Monday 16 March 2015

The Making of P0 Snake - Part 2: IT TALKS!


One of the features that stood out about P0 Snake, according to the people who played it, is its digitized speech.

Now, mind you, talking games have been around for quite some time, and they were not exactly uncommon in the 80s either. The problem with speech though, is that it takes “a lot” of memory, Therefore, while modern productions can get very chatty (even too much),  usually old games could only afford few seconds of speech, before they filled up the host machine’s memory. But it's probably this scarcity that contributed to making the few words that came out of the talking video games so memorable. Elvin Atombender, the mad scientist in Impossible Mission, would welcome you to his lab with his signature “Another visitor… stay a while… stay forever!” that would send shivers down your spine. Also, with speech being such an "expensive" commodity, programmers had to resort to it only where it really added something to the gameplay.
Mega Apocalypse, another all-time favourite of mine, was so hectic that sometimes players didn't know what they were doing, such was the focus on trying to stay alive. So the game just told them what was going on. “Extra life!”, and the player knew he had hit something good, not a cursed asteroid.
With P0 Snake things were a bit tricky though. The RGCD 16Kb development competition, as the name suggests, only allows 16Kb games. So, how much speech can we fit into 16Kb? Or, even better, how much speech can we squeeze into 16Kb and still have enough space left to code a game? Let’s find out!

Enter “The Dictionary”

“When words are scarce they are seldom spent in vain.”
[William Shakespeare]

P0 Snake's dictionary is made up of  7 sentences:

“Welcome to P0 Snake”
“Get Ready”
“One Up!”
“Oh no!”
“Well Done!”
‘Game Over”.

In total, they account for roughly 9 seconds of speech.
It doesn't sound like a big deal, but, trust me, it is. Let’s see why.

A sound is basically a waveform, or, for our purposes, the digital representation of its analog form.  This process of analog-to-digital conversion has two steps: sampling and quantization. A signal is sampled by measuring its amplitude at a particular time; the sampling rate is the number of samples taken per second. In order to accurately measure a wave, it is necessary to have at least two samples in each cycle: one measuring the positive part of the wave and one measuring the negative part.
Of course having more than 2 samples per cycle will increase the accuracy and the quality of the waveform representation, but what’s really important is that you don’t have LESS than 2 samples per cycle, as this would cause the frequency of the wave to be completely missed. If you want to know more about this stuff, just look up the Nyquist-Shannon Sampling Theory and start from there, I’m trying to keep this simple. What all of this is telling us anyway, is that in order to digitize sound properly, we must keep a sample rate that is at least twice the frequency of the signal we are sampling. The 44.1 Khz frequency at which your CD track or your mp3s are sampled can render a signal of up to 20 Khz, which is enough to capture all the frequencies that the human ear can hear. If mp3 was made for dogs, it would need to use a sample rate higher than 120Khz, as dogs can hear frequencies of up to 60Khz. If it was made for cats, it would need to be more than 160Khz! It’s good that we evolved from apes and not from cats, otherwise our ipod could only contain a fourth of the music it contains now!
And there are more good news: human voice only spans a narrower set of frequencies which goes up to approximately 3500Hz. It means that if you want to sample speech you just need a sample rate of at least 7000hz to keep it intelligible. Indeed, most speech encoding systems, including GSM whose acronym probably rings a bell (ahem…) for most of you, only use 8000Hz. That’s the kind of frequency we are really dealing with, and this closes the math for the sampling, the first part of the analog to digital conversion. What about quantization? In general, the more the better, and your typical CD track, to go back to the original example, uses 16 bit, 2 bytes, which means that each sample is a number in the range (-32768, 32767). Let’s assume we are using the same resolution. We have all the numbers now, so let’s see how much space those 9 seconds of speech will eat up.

9 seconds X 8000 samples per second X 2 bytes per sample = 144000 bytes = 140 Kb

Given that we have to fit both speech AND a game into 16 Kb that’s a bit too much. We should target 4 or 5  Kb at most. It’s a long way uphill from here...

The first little help comes, again, from a limitation of the machine. 16 bit samples simply can’t be played on the Commodore 64. In fact, The SID, the Commodore 64 sound chip, was not designed to play sampled sound at all. The way programmers did it was by exploiting a bug in the sound chip. I won’t go into details as to how it works (try this if you dare), but, simply said, although a better quantization can be achieved with weird tricks, the most common way of playing sampled sound on a Commodore 64 only allows for 4 bit resolution. This means that the range we can address is not (-32768,32767) but (-8,7). Since 4 bits are one fourth of 16 bit, our original space consumption goes down to 35Kb from 140kb. That’s a lot better, but still more than twice the space we have for the entire game.

Before we move on, let’s take a look at what these waveforms we have been talking about look like. For example the piece of speech “Oh No!”.

From top to bottom, the original recording, the same one sampled at 8000Hz with 16bit quantization, and finally the same one at 8000Hz with 4 bit quantization.

As you can see, the fact that only 16 different amplitude values are possible with 4 bit quantization shows in the form of the waveform being a bit "blocky" at a glance. This really represents the BEST waveform we can think of playing on the Commodore 64, and, despite the appearance, it doesn’t sound that bad. This is what we really have to start from anyway.

Before we even start to think about compression, let’s see if we can bring memory occupation down from the current 35Kb

Let’s zoom in a bit, and let’s explore “the “oh” part of “oh no!”, going back to the original 16 bit quantization and 44.1Khz sampling,that is CD-quality. This is what 20 milliseconds of “oh” look like

We immediately notice something: the waveform looks very “regular”, as if it was made of the same piece repeated many times, with just minimal differences. We'll take advantage of this aspect when we deal with the compression, but for now another interesting piece of evidence is that there aren't really many “parts” in this waveform, that is it doesn't change very rapidly. We mentioned that the frequency of human voice tops at 3500hz, and that we need double that frequency, that is 7000Hz which we had rounded to 8000Hz, to sample it. But I bet you we need much less than that for this specific segment.
Let’s see what it looks like at 8000hz:

It looks very similar, but we knew this already, as we know that 8Khz will suffice for human voice in general

let’s see what happens at 5000hz

It’s starting to deteriorate a bit, still, all the transitions are there and this means that the “oh”, although “noisy”, will still sound like an “oh”. Let’s try 1000Hz

Bummer! Most of the transitions have disappeared. This waveform doesn't sound like an “oh” anymore. But what we take from this exercise is that we don't always need to use 8000Hz. Some pieces of our speech are happy with a lower sample rate.

Speech fragments can be divided into two categories: Voiced and Unvoiced. Voiced sounds are generated by the vocal cords’ vibration. Among their characteristics is the fact that they are periodic, and their period is called pitch. It’s the case for vowels and specifically the fragment that we have analyzed so far. Unvoiced sounds, on the other hand, are not generated by the vibration of the vocal cords and they don’t have a specific pitch. Furthermore, and most importantly, they come with a higher frequency. It’s the case for fricative sounds like “F” or “S”.

Now this is interesting, and you can see already where this is going: let’s use a higher sample rate for unvoiced segments and a lower sample rate for voiced segments.
We can validate this theory looking at another segment of P0 Snake’s speech. The “ZE” part in “welcome to P ZEro Snake”.

You can clearly see where the “Z” ends and where the “E” starts already. Now let’s sample this segment at 5000Hz, which worked quite well in the previous example

Look what just happened: the “Z” has almost completely gone but the “E” is still there.

In conclusion, we really need all of the 8000Hz to sample fricative sounds and preserve their understandability.

Putting everything together

So far we have learned the following facts:

  1. 4 bit amplitude is all we can afford
  2. 8000Hz is a sampling rate that allows any type of speech to be coded.
  3. Fricative sounds require this entire bandwidth.
  4. Voiced sounds will be happy with much less. How much less? It depends on the sound, the voice of the speaker and other factors.

So, in order to minimize the memory occupation of the speech in P0 Snake, we just need to use the minimum sampling frequency for each speech segment that retains the understandability of the segment. This frequency will be higher for unvoiced segments and lower for voiced segments.
Although we could choose ANY frequency for each of the segments we really want to limit this search to a small set of possible frequencies for various reasons, the most important of which will be clear in the next part, but we can say already that we also want to be able to encode this frequency somehow in the bitstream. P0 Snake uses only 4 frequencies, so it requires a 2 bit overhead to indicate the frequency for each speech segment:
S = {3500Hz, 3942Hz, 5256Hz, 7884Hz}

And the algorithm to preprocess the data is now quite easy:

S = [3500Hz, 3942Hz, 5256Hz, 7884Hz]
segments = split(source) //splits the source in segments
    //of equal duration

foreach (segment in segments)
f = Frequency_Analysis(segment)
i = 0
while (i < 3 && S[i] < f)
i = i + 1
sampled_segment = Sample(segment,S[i])
yield return( (i,sampled_segment) )

The result of this preprocessor was only good as a starting point for a (painful) manual refinement of each segment: although the frequency analysis lib I used worked very well for me, I found that for some segments I could still move to the lower frequency, keeping a decent sound quality. This pickiness might sound like overkill, but don't forget that we are fighting with bytes here, so every little helps!

In the end, all the speech in P0 Snake averages at roughly 4900hz. Let’s see where this takes us up to:

35Kb * 4900/8000 = ~21Kb

21 Kb. Now... we are talking! It’s still 4 times larger than what we are targeting, and (as we'll see in the next post), we can't use any known audio codec to bring this number down, because the Commodore 64 simply doesn't have enough horsepower to play a sampled sound AS it decompresses it AND to also runs a game. Therefore, we must play uncompressed data. But this number already looks very interesting for one simple reason: The rules of the competition are that the size of the game must not exceed 16 Kb, but of course we can use the entire memory space of the Commodore 64 at runtime, that is 64Kb. If we could come up with a way to compress our speech in 4Kb in a way that can be decompressed (in a reasonable time) to 21 Kb BEFORE the game runs, then that would be it.
We'll see how in the next post, if you wish.

1 comment: