Monday, 16 March 2015

The Making of P0 Snake - Part 2: IT TALKS!

Introduction



One of the features that stood out about P0 Snake, according to the people who played it, is its digitized speech.






Now, mind you, talking games have been around for quite some time, and they were not exactly uncommon in the 80s either. The problem with speech though, is that it takes “a lot” of memory, Therefore, while modern productions can get very chatty (even too much),  usually old games could only afford few seconds of speech, before they filled up the host machine’s memory. But it's probably this scarcity that contributed to making the few words that came out of the talking video games so memorable. Elvin Atombender, the mad scientist in Impossible Mission, would welcome you to his lab with his signature “Another visitor… stay a while… stay forever!” that would send shivers down your spine. Also, with speech being such an "expensive" commodity, programmers had to resort to it only where it really added something to the gameplay.
Mega Apocalypse, another all-time favourite of mine, was so hectic that sometimes players didn't know what they were doing, such was the focus on trying to stay alive. So the game just told them what was going on. “Extra life!”, and the player knew he had hit something good, not a cursed asteroid.
With P0 Snake things were a bit tricky though. The RGCD 16Kb development competition, as the name suggests, only allows 16Kb games. So, how much speech can we fit into 16Kb? Or, even better, how much speech can we squeeze into 16Kb and still have enough space left to code a game? Let’s find out!


Enter “The Dictionary”



“When words are scarce they are seldom spent in vain.”
[William Shakespeare]


P0 Snake's dictionary is made up of  7 sentences:


“Welcome to P0 Snake”
“Get Ready”
“One Up!”
“Teleport!”
“Oh no!”
“Well Done!”
‘Game Over”.


In total, they account for roughly 9 seconds of speech.
It doesn't sound like a big deal, but, trust me, it is. Let’s see why.


A sound is basically a waveform, or, for our purposes, the digital representation of its analog form.  This process of analog-to-digital conversion has two steps: sampling and quantization. A signal is sampled by measuring its amplitude at a particular time; the sampling rate is the number of samples taken per second. In order to accurately measure a wave, it is necessary to have at least two samples in each cycle: one measuring the positive part of the wave and one measuring the negative part.
Of course having more than 2 samples per cycle will increase the accuracy and the quality of the waveform representation, but what’s really important is that you don’t have LESS than 2 samples per cycle, as this would cause the frequency of the wave to be completely missed. If you want to know more about this stuff, just look up the Nyquist-Shannon Sampling Theory and start from there, I’m trying to keep this simple. What all of this is telling us anyway, is that in order to digitize sound properly, we must keep a sample rate that is at least twice the frequency of the signal we are sampling. The 44.1 Khz frequency at which your CD track or your mp3s are sampled can render a signal of up to 20 Khz, which is enough to capture all the frequencies that the human ear can hear. If mp3 was made for dogs, it would need to use a sample rate higher than 120Khz, as dogs can hear frequencies of up to 60Khz. If it was made for cats, it would need to be more than 160Khz! It’s good that we evolved from apes and not from cats, otherwise our ipod could only contain a fourth of the music it contains now!
And there are more good news: human voice only spans a narrower set of frequencies which goes up to approximately 3500Hz. It means that if you want to sample speech you just need a sample rate of at least 7000hz to keep it intelligible. Indeed, most speech encoding systems, including GSM whose acronym probably rings a bell (ahem…) for most of you, only use 8000Hz. That’s the kind of frequency we are really dealing with, and this closes the math for the sampling, the first part of the analog to digital conversion. What about quantization? In general, the more the better, and your typical CD track, to go back to the original example, uses 16 bit, 2 bytes, which means that each sample is a number in the range (-32768, 32767). Let’s assume we are using the same resolution. We have all the numbers now, so let’s see how much space those 9 seconds of speech will eat up.


9 seconds X 8000 samples per second X 2 bytes per sample = 144000 bytes = 140 Kb


Given that we have to fit both speech AND a game into 16 Kb that’s a bit too much. We should target 4 or 5  Kb at most. It’s a long way uphill from here...


The first little help comes, again, from a limitation of the machine. 16 bit samples simply can’t be played on the Commodore 64. In fact, The SID, the Commodore 64 sound chip, was not designed to play sampled sound at all. The way programmers did it was by exploiting a bug in the sound chip. I won’t go into details as to how it works (try this if you dare), but, simply said, although a better quantization can be achieved with weird tricks, the most common way of playing sampled sound on a Commodore 64 only allows for 4 bit resolution. This means that the range we can address is not (-32768,32767) but (-8,7). Since 4 bits are one fourth of 16 bit, our original space consumption goes down to 35Kb from 140kb. That’s a lot better, but still more than twice the space we have for the entire game.


Before we move on, let’s take a look at what these waveforms we have been talking about look like. For example the piece of speech “Oh No!”.





From top to bottom, the original recording, the same one sampled at 8000Hz with 16bit quantization, and finally the same one at 8000Hz with 4 bit quantization.


As you can see, the fact that only 16 different amplitude values are possible with 4 bit quantization shows in the form of the waveform being a bit "blocky" at a glance. This really represents the BEST waveform we can think of playing on the Commodore 64, and, despite the appearance, it doesn’t sound that bad. This is what we really have to start from anyway.


Before we even start to think about compression, let’s see if we can bring memory occupation down from the current 35Kb


Let’s zoom in a bit, and let’s explore “the “oh” part of “oh no!”, going back to the original 16 bit quantization and 44.1Khz sampling,that is CD-quality. This is what 20 milliseconds of “oh” look like











We immediately notice something: the waveform looks very “regular”, as if it was made of the same piece repeated many times, with just minimal differences. We'll take advantage of this aspect when we deal with the compression, but for now another interesting piece of evidence is that there aren't really many “parts” in this waveform, that is it doesn't change very rapidly. We mentioned that the frequency of human voice tops at 3500hz, and that we need double that frequency, that is 7000Hz which we had rounded to 8000Hz, to sample it. But I bet you we need much less than that for this specific segment.
Let’s see what it looks like at 8000hz:











It looks very similar, but we knew this already, as we know that 8Khz will suffice for human voice in general


let’s see what happens at 5000hz











It’s starting to deteriorate a bit, still, all the transitions are there and this means that the “oh”, although “noisy”, will still sound like an “oh”. Let’s try 1000Hz





Bummer! Most of the transitions have disappeared. This waveform doesn't sound like an “oh” anymore. But what we take from this exercise is that we don't always need to use 8000Hz. Some pieces of our speech are happy with a lower sample rate.


Speech fragments can be divided into two categories: Voiced and Unvoiced. Voiced sounds are generated by the vocal cords’ vibration. Among their characteristics is the fact that they are periodic, and their period is called pitch. It’s the case for vowels and specifically the fragment that we have analyzed so far. Unvoiced sounds, on the other hand, are not generated by the vibration of the vocal cords and they don’t have a specific pitch. Furthermore, and most importantly, they come with a higher frequency. It’s the case for fricative sounds like “F” or “S”.




Now this is interesting, and you can see already where this is going: let’s use a higher sample rate for unvoiced segments and a lower sample rate for voiced segments.
We can validate this theory looking at another segment of P0 Snake’s speech. The “ZE” part in “welcome to P ZEro Snake”.

















You can clearly see where the “Z” ends and where the “E” starts already. Now let’s sample this segment at 5000Hz, which worked quite well in the previous example



















Look what just happened: the “Z” has almost completely gone but the “E” is still there.


In conclusion, we really need all of the 8000Hz to sample fricative sounds and preserve their understandability.

Putting everything together


So far we have learned the following facts:


  1. 4 bit amplitude is all we can afford
  2. 8000Hz is a sampling rate that allows any type of speech to be coded.
  3. Fricative sounds require this entire bandwidth.
  4. Voiced sounds will be happy with much less. How much less? It depends on the sound, the voice of the speaker and other factors.


So, in order to minimize the memory occupation of the speech in P0 Snake, we just need to use the minimum sampling frequency for each speech segment that retains the understandability of the segment. This frequency will be higher for unvoiced segments and lower for voiced segments.
Although we could choose ANY frequency for each of the segments we really want to limit this search to a small set of possible frequencies for various reasons, the most important of which will be clear in the next part, but we can say already that we also want to be able to encode this frequency somehow in the bitstream. P0 Snake uses only 4 frequencies, so it requires a 2 bit overhead to indicate the frequency for each speech segment:
S = {3500Hz, 3942Hz, 5256Hz, 7884Hz}


And the algorithm to preprocess the data is now quite easy:


S = [3500Hz, 3942Hz, 5256Hz, 7884Hz]
segments = split(source) //splits the source in segments
    //of equal duration


foreach (segment in segments)
{
f = Frequency_Analysis(segment)
i = 0
while (i < 3 && S[i] < f)
{
i = i + 1
}
sampled_segment = Sample(segment,S[i])
yield return( (i,sampled_segment) )
}


The result of this preprocessor was only good as a starting point for a (painful) manual refinement of each segment: although the frequency analysis lib I used worked very well for me, I found that for some segments I could still move to the lower frequency, keeping a decent sound quality. This pickiness might sound like overkill, but don't forget that we are fighting with bytes here, so every little helps!


In the end, all the speech in P0 Snake averages at roughly 4900hz. Let’s see where this takes us up to:


35Kb * 4900/8000 = ~21Kb


21 Kb. Now... we are talking! It’s still 4 times larger than what we are targeting, and (as we'll see in the next post), we can't use any known audio codec to bring this number down, because the Commodore 64 simply doesn't have enough horsepower to play a sampled sound AS it decompresses it AND to also runs a game. Therefore, we must play uncompressed data. But this number already looks very interesting for one simple reason: The rules of the competition are that the size of the game must not exceed 16 Kb, but of course we can use the entire memory space of the Commodore 64 at runtime, that is 64Kb. If we could come up with a way to compress our speech in 4Kb in a way that can be decompressed (in a reasonable time) to 21 Kb BEFORE the game runs, then that would be it.
We'll see how in the next post, if you wish.

Wednesday, 11 March 2015

The Making of P0 Snake - Part 1

Preface


More than 20 years ago my teenage self had a dream. To be more exact, my teenage self shared the same boring dream with most of my friends. That dream was to create a video game for the Commodore 64. It sounds silly today, but that’s what every teenager wanted to do back in the 80s. You wanted to be an astronaut, a football player or a game developer. Not surprisingly, I did not make a career in space travel or any kind of professional sports and, although I went on to publish several games for modern platforms many years later, I didn’t make it my main occupation. Which is not something I regret now. What I really regretted was failing to write my own Commodore 64 game. I’ll spare you the details of what made me try to rectify that situation, although I guess it could simply be summarised by a friend of mine telling me that it is never too late (yes, it’s that simple), but anyway, fast forward a couple of decades and P0 Snake, my entry in the RGCD 16Kb game development competition, came first and will be published later this year. Better late than never, you silly daydreaming teenager!




In this and the next series of posts I’ll try to cover a few aspects of the development process that I think are worth sharing, especially those that may be of general programming interest. In fact, I’ll try not to make these notes inaccessible to someone who has never seen a Commodore 64 in his life, although I might be writing a couple of lines of assembler here and there (which I’ll try to explain in detail). Hopefully I’ll give some answers to those of you who are wondering how someone can even think of implementing something for a 33-year-old machine. And who knows? Maybe I’ll drag some of you into this madness. That would be great!

An appreciation of the limitations of the target platform

It’s no mystery that the Commodore 64 comes with 64KB of RAM. This might sound like a silly amount of silicon to achieve anything sensible nowadays, but it was not back in the 80s. In 1982, when the machine came out, 64KB of RAM was “A LOT” of memory. In fact, it was the best selling point of the machine. This is a very important fact to consider, because we are targeting this specific machine, not a modern PC. An assembler instruction on the Commodore 64 takes 1 to 3 bytes. 64KB could host a good 30,000 “lines” of code if we were to use all of that space for this purpose. Furthermore, old machines in general, and the Commodore 64 in particular, were designed to make video game development (well... certain aspects of it) “easy”… Programming a ball that bounces around the screen, for example, is something that can be achieved very easily, and it really takes few bytes to be implemented. If you want to play a sound effect that resembles an explosion, that’s easy too. This is because the Commodore 64’s strongest asset were its custom chips that allowed programmers to achieve certain effects without much effort. The most famous and one of the most useful aces up the C64’s sleeve are its mighty hardware sprites. Let’s go back to the bouncing ball example. This is what we want to achieve:


Now let’s write pseudocode to achieve this effect on a machine of that era with no Hardware sprites:


x = 0
y = 0
xdir = 1 //going right
ydir = 1 //going down


ball = {(x0,y0), (x1,y1), .. (xn,yn)} //a sequence of pixels making up the shape of the ball.


while(true)
{
foreach (pixel in ball)
{
screen[x + pixel.x, y + pixel.y] =
backbuffer[x + pixel.x, y + pixel.y]
}
x = x + xdir
y = y + ydir
if (x == 0 or x == rightborder)
{
xdir = -xdir //bounce horizontally
}


if (y == 0 or y == bottomborder)
{
ydir = -ydir //bounce vertically
}
foreach (pixel in ball)
{
screen[x+pixel.x, y+pixel.y] = ballcolour
}
somedelay()...
}



The first few lines set the ball position and direction. It will be initially positioned on the top-left corner of the screen. Each iteration of the loop will update the x and y position according to the direction. The problem here is what “update” means:
There are three main steps here:

  1. erase the pixels at position (x,y) + offset for each pixel offset in ball
  2. update the ball position (x,y)
  3. set all the pixels at position (x,y) + offset for each pixel offset in ball


Surprisingly, this is not very different from what a modern PC does. The fact that machines and graphic cards are so fast now, makes this process totally invisible to the user, but those of you who witnessed the birth of the first GUIs, will definitely remember that moving windows around the screen triggered that weird erase/redraw effect. That’s exactly what went on in PCs in the early nineties: the user moving a window would result in all the pixels making up the window at the old position being erased (background is drawn) and then all the pixels being redrawn in the new position. It’s blazing fast nowadays, not so much 20 years ago. And indeed, things would get much more complicated when you had multiple objects that moved on the screen, which is always the case in games, unless you are dealing with something as simple as a bouncing ball.


How would you achieve the same effect on the Commodore 64?
The commodore 64 comes with 8 Hardware sprites. A sprite is a shape that can be defined by the user and positioned anywhere on the screen. What’s great about sprites is that they exist on a different layer than the background. This means that upon moving them, you don’t have to worry about the background. Let’s see what the bouncing ball pseudocode looks like on the C64.


xdir = 1 //going right
ydir = 1 //going down


ball = {(x0,y0), (x1,y1), .. (xn,yn)} //a sequence of pixels making up the shape of the ball.


sprite0.pointer = ball
sprite0.x = 0
sprite0.y = 0
sprite0.visible = true


while(true)
{
sprite0.x = sprite0.x + xdir
sprite0.y = sprite0.y + ydir
if (sprite0.x == 0 or sprite0.x == rightborder)
{
xdir = -xdir //bounce horizontally
}
if (sprite0.y == 0 or sprite0.y == bottomborder)
{
ydir = -ydir //bounce vertically
}
somedelay()
}


And that’s it. We basically set up the sprite number 0 to represent the ball, then in the main loop we just update the position of sprite0 and take care of the bouncing, and good old commodore 64 takes care of everything else.
We spoke about efficiency, which is clearly the most obvious advantage of this approach, but there’s so much more going on under the hood. The erase-redraw approach that other computers are forced to resort to implies that you have to store the background in memory twice. Once for the clean, unaltered background that you need to copy the pixels FROM when you are erasing, and once for the actual displayed version of it, with the ball and all the other objects overlayed. The concept of double buffering is an integral part of game development today (and even the Commodore 64 provides an elegant way to implement it), but you really don’t want to resort to it, unless you really need it, to limit the footprint of your graphics into memory.
In our specific example, besides the memory that goes to waste for something that simple, a lot of cpu power is wasted in erasing and drawing back the same thing over and over. There are a lot of tricks that programmers of other platforms came up with in the 80s to make this process as quick as possible, but the basic approach stays the same: you have to erase and redraw your objects.
This video shows the bouncing ball implemented on a C64. As you can see the background is completely unaffected by the ball movement.




This long preamble was needed to put things into perspective: The 16Kb limitation of the competition is indeed a tight constraint, but things are not so bad after all. There is a good hardware support for some common tasks that allows you to achieve interesting results with limited memory. The pure horsepower of the machine is ridiculous compared to your mobile (let alone your computer) but the sheer amount of “things” that go on on the display is very limited, and even when things get very busy, the number of pixels that the machine needs to shift at a certain moment in time is one order of magnitude smaller than those that your mobile has to move around, say, when you unlock the screen. Yes, you can’t play the latest tv show on your stock Commodore 64 (well, some people will disagree with this, but let’s not make these posts too hardcore, shall we?), but you can move objects on the display and play some sound without having to implement the basics of that.
That’s why I’m not going to cover the aspects of how to squeeze a game in 16 kb on the Commodore 64, because in general that’s not really something to tell home about: there are literally thousands of videogames developed by pioneer coders in the 80s that are smaller than 16kb. Those unsung heroes deserve your respect because THEIR 16Kb creatures laid the foundation of the industry that brought you Halo, Fifa 2015 or whatever you guys are playing today. Nor I’m giving you an introduction to C64 coding, because there are dozen of tutorials on game programming for this machine (you’d be surprised!). Finally, I won’t introduce you to Commodore 64 programming at all, because this video does it better than I could dream of and is well worth your “I’m not slacking, my code is compiling”-time today. Check it out!
What the next posts will cover are those aspects of the development of P0 Snake that challenged and intrigued me the most. The hurdles I had to come to term with and the makeshifts that allowed me to overcome them. It’ll be about how we can conjugate modern tools and knowledge with vintage technology to craft something beautifully old. It’s not going to be a tutorial (I would never be that arrogant), nor a lesson in data compression (again). It’s just the way I did it, which is not necessarily the best way of doing it. But, heck, it worked!
Stay tuned.