Introduction
Compression of digital sound has been a subject of study for some 50 years now, during which researchers have produced a wide number of audio codecs, that is systems to compress (or encode) sound and to decompress (or decode) it to something that can be played. We mentioned mp3 already, which is based on psychoacoustic modelling and was designed for music, and GSM, which is a very simple algorithm to encode speech, but there are many more. What pretty much all of these systems have in common is that the decoder does not return the original waveform, but an approximation of it. They are therefore always referred to as lossy audio codecs. We have seen already that even the introduction of heavy modifications in the original digitized waveform, like changing the frequency to 8000hz or moving from 16 to 4 bit quantization, does indeed affect the quality of the sound, but, especially for speech, it retains its "understandability". There are algorithms like CELP that can encode speech to as low as 800 bytes per second. Unfortunately, we can only dream of using this kind of technology in this context, for two simple reasons: first, any audio decoder would need space just for the code. What good would it be to compress our 9 seconds of audio to 4KB if we need, say, a 20KB decoder to play it? Second, and most importantly, audio decoders are too heavy on the CPU. In fact, for a computer like the Commodore 64, whose CPU only runs at 1mhz and is only capable of 8 bit integer math, even just playing the uncompressed samples is a demanding task!
How demanding? Let’s find out. (Warning: what follows might be a bit boring and very C64-centric, so you might want to take my word for it that playing digitized audio is very demanding and just skip to the next section.)
The problem is that whatever the CPU was doing, it was doing so by using its registers, and our sound play routine will use the same registers to do its thing. This means that, upon returning control, our routine must make sure that the registers contain exactly the same values that they contained when the CPU was interrupted. This is achieved by saving all the registers somewhere as the first thing in the interrupt routine and then restoring all of them as the last thing before returning control. This somewhere is usually the system stack. Unfortunately, only one of the registers, the Accumulator, A, can write to the stack. The other registers, X and Y can’t. But they can write to the Accumulator!
We now have the default skeleton code for a timer driven interrupt service routine, which will look like this:
pha //push the Accumulator onto system stack. [3]
txa //transfer X to Accumulator [2]
pha //push accumulator again, thus saving X value. [3]
tya //transfer Y to Accumulator [2]
pha //push the Accumulator once again, thus saving Y value [3]
//our sample playing routine here,
//which can now use the 3 registers
[...]
pla //pops Accumulator from the stack [4]
tay //transfer Accumulator to Y [2]
pla //pops Accumulator [4]
tax //transfer Accumulator to X [2]
pla //pops the Accumulator [4]
rti // return to whatever CPU was doing,
//All the registers correctly restored [6]
The numbers in the square brackets are the number of cycles that the instruction takes. Remember that the Commodore 64 is clocked at 1Mhz, or 1000000 cycles per second, so let’s see what the overhead of all this saving and restoring registers takes us to:
3+2+3+2+3+4+2+4+2+4+6 = 35
So for each call to the interrupt routine, 35 cycles go to waste without doing anything. Zero, zip, nada! We said that the highest sampling frequency at which we end up calling this routine is 7884 times per second, which means every second 7884 * 35 cycles are wasted. That’s 275940 cycles per second, which is roughly 28% of the CPU horsepower.
To cut a long story short, 28% of the CPU goes to waste just because of the overhead of this saving and restoring the registers, and we haven’t really played anything yet! Therefore, we can't really think of adding anything very complex on top of that if we want to be able to ALSO run the game at the same time.
Although there are tricks involving self modifying code (yuk!) that will allow you to take that number down, and P0 Snake uses all of them, this number still gives us an idea of how demanding this task is for the CPU, therefore on the fly audio decompression is out of the question here.
Compression
Let’s summarize our findings so far:
- We squeezed our original source audio into 21KB.
- We want to land somewhere in the neighborhood of 4KB.
- We accept to have a short wait before the game runs to have the 4KB of compressed audio decompressed to the 21KB that we are going to use from that moment on.
- The decoder should be a fairly short program, whose memory footprint should be negligible compared to the 4KB of the compressed audio.
- It should be really fast! No fancy floating point, no weird math, nothing like that. Possibly LUTs or Dictionaries but little more than that.
Having ruled out lossy audio compression schemas, classic, lossless compression approaches spring to mind. LZ77 is at the base of most of these approaches, such as RAR, ZIP, 7Z, you name them. It’s based on the idea that multiple occurrences of the same subsequence in a sequence can be encoded by using a single copy of such subsequence and references to the same data for the other copies of the subsequence.
A string like
ABCDABCDABCDE
is encoded to something like
ABCD(-4,4)(-8,4)E
The pairs (i,j) between parentheses mean “go back i symbols and copy j symbols from there”.
While finding such repeated copies is a computationally intensive task, such a task is up to the encoder. The decoder, which is what we are most interested in, has a fairly straightforward job to do. Straightforward enough to be implemented on the C64 without too much pain. We are almost there!
The only problem left is that LZ77, being a lossless compression approach, is not expected to compress much. We are lucky if we’ll see a 30-40% reduction (you can do this experiment yourself: try to record a wav file and zip it). That would take our memory footprint down to 12-15 KB. Much better, but not quite there yet.
But what if we manage to make LZ77 encoder’s life easier? What if we could somehow maximize the number and the size of identical subsequences of samples?
Let’s bring up once again that piece of the “Oh No” waveform we have mentioned in part 2.
We said that it almost looks like it was made up of the same segment repeated over and over. Let’s bring up these segments.
That’s very close to what LZ77 needs to do its dirty job at its best, but not quite that, yet. There are, in fact, minimal differences among the blocks. If this waveform was really made exactly of the same segment repeated over and over, LZ77 would only need one of those segments plus a few indices to encode the entire waveform in the picture.
Which brings us to the following intuition: how bad would it be (sound) to just use one of these blocks and repeat it over and over to form the actual waveform? In this case, the differences are so minimal that we probably wouldn't be able to appreciate them, and LZ77 would encode this piece to roughly one fifth of its original size! So, all we need to do is try to find all the similar blocks, make them identical, and run a normal LZ77 compression pass to have something that will unpack easily and will sound reasonably similar to the unaltered waveform. We don't really need fancy approximate pattern matching algorithms here, given how small the amount of data we are talking about is, and we could just go the brute force route. The algorithm would be something like this:
encode(waveform, threshold)
{
i=0;
while (i < length(waveform))
{
for (blocksize = 64; blocksize > 4; blocksize--)
{
j = find_most_similar_block in range (0,i)
{
i=0;
while (i < length(waveform))
{
for (blocksize = 64; blocksize > 4; blocksize--)
{
j = find_most_similar_block in range (0,i)
of size blocksize;
if (similarity((j,j+blocksize),
if (similarity((j,j+blocksize),
(i,i+blocksize)) < threshold(blocksize))
{
{
// replace the current block with a copy
// of a similar one
copy ((j,j+blocksize), (i,i+blocksize))
copy ((j,j+blocksize), (i,i+blocksize))
i+=blocksize;
break;
}
}
else //no similar block of any size was found
{
i++; // keep the current sample value and go ahead.
}
}
return LZ77encode(waveform);
}
break;
}
}
else //no similar block of any size was found
{
i++; // keep the current sample value and go ahead.
}
}
}
Dealing with blocks smaller than 4 samples doesn’t really make much sense, because the overhead for the indexes that LZ77 needs to replicate the blocks would outsize the length of the block itself.
The most important part of this algorithm is the similarity function, which needs to be able to compare two blocks of a given size and return a distance between them. The temptation of going straight with the Euclidean distance is strong. And in fact we’ll give in to that, but there’s one last thing to speech encoding that must considered. The dynamic range of speech is not linear (which is a fact that some basic audio compression schemas, like a-law, leverage). Basically, for our concerns this means that an alteration to samples that are “close” to zero will introduce a heavier distortion than the same absolute alteration to samples that are at the boundary of the sample range. To account for this, assuming such a range to be between [-1,1], rather than using the absolute sample value, we’ll go with the logarithm of the amplitude.
Ultimately, given two vectors of samples P and Q, whose values range in [-1,1], the distance between the two vectors is
It’s important to note that this logarithmic decay is taken into account only when computing the distance between two blocks, but the waveforms we end up storing are the original, linear sampled versions.
Final tricks and conclusion
Although there are many more tricks/hand tuning that I had to resort to in order to improve compression and/or reduce the compression artifacts, the encoding algorithm we discussed so far does the largest part of the job, and puts us in a position in which the commodore 64 only has to LZ77-decode 4KB of data to 21KB before the game starts. Now the only thing we need to feed the encoder with is the right threshold, which is dynamically adjusted by the encoder according to the size of the vectors it compares (this part would probably deserve a post on its own).
We can basically try different thresholds until we are satisfied with the total size of the encoded waveform, which is 4KB in our case, and go for it. Of course a higher threshold will cause more blocks to be marked as similar and lead to a better compression, at the cost of a decrease of sound quality.
In fact, I used different thresholds for different sentences in the game, which leads me to the very last trick that I want to tell you about. Listen again to the speech:
You might have noticed that the worst sounding sentence of the bunch is the first one, “Welcome to P0 Snake”. It will surprise you to know that this part is actually the one that has been compressed the least and takes up half of the total space of 4KB alone. It should therefore sound better than the rest. Believe it or not, it actually does. I can further explain with an example: in the silence of the night you can probably hear both your watch ticking and your neighbour talking in another flat. Now, suppose you take your neighbour out for dinner in a crowded restaurant. You can still hear what she is saying, but even if you focus on your watch, you won't be able to hear it ticking.
This phenomenon is called Auditory Masking, and it’s just one of the many concepts you'll get familiar with, should you decide to dive into the fascinating world of Psychoacoustics. Glitches and artifacts introduced by compression are like the ticking of your watch as your neighbour talks during the night. You can hear them in a quiet environment, but if you bring in more sound to “confuse” the ear, they'll become unnoticeable. That’s what I did with the remaining set of sentences in the game: whenever the game is talking, there’s always music playing along with the speech. It’s a bit like hiding the dust under the carpet, but, well… it works!
This concludes the third and last part about speech encoding. Next time we'll probably be looking at something a bit more geeky, like self modifying code or graphics compression.
Stay tuned!