We are all surrounded by codes. Computers process huge amounts of information in a multitude of formats – written text, numbers, images and audio/video – and a variety of coding methods are used to make storing and transmitting information more efficient and more reliable.d
To be clear, this doesn’t mean coding as in writing software – I’m talking about encoding: converting data into a different format, using a set of predetermined mappings. A simple example would be converting the alphabet to numbers, using A=1, B=2 etc. Codes in general are simply a way to convert information into a different form, that helps us store and transmit it, and can range from very simple to quite complex.
For example, a code your computer is literally using right now is ASCII – the American Standard Code for Information Interchange. This was originally developed in America in the 1960s, as a way to standardise the process of converting letters (used to display text on the then-relatively-new computer screens) into numbers the computer could process and store.
You’d imagine there’d be endless possible ways to decide on the ordering – do we start with A=1, or A=0, and where do we assign punctuation marks like space, full stop and comma – and what about capital and lower-case letters? Since so many computer systems were being developed, it made sense to have a standard everyone could use.
The original standard involved a simple character set with a single upper-case alphabet, a small selection of punctuation marks and a set of control characters (things like ‘new line’, ‘tab’ and indicators for the start and end of headings and blocks of text, used for different types of printing machines and displays). Since then, it’s been gradually expanded to incorporate more symbols and character sets for different languages and has now become part of the larger Unicode system (which also includes pictorial symbols like emoji).
This larger system is still built on the same basic principle – each character you might type has a number allocated to it. There are also clever features built into the choice of numbering – for example, take a look at this section of a table of ASCII codes for just the visible, printable characters.
The ASCII for capital A corresponds to the number 65, and the ASCII for lower-case a is 97. This might seem like a fairly arbitrary choice until you think about the fact that these numbers will be stored and processed using binary. If you write the numbers in binary, you might see an interesting pattern:
Since binary is base 2, each column corresponds to a particular power of 2 – 1, 2, 4, 8, 16, 32, 64 and so on. The 64 column is the left-most column in these 7-digit binary strings, and since 65 is one more than 64, its binary representation is made up of just a 64 and a 1 (1000001).
As you look down the column, you’ll see the numbers all start the same way – the capital letters with 100, and the lower-case with 110 – and the remaining five binary digits just tell you how far through the alphabet you are, using the classic A=1, B=2 system. This means to write a letter in binary, I can simply convert it to a number, e.g. (M=12), write that in binary (12 -> 01100), then write 100 at the front for capital M, and 110 for lower-case.
When designing a code like this, choosing to include patterns like this don’t just make for satisfying realisations 50 years later – they also mean that the code is easier to use. There are plenty of other ways codes can be made more efficient and useful through clever design.
One simple example is how Morse code uses particular strings of dots and dashes to represent letters. Unlike the ASCII codes, Morse Code doesn’t use the same number of symbols in every codeword – some are shorter, like E and T, and others use longer strings.
The letters with one-symbol codewords, E and T, both share the property that they occur relatively frequently in written language – as do A, I, N and (to some extent) M, which use the two-symbol strings. Other commonly used letters, like S and O, use the more memorable three-symbol codewords, dot-dot-dot and dash-dash-dash.
When developing this code, the engineer Alfred Vail (who expanded it from Morse’s original numbers-only version to also transmit letters) could have just written the possible combinations of dots and dashes in any order and assigned them to the letters. Instead, choosing to assign shorter strings to common characters means that communicating in Morse code is more efficient – the average number of dots and dashes needed to transmit a message is lower. It’s also easier for a human operator to type, since the more commonly used letters are easier to remember.
Can we (pre)fix it?
Another clever design trick that can be used when your code has different length codewords is to make it a prefix code. A prefix code has the property that none of the code strings occur at the start of any of the others. For example, using only the codewords 11, 12, 211 and 212 would work as a prefix code, but using 12, 21, 221 and 122 wouldn’t – if you see a ‘12’ at the start of a codeword you don’t know which of ‘12’ or ‘122’ is being typed.
One main benefit of this is that you don’t need to specify where the breaks between codewords are – we say the code is uniquely decodable, since there’s only one way to interpret a given string of digits as a set of separate codewords. Try this example:
E = 111
T = 010
A = 000
R = 1101
H = 1010
|Decode the message:
This means we can get the benefit of variable-length strings, and use shorter ones for more commonly communicated symbols, without needing to use any extra information to indicate how to parse the message. These kinds of tricks mean that we can both communicate messages more efficiently, and store data in an encoded form using less space.
Data compression – a hugely important feature of computer system design – uses tricks like this all the time to store data using fewer bytes of space on your hard drive or memory. The ideal is lossless data compression – storing something more compactly, but without losing any of the information, so that after decompressing it again you get back all the information that was there before. Techniques for lossy data compression are easier to design – storing photos with lower resolution, or audio files that lose sound quality – but there are ways to compress data that don’t lose a single bit.
Huffman coding uses a clever system – it analyses the data being stored prior to encoding, and determines the frequency of each symbol in the data. Since the information stored isn’t necessarily going to be written language, and might be computer code, image or audio information, this might not match the frequencies of letters in language we saw earlier – but for a given file, the frequencies can be tallied.
This is then used to design a prefix code, using shorter strings for more common symbols, and inventor David Huffman came up with an efficient algorithm to sort and assign the codes to pack data into a smaller space. While there is then also a requirement to communicate the particular coding used, this will only take up some of the space saved by reducing the length of the message, and it’s still more efficient than a fixed-length code like ASCII.
How low can you go?
How much more efficient such codes can get depends entirely on the contents of the data. For example, if every symbol occurs equally frequently in the message, it’s not possible to save as much space as if you have a more varied distribution. If you have one symbol that dominates, and occurs more than half the time, Huffman coding is inefficient. Prefix codes generally are also inefficient on small alphabets.
These issues are related to ideas that mathematician Claude Shannon studied as part of his groundbreaking Mathematical Theory of Communication. He posited that information, and the way it’s transmitted, doesn’t have to bear any relation to the actual information you’re sending.
For example, in a system where the only two things you can transmit are the word ‘yes’ and the entire text of the bible, you still only send one bit of information – even though the bible is a long chunk of text – since it’s only one of two options. A clever realisation of Shannon’s was how patterns in data – like letters of high frequency in written language – can be exploited to reduce how much actual information is needed to transmit something.
For example, Shannon conducted experiments on how easily people are able to predict what the next letter or word will be in a sentence. If someone is presented with the initial words ‘Once upon a’, there aren’t many fairy-tale fans who wouldn’t be able to guess that the next word is likely to be ‘time’. Similarly, faced with a word starting with a Q, there’s a good chance the next letter is U.
This means that ‘time’ and ‘U’ somehow convey less information to the person receiving the message – if they were missing, the meaning of the message would be unaffected. However, if I received the message ‘Meet me at the’ with the next word missing, there’d be a lot more information carried by that missing word, since I have less chance of predicting what it’s going to be. Shannon called this concept information density – not all letters and words are created equal, and some are more important than others.
While these ideas may seem simple, lossless data compression and coding underlies huge amounts of modern technology, and sometimes it’s nice to take a moment to appreciate just how clever these structures and ideas are, and how they make things work so effortlessly.