Hacking UTF8

Character encodings

There is an old saying that computers is all about binary data. Nevertheless in our day to day work this is a mere abstract idea because we got layers and layers of encoding/decoding processes that hide this pervasive truth. Nevertheless the old saying remains, everything is binary. One of the first things people needed to encode in computer systems was the traditional information human created and stored in string representation. If everything is binary how is possible that we store and load information in text format?. The answer is pretty straightforward. We devise methods to map numbers into characters. Not rocket science right? One of the first mapping schemes was called of ASCII code. ASCII stands for American Standard Code for Information Interchange, and basically mapped bytes into characters. The good thing about ASCII was the simplicity, a simple table lookup is enough to map between code and binary representation. But this representation suffers from a major drawback (otherwise no other schemes would be invented). At the time of their creation ASCII was pretty much designed to represent in a quick and straightforward way characters in computer systems. Notice that the standard was developed in 1960 right in the start of the computer revolution, at that time the only encoded needed was for the most common characters that people used in America. For this purpose alone ASCII was suffice. The problem arises when we want to encode other character representations that are not Latin characters. Characters used in Hebrew, Cyrillic, Greek and many others are not possible to map with the ASCII strategy. The reason is simple, one byte or 8 bits which mean 2⁸=256 different symbols are not enough to encode all the possible human encoding schemes presented by history.

UTF-8

To enable the representation of text information in computer systems for other set of characters Ken Thompson and Rob Pike devised a new scheme called UTF-8. UTF-8 stands for Unicode Transformation Format in 8 bits. The 8 bits is a little misleading because may induce people to think that the scheme uses bytes to encode characters. While this is true (of course, everything is byte represented) there are a crucial difference. In UTF-8 some characters are represented with just one byte, others with two bytes, three, until a maximum of 4 bytes.

| Bytes  | Bits used  | Byte 1   | Byte 2   | Byte 3   |  Byte 4  |
|--------|------------|----------|----------|--------- |----------|
| 1      |     7      | 0xxxxxxx | -------- | -------- | -------- |
| 2      |    11      | 110xxxxx | 10xxxxxx | -------- | -------- |
| 3      |    16      | 1110xxxx | 10xxxxxx | 10xxxxxx | -------- |
| 4      |    21      | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |

The previous table describes the rules used to encode the characters in binary format. Some things were taken into account when this format was devised. One of the nice things UTF-8 has is backward compatibility with ASCII codes. This mean that the encoding for the printable characters of ASCII were pretty much the same when encoded in UTF-8. This is a major feature since conversions between character encodings would be a mess for all the information encoded in ASCII if we want to integrate that same information in a UTF-8 based system. A interesting observation to make about UTF8 encoding is that all the 2,3 and 4 bytes representation will offset from 0x80 which is 1000 0000 in binary. This observation is handy to have present if we want map character representation between character sets. It is true that people use many different symbols with different semantics to encode information, however many of the characters have equivalent representation in other idioms. For instance we can map the letter a into the letter alpha α. UTF8 uses the prefix to map from one keyset to other but uses the last byte of character representation and preserves this order whenever this semantics hold. So it is expected that if the code for a is X and the Z is the prefix for the Greek set of letters that the code for α is something like the binary sequence of ZX.

UTF8 Tables

The UTF8 is well documented and the mappings are an exhaustive list of mapping between binary sequences and characters for many different character sets. There are several on-line sites where you can check for the UTF8 representation of a specific character. This is one of such places where you can find more about UTF8 encoding.

Hacking UTF-8

This is a practical blog so it is expected to be possible to do something with this knowledge. Indeed there is. One of the hacks we can do is to map characters between character sets. But how?! First of all we must find a pattern. There is always an hidden pattern. The first pattern we notice is that for all of the characteres of 2,3 and 4 bytes, we got an offset of 0x80 in hexadecimal which means in binary form 1000 0000. To notice this you just need to check the table that has the mapping rules. Another pattern we know is that all of theses characteres is of the form

  A --- 0x80 --- P

Where A is a sequence of bytes that function as prefix that identifies the character set, 0x80 is the offset and P is the position of the character in the encoding table.

Why is this useful? Well because with this pattern we can match the ASCII known characters to any of other representation. For instance if we want to map ASCII characters into Armenien characters we just prefix the characters with the binary code 0xd5.

With these concepts in mind we get into the following python function

def utfy(offseta,string):
    offsetb=0x80-ord('a')
    g_letters =[tostr(offseta+[ord(l)+offsetb]) if(l!=' ') else ' ' for l in string.lower()]
    return "".join(g_letters)

that does just that.

To print Hello World in Armenien characters you would run

  print utfy([0xd5],"Hello World")

and get ՇՄՋՋՎ ՖՎՑՋՃ as a result.