Unicode

Some years have passed since ASCII was invented, so let’s move 20 years forward, straight to the 1980s. By then, ASCII had already become a widely used standard for electronic communication, and most devices were familiar with the mapping. However, with the growth of international technology, new challenges emerged. There are more languages and alphabets to cover, and the 128 positions in the ASCII table were simply not enough. Moreover, the list was now closed, which made things worse. To make matters more complicated, many old devices would interpret the ASCII 0x7F (delete) as the end of a transmission, meaning characters beyond that could not be transmitted properly.

The Problem: Internationalization

As technology expanded globally, the need to represent more languages and symbols became crucial. The solution to extend the character set had to be both efficient and compatible with older systems. Devices couldn’t easily be updated, so any change had to be seamless and largely invisible to users. The last thing anyone wanted was a massive update that would require replacing hardware or forcing businesses to change everything at once.

Big Changes, Invisible to Users

To solve this problem, a smart solution was needed that would extend the character mapping while maintaining backward compatibility with existing devices. The challenge was to ensure that new machines could handle a broader range of symbols, while older machines could still function properly.

The goal was simple: make sure everything “just works.” This required solving the issue without interrupting normal operations. No one wanted to think about how or why this change took place, it simply had to be seamless for users.

Neat Solution and Perfect Hack

The answer came in the form of UTF-8, a character encoding system that solved the problem elegantly. The solution needed to be space-efficient, so that as many characters as possible could be represented without wasting memory.

UTF-8 uses one to four bytes to represent characters, depending on the value. It remains backward-compatible with ASCII, so the first 128 characters are identical to the ASCII set. UTF-8 can now represent over 120,000 unique symbols, encompassing almost all of the world’s writing systems.

If you’re curious about specific Unicode values, you can always check unicode.org for the full character set. But for now, let’s explore the different byte lengths used in UTF-8.

1 Byte

The 1-byte encoding is fully backward-compatible with ASCII. It uses 7 bits for character representation, with the remaining bit reserved for alignment. This allows for a maximum of 128 characters, exactly as defined in ASCII.

Bytes	Binary (Prefix)	Bits	Maximum Value
1	`0` _ _ _ _ _ _ _	7	127

2 Bytes

The 2-byte encoding offers significant expansion. With this format, we can now represent up to 2048 unique characters. The first byte uses a two-bit prefix to signal that it’s not part of the original ASCII set. This ensures that old devices won’t misinterpret the new characters.

Bytes	Binary (Prefix)	Bits	Maximum Value
2	`110` _ _ _ _ _	11	2047

3 Bytes

With 3 bytes, the encoding can represent over 65,000 characters, which was especially important as more languages and symbols became standard. The longer byte sequence allows for more unique values to be assigned.

Bytes	Binary (Prefix)	Bits	Maximum Value
3	`1110` _ _ _ _	16	65535

4 Bytes

The 4-byte encoding takes us beyond 100,000 characters, providing space for more than 250,000 unique symbols. This was crucial for supporting many additional languages and specialized characters without breaking backward compatibility.

Bytes	Binary (Prefix)	Bits	Maximum Value
4	`11110` _ _ _	21	1,114,111

Summary

In conclusion, UTF-8 is one of the most elegant solutions in the history of computer science. It provides unprecedented backward compatibility with ASCII, while enabling support for a massive number of characters across multiple languages. Today, every piece of technology on this planet understands UTF-8. This clever “hack” from the 1980s is still the standard and will likely remain in use for years to come.

Although we live in the 21st century, the spirit of the 1960s and 1980s continues to influence the way we interact with technology. UTF-8 has stood the test of time, and it’s a testament to the power of thoughtful, forward-looking solutions that don’t disrupt the status quo.