2 Spaces: Unicode Cube

Media technology facilitates communication between people. At the same time media shapes the communication itself. The restrictions of the technology put limits on what we are able to express in the medium.

Take for example the printing press. While it enabled the message to be received by a large number of people, it was limited to the movable type collection of the printer. This caused the disappearance of some glyphs from the written language, as well as the invention of some new ones, often by the reinterpretation of older handwritten texts.

Soon after the invention of the digital computer in the early 20th century there was the need to store natural language text in a human readable form. It took quite a while before this became standardised. Due to the limited hardware of the time and lack of needing to communicate to other systems, computers all had their own standards for encoding text. Even the length of a word, meaning the smallest unit describing a collection of bits, wasn't agreed upon. Some computer makers had systems where a word consisted of 6 bits (64 possibilities) for all 26 letters of the latin alphabets in a single case with room for numbers, basic punctuation and control characters. But there where also odd numbered word systems, like the 15-bit Apollo Guidance Computer, with one extra bit was for parity checking. By the early seventies, most did arrive at a 8-bit word, better known as the byte, as the common standard for computers. This however didn't mean that any computer could read and write text the same way, because there was no consensus on what byte represent what character. For example: while an Apple IIe might encode the letter A with byte 0x61, an Atari ST encodes this character as 0x41. This got resolved over the years with the growing adoption of ASCII. The American Standard Code for Information Interchange character set consists of 128 characters, of which the first few are control characters used in telegraph communication, where only a handful are still in common use today (\n, \r, \t). The rest of this 7-bit encoding was populated with basic punctuation, Arabic numerals, upper and lower case latin letters. This was enough to be compatible with American-English communications systems, but insufficient for other language systems. The standardisation of 8-bit words made it possible for local computer vendors to add 127 extra characters as an extension to ASCII. While this development solved the problem of exchanging digital text written in American-English, variants of Extended ASCII where all incompatible with one another. Meanwhile the 16-bit encoding systems for Chinese, Japanese, Korean, Thai, etc. caused even more problems, since computers built around ASCII assumes a character is only 1 byte long instead of 2 bytes. The incorrectly decoded text resulting from these incompatibilities is called Mojibake. For example, in Windows-1252 (Microsoft Windows Extended ASCII) turns the Arabic الإعلان العالمى لحقوق الإنسان into ÇáÅÚáÇä ÇáÚÇáãì áÍÞæÞ ÇáÅäÓÇä.

In the late eighties, people from Xerox, Apple, Sun Microsystems, Microsoft and NeXt, set out to create Unicode: a universal character encoding system designed to support most if not all of written language. Due to its ubiquity, ASCII was used as a starting point. This combined with allowing for variable byte size (eg. the character A can encoded as both 0x41 (UTF-8), 0x0041 (UTF-16BE) or 0x00000041 (UTF-32BE)) ensured compatibility with with legacy ASCII systems, while also allowing for multi-byte 16 or even 32-bit characters like 0x0f42 (ག) or 0x0001f004 (🀄) respectively. Without Unicode international communication as we know it today, including originally Japan-exclusive Emoji set, wouldn't be possible. Instead we would be staring at a row of ��.

Not unlike decodeunicode, I want to visualise the Unicode system to in a small way increase awareness and appreciation of this fundamental yet overlooked technology. My choice of tools to construct this 3D visualisation largely depended on how well it supports Unicode. Turns out this isn't as trivial as one might expect. In the name of performance, languages like C and C++ only support single 1-byte characters. This could be circumvented by creating your own data type to support 2 or 4-byte characters, but that's assuming I also write my own rendering implementation for something like OpenFrameworks, which I'm not equipped to do. I also explored implementing my idea with the Kotlin-based rendering engine openRNDR. This looked very promising, since the Java platform supports all kinds of Unicode variants out of the box. Sadly openRNDR wasn't capable of rendering any character out of the box. The work-around I eventually arrived at with some help of the very helpful openRNDR community only partly worked. It did manage to render characters up to halfway the Arabic character block but then halted, often without a stack-trace. Eventually I gave up and returned to my trusty workhorse Processing. It being Java based allowed me to directly transfer the things I've learned during my openRNDR adventures into the Processing Sketch. From here it was relatively smooth sailing. That is except for another hurdle, where I couldn't get UTF-32 encoding to work, which is required to reach all the way to the end of the current Unicode system at 0x0002FA0 (駾). Instead I had to settle for what Java calls the Basic Multilingual Plane (BMP), which encompasses all points between 0x0 and 0xffff, which are still 65535 different points in space. Enough for a neat 3D visualisation.

Source code