Il n'y a pas de hors-texte
— Derrida, French philosopher
Let me tell you a story about the letter ñ...
In its journy from your will...
to your computer screen.
The ñ is the 15th letter of the Spanish alphabet. It is a n with a diacritical tilde. It's history is tracked to monasteries where the glyph ñ was a shortcut for nn.
Everybody has a clear vision of what a ñ. Don't you?
The keyboard is agnostic about what letters it
has printed on its own surface. It simply sends the
scancode 0x27
.
You can use # showkeys --scancodes
to inspect the
raw scan codes.
Scancodes are too close to hardware events, so they are translated into more abstract keycodes.
You can use # getkeycodes
display the map between
scancodes and keycodes.
What is printed on your keyboard must to be known by the system to correctly translate abstract keycodes into keysym symbols.
Here is the mapping table for a QWERTY-Spanish layout.
Once we know the correct symbol, it is mapped against a set of characters such as the Universal Character Set or Unicode sets.
Use dumpkeys | loadkeys -u -m > unicode.h
to
build the table translating keysyms to code points.
Here is the resulting table for the previous mapping.
Finally, we need an encoding:
a scheme to translate numbers to bits.
Thus, the intrepid ñ reaches its destination,
turned into simple bytes 0xc3 0xb1
. Now it is time
to present itself before the world.
Terminals are configured to work with encodings, this case is UTF-8. The terminal knows how to turn bytes into a Unicode code point.
The Unicode code point is then looked for inside the font. It maps to a vector graphic depicting the symbol.
Use $ gucharmap
to display the utility to explore
installed fonts.
And finally, the graphic information is put into the video memory to be displayed through the screen.
It is an effort to create a set of characters for most of the writing systems on Earth (and beyond) and a set of rules for normalization, decomposition, collation, rendering and bidirectional display order.
But the core component is no more than an humongous
table: 1.114.112 entries in total from
U+0000
to U+10FFFF
. Here is how
it is organized...
Each entry in the table is said to be a grapheme.
A grapheme is the smallest unit in a written language. Each
grapheme has an assigned code point in the
Unicode's code space of 1.114.112 values. They are designated by
the prefix U+
followed by its hexadecimal
representation.
Unicode is divided into 17 planes, each of 65.536 code points.
Graphemes inside each plane are grouped by similarity.
Unicode overlaps ASCII, Latin-1 & Universal Character Set.
The plane 0 is called Basic Multilingual Plane and comprises most of the modern languages in the world.
The plane 1 is called Supplementary Multilingual Plane and comprises historycal scripts, mathematical alphanumerics, Emoji and game symbols.
Plane 2 is called the Supplementary Ideographic Plane and holds CJK Unified Ideographs.
Planes 3 to 13 are unused.
Plane 14 is the Supplementary Special-purpose Plane intended to hold special language tags.
Planes 15-16 are reserved for third parties to add their own glyphs. Commonly used by fonts to describe building blocks for other characters.
You need some way to transmit the codepoints. For instance, you could use 4 bytes to encode any of the Unicode code points now and future.
Suppose you want to transmit:
España
You need to transmit the code points:
U+0045 U+0073 U+0070 U+0061 U+00F1 U+0061
And you could use a 32 bit encoding for that:
00000045 00000073 00000070 00000061 000000F1 00000061
This is indeed the UTF-32 encoding format which is a direct representation of the Unicode table.
Each chunk of bits used to encode a code point is named a code unit.
The problem is you're using 24 bytes to encode a 6-letter word.
And you are using an encoding scheme which takes into account room for 10 unused planes!
So there are other more economic ways to encode the same text.
For the España
example, the encoded message reads as
follows:
0045 0073 0070 0061 00F1 0061
You should considerate the endianness as well!
The UTF-8 encoding is designed to expose a bunch of desired properties for an encoding scheme...
For multibyte representations, the first leading byte include as
many 1
bits as the number of bytes needed to encode
the code point plus a 0
. The rest of n
bits contains the first n
bits of the code point.
Following bytes start by 10
and complete the byte
with the following 6 bits of the code point. Code points are
padded with 0
bits when needed.
For the España
example, this time the encoded
message reads as follows:
45 73 70 61 C3B1 61
The ñ is encoded this way:
Code point: U+00F1
Binary: 00000000 11110001
UTF-8 : 11000011 10110001
Hex : C 3 B 1
Given a message in UTF-8 with emoji, translate it into HTML mixing text and images.
JavaScript Harmony (ES 6) includes a lot of text related functions and upgrades its encoding scheme to UTF-16!
But in current JavaScript, the encoding is UCS-2 and that means Emoji characters (plane 3) are out of the code space...
So we will use images from an Emoji database instead!
The idea is quite simple: first decode UTF-8 to Unicode code points and pass through them creating text nodes for text and styled elements for emoji.
Emoji are pictographs (from "picture" e and "character" moji) first provided by the NTT DoCoMo mobile operator in Japan to ease by-message communications.
Not all the original Emoji are in Unicode but some of them
reside with other symbols in the 3rd plane of Unicode, between
U+1F300
and U+27B0
.
There are several initiatives about compiling and documenting Emoji. One that provides tools to generate CSS files is js-emoji on GitHub (it's actually a library to display Emoji and takes the data from emoji-data) but there are other.
The original UTF-8 message is:
42 75 65 6e 6f 20 f0 9f 98 89 2c 20 68 65 20 61
71 75 c3 ad 20 75 6e 20 f0 9f 93 9d 20 65 6e 20
f0 9f 87 aa f0 9f 87 b8 20 63 6f 6e 20 61 6c 67
75 6e 6f 73 20 f0 9f 98 80 2c 20 f0 9f 86 97 3f
3f
When encoding is not provided, most of browsers attempt to decode the data. And they do really well! But this time we wont leave the browser to work for us.
We will use a special XMLHttpRequest
:
function get(url) {
return new Promise(function (resolve) {
var xhr = new XMLHttpRequest();
xhr.open('GET', url);
xhr.responseType = 'arraybuffer';
xhr.onload = function () {
resolve(new Uint8Array(xhr.response));
};
xhr.send();
});
}
Once we have a buffer with the raw bytes, lets translate it into a Unicode object such as an array of code points.
function decodeToUnicode(utf8data) {
var codepoints = [];
var i = 0, codeunit;
var b0, b1, b2, b3;
while (i < utf8data.length) {
codeunit = utf8data[i];
/* code unit handling */
}
return codepoints;
}
The only thing we need to know is if the code unit is lower than 128, it is an ASCII byte.
if (codeunit < 0x80) {
codepoints.push(codeunit);
i += 1;
}
If not, the code point will be represented with 2 code
units at least. The first one will include as many leading
1
as code units used to encode the code point, a
0
and the first bits of the code point in the
remaining free bits.
Remaining code units start by 10
and include
chunks of the next 6 bits of the code point.
So for a code point requiring 4 code units:
// 4 code units for a glyph
if (codeunit >= 0xf0) {
b3 = (codeunit & 0x07) << 18;
b2 = (utf8data[i+1] & 0x3f) << 12;
b1 = (utf8data[i+2] & 0x3f) << 6;
b0 = (utf8data[i+3] & 0x3f);
codepoints.push(b3 | b2 | b1 | b0);
i += 4;
}
Now we have an array of code points, the message could be divised:
[66, 117, 101, 110, 111, 32,
128521, 44, 32,
104, 101, 32,
97, 113, 117, 237, 32,
117, 110, 32,
128221, 32,
101, 110, 32,
127466, 127480, 32,
99, 111, 110, 32,
97, 108, 103, 117, 110, 111, 115, 32,
128512, 44, 32,
127383, 63, 63]
Emoji are code points above the BMP (greater than 65535).
But have you notice those two consecutive codes
127466
and 127480
?
They are called Regional Indicator Symbols and provide a way to represent countries.
They are actually two symbols and is the system who is in charge of replacing them with a flag or whatever.
You can download the complete source for the example.
The demo folder can be found at GitHub as well.