Excerpts from http://ijstokes.paunix.org/unicode/unicode.html
Some Acronyms:
UCS(ISO 10646-1): Universal Character Set BMP: Basic Multilingual Plane UTF: Unicode Transformation Format UTF-8: 8-bit Unicode Transformation Format
UCS is from ISO, Unicode is from Unicode consortium.
In 1991 they agreed to collaborate and from then on,
both standards are essentially same. But both organisations
publish the statndards individually. They coordinate any changes to standards.
UCS/Unicode is 31-bit character set (It does not imply any encoding).
So basically what that means is they assign each character an integer value.
Actually they assign an official name and an integer value(also known as code point)
to each character.
Usually the integer value(code point) of each charater is represented in
th form of U+xxxx(xxxx is in hex).
For example : hex value Official Name A = U+0041 "Latin capital letter A"
The UCS characters U+0000 to U+007F are identical to those in US-ASCII.
Now comes the concept of encoding, that is how do we represent
these hex values of each character as sequence of bytes for the computers to use.
- UCS-2 represent characters as 2 bytes sequence.
for example A = U+0041 is represented as 0×00 0×41
- UCS-4 represent characters as 4 byte sequence.
for example A = U+0041 is represented as 0×00 0×00 0×00 0×41
UTF-8
- UCS characters U+0000 to U+007F US-ASCII are encoded as is.
for example A = U+0041 is represented as 0×41 - UCS characters > U+007F are represented by multiple bytes.
These character representations will not have any byte value
between 0×00 and 0×7F - First byte is always between 0xC0 and 0xFD,
represents number of bytes following to represent the character.
All other following bytes are between 0×80 and 0xBF. - 0xFE, 0xFF are not used in UTF-8 encoding.
so in summary
0x00 - 0x7F US-ASCII 0x80 - 0xBF represent multibyte sequences. 0xC0 - 0xFD indicate number of bytes 0xFE - 0xFF not used.
This pretty smart encoding system was invented by Ken Thompson.
Rob Pike’s UTF-8 history http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
The following table from Markus kuhn’s article, explains why this encoding is damn smart. As we can see, number of most significant set bits indicate number of bytes that follow.
The following byte sequences are used to represent a character. The sequence to be used depends on the Unicode number of the character:
U-00000000 - U-0000007F: 0xxxxxxx U-00000080 - U-000007FF: 110xxxxx 10xxxxxx U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
The xxx bit positions are filled with the bits of the character code number in binary representation. The rightmost x bit is the least-significant bit. Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence.


