Character Encoding Schemes

©2013 PerfectLogic Corporation, All Rights Reserved

The characters and glyphs of natural languages are stored in computer memory and on storage media, as is all information, as numbers. Many schemes for encoding characters have been devised. They all have as their object, the mapping of characters to unique numerical values. Early attempts to devise encoding schemes focused on character sets that were used in those parts of the world (most notably, Europe and the Americas) where computers had first been introduced. As use of the computer spread, infiltrating new regions of the world and new problem domains, the demand for representing larger and more diverse character sets grew. The response to this need was the invention of new and increasingly complex encoding schemes.

A variety of encoding schemes are used to map the symbols of our spoken, mathematical, and graphical languages into numerical codes. Each of these schemes defines a one-to-one mapping between a symbol and a number or sequence of numbers. The encoding schemes (e.g., ASCII, ANSI) for the relatively small character sets used in Latin-like written languages are simple and compact, with integer values chosen so that they fit neatly into a single byte. The integer values for these simple encoding schemes may be viewed as indices into character set tables. The desire to encode the symbols used in the special purpose languages of mathematics, music, and ancient languages has spurred development of a variety of encoding methods capable of representing all the languages of the world including the special-purpose languages of mathematics, music, finance, ancient languages, et cetera.

1. The American Standard Code for Information Interchange (ASCII)

The original ASCII character set included 128 "characters. The set included punctuation marks, upper and lowercase Latin alphabetic characters, the digits 0 through 9, and a set of control characters. The ASCII mapping of this character set to 7-bit numeric values was devised by the American National Standards Institute (ANSI) to simplify the task of connecting character-oriented peripherals (e.g., printers, teletype machines, monitors, etc) to computers built by different manufacturers.

Table 1 - ASCII Character Encoding
Dec Hex Char Name Dec Hex Char Name Dec Hex Char Name
0 00 NUL Null character 43 2B + plus 86 56 V Upper V
1 01 SOH Start of Heading 44 2C cc comma 87 57 W Upper W
2 02 STX Start of Text 45 2D - hyphen 88 58 X Upper X
3 03 ETX End of Text 46 2E . period 89 59 Y Upper Y
4 04 EOT End of Transmission 47 2F / forward slash 90 5A Z Upper Z
5 05 ENQ Enquire 48 30 0 zero 91 5B [ left bracket
6 06 ACK Acknowledge 49 31 1 one 92 5C \ backslash
7 07 BEL Bell 50 32 2 two 93 5D ] right bracket
8 08 BS Backspace 51 33 3 three 94 5E ^ caret
9 09 HT Horizontal Tab 52 34 4 four 95 5F _ underscore
10 0A LF Line Feed 53 35 5 five 96 60 ` left single quote
11 0B VT Vertical Tab 54 36 6 six 97 61 a Lower a
12 0C FF Form Feed 55 37 7 seven 98 62 b Lower b
13 0D CR Carriage Return 56 38 8 eight 99 63 c Lower c
14 0E SO Shift Out 57 39 9 nine 100 64 d Lower d
15 0F SI Shift In 58 3A : colon 101 65 e Lower e
16 10 DLE Data Link Escape 59 3B ; semicolon 102 66 f Lower f
17 11 DC1 Device Control 1 60 3C < less than 103 67 g Lower g
18 12 DC2 Device Control 2 61 3D = equal 104 68 h Lower h
19 13 DC3 Device Control 3 62 3E > greater than 105 69 i Lower i
20 14 DC4 Device Control 4 63 3F ? question mark 106 6A j Lower j
21 15 NAK Neg. Acknowledgement 64 40 @ at symbol 107 6B k Lower k
22 16 SYN Synchronous Idle 65 41 A Upper A 108 6C l Lower l
23 17 ETB End Transmission Blk. 66 42 B Upper B 109 6D m Lower m
24 18 CAN Cancel 67 43 C Upper C 110 6E n Lower n
25 19 EM End of Medium 68 44 D Upper D 111 6F o Lower o
26 1A SUB Substitute 69 45 E Upper E 112 70 p Lower p
27 1B ESC Escape 70 46 F Upper F 113 71 q Lower q
28 1C FS File Separator 71 47 G Upper G 114 72 r Lower r
29 1D GS Group Separator 72 48 H Upper H 115 73 s Lower s
30 1E RS Record Separator 73 49 I Upper I 116 74 t Lower t
31 1F US Unit Separator 74 4A J Upper J 117 75 u Lower u
32 20 SP Space 75 4B K Upper K 118 76 v Lower v
33 21 ! Exclamation mark 76 4C L Upper L 119 77 w Lower w
34 22 DQ Double quote 77 4D M Upper M 120 78 x Lower x
35 23 # Pound sign 78 4E N Upper N 121 79 y Lower y
36 24 $ Dollar sign 79 4F O Upper O 122 7A z Lower z
37 25 % percent sign 70 50 P Upper P 123 7B { left brace
38 26 & ampersand 81 51 Q Upper Q 124 7C | vertical bar
39 27 ' single quote 82 52 R Upper R 125 7D } right brace
40 28 ( left paren. 83 53 S Upper S 126 7E ~ tilda
41 29 ) right paren 84 54 T Upper T 127 7F DEL Delete
42 2A * asterisk 85 55 U Upper U        

The ASCII numerical codes representing characters use the seven low-order bits of a byte memory unit. The high-order bit (i.e., most significant bit) of these character bytes were reserved to record byte parity [tooltip: a primitive error detection code whose value is set to one if the number of set bits in a memory unit is odd].

2. ANSI and Extended ASCII Character Sets

One response to the growing need for representing new characters was the extension of the ASCII character set from 128 to 256 characters. By using the high-order (parity) bit of the memory bytes used to contain ASCII character codes, the size of the character set could be doubled. Putting this (mostly unused) parity bit into service reduced the impact the expansion of the character set would have on existing hardware and software.

There has been, however, no universal agreement on which characters to include in the character set extension. Many different sets have been defined and are in use today. Consequently, there is no standard ASCII extension. One instance of an ASCII extension, known as the ANSI character set is shown in in Table 2. ANSI characters with code values 32 to 127 correspond to those in the 7-bit ASCII character set.

Another popular ASCII extension is the one defined by the the ISO 8859-1, a standard developed by the International Standards Organization. While there is no ASCII extension regarded as "the standard", ISO 8859-1 is, in fact, the only one of many ASCII extensions governed by a formal standards document. ISO 8859-1 is also referred to as the ISO Latin-1 set, and is widely used throughout North and South America, Western Europe, Africa, and those countries in Asia which use Latin-like alphabets.

Table 2 - ANSI Character Encoding (an Extended ASCII Character Set )
Dec Hex Char Name Dec Hex Char Name Dec Hex Char Name
128 80 Euro symbol 171 AB « Left, double angle quote 214 D6 Ö Upper O with diaeresis
129 81   Unassigned 172 AC ¬ Not symbol 215 D7 × Multiplication symbol
130 82 Single low-9 quote 173 AD ­ Soft hyphen 216 D8 Ø Upper O with stroke
131 83 ƒ Lower f with hook 174 AE ® Registered symbol 217 D9 Ù Upper U with grave
132 84 Double low-9 quote 175 AF ¯ Macron 218 DA Ú Upper U with acute
133 85 Horizontal ellipsis 176 B0 ° Degree symbol 219 DB Û Upper U with circumflex
134 86 Dagger 177 B1 ± Plus-minus symbol 220 DC Ü Upper U with diaeresis
135 87 Double dagger 178 B2 ² Superscript two 221 DD Ý Upper Y with acute accent
136 88 ˆ Circumflex accent modifier 179 B3 ³ Superscript three 222 DE Þ Upper Thorn
137 89 Per mile symbol 170 B4 ´ Acute accent 223 DF ß Lower sharp s
138 8A Š Upper S with caron 181 B5 µ Micro symbol 224 E0 à Lower a with grave accent
139 8B Left angle quote 182 B6 Pilcrow symbol 225 E1 á Lower a with acute accent
140 8C Œ Latin capital ligature OE 183 B7 · Middle dot 226 E2 â Lower a with circumflex
141 8D   Unassigned 184 B8 ¸ Cedilla 227 E3 ã Lower a with tilde
142 8E Ž Upper Z with caron 185 B9 ¹ Superscript one 228 E4 ä Lower a with diaeresis
143 8F   Unassigned 186 BA º Masculine ordinal indicator 229 E5 å Lower a with ring
144 90   Unassigned 187 BB » Right double angle quote 230 E6 æ Lower æ
145 91 Left single quote 188 BC ¼ Fraction=one quarter 231 E7 ç Lower c with cedilla
146 92 Right single quote 189 BD ½ Fraction=one half 232 E8 è Lower e with grave accent
147 93 Left double quote 190 BE ¾ Fraction=three quarters 233 E9 é Lower e with acute accent
148 94 Right double quote 191 BF ¿ Inverted question mark 234 EA ê Lower e with circumflex
149 95 Bullet 192 C0 À Upper A with grave 235 EB ë Lower e with diaeresis
150 96 En dash 193 C1 Á Upper A with acute 236 EC ì Lower i with grave accent
151 97 Em dash 194 C2 Â Upper A with circumflex 237 ED í Lower i with acute accent
152 98 ˜ Small tilde 195 C3 Ã Upper A with tilde 238 EE î Lower i with circumflex
153 99 Trademark symbol 196 C4 Ä Upper A with diaeresis 239 EF ï Lower i with diaeresis
154 9A š Lower s with caron 197 C5 Å Upper A with ring 240 F0 ð Lower eth
155 9B Right angle quote 198 C6 Æ Upper AE 241 F1 ñ Lower n with tilde
156 9C œ Latin small ligature oe 199 C7 Ç Upper C with cedilla 242 F2 ò Lower o with grave accent
157 9D   Unassigned 200 C8 È Upper E with grave 243 F3 ó Lower o with acute accent
158 9E ž Lower z with caron 201 C9 É Upper E with acute 244 F4 ô Lower o with circumflex
159 9F Ÿ Upper Y with diaeresis 202 CA Ê Upper E with circumflex 245 F5 õ Lower o with tilde
160 A0   Non-breaking space 203 CB Ë Upper E with diaeresis 246 F6 ö Lower o with diaeresis
161 A1 ¡ Inverted exclamation mark 204 CC Ì Upper I with grave 247 F7 ÷ Division symbol
162 A2 ¢ Cent symbol 205 CD Í Upper I with acute 248 F8 ø Lower o with stroke
163 A3 £ Pound symbol 206 CE Î Upper I with circumflex 249 F9 ù Lower u with grave accent
164 A4 ¤ Currency symbol 207 CF Ï Upper I with diaeresis 250 FA ú Lower u with acute accent
165 A5 ¥ Yen symbol 208 D0 Ð Upper Eth 251 FB û Lower u with circumflex
166 A6 ¦ Broken bar 209 D1 Ñ Upper N with tilde 252 FC ü Lower u with diaeresis
167 A7 § Section symbol 210 D2 Ò Upper O with grave 253 FD ý Lower y with acute accent
168 A8 ¨ Diaeresis 211 D3 Ó Upper O with acute 254 FE þ Lower thorn
169 A9 © Copyright symbol 212 D4 Ô Upper O with circumflex 255 FF ÿ Lower y with diaeresis
170 AA ª Feminine ordinal indicator 213 D5 Õ Upper O with tilde  

3. Unicode

Unicode is an unfinished computing industry standard whose designers aim to have it eventually replace older character encoding schemes that are incapable of representing many of the complex writing systems (e.g., Chinese) of the world. The Unicode Consortium manages the development of this standard. Copies of the most recent version of the Unicode Standard are available at their website.

The Unicode standardization project reserves a range of integer values for identifying characters and glyphs. These reserved values lie in the closed interval, [0, 10FFFF]. This range, or codespace, includes 1,114,112 values. Each value, referred to as a code point, is associated with a distinct character or glyph. The codespace is divided into seventeen planes, numbered 0 to 16, each containing 65,536 points. These planes may be subdivided into blocks of varying sizes and used to encode symbols for a particular language or group of languages (e.g., ). The zeroth plane is referred to as the Basic Multilingual Plane (BMP) with code points in the closed interval [0, FFFF]. Some of the 65,536 code points in the BMP have already been assigned to characters.

The assignment of the first 256 code points in the Unicode codespace is identical to the assignments made in the ISO 8859-1 standard (see Section entitled, Extended ASCII and the ANSI Character Set). This choice simplifies the conversion of ASCII encoded text to the Unicode standard, and reduces the impact of the Unicode standard on legacy systems.

3.1 Unicode Encoding Schemes

The Unicode standard defines two general methods for mapping code points to variable-length memory unit (8-bit, 16-bit, and 32-bit) sequences. These memory units are referred to as code units. The sequences produced by these encoding methods may be from one to four code units in length. The first of these general methods is referred to as the Unicode Transformation Format (UTF) encoding method. Several variants of this method are defined. They include, UTF-8, UTF-16, and UTF-32. The value appearing after the hyphen indicates the size of the code unit in the encoded sequences.

The second basic method is referred to as the Universal Character Set (UCS) encoding method. The two variants of this general encoding scheme are the UCS-2 and the UCS-4 mapping methods. Here the value following the hyphen in the method name indicates the number of bytes produced by the method during the mapping of a code point to a multi-byte sequence. The UCS-2 method is now obsolete, and the UCS-4 and UTF-32 methods are essentially equivalent.

UTF-8 and UTF-16 are the most widely employed methods for mapping Unicode code points to their memory-resident representations.

3.1.1 UTF-8 Encoding Method

The UTF-8 method maps code points to a sequence of bytes ranging in length from 1 to 4 bytes. Each byte within the sequence contains both control bits and non-control bits. The control bits indicate how many bytes there are in a given sequence, and whether a given byte is the first in the sequence or one of the "trailing" bytes. The figure below illustrates how these control bits are interpreted.

The non-control bits of each byte in a sequence are used to record the character code value (i.e., code point) assigned by the Unicode Standard. The way this is accomplished is most easily explained by giving an example. The Unicode integer value assigned to the trademark symbol, ™, is 8,482 base 10. Expressed as a hexadecimal number, the value is 2122. The Unicode convention for expressing this code point is U+2122. The questions needing answers are these: "How is this value encoded using the UTF-8 method?"; and "How many bytes will be required?" The answers to these questions are found by first considering the binary representation of the hexadecimal number 2122, keeping in mind the encoding details depicted in Figure 1.

At the top of Figure 2, the binary representation of the code value for the trademark symbol is given. Its representation requires 14 bits (leading zeros may be ignored). Each byte of a UTF-8 code sequence, except for the first has six bit positions available for containing the Unicode character value (i.e., code point). The least significant six bits of the binary representation of the trademark code is inserted in the final byte of the UTF-8 sequence. The next six bits of the code is moved to the next to last byte of the UTF-8 sequence. This leaves only the two most significant bits of the trademark code to insert. These two bits are inserted in the low order bits positions of a third byte. The control bits, 5 through 7, of this third byte are set to indicate the resulting UTF-8 code sequence is of length 3. The control bit 4 is reset to zero to mark the end of the the initial chain of 1 bits.

Thus, the Unicode code point, U+2122, requires three bytes for its UTF-8 representation, and this three byte sequence expressed using hexadecimal digits is,

C2 84 A2.

An examination of Figure 1 reveals that a 4-byte UTF-8 sequence provides a total of 21 non-control bits. These 21 bits can be used to represent character points in the closed interval [0, 3FFFF], equivalent to the decimal range 0..262,143. However the Unicode Standard in its current form does not associate characters with all these possible values.

3.1.2 UTF-16 Encoding Method

The UTF-16 encoding method maps code points into either one or two 16-bit code units. Characters in the Basic Multilingual Plane (BMP) (i.e., code points in the range 0 to FFFF) are mapped directly to a single 16-bit word. For all other characters the UTF-16 transformation of code points yields a pair of 16-bit words referred to as a surrogate pair. The 16-bit word containing the most significant bits of a code point is referred to as the leading or high surrogate, and the word containing the least significant bits of the code point is called the trailing or low surrogate.

The method for mapping code points to surrogate pairs is depicted in Figure 3 using, as an example, the Unicode code point, U+1D160, representing the musical eighth note symbol (musical symbols: The high and low surrogates are first initialized to the hexadecimal values, D800 and DC00, respectively. The value of the most significant five-bits of the codepoint decremented by one is then moved to bit positions 6 through 10 of the high surrogate. The sixteen least significant bits of the code point are distributed between to bit positions 0 through 5 of the high surrogate and positions 0 through 9 of the low surrogate, as indicated in Figure 3.

Program logic expressed in both the C and Ada programming languages that illustrate the UTF-16 encoding scheme may be downloaded (C_version, Ada_version).

3.1.3 UTF-32 and UCS-4 Encoding Methods

The equivalent UTF-32 and UCS-4 encoding methods employ a simple and very direct method to represent code points. All code points, regardless of value, are mapped directly into 32-bit code units.

3.2 Endianness Attribute and the Unicode Byte Order Mark (BOM)

Endianness refers to the order in which bytes within code units are ordered in memory. UTF-16 encodings in which the high-order byte of high and low surrogates precedes the low order bytes is said to be in Big Endian (abbreviated BE) order. The Big Endian ordering of the UTF-16 encoding of the musical eighth note symbol shown in Figure 3 would be,

D834 DD60.

UTF-16 encodings in which the low-order bytes precedes the high-order bytes is said to be in Little Endian (abbreviated LE) order. The equivalent Little Endian ordering of the UTF-16 encoding of the musical eighth note symbol would be,

34D8 60DD.

The endianness of UTF-16 encodings is indicated by appending the suffix "BE" or "LE" to the method name. The Unicode mapping method yielding UTF-16 encodings that have the low-order byte of each code unit appearing before the high-order byte are designated UTF-16LE. The mapping method in which the "natural" (high-order byte first) order is preserved is designated either UTF-16BE or simply UTF-16.

Two Byte Order Marks (BOMs), U+FFFE and U+FEFF, are defined to indicate the byte order of UTF-16 encodings within text streams. The BOM, U+FFFE, indicates the character encodings adhere to the UTF-16BE encoding scheme, while U+FEFF signals byte ordering according to the UTF-16LE scheme.