10.2005.5 Unicode
10.2005.5 Unicode
Unicode
Page 1 of 4
UNICODE
1. ASCII
by Sinclair Tweedie
The most widely used system for representation of data was ASCII the American Standard Code for Information Interchange. This is an eight bit code which allows the alphabet (upper and lower case), numerals, grammatical characters (commas, question marks, etc), certain foreign language characters (European languages!) and graphics characters to be represented. Each character is given a numerical code in the range 0 to 255 (256 in all, that is 28). Originally it was a 6 bit code and then a 7 bit code. If you think back to some basic logic theory you will remember that: 1 bit allows 2 combinations i.e. a 0 and a 1 2 bits allow 4 combinations i.e. 00,01,10 and 11 3 bits allow 8 combinations i.e. 000,001,010,011,100,101,110 and111 and so on. How many bits would you need to successfully work with English sentences? 'a' to 'z' 26 plus 'A' to 'Z' 26 plus '0' to '9' 10 (and these are just the number symbols) 6 bits would allow 64 combinations so let's throw in the space and a fullstop. That gives us upper and lower case letters, number symbols, a space and one grammatical symbol (the fullstop). Immediately one thinks Wouldnt it be nice to have the rest of the grammatical symbols? The solution was obviously to add another bit but 7 bits is not a nice computer number whereas 8 bits was then becoming the accepted size of 1 byte. At one time 2 systems existed, ASCII (which was the 6 bit code and then the 7 bit code) and ASCII-8 (or Extended ASCII) which used 8 bits and allowed 256 variations. The old ASCII was soon dropped and ASCII-8 became plain ASCII. The rough breakdown of codes is shown below: 0 31 These are reserved codes generally used for control purposes such as signalling the printer to do a carriage return or signalling the keyboard to make a beep. Note that even in ASCII there is recognition of the need for more than just displayable characters. There is a need to control the look of the text and communicate this look meaningfully and unambiguously. The usual alphabetic and grammatical characters Some common foreign language characters Various graphic characters
2.
This code has the same purpose as ASCII and is often used in IBM mainframes. The main difference between the two codes is the groupings of characters. For example: Character 0 1 2 3 4 5 6 7 8 9 ASCII 0011 0000 0011 0001 0011 0010 0011 0011 0011 0100 0011 0101 0011 0110 0011 0111 0011 1000 0011 1001 EBCDIC 1111 0000 1111 0001 1111 0010 1111 0011 1111 0100 1111 0101 1111 0110 1111 0111 1111 1000 1111 1001
The format is referred to as zoned decimal. The character is stored in 8 bits, with 4 being zone bits and 4 numeric bits.
Notes 10.2005.5
Unicode
Page 2 of 4
3.
And on to Unicode
Its primary goal is to provide an unambiguous encoding of the content of plain text, ultimately covering all languages in the world. Currently in its third major version, Unicode contains a large number of characters covering most of the currently used scripts in the world. It also contains additional characters for interoperability with older character encodings, and characters with control-like functions included primarily for reasons of providing unambiguous interpretation of plain text. Unicode provides specifications for use of all of these characters. Unicode attempts to incorporate as many known language systems as possible (including Braille) and make provision for the foreseeable future, taking into account that computer systems are notorious for underestimating growth (remember 1 Megabyte?).
3.1
Aims
There are inherent problems associated to the number of available characters and the ease of access you have to them. One of the aims has been to develop a system which is as compact and adaptable as possible. One possible approach was to convert all written text into a graphic format. After all, many Japanese businessmen prefer to fax hand-written documents rather than spend time with slow-to-produce E-mails. Many Arabic texts are sent as graphics because not many platforms support the codes for their text characters. So why not graphical representation? 3.1.1 Graphics take up space (and therefore bandwidth) even in compressed form. 3.1.2 Graphics cannot be re-formatted: what you see is the absolute restriction on what you get. By comparison, word processing software depends on text in a particular format that is adaptable and easy to deal with. This is the huge issue of standardisation. It has far reaching effects and implications. Unicode strives to pull out the essential forms of text throughout a variety of systems. 3.1.3 Legibility the graphic format is by definition non-standard. You need human intelligence to read some peoples handwriting and this applies to every written language, not just English. 3.1.4 And heres one you wont have thought of How do you get the computer to read the text aloud? (wake up, its going to happen). Well, of course, youll have to apply OCR to it and convert it to ... and guess what, youve just come full circle to the conclusion that you need an international standardised code.
3.2
Unicode format
Unicode gives each character a unique number just as ASCII did but it also defines a name for each character and it defines some additional properties. The objective is also quite modern in comparison to ASCII (the year is now 2003 for anyone laughing at this in years to come while I sit in retirement on some beach.... ahh....). All identified characters from the worlds major languages should be capable of being communicated via E-mail. The variety and number of international characters is huge and tends to suffer from repetition characters, or parts of characters, reappear in different languages but may have vastly different significance and meaning. Unicode draws a distinction between characters and glyphs. Pure text is separated from how it looks. A glyph is the shape of a character after it has been displayed. Take the example of fonts. A particular font may be made up of a number of glyphs which may adopt a particular style of display and which may include all or part of the possible character set. The variety and shape of glyphs which make up a font are not part of Unicode. It deals exclusively with characters, taken to be the smallest component of written language that has a semantic value. The unique name of each character can be things such as "LATIN CAPITAL LETTER A" or "CJKUNIFIED IDEOGRAPH-9AA8" A single 16-bit number is assigned to each character. They are referred to in Hexadecimal form. The code value U + 0041 (decimal number 65) represents the character A in Unicode.
Notes 10.2005.5 Character Number ASCII 0011 0000 0011 0001 EBCDIC
Unicode UNICODE 0000 0000 0011 0000 0000 0000 0011 0001 0000 0000 0100 0001 0000 0000 0110 0001
Page 3 of 4
Capitals (upper case characters): A 65d or 41h 0100 0001 Lower case characters: a 97d or 61h 0110 0001
If you condense each four binary digits of Unicode into hexadecimal you get the Unicode value in hexadecimal: A = 0000 0000 0100 0001b = 0 0 4 1 h = U+0041 The Unicode standard defines other normative properties such as: character type this refers to character set to which it belongs, combining class this is a numeric value that is attached to any character which indicates the set of other characters it can combine with to produce the language variations (e.g. accents), bi-directional behaviour some languages read from right to left (Arabic, Hebrew). Unicode also provides additional information such as: alias names the name by which the character may appear in different languages compatibility mapping the degree of congruence with pre-existing standards casing partner some languages have variants which differ significantly in shape and size sample glyph the written visual look of the character and finally usage notes for the characters. Unicode distinguishes characters by script, but not by language. The U+0041 LATIN CAPITAL LETTER A ('A') is used for an English as well as a French Capital A. This makes it harder to tell which language the text is in but most other problems are simplified.
3.3
What is a Mark-up language? Originally computer generated texts contained control codes or macros that enabled formatting in a specific way. A generic type of coding appeared in the late 1960s using descriptive tags (heading rather than a specific numerical code like format-17). Charles Goldfarb, Edward Mosher and Raymond Lorie of IBM invented the Generalised Mark-up Language (GML) in 1969. The system was intended for IBM mainframes managing major publishing projects. It is interesting to note Goldfarbs comment on the production of documents before he ever got involved with mark-up language. It does much to point to the aims of the original Mark-up Language. In 1966 I was an attorney practising in Boston, MA, two years out of Harvard Law School. I knew nothing about computers, but I knew there had to be a better way to produce documents than dictating them, reviewing a draft, marking up the draft with corrections, reviewing the retyped draft, and then, in frustration, seeing that the typist had introduced more errors while making the corrections. Just as there was an increasing need then to provide for accurate electronic storage of texts there is an increasing international need for the transmission of texts. Have you noticed how most Word-processors now provide HTML views as standard and that you publish a document as HTML? HTML embeds tags in the document which are interpreted by an Internet browser and used to format and position the text on the screen. Mark-up language provides similar features to the format characters in Unicode and it is inevitable that development of Unicode takes account of the existence of HTML and XML and their objectives in the communication of text.
Notes 10.2005.5
Unicode
4.
A Unicode character occupies 16 bits, so there is room for up to 65536 different characters. The numbering system runs from \u0000 to \uFFFF; i.e. it is numbered in hexadecimal. char c = 'E'; char c = '\u0045';
Example: or
Some sample character sets: 0000 001F Control characters (same as ASCII) 0020 007F Basic Latin characters (for all European languages, same as ASCII) 0080 00FF Latin supplement characters (same as ASCII) 0370 03FF Greek 0600 06FF Arabic 0F00 0FBF Tibetan Unfortunately Java 1.1, as supplied with the Ready-to-program IDE, does not recognise characters beyond \u007F (equivalent to 127d). Sample program: // The "Unicode" class. By M Brock, 23 February 2002 public class Unicode { public static void main (String [] args) { char c = 69; System.out.println (c); System.out.println ('\u0045'); } // main method } // Unicode class prints E E
Note the following special characters in Java: \t \n \ \ \\ tab new line single quote double quote backslash
Exercise:
Write a looped program to return the ASCII value of a character, the Unicode value, and the actual character, using the following tabular layout: ASCII 65 66 67 etc Unicode \u0041 \u0042 \u0043 Character A B C
________________________________________________
mb