Text in HTML...

ISO 8859 Character Sets | To Encode or Not To Encode?
Character Entity Formats - Pros and Cons | Unicode
Related Sites
Main Index | Top Of Tree | Tag Index | Tag History
ISO 8859 character sets
ISO 8859 is a set of 10 different 256-character sets used to represent Western alphabetic languages. It does not address Far East languages at all. These sets were designed by the standards group ECMA (European Computer Manufacturer's Association,) and are included in the Internet charset register for use with MIME identification.

Why is ISO 8859 important you might ask? Well, here is why: the ISO 8859-1 (also called ISO-Latin) character set is the one used for HTTP (the transport protocol for web documents) and is also used in the creation of HTML documents. This character set contains all characters necessary to type all major West European languages and is also the preferred encoding on the Internet. The following languages are supported under the ISO 8859-1 character set:

Afrikaans
Faeroese
Irish
Basque
Finnish
Italian
Catalan
French
Norwegian
Danish
Galician
Portuguese
Dutch
German
Spanish
English
Icelandic
Swedish

ISO 8859-1 Characters - To Encode or Not To Encode?
It is acceptable to leave all ISO 8859-1 characters as unencoded character octets, but there can be no guarantee that all destination systems will understand all of the characters. In order to increase portability/viewability of the entire character set, the HTML language additionally offers alternative representations of all ISO 8859-1 characters using coded entity representations. A special syntax is used to represent these Character Entities using either a number reference or a shorthand mnemonic word.

These 'safe' entities are created using characters from the US ASCII (ISO 646) character set. Interestingly enough, the first half of the ISO 8859-1 set (character positions 000-127) are identical to those used in the US ASCII set. In fact, ASCII is always the 0-127 character position subset used in ALL ISO-8859 character sets. If safe character entity references are created using a safe portion of the ISO 8859-1 set, which characters in the ISO 8859-1 set need to be encoded, and which format should be used?
The ISO-8859-1
Character Set
Positions

000-031 | 032-064 | 065-096 | 097-126
127-159 | 160-191 | 192-223 | 224-255

Positions 000-031 and 127-159:
The characters in the first range are non-printing characters in the HTML context and are not of any real interest to the discussion of HTML. The second range is earmarked for extended control characters, and is not used for encoding characters in HTML. The reason for this is to maintain interoperability with 7 bit devices or when the 8th bit gets stripped by faulty software. Some operating systems may use this special range for access of text characters, but this can not be relied upon.
Positions 032-064:
Includes common English punctuation and Roman numerical digits. This range does not need to be encoded except for the the four reserved HTML characters (quote, ampersand, less than or greater than characters.)
Positions 065-126:
Includes uppercase and lowercase letters (A-Z and a-z) as well as common English punctuation. These characters do not need to be encoded.
Positions 159-191:
These represent special symbols. It is always safest to encode this range as character entities (numbered or named) to ensure better portability. This range has only recently gained Named Entity support for most of the characters so using Numbered Entities is recommended.
Positions 192-255:
These represent special upper and lower case accented national characters. It is always safes to encode this range as character entities (numbered or named) to ensure better portability. The HTML specifications suggest encoding this range as named entities.
Character Entity Formats - Pros and Cons
Using &entityname;
Pros:
Cons:
Using &#number;
Pros:
Cons:
Special Character Cases:
HTML Reserved characters: (Less than, Greater than, Ampersand and Quotation mark)
Newer commonly used entities:

The Unicode Solution
There is a shift occurring in computer text representation. Traditionally, text is represented by a single character of data (1 byte or 8 bits) at its lowest level. This allows for 256 possible distinct characters. In languages where the entire character set exceeds this range (such as in Far East languages) two characters are used to represent a single character. Many far east languages use their own standard sets of double byte encodings to represent character sets in each language - this compounds the problem and makes the transporting of characters and documents yet more difficult. This diversity of character sets can also lead to significant problems in the programmatic handling of character data as well.

The Unicode standard was developed to greatly reduce all this fracturing of languages into conflicting character sets. Like Far East languages, Unicode also uses 16 bits of data to represent its characters. If you look at the number of characters possible using 16 bits of data (twice the normal amount of a single 'byte'), we see that 65536 (256*256) distinct encodings are possible. All major character sets of the world (including Far East languages, symbols and dingbats) can be represented using a total of only about 35,000 of these character code points in the unicode set. Even though all the possible code points are not currently used in Unicode, there are many obscure characters and dead language writing systems that are not included in the set. Including ALL known languages, variations and symbols would be a never-ending task. A mechanism does, however, exist to expand the number of possible code points in Unicode into the millions in case of such a need.

Current software uses 7 or 8 bit encoding of characters. Unicode uses 16 bits. What would happen if a current system reads Unicode? Could be quite nasty, so there is a workaround. Unicode can be translated into sequences of 7 bit or 8 bit encodings that allow many current and old systems to interchange or transparently pass these documents without loss of content. The most popular version of this translation mechanism in use is UTF-8 (Universal Character Set Translation Format, 8-bit Form.) This format uses variable lengths of the current standard single-byte characters to represent Unicode character code points.

The number of operating systems and applications that understand Unicode character encoding is growing, and it is a proposed future solution to text encoding in HTML and on the Internet.


Related Sites
Official References
ftp://ds.internic.net/rfc/rfc1866.txt
RFC 1866: The HTML 2.0 specification (plain text.) Appendix contains Character Entity table.
http://www.w3.org/pub/WWW/MarkUp/html-spec/html-spec_13.html
The web version of the HTML 2.0 (RFC 1866) Character Entity table
http://www.w3.org/pub/WWW/MarkUp/Wilbur/
The HTML 3.2 (Wilbur) draft
[This includes all character entities listed in HTML 2.0 plus new named entities covering the ISO 8859-1 120-191 range.]
http://www.w3.org/pub/WWW/MarkUp/Cougar/HTML.dtd
The experimental HTML (Cougar) draft
[Demonstrates some of the directions HTML is taking - including new character entities.]
http://www.w3.org/pub/WWW/International/O-HTML.html
The W3 HTML Internationalization area
http://unicode.org
The Unicode Consortium site

Other Related Links
(These sites provided many of the topics and ideas for this page)
ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/FAQ-ISO-8859-1
Michael K. Gschwind's ISO 8859-1 National Character Set FAQ
http://www.uni-passau.de/~ramsch/iso8859-1.html
Excellent resource with good pointers on ISO-8859 issues
http://www.cs.tu-berlin.de/~czyborra/charsets
Roman Czyborra's excellent page with many ISO 8859 links and issues. Includes charts of all ISO 8859 character sets.
http://ppewww.ph.gla.ac.uk/%7Eflavell/iso8859/iso8859-pointers.html
Alan Flavell's excellent document of pointers to information about ISO-8859
http://ppewww.ph.gla.ac.uk/%7Eflavell/iso8859/character-faq.txt
Alan Flavell's brief FAQ document regarding ISO-8859 issues in HTML
http://www.bbsinc.com/iso8859.html
Kevin J Brewer's page with MANY links regarding character set issues.


Boring Copyright Stuff...