Text in HTML...

ISO 8859 Character Sets \| To Encode or Not To Encode? Character Entity Formats - Pros and Cons \| Unicode
Related Sites

Main Index | Top Of Tree | Tag Index | Tag History

ISO 8859 character sets

ISO 8859 is a set of 10 different 256-character sets used to represent Western alphabetic languages. It does not address Far East languages at all. These sets were designed by the standards group ECMA (European Computer Manufacturer's Association,) and are included in the Internet charset register for use with MIME identification.

Why is ISO 8859 important you might ask? Well, here is why: the ISO 8859-1 (also called ISO-Latin) character set is the one used for HTTP (the transport protocol for web documents) and is also used in the creation of HTML documents. This character set contains all characters necessary to type all major West European languages and is also the preferred encoding on the Internet. The following languages are supported under the ISO 8859-1 character set:


Afrikaans Faeroese Irish	Basque Finnish Italian	Catalan French Norwegian	Danish Galician Portuguese	Dutch German Spanish	English Icelandic Swedish

ISO 8859-1 Characters - To Encode or Not To Encode?

It is acceptable to leave all ISO 8859-1 characters as unencoded character octets, but there can be no guarantee that all destination systems will understand all of the characters. In order to increase portability/viewability of the entire character set, the HTML language additionally offers alternative representations of all ISO 8859-1 characters using coded entity representations. A special syntax is used to represent these Character Entities using either a number reference or a shorthand mnemonic word. These 'safe' entities are created using characters from the US ASCII (ISO 646) character set. Interestingly enough, the first half of the ISO 8859-1 set (character positions 000-127) are identical to those used in the US ASCII set. In fact, ASCII is always the 0-127 character position subset used in ALL ISO-8859 character sets. If safe character entity references are created using a safe portion of the ISO 8859-1 set, which characters in the ISO 8859-1 set need to be encoded, and which format should be used?

The ISO-8859-1 Character Set Positions
	000-031 \| 032-064 \| 065-096 \| 097-126
	127-159 \| 160-191 \| 192-223 \| 224-255

Positions 000-031 and 127-159:

The characters in the first range are non-printing characters in the HTML context and are not of any real interest to the discussion of HTML. The second range is earmarked for extended control characters, and is not used for encoding characters in HTML. The reason for this is to maintain interoperability with 7 bit devices or when the 8th bit gets stripped by faulty software. Some operating systems may use this special range for access of text characters, but this can not be relied upon.

Positions 032-064:

Includes common English punctuation and Roman numerical digits. This range does not need to be encoded except for the the four reserved HTML characters (quote, ampersand, less than or greater than characters.)

Positions 065-126:

Includes uppercase and lowercase letters (A-Z and a-z) as well as common English punctuation. These characters do not need to be encoded.

Positions 159-191:

These represent special symbols. It is always safest to encode this range as character entities (numbered or named) to ensure better portability. This range has only recently gained Named Entity support for most of the characters so using Numbered Entities is recommended.

Positions 192-255:

These represent special upper and lower case accented national characters. It is always safes to encode this range as character entities (numbered or named) to ensure better portability. The HTML specifications suggest encoding this range as named entities.

Character Entity Formats - Pros and Cons

Using &entityname;

Pros:

The mnemonic words are much easier to remember than the numbers.
Official support has existed for accented Latin characters (192-255 character range) since at least HTML 2.0. The standards actually recommend using the named entities for this range over the numbered entities.
Browser support for accented Latin characters is also very widespread.

Cons:

Not all browsers may support the newer entity names (160-191 range.)

Using &#number;

Pros:

Support is excellent in most all browsers. It is hard to go wrong using them.
The entire ISO 8859-1 character set range has almost always been addressable using this method.

Cons:

The numbers are harder to remember

Special Character Cases:

HTML Reserved characters: (Less than, Greater than, Ampersand and Quotation mark)

Use character entity names

Newer commonly used entities:

Copyright and Registered Trademark (© and ®):
Most browsers now support the named entity versions, but it is probably a bit safer to use the numbers instead.
Non-breaking space ( ):
The named version is covered by most browsers now, but to be absolutely sure use the number instead.
Trademark:
Support is very limited, and this character is not actually IN the ISO 8859-1 set. Recommend using regular text or the following to achieve the character: <sup><tt>TM</tt></sup>

The Unicode Solution: There is a shift occurring in computer text representation. Traditionally, text is represented by a single character of data (1 byte or 8 bits) at its lowest level. This allows for 256 possible distinct characters. In languages where the entire character set exceeds this range (such as in Far East languages) two characters are used to represent a single character. Many far east languages use their own standard sets of double byte encodings to represent character sets in each language - this compounds the problem and makes the transporting of characters and documents yet more difficult. This diversity of character sets can also lead to significant problems in the programmatic handling of character data as well.

The Unicode standard was developed to greatly reduce all this fracturing of languages into conflicting character sets. Like Far East languages, Unicode also uses 16 bits of data to represent its characters. If you look at the number of characters possible using 16 bits of data (twice the normal amount of a single 'byte'), we see that 65536 (256*256) distinct encodings are possible. All major character sets of the world (including Far East languages, symbols and dingbats) can be represented using a total of only about 35,000 of these character code points in the unicode set. Even though all the possible code points are not currently used in Unicode, there are many obscure characters and dead language writing systems that are not included in the set. Including ALL known languages, variations and symbols would be a never-ending task. A mechanism does, however, exist to expand the number of possible code points in Unicode into the millions in case of such a need.

Current software uses 7 or 8 bit encoding of characters. Unicode uses 16 bits. What would happen if a current system reads Unicode? Could be quite nasty, so there is a workaround. Unicode can be translated into sequences of 7 bit or 8 bit encodings that allow many current and old systems to interchange or transparently pass these documents without loss of content. The most popular version of this translation mechanism in use is UTF-8 (Universal Character Set Translation Format, 8-bit Form.) This format uses variable lengths of the current standard single-byte characters to represent Unicode character code points.

The number of operating systems and applications that understand Unicode character encoding is growing, and it is a proposed future solution to text encoding in HTML and on the Internet.

Related Sites

Official References
ftp://ds.internic.net/rfc/rfc1866.txt: RFC 1866: The HTML 2.0 specification (plain text.) Appendix contains Character Entity table.
http://www.w3.org/pub/WWW/MarkUp/html-spec/html-spec_13.html: The web version of the HTML 2.0 (RFC 1866) Character Entity table
http://www.w3.org/pub/WWW/MarkUp/Wilbur/: The HTML 3.2 (Wilbur) draft
[This includes all character entities listed in HTML 2.0 plus new named entities covering the ISO 8859-1 120-191 range.]
http://www.w3.org/pub/WWW/MarkUp/Cougar/HTML.dtd: The experimental HTML (Cougar) draft
[Demonstrates some of the directions HTML is taking - including new character entities.]
http://www.w3.org/pub/WWW/International/O-HTML.html: The W3 HTML Internationalization area
http://unicode.org: The Unicode Consortium site
Other Related Links (These sites provided many of the topics and ideas for this page)
ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/FAQ-ISO-8859-1: Michael K. Gschwind's ISO 8859-1 National Character Set FAQ
http://www.uni-passau.de/~ramsch/iso8859-1.html: Excellent resource with good pointers on ISO-8859 issues
http://www.cs.tu-berlin.de/~czyborra/charsets: Roman Czyborra's excellent page with many ISO 8859 links and issues. Includes charts of all ISO 8859 character sets.
http://ppewww.ph.gla.ac.uk/%7Eflavell/iso8859/iso8859-pointers.html: Alan Flavell's excellent document of pointers to information about ISO-8859
http://ppewww.ph.gla.ac.uk/%7Eflavell/iso8859/character-faq.txt: Alan Flavell's brief FAQ document regarding ISO-8859 issues in HTML
http://www.bbsinc.com/iso8859.html: Kevin J Brewer's page with MANY links regarding character set issues.