Text in HTML...
Main Index |
Top Of Tree |
Tag Index |
Tag History
- ISO 8859 character sets
- ISO 8859 is a set of 10 different 256-character sets used to
represent Western alphabetic languages. It does not address Far
East languages at all. These sets were designed by the standards
group ECMA (European Computer Manufacturer's Association,)
and are included in the Internet charset register for use with
MIME identification.
Why is ISO 8859 important you might ask? Well, here is why:
the ISO 8859-1 (also called ISO-Latin) character set is the one
used for HTTP (the transport protocol for web documents) and is also
used in the creation of HTML documents. This character set contains
all characters necessary to type all major West European languages
and is also the preferred encoding on the Internet. The following
languages are supported under the ISO 8859-1 character set:
|
Afrikaans Faeroese Irish |
Basque Finnish Italian |
Catalan French Norwegian |
Danish Galician Portuguese |
Dutch German Spanish |
English Icelandic Swedish |
|
- ISO 8859-1 Characters - To Encode or Not To Encode?
- It is acceptable to leave all ISO 8859-1 characters as unencoded
character octets, but there can be no guarantee that all destination
systems will understand all of the characters. In order to increase
portability/viewability of the entire character set, the HTML language
additionally offers alternative representations of all ISO 8859-1
characters using coded entity representations. A special syntax
is used to represent these Character
Entities using either a number reference or a shorthand mnemonic
word.
These 'safe' entities are created using characters from the US ASCII (ISO 646)
character set. Interestingly enough, the first half of the ISO 8859-1
set (character positions 000-127) are identical to those used in the
US ASCII set. In fact, ASCII is always the 0-127 character
position subset used in ALL ISO-8859 character sets. If safe
character entity references are created using a safe portion of the
ISO 8859-1 set, which characters in the ISO 8859-1 set need to be
encoded, and which format should be used?
-
- Positions 000-031 and 127-159:
- The characters in the first range are non-printing characters in
the HTML context and are not of any real interest to the discussion
of HTML. The second range is earmarked for extended control
characters, and is not used for encoding characters in HTML. The
reason for this is to maintain interoperability with 7 bit devices
or when the 8th bit gets stripped by faulty software. Some operating
systems may use this special range for access of text characters,
but this can not be relied upon.
- Positions 032-064:
- Includes common English punctuation and Roman numerical digits.
This range does not need to be encoded except for the the four
reserved HTML characters (quote,
ampersand, less than or greater than characters.)
- Positions 065-126:
- Includes uppercase and lowercase letters (A-Z and a-z) as well as
common English punctuation. These characters do not need to
be encoded.
- Positions 159-191:
- These represent special symbols. It is always safest to encode
this range as character entities (numbered or named) to ensure
better portability. This range has only recently gained Named
Entity support for most of the characters so using Numbered
Entities is recommended.
- Positions 192-255:
- These represent special upper and lower case accented national
characters. It is always safes to encode this range as character
entities (numbered or named) to ensure better portability. The HTML
specifications suggest encoding this range as named entities.
- Character Entity Formats - Pros and Cons
- Using &entityname;
- Pros:
- The mnemonic words are much easier to remember than the numbers.
- Official support has existed for accented Latin characters
(192-255 character range) since at least HTML 2.0. The
standards actually recommend using the named entities for this
range over the numbered entities.
- Browser support for accented Latin characters is also
very widespread.
- Cons:
- Not all browsers may support the newer entity names (160-191
range.)
- Using &#number;
- Pros:
- Support is excellent in most all browsers. It is hard to
go wrong using them.
- The entire ISO 8859-1 character set range has almost always
been addressable using this method.
- Cons:
- The numbers are harder to remember
- Special Character Cases:
- HTML Reserved characters:
(Less than, Greater than, Ampersand and Quotation mark)
- Use character entity names
- Newer commonly used entities:
- Copyright and Registered Trademark
(© and ®):
Most browsers now support the named entity versions, but it is
probably a bit safer to use the numbers instead.
- Non-breaking space ( ):
The named version is covered by most browsers now, but to be
absolutely sure use the number instead.
- Trademark:
Support is very limited,
and this character is not actually IN the ISO 8859-1 set.
Recommend using regular text or the following to achieve the
character: <sup><tt>TM</tt></sup>
- The Unicode Solution
- There is a shift occurring in computer text representation.
Traditionally, text is represented by a single character of data
(1 byte or 8 bits) at its lowest level. This allows for 256
possible distinct characters. In languages where the entire
character set exceeds this range (such as in Far East languages)
two characters are used to represent a single character. Many far
east languages use their own standard sets of double byte encodings
to represent character sets in each language - this compounds the
problem and makes the transporting of characters and documents yet
more difficult. This diversity of character sets can also lead to
significant problems in the programmatic handling of character data
as well.
The Unicode standard was developed to greatly reduce all this fracturing
of languages into conflicting character sets. Like Far East languages,
Unicode also uses 16 bits of data to represent its characters. If you
look at the number of characters possible using 16 bits of data
(twice the normal amount of a single 'byte'), we see that 65536
(256*256) distinct encodings are possible. All major character sets of
the world (including Far East languages, symbols and dingbats) can be
represented using a total of only about 35,000 of these character code
points in the unicode set. Even though all the possible code points are
not currently used in Unicode, there are many obscure characters and
dead language writing systems that are not included in the set. Including
ALL known languages, variations and symbols would be a never-ending task.
A mechanism does, however, exist to expand the number of possible code
points in Unicode into the millions in case of such a need.
Current software uses 7 or 8 bit encoding of characters. Unicode uses 16
bits. What would happen if a current system reads Unicode? Could be quite
nasty, so there is a workaround. Unicode can be translated into sequences
of 7 bit or 8 bit encodings that allow many current and old systems to
interchange or transparently pass these documents without loss of content.
The most popular version of this translation mechanism in use is UTF-8
(Universal Character Set Translation Format, 8-bit Form.) This format uses
variable lengths of the current standard single-byte characters to
represent Unicode character code points.
The number of operating systems and applications that understand Unicode
character encoding is growing, and it is a proposed future solution to
text encoding in HTML and on the Internet.
Related Sites
- Official References
- ftp://ds.internic.net/rfc/rfc1866.txt
- RFC 1866: The HTML 2.0 specification (plain text.) Appendix contains Character Entity table.
- http://www.w3.org/pub/WWW/MarkUp/html-spec/html-spec_13.html
- The web version of the HTML 2.0 (RFC 1866) Character Entity table
- http://www.w3.org/pub/WWW/MarkUp/Wilbur/
- The HTML 3.2 (Wilbur) draft
[This includes all character entities listed in HTML 2.0 plus new named
entities covering the ISO 8859-1 120-191 range.]
- http://www.w3.org/pub/WWW/MarkUp/Cougar/HTML.dtd
- The experimental HTML (Cougar) draft
[Demonstrates some of the directions HTML is taking - including new
character entities.]
- http://www.w3.org/pub/WWW/International/O-HTML.html
- The W3 HTML Internationalization area
- http://unicode.org
- The Unicode Consortium site
- Other Related Links
(These sites provided many of the topics and ideas for this page)
- ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/FAQ-ISO-8859-1
- Michael K. Gschwind's ISO 8859-1 National Character Set FAQ
- http://www.uni-passau.de/~ramsch/iso8859-1.html
- Excellent resource with good pointers on ISO-8859 issues
- http://www.cs.tu-berlin.de/~czyborra/charsets
- Roman Czyborra's excellent page with many ISO 8859 links and issues. Includes
charts of all ISO 8859 character sets.
- http://ppewww.ph.gla.ac.uk/%7Eflavell/iso8859/iso8859-pointers.html
- Alan Flavell's excellent document of pointers to information about ISO-8859
- http://ppewww.ph.gla.ac.uk/%7Eflavell/iso8859/character-faq.txt
- Alan Flavell's brief FAQ document regarding ISO-8859 issues in HTML
- http://www.bbsinc.com/iso8859.html
- Kevin J Brewer's page with MANY links regarding character set issues.
Boring Copyright Stuff...