XHTML 1.0
= Index DOT Html/Css by Brian Wilson [bloo@blooberry.com] =

Index DOT Html: Main Index | Element Tree | Element Index | HTML Support History
Index DOT Css: Main Index | Property Index | CSS Support History | Browser History

Statistics
Authors: Steven Pemberton (HTML WG chair), et al (too many to mention)
Specifications
HTML 4.0 Recommendation - http://www.w3.org/TR/REC-html40/
XML 1.0 Recommendation - http://www.w3.org/TR/REC-xml
XHTML 1.0 Documentation - http://www.w3.org/TR/xhtml1/
Timeline:
- Industry organizations and companies gather to decide the future of HTML - May, 1998
- Draft document "Reformulating HTML in XML" (Voyager) released - December, 1998
- First working draft of "XHTML 1.0" released by the W3C - February, 1999
- Second working draft of "XHTML 1.0" released by the W3C - March, 1999
- Third working draft of "XHTML 1.0" released by the W3C - May, 1999
Where did HTML come from?
HTML 4.0 and its predecessors are defined using SGML, a stable and well-defined meta language that allows other markup languages to be created. SGML is very powerful and flexible, but it is these very features that have prevented the widespread adoption of the language. SGML's flexibility and power result in a level of complexity that prohibit a compliant parser from being lightweight.

Enter XML
XML is a new meta-language that aims to solve many of the problems of SGML, meanwhile retaining the power and flexibility which make SGML such a compelling solution. Like SGML, XML has the power to define markup languages. The HTML language consists of a static, restrictive set of elements and attributes. It is inflexible and is not able to adapt to the many needs that are foreseen in just the next few years. HTML's tag set is not even very good at satisfying the needs of adequately marking content with semantic meaning. Yet, HTML is currently the ubiquitous document format on the World Wide Web, used in millions...billions of documents. Obviously, HTML needs to grow up...to have something like the expressive power of SGML, without adding a lot of baggage to the deal. XML is the decided successor in this equation.

XML has many things going for it. When added up, the move to XML makes perfect sense. Part of the reason current browsers are so big is that the parsers need to accommodate bad syntax; many pages on the web today are coded using bad html syntax and authoring practices. The market for lightweight browsers is expected to grow considerably in the future, and lean & mean browsers will allow these new devices to tackle the contents of the web with less problems. XML is extensible too, which will allow even more powerful abilities for the full-featured browsers of the future.

XHTML
A gathering in May, 1998 of industry organizations and companies decided that HTML needed to be re-created as an XML application to meet the current and future needs of an ever-diversifying application and presentation market. To that end, the W3C has published a draft, "XHTML 1.0", which re-casts HTML 4.0 in XML syntax and componentizes its capabilities.

The transformation of HTML to XHTML will not be without a few growing pains, as some fundamental simplifications in the XML language are just different enough from current popular HTML authoring practice to create some incompatibilities. XHTML defines distinct namespaces for the three separate HTML 4.0 DTDs - strict, transitional and frameset. The extensibility and flexibility of XML will allow for HTML to be broken down even further if need be, or easily extended - possibly for uses and applications that can not even be foreseen at this point. The "X" in XML stands for "eXtensible", after all.

The HTML to XHTML headache:
What needs to change

Converting a document from HTML 4.0 to XHTML 1.0 will not be a totally painless affair - some changes WILL need to be made.
  • An XHTML document MUST be well-formed XML
    It must conform to basic XML syntax. If it does not, the XML parser does not have an obligation to continue processing the document. Unlike today's HTML parsers, an XML parser will not try to recover and "guess" what you meant if the syntax is incorrect.
  • <html> MUST be the top-level element.
    Not a change from HTML, but there are quite a few documents out there that neglect this important point.
  • Element and attribute names MUST be in lower case
    HTML is not case-sensitive; XML is.
  • Attribute values MUST be quoted
  • End tags are required for non-empty elements
    They are no longer optional.
    Affected Elements: basefont, body, colgroup, dd, dt, head, html, li, p, rt, spacer, tbody/thead/tfoot, th/td, tr
  • All empty elements must use the XML "empty tag" syntax
    XML empty elements are explicitly closed with a trailing forward slash ("/") before the end bracket (eg: <br> becomes <br />)
    Affected Elements: area, base, bgsound, br, col, frame, hr, img, input, isindex, keygen, link, meta, option, param, wbr
  • XML does not allow attribute minimization.
    Stand-alone attributes must be expanded (eg: <td nowrap>cell</td> becomes <td nowrap="nowrap">cell</td>)
  • Whitespace handling in attribute values is different in XML.
    Leading/trailing spaces are truncated, and multiple spacing characters within the attribute value are collapsed to single spaces.
  • Script sections should be wrapped in XML CDATA sections
  • SGML DTD exclusions are not possible in XML, but they should still be observed as "good practice".
    Not allowed to nest within themselves: a, button, form, label
    Pre exclusions: big, img, object, small, sub, sup
    Button exclusions: fieldset, form, iframe, input, label, select, textarea
Several of the above changes are to require certain features that were optional in the SGML world, or are optional in current usage because of historical leniency in implemented HTML parsers. When something becomes optional, people tend to abuse it. XML parsers will be very strict regarding these changes. In theory, any of these changes should NOT make documents unreadable by current browsers.

HTML Tidy
Dave Raggett (the co-author or primary author of the HTML 3.0, 3.2 and 4.0 specs) has created a free little program that converts an HTML page to XHTML for you, along with correcting many common authoring mistakes. See http://www.w3.org/People/Raggett/tidy/ for more details.
[This is not intended to be a product plug, merely a pointer toward a helpful tool.]

Why XHTML is important
The world of the web is changing, as are the browsers that access it. HTML has needed to change for quite some time in order to keep up, but it didn't have the power to do so. Changing HTML 4.0 into XHTML 1.0 will give it the power it needs to adapt today and to flourish in the future.


Boring Copyright Stuff...