The basic syntax is familiar from HTML and XHTML, with some clarifications:
- Character encoding
- Element types
- Opening and closing elements
- Self-closing tags
- Character references, and escaping &
- Attribute syntax
All HTML5 documents start the same way:
…where html is case-insensitive.
And that’s it. No DTD. No namespace. From there straight into the <html> tag.
Declaring the character encoding is mandatory, and HTML5 gives you three ways to do so:
- At the transport level, eg by using the HTTP Content-Type header
- By using a Unicode Byte Order Mark (BOM) at the start of the file
- Using a meta element with a charset attribute that specifies the encoding within the first 1024 bytes of the document, eg
The older version of this syntax is still allowed:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
The Byte Order Mark is a fiendishly elegant approach (if rather opaque). Interestingly the W3C’s own validator raises a warning if you use it:
The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported.
HTML5 defines five types of element – understanding them is useful background for some of the features below.
- Void elements: area, base, br, col, command, embed, hr, img, input, keygen, link, meta, param, source, track, wbr – Void elements have no end tag, so have no content (in the sense that a paragraph has character content).
- Raw text elements: script, style
- RCDATA elements: textarea, title – RCDATA is a concept from SGML, standing for “Entity references and character data” (or “Replaceable Character Data”). Which is what these elements can contain.
- Foreign elements: Elements from the MathML namespace and the SVG namespace
- Normal elements: All other allowed HTML elements
Only normal and RCDATA and normal elements are parsed as HTML. Void elements have no content. Raw text and foreign elements are parsed as other syntaxes.
Opening and closing elements
The start and end tags of certain normal elements can be omitted
Generally this means that you can omit the end tag of an element in a semantically natural way. eg in a list, you can omit the closing <li> tags. Slightly more formally, end tags can be omitted where there is no other possible (or sensible) interpretation.
For example, this can only be understood as a closed list with two items:
<ul> <li>Item 1 <li>Item 2 </ul>
This was the case in older versions of HTML, but we considered it poor practice. HTML5 has a well-defined list of optional start and end tags for normal elements.
<div> is not on that list. Nesting div elements is commonplace, so the following would be ambiguous:
<div id="first">some content <div id="second">some more content </div>
Does second div come after first, or sit inside it?
The list defines optional start tags as well as optional end tags. Elements with optional start tags (provided they have no attributes, and some other conditions) are:
So this a valid HTML5 file:
<!DOCTYPE html> <title>valid</title> <p>This is a complete HTML5 document.</p>
The html, head and body tags are inferred.
It will become clearer whether we consider this good or poor practice in HTML5.
In XHTML, we’re used to closing void tags with a trailing slash, eg <br />
In HTML5, since void tags have no contents they do not have a closing tag. You can make them notionally self-closing with a trailing slash. There is no difference between the three breaks in the following paragraph:
<p> line 1<br> line 2<br/> line 3<br /> line four </p>
Foreign elements may also be self-closing.
Other elements may not be self-closing. eg <p /> is valid XHTML but invalid HTML5.
Character references, and escaping &
The ampersand (
&) may be left unescaped in some more cases compared to HTML4.
To see which ampersands we can leave unescaped, we have to take a quick look at a couple of related areas.
HTML5 defines three types of character reference:
- Named: &, followed by a character reference name (case sensitive – there are 2231 of them in HTML5), terminated by a ;
- Decimal: &#, followed by one or more digits corresponding in base 10 to a Unicode code point, terminated by a ;
- Hexadecimal: &#x, followed by a hexadecimal number corresponding in base 10 to a Unicode code point, terminated by a ; (case-insensitive: x or X, a-f or A-F)
Almost all Unicode characters are permitted:
The numeric character reference forms described above are allowed to reference any Unicode code point other than U+0000, U+000D, permanently undefined Unicode characters (noncharacters), and control characters other than space characters.
Like XHTML, the character references must be terminated with a semicolon.
This allows HTML5 to define an ambiguous ampersand: something that looks like a named character reference, but isn’t.
Normal elements, RCDATA elements and attributes (ie all data parsed as HTML) cannot contain ambiguous ampersands. Bare ampersands are fine.
If you think about it, this corresponds very closely to how you would debug bare ampersands in XHTML. You can easily spot an & that needs to be escaped, or an &-escaped string that is missing its ; terminator. The rules above simply formalise this, in order to let you use a bare & where it’s simpler and reads better.
Multiple attributes in a tag must be separated by a space.
You can write attributes:
- With a value in ‘single quotes’
- With a value in “double quotes”
- With an un-quoted value, if it contains no spaces
- With an empty attribute, in which case the empty string is assumed
This empty attribute syntax can be quite elegant. The spec gives the example:
XHTML would require disabled=”disabled” – more formal but much less elegant.
This syntax is a change from HTML4, which allowed the value without the name for enumerated attributes. Apparently this was not supported in browsers.