« Archival Description and Library Work | Main | Google Print Searching »

XML's Formal structure: EBNF explained

Some formal languages, such as XML, have internal relationships defined using an specific and unusual set of punctuation. This is called Extended Backus-Naur Form, and what follows is an attempt to explain it and why it exists.

Extended Backus-Naur Form (EBNF) is a notation designed to allow a language designer to formally define the rules of a language. It is a class of BNF (same meaning, just not "extended" with additional logical operators). EBNF is of particular interest to metadata folks because it is used to define XML.

The form consists of formal rules, called "productions." These are generally in the following format:

definedEntity2 ::= definedEntity1

The operator "::=" indicates that the definedEntity2 can be considered to be identical to definedEntity1, which is itself also defined with a production (or by reference to an external standard).

Several entities can be listed after the ::= operator. The following operators are used in XML specs to refine productions:

  • c ::= a | b
    c consists of either a or b. [OR operator]

  • b ::= a?
    b consists of either a or nothing. ["optional" operator]

  • c ::= a - b
    c is consists of the set of a that does not contain b. [NOT operator]

  • b ::= a+
    b consists of one or more instances of a.

  • b ::= a*
    b consists of zero, one, or many instances of a.

These are used to build the definition of XML from the top down. To illustrate: the production for "document" is:

document ::= prolog element Misc*

This means that a valid XML document consists of a prolog, then an element, followed by zero, one, or many Miscs. Similar productions exist for each of those entities. For example, if we went to the "element" production (the spec is internally hyperlinked, making this very easy), we would see:

element ::= EmptyElemTag | STag content ETag

This indicates that an element consists of either an empty element tag, or by a string consisting of one start tag, content, and then one end tag.

Each of those entities also has a production, and the spec goes all the way down to the level of allowable Unicode characters for each.

Reading EBNF can be daunting initially, but it is an excellent way of peeking into the syntax of a language. I find myself appreciating the juxtaposition of underlying rigidity and surface flexibility in XML.


Neat. Two questions:

Am I right that two or more terms after ::= without a mediating symbol are ANDed together, and that all must be present for a valid expression?

Does capitalization (e.g., "Misc") have a function beyond easier legibility?


Indeed, the terms separated by spaces indicate not only that all must be present, but that they must appear in that sequence.