« February 2005 | Main | April 2005 »

March 31, 2005

XML's Formal structure: EBNF explained

Some formal languages, such as XML, have internal relationships defined using an specific and unusual set of punctuation. This is called Extended Backus-Naur Form, and what follows is an attempt to explain it and why it exists.

Extended Backus-Naur Form (EBNF) is a notation designed to allow a language designer to formally define the rules of a language. It is a class of BNF (same meaning, just not "extended" with additional logical operators). EBNF is of particular interest to metadata folks because it is used to define XML.

The form consists of formal rules, called "productions." These are generally in the following format:

definedEntity2 ::= definedEntity1

The operator "::=" indicates that the definedEntity2 can be considered to be identical to definedEntity1, which is itself also defined with a production (or by reference to an external standard).

Several entities can be listed after the ::= operator. The following operators are used in XML specs to refine productions:

  • c ::= a | b
    c consists of either a or b. [OR operator]

  • b ::= a?
    b consists of either a or nothing. ["optional" operator]

  • c ::= a - b
    c is consists of the set of a that does not contain b. [NOT operator]

  • b ::= a+
    b consists of one or more instances of a.

  • b ::= a*
    b consists of zero, one, or many instances of a.

These are used to build the definition of XML from the top down. To illustrate: the production for "document" is:

document ::= prolog element Misc*

This means that a valid XML document consists of a prolog, then an element, followed by zero, one, or many Miscs. Similar productions exist for each of those entities. For example, if we went to the "element" production (the spec is internally hyperlinked, making this very easy), we would see:

element ::= EmptyElemTag | STag content ETag

This indicates that an element consists of either an empty element tag, or by a string consisting of one start tag, content, and then one end tag.

Each of those entities also has a production, and the spec goes all the way down to the level of allowable Unicode characters for each.

Reading EBNF can be daunting initially, but it is an excellent way of peeking into the syntax of a language. I find myself appreciating the juxtaposition of underlying rigidity and surface flexibility in XML.

March 28, 2005

Archival Description and Library Work

Recently, I attended an event held by the Andersen Library: Archival Processing in an Age of Abundance: New Approaches to Old Backlogs. It gave me some food for thought as to the nature of library work and metadata creation.

The highlight of the talk was Dennis Meissner's description of the work he did with Mark Greene examining workflow. This research, flowing from a feeling that backlogs of collections wetre a growing problem at repositories, discovered a number of institutions running at a net deficit with regards to processing: new collections were being brought in faster than they can be processed under de facto standards in place. Even institutions with relatively fast processors commonly float a large backlog year to year.

More to the point, Meissner and Greene found a wide range of standards and expectations regarding the speed with which archival processing should proceed. Finding an average of 20 to 25 hours per cubic foot a fair estimate of real-world speed, they recommend a rate of 4 hours per cubic foot. They believe that this will require discarding most, if not all, work at the item level, including sorting and conservation tasks that many processors have considered necessary.

A point that was brought up really hit home: the reason why many people get into library work in general, and archival work in particular, is that they "like books." In the archival world, processors often get to deal first hand with the unique, rare and personal. What Meissner & Greene suggest is that without special directive and/or funding, the processor should do no "research" at anything below the series level - no description.

Eliminating this close contact with the item level will also greatly reduce the possibility that the archivist will find something of value, of sensitivity, or close relationship to another collection. But the possibilities that we fantasize about in this regard, the researchers claim, are not worth the real-life slowdown in processing that every collection suffers from with such a labor-intensive approach.

Another question is that of acquisitions. Given the possibility that Meissner & Greene's approach takes hold (possible, maybe not probable) and collections are processed much faster and at a shallower level, would this affect the mindset with which new collections would be acquired? Surely the equations of time, research access and human resources would shift so dramatically that accessioning would be changed in a some fashion.

Finally, I wanted to throw my two cents in about metadata creation. I believe that the study has implications for the world of metadata creation in general, not just in the world of archives.

The paper resulting from the study, which has not yet been published, has the working title of "More Product, Less Process." This is in line with how I have been thinking overall about metadata for digital collections. It is important to work toward streamlining the process in a way that balances efficiency with sufficient access. I further believe that our goal should be to work towards enhancing our skills, models and approaches using grant funding, allowing us to fold metadata creation into projects as part of our regular duties.