April 20, 2005

Why am I seeing double form subdivisions?

Now that authority linking has been turned back on, please be aware of a temporary issue regarding the updating of headings with form subdivisions. Specifically, when you update a record with a form subdivision in $$v and the authority still uses $$x for the same form subdivision, Aleph will double the subdivision, e.g.,

Authority: Children $$x Prayer-books and devotions.
Bibl hdg.: Children $$v Prayer-books and devotions.
Upd bibl.: Children $$x Prayer-books and devotions $$v Prayer-books and devotions.

The simplest solution is to follow the authorized form and use $$x. Eventually, these will be corrected to $$v; and in the meantime, $$v and $$x will interfile happily, so the coding difference will be invisible to users. For a more detailed explanation of the issue, read on.

When Aleph checks and finds an authority which matches a bib heading, it replaces the bib heading with the authority heading. This serves to correct minor differences in capitalization, diacritics, punctuation, etc., and is generally a good thing. The problem arises because Aleph's sensitivity to coding variations when checking headings and when updating headings is different. When matching a bib heading to an authority, Aleph normalizes out coding differences (along with minor differences in capitalization, diacritics, punctuation, etc.). When Aleph updates a bib heading, it attends to coding differences; in this case, it sees the authority $$x and the bib $$v as different, and preserves both. This preservation of different subfields is good when the difference is that the bib heading includes subdivisions that aren't in the authority, such as $$e or $$4; but in cases like the one above, it can cause problems.

When Database Management and LEO staff were working through these issues in order to get flipped headings from the V14 to V16 conversion fixed and get automatic updating from authorities turned back on, we determined that in most cases, the doubling of form subdivisions was due to the presence of an updated authority containing $$v and older bib headings containing $$x. Most of our bib form subdivisions are still in $$x. To solve the doubling in these cases, LEO ran a job which identified all the authorities containing $$v and set the UPD=N flag in them. This means that Aleph will not update a bib record which appears to match one of these authorities, and will not mistakenly create doubled form subdivisions.

The other side of this coin is the authorities which still contain form subdivisions coded $$x. LC is gradually working through its authority files creating form subdivision authority records (using tag 185) and revising heading authority records which contain a form subdivision. Our authority file contains a mix of these updated and non-updated authorities. As LC completes its work and as we resume our batch updating of authorities with newer versions from LC, these instances of authorized headings using $$x for form subdivisions will go away. For now, though, we're in an interim phase. (NB: These authorities are exceptions. LC does not routinely create authorities for headings ending in form subdivisions, but it does do so when such a heading has been used as an example or a reference on another authority.)

One good thing about the indexing in Aleph version 16 is that it is NOT sensitive to nearly as many minor differences in headings as version 14 was. In version 14, minor differences in capitalization, diacritics, punctuation, etc., and in coding, would cause separate lines (i.e., split files) to appear in the index. Version 16 is much better at recognizing that such differences still belong to the "same" heading, and is able to represent them with a single merged index entry. Therefore, the urgency of standardizing the coding of form subfields has considerably lessened with version 16

So, back to solutions: the simplest correction is to follow the coding in the authority, even when it is obsolete. Doing so will avoid the subdivision doubling, and will not cause problems in the index. The more involved solution is to revise the coding in the authority to $$v, and add the UPD=N field to avoid causing doubling of older bib headings containing $$x form subdivisions. Database Management will revise any authorities like this that we find or have reported to us during this interim phase. This will allow use of the current $$v coding in the bib record. The long term solution will be to replace these authorities with updates from LC. Our batch updating process will include a step to set such authorities to UPD=N as we load them. Eventually we may update all the older $$x bib form subdivisions to $$v, but that is not a high priority given that they are no longer causing split files.

In any case, it is always a good idea to glance over a bib record after sending it to the server. In most cases any automatic heading updates will be good changes; in some cases, the updated record may reveal a case where an incorrectly coded subdivision (e.g., a place name in $$x rather than $$z) caused the kind of doubling that's been described, and should be corrected; and in some cases, there may be a doubling of something that was correctly coded in the bib record but incorrectly coded in the authority, and therefore has doubled.

If you have any questions about this, please let me know.

Posted by s-hear at 11:32 AM | Comments (0)

February 14, 2005

Should I use LCSH terms in my metadata records?

Librarians usually think of authority control in the context of the library catalog, where an elaborate set of rules and a long history of consensus building and professional discipline have resulted in a relatively high uniformity of practice and understanding. Other communities, perceiving the value ascribed to authorized vocabularies, have shown an interest in making use of these vocabularies and presumably of adding the value inherent in them to other resource discovery tools and databases. But where exactly does the “added value” of the terms in a controlled vocabulary reside? (What follows focuses on subject vocabularies, but could also be applied to other kinds of controlled vocabularies.)

One simple notion is that the added value resides in the terms themselves. There is some logic to this. The terms in an authorized list like Library of Congress Subject Headings (LCSH) have been selected from a number of synonyms as being the “best” in some sense (e.g., most used by practitioners of a field). Automatic validation of a database’s terms against the source list can ensure formal consistency with the approved terminology. Declaration of a database’s terms’ adherence to a known authorized list can in principle enable searchers looking across databases to use a single controlled vocabulary. It would appear that all of these good things can be achieved simply by picking terms from an authorized list. This could be called formal adherence to a controlled vocabulary.

The problem with formal adherence is that it sidesteps the question of what terms mean. Only the simplest controlled terms lists use terms that are fully and mutually exclusive of one another. In most cases, the relationships between terms are more complex. In LCSH, two main devices are used to provide logical hierarchies of terms. Broad terms are assigned one or more subdivisions to narrow the meaning of a heading. Related terms are linked by cross references in broader/narrower hierarchies, again to specify a range of broad to narrow meanings. The specific meaning and correct application of a term in this kind of system depends on an understanding of the system’s principle of using the most specific terminology appropriate for describing a particular resource, and of how each term relates to the larger list. This could be called semantic adherence. Semantic adherence to a controlled vocabulary goes beyond formal adherence in that it seeks consistency with the meaning of terms as defined in the authorizing source, as well as with their forms.

Semantic adherence is more complicated that formal adherence. The specific meanings of terms are not routinely expressed in the authority records which make up a controlled vocabulary, and are rarely conveyed in a simple list of the terms. They are not typically available to automatic term matching algorithms, and cannot be typically be validated by machine tests. They may not be understood by users of the list who are not informed about the rules and discipline of the community that prepared the list. For all these reasons, it is doubtful that simple formal adherence to a controlled vocabulary will equate with semantic adherence to the same vocabulary.

The question then becomes, how much does the added value of a controlled term depend simply on its form, and how much does it depend on both the term’s form and its semantics? Consider two examples:

The diary of an American Civil War Confederate soldier might be assigned any of the following controlled LCSH headings, all of which are formally valid:
 United States
 United States—History
 United States—History—Civil War, 1861-1865
 United States—History—Civil War, 1861-1865—Personal narratives
 United States—History—Civil War, 1861-1865—Personal narratives, Confederate

A collection of photos of the B-1 bomber might be assigned any of the following controlled LCSH headings, all formally valid:
 Airplanes
 Government aircraft
 Airplanes, Military
 Bombers
 Strategic bombers
 Jet bombers
 Supersonic bombers
 B-1 bomber

In the context of a particular database, any of these levels of specificity might seem more or less appropriate for describing the specified content. However, to be consistent with LCSH semantics and rules of application, only the last and most specific terms would be appropriate. If semantic adherence is not undertaken by the creators of a set of databases, how much of the value of the controlled terms will be preserved for users attempting to search across those databases with the controlled vocabulary?

User satisfaction with searching across databases which share only formal adherence to a particular controlled vocabulary is a question that can only be properly settled by research. However, in principle the value of controlled terms does not reside simply in their “valid” form. It depends also and crucially on their semantics. The added value of authority control thus depends on:
 acknowledgment of an authoritative source for the form and meaning of terms
 discipline in the application of terms
As agencies and database creators seek to expand the utility of controlled vocabularies for users outside their source communities, there is an increased need to assign explicit definitions to authority records and to inform outside users of their rules of application. This need for more explicitly specific semantics in authority records was recognized by Francoise Bourdon in her book International cooperation in the field of authority data (Saur, 1993) in relation to name authority files; it is just as true for subjects, and for other kinds of controlled vocabularies.

Posted by s-hear at 12:10 PM | Comments (1)
The views and opinions expressed in this page are strictly those of the page author. The contents of this page have not been reviewed or approved by the University of Minnesota.