Remember the Spike Lee commercial for some company’s athletic shoes? Spike appears on camera trying to compete against a professional basketball player who flies around the court outmaneuvering Spike and sinking baskets with ease. At the end Spike is seen studying a sneaker and muttering to himself, “The shoes. It’s gotta be the shoes.� Remember this; we’ll come back to it.
How do libraries and other information managing entities achieve precision in their search systems? We have had two basic models. One is the alphabetical list. By following strict rules on formulating name, title, and subject access points, our card catalogs and online catalogs have made it possible to focus in on a particular search target by searching an entry term, and then browsing the ordered entries filed under that term. This makes finding the title “War and Peace� much simpler than if one were to do a keyword search on the words “war� and “peace� appearing in titles.
Computers have enabled the use of Boolean searching, which achieves precision by searching for less frequent terms or combinations of terms found in our metadata records. An unspecified keyword search on “war� and Tolstoy� is fairly precise; a search on “war� in a title field and “Tolstoy� in an author field is even more precise. Both browsing and Boolean searching are valuable tools for narrowing down the number of records needing to be reviewed to a reasonably relevant few based on data in the records.
Then there’s Google. Google’s single search box is widely admired as a model of simple, efficient design; and the results Google produces are remarkably successful in responding to users needs. However, this is not because of Google’s search model. It’s because of Google’s brilliant ranking algorithm. By ranking its thousands of results exceptionally well, Google is able to bring the likeliest candidates for meeting its searcher’s needs to the top of the result list. This ranking is based primarily on Google’s access to a vast amount of online behavior related precisely to the objects it indexes, web pages and their URLs. By gathering data about the behavior of searchers in choosing web pages and the frequency of links to particular pages from other pages, Google is able to assign rank values to its results that often bring the most sought after sources to the top.
Most data sets do not and will never have the kind of behavioral data that drives the Google ranking algorithm. Without it, they cannot begin to achieve the success Google finds in bringing certain result set members to the top of result lists. Yet search interface designers continue to adopt the Google model of a single, unqualified search box, which has such widespread public support. We all want to be like Google, so we want to look like Google.
“The shoes. It’s gotta be the shoes.�
Faceted browsing is best understood as an alternative to Google’s linear ranking. The faceted terms offered to the searcher highlight commonalities in the result set data—subjects, authors, dates, etc.—which the searcher can then use to reduce the results to a set which more precisely meets the searcher’s needs. This is a definite improvement over a single, poorly ranked list; but ultimately it also raises new questions. Is a Google-style popularity ranking best in all cases? Is it satisfactory to see only the top ranked facet terms, or do searchers want to see them all? If hit counts are used for ranking, is there an implicit expectation that the facet terms are consistently formed, so that the number of hits for an author or subject are not split among two or more terms for the same entity, thereby lowering its rank? Are there other more intuitive orders that would be expected for sorting facet terms, e.g., dates in chronological order, names in alphabetical order, classifications in compressed hierarchies? Why do we assume that popularity is the best predictor of what a searcher browsing facet terms is looking for? Just because it works for Google?
“The shoes. It’s gotta be the shoes.�
The Spike Lee who directed this commercial is a lot smarter than the character he plays. Next time you hear someone calling for a search interface to be like Google, be like Spike. Mutter to yourself, ‘The shoes, it’s gotta be the shoes,� and smile.
NACO normalization is a set of rules for eliminating small differences from character strings. Normalization is good, but NACO normalization isn't good in all cases. For more information, please click "Continue reading."
Normalization is the practice of simplifying character strings for comparison and sorting. Typically normalization removes case differences from letters by making them all capitals or all small letters; removes punctuation that might vary accidentally between two headings for the same thing, or might complicate sorting; changes combined characters and characters marked by diacritics into a simpler, unmarked forms; etc.
NACO normalization is primarily a set of rules for normalizing MARC encoded name and title headings. This means that NACO normalization goes beyond the rules listed above to include some more specialized rules needed for its particular task. For example, the first comma in a personal name heading usually marks a distinction between surname and forename, and NACO normalization retains it. A name heading in inverted form with a comma and in direct order form without a comma are not considered to be the same, e.g., "Chandra, Prakash" and "Chandra PrakÄ?sh" are both valid NACO headings, and the first comma rule allows them to be distinguished after normalization.
MARC headings are divided into subfields. In name headings, there are cases where the same data element can be found coded with different subfield values, one current and the other obsolete. NACO normalization deals with this by removing the subfield value, but leaving the subfield position marked; i.e., $a $b and $c get normalized to $-, $-, and $-. This means that insignificant or erroneous differences in subfield code values do not affect comparison and sorting, so "$a Microbiology Conference $b (1st ...)" and "$a Microbiology Conference $n (1st ...)" can index the same; likewise "$a Smith, John, $d 1953- " and "$a Smith, John, $c 1953- "
However, outside of MARC name and title headings, the NACO normalization rules do not necessarily lead to good results. Subject headings sometimes use the same words to mean different things, with the difference signified by the subfield code used. For example, under current coding practice, "... $x Portraits" and "... $v Portraits" mean different things when they appear in a subject heading string (the former is a work about portraits of a person, while the other is a collection of those portraits). Treating them as identical is not necessarily the best choice--though neither is indexing them separately with no indication of what the difference between them is, but that's a different discussion. Similarly, the first comma in a topical subject string does not mark a valid distinction between to headings; e.g., "Short stories, African" and "Short stories African" would not both be valid, separate headings, and allowing normalization to treat them as identical prevents an inadvertantly omitted comma from disrupting filing order.
There are also kinds of normalization which NACO normalization does not include, but which have proven very useful. NACO normalization makes no changes to numbers, but there are standardized changes that can be made to numbers in specified subfields that will enable them to file in a numeric rather than a decimal order (i.e., 3, 21, 111 instead of 111, 21, 3). This type of normalization may be judged appropriate, despite not being part of NACO normalization.
More details about NACO normalization can be found at http://www.loc.gov/catdir/pcc/naco/normrule.html .
Clay Shirky has posted a piece titled "Ontology Is Overrated: Ontologies, Links, and Tags" at http://shirky.com/writings/ontology_overrated.html The piece has attracted some interest among library staff. The rest of this entry is a set of notes in response to points raised by Shirky. Though I agree with his basic point about the viability of uncontrolled user tagging of web content as a means of extending access, I also contend much of his case that ontology is "overrated" is overstated. The comments in the extended entry here will be easier to follow if the Shirky piece is read first.
It is not universally true that library categorization/classification systems try to anticipate all new categories. Dewey’s system was originally designed to contain all possible topics, but LCC and LCSH both were designed to be responsive to literary warrant, i.e., to expand and incorporate new categories over time as they emerge, and NOT to try to map out all knowledge in advance. Even Dewey is responsive to the emergence of new categories of information. The difficulty Dewey editors have squeezing large new areas of research into narrow predetermined class ranges is evidence of the hazards of attempting too fine an analysis in one’s original classification. But the mistake was Dewey’s, not LC’s. The LC system is more flexible, though it too has its overcrowded areas.
“Noble gases” is just an arguable label. The category is consistent and coherent. The fact that it’s been arguably labeled doesn’t call the success of the categorization/classification into question.
Classification places single items for retrieval. It also places related items together. Shirky seems to acknowledge only the former function, and to ignore the latter.
Before making assertions about how Yahoo views its users and how Yahoo perceives reality, it might be good to know more about what constraints their data storage and retrieval software impose on their definitions of categories and assignment of items to them. The decision to assign categories to a single rather than multiple hierarchies may be a software constraint, not a philosophical position.
The assignment of an object to multiple categories is really managed by subject headings, not the addition of secondary classifications. Are we going to get to LCSH? Apparently not.
Google search does not categorize at all. It just retrieves on user-specified terms. It sorts the results, but the sorting is not categorical. On the other hand, Google’s Directory is classificatory, and works a lot like Yahoo. (Oh, but fewer people use it, so it must not have any value.)
Decisions about how to build brief representations of objects (i.e., catalog records) because the full content is inaccessible to machine manipulation are different in kind from the decisions made possible when using computers to analyze fully machine readable documents for retrieval. It’s like comparing the physical skills needed for professional football to those needed for PlayStation football.
Search vs. categorization is a false dichotomy. Presuming that a user preference for search over the categories in a majority of cases proves categories have no use and are not desirable is a false leap of judgment. If categorization is a bad idea, why does anyone use Yahoo?
The characteristics suggested for domains amenable to classification lean too heavily on one example, and misjudge the value of bringing like but not identical things together within a browsable frame. It also misplaces the unit with which ontology is concerned—i.e., not the elemental components, but the categories within which they fall. It’s the logic of ideational collectives into which the elements fall that determines whether classification is useful, not the distinctions between the base level units being organized.
The point made about the loss of nuanced semantics when a preferred term is selected from several candidates is valid, to a point. But it overreaches in implying that the differences between terms mean that there is no relationship of interest between them. Someone looking for a conversation with like-minded others might appreciate the distinctions in LiveJournal, but someone researching the range of opinions on a topic would find the data gathering task under this system significantly more laborious.
The notion that classification/categorization is a one-time-for-all activity is false. LCC adds new class numbers, LCSH adds new terms and revises old ones, DDC has its phoenix schedules. The fact that these systems are dynamic rather than static does mean that they require more work than the naïve might assume, but it does not disprove their utility. (And “Former Soviet Union” is primarily a label for a part of the world. If a better label were found, no massive reshelving at LC would be required to implement it. Alternatively, if Shirky is suggesting that some other organization and sequencing of materials about the places in the former Soviet Union would be superior, is he not engaging in the dreaded “fortune telling” himself?)
URLs are also volatile, and can point to things that require payment or authentication to access. For both of these reasons, user categorizing of links can be a wasted effort of no use to other users.
Many words, even words in natural language, contain an element of classification. The fact that one can derive sets of similar objects from users’ natural language terminology does not mean that no classification has been employed. It just means that the underlying classificatory functions of words have been invoked to derive a set of transpersonal categorizations.
The selectivity of tagged links in del.icio.us gets scant attention. Given that it only includes URLs that someone has chosen to preserve, is it really a strategy for tagging all URLs? If it were to be extended that way, would it still be as useful?
It’s also interesting that no comparison is made between del.icio.us’s tagging of URLs by users of web pages vs the metadata tagging of URL-addressed pages by their creators. Part of the utility of del.icio.us may derive from its aggregation of reviewers’ rather than creators’ tags.
Shirky also slights the role of language as a structured system that stands between individuals and the objects they describe. The rules and conventions of language both enable and constrain both freedom and consensus in the use of terms. Formal classificatory systems draw on the ability of language to represent logical relationships for shared understanding and use. The informal ones that Shirky is advocating do the same, but with a lot less discipline and a higher risk of tagged items being “unfindable” because of idiosyncratic use of terms, or being findable only through a laborious series of searches for multiple terms.
For all the disagreement expressed above, I fully agree with Shirky’s point that informal user-driven building of access paths and terminology is the only viable approach to making the bulk of web objects accessible. I also agree that the consistencies and relationships that emerge when vast amounts of informally generated tagging data are aggregated have great utility. I just don’t think that the alternative methodologies he disparages or their appropriate applications have been treated fairly, or even well understood. I also find the notion that classification is a kind of arrogant assault on freedom to be tiresome. Formal classification is a game with rules. Playing the game means accepting the rules, and thereby getting the benefits of a more disciplined approach to organizing information. Shirky’s boast that end users can apply terminology “without having to agree with anyone else about how something ‘should’ be tagged” is analogous to arguing that the umpire has no right to insist on the number of bases that “should” be tagged before heading for home plate. It’s the attitude behind that sneering “should” that grates.
MNCAT is our library of record, not WorldCat. For various reasons, WorldCat is an unreliable guide to what the University Libraries holds. For more detail, read on.
The libraries do all our original and copy cataloging and catalog maintenance in MNCAT, our local catalog. We regularly export batches of records to OCLC and RLG, the two bibliographic utilities to which we belong. However, this does not mean that OCLC's WorldCat and RLG's RLIN database accurately mirror what is in MNCAT.
OCLC does not load all records immediately. OCLC's matching algorithm is fairly complex, looking for multiple points of consistency. When such things as title, year of publication, ISBN/ISSN, publisher, and system number all match up, OCLC promptly and reliably adds our holdings symbol to their WorldCat record. However, if there are discrepancies, or if the record is new to OCLC, it is shunted off for manual review, which can be very slow. Consequently, original University of Minnesota records (e.g., for theses) are often not found in WorldCat, or may be long delayed before finally appearing there. Also, OCLC does not routinely remove our holdings symbol when we include "deleted" records in the batches we send, so many of the titles we appear in WorldCat to hold have actually been withdrawn or lost, as indicated in MNCAT.
Prior to our move to Aleph, RLIN was a more accurate mirror of MNCAT, because RLIN maintains a copy of each RLG member institution's record. While in NOTIS, we would routinely batch update our bibliographic records in RLIN whenever changes were made to the bib record in MNCAT. Since moving to Aleph, we have limited RLIN updates to cases where the MNCAT holdings record has been added or modified; so heading corrections no longer prompt an update of the RLIN copy of our record. Still, RLIN and Eureka are currently a closer reflection of MNCAT than WorldCat.
MNCAT is our library of record. MNCAT should be consulted to find out what we hold, and where it is shelved. When WorldCat indicates that we have a record but MNCAT does not, WorldCat is wrong in the vast majority of cases; and many of the titles we do hold do not and may never appear in WorldCat.
Librarians usually think of authority control in the context of the library catalog, where an elaborate set of rules and a long history of consensus building and professional discipline have resulted in a relatively high uniformity of practice and understanding. Other communities, perceiving the value ascribed to authorized vocabularies, have shown an interest in making use of these vocabularies and presumably of adding the value inherent in them to other resource discovery tools and databases. But where exactly does the “added value” of the terms in a controlled vocabulary reside? (What follows focuses on subject vocabularies, but could also be applied to other kinds of controlled vocabularies.)
One simple notion is that the added value resides in the terms themselves. There is some logic to this. The terms in an authorized list like Library of Congress Subject Headings (LCSH) have been selected from a number of synonyms as being the “best” in some sense (e.g., most used by practitioners of a field). Automatic validation of a database’s terms against the source list can ensure formal consistency with the approved terminology. Declaration of a database’s terms’ adherence to a known authorized list can in principle enable searchers looking across databases to use a single controlled vocabulary. It would appear that all of these good things can be achieved simply by picking terms from an authorized list. This could be called formal adherence to a controlled vocabulary.
The problem with formal adherence is that it sidesteps the question of what terms mean. Only the simplest controlled terms lists use terms that are fully and mutually exclusive of one another. In most cases, the relationships between terms are more complex. In LCSH, two main devices are used to provide logical hierarchies of terms. Broad terms are assigned one or more subdivisions to narrow the meaning of a heading. Related terms are linked by cross references in broader/narrower hierarchies, again to specify a range of broad to narrow meanings. The specific meaning and correct application of a term in this kind of system depends on an understanding of the system’s principle of using the most specific terminology appropriate for describing a particular resource, and of how each term relates to the larger list. This could be called semantic adherence. Semantic adherence to a controlled vocabulary goes beyond formal adherence in that it seeks consistency with the meaning of terms as defined in the authorizing source, as well as with their forms.
Semantic adherence is more complicated that formal adherence. The specific meanings of terms are not routinely expressed in the authority records which make up a controlled vocabulary, and are rarely conveyed in a simple list of the terms. They are not typically available to automatic term matching algorithms, and cannot be typically be validated by machine tests. They may not be understood by users of the list who are not informed about the rules and discipline of the community that prepared the list. For all these reasons, it is doubtful that simple formal adherence to a controlled vocabulary will equate with semantic adherence to the same vocabulary.
The question then becomes, how much does the added value of a controlled term depend simply on its form, and how much does it depend on both the term’s form and its semantics? Consider two examples:
The diary of an American Civil War Confederate soldier might be assigned any of the following controlled LCSH headings, all of which are formally valid:
United States
United States—History
United States—History—Civil War, 1861-1865
United States—History—Civil War, 1861-1865—Personal narratives
United States—History—Civil War, 1861-1865—Personal narratives, Confederate
A collection of photos of the B-1 bomber might be assigned any of the following controlled LCSH headings, all formally valid:
Airplanes
Government aircraft
Airplanes, Military
Bombers
Strategic bombers
Jet bombers
Supersonic bombers
B-1 bomber
In the context of a particular database, any of these levels of specificity might seem more or less appropriate for describing the specified content. However, to be consistent with LCSH semantics and rules of application, only the last and most specific terms would be appropriate. If semantic adherence is not undertaken by the creators of a set of databases, how much of the value of the controlled terms will be preserved for users attempting to search across those databases with the controlled vocabulary?
User satisfaction with searching across databases which share only formal adherence to a particular controlled vocabulary is a question that can only be properly settled by research. However, in principle the value of controlled terms does not reside simply in their “valid” form. It depends also and crucially on their semantics. The added value of authority control thus depends on:
acknowledgment of an authoritative source for the form and meaning of terms
discipline in the application of terms
As agencies and database creators seek to expand the utility of controlled vocabularies for users outside their source communities, there is an increased need to assign explicit definitions to authority records and to inform outside users of their rules of application. This need for more explicitly specific semantics in authority records was recognized by Francoise Bourdon in her book International cooperation in the field of authority data (Saur, 1993) in relation to name authority files; it is just as true for subjects, and for other kinds of controlled vocabularies.
Authority control is a system librarians have devised to manage access to the records for their collections. When a number of items are held in a collection that have a common characteristic--same author, same topic, same series, versions of the same work--users of the collection may want to see the records for those items brought together. In online catalogs, authority control uses a separate file of records called authority records (distinct from bibliographic records, which represent items in the collection). The authority file defines uniform headings for each of these common characteristics--names, subjects, series, and entries for works. Also needed are rules for how to create and use these headings, and system functions to link the authority records to the bibliographic records and provide an integrated display of information from both. Authority records also enable the library to provide entries for variants of the uniform heading, and to indicate relationships between headings.
For example, if catalog users are searching for works by F. Scott Fitzgerald, they can enter a search for either "fitzgerald f scott" or "fitzgerald francis scott". Both searches will find the heading in the catalog where the author's works are listed: Fitzgerald, F. Scott (Francis Scott), 1896-1940. Authority control ensures that the only one form of the heading is used, and that it can still be searched in a number of ways. Similarly, a search on "fbi" will bring up a link to the heading for the body: United States. Federal Bureau of Investigation.
Authority records can also express relationships between headings. The authority for the FBI also points to a heading for the body's earlier name: United States. Dept. of Justice. Division of Investigation. The authority record for the topic Ostriches also points to broader and related topics: Birds, Ratites, and Cookery (Ostrich).
Often, authorized headings include more information than anyone would search for. This enables the system to display a clear and distinct heading in response to a partial or truncated search (e.g., "fitzgerald f s" leads to the full name heading above), and to distinguish between two similar entries (e.g., Wagner, Richard, 1813-1883 (for the composer) and Wagner, Richard, 1939-1972 (a dancer and choreographer).
Control freak is a blog devoted to library catalog authority control. It features answers to questions that I get about specific authority control issues, rules, and functions, and occasional comments on the state of the art. Additional questions and comments are always welcome.
Stephen