May 23, 2005

Is ontology overrated?

Clay Shirky has posted a piece titled "Ontology Is Overrated: Ontologies, Links, and Tags" at The piece has attracted some interest among library staff. The rest of this entry is a set of notes in response to points raised by Shirky. Though I agree with his basic point about the viability of uncontrolled user tagging of web content as a means of extending access, I also contend much of his case that ontology is "overrated" is overstated. The comments in the extended entry here will be easier to follow if the Shirky piece is read first.

It is not universally true that library categorization/classification systems try to anticipate all new categories. Dewey’s system was originally designed to contain all possible topics, but LCC and LCSH both were designed to be responsive to literary warrant, i.e., to expand and incorporate new categories over time as they emerge, and NOT to try to map out all knowledge in advance. Even Dewey is responsive to the emergence of new categories of information. The difficulty Dewey editors have squeezing large new areas of research into narrow predetermined class ranges is evidence of the hazards of attempting too fine an analysis in one’s original classification. But the mistake was Dewey’s, not LC’s. The LC system is more flexible, though it too has its overcrowded areas.

“Noble gases” is just an arguable label. The category is consistent and coherent. The fact that it’s been arguably labeled doesn’t call the success of the categorization/classification into question.

Classification places single items for retrieval. It also places related items together. Shirky seems to acknowledge only the former function, and to ignore the latter.

Before making assertions about how Yahoo views its users and how Yahoo perceives reality, it might be good to know more about what constraints their data storage and retrieval software impose on their definitions of categories and assignment of items to them. The decision to assign categories to a single rather than multiple hierarchies may be a software constraint, not a philosophical position.

The assignment of an object to multiple categories is really managed by subject headings, not the addition of secondary classifications. Are we going to get to LCSH? Apparently not.

Google search does not categorize at all. It just retrieves on user-specified terms. It sorts the results, but the sorting is not categorical. On the other hand, Google’s Directory is classificatory, and works a lot like Yahoo. (Oh, but fewer people use it, so it must not have any value.)

Decisions about how to build brief representations of objects (i.e., catalog records) because the full content is inaccessible to machine manipulation are different in kind from the decisions made possible when using computers to analyze fully machine readable documents for retrieval. It’s like comparing the physical skills needed for professional football to those needed for PlayStation football.

Search vs. categorization is a false dichotomy. Presuming that a user preference for search over the categories in a majority of cases proves categories have no use and are not desirable is a false leap of judgment. If categorization is a bad idea, why does anyone use Yahoo?

The characteristics suggested for domains amenable to classification lean too heavily on one example, and misjudge the value of bringing like but not identical things together within a browsable frame. It also misplaces the unit with which ontology is concerned—i.e., not the elemental components, but the categories within which they fall. It’s the logic of ideational collectives into which the elements fall that determines whether classification is useful, not the distinctions between the base level units being organized.

The point made about the loss of nuanced semantics when a preferred term is selected from several candidates is valid, to a point. But it overreaches in implying that the differences between terms mean that there is no relationship of interest between them. Someone looking for a conversation with like-minded others might appreciate the distinctions in LiveJournal, but someone researching the range of opinions on a topic would find the data gathering task under this system significantly more laborious.

The notion that classification/categorization is a one-time-for-all activity is false. LCC adds new class numbers, LCSH adds new terms and revises old ones, DDC has its phoenix schedules. The fact that these systems are dynamic rather than static does mean that they require more work than the naïve might assume, but it does not disprove their utility. (And “Former Soviet Union” is primarily a label for a part of the world. If a better label were found, no massive reshelving at LC would be required to implement it. Alternatively, if Shirky is suggesting that some other organization and sequencing of materials about the places in the former Soviet Union would be superior, is he not engaging in the dreaded “fortune telling” himself?)

URLs are also volatile, and can point to things that require payment or authentication to access. For both of these reasons, user categorizing of links can be a wasted effort of no use to other users.

Many words, even words in natural language, contain an element of classification. The fact that one can derive sets of similar objects from users’ natural language terminology does not mean that no classification has been employed. It just means that the underlying classificatory functions of words have been invoked to derive a set of transpersonal categorizations.

The selectivity of tagged links in gets scant attention. Given that it only includes URLs that someone has chosen to preserve, is it really a strategy for tagging all URLs? If it were to be extended that way, would it still be as useful?

It’s also interesting that no comparison is made between’s tagging of URLs by users of web pages vs the metadata tagging of URL-addressed pages by their creators. Part of the utility of may derive from its aggregation of reviewers’ rather than creators’ tags.

Shirky also slights the role of language as a structured system that stands between individuals and the objects they describe. The rules and conventions of language both enable and constrain both freedom and consensus in the use of terms. Formal classificatory systems draw on the ability of language to represent logical relationships for shared understanding and use. The informal ones that Shirky is advocating do the same, but with a lot less discipline and a higher risk of tagged items being “unfindable” because of idiosyncratic use of terms, or being findable only through a laborious series of searches for multiple terms.

For all the disagreement expressed above, I fully agree with Shirky’s point that informal user-driven building of access paths and terminology is the only viable approach to making the bulk of web objects accessible. I also agree that the consistencies and relationships that emerge when vast amounts of informally generated tagging data are aggregated have great utility. I just don’t think that the alternative methodologies he disparages or their appropriate applications have been treated fairly, or even well understood. I also find the notion that classification is a kind of arrogant assault on freedom to be tiresome. Formal classification is a game with rules. Playing the game means accepting the rules, and thereby getting the benefits of a more disciplined approach to organizing information. Shirky’s boast that end users can apply terminology “without having to agree with anyone else about how something ‘should’ be tagged” is analogous to arguing that the umpire has no right to insist on the number of bases that “should” be tagged before heading for home plate. It’s the attitude behind that sneering “should” that grates.

Posted by s-hear at 9:50 AM

May 11, 2005

WorldCat says we have this. How come I can't find it in MNCAT?

MNCAT is our library of record, not WorldCat. For various reasons, WorldCat is an unreliable guide to what the University Libraries holds. For more detail, read on.

The libraries do all our original and copy cataloging and catalog maintenance in MNCAT, our local catalog. We regularly export batches of records to OCLC and RLG, the two bibliographic utilities to which we belong. However, this does not mean that OCLC's WorldCat and RLG's RLIN database accurately mirror what is in MNCAT.

OCLC does not load all records immediately. OCLC's matching algorithm is fairly complex, looking for multiple points of consistency. When such things as title, year of publication, ISBN/ISSN, publisher, and system number all match up, OCLC promptly and reliably adds our holdings symbol to their WorldCat record. However, if there are discrepancies, or if the record is new to OCLC, it is shunted off for manual review, which can be very slow. Consequently, original University of Minnesota records (e.g., for theses) are often not found in WorldCat, or may be long delayed before finally appearing there. Also, OCLC does not routinely remove our holdings symbol when we include "deleted" records in the batches we send, so many of the titles we appear in WorldCat to hold have actually been withdrawn or lost, as indicated in MNCAT.

Prior to our move to Aleph, RLIN was a more accurate mirror of MNCAT, because RLIN maintains a copy of each RLG member institution's record. While in NOTIS, we would routinely batch update our bibliographic records in RLIN whenever changes were made to the bib record in MNCAT. Since moving to Aleph, we have limited RLIN updates to cases where the MNCAT holdings record has been added or modified; so heading corrections no longer prompt an update of the RLIN copy of our record. Still, RLIN and Eureka are currently a closer reflection of MNCAT than WorldCat.

MNCAT is our library of record. MNCAT should be consulted to find out what we hold, and where it is shelved. When WorldCat indicates that we have a record but MNCAT does not, WorldCat is wrong in the vast majority of cases; and many of the titles we do hold do not and may never appear in WorldCat.

Posted by s-hear at 9:06 AM