May 23, 2005

Is ontology overrated?

Clay Shirky has posted a piece titled "Ontology Is Overrated: Ontologies, Links, and Tags" at http://shirky.com/writings/ontology_overrated.html The piece has attracted some interest among library staff. The rest of this entry is a set of notes in response to points raised by Shirky. Though I agree with his basic point about the viability of uncontrolled user tagging of web content as a means of extending access, I also contend much of his case that ontology is "overrated" is overstated. The comments in the extended entry here will be easier to follow if the Shirky piece is read first.

It is not universally true that library categorization/classification systems try to anticipate all new categories. Dewey’s system was originally designed to contain all possible topics, but LCC and LCSH both were designed to be responsive to literary warrant, i.e., to expand and incorporate new categories over time as they emerge, and NOT to try to map out all knowledge in advance. Even Dewey is responsive to the emergence of new categories of information. The difficulty Dewey editors have squeezing large new areas of research into narrow predetermined class ranges is evidence of the hazards of attempting too fine an analysis in one’s original classification. But the mistake was Dewey’s, not LC’s. The LC system is more flexible, though it too has its overcrowded areas.

“Noble gases” is just an arguable label. The category is consistent and coherent. The fact that it’s been arguably labeled doesn’t call the success of the categorization/classification into question.

Classification places single items for retrieval. It also places related items together. Shirky seems to acknowledge only the former function, and to ignore the latter.

Before making assertions about how Yahoo views its users and how Yahoo perceives reality, it might be good to know more about what constraints their data storage and retrieval software impose on their definitions of categories and assignment of items to them. The decision to assign categories to a single rather than multiple hierarchies may be a software constraint, not a philosophical position.

The assignment of an object to multiple categories is really managed by subject headings, not the addition of secondary classifications. Are we going to get to LCSH? Apparently not.

Google search does not categorize at all. It just retrieves on user-specified terms. It sorts the results, but the sorting is not categorical. On the other hand, Google’s Directory is classificatory, and works a lot like Yahoo. (Oh, but fewer people use it, so it must not have any value.)

Decisions about how to build brief representations of objects (i.e., catalog records) because the full content is inaccessible to machine manipulation are different in kind from the decisions made possible when using computers to analyze fully machine readable documents for retrieval. It’s like comparing the physical skills needed for professional football to those needed for PlayStation football.

Search vs. categorization is a false dichotomy. Presuming that a user preference for search over the categories in a majority of cases proves categories have no use and are not desirable is a false leap of judgment. If categorization is a bad idea, why does anyone use Yahoo?

The characteristics suggested for domains amenable to classification lean too heavily on one example, and misjudge the value of bringing like but not identical things together within a browsable frame. It also misplaces the unit with which ontology is concerned—i.e., not the elemental components, but the categories within which they fall. It’s the logic of ideational collectives into which the elements fall that determines whether classification is useful, not the distinctions between the base level units being organized.

The point made about the loss of nuanced semantics when a preferred term is selected from several candidates is valid, to a point. But it overreaches in implying that the differences between terms mean that there is no relationship of interest between them. Someone looking for a conversation with like-minded others might appreciate the distinctions in LiveJournal, but someone researching the range of opinions on a topic would find the data gathering task under this system significantly more laborious.

The notion that classification/categorization is a one-time-for-all activity is false. LCC adds new class numbers, LCSH adds new terms and revises old ones, DDC has its phoenix schedules. The fact that these systems are dynamic rather than static does mean that they require more work than the naïve might assume, but it does not disprove their utility. (And “Former Soviet Union” is primarily a label for a part of the world. If a better label were found, no massive reshelving at LC would be required to implement it. Alternatively, if Shirky is suggesting that some other organization and sequencing of materials about the places in the former Soviet Union would be superior, is he not engaging in the dreaded “fortune telling” himself?)

URLs are also volatile, and can point to things that require payment or authentication to access. For both of these reasons, user categorizing of links can be a wasted effort of no use to other users.

Many words, even words in natural language, contain an element of classification. The fact that one can derive sets of similar objects from users’ natural language terminology does not mean that no classification has been employed. It just means that the underlying classificatory functions of words have been invoked to derive a set of transpersonal categorizations.

The selectivity of tagged links in del.icio.us gets scant attention. Given that it only includes URLs that someone has chosen to preserve, is it really a strategy for tagging all URLs? If it were to be extended that way, would it still be as useful?

It’s also interesting that no comparison is made between del.icio.us’s tagging of URLs by users of web pages vs the metadata tagging of URL-addressed pages by their creators. Part of the utility of del.icio.us may derive from its aggregation of reviewers’ rather than creators’ tags.

Shirky also slights the role of language as a structured system that stands between individuals and the objects they describe. The rules and conventions of language both enable and constrain both freedom and consensus in the use of terms. Formal classificatory systems draw on the ability of language to represent logical relationships for shared understanding and use. The informal ones that Shirky is advocating do the same, but with a lot less discipline and a higher risk of tagged items being “unfindable” because of idiosyncratic use of terms, or being findable only through a laborious series of searches for multiple terms.

For all the disagreement expressed above, I fully agree with Shirky’s point that informal user-driven building of access paths and terminology is the only viable approach to making the bulk of web objects accessible. I also agree that the consistencies and relationships that emerge when vast amounts of informally generated tagging data are aggregated have great utility. I just don’t think that the alternative methodologies he disparages or their appropriate applications have been treated fairly, or even well understood. I also find the notion that classification is a kind of arrogant assault on freedom to be tiresome. Formal classification is a game with rules. Playing the game means accepting the rules, and thereby getting the benefits of a more disciplined approach to organizing information. Shirky’s boast that end users can apply terminology “without having to agree with anyone else about how something ‘should’ be tagged” is analogous to arguing that the umpire has no right to insist on the number of bases that “should” be tagged before heading for home plate. It’s the attitude behind that sneering “should” that grates.

Posted by s-hear at May 23, 2005 9:50 AM
Comments
Post a comment









Remember personal info?