Brewster Kahle gave the talk at the closing plenary of the CNI Spring Task Force meeting. Brewster just keeps on doing, he never seems to be daunted by the scope of large tasks. The amazing thing is that it works! He set out to capture the web, and the Internet Archive (IA) does that better than any other entity. He called on us to "put the best we have to offer within the reach of our children." Within reach, to Brewster (and to our children) means "on the web." He then walked us through a back-of-the-napkin calculation of what it would take, concluding that the goal is within reach of us today and within our budgets to boot. Are we ready to answer the call?
Books. The Library of Congress = 20M volumes = 26TB = $60,000 disk space. At 2 hours/book (without destroying the books) this is doable. Output back to book form costs $1/book. This print-on-demand solution is being demonstrated today by the BookMobile the Internet Archive has put on the streets not just of the USA, but also India, Egypt, and most recently rural Uganda.
Audio. 2M "saleable objects" of audio exist, but much of it behind IP regs that make it hard to deal with. The IA approached the "taper" community of people who have taken advantage of performance oriented rock bands who followed the Grateful Dead's lead into allowing fans to tape their music and exchange it for non-commercial use. "How would you like infinite bandwidth and infinite storage for free?" the IA asked the tapers. Guess what? They love the idea. 500 rock bands have given the IA permission to archive this material and share it for free. The tapers have already produced 10-20TB of concerts available on the IA.
Moving Images. Don't just consider the 100-200,000 mainstream films (half of them from India). Consider the 2M films created in the 20th century that document daily life. Some of these may be in your very own basement. One hour of film costs about $100 to convert. One hour of video costs only $15. The IA is also now capturing 20 channels of video from around the world 24/7 for about $500,000. It is estimated there may be about 400 channels around the world.
Software. The IA has received a DMCA exception to circumvent copy protection for the purpose of ripping some of the 50,000 software packages that exist to date. They are only allowed to rip titles from no-longer-supported operating systems.
Web. The IA now captures 20TB/month of web content. The WayBackMachine holds over 30B (yes, billion) pages from 50M sites on 15M hosts. Anna Patterson's search engine based on this corpus searches 4 times the number of sites covered by Google.
The Internet Archive does all this on a budget of about $4M or $5M each year. I don't know about you, but this leaves me breathless.
In order to preserve this growing corpus (libraries, Brewster notes, traditionally burn eventually) the IA seeks out partners around the world who can host copies of the data. The more different they are from the US the better. Right now a copy is held at the new library in Alexandria and negotiations are under way with a northern european country. Brewster estimates that the resources needed to maintain a mirror of IA are a PB of disk (that's petabyte), a GB of bandwidth, and $100M to set up an appropriate endowment for continued operation.
But if the "Universal Access to All Human Knowledge" goal articulated by Raj Ready of the Million Book Project is too vast, and even the "All Published Knowledge Available to the Kid in Uganda" is a bit far out, how about something easy, asks Brewster. What if we just tried to attack what we already have every right to collect? Let's go for "Public Access to the Public Domain."
In the USA the public domain is pre-1923 publications. In fact, Brewster points out, with the aid of Mike Klezman's (?) recently completed electronic version of the copyright registry, it is now easy to find out which material from 1923-1964 did not have their copyright renewed and are now also in the public domain. Let's go get this material! His proposal: give the IA a book and $10 and the IA will return to you the book unharmed plus a digital copy. Will we accept the offer? Oh, and by the way, the IA is also happy to accept video and $15/hour for the conversion of that to digital format. Oh, and did I mention that the IA will also host the digital documents on their servers "forever"?
I think we should take Brewster up on this offer. How much material do we have in the University of Minnesota collections which we could part with for a bit to let the IA digitize and store it? We should seriously consider a project to pump this material and the limited dollars required to the IA as fast as we can. This is a crazy idea at a crazy price point, let's try to sink Brewster under our enthusiastic response! The great thing is, we probably won't, he has not sunk yet.
P.S. Brewster also tossed off an idea about how to archive blogs in response to a question. His thought was that we should be able to subscribe to blog RSS feeds and simply archive everything we see announced via that mechanism. I wonder if we could auto-harvest RSS from UThink.
I am concerned that the work of the Joint Committee of the Higher Education and Entertainment Communities may do more harm than good by legitimizing some role for higher ed in killing off P2P file sharing. I don't think we have a role, I think this is a fight between the RIAA and MPAA and American society, we will just get trampled in the middle. Still, a session updating us on the P2P issue at CNI was interesting. It is clear that EDUCAUSE is finding little workable technology to help satisfy industry demands (tools like Audible Magic and ICARUS are throwing out the legitimate baby with the illegal bathwater). Brewster Kahle was in the audience and asked us to please remember that the Internet Archive depends on P2P for distribution of its legitimate content. If we need an example of real life content dependent on P2P distribution, he welcomes us to point his way.
I am a pretty visual person and appreciate a well laid out graphical representation of an issue. I find one of the masters of our field to be Herbert Van de Sompel. I didn't attend his session today on Federations of Institutional Repositories, but I see the handouts in the CNI packet and am struck again by what lean, direct, and illuminating illustrations he comes up with. I don't know whether he makes this stuff up himself or employs some graphic talent on the back end, but his touch has been so consistent over the years in many contexts that I suspect the former. I hear many people laud the interface of SFX, few of whom realize just how much it is the vision of Herbert, who showed "rough" versions of SFX many years before it became a commercial product with virtually the same interface it still enjoys. If you want to see what I consider PowerPoint well-used, take a look at a presentation by Herbert some day.
By the way, his work on new roles for MPEG-21 & OAI & OpenURL in federating repositories is quite interesting, thinking way outside the box. Take a look at the D-Lib article he and a few colleagues wrote for a taste.
After lunch a few of us retired to a quieter corner of the hotel to discuss whether it would be worth our time and effort to try to make LOCKSS more of a preservation tool. There was a clear consensus among this group that LOCKSS is not preservation today, and that the project (though it claims a preservation role) is really not doing much (beyond its NSF grant attempt, anyway) to make accommodations in the software for preservation issues. These would include things like issue level manifests with metadata, file format recognition and metadata (perhaps via JHOVE, which I saw was announced today), or picking up formats other than HTML (maybe an OAI harvest of metadata followed by a harvest of the related deeper-web items). Right now LOCKSS is, in essence, a "bit store," it is a backup mechanism. In some ways, building up LOCKSS installations might also remove some of the wins the system brings in terms of ease of setup and maintenance.
A group funded by the Mellon Foundation is trying to define the bounds of interaction between course management systems (CMS) and repositories. Their report should be available on the DLF web site by the end of May. In today's presentation to CNI they made three fundamental points: (1) users will be getting to repository content through a broad set of "course management" tools that extend well beyond CMS into PowerPoint, Weblogs, Citation Managers and the like; (2) repositories need to attend to a Checklist of requirements and desirables in order to interoperate with this layer of tools; and (3) the process used to build course content can be expressed as "Gather-Create-Share".
This "Gather-Create-Share" seems like a weak echo of Apple's "Rip, Mix, and Burn" campaign a few years ago. It is also the process that Lessig warns us is under threat given the intellectual property regime our country is putting into force. The session really didn't touch on the impediments that copyright puts in the way of the "Gather" step, but I was told that IP issues will be part of the Checklist when the group reports out to the DLF.
Random thought... Could we cut off much of the unwanted workstation traffic by limiting Public Browser in a new way? What would happen if public browser refused to allow more than 100 characters in any text field of any form? Would that be enough to kill use for email, but still allow research use?
Ralph Quarles (IU) and I found each other at the reception. Ralph has offered to help us evaluate our computer support and seek an appropriate model for future support. He noted that he is ready for some ongoing contact with his colleagues at other CIC institutions. The Library IT Directors have that kind of forum in the CIC, but staff at his level, those actually running technology support operations, really don't have many opportunities to reach out to each other. I wonder if we should plan a day or two of professional "shoot the breeze" time at Minnesota for all the folks in these positions? We could do it as part of our investigative effort. This could both help this cohort build connections to one another and serve as a font of wisdom and warning for our own planning effort.
We ask our technology staff to do the seeming impossible. Our staff is not nearly large enough to manage the kind of deployment we've got around the Libraries. How can only six staff manage 600 machines? On the other hand, could it be a failure of imagination? When I arrived at the U the first significant decision I made was to kill our attempt to use Sun Ray "appliance" computers to replace public workstations in the Libraries. We had good reasons for that decision, but the fundamental problem was staring us in the face then and remains at the core of our troubles in ITS: we cannot support a deployment of 600 Windows workstations with so few staff. Why can't we change the rules? At MIT I watched an organization deploy and maintain thousands of workstations with fewer staff than we have available to us.
I believe we need to think outside the box, it may not be Sun Ray, but we must recognize our situation (a budget even more limited than it was in 2001) and devise creative solutions to meet our needs within those bounds. I am certain this means compromises, but not necessarily the ones that run our staff ragged without the reward of a computing infrastructure they can take pride in, tell the world about, and share with our community.
My frustration with our current situation expresses itself as a frustration with ugly machinery, and I do believe that computers should be in the process of fading out of sight, but that's a red herring. My real frustration is that I've allowed our expectations to be diminished by accepting the limits we've imposed on ourselves. I wonder if we shouldn't get the CIC equivalents of Directors of ITS together to share their frustrations and triumphs. We could certainly use some inspiration and, who knows, we might even be able to do something inspiring ourselves!
Reagan Moore from the San Diego Supercomputer Center discussed their SRB development. A very dense presentation left me with the basic impression that I need to understand this approach to data storage being developed as part of the NSF grid infrastructure. SRB "provides a uniform interface for connecting to heterogeneous data resources over a network and accessing replicated data sets." An alternative to LOCKSS? I did hear from one colleague who has already been reviewing SRB that it is a very complex bit of software to install and maintain.
Jerome McDonough, the Digital Library Development Team Leader at NYU gave a very informative presentation on video standards and preservation. The bottom line was that though 4:4:4 (truly uncompressed, an Apple codec and MJPEG2000 can provide this) video would be the right thing to capture and archive, unfortunately it is much too expensive to to store. At NYU they've stepped back to the compromise of capturing 4:2:2 video instead. He noted that anything other than 4:4:4 capture and storage was, in fact, allowing for lossy compression. This is fine until migration has to happen, at which point artifacts will creep in due to the recompression of images. Note that NYU uses TripWire to create and check up on MD5 checksums (can I use that on Thomas?) and UC Berkeley's OceanStore project is taking a stab at very large, high performing distributed storage solutions.
Ed Ayers gave a wonderfully funny and somewhat touching plenary address about the tensions between "Academic Culture and Computer Culture". His own work includes the well regarded Valley of the Shadow. He described the "communal autonomy" of the university as a defining characteristic. The heart of our institutions is the "mysterious exchange between student and teacher," that "intimate bubble" in which learning happens. He described this activity as a flame, both intense and vulnerable, and our universities as "massive structures to protect those flames." His suggestion was that we build lighter, smaller things that "simplify the vastness," things like instant local class nets that don't rely on the broader campus network. His mantra was "scale down." As he spoke of the intimate bubble of interaction between student and teacher, I began to wonder whether technology is not beginning to pierce that bubble with tools that allow for action in the world from the classroom and feedback from the world into the classroom.
I was really struck by Ed's flickering flame. I get awfully frustrated by the scale and scope of the University of Minnesota. I lament that its mission seems to be: "Everything to everyone!" The bureaucracy can seem endless, the commitment to excellence often lacking, the message muddled, on an on. But I stay here, and this flickering flame reminded me why. This academic enterprise is rather counter most of American culture, it is rather precious. In a culture where profit and individual heroism are prized, we operate an enterprise which spends every penny we are given in the name of creating moments of intimate, hidden victory. People discover who they are on our campus, they encounter mentors, they open their minds to one another. Sure, there is a lot of bureaucracy, not to mention a whole lot of drinking, backstabbing, and just getting by... but what if those things are the cover need to protect the flickering flame. If our culture actually realized how radical an enterprise this was, would we be allowed to get away with it? We all break the wind so that the flame of learning has a chance to move from one candle to another; maybe not every time we fire up the PowerPoint slides for another class, but maybe enough. Maybe this behemoth of an institution is what it takes to make this opportunity available to more of our neighbors.
Then again, that's pretty dreamy stuff. Quite the rationalization. I still want to stir up our University enough that we don't settle for less than excellence. The Libraries is where this starts for me. And more specifically our IT division and the work we do to put appropriate technology into the hands of our staff, students, and faculty.
Yikes! It's a good thing I'm not doing this blogging stuff more often!
Steve Cawley and I attended a valuable "executive roundtable" bringing CIO's and University Librarians together to discuss "identity." As we discussed the challenges and successes of authentication and authorization in today's academic environment, I began to wonder if we are not focussing a bit too close to home. I note that while libraries felt very secure in their database and searching expertise, tools like Google snuck up outside our borders and transformed user expectations of searching and research so that now we are strangers in our own territory. What will the identity landscape look like five and ten years from now. Will users have an expectation that they can carry an identity into our organization that was credentialed beyond our borders and control?
Some hint of that future may have appeared in the form of a discussion about e-portfolios and their impact on our data models. A move toward portfolios is a move toward users asserting control of their own data (a user as holding the copy of record of their transcript, for example). This turns on its head our current data model where institutions bear the responsibility for holding and managing the continuity of that sort of data. I wonder whether the solution to the buy-in problem for institutional repositories, for example, might be an individual repository model sewn together by metadata harvesting like OAI? Who will be the "portfolio banks" of the future who (for a small fee) manage the physical systems on which your e-portfolio resides and ensure that the policies and permissions you specify for your information are actually carried out when sharing your portfolio?
Some additional notes: Who assigns identities? Who decides their scope? Some discussion of OKI's concept of "authN" (authentication) and "authZ" (authorization). The role of UIN (numbers) vs. NetID (typically names). Distinguishing between the deed of authentication and the trails and logs kept about that deed and subsequent actions (librarians are loath to keep any trail, but doing the deed might be fine). See SPEC Kits 277 and 278 from the ARL and "Mirage of Continuity" by Brian Hawkin. Credit Brad with the notion "sustainable economies in tension with the frontiers of innovation" (all you really have to do to make technology sustainable is stop changing) and Beth with "the economics of compromise" (the notion that organizations are much more willing to work with you after they have experienced a compromise and its costs than before). If setting up a portfolio banking business, what would be your "free as in beer" service lure and what would you charge for? Would password management be part of the package?
Today and tomorrow I'm at the CNI Spring Task Force meeting in Alexandria, Virginia. The Coalition for Networked Information holds these meetings twice a year and I've been lucky enough to work at two institutions that value the CNI's work and think this is a place worth being. I always find CNI meetings very meaty. Most of the sessions are in small breakout rooms, the best of these involving a brief presentation followed by vigorous discussion by a knowledgeable group. As an experiment, I've decided to try to capture my thoughts on the meeting as it proceeds via this blog, so the next few entries will be about my experience of CNI. I hope this helps me actually follow up on some of the ideas sparked here.