Main

May 10, 2008

"Live from Minnesota", Drupal Search Sprint 2008

Just winding down now from a long day of thinking about and working on Drupal search, I thought I spend five minutes jotting down a few impressions.

The People

Robert Douglass, Blake Lucchesi, Djun Kim, David Lesieur, Earnest Berry, and Doug Green and Yours Truly

I'm almost starting to take for granted the fact that Drupal will be sending the University of Minnesota Libraries a handful of authentically brilliant and unusually good-natured developers to visit, this being our second high-profile Drupal project. But yeah, we're "Live from Minnesota" as Doug Green points out (I like the meme, no doubt).

The Work

Like a lot of efforts underway in Drupal, search is moving towards a more extensible model by way of a core API. This shift has me excited for a number of reasons, but mostly due to my interest in the Drupal/Solr combo within the context of academic research environments. By decoupling the search and indexing processes from a specific database implementation, we open the door to a whole new ecosystem of search utilities.

If all goes well with the new search API, I believe it won't be long before we start seeing c|net style faceted drill-down e-commerce sites in Drupal. We already have a well laid-out Apache Solr Module, but a search API will provide more granular control, better integration with core search and ease the long-term burden of maintenance on Robert by moving portions of his code into core.

Personally, this is a chance for me to really think deeply about how to best harness the wonderful work of Doug Green and others within the context of core search features, especially as regards Solr integration. On an even more practical level, this is my own little unit testing bootcamp as I help to write tests for the work being done here this weekend.

Off to bed and on to day two.

April 19, 2007

Speed Reading XML with PHP XMLReader

First, allow me to bow down and grovel before all those who had a hand in creating the PHP XMLReader Library. Your work is indeed grovel-worthy.

What is the XMLReader Library, you ask? It’s a “thin PHP layer� on top of a ridiculously fast C/C++ XML pull parsing utility, and I thank my lucky stars for bumping into it (Pull parsing XML in PHP - IBM).

I’ve been working on various projects that require lots of citations from PubMed. And, I wrote some scripts to download them en masse via PubMed’s Entrez Programming Utilities, which offer a simple XML API to PubMed data. So, I downloaded a batch of 50k for my latest project, and casually noticed that the resulting file was 250+ megs. I needed to take that data and transform it into BibTeX. “Oh Dear,� I thought, “Will SimpleXML handle all that data?� Nope, it laughed in my face, a deep, long, belly laugh. Then it walked away.

Rats. I could have broken the file down into baby files, but it sort of annoyed me to have to do so, especially as I was inevitably going to have to rinse and repeat this process every time I needed different data. So, I cast about to find a more acceptable solution. And behold, right under my nose, PHP 5 had gone and bundled the XMLReader Library into its core offerings. Yay.

Instead of laughing at me and walking off into the distance, XMLReader chewed right through that massive document like a Minnesotan with a turkey leg at the state fair.

Enough background. For those of you who do PHP and XML, I’m including my script below. Read on if you’re interested. If you gag at the site of code, move along, nothing to see here.

Continue reading "Speed Reading XML with PHP XMLReader" »