« March 2007 | Main | May 2007 »

April 19, 2007

Speed Reading XML with PHP XMLReader

First, allow me to bow down and grovel before all those who had a hand in creating the PHP XMLReader Library. Your work is indeed grovel-worthy.

What is the XMLReader Library, you ask? It’s a “thin PHP layer� on top of a ridiculously fast C/C++ XML pull parsing utility, and I thank my lucky stars for bumping into it (Pull parsing XML in PHP - IBM).

I’ve been working on various projects that require lots of citations from PubMed. And, I wrote some scripts to download them en masse via PubMed’s Entrez Programming Utilities, which offer a simple XML API to PubMed data. So, I downloaded a batch of 50k for my latest project, and casually noticed that the resulting file was 250+ megs. I needed to take that data and transform it into BibTeX. “Oh Dear,� I thought, “Will SimpleXML handle all that data?� Nope, it laughed in my face, a deep, long, belly laugh. Then it walked away.

Rats. I could have broken the file down into baby files, but it sort of annoyed me to have to do so, especially as I was inevitably going to have to rinse and repeat this process every time I needed different data. So, I cast about to find a more acceptable solution. And behold, right under my nose, PHP 5 had gone and bundled the XMLReader Library into its core offerings. Yay.

Instead of laughing at me and walking off into the distance, XMLReader chewed right through that massive document like a Minnesotan with a turkey leg at the state fair.

Enough background. For those of you who do PHP and XML, I’m including my script below. Read on if you’re interested. If you gag at the site of code, move along, nothing to see here.


<?php
$reader
= new XMLReader();
   
$filename = '/path/to/xml/fil/pubmed.xml' //Set the File Name

   
if(!$reader->open($filename)){ print "can't open file";}

while (
$reader->read()) {

  if(
$reader->nodeType == XMLReader::ELEMENT ){
   
$name = $reader->name;
  }
 
    if (
in_array($reader->nodeType, array(XMLReader::TEXT, XMLReader::CDATA, XMLReader::WHITESPACE, XMLReader::SIGNIFICANT_WHITESPACE)) && $name!=''){
       
$value= $reader->value;
    }

      
        if(
$reader->value != ''){
            if(
$name == 'PMID'){ $key = $value;}          
            if(
$name == 'ArticleTitle'){ $title $value;}
            if(
$name == 'Title'){ $journal = $value;}
            if(
$name == 'PubDate'){ $pubdate = 1; }
        if(
$name == 'Year' && $pubdate == 1){ $year = $value;; $pubdate = 0; }      
            if(
$name == 'Volume'){ $volume = $value;}
            if(
$name == 'Issue'){ $number = $value;}
            if(
$name == 'MedlinePgn'){ $pages = $value;}
            if(
$name == 'Affiliation'){ $note = 'Affiliaton: ' . $value;}
            if(
$name == 'Language'){ $language = $value;}
                      
//Yes, I see that I'm mapping isbn to issn
           
if($name == 'ISSN'){ $isbn = $value;}
            if(
$name == 'AbstractText'){ $Abstract = $value;}
                      
           
//remember, we have multiple mesh terms and authors
           
if($name == 'DescriptorName'){ $mesh .= $value . '; ';}          
            if(
$name == 'Keyword'){  $keywordlist .= $value . '; ';}                  
            if(
$name == 'LastName'){ $lastname = $value;}
            if(
$name == 'ForeName'){ $forename = $value;}
  
            if(
$lastname != '' && $forename != ''){
               
$author_list .= $lastname . ', ' . $forename  . '; ';
               
$lastname = '';
               
$forename = '';
               
$initials = '';
            }
      
        }
              

  if (
$reader->nodeType == XMLReader::END_ELEMENT){
   
$name = '';
   
$value = '';
  }
   
//when we reach the end of a node, we grab all the values and make a new bibtex entry
   
if ($reader->nodeType == XMLReader::END_ELEMENT && $reader->name == 'PubmedArticle'){
  
  
               
$authors = substr($author_list, 0, -2);
                
$keywords  = substr($mesh , 0, -2);              
              
       
/*
                * we create a long string of BibTex records in memory.
                * A more memory-conscious approach would be to
                * progressively write these records to a text file
                */
               
$article .= '@article{' . $key . ',' . "\n";
               
$article .= 'author = {' . $authors . '},'. "\n";
               
$article .= 'title = {' . $title  . '},'. "\n";
               
$article .= 'year = {' . $year . '},'. "\n";
               
$article .= 'journal = {' . $journal . '},'. "\n";
               
$article .= 'volume = {' . $volume . '},'. "\n";
                if(
$number !=''){$article .= 'number = {' . $number . '},'. "\n";}
               
$article .= 'pages={' . $pages . '},'. "\n";
                if(
$note !=''){$article .= 'note = {' . $note . ' Additional Keywords: ' . $keywordlist . '},'. "\n";}
               
$article .= 'keywords = {' . $keywords . '},'. "\n";  
               
$article .= 'isbn={' . $isbn . '},'. "\n";
               
$article .= 'language = {' . $language . '},'. "\n";
               
$article .= 'abstract = {' . $Abstract . '}'. "\n";
               
$article .= '}';
              
                              
           
$author_list = '';
           
$mesh  = '';          
           
$keywordlist  = '';
           
$x++;                      
    }
}

   
$bibtex_filename = '/path/to/bibtex/file/pubmed.bib';
    if (!
$handle = fopen($bibtex_filename, 'a')) {
        
printf("Cannot open file (%s)", $bibtex_filename);
         exit;
    }
  
    if (
fwrite($handle, $article) === FALSE) {
        
printf("Cannot write to (%s)", $bibtex_filename);
        exit;
    }
  
   
fclose($handle);
?>

Lightbox 2 = JS Goodness + Your Image Gallery

Wow, this is a neat tool:

http://www.huddletogether.com/projects/lightbox2/

I slapped something together on my staging server:

http://webstaging.dewey.lib.umn.edu/image/tid/224

It could hardly have been easier. I also love that this gracefully degrades. With Javascript turned off, users are simply taken directly to the image itself. I can imagine a number of uses for this utility, but I'm most interested in looking at integrating it into online exhibits, where we might have an image that triggers a pop-up that allows users to page through various images. This is a common feature on newspaper sites and the like.

LibWidget Branding

Just playing around with branding for the Code Snippets/LibWidgets/whatever we're gonna call these things. The widget generator is written in PHP and Javascript. Thanks to I Love Jack Daniels for the idea behind the simple caching mechanism.

Can't Access This Page? Go to the Original

April 13, 2007

Drupal 6 Language Support

Wow, this will be super useful. We haven't even begun to tackle multilingual support for our web presence in the U Libraries, but it has often been discussed. This feature could give us a real leg up:


"Following on a huge amount of discussion and planning at the Internationalization Group, Jose A. Reyero, Károly Négyesi (chx) and Gábor Hojtsy (myself) prepared a big patch to introduce better language support into Drupal 6.x-dev. We got significant code reviews and correction tips from Dries Buytaert, Steven Wittens and Doug Green. As a result, the development version of Drupal 6 now includes a generic language setup screen with support for right-to-left written scripts, native language names, path and domain based language selection and browser language detection. We also added support for language dependent path aliases, so you can have addresses like "example.com/espanol/contacto" and "example.com/english/contact", or alternatively "english.example.com/contact" and "espanol.example.com/contacto".

See the Anoucement: http://drupal.org/node/131516