« Lightbox 2 = JS Goodness + Your Image Gallery | Main | Skype Firefox Extension - Brilliant »

Speed Reading XML with PHP XMLReader

First, allow me to bow down and grovel before all those who had a hand in creating the PHP XMLReader Library. Your work is indeed grovel-worthy.

What is the XMLReader Library, you ask? It’s a “thin PHP layer� on top of a ridiculously fast C/C++ XML pull parsing utility, and I thank my lucky stars for bumping into it (Pull parsing XML in PHP - IBM).

I’ve been working on various projects that require lots of citations from PubMed. And, I wrote some scripts to download them en masse via PubMed’s Entrez Programming Utilities, which offer a simple XML API to PubMed data. So, I downloaded a batch of 50k for my latest project, and casually noticed that the resulting file was 250+ megs. I needed to take that data and transform it into BibTeX. “Oh Dear,� I thought, “Will SimpleXML handle all that data?� Nope, it laughed in my face, a deep, long, belly laugh. Then it walked away.

Rats. I could have broken the file down into baby files, but it sort of annoyed me to have to do so, especially as I was inevitably going to have to rinse and repeat this process every time I needed different data. So, I cast about to find a more acceptable solution. And behold, right under my nose, PHP 5 had gone and bundled the XMLReader Library into its core offerings. Yay.

Instead of laughing at me and walking off into the distance, XMLReader chewed right through that massive document like a Minnesotan with a turkey leg at the state fair.

Enough background. For those of you who do PHP and XML, I’m including my script below. Read on if you’re interested. If you gag at the site of code, move along, nothing to see here.


<?php
$reader
= new XMLReader();
   
$filename = '/path/to/xml/fil/pubmed.xml' //Set the File Name

   
if(!$reader->open($filename)){ print "can't open file";}

while (
$reader->read()) {

  if(
$reader->nodeType == XMLReader::ELEMENT ){
   
$name = $reader->name;
  }
 
    if (
in_array($reader->nodeType, array(XMLReader::TEXT, XMLReader::CDATA, XMLReader::WHITESPACE, XMLReader::SIGNIFICANT_WHITESPACE)) && $name!=''){
       
$value= $reader->value;
    }

      
        if(
$reader->value != ''){
            if(
$name == 'PMID'){ $key = $value;}          
            if(
$name == 'ArticleTitle'){ $title $value;}
            if(
$name == 'Title'){ $journal = $value;}
            if(
$name == 'PubDate'){ $pubdate = 1; }
        if(
$name == 'Year' && $pubdate == 1){ $year = $value;; $pubdate = 0; }      
            if(
$name == 'Volume'){ $volume = $value;}
            if(
$name == 'Issue'){ $number = $value;}
            if(
$name == 'MedlinePgn'){ $pages = $value;}
            if(
$name == 'Affiliation'){ $note = 'Affiliaton: ' . $value;}
            if(
$name == 'Language'){ $language = $value;}
                      
//Yes, I see that I'm mapping isbn to issn
           
if($name == 'ISSN'){ $isbn = $value;}
            if(
$name == 'AbstractText'){ $Abstract = $value;}
                      
           
//remember, we have multiple mesh terms and authors
           
if($name == 'DescriptorName'){ $mesh .= $value . '; ';}          
            if(
$name == 'Keyword'){  $keywordlist .= $value . '; ';}                  
            if(
$name == 'LastName'){ $lastname = $value;}
            if(
$name == 'ForeName'){ $forename = $value;}
  
            if(
$lastname != '' && $forename != ''){
               
$author_list .= $lastname . ', ' . $forename  . '; ';
               
$lastname = '';
               
$forename = '';
               
$initials = '';
            }
      
        }
              

  if (
$reader->nodeType == XMLReader::END_ELEMENT){
   
$name = '';
   
$value = '';
  }
   
//when we reach the end of a node, we grab all the values and make a new bibtex entry
   
if ($reader->nodeType == XMLReader::END_ELEMENT && $reader->name == 'PubmedArticle'){
  
  
               
$authors = substr($author_list, 0, -2);
                
$keywords  = substr($mesh , 0, -2);              
              
       
/*
                * we create a long string of BibTex records in memory.
                * A more memory-conscious approach would be to
                * progressively write these records to a text file
                */
               
$article .= '@article{' . $key . ',' . "\n";
               
$article .= 'author = {' . $authors . '},'. "\n";
               
$article .= 'title = {' . $title  . '},'. "\n";
               
$article .= 'year = {' . $year . '},'. "\n";
               
$article .= 'journal = {' . $journal . '},'. "\n";
               
$article .= 'volume = {' . $volume . '},'. "\n";
                if(
$number !=''){$article .= 'number = {' . $number . '},'. "\n";}
               
$article .= 'pages={' . $pages . '},'. "\n";
                if(
$note !=''){$article .= 'note = {' . $note . ' Additional Keywords: ' . $keywordlist . '},'. "\n";}
               
$article .= 'keywords = {' . $keywords . '},'. "\n";  
               
$article .= 'isbn={' . $isbn . '},'. "\n";
               
$article .= 'language = {' . $language . '},'. "\n";
               
$article .= 'abstract = {' . $Abstract . '}'. "\n";
               
$article .= '}';
              
                              
           
$author_list = '';
           
$mesh  = '';          
           
$keywordlist  = '';
           
$x++;                      
    }
}

   
$bibtex_filename = '/path/to/bibtex/file/pubmed.bib';
    if (!
$handle = fopen($bibtex_filename, 'a')) {
        
printf("Cannot open file (%s)", $bibtex_filename);
         exit;
    }
  
    if (
fwrite($handle, $article) === FALSE) {
        
printf("Cannot write to (%s)", $bibtex_filename);
        exit;
    }
  
   
fclose($handle);
?>

TrackBack

TrackBack URL for this entry:
http://blog.lib.umn.edu/cgi-bin/mt-tb.cgi/38342

Comments

Whoops, I just saved over a user's comments. Basically, "nk" raised a couple of issues with the above script. S/he indicated that I'd run into problems with the ArticleID and Year fields as there are multiple potential values for both in your average PubMed record. And you know what? NK is right. S/he also asked if any follow-up resources might not be available - more on that at the end of my comments.

Here are my follow-up comments:

First, thanks for the feedback. I ran into both the ID and Year problems myself and never got back to updating this code example...so thanks.

I don't, however, actually need to resort to "traditional" means to parse through child elements; XMLReader handles that too. It just spits everything (child elements included) out in order as you call elements...in one long stream.

So for thePMID, I only needed to change

if($name == 'ArticleId'){ $key = $value;}

to
if($name == 'PMID'){ $key = $value;}

Getting the proper year is only slightly trickier. What I essentially need is the year that appears right after the PubDate element. I do this like so:

if($name == 'PubDate'){ $pubdate = 1; }
if($name == 'Year' && $pubdate == 1){ $year = $value;; $pubdate = 0; }

There may well be other ways of handling this sort of thing, this way works for me.

In terms of xmlreader resources, yeah, very sparse indeed. The folks at IBM developerWorks did,put together a nice article on the subject, however:

http://www.ibm.com/developerworks/xml/library/x-pullparsingphp.html

They didn't feature the term "XMLReader" in the title anywhere, even though this is the focus of the article, which results in very poor SEO for said topic.

Thanks again!

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)