Speed Reading XML with PHP XMLReader
First, allow me to bow down and grovel before all those who had a hand in creating the PHP XMLReader Library. Your work is indeed grovel-worthy.
What is the XMLReader Library, you ask? It’s a “thin PHP layer� on top of a ridiculously fast C/C++ XML pull parsing utility, and I thank my lucky stars for bumping into it (Pull parsing XML in PHP - IBM).
I’ve been working on various projects that require lots of citations from PubMed. And, I wrote some scripts to download them en masse via PubMed’s Entrez Programming Utilities, which offer a simple XML API to PubMed data. So, I downloaded a batch of 50k for my latest project, and casually noticed that the resulting file was 250+ megs. I needed to take that data and transform it into BibTeX. “Oh Dear,� I thought, “Will SimpleXML handle all that data?� Nope, it laughed in my face, a deep, long, belly laugh. Then it walked away.
Rats. I could have broken the file down into baby files, but it sort of annoyed me to have to do so, especially as I was inevitably going to have to rinse and repeat this process every time I needed different data. So, I cast about to find a more acceptable solution. And behold, right under my nose, PHP 5 had gone and bundled the XMLReader Library into its core offerings. Yay.
Instead of laughing at me and walking off into the distance, XMLReader chewed right through that massive document like a Minnesotan with a turkey leg at the state fair.
Enough background. For those of you who do PHP and XML, I’m including my script below. Read on if you’re interested. If you gag at the site of code, move along, nothing to see here.
<?php
$reader = new XMLReader();
$filename = '/path/to/xml/fil/pubmed.xml' ; //Set the File Name
if(!$reader->open($filename)){ print "can't open file";}
while ($reader->read()) {
if($reader->nodeType == XMLReader::ELEMENT ){
$name = $reader->name;
}
if (in_array($reader->nodeType, array(XMLReader::TEXT, XMLReader::CDATA, XMLReader::WHITESPACE, XMLReader::SIGNIFICANT_WHITESPACE)) && $name!=''){
$value= $reader->value;
}
if($reader->value != ''){
if($name == 'PMID'){ $key = $value;}
if($name == 'ArticleTitle'){ $title = $value;}
if($name == 'Title'){ $journal = $value;}
if($name == 'PubDate'){ $pubdate = 1; }
if($name == 'Year' && $pubdate == 1){ $year = $value;; $pubdate = 0; }
if($name == 'Volume'){ $volume = $value;}
if($name == 'Issue'){ $number = $value;}
if($name == 'MedlinePgn'){ $pages = $value;}
if($name == 'Affiliation'){ $note = 'Affiliaton: ' . $value;}
if($name == 'Language'){ $language = $value;}
//Yes, I see that I'm mapping isbn to issn
if($name == 'ISSN'){ $isbn = $value;}
if($name == 'AbstractText'){ $Abstract = $value;}
//remember, we have multiple mesh terms and authors
if($name == 'DescriptorName'){ $mesh .= $value . '; ';}
if($name == 'Keyword'){ $keywordlist .= $value . '; ';}
if($name == 'LastName'){ $lastname = $value;}
if($name == 'ForeName'){ $forename = $value;}
if($lastname != '' && $forename != ''){
$author_list .= $lastname . ', ' . $forename . '; ';
$lastname = '';
$forename = '';
$initials = '';
}
}
if ($reader->nodeType == XMLReader::END_ELEMENT){
$name = '';
$value = '';
}
//when we reach the end of a node, we grab all the values and make a new bibtex entry
if ($reader->nodeType == XMLReader::END_ELEMENT && $reader->name == 'PubmedArticle'){
$authors = substr($author_list, 0, -2);
$keywords = substr($mesh , 0, -2);
/*
* we create a long string of BibTex records in memory.
* A more memory-conscious approach would be to
* progressively write these records to a text file
*/
$article .= '@article{' . $key . ',' . "\n";
$article .= 'author = {' . $authors . '},'. "\n";
$article .= 'title = {' . $title . '},'. "\n";
$article .= 'year = {' . $year . '},'. "\n";
$article .= 'journal = {' . $journal . '},'. "\n";
$article .= 'volume = {' . $volume . '},'. "\n";
if($number !=''){$article .= 'number = {' . $number . '},'. "\n";}
$article .= 'pages={' . $pages . '},'. "\n";
if($note !=''){$article .= 'note = {' . $note . ' Additional Keywords: ' . $keywordlist . '},'. "\n";}
$article .= 'keywords = {' . $keywords . '},'. "\n";
$article .= 'isbn={' . $isbn . '},'. "\n";
$article .= 'language = {' . $language . '},'. "\n";
$article .= 'abstract = {' . $Abstract . '}'. "\n";
$article .= '}';
$author_list = '';
$mesh = '';
$keywordlist = '';
$x++;
}
}
$bibtex_filename = '/path/to/bibtex/file/pubmed.bib';
if (!$handle = fopen($bibtex_filename, 'a')) {
printf("Cannot open file (%s)", $bibtex_filename);
exit;
}
if (fwrite($handle, $article) === FALSE) {
printf("Cannot write to (%s)", $bibtex_filename);
exit;
}
fclose($handle);
?>