« Canididate data stream for METS struct map | Main | List of bitstream_format_id and mimetypes for DSPACE »

Work around for problem with PDF box text extraction

The problem

When PDFbox was used to extract text from a file of size ~20 meg, it would chew up more and more of the memory and eventually drag the system down to a stand still.

The workaround

I used the methods setStartPage and setEndPage in the PDFTextStripper class to limit the number of pages converted to text at a time.

Results of a run

Extracting text
NumberPagesInPDF 1959
s Extraction time seconds 230
Page per second s
Number of characters extracted 672324
For the number of pages per text extraction I tried 20, 50 and 100 pages and the results were similar. This rate seems very slow.

Specs on computer used

The run above was done on odin.lib.umn.edu. Here is some information on it:

cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping : 3
cpu MHz : 3200.166
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl est cid cx16 xtpr
bogomips : 6404.36
clflush size : 64
processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping : 3
cpu MHz : 3200.166
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl est cid cx16 xtpr
bogomips : 6400.17
clflush size : 64

free -m (without pdfbox running)
             total       used       free     shared    buffers     cached
Mem:          2018       1428        590          0        229       1045
-/+ buffers/cache:        153       1865
Swap:         3812         31       3781

Code sample

I rewrote the getDestinationStream method in the DSPACE class PDFFilter to extract text in blocks of 50 pages. The new version of the method is given below:

    public InputStream getDestinationStream(InputStream source) throws Exception {
    
        System.out.println("Extracting text");

        long startTime = System.currentTimeMillis();
        // get input stream from bitstream
        // pass to filter, get string back
        PDFTextStripper TextExtractor = new PDFTextStripper();
        PDFParser parser = null;
        String extractedText = null;
        int NumberPagesInPDF =0;
        try
        {
            parser = new PDFParser(source);
            parser.parse();
            
            extractedText = " " ;
            int SizeOfPDFSection = 50;
            PDDocument PDFtoExtract = new PDDocument(parser.getDocument());
            NumberPagesInPDF = PDFtoExtract.getNumberOfPages();
            int StartPage = 0;
            int EndPage = 0;
            while (StartPage < NumberPagesInPDF){
              EndPage = StartPage + SizeOfPDFSection;
              TextExtractor.setStartPage( StartPage );
              TextExtractor.setEndPage( EndPage );
              extractedText += TextExtractor.getText(PDFtoExtract);
              StartPage = EndPage;
            }
            
            // get the last few pageses at the end of the pdf
            TextExtractor.setStartPage( StartPage - SizeOfPDFSection );
            TextExtractor.setEndPage( NumberPagesInPDF -1 );
            extractedText += TextExtractor.getText(PDFtoExtract);
        }
        finally
        {
            try
            {
                parser.getDocument().close();
            }
            catch(Exception e)
            {
               log.error("Error closing temporary PDF file: " + e.getMessage(), e);
            }
        }

        // if verbose flag is set, print out extracted text
        // to STDOUT
        if (MediaFilterManager.isVerbose)
        {
            System.out.println(extractedText);
        }

        // generate an input stream with the extracted text
        long stopTime = System.currentTimeMillis();
        long deltaSeconds = (stopTime-startTime)/1000;
        System.out.println(" NumberPagesInPDF " + NumberPagesInPDF);
        System.out.println("Extraction time seconds " + deltaSeconds );
        System.out.println("s "  + (double)NumberPagesInPDF/(double)deltaSeconds  );
        System.out.println(" Number of characters extracted " + extractedText.length());
        byte[] textBytes = extractedText.getBytes();
        ByteArrayInputStream bais = new ByteArrayInputStream(textBytes);
        return bais; // will this work? or will the byte array be out of scope?
    }
}

                           

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)