« December 2010 | Main | February 2011 »

January 27, 2011

Indexer problem and old libraries

When the full text indexer tries to process a single handle it gives a class not found exception

The plan

1) Locate a handle that with less than a Meg in size that has not ben indexed. Confirm this by directly checking the postgres DB.

2) Find the files in the library that have changed between now and Wed, 18 Feb 2009 (files pulled from the last commit to the odin system).

3) Run the indexer with current library files ... expect failure.

4) Replace the library files with the old values and run the indexer ... anticipate success.


Results:

1) Handle that has not been indexed

Handle 608 has not been indexed.

Handle Name Bytes
608 | 200615.pdf | 842375


If this had been indexed the query below should yield a *.pdf.txt file.


select handle.handle, bitstream.name, bitstream.size_bytes from
handle,item2bundle,bitstream,bundle2bitstream where handle.resource_id = item2bundle.item_id and
handle.resource_type_id=2 and item2bundle.bundle_id=bundle2bitstream.bundle_id and
bundle2bitstream.bitstream_id = bitstream.bitstream_id and handle = 608;

However we get


handle | name | size_bytes --------+------------+------------
608 | 200615.pdf | 842375
(1 row)


2) Comparison of old and new libraries.



ulwal-320-bu-m:~ silvi003$ dirdiff lib_old lib_new
No matching file: lib_old/PDFBox.jar
No matching file: lib_new/commons-collections-3.2.1.jar
No matching file: lib_new/jcaptcha-1.0-all.jar
No matching file: lib_new/jxl.jar
No matching file: lib_new/pdfbox-1.1.0.jar

Run the indexer with current library files ... expect failure.

First make a dspace-ir test space.
[silvi003@strip1 dspace]$ tu cp -R dspace-ir dspace-ir_test

It did fail:
[silvi003@strip1 bin]$ tu ./filter-media -i  608
Applying Media Filters
:/dspace/dspace-ir_test/lib/activation.jar:/dspace/dspace-ir_test/lib/commons-cli.jar:/dspace/dspace-ir_test/lib/commons-codec-1.3.jar:/dspace/dspace-ir_test/lib/commons-collections-3.2.1.jar:/dspace/dspace-ir_test/lib/commons-collections.jar:/dspace/dspace-ir_test/lib/commons-dbcp.jar:/dspace/dspace-ir_test/lib/commons-fileupload.jar:/dspace/dspace-ir_test/lib/commons-io.jar:/dspace/dspace-ir_test/lib/commons-lang-2.1.jar:/dspace/dspace-ir_test/lib/commons-pool.jar:/dspace/dspace-ir_test/lib/fontbox.jar:/dspace/dspace-ir_test/lib/handle.jar:/dspace/dspace-ir_test/lib/jakarta-poi.jar:/dspace/dspace-ir_test/lib/jargon.jar:/dspace/dspace-ir_test/lib/jaxen.jar:/dspace/dspace-ir_test/lib/jcaptcha-1.0-all.jar:/dspace/dspace-ir_test/lib/jdom.jar:/dspace/dspace-ir_test/lib/jena.jar:/dspace/dspace-ir_test/lib/jstl.jar:/dspace/dspace-ir_test/lib/jxl.jar:/dspace/dspace-ir_test/lib/log4j.jar:/dspace/dspace-ir_test/lib/lucene.jar:/dspace/dspace-ir_test/lib/lucene-sandbox.jar:/dspace/dspace-ir_test/lib/mail.jar:/dspace/dspace-ir_test/lib/mets.jar:/dspace/dspace-ir_test/lib/oaicat.jar:/dspace/dspace-ir_test/lib/oro.jar:/dspace/dspace-ir_test/lib/pdfbox-1.1.0.jar:/dspace/dspace-ir_test/lib/pg74.216.jdbc3.jar:/dspace/dspace-ir_test/lib/rome.jar:/dspace/dspace-ir_test/lib/serializer.jar:/dspace/dspace-ir_test/lib/servlet.jar:/dspace/dspace-ir_test/lib/standard.jar:/dspace/dspace-ir_test/lib/tm-extractors.jar:/dspace/dspace-ir_test/lib/xalan.jar:/dspace/dspace-ir_test/lib/xercesImpl.jar:/dspace/dspace-ir_test/lib/xml-apis.jar:/dspace/dspace-ir_test/config:/dspace/src/dspace-ir/build/classes/
########################################### 
Collection handle: 608
*********************************************
Handle 608
Bitstream ID 2287
Filter org.dspace.app.mediafilter.PDFFilter
Bitstream supports filter true
Bitstream name 200615.pdf.txt
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
	at org.apache.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:63)
	at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:102)
	at org.dspace.app.mediafilter.MediaFilter.processBitstream(MediaFilter.java:159)
	at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:357)
	at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:311)
	at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:272)
	at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:211)
[silvi003@strip1 bin]$ 

4) Replace the library files with the old values and run the indexer ... anticipate success.

This fails:
[silvi003@strip1 bin]$ tu ./filter-media -i  608
Applying Media Filters
:/dspace/dspace-ir_test/lib/activation.jar:/dspace/dspace-ir_test/lib/commons-cli.jar:/dspace/dspace-ir_test/lib/commons-codec-1.3.jar:/dspace/dspace-ir_test/lib/commons-collections.jar:/dspace/dspace-ir_test/lib/commons-dbcp.jar:/dspace/dspace-ir_test/lib/commons-fileupload.jar:/dspace/dspace-ir_test/lib/commons-io.jar:/dspace/dspace-ir_test/lib/commons-lang-2.1.jar:/dspace/dspace-ir_test/lib/commons-pool.jar:/dspace/dspace-ir_test/lib/fontbox.jar:/dspace/dspace-ir_test/lib/handle.jar:/dspace/dspace-ir_test/lib/jakarta-poi.jar:/dspace/dspace-ir_test/lib/jargon.jar:/dspace/dspace-ir_test/lib/jaxen.jar:/dspace/dspace-ir_test/lib/jdom.jar:/dspace/dspace-ir_test/lib/jena.jar:/dspace/dspace-ir_test/lib/jstl.jar:/dspace/dspace-ir_test/lib/log4j.jar:/dspace/dspace-ir_test/lib/lucene.jar:/dspace/dspace-ir_test/lib/lucene-sandbox.jar:/dspace/dspace-ir_test/lib/mail.jar:/dspace/dspace-ir_test/lib/mets.jar:/dspace/dspace-ir_test/lib/oaicat.jar:/dspace/dspace-ir_test/lib/oro.jar:/dspace/dspace-ir_test/lib/PDFBox.jar:/dspace/dspace-ir_test/lib/pg74.216.jdbc3.jar:/dspace/dspace-ir_test/lib/rome.jar:/dspace/dspace-ir_test/lib/serializer.jar:/dspace/dspace-ir_test/lib/servlet.jar:/dspace/dspace-ir_test/lib/standard.jar:/dspace/dspace-ir_test/lib/tm-extractors.jar:/dspace/dspace-ir_test/lib/xalan.jar:/dspace/dspace-ir_test/lib/xercesImpl.jar:/dspace/dspace-ir_test/lib/xml-apis.jar:/dspace/dspace-ir_test/config:/dspace/src/dspace-ir/build/classes/
########################################### 
Collection handle: 608
*********************************************
Handle 608
Bitstream ID 2287
Filter org.dspace.app.mediafilter.PDFFilter
Bitstream supports filter true
Bitstream name 200615.pdf.txt
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/pdfbox/util/PDFTextStripper
	at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:102)
	at org.dspace.app.mediafilter.MediaFilter.processBitstream(MediaFilter.java:159)
	at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:357)
	at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:311)
	at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:272)
	at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:211)


However the PDFTextStripper clas is there.


[silvi003@strip1 lib]$ jar -tvf PDFBox.jar | grep PDFTextStripper
13194 Thu Oct 12 12:19:36 CDT 2006 org/pdfbox/util/PDFTextStripper.class
3358 Thu Oct 12 12:19:38 CDT 2006 org/pdfbox/util/PDFTextStripperByArea.class
4382 Mon Jul 31 20:27:52 CDT 2006 Resources/PDFTextStripper.properties

January 14, 2011

Big SQL query to find handles where the pdf has not been indexed in DSPACE

The UDC DSPACE instance has a full text indexer. However there are many pdfs that have not been indexed. I wrote a SQL query to find these. Here it is
SELECT handle.handle, bitstream.name, bitstream.size_bytes  FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND handle in 
((SELECT handle.handle FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND  bitstream.name ~ '^.*pdf$')  EXCEPT (SELECT handle.handle FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND  bitstream.name ~ '^.*pdf.txt$')) order by handle::text::integer;

This produces output like:

 handle |                                                               name                                                                | size_bytes 
--------+-----------------------------------------------------------------------------------------------------------------------------------+------------
 394    | Connect2006Fall.pdf                                                                                                               |    2146701
 394    | license.txt                                                                                                                       |       1400
 406    | m139.pdf                                                                                                                          |    1349983
 406    | license.txt                                                                                                                       |       1371
 406    | m139_Extras.zip                                                                                                                   |    2240787
 406    | m139meta.doc.txt                                                                                                                  |       2342
 406    | index.txt.txt                                                                                                                     |       1850
 422    | license.txt                                                                                                                       |       1371


Breaking it down the above expression:
SELECT handle.handle, bitstream.name, bitstream.size_bytes FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND handle in {THE HANDLES OF NON INDEXED PDF}
Next Step:
THE HANDLES OF NON INDEXED PDF = SELECT {HANDLES THAT HAVE A PDF} EXCEPT {HANDLES THAT HAVE A ARE INDEXED}

HANDLES THAT HAVE A PDF = SELECT handle.handle FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND bitstream.name ~ '^.*pdf$';

HANDLES THAT HAVE A PDF = SELECT handle.handle FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND bitstream.name ~ '^.*pdf.txt$';

January 13, 2011

ABBY OCR System does not work well in a VM

I tried to run ABBY 3.0 in a VM. I used a 34 page pdf as input to ABBY:
MRC-72-3 A consumer test of canned, seasoned salad tomatoes.pdf I ran the microsoft performance monitor while using ABBY to produce the plot below.
FinalperformanceTest.gif
red line is \\DLS-OCR\Processor(_Total)\% Processor Time
green line is \\DLS-OCR\Memory\Page Faults/sec ( this maxed out at above 60K)


Also we have a csv version of the data This run was completed after several modifications of the VM were made to enhance performance. However even at this point, it still takes almost 2 seconds per page.
Because We will not be using a VM with ABBY.