« Big SQL query to find handles where the pdf has not been indexed in DSPACE | Main | Modify DSPACE to accept external text files instead of using DSPACE's internal OCR system. »

Indexer problem and old libraries

When the full text indexer tries to process a single handle it gives a class not found exception

The plan

1) Locate a handle that with less than a Meg in size that has not ben indexed. Confirm this by directly checking the postgres DB.

2) Find the files in the library that have changed between now and Wed, 18 Feb 2009 (files pulled from the last commit to the odin system).

3) Run the indexer with current library files ... expect failure.

4) Replace the library files with the old values and run the indexer ... anticipate success.


Results:

1) Handle that has not been indexed

Handle 608 has not been indexed.

Handle Name Bytes
608 | 200615.pdf | 842375


If this had been indexed the query below should yield a *.pdf.txt file.


select handle.handle, bitstream.name, bitstream.size_bytes from
handle,item2bundle,bitstream,bundle2bitstream where handle.resource_id = item2bundle.item_id and
handle.resource_type_id=2 and item2bundle.bundle_id=bundle2bitstream.bundle_id and
bundle2bitstream.bitstream_id = bitstream.bitstream_id and handle = 608;

However we get


handle | name | size_bytes --------+------------+------------
608 | 200615.pdf | 842375
(1 row)


2) Comparison of old and new libraries.



ulwal-320-bu-m:~ silvi003$ dirdiff lib_old lib_new
No matching file: lib_old/PDFBox.jar
No matching file: lib_new/commons-collections-3.2.1.jar
No matching file: lib_new/jcaptcha-1.0-all.jar
No matching file: lib_new/jxl.jar
No matching file: lib_new/pdfbox-1.1.0.jar

Run the indexer with current library files ... expect failure.

First make a dspace-ir test space.
[silvi003@strip1 dspace]$ tu cp -R dspace-ir dspace-ir_test

It did fail:
[silvi003@strip1 bin]$ tu ./filter-media -i  608
Applying Media Filters
:/dspace/dspace-ir_test/lib/activation.jar:/dspace/dspace-ir_test/lib/commons-cli.jar:/dspace/dspace-ir_test/lib/commons-codec-1.3.jar:/dspace/dspace-ir_test/lib/commons-collections-3.2.1.jar:/dspace/dspace-ir_test/lib/commons-collections.jar:/dspace/dspace-ir_test/lib/commons-dbcp.jar:/dspace/dspace-ir_test/lib/commons-fileupload.jar:/dspace/dspace-ir_test/lib/commons-io.jar:/dspace/dspace-ir_test/lib/commons-lang-2.1.jar:/dspace/dspace-ir_test/lib/commons-pool.jar:/dspace/dspace-ir_test/lib/fontbox.jar:/dspace/dspace-ir_test/lib/handle.jar:/dspace/dspace-ir_test/lib/jakarta-poi.jar:/dspace/dspace-ir_test/lib/jargon.jar:/dspace/dspace-ir_test/lib/jaxen.jar:/dspace/dspace-ir_test/lib/jcaptcha-1.0-all.jar:/dspace/dspace-ir_test/lib/jdom.jar:/dspace/dspace-ir_test/lib/jena.jar:/dspace/dspace-ir_test/lib/jstl.jar:/dspace/dspace-ir_test/lib/jxl.jar:/dspace/dspace-ir_test/lib/log4j.jar:/dspace/dspace-ir_test/lib/lucene.jar:/dspace/dspace-ir_test/lib/lucene-sandbox.jar:/dspace/dspace-ir_test/lib/mail.jar:/dspace/dspace-ir_test/lib/mets.jar:/dspace/dspace-ir_test/lib/oaicat.jar:/dspace/dspace-ir_test/lib/oro.jar:/dspace/dspace-ir_test/lib/pdfbox-1.1.0.jar:/dspace/dspace-ir_test/lib/pg74.216.jdbc3.jar:/dspace/dspace-ir_test/lib/rome.jar:/dspace/dspace-ir_test/lib/serializer.jar:/dspace/dspace-ir_test/lib/servlet.jar:/dspace/dspace-ir_test/lib/standard.jar:/dspace/dspace-ir_test/lib/tm-extractors.jar:/dspace/dspace-ir_test/lib/xalan.jar:/dspace/dspace-ir_test/lib/xercesImpl.jar:/dspace/dspace-ir_test/lib/xml-apis.jar:/dspace/dspace-ir_test/config:/dspace/src/dspace-ir/build/classes/
########################################### 
Collection handle: 608
*********************************************
Handle 608
Bitstream ID 2287
Filter org.dspace.app.mediafilter.PDFFilter
Bitstream supports filter true
Bitstream name 200615.pdf.txt
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
	at org.apache.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:63)
	at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:102)
	at org.dspace.app.mediafilter.MediaFilter.processBitstream(MediaFilter.java:159)
	at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:357)
	at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:311)
	at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:272)
	at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:211)
[silvi003@strip1 bin]$ 

4) Replace the library files with the old values and run the indexer ... anticipate success.

This fails:
[silvi003@strip1 bin]$ tu ./filter-media -i  608
Applying Media Filters
:/dspace/dspace-ir_test/lib/activation.jar:/dspace/dspace-ir_test/lib/commons-cli.jar:/dspace/dspace-ir_test/lib/commons-codec-1.3.jar:/dspace/dspace-ir_test/lib/commons-collections.jar:/dspace/dspace-ir_test/lib/commons-dbcp.jar:/dspace/dspace-ir_test/lib/commons-fileupload.jar:/dspace/dspace-ir_test/lib/commons-io.jar:/dspace/dspace-ir_test/lib/commons-lang-2.1.jar:/dspace/dspace-ir_test/lib/commons-pool.jar:/dspace/dspace-ir_test/lib/fontbox.jar:/dspace/dspace-ir_test/lib/handle.jar:/dspace/dspace-ir_test/lib/jakarta-poi.jar:/dspace/dspace-ir_test/lib/jargon.jar:/dspace/dspace-ir_test/lib/jaxen.jar:/dspace/dspace-ir_test/lib/jdom.jar:/dspace/dspace-ir_test/lib/jena.jar:/dspace/dspace-ir_test/lib/jstl.jar:/dspace/dspace-ir_test/lib/log4j.jar:/dspace/dspace-ir_test/lib/lucene.jar:/dspace/dspace-ir_test/lib/lucene-sandbox.jar:/dspace/dspace-ir_test/lib/mail.jar:/dspace/dspace-ir_test/lib/mets.jar:/dspace/dspace-ir_test/lib/oaicat.jar:/dspace/dspace-ir_test/lib/oro.jar:/dspace/dspace-ir_test/lib/PDFBox.jar:/dspace/dspace-ir_test/lib/pg74.216.jdbc3.jar:/dspace/dspace-ir_test/lib/rome.jar:/dspace/dspace-ir_test/lib/serializer.jar:/dspace/dspace-ir_test/lib/servlet.jar:/dspace/dspace-ir_test/lib/standard.jar:/dspace/dspace-ir_test/lib/tm-extractors.jar:/dspace/dspace-ir_test/lib/xalan.jar:/dspace/dspace-ir_test/lib/xercesImpl.jar:/dspace/dspace-ir_test/lib/xml-apis.jar:/dspace/dspace-ir_test/config:/dspace/src/dspace-ir/build/classes/
########################################### 
Collection handle: 608
*********************************************
Handle 608
Bitstream ID 2287
Filter org.dspace.app.mediafilter.PDFFilter
Bitstream supports filter true
Bitstream name 200615.pdf.txt
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/pdfbox/util/PDFTextStripper
	at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:102)
	at org.dspace.app.mediafilter.MediaFilter.processBitstream(MediaFilter.java:159)
	at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:357)
	at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:311)
	at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:272)
	at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:211)


However the PDFTextStripper clas is there.


[silvi003@strip1 lib]$ jar -tvf PDFBox.jar | grep PDFTextStripper
13194 Thu Oct 12 12:19:36 CDT 2006 org/pdfbox/util/PDFTextStripper.class
3358 Thu Oct 12 12:19:38 CDT 2006 org/pdfbox/util/PDFTextStripperByArea.class
4382 Mon Jul 31 20:27:52 CDT 2006 Resources/PDFTextStripper.properties

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)