Indexer problem and old libraries
When the full text indexer tries to process a single handle it gives a class not found exception
The plan
1) Locate a handle that with less than a Meg in size that has not ben indexed. Confirm this by directly checking the postgres DB.
2) Find the files in the library that have changed between now and Wed, 18 Feb 2009 (files pulled from the last commit to the odin system).
3) Run the indexer with current library files ... expect failure.
4) Replace the library files with the old values and run the indexer ... anticipate success.
Results:
1) Handle that has not been indexed
Handle 608 has not been indexed.Handle Name Bytes
608 | 200615.pdf | 842375
If this had been indexed the query below should yield a *.pdf.txt file.
select handle.handle, bitstream.name, bitstream.size_bytes from
handle,item2bundle,bitstream,bundle2bitstream where handle.resource_id = item2bundle.item_id and
handle.resource_type_id=2 and item2bundle.bundle_id=bundle2bitstream.bundle_id and
bundle2bitstream.bitstream_id = bitstream.bitstream_id and handle = 608;
However we get
handle | name | size_bytes --------+------------+------------
608 | 200615.pdf | 842375
(1 row)
2) Comparison of old and new libraries.
ulwal-320-bu-m:~ silvi003$ dirdiff lib_old lib_new
No matching file: lib_old/PDFBox.jar
No matching file: lib_new/commons-collections-3.2.1.jar
No matching file: lib_new/jcaptcha-1.0-all.jar
No matching file: lib_new/jxl.jar
No matching file: lib_new/pdfbox-1.1.0.jar
Run the indexer with current library files ... expect failure.
First make a dspace-ir test space. [silvi003@strip1 dspace]$ tu cp -R dspace-ir dspace-ir_test It did fail:[silvi003@strip1 bin]$ tu ./filter-media -i 608 Applying Media Filters :/dspace/dspace-ir_test/lib/activation.jar:/dspace/dspace-ir_test/lib/commons-cli.jar:/dspace/dspace-ir_test/lib/commons-codec-1.3.jar:/dspace/dspace-ir_test/lib/commons-collections-3.2.1.jar:/dspace/dspace-ir_test/lib/commons-collections.jar:/dspace/dspace-ir_test/lib/commons-dbcp.jar:/dspace/dspace-ir_test/lib/commons-fileupload.jar:/dspace/dspace-ir_test/lib/commons-io.jar:/dspace/dspace-ir_test/lib/commons-lang-2.1.jar:/dspace/dspace-ir_test/lib/commons-pool.jar:/dspace/dspace-ir_test/lib/fontbox.jar:/dspace/dspace-ir_test/lib/handle.jar:/dspace/dspace-ir_test/lib/jakarta-poi.jar:/dspace/dspace-ir_test/lib/jargon.jar:/dspace/dspace-ir_test/lib/jaxen.jar:/dspace/dspace-ir_test/lib/jcaptcha-1.0-all.jar:/dspace/dspace-ir_test/lib/jdom.jar:/dspace/dspace-ir_test/lib/jena.jar:/dspace/dspace-ir_test/lib/jstl.jar:/dspace/dspace-ir_test/lib/jxl.jar:/dspace/dspace-ir_test/lib/log4j.jar:/dspace/dspace-ir_test/lib/lucene.jar:/dspace/dspace-ir_test/lib/lucene-sandbox.jar:/dspace/dspace-ir_test/lib/mail.jar:/dspace/dspace-ir_test/lib/mets.jar:/dspace/dspace-ir_test/lib/oaicat.jar:/dspace/dspace-ir_test/lib/oro.jar:/dspace/dspace-ir_test/lib/pdfbox-1.1.0.jar:/dspace/dspace-ir_test/lib/pg74.216.jdbc3.jar:/dspace/dspace-ir_test/lib/rome.jar:/dspace/dspace-ir_test/lib/serializer.jar:/dspace/dspace-ir_test/lib/servlet.jar:/dspace/dspace-ir_test/lib/standard.jar:/dspace/dspace-ir_test/lib/tm-extractors.jar:/dspace/dspace-ir_test/lib/xalan.jar:/dspace/dspace-ir_test/lib/xercesImpl.jar:/dspace/dspace-ir_test/lib/xml-apis.jar:/dspace/dspace-ir_test/config:/dspace/src/dspace-ir/build/classes/ ########################################### Collection handle: 608 ********************************************* Handle 608 Bitstream ID 2287 Filter org.dspace.app.mediafilter.PDFFilter Bitstream supports filter true Bitstream name 200615.pdf.txt Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory at org.apache.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:63) at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:102) at org.dspace.app.mediafilter.MediaFilter.processBitstream(MediaFilter.java:159) at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:357) at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:311) at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:272) at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:211) [silvi003@strip1 bin]$
4) Replace the library files with the old values and run the indexer ... anticipate success.
This fails:[silvi003@strip1 bin]$ tu ./filter-media -i 608 Applying Media Filters :/dspace/dspace-ir_test/lib/activation.jar:/dspace/dspace-ir_test/lib/commons-cli.jar:/dspace/dspace-ir_test/lib/commons-codec-1.3.jar:/dspace/dspace-ir_test/lib/commons-collections.jar:/dspace/dspace-ir_test/lib/commons-dbcp.jar:/dspace/dspace-ir_test/lib/commons-fileupload.jar:/dspace/dspace-ir_test/lib/commons-io.jar:/dspace/dspace-ir_test/lib/commons-lang-2.1.jar:/dspace/dspace-ir_test/lib/commons-pool.jar:/dspace/dspace-ir_test/lib/fontbox.jar:/dspace/dspace-ir_test/lib/handle.jar:/dspace/dspace-ir_test/lib/jakarta-poi.jar:/dspace/dspace-ir_test/lib/jargon.jar:/dspace/dspace-ir_test/lib/jaxen.jar:/dspace/dspace-ir_test/lib/jdom.jar:/dspace/dspace-ir_test/lib/jena.jar:/dspace/dspace-ir_test/lib/jstl.jar:/dspace/dspace-ir_test/lib/log4j.jar:/dspace/dspace-ir_test/lib/lucene.jar:/dspace/dspace-ir_test/lib/lucene-sandbox.jar:/dspace/dspace-ir_test/lib/mail.jar:/dspace/dspace-ir_test/lib/mets.jar:/dspace/dspace-ir_test/lib/oaicat.jar:/dspace/dspace-ir_test/lib/oro.jar:/dspace/dspace-ir_test/lib/PDFBox.jar:/dspace/dspace-ir_test/lib/pg74.216.jdbc3.jar:/dspace/dspace-ir_test/lib/rome.jar:/dspace/dspace-ir_test/lib/serializer.jar:/dspace/dspace-ir_test/lib/servlet.jar:/dspace/dspace-ir_test/lib/standard.jar:/dspace/dspace-ir_test/lib/tm-extractors.jar:/dspace/dspace-ir_test/lib/xalan.jar:/dspace/dspace-ir_test/lib/xercesImpl.jar:/dspace/dspace-ir_test/lib/xml-apis.jar:/dspace/dspace-ir_test/config:/dspace/src/dspace-ir/build/classes/ ########################################### Collection handle: 608 ********************************************* Handle 608 Bitstream ID 2287 Filter org.dspace.app.mediafilter.PDFFilter Bitstream supports filter true Bitstream name 200615.pdf.txt Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/pdfbox/util/PDFTextStripper at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:102) at org.dspace.app.mediafilter.MediaFilter.processBitstream(MediaFilter.java:159) at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:357) at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:311) at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:272) at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:211)
However the PDFTextStripper clas is there.
[silvi003@strip1 lib]$ jar -tvf PDFBox.jar | grep PDFTextStripper
13194 Thu Oct 12 12:19:36 CDT 2006 org/pdfbox/util/PDFTextStripper.class
3358 Thu Oct 12 12:19:38 CDT 2006 org/pdfbox/util/PDFTextStripperByArea.class
4382 Mon Jul 31 20:27:52 CDT 2006 Resources/PDFTextStripper.properties