media filter UDC and cron job
Cron job
My predecessor wrote a cron job to index the contents of the pdfs in UDC. It is:# # Filter media # 1 0 * * * /dspace/dspace-ir/bin/filter-media.sh > /dspace/dspace-ir/log/filter-media.log 2>&1It was noted that this process was taking up to eight hours to run and impacting the users.
It will need to be edited and replaced.
Error record associated with the cron job
Creating search index:
Applying Media Filters
2008-02-11 08:07:19,271
INFO org.dspace.core.ConfigurationManager @ DSpace logging installed using log4j.properties
Exception in thread "main" java.lang.IllegalArgumentException: Cannot resolve 4938 to a DSpace object
at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:192)
Applying Media Filters
2008-02-11 08:07:19,839 INFO org.dspace.core.ConfigurationManager
@ DSpace logging installed using log4j.properties
2008-02-11 08:07:20,160 INFO org.dspace.content.MetadataField
@ Loading MetadataField elements into cache.
2008-02-11 08:07:20,199 INFO org.dspace.content.MetadataSchema @ Loading schema cache for fast finds
SKIPPED: bitstream 16263 because 'LIFE_SCIENCEs_PREDESIGN_REPORT041504_.pdf.txt' already exists
SKIPPED: bitstream 16261 because 'equine_predesign_may04.pdf.txt' already exists
SKIPPED: bitstream 16259 because 'EducationalFacilitiesPredesignStudyFinal.pdf.txt' already exists
ERROR filtering, skipping bitstream #16251 java.lang.ArrayIndexOutOfBoundsException: 4
java.lang.ArrayIndexOutOfBoundsException: 4
at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:294)
at org.fontbox.cmap.CMapParser.parse(CMapParser.java:103)
at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
at org.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:80)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:110)
at org.dspace.app.mediafilter.MediaFilter.processBitstream(MediaFilter.java:155)
at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:327)
at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:296)
at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:266)
at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersCollection(MediaFilterManager.java:260)
at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:202)
ERROR filtering, skipping bitstream #16250 java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException
SKIPPED: bitstream 16249 because 'Volume_II-Appendix2.pdf.txt' already exists
ERROR filtering, skipping bitstream #16248 java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException
SKIPPED: bitstream 15486 because 'AHC_FacilitiesMasterPlan.pdf.txt' already exists
SKIPPED: bitstream 13492 because 'Vet_med_facilities_development_plan_FINAL.pdf.txt' already exists
SKIPPED: bitstream 13490 because 'SPH_CONSOLIDATION.pdf.txt' already exists
SKIPPED: bitstream 13488 because 'AHC_strategic_facility_plan_1998.pdf.txt' already exists
SKIPPED: bitstream 13486 because 'AHC_Precinct_Plan_Report_Final_May_2006.pdf.txt' already exists
SKIPPED: bitstream 13484 because 'AHC_Mpls_District_Plan_2000.pdf.txt' already exists
Creating search index:
Creating browse index
Indexing all Items in DSpace....2008-02-11 08:17:24,358
INFO org.dspace.core.ConfigurationManager @ DSpace logging installed using log4j.properties
2008-02-11 08:17:25,315
INFO org.dspace.content.MetadataField @ Loading MetadataField elements into cache.
2008-02-11 08:17:25,357
INFO org.dspace.content.MetadataSchema @ Loading schema cache for fast finds
... Done
Creating search index
2008-02-11 08:19:57,683 INFO org.dspace.core.ConfigurationManager @
DSpace logging installed using log4j.properties
Comments
Hi Jeff, if you know a solution for this problem, please send some tips!
My program also generates the exception:
java.lang.ArrayIndexOutOfBoundsException: 4
at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:294)
...
Posted by: Ilija Antovic | December 3, 2008 4:20 AM