« Some basic information on Dspace | Main | Sending dspace email to a gmail accounts »

media filter UDC and cron job

Cron job

My predecessor wrote a cron job to index the contents of the pdfs in UDC. It is:
#
# Filter media
#
1 0 * * * /dspace/dspace-ir/bin/filter-media.sh > /dspace/dspace-ir/log/filter-media.log 2>&1
It was noted that this process was taking up to eight hours to run and impacting the users.
It will need to be edited and replaced.

Error record associated with the cron job

Creating search index:
Applying Media Filters
2008-02-11 08:07:19,271 
  INFO  org.dspace.core.ConfigurationManager @ DSpace logging installed using log4j.properties
Exception in thread "main" java.lang.IllegalArgumentException: Cannot resolve 4938 to a DSpace object
        at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:192)
Applying Media Filters
2008-02-11 08:07:19,839 INFO  org.dspace.core.ConfigurationManager 
       @ DSpace logging installed using log4j.properties
2008-02-11 08:07:20,160 INFO  org.dspace.content.MetadataField 
       @ Loading MetadataField elements into cache.
2008-02-11 08:07:20,199 INFO  org.dspace.content.MetadataSchema @ Loading schema cache for fast finds
SKIPPED: bitstream 16263 because 'LIFE_SCIENCEs_PREDESIGN_REPORT041504_.pdf.txt' already exists
SKIPPED: bitstream 16261 because 'equine_predesign_may04.pdf.txt' already exists
SKIPPED: bitstream 16259 because 'EducationalFacilitiesPredesignStudyFinal.pdf.txt' already exists
ERROR filtering, skipping bitstream #16251 java.lang.ArrayIndexOutOfBoundsException: 4
java.lang.ArrayIndexOutOfBoundsException: 4
        at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:294)
        at org.fontbox.cmap.CMapParser.parse(CMapParser.java:103)
        at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
        at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
        at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
        at org.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:80)
        at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
        at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
        at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
        at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
        at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
        at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
        at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
        at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:110)
        at org.dspace.app.mediafilter.MediaFilter.processBitstream(MediaFilter.java:155)
        at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:327)
        at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:296)
        at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:266)
        at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersCollection(MediaFilterManager.java:260)
        at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:202)
ERROR filtering, skipping bitstream #16250 java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException
SKIPPED: bitstream 16249 because 'Volume_II-Appendix2.pdf.txt' already exists
ERROR filtering, skipping bitstream #16248 java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException
SKIPPED: bitstream 15486 because 'AHC_FacilitiesMasterPlan.pdf.txt' already exists
SKIPPED: bitstream 13492 because 'Vet_med_facilities_development_plan_FINAL.pdf.txt' already exists
SKIPPED: bitstream 13490 because 'SPH_CONSOLIDATION.pdf.txt' already exists
SKIPPED: bitstream 13488 because 'AHC_strategic_facility_plan_1998.pdf.txt' already exists
SKIPPED: bitstream 13486 because 'AHC_Precinct_Plan_Report_Final_May_2006.pdf.txt' already exists
SKIPPED: bitstream 13484 because 'AHC_Mpls_District_Plan_2000.pdf.txt' already exists
Creating search index:
Creating browse index
Indexing all Items in DSpace....2008-02-11 08:17:24,358 
  INFO  org.dspace.core.ConfigurationManager @ DSpace logging installed using log4j.properties
2008-02-11 08:17:25,315 
   INFO  org.dspace.content.MetadataField @ Loading MetadataField elements into cache.
2008-02-11 08:17:25,357 
   INFO  org.dspace.content.MetadataSchema @ Loading schema cache for fast finds
 ... Done
Creating search index
2008-02-11 08:19:57,683 INFO  org.dspace.core.ConfigurationManager @ 
   DSpace logging installed using log4j.properties

Comments

Hi Jeff, if you know a solution for this problem, please send some tips!

My program also generates the exception:

java.lang.ArrayIndexOutOfBoundsException: 4
at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:294)


...

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)