PDF files that are not being indexed SQL to look at the issue
The problem
Beth found that the item with handle 56385 http://conservancy.umn.edu/handle/56385 had not gone through the full text indexer.The command
/dspace/dspace-ir/bin/filter-media -i 56385
Will index just that one file. Here is the error mess that you get.
ErrorMediaFilter
This error message indicates the the error is coming from the third party jar: PDFBox.jar.
Some SQL
Below is an SQL query that pulls out all of the files that are pdfs in the repository:select handle.handle,bitstream.name, bitstream.description from handle,item2bundle,bitstream,bundle2bitstream where handle.resource_id = item2bundle.item_id and handle.resource_type_id=2 and item2bundle.bundle_id=bundle2bitstream.bundle_id and bundle2bitstream.bitstream_id = bitstream.bitstream_id and bitstream.name ~ '^.*pdf$' order by handle::text::integer;
Next is an SQL query that finds all the PDFs that have been indexed:
select handle.handle,bitstream.name, bitstream.description from handle,item2bundle,bitstream,bundle2bitstream where handle.resource_id = item2bundle.item_id and handle.resource_type_id=2 and item2bundle.bundle_id=bundle2bitstream.bundle_id and bundle2bitstream.bitstream_id = bitstream.bitstream_id and bitstream.description = 'Extracted text' order by handle::text::integer;
Current number of unindexed fields
# PDFs - 14477# Indexed _ 13252
# Unindexed 1225