« script to take a sql query and turn it into a tab delimited file | Main | Possible solution to large sets of PDFs not being indexed. »

PDF files that are not being indexed SQL to look at the issue

The problem

Beth found that the item with handle 56385 http://conservancy.umn.edu/handle/56385 had not gone through the full text indexer.
The command
/dspace/dspace-ir/bin/filter-media -i 56385
Will index just that one file. Here is the error mess that you get.
ErrorMediaFilter
This error message indicates the the error is coming from the third party jar: PDFBox.jar.

Some SQL

Below is an SQL query that pulls out all of the files that are pdfs in the repository:
select handle.handle,bitstream.name, bitstream.description from handle,item2bundle,bitstream,bundle2bitstream where handle.resource_id = item2bundle.item_id and handle.resource_type_id=2 and item2bundle.bundle_id=bundle2bitstream.bundle_id and bundle2bitstream.bitstream_id = bitstream.bitstream_id and bitstream.name ~ '^.*pdf$' order by handle::text::integer;

Next is an SQL query that finds all the PDFs that have been indexed:
select handle.handle,bitstream.name, bitstream.description from handle,item2bundle,bitstream,bundle2bitstream where handle.resource_id = item2bundle.item_id and handle.resource_type_id=2 and item2bundle.bundle_id=bundle2bitstream.bundle_id and bundle2bitstream.bitstream_id = bitstream.bitstream_id and bitstream.description = 'Extracted text' order by handle::text::integer;

Current number of unindexed fields

# PDFs - 14477
# Indexed _ 13252

# Unindexed 1225

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)