« PDF files that are not being indexed SQL to look at the issue | Main | Canididate data stream for METS struct map »

Possible solution to large sets of PDFs not being indexed.

1) Currently all the bitstreams for all the pdfs are being read into memory by DSPACE before indexing starts.

2) General idea: make an array of bitstream.internal_id and produce file names from these. Open them one at a time, as files and process them. There will have to be an if to catch PDF filters. A new version of the processBitstream method will have to be written in the MediaFilter.java class.
3) Determine if a filter is PDF. In MediaFilterManager the line:

filterClasses[i].getClass().getName()
produces a String of the form:
org.dspace.app.mediafilter.PDFFilter

4)In Bundle.java is an example of a database query in DSPACE:

TableRowIterator tri = DatabaseManager.queryTable( ourContext, "bitstream", "SELECT bitstream.* FROM bitstream, bundle2bitstream WHERE " + "bundle2bitstream.bitstream_id=bitstream.bitstream_id AND " + "bundle2bitstream.bundle_id= ? ", bundleRow.getIntColumn("bundle_id"));
5) A query that will get the bitstream.internal_id from the handle:

select handle.handle,bitstream.internal_id from handle,item2bundle,bitstream,bundle2bitstream where handle.resource_id = item2bundle.item_id and handle.resource_type_id=2 and item2bundle.bundle_id=bundle2bitstream.bundle_id and bundle2bitstream.bitstream_id = bitstream.bitstream_id and bitstream.name ~ '^.*pdf$' order by handle::text::integer;

6) The SQL query produces an output like.

handle | internal_id
--------+-----------------------------------------
4 | 21235899611297801355164539089367487918
6 | 107336362058605133394783483256748994425
7 | 11090457400765004032181753363050906238
8 | 15085152857357519261000291360441738392
9 | 107871098203799408458667072002741710893
11 | 50276060641731470155592232786626559701
12 | 65101419339422847404612669026873496384
13 | 43890973756880840755046786472676911568
15 | 160770701110516817178377325959276903855
16 | 65016360752601910182437228650469824124
7) Finding the handle

int Handle = Integer.parseInt(myItem.getHandle());

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)