« ABBY OCR System does not work well in a VM | Main | Indexer problem and old libraries »

Big SQL query to find handles where the pdf has not been indexed in DSPACE

The UDC DSPACE instance has a full text indexer. However there are many pdfs that have not been indexed. I wrote a SQL query to find these. Here it is
SELECT handle.handle, bitstream.name, bitstream.size_bytes  FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND handle in 
((SELECT handle.handle FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND  bitstream.name ~ '^.*pdf$')  EXCEPT (SELECT handle.handle FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND  bitstream.name ~ '^.*pdf.txt$')) order by handle::text::integer;

This produces output like:

 handle |                                                               name                                                                | size_bytes 
--------+-----------------------------------------------------------------------------------------------------------------------------------+------------
 394    | Connect2006Fall.pdf                                                                                                               |    2146701
 394    | license.txt                                                                                                                       |       1400
 406    | m139.pdf                                                                                                                          |    1349983
 406    | license.txt                                                                                                                       |       1371
 406    | m139_Extras.zip                                                                                                                   |    2240787
 406    | m139meta.doc.txt                                                                                                                  |       2342
 406    | index.txt.txt                                                                                                                     |       1850
 422    | license.txt                                                                                                                       |       1371


Breaking it down the above expression:
SELECT handle.handle, bitstream.name, bitstream.size_bytes FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND handle in {THE HANDLES OF NON INDEXED PDF}
Next Step:
THE HANDLES OF NON INDEXED PDF = SELECT {HANDLES THAT HAVE A PDF} EXCEPT {HANDLES THAT HAVE A ARE INDEXED}

HANDLES THAT HAVE A PDF = SELECT handle.handle FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND bitstream.name ~ '^.*pdf$';

HANDLES THAT HAVE A PDF = SELECT handle.handle FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND bitstream.name ~ '^.*pdf.txt$';

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)