Big SQL query to find handles where the pdf has not been indexed in DSPACE
The UDC DSPACE instance has a full text indexer. However there are many pdfs that have not been indexed. I wrote a SQL query to find these. Here it is
SELECT handle.handle, bitstream.name, bitstream.size_bytes FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND handle in {THE HANDLES OF NON INDEXED PDF}
Next Step:
THE HANDLES OF NON INDEXED PDF = SELECT {HANDLES THAT HAVE A PDF} EXCEPT {HANDLES THAT HAVE A ARE INDEXED}
HANDLES THAT HAVE A PDF = SELECT handle.handle FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND bitstream.name ~ '^.*pdf$';
HANDLES THAT HAVE A PDF = SELECT handle.handle FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND bitstream.name ~ '^.*pdf.txt$';
SELECT handle.handle, bitstream.name, bitstream.size_bytes FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND handle in ((SELECT handle.handle FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND bitstream.name ~ '^.*pdf$') EXCEPT (SELECT handle.handle FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND bitstream.name ~ '^.*pdf.txt$')) order by handle::text::integer;This produces output like:
handle | name | size_bytes --------+-----------------------------------------------------------------------------------------------------------------------------------+------------ 394 | Connect2006Fall.pdf | 2146701 394 | license.txt | 1400 406 | m139.pdf | 1349983 406 | license.txt | 1371 406 | m139_Extras.zip | 2240787 406 | m139meta.doc.txt | 2342 406 | index.txt.txt | 1850 422 | license.txt | 1371Breaking it down the above expression:
SELECT handle.handle, bitstream.name, bitstream.size_bytes FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND handle in {THE HANDLES OF NON INDEXED PDF}
Next Step:
THE HANDLES OF NON INDEXED PDF = SELECT {HANDLES THAT HAVE A PDF} EXCEPT {HANDLES THAT HAVE A ARE INDEXED}
HANDLES THAT HAVE A PDF = SELECT handle.handle FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND bitstream.name ~ '^.*pdf$';
HANDLES THAT HAVE A PDF = SELECT handle.handle FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND bitstream.name ~ '^.*pdf.txt$';