« May 2010 | Main | July 2010 »

June 29, 2010

Canididate data stream for METS struct map

In our Fedora archive, there will be complex objects that have children objects. The order of these objects will be specified by a METS struct map in a Fedora data stream. The METS xml will look something like:

<METS:mets OBJID="demo:UMNcard.Will001" xmlns:METS="http://www.loc.gov/METS/"> <METS:structMap> <METS:div ID="UMNcard.Will001.STRUCT"> <METS:div ORDER="1" CONTENTIDS="demo:UMNcard.Will001.01" LABEL="FRONT"/> <METS:div ORDER="2" CONTENTIDS="demo:UMNcard.Will001.02" LABEL="BACK"/> </METS:div> </METS:structMap> </METS:mets>

June 15, 2010

Possible solution to large sets of PDFs not being indexed.

1) Currently all the bitstreams for all the pdfs are being read into memory by DSPACE before indexing starts.

2) General idea: make an array of bitstream.internal_id and produce file names from these. Open them one at a time, as files and process them. There will have to be an if to catch PDF filters. A new version of the processBitstream method will have to be written in the MediaFilter.java class.
3) Determine if a filter is PDF. In MediaFilterManager the line:

filterClasses[i].getClass().getName()
produces a String of the form:
org.dspace.app.mediafilter.PDFFilter

4)In Bundle.java is an example of a database query in DSPACE:

TableRowIterator tri = DatabaseManager.queryTable( ourContext, "bitstream", "SELECT bitstream.* FROM bitstream, bundle2bitstream WHERE " + "bundle2bitstream.bitstream_id=bitstream.bitstream_id AND " + "bundle2bitstream.bundle_id= ? ", bundleRow.getIntColumn("bundle_id"));
5) A query that will get the bitstream.internal_id from the handle:

select handle.handle,bitstream.internal_id from handle,item2bundle,bitstream,bundle2bitstream where handle.resource_id = item2bundle.item_id and handle.resource_type_id=2 and item2bundle.bundle_id=bundle2bitstream.bundle_id and bundle2bitstream.bitstream_id = bitstream.bitstream_id and bitstream.name ~ '^.*pdf$' order by handle::text::integer;

6) The SQL query produces an output like.

handle | internal_id
--------+-----------------------------------------
4 | 21235899611297801355164539089367487918
6 | 107336362058605133394783483256748994425
7 | 11090457400765004032181753363050906238
8 | 15085152857357519261000291360441738392
9 | 107871098203799408458667072002741710893
11 | 50276060641731470155592232786626559701
12 | 65101419339422847404612669026873496384
13 | 43890973756880840755046786472676911568
15 | 160770701110516817178377325959276903855
16 | 65016360752601910182437228650469824124
7) Finding the handle

int Handle = Integer.parseInt(myItem.getHandle());

June 11, 2010

PDF files that are not being indexed SQL to look at the issue

The problem

Beth found that the item with handle 56385 http://conservancy.umn.edu/handle/56385 had not gone through the full text indexer.
The command
/dspace/dspace-ir/bin/filter-media -i 56385
Will index just that one file. Here is the error mess that you get.
ErrorMediaFilter
This error message indicates the the error is coming from the third party jar: PDFBox.jar.

Some SQL

Below is an SQL query that pulls out all of the files that are pdfs in the repository:
select handle.handle,bitstream.name, bitstream.description from handle,item2bundle,bitstream,bundle2bitstream where handle.resource_id = item2bundle.item_id and handle.resource_type_id=2 and item2bundle.bundle_id=bundle2bitstream.bundle_id and bundle2bitstream.bitstream_id = bitstream.bitstream_id and bitstream.name ~ '^.*pdf$' order by handle::text::integer;

Next is an SQL query that finds all the PDFs that have been indexed:
select handle.handle,bitstream.name, bitstream.description from handle,item2bundle,bitstream,bundle2bitstream where handle.resource_id = item2bundle.item_id and handle.resource_type_id=2 and item2bundle.bundle_id=bundle2bitstream.bundle_id and bundle2bitstream.bitstream_id = bitstream.bitstream_id and bitstream.description = 'Extracted text' order by handle::text::integer;

Current number of unindexed fields

# PDFs - 14477
# Indexed _ 13252

# Unindexed 1225

June 10, 2010

script to take a sql query and turn it into a tab delimited file

This will convert an sql query to a tab delimited file. I am sure there is a more elegant way to do this, but this works.
# Name sql2tab.sh
#
# dbFlavor - dspace_sr or dspace_ir
# sql_query - a query for the database
#
# Example:
#  ./sql2tab.sh dspace_ir 'select handle from handle;'
#
# Bug the '*' in an sql query produces an error because bash
# expands it as an las of the directory
#
# J. Silvis
# June 2010
#*******************************************************
flavor=$1
sql_query=$2

echo $sql_query > sql_temp_junk
# run the sql queryy and put it in tempData
psql -U $flavor $flavor  < sql_temp_junk  > tempData

# Replace the pipe (|) delimiter with tab
perl -p -i -e 's/\s+\|\s+/\t/g' tempData
#clear out the leading whitespace
perl -p -i -e 's/^(\s+)(.*)/\2/g' tempData
# dump the results out to be either directly viewed or sent to a pipe.
cat tempData
# get rid of junk files
rm tempData
rm sql_temp_junk