« August 2011 | Main | October 2011 »

September 29, 2011

Bill's Solution to the Solr problem

We had problems connecting SOLR to Drupal. Here are some answers that Bill Tantzen found:

Some issues with bad characters

Hex values should be inserted in the solr REST call:
http://128.101.146.59:8080/solr/select/?version=1.2&start=0&rows=10&indent=on&wt=standard&q=fullText:food&sort=modstitle desc&qt=mods

Should be:

http://128.101.146.59:8080/solr/select/?version=1.2&start=0&rows=10&indent=on&wt=standard&q=fullText:food++%26sort%3Dmodstitle%2Bdesc&qt=mods

Fields that Solr searches on should be single valued

Bill found out that fields that are being used for sort must be single values. Our title field is not. In the past SOLR would handle multivalued fields for sort.

The MODS schema

	<titleInfo>
 	 	<title>[dc.title]</title>
 	</titleInfo>
<titleInfo type="translated">
 	 	<title>[dc.title.alternative]</title>
 	</titleInfo>

The XSLT to make SOLR indices

Note the XSLT will not distinguish between different values of the type attribute, so there is more than version of title, it will no longer be single valued.
/Users/birage/fedora/tomcat/webapps/fedoragsearch/WEB-INF/classes/config/index/GSearch_solr/demoFoxmlToSolr.xslt  

    <xsl:for-each select="foxml:datastream[@ID='MODS']/foxml:datastreamVersion[last()]/foxml:xmlContent//mods:titleInfo/mods:title">
         <xsl:if test="text() [normalize-space(.) ]"><!--don't bother with empty space-->
            <field>./WEB-INF/classes/config/index/GSearch_solr/demoFoxmlToSolr.xslt
               <xsl:attribute name="name">
                  <xsl:value-of select="concat('mods.', 'title')"/>
               </xsl:attribute>
               <xsl:value-of select="text()"/>
            </field>
         </xsl:if>
      </xsl:for-each>

Possible values for the type attribute of title element in MODS

From: http://www.loc.gov/standards/mods/v3/mods-userguide-elements.html
type - This attribute is applied when it is necessary to identify what type of title is recorded.
For the main title (MARC 21 field 245), no type is indicated. The following values may be used with the type attribute:
abbreviated (equivalent to MARC 21 field 210)
translated (equivalent to MARC 21 field 242, 246)
alternative (equivalent to MARC 21 fields 246, 740)
uniform (equivalent to MARC 21 fields 130, 240, 730)

Reason why Islandora could not connect to SOLR on stage

The bug

When we tried to connect to solr from islandora we got the error:
Unable to connect to Solr server 

Islandora code connected to the problem

This error is generated in the Islandora file:
./sites/all/modules/Islandora-islandora_solr_search-9e474f7/solr.admin.inc: The following function is the root cause of the problem:

/**
 *
 * @param String $solr_url
 * @return boolean
 *
 * Checks availability of Solr installation
 *
 */
function solr_available($solr_url) {
  // path from url is parsed to allow graceful inclusion or exclusion of 'http://'
  $pathParts = parse_url($solr_url); 
  $path = 'http://' . $pathParts['host'] . ':' . $pathParts['port'] . $pathParts['path'] . '/admin/file';
  $test = @fopen($path, "r");
  if ($test) {
    return true;
  }
  return false;
}
    

The fix (upgrade SOLR)

It turns out that solr 3.1 cannot recognize the
"/admin/file"
at the end of a URL. We upgraded to SOLR 3.4 and it worked.

September 23, 2011

Reason why we could not upload foxml file (agecon_top.xml)

We had assumed that it was the XACML permissions this was not the case. The problem was actually due to a TN element in the foxml that was the problem.

Here is the contents of the
/swadm/local/fedora3/server/fedora-internal-use/fedora-internal-use-repository-policies-approximating-2.0
deny-apim-if-not-localhost.xml deny-inactive-or-deleted-objects-or-datastreams-if-not-administrator.xml deny-policy-management-if-not-administrator.xml deny-purge-datastream-if-active-or-inactive.xml deny-purge-object-if-active-or-inactive.xml deny-reloadPolicies-if-not-localhost.xml deny-unallowed-file-resolution.xml LOGPATH permit-anything-to-administrator.xml permit-apia-unrestricted.xml permit-dsstate-check-unrestricted.xml permit-oai-unrestricted.xml permit-serverStatus-unrestricted.xml readme.txt

Switch in UDC Media filter.

I changed the Media Filter so that it would not use the unix nice command when it launches. This should speed up the process.

Crontab

@reboot /sbin/service httpd start @reboot sudo -u tomcat /dspace/bin/start_tomcat.sh # day of week (0 - 6) (Sunday=0) 10 1 * * 6 /dspace/dspace-ir/bin/media_launch.sh 30 22 * * 1 /dspace/dspace-sr/bin/index-all-cron 30 22 * * 2 /dspace/dspace-ir/bin/index-all-cron 30 22 * * 3 /dspace/dspace-sr/bin/index-all-cron 30 22 * * 4 /dspace/dspace-ir/bin/index-all-cron 30 22 * * 5 /dspace/dspace-sr/bin/index-all-cron

media_launch.sh

tstamp=`date "+%Y%m%d_%H:%M"` echo $tstamp nice /dspace/dspace-ir/bin/filter-media.sh > /dspace/dspace-ir/log/filter-media.sh_$tstamp.log 2>&1 cd /dspace/dspace-ir/bin/ /dspace/dspace-ir/bin/index_check_and_email.sh

filter-media.sh

Note the "-n" in filter-media means that the index will not be made after each collection is OCRed. Also in the runs using "nice" the "-n" was also used.
#!/bin/sh # This script grabs the handles of each collection # in a DSpace DB instance. Then loops through the # handles and run the full-text indexer against each # collection. # This is done to fix out of memory errors, # PDFs that are too large for full-text indexing, # and when filter-media (java app) fails now full # text indexing continues on other collections. # Setup the environment JAVA_HOME=/opt/jdk1.5.0_10 PATH=$JAVA_HOME/bin:/opt/ant/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin export PATH JAVA_HOME dbname="dspace_ir" username="read_only" hostname="strip3.oit.umn.edu" # Determine if we have Postgres client installed which psql > /dev/null if [ $? -ne 0 ] then echo echo "psql not found in your PATH, please add to your PATH and re-run script" echo exit 1 fi print_usage() { echo 1>&2 "Usage: $0 [-d dbname] [-u username]" exit 1; } while getopts d:hu: o do case "$o" in d) dbname="$OPTARG";; h) print_usage;; n) hostname="$OPTARG";; u) username="$OPTARG";; [?]) print_usage;; esac done echo_cmd="echo SELECT handle FROM handle WHERE resource_type_id=3;" psql_cmd="psql -t -U $username -h $hostname $dbname" BINDIR=`dirname $0` for handle in `$echo_cmd | $psql_cmd` do $BINDIR/filter-media -n -i $handle done $BINDIR/index-all

September 15, 2011

items in DSAPCE with in_archive =f and withdrawn = f

Problem

Louise told me that the following two purls produced an error when accessed:
http://purl.umn.edu/114817
http://purl.umn.edu/113790

The cause

It turns our that both in_archive and withdrawn are set to false.
dspace_sr=> select * from handle where handle = 113790;
 handle_id | handle | resource_type_id | resource_id 
-----------+--------+------------------+-------------
     51427 | 113790 |                2 |       53352


dspace_sr=> select * from item where item_id = 53352;

 item_id | submitter_id | in_archive | withdrawn |       last_modified        | owning_collection 
---------+--------------+------------+-----------+----------------------------+-------------------
   53352 |         2680 | f          | f         | 2011-08-25 12:58:34.912-05 |                  
(1 row)
I think that both variables should not be set to f.

The solution

Set withdrawn to t and the Louise can resubmit the metadata:
UPDATE item SET withdrawn = 'T' WHERE item_id = 53352;

September 6, 2011

Varios curls for fedora RDF (REST interface)

List of datastreams in a Fedora Object

curl example

curl --user user:password http://128.101.146.59:8080/fedora/objects/urepository:97023/datastreams

browser example

http://128.101.146.59:8080/fedora/objects/urepository:97023/datastreams

MODs Datastream

curl example

curl --user user:password http://128.101.146.59:8080/fedora/objects/urepository:97023/datastreams/MODS/content

browser example

http://128.101.146.59:8080/fedora/objects/urepository:97023/datastreams/MODS/contentss

RDF Datastream

curl example

curl --user user:password http://128.101.146.59:8080/fedora/objects/urepository:97023/datastreams/RELS-EXT/content

browser example

http://128.101.146.59:8080/fedora/objects/urepository:97023/datastreams/RELS-EXT/content

Get all triples

curl example

curl --user user:password http://128.101.146.59:8080/fedora/risearch?type=triples\&lang=spo\&format=N-Triples\&stream=on\&query=*+*+*

browser example

http://128.101.146.59:8080/fedora/risearch?type=triples&lang=spo&format=N-Triples&stream=on&query=*+*+*

All triples with 36284 handle as a predicate

curl example

curl --user user:password http://128.101.146.59:8080/fedora/risearch?type=triples\&lang=spo\&format=N-Triples\&stream=on\&query=*+*+\<info:fedora/urepository:36284\>

browser example

http://128.101.146.59:8080/fedora/risearch?type=triples&lang=spo&format=N-Triples&stream=on&query=*+*+<info:fedora/urepository:36284>

All triples with 36284 handle as a subject

curl example

curl --user user:password http://128.101.146.59:8080/fedora/risearch?type=triples\&lang=spo\&format=N-Triples\&stream=on\&query=\<info:fedora/urepository:36284\>+*+*

browser example

http://128.101.146.59:8080/fedora/risearch?type=triples&lang=spo&format=N-Triples&stream=on&query=<info:fedora/urepository:36284>+*+*

Determine if 36284 is a collection

curl example

curl --user user:password http://128.101.146.59:8080/fedora/risearch?type=triples\&lang=spo\&format=N-Triples\&stream=on\&query=*+\<info:fedora/fedora-system:def/relations-external#isMemberOfCollection\>+\<info:fedora/urepository:36284\>

Find the title of 36284

curl --user user:password http://128.101.146.59:8080/fedora/risearch?type=triples\&lang=spo\&format=n-triples\&stream=on\&query=\<info:fedora/urepository:36284\>+\<http://purl.org/dc/elements/1.1/title\>+*