February 7, 2012

Select to find ascii control characters in DSPACE (i.e. in the postgres DB)

SQL select for CNT Chars

The command below will find DSPACE titles that contain ascii control characters (less than 31).
 select  handle, title, item.last_modified   from item, handle, itemsbytitle  where item.item_id = handle.resource_id  AND itemsbytitle.item_id = item.item_id AND last_modified > '2012-01-28'  AND title ~ ('[^'||chr(31)||'-'||chr(255)||']')  ;

Reason for the Range

The site:
http://www.asciitable.com/
Shows that the control characters in ASCII are all less than 32.

Result of the search

handle |                                                           title                                                            |       last_modified        
--------+----------------------------------------------------------------------------------------------------------------------------+----------------------------
 47226  | Pouvoir, violence et resistance en postcolonie : une lecture de en attendant le vote des betes Sauvages E’Ahmadou Kourouma | 2012-02-04 01:11:16.764-06
(1 row)
s

Converting the title to hex and sorting


[silvi003:~]$ echo 'Pouvoir, violence et resistance en postcolonie : une lecture de en attendant le vote des betes Sauvages E’Ahmadou Kourouma'
  | hexdump -v -e '"\\\x" 1/1 "%02x" " "' | perl -p -i -e 's/ /\n/g' | sort | uniq -c  
   1 \x0a
  18 \x20
   1 \x2c
   1 \x3a
   1 \x41
   1 \x45
   1 \x4b
   1 \x50
   1 \x53
   7 \x61
   1 \x62
   4 \x63
   4 \x64
  19 \x65
   1 \x67
   1 \x68
   4 \x69
   4 \x6c
   2 \x6d
   8 \x6e
  10 \x6f
   1 \x70
   4 \x72
   6 \x73
   9 \x74
   7 \x75
   4 \x76
   1 \x80
   1 \x99
   1 \xe2
There is a line feed in the title (\x0a), and this happens to be the last character.

ISSUE

The regex above will select strings with a CR.
000 1101 	015 	13 	0D 	CR 	␍ 	^M 	\r 	Carriage return[g]
This is bad. I need to put in some sort of "OR" for this.

December 16, 2011

Handles of items that have been withdrawn from DSPACE Instance of UDC

dspace_ir=> select handle  from item,handle   where resource_id=item_id AND resource_type_id =2 AND withdrawn='t';
 handle 
--------
 59457
 59456
 91897
 59624
 52302
 52304
 1417
 59224
 1564
 1556
 1597
 59696
 52294
 52295
 54922
 1635
 1652
 1641
 1642
 1634
 1654
 1643
 1640
 1638
 1650
 1648
 1639
 1637
 1636
 1631
 1632
 47185
 1693
 1657
 1700
 1656
 1667
 1699
 1666
 57139
 56283
 1714
 1713
 53967
 1962
 1959
 1924
 1941
 1917
 1921
 1956
 1945
 235
 52291
 52301
 52293
 52303
 52305
 52306
 2140
 4308
 2255
 4071
 52308
 2342
 4252
 2421
 2411
 2424
 2420
 2423
 4396
 47036
 3992
 4122
 2610
 2599
 2614
 46794
 2779
 2870
 2851
 2880
 5763
 56296
 3226
 3393
 3391
 3394
 3422
 48945
 3468
 4948
 5022
 5027
 1056
 1149
 1072
 56320
 54792
 59699
 4183
 47738
 4459
 4468
 57299
 4489
 4576
 57302
 92517
 45722
 4779
 4842
 4781
 53859
 4862
 4886
 4848
 4429
 50922
 56358
 103371
 5866
 4997
 57063
 53867
 5141
 57052
 104500
 57199
 57141
 107249
 109941
 100004
 97262
 98554
 50549
 48018
 104356
(139 rows)

December 15, 2011

Fields in DSPACE eperson table that are empty, fixed or small

Fields in DSPACE eperson table

 dspace_sr=> \d eperson   
                    Table "public.eperson"
       Column        |            Type             | Modifiers 
---------------------+-----------------------------+-----------
 eperson_id          | integer                     | not null
 email               | character varying(64)       | 
 password            | character varying(64)       | 
 firstname           | character varying(64)       | 
 lastname            | character varying(64)       | 
 can_log_in          | boolean                     | 
 require_certificate | boolean                     | 
 self_registered     | boolean                     | 
 last_active         | timestamp without time zone | 
 sub_frequency       | integer                     | 
 phone               | character varying(32)       | 
 netid               | character varying(64)       | 
Indexes:
    "eperson_pkey" primary key, btree (eperson_id)
    "eperson_email_key" unique, btree (email)
    "eperson_email_idx" btree (email)
    "eperson_netid_idx" btree (netid)

SQL to check for empty fields

field: last_active

dspace_sr=> select last_active   from eperson where last_active ~ '.*';
 last_active 
-------------
(0 rows)
last_active is empty

field: netid

dspace_sr=> select netid  from eperson where netid~'.*';
 netid 
-------------
(0 rows)
last_active is netid

field: require_certificate are all false

dspace_sr=> select require_certificate   from eperson where require_certificate = 't';
 require_certificate 
---------------------
(0 rows)

field: can_log_in are all true

dspace_sr=> select can_log_in  from eperson where can_log_in = 'f' ;
 can_log_in 
------------
(0 rows)

field: can_log_in are all true

dspace_sr=> select can_log_in  from eperson where can_log_in = 'f' ;
 can_log_in 
------------
(0 rows)

field: self_registered is a small group

dspace_sr=> select *    from eperson where self_registered = 'f' ;
 eperson_id |       email        |             password             | firstname | lastname | can_log_in | require_certificate | self_registered | last_active | sub_frequency | phone | netid 
------------+--------------------+----------------------------------+-----------+----------+------------+---------------------+-----------------+-------------+---------------+-------+-------
          1 | teale003@umn.edu   | 76419c58730d9f35de7ac538c2fd6737 | Brad      | Teale    | t          | f                   | f               |             |               |       | 
          3 | jkelly@umn.edu     | 1a9bb0d26640ffb18ee06ae398d201ad | Julia     | Kelly    | t          | f                   | f               |             |               |       | 
          5 | skill@umn.edu      | fa483ad4f1d094e0d5aaa31856d618ec | Sharon    | Kill     | t          | f                   | f               |             |               |       | 
         21 | ruddy005@gmail.com | f850f3fb454e369b8b3444ad66fcd469 | Martha    | Ruddy    | t          | f                   | f               |             |               |       | 
          2 | lletnes@umn.edu    | a8113a9b4f61b178cd1fea4efa5bf4c8 | Louise    | Letnes   | t          | f                   | f               |             |               |       | 
         15 | jsilvis@umn.edu    | 7f8b6712b097089167df2fd678ae8133 | Jeff      | Silvis   | t          | f                   | f               |             |               |       | 
(6 rows)

These people were put in by a bash script.

November 18, 2011

Number of assets per item in AgEcon

File that contains a list of all bitstreams in Agecon

less transfer.sh

./AssetsUDC/Assets_Found20111114_1524_60427_max/asset_61536/urepository_2.pdf
./AssetsUDC/Assets_Found20111114_1524_60427_max/asset_61536/urepository_1.pdf
./AssetsUDC/Assets_Found20111114_1524_60427_max/asset_99776/urepository_1.pdf
./AssetsUDC/Assets_Found20111114_1524_60427_max/asset_114550/urepository_2.pdf
./AssetsUDC/Assets_Found20111114_1524_60427_max/asset_114550/urepository_1.pdf
./AssetsUDC/Assets_Found20111114_1524_60427_max/asset_93451/urepository_1.pdf
./AssetsUDC/Assets_Found20111114_1524_60427_max/asset_93451/urepository_2.pdf
                                                        ^               ^
                                                        |               |
                                                    Handle            Bitstream number
handle 
[swadm:/swadm/assetstore_stage]$ cat transfer.sh  | perl -p -i -e 's/(.\/AssetsUDC\/Assets_Found201111.*asset_)(\d+)(\/urepository_)(\d+)(\.pdf)/\4/g' | sort | uniq -c | sort -nk 2 


Number  Assets
of       per 
Items    item 
  48971 1
  21191 2
    116 3
     13 4
      4 5
      3 6
      2 7
      1 8
      1 9
      1 10

November 16, 2011

Cron to do mysqldump

Create a mysql user that will do the backups

GRANT LOCK TABLES, SELECT ON *.* TO 'backup'@'%.oit.umn.edu' IDENTIFIED BY PASSWORD;

The Script to run under cron

#!/usr/bin/perl

printf("Running...");

####################
# Backup settings #
###################

my $mysql_user = "backup";
my $mysql_pwd = "password";
my $mysql_server = "mysql_host";
my $dir_base = "path_to_backup_dir";

my ($y, $m, $d) = (localtime)[5,4,3];

$y += 1900;
$m += 1;

if ($m < 10) { $m = "0" . $m; }
if ($d < 10) { $d = "0" . $d; }

$stamp = $y . "_" . $m . $d;
$file_dumpname = $stamp . "_mysqldump.sql";

# Format:
#/swadm/assetstore_stage/Backups/mysql/2009_0921_mysqldump.sql
#/swadm/assetstore_stage/Backups/mysql/2009_0922_mysqldump.sql

##################
# Handling MySQL #
##################

# Making mySQL dumps
$starttime = `date`;
printf("mysqldump -h$mysql_server -u$mysql_user -p$mysql_pwd drupalstage > $dir_base/$file_dumpname\n");

`mysqldump -h$mysql_server -u$mysql_user -p$mysql_pwd drupalstage > $dir_base/$file_dumpname`;

#############################
# Chmod -- owner/group only #
#############################
`chmod 660 $dir_base/$file_dumpname`;
`gzip -fq $dir_base/$file_dumpname`;


##############
# Concluding #
##############


# The "done" message
printf("...complete.\n");
$finishtime = `date`;

####################
# Mail the results #
####################

$title='MySQL backup';
$to='your email';
$from= 'crontab@yourhost';
$subject='[cron] vm02 mySQLdump';

open(MAIL, "|/usr/sbin/sendmail -t");

## Mail Header
print MAIL "To: $to\n";
print MAIL "From: $from\n";
print MAIL "Subject: $subject\n\n";
## Mail Body
# $dumpsize = `du  -h $dir_base/${file_dumpname}.gz`;
#print MAIL "Start:\n$starttime\nFinish:\n$finishtime\nDumpsize:\n  $dumpsize";
print MAIL "Start:\n$starttime\nFinish:\n$finishtime\nFile $dir_base/${file_dumpname}.gz";
close(MAIL);

November 14, 2011

Metadata Field dc.type

Sql Query to get Types of AgEcon Papers

[silvi003:~]$ cat cmdType.sql
\f ','
\a
\t
\o outputfile.csv
select text_value from handle, metadatavalue, item  where metadatavalue.item_id=handle.resource_id AND handle.resource_type_id=2 AND 
handle.resource_id=item.item_id AND item.withdrawn='f' AND metadata_field_id=66;
\o
\q


[silvi003:~]$ psql -U dspace_sr  dspace_sr  < cmdType.sql 

The distribution

[silvi003:~]$ cat outputfile.csv  | sort | uniq -c | sort -n 
      1 journal article
      2 Dataset
      4 Preprint
     24 Book Item
    205 Thesis
    225 Book
    328 Thesis or Dissertation
   1096 Report
   2673 Technical Report
   3592 Working Paper
   6313 Article
   7492 Presentation
   8202 Working or Discussion Paper
   8694 Journal Article
   9294 Conference Paper

DSPACE mime types for AgEcon ... Very few excel

Below is a list of all the MIME types supported by DSPACE

 bitstream_format_id |           mimetype            |  short_description   |                             description                              | support_level | internal 
---------------------+-------------------------------+----------------------+----------------------------------------------------------------------+---------------+----------
                   3 | application/pdf               | PDF                  | Adobe Portable Document Format                                       |             1 | f
                   1 | application/octet-stream      | Unknown              | Unknown data format                                                  |             0 | f
                   2 | text/plain                    | License              | Item-specific license agreed upon to submission                      |             1 | t
                   4 | text/xml                      | XML                  | Extensible Markup Language                                           |             1 | f
                   5 | text/plain                    | Text                 | Plain Text                                                           |             1 | f
                   6 | text/html                     | HTML                 | Hypertext Markup Language                                            |             1 | f
                   7 | text/css                      | CSS                  | Cascading Style Sheets                                               |             1 | f
                   8 | application/msword            | Microsoft Word       | Microsoft Word                                                       |             1 | f
                   9 | application/vnd.ms-powerpoint | Microsoft Powerpoint | Microsoft Powerpoint                                                 |             1 | f
                  10 | application/vnd.ms-excel      | Microsoft Excel      | Microsoft Excel                                                      |             1 | f
                  11 | application/marc              | MARC                 | Machine-Readable Cataloging records                                  |             1 | f
                  12 | image/jpeg                    | JPEG                 | Joint Photographic Experts Group/JPEG File Interchange Format (JFIF) |             1 | f
                  13 | image/gif                     | GIF                  | Graphics Interchange Format                                          |             1 | f
                  14 | image/png                     | image/png            | Portable Network Graphics                                            |             1 | f
                  15 | image/tiff                    | TIFF                 | Tag Image File Format                                                |             1 | f
                  16 | audio/x-aiff                  | AIFF                 | Audio Interchange File Format                                        |             1 | f
                  17 | audio/basic                   | audio/basic          | Basic Audio                                                          |             1 | f
                  18 | audio/x-wav                   | WAV                  | Broadcase Wave Format                                                |             1 | f
                  19 | video/mpeg                    | MPEG                 | Moving Picture Experts Group                                         |             1 | f
                  20 | text/richtext                 | RTF                  | Rich Text Format                                                     |             1 | f
                  21 | application/vnd.visio         | Microsoft Visio      | Microsoft Visio                                                      |             1 | f
                  22 | application/x-filemaker       | FMP3                 | Filemaker Pro                                                        |             1 | f
                  23 | image/x-ms-bmp                | BMP                  | Microsoft Windows bitmap                                             |             1 | f
                  24 | application/x-photoshop       | Photoshop            | Photoshop                                                            |             1 | f
                  25 | application/postscript        | Postscript           | Postscript Files                                                     |             1 | f
                  26 | video/quicktime               | Video Quicktime      | Video Quicktime                                                      |             1 | f
                  27 | audio/x-mpeg                  | MPEG Audio           | MPEG Audio                                                           |             1 | f
                  28 | application/vnd.ms-project    | Microsoft Project    | Microsoft Project                                                    |             1 | f
                  29 | application/mathematica       | Mathematica          | Mathematica Notebook                                                 |             1 | f
                  30 | application/x-latex           | LateX                | LaTeX document                                                       |             1 | f
                  31 | application/x-tex             | TeX                  | Tex/LateX document                                                   |             1 | f
                  32 | application/x-dvi             | TeX dvi              | TeX dvi format                                                       |             1 | f
                  33 | application/sgml              | SGML                 | SGML application (RFC 1874)                                          |             1 | f
                  34 | application/wordperfect5.1    | WordPerfect          | WordPerfect 5.1 document                                             |             1 | f
                  35 | audio/x-pn-realaudio          | RealAudio            | RealAudio file                                                       |             1 | f
                  36 | image/x-photo-cd              | Photo CD             | Kodak Photo CD image                                                 |             1 | f

A file with the wrong bitstream_format_id

 
handle | bitstream_id | bitstream_format_id |                    name                     | size_bytes |             checksum             | checksum_algorithm | description | user_format_description |                                     source                                      |               internal_id               | deleted | store_number | sequence_id 
--------+--------------+---------------------+---------------------------------------------+------------+----------------------------------+--------------------+-------------+-------------------------+---------------------------------------------------------------------------------+-----------------------------------------+---------+--------------+-------------
 95522  |        74367 |                   1 | Staff Paper P10-8--InSTePP10-04.revised pdf |     313884 | 35f4304e6a0c68e935c09c0469a9e291 | MD5                |             |                         | /dspace/assetstore/dspace-sr/upload/Staff Paper P10-8--InSTePP10-04.revised pdf | 102028865626877833459313413758816463357 | f       |            0 |           2
(1 row)

This is labeled as Unknown, but should be PDF. The line below changed it:
 
dspace_sr=> 
dspace_sr=> UPDATE bitstream SET bitstream_format_id = '3' WHERE bitstream_id = '74367';
UPDATE 1

Distribution of bitstream_format_id

The sql query that pulls only live bitstreams:
[silvi003:~]$ cat cmdMime.sql 
\f ','
\a
\t
\o outputfile.csv
SELECT bitstream_format_id  FROM handle,item, item2bundle,bitstream,bundle2bitstream WHERE  handle.resource_type_id=2 AND handle.resource_id = item2bundle.item_id AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND handle.resource_id=item.item_id AND item.withdrawn='f' AND   bundle2bitstream.bitstream_id = bitstream.bitstream_id AND  bitstream.deleted = 'f'  ;
\o
\q
[silvi003:~]$ psql -U dspace_sr  dspace_sr  < cmdMime.sql
Number count of bitstream_format_id

[silvi003:~]$ cat outputfile.csv | sort | uniq -c | sort -n 
      # bitstream_format_id
      1 1
      2 10
  20602 2
  48265 3

the odd bitstream_format_id

Most of the the bitstreams are PDFs ( bitstream_format_id 3) or liscense (bitstream_format_id 2). There is one Unknown (bitstream_format_id 1) and two excel (bitstream_format_id 10). They are shown below:
bitstream_format_id =1
dspace_sr=> SELECT handle, bitstream.*  FROM handle,item2bundle,bitstream,bundle2bitstream WHERE  handle.resource_type_id=2 AND handle.resource_id = item2bundle.item_id AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND handle.resource_id=item.item_id AND item.withdrawn='f' AND   bundle2bitstream.bitstream_id = bitstream.bitstream_id AND  bitstream.deleted = 'f'  AND bitstream_format_id=1;
NOTICE:  adding missing FROM-clause entry for table "item"
 handle | bitstream_id | bitstream_format_id |                          name                           | size_bytes |             checksum             | checksum_algorithm | description | user_format_description |                                           source                                            |              internal_id               | deleted | store_number | sequence_id 
--------+--------------+---------------------+---------------------------------------------------------+------------+----------------------------------+--------------------+-------------+-------------------------+---------------------------------------------------------------------------------------------+----------------------------------------+---------+--------------+-------------
 62242  |        59248 |                   1 | data appendix jayasinghe Beghin Moschini ajae 9007.xlsx |     141434 | 6f40baf7dd97f784091e69ed8714b837 | MD5                |             |                         | /dspace/assetstore/dspace-sr/upload/data appendix jayasinghe Beghin Moschini ajae 9007.xlsx | 13124464764865665476393448862247227640 | f       |            0 |           3
(1 row)


bitstream_format_id =10
dspace_sr=> SELECT handle, bitstream.*  FROM handle,item2bundle,bitstream,bundle2bitstream WHERE  handle.resource_type_id=2 AND handle.resource_id = item2bundle.item_id AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND handle.resource_id=item.item_id AND item.withdrawn='f' AND   bundle2bitstream.bitstream_id = bitstream.bitstream_id AND  bitstream.deleted = 'f'  AND bitstream_format_id=10;
NOTICE:  adding missing FROM-clause entry for table "item"
 handle | bitstream_id | bitstream_format_id |                  name                   | size_bytes |             checksum             | checksum_algorithm |        description         | user_format_description |                                                         source                                                          |              internal_id               | deleted | store_number | sequence_id 
--------+--------------+---------------------+-----------------------------------------+------------+----------------------------------+--------------------+----------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------+----------------------------------------+---------+--------------+-------------
 42187  |        32967 |                  10 | MissouriUseValueCalculationsOct2007.xls |     361472 | cbefd6d4008ba5d97db49d2b9178f89f | MD5                | Excel Spreadsheet          |                         | /dspace/assetstore/dspace-sr/upload/C:\Documents and Settings\Lori\My Documents\MissouriUseValueCalculationsOct2007.xls | 87756269209817911914027269532862968326 | f       |            0 |           3
 92231  |        61062 |                  10 | stpap536.data.zip                       |     741798 | b48a9d5aa21f3d8411230bde4651e4fe | MD5                | Data in zipped Excel files |                         | /dspace/assetstore/dspace-sr/upload/stpap536.data.zip                                                                   | 28095595994115196972466977473167819715 | f       |            0 |           3
(2 rows)


October 28, 2011

Islandora was not properly pointing at https fedora sites

The issue

When we changed to https for stage we got the white screen of death with drupal.

Location of code

The file where the troubled code is:
sites/all/modules/Islandora-islandora-6589c7c/ConnectionHelper.inc
function _fixURL

The bug

This is the basic problem. The code below will yield:
$new_url = 'http'
No matter what the input is.
<?php                   
   $url = 'https://ddd';
    if (strpos($url, 'http://') == 0) {
        $new_url = 'http';
    }
    elseif (strpos($url, 'https://') == 0) {
      $new_url = 'https';
    }
    else {
        drupal_set_message(t('Invalid URL: !url', array('!url' => $url)));
        return NULL;
    }
   echo $new_url . "\n";
?>
Rewrite of the code that properly selects https:
<?php                   
   $url = 'https://ddd';

    $url_start = substr ($url , 0, 5 );

    if ($url_start == 'http:') {
        $new_url = 'http';
    }
    elseif ($url_start == 'https') {
      $new_url = 'https';
    }
    else {
        drupal_set_message(t('Invalid URL: !url', array('!url' => $url)));
        return NULL;
    }
   echo $new_url . "\n";
?>
I inserted this code into the function _fixURL code of the file:
sites/all/modules/Islandora-islandora-6589c7c/ConnectionHelper.inc
and all is well.

October 21, 2011

Finding all the collections a DSPACE item belongs to

Multi-level tree in AgEcon

The list below gives an item that is under several layers in the DSPACE tree.
AgEcon Search
  Universitaet Hohenheim>
   Institute of Agricultural Policy and Agricultural Markets >
      Working Papers >
        A 2004 Social Accounting Matrix for Israel>

SQL to determine names of ancestors of an item

Item to collection

Starting handle = 110156

SQL for item handle to collection handle

SELECT handle FROM handle WHERE resource_type_id=3 AND resource_id=(SELECT collection.collection_id FROM collection, collection2item, item, handle WHERE collection2item.item_id=item.item_id AND collection2item.collection_id=collection.collection_id AND handle.resource_id=item.item_id AND resource_type_id=2 AND handle = 110156) ;
collection handle: 93692

Collection Name SQL

select name from handle, collection where handle = 93692 AND resource_type_id= 3 AND collection_id = handle.resource_id;

Collection to Community

SQL for Collection handle to Community handle

SELECT handle FROM community2collection, handle WHERE handle.resource_type_id =4 AND community2collection.community_id = handle.resource_id AND community2collection.collection_id = (SELECT resource_id FROM handle WHERE resource_type_id= 3 AND handle= 93692);
community level 1: 93691

community name SQL

select name from handle, community where handle = 93691 AND resource_type_id= 4 AND community_id = handle.resource_id;

Next Community level

SELECT handle FROM community2collection, handle WHERE handle.resource_type_id =4 AND community2collection.community_id = handle.resource_id AND community2collection.collection_id = (SELECT resource_id FROM handle WHERE resource_type_id= 3 AND handle= 93691);
top 93690

Proof that 93690 is top

SELECT handle FROM community2community, handle WHERE handle.resource_type_id =4 AND community2community.parent_comm_id = handle.resource_id AND community2community.child_comm_id = (SELECT resource_id FROM handle WHERE resource_type_id= 4sss AND handle= 93690);
returns nothing

PHP code

The code below will pull the names of all the ancestors of an item.
#!/usr/bin/php 
 $value)
 echo $key.'=>'.$value."\n";

//
// Test should produce: (1 indicates a primary community, 0 is a lower level)
//
// Content-type: text/html
// X-Powered-By: PHP/4.3.9
// 
// Working Papers=>0
// Universitaet Hohenheim=>1
// Institute of Agricultural Policy and Agricultural Markets=>0


// if you are interesed in the parents of a collection call:
// process_collection 

// if you are interesed in the parents of a community call:
// process_community 


function ancestor_names($item_handle, &$ancestors){
  include("port_DSPACE_sql.inc.php");
  $dbh = open_DSPACE_DB();  // opens a connection to the DSPACE psql database

// This query will find a collection handle given an item handle
  $item_to_collection_query="SELECT handle FROM handle WHERE resource_type_id=3 AND 
          resource_id=(SELECT collection.collection_id 
          FROM collection, collection2item, item, handle 
          WHERE collection2item.item_id=item.item_id 
          AND collection2item.collection_id=collection.collection_id 
          AND handle.resource_id=item.item_id AND resource_type_id=2 
          AND handle = $item_handle) ; ";

  $result = pg_query($item_to_collection_query);
  if (!$result) {
    echo "Problem with query " . $item_to_collection_query . "\n";
    echo pg_last_error();
    exit();
  }


// Could have multiple collection parents
while ($collection_handle = pg_fetch_row($result)) {
     process_collection($collection_handle[0], $ancestors);
}
}
 
 function process_collection($collection_handle,  &$ancestors){
 $collection_title_query = "select name from handle, collection where handle = 93692 AND resource_type_id= 3 AND collection_id = handle.resource_id;";
  $result = pg_query($collection_title_query);
  if (!$result) {
    echo "Problem with query " . $collection_title_query . "\n";
    echo pg_last_error();
    exit();
  }

while ($collection_title = pg_fetch_row($result)) {
  $ancestors[$collection_title[0]] = 0;
}
   collection_to_community ($collection_handle, $ancestors);
 }

// climbs the tree from the collection to the comunity level
function collection_to_community ($collection_handle,  &$ancestors) {
  $collection_to_commuity_query="SELECT handle FROM community2collection, handle 
                                WHERE handle.resource_type_id =4 AND 
                                community2collection.community_id = handle.resource_id AND
                                community2collection.collection_id = 
                                (SELECT resource_id FROM handle WHERE 
                                resource_type_id= 3 AND handle= $collection_handle);";
  $result = pg_query($collection_to_commuity_query);
  if (!$result) {
    echo "Problem with query " . $collection_to_commuity_query . "\n";
    echo pg_last_error();
    exit();
  }

while ($community_handle = pg_fetch_row($result)) {
     process_community($community_handle[0],$ancestors);
}

}


// finds the community title and recursively finds all higher level collections
function process_community ($community_handle, &$ancestors) {
 $community_title_query = "select name from handle, community where handle = $community_handle AND resource_type_id= 4 AND community_id = handle.resource_id; ";
 
  $result = pg_query($community_title_query);
  if (!$result) {
    echo "Problem with query " . $community_title_query . "\n";
    echo pg_last_error();
    exit();
  }
$community_title="";
while ($row = pg_fetch_row($result)) {
  $community_title = $row[0] ;
}

  $community_to_community_query="SELECT handle FROM community2community, handle WHERE 
                                handle.resource_type_id =4 AND 
                                community2community.parent_comm_id = handle.resource_id AND 
                                community2community.child_comm_id = (SELECT resource_id 
                                FROM handle WHERE resource_type_id= 4 AND handle= $community_handle); ";
  $result = pg_query($community_to_community_query);
  if (!$result) {
    echo "Problem with query " . $collection_to_commuity_query . "\n";
    echo pg_last_error();
    exit();
  }
$this_community_is_top = 1;
while ($row = pg_fetch_row($result)) {
  // if we get into this loop there is a higher community
  $this_community_is_top = 0;
  process_community($row[0],$ancestors );
}
  $ancestors[$community_title] = $this_community_is_top;
}
 ?>

Every item maps to a collection

Number of item handles:
dspace_sr=> select count (handle) from handle where resource_type_id=2;
 count 
-------
 48653
Number of item links to collections:
dspace_sr=> select count (item_id) from collection2item;
 count 
-------
 48653

Communities do not have multiple parents

I also checked the community2community table and there were no duplicate child_comm_id values. This implies that the children have one and only one parent.

September 29, 2011

Bill's Solution to the Solr problem

We had problems connecting SOLR to Drupal. Here are some answers that Bill Tantzen found:

Some issues with bad characters

Hex values should be inserted in the solr REST call:
http://128.101.146.59:8080/solr/select/?version=1.2&start=0&rows=10&indent=on&wt=standard&q=fullText:food&sort=modstitle desc&qt=mods

Should be:

http://128.101.146.59:8080/solr/select/?version=1.2&start=0&rows=10&indent=on&wt=standard&q=fullText:food++%26sort%3Dmodstitle%2Bdesc&qt=mods

Fields that Solr searches on should be single valued

Bill found out that fields that are being used for sort must be single values. Our title field is not. In the past SOLR would handle multivalued fields for sort.

The MODS schema

	<titleInfo>
 	 	<title>[dc.title]</title>
 	</titleInfo>
<titleInfo type="translated">
 	 	<title>[dc.title.alternative]</title>
 	</titleInfo>

The XSLT to make SOLR indices

Note the XSLT will not distinguish between different values of the type attribute, so there is more than version of title, it will no longer be single valued.
/Users/birage/fedora/tomcat/webapps/fedoragsearch/WEB-INF/classes/config/index/GSearch_solr/demoFoxmlToSolr.xslt  

    <xsl:for-each select="foxml:datastream[@ID='MODS']/foxml:datastreamVersion[last()]/foxml:xmlContent//mods:titleInfo/mods:title">
         <xsl:if test="text() [normalize-space(.) ]"><!--don't bother with empty space-->
            <field>./WEB-INF/classes/config/index/GSearch_solr/demoFoxmlToSolr.xslt
               <xsl:attribute name="name">
                  <xsl:value-of select="concat('mods.', 'title')"/>
               </xsl:attribute>
               <xsl:value-of select="text()"/>
            </field>
         </xsl:if>
      </xsl:for-each>

Possible values for the type attribute of title element in MODS

From: http://www.loc.gov/standards/mods/v3/mods-userguide-elements.html
type - This attribute is applied when it is necessary to identify what type of title is recorded.
For the main title (MARC 21 field 245), no type is indicated. The following values may be used with the type attribute:
abbreviated (equivalent to MARC 21 field 210)
translated (equivalent to MARC 21 field 242, 246)
alternative (equivalent to MARC 21 fields 246, 740)
uniform (equivalent to MARC 21 fields 130, 240, 730)

Reason why Islandora could not connect to SOLR on stage

The bug

When we tried to connect to solr from islandora we got the error:
Unable to connect to Solr server 

Islandora code connected to the problem

This error is generated in the Islandora file:
./sites/all/modules/Islandora-islandora_solr_search-9e474f7/solr.admin.inc: The following function is the root cause of the problem:

/**
 *
 * @param String $solr_url
 * @return boolean
 *
 * Checks availability of Solr installation
 *
 */
function solr_available($solr_url) {
  // path from url is parsed to allow graceful inclusion or exclusion of 'http://'
  $pathParts = parse_url($solr_url); 
  $path = 'http://' . $pathParts['host'] . ':' . $pathParts['port'] . $pathParts['path'] . '/admin/file';
  $test = @fopen($path, "r");
  if ($test) {
    return true;
  }
  return false;
}
    

The fix (upgrade SOLR)

It turns out that solr 3.1 cannot recognize the
"/admin/file"
at the end of a URL. We upgraded to SOLR 3.4 and it worked.

September 23, 2011

Reason why we could not upload foxml file (agecon_top.xml)

We had assumed that it was the XACML permissions this was not the case. The problem was actually due to a TN element in the foxml that was the problem.

Here is the contents of the
/swadm/local/fedora3/server/fedora-internal-use/fedora-internal-use-repository-policies-approximating-2.0
deny-apim-if-not-localhost.xml deny-inactive-or-deleted-objects-or-datastreams-if-not-administrator.xml deny-policy-management-if-not-administrator.xml deny-purge-datastream-if-active-or-inactive.xml deny-purge-object-if-active-or-inactive.xml deny-reloadPolicies-if-not-localhost.xml deny-unallowed-file-resolution.xml LOGPATH permit-anything-to-administrator.xml permit-apia-unrestricted.xml permit-dsstate-check-unrestricted.xml permit-oai-unrestricted.xml permit-serverStatus-unrestricted.xml readme.txt

Switch in UDC Media filter.

I changed the Media Filter so that it would not use the unix nice command when it launches. This should speed up the process.

Crontab

@reboot /sbin/service httpd start @reboot sudo -u tomcat /dspace/bin/start_tomcat.sh # day of week (0 - 6) (Sunday=0) 10 1 * * 6 /dspace/dspace-ir/bin/media_launch.sh 30 22 * * 1 /dspace/dspace-sr/bin/index-all-cron 30 22 * * 2 /dspace/dspace-ir/bin/index-all-cron 30 22 * * 3 /dspace/dspace-sr/bin/index-all-cron 30 22 * * 4 /dspace/dspace-ir/bin/index-all-cron 30 22 * * 5 /dspace/dspace-sr/bin/index-all-cron

media_launch.sh

tstamp=`date "+%Y%m%d_%H:%M"` echo $tstamp nice /dspace/dspace-ir/bin/filter-media.sh > /dspace/dspace-ir/log/filter-media.sh_$tstamp.log 2>&1 cd /dspace/dspace-ir/bin/ /dspace/dspace-ir/bin/index_check_and_email.sh

filter-media.sh

Note the "-n" in filter-media means that the index will not be made after each collection is OCRed. Also in the runs using "nice" the "-n" was also used.
#!/bin/sh # This script grabs the handles of each collection # in a DSpace DB instance. Then loops through the # handles and run the full-text indexer against each # collection. # This is done to fix out of memory errors, # PDFs that are too large for full-text indexing, # and when filter-media (java app) fails now full # text indexing continues on other collections. # Setup the environment JAVA_HOME=/opt/jdk1.5.0_10 PATH=$JAVA_HOME/bin:/opt/ant/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin export PATH JAVA_HOME dbname="dspace_ir" username="read_only" hostname="strip3.oit.umn.edu" # Determine if we have Postgres client installed which psql > /dev/null if [ $? -ne 0 ] then echo echo "psql not found in your PATH, please add to your PATH and re-run script" echo exit 1 fi print_usage() { echo 1>&2 "Usage: $0 [-d dbname] [-u username]" exit 1; } while getopts d:hu: o do case "$o" in d) dbname="$OPTARG";; h) print_usage;; n) hostname="$OPTARG";; u) username="$OPTARG";; [?]) print_usage;; esac done echo_cmd="echo SELECT handle FROM handle WHERE resource_type_id=3;" psql_cmd="psql -t -U $username -h $hostname $dbname" BINDIR=`dirname $0` for handle in `$echo_cmd | $psql_cmd` do $BINDIR/filter-media -n -i $handle done $BINDIR/index-all

September 15, 2011

items in DSAPCE with in_archive =f and withdrawn = f

Problem

Louise told me that the following two purls produced an error when accessed:
http://purl.umn.edu/114817
http://purl.umn.edu/113790

The cause

It turns our that both in_archive and withdrawn are set to false.
dspace_sr=> select * from handle where handle = 113790;
 handle_id | handle | resource_type_id | resource_id 
-----------+--------+------------------+-------------
     51427 | 113790 |                2 |       53352


dspace_sr=> select * from item where item_id = 53352;

 item_id | submitter_id | in_archive | withdrawn |       last_modified        | owning_collection 
---------+--------------+------------+-----------+----------------------------+-------------------
   53352 |         2680 | f          | f         | 2011-08-25 12:58:34.912-05 |                  
(1 row)
I think that both variables should not be set to f.

The solution

Set withdrawn to t and the Louise can resubmit the metadata:
UPDATE item SET withdrawn = 'T' WHERE item_id = 53352;

September 6, 2011

Varios curls for fedora RDF (REST interface)

List of datastreams in a Fedora Object

curl example

curl --user user:password http://128.101.146.59:8080/fedora/objects/urepository:97023/datastreams

browser example

http://128.101.146.59:8080/fedora/objects/urepository:97023/datastreams

MODs Datastream

curl example

curl --user user:password http://128.101.146.59:8080/fedora/objects/urepository:97023/datastreams/MODS/content

browser example

http://128.101.146.59:8080/fedora/objects/urepository:97023/datastreams/MODS/contentss

RDF Datastream

curl example

curl --user user:password http://128.101.146.59:8080/fedora/objects/urepository:97023/datastreams/RELS-EXT/content

browser example

http://128.101.146.59:8080/fedora/objects/urepository:97023/datastreams/RELS-EXT/content

Get all triples

curl example

curl --user user:password http://128.101.146.59:8080/fedora/risearch?type=triples\&lang=spo\&format=N-Triples\&stream=on\&query=*+*+*

browser example

http://128.101.146.59:8080/fedora/risearch?type=triples&lang=spo&format=N-Triples&stream=on&query=*+*+*

All triples with 36284 handle as a predicate

curl example

curl --user user:password http://128.101.146.59:8080/fedora/risearch?type=triples\&lang=spo\&format=N-Triples\&stream=on\&query=*+*+\<info:fedora/urepository:36284\>

browser example

http://128.101.146.59:8080/fedora/risearch?type=triples&lang=spo&format=N-Triples&stream=on&query=*+*+<info:fedora/urepository:36284>

All triples with 36284 handle as a subject

curl example

curl --user user:password http://128.101.146.59:8080/fedora/risearch?type=triples\&lang=spo\&format=N-Triples\&stream=on\&query=\<info:fedora/urepository:36284\>+*+*

browser example

http://128.101.146.59:8080/fedora/risearch?type=triples&lang=spo&format=N-Triples&stream=on&query=<info:fedora/urepository:36284>+*+*

Determine if 36284 is a collection

curl example

curl --user user:password http://128.101.146.59:8080/fedora/risearch?type=triples\&lang=spo\&format=N-Triples\&stream=on\&query=*+\<info:fedora/fedora-system:def/relations-external#isMemberOfCollection\>+\<info:fedora/urepository:36284\>

Find the title of 36284

curl --user user:password http://128.101.146.59:8080/fedora/risearch?type=triples\&lang=spo\&format=n-triples\&stream=on\&query=\<info:fedora/urepository:36284\>+\<http://purl.org/dc/elements/1.1/title\>+*