« October 2010 | Main | December 2010 »

November 24, 2010

Documentation of php code to query DSPACE DB

Code Purpose Documentation complete
port_meta_field_uniq_count.php Get a count of the unique
values for a field (metadata_id)
yes
port_file_handles_to_meta.php Take a file of DSPACE handles and dump
the associated metadata to an excel file
yes
port_DSPACE_sql.inc.php Set of utility functions no
port_metadata_id_to_full_table.php Finds all the metadata associated with non null
values of a metadata id e.g dc.subject.mesh has a metadat_id of 62
./port_metadata_id_to_full_table.php 62
will make an excel spreadsheet of all the
metadata where dc.subject.mesh
is not null
yes
port_metadata_count.php Gives a count of occurrences of a metadata text field
yes
port_sql_column_diff.php Does a diff between two SQL columns no

November 18, 2010

DSPACE sql for items that have not been withdrawn

The line below will do a search for dc.type (metadata_field_id=66) for items that have not been withdrawn.
select handle from handle, metadatavalue, item where metadata_field_id=66 AND metadatavalue.item_id=handle.resource_id AND text_value='Other' AND resource_id=item.item_id AND withdrawn='f';

mime types in AgEcon

Overview

I have found that the metadata type:
dc.format.mimetype (metadata_field_id = 36)
THe field dc.format.mimetype only contains entries for items from before the migration to DSPACE. In general, we will need to use the Format field from the bitstream table. This is valid for both before and after the migration.

dc.format.mimetype

The little shell script below pulls the handles of items that have non-null values for dc.format.mimetype.
query="select handle from metadatavalue,handle where metadata_field_id=36 AND item_id=handle.resource_id AND handle.resource_type_id=2;"
echo $query > temp_sql_file
psql -U dspace_sr  dspace_sr  < temp_sql_file
rm temp_sql_file
This shell script pulls all item handles.
query="select  handle  from handle where handle.resource_type_id=2;"
echo $query > temp_sql_file
psql -U dspace_sr  dspace_sr  < temp_sql_file
rm temp_sql_file
With these two scripts, I was able to find the handles that did and did not have the dc.format.mimetype field populated.

Comparison of metadata pre and post migration to DSPACE

Here is a comparison of pre vs post migration to DSPACE and the metadata fields related to mimetype.
DSPACE Table Element Present on Pre-Migration
(example handle 36676)
Present on Post- Migration
(example handle 96677)
dc.format.mimetype yes no
Bitstream.name yes yes
Bitstream.source no yes
Bitstream.description no yes
Bitstream.format yes yes
Bitstream.user format description no no
Bitstream.license no yes
The information for the table above came largely from the bitstream table.

Bitstream table for handle 96677:
Bitstream table for handle 96677 Bitstream table for handle 36676:
Old_bitstream_ 36676.tiff

Summary of Bitstream.format

I have found the unique mime types in AgEcon using the bitstream.format field:
application/pdf 45264
application/octet-stream 1
application/vnd.ms-excel 2
The handles for the non-pdf files are:
application/octet-stream
62242
application/vnd.ms-excel
42187
92231
This field will be used to determine mime type.

November 8, 2010

mismatch between handle table and collection/community tables

Problem: some collection and community handles are in the handle table but are not found in the collection/community tables

1) If you put this handle into the agecon system you get:
Invalid Identifier

The identifier /59677 does not correspond to a valid Object in AgEcon Search. This may be because of one of the following reasons:

    * The URL of the current page is incorrect - if you followed a link from outside of AgEcon Search it may be mistyped or corrupt.
    * You entered an invalid ID into a form - please try again.

If you're having problems, or you expected the ID to work, feel free to contact the site administrators.
2) The handles do exist in the purl system.
I call collections of this type "zombie" collections/communities. 3) Before we migrate to Islandora, we need to find all the zombie collections/communities and make sure that they are not transferred into the new system and that they are tombstoned in the purl system.

Locating all the troublesome handles

I wrote some small php code that found a vector of handles for items, collections or communities from the handle table and then diffed this vector with one from the item, collection or community table. Below are the results.

Items

query1 SELECT handle FROM handle, item  WHERE handle.resource_type_id=2 AND handle.resource_id=item.item_id
query2 SELECT handle FROM handle WHERE resource_type_id=2
Length query 1 item_id 42838
Length query 2 42838
Length diff array 0

Collections

Array
(
    [39] => 33702
    [47] => 33718
    [48] => 33720
    [50] => 33724
    [51] => 33726
    [63] => 33750
    [70] => 33764
    [71] => 33766
    [72] => 33768
    [81] => 33786
    [82] => 33788
    [101] => 33826
    [111] => 33846
    [126] => 33876
    [127] => 33878
    [128] => 33880
    [149] => 33922
    [158] => 33940
    [159] => 33942
    [171] => 33966
    [183] => 33990
    [187] => 33998
    [192] => 34008
    [201] => 34026
    [202] => 34028
    [234] => 34092
    [237] => 34098
    [302] => 34228
    [312] => 34248
    [319] => 34262
    [322] => 34268
    [325] => 34274
    [328] => 34280
    [329] => 34282
    [330] => 34284
    [331] => 34286
    [332] => 34288
    [333] => 34289
    [334] => 34290
    [335] => 34291
    [336] => 34293
    [337] => 34295
    [338] => 34297
    [339] => 34299
    [340] => 34301
    [341] => 34303
    [342] => 34305
    [343] => 34307
    [344] => 34309
    [345] => 34311
    [346] => 34313
    [347] => 34315
    [348] => 34317
    [349] => 34319
    [350] => 34321
    [351] => 34323
    [359] => 34340
    [360] => 34342
    [362] => 34344
    [363] => 34346
    [364] => 34348
    [365] => 34350
    [368] => 34356
    [699] => 35008
    [718] => 35046
    [851] => 35312
    [868] => 35346
    [1123] => 35857
    [1322] => 36453
    [1323] => 36456
    [1324] => 36459
    [1325] => 36462
    [1333] => 36486
    [1361] => 36570
    [1365] => 36582
    [1371] => 36600
    [1373] => 36606
    [1383] => 36636
    [1405] => 36998
    [1435] => 37168
    [1436] => 37170
    [1451] => 37295
    [1459] => 37584
    [1460] => 37586
    [1483] => 37939
    [1484] => 37940
    [1485] => 37942
    [1486] => 37947
    [1490] => 37968
    [1502] => 42438
    [1519] => 42893
    [1560] => 43992
    [1562] => 44080
    [1584] => 44940
    [1602] => 45639
    [1621] => 46044
    [1623] => 46046
    [1638] => 46277
    [1643] => 46321
    [1645] => 46501
    [1646] => 46502
    [1685] => 47047
    [1753] => 47495
    [1766] => 47829
    [1771] => 48118
    [1784] => 48250
    [1812] => 49828
    [1814] => 49831
    [1822] => 50343
    [1836] => 51324
    [1845] => 52117
    [1849] => 52237
    [1860] => 53120
    [1865] => 53221
    [1867] => 53245
    [1868] => 53287
    [1877] => 53371
    [1887] => 53406
    [1921] => 53652
    [1953] => 54541
    [1973] => 55664
    [2070] => 56401
    [2139] => 58795
    [2140] => 58798
    [2216] => 59649
    [2239] => 59672
    [2241] => 59676
    [2242] => 59677
    [2305] => 91208
    [2411] => 94371
)
query1 SELECT handle FROM handle, collection  WHERE handle.resource_type_id=3 AND handle.resource_id=collection.collection_id
query2 SELECT handle FROM handle WHERE resource_type_id=3
Length query 1 item_id 2295
Length query 2 2425
Length diff array 130
[silvi003@strip1 bin]$ 

Communities


Array
(
    [5] => 35871
    [11] => 35889
    [12] => 35892
    [14] => 35900
    [15] => 35903
    [16] => 35906
    [17] => 35909
    [258] => 36632
    [259] => 36635
    [285] => 37192
    [301] => 37938
    [302] => 37941
    [315] => 42934
    [327] => 43991
    [349] => 47494
    [369] => 53119
    [375] => 53468
    [382] => 56201
    [394] => 60659
    [407] => 93689
)
query1 SELECT handle FROM handle, community  WHERE handle.resource_type_id=4 AND handle.resource_id=community.community_id
query2 SELECT handle FROM handle WHERE resource_type_id=4
Length query 1 item_id 397
Length query 2 417
Length diff array 20

mismatch between handle table and collection/community tables

query1 SELECT handle FROM handle, item  WHERE handle.resource_type_id=2 AND handle.resource_id=item.item_id
query2 SELECT handle FROM handle WHERE resource_type_id=2
Length query 1 item_id 42838
Length query 2 42838
Length diff array 0

Going from handles to metadata text value in Dspace

Handles that are items

DSPACE tables that map item handles to meta data text

The pdf below shows how the handle table connects to the metadata table for items. Item Handle to MetaData.pdf

sql to pull metadata for items

The query will pull all the metadata for handle value 30308 that is an item (resource_type_id=2).
select text_value from handle, metadatavalue where metadatavalue.item_id=handle.resource_id AND handle.resource_type_id=2 AND handle=30308;

Handles that are collections

DSPACE tables that map collection handles to meta data text

The pdf below shows how the handle table connects to the metadata table for items. Note for this to work we need to go through the collection table.
Collection Handle to MetaData.pdf

sql to pull metadata for collections

THe query below willpull all of the metadata associated with the collection that has a handle of 37244.
select text_value from handle, collection, metadatavalue where metadatavalue.item_id=collection.template_item_id AND collection.collection_id=handle.resource_id AND handle.resource_type_id=3 AND handle=37244;

Handles that are communities

It looks like the communities do not map to meta data fields. The diagram below shows the connection between the handle and the community table, but I do not see how to link to the metadatavalue table.
Community Handle to MetaData.pdf