« October 2011 | Main | December 2011 »

November 18, 2011

Number of assets per item in AgEcon

File that contains a list of all bitstreams in Agecon

less transfer.sh

./AssetsUDC/Assets_Found20111114_1524_60427_max/asset_61536/urepository_2.pdf
./AssetsUDC/Assets_Found20111114_1524_60427_max/asset_61536/urepository_1.pdf
./AssetsUDC/Assets_Found20111114_1524_60427_max/asset_99776/urepository_1.pdf
./AssetsUDC/Assets_Found20111114_1524_60427_max/asset_114550/urepository_2.pdf
./AssetsUDC/Assets_Found20111114_1524_60427_max/asset_114550/urepository_1.pdf
./AssetsUDC/Assets_Found20111114_1524_60427_max/asset_93451/urepository_1.pdf
./AssetsUDC/Assets_Found20111114_1524_60427_max/asset_93451/urepository_2.pdf
                                                        ^               ^
                                                        |               |
                                                    Handle            Bitstream number
handle 
[swadm:/swadm/assetstore_stage]$ cat transfer.sh  | perl -p -i -e 's/(.\/AssetsUDC\/Assets_Found201111.*asset_)(\d+)(\/urepository_)(\d+)(\.pdf)/\4/g' | sort | uniq -c | sort -nk 2 


Number  Assets
of       per 
Items    item 
  48971 1
  21191 2
    116 3
     13 4
      4 5
      3 6
      2 7
      1 8
      1 9
      1 10

November 16, 2011

Cron to do mysqldump

Create a mysql user that will do the backups

GRANT LOCK TABLES, SELECT ON *.* TO 'backup'@'%.oit.umn.edu' IDENTIFIED BY PASSWORD;

The Script to run under cron

#!/usr/bin/perl

printf("Running...");

####################
# Backup settings #
###################

my $mysql_user = "backup";
my $mysql_pwd = "password";
my $mysql_server = "mysql_host";
my $dir_base = "path_to_backup_dir";

my ($y, $m, $d) = (localtime)[5,4,3];

$y += 1900;
$m += 1;

if ($m < 10) { $m = "0" . $m; }
if ($d < 10) { $d = "0" . $d; }

$stamp = $y . "_" . $m . $d;
$file_dumpname = $stamp . "_mysqldump.sql";

# Format:
#/swadm/assetstore_stage/Backups/mysql/2009_0921_mysqldump.sql
#/swadm/assetstore_stage/Backups/mysql/2009_0922_mysqldump.sql

##################
# Handling MySQL #
##################

# Making mySQL dumps
$starttime = `date`;
printf("mysqldump -h$mysql_server -u$mysql_user -p$mysql_pwd drupalstage > $dir_base/$file_dumpname\n");

`mysqldump -h$mysql_server -u$mysql_user -p$mysql_pwd drupalstage > $dir_base/$file_dumpname`;

#############################
# Chmod -- owner/group only #
#############################
`chmod 660 $dir_base/$file_dumpname`;
`gzip -fq $dir_base/$file_dumpname`;


##############
# Concluding #
##############


# The "done" message
printf("...complete.\n");
$finishtime = `date`;

####################
# Mail the results #
####################

$title='MySQL backup';
$to='your email';
$from= 'crontab@yourhost';
$subject='[cron] vm02 mySQLdump';

open(MAIL, "|/usr/sbin/sendmail -t");

## Mail Header
print MAIL "To: $to\n";
print MAIL "From: $from\n";
print MAIL "Subject: $subject\n\n";
## Mail Body
# $dumpsize = `du  -h $dir_base/${file_dumpname}.gz`;
#print MAIL "Start:\n$starttime\nFinish:\n$finishtime\nDumpsize:\n  $dumpsize";
print MAIL "Start:\n$starttime\nFinish:\n$finishtime\nFile $dir_base/${file_dumpname}.gz";
close(MAIL);

November 14, 2011

Metadata Field dc.type

Sql Query to get Types of AgEcon Papers

[silvi003:~]$ cat cmdType.sql
\f ','
\a
\t
\o outputfile.csv
select text_value from handle, metadatavalue, item  where metadatavalue.item_id=handle.resource_id AND handle.resource_type_id=2 AND 
handle.resource_id=item.item_id AND item.withdrawn='f' AND metadata_field_id=66;
\o
\q


[silvi003:~]$ psql -U dspace_sr  dspace_sr  < cmdType.sql 

The distribution

[silvi003:~]$ cat outputfile.csv  | sort | uniq -c | sort -n 
      1 journal article
      2 Dataset
      4 Preprint
     24 Book Item
    205 Thesis
    225 Book
    328 Thesis or Dissertation
   1096 Report
   2673 Technical Report
   3592 Working Paper
   6313 Article
   7492 Presentation
   8202 Working or Discussion Paper
   8694 Journal Article
   9294 Conference Paper

DSPACE mime types for AgEcon ... Very few excel

Below is a list of all the MIME types supported by DSPACE

 bitstream_format_id |           mimetype            |  short_description   |                             description                              | support_level | internal 
---------------------+-------------------------------+----------------------+----------------------------------------------------------------------+---------------+----------
                   3 | application/pdf               | PDF                  | Adobe Portable Document Format                                       |             1 | f
                   1 | application/octet-stream      | Unknown              | Unknown data format                                                  |             0 | f
                   2 | text/plain                    | License              | Item-specific license agreed upon to submission                      |             1 | t
                   4 | text/xml                      | XML                  | Extensible Markup Language                                           |             1 | f
                   5 | text/plain                    | Text                 | Plain Text                                                           |             1 | f
                   6 | text/html                     | HTML                 | Hypertext Markup Language                                            |             1 | f
                   7 | text/css                      | CSS                  | Cascading Style Sheets                                               |             1 | f
                   8 | application/msword            | Microsoft Word       | Microsoft Word                                                       |             1 | f
                   9 | application/vnd.ms-powerpoint | Microsoft Powerpoint | Microsoft Powerpoint                                                 |             1 | f
                  10 | application/vnd.ms-excel      | Microsoft Excel      | Microsoft Excel                                                      |             1 | f
                  11 | application/marc              | MARC                 | Machine-Readable Cataloging records                                  |             1 | f
                  12 | image/jpeg                    | JPEG                 | Joint Photographic Experts Group/JPEG File Interchange Format (JFIF) |             1 | f
                  13 | image/gif                     | GIF                  | Graphics Interchange Format                                          |             1 | f
                  14 | image/png                     | image/png            | Portable Network Graphics                                            |             1 | f
                  15 | image/tiff                    | TIFF                 | Tag Image File Format                                                |             1 | f
                  16 | audio/x-aiff                  | AIFF                 | Audio Interchange File Format                                        |             1 | f
                  17 | audio/basic                   | audio/basic          | Basic Audio                                                          |             1 | f
                  18 | audio/x-wav                   | WAV                  | Broadcase Wave Format                                                |             1 | f
                  19 | video/mpeg                    | MPEG                 | Moving Picture Experts Group                                         |             1 | f
                  20 | text/richtext                 | RTF                  | Rich Text Format                                                     |             1 | f
                  21 | application/vnd.visio         | Microsoft Visio      | Microsoft Visio                                                      |             1 | f
                  22 | application/x-filemaker       | FMP3                 | Filemaker Pro                                                        |             1 | f
                  23 | image/x-ms-bmp                | BMP                  | Microsoft Windows bitmap                                             |             1 | f
                  24 | application/x-photoshop       | Photoshop            | Photoshop                                                            |             1 | f
                  25 | application/postscript        | Postscript           | Postscript Files                                                     |             1 | f
                  26 | video/quicktime               | Video Quicktime      | Video Quicktime                                                      |             1 | f
                  27 | audio/x-mpeg                  | MPEG Audio           | MPEG Audio                                                           |             1 | f
                  28 | application/vnd.ms-project    | Microsoft Project    | Microsoft Project                                                    |             1 | f
                  29 | application/mathematica       | Mathematica          | Mathematica Notebook                                                 |             1 | f
                  30 | application/x-latex           | LateX                | LaTeX document                                                       |             1 | f
                  31 | application/x-tex             | TeX                  | Tex/LateX document                                                   |             1 | f
                  32 | application/x-dvi             | TeX dvi              | TeX dvi format                                                       |             1 | f
                  33 | application/sgml              | SGML                 | SGML application (RFC 1874)                                          |             1 | f
                  34 | application/wordperfect5.1    | WordPerfect          | WordPerfect 5.1 document                                             |             1 | f
                  35 | audio/x-pn-realaudio          | RealAudio            | RealAudio file                                                       |             1 | f
                  36 | image/x-photo-cd              | Photo CD             | Kodak Photo CD image                                                 |             1 | f

A file with the wrong bitstream_format_id

 
handle | bitstream_id | bitstream_format_id |                    name                     | size_bytes |             checksum             | checksum_algorithm | description | user_format_description |                                     source                                      |               internal_id               | deleted | store_number | sequence_id 
--------+--------------+---------------------+---------------------------------------------+------------+----------------------------------+--------------------+-------------+-------------------------+---------------------------------------------------------------------------------+-----------------------------------------+---------+--------------+-------------
 95522  |        74367 |                   1 | Staff Paper P10-8--InSTePP10-04.revised pdf |     313884 | 35f4304e6a0c68e935c09c0469a9e291 | MD5                |             |                         | /dspace/assetstore/dspace-sr/upload/Staff Paper P10-8--InSTePP10-04.revised pdf | 102028865626877833459313413758816463357 | f       |            0 |           2
(1 row)

This is labeled as Unknown, but should be PDF. The line below changed it:
 
dspace_sr=> 
dspace_sr=> UPDATE bitstream SET bitstream_format_id = '3' WHERE bitstream_id = '74367';
UPDATE 1

Distribution of bitstream_format_id

The sql query that pulls only live bitstreams:
[silvi003:~]$ cat cmdMime.sql 
\f ','
\a
\t
\o outputfile.csv
SELECT bitstream_format_id  FROM handle,item, item2bundle,bitstream,bundle2bitstream WHERE  handle.resource_type_id=2 AND handle.resource_id = item2bundle.item_id AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND handle.resource_id=item.item_id AND item.withdrawn='f' AND   bundle2bitstream.bitstream_id = bitstream.bitstream_id AND  bitstream.deleted = 'f'  ;
\o
\q
[silvi003:~]$ psql -U dspace_sr  dspace_sr  < cmdMime.sql
Number count of bitstream_format_id

[silvi003:~]$ cat outputfile.csv | sort | uniq -c | sort -n 
      # bitstream_format_id
      1 1
      2 10
  20602 2
  48265 3

the odd bitstream_format_id

Most of the the bitstreams are PDFs ( bitstream_format_id 3) or liscense (bitstream_format_id 2). There is one Unknown (bitstream_format_id 1) and two excel (bitstream_format_id 10). They are shown below:
bitstream_format_id =1
dspace_sr=> SELECT handle, bitstream.*  FROM handle,item2bundle,bitstream,bundle2bitstream WHERE  handle.resource_type_id=2 AND handle.resource_id = item2bundle.item_id AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND handle.resource_id=item.item_id AND item.withdrawn='f' AND   bundle2bitstream.bitstream_id = bitstream.bitstream_id AND  bitstream.deleted = 'f'  AND bitstream_format_id=1;
NOTICE:  adding missing FROM-clause entry for table "item"
 handle | bitstream_id | bitstream_format_id |                          name                           | size_bytes |             checksum             | checksum_algorithm | description | user_format_description |                                           source                                            |              internal_id               | deleted | store_number | sequence_id 
--------+--------------+---------------------+---------------------------------------------------------+------------+----------------------------------+--------------------+-------------+-------------------------+---------------------------------------------------------------------------------------------+----------------------------------------+---------+--------------+-------------
 62242  |        59248 |                   1 | data appendix jayasinghe Beghin Moschini ajae 9007.xlsx |     141434 | 6f40baf7dd97f784091e69ed8714b837 | MD5                |             |                         | /dspace/assetstore/dspace-sr/upload/data appendix jayasinghe Beghin Moschini ajae 9007.xlsx | 13124464764865665476393448862247227640 | f       |            0 |           3
(1 row)


bitstream_format_id =10
dspace_sr=> SELECT handle, bitstream.*  FROM handle,item2bundle,bitstream,bundle2bitstream WHERE  handle.resource_type_id=2 AND handle.resource_id = item2bundle.item_id AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND handle.resource_id=item.item_id AND item.withdrawn='f' AND   bundle2bitstream.bitstream_id = bitstream.bitstream_id AND  bitstream.deleted = 'f'  AND bitstream_format_id=10;
NOTICE:  adding missing FROM-clause entry for table "item"
 handle | bitstream_id | bitstream_format_id |                  name                   | size_bytes |             checksum             | checksum_algorithm |        description         | user_format_description |                                                         source                                                          |              internal_id               | deleted | store_number | sequence_id 
--------+--------------+---------------------+-----------------------------------------+------------+----------------------------------+--------------------+----------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------+----------------------------------------+---------+--------------+-------------
 42187  |        32967 |                  10 | MissouriUseValueCalculationsOct2007.xls |     361472 | cbefd6d4008ba5d97db49d2b9178f89f | MD5                | Excel Spreadsheet          |                         | /dspace/assetstore/dspace-sr/upload/C:\Documents and Settings\Lori\My Documents\MissouriUseValueCalculationsOct2007.xls | 87756269209817911914027269532862968326 | f       |            0 |           3
 92231  |        61062 |                  10 | stpap536.data.zip                       |     741798 | b48a9d5aa21f3d8411230bde4651e4fe | MD5                | Data in zipped Excel files |                         | /dspace/assetstore/dspace-sr/upload/stpap536.data.zip                                                                   | 28095595994115196972466977473167819715 | f       |            0 |           3
(2 rows)