« June 2010 | Main | August 2010 »

July 15, 2010

Excel files to be derived from dspace to go into Drupal

We are going to extract the guts from dspace and put it into Drupal here are the various excel spreadsheets that must be created.
parent_id 
0 - no parent
1 - collection
2 - community


items
metadata 1  ... metadata n parent_type parent_id

communities

     Column       |          Type          | Modifiers 
-------------------+------------------------+-----------
 community_id      | integer                | not null
 name              | character varying(128) | 
 short_description | character varying(512) | 
 introductory_text | text                   | 
 logo_bitstream_id | integer                | 
 copyright_text    | text                   | 
 side_bar_text     | text                   |    <---- Blank

Excel community file
community_id name short_description introductory_text parent_type parent_id has_child


collections:


                  Table "public.collection"
         Column         |          Type          | Modifiers 
------------------------+------------------------+-----------
 collection_id          | integer                | not null
 name                   | character varying(128) | 
 short_description      | character varying(512) | 
 introductory_text      | text                   | 
 logo_bitstream_id      | integer                | 
 template_item_id       | integer                |   <--- not all filled
 provenance_description | text                   |   <--- blank
 license                | text                   | 
 copyright_text         | text                   | 
 side_bar_text          | text                   | 
 workflow_step_1        | integer                | <--- not all filled
 workflow_step_2        | integer                | <--- not all filled
 workflow_step_3        | integer                | 
 submitter              | integer                | <--- eperson_group_id
 admin                  | integer                | <--- eperson_group_id

Excel collection file
collection_id name short_description introductory_text parent_type submitter_group_id admin_group_id parent_id has_child




dspace_ir=> \d eperson;
                    Table "public.eperson"
       Column        |            Type             | Modifiers 
---------------------+-----------------------------+-----------
 eperson_id          | integer                     | not null
 email               | character varying(64)       | 
 password            | character varying(64)       | 
 firstname           | character varying(64)       | 
 lastname            | character varying(64)       | 
 can_log_in          | boolean                     |  <-- all of these are true
 require_certificate | boolean                     |  <-- all of these are false
 self_registered     | boolean                     |  <-- blanks and false
 last_active         | timestamp without time zone | 
 sub_frequency       | integer                     |  <-- blank
 phone               | character varying(32)       | 
 netid               | character varying(64)       | 



dspace_ir=> \d epersongroup;
      Column      |          Type          | Modifiers 
------------------+------------------------+-----------
 eperson_group_id | integer                | not null    <-- admin or submitter
 name             | character varying(256) | 

dspace_ir=> \d epersongroup2eperson;
  Table "public.epersongroup2eperson"
      Column      |  Type   | Modifiers 
------------------+---------+-----------
 id               | integer | not null
 eperson_group_id | integer | 
 eperson_id       | integer | 


dspace_ir=> select * from  epersongroup2workspaceitem ;
 id | eperson_group_id | workspace_item_id 
----+------------------+-------------------
(0 rows)



3 excel tables
eperson
eperson_id firstname lastname phone netid

group 
eperson_group_id name

eperson2group
eperson_group_id eperson_id

July 14, 2010

List of bitstream_format_id and mimetypes for DSPACE


 bitstream_format_id |           mimetype            |  short_description   |                             description                              | support_level | internal 
---------------------+-------------------------------+----------------------+----------------------------------------------------------------------+---------------+----------
                   1 | application/octet-stream      | Unknown              | Unknown data format                                                  |             0 | f
                   2 | text/plain                    | License              | Item-specific license agreed upon to submission                      |             1 | t
                   3 | application/pdf               | PDF                  | Adobe Portable Document Format                                       |             1 | f
                   4 | text/xml                      | XML                  | Extensible Markup Language                                           |             1 | f
                   5 | text/plain                    | Text                 | Plain Text                                                           |             1 | f
                   6 | text/html                     | HTML                 | Hypertext Markup Language                                            |             1 | f
                   7 | text/css                      | CSS                  | Cascading Style Sheets                                               |             1 | f
                   8 | application/msword            | Microsoft Word       | Microsoft Word                                                       |             1 | f
                   9 | application/vnd.ms-powerpoint | Microsoft Powerpoint | Microsoft Powerpoint                                                 |             1 | f
                  10 | application/vnd.ms-excel      | Microsoft Excel      | Microsoft Excel                                                      |             1 | f
                  11 | application/marc              | MARC                 | Machine-Readable Cataloging records                                  |             1 | f
                  12 | image/jpeg                    | JPEG                 | Joint Photographic Experts Group/JPEG File Interchange Format (JFIF) |             1 | f
                  13 | image/gif                     | GIF                  | Graphics Interchange Format                                          |             1 | f
                  14 | image/png                     | image/png            | Portable Network Graphics                                            |             1 | f
                  15 | image/tiff                    | TIFF                 | Tag Image File Format                                                |             1 | f
                  16 | audio/x-aiff                  | AIFF                 | Audio Interchange File Format                                        |             1 | f
                  17 | audio/basic                   | audio/basic          | Basic Audio                                                          |             1 | f
                  18 | audio/x-wav                   | WAV                  | Broadcase Wave Format                                                |             1 | f
                  19 | video/mpeg                    | MPEG                 | Moving Picture Experts Group                                         |             1 | f
                  20 | text/richtext                 | RTF                  | Rich Text Format                                                     |             1 | f
                  21 | application/vnd.visio         | Microsoft Visio      | Microsoft Visio                                                      |             1 | f
                  22 | application/x-filemaker       | FMP3                 | Filemaker Pro                                                        |             1 | f
                  23 | image/x-ms-bmp                | BMP                  | Microsoft Windows bitmap                                             |             1 | f
                  24 | application/x-photoshop       | Photoshop            | Photoshop                                                            |             1 | f
                  25 | application/postscript        | Postscript           | Postscript Files                                                     |             1 | f
                  26 | video/quicktime               | Video Quicktime      | Video Quicktime                                                      |             1 | f
                  27 | audio/x-mpeg                  | MPEG Audio           | MPEG Audio                                                           |             1 | f
                  28 | application/vnd.ms-project    | Microsoft Project    | Microsoft Project                                                    |             1 | f
                  29 | application/mathematica       | Mathematica          | Mathematica Notebook                                                 |             1 | f
                  30 | application/x-latex           | LateX                | LaTeX document                                                       |             1 | f
                  31 | application/x-tex             | TeX                  | Tex/LateX document                                                   |             1 | f
                  32 | application/x-dvi             | TeX dvi              | TeX dvi format                                                       |             1 | f
                  33 | application/sgml              | SGML                 | SGML application (RFC 1874)                                          |             1 | f
                  34 | application/wordperfect5.1    | WordPerfect          | WordPerfect 5.1 document                                             |             1 | f
                  35 | audio/x-pn-realaudio          | RealAudio            | RealAudio file                                                       |             1 | f
                  36 | image/x-photo-cd              | Photo CD             | Kodak Photo CD image                                                 |             1 | f
                  37 | text/plain                    | tfw                  | ArcView World File For TIF Image                                     |             0 | f
                  38 | text/plain                    | e00                  | ArcInfo Coverage Export                                              |             0 | f
(38 rows)

Work around for problem with PDF box text extraction

The problem

When PDFbox was used to extract text from a file of size ~20 meg, it would chew up more and more of the memory and eventually drag the system down to a stand still.

The workaround

I used the methods setStartPage and setEndPage in the PDFTextStripper class to limit the number of pages converted to text at a time.

Results of a run

Extracting text
NumberPagesInPDF 1959
s Extraction time seconds 230
Page per second s
Number of characters extracted 672324
For the number of pages per text extraction I tried 20, 50 and 100 pages and the results were similar. This rate seems very slow.

Specs on computer used

The run above was done on odin.lib.umn.edu. Here is some information on it:

cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping : 3
cpu MHz : 3200.166
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl est cid cx16 xtpr
bogomips : 6404.36
clflush size : 64
processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping : 3
cpu MHz : 3200.166
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl est cid cx16 xtpr
bogomips : 6400.17
clflush size : 64

free -m (without pdfbox running)
             total       used       free     shared    buffers     cached
Mem:          2018       1428        590          0        229       1045
-/+ buffers/cache:        153       1865
Swap:         3812         31       3781

Code sample

I rewrote the getDestinationStream method in the DSPACE class PDFFilter to extract text in blocks of 50 pages. The new version of the method is given below:

    public InputStream getDestinationStream(InputStream source) throws Exception {
    
        System.out.println("Extracting text");

        long startTime = System.currentTimeMillis();
        // get input stream from bitstream
        // pass to filter, get string back
        PDFTextStripper TextExtractor = new PDFTextStripper();
        PDFParser parser = null;
        String extractedText = null;
        int NumberPagesInPDF =0;
        try
        {
            parser = new PDFParser(source);
            parser.parse();
            
            extractedText = " " ;
            int SizeOfPDFSection = 50;
            PDDocument PDFtoExtract = new PDDocument(parser.getDocument());
            NumberPagesInPDF = PDFtoExtract.getNumberOfPages();
            int StartPage = 0;
            int EndPage = 0;
            while (StartPage < NumberPagesInPDF){
              EndPage = StartPage + SizeOfPDFSection;
              TextExtractor.setStartPage( StartPage );
              TextExtractor.setEndPage( EndPage );
              extractedText += TextExtractor.getText(PDFtoExtract);
              StartPage = EndPage;
            }
            
            // get the last few pageses at the end of the pdf
            TextExtractor.setStartPage( StartPage - SizeOfPDFSection );
            TextExtractor.setEndPage( NumberPagesInPDF -1 );
            extractedText += TextExtractor.getText(PDFtoExtract);
        }
        finally
        {
            try
            {
                parser.getDocument().close();
            }
            catch(Exception e)
            {
               log.error("Error closing temporary PDF file: " + e.getMessage(), e);
            }
        }

        // if verbose flag is set, print out extracted text
        // to STDOUT
        if (MediaFilterManager.isVerbose)
        {
            System.out.println(extractedText);
        }

        // generate an input stream with the extracted text
        long stopTime = System.currentTimeMillis();
        long deltaSeconds = (stopTime-startTime)/1000;
        System.out.println(" NumberPagesInPDF " + NumberPagesInPDF);
        System.out.println("Extraction time seconds " + deltaSeconds );
        System.out.println("s "  + (double)NumberPagesInPDF/(double)deltaSeconds  );
        System.out.println(" Number of characters extracted " + extractedText.length());
        byte[] textBytes = extractedText.getBytes();
        ByteArrayInputStream bais = new ByteArrayInputStream(textBytes);
        return bais; // will this work? or will the byte array be out of scope?
    }
}