« Indexer problem and old libraries | Main | RDF used by islandora to specify collections »

Modify DSPACE to accept external text files instead of using DSPACE's internal OCR system.

Introduction

To achieve this I had to modify the processBitstream method in MediaFilter.java. This allows the input of an external text file instead of using the native DSPACE OCR system. It has the "ABBYY" appendage because we plan on using the ABBYY OCR system.

Classes modified

./src/org/dspace/app/mediafilter/MediaFilter.java
./src/org/dspace/app/mediafilter/MediaFilterManager.java

Method written

processBitstream_ABBYY (in MediaFilter.java)

Bitstreams before executing code

dspace_ir_silvi003_2011_02=# SELECT bitstream.* FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND handle=56424;

 bitstream_id | bitstream_format_id |         name         | size_bytes |             checksum             | checksum_algorithm | description | user_format_description |                           source                            |               internal_id               | deleted | store_number | sequence_id 
--------------+---------------------+----------------------+------------+----------------------------------+--------------------+-------------+-------------------------+-------------------------------------------------------------+-----------------------------------------+---------+--------------+-------------
            3 |                   3 | Brief_2003-98677.pdf |     626917 | 9d159b478e66b012bada96b9c3df77f2 | MD5                |             |                         | /home/silvi003/dspace/dspace-ir/upload/Brief_2003-98677.pdf | 84690010223348412237368928451773250266  | f       |            0 |           1
            4 |                   2 | license.txt          |       1355 | 1d8ffcef0e2c0982456aa8c8a736e26d | MD5                |             |                         | Written by org.dspace.content.Item                          | 121016793780646677673477451716426805228 | f       |            0 |           2
(2 rows)

Bitstreams after executing code

dspace_ir_silvi003_2011_02=# SELECT bitstream.* FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND handle=56424;
 bitstream_id | bitstream_format_id |           name           | size_bytes |             checksum             | checksum_algorithm |  description   | user_format_description |                           source                            |               internal_id               | deleted | store_number | sequence_id 
--------------+---------------------+--------------------------+------------+----------------------------------+--------------------+----------------+-------------------------+-------------------------------------------------------------+-----------------------------------------+---------+--------------+-------------
            3 |                   3 | Brief_2003-98677.pdf     |     626917 | 9d159b478e66b012bada96b9c3df77f2 | MD5                |                |                         | /home/silvi003/dspace/dspace-ir/upload/Brief_2003-98677.pdf | 84690010223348412237368928451773250266  | f       |            0 |           1
            4 |                   2 | license.txt              |       1355 | 1d8ffcef0e2c0982456aa8c8a736e26d | MD5                |                |                         | Written by org.dspace.content.Item                          | 121016793780646677673477451716426805228 | f       |            0 |           2
            5 |                   5 | Brief_2003-98677.pdf.txt |     441997 | 2a395efbb6798effbd31b78cc410c554 | MD5                | Extracted text |                         | Written by MediaFilter org.dspace.app.mediafilter.PDFFilter | 117268227247096420229650348692064973457 | f       |            0 |           3
(3 rows)

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)