Modify DSPACE to accept external text files instead of using DSPACE's internal OCR system.
Introduction
To achieve this I had to modify the processBitstream method in MediaFilter.java. This allows the input of an external text file instead of using the native DSPACE OCR system. It has the "ABBYY" appendage because we plan on using the ABBYY OCR system.Classes modified
./src/org/dspace/app/mediafilter/MediaFilter.java./src/org/dspace/app/mediafilter/MediaFilterManager.java
Method written
processBitstream_ABBYY (in MediaFilter.java)Bitstreams before executing code
dspace_ir_silvi003_2011_02=# SELECT bitstream.* FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND handle=56424;
bitstream_id | bitstream_format_id | name | size_bytes | checksum | checksum_algorithm | description | user_format_description | source | internal_id | deleted | store_number | sequence_id
--------------+---------------------+----------------------+------------+----------------------------------+--------------------+-------------+-------------------------+-------------------------------------------------------------+-----------------------------------------+---------+--------------+-------------
3 | 3 | Brief_2003-98677.pdf | 626917 | 9d159b478e66b012bada96b9c3df77f2 | MD5 | | | /home/silvi003/dspace/dspace-ir/upload/Brief_2003-98677.pdf | 84690010223348412237368928451773250266 | f | 0 | 1
4 | 2 | license.txt | 1355 | 1d8ffcef0e2c0982456aa8c8a736e26d | MD5 | | | Written by org.dspace.content.Item | 121016793780646677673477451716426805228 | f | 0 | 2
(2 rows)
Bitstreams after executing code
dspace_ir_silvi003_2011_02=# SELECT bitstream.* FROM handle,item2bundle,bitstream,bundle2bitstream WHERE handle.resource_id = item2bundle.item_id AND handle.resource_type_id=2 AND item2bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id = bitstream.bitstream_id AND handle=56424;
bitstream_id | bitstream_format_id | name | size_bytes | checksum | checksum_algorithm | description | user_format_description | source | internal_id | deleted | store_number | sequence_id
--------------+---------------------+--------------------------+------------+----------------------------------+--------------------+----------------+-------------------------+-------------------------------------------------------------+-----------------------------------------+---------+--------------+-------------
3 | 3 | Brief_2003-98677.pdf | 626917 | 9d159b478e66b012bada96b9c3df77f2 | MD5 | | | /home/silvi003/dspace/dspace-ir/upload/Brief_2003-98677.pdf | 84690010223348412237368928451773250266 | f | 0 | 1
4 | 2 | license.txt | 1355 | 1d8ffcef0e2c0982456aa8c8a736e26d | MD5 | | | Written by org.dspace.content.Item | 121016793780646677673477451716426805228 | f | 0 | 2
5 | 5 | Brief_2003-98677.pdf.txt | 441997 | 2a395efbb6798effbd31b78cc410c554 | MD5 | Extracted text | | Written by MediaFilter org.dspace.app.mediafilter.PDFFilter | 117268227247096420229650348692064973457 | f | 0 | 3
(3 rows)