« August 2008 | Main | October 2008 »

September 30, 2008

Translating apache log format to dspace

Introduction

There are basically two types of log files that must be handled: views and downloads.

Downloads

Comparison of apache and dspace logs

From apache logs:
69.109.228.170 - - [21/Sep/2008:23:59:27 -0500] "GET /bitstream/31045/1/26020387.pdf HTTP/1.1" 200 1225483 "http://scholar.google.com/scholar?hl=en&lr=&q=accept+Genetically+modified+organism&btnG=Search" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1"

When the guts of this request is sent to agecon by entering the line below into a browser:

http://ageconsearch.umn.edu/bitstream/31045/1/26020387.pdf

Catalina records the following log entry:

2008-09-30 16:12:08,153 INFO org.dspace.app.webui.servlet.BitstreamServlet @ anonymous:session_id=F82A25EDFCF0C73AE8C19291D3C3985A:ip_addr=128.101.29.84:view_bitstream:bitstream_id=3976


Required conversions

Notice that in the two log entries above, apache records a handle of 31045, while the dspace log gives a bitstream_id of 3976. To convert from apache to dspace, we must map the handle to the bitstream_id. The sql command below will do this:

select bundle2bitstream.bitstream_id from item2bundle, handle, bundle2bitstream where (handle.handle=31045 and handle.resource_id = item2bundle.item_id and bundle2bitstream.bundle_id=item2bundle.bundle_id); bitstream_id
bundle_id
-----------
3976
Note the bundle_id and the bitstream_id are the same.

Views

Comparison of apache and dspace logs

For views we have an apache log of the form:

203.20.101.203 - - [21/Sep/2008:23:32:17 -0500] "GET /handle/22682 HTTP/1.1" 503 410 "http://scholar.google.com.au/scholar?q=%22some+implications+of+the+growth+of+the+mineral+sector%22&hl=en&um=1&ie=UTF-8&oi=scholart" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; InfoPath.1)"

When the request below is put into a browser:
http://ageconsearch.umn.edu/handle/22682

We get the catalina log:

2008-09-30 16:26:30,456 INFO org.dspace.app.webui.servlet.DSpaceServlet @ anonymous:session_id=F82A25EDFCF0C73AE8C19291D3C3985A:ip_addr=128.101.29.84:view_item:handle=22682

Required conversions

To generate the dspace log format, we need to determine what was being viewed. In the case above it was an view_item.
From the handle we can find the resource_type_id, which allows us to generate terms like, view_item.
select * from handle where handle=22682; handle_id | handle | resource_type_id | resource_id -----------+--------+------------------+------------- 13089 | 22682 | 2 | 12346
The table below provides a conversion between the resource_type_id and they actual type.
resource_type resource_type_id
BITSTREAM 0
BUNDLE 1
ITEM 2
COLLECTION 3
COMMUNITY 4
SITE 5
GROUP 6
EPERSON 7

September 29, 2008

Regex to pull GET POST commands from apache logs

The regex: perl -p -i -e 's/^.*\s\"(GET|POST)(.*)\"\s.*\".*\".*\"$/\1\2/' Will extract:
GET /bitstream/34171/1/sp06pa01.pdf HTTP/1.1

from
99.239.52.187 - - [21/Sep/2008:09:38:24 -0500] "GET /bitstream/34171/1/sp06pa01.pdf HTTP/1.1" 200 16368 "http://www.google.ca/search?hl=en&q=positive+effects+on+canada+from+immigration&start=10&sa=N" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1"

September 25, 2008

perl path on os x

Where to find perl modules installed by cpan on mac OS X.
export PERL5LIB=/opt/local/lib/perl5/site_perl/5.8.8:/opt/local/lib/perl5/site_perl/5.8.8/darwin-2level

September 24, 2008

Analysis of Apache/Catalina AgEcon Records for

Breakdown of log entries by type

I looked at the catalina.out log records for 2008-09-21. These logs are in the file: catalina.out_2008-09-21 (because of the way that the files are backed up the log entries only extend to 11:30 PM). From these entries, I made a list of log entries by type: BreakDownOfAgEconLog_2008_09_21.html It is worth noting that of the 62K hits only 20 came from "SimpleSearch". That is only 20 users went to our search engine and the rest searched through google or are robots.

Log types that are required for stats

Jason Roy and I agree that the following log types are need for stats.
Log Name Number in Log Found Apache Match Apache needs SQL
view_bitstream 10772 Y Y
view_item 4462 Y N
view_collection 2084 Y Y
view_community 582 Y Y
The "Apache needs SQL" column indicates whether it is required to use information from the dspace SQL database to map the apache logs to dspace catalina logs. Also the term "view_bitstream" corresponds to download.

How apache logs map to catalina logs

To take care of some issues in the catalina logs, I am going to use apache logs. Here are examples of the log entries for both apache and catalina for all of the critical log types given in the table above. There are also catalina examples for almost all the types.

September 17, 2008

Why UDC crashed ... "pool exhausted" error Jason, Basically DSPACE does not properly close connections to the SQL server. When the pool is exhausted it generates error messages. This may be more of a problem now because OAI is available (climbing down t

Jason,

Basically DSPACE does not properly close connections to the SQL server. When the pool is exhausted it generates error messages. This may be more of a problem now because OAI is available (climbing down the tree will hit the DB a lot) or there is another SQL-Injection attack, or UDC just may be more popular. I did not explore the probable increased load.

A more detailed explanation of the error is given below, with a possible fix. To step up the fix given on the web I need some privileges on strip3. I have asked CCO for them.

Jeff

1) Problem Indicated in the Logs

Starting at
2008-09-17 08:23:08,769
and ending at
2008-09-17 08:53:29,729 (When Bill restarted DSPACE).
There were 330 error messages of the type:
2008-09-17 08:53:29,729 WARN org.dspace.app.webui.servlet.DSpaceServlet @ anonymous:no_context:database_error:org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool exhausted


An error messages of this sort would generate an error screen.

2) Fix from University of Michigan

In dspace-tech the University of Michigan team addresses this problem by closing prossess that have the phrase 'idle in transaction' when displayed by ps.

see
http://www.mail-archive.com/dspace-tech@lists.sourceforge.net/msg01057.html

3) Confirmation that our problem matches University of Michigan's
I have checked strip3 (where the postgres database lives) and between 08:54 (When Bill restarted DSPACE) and 11:22, there have been 45 processes created that have the form:

postgres 15047 0.0 0.0 86304 4964 ? S 10:58 0:00 postgres: dspace_ir dspace_ir 134.84.135.19 idle

So we are building up these "idle" processes on the DB side and it is likely that the system will crash again, unless we put in the Michigan fix

September 16, 2008

Properties of tomcat and apache on strip1

Here are the properties:

Using CATALINA_BASE: /opt/tomcat
Using CATALINA_HOME: /opt/tomcat
Using CATALINA_TMPDIR: /opt/tomcat/temp
Using JRE_HOME: /opt/jdk1.5.0_10
Server version: Apache Tomcat/5.5.20
Server built: Sep 12 2006 10:09:20
Server number: 5.5.20.0
OS Name: Linux
OS Version: 2.6.9-78.0.1.ELsmp
Architecture: i386
JVM Version: 1.5.0_10-b03
JVM Vendor: Sun Microsystems Inc.



Server version: Apache/2.0.52
Server built: May 9 2008 05:54:40

mod_jk/1.2.25

September 11, 2008

Finding the x.500 fields in an ssl cert

The command below will reveal all the x.500 fields in tools.lib.umn.edu.crt

openssl x509 -noout -text -in tools.lib.umn.edu.crt
openssl rsa -noout -text -in server.key
openssl req -noout -text -in server.csr
openssl rsa -noout -text -in ca.key
openssl x509 -noout -text -in ca.crt

Bad handle in AgEcon indexer

An AgEcon patron tried to submit a file to the archive. When they did a search based on author an error resulted. The AgEcon staff resubmitted and then deleted the old record. It looks like a bad handle got into the lucene index with the submit and then was never removed. When dspace queries lucene and finds a handle that is not in the DB it throws an error and halts the search. Perhaps it should continue on. Here is some error logs.

September 8, 2008

OAI on odin

Activating the OAI harvester

I have enabled the OAI harvester for the dspace instance on the odin box. I needed to make the oai_dc metadata available. To do this I copied the file:

~/dspace_home/config/templates/oaicat.properties
to
~/dspace_home/config/templates/oaicat.properties

and returned the contents of this file to its original form (i.e. revision 1). Note: in this default state the file oaicat.properties has the line:
Crosswalks.oai_dc=org.dspace.app.oai.OAIDCCrosswalk uncommented.
I also removed the replace task from the build.xml file in the dspace-sr instance. Although for several months the OAI harvestor has been operational, the changes mean that the dspace-ir (UDC) and the dspace-sr (AgEcon) invoke the OAI system in an identical fashion.

Hyperlinks to call OAI verbs

Hyperlinks to development server on odin

The URL below returned all the metadata:
https://odin.lib.umn.edu:9031/dspace-oai-ir/request?verb=ListRecords&metadataPrefix=oai_dc

This URL will return all the metadata since 2008-04-15:
https://odin.lib.umn.edu:9031/dspace-oai-ir/request?verb=ListRecords&metadataPrefix=oai_dc&from=2008-04-15

The next URL returns the data between 2008-04-15 and 2008-04-20
https://odin.lib.umn.edu:9031/dspace-oai-ir/request?verb=ListRecords&metadataPrefix=oai_dc&from=2008-04-15&until=2008-04-20

Hyperlinks to live box

AgEcon:
http://strip1.oit.umn.edu:8080/dspace_sr-oai/request?verb=ListRecords&metadataPrefix=oai_dc&from=2008-04-15&until=2008-04-30
UDC:
http://strip1.oit.umn.edu:8080/dspace_ir-oai/request?verb=ListRecords&metadataPrefix=oai_dc&from=2008-04-15&until=2008-04-30

Nice hyperlinks to live box

http://strip1.oit.umn.edu:8080/dspace_ir-oai/
has been aliased to
http://conservancy.umn.edu/oai

and
http://strip1.oit.umn.edu:8080/dspace_ir-oai/
has been aliased to
http://ageconsearch.umn.edu/oai/

These are much nicer to look at.

So the urls below will work.
http://conservancy.umn.edu/oai/request?verb=ListRecords&metadataPrefix=oai_dc&from=2008-04-15&until=2008-04-30
http://ageconsearch.umn.edu/oai/request?verb=ListRecords&metadataPrefix=oai_dc&from=2008-04-15&until=2008-04-30

Issues with OAI harvester on DSPACE

Elements in the AgEcon schema are not displayed

Only the AgEcon metadata that are Dublin Core are displayed by the OAI harvester
Note in the table from the link above:  
metadata_schema_id = 1 is dc schema
metadata_schema_id = 2 is agecon schema

The elements of the AgEcon schema are not displayed by the current crosswalk and a new crosswalk would have to be written.

Crosswalk is not qualified Dublin Core

The crosswalk from the dspace archive to the OAI output is Dublin Core not qualified Dublin Core (note the metadataPrefix=oai_dc in the OAI requests above).
So the values:
 date        | accessioned | Date DSpace takes possession of item.
 date        | available        | Date or date range item became available to the public.
 date        | issued    
  
are all mapped to the tag <dc:date> In this OAI output sample , I have pointed out which OAI <dc:date> tags correspond to the various qualified Dublin Core dates. I did an sql query on item 2 to find the actual values for the different date fields.
I could try to find a caned crosswalk for DSPACE to qualified DC. I have briefly looked for this several people say it would be a good thing but I have not found anyone who has done this. Hard to say how long this would take, but it is possible that no one has done it.

Update oaicat.jar

I did an update on oaicat.jar. It turns out that this both versions of the jar work, so I stayed with the more recent version. The manifests form both jars are given here.
oaiCatMainfest.html