« April 2011 | Main | August 2011 »

June 23, 2011

Changes required to have islandora_ContentModelCollection.xml upload to fedora repository

Summary

Biraj pulled some xml directly from his repository but it could not be ingested into the new Fedora repository on https://umetadata-stage.lib.umn.edu:8443/fedora/admin/. However I could ingest into umetadata-stage the examples that the Fedora repository people gave us.

Original file that would not load and modified file that does load

Version of the xml directly from Biraj's Fedora repository that would not ingest into the umetadata-stage Fedora repository:
islandora_ContentModelCollection.xml
Modified XML that can be ingested:
islandora_ContentModelCollection_noaudit.xml

Changes

Modification Comment
Eliminate audit elements not certain that this is required
Eliminate CREATED timestamp elements not certain that this is required
Eliminate element with ID="TN.0" LABEL="Thumbnail" required

June 10, 2011

crons on strip3 (DB side of DSPACE)

# Clean up the databases nightly
20 0 * * * vacuumdb -U dspace_ir --analyze dspace_ir > /dev/null 2>&1
40 0 * * * vacuumdb -U dspace_sr --analyze dspace_sr > /dev/null 2>&1

# Backup the databases nightly
2 1 * * * /var/lib/pgsql/backup.sh

# MySQL backup
5 0 * * * /opt/mysql/bin/backup.sh

crawler not working for UDC

Problem

For conservancy.umn.edu we have not been getting good crawls from the University crawl app.

New robots.txt file

I changed the robot.txt file allow crawling down the subject browse tree:
# removed Disallow: /browse-subject line
User-agent: *
Disallow: /browse-author
Disallow: /browse-title
Disallow: /browse-date
Disallow: /suggest
Disallow: /*/browse-subject
Disallow: /*/browse-author
Disallow: /*/browse-title
Disallow: /*/browse-date
Disallow: /image
Disallow: /feed
Disallow: /password-login
Disallow: /advanced-search

Results with new robots.txt file

and then asked Curt Squires to crawl UDC again. He found:
A test search on inurl:conservancy.umn.edu shows about 8100 hits, so I think we're still missing some stuff. http://google.umn.edu/search?q=inurl%3Aconservancy.umn.edu&btnG=Google+Search&access=p&client=default_frontend&output=xml_no_dtd&proxystylesheet=default_frontend&ie=UTF-8&entqr=0&oe=UTF-8&ud=1&site=entire_index

Future plans

Curt is out of town until next Wed. When he gets back, I will allow the robots to go down the browse title path. Since all of the assets have to have titles that path should get everything. The new robots.txt file will be:
# Remove the lines:
# Disallow: /browse-title
# Disallow: /*/browse-title

User-agent: *
Disallow: /browse-subject
Disallow: /browse-author
Disallow: /browse-date
Disallow: /suggest
Disallow: /*/browse-subject
Disallow: /*/browse-author
Disallow: /*/browse-date
Disallow: /image
Disallow: /feed
Disallow: /password-login
Disallow: /advanced-search