« sql to pull relation metadata | Main | crons on strip3 (DB side of DSPACE) »

crawler not working for UDC

Problem

For conservancy.umn.edu we have not been getting good crawls from the University crawl app.

New robots.txt file

I changed the robot.txt file allow crawling down the subject browse tree:
# removed Disallow: /browse-subject line
User-agent: *
Disallow: /browse-author
Disallow: /browse-title
Disallow: /browse-date
Disallow: /suggest
Disallow: /*/browse-subject
Disallow: /*/browse-author
Disallow: /*/browse-title
Disallow: /*/browse-date
Disallow: /image
Disallow: /feed
Disallow: /password-login
Disallow: /advanced-search

Results with new robots.txt file

and then asked Curt Squires to crawl UDC again. He found:
A test search on inurl:conservancy.umn.edu shows about 8100 hits, so I think we're still missing some stuff. http://google.umn.edu/search?q=inurl%3Aconservancy.umn.edu&btnG=Google+Search&access=p&client=default_frontend&output=xml_no_dtd&proxystylesheet=default_frontend&ie=UTF-8&entqr=0&oe=UTF-8&ud=1&site=entire_index

Future plans

Curt is out of town until next Wed. When he gets back, I will allow the robots to go down the browse title path. Since all of the assets have to have titles that path should get everything. The new robots.txt file will be:
# Remove the lines:
# Disallow: /browse-title
# Disallow: /*/browse-title

User-agent: *
Disallow: /browse-subject
Disallow: /browse-author
Disallow: /browse-date
Disallow: /suggest
Disallow: /*/browse-subject
Disallow: /*/browse-author
Disallow: /*/browse-date
Disallow: /image
Disallow: /feed
Disallow: /password-login
Disallow: /advanced-search

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)