crawler not working for UDC
Problem
For conservancy.umn.edu we have not been getting good crawls from the University crawl app.New robots.txt file
I changed the robot.txt file allow crawling down the subject browse tree:# removed Disallow: /browse-subject line User-agent: * Disallow: /browse-author Disallow: /browse-title Disallow: /browse-date Disallow: /suggest Disallow: /*/browse-subject Disallow: /*/browse-author Disallow: /*/browse-title Disallow: /*/browse-date Disallow: /image Disallow: /feed Disallow: /password-login Disallow: /advanced-search
Results with new robots.txt file
and then asked Curt Squires to crawl UDC again. He found:A test search on inurl:conservancy.umn.edu shows about 8100 hits, so I think we're still missing some stuff. http://google.umn.edu/search?q=inurl%3Aconservancy.umn.edu&btnG=Google+Search&access=p&client=default_frontend&output=xml_no_dtd&proxystylesheet=default_frontend&ie=UTF-8&entqr=0&oe=UTF-8&ud=1&site=entire_index
Future plans
Curt is out of town until next Wed. When he gets back, I will allow the robots to go down the browse title path. Since all of the assets have to have titles that path should get everything. The new robots.txt file will be:# Remove the lines: # Disallow: /browse-title # Disallow: /*/browse-title User-agent: * Disallow: /browse-subject Disallow: /browse-author Disallow: /browse-date Disallow: /suggest Disallow: /*/browse-subject Disallow: /*/browse-author Disallow: /*/browse-date Disallow: /image Disallow: /feed Disallow: /password-login Disallow: /advanced-search