Select to find ascii control characters in DSPACE (i.e. in the postgres DB)
SQL select for CNT Chars
The command below will find DSPACE titles that contain ascii control characters (less than 31).
select handle, title, item.last_modified from item, handle, itemsbytitle where item.item_id = handle.resource_id AND itemsbytitle.item_id = item.item_id AND last_modified > '2012-01-28' AND title ~ ('[^'||chr(31)||'-'||chr(255)||']') ;
Reason for the Range
The site:http://www.asciitable.com/
Shows that the control characters in ASCII are all less than 32.
Result of the search
handle | title | last_modified --------+----------------------------------------------------------------------------------------------------------------------------+---------------------------- 47226 | Pouvoir, violence et resistance en postcolonie : une lecture de en attendant le vote des betes Sauvages E’Ahmadou Kourouma | 2012-02-04 01:11:16.764-06 (1 row) s
Converting the title to hex and sorting
[silvi003:~]$ echo 'Pouvoir, violence et resistance en postcolonie : une lecture de en attendant le vote des betes Sauvages E’Ahmadou Kourouma' | hexdump -v -e '"\\\x" 1/1 "%02x" " "' | perl -p -i -e 's/ /\n/g' | sort | uniq -c 1 \x0a 18 \x20 1 \x2c 1 \x3a 1 \x41 1 \x45 1 \x4b 1 \x50 1 \x53 7 \x61 1 \x62 4 \x63 4 \x64 19 \x65 1 \x67 1 \x68 4 \x69 4 \x6c 2 \x6d 8 \x6e 10 \x6f 1 \x70 4 \x72 6 \x73 9 \x74 7 \x75 4 \x76 1 \x80 1 \x99 1 \xe2There is a line feed in the title (\x0a), and this happens to be the last character.
ISSUE
The regex above will select strings with a CR.000 1101 015 13 0D CR ␍ ^M \r Carriage return[g]This is bad. I need to put in some sort of "OR" for this.