[Koha-devel] Searching for garbage characters

Tue Nov 6 15:32:02 CET 2012

We've managed to import a number of MARC records with corrupted
diacritics and my attempts to retrieve these with a report haven't met
w/ success. (Sample records in this list:
http://catalog.splnh.com/cgi-bin/koha/opac-shelves.pl?viewshelf=8.)

My thought is to search for 100, 700, etc. tags containing any
characters outside of the ASCII 32 through 126 range, but my regex
skills aren't up to the task. To wit:

SELECT CONCAT('<a
href=\"/cgi-bin/koha/catalogue/detail.pl?biblionumber=',biblionumber,'\">',biblionumber,'</a>')
AS bibnumber, pname
FROM
(SELECT biblionumber,
ExtractValue(marcxml,'//datafield[@tag="100"]/subfield[@code>="a"]')
AS pname FROM biblioitems)
AS authors
WHERE pname REGEXP '[\W]'

These attempts also didn't seem to be getting me any closer:

WHERE pname REGEXP '[^a-z]'

WHERE pname LIKE '%[^a-zA-Z0-9]%'

WHERE PATINDEX('%[^a-zA-Z0-9]%',pname) > 1

Any thoughts on how to write this report? Have tried the folks over on
the MarcEdit list, but no solution as yet.

Many thanks,

Cab Vinton, Director
Sanbornton Public Library
Sanbornton, NH

Life is short. Read fast!