[Koha-devel] Finding invalid XML characters in Koha data via SQL

Fri Apr 12 08:06:05 CEST 2024

Hi!

Den 12.04.2024 03:36, skrev David Cook via Koha-devel:
> Hi all,
> 
> I just wanted to share a (MariaDB) SQL report that I wrote for finding 
> bib records with invalid XML characters:
> 
> select biblionumber from biblio_metadata where metadata REGEXP 
> '[^\\x{0009}\\x{000A}\\x{000D}\\x{0020}-\\x{D7FF}\\x{E000}-\\x{FFFD}\\x{10000}-\\x{10FFFF}]+';
> 
> Newer versions of Koha strip invalid character from the XML so that you 
> can fix your records. I figure this report is very valuable when coupled 
> with that functionality. In fact, I just advised a library today to use 
> them together to fix up some bad data in their catalogue.
> 
> --
> 
> On a related note, I’ve noticed that you can have a record with good bib 
> XML but invalid item XML, and you won’t notice until your record fails 
> to be indexed. So I’m planning on writing a report for that too.
> 
> I’m thinking it might be good to add these reports to core Koha, so that 
> people can find and fix their own metadata problems. What do people think?

Sounds like an excellent idea! Sounds kind of similar to "MARC 
bibliographic framework test" at /cgi-bin/koha/admin/checkmarc.pl

The report could also be added to 
https://wiki.koha-community.org/wiki/SQL_Reports_Library for older Kohas 
and to be immediately useful.

Best regards,
Magnus