encoding debugging
Yesterday Sam Ruby pointed out a small glitch in my RSS feeds. So late last night I started looked at my publishing scripts to work out what the problem was. I guessed I'd done something stupid like double encoding a string somewhere. Or maybe it'd be an obvious fix to a library.
If only it were that simple.
In perl 5.8.1, like those for most modern languages, unicode should just work in almost all cases. Unfortunately, if you've got one of the cases where it doesn't, it's almost impossible to work out why.
After several hours of reading up on the perl unicode implementation details, tracing output and working through the source code of several CPAN modules I'm using, I finally worked out what was going wrong. A module was overriding the perl default unicode behaviour, which meant a file was being read with the wrong encoding. This was then compounded by me using a html escaping function instead of an xml one.
Problem is, fixing this bug then triggered other bugs elsewhere in the code. If I read the file in with the correct encoding, then the templating module started to complain about multibyte characters. Fixing that causes other problems. And so on.
In the end I gave up - I'd rather spend my time on a ground-up rewrite of my publishing system, and there was a much simpler solution. I just edited the offending entry to use "…
" instead of "…
".
The real underlying problem here is that my current publishing system is a bit of a hack. But still, it really shouldn't take this much effort to do the right thing with non-ascii text.