[wpkg-users] [Bug 269] config.xml file has wrong encoding
Malte Starostik
lists at malte.homeip.net
Mon Apr 9 15:37:17 CEST 2012
Hi Stefan,
Am Montag, 9. April 2012, 11:14:43 schrieb bugzilla-daemon at bugzilla.wpkg.org:
> http://bugzilla.wpkg.org/show_bug.cgi?id=269
> In general Notepad++ is happy with "ANSI as UTF-8" encoded files, but this
> seems not to be the case for this file.
>
> The BOM doesn't hurt and makes sure that any editor, not just only
> specialized ones, display and edit the text correctly.
please don't get me wrong, all of what you say in this bug is correct :)
However, there is IMHO a subtle problem with BOMs in any file that is supposed
to be machine-readable. Without a BOM, any UTF-8 encoded file will be
correctly parsed by anything that digests 8-bit ASCII files. A BOM can break
this in sometimes hard to debug ways, as it's usually not visible. Imagine a
simple key=value based config file:
message="Içh ßiñ €in Täst ☺"
Without a BOM, this will be correctly handled by even the most stupid parser.
As long as the code consuming the data from that parser is aware of the UTF-8
encoding, all is fine. But when you add a BOM, the parser will fail to match
"message" vs. "<BOM>message" as key and fail miserably.
Any XML tool must definately cope with the presence of a BOM. But then an XML
file without explicitly specified encoding and without BOM must be UTF-8 encoded
anyway. So as you already said, the BOM helps non-specialized editors.
Right, but personally I've had those invisible buggers bite me several times
while they never served me any good ;)
{Gosh I do feel like ranting, it's not towards you, but shall emphasize why I
consider adding a BOM a valid but unfortunate "fix":}
<rant>
I'd assert that BOMs are a kludge that should be used very sparingly. In
fact, as the byte order is clear in UTF-8, the BOM as customary and necessary
with UTF-16/UCS-2 is degenerated to only flag the text as Unicode. What an
epic fail ;) Anything non-UTF-8 should be flagged as "Danger: obsolete encoding
inside" intead and those ISO-8859-*, WIN125* and whatnot should go die in a
fire. AFAIK among all current OSs, only Windows still doesn't default to UTF-8
in text files. Plus MS has this evil habit of assuming Unicode = UCS-2, which
totally breaks ASCII-compatibility, breaks all protocols that can't deal with
<NUL>-bytes in text streams etc. (I've seen Outlook Express send E-Mails with
UCS-2 content as text/plain, no charset given, no transfer encoding and all
those lovely <NUL>s inside...) Yeah, UCS-2 is a Unicode encoding and fine for
internal processing (if all you need is the BMP). But when serializing to a
file, UTF-8 is the way to go, no BOMs needed.
</rant>
Kind regards,
Malte
More information about the wpkg-users
mailing list