[wpkg-users] [Bug 269] config.xml file has wrong encoding

Malte Starostik lists at malte.homeip.net
Mon Apr 9 15:37:17 CEST 2012


Hi Stefan,

Am Montag, 9. April 2012, 11:14:43 schrieb bugzilla-daemon at bugzilla.wpkg.org:
> http://bugzilla.wpkg.org/show_bug.cgi?id=269
> In general Notepad++ is happy with "ANSI as UTF-8" encoded files, but this
> seems not to be the case for this file.
> 
> The BOM doesn't hurt and makes sure that any editor, not just only
> specialized ones, display and edit the text correctly.

please don't get me wrong, all of what you say in this bug is correct :)
However, there is IMHO a subtle problem with BOMs in any file that is supposed 
to be machine-readable.  Without a BOM, any UTF-8 encoded file will be 
correctly parsed by anything that digests 8-bit ASCII files.  A BOM can break 
this in sometimes hard to debug ways, as it's usually not visible.  Imagine a 
simple key=value based config file:

message="Içh ßiñ €in Täst ☺"

Without a BOM, this will be correctly handled by even the most stupid parser.  
As long as the code consuming the data from that parser is aware of the UTF-8 
encoding, all is fine.  But when you add a BOM, the parser will fail to match 
"message" vs. "<BOM>message" as key and fail miserably.

Any XML tool must definately cope with the presence of a BOM.  But then an XML 
file without explicitly specified encoding and without BOM must be UTF-8 encoded 
anyway.  So as you already said, the BOM helps non-specialized editors.  
Right, but personally I've had those invisible buggers bite me several times 
while they never served me any good ;)

{Gosh I do feel like ranting, it's not towards you, but shall emphasize why I 
consider adding a BOM a valid but unfortunate "fix":}
<rant>
I'd assert that BOMs are a kludge that should be used very sparingly.  In 
fact, as the byte order is clear in UTF-8, the BOM as customary and necessary 
with UTF-16/UCS-2 is degenerated to only flag the text as Unicode.  What an 
epic fail ;) Anything non-UTF-8 should be flagged as "Danger: obsolete encoding 
inside" intead and those ISO-8859-*, WIN125* and whatnot should go die in a 
fire.  AFAIK among all current OSs, only Windows still doesn't default to UTF-8 
in text files.  Plus MS has this evil habit of assuming Unicode = UCS-2, which 
totally breaks ASCII-compatibility, breaks all protocols that can't deal with 
<NUL>-bytes in text streams etc. (I've seen Outlook Express send E-Mails with 
UCS-2 content as text/plain, no charset given, no transfer encoding and all 
those lovely <NUL>s inside...) Yeah, UCS-2 is a Unicode encoding and fine for 
internal processing (if all you need is the BMP).  But when serializing to a 
file, UTF-8 is the way to go, no BOMs needed.
</rant>

Kind regards,
Malte



More information about the wpkg-users mailing list