Smalltalk › Usenets › Dolphin Smalltalk

XML Manifest contains invalid characters

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

6 messages Options

Don Rylander-3

XML Manifest contains invalid characters

When the directory paths to a deployed application contains certain unusual
characters (e.g., $« or $» in the Windows Western character set), the XML
manifest generated during deployment is not readable by IXMLDOMDocument.
Opening the file in Scintilla shows some special characters in place of the
ones in the path, although PFE and Notepad have no problem with it.

Manually editing the file in Scintilla by removing the special characters,
and replacing them with the path as it exists (including the "unusual"
characters) allows the Executable Browser (and Scintilla) to read the
manifest file properly.

A text comparison with BeyondCompare shows the edited and unedited versions
of the file are the same. A binary comparison, however, shows that
re-editing them in Scintilla apparently causes the "unusual" characters to
be preceeded (escaped?) by a character with a code point of 194. This
character does show up when viewing the file with PFE, but Unicode-aware
editors don't display it.

I don't know whether there's anything that should be done (except for
redoing my directory structure =8^0), but maybe something isn't quite to
spec.

Thanks,

Don

Chris Uppal-3

Re: XML Manifest contains invalid characters

Don,

> When the directory paths to a deployed application contains certain
> unusual characters (e.g., $« or $» in the Windows Western character set),
> the XML manifest generated during deployment is not readable by
> IXMLDOMDocument. Opening the file in Scintilla shows some special
> characters in place of the ones in the path, although PFE and Notepad
> have no problem with it.

I believe that this is a Unicode issue, and that it is a bug which will affect
anyone who uses characters in filename which are not in the 7-bit ASCII range
(whatever code page they use).

The ImageStripper is writing the XML log file as if it were a an ordinary file
containing 8-bit characters in the user's current character set. Actually XML
is usually written in one or another form of Unicode, defaulting to UTF-8. If
it is not in UTF-8 or UTF-16 then it is /required/ to start with an Encoding
Declaration. (Note that XML parsers are not required to be able to read
formats other than UTF-8 and -16).

Consider the String:
'«»'
That consists of Characters with "code points":
171 and 187
so that's what would be written to the file. However, in UTF-8 the
corresponding Unicode characters have code points:
U+AB and U+BB
(which are actually 171 and 187 as it happens, but that's just a coincidence).
When written as UTF-8 they take 2 bytes each, and become the byte sequence:
0xC2 0xAB 0xC2 0xBB
or, in decimal:
194 171 194 187
It looks as if the "special" characters are being escaped by prefixing them
with 194, but actually that's misleading, it's just a coincidence (the same one
as before) that the UTF-8 encoding takes that form for these characters.

So the ImageStripper is creating an invalid XML file, and the XML parser is
(correctly) refusing to have anything to do with it. (It is required to treat
it as a "fatal error")

Now if you open the file in Scintilla, I suspect that it will discover that the
contents isn't valid UTF-8 and revert to assuming it's just 8-bit data in the
current code page, and will display it as such. But then, when you write it
out again, because it knows that XML is UTF, it'll write it out as such, thus
converting 171 to 194 171.

As to cures:

One would be for the ImageStripper to provide a correct Encoding Declaration
for the user's code page. In your case I think that should be
<?xml version="1.0" encoding="ISO-8859-1"?>
at the start of the XML file (see ImageStripper>>openLogFile). I'm not sure
where to get the correct code page name from (it's available somewhere!).

Another would be to change the ImageStripper so that it wrote UTF-8 (as it
claims to do).

A third, which /might/ work, would be to tell the XML parser to ignore the
embedded declaration and treat the text as if it were in the current code page.
That's not a particularly general fix, and anyway I don't know it is possible
to do it with an IXMLDOMDocument.

-- chris

Don Rylander-3

Re: XML Manifest contains invalid characters

Chris,
"Chris Uppal" <[hidden email]> wrote in message
news:4402e698$0$1176$[hidden email]...
> Don,
[...]
> I believe that this is a Unicode issue, and that it is a bug which will
> affect
> anyone who uses characters in filename which are not in the 7-bit ASCII
> range
> (whatever code page they use).
I think you've hit on it.

> The ImageStripper is writing the XML log file as if it were a an ordinary
> file
> containing 8-bit characters in the user's current character set. Actually
> XML
> is usually written in one or another form of Unicode, defaulting to UTF-8.
> If
> it is not in UTF-8 or UTF-16 then it is /required/ to start with an
> Encoding
> Declaration. (Note that XML parsers are not required to be able to read
> formats other than UTF-8 and -16).

What makes it all the more interesting is opening it in Scite (which I
erroneously called Scintilla in my original post). Scite apparently
interprets its display character set based on the encoding specified in the
XML header. Setting the encoding to UTF-16 or removing the explicit
encoding (which I believe defaults to the better of UTF-8 or -16) both allow
Scite to interpret the characters as entered. IXMLDOMDocument, on the
other hand, was not deceived by my feeble efforts. No encoding in the XML
header produced the same "invalid character" error as UTF-8 encoding, while
changing encoding to UTF-16 gave "Switch from current encoding to specified
encoding not supported." (Viewing the files in InternetExplorer gives the
same errors.)

[...]
> As to cures:
>
> One would be for the ImageStripper to provide a correct Encoding
> Declaration
> for the user's code page. In your case I think that should be
> <?xml version="1.0" encoding="ISO-8859-1"?>
> at the start of the XML file (see ImageStripper>>openLogFile). I'm not
> sure
> where to get the correct code page name from (it's available somewhere!).
This works fine just by changing the encoding statement in the header,
unlike changing it to UTF-16 or no explicit encoding. It's a simple hack
for me, but there's got to be a way to get ISO encoding based on a user's
Locale. How hard can it be? ;^)

>
> Another would be to change the ImageStripper so that it wrote UTF-8 (as it
> claims to do).
It looks like this might entail digging into the guts of FileStream. That
would scare me off.

>
> A third, which /might/ work, would be to tell the XML parser to ignore the
> embedded declaration and treat the text as if it were in the current code
> page.
> That's not a particularly general fix, and anyway I don't know it is
> possible
> to do it with an IXMLDOMDocument.
A cursory check of MSDN suggests this approach isn't easy enough for my
purposes.

Thanks for taking the time to reply, Chris. I think I'll check into the
Locale business to see how easy a more generalized solution might be.

Don

>
> -- chris
>
>
>
>

Chris Uppal-3

Re: XML Manifest contains invalid characters

Don Rylander wrote:

> > <?xml version="1.0" encoding="ISO-8859-1"?>
[...]
> This works fine just by changing the encoding statement in the header,
> unlike changing it to UTF-16 or no explicit encoding. It's a simple hack
> for me, but there's got to be a way to get ISO encoding based on a user's
> Locale. How hard can it be? ;^)

I spent some time looking for a way. If it's there I can't find it. All you
can do is find the current code-page ID (GetACP() in kernel32.dll). On my
system that answers 1252 (which I believe to be either identical to, or very
similar to, ISO-8859-1). The problems are that there's no way to get a sensible
string name for it (GetPCInfoEx() will get you a string, but it's not usefull),
and that converting a given Id into a /standard./ encoding Id (or name) is
essentially impossible. It would probably be possible to create a mapping
table, but...

Of course, these difficulties with code pages are why Unicode was invented in
the first place.

> > Another would be to change the ImageStripper so that it wrote UTF-8 (as
> > it claims to do).
> It looks like this might entail digging into the guts of FileStream. That
> would scare me off.

Steve Waring (a long time ago) posted a little utility method for converting
UnicodeStrings into UTF-8. With that (appended) and:
=============
String>>asUTF8String
^ self asUnicodeString asUTF8String.
=============
it shouldn't be too hard, or interfere with stripping too much, to convert
everything to UTF-8 before writing it to the file. Of course the best thing
would be if Dolphin had proper Unicode handling, but I don't expect that
anytime soon ;-) (I do have a working, but as yet incomplete, set of Unicode
Strings and Streams, but it's a pretty large package, and would add rather a
lot of bloat to the target application if that app didn't happen to need
Unicode support anyway.)

-- chris

======= code by Steve Waring =========
!UnicodeString methodsFor!
asUTF8String
"Answer a byte string representation of the receiver.
-Not supported in Win95, but should work in Win98"
| buf size bytes |
size := self size.
buf := String new: size+size+size.
size == 0 ifTrue: [^buf]. "Avoid 'The Parameter is Incorrect' error"
bytes := KernelLibrary default
wideCharToMultiByte: 65001 "CP_UTF8"
dwFlags: 0
lpWideCharStr: self
cchWideChar: size
lpMultiByteStr: buf
cchMultiByte: buf size
lpDefaultChar: nil
lpUsedDefaultChar: nil.
bytes == 0 ifTrue: [^KernelLibrary default systemError].
buf resize: bytes.
^buf! !
!UnicodeString categoriesFor: #asUTF8String!converting!public! !
====================================

Don Rylander-3

Re: XML Manifest contains invalid characters

"Chris Uppal" <[hidden email]> wrote in message
news:44041e35$2$1175$[hidden email]...

> Don Rylander wrote:
>
>> > <?xml version="1.0" encoding="ISO-8859-1"?>
> [...]
>> This works fine just by changing the encoding statement in the header,
>> unlike changing it to UTF-16 or no explicit encoding. It's a simple hack
>> for me, but there's got to be a way to get ISO encoding based on a user's
>> Locale. How hard can it be? ;^)
>
> I spent some time looking for a way. If it's there I can't find it. All
> you
> can do is find the current code-page ID (GetACP() in kernel32.dll). On my
> system that answers 1252 (which I believe to be either identical to, or
> very
> similar to, ISO-8859-1). The problems are that there's no way to get a
> sensible
> string name for it (GetPCInfoEx() will get you a string, but it's not
> usefull),
> and that converting a given Id into a /standard./ encoding Id (or name) is
> essentially impossible. It would probably be possible to create a mapping
> table, but...

I came to the same conclusion about GetCPInfo and GetCPInfoEx; they get you
really close, but you still need some refernce table to look up the string
name of the numeric code page. By the way, MSDN indicates that Windows code
page 1252 *is* the same as Latin 1 and ISO-8859-1
(http://msdn.microsoft.com/library/default.asp?url=/workshop/author/dhtml/reference/charsets/charset4.asp).
I'd seen other references that implied there might be differences, but if
Windows views them identically, who am I to differ. ;^)

IXMLDocument2 knows its character set (on my system, IXMLDocument2 new
charset = 'UTF-8'), but I haven't found a way to do anything useful with
that.

[...]

>> > Another would be to change the ImageStripper so that it wrote UTF-8 (as
>> > it claims to do).
>> It looks like this might entail digging into the guts of FileStream.
>> That
>> would scare me off.
>
> Steve Waring (a long time ago) posted a little utility method for
> converting
> UnicodeStrings into UTF-8. With that (appended) and:
> =============
> String>>asUTF8String
> ^ self asUnicodeString asUTF8String.
> =============
> it shouldn't be too hard, or interfere with stripping too much, to convert
> everything to UTF-8 before writing it to the file. Of course the best
> thing
> would be if Dolphin had proper Unicode handling, but I don't expect that
> anytime soon ;-) (I do have a working, but as yet incomplete, set of
> Unicode
> Strings and Streams, but it's a pretty large package, and would add rather
> a
> lot of bloat to the target application if that app didn't happen to need
> Unicode support anyway.)
>
> -- chris

This could be useful (I'm still amazed at what Steve was able to get done
when he was first learning Dolphin!), but given that (a) my problem (which
nobody else seems to have!) is solved by changing the encoding to ISO-8859-1
in ImageStripper>>openLogFile, and (b) Unicode support is evolving in both
Dolphin and Windows (the .NET and Vista stuff on MSDN seems much more
comprehensive), I'm starting to think my time would be better spent
elsewhere.

I suppose this could affect anyone who uses characters beyond 7-bit ASCII,
but maybe others have been more sensible in naming things. Thanks again for
spending time on this, Chris.

Don

>
> ======= code by Steve Waring =========
> !UnicodeString methodsFor!
> asUTF8String
> "Answer a byte string representation of the receiver.
> -Not supported in Win95, but should work in Win98"
> | buf size bytes |
> size := self size.
> buf := String new: size+size+size.
> size == 0 ifTrue: [^buf]. "Avoid 'The Parameter is Incorrect' error"
> bytes := KernelLibrary default
> wideCharToMultiByte: 65001 "CP_UTF8"
> dwFlags: 0
> lpWideCharStr: self
> cchWideChar: size
> lpMultiByteStr: buf
> cchMultiByte: buf size
> lpDefaultChar: nil
> lpUsedDefaultChar: nil.
> bytes == 0 ifTrue: [^KernelLibrary default systemError].
> buf resize: bytes.
> ^buf! !
> !UnicodeString categoriesFor: #asUTF8String!converting!public! !
> ====================================
>
>

Chris Uppal-3

Re: XML Manifest contains invalid characters

Don,

> By the way, MSDN indicates that Windows code
> page 1252 *is* the same as Latin 1 and ISO-8859-1
>
(http://msdn.microsoft.com/library/default.asp?url=/workshop/author/dhtml/refer
ence/charsets/charset4.asp).
> I'd seen other references that implied there might be differences, but if
> Windows views them identically, who am I to differ. ;^)

They aren't /quite/ the same. Or rather, if you tell Windows to convert a
String consisting of Characters 0 through 255 into Unicode, then it gives you
slightly different results depending on whether you tell it that the source
data is 1252 or 8859-1. The only significant difference is that the euro
symbol ( = U+20AC) is at position 128 in 1252, while that position is
apparently not mapped in their version of 8859-1. I don't know what the
official spec calls for.

If you use that character in a file declared to be "ISO-8859-1", the XML parser
appears to replace it with '?'.

A little more investigation, and it seems that MS's parser recognises
"windows-1252" as a document encoding, which allows it to parse the euro. I
don't suppose all that many other XML processors would accept that name, even
though it does turn out to be registered with IANA ;-) (The other 125x code
page numbers are registered in that form too, perhaps they would work as well.)

> (I'm still amazed at what Steve was able to get done
> when he was first learning Dolphin!),

Talented bloke.

> but given that (a) my problem (which
> nobody else seems to have!) is solved by changing the encoding to
> ISO-8859-1
> in ImageStripper>>openLogFile, and (b) Unicode support is evolving in both
> Dolphin and Windows (the .NET and Vista stuff on MSDN seems much more
> comprehensive), I'm starting to think my time would be better spent
> elsewhere.

I'm not so sure that it actually /is/ evolving all that much in Dolphin. At
least we have a text widget that'll display UTF-8 correctly now (according to
the Scintilla documentation).

> Thanks again for spending time on this, Chris.

You're welcome. All this codepage muck was on my list of things to put off
anyway, so this little investigation just bumped it up the queue a bit. Got me
over a hump that was holding up my Unicode stuff too, so I should thank you for
raising the subject ;-)

-- chris