When the directory paths to a deployed application contains certain unusual
characters (e.g., $« or $» in the Windows Western character set), the XML manifest generated during deployment is not readable by IXMLDOMDocument. Opening the file in Scintilla shows some special characters in place of the ones in the path, although PFE and Notepad have no problem with it. Manually editing the file in Scintilla by removing the special characters, and replacing them with the path as it exists (including the "unusual" characters) allows the Executable Browser (and Scintilla) to read the manifest file properly. A text comparison with BeyondCompare shows the edited and unedited versions of the file are the same. A binary comparison, however, shows that re-editing them in Scintilla apparently causes the "unusual" characters to be preceeded (escaped?) by a character with a code point of 194. This character does show up when viewing the file with PFE, but Unicode-aware editors don't display it. I don't know whether there's anything that should be done (except for redoing my directory structure =8^0), but maybe something isn't quite to spec. Thanks, Don |
Don,
> When the directory paths to a deployed application contains certain > unusual characters (e.g., $« or $» in the Windows Western character set), > the XML manifest generated during deployment is not readable by > IXMLDOMDocument. Opening the file in Scintilla shows some special > characters in place of the ones in the path, although PFE and Notepad > have no problem with it. I believe that this is a Unicode issue, and that it is a bug which will affect anyone who uses characters in filename which are not in the 7-bit ASCII range (whatever code page they use). The ImageStripper is writing the XML log file as if it were a an ordinary file containing 8-bit characters in the user's current character set. Actually XML is usually written in one or another form of Unicode, defaulting to UTF-8. If it is not in UTF-8 or UTF-16 then it is /required/ to start with an Encoding Declaration. (Note that XML parsers are not required to be able to read formats other than UTF-8 and -16). Consider the String: '«»' That consists of Characters with "code points": 171 and 187 so that's what would be written to the file. However, in UTF-8 the corresponding Unicode characters have code points: U+AB and U+BB (which are actually 171 and 187 as it happens, but that's just a coincidence). When written as UTF-8 they take 2 bytes each, and become the byte sequence: 0xC2 0xAB 0xC2 0xBB or, in decimal: 194 171 194 187 It looks as if the "special" characters are being escaped by prefixing them with 194, but actually that's misleading, it's just a coincidence (the same one as before) that the UTF-8 encoding takes that form for these characters. So the ImageStripper is creating an invalid XML file, and the XML parser is (correctly) refusing to have anything to do with it. (It is required to treat it as a "fatal error") Now if you open the file in Scintilla, I suspect that it will discover that the contents isn't valid UTF-8 and revert to assuming it's just 8-bit data in the current code page, and will display it as such. But then, when you write it out again, because it knows that XML is UTF, it'll write it out as such, thus converting 171 to 194 171. As to cures: One would be for the ImageStripper to provide a correct Encoding Declaration for the user's code page. In your case I think that should be <?xml version="1.0" encoding="ISO-8859-1"?> at the start of the XML file (see ImageStripper>>openLogFile). I'm not sure where to get the correct code page name from (it's available somewhere!). Another would be to change the ImageStripper so that it wrote UTF-8 (as it claims to do). A third, which /might/ work, would be to tell the XML parser to ignore the embedded declaration and treat the text as if it were in the current code page. That's not a particularly general fix, and anyway I don't know it is possible to do it with an IXMLDOMDocument. -- chris |
Chris,
"Chris Uppal" <[hidden email]> wrote in message news:4402e698$0$1176$[hidden email]... > Don, [...] > I believe that this is a Unicode issue, and that it is a bug which will > affect > anyone who uses characters in filename which are not in the 7-bit ASCII > range > (whatever code page they use). I think you've hit on it. > The ImageStripper is writing the XML log file as if it were a an ordinary > file > containing 8-bit characters in the user's current character set. Actually > XML > is usually written in one or another form of Unicode, defaulting to UTF-8. > If > it is not in UTF-8 or UTF-16 then it is /required/ to start with an > Encoding > Declaration. (Note that XML parsers are not required to be able to read > formats other than UTF-8 and -16). erroneously called Scintilla in my original post). Scite apparently interprets its display character set based on the encoding specified in the XML header. Setting the encoding to UTF-16 or removing the explicit encoding (which I believe defaults to the better of UTF-8 or -16) both allow Scite to interpret the characters as entered. IXMLDOMDocument, on the other hand, was not deceived by my feeble efforts. No encoding in the XML header produced the same "invalid character" error as UTF-8 encoding, while changing encoding to UTF-16 gave "Switch from current encoding to specified encoding not supported." (Viewing the files in InternetExplorer gives the same errors.) [...] > As to cures: > > One would be for the ImageStripper to provide a correct Encoding > Declaration > for the user's code page. In your case I think that should be > <?xml version="1.0" encoding="ISO-8859-1"?> > at the start of the XML file (see ImageStripper>>openLogFile). I'm not > sure > where to get the correct code page name from (it's available somewhere!). This works fine just by changing the encoding statement in the header, unlike changing it to UTF-16 or no explicit encoding. It's a simple hack for me, but there's got to be a way to get ISO encoding based on a user's Locale. How hard can it be? ;^) > > Another would be to change the ImageStripper so that it wrote UTF-8 (as it > claims to do). It looks like this might entail digging into the guts of FileStream. That would scare me off. > > A third, which /might/ work, would be to tell the XML parser to ignore the > embedded declaration and treat the text as if it were in the current code > page. > That's not a particularly general fix, and anyway I don't know it is > possible > to do it with an IXMLDOMDocument. A cursory check of MSDN suggests this approach isn't easy enough for my purposes. Thanks for taking the time to reply, Chris. I think I'll check into the Locale business to see how easy a more generalized solution might be. Don > > -- chris > > > > |
Don Rylander wrote:
> > <?xml version="1.0" encoding="ISO-8859-1"?> [...] > This works fine just by changing the encoding statement in the header, > unlike changing it to UTF-16 or no explicit encoding. It's a simple hack > for me, but there's got to be a way to get ISO encoding based on a user's > Locale. How hard can it be? ;^) I spent some time looking for a way. If it's there I can't find it. All you can do is find the current code-page ID (GetACP() in kernel32.dll). On my system that answers 1252 (which I believe to be either identical to, or very similar to, ISO-8859-1). The problems are that there's no way to get a sensible string name for it (GetPCInfoEx() will get you a string, but it's not usefull), and that converting a given Id into a /standard./ encoding Id (or name) is essentially impossible. It would probably be possible to create a mapping table, but... Of course, these difficulties with code pages are why Unicode was invented in the first place. > > Another would be to change the ImageStripper so that it wrote UTF-8 (as > > it claims to do). > It looks like this might entail digging into the guts of FileStream. That > would scare me off. Steve Waring (a long time ago) posted a little utility method for converting UnicodeStrings into UTF-8. With that (appended) and: ============= String>>asUTF8String ^ self asUnicodeString asUTF8String. ============= it shouldn't be too hard, or interfere with stripping too much, to convert everything to UTF-8 before writing it to the file. Of course the best thing would be if Dolphin had proper Unicode handling, but I don't expect that anytime soon ;-) (I do have a working, but as yet incomplete, set of Unicode Strings and Streams, but it's a pretty large package, and would add rather a lot of bloat to the target application if that app didn't happen to need Unicode support anyway.) -- chris ======= code by Steve Waring ========= !UnicodeString methodsFor! asUTF8String "Answer a byte string representation of the receiver. -Not supported in Win95, but should work in Win98" | buf size bytes | size := self size. buf := String new: size+size+size. size == 0 ifTrue: [^buf]. "Avoid 'The Parameter is Incorrect' error" bytes := KernelLibrary default wideCharToMultiByte: 65001 "CP_UTF8" dwFlags: 0 lpWideCharStr: self cchWideChar: size lpMultiByteStr: buf cchMultiByte: buf size lpDefaultChar: nil lpUsedDefaultChar: nil. bytes == 0 ifTrue: [^KernelLibrary default systemError]. buf resize: bytes. ^buf! ! !UnicodeString categoriesFor: #asUTF8String!converting!public! ! ==================================== |
"Chris Uppal" <[hidden email]> wrote in message
news:44041e35$2$1175$[hidden email]... > Don Rylander wrote: > >> > <?xml version="1.0" encoding="ISO-8859-1"?> > [...] >> This works fine just by changing the encoding statement in the header, >> unlike changing it to UTF-16 or no explicit encoding. It's a simple hack >> for me, but there's got to be a way to get ISO encoding based on a user's >> Locale. How hard can it be? ;^) > > I spent some time looking for a way. If it's there I can't find it. All > you > can do is find the current code-page ID (GetACP() in kernel32.dll). On my > system that answers 1252 (which I believe to be either identical to, or > very > similar to, ISO-8859-1). The problems are that there's no way to get a > sensible > string name for it (GetPCInfoEx() will get you a string, but it's not > usefull), > and that converting a given Id into a /standard./ encoding Id (or name) is > essentially impossible. It would probably be possible to create a mapping > table, but... really close, but you still need some refernce table to look up the string name of the numeric code page. By the way, MSDN indicates that Windows code page 1252 *is* the same as Latin 1 and ISO-8859-1 (http://msdn.microsoft.com/library/default.asp?url=/workshop/author/dhtml/reference/charsets/charset4.asp). I'd seen other references that implied there might be differences, but if Windows views them identically, who am I to differ. ;^) IXMLDocument2 knows its character set (on my system, IXMLDocument2 new charset = 'UTF-8'), but I haven't found a way to do anything useful with that. [...] >> > Another would be to change the ImageStripper so that it wrote UTF-8 (as >> > it claims to do). >> It looks like this might entail digging into the guts of FileStream. >> That >> would scare me off. > > Steve Waring (a long time ago) posted a little utility method for > converting > UnicodeStrings into UTF-8. With that (appended) and: > ============= > String>>asUTF8String > ^ self asUnicodeString asUTF8String. > ============= > it shouldn't be too hard, or interfere with stripping too much, to convert > everything to UTF-8 before writing it to the file. Of course the best > thing > would be if Dolphin had proper Unicode handling, but I don't expect that > anytime soon ;-) (I do have a working, but as yet incomplete, set of > Unicode > Strings and Streams, but it's a pretty large package, and would add rather > a > lot of bloat to the target application if that app didn't happen to need > Unicode support anyway.) > > -- chris when he was first learning Dolphin!), but given that (a) my problem (which nobody else seems to have!) is solved by changing the encoding to ISO-8859-1 in ImageStripper>>openLogFile, and (b) Unicode support is evolving in both Dolphin and Windows (the .NET and Vista stuff on MSDN seems much more comprehensive), I'm starting to think my time would be better spent elsewhere. I suppose this could affect anyone who uses characters beyond 7-bit ASCII, but maybe others have been more sensible in naming things. Thanks again for spending time on this, Chris. Don > > ======= code by Steve Waring ========= > !UnicodeString methodsFor! > asUTF8String > "Answer a byte string representation of the receiver. > -Not supported in Win95, but should work in Win98" > | buf size bytes | > size := self size. > buf := String new: size+size+size. > size == 0 ifTrue: [^buf]. "Avoid 'The Parameter is Incorrect' error" > bytes := KernelLibrary default > wideCharToMultiByte: 65001 "CP_UTF8" > dwFlags: 0 > lpWideCharStr: self > cchWideChar: size > lpMultiByteStr: buf > cchMultiByte: buf size > lpDefaultChar: nil > lpUsedDefaultChar: nil. > bytes == 0 ifTrue: [^KernelLibrary default systemError]. > buf resize: bytes. > ^buf! ! > !UnicodeString categoriesFor: #asUTF8String!converting!public! ! > ==================================== > > |
Don,
> By the way, MSDN indicates that Windows code > page 1252 *is* the same as Latin 1 and ISO-8859-1 > (http://msdn.microsoft.com/library/default.asp?url=/workshop/author/dhtml/refer ence/charsets/charset4.asp). > I'd seen other references that implied there might be differences, but if > Windows views them identically, who am I to differ. ;^) They aren't /quite/ the same. Or rather, if you tell Windows to convert a String consisting of Characters 0 through 255 into Unicode, then it gives you slightly different results depending on whether you tell it that the source data is 1252 or 8859-1. The only significant difference is that the euro symbol ( = U+20AC) is at position 128 in 1252, while that position is apparently not mapped in their version of 8859-1. I don't know what the official spec calls for. If you use that character in a file declared to be "ISO-8859-1", the XML parser appears to replace it with '?'. A little more investigation, and it seems that MS's parser recognises "windows-1252" as a document encoding, which allows it to parse the euro. I don't suppose all that many other XML processors would accept that name, even though it does turn out to be registered with IANA ;-) (The other 125x code page numbers are registered in that form too, perhaps they would work as well.) > (I'm still amazed at what Steve was able to get done > when he was first learning Dolphin!), Talented bloke. > but given that (a) my problem (which > nobody else seems to have!) is solved by changing the encoding to > ISO-8859-1 > in ImageStripper>>openLogFile, and (b) Unicode support is evolving in both > Dolphin and Windows (the .NET and Vista stuff on MSDN seems much more > comprehensive), I'm starting to think my time would be better spent > elsewhere. I'm not so sure that it actually /is/ evolving all that much in Dolphin. At least we have a text widget that'll display UTF-8 correctly now (according to the Scintilla documentation). > Thanks again for spending time on this, Chris. You're welcome. All this codepage muck was on my list of things to put off anyway, so this little investigation just bumped it up the queue a bit. Got me over a hump that was holding up my Unicode stuff too, so I should thank you for raising the subject ;-) -- chris |
Free forum by Nabble | Edit this page |