I've been reading the thread a while ago about extended character sets. I have a problem that is related but different. Can anyone help?
I'm reading information from a (computed) web URL, which may supply non-UK/US English characters (such as o-umlaut, which is the one I've hit first). I'm parsing the string by the HTML tags it contains. The read is successful, and the stream contents can be manipulated up to the point the foreign character occurs. After that, any attempt to access the string results in a subscript-out-of-bounds error in the depths of the system. This includes any attempt to copy, print, Transcript show or even inspect - so I can't even see what's in there. (Therefore I'm strongly inferring that it's the o-umlaut which is the problem, but I can't prove it. I think the error is that the system can't find the character in its character set, hence the out-of-bounds. But other files from the same source can be read and parsed with no problem). However ... if I create the same character as part of a string constant in code, I can inspect it quite happily and it comes up ok. Equally if I read it in from a file as opposed to a URL, although the output result is a weird unscramble of the character. So the problem may be in the way the character is read in via XML.InputSource. I've tried to find a way that will create the input stream encoded with #utf16 but it always ends up circling back to UTF-8 (I get something like EncodedStream with UTF-16 on something else on ReadStream with UTF8). I could perhaps read the URL as a byte stream or binary, but can't figure out how to create this kind of stream on a URL. Does anyone have code that will circumvent this problem, and where can I find it? (As a newbie, though a regular Smalltalk writer from some years ago, I don't know where the public forums are). Thanks Tony Law www.informationspan.com |
That seems very odd. Could you be more specific about what
you're doing to get the string. If I just get a string containing
non-ANSI characters, for example
(HttpClient get: 'http://smalltalk.ru ') contents then I get back a string with lots of Russian characters in it, and depending on the OS and installed fonts I can even see them (works nicely on a Mac, less nicely on Windows). Do you have some reason to believe that your input string is UTF-16? UTF-8 should be able to represent any character just as well, and unless the input really is UTF-16 is likely to give you some very garbled text. At 04:29 AM 10/7/2008, Tony Law wrote: I've been reading the thread a while ago about extended character sets. I --
Alan Knight [|], Engineering Manager, Cincom Smalltalk
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Free forum by Nabble | Edit this page |