Web input with extended character set ...

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Web input with extended character set ...

Tony Law
I've been reading the thread a while ago about extended character sets. I have a problem that is related but different. Can anyone help?

I'm reading information from a (computed) web URL, which may supply non-UK/US English characters (such as o-umlaut, which is the one I've hit first). I'm parsing the string by the HTML tags it contains. The read is successful, and the stream contents can be manipulated up to the point the foreign character occurs. After that, any attempt to access the string results in a subscript-out-of-bounds error in the depths of the system. This includes any attempt to copy, print, Transcript show or even inspect - so I can't even see what's in there. (Therefore I'm strongly inferring that it's the o-umlaut which is the problem, but I can't prove it. I think the error is that the system can't find the character in its character set, hence the out-of-bounds. But other files from the same source can be read and parsed with no problem).

However ... if I create the same character as part of a string constant in code, I can inspect it quite happily and it comes up ok. Equally if I read it in from a file as opposed to a URL, although the output result is a weird unscramble of the character. So the problem may be in the way the character is read in via XML.InputSource.

I've tried to find a way that will create the input stream encoded with #utf16 but it always ends up circling back to UTF-8 (I get something like EncodedStream with UTF-16 on something else on ReadStream with UTF8). I could perhaps read the URL as a byte stream or binary, but can't figure out how to create this kind of stream on a URL.

Does anyone have code that will circumvent this problem, and where can I find it? (As a newbie, though a regular Smalltalk writer from some years ago, I don't know where the public forums are).

Thanks

Tony Law
www.informationspan.com
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Web input with extended character set ...

Alan Knight-2
That seems very odd. Could you be more specific about what you're doing to get the string. If I just get a string containing non-ANSI characters, for example
   (HttpClient get: 'http://smalltalk.ru ') contents
then I get back a string with lots of Russian characters in it, and depending on the OS and installed fonts I can even see them (works nicely on a Mac, less nicely on Windows).

Do you have some reason to believe that your input string is UTF-16? UTF-8 should be able to represent any character just as well, and unless the input really is UTF-16 is likely to give you some very garbled text.

At 04:29 AM 10/7/2008, Tony Law wrote:

I've been reading the thread a while ago about extended character sets. I
have a problem that is related but different. Can anyone help?

I'm reading information from a (computed) web URL, which may supply
non-UK/US English characters (such as o-umlaut, which is the one I've hit
first). I'm parsing the string by the HTML tags it contains. The read is
successful, and the stream contents can be manipulated up to the point the
foreign character occurs. After that, any attempt to access the string
results in a subscript-out-of-bounds error in the depths of the system. This
includes any attempt to copy, print, Transcript show or even inspect - so I
can't even see what's in there. (Therefore I'm strongly inferring that it's
the o-umlaut which is the problem, but I can't prove it. I think the error
is that the system can't find the character in its character set, hence the
out-of-bounds. But other files from the same source can be read and parsed
with no problem).

However ... if I create the same character as part of a string constant in
code, I can inspect it quite happily and it comes up ok. Equally if I read
it in from a file as opposed to a URL, although the output result is a weird
unscramble of the character. So the problem may be in the way the character
is read in via XML.InputSource.

I've tried to find a way that will create the input stream encoded with
#utf16 but it always ends up circling back to UTF-8 (I get something like
EncodedStream with UTF-16 on something else on ReadStream with UTF8). I
could perhaps read the URL as a byte stream or binary, but can't figure out
how to create this kind of stream on a URL.

Does anyone have code that will circumvent this problem, and where can I
find it? (As a newbie, though a regular Smalltalk writer from some years
ago, I don't know where the public forums are).

Thanks

Tony Law
www.informationspan.com
--
View this message in context: http://www.nabble.com/Web-input-with-extended-character-set-...-tp19853590p19853590.html
Sent from the VisualWorks mailing list archive at Nabble.com.

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

--
Alan Knight [|], Engineering Manager, Cincom Smalltalk

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc