Working with urls that contain non latin characters

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Working with urls that contain non latin characters

Offray Vladimir Luna Cárdenas-2
Hi,

I was ready to show a friend the Pharo web capabilities with the
classical "myString asUrl retrieveContents", but the friend gave me a
url that contains non Latin characters[1] and then I got an
ZnInvalidUTF8 error.

[1]
http://www.bidchance.com/freesearch.do?&filetype=&channel=&currentpage=1&searchtype=zb&queryword=%BF%A6%CA%B2&displayStyle=&pstate=&field=&leftday=&province=&bidfile=&project=&heshi=&recommend=&field=&jing=&starttime=&endtime=&attachment=

How can I process web addresses in Pharo that contain non latin
characters like the one in [1]?

Thanks,

Offray




Reply | Threaded
Open this post in threaded view
|

Re: Working with urls that contain non latin characters

Herby Vojčík


Offray Vladimir Luna Cárdenas wrote on 27. 7. 2018 12:39:

> Hi,
>
> I was ready to show a friend the Pharo web capabilities with the
> classical "myString asUrl retrieveContents", but the friend gave me a
> url that contains non Latin characters[1] and then I got an
> ZnInvalidUTF8 error.
>
> [1]
> http://www.bidchance.com/freesearch.do?&filetype=&channel=&currentpage=1&searchtype=zb&queryword=%BF%A6%CA%B2&displayStyle=&pstate=&field=&leftday=&province=&bidfile=&project=&heshi=&recommend=&field=&jing=&starttime=&endtime=&attachment=
>
> How can I process web addresses in Pharo that contain non latin
> characters like the one in [1]?

Maybe you can make it into additional test case(s) in Zinc and upload
them so authors (or anyone else willing to) can take on from that point
and fix them.

I also faintly remember Zinc having some problems when I worked with it
and the need to devise workarounds around asUrl use.

> Thanks,
>
> Offray

Reply | Threaded
Open this post in threaded view
|

Re: Working with urls that contain non latin characters

Sven Van Caekenberghe-2
In reply to this post by Offray Vladimir Luna Cárdenas-2
Hi Offray,

> On 27 Jul 2018, at 12:39, Offray Vladimir Luna Cárdenas <[hidden email]> wrote:
>
> Hi,
>
> I was ready to show a friend the Pharo web capabilities with the
> classical "myString asUrl retrieveContents", but the friend gave me a
> url that contains non Latin characters[1] and then I got an
> ZnInvalidUTF8 error.
>
> [1]
> http://www.bidchance.com/freesearch.do?&filetype=&channel=&currentpage=1&searchtype=zb&queryword=%BF%A6%CA%B2&displayStyle=&pstate=&field=&leftday=&province=&bidfile=&project=&heshi=&recommend=&field=&jing=&starttime=&endtime=&attachment=
>
> How can I process web addresses in Pharo that contain non latin
> characters like the one in [1]?

I am on holiday, so I cannot go too deep into this, but AFAIU the URL is wrong (or it assumes a specific context with a non-standard encoding).

In a URL's query part, non-ASCII data is first UTF-8 encoded, then percent encoded (this is the modern way).

I don't read Chinese, so it is hard to infer much from the original site, but I am assuming the search is for '喀什', a city called Kashgar, https://en.wikipedia.org/wiki/Kashgar_(disambiguation).

The string in question can be written as (to avoid copy/paste problems):

  String with: 21888 asCharacter with: 20160 asCharacter.

The encoding in a URL has to be:

  ZnPercentEncoder new encode: (String with: 21888 asCharacter with: 20160 asCharacter).

This gives us for example the following URL:

  'https://www.google.com/search?q=%E5%96%80%E4%BB%80' asUrl.

Which parses OK and contains the correct encoded string (decoded in the URL object):

  'https://www.google.com/search?q=%E5%96%80%E4%BB%80' asUrl queryAt: #q.

If you copy/paste that URL in your browser it should resolve to stuff about Kashgar.

Obviously the website www.bidchance.com does something else (non-standard ?).

HTH,

Sven

> Thanks,
>
> Offray


Reply | Threaded
Open this post in threaded view
|

Re: Working with urls that contain non latin characters

Ben Coman
In reply to this post by Offray Vladimir Luna Cárdenas-2
On 27 July 2018 at 18:39, Offray Vladimir Luna Cárdenas
<[hidden email]> wrote:

> Hi,
>
> I was ready to show a friend the Pharo web capabilities with the
> classical "myString asUrl retrieveContents", but the friend gave me a
> url that contains non Latin characters[1] and then I got an
> ZnInvalidUTF8 error.
>
> [1]
> http://www.bidchance.com/freesearch.do?&filetype=&channel=&currentpage=1&searchtype=zb&queryword=%BF%A6%CA%B2&displayStyle=&pstate=&field=&leftday=&province=&bidfile=&project=&heshi=&recommend=&field=&jing=&starttime=&endtime=&attachment=
>
> How can I process web addresses in Pharo that contain non latin
> characters like the one in [1]?

Just some blind digging...

A few levels down the stack is a call equivalent to...
    x := '%BF%A6%CA%B2'.
    ZnPercentEncoder new decode: x.
which fails with the same error.

In #decode we have...
    bytes := #[191 166 202 178].

and browsing around I discovered a useful method...
    encoder := ZnCharacterEncoder detectEncoding: bytes
"==> a ZnSimplifiedByteEncoder('iso88591' strict)"

now the following works...
    (ZnPercentEncoder new characterEncoder: encoder ) decode: x.


So maybe that helps explain it,
but I don't know how to join the dots to make it work out of the box
with "asUrl retrieveContents"

cheers -ben

Reply | Threaded
Open this post in threaded view
|

Re: Working with urls that contain non latin characters

Sven Van Caekenberghe-2


> On 27 Jul 2018, at 13:58, Ben Coman <[hidden email]> wrote:
>
> On 27 July 2018 at 18:39, Offray Vladimir Luna Cárdenas
> <[hidden email]> wrote:
>> Hi,
>>
>> I was ready to show a friend the Pharo web capabilities with the
>> classical "myString asUrl retrieveContents", but the friend gave me a
>> url that contains non Latin characters[1] and then I got an
>> ZnInvalidUTF8 error.
>>
>> [1]
>> http://www.bidchance.com/freesearch.do?&filetype=&channel=&currentpage=1&searchtype=zb&queryword=%BF%A6%CA%B2&displayStyle=&pstate=&field=&leftday=&province=&bidfile=&project=&heshi=&recommend=&field=&jing=&starttime=&endtime=&attachment=
>>
>> How can I process web addresses in Pharo that contain non latin
>> characters like the one in [1]?
>
> Just some blind digging...
>
> A few levels down the stack is a call equivalent to...
>    x := '%BF%A6%CA%B2'.
>    ZnPercentEncoder new decode: x.
> which fails with the same error.
>
> In #decode we have...
>    bytes := #[191 166 202 178].

Correct (as it is not legal UTF-8)

> and browsing around I discovered a useful method...
>    encoder := ZnCharacterEncoder detectEncoding: bytes
> "==> a ZnSimplifiedByteEncoder('iso88591' strict)"
>
> now the following works...
>    (ZnPercentEncoder new characterEncoder: encoder ) decode: x.

Right, but that guess is wrong (check the resulting string).

Since we are talking about Chinese characters that are outside the allowed range for #iso88591 (#latin1), that is logical.

Clearly, the original website http://www.bidchance.com knows the encoding ...

Again, to my understanding, without further context, when %BF%A6%CA%B2 is encountered in the query part of a URL, it is first percent decoded, then UTF-8 decoded. That is what #asUrl assumes, and which leads to the error since that particular sequence, when interpreted like that, does not constitute a legal UTF-8 encoding.

> So maybe that helps explain it,
> but I don't know how to join the dots to make it work out of the box
> with "asUrl retrieveContents"
>
> cheers -ben


Reply | Threaded
Open this post in threaded view
|

Re: Working with urls that contain non latin characters

Sven Van Caekenberghe-2


> On 27 Jul 2018, at 15:25, Sven Van Caekenberghe <[hidden email]> wrote:
>
> Right, but that guess is wrong (check the resulting string).

Actually, the encoding used is GBK, https://en.wikipedia.org/wiki/GBK_(character_encoding)

This is a variable length encoding used in China. It is not currently implemented (but could be added).

But even if we implemented it, it would not solve the current issue (we would not known that we had to use it).