ZnURL and parsing URL with diacritics

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

ZnURL and parsing URL with diacritics

Pharo Smalltalk Users mailing list
Hello,

when I try to parse this URL asUrl, error "ZnCharacterEncodingError: ASCII character expected" occurs:

'http://domain.com/ěščýž.html' asUrl.

this also does not work:

ZnEasy get: 'http://domain.com/ěščýž.html'

How to solve this? In the web browser, URL with diacritics is OK.

I tried also this:

ZnEasy get: 'http://domain.com/ěščýž.html' urlEncoded.

but this cripples the whole URL.

Thanks! Petr Fischer


Reply | Threaded
Open this post in threaded view
|

Re: ZnURL and parsing URL with diacritics

Peter Kenny

Hi Petr

 

I have used #urlEncoded in the past, with success, to deal with German umlauts. The secret is to urlEncode just the part containing the diacritics. If you encode the whole url, the slashes are encoded, and this confuses Zinc, which segments the url before decoding.

 

So I would expect you to be able to read your file with:

 

ZnEasy get: 'http://domain.com/’,’ěščýž.html' urlEncoded.

 

However, this also fails with ‘ASCII character expected’, and I can’t understand why. The debug trace has too many levels for me to understand. Zinc is evidently getting in a mess trying to decode the urlEncoded string, but if we try:

 

’ěščýž.html' urlEncoded urlDecoded

 

as a separate operation, it works OK.

 

I think only Sven can explain this for you.

 

HTH

 

Peter Kenny

 

 

From: Pharo-users <[hidden email]> On Behalf Of Petr Fischer via Pharo-users
Sent: 10 September 2018 10:07
To: [hidden email]
Cc: Petr Fischer <[hidden email]>
Subject: [Pharo-users] ZnURL and parsing URL with diacritics

 

Hello,

 

when I try to parse this URL asUrl, error "ZnCharacterEncodingError: ASCII character expected" occurs:

 

'http://domain.com/ěščýž.html' asUrl.

 

this also does not work:

 

ZnEasy get: 'http://domain.com/ěščýž.html'

 

How to solve this? In the web browser, URL with diacritics is OK.

 

I tried also this:

 

ZnEasy get: 'http://domain.com/ěščýž.html' urlEncoded.

 

but this cripples the whole URL.

 

Thanks! Petr Fischer

 

Reply | Threaded
Open this post in threaded view
|

Re: ZnURL and parsing URL with diacritics

Sven Van Caekenberghe-2
Hi,

> On 10 Sep 2018, at 12:53, PBKResearch <[hidden email]> wrote:
>
> Hi Petr
>  
> I have used #urlEncoded in the past, with success, to deal with German umlauts. The secret is to urlEncode just the part containing the diacritics. If you encode the whole url, the slashes are encoded, and this confuses Zinc, which segments the url before decoding.
>  
> So I would expect you to be able to read your file with:
>  
> ZnEasy get: 'http://domain.com/’,’ěščýž.html' urlEncoded.
>  
> However, this also fails with ‘ASCII character expected’, and I can’t understand why. The debug trace has too many levels for me to understand. Zinc is evidently getting in a mess trying to decode the urlEncoded string, but if we try:
>  
> ’ěščýž.html' urlEncoded urlDecoded
>  
> as a separate operation, it works OK.
>  
> I think only Sven can explain this for you.

The external representation of a URL with special characters is not the same as what an address bar or browser search field accepts. The latter is quite intelligent and accepts much broader input.

ZnUrl parses the official external representation according to the spec.

Internally, ZnUrl represents all components as resolved strings. The solution is to construct difficult/special URLs by hand.

Here is an example: let's say we want to access the English Wikipedia page of the Czech Republic (the country) using its native name 'Česká republika' (which is not only non-ASCII, but non-Latin1 as well, so it needs a WideString and UTF-8 encoding).

Here is one way to construct such a string.

ZnUrl new
  scheme: #http;
  host: 'en.wikipedia.org';
  addPathSegment: 'wiki';
  addPathSegment: 'Česká republika';
  yourself.

Which gives a URL with the following external representation:

  http://en.wikipedia.org/wiki/%C4%8Cesk%C3%A1%20republika

This can be parsed without problems.

  'http://en.wikipedia.org/wiki/%C4%8Cesk%C3%A1%20republika' asUrl.

You can send #retrieveContents to a URL to actually fetch it.

ZnUrl new
  scheme: #http;
  host: 'en.wikipedia.org';
  addPathSegment: 'wiki';
  addPathSegment: 'Česká republika';
  retrieveContents.

Or you could use the url in a ZnClient object.

BTW, there are many ways to construct URLs, I would maybe do the following.

  'https://en.wikipedia.org/wiki' asUrl addPathSegment: 'Česká republika'; yourself.

Or something like

ZnClient new
  url: 'https://en.wikipedia.org/wiki';
  addPathSegment: 'Česká republika';
  get.

HTH,

Sven

> HTH
>  
> Peter Kenny
>  
>  
> From: Pharo-users <[hidden email]> On Behalf Of Petr Fischer via Pharo-users
> Sent: 10 September 2018 10:07
> To: [hidden email]
> Cc: Petr Fischer <[hidden email]>
> Subject: [Pharo-users] ZnURL and parsing URL with diacritics
>  
> Hello,
>  
> when I try to parse this URL asUrl, error "ZnCharacterEncodingError: ASCII character expected" occurs:
>  
> 'http://domain.com/ěščýž.html' asUrl.
>  
> this also does not work:
>  
> ZnEasy get: 'http://domain.com/ěščýž.html'
>  
> How to solve this? In the web browser, URL with diacritics is OK.
>  
> I tried also this:
>  
> ZnEasy get: 'http://domain.com/ěščýž.html' urlEncoded.
>  
> but this cripples the whole URL.
>  
> Thanks! Petr Fischer


Reply | Threaded
Open this post in threaded view
|

Re: ZnURL and parsing URL with diacritics

Pharo Smalltalk Users mailing list
OK. Thanks for examples. But in my case, the bad URL (with diacritics) comes directly from the Zomato.com REST API (they probably do not read specs), so I'll end up with a few "hacks" with strings.

pf


> Hi,
>
> > On 10 Sep 2018, at 12:53, PBKResearch <[hidden email]> wrote:
> >
> > Hi Petr
> >  
> > I have used #urlEncoded in the past, with success, to deal with German umlauts. The secret is to urlEncode just the part containing the diacritics. If you encode the whole url, the slashes are encoded, and this confuses Zinc, which segments the url before decoding.
> >  
> > So I would expect you to be able to read your file with:
> >  
> > ZnEasy get: 'http://domain.com/’,’ěščýž.html' urlEncoded.
> >  
> > However, this also fails with ‘ASCII character expected’, and I can’t understand why. The debug trace has too many levels for me to understand. Zinc is evidently getting in a mess trying to decode the urlEncoded string, but if we try:
> >  
> > ’ěščýž.html' urlEncoded urlDecoded
> >  
> > as a separate operation, it works OK.
> >  
> > I think only Sven can explain this for you.
>
> The external representation of a URL with special characters is not the same as what an address bar or browser search field accepts. The latter is quite intelligent and accepts much broader input.
>
> ZnUrl parses the official external representation according to the spec.
>
> Internally, ZnUrl represents all components as resolved strings. The solution is to construct difficult/special URLs by hand.
>
> Here is an example: let's say we want to access the English Wikipedia page of the Czech Republic (the country) using its native name 'Česká republika' (which is not only non-ASCII, but non-Latin1 as well, so it needs a WideString and UTF-8 encoding).
>
> Here is one way to construct such a string.
>
> ZnUrl new
>   scheme: #http;
>   host: 'en.wikipedia.org';
>   addPathSegment: 'wiki';
>   addPathSegment: 'Česká republika';
>   yourself.
>
> Which gives a URL with the following external representation:
>
>   http://en.wikipedia.org/wiki/%C4%8Cesk%C3%A1%20republika
>
> This can be parsed without problems.
>
>   'http://en.wikipedia.org/wiki/%C4%8Cesk%C3%A1%20republika' asUrl.
>
> You can send #retrieveContents to a URL to actually fetch it.
>
> ZnUrl new
>   scheme: #http;
>   host: 'en.wikipedia.org';
>   addPathSegment: 'wiki';
>   addPathSegment: 'Česká republika';
>   retrieveContents.
>
> Or you could use the url in a ZnClient object.
>
> BTW, there are many ways to construct URLs, I would maybe do the following.
>
>   'https://en.wikipedia.org/wiki' asUrl addPathSegment: 'Česká republika'; yourself.
>
> Or something like
>
> ZnClient new
>   url: 'https://en.wikipedia.org/wiki';
>   addPathSegment: 'Česká republika';
>   get.
>
> HTH,
>
> Sven
>
> > HTH
> >  
> > Peter Kenny
> >  
> >  
> > From: Pharo-users <[hidden email]> On Behalf Of Petr Fischer via Pharo-users
> > Sent: 10 September 2018 10:07
> > To: [hidden email]
> > Cc: Petr Fischer <[hidden email]>
> > Subject: [Pharo-users] ZnURL and parsing URL with diacritics
> >  
> > Hello,
> >  
> > when I try to parse this URL asUrl, error "ZnCharacterEncodingError: ASCII character expected" occurs:
> >  
> > 'http://domain.com/ěščýž.html' asUrl.
> >  
> > this also does not work:
> >  
> > ZnEasy get: 'http://domain.com/ěščýž.html'
> >  
> > How to solve this? In the web browser, URL with diacritics is OK.
> >  
> > I tried also this:
> >  
> > ZnEasy get: 'http://domain.com/ěščýž.html' urlEncoded.
> >  
> > but this cripples the whole URL.
> >  
> > Thanks! Petr Fischer
>
>

Reply | Threaded
Open this post in threaded view
|

Re: ZnURL and parsing URL with diacritics

Sven Van Caekenberghe-2
It would probably help if you gave a real example, a REST call that returns something (presumable JSON or XML) that contains a URL that is problematic.

FWIW, the following do also work

('https://en.wikipedia.org/wiki/' , 'Česká republika' urlEncoded) asUrl.
('https://en.wikipedia.org/wiki/' , 'Česká republika' urlEncoded) asUrl retrieveContents.

> On 10 Sep 2018, at 14:16, Petr Fischer via Pharo-users <[hidden email]> wrote:
>
>
> From: Petr Fischer <[hidden email]>
> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
> Date: 10 September 2018 at 14:16:53 GMT+2
> To: Any question about pharo is welcome <[hidden email]>
>
>
> OK. Thanks for examples. But in my case, the bad URL (with diacritics) comes directly from the Zomato.com REST API (they probably do not read specs), so I'll end up with a few "hacks" with strings.
>
> pf
>
>
>> Hi,
>>
>>> On 10 Sep 2018, at 12:53, PBKResearch <[hidden email]> wrote:
>>>
>>> Hi Petr
>>>
>>> I have used #urlEncoded in the past, with success, to deal with German umlauts. The secret is to urlEncode just the part containing the diacritics. If you encode the whole url, the slashes are encoded, and this confuses Zinc, which segments the url before decoding.
>>>
>>> So I would expect you to be able to read your file with:
>>>
>>> ZnEasy get: 'http://domain.com/’,’ěščýž.html' urlEncoded.
>>>
>>> However, this also fails with ‘ASCII character expected’, and I can’t understand why. The debug trace has too many levels for me to understand. Zinc is evidently getting in a mess trying to decode the urlEncoded string, but if we try:
>>>
>>> ’ěščýž.html' urlEncoded urlDecoded
>>>
>>> as a separate operation, it works OK.
>>>
>>> I think only Sven can explain this for you.
>>
>> The external representation of a URL with special characters is not the same as what an address bar or browser search field accepts. The latter is quite intelligent and accepts much broader input.
>>
>> ZnUrl parses the official external representation according to the spec.
>>
>> Internally, ZnUrl represents all components as resolved strings. The solution is to construct difficult/special URLs by hand.
>>
>> Here is an example: let's say we want to access the English Wikipedia page of the Czech Republic (the country) using its native name 'Česká republika' (which is not only non-ASCII, but non-Latin1 as well, so it needs a WideString and UTF-8 encoding).
>>
>> Here is one way to construct such a string.
>>
>> ZnUrl new
>>  scheme: #http;
>>  host: 'en.wikipedia.org';
>>  addPathSegment: 'wiki';
>>  addPathSegment: 'Česká republika';
>>  yourself.
>>
>> Which gives a URL with the following external representation:
>>
>>  http://en.wikipedia.org/wiki/%C4%8Cesk%C3%A1%20republika
>>
>> This can be parsed without problems.
>>
>>  'http://en.wikipedia.org/wiki/%C4%8Cesk%C3%A1%20republika' asUrl.
>>
>> You can send #retrieveContents to a URL to actually fetch it.
>>
>> ZnUrl new
>>  scheme: #http;
>>  host: 'en.wikipedia.org';
>>  addPathSegment: 'wiki';
>>  addPathSegment: 'Česká republika';
>>  retrieveContents.
>>
>> Or you could use the url in a ZnClient object.
>>
>> BTW, there are many ways to construct URLs, I would maybe do the following.
>>
>>  'https://en.wikipedia.org/wiki' asUrl addPathSegment: 'Česká republika'; yourself.
>>
>> Or something like
>>
>> ZnClient new
>>  url: 'https://en.wikipedia.org/wiki';
>>  addPathSegment: 'Česká republika';
>>  get.
>>
>> HTH,
>>
>> Sven
>>
>>> HTH
>>>
>>> Peter Kenny
>>>
>>>
>>> From: Pharo-users <[hidden email]> On Behalf Of Petr Fischer via Pharo-users
>>> Sent: 10 September 2018 10:07
>>> To: [hidden email]
>>> Cc: Petr Fischer <[hidden email]>
>>> Subject: [Pharo-users] ZnURL and parsing URL with diacritics
>>>
>>> Hello,
>>>
>>> when I try to parse this URL asUrl, error "ZnCharacterEncodingError: ASCII character expected" occurs:
>>>
>>> 'http://domain.com/ěščýž.html' asUrl.
>>>
>>> this also does not work:
>>>
>>> ZnEasy get: 'http://domain.com/ěščýž.html'
>>>
>>> How to solve this? In the web browser, URL with diacritics is OK.
>>>
>>> I tried also this:
>>>
>>> ZnEasy get: 'http://domain.com/ěščýž.html' urlEncoded.
>>>
>>> but this cripples the whole URL.
>>>
>>> Thanks! Petr Fischer
>>
>>
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: ZnURL and parsing URL with diacritics

Erik Stel
In reply to this post by Pharo Smalltalk Users mailing list
Check out the following info:
     https://en.wikipedia.org/wiki/Internationalized_Resource_Identifier
<https://en.wikipedia.org/wiki/Internationalized_Resource_Identifier>  

The REST API might be answering an IRI instead of an URI in which case it
might not be faulty after all (I did not check the API myself).




--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Reply | Threaded
Open this post in threaded view
|

Re: ZnURL and parsing URL with diacritics

Sean P. DeNigris
Administrator
In reply to this post by Pharo Smalltalk Users mailing list
Pharo Smalltalk Users mailing list wrote
> OK. Thanks for examples. But in my case, the bad URL (with diacritics)
> comes directly from the Zomato.com REST API (they probably do not read
> specs), so I'll end up with a few "hacks" with strings.

Sven actually found a trick to handle this case and then forgot he he [1].
Just in case you still have the issue:

'http://myhost/path/with/umlaut/äöü.txt' asFileReference asUrl.

1. http://forum.world.st/Umlauts-in-ZnUrl-tp4793736.html



-----
Cheers,
Sean
--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Cheers,
Sean
Reply | Threaded
Open this post in threaded view
|

Re: ZnURL and parsing URL with diacritics

Peter Kenny
I recall seeing this trick when Sven published it, but I have never tried it. Trying it now, I get strange results. I entered:
'https://fr.wiktionary.org/wiki/péripétie' asFileReference asUrl.
And when I inspect the result, it is the address of a non-existent file in my image directory. I am using Moose 6.1 ( Pharo 6.0
Latest update: #60541) on Windows 10. This is not a current image, I know, but it is more recent than the date of Sven's post (Dec 2014). Has something changed since 2014?

Incidentally, I tried the other trick Sven cites in the same thread. The same url as above can be written:
'https://fr.wiktionary.org/wiki' asUrl / 'péripétie'.
This works fine, and is neater than the alternative with explicit percent encoding which I currently use. So thanks Sean for pointing me to the thread.

Peter Kenny

-----Original Message-----
From: Pharo-users <[hidden email]> On Behalf Of Sean P. DeNigris
Sent: 23 March 2019 04:00
To: [hidden email]
Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics

Pharo Smalltalk Users mailing list wrote
> OK. Thanks for examples. But in my case, the bad URL (with diacritics)
> comes directly from the Zomato.com REST API (they probably do not read
> specs), so I'll end up with a few "hacks" with strings.

Sven actually found a trick to handle this case and then forgot he he [1].
Just in case you still have the issue:

'http://myhost/path/with/umlaut/äöü.txt' asFileReference asUrl.

1. http://forum.world.st/Umlauts-in-ZnUrl-tp4793736.html



-----
Cheers,
Sean
--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html


Reply | Threaded
Open this post in threaded view
|

Re: ZnURL and parsing URL with diacritics

Sean P. DeNigris
Administrator
Peter Kenny wrote
> And when I inspect the result, it is the address of a non-existent file in
> my image directory.

Ah, no. I see the same result. By "worked" I meant that it created a URL
that safari accepted, but I see now it's not the same as correctly parsing
it.


Peter Kenny wrote
> Incidentally, I tried the other trick Sven cites in the same thread. The
> same url as above can be written:
> 'https://fr.wiktionary.org/wiki' asUrl / 'péripétie'.

Yes, this works if you are assembling the URL, but several people presented
the use case of processing URLs from elsewhere, leaving one in a
chicken-and-egg situation where one can't parse due to the diacritics and
can't escape the diacritics (i.e. without incorrectly escaping other things)
without parsing :/



-----
Cheers,
Sean
--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Cheers,
Sean
Reply | Threaded
Open this post in threaded view
|

Re: ZnURL and parsing URL with diacritics

Sven Van Caekenberghe-2


> On 23 Mar 2019, at 20:53, Sean P. DeNigris <[hidden email]> wrote:
>
> Peter Kenny wrote
>> And when I inspect the result, it is the address of a non-existent file in
>> my image directory.
>
> Ah, no. I see the same result. By "worked" I meant that it created a URL
> that safari accepted, but I see now it's not the same as correctly parsing
> it.
>
>
> Peter Kenny wrote
>> Incidentally, I tried the other trick Sven cites in the same thread. The
>> same url as above can be written:
>> 'https://fr.wiktionary.org/wiki' asUrl / 'péripétie'.
>
> Yes, this works if you are assembling the URL, but several people presented
> the use case of processing URLs from elsewhere, leaving one in a
> chicken-and-egg situation where one can't parse due to the diacritics and
> can't escape the diacritics (i.e. without incorrectly escaping other things)
> without parsing :/

Yes, that is pretty close to a catch 22. Strictly speaking, such URLs are incorrect and can't be parsed.

I do understand that sometimes these URLs occur in the wild, but again, strictly speaking they are in error.

The fact that browser search boxes accept them is a service on top of the strict URL syntax, I am not 100% sure how they do it, but it probably involves a lot of heuristics and trial and error.

The parser of ZnUrl is just 3 to 4 methods. There is nothing preventing somebody from making a new ZnLoseUrlParser, but it won't be easy.

> -----
> Cheers,
> Sean
> --
> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
>


Reply | Threaded
Open this post in threaded view
|

Re: ZnURL and parsing URL with diacritics

Peter Kenny
Sean, Sven

Thinking about this, I have found a simple (maybe too simple) way round it. The obvious first approach is to apply 'urlEncoded' to the received url string, but this fails because it also encodes the slashes and other segment dividers. A simple-minded approach is to scan the received string, copy the slashes and other segment dividers unchanged and percent encode everything else. I cobbled together the following in a playground, but it could easily be turned into a method in String class.

urlEncodedSegments := [ :url||outStream|
        outStream := String new writeStream.
        url asString do: [ :ch|(':/?' includes: ch )
                ifTrue: [ outStream nextPut: ch ]
                ifFalse:[outStream nextPutAll: ch asString urlEncoded ] ].
        outStream contents].

urlEncodedSegments value: 'https://fr.wiktionary.org/wiki/péripétie'
=> https://fr.wiktionary.org/wiki/p%C3%A9rip%C3%A9tie

This may fail if a slash can occur in a url other than as a segment divider. I am not sure if this is possible - could there be some sort of escaped slash within a segment? Anyway, if the received url strings are well-behaved, apart from the diacritics, this approach could be used as a hack for Sean's problem.

HTH

Peter Kenny

Note to Sven: The comment to String>>urlEncoded says: ' This is an encoding where characters that are illegal in a URL are escaped.' Slashes are escaped but are quite legal. Should the comment be changed, or the method?



-----Original Message-----
From: Pharo-users <[hidden email]> On Behalf Of Sven Van Caekenberghe
Sent: 23 March 2019 20:03
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics



> On 23 Mar 2019, at 20:53, Sean P. DeNigris <[hidden email]> wrote:
>
> Peter Kenny wrote
>> And when I inspect the result, it is the address of a non-existent
>> file in my image directory.
>
> Ah, no. I see the same result. By "worked" I meant that it created a
> URL that safari accepted, but I see now it's not the same as correctly
> parsing it.
>
>
> Peter Kenny wrote
>> Incidentally, I tried the other trick Sven cites in the same thread.
>> The same url as above can be written:
>> 'https://fr.wiktionary.org/wiki' asUrl / 'péripétie'.
>
> Yes, this works if you are assembling the URL, but several people
> presented the use case of processing URLs from elsewhere, leaving one
> in a chicken-and-egg situation where one can't parse due to the
> diacritics and can't escape the diacritics (i.e. without incorrectly
> escaping other things) without parsing :/

Yes, that is pretty close to a catch 22. Strictly speaking, such URLs are incorrect and can't be parsed.

I do understand that sometimes these URLs occur in the wild, but again, strictly speaking they are in error.

The fact that browser search boxes accept them is a service on top of the strict URL syntax, I am not 100% sure how they do it, but it probably involves a lot of heuristics and trial and error.

The parser of ZnUrl is just 3 to 4 methods. There is nothing preventing somebody from making a new ZnLoseUrlParser, but it won't be easy.

> -----
> Cheers,
> Sean
> --
> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
>



Reply | Threaded
Open this post in threaded view
|

Re: ZnURL and parsing URL with diacritics

Peter Kenny
Well it didn't take long to find a potential problem in what I wrote, at least as a general solution. If the input string contains something which has already been percent encoded, it will re-encode the percent signs. In this case, decoding will recover the once-encoded version, but we need to decode twice to recover the original text. Any web site receiving this version will almost certainly decode once only, and so will not see the right details.

The solution is simple - just include the percent sign in the list of excluded characters in the third line, so it becomes:
        url asString do: [ :ch|(':/?%' includes: ch )

-----Original Message-----
From: Pharo-users <[hidden email]> On Behalf Of PBKResearch
Sent: 24 March 2019 12:11
To: 'Any question about pharo is welcome' <[hidden email]>
Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics

Sean, Sven

Thinking about this, I have found a simple (maybe too simple) way round it. The obvious first approach is to apply 'urlEncoded' to the received url string, but this fails because it also encodes the slashes and other segment dividers. A simple-minded approach is to scan the received string, copy the slashes and other segment dividers unchanged and percent encode everything else. I cobbled together the following in a playground, but it could easily be turned into a method in String class.

urlEncodedSegments := [ :url||outStream|
        outStream := String new writeStream.
        url asString do: [ :ch|(':/?' includes: ch )
                ifTrue: [ outStream nextPut: ch ]
                ifFalse:[outStream nextPutAll: ch asString urlEncoded ] ].
        outStream contents].

urlEncodedSegments value: 'https://fr.wiktionary.org/wiki/péripétie'
=> https://fr.wiktionary.org/wiki/p%C3%A9rip%C3%A9tie

This may fail if a slash can occur in a url other than as a segment divider. I am not sure if this is possible - could there be some sort of escaped slash within a segment? Anyway, if the received url strings are well-behaved, apart from the diacritics, this approach could be used as a hack for Sean's problem.

HTH

Peter Kenny

Note to Sven: The comment to String>>urlEncoded says: ' This is an encoding where characters that are illegal in a URL are escaped.' Slashes are escaped but are quite legal. Should the comment be changed, or the method?



-----Original Message-----
From: Pharo-users <[hidden email]> On Behalf Of Sven Van Caekenberghe
Sent: 23 March 2019 20:03
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics



> On 23 Mar 2019, at 20:53, Sean P. DeNigris <[hidden email]> wrote:
>
> Peter Kenny wrote
>> And when I inspect the result, it is the address of a non-existent
>> file in my image directory.
>
> Ah, no. I see the same result. By "worked" I meant that it created a
> URL that safari accepted, but I see now it's not the same as correctly
> parsing it.
>
>
> Peter Kenny wrote
>> Incidentally, I tried the other trick Sven cites in the same thread.
>> The same url as above can be written:
>> 'https://fr.wiktionary.org/wiki' asUrl / 'péripétie'.
>
> Yes, this works if you are assembling the URL, but several people
> presented the use case of processing URLs from elsewhere, leaving one
> in a chicken-and-egg situation where one can't parse due to the
> diacritics and can't escape the diacritics (i.e. without incorrectly
> escaping other things) without parsing :/

Yes, that is pretty close to a catch 22. Strictly speaking, such URLs are incorrect and can't be parsed.

I do understand that sometimes these URLs occur in the wild, but again, strictly speaking they are in error.

The fact that browser search boxes accept them is a service on top of the strict URL syntax, I am not 100% sure how they do it, but it probably involves a lot of heuristics and trial and error.

The parser of ZnUrl is just 3 to 4 methods. There is nothing preventing somebody from making a new ZnLoseUrlParser, but it won't be easy.

> -----
> Cheers,
> Sean
> --
> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
>




Reply | Threaded
Open this post in threaded view
|

Re: ZnURL and parsing URL with diacritics

Peter Kenny
Sean

I have realized that the method I proposed can be expressed entirely within the Zinc system, which may make it a bit neater and easier to follow. There probably is no completely general solution, but there is a completely general way of finding a solution for your problem domain.

It is important to realize that String>>urlEncoded is defined as:
        ZnPercentEncoder new encode: self.
ZnPercentEncoder does not attempt to parse the input string as a url. It scans the entire string, and percent encodes any character that is not in its safe set (see the comment to ZnPercentEncoder>>encode:). Sven has given as default a minimum safe set, which does not include slash, but there is a setter method to redefine the safe set.

So the general way to find a solution for your domain is to collect a representative set of the url strings, apply String>>urlEncoded to each, and work out which characters have been percent encoded wrongly for your domain. For any url cases this is likely to include ':/?#', as well as '%' if it includes things already percent encoded, but there may be others specific to your domain. Now construct an instance of ZnPercentEncoder with the safe set extended to include these characters - note that the default safe set is given by the class method ZnPercentEncoder class>> rfc3986UnreservedCharacters. Apply this instance to encode all your test incoming url strings and verify that they work. Iterate, extending the safe set, until everything passes.

If you want to keep the neatness being able to write something like 'incomingString urlEncoded asZnUrl', you can add a method to String; for the case of the common url characters mentioned above:

String>> urlEncodedMyWay

"As urlEncoded, but with the safe set extended to include characters commonly found in a url"

^ ZnPercentEncoder new safeSet: ':/?#%', (ZnPercentEncoder rfc3986UnreservedCharacters);
        encode: self

This works in much the same way as the snippet I posted originally, because my code simply reproduces the essentials of ZnPercentEncoder>>encode:.

I seem to be trying to monopolize this thread, so I shall shut up now.

HTH

Peter Kenny

-----Original Message-----
From: Pharo-users <[hidden email]> On Behalf Of PBKResearch
Sent: 24 March 2019 15:36
To: 'Any question about pharo is welcome' <[hidden email]>
Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics

Well it didn't take long to find a potential problem in what I wrote, at least as a general solution. If the input string contains something which has already been percent encoded, it will re-encode the percent signs. In this case, decoding will recover the once-encoded version, but we need to decode twice to recover the original text. Any web site receiving this version will almost certainly decode once only, and so will not see the right details.

The solution is simple - just include the percent sign in the list of excluded characters in the third line, so it becomes:
        url asString do: [ :ch|(':/?%' includes: ch )

-----Original Message-----
From: Pharo-users <[hidden email]> On Behalf Of PBKResearch
Sent: 24 March 2019 12:11
To: 'Any question about pharo is welcome' <[hidden email]>
Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics

Sean, Sven

Thinking about this, I have found a simple (maybe too simple) way round it. The obvious first approach is to apply 'urlEncoded' to the received url string, but this fails because it also encodes the slashes and other segment dividers. A simple-minded approach is to scan the received string, copy the slashes and other segment dividers unchanged and percent encode everything else. I cobbled together the following in a playground, but it could easily be turned into a method in String class.

urlEncodedSegments := [ :url||outStream|
        outStream := String new writeStream.
        url asString do: [ :ch|(':/?' includes: ch )
                ifTrue: [ outStream nextPut: ch ]
                ifFalse:[outStream nextPutAll: ch asString urlEncoded ] ].
        outStream contents].

urlEncodedSegments value: 'https://fr.wiktionary.org/wiki/péripétie'
=> https://fr.wiktionary.org/wiki/p%C3%A9rip%C3%A9tie

This may fail if a slash can occur in a url other than as a segment divider. I am not sure if this is possible - could there be some sort of escaped slash within a segment? Anyway, if the received url strings are well-behaved, apart from the diacritics, this approach could be used as a hack for Sean's problem.

HTH

Peter Kenny

Note to Sven: The comment to String>>urlEncoded says: ' This is an encoding where characters that are illegal in a URL are escaped.' Slashes are escaped but are quite legal. Should the comment be changed, or the method?



-----Original Message-----
From: Pharo-users <[hidden email]> On Behalf Of Sven Van Caekenberghe
Sent: 23 March 2019 20:03
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics



> On 23 Mar 2019, at 20:53, Sean P. DeNigris <[hidden email]> wrote:
>
> Peter Kenny wrote
>> And when I inspect the result, it is the address of a non-existent
>> file in my image directory.
>
> Ah, no. I see the same result. By "worked" I meant that it created a
> URL that safari accepted, but I see now it's not the same as correctly
> parsing it.
>
>
> Peter Kenny wrote
>> Incidentally, I tried the other trick Sven cites in the same thread.
>> The same url as above can be written:
>> 'https://fr.wiktionary.org/wiki' asUrl / 'péripétie'.
>
> Yes, this works if you are assembling the URL, but several people
> presented the use case of processing URLs from elsewhere, leaving one
> in a chicken-and-egg situation where one can't parse due to the
> diacritics and can't escape the diacritics (i.e. without incorrectly
> escaping other things) without parsing :/

Yes, that is pretty close to a catch 22. Strictly speaking, such URLs are incorrect and can't be parsed.

I do understand that sometimes these URLs occur in the wild, but again, strictly speaking they are in error.

The fact that browser search boxes accept them is a service on top of the strict URL syntax, I am not 100% sure how they do it, but it probably involves a lot of heuristics and trial and error.

The parser of ZnUrl is just 3 to 4 methods. There is nothing preventing somebody from making a new ZnLoseUrlParser, but it won't be easy.

> -----
> Cheers,
> Sean
> --
> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
>





Reply | Threaded
Open this post in threaded view
|

Re: ZnURL and parsing URL with diacritics

Sven Van Caekenberghe-2
I would use a variant of your original transformation.

The issue (the error in the URL) is that all kinds of non-ASCII characters occur unencoded. We should/could assume that other special/reserved ASCII characters _are_ properly encoded (so we do not need to handle them).

So I would literally patch/fix the problem, like this:

| bogusUrl fixedUrl url |
bogusUrl := 'https://en.wikipedia.org/wiki/Česká republika'.
fixedUrl := String streamContents: [ :out |
        bogusUrl do: [ :each |
                (each codePoint < 127 and: [ each ~= $ ])
                        ifTrue: [ out nextPut: each ]
                        ifFalse: [ out nextPutAll: each asString urlEncoded ] ] ].
fixedUrl asUrl retrieveContents.

I made and extra case for the space character, it works either way in the example given, but a space cannot occur freely.

> On 26 Mar 2019, at 12:53, PBKResearch <[hidden email]> wrote:
>
> Sean
>
> I have realized that the method I proposed can be expressed entirely within the Zinc system, which may make it a bit neater and easier to follow. There probably is no completely general solution, but there is a completely general way of finding a solution for your problem domain.
>
> It is important to realize that String>>urlEncoded is defined as:
> ZnPercentEncoder new encode: self.
> ZnPercentEncoder does not attempt to parse the input string as a url. It scans the entire string, and percent encodes any character that is not in its safe set (see the comment to ZnPercentEncoder>>encode:). Sven has given as default a minimum safe set, which does not include slash, but there is a setter method to redefine the safe set.
>
> So the general way to find a solution for your domain is to collect a representative set of the url strings, apply String>>urlEncoded to each, and work out which characters have been percent encoded wrongly for your domain. For any url cases this is likely to include ':/?#', as well as '%' if it includes things already percent encoded, but there may be others specific to your domain. Now construct an instance of ZnPercentEncoder with the safe set extended to include these characters - note that the default safe set is given by the class method ZnPercentEncoder class>> rfc3986UnreservedCharacters. Apply this instance to encode all your test incoming url strings and verify that they work. Iterate, extending the safe set, until everything passes.
>
> If you want to keep the neatness being able to write something like 'incomingString urlEncoded asZnUrl', you can add a method to String; for the case of the common url characters mentioned above:
>
> String>> urlEncodedMyWay
>
> "As urlEncoded, but with the safe set extended to include characters commonly found in a url"
>
> ^ ZnPercentEncoder new safeSet: ':/?#%', (ZnPercentEncoder rfc3986UnreservedCharacters);
> encode: self
>
> This works in much the same way as the snippet I posted originally, because my code simply reproduces the essentials of ZnPercentEncoder>>encode:.
>
> I seem to be trying to monopolize this thread, so I shall shut up now.
>
> HTH
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users <[hidden email]> On Behalf Of PBKResearch
> Sent: 24 March 2019 15:36
> To: 'Any question about pharo is welcome' <[hidden email]>
> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
>
> Well it didn't take long to find a potential problem in what I wrote, at least as a general solution. If the input string contains something which has already been percent encoded, it will re-encode the percent signs. In this case, decoding will recover the once-encoded version, but we need to decode twice to recover the original text. Any web site receiving this version will almost certainly decode once only, and so will not see the right details.
>
> The solution is simple - just include the percent sign in the list of excluded characters in the third line, so it becomes:
> url asString do: [ :ch|(':/?%' includes: ch )
>
> -----Original Message-----
> From: Pharo-users <[hidden email]> On Behalf Of PBKResearch
> Sent: 24 March 2019 12:11
> To: 'Any question about pharo is welcome' <[hidden email]>
> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
>
> Sean, Sven
>
> Thinking about this, I have found a simple (maybe too simple) way round it. The obvious first approach is to apply 'urlEncoded' to the received url string, but this fails because it also encodes the slashes and other segment dividers. A simple-minded approach is to scan the received string, copy the slashes and other segment dividers unchanged and percent encode everything else. I cobbled together the following in a playground, but it could easily be turned into a method in String class.
>
> urlEncodedSegments := [ :url||outStream|
> outStream := String new writeStream.
> url asString do: [ :ch|(':/?' includes: ch )
> ifTrue: [ outStream nextPut: ch ]
> ifFalse:[outStream nextPutAll: ch asString urlEncoded ] ].
> outStream contents].
>
> urlEncodedSegments value: 'https://fr.wiktionary.org/wiki/péripétie'
> => https://fr.wiktionary.org/wiki/p%C3%A9rip%C3%A9tie
>
> This may fail if a slash can occur in a url other than as a segment divider. I am not sure if this is possible - could there be some sort of escaped slash within a segment? Anyway, if the received url strings are well-behaved, apart from the diacritics, this approach could be used as a hack for Sean's problem.
>
> HTH
>
> Peter Kenny
>
> Note to Sven: The comment to String>>urlEncoded says: ' This is an encoding where characters that are illegal in a URL are escaped.' Slashes are escaped but are quite legal. Should the comment be changed, or the method?
>
>
>
> -----Original Message-----
> From: Pharo-users <[hidden email]> On Behalf Of Sven Van Caekenberghe
> Sent: 23 March 2019 20:03
> To: Any question about pharo is welcome <[hidden email]>
> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
>
>
>
>> On 23 Mar 2019, at 20:53, Sean P. DeNigris <[hidden email]> wrote:
>>
>> Peter Kenny wrote
>>> And when I inspect the result, it is the address of a non-existent
>>> file in my image directory.
>>
>> Ah, no. I see the same result. By "worked" I meant that it created a
>> URL that safari accepted, but I see now it's not the same as correctly
>> parsing it.
>>
>>
>> Peter Kenny wrote
>>> Incidentally, I tried the other trick Sven cites in the same thread.
>>> The same url as above can be written:
>>> 'https://fr.wiktionary.org/wiki' asUrl / 'péripétie'.
>>
>> Yes, this works if you are assembling the URL, but several people
>> presented the use case of processing URLs from elsewhere, leaving one
>> in a chicken-and-egg situation where one can't parse due to the
>> diacritics and can't escape the diacritics (i.e. without incorrectly
>> escaping other things) without parsing :/
>
> Yes, that is pretty close to a catch 22. Strictly speaking, such URLs are incorrect and can't be parsed.
>
> I do understand that sometimes these URLs occur in the wild, but again, strictly speaking they are in error.
>
> The fact that browser search boxes accept them is a service on top of the strict URL syntax, I am not 100% sure how they do it, but it probably involves a lot of heuristics and trial and error.
>
> The parser of ZnUrl is just 3 to 4 methods. There is nothing preventing somebody from making a new ZnLoseUrlParser, but it won't be easy.
>
>> -----
>> Cheers,
>> Sean
>> --
>> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
>>
>
>
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: ZnURL and parsing URL with diacritics

Peter Kenny
Sven

That would certainly work, and represents the most liberal possible approach. An equivalent, keeping entirely within Zinc, would be to use a special-purpose instance of ZnPercentEncoder, in which the safe set is defined as all characters between code points 33 and 126 inclusive. (Starting at 33 fixes your space point.)

Using 'bogusUrl' as a variable name seems a bit pejorative. I am looking up French and German words in Wiktionary all the time, and I am building a Pharo app to do it for me. The version of the url with the accented characters will not work in Zinc until I have urlEncoded it, but it works perfectly well in a browser and is much easier to read.

Peter Kenny


-----Original Message-----
From: Pharo-users <[hidden email]> On Behalf Of Sven Van Caekenberghe
Sent: 26 March 2019 12:26
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics

I would use a variant of your original transformation.

The issue (the error in the URL) is that all kinds of non-ASCII characters occur unencoded. We should/could assume that other special/reserved ASCII characters _are_ properly encoded (so we do not need to handle them).

So I would literally patch/fix the problem, like this:

| bogusUrl fixedUrl url |
bogusUrl := 'https://en.wikipedia.org/wiki/Česká republika'.
fixedUrl := String streamContents: [ :out |
        bogusUrl do: [ :each |
                (each codePoint < 127 and: [ each ~= $ ])
                        ifTrue: [ out nextPut: each ]
                        ifFalse: [ out nextPutAll: each asString urlEncoded ] ] ].
fixedUrl asUrl retrieveContents.

I made and extra case for the space character, it works either way in the example given, but a space cannot occur freely.

> On 26 Mar 2019, at 12:53, PBKResearch <[hidden email]> wrote:
>
> Sean
>
> I have realized that the method I proposed can be expressed entirely within the Zinc system, which may make it a bit neater and easier to follow. There probably is no completely general solution, but there is a completely general way of finding a solution for your problem domain.
>
> It is important to realize that String>>urlEncoded is defined as:
> ZnPercentEncoder new encode: self.
> ZnPercentEncoder does not attempt to parse the input string as a url. It scans the entire string, and percent encodes any character that is not in its safe set (see the comment to ZnPercentEncoder>>encode:). Sven has given as default a minimum safe set, which does not include slash, but there is a setter method to redefine the safe set.
>
> So the general way to find a solution for your domain is to collect a representative set of the url strings, apply String>>urlEncoded to each, and work out which characters have been percent encoded wrongly for your domain. For any url cases this is likely to include ':/?#', as well as '%' if it includes things already percent encoded, but there may be others specific to your domain. Now construct an instance of ZnPercentEncoder with the safe set extended to include these characters - note that the default safe set is given by the class method ZnPercentEncoder class>> rfc3986UnreservedCharacters. Apply this instance to encode all your test incoming url strings and verify that they work. Iterate, extending the safe set, until everything passes.
>
> If you want to keep the neatness being able to write something like 'incomingString urlEncoded asZnUrl', you can add a method to String; for the case of the common url characters mentioned above:
>
> String>> urlEncodedMyWay
>
> "As urlEncoded, but with the safe set extended to include characters commonly found in a url"
>
> ^ ZnPercentEncoder new safeSet: ':/?#%', (ZnPercentEncoder rfc3986UnreservedCharacters);
> encode: self
>
> This works in much the same way as the snippet I posted originally, because my code simply reproduces the essentials of ZnPercentEncoder>>encode:.
>
> I seem to be trying to monopolize this thread, so I shall shut up now.
>
> HTH
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users <[hidden email]> On Behalf Of
> PBKResearch
> Sent: 24 March 2019 15:36
> To: 'Any question about pharo is welcome'
> <[hidden email]>
> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
>
> Well it didn't take long to find a potential problem in what I wrote, at least as a general solution. If the input string contains something which has already been percent encoded, it will re-encode the percent signs. In this case, decoding will recover the once-encoded version, but we need to decode twice to recover the original text. Any web site receiving this version will almost certainly decode once only, and so will not see the right details.
>
> The solution is simple - just include the percent sign in the list of excluded characters in the third line, so it becomes:
> url asString do: [ :ch|(':/?%' includes: ch )
>
> -----Original Message-----
> From: Pharo-users <[hidden email]> On Behalf Of
> PBKResearch
> Sent: 24 March 2019 12:11
> To: 'Any question about pharo is welcome'
> <[hidden email]>
> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
>
> Sean, Sven
>
> Thinking about this, I have found a simple (maybe too simple) way round it. The obvious first approach is to apply 'urlEncoded' to the received url string, but this fails because it also encodes the slashes and other segment dividers. A simple-minded approach is to scan the received string, copy the slashes and other segment dividers unchanged and percent encode everything else. I cobbled together the following in a playground, but it could easily be turned into a method in String class.
>
> urlEncodedSegments := [ :url||outStream|
> outStream := String new writeStream.
> url asString do: [ :ch|(':/?' includes: ch )
> ifTrue: [ outStream nextPut: ch ]
> ifFalse:[outStream nextPutAll: ch asString urlEncoded ] ].
> outStream contents].
>
> urlEncodedSegments value: 'https://fr.wiktionary.org/wiki/péripétie'
> => https://fr.wiktionary.org/wiki/p%C3%A9rip%C3%A9tie
>
> This may fail if a slash can occur in a url other than as a segment divider. I am not sure if this is possible - could there be some sort of escaped slash within a segment? Anyway, if the received url strings are well-behaved, apart from the diacritics, this approach could be used as a hack for Sean's problem.
>
> HTH
>
> Peter Kenny
>
> Note to Sven: The comment to String>>urlEncoded says: ' This is an encoding where characters that are illegal in a URL are escaped.' Slashes are escaped but are quite legal. Should the comment be changed, or the method?
>
>
>
> -----Original Message-----
> From: Pharo-users <[hidden email]> On Behalf Of
> Sven Van Caekenberghe
> Sent: 23 March 2019 20:03
> To: Any question about pharo is welcome <[hidden email]>
> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
>
>
>
>> On 23 Mar 2019, at 20:53, Sean P. DeNigris <[hidden email]> wrote:
>>
>> Peter Kenny wrote
>>> And when I inspect the result, it is the address of a non-existent
>>> file in my image directory.
>>
>> Ah, no. I see the same result. By "worked" I meant that it created a
>> URL that safari accepted, but I see now it's not the same as
>> correctly parsing it.
>>
>>
>> Peter Kenny wrote
>>> Incidentally, I tried the other trick Sven cites in the same thread.
>>> The same url as above can be written:
>>> 'https://fr.wiktionary.org/wiki' asUrl / 'péripétie'.
>>
>> Yes, this works if you are assembling the URL, but several people
>> presented the use case of processing URLs from elsewhere, leaving one
>> in a chicken-and-egg situation where one can't parse due to the
>> diacritics and can't escape the diacritics (i.e. without incorrectly
>> escaping other things) without parsing :/
>
> Yes, that is pretty close to a catch 22. Strictly speaking, such URLs are incorrect and can't be parsed.
>
> I do understand that sometimes these URLs occur in the wild, but again, strictly speaking they are in error.
>
> The fact that browser search boxes accept them is a service on top of the strict URL syntax, I am not 100% sure how they do it, but it probably involves a lot of heuristics and trial and error.
>
> The parser of ZnUrl is just 3 to 4 methods. There is nothing preventing somebody from making a new ZnLoseUrlParser, but it won't be easy.
>
>> -----
>> Cheers,
>> Sean
>> --
>> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
>>
>
>
>
>
>



Reply | Threaded
Open this post in threaded view
|

Re: ZnURL and parsing URL with diacritics

Sven Van Caekenberghe-2
Peter,

It *is* a bogus URL, please go and read some RFCs.

A browser's address/search box is an entirely different thing that adds convenience features, such as the issue we are discussing here.

Sven

> On 26 Mar 2019, at 16:02, PBKResearch <[hidden email]> wrote:
>
> Sven
>
> That would certainly work, and represents the most liberal possible approach. An equivalent, keeping entirely within Zinc, would be to use a special-purpose instance of ZnPercentEncoder, in which the safe set is defined as all characters between code points 33 and 126 inclusive. (Starting at 33 fixes your space point.)
>
> Using 'bogusUrl' as a variable name seems a bit pejorative. I am looking up French and German words in Wiktionary all the time, and I am building a Pharo app to do it for me. The version of the url with the accented characters will not work in Zinc until I have urlEncoded it, but it works perfectly well in a browser and is much easier to read.
>
> Peter Kenny
>
>
> -----Original Message-----
> From: Pharo-users <[hidden email]> On Behalf Of Sven Van Caekenberghe
> Sent: 26 March 2019 12:26
> To: Any question about pharo is welcome <[hidden email]>
> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
>
> I would use a variant of your original transformation.
>
> The issue (the error in the URL) is that all kinds of non-ASCII characters occur unencoded. We should/could assume that other special/reserved ASCII characters _are_ properly encoded (so we do not need to handle them).
>
> So I would literally patch/fix the problem, like this:
>
> | bogusUrl fixedUrl url |
> bogusUrl := 'https://en.wikipedia.org/wiki/Česká republika'.
> fixedUrl := String streamContents: [ :out |
> bogusUrl do: [ :each |
> (each codePoint < 127 and: [ each ~= $ ])
> ifTrue: [ out nextPut: each ]
> ifFalse: [ out nextPutAll: each asString urlEncoded ] ] ].
> fixedUrl asUrl retrieveContents.
>
> I made and extra case for the space character, it works either way in the example given, but a space cannot occur freely.
>
>> On 26 Mar 2019, at 12:53, PBKResearch <[hidden email]> wrote:
>>
>> Sean
>>
>> I have realized that the method I proposed can be expressed entirely within the Zinc system, which may make it a bit neater and easier to follow. There probably is no completely general solution, but there is a completely general way of finding a solution for your problem domain.
>>
>> It is important to realize that String>>urlEncoded is defined as:
>> ZnPercentEncoder new encode: self.
>> ZnPercentEncoder does not attempt to parse the input string as a url. It scans the entire string, and percent encodes any character that is not in its safe set (see the comment to ZnPercentEncoder>>encode:). Sven has given as default a minimum safe set, which does not include slash, but there is a setter method to redefine the safe set.
>>
>> So the general way to find a solution for your domain is to collect a representative set of the url strings, apply String>>urlEncoded to each, and work out which characters have been percent encoded wrongly for your domain. For any url cases this is likely to include ':/?#', as well as '%' if it includes things already percent encoded, but there may be others specific to your domain. Now construct an instance of ZnPercentEncoder with the safe set extended to include these characters - note that the default safe set is given by the class method ZnPercentEncoder class>> rfc3986UnreservedCharacters. Apply this instance to encode all your test incoming url strings and verify that they work. Iterate, extending the safe set, until everything passes.
>>
>> If you want to keep the neatness being able to write something like 'incomingString urlEncoded asZnUrl', you can add a method to String; for the case of the common url characters mentioned above:
>>
>> String>> urlEncodedMyWay
>>
>> "As urlEncoded, but with the safe set extended to include characters commonly found in a url"
>>
>> ^ ZnPercentEncoder new safeSet: ':/?#%', (ZnPercentEncoder rfc3986UnreservedCharacters);
>> encode: self
>>
>> This works in much the same way as the snippet I posted originally, because my code simply reproduces the essentials of ZnPercentEncoder>>encode:.
>>
>> I seem to be trying to monopolize this thread, so I shall shut up now.
>>
>> HTH
>>
>> Peter Kenny
>>
>> -----Original Message-----
>> From: Pharo-users <[hidden email]> On Behalf Of
>> PBKResearch
>> Sent: 24 March 2019 15:36
>> To: 'Any question about pharo is welcome'
>> <[hidden email]>
>> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
>>
>> Well it didn't take long to find a potential problem in what I wrote, at least as a general solution. If the input string contains something which has already been percent encoded, it will re-encode the percent signs. In this case, decoding will recover the once-encoded version, but we need to decode twice to recover the original text. Any web site receiving this version will almost certainly decode once only, and so will not see the right details.
>>
>> The solution is simple - just include the percent sign in the list of excluded characters in the third line, so it becomes:
>> url asString do: [ :ch|(':/?%' includes: ch )
>>
>> -----Original Message-----
>> From: Pharo-users <[hidden email]> On Behalf Of
>> PBKResearch
>> Sent: 24 March 2019 12:11
>> To: 'Any question about pharo is welcome'
>> <[hidden email]>
>> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
>>
>> Sean, Sven
>>
>> Thinking about this, I have found a simple (maybe too simple) way round it. The obvious first approach is to apply 'urlEncoded' to the received url string, but this fails because it also encodes the slashes and other segment dividers. A simple-minded approach is to scan the received string, copy the slashes and other segment dividers unchanged and percent encode everything else. I cobbled together the following in a playground, but it could easily be turned into a method in String class.
>>
>> urlEncodedSegments := [ :url||outStream|
>> outStream := String new writeStream.
>> url asString do: [ :ch|(':/?' includes: ch )
>> ifTrue: [ outStream nextPut: ch ]
>> ifFalse:[outStream nextPutAll: ch asString urlEncoded ] ].
>> outStream contents].
>>
>> urlEncodedSegments value: 'https://fr.wiktionary.org/wiki/péripétie'
>> => https://fr.wiktionary.org/wiki/p%C3%A9rip%C3%A9tie
>>
>> This may fail if a slash can occur in a url other than as a segment divider. I am not sure if this is possible - could there be some sort of escaped slash within a segment? Anyway, if the received url strings are well-behaved, apart from the diacritics, this approach could be used as a hack for Sean's problem.
>>
>> HTH
>>
>> Peter Kenny
>>
>> Note to Sven: The comment to String>>urlEncoded says: ' This is an encoding where characters that are illegal in a URL are escaped.' Slashes are escaped but are quite legal. Should the comment be changed, or the method?
>>
>>
>>
>> -----Original Message-----
>> From: Pharo-users <[hidden email]> On Behalf Of
>> Sven Van Caekenberghe
>> Sent: 23 March 2019 20:03
>> To: Any question about pharo is welcome <[hidden email]>
>> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
>>
>>
>>
>>> On 23 Mar 2019, at 20:53, Sean P. DeNigris <[hidden email]> wrote:
>>>
>>> Peter Kenny wrote
>>>> And when I inspect the result, it is the address of a non-existent
>>>> file in my image directory.
>>>
>>> Ah, no. I see the same result. By "worked" I meant that it created a
>>> URL that safari accepted, but I see now it's not the same as
>>> correctly parsing it.
>>>
>>>
>>> Peter Kenny wrote
>>>> Incidentally, I tried the other trick Sven cites in the same thread.
>>>> The same url as above can be written:
>>>> 'https://fr.wiktionary.org/wiki' asUrl / 'péripétie'.
>>>
>>> Yes, this works if you are assembling the URL, but several people
>>> presented the use case of processing URLs from elsewhere, leaving one
>>> in a chicken-and-egg situation where one can't parse due to the
>>> diacritics and can't escape the diacritics (i.e. without incorrectly
>>> escaping other things) without parsing :/
>>
>> Yes, that is pretty close to a catch 22. Strictly speaking, such URLs are incorrect and can't be parsed.
>>
>> I do understand that sometimes these URLs occur in the wild, but again, strictly speaking they are in error.
>>
>> The fact that browser search boxes accept them is a service on top of the strict URL syntax, I am not 100% sure how they do it, but it probably involves a lot of heuristics and trial and error.
>>
>> The parser of ZnUrl is just 3 to 4 methods. There is nothing preventing somebody from making a new ZnLoseUrlParser, but it won't be easy.
>>
>>> -----
>>> Cheers,
>>> Sean
>>> --
>>> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
>>>
>>
>>
>>
>>
>>
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: ZnURL and parsing URL with diacritics

Peter Kenny
Sven

Well RFCs are unreadable - I know, because I looked at 3986 while looking at this question - but OK, I get your point. I suppose I should be looking for something that makes it easier to provide similar convenience features in Pharo. As you say, if this issue is cracked, that is a step on the way.

Peter

-----Original Message-----
From: Pharo-users <[hidden email]> On Behalf Of Sven Van Caekenberghe
Sent: 26 March 2019 15:08
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics

Peter,

It *is* a bogus URL, please go and read some RFCs.

A browser's address/search box is an entirely different thing that adds convenience features, such as the issue we are discussing here.

Sven

> On 26 Mar 2019, at 16:02, PBKResearch <[hidden email]> wrote:
>
> Sven
>
> That would certainly work, and represents the most liberal possible
> approach. An equivalent, keeping entirely within Zinc, would be to use
> a special-purpose instance of ZnPercentEncoder, in which the safe set
> is defined as all characters between code points 33 and 126 inclusive.
> (Starting at 33 fixes your space point.)
>
> Using 'bogusUrl' as a variable name seems a bit pejorative. I am looking up French and German words in Wiktionary all the time, and I am building a Pharo app to do it for me. The version of the url with the accented characters will not work in Zinc until I have urlEncoded it, but it works perfectly well in a browser and is much easier to read.
>
> Peter Kenny
>
>
> -----Original Message-----
> From: Pharo-users <[hidden email]> On Behalf Of
> Sven Van Caekenberghe
> Sent: 26 March 2019 12:26
> To: Any question about pharo is welcome <[hidden email]>
> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
>
> I would use a variant of your original transformation.
>
> The issue (the error in the URL) is that all kinds of non-ASCII characters occur unencoded. We should/could assume that other special/reserved ASCII characters _are_ properly encoded (so we do not need to handle them).
>
> So I would literally patch/fix the problem, like this:
>
> | bogusUrl fixedUrl url |
> bogusUrl := 'https://en.wikipedia.org/wiki/Česká republika'.
> fixedUrl := String streamContents: [ :out |
> bogusUrl do: [ :each |
> (each codePoint < 127 and: [ each ~= $ ])
> ifTrue: [ out nextPut: each ]
> ifFalse: [ out nextPutAll: each asString urlEncoded ] ] ].
> fixedUrl asUrl retrieveContents.
>
> I made and extra case for the space character, it works either way in the example given, but a space cannot occur freely.
>
>> On 26 Mar 2019, at 12:53, PBKResearch <[hidden email]> wrote:
>>
>> Sean
>>
>> I have realized that the method I proposed can be expressed entirely within the Zinc system, which may make it a bit neater and easier to follow. There probably is no completely general solution, but there is a completely general way of finding a solution for your problem domain.
>>
>> It is important to realize that String>>urlEncoded is defined as:
>> ZnPercentEncoder new encode: self.
>> ZnPercentEncoder does not attempt to parse the input string as a url. It scans the entire string, and percent encodes any character that is not in its safe set (see the comment to ZnPercentEncoder>>encode:). Sven has given as default a minimum safe set, which does not include slash, but there is a setter method to redefine the safe set.
>>
>> So the general way to find a solution for your domain is to collect a representative set of the url strings, apply String>>urlEncoded to each, and work out which characters have been percent encoded wrongly for your domain. For any url cases this is likely to include ':/?#', as well as '%' if it includes things already percent encoded, but there may be others specific to your domain. Now construct an instance of ZnPercentEncoder with the safe set extended to include these characters - note that the default safe set is given by the class method ZnPercentEncoder class>> rfc3986UnreservedCharacters. Apply this instance to encode all your test incoming url strings and verify that they work. Iterate, extending the safe set, until everything passes.
>>
>> If you want to keep the neatness being able to write something like 'incomingString urlEncoded asZnUrl', you can add a method to String; for the case of the common url characters mentioned above:
>>
>> String>> urlEncodedMyWay
>>
>> "As urlEncoded, but with the safe set extended to include characters commonly found in a url"
>>
>> ^ ZnPercentEncoder new safeSet: ':/?#%', (ZnPercentEncoder rfc3986UnreservedCharacters);
>> encode: self
>>
>> This works in much the same way as the snippet I posted originally, because my code simply reproduces the essentials of ZnPercentEncoder>>encode:.
>>
>> I seem to be trying to monopolize this thread, so I shall shut up now.
>>
>> HTH
>>
>> Peter Kenny
>>
>> -----Original Message-----
>> From: Pharo-users <[hidden email]> On Behalf Of
>> PBKResearch
>> Sent: 24 March 2019 15:36
>> To: 'Any question about pharo is welcome'
>> <[hidden email]>
>> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
>>
>> Well it didn't take long to find a potential problem in what I wrote, at least as a general solution. If the input string contains something which has already been percent encoded, it will re-encode the percent signs. In this case, decoding will recover the once-encoded version, but we need to decode twice to recover the original text. Any web site receiving this version will almost certainly decode once only, and so will not see the right details.
>>
>> The solution is simple - just include the percent sign in the list of excluded characters in the third line, so it becomes:
>> url asString do: [ :ch|(':/?%' includes: ch )
>>
>> -----Original Message-----
>> From: Pharo-users <[hidden email]> On Behalf Of
>> PBKResearch
>> Sent: 24 March 2019 12:11
>> To: 'Any question about pharo is welcome'
>> <[hidden email]>
>> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
>>
>> Sean, Sven
>>
>> Thinking about this, I have found a simple (maybe too simple) way round it. The obvious first approach is to apply 'urlEncoded' to the received url string, but this fails because it also encodes the slashes and other segment dividers. A simple-minded approach is to scan the received string, copy the slashes and other segment dividers unchanged and percent encode everything else. I cobbled together the following in a playground, but it could easily be turned into a method in String class.
>>
>> urlEncodedSegments := [ :url||outStream|
>> outStream := String new writeStream.
>> url asString do: [ :ch|(':/?' includes: ch )
>> ifTrue: [ outStream nextPut: ch ]
>> ifFalse:[outStream nextPutAll: ch asString urlEncoded ] ].
>> outStream contents].
>>
>> urlEncodedSegments value: 'https://fr.wiktionary.org/wiki/péripétie'
>> => https://fr.wiktionary.org/wiki/p%C3%A9rip%C3%A9tie
>>
>> This may fail if a slash can occur in a url other than as a segment divider. I am not sure if this is possible - could there be some sort of escaped slash within a segment? Anyway, if the received url strings are well-behaved, apart from the diacritics, this approach could be used as a hack for Sean's problem.
>>
>> HTH
>>
>> Peter Kenny
>>
>> Note to Sven: The comment to String>>urlEncoded says: ' This is an encoding where characters that are illegal in a URL are escaped.' Slashes are escaped but are quite legal. Should the comment be changed, or the method?
>>
>>
>>
>> -----Original Message-----
>> From: Pharo-users <[hidden email]> On Behalf Of
>> Sven Van Caekenberghe
>> Sent: 23 March 2019 20:03
>> To: Any question about pharo is welcome <[hidden email]>
>> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
>>
>>
>>
>>> On 23 Mar 2019, at 20:53, Sean P. DeNigris <[hidden email]> wrote:
>>>
>>> Peter Kenny wrote
>>>> And when I inspect the result, it is the address of a non-existent
>>>> file in my image directory.
>>>
>>> Ah, no. I see the same result. By "worked" I meant that it created a
>>> URL that safari accepted, but I see now it's not the same as
>>> correctly parsing it.
>>>
>>>
>>> Peter Kenny wrote
>>>> Incidentally, I tried the other trick Sven cites in the same thread.
>>>> The same url as above can be written:
>>>> 'https://fr.wiktionary.org/wiki' asUrl / 'péripétie'.
>>>
>>> Yes, this works if you are assembling the URL, but several people
>>> presented the use case of processing URLs from elsewhere, leaving
>>> one in a chicken-and-egg situation where one can't parse due to the
>>> diacritics and can't escape the diacritics (i.e. without incorrectly
>>> escaping other things) without parsing :/
>>
>> Yes, that is pretty close to a catch 22. Strictly speaking, such URLs are incorrect and can't be parsed.
>>
>> I do understand that sometimes these URLs occur in the wild, but again, strictly speaking they are in error.
>>
>> The fact that browser search boxes accept them is a service on top of the strict URL syntax, I am not 100% sure how they do it, but it probably involves a lot of heuristics and trial and error.
>>
>> The parser of ZnUrl is just 3 to 4 methods. There is nothing preventing somebody from making a new ZnLoseUrlParser, but it won't be easy.
>>
>>> -----
>>> Cheers,
>>> Sean
>>> --
>>> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
>>>
>>
>>
>>
>>
>>
>
>
>