utf-8 encoded URL path info?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

utf-8 encoded URL path info?

NorbertHartl
I just recently discovered that actual browsers are displaying utf-8 safe url encodet URL paths as the right characters. Meaning

http://en.wikipedia.org/wiki/Gew%C3%BCrztraminer

actually displays

http://en.wikipedia.org/wiki/Gewürztraminer

As far as I can see in pier the URL path is tightly coupled to the name of a structure. And structure names are restricted to only a few characters. The comments in the code reason this for safe usage in some object protocols. What would be the way to go if I want to enable those types of URLs? What are the problematic cases if a structure name could consist of non-7bit characters?

thanks,

Norbert
 
_______________________________________________
Magritte, Pier and Related Tools ...
https://www.iam.unibe.ch/mailman/listinfo/smallwiki
Reply | Threaded
Open this post in threaded view
|

Re: utf-8 encoded URL path info?

Philippe Marschall
2010/4/19 Norbert Hartl <[hidden email]>:
> I just recently discovered that actual browsers are displaying utf-8 safe url encodet URL paths as the right characters. Meaning
>
> http://en.wikipedia.org/wiki/Gew%C3%BCrztraminer
>
> actually displays
>
> http://en.wikipedia.org/wiki/Gewürztraminer

Nice, isn't it?

> As far as I can see in pier the URL path is tightly coupled to the name of a structure. And structure names are restricted to only a few characters. The comments in the code reason this for safe usage in some object protocols. What would be the way to go if I want to enable those types of URLs? What are the problematic cases if a structure name could consist of non-7bit characters?

The following comes to my mind:
- If you post the second link with umlauts, if might cause problems if
the server is not in utf-8, ie5 or some non-modern browser user agent.
- If you use WAKomEncoded(39) (which you should) and go beyond Latin-1
(e.g. €) you enter the wonderful world of WideStrings.
- There might round trip problems with external systems (files, databases, ...)

I summary you might run into bugs, but then again someone has to in
order to get them fixed.

Cheers
Philippe

_______________________________________________
Magritte, Pier and Related Tools ...
https://www.iam.unibe.ch/mailman/listinfo/smallwiki
Reply | Threaded
Open this post in threaded view
|

Re: utf-8 encoded URL path info?

johnmci
In Sophie the URI subsystem was *fixed* to ensure the path data was UTF-8 and HTTP correct.
However what we found (a couple of years back) was that different browsers had different opinions about
what to do for every UTF-8 character. At the time we decided the differences were bugs, hopefully they have
been fixed now since I recall you can? now have domains with UTF-8 characters.

On 2010-04-19, at 10:49 AM, Philippe Marschall wrote:

> 2010/4/19 Norbert Hartl <[hidden email]>:
>> I just recently discovered that actual browsers are displaying utf-8 safe url encodet URL paths as the right characters. Meaning
>>
>> http://en.wikipedia.org/wiki/Gew%C3%BCrztraminer
>>
>> actually displays
>>
>> http://en.wikipedia.org/wiki/Gewürztraminer
>
> Nice, isn't it?
--
===========================================================================
John M. McIntosh <[hidden email]>   Twitter:  squeaker68882
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
===========================================================================





_______________________________________________
Magritte, Pier and Related Tools ...
https://www.iam.unibe.ch/mailman/listinfo/smallwiki

smime.p7s (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: utf-8 encoded URL path info?

Philippe Marschall
2010/4/19 John M McIntosh <[hidden email]>:
> In Sophie the URI subsystem was *fixed* to ensure the path data was UTF-8 and HTTP correct.
> However what we found (a couple of years back) was that different browsers had different opinions about
> what to do for every UTF-8 character. At the time we decided the differences were bugs, hopefully they have
> been fixed now since I recall you can? now have domains with UTF-8 characters.

For domains that's easy because there's only one standard. It has
nothing to do with UTF-8 but uses punycode. For the path and the query
there are 3 standards one says ASCII, one says Latin-1 and one says
UTF-8. There is no way of knowing before making a request whether the
server accepts Latin-1 or UTF-8.

Cheers
Philippe
_______________________________________________
Magritte, Pier and Related Tools ...
https://www.iam.unibe.ch/mailman/listinfo/smallwiki