Enabling non 7bit characters in URL path info

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Enabling non 7bit characters in URL path info

NorbertHartl
Hi,

I just discovered the possibility of displaying unicode characters in the browsers URL display. Or better to use URLs that contain umlauts and such. My current project would benefit greatly from it so want it :)

The mechanics are easy. Everything in an url path is encoded first into utf-8 bytes. Then the individual bytes are encoded in url safe format.

So a german ü will be encoded to %C3%BC. The URL would be really ugly but actual browsers are displaying the according character which is very nice. Just as a trial I hacked gemstone to deal with it but I need some assistance to figure out the proper of doing it.

1. I disabled the check in PRPath class>>isValidName: for the valid characters. With support for unicode the rule be reversed I guess. From then on there are only a couple of characters you want to exclude.

Now you can add pages with names that contain non 7bit characters

2. I changed WAUrlEncoder>>encode:on: to
        aStream nextPutAll: aCharacter asString encodeAsUTF8 encodeForHTTP
3. The above change is inactive because the characters in the BMP are cached. Therefor I did a
        WAUrlEncoder initializeBMP

Now when you open the page containing non-7bit characters the URL is in the new format and you will see that the browser is displaying the right character. But pier complains it can't find the page. The url decoder is not fixed yet

4. in FSSeasideHandler>>decodeString: I changed
        ^string
    to
        ^string decodeFromUTF8

With this changes you can use a lot more characters in the URL path. The change in WAUrlEncoder is ugly and heavy but will not suffer performance for charaters within the 16bit boundary. The decodeString: change I don't know if that works if there is a single encoded non-7bit byte in the string. With some intelligent change in PRPath this could be it. But I guess there are much better ways to achieve it. So how? :)

thanks,

Norbert

Reply | Threaded
Open this post in threaded view
|

Re: Enabling non 7bit characters in URL path info

Dale
Norbert,

I'm not sure I understand the problem, could you give me a specific example of the problem with FSSeasideHandler>>decodeString:?

Dale
----- "Norbert Hartl" <[hidden email]> wrote:

| Hi,
|
| I just discovered the possibility of displaying unicode characters in
| the browsers URL display. Or better to use URLs that contain umlauts
| and such. My current project would benefit greatly from it so want it
| :)
|
| The mechanics are easy. Everything in an url path is encoded first
| into utf-8 bytes. Then the individual bytes are encoded in url safe
| format.
|
| So a german ü will be encoded to %C3%BC. The URL would be really ugly
| but actual browsers are displaying the according character which is
| very nice. Just as a trial I hacked gemstone to deal with it but I
| need some assistance to figure out the proper of doing it.
|
| 1. I disabled the check in PRPath class>>isValidName: for the valid
| characters. With support for unicode the rule be reversed I guess.
| From then on there are only a couple of characters you want to
| exclude.
|
| Now you can add pages with names that contain non 7bit characters
|
| 2. I changed WAUrlEncoder>>encode:on: to
| aStream nextPutAll: aCharacter asString encodeAsUTF8 encodeForHTTP
| 3. The above change is inactive because the characters in the BMP are
| cached. Therefor I did a
| WAUrlEncoder initializeBMP
|
| Now when you open the page containing non-7bit characters the URL is
| in the new format and you will see that the browser is displaying the
| right character. But pier complains it can't find the page. The url
| decoder is not fixed yet
|
| 4. in FSSeasideHandler>>decodeString: I changed
| ^string
|     to
| ^string decodeFromUTF8
|
| With this changes you can use a lot more characters in the URL path.
| The change in WAUrlEncoder is ugly and heavy but will not suffer
| performance for charaters within the 16bit boundary. The decodeString:
| change I don't know if that works if there is a single encoded
| non-7bit byte in the string. With some intelligent change in PRPath
| this could be it. But I guess there are much better ways to achieve
| it. So how? :)
|
| thanks,
|
| Norbert
Reply | Threaded
Open this post in threaded view
|

Re: Enabling non 7bit characters in URL path info

NorbertHartl

On 21.04.2010, at 18:55, Dale Henrichs wrote:

> Norbert,
>
> I'm not sure I understand the problem, could you give me a specific example of the problem with FSSeasideHandler>>decodeString:?
>
try

((Character codePoint: ('16r' , ('%FC' copyFrom: 2 to: 3)) asNumber) asString) decodeFromUTF8

If I do that in a workspace the images hangs.

Norbert

> Dale
> ----- "Norbert Hartl" <[hidden email]> wrote:
>
> | Hi,
> |
> | I just discovered the possibility of displaying unicode characters in
> | the browsers URL display. Or better to use URLs that contain umlauts
> | and such. My current project would benefit greatly from it so want it
> | :)
> |
> | The mechanics are easy. Everything in an url path is encoded first
> | into utf-8 bytes. Then the individual bytes are encoded in url safe
> | format.
> |
> | So a german ü will be encoded to %C3%BC. The URL would be really ugly
> | but actual browsers are displaying the according character which is
> | very nice. Just as a trial I hacked gemstone to deal with it but I
> | need some assistance to figure out the proper of doing it.
> |
> | 1. I disabled the check in PRPath class>>isValidName: for the valid
> | characters. With support for unicode the rule be reversed I guess.
> | From then on there are only a couple of characters you want to
> | exclude.
> |
> | Now you can add pages with names that contain non 7bit characters
> |
> | 2. I changed WAUrlEncoder>>encode:on: to
> | aStream nextPutAll: aCharacter asString encodeAsUTF8 encodeForHTTP
> | 3. The above change is inactive because the characters in the BMP are
> | cached. Therefor I did a
> | WAUrlEncoder initializeBMP
> |
> | Now when you open the page containing non-7bit characters the URL is
> | in the new format and you will see that the browser is displaying the
> | right character. But pier complains it can't find the page. The url
> | decoder is not fixed yet
> |
> | 4. in FSSeasideHandler>>decodeString: I changed
> | ^string
> |     to
> | ^string decodeFromUTF8
> |
> | With this changes you can use a lot more characters in the URL path.
> | The change in WAUrlEncoder is ugly and heavy but will not suffer
> | performance for charaters within the 16bit boundary. The decodeString:
> | change I don't know if that works if there is a single encoded
> | non-7bit byte in the string. With some intelligent change in PRPath
> | this could be it. But I guess there are much better ways to achieve
> | it. So how? :)
> |
> | thanks,
> |
> | Norbert

Reply | Threaded
Open this post in threaded view
|

Re: Enabling non 7bit characters in URL path info

Dale
Very nice:)

I'll check into this...

Dale
----- "Norbert Hartl" <[hidden email]> wrote:

| On 21.04.2010, at 18:55, Dale Henrichs wrote:
|
| > Norbert,
| >
| > I'm not sure I understand the problem, could you give me a specific
| example of the problem with FSSeasideHandler>>decodeString:?
| >
| try
|
| ((Character codePoint: ('16r' , ('%FC' copyFrom: 2 to: 3)) asNumber)
| asString) decodeFromUTF8
|
| If I do that in a workspace the images hangs.
|
| Norbert
|
| > Dale
| > ----- "Norbert Hartl" <[hidden email]> wrote:
| >
| > | Hi,
| > |
| > | I just discovered the possibility of displaying unicode characters
| in
| > | the browsers URL display. Or better to use URLs that contain
| umlauts
| > | and such. My current project would benefit greatly from it so want
| it
| > | :)
| > |
| > | The mechanics are easy. Everything in an url path is encoded
| first
| > | into utf-8 bytes. Then the individual bytes are encoded in url
| safe
| > | format.
| > |
| > | So a german ü will be encoded to %C3%BC. The URL would be really
| ugly
| > | but actual browsers are displaying the according character which
| is
| > | very nice. Just as a trial I hacked gemstone to deal with it but
| I
| > | need some assistance to figure out the proper of doing it.
| > |
| > | 1. I disabled the check in PRPath class>>isValidName: for the
| valid
| > | characters. With support for unicode the rule be reversed I
| guess.
| > | From then on there are only a couple of characters you want to
| > | exclude.
| > |
| > | Now you can add pages with names that contain non 7bit characters
| > |
| > | 2. I changed WAUrlEncoder>>encode:on: to
| > | aStream nextPutAll: aCharacter asString encodeAsUTF8
| encodeForHTTP
| > | 3. The above change is inactive because the characters in the BMP
| are
| > | cached. Therefor I did a
| > | WAUrlEncoder initializeBMP
| > |
| > | Now when you open the page containing non-7bit characters the URL
| is
| > | in the new format and you will see that the browser is displaying
| the
| > | right character. But pier complains it can't find the page. The
| url
| > | decoder is not fixed yet
| > |
| > | 4. in FSSeasideHandler>>decodeString: I changed
| > | ^string
| > |     to
| > | ^string decodeFromUTF8
| > |
| > | With this changes you can use a lot more characters in the URL
| path.
| > | The change in WAUrlEncoder is ugly and heavy but will not suffer
| > | performance for charaters within the 16bit boundary. The
| decodeString:
| > | change I don't know if that works if there is a single encoded
| > | non-7bit byte in the string. With some intelligent change in
| PRPath
| > | this could be it. But I guess there are much better ways to
| achieve
| > | it. So how? :)
| > |
| > | thanks,
| > |
| > | Norbert