Smalltalk › Pharo › Pharo Smalltalk Developers

ZnUnicodeComposingReadStream?

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

7 messages Options

alistairgrant

ZnUnicodeComposingReadStream?

Hi Sven & Everyone,

I need to convert an UTF8 encoded decomposed stream (Mac OS file
names) in to a composed string, e.g.:

string: 'test-äöü-äöü'
code points: #(116 101 115 116 45 228 246 252 45 97 776 111 776 117 776)
utf8 encoding: #[116 101 115 116 45 195 164 195 182 195 188 45 97 204
136 111 204 136 117 204 136]

In the above string, the first group of 3 accented characters are the
same as the second group of 3, but are encoded differently - code
points (228 246 252) vs (97 776 111 776 117 776).

Reading the utf8 encoded stream should result in:

string: 'test-äöü-äöü'
code points: #(116 101 115 116 45 228 246 252 45 228 246 252)
utf8 encoding: #[116 101 115 116 45 195 164 195 182 195 188 45 195 164
195 182 195 188]

My current thought is to write a ZnUnicodeComposingReadStream which
would wrap a ZnCharacterReadStream and return the composed characters.

What do you think?

Thanks!
Alistair

Sven Van Caekenberghe-2

Re: ZnUnicodeComposingReadStream?

Alistair, are you aware of the following (article/codebase) ?

https://medium.com/concerning-pharo/an-implementation-of-unicode-normalization-7c6719068f43

Due to the size of the full DB it is doubtful it would become a standard part of Pharo though.

Sven

> On 13 Jul 2018, at 19:46, Alistair Grant <[hidden email]> wrote:
>
> Hi Sven & Everyone,
>
> I need to convert an UTF8 encoded decomposed stream (Mac OS file
> names) in to a composed string, e.g.:
>
> string: 'test-äöü-äöü'
> code points: #(116 101 115 116 45 228 246 252 45 97 776 111 776 117 776)
> utf8 encoding: #[116 101 115 116 45 195 164 195 182 195 188 45 97 204
> 136 111 204 136 117 204 136]
>
> In the above string, the first group of 3 accented characters are the
> same as the second group of 3, but are encoded differently - code
> points (228 246 252) vs (97 776 111 776 117 776).
>
> Reading the utf8 encoded stream should result in:
>
> string: 'test-äöü-äöü'
> code points: #(116 101 115 116 45 228 246 252 45 228 246 252)
> utf8 encoding: #[116 101 115 116 45 195 164 195 182 195 188 45 195 164
> 195 182 195 188]
>
> My current thought is to write a ZnUnicodeComposingReadStream which
> would wrap a ZnCharacterReadStream and return the composed characters.
>
> What do you think?
>
> Thanks!
> Alistair
>

alistairgrant

Re: ZnUnicodeComposingReadStream?

Hi Sven,

Thanks very much for your quick reply...

On Fri, 13 Jul 2018 at 19:59, Sven Van Caekenberghe <[hidden email]> wrote:
>
> Alistair, are you aware of the following (article/codebase) ?
>
> https://medium.com/concerning-pharo/an-implementation-of-unicode-normalization-7c6719068f43
>
> Due to the size of the full DB it is doubtful it would become a standard part of Pharo though.
>
> Sven

I hadn't seen this. I'll read it next (although I think it will take
me longer than 17 minutes :-)).

But a quick, partial answer is that I was planning on only supporting
the composition and decomposition tables that are already included in
the main image as part of CombinedChar (see the Composition and
Decomposition class variables).

Thanks again,
Alistair

> > On 13 Jul 2018, at 19:46, Alistair Grant <[hidden email]> wrote:
> >
> > Hi Sven & Everyone,
> >
> > I need to convert an UTF8 encoded decomposed stream (Mac OS file
> > names) in to a composed string, e.g.:
> >
> > string: 'test-äöü-äöü'
> > code points: #(116 101 115 116 45 228 246 252 45 97 776 111 776 117 776)
> > utf8 encoding: #[116 101 115 116 45 195 164 195 182 195 188 45 97 204
> > 136 111 204 136 117 204 136]
> >
> > In the above string, the first group of 3 accented characters are the
> > same as the second group of 3, but are encoded differently - code
> > points (228 246 252) vs (97 776 111 776 117 776).
> >
> > Reading the utf8 encoded stream should result in:
> >
> > string: 'test-äöü-äöü'
> > code points: #(116 101 115 116 45 228 246 252 45 228 246 252)
> > utf8 encoding: #[116 101 115 116 45 195 164 195 182 195 188 45 195 164
> > 195 182 195 188]
> >
> > My current thought is to write a ZnUnicodeComposingReadStream which
> > would wrap a ZnCharacterReadStream and return the composed characters.
> >
> > What do you think?
> >
> > Thanks!
> > Alistair
> >
>
>

Max Leske

Re: ZnUnicodeComposingReadStream?

Hi Alistair,

*nix systems usually come with the iconv[1] command line program that implements the normalization and denormalization algorithms, or Uconv 2, a library that does the same thing. These algorithms include a lot of black magic and I recommend to not make your hands dirty with them. With the FFI interface Pharo has today it shouldn't be too hard to call out to Uconv (although I'm not saying it's trivial; I've written a VM plugin that we use a work to interface with Uconv and you do have to know how encodings and iconv work) or execute iconv directly.

I can send you a copy of the plugin code if you want, actually, I may put it on github if there's interest.

Cheers,
Max

[1] https://linux.die.net/man/1/iconv
[2] https://en.wikipedia.org/wiki/Uconv
[3] http://site.icu-project.org/

On 13 Jul 2018, at 20:22, Alistair Grant wrote:

Hi Sven,

Thanks very much for your quick reply...

On Fri, 13 Jul 2018 at 19:59, Sven Van Caekenberghe [hidden email] wrote:

Alistair, are you aware of the following (article/codebase) ?

https://medium.com/concerning-pharo/an-implementation-of-unicode-normalization-7c6719068f43

Due to the size of the full DB it is doubtful it would become a standard part of Pharo though.

Sven

I hadn't seen this. I'll read it next (although I think it will take
me longer than 17 minutes :-)).

But a quick, partial answer is that I was planning on only supporting
the composition and decomposition tables that are already included in
the main image as part of CombinedChar (see the Composition and
Decomposition class variables).

Thanks again,
Alistair

On 13 Jul 2018, at 19:46, Alistair Grant [hidden email] wrote:

Hi Sven & Everyone,

I need to convert an UTF8 encoded decomposed stream (Mac OS file
names) in to a composed string, e.g.:

string: 'test-äöü-äöü'
code points: #(116 101 115 116 45 228 246 252 45 97 776 111 776 117 776)
utf8 encoding: #[116 101 115 116 45 195 164 195 182 195 188 45 97 204
136 111 204 136 117 204 136]

In the above string, the first group of 3 accented characters are the
same as the second group of 3, but are encoded differently - code
points (228 246 252) vs (97 776 111 776 117 776).

Reading the utf8 encoded stream should result in:

string: 'test-äöü-äöü'
code points: #(116 101 115 116 45 228 246 252 45 228 246 252)
utf8 encoding: #[116 101 115 116 45 195 164 195 182 195 188 45 195 164
195 182 195 188]

My current thought is to write a ZnUnicodeComposingReadStream which
would wrap a ZnCharacterReadStream and return the composed characters.

What do you think?

Thanks!
Alistair

Max Leske

Re: ZnUnicodeComposingReadStream?

I realize I got things mixed up a bit: Uconv is a program akin to Iconv. What we interface with is libicu.

Max

On 13 Jul 2018, at 22:50, Max Leske wrote:

Hi Alistair,

*nix systems usually come with the iconv[1] command line program that implements the normalization and denormalization algorithms, or Uconv 2, a library that does the same thing. These algorithms include a lot of black magic and I recommend to not make your hands dirty with them. With the FFI interface Pharo has today it shouldn't be too hard to call out to Uconv (although I'm not saying it's trivial; I've written a VM plugin that we use a work to interface with Uconv and you do have to know how encodings and iconv work) or execute iconv directly.

I can send you a copy of the plugin code if you want, actually, I may put it on github if there's interest.

Cheers,
Max

[1] https://linux.die.net/man/1/iconv
[2] https://en.wikipedia.org/wiki/Uconv
[3] http://site.icu-project.org/

On 13 Jul 2018, at 20:22, Alistair Grant wrote:

Hi Sven,

Thanks very much for your quick reply...

On Fri, 13 Jul 2018 at 19:59, Sven Van Caekenberghe [hidden email] wrote:

Alistair, are you aware of the following (article/codebase) ?

https://medium.com/concerning-pharo/an-implementation-of-unicode-normalization-7c6719068f43

Due to the size of the full DB it is doubtful it would become a standard part of Pharo though.

Sven

I hadn't seen this. I'll read it next (although I think it will take
me longer than 17 minutes :-)).

But a quick, partial answer is that I was planning on only supporting
the composition and decomposition tables that are already included in
the main image as part of CombinedChar (see the Composition and
Decomposition class variables).

Thanks again,
Alistair

On 13 Jul 2018, at 19:46, Alistair Grant [hidden email] wrote:

Hi Sven & Everyone,

I need to convert an UTF8 encoded decomposed stream (Mac OS file
names) in to a composed string, e.g.:

string: 'test-äöü-äöü'
code points: #(116 101 115 116 45 228 246 252 45 97 776 111 776 117 776)
utf8 encoding: #[116 101 115 116 45 195 164 195 182 195 188 45 97 204
136 111 204 136 117 204 136]

In the above string, the first group of 3 accented characters are the
same as the second group of 3, but are encoded differently - code
points (228 246 252) vs (97 776 111 776 117 776).

Reading the utf8 encoded stream should result in:

string: 'test-äöü-äöü'
code points: #(116 101 115 116 45 228 246 252 45 228 246 252)
utf8 encoding: #[116 101 115 116 45 195 164 195 182 195 188 45 195 164
195 182 195 188]

My current thought is to write a ZnUnicodeComposingReadStream which
would wrap a ZnCharacterReadStream and return the composed characters.

What do you think?

Thanks!
Alistair

alistairgrant

Re: ZnUnicodeComposingReadStream?

Hi Sven & Max,

On Fri, Jul 13, 2018 at 07:59:32PM +0200, Sven Van Caekenberghe wrote:
> Alistair, are you aware of the following (article/codebase) ?
>
> https://medium.com/concerning-pharo/an-implementation-of-unicode-normalization-7c6719068f43
>
> Due to the size of the full DB it is doubtful it would become a standard part of Pharo though.
>
> Sven

Thanks again for the link, it has helped my (still limited)
understanding of Unicode.

The reason I started looking in to this was that my file attribute
modification tests failed on Mac OS. The problem is that Mac OS
requires file names to be decomposed UTF8, and my plugin wasn't doing
the conversion.

Following the general principle of trying to keep the VM minimal and do
as much as possible in the image, I had hoped I could do the UTF8
(de)composition in the image.

But it turns out that Mac OS doesn't follow the standard rules, so
programs really need to use the native file name encoding routines on
Mac OS.

So that's the path I'll be following in this instance. I still really
appreciate the link, and will be exploring the Unicode package more.

On Fri, Jul 13, 2018 at 10:50:36PM +0200, Max Leske wrote:

> Hi Alistair,
>
> *nix systems usually come with the iconv[1] command line program that
> implements the normalization and denormalization algorithms, or Uconv 2, a
> library that does the same thing. These algorithms include a lot of black magic
> and I recommend to not make your hands dirty with them. With the FFI interface
> Pharo has today it shouldn't be too hard to call out to Uconv (although I'm not
> saying it's trivial; I've written a VM plugin that we use a work to interface
> with Uconv and you do have to know how encodings and iconv work) or execute
> iconv directly.
>
> I can send you a copy of the plugin code if you want, actually, I may put it on
> github if there's interest.
>
> Cheers,
> Max
>
> [1] https://linux.die.net/man/1/iconv
> [2] https://en.wikipedia.org/wiki/Uconv
> [3] http://site.icu-project.org/
>
>

On Sat, Jul 14, 2018 at 08:20:23AM +0200, Max Leske wrote:
> I realize I got things mixed up a bit: Uconv is a program akin to Iconv. What
> we interface with is libicu.
>
> Max

The VM already uses libiconv for the encoding translation on linux. As
far as I know, the routines haven't been exposed directly to the image
(although I haven't looked carefully).

I'd be interested in looking at your plugin - I'm still working through
the current FilePlugin behaviour, but I think it would be useful to have
these routines available directly from the image for debugging, etc.

Thanks again,
Alistair

Max Leske

Re: ZnUnicodeComposingReadStream?

On 16 Jul 2018, at 19:46, Alistair Grant wrote:

> Hi Sven & Max,
>
>
> On Fri, Jul 13, 2018 at 07:59:32PM +0200, Sven Van Caekenberghe wrote:
>> Alistair, are you aware of the following (article/codebase) ?
>>
>> https://medium.com/concerning-pharo/an-implementation-of-unicode-normalization-7c6719068f43
>>
>> Due to the size of the full DB it is doubtful it would become a
>> standard part of Pharo though.
>>
>> Sven
>
>
> Thanks again for the link, it has helped my (still limited)
> understanding of Unicode.
>
> The reason I started looking in to this was that my file attribute
> modification tests failed on Mac OS. The problem is that Mac OS
> requires file names to be decomposed UTF8, and my plugin wasn't doing
> the conversion.
>
> Following the general principle of trying to keep the VM minimal and
> do
> as much as possible in the image, I had hoped I could do the UTF8
> (de)composition in the image.
>
> But it turns out that Mac OS doesn't follow the standard rules, so
> programs really need to use the native file name encoding routines on
> Mac OS.
>
> So that's the path I'll be following in this instance. I still really
> appreciate the link, and will be exploring the Unicode package more.
>
>
>
> On Fri, Jul 13, 2018 at 10:50:36PM +0200, Max Leske wrote:
>> Hi Alistair,
>>
>> *nix systems usually come with the iconv[1] command line program that
>> implements the normalization and denormalization algorithms, or Uconv
>> 2, a
>> library that does the same thing. These algorithms include a lot of
>> black magic
>> and I recommend to not make your hands dirty with them. With the FFI
>> interface
>> Pharo has today it shouldn't be too hard to call out to Uconv
>> (although I'm not
>> saying it's trivial; I've written a VM plugin that we use a work to
>> interface
>> with Uconv and you do have to know how encodings and iconv work) or
>> execute
>> iconv directly.
>>
>> I can send you a copy of the plugin code if you want, actually, I may
>> put it on
>> github if there's interest.
>>
>> Cheers,
>> Max
>>
>> [1] https://linux.die.net/man/1/iconv
>> [2] https://en.wikipedia.org/wiki/Uconv
>> [3] http://site.icu-project.org/
>>
>>
>
> On Sat, Jul 14, 2018 at 08:20:23AM +0200, Max Leske wrote:
>> I realize I got things mixed up a bit: Uconv is a program akin to
>> Iconv. What
>> we interface with is libicu.
>>
>> Max
>
> The VM already uses libiconv for the encoding translation on linux.
> As
> far as I know, the routines haven't been exposed directly to the image
> (although I haven't looked carefully).
>
> I'd be interested in looking at your plugin - I'm still working
> through
> the current FilePlugin behaviour, but I think it would be useful to
> have
> these routines available directly from the image for debugging, etc.
>
> Thanks again,
> Alistair

I've put the plugin source on Github:
https://github.com/Netstyle/Squeak-VM-Unicode-operations-plugin.

I hope you'll find it useful. Note that the code was written for version
4.0.3-2202 of the Squeak VM and that you'd most likely have to make a
couple of modifications to get it running in the OpenSmalltalk VM's.

Cheers,
Max