Smalltalk › Squeak › Squeak - Dev

Unicode patch

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

10 messages Options

Janko Mivšek

Unicode patch

Dear Squeakers,

Please find attached an Unicode patch, which deals with improvements of
internal representation of Unicode characters. It:

1. introduce new class TwoByteString
2. change at:put: on ByteString and other such methods to "scale" string
to TwoByteString or FourByteString, depending on width of a character
3. rename WideString to FourByteString for consistency, also
rename all related methods
2. add category CollectionTests-Unicode with tests
3. add class UnicodeBenchmarking for measuring speed of
Unicode handling like at:put speed and UTF8 conversions on included
English, French, Slovenian, Russian and Chinese text.

ByteString and TwoByteString also include UTF8 conversion methods, which
will probably be moved to UTF8TextConverter later.

I hope this patch will help improving Squeak Unicode support a bit.

Best regards
Janko

--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

unicode.1.cs.gz (33K) Download Attachment

Damien Cassou-3

Re: Unicode patch

Hi Janko,

did you try to load your changeset in a Squeak 3.10 image? What is the
status of the tests?

If your changeset is good enough and if you write unit tests, it may
be interesting to put your changeset into 3.10.

Bye

2007/6/12, Janko Mivšek <[hidden email]>:

> Dear Squeakers,
>
> Please find attached an Unicode patch, which deals with improvements of
> internal representation of Unicode characters. It:
>
> 1. introduce new class TwoByteString
> 2. change at:put: on ByteString and other such methods to "scale" string
> to TwoByteString or FourByteString, depending on width of a character
> 3. rename WideString to FourByteString for consistency, also
> rename all related methods
> 2. add category CollectionTests-Unicode with tests
> 3. add class UnicodeBenchmarking for measuring speed of
> Unicode handling like at:put speed and UTF8 conversions on included
> English, French, Slovenian, Russian and Chinese text.
>
> ByteString and TwoByteString also include UTF8 conversion methods, which
> will probably be moved to UTF8TextConverter later.
>
> I hope this patch will help improving Squeak Unicode support a bit.
>
> Best regards
> Janko
>
>
> --
> Janko Mivšek
> AIDA/Web
> Smalltalk Web Application Server
> http://www.aidaweb.si
>
>
>
>
>

--
Damien Cassou

Edgar J. De Cleene

Re: Unicode patch

El 6/13/07 10:18 PM, "Damien Cassou" <[hidden email]> escribió:

> Hi Janko,

did you try to load your changeset in a Squeak 3.10 image? What is
> the
status of the tests?

If your changeset is good enough and if you write
> unit tests, it may
be interesting to put your changeset into 3.10.

Dqmien, Janko Aida web works. I test yesterday and try to collaborate with
he.

Edgar

Bert Freudenberg

Re: Unicode patch

In reply to this post by Damien Cassou-3

Just from glancing at the code this cannot possibly be right.

Like, in many places the isWideString test is simply replaced with
isFourByteString. But the distinction we need to make is wether we
have character values below 256 or above (for example to choose
between the old and the MultiByteScanner). So #isWideString needs to
be preserved and answer true for all Strings that have character
values >= 256.

As for the internal representation of TwoByteStrings; I'm not sure
using big endian on all platforms is a good idea. Should certainly be
discussed - like, it might be valuable to hand that string to a
primitive and then platform order would be better.

Also, the renaming of WideString without providing proper conversion
methods will most certainly break existing projects.

Then there are a lot of nits to pick - like the class comments are
wrong, ByteString>>replaceFrom:... only creates 32 bit strings,
bitShift is used all over the place when Smalltalk code traditionally
uses * and //, what is TwoByteString>>printString good for, why does
TwoByteString>>asByteString do an unnecessary copy etc.

Before inclusion this still needs a lot of work and testing.

- Bert -

On Jun 14, 2007, at 3:18 , Damien Cassou wrote:

> Hi Janko,
>
> did you try to load your changeset in a Squeak 3.10 image? What is the
> status of the tests?
>
> If your changeset is good enough and if you write unit tests, it may
> be interesting to put your changeset into 3.10.
>
> Bye
>
> 2007/6/12, Janko Mivšek <[hidden email]>:
>> Dear Squeakers,
>>
>> Please find attached an Unicode patch, which deals with
>> improvements of
>> internal representation of Unicode characters. It:
>>
>> 1. introduce new class TwoByteString
>> 2. change at:put: on ByteString and other such methods to "scale"
>> string
>> to TwoByteString or FourByteString, depending on width of a
>> character
>> 3. rename WideString to FourByteString for consistency, also
>> rename all related methods
>> 2. add category CollectionTests-Unicode with tests
>> 3. add class UnicodeBenchmarking for measuring speed of
>> Unicode handling like at:put speed and UTF8 conversions on
>> included
>> English, French, Slovenian, Russian and Chinese text.
>>
>> ByteString and TwoByteString also include UTF8 conversion methods,
>> which
>> will probably be moved to UTF8TextConverter later.
>>
>> I hope this patch will help improving Squeak Unicode support a bit.
>>
>> Best regards
>> Janko
>>
>>
>> --
>> Janko Mivšek
>> AIDA/Web
>> Smalltalk Web Application Server
>> http://www.aidaweb.si
>>
>>
>>
>>
>>
>
>
> --
> Damien Cassou
>

Janko Mivšek

Re: Unicode patch

In reply to this post by Damien Cassou-3

Hi Damien,

Damien Cassou wrote:
> did you try to load your changeset in a Squeak 3.10 image? What is the
> status of the tests?
>
> If your changeset is good enough and if you write unit tests, it may
> be interesting to put your changeset into 3.10.

That will be nice. I just don't know yet a procedure how patches from
community goes through all tests and careful eyes to be included in main
image. Is this written down somewhere. And for start, where can I find 3.10?

Best regards
Janko

> Bye
>
> 2007/6/12, Janko Mivšek <[hidden email]>:
>> Dear Squeakers,
>>
>> Please find attached an Unicode patch, which deals with improvements of
>> internal representation of Unicode characters. It:
>>
>> 1. introduce new class TwoByteString
>> 2. change at:put: on ByteString and other such methods to "scale" string
>> to TwoByteString or FourByteString, depending on width of a character
>> 3. rename WideString to FourByteString for consistency, also
>> rename all related methods
>> 2. add category CollectionTests-Unicode with tests
>> 3. add class UnicodeBenchmarking for measuring speed of
>> Unicode handling like at:put speed and UTF8 conversions on included
>> English, French, Slovenian, Russian and Chinese text.
>>
>> ByteString and TwoByteString also include UTF8 conversion methods, which
>> will probably be moved to UTF8TextConverter later.
>>
>> I hope this patch will help improving Squeak Unicode support a bit.
>>
>> Best regards
>> Janko

--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

stephane ducasse

Re: Unicode patch

In reply to this post by Bert Freudenberg

> Just from glancing at the code this cannot possibly be right.
>
> Like, in many places the isWideString test is simply replaced with
> isFourByteString. But the distinction we need to make is wether we
> have character values below 256 or above (for example to choose
> between the old and the MultiByteScanner). So #isWideString needs
> to be preserved and answer true for all Strings that have character
> values >= 256.
>
> As for the internal representation of TwoByteStrings; I'm not sure
> using big endian on all platforms is a good idea. Should certainly
> be discussed - like, it might be valuable to hand that string to a
> primitive and then platform order would be better.
>
> Also, the renaming of WideString without providing proper
> conversion methods will most certainly break existing projects.
>
> Then there are a lot of nits to pick - like the class comments are
> wrong, ByteString>>replaceFrom:... only creates 32 bit strings,
> bitShift is used all over the place when Smalltalk code
> traditionally uses * and //, what is TwoByteString>>printString
> good for, why does TwoByteString>>asByteString do an unnecessary
> copy etc.
>
> Before inclusion this still needs a lot of work and testing.

Sounds like. Thanks for the feedback bert.

Stef

Edgar J. De Cleene

Re: Unicode patch

In reply to this post by Janko Mivšek

El 6/14/07 3:23 PM, "Janko Mivšek" <[hidden email]> escribió:

> I just don't know yet a procedure how patches from
> community goes through all tests and careful eyes to be included in main
> image. Is this written down somewhere. And for start, where can I find 3.10?
Janko:

Yoou could read about 3.10 http://wiki.squeak.org/squeak/5919 and follow
links
http://wiki.squeak.org/squeak/5990 Here you could complain how 3.10 is going

and in http://ftp.squeak.org/3.10alpha/Squeak3.10alpha.7105.zip the last
published image.

Hope soon I solve some mistakes and could actualize to 7113 and beyond.

About packages, they must go into Package Universes now.
Image is going in the smaller direction to converge with Pavel works.

Ralph extend the quality control of image to packages , this work just
begin.

Edgar

stephane ducasse

Re: Unicode patch

edgar sorry to repeat it but could you send to the list the changes
that have been harvested.
How can you expect that people trust this image if we do not know
what is harvested and
not give a chance to busy people to give a comment.
The feedback of bert illustrates really that problem.

Stef

On 15 juin 07, at 00:10, Edgar J. De Cleene wrote:

>
>
> El 6/14/07 3:23 PM, "Janko Mivšek" <[hidden email]> escribió:
>
>> I just don't know yet a procedure how patches from
>> community goes through all tests and careful eyes to be included
>> in main
>> image. Is this written down somewhere. And for start, where can I
>> find 3.10?
> Janko:
>
> Yoou could read about 3.10 http://wiki.squeak.org/squeak/5919 and
> follow
> links
> http://wiki.squeak.org/squeak/5990 Here you could complain how 3.10
> is going
>
> and in http://ftp.squeak.org/3.10alpha/Squeak3.10alpha.7105.zip the
> last
> published image.
>
> Hope soon I solve some mistakes and could actualize to 7113 and
> beyond.
>
> About packages, they must go into Package Universes now.
> Image is going in the smaller direction to converge with Pavel works.
>
> Ralph extend the quality control of image to packages , this work just
> begin.
>
> Edgar
>
>
>
>

stephane ducasse

Re: Unicode patch

In reply to this post by Edgar J. De Cleene

>
> About packages, they must go into Package Universes now.

Why? What does it mean?

> Image is going in the smaller direction to converge with Pavel works.
>
> Ralph extend the quality control of image to packages , this work just
> begin.

How?

Edgar J. De Cleene

Re: Unicode patch

In reply to this post by stephane ducasse

El 6/15/07 5:28 AM, "stephane ducasse" <[hidden email]> escribió:

> edgar sorry to repeat it but could you send to the list the changes
> that have been harvested.
> How can you expect that people trust this image if we do not know
> what is harvested and
> not give a chance to busy people to give a comment.
> The feedback of bert illustrates really that problem.
>
> Stef

If you read swiki ....
I now you wish me out of team, so write to Ralph and give me a break.

Edgar