Unicode patch

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Unicode patch

Janko Mivšek
Dear Squeakers,

Please find attached an Unicode patch, which deals with improvements of
internal representation of Unicode characters. It:

1. introduce new class TwoByteString
2. change at:put: on ByteString and other such methods to "scale" string
    to TwoByteString or FourByteString, depending on width of a character
3. rename WideString to FourByteString for consistency, also
    rename all related methods
2. add category CollectionTests-Unicode with tests
3. add class UnicodeBenchmarking for measuring speed of
    Unicode handling like at:put speed and UTF8 conversions on included
    English, French, Slovenian, Russian and Chinese text.

ByteString and TwoByteString also include UTF8 conversion methods, which
will probably be moved to UTF8TextConverter later.

I hope this patch will help improving Squeak Unicode support a bit.

Best regards
Janko


--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si



unicode.1.cs.gz (33K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Unicode patch

Damien Cassou-3
Hi Janko,

did you try to load your changeset in a Squeak 3.10 image? What is the
status of the tests?

If your changeset is good enough and if you write unit tests, it may
be interesting to put your changeset into 3.10.

Bye

2007/6/12, Janko Mivšek <[hidden email]>:

> Dear Squeakers,
>
> Please find attached an Unicode patch, which deals with improvements of
> internal representation of Unicode characters. It:
>
> 1. introduce new class TwoByteString
> 2. change at:put: on ByteString and other such methods to "scale" string
>     to TwoByteString or FourByteString, depending on width of a character
> 3. rename WideString to FourByteString for consistency, also
>     rename all related methods
> 2. add category CollectionTests-Unicode with tests
> 3. add class UnicodeBenchmarking for measuring speed of
>     Unicode handling like at:put speed and UTF8 conversions on included
>     English, French, Slovenian, Russian and Chinese text.
>
> ByteString and TwoByteString also include UTF8 conversion methods, which
> will probably be moved to UTF8TextConverter later.
>
> I hope this patch will help improving Squeak Unicode support a bit.
>
> Best regards
> Janko
>
>
> --
> Janko Mivšek
> AIDA/Web
> Smalltalk Web Application Server
> http://www.aidaweb.si
>
>
>
>
>

--
Damien Cassou


Reply | Threaded
Open this post in threaded view
|

Re: Unicode patch

Edgar J. De Cleene



El 6/13/07 10:18 PM, "Damien Cassou" <[hidden email]> escribió:

> Hi Janko,

did you try to load your changeset in a Squeak 3.10 image? What is
> the
status of the tests?

If your changeset is good enough and if you write
> unit tests, it may
be interesting to put your changeset into 3.10.

Dqmien, Janko Aida web works. I test yesterday and try to collaborate with
he.

Edgar



Reply | Threaded
Open this post in threaded view
|

Re: Unicode patch

Bert Freudenberg
In reply to this post by Damien Cassou-3
Just from glancing at the code this cannot possibly be right.

Like, in many places the isWideString test is simply replaced with  
isFourByteString. But the distinction we need to make is wether we  
have character values below 256 or above (for example to choose  
between the old and the MultiByteScanner). So #isWideString needs to  
be preserved and answer true for all Strings that have character  
values >= 256.

As for the internal representation of TwoByteStrings; I'm not sure  
using big endian on all platforms is a good idea. Should certainly be  
discussed - like, it might be valuable to hand that string to a  
primitive and then platform order would be better.

Also, the renaming of WideString without providing proper conversion  
methods will most certainly break existing projects.

Then there are a lot of nits to pick - like the class comments are  
wrong, ByteString>>replaceFrom:... only creates 32 bit strings,  
bitShift is used all over the place when Smalltalk code traditionally  
uses * and //, what is TwoByteString>>printString good for, why does  
TwoByteString>>asByteString do an unnecessary copy etc.

Before inclusion this still needs a lot of work and testing.

- Bert -

On Jun 14, 2007, at 3:18 , Damien Cassou wrote:

> Hi Janko,
>
> did you try to load your changeset in a Squeak 3.10 image? What is the
> status of the tests?
>
> If your changeset is good enough and if you write unit tests, it may
> be interesting to put your changeset into 3.10.
>
> Bye
>
> 2007/6/12, Janko Mivšek <[hidden email]>:
>> Dear Squeakers,
>>
>> Please find attached an Unicode patch, which deals with  
>> improvements of
>> internal representation of Unicode characters. It:
>>
>> 1. introduce new class TwoByteString
>> 2. change at:put: on ByteString and other such methods to "scale"  
>> string
>>     to TwoByteString or FourByteString, depending on width of a  
>> character
>> 3. rename WideString to FourByteString for consistency, also
>>     rename all related methods
>> 2. add category CollectionTests-Unicode with tests
>> 3. add class UnicodeBenchmarking for measuring speed of
>>     Unicode handling like at:put speed and UTF8 conversions on  
>> included
>>     English, French, Slovenian, Russian and Chinese text.
>>
>> ByteString and TwoByteString also include UTF8 conversion methods,  
>> which
>> will probably be moved to UTF8TextConverter later.
>>
>> I hope this patch will help improving Squeak Unicode support a bit.
>>
>> Best regards
>> Janko
>>
>>
>> --
>> Janko Mivšek
>> AIDA/Web
>> Smalltalk Web Application Server
>> http://www.aidaweb.si
>>
>>
>>
>>
>>
>
>
> --
> Damien Cassou
>





Reply | Threaded
Open this post in threaded view
|

Re: Unicode patch

Janko Mivšek
In reply to this post by Damien Cassou-3
Hi Damien,

Damien Cassou wrote:
> did you try to load your changeset in a Squeak 3.10 image? What is the
> status of the tests?
>
> If your changeset is good enough and if you write unit tests, it may
> be interesting to put your changeset into 3.10.

That will be nice. I just don't  know yet a procedure how patches from
community goes through all tests and careful eyes to be included in main
image. Is this written down somewhere. And for start, where can I find 3.10?

Best regards
Janko


> Bye
>
> 2007/6/12, Janko Mivšek <[hidden email]>:
>> Dear Squeakers,
>>
>> Please find attached an Unicode patch, which deals with improvements of
>> internal representation of Unicode characters. It:
>>
>> 1. introduce new class TwoByteString
>> 2. change at:put: on ByteString and other such methods to "scale" string
>>     to TwoByteString or FourByteString, depending on width of a character
>> 3. rename WideString to FourByteString for consistency, also
>>     rename all related methods
>> 2. add category CollectionTests-Unicode with tests
>> 3. add class UnicodeBenchmarking for measuring speed of
>>     Unicode handling like at:put speed and UTF8 conversions on included
>>     English, French, Slovenian, Russian and Chinese text.
>>
>> ByteString and TwoByteString also include UTF8 conversion methods, which
>> will probably be moved to UTF8TextConverter later.
>>
>> I hope this patch will help improving Squeak Unicode support a bit.
>>
>> Best regards
>> Janko


--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Reply | Threaded
Open this post in threaded view
|

Re: Unicode patch

stephane ducasse
In reply to this post by Bert Freudenberg

> Just from glancing at the code this cannot possibly be right.
>
> Like, in many places the isWideString test is simply replaced with  
> isFourByteString. But the distinction we need to make is wether we  
> have character values below 256 or above (for example to choose  
> between the old and the MultiByteScanner). So #isWideString needs  
> to be preserved and answer true for all Strings that have character  
> values >= 256.
>
> As for the internal representation of TwoByteStrings; I'm not sure  
> using big endian on all platforms is a good idea. Should certainly  
> be discussed - like, it might be valuable to hand that string to a  
> primitive and then platform order would be better.
>
> Also, the renaming of WideString without providing proper  
> conversion methods will most certainly break existing projects.
>
> Then there are a lot of nits to pick - like the class comments are  
> wrong, ByteString>>replaceFrom:... only creates 32 bit strings,  
> bitShift is used all over the place when Smalltalk code  
> traditionally uses * and //, what is TwoByteString>>printString  
> good for, why does TwoByteString>>asByteString do an unnecessary  
> copy etc.
>
> Before inclusion this still needs a lot of work and testing.

Sounds like. Thanks for the feedback bert.

Stef


Reply | Threaded
Open this post in threaded view
|

Re: Unicode patch

Edgar J. De Cleene
In reply to this post by Janko Mivšek



El 6/14/07 3:23 PM, "Janko Mivšek" <[hidden email]> escribió:

> I just don't  know yet a procedure how patches from
> community goes through all tests and careful eyes to be included in main
> image. Is this written down somewhere. And for start, where can I find 3.10?
Janko:

Yoou could read about 3.10 http://wiki.squeak.org/squeak/5919 and follow
links
http://wiki.squeak.org/squeak/5990 Here you could complain how 3.10 is going

and in http://ftp.squeak.org/3.10alpha/Squeak3.10alpha.7105.zip the last
published image.

Hope soon I solve some mistakes and could actualize to 7113 and beyond.

About packages, they must go into Package Universes now.
Image is going in the smaller direction to converge with Pavel works.

Ralph extend the quality control of image to packages , this work just
begin.

Edgar



Reply | Threaded
Open this post in threaded view
|

Re: Unicode patch

stephane ducasse
edgar sorry to repeat it but could you send to the list the changes  
that have been harvested.
How can you expect that people trust this image if we do not know  
what is harvested and
not give a chance to busy people to give a comment.
The feedback of bert illustrates really that problem.

Stef

On 15 juin 07, at 00:10, Edgar J. De Cleene wrote:

>
>
> El 6/14/07 3:23 PM, "Janko Mivšek" <[hidden email]> escribió:
>
>> I just don't  know yet a procedure how patches from
>> community goes through all tests and careful eyes to be included  
>> in main
>> image. Is this written down somewhere. And for start, where can I  
>> find 3.10?
> Janko:
>
> Yoou could read about 3.10 http://wiki.squeak.org/squeak/5919 and  
> follow
> links
> http://wiki.squeak.org/squeak/5990 Here you could complain how 3.10  
> is going
>
> and in http://ftp.squeak.org/3.10alpha/Squeak3.10alpha.7105.zip the  
> last
> published image.
>
> Hope soon I solve some mistakes and could actualize to 7113 and  
> beyond.
>
> About packages, they must go into Package Universes now.
> Image is going in the smaller direction to converge with Pavel works.
>
> Ralph extend the quality control of image to packages , this work just
> begin.
>
> Edgar
>
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Unicode patch

stephane ducasse
In reply to this post by Edgar J. De Cleene
>
> About packages, they must go into Package Universes now.

Why? What does it mean?

> Image is going in the smaller direction to converge with Pavel works.
>
> Ralph extend the quality control of image to packages , this work just
> begin.

How?



Reply | Threaded
Open this post in threaded view
|

Re: Unicode patch

Edgar J. De Cleene
In reply to this post by stephane ducasse



El 6/15/07 5:28 AM, "stephane ducasse" <[hidden email]> escribió:

> edgar sorry to repeat it but could you send to the list the changes
> that have been harvested.
> How can you expect that people trust this image if we do not know
> what is harvested and
> not give a chance to busy people to give a comment.
> The feedback of bert illustrates really that problem.
>
> Stef

If you read swiki ....
I now you wish me out of team, so write to Ralph and give me a break.


Edgar