Re: [squeak-dev] Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: Unicode Support))

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: Unicode Support))

Eliot Miranda-2
 
Hi Dale,

On Thu, Dec 10, 2015 at 12:27 PM, Dale Henrichs <[hidden email]> wrote:


On 12/09/2015 04:31 PM, Levente Uzonyi wrote:
On Wed, 9 Dec 2015, Dale Henrichs wrote:



On 12/09/2015 12:44 AM, Stephan Eggermont wrote:
On 08-12-15 22:35, Dale Henrichs wrote:
What I meant is that you can't _always_ use the code point for
collation, i.e., sorting based on the value of code points is not always
correct[1].

I have given up on universal sorting when I learned that dutch libraries sorting of author names depends on the country of origin of the author. So if Jan van Beek is dutch he will be sorted under B, while if he's belgian under V. I haven't checked what happens if the author emigrates, or changes nationality...

Stephan

Well, with ICU (and GemStone's implementation) you can choose which collator to use (Country specific) at the image level or on a comparison by comparison bases ... for example for an indexed collection (Unicode) Strings, you can choose the collator to use for that particular index ... so while it's true that universal sorter is not possible, it is possible to choose a collator that will satisfy a particlar customer ....

I expect my image to compare strings using the codepoint-based (+language tags) lexicographical method, because it's simple, deterministic and fast.
Imagine having failing tests just because your image uses different default comparison methods based on some (external) parameter...
It's also a nightmare to find out why your program is slow on some machine, while it's fast on another.

When we implemented the Unicode support in GemStone we preserved the legacy string classes and  their legacy behavior ... We added new Unicode* classes with the new collator-based behavior for sorting and comparison ... That way legacy applications (and legacy) tests were not impacted by the choice of  collator ... And folks could choose whether or not their application would benefit by the use of the new Unicode* classes....

The ICU library performance is actually comparable to our original implementations, so there isn't a noticeable performance difference - we built the support into our vm and if folks are interested in some of the gory technical details, we'd be willing to share our experience, as there are several things that we did to minimize potential performance impacts ---

Just so you know, I will dig my heels in as deeply as I am able to prevent the use of C++ libraries in the VM.  It destroys the simulator, which is the most important thing we have for VM development productivity.  As far as I'm concerned any use of external libraries to implement core functionality kills the VM-in-Smalltalk concept that Squeak (and Pharo) are built upon.  So for me it's a non-starter.  I hope others agree.

_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] [squeak-dev] Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: Unicode Support))

Eliot Miranda-2
 
Hi Todd,

On Tue, Dec 15, 2015 at 3:46 PM, Todd Blanchard <[hidden email]> wrote:
Hi Eliot,

On Dec 15, 2015, at 13:46, Eliot Miranda <[hidden email]> wrote:

Just so you know, I will dig my heels in as deeply as I am able to prevent the use of C++ libraries in the VM.  It destroys the simulator, which is the most important thing we have for VM development productivity.  As far as I'm concerned any use of external libraries to implement core functionality kills the VM-in-Smalltalk concept that Squeak (and Pharo) are built upon.

OK, I defer to you because you certainly know more about the VM internals and what does and doesn't work well than anyone else.  

So I guess I would like to know your recommendation for 1) how best to store strings - byte arrays (UTF8), - 2-byte word arrays (UTF16 - now we get to worry about endian).  

Raw Unicode, either as 8-bit, 16-bit or 32-bit.  When creating a String it should start as an 8-bit-per-Unicode-character string.  Attempts to store Character values that won't fit cause the String to become a String whose element size is large enough to accommodate the character.  In Spur, become: is cheap so this growth pays only for the reallocation and copying of the at a, not for an expensive heap scan necessary to do the become:.


Bearing in mind that both representations are variable length and so while accessing the n'th byte/word is O(1), accessing the n'th character is necessarily O(n) unless you know you have no surrogates in your string.

Right, so UTF-8 and UTF-16 are not convenient representations and to be provided only for interchange.
 

Also...since NSString has been mentioned...it is worth noting that NSString is built atop CFString (source code here: https://www.opensource.apple.com/source/CF/CF-855.11/CFString.c) which does a fair job of optimizing memory by using bytes where it can and shorts where it cannot.  It is also worth noting that characterAt: actually does the wrong thing, since it assumes characters are no bigger than FFFF rather than 10FFFF.  

Yes, and Squeak (and AFAIA, Pharo) has been doing this for ages.  If one has become: it is very easy to manage.  Now with Spur not only do we have become:, we have a fairly fast become:.

Does this make sense?
 
Also...I'll just toss in this very nice article on unicode and how NSString deals with it.

-Todd Blanchard
 
_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] [squeak-dev] Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: Unicode Support))

Hannes Hirzel
 
On 12/16/15, Eliot Miranda <[hidden email]> wrote:

> Hi Todd,
>
> On Tue, Dec 15, 2015 at 3:46 PM, Todd Blanchard <[hidden email]> wrote:
>
>> Hi Eliot,
>>
>> On Dec 15, 2015, at 13:46, Eliot Miranda <[hidden email]> wrote:
>>
>> Just so you know, I will dig my heels in as deeply as I am able to
>> prevent
>> the use of C++ libraries in the VM.  It destroys the simulator, which is
>> the most important thing we have for VM development productivity.  As far
>> as I'm concerned any use of external libraries to implement core
>> functionality kills the VM-in-Smalltalk concept that Squeak (and Pharo)
>> are
>> built upon.
>>
>>
>> OK, I defer to you because you certainly know more about the VM internals
>> and what does and doesn't work well than anyone else.
>>
>> So I guess I would like to know your recommendation for 1) how best to
>> store strings - byte arrays (UTF8), - 2-byte word arrays (UTF16 - now we
>> get to worry about endian).
>>
>
> Raw Unicode, either as 8-bit, 16-bit or 32-bit.  When creating a String it
> should start as an 8-bit-per-Unicode-character string.  Attempts to store
> Character values that won't fit cause the String to become a String whose
> element size is large enough to accommodate the character.

This is the case: see tests here http://wiki.squeak.org/squeak/6316

In Spur,
> become: is cheap so this growth pays only for the reallocation and copying
> of the at a, not for an expensive heap scan necessary to do the become:.

Could you elaborate on this please?

>
>
>> Bearing in mind that both representations are variable length and so
>> while
>> accessing the n'th byte/word is O(1), accessing the n'th character is
>> necessarily O(n) unless you know you have no surrogates in your string.
>>
>
> Right, so UTF-8 and UTF-16 are not convenient representations and to be
> provided only for interchange.

+1

Squeak/Pharo uses 8bit/32bit (UTF-32) internally and UTF-8 externally.
There are converters for UTF-8 and UTF-16.



>
>>
>> Also...since NSString has been mentioned...it is worth noting that
>> NSString is built atop CFString (source code here:
>> https://www.opensource.apple.com/source/CF/CF-855.11/CFString.c) which
>> does a fair job of optimizing memory by using bytes where it can and
>> shorts
>> where it cannot.  It is also worth noting that characterAt: actually does
>> the wrong thing, since it assumes characters are no bigger than FFFF
>> rather
>> than 10FFFF.
>>
>
> Yes, and Squeak (and AFAIA, Pharo) has been doing this for ages.  If one
> has become: it is very easy to manage.  Now with Spur not only do we have
> become:, we have a fairly fast become:.
>
> Does this make sense?
>
>
>> Also...I'll just toss in this very nice article on unicode and how
>> NSString deals with it.
>> https://www.objc.io/issues/9-strings/unicode/
>>
>> -Todd Blanchard
>>
>
> _,,,^..^,,,_
> best, Eliot
>
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] [squeak-dev] Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: Unicode Support))

Ben Coman
 
On Wed, Dec 16, 2015 at 6:22 PM, H. Hirzel <[hidden email]> wrote:

>
> On 12/16/15, Eliot Miranda <[hidden email]> wrote:
>> Hi Todd,
>>
>> On Tue, Dec 15, 2015 at 3:46 PM, Todd Blanchard <[hidden email]> wrote:
>>
>>> Hi Eliot,
>>>
>>> On Dec 15, 2015, at 13:46, Eliot Miranda <[hidden email]> wrote:
>>>
>>> Just so you know, I will dig my heels in as deeply as I am able to
>>> prevent
>>> the use of C++ libraries in the VM.  It destroys the simulator, which is
>>> the most important thing we have for VM development productivity.  As far
>>> as I'm concerned any use of external libraries to implement core
>>> functionality kills the VM-in-Smalltalk concept that Squeak (and Pharo)
>>> are
>>> built upon.
>>>
>>>
>>> OK, I defer to you because you certainly know more about the VM internals
>>> and what does and doesn't work well than anyone else.
>>>
>>> So I guess I would like to know your recommendation for 1) how best to
>>> store strings - byte arrays (UTF8), - 2-byte word arrays (UTF16 - now we
>>> get to worry about endian).
>>>
>>
>> Raw Unicode, either as 8-bit, 16-bit or 32-bit.  When creating a String it
>> should start as an 8-bit-per-Unicode-character string.  Attempts to store
>> Character values that won't fit cause the String to become a String whose
>> element size is large enough to accommodate the character.
>
> This is the case: see tests here http://wiki.squeak.org/squeak/6316
>
> In Spur,
>> become: is cheap so this growth pays only for the reallocation and copying
>> of the at a, not for an expensive heap scan necessary to do the become:.
>
> Could you elaborate on this please?


https://hal.inria.fr/hal-01152610/file/partialReadBarrier.pdf

cheers -ben


>>
>>
>>> Bearing in mind that both representations are variable length and so
>>> while
>>> accessing the n'th byte/word is O(1), accessing the n'th character is
>>> necessarily O(n) unless you know you have no surrogates in your string.
>>>
>>
>> Right, so UTF-8 and UTF-16 are not convenient representations and to be
>> provided only for interchange.
>
> +1
>
> Squeak/Pharo uses 8bit/32bit (UTF-32) internally and UTF-8 externally.
> There are converters for UTF-8 and UTF-16.
>
>
>
>>
>>>
>>> Also...since NSString has been mentioned...it is worth noting that
>>> NSString is built atop CFString (source code here:
>>> https://www.opensource.apple.com/source/CF/CF-855.11/CFString.c) which
>>> does a fair job of optimizing memory by using bytes where it can and
>>> shorts
>>> where it cannot.  It is also worth noting that characterAt: actually does
>>> the wrong thing, since it assumes characters are no bigger than FFFF
>>> rather
>>> than 10FFFF.
>>>
>>
>> Yes, and Squeak (and AFAIA, Pharo) has been doing this for ages.  If one
>> has become: it is very easy to manage.  Now with Spur not only do we have
>> become:, we have a fairly fast become:.
>>
>> Does this make sense?
>>
>>
>>> Also...I'll just toss in this very nice article on unicode and how
>>> NSString deals with it.
>>> https://www.objc.io/issues/9-strings/unicode/
>>>
>>> -Todd Blanchard
>>>
>>
>> _,,,^..^,,,_
>> best, Eliot
>>
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] [squeak-dev] Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: Unicode Support))

Eliot Miranda-2
In reply to this post by Hannes Hirzel

Hi Hannes,

> On Dec 16, 2015, at 2:22 AM, H. Hirzel <[hidden email]> wrote:
>
>
>> On 12/16/15, Eliot Miranda <[hidden email]> wrote:
>> Hi Todd,
>>
>>> On Tue, Dec 15, 2015 at 3:46 PM, Todd Blanchard <[hidden email]> wrote:
>>>
>>> Hi Eliot,
>>>
>>> On Dec 15, 2015, at 13:46, Eliot Miranda <[hidden email]> wrote:
>>>
>>> Just so you know, I will dig my heels in as deeply as I am able to
>>> prevent
>>> the use of C++ libraries in the VM.  It destroys the simulator, which is
>>> the most important thing we have for VM development productivity.  As far
>>> as I'm concerned any use of external libraries to implement core
>>> functionality kills the VM-in-Smalltalk concept that Squeak (and Pharo)
>>> are
>>> built upon.
>>>
>>>
>>> OK, I defer to you because you certainly know more about the VM internals
>>> and what does and doesn't work well than anyone else.
>>>
>>> So I guess I would like to know your recommendation for 1) how best to
>>> store strings - byte arrays (UTF8), - 2-byte word arrays (UTF16 - now we
>>> get to worry about endian).
>>
>> Raw Unicode, either as 8-bit, 16-bit or 32-bit.  When creating a String it
>> should start as an 8-bit-per-Unicode-character string.  Attempts to store
>> Character values that won't fit cause the String to become a String whose
>> element size is large enough to accommodate the character.
>
> This is the case: see tests here http://wiki.squeak.org/squeak/6316
>
> In Spur,
>> become: is cheap so this growth pays only for the reallocation and copying
>> of the at a, not for an expensive heap scan necessary to do the become:.
>
> Could you elaborate on this please?

Well, there are two presentations online, one at ESUG 2014, and one at Splash 2015, a paper at ISMM 2015, and a blog post if you want a full account, but...

Spur supports lazy become via forwarders.  When an object is becommed, it is turned into a forwarder to the object it becomes, and when a pair of objects are becommed each is copied and the two originals turned into forwarders to the opposite copy.  Immediately after the forwarding the receiver in each stack frame in the stack zone is scanned to follow forwarders so that no read barrier is needed when accessing inst vars.

Forwarders have a unique class index so any message send to a forwarder will fail. The check for sends to forwarders needs to be done only just before a full lookup.  A forwarder looks different to other objects (they have a unique format field in their object header) so any primitive that encounters a forwarder in its operands will fail (since primitives validate their arguments).  So in primitive failure the VM checks for forwarders amongst the operands and if any are found, they are followed and the primitive is retried.

There are read barriers in some operations, hence I call the scheme a partial read barrier.  The main cost is the stack zone scan which must be performed for pointer objects (since only these can have inst vars).    This takes very little time on current machinery.  Becoming bit objects like strings is faster because no scan is needed.


>>> Bearing in mind that both representations are variable length and so
>>> while
>>> accessing the n'th byte/word is O(1), accessing the n'th character is
>>> necessarily O(n) unless you know you have no surrogates in your string.
>>
>> Right, so UTF-8 and UTF-16 are not convenient representations and to be
>> provided only for interchange.
>
> +1
>
> Squeak/Pharo uses 8bit/32bit (UTF-32) internally and UTF-8 externally.
> There are converters for UTF-8 and UTF-16.
>
>>> Also...since NSString has been mentioned...it is worth noting that
>>> NSString is built atop CFString (source code here:
>>> https://www.opensource.apple.com/source/CF/CF-855.11/CFString.c) which
>>> does a fair job of optimizing memory by using bytes where it can and
>>> shorts
>>> where it cannot.  It is also worth noting that characterAt: actually does
>>> the wrong thing, since it assumes characters are no bigger than FFFF
>>> rather
>>> than 10FFFF.
>>
>> Yes, and Squeak (and AFAIA, Pharo) has been doing this for ages.  If one
>> has become: it is very easy to manage.  Now with Spur not only do we have
>> become:, we have a fairly fast become:.
>>
>> Does this make sense?
>>
>>
>>> Also...I'll just toss in this very nice article on unicode and how
>>> NSString deals with it.
>>> https://www.objc.io/issues/9-strings/unicode/
>>>
>>> -Todd Blanchard
>>
>> _,,,^..^,,,_
>> best, Eliot