Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: Unicode Support))

Dale Henrichs-3


On 12/09/2015 12:44 AM, Stephan Eggermont wrote:

> On 08-12-15 22:35, Dale Henrichs wrote:
>> What I meant is that you can't _always_ use the code point for
>> collation, i.e., sorting based on the value of code points is not always
>> correct[1].
>
> I have given up on universal sorting when I learned that dutch
> libraries sorting of author names depends on the country of origin of
> the author. So if Jan van Beek is dutch he will be sorted under B,
> while if he's belgian under V. I haven't checked what happens if the
> author emigrates, or changes nationality...
>
> Stephan
>
Well, with ICU (and GemStone's implementation) you can choose which
collator to use (Country specific) at the image level or on a comparison
by comparison bases ... for example for an indexed collection (Unicode)
Strings, you can choose the collator to use for that particular index
... so while it's true that universal sorter is not possible, it is
possible to choose a collator that will satisfy a particlar customer ....

Dale


Reply | Threaded
Open this post in threaded view
|

Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: Unicode Support))

Levente Uzonyi
On Wed, 9 Dec 2015, Dale Henrichs wrote:

>
>
> On 12/09/2015 12:44 AM, Stephan Eggermont wrote:
>> On 08-12-15 22:35, Dale Henrichs wrote:
>>> What I meant is that you can't _always_ use the code point for
>>> collation, i.e., sorting based on the value of code points is not always
>>> correct[1].
>>
>> I have given up on universal sorting when I learned that dutch libraries
>> sorting of author names depends on the country of origin of the author. So
>> if Jan van Beek is dutch he will be sorted under B, while if he's belgian
>> under V. I haven't checked what happens if the author emigrates, or changes
>> nationality...
>>
>> Stephan
>>
> Well, with ICU (and GemStone's implementation) you can choose which collator
> to use (Country specific) at the image level or on a comparison by comparison
> bases ... for example for an indexed collection (Unicode) Strings, you can
> choose the collator to use for that particular index ... so while it's true
> that universal sorter is not possible, it is possible to choose a collator
> that will satisfy a particlar customer ....

I expect my image to compare strings using the codepoint-based (+language
tags) lexicographical method, because it's simple, deterministic and fast.
Imagine having failing tests just because your image uses different
default comparison methods based on some (external) parameter...
It's also a nightmare to find out why your program is slow on some
machine, while it's fast on another.

Levente

>
> Dale
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: Unicode Support))

Dale Henrichs-3


On 12/09/2015 04:31 PM, Levente Uzonyi wrote:

> On Wed, 9 Dec 2015, Dale Henrichs wrote:
>
>>
>>
>> On 12/09/2015 12:44 AM, Stephan Eggermont wrote:
>>> On 08-12-15 22:35, Dale Henrichs wrote:
>>>> What I meant is that you can't _always_ use the code point for
>>>> collation, i.e., sorting based on the value of code points is not
>>>> always
>>>> correct[1].
>>>
>>> I have given up on universal sorting when I learned that dutch
>>> libraries sorting of author names depends on the country of origin
>>> of the author. So if Jan van Beek is dutch he will be sorted under
>>> B, while if he's belgian under V. I haven't checked what happens if
>>> the author emigrates, or changes nationality...
>>>
>>> Stephan
>>>
>> Well, with ICU (and GemStone's implementation) you can choose which
>> collator to use (Country specific) at the image level or on a
>> comparison by comparison bases ... for example for an indexed
>> collection (Unicode) Strings, you can choose the collator to use for
>> that particular index ... so while it's true that universal sorter is
>> not possible, it is possible to choose a collator that will satisfy a
>> particlar customer ....
>
> I expect my image to compare strings using the codepoint-based
> (+language tags) lexicographical method, because it's simple,
> deterministic and fast.
> Imagine having failing tests just because your image uses different
> default comparison methods based on some (external) parameter...
> It's also a nightmare to find out why your program is slow on some
> machine, while it's fast on another.

When we implemented the Unicode support in GemStone we preserved the
legacy string classes and  their legacy behavior ... We added new
Unicode* classes with the new collator-based behavior for sorting and
comparison ... That way legacy applications (and legacy) tests were not
impacted by the choice of  collator ... And folks could choose whether
or not their application would benefit by the use of the new Unicode*
classes....

The ICU library performance is actually comparable to our original
implementations, so there isn't a noticeable performance difference - we
built the support into our vm and if folks are interested in some of the
gory technical details, we'd be willing to share our experience, as
there are several things that we did to minimize potential performance
impacts ---

Dale



Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

Hannes Hirzel
In reply to this post by EuanM
On 12/15/15, Ben Coman <[hidden email]> wrote:

> On Thu, Dec 10, 2015 at 12:37 AM, Todd Blanchard <[hidden email]>
> wrote:
>> They are practically the same thing.
>>
>> ICU was developed by Taligent which was a joint venture between Apple and
>> IBM.  Makes sense that NSString and ICU's UnicodeString are pretty close
>> in implementation.  ICU was also ported to Java for Sun by IBM.  The point
>> is - this is a very elaborate chunk of code with far reach. If ICU is
>> wrong on some point - it is universally wrong and thus likely to be taken
>> as "right" as it is at least consistent.  I think re-implementing it is
>> folly TBH.  Just use it.
>
> Apple seem to have moved on from NSString to support Unicode in a
> different way in Switft...

Could you please give some more details?

I have read
https://www.objc.io/issues/9-strings/unicode/#nsstring-and-unicode
so far.

It says that an NSString object actually represents an array of
UTF-16-encoded code units.

This in contrast to Squeak / Pharo where a String is an
ArrayedCollection of 21 bit Unicode code points (transparently
optimizing to a ByteArray if the string only contains values of the
first code page).


>>
>>> On Dec 8, 2015, at 15:52, EuanM <[hidden email]> wrote:
>>>
>>> Equally old are the NextStep Object C functions which are now embodied
>>> within MacOS X.
>>>
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

Ben Coman
On Wed, Dec 16, 2015 at 12:05 AM, H. Hirzel <[hidden email]> wrote:

> On 12/15/15, Ben Coman <[hidden email]> wrote:
>> On Thu, Dec 10, 2015 at 12:37 AM, Todd Blanchard <[hidden email]>
>> wrote:
>>> They are practically the same thing.
>>>
>>> ICU was developed by Taligent which was a joint venture between Apple and
>>> IBM.  Makes sense that NSString and ICU's UnicodeString are pretty close
>>> in implementation.  ICU was also ported to Java for Sun by IBM.  The point
>>> is - this is a very elaborate chunk of code with far reach. If ICU is
>>> wrong on some point - it is universally wrong and thus likely to be taken
>>> as "right" as it is at least consistent.  I think re-implementing it is
>>> folly TBH.  Just use it.
>>
>> Apple seem to have moved on from NSString to support Unicode in a
>> different way in Switft...
>
> Could you please give some more details?

http://oleb.net/blog/2014/07/swift-strings/
"Swift's change has the potential to prevent many common errors when
dealing with string lengths or substrings. It is a huge difference to
most other Unicode-aware string libraries (including NSString) where
the building blocks of a string are usually UTF-16 code units or
single Unicode scalars"

cheers -ben

>
> I have read
> https://www.objc.io/issues/9-strings/unicode/#nsstring-and-unicode
> so far.
>
> It says that an NSString object actually represents an array of
> UTF-16-encoded code units.
>
> This in contrast to Squeak / Pharo where a String is an
> ArrayedCollection of 21 bit Unicode code points (transparently
> optimizing to a ByteArray if the string only contains values of the
> first code page).
>
>
>>>
>>>> On Dec 8, 2015, at 15:52, EuanM <[hidden email]> wrote:
>>>>
>>>> Equally old are the NextStep Object C functions which are now embodied
>>>> within MacOS X.
>>>>
>>>
>>>
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: Unicode Support))

Eliot Miranda-2
In reply to this post by Dale Henrichs-3
Hi Dale,

On Thu, Dec 10, 2015 at 12:27 PM, Dale Henrichs <[hidden email]> wrote:


On 12/09/2015 04:31 PM, Levente Uzonyi wrote:
On Wed, 9 Dec 2015, Dale Henrichs wrote:



On 12/09/2015 12:44 AM, Stephan Eggermont wrote:
On 08-12-15 22:35, Dale Henrichs wrote:
What I meant is that you can't _always_ use the code point for
collation, i.e., sorting based on the value of code points is not always
correct[1].

I have given up on universal sorting when I learned that dutch libraries sorting of author names depends on the country of origin of the author. So if Jan van Beek is dutch he will be sorted under B, while if he's belgian under V. I haven't checked what happens if the author emigrates, or changes nationality...

Stephan

Well, with ICU (and GemStone's implementation) you can choose which collator to use (Country specific) at the image level or on a comparison by comparison bases ... for example for an indexed collection (Unicode) Strings, you can choose the collator to use for that particular index ... so while it's true that universal sorter is not possible, it is possible to choose a collator that will satisfy a particlar customer ....

I expect my image to compare strings using the codepoint-based (+language tags) lexicographical method, because it's simple, deterministic and fast.
Imagine having failing tests just because your image uses different default comparison methods based on some (external) parameter...
It's also a nightmare to find out why your program is slow on some machine, while it's fast on another.

When we implemented the Unicode support in GemStone we preserved the legacy string classes and  their legacy behavior ... We added new Unicode* classes with the new collator-based behavior for sorting and comparison ... That way legacy applications (and legacy) tests were not impacted by the choice of  collator ... And folks could choose whether or not their application would benefit by the use of the new Unicode* classes....

The ICU library performance is actually comparable to our original implementations, so there isn't a noticeable performance difference - we built the support into our vm and if folks are interested in some of the gory technical details, we'd be willing to share our experience, as there are several things that we did to minimize potential performance impacts ---

Just so you know, I will dig my heels in as deeply as I am able to prevent the use of C++ libraries in the VM.  It destroys the simulator, which is the most important thing we have for VM development productivity.  As far as I'm concerned any use of external libraries to implement core functionality kills the VM-in-Smalltalk concept that Squeak (and Pharo) are built upon.  So for me it's a non-starter.  I hope others agree.

_,,,^..^,,,_
best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] [squeak-dev] Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: Unicode Support))

Eliot Miranda-2
Hi Todd,

On Tue, Dec 15, 2015 at 3:46 PM, Todd Blanchard <[hidden email]> wrote:
Hi Eliot,

On Dec 15, 2015, at 13:46, Eliot Miranda <[hidden email]> wrote:

Just so you know, I will dig my heels in as deeply as I am able to prevent the use of C++ libraries in the VM.  It destroys the simulator, which is the most important thing we have for VM development productivity.  As far as I'm concerned any use of external libraries to implement core functionality kills the VM-in-Smalltalk concept that Squeak (and Pharo) are built upon.

OK, I defer to you because you certainly know more about the VM internals and what does and doesn't work well than anyone else.  

So I guess I would like to know your recommendation for 1) how best to store strings - byte arrays (UTF8), - 2-byte word arrays (UTF16 - now we get to worry about endian).  

Raw Unicode, either as 8-bit, 16-bit or 32-bit.  When creating a String it should start as an 8-bit-per-Unicode-character string.  Attempts to store Character values that won't fit cause the String to become a String whose element size is large enough to accommodate the character.  In Spur, become: is cheap so this growth pays only for the reallocation and copying of the at a, not for an expensive heap scan necessary to do the become:.


Bearing in mind that both representations are variable length and so while accessing the n'th byte/word is O(1), accessing the n'th character is necessarily O(n) unless you know you have no surrogates in your string.

Right, so UTF-8 and UTF-16 are not convenient representations and to be provided only for interchange.
 

Also...since NSString has been mentioned...it is worth noting that NSString is built atop CFString (source code here: https://www.opensource.apple.com/source/CF/CF-855.11/CFString.c) which does a fair job of optimizing memory by using bytes where it can and shorts where it cannot.  It is also worth noting that characterAt: actually does the wrong thing, since it assumes characters are no bigger than FFFF rather than 10FFFF.  

Yes, and Squeak (and AFAIA, Pharo) has been doing this for ages.  If one has become: it is very easy to manage.  Now with Spur not only do we have become:, we have a fairly fast become:.

Does this make sense?
 
Also...I'll just toss in this very nice article on unicode and how NSString deals with it.

-Todd Blanchard
 
_,,,^..^,,,_
best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] [squeak-dev] Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: Unicode Support))

Hannes Hirzel
On 12/16/15, Eliot Miranda <[hidden email]> wrote:

> Hi Todd,
>
> On Tue, Dec 15, 2015 at 3:46 PM, Todd Blanchard <[hidden email]> wrote:
>
>> Hi Eliot,
>>
>> On Dec 15, 2015, at 13:46, Eliot Miranda <[hidden email]> wrote:
>>
>> Just so you know, I will dig my heels in as deeply as I am able to
>> prevent
>> the use of C++ libraries in the VM.  It destroys the simulator, which is
>> the most important thing we have for VM development productivity.  As far
>> as I'm concerned any use of external libraries to implement core
>> functionality kills the VM-in-Smalltalk concept that Squeak (and Pharo)
>> are
>> built upon.
>>
>>
>> OK, I defer to you because you certainly know more about the VM internals
>> and what does and doesn't work well than anyone else.
>>
>> So I guess I would like to know your recommendation for 1) how best to
>> store strings - byte arrays (UTF8), - 2-byte word arrays (UTF16 - now we
>> get to worry about endian).
>>
>
> Raw Unicode, either as 8-bit, 16-bit or 32-bit.  When creating a String it
> should start as an 8-bit-per-Unicode-character string.  Attempts to store
> Character values that won't fit cause the String to become a String whose
> element size is large enough to accommodate the character.

This is the case: see tests here http://wiki.squeak.org/squeak/6316

In Spur,
> become: is cheap so this growth pays only for the reallocation and copying
> of the at a, not for an expensive heap scan necessary to do the become:.

Could you elaborate on this please?

>
>
>> Bearing in mind that both representations are variable length and so
>> while
>> accessing the n'th byte/word is O(1), accessing the n'th character is
>> necessarily O(n) unless you know you have no surrogates in your string.
>>
>
> Right, so UTF-8 and UTF-16 are not convenient representations and to be
> provided only for interchange.

+1

Squeak/Pharo uses 8bit/32bit (UTF-32) internally and UTF-8 externally.
There are converters for UTF-8 and UTF-16.



>
>>
>> Also...since NSString has been mentioned...it is worth noting that
>> NSString is built atop CFString (source code here:
>> https://www.opensource.apple.com/source/CF/CF-855.11/CFString.c) which
>> does a fair job of optimizing memory by using bytes where it can and
>> shorts
>> where it cannot.  It is also worth noting that characterAt: actually does
>> the wrong thing, since it assumes characters are no bigger than FFFF
>> rather
>> than 10FFFF.
>>
>
> Yes, and Squeak (and AFAIA, Pharo) has been doing this for ages.  If one
> has become: it is very easy to manage.  Now with Spur not only do we have
> become:, we have a fairly fast become:.
>
> Does this make sense?
>
>
>> Also...I'll just toss in this very nice article on unicode and how
>> NSString deals with it.
>> https://www.objc.io/issues/9-strings/unicode/
>>
>> -Todd Blanchard
>>
>
> _,,,^..^,,,_
> best, Eliot
>

Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] Re: [Pharo-dev] [squeak-dev] Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: Unicode Support))

Ben Coman
On Wed, Dec 16, 2015 at 6:22 PM, H. Hirzel <[hidden email]> wrote:

>
> On 12/16/15, Eliot Miranda <[hidden email]> wrote:
>> Hi Todd,
>>
>> On Tue, Dec 15, 2015 at 3:46 PM, Todd Blanchard <[hidden email]> wrote:
>>
>>> Hi Eliot,
>>>
>>> On Dec 15, 2015, at 13:46, Eliot Miranda <[hidden email]> wrote:
>>>
>>> Just so you know, I will dig my heels in as deeply as I am able to
>>> prevent
>>> the use of C++ libraries in the VM.  It destroys the simulator, which is
>>> the most important thing we have for VM development productivity.  As far
>>> as I'm concerned any use of external libraries to implement core
>>> functionality kills the VM-in-Smalltalk concept that Squeak (and Pharo)
>>> are
>>> built upon.
>>>
>>>
>>> OK, I defer to you because you certainly know more about the VM internals
>>> and what does and doesn't work well than anyone else.
>>>
>>> So I guess I would like to know your recommendation for 1) how best to
>>> store strings - byte arrays (UTF8), - 2-byte word arrays (UTF16 - now we
>>> get to worry about endian).
>>>
>>
>> Raw Unicode, either as 8-bit, 16-bit or 32-bit.  When creating a String it
>> should start as an 8-bit-per-Unicode-character string.  Attempts to store
>> Character values that won't fit cause the String to become a String whose
>> element size is large enough to accommodate the character.
>
> This is the case: see tests here http://wiki.squeak.org/squeak/6316
>
> In Spur,
>> become: is cheap so this growth pays only for the reallocation and copying
>> of the at a, not for an expensive heap scan necessary to do the become:.
>
> Could you elaborate on this please?


https://hal.inria.fr/hal-01152610/file/partialReadBarrier.pdf

cheers -ben


>>
>>
>>> Bearing in mind that both representations are variable length and so
>>> while
>>> accessing the n'th byte/word is O(1), accessing the n'th character is
>>> necessarily O(n) unless you know you have no surrogates in your string.
>>>
>>
>> Right, so UTF-8 and UTF-16 are not convenient representations and to be
>> provided only for interchange.
>
> +1
>
> Squeak/Pharo uses 8bit/32bit (UTF-32) internally and UTF-8 externally.
> There are converters for UTF-8 and UTF-16.
>
>
>
>>
>>>
>>> Also...since NSString has been mentioned...it is worth noting that
>>> NSString is built atop CFString (source code here:
>>> https://www.opensource.apple.com/source/CF/CF-855.11/CFString.c) which
>>> does a fair job of optimizing memory by using bytes where it can and
>>> shorts
>>> where it cannot.  It is also worth noting that characterAt: actually does
>>> the wrong thing, since it assumes characters are no bigger than FFFF
>>> rather
>>> than 10FFFF.
>>>
>>
>> Yes, and Squeak (and AFAIA, Pharo) has been doing this for ages.  If one
>> has become: it is very easy to manage.  Now with Spur not only do we have
>> become:, we have a fairly fast become:.
>>
>> Does this make sense?
>>
>>
>>> Also...I'll just toss in this very nice article on unicode and how
>>> NSString deals with it.
>>> https://www.objc.io/issues/9-strings/unicode/
>>>
>>> -Todd Blanchard
>>>
>>
>> _,,,^..^,,,_
>> best, Eliot
>>

Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] Re: [Pharo-dev] [squeak-dev] Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: Unicode Support))

Eliot Miranda-2
In reply to this post by Hannes Hirzel
Hi Hannes,

> On Dec 16, 2015, at 2:22 AM, H. Hirzel <[hidden email]> wrote:
>
>
>> On 12/16/15, Eliot Miranda <[hidden email]> wrote:
>> Hi Todd,
>>
>>> On Tue, Dec 15, 2015 at 3:46 PM, Todd Blanchard <[hidden email]> wrote:
>>>
>>> Hi Eliot,
>>>
>>> On Dec 15, 2015, at 13:46, Eliot Miranda <[hidden email]> wrote:
>>>
>>> Just so you know, I will dig my heels in as deeply as I am able to
>>> prevent
>>> the use of C++ libraries in the VM.  It destroys the simulator, which is
>>> the most important thing we have for VM development productivity.  As far
>>> as I'm concerned any use of external libraries to implement core
>>> functionality kills the VM-in-Smalltalk concept that Squeak (and Pharo)
>>> are
>>> built upon.
>>>
>>>
>>> OK, I defer to you because you certainly know more about the VM internals
>>> and what does and doesn't work well than anyone else.
>>>
>>> So I guess I would like to know your recommendation for 1) how best to
>>> store strings - byte arrays (UTF8), - 2-byte word arrays (UTF16 - now we
>>> get to worry about endian).
>>
>> Raw Unicode, either as 8-bit, 16-bit or 32-bit.  When creating a String it
>> should start as an 8-bit-per-Unicode-character string.  Attempts to store
>> Character values that won't fit cause the String to become a String whose
>> element size is large enough to accommodate the character.
>
> This is the case: see tests here http://wiki.squeak.org/squeak/6316
>
> In Spur,
>> become: is cheap so this growth pays only for the reallocation and copying
>> of the at a, not for an expensive heap scan necessary to do the become:.
>
> Could you elaborate on this please?

Well, there are two presentations online, one at ESUG 2014, and one at Splash 2015, a paper at ISMM 2015, and a blog post if you want a full account, but...

Spur supports lazy become via forwarders.  When an object is becommed, it is turned into a forwarder to the object it becomes, and when a pair of objects are becommed each is copied and the two originals turned into forwarders to the opposite copy.  Immediately after the forwarding the receiver in each stack frame in the stack zone is scanned to follow forwarders so that no read barrier is needed when accessing inst vars.

Forwarders have a unique class index so any message send to a forwarder will fail. The check for sends to forwarders needs to be done only just before a full lookup.  A forwarder looks different to other objects (they have a unique format field in their object header) so any primitive that encounters a forwarder in its operands will fail (since primitives validate their arguments).  So in primitive failure the VM checks for forwarders amongst the operands and if any are found, they are followed and the primitive is retried.

There are read barriers in some operations, hence I call the scheme a partial read barrier.  The main cost is the stack zone scan which must be performed for pointer objects (since only these can have inst vars).    This takes very little time on current machinery.  Becoming bit objects like strings is faster because no scan is needed.


>>> Bearing in mind that both representations are variable length and so
>>> while
>>> accessing the n'th byte/word is O(1), accessing the n'th character is
>>> necessarily O(n) unless you know you have no surrogates in your string.
>>
>> Right, so UTF-8 and UTF-16 are not convenient representations and to be
>> provided only for interchange.
>
> +1
>
> Squeak/Pharo uses 8bit/32bit (UTF-32) internally and UTF-8 externally.
> There are converters for UTF-8 and UTF-16.
>
>>> Also...since NSString has been mentioned...it is worth noting that
>>> NSString is built atop CFString (source code here:
>>> https://www.opensource.apple.com/source/CF/CF-855.11/CFString.c) which
>>> does a fair job of optimizing memory by using bytes where it can and
>>> shorts
>>> where it cannot.  It is also worth noting that characterAt: actually does
>>> the wrong thing, since it assumes characters are no bigger than FFFF
>>> rather
>>> than 10FFFF.
>>
>> Yes, and Squeak (and AFAIA, Pharo) has been doing this for ages.  If one
>> has become: it is very easy to manage.  Now with Spur not only do we have
>> become:, we have a fairly fast become:.
>>
>> Does this make sense?
>>
>>
>>> Also...I'll just toss in this very nice article on unicode and how
>>> NSString deals with it.
>>> https://www.objc.io/issues/9-strings/unicode/
>>>
>>> -Todd Blanchard
>>
>> _,,,^..^,,,_
>> best, Eliot

12