Smalltalk › Gemtalk › GLASS

Re: [Glass] case insensitive search broken for Unicode7

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

6 messages Options

Pieter Nagel-3

Re: [Glass] case insensitive search broken for Unicode7

Please note that Unicode16 and Unicode32 have the same problem as well.

I can confirm that the patch that Mariano Martinez Peck posted [*] works
for all three classes, at least in the sense that none of our unit tests
catch any error, and it fixes the insensitve sorting issue (for those who
don't care about using an ICU collator).

US-centric system *might* get away with patching Unicode7 alone, but then
you'd have to be pretty sure that you never deal characters beyond strict
ASCII, i.e. surnames with accented characters.

So I suggest people who need to patch Unicode7 do it to Unicode16 and
Unicode32 too.

[*] Unicode7 >> _findString: subString startingAt: startIndex ignoreCase:
aBoolean

^ super _findString: subString startingAt: startIndex ignoreCase: aBoolean

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

Pieter Nagel-3

Re: [Glass] case insensitive search broken for Unicode7

The currently buggy UnicodeX >> #_findString:startingAt:ignoreCase: delegates
to the ICU collator only in the case where ignoreCase is true. Is this
correct?

The reasoning seems to be that a case-sensitive match in Unicode can be done
by just comparing the byte values of the two strings for identity, as the
super
implementation presumably does. But since some letters can be decomposed into
multiple codepoints in canonical and non-canonical ways[1], that's not
true. And
I suppose surrogate escapes factor in here as true.

To be honest, I'm not familiar enough with ICU to know whether it
(optionally?)
takes character decomposition into account on comparison, but I would
guess that
issues like these are precisely why an industrial-strength Unicode handling
library was needed in the first place.

What does this look like in 3.2? Does your testing include comparisons
where the
two strings have characters decomposed in different ways?

[1] I.e. U+00E9 LATIN SMALL LETTER E WITH ACUTE can also be decomposed as
U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

Dale Henrichs-3

Re: [Glass] case insensitive search broken for Unicode7

Pieter,

The engineer responsible for the ICU implementation has been on vacation all week, so I haven't had a chance to discuss the #_findString:startingAt:ignoreCase: issue with him ...

With that said we have been "discovering" things about mixed Unicode[7|16|32] (where a collator is always used) and [DoubleByte|QuadByte]String classes and the basic conclusion that we've come to is that:

"it does not make sense to attempt to perfomed mixed comparisons

between Unicode* and *String instances"

For 3.2 the default mechanism ("Legacy comparison mode" ) will be that it is an error to attempt to compared a Unicode* instance with a *String instance, unless explicitly specifies a collator to be used for the comparison. We will also support "Unicode comparison mode" in which all comparisons for both Unicode* instances and *String instances will use the default collator.

I am planning to make "Unicode comparison mode" the default for GLASS.

There is an implication for upgrading a repository from 3.1 to 3.2, from the perspective that if you have "sorted" collections with mixed instances, it is possible that the sort order will change in "Unicode comparison mode" and you will start getting errors in "Legacy comparison mode" ...

I think that if you do have mixed "sorted" collections today that the sort order is probably not deterministic. In 3.1 a comparison with a *String receiver and a Unicode* argument may give a different answer if you reverse the comparison and have a Unicode* receiver and a *String argument ... common occurrences in sorted collections:)

If it isn't already obvious, 3.1 was our first attempt at introducing Unicode into the system and being a largely US-ASCII shop, we did not recognize the perils of mixing encoding schemes until we started trying to support Unicode indexing ...

I think the "comparison modes" are the right answer for allowing our existing customer's applications to continue to function correctly within the String* universe while providing a pathway towards a purely Unicode-based system.

Even with the 3.2 release, we will leave a couple of loose ends to be cleaned up in 3.2.1, mainly in the area of fileout ... we'd like to standardize on UTF8 filein/fileout (while still supporting "gemstone-encoding" filein/fileout for existing customer), but we're defering that work for 3.2 ...

So for 3.2 I would expect taht in "Legacy comparison mode" it will be an error to do a mixed #_findString:startingAt:ignoreCase:. For a Unicode* #_find... (or in "Unicode comparison mode") the default collator will be used...

Dale

On Thu, Mar 27, 2014 at 5:19 AM, Pieter Nagel <[hidden email]> wrote:

The currently buggy UnicodeX >> #_findString:startingAt:ignoreCase: delegates
to the ICU collator only in the case where ignoreCase is true. Is this
correct?

The reasoning seems to be that a case-sensitive match in Unicode can be done
by just comparing the byte values of the two strings for identity, as the
super
implementation presumably does. But since some letters can be decomposed into
multiple codepoints in canonical and non-canonical ways[1], that's not
true. And
I suppose surrogate escapes factor in here as true.

To be honest, I'm not familiar enough with ICU to know whether it
(optionally?)
takes character decomposition into account on comparison, but I would
guess that
issues like these are precisely why an industrial-strength Unicode handling
library was needed in the first place.

What does this look like in 3.2? Does your testing include comparisons
where the
two strings have characters decomposed in different ways?

[1] I.e. U+00E9 LATIN SMALL LETTER E WITH ACUTE can also be decomposed as
U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

Pieter Nagel-3

Re: [Glass] case insensitive search broken for Unicode7

In reply to this post by Pieter Nagel-3

Hi Dale,

> "it does not make sense to attempt to perfomed mixed comparisons
> between Unicode* and *String instances"

I've been trying to wrap my head around that statement, and they way that
"Legacy comparison mode" and "Unicode comparison mode" seem to hinge
around whether a collator needs to be specified or defaulted.

To my mind the choice of collator is a totally orthogonal concern to
whether mixed comparisons "make sense". Well defined comparisons are
purely dependent on whether the _encoding_ of *String instances is known.

Unicode is a strict superset of all string encodings, and therefore mixed
comparisons between the two should be a well-defined operation, *assuming*
the encoding used by *String instances is known. No matter what character
appears in a *String instance, it will correspond to a Unicode character
and can thus be used as part of a comparison to a Unicode string.

The choice of collator has no bearing on whether comparison is a
well-defined operation or not. It only affects issues like
language-dependent conventions where, for example, Germans want "ß" to
sort equal to "ss".

How passing in a collator will transform a mixed String/Unicode comparison
from an error to not does not make sense to me. If the String instance's
encoding was known, the comparison was already well-defined and not really
an error in the first place. If the String's encoding is not known (or not
specified by the programmer), then the collator will be fed garbage input
- how should it know whether a specific byte 0xBC in the input was
supposed to mean "Vulgar Fraction One Quarter" (ISO-8859-1) or "Latin
Capital Ligature OE" (ISO-8859-15)?

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

Mariano Martinez Peck

Re: [Glass] case insensitive search broken for Unicode7

In reply to this post by Dale Henrichs-3

On Fri, Mar 28, 2014 at 2:55 PM, Dale Henrichs <[hidden email]> wrote:

Pieter,

The engineer responsible for the ICU implementation has been on vacation all week, so I haven't had a chance to discuss the #_findString:startingAt:ignoreCase: issue with him ...

With that said we have been "discovering" things about mixed Unicode[7|16|32] (where a collator is always used) and [DoubleByte|QuadByte]String classes and the basic conclusion that we've come to is that:

"it does not make sense to attempt to perfomed mixed comparisons
between Unicode* and *String instances"

Dale,

While I understand such conclusion, I think this is a different discussion than the one I originally pasted, isn't it?

('Newmont' asUnicodeString _findString: 'newm' asUnicodeString startingAt: 1 ignoreCase: true) > 0

answers false and in this case I am not mixing anything...both are Unicode7. So what I mean is that this is broken even without mixing.

Also...note that users may be comparing/mixing Unicode* and String* WITHOUT knowing. For example...in my case, I don't know how but I get Unicode7 from a combo list from a magritte form.... (of course, in Pharo I get a String) and then I search over that result.... So if you were to choose "Legacy comparison mode" for GLASS...then we should at least avoid using Unicode classes in Seaside/magritte. Otherwise it would be a pain to maintain a working system for Pharo and GemStone. So..this is just to agree that "Unicode comparison mode" would be better for GLASS ?

Thanks,

Dale

On Thu, Mar 27, 2014 at 5:19 AM, Pieter Nagel <[hidden email]> wrote:

The currently buggy UnicodeX >> #_findString:startingAt:ignoreCase: delegates
to the ICU collator only in the case where ignoreCase is true. Is this
correct?

The reasoning seems to be that a case-sensitive match in Unicode can be done
by just comparing the byte values of the two strings for identity, as the
super
implementation presumably does. But since some letters can be decomposed into
multiple codepoints in canonical and non-canonical ways[1], that's not
true. And
I suppose surrogate escapes factor in here as true.

To be honest, I'm not familiar enough with ICU to know whether it
(optionally?)
takes character decomposition into account on comparison, but I would
guess that
issues like these are precisely why an industrial-strength Unicode handling
library was needed in the first place.

What does this look like in 3.2? Does your testing include comparisons
where the
two strings have characters decomposed in different ways?

[1] I.e. U+00E9 LATIN SMALL LETTER E WITH ACUTE can also be decomposed as
U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

--
Mariano
http://marianopeck.wordpress.com

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

Dale Henrichs-3

Re: [Glass] case insensitive search broken for Unicode7

Mariano,

You are correct that Unicode to Unicode should work and the error that you are seeing is definitely a bug ...

You are also correct that for glass one should use 'Unicode comparison mode' ...

The gotcha is that in 3.1 and earlier we allowed Unicode* and *String instances to be intermixed. Basically we ignored the unicode-ness of instances and used simple code point comparisons in certain circumstances ... the main implication of this is that sorted collections may not be sorted correctly so one will have to be aware of this when upgrading to the 3.2 ...

I will be doing a number of experiments with GLASS upgrades to 3.2 in the next week or so to see if I can identify any issues and come up with ways to compensate ...

Dale

On Wed, Apr 16, 2014 at 1:24 PM, Mariano Martinez Peck <[hidden email]> wrote:

On Fri, Mar 28, 2014 at 2:55 PM, Dale Henrichs <[hidden email]> wrote:

Pieter,

The engineer responsible for the ICU implementation has been on vacation all week, so I haven't had a chance to discuss the #_findString:startingAt:ignoreCase: issue with him ...

With that said we have been "discovering" things about mixed Unicode[7|16|32] (where a collator is always used) and [DoubleByte|QuadByte]String classes and the basic conclusion that we've come to is that:

"it does not make sense to attempt to perfomed mixed comparisons
between Unicode* and *String instances"

Dale,

While I understand such conclusion, I think this is a different discussion than the one I originally pasted, isn't it?

('Newmont' asUnicodeString _findString: 'newm' asUnicodeString startingAt: 1 ignoreCase: true) > 0

answers false and in this case I am not mixing anything...both are Unicode7. So what I mean is that this is broken even without mixing.

Also...note that users may be comparing/mixing Unicode* and String* WITHOUT knowing. For example...in my case, I don't know how but I get Unicode7 from a combo list from a magritte form.... (of course, in Pharo I get a String) and then I search over that result.... So if you were to choose "Legacy comparison mode" for GLASS...then we should at least avoid using Unicode classes in Seaside/magritte. Otherwise it would be a pain to maintain a working system for Pharo and GemStone. So..this is just to agree that "Unicode comparison mode" would be better for GLASS ?

Thanks,

Dale

On Thu, Mar 27, 2014 at 5:19 AM, Pieter Nagel <[hidden email]> wrote:

The currently buggy UnicodeX >> #_findString:startingAt:ignoreCase: delegates
to the ICU collator only in the case where ignoreCase is true. Is this
correct?

The reasoning seems to be that a case-sensitive match in Unicode can be done
by just comparing the byte values of the two strings for identity, as the
super
implementation presumably does. But since some letters can be decomposed into
multiple codepoints in canonical and non-canonical ways[1], that's not
true. And
I suppose surrogate escapes factor in here as true.

To be honest, I'm not familiar enough with ICU to know whether it
(optionally?)
takes character decomposition into account on comparison, but I would
guess that
issues like these are precisely why an industrial-strength Unicode handling
library was needed in the first place.

What does this look like in 3.2? Does your testing include comparisons
where the
two strings have characters decomposed in different ways?

[1] I.e. U+00E9 LATIN SMALL LETTER E WITH ACUTE can also be decomposed as
U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

--
Mariano
http://marianopeck.wordpress.com

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass