Please note that Unicode16 and Unicode32 have the same problem as well.
I can confirm that the patch that Mariano Martinez Peck posted [*] works for all three classes, at least in the sense that none of our unit tests catch any error, and it fixes the insensitve sorting issue (for those who don't care about using an ICU collator). US-centric system *might* get away with patching Unicode7 alone, but then you'd have to be pretty sure that you never deal characters beyond strict ASCII, i.e. surnames with accented characters. So I suggest people who need to patch Unicode7 do it to Unicode16 and Unicode32 too. [*] Unicode7 >> _findString: subString startingAt: startIndex ignoreCase: aBoolean ^ super _findString: subString startingAt: startIndex ignoreCase: aBoolean _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
The currently buggy UnicodeX >> #_findString:startingAt:ignoreCase: delegates
to the ICU collator only in the case where ignoreCase is true. Is this correct? The reasoning seems to be that a case-sensitive match in Unicode can be done by just comparing the byte values of the two strings for identity, as the super implementation presumably does. But since some letters can be decomposed into multiple codepoints in canonical and non-canonical ways[1], that's not true. And I suppose surrogate escapes factor in here as true. To be honest, I'm not familiar enough with ICU to know whether it (optionally?) takes character decomposition into account on comparison, but I would guess that issues like these are precisely why an industrial-strength Unicode handling library was needed in the first place. What does this look like in 3.2? Does your testing include comparisons where the two strings have characters decomposed in different ways? [1] I.e. U+00E9 LATIN SMALL LETTER E WITH ACUTE can also be decomposed as U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Pieter, The engineer responsible for the ICU implementation has been on vacation all week, so I haven't had a chance to discuss the #_findString:startingAt:ignoreCase: issue with him ...
With that said we have been "discovering" things about mixed Unicode[7|16|32] (where a collator is always used) and [DoubleByte|QuadByte]String classes and the basic conclusion that we've come to is that:
"it does not make sense to attempt to perfomed mixed comparisons between Unicode* and *String instances" For 3.2 the default mechanism ("Legacy comparison mode" ) will be that it is an error to attempt to compared a Unicode* instance with a *String instance, unless explicitly specifies a collator to be used for the comparison. We will also support "Unicode comparison mode" in which all comparisons for both Unicode* instances and *String instances will use the default collator.
I am planning to make "Unicode comparison mode" the default for GLASS. There is an implication for upgrading a repository from 3.1 to 3.2, from the perspective that if you have "sorted" collections with mixed instances, it is possible that the sort order will change in "Unicode comparison mode" and you will start getting errors in "Legacy comparison mode" ...
I think that if you do have mixed "sorted" collections today that the sort order is probably not deterministic. In 3.1 a comparison with a *String receiver and a Unicode* argument may give a different answer if you reverse the comparison and have a Unicode* receiver and a *String argument ... common occurrences in sorted collections:)
If it isn't already obvious, 3.1 was our first attempt at introducing Unicode into the system and being a largely US-ASCII shop, we did not recognize the perils of mixing encoding schemes until we started trying to support Unicode indexing ...
I think the "comparison modes" are the right answer for allowing our existing customer's applications to continue to function correctly within the String* universe while providing a pathway towards a purely Unicode-based system.
Even with the 3.2 release, we will leave a couple of loose ends to be cleaned up in 3.2.1, mainly in the area of fileout ... we'd like to standardize on UTF8 filein/fileout (while still supporting "gemstone-encoding" filein/fileout for existing customer), but we're defering that work for 3.2 ...
So for 3.2 I would expect taht in "Legacy comparison mode" it will be an error to do a mixed #_findString:startingAt:ignoreCase:. For a Unicode* #_find... (or in "Unicode comparison mode") the default collator will be used...
Dale On Thu, Mar 27, 2014 at 5:19 AM, Pieter Nagel <[hidden email]> wrote: The currently buggy UnicodeX >> #_findString:startingAt:ignoreCase: delegates
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
In reply to this post by Pieter Nagel-3
Hi Dale,
> "it does not make sense to attempt to perfomed mixed comparisons > between Unicode* and *String instances" I've been trying to wrap my head around that statement, and they way that "Legacy comparison mode" and "Unicode comparison mode" seem to hinge around whether a collator needs to be specified or defaulted. To my mind the choice of collator is a totally orthogonal concern to whether mixed comparisons "make sense". Well defined comparisons are purely dependent on whether the _encoding_ of *String instances is known. Unicode is a strict superset of all string encodings, and therefore mixed comparisons between the two should be a well-defined operation, *assuming* the encoding used by *String instances is known. No matter what character appears in a *String instance, it will correspond to a Unicode character and can thus be used as part of a comparison to a Unicode string. The choice of collator has no bearing on whether comparison is a well-defined operation or not. It only affects issues like language-dependent conventions where, for example, Germans want "ß" to sort equal to "ss". How passing in a collator will transform a mixed String/Unicode comparison from an error to not does not make sense to me. If the String instance's encoding was known, the comparison was already well-defined and not really an error in the first place. If the String's encoding is not known (or not specified by the programmer), then the collator will be fed garbage input - how should it know whether a specific byte 0xBC in the input was supposed to mean "Vulgar Fraction One Quarter" (ISO-8859-1) or "Latin Capital Ligature OE" (ISO-8859-15)? _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
In reply to this post by Dale Henrichs-3
On Fri, Mar 28, 2014 at 2:55 PM, Dale Henrichs <[hidden email]> wrote:
Dale, While I understand such conclusion, I think this is a different discussion than the one I originally pasted, isn't it? ('Newmont' asUnicodeString _findString: 'newm' asUnicodeString startingAt: 1 ignoreCase: true) > 0
answers false and in this case I am not mixing anything...both are Unicode7. So what I mean is that this is broken even without mixing. Also...note that users may be comparing/mixing Unicode* and String* WITHOUT knowing. For example...in my case, I don't know how but I get Unicode7 from a combo list from a magritte form.... (of course, in Pharo I get a String) and then I search over that result.... So if you were to choose "Legacy comparison mode" for GLASS...then we should at least avoid using Unicode classes in Seaside/magritte. Otherwise it would be a pain to maintain a working system for Pharo and GemStone. So..this is just to agree that "Unicode comparison mode" would be better for GLASS ?
Thanks,
Mariano http://marianopeck.wordpress.com _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Mariano, You are correct that Unicode to Unicode should work and the error that you are seeing is definitely a bug ... You are also correct that for glass one should use 'Unicode comparison mode' ...
The gotcha is that in 3.1 and earlier we allowed Unicode* and *String instances to be intermixed. Basically we ignored the unicode-ness of instances and used simple code point comparisons in certain circumstances ... the main implication of this is that sorted collections may not be sorted correctly so one will have to be aware of this when upgrading to the 3.2 ...
I will be doing a number of experiments with GLASS upgrades to 3.2 in the next week or so to see if I can identify any issues and come up with ways to compensate ... Dale On Wed, Apr 16, 2014 at 1:24 PM, Mariano Martinez Peck <[hidden email]> wrote:
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Free forum by Nabble | Edit this page |