Smalltalk › Pharo › Pharo Smalltalk Developers

CasInsensitiveOrder map with upper/lower case characters for all latin1 entries

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

6 messages Options

Nicolai Hess-3-2

CasInsensitiveOrder map with upper/lower case characters for all latin1 entries

see issues 17302/17242/17227
String>>findString:startindAt:caseSensitive appears to be failing for extended charsets
String>>compare:caseSensitive seems to be failing for extended charset comparisons
String>>beginsWithEmpty:caseSensitive: has test failure for some cases

the problem is, the standard character set used for building the CaseInsensitiveOrder map

only maps characters from the set of ascii characters but it is used in the findString/compare/beginsWith-methods for all byte characters.

Any objections if we fill this map like it is suggested in case 17242 ?

CaseInsensitiveOrder := AsciiOrder copy.
    (0 to: 255) do:[ :v |
            | char upper |
            char := v asCharacter.
            upper := char asUppercase.
            upper isOctetCharacter
                ifFalse: [ upper := char ].
            CaseInsensitiveOrder at: char asciiValue + 1 put: (CaseInsensitiveOrder at: upper asciiValue + 1) ].

(the check for #isOctectCharacter is needed because for some entries the correspondending

uppercase character is not within this character set).

This would solve all three issues.

nicolai

Sven Van Caekenberghe-2

Re: CasInsensitiveOrder map with upper/lower case characters for all latin1 entries

Hallo Nicolai,

> On 06 Jan 2016, at 09:58, Nicolai Hess <[hidden email]> wrote:
>
>
> see issues 17302/17242/17227
> String>>findString:startindAt:caseSensitive appears to be failing for extended charsets
> String>>compare:caseSensitive seems to be failing for extended charset comparisons
> String>>beginsWithEmpty:caseSensitive: has test failure for some cases
>
> the problem is, the standard character set used for building the CaseInsensitiveOrder map
> only maps characters from the set of ascii characters but it is used in the findString/compare/beginsWith-methods for all byte characters.
>
> Any objections if we fill this map like it is suggested in case 17242 ?
>
> CaseInsensitiveOrder := AsciiOrder copy.
> (0 to: 255) do:[ :v |
> | char upper |
> char := v asCharacter.
> upper := char asUppercase.
> upper isOctetCharacter
> ifFalse: [ upper := char ].
> CaseInsensitiveOrder at: char asciiValue + 1 put: (CaseInsensitiveOrder at: upper asciiValue + 1) ].
>
> (the check for #isOctectCharacter is needed because for some entries the correspondending
> uppercase character is not within this character set).
>
> This would solve all three issues.
>
>
> nicolai

That looks like a beautiful fix that makes perfect sense.
If all tests are green, I see no reason not to do it.

Thanks and well done (again),

Sven

Henrik Sperre Johansen

Re: CasInsensitiveOrder map with upper/lower case characters for all latin1 entries

> On 06 Jan 2016, at 10:09 , Sven Van Caekenberghe <[hidden email]> wrote:
>
> Hallo Nicolai,
>
>> On 06 Jan 2016, at 09:58, Nicolai Hess <[hidden email]> wrote:
>>
>>
>> see issues 17302/17242/17227
>> String>>findString:startindAt:caseSensitive appears to be failing for extended charsets
>> String>>compare:caseSensitive seems to be failing for extended charset comparisons
>> String>>beginsWithEmpty:caseSensitive: has test failure for some cases
>>
>> the problem is, the standard character set used for building the CaseInsensitiveOrder map
>> only maps characters from the set of ascii characters but it is used in the findString/compare/beginsWith-methods for all byte characters.
>>
>> Any objections if we fill this map like it is suggested in case 17242 ?
>>
>> CaseInsensitiveOrder := AsciiOrder copy.
>> (0 to: 255) do:[ :v |
>> | char upper |
>> char := v asCharacter.
>> upper := char asUppercase.
>> upper isOctetCharacter
>> ifFalse: [ upper := char ].
>> CaseInsensitiveOrder at: char asciiValue + 1 put: (CaseInsensitiveOrder at: upper asciiValue + 1) ].
>>
>> (the check for #isOctectCharacter is needed because for some entries the correspondending
>> uppercase character is not within this character set).
>>
>> This would solve all three issues.
>>
>>
>> nicolai
>
> That looks like a beautiful fix that makes perfect sense.
> If all tests are green, I see no reason not to do it.
>
> Thanks and well done (again),
>
> Sven
>
>

If you use asLowercase as the "canonical" ordering index instead, can you drop the isOctetCharacter test, or are there uppercase characters in latin1 with no corresponding lowercases?

I was about to suggest copying the CaseSensitiveOrder mapping instead of the AsciiOrder, since it has an ordering more refined than just A-Z, but that would quickly lead to wanting to extend it to a generic Latin1 sort order rather than just ASCII (é between e and f, for example), which is a can of worms that is hard to solve without making the ordering locale specific...
I mean, one could use the default Unicode ordering, but would inevitably receive complaints from, say, Norwegians, that å sorts between a and b instead of after z.

After all, it only affects the case where compare is used for ordering anyways.

Cheers,
Henry

signature.asc (859 bytes) Download Attachment

Nicolai Hess-3-2

Re: CasInsensitiveOrder map with upper/lower case characters for all latin1 entries

In reply to this post by Sven Van Caekenberghe-2

2016-01-06 10:09 GMT+01:00 Sven Van Caekenberghe <[hidden email]>:

Hallo Nicolai,

> On 06 Jan 2016, at 09:58, Nicolai Hess <[hidden email]> wrote:
>
>
> see issues 17302/17242/17227
> String>>findString:startindAt:caseSensitive appears to be failing for extended charsets
> String>>compare:caseSensitive seems to be failing for extended charset comparisons
> String>>beginsWithEmpty:caseSensitive: has test failure for some cases
>
> the problem is, the standard character set used for building the CaseInsensitiveOrder map
> only maps characters from the set of ascii characters but it is used in the findString/compare/beginsWith-methods for all byte characters.
>
> Any objections if we fill this map like it is suggested in case 17242 ?
>
> CaseInsensitiveOrder := AsciiOrder copy.
> (0 to: 255) do:[ :v |
> | char upper |
> char := v asCharacter.
> upper := char asUppercase.
> upper isOctetCharacter
> ifFalse: [ upper := char ].
> CaseInsensitiveOrder at: char asciiValue + 1 put: (CaseInsensitiveOrder at: upper asciiValue + 1) ].
>
> (the check for #isOctectCharacter is needed because for some entries the correspondending
> uppercase character is not within this character set).
>
> This would solve all three issues.
>
>
> nicolai

That looks like a beautiful fix that makes perfect sense.
If all tests are green, I see no reason not to do it.

Thanks for you feedback.

Thanks and well done (again),

Sven

Nicolai Hess-3-2

Re: CasInsensitiveOrder map with upper/lower case characters for all latin1 entries

In reply to this post by Henrik Sperre Johansen

2016-01-06 11:09 GMT+01:00 Henrik Johansen <[hidden email]>:

> On 06 Jan 2016, at 10:09 , Sven Van Caekenberghe <[hidden email]> wrote:
>
> Hallo Nicolai,
>
>> On 06 Jan 2016, at 09:58, Nicolai Hess <[hidden email]> wrote:
>>
>>
>> see issues 17302/17242/17227
>> String>>findString:startindAt:caseSensitive appears to be failing for extended charsets
>> String>>compare:caseSensitive seems to be failing for extended charset comparisons
>> String>>beginsWithEmpty:caseSensitive: has test failure for some cases
>>
>> the problem is, the standard character set used for building the CaseInsensitiveOrder map
>> only maps characters from the set of ascii characters but it is used in the findString/compare/beginsWith-methods for all byte characters.
>>
>> Any objections if we fill this map like it is suggested in case 17242 ?
>>
>> CaseInsensitiveOrder := AsciiOrder copy.
>> (0 to: 255) do:[ :v |
>> | char upper |
>> char := v asCharacter.
>> upper := char asUppercase.
>> upper isOctetCharacter
>> ifFalse: [ upper := char ].
>> CaseInsensitiveOrder at: char asciiValue + 1 put: (CaseInsensitiveOrder at: upper asciiValue + 1) ].
>>
>> (the check for #isOctectCharacter is needed because for some entries the correspondending
>> uppercase character is not within this character set).
>>
>> This would solve all three issues.
>>
>>
>> nicolai
>
> That looks like a beautiful fix that makes perfect sense.
> If all tests are green, I see no reason not to do it.
>
> Thanks and well done (again),
>
> Sven
>
>

If you use asLowercase as the "canonical" ordering index instead, can you drop the isOctetCharacter test, or are there uppercase characters in latin1 with no corresponding lowercases?

Interesting, good idea, there are no uppercase characters without lowercases.

I was about to suggest copying the CaseSensitiveOrder mapping instead of the AsciiOrder, since it has an ordering more refined than just A-Z, but that would quickly lead to wanting to extend it to a generic Latin1 sort order rather than just ASCII (é between e and f, for example), which is a can of worms that is hard to solve without making the ordering locale specific...
I mean, one could use the default Unicode ordering, but would inevitably receive complaints from, say, Norwegians, that å sorts between a and b instead of after z.

I didn't thought about ordering..., and I think I don't want to :)

After all, it only affects the case where compare is used for ordering anyways.

Cheers,
Henry

stepharo

Re: CasInsensitiveOrder map with upper/lower case characters for all latin1 entries

In reply to this post by Nicolai Hess-3-2

It is cool to have you three on board. I love your discussions because
I'm learning by immersion.

Le 6/1/16 09:58, Nicolai Hess a écrit :

>
> see issues 17302/17242/17227
> String>>findString:startindAt:caseSensitive appears to be failing for
> extended charsets
> String>>compare:caseSensitive seems to be failing for extended charset
> comparisons
> String>>beginsWithEmpty:caseSensitive: has test failure for some cases
>
> the problem is, the standard character set used for building the
> CaseInsensitiveOrder map
> only maps characters from the set of ascii characters but it is used
> in the findString/compare/beginsWith-methods for all byte characters.
>
> Any objections if we fill this map like it is suggested in case 17242 ?
>
> CaseInsensitiveOrder := AsciiOrder copy.
> (0 to: 255) do:[ :v |
> | char upper |
> char := v asCharacter.
> upper := char asUppercase.
> upper isOctetCharacter
> ifFalse: [ upper := char ].
> CaseInsensitiveOrder at: char asciiValue + 1 put:
> (CaseInsensitiveOrder at: upper asciiValue + 1) ].
>
> (the check for #isOctectCharacter is needed because for some entries
> the correspondending
> uppercase character is not within this character set).
>
> This would solve all three issues.
>
>
> nicolai