CasInsensitiveOrder map with upper/lower case characters for all latin1 entries

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

CasInsensitiveOrder map with upper/lower case characters for all latin1 entries

Nicolai Hess-3-2

see issues 17302/17242/17227
String>>findString:startindAt:caseSensitive appears to be failing for extended charsets
String>>compare:caseSensitive seems to be failing for extended charset comparisons
String>>beginsWithEmpty:caseSensitive: has test failure for some cases

the problem is, the standard character set used for building the CaseInsensitiveOrder map
only maps characters from the set of ascii characters but it is used in the findString/compare/beginsWith-methods for all byte characters.

Any objections if we fill this map like it is suggested in case 17242 ?

CaseInsensitiveOrder := AsciiOrder copy.
    (0 to: 255) do:[ :v |
            | char upper |
            char := v asCharacter.
            upper := char asUppercase.
            upper isOctetCharacter
                ifFalse: [ upper := char ].
            CaseInsensitiveOrder at: char asciiValue + 1 put: (CaseInsensitiveOrder at: upper asciiValue + 1) ].

(the check for #isOctectCharacter is needed because for some entries the correspondending
uppercase character is not within this character set).

This would solve all three issues.


nicolai
Reply | Threaded
Open this post in threaded view
|

Re: CasInsensitiveOrder map with upper/lower case characters for all latin1 entries

Sven Van Caekenberghe-2
Hallo Nicolai,

> On 06 Jan 2016, at 09:58, Nicolai Hess <[hidden email]> wrote:
>
>
> see issues 17302/17242/17227
> String>>findString:startindAt:caseSensitive appears to be failing for extended charsets
> String>>compare:caseSensitive seems to be failing for extended charset comparisons
> String>>beginsWithEmpty:caseSensitive: has test failure for some cases
>
> the problem is, the standard character set used for building the CaseInsensitiveOrder map
> only maps characters from the set of ascii characters but it is used in the findString/compare/beginsWith-methods for all byte characters.
>
> Any objections if we fill this map like it is suggested in case 17242 ?
>
> CaseInsensitiveOrder := AsciiOrder copy.
>     (0 to: 255) do:[ :v |
>             | char upper |
>             char := v asCharacter.
>             upper := char asUppercase.
>             upper isOctetCharacter
>                 ifFalse: [ upper := char ].
>             CaseInsensitiveOrder at: char asciiValue + 1 put: (CaseInsensitiveOrder at: upper asciiValue + 1) ].
>
> (the check for #isOctectCharacter is needed because for some entries the correspondending
> uppercase character is not within this character set).
>
> This would solve all three issues.
>
>
> nicolai

That looks like a beautiful fix that makes perfect sense.
If all tests are green, I see no reason not to do it.

Thanks and well done (again),

Sven


Reply | Threaded
Open this post in threaded view
|

Re: CasInsensitiveOrder map with upper/lower case characters for all latin1 entries

Henrik Sperre Johansen

> On 06 Jan 2016, at 10:09 , Sven Van Caekenberghe <[hidden email]> wrote:
>
> Hallo Nicolai,
>
>> On 06 Jan 2016, at 09:58, Nicolai Hess <[hidden email]> wrote:
>>
>>
>> see issues 17302/17242/17227
>> String>>findString:startindAt:caseSensitive appears to be failing for extended charsets
>> String>>compare:caseSensitive seems to be failing for extended charset comparisons
>> String>>beginsWithEmpty:caseSensitive: has test failure for some cases
>>
>> the problem is, the standard character set used for building the CaseInsensitiveOrder map
>> only maps characters from the set of ascii characters but it is used in the findString/compare/beginsWith-methods for all byte characters.
>>
>> Any objections if we fill this map like it is suggested in case 17242 ?
>>
>> CaseInsensitiveOrder := AsciiOrder copy.
>>    (0 to: 255) do:[ :v |
>>            | char upper |
>>            char := v asCharacter.
>>            upper := char asUppercase.
>>            upper isOctetCharacter
>>                ifFalse: [ upper := char ].
>>            CaseInsensitiveOrder at: char asciiValue + 1 put: (CaseInsensitiveOrder at: upper asciiValue + 1) ].
>>
>> (the check for #isOctectCharacter is needed because for some entries the correspondending
>> uppercase character is not within this character set).
>>
>> This would solve all three issues.
>>
>>
>> nicolai
>
> That looks like a beautiful fix that makes perfect sense.
> If all tests are green, I see no reason not to do it.
>
> Thanks and well done (again),
>
> Sven
>
>
If you use asLowercase as the "canonical" ordering index instead, can you drop the isOctetCharacter test, or are there uppercase characters in latin1 with no corresponding lowercases?

I was about to suggest copying the CaseSensitiveOrder mapping instead of the AsciiOrder, since it has an ordering more refined than just A-Z, but that would quickly lead to wanting to extend it to a generic Latin1 sort order rather than just ASCII (é between e and f, for example), which is a can of worms that is hard to solve without making the ordering locale specific...
I mean, one could use the default Unicode ordering, but would inevitably receive complaints from, say, Norwegians, that å sorts between a and b instead of after z.

After all, it only affects the case where compare is used for ordering anyways.

Cheers,
Henry

signature.asc (859 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: CasInsensitiveOrder map with upper/lower case characters for all latin1 entries

Nicolai Hess-3-2
In reply to this post by Sven Van Caekenberghe-2


2016-01-06 10:09 GMT+01:00 Sven Van Caekenberghe <[hidden email]>:
Hallo Nicolai,

> On 06 Jan 2016, at 09:58, Nicolai Hess <[hidden email]> wrote:
>
>
> see issues 17302/17242/17227
> String>>findString:startindAt:caseSensitive appears to be failing for extended charsets
> String>>compare:caseSensitive seems to be failing for extended charset comparisons
> String>>beginsWithEmpty:caseSensitive: has test failure for some cases
>
> the problem is, the standard character set used for building the CaseInsensitiveOrder map
> only maps characters from the set of ascii characters but it is used in the findString/compare/beginsWith-methods for all byte characters.
>
> Any objections if we fill this map like it is suggested in case 17242 ?
>
> CaseInsensitiveOrder := AsciiOrder copy.
>     (0 to: 255) do:[ :v |
>             | char upper |
>             char := v asCharacter.
>             upper := char asUppercase.
>             upper isOctetCharacter
>                 ifFalse: [ upper := char ].
>             CaseInsensitiveOrder at: char asciiValue + 1 put: (CaseInsensitiveOrder at: upper asciiValue + 1) ].
>
> (the check for #isOctectCharacter is needed because for some entries the correspondending
> uppercase character is not within this character set).
>
> This would solve all three issues.
>
>
> nicolai

That looks like a beautiful fix that makes perfect sense.
If all tests are green, I see no reason not to do it.

Thanks for you feedback.
 
 
Thanks and well done (again),

:)
 

Sven



Reply | Threaded
Open this post in threaded view
|

Re: CasInsensitiveOrder map with upper/lower case characters for all latin1 entries

Nicolai Hess-3-2
In reply to this post by Henrik Sperre Johansen


2016-01-06 11:09 GMT+01:00 Henrik Johansen <[hidden email]>:

> On 06 Jan 2016, at 10:09 , Sven Van Caekenberghe <[hidden email]> wrote:
>
> Hallo Nicolai,
>
>> On 06 Jan 2016, at 09:58, Nicolai Hess <[hidden email]> wrote:
>>
>>
>> see issues 17302/17242/17227
>> String>>findString:startindAt:caseSensitive appears to be failing for extended charsets
>> String>>compare:caseSensitive seems to be failing for extended charset comparisons
>> String>>beginsWithEmpty:caseSensitive: has test failure for some cases
>>
>> the problem is, the standard character set used for building the CaseInsensitiveOrder map
>> only maps characters from the set of ascii characters but it is used in the findString/compare/beginsWith-methods for all byte characters.
>>
>> Any objections if we fill this map like it is suggested in case 17242 ?
>>
>> CaseInsensitiveOrder := AsciiOrder copy.
>>    (0 to: 255) do:[ :v |
>>            | char upper |
>>            char := v asCharacter.
>>            upper := char asUppercase.
>>            upper isOctetCharacter
>>                ifFalse: [ upper := char ].
>>            CaseInsensitiveOrder at: char asciiValue + 1 put: (CaseInsensitiveOrder at: upper asciiValue + 1) ].
>>
>> (the check for #isOctectCharacter is needed because for some entries the correspondending
>> uppercase character is not within this character set).
>>
>> This would solve all three issues.
>>
>>
>> nicolai
>
> That looks like a beautiful fix that makes perfect sense.
> If all tests are green, I see no reason not to do it.
>
> Thanks and well done (again),
>
> Sven
>
>
If you use asLowercase as the "canonical" ordering index instead, can you drop the isOctetCharacter test, or are there uppercase characters in latin1 with no corresponding lowercases?

Interesting, good idea, there are no uppercase characters without lowercases.
 

I was about to suggest copying the CaseSensitiveOrder mapping instead of the AsciiOrder, since it has an ordering more refined than just A-Z, but that would quickly lead to wanting to extend it to a generic Latin1 sort order rather than just ASCII (é between e and f, for example), which is a can of worms that is hard to solve without making the ordering locale specific...
I mean, one could use the default Unicode ordering, but would inevitably receive complaints from, say, Norwegians, that å sorts between a and b instead of after z.

I didn't thought about ordering..., and I think I don't want to :)
 

After all, it only affects the case where compare is used for ordering anyways.

Cheers,
Henry

Reply | Threaded
Open this post in threaded view
|

Re: CasInsensitiveOrder map with upper/lower case characters for all latin1 entries

stepharo
In reply to this post by Nicolai Hess-3-2
It is cool to have you three on board. I love your discussions because
I'm learning by immersion.


Le 6/1/16 09:58, Nicolai Hess a écrit :

>
> see issues 17302/17242/17227
> String>>findString:startindAt:caseSensitive appears to be failing for
> extended charsets
> String>>compare:caseSensitive seems to be failing for extended charset
> comparisons
> String>>beginsWithEmpty:caseSensitive: has test failure for some cases
>
> the problem is, the standard character set used for building the
> CaseInsensitiveOrder map
> only maps characters from the set of ascii characters but it is used
> in the findString/compare/beginsWith-methods for all byte characters.
>
> Any objections if we fill this map like it is suggested in case 17242 ?
>
> CaseInsensitiveOrder := AsciiOrder copy.
>     (0 to: 255) do:[ :v |
>             | char upper |
>             char := v asCharacter.
>             upper := char asUppercase.
>             upper isOctetCharacter
>                 ifFalse: [ upper := char ].
>             CaseInsensitiveOrder at: char asciiValue + 1 put:
> (CaseInsensitiveOrder at: upper asciiValue + 1) ].
>
> (the check for #isOctectCharacter is needed because for some entries
> the correspondending
> uppercase character is not within this character set).
>
> This would solve all three issues.
>
>
> nicolai