UTF8 sorting for Spanish words [was: Some beginner questions ....]

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

UTF8 sorting for Spanish words [was: Some beginner questions ....]

Carla F. Griggio
Hi everyone,
As Marten brought up the topic, I'll add a more specific question about UTF-8 sorting.

I've got this problem. I'm using UTF-8 for Spanish texts, and in some cases I have to sort strings.

The thing is that characters with "tilde" (á,é,í, etc) have a bigger numeric code than the same letter without tilde (a, e, i, etc), so the sorting for a spanish speaker turns up being wrong.

For example:
'Córdoba'  > 'Corrientes' evaluates to true, and shouldn't be that way.

Can I specify an alternative sorting rule for strings in UTF-8?

Thanks!
Carla

On Thu, Nov 24, 2011 at 2:32 PM, Norbert Hartl <[hidden email]> wrote:

Am 24.11.2011 um 17:03 schrieb Marten Feldtmann:

> I mean: sorting, comparing, searching in UTF8, UTF16 and/or UTF32 oriented strings, translating between code pages
>
I don't know. I know there is support for reading utf-8 but I never encounteres the other two. There are multibyte character classes in GemStone. But I don't know if there is any collation support in GemStone. Actually I never worked with conversion or transliteration in GemStone. Now I want to know myself what GemStone provides.

Norbert

> Marten
>
> Am 24.11.2011 16:58, schrieb Norbert Hartl:
>>
>> Am 24.11.2011 um 15:52 schrieb Marten Feldtmann:
>>
>>> Hello,
>>>
>>> and again some simple questions:
>>>
>>> ->  what is the state of Unicode support within Gemstone. I thought about porting ICU to Gemstone to get access to Unicode ...
>>>
>> What do you mean with "get access to Unicode" ?
>>


Reply | Threaded
Open this post in threaded view
|

Re: UTF8 sorting for Spanish words [was: Some beginner questions ....]

NorbertHartl
Carla,
Am 25.11.2011 um 19:54 schrieb Carla F. Griggio:

Hi everyone,
As Marten brought up the topic, I'll add a more specific question about UTF-8 sorting.

I've got this problem. I'm using UTF-8 for Spanish texts, and in some cases I have to sort strings.

The thing is that characters with "tilde" (á,é,í, etc) have a bigger numeric code than the same letter without tilde (a, e, i, etc), so the sorting for a spanish speaker turns up being wrong.

For example:
'Córdoba'  > 'Corrientes' evaluates to true, and shouldn't be that way.

Can I specify an alternative sorting rule for strings in UTF-8?

if you are talking about characters and text there is no UTF-8. It is unicode which means that most of the symbols we use in IT have a distinct number. UTF-8 is just one format you can write to disk or network etc. 
In unicode things are not that easy as they are in ASCII. Sorting becomes really a big problem because comparing characters isn't easy. If it is done right than you need a collation algorithm [1] for sorting strings.
So your question is close if not the same as the one of Marten. If there would be support in gemstone comparable to ICU [2] then the odds are high that the UCA (unicode collation algorithm) [3] would be also supported.

I'm sorry not to bring the good news. Hopefully anything in this mail is of value to you

Norbert



On Thu, Nov 24, 2011 at 2:32 PM, Norbert Hartl <[hidden email]> wrote:

Am 24.11.2011 um 17:03 schrieb Marten Feldtmann:

> I mean: sorting, comparing, searching in UTF8, UTF16 and/or UTF32 oriented strings, translating between code pages
>
I don't know. I know there is support for reading utf-8 but I never encounteres the other two. There are multibyte character classes in GemStone. But I don't know if there is any collation support in GemStone. Actually I never worked with conversion or transliteration in GemStone. Now I want to know myself what GemStone provides.

Norbert

> Marten
>
> Am 24.11.2011 16:58, schrieb Norbert Hartl:
>>
>> Am 24.11.2011 um 15:52 schrieb Marten Feldtmann:
>>
>>> Hello,
>>>
>>> and again some simple questions:
>>>
>>> ->  what is the state of Unicode support within Gemstone. I thought about porting ICU to Gemstone to get access to Unicode ...
>>>
>> What do you mean with "get access to Unicode" ?
>>



Reply | Threaded
Open this post in threaded view
|

Re: UTF8 sorting for Spanish words [was: Some beginner questions ....]

Dale Henrichs
In reply to this post by Carla F. Griggio
Carla,

The default collating sequence that GemStone uses is intended to be compatible with the legacy collation sequence and is know to be wrong for multibyte characters.

Sometime in the 2.x timeframe (pre2.3) we introduced Extended Character Set Support[1]. Extended Character Set Support was intended to support collating the characters in the 32-bit Unicode Standard as well as allowing for the creation of custom collating sequences. Unfortunately our collation algorithm didn't take diacritical marks into account. Issue 321[3] was created to document and track progress on this problem.

At this point in time it doesn't look like there is a viable workaround.

Dale

[1] http://community.gemstone.com/download/attachments/6816350/GS64-SysAdminGuide-3.0.pdf?version=1
      Appendix F.2, page 445.
[2] http://www.unicode.org/reports/tr10/
[3] http://code.google.com/p/glassdb/issues/detail?id=321

----- Original Message -----
| From: "Carla F. Griggio" <[hidden email]>
| To: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Friday, November 25, 2011 10:54:19 AM
| Subject: [GS/SS Beta] UTF8 sorting for Spanish words [was: Some beginner questions ....]
|
| Hi everyone,
| As Marten brought up the topic, I'll add a more specific question
| about UTF-8 sorting.
|
| I've got this problem. I'm using UTF-8 for Spanish texts, and in some
| cases I have to sort strings.
|
| The thing is that characters with "tilde" (á,é,í, etc) have a bigger
| numeric code than the same letter without tilde (a, e, i, etc), so
| the sorting for a spanish speaker turns up being wrong.
|
| For example:
| 'Córdoba' > 'Corrientes' evaluates to true, and shouldn't be that
| way.
|
| Can I specify an alternative sorting rule for strings in UTF-8?
|
| Thanks!
| Carla
|
|
| On Thu, Nov 24, 2011 at 2:32 PM, Norbert Hartl < [hidden email] >
| wrote:
|
|
|
| Am 24.11.2011 um 17:03 schrieb Marten Feldtmann:
|
|
| > I mean: sorting, comparing, searching in UTF8, UTF16 and/or UTF32
| > oriented strings, translating between code pages
| >
| I don't know. I know there is support for reading utf-8 but I never
| encounteres the other two. There are multibyte character classes in
| GemStone. But I don't know if there is any collation support in
| GemStone. Actually I never worked with conversion or transliteration
| in GemStone. Now I want to know myself what GemStone provides.
|
| Norbert
|
|
|
|
| > Marten
| >
| > Am 24.11.2011 16:58, schrieb Norbert Hartl:
| >>
| >> Am 24.11.2011 um 15:52 schrieb Marten Feldtmann:
| >>
| >>> Hello,
| >>>
| >>> and again some simple questions:
| >>>
| >>> -> what is the state of Unicode support within Gemstone. I
| >>> thought about porting ICU to Gemstone to get access to Unicode
| >>> ...
| >>>
| >> What do you mean with "get access to Unicode" ?
| >>
|
|
|
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 sorting for Spanish words [was: Some beginner questions ....]

Carla F. Griggio
Hi Norbert, Dale,

Thanks for the replies.
The workaround I found is pretty much ad-hoc and nasty :P But at least it works...

In the sort block were I order the strings, I don't compare the strings themselves but a copy of each with the vowels with accents replaced with normal vowels :P

I'll pay attention to all the links you've sent, maaaaybe I can help to fix this issue?



On Wed, Nov 30, 2011 at 7:38 PM, Dale Henrichs <[hidden email]> wrote:
Carla,

The default collating sequence that GemStone uses is intended to be compatible with the legacy collation sequence and is know to be wrong for multibyte characters.

Sometime in the 2.x timeframe (pre2.3) we introduced Extended Character Set Support[1]. Extended Character Set Support was intended to support collating the characters in the 32-bit Unicode Standard as well as allowing for the creation of custom collating sequences. Unfortunately our collation algorithm didn't take diacritical marks into account. Issue 321[3] was created to document and track progress on this problem.

At this point in time it doesn't look like there is a viable workaround.

Dale

[1] http://community.gemstone.com/download/attachments/6816350/GS64-SysAdminGuide-3.0.pdf?version=1
     Appendix F.2, page 445.
[2] http://www.unicode.org/reports/tr10/
[3] http://code.google.com/p/glassdb/issues/detail?id=321

----- Original Message -----
| From: "Carla F. Griggio" <[hidden email]>
| To: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Friday, November 25, 2011 10:54:19 AM
| Subject: [GS/SS Beta] UTF8 sorting for Spanish words [was: Some beginner      questions ....]
|
| Hi everyone,
| As Marten brought up the topic, I'll add a more specific question
| about UTF-8 sorting.
|
| I've got this problem. I'm using UTF-8 for Spanish texts, and in some
| cases I have to sort strings.
|
| The thing is that characters with "tilde" (á,é,í, etc) have a bigger
| numeric code than the same letter without tilde (a, e, i, etc), so
| the sorting for a spanish speaker turns up being wrong.
|
| For example:
| 'Córdoba' > 'Corrientes' evaluates to true, and shouldn't be that
| way.
|
| Can I specify an alternative sorting rule for strings in UTF-8?
|
| Thanks!
| Carla
|
|
| On Thu, Nov 24, 2011 at 2:32 PM, Norbert Hartl < [hidden email] >
| wrote:
|
|
|
| Am 24.11.2011 um 17:03 schrieb Marten Feldtmann:
|
|
| > I mean: sorting, comparing, searching in UTF8, UTF16 and/or UTF32
| > oriented strings, translating between code pages
| >
| I don't know. I know there is support for reading utf-8 but I never
| encounteres the other two. There are multibyte character classes in
| GemStone. But I don't know if there is any collation support in
| GemStone. Actually I never worked with conversion or transliteration
| in GemStone. Now I want to know myself what GemStone provides.
|
| Norbert
|
|
|
|
| > Marten
| >
| > Am 24.11.2011 16:58, schrieb Norbert Hartl:
| >>
| >> Am 24.11.2011 um 15:52 schrieb Marten Feldtmann:
| >>
| >>> Hello,
| >>>
| >>> and again some simple questions:
| >>>
| >>> -> what is the state of Unicode support within Gemstone. I
| >>> thought about porting ICU to Gemstone to get access to Unicode
| >>> ...
| >>>
| >> What do you mean with "get access to Unicode" ?
| >>
|
|
|

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 sorting for Spanish words [was: Some beginner questions ....]

NorbertHartl

Am 01.12.2011 um 17:22 schrieb Carla F. Griggio:

I don't compare the strings themselves but a copy of each with the vowels with accents replaced with normal vowels :P

That's exactly what transliteration [1] is all about. It is a common technique to produce "something comparable" and it is not a hack (just if you consider it like that).


Norbert
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 sorting for Spanish words [was: Some beginner questions ....]

Dale Henrichs
That's basically what is missing from our collation algorithm ... caps and diacritical marks are basically secondary sort keys with the base character being the primary key ...

Dale

----- Original Message -----
| From: "Norbert Hartl" <[hidden email]>
| To: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Thursday, December 1, 2011 8:33:51 AM
| Subject: Re: [GS/SS Beta] UTF8 sorting for Spanish words [was: Some beginner questions ....]
|
|
|
|
| Am 01.12.2011 um 17:22 schrieb Carla F. Griggio:
|
|
| I don't compare the strings themselves but a copy of each with the
| vowels with accents replaced with normal vowels :P
|
| That's exactly what transliteration [1] is all about. It is a common
| technique to produce "something comparable" and it is not a hack
| (just if you consider it like that).
|
|
| [1] http://en.wikipedia.org/wiki/Transliteration
|
|
| Norbert