[squeak-dev] Re: a diacritics free version of a string

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: a diacritics free version of a string

Stephan Eggermont-3
Philippe wrote:
> The Unicode solution would be to do normalization with full
> decomposition and then a regex on \p{InCombiningDiacriticalMarks} and
> replace it with an empty string or something similar.

I don't think that is enough. I think the normalization is language dependent.
o-umlaut is replaced by oe in German, but the equivalent in Dutch is o.

Stephan

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: a diacritics free version of a string

Philippe Marschall
2009/6/3  <[hidden email]>:
> Philippe wrote:
>>
>> The Unicode solution would be to do normalization with full
>> decomposition and then a regex on \p{InCombiningDiacriticalMarks} and
>> replace it with an empty string or something similar.
>
> I don't think that is enough. I think the normalization is language
> dependent.
> o-umlaut is replaced by oe in German, but the equivalent in Dutch is o.

I think we talk about different issues. Unicode normalization [1]
solves the problem that some characters (like o-umlaut) can be
represented in different ways in Unicode by using only one of them.
What I proposed would then simply remove the diacritical marks (remove
the umlaut, keep to o). What you propose is more sophisticated.

But you're right, most text operations are language dependent
including upper and lower case translation.

 [1] http://www.unicode.org/unicode/reports/tr15/tr15-23.html

Cheers
Philippe