Smalltalk › Squeak › Squeak - Dev

[squeak-dev] Re: a diacritics free version of a string

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

2 messages Options

Stephan Eggermont-3

[squeak-dev] Re: a diacritics free version of a string

Philippe wrote:
> The Unicode solution would be to do normalization with full
> decomposition and then a regex on \p{InCombiningDiacriticalMarks} and
> replace it with an empty string or something similar.

I don't think that is enough. I think the normalization is language dependent.
o-umlaut is replaced by oe in German, but the equivalent in Dutch is o.

Stephan

Philippe Marschall

Re: [squeak-dev] Re: a diacritics free version of a string

2009/6/3 <[hidden email]>:
> Philippe wrote:
>>
>> The Unicode solution would be to do normalization with full
>> decomposition and then a regex on \p{InCombiningDiacriticalMarks} and
>> replace it with an empty string or something similar.
>
> I don't think that is enough. I think the normalization is language
> dependent.
> o-umlaut is replaced by oe in German, but the equivalent in Dutch is o.

I think we talk about different issues. Unicode normalization [1]
solves the problem that some characters (like o-umlaut) can be
represented in different ways in Unicode by using only one of them.
What I proposed would then simply remove the diacritical marks (remove
the umlaut, keep to o). What you propose is more sophisticated.

But you're right, most text operations are language dependent
including upper and lower case translation.

[1] http://www.unicode.org/unicode/reports/tr15/tr15-23.html

Cheers
Philippe