Smalltalk › Squeak › Squeak - Dev

Squeak to/from UTF-8 conversions

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

9 messages Options

Andreas.Raab

Squeak to/from UTF-8 conversions

Hi -

I was working on a little improvement in UTF-8 conversion speed (so far
it's about 150x faster for latin-1 text ;-) and for measuring the
improvements was running a test that said:

strings := String allSubInstances.
1 to: strings size do:[:i|
original := strings at: i.
utf8 := original squeakToUtf8.
copy := utf8 utf8ToSqueak.
original = copy ifFalse:[self error: 'Encoding problem'].
].

When I ran this test it failed on each and every WideString instance.
Digging into it, it seems that all of the WideStrings in Squeak have a
language tag that is being supplied implicitly by the current
LanguageEnvironment.

Questions:
1) From what it looks like right now there is no way to preserve that
language tag through a UTF-8 conversion. Is this indeed the case or am I
missing something?
2) Given that my language environment is being set to Latin-1, how
should clients treat UTF-8 to provide the "proper" language tag? For
example, I expected that a client be able to read and write UTF-8 text
without implicitly providing that language tag. If that's the case, then
how does one store these in common text files? (I could see how to do
this for formatted text but not for "plain text files" without further
attributation)
3) More generally asking, isn't the language tag here more of a
"decorator" along the lines of text attributes? This would certainly
model more closely the effect that I'm seeing here (some attributes are
dropped by the squeak -> utf8 -> squeak conversion) *except* that I
didn't expect any lossy conversion for strings (contrary to Text where
dropping text attributes is obviously lossy).

Thanks for any help,
- Andreas

Bert Freudenberg

Re: Squeak to/from UTF-8 conversions

On Jun 26, 2007, at 9:19 , Andreas Raab wrote:

> Hi -
>
> I was working on a little improvement in UTF-8 conversion speed (so
> far it's about 150x faster for latin-1 text ;-) and for measuring
> the improvements was running a test that said:
>
> strings := String allSubInstances.
> 1 to: strings size do:[:i|
> original := strings at: i.
> utf8 := original squeakToUtf8.
> copy := utf8 utf8ToSqueak.
> original = copy ifFalse:[self error: 'Encoding problem'].
> ].
>
> When I ran this test it failed on each and every WideString
> instance. Digging into it, it seems that all of the WideStrings in
> Squeak have a language tag that is being supplied implicitly by the
> current LanguageEnvironment.
>
> Questions:
> 1) From what it looks like right now there is no way to preserve
> that language tag through a UTF-8 conversion. Is this indeed the
> case or am I missing something?
> 2) Given that my language environment is being set to Latin-1, how
> should clients treat UTF-8 to provide the "proper" language tag?
> For example, I expected that a client be able to read and write
> UTF-8 text without implicitly providing that language tag. If
> that's the case, then how does one store these in common text
> files? (I could see how to do this for formatted text but not for
> "plain text files" without further attributation)
> 3) More generally asking, isn't the language tag here more of a
> "decorator" along the lines of text attributes? This would
> certainly model more closely the effect that I'm seeing here (some
> attributes are dropped by the squeak -> utf8 -> squeak conversion)
> *except* that I didn't expect any lossy conversion for strings
> (contrary to Text where dropping text attributes is obviously lossy).

Nice catch. We had the discussion before, and this to me is another
hint that we really should strip the language tag from Strings and
move it to Text attributes. For rendering bare strings the default
language could be taken from the current environment. The problem is,
IIUC, that currently a lot of bare strings are passed around so it
was simpler to just tag the language onto the string itself.

- Bert -

Nicolas Cellier-3

Re: Squeak to/from UTF-8 conversions

However, if you strip the language tag, you will run into very minor
bugs with the A macron and a macron, because their encodings have been
hijacked as CrossedX and EndOfRun in the CharacterScanner family (clever
trick when Characters were 256). I searched how these damned characters
could ever work in Squeak and Sophie, and found black magic was this
language tag.

Andreas, maybe you could have a look at how RTF text are converted in
SOphie, it seems to deal with language tag correctly, at least with
extended latin characters.

Nicolas

Bert Freudenberg a écrit :

>
> On Jun 26, 2007, at 9:19 , Andreas Raab wrote:
>
>> Hi -
>>
>> I was working on a little improvement in UTF-8 conversion speed (so
>> far it's about 150x faster for latin-1 text ;-) and for measuring the
>> improvements was running a test that said:
>>
>> strings := String allSubInstances.
>> 1 to: strings size do:[:i|
>> original := strings at: i.
>> utf8 := original squeakToUtf8.
>> copy := utf8 utf8ToSqueak.
>> original = copy ifFalse:[self error: 'Encoding problem'].
>> ].
>>
>> When I ran this test it failed on each and every WideString instance.
>> Digging into it, it seems that all of the WideStrings in Squeak have a
>> language tag that is being supplied implicitly by the current
>> LanguageEnvironment.
>>
>> Questions:
>> 1) From what it looks like right now there is no way to preserve that
>> language tag through a UTF-8 conversion. Is this indeed the case or am
>> I missing something?
>> 2) Given that my language environment is being set to Latin-1, how
>> should clients treat UTF-8 to provide the "proper" language tag? For
>> example, I expected that a client be able to read and write UTF-8 text
>> without implicitly providing that language tag. If that's the case,
>> then how does one store these in common text files? (I could see how
>> to do this for formatted text but not for "plain text files" without
>> further attributation)
>> 3) More generally asking, isn't the language tag here more of a
>> "decorator" along the lines of text attributes? This would certainly
>> model more closely the effect that I'm seeing here (some attributes
>> are dropped by the squeak -> utf8 -> squeak conversion) *except* that
>> I didn't expect any lossy conversion for strings (contrary to Text
>> where dropping text attributes is obviously lossy).
>
> Nice catch. We had the discussion before, and this to me is another hint
> that we really should strip the language tag from Strings and move it to
> Text attributes. For rendering bare strings the default language could
> be taken from the current environment. The problem is, IIUC, that
> currently a lot of bare strings are passed around so it was simpler to
> just tag the language onto the string itself.
>
> - Bert -
>
>
>
>

Andreas.Raab

Re: Squeak to/from UTF-8 conversions

nicolas cellier wrote:
> However, if you strip the language tag, you will run into very minor
> bugs with the A macron and a macron, because their encodings have been
> hijacked as CrossedX and EndOfRun in the CharacterScanner family (clever
> trick when Characters were 256). I searched how these damned characters
> could ever work in Squeak and Sophie, and found black magic was this
> language tag.

Ah, how interesting. I wasn't even aware of that but it makes good
sense. Which, in a sense, only emphasizes question 2) below given that
the default Latin-1 environment doesn't seem to set a language code
whatsoever.

> Andreas, maybe you could have a look at how RTF text are converted in
> SOphie, it seems to deal with language tag correctly, at least with
> extended latin characters.

Good point. Unfortunately, I don't have the time to get into Sophie in
detail (I was just trying to understand why UTF-8 conversion is lossy
and what to do about it) but if someone would give me a primer on how
Sophie deals with these issues I'd appreciate it.

Cheers,
- Andreas

>
> Nicolas
>
> Bert Freudenberg a écrit :
>>
>> On Jun 26, 2007, at 9:19 , Andreas Raab wrote:
>>
>>> Hi -
>>>
>>> I was working on a little improvement in UTF-8 conversion speed (so
>>> far it's about 150x faster for latin-1 text ;-) and for measuring the
>>> improvements was running a test that said:
>>>
>>> strings := String allSubInstances.
>>> 1 to: strings size do:[:i|
>>> original := strings at: i.
>>> utf8 := original squeakToUtf8.
>>> copy := utf8 utf8ToSqueak.
>>> original = copy ifFalse:[self error: 'Encoding problem'].
>>> ].
>>>
>>> When I ran this test it failed on each and every WideString instance.
>>> Digging into it, it seems that all of the WideStrings in Squeak have
>>> a language tag that is being supplied implicitly by the current
>>> LanguageEnvironment.
>>>
>>> Questions:
>>> 1) From what it looks like right now there is no way to preserve that
>>> language tag through a UTF-8 conversion. Is this indeed the case or
>>> am I missing something?
>>> 2) Given that my language environment is being set to Latin-1, how
>>> should clients treat UTF-8 to provide the "proper" language tag? For
>>> example, I expected that a client be able to read and write UTF-8
>>> text without implicitly providing that language tag. If that's the
>>> case, then how does one store these in common text files? (I could
>>> see how to do this for formatted text but not for "plain text files"
>>> without further attributation)
>>> 3) More generally asking, isn't the language tag here more of a
>>> "decorator" along the lines of text attributes? This would certainly
>>> model more closely the effect that I'm seeing here (some attributes
>>> are dropped by the squeak -> utf8 -> squeak conversion) *except* that
>>> I didn't expect any lossy conversion for strings (contrary to Text
>>> where dropping text attributes is obviously lossy).
>>
>> Nice catch. We had the discussion before, and this to me is another
>> hint that we really should strip the language tag from Strings and
>> move it to Text attributes. For rendering bare strings the default
>> language could be taken from the current environment. The problem is,
>> IIUC, that currently a lot of bare strings are passed around so it was
>> simpler to just tag the language onto the string itself.
>>
>> - Bert -
>>
>>
>>
>>
>
>
>

Michael Rueger-4

Re: Squeak to/from UTF-8 conversions

Andreas Raab wrote:
> Good point. Unfortunately, I don't have the time to get into Sophie in
> detail (I was just trying to understand why UTF-8 conversion is lossy
> and what to do about it) but if someone would give me a primer on how
> Sophie deals with these issues I'd appreciate it.

The main point we did in Sophie was to build "true" unicode converters.
The existing ones preserved some of the black magic, so when converting
into unicode you actually didn't for some characters.
So just looking at the new converters would be a good start. They can be
found in the SOphie-RTF package.

Michael

MappingUnicodeTextConverter #()
CP1250UnicodeTextConverter #()
CP1251UnicodeTextConverter #()
CP1252UnicodeTextConverter #()
MacRomanUnicodeTextConverter #()

Yoshiki Ohshima

Re: Squeak to/from UTF-8 conversions

In reply to this post by Andreas.Raab

As Bert suggested, the Right Thing is to build a system on an
assumption that bare String and Characters cannot be really displayed.
For method source, the tag is encoded as the text property so they are
retained. A XML-like (or whatever) format in UTF-8 for storing Squeak
Text and use it almost always is the consecuence from it.

-- Yoshiki

At Tue, 26 Jun 2007 00:19:04 -0700,
Andreas Raab wrote:

>
> Hi -
>
> I was working on a little improvement in UTF-8 conversion speed (so far
> it's about 150x faster for latin-1 text ;-) and for measuring the
> improvements was running a test that said:
>
> strings := String allSubInstances.
> 1 to: strings size do:[:i|
> original := strings at: i.
> utf8 := original squeakToUtf8.
> copy := utf8 utf8ToSqueak.
> original = copy ifFalse:[self error: 'Encoding problem'].
> ].
>
> When I ran this test it failed on each and every WideString instance.
> Digging into it, it seems that all of the WideStrings in Squeak have a
> language tag that is being supplied implicitly by the current
> LanguageEnvironment.
>
> Questions:
> 1) From what it looks like right now there is no way to preserve that
> language tag through a UTF-8 conversion. Is this indeed the case or am I
> missing something?
> 2) Given that my language environment is being set to Latin-1, how
> should clients treat UTF-8 to provide the "proper" language tag? For
> example, I expected that a client be able to read and write UTF-8 text
> without implicitly providing that language tag. If that's the case, then
> how does one store these in common text files? (I could see how to do
> this for formatted text but not for "plain text files" without further
> attributation)
> 3) More generally asking, isn't the language tag here more of a
> "decorator" along the lines of text attributes? This would certainly
> model more closely the effect that I'm seeing here (some attributes are
> dropped by the squeak -> utf8 -> squeak conversion) *except* that I
> didn't expect any lossy conversion for strings (contrary to Text where
> dropping text attributes is obviously lossy).
>
> Thanks for any help,
> - Andreas

Andreas.Raab

Re: Squeak to/from UTF-8 conversions

Yoshiki Ohshima wrote:
> As Bert suggested, the Right Thing is to build a system on an
> assumption that bare String and Characters cannot be really displayed.

I'm sure it is, but unfortunately we don't have the time to do the Right
Thing since we need to get a product out the door ;-)

> For method source, the tag is encoded as the text property so they are
> retained. A XML-like (or whatever) format in UTF-8 for storing Squeak
> Text and use it almost always is the consecuence from it.

Thanks. So if I hear you correctly you are recommending to preserve the
language tag via additional attributes. Is that correct?

Cheers,
- Andreas

>
> -- Yoshiki
>
> At Tue, 26 Jun 2007 00:19:04 -0700,
> Andreas Raab wrote:
>> Hi -
>>
>> I was working on a little improvement in UTF-8 conversion speed (so far
>> it's about 150x faster for latin-1 text ;-) and for measuring the
>> improvements was running a test that said:
>>
>> strings := String allSubInstances.
>> 1 to: strings size do:[:i|
>> original := strings at: i.
>> utf8 := original squeakToUtf8.
>> copy := utf8 utf8ToSqueak.
>> original = copy ifFalse:[self error: 'Encoding problem'].
>> ].
>>
>> When I ran this test it failed on each and every WideString instance.
>> Digging into it, it seems that all of the WideStrings in Squeak have a
>> language tag that is being supplied implicitly by the current
>> LanguageEnvironment.
>>
>> Questions:
>> 1) From what it looks like right now there is no way to preserve that
>> language tag through a UTF-8 conversion. Is this indeed the case or am I
>> missing something?
>> 2) Given that my language environment is being set to Latin-1, how
>> should clients treat UTF-8 to provide the "proper" language tag? For
>> example, I expected that a client be able to read and write UTF-8 text
>> without implicitly providing that language tag. If that's the case, then
>> how does one store these in common text files? (I could see how to do
>> this for formatted text but not for "plain text files" without further
>> attributation)
>> 3) More generally asking, isn't the language tag here more of a
>> "decorator" along the lines of text attributes? This would certainly
>> model more closely the effect that I'm seeing here (some attributes are
>> dropped by the squeak -> utf8 -> squeak conversion) *except* that I
>> didn't expect any lossy conversion for strings (contrary to Text where
>> dropping text attributes is obviously lossy).
>>
>> Thanks for any help,
>> - Andreas
>
>

Yoshiki Ohshima

Re: Squeak to/from UTF-8 conversions

Andreas,

> > For method source, the tag is encoded as the text property so they are
> > retained. A XML-like (or whatever) format in UTF-8 for storing Squeak
> > Text and use it almost always is the consecuence from it.
>
> Thanks. So if I hear you correctly you are recommending to preserve the
> language tag via additional attributes. Is that correct?

The short answer is, yes.

-- Yoshiki

Philippe Marschall

Re: Squeak to/from UTF-8 conversions

In reply to this post by Andreas.Raab

What's the status of these patches? Seaside shows a measurable speed
drop when doing utf-8 encoding/decoding so we'd be more than willing
to test them. We don't care about the stripping of language tags, we
are fine with the unification aspect of unicode.

Cheers
Philippe

2007/6/26, Andreas Raab <[hidden email]>:

> Hi -
>
> I was working on a little improvement in UTF-8 conversion speed (so far
> it's about 150x faster for latin-1 text ;-) and for measuring the
> improvements was running a test that said:
>
> strings := String allSubInstances.
> 1 to: strings size do:[:i|
> original := strings at: i.
> utf8 := original squeakToUtf8.
> copy := utf8 utf8ToSqueak.
> original = copy ifFalse:[self error: 'Encoding problem'].
> ].
>
> When I ran this test it failed on each and every WideString instance.
> Digging into it, it seems that all of the WideStrings in Squeak have a
> language tag that is being supplied implicitly by the current
> LanguageEnvironment.
>
> Questions:
> 1) From what it looks like right now there is no way to preserve that
> language tag through a UTF-8 conversion. Is this indeed the case or am I
> missing something?
> 2) Given that my language environment is being set to Latin-1, how
> should clients treat UTF-8 to provide the "proper" language tag? For
> example, I expected that a client be able to read and write UTF-8 text
> without implicitly providing that language tag. If that's the case, then
> how does one store these in common text files? (I could see how to do
> this for formatted text but not for "plain text files" without further
> attributation)
> 3) More generally asking, isn't the language tag here more of a
> "decorator" along the lines of text attributes? This would certainly
> model more closely the effect that I'm seeing here (some attributes are
> dropped by the squeak -> utf8 -> squeak conversion) *except* that I
> didn't expect any lossy conversion for strings (contrary to Text where
> dropping text attributes is obviously lossy).
>
> Thanks for any help,
> - Andreas
>
>