Smalltalk › Squeak › Squeak - Dev

I18n and Cairo/pango rendering

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

9 messages Options

timrowledge

I18n and Cairo/pango rendering

Has anyone done any work on neatly integrating UTF8 strings, editing, our scanners, displaying with Cairo/pango etc?

I have the Cairo/pango based Unicode plugin working on the pi but I'd like to use it neatly rather than ad hoc hacky. Nic Cellier & I did a lot of clean up last year for the scanners implementations of scanning but not so much for the displaying. It seems like something that ought to be done at some point. I'm actually a bit miffed to discover that Cairo/pango requires UTF8 strings, which have to about the most complicated possible strings for editing. We have our Mac-roman encoded byte strings and full Unicode 32bit wide strings and can indeed convert to/from UTF8 but I'm not too keen on constantly going back and forth.

Is there anything out there I can use?

/tim
{insert witticism here}

Nicolas Cellier

Re: I18n and Cairo/pango rendering

2014-07-22 23:02 GMT+02:00 tim Rowledge <[hidden email]>:

Has anyone done any work on neatly integrating UTF8 strings, editing, our scanners, displaying with Cairo/pango etc?

I have the Cairo/pango based Unicode plugin working on the pi but I'd like to use it neatly rather than ad hoc hacky. Nic Cellier & I did a lot of clean up last year for the scanners implementations of scanning but not so much for the displaying. It seems like something that ought to be done at some point. I'm actually a bit miffed to discover that Cairo/pango requires UTF8 strings, which have to about the most complicated possible strings for editing. We have our Mac-roman encoded byte strings and full Unicode 32bit wide strings and can indeed convert to/from UTF8 but I'm not too keen on constantly going back and forth.

MacRoman? Beware Tim, you have slept too long, but I must tell you the awfull truth now.

ByteString are not anymore MacRoman encoded.

They are ISO8859L1 (latin 1) which matches Unicode on first 256 code points...

Converting to UTF8 seems hackish, but should just work.

Why would you have to go back from UTF8? Optimization? (storing the UTF8 result)

Or could we create a UTF8String class?

Is there anything out there I can use?

/tim
{insert witticism here}

timrowledge

Re: I18n and Cairo/pango rendering

On Jul 22, 2014, at 14:14, Nicolas Cellier <[hidden email]> wrote:

2014-07-22 23:02 GMT+02:00 tim Rowledge <[hidden email]>:

(Snip).

MacRoman? Beware Tim, you have slept too long, but I must tell you the awfull truth now.
ByteString are not anymore MacRoman encoded.
They are ISO8859L1 (latin 1) which matches Unicode on first 256 code points...

Oh. Well, I was basing my comment on comments in code that I came across. Guess they need fixing. This isn't something I've ever felt a need to think about before so it's all new and clunky to me...

One thinks springs to mind though - if the basic ByteString is Latin-1/utf why do we have any code to convert ? Right now (in my 4.5) it looks like there is a relatively slow check for any non-compliant chars in the #squeakToUtf8 method. Can we drop that now? It would likely be nice if any old ByteString were acceptable to the Cairo/pango plugin.

Converting to UTF8 seems hackish, but should just work.
Why would you have to go back from UTF8? Optimization? (storing the UTF8 result)

Hopefully we don't really need to go back in my usage case - Scratch i18n short strings with very little editing. I can probably keep the 'real' string and convert as and when needed for the displaying methods, maybe even caching the converted form. For the longer term we should at least consider doing a better cleaner job so as to life in a world where it at least appears that UTF8 is becoming a new standard. I have no idea how everyone is handling editing variable length encoded texts.

Or could we create a UTF8String class?

Certainly a possibility. A simple version might just do a convert/edit/reconvert for every operation, but there has to be a better way.

/tim

{insert witticism here}

Tobias Pape

Re: I18n and Cairo/pango rendering

On 23.07.2014, at 02:01, tim Rowledge <[hidden email]> wrote:

>
>
> On Jul 22, 2014, at 14:14, Nicolas Cellier <[hidden email]> wrote:
>
>>
>>
>>
>> 2014-07-22 23:02 GMT+02:00 tim Rowledge <[hidden email]>:
>> (Snip).
>>
>>
>> MacRoman? Beware Tim, you have slept too long, but I must tell you the awfull truth now.
>> ByteString are not anymore MacRoman encoded.
>> They are ISO8859L1 (latin 1) which matches Unicode on first 256 code points...
>>
> Oh. Well, I was basing my comment on comments in code that I came across. Guess they need fixing. This isn't something I've ever felt a need to think about before so it's all new and clunky to me...
> One thinks springs to mind though - if the basic ByteString is Latin-1/utf why do we have any code to convert ? Right now (in my 4.5) it looks like there is a relatively slow check for any non-compliant chars in the #squeakToUtf8 method. Can we drop that now? It would likely be nice if any old ByteString were acceptable to the Cairo/pango plugin.
>
>> Converting to UTF8 seems hackish, but should just work.
>> Why would you have to go back from UTF8? Optimization? (storing the UTF8 result)
>
> Hopefully we don't really need to go back in my usage case - Scratch i18n short strings with very little editing. I can probably keep the 'real' string and convert as and when needed for the displaying methods, maybe even caching the converted form. For the longer term we should at least consider doing a better cleaner job so as to life in a world where it at least appears that UTF8 is becoming a new standard. I have no idea how everyone is handling editing variable length encoded texts.
>
>>
>> Or could we create a UTF8String class?
> Certainly a possibility. A simple version might just do a convert/edit/reconvert for every operation, but there has to be a better way.
>

What about making WideString the only one?

*duckandcover*

> /tim
> {insert witticism here}
>
>

signature.asc (1K) Download Attachment

Bert Freudenberg

Re: I18n and Cairo/pango rendering

In reply to this post by timrowledge

On 23.07.2014, at 02:01, tim Rowledge <[hidden email]> wrote:

On Jul 22, 2014, at 14:14, Nicolas Cellier <[hidden email]> wrote:

2014-07-22 23:02 GMT+02:00 tim Rowledge <[hidden email]>:

(Snip).

MacRoman? Beware Tim, you have slept too long, but I must tell you the awfull truth now.
ByteString are not anymore MacRoman encoded.

I guess Tim spent too much time in the Scratch image, which still was MacRoman.

They are ISO8859L1 (latin 1) which matches Unicode on first 256 code points...

Oh. Well, I was basing my comment on comments in code that I came across. Guess they need fixing. This isn't something I've ever felt a need to think about before so it's all new and clunky to me...
One thinks springs to mind though - if the basic ByteString is Latin-1/utf why do we have any code to convert ? Right now (in my 4.5) it looks like there is a relatively slow check for any non-compliant chars in the #squeakToUtf8 method. Can we drop that now? It would likely be nice if any old ByteString were acceptable to the Cairo/pango plugin.

Well, Latin1 matches the first 256 codepoints in Unicode, but only codepoints < 128 (a.k.a. ASCII) have a one-byte encoding in UTF-8. That's why we need to check. If all chars are < 128 then the ByteString is return unmodified.

Converting to UTF8 seems hackish, but should just work.
Why would you have to go back from UTF8? Optimization? (storing the UTF8 result)

Hopefully we don't really need to go back in my usage case - Scratch i18n short strings with very little editing. I can probably keep the 'real' string and convert as and when needed for the displaying methods, maybe even caching the converted form. For the longer term we should at least consider doing a better cleaner job so as to life in a world where it at least appears that UTF8 is becoming a new standard. I have no idea how everyone is handling editing variable length encoded texts.

UTF8 is only a standard for externalizing strings. Internally it's too cumbersome to work with.

Or could we create a UTF8String class?

Certainly a possibility. A simple version might just do a convert/edit/reconvert for every operation, but there has to be a better way.

A string-like class storing its chars in ByteArrays plus an encoding would be nice indeed. Not sure it should be a String subclass (like Scratch's "UTF8" class), because operations would be weird at least for UTF8 with its varying bytes-per-char. Rather have conversion methods from/to actual Strings.

That way we would have objects that know their encoding, rather than the current squeakToUtf8 which results in an invalid String and hence must be used only temporarily for passing to a primitive (file save, socket send etc).

But I'd say you should put squeakToUtf8 sends in the primitive call code and if the repeated conversion is actually slowing things down then replace the strings by some encoded thing which would return self in response to that message.

- Bert -

smime.p7s (5K) Download Attachment

timrowledge

Re: I18n and Cairo/pango rendering

In reply to this post by Tobias Pape

Oh surely we can come up with something even crazier? How about for the 64bit layout having the bottom 24bits be the character and the top 39 be a pseudo pointer to the next char in the string?

/tim
{insert witticism here}

>
> What about making WideString the only one?
>
> *duckandcover*
>
>
>
>> /tim
>> {insert witticism here}
>
>

Tobias Pape

Le String… Re: [squeak-dev] I18n and Cairo/pango rendering

On 23.07.2014, at 19:03, tim Rowledge <[hidden email]> wrote:

> Oh surely we can come up with something even crazier? How about for the 64bit layout having the bottom 24bits be the character and the top 39 be a pseudo pointer to the next char in the string?
>

Well there is reason we don't think of a byte as 6 or 7 bit anymore…
I'd rather see the String implementation to use to be the most versatile one.
For interacting with C, plain old C Strings seem the way to go. But what
is that, actually?
For covering unicode, collections of 8-bit numbers, it is certainly not.
I really do not want to operate on utf-8 internal to the VM or image, but
…for interchange, UTF-8 is (IMHO) clearly the way to go.

Best
-Tobias

> /tim
> {insert witticism here}
>
>>
>> What about making WideString the only one?
>>
>> *duckandcover*
>>
>>
>>
>>> /tim
>>> {insert witticism here}
>>
>>
>

signature.asc (1K) Download Attachment

Nicolas Cellier

Re: I18n and Cairo/pango rendering

In reply to this post by timrowledge

2014-07-23 19:03 GMT+02:00 tim Rowledge <[hidden email]>:

Oh surely we can come up with something even crazier? How about for the 64bit layout having the bottom 24bits be the character and the top 39 be a pseudo pointer to the next char in the string?

Note that you somehow have that with UTF8, the number of bits set in the character (if > 127) indicates the position of the next char already ;)

It's thus a pseudo pointer (relative).

/tim
{insert witticism here}

>
> What about making WideString the only one?
>
> *duckandcover*
>
>
>
>> /tim
>> {insert witticism here}
>
>

timrowledge

Re: I18n and Cairo/pango rendering

In reply to this post by Bert Freudenberg

On 23-07-2014, at 12:59 AM, Bert Freudenberg <[hidden email]> wrote:
> But I'd say you should put squeakToUtf8 sends in the primitive call code and if the repeated conversion is actually slowing things down then replace the strings by some encoded thing which would return self in response to that message.

After a lot of wondering and wandering around code I ended up just doing the conversion each time the prim(s) is called. Interestingly, on a Pi and with the fairly high degree of caching of resultant bitmaps used in Scratch there isn’t any noticeable impact on typing performance.

As a way to decide when Pango rendering is used I settled upon a new font class and made sure it all went via the font; so I can have Scratch using Pango fonts and the dev tools using ‘normal’ fonts without the confusion in old Scratch images. There’s good arguments for also bringing wide/byte stringiness into it so that any latin-1 compatible string uses our normal code but right now it all looks ok.

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Strange OpCodes: VMB: Verify, then Make Bad