Smalltalk › Squeak › Squeak - Dev

MultiCharacterScanner>addCharToPresentation: and conversion to pre-composed unicode code points

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

4 messages Options

Nicolas Cellier

MultiCharacterScanner>addCharToPresentation: and conversion to pre-composed unicode code points

As I understand it, MultiCharacterScanner is transforming a String of decomposed unicode into a string of pre-composed unicode code points, with help of UnicodeCompositionStream.

It store the result in presentation.

As I understand it, this was necessary because some keyboard/vm do produce such decomposed sequences.
I presume this once helped measuring and displaying those codes with fonts having only pre-composed codes.

First remark, this is a pity that the base character comes first, before the diacritical.

This forces the composition algorithm to look ahead.

We can't change it, it's a standard, but I wonder the motivation for such ordering...
Ref: http://www.unicode.org/standard/principles.html

Second remark, transforming unicodes sequence to a canonical form is not only useful for measuring/displaying text.

It's usefull for comparing strings (for equality, for collation, ...)

So the transformation could happen somewhere else than at display time.

Unicode define standard ways to do it, and bad news, UnicodeCompositionStream is not conforming.

Ref: https://en.wikipedia.org/wiki/Unicode_equivalence

Third remark, I wonder if this composition is really necessary at all for measuring/displaying.

Doesn't unicode fonts provide special kerning pairs for those diacriticals?

I couldn't find good references on this one...

Bert Freudenberg

Re: MultiCharacterScanner>addCharToPresentation: and conversion to pre-composed unicode code points

On 2013-09-22, at 19:50, Nicolas Cellier <[hidden email]> wrote:

As I understand it, MultiCharacterScanner is transforming a String of decomposed unicode into a string of pre-composed unicode code points, with help of UnicodeCompositionStream.

It store the result in presentation.

As I understand it, this was necessary because some keyboard/vm do produce such decomposed sequences.
I presume this once helped measuring and displaying those codes with fonts having only pre-composed codes.

First remark, this is a pity that the base character comes first, before the diacritical.
This forces the composition algorithm to look ahead.
We can't change it, it's a standard, but I wonder the motivation for such ordering...
Ref: http://www.unicode.org/standard/principles.html

Second remark, transforming unicodes sequence to a canonical form is not only useful for measuring/displaying text.

It's usefull for comparing strings (for equality, for collation, ...)
So the transformation could happen somewhere else than at display time.
Unicode define standard ways to do it, and bad news, UnicodeCompositionStream is not conforming.

Ref: https://en.wikipedia.org/wiki/Unicode_equivalence

Yep.

Third remark, I wonder if this composition is really necessary at all for measuring/displaying.

Doesn't unicode fonts provide special kerning pairs for those diacriticals?
I couldn't find good references on this one...

This would work if we had the diacriticals in our fonts and if rendering glyphs would take into account kerning info. Neither is the case currently, so the next-best thing was compositing which allows us to use the pre-composed Latin-1 characters.

Just paste this into Squeak:

A + combining diaeresis: Ä

Precomposed: Ä

Both look the same in my email client but in Squeak I get:

which indicates the presentation thing is not working currently. In case this doesn't make it through via email, the combining diaeresis is Character value: 16r0308.

- Bert -

Bert Freudenberg

Re: MultiCharacterScanner>addCharToPresentation: and conversion to pre-composed unicode code points

On 2013-09-23, at 14:10, Bert Freudenberg <[hidden email]> wrote:

On 2013-09-22, at 19:50, Nicolas Cellier <[hidden email]> wrote:

As I understand it, MultiCharacterScanner is transforming a String of decomposed unicode into a string of pre-composed unicode code points, with help of UnicodeCompositionStream.

It store the result in presentation.

As I understand it, this was necessary because some keyboard/vm do produce such decomposed sequences.
I presume this once helped measuring and displaying those codes with fonts having only pre-composed codes.

First remark, this is a pity that the base character comes first, before the diacritical.
This forces the composition algorithm to look ahead.
We can't change it, it's a standard, but I wonder the motivation for such ordering...
Ref: http://www.unicode.org/standard/principles.html

Second remark, transforming unicodes sequence to a canonical form is not only useful for measuring/displaying text.

It's usefull for comparing strings (for equality, for collation, ...)
So the transformation could happen somewhere else than at display time.
Unicode define standard ways to do it, and bad news, UnicodeCompositionStream is not conforming.

Ref: https://en.wikipedia.org/wiki/Unicode_equivalence

Yep.

Third remark, I wonder if this composition is really necessary at all for measuring/displaying.

Doesn't unicode fonts provide special kerning pairs for those diacriticals?
I couldn't find good references on this one...

This would work if we had the diacriticals in our fonts and if rendering glyphs would take into account kerning info. Neither is the case currently, so the next-best thing was compositing which allows us to use the pre-composed Latin-1 characters.

Just paste this into Squeak:

A + combining diaeresis: Ä
Precomposed: Ä

Both look the same in my email client but in Squeak I get:

which indicates the presentation thing is not working currently. In case this doesn't make it through via email, the combining diaeresis is Character value: 16r0308.

- Bert -

... and it appears my email client normalizes before sending. Anyway, try this then:

{$A. Character value: 16r0308} as: String

if you then copy the result into a word processor it would look okay again.

- Bert -

Nicolas Cellier

Re: MultiCharacterScanner>addCharToPresentation: and conversion to pre-composed unicode code points

Anyway, after digging a bit, composing is far more complex than just kerning.
For example, the vertical spacing might have to be adjusted so as to have a reasonnable stacking of diacritics in languages that require multiple diacritics...

I like the Yukon site which provides some expectations http://www.ynlc.ca/languages/font/stacking/stacking.html

It's not an isolated language, ancient himalayan languages might require such stacking too...

So the rules could be:

- combinedChar shall have more than two chars and shall be produced even if there is no pre-composed form for the given base

- at rendering time, if a pre-composed form exists in the font, then use it

else render base, and if diacritics exist in the font, perform the composition (this has to be written)

For languages like Yukon, it seems better to not use the pre-composed glyph at all so as to have homegeneous rendering, but this requires a good stacking algorithm, and we'd better let that stuff for later.

2013/9/23 Bert Freudenberg <[hidden email]>

On 2013-09-23, at 14:10, Bert Freudenberg <[hidden email]> wrote:

On 2013-09-22, at 19:50, Nicolas Cellier <[hidden email]> wrote:

As I understand it, MultiCharacterScanner is transforming a String of decomposed unicode into a string of pre-composed unicode code points, with help of UnicodeCompositionStream.

It store the result in presentation.

As I understand it, this was necessary because some keyboard/vm do produce such decomposed sequences.
I presume this once helped measuring and displaying those codes with fonts having only pre-composed codes.

First remark, this is a pity that the base character comes first, before the diacritical.
This forces the composition algorithm to look ahead.
We can't change it, it's a standard, but I wonder the motivation for such ordering...
Ref: http://www.unicode.org/standard/principles.html

Second remark, transforming unicodes sequence to a canonical form is not only useful for measuring/displaying text.

It's usefull for comparing strings (for equality, for collation, ...)
So the transformation could happen somewhere else than at display time.
Unicode define standard ways to do it, and bad news, UnicodeCompositionStream is not conforming.

Ref: https://en.wikipedia.org/wiki/Unicode_equivalence

Yep.

Third remark, I wonder if this composition is really necessary at all for measuring/displaying.

Doesn't unicode fonts provide special kerning pairs for those diacriticals?
I couldn't find good references on this one...

This would work if we had the diacriticals in our fonts and if rendering glyphs would take into account kerning info. Neither is the case currently, so the next-best thing was compositing which allows us to use the pre-composed Latin-1 characters.

Just paste this into Squeak:

A + combining diaeresis: Ä
Precomposed: Ä

Both look the same in my email client but in Squeak I get:

which indicates the presentation thing is not working currently. In case this doesn't make it through via email, the combining diaeresis is Character value: 16r0308.

- Bert -

... and it appears my email client normalizes before sending. Anyway, try this then:

{$A. Character value: 16r0308} as: String

if you then copy the result into a word processor it would look okay again.

- Bert -