Smalltalk › Pharo › Pharo Smalltalk Developers

EncodedCharSets

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

6 messages Options

demarey

EncodedCharSets

Hi again,

Does anyone know the rationale behind this?

declareEncodedCharSet: anEncodedCharSetOrLanguageEnvironmentClass atIndex: aNumber

EncodedCharSets at: aNumber put: anEncodedCharSetOrLanguageEnvironmentClass

"this method is used to modularize the old initialize method:
EncodedCharSets at: 0+1 put: Unicode.
EncodedCharSets at: 1+1 put: JISX0208.
EncodedCharSets at: 2+1 put: GB2312.
EncodedCharSets at: 3+1 put: KSX1001.
EncodedCharSets at: 4+1 put: JISX0208.
EncodedCharSets at: 5+1 put: JapaneseEnvironment.
EncodedCharSets at: 6+1 put: SimplifiedChineseEnvironment.
EncodedCharSets at: 7+1 put: KoreanEnvironment.
EncodedCharSets at: 8+1 put: GB2312.
EncodedCharSets at: 12+1 put: KSX1001.
EncodedCharSets at: 13+1 put: GreekEnvironment.
EncodedCharSets at: 14+1 put: Latin2Environment.
EncodedCharSets at: 15+1 put: RussianEnvironment.
EncodedCharSets at: 17+1 put: Latin9Environment.
EncodedCharSets at: 256 put: Unicode.

"

smime.p7s (5K) Download Attachment

Henrik Sperre Johansen

Re: EncodedCharSets

> On 17 Sep 2015, at 4:09 , Christophe Demarey <[hidden email]> wrote:
>
> Hi again,
>
> Does anyone know the rationale behind this?
>
> declareEncodedCharSet: anEncodedCharSetOrLanguageEnvironmentClass atIndex: aNumber
>
> EncodedCharSets at: aNumber put: anEncodedCharSetOrLanguageEnvironmentClass
>
> "this method is used to modularize the old initialize method:
> EncodedCharSets at: 0+1 put: Unicode.
> EncodedCharSets at: 1+1 put: JISX0208.
> EncodedCharSets at: 2+1 put: GB2312.
> EncodedCharSets at: 3+1 put: KSX1001.
> EncodedCharSets at: 4+1 put: JISX0208.
> EncodedCharSets at: 5+1 put: JapaneseEnvironment.
> EncodedCharSets at: 6+1 put: SimplifiedChineseEnvironment.
> EncodedCharSets at: 7+1 put: KoreanEnvironment.
> EncodedCharSets at: 8+1 put: GB2312.
> EncodedCharSets at: 12+1 put: KSX1001.
> EncodedCharSets at: 13+1 put: GreekEnvironment.
> EncodedCharSets at: 14+1 put: Latin2Environment.
> EncodedCharSets at: 15+1 put: RussianEnvironment.
> EncodedCharSets at: 17+1 put: Latin9Environment.
> EncodedCharSets at: 256 put: Unicode.
>
> "

If you mean the name EncodedCharSets, and the fact there are Encodings in that list, it is a leftover from when WideStrings code points could be in character sets other than Unicode.
That made little sense since each WideChar takes 32 bits anyways (other than a possibly simpler codePoint -> glyph index translation described below), so most/all places where this happened have been removed and we now assume WideString code points to be equal to unicode code points.

The goal of the old code was to give parameters to the old StrikeFont string display primitive, which is limited to using a table of 256 glyphs for any string it wants to render.
The Scanners job was to introduce stops when it encounters a change that makes the string it's scanning unable to be displayed by a single call to the primitive.
The presence of a leadingChar induced(/s) a stop in the Scanner, which explains the mix of encodings and Languages (whose codePoints are in Unicode) in the EncodedCharSets table.
This stop let a properly constructed StrikeFontSet swap in a glyph table suitable for displaying the leading char, using a custom codePoint -> glyph index conversion.

TLDR; It's a relic of a more complex past used in a mechanism that let StrikeFonts display other than macroman characters.

Cheers,
Henry

P.S. Funnily enough, the mechanism is somewhat similar to what would be needed to swap in fallback fonts for code points not covered by the default font efficiently (although the stop would be on missing glyph, rather than leading char), instead of the current, doomed-to-fail approach of using a FallbackFont in each font.

signature.asc (859 bytes) Download Attachment

Eliot Miranda-2

Re: EncodedCharSets

In reply to this post by demarey

Hi Christophe,

On Thu, Sep 17, 2015 at 7:09 AM, Christophe Demarey <[hidden email]> wrote:

Hi again,

Does anyone know the rationale behind this?

declareEncodedCharSet: anEncodedCharSetOrLanguageEnvironmentClass atIndex: aNumber

EncodedCharSets at: aNumber put: anEncodedCharSetOrLanguageEnvironmentClass

"this method is used to modularize the old initialize method:
EncodedCharSets at: 0+1 put: Unicode.
EncodedCharSets at: 1+1 put: JISX0208.
EncodedCharSets at: 2+1 put: GB2312.
EncodedCharSets at: 3+1 put: KSX1001.
EncodedCharSets at: 4+1 put: JISX0208.
EncodedCharSets at: 5+1 put: JapaneseEnvironment.
EncodedCharSets at: 6+1 put: SimplifiedChineseEnvironment.
EncodedCharSets at: 7+1 put: KoreanEnvironment.
EncodedCharSets at: 8+1 put: GB2312.
EncodedCharSets at: 12+1 put: KSX1001.
EncodedCharSets at: 13+1 put: GreekEnvironment.
EncodedCharSets at: 14+1 put: Latin2Environment.
EncodedCharSets at: 15+1 put: RussianEnvironment.
EncodedCharSets at: 17+1 put: Latin9Environment.
EncodedCharSets at: 256 put: Unicode.

"

what Henrik says is correct. Here's the relevant definition in Character:

Character>>leadingChar

"Answer the value of the 8 highest bits which is used to identify the language.

This is mostly used for east asian languages CJKV as a workaround against unicode han-unification."

^ self asInteger bitShift: -22

i.e. the top 8 bytes of the leading character in a string is (was?) used to index EncodedCharSets to determine what language the string is in.

_,,,^..^,,,_

best, Eliot

Sven Van Caekenberghe-2

Re: EncodedCharSets

> On 17 Sep 2015, at 19:26, Eliot Miranda <[hidden email]> wrote:
>
> Hi Christophe,
>
> On Thu, Sep 17, 2015 at 7:09 AM, Christophe Demarey <[hidden email]> wrote:
> Hi again,
>
> Does anyone know the rationale behind this?
>
> declareEncodedCharSet: anEncodedCharSetOrLanguageEnvironmentClass atIndex: aNumber
>
> EncodedCharSets at: aNumber put: anEncodedCharSetOrLanguageEnvironmentClass
>
> "this method is used to modularize the old initialize method:
> EncodedCharSets at: 0+1 put: Unicode.
> EncodedCharSets at: 1+1 put: JISX0208.
> EncodedCharSets at: 2+1 put: GB2312.
> EncodedCharSets at: 3+1 put: KSX1001.
> EncodedCharSets at: 4+1 put: JISX0208.
> EncodedCharSets at: 5+1 put: JapaneseEnvironment.
> EncodedCharSets at: 6+1 put: SimplifiedChineseEnvironment.
> EncodedCharSets at: 7+1 put: KoreanEnvironment.
> EncodedCharSets at: 8+1 put: GB2312.
> EncodedCharSets at: 12+1 put: KSX1001.
> EncodedCharSets at: 13+1 put: GreekEnvironment.
> EncodedCharSets at: 14+1 put: Latin2Environment.
> EncodedCharSets at: 15+1 put: RussianEnvironment.
> EncodedCharSets at: 17+1 put: Latin9Environment.
> EncodedCharSets at: 256 put: Unicode.
>
> "
>
> what Henrik says is correct. Here's the relevant definition in Character:
>
> Character>>leadingChar
> "Answer the value of the 8 highest bits which is used to identify the language.
> This is mostly used for east asian languages CJKV as a workaround against unicode han-unification."
> ^ self asInteger bitShift: -22
>
>
> i.e. the top 8 bytes of the leading character in a string is (was?) used to index EncodedCharSets to determine what language the string is in.

Past tense indeed, until someone can explain why we would need this while it cannot be found anywhere else.

> _,,,^..^,,,_
> best, Eliot

demarey

Re: EncodedCharSets

In reply to this post by Eliot Miranda-2

Thanks all for your explanations.

It looks like there are still a lot of things to clean.

Christophe

Le 17 sept. 2015 à 19:26, Eliot Miranda a écrit :

Hi Christophe,

On Thu, Sep 17, 2015 at 7:09 AM, Christophe Demarey <[hidden email]> wrote:
Hi again,

Does anyone know the rationale behind this?

declareEncodedCharSet: anEncodedCharSetOrLanguageEnvironmentClass atIndex: aNumber

EncodedCharSets at: aNumber put: anEncodedCharSetOrLanguageEnvironmentClass

"this method is used to modularize the old initialize method:
EncodedCharSets at: 0+1 put: Unicode.
EncodedCharSets at: 1+1 put: JISX0208.
EncodedCharSets at: 2+1 put: GB2312.
EncodedCharSets at: 3+1 put: KSX1001.
EncodedCharSets at: 4+1 put: JISX0208.
EncodedCharSets at: 5+1 put: JapaneseEnvironment.
EncodedCharSets at: 6+1 put: SimplifiedChineseEnvironment.
EncodedCharSets at: 7+1 put: KoreanEnvironment.
EncodedCharSets at: 8+1 put: GB2312.
EncodedCharSets at: 12+1 put: KSX1001.
EncodedCharSets at: 13+1 put: GreekEnvironment.
EncodedCharSets at: 14+1 put: Latin2Environment.
EncodedCharSets at: 15+1 put: RussianEnvironment.
EncodedCharSets at: 17+1 put: Latin9Environment.
EncodedCharSets at: 256 put: Unicode.

"

what Henrik says is correct. Here's the relevant definition in Character:

Character>>leadingChar
"Answer the value of the 8 highest bits which is used to identify the language.
This is mostly used for east asian languages CJKV as a workaround against unicode han-unification."
^ self asInteger bitShift: -22

i.e. the top 8 bytes of the leading character in a string is (was?) used to index EncodedCharSets to determine what language the string is in.

_,,,^..^,,,_
best, Eliot

smime.p7s (5K) Download Attachment

Nicolas Cellier

Re: EncodedCharSets

Yes, but first step consists in reading http://www.ipa.go.jp/files/000005751.pdf to exactly understand which feature you're going to loose, or better, how you're going to support it differently.

Nicolas

2015-09-18 10:10 GMT+02:00 Christophe Demarey <[hidden email]>:

Thanks all for your explanations.
It looks like there are still a lot of things to clean.

Christophe

Le 17 sept. 2015 à 19:26, Eliot Miranda a écrit :

Hi Christophe,

On Thu, Sep 17, 2015 at 7:09 AM, Christophe Demarey <[hidden email]> wrote:
Hi again,

Does anyone know the rationale behind this?

declareEncodedCharSet: anEncodedCharSetOrLanguageEnvironmentClass atIndex: aNumber

EncodedCharSets at: aNumber put: anEncodedCharSetOrLanguageEnvironmentClass

"this method is used to modularize the old initialize method:
EncodedCharSets at: 0+1 put: Unicode.
EncodedCharSets at: 1+1 put: JISX0208.
EncodedCharSets at: 2+1 put: GB2312.
EncodedCharSets at: 3+1 put: KSX1001.
EncodedCharSets at: 4+1 put: JISX0208.
EncodedCharSets at: 5+1 put: JapaneseEnvironment.
EncodedCharSets at: 6+1 put: SimplifiedChineseEnvironment.
EncodedCharSets at: 7+1 put: KoreanEnvironment.
EncodedCharSets at: 8+1 put: GB2312.
EncodedCharSets at: 12+1 put: KSX1001.
EncodedCharSets at: 13+1 put: GreekEnvironment.
EncodedCharSets at: 14+1 put: Latin2Environment.
EncodedCharSets at: 15+1 put: RussianEnvironment.
EncodedCharSets at: 17+1 put: Latin9Environment.
EncodedCharSets at: 256 put: Unicode.

"

what Henrik says is correct. Here's the relevant definition in Character:

Character>>leadingChar
"Answer the value of the 8 highest bits which is used to identify the language.
This is mostly used for east asian languages CJKV as a workaround against unicode han-unification."
^ self asInteger bitShift: -22

i.e. the top 8 bytes of the leading character in a string is (was?) used to index EncodedCharSets to determine what language the string is in.

_,,,^..^,,,_
best, Eliot