EncodedCharSets

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

EncodedCharSets

demarey
Hi again,

Does anyone know the rationale behind this?

declareEncodedCharSet: anEncodedCharSetOrLanguageEnvironmentClass atIndex: aNumber
       
        EncodedCharSets at: aNumber put: anEncodedCharSetOrLanguageEnvironmentClass
       
        "this method is used to modularize the old initialize method:
        EncodedCharSets at: 0+1 put: Unicode.
        EncodedCharSets at: 1+1 put: JISX0208.
        EncodedCharSets at: 2+1 put: GB2312.
        EncodedCharSets at: 3+1 put: KSX1001.
        EncodedCharSets at: 4+1 put: JISX0208.
        EncodedCharSets at: 5+1 put: JapaneseEnvironment.
        EncodedCharSets at: 6+1 put: SimplifiedChineseEnvironment.
        EncodedCharSets at: 7+1 put: KoreanEnvironment.
        EncodedCharSets at: 8+1 put: GB2312.
        EncodedCharSets at: 12+1 put: KSX1001.
        EncodedCharSets at: 13+1 put: GreekEnvironment.
        EncodedCharSets at: 14+1 put: Latin2Environment.
        EncodedCharSets at: 15+1 put: RussianEnvironment.
        EncodedCharSets at: 17+1 put: Latin9Environment.
        EncodedCharSets at: 256 put: Unicode.

        "




smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: EncodedCharSets

Henrik Sperre Johansen

> On 17 Sep 2015, at 4:09 , Christophe Demarey <[hidden email]> wrote:
>
> Hi again,
>
> Does anyone know the rationale behind this?
>
> declareEncodedCharSet: anEncodedCharSetOrLanguageEnvironmentClass atIndex: aNumber
>
> EncodedCharSets at: aNumber put: anEncodedCharSetOrLanguageEnvironmentClass
>
> "this method is used to modularize the old initialize method:
> EncodedCharSets at: 0+1 put: Unicode.
> EncodedCharSets at: 1+1 put: JISX0208.
> EncodedCharSets at: 2+1 put: GB2312.
> EncodedCharSets at: 3+1 put: KSX1001.
> EncodedCharSets at: 4+1 put: JISX0208.
> EncodedCharSets at: 5+1 put: JapaneseEnvironment.
> EncodedCharSets at: 6+1 put: SimplifiedChineseEnvironment.
> EncodedCharSets at: 7+1 put: KoreanEnvironment.
> EncodedCharSets at: 8+1 put: GB2312.
> EncodedCharSets at: 12+1 put: KSX1001.
> EncodedCharSets at: 13+1 put: GreekEnvironment.
> EncodedCharSets at: 14+1 put: Latin2Environment.
> EncodedCharSets at: 15+1 put: RussianEnvironment.
> EncodedCharSets at: 17+1 put: Latin9Environment.
> EncodedCharSets at: 256 put: Unicode.
>
> "
If you mean the name EncodedCharSets, and the fact there are Encodings in that list, it is a leftover from when WideStrings code points could be in character sets other than Unicode.
That made little sense since each WideChar takes 32 bits anyways (other than a possibly simpler codePoint -> glyph index translation described below), so most/all places where this happened have been removed and we now assume WideString code points to be equal to unicode code points.

The goal of the old code was to give parameters to the old StrikeFont string display primitive, which is limited to using a table of 256 glyphs for any string it wants to render.
The Scanners job was to introduce stops when it encounters a change that makes the string it's scanning unable to be displayed by a single call to the primitive.
The presence of a leadingChar induced(/s) a stop in the Scanner, which explains the mix of encodings and Languages (whose codePoints are in Unicode) in the EncodedCharSets table.
This stop let a properly constructed StrikeFontSet swap in a glyph table suitable for displaying the leading char, using a custom codePoint -> glyph index conversion.

TLDR; It's a relic of a more complex past used in a mechanism that let StrikeFonts display other than macroman characters.

Cheers,
Henry

P.S. Funnily enough, the mechanism is somewhat similar to what would be needed to swap in fallback fonts for code points not covered by the default font efficiently (although the stop would be on missing glyph, rather than leading char), instead of the current, doomed-to-fail approach of using a FallbackFont in each font.

signature.asc (859 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: EncodedCharSets

Eliot Miranda-2
In reply to this post by demarey
Hi Christophe,

On Thu, Sep 17, 2015 at 7:09 AM, Christophe Demarey <[hidden email]> wrote:
Hi again,

Does anyone know the rationale behind this?

declareEncodedCharSet: anEncodedCharSetOrLanguageEnvironmentClass atIndex: aNumber

        EncodedCharSets at: aNumber put: anEncodedCharSetOrLanguageEnvironmentClass

        "this method is used to modularize the old initialize method:
        EncodedCharSets at: 0+1 put: Unicode.
        EncodedCharSets at: 1+1 put: JISX0208.
        EncodedCharSets at: 2+1 put: GB2312.
        EncodedCharSets at: 3+1 put: KSX1001.
        EncodedCharSets at: 4+1 put: JISX0208.
        EncodedCharSets at: 5+1 put: JapaneseEnvironment.
        EncodedCharSets at: 6+1 put: SimplifiedChineseEnvironment.
        EncodedCharSets at: 7+1 put: KoreanEnvironment.
        EncodedCharSets at: 8+1 put: GB2312.
        EncodedCharSets at: 12+1 put: KSX1001.
        EncodedCharSets at: 13+1 put: GreekEnvironment.
        EncodedCharSets at: 14+1 put: Latin2Environment.
        EncodedCharSets at: 15+1 put: RussianEnvironment.
        EncodedCharSets at: 17+1 put: Latin9Environment.
        EncodedCharSets at: 256 put: Unicode.

        "

what Henrik says is correct.  Here's the relevant definition in Character:

Character>>leadingChar
"Answer the value of the 8 highest bits which is used to identify the language.
This is mostly used for east asian languages CJKV as a workaround against unicode han-unification."
^ self asInteger bitShift: -22


i.e. the top 8 bytes of the leading character in a string is (was?) used to index EncodedCharSets to determine what language the string is in.

_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: EncodedCharSets

Sven Van Caekenberghe-2

> On 17 Sep 2015, at 19:26, Eliot Miranda <[hidden email]> wrote:
>
> Hi Christophe,
>
> On Thu, Sep 17, 2015 at 7:09 AM, Christophe Demarey <[hidden email]> wrote:
> Hi again,
>
> Does anyone know the rationale behind this?
>
> declareEncodedCharSet: anEncodedCharSetOrLanguageEnvironmentClass atIndex: aNumber
>
>         EncodedCharSets at: aNumber put: anEncodedCharSetOrLanguageEnvironmentClass
>
>         "this method is used to modularize the old initialize method:
>         EncodedCharSets at: 0+1 put: Unicode.
>         EncodedCharSets at: 1+1 put: JISX0208.
>         EncodedCharSets at: 2+1 put: GB2312.
>         EncodedCharSets at: 3+1 put: KSX1001.
>         EncodedCharSets at: 4+1 put: JISX0208.
>         EncodedCharSets at: 5+1 put: JapaneseEnvironment.
>         EncodedCharSets at: 6+1 put: SimplifiedChineseEnvironment.
>         EncodedCharSets at: 7+1 put: KoreanEnvironment.
>         EncodedCharSets at: 8+1 put: GB2312.
>         EncodedCharSets at: 12+1 put: KSX1001.
>         EncodedCharSets at: 13+1 put: GreekEnvironment.
>         EncodedCharSets at: 14+1 put: Latin2Environment.
>         EncodedCharSets at: 15+1 put: RussianEnvironment.
>         EncodedCharSets at: 17+1 put: Latin9Environment.
>         EncodedCharSets at: 256 put: Unicode.
>
>         "
>
> what Henrik says is correct.  Here's the relevant definition in Character:
>
> Character>>leadingChar
> "Answer the value of the 8 highest bits which is used to identify the language.
> This is mostly used for east asian languages CJKV as a workaround against unicode han-unification."
> ^ self asInteger bitShift: -22
>
>
> i.e. the top 8 bytes of the leading character in a string is (was?) used to index EncodedCharSets to determine what language the string is in.

Past tense indeed, until someone can explain why we would need this while it cannot be found anywhere else.

> _,,,^..^,,,_
> best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: EncodedCharSets

demarey
In reply to this post by Eliot Miranda-2
Thanks  all for your explanations.
It looks like there are still a lot of things to clean.

Christophe

Le 17 sept. 2015 à 19:26, Eliot Miranda a écrit :

Hi Christophe,

On Thu, Sep 17, 2015 at 7:09 AM, Christophe Demarey <[hidden email]> wrote:
Hi again,

Does anyone know the rationale behind this?

declareEncodedCharSet: anEncodedCharSetOrLanguageEnvironmentClass atIndex: aNumber

        EncodedCharSets at: aNumber put: anEncodedCharSetOrLanguageEnvironmentClass

        "this method is used to modularize the old initialize method:
        EncodedCharSets at: 0+1 put: Unicode.
        EncodedCharSets at: 1+1 put: JISX0208.
        EncodedCharSets at: 2+1 put: GB2312.
        EncodedCharSets at: 3+1 put: KSX1001.
        EncodedCharSets at: 4+1 put: JISX0208.
        EncodedCharSets at: 5+1 put: JapaneseEnvironment.
        EncodedCharSets at: 6+1 put: SimplifiedChineseEnvironment.
        EncodedCharSets at: 7+1 put: KoreanEnvironment.
        EncodedCharSets at: 8+1 put: GB2312.
        EncodedCharSets at: 12+1 put: KSX1001.
        EncodedCharSets at: 13+1 put: GreekEnvironment.
        EncodedCharSets at: 14+1 put: Latin2Environment.
        EncodedCharSets at: 15+1 put: RussianEnvironment.
        EncodedCharSets at: 17+1 put: Latin9Environment.
        EncodedCharSets at: 256 put: Unicode.

        "

what Henrik says is correct.  Here's the relevant definition in Character:

Character>>leadingChar
"Answer the value of the 8 highest bits which is used to identify the language.
This is mostly used for east asian languages CJKV as a workaround against unicode han-unification."
^ self asInteger bitShift: -22


i.e. the top 8 bytes of the leading character in a string is (was?) used to index EncodedCharSets to determine what language the string is in.

_,,,^..^,,,_
best, Eliot


smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: EncodedCharSets

Nicolas Cellier
Yes, but first step consists in reading http://www.ipa.go.jp/files/000005751.pdf to exactly understand which feature you're going to loose, or better, how you're going to support it differently.

Nicolas

2015-09-18 10:10 GMT+02:00 Christophe Demarey <[hidden email]>:
Thanks  all for your explanations.
It looks like there are still a lot of things to clean.

Christophe

Le 17 sept. 2015 à 19:26, Eliot Miranda a écrit :

Hi Christophe,

On Thu, Sep 17, 2015 at 7:09 AM, Christophe Demarey <[hidden email]> wrote:
Hi again,

Does anyone know the rationale behind this?

declareEncodedCharSet: anEncodedCharSetOrLanguageEnvironmentClass atIndex: aNumber

        EncodedCharSets at: aNumber put: anEncodedCharSetOrLanguageEnvironmentClass

        "this method is used to modularize the old initialize method:
        EncodedCharSets at: 0+1 put: Unicode.
        EncodedCharSets at: 1+1 put: JISX0208.
        EncodedCharSets at: 2+1 put: GB2312.
        EncodedCharSets at: 3+1 put: KSX1001.
        EncodedCharSets at: 4+1 put: JISX0208.
        EncodedCharSets at: 5+1 put: JapaneseEnvironment.
        EncodedCharSets at: 6+1 put: SimplifiedChineseEnvironment.
        EncodedCharSets at: 7+1 put: KoreanEnvironment.
        EncodedCharSets at: 8+1 put: GB2312.
        EncodedCharSets at: 12+1 put: KSX1001.
        EncodedCharSets at: 13+1 put: GreekEnvironment.
        EncodedCharSets at: 14+1 put: Latin2Environment.
        EncodedCharSets at: 15+1 put: RussianEnvironment.
        EncodedCharSets at: 17+1 put: Latin9Environment.
        EncodedCharSets at: 256 put: Unicode.

        "

what Henrik says is correct.  Here's the relevant definition in Character:

Character>>leadingChar
"Answer the value of the 8 highest bits which is used to identify the language.
This is mostly used for east asian languages CJKV as a workaround against unicode han-unification."
^ self asInteger bitShift: -22


i.e. the top 8 bytes of the leading character in a string is (was?) used to index EncodedCharSets to determine what language the string is in.

_,,,^..^,,,_
best, Eliot