#isBreakableAt:in:

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

#isBreakableAt:in:

Nicolas Cellier

2013/9/21 tim Rowledge <[hidden email]>

a) There are {language}environment classes and encoding classes. There is #isBreakableAt:in: implemented in both but seemingly unused in the encoding classes because it is just plain broken there. Should it be removed from the encoders? In the language environment classes it is implemented to return true for space and cr by default, but space, cr & lf in Latin1 and Latin2. Is that as expected?

 
From what I understand:
- no need to answer true for space, cr, lf since these are already handled in the CharacterScanner stopConditions, so default answer should be ^false (unless one of these is removed from stopConditions, I thought I saw that, but cannot remember...)
- whether it should be in EncodedCharSet or LanguageEnvironment, I don't know...

I don't completely like the Multi* version...
For example, when the last breakable char is not a space, there is no adjustment of space width.
Maybe Justified makes no sense in Japanese?
I'd very much like to have tests describing the exepectations...


Strange OpCodes: AGO: Allow Games Only






Reply | Threaded
Open this post in threaded view
|

Re: #isBreakableAt:in:

Yoshiki Ohshima-3
At Tue, 24 Sep 2013 23:21:00 +0200,
Nicolas Cellier wrote:

>
> 2013/9/21 tim Rowledge <[hidden email]>
>
> >
> > a) There are {language}environment classes and encoding classes. There is
> > #isBreakableAt:in: implemented in both but seemingly unused in the encoding
> > classes because it is just plain broken there. Should it be removed from
> > the encoders? In the language environment classes it is implemented to
> > return true for space and cr by default, but space, cr & lf in Latin1 and
> > Latin2. Is that as expected?
> >
> >
> >From what I understand:
> - no need to answer true for space, cr, lf since these are already handled
> in the CharacterScanner stopConditions, so default answer should be ^false
> (unless one of these is removed from stopConditions, I thought I saw that,
> but cannot remember...)
> - whether it should be in EncodedCharSet or LanguageEnvironment, I don't
> know...
>
> I don't completely like the Multi* version...
> For example, when the last breakable char is not a space, there is no
> adjustment of space width.
> Maybe Justified makes no sense in Japanese?
> I'd very much like to have tests describing the exepectations...
>

Having tests would have been good, yes.  For some reference this might
help a bit.  The page rightly mentions contradicting "House Rules" so
it is not clear cut.

http://en.wikipedia.org/wiki/Line_breaking_rules_in_East_Asian_languages

I'd support a rewrite of the whole thing, and perhaps would do more
"total rewrite" approach...

-- Yoshiki

Reply | Threaded
Open this post in threaded view
|

Re: #isBreakableAt:in:

timrowledge

On 24-09-2013, at 3:14 PM, Yoshiki Ohshima <[hidden email]> wrote:
>
>
>
> I'd support a rewrite of the whole thing, and perhaps would do more
> "total rewrite" approach…


That's easy for you to say; you actually know about i18n stuff. I can't even reliably spell it…

What do you remember about the various scanMultiCharactersCombiningFrom:to:in:rightX:stopConditions:kern: methods? The only 'live' reference to them is commented out in Unicode class>scanSelector, so we could argue that they all ought to be deleted. But I doubt all that work was done just for the hell of it and if it isn't in use now there was presumably a reason for the change.

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Useful random insult:- Has a pulse, but that's about all.



Reply | Threaded
Open this post in threaded view
|

Re: #isBreakableAt:in:

Nicolas Cellier
1) because there is a special scanning method for Japanese
2) because unicode diacritics and other combining chars should be rendered specially


2013/9/25 tim Rowledge <[hidden email]>

On 24-09-2013, at 3:14 PM, Yoshiki Ohshima <[hidden email]> wrote:
>
>
>
> I'd support a rewrite of the whole thing, and perhaps would do more
> "total rewrite" approach…


That's easy for you to say; you actually know about i18n stuff. I can't even reliably spell it…

What do you remember about the various scanMultiCharactersCombiningFrom:to:in:rightX:stopConditions:kern: methods? The only 'live' reference to them is commented out in Unicode class>scanSelector, so we could argue that they all ought to be deleted. But I doubt all that work was done just for the hell of it and if it isn't in use now there was presumably a reason for the change.
Useful random insult:- Has a pulse, but that's about all.






Reply | Threaded
Open this post in threaded view
|

Re: #isBreakableAt:in:

Nicolas Cellier
In reply to this post by Yoshiki Ohshima-3
Thanks Yoshiki, that's helpful.
So, EncodedCharSet is currently used, but that should better be the LanguageEnvironment (or a more specialized house rule).
And isBreakable should at least look at a pair of chars (if lines cannot end with prev, or line cannot start with next, or the pair is unbreakable, etc...).
Since you have very accurate names for it, it would be nice to see methods spelled with latin transliteration oikomi oidashi etc... :)


2013/9/25 Yoshiki Ohshima <[hidden email]>
At Tue, 24 Sep 2013 23:21:00 +0200,
Nicolas Cellier wrote:
>
> 2013/9/21 tim Rowledge <[hidden email]>
>
> >
> > a) There are {language}environment classes and encoding classes. There is
> > #isBreakableAt:in: implemented in both but seemingly unused in the encoding
> > classes because it is just plain broken there. Should it be removed from
> > the encoders? In the language environment classes it is implemented to
> > return true for space and cr by default, but space, cr & lf in Latin1 and
> > Latin2. Is that as expected?
> >
> >
> >From what I understand:
> - no need to answer true for space, cr, lf since these are already handled
> in the CharacterScanner stopConditions, so default answer should be ^false
> (unless one of these is removed from stopConditions, I thought I saw that,
> but cannot remember...)
> - whether it should be in EncodedCharSet or LanguageEnvironment, I don't
> know...
>
> I don't completely like the Multi* version...
> For example, when the last breakable char is not a space, there is no
> adjustment of space width.
> Maybe Justified makes no sense in Japanese?
> I'd very much like to have tests describing the exepectations...
>

Having tests would have been good, yes.  For some reference this might
help a bit.  The page rightly mentions contradicting "House Rules" so
it is not clear cut.

http://en.wikipedia.org/wiki/Line_breaking_rules_in_East_Asian_languages

I'd support a rewrite of the whole thing, and perhaps would do more
"total rewrite" approach...

-- Yoshiki




Reply | Threaded
Open this post in threaded view
|

Re: #isBreakableAt:in:

timrowledge
In reply to this post by Nicolas Cellier

On 24-09-2013, at 3:37 PM, Nicolas Cellier <[hidden email]> wrote:

> 1) because there is a special scanning method for Japanese

So that removes the main(?) need for scanMultiCharactersCombining… ?

> 2) because unicode diacritics and other combining chars should be rendered specially

And we certainly don't seem to do that right at the moment. Does FreeType deal with that? Would we be better off removing the scanMultiCharactersCombining… methods for now and replacing them later with a new approach (to be written by Yoshiki, of course ;-) ) ? We might be able to remove the CombinedChar related code too in that case.


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
I haven't lost my mind; it's backed up on tape somewhere.



Reply | Threaded
Open this post in threaded view
|

Re: #isBreakableAt:in:

Nicolas Cellier
No, we can't throw Combining away like that.
Combining is pan-language, it's a unicode feature.

Whether we should rewrite it, and/or provide font support,
- a simple rule that combines a base character with a single diacritics into a precomposed unicode like Yoshiki implemented is already better than nothing. So we shall support it for WideString at least (totally useless for ByteString).
- we should recognize more complex cases and pass the baby to the font IMHO
The font shall decide how to render (put a ? mark or display just the base, or do the whole combination with multiple accent stacking etc...)

Did you try to reconnect Combining in Unicode scanSelector?
That would be interesting.
Currently my image is blocked after if I try to render (String with: $a with: (16r300 to: 16r36F) atRandom asCharacter)...
MessageNotUnderstood: CompositionScanner>>scanMultiCharactersCombiningFrom:to:in:rightX:stopConditions:kern:
but what if we add it in both CharacterScanner branches?


2013/9/25 tim Rowledge <[hidden email]>

On 24-09-2013, at 3:37 PM, Nicolas Cellier <[hidden email]> wrote:

> 1) because there is a special scanning method for Japanese

So that removes the main(?) need for scanMultiCharactersCombining… ?

> 2) because unicode diacritics and other combining chars should be rendered specially

And we certainly don't seem to do that right at the moment. Does FreeType deal with that? Would we be better off removing the scanMultiCharactersCombining… methods for now and replacing them later with a new approach (to be written by Yoshiki, of course ;-) ) ? We might be able to remove the CombinedChar related code too in that case.
I haven't lost my mind; it's backed up on tape somewhere.






Reply | Threaded
Open this post in threaded view
|

Re: #isBreakableAt:in:

timrowledge
In reply to this post by Nicolas Cellier

On 24-09-2013, at 2:21 PM, Nicolas Cellier <[hidden email]> wrote:

>
> 2013/9/21 tim Rowledge <[hidden email]>
>
> a) There are {language}environment classes and encoding classes. There is #isBreakableAt:in: implemented in both but seemingly unused in the encoding classes because it is just plain broken there. Should it be removed from the encoders? In the language environment classes it is implemented to return true for space and cr by default, but space, cr & lf in Latin1 and Latin2. Is that as expected?
>
>  
> From what I understand:
> - no need to answer true for space, cr, lf since these are already handled in the CharacterScanner stopConditions, so default answer should be ^false (unless one of these is removed from stopConditions, I thought I saw that, but cannot remember...)
> - whether it should be in EncodedCharSet or LanguageEnvironment, I don't know…

I'm a bit puzzled by what's going on here, now that I'm looking into it. The Multi* classes actually *do* have stopConditions *without* space, and don't have methods to handle #space. And yet I can't find anything going wrong… this is odd.

With Nicolas' changes to the implementations of #isBreakableAt:in: it seems like we shouldn't be getting any use of registerBreakableIndex (and in fact I'm running code where it isn't even in the scan methods to be called!) and so MultiCompositionScanner>crossedX ought to be getting a bit annoyed by now. The only bit of its code getting run is the last 4 lines and 'breakableIndex'/'lineHeightAtBreak'/'baselineAtBreak' are all nil. This surely ought to be breaking the TextLine 'line' but so far… nothing. Even changing the width of windows to force recomposing doesn't seem to upset things. Why isn't it breaking!

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Compatible: Gracefully accepts erroneous data from any source.



Reply | Threaded
Open this post in threaded view
|

Re: #isBreakableAt:in:

Hannes Hirzel
In reply to this post by Nicolas Cellier
On 9/24/13, Nicolas Cellier <[hidden email]> wrote:
> No, we can't throw Combining away like that.
> Combining is pan-language, it's a unicode feature.
>
> Whether we should rewrite it, and/or provide font support,
> - a simple rule that combines a base character with a single diacritics
> into a precomposed unicode like Yoshiki implemented is already better than
> nothing.

+1

and very common in hundreds of languages

So we shall support it for WideString at least

+1
(totally useless
> for ByteString).

not totally but not important these days.


> - we should recognize more complex cases and pass the baby to the font IMHO


> The font shall decide how to render (put a ? mark or display just the base,
> or do the whole combination with multiple accent stacking etc...)
>
> Did you try to reconnect Combining in Unicode scanSelector?
> That would be interesting.
> Currently my image is blocked after if I try to render (String with: $a
> with: (16r300 to: 16r36F) atRandom asCharacter)...
> MessageNotUnderstood:
> CompositionScanner>>scanMultiCharactersCombiningFrom:to:in:rightX:stopConditions:kern:
> but what if we add it in both CharacterScanner branches?
>
>
> 2013/9/25 tim Rowledge <[hidden email]>
>
>>
>> On 24-09-2013, at 3:37 PM, Nicolas Cellier <
>> [hidden email]> wrote:
>>
>> > 1) because there is a special scanning method for Japanese
>>
>> So that removes the main(?) need for scanMultiCharactersCombining… ?
>>
>> > 2) because unicode diacritics and other combining chars should be
>> rendered specially
>>
>> And we certainly don't seem to do that right at the moment. Does FreeType
>> deal with that? Would we be better off removing the
>> scanMultiCharactersCombining… methods for now and replacing them later
>> with
>> a new approach (to be written by Yoshiki, of course ;-) ) ? We might be
>> able to remove the CombinedChar related code too in that case.
>>
>>
>> tim
>> --
>> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
>> I haven't lost my mind; it's backed up on tape somewhere.
>>
>>
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: #isBreakableAt:in:

Bob Arning-2
In reply to this post by timrowledge
I suppose an example would have helped here. So I tried rolling my own:

StringHolder new contents: 'jhdgkjh dfkhg kfg kdjh gkjdhfg kjdfh gkjdfhg jkdhf gkjdh fgkjhd fgkjh dfkgjdkjfgh kdjhf gkjdhf gkjdh fgkjdhf gkjdh fgkjdh fgkjdh fgkjdhf gkjdh fgkjdhg kjd fhg' asWideString; openLabel: 'foo'

Then I fiddled with the window and tried to see where the code did or did not go. It basically didn't go anywhere that seemed related to your question because:

    scanner := (theText string isOctetString
                        ifTrue:[CompositionScanner new]
                        ifFalse:[MultiCompositionScanner new]).

detects that my WideString isn't really all that wide and simply uses CompositionScanner. (This is from TextComposer composeBlahBlah). Even though it uses a plain scanner here, it does use a Multi* scanner when trying to figure out where to highlight.

Maybe that's why you weren't seeing what you expected. I'll try again with some really wide chars.

Cheers,
Bob

On 9/25/13 10:01 PM, tim Rowledge wrote:
On 24-09-2013, at 2:21 PM, Nicolas Cellier [hidden email] wrote:

2013/9/21 tim Rowledge [hidden email]

a) There are {language}environment classes and encoding classes. There is #isBreakableAt:in: implemented in both but seemingly unused in the encoding classes because it is just plain broken there. Should it be removed from the encoders? In the language environment classes it is implemented to return true for space and cr by default, but space, cr & lf in Latin1 and Latin2. Is that as expected?

 
>From what I understand:
- no need to answer true for space, cr, lf since these are already handled in the CharacterScanner stopConditions, so default answer should be ^false (unless one of these is removed from stopConditions, I thought I saw that, but cannot remember...)
- whether it should be in EncodedCharSet or LanguageEnvironment, I don't know…
I'm a bit puzzled by what's going on here, now that I'm looking into it. The Multi* classes actually *do* have stopConditions *without* space, and don't have methods to handle #space. And yet I can't find anything going wrong… this is odd.

With Nicolas' changes to the implementations of #isBreakableAt:in: it seems like we shouldn't be getting any use of registerBreakableIndex (and in fact I'm running code where it isn't even in the scan methods to be called!) and so MultiCompositionScanner>crossedX ought to be getting a bit annoyed by now. The only bit of its code getting run is the last 4 lines and 'breakableIndex'/'lineHeightAtBreak'/'baselineAtBreak' are all nil. This surely ought to be breaking the TextLine 'line' but so far… nothing. Even changing the width of windows to force recomposing doesn't seem to upset things. Why isn't it breaking!

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Compatible: Gracefully accepts erroneous data from any source.







Reply | Threaded
Open this post in threaded view
|

Re: #isBreakableAt:in:

Bob Arning-2
In reply to this post by timrowledge
Well, something is a little wrong

StringHolder new contents:  'word1 word2 word3 word4 word5 word6 word7 word8 word9 word10 word11 word12 word13 word14 word15 word16 word17 word18 word19 word20 word21 word22 word23 word24 word25 word26 word27 word28 word29 word30 ' asWideString,(WideString with: 401 asCharacter with: $a with: 402 asCharacter with: $b); openLabel: 'foo'

If you resize this window, you see lines breaking in the middle of words. Because breakAtSpace is always false, you simply

    (breakableIndex isNil or: [breakableIndex < line first]) ifTrue: [
        "Any breakable point in this line.  Just wrap last character."
        breakableIndex := lastIndex - 1.
        lineHeightAtBreak := lineHeight.
        baselineAtBreak := baseline.
    ].

    "It wasn't a space, but anyway this is where we break the line."
    line stop: breakableIndex.
    lineHeight := lineHeightAtBreak.
    baseline := baselineAtBreak.
    ^ true.

an break wherever the line crossed the right margin.

Cheers,
Bob

On 9/25/13 10:01 PM, tim Rowledge wrote:
On 24-09-2013, at 2:21 PM, Nicolas Cellier [hidden email] wrote:

2013/9/21 tim Rowledge [hidden email]

a) There are {language}environment classes and encoding classes. There is #isBreakableAt:in: implemented in both but seemingly unused in the encoding classes because it is just plain broken there. Should it be removed from the encoders? In the language environment classes it is implemented to return true for space and cr by default, but space, cr & lf in Latin1 and Latin2. Is that as expected?

 
>From what I understand:
- no need to answer true for space, cr, lf since these are already handled in the CharacterScanner stopConditions, so default answer should be ^false (unless one of these is removed from stopConditions, I thought I saw that, but cannot remember...)
- whether it should be in EncodedCharSet or LanguageEnvironment, I don't know…
I'm a bit puzzled by what's going on here, now that I'm looking into it. The Multi* classes actually *do* have stopConditions *without* space, and don't have methods to handle #space. And yet I can't find anything going wrong… this is odd.

With Nicolas' changes to the implementations of #isBreakableAt:in: it seems like we shouldn't be getting any use of registerBreakableIndex (and in fact I'm running code where it isn't even in the scan methods to be called!) and so MultiCompositionScanner>crossedX ought to be getting a bit annoyed by now. The only bit of its code getting run is the last 4 lines and 'breakableIndex'/'lineHeightAtBreak'/'baselineAtBreak' are all nil. This surely ought to be breaking the TextLine 'line' but so far… nothing. Even changing the width of windows to force recomposing doesn't seem to upset things. Why isn't it breaking!

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Compatible: Gracefully accepts erroneous data from any source.







Reply | Threaded
Open this post in threaded view
|

Re: #isBreakableAt:in:

timrowledge
In reply to this post by Bob Arning-2

On 26-09-2013, at 6:28 AM, Bob Arning <[hidden email]> wrote:
> Then I fiddled with the window and tried to see where the code did or did not go. It basically didn't go anywhere that seemed related to your question because:
>
>     scanner := (theText string isOctetString
>                         ifTrue:[CompositionScanner new]
>                         ifFalse:[MultiCompositionScanner new]).

You hear that loud, dull, thudding noise from the northwest? That's me head-desking. I just wrote that code yesterday and I completely overlooked it. Sigh. This is why pair programming or at least code-review is so useful, children. Don't forget to ask your parents for one for the {winter solstice holiday of your religious/cultural tradition}.

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
For every action, there is an equal and opposite criticism.



Reply | Threaded
Open this post in threaded view
|

Re: #isBreakableAt:in:

timrowledge
In reply to this post by Bob Arning-2

On 26-09-2013, at 7:14 AM, Bob Arning <[hidden email]> wrote:

> Well, something is a little wrong

I rather thought so. I'll use your StringHolder to work out something. Actually I reckon a quick hack to add #space and simply use #registerBreakableIndex should be good start.


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
BASIC is to computer programming as QWERTY is to typing.



Reply | Threaded
Open this post in threaded view
|

Re: #isBreakableAt:in:

timrowledge

On 26-09-2013, at 10:19 AM, tim Rowledge <[hidden email]> wrote:

>
> On 26-09-2013, at 7:14 AM, Bob Arning <[hidden email]> wrote:
>
>> Well, something is a little wrong
>
> I rather thought so. I'll use your StringHolder to work out something. Actually I reckon a quick hack to add #space and simply use #registerBreakableIndex should be good start.


Well, that wasn't much fun.

The current implementations of registerBreakableIndex and crossedX are nastily intertwined with assumptions about how they are used in such a way that I suspect laws of nature are being broken. Certainly I'm not going to spend any more time today trying to work out WTF is going on.

So I've returned the use of isBreakableAt:in:in: & registerBreakableIndex to their previous status and it no longer makes nasty with widestrings and wrapping.

It raises more questions (still lots from previous message unanswered folks!)-
EncodeCharSets - there are several commented out in EncodeCharSet class>initialise Why?
Why is Unicode also commented as 'Latin1Environment'?
What is Latin2Environment?
Why is there a separate Latin1 class?
Why are there mixed up encodedcharset classes and language environment classes?


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Oxymorons: Clearly misunderstood



Reply | Threaded
Open this post in threaded view
|

Re: #isBreakableAt:in:

Nicolas Cellier
In reply to this post by timrowledge
There are more cleanups required in this area

1) senders of setConditionArray: will always send with #paddedSpace or nil
  #paddedSpace will install PaddedSpaceCondition which does stop on Space character and sends #paddedSpace to the scanner
 nil will install NilCondition which does not stop on Space

We thus have SpaceCondition wich is never used, and right now, it's fortunate, because it would send #space stopCondition when encoutering Space character, but only CompositionScanner would understand that !

2) setFont does override with stopConditions := DefaultStopConditions which does not stop on Space
the (single) CompositionScanner performs an additional action, it set Space stopCondition -> #space
Curiously, SegmentScanner forces Space stopCondition to nil (that's now un-necessary, but I'm quite sure that it once was).

Since, setFont is always sent by setStopConditions (* see below), the #paddedSpace is restored if alignment is Justified, or the NilCondition overrides the DefaultCondition...
* computeDefaultLineHeight also calls setFont, but only before composition takes place...

This is really a convoluted mess !

My opinion:
1) we should remove SpaceCondition, and NilCondition, and only have DefaultCondition stopping on Space by sending #space
2) we should restore #space message to record Space position and count.
3) Only in JapaneseLanguageEnvironment there should be a special JapaneseStopConditions



2013/9/26 tim Rowledge <[hidden email]>

On 26-09-2013, at 7:14 AM, Bob Arning <[hidden email]> wrote:

> Well, something is a little wrong

I rather thought so. I'll use your StringHolder to work out something. Actually I reckon a quick hack to add #space and simply use #registerBreakableIndex should be good start.
BASIC is to computer programming as QWERTY is to typing.






Reply | Threaded
Open this post in threaded view
|

Re: #isBreakableAt:in:

timrowledge

On 26-09-2013, at 1:23 PM, Nicolas Cellier <[hidden email]> wrote:

> There are more cleanups required in this area

A lot is wrong and it's a nasty mess. It needs a large rewrite sometime - but not today.
I need to get some other tidyups done and checked before that. And some answers to all those questions!

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Fractured Idiom:- COGITO EGGO SUM - I think; therefore, I am a waffle



Reply | Threaded
Open this post in threaded view
|

Re: #isBreakableAt:in:

Nicolas Cellier
In reply to this post by timrowledge
A Character codePoint contains both
- a charCode
- a language tag (so called #leadingChar)

The leadingChar can encode either a CharacterSet, or a LanguageEnvironment (see EncodedCharSet initialize).
The CharacterSet tells how to interpret the charCode (whether 16r41 encodes a capital A or something else).

All this is complex, and has strange side effects, because a letter A in a given char set could be different from a character A in another char set (they don't have same leadingChar, and eventually not same charCode, though maybe not true for A since most encodings are superset of ASCII)...
With Unicode (iso 10646) we can have a canonical (hem, almost) encoding for all languages, so all this is getting a bit obsolete, except for eastern asian languages for historical reasons.

I've tried to generalize the use of Unicode in the image, except for eastern Asian environments.

The latin1 character set is a subset of Unicode (it matches the first 256 codes), so with the promotion of Unicode, it is effectively obsolescent.


2013/9/26 tim Rowledge <[hidden email]>

On 26-09-2013, at 10:19 AM, tim Rowledge <[hidden email]> wrote:

>
> On 26-09-2013, at 7:14 AM, Bob Arning <[hidden email]> wrote:
>
>> Well, something is a little wrong
>
> I rather thought so. I'll use your StringHolder to work out something. Actually I reckon a quick hack to add #space and simply use #registerBreakableIndex should be good start.


Well, that wasn't much fun.

The current implementations of registerBreakableIndex and crossedX are nastily intertwined with assumptions about how they are used in such a way that I suspect laws of nature are being broken. Certainly I'm not going to spend any more time today trying to work out WTF is going on.

So I've returned the use of isBreakableAt:in:in: & registerBreakableIndex to their previous status and it no longer makes nasty with widestrings and wrapping.

It raises more questions (still lots from previous message unanswered folks!)-
EncodeCharSets - there are several commented out in EncodeCharSet class>initialise Why?
Why is Unicode also commented as 'Latin1Environment'?
What is Latin2Environment?
Why is there a separate Latin1 class?
Why are there mixed up encodedcharset classes and language environment classes?
Oxymorons: Clearly misunderstood






Reply | Threaded
Open this post in threaded view
|

Re: #isBreakableAt:in:

Yoshiki Ohshima-3
At Thu, 26 Sep 2013 22:37:04 +0200,
Nicolas Cellier wrote:

>
> A Character codePoint contains both
> - a charCode
> - a language tag (so called #leadingChar)
>
> The leadingChar can encode either a CharacterSet, or a LanguageEnvironment
> (see EncodedCharSet initialize).
> The CharacterSet tells how to interpret the charCode (whether 16r41 encodes
> a capital A or something else).
>
> All this is complex, and has strange side effects, because a letter A in a
> given char set could be different from a character A in another char set
> (they don't have same leadingChar, and eventually not same charCode, though
> maybe not true for A since most encodings are superset of ASCII)...
> With Unicode (iso 10646) we can have a canonical (hem, almost) encoding for
> all languages, so all this is getting a bit obsolete, except for eastern
> asian languages for historical reasons.

It is not quite "historic reasons" but the leadingChar concept (again
borrowed from Emacs) is a practical need.  The idea of encoding them
in the character object themselves should be obsoleted, but unified
chars should be able to be distinguished for some applications (such
as, a Chinese-Japanese dictionary application).

-- Yoshiki

Reply | Threaded
Open this post in threaded view
|

Re: #isBreakableAt:in:

timrowledge
In reply to this post by Nicolas Cellier

On 26-09-2013, at 1:37 PM, Nicolas Cellier <[hidden email]> wrote:

> A Character codePoint contains both
> - a charCode
> - a language tag (so called #leadingChar)

I see the use of bits 22-30 (I think) as leadingChar in Character. Is it correct to say that any character extracted from a ByteString will actually be just the 8bit value, or am I missing some devious encryption somewhere? If all BytesStrings include only 8bit valued characters then one would be safe in expecting only the 0 encoded characterset (aka Latin1Environment). Since that is simple and comprehensible, I have to anticipate that it isn't like that. Life would be too simple.

So far as I can work out from the code we are very much assuming that a ByteString is simple ascii encoded (see basicScanCharactersFrom… etc). I'd love some assurance that life is simple.

>
> The leadingChar can encode either a CharacterSet, or a LanguageEnvironment (see EncodedCharSet initialize).

Why? Why on earth would we make it that way? That seems crazy.

> The CharacterSet tells how to interpret the charCode (whether 16r41 encodes a capital A or something else).

Yeah, got that part. The interesting followup question is why, in #scanMultiCharactersFrom:to:in:rightX:stopConditions:kern:, did we
a) find the encoding
b) check that it is the same as we started with (I see the point to the endOfRun if it changes)
c) insist that encoding ==0 before testing the stops (and I see that your latest suggested changes drop that)
d) ignore the encoding in favour of Latin1Environment when sending isBreakableAt… ?

I can't see any good reason for it but obviously someone did at some point. That clearly means I may be missing something important.


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
C++ is history repeated as tragedy. Java is history repeated as farce.



Reply | Threaded
Open this post in threaded view
|

Re: #isBreakableAt:in:

Nicolas Cellier
In reply to this post by timrowledge
I have processed the #space as it should always have been processed, a stopCondition.
This is because like cr tab or lf, we do not handle space by displaying a glyph, we just add horizontal spacing (with elastic padding if alignment == Justified). And also because a space gives a chance to wrap the line if we crossedX the rightMargin...
I have cleaned the setting of stopConditions inst. var. in a single place, setStopConditionsOrNil:, called from a single site setStopConditions.
Since setStopConditions send setFont, then setFont should not setStopConditions.
Also note that with addition of ColumnBreakStopConditions, necessity to copy stopConditions is void.

I cross fingers, but I could load these changes in a trunk image, so no update required by now.
And the WideString are now composed more like they should.


2013/9/26 tim Rowledge <[hidden email]>

On 26-09-2013, at 1:23 PM, Nicolas Cellier <[hidden email]> wrote:

> There are more cleanups required in this area

A lot is wrong and it's a nasty mess. It needs a large rewrite sometime - but not today.
I need to get some other tidyups done and checked before that. And some answers to all those questions!
Fractured Idiom:- COGITO EGGO SUM - I think; therefore, I am a waffle






12