Issue 4142 in pharo: Never use a leadingChar for byte char

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Issue 4142 in pharo: Never use a leadingChar for byte char

pharo
Status: FixedWaitingToBePharoed
Owner: [hidden email]

New issue 4142 by [hidden email]: Never use a leadingChar for byte  
char
http://code.google.com/p/pharo/issues/detail?id=4142

 From Squeak:

Levente Uzonyi uploaded a new version of Collections to project The Trunk:
http://source.squeak.org/trunk/Collections-ul.440.mcz

==================== Summary ====================

Name: Collections-ul.440
Author: ul
Time: 26 April 2011, 2:37:08.897 am
UUID: 4c084629-af8b-3740-b919-ec87f228c915
Ancestors: Collections-kb.439

- ignore the leadingChar for unique characters in Character class >>  
#leadingChar:code:
- fixed the copying of Characters

=============== Diff against Collections-kb.439 ===============

Item was changed:
  ----- Method: Character class>>leadingChar:code: (in category 'instance  
creation') -----
  leadingChar: leadChar code: code

        code >= 16r400000 ifTrue: [
                self error: 'code is out of range'.
        ].
        leadChar >= 256 ifTrue: [
                self error: 'lead is out of range'.
        ].
+       code < 256 ifTrue: [ ^self value: code ].
-
        ^self value: (leadChar bitShift: 22) + code.!

Item was changed:
  ----- Method: Character>>clone (in category 'copying') -----
  clone
+       "Characters from 0 to 255 are unique, copy only the rest."
+
+       value < 256 ifTrue: [ ^self ].
+       ^super clone!
-       "Answer with the receiver, because Characters are unique."!

Item was changed:
  ----- Method: Character>>copy (in category 'copying') -----
  copy
+       "Characters from 0 to 255 are unique, copy only the rest."
+
+       value < 256 ifTrue: [ ^self ].
+       ^super copy!
-       "Answer with the receiver because Characters are unique."!

Item was changed:
  ----- Method: Character>>deepCopy (in category 'copying') -----
  deepCopy
+       "Characters from 0 to 255 are unique, copy only the rest."
+
+       value < 256 ifTrue: [ ^self ].
+       ^super deepCopy!
-       "Answer with the receiver because Characters are unique."!

Item was added:
+ ----- Method: Character>>shallowCopy (in category 'copying') -----
+ shallowCopy
+       "Characters from 0 to 255 are unique, copy only the rest."
+
+       value < 256 ifTrue: [ ^self ].
+       ^super shallowCopy!

Item was changed:
  ----- Method: Character>>veryDeepCopyWith: (in category 'copying') -----
  veryDeepCopyWith: deepCopier
+       "Characters from 0 to 255 are unique, copy only the rest."
+
+       value < 256 ifTrue: [ ^self ].
+       ^super veryDeepCopyWith: deepCopier!
-       "Return self.  I can't be copied."!



Levente Uzonyi uploaded a new version of Multilingual to project The Trunk:
http://source.squeak.org/trunk/Multilingual-ul.141.mcz

==================== Summary ====================

Name: Multilingual-ul.141
Author: ul
Time: 26 April 2011, 2:26:32.742 am
UUID: ecda4489-6940-b043-8aba-881d913f4985
Ancestors: Multilingual-nice.140

Removed #leadingChar and it's usage from ByteTextConverter and it's  
subclasses. Only CJKV characters should have leadingChar.

=============== Diff against Multilingual-nice.140 ===============

Item was changed:
  ----- Method: ByteTextConverter>>decode: (in category 'private') -----
  decode: aByte
        "Answer a decoded squeak character corresponding to aByte code.
        Note that aByte does necessary span in the range 0...255, since this  
receiver is a ByteTextEncoder."
        | code |
        ((code := self class decodeTable at: 1 + aByte) = -1 or: [code =  
16rFFFD]) ifTrue: [^nil].
+       ^Character value: code!
-       ^ Character leadingChar: self leadingChar code: code!

Item was removed:
- ----- Method: ByteTextConverter>>leadingChar (in category 'friend') -----
- leadingChar
-       self subclassResponsibility!

Item was removed:
- ----- Method: CP1250TextConverter>>leadingChar (in category 'friend')  
-----
- leadingChar
-       ^0!

Item was removed:
- ----- Method: CP1253TextConverter>>leadingChar (in category 'friend')  
-----
- leadingChar
-       ^ GreekEnvironment leadingChar!

Item was removed:
- ----- Method: ISO88592TextConverter>>leadingChar (in category 'friend')  
-----
- leadingChar
-       ^Latin2Environment leadingChar!

Item was removed:
- ----- Method: ISO88597TextConverter>>leadingChar (in category 'friend')  
-----
- leadingChar
-       ^GreekEnvironment leadingChar!

Item was removed:
- ----- Method: Latin1TextConverter>>leadingChar (in category 'friend')  
-----
- leadingChar
-       ^0!

Item was removed:
- ----- Method: MacRomanTextConverter>>leadingChar (in category 'friend')  
-----
- leadingChar
-
-       ^ 0.
- !


_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

pharo
Updates:
        Status: FixProposed
        Labels: Type-Squeak

Comment #1 on issue 4142 by [hidden email]: Never use a leadingChar  
for byte char
http://code.google.com/p/pharo/issues/detail?id=4142

Name: SLICE-Issue-4142-ByteCharacterNeverUseALeadingChar-nice.1
Dependencies: Multilingual-Encodings-nice.13,  
Multilingual-Languages-nice.12, Collections-Strings-nice.178,  
Multilingual-TextConversion-nice.20

Only east asian languages should use a leadingChar
Fix the leadingChar -> 0 for byte character and byte encoder

Note that this change forces leadingChar to 0 for some environment (Greek  
Russian Nepalese).
AFAIK, it's better to have these leadingChar at 0 and use a Unicode font.
However, it shall be nice to ask a user of one of these languages... Igor?


_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

pharo
Updates:
        Cc: [hidden email]

Comment #2 on issue 4142 by [hidden email]: Never use a leadingChar  
for byte char
http://code.google.com/p/pharo/issues/detail?id=4142

igor can you check that?
Tx


_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

pharo

Comment #3 on issue 4142 by [hidden email]: Never use a leadingChar for  
byte char
http://code.google.com/p/pharo/issues/detail?id=4142

hmm.. i wonder why do you need to copy characters at all?
i think that for all #copy messages sent to character it can just answer  
self.

The only method, which affects the character's state is  
Character>>setValue: newValue
and its a private one, which means that you are not allowed to change the  
character's value for existing characters.
And obviously, you don't need to copy chars because characters with same  
value representing same character.

What is wrong, i think that Character>>#=
using #asciiValue instead of #codePoint
(okay, it aswers value, but then since value actually an unicode value, the  
implementation of #asciiValue is incorrect and should fail for any  
character codes > 127 , because only 0..127 character codes defined in  
ascii standard.)




_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

pharo

Comment #4 on issue 4142 by [hidden email]: Never use a leadingChar  
for byte char
http://code.google.com/p/pharo/issues/detail?id=4142

You are right, asciiValue is a missnommer, but I^H^H Levente didn't change  
it.
We shall change this later, this issue was more about leadingChar.

The change of #copy was motivated by the fact that the system expects byte  
characters to be unique. There is no such expectation for the MultiByte  
chars, and they aren't unique.
Of course, currently there are no mutators, so we could avoid a copy in  
both cases.
Anyway we shall better document in a TestCase at least, because who knows  
what 3rd party libraries will do with setValue: (we have no support for  
immutable...).

The question you did not answer: is a specific leadingChar required for  
Russian ?


_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

pharo

Comment #5 on issue 4142 by [hidden email]: Never use a leadingChar for  
byte char
http://code.google.com/p/pharo/issues/detail?id=4142

No. I don't know what leading char does. and why i would need it for  
russian letters when there is a much simpler and commonly used unicode  
values for them.
It means that even if one may use leading char, i would simply discourage  
that and encourage to use plain unicode period.
I vote for getting rid of leading char. IMO it is better to make another  
class for chars with leading chars and put the complexity there. Because  
correct me if i wrong, in 99.9999% cases unicode is enough. So why we  
should waste our time and keep things complex which used only in 0.00001%  
of cases today?


As for #setValue: and third-party libs. Not our problem: this method is  
private.
And those who abusing API are on their own.
Instead of taking care, we should punish those who using private interfaces  
outside of implementing class or its subclasses. There is no excuse for  
that. Period



_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

pharo

Comment #6 on issue 4142 by [hidden email]: Never use a leadingChar  
for byte char
http://code.google.com/p/pharo/issues/detail?id=4142

Thanks Igor, you confirmed what I thought of leadingChar. Unfortunately I'm  
latin, so I need confirmation :)

Concerning setValue: I would much prefer having immediate characters like  
in VW, we could get immutability and probably some optimization.

As wether we shall completely eliminate leadingChar, I can't tell for sure.  
As I understand it is meant for east asian languages in order to work  
around han-unification. I don't know if this could be handled differently  
by using specific fonts or text attributes, and I can't fill the cultural  
gap that easily, that's too much to learn, so I have to refer to the users  
of these languages, taking a wrong decision by ignorance is the last thing  
I'd like to do.


_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

pharo

Comment #7 on issue 4142 by [hidden email]: Never use a leadingChar for  
byte char
http://code.google.com/p/pharo/issues/detail?id=4142

@ setValue:
we could change its name to
privateSetValue:
and on next month we could change it to
privateSetValueDontUseThisMethod:
and so make sure that nobody will dare to use it :)

About leading char:
Do you think it is possible to make a separate class, like
CharacterWithLeadingChar
and keep stuff there, while for the Character just leave a clean & lean  
unicode?

Also, i really would like people who knowing better than me , and actually  
needs to use this feature(s) to argument, why we should use this scheme  
while rest of the world just using unicode.



_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

pharo

Comment #8 on issue 4142 by [hidden email]: Never use a leadingChar  
for byte char
http://code.google.com/p/pharo/issues/detail?id=4142

It might be usefull to remind the leadingChar reference:
  http://www.is.titech.ac.jp/~ohshima/squeak/m17npaper/main.html

However, this link does not examine alternatives...


_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

pharo

Comment #9 on issue 4142 by [hidden email]: Never use a leadingChar for  
byte char
http://code.google.com/p/pharo/issues/detail?id=4142

yes. More or less its an implementation description.
Here you can see the table of  language tags AKA leading chars assigned:

http://www.is.titech.ac.jp/~ohshima/squeak/m17npaper/node10.html

Now, my question is simple:
  - who using anything else than unicode today in reality?

I never ever seen GB2312 and prefer to never hear about it in future.
So, what do we lose by simply removing this logic and leaving only unicode?

I could tell you how many the various russian char encodings existed:
KOI8-R KOI8-U, windows-1251, cp886 (and you can find plenty of others at  
the end of this page: http://en.wikipedia.org/wiki/Cyrillic_alphabet)
But really, who cares today about it?? I definitely would not like to deal  
with anything else than unicode today.




_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

pharo

Comment #10 on issue 4142 by [hidden email]: Never use a leadingChar for  
byte char
http://code.google.com/p/pharo/issues/detail?id=4142

My 2c from memory, I make no claims to it's actual accuracy:

IIRC, using the leadingChar is part of what StrikeFontSet does to support  
different (unicode) character ranges.
It's a flawed approach at best in the case of wanting to display more than  
one language's characters (not the basic approach of storing different  
ranges in different Fonts and selecting based on character, but doing the  
selection based on leadingChar).

The only time it would ever matter is when one Unicode character has  
different glyphs based on what language is displayed. AFAIK, this is only  
true for Japanese/Korean or some such combination.

For other places where leadingChar is currently used, it just feels like a  
remnant of the times when OS's were strictly bound to one single-byte code  
page, and the code in these places should be modernized.


_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

pharo

Comment #11 on issue 4142 by [hidden email]: Never use a leadingChar  
for byte char
http://code.google.com/p/pharo/issues/detail?id=4142

I like the idea of igor to introduce CharacterWithLeadingChar and have  
Character for Unicode.



_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

pharo

Comment #12 on issue 4142 by [hidden email]: Never use a leadingChar  
for byte char
http://code.google.com/p/pharo/issues/detail?id=4142

by reading the wiki it looks like originally it was done like that, but  
then two character classes are folded to a single one.

As to me, character code should represent a glyph. It should not carry  
anything like "this is a letter X from language Y", because its too low  
level. The way how glyphs  are interpreted heavily depending on context.  
Consider a usual greek script and math/physics formulas where you see same  
glyphs, but they having completely different meaning.
In unicode there's also a lot of code points for various punctiation and  
scientific glyphs which are not belong to any language. So, what  
tag(s)/encodings you could assign to them? It is pointless.


I don't understand why Japanese/Korean glyphs , if they are coincide, could  
cause problems? Depending on context you are clearly know that given text  
either Japanese or Korean. But i'm not an expert in this area to tell for  
sure. The only thing i know is: keep it simple stupid. This practice  
usually wins in a longer perspective.



_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

pharo

Comment #13 on issue 4142 by [hidden email]: Never use a leadingChar  
for byte char
http://code.google.com/p/pharo/issues/detail?id=4142

Ok I will intergate the fixes proposed and after it would be good to have  
the solution proposed by igor :)



_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

pharo

Comment #14 on issue 4142 by [hidden email]: Never use a leadingChar  
for byte char
http://code.google.com/p/pharo/issues/detail?id=4142

Good, remember, little steps ;)


_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

pharo

Comment #15 on issue 4142 by [hidden email]: Never use a leadingChar  
for byte char
http://code.google.com/p/pharo/issues/detail?id=4142

Leading char fixes is ok. But i think for copying, just use ^ self  
everywhere.


_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

pharo
Updates:
        Status: Closed

Comment #16 on issue 4142 by [hidden email]: Never use a leadingChar  
for byte char
http://code.google.com/p/pharo/issues/detail?id=4142

in 13185


_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker