Smalltalk › Pharo › Pharo Issue Tracker

Issue 4142 in pharo: Never use a leadingChar for byte char

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

17 messages Options

pharo

Issue 4142 in pharo: Never use a leadingChar for byte char

Status: FixedWaitingToBePharoed
Owner: [hidden email]

New issue 4142 by [hidden email]: Never use a leadingChar for byte
char
http://code.google.com/p/pharo/issues/detail?id=4142

From Squeak:

Levente Uzonyi uploaded a new version of Collections to project The Trunk:
http://source.squeak.org/trunk/Collections-ul.440.mcz

==================== Summary ====================

Name: Collections-ul.440
Author: ul
Time: 26 April 2011, 2:37:08.897 am
UUID: 4c084629-af8b-3740-b919-ec87f228c915
Ancestors: Collections-kb.439

- ignore the leadingChar for unique characters in Character class >>
#leadingChar:code:
- fixed the copying of Characters

=============== Diff against Collections-kb.439 ===============

Item was changed:
----- Method: Character class>>leadingChar:code: (in category 'instance
creation') -----
leadingChar: leadChar code: code

code >= 16r400000 ifTrue: [
self error: 'code is out of range'.
].
leadChar >= 256 ifTrue: [
self error: 'lead is out of range'.
].
+ code < 256 ifTrue: [ ^self value: code ].
-
^self value: (leadChar bitShift: 22) + code.!

Item was changed:
----- Method: Character>>clone (in category 'copying') -----
clone
+ "Characters from 0 to 255 are unique, copy only the rest."
+
+ value < 256 ifTrue: [ ^self ].
+ ^super clone!
- "Answer with the receiver, because Characters are unique."!

Item was changed:
----- Method: Character>>copy (in category 'copying') -----
copy
+ "Characters from 0 to 255 are unique, copy only the rest."
+
+ value < 256 ifTrue: [ ^self ].
+ ^super copy!
- "Answer with the receiver because Characters are unique."!

Item was changed:
----- Method: Character>>deepCopy (in category 'copying') -----
deepCopy
+ "Characters from 0 to 255 are unique, copy only the rest."
+
+ value < 256 ifTrue: [ ^self ].
+ ^super deepCopy!
- "Answer with the receiver because Characters are unique."!

Item was added:
+ ----- Method: Character>>shallowCopy (in category 'copying') -----
+ shallowCopy
+ "Characters from 0 to 255 are unique, copy only the rest."
+
+ value < 256 ifTrue: [ ^self ].
+ ^super shallowCopy!

Item was changed:
----- Method: Character>>veryDeepCopyWith: (in category 'copying') -----
veryDeepCopyWith: deepCopier
+ "Characters from 0 to 255 are unique, copy only the rest."
+
+ value < 256 ifTrue: [ ^self ].
+ ^super veryDeepCopyWith: deepCopier!
- "Return self. I can't be copied."!

Levente Uzonyi uploaded a new version of Multilingual to project The Trunk:
http://source.squeak.org/trunk/Multilingual-ul.141.mcz

==================== Summary ====================

Name: Multilingual-ul.141
Author: ul
Time: 26 April 2011, 2:26:32.742 am
UUID: ecda4489-6940-b043-8aba-881d913f4985
Ancestors: Multilingual-nice.140

Removed #leadingChar and it's usage from ByteTextConverter and it's
subclasses. Only CJKV characters should have leadingChar.

=============== Diff against Multilingual-nice.140 ===============

Item was changed:
----- Method: ByteTextConverter>>decode: (in category 'private') -----
decode: aByte
"Answer a decoded squeak character corresponding to aByte code.
Note that aByte does necessary span in the range 0...255, since this
receiver is a ByteTextEncoder."
| code |
((code := self class decodeTable at: 1 + aByte) = -1 or: [code =
16rFFFD]) ifTrue: [^nil].
+ ^Character value: code!
- ^ Character leadingChar: self leadingChar code: code!

Item was removed:
- ----- Method: ByteTextConverter>>leadingChar (in category 'friend') -----
- leadingChar
- self subclassResponsibility!

Item was removed:
- ----- Method: CP1250TextConverter>>leadingChar (in category 'friend')
-----
- leadingChar
- ^0!

Item was removed:
- ----- Method: CP1253TextConverter>>leadingChar (in category 'friend')
-----
- leadingChar
- ^ GreekEnvironment leadingChar!

Item was removed:
- ----- Method: ISO88592TextConverter>>leadingChar (in category 'friend')
-----
- leadingChar
- ^Latin2Environment leadingChar!

Item was removed:
- ----- Method: ISO88597TextConverter>>leadingChar (in category 'friend')
-----
- leadingChar
- ^GreekEnvironment leadingChar!

Item was removed:
- ----- Method: Latin1TextConverter>>leadingChar (in category 'friend')
-----
- leadingChar
- ^0!

Item was removed:
- ----- Method: MacRomanTextConverter>>leadingChar (in category 'friend')
-----
- leadingChar
-
- ^ 0.
- !

_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker

pharo

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

Updates:
Status: FixProposed
Labels: Type-Squeak

Comment #1 on issue 4142 by [hidden email]: Never use a leadingChar
for byte char
http://code.google.com/p/pharo/issues/detail?id=4142

Name: SLICE-Issue-4142-ByteCharacterNeverUseALeadingChar-nice.1
Dependencies: Multilingual-Encodings-nice.13,
Multilingual-Languages-nice.12, Collections-Strings-nice.178,
Multilingual-TextConversion-nice.20

Only east asian languages should use a leadingChar
Fix the leadingChar -> 0 for byte character and byte encoder

Note that this change forces leadingChar to 0 for some environment (Greek
Russian Nepalese).
AFAIK, it's better to have these leadingChar at 0 and use a Unicode font.
However, it shall be nice to ask a user of one of these languages... Igor?

_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker

pharo

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

Updates:
Cc: [hidden email]

Comment #2 on issue 4142 by [hidden email]: Never use a leadingChar
for byte char
http://code.google.com/p/pharo/issues/detail?id=4142

igor can you check that?
Tx

_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker

pharo

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

Comment #3 on issue 4142 by [hidden email]: Never use a leadingChar for
byte char
http://code.google.com/p/pharo/issues/detail?id=4142

hmm.. i wonder why do you need to copy characters at all?
i think that for all #copy messages sent to character it can just answer
self.

The only method, which affects the character's state is
Character>>setValue: newValue
and its a private one, which means that you are not allowed to change the
character's value for existing characters.
And obviously, you don't need to copy chars because characters with same
value representing same character.

What is wrong, i think that Character>>#=
using #asciiValue instead of #codePoint
(okay, it aswers value, but then since value actually an unicode value, the
implementation of #asciiValue is incorrect and should fail for any
character codes > 127 , because only 0..127 character codes defined in
ascii standard.)

_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker

pharo

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

Comment #4 on issue 4142 by [hidden email]: Never use a leadingChar
for byte char
http://code.google.com/p/pharo/issues/detail?id=4142

You are right, asciiValue is a missnommer, but I^H^H Levente didn't change
it.
We shall change this later, this issue was more about leadingChar.

The change of #copy was motivated by the fact that the system expects byte
characters to be unique. There is no such expectation for the MultiByte
chars, and they aren't unique.
Of course, currently there are no mutators, so we could avoid a copy in
both cases.
Anyway we shall better document in a TestCase at least, because who knows
what 3rd party libraries will do with setValue: (we have no support for
immutable...).

The question you did not answer: is a specific leadingChar required for
Russian ?

_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker

pharo

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

Comment #5 on issue 4142 by [hidden email]: Never use a leadingChar for
byte char
http://code.google.com/p/pharo/issues/detail?id=4142

No. I don't know what leading char does. and why i would need it for
russian letters when there is a much simpler and commonly used unicode
values for them.
It means that even if one may use leading char, i would simply discourage
that and encourage to use plain unicode period.
I vote for getting rid of leading char. IMO it is better to make another
class for chars with leading chars and put the complexity there. Because
correct me if i wrong, in 99.9999% cases unicode is enough. So why we
should waste our time and keep things complex which used only in 0.00001%
of cases today?

As for #setValue: and third-party libs. Not our problem: this method is
private.
And those who abusing API are on their own.
Instead of taking care, we should punish those who using private interfaces
outside of implementing class or its subclasses. There is no excuse for
that. Period

_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker

pharo

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

Comment #6 on issue 4142 by [hidden email]: Never use a leadingChar
for byte char
http://code.google.com/p/pharo/issues/detail?id=4142

Thanks Igor, you confirmed what I thought of leadingChar. Unfortunately I'm
latin, so I need confirmation :)

Concerning setValue: I would much prefer having immediate characters like
in VW, we could get immutability and probably some optimization.

As wether we shall completely eliminate leadingChar, I can't tell for sure.
As I understand it is meant for east asian languages in order to work
around han-unification. I don't know if this could be handled differently
by using specific fonts or text attributes, and I can't fill the cultural
gap that easily, that's too much to learn, so I have to refer to the users
of these languages, taking a wrong decision by ignorance is the last thing
I'd like to do.

_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker

pharo

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

Comment #7 on issue 4142 by [hidden email]: Never use a leadingChar for
byte char
http://code.google.com/p/pharo/issues/detail?id=4142

@ setValue:
we could change its name to
privateSetValue:
and on next month we could change it to
privateSetValueDontUseThisMethod:
and so make sure that nobody will dare to use it :)

About leading char:
Do you think it is possible to make a separate class, like
CharacterWithLeadingChar
and keep stuff there, while for the Character just leave a clean & lean
unicode?

Also, i really would like people who knowing better than me , and actually
needs to use this feature(s) to argument, why we should use this scheme
while rest of the world just using unicode.

_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker

pharo

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

Comment #8 on issue 4142 by [hidden email]: Never use a leadingChar
for byte char
http://code.google.com/p/pharo/issues/detail?id=4142

It might be usefull to remind the leadingChar reference:
http://www.is.titech.ac.jp/~ohshima/squeak/m17npaper/main.html

However, this link does not examine alternatives...

_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker

pharo

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

Comment #9 on issue 4142 by [hidden email]: Never use a leadingChar for
byte char
http://code.google.com/p/pharo/issues/detail?id=4142

yes. More or less its an implementation description.
Here you can see the table of language tags AKA leading chars assigned:

http://www.is.titech.ac.jp/~ohshima/squeak/m17npaper/node10.html

Now, my question is simple:
- who using anything else than unicode today in reality?

I never ever seen GB2312 and prefer to never hear about it in future.
So, what do we lose by simply removing this logic and leaving only unicode?

I could tell you how many the various russian char encodings existed:
KOI8-R KOI8-U, windows-1251, cp886 (and you can find plenty of others at
the end of this page: http://en.wikipedia.org/wiki/Cyrillic_alphabet)
But really, who cares today about it?? I definitely would not like to deal
with anything else than unicode today.

_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker

pharo

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

Comment #10 on issue 4142 by [hidden email]: Never use a leadingChar for
byte char
http://code.google.com/p/pharo/issues/detail?id=4142

My 2c from memory, I make no claims to it's actual accuracy:

IIRC, using the leadingChar is part of what StrikeFontSet does to support
different (unicode) character ranges.
It's a flawed approach at best in the case of wanting to display more than
one language's characters (not the basic approach of storing different
ranges in different Fonts and selecting based on character, but doing the
selection based on leadingChar).

The only time it would ever matter is when one Unicode character has
different glyphs based on what language is displayed. AFAIK, this is only
true for Japanese/Korean or some such combination.

For other places where leadingChar is currently used, it just feels like a
remnant of the times when OS's were strictly bound to one single-byte code
page, and the code in these places should be modernized.

_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker

pharo

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

Comment #11 on issue 4142 by [hidden email]: Never use a leadingChar
for byte char
http://code.google.com/p/pharo/issues/detail?id=4142

I like the idea of igor to introduce CharacterWithLeadingChar and have
Character for Unicode.

_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker

pharo

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

Comment #12 on issue 4142 by [hidden email]: Never use a leadingChar
for byte char
http://code.google.com/p/pharo/issues/detail?id=4142

by reading the wiki it looks like originally it was done like that, but
then two character classes are folded to a single one.

As to me, character code should represent a glyph. It should not carry
anything like "this is a letter X from language Y", because its too low
level. The way how glyphs are interpreted heavily depending on context.
Consider a usual greek script and math/physics formulas where you see same
glyphs, but they having completely different meaning.
In unicode there's also a lot of code points for various punctiation and
scientific glyphs which are not belong to any language. So, what
tag(s)/encodings you could assign to them? It is pointless.

I don't understand why Japanese/Korean glyphs , if they are coincide, could
cause problems? Depending on context you are clearly know that given text
either Japanese or Korean. But i'm not an expert in this area to tell for
sure. The only thing i know is: keep it simple stupid. This practice
usually wins in a longer perspective.

_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker

pharo

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

Comment #13 on issue 4142 by [hidden email]: Never use a leadingChar
for byte char
http://code.google.com/p/pharo/issues/detail?id=4142

Ok I will intergate the fixes proposed and after it would be good to have
the solution proposed by igor :)

_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker

pharo

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

Comment #14 on issue 4142 by [hidden email]: Never use a leadingChar
for byte char
http://code.google.com/p/pharo/issues/detail?id=4142

Good, remember, little steps ;)

_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker

pharo

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

Comment #15 on issue 4142 by [hidden email]: Never use a leadingChar
for byte char
http://code.google.com/p/pharo/issues/detail?id=4142

Leading char fixes is ok. But i think for copying, just use ^ self
everywhere.

_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker

pharo

Re: Issue 4142 in pharo: Never use a leadingChar for byte char

Updates:
Status: Closed

Comment #16 on issue 4142 by [hidden email]: Never use a leadingChar
for byte char
http://code.google.com/p/pharo/issues/detail?id=4142

in 13185

_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker