Status: FixedWaitingToBePharoed
Owner: [hidden email] New issue 4142 by [hidden email]: Never use a leadingChar for byte char http://code.google.com/p/pharo/issues/detail?id=4142 From Squeak: Levente Uzonyi uploaded a new version of Collections to project The Trunk: http://source.squeak.org/trunk/Collections-ul.440.mcz ==================== Summary ==================== Name: Collections-ul.440 Author: ul Time: 26 April 2011, 2:37:08.897 am UUID: 4c084629-af8b-3740-b919-ec87f228c915 Ancestors: Collections-kb.439 - ignore the leadingChar for unique characters in Character class >> #leadingChar:code: - fixed the copying of Characters =============== Diff against Collections-kb.439 =============== Item was changed: ----- Method: Character class>>leadingChar:code: (in category 'instance creation') ----- leadingChar: leadChar code: code code >= 16r400000 ifTrue: [ self error: 'code is out of range'. ]. leadChar >= 256 ifTrue: [ self error: 'lead is out of range'. ]. + code < 256 ifTrue: [ ^self value: code ]. - ^self value: (leadChar bitShift: 22) + code.! Item was changed: ----- Method: Character>>clone (in category 'copying') ----- clone + "Characters from 0 to 255 are unique, copy only the rest." + + value < 256 ifTrue: [ ^self ]. + ^super clone! - "Answer with the receiver, because Characters are unique."! Item was changed: ----- Method: Character>>copy (in category 'copying') ----- copy + "Characters from 0 to 255 are unique, copy only the rest." + + value < 256 ifTrue: [ ^self ]. + ^super copy! - "Answer with the receiver because Characters are unique."! Item was changed: ----- Method: Character>>deepCopy (in category 'copying') ----- deepCopy + "Characters from 0 to 255 are unique, copy only the rest." + + value < 256 ifTrue: [ ^self ]. + ^super deepCopy! - "Answer with the receiver because Characters are unique."! Item was added: + ----- Method: Character>>shallowCopy (in category 'copying') ----- + shallowCopy + "Characters from 0 to 255 are unique, copy only the rest." + + value < 256 ifTrue: [ ^self ]. + ^super shallowCopy! Item was changed: ----- Method: Character>>veryDeepCopyWith: (in category 'copying') ----- veryDeepCopyWith: deepCopier + "Characters from 0 to 255 are unique, copy only the rest." + + value < 256 ifTrue: [ ^self ]. + ^super veryDeepCopyWith: deepCopier! - "Return self. I can't be copied."! Levente Uzonyi uploaded a new version of Multilingual to project The Trunk: http://source.squeak.org/trunk/Multilingual-ul.141.mcz ==================== Summary ==================== Name: Multilingual-ul.141 Author: ul Time: 26 April 2011, 2:26:32.742 am UUID: ecda4489-6940-b043-8aba-881d913f4985 Ancestors: Multilingual-nice.140 Removed #leadingChar and it's usage from ByteTextConverter and it's subclasses. Only CJKV characters should have leadingChar. =============== Diff against Multilingual-nice.140 =============== Item was changed: ----- Method: ByteTextConverter>>decode: (in category 'private') ----- decode: aByte "Answer a decoded squeak character corresponding to aByte code. Note that aByte does necessary span in the range 0...255, since this receiver is a ByteTextEncoder." | code | ((code := self class decodeTable at: 1 + aByte) = -1 or: [code = 16rFFFD]) ifTrue: [^nil]. + ^Character value: code! - ^ Character leadingChar: self leadingChar code: code! Item was removed: - ----- Method: ByteTextConverter>>leadingChar (in category 'friend') ----- - leadingChar - self subclassResponsibility! Item was removed: - ----- Method: CP1250TextConverter>>leadingChar (in category 'friend') ----- - leadingChar - ^0! Item was removed: - ----- Method: CP1253TextConverter>>leadingChar (in category 'friend') ----- - leadingChar - ^ GreekEnvironment leadingChar! Item was removed: - ----- Method: ISO88592TextConverter>>leadingChar (in category 'friend') ----- - leadingChar - ^Latin2Environment leadingChar! Item was removed: - ----- Method: ISO88597TextConverter>>leadingChar (in category 'friend') ----- - leadingChar - ^GreekEnvironment leadingChar! Item was removed: - ----- Method: Latin1TextConverter>>leadingChar (in category 'friend') ----- - leadingChar - ^0! Item was removed: - ----- Method: MacRomanTextConverter>>leadingChar (in category 'friend') ----- - leadingChar - - ^ 0. - ! _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Updates:
Status: FixProposed Labels: Type-Squeak Comment #1 on issue 4142 by [hidden email]: Never use a leadingChar for byte char http://code.google.com/p/pharo/issues/detail?id=4142 Name: SLICE-Issue-4142-ByteCharacterNeverUseALeadingChar-nice.1 Dependencies: Multilingual-Encodings-nice.13, Multilingual-Languages-nice.12, Collections-Strings-nice.178, Multilingual-TextConversion-nice.20 Only east asian languages should use a leadingChar Fix the leadingChar -> 0 for byte character and byte encoder Note that this change forces leadingChar to 0 for some environment (Greek Russian Nepalese). AFAIK, it's better to have these leadingChar at 0 and use a Unicode font. However, it shall be nice to ask a user of one of these languages... Igor? _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Updates:
Cc: [hidden email] Comment #2 on issue 4142 by [hidden email]: Never use a leadingChar for byte char http://code.google.com/p/pharo/issues/detail?id=4142 igor can you check that? Tx _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Comment #3 on issue 4142 by [hidden email]: Never use a leadingChar for byte char http://code.google.com/p/pharo/issues/detail?id=4142 hmm.. i wonder why do you need to copy characters at all? i think that for all #copy messages sent to character it can just answer self. The only method, which affects the character's state is Character>>setValue: newValue and its a private one, which means that you are not allowed to change the character's value for existing characters. And obviously, you don't need to copy chars because characters with same value representing same character. What is wrong, i think that Character>>#= using #asciiValue instead of #codePoint (okay, it aswers value, but then since value actually an unicode value, the implementation of #asciiValue is incorrect and should fail for any character codes > 127 , because only 0..127 character codes defined in ascii standard.) _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Comment #4 on issue 4142 by [hidden email]: Never use a leadingChar for byte char http://code.google.com/p/pharo/issues/detail?id=4142 You are right, asciiValue is a missnommer, but I^H^H Levente didn't change it. We shall change this later, this issue was more about leadingChar. The change of #copy was motivated by the fact that the system expects byte characters to be unique. There is no such expectation for the MultiByte chars, and they aren't unique. Of course, currently there are no mutators, so we could avoid a copy in both cases. Anyway we shall better document in a TestCase at least, because who knows what 3rd party libraries will do with setValue: (we have no support for immutable...). The question you did not answer: is a specific leadingChar required for Russian ? _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Comment #5 on issue 4142 by [hidden email]: Never use a leadingChar for byte char http://code.google.com/p/pharo/issues/detail?id=4142 No. I don't know what leading char does. and why i would need it for russian letters when there is a much simpler and commonly used unicode values for them. It means that even if one may use leading char, i would simply discourage that and encourage to use plain unicode period. I vote for getting rid of leading char. IMO it is better to make another class for chars with leading chars and put the complexity there. Because correct me if i wrong, in 99.9999% cases unicode is enough. So why we should waste our time and keep things complex which used only in 0.00001% of cases today? As for #setValue: and third-party libs. Not our problem: this method is private. And those who abusing API are on their own. Instead of taking care, we should punish those who using private interfaces outside of implementing class or its subclasses. There is no excuse for that. Period _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Comment #6 on issue 4142 by [hidden email]: Never use a leadingChar for byte char http://code.google.com/p/pharo/issues/detail?id=4142 Thanks Igor, you confirmed what I thought of leadingChar. Unfortunately I'm latin, so I need confirmation :) Concerning setValue: I would much prefer having immediate characters like in VW, we could get immutability and probably some optimization. As wether we shall completely eliminate leadingChar, I can't tell for sure. As I understand it is meant for east asian languages in order to work around han-unification. I don't know if this could be handled differently by using specific fonts or text attributes, and I can't fill the cultural gap that easily, that's too much to learn, so I have to refer to the users of these languages, taking a wrong decision by ignorance is the last thing I'd like to do. _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Comment #7 on issue 4142 by [hidden email]: Never use a leadingChar for byte char http://code.google.com/p/pharo/issues/detail?id=4142 @ setValue: we could change its name to privateSetValue: and on next month we could change it to privateSetValueDontUseThisMethod: and so make sure that nobody will dare to use it :) About leading char: Do you think it is possible to make a separate class, like CharacterWithLeadingChar and keep stuff there, while for the Character just leave a clean & lean unicode? Also, i really would like people who knowing better than me , and actually needs to use this feature(s) to argument, why we should use this scheme while rest of the world just using unicode. _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Comment #8 on issue 4142 by [hidden email]: Never use a leadingChar for byte char http://code.google.com/p/pharo/issues/detail?id=4142 It might be usefull to remind the leadingChar reference: http://www.is.titech.ac.jp/~ohshima/squeak/m17npaper/main.html However, this link does not examine alternatives... _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Comment #9 on issue 4142 by [hidden email]: Never use a leadingChar for byte char http://code.google.com/p/pharo/issues/detail?id=4142 yes. More or less its an implementation description. Here you can see the table of language tags AKA leading chars assigned: http://www.is.titech.ac.jp/~ohshima/squeak/m17npaper/node10.html Now, my question is simple: - who using anything else than unicode today in reality? I never ever seen GB2312 and prefer to never hear about it in future. So, what do we lose by simply removing this logic and leaving only unicode? I could tell you how many the various russian char encodings existed: KOI8-R KOI8-U, windows-1251, cp886 (and you can find plenty of others at the end of this page: http://en.wikipedia.org/wiki/Cyrillic_alphabet) But really, who cares today about it?? I definitely would not like to deal with anything else than unicode today. _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Comment #10 on issue 4142 by [hidden email]: Never use a leadingChar for byte char http://code.google.com/p/pharo/issues/detail?id=4142 My 2c from memory, I make no claims to it's actual accuracy: IIRC, using the leadingChar is part of what StrikeFontSet does to support different (unicode) character ranges. It's a flawed approach at best in the case of wanting to display more than one language's characters (not the basic approach of storing different ranges in different Fonts and selecting based on character, but doing the selection based on leadingChar). The only time it would ever matter is when one Unicode character has different glyphs based on what language is displayed. AFAIK, this is only true for Japanese/Korean or some such combination. For other places where leadingChar is currently used, it just feels like a remnant of the times when OS's were strictly bound to one single-byte code page, and the code in these places should be modernized. _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Comment #11 on issue 4142 by [hidden email]: Never use a leadingChar for byte char http://code.google.com/p/pharo/issues/detail?id=4142 I like the idea of igor to introduce CharacterWithLeadingChar and have Character for Unicode. _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Comment #12 on issue 4142 by [hidden email]: Never use a leadingChar for byte char http://code.google.com/p/pharo/issues/detail?id=4142 by reading the wiki it looks like originally it was done like that, but then two character classes are folded to a single one. As to me, character code should represent a glyph. It should not carry anything like "this is a letter X from language Y", because its too low level. The way how glyphs are interpreted heavily depending on context. Consider a usual greek script and math/physics formulas where you see same glyphs, but they having completely different meaning. In unicode there's also a lot of code points for various punctiation and scientific glyphs which are not belong to any language. So, what tag(s)/encodings you could assign to them? It is pointless. I don't understand why Japanese/Korean glyphs , if they are coincide, could cause problems? Depending on context you are clearly know that given text either Japanese or Korean. But i'm not an expert in this area to tell for sure. The only thing i know is: keep it simple stupid. This practice usually wins in a longer perspective. _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Comment #13 on issue 4142 by [hidden email]: Never use a leadingChar for byte char http://code.google.com/p/pharo/issues/detail?id=4142 Ok I will intergate the fixes proposed and after it would be good to have the solution proposed by igor :) _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Comment #14 on issue 4142 by [hidden email]: Never use a leadingChar for byte char http://code.google.com/p/pharo/issues/detail?id=4142 Good, remember, little steps ;) _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Comment #15 on issue 4142 by [hidden email]: Never use a leadingChar for byte char http://code.google.com/p/pharo/issues/detail?id=4142 Leading char fixes is ok. But i think for copying, just use ^ self everywhere. _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Updates:
Status: Closed Comment #16 on issue 4142 by [hidden email]: Never use a leadingChar for byte char http://code.google.com/p/pharo/issues/detail?id=4142 in 13185 _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Free forum by Nabble | Edit this page |