Hi all,
I need to type Czech characters in Squeak. I have a latin2 font, so I tried to setup latin2 environment in Squeak as follows: StrikeFontSet installExternalFontFileName6: 'latin2.out' encoding: 14 encodingName: #Latin2 textStyleName: #DefaultMultiStyle. Locale switchToID: (LocaleID isoLanguage: 'cs'). It seems OK, I can see Czech characters with diacritical marks, for example using this in Workspace: (Character value: 236) asString convertFromEncoding: 'iso-8859-2' Now when I run Squeak on Ubuntu by this: LANG="cs_CZ.ISO-8859-2" LC_ALL="cs_CZ.ISO-8859-2" squeak I can type lower case Czech letters ě š č ř ž ý á í é ů ú - the keyboard keys with these letters works. But when I press a key with diacritical mark + some character key, I get only the character followed by a question mark, e? s? c? for example. So I am not able to type Czech upper case characters (like Ě Š Č etc.). Where is the problem ? In Squeak VM (I use last 3.10-6 version) or in Squeak itself? Please help. Thanks Michal |
Michal Perutka wrote:
> I can type lower case Czech letters ě š č ř ž ý á í é ů ú - the keyboard > keys with these letters works. But when I press a key with diacritical > mark + some character key, I get only the character followed by a > question mark, e? s? c? for example. So I am not able to type Czech > upper case characters (like Ě Š Č etc.). > > Where is the problem ? In Squeak VM (I use last 3.10-6 version) or in > Squeak itself? Please help. I don't know too much about Linux input handling but it looks like a mismatch between VM and image (i.e., that the VM is reporting two codes that the image needs to merge and that the image doesn't really know what to do with it). To track this down, you might start by looking at the incoming events in EventSensor (but VERY carefully; screwing up there is a great recipe for a force-quit-restart cycle ;) and see if the event codes look reasonable to you. Also check out the other input converters - some of them might already be doing what you need. Cheers, - Andreas |
2009/8/14 Andreas Raab <[hidden email]> Then, when I type á (=225), I get
Thanks. So, in EventSensor>>processKeyboardEvent: I inserted a line Transcript show: evt asString; show: String cr. (or I can insert that line in ISO88592InputInterpreter>>nextCharFrom:firstEvt:, the result is the same) #(2 2841355 225 1 0 225 0 0) #(2 2841355 225 0 0 225 0 0) #(2 2841506 225 2 0 225 0 0) When I type acute accent key and then a (=97), first I get #(2 2862057 180 2 0 0 0 0) then #(2 2872015 97 1 0 97 0 0) #(2 2872015 97 0 0 97 0 0) #(2 2872015 769 1 0 769 0 0) #(2 2872015 769 0 0 769 0 0) #(2 2872191 97 2 0 97 0 0) and as result I get a?, not á But what next? Cheers, Michal |
At Fri, 14 Aug 2009 23:18:11 +0200,
Michal Perutka wrote: > > So, in EventSensor>>processKeyboardEvent: I inserted a line > Transcript show: evt asString; show: String cr. > (or I can insert that line in ISO88592InputInterpreter>>nextCharFrom:firstEvt:, the result is the same) > > Then, when I type ? (=225), I get > #(2 2841355 225 1 0 225 0 0) > #(2 2841355 225 0 0 225 0 0) > #(2 2841506 225 2 0 225 0 0) > > When I type acute accent key and then a (=97), first I get > #(2 2862057 180 2 0 0 0 0) > > then > #(2 2872015 97 1 0 97 0 0) > #(2 2872015 97 0 0 97 0 0) > #(2 2872015 769 1 0 769 0 0) > #(2 2872015 769 0 0 769 0 0) > #(2 2872191 97 2 0 97 0 0) > > and as result I get a?, not ? The VM appears to be sending the base character and the compostion accent character. Which itself is correct but the image side has to do something. In the Etoys image, there is a class called UnicodeCompositionStream. If you stick "97 (= 16r61) and 769 (= 16r301) to that stream, you get out of the accented a. And in the Etoys image, the ParagraphEditor uses it to make the composed character. It should work ok. Alternatively (or along with it), you could turn on the Pango renderer, which takes non-composed sequence and renders it properly. -- Yoshiki |
Yoshiki Ohshima wrote:
> The VM appears to be sending the base character and the compostion > accent character. Which itself is correct but the image side has to > do something. > > In the Etoys image, there is a class called > UnicodeCompositionStream. If you stick "97 (= 16r61) and 769 (= > 16r301) to that stream, you get out of the accented a. Sweet! I was just looking at it, it looks as if the code that generated the mapping was stripped out. Do you still have it somewhere? Also, is the rule of combinations complete or does it only cover the common combination rules? In any case, this is hugely valuable - I'll check to see how we get this into Squeak. Cheers, - Andreas |
On 8/17/09, Andreas Raab <[hidden email]> wrote:
> Yoshiki Ohshima wrote: >> The VM appears to be sending the base character and the compostion >> accent character. Which itself is correct but the image side has to >> do something. >> >> In the Etoys image, there is a class called >> UnicodeCompositionStream. If you stick "97 (= 16r61) and 769 (= >> 16r301) to that stream, you get out of the accented a. > > Sweet! I was just looking at it, it looks as if the code that generated > the mapping was stripped out. Do you still have it somewhere? Also, is > the rule of combinations complete or does it only cover the common > combination rules? > > In any case, this is hugely valuable - I'll check to see how we get this > into Squeak. > > Cheers, > - Andreas > > |
In reply to this post by Yoshiki Ohshima-2
On Tue, Aug 18, 2009 at 9:12 AM, Yoshiki Ohshima <[hidden email]> wrote: At Fri, 14 Aug 2009 23:18:11 +0200, Assuming the Unicode characters 97 ("a") followed by 301 (composing ') in a String, should the correct behaviour be to consider this one character or two? Given the String 'xxa'xx' (where "a" is Unicode #97 and the middle ' is Unicode #301), would "String at: 3" return a single composed character or uncomposed character? Or should Unicode-able Strings not be indexable at all to completely circumvent issues like this? Gulik -- http://gulik.pbwiki.com/ |
In reply to this post by Andreas.Raab
At Mon, 17 Aug 2009 14:38:00 -0700,
Andreas Raab wrote: > > Yoshiki Ohshima wrote: > > The VM appears to be sending the base character and the compostion > > accent character. Which itself is correct but the image side has to > > do something. > > > > In the Etoys image, there is a class called > > UnicodeCompositionStream. If you stick "97 (= 16r61) and 769 (= > > 16r301) to that stream, you get out of the accented a. > > Sweet! I was just looking at it, it looks as if the code that generated > the mapping was stripped out. Do you still have it somewhere? Also, is > the rule of combinations complete or does it only cover the common > combination rules? Hehe, probably proper comments in methods and classes would be a good idea, as the method comment is wrong and there is nothing tells you (err, us, really) what to do. But here it is. Download the following: http://unicode.org/Public/UNIDATA/UnicodeData.txt put it in the directory with your image. And then evaluate: CombinedChar parseCompositionMappingFrom: ((FileStream readOnlyFileNamed: 'UnicodeData.txt') wantsLineEndConversion: true) would do it. (Actually the resulting dictionaries are bigger than the ones in the Etoys image. It hasn't been updated sometime...) -- Yoshiki |
In reply to this post by Michael van der Gulik-2
At Tue, 18 Aug 2009 10:12:21 +1200,
Michael van der Gulik wrote: > > Assuming the Unicode characters 97 ("a") followed by 301 (composing ') in a String, should the correct behaviour be to > consider this one character or two? > > Given the String 'xxa'xx' (where "a" is Unicode #97 and the middle ' is Unicode #301), would "String at: 3" return a > single composed character or uncomposed character? > > Or should Unicode-able Strings not be indexable at all to completely circumvent issues like this? Unicode string can be indexable, but basically don't expect to get a useful "character" (displayable, comparable, and etc.) always. What you get back is a code point, not a character. For comparison and other purposes, you need to "normalize" the string first, and result can be a single composed character or uncomposed character. However, when do you need "aString at: 3"? From the Squeak point of view, as long as some relationship is satisfied (like #at: agrees with #size), a random access indexing is rarely needed, and if there is, it would need some closer attention. -- Yoshiki |
In reply to this post by Michal Perutka-2
Folks -
Armed with the input from Yoshiki, here is an attempt at addressing the problem of decomposed unicode input. I decided to do the handling a little differently from Etoys by providing a UnicodeInputInterpreter that does the composition in HandMorph since all the hooks were already available. [Yoshiki - out of curiosity, what is the reason why in the Etoys image this level of input conversion is not managed via the input interpreter but rather separately in ParagraphEditor?] I need some people who can test this stuff though. So Michal or anyone else who does m17n input on Linux, please try the following: 1) Verify that your platform generates non-composed input (see the original message below) 2) Load the attached code. It will go and fetch the Unicode data and install the compositions mappings in Unicode. 3) Install the new input converter using: World primaryHand keyboardInterpreter: UnicodeInputInterpreter new. 4) Type the same sequence(s) as in 1). If everything goes as it should, you should see the composed character. If it doesn't, it would be interesting to see what the state of the input event queue is at that point in time (print out the contents of "sensor eventQueue" inside of UnicodeInterpreter>>nextCharFrom:firstEvt:). Change Set: UnicodeInput-ar Date: 17 August 2009 Author: Andreas Raab Simplified handling for decomposed Unicode input. UnicodeInputInterpreter deals with the composition based on the composition operations provided by Unicode: - Unicode>>isComposed: aCharacter - Unicode>>isComposable: aCharacter - Unicode>>compose: baseChar with: compositionChar - Unicode>>decompose: composedChar See the method comments for more information. If this works okay for people, I'll push it into the trunk. Cheers, - Andreas Michal Perutka wrote: > 2009/8/14 Andreas Raab <[hidden email] <mailto:[hidden email]>> > > Michal Perutka wrote: > > I can type lower case Czech letters ě š č ř ž ý á í é ů ú - the > keyboard keys with these letters works. But when I press a key > with diacritical mark + some character key, I get only the > character followed by a question mark, e? s? c? for example. So > I am not able to type Czech upper case characters (like Ě Š Č etc.). > > Where is the problem ? In Squeak VM (I use last 3.10-6 version) > or in Squeak itself? Please help. > > > I don't know too much about Linux input handling but it looks like a > mismatch between VM and image (i.e., that the VM is reporting two > codes that the image needs to merge and that the image doesn't > really know what to do with it). > > To track this down, you might start by looking at the incoming > events in EventSensor (but VERY carefully; screwing up there is a > great recipe for a force-quit-restart cycle ;) and see if the event > codes look reasonable to you. Also check out the other input > converters - some of them might already be doing what you need. > > Cheers, > - Andreas > > > Thanks. > > So, in EventSensor>>processKeyboardEvent: I inserted a line > Transcript show: evt asString; show: String cr. > (or I can insert that line in > ISO88592InputInterpreter>>nextCharFrom:firstEvt:, the result is the same) > > Then, when I type á (=225), I get > #(2 2841355 225 1 0 225 0 0) > #(2 2841355 225 0 0 225 0 0) > #(2 2841506 225 2 0 225 0 0) > > When I type acute accent key and then a (=97), first I get > #(2 2862057 180 2 0 0 0 0) > > then > #(2 2872015 97 1 0 97 0 0) > #(2 2872015 97 0 0 97 0 0) > #(2 2872015 769 1 0 769 0 0) > #(2 2872015 769 0 0 769 0 0) > #(2 2872191 97 2 0 97 0 0) > > and as result I get a?, not á > > But what next? > > Cheers, > Michal > > > ------------------------------------------------------------------------ > > UnicodeInput-ar.2.cs (7K) Download Attachment |
At Mon, 17 Aug 2009 22:22:52 -0700,
Andreas Raab wrote: > > Folks - > > Armed with the input from Yoshiki, here is an attempt at addressing the > problem of decomposed unicode input. Cool! > I decided to do the handling a > little differently from Etoys by providing a UnicodeInputInterpreter > that does the composition in HandMorph since all the hooks were already > available. > > [Yoshiki - out of curiosity, what is the reason why in the Etoys image > this level of input conversion is not managed via the input interpreter > but rather separately in ParagraphEditor?] One primary case is that when the "multi key" style input is used (the user holds the "multi" key and hit a key to enter the accent), the composition char may not come right after the base char; the user could even move the cursor to a non-accented base char and hit the key sequence to just enter the composition char. Another case is where the user pastes a string that begins with a composition character into text, and I thought it should combine that with the character before the paste point when possible. (This second case is not that important, I think.) -- Yoshiki |
Yoshiki Ohshima wrote:
> One primary case is that when the "multi key" style input is used > (the user holds the "multi" key and hit a key to enter the accent), > the composition char may not come right after the base char; the user > could even move the cursor to a non-accented base char and hit the key > sequence to just enter the composition char. So you are basically saying that there are input modes where I can position the cursor at an arbitrary position in the text and then type a composition character to modify the character at the input position? Wow. I had no idea ;-) Where is this used? I guess that means back to the drawing board, but it'll be interesting to see if the approach works at all for the case in question. Cheers, - Andreas |
At Mon, 17 Aug 2009 23:03:37 -0700,
Andreas Raab wrote: > > Yoshiki Ohshima wrote: > > One primary case is that when the "multi key" style input is used > > (the user holds the "multi" key and hit a key to enter the accent), > > the composition char may not come right after the base char; the user > > could even move the cursor to a non-accented base char and hit the key > > sequence to just enter the composition char. > > So you are basically saying that there are input modes where I can > position the cursor at an arbitrary position in the text and then type a > composition character to modify the character at the input position? > Wow. I had no idea ;-) Where is this used? Not entirely sure how widely it is used in the world, but the XO keyboard setting for a country used that style. I vaguely remember that somewhat the layout was moved back to the dead-key style input, but it surely exist. Yes, with this, you can also stack many different accent marks on an arbitrary base character (e.g., you can type "accent-grave-circumflex b") in XO's Chat program and Write activity and etc quite easily. With Pango enabled, Etoys can do that. > I guess that means back to the drawing board, but it'll be interesting > to see if the approach works at all for the case in question. Right. It was too easy to type a code point sequence with composition character where no pre-composed form exists. A more flexible renderer is ideal, but it is practical for now for Squeak to only support pre-composed forms and just display ? for unhandled cases... -- Yoshiki |
In reply to this post by Andreas.Raab
Hi,
2009/8/27 Andreas Raab <[hidden email]> Michal Perutka wrote: $? for every code > 255. For example Character value: 353 shows $?, Character leadingChar: 14 code: 353 shows $š
Today I tested UnicodeInputInterpreter on Win XP and it works as well as on Linux. Cheers Michal |
Michal Perutka wrote:
> 2009/8/27 Andreas Raab <[hidden email] <mailto:[hidden email]>> > Actually, that was no mistake. I meant to use Character value: xxx > since I want us to get away from the leading char stuff in Unicode. > What happens when you use Character value: instead of Unicode > value:? Does it blow up? Does it display incorrectly? Anything else? > > > $? for every code > 255. For example Character value: 353 shows $?, > Character leadingChar: 14 code: 353 shows $š That's *extremely* strange. I just tried it and it shows $š in both cases in a current updated trunk image after selecting a suitable font (I used Arial). What image are you using? Which font(s) did you use to try this? Can someone else try to verify this with a current trunk image? (I'm wondering if I screwed something up with my own experiments here) You should see no difference between printing (Character leadingChar: 14 code: 353) and (Character value: 353) (i.e., they either both display as $? or they both display as $š). Cheers, - Andreas |
2009/8/28 Andreas Raab <[hidden email]> CheersMichal Perutka wrote: Can someone else try to verify this with a current trunk image? (I'm wondering if I screwed something up with my own experiments here) You should see no difference between printing (Character leadingChar: 14 code: 353) and (Character value: 353) (i.e., they either both display as $? or they both display as $š). Sorry, the problem is with my font - I use a Latin2 bitmap font. When I use some TTF font (installed from Windows), it is OK. Michal |
In reply to this post by Andreas.Raab
2009/8/27 Andreas Raab <[hidden email]> Michal Perutka wrote: Using evtBuf sixth throughout the method brought me problems with ctrl-c, ctrl-s, etc. and even scrolling by a mouse wheel stopped working. This modification seems to fix them: UnicodeInputInterpreter>>nextCharFrom: sensor firstEvt: evtBuf "Compose Unicode character sequences" "Only try this if the first event is composable and is a character event" | peekEvent keyValue composed | keyValue := evtBuf sixth > 127 ifTrue: [evtBuf sixth] ifFalse: [evtBuf third]. ((Unicode isComposable: keyValue) and: [evtBuf fourth = EventKeyChar]) ifTrue: [ ... Cheers Michal |
In reply to this post by Michal Perutka-2
Michal Perutka wrote:
> 2009/8/28 Andreas Raab <[hidden email] <mailto:[hidden email]>> > Can someone else try to verify this with a current trunk image? (I'm > wondering if I screwed something up with my own experiments here) > You should see no difference between printing (Character > leadingChar: 14 code: 353) and (Character value: 353) (i.e., they > either both display as $? or they both display as $š). > > > Sorry, the problem is with my font - I use a Latin2 bitmap font. When I > use some TTF font (installed from Windows), it is OK. Phew! Glad we sorted that out ;-) Cheers, - Andreas |
Free forum by Nabble | Edit this page |