Hello again. Does anyone know how can we reflect Unicode characters in Smalltalk? I want to read characters and compare them with a Unicode representation. Like U+0026 for & or U+003C for < and so on.. Thanks Mohammad |
On 14 March 2013 20:58, Mohammad Al Houssami (Alumni)
<[hidden email]> wrote: > Hello again. > > > > Does anyone know how can we reflect Unicode characters in Smalltalk? > what you mean reflect? > I want to read characters and compare them with a Unicode representation. > Like U+0026 for & or U+003C for < and so on.. > Piece of cake. $& asUnicode hex '16r26' $< asUnicode hex '16r3C' In Pharo, Character(s) are inherently stored in unicode. So your main question would be, where you wanna read from (file/network), and what encoding the source uses (utf-8, ucs-16 ucs-32 etc)... > Thanks > > Mohammad -- Best regards, Igor Stasenko. |
I am trying to build a tokenizer according to the spec below
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenization Let me take the Data State as an example it says Consume the next input character: U+0026 AMPERSAND (&) Switch to the character reference in data state. So I want to consume the next character of the HTML file and compare it to U+0026. The encoding is probably utf-8 since it’s the most common. The problem is that some characters cannot be written to us the asUnicode function. For example you have these cases which are not trivial: U+0009 CHARACTER TABULATION (tab) U+000A LINE FEED (LF) U+000C FORM FEED (FF) U+0020 SPACE In java you simply do if (c == '\u009'). How does this go in Smalltalk? -----Original Message----- From: [hidden email] [mailto:[hidden email]] On Behalf Of Igor Stasenko Sent: Thursday, March 14, 2013 8:05 PM To: A friendly place where any question about pharo is welcome Subject: Re: [Pharo-users] Unicode in Smalltalk On 14 March 2013 20:58, Mohammad Al Houssami (Alumni) <[hidden email]> wrote: > Hello again. > > > > Does anyone know how can we reflect Unicode characters in Smalltalk? > what you mean reflect? > I want to read characters and compare them with a Unicode representation. > Like U+0026 for & or U+003C for < and so on.. > Piece of cake. $& asUnicode hex '16r26' $< asUnicode hex '16r3C' In Pharo, Character(s) are inherently stored in unicode. So your main question would be, where you wanna read from (file/network), and what encoding the source uses (utf-8, ucs-16 ucs-32 etc)... > Thanks > > Mohammad -- Best regards, Igor Stasenko. ----- No virus found in this message. Checked by AVG - www.avg.com Version: 2013.0.2904 / Virus Database: 2641/6173 - Release Date: 03/14/13 |
On 14 March 2013 21:14, Mohammad Al Houssami (Alumni)
<[hidden email]> wrote: > I am trying to build a tokenizer according to the spec below > http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenization > > Let me take the Data State as an example it says > Consume the next input character: > > U+0026 AMPERSAND (&) > Switch to the character reference in data state. > > So I want to consume the next character of the HTML file and compare it to U+0026. > The encoding is probably utf-8 since it’s the most common. > The problem is that some characters cannot be written to us the asUnicode function. > > For example you have these cases which are not trivial: > U+0009 CHARACTER TABULATION (tab) > U+000A LINE FEED (LF) > U+000C FORM FEED (FF) > U+0020 SPACE > > In java you simply do if (c == '\u009'). > How does this go in Smalltalk? > c = Character tab ifTrue: [... ] ifFalse: [... ] for many other 'funky' characters you can use convertor method, which converts integer to character: 1234 asCharacter asUnicode => 1234 > > > > -----Original Message----- > From: [hidden email] [mailto:[hidden email]] On Behalf Of Igor Stasenko > Sent: Thursday, March 14, 2013 8:05 PM > To: A friendly place where any question about pharo is welcome > Subject: Re: [Pharo-users] Unicode in Smalltalk > > On 14 March 2013 20:58, Mohammad Al Houssami (Alumni) <[hidden email]> wrote: >> Hello again. >> >> >> >> Does anyone know how can we reflect Unicode characters in Smalltalk? >> > what you mean reflect? > > >> I want to read characters and compare them with a Unicode representation. >> Like U+0026 for & or U+003C for < and so on.. >> > Piece of cake. > $& asUnicode hex '16r26' > $< asUnicode hex '16r3C' > > In Pharo, Character(s) are inherently stored in unicode. > > So your main question would be, where you wanna read from (file/network), and what encoding the source uses (utf-8, ucs-16 ucs-32 etc)... > >> Thanks >> >> Mohammad > > > > -- > Best regards, > Igor Stasenko. > > > > ----- > No virus found in this message. > Checked by AVG - www.avg.com > Version: 2013.0.2904 / Virus Database: 2641/6173 - Release Date: 03/14/13 -- Best regards, Igor Stasenko. |
In reply to this post by Mohammad Al Houssami (Alumni)
Btw, i detected a virus in your message :)
How else you can call the software which appends extra content to your mail message with advertisement content? :) > No virus found in this message. > Checked by AVG - www.avg.com > Version: 2013.0.2904 / Virus Database: 2641/6173 - Release Date: 03/14/13 -- Best regards, Igor Stasenko. |
HAHA I usually delete them... I forgot this time. Ill try what you said. Thanks so much for the help :) -----Original Message----- From: [hidden email] [mailto:[hidden email]] On Behalf Of Igor Stasenko Sent: Thursday, March 14, 2013 8:26 PM To: A friendly place where any question about pharo is welcome Subject: Re: [Pharo-users] Unicode in Smalltalk Btw, i detected a virus in your message :) How else you can call the software which appends extra content to your mail message with advertisement content? :) > No virus found in this message. > Checked by AVG - www.avg.com > Version: 2013.0.2904 / Virus Database: 2641/6173 - Release Date: > 03/14/13 -- Best regards, Igor Stasenko. |
In reply to this post by Mohammad Al Houssami (Alumni)
I'm not sure if I understand your question.
But did you try this? : yourCharacter = 16r0026 asCharacter Regards. |
Im not sure where you got the 16r0026 from. What I have is something like this
U+0009 CHARACTER TABULATION (tab) U+000A LINE FEED (LF) U+000C FORM FEED (FF) U+0020 SPACE Whats on the left is the unicode representation. In some cases there is no actual character to compare to( so I cant compare to < > &....) What I will have to do is to create a character with unicode value U+000A. I have tried myCharacter:= 000A asCharacter. But it doesn't work. When I try to print myCharacter it gives an error because it doesn't understand the A. Thanks Mohammad -----Original Message----- From: [hidden email] [mailto:[hidden email]] On Behalf Of Esteban A. Maringolo Sent: Friday, March 15, 2013 12:28 AM To: [hidden email] Subject: Re: [Pharo-users] Unicode in Smalltalk I'm not sure if I understand your question. But did you try this? : yourCharacter = 16r0026 asCharacter Regards. -- View this message in context: http://forum.world.st/Unicode-in-Smalltalk-tp4676821p4676871.html Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com. |
On 15 March 2013 13:44, Mohammad Al Houssami (Alumni)
<[hidden email]> wrote: > Im not sure where you got the 16r0026 from. What I have is something like this > U+0009 CHARACTER TABULATION (tab) > U+000A LINE FEED (LF) > U+000C FORM FEED (FF) > U+0020 SPACE > > Whats on the left is the unicode representation. In some cases there is no actual character to compare to( so I cant compare to < > &....) > What I will have to do is to create a character with unicode value U+000A. > I have tried myCharacter:= 000A asCharacter. But it doesn't work. > > When I try to print myCharacter it gives an error because it doesn't understand the A. > it is actually understood by compiler as unary message: 0 A if you want to use hexadecimal integer litrals, you should use radix notation: <base>r<number> try: 2r10101101 16rA > Thanks > Mohammad -- Best regards, Igor Stasenko. |
Im new to smalltalk so im not sure how you derived the numbers.
So if I have 000A as a unicode how do I do the transformation? I don’t know how you got 2r10101101 16rA And what are base r and number in <base>r<number> ? Thanks again Mohammad -----Original Message----- From: [hidden email] [mailto:[hidden email]] On Behalf Of Igor Stasenko Sent: Friday, March 15, 2013 12:54 PM To: A friendly place where any question about pharo is welcome Subject: Re: [Pharo-users] Unicode in Smalltalk On 15 March 2013 13:44, Mohammad Al Houssami (Alumni) <[hidden email]> wrote: > Im not sure where you got the 16r0026 from. What I have is something > like this > U+0009 CHARACTER TABULATION (tab) > U+000A LINE FEED (LF) > U+000C FORM FEED (FF) > U+0020 SPACE > > Whats on the left is the unicode representation. In some cases there > is no actual character to compare to( so I cant compare to < > &....) What I will have to do is to create a character with unicode value U+000A. > I have tried myCharacter:= 000A asCharacter. But it doesn't work. > > When I try to print myCharacter it gives an error because it doesn't understand the A. > it is actually understood by compiler as unary message: 0 A if you want to use hexadecimal integer litrals, you should use radix notation: <base>r<number> try: 2r10101101 16rA > Thanks > Mohammad -- Best regards, Igor Stasenko. |
Its not base 'r'. its base 16. In Smalltalk you use 'r' to denote the
fact that your specifying the base (radix) (https://en.wikipedia.org/wiki/Radix) of a number. So Igor was pointing out that for the hexadecimal representation of the decimal number 10 you use 16rA For the binary representation of the decimal number 10 you use 2r00001010 and for octal its 8r12 so when he says <base>r<number> he's just pointing out the pattern I show when showing you how to manipulate the decimal number 10 in base 16, 2, and 8 above. Also to see what a number is in a different base (radix) use this: 10 radix: 4 or 16rA radix: 10 2r00001010 radix: 10 8r12 radix: 10 hope this helps Paul On 03/15/2013 06:35 AM, Mohammad Al Houssami (Alumni) wrote: > Im new to smalltalk so im not sure how you derived the numbers. > So if I have 000A as a unicode how do I do the transformation? I don’t know how you got > 2r10101101 > 16rA > > And what are base r and number in <base>r<number> ? > > Thanks again > Mohammad > > -----Original Message----- > From: [hidden email] [mailto:[hidden email]] On Behalf Of Igor Stasenko > Sent: Friday, March 15, 2013 12:54 PM > To: A friendly place where any question about pharo is welcome > Subject: Re: [Pharo-users] Unicode in Smalltalk > > On 15 March 2013 13:44, Mohammad Al Houssami (Alumni) <[hidden email]> wrote: >> Im not sure where you got the 16r0026 from. What I have is something >> like this >> U+0009 CHARACTER TABULATION (tab) >> U+000A LINE FEED (LF) >> U+000C FORM FEED (FF) >> U+0020 SPACE >> >> Whats on the left is the unicode representation. In some cases there >> is no actual character to compare to( so I cant compare to < > &....) What I will have to do is to create a character with unicode value U+000A. >> I have tried myCharacter:= 000A asCharacter. But it doesn't work. >> >> When I try to print myCharacter it gives an error because it doesn't understand the A. >> > because 000A is not valid literal in smalltalk. > it is actually understood by compiler as unary message: > 0 A > > if you want to use hexadecimal integer litrals, you should use radix notation: > > <base>r<number> > > try: > > 2r10101101 > 16rA > > > >> Thanks >> Mohammad > > -- > Best regards, > Igor Stasenko. > > |
Ok it's just that I didn’t understand the notation he used. Its clear now. I know about radix and bases and all this stuff but it’s the notation that got me confused. The whole thing was just changing the notations from base 16 to base 10. I tried it and it works.
Thank you Paul and Igor. -----Original Message----- From: [hidden email] [mailto:[hidden email]] On Behalf Of Paul DeBruicker Sent: Friday, March 15, 2013 2:04 PM To: A friendly place where any question about pharo is welcome Subject: Re: [Pharo-users] Unicode in Smalltalk Its not base 'r'. its base 16. In Smalltalk you use 'r' to denote the fact that your specifying the base (radix) (https://en.wikipedia.org/wiki/Radix) of a number. So Igor was pointing out that for the hexadecimal representation of the decimal number 10 you use 16rA For the binary representation of the decimal number 10 you use 2r00001010 and for octal its 8r12 so when he says <base>r<number> he's just pointing out the pattern I show when showing you how to manipulate the decimal number 10 in base 16, 2, and 8 above. Also to see what a number is in a different base (radix) use this: 10 radix: 4 or 16rA radix: 10 2r00001010 radix: 10 8r12 radix: 10 hope this helps Paul On 03/15/2013 06:35 AM, Mohammad Al Houssami (Alumni) wrote: > Im new to smalltalk so im not sure how you derived the numbers. > So if I have 000A as a unicode how do I do the transformation? I don’t > know how you got > 2r10101101 > 16rA > > And what are base r and number in <base>r<number> ? > > Thanks again > Mohammad > > -----Original Message----- > From: [hidden email] > [mailto:[hidden email]] On Behalf Of Igor > Stasenko > Sent: Friday, March 15, 2013 12:54 PM > To: A friendly place where any question about pharo is welcome > Subject: Re: [Pharo-users] Unicode in Smalltalk > > On 15 March 2013 13:44, Mohammad Al Houssami (Alumni) <[hidden email]> wrote: >> Im not sure where you got the 16r0026 from. What I have is something >> like this >> U+0009 CHARACTER TABULATION (tab) >> U+000A LINE FEED (LF) >> U+000C FORM FEED (FF) >> U+0020 SPACE >> >> Whats on the left is the unicode representation. In some cases there >> is no actual character to compare to( so I cant compare to < > &....) What I will have to do is to create a character with unicode value U+000A. >> I have tried myCharacter:= 000A asCharacter. But it doesn't work. >> >> When I try to print myCharacter it gives an error because it doesn't understand the A. >> > because 000A is not valid literal in smalltalk. > it is actually understood by compiler as unary message: > 0 A > > if you want to use hexadecimal integer litrals, you should use radix notation: > > <base>r<number> > > try: > > 2r10101101 > 16rA > > > >> Thanks >> Mohammad > > -- > Best regards, > Igor Stasenko. > > |
In reply to this post by Esteban A. Maringolo
Just looking in Pharo 2.0 release on Unicode handling today ...
The class comment for UTF8DecomposedTextConverter says "An UTF8DecomposedTextConverter converts from decomposed UTF8 using the UnicodeCompositionStream." I don't see UnicodeCompositionStream ... at least not in this One-Click image today ... I also wonder why we do not have more of a parallel in the BOM methods for the utf8 and utf16 Converter classes to ensure converter instances have polymorphic behavior ... even such simple accessors as BOM value methods are not implemented ... but one class has a >>nextPut... method that is doing a BOM check each time, so it is not just avoiding a method call here if that was the issue, I suppose ... I do use Unicode every day in my Japanese app's and this mismatch across converters looks as if it would bite someone eventually ... Or no ? Maybe I am missing something obvious ... Btw, are we safe to assume that no one in India (or elsewhere with IT shops) is still using utf32 ? |
In reply to this post by Esteban A. Maringolo
[ I still am not accepted to the mailing list to POST ... ] Just looking in Pharo 2.0 release on Unicode handling today ... The class comment for UTF8DecomposedTextConverter says
"An UTF8DecomposedTextConverter converts from decomposed UTF8 using the UnicodeCompositionStream." I don't see UnicodeCompositionStream ... at least not in this One-Click image today ...
I also wonder why we do not have more of a parallel in the BOM methods for the utf8 and utf16 Converter classes to ensure converter instances have polymorphic behavior ... even such simple accessors as BOM value methods are not implemented ... but one class has a >>nextPut... method that is doing a BOM check each time, so it is not just avoiding a method call here if that was the issue, I suppose ...
I do use Unicode every day in my Japanese app's and this mismatch across converters looks as if it would bite someone eventually ... Or no ? Maybe I am missing something obvious ...
Btw, are we safe to assume that no one in India (or elsewhere with IT shops) is still using utf32 ? On 14 March 2013 21:28, Esteban A. Maringolo <[hidden email]> wrote: I'm not sure if I understand your question. |
On Mar 18, 2013, at 4:52 PM, Robert Shiplett <[hidden email]> wrote:
Post arrived… and no manual accept needed.
|
so-desu ...
Clicked in don't show message again box ... But by then I was looking at (encoding = 'utf-8') ifTrue: [^UTF8TextConverter]. (encoding = 'shiftjis' or: [ encoding = 'sjis' ]) as seen in JapaneseEnvironment>>systemConverterClass for X11 ... also looks a little fragile for lack of accessors yet X11Encoding is just such a class-side location for "utf8EncodingCode" or such like ... My other programming environment uses "utf8" as the string. Oi vey. Then there are the 'UTF-8' folks out there ... Is there a reason to avoid symbols here ? Or at least to refactor that () or: [] out and into X11Encoding ? That call at 2AM on a Sunday morning is Sooo often for similar issues that crept past code reviews ... or is this a lighter style that I am used to ? |
On Mar 18, 2013, at 5:26 PM, LogiqueWerks <[hidden email]> wrote: > so-desu ... > > Clicked in don't show message again box ... > > But by then I was looking at > > (encoding = 'utf-8') > ifTrue: [^UTF8TextConverter]. > (encoding = 'shiftjis' or: [ encoding = 'sjis' ]) > > as seen in JapaneseEnvironment>>systemConverterClass for X11 ... also looks > a little fragile for lack of accessors yet X11Encoding is just such a > class-side location for "utf8EncodingCode" or such like ... > > My other programming environment uses "utf8" as the string. Oi vey. Then > there are the 'UTF-8' folks out there ... > > Is there a reason to avoid symbols here ? Or at least to refactor that > > () or: [] > > out and into X11Encoding ? > > That call at 2AM on a Sunday morning is Sooo often for similar issues that > crept past code reviews ... or is this a lighter style that I am used to ? code quality is very low for most of the old code base… Improvements are always welcome. Marcus |
On 18 Mar 2013, at 17:30, Marcus Denker <[hidden email]> wrote: > On Mar 18, 2013, at 5:26 PM, LogiqueWerks <[hidden email]> wrote: > >> so-desu ... >> >> Clicked in don't show message again box ... >> >> But by then I was looking at >> >> (encoding = 'utf-8') >> ifTrue: [^UTF8TextConverter]. >> (encoding = 'shiftjis' or: [ encoding = 'sjis' ]) >> >> as seen in JapaneseEnvironment>>systemConverterClass for X11 ... also looks >> a little fragile for lack of accessors yet X11Encoding is just such a >> class-side location for "utf8EncodingCode" or such like ... >> >> My other programming environment uses "utf8" as the string. Oi vey. Then >> there are the 'UTF-8' folks out there ... >> >> Is there a reason to avoid symbols here ? Or at least to refactor that >> >> () or: [] >> >> out and into X11Encoding ? >> >> That call at 2AM on a Sunday morning is Sooo often for similar issues that >> crept past code reviews ... or is this a lighter style that I am used to ? > > code quality is very low for most of the old code base… > Improvements are always welcome. Check out the ZnCharacterEncoder hierarchy for a much simpler, more modern implementation of UTF8 and some common byte encoders. It requires people fluent in all kinds of special spoken languages to help in moving things forward. What do we really need above UTF8 and some simple byte encoders ? Sven -- Sven Van Caekenberghe http://stfx.eu Smalltalk is the Red Pill |
Sven Van Caekenberghe wrote:
> On 18 Mar 2013, at 17:30, Marcus Denker <[hidden email]> wrote: > > >> On Mar 18, 2013, at 5:26 PM, LogiqueWerks <[hidden email]> wrote: >> >> >>> so-desu ... >>> >>> Clicked in don't show message again box ... >>> >>> But by then I was looking at >>> >>> (encoding = 'utf-8') >>> ifTrue: [^UTF8TextConverter]. >>> (encoding = 'shiftjis' or: [ encoding = 'sjis' ]) >>> >>> as seen in JapaneseEnvironment>>systemConverterClass for X11 ... also looks >>> a little fragile for lack of accessors yet X11Encoding is just such a >>> class-side location for "utf8EncodingCode" or such like ... >>> >>> My other programming environment uses "utf8" as the string. Oi vey. Then >>> there are the 'UTF-8' folks out there ... >>> >>> Is there a reason to avoid symbols here ? Or at least to refactor that >>> >>> () or: [] >>> >>> out and into X11Encoding ? >>> >>> That call at 2AM on a Sunday morning is Sooo often for similar issues that >>> crept past code reviews ... or is this a lighter style that I am used to ? >>> >> code quality is very low for most of the old code base… >> Improvements are always welcome. >> > > Check out the ZnCharacterEncoder hierarchy for a much simpler, more modern implementation of UTF8 and some common byte encoders. > Is that a candidate to move into core? > It requires people fluent in all kinds of special spoken languages to help in moving things forward. > > What do we really need above UTF8 and some simple byte encoders ? > > Sven > > -- > Sven Van Caekenberghe > http://stfx.eu > Smalltalk is the Red Pill > > > > |
Ben,
On 19 Mar 2013, at 00:41, Ben Coman <[hidden email]> wrote: > Sven Van Caekenberghe wrote: >> On 18 Mar 2013, at 17:30, Marcus Denker <[hidden email]> wrote: >> >> >>> On Mar 18, 2013, at 5:26 PM, LogiqueWerks <[hidden email]> wrote: >>> >>> >>>> so-desu ... >>>> >>>> Clicked in don't show message again box ... >>>> >>>> But by then I was looking at >>>> >>>> (encoding = 'utf-8') ifTrue: [^UTF8TextConverter]. >>>> (encoding = 'shiftjis' or: [ encoding = 'sjis' ]) >>>> as seen in JapaneseEnvironment>>systemConverterClass for X11 ... also looks >>>> a little fragile for lack of accessors yet X11Encoding is just such a >>>> class-side location for "utf8EncodingCode" or such like ... >>>> My other programming environment uses "utf8" as the string. Oi vey. Then >>>> there are the 'UTF-8' folks out there ... >>>> Is there a reason to avoid symbols here ? Or at least to refactor that >>>> >>>> () or: [] >>>> out and into X11Encoding ? >>>> >>>> That call at 2AM on a Sunday morning is Sooo often for similar issues that >>>> crept past code reviews ... or is this a lighter style that I am used to ? >>>> >>> code quality is very low for most of the old code base… Improvements are always welcome. >>> >> >> Check out the ZnCharacterEncoder hierarchy for a much simpler, more modern implementation of UTF8 and some common byte encoders. > > Is that a candidate to move into core? It is permanently in the image. Pending community testing, feedback, improvement, extension, yes it could replace the others. It is currently used for the HTTP layer, but is fully functional for any stream: 'my-utf8-file.txt' asFileReference readStreamDo: [ :stream | stream binary. "remove the old stuff" (ZnCharacterReadStream on: stream) upToEnd ]. The key difference is that is works strictly from bytes to/from characters, not from character to/from character like the old ones. Sven >> It requires people fluent in all kinds of special spoken languages to help in moving things forward. >> >> What do we really need above UTF8 and some simple byte encoders ? >> >> Sven >> >> -- >> Sven Van Caekenberghe >> http://stfx.eu >> Smalltalk is the Red Pill >> >> >> >> > > |
Free forum by Nabble | Edit this page |