Unicode in Smalltalk

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Unicode in Smalltalk

Mohammad Al Houssami (Alumni)

Hello again.

 

Does anyone know how can we reflect Unicode characters in Smalltalk?

I want to read characters and compare them with a Unicode representation. Like U+0026 for & or U+003C for < and so on..

Thanks

Mohammad

Reply | Threaded
Open this post in threaded view
|

Re: Unicode in Smalltalk

Igor Stasenko
On 14 March 2013 20:58, Mohammad Al Houssami (Alumni)
<[hidden email]> wrote:
> Hello again.
>
>
>
> Does anyone know how can we reflect Unicode characters in Smalltalk?
>
what you mean reflect?


> I want to read characters and compare them with a Unicode representation.
> Like U+0026 for & or U+003C for < and so on..
>
Piece of cake.
$& asUnicode hex '16r26'
$< asUnicode hex '16r3C'

In Pharo, Character(s) are inherently stored in unicode.

So your main question would be, where you wanna read from (file/network),
and what encoding the source uses (utf-8, ucs-16 ucs-32 etc)...

> Thanks
>
> Mohammad



--
Best regards,
Igor Stasenko.

Reply | Threaded
Open this post in threaded view
|

Re: Unicode in Smalltalk

Mohammad Al Houssami (Alumni)
I am trying to build a tokenizer according to the spec below
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenization

Let me take the Data State as an example it says
Consume the next input character:

U+0026 AMPERSAND (&)
    Switch to the character reference in data state.

So I want to consume the next character of the HTML file and compare it to U+0026.
The encoding is probably utf-8 since it’s the most common.
The problem is that some characters cannot be written to us the asUnicode function.

For example you have these cases which are not trivial:
U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE

In java you simply do if (c == '\u009').
How does this go in Smalltalk?




-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Igor Stasenko
Sent: Thursday, March 14, 2013 8:05 PM
To: A friendly place where any question about pharo is welcome
Subject: Re: [Pharo-users] Unicode in Smalltalk

On 14 March 2013 20:58, Mohammad Al Houssami (Alumni) <[hidden email]> wrote:
> Hello again.
>
>
>
> Does anyone know how can we reflect Unicode characters in Smalltalk?
>
what you mean reflect?


> I want to read characters and compare them with a Unicode representation.
> Like U+0026 for & or U+003C for < and so on..
>
Piece of cake.
$& asUnicode hex '16r26'
$< asUnicode hex '16r3C'

In Pharo, Character(s) are inherently stored in unicode.

So your main question would be, where you wanna read from (file/network), and what encoding the source uses (utf-8, ucs-16 ucs-32 etc)...

> Thanks
>
> Mohammad



--
Best regards,
Igor Stasenko.



-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2013.0.2904 / Virus Database: 2641/6173 - Release Date: 03/14/13
Reply | Threaded
Open this post in threaded view
|

Re: Unicode in Smalltalk

Igor Stasenko
On 14 March 2013 21:14, Mohammad Al Houssami (Alumni)
<[hidden email]> wrote:

> I am trying to build a tokenizer according to the spec below
> http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenization
>
> Let me take the Data State as an example it says
> Consume the next input character:
>
> U+0026 AMPERSAND (&)
>     Switch to the character reference in data state.
>
> So I want to consume the next character of the HTML file and compare it to U+0026.
> The encoding is probably utf-8 since it’s the most common.
> The problem is that some characters cannot be written to us the asUnicode function.
>
> For example you have these cases which are not trivial:
> U+0009 CHARACTER TABULATION (tab)
> U+000A LINE FEED (LF)
> U+000C FORM FEED (FF)
> U+0020 SPACE
>
> In java you simply do if (c == '\u009').
> How does this go in Smalltalk?
>
in smalltalk you do it:

c = Character tab ifTrue: [... ] ifFalse: [... ]

for many other 'funky' characters you can use convertor method, which
converts integer to character:

1234 asCharacter asUnicode => 1234


>
>
>
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On Behalf Of Igor Stasenko
> Sent: Thursday, March 14, 2013 8:05 PM
> To: A friendly place where any question about pharo is welcome
> Subject: Re: [Pharo-users] Unicode in Smalltalk
>
> On 14 March 2013 20:58, Mohammad Al Houssami (Alumni) <[hidden email]> wrote:
>> Hello again.
>>
>>
>>
>> Does anyone know how can we reflect Unicode characters in Smalltalk?
>>
> what you mean reflect?
>
>
>> I want to read characters and compare them with a Unicode representation.
>> Like U+0026 for & or U+003C for < and so on..
>>
> Piece of cake.
> $& asUnicode hex '16r26'
> $< asUnicode hex '16r3C'
>
> In Pharo, Character(s) are inherently stored in unicode.
>
> So your main question would be, where you wanna read from (file/network), and what encoding the source uses (utf-8, ucs-16 ucs-32 etc)...
>
>> Thanks
>>
>> Mohammad
>
>
>
> --
> Best regards,
> Igor Stasenko.
>
>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 2013.0.2904 / Virus Database: 2641/6173 - Release Date: 03/14/13



--
Best regards,
Igor Stasenko.

Reply | Threaded
Open this post in threaded view
|

Re: Unicode in Smalltalk

Igor Stasenko
In reply to this post by Mohammad Al Houssami (Alumni)
Btw, i detected a virus in your message :)
How else you can call the software which appends extra content to your
mail message
with advertisement content? :)

> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 2013.0.2904 / Virus Database: 2641/6173 - Release Date: 03/14/13



--
Best regards,
Igor Stasenko.

Reply | Threaded
Open this post in threaded view
|

Re: Unicode in Smalltalk

Mohammad Al Houssami (Alumni)

HAHA
I usually delete them... I forgot this time.

Ill try what you said. Thanks so much for the help :)

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Igor Stasenko
Sent: Thursday, March 14, 2013 8:26 PM
To: A friendly place where any question about pharo is welcome
Subject: Re: [Pharo-users] Unicode in Smalltalk

Btw, i detected a virus in your message :) How else you can call the software which appends extra content to your mail message with advertisement content? :)

> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 2013.0.2904 / Virus Database: 2641/6173 - Release Date:
> 03/14/13



--
Best regards,
Igor Stasenko.


Reply | Threaded
Open this post in threaded view
|

Re: Unicode in Smalltalk

Esteban A. Maringolo
In reply to this post by Mohammad Al Houssami (Alumni)
I'm not sure if I understand your question.

But did you try this? :

yourCharacter = 16r0026 asCharacter

Regards.

Reply | Threaded
Open this post in threaded view
|

Re: Unicode in Smalltalk

Mohammad Al Houssami (Alumni)
Im not sure where you got the 16r0026 from. What I have is something like this
U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE

Whats on the left is the unicode representation. In some cases there is no actual character to compare to( so I cant compare to < > &....)
What I will have to do is to create a character with unicode value U+000A.
I have tried myCharacter:= 000A asCharacter. But it doesn't work.

When I try to print myCharacter it gives an error because it doesn't understand the A.

Thanks
Mohammad
-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Esteban A. Maringolo
Sent: Friday, March 15, 2013 12:28 AM
To: [hidden email]
Subject: Re: [Pharo-users] Unicode in Smalltalk

I'm not sure if I understand your question.

But did you try this? :

yourCharacter = 16r0026 asCharacter

Regards.





--
View this message in context: http://forum.world.st/Unicode-in-Smalltalk-tp4676821p4676871.html
Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.




Reply | Threaded
Open this post in threaded view
|

Re: Unicode in Smalltalk

Igor Stasenko
On 15 March 2013 13:44, Mohammad Al Houssami (Alumni)
<[hidden email]> wrote:

> Im not sure where you got the 16r0026 from. What I have is something like this
> U+0009 CHARACTER TABULATION (tab)
> U+000A LINE FEED (LF)
> U+000C FORM FEED (FF)
> U+0020 SPACE
>
> Whats on the left is the unicode representation. In some cases there is no actual character to compare to( so I cant compare to < > &....)
> What I will have to do is to create a character with unicode value U+000A.
> I have tried myCharacter:= 000A asCharacter. But it doesn't work.
>
> When I try to print myCharacter it gives an error because it doesn't understand the A.
>
because 000A is not valid literal in smalltalk.
it is actually understood by compiler as unary message:
 0 A

if you want to use hexadecimal integer litrals, you should use radix notation:

<base>r<number>

try:

2r10101101
16rA



> Thanks
> Mohammad

--
Best regards,
Igor Stasenko.

Reply | Threaded
Open this post in threaded view
|

Re: Unicode in Smalltalk

Mohammad Al Houssami (Alumni)
Im new to smalltalk so im not sure how you derived the numbers.
So if I have 000A as a unicode how do I do the transformation? I don’t know how you got
2r10101101
16rA

And what are base r and number in <base>r<number> ?

Thanks again
Mohammad

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Igor Stasenko
Sent: Friday, March 15, 2013 12:54 PM
To: A friendly place where any question about pharo is welcome
Subject: Re: [Pharo-users] Unicode in Smalltalk

On 15 March 2013 13:44, Mohammad Al Houssami (Alumni) <[hidden email]> wrote:

> Im not sure where you got the 16r0026 from. What I have is something
> like this
> U+0009 CHARACTER TABULATION (tab)
> U+000A LINE FEED (LF)
> U+000C FORM FEED (FF)
> U+0020 SPACE
>
> Whats on the left is the unicode representation. In some cases there
> is no actual character to compare to( so I cant compare to < > &....) What I will have to do is to create a character with unicode value U+000A.
> I have tried myCharacter:= 000A asCharacter. But it doesn't work.
>
> When I try to print myCharacter it gives an error because it doesn't understand the A.
>
because 000A is not valid literal in smalltalk.
it is actually understood by compiler as unary message:
 0 A

if you want to use hexadecimal integer litrals, you should use radix notation:

<base>r<number>

try:

2r10101101
16rA



> Thanks
> Mohammad

--
Best regards,
Igor Stasenko.


Reply | Threaded
Open this post in threaded view
|

Re: Unicode in Smalltalk

Paul DeBruicker
Its not base 'r'.  its base 16.  In Smalltalk you use 'r' to denote the
fact that your specifying the base (radix)
(https://en.wikipedia.org/wiki/Radix) of a number.   So Igor was
pointing out that for the hexadecimal representation of the decimal
number 10 you use


16rA


For the binary representation of the decimal number 10 you use

2r00001010


and for octal its

8r12


so when he says

<base>r<number>

he's just pointing out the pattern I show when showing you how to
manipulate the decimal number 10 in base 16, 2, and 8 above.

Also to see what a number is in a different base (radix) use this:


10 radix: 4

or

16rA radix: 10
2r00001010 radix: 10
8r12 radix: 10




hope this helps

Paul


On 03/15/2013 06:35 AM, Mohammad Al Houssami (Alumni) wrote:

> Im new to smalltalk so im not sure how you derived the numbers.
> So if I have 000A as a unicode how do I do the transformation? I don’t know how you got
> 2r10101101
> 16rA
>
> And what are base r and number in <base>r<number> ?
>
> Thanks again
> Mohammad
>
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On Behalf Of Igor Stasenko
> Sent: Friday, March 15, 2013 12:54 PM
> To: A friendly place where any question about pharo is welcome
> Subject: Re: [Pharo-users] Unicode in Smalltalk
>
> On 15 March 2013 13:44, Mohammad Al Houssami (Alumni) <[hidden email]> wrote:
>> Im not sure where you got the 16r0026 from. What I have is something
>> like this
>> U+0009 CHARACTER TABULATION (tab)
>> U+000A LINE FEED (LF)
>> U+000C FORM FEED (FF)
>> U+0020 SPACE
>>
>> Whats on the left is the unicode representation. In some cases there
>> is no actual character to compare to( so I cant compare to < > &....) What I will have to do is to create a character with unicode value U+000A.
>> I have tried myCharacter:= 000A asCharacter. But it doesn't work.
>>
>> When I try to print myCharacter it gives an error because it doesn't understand the A.
>>
> because 000A is not valid literal in smalltalk.
> it is actually understood by compiler as unary message:
>  0 A
>
> if you want to use hexadecimal integer litrals, you should use radix notation:
>
> <base>r<number>
>
> try:
>
> 2r10101101
> 16rA
>
>
>
>> Thanks
>> Mohammad
>
> --
> Best regards,
> Igor Stasenko.
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Unicode in Smalltalk

Mohammad Al Houssami (Alumni)
Ok it's just that I didn’t understand the notation he used. Its clear now. I know about radix and bases and all this stuff but it’s the notation that got me confused. The whole thing was just changing the notations from base 16 to base 10. I tried it and it works.

Thank you Paul and Igor.


-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Paul DeBruicker
Sent: Friday, March 15, 2013 2:04 PM
To: A friendly place where any question about pharo is welcome
Subject: Re: [Pharo-users] Unicode in Smalltalk

Its not base 'r'.  its base 16.  In Smalltalk you use 'r' to denote the fact that your specifying the base (radix)
(https://en.wikipedia.org/wiki/Radix) of a number.   So Igor was
pointing out that for the hexadecimal representation of the decimal number 10 you use


16rA


For the binary representation of the decimal number 10 you use

2r00001010


and for octal its

8r12


so when he says

<base>r<number>

he's just pointing out the pattern I show when showing you how to manipulate the decimal number 10 in base 16, 2, and 8 above.

Also to see what a number is in a different base (radix) use this:


10 radix: 4

or

16rA radix: 10
2r00001010 radix: 10
8r12 radix: 10




hope this helps

Paul


On 03/15/2013 06:35 AM, Mohammad Al Houssami (Alumni) wrote:

> Im new to smalltalk so im not sure how you derived the numbers.
> So if I have 000A as a unicode how do I do the transformation? I don’t
> know how you got
> 2r10101101
> 16rA
>
> And what are base r and number in <base>r<number> ?
>
> Thanks again
> Mohammad
>
> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of Igor
> Stasenko
> Sent: Friday, March 15, 2013 12:54 PM
> To: A friendly place where any question about pharo is welcome
> Subject: Re: [Pharo-users] Unicode in Smalltalk
>
> On 15 March 2013 13:44, Mohammad Al Houssami (Alumni) <[hidden email]> wrote:
>> Im not sure where you got the 16r0026 from. What I have is something
>> like this
>> U+0009 CHARACTER TABULATION (tab)
>> U+000A LINE FEED (LF)
>> U+000C FORM FEED (FF)
>> U+0020 SPACE
>>
>> Whats on the left is the unicode representation. In some cases there
>> is no actual character to compare to( so I cant compare to < > &....) What I will have to do is to create a character with unicode value U+000A.
>> I have tried myCharacter:= 000A asCharacter. But it doesn't work.
>>
>> When I try to print myCharacter it gives an error because it doesn't understand the A.
>>
> because 000A is not valid literal in smalltalk.
> it is actually understood by compiler as unary message:
>  0 A
>
> if you want to use hexadecimal integer litrals, you should use radix notation:
>
> <base>r<number>
>
> try:
>
> 2r10101101
> 16rA
>
>
>
>> Thanks
>> Mohammad
>
> --
> Best regards,
> Igor Stasenko.
>
>



Reply | Threaded
Open this post in threaded view
|

Re: Unicode in Smalltalk

LogiqueWerks
In reply to this post by Esteban A. Maringolo
Just looking in Pharo 2.0 release on Unicode handling today ...

The class comment for UTF8DecomposedTextConverter says

"An UTF8DecomposedTextConverter converts from decomposed UTF8 using the UnicodeCompositionStream."

I don't see UnicodeCompositionStream ... at least not in this One-Click image today ...

I also wonder why we do not have more of a parallel in the BOM methods for the  utf8 and utf16 Converter classes to ensure converter instances have polymorphic behavior ... even such simple accessors as BOM value methods are not implemented ... but one class has a >>nextPut... method that is doing a BOM check each time, so it is not just avoiding a method call here if that was the issue, I suppose ...

I do use Unicode every day in my Japanese app's and this mismatch across converters looks as if it would bite someone eventually ...

Or no ?  Maybe I am missing something obvious ...

Btw, are we safe to assume that no one in India (or elsewhere with IT shops) is still using utf32 ?
Reply | Threaded
Open this post in threaded view
|

Re: Unicode in Smalltalk

LogiqueWerks
In reply to this post by Esteban A. Maringolo
[ I still am not accepted to the mailing list to POST ... ]

Just looking in Pharo 2.0 release on Unicode handling today ...

The class comment for UTF8DecomposedTextConverter says 

"An UTF8DecomposedTextConverter converts from decomposed UTF8 using the UnicodeCompositionStream."

I don't see UnicodeCompositionStream ... at least not in this One-Click image today ...

I also wonder why we do not have more of a parallel in the BOM methods for the  utf8 and utf16 Converter classes to ensure converter instances have polymorphic behavior ... even such simple accessors as BOM value methods are not implemented ... but one class has a >>nextPut... method that is doing a BOM check each time, so it is not just avoiding a method call here if that was the issue, I suppose ... 

I do use Unicode every day in my Japanese app's and this mismatch across converters looks as if it would bite someone eventually ...

Or no ?  Maybe I am missing something obvious ...

Btw, are we safe to assume that no one in India (or elsewhere with IT shops) is still using utf32 ?



On 14 March 2013 21:28, Esteban A. Maringolo <[hidden email]> wrote:
I'm not sure if I understand your question.

But did you try this? :

yourCharacter = 16r0026 asCharacter

Regards.





--
View this message in context: http://forum.world.st/Unicode-in-Smalltalk-tp4676821p4676871.html
Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|

Re: Unicode in Smalltalk

Marcus Denker-4

On Mar 18, 2013, at 4:52 PM, Robert Shiplett <[hidden email]> wrote:

[ I still am not accepted to the mailing list to POST ... ]


Post arrived… and no manual accept needed.

Just looking in Pharo 2.0 release on Unicode handling today ...

The class comment for UTF8DecomposedTextConverter says 

"An UTF8DecomposedTextConverter converts from decomposed UTF8 using the UnicodeCompositionStream."

I don't see UnicodeCompositionStream ... at least not in this One-Click image today ...

I also wonder why we do not have more of a parallel in the BOM methods for the  utf8 and utf16 Converter classes to ensure converter instances have polymorphic behavior ... even such simple accessors as BOM value methods are not implemented ... but one class has a >>nextPut... method that is doing a BOM check each time, so it is not just avoiding a method call here if that was the issue, I suppose ... 

I do use Unicode every day in my Japanese app's and this mismatch across converters looks as if it would bite someone eventually ...

Or no ?  Maybe I am missing something obvious ...

Btw, are we safe to assume that no one in India (or elsewhere with IT shops) is still using utf32 ?



On 14 March 2013 21:28, Esteban A. Maringolo <[hidden email]> wrote:
I'm not sure if I understand your question.

But did you try this? :

yourCharacter = 16r0026 asCharacter

Regards.





--
View this message in context: http://forum.world.st/Unicode-in-Smalltalk-tp4676821p4676871.html
Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.



Reply | Threaded
Open this post in threaded view
|

Re: Unicode in Smalltalk

LogiqueWerks
so-desu ...

Clicked in don't show message again box ...

But by then I was looking at

                        (encoding = 'utf-8')
                                ifTrue: [^UTF8TextConverter].
                        (encoding = 'shiftjis' or: [ encoding = 'sjis' ])

as seen in JapaneseEnvironment>>systemConverterClass for X11 ... also looks a little fragile for lack of accessors yet X11Encoding is just such a class-side location for "utf8EncodingCode" or such like ...  

My other programming environment uses "utf8" as the string. Oi vey. Then there are the 'UTF-8' folks out there ...  

Is there a reason to avoid symbols here ?  Or at least to refactor that

  () or: []

out and into X11Encoding ?

That call at 2AM on a Sunday morning is Sooo often for similar issues that crept past code reviews ... or is this a lighter style that I am used to ?

Reply | Threaded
Open this post in threaded view
|

Re: Unicode in Smalltalk

Marcus Denker-4

On Mar 18, 2013, at 5:26 PM, LogiqueWerks <[hidden email]> wrote:

> so-desu ...
>
> Clicked in don't show message again box ...
>
> But by then I was looking at
>
> (encoding = 'utf-8')
> ifTrue: [^UTF8TextConverter].
> (encoding = 'shiftjis' or: [ encoding = 'sjis' ])
>
> as seen in JapaneseEnvironment>>systemConverterClass for X11 ... also looks
> a little fragile for lack of accessors yet X11Encoding is just such a
> class-side location for "utf8EncodingCode" or such like ...  
>
> My other programming environment uses "utf8" as the string. Oi vey. Then
> there are the 'UTF-8' folks out there ...  
>
> Is there a reason to avoid symbols here ?  Or at least to refactor that
>
>  () or: []
>
> out and into X11Encoding ?
>
> That call at 2AM on a Sunday morning is Sooo often for similar issues that
> crept past code reviews ... or is this a lighter style that I am used to ?

code quality is very low for most of the old code base…  
Improvements are always welcome.


        Marcus


Reply | Threaded
Open this post in threaded view
|

Re: Unicode in Smalltalk

Sven Van Caekenberghe-2

On 18 Mar 2013, at 17:30, Marcus Denker <[hidden email]> wrote:

> On Mar 18, 2013, at 5:26 PM, LogiqueWerks <[hidden email]> wrote:
>
>> so-desu ...
>>
>> Clicked in don't show message again box ...
>>
>> But by then I was looking at
>>
>> (encoding = 'utf-8')
>> ifTrue: [^UTF8TextConverter].
>> (encoding = 'shiftjis' or: [ encoding = 'sjis' ])
>>
>> as seen in JapaneseEnvironment>>systemConverterClass for X11 ... also looks
>> a little fragile for lack of accessors yet X11Encoding is just such a
>> class-side location for "utf8EncodingCode" or such like ...  
>>
>> My other programming environment uses "utf8" as the string. Oi vey. Then
>> there are the 'UTF-8' folks out there ...  
>>
>> Is there a reason to avoid symbols here ?  Or at least to refactor that
>>
>> () or: []
>>
>> out and into X11Encoding ?
>>
>> That call at 2AM on a Sunday morning is Sooo often for similar issues that
>> crept past code reviews ... or is this a lighter style that I am used to ?
>
> code quality is very low for most of the old code base…  
> Improvements are always welcome.

Check out the ZnCharacterEncoder hierarchy for a much simpler, more modern implementation of UTF8 and some common byte encoders.

It requires people fluent in all kinds of special spoken languages to help in moving things forward.

What do we really need above UTF8 and some simple byte encoders ?

Sven

--
Sven Van Caekenberghe
http://stfx.eu
Smalltalk is the Red Pill


Reply | Threaded
Open this post in threaded view
|

Re: Unicode in Smalltalk

Ben Coman
Sven Van Caekenberghe wrote:

> On 18 Mar 2013, at 17:30, Marcus Denker <[hidden email]> wrote:
>
>  
>> On Mar 18, 2013, at 5:26 PM, LogiqueWerks <[hidden email]> wrote:
>>
>>    
>>> so-desu ...
>>>
>>> Clicked in don't show message again box ...
>>>
>>> But by then I was looking at
>>>
>>> (encoding = 'utf-8')
>>> ifTrue: [^UTF8TextConverter].
>>> (encoding = 'shiftjis' or: [ encoding = 'sjis' ])
>>>
>>> as seen in JapaneseEnvironment>>systemConverterClass for X11 ... also looks
>>> a little fragile for lack of accessors yet X11Encoding is just such a
>>> class-side location for "utf8EncodingCode" or such like ...  
>>>
>>> My other programming environment uses "utf8" as the string. Oi vey. Then
>>> there are the 'UTF-8' folks out there ...  
>>>
>>> Is there a reason to avoid symbols here ?  Or at least to refactor that
>>>
>>> () or: []
>>>
>>> out and into X11Encoding ?
>>>
>>> That call at 2AM on a Sunday morning is Sooo often for similar issues that
>>> crept past code reviews ... or is this a lighter style that I am used to ?
>>>      
>> code quality is very low for most of the old code base…  
>> Improvements are always welcome.
>>    
>
> Check out the ZnCharacterEncoder hierarchy for a much simpler, more modern implementation of UTF8 and some common byte encoders.
>  

Is that a candidate to move into core?

> It requires people fluent in all kinds of special spoken languages to help in moving things forward.
>
> What do we really need above UTF8 and some simple byte encoders ?
>
> Sven
>
> --
> Sven Van Caekenberghe
> http://stfx.eu
> Smalltalk is the Red Pill
>
>
>
>  


Reply | Threaded
Open this post in threaded view
|

Re: Unicode in Smalltalk

Sven Van Caekenberghe-2
Ben,

On 19 Mar 2013, at 00:41, Ben Coman <[hidden email]> wrote:

> Sven Van Caekenberghe wrote:
>> On 18 Mar 2013, at 17:30, Marcus Denker <[hidden email]> wrote:
>>
>>  
>>> On Mar 18, 2013, at 5:26 PM, LogiqueWerks <[hidden email]> wrote:
>>>
>>>    
>>>> so-desu ...
>>>>
>>>> Clicked in don't show message again box ...
>>>>
>>>> But by then I was looking at
>>>>
>>>> (encoding = 'utf-8') ifTrue: [^UTF8TextConverter].
>>>> (encoding = 'shiftjis' or: [ encoding = 'sjis' ])
>>>> as seen in JapaneseEnvironment>>systemConverterClass for X11 ... also looks
>>>> a little fragile for lack of accessors yet X11Encoding is just such a
>>>> class-side location for "utf8EncodingCode" or such like ...  
>>>> My other programming environment uses "utf8" as the string. Oi vey. Then
>>>> there are the 'UTF-8' folks out there ...  
>>>> Is there a reason to avoid symbols here ?  Or at least to refactor that
>>>>
>>>> () or: []
>>>> out and into X11Encoding ?
>>>>
>>>> That call at 2AM on a Sunday morning is Sooo often for similar issues that
>>>> crept past code reviews ... or is this a lighter style that I am used to ?
>>>>      
>>> code quality is very low for most of the old code base…  Improvements are always welcome.
>>>    
>>
>> Check out the ZnCharacterEncoder hierarchy for a much simpler, more modern implementation of UTF8 and some common byte encoders.
>
> Is that a candidate to move into core?

It is permanently in the image.

Pending community testing, feedback, improvement, extension, yes it could replace the others.

It is currently used for the HTTP layer, but is fully functional for any stream:

'my-utf8-file.txt' asFileReference readStreamDo: [ :stream |
        stream binary. "remove the old stuff"
        (ZnCharacterReadStream on: stream) upToEnd ].

The key difference is that is works strictly from bytes to/from characters, not from character to/from character like the old ones.

Sven

>> It requires people fluent in all kinds of special spoken languages to help in moving things forward.
>>
>> What do we really need above UTF8 and some simple byte encoders ?
>>
>> Sven
>>
>> --
>> Sven Van Caekenberghe
>> http://stfx.eu
>> Smalltalk is the Red Pill
>>
>>
>>
>>  
>
>