Unicode strings, benchmarks

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Unicode strings, benchmarks

Janko Mivšek
Hi Squeakers,

I already extended String with TwoByteString and did a "scaling" with
auto conversion to wider string when a wider character is put into a
string. So far so good and this already works in Aida/Web.

I also did a bit better UTF8 conversion but is only 25-80% faster that
existing one in UTF8TextConverter. To prepare for even better results, I
  made a benchmark, which measure conversion time for English, French,
Slovenian, Russian and Chinese 2500 characters long text. It measure 100
conversions which accumulates to 250K characters of text.

Here are results in VW, Squeak with old UTF8 converter and a new one:

           VW    old new
english   30 313 248 ByteString,   pure ASCII
french   32 323 251 ByteString,   ISO8859-1 (Latin 1)
slovenian  48 578 480 TwoByteString Latin 2
russian   112 1306 720 TwoByteString Cyrillic
chinese   107 1544 3825 TwoByteString

Notice an exceptional 10x VW performance comparing to Squeak, and they
do all encodings in plain Smalltalk! No primitives! So how come that
Squeak is so slow here?

Above benchmark was done on Squeak 3.9 on Suse Linux 10.1, P3.2GHz.

Best regards
Janko

--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Reply | Threaded
Open this post in threaded view
|

Re: Unicode strings, benchmarks

Janko Mivšek
With corrected table of results:

>            VW     old     new
> english    30     313     248  ByteString,   pure ASCII
> french     32     323     251  ByteString,   ISO8859-1 (Latin 1)
> slovenian  48     578     480  TwoByteString Latin 2
> russian   112    1306     720  TwoByteString Cyrillic
> chinese   107    1544    3825  TwoByteString
>


Reply | Threaded
Open this post in threaded view
|

Re: Unicode strings, benchmarks

Yoshiki Ohshima
In reply to this post by Janko Mivšek
  Hi, Janko,

> I also did a bit better UTF8 conversion but is only 25-80% faster that
> existing one in UTF8TextConverter.

  Good!

> Here are results in VW, Squeak with old UTF8 converter and a new one:
>
>   VW    old new
> english   30 313 248 ByteString,   pure ASCII
> french    32 323 251 ByteString,   ISO8859-1 (Latin 1)
> slovenian  48 578 480 TwoByteString Latin 2
> russian   112 1306 720 TwoByteString Cyrillic
> chinese   107 1544 3825 TwoByteString
>
> Notice an exceptional 10x VW performance comparing to Squeak, and they
> do all encodings in plain Smalltalk! No primitives! So how come that
> Squeak is so slow here?

  Is it true that you traded the performance for
Chinese with other languages?

  BTW, I can't see the difference between this and your "With
corrected table of results:".

  - UTF8TextConverter wasn't written with performance in mind (as you
    can tell^^;)
  - This kind of tight loop gives 3-5 factor of performance difference
    in VW and Squeak, plus,
  - immediate representation for characters must be helping a lot.

  For the OLPC, I think I will end up with writing primitives for
Squeak.  One could say that I should like the iconv library, but not
sure if that is a good idea or not...

-- Yoshiki

Reply | Threaded
Open this post in threaded view
|

Re: Unicode strings, benchmarks

Janko Mivšek
Hi Yoshiki,

Yoshiki Ohshima wrote:

>
>> I also did a bit better UTF8 conversion but is only 25-80% faster that
>> existing one in UTF8TextConverter.
>
>   Good!
>
>> Here are results in VW, Squeak with old UTF8 converter and a new one:
>>
>>      VW         old new
>> english    30 313 248 ByteString,   pure ASCII
>> french     32 323 251 ByteString,   ISO8859-1 (Latin 1)
>> slovenian  48 578 480 TwoByteString Latin 2
>> russian   112 1306 720 TwoByteString Cyrillic
>> chinese   107 1544 3825 TwoByteString
>>
>> Notice an exceptional 10x VW performance comparing to Squeak, and they
>> do all encodings in plain Smalltalk! No primitives! So how come that
>> Squeak is so slow here?
>
>   Is it true that you traded the performance for
> Chinese with other languages?

Definitively not, and I just don't understand why Chinese is so slow. I
hope you'll be able too look at that code to see, what's wrong. And
Chinese is close to Japanese, right? I learned Chinese a bit 20 years
ago, but this was not of much help - I forgot too much :)

I'll prepare and publish code and benchmark tomorrow.


>   BTW, I can't see the difference between this and your "With
> corrected table of results:".

The "corrected" should be "with corrected layout", just that. Sorry for
that ambiguity.

>
>   - UTF8TextConverter wasn't written with performance in mind (as you
>     can tell^^;)
>   - This kind of tight loop gives 3-5 factor of performance difference
>     in VW and Squeak, plus,
>   - immediate representation for characters must be helping a lot.
>
>   For the OLPC, I think I will end up with writing primitives for
> Squeak.  One could say that I should like the iconv library, but not
> sure if that is a good idea or not...
>
> -- Yoshiki
>
>

--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Reply | Threaded
Open this post in threaded view
|

Re: Unicode strings, benchmarks

tgkuo
Hi,
On Tue, 12 Jun 2007 00:28:24 +0200, you wrote:

>Definitively not, and I just don't understand why Chinese is so slow. I
>hope you'll be able too look at that code to see, what's wrong. And
>Chinese is close to Japanese, right? I learned Chinese a bit 20 years
>ago, but this was not of much help - I forgot too much :)
>
>I'll prepare and publish code and benchmark tomorrow.

Interesting to hear that unicode encoding is possible upon Squeak.
But the new VM/Image that I installed and run under Windows XP box
showed strange characters for the title and any Chinese characters
entered. There is still much conversion done to overcome the
anomalies, I think. Maybe someone had more converted and stable image
for testing ... :-) .

Best regards.
tgkuo


 


Reply | Threaded
Open this post in threaded view
|

Re: Unicode strings, benchmarks

Yoshiki Ohshima
  Hello,

> Interesting to hear that unicode encoding is possible upon Squeak.

  Well, it's been quite long time.

> But the new VM/Image that I installed and run under Windows XP box
> showed strange characters for the title and any Chinese characters
> entered. There is still much conversion done to overcome the
> anomalies, I think. Maybe someone had more converted and stable image
> for testing ... :-) .

  Can you be a bit more specific?  Which "the new VM/Image" you tried?

  BTW, have you looked at:

http://www.smalltalk.org.cn/squeakr/squeakdownload0.html

?

-- Yoshiki

Reply | Threaded
Open this post in threaded view
|

Re: Unicode strings, benchmarks

tgkuo
Hi,
On Mon, 11 Jun 2007 20:36:27 -0700, you wrote:
>  BTW, have you looked at:
>
>http://www.smalltalk.org.cn/squeakr/squeakdownload0.html
>
It's a simplified-Chinese based edition, not usable at our country
which is a traditional-Chinese big-5 encoding world.

anyway,  thanks.

tgkuo


Reply | Threaded
Open this post in threaded view
|

Re: Unicode strings, benchmarks

Janko Mivšek
In reply to this post by tgkuo
Hi tgkuo,

Here is a proof that Unicode is definitively possible in Squeak. I put
on a same web page some English, French, Slovenian, Russian and Chinese:

        http://mivsek.eranova.si:8888/

This is done by help of Aida/Web on Squeak 3.9. This image has Unicode
patch installed but it would probably work the same on plan 3.9 too.

Best regards
Janko

tgkuo wrote:

> Hi,
> On Tue, 12 Jun 2007 00:28:24 +0200, you wrote:
>
>> Definitively not, and I just don't understand why Chinese is so slow. I
>> hope you'll be able too look at that code to see, what's wrong. And
>> Chinese is close to Japanese, right? I learned Chinese a bit 20 years
>> ago, but this was not of much help - I forgot too much :)
>>
>> I'll prepare and publish code and benchmark tomorrow.
>
> Interesting to hear that unicode encoding is possible upon Squeak.
> But the new VM/Image that I installed and run under Windows XP box
> showed strange characters for the title and any Chinese characters
> entered. There is still much conversion done to overcome the
> anomalies, I think. Maybe someone had more converted and stable image
> for testing ... :-) .
>
> Best regards.
> tgkuo


--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Reply | Threaded
Open this post in threaded view
|

Re: Unicode strings, benchmarks

Lukas Renggli
> This is done by help of Aida/Web on Squeak 3.9. This image has Unicode
> patch installed but it would probably work the same on plan 3.9 too.

It did already in 3.8 as tested and demonstrated in the comments of
the following blog post. The comments were written in a 3.8 image, but
nowadays this applications runs on 3.9:

    http://www.lukas-renggli.ch/blog/studenckifestwal

Lukas
--
Lukas Renggli
http://www.lukas-renggli.ch

Reply | Threaded
Open this post in threaded view
|

Re: Unicode strings, benchmarks

Yoshiki Ohshima
In reply to this post by Janko Mivšek
  Janko,

> Here is a proof that Unicode is definitively possible in Squeak. I put
> on a same web page some English, French, Slovenian, Russian and Chinese:
>
> http://mivsek.eranova.si:8888/
>
> This is done by help of Aida/Web on Squeak 3.9. This image has Unicode
> patch installed but it would probably work the same on plan 3.9 too.

  This could be a hunch and could be completely wrong, but have you
tried to *use* Slovenian in Squeak, in the sense that type in from
keyboard, display in workspace, and use it in class names/variable
names, etc.?

  To try it, take the OLPC image for example, gunzip the attached
latin2.out, and evaluate the following in a workspace:

StrikeFontSet installExternalFontFileName6: 'latin2.out' encoding: Latin2Environment leadingChar encodingName: #Latin2 textStyleName: #DefaultMultiStyle.

  And then, evaluate:

    Locale currentPlatform: (Locale localeID: (LocaleID isoString: 'sl')).

and save the image.  If you are on Windows, this would let you type
latin2 characters, display them, and use it everywhere in the image...




latin2.out.gz (27K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Unicode strings, benchmarks

Yoshiki Ohshima
In reply to this post by tgkuo
  TG Kuo,

> It's a simplified-Chinese based edition, not usable at our country
> which is a traditional-Chinese big-5 encoding world.
>
> anyway,  thanks.

  BTW, it is almost just a matter of selecting a good fonts and making
an "environment" for traditional Chinese.  As we do have a reserved
IDs for both traditional Chinese and simplified Chinese, The
simplified Chinese, it should be really easy once somebody who really
wants to have it does it.  And, I'm more than happy to help.

-- Yoshiki


Eno
Reply | Threaded
Open this post in threaded view
|

Re: Unicode strings, benchmarks

Eno
In reply to this post by Yoshiki Ohshima
Hi Yoshiki

I'm a Chinese user, actually traditional Chinese ( using big5 encoding).

I'm bothered by it, too, Squeak could not show our fonts in the UI, though there is no problem in the redered web pages by Seaside 3.0. That means the encoding is corrent (UTF-8) but the font is missing.

Why squeak could not display the encoding correctly as others do? is it related it's unique way of interpreting the encoding, not by the OS API.  

Some questions want to ask:

How is the file latin2.out made?

Can we just use windows system font instead?

if not, how could we make chinese.out ?
Reply | Threaded
Open this post in threaded view
|

Re: Unicode strings, benchmarks

K K Subbu
On Thursday 11 Nov 2010 1:46:36 pm Eno wrote:
> Some questions want to ask:
>
> How is the file latin2.out made?
This is a file out of a set (Array) of StrikeFont objects.

> if not, how could we make chinese.out ?
You have to create a bunch of StrikeFonts and create a function to file them
out and load them back in. See class side methods in StrikeFontSet
   createExternalFontFile....   for creating such files
and
   installExternalFontFile.... for loading them back.

Subbu

Reply | Threaded
Open this post in threaded view
|

Re: Unicode strings, benchmarks

Hannes Hirzel
If I remember well there was Chinese support in the past. So searching
the mail archive might be helpful.

--Hannes

On 11/11/10, K. K. Subramaniam <[hidden email]> wrote:

> On Thursday 11 Nov 2010 1:46:36 pm Eno wrote:
>> Some questions want to ask:
>>
>> How is the file latin2.out made?
> This is a file out of a set (Array) of StrikeFont objects.
>
>> if not, how could we make chinese.out ?
> You have to create a bunch of StrikeFonts and create a function to file them
> out and load them back in. See class side methods in StrikeFontSet
>    createExternalFontFile....   for creating such files
> and
>    installExternalFontFile.... for loading them back.
>
> Subbu
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Unicode strings, benchmarks

Yoshiki Ohshima-2
In reply to this post by Eno
At Thu, 11 Nov 2010 00:16:36 -0800 (PST),
Eno wrote:

>
>
> Hi Yoshiki
>
> I'm a Chinese user, actually traditional Chinese ( using big5 encoding).
>
> I'm bothered by it, too, Squeak could not show our fonts in the UI, though
> there is no problem in the redered web pages by Seaside 3.0. That means the
> encoding is corrent (UTF-8) but the font is missing.
>
> Why squeak could not display the encoding correctly as others do? is it
> related it's unique way of interpreting the encoding, not by the OS API.  
>
> Some questions want to ask:
>
> How is the file latin2.out made?
>
> Can we just use windows system font instead?
>
> if not, how could we make chinese.out ?

  Sorry for taking forever to answer this.  I dug up the font files I
created while ago and uploaded them to:

http://tinlizzie.org/~ohshima/uSimplifiedChineseFont.out
http://tinlizzie.org/~ohshima/uTraditionalChineseFont.out

  In the Etoys development image, you can load it by evaluating:

StrikeFontSet installExternalFontFileName: 'uSimplifiedChineseFont.out' encoding: SimplifiedChineseEnvironment leadingChar encodingName: #SimplifiedChinese textStyleName: #DefaultMultiStyle.

The trunk image seems to be missing some methods to make it run,
however.

-- Yoshiki

Eno
Reply | Threaded
Open this post in threaded view
|

Re: Unicode strings, benchmarks

Eno
Hi,

I downloaded the development image from
http://etoys.squeak.org/download/ , run the default image, evaluate the
code, but still can not show the font as expected.

I suppose that it cann't work because what I needed is
TraditionalChineseEnvironment in my WinXP PC.

I explored the system for the class "TraditionalChineseEnvironment:" but
it is missing.
There is only a few environment available in Multilingual-languages
package, how can I make the "TraditionalChineseEnvironment " one, I
currently had no ideas to write the class methods that need to be
changed for TraditionalChineseEnvironment, can you help me to build one?

thanks.

Best regards,

Eno


On 2010/12/15 下午 03:27, Yoshiki Ohshima wrote:

> At Thu, 11 Nov 2010 00:16:36 -0800 (PST),
> Eno wrote:
>>
>> Hi Yoshiki
>>
>> I'm a Chinese user, actually traditional Chinese ( using big5 encoding).
>>
>> I'm bothered by it, too, Squeak could not show our fonts in the UI, though
>> there is no problem in the redered web pages by Seaside 3.0. That means the
>> encoding is corrent (UTF-8) but the font is missing.
>>
>> Why squeak could not display the encoding correctly as others do? is it
>> related it's unique way of interpreting the encoding, not by the OS API.  
>>
>> Some questions want to ask:
>>
>> How is the file latin2.out made?
>>
>> Can we just use windows system font instead?
>>
>> if not, how could we make chinese.out ?
>   Sorry for taking forever to answer this.  I dug up the font files I
> created while ago and uploaded them to:
>
> http://tinlizzie.org/~ohshima/uSimplifiedChineseFont.out
> http://tinlizzie.org/~ohshima/uTraditionalChineseFont.out
>
>   In the Etoys development image, you can load it by evaluating:
>
> StrikeFontSet installExternalFontFileName: 'uSimplifiedChineseFont.out' encoding: SimplifiedChineseEnvironment leadingChar encodingName: #SimplifiedChinese textStyleName: #DefaultMultiStyle.
>
> The trunk image seems to be missing some methods to make it run,
> however.
>
> -- Yoshiki
>
>


Eno
Reply | Threaded
Open this post in threaded view
|

Re: Unicode strings, benchmarks

Eno
In reply to this post by Yoshiki Ohshima-2
Hi Yoshiki,

I downloaded the development image from
http://etoys.squeak.org/download/ , run the default image, evaluate the
code, but still can not show the font as expected.

I suppose that it cann't work because what I needed is
TraditionalChineseEnvironment in my WinXP PC.

I explored the system for the class "TraditionalChineseEnvironment:" but
it is missing.
There is only a few environment available in Multilingual-languages
package, how can I make the "TraditionalChineseEnvironment " one, I
currently had no ideas to write the class methods that need to be
changed for TraditionalChineseEnvironment, can you help me to build one?

thanks.

Best regards,

Eno


On 2010/12/15 下午 03:27, Yoshiki Ohshima wrote:

> At Thu, 11 Nov 2010 00:16:36 -0800 (PST),
> Eno wrote:
>>
>> Hi Yoshiki
>>
>> I'm a Chinese user, actually traditional Chinese ( using big5 encoding).
>>
>> I'm bothered by it, too, Squeak could not show our fonts in the UI, though
>> there is no problem in the redered web pages by Seaside 3.0. That means the
>> encoding is corrent (UTF-8) but the font is missing.
>>
>> Why squeak could not display the encoding correctly as others do? is it
>> related it's unique way of interpreting the encoding, not by the OS API.  
>>
>> Some questions want to ask:
>>
>> How is the file latin2.out made?
>>
>> Can we just use windows system font instead?
>>
>> if not, how could we make chinese.out ?
>   Sorry for taking forever to answer this.  I dug up the font files I
> created while ago and uploaded them to:
>
> http://tinlizzie.org/~ohshima/uSimplifiedChineseFont.out
> http://tinlizzie.org/~ohshima/uTraditionalChineseFont.out
>
>   In the Etoys development image, you can load it by evaluating:
>
> StrikeFontSet installExternalFontFileName: 'uSimplifiedChineseFont.out' encoding: SimplifiedChineseEnvironment leadingChar encodingName: #SimplifiedChinese textStyleName: #DefaultMultiStyle.
>
> The trunk image seems to be missing some methods to make it run,
> however.
>
> -- Yoshiki
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Unicode strings, benchmarks

Yoshiki Ohshima-2
In reply to this post by Eno
At Wed, 15 Dec 2010 18:54:26 +0800,
tgkuo wrote:

>
> Hi,
>
> I downloaded the development image from
> http://etoys.squeak.org/download/ , run the default image, evaluate the
> code, but still can not show the font as expected.
>
> I suppose that it cann't work because what I needed is
> TraditionalChineseEnvironment in my WinXP PC.
>
> I explored the system for the class "TraditionalChineseEnvironment:" but
> it is missing.
> There is only a few environment available in Multilingual-languages
> package, how can I make the "TraditionalChineseEnvironment " one, I
> currently had no ideas to write the class methods that need to be
> changed for TraditionalChineseEnvironment, can you help me to build one?

  Again, sorry for slow response.

  If you don't need support for CNS 11643 (or BIG 5),
SimplifiedChineseEnvironment can mimick, say, the Russian Environment.
You would need a new "leading char" assigned (just get the next
available one), and define a font you would like to use.

Nowadays, you would like to consider to use TrueType fonts.  The OLPC
image includes a method
#makeSmartRefFilesFrom:encodingTag:ranges:outputFileName: at
TTCFontSet.  You define the range you're interested in, choose a TT
font that convers it, and create a ".out" file.

-- Yoshiki