Latin-1 to UTF-8 speedups

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Latin-1 to UTF-8 speedups

Andreas.Raab
Hi -

John asked me for the UTF-8 changes that I had done for our own use and
since there may be some interested by other people, here are the
changes. Keep in mind that the speedup is aimed at situations where your
input is basically Latin-1 and won't have any effect if you are actually
using anything beyond Latin-1.

The main goal of these changes is to make the overhead of adding UTF-8
conversions "just in case" diminishingly small. For example, converting
ASCII text with no extended characters at all is effectively free:

"Convert 1 million ascii characters"
string := (String new: 10000 withAll: $a).

"The current converter"
Transcript cr; show: [1 to: 100 do:[:i|
     string convertToWithConverter: UTF8TextConverter new
]] timeToRun.

=> 1809

"The fast path"
Transcript cr; show: [1 to: 100 do:[:i|
   string squeakToUtf8
]] timeToRun.

=> 4

Even when using the full Latin-1 range, there is still a goodly bit of
speedup:

"Convert 1 million extended latin-1 characters"
string := (String new: 10000 withAll: $ß).

"The current converter"
Transcript cr; show: [1 to: 100 do:[:i|
   string convertToWithConverter: UTF8TextConverter new
]] timeToRun.

=> 5193

"The fast path"
Transcript cr; show:[1 to: 100 do:[:i|
   string squeakToUtf8
]] timeToRun.

=> 1816

Depending on your concrete usage, the result will be somewhere inbetween
these extremes - for our use we found it to be close to the optimal case
but if you use a lot of extended Latin-1 your results will be closer to
the latter one. In any case, it should be a nice little speedup so enjoy
the ride.

Cheers,
   - Andreas




SqueakToUtf8.cs (4K) Download Attachment