New Win32 VM [m17n testers needed]

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
124 messages Options
1234567
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Yoshiki Ohshima
  Janko,

> >   So, the question to you is that if you have a system with 8-bit
> > ByteString and 32-bit WideString in year 2007, would you add a class
> > to represent 16-bit string to that system?
>
> I would say yes, because for most countries 16-bit is enough and 32-bit
> is then just a waste of memory. And I just noticed that WideString is
> actually fixed to 4 bytes. I would therefore think about renaming it to
> ForByteString and add TwoByteString (or similar names). For user these
> are always Strings anyway, as SmallIntegers and LargeIntegers are always
> Integers.

  Similar deal in Squeak, too.  The system does the auto coertion
between WideString and ByteString, and the user doesn't have to deal
with them not all the time.

  Adding 16-bit is surely an option.  At the same time, there is
similar but different POV: "because for most users 8-bit is enough and
32-bit version is used not so frequently anyway".  There is no "right"
answer, but different trade-offs.  (That is why this problem is
interesting^^;)

  And actually, adding more general character object that doesn't rely
on a particular bit-representation (and therefore can go beyond
32-bit), and make the strings be array of such characters will be
better eventually.

-- Yoshiki

Reply | Threaded
Open this post in threaded view
|

RE: UTF8 Squeak

Alan L. Lovejoy
In reply to this post by Yoshiki Ohshima


<Alan L>Each String object should specify its encoding scheme.  UTF-8 should

> be the default, but all commonly-encounterd encodings should be
> supported, and should all be useable at once (in different String
> instances.) When a Character is reified from a String, it should use
> the Unicode code point values (full 32-bit value.)  Ideally, the
> encoding of a String should be a function of an associated Strategy
> object, and not be based on having different subclasses of String
</Alan L>

<Yoshiki>Is this better than using UTF32 throught the image for all Strings?
One reason would be that for some chars in domestic encodings, the
round-trip conversion is not exactly guaranteed; so you can avoid that
problem in this way.  But ohter than that, encodings only matters when the
system is interfacing with the outside world.  So, the internal
representation can be uniform, I think.

  Would you write all comparison methods for all of combinations of
different encodings?
</Yoshiki>

Well, perhaps UTF-32 would be a better default, now that I think about
it--due to performance issues for accessing characters at an index. But
using 32-bit-wide or 16-bit-wide strings internally as the only option would
be a waste of memory in many situations, especially for the "Latin-1"
languages.

Having String instances that use specified encodings enables one to avoid
doing conversions unless and until it's needed. It also makes it easy to
deal with the data as it will actually exist when persisted, or when
transported over the network. And it makes it easier to handle the host
plaform's native character encodings (there may be more than one,) or the
character encodings used by external libraries or applications that either
offer callpoints to, or consume callpoints from, a Squeak process. It also
documents the encoding used by each String.

If all Strings use UTF-32, and are only converted to other encodings by the
VM, how does one write Smalltalk code to convert text from one character
encoding to another?  I'd rather not make character encodings yet another
bit of magic that only the VM can do.

It is already the case that accessing individual characters from a String
results in the reification of a Character object.  So, leveraging what is
already the case, convervsion to/from the internal encoding to the canonical
(Unicode) encoding should occur when a Character object is reified from an
encoded character in a String (or in a Stream.)  Character objects that are
"put:" into a String would be converted from the Unicode code point to the
encoding native to that String.  Using Character reification to/from Unicode
as the unification mechanism provides the illusion that all Strings use the
same code points for their characters, even though they in fact do not.

Of course, for some encodings (such as UTF-8) there would probably be a
performance penalty for accessing characters at an arbitrary index ("aString
at: n.") But there may be good ways to mitigate that, using clever
implementation tricks (caveat: I haven't actually tried it.)  However, with
my proposal, one is free to use UTF-16 for all Strings, or UTF-32 for all
Strings, or ASCII for all Strings--based on one's space and performance
constraints, and based on the character repertoire one needs for one's user
base.  And the conversion to UTF-16 or UTF-32 (or whatever) can be done when
the String is read from an external Stream (using the VW stream decorator
approach, for example.)

The ASCII encoding would be good for the mutlitude of legacy applications
that are English-only. ISO 8859-1 would be best for post-1980s/pre-UTFx
legacy applications that have to deal with non-English languages, or have to
deal with either HTML or pre-Vista Windows. UTF-x would be best for most
other situations.


--Alan







Reply | Threaded
Open this post in threaded view
|

RE: UTF8 Squeak

Alan L. Lovejoy
In reply to this post by J J-6
<Alan L>UTF-8 should be the default</Alan L>

<J J (Jason)>Wouldn't that be a pretty big speed impact given how much
strings are used?</J J (Jason)>

Now that I think about it, that could very well be the case.  There might be
clever ways to make the impact much less than one might otherwise expect
(for example, RunArrays were a clever way to make Text objects reasonably
efficient)--but I haven't actually implmented it, so there's no guarantee.

So, perhaps the default internal String encoding should be UTF-32, instead
of UTF-8 or UTF-16, in order to avoid the performance issue.  But that
raises a memory usage issue--which is the primary reason I don't think a
"one size fits all" approach is sufficient.

--Alan




Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Tapple Gao
In reply to this post by Yoshiki Ohshima
On Thu, Jun 07, 2007 at 08:16:21PM -0700, Alan Lovejoy wrote:
> It is already the case that accessing individual characters from a String
> results in the reification of a Character object.  So, leveraging what is
> already the case, convervsion to/from the internal encoding to the canonical
> (Unicode) encoding should occur when a Character object is reified from an
> encoded character in a String (or in a Stream.)  Character objects that are
> "put:" into a String would be converted from the Unicode code point to the
> encoding native to that String.  Using Character reification to/from Unicode
> as the unification mechanism provides the illusion that all Strings use the
> same code points for their characters, even though they in fact do not.

Someone already mentioned the way Plan-9 did this, and provided
a link, which I read, and it sounded pretty logical. What
follows is my assessment of what I read.

The key realization that Plan-9 made is that random-access
string access is the exception, rather than the rule. Stream
access is much more common, and much more in need of
optimization. This seems logical to me. UTF-8 is a
stream-oriented encoding of Unicode that Plan-9 invented to
solve this optimization issue. UTF-8 is self-synchronizing and
byte-oriented, which allows a reader to be nearly stateless, and
still consume much less memory that UCS-32. Plan 9 also
described that, contrary to what some expect, very few programs
do better with UCS-32, because very few programs really need to
process the string in a non-linear way. Regular expressions and
sorting are the two main exceptions.

UTF-8 also allows the transition to be made slightly more
smoothly, since many ASCII programs will already work with
UTF-8.

This is a synopsis of what I read. I am not familiar with this
issue as much as you are.

--
Matthew Fulmer -- http://mtfulmer.wordpress.com/
Help improve Squeak Documentation: http://wiki.squeak.org/squeak/808

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Yoshiki Ohshima
In reply to this post by Yoshiki Ohshima
  Alan,

> Well, perhaps UTF-32 would be a better default, now that I think about
> it--due to performance issues for accessing characters at an index. But
> using 32-bit-wide or 16-bit-wide strings internally as the only option would
> be a waste of memory in many situations, especially for the "Latin-1"
> languages.

  We do switch in Squeak from different bit-width representation (8
and 32) whenever necessary or favorable.

> Having String instances that use specified encodings enables one to avoid
> doing conversions unless and until it's needed. It also makes it easy to
> deal with the data as it will actually exist when persisted, or when
> transported over the network. And it makes it easier to handle the host
> plaform's native character encodings (there may be more than one,) or the
> character encodings used by external libraries or applications that either
> offer callpoints to, or consume callpoints from, a Squeak process. It also
> documents the encoding used by each String.

  Nothing prevents you from using a String as if it is, say, a
ByteArray.  For example, you can pass a String or ByteArray to a
socket primitive to fill it and you can keep the bits in it as you
like.

  However, Smalltalk is not just about holding data; once it comes to
displaying a String, concatenating them, comparing them, etc., etc.,
you do have to have a canonical form.

> If all Strings use UTF-32, and are only converted to other encodings by the
> VM, how does one write Smalltalk code to convert text from one character
> encoding to another?  I'd rather not make character encodings yet another
> bit of magic that only the VM can do.

  Hmm.  Of course you can convert encodings in memory.  In Squeak,
there are bunch of subclasses of TextConverter.  Did anybody
mentioned/suggested that the conversion has to be a VM magic?

> It is already the case that accessing individual characters from a String
> results in the reification of a Character object.  So, leveraging what is
> already the case, convervsion to/from the internal encoding to the canonical
> (Unicode) encoding should occur when a Character object is reified from an
> encoded character in a String (or in a Stream.)  Character objects that are
> "put:" into a String would be converted from the Unicode code point to the
> encoding native to that String.  Using Character reification to/from Unicode
> as the unification mechanism provides the illusion that all Strings use the
> same code points for their characters, even though they in fact do
> not.

  You criticized an approach nobody advocated as "magic" in above, but
what you wrote here is really a magic.  I've got a feeling that this
system would be very hard to debug.

  BTW, what would you do with Symbols?

> Of course, for some encodings (such as UTF-8) there would probably be a
> performance penalty for accessing characters at an arbitrary index ("aString
> at: n.") But there may be good ways to mitigate that, using clever
> implementation tricks (caveat: I haven't actually tried it.)  However, with
> my proposal, one is free to use UTF-16 for all Strings, or UTF-32 for all
> Strings, or ASCII for all Strings--based on one's space and performance
> constraints, and based on the character repertoire one needs for one's user
> base.  And the conversion to UTF-16 or UTF-32 (or whatever) can be done when
> the String is read from an external Stream (using the VW stream decorator
> approach, for example.)

  I *do* see some upsides of this approach, actually, but the
downsides is overwhelming bigger, if you think that Smalltalk is a
self-contained system.  Handling keyboard input alone would make the
system really complex.

  IIUC, Matsumoto-san's (Matz) m17n idea for Ruby is sort of along
this line.  I don't think that is a good approach, but it is slightly
more acceptable in Ruby, because Ruby is not a whole system.

  BTW, current Squeak allows you to do this.  Within the 32-bit
quantity, the first several bits denotes the "language"; you can make
up a special language and store the code point in different encodings.

> The ASCII encoding would be good for the mutlitude of legacy applications
> that are English-only. ISO 8859-1 would be best for post-1980s/pre-UTFx
> legacy applications that have to deal with non-English languages, or have to
> deal with either HTML or pre-Vista Windows. UTF-x would be best for most
> other situations.

  Is this your observation?  Where does legacy application in Japanese
fit?  Why HTML is associated with latin-1?  What is special about
Vista Windows?  This doesn't make any good sense.

  One approach I might try in a "new system" would be:

  - the bits of raw string representation is in UTF-8 but it is not
    really displayable.
  - you always do stuff though an equivalent of Text, that carry
    enough attributes for the bits.
  - maybe remove character object.  A "character" is just a short Text.
    For the ASCII part, it could be a special case; i.e., a naked
    byte can have implicit text attributes by default.

-- Yoshiki

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Lukas Renggli
Just as a side-note: In Seaside the encoding and decoding turns out to
be very  complicated and expensive. In fact so expensive, that almost
nobody is willing to pay for it. What most people do is to work with
(Squeak 2.7 or) ByteStrings that they treat like ByteArrays. The data
is received, stored, and sent exactly the way it comes from the
socket. Byte identical strings are sent back as they were received.

There are many cravats:

1. Most string operations don't work (except concatenation), e.g.
asking a string for its #size might return a wrong number.

2. All literal strings have to be encoded manually to the right
format. This clutters the code and is ugly.

3. Data in inspectors is sometimes not readable without a manual conversion.

I am no expert with encodings, so I have no idea how this could be
cleanly solved. There is definitely the need for improvement.

Another issue I observed is that Characters in Squeak have an
inconsistent behavior for #==. For characters with codePoint > 256 the
identity is not preserved. This gives problems with code that uses #==
to compare characters, legacy code and code ported from VisualWorks
(SmaCC for example). In VisualWorks Characters are unique, just like
Symbols are.

Lukas

--
Lukas Renggli
http://www.lukas-renggli.ch

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Andreas.Raab
Lukas Renggli wrote:
> Just as a side-note: In Seaside the encoding and decoding turns out to
> be very  complicated and expensive. In fact so expensive, that almost
> nobody is willing to pay for it.

But is that a property of 1) Seaside or 2) Squeak or 3) UTF-8? If the
first, just fix it ;-) If the second, what conversions are slow? If the
third, why not speed it up by a primitive? (UTF-8 translation isn't that
hard)

> What most people do is to work with
> (Squeak 2.7 or) ByteStrings that they treat like ByteArrays. The data
> is received, stored, and sent exactly the way it comes from the
> socket. Byte identical strings are sent back as they were received.

I assume you mean Seaside 2.7 above not Squeak 2.7.

> I am no expert with encodings, so I have no idea how this could be
> cleanly solved. There is definitely the need for improvement.

How about trying to improve the speed of conversions? You seem to imply
that this is the major issue here, so if the conversions where
blindingly fast (which I think they easily could by writing one or two
primitives) this should improve matters.

> Another issue I observed is that Characters in Squeak have an
> inconsistent behavior for #==. For characters with codePoint > 256 the
> identity is not preserved. This gives problems with code that uses #==
> to compare characters, legacy code and code ported from VisualWorks
> (SmaCC for example). In VisualWorks Characters are unique, just like
> Symbols are.

Yeah, but there isn't really an easy workaround unless you have
immediate characters. Which Squeak doesn't so fixing those comparisons
to use equality is really your only option (FWIW, given that VW has a
good JIT I would expect that they can inline this trivially so there
shouldn't be a speed difference for VW).

Cheers,
   - Andreas


Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Lukas Renggli
> > Just as a side-note: In Seaside the encoding and decoding turns out to
> > be very  complicated and expensive. In fact so expensive, that almost
> > nobody is willing to pay for it.
>
> But is that a property of 1) Seaside or 2) Squeak or 3) UTF-8? If the
> first, just fix it ;-) If the second, what conversions are slow? If the
> third, why not speed it up by a primitive? (UTF-8 translation isn't that
> hard)

I would if I knew how to do it.

> > What most people do is to work with
> > (Squeak 2.7 or) ByteStrings that they treat like ByteArrays. The data
> > is received, stored, and sent exactly the way it comes from the
> > socket. Byte identical strings are sent back as they were received.
>
> I assume you mean Seaside 2.7 above not Squeak 2.7.

I am talking about Squeak 3.7. There are many Seaside users that will
stick with Squeak 3.7 forever.

> How about trying to improve the speed of conversions? You seem to imply
> that this is the major issue here, so if the conversions where
> blindingly fast (which I think they easily could by writing one or two
> primitives) this should improve matters.

Are you taking about escaping? In Seaside 2.8 the escaping is already
2 times faster than in Seaside 2.7. Character encoding is another
story.

Lukas

--
Lukas Renggli
http://www.lukas-renggli.ch

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Andreas.Raab
Lukas Renggli wrote:
>> But is that a property of 1) Seaside or 2) Squeak or 3) UTF-8? If the
>> first, just fix it ;-) If the second, what conversions are slow? If the
>> third, why not speed it up by a primitive? (UTF-8 translation isn't that
>> hard)
>
> I would if I knew how to do it.

I'll see if I can find some time on the weekend to look at this.

>> > What most people do is to work with
>> > (Squeak 2.7 or) ByteStrings that they treat like ByteArrays. The data
>> > is received, stored, and sent exactly the way it comes from the
>> > socket. Byte identical strings are sent back as they were received.
>>
>> I assume you mean Seaside 2.7 above not Squeak 2.7.
>
> I am talking about Squeak 3.7. There are many Seaside users that will
> stick with Squeak 3.7 forever.

Yes, using Squeak ->3<-.7 can make good sense for people who don't care
about using m17n internally (definitely more than using Squeak ->2<-.7
as you wrote initially).

>> How about trying to improve the speed of conversions? You seem to imply
>> that this is the major issue here, so if the conversions where
>> blindingly fast (which I think they easily could by writing one or two
>> primitives) this should improve matters.
>
> Are you taking about escaping? In Seaside 2.8 the escaping is already
> 2 times faster than in Seaside 2.7. Character encoding is another
> story.

I'm talking about UTF-8 conversions. A simple thing to do would be (for
example) to have a lookup table for everything covered by 2-byte
encodings (which is practically everything in the western hemisphere).
Something like here:

nextFromStream: stream
        "Read a UTF-8 encoded character from the stream"
        value1 := utf8Table at: stream nextByte.
        value1 isCharacter ifTrue:[^value1].
        value1 isArray ifTrue:[
                value2 := value1 at: stream nextByte.
                value2 isCharacter ifTrue:[^value2].
        "... put the slow code here ..."

(note that the lookup table can include the required language tags etc.
to make any further conversion unnecessary) Beyond which a primitive
would go a very long way here.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

K. K. Subramaniam
In reply to this post by Yoshiki Ohshima
On Friday 08 June 2007 1:25 am, Yoshiki Ohshima wrote:
..
> > Well, UTF8 is just an encoding of Unicode code points, So, Squeak will
> > have to support Unicode. Its language and tools will need to handle
> > Unicode code points and UTF8 streams. Internally, whether code points or
> > UTF8 encoding is used would depend on the context.
>
>   Why do you get the impression that Squeak doesn't support it?
Squeak's Unicode/UTF8 support seemed incomplete. I couldn't get Squeak on
Linux to take in ½ or π. How about :
a) Use Unicode chars in literals and text fields. I should be able to write
math equations in PluggableText.
b) Use Unicode chars in names (object, method, variable, symbols). Children
should be able to name their scripts and variables in their language in
Etoys.
c) See fallback glyphs for Unicode. Like four hex digits laid out 2x2 in a
small box the same height as the current font. It works much better than []
box.
d) Have Buttons that generate Unicode. This could be used to build soft
keyboards. (cf. PopUpMenu>>readKeyboard uses asciiValue :-().
e) Use Modal input - codes coming in from Sensors could be button presses
(e.g. ESC, hotkeys to switch keyboard layouts, ) or multilingual text
sequences.
f) See 'current language' indicator in input fields. Handling backspace will
be language dependent.
> Using UTF-8 internally throughot the system would be a challenge,
> especially thinking about that the overloaded methods like at:,
> at:put: and all of these have to be disambiguated as to what it means.
at:put: is a random access operation and UTF-8 is not meant for such purposes.
UTF-8 works well for streams of characters and Unicode for random access and
lookup. This is what I meant when I said it would depend on context. Then
there are mixed streams like keyboard input. I could be reading button
presses (like Enter for OK) or reading in a stream of characters in a text
field. We may need instream character codes to switch modes and language.

I am still coming upto speed on Squeak multilingual support and these
observations are based on my explorations so far. It is quite possible that I
may have missed something.

Regards .. Subbu

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

K. K. Subramaniam
In reply to this post by Yoshiki Ohshima
On Friday 08 June 2007 2:41 am, Yoshiki Ohshima wrote:
>   However, there is a reason to call our stuff m17n, instead of i18n.
> It might be still an aspiration to it, but supporting one language at
> a time "sort of localed based idea" is not enough for "real"
> multilingualization, where you would like to mix strings from
> different languages freely.
Very true. India has over 28 official languages and multilingual streams are
the norm rather than the exception. Children learn three languages in primary
school. Math texts make heavy use of 'math symbols'.

Regards .. Subbu

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

K. K. Subramaniam
In reply to this post by Yoshiki Ohshima
On Friday 08 June 2007 11:28 am, Yoshiki Ohshima wrote:

> > Of course, for some encodings (such as UTF-8) there would probably be a
> > performance penalty for accessing characters at an arbitrary index
> > ("aString at: n.") But there may be good ways to mitigate that, using
> > clever implementation tricks (caveat: I haven't actually tried it.)
> > However, with my proposal, one is free to use UTF-16 for all Strings, or
> > UTF-32 for all Strings, or ASCII for all Strings--based on one's space
> > and performance constraints, and based on the character repertoire one
> > needs for one's user base.  And the conversion to UTF-16 or UTF-32 (or
> > whatever) can be done when the String is read from an external Stream
> > (using the VW stream decorator approach, for example.)
>
>   I *do* see some upsides of this approach, actually, but the
> downsides is overwhelming bigger, if you think that Smalltalk is a
> self-contained system.  Handling keyboard input alone would make the
> system really complex.
I am not sure if Squeak needs multiple transformation formats for Unicode code
points. A Unicode code point is 16-bits and UTF-8 varies from 8 to 32-bits. I  
Is there any sound case for other UTFs now (outside of VMs)? The Wikipedia
entry below has a good summary of pros and cons:
    http://en.wikipedia.org/wiki/UTF-8

Rob Pike's note:
  http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
is a very good reality check on the situation.

For children who will be working in multilingual environment, Squeak will be
spending most of its time in waiting for a button/key push anyways :-).

Regards .. Subbu

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Nicolas Cellier-3
In reply to this post by Lukas Renggli
Lukas Renggli <renggli <at> gmail.com> writes:

>
> Another issue I observed is that Characters in Squeak have an
> inconsistent behavior for #==. For characters with codePoint > 256 the
> identity is not preserved. This gives problems with code that uses #==
> to compare characters, legacy code and code ported from VisualWorks
> (SmaCC for example). In VisualWorks Characters are unique, just like
> Symbols are.
>
> Lukas
>


Just for the detail, Characters are unique like SmallIntegers are.
VW use last-two-bits-of-object-pointer trick so that Characters are immediate
values.

Nicolas


Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Colin Putney
In reply to this post by Andreas.Raab

On Jun 7, 2007, at 11:55 PM, Andreas Raab wrote:

> How about trying to improve the speed of conversions? You seem to  
> imply that this is the major issue here, so if the conversions  
> where blindingly fast (which I think they easily could by writing  
> one or two primitives) this should improve matters.

The conversions could be made faster, yes. But consider this: the  
life-cycle of a string in a web app is very often something like this:

- comes in over HTTP
- lives in the image for a while, maybe persisted in some way
- gets sent back out over HTTP many times

Even if the conversion *is* blindingly fast, it's still better to  
leave it as UTF-8 the whole time, not only to remove the overhead of  
decoding and reencoding, but also to avoid storing WideStrings in the  
image for long periods of time. Also, consider that building html  
pages mainly involves writing lots of short strings to streams, which  
sometimes include non-ASCII characters. If they can be pre-encoded  
it's another space and time win. On the other hand, the traditional  
drawback to UTF-8, random access to characters, doesn't come up much  
with generating web pages, though of course a web app may do this  
kind of thing as part of its domain functionality.

I don't claim that all strings should always be UTF-8, but having a  
UTF8String class would be an excellent thing.

Colin

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

stephane ducasse
Colin

Could you say the difference between WidString and UTF-8 (UTF-8 would  
a specialized WideString?).
I got bitten by these encodings problems and having a nice solution  
would be good.

Stef

On 9 juin 07, at 00:02, Colin Putney wrote:

>
> On Jun 7, 2007, at 11:55 PM, Andreas Raab wrote:
>
>> How about trying to improve the speed of conversions? You seem to  
>> imply that this is the major issue here, so if the conversions  
>> where blindingly fast (which I think they easily could by writing  
>> one or two primitives) this should improve matters.
>
> The conversions could be made faster, yes. But consider this: the  
> life-cycle of a string in a web app is very often something like this:
>
> - comes in over HTTP
> - lives in the image for a while, maybe persisted in some way
> - gets sent back out over HTTP many times
>
> Even if the conversion *is* blindingly fast, it's still better to  
> leave it as UTF-8 the whole time, not only to remove the overhead  
> of decoding and reencoding, but also to avoid storing WideStrings  
> in the image for long periods of time. Also, consider that building  
> html pages mainly involves writing lots of short strings to  
> streams, which sometimes include non-ASCII characters. If they can  
> be pre-encoded it's another space and time win. On the other hand,  
> the traditional drawback to UTF-8, random access to characters,  
> doesn't come up much with generating web pages, though of course a  
> web app may do this kind of thing as part of its domain functionality.
>
> I don't claim that all strings should always be UTF-8, but having a  
> UTF8String class would be an excellent thing.
>
> Colin
>
>


Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Philippe Marschall
2007/6/9, stephane ducasse <[hidden email]>:
> Colin
>
> Could you say the difference between WidString and UTF-8 (UTF-8 would
> a specialized WideString?).
The way I understand it UTF8String would be a subclass of ByteString
and probably have methods like #size, #first:, #last: and #at:
overriden.

> I got bitten by these encodings problems and having a nice solution
> would be good.
Well, there is what the evil language with J does: UCS2 everywhere, no
excuses. This is a bit awkward for characters outside the BMP (which
are more rare than unicorns) but IIRC the astral planes didn't exits
when it was created. So you could argue for UCS4. Yes, it's twice the
size, but who really cares? If you could get rid of all the size hacks
in Squeak that were cool in the 70ies, would you?

Cheers
Philippe


> Stef
>
> On 9 juin 07, at 00:02, Colin Putney wrote:
>
> >
> > On Jun 7, 2007, at 11:55 PM, Andreas Raab wrote:
> >
> >> How about trying to improve the speed of conversions? You seem to
> >> imply that this is the major issue here, so if the conversions
> >> where blindingly fast (which I think they easily could by writing
> >> one or two primitives) this should improve matters.
> >
> > The conversions could be made faster, yes. But consider this: the
> > life-cycle of a string in a web app is very often something like this:
> >
> > - comes in over HTTP
> > - lives in the image for a while, maybe persisted in some way
> > - gets sent back out over HTTP many times
> >
> > Even if the conversion *is* blindingly fast, it's still better to
> > leave it as UTF-8 the whole time, not only to remove the overhead
> > of decoding and reencoding, but also to avoid storing WideStrings
> > in the image for long periods of time. Also, consider that building
> > html pages mainly involves writing lots of short strings to
> > streams, which sometimes include non-ASCII characters. If they can
> > be pre-encoded it's another space and time win. On the other hand,
> > the traditional drawback to UTF-8, random access to characters,
> > doesn't come up much with generating web pages, though of course a
> > web app may do this kind of thing as part of its domain functionality.
> >
> > I don't claim that all strings should always be UTF-8, but having a
> > UTF8String class would be an excellent thing.
> >
> > Colin
> >
> >
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Colin Putney
In reply to this post by stephane ducasse

On Jun 9, 2007, at 12:24 AM, stephane ducasse wrote:

> Could you say the difference between WidString and UTF-8 (UTF-8  
> would a specialized WideString?).

WideString is a fixed length encoding - each character is 4 bytes  
long. UTF-8 is a variable length encoding - where each character  
could be 1, 2 or 3 bytes.

The problem with WideString is that it wastes memory. Most characters  
can fit into 2 bytes, and all of them can fit into 3 bytes.

The problem with UTF-8 is that it makes random access expensive.  
UTF8String>>at: would have to do a linear search through the string  
to find the character at a given offset.

> I got bitten by these encodings problems and having a nice solution  
> would be good.

I don't think there's a single solution that's good for all problems.  
For the kind of web applications that I work on, UTF-8 is  a clear  
win. For other kinds of applications , WideString and maybe  
TwoByteString are probably better.

Colin

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Janko Mivšek
In reply to this post by Philippe Marschall
Philippe Marschall wrote:
>> I got bitten by these encodings problems and having a nice solution
>> would be good.
> Well, there is what the evil language with J does: UCS2 everywhere, no
> excuses. This is a bit awkward for characters outside the BMP (which
> are more rare than unicorns) but IIRC the astral planes didn't exits
> when it was created. So you could argue for UCS4. Yes, it's twice the
> size, but who really cares? If you could get rid of all the size hacks
> in Squeak that were cool in the 70ies, would you?

All of us who use image as a database care about space efficiency but on
the other side we want all normal string operations to run on unicode
strings too. That's why UTF8 encoded string is not appropriate even that
it is most space efficient, because string operations are not fast enough.

I would propose a hibrid solution: three subclasses of String:

1. ByteString for ASCII (native english speakers
2. TwoByteString for most of other languages
3. FourByteString(WideString) for Japanese/Chinese/and others

And even for 2nd group and for short strings a plain ASCII satisfies in
many cases. For Slovenian I would say for 80% of short strings (we have
only 蚞ȊŽ as non-ascii chars). I think most of latin Europe has
similar situation.

Conversion between strings should be automatic as is with numbers. You
start with ASCII only ByteString and when you first encounter a
character >256, you convert to TwoByteString, and then to FourByteString.

Best regards
Janko



>> Stef
>>
>> On 9 juin 07, at 00:02, Colin Putney wrote:
>>
>> >
>> > On Jun 7, 2007, at 11:55 PM, Andreas Raab wrote:
>> >
>> >> How about trying to improve the speed of conversions? You seem to
>> >> imply that this is the major issue here, so if the conversions
>> >> where blindingly fast (which I think they easily could by writing
>> >> one or two primitives) this should improve matters.
>> >
>> > The conversions could be made faster, yes. But consider this: the
>> > life-cycle of a string in a web app is very often something like this:
>> >
>> > - comes in over HTTP
>> > - lives in the image for a while, maybe persisted in some way
>> > - gets sent back out over HTTP many times
>> >
>> > Even if the conversion *is* blindingly fast, it's still better to
>> > leave it as UTF-8 the whole time, not only to remove the overhead
>> > of decoding and reencoding, but also to avoid storing WideStrings
>> > in the image for long periods of time. Also, consider that building
>> > html pages mainly involves writing lots of short strings to
>> > streams, which sometimes include non-ASCII characters. If they can
>> > be pre-encoded it's another space and time win. On the other hand,
>> > the traditional drawback to UTF-8, random access to characters,
>> > doesn't come up much with generating web pages, though of course a
>> > web app may do this kind of thing as part of its domain functionality.
>> >
>> > I don't claim that all strings should always be UTF-8, but having a
>> > UTF8String class would be an excellent thing.
>> >
>> > Colin
>> >
>> >
>>
>>
>>
>
>

--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Philippe Marschall
2007/6/9, Janko Mivšek <[hidden email]>:

> Philippe Marschall wrote:
> >> I got bitten by these encodings problems and having a nice solution
> >> would be good.
> > Well, there is what the evil language with J does: UCS2 everywhere, no
> > excuses. This is a bit awkward for characters outside the BMP (which
> > are more rare than unicorns) but IIRC the astral planes didn't exits
> > when it was created. So you could argue for UCS4. Yes, it's twice the
> > size, but who really cares? If you could get rid of all the size hacks
> > in Squeak that were cool in the 70ies, would you?
>
> All of us who use image as a database care about space efficiency but on
> the other side we want all normal string operations to run on unicode
> strings too.
The image is not an efficient database. It stores all kind of "crap"
like Morphs. And it sucks as a database (ACID transactions anyone?).
Don't even get me started on migration (like Squeak Chronlogy
classes).

Philippe

> That's why UTF8 encoded string is not appropriate even that
> it is most space efficient, because string operations are not fast enough.
>
> I would propose a hibrid solution: three subclasses of String:
>
> 1. ByteString for ASCII (native english speakers
> 2. TwoByteString for most of other languages
> 3. FourByteString(WideString) for Japanese/Chinese/and others
>
> And even for 2nd group and for short strings a plain ASCII satisfies in
> many cases. For Slovenian I would say for 80% of short strings (we have
> only 蚞ȊŽ as non-ascii chars). I think most of latin Europe has
> similar situation.
>
> Conversion between strings should be automatic as is with numbers. You
> start with ASCII only ByteString and when you first encounter a
> character >256, you convert to TwoByteString, and then to FourByteString.
>
> Best regards
> Janko
>
>
>
> >> Stef
> >>
> >> On 9 juin 07, at 00:02, Colin Putney wrote:
> >>
> >> >
> >> > On Jun 7, 2007, at 11:55 PM, Andreas Raab wrote:
> >> >
> >> >> How about trying to improve the speed of conversions? You seem to
> >> >> imply that this is the major issue here, so if the conversions
> >> >> where blindingly fast (which I think they easily could by writing
> >> >> one or two primitives) this should improve matters.
> >> >
> >> > The conversions could be made faster, yes. But consider this: the
> >> > life-cycle of a string in a web app is very often something like this:
> >> >
> >> > - comes in over HTTP
> >> > - lives in the image for a while, maybe persisted in some way
> >> > - gets sent back out over HTTP many times
> >> >
> >> > Even if the conversion *is* blindingly fast, it's still better to
> >> > leave it as UTF-8 the whole time, not only to remove the overhead
> >> > of decoding and reencoding, but also to avoid storing WideStrings
> >> > in the image for long periods of time. Also, consider that building
> >> > html pages mainly involves writing lots of short strings to
> >> > streams, which sometimes include non-ASCII characters. If they can
> >> > be pre-encoded it's another space and time win. On the other hand,
> >> > the traditional drawback to UTF-8, random access to characters,
> >> > doesn't come up much with generating web pages, though of course a
> >> > web app may do this kind of thing as part of its domain functionality.
> >> >
> >> > I don't claim that all strings should always be UTF-8, but having a
> >> > UTF8String class would be an excellent thing.
> >> >
> >> > Colin
> >> >
> >> >
> >>
> >>
> >>
> >
> >
>
> --
> Janko Mivšek
> AIDA/Web
> Smalltalk Web Application Server
> http://www.aidaweb.si
>
>


Reply | Threaded
Open this post in threaded view
|

Image as a database (was Re: UTF8 Squeak)

Janko Mivšek
Hi Phillipe,

Philippe Marschall wrote:

> 2007/6/9, Janko Mivšek <[hidden email]>:
>> Philippe Marschall wrote:
>> All of us who use image as a database care about space efficiency but on
>> the other side we want all normal string operations to run on unicode
>> strings too.
>
> The image is not an efficient database. It stores all kind of "crap"
> like Morphs. And it sucks as a database (ACID transactions anyone?).
> Don't even get me started on migration (like Squeak Chronlogy
> classes).

I admit that I came from VW where I'm running quite a number of web apps
on images which servers also as a sole database and that just works,
reliable and fast.

Now I'm thinking to do the same in Squeak. That is, to use squeak image
as a database, fast and reliable. Am I too naive?

Best regards
Janko

--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

1234567