New Win32 VM [m17n testers needed]

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
124 messages Options
1 ... 4567
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Andreas.Raab
Colin Putney wrote:

> If a String were a flat array of Unicode code points, it would be
> implemented in Smalltalk as an array of Characters wouldn't it? The fact
> that you've chosen to hide the internal representation of the string and
> use a "variable byte" or "variable word" subclass to store bytes, rather
> than objects, is an indication that the strings *are* encoded. In fact,
> the encodings have names: ISO 8859-1 and UCS-4. Janko is proposing to
> add a string class that internally stores strings encoded in UCS-2 to
> the mix.
>
> So what's so holy about these particular encodings, besides the fact
> that they're especially efficient on the VisualWorks VM?

Indeed. That is effectively the point I was trying to make in taking a
more "encoding-centered" perspective on the problem. In which case there
is nothing holy about particular encodings (and nothing confusing about
the choice of names); some people use one encoding, some people use
another and by the end of the day there is no need to be religious about
what exactly a string must contain (EBCDIC anyone? :-)

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Andreas.Raab
In reply to this post by Janko Mivšek
Hi Janko,

Thanks for the numbers, that's quite interesting to see. It seems that
the total increase in image size would be roughly 10% all things
considered (from approx. 113 MB to approx. 123 MB). That's actually less
than I would have intuitively expected (my guess was in the range of 20%
total image size).

Of course a single data point is no proof of anything but thanks again
for taking the time and getting us a few numbers.

Cheers,
   - Andreas

Janko Mivšek wrote:

> Hi Andreas,
>
> Here is an analysis of one VW "image as a database" which runs a
> document portal for quality management system of one of our
> pharmaceutical distributors.
>
> image size: 113MB
>
>                 instances  size total  avg size   byte size
> ByteString      355.068     8.562.097     24      9.982.369
> TwoByteString    19.848     5.372.602    541     10.824.596
>
> If I remember correctly byte indexed objects have 4 byte header in VW,
> therefore:
>
> byte size = 4 bytes per header + size (2*size for TwoByteString)
>
> This should also be rounded up to 4 bytes, which I ignored for now.
>
> Strings therefore contain approx.20% of whole image. If 4B strings would
> be used instead of 2B ones, a string space increase would be:
>
>     byte size with 2BString: 20.806.965
>     byte size with 4BString: 31.552.169
>                  increase %:        52%
>
> So, 2x bigger image was really an exaggerated statement but you can see
> from those results that image would grow quite extensively if 4B strings
> would be used instead of 2B ones.
>
> Best regards
> Janko
>
>
> Andreas Raab wrote:
>> Janko Mivšek wrote:
>>>> so, all of this talk is for about 4 MB extra (in that image squeak
>>>> take 26.8 MB at startup)?.
>>>
>>> Consider image as a database where you store strings from your
>>> application. In that case space efficient but still manipulable
>>> strings really matter. For instance, I run one 380MB VW image full of
>>> TwoByteStrings and this image would probably have 760M with only
>>> FourByteStrings ...
>>
>> Actually, I would be very interested in a more accurate answer than
>> "probably" since the 2x answer assumes that the whole image consists
>> of 2-byte strings and that there is zero overhead for headers etc.
>> both of which is obviously not the case. If you wouldn't mind, could
>> you run a little script that computes the number of characters that
>> are actually stored as 2 bytes? Something like:
>>
>>   TwoByteString allInstances inject: 0 into:[:sum :str| sum + str size].
>>
>> This strictly counts the number of characters that "matter", i.e.,
>> that are affected by an encoding change and I'd be interested in
>> getting some data point about how that looks in a real application
>> (e.g., whether that is in the 10%, 25%, or 50% range). In particular
>> considering that VW probably uses the most compact form by default and
>> that there is probably quite a bit of application code running and
>> that there is probably more than just strings to keep in the data, I'm
>> really curious how much of that ends up to be relevant for the 2-byte
>> encoding.
>>
>> Thanks,
>>   - Andreas
>>
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Nicolas Cellier-3
In reply to this post by Andreas.Raab
Andreas Raab <andreas.raab <at> gmx.de> writes:

>
> Colin Putney wrote:
> > If a String were a flat array of Unicode code points, it would be
> > implemented in Smalltalk as an array of Characters wouldn't it? The fact
> > that you've chosen to hide the internal representation of the string and
> > use a "variable byte" or "variable word" subclass to store bytes, rather
> > than objects, is an indication that the strings *are* encoded. In fact,
> > the encodings have names: ISO 8859-1 and UCS-4. Janko is proposing to
> > add a string class that internally stores strings encoded in UCS-2 to
> > the mix.
> >
> > So what's so holy about these particular encodings, besides the fact
> > that they're especially efficient on the VisualWorks VM?
>
> Indeed. That is effectively the point I was trying to make in taking a
> more "encoding-centered" perspective on the problem. In which case there
> is nothing holy about particular encodings (and nothing confusing about
> the choice of names); some people use one encoding, some people use
> another and by the end of the day there is no need to be religious about
> what exactly a string must contain (EBCDIC anyone?
>
> Cheers,
>    - Andreas
>
>

As long as there is a Character class, there must be:
- either a canonical encoding in the system,
- or each Character should also carry encoding information.

A canonical encoding must be able to encode all characters, not a subset, so
using a Universal Character Set (UCS) is required, and the standard ISO-10646
seems the best candidate, unless you are ready to invent your own.

Having string encoded using this canonical encoding seems an efficient strategy
regarding String-Character conversions. By now, we have encoded string with a
neutral encoder wrt canonical encoding...

Anyway, even, if Characters carry encoding information, in order to compare
them, we would need a universal canonical encoding too...

So, the only religion here is:
- to have the most simple implementation enabling multilingual.
- to conform to the widely used UCS standard

Of course, you need to deal with other encodings to link to external world.
There, several strategies have been exposed in this thread:
- doing a canonical conversion at each input/output (by way of stream)
- not converting at all, but storing a collection of bytes uninterpreted by the
image (a ByteArray solution)
- having an internal representation of external objects (strings encoded in
another code page), able to be manipulated inside the image as any other String.

The later is the VW solution. And this is what you seem to propose. It is also
the reachest solution. Current implementation is the simplest. There lie a
religious choice maybe...

Nicolas


Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Bert Freudenberg
In reply to this post by Colin Putney
On Jun 13, 2007, at 1:31 , Colin Putney wrote:

> On Jun 12, 2007, at 4:28 AM, Bert Freudenberg wrote:
>
>
>> On Jun 12, 2007, at 8:29 , Colin Putney wrote:
>>
>>
>>> Your proposal is actually to have strings encoded as ISO 8859-1,  
>>> UCS-2 or UCS-4.
>>>
>>
>> Actually, the idea is that a String has Unicode throughout, with  
>> no encoding. A string is simply a flat array of Unicode code points.
>>
>> To optimize space usage we choose the lowest number of bytes per  
>> character that can encompass all code points in a String. This is  
>> implemented as specialized subclasses of String. So for code  
>> points below 256 we use ByteString (8 bit per char), for all  
>> others WideString (32 bits per char). This is purely space  
>> optimization, not a change in encoding.
>>
>
> Yes, I understand how m17n was implemented in Squeak. I'm trying to  
> challenge one of the ideas that underlies Janko's proposal, which  
> you layout beautifully above: "String has Unicode throughout, with  
> no encoding." And again at the end: "This is purely space  
> optimization, not a change in encoding."
>
> If a String were a flat array of Unicode code points, it would be  
> implemented in Smalltalk as an array of Characters wouldn't it?

If that was as efficient as the current implementation it certainly  
would. From the outside it certainly appears as an array of Characters.

> The fact that you've chosen to hide the internal representation of  
> the string and use a "variable byte" or "variable word" subclass to  
> store bytes, rather than objects, is an indication that the strings  
> *are* encoded.

I'd say the main rationale for this was for optimization.

The implementation if Strings in Squeak has always been as a  
variableByteSubclass, the numerical value of the bytes in the String  
are the Character's value. This means you could only have Characters  
with value 0 to 255 in a String. Now, to extend that range we have  
WideStrings, which are an extension as natural as extending the  
SmallInteger range by LargeIntegers. It still holds that the  
numerical value of each word in a WideString is identical to the  
Character's value at that position. There is no interpretation in the  
mapping between the internal representation and the external appearance.

> In fact, the encodings have names: ISO 8859-1 and UCS-4.

This is an unavoidable coincidence. We just do Unicode. The 8-bit  
subset of Unicode happens to coincide with ISO 8859-1. And UCS-4  
happens to be 32 bits.

> Janko is proposing to add a string class that internally stores  
> strings encoded in UCS-2 to the mix.

That's one way to say it. The other way to say it is to not store  
unnecessary 0-bytes for Unicode characters that are less than 65536.  
Same as we do for characters below 256.

> So what's so holy about these particular encodings, besides the  
> fact that they're especially efficient on the VisualWorks VM?

I have no idea how VW came into the discussion. For a discussion why  
these "encodings" appear natural, see above.

So what are you proposing instead?

- Bert -



1 ... 4567