How to zip a WideString

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

How to zip a WideString

Peter Kenny

Hello

 

I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases.

 

I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray.

 

So my proposed solution is:

 

For compression:             myZipString := myWideString utf8Encoded asString zipped.

For decompression:         myOutputString := myZipString unzipped asByteArray utf8Decoded.

 

As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters?

 

Thanks in advance for any help.

 

Peter Kenny

 

Reply | Threaded
Open this post in threaded view
|

Re: How to zip a WideString

Sven Van Caekenberghe-2
Hi Peter,

About #zipped / #unzipped and the inflate / deflate classes: your observation is correct, these work from string to string, while clearly the compressed representation should be binary.

The contents (input, what is inside the compressed data) can be anything, it is not necessarily a string (it could be an image, so also something binary). Only the creator of the compressed data knows, you cannot assume to know in general.

It would be possible (and it would be very nice) to change this, however that will have serious impact on users (as the contract changes).

About your use case: why would your DB not be capable of storing large strings ? A good DB should be capable of storing any kind of string (full unicode) efficiently.

What DB and what sizes are we talking about ?

Sven

> On 3 Oct 2019, at 11:06, PBKResearch <[hidden email]> wrote:
>
> Hello
>  
> I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases.
>  
> I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray.
>  
> So my proposed solution is:
>  
> For compression:             myZipString := myWideString utf8Encoded asString zipped.
> For decompression:         myOutputString := myZipString unzipped asByteArray utf8Decoded.
>  
> As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters?
>  
> Thanks in advance for any help.
>  
> Peter Kenny


Reply | Threaded
Open this post in threaded view
|

Re: How to zip a WideString

tomo
Peter and Sven,

zip API from string to string works fine except that aWideString
zipped generates malformed zip string.
I think it might be a good guidance to define
String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: .
Such as
String>>zippedWithEncoding: encoder
zippedWithEncoding: encoder
    ^ ByteArray
        streamContents: [ :stream |
            | gzstream |
            gzstream := GZipWriteStream on: stream.
            encoder
                next: self size
                putAll: self
                startingAt: 1
                toStream: gzstream.
            gzstream close ]

and ByteArray>>unzippedWithEncoding: encoder
unzippedWithEncoding: encoder
    | byteStream |
    byteStream := GZipReadStream on: self.
    ^ String
        streamContents: [ :stream |
            [ byteStream atEnd ]
                whileFalse: [ stream nextPut: (encoder nextFromStream:
byteStream) ] ]

Then, you can write something like
zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8.
unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8.

This will not affect the existing zipped/unzipped API and you can
handle other encodings.
This zippedWithEncoding: generates a ByteArray, which is kind of
conformant to the encoding API.
And you don't have to create many intermediate byte arrays and byte strings.

I hope this helps.
---
tomo

2019/10/3(Thu) 18:56 Sven Van Caekenberghe <[hidden email]>:

>
> Hi Peter,
>
> About #zipped / #unzipped and the inflate / deflate classes: your observation is correct, these work from string to string, while clearly the compressed representation should be binary.
>
> The contents (input, what is inside the compressed data) can be anything, it is not necessarily a string (it could be an image, so also something binary). Only the creator of the compressed data knows, you cannot assume to know in general.
>
> It would be possible (and it would be very nice) to change this, however that will have serious impact on users (as the contract changes).
>
> About your use case: why would your DB not be capable of storing large strings ? A good DB should be capable of storing any kind of string (full unicode) efficiently.
>
> What DB and what sizes are we talking about ?
>
> Sven
>
> > On 3 Oct 2019, at 11:06, PBKResearch <[hidden email]> wrote:
> >
> > Hello
> >
> > I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases.
> >
> > I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray.
> >
> > So my proposed solution is:
> >
> > For compression:             myZipString := myWideString utf8Encoded asString zipped.
> > For decompression:         myOutputString := myZipString unzipped asByteArray utf8Decoded.
> >
> > As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters?
> >
> > Thanks in advance for any help.
> >
> > Peter Kenny
>
>

Reply | Threaded
Open this post in threaded view
|

Re: How to zip a WideString

Peter Kenny
In reply to this post by Sven Van Caekenberghe-2
Hi Sven

The DB system I am using at the moment is OmniBase - despite Todd Blanchard's warning, I have decided to experiment with it. It has the advantage of being fully object based, though I am not yet using anything more elaborate than strings, dictionaries and arrays as the data types. One secondary advantage is that I still use Dolphin occasionally, and my version of Dolphin 6.1 comes with OmniBase built in. I have checked that an OmniBase DB built in Pharo can be read in Dolphin.

As to size, there is no problem with storing large strings in OmniBase, except for the amount of disk space occupied in total. I am looking far ahead - my toy development DB is only about 15MB, but if I get to where I want to be, it could be tens of GB. With modern machines this may not be a problem, but I thought there might come a time when I want to think about trade-offs between storage space and unzipping time. I had a few qualms when I looked inside my development DB; it seems that an OmniBase DB consists of a few smallish index files and one ginormous file called 'objects'. I am not sure how the OS will get on with a huge single file.

But all this is speculative at the moment. For now I shall continue with storing the strings unzipped (but utf8Encoded - thanks for such a neat facility), bearing in mind that if I need to save space later, my method as described will work.

Peter Kenny

-----Original Message-----
From: Pharo-users <[hidden email]> On Behalf Of Sven Van Caekenberghe
Sent: 03 October 2019 10:56
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] How to zip a WideString

Hi Peter,

About #zipped / #unzipped and the inflate / deflate classes: your observation is correct, these work from string to string, while clearly the compressed representation should be binary.

The contents (input, what is inside the compressed data) can be anything, it is not necessarily a string (it could be an image, so also something binary). Only the creator of the compressed data knows, you cannot assume to know in general.

It would be possible (and it would be very nice) to change this, however that will have serious impact on users (as the contract changes).

About your use case: why would your DB not be capable of storing large strings ? A good DB should be capable of storing any kind of string (full unicode) efficiently.

What DB and what sizes are we talking about ?

Sven

> On 3 Oct 2019, at 11:06, PBKResearch <[hidden email]> wrote:
>
> Hello
>  
> I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases.
>  
> I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray.
>  
> So my proposed solution is:
>  
> For compression:             myZipString := myWideString utf8Encoded asString zipped.
> For decompression:         myOutputString := myZipString unzipped asByteArray utf8Decoded.
>  
> As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters?
>  
> Thanks in advance for any help.
>  
> Peter Kenny



Reply | Threaded
Open this post in threaded view
|

Re: How to zip a WideString

Sven Van Caekenberghe-2
In reply to this post by tomo
Hi Tomo,

Indeed, I stand corrected, it does indeed seem possible to use the existing gzip classes to work from bytes to bytes, this works fine:

data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) nextPutAll: 'foo 10 €' utf8Encoded; close ].

(GZipReadStream on: data) upToEnd utf8Decoded.

Now regarding the encoding option, I am not so sure that is really necessary (though nice to have). Why would anyone use anything except UTF8 (today).

Thanks again for the correction !

Sven

> On 3 Oct 2019, at 12:41, Tomohiro Oda <[hidden email]> wrote:
>
> Peter and Sven,
>
> zip API from string to string works fine except that aWideString
> zipped generates malformed zip string.
> I think it might be a good guidance to define
> String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: .
> Such as
> String>>zippedWithEncoding: encoder
> zippedWithEncoding: encoder
>    ^ ByteArray
>        streamContents: [ :stream |
>            | gzstream |
>            gzstream := GZipWriteStream on: stream.
>            encoder
>                next: self size
>                putAll: self
>                startingAt: 1
>                toStream: gzstream.
>            gzstream close ]
>
> and ByteArray>>unzippedWithEncoding: encoder
> unzippedWithEncoding: encoder
>    | byteStream |
>    byteStream := GZipReadStream on: self.
>    ^ String
>        streamContents: [ :stream |
>            [ byteStream atEnd ]
>                whileFalse: [ stream nextPut: (encoder nextFromStream:
> byteStream) ] ]
>
> Then, you can write something like
> zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8.
> unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8.
>
> This will not affect the existing zipped/unzipped API and you can
> handle other encodings.
> This zippedWithEncoding: generates a ByteArray, which is kind of
> conformant to the encoding API.
> And you don't have to create many intermediate byte arrays and byte strings.
>
> I hope this helps.
> ---
> tomo
>
> 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <[hidden email]>:
>>
>> Hi Peter,
>>
>> About #zipped / #unzipped and the inflate / deflate classes: your observation is correct, these work from string to string, while clearly the compressed representation should be binary.
>>
>> The contents (input, what is inside the compressed data) can be anything, it is not necessarily a string (it could be an image, so also something binary). Only the creator of the compressed data knows, you cannot assume to know in general.
>>
>> It would be possible (and it would be very nice) to change this, however that will have serious impact on users (as the contract changes).
>>
>> About your use case: why would your DB not be capable of storing large strings ? A good DB should be capable of storing any kind of string (full unicode) efficiently.
>>
>> What DB and what sizes are we talking about ?
>>
>> Sven
>>
>>> On 3 Oct 2019, at 11:06, PBKResearch <[hidden email]> wrote:
>>>
>>> Hello
>>>
>>> I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases.
>>>
>>> I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray.
>>>
>>> So my proposed solution is:
>>>
>>> For compression:             myZipString := myWideString utf8Encoded asString zipped.
>>> For decompression:         myOutputString := myZipString unzipped asByteArray utf8Decoded.
>>>
>>> As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters?
>>>
>>> Thanks in advance for any help.
>>>
>>> Peter Kenny
>>
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: How to zip a WideString

Sven Van Caekenberghe-2
Actually, thinking about this a bit more, why not add #zipped #unzipped to ByteArray ?


ByteArray>>#zipped
  "Return a GZIP compressed version of the receiver as a ByteArray"

  ^ ByteArray streamContents: [ :out |
      (GZipWriteStream on: out) nextPutAll: self; close ]

ByteArray>>#unzipped
  "Assuming the receiver contains GZIP encoded data,
   return the decompressed data as a ByteArray"

  ^ (GZipReadStream on: self) upToEnd


The original oneliner then becomes

  'string' utf8Encoded zipped.

and

  data unzipped utf8Decoded

which is pretty clear, simple and intention-revealing, IMHO.

> On 3 Oct 2019, at 13:04, Sven Van Caekenberghe <[hidden email]> wrote:
>
> Hi Tomo,
>
> Indeed, I stand corrected, it does indeed seem possible to use the existing gzip classes to work from bytes to bytes, this works fine:
>
> data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) nextPutAll: 'foo 10 €' utf8Encoded; close ].
>
> (GZipReadStream on: data) upToEnd utf8Decoded.
>
> Now regarding the encoding option, I am not so sure that is really necessary (though nice to have). Why would anyone use anything except UTF8 (today).
>
> Thanks again for the correction !
>
> Sven
>
>> On 3 Oct 2019, at 12:41, Tomohiro Oda <[hidden email]> wrote:
>>
>> Peter and Sven,
>>
>> zip API from string to string works fine except that aWideString
>> zipped generates malformed zip string.
>> I think it might be a good guidance to define
>> String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: .
>> Such as
>> String>>zippedWithEncoding: encoder
>> zippedWithEncoding: encoder
>>   ^ ByteArray
>>       streamContents: [ :stream |
>>           | gzstream |
>>           gzstream := GZipWriteStream on: stream.
>>           encoder
>>               next: self size
>>               putAll: self
>>               startingAt: 1
>>               toStream: gzstream.
>>           gzstream close ]
>>
>> and ByteArray>>unzippedWithEncoding: encoder
>> unzippedWithEncoding: encoder
>>   | byteStream |
>>   byteStream := GZipReadStream on: self.
>>   ^ String
>>       streamContents: [ :stream |
>>           [ byteStream atEnd ]
>>               whileFalse: [ stream nextPut: (encoder nextFromStream:
>> byteStream) ] ]
>>
>> Then, you can write something like
>> zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8.
>> unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8.
>>
>> This will not affect the existing zipped/unzipped API and you can
>> handle other encodings.
>> This zippedWithEncoding: generates a ByteArray, which is kind of
>> conformant to the encoding API.
>> And you don't have to create many intermediate byte arrays and byte strings.
>>
>> I hope this helps.
>> ---
>> tomo
>>
>> 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <[hidden email]>:
>>>
>>> Hi Peter,
>>>
>>> About #zipped / #unzipped and the inflate / deflate classes: your observation is correct, these work from string to string, while clearly the compressed representation should be binary.
>>>
>>> The contents (input, what is inside the compressed data) can be anything, it is not necessarily a string (it could be an image, so also something binary). Only the creator of the compressed data knows, you cannot assume to know in general.
>>>
>>> It would be possible (and it would be very nice) to change this, however that will have serious impact on users (as the contract changes).
>>>
>>> About your use case: why would your DB not be capable of storing large strings ? A good DB should be capable of storing any kind of string (full unicode) efficiently.
>>>
>>> What DB and what sizes are we talking about ?
>>>
>>> Sven
>>>
>>>> On 3 Oct 2019, at 11:06, PBKResearch <[hidden email]> wrote:
>>>>
>>>> Hello
>>>>
>>>> I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases.
>>>>
>>>> I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray.
>>>>
>>>> So my proposed solution is:
>>>>
>>>> For compression:             myZipString := myWideString utf8Encoded asString zipped.
>>>> For decompression:         myOutputString := myZipString unzipped asByteArray utf8Decoded.
>>>>
>>>> As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters?
>>>>
>>>> Thanks in advance for any help.
>>>>
>>>> Peter Kenny
>>>
>>>
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: How to zip a WideString

tomo
Sven,

Yes, ByteArray>>zipped/unzipped are simple, neat and intuitive way of
zipping/unzipping binary data.
I also love the new idioms. They look clean and concise.

Best Regards,
---
tomo

2019年10月3日(木) 20:14 Sven Van Caekenberghe <[hidden email]>:

>
> Actually, thinking about this a bit more, why not add #zipped #unzipped to ByteArray ?
>
>
> ByteArray>>#zipped
>   "Return a GZIP compressed version of the receiver as a ByteArray"
>
>   ^ ByteArray streamContents: [ :out |
>       (GZipWriteStream on: out) nextPutAll: self; close ]
>
> ByteArray>>#unzipped
>   "Assuming the receiver contains GZIP encoded data,
>    return the decompressed data as a ByteArray"
>
>   ^ (GZipReadStream on: self) upToEnd
>
>
> The original oneliner then becomes
>
>   'string' utf8Encoded zipped.
>
> and
>
>   data unzipped utf8Decoded
>
> which is pretty clear, simple and intention-revealing, IMHO.
>
> > On 3 Oct 2019, at 13:04, Sven Van Caekenberghe <[hidden email]> wrote:
> >
> > Hi Tomo,
> >
> > Indeed, I stand corrected, it does indeed seem possible to use the existing gzip classes to work from bytes to bytes, this works fine:
> >
> > data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) nextPutAll: 'foo 10 €' utf8Encoded; close ].
> >
> > (GZipReadStream on: data) upToEnd utf8Decoded.
> >
> > Now regarding the encoding option, I am not so sure that is really necessary (though nice to have). Why would anyone use anything except UTF8 (today).
> >
> > Thanks again for the correction !
> >
> > Sven
> >
> >> On 3 Oct 2019, at 12:41, Tomohiro Oda <[hidden email]> wrote:
> >>
> >> Peter and Sven,
> >>
> >> zip API from string to string works fine except that aWideString
> >> zipped generates malformed zip string.
> >> I think it might be a good guidance to define
> >> String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: .
> >> Such as
> >> String>>zippedWithEncoding: encoder
> >> zippedWithEncoding: encoder
> >>   ^ ByteArray
> >>       streamContents: [ :stream |
> >>           | gzstream |
> >>           gzstream := GZipWriteStream on: stream.
> >>           encoder
> >>               next: self size
> >>               putAll: self
> >>               startingAt: 1
> >>               toStream: gzstream.
> >>           gzstream close ]
> >>
> >> and ByteArray>>unzippedWithEncoding: encoder
> >> unzippedWithEncoding: encoder
> >>   | byteStream |
> >>   byteStream := GZipReadStream on: self.
> >>   ^ String
> >>       streamContents: [ :stream |
> >>           [ byteStream atEnd ]
> >>               whileFalse: [ stream nextPut: (encoder nextFromStream:
> >> byteStream) ] ]
> >>
> >> Then, you can write something like
> >> zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8.
> >> unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8.
> >>
> >> This will not affect the existing zipped/unzipped API and you can
> >> handle other encodings.
> >> This zippedWithEncoding: generates a ByteArray, which is kind of
> >> conformant to the encoding API.
> >> And you don't have to create many intermediate byte arrays and byte strings.
> >>
> >> I hope this helps.
> >> ---
> >> tomo
> >>
> >> 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <[hidden email]>:
> >>>
> >>> Hi Peter,
> >>>
> >>> About #zipped / #unzipped and the inflate / deflate classes: your observation is correct, these work from string to string, while clearly the compressed representation should be binary.
> >>>
> >>> The contents (input, what is inside the compressed data) can be anything, it is not necessarily a string (it could be an image, so also something binary). Only the creator of the compressed data knows, you cannot assume to know in general.
> >>>
> >>> It would be possible (and it would be very nice) to change this, however that will have serious impact on users (as the contract changes).
> >>>
> >>> About your use case: why would your DB not be capable of storing large strings ? A good DB should be capable of storing any kind of string (full unicode) efficiently.
> >>>
> >>> What DB and what sizes are we talking about ?
> >>>
> >>> Sven
> >>>
> >>>> On 3 Oct 2019, at 11:06, PBKResearch <[hidden email]> wrote:
> >>>>
> >>>> Hello
> >>>>
> >>>> I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases.
> >>>>
> >>>> I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray.
> >>>>
> >>>> So my proposed solution is:
> >>>>
> >>>> For compression:             myZipString := myWideString utf8Encoded asString zipped.
> >>>> For decompression:         myOutputString := myZipString unzipped asByteArray utf8Decoded.
> >>>>
> >>>> As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters?
> >>>>
> >>>> Thanks in advance for any help.
> >>>>
> >>>> Peter Kenny
> >>>
> >>>
> >>
> >
>
>

Reply | Threaded
Open this post in threaded view
|

Re: How to zip a WideString

Peter Kenny
Sven, Tomo

Thanks for this discussion. I shall bear in mind Sven's proposed extension to ByteArray - this is exactly the sort of neater solution I was hoping for. Any chance this might make it into standard Pharo (perhaps inP8)?

Peter Kenny

-----Original Message-----
From: Pharo-users <[hidden email]> On Behalf Of Tomohiro Oda
Sent: 03 October 2019 12:22
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] How to zip a WideString

Sven,

Yes, ByteArray>>zipped/unzipped are simple, neat and intuitive way of zipping/unzipping binary data.
I also love the new idioms. They look clean and concise.

Best Regards,
---
tomo

2019年10月3日(木) 20:14 Sven Van Caekenberghe <[hidden email]>:

>
> Actually, thinking about this a bit more, why not add #zipped #unzipped to ByteArray ?
>
>
> ByteArray>>#zipped
>   "Return a GZIP compressed version of the receiver as a ByteArray"
>
>   ^ ByteArray streamContents: [ :out |
>       (GZipWriteStream on: out) nextPutAll: self; close ]
>
> ByteArray>>#unzipped
>   "Assuming the receiver contains GZIP encoded data,
>    return the decompressed data as a ByteArray"
>
>   ^ (GZipReadStream on: self) upToEnd
>
>
> The original oneliner then becomes
>
>   'string' utf8Encoded zipped.
>
> and
>
>   data unzipped utf8Decoded
>
> which is pretty clear, simple and intention-revealing, IMHO.
>
> > On 3 Oct 2019, at 13:04, Sven Van Caekenberghe <[hidden email]> wrote:
> >
> > Hi Tomo,
> >
> > Indeed, I stand corrected, it does indeed seem possible to use the existing gzip classes to work from bytes to bytes, this works fine:
> >
> > data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) nextPutAll: 'foo 10 €' utf8Encoded; close ].
> >
> > (GZipReadStream on: data) upToEnd utf8Decoded.
> >
> > Now regarding the encoding option, I am not so sure that is really necessary (though nice to have). Why would anyone use anything except UTF8 (today).
> >
> > Thanks again for the correction !
> >
> > Sven
> >
> >> On 3 Oct 2019, at 12:41, Tomohiro Oda <[hidden email]> wrote:
> >>
> >> Peter and Sven,
> >>
> >> zip API from string to string works fine except that aWideString
> >> zipped generates malformed zip string.
> >> I think it might be a good guidance to define
> >> String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: .
> >> Such as
> >> String>>zippedWithEncoding: encoder
> >> zippedWithEncoding: encoder
> >>   ^ ByteArray
> >>       streamContents: [ :stream |
> >>           | gzstream |
> >>           gzstream := GZipWriteStream on: stream.
> >>           encoder
> >>               next: self size
> >>               putAll: self
> >>               startingAt: 1
> >>               toStream: gzstream.
> >>           gzstream close ]
> >>
> >> and ByteArray>>unzippedWithEncoding: encoder
> >> unzippedWithEncoding: encoder
> >>   | byteStream |
> >>   byteStream := GZipReadStream on: self.
> >>   ^ String
> >>       streamContents: [ :stream |
> >>           [ byteStream atEnd ]
> >>               whileFalse: [ stream nextPut: (encoder nextFromStream:
> >> byteStream) ] ]
> >>
> >> Then, you can write something like
> >> zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8.
> >> unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8.
> >>
> >> This will not affect the existing zipped/unzipped API and you can
> >> handle other encodings.
> >> This zippedWithEncoding: generates a ByteArray, which is kind of
> >> conformant to the encoding API.
> >> And you don't have to create many intermediate byte arrays and byte strings.
> >>
> >> I hope this helps.
> >> ---
> >> tomo
> >>
> >> 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <[hidden email]>:
> >>>
> >>> Hi Peter,
> >>>
> >>> About #zipped / #unzipped and the inflate / deflate classes: your observation is correct, these work from string to string, while clearly the compressed representation should be binary.
> >>>
> >>> The contents (input, what is inside the compressed data) can be anything, it is not necessarily a string (it could be an image, so also something binary). Only the creator of the compressed data knows, you cannot assume to know in general.
> >>>
> >>> It would be possible (and it would be very nice) to change this, however that will have serious impact on users (as the contract changes).
> >>>
> >>> About your use case: why would your DB not be capable of storing large strings ? A good DB should be capable of storing any kind of string (full unicode) efficiently.
> >>>
> >>> What DB and what sizes are we talking about ?
> >>>
> >>> Sven
> >>>
> >>>> On 3 Oct 2019, at 11:06, PBKResearch <[hidden email]> wrote:
> >>>>
> >>>> Hello
> >>>>
> >>>> I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases.
> >>>>
> >>>> I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray.
> >>>>
> >>>> So my proposed solution is:
> >>>>
> >>>> For compression:             myZipString := myWideString utf8Encoded asString zipped.
> >>>> For decompression:         myOutputString := myZipString unzipped asByteArray utf8Decoded.
> >>>>
> >>>> As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters?
> >>>>
> >>>> Thanks in advance for any help.
> >>>>
> >>>> Peter Kenny
> >>>
> >>>
> >>
> >
>
>


Reply | Threaded
Open this post in threaded view
|

Re: How to zip a WideString

Sven Van Caekenberghe-2
https://github.com/pharo-project/pharo/issues/4806

PR will follow

> On 3 Oct 2019, at 13:49, PBKResearch <[hidden email]> wrote:
>
> Sven, Tomo
>
> Thanks for this discussion. I shall bear in mind Sven's proposed extension to ByteArray - this is exactly the sort of neater solution I was hoping for. Any chance this might make it into standard Pharo (perhaps inP8)?
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users <[hidden email]> On Behalf Of Tomohiro Oda
> Sent: 03 October 2019 12:22
> To: Any question about pharo is welcome <[hidden email]>
> Subject: Re: [Pharo-users] How to zip a WideString
>
> Sven,
>
> Yes, ByteArray>>zipped/unzipped are simple, neat and intuitive way of zipping/unzipping binary data.
> I also love the new idioms. They look clean and concise.
>
> Best Regards,
> ---
> tomo
>
> 2019年10月3日(木) 20:14 Sven Van Caekenberghe <[hidden email]>:
>>
>> Actually, thinking about this a bit more, why not add #zipped #unzipped to ByteArray ?
>>
>>
>> ByteArray>>#zipped
>>  "Return a GZIP compressed version of the receiver as a ByteArray"
>>
>>  ^ ByteArray streamContents: [ :out |
>>      (GZipWriteStream on: out) nextPutAll: self; close ]
>>
>> ByteArray>>#unzipped
>>  "Assuming the receiver contains GZIP encoded data,
>>   return the decompressed data as a ByteArray"
>>
>>  ^ (GZipReadStream on: self) upToEnd
>>
>>
>> The original oneliner then becomes
>>
>>  'string' utf8Encoded zipped.
>>
>> and
>>
>>  data unzipped utf8Decoded
>>
>> which is pretty clear, simple and intention-revealing, IMHO.
>>
>>> On 3 Oct 2019, at 13:04, Sven Van Caekenberghe <[hidden email]> wrote:
>>>
>>> Hi Tomo,
>>>
>>> Indeed, I stand corrected, it does indeed seem possible to use the existing gzip classes to work from bytes to bytes, this works fine:
>>>
>>> data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) nextPutAll: 'foo 10 €' utf8Encoded; close ].
>>>
>>> (GZipReadStream on: data) upToEnd utf8Decoded.
>>>
>>> Now regarding the encoding option, I am not so sure that is really necessary (though nice to have). Why would anyone use anything except UTF8 (today).
>>>
>>> Thanks again for the correction !
>>>
>>> Sven
>>>
>>>> On 3 Oct 2019, at 12:41, Tomohiro Oda <[hidden email]> wrote:
>>>>
>>>> Peter and Sven,
>>>>
>>>> zip API from string to string works fine except that aWideString
>>>> zipped generates malformed zip string.
>>>> I think it might be a good guidance to define
>>>> String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: .
>>>> Such as
>>>> String>>zippedWithEncoding: encoder
>>>> zippedWithEncoding: encoder
>>>>  ^ ByteArray
>>>>      streamContents: [ :stream |
>>>>          | gzstream |
>>>>          gzstream := GZipWriteStream on: stream.
>>>>          encoder
>>>>              next: self size
>>>>              putAll: self
>>>>              startingAt: 1
>>>>              toStream: gzstream.
>>>>          gzstream close ]
>>>>
>>>> and ByteArray>>unzippedWithEncoding: encoder
>>>> unzippedWithEncoding: encoder
>>>>  | byteStream |
>>>>  byteStream := GZipReadStream on: self.
>>>>  ^ String
>>>>      streamContents: [ :stream |
>>>>          [ byteStream atEnd ]
>>>>              whileFalse: [ stream nextPut: (encoder nextFromStream:
>>>> byteStream) ] ]
>>>>
>>>> Then, you can write something like
>>>> zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8.
>>>> unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8.
>>>>
>>>> This will not affect the existing zipped/unzipped API and you can
>>>> handle other encodings.
>>>> This zippedWithEncoding: generates a ByteArray, which is kind of
>>>> conformant to the encoding API.
>>>> And you don't have to create many intermediate byte arrays and byte strings.
>>>>
>>>> I hope this helps.
>>>> ---
>>>> tomo
>>>>
>>>> 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <[hidden email]>:
>>>>>
>>>>> Hi Peter,
>>>>>
>>>>> About #zipped / #unzipped and the inflate / deflate classes: your observation is correct, these work from string to string, while clearly the compressed representation should be binary.
>>>>>
>>>>> The contents (input, what is inside the compressed data) can be anything, it is not necessarily a string (it could be an image, so also something binary). Only the creator of the compressed data knows, you cannot assume to know in general.
>>>>>
>>>>> It would be possible (and it would be very nice) to change this, however that will have serious impact on users (as the contract changes).
>>>>>
>>>>> About your use case: why would your DB not be capable of storing large strings ? A good DB should be capable of storing any kind of string (full unicode) efficiently.
>>>>>
>>>>> What DB and what sizes are we talking about ?
>>>>>
>>>>> Sven
>>>>>
>>>>>> On 3 Oct 2019, at 11:06, PBKResearch <[hidden email]> wrote:
>>>>>>
>>>>>> Hello
>>>>>>
>>>>>> I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases.
>>>>>>
>>>>>> I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray.
>>>>>>
>>>>>> So my proposed solution is:
>>>>>>
>>>>>> For compression:             myZipString := myWideString utf8Encoded asString zipped.
>>>>>> For decompression:         myOutputString := myZipString unzipped asByteArray utf8Decoded.
>>>>>>
>>>>>> As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters?
>>>>>>
>>>>>> Thanks in advance for any help.
>>>>>>
>>>>>> Peter Kenny
>>>>>
>>>>>
>>>>
>>>
>>
>>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: How to zip a WideString

Sven Van Caekenberghe-2
https://github.com/pharo-project/pharo/pull/4812

> On 3 Oct 2019, at 14:05, Sven Van Caekenberghe <[hidden email]> wrote:
>
> https://github.com/pharo-project/pharo/issues/4806
>
> PR will follow
>
>> On 3 Oct 2019, at 13:49, PBKResearch <[hidden email]> wrote:
>>
>> Sven, Tomo
>>
>> Thanks for this discussion. I shall bear in mind Sven's proposed extension to ByteArray - this is exactly the sort of neater solution I was hoping for. Any chance this might make it into standard Pharo (perhaps inP8)?
>>
>> Peter Kenny
>>
>> -----Original Message-----
>> From: Pharo-users <[hidden email]> On Behalf Of Tomohiro Oda
>> Sent: 03 October 2019 12:22
>> To: Any question about pharo is welcome <[hidden email]>
>> Subject: Re: [Pharo-users] How to zip a WideString
>>
>> Sven,
>>
>> Yes, ByteArray>>zipped/unzipped are simple, neat and intuitive way of zipping/unzipping binary data.
>> I also love the new idioms. They look clean and concise.
>>
>> Best Regards,
>> ---
>> tomo
>>
>> 2019年10月3日(木) 20:14 Sven Van Caekenberghe <[hidden email]>:
>>>
>>> Actually, thinking about this a bit more, why not add #zipped #unzipped to ByteArray ?
>>>
>>>
>>> ByteArray>>#zipped
>>> "Return a GZIP compressed version of the receiver as a ByteArray"
>>>
>>> ^ ByteArray streamContents: [ :out |
>>>     (GZipWriteStream on: out) nextPutAll: self; close ]
>>>
>>> ByteArray>>#unzipped
>>> "Assuming the receiver contains GZIP encoded data,
>>>  return the decompressed data as a ByteArray"
>>>
>>> ^ (GZipReadStream on: self) upToEnd
>>>
>>>
>>> The original oneliner then becomes
>>>
>>> 'string' utf8Encoded zipped.
>>>
>>> and
>>>
>>> data unzipped utf8Decoded
>>>
>>> which is pretty clear, simple and intention-revealing, IMHO.
>>>
>>>> On 3 Oct 2019, at 13:04, Sven Van Caekenberghe <[hidden email]> wrote:
>>>>
>>>> Hi Tomo,
>>>>
>>>> Indeed, I stand corrected, it does indeed seem possible to use the existing gzip classes to work from bytes to bytes, this works fine:
>>>>
>>>> data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) nextPutAll: 'foo 10 €' utf8Encoded; close ].
>>>>
>>>> (GZipReadStream on: data) upToEnd utf8Decoded.
>>>>
>>>> Now regarding the encoding option, I am not so sure that is really necessary (though nice to have). Why would anyone use anything except UTF8 (today).
>>>>
>>>> Thanks again for the correction !
>>>>
>>>> Sven
>>>>
>>>>> On 3 Oct 2019, at 12:41, Tomohiro Oda <[hidden email]> wrote:
>>>>>
>>>>> Peter and Sven,
>>>>>
>>>>> zip API from string to string works fine except that aWideString
>>>>> zipped generates malformed zip string.
>>>>> I think it might be a good guidance to define
>>>>> String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: .
>>>>> Such as
>>>>> String>>zippedWithEncoding: encoder
>>>>> zippedWithEncoding: encoder
>>>>> ^ ByteArray
>>>>>     streamContents: [ :stream |
>>>>>         | gzstream |
>>>>>         gzstream := GZipWriteStream on: stream.
>>>>>         encoder
>>>>>             next: self size
>>>>>             putAll: self
>>>>>             startingAt: 1
>>>>>             toStream: gzstream.
>>>>>         gzstream close ]
>>>>>
>>>>> and ByteArray>>unzippedWithEncoding: encoder
>>>>> unzippedWithEncoding: encoder
>>>>> | byteStream |
>>>>> byteStream := GZipReadStream on: self.
>>>>> ^ String
>>>>>     streamContents: [ :stream |
>>>>>         [ byteStream atEnd ]
>>>>>             whileFalse: [ stream nextPut: (encoder nextFromStream:
>>>>> byteStream) ] ]
>>>>>
>>>>> Then, you can write something like
>>>>> zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8.
>>>>> unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8.
>>>>>
>>>>> This will not affect the existing zipped/unzipped API and you can
>>>>> handle other encodings.
>>>>> This zippedWithEncoding: generates a ByteArray, which is kind of
>>>>> conformant to the encoding API.
>>>>> And you don't have to create many intermediate byte arrays and byte strings.
>>>>>
>>>>> I hope this helps.
>>>>> ---
>>>>> tomo
>>>>>
>>>>> 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <[hidden email]>:
>>>>>>
>>>>>> Hi Peter,
>>>>>>
>>>>>> About #zipped / #unzipped and the inflate / deflate classes: your observation is correct, these work from string to string, while clearly the compressed representation should be binary.
>>>>>>
>>>>>> The contents (input, what is inside the compressed data) can be anything, it is not necessarily a string (it could be an image, so also something binary). Only the creator of the compressed data knows, you cannot assume to know in general.
>>>>>>
>>>>>> It would be possible (and it would be very nice) to change this, however that will have serious impact on users (as the contract changes).
>>>>>>
>>>>>> About your use case: why would your DB not be capable of storing large strings ? A good DB should be capable of storing any kind of string (full unicode) efficiently.
>>>>>>
>>>>>> What DB and what sizes are we talking about ?
>>>>>>
>>>>>> Sven
>>>>>>
>>>>>>> On 3 Oct 2019, at 11:06, PBKResearch <[hidden email]> wrote:
>>>>>>>
>>>>>>> Hello
>>>>>>>
>>>>>>> I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases.
>>>>>>>
>>>>>>> I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray.
>>>>>>>
>>>>>>> So my proposed solution is:
>>>>>>>
>>>>>>> For compression:             myZipString := myWideString utf8Encoded asString zipped.
>>>>>>> For decompression:         myOutputString := myZipString unzipped asByteArray utf8Decoded.
>>>>>>>
>>>>>>> As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters?
>>>>>>>
>>>>>>> Thanks in advance for any help.
>>>>>>>
>>>>>>> Peter Kenny
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: How to zip a WideString

Peter Kenny
Thanks Sven. Just 5 hours from when I raised the question, there is a solution in place for everyone. This group is amazing!

-----Original Message-----
From: Pharo-users <[hidden email]> On Behalf Of Sven Van Caekenberghe
Sent: 03 October 2019 15:28
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] How to zip a WideString

https://github.com/pharo-project/pharo/pull/4812

> On 3 Oct 2019, at 14:05, Sven Van Caekenberghe <[hidden email]> wrote:
>
> https://github.com/pharo-project/pharo/issues/4806
>
> PR will follow
>
>> On 3 Oct 2019, at 13:49, PBKResearch <[hidden email]> wrote:
>>
>> Sven, Tomo
>>
>> Thanks for this discussion. I shall bear in mind Sven's proposed extension to ByteArray - this is exactly the sort of neater solution I was hoping for. Any chance this might make it into standard Pharo (perhaps inP8)?
>>
>> Peter Kenny
>>
>> -----Original Message-----
>> From: Pharo-users <[hidden email]> On Behalf Of
>> Tomohiro Oda
>> Sent: 03 October 2019 12:22
>> To: Any question about pharo is welcome <[hidden email]>
>> Subject: Re: [Pharo-users] How to zip a WideString
>>
>> Sven,
>>
>> Yes, ByteArray>>zipped/unzipped are simple, neat and intuitive way of zipping/unzipping binary data.
>> I also love the new idioms. They look clean and concise.
>>
>> Best Regards,
>> ---
>> tomo
>>
>> 2019年10月3日(木) 20:14 Sven Van Caekenberghe <[hidden email]>:
>>>
>>> Actually, thinking about this a bit more, why not add #zipped #unzipped to ByteArray ?
>>>
>>>
>>> ByteArray>>#zipped
>>> "Return a GZIP compressed version of the receiver as a ByteArray"
>>>
>>> ^ ByteArray streamContents: [ :out |
>>>     (GZipWriteStream on: out) nextPutAll: self; close ]
>>>
>>> ByteArray>>#unzipped
>>> "Assuming the receiver contains GZIP encoded data,  return the
>>> decompressed data as a ByteArray"
>>>
>>> ^ (GZipReadStream on: self) upToEnd
>>>
>>>
>>> The original oneliner then becomes
>>>
>>> 'string' utf8Encoded zipped.
>>>
>>> and
>>>
>>> data unzipped utf8Decoded
>>>
>>> which is pretty clear, simple and intention-revealing, IMHO.
>>>
>>>> On 3 Oct 2019, at 13:04, Sven Van Caekenberghe <[hidden email]> wrote:
>>>>
>>>> Hi Tomo,
>>>>
>>>> Indeed, I stand corrected, it does indeed seem possible to use the existing gzip classes to work from bytes to bytes, this works fine:
>>>>
>>>> data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) nextPutAll: 'foo 10 €' utf8Encoded; close ].
>>>>
>>>> (GZipReadStream on: data) upToEnd utf8Decoded.
>>>>
>>>> Now regarding the encoding option, I am not so sure that is really necessary (though nice to have). Why would anyone use anything except UTF8 (today).
>>>>
>>>> Thanks again for the correction !
>>>>
>>>> Sven
>>>>
>>>>> On 3 Oct 2019, at 12:41, Tomohiro Oda <[hidden email]> wrote:
>>>>>
>>>>> Peter and Sven,
>>>>>
>>>>> zip API from string to string works fine except that aWideString
>>>>> zipped generates malformed zip string.
>>>>> I think it might be a good guidance to define
>>>>> String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: .
>>>>> Such as
>>>>> String>>zippedWithEncoding: encoder
>>>>> zippedWithEncoding: encoder
>>>>> ^ ByteArray
>>>>>     streamContents: [ :stream |
>>>>>         | gzstream |
>>>>>         gzstream := GZipWriteStream on: stream.
>>>>>         encoder
>>>>>             next: self size
>>>>>             putAll: self
>>>>>             startingAt: 1
>>>>>             toStream: gzstream.
>>>>>         gzstream close ]
>>>>>
>>>>> and ByteArray>>unzippedWithEncoding: encoder
>>>>> unzippedWithEncoding: encoder
>>>>> | byteStream |
>>>>> byteStream := GZipReadStream on: self.
>>>>> ^ String
>>>>>     streamContents: [ :stream |
>>>>>         [ byteStream atEnd ]
>>>>>             whileFalse: [ stream nextPut: (encoder nextFromStream:
>>>>> byteStream) ] ]
>>>>>
>>>>> Then, you can write something like zipped := yourLongWideString
>>>>> zippedWithEncoding: ZnCharacterEncoder utf8.
>>>>> unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8.
>>>>>
>>>>> This will not affect the existing zipped/unzipped API and you can
>>>>> handle other encodings.
>>>>> This zippedWithEncoding: generates a ByteArray, which is kind of
>>>>> conformant to the encoding API.
>>>>> And you don't have to create many intermediate byte arrays and byte strings.
>>>>>
>>>>> I hope this helps.
>>>>> ---
>>>>> tomo
>>>>>
>>>>> 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <[hidden email]>:
>>>>>>
>>>>>> Hi Peter,
>>>>>>
>>>>>> About #zipped / #unzipped and the inflate / deflate classes: your observation is correct, these work from string to string, while clearly the compressed representation should be binary.
>>>>>>
>>>>>> The contents (input, what is inside the compressed data) can be anything, it is not necessarily a string (it could be an image, so also something binary). Only the creator of the compressed data knows, you cannot assume to know in general.
>>>>>>
>>>>>> It would be possible (and it would be very nice) to change this, however that will have serious impact on users (as the contract changes).
>>>>>>
>>>>>> About your use case: why would your DB not be capable of storing large strings ? A good DB should be capable of storing any kind of string (full unicode) efficiently.
>>>>>>
>>>>>> What DB and what sizes are we talking about ?
>>>>>>
>>>>>> Sven
>>>>>>
>>>>>>> On 3 Oct 2019, at 11:06, PBKResearch <[hidden email]> wrote:
>>>>>>>
>>>>>>> Hello
>>>>>>>
>>>>>>> I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases.
>>>>>>>
>>>>>>> I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray.
>>>>>>>
>>>>>>> So my proposed solution is:
>>>>>>>
>>>>>>> For compression:             myZipString := myWideString utf8Encoded asString zipped.
>>>>>>> For decompression:         myOutputString := myZipString unzipped asByteArray utf8Decoded.
>>>>>>>
>>>>>>> As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters?
>>>>>>>
>>>>>>> Thanks in advance for any help.
>>>>>>>
>>>>>>> Peter Kenny
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>



Reply | Threaded
Open this post in threaded view
|

Re: How to zip a WideString

Sean P. DeNigris
Administrator
Peter Kenny wrote
> Just 5 hours from when I raised the question, there is a solution in place
> for everyone. This group is amazing!

Indeed. Bravo, Sven and all of our other contributors.

I can't resist mentioning that the fix would almost certainly have taken
significantly longer a few years ago due to the awkward issue/contribution
infrastructure. Esteban and others' tireless work to get git/GH support up
and running are at the heart of all these quick turnaround stories. Keep
going!!

p.s. it takes serious courage to port the issue tracker of a project this
size and reach twice in a few years



-----
Cheers,
Sean
--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Cheers,
Sean
Reply | Threaded
Open this post in threaded view
|

Re: How to zip a WideString

Sven Van Caekenberghe-2


> On 3 Oct 2019, at 18:25, Sean P. DeNigris <[hidden email]> wrote:
>
> Peter Kenny wrote
>> Just 5 hours from when I raised the question, there is a solution in place
>> for everyone. This group is amazing!
>
> Indeed. Bravo, Sven and all of our other contributors.

Thanks, but I must say that it was Tomo's snippet that showed me how to use the gzip streams binary to binary as I thought that was not possible.

And we need user's questions to start acting, so thank you Peter as well.

> I can't resist mentioning that the fix would almost certainly have taken
> significantly longer a few years ago due to the awkward issue/contribution
> infrastructure. Esteban and others' tireless work to get git/GH support up
> and running are at the heart of all these quick turnaround stories. Keep
> going!!
>
> p.s. it takes serious courage to port the issue tracker of a project this
> size and reach twice in a few years

Indeed, it had been a while since I did a PR and I must say it went super smooth.

Also, Pharo 8 feels quite snappy.

> -----
> Cheers,
> Sean
> --
> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
>


Reply | Threaded
Open this post in threaded view
|

Re: How to zip a WideString

Richard O'Keefe
In reply to this post by tomo
The interface should surely be
   SomeClass
     methods for: 'compression'
       zipped "return a byte array"

    class methods for: 'decompression'
      unzip: aByteArray "return an instance of SomeClass"

Reply | Threaded
Open this post in threaded view
|

Re: How to zip a WideString

Peter Kenny
Richard

I don't think so. The case being considered for my problem is the compression of a ByteArray produced by applying #utf8Encoded to a WideString, but it extends to any other form of ByteArray. If you substitute ByteArray for SomeClass in your examples, I think you will see why the chosen interface was used.

Peter Kenny


-----Original Message-----
From: Pharo-users <[hidden email]> On Behalf Of Richard O'Keefe
Sent: 03 October 2019 23:08
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] How to zip a WideString

The interface should surely be
   SomeClass
     methods for: 'compression'
       zipped "return a byte array"

    class methods for: 'decompression'
      unzip: aByteArray "return an instance of SomeClass"


Reply | Threaded
Open this post in threaded view
|

Re: How to zip a WideString

Richard O'Keefe
Here's how it would look in my library:
   compressed := original zipped.
      "There is currently one definition, in
AbstractStringOrByteArray, covering [ReadOnly]ByteArray,
[ReadOnly]String and its many subclasses,
       ByteBuffer,StringBuffer, Substring, [ReadOnly]ShortArray,
[ReadOnly]MappedByteArray, and some others.  This relies on
       _ asByteArraySize and _ asByteArrayDo: _.  There is no need for
a separate #utf8Encoded, that's what asByteArrayDo: *does*."

  copy := original class unzip: compressed.
    "This is a little trickier, but not hugely so.  There is no need
for special case code.   [ReadOnly][Mapped]ByteArray and ByteBuffer
are
     sequences of bytes, Stringy things are Unicode, and
[ReadOnly]ShortArrays are treated as UTF16."

As far as I can tell, this just works for the original use case.

On Fri, 4 Oct 2019 at 11:42, PBKResearch <[hidden email]> wrote:

>
> Richard
>
> I don't think so. The case being considered for my problem is the compression of a ByteArray produced by applying #utf8Encoded to a WideString, but it extends to any other form of ByteArray. If you substitute ByteArray for SomeClass in your examples, I think you will see why the chosen interface was used.
>
> Peter Kenny
>
>
> -----Original Message-----
> From: Pharo-users <[hidden email]> On Behalf Of Richard O'Keefe
> Sent: 03 October 2019 23:08
> To: Any question about pharo is welcome <[hidden email]>
> Subject: Re: [Pharo-users] How to zip a WideString
>
> The interface should surely be
>    SomeClass
>      methods for: 'compression'
>        zipped "return a byte array"
>
>     class methods for: 'decompression'
>       unzip: aByteArray "return an instance of SomeClass"
>
>

Reply | Threaded
Open this post in threaded view
|

Re: How to zip a WideString

Sven Van Caekenberghe-2
Richard,

Your implementation mixes zipping/unzipping and encoding/decoding, dictating a single way to do so, if I understand it correctly.

The composition with several messages allows for end users to choose their own encoding format, depending on their own needs, which I think is more flexible.

Sven

> On 4 Oct 2019, at 06:36, Richard O'Keefe <[hidden email]> wrote:
>
> Here's how it would look in my library:
>   compressed := original zipped.
>      "There is currently one definition, in
> AbstractStringOrByteArray, covering [ReadOnly]ByteArray,
> [ReadOnly]String and its many subclasses,
>       ByteBuffer,StringBuffer, Substring, [ReadOnly]ShortArray,
> [ReadOnly]MappedByteArray, and some others.  This relies on
>       _ asByteArraySize and _ asByteArrayDo: _.  There is no need for
> a separate #utf8Encoded, that's what asByteArrayDo: *does*."
>
>  copy := original class unzip: compressed.
>    "This is a little trickier, but not hugely so.  There is no need
> for special case code.   [ReadOnly][Mapped]ByteArray and ByteBuffer
> are
>     sequences of bytes, Stringy things are Unicode, and
> [ReadOnly]ShortArrays are treated as UTF16."
>
> As far as I can tell, this just works for the original use case.
>
> On Fri, 4 Oct 2019 at 11:42, PBKResearch <[hidden email]> wrote:
>>
>> Richard
>>
>> I don't think so. The case being considered for my problem is the compression of a ByteArray produced by applying #utf8Encoded to a WideString, but it extends to any other form of ByteArray. If you substitute ByteArray for SomeClass in your examples, I think you will see why the chosen interface was used.
>>
>> Peter Kenny
>>
>>
>> -----Original Message-----
>> From: Pharo-users <[hidden email]> On Behalf Of Richard O'Keefe
>> Sent: 03 October 2019 23:08
>> To: Any question about pharo is welcome <[hidden email]>
>> Subject: Re: [Pharo-users] How to zip a WideString
>>
>> The interface should surely be
>>   SomeClass
>>     methods for: 'compression'
>>       zipped "return a byte array"
>>
>>    class methods for: 'decompression'
>>      unzip: aByteArray "return an instance of SomeClass"
>>
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: How to zip a WideString

Sven Van Caekenberghe-2
In reply to this post by tomo
Actually, thinking about the original use case, I now feel that it would be best to remove #zipped/unzipped from String.

The original problem was that

  'Les élèves Françaises ont 100 €' zipped unzipped.

does not work (it fails on WideStrings), while we now have

  'Les élèves Françaises ont 100 €' utf8Encoded zipped unzipped utf8Decoded.

which I would consider better form/style.

The original is also very confusing, since the result of zipping is not a string but binary.

> On 3 Oct 2019, at 13:21, Tomohiro Oda <[hidden email]> wrote:
>
> Sven,
>
> Yes, ByteArray>>zipped/unzipped are simple, neat and intuitive way of
> zipping/unzipping binary data.
> I also love the new idioms. They look clean and concise.
>
> Best Regards,
> ---
> tomo
>
> 2019年10月3日(木) 20:14 Sven Van Caekenberghe <[hidden email]>:
>>
>> Actually, thinking about this a bit more, why not add #zipped #unzipped to ByteArray ?
>>
>>
>> ByteArray>>#zipped
>>  "Return a GZIP compressed version of the receiver as a ByteArray"
>>
>>  ^ ByteArray streamContents: [ :out |
>>      (GZipWriteStream on: out) nextPutAll: self; close ]
>>
>> ByteArray>>#unzipped
>>  "Assuming the receiver contains GZIP encoded data,
>>   return the decompressed data as a ByteArray"
>>
>>  ^ (GZipReadStream on: self) upToEnd
>>
>>
>> The original oneliner then becomes
>>
>>  'string' utf8Encoded zipped.
>>
>> and
>>
>>  data unzipped utf8Decoded
>>
>> which is pretty clear, simple and intention-revealing, IMHO.
>>
>>> On 3 Oct 2019, at 13:04, Sven Van Caekenberghe <[hidden email]> wrote:
>>>
>>> Hi Tomo,
>>>
>>> Indeed, I stand corrected, it does indeed seem possible to use the existing gzip classes to work from bytes to bytes, this works fine:
>>>
>>> data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) nextPutAll: 'foo 10 €' utf8Encoded; close ].
>>>
>>> (GZipReadStream on: data) upToEnd utf8Decoded.
>>>
>>> Now regarding the encoding option, I am not so sure that is really necessary (though nice to have). Why would anyone use anything except UTF8 (today).
>>>
>>> Thanks again for the correction !
>>>
>>> Sven
>>>
>>>> On 3 Oct 2019, at 12:41, Tomohiro Oda <[hidden email]> wrote:
>>>>
>>>> Peter and Sven,
>>>>
>>>> zip API from string to string works fine except that aWideString
>>>> zipped generates malformed zip string.
>>>> I think it might be a good guidance to define
>>>> String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: .
>>>> Such as
>>>> String>>zippedWithEncoding: encoder
>>>> zippedWithEncoding: encoder
>>>>  ^ ByteArray
>>>>      streamContents: [ :stream |
>>>>          | gzstream |
>>>>          gzstream := GZipWriteStream on: stream.
>>>>          encoder
>>>>              next: self size
>>>>              putAll: self
>>>>              startingAt: 1
>>>>              toStream: gzstream.
>>>>          gzstream close ]
>>>>
>>>> and ByteArray>>unzippedWithEncoding: encoder
>>>> unzippedWithEncoding: encoder
>>>>  | byteStream |
>>>>  byteStream := GZipReadStream on: self.
>>>>  ^ String
>>>>      streamContents: [ :stream |
>>>>          [ byteStream atEnd ]
>>>>              whileFalse: [ stream nextPut: (encoder nextFromStream:
>>>> byteStream) ] ]
>>>>
>>>> Then, you can write something like
>>>> zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8.
>>>> unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8.
>>>>
>>>> This will not affect the existing zipped/unzipped API and you can
>>>> handle other encodings.
>>>> This zippedWithEncoding: generates a ByteArray, which is kind of
>>>> conformant to the encoding API.
>>>> And you don't have to create many intermediate byte arrays and byte strings.
>>>>
>>>> I hope this helps.
>>>> ---
>>>> tomo
>>>>
>>>> 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <[hidden email]>:
>>>>>
>>>>> Hi Peter,
>>>>>
>>>>> About #zipped / #unzipped and the inflate / deflate classes: your observation is correct, these work from string to string, while clearly the compressed representation should be binary.
>>>>>
>>>>> The contents (input, what is inside the compressed data) can be anything, it is not necessarily a string (it could be an image, so also something binary). Only the creator of the compressed data knows, you cannot assume to know in general.
>>>>>
>>>>> It would be possible (and it would be very nice) to change this, however that will have serious impact on users (as the contract changes).
>>>>>
>>>>> About your use case: why would your DB not be capable of storing large strings ? A good DB should be capable of storing any kind of string (full unicode) efficiently.
>>>>>
>>>>> What DB and what sizes are we talking about ?
>>>>>
>>>>> Sven
>>>>>
>>>>>> On 3 Oct 2019, at 11:06, PBKResearch <[hidden email]> wrote:
>>>>>>
>>>>>> Hello
>>>>>>
>>>>>> I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases.
>>>>>>
>>>>>> I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray.
>>>>>>
>>>>>> So my proposed solution is:
>>>>>>
>>>>>> For compression:             myZipString := myWideString utf8Encoded asString zipped.
>>>>>> For decompression:         myOutputString := myZipString unzipped asByteArray utf8Decoded.
>>>>>>
>>>>>> As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters?
>>>>>>
>>>>>> Thanks in advance for any help.
>>>>>>
>>>>>> Peter Kenny
>>>>>
>>>>>
>>>>
>>>
>>
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: How to zip a WideString

Richard O'Keefe
In reply to this post by Sven Van Caekenberghe-2
One of the key concepts in astc's text handling is that all strings
use the same encoding.
So no, my system doesn't mix encoding/decoding with compression/decompression;
encoding and decoding are completely out of scope.
There is a *transformation format* issue, but not an encoding issue.
For example, having got aByteArray from somewhere,
   aString := SCSUDecoder decode: (ByteArray unzip: aByteArray)
uncompresses in one step (which knows nothing about encodings)
and decodes in another (which is admittedly the Simple Compression
Scheme for Unicode, but knows nothing of FLATE compression).
Similarly,
 aByteArray := (EightBitEncoder type: 'ISO-8859-1' encode: aString) zipped
separates encoding from compression.

On Thu, 10 Oct 2019 at 02:17, Sven Van Caekenberghe <[hidden email]> wrote:

>
> Richard,
>
> Your implementation mixes zipping/unzipping and encoding/decoding, dictating a single way to do so, if I understand it correctly.
>
> The composition with several messages allows for end users to choose their own encoding format, depending on their own needs, which I think is more flexible.
>
> Sven
>
> > On 4 Oct 2019, at 06:36, Richard O'Keefe <[hidden email]> wrote:
> >
> > Here's how it would look in my library:
> >   compressed := original zipped.
> >      "There is currently one definition, in
> > AbstractStringOrByteArray, covering [ReadOnly]ByteArray,
> > [ReadOnly]String and its many subclasses,
> >       ByteBuffer,StringBuffer, Substring, [ReadOnly]ShortArray,
> > [ReadOnly]MappedByteArray, and some others.  This relies on
> >       _ asByteArraySize and _ asByteArrayDo: _.  There is no need for
> > a separate #utf8Encoded, that's what asByteArrayDo: *does*."
> >
> >  copy := original class unzip: compressed.
> >    "This is a little trickier, but not hugely so.  There is no need
> > for special case code.   [ReadOnly][Mapped]ByteArray and ByteBuffer
> > are
> >     sequences of bytes, Stringy things are Unicode, and
> > [ReadOnly]ShortArrays are treated as UTF16."
> >
> > As far as I can tell, this just works for the original use case.
> >
> > On Fri, 4 Oct 2019 at 11:42, PBKResearch <[hidden email]> wrote:
> >>
> >> Richard
> >>
> >> I don't think so. The case being considered for my problem is the compression of a ByteArray produced by applying #utf8Encoded to a WideString, but it extends to any other form of ByteArray. If you substitute ByteArray for SomeClass in your examples, I think you will see why the chosen interface was used.
> >>
> >> Peter Kenny
> >>
> >>
> >> -----Original Message-----
> >> From: Pharo-users <[hidden email]> On Behalf Of Richard O'Keefe
> >> Sent: 03 October 2019 23:08
> >> To: Any question about pharo is welcome <[hidden email]>
> >> Subject: Re: [Pharo-users] How to zip a WideString
> >>
> >> The interface should surely be
> >>   SomeClass
> >>     methods for: 'compression'
> >>       zipped "return a byte array"
> >>
> >>    class methods for: 'decompression'
> >>      unzip: aByteArray "return an instance of SomeClass"
> >>
> >>
> >
>
>

Reply | Threaded
Open this post in threaded view
|

Re: How to zip a WideString

Sven Van Caekenberghe-2
In reply to this post by Richard O'Keefe


> On 4 Oct 2019, at 06:36, Richard O'Keefe <[hidden email]> wrote:
>
> There is no need for
> a separate #utf8Encoded, that's what asByteArrayDo: *does*."

so #asByteArrayDo: produces UTF8 bytes, which is an encoding, so it seems fixed, IIUC.

But this is the Pharo mailing list and your are referring to a system nobody can see (I asked you this before), so it is very hard to understand (the context of) what you write.
12