Hello I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases. I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray. So my proposed solution is: For compression: myZipString := myWideString utf8Encoded asString zipped. For decompression: myOutputString := myZipString unzipped asByteArray utf8Decoded. As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters? Thanks in advance for any help. Peter Kenny |
Hi Peter,
About #zipped / #unzipped and the inflate / deflate classes: your observation is correct, these work from string to string, while clearly the compressed representation should be binary. The contents (input, what is inside the compressed data) can be anything, it is not necessarily a string (it could be an image, so also something binary). Only the creator of the compressed data knows, you cannot assume to know in general. It would be possible (and it would be very nice) to change this, however that will have serious impact on users (as the contract changes). About your use case: why would your DB not be capable of storing large strings ? A good DB should be capable of storing any kind of string (full unicode) efficiently. What DB and what sizes are we talking about ? Sven > On 3 Oct 2019, at 11:06, PBKResearch <[hidden email]> wrote: > > Hello > > I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases. > > I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray. > > So my proposed solution is: > > For compression: myZipString := myWideString utf8Encoded asString zipped. > For decompression: myOutputString := myZipString unzipped asByteArray utf8Decoded. > > As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters? > > Thanks in advance for any help. > > Peter Kenny |
Peter and Sven,
zip API from string to string works fine except that aWideString zipped generates malformed zip string. I think it might be a good guidance to define String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: . Such as String>>zippedWithEncoding: encoder zippedWithEncoding: encoder ^ ByteArray streamContents: [ :stream | | gzstream | gzstream := GZipWriteStream on: stream. encoder next: self size putAll: self startingAt: 1 toStream: gzstream. gzstream close ] and ByteArray>>unzippedWithEncoding: encoder unzippedWithEncoding: encoder | byteStream | byteStream := GZipReadStream on: self. ^ String streamContents: [ :stream | [ byteStream atEnd ] whileFalse: [ stream nextPut: (encoder nextFromStream: byteStream) ] ] Then, you can write something like zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8. unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8. This will not affect the existing zipped/unzipped API and you can handle other encodings. This zippedWithEncoding: generates a ByteArray, which is kind of conformant to the encoding API. And you don't have to create many intermediate byte arrays and byte strings. I hope this helps. --- tomo 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <[hidden email]>: > > Hi Peter, > > About #zipped / #unzipped and the inflate / deflate classes: your observation is correct, these work from string to string, while clearly the compressed representation should be binary. > > The contents (input, what is inside the compressed data) can be anything, it is not necessarily a string (it could be an image, so also something binary). Only the creator of the compressed data knows, you cannot assume to know in general. > > It would be possible (and it would be very nice) to change this, however that will have serious impact on users (as the contract changes). > > About your use case: why would your DB not be capable of storing large strings ? A good DB should be capable of storing any kind of string (full unicode) efficiently. > > What DB and what sizes are we talking about ? > > Sven > > > On 3 Oct 2019, at 11:06, PBKResearch <[hidden email]> wrote: > > > > Hello > > > > I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases. > > > > I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray. > > > > So my proposed solution is: > > > > For compression: myZipString := myWideString utf8Encoded asString zipped. > > For decompression: myOutputString := myZipString unzipped asByteArray utf8Decoded. > > > > As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters? > > > > Thanks in advance for any help. > > > > Peter Kenny > > |
In reply to this post by Sven Van Caekenberghe-2
Hi Sven
The DB system I am using at the moment is OmniBase - despite Todd Blanchard's warning, I have decided to experiment with it. It has the advantage of being fully object based, though I am not yet using anything more elaborate than strings, dictionaries and arrays as the data types. One secondary advantage is that I still use Dolphin occasionally, and my version of Dolphin 6.1 comes with OmniBase built in. I have checked that an OmniBase DB built in Pharo can be read in Dolphin. As to size, there is no problem with storing large strings in OmniBase, except for the amount of disk space occupied in total. I am looking far ahead - my toy development DB is only about 15MB, but if I get to where I want to be, it could be tens of GB. With modern machines this may not be a problem, but I thought there might come a time when I want to think about trade-offs between storage space and unzipping time. I had a few qualms when I looked inside my development DB; it seems that an OmniBase DB consists of a few smallish index files and one ginormous file called 'objects'. I am not sure how the OS will get on with a huge single file. But all this is speculative at the moment. For now I shall continue with storing the strings unzipped (but utf8Encoded - thanks for such a neat facility), bearing in mind that if I need to save space later, my method as described will work. Peter Kenny -----Original Message----- From: Pharo-users <[hidden email]> On Behalf Of Sven Van Caekenberghe Sent: 03 October 2019 10:56 To: Any question about pharo is welcome <[hidden email]> Subject: Re: [Pharo-users] How to zip a WideString Hi Peter, About #zipped / #unzipped and the inflate / deflate classes: your observation is correct, these work from string to string, while clearly the compressed representation should be binary. The contents (input, what is inside the compressed data) can be anything, it is not necessarily a string (it could be an image, so also something binary). Only the creator of the compressed data knows, you cannot assume to know in general. It would be possible (and it would be very nice) to change this, however that will have serious impact on users (as the contract changes). About your use case: why would your DB not be capable of storing large strings ? A good DB should be capable of storing any kind of string (full unicode) efficiently. What DB and what sizes are we talking about ? Sven > On 3 Oct 2019, at 11:06, PBKResearch <[hidden email]> wrote: > > Hello > > I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases. > > I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray. > > So my proposed solution is: > > For compression: myZipString := myWideString utf8Encoded asString zipped. > For decompression: myOutputString := myZipString unzipped asByteArray utf8Decoded. > > As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters? > > Thanks in advance for any help. > > Peter Kenny |
In reply to this post by tomo
Hi Tomo,
Indeed, I stand corrected, it does indeed seem possible to use the existing gzip classes to work from bytes to bytes, this works fine: data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) nextPutAll: 'foo 10 €' utf8Encoded; close ]. (GZipReadStream on: data) upToEnd utf8Decoded. Now regarding the encoding option, I am not so sure that is really necessary (though nice to have). Why would anyone use anything except UTF8 (today). Thanks again for the correction ! Sven > On 3 Oct 2019, at 12:41, Tomohiro Oda <[hidden email]> wrote: > > Peter and Sven, > > zip API from string to string works fine except that aWideString > zipped generates malformed zip string. > I think it might be a good guidance to define > String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: . > Such as > String>>zippedWithEncoding: encoder > zippedWithEncoding: encoder > ^ ByteArray > streamContents: [ :stream | > | gzstream | > gzstream := GZipWriteStream on: stream. > encoder > next: self size > putAll: self > startingAt: 1 > toStream: gzstream. > gzstream close ] > > and ByteArray>>unzippedWithEncoding: encoder > unzippedWithEncoding: encoder > | byteStream | > byteStream := GZipReadStream on: self. > ^ String > streamContents: [ :stream | > [ byteStream atEnd ] > whileFalse: [ stream nextPut: (encoder nextFromStream: > byteStream) ] ] > > Then, you can write something like > zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8. > unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8. > > This will not affect the existing zipped/unzipped API and you can > handle other encodings. > This zippedWithEncoding: generates a ByteArray, which is kind of > conformant to the encoding API. > And you don't have to create many intermediate byte arrays and byte strings. > > I hope this helps. > --- > tomo > > 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <[hidden email]>: >> >> Hi Peter, >> >> About #zipped / #unzipped and the inflate / deflate classes: your observation is correct, these work from string to string, while clearly the compressed representation should be binary. >> >> The contents (input, what is inside the compressed data) can be anything, it is not necessarily a string (it could be an image, so also something binary). Only the creator of the compressed data knows, you cannot assume to know in general. >> >> It would be possible (and it would be very nice) to change this, however that will have serious impact on users (as the contract changes). >> >> About your use case: why would your DB not be capable of storing large strings ? A good DB should be capable of storing any kind of string (full unicode) efficiently. >> >> What DB and what sizes are we talking about ? >> >> Sven >> >>> On 3 Oct 2019, at 11:06, PBKResearch <[hidden email]> wrote: >>> >>> Hello >>> >>> I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases. >>> >>> I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray. >>> >>> So my proposed solution is: >>> >>> For compression: myZipString := myWideString utf8Encoded asString zipped. >>> For decompression: myOutputString := myZipString unzipped asByteArray utf8Decoded. >>> >>> As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters? >>> >>> Thanks in advance for any help. >>> >>> Peter Kenny >> >> > |
Actually, thinking about this a bit more, why not add #zipped #unzipped to ByteArray ?
ByteArray>>#zipped "Return a GZIP compressed version of the receiver as a ByteArray" ^ ByteArray streamContents: [ :out | (GZipWriteStream on: out) nextPutAll: self; close ] ByteArray>>#unzipped "Assuming the receiver contains GZIP encoded data, return the decompressed data as a ByteArray" ^ (GZipReadStream on: self) upToEnd The original oneliner then becomes 'string' utf8Encoded zipped. and data unzipped utf8Decoded which is pretty clear, simple and intention-revealing, IMHO. > On 3 Oct 2019, at 13:04, Sven Van Caekenberghe <[hidden email]> wrote: > > Hi Tomo, > > Indeed, I stand corrected, it does indeed seem possible to use the existing gzip classes to work from bytes to bytes, this works fine: > > data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) nextPutAll: 'foo 10 €' utf8Encoded; close ]. > > (GZipReadStream on: data) upToEnd utf8Decoded. > > Now regarding the encoding option, I am not so sure that is really necessary (though nice to have). Why would anyone use anything except UTF8 (today). > > Thanks again for the correction ! > > Sven > >> On 3 Oct 2019, at 12:41, Tomohiro Oda <[hidden email]> wrote: >> >> Peter and Sven, >> >> zip API from string to string works fine except that aWideString >> zipped generates malformed zip string. >> I think it might be a good guidance to define >> String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: . >> Such as >> String>>zippedWithEncoding: encoder >> zippedWithEncoding: encoder >> ^ ByteArray >> streamContents: [ :stream | >> | gzstream | >> gzstream := GZipWriteStream on: stream. >> encoder >> next: self size >> putAll: self >> startingAt: 1 >> toStream: gzstream. >> gzstream close ] >> >> and ByteArray>>unzippedWithEncoding: encoder >> unzippedWithEncoding: encoder >> | byteStream | >> byteStream := GZipReadStream on: self. >> ^ String >> streamContents: [ :stream | >> [ byteStream atEnd ] >> whileFalse: [ stream nextPut: (encoder nextFromStream: >> byteStream) ] ] >> >> Then, you can write something like >> zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8. >> unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8. >> >> This will not affect the existing zipped/unzipped API and you can >> handle other encodings. >> This zippedWithEncoding: generates a ByteArray, which is kind of >> conformant to the encoding API. >> And you don't have to create many intermediate byte arrays and byte strings. >> >> I hope this helps. >> --- >> tomo >> >> 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <[hidden email]>: >>> >>> Hi Peter, >>> >>> About #zipped / #unzipped and the inflate / deflate classes: your observation is correct, these work from string to string, while clearly the compressed representation should be binary. >>> >>> The contents (input, what is inside the compressed data) can be anything, it is not necessarily a string (it could be an image, so also something binary). Only the creator of the compressed data knows, you cannot assume to know in general. >>> >>> It would be possible (and it would be very nice) to change this, however that will have serious impact on users (as the contract changes). >>> >>> About your use case: why would your DB not be capable of storing large strings ? A good DB should be capable of storing any kind of string (full unicode) efficiently. >>> >>> What DB and what sizes are we talking about ? >>> >>> Sven >>> >>>> On 3 Oct 2019, at 11:06, PBKResearch <[hidden email]> wrote: >>>> >>>> Hello >>>> >>>> I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases. >>>> >>>> I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray. >>>> >>>> So my proposed solution is: >>>> >>>> For compression: myZipString := myWideString utf8Encoded asString zipped. >>>> For decompression: myOutputString := myZipString unzipped asByteArray utf8Decoded. >>>> >>>> As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters? >>>> >>>> Thanks in advance for any help. >>>> >>>> Peter Kenny >>> >>> >> > |
Sven,
Yes, ByteArray>>zipped/unzipped are simple, neat and intuitive way of zipping/unzipping binary data. I also love the new idioms. They look clean and concise. Best Regards, --- tomo 2019年10月3日(木) 20:14 Sven Van Caekenberghe <[hidden email]>: > > Actually, thinking about this a bit more, why not add #zipped #unzipped to ByteArray ? > > > ByteArray>>#zipped > "Return a GZIP compressed version of the receiver as a ByteArray" > > ^ ByteArray streamContents: [ :out | > (GZipWriteStream on: out) nextPutAll: self; close ] > > ByteArray>>#unzipped > "Assuming the receiver contains GZIP encoded data, > return the decompressed data as a ByteArray" > > ^ (GZipReadStream on: self) upToEnd > > > The original oneliner then becomes > > 'string' utf8Encoded zipped. > > and > > data unzipped utf8Decoded > > which is pretty clear, simple and intention-revealing, IMHO. > > > On 3 Oct 2019, at 13:04, Sven Van Caekenberghe <[hidden email]> wrote: > > > > Hi Tomo, > > > > Indeed, I stand corrected, it does indeed seem possible to use the existing gzip classes to work from bytes to bytes, this works fine: > > > > data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) nextPutAll: 'foo 10 €' utf8Encoded; close ]. > > > > (GZipReadStream on: data) upToEnd utf8Decoded. > > > > Now regarding the encoding option, I am not so sure that is really necessary (though nice to have). Why would anyone use anything except UTF8 (today). > > > > Thanks again for the correction ! > > > > Sven > > > >> On 3 Oct 2019, at 12:41, Tomohiro Oda <[hidden email]> wrote: > >> > >> Peter and Sven, > >> > >> zip API from string to string works fine except that aWideString > >> zipped generates malformed zip string. > >> I think it might be a good guidance to define > >> String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: . > >> Such as > >> String>>zippedWithEncoding: encoder > >> zippedWithEncoding: encoder > >> ^ ByteArray > >> streamContents: [ :stream | > >> | gzstream | > >> gzstream := GZipWriteStream on: stream. > >> encoder > >> next: self size > >> putAll: self > >> startingAt: 1 > >> toStream: gzstream. > >> gzstream close ] > >> > >> and ByteArray>>unzippedWithEncoding: encoder > >> unzippedWithEncoding: encoder > >> | byteStream | > >> byteStream := GZipReadStream on: self. > >> ^ String > >> streamContents: [ :stream | > >> [ byteStream atEnd ] > >> whileFalse: [ stream nextPut: (encoder nextFromStream: > >> byteStream) ] ] > >> > >> Then, you can write something like > >> zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8. > >> unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8. > >> > >> This will not affect the existing zipped/unzipped API and you can > >> handle other encodings. > >> This zippedWithEncoding: generates a ByteArray, which is kind of > >> conformant to the encoding API. > >> And you don't have to create many intermediate byte arrays and byte strings. > >> > >> I hope this helps. > >> --- > >> tomo > >> > >> 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <[hidden email]>: > >>> > >>> Hi Peter, > >>> > >>> About #zipped / #unzipped and the inflate / deflate classes: your observation is correct, these work from string to string, while clearly the compressed representation should be binary. > >>> > >>> The contents (input, what is inside the compressed data) can be anything, it is not necessarily a string (it could be an image, so also something binary). Only the creator of the compressed data knows, you cannot assume to know in general. > >>> > >>> It would be possible (and it would be very nice) to change this, however that will have serious impact on users (as the contract changes). > >>> > >>> About your use case: why would your DB not be capable of storing large strings ? A good DB should be capable of storing any kind of string (full unicode) efficiently. > >>> > >>> What DB and what sizes are we talking about ? > >>> > >>> Sven > >>> > >>>> On 3 Oct 2019, at 11:06, PBKResearch <[hidden email]> wrote: > >>>> > >>>> Hello > >>>> > >>>> I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases. > >>>> > >>>> I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray. > >>>> > >>>> So my proposed solution is: > >>>> > >>>> For compression: myZipString := myWideString utf8Encoded asString zipped. > >>>> For decompression: myOutputString := myZipString unzipped asByteArray utf8Decoded. > >>>> > >>>> As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters? > >>>> > >>>> Thanks in advance for any help. > >>>> > >>>> Peter Kenny > >>> > >>> > >> > > > > |
Sven, Tomo
Thanks for this discussion. I shall bear in mind Sven's proposed extension to ByteArray - this is exactly the sort of neater solution I was hoping for. Any chance this might make it into standard Pharo (perhaps inP8)? Peter Kenny -----Original Message----- From: Pharo-users <[hidden email]> On Behalf Of Tomohiro Oda Sent: 03 October 2019 12:22 To: Any question about pharo is welcome <[hidden email]> Subject: Re: [Pharo-users] How to zip a WideString Sven, Yes, ByteArray>>zipped/unzipped are simple, neat and intuitive way of zipping/unzipping binary data. I also love the new idioms. They look clean and concise. Best Regards, --- tomo 2019年10月3日(木) 20:14 Sven Van Caekenberghe <[hidden email]>: > > Actually, thinking about this a bit more, why not add #zipped #unzipped to ByteArray ? > > > ByteArray>>#zipped > "Return a GZIP compressed version of the receiver as a ByteArray" > > ^ ByteArray streamContents: [ :out | > (GZipWriteStream on: out) nextPutAll: self; close ] > > ByteArray>>#unzipped > "Assuming the receiver contains GZIP encoded data, > return the decompressed data as a ByteArray" > > ^ (GZipReadStream on: self) upToEnd > > > The original oneliner then becomes > > 'string' utf8Encoded zipped. > > and > > data unzipped utf8Decoded > > which is pretty clear, simple and intention-revealing, IMHO. > > > On 3 Oct 2019, at 13:04, Sven Van Caekenberghe <[hidden email]> wrote: > > > > Hi Tomo, > > > > Indeed, I stand corrected, it does indeed seem possible to use the existing gzip classes to work from bytes to bytes, this works fine: > > > > data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) nextPutAll: 'foo 10 €' utf8Encoded; close ]. > > > > (GZipReadStream on: data) upToEnd utf8Decoded. > > > > Now regarding the encoding option, I am not so sure that is really necessary (though nice to have). Why would anyone use anything except UTF8 (today). > > > > Thanks again for the correction ! > > > > Sven > > > >> On 3 Oct 2019, at 12:41, Tomohiro Oda <[hidden email]> wrote: > >> > >> Peter and Sven, > >> > >> zip API from string to string works fine except that aWideString > >> zipped generates malformed zip string. > >> I think it might be a good guidance to define > >> String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: . > >> Such as > >> String>>zippedWithEncoding: encoder > >> zippedWithEncoding: encoder > >> ^ ByteArray > >> streamContents: [ :stream | > >> | gzstream | > >> gzstream := GZipWriteStream on: stream. > >> encoder > >> next: self size > >> putAll: self > >> startingAt: 1 > >> toStream: gzstream. > >> gzstream close ] > >> > >> and ByteArray>>unzippedWithEncoding: encoder > >> unzippedWithEncoding: encoder > >> | byteStream | > >> byteStream := GZipReadStream on: self. > >> ^ String > >> streamContents: [ :stream | > >> [ byteStream atEnd ] > >> whileFalse: [ stream nextPut: (encoder nextFromStream: > >> byteStream) ] ] > >> > >> Then, you can write something like > >> zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8. > >> unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8. > >> > >> This will not affect the existing zipped/unzipped API and you can > >> handle other encodings. > >> This zippedWithEncoding: generates a ByteArray, which is kind of > >> conformant to the encoding API. > >> And you don't have to create many intermediate byte arrays and byte strings. > >> > >> I hope this helps. > >> --- > >> tomo > >> > >> 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <[hidden email]>: > >>> > >>> Hi Peter, > >>> > >>> About #zipped / #unzipped and the inflate / deflate classes: your observation is correct, these work from string to string, while clearly the compressed representation should be binary. > >>> > >>> The contents (input, what is inside the compressed data) can be anything, it is not necessarily a string (it could be an image, so also something binary). Only the creator of the compressed data knows, you cannot assume to know in general. > >>> > >>> It would be possible (and it would be very nice) to change this, however that will have serious impact on users (as the contract changes). > >>> > >>> About your use case: why would your DB not be capable of storing large strings ? A good DB should be capable of storing any kind of string (full unicode) efficiently. > >>> > >>> What DB and what sizes are we talking about ? > >>> > >>> Sven > >>> > >>>> On 3 Oct 2019, at 11:06, PBKResearch <[hidden email]> wrote: > >>>> > >>>> Hello > >>>> > >>>> I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases. > >>>> > >>>> I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray. > >>>> > >>>> So my proposed solution is: > >>>> > >>>> For compression: myZipString := myWideString utf8Encoded asString zipped. > >>>> For decompression: myOutputString := myZipString unzipped asByteArray utf8Decoded. > >>>> > >>>> As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters? > >>>> > >>>> Thanks in advance for any help. > >>>> > >>>> Peter Kenny > >>> > >>> > >> > > > > |
https://github.com/pharo-project/pharo/issues/4806
PR will follow > On 3 Oct 2019, at 13:49, PBKResearch <[hidden email]> wrote: > > Sven, Tomo > > Thanks for this discussion. I shall bear in mind Sven's proposed extension to ByteArray - this is exactly the sort of neater solution I was hoping for. Any chance this might make it into standard Pharo (perhaps inP8)? > > Peter Kenny > > -----Original Message----- > From: Pharo-users <[hidden email]> On Behalf Of Tomohiro Oda > Sent: 03 October 2019 12:22 > To: Any question about pharo is welcome <[hidden email]> > Subject: Re: [Pharo-users] How to zip a WideString > > Sven, > > Yes, ByteArray>>zipped/unzipped are simple, neat and intuitive way of zipping/unzipping binary data. > I also love the new idioms. They look clean and concise. > > Best Regards, > --- > tomo > > 2019年10月3日(木) 20:14 Sven Van Caekenberghe <[hidden email]>: >> >> Actually, thinking about this a bit more, why not add #zipped #unzipped to ByteArray ? >> >> >> ByteArray>>#zipped >> "Return a GZIP compressed version of the receiver as a ByteArray" >> >> ^ ByteArray streamContents: [ :out | >> (GZipWriteStream on: out) nextPutAll: self; close ] >> >> ByteArray>>#unzipped >> "Assuming the receiver contains GZIP encoded data, >> return the decompressed data as a ByteArray" >> >> ^ (GZipReadStream on: self) upToEnd >> >> >> The original oneliner then becomes >> >> 'string' utf8Encoded zipped. >> >> and >> >> data unzipped utf8Decoded >> >> which is pretty clear, simple and intention-revealing, IMHO. >> >>> On 3 Oct 2019, at 13:04, Sven Van Caekenberghe <[hidden email]> wrote: >>> >>> Hi Tomo, >>> >>> Indeed, I stand corrected, it does indeed seem possible to use the existing gzip classes to work from bytes to bytes, this works fine: >>> >>> data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) nextPutAll: 'foo 10 €' utf8Encoded; close ]. >>> >>> (GZipReadStream on: data) upToEnd utf8Decoded. >>> >>> Now regarding the encoding option, I am not so sure that is really necessary (though nice to have). Why would anyone use anything except UTF8 (today). >>> >>> Thanks again for the correction ! >>> >>> Sven >>> >>>> On 3 Oct 2019, at 12:41, Tomohiro Oda <[hidden email]> wrote: >>>> >>>> Peter and Sven, >>>> >>>> zip API from string to string works fine except that aWideString >>>> zipped generates malformed zip string. >>>> I think it might be a good guidance to define >>>> String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: . >>>> Such as >>>> String>>zippedWithEncoding: encoder >>>> zippedWithEncoding: encoder >>>> ^ ByteArray >>>> streamContents: [ :stream | >>>> | gzstream | >>>> gzstream := GZipWriteStream on: stream. >>>> encoder >>>> next: self size >>>> putAll: self >>>> startingAt: 1 >>>> toStream: gzstream. >>>> gzstream close ] >>>> >>>> and ByteArray>>unzippedWithEncoding: encoder >>>> unzippedWithEncoding: encoder >>>> | byteStream | >>>> byteStream := GZipReadStream on: self. >>>> ^ String >>>> streamContents: [ :stream | >>>> [ byteStream atEnd ] >>>> whileFalse: [ stream nextPut: (encoder nextFromStream: >>>> byteStream) ] ] >>>> >>>> Then, you can write something like >>>> zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8. >>>> unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8. >>>> >>>> This will not affect the existing zipped/unzipped API and you can >>>> handle other encodings. >>>> This zippedWithEncoding: generates a ByteArray, which is kind of >>>> conformant to the encoding API. >>>> And you don't have to create many intermediate byte arrays and byte strings. >>>> >>>> I hope this helps. >>>> --- >>>> tomo >>>> >>>> 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <[hidden email]>: >>>>> >>>>> Hi Peter, >>>>> >>>>> About #zipped / #unzipped and the inflate / deflate classes: your observation is correct, these work from string to string, while clearly the compressed representation should be binary. >>>>> >>>>> The contents (input, what is inside the compressed data) can be anything, it is not necessarily a string (it could be an image, so also something binary). Only the creator of the compressed data knows, you cannot assume to know in general. >>>>> >>>>> It would be possible (and it would be very nice) to change this, however that will have serious impact on users (as the contract changes). >>>>> >>>>> About your use case: why would your DB not be capable of storing large strings ? A good DB should be capable of storing any kind of string (full unicode) efficiently. >>>>> >>>>> What DB and what sizes are we talking about ? >>>>> >>>>> Sven >>>>> >>>>>> On 3 Oct 2019, at 11:06, PBKResearch <[hidden email]> wrote: >>>>>> >>>>>> Hello >>>>>> >>>>>> I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases. >>>>>> >>>>>> I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray. >>>>>> >>>>>> So my proposed solution is: >>>>>> >>>>>> For compression: myZipString := myWideString utf8Encoded asString zipped. >>>>>> For decompression: myOutputString := myZipString unzipped asByteArray utf8Decoded. >>>>>> >>>>>> As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters? >>>>>> >>>>>> Thanks in advance for any help. >>>>>> >>>>>> Peter Kenny >>>>> >>>>> >>>> >>> >> >> > > |
https://github.com/pharo-project/pharo/pull/4812
> On 3 Oct 2019, at 14:05, Sven Van Caekenberghe <[hidden email]> wrote: > > https://github.com/pharo-project/pharo/issues/4806 > > PR will follow > >> On 3 Oct 2019, at 13:49, PBKResearch <[hidden email]> wrote: >> >> Sven, Tomo >> >> Thanks for this discussion. I shall bear in mind Sven's proposed extension to ByteArray - this is exactly the sort of neater solution I was hoping for. Any chance this might make it into standard Pharo (perhaps inP8)? >> >> Peter Kenny >> >> -----Original Message----- >> From: Pharo-users <[hidden email]> On Behalf Of Tomohiro Oda >> Sent: 03 October 2019 12:22 >> To: Any question about pharo is welcome <[hidden email]> >> Subject: Re: [Pharo-users] How to zip a WideString >> >> Sven, >> >> Yes, ByteArray>>zipped/unzipped are simple, neat and intuitive way of zipping/unzipping binary data. >> I also love the new idioms. They look clean and concise. >> >> Best Regards, >> --- >> tomo >> >> 2019年10月3日(木) 20:14 Sven Van Caekenberghe <[hidden email]>: >>> >>> Actually, thinking about this a bit more, why not add #zipped #unzipped to ByteArray ? >>> >>> >>> ByteArray>>#zipped >>> "Return a GZIP compressed version of the receiver as a ByteArray" >>> >>> ^ ByteArray streamContents: [ :out | >>> (GZipWriteStream on: out) nextPutAll: self; close ] >>> >>> ByteArray>>#unzipped >>> "Assuming the receiver contains GZIP encoded data, >>> return the decompressed data as a ByteArray" >>> >>> ^ (GZipReadStream on: self) upToEnd >>> >>> >>> The original oneliner then becomes >>> >>> 'string' utf8Encoded zipped. >>> >>> and >>> >>> data unzipped utf8Decoded >>> >>> which is pretty clear, simple and intention-revealing, IMHO. >>> >>>> On 3 Oct 2019, at 13:04, Sven Van Caekenberghe <[hidden email]> wrote: >>>> >>>> Hi Tomo, >>>> >>>> Indeed, I stand corrected, it does indeed seem possible to use the existing gzip classes to work from bytes to bytes, this works fine: >>>> >>>> data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) nextPutAll: 'foo 10 €' utf8Encoded; close ]. >>>> >>>> (GZipReadStream on: data) upToEnd utf8Decoded. >>>> >>>> Now regarding the encoding option, I am not so sure that is really necessary (though nice to have). Why would anyone use anything except UTF8 (today). >>>> >>>> Thanks again for the correction ! >>>> >>>> Sven >>>> >>>>> On 3 Oct 2019, at 12:41, Tomohiro Oda <[hidden email]> wrote: >>>>> >>>>> Peter and Sven, >>>>> >>>>> zip API from string to string works fine except that aWideString >>>>> zipped generates malformed zip string. >>>>> I think it might be a good guidance to define >>>>> String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: . >>>>> Such as >>>>> String>>zippedWithEncoding: encoder >>>>> zippedWithEncoding: encoder >>>>> ^ ByteArray >>>>> streamContents: [ :stream | >>>>> | gzstream | >>>>> gzstream := GZipWriteStream on: stream. >>>>> encoder >>>>> next: self size >>>>> putAll: self >>>>> startingAt: 1 >>>>> toStream: gzstream. >>>>> gzstream close ] >>>>> >>>>> and ByteArray>>unzippedWithEncoding: encoder >>>>> unzippedWithEncoding: encoder >>>>> | byteStream | >>>>> byteStream := GZipReadStream on: self. >>>>> ^ String >>>>> streamContents: [ :stream | >>>>> [ byteStream atEnd ] >>>>> whileFalse: [ stream nextPut: (encoder nextFromStream: >>>>> byteStream) ] ] >>>>> >>>>> Then, you can write something like >>>>> zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8. >>>>> unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8. >>>>> >>>>> This will not affect the existing zipped/unzipped API and you can >>>>> handle other encodings. >>>>> This zippedWithEncoding: generates a ByteArray, which is kind of >>>>> conformant to the encoding API. >>>>> And you don't have to create many intermediate byte arrays and byte strings. >>>>> >>>>> I hope this helps. >>>>> --- >>>>> tomo >>>>> >>>>> 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <[hidden email]>: >>>>>> >>>>>> Hi Peter, >>>>>> >>>>>> About #zipped / #unzipped and the inflate / deflate classes: your observation is correct, these work from string to string, while clearly the compressed representation should be binary. >>>>>> >>>>>> The contents (input, what is inside the compressed data) can be anything, it is not necessarily a string (it could be an image, so also something binary). Only the creator of the compressed data knows, you cannot assume to know in general. >>>>>> >>>>>> It would be possible (and it would be very nice) to change this, however that will have serious impact on users (as the contract changes). >>>>>> >>>>>> About your use case: why would your DB not be capable of storing large strings ? A good DB should be capable of storing any kind of string (full unicode) efficiently. >>>>>> >>>>>> What DB and what sizes are we talking about ? >>>>>> >>>>>> Sven >>>>>> >>>>>>> On 3 Oct 2019, at 11:06, PBKResearch <[hidden email]> wrote: >>>>>>> >>>>>>> Hello >>>>>>> >>>>>>> I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases. >>>>>>> >>>>>>> I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray. >>>>>>> >>>>>>> So my proposed solution is: >>>>>>> >>>>>>> For compression: myZipString := myWideString utf8Encoded asString zipped. >>>>>>> For decompression: myOutputString := myZipString unzipped asByteArray utf8Decoded. >>>>>>> >>>>>>> As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters? >>>>>>> >>>>>>> Thanks in advance for any help. >>>>>>> >>>>>>> Peter Kenny >>>>>> >>>>>> >>>>> >>>> >>> >>> >> >> > |
Thanks Sven. Just 5 hours from when I raised the question, there is a solution in place for everyone. This group is amazing!
-----Original Message----- From: Pharo-users <[hidden email]> On Behalf Of Sven Van Caekenberghe Sent: 03 October 2019 15:28 To: Any question about pharo is welcome <[hidden email]> Subject: Re: [Pharo-users] How to zip a WideString https://github.com/pharo-project/pharo/pull/4812 > On 3 Oct 2019, at 14:05, Sven Van Caekenberghe <[hidden email]> wrote: > > https://github.com/pharo-project/pharo/issues/4806 > > PR will follow > >> On 3 Oct 2019, at 13:49, PBKResearch <[hidden email]> wrote: >> >> Sven, Tomo >> >> Thanks for this discussion. I shall bear in mind Sven's proposed extension to ByteArray - this is exactly the sort of neater solution I was hoping for. Any chance this might make it into standard Pharo (perhaps inP8)? >> >> Peter Kenny >> >> -----Original Message----- >> From: Pharo-users <[hidden email]> On Behalf Of >> Tomohiro Oda >> Sent: 03 October 2019 12:22 >> To: Any question about pharo is welcome <[hidden email]> >> Subject: Re: [Pharo-users] How to zip a WideString >> >> Sven, >> >> Yes, ByteArray>>zipped/unzipped are simple, neat and intuitive way of zipping/unzipping binary data. >> I also love the new idioms. They look clean and concise. >> >> Best Regards, >> --- >> tomo >> >> 2019年10月3日(木) 20:14 Sven Van Caekenberghe <[hidden email]>: >>> >>> Actually, thinking about this a bit more, why not add #zipped #unzipped to ByteArray ? >>> >>> >>> ByteArray>>#zipped >>> "Return a GZIP compressed version of the receiver as a ByteArray" >>> >>> ^ ByteArray streamContents: [ :out | >>> (GZipWriteStream on: out) nextPutAll: self; close ] >>> >>> ByteArray>>#unzipped >>> "Assuming the receiver contains GZIP encoded data, return the >>> decompressed data as a ByteArray" >>> >>> ^ (GZipReadStream on: self) upToEnd >>> >>> >>> The original oneliner then becomes >>> >>> 'string' utf8Encoded zipped. >>> >>> and >>> >>> data unzipped utf8Decoded >>> >>> which is pretty clear, simple and intention-revealing, IMHO. >>> >>>> On 3 Oct 2019, at 13:04, Sven Van Caekenberghe <[hidden email]> wrote: >>>> >>>> Hi Tomo, >>>> >>>> Indeed, I stand corrected, it does indeed seem possible to use the existing gzip classes to work from bytes to bytes, this works fine: >>>> >>>> data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) nextPutAll: 'foo 10 €' utf8Encoded; close ]. >>>> >>>> (GZipReadStream on: data) upToEnd utf8Decoded. >>>> >>>> Now regarding the encoding option, I am not so sure that is really necessary (though nice to have). Why would anyone use anything except UTF8 (today). >>>> >>>> Thanks again for the correction ! >>>> >>>> Sven >>>> >>>>> On 3 Oct 2019, at 12:41, Tomohiro Oda <[hidden email]> wrote: >>>>> >>>>> Peter and Sven, >>>>> >>>>> zip API from string to string works fine except that aWideString >>>>> zipped generates malformed zip string. >>>>> I think it might be a good guidance to define >>>>> String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: . >>>>> Such as >>>>> String>>zippedWithEncoding: encoder >>>>> zippedWithEncoding: encoder >>>>> ^ ByteArray >>>>> streamContents: [ :stream | >>>>> | gzstream | >>>>> gzstream := GZipWriteStream on: stream. >>>>> encoder >>>>> next: self size >>>>> putAll: self >>>>> startingAt: 1 >>>>> toStream: gzstream. >>>>> gzstream close ] >>>>> >>>>> and ByteArray>>unzippedWithEncoding: encoder >>>>> unzippedWithEncoding: encoder >>>>> | byteStream | >>>>> byteStream := GZipReadStream on: self. >>>>> ^ String >>>>> streamContents: [ :stream | >>>>> [ byteStream atEnd ] >>>>> whileFalse: [ stream nextPut: (encoder nextFromStream: >>>>> byteStream) ] ] >>>>> >>>>> Then, you can write something like zipped := yourLongWideString >>>>> zippedWithEncoding: ZnCharacterEncoder utf8. >>>>> unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8. >>>>> >>>>> This will not affect the existing zipped/unzipped API and you can >>>>> handle other encodings. >>>>> This zippedWithEncoding: generates a ByteArray, which is kind of >>>>> conformant to the encoding API. >>>>> And you don't have to create many intermediate byte arrays and byte strings. >>>>> >>>>> I hope this helps. >>>>> --- >>>>> tomo >>>>> >>>>> 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <[hidden email]>: >>>>>> >>>>>> Hi Peter, >>>>>> >>>>>> About #zipped / #unzipped and the inflate / deflate classes: your observation is correct, these work from string to string, while clearly the compressed representation should be binary. >>>>>> >>>>>> The contents (input, what is inside the compressed data) can be anything, it is not necessarily a string (it could be an image, so also something binary). Only the creator of the compressed data knows, you cannot assume to know in general. >>>>>> >>>>>> It would be possible (and it would be very nice) to change this, however that will have serious impact on users (as the contract changes). >>>>>> >>>>>> About your use case: why would your DB not be capable of storing large strings ? A good DB should be capable of storing any kind of string (full unicode) efficiently. >>>>>> >>>>>> What DB and what sizes are we talking about ? >>>>>> >>>>>> Sven >>>>>> >>>>>>> On 3 Oct 2019, at 11:06, PBKResearch <[hidden email]> wrote: >>>>>>> >>>>>>> Hello >>>>>>> >>>>>>> I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases. >>>>>>> >>>>>>> I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray. >>>>>>> >>>>>>> So my proposed solution is: >>>>>>> >>>>>>> For compression: myZipString := myWideString utf8Encoded asString zipped. >>>>>>> For decompression: myOutputString := myZipString unzipped asByteArray utf8Decoded. >>>>>>> >>>>>>> As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters? >>>>>>> >>>>>>> Thanks in advance for any help. >>>>>>> >>>>>>> Peter Kenny >>>>>> >>>>>> >>>>> >>>> >>> >>> >> >> > |
Administrator
|
Peter Kenny wrote
> Just 5 hours from when I raised the question, there is a solution in place > for everyone. This group is amazing! Indeed. Bravo, Sven and all of our other contributors. I can't resist mentioning that the fix would almost certainly have taken significantly longer a few years ago due to the awkward issue/contribution infrastructure. Esteban and others' tireless work to get git/GH support up and running are at the heart of all these quick turnaround stories. Keep going!! p.s. it takes serious courage to port the issue tracker of a project this size and reach twice in a few years ----- Cheers, Sean -- Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
Cheers,
Sean |
> On 3 Oct 2019, at 18:25, Sean P. DeNigris <[hidden email]> wrote: > > Peter Kenny wrote >> Just 5 hours from when I raised the question, there is a solution in place >> for everyone. This group is amazing! > > Indeed. Bravo, Sven and all of our other contributors. Thanks, but I must say that it was Tomo's snippet that showed me how to use the gzip streams binary to binary as I thought that was not possible. And we need user's questions to start acting, so thank you Peter as well. > I can't resist mentioning that the fix would almost certainly have taken > significantly longer a few years ago due to the awkward issue/contribution > infrastructure. Esteban and others' tireless work to get git/GH support up > and running are at the heart of all these quick turnaround stories. Keep > going!! > > p.s. it takes serious courage to port the issue tracker of a project this > size and reach twice in a few years Indeed, it had been a while since I did a PR and I must say it went super smooth. Also, Pharo 8 feels quite snappy. > ----- > Cheers, > Sean > -- > Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html > |
In reply to this post by tomo
The interface should surely be
SomeClass methods for: 'compression' zipped "return a byte array" class methods for: 'decompression' unzip: aByteArray "return an instance of SomeClass" |
Richard
I don't think so. The case being considered for my problem is the compression of a ByteArray produced by applying #utf8Encoded to a WideString, but it extends to any other form of ByteArray. If you substitute ByteArray for SomeClass in your examples, I think you will see why the chosen interface was used. Peter Kenny -----Original Message----- From: Pharo-users <[hidden email]> On Behalf Of Richard O'Keefe Sent: 03 October 2019 23:08 To: Any question about pharo is welcome <[hidden email]> Subject: Re: [Pharo-users] How to zip a WideString The interface should surely be SomeClass methods for: 'compression' zipped "return a byte array" class methods for: 'decompression' unzip: aByteArray "return an instance of SomeClass" |
Here's how it would look in my library:
compressed := original zipped. "There is currently one definition, in AbstractStringOrByteArray, covering [ReadOnly]ByteArray, [ReadOnly]String and its many subclasses, ByteBuffer,StringBuffer, Substring, [ReadOnly]ShortArray, [ReadOnly]MappedByteArray, and some others. This relies on _ asByteArraySize and _ asByteArrayDo: _. There is no need for a separate #utf8Encoded, that's what asByteArrayDo: *does*." copy := original class unzip: compressed. "This is a little trickier, but not hugely so. There is no need for special case code. [ReadOnly][Mapped]ByteArray and ByteBuffer are sequences of bytes, Stringy things are Unicode, and [ReadOnly]ShortArrays are treated as UTF16." As far as I can tell, this just works for the original use case. On Fri, 4 Oct 2019 at 11:42, PBKResearch <[hidden email]> wrote: > > Richard > > I don't think so. The case being considered for my problem is the compression of a ByteArray produced by applying #utf8Encoded to a WideString, but it extends to any other form of ByteArray. If you substitute ByteArray for SomeClass in your examples, I think you will see why the chosen interface was used. > > Peter Kenny > > > -----Original Message----- > From: Pharo-users <[hidden email]> On Behalf Of Richard O'Keefe > Sent: 03 October 2019 23:08 > To: Any question about pharo is welcome <[hidden email]> > Subject: Re: [Pharo-users] How to zip a WideString > > The interface should surely be > SomeClass > methods for: 'compression' > zipped "return a byte array" > > class methods for: 'decompression' > unzip: aByteArray "return an instance of SomeClass" > > |
Richard,
Your implementation mixes zipping/unzipping and encoding/decoding, dictating a single way to do so, if I understand it correctly. The composition with several messages allows for end users to choose their own encoding format, depending on their own needs, which I think is more flexible. Sven > On 4 Oct 2019, at 06:36, Richard O'Keefe <[hidden email]> wrote: > > Here's how it would look in my library: > compressed := original zipped. > "There is currently one definition, in > AbstractStringOrByteArray, covering [ReadOnly]ByteArray, > [ReadOnly]String and its many subclasses, > ByteBuffer,StringBuffer, Substring, [ReadOnly]ShortArray, > [ReadOnly]MappedByteArray, and some others. This relies on > _ asByteArraySize and _ asByteArrayDo: _. There is no need for > a separate #utf8Encoded, that's what asByteArrayDo: *does*." > > copy := original class unzip: compressed. > "This is a little trickier, but not hugely so. There is no need > for special case code. [ReadOnly][Mapped]ByteArray and ByteBuffer > are > sequences of bytes, Stringy things are Unicode, and > [ReadOnly]ShortArrays are treated as UTF16." > > As far as I can tell, this just works for the original use case. > > On Fri, 4 Oct 2019 at 11:42, PBKResearch <[hidden email]> wrote: >> >> Richard >> >> I don't think so. The case being considered for my problem is the compression of a ByteArray produced by applying #utf8Encoded to a WideString, but it extends to any other form of ByteArray. If you substitute ByteArray for SomeClass in your examples, I think you will see why the chosen interface was used. >> >> Peter Kenny >> >> >> -----Original Message----- >> From: Pharo-users <[hidden email]> On Behalf Of Richard O'Keefe >> Sent: 03 October 2019 23:08 >> To: Any question about pharo is welcome <[hidden email]> >> Subject: Re: [Pharo-users] How to zip a WideString >> >> The interface should surely be >> SomeClass >> methods for: 'compression' >> zipped "return a byte array" >> >> class methods for: 'decompression' >> unzip: aByteArray "return an instance of SomeClass" >> >> > |
In reply to this post by tomo
Actually, thinking about the original use case, I now feel that it would be best to remove #zipped/unzipped from String.
The original problem was that 'Les élèves Françaises ont 100 €' zipped unzipped. does not work (it fails on WideStrings), while we now have 'Les élèves Françaises ont 100 €' utf8Encoded zipped unzipped utf8Decoded. which I would consider better form/style. The original is also very confusing, since the result of zipping is not a string but binary. > On 3 Oct 2019, at 13:21, Tomohiro Oda <[hidden email]> wrote: > > Sven, > > Yes, ByteArray>>zipped/unzipped are simple, neat and intuitive way of > zipping/unzipping binary data. > I also love the new idioms. They look clean and concise. > > Best Regards, > --- > tomo > > 2019年10月3日(木) 20:14 Sven Van Caekenberghe <[hidden email]>: >> >> Actually, thinking about this a bit more, why not add #zipped #unzipped to ByteArray ? >> >> >> ByteArray>>#zipped >> "Return a GZIP compressed version of the receiver as a ByteArray" >> >> ^ ByteArray streamContents: [ :out | >> (GZipWriteStream on: out) nextPutAll: self; close ] >> >> ByteArray>>#unzipped >> "Assuming the receiver contains GZIP encoded data, >> return the decompressed data as a ByteArray" >> >> ^ (GZipReadStream on: self) upToEnd >> >> >> The original oneliner then becomes >> >> 'string' utf8Encoded zipped. >> >> and >> >> data unzipped utf8Decoded >> >> which is pretty clear, simple and intention-revealing, IMHO. >> >>> On 3 Oct 2019, at 13:04, Sven Van Caekenberghe <[hidden email]> wrote: >>> >>> Hi Tomo, >>> >>> Indeed, I stand corrected, it does indeed seem possible to use the existing gzip classes to work from bytes to bytes, this works fine: >>> >>> data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) nextPutAll: 'foo 10 €' utf8Encoded; close ]. >>> >>> (GZipReadStream on: data) upToEnd utf8Decoded. >>> >>> Now regarding the encoding option, I am not so sure that is really necessary (though nice to have). Why would anyone use anything except UTF8 (today). >>> >>> Thanks again for the correction ! >>> >>> Sven >>> >>>> On 3 Oct 2019, at 12:41, Tomohiro Oda <[hidden email]> wrote: >>>> >>>> Peter and Sven, >>>> >>>> zip API from string to string works fine except that aWideString >>>> zipped generates malformed zip string. >>>> I think it might be a good guidance to define >>>> String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: . >>>> Such as >>>> String>>zippedWithEncoding: encoder >>>> zippedWithEncoding: encoder >>>> ^ ByteArray >>>> streamContents: [ :stream | >>>> | gzstream | >>>> gzstream := GZipWriteStream on: stream. >>>> encoder >>>> next: self size >>>> putAll: self >>>> startingAt: 1 >>>> toStream: gzstream. >>>> gzstream close ] >>>> >>>> and ByteArray>>unzippedWithEncoding: encoder >>>> unzippedWithEncoding: encoder >>>> | byteStream | >>>> byteStream := GZipReadStream on: self. >>>> ^ String >>>> streamContents: [ :stream | >>>> [ byteStream atEnd ] >>>> whileFalse: [ stream nextPut: (encoder nextFromStream: >>>> byteStream) ] ] >>>> >>>> Then, you can write something like >>>> zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8. >>>> unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8. >>>> >>>> This will not affect the existing zipped/unzipped API and you can >>>> handle other encodings. >>>> This zippedWithEncoding: generates a ByteArray, which is kind of >>>> conformant to the encoding API. >>>> And you don't have to create many intermediate byte arrays and byte strings. >>>> >>>> I hope this helps. >>>> --- >>>> tomo >>>> >>>> 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <[hidden email]>: >>>>> >>>>> Hi Peter, >>>>> >>>>> About #zipped / #unzipped and the inflate / deflate classes: your observation is correct, these work from string to string, while clearly the compressed representation should be binary. >>>>> >>>>> The contents (input, what is inside the compressed data) can be anything, it is not necessarily a string (it could be an image, so also something binary). Only the creator of the compressed data knows, you cannot assume to know in general. >>>>> >>>>> It would be possible (and it would be very nice) to change this, however that will have serious impact on users (as the contract changes). >>>>> >>>>> About your use case: why would your DB not be capable of storing large strings ? A good DB should be capable of storing any kind of string (full unicode) efficiently. >>>>> >>>>> What DB and what sizes are we talking about ? >>>>> >>>>> Sven >>>>> >>>>>> On 3 Oct 2019, at 11:06, PBKResearch <[hidden email]> wrote: >>>>>> >>>>>> Hello >>>>>> >>>>>> I have a problem with text storage, to which I seem to have found a solution, but it’s a bit clumsy-looking. I would be grateful for confirmation that (a) there is no neater solution, (b) I can rely on this to work – I only know that it works in a few test cases. >>>>>> >>>>>> I need to store a large number of text strings in a database. To avoid the database files becoming too large, I am thinking of zipping the strings, or at least the less frequently accessed ones. Depending on the source, some of the strings will be instances of ByteString, some of WideString (because they contain characters not representable in one byte). Storing a WideString uncompressed seems to occupy 4 bytes per character, so I decided, before thinking of compression, to store the strings utf8Encoded, which yields a ByteArray. But zipped can only be applied to a String, not a ByteArray. >>>>>> >>>>>> So my proposed solution is: >>>>>> >>>>>> For compression: myZipString := myWideString utf8Encoded asString zipped. >>>>>> For decompression: myOutputString := myZipString unzipped asByteArray utf8Decoded. >>>>>> >>>>>> As I said, it works in all the cases I tried, whether WideString or not, but the chains of transformations look clunky somehow. Can anyone see a neater way of doing it? And can I rely on it working, especially when I am handling foreign texts with many multi-byte characters? >>>>>> >>>>>> Thanks in advance for any help. >>>>>> >>>>>> Peter Kenny >>>>> >>>>> >>>> >>> >> >> > |
In reply to this post by Sven Van Caekenberghe-2
One of the key concepts in astc's text handling is that all strings
use the same encoding. So no, my system doesn't mix encoding/decoding with compression/decompression; encoding and decoding are completely out of scope. There is a *transformation format* issue, but not an encoding issue. For example, having got aByteArray from somewhere, aString := SCSUDecoder decode: (ByteArray unzip: aByteArray) uncompresses in one step (which knows nothing about encodings) and decodes in another (which is admittedly the Simple Compression Scheme for Unicode, but knows nothing of FLATE compression). Similarly, aByteArray := (EightBitEncoder type: 'ISO-8859-1' encode: aString) zipped separates encoding from compression. On Thu, 10 Oct 2019 at 02:17, Sven Van Caekenberghe <[hidden email]> wrote: > > Richard, > > Your implementation mixes zipping/unzipping and encoding/decoding, dictating a single way to do so, if I understand it correctly. > > The composition with several messages allows for end users to choose their own encoding format, depending on their own needs, which I think is more flexible. > > Sven > > > On 4 Oct 2019, at 06:36, Richard O'Keefe <[hidden email]> wrote: > > > > Here's how it would look in my library: > > compressed := original zipped. > > "There is currently one definition, in > > AbstractStringOrByteArray, covering [ReadOnly]ByteArray, > > [ReadOnly]String and its many subclasses, > > ByteBuffer,StringBuffer, Substring, [ReadOnly]ShortArray, > > [ReadOnly]MappedByteArray, and some others. This relies on > > _ asByteArraySize and _ asByteArrayDo: _. There is no need for > > a separate #utf8Encoded, that's what asByteArrayDo: *does*." > > > > copy := original class unzip: compressed. > > "This is a little trickier, but not hugely so. There is no need > > for special case code. [ReadOnly][Mapped]ByteArray and ByteBuffer > > are > > sequences of bytes, Stringy things are Unicode, and > > [ReadOnly]ShortArrays are treated as UTF16." > > > > As far as I can tell, this just works for the original use case. > > > > On Fri, 4 Oct 2019 at 11:42, PBKResearch <[hidden email]> wrote: > >> > >> Richard > >> > >> I don't think so. The case being considered for my problem is the compression of a ByteArray produced by applying #utf8Encoded to a WideString, but it extends to any other form of ByteArray. If you substitute ByteArray for SomeClass in your examples, I think you will see why the chosen interface was used. > >> > >> Peter Kenny > >> > >> > >> -----Original Message----- > >> From: Pharo-users <[hidden email]> On Behalf Of Richard O'Keefe > >> Sent: 03 October 2019 23:08 > >> To: Any question about pharo is welcome <[hidden email]> > >> Subject: Re: [Pharo-users] How to zip a WideString > >> > >> The interface should surely be > >> SomeClass > >> methods for: 'compression' > >> zipped "return a byte array" > >> > >> class methods for: 'decompression' > >> unzip: aByteArray "return an instance of SomeClass" > >> > >> > > > > |
In reply to this post by Richard O'Keefe
> On 4 Oct 2019, at 06:36, Richard O'Keefe <[hidden email]> wrote: > > There is no need for > a separate #utf8Encoded, that's what asByteArrayDo: *does*." so #asByteArrayDo: produces UTF8 bytes, which is an encoding, so it seems fixed, IIUC. But this is the Pharo mailing list and your are referring to a system nobody can see (I asked you this before), so it is very hard to understand (the context of) what you write. |
Free forum by Nabble | Edit this page |