About #zipped / #unzipped

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

About #zipped / #unzipped

Sven Van Caekenberghe-2
Hi,

Given https://pharo.fogbugz.com/f/cases/21858/Cleanup-remaining-DeprecatedFileSystem-users [where we need more help !!] I found String>>#zipped to be one of the users of the deprecated RWBinaryOrTextStream. Although this usage is easy enough to fix, I think the current #zipped / #unzipped on String is broken.

(note also that there are no real users of these methods)

Right now it seems cool that the following is an identity.

  'foobar' zipped unzipped.

However, the result of zipping something is actual something binary (a collection of opaque bytes). Thinking of it, the input is actually also bytes, not unencoded characters.

Of course, the current methods are broken, as can be seen from a more complex (wide) string.

  'élèves Françaises @ 10 €' zipped unzipped. >>> <something very weird>

The error results from some implicit/wrong character encoding being used.

The right way to do this is to explicitly encode/decode the string.

  (GZipReadStream on: (ByteArray streamContents: [ :out |
     (GZipWriteStream on: out)
        nextPutAll: 'élèves Françaises à 10 €' utf8Encoded;
        close ])) upToEnd utf8Decoded.

From this it would follow that #zipped / #unzipped would make more sense on ByteArray. So that the above identity would become.

  'élèves Françaises à 10 €' utf8Encoded zipped unzipped utf8Decoded.

This change of signature would be comparable to what we recently did with #base64Encoded / #base64Decoded

What do you think ?

Sven




Reply | Threaded
Open this post in threaded view
|

Re: About #zipped / #unzipped

Marcus Denker-4


> On 18 Oct 2018, at 20:59, Sven Van Caekenberghe <[hidden email]> wrote:
>
> Hi,
>
> Given https://pharo.fogbugz.com/f/cases/21858/Cleanup-remaining-DeprecatedFileSystem-users [where we need more help !!] I found String>>#zipped to be one of the users of the deprecated RWBinaryOrTextStream. Although this usage is easy enough to fix, I think the current #zipped / #unzipped on String is broken.
>
> (note also that there are no real users of these methods)
>
> Right now it seems cool that the following is an identity.
>
>  'foobar' zipped unzipped.
>
> However, the result of zipping something is actual something binary (a collection of opaque bytes). Thinking of it, the input is actually also bytes, not unencoded characters.
>
> Of course, the current methods are broken, as can be seen from a more complex (wide) string.
>
>  'élèves Françaises @ 10 €' zipped unzipped. >>> <something very weird>
>
> The error results from some implicit/wrong character encoding being used.
>
> The right way to do this is to explicitly encode/decode the string.
>
>  (GZipReadStream on: (ByteArray streamContents: [ :out |
>     (GZipWriteStream on: out)
>        nextPutAll: 'élèves Françaises à 10 €' utf8Encoded;
>        close ])) upToEnd utf8Decoded.
>
> From this it would follow that #zipped / #unzipped would make more sense on ByteArray. So that the above identity would become.
>
>  'élèves Françaises à 10 €' utf8Encoded zipped unzipped utf8Decoded.
>
> This change of signature would be comparable to what we recently did with #base64Encoded / #base64Decoded
>
> What do you think ?
>

Yes, to me this sounds interesting… I think compression is indeed better done on the level of bytes then on Strings.


        Marcus


Reply | Threaded
Open this post in threaded view
|

Re: About #zipped / #unzipped

Stephane Ducasse-3
+1

On Tue, Oct 23, 2018 at 10:50 AM Marcus Denker <[hidden email]> wrote:

>
>
>
> > On 18 Oct 2018, at 20:59, Sven Van Caekenberghe <[hidden email]> wrote:
> >
> > Hi,
> >
> > Given https://pharo.fogbugz.com/f/cases/21858/Cleanup-remaining-DeprecatedFileSystem-users [where we need more help !!] I found String>>#zipped to be one of the users of the deprecated RWBinaryOrTextStream. Although this usage is easy enough to fix, I think the current #zipped / #unzipped on String is broken.
> >
> > (note also that there are no real users of these methods)
> >
> > Right now it seems cool that the following is an identity.
> >
> >  'foobar' zipped unzipped.
> >
> > However, the result of zipping something is actual something binary (a collection of opaque bytes). Thinking of it, the input is actually also bytes, not unencoded characters.
> >
> > Of course, the current methods are broken, as can be seen from a more complex (wide) string.
> >
> >  'élèves Françaises @ 10 €' zipped unzipped. >>> <something very weird>
> >
> > The error results from some implicit/wrong character encoding being used.
> >
> > The right way to do this is to explicitly encode/decode the string.
> >
> >  (GZipReadStream on: (ByteArray streamContents: [ :out |
> >     (GZipWriteStream on: out)
> >        nextPutAll: 'élèves Françaises à 10 €' utf8Encoded;
> >        close ])) upToEnd utf8Decoded.
> >
> > From this it would follow that #zipped / #unzipped would make more sense on ByteArray. So that the above identity would become.
> >
> >  'élèves Françaises à 10 €' utf8Encoded zipped unzipped utf8Decoded.
> >
> > This change of signature would be comparable to what we recently did with #base64Encoded / #base64Decoded
> >
> > What do you think ?
> >
>
> Yes, to me this sounds interesting… I think compression is indeed better done on the level of bytes then on Strings.
>
>
>         Marcus
>
>

Reply | Threaded
Open this post in threaded view
|

Re: About #zipped / #unzipped

Sven Van Caekenberghe-2
https://github.com/pharo-project/pharo/pull/1947

> On 26 Oct 2018, at 22:28, Stephane Ducasse <[hidden email]> wrote:
>
> +1
>
> On Tue, Oct 23, 2018 at 10:50 AM Marcus Denker <[hidden email]> wrote:
>>
>>
>>
>>> On 18 Oct 2018, at 20:59, Sven Van Caekenberghe <[hidden email]> wrote:
>>>
>>> Hi,
>>>
>>> Given https://pharo.fogbugz.com/f/cases/21858/Cleanup-remaining-DeprecatedFileSystem-users [where we need more help !!] I found String>>#zipped to be one of the users of the deprecated RWBinaryOrTextStream. Although this usage is easy enough to fix, I think the current #zipped / #unzipped on String is broken.
>>>
>>> (note also that there are no real users of these methods)
>>>
>>> Right now it seems cool that the following is an identity.
>>>
>>> 'foobar' zipped unzipped.
>>>
>>> However, the result of zipping something is actual something binary (a collection of opaque bytes). Thinking of it, the input is actually also bytes, not unencoded characters.
>>>
>>> Of course, the current methods are broken, as can be seen from a more complex (wide) string.
>>>
>>> 'élèves Françaises @ 10 €' zipped unzipped. >>> <something very weird>
>>>
>>> The error results from some implicit/wrong character encoding being used.
>>>
>>> The right way to do this is to explicitly encode/decode the string.
>>>
>>> (GZipReadStream on: (ByteArray streamContents: [ :out |
>>>    (GZipWriteStream on: out)
>>>       nextPutAll: 'élèves Françaises à 10 €' utf8Encoded;
>>>       close ])) upToEnd utf8Decoded.
>>>
>>> From this it would follow that #zipped / #unzipped would make more sense on ByteArray. So that the above identity would become.
>>>
>>> 'élèves Françaises à 10 €' utf8Encoded zipped unzipped utf8Decoded.
>>>
>>> This change of signature would be comparable to what we recently did with #base64Encoded / #base64Decoded
>>>
>>> What do you think ?
>>>
>>
>> Yes, to me this sounds interesting… I think compression is indeed better done on the level of bytes then on Strings.
>>
>>
>>        Marcus
>>
>>
>