Default String encoding when storing to file?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Default String encoding when storing to file?

Henrik Sperre Johansen
Hi, just did a small test of a faster storeOn: for ByteStrings (using
next:putAll:startingAt: for sequences not containing quotes, instead of
nextPut: for each character ), and ran into a weird quirk.
When writing to disk (on Windows), the old storeOn: would result in
storing the ø's in the string below as F8, while with storeOn2: they are
stored as C3 B8. (Haven't looked at exactly why they're saved differently.)
As far as I can tell from a google search, F8 is ascii encoding, and C3
B8 is UTF8 encoding.

Could this be a cause for the "invalid UTF8 character"-errors we've been
seeing sporadically?

Attached a changeset with storeOn2: , below is the workspace I used.

Cheers,
Henry

str :=
'asdfasdfadfadfrgjn''fgoibocbxlgjsrgoihrgohgfn''cx,bmxnbøzghøfhzødfhxcvnzdfljfoaurgaorhr8htg0ae8gofhef08hasovhdfhøxo''vh89ah4f9aw8hf'.

str storeOn: (FileStream newFileNamed: 'test22.txt').
Time millisecondsToRun: [|fs|
fs := FileStream oldFileNamed: 'test22.txt'.
50000 timesRepeat: [str storeOn: fs].
fs close]. 23494 22869

str storeOn2: (FileStream newFileNamed: 'test2.txt').
Time millisecondsToRun: [|fs|
fs := FileStream oldFileNamed: 'test2.txt'.
50000 timesRepeat: [str storeOn2: fs].
fs close]. 1303  1383

'From Pharo1.0beta of 16 May 2008 [Latest update: #10470] on 12 October 2009 at 2:31:08 pm'!

!ByteString methodsFor: 'printing' stamp: 'HenrikSperreJohansen 10/12/2009 14:30'!
storeOn2: aStream
        "Print inside string quotes, doubling inbedded quotes."
        | ix startIx|
        aStream nextPut: $'.
        startIx := 1.
        [(ix := self indexOf: $' startingAt: startIx) > 0 ] whileTrue: [
                aStream next: ix +1 - startIx putAll: self startingAt: startIx.
                aStream nextPut: $'.
                startIx := ix +1].
       
        aStream next: self size +1 - startIx putAll: self startingAt: startIx.
        aStream nextPut: $'.
! !

!ByteString methodsFor: 'printing' stamp: 'HenrikSperreJohansen 10/12/2009 14:30'!
storeString2
        "Answer a String representation of the receiver from which the receiver
        can be reconstructed."
        |result ws |
        result := String new: self size +2.
        ws := result writeStream.
        self storeOn2: ws.
        ^ws position = result size ifTrue: [result] ifFalse: [ws contents]
! !


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Default String encoding when storing to file?

Mariano Martinez Peck


2009/10/12 Henrik Johansen <[hidden email]>
Hi, just did a small test of a faster storeOn: for ByteStrings (using
next:putAll:startingAt: for sequences not containing quotes, instead of
nextPut: for each character ), and ran into a weird quirk.
When writing to disk (on Windows), the old storeOn: would result in
storing the ø's in the string below as F8, while with storeOn2: they are
stored as C3 B8. (Haven't looked at exactly why they're saved differently.)
As far as I can tell from a google search, F8 is ascii encoding, and C3
B8 is UTF8 encoding.

Could this be a cause for the "invalid UTF8 character"-errors we've been
seeing sporadically?

Attached a changeset with storeOn2: , below is the workspace I used.

Cheers,
Henry

str :=
'asdfasdfadfadfrgjn''fgoibocbxlgjsrgoihrgohgfn''cx,bmxnbøzghøfhzødfhxcvnzdfljfoaurgaorhr8htg0ae8gofhef08hasovhdfhøxo''vh89ah4f9aw8hf'.

str storeOn: (FileStream newFileNamed: 'test22.txt').
Time millisecondsToRun: [|fs|
fs := FileStream oldFileNamed: 'test22.txt'.
50000 timesRepeat: [str storeOn: fs].
fs close]. 23494 22869

str storeOn2: (FileStream newFileNamed: 'test2.txt').
Time millisecondsToRun: [|fs|
fs := FileStream oldFileNamed: 'test2.txt'.
50000 timesRepeat: [str storeOn2: fs].
fs close]. 1303  1383


WOW!!! I run it here in a Windows XP 1GB RAM and these are the results:

str storeOn: (FileStream newFileNamed: 'test22.txt').
Time millisecondsToRun: [|fs|
fs := FileStream oldFileNamed: 'test22.txt'.
50000 timesRepeat: [str storeOn: fs].
fs close]. 84022

str storeOn2: (FileStream newFileNamed: 'test2.txt').
Time millisecondsToRun: [|fs|
fs := FileStream oldFileNamed: 'test2.txt'.
50000 timesRepeat: [str storeOn2: fs].
fs close]. 4990

BIG DIFFERENCE





 
'From Pharo1.0beta of 16 May 2008 [Latest update: #10470] on 12 October 2009 at 2:31:08 pm'!

!ByteString methodsFor: 'printing' stamp: 'HenrikSperreJohansen 10/12/2009 14:30'!
storeOn2: aStream
       "Print inside string quotes, doubling inbedded quotes."
       | ix startIx|
       aStream nextPut: $'.
       startIx := 1.
       [(ix := self indexOf: $' startingAt: startIx) > 0 ] whileTrue: [
               aStream next: ix +1 - startIx putAll: self startingAt: startIx.
               aStream nextPut: $'.
               startIx := ix +1].

       aStream next: self size +1 - startIx putAll: self startingAt: startIx.
       aStream nextPut: $'.
! !

!ByteString methodsFor: 'printing' stamp: 'HenrikSperreJohansen 10/12/2009 14:30'!
storeString2
       "Answer a String representation of the receiver from which the receiver
       can be reconstructed."
       |result ws |
       result := String new: self size +2.
       ws := result writeStream.
       self storeOn2: ws.
       ^ws position = result size ifTrue: [result] ifFalse: [ws contents]
! !


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Default String encoding when storing to file?

Henrik Sperre Johansen
Mariano Martinez Peck skrev:

>
> WOW!!! I run it here in a Windows XP 1GB RAM and these are the results:
>
> str storeOn: (FileStream newFileNamed: 'test22.txt').
> Time millisecondsToRun: [|fs|
> fs := FileStream oldFileNamed: 'test22.txt'.
> 50000 timesRepeat: [str storeOn: fs].
> fs close]. 84022
>
> str storeOn2: (FileStream newFileNamed: 'test2.txt').
> Time millisecondsToRun: [|fs|
> fs := FileStream oldFileNamed: 'test2.txt'.
> 50000 timesRepeat: [str storeOn2: fs].
> fs close]. 4990
>
> BIG DIFFERENCE
>
Yeah, writing one char at a time (plus, when doing that, the broken
nextPut: primitive) isn't exactly performant :P
Btw, it works fine for Widestrings too when using storeOn: internally
(in printString f.ex.), but not when writing them to file. :/
Plus, as I indicated, behaviour from storeOn: is not 100% preserved
(different encoding used for ø), as I noted in the original post.

Cheers,
Henry

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Default String encoding when storing to file?

Stéphane Ducasse
so what would be the next step :)

Stef

On Oct 13, 2009, at 3:17 PM, Henrik Johansen wrote:

> Mariano Martinez Peck skrev:
>>
>> WOW!!! I run it here in a Windows XP 1GB RAM and these are the  
>> results:
>>
>> str storeOn: (FileStream newFileNamed: 'test22.txt').
>> Time millisecondsToRun: [|fs|
>> fs := FileStream oldFileNamed: 'test22.txt'.
>> 50000 timesRepeat: [str storeOn: fs].
>> fs close]. 84022
>>
>> str storeOn2: (FileStream newFileNamed: 'test2.txt').
>> Time millisecondsToRun: [|fs|
>> fs := FileStream oldFileNamed: 'test2.txt'.
>> 50000 timesRepeat: [str storeOn2: fs].
>> fs close]. 4990
>>
>> BIG DIFFERENCE
>>
> Yeah, writing one char at a time (plus, when doing that, the broken
> nextPut: primitive) isn't exactly performant :P
> Btw, it works fine for Widestrings too when using storeOn: internally
> (in printString f.ex.), but not when writing them to file. :/
> Plus, as I indicated, behaviour from storeOn: is not 100% preserved
> (different encoding used for ø), as I noted in the original post.
>
> Cheers,
> Henry
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Default String encoding when storing to file?

Henrik Sperre Johansen
Someone with a clue answering my original question of whether the
encoding when saving to file is wrong or not :)

Alternatively someone with VM knowledge looking into why the nextPut
primitive is broken...
And no, it's not just that the stream hasn't been placed in cache yet or
that collection is full, placing a counter in nextPut and printing:
|string ws|
Smalltalk at: #NextPutPrimitiveFails put: 0.
string := String new: 500000.
ws := string writeStream.
500000 timesRepeat: [ws nextPut: $a].
NextPutPrimitiveFails. 500000
)

Cheers,
Henry

Stéphane Ducasse skrev:

> so what would be the next step :)
>
> Stef
>
> On Oct 13, 2009, at 3:17 PM, Henrik Johansen wrote:
>
>  
>> Mariano Martinez Peck skrev:
>>    
>>> WOW!!! I run it here in a Windows XP 1GB RAM and these are the  
>>> results:
>>>
>>> str storeOn: (FileStream newFileNamed: 'test22.txt').
>>> Time millisecondsToRun: [|fs|
>>> fs := FileStream oldFileNamed: 'test22.txt'.
>>> 50000 timesRepeat: [str storeOn: fs].
>>> fs close]. 84022
>>>
>>> str storeOn2: (FileStream newFileNamed: 'test2.txt').
>>> Time millisecondsToRun: [|fs|
>>> fs := FileStream oldFileNamed: 'test2.txt'.
>>> 50000 timesRepeat: [str storeOn2: fs].
>>> fs close]. 4990
>>>
>>> BIG DIFFERENCE
>>>
>>>      
>> Yeah, writing one char at a time (plus, when doing that, the broken
>> nextPut: primitive) isn't exactly performant :P
>> Btw, it works fine for Widestrings too when using storeOn: internally
>> (in printString f.ex.), but not when writing them to file. :/
>> Plus, as I indicated, behaviour from storeOn: is not 100% preserved
>> (different encoding used for ø), as I noted in the original post.
>>
>> Cheers,
>> Henry
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>    
>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>
>
>  

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project