[bug] UnicodeString conversion truncation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[bug] UnicodeString conversion truncation

Robin Redeker-2
Issue status update for
http://smalltalk.gnu.org/node/108
Post a follow up:
http://smalltalk.gnu.org/project/comments/add/108

 Project:      GNU Smalltalk
 Version:      <none>
 Component:    Base classes
 Category:     bug reports
 Priority:     normal
 Assigned to:  Unassigned
 Reported by:  elmex
 Updated by:   elmex
 Status:       active

There seems to be another bug with UnicodeStrings.
Assume following program:

Eval [
 PackageLoader fileInPackage: #I18N.
 a := WriteStream on: (Array new).
 1 to: 100 do: [:nr| a nextPut: nr].
 str := (a contents collect: [:nr| nr printString]) join: ','.
 str displayNl.
 str2 := str asUnicodeString.
 "str2 asString displayNl."   "<- gives a backtrace"
 str3 := '', str2.
 str3 displayNl.
]

It prints two lines of numbers, here the second line
is truncated right after 88, like this: "...,88,8"

When I uncomment the commented line in the code snippet above I get a
backtrace:

Object: Iconv new "<0x2ac0d49dba00>" error: incomplete input sequence
I18N.IncompleteSequenceError(Exception)>>signal
I18N.IncompleteSequenceError class(Exception class)>>signal
I18N.Encoders.Iconv>>convertMore
I18N.Encoders.Iconv>>atEnd
[] in I18N.Encoders.Iconv(Stream)>>do:
BlockClosure>>on:do:
I18N.Encoders.Iconv(Stream)>>do:
I18N.Encoders.Iconv(Stream)>>contents
UnicodeString>>asString
UndefinedObject>>executeStatements




_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: [bug] UnicodeString conversion truncation

Paolo Bonzini
Issue status update for
http://smalltalk.gnu.org/project/issue/108
Post a follow up:
http://smalltalk.gnu.org/project/comments/add/108

 Project:      GNU Smalltalk
 Version:      <none>
 Component:    Base classes
 Category:     bug reports
 Priority:     normal
 Assigned to:  Unassigned
 Reported by:  elmex
 Updated by:   bonzinip
-Status:       active
+Status:       fixed
 Attachment:   http://smalltalk.gnu.org/files/issues/gst-iconv-more.patch (3.71 KB)

You opened a can of half a dozen different off-by-one and similar bugs.
:-)  All fixed in the attached patch




_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Re: [bug] UnicodeString conversion truncation

Robin Redeker-2
On Sun, Oct 21, 2007 at 08:51:24AM -0700, Paolo Bonzini wrote:

> Issue status update for
> http://smalltalk.gnu.org/project/issue/108
> Post a follow up:
> http://smalltalk.gnu.org/project/comments/add/108
>
> Project:      GNU Smalltalk
> Version:      <none>
> Component:    Base classes
> Category:     bug reports
> Priority:     normal
> Assigned to:  Unassigned
> Reported by:  elmex
> Updated by:   bonzinip
> -Status:       active
> +Status:       fixed
> Attachment:   http://smalltalk.gnu.org/files/issues/gst-iconv-more.patch 
> (3.71 KB)
>
> You opened a can of half a dozen different off-by-one and similar bugs.
> :-)  All fixed in the attached patch

I tested the patch and everything seems to work now.
But I've found this code in json.st which puzzled me a bit:

   String>>#jsonPrintOn:
      (self anySatisfy: [ :ch | ch value between: 128 and: 255 ])
             ifTrue: [ self asUnicodeString jsonPrintOn: aStream ]
             ifFalse: [ super jsonPrintOn: aStream ]

Why print strings that have non-ascii chars differently?
And this in the string parsing code:

            c = $u
               ifTrue: [
        c := (Integer readFrom: (stream next: 4) readStream radix: 16) asCharacter.
        (c class == UnicodeCharacter and: [ str species == String ])
          ifTrue: [ str := (UnicodeString new writeStream
               nextPutAll: str contents; yourself) ] ].
         ].
      str nextPut: c.

Maybe I don't understand the Unicode implementation of GNU smalltalk not enough.

Would you object if I change the json code to operate on UnicodeStrings only?

Stricly and semantically the JSON implementation should only operate on UnicodeStrings
as JSON is only parseable in Unicode. (I wonder what happens with the current JSON reader when it encounters a utf-16 encoded String, as far as my test went, it just didn't
work because it doesn't expect multibyte encodings in String).

What puzzles me is the question what JSONReader>>#nextJSONString should
return. Should it be a String or a UnicodeString?

If it returns UnicodeString no literal string access on a Dictionary returned by
the JSON parser will work as it would get only a String object which has a different
hash function than UnicodeString.


Robin


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: [bug] UnicodeString conversion truncation

Paolo Bonzini-2

>    String>>#jsonPrintOn:
>       (self anySatisfy: [ :ch | ch value between: 128 and: 255 ])
>              ifTrue: [ self asUnicodeString jsonPrintOn: aStream ]
>              ifFalse: [ super jsonPrintOn: aStream ]
>
> Why print strings that have non-ascii chars differently?

Because, say, an UTF-8-encoded string containing the characters 195 and
160 should print as "\u00E0", not as "à" (that's a lowercase accented
'a').  The easiest way to convert the two bytes to a single character is
with #asUnicodeString: in GNU Smalltalk, Strings are bytes and
UnicodeStrings are characters.

Actually, to support ISO-2022-JP and similar encodings (which use a
sequence introduced by ESC to switch between latin and double-byte
characters), one of us should probably change jsonPrintOn: to use

     (self allSatisfy: [ :ch | ch value between: 32 and: 126 ])
        ifFalse: [ self asUnicodeString jsonPrintOn: aStream ]
         ifTrue: [ super jsonPrintOn: aStream ]

(Note that you can safely skip: even this, unfortunately, would not
cater for UTF-7.  You can skip this because UTF-7 is terminally broken,
and all you should do with UTF-7 is convert it to a saner encoding as
soon as you read something in UTF-7.)

> And this in the string parsing code:
>
>             c = $u
>                ifTrue: [
>         c := (Integer readFrom: (stream next: 4) readStream radix: 16) asCharacter.
>         (c class == UnicodeCharacter and: [ str species == String ])
>           ifTrue: [ str := (UnicodeString new writeStream
>                nextPutAll: str contents; yourself) ] ].
>          ].
>       str nextPut: c.

What it does now is to operate on UnicodeStrings if it considers it
necessary; if there are no \uXXXX escapes, it uses String because valid
JSON only has 7-bit characters in strings.

> Would you object if I change the json code to operate on UnicodeStrings only?

I would like to understand why you need this, but no, I would not object
especially because I consider JSON your code, not mine.  I just helped a
bit.  :-)

I think you wouldn't be able to operate on UnicodeStrings only, unless I
fix the bug with String/UnicodeString hashes (see below).

I don't know if after the explanation above you still want JSON to
operate on UnicodeStrings only.

> Stricly and semantically the JSON implementation should only operate on UnicodeStrings
> as JSON is only parseable in Unicode. (I wonder what happens with the current JSON reader
> when it encounters a utf-16 encoded String, as far as my test went, it just didn't
> work because it doesn't expect multibyte encodings in String).

JSON is not supposed to include non-Latin-1 characters.  Everything
that's not 7-bit encodable should be escaped using \uXXXX.

> What puzzles me is the question what JSONReader>>#nextJSONString should
> return. Should it be a String or a UnicodeString?

Strictly speaking it should return a UnicodeString, but it's easier to
use it, and faster, if (when it's possible) we let it return a String.
Switching to UnicodeStrings as soon as we find a \uXXXX is a
conservative approximation of "when it's possible".

Probably, what is missing from GNU Smalltalk's Iconv package is an
"Encoding" object that can answer queries like "is this string pure
ASCII?", the default very slow implementation being something like this:

     str := self asString.
     uniStr := self asUnicodeString.
     str size = uniStr size ifFalse: [ ^false ].
     str with: uniStr do: [ :ch :uni |
         ch value = uni codePoint ifFalse: [ ^false ] ].
     ^true

This snippet would provide a more rigorous definition of "when it's
possible".

> If it returns UnicodeString no literal string access on a Dictionary returned by
> the JSON parser will work as it would get only a String object which has a different
> hash function than UnicodeString.

Hmmm, this has to be fixed.

Paolo



_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Re: [bug] UnicodeString conversion truncation

Robin Redeker-2
On Mon, Oct 22, 2007 at 10:57:10AM +0200, Paolo Bonzini wrote:
>
[.snip.]
> >Would you object if I change the json code to operate on UnicodeStrings
> >only?
>
> I would like to understand why you need this, but no, I would not object
> especially because I consider JSON your code, not mine.  I just helped a
> bit.  :-)

Heh, ok. I just want to hear other people thoughts about this :)

> I think you wouldn't be able to operate on UnicodeStrings only, unless I
> fix the bug with String/UnicodeString hashes (see below).
>
> I don't know if after the explanation above you still want JSON to
> operate on UnicodeStrings only.

Because a JSON Parser can only process characters and not bytes of some
multibyte encoding. As far as I understood a '(ReadStream on: String)
next' will return me a Character in the range 0 to: 255 which represents
a byte of the multibyte encoding of the string.

Or am I wrong and String>>#next will return me an UnicodeCharacter sometimes?

> >Stricly and semantically the JSON implementation should only operate on
> >UnicodeStrings
> >as JSON is only parseable in Unicode. (I wonder what happens with the
> >current JSON reader
> >when it encounters a utf-16 encoded String, as far as my test went, it
> >just didn't
> >work because it doesn't expect multibyte encodings in String).
>
> JSON is not supposed to include non-Latin-1 characters.  Everything
> that's not 7-bit encodable should be escaped using \uXXXX.

I must object, the JSON RFC ( http://www.ietf.org/rfc/rfc4627.txt ) says:

   "JavaScript Object Notation (JSON) is a text format for the serialization
   of structured data."
And:
   "A string is a sequence of zero or more Unicode characters [UNICODE]."
And:
   "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8."

Also the whole grammar/BNF is defined in terms of Unicode characters.
The \uXXXX is allowed in strings for convenience.

   "Any character may be escaped."

The JSON parser has no choice but to operate on Unicode characters.
Parsing a UTF-16 encoded JSON text byte-wise will just not work :)

> >What puzzles me is the question what JSONReader>>#nextJSONString should
> >return. Should it be a String or a UnicodeString?
>
> Strictly speaking it should return a UnicodeString, but it's easier to
> use it, and faster, if (when it's possible) we let it return a String.
> Switching to UnicodeStrings as soon as we find a \uXXXX is a
> conservative approximation of "when it's possible".

I guess it depends on how compatible String and UnicodeString are. I can't
assume anything about the strings returned by the implementation, and usually I
would have to convert the string to an UnicodeString anyways I guess.

If you are concerned about the memory footprint of UnicodeStrings I would
suggest making it possible to have the JSON implementation return always
encoded strings if told so.

w.r.t. internal encoding of Unicode strings: Perl has an interesting concept:
It stores Unicode strings internally either as iso-8859 or UTF-X (some extended
form of UTF-8 encoding which can encode arbitrary integer values), which isn't
visible on the language level. On the language level a String is a sequence of
integers interpreted as Unicode characters.

> Probably, what is missing from GNU Smalltalk's Iconv package is an
> "Encoding" object that can answer queries like "is this string pure
> ASCII?", the default very slow implementation being something like this:
>
>     str := self asString.
>     uniStr := self asUnicodeString.
>     str size = uniStr size ifFalse: [ ^false ].
>     str with: uniStr do: [ :ch :uni |
>         ch value = uni codePoint ifFalse: [ ^false ] ].
>     ^true
>
> This snippet would provide a more rigorous definition of "when it's
> possible".
>
> >If it returns UnicodeString no literal string access on a Dictionary
> >returned by
> >the JSON parser will work as it would get only a String object which has a
> >different
> >hash function than UnicodeString.
>
> Hmmm, this has to be fixed.

Is it fixable? If I have a UnicodeString the encoding is lost and the hash
has to operate on the characters. If I have eg. the UTF-16 encoded form in a
String then the hash method has to operate on the bytes which will lead to
a different encoding.

Of course it would already be helpful if it would work for ASCII characters
and Latin-1, because I'm accessing those Dictionaries often with literal strings
and usually those literals are ascii strings in my case.

But it would also be nice if there would be a way to have UnicodeString literals :)
Of course then the Smalltalk source has to have a defined encoding and the Smalltalk
parser would have to understand Unicode.
(I don't need this, this just a random thought :)
(btw. does Smalltalk operate case-insensitive or have classnames to be
upper case for the first character? Or is that just a convention?)



Robin


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Re: [bug] UnicodeString conversion truncation

Paolo Bonzini

>> I don't know if after the explanation above you still want JSON to
>> operate on UnicodeStrings only.
>
> Because a JSON Parser can only process characters and not bytes of some
> multibyte encoding. As far as I understood a '(ReadStream on: String)
> next' will return me a Character in the range 0 to: 255 which represents
> a byte of the multibyte encoding of the string.
>
> Or am I wrong and String>>#next will return me an UnicodeCharacter sometimes?

No, you're right.  However, note that there are no UnicodeCharacters
below 128.  There, the two spaces overlap.  So, if the characters are
all 7-bit, a ReadStream on a String or on a UnicodeString will be
undistinguishable.

>    "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8."
>
> The JSON parser has no choice but to operate on Unicode characters.
> Parsing a UTF-16 encoded JSON text byte-wise will just not work :)

Oops, my fault.  And can you specify different encodings?

>    "Any character may be escaped."

Sure, you can write "\u0041".  In this case the JSON reader will return
a UnicodeString.  That's why I wrote 'switching to UnicodeStrings as
soon as we find a \uXXXX is a conservative approximation".

> w.r.t. internal encoding of Unicode strings: Perl has an interesting concept:
> It stores Unicode strings internally either as iso-8859 or UTF-X (some extended
> form of UTF-8 encoding which can encode arbitrary integer values), which isn't
> visible on the language level. On the language level a String is a sequence of
> integers interpreted as Unicode characters.

Unfortunately, compatibility with pre-Unicode Smalltalk was a mess to
achieve, and actually it still has some problems (mostly the hashing
problem I refer to below).  So, I really have to thank you for working
out the bugs before 3.0.

>> Probably, what is missing from GNU Smalltalk's Iconv package is an
>> "Encoding" object that can answer queries like "is this string pure
>> ASCII?", the default very slow implementation being something like this:
>>
>>     str := self asString.
>>     uniStr := self asUnicodeString.
>>     str size = uniStr size ifFalse: [ ^false ].
>>     str with: uniStr do: [ :ch :uni |
>>         ch value = uni codePoint ifFalse: [ ^false ] ].
>>     ^true
>>
>> This snippet would provide a more rigorous definition of "when it's
>> possible".
>>
>>> If it returns UnicodeString no literal string access on a Dictionary
>>> returned by
>>> the JSON parser will work as it would get only a String object which has a
>>> different
>>> hash function than UnicodeString.
>> Hmmm, this has to be fixed.
>
> Is it fixable?

To some extent, it should be.  For example, in the case of
UnicodeStrings I can define the hash to be computed by translating
(internally) to UTF-8 and hashing the result.  Then, we can cross our
fingers and hope that Strings are also UTF-8 (good bet nowadays), and
just implement EncodedString>>hash as "^self asUnicodeString hash".

As a note to myself, this would mean also skipping the first 3 bytes of
a String to be hashed, if they are the BOM.

> Of course it would already be helpful if it would work for ASCII characters
> and Latin-1

ASCII characters and UTF-8 please. :-)  I'm also from a Latin-1 country,
but I try to think as international as possible. :-)

> (btw. does Smalltalk operate case-insensitive or have classnames to be
> upper case for the first character? Or is that just a convention?)

Just a convention.

Paolo


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Re: [bug] UnicodeString conversion truncation

Robin Redeker-2
On Mon, Oct 22, 2007 at 12:03:10PM +0200, Paolo Bonzini wrote:
>
[.snip.]
> >   "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8."
> >
> >The JSON parser has no choice but to operate on Unicode characters.
> >Parsing a UTF-16 encoded JSON text byte-wise will just not work :)
>
> Oops, my fault.  And can you specify different encodings?

That can't really be specified in the JSON text itself, it's usually
an out-of-band thing. eg. both ends agree on sending UTF-8 encoded JSON.
Of course there is a heuristic, which is even defined in the RFC:

   Since the first two characters of a JSON text will always be ASCII
   characters [RFC0020], it is possible to determine whether an octet
   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
   at the pattern of nulls in the first four octets.

           00 00 00 xx  UTF-32BE
           00 xx 00 xx  UTF-16BE
           xx 00 00 00  UTF-32LE
           xx 00 xx 00  UTF-16LE
           xx xx xx xx  UTF-8

But thats rather ugly IMO.
The cleanes interface for the JSON parser/serializer would be to
receive and produce UnicodeStrings and let the programmer worry about
encoding.
(An octet parsing wrapper can then always be defined later ontop of that.)

>
> >   "Any character may be escaped."
>
> Sure, you can write "\u0041".  In this case the JSON reader will return
> a UnicodeString.  That's why I wrote 'switching to UnicodeStrings as
> soon as we find a \uXXXX is a conservative approximation".

Ah, yes, now I understand the string-parsing code you wrote.

> >>>If it returns UnicodeString no literal string access on a Dictionary
> >>>returned by
> >>>the JSON parser will work as it would get only a String object which has
> >>>a different
> >>>hash function than UnicodeString.
> >>Hmmm, this has to be fixed.
> >
> >Is it fixable?
>
> To some extent, it should be.  For example, in the case of
> UnicodeStrings I can define the hash to be computed by translating
> (internally) to UTF-8 and hashing the result.  Then, we can cross our
> fingers and hope that Strings are also UTF-8 (good bet nowadays), and
> just implement EncodedString>>hash as "^self asUnicodeString hash".
>
> As a note to myself, this would mean also skipping the first 3 bytes of
> a String to be hashed, if they are the BOM.

Hm, I agree that hasing Strings in their UTF-8 encoded form is a good approximation.
Which will of course horribly break if someone chooses to use eg. german "umlaute"
in the source code in latin-1 encoding, or maybe not. How is the encoding of a
literal string determined?

> >Of course it would already be helpful if it would work for ASCII characters
> >and Latin-1
>
> ASCII characters and UTF-8 please. :-)  I'm also from a Latin-1 country,
> but I try to think as international as possible. :-)

That Smalltalk source code literals come in UTF-8 encoded form is a bold
assumption (which is increasingly right these days on Linux and other OSs :-)


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Re: [bug] UnicodeString conversion truncation

Paolo Bonzini
> The cleanes interface for the JSON parser/serializer would be to
> receive and produce UnicodeStrings and let the programmer worry about
> encoding.

I see.  An alternative is, in the case when you read "\uXXXX", to just
return Strings.  To add a UnicodeCharacter to a String stream, you just use

    aStream display: aCharacter

A full implementation would probably require adding a method like this:

     PositionableStream >> encoding
         ^collection encoding

and I can take care of a more complete implementation of Stream encoding.

There are many ways to specify encoding, for example the following:

1) add a #on:encoding: constructor where the encoding defaults to
'UTF-8'.  When creating a String to be returned, use the same encoding
as the input.

2) use the aforementioned PositionableStream >> encoding method; when
creating a String to be returned, use the same encoding as the input.

3) use the aforementioned PositionableStream >> encoding method and add
a #on:outputEncoding: constructor, where the encoding defaults to the
same encoding as the input.

4) use the aforementioned PositionableStream >> encoding method and
always return UnicodeStrings.  In this case, you will never find
Characters whose value is >= 128 in the input (you'll find
UnicodeCharacters instead!).

> Hm, I agree that hasing Strings in their UTF-8 encoded form is a good approximation.
> Which will of course horribly break if someone chooses to use eg. german "umlaute"
> in the source code in latin-1 encoding, or maybe not. How is the encoding of a
> literal string determined?

It is not so far, and unless one is interested in using Strings and
UnicodeStrings interchangeably for hashing, you should not care.  Do you
have example of prior art for other languages?

>> ASCII characters and UTF-8 please. :-)  I'm also from a Latin-1 country,
>> but I try to think as international as possible. :-)
>
> That Smalltalk source code literals come in UTF-8 encoded form is a bold
> assumption (which is increasingly right these days on Linux and other OSs :-)

Yes.

Paolo


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Re: [bug] UnicodeString conversion truncation

Robin Redeker-2
On Mon, Oct 22, 2007 at 01:37:19PM +0200, Paolo Bonzini wrote:
> >The cleanes interface for the JSON parser/serializer would be to
> >receive and produce UnicodeStrings and let the programmer worry about
> >encoding.
>
> I see.  An alternative is, in the case when you read "\uXXXX", to just
> return Strings.  To add a UnicodeCharacter to a String stream, you just use
>
>    aStream display: aCharacter

Hm, but then I would have to do that for any character in a String and not
only for \uXXXX, if I understand you right, as this is valid JSON
(encoding UTF-8):

   {"test":"にほんじん\u306b"}

> A full implementation would probably require adding a method like this:
>
>     PositionableStream >> encoding
>         ^collection encoding
>
> and I can take care of a more complete implementation of Stream encoding.
>
> There are many ways to specify encoding, for example the following:
>
[.snip.]

Interesting, I'll keep those in mind for the next json.st iteration :)

> >Hm, I agree that hasing Strings in their UTF-8 encoded form is a good
> >approximation.
> >Which will of course horribly break if someone chooses to use eg. german
> >"umlaute"
> >in the source code in latin-1 encoding, or maybe not. How is the encoding
> >of a
> >literal string determined?
>
> It is not so far, and unless one is interested in using Strings and
> UnicodeStrings interchangeably for hashing, you should not care.  Do you
> have example of prior art for other languages?

Nope, Perl has strings of "integers" which can either represent octets
or Unicode characters. The interpretation is up to the programmer. So
the internal hash only operates on the integer values.
Of course you can only use strings as keys for perl hashes, as they
are automatically stringified (afaik, but I'm maybe wrong here).

About Unicode in Perl in general:

See this Perl script (encoded in UTF-8):

http://www.ta-sa.org/files/txt/3f0babeefe692cbf6bdd62def1dd68a2.txt

Output:

306B 307B 3093 3058 3093
E3   81   AB   E3   81   BB   E3   82   93   E3   81   98   E3   82   93
FE   FF   30   6B   30   7B   30   93   30   58   30   93

The 'use utf8' in the beginning will tell the parser to interpret the
soruce code as utf8 encoded unicode, which makes $string to contain
unicode characters.

After encode () is used $utf8_encoded and $utf16_encoded contains a
string of characters each in the range of 0 to 255, which represent the
octets of the encoded strings.

It gets interesting if you remove the 'use utf8' statement on top of the
script which will result in this output:

E3   81   AB   E3   81   BB   E3   82   93   E3   81   98   E3   82   93
C3   A3   C2   81   C2   AB   C3   A3   C2   81   C2   ...
FE   FF   0    E3   0    81   0    AB   0    E3   0    ...

Without the 'use utf8' the $string already contains octets which represent
the utf8 encoded unicode string.

Ah, so much about Perl strings. In general Perl doesn't really care much
about Unicode and lets the programmer care about encoding and keeping
track of how and whether strings are encoded.


Robin


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Re: [bug] UnicodeString conversion truncation

Paolo Bonzini

> Hm, but then I would have to do that for any character in a String and not
> only for \uXXXX, if I understand you right, as this is valid JSON
> (encoding UTF-8):
>
>    {"test":"にほんじん\u306b"}

I couldn't refute or confirm the above, so I started implementing it,
and this led to another approach: do everything in Unicode inside the
JSON reader, as you suggested, but accept *and return* encoded strings.
  The attached patch and the attached JSON.st will show more or less
what I have in mind.  There is still something to be gained from a
Stream>>#encoding method, but it already clarifies things a lot, at
least for me.

With this patch (which is straightforward apart from a stupid
String/Symbol mismatch in i18n/Locale.st), I can do things like these:

PackageLoader fileInPackage: #I18N.
FileStream fileIn: '../examples/JSON.st'.
JSONReader fromJSON: '{"test":"にほんじん\u306b"}'
=> 'test'->'にほんじんに'
JSONReader fromJSON: '{"test":"にほんじん\u306b"}' encoding: 'UTF-16'
=> UTF-16BE['test']->UTF-16BE['にほんじんに']

(there are problems with hashing, which make the second result unusable
in practice, but the JSON reader does its part of the job right, at least).

> Nope, Perl has strings of "integers" which can either represent octets
> or Unicode characters. The interpretation is up to the programmer.

Hm, so that's as bad as what we do.  Ours has more potential for
confusion, but it's a little more typesafe.  This fits the general Perl
philosophy, I would say.

Paolo

* looking for [hidden email]--2004b/smalltalk--devo--2.2--patch-612 to compare with
* comparing to [hidden email]--2004b/smalltalk--devo--2.2--patch-612
M  kernel/UniString.st
M  examples/JSON.st
M  packages/i18n/Locale.st
M  packages/iconv/Sets.st
M  kernel/CharArray.st
M  kernel/Stream.st
M  kernel/String.st

* modified files

--- orig/examples/JSON.st
+++ mod/examples/JSON.st
@@ -32,7 +32,7 @@
 
 
 Stream subclass: #JSONReader
-    instanceVariableNames: 'stream'
+    instanceVariableNames: 'stream encoding'
     classVariableNames: ''
     poolDictionaries: ''
     category: nil !
@@ -51,15 +51,39 @@ toJSON: anObject
 
 fromJSON: string
    "I'm responsible for decoding the JSON string to objects."
-   ^(self on: string readStream) nextJSONObject
+   ^self fromJSON: string encoding: string encoding
+!
+
+fromJSON: string encoding: encString
+   "I'm responsible for decoding the JSON string to objects."
+   | stream |
+   stream := string readStream.
+   string isUnicode ifFalse: [
+       stream := I18N.EncodedStream unicodeOn: stream encoding: string encoding ].
+   ^(self on: stream encoding: encString) nextJSONObject
 !
 
 on: aStream
-    ^self new stream: aStream
+    ^self on: aStream encoding: 'UTF-8'
+!
+
+on: aStream encoding: encString
+    "TODO: if we had an #encoding method on Streams, we could use it and do
+     the re-encoding here.  Now instead we assume encString is also the input
+     encoding."
+    ^self new stream: aStream; encoding: encString; yourself
 ! !
 
 !JSONReader methodsFor: 'json'!
 
+encoding
+    ^encoding
+!
+
+encoding: aString
+    encoding := aString
+!
+
 stream: aStream
     stream := aStream
 !
@@ -165,15 +189,11 @@ nextJSONString
             c = $t ifTrue: [ c := Character tab ].
             c = $u
                ifTrue: [
-  c := (Integer readFrom: (stream next: 4) readStream radix: 16) asCharacter.
-  (c class == UnicodeCharacter and: [ str species == String ])
-    ifTrue: [ str := (UnicodeString new writeStream
- nextPutAll: str contents; yourself) ] ].
+  c := (Integer readFrom: (stream next: 4) readStream radix: 16) asCharacter ].
          ].
-      str nextPut: c.
+ str nextPut: c.
    ].
-   "Undo the conversion to UnicodeString done above."
-   ^str contents asString.
+   ^str contents asString: self encoding
 !
 
 nextJSONNumber


--- orig/kernel/CharArray.st
+++ mod/kernel/CharArray.st
@@ -60,6 +60,14 @@ accessing and manipulation methods for s
  ^self with: Character nl
     ]
 
+    CharacterArray class >> isUnicode [
+ "Answer whether the receiver stores bytes (i.e. an encoded
+ form) or characters (if true is returned)."
+
+ <category: 'multibyte encodings'>
+ self subclassResponsibility
+    ]
+
     = aString [
  "Answer whether the receiver's items match those in aCollection"
 
@@ -188,6 +196,14 @@ accessing and manipulation methods for s
  ^nil
     ]
 
+    isUnicode [
+ "Answer whether the receiver stores bytes (i.e. an encoded
+ form) or characters (if true is returned)."
+
+ <category: 'multibyte encodings'>
+ ^self class isUnicode
+    ]
+
     encoding [
  "Answer the encoding used by the receiver."
 


--- orig/kernel/Stream.st
+++ mod/kernel/Stream.st
@@ -312,7 +312,7 @@ provide for writing collections sequenti
  whose value is above 127."
 
  <category: 'character writing'>
- ^self species shape ~~ #character
+ ^self species isUnicode
     ]
 
     cr [


--- orig/kernel/String.st
+++ mod/kernel/String.st
@@ -64,6 +64,14 @@ or assumed to be the system default.'>
  ^SystemExceptions.WrongClass signalOn: anInteger mustBe: SmallInteger
     ]
 
+    String class >> isUnicode [
+ "Answer false; the receiver stores bytes (i.e. an encoded
+ form), not characters."
+
+ <category: 'multibyte encodings'>
+ ^false
+    ]
+
     = aString [
  "Answer whether the receiver's items match those in aCollection"
 


--- orig/kernel/UniString.st
+++ mod/kernel/UniString.st
@@ -55,6 +55,13 @@ as 4-byte UTF-32 characters'>
  ^'Unicode'
     ]
 
+    UnicodeString class >> isUnicode [
+ "Answer true; the receiver stores characters."
+
+ <category: 'multibyte encodings'>
+ ^true
+    ]
+
     asString [
  "Returns the string corresponding to the receiver.  Without the
  Iconv package, unrecognized Unicode characters become $?


--- orig/packages/i18n/Locale.st
+++ mod/packages/i18n/Locale.st
@@ -69,7 +69,7 @@ information.'>
  "Set the default charset used when nothing is specified."
 
  <category: 'database'>
- DefaultCharsets at: 'POSIX' put: aString asSymbol
+ DefaultCharsets at: 'POSIX' put: aString asString
     ]
 
     LocaleData class >> defaults [
@@ -77,10 +77,136 @@ information.'>
  associations."
 
  <category: 'database'>
- ^#(#('POSIX' '' #'UTF-8') #('af' 'ZA' #'ISO-8859-1') #('am' 'ET' #'UTF-8') #('ar' 'SA' #'ISO-8859-6') #('as' 'IN' #'UTF-8') #('az' 'AZ' #'UTF-8') #('be' 'BY' #CP1251) #('ber' 'MA' #'UTF-8') #('bg' 'BG' #CP1251) #('bin' 'NG' #'ISO-8859-1') #('bn' 'IN' #'UTF-8') #('bnt' 'TZ' #'ISO-8859-1') #('bo' 'CN' #'UTF-8') #('br' 'FR' #'ISO-8859-1') #('bs' 'BA' #'ISO-8859-2') #('ca' 'ES' #'ISO-8859-1') #('chr' 'US' #'ISO-8859-1') #('cpe' 'US' #'ISO-8859-1') #('cs' 'CZ' #'ISO-8859-2') #('cy' 'GB' #'ISO-8859-14') #('da' 'DK' #'ISO-8859-1') #('de' 'DE' #'ISO-8859-1') #('div' 'MV' #'ISO-8859-1') #('el' 'GR' #'ISO-8859-7') #('en' 'US' #'ISO-8859-1') #('eo' 'XX' #'ISO-8859-3') #('es' 'ES' #'ISO-8859-1') #('et' 'EE' #'ISO-8859-4') #('eu' 'ES' #'ISO-8859-1') #('fa' 'IR' #'UTF-8') #('fi' 'FI' #'ISO-8859-1') #('fo' 'FO' #'ISO-8859-1') #('fr' 'FR' #'ISO-8859-1') #('ful' 'NG' #'ISO-8859-1') #('fy' 'NL' #'ISO-8859-1') #('ga' 'IE' #'ISO-8859-1') #('gd' 'GB' #'ISO-8859-1') #('gl' 'ES' #'ISO-8859-1') #('gn' 'PY' #'ISO-8859-1') #('gu' 'IN' #'UTF-8') #('gv' 'GB' #'ISO-8859-1') #('ha' 'NG' #'ISO-8859-1') #('he' 'IL' #'ISO-8859-8') #('hi' 'IN' #'UTF-8') #('hr' 'HR' #'ISO-8859-2') #('hu' 'HU' #'ISO-8859-2') #('ibo' 'NG' #'ISO-8859-1') #('id' 'ID' #'ISO-8859-1') #('is' 'IS' #'ISO-8859-1') #('it' 'IT' #'ISO-8859-1') #('iu' 'CA' #'UTF-8') #('ja' 'JP' #'EUC-JP') #('ka' 'GE' #'GEORGIAN-PS') #('kau' 'NG' #'ISO-8859-1') #('kk' 'KZ' #'UTF-8') #('kl' 'GL' #'ISO-8859-1') #('km' 'KH' #'UTF-8') #('kn' 'IN' #'UTF-8') #('ko' 'KR' #'EUC-KR') #('kok' 'IN' #'UTF-8') #('ks' 'PK' #'UTF-8') #('kw' 'GB' #'ISO-8859-1') #('ky' 'KG' #'UTF-8') #('la' 'VA' #ASCII) #('lt' 'LT' #'ISO-8859-13') #('lv' 'LV' #'ISO-8859-13') #('mi' 'NZ' #'ISO-8859-13') #('mk' 'MK' #'ISO-8859-5') #('ml' 'IN' #'UTF-8') #('mn' 'MN' #'KOI8-R') #('mni' 'IN' #'UTF-8') #('mr' 'IN' #'UTF-8') #('ms' 'MY' #'ISO-8859-1') #('mt' 'MT' #'ISO-8859-3') #('my' 'MM' #'UTF-8') #('ne' 'NP' #'UTF-8') #('nic' 'NG' #'ISO-8859-1') #('nl' 'NL' #'ISO-8859-1') #('nn' 'NO' #'ISO-8859-1') #('no' 'NO' #'ISO-8859-1') #('oc' 'FR' #'ISO-8859-1') #('om' 'ET' #'UTF-8') #('or' 'IN' #'UTF-8') #('pa' 'IN' #'UTF-8') #('pap' 'AN' #'UTF-8') #('pl' 'PL' #'ISO-8859-2') #('ps' 'PK' #'UTF-8') #('pt' 'PT' #'ISO-8859-1') #('rm' 'CH' #'ISO-8859-1') #('ro' 'RO' #'ISO-8859-2') #('ru' 'RU' #'KOI8-R') #('sa' 'IN' #'UTF-8') #('se' 'NO' #'UTF-8') #('sh' 'YU' #'ISO-8859-2') #('si' 'LK' #'UTF-8') #('sit' 'CN' #'UTF-8') #('sk' 'SK' #'ISO-8859-2') #('sl' 'SI' #'ISO-8859-2') #('so' 'SO' #'UTF-8') #('sp' 'YU' #'ISO-8859-5') #('sq' 'AL' #'ISO-8859-1') #('sr' 'YU' #'ISO-8859-2') #('sv' 'SE' #'ISO-8859-1') #('sw' 'KE' #'ISO-8859-1') #('syr' 'TR' #'UTF-8') #('ta' 'IN' #'UTF-8') #('te' 'IN' #'UTF-8') #('tg' 'TJ' #'UTF-8') #('th' 'TH' #'TIS-620') #('ti' 'ET' #'UTF-8') #('tk' 'TM' #'UTF-8') #('tl' 'PH' #'ISO-8859-1') #('tr' 'TR' #'ISO-8859-9') #('ts' 'ZA' #'ISO-8859-1') #('tt' 'RU' #'UTF-8') #('uk' 'UA' #'KOI8-U') #('ur' 'PK' #'UTF-8') #('uz' 'UZ' #'ISO-8859-1') #('ven' 'ZA' #'ISO-8859-1') #('vi' 'VN' #'UTF-8') #('wa' 'BE' #'ISO-8859-1') #('wen' 'DE' #'ISO-8859-1') #('xh' 'ZA' #'ISO-8859-1') #('yi' 'US' #CP1255) #('yo' 'NG' #'ISO-8859-1') #('zh' 'CN' #GB2312) #('zu' 'ZA' #'ISO-8859-1')) "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix"
- "('hy' 'AM' #'ARMSCII-8')" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix"
- "('lo' 'LA' #'MULELAO-1')" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix"
- "('sd' ? ?)" "not yet seen on Unix" "obsolete" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "obsolete" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix" "not yet seen on Unix"
+ ^#(#('POSIX' '' 'UTF-8')
+ #('af' 'ZA' 'ISO-8859-1')
+ #('am' 'ET' 'UTF-8')
+ #('ar' 'SA' 'ISO-8859-6')
+ #('as' 'IN' 'UTF-8')
+ #('az' 'AZ' 'UTF-8')
+ #('be' 'BY' 'CP1251')
+ #('ber' 'MA' 'UTF-8')
+ #('bg' 'BG' 'CP1251')
+ #('bin' 'NG' 'ISO-8859-1')
+ #('bn' 'IN' 'UTF-8')
+ #('bnt' 'TZ' 'ISO-8859-1')
+ #('bo' 'CN' 'UTF-8')
+ #('br' 'FR' 'ISO-8859-1')
+ #('bs' 'BA' 'ISO-8859-2')
+ #('ca' 'ES' 'ISO-8859-1')
+ #('chr' 'US' 'ISO-8859-1')
+ #('cpe' 'US' 'ISO-8859-1')
+ #('cs' 'CZ' 'ISO-8859-2')
+ #('cy' 'GB' 'ISO-8859-14')
+ #('da' 'DK' 'ISO-8859-1')
+ #('de' 'DE' 'ISO-8859-1')
+ #('div' 'MV' 'ISO-8859-1')
+ #('el' 'GR' 'ISO-8859-7')
+ #('en' 'US' 'ISO-8859-1')
+ #('eo' 'XX' 'ISO-8859-3')
+ #('es' 'ES' 'ISO-8859-1')
+ #('et' 'EE' 'ISO-8859-4')
+ #('eu' 'ES' 'ISO-8859-1')
+ #('fa' 'IR' 'UTF-8')
+ #('fi' 'FI' 'ISO-8859-1')
+ #('fo' 'FO' 'ISO-8859-1')
+ #('fr' 'FR' 'ISO-8859-1')
+ #('ful' 'NG' 'ISO-8859-1')
+ #('fy' 'NL' 'ISO-8859-1')
+ #('ga' 'IE' 'ISO-8859-1')
+ #('gd' 'GB' 'ISO-8859-1')
+ #('gl' 'ES' 'ISO-8859-1')
+ #('gn' 'PY' 'ISO-8859-1')
+ #('gu' 'IN' 'UTF-8')
+ #('gv' 'GB' 'ISO-8859-1')
+ #('ha' 'NG' 'ISO-8859-1')
+ #('he' 'IL' 'ISO-8859-8')
+ #('hi' 'IN' 'UTF-8')
+ #('hr' 'HR' 'ISO-8859-2')
+ #('hu' 'HU' 'ISO-8859-2')
+ #('ibo' 'NG' 'ISO-8859-1')
+ #('id' 'ID' 'ISO-8859-1')
+ #('is' 'IS' 'ISO-8859-1')
+ #('it' 'IT' 'ISO-8859-1')
+ #('iu' 'CA' 'UTF-8')
+ #('ja' 'JP' 'EUC-JP')
+ #('ka' 'GE' 'GEORGIAN-PS')
+ #('kau' 'NG' 'ISO-8859-1')
+ #('kk' 'KZ' 'UTF-8')
+ #('kl' 'GL' 'ISO-8859-1')
+ #('km' 'KH' 'UTF-8')
+ #('kn' 'IN' 'UTF-8')
+ #('ko' 'KR' 'EUC-KR')
+ #('kok' 'IN' 'UTF-8')
+ #('ks' 'PK' 'UTF-8')
+ #('kw' 'GB' 'ISO-8859-1')
+ #('ky' 'KG' 'UTF-8')
+ #('la' 'VA' 'ASCII')
+ #('lt' 'LT' 'ISO-8859-13')
+ #('lv' 'LV' 'ISO-8859-13')
+ #('mi' 'NZ' 'ISO-8859-13')
+ #('mk' 'MK' 'ISO-8859-5')
+ #('ml' 'IN' 'UTF-8')
+ #('mn' 'MN' 'KOI8-R')
+ #('mni' 'IN' 'UTF-8')
+ #('mr' 'IN' 'UTF-8')
+ #('ms' 'MY' 'ISO-8859-1')
+ #('mt' 'MT' 'ISO-8859-3')
+ #('my' 'MM' 'UTF-8')
+ #('ne' 'NP' 'UTF-8')
+ #('nic' 'NG' 'ISO-8859-1')
+ #('nl' 'NL' 'ISO-8859-1')
+ #('nn' 'NO' 'ISO-8859-1')
+ #('no' 'NO' 'ISO-8859-1')
+ #('oc' 'FR' 'ISO-8859-1')
+ #('om' 'ET' 'UTF-8')
+ #('or' 'IN' 'UTF-8')
+ #('pa' 'IN' 'UTF-8')
+ #('pap' 'AN' 'UTF-8')
+ #('pl' 'PL' 'ISO-8859-2')
+ #('ps' 'PK' 'UTF-8')
+ #('pt' 'PT' 'ISO-8859-1')
+ #('rm' 'CH' 'ISO-8859-1')
+ #('ro' 'RO' 'ISO-8859-2')
+ #('ru' 'RU' 'KOI8-R')
+ #('sa' 'IN' 'UTF-8')
+ #('se' 'NO' 'UTF-8')
+ #('sh' 'YU' 'ISO-8859-2')
+ #('si' 'LK' 'UTF-8')
+ #('sit' 'CN' 'UTF-8')
+ #('sk' 'SK' 'ISO-8859-2')
+ #('sl' 'SI' 'ISO-8859-2')
+ #('so' 'SO' 'UTF-8')
+ #('sp' 'YU' 'ISO-8859-5')
+ #('sq' 'AL' 'ISO-8859-1')
+ #('sr' 'YU' 'ISO-8859-2')
+ #('sv' 'SE' 'ISO-8859-1')
+ #('sw' 'KE' 'ISO-8859-1')
+ #('syr' 'TR' 'UTF-8')
+ #('ta' 'IN' 'UTF-8')
+ #('te' 'IN' 'UTF-8')
+ #('tg' 'TJ' 'UTF-8')
+ #('th' 'TH' 'TIS-620')
+ #('ti' 'ET' 'UTF-8')
+ #('tk' 'TM' 'UTF-8')
+ #('tl' 'PH' 'ISO-8859-1')
+ #('tr' 'TR' 'ISO-8859-9')
+ #('ts' 'ZA' 'ISO-8859-1')
+ #('tt' 'RU' 'UTF-8')
+ #('uk' 'UA' 'KOI8-U')
+ #('ur' 'PK' 'UTF-8')
+ #('uz' 'UZ' 'ISO-8859-1')
+ #('ven' 'ZA' 'ISO-8859-1')
+ #('vi' 'VN' 'UTF-8')
+ #('wa' 'BE' 'ISO-8859-1')
+ #('wen' 'DE' 'ISO-8859-1')
+ #('xh' 'ZA' 'ISO-8859-1')
+ #('yi' 'US' 'CP1255')
+ #('yo' 'NG' 'ISO-8859-1')
+ #('zh' 'CN' 'GB2312')
+ #('zu' 'ZA' 'ISO-8859-1'))
+ "('hy' 'AM' #'ARMSCII-8')"
+ "('lo' 'LA' #'MULELAO-1')"
+ "('sd' ? ?)"
     ]
 
     LocaleData class >> initialize [


--- orig/packages/iconv/Sets.st
+++ mod/packages/iconv/Sets.st
@@ -100,6 +100,7 @@ assumed to be the system default.'>
  <category: 'instance creation'>
  | str |
  str := aString asString.
+ str encoding = str class defaultEncoding ifTrue: [ ^str ].
  ^self fromString: str encoding: str encoding
     ]
 
@@ -109,6 +110,7 @@ assumed to be the system default.'>
  str := aString isString
     ifTrue: [aString]
     ifFalse: [aString asString: encoding].
+ encoding = str class defaultEncoding ifTrue: [ ^str ].
  ^(self basicNew)
     setString: aString;
     encoding: encoding
@@ -124,6 +126,14 @@ assumed to be the system default.'>
  self shouldNotImplement
     ]
 
+    EncodedString class >> isUnicode [
+ "Answer false; the receiver stores bytes (i.e. an encoded
+ form), not characters."
+
+ <category: 'accessing'>
+ ^false
+    ]
+
     asString [
  <category: 'accessing'>
  ^string
@@ -279,6 +289,14 @@ Encoders can return EncodedString object
  ^EncodedString fromString: (String new: size) encoding: self encoding
     ]
 
+    isUnicode [
+ "Answer false; the receiver stores bytes (i.e. an encoded
+ form), not characters."
+
+ <category: 'accessing'>
+ ^false
+    ]
+
     encoding [
  "Answer the encoding used for the created Strings."
 




"======================================================================
|
|   JSON reader/writer example
|
|
 ======================================================================"


"======================================================================
|
| Copyright 2007 Free Software Foundation, Inc.
| Written by Robin Redeker.
|
| This file is part of the GNU Smalltalk class library.
|
| The GNU Smalltalk class library is free software; you can redistribute it
| and/or modify it under the terms of the GNU Lesser General Public License
| as published by the Free Software Foundation; either version 2.1, or (at
| your option) any later version.
|
| The GNU Smalltalk class library is distributed in the hope that it will be
| useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
| MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser
| General Public License for more details.
|
| You should have received a copy of the GNU Lesser General Public License
| along with the GNU Smalltalk class library; see the file COPYING.LIB.
| If not, write to the Free Software Foundation, 59 Temple Place - Suite
| 330, Boston, MA 02110-1301, USA.  
|
 ======================================================================"


Stream subclass: #JSONReader
    instanceVariableNames: 'stream encoding'
    classVariableNames: ''
    poolDictionaries: ''
    category: nil !

JSONReader comment:
'I read data structures (currently build of OrderedCollection and Dictionary)
from and to JSON (Java Script Object Notation). Writing is done with the
#toJSON method (note: it will behave badly with circular data structures).' !

!JSONReader class methodsFor: 'json'!

toJSON: anObject
   "I'm returning a JSON string which represents the object."
   ^anObject toJSON
!

fromJSON: string
   "I'm responsible for decoding the JSON string to objects."
   ^self fromJSON: string encoding: string encoding
!

fromJSON: string encoding: encString
   "I'm responsible for decoding the JSON string to objects."
   | stream |
   stream := string readStream.
   string isUnicode ifFalse: [
       stream := I18N.EncodedStream unicodeOn: stream encoding: string encoding ].
   ^(self on: stream encoding: encString) nextJSONObject
!

on: aStream
    ^self on: aStream encoding: 'UTF-8'
!

on: aStream encoding: encString
    "TODO: if we had an #encoding method on Streams, we could use it and do
     the re-encoding here.  Now instead we assume encString is also the input
     encoding."
    ^self new stream: aStream; encoding: encString; yourself
! !

!JSONReader methodsFor: 'json'!

encoding
    ^encoding
!

encoding: aString
    encoding := aString
!

stream: aStream
    stream := aStream
!

peek
   "I'm peeking for the next non-whitespace character and will drop all whitespace in front of it"
   | c |
   [
     c := stream peek.
     c = (Character space)
         or: [ c = (Character tab)
         or: [ c = (Character lf)
         or: [ c = (Character cr)]]]
   ] whileTrue: [
     stream next
   ].
   ^c
!

next
   "I'm returning the next non-whitespace character"
   | c |
   c := self peek.
   c isNil ifTrue: [ ^self error: 'expected character but found end of stream' ].
   stream next.
   ^c
! !

!JSONReader methodsFor: 'private'!

nextJSONObject
   "I decode a json self to a value, which will be one of: nil,
true, false, OrderedCollection, Dictionary, String or Number
(i will return Integer or Float depending on the input)."
   | c |
   c := self peek.
   (c = $n) ifTrue: [ self next: 4. ^nil   ].
   (c = $t) ifTrue: [ self next: 4. ^true  ].
   (c = $f) ifTrue: [ self next: 5. ^false ].
   (c = ${) ifTrue: [ ^self nextJSONDict ].
   (c = $[) ifTrue: [ ^self nextJSONArray  ].
   (c = $") ifTrue: [ ^self nextJSONString ].
   ^self nextJSONNumber
!

nextJSONArray
   "I decode JSON arrays from self and will return a OrderedCollection for them."
   | c obj value |
   obj := OrderedCollection new.
   self next.
   [ c := self peek.
     (c = $]) ] whileFalse: [
      (c = $,) ifTrue: [ self next. ].
      value := self nextJSONObject.
      obj add: value.
   ].
   self next.
   ^obj
!

nextJSONDict
   "I decode JSON objects from self and will return a Dictionary containing all the key/value pairs."
   | c obj key value |
   obj := Dictionary new.
   self next.
   [ c := self peek.
     c = $} ] whileFalse: [
      (c = $,) ifTrue: [ self next ].

      key := self nextJSONString.

      c := self next.
      c = $: ifFalse: [
         self error: ('unexpected character found where name-seperator '':'' expected, found: %1' bindWith: c)
      ].

      value := self nextJSONObject.

      obj at: key put: value.
   ].
   self next.
   ^obj
!

nextJSONString
   "I'm extracting a JSON string from self and return them as String."
   | c obj str |
   str := ReadWriteStream on: UnicodeString new.
   self next.
   [
        c := stream next.
        c = $"
   ] whileFalse: [
      c = $\
         ifTrue: [
            c := stream next.
            c isNil ifTrue:
               [ ^self error: 'expected character, found end of self' ].
            c = $b ifTrue: [ c := 8 asCharacter ].
            c = $f ifTrue: [ c := 12 asCharacter ].
            c = $n ifTrue: [ c := Character nl ].
            c = $r ifTrue: [ c := Character cr ].
            c = $t ifTrue: [ c := Character tab ].
            c = $u
               ifTrue: [
                  c := (Integer readFrom: (stream next: 4) readStream radix: 16) asCharacter ].
         ].
         str nextPut: c.
   ].

   "Same as 'str contents asString: self encoding', a little more efficient."
   "str reset. ^(I18N.EncodedStream encoding: str as: self encoding) contents"
   ^str contents asString: self encoding
!

nextJSONNumber
   "I'm extracting a number in JSON format from self and return Integer or Float depending on the input."
   | c num sgn int intexp frac exp isfloat |
   num := WriteStream on: (String new).

   isfloat := false.
   sgn     := 1.
   int     := 0.
   intexp  := 1.

   c := stream peek.
   (c isNil) ifTrue: [ ^self error: 'expected number or -sign, but found end of self' ].
   c = $- ifTrue: [ sgn := -1. stream next. ].

   c := stream peek.
   (c isNil) ifTrue: [ ^self error: 'expected number, but found end of self' ].
   (c isDigit or: [ c = $. ]) ifFalse: [ ^self error: 'invalid JSON input' ].

   [ c notNil and: [ c isDigit ] ] whileTrue: [
      stream next.
      int := sgn * (c digitValue) + (int * 10).
      c := stream peek
   ].
   (c isNil) ifTrue: [ ^int ].

   c = $. ifTrue: [
      stream next.
      isfloat := true.
      [ c := stream peek. c notNil and: [ c isDigit ] ] whileTrue: [
         sgn := sgn / 10.
         int := sgn * (c digitValue) + int.
         stream next
      ]
   ].

   exp := 0.
   ((c = $e) or: [ c = $E ]) ifFalse: [
        ^isfloat ifTrue: [ int asFloat ] ifFalse: [ int ] ].

   stream next.
   c := stream peek.
   (c isNil) ifTrue: [ ^int ].
   sgn := 1.
   c = $+ ifTrue: [ sgn :=  1. self next ].
   c = $- ifTrue: [ sgn := -1. self next ].

   [ c := stream peek. c notNil and: [ c isDigit ] ] whileTrue: [
      exp := (c digitValue) + (exp * 10).
      stream next
   ].

   int := int * (10 raisedToInteger: exp * sgn).
   ^int asFloat
! !

!Number methodsFor: 'json'!

jsonPrintOn: aStream
   "I return the Number in a JSON compatible format as String."
   self asFloat printOn: aStream
! !

!Float methodsFor: 'json'!

jsonPrintOn: aStream
   "I return the Number in a JSON compatible format as String."
   aStream nextPutAll:
        (self printString copyReplacing: self exponentLetter withObject: $e)
! !

!Integer methodsFor: 'json'!

jsonPrintOn: aStream
   "I return the Integer in a JSON compatible format as String."
   self printOn: aStream
! !

!Dictionary methodsFor: 'json'!

jsonPrintOn: ws
   "I encode my contents (key/value pairs) to a JSON object and return it as String."
   | f |
   ws nextPut: ${.
   f := true.
   self keysAndValuesDo: [ :key :val |
      f ifFalse: [ ws nextPut: $, ].
      key jsonPrintOn: ws.
      ws nextPut: $:.
      val jsonPrintOn: ws.
      f := false
   ].
   ws nextPut: $}.
! !

!CharacterArray methodsFor: 'json'!

jsonPrintOn: ws
   "I will encode me as JSON String and return a String containing my encoded version."
   ws nextPut: $".
   self do: [ :c || i |
      i := c asInteger.
      (((i = 16r20
         or: [ i = 16r21 ])
         or: [ i >= 16r23 and: [ i <= 16r5B ] ])
         or: [ i >= 16r5D ])
            ifTrue: [ ws nextPut: c ];
            ifFalse: [ | f |
               f := false.
               ws nextPut: $\.
               i = 16r22 ifTrue: [ f := true. ws nextPut: c ].
               i = 16r5C ifTrue: [ f := true. ws nextPut: c ].
               i = 16r2F ifTrue: [ f := true. ws nextPut: c ].
               i = 16r08 ifTrue: [ f := true. ws nextPut: $b ].
               i = 16r0C ifTrue: [ f := true. ws nextPut: $f ].
               i = 16r0A ifTrue: [ f := true. ws nextPut: $n ].
               i = 16r0D ifTrue: [ f := true. ws nextPut: $r ].
               i = 16r09 ifTrue: [ f := true. ws nextPut: $t ].
               f ifFalse: [
                  ws nextPut: $u.
                  ws nextPutAll: ('0000', i printString: 16) last: 4 ].
            ]
   ].
   ws nextPut: $".
!

!String methodsFor: 'json'!

jsonPrintOn: aStream
   "I will encode me as JSON String and return a String containing my encoded version."
   (self anySatisfy: [ :ch | ch value between: 128 and: 255 ])
        ifTrue: [ self asUnicodeString jsonPrintOn: aStream ]
        ifFalse: [ super jsonPrintOn: aStream ]! !

!SequenceableCollection methodsFor: 'json'!

jsonPrintOn: ws
   "I'm returning a JSON encoding of my contents as String."
   | f |
   ws nextPut: $[.
   f := true.
   self do: [ :val |
      f ifFalse: [ ws nextPut: $, ].
      val jsonPrintOn: ws.
      f := false
   ].
   ws nextPut: $].
! !

!UndefinedObject methodsFor: 'json'!

jsonPrintOn: aStream
   "I'm returning my corresponding value as JSON String."
   aStream nextPutAll: 'null'
! !

!Boolean methodsFor: 'json'!

jsonPrintOn: aStream
   "I'm returning the JSON String for truth or lie."
   self printOn: aStream
! !

!Object methodsFor: 'json'!

jsonPrintOn: aStream
    self subclassResponsibility
!

toJSON
    ^String streamContents: [ :aStream | self jsonPrintOn: aStream ]
! !


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk