Re: Starting with smalltalk

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Starting with smalltalk

Paolo Bonzini
(First of all, you may want to subscribe to [hidden email] -- it
is moderated for non-subscribers, so no spam).
> I've been reading the GNU smalltalk manual, but the following I havn't
> been able to find on the web yet:
> - A GNU smalltalk compatible, functional program
You mean a program written with gst?  Unfortunately I don't know of any
:-(  Mike Anderson has some on his blog, but they're small.
> - A way of seperating smalltalk source over multiple files
You write the source code in multiple files, and then provide a loading
script that loads them all (optionally saving everything to an image
file, see later).
> - A way of editing smalltalk files without the use of a commercial IDE
GNU Smalltalk has an Emacs mode.
> - A way of running smalltalk probrams like other programs (from the
> commandline) without the need of a wrapper script (The normal
> '#!/usr/bin/env doesn't work, nor could i find ways of creating
> bytecode/packages/binaries)
You can use (with GNU Smalltalk 2.2)

#! /usr/bin/env gst -f

or

#! /bin/sh
"exec" "gst" "-f" "$0" "$@"

GNU Smalltalk special cases the #! at the beginning of a file as a
one-line command.  Comments are quote-delimited in Smalltalk, so the
second line is eaten by GNU Smalltalk's parser in the second example.

In addition, GNU Smalltalk can save a snapshot of its status in an image
(.im) file that can be made executable with chmod.  Making something run
automatically when the image file is reloaded is feasible.  Just create
a class-side method named #update: including some code like

update: aspect
    "Flush instances of the receiver when an image is loaded."
    aspect == #returnFromSnapshot ifTrue: [ self restart ]!

and then evaluate code like

ObjectMemory
    addDependent: NameOfTheClassWithTheUpdateMethod;
    snapshot: 'myprogram.im'

Then, running gst with "gst -I myprogram.im" (or just making
myprogram.im executable) will invoke the #restart method on the class
NameOfTheClassWithTheUpdateMethod.

Paolo


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Re: Starting with smalltalk

Mike Anderson-3
Sorry to be weighing in so late on this one. My brain works slowly in
the summer heat...

Paolo Bonzini wrote:
>> I've been reading the GNU smalltalk manual, but the following I havn't
>> been able to find on the web yet:
>> - A GNU smalltalk compatible, functional program
>
> You mean a program written with gst?  Unfortunately I don't know of any
> :-(  Mike Anderson has some on his blog, but they're small.

What you will find is that one of the major problems Smalltalk has as a
language is that the dialects are sufficiently dissimilar that programs
are not very portable, so the only programs you will find for GSt are
those that were written for GSt. There are projects that aim to remedy
this, eg. Sport. Porting Sport to gst would be a very useful project.

The other main problem, related to the above, is that the Smalltalk Way
is image-based development, which unfortunately means that the easiest
way to distribute programs is as images, not code.

At a personal level, the main problem I have is that the packaging
system is a bit inflexible, so splitting out a package is hard work.

>> - A way of editing smalltalk files without the use of a commercial IDE

This sounds as if you're thinking about commercial Smalltalks, like
Visual Works. Actually, most other Smalltalks don't use files - you
develop within the IDE, and code at the method level. Where the source
code is outside of the image, it is found in a repository like Envy or
Store, ie. a database.

> GNU Smalltalk has an Emacs mode.

SciTE also has syntax-highlighting, if, like me, you never really got to
grips with Emacs (if you're using Emacs, surely you must prefer Lisp
over Smalltalk?).

Mike


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Re: Starting with smalltalk

Bram Neijt
On 7/5/06, Mike Anderson <[hidden email]> wrote:
> What you will find is that one of the major problems Smalltalk has as a
> language is that the dialects are sufficiently dissimilar that programs
> are not very portable, so the only programs you will find for GSt are
> those that were written for GSt. There are projects that aim to remedy
> this, eg. Sport. Porting Sport to gst would be a very useful project.
This is a problem, but with the growing number of architectures and
operating systems, it is just as hard for any other language
(probably).


> The other main problem, related to the above, is that the Smalltalk Way
> is image-based development, which unfortunately means that the easiest
> way to distribute programs is as images, not code.
>
> At a personal level, the main problem I have is that the packaging
> system is a bit inflexible, so splitting out a package is hard work.
I have not found anything about packaging yet, however this is the
kind of thing that will keep a language from ever getting out (even
out of a computer ;-) ).

>
> >> - A way of editing smalltalk files without the use of a commercial IDE
>
> This sounds as if you're thinking about commercial Smalltalks, like
> Visual Works. Actually, most other Smalltalks don't use files - you
> develop within the IDE, and code at the method level. Where the source
> code is outside of the image, it is found in a repository like Envy or
> Store, ie. a database.
I'm sorry, but if Smalltalk can't even get out of my computer, I might
just not bother to learn it at all. This does explain why I can't find
any real-life implementations on the internet (like a simple hello,
ls, find, sort or anything like that with install scripts,
documentations and comments).

Then I guess there arn't any standard commandline argument parsing
libraries in the stdlib either, right?

Greets,
  Bram

PS If all this is really like I now think it is, I can imagine why
this language never took off!


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Re: Starting with smalltalk

Paolo Bonzini

> This is a problem, but with the growing number of architectures and
> operating systems, it is just as hard for any other language
> (probably).
It's a bit different, and actually worse.  It's like you had ten
different forks of Python, and somebody writes for one and somebody for
the others.
> I have not found anything about packaging yet, however this is the
> kind of thing that will keep a language from ever getting out (even
> out of a computer ;-) ).
I don't think the packaging system is *too* inflexible.  It's
underdeveloped, true, and feature requests will only help.

>> >> - A way of editing smalltalk files without the use of a commercial
>> IDE
>>
>> This sounds as if you're thinking about commercial Smalltalks, like
>> Visual Works. Actually, most other Smalltalks don't use files - you
>> develop within the IDE, and code at the method level. Where the source
>> code is outside of the image, it is found in a repository like Envy or
>> Store, ie. a database.
> I'm sorry, but if Smalltalk can't even get out of my computer, I might
> just not bother to learn it at all. This does explain why I can't find
> any real-life implementations on the internet (like a simple hello,
> ls, find, sort or anything like that with install scripts,
> documentations and comments).
Mike is speaking about commercial Smalltalks.  GNU Smalltalk is by
design different.  You can write your code in files, with SciTE or
Emacs.  The next version, when it comes out, will almost surely have a
more compact and less arcane syntax for defining classes, and so on.
> Then I guess there arn't any standard commandline argument parsing
> libraries in the stdlib either, right?
If you want, I can write one in half an hour. :-P  Would this syntax
satisfy you (I'm getting the command line options from autoconf)?

Smalltalk
    arguments: '-B|--prepend-include: -I|--include: -t|--trace:
-p|--preselect= -F|--freeze --help --version -v'
    do: [ :arg :option | (arg->option) printNl ].

The output could be something like

    'trace'->'AC_DEFUN'
    $v->nil
    'prepend-include'->'/usr/local/share'

if you invoked your script like

    gst -f script.st --trace=AC_DEFUN -v -B/usr/local/share
> PS If all this is really like I now think it is, I can imagine why
> this language never took off!
Maybe that's because the language was born 20 years before Python.  The
problem is not the inflexibility of the language, is that nobody
implemented the features that people love in other languages (due to
lack of time, lack of funding, or sometimes even human stupidity).

Paolo


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Re: Starting with smalltalk

Paolo Bonzini

> I'll get a book on Smalltalk, take some time to read-up on the syntax
> and try Squeak to see the difference between GNU and Squeak.
You can try the tutorial that comes with GNU Smalltalk.

The differences are mostly conceptual.  Plus Squeak has a huge (and
sometimes very poorly designed) class library for graphics and much more.
> Then, I'll get back to you all.
No need to wait.  We're here to help and to understand where you have
problems.
> Nice looking commandline parser by the way. I don't understand it all
> yet, but I'll get there. In the end I'll try to make a commandline
> arguments parser and post it somewhere.
Heh... I wanted to see how far I was from my (purposedly exaggerate)
30-minutes estimate of the time to make one.  So I did it.

Here it is.  220 lines in ~2 hours, slightly less actually, including 30
minutes for testing (didn't have time to do SUnit tests, so they're just
commands at the end of the file).  No comments for now, I will add them
when I commit.  :-P

Paolo

"======================================================================
|
|   Smalltalk command-line parser
|
|
 ======================================================================"


"======================================================================
|
| Copyright 2006 Free Software Foundation, Inc.
| Written by Paolo Bonzini.
|
| This file is part of the GNU Smalltalk class library.
|
| The GNU Smalltalk class library is free software; you can redistribute it
| and/or modify it under the terms of the GNU Lesser General Public License
| as published by the Free Software Foundation; either version 2.1, or (at
| your option) any later version.
|
| The GNU Smalltalk class library is distributed in the hope that it will be
| useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
| MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser
| General Public License for more details.
|
| You should have received a copy of the GNU Lesser General Public License
| along with the GNU Smalltalk class library; see the file COPYING.LIB.
| If not, write to the Free Software Foundation, 59 Temple Place - Suite
| 330, Boston, MA 02110-1301, USA.  
|
 ======================================================================"


Object subclass: #Getopt
                  instanceVariableNames: 'options longOptions prefixes args currentArg actionBlock errorBlock'
                  classVariableNames: ''
                  poolDictionaries: ''
                  category: 'Language-Data types'
!

Getopt comment:
'My instances represent ASCII string data types.  Being a very common
case, they are particularly optimized.' !


!Getopt class methodsFor: 'instance creation'!

test: args with: pattern
    args do: [ :each |
        self
            parse: each subStrings
            with: pattern
            do: [ :x :y | (x->y) printNl ]
            ifError: [ (each->'error') displayNl ].
        Transcript nl ]!
   
parse: args with: pattern do: actionBlock
    ^self new
        parsePattern: pattern;
        actionBlock: actionBlock;
        errorBlock: [ ^nil ];
        parse: args!

parse: args with: pattern do: actionBlock ifError: errorBlock
    ^self new
        parsePattern: pattern;
        actionBlock: actionBlock;
        errorBlock: [ ^errorBlock value ];
        parse: args!

!Getopt methodsFor: 'initializing'!

fullOptionName: aString
    (prefixes includes: aString) ifFalse: [ errorBlock value ].
    longOptions do: [ :k |
        (k startsWith: aString) ifTrue: [ ^k ] ].
    self halt!

optionKind: aString
    | kindOrString |
    kindOrString := options at: aString ifAbsent: [ errorBlock value ].
    ^kindOrString isSymbol
        ifTrue: [ kindOrString ]
        ifFalse: [ options at: kindOrString ]!

optionName: aString
    | kindOrString |
    kindOrString := options at: aString ifAbsent: [ errorBlock value ].
    ^kindOrString isSymbol
        ifTrue: [ aString ]
        ifFalse: [ kindOrString ]!

parseRemainingArguments
    [ args atEnd ] whileFalse: [
        actionBlock value: nil value: args next ]!

parseOption: name kind: kind with: arg
    | theArg |
    theArg := arg.
    (kind = #mandatoryArg and: [ arg isNil ])
        ifTrue: [
            args atEnd ifTrue: [ errorBlock value ].
            theArg := args next ].
    (kind = #noArg and: [ theArg notNil ])
        ifTrue: [ errorBlock value ].

    actionBlock value: name value: theArg!
   
parseLongOption: argStream
    | name kind haveArg arg |
    name := argStream upTo: $=.
    argStream skip: -1.

    name := self fullOptionName: name.
    name := self optionName: name.
    kind := self optionKind: name.
    haveArg := argStream nextMatchFor: $=.
    arg := haveArg ifTrue: [ argStream upToEnd ] ifFalse: [ nil ].
    self parseOption: name kind: kind with: arg!

parseShortOptions: argStream
    | name kind ch haveArg arg |
    [ argStream atEnd ] whileFalse: [
        ch := argStream next.
        name := self optionName: ch.
        kind := self optionKind: ch.
        haveArg := kind ~~ #noArg and: [ argStream atEnd not ].
        arg := haveArg ifTrue: [ argStream upToEnd ] ifFalse: [ nil ].
        self parseOption: name kind: kind with: arg ]!

parseOneArgument
    | arg argStream |
    arg := args next.
    arg = '--' ifTrue: [ ^self parseRemainingArguments ].

    (arg isEmpty or: [ arg first ~= $- ])
        ifTrue: [ ^actionBlock value: nil value: arg ].

    argStream := arg readStream.
    (arg at: 2) = $-
        ifTrue: [ argStream next: 2. self parseLongOption: argStream ]
        ifFalse: [ argStream next. self parseShortOptions: argStream ]!

parse
    [ args atEnd ] whileFalse: [ self parseOneArgument ]!
 
!Getopt methodsFor: 'initializing'!

addPrefixes: option
    longOptions add: option.
    1 to: option size do: [ :length |
        prefixes add: (option copyFrom: 1 to: length) ]!

rejectBadPrefixes
    longOptions := longOptions asSortedCollection: [ :a :b | a size <= b size ].

    prefixes := prefixes select: [ :each | (prefixes occurrencesOf: each) == 1 ].
    prefixes := prefixes asSet.
    prefixes addAll: longOptions!

initialize
    options := Dictionary new.
    longOptions := Set new.
    prefixes := Bag new!

checkSynonyms: synonyms
    (synonyms allSatisfy: [ :each | each startsWith: '-' ])
        ifFalse: [ ^self error: 'expected -' ].

    (synonyms anySatisfy: [ :each | each size < 2 or: [ each = '--' ] ])
        ifTrue: [ ^self error: 'expected option name' ].

    synonyms do: [ :each |
        ((each startsWith: '--') and: [ each includes: $= ])
            ifTrue: [ ^self error: 'unexpected = inside long option' ] ]!

colonsToKind: colons
    colons = 0 ifTrue: [ ^#noArg ].
    colons = 1 ifTrue: [ ^#mandatoryArg ].
    colons = 2 ifTrue: [ ^#optionalArg ].
    ^self error: 'too many colons, don''t know what to do with them...'!

atSynonym: synonym put: kindOrName
    | key |
    synonym size = 2
        ifTrue: [ key := synonym at: 2 ]
        ifFalse: [ key := synonym copyFrom: 3. self addPrefixes: key ].

    (options includes: key) ifTrue: [ self error: 'duplicate option' ].
    options at: key put: kindOrName.
    ^key!

parseSynonyms: synonyms kind: kind
    | last |
    last := self atSynonym: synonyms last put: kind.
    synonyms from: 1 to: synonyms size - 1 do: [ :each |
        self atSynonym: each put: last ]!

parseOption: opt
    | colons optNames synonyms kind |
    optNames := opt copyWithout: $:.
    colons := opt size - optNames size.
    opt from: optNames size + 1 to: opt size do: [ :ch |
        ch = $: ifFalse: [ ^self error: 'invalid pattern, colons are hosed' ] ].

    kind := self colonsToKind: colons.
    synonyms := optNames subStrings: $|.
    self checkSynonyms: synonyms.
    self parseSynonyms: synonyms kind: kind!

parsePattern: pattern
    self initialize.
    pattern subStrings do: [ :opt | self parseOption: opt ].
    self rejectBadPrefixes!

actionBlock: aBlock
    actionBlock := aBlock!
           
errorBlock: aBlock
    errorBlock := aBlock!
           
parse: argsArray
    args := argsArray readStream.
    self parse.
    ^args contents!

!SystemDictionary class methodsFor: 'command-line'!

arguments: pattern do: actionBlock
    ^Getopt
        parse: self arguments
        with: pattern
        do: actionBlock!

arguments: pattern do: actionBlock ifError: errorBlock
    ^Getopt
        parse: self arguments
        with: pattern
        do: actionBlock
        ifError: errorBlock! !

"Getopt new parsePattern: '-B'"
"Getopt new parsePattern: '--long'"
"Getopt new parsePattern: '--longish --longer'"
"Getopt new parsePattern: '--long --longer'"
"Getopt new parsePattern: '-B:'"
"Getopt new parsePattern: '-B::'"
"Getopt new parsePattern: '-a|-b'"
"Getopt new parsePattern: '-a|--long'"
"Getopt new parsePattern: '-a|--very-long|--long'"
"Getopt test: #('-a' '-b' '-ab' '-a -b') with: '-a -b'"
"Getopt test: #('-a' '-b' '-ab' '-a -b') with: '-a: -b'"
"Getopt test: #('-a' '-b' '-ab' '-a -b') with: '-a:: -b'"
"Getopt test: #('--longish' '--longer' '--longi' '--longe' '--lo' '-longer') with: '--longish --longer'"
"Getopt test: #('--lo' '--long' '--longe' '--longer') with: '--long --longer'"
"Getopt test: #('--noarg' '--mandatory' '--mandatory foo' '--mandatory=' '--mandatory=foo' '--optional' '--optional foo') with: '--noarg --mandatory: --optional::'"
"Getopt test: #('-a' '-b') with: '-a|-b'"
"Getopt test: #('--long' '-b') with: '-b|--long'"
"Getopt test: #('--long=x' '-bx') with: '-b|--long:'"
"Getopt test: #('-b' '--long' '--very-long') with: '-b|--very-long|--long'"
"Getopt test: #('--long=x' '--very-long x' '-bx') with: '-b|--very-long|--long:'"
"Getopt test: #('-b -- -b' '-- -b' '-- -b -b') with: '-b'"

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Re: Starting with smalltalk

Bram Neijt
In reply to this post by Bram Neijt
On 7/6/06, Mike Anderson <[hidden email]> wrote:
> Well, that was a bit inflammatory, but if it was just code snippets you
> were after, try this:
True, it was. Mainly because I heard 'the language is great' from
people who do it, and I've seen a few video's of Alan Kay about how
great it is and that he can't understand why it isn't used more often.

So I felt like people where saying "this is art!" and I just couldn't see it.

Thanx allot for the sources. I'll try them out and probably build some
pages with info as I come across more code and learn more.

Greetings,
  Bram

PS One of the videos I'm referring to can be found here:
http://video.google.com/videoplay?docid=-2950949730059754521


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

[Q] Unicode String?

Chun, Sungjin
In reply to this post by Paolo Bonzini
Hi,

I've tried GNU smalltalk and for me it seems good. But I have a  
problem: current implementation does not support Unicode. It seems  
that it only supports single byte character only. I've also tried  
squeak, which seems less faster than GNU smalltalk - I'm not sure on  
this, this might not be correct - has unicode compatible string  
implementation and I think this kind of approach is good. Is there  
any change to have unicode compatible string implementation in next  
version of GNU smalltalk?

Thank in advance.


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: [Q] Unicode String?

Paolo Bonzini
Chun Sungjin wrote:

> Hi,
>
> I've tried GNU smalltalk and for me it seems good. But I have a
> problem: current implementation does not support Unicode. It seems
> that it only supports single byte character only. I've also tried
> squeak, which seems less faster than GNU smalltalk - I'm not sure on
> this, this might not be correct - has unicode compatible string
> implementation and I think this kind of approach is good. Is there any
> change to have unicode compatible string implementation in next
> version of GNU smalltalk?
What do you need exactly?  The main missing thing is support for
Character objects with values above 256.  However if you are content
with multibyte character sets like UTF-8, or with Unicode character
codes, that's fine.

For character set translation, if you load the I18N package, GNU
Smalltalk gets an iconv wrapper.  The main method you need is
EncodedStream>>#on:from:to: (e.g. on: 'abc' from: 'UTF-8' to: 'UCS-4').

To extract Unicode character codes from an UCS-4LE encoded string, you
can use (ByteStream on: x asByteArray) and send nextLong.  For
big-endian, there is no class but I was thinking of adding a #bigEndian
method to ByteStream for the next version.

Things that could be useful are
    Integer>>#asUTF8String
    String class>>#utf8FromCodepoint: (same as above)
    String>>#utf8Stream
    UTF8Stream (returns Unicode character codes)
    ... (tell me what you need) ...

Paolo


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: [Q] Unicode String?

Chun, Sungjin
Hi,

main problem is that for example, if I did create an instance of  
string like this;

a := 'Some MultiByte Encoded String'.

then

a size

does not answer correct length of string.

However, I will try what you said, thank you

On Jul 7, 2006, at 4:03 PM, Paolo Bonzini wrote:

> Chun Sungjin wrote:
>> Hi,
>>
>> I've tried GNU smalltalk and for me it seems good. But I have a  
>> problem: current implementation does not support Unicode. It seems  
>> that it only supports single byte character only. I've also tried  
>> squeak, which seems less faster than GNU smalltalk - I'm not sure  
>> on this, this might not be correct - has unicode compatible string  
>> implementation and I think this kind of approach is good. Is there  
>> any change to have unicode compatible string implementation in  
>> next version of GNU smalltalk?
> What do you need exactly?  The main missing thing is support for  
> Character objects with values above 256.  However if you are  
> content with multibyte character sets like UTF-8, or with Unicode  
> character codes, that's fine.
>
> For character set translation, if you load the I18N package, GNU  
> Smalltalk gets an iconv wrapper.  The main method you need is  
> EncodedStream>>#on:from:to: (e.g. on: 'abc' from: 'UTF-8' to:  
> 'UCS-4').
>
> To extract Unicode character codes from an UCS-4LE encoded string,  
> you can use (ByteStream on: x asByteArray) and send nextLong.  For  
> big-endian, there is no class but I was thinking of adding a  
> #bigEndian method to ByteStream for the next version.
>
> Things that could be useful are
>    Integer>>#asUTF8String
>    String class>>#utf8FromCodepoint: (same as above)
>    String>>#utf8Stream
>    UTF8Stream (returns Unicode character codes)
>    ... (tell me what you need) ...
>
> Paolo



_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: {Spam?} Re: [Q] Unicode String?

Paolo Bonzini
Chun Sungjin wrote:

> Hi,
>
> main problem is that for example, if I did create an instance of
> string like this;
>
> a := 'Some MultiByte Encoded String'.
>
> then
>
> a size
>
> does not answer correct length of string.
Well, strlen does not in C, too.  You need mbrlen, and #size is more
like strlen than mbrlen.

Also, the result heavily depends on the chosen character set.  If we
want to have #utf8Size, that's fine.  But #size should be the number of
*bytes*, not of characters.

I'm seeing now if I can add an EncodedStream method that extracts
Unicode characters.  Then what you wanted would be something like

    (EncodedStream wordsOn: 'some string') contents size

for which, of course, we can add a utility method.

Paolo


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Why string should be collection of single byte characters? (WAS: Re: [Q] Unicode String?)

Chun, Sungjin
Hi,

For me, string should not be limited to collection of single byte
characters. String is string not a simple collection of byte, isn't it? I
think squeak's approach (or OpenStep's approach, where abstract public
string class and concrete private subclasses of string that implements
several cases of string). But I'm not currently working hard on GNU
Smalltalk, this may not be the best idea for GNU Smalltalk's case :-)

PS)
I DO think that strlen is not for unicode(actually multi-byte encoded case)
string and is bad design: limited to single byte encoding. I DO think that
modern language should consider unicode like string. I DO think Smalltalk is
MODERN :-)

----- Original Message -----
From: "Paolo Bonzini" <[hidden email]>
To: "Chun Sungjin" <[hidden email]>
Cc: "GNU Smalltalk" <[hidden email]>
Sent: Friday, July 07, 2006 6:17 PM
Subject: Re: {Spam?} Re: [Help-smalltalk] [Q] Unicode String?


> Chun Sungjin wrote:
> > Hi,
> >
> > main problem is that for example, if I did create an instance of
> > string like this;
> >
> > a := 'Some MultiByte Encoded String'.
> >
> > then
> >
> > a size
> >
> > does not answer correct length of string.
> Well, strlen does not in C, too.  You need mbrlen, and #size is more
> like strlen than mbrlen.
>
> Also, the result heavily depends on the chosen character set.  If we
> want to have #utf8Size, that's fine.  But #size should be the number of
> *bytes*, not of characters.
>
> I'm seeing now if I can add an EncodedStream method that extracts
> Unicode characters.  Then what you wanted would be something like
>
>     (EncodedStream wordsOn: 'some string') contents size
>
> for which, of course, we can add a utility method.
>
> Paolo
>



_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: {Spam?} Why string should be collection of single byte characters? (WAS: Re: [Q] Unicode String?)

Paolo Bonzini
Sungjin Chun wrote:
> Hi,
>
> For me, string should not be limited to collection of single byte
> characters. String is string not a simple collection of byte, isn't it? I
> think squeak's approach (or OpenStep's approach, where abstract public
> string class and concrete private subclasses of string that implements
> several cases of string). But I'm not currently working hard on GNU
> Smalltalk, this may not be the best idea for GNU Smalltalk's case :-)
>  
There's already CharacterArray as a superclass of String.  It probably
would not be hard to have a UnicodeString subclass of CharacterArray,
and use that instead of WordArray inside the I18N package.  I'd also
need UnicodeCharacter, probably.

I'm working on it in my spare time, I attach my current prototype patch.

Paolo

--- orig/i18n/Sets.st
+++ mod/i18n/Sets.st
@@ -89,70 +89,70 @@
 
 Namespace current: Smalltalk.I18N.Encoders!
 
-Encoder subclass: #FromUCS4
+Encoder subclass: #FromUTF32
        instanceVariableNames: ''
        classVariableNames: ''
        poolDictionaries: ''
        category: 'i18n-Character sets'!
 
-FromUCS4 comment:
-'This class is a superclass for classes that convert from UCS4
+FromUTF32 comment:
+'This class is a superclass for classes that convert from UTF-32
 characters (encoded as 32-bit Integers) to bytes in another
 encoding (encoded as Characters).'!
 
-Encoder subclass: #ToUCS4
+Encoder subclass: #ToUTF32
        instanceVariableNames: ''
        classVariableNames: ''
        poolDictionaries: ''
        category: 'i18n-Character sets'!
 
-ToUCS4 comment:
+ToUTF32 comment:
 'This class is a superclass for classes that convert from bytes
-(encoded as Characters) to UCS4 characters (encoded as 32-bit
+(encoded as Characters) to UTF-32 characters (encoded as 32-bit
 Integers to simplify the code and to avoid endianness conversions).'!
 
-ToUCS4 subclass: #ComposeUCS4LE
+ToUTF32 subclass: #ComposeUTF32LE
        instanceVariableNames: ''
        classVariableNames: ''
        poolDictionaries: ''
        category: 'i18n-Character sets'!
 
-ComposeUCS4LE comment:
-'This class is used internally to provide UCS4 characters encoded as
-32-bit integers for a descendent of FromUCS4, when the starting
+ComposeUTF32LE comment:
+'This class is used internally to provide UTF-32 characters encoded as
+32-bit integers for a descendent of FromUTF32, when the starting
 encoding is little-endian.'!
 
-ToUCS4 subclass: #ComposeUCS4BE
+ToUTF32 subclass: #ComposeUTF32BE
        instanceVariableNames: ''
        classVariableNames: ''
        poolDictionaries: ''
        category: 'i18n-Character sets'!
 
-ComposeUCS4BE comment:
-'This class is used internally to provide UCS4 characters encoded as
-32-bit integers for a descendent of FromUCS4, when the starting
+ComposeUTF32BE comment:
+'This class is used internally to provide UTF-32 characters encoded as
+32-bit integers for a descendent of FromUTF32, when the starting
 encoding is big-endian.'!
 
-FromUCS4 subclass: #SplitUCS4LE
+FromUTF32 subclass: #SplitUTF32LE
        instanceVariableNames: 'wch'
        classVariableNames: ''
        poolDictionaries: ''
        category: 'i18n-Character sets'!
 
-SplitUCS4LE comment:
+SplitUTF32LE comment:
 'This class is used internally to split into four 8-bit characters
-the 32-bit UCS4 integers coming from a descendent of ToUCS4, when
+the 32-bit UTF-32 integers coming from a descendent of ToUTF32, when
 the destination encoding is little-endian.'!
 
-FromUCS4 subclass: #SplitUCS4BE
+FromUTF32 subclass: #SplitUTF32BE
        instanceVariableNames: 'count wch'
        classVariableNames: ''
        poolDictionaries: ''
        category: 'i18n-Character sets'!
 
-SplitUCS4BE comment:
+SplitUTF32BE comment:
 'This class is used internally to split into four 8-bit characters
-the 32-bit UCS4 integers coming from a descendent of ToUCS4, when
+the 32-bit UTF-32 integers coming from a descendent of ToUTF32, when
 the destination encoding is big-endian.'!
 
 Encoder subclass: #Iconv
@@ -166,21 +166,21 @@
 'This class is used to delegate the actual character set conversion
 to the C library''s iconv function.  Most conversions use iconv as
 the only step in the conversions, sometimes the structure is
-ToUCS4+SplitUCS4xx+Iconv or Iconv+ComposeUCS4xx+FromUCS4, rarely
+ToUTF32+SplitUTF32xx+Iconv or Iconv+ComposeUTF32xx+FromUTF32, rarely
 Iconv is skipped altogether and only Smalltalk converters are used.'!
 
-FromUCS4 subclass: #ToUTF7
+FromUTF32 subclass: #ToUTF7
  instanceVariableNames: 'left value lookahead'
  classVariableNames: 'Base64Characters DirectCharacters ToBase64'
  poolDictionaries: ''
  category: 'i18n-Encodings'!
 
 ToUTF7 comment:
-'This class implements a converter that transliterates UCS4
+'This class implements a converter that transliterates UTF-32
 characters (encoded as 32-bit Integers) to UTF-7 encoded
 characters.'!
 
-ToUCS4 subclass: #FromUTF7
+ToUTF32 subclass: #FromUTF7
  instanceVariableNames: 'shift wch lookahead'
  classVariableNames: 'DirectCharacters FromBase64'
  poolDictionaries: ''
@@ -188,7 +188,7 @@
 
 ToUTF7 comment:
 'This class implements a converter that transliterates UTF-7
-encoded characters to UCS4 values (encoded as 32-bit Integers).'!
+encoded characters to UTF-32 values (encoded as 32-bit Integers).'!
 
 Namespace current: Smalltalk.I18N!
 
@@ -241,9 +241,9 @@
 !Encoder methodsFor: 'private - initialization'!
 
 initializeFrom: fromEncoding to: toEncoding origin: aStringOrStream
-    origin := aStringOrStream isString
- ifTrue: [ aStringOrStream readStream ]
- ifFalse: [ aStringOrStream ].
+    origin := (aStringOrStream isKindOf: Stream)
+ ifFalse: [ aStringOrStream readStream ]
+ ifTrue: [ aStringOrStream ].
 
     self flush
 ! !
@@ -258,27 +258,27 @@
     }
 !
 
-registerEncoderFor: arrayOfAliases toUCS4: toUCS4Class fromUCS4: fromUCS4Class
+registerEncoderFor: arrayOfAliases toUTF32: toUTF32Class fromUTF32: fromUTF32Class
     "Register the two classes that will respectively convert from the
-     charsets in arrayOfAliases to UCS4 and vice versa.
+     charsets in arrayOfAliases to UTF-32 and vice versa.
 
      The former class is a stream that accepts characters and returns
-     (via #next) integers representing UCS-4 character codes, while
-     the latter accepts UCS-4 character codes and converts them to
+     (via #next) integers representing UTF-32 character codes, while
+     the latter accepts UTF-32 character codes and converts them to
      characters.  For an example see respectively FromUTF7 and ToUTF7
      (I admit it is not a trivial example)."
 
     EncodersRegistry := EncodersRegistry copyWith:
- { arrayOfAliases. toUCS4Class. fromUCS4Class }
+ { arrayOfAliases. toUTF32Class. fromUTF32Class }
 ! !
 
 !EncodedStream class methodsFor: 'private - triangulating'!
 
 bigEndianPivot
     "When only one of the sides is implemented in Smalltalk
-     and the other is obtained via iconv, we use UCS-4 to
+     and the other is obtained via iconv, we use UTF-32 to
      marshal data from Smalltalk to iconv; answer whether we
-     should encode UCS-4 characters as big-endian."
+     should encode UTF-32 characters as big-endian."
     ^Memory bigEndian
 !
 
@@ -287,29 +287,119 @@
      and the other is obtained via iconv, we need a common
      pivot encoding to marshal data from Smalltalk to iconv.
      Answer the iconv name of this encoding."
-    ^self bigEndianPivot ifTrue: [ 'UCS-4BE' ] ifFalse: [ 'UCS-4LE' ]
+    ^self bigEndianPivot ifTrue: [ 'UTF-32BE' ] ifFalse: [ 'UTF-32LE' ]
 !
 
-split: input
+split: input to: encoding
     "Answer a pipe with the given input stream (which produces
-     UCS-4 character codes as integers) and whose output is
+     UTF-32 character codes as integers) and whose output is
      a series of Characters in the required pivot encoding"
-    ^self bigEndianPivot
- ifTrue: [ SplitUCS4BE on: input from: 'words' to: 'UCS4-BE' ]
- ifFalse: [ SplitUCS4LE on: input from: 'words' to: 'UCS4-LE' ].
+    ^(encoding = 'UCS-4BE' or: [ encoding = 'UTF-32BE' ])
+ ifTrue: [ SplitUTF32BE on: input from: 'UTF-32' to: encoding ]
+ ifFalse: [ SplitUTF32LE on: input from: 'UTF-32' to: encoding ].
 !
 
-compose: input
+compose: input from: encoding
     "Answer a pipe with the given input stream (which produces
      Characters in the required pivot encoding) and whose output
-     is a series of integer UCS-4 character codes."
-    ^self bigEndianPivot
- ifTrue: [ ComposeUCS4BE on: input from: 'UCS4-BE' to: 'words' ]
- ifFalse: [ ComposeUCS4LE on: input from: 'UCS4-LE' to: 'words' ].
+     is a series of integer UTF-32 character codes."
+    ^(encoding = 'UCS-4BE' or: [ encoding = 'UTF-32BE' ])
+ ifTrue: [ ComposeUTF32BE on: input from: encoding to: 'UTF-32' ]
+ ifFalse: [ ComposeUTF32LE on: input from: encoding to: 'UTF-32' ].
 ! !
 
 !EncodedStream class methodsFor: 'instance creation'!
 
+encoding: aWordArray
+    "Answer a pipe of encoders that converts aWordArray (which contains
+     Integers for the Unicode values) to the current locale's default
+     charset."
+    ^self
+ encoding: aWordArray
+ as: Locale default charset
+!
+
+encoding: aStringOrStream as: toEncoding
+    "Answer a pipe of encoders that converts aWordArray (which contains
+     Integers for the Unicode values) to the supplied encoding (which
+     can be an ASCII String or Symbol)."
+    | pivot to encoderTo pipe |
+
+    "Adopt an uniform naming"
+    to := toEncoding asString.
+    (from = 'UTF-32' or: [ from = 'UCS-4' ])
+ ifTrue: [ to := self pivotEncoding ].
+    (to = 'UTF-16' or: [ to = 'UCS-2' ])
+ ifTrue: [ to := self pivotEncoding copyReplacing: '32' with: '16' ].
+
+    "If converting to the pivot encoding, we're done."
+    pivot := 'UTF-32'.
+    ((to startsWith: 'UCS-4') or: [ to startsWith: 'UTF-32' ])
+ ifTrue: [ pivot := to ].
+    pivot = 'UTF-32' ifTrue: [ pivot := self pivotEncoding ].
+
+    encoderTo := Iconv.
+    EncodersRegistry do: [ :each |
+ ((each at: 1) includes: to)
+    ifTrue: [ encoderTo := each at: 2 ]
+    ].
+
+    pipe := aStringOrStream.
+
+    "Split UTF-32 character codes into bytes if needed by iconv."
+    encoderTo == Iconv ifTrue: [ pipe := self split: pipe to: pivot ].
+
+    "If not converting to the pivot encoding, we need one more step."
+    to = pivot ifFalse: [
+        pipe := encoderTo on: aStringOrStream from: pivot to: toEncoding ].
+    ^pipe
+!
+
+unicodeOn: aStringOrStream
+    "Answer a pipe of encoders that converts aStringOrStream (which can
+     be a string or another stream) from the current locale's default
+     charset to integers representing Unicode character codes."
+    ^self
+ unicodeOn: aStringOrStream
+ encoding: Locale default charset
+!
+
+unicodeOn: aStringOrStream encoding: fromEncoding
+    "Answer a pipe of encoders that converts aStringOrStream
+     (which can be a string or another stream) from the supplied
+     encoding (which can be an ASCII String or Symbol) to
+     integers representing Unicode character codes."
+    | from pivot encoderFrom pipe |
+
+    "Adopt an uniform naming"
+    from := fromEncoding asString.
+    (from = 'UTF-32' or: [ from = 'UCS-4' ])
+ ifTrue: [ from := aStringOrStream utf32Encoding ].
+    (from = 'UTF-16' or: [ from = 'UCS-2' ])
+ ifTrue: [ from := aStringOrStream utf16Encoding ].
+
+    pivot := 'UTF-32'.
+    ((from startsWith: 'UCS-4') or: [ from startsWith: 'UTF-32' ])
+ ifTrue: [ pivot := from ].
+    pivot = 'UTF-32' ifTrue: [ pivot := self pivotEncoding ].
+
+    encoderFrom := Iconv.
+    EncodersRegistry do: [ :each |
+ ((each at: 1) includes: from)
+    ifTrue: [ encoderFrom := each at: 2 ]
+    ].
+
+    pipe := aStringOrStream.
+
+    "If not converting from the pivot encoding, we need one more step."
+    from = pivot ifFalse: [
+        pipe := encoderFrom on: aStringOrStream from: fromEncoding to: pivot ].
+
+    "Compose iconv-produced bytes into UTF-32 character codes if needed."
+    encoderFrom == Iconv ifTrue: [ pipe := self compose: pipe from: pivot ].
+    ^pipe
+!
+
 on: aStringOrStream from: fromEncoding
     "Answer a pipe of encoders that converts aStringOrStream
      (which can be a string or another stream) from the given
@@ -340,8 +430,20 @@
     "Adopt an uniform naming"
     from := fromEncoding asString.
     to := toEncoding asString.
-    from = 'UCS-4' ifTrue: [ from := 'UCS-4BE' ].
-    to = 'UCS-4' ifTrue: [ to := 'UCS-4BE' ].
+    (from = 'UTF-32' or: [ from = 'UCS-4' ])
+ ifTrue: [ from := aStringOrStream utf32Encoding ].
+    (from = 'UTF-16' or: [ from = 'UCS-2' ])
+ ifTrue: [ from := aStringOrStream utf16Encoding ].
+    (to = 'UTF-32' or: [ to = 'UCS-4' ])
+ ifTrue: [ to := self pivotEncoding ].
+    (to = 'UTF-16' or: [ to = 'UCS-2' ])
+ ifTrue: [ to := self pivotEncoding copyReplaceAll: '32' with: '16' ].
+
+    ((from startsWith: 'UCS-4') or: [ from startsWith: 'UTF-32' ])
+ ifTrue: [ pivot := from ].
+    ((to startsWith: 'UCS-4') or: [ to startsWith: 'UTF-32' ])
+ ifTrue: [ pivot := to ].
+    pivot = 'UTF-32' ifTrue: [ pivot := self pivotEncoding ].
 
     encoderFrom := encoderTo := Iconv.
     EncodersRegistry do: [ :each |
@@ -358,12 +460,12 @@
     "Else answer a `pipe' that takes care of triangulating.
      There is an additional complication: Smalltalk encoders
      read or provide a stream of character codes (respectively
-     if the source is UCS-4, or the target is UCS-4), while iconv
+     if the source is UTF-32, or the target is UTF-32), while iconv
      expects raw bytes.  So we add an intermediate layer if
      a mixed Smalltalk+iconv conversion is done: it converts
-     character codes --> bytes (SplitUCS4xx, used if iconv will
-     convert from UCS-4) or bytes --> character code (ComposeUCS4xx,
-     used if iconv will convert to UCS-4).
+     character codes --> bytes (SplitUTF32xx, used if iconv will
+     convert from UTF-32) or bytes --> character code (ComposeUTF32xx,
+     used if iconv will convert to UTF-32).
 
      There are five different cases (remember that at least one converter
      is not iconv, so `both use iconv' and `from = pivot = to' are banned):
@@ -373,7 +475,6 @@
  from uses iconv --> iconv + Compose + non-iconv (implies to ~= pivot)
  none uses iconv --> non-iconv + non-iconv (implies neither = pivot)"
 
-    pivot := self pivotEncoding.
     pipe := aStringOrStream.
     from = pivot
  ifFalse: [
@@ -382,16 +483,16 @@
 
     pipe := encoderFrom on: pipe from: fromEncoding to: pivot.
     encoderTo == Iconv ifTrue: [
- pipe := self split: pipe.
+ pipe := self split: pipe to: pivot.
 
  "Check if we already reached the destination format."
  to = pivot ifTrue: [ ^pipe ].
     ].
  ].
 
-    "Compose iconv-produced bytes into UCS-4 character codes if needed."
+    "Compose iconv-produced bytes into UTF-32 character codes if needed."
     encoderFrom == Iconv ifTrue: [
- pipe := self compose: pipe
+ pipe := self compose: pipe from: pivot
     ].
 
     ^encoderTo on: pipe from: pivot to: toEncoding.
@@ -399,7 +500,7 @@
 
 Namespace current: Smalltalk.I18N.Encoders!
 
-!FromUCS4 methodsFor: 'stream operation'!
+!FromUTF32 methodsFor: 'stream operation'!
 
 species
     "We answer a string of Characters encoded in our destination
@@ -407,15 +508,15 @@
     ^String
 ! !
 
-!ToUCS4 methodsFor: 'stream operation'!
+!ToUTF32 methodsFor: 'stream operation'!
 
 species
-    "We answer a WordArray of UCS4 characters encoded as a series of
+    "We answer a WordArray of UTF-32 characters encoded as a series of
      32-bit Integers."
     ^WordArray
 ! !
 
-!ComposeUCS4LE methodsFor: 'stream operation'!
+!ComposeUTF32LE methodsFor: 'stream operation'!
 
 next
     "Answer a 32-bit integer obtained by reading four 8-bit character
@@ -426,7 +527,7 @@
      (self nextInput asInteger bitShift: 24)
 ! !
 
-!ComposeUCS4BE methodsFor: 'stream operation'!
+!ComposeUTF32BE methodsFor: 'stream operation'!
 
 next
     "Answer a 32-bit integer obtained by reading four 8-bit character
@@ -439,7 +540,7 @@
           self nextInput asInteger    
 ! !
 
-!SplitUCS4LE methodsFor: 'stream operation'!
+!SplitUTF32LE methodsFor: 'stream operation'!
 
 atEnd
     "Answer whether the receiver can produce more characters"
@@ -474,7 +575,7 @@
     wch := 1
 ! !
 
-!SplitUCS4BE methodsFor: 'stream operation'!
+!SplitUTF32BE methodsFor: 'stream operation'!
 
 atEnd
     "Answer whether the receiver can produce more characters"
@@ -670,7 +771,7 @@
 !ToUTF7 class methodsFor: 'initialization'!
 
 initialize
-    "Initialize the tables used by the UCS4-to-UTF7 converter"
+    "Initialize the tables used by the UTF-32-to-UTF-7 converter"
 
     Base64Characters := #[
         16r00 16r00 16r00 16r00 16r00 16rA8 16rFF 16r03
@@ -806,7 +907,7 @@
 !FromUTF7 class methodsFor: 'initialization'!
 
 initialize
-    "Initialize the tables used by the UTF7-to-UCS4 converter"
+    "Initialize the tables used by the UTF-7-to-UTF-32 converter"
 
     FromBase64 := #[
  62 99 99 99 63
@@ -842,7 +943,7 @@
 !FromUTF7 methodsFor: 'converting'!
 
 atEnd
-    "Answer whether the receiver can produce another UCS4 32-bit
+    "Answer whether the receiver can produce another UTF-32 32-bit
      encoded integer"
     ^lookahead isNil
 !
@@ -878,7 +979,7 @@
     "Flush any remaining state left in the encoder by the last character
      (this is because UTF-7 encodes 6 bits at a time, so it takes three
      characters before it can provide a single 16-bit character and
-     up to six characters before it can provide a full UCS-4 character)."
+     up to six characters before it can provide a full UTF-32 character)."
     shift := 0.
     lookahead := self getNext.
 ! !
@@ -977,16 +1078,42 @@
 description
     "Answer a textual description of the exception."
     ^'unknown encoding specified'! !
-
+
 
 "Now add some extensions to the system classes"
 
-(CharacterArray classPool includesKey: #DefaultEncoding)
-    ifFalse: [ CharacterArray addClassVarName: #DefaultEncoding ]!
-
 !CharacterArray class methodsFor: 'converting'!
 
 defaultEncoding
+    self subclassResponsibility!
+
+!CharacterArray methodsFor: 'converting'!
+
+encoding
+    "Answer the encoding of the receiver, assuming it is in the
+     default locale's default charset"
+
+    self class defaultEncoding asString = 'UTF-16'
+ ifTrue: [ ^self utf16Encoding ].
+    self class defaultEncoding asString = 'UTF-32'
+ ifTrue: [ ^self utf32Encoding ].
+    ^self class defaultEncoding!
+
+utf16Encoding
+    "Answer the encoding of the receiver, assuming it's UTF-16"
+    ^Memory bigEndian ifTrue: [ 'UTF-16BE' ] ifFalse: [ 'UTF-16LE' ]!
+
+utf32Encoding
+    "Answer the encoding of the receiver, assuming it's UTF-32"
+    ^Memory bigEndian ifTrue: [ 'UTF-32BE' ] ifFalse: [ 'UTF-32LE' ]! !
+
+
+(String classPool includesKey: #DefaultEncoding)
+    ifFalse: [ String addClassVarName: #DefaultEncoding ]!
+
+!String class methodsFor: 'converting'!
+
+defaultEncoding
     "Answer the default locale's default charset"
     DefaultEncoding isNil
  ifTrue: [ DefaultEncoding := Locale default charset ].
@@ -999,15 +1126,28 @@
     DefaultEncoding := aString
 ! !
 
-!CharacterArray methodsFor: 'converting'!
+!String methodsFor: 'converting'!
 
-encoding
-    "Answer the encoding of the receiver, assuming it is in the
-     default locale's default charset"
+asUnicode
+    "Return a WordArray with the contents of the receiver, interpreted
+     as the default locale character set."
+    ^(EncodedStream unicodeOn: self) contents!
+
+asUnicode: aString
+    "Return a WordArray with the contents of the receiver, interpreted
+     as the default locale character set."
+    ^(EncodedStream unicodeOn: self encoding: aString) contents! !
+
+!String methodsFor: 'converting'!
+
+utf32Encoding
+    "Assuming the receiver is encoded as UTF-16 with a proper
+     endianness marker, answer the correct encoding of the receiver."
 
-    ^self class defaultEncoding asString = 'UTF-16'
- ifTrue: [ self utf16Encoding ]
- ifFalse: [ self class defaultEncoding ]
+    | b1 b2 bigEndian |
+    b1 := self at: 1. "Low byte"
+    bigEndian := b1 = 0.
+    ^bigEndian ifTrue: [ 'UTF-32BE' ] ifFalse: [ 'UTF-32LE' ]
 !
 
 utf16Encoding
@@ -1026,12 +1166,53 @@
     ^bigEndian ifTrue: [ 'UTF-16BE' ] ifFalse: [ 'UTF-16LE' ]
 ! !
 
+
+!WordArray class methodsFor: 'converting'!
+
+defaultEncoding
+    ^'UTF-32'! !
+
+!WordArray methodsFor: 'converting'!
+
+utf16Encoding
+    self shouldNotImplement! !
+
+!WordArray methodsFor: 'converting'!
+
+encoded
+    "Return a String with the contents of the receiver, converted
+     into the default locale character set."
+    ^(EncodedStream encoding: self) contents!
+
+encodedAs: aString
+    "Return a String with the contents of the receiver, converted
+     into the aString locale character set."
+    ^(EncodedStream encoding: self as: aString) contents!
+
+
 !PositionableStream methodsFor: 'converting'!
 
 encoding
     "Answer the encoding of the underlying collection"
-    ^collection encoding
-! !
+    ^collection encoding!
+
+utf16Encoding
+    "Answer the encoding of the underlying collection, assuming it's UTF-16"
+    ^collection utf16Encoding!
+
+utf32Encoding
+    "Answer the encoding of the underlying collection, assuming it's UTF-32"
+    ^collection utf32Encoding! !
+
+!Stream methodsFor: 'converting'!
+
+utf16Encoding
+    "Answer the encoding of the underlying collection, assuming it's UTF-16"
+    ^Memory bigEndian ifTrue: [ 'UTF-16BE' ] ifFalse: [ 'UTF-16LE' ]!
+
+utf32Encoding
+    "Answer the encoding of the underlying collection, assuming it's UTF-32"
+    ^Memory bigEndian ifTrue: [ 'UTF-32BE' ] ifFalse: [ 'UTF-32LE' ]! !
 
 Encoders.ToUTF7 initialize!
 Encoders.FromUTF7 initialize!

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: {Spam?} Why string should be collection of single byte characters? (WAS: Re: [Q] Unicode String?)

Paolo Bonzini
In reply to this post by Chun, Sungjin

> I DO think that strlen is not for unicode(actually multi-byte encoded case)
> string and is bad design: limited to single byte encoding.
>  
I think it's different than this.  strlen counts bytes.  mbrlen counts
characters.  In Smalltalk #size returns allocation units: only if we
stored everything in UTF-32 (no, UTF-16 would not suffice) would this
mean characters.
>  I DO think that
> modern language should consider unicode like string. I DO think Smalltalk is
> MODERN :-)
>  
I do think that modern languages should support Unicode and you're right
that GNU Smalltalk (mostly) does not.  I don't think they should dismiss
character encodings based on bytes, like UTF-8.  These should remain the
primary representation in my opinion, especially if like in UTF-8 you
don't have any problem in finding the first byte of a character (unlike
JIS-0212 or GB-2312) and no need for escape sequences (unlike ISO-2022).

Paolo


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: {Spam?} Why string should be collection of single byte characters? (WAS: Re: [Q] Unicode String?)

Paolo Bonzini
In reply to this post by Paolo Bonzini

> I'm working on it in my spare time, I attach my current prototype patch.
I have almost completed this, it's only about 400 lines of new code,
mostly in i18n/Sets.st. I have defined a new UnicodeString class, and
modified Character to have support for characters whose Unicode code
point is > 255. For ease of testing and usage, also, I've defined a
syntax $<279> that allows you to refer to a Character by its ASCII
value. It's equivalent to "279 asCharacter" -- I could have instead
inlined this at compile-time, but I prefer to have also a more compact
syntax.

The changes are mostly backwards compatible, but characters should *not*
be compared with ==, but with = unless you're sure the code point is <=
255. Similarly, they should *not* be printed with nextPut:, but with
display:, unless you're sure the code point is <= 127.


What follows is some use cases. This is in a UTF-8 locale but (subject
to the capabilities of your system's iconv function) it works as well
for every other locale.

I am not very expert in the *needs* of people using Unicode, so can you
please confirm that it is (close to) what you need? In particular, I'd
like feedback on what to do when in transcoding is not enabled, because
right now the behavior is inconsistent: see the notes preceded by ***.

Without the I18N package, the behavior is not complete and you can
store, but not print Unicode characters correctly:

Printing a Unicode character:
st> $<279> printNl!
$<16r0117>

Converting a Unicode character to String:
*** maybe should consider returning '?'
st> $<279> asString printNl!
error: Invalid argument <16r0117>: argument must be between $<0> and
$<16r00FF>

Converting a Unicode character to a UTF-32 String:
st> ($<279> asUnicodeString) printNl!
'<16r0117>'

Converting a UTF-32 String with a Unicode character to a byte-encoded
String:
*** maybe should give an error instead
st> $<279> asUnicodeString asString printNl!
'?'

Asking the number of characters to the resulting Strings:
st> $<279> asUnicodeString numberOfCharacters printNl!
1
st> $<279> asUnicodeString asString numberOfCharacters printNl!
error: should not be implemented in this class

Converting ByteArrays or Strings to UnicodeStrings:
st> #[196 151] asUnicodeString first printNl!
error: should not be implemented in this class

-----


After loading the I18N package, everything is much better:

Printing a Unicode character:
st> $<279> printNl!


Converting a Unicode character to String:
st> $<279> asString printNl!
'ė'

Converting a Unicode character to a UTF-32 String, and then back just by
printing it:
st> ($<279> asUnicodeString) printNl!
'ė'

Converting a UTF-32 String with a Unicode character to a byte-encoded
String:
st> $<279> asUnicodeString asString printNl!
'ė'

Asking the number of characters to the resulting Strings:
st> $<279> asUnicodeString numberOfCharacters printNl!
1

st> $<279> asUnicodeString asString numberOfCharacters printNl!
1

Converting ByteArrays or Strings to UnicodeStrings:
st> #[196 151] asUnicodeString first printNl!


st> #[196 151] asUnicodeString size printNl!
1

st> #[196 151] asUnicodeString numberOfCharacters printNl!
1

Paolo



_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk