Smalltalk › Gnu

Re: Starting with smalltalk

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

14 messages Options

Paolo Bonzini

Re: Starting with smalltalk

(First of all, you may want to subscribe to [hidden email] -- it
is moderated for non-subscribers, so no spam).
> I've been reading the GNU smalltalk manual, but the following I havn't
> been able to find on the web yet:
> - A GNU smalltalk compatible, functional program
You mean a program written with gst? Unfortunately I don't know of any
:-( Mike Anderson has some on his blog, but they're small.
> - A way of seperating smalltalk source over multiple files
You write the source code in multiple files, and then provide a loading
script that loads them all (optionally saving everything to an image
file, see later).
> - A way of editing smalltalk files without the use of a commercial IDE
GNU Smalltalk has an Emacs mode.
> - A way of running smalltalk probrams like other programs (from the
> commandline) without the need of a wrapper script (The normal
> '#!/usr/bin/env doesn't work, nor could i find ways of creating
> bytecode/packages/binaries)
You can use (with GNU Smalltalk 2.2)

#! /usr/bin/env gst -f

or

#! /bin/sh
"exec" "gst" "-f" "$0" "$@"

GNU Smalltalk special cases the #! at the beginning of a file as a
one-line command. Comments are quote-delimited in Smalltalk, so the
second line is eaten by GNU Smalltalk's parser in the second example.

In addition, GNU Smalltalk can save a snapshot of its status in an image
(.im) file that can be made executable with chmod. Making something run
automatically when the image file is reloaded is feasible. Just create
a class-side method named #update: including some code like

update: aspect
"Flush instances of the receiver when an image is loaded."
aspect == #returnFromSnapshot ifTrue: [ self restart ]!

and then evaluate code like

ObjectMemory
addDependent: NameOfTheClassWithTheUpdateMethod;
snapshot: 'myprogram.im'

Then, running gst with "gst -I myprogram.im" (or just making
myprogram.im executable) will invoke the #restart method on the class
NameOfTheClassWithTheUpdateMethod.

Paolo

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Mike Anderson-3

Re: Re: Starting with smalltalk

Sorry to be weighing in so late on this one. My brain works slowly in
the summer heat...

Paolo Bonzini wrote:
>> I've been reading the GNU smalltalk manual, but the following I havn't
>> been able to find on the web yet:
>> - A GNU smalltalk compatible, functional program
>
> You mean a program written with gst? Unfortunately I don't know of any
> :-( Mike Anderson has some on his blog, but they're small.

What you will find is that one of the major problems Smalltalk has as a
language is that the dialects are sufficiently dissimilar that programs
are not very portable, so the only programs you will find for GSt are
those that were written for GSt. There are projects that aim to remedy
this, eg. Sport. Porting Sport to gst would be a very useful project.

The other main problem, related to the above, is that the Smalltalk Way
is image-based development, which unfortunately means that the easiest
way to distribute programs is as images, not code.

At a personal level, the main problem I have is that the packaging
system is a bit inflexible, so splitting out a package is hard work.

>> - A way of editing smalltalk files without the use of a commercial IDE

This sounds as if you're thinking about commercial Smalltalks, like
Visual Works. Actually, most other Smalltalks don't use files - you
develop within the IDE, and code at the method level. Where the source
code is outside of the image, it is found in a repository like Envy or
Store, ie. a database.

> GNU Smalltalk has an Emacs mode.

SciTE also has syntax-highlighting, if, like me, you never really got to
grips with Emacs (if you're using Emacs, surely you must prefer Lisp
over Smalltalk?).

Mike

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Bram Neijt

Re: Re: Starting with smalltalk

On 7/5/06, Mike Anderson <[hidden email]> wrote:
> What you will find is that one of the major problems Smalltalk has as a
> language is that the dialects are sufficiently dissimilar that programs
> are not very portable, so the only programs you will find for GSt are
> those that were written for GSt. There are projects that aim to remedy
> this, eg. Sport. Porting Sport to gst would be a very useful project.
This is a problem, but with the growing number of architectures and
operating systems, it is just as hard for any other language
(probably).

> The other main problem, related to the above, is that the Smalltalk Way
> is image-based development, which unfortunately means that the easiest
> way to distribute programs is as images, not code.
>
> At a personal level, the main problem I have is that the packaging
> system is a bit inflexible, so splitting out a package is hard work.
I have not found anything about packaging yet, however this is the
kind of thing that will keep a language from ever getting out (even
out of a computer ;-) ).

>
> >> - A way of editing smalltalk files without the use of a commercial IDE
>
> This sounds as if you're thinking about commercial Smalltalks, like
> Visual Works. Actually, most other Smalltalks don't use files - you
> develop within the IDE, and code at the method level. Where the source
> code is outside of the image, it is found in a repository like Envy or
> Store, ie. a database.
I'm sorry, but if Smalltalk can't even get out of my computer, I might
just not bother to learn it at all. This does explain why I can't find
any real-life implementations on the internet (like a simple hello,
ls, find, sort or anything like that with install scripts,
documentations and comments).

Then I guess there arn't any standard commandline argument parsing
libraries in the stdlib either, right?

Greets,
Bram

PS If all this is really like I now think it is, I can imagine why
this language never took off!

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Paolo Bonzini

Re: Re: Starting with smalltalk

> This is a problem, but with the growing number of architectures and
> operating systems, it is just as hard for any other language
> (probably).
It's a bit different, and actually worse. It's like you had ten
different forks of Python, and somebody writes for one and somebody for
the others.
> I have not found anything about packaging yet, however this is the
> kind of thing that will keep a language from ever getting out (even
> out of a computer ;-) ).
I don't think the packaging system is *too* inflexible. It's
underdeveloped, true, and feature requests will only help.

>> >> - A way of editing smalltalk files without the use of a commercial
>> IDE
>>
>> This sounds as if you're thinking about commercial Smalltalks, like
>> Visual Works. Actually, most other Smalltalks don't use files - you
>> develop within the IDE, and code at the method level. Where the source
>> code is outside of the image, it is found in a repository like Envy or
>> Store, ie. a database.
> I'm sorry, but if Smalltalk can't even get out of my computer, I might
> just not bother to learn it at all. This does explain why I can't find
> any real-life implementations on the internet (like a simple hello,
> ls, find, sort or anything like that with install scripts,
> documentations and comments).

Mike is speaking about commercial Smalltalks. GNU Smalltalk is by
design different. You can write your code in files, with SciTE or
Emacs. The next version, when it comes out, will almost surely have a
more compact and less arcane syntax for defining classes, and so on.
> Then I guess there arn't any standard commandline argument parsing
> libraries in the stdlib either, right?
If you want, I can write one in half an hour. :-P Would this syntax
satisfy you (I'm getting the command line options from autoconf)?

Smalltalk
arguments: '-B|--prepend-include: -I|--include: -t|--trace:
-p|--preselect= -F|--freeze --help --version -v'
do: [ :arg :option | (arg->option) printNl ].

The output could be something like

'trace'->'AC_DEFUN'
$v->nil
'prepend-include'->'/usr/local/share'

if you invoked your script like

gst -f script.st --trace=AC_DEFUN -v -B/usr/local/share
> PS If all this is really like I now think it is, I can imagine why
> this language never took off!
Maybe that's because the language was born 20 years before Python. The
problem is not the inflexibility of the language, is that nobody
implemented the features that people love in other languages (due to
lack of time, lack of funding, or sometimes even human stupidity).

Paolo

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Paolo Bonzini

Re: Re: Starting with smalltalk

> I'll get a book on Smalltalk, take some time to read-up on the syntax
> and try Squeak to see the difference between GNU and Squeak.
You can try the tutorial that comes with GNU Smalltalk.

The differences are mostly conceptual. Plus Squeak has a huge (and
sometimes very poorly designed) class library for graphics and much more.
> Then, I'll get back to you all.
No need to wait. We're here to help and to understand where you have
problems.
> Nice looking commandline parser by the way. I don't understand it all
> yet, but I'll get there. In the end I'll try to make a commandline
> arguments parser and post it somewhere.
Heh... I wanted to see how far I was from my (purposedly exaggerate)
30-minutes estimate of the time to make one. So I did it.

Here it is. 220 lines in ~2 hours, slightly less actually, including 30
minutes for testing (didn't have time to do SUnit tests, so they're just
commands at the end of the file). No comments for now, I will add them
when I commit. :-P

Paolo

"======================================================================
|
| Smalltalk command-line parser
|
|
======================================================================"

"======================================================================
|
| Copyright 2006 Free Software Foundation, Inc.
| Written by Paolo Bonzini.
|
| This file is part of the GNU Smalltalk class library.
|
| The GNU Smalltalk class library is free software; you can redistribute it
| and/or modify it under the terms of the GNU Lesser General Public License
| as published by the Free Software Foundation; either version 2.1, or (at
| your option) any later version.
|
| The GNU Smalltalk class library is distributed in the hope that it will be
| useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
| MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser
| General Public License for more details.
|
| You should have received a copy of the GNU Lesser General Public License
| along with the GNU Smalltalk class library; see the file COPYING.LIB.
| If not, write to the Free Software Foundation, 59 Temple Place - Suite
| 330, Boston, MA 02110-1301, USA.
|
======================================================================"

Object subclass: #Getopt
instanceVariableNames: 'options longOptions prefixes args currentArg actionBlock errorBlock'
classVariableNames: ''
poolDictionaries: ''
category: 'Language-Data types'
!

Getopt comment:
'My instances represent ASCII string data types. Being a very common
case, they are particularly optimized.' !

!Getopt class methodsFor: 'instance creation'!

test: args with: pattern
args do: [ :each |
self
parse: each subStrings
with: pattern
do: [ :x :y | (x->y) printNl ]
ifError: [ (each->'error') displayNl ].
Transcript nl ]!

parse: args with: pattern do: actionBlock
^self new
parsePattern: pattern;
actionBlock: actionBlock;
errorBlock: [ ^nil ];
parse: args!

parse: args with: pattern do: actionBlock ifError: errorBlock
^self new
parsePattern: pattern;
actionBlock: actionBlock;
errorBlock: [ ^errorBlock value ];
parse: args!

!Getopt methodsFor: 'initializing'!

fullOptionName: aString
(prefixes includes: aString) ifFalse: [ errorBlock value ].
longOptions do: [ :k |
(k startsWith: aString) ifTrue: [ ^k ] ].
self halt!

optionKind: aString
| kindOrString |
kindOrString := options at: aString ifAbsent: [ errorBlock value ].
^kindOrString isSymbol
ifTrue: [ kindOrString ]
ifFalse: [ options at: kindOrString ]!

optionName: aString
| kindOrString |
kindOrString := options at: aString ifAbsent: [ errorBlock value ].
^kindOrString isSymbol
ifTrue: [ aString ]
ifFalse: [ kindOrString ]!

parseRemainingArguments
[ args atEnd ] whileFalse: [
actionBlock value: nil value: args next ]!

parseOption: name kind: kind with: arg
| theArg |
theArg := arg.
(kind = #mandatoryArg and: [ arg isNil ])
ifTrue: [
args atEnd ifTrue: [ errorBlock value ].
theArg := args next ].
(kind = #noArg and: [ theArg notNil ])
ifTrue: [ errorBlock value ].

actionBlock value: name value: theArg!

parseLongOption: argStream
| name kind haveArg arg |
name := argStream upTo: $=.
argStream skip: -1.

name := self fullOptionName: name.
name := self optionName: name.
kind := self optionKind: name.
haveArg := argStream nextMatchFor: $=.
arg := haveArg ifTrue: [ argStream upToEnd ] ifFalse: [ nil ].
self parseOption: name kind: kind with: arg!

parseShortOptions: argStream
| name kind ch haveArg arg |
[ argStream atEnd ] whileFalse: [
ch := argStream next.
name := self optionName: ch.
kind := self optionKind: ch.
haveArg := kind ~~ #noArg and: [ argStream atEnd not ].
arg := haveArg ifTrue: [ argStream upToEnd ] ifFalse: [ nil ].
self parseOption: name kind: kind with: arg ]!

parseOneArgument
| arg argStream |
arg := args next.
arg = '--' ifTrue: [ ^self parseRemainingArguments ].

(arg isEmpty or: [ arg first ~= $- ])
ifTrue: [ ^actionBlock value: nil value: arg ].

argStream := arg readStream.
(arg at: 2) = $-
ifTrue: [ argStream next: 2. self parseLongOption: argStream ]
ifFalse: [ argStream next. self parseShortOptions: argStream ]!

parse
[ args atEnd ] whileFalse: [ self parseOneArgument ]!

!Getopt methodsFor: 'initializing'!

addPrefixes: option
longOptions add: option.
1 to: option size do: [ :length |
prefixes add: (option copyFrom: 1 to: length) ]!

rejectBadPrefixes
longOptions := longOptions asSortedCollection: [ :a :b | a size <= b size ].

prefixes := prefixes select: [ :each | (prefixes occurrencesOf: each) == 1 ].
prefixes := prefixes asSet.
prefixes addAll: longOptions!

initialize
options := Dictionary new.
longOptions := Set new.
prefixes := Bag new!

checkSynonyms: synonyms
(synonyms allSatisfy: [ :each | each startsWith: '-' ])
ifFalse: [ ^self error: 'expected -' ].

(synonyms anySatisfy: [ :each | each size < 2 or: [ each = '--' ] ])
ifTrue: [ ^self error: 'expected option name' ].

synonyms do: [ :each |
((each startsWith: '--') and: [ each includes: $= ])
ifTrue: [ ^self error: 'unexpected = inside long option' ] ]!

colonsToKind: colons
colons = 0 ifTrue: [ ^#noArg ].
colons = 1 ifTrue: [ ^#mandatoryArg ].
colons = 2 ifTrue: [ ^#optionalArg ].
^self error: 'too many colons, don''t know what to do with them...'!

atSynonym: synonym put: kindOrName
| key |
synonym size = 2
ifTrue: [ key := synonym at: 2 ]
ifFalse: [ key := synonym copyFrom: 3. self addPrefixes: key ].

(options includes: key) ifTrue: [ self error: 'duplicate option' ].
options at: key put: kindOrName.
^key!

parseSynonyms: synonyms kind: kind
| last |
last := self atSynonym: synonyms last put: kind.
synonyms from: 1 to: synonyms size - 1 do: [ :each |
self atSynonym: each put: last ]!

parseOption: opt
| colons optNames synonyms kind |
optNames := opt copyWithout: $:.
colons := opt size - optNames size.
opt from: optNames size + 1 to: opt size do: [ :ch |
ch = $: ifFalse: [ ^self error: 'invalid pattern, colons are hosed' ] ].

kind := self colonsToKind: colons.
synonyms := optNames subStrings: $|.
self checkSynonyms: synonyms.
self parseSynonyms: synonyms kind: kind!

parsePattern: pattern
self initialize.
pattern subStrings do: [ :opt | self parseOption: opt ].
self rejectBadPrefixes!

actionBlock: aBlock
actionBlock := aBlock!

errorBlock: aBlock
errorBlock := aBlock!

parse: argsArray
args := argsArray readStream.
self parse.
^args contents!

!SystemDictionary class methodsFor: 'command-line'!

arguments: pattern do: actionBlock
^Getopt
parse: self arguments
with: pattern
do: actionBlock!

arguments: pattern do: actionBlock ifError: errorBlock
^Getopt
parse: self arguments
with: pattern
do: actionBlock
ifError: errorBlock! !

"Getopt new parsePattern: '-B'"
"Getopt new parsePattern: '--long'"
"Getopt new parsePattern: '--longish --longer'"
"Getopt new parsePattern: '--long --longer'"
"Getopt new parsePattern: '-B:'"
"Getopt new parsePattern: '-B::'"
"Getopt new parsePattern: '-a|-b'"
"Getopt new parsePattern: '-a|--long'"
"Getopt new parsePattern: '-a|--very-long|--long'"
"Getopt test: #('-a' '-b' '-ab' '-a -b') with: '-a -b'"
"Getopt test: #('-a' '-b' '-ab' '-a -b') with: '-a: -b'"
"Getopt test: #('-a' '-b' '-ab' '-a -b') with: '-a:: -b'"
"Getopt test: #('--longish' '--longer' '--longi' '--longe' '--lo' '-longer') with: '--longish --longer'"
"Getopt test: #('--lo' '--long' '--longe' '--longer') with: '--long --longer'"
"Getopt test: #('--noarg' '--mandatory' '--mandatory foo' '--mandatory=' '--mandatory=foo' '--optional' '--optional foo') with: '--noarg --mandatory: --optional::'"
"Getopt test: #('-a' '-b') with: '-a|-b'"
"Getopt test: #('--long' '-b') with: '-b|--long'"
"Getopt test: #('--long=x' '-bx') with: '-b|--long:'"
"Getopt test: #('-b' '--long' '--very-long') with: '-b|--very-long|--long'"
"Getopt test: #('--long=x' '--very-long x' '-bx') with: '-b|--very-long|--long:'"
"Getopt test: #('-b -- -b' '-- -b' '-- -b -b') with: '-b'"

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Bram Neijt

Re: Re: Starting with smalltalk

In reply to this post by Bram Neijt

On 7/6/06, Mike Anderson <[hidden email]> wrote:
> Well, that was a bit inflammatory, but if it was just code snippets you
> were after, try this:
True, it was. Mainly because I heard 'the language is great' from
people who do it, and I've seen a few video's of Alan Kay about how
great it is and that he can't understand why it isn't used more often.

So I felt like people where saying "this is art!" and I just couldn't see it.

Thanx allot for the sources. I'll try them out and probably build some
pages with info as I come across more code and learn more.

Greetings,
Bram

PS One of the videos I'm referring to can be found here:
http://video.google.com/videoplay?docid=-2950949730059754521

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Chun, Sungjin

[Q] Unicode String?

In reply to this post by Paolo Bonzini

Hi,

I've tried GNU smalltalk and for me it seems good. But I have a
problem: current implementation does not support Unicode. It seems
that it only supports single byte character only. I've also tried
squeak, which seems less faster than GNU smalltalk - I'm not sure on
this, this might not be correct - has unicode compatible string
implementation and I think this kind of approach is good. Is there
any change to have unicode compatible string implementation in next
version of GNU smalltalk?

Thank in advance.

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Paolo Bonzini

Re: [Q] Unicode String?

Chun Sungjin wrote:

> Hi,
>
> I've tried GNU smalltalk and for me it seems good. But I have a
> problem: current implementation does not support Unicode. It seems
> that it only supports single byte character only. I've also tried
> squeak, which seems less faster than GNU smalltalk - I'm not sure on
> this, this might not be correct - has unicode compatible string
> implementation and I think this kind of approach is good. Is there any
> change to have unicode compatible string implementation in next
> version of GNU smalltalk?

What do you need exactly? The main missing thing is support for
Character objects with values above 256. However if you are content
with multibyte character sets like UTF-8, or with Unicode character
codes, that's fine.

For character set translation, if you load the I18N package, GNU
Smalltalk gets an iconv wrapper. The main method you need is
EncodedStream>>#on:from:to: (e.g. on: 'abc' from: 'UTF-8' to: 'UCS-4').

To extract Unicode character codes from an UCS-4LE encoded string, you
can use (ByteStream on: x asByteArray) and send nextLong. For
big-endian, there is no class but I was thinking of adding a #bigEndian
method to ByteStream for the next version.

Things that could be useful are
Integer>>#asUTF8String
String class>>#utf8FromCodepoint: (same as above)
String>>#utf8Stream
UTF8Stream (returns Unicode character codes)
... (tell me what you need) ...

Paolo

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Chun, Sungjin

Re: [Q] Unicode String?

Hi,

main problem is that for example, if I did create an instance of
string like this;

a := 'Some MultiByte Encoded String'.

then

a size

does not answer correct length of string.

However, I will try what you said, thank you

On Jul 7, 2006, at 4:03 PM, Paolo Bonzini wrote:

> Chun Sungjin wrote:
>> Hi,
>>
>> I've tried GNU smalltalk and for me it seems good. But I have a
>> problem: current implementation does not support Unicode. It seems
>> that it only supports single byte character only. I've also tried
>> squeak, which seems less faster than GNU smalltalk - I'm not sure
>> on this, this might not be correct - has unicode compatible string
>> implementation and I think this kind of approach is good. Is there
>> any change to have unicode compatible string implementation in
>> next version of GNU smalltalk?
> What do you need exactly? The main missing thing is support for
> Character objects with values above 256. However if you are
> content with multibyte character sets like UTF-8, or with Unicode
> character codes, that's fine.
>
> For character set translation, if you load the I18N package, GNU
> Smalltalk gets an iconv wrapper. The main method you need is
> EncodedStream>>#on:from:to: (e.g. on: 'abc' from: 'UTF-8' to:
> 'UCS-4').
>
> To extract Unicode character codes from an UCS-4LE encoded string,
> you can use (ByteStream on: x asByteArray) and send nextLong. For
> big-endian, there is no class but I was thinking of adding a
> #bigEndian method to ByteStream for the next version.
>
> Things that could be useful are
> Integer>>#asUTF8String
> String class>>#utf8FromCodepoint: (same as above)
> String>>#utf8Stream
> UTF8Stream (returns Unicode character codes)
> ... (tell me what you need) ...
>
> Paolo

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Paolo Bonzini

Re: {Spam?} Re: [Q] Unicode String?

Chun Sungjin wrote:

> Hi,
>
> main problem is that for example, if I did create an instance of
> string like this;
>
> a := 'Some MultiByte Encoded String'.
>
> then
>
> a size
>
> does not answer correct length of string.

Well, strlen does not in C, too. You need mbrlen, and #size is more
like strlen than mbrlen.

Also, the result heavily depends on the chosen character set. If we
want to have #utf8Size, that's fine. But #size should be the number of
*bytes*, not of characters.

I'm seeing now if I can add an EncodedStream method that extracts
Unicode characters. Then what you wanted would be something like

(EncodedStream wordsOn: 'some string') contents size

for which, of course, we can add a utility method.

Paolo

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Chun, Sungjin

Why string should be collection of single byte characters? (WAS: Re: [Q] Unicode String?)

Hi,

For me, string should not be limited to collection of single byte
characters. String is string not a simple collection of byte, isn't it? I
think squeak's approach (or OpenStep's approach, where abstract public
string class and concrete private subclasses of string that implements
several cases of string). But I'm not currently working hard on GNU
Smalltalk, this may not be the best idea for GNU Smalltalk's case :-)

PS)
I DO think that strlen is not for unicode(actually multi-byte encoded case)
string and is bad design: limited to single byte encoding. I DO think that
modern language should consider unicode like string. I DO think Smalltalk is
MODERN :-)

----- Original Message -----
From: "Paolo Bonzini" <[hidden email]>
To: "Chun Sungjin" <[hidden email]>
Cc: "GNU Smalltalk" <[hidden email]>
Sent: Friday, July 07, 2006 6:17 PM
Subject: Re: {Spam?} Re: [Help-smalltalk] [Q] Unicode String?

> Chun Sungjin wrote:
> > Hi,
> >
> > main problem is that for example, if I did create an instance of
> > string like this;
> >
> > a := 'Some MultiByte Encoded String'.
> >
> > then
> >
> > a size
> >
> > does not answer correct length of string.
> Well, strlen does not in C, too. You need mbrlen, and #size is more
> like strlen than mbrlen.
>
> Also, the result heavily depends on the chosen character set. If we
> want to have #utf8Size, that's fine. But #size should be the number of
> *bytes*, not of characters.
>
> I'm seeing now if I can add an EncodedStream method that extracts
> Unicode characters. Then what you wanted would be something like
>
> (EncodedStream wordsOn: 'some string') contents size
>
> for which, of course, we can add a utility method.
>
> Paolo
>

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Paolo Bonzini

Re: {Spam?} Why string should be collection of single byte characters? (WAS: Re: [Q] Unicode String?)

Sungjin Chun wrote:
> Hi,
>
> For me, string should not be limited to collection of single byte
> characters. String is string not a simple collection of byte, isn't it? I
> think squeak's approach (or OpenStep's approach, where abstract public
> string class and concrete private subclasses of string that implements
> several cases of string). But I'm not currently working hard on GNU
> Smalltalk, this may not be the best idea for GNU Smalltalk's case :-)
>
There's already CharacterArray as a superclass of String. It probably
would not be hard to have a UnicodeString subclass of CharacterArray,
and use that instead of WordArray inside the I18N package. I'd also
need UnicodeCharacter, probably.

I'm working on it in my spare time, I attach my current prototype patch.

Paolo

--- orig/i18n/Sets.st
+++ mod/i18n/Sets.st
@@ -89,70 +89,70 @@

Namespace current: Smalltalk.I18N.Encoders!

-Encoder subclass: #FromUCS4
+Encoder subclass: #FromUTF32
instanceVariableNames: ''
classVariableNames: ''
poolDictionaries: ''
category: 'i18n-Character sets'!

-FromUCS4 comment:
-'This class is a superclass for classes that convert from UCS4
+FromUTF32 comment:
+'This class is a superclass for classes that convert from UTF-32
characters (encoded as 32-bit Integers) to bytes in another
encoding (encoded as Characters).'!

-Encoder subclass: #ToUCS4
+Encoder subclass: #ToUTF32
instanceVariableNames: ''
classVariableNames: ''
poolDictionaries: ''
category: 'i18n-Character sets'!

-ToUCS4 comment:
+ToUTF32 comment:
'This class is a superclass for classes that convert from bytes
-(encoded as Characters) to UCS4 characters (encoded as 32-bit
+(encoded as Characters) to UTF-32 characters (encoded as 32-bit
Integers to simplify the code and to avoid endianness conversions).'!

-ToUCS4 subclass: #ComposeUCS4LE
+ToUTF32 subclass: #ComposeUTF32LE
instanceVariableNames: ''
classVariableNames: ''
poolDictionaries: ''
category: 'i18n-Character sets'!

-ComposeUCS4LE comment:
-'This class is used internally to provide UCS4 characters encoded as
-32-bit integers for a descendent of FromUCS4, when the starting
+ComposeUTF32LE comment:
+'This class is used internally to provide UTF-32 characters encoded as
+32-bit integers for a descendent of FromUTF32, when the starting
encoding is little-endian.'!

-ToUCS4 subclass: #ComposeUCS4BE
+ToUTF32 subclass: #ComposeUTF32BE
instanceVariableNames: ''
classVariableNames: ''
poolDictionaries: ''
category: 'i18n-Character sets'!

-ComposeUCS4BE comment:
-'This class is used internally to provide UCS4 characters encoded as
-32-bit integers for a descendent of FromUCS4, when the starting
+ComposeUTF32BE comment:
+'This class is used internally to provide UTF-32 characters encoded as
+32-bit integers for a descendent of FromUTF32, when the starting
encoding is big-endian.'!

-FromUCS4 subclass: #SplitUCS4LE
+FromUTF32 subclass: #SplitUTF32LE
instanceVariableNames: 'wch'
classVariableNames: ''
poolDictionaries: ''
category: 'i18n-Character sets'!

-SplitUCS4LE comment:
+SplitUTF32LE comment:
'This class is used internally to split into four 8-bit characters
-the 32-bit UCS4 integers coming from a descendent of ToUCS4, when
+the 32-bit UTF-32 integers coming from a descendent of ToUTF32, when
the destination encoding is little-endian.'!

-FromUCS4 subclass: #SplitUCS4BE
+FromUTF32 subclass: #SplitUTF32BE
instanceVariableNames: 'count wch'
classVariableNames: ''
poolDictionaries: ''
category: 'i18n-Character sets'!

-SplitUCS4BE comment:
+SplitUTF32BE comment:
'This class is used internally to split into four 8-bit characters
-the 32-bit UCS4 integers coming from a descendent of ToUCS4, when
+the 32-bit UTF-32 integers coming from a descendent of ToUTF32, when
the destination encoding is big-endian.'!

Encoder subclass: #Iconv
@@ -166,21 +166,21 @@
'This class is used to delegate the actual character set conversion
to the C library''s iconv function. Most conversions use iconv as
the only step in the conversions, sometimes the structure is
-ToUCS4+SplitUCS4xx+Iconv or Iconv+ComposeUCS4xx+FromUCS4, rarely
+ToUTF32+SplitUTF32xx+Iconv or Iconv+ComposeUTF32xx+FromUTF32, rarely
Iconv is skipped altogether and only Smalltalk converters are used.'!

-FromUCS4 subclass: #ToUTF7
+FromUTF32 subclass: #ToUTF7
instanceVariableNames: 'left value lookahead'
classVariableNames: 'Base64Characters DirectCharacters ToBase64'
poolDictionaries: ''
category: 'i18n-Encodings'!

ToUTF7 comment:
-'This class implements a converter that transliterates UCS4
+'This class implements a converter that transliterates UTF-32
characters (encoded as 32-bit Integers) to UTF-7 encoded
characters.'!

-ToUCS4 subclass: #FromUTF7
+ToUTF32 subclass: #FromUTF7
instanceVariableNames: 'shift wch lookahead'
classVariableNames: 'DirectCharacters FromBase64'
poolDictionaries: ''
@@ -188,7 +188,7 @@

ToUTF7 comment:
'This class implements a converter that transliterates UTF-7
-encoded characters to UCS4 values (encoded as 32-bit Integers).'!
+encoded characters to UTF-32 values (encoded as 32-bit Integers).'!

Namespace current: Smalltalk.I18N!

@@ -241,9 +241,9 @@
!Encoder methodsFor: 'private - initialization'!

initializeFrom: fromEncoding to: toEncoding origin: aStringOrStream
- origin := aStringOrStream isString
- ifTrue: [ aStringOrStream readStream ]
- ifFalse: [ aStringOrStream ].
+ origin := (aStringOrStream isKindOf: Stream)
+ ifFalse: [ aStringOrStream readStream ]
+ ifTrue: [ aStringOrStream ].

self flush
! !
@@ -258,27 +258,27 @@
}
!

-registerEncoderFor: arrayOfAliases toUCS4: toUCS4Class fromUCS4: fromUCS4Class
+registerEncoderFor: arrayOfAliases toUTF32: toUTF32Class fromUTF32: fromUTF32Class
"Register the two classes that will respectively convert from the
- charsets in arrayOfAliases to UCS4 and vice versa.
+ charsets in arrayOfAliases to UTF-32 and vice versa.

The former class is a stream that accepts characters and returns
- (via #next) integers representing UCS-4 character codes, while
- the latter accepts UCS-4 character codes and converts them to
+ (via #next) integers representing UTF-32 character codes, while
+ the latter accepts UTF-32 character codes and converts them to
characters. For an example see respectively FromUTF7 and ToUTF7
(I admit it is not a trivial example)."

EncodersRegistry := EncodersRegistry copyWith:
- { arrayOfAliases. toUCS4Class. fromUCS4Class }
+ { arrayOfAliases. toUTF32Class. fromUTF32Class }
! !

!EncodedStream class methodsFor: 'private - triangulating'!

bigEndianPivot
"When only one of the sides is implemented in Smalltalk
- and the other is obtained via iconv, we use UCS-4 to
+ and the other is obtained via iconv, we use UTF-32 to
marshal data from Smalltalk to iconv; answer whether we
- should encode UCS-4 characters as big-endian."
+ should encode UTF-32 characters as big-endian."
^Memory bigEndian
!

@@ -287,29 +287,119 @@
and the other is obtained via iconv, we need a common
pivot encoding to marshal data from Smalltalk to iconv.
Answer the iconv name of this encoding."
- ^self bigEndianPivot ifTrue: [ 'UCS-4BE' ] ifFalse: [ 'UCS-4LE' ]
+ ^self bigEndianPivot ifTrue: [ 'UTF-32BE' ] ifFalse: [ 'UTF-32LE' ]
!

-split: input
+split: input to: encoding
"Answer a pipe with the given input stream (which produces
- UCS-4 character codes as integers) and whose output is
+ UTF-32 character codes as integers) and whose output is
a series of Characters in the required pivot encoding"
- ^self bigEndianPivot
- ifTrue: [ SplitUCS4BE on: input from: 'words' to: 'UCS4-BE' ]
- ifFalse: [ SplitUCS4LE on: input from: 'words' to: 'UCS4-LE' ].
+ ^(encoding = 'UCS-4BE' or: [ encoding = 'UTF-32BE' ])
+ ifTrue: [ SplitUTF32BE on: input from: 'UTF-32' to: encoding ]
+ ifFalse: [ SplitUTF32LE on: input from: 'UTF-32' to: encoding ].
!

-compose: input
+compose: input from: encoding
"Answer a pipe with the given input stream (which produces
Characters in the required pivot encoding) and whose output
- is a series of integer UCS-4 character codes."
- ^self bigEndianPivot
- ifTrue: [ ComposeUCS4BE on: input from: 'UCS4-BE' to: 'words' ]
- ifFalse: [ ComposeUCS4LE on: input from: 'UCS4-LE' to: 'words' ].
+ is a series of integer UTF-32 character codes."
+ ^(encoding = 'UCS-4BE' or: [ encoding = 'UTF-32BE' ])
+ ifTrue: [ ComposeUTF32BE on: input from: encoding to: 'UTF-32' ]
+ ifFalse: [ ComposeUTF32LE on: input from: encoding to: 'UTF-32' ].
! !

!EncodedStream class methodsFor: 'instance creation'!

+encoding: aWordArray
+ "Answer a pipe of encoders that converts aWordArray (which contains
+ Integers for the Unicode values) to the current locale's default
+ charset."
+ ^self
+ encoding: aWordArray
+ as: Locale default charset
+!
+
+encoding: aStringOrStream as: toEncoding
+ "Answer a pipe of encoders that converts aWordArray (which contains
+ Integers for the Unicode values) to the supplied encoding (which
+ can be an ASCII String or Symbol)."
+ | pivot to encoderTo pipe |
+
+ "Adopt an uniform naming"
+ to := toEncoding asString.
+ (from = 'UTF-32' or: [ from = 'UCS-4' ])
+ ifTrue: [ to := self pivotEncoding ].
+ (to = 'UTF-16' or: [ to = 'UCS-2' ])
+ ifTrue: [ to := self pivotEncoding copyReplacing: '32' with: '16' ].
+
+ "If converting to the pivot encoding, we're done."
+ pivot := 'UTF-32'.
+ ((to startsWith: 'UCS-4') or: [ to startsWith: 'UTF-32' ])
+ ifTrue: [ pivot := to ].
+ pivot = 'UTF-32' ifTrue: [ pivot := self pivotEncoding ].
+
+ encoderTo := Iconv.
+ EncodersRegistry do: [ :each |
+ ((each at: 1) includes: to)
+ ifTrue: [ encoderTo := each at: 2 ]
+ ].
+
+ pipe := aStringOrStream.
+
+ "Split UTF-32 character codes into bytes if needed by iconv."
+ encoderTo == Iconv ifTrue: [ pipe := self split: pipe to: pivot ].
+
+ "If not converting to the pivot encoding, we need one more step."
+ to = pivot ifFalse: [
+ pipe := encoderTo on: aStringOrStream from: pivot to: toEncoding ].
+ ^pipe
+!
+
+unicodeOn: aStringOrStream
+ "Answer a pipe of encoders that converts aStringOrStream (which can
+ be a string or another stream) from the current locale's default
+ charset to integers representing Unicode character codes."
+ ^self
+ unicodeOn: aStringOrStream
+ encoding: Locale default charset
+!
+
+unicodeOn: aStringOrStream encoding: fromEncoding
+ "Answer a pipe of encoders that converts aStringOrStream
+ (which can be a string or another stream) from the supplied
+ encoding (which can be an ASCII String or Symbol) to
+ integers representing Unicode character codes."
+ | from pivot encoderFrom pipe |
+
+ "Adopt an uniform naming"
+ from := fromEncoding asString.
+ (from = 'UTF-32' or: [ from = 'UCS-4' ])
+ ifTrue: [ from := aStringOrStream utf32Encoding ].
+ (from = 'UTF-16' or: [ from = 'UCS-2' ])
+ ifTrue: [ from := aStringOrStream utf16Encoding ].
+
+ pivot := 'UTF-32'.
+ ((from startsWith: 'UCS-4') or: [ from startsWith: 'UTF-32' ])
+ ifTrue: [ pivot := from ].
+ pivot = 'UTF-32' ifTrue: [ pivot := self pivotEncoding ].
+
+ encoderFrom := Iconv.
+ EncodersRegistry do: [ :each |
+ ((each at: 1) includes: from)
+ ifTrue: [ encoderFrom := each at: 2 ]
+ ].
+
+ pipe := aStringOrStream.
+
+ "If not converting from the pivot encoding, we need one more step."
+ from = pivot ifFalse: [
+ pipe := encoderFrom on: aStringOrStream from: fromEncoding to: pivot ].
+
+ "Compose iconv-produced bytes into UTF-32 character codes if needed."
+ encoderFrom == Iconv ifTrue: [ pipe := self compose: pipe from: pivot ].
+ ^pipe
+!
+
on: aStringOrStream from: fromEncoding
"Answer a pipe of encoders that converts aStringOrStream
(which can be a string or another stream) from the given
@@ -340,8 +430,20 @@
"Adopt an uniform naming"
from := fromEncoding asString.
to := toEncoding asString.
- from = 'UCS-4' ifTrue: [ from := 'UCS-4BE' ].
- to = 'UCS-4' ifTrue: [ to := 'UCS-4BE' ].
+ (from = 'UTF-32' or: [ from = 'UCS-4' ])
+ ifTrue: [ from := aStringOrStream utf32Encoding ].
+ (from = 'UTF-16' or: [ from = 'UCS-2' ])
+ ifTrue: [ from := aStringOrStream utf16Encoding ].
+ (to = 'UTF-32' or: [ to = 'UCS-4' ])
+ ifTrue: [ to := self pivotEncoding ].
+ (to = 'UTF-16' or: [ to = 'UCS-2' ])
+ ifTrue: [ to := self pivotEncoding copyReplaceAll: '32' with: '16' ].
+
+ ((from startsWith: 'UCS-4') or: [ from startsWith: 'UTF-32' ])
+ ifTrue: [ pivot := from ].
+ ((to startsWith: 'UCS-4') or: [ to startsWith: 'UTF-32' ])
+ ifTrue: [ pivot := to ].
+ pivot = 'UTF-32' ifTrue: [ pivot := self pivotEncoding ].

encoderFrom := encoderTo := Iconv.
EncodersRegistry do: [ :each |
@@ -358,12 +460,12 @@
"Else answer a `pipe' that takes care of triangulating.
There is an additional complication: Smalltalk encoders
read or provide a stream of character codes (respectively
- if the source is UCS-4, or the target is UCS-4), while iconv
+ if the source is UTF-32, or the target is UTF-32), while iconv
expects raw bytes. So we add an intermediate layer if
a mixed Smalltalk+iconv conversion is done: it converts
- character codes --> bytes (SplitUCS4xx, used if iconv will
- convert from UCS-4) or bytes --> character code (ComposeUCS4xx,
- used if iconv will convert to UCS-4).
+ character codes --> bytes (SplitUTF32xx, used if iconv will
+ convert from UTF-32) or bytes --> character code (ComposeUTF32xx,
+ used if iconv will convert to UTF-32).

There are five different cases (remember that at least one converter
is not iconv, so `both use iconv' and `from = pivot = to' are banned):
@@ -373,7 +475,6 @@
from uses iconv --> iconv + Compose + non-iconv (implies to ~= pivot)
none uses iconv --> non-iconv + non-iconv (implies neither = pivot)"

- pivot := self pivotEncoding.
pipe := aStringOrStream.
from = pivot
ifFalse: [
@@ -382,16 +483,16 @@

pipe := encoderFrom on: pipe from: fromEncoding to: pivot.
encoderTo == Iconv ifTrue: [
- pipe := self split: pipe.
+ pipe := self split: pipe to: pivot.

"Check if we already reached the destination format."
to = pivot ifTrue: [ ^pipe ].
].
].

- "Compose iconv-produced bytes into UCS-4 character codes if needed."
+ "Compose iconv-produced bytes into UTF-32 character codes if needed."
encoderFrom == Iconv ifTrue: [
- pipe := self compose: pipe
+ pipe := self compose: pipe from: pivot
].

^encoderTo on: pipe from: pivot to: toEncoding.
@@ -399,7 +500,7 @@

Namespace current: Smalltalk.I18N.Encoders!

-!FromUCS4 methodsFor: 'stream operation'!
+!FromUTF32 methodsFor: 'stream operation'!

species
"We answer a string of Characters encoded in our destination
@@ -407,15 +508,15 @@
^String
! !

-!ToUCS4 methodsFor: 'stream operation'!
+!ToUTF32 methodsFor: 'stream operation'!

species
- "We answer a WordArray of UCS4 characters encoded as a series of
+ "We answer a WordArray of UTF-32 characters encoded as a series of
32-bit Integers."
^WordArray
! !

-!ComposeUCS4LE methodsFor: 'stream operation'!
+!ComposeUTF32LE methodsFor: 'stream operation'!

next
"Answer a 32-bit integer obtained by reading four 8-bit character
@@ -426,7 +527,7 @@
(self nextInput asInteger bitShift: 24)
! !

-!ComposeUCS4BE methodsFor: 'stream operation'!
+!ComposeUTF32BE methodsFor: 'stream operation'!

next
"Answer a 32-bit integer obtained by reading four 8-bit character
@@ -439,7 +540,7 @@
self nextInput asInteger
! !

-!SplitUCS4LE methodsFor: 'stream operation'!
+!SplitUTF32LE methodsFor: 'stream operation'!

atEnd
"Answer whether the receiver can produce more characters"
@@ -474,7 +575,7 @@
wch := 1
! !

-!SplitUCS4BE methodsFor: 'stream operation'!
+!SplitUTF32BE methodsFor: 'stream operation'!

atEnd
"Answer whether the receiver can produce more characters"
@@ -670,7 +771,7 @@
!ToUTF7 class methodsFor: 'initialization'!

initialize
- "Initialize the tables used by the UCS4-to-UTF7 converter"
+ "Initialize the tables used by the UTF-32-to-UTF-7 converter"

Base64Characters := #[
16r00 16r00 16r00 16r00 16r00 16rA8 16rFF 16r03
@@ -806,7 +907,7 @@
!FromUTF7 class methodsFor: 'initialization'!

initialize
- "Initialize the tables used by the UTF7-to-UCS4 converter"
+ "Initialize the tables used by the UTF-7-to-UTF-32 converter"

FromBase64 := #[
62 99 99 99 63
@@ -842,7 +943,7 @@
!FromUTF7 methodsFor: 'converting'!

atEnd
- "Answer whether the receiver can produce another UCS4 32-bit
+ "Answer whether the receiver can produce another UTF-32 32-bit
encoded integer"
^lookahead isNil
!
@@ -878,7 +979,7 @@
"Flush any remaining state left in the encoder by the last character
(this is because UTF-7 encodes 6 bits at a time, so it takes three
characters before it can provide a single 16-bit character and
- up to six characters before it can provide a full UCS-4 character)."
+ up to six characters before it can provide a full UTF-32 character)."
shift := 0.
lookahead := self getNext.
! !
@@ -977,16 +1078,42 @@
description
"Answer a textual description of the exception."
^'unknown encoding specified'! !
-
+

"Now add some extensions to the system classes"

-(CharacterArray classPool includesKey: #DefaultEncoding)
- ifFalse: [ CharacterArray addClassVarName: #DefaultEncoding ]!
-
!CharacterArray class methodsFor: 'converting'!

defaultEncoding
+ self subclassResponsibility!
+
+!CharacterArray methodsFor: 'converting'!
+
+encoding
+ "Answer the encoding of the receiver, assuming it is in the
+ default locale's default charset"
+
+ self class defaultEncoding asString = 'UTF-16'
+ ifTrue: [ ^self utf16Encoding ].
+ self class defaultEncoding asString = 'UTF-32'
+ ifTrue: [ ^self utf32Encoding ].
+ ^self class defaultEncoding!
+
+utf16Encoding
+ "Answer the encoding of the receiver, assuming it's UTF-16"
+ ^Memory bigEndian ifTrue: [ 'UTF-16BE' ] ifFalse: [ 'UTF-16LE' ]!
+
+utf32Encoding
+ "Answer the encoding of the receiver, assuming it's UTF-32"
+ ^Memory bigEndian ifTrue: [ 'UTF-32BE' ] ifFalse: [ 'UTF-32LE' ]! !
+
+
+(String classPool includesKey: #DefaultEncoding)
+ ifFalse: [ String addClassVarName: #DefaultEncoding ]!
+
+!String class methodsFor: 'converting'!
+
+defaultEncoding
"Answer the default locale's default charset"
DefaultEncoding isNil
ifTrue: [ DefaultEncoding := Locale default charset ].
@@ -999,15 +1126,28 @@
DefaultEncoding := aString
! !

-!CharacterArray methodsFor: 'converting'!
+!String methodsFor: 'converting'!

-encoding
- "Answer the encoding of the receiver, assuming it is in the
- default locale's default charset"
+asUnicode
+ "Return a WordArray with the contents of the receiver, interpreted
+ as the default locale character set."
+ ^(EncodedStream unicodeOn: self) contents!
+
+asUnicode: aString
+ "Return a WordArray with the contents of the receiver, interpreted
+ as the default locale character set."
+ ^(EncodedStream unicodeOn: self encoding: aString) contents! !
+
+!String methodsFor: 'converting'!
+
+utf32Encoding
+ "Assuming the receiver is encoded as UTF-16 with a proper
+ endianness marker, answer the correct encoding of the receiver."

- ^self class defaultEncoding asString = 'UTF-16'
- ifTrue: [ self utf16Encoding ]
- ifFalse: [ self class defaultEncoding ]
+ | b1 b2 bigEndian |
+ b1 := self at: 1. "Low byte"
+ bigEndian := b1 = 0.
+ ^bigEndian ifTrue: [ 'UTF-32BE' ] ifFalse: [ 'UTF-32LE' ]
!

utf16Encoding
@@ -1026,12 +1166,53 @@
^bigEndian ifTrue: [ 'UTF-16BE' ] ifFalse: [ 'UTF-16LE' ]
! !

+
+!WordArray class methodsFor: 'converting'!
+
+defaultEncoding
+ ^'UTF-32'! !
+
+!WordArray methodsFor: 'converting'!
+
+utf16Encoding
+ self shouldNotImplement! !
+
+!WordArray methodsFor: 'converting'!
+
+encoded
+ "Return a String with the contents of the receiver, converted
+ into the default locale character set."
+ ^(EncodedStream encoding: self) contents!
+
+encodedAs: aString
+ "Return a String with the contents of the receiver, converted
+ into the aString locale character set."
+ ^(EncodedStream encoding: self as: aString) contents!
+
+
!PositionableStream methodsFor: 'converting'!

encoding
"Answer the encoding of the underlying collection"
- ^collection encoding
-! !
+ ^collection encoding!
+
+utf16Encoding
+ "Answer the encoding of the underlying collection, assuming it's UTF-16"
+ ^collection utf16Encoding!
+
+utf32Encoding
+ "Answer the encoding of the underlying collection, assuming it's UTF-32"
+ ^collection utf32Encoding! !
+
+!Stream methodsFor: 'converting'!
+
+utf16Encoding
+ "Answer the encoding of the underlying collection, assuming it's UTF-16"
+ ^Memory bigEndian ifTrue: [ 'UTF-16BE' ] ifFalse: [ 'UTF-16LE' ]!
+
+utf32Encoding
+ "Answer the encoding of the underlying collection, assuming it's UTF-32"
+ ^Memory bigEndian ifTrue: [ 'UTF-32BE' ] ifFalse: [ 'UTF-32LE' ]! !

Encoders.ToUTF7 initialize!
Encoders.FromUTF7 initialize!

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Paolo Bonzini

Re: {Spam?} Why string should be collection of single byte characters? (WAS: Re: [Q] Unicode String?)

In reply to this post by Chun, Sungjin

> I DO think that strlen is not for unicode(actually multi-byte encoded case)
> string and is bad design: limited to single byte encoding.
>
I think it's different than this. strlen counts bytes. mbrlen counts
characters. In Smalltalk #size returns allocation units: only if we
stored everything in UTF-32 (no, UTF-16 would not suffice) would this
mean characters.
> I DO think that
> modern language should consider unicode like string. I DO think Smalltalk is
> MODERN :-)
>
I do think that modern languages should support Unicode and you're right
that GNU Smalltalk (mostly) does not. I don't think they should dismiss
character encodings based on bytes, like UTF-8. These should remain the
primary representation in my opinion, especially if like in UTF-8 you
don't have any problem in finding the first byte of a character (unlike
JIS-0212 or GB-2312) and no need for escape sequences (unlike ISO-2022).

Paolo

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Paolo Bonzini

Re: {Spam?} Why string should be collection of single byte characters? (WAS: Re: [Q] Unicode String?)

In reply to this post by Paolo Bonzini

> I'm working on it in my spare time, I attach my current prototype patch.
I have almost completed this, it's only about 400 lines of new code,
mostly in i18n/Sets.st. I have defined a new UnicodeString class, and
modified Character to have support for characters whose Unicode code
point is > 255. For ease of testing and usage, also, I've defined a
syntax $<279> that allows you to refer to a Character by its ASCII
value. It's equivalent to "279 asCharacter" -- I could have instead
inlined this at compile-time, but I prefer to have also a more compact
syntax.

The changes are mostly backwards compatible, but characters should *not*
be compared with ==, but with = unless you're sure the code point is <=
255. Similarly, they should *not* be printed with nextPut:, but with
display:, unless you're sure the code point is <= 127.

What follows is some use cases. This is in a UTF-8 locale but (subject
to the capabilities of your system's iconv function) it works as well
for every other locale.

I am not very expert in the *needs* of people using Unicode, so can you
please confirm that it is (close to) what you need? In particular, I'd
like feedback on what to do when in transcoding is not enabled, because
right now the behavior is inconsistent: see the notes preceded by ***.

Without the I18N package, the behavior is not complete and you can
store, but not print Unicode characters correctly:

Printing a Unicode character:
st> $<279> printNl!
$<16r0117>

Converting a Unicode character to String:
*** maybe should consider returning '?'
st> $<279> asString printNl!
error: Invalid argument <16r0117>: argument must be between $<0> and
$<16r00FF>

Converting a Unicode character to a UTF-32 String:
st> ($<279> asUnicodeString) printNl!
'<16r0117>'

Converting a UTF-32 String with a Unicode character to a byte-encoded
String:
*** maybe should give an error instead
st> $<279> asUnicodeString asString printNl!
'?'

Asking the number of characters to the resulting Strings:
st> $<279> asUnicodeString numberOfCharacters printNl!
1
st> $<279> asUnicodeString asString numberOfCharacters printNl!
error: should not be implemented in this class

Converting ByteArrays or Strings to UnicodeStrings:
st> #[196 151] asUnicodeString first printNl!
error: should not be implemented in this class

-----

After loading the I18N package, everything is much better:

Printing a Unicode character:
st> $<279> printNl!
$ė

Converting a Unicode character to String:
st> $<279> asString printNl!
'ė'

Converting a Unicode character to a UTF-32 String, and then back just by
printing it:
st> ($<279> asUnicodeString) printNl!
'ė'

Converting a UTF-32 String with a Unicode character to a byte-encoded
String:
st> $<279> asUnicodeString asString printNl!
'ė'

Asking the number of characters to the resulting Strings:
st> $<279> asUnicodeString numberOfCharacters printNl!
1

st> $<279> asUnicodeString asString numberOfCharacters printNl!
1

Converting ByteArrays or Strings to UnicodeStrings:
st> #[196 151] asUnicodeString first printNl!
$ė

st> #[196 151] asUnicodeString size printNl!
1

st> #[196 151] asUnicodeString numberOfCharacters printNl!
1

Paolo

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk