[vwnc] Questions about handling ISO8859L1 String objects...

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

[vwnc] Questions about handling ISO8859L1 String objects...

Rick Flower
Hi all..

I'm trying my hand at parsing some CSV files that I receive (and have no
control over the format) and they
appear to be encoded as ISO8859 strings after the contents are read in
using :

    coll := '/tmp/foo.csv' asFilename contentsOfEntireFile
tokensBasedOn: Character cr.

The first item in the collection looks something like the following :

should be read as "StoreName, textbox4" but comes in as :

$y "16r00FF"  (the 'y' actually has an umlaut over it -- I'm not really
sure what this first 32-bit word is for)
$p "16r00FE"
$S "16r0053"    --> S
$ "16r0000"
$t "16r0074"     --> t
$ "16r0000"
$o "16r006F"   --> o
$ "16r0000"
$r "16r0072"    --> r
$ "16r0000"
$e "16r0065"   --> e
$ "16r0000"

Is there some good way to convert this into a regular string?  Also --
if it helps, this will eventually be done by
passing the file in via Seaside using the WAUpload handling.. Not sure
if that matters..

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Questions about handling ISO8859L1 String objects...

Rick Flower
Sorry.. One more thing.. After playing around with Neo Office (loading
the CSV), I
believe the file is actually encoded as a Unicode file and not ISO8859-1
hence the odd
characters at the beginning of the string.. Is there a way to override
the string encoding
when the file is read-in?  If so it may solve my problem directly..

Thx!
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Questions about handling ISO8859L1 String objects...

Rick Flower
Rick Flower wrote:
> Sorry.. One more thing.. After playing around with Neo Office (loading
> the CSV), I
> believe the file is actually encoded as a Unicode file and not ISO8859-1
> hence the odd
> characters at the beginning of the string.. Is there a way to override
> the string encoding
> when the file is read-in?  If so it may solve my problem directly..
>  
One more thing.. In doing more searching on VW & Unicode (less any ODBC
references), I
found a discussion from last January and ran the following on my OSX
version of VW :

(StreamEncoder new: #default) encoding

and see ISO8859-1 returned.. Should this be something else or is that
OK?  This image ultimately
is shared between OSX and Linux  -- not sure if that matters.

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Questions about handling ISO8859L1 String objects...

Steven Kelly
In reply to this post by Rick Flower
RE: [vwnc] Questions about handling ISO8859L1 String objects...

That's a Unicode file, UTF-16 so two bytes for each character. The FF and FE are the Byte Order Mark, and tell you of each two byte pair which is the high order and which the low order byte - but you can see that already from the data.

If you just ask for contentsOfEntireFile, VisualWorks has to assume some encoding, and uses the platform's default. ISO-8859-1 is presumably on Linux, Windows would be the similar Microsoft codepage (identical apart from Microsoft smart quotes, IIRC). You want to explicitly make the stream use UTF-16 encoding. The easiest way is just (aFilename withEncoding: #'utf-16') readStream (or somesuch message, sorry, don't have an image with me on this machine). VW can probably figure out the Byte Order Mark itself these days (or was that just for XML files?), so just asking the stream for #contents should be enough.

It might be better style to use the stream as a stream:
[aStream atEnd] whileFalse: [lines add: (aStream upTo: Character cr)]
(or with more detailed stream processing to get the individual fields from each line).

Hope this helps, and sorry I don't have the details to hand,
Steve

-----Original Message-----
From: [hidden email] on behalf of Rick Flower
Sent: Sun 4/12/2009 04:21
To: VisualWorks Mailing List
Subject: [vwnc] Questions about handling ISO8859L1 String objects...

Hi all..

I'm trying my hand at parsing some CSV files that I receive (and have no
control over the format) and they
appear to be encoded as ISO8859 strings after the contents are read in
using :

    coll := '/tmp/foo.csv' asFilename contentsOfEntireFile
tokensBasedOn: Character cr.

The first item in the collection looks something like the following :

should be read as "StoreName, textbox4" but comes in as :

$y "16r00FF"  (the 'y' actually has an umlaut over it -- I'm not really
sure what this first 32-bit word is for)
$p "16r00FE"
$S "16r0053"    --> S
$ "16r0000"
$t "16r0074"     --> t
$ "16r0000"
$o "16r006F"   --> o
$ "16r0000"
$r "16r0072"    --> r
$ "16r0000"
$e "16r0065"   --> e
$ "16r0000"

Is there some good way to convert this into a regular string?  Also --
if it helps, this will eventually be done by
passing the file in via Seaside using the WAUpload handling.. Not sure
if that matters..

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Questions about handling ISO8859L1 String objects...

Rick Flower
Steven Kelly wrote:

>
> That's a Unicode file, UTF-16 so two bytes for each character. The FF
> and FE are the Byte Order Mark, and tell you of each two byte pair
> which is the high order and which the low order byte - but you can see
> that already from the data.
>
> If you just ask for contentsOfEntireFile, VisualWorks has to assume
> some encoding, and uses the platform's default. ISO-8859-1 is
> presumably on Linux, Windows would be the similar Microsoft codepage
> (identical apart from Microsoft smart quotes, IIRC). You want to
> explicitly make the stream use UTF-16 encoding. The easiest way is
> just (aFilename withEncoding: #'utf-16') readStream (or somesuch
> message, sorry, don't have an image with me on this machine). VW can
> probably figure out the Byte Order Mark itself these days (or was that
> just for XML files?), so just asking the stream for #contents should
> be enough.
>
> It might be better style to use the stream as a stream:
> [aStream atEnd] whileFalse: [lines add: (aStream upTo: Character cr)]
> (or with more detailed stream processing to get the individual fields
> from each line).
>
Steve,

Thanks for the help.. I tried using the 'withEncoding:' and it worked
like a charm.. All problems are gone.
Thanks for getting me unstuck..

-- Rick
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] BOM

Steffen Märcker
In reply to this post by Steven Kelly
> VW can probably figure out the Byte Order Mark itself these days (or was  
> that just for XML files?), so just asking the stream for #contents  
> should be enough.

... at least for UTF8 it doesn't respect the BOM. Each time an XML starts
with it, the framework runs into an Exception. In my opinion, at least the
XML framework should handle that issue by default on it's own.

See also: http://unicode.org/faq/utf_bom.html#bom5

Happy Easter!
Steffen


> It might be better style to use the stream as a stream:
> [aStream atEnd] whileFalse: [lines add: (aStream upTo: Character cr)]
> (or with more detailed stream processing to get the individual fields  
> from each line).
>
> Hope this helps, and sorry I don't have the details to hand,
> Steve
>
> -----Original Message-----
> From: [hidden email] on behalf of Rick Flower
> Sent: Sun 4/12/2009 04:21
> To: VisualWorks Mailing List
> Subject: [vwnc] Questions about handling ISO8859L1 String objects...
> Hi all..
>
> I'm trying my hand at parsing some CSV files that I receive (and have no
> control over the format) and they
> appear to be encoded as ISO8859 strings after the contents are read in
> using :
>
>     coll := '/tmp/foo.csv' asFilename contentsOfEntireFile
> tokensBasedOn: Character cr.
>
> The first item in the collection looks something like the following :
>
> should be read as "StoreName, textbox4" but comes in as :
>
> $y "16r00FF"  (the 'y' actually has an umlaut over it -- I'm not really
> sure what this first 32-bit word is for)
> $p "16r00FE"
> $S "16r0053"    --> S
> $ "16r0000"
> $t "16r0074"     --> t
> $ "16r0000"
> $o "16r006F"   --> o
> $ "16r0000"
> $r "16r0072"    --> r
> $ "16r0000"
> $e "16r0065"   --> e
> $ "16r0000"
>
> Is there some good way to convert this into a regular string?  Also --
> if it helps, this will eventually be done by
> passing the file in via Seaside using the WAUpload handling.. Not sure
> if that matters..
>
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] BOM

Joachim Geidel
Am 12.04.09 20:58 schrieb Steffen Märcker:
>> VW can probably figure out the Byte Order Mark itself these days (or was
>> that just for XML files?), so just asking the stream for #contents
>> should be enough.
>
> ... at least for UTF8 it doesn't respect the BOM. Each time an XML starts
> with it, the framework runs into an Exception. In my opinion, at least the
> XML framework should handle that issue by default on it's own.
>
> See also: http://unicode.org/faq/utf_bom.html#bom5

...and the Unicode support is outdated and incomplete. E.g., the encoders
will write Characters with values beyond 16r10FFFF and silently truncate
them. Character values above 16r10FFFF are not legal in Unicode. In UTF-16,
there are "supplementary characters" represented by "surrogate pairs" which
are not supported in VisualWorks, see
http://www.parcplace.net/list/vwnc-archive/0608/msg00033.html. VW only
supports the Basic Multilingual Plane (BMP, plane 0). UTF-32 is not
supported at all. In addition, there are some problems when copying Strings
between external libraries and VW, see below.

For JNIPort, I had to implement my own encoder to correctly handle the
UTF-16 encoding used in java.lang.String. It's JavaLangStringStreamEncoder
in the package "JNIPort String Encoding", which is actually a UTF-16 encoder
with support for supplementary characters.

The UnicodeStreamEncoder in the Internationalization package implements
UCS-2 encoding, which is an obsolete precursor of UTF-16. UCS-2 supports
only characters up to 16rFFFF, but the UnicodeStreamEncoder does not check
this when writing a character. The correct behavior would be to write the
"Character illegalCode" 16rFFFF.

The UTF16StreamEncoder does not pay attention to surrogate pairs and will
produce illegal characters when reading supplementary characters from a
UTF-16 encoded text, instead of decoding the surrogate pairs into a
character in the range 16r01000-16r10FFFF. It also assumes that the size of
an encoded Character is always 2, which is wrong - for supplementary
characters, it is 4. So it's really another UCS-2 encoder, not a UTF-16
encoder.

Exchanging Strings between VW and external libaries has some problems, too.
For example:

copyUnicodeStringFromHeap
    "Answer an instance of a String by copying the null terminated Unicode
string pointed to by the receiver from the external heap.  ..."

    ^self copyDoubleByteStringFromHeap: #UCS_2

copyDoubleByteStringFromHeap: encoding
    "..."

    | bytes |
    bytes := self primCopyDoubleByteStringFromHeap: theDatum pointerKind:
type kind.
    ^encoding == #UCS_2
        ifTrue: [bytes changeClassTo: TwoByteString]
        ifFalse: [bytes asStringEncoding: encoding]

While it is correct that a UCS-2 encoded String can be copied without
modification, the assumption that a "Unicode String" has UCS-2 encoding is
wrong. This should be UTF-16. The same for String>>copyToHeapUnicode. The
problematic methods can be found by looking for senders of #UCS_2.

There is also a bug in String>>copyToHeap:encoding: which creates the
terminator for the external string like this:

    null := (ByteString new: 1) asByteArrayEncoding: encoding.

This expression will produce anything but nulls for many encodings, e.g.
Base-64 or UTF-7. However, the terminator of the external string should not
be an *encoding* of a null character, but simply one or two null characters.
See http://www.parcplace.net/list/vwnc-archive/0607/msg00339.html.

Cheers!
Joachim Geidel



_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] BOM

Jan Weerts
In reply to this post by Steffen Märcker
Joachim Geidel wrote:
> Am 12.04.09 20:58 schrieb Steffen Märcker:
> ...and the Unicode support is outdated and incomplete.

thanks Joachim for the collection.

I would like to add "Combining Diacritical Marks".
Thats U0300 and above which will be read from an UTF-8
stream into gibberish or errors. See
http://www.unicode.org/charts/PDF/U0300.pdf for some
really strange diacriticals like "combining seagull
below". Murphy dictates, that we hat to some data including
more than the representable set of characters. This
brought our attention to Character>>initCompositeLetters
and the shared variables referenced therein. This and
Character>>diacriticalNamed: seem to be based on a rather
old Unicode version.

Regards
  Jan

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Questions about handling ISO8859L1 String objects...

Holger Guhl
In reply to this post by Steven Kelly
If you want to have a look:
In section "Heeg" of VisualWorks Contributions you will find parcel
"GHCsvImportExport [1.10]". We had similar issues and solved them. If
you want to reuse the entire parcel or just the code dealing with BOM
(byte order mark), go ahead ...
Cheers

Holger Guhl
--
Senior Consultant * Certified Scrum Master * [hidden email]
Tel: +49 231 9 75 99 21 * Fax: +49 231 9 75 99 20
Georg Heeg eK Dortmund
Handelsregister: Amtsgericht Dortmund  A 12812



Steven Kelly schrieb:

>
> That's a Unicode file, UTF-16 so two bytes for each character. The FF
> and FE are the Byte Order Mark, and tell you of each two byte pair
> which is the high order and which the low order byte - but you can see
> that already from the data.
>
> If you just ask for contentsOfEntireFile, VisualWorks has to assume
> some encoding, and uses the platform's default. ISO-8859-1 is
> presumably on Linux, Windows would be the similar Microsoft codepage
> (identical apart from Microsoft smart quotes, IIRC). You want to
> explicitly make the stream use UTF-16 encoding. The easiest way is
> just (aFilename withEncoding: #'utf-16') readStream (or somesuch
> message, sorry, don't have an image with me on this machine). VW can
> probably figure out the Byte Order Mark itself these days (or was that
> just for XML files?), so just asking the stream for #contents should
> be enough.
>
> It might be better style to use the stream as a stream:
> [aStream atEnd] whileFalse: [lines add: (aStream upTo: Character cr)]
> (or with more detailed stream processing to get the individual fields
> from each line).
>
> Hope this helps, and sorry I don't have the details to hand,
> Steve
>
> -----Original Message-----
> From: [hidden email] on behalf of Rick Flower
> Sent: Sun 4/12/2009 04:21
> To: VisualWorks Mailing List
> Subject: [vwnc] Questions about handling ISO8859L1 String objects...
>
> Hi all..
>
> I'm trying my hand at parsing some CSV files that I receive (and have no
> control over the format) and they
> appear to be encoded as ISO8859 strings after the contents are read in
> using :
>
>     coll := '/tmp/foo.csv' asFilename contentsOfEntireFile
> tokensBasedOn: Character cr.
>
> The first item in the collection looks something like the following :
>
> should be read as "StoreName, textbox4" but comes in as :
>
> $y "16r00FF"  (the 'y' actually has an umlaut over it -- I'm not really
> sure what this first 32-bit word is for)
> $p "16r00FE"
> $S "16r0053"    --> S
> $ "16r0000"
> $t "16r0074"     --> t
> $ "16r0000"
> $o "16r006F"   --> o
> $ "16r0000"
> $r "16r0072"    --> r
> $ "16r0000"
> $e "16r0065"   --> e
> $ "16r0000"
>
> Is there some good way to convert this into a regular string?  Also --
> if it helps, this will eventually be done by
> passing the file in via Seaside using the WAUpload handling.. Not sure
> if that matters..
>
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>  
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Questions about handling ISO8859L1 String objects...

Steven Kelly
In reply to this post by Rick Flower
Thanks Holger! I see that fix was written nearly 3 years ago. Strikes me
that the process of getting fixes into the base isn't working :-(.

Steve

> -----Original Message-----
> From: Holger Guhl [mailto:[hidden email]]
> Sent: 14 April 2009 23:08
> To: Steven Kelly
> Cc: Rick Flower; VisualWorks Mailing List
> Subject: Re: [vwnc] Questions about handling ISO8859L1 String
objects...

>
> If you want to have a look:
> In section "Heeg" of VisualWorks Contributions you will find parcel
> "GHCsvImportExport [1.10]". We had similar issues and solved them. If
> you want to reuse the entire parcel or just the code dealing with BOM
> (byte order mark), go ahead ...
> Cheers
>
> Holger Guhl
> --
> Senior Consultant * Certified Scrum Master * [hidden email]
> Tel: +49 231 9 75 99 21 * Fax: +49 231 9 75 99 20
> Georg Heeg eK Dortmund
> Handelsregister: Amtsgericht Dortmund  A 12812
>
>
>
> Steven Kelly schrieb:
> >
> > That's a Unicode file, UTF-16 so two bytes for each character. The
FF

> > and FE are the Byte Order Mark, and tell you of each two byte pair
> > which is the high order and which the low order byte - but you can
> see
> > that already from the data.
> >
> > If you just ask for contentsOfEntireFile, VisualWorks has to assume
> > some encoding, and uses the platform's default. ISO-8859-1 is
> > presumably on Linux, Windows would be the similar Microsoft codepage
> > (identical apart from Microsoft smart quotes, IIRC). You want to
> > explicitly make the stream use UTF-16 encoding. The easiest way is
> > just (aFilename withEncoding: #'utf-16') readStream (or somesuch
> > message, sorry, don't have an image with me on this machine). VW can
> > probably figure out the Byte Order Mark itself these days (or was
> that
> > just for XML files?), so just asking the stream for #contents should
> > be enough.
> >
> > It might be better style to use the stream as a stream:
> > [aStream atEnd] whileFalse: [lines add: (aStream upTo: Character
cr)]
> > (or with more detailed stream processing to get the individual
fields

> > from each line).
> >
> > Hope this helps, and sorry I don't have the details to hand,
> > Steve
> >
> > -----Original Message-----
> > From: [hidden email] on behalf of Rick Flower
> > Sent: Sun 4/12/2009 04:21
> > To: VisualWorks Mailing List
> > Subject: [vwnc] Questions about handling ISO8859L1 String objects...
> >
> > Hi all..
> >
> > I'm trying my hand at parsing some CSV files that I receive (and
have

> no
> > control over the format) and they
> > appear to be encoded as ISO8859 strings after the contents are read
> in
> > using :
> >
> >     coll := '/tmp/foo.csv' asFilename contentsOfEntireFile
> > tokensBasedOn: Character cr.
> >
> > The first item in the collection looks something like the following
:

> >
> > should be read as "StoreName, textbox4" but comes in as :
> >
> > $y "16r00FF"  (the 'y' actually has an umlaut over it -- I'm not
> really
> > sure what this first 32-bit word is for)
> > $p "16r00FE"
> > $S "16r0053"    --> S
> > $ "16r0000"
> > $t "16r0074"     --> t
> > $ "16r0000"
> > $o "16r006F"   --> o
> > $ "16r0000"
> > $r "16r0072"    --> r
> > $ "16r0000"
> > $e "16r0065"   --> e
> > $ "16r0000"
> >
> > Is there some good way to convert this into a regular string?  Also
-

> -
> > if it helps, this will eventually be done by
> > passing the file in via Seaside using the WAUpload handling.. Not
> sure
> > if that matters..
> >
> > _______________________________________________
> > vwnc mailing list
> > [hidden email]
> > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
> >
> >
---------------------------------------------------------------------
> ---
> >
> > _______________________________________________
> > vwnc mailing list
> > [hidden email]
> > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
> >

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] BOM

Alan Knight-2
In reply to this post by Jan Weerts
Thanks indeed. We're aware of a number of these, and fixes are already in the works for 7.7, but it's nice to have a concise list like that of things people have run into.

At 07:01 AM 4/14/2009, Jan Weerts wrote:
Joachim Geidel wrote:
> Am 12.04.09 20:58 schrieb Steffen Märcker:
> ...and the Unicode support is outdated and incomplete.

thanks Joachim for the collection.

I would like to add "Combining Diacritical Marks".
Thats U0300 and above which will be read from an UTF-8
stream into gibberish or errors. See
http://www.unicode.org/charts/PDF/U0300.pdf for some
really strange diacriticals like "combining seagull
below". Murphy dictates, that we hat to some data including
more than the representable set of characters. This
brought our attention to Character>>initCompositeLetters
and the shared variables referenced therein. This and
Character>>diacriticalNamed: seem to be based on a rather
old Unicode version.

Regards
  Jan

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

--
Alan Knight [|], Engineering Manager, Cincom Smalltalk

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] XML name

Steffen Märcker
This is somehow related. As I've pointed out earlier (see old mail below),  
it seems, that the XML framework does not respect the specification of IDs  
completely. It rejects the colon as a name character.
If my observation is correct, will this be fixed as well?

Greetings,
Steffen


Old mail to vwnc:


Today I played with the XML ID specification. It states that an ID  
attribute must match the _Name_ production.  
(http://www.w3.org/TR/2006/REC-xml-20060816/#id)

I created Character classes from the given definiton and tested them  
against the existing implementation. The result is, that the XML framework  
rejects the colon. But XML 1.0 specification explicitly states : [...] XML  
processors must accept the colon as a name character. Two questions:
1. Is my test suitable?
2. If not - why differs the implementation from the specification?

Workspace code and the XMLCharacterClass class (requires Regex11) is  
attached

Ciao, Steffen
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

IDs.ws (1K) Download Attachment
XMLCharacterClass.st (22K) Download Attachment