Hi all..
I'm trying my hand at parsing some CSV files that I receive (and have no control over the format) and they appear to be encoded as ISO8859 strings after the contents are read in using : coll := '/tmp/foo.csv' asFilename contentsOfEntireFile tokensBasedOn: Character cr. The first item in the collection looks something like the following : should be read as "StoreName, textbox4" but comes in as : $y "16r00FF" (the 'y' actually has an umlaut over it -- I'm not really sure what this first 32-bit word is for) $p "16r00FE" $S "16r0053" --> S $ "16r0000" $t "16r0074" --> t $ "16r0000" $o "16r006F" --> o $ "16r0000" $r "16r0072" --> r $ "16r0000" $e "16r0065" --> e $ "16r0000" Is there some good way to convert this into a regular string? Also -- if it helps, this will eventually be done by passing the file in via Seaside using the WAUpload handling.. Not sure if that matters.. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Sorry.. One more thing.. After playing around with Neo Office (loading
the CSV), I believe the file is actually encoded as a Unicode file and not ISO8859-1 hence the odd characters at the beginning of the string.. Is there a way to override the string encoding when the file is read-in? If so it may solve my problem directly.. Thx! _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Rick Flower wrote:
> Sorry.. One more thing.. After playing around with Neo Office (loading > the CSV), I > believe the file is actually encoded as a Unicode file and not ISO8859-1 > hence the odd > characters at the beginning of the string.. Is there a way to override > the string encoding > when the file is read-in? If so it may solve my problem directly.. > One more thing.. In doing more searching on VW & Unicode (less any ODBC references), I found a discussion from last January and ran the following on my OSX version of VW : (StreamEncoder new: #default) encoding and see ISO8859-1 returned.. Should this be something else or is that OK? This image ultimately is shared between OSX and Linux -- not sure if that matters. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Rick Flower
That's a Unicode file, UTF-16 so two bytes for each character. The FF and FE are the Byte Order Mark, and tell you of each two byte pair which is the high order and which the low order byte - but you can see that already from the data. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Steven Kelly wrote:
> > That's a Unicode file, UTF-16 so two bytes for each character. The FF > and FE are the Byte Order Mark, and tell you of each two byte pair > which is the high order and which the low order byte - but you can see > that already from the data. > > If you just ask for contentsOfEntireFile, VisualWorks has to assume > some encoding, and uses the platform's default. ISO-8859-1 is > presumably on Linux, Windows would be the similar Microsoft codepage > (identical apart from Microsoft smart quotes, IIRC). You want to > explicitly make the stream use UTF-16 encoding. The easiest way is > just (aFilename withEncoding: #'utf-16') readStream (or somesuch > message, sorry, don't have an image with me on this machine). VW can > probably figure out the Byte Order Mark itself these days (or was that > just for XML files?), so just asking the stream for #contents should > be enough. > > It might be better style to use the stream as a stream: > [aStream atEnd] whileFalse: [lines add: (aStream upTo: Character cr)] > (or with more detailed stream processing to get the individual fields > from each line). > Thanks for the help.. I tried using the 'withEncoding:' and it worked like a charm.. All problems are gone. Thanks for getting me unstuck.. -- Rick _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Steven Kelly
> VW can probably figure out the Byte Order Mark itself these days (or was
> that just for XML files?), so just asking the stream for #contents > should be enough. ... at least for UTF8 it doesn't respect the BOM. Each time an XML starts with it, the framework runs into an Exception. In my opinion, at least the XML framework should handle that issue by default on it's own. See also: http://unicode.org/faq/utf_bom.html#bom5 Happy Easter! Steffen > It might be better style to use the stream as a stream: > [aStream atEnd] whileFalse: [lines add: (aStream upTo: Character cr)] > (or with more detailed stream processing to get the individual fields > from each line). > > Hope this helps, and sorry I don't have the details to hand, > Steve > > -----Original Message----- > From: [hidden email] on behalf of Rick Flower > Sent: Sun 4/12/2009 04:21 > To: VisualWorks Mailing List > Subject: [vwnc] Questions about handling ISO8859L1 String objects... > Hi all.. > > I'm trying my hand at parsing some CSV files that I receive (and have no > control over the format) and they > appear to be encoded as ISO8859 strings after the contents are read in > using : > > coll := '/tmp/foo.csv' asFilename contentsOfEntireFile > tokensBasedOn: Character cr. > > The first item in the collection looks something like the following : > > should be read as "StoreName, textbox4" but comes in as : > > $y "16r00FF" (the 'y' actually has an umlaut over it -- I'm not really > sure what this first 32-bit word is for) > $p "16r00FE" > $S "16r0053" --> S > $ "16r0000" > $t "16r0074" --> t > $ "16r0000" > $o "16r006F" --> o > $ "16r0000" > $r "16r0072" --> r > $ "16r0000" > $e "16r0065" --> e > $ "16r0000" > > Is there some good way to convert this into a regular string? Also -- > if it helps, this will eventually be done by > passing the file in via Seaside using the WAUpload handling.. Not sure > if that matters.. > > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Am 12.04.09 20:58 schrieb Steffen Märcker:
>> VW can probably figure out the Byte Order Mark itself these days (or was >> that just for XML files?), so just asking the stream for #contents >> should be enough. > > ... at least for UTF8 it doesn't respect the BOM. Each time an XML starts > with it, the framework runs into an Exception. In my opinion, at least the > XML framework should handle that issue by default on it's own. > > See also: http://unicode.org/faq/utf_bom.html#bom5 ...and the Unicode support is outdated and incomplete. E.g., the encoders will write Characters with values beyond 16r10FFFF and silently truncate them. Character values above 16r10FFFF are not legal in Unicode. In UTF-16, there are "supplementary characters" represented by "surrogate pairs" which are not supported in VisualWorks, see http://www.parcplace.net/list/vwnc-archive/0608/msg00033.html. VW only supports the Basic Multilingual Plane (BMP, plane 0). UTF-32 is not supported at all. In addition, there are some problems when copying Strings between external libraries and VW, see below. For JNIPort, I had to implement my own encoder to correctly handle the UTF-16 encoding used in java.lang.String. It's JavaLangStringStreamEncoder in the package "JNIPort String Encoding", which is actually a UTF-16 encoder with support for supplementary characters. The UnicodeStreamEncoder in the Internationalization package implements UCS-2 encoding, which is an obsolete precursor of UTF-16. UCS-2 supports only characters up to 16rFFFF, but the UnicodeStreamEncoder does not check this when writing a character. The correct behavior would be to write the "Character illegalCode" 16rFFFF. The UTF16StreamEncoder does not pay attention to surrogate pairs and will produce illegal characters when reading supplementary characters from a UTF-16 encoded text, instead of decoding the surrogate pairs into a character in the range 16r01000-16r10FFFF. It also assumes that the size of an encoded Character is always 2, which is wrong - for supplementary characters, it is 4. So it's really another UCS-2 encoder, not a UTF-16 encoder. Exchanging Strings between VW and external libaries has some problems, too. For example: copyUnicodeStringFromHeap "Answer an instance of a String by copying the null terminated Unicode string pointed to by the receiver from the external heap. ..." ^self copyDoubleByteStringFromHeap: #UCS_2 copyDoubleByteStringFromHeap: encoding "..." | bytes | bytes := self primCopyDoubleByteStringFromHeap: theDatum pointerKind: type kind. ^encoding == #UCS_2 ifTrue: [bytes changeClassTo: TwoByteString] ifFalse: [bytes asStringEncoding: encoding] While it is correct that a UCS-2 encoded String can be copied without modification, the assumption that a "Unicode String" has UCS-2 encoding is wrong. This should be UTF-16. The same for String>>copyToHeapUnicode. The problematic methods can be found by looking for senders of #UCS_2. There is also a bug in String>>copyToHeap:encoding: which creates the terminator for the external string like this: null := (ByteString new: 1) asByteArrayEncoding: encoding. This expression will produce anything but nulls for many encodings, e.g. Base-64 or UTF-7. However, the terminator of the external string should not be an *encoding* of a null character, but simply one or two null characters. See http://www.parcplace.net/list/vwnc-archive/0607/msg00339.html. Cheers! Joachim Geidel _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Steffen Märcker
Joachim Geidel wrote:
> Am 12.04.09 20:58 schrieb Steffen Märcker: > ...and the Unicode support is outdated and incomplete. thanks Joachim for the collection. I would like to add "Combining Diacritical Marks". Thats U0300 and above which will be read from an UTF-8 stream into gibberish or errors. See http://www.unicode.org/charts/PDF/U0300.pdf for some really strange diacriticals like "combining seagull below". Murphy dictates, that we hat to some data including more than the representable set of characters. This brought our attention to Character>>initCompositeLetters and the shared variables referenced therein. This and Character>>diacriticalNamed: seem to be based on a rather old Unicode version. Regards Jan _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Steven Kelly
If you want to have a look:
In section "Heeg" of VisualWorks Contributions you will find parcel "GHCsvImportExport [1.10]". We had similar issues and solved them. If you want to reuse the entire parcel or just the code dealing with BOM (byte order mark), go ahead ... Cheers Holger Guhl -- Senior Consultant * Certified Scrum Master * [hidden email] Tel: +49 231 9 75 99 21 * Fax: +49 231 9 75 99 20 Georg Heeg eK Dortmund Handelsregister: Amtsgericht Dortmund A 12812 Steven Kelly schrieb: > > That's a Unicode file, UTF-16 so two bytes for each character. The FF > and FE are the Byte Order Mark, and tell you of each two byte pair > which is the high order and which the low order byte - but you can see > that already from the data. > > If you just ask for contentsOfEntireFile, VisualWorks has to assume > some encoding, and uses the platform's default. ISO-8859-1 is > presumably on Linux, Windows would be the similar Microsoft codepage > (identical apart from Microsoft smart quotes, IIRC). You want to > explicitly make the stream use UTF-16 encoding. The easiest way is > just (aFilename withEncoding: #'utf-16') readStream (or somesuch > message, sorry, don't have an image with me on this machine). VW can > probably figure out the Byte Order Mark itself these days (or was that > just for XML files?), so just asking the stream for #contents should > be enough. > > It might be better style to use the stream as a stream: > [aStream atEnd] whileFalse: [lines add: (aStream upTo: Character cr)] > (or with more detailed stream processing to get the individual fields > from each line). > > Hope this helps, and sorry I don't have the details to hand, > Steve > > -----Original Message----- > From: [hidden email] on behalf of Rick Flower > Sent: Sun 4/12/2009 04:21 > To: VisualWorks Mailing List > Subject: [vwnc] Questions about handling ISO8859L1 String objects... > > Hi all.. > > I'm trying my hand at parsing some CSV files that I receive (and have no > control over the format) and they > appear to be encoded as ISO8859 strings after the contents are read in > using : > > coll := '/tmp/foo.csv' asFilename contentsOfEntireFile > tokensBasedOn: Character cr. > > The first item in the collection looks something like the following : > > should be read as "StoreName, textbox4" but comes in as : > > $y "16r00FF" (the 'y' actually has an umlaut over it -- I'm not really > sure what this first 32-bit word is for) > $p "16r00FE" > $S "16r0053" --> S > $ "16r0000" > $t "16r0074" --> t > $ "16r0000" > $o "16r006F" --> o > $ "16r0000" > $r "16r0072" --> r > $ "16r0000" > $e "16r0065" --> e > $ "16r0000" > > Is there some good way to convert this into a regular string? Also -- > if it helps, this will eventually be done by > passing the file in via Seaside using the WAUpload handling.. Not sure > if that matters.. > > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > > ------------------------------------------------------------------------ > > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Rick Flower
Thanks Holger! I see that fix was written nearly 3 years ago. Strikes me
that the process of getting fixes into the base isn't working :-(. Steve > -----Original Message----- > From: Holger Guhl [mailto:[hidden email]] > Sent: 14 April 2009 23:08 > To: Steven Kelly > Cc: Rick Flower; VisualWorks Mailing List > Subject: Re: [vwnc] Questions about handling ISO8859L1 String objects... > > If you want to have a look: > In section "Heeg" of VisualWorks Contributions you will find parcel > "GHCsvImportExport [1.10]". We had similar issues and solved them. If > you want to reuse the entire parcel or just the code dealing with BOM > (byte order mark), go ahead ... > Cheers > > Holger Guhl > -- > Senior Consultant * Certified Scrum Master * [hidden email] > Tel: +49 231 9 75 99 21 * Fax: +49 231 9 75 99 20 > Georg Heeg eK Dortmund > Handelsregister: Amtsgericht Dortmund A 12812 > > > > Steven Kelly schrieb: > > > > That's a Unicode file, UTF-16 so two bytes for each character. The > > and FE are the Byte Order Mark, and tell you of each two byte pair > > which is the high order and which the low order byte - but you can > see > > that already from the data. > > > > If you just ask for contentsOfEntireFile, VisualWorks has to assume > > some encoding, and uses the platform's default. ISO-8859-1 is > > presumably on Linux, Windows would be the similar Microsoft codepage > > (identical apart from Microsoft smart quotes, IIRC). You want to > > explicitly make the stream use UTF-16 encoding. The easiest way is > > just (aFilename withEncoding: #'utf-16') readStream (or somesuch > > message, sorry, don't have an image with me on this machine). VW can > > probably figure out the Byte Order Mark itself these days (or was > that > > just for XML files?), so just asking the stream for #contents should > > be enough. > > > > It might be better style to use the stream as a stream: > > [aStream atEnd] whileFalse: [lines add: (aStream upTo: Character > > (or with more detailed stream processing to get the individual fields > > from each line). > > > > Hope this helps, and sorry I don't have the details to hand, > > Steve > > > > -----Original Message----- > > From: [hidden email] on behalf of Rick Flower > > Sent: Sun 4/12/2009 04:21 > > To: VisualWorks Mailing List > > Subject: [vwnc] Questions about handling ISO8859L1 String objects... > > > > Hi all.. > > > > I'm trying my hand at parsing some CSV files that I receive (and > no > > control over the format) and they > > appear to be encoded as ISO8859 strings after the contents are read > in > > using : > > > > coll := '/tmp/foo.csv' asFilename contentsOfEntireFile > > tokensBasedOn: Character cr. > > > > The first item in the collection looks something like the following > > > > should be read as "StoreName, textbox4" but comes in as : > > > > $y "16r00FF" (the 'y' actually has an umlaut over it -- I'm not > really > > sure what this first 32-bit word is for) > > $p "16r00FE" > > $S "16r0053" --> S > > $ "16r0000" > > $t "16r0074" --> t > > $ "16r0000" > > $o "16r006F" --> o > > $ "16r0000" > > $r "16r0072" --> r > > $ "16r0000" > > $e "16r0065" --> e > > $ "16r0000" > > > > Is there some good way to convert this into a regular string? Also > - > > if it helps, this will eventually be done by > > passing the file in via Seaside using the WAUpload handling.. Not > sure > > if that matters.. > > > > _______________________________________________ > > vwnc mailing list > > [hidden email] > > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > > > > > --- > > > > _______________________________________________ > > vwnc mailing list > > [hidden email] > > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > > _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Jan Weerts
Thanks indeed. We're aware of a number of these, and fixes
are already in the works for 7.7, but it's nice to have a concise list
like that of things people have run into.
At 07:01 AM 4/14/2009, Jan Weerts wrote: Joachim Geidel wrote: --
Alan Knight [|], Engineering Manager, Cincom Smalltalk
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
This is somehow related. As I've pointed out earlier (see old mail below),
it seems, that the XML framework does not respect the specification of IDs completely. It rejects the colon as a name character. If my observation is correct, will this be fixed as well? Greetings, Steffen Old mail to vwnc: Today I played with the XML ID specification. It states that an ID attribute must match the _Name_ production. (http://www.w3.org/TR/2006/REC-xml-20060816/#id) I created Character classes from the given definiton and tested them against the existing implementation. The result is, that the XML framework rejects the colon. But XML 1.0 specification explicitly states : [...] XML processors must accept the colon as a name character. Two questions: 1. Is my test suitable? 2. If not - why differs the implementation from the specification? Workspace code and the XMLCharacterClass class (requires Regex11) is attached Ciao, Steffen _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Free forum by Nabble | Edit this page |