I need to be able to read, process and possibly write out again data which includes extended character sets; not just the usual French and German accents, but eastern European such as the S-caron and so on. How do I
The input and output are XML files. At present the input file turns up in VW as a garbled string which I detect and manually convert using data from another source to provide the first best guess. At present I am outputting extended characters as their unicode mnemonics e.g. é š but the file is being transferrred to an application which could handle native utf-16 if I can output it. I've played around a bit with TwoByteString and so on but can't figure out how to do the whole task end to end. Tony Law -------------- ITasITis: technology postings from InformationSpan http://itasitis.wordpress.com/ _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
(filename withEncoding: #utf8) readStream. (filename withEncoding: #utf8) writeStream. (filename withEncoding: #utf16) readStream. (filename withEncoding: #utf16) writeStream. -Boris -- DeepCove Labs Ltd. +1 (604) 689-0322 4th floor, 595 Howe Street Vancouver, British Columbia Canada V6C 2T5 http://tinyurl.com/r7uw4 PacNet Services (Europe) Ltd. +353 (0)61 714-360 Shannon Airport House, SFZ County Clare, Ireland http://tinyurl.com/y952amr CONFIDENTIALITY NOTICE This email is intended only for the persons named in the message
header. Unless otherwise indicated, it contains information that is private and
confidential. If you have received it in error, please notify the sender and
delete the entire message including any attachments. Thank you. From: [hidden email]
[mailto:[hidden email]] On Behalf Of Tony Law Can
someone point me in the direction of a straightforward guide to handling
unicodes in ST?
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Tony Law
Just to add to Boris’s answer: Characters and strings inside VW work fine with Unicode, with no
special effort (a String is a ByteString or a TwoByteString pretty
automatically, a bit like a number is a SmallInteger or LargePositiveInteger
automatically). When reading or writing external files or streams, you need to
specify what encoding to use. By default you’ll get the platform’s
encoding for your current locale, but you can specify an encoding with
#withEncoding: as Boris showed. When reading a file you need to know what
encoding it was written with: an s-caron is represented as a single byte with
value A9 if the encoding is ISO 8859-2, but as 2 bytes in UTF-16 (or 4 if
represented as a a normal s plus a combining caron). Just specify the right
encoding and let VW do the work. Steve From:
[hidden email] [mailto:[hidden email]] On Behalf Of Tony
Law Can
someone point me in the direction of a straightforward guide to handling
unicodes in ST?
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Tony Law
You need to know the encoding of the input xml file.
Then you do (as Boris typed): (filename withEncoding: #theInputFileEncoding) readStream. Reading from this stream should give you an internal string of an appropriate class (internal string representation is in general not linked to what the external file had). Then, to output a 16bit character file, use (filename withEncoding: #utf16) writeStream, and stream out the modified string. In short: - read external string specifying the encoding it's in. - Do manipulation internally without worrying about encoding. - Specify your wanted format (in your case probably either #utf16 or #ucs_2) when writing back to disk. Cheers, Henry Addendum: If you want to be evil, and/or really need the internal representation to be 16 bit, you could do: rsBin := ('testFile.txt' asFilename withEncoding: #binary) readStream tmp := rsBin contents. rsBin close. TwoByteString adoptInstance: tmp. Just make sure the byte ordering of the file and what VW expects are the same :) On Jan 20, 2010, at 1:14 28PM, Tony Law wrote:
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Tony Law
When importing text from external sources, also keep in mind line-end conversions, which are normally performed along with character encoding/decoding. Here's a shameless plug of a post on this topic that I wrote a while ago:
http://www.cincomsmalltalk.com/userblogs/cst/blogView?showComments=true&printTitle=No_End_of_Line_End_Confusion.&entry=3349018393 HTH, Martin "Henrik Johansen"<[hidden email]> wrote: > Date: January 20, 2010 9:17:15 AM > From: "Henrik Johansen"<[hidden email]> > To: "Tony Law"<[hidden email]> > Cc: VWNC<[hidden email]> > Subject: Re: [vwnc] Handling unicodes in VW > > You need to know the encoding of the input xml file. > > > Then you do (as Boris typed): > (filename withEncoding: #theInputFileEncoding) readStream. > > Reading from this stream should give you an internal string of an appropriate class (internal string representation is in general not linked to what the external file had). > > Then, to output a 16bit character file, use > (filename withEncoding: #utf16) writeStream, and stream out the modified string. > > In short: > - read external string specifying the encoding it's in. > - Do manipulation internally without worrying about encoding. > - Specify your wanted format (in your case probably either #utf16 or #ucs_2) when writing back to disk. > > Cheers, > Henry > > Addendum: > If you want to be evil, and/or really need the internal representation to be 16 bit, you could do: > > rsBin := ('testFile.txt' asFilename withEncoding: #binary) readStream > tmp := rsBin contents. > rsBin close. > TwoByteString adoptInstance: tmp. > > Just make sure the byte ordering of the file and what VW expects are the same :) > > On Jan 20, 2010, at 1:14 28PM, Tony Law wrote: > > > Can someone point me in the direction of a straightforward guide to handling unicodes in ST? > > > > I need to be able to read, process and possibly write out again data which includes extended character sets; not just the usual French and German accents, but eastern European such as the S-caron and so on. How do I > > set up a ReadStream on the file which will capture the incoming data as 16-byte characters > > set up 16-byte character and string variables which I can use to process these data > > write out 16-byte characters (appropriate form of WriteStream) > > > > The input and output are XML files. At present the input file turns up in VW as a garbled string which I detect and manually convert using data from another source to provide the first best guess. At present I am outputting extended characters as their unicode mnemonics e.g. é š but the file is being transferrred to an application which could handle native utf-16 if I can output it. > > > > I've played around a bit with TwoByteString and so on but can't figure out how to do the whole task end to end. > > > > Tony Law > > -------------- > > ITasITis: technology postings from InformationSpan > > http://itasitis.wordpress.com/ > > > > _______________________________________________ > > vwnc mailing list > > [hidden email] > > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > > > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Free forum by Nabble | Edit this page |