[vwnc] Handling unicodes in VW

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[vwnc] Handling unicodes in VW

Tony Law
Handling unicodes in VW Can someone point me in the direction of a straightforward guide to handling unicodes in ST?

I need to be able to read, process and possibly write out again data which includes extended character sets; not just the usual French and German accents, but eastern European such as the S-caron and so on. How do I
  • set up a ReadStream on the file which will capture the incoming data as 16-byte characters
  • set up 16-byte character and string variables which I can use to process these data
  • write out 16-byte characters (appropriate form of WriteStream)

The input and output are XML files. At present the input file turns up in VW as a garbled string which I detect and manually convert using data from another source to provide the first best guess. At present I am outputting extended characters as their unicode mnemonics e.g. é š but the file is being transferrred to an application which could handle native utf-16 if I can output it.

I've played around a bit with TwoByteString and so on but can't figure out how to do the whole task end to end.

Tony Law
--------------
ITasITis: technology postings from InformationSpan
http://itasitis.wordpress.com/


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Handling unicodes in VW

Boris Popov, DeepCove Labs (SNN)
Handling unicodes in VW

(filename withEncoding: #utf8) readStream.

(filename withEncoding: #utf8) writeStream.

 

(filename withEncoding: #utf16) readStream.

(filename withEncoding: #utf16) writeStream.

 

-Boris

 

--

DeepCove Labs Ltd.

+1 (604) 689-0322

4th floor, 595 Howe Street

Vancouver, British Columbia

Canada V6C 2T5

http://tinyurl.com/r7uw4

 

PacNet Services (Europe) Ltd.

+353 (0)61 714-360

Shannon Airport House, SFZ

County Clare, Ireland

http://tinyurl.com/y952amr

 

CONFIDENTIALITY NOTICE

 

This email is intended only for the persons named in the message header. Unless otherwise indicated, it contains information that is private and confidential. If you have received it in error, please notify the sender and delete the entire message including any attachments.

 

Thank you.

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Tony Law
Sent: 20 January 2010 12:14
To: VWNC
Subject: [vwnc] Handling unicodes in VW

 

Can someone point me in the direction of a straightforward guide to handling unicodes in ST?

I need to be able to read, process and possibly write out again data which includes extended character sets; not just the usual French and German accents, but eastern European such as the S-caron and so on. How do I

  • set up a ReadStream on the file which will capture the incoming data as 16-byte characters
  • set up 16-byte character and string variables which I can use to process these data
  • write out 16-byte characters (appropriate form of WriteStream)


The input and output are XML files. At present the input file turns up in VW as a garbled string which I detect and manually convert using data from another source to provide the first best guess. At present I am outputting extended characters as their unicode mnemonics e.g. é š but the file is being transferrred to an application which could handle native utf-16 if I can output it.

I've played around a bit with TwoByteString and so on but can't figure out how to do the whole task end to end.

Tony Law
--------------
ITasITis: technology postings from InformationSpan
http://itasitis.wordpress.com/


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Handling unicodes in VW

Steven Kelly
In reply to this post by Tony Law
Handling unicodes in VW

Just to add to Boris’s answer:

 

Characters and strings inside VW work fine with Unicode, with no special effort (a String is a ByteString or a TwoByteString pretty automatically, a bit like a number is a SmallInteger or LargePositiveInteger automatically).

 

When reading or writing external files or streams, you need to specify what encoding to use. By default you’ll get the platform’s encoding for your current locale, but you can specify an encoding with #withEncoding: as Boris showed. When reading a file you need to know what encoding it was written with: an s-caron is represented as a single byte with value A9 if the encoding is ISO 8859-2, but as 2 bytes in UTF-16 (or 4 if represented as a a normal s plus a combining caron). Just specify the right encoding and let VW do the work.

 

Steve

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Tony Law
Sent: 20 January 2010 14:14
To: VWNC
Subject: [vwnc] Handling unicodes in VW

 

Can someone point me in the direction of a straightforward guide to handling unicodes in ST?

I need to be able to read, process and possibly write out again data which includes extended character sets; not just the usual French and German accents, but eastern European such as the S-caron and so on. How do I

  • set up a ReadStream on the file which will capture the incoming data as 16-byte characters
  • set up 16-byte character and string variables which I can use to process these data
  • write out 16-byte characters (appropriate form of WriteStream)


The input and output are XML files. At present the input file turns up in VW as a garbled string which I detect and manually convert using data from another source to provide the first best guess. At present I am outputting extended characters as their unicode mnemonics e.g. é š but the file is being transferrred to an application which could handle native utf-16 if I can output it.

I've played around a bit with TwoByteString and so on but can't figure out how to do the whole task end to end.

Tony Law
--------------
ITasITis: technology postings from InformationSpan
http://itasitis.wordpress.com/


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Handling unicodes in VW

Henrik Sperre Johansen
In reply to this post by Tony Law
You need to know the encoding of the input xml file.


Then you do (as Boris typed):
(filename withEncoding: #theInputFileEncoding) readStream.

Reading from this stream should give you an internal string of an appropriate class (internal string representation is in general not linked to what the external file had).

Then, to output a 16bit character file, use
(filename withEncoding: #utf16) writeStream, and stream out the modified string.

In short:
- read external string specifying the encoding it's in.
- Do manipulation internally without worrying about encoding.
- Specify your wanted format (in your case probably either #utf16 or #ucs_2) when writing back to disk.

Cheers,
Henry

Addendum:
If you want to be evil, and/or really need the internal representation to be 16 bit, you could do:

rsBin := ('testFile.txt' asFilename withEncoding: #binary) readStream
tmp := rsBin contents.
rsBin close.
TwoByteString adoptInstance: tmp.

Just make sure the byte ordering of the file and what VW expects are the same :)

On Jan 20, 2010, at 1:14 28PM, Tony Law wrote:

Can someone point me in the direction of a straightforward guide to handling unicodes in ST?

I need to be able to read, process and possibly write out again data which includes extended character sets; not just the usual French and German accents, but eastern European such as the S-caron and so on. How do I
  • set up a ReadStream on the file which will capture the incoming data as 16-byte characters
  • set up 16-byte character and string variables which I can use to process these data
  • write out 16-byte characters (appropriate form of WriteStream)

The input and output are XML files. At present the input file turns up in VW as a garbled string which I detect and manually convert using data from another source to provide the first best guess. At present I am outputting extended characters as their unicode mnemonics e.g. é š but the file is being transferrred to an application which could handle native utf-16 if I can output it.

I've played around a bit with TwoByteString and so on but can't figure out how to do the whole task end to end.

Tony Law
--------------
ITasITis: technology postings from InformationSpan
http://itasitis.wordpress.com/

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Handling unicodes in VW

kobetic
In reply to this post by Tony Law
When importing text from external sources, also keep in mind line-end conversions, which are normally performed along with character encoding/decoding. Here's a shameless plug of a post on this topic that I wrote a while ago:

http://www.cincomsmalltalk.com/userblogs/cst/blogView?showComments=true&printTitle=No_End_of_Line_End_Confusion.&entry=3349018393

HTH,

Martin

"Henrik Johansen"<[hidden email]> wrote:

> Date: January 20, 2010 9:17:15 AM
> From: "Henrik Johansen"<[hidden email]>
> To: "Tony Law"<[hidden email]>
> Cc: VWNC<[hidden email]>
> Subject: Re: [vwnc] Handling unicodes in VW
>
> You need to know the encoding of the input xml file.
>
>
> Then you do (as Boris typed):
> (filename withEncoding: #theInputFileEncoding) readStream.
>
> Reading from this stream should give you an internal string of an appropriate class (internal string representation is in general not linked to what the external file had).
>
> Then, to output a 16bit character file, use
> (filename withEncoding: #utf16) writeStream, and stream out the modified string.
>
> In short:
> - read external string specifying the encoding it's in.
> - Do manipulation internally without worrying about encoding.
> - Specify your wanted format (in your case probably either #utf16 or #ucs_2) when writing back to disk.
>
> Cheers,
> Henry
>
> Addendum:
> If you want to be evil, and/or really need the internal representation to be 16 bit, you could do:
>
> rsBin := ('testFile.txt' asFilename withEncoding: #binary) readStream
> tmp := rsBin contents.
> rsBin close.
> TwoByteString adoptInstance: tmp.
>
> Just make sure the byte ordering of the file and what VW expects are the same :)
>
> On Jan 20, 2010, at 1:14 28PM, Tony Law wrote:
>
> > Can someone point me in the direction of a straightforward guide to handling unicodes in ST?
> >
> > I need to be able to read, process and possibly write out again data which includes extended character sets; not just the usual French and German accents, but eastern European such as the S-caron and so on. How do I
> > set up a ReadStream on the file which will capture the incoming data as 16-byte characters
> > set up 16-byte character and string variables which I can use to process these data
> > write out 16-byte characters (appropriate form of WriteStream)
> >
> > The input and output are XML files. At present the input file turns up in VW as a garbled string which I detect and manually convert using data from another source to provide the first best guess. At present I am outputting extended characters as their unicode mnemonics e.g. &eacute; &scaron; but the file is being transferrred to an application which could handle native utf-16 if I can output it.
> >
> > I've played around a bit with TwoByteString and so on but can't figure out how to do the whole task end to end.
> >
> > Tony Law
> > --------------
> > ITasITis: technology postings from InformationSpan
> > http://itasitis.wordpress.com/
> >
> > _______________________________________________
> > vwnc mailing list
> > [hidden email]
> > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>
>
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc