[vwnc] XML.StreamWrapper error on UTF-8 "byte-order mark"

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[vwnc] XML.StreamWrapper error on UTF-8 "byte-order mark"

Steven Kelly
In 7.6, parsing a UTF-8 XML file starting with a BOM causes an error "<
expected, but not found". The code for XML.StreamWrapper>>checkEncoding
only takes account of a UTF-16 BOM (somewhat odd, given it checks first
that the encoding is UTF-8). Maybe I'm missing something here. For my
file to read, the following worked. I couldn't resist changing the check
for the UTF-16 BOM (FEFF / FFFE) to be rather less cryptic than "c1 * c2
= 16rFD02" - to understand that you need to know that multiplication is
commutative, that FE * FF = FD02, and that no other pair of bytes can
multiply to the same value.

The last ifTrue: block could just be "stream position: pos+3" if we can
be certain that will put us in the right place and state, even for
funkily encoded multi-byte per character streams. That sounds
reasonable, given that we've just decided that this really is a UTF-8
stream.

Steve

checkEncoding

        | encoding |
        encoding := [stream encoding] on: Error do: [:ex | ex
returnWith: #null].
        encoding = #'UTF-8'
                ifTrue:
                        [| firstPair third pos |
                        pos := stream position.
                        stream setBinary: true.
                        firstPair := stream nextAvailable: 2.
                        third := stream peek.
                        stream setBinary: false.
                        (#([16rFE 16rFF] [16rFF 16rFE]) includes:
firstPair)
                                ifTrue: [stream encoder:
(UTF16StreamEncoder new
       
forByte1: firstPair first byte2: firstPair last)]
                                ifFalse: [(firstPair = #[16rEF 16rBB]
and: [third = 16rBF])
                                                        ifTrue: [stream
setBinary: true; next; setBinary: false]
                                                        ifFalse:
[stream position: pos]]]

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] XML.StreamWrapper error on UTF-8 "byte-order mark"

Steffen Märcker
Some time ago I've had a similar problem with BOM-marked UTF-8 files. It  
looked like the API assumes UTF-8 inputs not beginning with a BOM even  
though the specification allows it. The situation in 7.6 seems to be the  
same. Wouldn't it be reasonable to treat a stream starting with "< as  
UTF-8 encoded in general?

Steffen


Am 27.11.2008, 17:52 Uhr, schrieb Steven Kelly <[hidden email]>:

> In 7.6, parsing a UTF-8 XML file starting with a BOM causes an error "<
> expected, but not found". The code for XML.StreamWrapper>>checkEncoding
> only takes account of a UTF-16 BOM (somewhat odd, given it checks first
> that the encoding is UTF-8). Maybe I'm missing something here. For my
> file to read, the following worked. I couldn't resist changing the check
> for the UTF-16 BOM (FEFF / FFFE) to be rather less cryptic than "c1 * c2
> = 16rFD02" - to understand that you need to know that multiplication is
> commutative, that FE * FF = FD02, and that no other pair of bytes can
> multiply to the same value.
>
> The last ifTrue: block could just be "stream position: pos+3" if we can
> be certain that will put us in the right place and state, even for
> funkily encoded multi-byte per character streams. That sounds
> reasonable, given that we've just decided that this really is a UTF-8
> stream.
>
> Steve
>
> checkEncoding
>
> | encoding |
> encoding := [stream encoding] on: Error do: [:ex | ex
> returnWith: #null].
> encoding = #'UTF-8'
> ifTrue:
> [| firstPair third pos |
> pos := stream position.
> stream setBinary: true.
> firstPair := stream nextAvailable: 2.
> third := stream peek.
> stream setBinary: false.
> (#([16rFE 16rFF] [16rFF 16rFE]) includes:
> firstPair)
> ifTrue: [stream encoder:
> (UTF16StreamEncoder new
>
> forByte1: firstPair first byte2: firstPair last)]
> ifFalse: [(firstPair = #[16rEF 16rBB]
> and: [third = 16rBF])
> ifTrue: [stream
> setBinary: true; next; setBinary: false]
> ifFalse:
> [stream position: pos]]]
>
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] XML.StreamWrapper error on UTF-8 "byte-order mark"

Holger Guhl
In reply to this post by Steven Kelly
Please, have a look at our recent version of GHCsvImportExport(1.11) in
Cincom Public Repository. We made some extensions to PeekableStream to
carefully peek for a BOM (byte order mark). Method #nextBOM peeks for a
byte order mark and leaves the stream pointer behind its ocurrence (if
any). Method #getEncodingFromBOM translates the result into an encoding
symbol.
Method Heeg.CsvReader>>onFileNamed: shows a possible application scenario.
We did not yet extract the reusable PeekableStream stuff to another
package. I' ld like to encourage you to go with this approach or adapt
it to your needs. Having reusable code for Stream is better than
inlining the stuff whenever you need. BTW: The PeekableStream extension
methods have some nice comments that explain some of the bits and bytes
"magic".

Regards

Holger Guhl
--
Senior Consultant * Certified Scrum Master * [hidden email]
Tel: +49 231 9 75 99 21 * Fax: +49 231 9 75 99 20
Georg Heeg eK Dortmund
Handelsregister: Amtsgericht Dortmund  A 12812


Steven Kelly schrieb:

> In 7.6, parsing a UTF-8 XML file starting with a BOM causes an error "<
> expected, but not found". The code for XML.StreamWrapper>>checkEncoding
> only takes account of a UTF-16 BOM (somewhat odd, given it checks first
> that the encoding is UTF-8). Maybe I'm missing something here. For my
> file to read, the following worked. I couldn't resist changing the check
> for the UTF-16 BOM (FEFF / FFFE) to be rather less cryptic than "c1 * c2
> = 16rFD02" - to understand that you need to know that multiplication is
> commutative, that FE * FF = FD02, and that no other pair of bytes can
> multiply to the same value.
>
> The last ifTrue: block could just be "stream position: pos+3" if we can
> be certain that will put us in the right place and state, even for
> funkily encoded multi-byte per character streams. That sounds
> reasonable, given that we've just decided that this really is a UTF-8
> stream.
>
> Steve
>
> checkEncoding
>
> | encoding |
> encoding := [stream encoding] on: Error do: [:ex | ex
> returnWith: #null].
> encoding = #'UTF-8'
> ifTrue:
> [| firstPair third pos |
> pos := stream position.
> stream setBinary: true.
> firstPair := stream nextAvailable: 2.
> third := stream peek.
> stream setBinary: false.
> (#([16rFE 16rFF] [16rFF 16rFE]) includes:
> firstPair)
> ifTrue: [stream encoder:
> (UTF16StreamEncoder new
>
> forByte1: firstPair first byte2: firstPair last)]
> ifFalse: [(firstPair = #[16rEF 16rBB]
> and: [third = 16rBF])
> ifTrue: [stream
> setBinary: true; next; setBinary: false]
> ifFalse:
> [stream position: pos]]]
>
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] XML.StreamWrapper error on UTF-8 "byte-order mark"

Steven Kelly
In reply to this post by Steven Kelly
Thanks Holger, I'll take a look. While of course I agree with the idea
of putting the code in the right place, I find it's much easier to get a
change into the base VW if it's a bug fix to a single base method, than
if it's an extension in the public repository. Once it's in the base VW,
it's maintained and supported, and of course it's there immediately for
both new and experienced users: no need to hit the bug, wonder about it,
ask on the mailing list, wait for the creator of the fix in the public
repository to reply, load the latest version, and update all build
scripts to include it (and add a note to self to check that newer
versions of it are still good, and that it is updated when there is a
new VW version) :-).

I added a smiley, but I have to admit that it's a slightly pained smile.
Eclipse users have to spend an average of 30 minutes a day just keeping
their IDE up to date, and I worry that VW is heading in the same
direction with all the contributed stuff. I really welcome the moves to
integrate the most used parts in the standard development image. For bug
fixes like this, I think it's even clearer that they should be in the
base.

Steve

> -----Original Message-----
> From: Holger Guhl [mailto:[hidden email]]
> Sent: 01 December 2008 12:30
> To: Steven Kelly
> Cc: VW NC
> Subject: Re: [vwnc] XML.StreamWrapper error on UTF-8 "byte-order mark"
>
> Please, have a look at our recent version of GHCsvImportExport(1.11)
in
> Cincom Public Repository. We made some extensions to PeekableStream to
> carefully peek for a BOM (byte order mark). Method #nextBOM peeks for
a
> byte order mark and leaves the stream pointer behind its ocurrence (if
> any). Method #getEncodingFromBOM translates the result into an
encoding
> symbol.
> Method Heeg.CsvReader>>onFileNamed: shows a possible application
> scenario.
> We did not yet extract the reusable PeekableStream stuff to another
> package. I' ld like to encourage you to go with this approach or adapt
> it to your needs. Having reusable code for Stream is better than
> inlining the stuff whenever you need. BTW: The PeekableStream
extension
> methods have some nice comments that explain some of the bits and
bytes

> "magic".
>
> Regards
>
> Holger Guhl
> --
> Senior Consultant * Certified Scrum Master * [hidden email]
> Tel: +49 231 9 75 99 21 * Fax: +49 231 9 75 99 20
> Georg Heeg eK Dortmund
> Handelsregister: Amtsgericht Dortmund  A 12812
>
>
> Steven Kelly schrieb:
> > In 7.6, parsing a UTF-8 XML file starting with a BOM causes an error
> "<
> > expected, but not found". The code for
> XML.StreamWrapper>>checkEncoding
> > only takes account of a UTF-16 BOM (somewhat odd, given it checks
> first
> > that the encoding is UTF-8). Maybe I'm missing something here. For
my
> > file to read, the following worked. I couldn't resist changing the
> check
> > for the UTF-16 BOM (FEFF / FFFE) to be rather less cryptic than "c1
*
> c2
> > = 16rFD02" - to understand that you need to know that multiplication
> is
> > commutative, that FE * FF = FD02, and that no other pair of bytes
can
> > multiply to the same value.
> >
> > The last ifTrue: block could just be "stream position: pos+3" if we
> can
> > be certain that will put us in the right place and state, even for
> > funkily encoded multi-byte per character streams. That sounds
> > reasonable, given that we've just decided that this really is a
UTF-8

> > stream.
> >
> > Steve
> >
> > checkEncoding
> >
> > | encoding |
> > encoding := [stream encoding] on: Error do: [:ex | ex
> > returnWith: #null].
> > encoding = #'UTF-8'
> > ifTrue:
> > [| firstPair third pos |
> > pos := stream position.
> > stream setBinary: true.
> > firstPair := stream nextAvailable: 2.
> > third := stream peek.
> > stream setBinary: false.
> > (#([16rFE 16rFF] [16rFF 16rFE]) includes:
> > firstPair)
> > ifTrue: [stream encoder:
> > (UTF16StreamEncoder new
> >
> > forByte1: firstPair first byte2: firstPair last)]
> > ifFalse: [(firstPair = #[16rEF 16rBB]
> > and: [third = 16rBF])
> > ifTrue: [stream
> > setBinary: true; next; setBinary: false]
> > ifFalse:
> > [stream position: pos]]]
> >
> > _______________________________________________
> > vwnc mailing list
> > [hidden email]
> > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc