In 7.6, parsing a UTF-8 XML file starting with a BOM causes an error "<
expected, but not found". The code for XML.StreamWrapper>>checkEncoding only takes account of a UTF-16 BOM (somewhat odd, given it checks first that the encoding is UTF-8). Maybe I'm missing something here. For my file to read, the following worked. I couldn't resist changing the check for the UTF-16 BOM (FEFF / FFFE) to be rather less cryptic than "c1 * c2 = 16rFD02" - to understand that you need to know that multiplication is commutative, that FE * FF = FD02, and that no other pair of bytes can multiply to the same value. The last ifTrue: block could just be "stream position: pos+3" if we can be certain that will put us in the right place and state, even for funkily encoded multi-byte per character streams. That sounds reasonable, given that we've just decided that this really is a UTF-8 stream. Steve checkEncoding | encoding | encoding := [stream encoding] on: Error do: [:ex | ex returnWith: #null]. encoding = #'UTF-8' ifTrue: [| firstPair third pos | pos := stream position. stream setBinary: true. firstPair := stream nextAvailable: 2. third := stream peek. stream setBinary: false. (#([16rFE 16rFF] [16rFF 16rFE]) includes: firstPair) ifTrue: [stream encoder: (UTF16StreamEncoder new forByte1: firstPair first byte2: firstPair last)] ifFalse: [(firstPair = #[16rEF 16rBB] and: [third = 16rBF]) ifTrue: [stream setBinary: true; next; setBinary: false] ifFalse: [stream position: pos]]] _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Some time ago I've had a similar problem with BOM-marked UTF-8 files. It
looked like the API assumes UTF-8 inputs not beginning with a BOM even though the specification allows it. The situation in 7.6 seems to be the same. Wouldn't it be reasonable to treat a stream starting with "< as UTF-8 encoded in general? Steffen Am 27.11.2008, 17:52 Uhr, schrieb Steven Kelly <[hidden email]>: > In 7.6, parsing a UTF-8 XML file starting with a BOM causes an error "< > expected, but not found". The code for XML.StreamWrapper>>checkEncoding > only takes account of a UTF-16 BOM (somewhat odd, given it checks first > that the encoding is UTF-8). Maybe I'm missing something here. For my > file to read, the following worked. I couldn't resist changing the check > for the UTF-16 BOM (FEFF / FFFE) to be rather less cryptic than "c1 * c2 > = 16rFD02" - to understand that you need to know that multiplication is > commutative, that FE * FF = FD02, and that no other pair of bytes can > multiply to the same value. > > The last ifTrue: block could just be "stream position: pos+3" if we can > be certain that will put us in the right place and state, even for > funkily encoded multi-byte per character streams. That sounds > reasonable, given that we've just decided that this really is a UTF-8 > stream. > > Steve > > checkEncoding > > | encoding | > encoding := [stream encoding] on: Error do: [:ex | ex > returnWith: #null]. > encoding = #'UTF-8' > ifTrue: > [| firstPair third pos | > pos := stream position. > stream setBinary: true. > firstPair := stream nextAvailable: 2. > third := stream peek. > stream setBinary: false. > (#([16rFE 16rFF] [16rFF 16rFE]) includes: > firstPair) > ifTrue: [stream encoder: > (UTF16StreamEncoder new > > forByte1: firstPair first byte2: firstPair last)] > ifFalse: [(firstPair = #[16rEF 16rBB] > and: [third = 16rBF]) > ifTrue: [stream > setBinary: true; next; setBinary: false] > ifFalse: > [stream position: pos]]] > > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Steven Kelly
Please, have a look at our recent version of GHCsvImportExport(1.11) in
Cincom Public Repository. We made some extensions to PeekableStream to carefully peek for a BOM (byte order mark). Method #nextBOM peeks for a byte order mark and leaves the stream pointer behind its ocurrence (if any). Method #getEncodingFromBOM translates the result into an encoding symbol. Method Heeg.CsvReader>>onFileNamed: shows a possible application scenario. We did not yet extract the reusable PeekableStream stuff to another package. I' ld like to encourage you to go with this approach or adapt it to your needs. Having reusable code for Stream is better than inlining the stuff whenever you need. BTW: The PeekableStream extension methods have some nice comments that explain some of the bits and bytes "magic". Regards Holger Guhl -- Senior Consultant * Certified Scrum Master * [hidden email] Tel: +49 231 9 75 99 21 * Fax: +49 231 9 75 99 20 Georg Heeg eK Dortmund Handelsregister: Amtsgericht Dortmund A 12812 Steven Kelly schrieb: > In 7.6, parsing a UTF-8 XML file starting with a BOM causes an error "< > expected, but not found". The code for XML.StreamWrapper>>checkEncoding > only takes account of a UTF-16 BOM (somewhat odd, given it checks first > that the encoding is UTF-8). Maybe I'm missing something here. For my > file to read, the following worked. I couldn't resist changing the check > for the UTF-16 BOM (FEFF / FFFE) to be rather less cryptic than "c1 * c2 > = 16rFD02" - to understand that you need to know that multiplication is > commutative, that FE * FF = FD02, and that no other pair of bytes can > multiply to the same value. > > The last ifTrue: block could just be "stream position: pos+3" if we can > be certain that will put us in the right place and state, even for > funkily encoded multi-byte per character streams. That sounds > reasonable, given that we've just decided that this really is a UTF-8 > stream. > > Steve > > checkEncoding > > | encoding | > encoding := [stream encoding] on: Error do: [:ex | ex > returnWith: #null]. > encoding = #'UTF-8' > ifTrue: > [| firstPair third pos | > pos := stream position. > stream setBinary: true. > firstPair := stream nextAvailable: 2. > third := stream peek. > stream setBinary: false. > (#([16rFE 16rFF] [16rFF 16rFE]) includes: > firstPair) > ifTrue: [stream encoder: > (UTF16StreamEncoder new > > forByte1: firstPair first byte2: firstPair last)] > ifFalse: [(firstPair = #[16rEF 16rBB] > and: [third = 16rBF]) > ifTrue: [stream > setBinary: true; next; setBinary: false] > ifFalse: > [stream position: pos]]] > > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Steven Kelly
Thanks Holger, I'll take a look. While of course I agree with the idea
of putting the code in the right place, I find it's much easier to get a change into the base VW if it's a bug fix to a single base method, than if it's an extension in the public repository. Once it's in the base VW, it's maintained and supported, and of course it's there immediately for both new and experienced users: no need to hit the bug, wonder about it, ask on the mailing list, wait for the creator of the fix in the public repository to reply, load the latest version, and update all build scripts to include it (and add a note to self to check that newer versions of it are still good, and that it is updated when there is a new VW version) :-). I added a smiley, but I have to admit that it's a slightly pained smile. Eclipse users have to spend an average of 30 minutes a day just keeping their IDE up to date, and I worry that VW is heading in the same direction with all the contributed stuff. I really welcome the moves to integrate the most used parts in the standard development image. For bug fixes like this, I think it's even clearer that they should be in the base. Steve > -----Original Message----- > From: Holger Guhl [mailto:[hidden email]] > Sent: 01 December 2008 12:30 > To: Steven Kelly > Cc: VW NC > Subject: Re: [vwnc] XML.StreamWrapper error on UTF-8 "byte-order mark" > > Please, have a look at our recent version of GHCsvImportExport(1.11) in > Cincom Public Repository. We made some extensions to PeekableStream to > carefully peek for a BOM (byte order mark). Method #nextBOM peeks for a > byte order mark and leaves the stream pointer behind its ocurrence (if > any). Method #getEncodingFromBOM translates the result into an encoding > symbol. > Method Heeg.CsvReader>>onFileNamed: shows a possible application > scenario. > We did not yet extract the reusable PeekableStream stuff to another > package. I' ld like to encourage you to go with this approach or adapt > it to your needs. Having reusable code for Stream is better than > inlining the stuff whenever you need. BTW: The PeekableStream extension > methods have some nice comments that explain some of the bits and bytes > "magic". > > Regards > > Holger Guhl > -- > Senior Consultant * Certified Scrum Master * [hidden email] > Tel: +49 231 9 75 99 21 * Fax: +49 231 9 75 99 20 > Georg Heeg eK Dortmund > Handelsregister: Amtsgericht Dortmund A 12812 > > > Steven Kelly schrieb: > > In 7.6, parsing a UTF-8 XML file starting with a BOM causes an error > "< > > expected, but not found". The code for > XML.StreamWrapper>>checkEncoding > > only takes account of a UTF-16 BOM (somewhat odd, given it checks > first > > that the encoding is UTF-8). Maybe I'm missing something here. For > > file to read, the following worked. I couldn't resist changing the > check > > for the UTF-16 BOM (FEFF / FFFE) to be rather less cryptic than "c1 * > c2 > > = 16rFD02" - to understand that you need to know that multiplication > is > > commutative, that FE * FF = FD02, and that no other pair of bytes can > > multiply to the same value. > > > > The last ifTrue: block could just be "stream position: pos+3" if we > can > > be certain that will put us in the right place and state, even for > > funkily encoded multi-byte per character streams. That sounds > > reasonable, given that we've just decided that this really is a UTF-8 > > stream. > > > > Steve > > > > checkEncoding > > > > | encoding | > > encoding := [stream encoding] on: Error do: [:ex | ex > > returnWith: #null]. > > encoding = #'UTF-8' > > ifTrue: > > [| firstPair third pos | > > pos := stream position. > > stream setBinary: true. > > firstPair := stream nextAvailable: 2. > > third := stream peek. > > stream setBinary: false. > > (#([16rFE 16rFF] [16rFF 16rFE]) includes: > > firstPair) > > ifTrue: [stream encoder: > > (UTF16StreamEncoder new > > > > forByte1: firstPair first byte2: firstPair last)] > > ifFalse: [(firstPair = #[16rEF 16rBB] > > and: [third = 16rBF]) > > ifTrue: [stream > > setBinary: true; next; setBinary: false] > > ifFalse: > > [stream position: pos]]] > > > > _______________________________________________ > > vwnc mailing list > > [hidden email] > > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Free forum by Nabble | Edit this page |