I had a failing
test that was reading an 8-bit
file-out.
The test reads the file as
UTF-8 encoded,
and the 8-bit
characters cause the UTF-8 decoding to produce an incorrect
result.
The
file contains the
character 16rF3, which triggers the 21-bit UTF-8 decoding. This expects 4
bytes looking like
11110xxx |
10xxxxxx |
10xxxxxx |
10xxxxxx |
The
characters following the F3 are normal ASCII, which is an error.
RFC 3629 states "Implementations of the decoding
algorithm MUST protect against decoding invalid sequences."
[14] The Unicode
Standard requires decoders to "...treat any ill-formed code unit sequence as
an error condition. This guarantees that it will neither interpret nor emit an
ill-formed code unit sequence."
(from Wikipedia's UTF-8
article)
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc