I had a failing test that was reading an 8-bit
file-out. The test reads the file as
UTF-8 encoded, and the 8-bit
characters cause the UTF-8 decoding to produce an incorrect
The file contains the
character 16rF3, which triggers the 21-bit UTF-8 decoding. This expects 4
bytes looking like
characters following the F3 are normal ASCII, which is an error.
RFC 3629 states "Implementations of the decoding
algorithm MUST protect against decoding invalid sequences."The Unicode
Standard requires decoders to "...treat any ill-formed code unit sequence as
an error condition. This guarantees that it will neither interpret nor emit an
ill-formed code unit sequence."