[Bug?] String>>copyToHeap:encoding: does not insert terminating null for some encodings

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug?] String>>copyToHeap:encoding: does not insert terminating null for some encodings

Joachim Geidel
I'm currently implementing a StreamEncoder for the "Java modified UTF-8"
encoding used in the Java Native Interface. When copying the encoded
strings to the heap and back with

        | pointer |
        pointer := 'test' gcCopyToHeapEncoding: #JavaModifiedUTF8.
        pointer copyCStringFromHeap: #JavaModifiedUTF8

the result was not the same as the input, but had arbitrary trailing
characters. The source of this was in String>>copyToHeap:encoding:,
which creates the null byte(s) for terminating the C string like this:

        null := (ByteString new: 1) asByteArrayEncoding: encoding.

That does not work for Java modified UTF-8, as this encoding uses the
byte sequence #[192 128] for a Character with code point 0.

The problem also exists for the encodings #BASE64 and #'UTF-7', which
also encode null characters as non-null byte sequences. The encoding
#CompoundText produces an empty ByteArray (zero length), so this might
either lead to a primitive failure or simply not add anything to the end
of the string (I didn't try).

The result is that the strings copied to the heap don't have a
terminating null. This can lead to memory corruption or wrong results
when the string is manipulated by an external library.

I assume using the encoding to produce the trailing null from a
ByteString with a null character is done to ensure that the number of
bytes in the ByteArray is correct. E.g., the encoding #UTF16 produces
two null bytes instead of one.

A possible solution would be to extract the null byte creation to a new
method of StreamEncoder, e.g. like this:

        null := StreamEncoder externalStringTerminatorFor: encoding.

ByteArray class>>externalStringTerminatorFor: encoding
        ^StreamEncoder externalStringTerminatorFor: encoding

StreamEncoder class>>externalStringTerminatorFor: encoding
        ^(self lookupEncoderDirectory: encoding)
                externalStringTerminator

StreamEncoder class>>externalStringTerminator
        ^#[0]

UTF16StreamEncoder class>>externalStringTerminator
        ^#[0 0]

UnicodeStreamEncoder class>>externalStringTerminator
        ^#[0 0]

and maybe something else for CompoundTextStreamEncoder.

Best regards,
Joachim Geidel

Reply | Threaded
Open this post in threaded view
|

RE: [Bug?] String>>copyToHeap:encoding: does not insert terminating null for some encodings

Steven Kelly
This sounds rather like the following bug I reported in 7.2 in 2003:

Case 358800: AR 47137: ExternalMethod call with MSCP1252 string of size
multiples of 4 is changed

When the string size was a multiple of 4, there was a problem with
addition of extra characters to add a null character to terminate the
string. You can see it is (sadly!) still a problem in 7.4.1 on Windows
with the following code:

| p aCheckStr |
'temp.txt' asFilename writeStream nextPutAll: 'notepad temp.txt'; close.
aCheckStr := 'temp.txt' asFilename contentsOfEntireFile.
p := WinNTSystemSupport CreateProcess: nil arguments: aCheckStr.
(Delay forSeconds: 3) wait.
^WinNTSystemSupport TerminateProcess: p

This will give an error, as the argument to the first DLLCC call is a
16-char MSCP1252String (which is what VW creates when reading from files
on Windows). But if aCheckStr is just set to a 16-char ByteString:
  aCheckStr := 'notepad temp.txt'.
it will work fine.

Steve

> -----Original Message-----
> From: Joachim Geidel [mailto:[hidden email]]
> Sent: 31 July 2006 21:42
> To: vwnc-list
> Subject: [Bug?] String>>copyToHeap:encoding: does not insert
terminating
> null for some encodings
>
> I'm currently implementing a StreamEncoder for the "Java modified
UTF-8"

> encoding used in the Java Native Interface. When copying the encoded
> strings to the heap and back with
>
> | pointer |
> pointer := 'test' gcCopyToHeapEncoding: #JavaModifiedUTF8.
> pointer copyCStringFromHeap: #JavaModifiedUTF8
>
> the result was not the same as the input, but had arbitrary trailing
> characters. The source of this was in String>>copyToHeap:encoding:,
> which creates the null byte(s) for terminating the C string like this:
>
> null := (ByteString new: 1) asByteArrayEncoding: encoding.
>
> That does not work for Java modified UTF-8, as this encoding uses the
> byte sequence #[192 128] for a Character with code point 0.
>
> The problem also exists for the encodings #BASE64 and #'UTF-7', which
> also encode null characters as non-null byte sequences. The encoding
> #CompoundText produces an empty ByteArray (zero length), so this might
> either lead to a primitive failure or simply not add anything to the
end

> of the string (I didn't try).
>
> The result is that the strings copied to the heap don't have a
> terminating null. This can lead to memory corruption or wrong results
> when the string is manipulated by an external library.
>
> I assume using the encoding to produce the trailing null from a
> ByteString with a null character is done to ensure that the number of
> bytes in the ByteArray is correct. E.g., the encoding #UTF16 produces
> two null bytes instead of one.
>
> A possible solution would be to extract the null byte creation to a
new

> method of StreamEncoder, e.g. like this:
>
> null := StreamEncoder externalStringTerminatorFor: encoding.
>
> ByteArray class>>externalStringTerminatorFor: encoding
> ^StreamEncoder externalStringTerminatorFor: encoding
>
> StreamEncoder class>>externalStringTerminatorFor: encoding
> ^(self lookupEncoderDirectory: encoding)
> externalStringTerminator
>
> StreamEncoder class>>externalStringTerminator
> ^#[0]
>
> UTF16StreamEncoder class>>externalStringTerminator
> ^#[0 0]
>
> UnicodeStreamEncoder class>>externalStringTerminator
> ^#[0 0]
>
> and maybe something else for CompoundTextStreamEncoder.
>
> Best regards,
> Joachim Geidel

Reply | Threaded
Open this post in threaded view
|

Re: [Bug?] String>>copyToHeap:encoding: does not insert terminating null for some encodings

Joachim Geidel
Steven Kelly wrote:
> This sounds rather like the following bug I reported in 7.2 in 2003:
>
> Case 358800: AR 47137: ExternalMethod call with MSCP1252 string of size
> multiples of 4 is changed
>
> When the string size was a multiple of 4, there was a problem with
> addition of extra characters to add a null character to terminate the
> string. You can see it is (sadly!) still a problem in 7.4.1 on Windows
> with the following code:

Yes, this looks similar, but it might be a different case. The external
method call directly executes a primitive (395), handing over the method
arguments to the VM unmodified, if I understand external methods right.
This would point to a problem at the VM level. In the StreamEncoder
case, the unwanted bytes are created in String>>copyToHeap:encoding:,
and this does not depend on the length of the String. Of course, if the
VM calls back into Smalltalk to convert the String to bytes, it may be
the same problem after all. I don't have the VM sources to check this.

Joachim