I'm currently implementing a StreamEncoder for the "Java modified UTF-8"
encoding used in the Java Native Interface. When copying the encoded strings to the heap and back with | pointer | pointer := 'test' gcCopyToHeapEncoding: #JavaModifiedUTF8. pointer copyCStringFromHeap: #JavaModifiedUTF8 the result was not the same as the input, but had arbitrary trailing characters. The source of this was in String>>copyToHeap:encoding:, which creates the null byte(s) for terminating the C string like this: null := (ByteString new: 1) asByteArrayEncoding: encoding. That does not work for Java modified UTF-8, as this encoding uses the byte sequence #[192 128] for a Character with code point 0. The problem also exists for the encodings #BASE64 and #'UTF-7', which also encode null characters as non-null byte sequences. The encoding #CompoundText produces an empty ByteArray (zero length), so this might either lead to a primitive failure or simply not add anything to the end of the string (I didn't try). The result is that the strings copied to the heap don't have a terminating null. This can lead to memory corruption or wrong results when the string is manipulated by an external library. I assume using the encoding to produce the trailing null from a ByteString with a null character is done to ensure that the number of bytes in the ByteArray is correct. E.g., the encoding #UTF16 produces two null bytes instead of one. A possible solution would be to extract the null byte creation to a new method of StreamEncoder, e.g. like this: null := StreamEncoder externalStringTerminatorFor: encoding. ByteArray class>>externalStringTerminatorFor: encoding ^StreamEncoder externalStringTerminatorFor: encoding StreamEncoder class>>externalStringTerminatorFor: encoding ^(self lookupEncoderDirectory: encoding) externalStringTerminator StreamEncoder class>>externalStringTerminator ^#[0] UTF16StreamEncoder class>>externalStringTerminator ^#[0 0] UnicodeStreamEncoder class>>externalStringTerminator ^#[0 0] and maybe something else for CompoundTextStreamEncoder. Best regards, Joachim Geidel |
This sounds rather like the following bug I reported in 7.2 in 2003:
Case 358800: AR 47137: ExternalMethod call with MSCP1252 string of size multiples of 4 is changed When the string size was a multiple of 4, there was a problem with addition of extra characters to add a null character to terminate the string. You can see it is (sadly!) still a problem in 7.4.1 on Windows with the following code: | p aCheckStr | 'temp.txt' asFilename writeStream nextPutAll: 'notepad temp.txt'; close. aCheckStr := 'temp.txt' asFilename contentsOfEntireFile. p := WinNTSystemSupport CreateProcess: nil arguments: aCheckStr. (Delay forSeconds: 3) wait. ^WinNTSystemSupport TerminateProcess: p This will give an error, as the argument to the first DLLCC call is a 16-char MSCP1252String (which is what VW creates when reading from files on Windows). But if aCheckStr is just set to a 16-char ByteString: aCheckStr := 'notepad temp.txt'. it will work fine. Steve > -----Original Message----- > From: Joachim Geidel [mailto:[hidden email]] > Sent: 31 July 2006 21:42 > To: vwnc-list > Subject: [Bug?] String>>copyToHeap:encoding: does not insert terminating > null for some encodings > > I'm currently implementing a StreamEncoder for the "Java modified UTF-8" > encoding used in the Java Native Interface. When copying the encoded > strings to the heap and back with > > | pointer | > pointer := 'test' gcCopyToHeapEncoding: #JavaModifiedUTF8. > pointer copyCStringFromHeap: #JavaModifiedUTF8 > > the result was not the same as the input, but had arbitrary trailing > characters. The source of this was in String>>copyToHeap:encoding:, > which creates the null byte(s) for terminating the C string like this: > > null := (ByteString new: 1) asByteArrayEncoding: encoding. > > That does not work for Java modified UTF-8, as this encoding uses the > byte sequence #[192 128] for a Character with code point 0. > > The problem also exists for the encodings #BASE64 and #'UTF-7', which > also encode null characters as non-null byte sequences. The encoding > #CompoundText produces an empty ByteArray (zero length), so this might > either lead to a primitive failure or simply not add anything to the > of the string (I didn't try). > > The result is that the strings copied to the heap don't have a > terminating null. This can lead to memory corruption or wrong results > when the string is manipulated by an external library. > > I assume using the encoding to produce the trailing null from a > ByteString with a null character is done to ensure that the number of > bytes in the ByteArray is correct. E.g., the encoding #UTF16 produces > two null bytes instead of one. > > A possible solution would be to extract the null byte creation to a > method of StreamEncoder, e.g. like this: > > null := StreamEncoder externalStringTerminatorFor: encoding. > > ByteArray class>>externalStringTerminatorFor: encoding > ^StreamEncoder externalStringTerminatorFor: encoding > > StreamEncoder class>>externalStringTerminatorFor: encoding > ^(self lookupEncoderDirectory: encoding) > externalStringTerminator > > StreamEncoder class>>externalStringTerminator > ^#[0] > > UTF16StreamEncoder class>>externalStringTerminator > ^#[0 0] > > UnicodeStreamEncoder class>>externalStringTerminator > ^#[0 0] > > and maybe something else for CompoundTextStreamEncoder. > > Best regards, > Joachim Geidel |
Steven Kelly wrote:
> This sounds rather like the following bug I reported in 7.2 in 2003: > > Case 358800: AR 47137: ExternalMethod call with MSCP1252 string of size > multiples of 4 is changed > > When the string size was a multiple of 4, there was a problem with > addition of extra characters to add a null character to terminate the > string. You can see it is (sadly!) still a problem in 7.4.1 on Windows > with the following code: Yes, this looks similar, but it might be a different case. The external method call directly executes a primitive (395), handing over the method arguments to the VM unmodified, if I understand external methods right. This would point to a problem at the VM level. In the StreamEncoder case, the unwanted bytes are created in String>>copyToHeap:encoding:, and this does not depend on the length of the String. Of course, if the VM calls back into Smalltalk to convert the String to bytes, it may be the same problem after all. I don't have the VM sources to check this. Joachim |
Free forum by Nabble | Edit this page |