Hi Levente,
On Tue, Mar 29, 2016 at 7:10 PM, Levente Uzonyi <[hidden email]> wrote: Hi Eliot, Not sure what's going on here but I can reproduce the bug.
So this is interesting. It may boil down to leading character issues. First, the bug is that the JIT and the Interpreter don't agree on the semantics of WideString>>#at: when the value at the index is "out of range". Notionally the Spur VM restricts characters to the range 0 through 16r3FFFFFFF, to keep them in the +ve SmallInteger range. The JIT primitive fails if asked to answer a Character outside that range, and 16rD80000FF is outside, hence the JIT primitive fails. The Interpreter primitive does not (yet) fail. There's an optimization in the JIT that if a primitive deals with all possible inputs and the method is written not to use an error code then the primitive will fail without invoking the interpreter primitive. This a) means the JIT doesn't have to set error codes correctly and b) unless the programmer wants the failure code failure is a bit quicker. But in this case I got caught. Because the Interpreter primitive does not fail, changing the method to include the error code the JIT now calls the Interpreter primitive when the JIT primitive fails, the interpreter primitive does not fail, and the character is answered anyway. So there are a few things going on here... a) in loading the Monticello snapshot for Multilingual-ul.210 a WideString gets created for JapaneseEnvironment class>>#flapTabTextFor:in: that has 16rD80000FF at 213 and 275 in the source, whereas (JapaneseEnvironment class>>#flapTabTextFor:in:) getSource string contains Character value: 27 (Character escape) at 213 and 275, and Character value: 16r14030FC at 214 and 276, but weirdly the strings have the same size, and are equal everywhere else. So somewhere in the Monticello package read (or write?) the escape is being rewritten to Character value: 16rD80000FF. b) I'm confused^H^H^H^H^H^H^H^Hnot sure what to do about the range of acceptable characters. It seems right to me to only use 30 bits for characters; that's more than enough for Unicode and it seems arbitrary to allow e.g. 61 bit Characters in 64-bits and 30 bit Characters in 32-bit (makes sense with integers because of LargeInteger). So I should either make WideString>>at:[put:] fail for out of range characters in all versions of the primitives, and modify the methods so they yield a sensible error message, or should I make the JIT primitive not fail? I want to fail for out-of-range characters, but not until we understand the Monticello issue that maps Character escape to Character value: 16rD80000FF. If you want to experiment with the system then here are some variants on WideString>>at:[put:] that will be useful (also attached). To see the JIT primitives fail, but get a sensible error message, use WideString>>at: index "Answer the Character stored in the field of the receiver indexed by the argument. Primitive. Fail if the index argument is not an Integer or is out of bounds. Essential. See Object documentation whatIsAPrimitive." <primitive: 63> ^index isInteger ifTrue: [(index >= 1 and: [index <= self size]) ifTrue: [self error: 'illegal character value ', ((self basicAt: index) storeStringBase: 16)] ifFalse: [self errorSubscriptBounds: index]] ifFalse: [index isNumber ifTrue: [self at: index asInteger] ifFalse: [self errorNonIntegerIndex]] WideString>>at: index put: aCharacter "Store the Character into the field of the receiver indicated by the index. Primitive. Fail if the index is not an Integer or is out of bounds, or if the argument is not a Character. Essential. See Object documentation whatIsAPrimitive." <primitive: 64> ^aCharacter isCharacter ifTrue: [index isInteger ifTrue: [(index >= 1 and: [index <= self size]) ifTrue: [self error: 'illegal character value ', (aCharacter asInteger storeStringBase: 16)] ifFalse: [self errorSubscriptBounds: index]] ifFalse: [self errorNonIntegerIndex]] ifFalse: [self errorImproperStore] To suppress the errors and have the interpreter primitives back up (and answer out-of-range Characters) simply include an error code: at: index "Answer the Character stored in the field of the receiver indexed by the argument. Primitive. Fail if the index argument is not an Integer or is out of bounds. Essential. See Object documentation whatIsAPrimitive." <primitive: 63 error: ec> ^index isInteger ifTrue: [self errorSubscriptBounds: index] ifFalse: [index isNumber ifTrue: [self at: index asInteger] ifFalse: [self errorNonIntegerIndex]] Again, the decision for the VM is whether to drop the pretence of Characters being 30 bits (not something I'm happy with) or whether to make sure the interpreter WideString primitives also fail for out-of-range characters, and for us to fix Monticello snapshot read (write?) for the Character escape, Character value: 16r14030FC sequence. The problem is that there are ways of creating strings that bypass both Character class>>value: and WideString>>at:put:. Compare these: This produces a WideString full of illegal characters via adoptInstance: | s | s := Bitmap new: 16 withAll: 16rD80000FF. WideString adoptInstance: s. s. This does the same via basicAt:[put:] which treat WideString as arrays of 32-bit unsigned integers (which is what they are in memory). | s | s := WideString new: 16. 1 to: s size do: [:i| s basicAt: i put: 16rD80000FF] This fails to do so because Character class>>#value: has a range check WideString new: 16 withAll: (Character value: 16rD80000FF) This fails if the JIT primitive is in use: (tryPrimitive:withArgs: side-steps the JIT primitive, calls the interpreter primitive which does answer the out-of-range character) | illegalCharacter s | illegalCharacter := ((WideString new: 1) basicAt: 1 put: 16rD80000FF; tryPrimitive: 63 withArgs: #(1)). s := WideString new: 16. 1 to: s size do: [:i| s at: i put: illegalCharacter] I'm open to being called idiotic here (please weigh in) but I think the "right thing" is a) make the Interpreter wide string primitives also fail for character access outside of 30 bits. b) change the error handling on WideString>>at:[put:] to give sensible error messages when attempting character access outside of 30 bits. c) fix the character encoding issue Here's the source of the method: !JapaneseEnvironment class methodsFor: 'language methods' stamp: 'yo 8/3/2004 21 :25'! flapTabTextFor: aString in: aFlapTab | string | string := super flapTabTextFor: aString. string isEmptyOrNil ifTrue: [^ self]. string := aFlapTab orientation == #vertical ifTrue: [string copyReplaceAll: '<E3><83><BC>' with: '<EF><BD><9C>'] ifFalse: [string copyReplaceAll: '<EF><BD><9C>' with: '<E3><83><BC>']. ^ string. !]lang[(213 1 9 1 41 1 9 1 16)0,5,0,5,0,5,0,5,0! ! This is read differently in the CompiledMethod>>getSource case to the Monticello snapshot load case. Why? Here's a model of that method source: | w | w := WideString with: 102 asCharacter with: 108 asCharacter with: 27 asCharacter with: 16r14030FC asCharacter. ByteString streamContents: [:s| s nextChunkPut: w] 'flー!]lang[(3 1)0,5!' This doit implies that it's not the converters that are at fault: | w | w := WideString with: 102 asCharacter with: 108 asCharacter with: 27 asCharacter with: 16r14030FC asCharacter. (TextConverter allSubclasses reject: [:tcc| tcc includesBehavior: ByteTextConverter]) reject: [:tcc| | r | r := (ByteString streamContents: [:s| s nextChunkPut: w]) readStream. (tcc new parseLangTagFor: (r next: 4) fromStream: r) = w] an OrderedCollection() i.e. all the multibyte converters answer the string unaltered. So where is this 16rD80000FF bit pattern comming from? P.S. Digging further I see it /shouldn't/ be a leading character issue. There's just enough room for a 22 bit Unicode character code and an 8-bit leading character language code: Character>>encodedCharSet self asInteger < 16r400000 ifTrue: [ ^Unicode ]. "Shortcut" ^EncodedCharSet charsetAt: self leadingChar Character>>leadingChar "Answer the value of the 8 highest bits which is used to identify the language. This is mostly used for east asian languages CJKV as a workaround against unicode han-unification." ^ self asInteger bitShift: -22 16rFF << 22 = 16r3F,C0,00,00 Levente _,,,^..^,,,_ best, Eliot WideString-at.st (1018 bytes) Download Attachment WideString-atput.st (1K) Download Attachment WideString-at.st (820 bytes) Download Attachment WideString-atput.st (882 bytes) Download Attachment |
Hi Eliot,
I made a snippet to read the source of the method as an array of integers from the .mcz in the package cache: (MCMczReader versionFromFile: 'package-cache/Multilingual-ul.210.mcz') snapshot definitions detect: [ :each | each isMethodDefinition and: [ each className = #JapaneseEnvironment and: [ each selector = #flapTabTextFor:in: ] ] ] ifFound: [ :definition | Array streamContents: [ :stream | | source | source := definition source. 1 to: source size do: [ :index | stream nextPut: (source basicAt: index) ] ] ] ifNone: [ self error ] In MCMczReader >> #loadDefinitions, if you change this line [:m | [^definitions := (DataStream on: m contentStream) next definitions] to this [:m | [ self error. ^definitions := (DataStream on: m contentStream) next definitions] then the definition will be read from the sources instead of the binary snapshot, and you'll get the correct source code. First I disabled all the ZipPlugin primitives to see if those are responsible for this issue, but they turned out to be okay. Then I dag into DataStream, and I came to the conclusion that the issue is in BitBlt. The mangled characters appear when PositionableStream >> #nextWordsInto: applies some BitBlt magic to convert the read bytes into a WideString. Here's a snippet triggering the error: | wideString source pos blt expectedWideString | source := #[1 64 255 14 1 64 48 251]. expectedWideString := WideString fromByteArray: source. wideString := WideString new: source size // 4. pos := 0. blt := (BitBlt toForm: (Form new hackBits: wideString)) sourceForm: (Form new hackBits: source). blt combinationRule: Form over; sourceX: 0; sourceY: pos // 4; height: wideString byteSize // 4; width: 4; destX: 0; destY: 0; copyBits. wideString restoreEndianness. self assert: wideString = expectedWideString Levente |
I have a change where I massively decalred the bitblt operands as unsigned int. I did not push it so far, but I'll try to see if it solves the symptoms.2016-03-31 3:16 GMT+02:00 Levente Uzonyi <[hidden email]>: Hi Eliot, |
OK, declaring a few 'unsigned int' in BitBltSimulation suffices to make the problem disappear on SqueakCogSpur64 MacOSX, I'll publish this evening. 2016-03-31 7:42 GMT+02:00 Nicolas Cellier <[hidden email]>:
|
Free forum by Nabble | Edit this page |