64-bit WideString access (was [squeak-dev] New Cog VMs available)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

64-bit WideString access (was [squeak-dev] New Cog VMs available)

Eliot Miranda-2
Hi Levente,

On Tue, Mar 29, 2016 at 7:10 PM, Levente Uzonyi <[hidden email]> wrote:
Hi Eliot,

I think something went wrong with those semaphores. Connecting to a Socket seems to wait for the full 45 seconds timeout with this VM. To reproduce, just try opening any remote MC repository. This happens with cogspur64linuxht. I haven't tried any other versions yet.

Not sure what's going on here but I can reproduce the bug.

There's another issue with WideStrings where, #at: (primitive 63) fails for a few instances. This issue is present in 3643 as well. To reproduce it, open the Monticello Browser, select the Multilingual package, select the Trunk repository and press Changes. You'll see that two methods are reported to have changed (even though they haven't). Selecting any of them will pop up a debugger with the primitive failure. The common symptom is that #basicAt: will work for the index, for which #at: fails, and it will return the number 16rD80000FF.

So this is interesting.  It may boil down to leading character issues.

First, the bug is that the JIT and the Interpreter don't agree on the semantics of WideString>>#at: when the value at the index is "out of range".  Notionally the Spur VM restricts characters to the range 0 through 16r3FFFFFFF, to keep them in the +ve SmallInteger range.  The JIT primitive fails if asked to answer a Character outside that range, and 16rD80000FF is outside, hence the JIT primitive fails.  The Interpreter primitive does not (yet) fail.

There's an optimization in the JIT that if a primitive deals with all possible inputs and the method is written not to use an error code then the primitive will fail without invoking the interpreter primitive.  This a) means the JIT doesn't have to set error codes correctly and b) unless the programmer wants the failure code failure is a bit quicker.  But in this case I got caught.  Because the Interpreter primitive does not fail, changing the method to include the error code the JIT now calls the Interpreter primitive when the JIT primitive fails, the interpreter primitive does not fail, and the character is answered anyway.

So there are a few things going on here...

a) in loading the Monticello snapshot for Multilingual-ul.210 a WideString gets created for JapaneseEnvironment class>>#flapTabTextFor:in: that has 16rD80000FF at 213 and 275 in the source, whereas (JapaneseEnvironment class>>#flapTabTextFor:in:) getSource string contains Character value: 27 (Character escape) at 213 and 275, and Character value: 16r14030FC at 214 and 276, but weirdly the strings have the same size, and are equal everywhere else.  So somewhere in the Monticello package read (or write?) the escape is being rewritten to Character value: 16rD80000FF.

b) I'm confused^H^H^H^H^H^H^H^Hnot sure what to do about the range of acceptable characters.  It seems right to me to only use 30 bits for characters; that's more than enough for Unicode and it seems arbitrary to allow e.g. 61 bit Characters in 64-bits and 30 bit Characters in 32-bit (makes sense with integers because of LargeInteger).  So I should either make WideString>>at:[put:] fail for out of range characters in all versions of the primitives, and modify the methods so they yield a sensible error message, or should I make the JIT primitive not fail?  I want to fail for out-of-range characters, but not until we understand the Monticello issue that maps Character escape to Character value: 16rD80000FF.

If you want to experiment with the system then here are some variants on WideString>>at:[put:] that will be useful (also attached). 


To see the JIT primitives fail, but get a sensible error message, use

WideString>>at: index
"Answer the Character stored in the field of the receiver indexed by the
argument.  Primitive.  Fail if the index argument is not an Integer or is out
of bounds.  Essential.  See Object documentation whatIsAPrimitive."

<primitive: 63>
^index isInteger
ifTrue:
[(index >= 1 and: [index <= self size])
ifTrue: [self error: 'illegal character value ', ((self basicAt: index) storeStringBase: 16)]
ifFalse: [self errorSubscriptBounds: index]]
ifFalse:
[index isNumber
ifTrue: [self at: index asInteger]
ifFalse: [self errorNonIntegerIndex]]


WideString>>at: index put: aCharacter
"Store the Character into the field of the receiver indicated by the index.
Primitive.  Fail if the index is not an Integer or is out of bounds, or if the
argument is not a Character.  Essential.  See Object documentation whatIsAPrimitive."

<primitive: 64>
^aCharacter isCharacter
ifTrue:
[index isInteger
ifTrue:
[(index >= 1 and: [index <= self size])
ifTrue: [self error: 'illegal character value ', (aCharacter asInteger storeStringBase: 16)]
ifFalse: [self errorSubscriptBounds: index]]
ifFalse: [self errorNonIntegerIndex]]
ifFalse:
[self errorImproperStore]

To suppress the errors and have the interpreter primitives back up (and answer out-of-range Characters) simply include an error code:
at: index
"Answer the Character stored in the field of the receiver indexed by the
argument.  Primitive.  Fail if the index argument is not an Integer or is out
of bounds.  Essential.  See Object documentation whatIsAPrimitive."

<primitive: 63 error: ec>
^index isInteger
ifTrue:
[self errorSubscriptBounds: index]
ifFalse:
[index isNumber
ifTrue: [self at: index asInteger]
ifFalse: [self errorNonIntegerIndex]]

Again, the decision for the VM is whether to drop the pretence of Characters being 30 bits (not something I'm happy with) or whether to make sure the interpreter WideString primitives also fail for out-of-range characters, and for us to fix Monticello snapshot read (write?) for the Character escape, Character value: 16r14030FC sequence.  The problem is that there are ways of creating strings that bypass both Character class>>value: and WideString>>at:put:.  Compare these:

This produces a WideString full of illegal characters via adoptInstance:
| s |
s := Bitmap new: 16 withAll: 16rD80000FF.
WideString adoptInstance: s.
s.

This does the same via basicAt:[put:] which treat WideString as arrays of 32-bit unsigned integers (which is what they are in memory).
| s |
s := WideString new: 16.
1 to: s size do: [:i| s basicAt: i put: 16rD80000FF]

This fails to do so because Character class>>#value: has a range check

WideString new: 16 withAll: (Character value: 16rD80000FF)

This fails if the JIT primitive is in use: (tryPrimitive:withArgs: side-steps the JIT primitive, calls the interpreter primitive which does answer the out-of-range character)
| illegalCharacter s |
illegalCharacter := ((WideString new: 1) basicAt: 1 put: 16rD80000FF; tryPrimitive: 63 withArgs: #(1)).
s := WideString new: 16.
1 to: s size do: [:i| s at: i put: illegalCharacter]


I'm open to being called idiotic here (please weigh in) but I think the "right thing" is
a) make the Interpreter wide string primitives also fail for character access outside of 30 bits.
b) change the error handling on WideString>>at:[put:] to give sensible error messages when attempting character access outside of 30 bits.
c) fix the character encoding issue

Here's the source of the method:

!JapaneseEnvironment class methodsFor: 'language methods' stamp: 'yo 8/3/2004 21
:25'!
flapTabTextFor: aString in: aFlapTab

        | string |
        string := super flapTabTextFor: aString.
        string isEmptyOrNil ifTrue: [^ self].
        string := aFlapTab orientation == #vertical
                                ifTrue: [string copyReplaceAll: '<E3><83><BC>' with: '<EF><BD><9C>']
                                ifFalse: [string copyReplaceAll: '<EF><BD><9C>' with: '<E3><83><BC>'].

        ^ string.
!]lang[(213 1 9 1 41 1 9 1 16)0,5,0,5,0,5,0,5,0! !

This is read differently in the CompiledMethod>>getSource case to the Monticello snapshot load case.  Why?  

Here's a model of that method source:
| w |
w := WideString with: 102 asCharacter with: 108 asCharacter with: 27 asCharacter with: 16r14030FC asCharacter.
ByteString streamContents: [:s| s nextChunkPut: w] 'flー!]lang[(3 1)0,5!'

This doit implies that it's not the converters that are at fault:
| w |
w := WideString with: 102 asCharacter with: 108 asCharacter with: 27 asCharacter with: 16r14030FC asCharacter.
(TextConverter allSubclasses reject: [:tcc| tcc includesBehavior: ByteTextConverter]) reject: [:tcc| | r |
r := (ByteString streamContents: [:s| s nextChunkPut: w]) readStream.
(tcc new parseLangTagFor: (r next: 4) fromStream: r) = w] an OrderedCollection()

i.e. all the multibyte converters answer the string unaltered.  So where is this 16rD80000FF bit pattern comming from?





P.S.  Digging further I see it /shouldn't/ be a leading character issue.  There's just enough room for a 22 bit Unicode character code and an 8-bit leading character language code:

Character>>encodedCharSet
self asInteger < 16r400000 ifTrue: [ ^Unicode ]. "Shortcut"
^EncodedCharSet charsetAt: self leadingChar

Character>>leadingChar
"Answer the value of the 8 highest bits which is used to identify the language.
This is mostly used for east asian languages CJKV as a workaround against unicode han-unification."
^ self asInteger bitShift: -22

16rFF << 22 = 16r3F,C0,00,00




Levente


On Tue, 29 Mar 2016, Eliot Miranda wrote:

... at http://www.mirandabanda.org/files/Cog/VM/VM.r3663


CogVM binaries as per VMMaker.oscog-eem.1746/r3663

General:

Fix loss of signals to e.g. a socket's readSemaphore when data available:
Fix a bug in sqAtomicOps.h where the assumption that the intrinsic atomic add
operation will apply to 16-bit variables is false on clang.  So on x86 and x64
use inline assembly with gcc and clang since this is known to work.
Improve the doSignalSemaphores code in sqExternalSemaphores.c so that the tide
variables are initialized at the right point and so there's only one copy of
the signalling code.

Allow primitiveUtcWithOffset to accept an optional parameter with an
array or object with two or more slots to store UTC posix microseconds
and time zone offset in seconds. Compatibility with VMM trunk.

Add a Smalltalk epoch version of it, primitiveUtcAndTimezoneOffset,
and give it primitive #244.

Fix signed/unsigned arithmetic issues in 64-bit microsecond clock code.
This fixes the bug whereby
Time localMicrosecondClock - Time utcMicrosecondClock // 1000000
would answer something with an odd second, a multiple of 3600 plus 1.

General Integer conversion routines:
Several clean ups to integer conversion routines.

Simplify bit operations using positiveMachineIntegerValueOf:/
positiveMachineIntegerFor: rather than doing 32/64 bits dissertation.


Spur Cogit:
Rewrote identity primitive to check for forwarders only when the identity of
objects is different.  The previous version would crash if the argument was a
forwarder to an immediate.

Spur VMs:
Now that UUIDs are created with purely image-based code in Squeak trunk, make
the UUIDPlugin external in all Squeak Spur VMs.


Plugins:
Make FFI load symbol fail after (when find function fails). 
In Pharo, we allow to load global functions (so we can control world windows
through FFI). 

LargeIntegersPlugin:

Finish 1st round of LargeIntegers refactoring

- simplify the left and right shift
- use as much unsigned arithmetic as possible
- homogenize type declaration within the plugin
- remove unused digitOf:at:


Slang:
Beware: sending abs to an unsigned int will re-interpret the int as signed and
will generate C compiler warnings.  The new version uses SQABS and SQLABS macros
on sqInt and sqLong vars, and fabs and fabsf on double and float args,
respectively, and llabs on long long, __int64 vars.


Mac OS X:
Unswap the Squeak and Newspeak localized strings

_,,,^..^,,,_
best, Eliot

_,,,^..^,,,_
best, Eliot



WideString-at.st (1018 bytes) Download Attachment
WideString-atput.st (1K) Download Attachment
WideString-at.st (820 bytes) Download Attachment
WideString-atput.st (882 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] 64-bit WideString access (was [squeak-dev] New Cog VMs available)

Levente Uzonyi
Hi Eliot,

I made a snippet to read the source of the method as an array of integers
from the .mcz in the package cache:

(MCMczReader versionFromFile: 'package-cache/Multilingual-ul.210.mcz')
snapshot definitions
  detect: [ :each |
  each isMethodDefinition
  and: [ each className = #JapaneseEnvironment
  and: [ each selector = #flapTabTextFor:in: ] ] ]
  ifFound: [ :definition |
  Array streamContents: [ :stream |
  | source |
  source := definition source.
  1 to: source size do: [ :index |
  stream nextPut: (source basicAt: index) ] ] ]
  ifNone: [ self error ]

In MCMczReader >> #loadDefinitions, if you change this line

  [:m | [^definitions := (DataStream on: m contentStream) next definitions]

to this

  [:m | [ self error. ^definitions := (DataStream on: m contentStream) next definitions]

then the definition will be read from the sources instead of the binary
snapshot, and you'll get the correct source code.

First I disabled all the ZipPlugin primitives to see if those are
responsible for this issue, but they turned out to be okay.
Then I dag into DataStream, and I came to the conclusion that the issue is
in BitBlt. The mangled characters appear when PositionableStream >>
#nextWordsInto: applies some BitBlt magic to convert the read bytes into a
WideString.

Here's a snippet triggering the error:

| wideString source pos blt expectedWideString |
source := #[1 64 255 14 1 64 48 251].
expectedWideString := WideString fromByteArray: source.
wideString := WideString new: source size // 4.
pos := 0.
blt := (BitBlt
  toForm: (Form new hackBits: wideString))
  sourceForm: (Form new hackBits: source).
blt
  combinationRule: Form over;
  sourceX: 0;
  sourceY: pos // 4;
  height: wideString byteSize // 4;
  width: 4;
  destX: 0;
  destY: 0;
  copyBits.
wideString restoreEndianness.
self assert: wideString = expectedWideString

Levente

Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] 64-bit WideString access (was [squeak-dev] New Cog VMs available)

Nicolas Cellier
I have a change where I massively decalred the bitblt operands as unsigned int.
I did not push it so far, but I'll try to see if it solves the symptoms.

2016-03-31 3:16 GMT+02:00 Levente Uzonyi <[hidden email]>:
Hi Eliot,

I made a snippet to read the source of the method as an array of integers from the .mcz in the package cache:

(MCMczReader versionFromFile: 'package-cache/Multilingual-ul.210.mcz') snapshot definitions
        detect: [ :each |
                each isMethodDefinition
                        and: [ each className = #JapaneseEnvironment
                                and: [ each selector = #flapTabTextFor:in: ] ] ]
        ifFound: [ :definition |
                Array streamContents: [ :stream |
                        | source |
                        source := definition source.
                        1 to: source size do: [ :index |
                                stream nextPut: (source basicAt: index) ] ] ]
        ifNone: [ self error ]

In MCMczReader >> #loadDefinitions, if you change this line

                [:m | [^definitions := (DataStream on: m contentStream) next definitions]

to this

                [:m | [ self error. ^definitions := (DataStream on: m contentStream) next definitions]

then the definition will be read from the sources instead of the binary snapshot, and you'll get the correct source code.

First I disabled all the ZipPlugin primitives to see if those are responsible for this issue, but they turned out to be okay.
Then I dag into DataStream, and I came to the conclusion that the issue is in BitBlt. The mangled characters appear when PositionableStream >> #nextWordsInto: applies some BitBlt magic to convert the read bytes into a WideString.

Here's a snippet triggering the error:

| wideString source pos blt expectedWideString |
source := #[1 64 255 14 1 64 48 251].
expectedWideString := WideString fromByteArray: source.
wideString := WideString new: source size // 4.
pos := 0.
blt := (BitBlt
        toForm: (Form new hackBits: wideString))
        sourceForm: (Form new hackBits: source).
blt
        combinationRule: Form over;
        sourceX: 0;
        sourceY: pos // 4;
        height: wideString byteSize // 4;
        width: 4;
        destX: 0;
        destY: 0;
        copyBits.
wideString restoreEndianness.
self assert: wideString = expectedWideString

Levente




Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] 64-bit WideString access (was [squeak-dev] New Cog VMs available)

Nicolas Cellier
OK, declaring a few 'unsigned int' in BitBltSimulation suffices to make the problem disappear on SqueakCogSpur64 MacOSX, I'll publish this evening.

2016-03-31 7:42 GMT+02:00 Nicolas Cellier <[hidden email]>:
I have a change where I massively decalred the bitblt operands as unsigned int.
I did not push it so far, but I'll try to see if it solves the symptoms.

2016-03-31 3:16 GMT+02:00 Levente Uzonyi <[hidden email]>:
Hi Eliot,

I made a snippet to read the source of the method as an array of integers from the .mcz in the package cache:

(MCMczReader versionFromFile: 'package-cache/Multilingual-ul.210.mcz') snapshot definitions
        detect: [ :each |
                each isMethodDefinition
                        and: [ each className = #JapaneseEnvironment
                                and: [ each selector = #flapTabTextFor:in: ] ] ]
        ifFound: [ :definition |
                Array streamContents: [ :stream |
                        | source |
                        source := definition source.
                        1 to: source size do: [ :index |
                                stream nextPut: (source basicAt: index) ] ] ]
        ifNone: [ self error ]

In MCMczReader >> #loadDefinitions, if you change this line

                [:m | [^definitions := (DataStream on: m contentStream) next definitions]

to this

                [:m | [ self error. ^definitions := (DataStream on: m contentStream) next definitions]

then the definition will be read from the sources instead of the binary snapshot, and you'll get the correct source code.

First I disabled all the ZipPlugin primitives to see if those are responsible for this issue, but they turned out to be okay.
Then I dag into DataStream, and I came to the conclusion that the issue is in BitBlt. The mangled characters appear when PositionableStream >> #nextWordsInto: applies some BitBlt magic to convert the read bytes into a WideString.

Here's a snippet triggering the error:

| wideString source pos blt expectedWideString |
source := #[1 64 255 14 1 64 48 251].
expectedWideString := WideString fromByteArray: source.
wideString := WideString new: source size // 4.
pos := 0.
blt := (BitBlt
        toForm: (Form new hackBits: wideString))
        sourceForm: (Form new hackBits: source).
blt
        combinationRule: Form over;
        sourceX: 0;
        sourceY: pos // 4;
        height: wideString byteSize // 4;
        width: 4;
        destX: 0;
        destY: 0;
        copyBits.
wideString restoreEndianness.
self assert: wideString = expectedWideString

Levente