Cog Primitive Performance

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Cog Primitive Performance

Eliot Miranda-2
 
Hi All,

    Clément and I have been looking at primitive performance in the context of string hash.  We'll say more about string hash in a later message (and there is nothing to be alarmed about here).  But it is worth describing the performance measurements of the various kinds of primitive dispatch in Cog.  The following measurements are collected from a 64-bit Squeak trunk image with a 64-bit Cog Spur VM with four different implementations of hashMultiply, three of them primitive.  The times are collected on a 2.3 GHz Intel Core i7 MacMini.

The doit below compares the time taken to go round a loop containing 10 unrolled copies of an operation 10 million times, for a total of 100 million invocations of the operation.  The operations are
- doing nothing
- sending yourself
- invoking the SmallInteger>>hashMultiply method
- invoking a version of hashMultiply written as a normal interpreter primitive (more on this below)
- invoking a version of hashMultiply written as a C function called from machine code
- invoking a version of hashMultiply written as a machine code primitive

FYI, hash multiply is
hashMultiply
| low |
low := self bitAnd: 16383.
^(16r260D * low + ((16r260D * (self bitShift: -14) + (16r0065 * low) bitAnd: 16383) * 16384))
bitAnd: 16r0FFFFFFF

and when written as a primitive is
hashMultiply
| low |
low := self bitAnd: 16383.
^(16r260D * low + (16r260D * (self bitShift: -14) + (16r0065 * low) * 16384))
bitAnd: 16r0FFFFFFF
because we don't care about the multiply by 16384 overflowing; when written as a primitive it won't overflow into a LargePositiveInteger.

Here is the doit:

| n them |
n := 10000000.
them := {
[1 to: n do: [:v| v. v. v. v. v. v. v. v. v. v]]. [1 to: n do: [:v| v yourself. v yourself. v yourself. v yourself. v yourself. v yourself. v yourself. v yourself. v yourself. v yourself]].
[1 to: n do: [:v| v hashMultiply. v hashMultiply. v hashMultiply. v hashMultiply. v hashMultiply. v hashMultiply. v hashMultiply. v hashMultiply. v hashMultiply. v hashMultiply]].
[1 to: n do: [:v| v hashMultiplyInterpreterPrimitive. v hashMultiplyInterpreterPrimitive. v hashMultiplyInterpreterPrimitive. v hashMultiplyInterpreterPrimitive. v hashMultiplyInterpreterPrimitive. v hashMultiplyInterpreterPrimitive. v hashMultiplyInterpreterPrimitive. v hashMultiplyInterpreterPrimitive. v hashMultiplyInterpreterPrimitive. v hashMultiplyInterpreterPrimitive]].
[1 to: n do: [:v| v hashMultiplyPrimitive. v hashMultiplyPrimitive. v hashMultiplyPrimitive. v hashMultiplyPrimitive. v hashMultiplyPrimitive. v hashMultiplyPrimitive. v hashMultiplyPrimitive. v hashMultiplyPrimitive. v hashMultiplyPrimitive. v hashMultiplyPrimitive]].
[1 to: n do: [:v| v hashMultiplyNativePrimitive. v hashMultiplyNativePrimitive. v hashMultiplyNativePrimitive. v hashMultiplyNativePrimitive. v hashMultiplyNativePrimitive. v hashMultiplyNativePrimitive. v hashMultiplyNativePrimitive. v hashMultiplyNativePrimitive. v hashMultiplyNativePrimitive. v hashMultiplyNativePrimitive]] }.
them := them, them.
them collect: [:b| b timeToRun]

and the results:

#(19 235 2321 2453 615 314
   19 233 2317 2473 614 313)

Writing this as a table we have
19/19 null
235/233 v yourself (remember divide by 10 to compare against the loop overhead)
2321/2317 hashMultiply
2453/2473 hashMultiplyInterpreterPrimitive
615/614 hashMultiplyPrimitive
314/313 hashNativePrimitive

(the time for) null is the cost of performing the loop arithmetic and the back jump.  The JIT eliminates the push and pop of v so the block is effectively empty.

v yourself is the overhead of sending a message to a SmallInteger, and returning self, plus the loop overhead.  So the loop overhead is comparable to the send overhead.  Unsurprising; the loop contains an inlined add and an inlined compare.

hashMultiply is the cost of evaluating hashMultiply in Cog jitted code; i.e. as it exists today in standard Squeak and Pharo images

hashMultiplyInterpreterPrimitive is the cost of evaluating a primitive version of hashMultiply when written as "normal interpreter primitives".  Such primitives are written to take their receiver and arguments from the Smalltalk stack, accessing the stack through the stackPointer global variable, and answering the result by popping the stackPointer and replacing top of stack with the result.  When called from a jitted method, the invocation of the interpreter primitive involves
- writing the native stack and frame pointers to the stackPointer and framePointer variables
- setting the native stack and frame pointers to the C stack (the C stack at the point the interpreter was entered either at start-up or after a callback)
- calling the C function
- undoing the stack switch
- testing primFailCode, the flag that holds the primitive error code and hence indicates a primitive succeeded if 0

hashMultiplyPrimitive is an experimental implementation of the primitive where it is written as a C function that takes its argument (the receiver) as an argument, and answers its result by returning it, setting primFailCode and answering 0 on failure (because 0 is never a valid object; 0 can be used to flag failure).  The primitive is run on the Smalltalk stack, avoiding the stack switch in a normal interpreter primitive invocation.  Because the Smalltalk stack is divided into small pages this form can only be used for C functions that will not consume much stack (not recurse, or call a deep call chain, nor do any stack allocation, nor raise an exception, etc).
When called from a jitted method, the invocation of the C primitive involves
- saving any live caller-saved registers (i.e. the register containing the receiver)
- marshaling the live registers (in this case only the register containing the receiver) as argument(s) according to the C ABI
- calling the C function
- restoring any caller-saved registers
- testing the C result register and either copying it to the receiver/result register and returning, or falling through to primitive failure code

hashMultiplyNativePrimitive is the standard fast implementation of a primitive, using the it's inbuilt assembler; i.e. the primitive is written in an executed assembly code from which the it is constructed.  Because this is effectively assembly programming we restrict its use to relatively simple performance-critical primitives (SmallInteger, BoxedFloat64 and SmallFloat64 arithmetic and comparison, basicNew, basicNew:, shallowCopy, closure value, perform:[with:with:], #==. #~~, #class, basicAt:, basicAt:put:, basicSize).  

As you can see, normal interpreter primitives are woefully slow.  The cost is so high that jitted Smalltalk code just beats the primitive, even though the normal Smalltalk code does one more operation and involves five sends (the JIT is clever enough to inline the bitShift: and bitAnd: operations).  Alas, most primitives in the VM are written in this style.  It has two advantages; first that there exists a large number of existing implementations using this style, and second that the style makes writing primitives that take a variable number of arguments easy (so for example getVMParameters: vmParameterAt: and vmParameterAt:put: are written as one large primitive in the VM code).

Writing the primitive as a C function that takes its parameters as arguments and answers its result as the function's return value, is much faster.  We're therefore interested in using this scheme to reimplement primitives such as replaceFrom:to:with:startingAt: that are performance-critical but too complex to write as native machine code primitives without developing ulcers, loosing whatever hair color one has left, getting psoriasis, etc.

What's perhaps surprising, but should actually be unsurprising, is that there is still overhead in calling a very simple C function compared to embedding the code to perform the primitive directly in the jitted machine code.  It is unsurprising in this case as it shows that the overhead of a send is very similar to the overhead of calling a one-argument C function.

What's inconvenient about these C function called on the Smalltalk stack style primitives is that currently our C calling marshaling machinery only handles 4 arguments, a limitation that means we don't have to pass arguments on the stack on ARM and WIN64, both of which have a four-integer-register-arguments calling convention, but writing replaceFrom:to:with:startingAt: as a C function requires 5 arguments, because we have to supply the receiver as well.  So the C marshalling machinery will have to be extended to pass arguments in both registers and on the stack.  Volunteers welcome ;-)


You may guess where this conversation is going in terms of string hash.  But for the moment we can conclude that
- normal interpreter primitives should not be relied upon for performance-critical primitives
- the experimental C function called on the Smalltalk stack style primitives look like a good way to write more complex primitives (providing we can invoke them).  While they are a little slower than machine code primitives to invoke, the cost will soon be amortized for complex operations.

_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Cog Primitive Performance

Levente Uzonyi
 
Hi Eliot,

This all sounds pretty cool.
I wonder why you kept the multiplication by 16384, when you could have
just used #bitShift: 14 instead. If I read it correctly, this is like that
in the jitted code too.
I suppose the multiplication is a leftover from the time it was faster
than bit shift.

Levente
Reply | Threaded
Open this post in threaded view
|

Re: Cog Primitive Performance

Levente Uzonyi
 
To be more precise, in SimpleStackBasedCogit >> #genPrimitiveHashMultiply

                MoveCq: 16384 R: TempReg;
                MulR: TempReg R: highReg;                                               "highReg := (16r260D * (receiver bitShift: -14)) + (16r0065 * low)"

could read:

                LogicalShiftLeftCq: 14 R: highReg;                                               "highReg := (16r260D * (receiver bitShift: -14)) + (16r0065 * low) bitShift: 14"

(Note that I changed the comment too; the shift was missing.)

Levente

On Tue, 18 Apr 2017, Levente Uzonyi wrote:

>
> Hi Eliot,
>
> This all sounds pretty cool.
> I wonder why you kept the multiplication by 16384, when you could have
> just used #bitShift: 14 instead. If I read it correctly, this is like that
> in the jitted code too.
> I suppose the multiplication is a leftover from the time it was faster
> than bit shift.
>
> Levente
>
Reply | Threaded
Open this post in threaded view
|

Re: Cog Primitive Performance

Andres Valloud-4
In reply to this post by Eliot Miranda-2
 
On 4/17/17 16:45 , Eliot Miranda wrote:

> FYI, hash multiply is
> hashMultiply
> | low |
> low := self bitAnd: 16383.
> ^(16r260D * low + ((16r260D * (self bitShift: -14) + (16r0065 * low)
> bitAnd: 16383) * 16384))
> bitAnd: 16r0FFFFFFF
>
> and when written as a primitive is
> hashMultiply
> | low |
> low := self bitAnd: 16383.
> ^(16r260D * low + (16r260D * (self bitShift: -14) + (16r0065 * low) *
> 16384))
> bitAnd: 16r0FFFFFFF
> because we don't care about the multiply by 16384 overflowing; when
> written as a primitive it won't overflow into a LargePositiveInteger.

Hopefully this doesn't mean the primitive got implemented by actually
doing these operations verbatim.  As you have probably seen, that
convoluted arithmetic is done this way in Smalltalk only to simulate a
32x32 multiplication bit-anded down to 28 bits without overflowing into
large integers (the original code from August 2000 had my initials).

At a sufficiently low level such as C, all that complexity is just an
unsigned multiplication by 1664525.  The image code should still have a
comment to that effect, did it get lost?

Andres.
Reply | Threaded
Open this post in threaded view
|

Re: Cog Primitive Performance

Ben Coman
In reply to this post by Eliot Miranda-2
 


On Tue, Apr 18, 2017 at 7:45 AM, Eliot Miranda <[hidden email]> wrote:
 
Hi All,

    Clément and I have been looking at primitive performance in the context of string hash.  We'll say more about string hash in a later message (and there is nothing to be alarmed about here).  But it is worth describing the performance measurements of the various kinds of primitive dispatch in Cog.  The following measurements are collected from a 64-bit Squeak trunk image with a 64-bit Cog Spur VM with four different implementations of hashMultiply, three of them primitive.  The times are collected on a 2.3 GHz Intel Core i7 MacMini.

The doit below compares the time taken to go round a loop containing 10 unrolled copies of an operation 10 million times, for a total of 100 million invocations of the operation.  The operations are
- doing nothing
- sending yourself
- invoking the SmallInteger>>hashMultiply method
- invoking a version of hashMultiply written as a normal interpreter primitive (more on this below)
- invoking a version of hashMultiply written as a C function called from machine code
- invoking a version of hashMultiply written as a machine code primitive

Where does lowcode fit into this picture?

cheers -ben
Reply | Threaded
Open this post in threaded view
|

Re: Cog Primitive Performance

Nicolas Cellier
In reply to this post by Andres Valloud-4
 


2017-04-18 4:02 GMT+02:00 Andres Valloud <[hidden email]>:

On 4/17/17 16:45 , Eliot Miranda wrote:
FYI, hash multiply is
hashMultiply
| low |
low := self bitAnd: 16383.
^(16r260D * low + ((16r260D * (self bitShift: -14) + (16r0065 * low)
bitAnd: 16383) * 16384))
bitAnd: 16r0FFFFFFF

and when written as a primitive is
hashMultiply
| low |
low := self bitAnd: 16383.
^(16r260D * low + (16r260D * (self bitShift: -14) + (16r0065 * low) *
16384))
bitAnd: 16r0FFFFFFF
because we don't care about the multiply by 16384 overflowing; when
written as a primitive it won't overflow into a LargePositiveInteger.

Hopefully this doesn't mean the primitive got implemented by actually doing these operations verbatim.  As you have probably seen, that convoluted arithmetic is done this way in Smalltalk only to simulate a 32x32 multiplication bit-anded down to 28 bits without overflowing into large integers (the original code from August 2000 had my initials).

At a sufficiently low level such as C, all that complexity is just an unsigned multiplication by 1664525.  The image code should still have a comment to that effect, did it get lost?

Andres.

More: if it's a 64 bit image, then we have 60 bit long unsigned small int
since 1664525 highBit = 21, and self is a hash result not exceeding 30 bits, we can implement hashMultiply as
    ^self * 1664525 bitAnd: 16r0FFFFFFF

Maybe the JIT can be given a second chance too.
Reply | Threaded
Open this post in threaded view
|

Re: Cog Primitive Performance

Levente Uzonyi
In reply to this post by Andres Valloud-4
 
The methods in SmallInteger and LargePositiveInteger have the initials SqR
(that's yours IIRC), but they have no comment.
The variant in String is originally from 2001 and has never had that
comment.

Levente

On Mon, 17 Apr 2017, Andres Valloud wrote:

>
> On 4/17/17 16:45 , Eliot Miranda wrote:
>> FYI, hash multiply is
>> hashMultiply
>> | low |
>> low := self bitAnd: 16383.
>> ^(16r260D * low + ((16r260D * (self bitShift: -14) + (16r0065 * low)
>> bitAnd: 16383) * 16384))
>> bitAnd: 16r0FFFFFFF
>>
>> and when written as a primitive is
>> hashMultiply
>> | low |
>> low := self bitAnd: 16383.
>> ^(16r260D * low + (16r260D * (self bitShift: -14) + (16r0065 * low) *
>> 16384))
>> bitAnd: 16r0FFFFFFF
>> because we don't care about the multiply by 16384 overflowing; when
>> written as a primitive it won't overflow into a LargePositiveInteger.
>
> Hopefully this doesn't mean the primitive got implemented by actually
> doing these operations verbatim.  As you have probably seen, that
> convoluted arithmetic is done this way in Smalltalk only to simulate a
> 32x32 multiplication bit-anded down to 28 bits without overflowing into
> large integers (the original code from August 2000 had my initials).
>
> At a sufficiently low level such as C, all that complexity is just an
> unsigned multiplication by 1664525.  The image code should still have a
> comment to that effect, did it get lost?
>
> Andres.
>
Reply | Threaded
Open this post in threaded view
|

Re: Cog Primitive Performance

Andres Valloud-4
 
Yes, the SqR initials are mine.  That's too bad about the comments, at
the very least there should be something like this:

"This code performs a multiplication by 1664525 mod 2^28 without
overflowing into large integers."

On 4/18/17 1:48 , Levente Uzonyi wrote:

> The methods in SmallInteger and LargePositiveInteger have the initials
> SqR (that's yours IIRC), but they have no comment.
> The variant in String is originally from 2001 and has never had that
> comment.
>
> Levente
>
> On Mon, 17 Apr 2017, Andres Valloud wrote:
>
>>
>> On 4/17/17 16:45 , Eliot Miranda wrote:
>>> FYI, hash multiply is
>>> hashMultiply
>>> | low |
>>> low := self bitAnd: 16383.
>>> ^(16r260D * low + ((16r260D * (self bitShift: -14) + (16r0065 * low)
>>> bitAnd: 16383) * 16384))
>>> bitAnd: 16r0FFFFFFF
>>>
>>> and when written as a primitive is
>>> hashMultiply
>>> | low |
>>> low := self bitAnd: 16383.
>>> ^(16r260D * low + (16r260D * (self bitShift: -14) + (16r0065 * low) *
>>> 16384))
>>> bitAnd: 16r0FFFFFFF
>>> because we don't care about the multiply by 16384 overflowing; when
>>> written as a primitive it won't overflow into a LargePositiveInteger.
>>
>> Hopefully this doesn't mean the primitive got implemented by actually
>> doing these operations verbatim.  As you have probably seen, that
>> convoluted arithmetic is done this way in Smalltalk only to simulate a
>> 32x32 multiplication bit-anded down to 28 bits without overflowing
>> into large integers (the original code from August 2000 had my initials).
>>
>> At a sufficiently low level such as C, all that complexity is just an
>> unsigned multiplication by 1664525.  The image code should still have
>> a comment to that effect, did it get lost?
>>
>> Andres.
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Cog Primitive Performance

David T. Lewis
In reply to this post by Nicolas Cellier
 
On Tue, Apr 18, 2017 at 10:27:41AM +0200, Nicolas Cellier wrote:

>  
> 2017-04-18 4:02 GMT+02:00 Andres Valloud <[hidden email]>
> :
>
> >
> > On 4/17/17 16:45 , Eliot Miranda wrote:
> >
> >> FYI, hash multiply is
> >> hashMultiply
> >> | low |
> >> low := self bitAnd: 16383.
> >> ^(16r260D * low + ((16r260D * (self bitShift: -14) + (16r0065 * low)
> >> bitAnd: 16383) * 16384))
> >> bitAnd: 16r0FFFFFFF
> >>
> >> and when written as a primitive is
> >> hashMultiply
> >> | low |
> >> low := self bitAnd: 16383.
> >> ^(16r260D * low + (16r260D * (self bitShift: -14) + (16r0065 * low) *
> >> 16384))
> >> bitAnd: 16r0FFFFFFF
> >> because we don't care about the multiply by 16384 overflowing; when
> >> written as a primitive it won't overflow into a LargePositiveInteger.
> >>
> >
> > Hopefully this doesn't mean the primitive got implemented by actually
> > doing these operations verbatim.  As you have probably seen, that
> > convoluted arithmetic is done this way in Smalltalk only to simulate a
> > 32x32 multiplication bit-anded down to 28 bits without overflowing into
> > large integers (the original code from August 2000 had my initials).
> >
> > At a sufficiently low level such as C, all that complexity is just an
> > unsigned multiplication by 1664525.  The image code should still have a
> > comment to that effect, did it get lost?
> >
> > Andres.
> >
>
> More: if it's a 64 bit image, then we have 60 bit long unsigned small int
> since 1664525 highBit = 21, and self is a hash result not exceeding 30
> bits, we can implement hashMultiply as
>     ^self * 1664525 bitAnd: 16r0FFFFFFF
>
> Maybe the JIT can be given a second chance too.

AFAIK, primitiveStringHash has always been translated directly from
ByteArray class>>hashBytes:startingWith: and the hashMultiply logic that we are
discussing here currently is repeated in four different methods in Squeak trunk
(String class>>stringHash:initialHash:, TypeString class>>stringHash:initialHash:,
SmallInteger hashMultiply, and ByteArray class hashBytes:startingWith:).

At this point, and particularly given the differences in various VMs, it would
make sense to reimplement primitiveStringHash directly in MiscPrimitivesPlugin
with whatever optimizations are needed (maybe differently based on size of a
small integers), rather than translating it from a default implementation in
the image.

But I think we may be missing Eliot's main point. There is overhead associated
with accessing the stack (at least on a stack interpreter) and in returning
values to image, and this will be the case regardless of what calculations are
being done in the primitive itself. So at least for some VMs we may be reaching
the point where that overhead becomes a significant part of the performance profile.

Whether that overhead would justify creating a new primitive callout mechanism
may be another question, but it is certainly worth understanding where the
overhead occurs, and how that performance profile changes as the Cog jit and
Sista progress.

Dave

Reply | Threaded
Open this post in threaded view
|

Re: Cog Primitive Performance

Nicolas Cellier
 


2017-04-18 16:03 GMT+02:00 David T. Lewis <[hidden email]>:

On Tue, Apr 18, 2017 at 10:27:41AM +0200, Nicolas Cellier wrote:
>
> 2017-04-18 4:02 GMT+02:00 Andres Valloud <[hidden email]>
> :
>
> >
> > On 4/17/17 16:45 , Eliot Miranda wrote:
> >
> >> FYI, hash multiply is
> >> hashMultiply
> >> | low |
> >> low := self bitAnd: 16383.
> >> ^(16r260D * low + ((16r260D * (self bitShift: -14) + (16r0065 * low)
> >> bitAnd: 16383) * 16384))
> >> bitAnd: 16r0FFFFFFF
> >>
> >> and when written as a primitive is
> >> hashMultiply
> >> | low |
> >> low := self bitAnd: 16383.
> >> ^(16r260D * low + (16r260D * (self bitShift: -14) + (16r0065 * low) *
> >> 16384))
> >> bitAnd: 16r0FFFFFFF
> >> because we don't care about the multiply by 16384 overflowing; when
> >> written as a primitive it won't overflow into a LargePositiveInteger.
> >>
> >
> > Hopefully this doesn't mean the primitive got implemented by actually
> > doing these operations verbatim.  As you have probably seen, that
> > convoluted arithmetic is done this way in Smalltalk only to simulate a
> > 32x32 multiplication bit-anded down to 28 bits without overflowing into
> > large integers (the original code from August 2000 had my initials).
> >
> > At a sufficiently low level such as C, all that complexity is just an
> > unsigned multiplication by 1664525.  The image code should still have a
> > comment to that effect, did it get lost?
> >
> > Andres.
> >
>
> More: if it's a 64 bit image, then we have 60 bit long unsigned small int
> since 1664525 highBit = 21, and self is a hash result not exceeding 30
> bits, we can implement hashMultiply as
>     ^self * 1664525 bitAnd: 16r0FFFFFFF
>
> Maybe the JIT can be given a second chance too.

AFAIK, primitiveStringHash has always been translated directly from
ByteArray class>>hashBytes:startingWith: and the hashMultiply logic that we are
discussing here currently is repeated in four different methods in Squeak trunk
(String class>>stringHash:initialHash:, TypeString class>>stringHash:initialHash:,
SmallInteger hashMultiply, and ByteArray class hashBytes:startingWith:).

At this point, and particularly given the differences in various VMs, it would
make sense to reimplement primitiveStringHash directly in MiscPrimitivesPlugin
with whatever optimizations are needed (maybe differently based on size of a
small integers), rather than translating it from a default implementation in
the image.

Hi David,
My opinion is rather that we should get rid of MiscPrimitivesPlugin.
1) If String hash is performance critical (and it is), it's not really Misc.
2) Eliot's post shows that regular primitive invocation is slow vs modern spur/jit, so it might not pay
3) other than that, we don't properly version control MiscPrimitivePlugin source code
   it's not handled in VMMaker repository nor in one of the well identified external plugin repositories.
   but it's spreaded over Squeak and/or Pharo packages, and thus subject to uncontrolled changes and divergences
 
Maybe it's time to do it.

But I think we may be missing Eliot's main point. There is overhead associated
with accessing the stack (at least on a stack interpreter) and in returning
values to image, and this will be the case regardless of what calculations are
being done in the primitive itself. So at least for some VMs we may be reaching
the point where that overhead becomes a significant part of the performance profile.

Whether that overhead would justify creating a new primitive callout mechanism
may be another question, but it is certainly worth understanding where the
overhead occurs, and how that performance profile changes as the Cog jit and
Sista progress.

Dave


Yes, exactly, but the new callout mechanism has severe stack limitations preventing general purpose usage.
Concerning the specific case of hashMultiply, since it's just an imul and a bitAnd, the primitive should be inlined by the VM just like other basic arithmetic operations IMO.
So the remarks of Andres makes sense, even if we can't generalize to other Misc primitives.

Reply | Threaded
Open this post in threaded view
|

Re: Cog Primitive Performance

Clément Béra
 


On Tue, Apr 18, 2017 at 9:09 AM, Nicolas Cellier <[hidden email]> wrote:
 


2017-04-18 16:03 GMT+02:00 David T. Lewis <[hidden email]>:

On Tue, Apr 18, 2017 at 10:27:41AM +0200, Nicolas Cellier wrote:
>
> 2017-04-18 4:02 GMT+02:00 Andres Valloud <[hidden email]>
> :
>
> >
> > On 4/17/17 16:45 , Eliot Miranda wrote:
> >
> >> FYI, hash multiply is
> >> hashMultiply
> >> | low |
> >> low := self bitAnd: 16383.
> >> ^(16r260D * low + ((16r260D * (self bitShift: -14) + (16r0065 * low)
> >> bitAnd: 16383) * 16384))
> >> bitAnd: 16r0FFFFFFF
> >>
> >> and when written as a primitive is
> >> hashMultiply
> >> | low |
> >> low := self bitAnd: 16383.
> >> ^(16r260D * low + (16r260D * (self bitShift: -14) + (16r0065 * low) *
> >> 16384))
> >> bitAnd: 16r0FFFFFFF
> >> because we don't care about the multiply by 16384 overflowing; when
> >> written as a primitive it won't overflow into a LargePositiveInteger.
> >>
> >
> > Hopefully this doesn't mean the primitive got implemented by actually
> > doing these operations verbatim.  As you have probably seen, that
> > convoluted arithmetic is done this way in Smalltalk only to simulate a
> > 32x32 multiplication bit-anded down to 28 bits without overflowing into
> > large integers (the original code from August 2000 had my initials).
> >
> > At a sufficiently low level such as C, all that complexity is just an
> > unsigned multiplication by 1664525.  The image code should still have a
> > comment to that effect, did it get lost?
> >
> > Andres.
> >
>
> More: if it's a 64 bit image, then we have 60 bit long unsigned small int
> since 1664525 highBit = 21, and self is a hash result not exceeding 30
> bits, we can implement hashMultiply as
>     ^self * 1664525 bitAnd: 16r0FFFFFFF
>
> Maybe the JIT can be given a second chance too.

AFAIK, primitiveStringHash has always been translated directly from
ByteArray class>>hashBytes:startingWith: and the hashMultiply logic that we are
discussing here currently is repeated in four different methods in Squeak trunk
(String class>>stringHash:initialHash:, TypeString class>>stringHash:initialHash:,
SmallInteger hashMultiply, and ByteArray class hashBytes:startingWith:).

At this point, and particularly given the differences in various VMs, it would
make sense to reimplement primitiveStringHash directly in MiscPrimitivesPlugin
with whatever optimizations are needed (maybe differently based on size of a
small integers), rather than translating it from a default implementation in
the image.

Hi David,
My opinion is rather that we should get rid of MiscPrimitivesPlugin.
1) If String hash is performance critical (and it is), it's not really Misc.
2) Eliot's post shows that regular primitive invocation is slow vs modern spur/jit, so it might not pay
3) other than that, we don't properly version control MiscPrimitivePlugin source code
   it's not handled in VMMaker repository nor in one of the well identified external plugin repositories.
   but it's spreaded over Squeak and/or Pharo packages, and thus subject to uncontrolled changes and divergences
 
Maybe it's time to do it.

I agree. I am ready to collaborate on this and update the Pharo packages. 

For each primitive, we need to evaluate the performance with and without the primitive. In each case, we either move the primitive to numbered primitive or we just use normal Smalltalk code. We looked into string hashing with Eliot and it seems having only hashMultiply as a primitive is enough. 

But I think we may be missing Eliot's main point. There is overhead associated
with accessing the stack (at least on a stack interpreter) and in returning
values to image, and this will be the case regardless of what calculations are
being done in the primitive itself. So at least for some VMs we may be reaching
the point where that overhead becomes a significant part of the performance profile.

Whether that overhead would justify creating a new primitive callout mechanism
may be another question, but it is certainly worth understanding where the
overhead occurs, and how that performance profile changes as the Cog jit and
Sista progress.
Dave


Yes, exactly, but the new callout mechanism has severe stack limitations preventing general purpose usage.

The idea there is that each primitive can be analysed to determine if it can be called that way or not. I believe large integers primitives may be able to be called that way for example.
 
Concerning the specific case of hashMultiply, since it's just an imul and a bitAnd, the primitive should be inlined by the VM just like other basic arithmetic operations IMO.
So the remarks of Andres makes sense, even if we can't generalize to other Misc primitives.



Reply | Threaded
Open this post in threaded view
|

Re: Cog Primitive Performance

Eliot Miranda-2
In reply to this post by David T. Lewis
 
Hi All, Hi David,

On Tue, Apr 18, 2017 at 7:03 AM, David T. Lewis <[hidden email]> wrote:

On Tue, Apr 18, 2017 at 10:27:41AM +0200, Nicolas Cellier wrote:
>
> 2017-04-18 4:02 GMT+02:00 Andres Valloud <[hidden email]>
> :
>
> >
> > On 4/17/17 16:45 , Eliot Miranda wrote:
> >
> >> FYI, hash multiply is
> >> hashMultiply
> >> | low |
> >> low := self bitAnd: 16383.
> >> ^(16r260D * low + ((16r260D * (self bitShift: -14) + (16r0065 * low)
> >> bitAnd: 16383) * 16384))
> >> bitAnd: 16r0FFFFFFF
> >>
> >> and when written as a primitive is
> >> hashMultiply
> >> | low |
> >> low := self bitAnd: 16383.
> >> ^(16r260D * low + (16r260D * (self bitShift: -14) + (16r0065 * low) *
> >> 16384))
> >> bitAnd: 16r0FFFFFFF
> >> because we don't care about the multiply by 16384 overflowing; when
> >> written as a primitive it won't overflow into a LargePositiveInteger.
> >>
> >
> > Hopefully this doesn't mean the primitive got implemented by actually
> > doing these operations verbatim.  As you have probably seen, that
> > convoluted arithmetic is done this way in Smalltalk only to simulate a
> > 32x32 multiplication bit-anded down to 28 bits without overflowing into
> > large integers (the original code from August 2000 had my initials).
> >
> > At a sufficiently low level such as C, all that complexity is just an
> > unsigned multiplication by 1664525.  The image code should still have a
> > comment to that effect, did it get lost?
> >
> > Andres.
> >
>
> More: if it's a 64 bit image, then we have 60 bit long unsigned small int
> since 1664525 highBit = 21, and self is a hash result not exceeding 30
> bits, we can implement hashMultiply as
>     ^self * 1664525 bitAnd: 16r0FFFFFFF
>
> Maybe the JIT can be given a second chance too.

AFAIK, primitiveStringHash has always been translated directly from
ByteArray class>>hashBytes:startingWith: and the hashMultiply logic that we are
discussing here currently is repeated in four different methods in Squeak trunk
(String class>>stringHash:initialHash:, TypeString class>>stringHash:initialHash:,
SmallInteger hashMultiply, and ByteArray class hashBytes:startingWith:).

At this point, and particularly given the differences in various VMs, it would
make sense to reimplement primitiveStringHash directly in MiscPrimitivesPlugin
with whatever optimizations are needed (maybe differently based on size of a
small integers), rather than translating it from a default implementation in
the image.

But I think we may be missing Eliot's main point. There is overhead associated
with accessing the stack (at least on a stack interpreter) and in returning
values to image, and this will be the case regardless of what calculations are
being done in the primitive itself. So at least for some VMs we may be reaching
the point where that overhead becomes a significant part of the performance profile.

Exactly.  This thread is about the overhead associated with invoking the various forms of primitive.  Interpreter primitives, which access their receiver and arguments via the global stackPointer variable, are very slow, and the measurements show that it is accessing arguments and returning results through the stackPointer global that is the main issue.
 
Whether that overhead would justify creating a new primitive callout mechanism
may be another question, but it is certainly worth understanding where the
overhead occurs, and how that performance profile changes as the Cog jit and
Sista progress.

My conclusion is that we should stop using MiscPrimitivePlugin and reimplement its performance-critical primitives not to use interpreter primitives.  More of that in a follow up on what to do about string hashing.
 
Dave

--
_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Cog Primitive Performance

Eliot Miranda-2
In reply to this post by Andres Valloud-4
 
Hi Andres,

On Tue, Apr 18, 2017 at 1:52 AM, Andres Valloud <[hidden email]> wrote:

Yes, the SqR initials are mine.  That's too bad about the comments, at the very least there should be something like this:

"This code performs a multiplication by 1664525 mod 2^28 without overflowing into large integers."

Thanks.  I'll update the primitives appropriately.  Could you supply a reference both for the technique and the choice of  1664525 in particular?  I'd like to include that info in the comments too.


On 4/18/17 1:48 , Levente Uzonyi wrote:
The methods in SmallInteger and LargePositiveInteger have the initials
SqR (that's yours IIRC), but they have no comment.
The variant in String is originally from 2001 and has never had that
comment.

Levente

On Mon, 17 Apr 2017, Andres Valloud wrote:


On 4/17/17 16:45 , Eliot Miranda wrote:
FYI, hash multiply is
hashMultiply
| low |
low := self bitAnd: 16383.
^(16r260D * low + ((16r260D * (self bitShift: -14) + (16r0065 * low)
bitAnd: 16383) * 16384))
bitAnd: 16r0FFFFFFF

and when written as a primitive is
hashMultiply
| low |
low := self bitAnd: 16383.
^(16r260D * low + (16r260D * (self bitShift: -14) + (16r0065 * low) *
16384))
bitAnd: 16r0FFFFFFF
because we don't care about the multiply by 16384 overflowing; when
written as a primitive it won't overflow into a LargePositiveInteger.

Hopefully this doesn't mean the primitive got implemented by actually
doing these operations verbatim.  As you have probably seen, that
convoluted arithmetic is done this way in Smalltalk only to simulate a
32x32 multiplication bit-anded down to 28 bits without overflowing
into large integers (the original code from August 2000 had my initials).

At a sufficiently low level such as C, all that complexity is just an
unsigned multiplication by 1664525.  The image code should still have
a comment to that effect, did it get lost?

Andres.





--
_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Cog Primitive Performance

timrowledge
In reply to this post by Andres Valloud-4
 

> On 18-04-2017, at 1:52 AM, Andres Valloud <[hidden email]> wrote:
>
> Yes, the SqR initials are mine.  That's too bad about the comments, at the very least there should be something like this:
>
> "This code performs a multiplication by 1664525 mod 2^28 without overflowing into large integers.”

But.. .but.. I thought comments weren’t useful ?

;-)


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Fractured Idiom:- HARLEZ-VOUS FRANCAIS? - Can you drive a French motorcycle?


Reply | Threaded
Open this post in threaded view
|

Re: Cog Primitive Performance

timrowledge
In reply to this post by Nicolas Cellier
 

> On 18-04-2017, at 9:09 AM, Nicolas Cellier <[hidden email]> wrote:
>
> My opinion is rather that we should get rid of MiscPrimitivesPlugin.

It’s been on my list of “things we ought to do” for, oh, 15 years?

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Strange OpCodes: PWB: Put to Waste Basket


Reply | Threaded
Open this post in threaded view
|

Re: Cog Primitive Performance

Bert Freudenberg
In reply to this post by Nicolas Cellier
 
On Tue, Apr 18, 2017 at 6:09 PM, Nicolas Cellier <[hidden email]> wrote:

Concerning the specific case of hashMultiply, since it's just an imul and a bitAnd, the primitive should be inlined by the VM just like other basic arithmetic operations IMO.
So the remarks of Andres makes sense, even if we can't generalize to other Misc primitives.

Yes, code optimized by Sista should be pretty much as fast as a primitive, so we could hopefully get rid of many primitives that are just there to improve performance.

- Bert -
Reply | Threaded
Open this post in threaded view
|

Re: Cog Primitive Performance

Andres Valloud-4
In reply to this post by Eliot Miranda-2
 
The hashing technique is just a multiplicative hash function.  The
multiplication technique is equivalent to splitting AX into AH and AL,
and using the distributive rule to (AH * 2^8 + AL) * (BH * 2^8 + BL).
The choice of 1664525 is due to Mark van der Gulik, he took the constant
out of a Knuth table of good linear congruency pseudo-RNG factors.  It's
been 17 years since that, I do not currently think the pseudo-RNG
relationship is that relevant.  In the mean time, as hashing goes, it's
good enough.

On 4/18/17 10:01 , Eliot Miranda wrote:

> Hi Andres,
>
> On Tue, Apr 18, 2017 at 1:52 AM, Andres Valloud
> <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>
>     Yes, the SqR initials are mine.  That's too bad about the comments,
>     at the very least there should be something like this:
>
>     "This code performs a multiplication by 1664525 mod 2^28 without
>     overflowing into large integers."
>
>
> Thanks.  I'll update the primitives appropriately.  Could you supply a
> reference both for the technique and the choice of  1664525 in
> particular?  I'd like to include that info in the comments too.
>
>
>     On 4/18/17 1:48 , Levente Uzonyi wrote:
>
>         The methods in SmallInteger and LargePositiveInteger have the
>         initials
>         SqR (that's yours IIRC), but they have no comment.
>         The variant in String is originally from 2001 and has never had that
>         comment.
>
>         Levente
>
>         On Mon, 17 Apr 2017, Andres Valloud wrote:
>
>
>             On 4/17/17 16:45 , Eliot Miranda wrote:
>
>                 FYI, hash multiply is
>                 hashMultiply
>                 | low |
>                 low := self bitAnd: 16383.
>                 ^(16r260D * low + ((16r260D * (self bitShift: -14) +
>                 (16r0065 * low)
>                 bitAnd: 16383) * 16384))
>                 bitAnd: 16r0FFFFFFF
>
>                 and when written as a primitive is
>                 hashMultiply
>                 | low |
>                 low := self bitAnd: 16383.
>                 ^(16r260D * low + (16r260D * (self bitShift: -14) +
>                 (16r0065 * low) *
>                 16384))
>                 bitAnd: 16r0FFFFFFF
>                 because we don't care about the multiply by 16384
>                 overflowing; when
>                 written as a primitive it won't overflow into a
>                 LargePositiveInteger.
>
>
>             Hopefully this doesn't mean the primitive got implemented by
>             actually
>             doing these operations verbatim.  As you have probably seen,
>             that
>             convoluted arithmetic is done this way in Smalltalk only to
>             simulate a
>             32x32 multiplication bit-anded down to 28 bits without
>             overflowing
>             into large integers (the original code from August 2000 had
>             my initials).
>
>             At a sufficiently low level such as C, all that complexity
>             is just an
>             unsigned multiplication by 1664525.  The image code should
>             still have
>             a comment to that effect, did it get lost?
>
>             Andres.
>
>
>
>
>
> --
> _,,,^..^,,,_
> best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Cog Primitive Performance

Eliot Miranda-2
In reply to this post by Bert Freudenberg
 
Hi Bert, Hi All,

On Tue, Apr 18, 2017 at 10:55 AM, Bert Freudenberg <[hidden email]> wrote:
 
On Tue, Apr 18, 2017 at 6:09 PM, Nicolas Cellier <[hidden email]> wrote:

Concerning the specific case of hashMultiply, since it's just an imul and a bitAnd, the primitive should be inlined by the VM just like other basic arithmetic operations IMO.
So the remarks of Andres makes sense, even if we can't generalize to other Misc primitives.

Yes, code optimized by Sista should be pretty much as fast as a primitive, so we could hopefully get rid of many primitives that are just there to improve performance.

The problem to think through carefully is the impact on the ContextInterpreter, StackInterpreter and SqueakJS VMs.  I have a gut feeling that the right thing to do is to keep the primitives for the benefit of these VMs and to have Cog ignore them.  That implies that the primitive failure code can be what one would like to execute and have Sista optimize.  That means that translated primitives are an obstacle.

Rewriting the translated primitives as conventional primitives in MiscPrimitivePlugin and not including the MiscPrimitivePlugin in Cog (JIT) VMs (o including one which provides no primitives) would allow the primitive failure code in the various Collection methods to be written optimally, allow the interpreter VMs to keep using the primitives, allow Cog VMs to ignore the primitives easily, and would allow putting the primitives closer to the receivers.  Redirecting through to the class side wastes cycles.

_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Cog Primitive Performance

Eliot Miranda-2
In reply to this post by Andres Valloud-4
 
Hi Andres,

On Tue, Apr 18, 2017 at 11:06 AM, Andres Valloud <[hidden email]> wrote:

The hashing technique is just a multiplicative hash function.  The multiplication technique is equivalent to splitting AX into AH and AL, and using the distributive rule to (AH * 2^8 + AL) * (BH * 2^8 + BL). The choice of 1664525 is due to Mark van der Gulik, he took the constant out of a Knuth table of good linear congruency pseudo-RNG factors.  It's been 17 years since that, I do not currently think the pseudo-RNG relationship is that relevant.  In the mean time, as hashing goes, it's good enough.

I'm looking for more of a reference, say to a paper which describes the technique, or a reference implementation on the web.  Remember that one important proper of 1664525 is that it belongs to a set of integers for which n bitCount * 2 between: n highBit and: n highBit + 2, i.e. that about half the bits are set.  Further, the bits are reasonably well distributed throughout, unlike for example 16rF0F0F0F0F.

I'll look up the table in my Knuth vol 3.



On 4/18/17 10:01 , Eliot Miranda wrote:
Hi Andres,

On Tue, Apr 18, 2017 at 1:52 AM, Andres Valloud
<[hidden email]
<mailto:[hidden email]>> wrote:


    Yes, the SqR initials are mine.  That's too bad about the comments,
    at the very least there should be something like this:

    "This code performs a multiplication by 1664525 mod 2^28 without
    overflowing into large integers."


Thanks.  I'll update the primitives appropriately.  Could you supply a
reference both for the technique and the choice of  1664525 in
particular?  I'd like to include that info in the comments too.


    On 4/18/17 1:48 , Levente Uzonyi wrote:

        The methods in SmallInteger and LargePositiveInteger have the
        initials
        SqR (that's yours IIRC), but they have no comment.
        The variant in String is originally from 2001 and has never had that
        comment.

        Levente

        On Mon, 17 Apr 2017, Andres Valloud wrote:


            On 4/17/17 16:45 , Eliot Miranda wrote:

                FYI, hash multiply is
                hashMultiply
                | low |
                low := self bitAnd: 16383.
                ^(16r260D * low + ((16r260D * (self bitShift: -14) +
                (16r0065 * low)
                bitAnd: 16383) * 16384))
                bitAnd: 16r0FFFFFFF

                and when written as a primitive is
                hashMultiply
                | low |
                low := self bitAnd: 16383.
                ^(16r260D * low + (16r260D * (self bitShift: -14) +
                (16r0065 * low) *
                16384))
                bitAnd: 16r0FFFFFFF
                because we don't care about the multiply by 16384
                overflowing; when
                written as a primitive it won't overflow into a
                LargePositiveInteger.


            Hopefully this doesn't mean the primitive got implemented by
            actually
            doing these operations verbatim.  As you have probably seen,
            that
            convoluted arithmetic is done this way in Smalltalk only to
            simulate a
            32x32 multiplication bit-anded down to 28 bits without
            overflowing
            into large integers (the original code from August 2000 had
            my initials).

            At a sufficiently low level such as C, all that complexity
            is just an
            unsigned multiplication by 1664525.  The image code should
            still have
            a comment to that effect, did it get lost?

            Andres.





--
_,,,^..^,,,_
best, Eliot



--
_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Cog Primitive Performance

Andres Valloud-4
 
The "technique" of multiplicative hash functions is well publicized, e.g.

https://en.wikipedia.org/wiki/Universal_hashing (under hashing strings)

https://en.wikipedia.org/wiki/Hash_function#Multiplicative_hashing

and IIRC Knuth under hashing.  Part of the problem with multiplicative
hash functions is that every other person chooses a different number and
attaches their name to it.  So if you choose 13, 131, 1313 etc you have
the K&R "digital method" hash, or if you choose an FNV factor then you
have the FNV hashes, and so on.  But really they are all the same
algorithm: namely, the evaluation of a polynomial with coefficients
implied by the data at the factor chosen using Horner's rule.

Note I did measure, and even Knuth's suggestions for factors did not
substantially improve the outcome.  The entire class of multiplicative
hash functions, as long as the factor is not really bad like 31 or 255,
are "good enough".  In particular, the size of the factor doesn't matter
as much provided it's sufficiently large compared to the quanta being
hashed (e.g. bytes => factor [significantly] greater than 256).

Overall, I checked 300+ hash functions against 100+ datasets and
documented the results in the hashing book.  Many of the hash functions
are multiplicative hash functions, either explicitly or in disguise.

At one point I did find the RNG factor 1664525 on line 26 from Knuth's
TAOCP vol 2 table of good linear congruence RNG multipliers.  However,
that the factor is good for an RNG does not necessarily show it has to
be a good factor for a multiplicative hash function.  In retrospect,
that implication seems like an argument by osmosis.

Andres.

On 4/18/17 12:08 , Eliot Miranda wrote:

> Hi Andres,
>
> On Tue, Apr 18, 2017 at 11:06 AM, Andres Valloud
> <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>
>     The hashing technique is just a multiplicative hash function.  The
>     multiplication technique is equivalent to splitting AX into AH and
>     AL, and using the distributive rule to (AH * 2^8 + AL) * (BH * 2^8 +
>     BL). The choice of 1664525 is due to Mark van der Gulik, he took the
>     constant out of a Knuth table of good linear congruency pseudo-RNG
>     factors.  It's been 17 years since that, I do not currently think
>     the pseudo-RNG relationship is that relevant.  In the mean time, as
>     hashing goes, it's good enough.
>
>
> I'm looking for more of a reference, say to a paper which describes the
> technique, or a reference implementation on the web.  Remember that one
> important proper of 1664525 is that it belongs to a set of integers for
> which n bitCount * 2 between: n highBit and: n highBit + 2, i.e. that
> about half the bits are set.  Further, the bits are reasonably well
> distributed throughout, unlike for example 16rF0F0F0F0F.
>
> I'll look up the table in my Knuth vol 3.
>
>
>
>     On 4/18/17 10:01 , Eliot Miranda wrote:
>
>         Hi Andres,
>
>         On Tue, Apr 18, 2017 at 1:52 AM, Andres Valloud
>         <[hidden email]
>         <mailto:[hidden email]>
>         <mailto:[hidden email]
>         <mailto:[hidden email]>>> wrote:
>
>
>             Yes, the SqR initials are mine.  That's too bad about the
>         comments,
>             at the very least there should be something like this:
>
>             "This code performs a multiplication by 1664525 mod 2^28 without
>             overflowing into large integers."
>
>
>         Thanks.  I'll update the primitives appropriately.  Could you
>         supply a
>         reference both for the technique and the choice of  1664525 in
>         particular?  I'd like to include that info in the comments too.
>
>
>             On 4/18/17 1:48 , Levente Uzonyi wrote:
>
>                 The methods in SmallInteger and LargePositiveInteger
>         have the
>                 initials
>                 SqR (that's yours IIRC), but they have no comment.
>                 The variant in String is originally from 2001 and has
>         never had that
>                 comment.
>
>                 Levente
>
>                 On Mon, 17 Apr 2017, Andres Valloud wrote:
>
>
>                     On 4/17/17 16:45 , Eliot Miranda wrote:
>
>                         FYI, hash multiply is
>                         hashMultiply
>                         | low |
>                         low := self bitAnd: 16383.
>                         ^(16r260D * low + ((16r260D * (self bitShift: -14) +
>                         (16r0065 * low)
>                         bitAnd: 16383) * 16384))
>                         bitAnd: 16r0FFFFFFF
>
>                         and when written as a primitive is
>                         hashMultiply
>                         | low |
>                         low := self bitAnd: 16383.
>                         ^(16r260D * low + (16r260D * (self bitShift: -14) +
>                         (16r0065 * low) *
>                         16384))
>                         bitAnd: 16r0FFFFFFF
>                         because we don't care about the multiply by 16384
>                         overflowing; when
>                         written as a primitive it won't overflow into a
>                         LargePositiveInteger.
>
>
>                     Hopefully this doesn't mean the primitive got
>         implemented by
>                     actually
>                     doing these operations verbatim.  As you have
>         probably seen,
>                     that
>                     convoluted arithmetic is done this way in Smalltalk
>         only to
>                     simulate a
>                     32x32 multiplication bit-anded down to 28 bits without
>                     overflowing
>                     into large integers (the original code from August
>         2000 had
>                     my initials).
>
>                     At a sufficiently low level such as C, all that
>         complexity
>                     is just an
>                     unsigned multiplication by 1664525.  The image code
>         should
>                     still have
>                     a comment to that effect, did it get lost?
>
>                     Andres.
>
>
>
>
>
>         --
>         _,,,^..^,,,_
>         best, Eliot
>
>
>
>
> --
> _,,,^..^,,,_
> best, Eliot
12