Smalltalk › Squeak › Squeak - Dev

ByteArray accessors for 64-bit manipulation

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

16 messages Options

cbc

ByteArray accessors for 64-bit manipulation

Hi.

I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this be added to trunk?

Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!

-cbc

Chris Muller-3

Re: ByteArray accessors for 64-bit manipulation

Hi Chris, I think these methods belong in the image with the fastest
implementation we can do.

I implemented 64-bit unsigned access for Ma Serializer back in 2005.
I modeled my implementation after Andreas' original approach which
tries to avoid LI arithmetic. I was curious whether your
implementations would be faster, because if they are then it could
benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a
trunk image, I used the following script to take comparison
measurements:

| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2
raisedTo: 14)) atRandom.
largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
maBa := ByteArray new: 8.
cbBa := ByteArray new: 8.
maBa maUint: 64 at: 0 put: largeN.
cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
bigEndian: false).
{ 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
bigEndian: false] bench.
'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench.
'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
bigEndian: false] bench.
'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench.
'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
}

Here are the results:

'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' .
'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' .
'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' .
'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' .
'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' .
'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' .
'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' .
'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'

It looks like your 64-bit access is 86% faster for accessing the
high-end of the 64-bit range, but slower in the other 3 metrics.
Noticeably, it was only 14% as fast for writing the high-end of the
64-bit range, and similarly as much slower for small-number access..

On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
<[hidden email]> wrote:

> Hi.
>
> I've committed a change to the inbox with changes to allow getting/putting
> 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this
> be added to trunk?
>
> Also, first time I used the selective commit function - very nice! the
> changes I didn't want committed didn't, in fact, get commited. Just the
> desirable bits!
>
> -cbc
>
>
>

cbc

Re: ByteArray accessors for 64-bit manipulation

Hi Chris,

I'm all for having the fastest that in the image that works. If you could make your version handle endianess, then I'm all for including it (at least in the 3 variants that are faster). My first use for this (interface for KAFKA) apparently requires bigEndianess, so I really want that supported.

It might be best to keep my naming, though - it follows the name pattern that is already in the class. Or will yours also support 128?

-cbc

On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller <[hidden email]> wrote:

Hi Chris, I think these methods belong in the image with the fastest
implementation we can do.

I implemented 64-bit unsigned access for Ma Serializer back in 2005.
I modeled my implementation after Andreas' original approach which
tries to avoid LI arithmetic. I was curious whether your
implementations would be faster, because if they are then it could
benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a
trunk image, I used the following script to take comparison
measurements:

| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2
raisedTo: 14)) atRandom.
largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
maBa := ByteArray new: 8.
cbBa := ByteArray new: 8.
maBa maUint: 64 at: 0 put: largeN.
cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
bigEndian: false).
{ 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
bigEndian: false] bench.
'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench.
'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
bigEndian: false] bench.
'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench.
'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
}

Here are the results:

'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' .
'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' .
'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' .
'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' .
'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' .
'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' .
'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' .
'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'

It looks like your 64-bit access is 86% faster for accessing the
high-end of the 64-bit range, but slower in the other 3 metrics.
Noticeably, it was only 14% as fast for writing the high-end of the
64-bit range, and similarly as much slower for small-number access..

On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
<[hidden email]> wrote:
> Hi.
>
> I've committed a change to the inbox with changes to allow getting/putting
> 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this
> be added to trunk?
>
> Also, first time I used the selective commit function - very nice! the
> changes I didn't want committed didn't, in fact, get commited. Just the
> desirable bits!
>
> -cbc
>
>
>

Eliot Miranda-2

Re: ByteArray accessors for 64-bit manipulation

Hi Chrises,

my vote would be to write these as 12 numbered primitives, (2,4 & 8 bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT implementing the primitives.

On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham <[hidden email]> wrote:

Hi Chris,

I'm all for having the fastest that in the image that works. If you could make your version handle endianess, then I'm all for including it (at least in the 3 variants that are faster). My first use for this (interface for KAFKA) apparently requires bigEndianess, so I really want that supported.

It might be best to keep my naming, though - it follows the name pattern that is already in the class. Or will yours also support 128?

-cbc

On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller <[hidden email]> wrote:
Hi Chris, I think these methods belong in the image with the fastest
implementation we can do.

I implemented 64-bit unsigned access for Ma Serializer back in 2005.
I modeled my implementation after Andreas' original approach which
tries to avoid LI arithmetic. I was curious whether your
implementations would be faster, because if they are then it could
benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a
trunk image, I used the following script to take comparison
measurements:

| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2
raisedTo: 14)) atRandom.
largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
maBa := ByteArray new: 8.
cbBa := ByteArray new: 8.
maBa maUint: 64 at: 0 put: largeN.
cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
bigEndian: false).
{ 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
bigEndian: false] bench.
'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench.
'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
bigEndian: false] bench.
'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench.
'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
}

Here are the results:

'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' .
'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' .
'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' .
'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' .
'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' .
'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' .
'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' .
'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'

It looks like your 64-bit access is 86% faster for accessing the
high-end of the 64-bit range, but slower in the other 3 metrics.
Noticeably, it was only 14% as fast for writing the high-end of the
64-bit range, and similarly as much slower for small-number access..

On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
<[hidden email]> wrote:
> Hi.
>
> I've committed a change to the inbox with changes to allow getting/putting
> 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this
> be added to trunk?
>
> Also, first time I used the selective commit function - very nice! the
> changes I didn't want committed didn't, in fact, get commited. Just the
> desirable bits!
>
> -cbc
>
>
>

_,,,^..^,,,_

best, Eliot

Chris Muller-3

Re: ByteArray accessors for 64-bit manipulation

Sometimes the number of bytes is only known in a variable, so would it
be possible to do 4 primitives which accept the number of bits (or
bytes) as an argument? (uint:at: uint:at:put:) * (big endian, little
endian)

On Mon, Aug 31, 2015 at 12:25 PM, Eliot Miranda <[hidden email]> wrote:

> Hi Chrises,
>
> my vote would be to write these as 12 numbered primitives, (2,4 & 8
> bytes) * (at: & at:put:) * (big & little endian) because they can be
> performance critical and implementing them like this means the maximum
> efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT
> implementing the primitives.
>
> On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham <[hidden email]>
> wrote:
>>
>> Hi Chris,
>>
>> I'm all for having the fastest that in the image that works. If you could
>> make your version handle endianess, then I'm all for including it (at least
>> in the 3 variants that are faster). My first use for this (interface for
>> KAFKA) apparently requires bigEndianess, so I really want that supported.
>>
>> It might be best to keep my naming, though - it follows the name pattern
>> that is already in the class. Or will yours also support 128?
>>
>> -cbc
>>
>> On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller <[hidden email]> wrote:
>>>
>>> Hi Chris, I think these methods belong in the image with the fastest
>>> implementation we can do.
>>>
>>> I implemented 64-bit unsigned access for Ma Serializer back in 2005.
>>> I modeled my implementation after Andreas' original approach which
>>> tries to avoid LI arithmetic. I was curious whether your
>>> implementations would be faster, because if they are then it could
>>> benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a
>>> trunk image, I used the following script to take comparison
>>> measurements:
>>>
>>> | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2
>>> raisedTo: 14)) atRandom.
>>> largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
>>> maBa := ByteArray new: 8.
>>> cbBa := ByteArray new: 8.
>>> maBa maUint: 64 at: 0 put: largeN.
>>> cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
>>> self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
>>> bigEndian: false).
>>> { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
>>> bigEndian: false] bench.
>>> 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
>>> 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ]
>>> bench.
>>> 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
>>> 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
>>> bigEndian: false] bench.
>>> 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
>>> 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ]
>>> bench.
>>> 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
>>> }
>>>
>>> Here are the results:
>>>
>>> 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' .
>>> 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' .
>>> 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' .
>>> 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' .
>>> 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' .
>>> 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' .
>>> 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' .
>>> 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'
>>>
>>> It looks like your 64-bit access is 86% faster for accessing the
>>> high-end of the 64-bit range, but slower in the other 3 metrics.
>>> Noticeably, it was only 14% as fast for writing the high-end of the
>>> 64-bit range, and similarly as much slower for small-number access..
>>>
>>>
>>> On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
>>> <[hidden email]> wrote:
>>> > Hi.
>>> >
>>> > I've committed a change to the inbox with changes to allow
>>> > getting/putting
>>> > 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could
>>> > this
>>> > be added to trunk?
>>> >
>>> > Also, first time I used the selective commit function - very nice! the
>>> > changes I didn't want committed didn't, in fact, get commited. Just
>>> > the
>>> > desirable bits!
>>> >
>>> > -cbc
>>> >
>>> >
>>> >
>>>
>>
>>
>>
>>
>
>
>
> --
> _,,,^..^,,,_
> best, Eliot
>
>
>

Eliot Miranda-2

Re: [Vm-dev] Re: [squeak-dev] ByteArray accessors for 64-bit manipulation

On Mon, Aug 31, 2015 at 11:35 AM, Chris Muller <[hidden email]> wrote:

Sometimes the number of bytes is only known in a variable, so would it
be possible to do 4 primitives which accept the number of bits (or
bytes) as an argument? (uint:at: uint:at:put:) * (big endian, little
endian)

Of course its possible, but such an architecture can hardly be quick. If one needs the flexible primitives then use them, but don't hobble the system by only providing them. Having a real 64-bit VM means that the use of 2 32-bit accesses is unnecessarily slow.

Which would you rather, and which would you think would be faster (I don't know, but I have my suspicions):

Expand the existing flexible integerAt: prims to integerAt:put:bytes:signed:bigEndian: (yuck), or implement this in terms of a wrapper something like

ByteArray>>integerAt: index bytes: numBytes signed: signed bigEndian: bigEndian

^size >= 4

ifTrue:

[size = 8 ifTrue:

[value := self unsignedLong64At: index.

bigEndian ifTrue:

[value := self byteReverseEightBytes: value].

(sign := value bitShift: -63) ~= 0 ifTrue: "if the VM is intelligent about left shift of zero then this test is unnecessary..."

[value := value - ((sign bitAnd: 1) bitShift: 64)].

^value].

size = 4 ifTrue:

[value := self unsignedLong32At: index.

bigEndian ifTrue:

[value := self byteReverseFourBytes: value].

(sign := value bitShift: -31) ~= 0 ifTrue: "if the VM is intelligent about left shift of zero then this test is unnecessary..."

[value := value - ((sign bitAnd: 1) bitShift: 32)].

^value].

^self error: 'size must be a power of two from 1 to 8']

ifFalse:

...

On Mon, Aug 31, 2015 at 12:25 PM, Eliot Miranda <[hidden email]> wrote:
> Hi Chrises,
>
> my vote would be to write these as 12 numbered primitives, (2,4 & 8
> bytes) * (at: & at:put:) * (big & little endian) because they can be
> performance critical and implementing them like this means the maximum
> efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT
> implementing the primitives.
>
> On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham <[hidden email]>
> wrote:
>>
>> Hi Chris,
>>
>> I'm all for having the fastest that in the image that works. If you could
>> make your version handle endianess, then I'm all for including it (at least
>> in the 3 variants that are faster). My first use for this (interface for
>> KAFKA) apparently requires bigEndianess, so I really want that supported.
>>
>> It might be best to keep my naming, though - it follows the name pattern
>> that is already in the class. Or will yours also support 128?
>>
>> -cbc
>>
>> On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller <[hidden email]> wrote:
>>>
>>> Hi Chris, I think these methods belong in the image with the fastest
>>> implementation we can do.
>>>
>>> I implemented 64-bit unsigned access for Ma Serializer back in 2005.
>>> I modeled my implementation after Andreas' original approach which
>>> tries to avoid LI arithmetic. I was curious whether your
>>> implementations would be faster, because if they are then it could
>>> benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a
>>> trunk image, I used the following script to take comparison
>>> measurements:
>>>
>>> | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2
>>> raisedTo: 14)) atRandom.
>>> largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
>>> maBa := ByteArray new: 8.
>>> cbBa := ByteArray new: 8.
>>> maBa maUint: 64 at: 0 put: largeN.
>>> cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
>>> self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
>>> bigEndian: false).
>>> { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
>>> bigEndian: false] bench.
>>> 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
>>> 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ]
>>> bench.
>>> 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
>>> 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
>>> bigEndian: false] bench.
>>> 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
>>> 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ]
>>> bench.
>>> 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
>>> }
>>>
>>> Here are the results:
>>>
>>> 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' .
>>> 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' .
>>> 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' .
>>> 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' .
>>> 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' .
>>> 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' .
>>> 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' .
>>> 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'
>>>
>>> It looks like your 64-bit access is 86% faster for accessing the
>>> high-end of the 64-bit range, but slower in the other 3 metrics.
>>> Noticeably, it was only 14% as fast for writing the high-end of the
>>> 64-bit range, and similarly as much slower for small-number access..
>>>
>>>
>>> On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
>>> <[hidden email]> wrote:
>>> > Hi.
>>> >
>>> > I've committed a change to the inbox with changes to allow
>>> > getting/putting
>>> > 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could
>>> > this
>>> > be added to trunk?
>>> >
>>> > Also, first time I used the selective commit function - very nice! the
>>> > changes I didn't want committed didn't, in fact, get commited. Just
>>> > the
>>> > desirable bits!
>>> >
>>> > -cbc
>>> >
>>> >
>>> >
>>>
>>
>>
>>
>>
>
>
>
> --
> _,,,^..^,,,_
> best, Eliot
>
>
>

_,,,^..^,,,_

best, Eliot

David T. Lewis

Re: ByteArray accessors for 64-bit manipulation

In reply to this post by Eliot Miranda-2

I would ask that someone please measure the real-world performance benefit
of adding these (or any other) numbered primitives. Maybe it's a lot, maybe
it's not, but when in doubt leave it out.

Dave

On Mon, Aug 31, 2015 at 10:25:59AM -0700, Eliot Miranda wrote:

> Hi Chrises,
>
> my vote would be to write these as 12 numbered primitives, (2,4 & 8
> bytes) * (at: & at:put:) * (big & little endian) because they can be
> performance critical and implementing them like this means the maximum
> efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT
> implementing the primitives.
>
> On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham <[hidden email]>
> wrote:
>
> > Hi Chris,
> >
> > I'm all for having the fastest that in the image that works. If you could
> > make your version handle endianess, then I'm all for including it (at least
> > in the 3 variants that are faster). My first use for this (interface for
> > KAFKA) apparently requires bigEndianess, so I really want that supported.
> >
> > It might be best to keep my naming, though - it follows the name pattern
> > that is already in the class. Or will yours also support 128?
> >
> > -cbc
> >
> > On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller <[hidden email]> wrote:
> >
> >> Hi Chris, I think these methods belong in the image with the fastest
> >> implementation we can do.
> >>
> >> I implemented 64-bit unsigned access for Ma Serializer back in 2005.
> >> I modeled my implementation after Andreas' original approach which
> >> tries to avoid LI arithmetic. I was curious whether your
> >> implementations would be faster, because if they are then it could
> >> benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a
> >> trunk image, I used the following script to take comparison
> >> measurements:
> >>
> >> | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2
> >> raisedTo: 14)) atRandom.
> >> largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
> >> maBa := ByteArray new: 8.
> >> cbBa := ByteArray new: 8.
> >> maBa maUint: 64 at: 0 put: largeN.
> >> cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
> >> self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
> >> bigEndian: false).
> >> { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
> >> bigEndian: false] bench.
> >> 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
> >> 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ]
> >> bench.
> >> 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
> >> 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
> >> bigEndian: false] bench.
> >> 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
> >> 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ]
> >> bench.
> >> 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
> >> }
> >>
> >> Here are the results:
> >>
> >> 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' .
> >> 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' .
> >> 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' .
> >> 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' .
> >> 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' .
> >> 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' .
> >> 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' .
> >> 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'
> >>
> >> It looks like your 64-bit access is 86% faster for accessing the
> >> high-end of the 64-bit range, but slower in the other 3 metrics.
> >> Noticeably, it was only 14% as fast for writing the high-end of the
> >> 64-bit range, and similarly as much slower for small-number access..
> >>
> >>
> >> On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
> >> <[hidden email]> wrote:
> >> > Hi.
> >> >
> >> > I've committed a change to the inbox with changes to allow
> >> getting/putting
> >> > 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could
> >> this
> >> > be added to trunk?
> >> >
> >> > Also, first time I used the selective commit function - very nice! the
> >> > changes I didn't want committed didn't, in fact, get commited. Just the
> >> > desirable bits!
> >> >
> >> > -cbc
> >> >
> >> >
> >> >
> >>
> >>
> >
> >
> >
> >
>
>
> --
> _,,,^..^,,,_
> best, Eliot

Andres Valloud-4

Re: ByteArray accessors for 64-bit manipulation

In reply to this post by Eliot Miranda-2

FWIW... IMO it's better to enable access to the relevant compiler
intrinsic with platform specific macros, rather than implementing
instructions such as Intel's BSWAP or MOVBE by hand. In HPS, isolating
endianness concerns from the large integer arithmetic primitives with
such macros enabled 25-40% faster performance on big endian platforms.
Just as importantly, the intrinsic approach takes significantly less
code to implement.

On 8/31/15 10:25 , Eliot Miranda wrote:

> Hi Chrises,
>
> my vote would be to write these as 12 numbered primitives, (2,4 & 8
> bytes) * (at: & at:put:) * (big & little endian) because they can be
> performance critical and implementing them like this means the maximum
> efficiency in both 32-bit and 64-bit Spur, plus the possibility of the
> JIT implementing the primitives.
>
> On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham
> <[hidden email] <mailto:[hidden email]>> wrote:
>
> Hi Chris,
>
> I'm all for having the fastest that in the image that works. If you
> could make your version handle endianess, then I'm all for including
> it (at least in the 3 variants that are faster). My first use for
> this (interface for KAFKA) apparently requires bigEndianess, so I
> really want that supported.
>
> It might be best to keep my naming, though - it follows the name
> pattern that is already in the class. Or will yours also support 128?
>
> -cbc
>
> On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller <[hidden email]
> <mailto:[hidden email]>> wrote:
>
> Hi Chris, I think these methods belong in the image with the fastest
> implementation we can do.
>
> I implemented 64-bit unsigned access for Ma Serializer back in 2005.
> I modeled my implementation after Andreas' original approach which
> tries to avoid LI arithmetic. I was curious whether your
> implementations would be faster, because if they are then it could
> benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a
> trunk image, I used the following script to take comparison
> measurements:
>
> | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2
> raisedTo: 14)) atRandom.
> largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
> maBa := ByteArray new: 8.
> cbBa := ByteArray new: 8.
> maBa maUint: 64 at: 0 put: largeN.
> cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
> self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
> bigEndian: false).
> { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
> bigEndian: false] bench.
> 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
> 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
> false. ] bench.
> 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
> 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
> bigEndian: false] bench.
> 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
> 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
> false ] bench.
> 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
> }
>
> Here are the results:
>
> 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per
> run.' .
> 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per
> run.' .
> 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per
> run.' .
> 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds
> per run.' .
> 'cbc largeN write'->'907,000 per second. 1.1 microseconds per
> run.' .
> 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per
> run.' .
> 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per
> run.' .
> 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per
> run.'
>
> It looks like your 64-bit access is 86% faster for accessing the
> high-end of the 64-bit range, but slower in the other 3 metrics.
> Noticeably, it was only 14% as fast for writing the high-end of the
> 64-bit range, and similarly as much slower for small-number access..
>
>
> On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
> <[hidden email] <mailto:[hidden email]>> wrote:
> > Hi.
> >
> > I've committed a change to the inbox with changes to allow
> getting/putting
> > 64bit values to ByteArrays (similar to 32 and 16 bit
> accessors). Could this
> > be added to trunk?
> >
> > Also, first time I used the selective commit function - very
> nice! the
> > changes I didn't want committed didn't, in fact, get
> commited. Just the
> > desirable bits!
> >
> > -cbc
> >
> >
> >
>
>
>
>
>
>
>
>
> --
> _,,,^..^,,,_
> best, Eliot
>
>
>

Levente Uzonyi-2

Re: ByteArray accessors for 64-bit manipulation

In reply to this post by Chris Muller-3

Hi All,

A bit later than I wanted to, but I've finally uploaded my versions to the
Trunk. I guess I went as far as possible with getting the "fastest
implementation".
I modified your benchmark to use the same numbers, so that the
measurements could be repeated. I got the following:

Before:
{'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'.
'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'.
'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'.
'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}.

After:
{'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'.
'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'.
'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'.
'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.

As you can see, everything became faster except for smallN access. This is
the side-effect of optimizing for the average case instead of specific
cases - like zero bytes. I decided not to use that trick, because it
decreased the overall performance.

I also wrote a benchmark which measures reads and writes together. It
generates random numbers which can be represented using a given number of
bits. The result is an array of run times where values having and odd
index belong to big-endian access, and even ones to little-endian.

| byteArray inputs random storageBits unsigned |
Smalltalk garbageCollect.
random := Random seed: 36rSqueak.
storageBits := 64.
unsigned := true.
byteArray := ByteArray new: storageBits // 8 * 2.
inputs := Array new: 100000.
(2 to: storageBits * 2 + 1) collect: [ :descriptor |
"lowest bit describes endianness, the rest the number of bits."
| limit bigEndian offset |
bigEndian := descriptor odd.
limit := 1 << (descriptor >> 1) - 1.
unsigned
ifTrue: [ offset := -1 ]
ifFalse: [ offset := -1- (limit >> 1) ].
inputs replace: [ :each | (random nextInt: limit) + offset ].
[ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex |
1 to: inputs size do: [ :inputIndex |
byteArray
unsignedLong64At: startIndex
put: (inputs at: inputIndex)
bigEndian: bigEndian;
unsignedLong64At: startIndex
bigEndian: bigEndian ] ] ] timeToRun ].

I ran it with various accessors and got the following results:

"short"
#(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28
28 28 28 28 26 28 28 28) "average asFloat 27.625".
#(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20
20 20 20 22 20 20 20 20) "average asFloat 19.5".

"long"
#(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72
72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74 72 76 70 76 72
76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat
87.28125".
#(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72
74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74 74 72 72 72 72
72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625".

"unsigned short"
#(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20
20 20 18 20 18 18 18 18) "average asFloat 18.5".
#(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18
18 18 18 18 18 18 18 18) "average asFloat 18.125".

"unsigned long"
#(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54
52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58 60 58 56 56 60
62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat
82.40625".
#(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64
64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64 64 64 64 64 64
64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat
69.09375".

"unsigned long 64"
#(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306
308 310 308 306 308 308 310 308 308 308 310 308 312 308 318 316 314 318
316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322
328 324 326 322 326 322 510 520 592 592 634 618 636 640 652 666 642 644
660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620
622 636 626 626 644 632 750 748 812 822 828 858 842 862 898 880 896 840
870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848
820 828) "average asFloat 536.109375".
#(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170
166 172 166 172 166 170 166 170 164 170 170 170 168 176 164 170 166 172
166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172
164 172 166 172 166 170 238 272 264 484 282 344 284 356 292 362 294 364
288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370
290 366 290 368 292 364 302 382 304 388 302 390 298 392 298 384 302 388
302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400
298 396). "average asFloat 259.359375"

Levente

On Sun, 30 Aug 2015, Chris Muller wrote:

> Hi Chris, I think these methods belong in the image with the fastest
> implementation we can do.
>
> I implemented 64-bit unsigned access for Ma Serializer back in 2005.
> I modeled my implementation after Andreas' original approach which
> tries to avoid LI arithmetic. I was curious whether your
> implementations would be faster, because if they are then it could
> benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a
> trunk image, I used the following script to take comparison
> measurements:
>
> | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2
> raisedTo: 14)) atRandom.
> largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
> maBa := ByteArray new: 8.
> cbBa := ByteArray new: 8.
> maBa maUint: 64 at: 0 put: largeN.
> cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
> self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
> bigEndian: false).
> { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
> bigEndian: false] bench.
> 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
> 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench.
> 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
> 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
> bigEndian: false] bench.
> 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
> 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench.
> 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
> }
>
> Here are the results:
>
> 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' .
> 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' .
> 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' .
> 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' .
> 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' .
> 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' .
> 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' .
> 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'
>
> It looks like your 64-bit access is 86% faster for accessing the
> high-end of the 64-bit range, but slower in the other 3 metrics.
> Noticeably, it was only 14% as fast for writing the high-end of the
> 64-bit range, and similarly as much slower for small-number access..
>
>
> On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
> <[hidden email]> wrote:
>> Hi.
>>
>> I've committed a change to the inbox with changes to allow getting/putting
>> 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this
>> be added to trunk?
>>
>> Also, first time I used the selective commit function - very nice! the
>> changes I didn't want committed didn't, in fact, get commited. Just the
>> desirable bits!
>>
>> -cbc
>>
>>
>>
>
>

cbc

Re: ByteArray accessors for 64-bit manipulation

Levente,

Interesting. I have a question and a concern about your implementation, though:

Question: why in the micro checks, is the Write faster than the Access:

{'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'.
'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'.
'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'.
'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.

yet in your more thorough benchmark, the Write twice as slow as the Access?

"unsigned long 64"
(put, or Write) "average asFloat 536.109375".
(Access) "average asFloat 259.359375"

any ideas?

the concern is that your code is nicely optimized for our current 32bit vm - but once we go to 64 bit, I think it will fail. Should we be concerned?

-cbc

On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi <[hidden email]> wrote:

Hi All,

A bit later than I wanted to, but I've finally uploaded my versions to the Trunk. I guess I went as far as possible with getting the "fastest implementation".
I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following:

Before:
{'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'.
'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'.
'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'.
'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}.

After:
{'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'.
'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'.
'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'.
'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.

As you can see, everything became faster except for smallN access. This is the side-effect of optimizing for the average case instead of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance.

I also wrote a benchmark which measures reads and writes together. It generates random numbers which can be represented using a given number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to little-endian.

| byteArray inputs random storageBits unsigned |
Smalltalk garbageCollect.
random := Random seed: 36rSqueak.
storageBits := 64.
unsigned := true.
byteArray := ByteArray new: storageBits // 8 * 2.
inputs := Array new: 100000.
(2 to: storageBits * 2 + 1) collect: [ :descriptor |
"lowest bit describes endianness, the rest the number of bits."
| limit bigEndian offset |
bigEndian := descriptor odd.
limit := 1 << (descriptor >> 1) - 1.
unsigned
ifTrue: [ offset := -1 ]
ifFalse: [ offset := -1- (limit >> 1) ].
inputs replace: [ :each | (random nextInt: limit) + offset ].
[ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex |
1 to: inputs size do: [ :inputIndex |
byteArray
unsignedLong64At: startIndex
put: (inputs at: inputIndex)
bigEndian: bigEndian;
unsignedLong64At: startIndex
bigEndian: bigEndian ] ] ] timeToRun ].

I ran it with various accessors and got the following results:

"short"
#(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625".
#(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5".

"long"
#(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125".
#(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625".

"unsigned short"
#(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5".
#(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125".

"unsigned long"
#(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625".
#(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375".

"unsigned long 64"
#(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat 536.109375".
#(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat 259.359375"

Levente

On Sun, 30 Aug 2015, Chris Muller wrote:

Hi Chris, I think these methods belong in the image with the fastest
implementation we can do.

I implemented 64-bit unsigned access for Ma Serializer back in 2005.
I modeled my implementation after Andreas' original approach which
tries to avoid LI arithmetic. I was curious whether your
implementations would be faster, because if they are then it could
benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a
trunk image, I used the following script to take comparison
measurements:

| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2
raisedTo: 14)) atRandom.
largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
maBa := ByteArray new: 8.
cbBa := ByteArray new: 8.
maBa maUint: 64 at: 0 put: largeN.
cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
bigEndian: false).
{ 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
bigEndian: false] bench.
'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench.
'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
bigEndian: false] bench.
'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench.
'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
}

Here are the results:

'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' .
'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' .
'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' .
'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' .
'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' .
'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' .
'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' .
'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'

It looks like your 64-bit access is 86% faster for accessing the
high-end of the 64-bit range, but slower in the other 3 metrics.
Noticeably, it was only 14% as fast for writing the high-end of the
64-bit range, and similarly as much slower for small-number access..

On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
<[hidden email]> wrote:

Hi.

I've committed a change to the inbox with changes to allow getting/putting
64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this
be added to trunk?

Also, first time I used the selective commit function - very nice! the
changes I didn't want committed didn't, in fact, get commited. Just the
desirable bits!

-cbc

Levente Uzonyi-2

Re: ByteArray accessors for 64-bit manipulation

Hi Chris,

I added the #normalize sends to avoid the creation of spurious
LargeIntegers in 64-bit images (there are two places where I relied on
#-'s ability to work on unnormalized input). I didn't have a chance to
test it, but I expect it to work correctly. Even if the code is
sub-optimal in 64-bit, it shouldn't be any slower than in 32-bit.

Levente

On Tue, 8 Sep 2015, Chris Cunningham wrote:

> Levente,
> Interesting. I have a question and a concern about your implementation, though:
>
> Question: why in the micro checks, is the Write faster than the Access:
> {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'.
> 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'.
> 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'.
> 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.
> yet in your more thorough benchmark, the Write twice as slow as the Access?
> "unsigned long 64"
> (put, or Write) "average asFloat 536.109375".
> (Access) "average asFloat 259.359375"
> any ideas?
>
> the concern is that your code is nicely optimized for our current 32bit vm - but once we go to 64 bit, I think it will fail. Should we be
> concerned?
>
> -cbc
>
> On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi <[hidden email]> wrote:
> Hi All,
>
> A bit later than I wanted to, but I've finally uploaded my versions to the Trunk. I guess I went as far as possible with getting the
> "fastest implementation".
> I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following:
>
> Before:
> {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'.
> 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'.
> 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'.
> 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}.
>
> After:
> {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'.
> 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'.
> 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'.
> 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.
>
> As you can see, everything became faster except for smallN access. This is the side-effect of optimizing for the average case instead
> of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance.
>
> I also wrote a benchmark which measures reads and writes together. It generates random numbers which can be represented using a given
> number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to
> little-endian.
>
> | byteArray inputs random storageBits unsigned |
> Smalltalk garbageCollect.
> random := Random seed: 36rSqueak.
> storageBits := 64.
> unsigned := true.
> byteArray := ByteArray new: storageBits // 8 * 2.
> inputs := Array new: 100000.
> (2 to: storageBits * 2 + 1) collect: [ :descriptor |
> "lowest bit describes endianness, the rest the number of bits."
> | limit bigEndian offset |
> bigEndian := descriptor odd.
> limit := 1 << (descriptor >> 1) - 1.
> unsigned
> ifTrue: [ offset := -1 ]
> ifFalse: [ offset := -1- (limit >> 1) ].
> inputs replace: [ :each | (random nextInt: limit) + offset ].
> [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex |
> 1 to: inputs size do: [ :inputIndex |
> byteArray
> unsignedLong64At: startIndex
> put: (inputs at: inputIndex)
> bigEndian: bigEndian;
> unsignedLong64At: startIndex
> bigEndian: bigEndian ] ] ] timeToRun ].
>
> I ran it with various accessors and got the following results:
>
> "short"
> #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625".
> #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5".
>
> "long"
> #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74
> 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125".
> #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74
> 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625".
>
> "unsigned short"
> #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5".
> #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125".
>
> "unsigned long"
> #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58
> 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625".
> #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64
> 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375".
>
> "unsigned long 64"
> #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318
> 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618
> 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812
> 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat
> 536.109375".
> #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164
> 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344
> 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304
> 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat
> 259.359375"
>
>
> Levente
>
> On Sun, 30 Aug 2015, Chris Muller wrote:
>
> Hi Chris, I think these methods belong in the image with the fastest
> implementation we can do.
>
> I implemented 64-bit unsigned access for Ma Serializer back in 2005.
> I modeled my implementation after Andreas' original approach which
> tries to avoid LI arithmetic. I was curious whether your
> implementations would be faster, because if they are then it could
> benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a
> trunk image, I used the following script to take comparison
> measurements:
>
> | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2
> raisedTo: 14)) atRandom.
> largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
> maBa := ByteArray new: 8.
> cbBa := ByteArray new: 8.
> maBa maUint: 64 at: 0 put: largeN.
> cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
> self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
> bigEndian: false).
> { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
> bigEndian: false] bench.
> 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
> 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench.
> 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
> 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
> bigEndian: false] bench.
> 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
> 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench.
> 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
> }
>
> Here are the results:
>
> 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' .
> 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' .
> 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' .
> 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' .
> 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' .
> 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' .
> 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' .
> 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'
>
> It looks like your 64-bit access is 86% faster for accessing the
> high-end of the 64-bit range, but slower in the other 3 metrics.
> Noticeably, it was only 14% as fast for writing the high-end of the
> 64-bit range, and similarly as much slower for small-number access..
>
>
> On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
> <[hidden email]> wrote:
> Hi.
>
> I've committed a change to the inbox with changes to allow getting/putting
> 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this
> be added to trunk?
>
> Also, first time I used the selective commit function - very nice! the
> changes I didn't want committed didn't, in fact, get commited. Just the
> desirable bits!
>
> -cbc
>
>
>
>
>
>
>
>
>

Levente Uzonyi-2

Re: ByteArray accessors for 64-bit manipulation

In reply to this post by cbc

I forgot to answer your other questions.

On Tue, 8 Sep 2015, Chris Cunningham wrote:

> Levente,
> Interesting. I have a question and a concern about your implementation, though:
>
> Question: why in the micro checks, is the Write faster than the Access:

Because Access means allocation of new objects (LargeIntegers), while
Write simply means copying bytes.

> {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'.
> 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'.
> 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'.
> 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.
> yet in your more thorough benchmark, the Write twice as slow as the Access?

For each pair of lines the first one was measured _before_ my changes, the
second one was measured _after_ them. The benchmark measures read and
write together, so 536.109375 stands for the average number of
milliseconds to read and write ninemillion numbers using your variant of
the 64-bit methods on my machine.

> "unsigned long 64"
> (put, or Write) "average asFloat 536.109375".
> (Access) "average asFloat 259.359375"
> any ideas?

So this is

(read and write Before) "average asFloat 536.109375"
(read and write After) "average asFloat 259.359375"

Levente

>
> the concern is that your code is nicely optimized for our current 32bit vm - but once we go to 64 bit, I think it will fail. Should we be
> concerned?
>
> -cbc
>
> On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi <[hidden email]> wrote:
> Hi All,
>
> A bit later than I wanted to, but I've finally uploaded my versions to the Trunk. I guess I went as far as possible with getting the
> "fastest implementation".
> I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following:
>
> Before:
> {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'.
> 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'.
> 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'.
> 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}.
>
> After:
> {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'.
> 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'.
> 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'.
> 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.
>
> As you can see, everything became faster except for smallN access. This is the side-effect of optimizing for the average case instead
> of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance.
>
> I also wrote a benchmark which measures reads and writes together. It generates random numbers which can be represented using a given
> number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to
> little-endian.
>
> | byteArray inputs random storageBits unsigned |
> Smalltalk garbageCollect.
> random := Random seed: 36rSqueak.
> storageBits := 64.
> unsigned := true.
> byteArray := ByteArray new: storageBits // 8 * 2.
> inputs := Array new: 100000.
> (2 to: storageBits * 2 + 1) collect: [ :descriptor |
> "lowest bit describes endianness, the rest the number of bits."
> | limit bigEndian offset |
> bigEndian := descriptor odd.
> limit := 1 << (descriptor >> 1) - 1.
> unsigned
> ifTrue: [ offset := -1 ]
> ifFalse: [ offset := -1- (limit >> 1) ].
> inputs replace: [ :each | (random nextInt: limit) + offset ].
> [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex |
> 1 to: inputs size do: [ :inputIndex |
> byteArray
> unsignedLong64At: startIndex
> put: (inputs at: inputIndex)
> bigEndian: bigEndian;
> unsignedLong64At: startIndex
> bigEndian: bigEndian ] ] ] timeToRun ].
>
> I ran it with various accessors and got the following results:
>
> "short"
> #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625".
> #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5".
>
> "long"
> #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74
> 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125".
> #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74
> 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625".
>
> "unsigned short"
> #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5".
> #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125".
>
> "unsigned long"
> #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58
> 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625".
> #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64
> 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375".
>
> "unsigned long 64"
> #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318
> 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618
> 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812
> 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat
> 536.109375".
> #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164
> 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344
> 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304
> 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat
> 259.359375"
>
>
> Levente
>
> On Sun, 30 Aug 2015, Chris Muller wrote:
>
> Hi Chris, I think these methods belong in the image with the fastest
> implementation we can do.
>
> I implemented 64-bit unsigned access for Ma Serializer back in 2005.
> I modeled my implementation after Andreas' original approach which
> tries to avoid LI arithmetic. I was curious whether your
> implementations would be faster, because if they are then it could
> benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a
> trunk image, I used the following script to take comparison
> measurements:
>
> | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2
> raisedTo: 14)) atRandom.
> largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
> maBa := ByteArray new: 8.
> cbBa := ByteArray new: 8.
> maBa maUint: 64 at: 0 put: largeN.
> cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
> self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
> bigEndian: false).
> { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
> bigEndian: false] bench.
> 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
> 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench.
> 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
> 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
> bigEndian: false] bench.
> 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
> 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench.
> 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
> }
>
> Here are the results:
>
> 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' .
> 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' .
> 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' .
> 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' .
> 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' .
> 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' .
> 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' .
> 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'
>
> It looks like your 64-bit access is 86% faster for accessing the
> high-end of the 64-bit range, but slower in the other 3 metrics.
> Noticeably, it was only 14% as fast for writing the high-end of the
> 64-bit range, and similarly as much slower for small-number access..
>
>
> On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
> <[hidden email]> wrote:
> Hi.
>
> I've committed a change to the inbox with changes to allow getting/putting
> 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this
> be added to trunk?
>
> Also, first time I used the selective commit function - very nice! the
> changes I didn't want committed didn't, in fact, get commited. Just the
> desirable bits!
>
> -cbc
>
>
>
>
>
>
>
>
>

Chris Muller-3

Re: ByteArray accessors for 64-bit manipulation

In reply to this post by Levente Uzonyi-2

(Ha! What an amazing thread!) Levente, why am I not surprised that
you could manage to squeeze yet even more efficiency out of it. If
there were an eXtreme sports for programming, you'd win the gold
medal... :)

I've got to try this on the MagmaBenchmarker right now, I'll let you
know in a bit..

On Tue, Sep 8, 2015 at 4:42 AM, Levente Uzonyi <[hidden email]> wrote:

> Hi All,
>
> A bit later than I wanted to, but I've finally uploaded my versions to the
> Trunk. I guess I went as far as possible with getting the "fastest
> implementation".
> I modified your benchmark to use the same numbers, so that the measurements
> could be repeated. I got the following:
>
> Before:
> {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'.
> 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'.
> 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'.
> 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}.
>
> After:
> {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'.
> 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'.
> 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'.
> 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.
>
> As you can see, everything became faster except for smallN access. This is
> the side-effect of optimizing for the average case instead of specific cases
> - like zero bytes. I decided not to use that trick, because it decreased the
> overall performance.
>
> I also wrote a benchmark which measures reads and writes together. It
> generates random numbers which can be represented using a given number of
> bits. The result is an array of run times where values having and odd index
> belong to big-endian access, and even ones to little-endian.
>
> | byteArray inputs random storageBits unsigned |
> Smalltalk garbageCollect.
> random := Random seed: 36rSqueak.
> storageBits := 64.
> unsigned := true.
> byteArray := ByteArray new: storageBits // 8 * 2.
> inputs := Array new: 100000.
> (2 to: storageBits * 2 + 1) collect: [ :descriptor |
> "lowest bit describes endianness, the rest the number of bits."
> | limit bigEndian offset |
> bigEndian := descriptor odd.
> limit := 1 << (descriptor >> 1) - 1.
> unsigned
> ifTrue: [ offset := -1 ]
> ifFalse: [ offset := -1- (limit >> 1) ].
> inputs replace: [ :each | (random nextInt: limit) + offset ].
> [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex |
> 1 to: inputs size do: [ :inputIndex |
> byteArray
> unsignedLong64At: startIndex
> put: (inputs at: inputIndex)
> bigEndian: bigEndian;
> unsignedLong64At: startIndex
> bigEndian: bigEndian ] ] ] timeToRun
> ].
>
> I ran it with various accessors and got the following results:
>
> "short"
> #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28
> 28 28 28 26 28 28 28) "average asFloat 27.625".
> #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20
> 20 20 22 20 20 20 20) "average asFloat 19.5".
>
> "long"
> #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72
> 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74 72 76 70 76 72 76
> 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125".
> #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74
> 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74 74 72 72 72 72 72
> 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625".
>
> "unsigned short"
> #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20
> 20 18 20 18 18 18 18) "average asFloat 18.5".
> #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18
> 18 18 18 18 18 18 18) "average asFloat 18.125".
>
> "unsigned long"
> #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52
> 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58 60 58 56 56 60 62
> 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625".
> #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64
> 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64 64 64 64 64 64
> 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat
> 69.09375".
>
> "unsigned long 64"
> #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306
> 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318 316 314 318 316
> 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324
> 326 322 326 322 510 520 592 592 634 618 636 640 652 666 642 644 660 648 642
> 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626
> 644 632 750 748 812 822 828 858 842 862 898 880 896 840 870 896 926 870 1034
> 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average
> asFloat 536.109375".
> #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170
> 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164 170 166 172 166
> 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172
> 166 172 166 170 238 272 264 484 282 344 284 356 292 362 294 364 288 362 292
> 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368
> 292 364 302 382 304 388 302 390 298 392 298 384 302 388 302 390 298 386 308
> 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average
> asFloat 259.359375"
>
>
> Levente
>
>
> On Sun, 30 Aug 2015, Chris Muller wrote:
>
>> Hi Chris, I think these methods belong in the image with the fastest
>> implementation we can do.
>>
>> I implemented 64-bit unsigned access for Ma Serializer back in 2005.
>> I modeled my implementation after Andreas' original approach which
>> tries to avoid LI arithmetic. I was curious whether your
>> implementations would be faster, because if they are then it could
>> benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a
>> trunk image, I used the following script to take comparison
>> measurements:
>>
>> | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2
>> raisedTo: 14)) atRandom.
>> largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
>> maBa := ByteArray new: 8.
>> cbBa := ByteArray new: 8.
>> maBa maUint: 64 at: 0 put: largeN.
>> cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
>> self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
>> bigEndian: false).
>> { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
>> bigEndian: false] bench.
>> 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
>> 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ]
>> bench.
>> 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
>> 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
>> bigEndian: false] bench.
>> 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
>> 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ]
>> bench.
>> 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
>> }
>>
>> Here are the results:
>>
>> 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' .
>> 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' .
>> 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' .
>> 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' .
>> 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' .
>> 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' .
>> 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' .
>> 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'
>>
>> It looks like your 64-bit access is 86% faster for accessing the
>> high-end of the 64-bit range, but slower in the other 3 metrics.
>> Noticeably, it was only 14% as fast for writing the high-end of the
>> 64-bit range, and similarly as much slower for small-number access..
>>
>>
>> On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
>> <[hidden email]> wrote:
>>>
>>> Hi.
>>>
>>> I've committed a change to the inbox with changes to allow
>>> getting/putting
>>> 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could
>>> this
>>> be added to trunk?
>>>
>>> Also, first time I used the selective commit function - very nice! the
>>> changes I didn't want committed didn't, in fact, get commited. Just the
>>> desirable bits!
>>>
>>> -cbc
>>>
>>>
>>>
>>
>>
>

cbc

Re: ByteArray accessors for 64-bit manipulation

In reply to this post by Levente Uzonyi-2

Ok, it makes sense now. Well, it made sense before, I was just slow to see it.

-cbc

On Tue, Sep 8, 2015 at 9:25 AM, Levente Uzonyi <[hidden email]> wrote:

I forgot to answer your other questions.

On Tue, 8 Sep 2015, Chris Cunningham wrote:

Levente,
Interesting. I have a question and a concern about your implementation, though:

Question: why in the micro checks, is the Write faster than the Access:

Because Access means allocation of new objects (LargeIntegers), while Write simply means copying bytes.

{'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'.
'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'.
'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'.
'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.
yet in your more thorough benchmark, the Write twice as slow as the Access?

For each pair of lines the first one was measured _before_ my changes, the second one was measured _after_ them. The benchmark measures read and write together, so 536.109375 stands for the average number of milliseconds to read and write ninemillion numbers using your variant of the 64-bit methods on my machine.

"unsigned long 64"
(put, or Write) "average asFloat 536.109375".
(Access) "average asFloat 259.359375"
any ideas?

So this is

(read and write Before) "average asFloat 536.109375"
(read and write After) "average asFloat 259.359375"

Levente

the concern is that your code is nicely optimized for our current 32bit vm - but once we go to 64 bit, I think it will fail. Should we be
concerned?

-cbc

On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi <[hidden email]> wrote:
Hi All,

A bit later than I wanted to, but I've finally uploaded my versions to the Trunk. I guess I went as far as possible with getting the
"fastest implementation".
I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following:

Before:
{'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'.
'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'.
'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'.
'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}.

After:
{'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'.
'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'.
'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'.
'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.

As you can see, everything became faster except for smallN access. This is the side-effect of optimizing for the average case instead
of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance.

I also wrote a benchmark which measures reads and writes together. It generates random numbers which can be represented using a given
number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to
little-endian.

| byteArray inputs random storageBits unsigned |
Smalltalk garbageCollect.
random := Random seed: 36rSqueak.
storageBits := 64.
unsigned := true.
byteArray := ByteArray new: storageBits // 8 * 2.
inputs := Array new: 100000.
(2 to: storageBits * 2 + 1) collect: [ :descriptor |
"lowest bit describes endianness, the rest the number of bits."
| limit bigEndian offset |
bigEndian := descriptor odd.
limit := 1 << (descriptor >> 1) - 1.
unsigned
ifTrue: [ offset := -1 ]
ifFalse: [ offset := -1- (limit >> 1) ].
inputs replace: [ :each | (random nextInt: limit) + offset ].
[ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex |
1 to: inputs size do: [ :inputIndex |
byteArray
unsignedLong64At: startIndex
put: (inputs at: inputIndex)
bigEndian: bigEndian;
unsignedLong64At: startIndex
bigEndian: bigEndian ] ] ] timeToRun ].

I ran it with various accessors and got the following results:

"short"
#(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625".
#(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5".

"long"
#(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74
72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125".
#(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74
74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625".

"unsigned short"
#(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5".
#(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125".

"unsigned long"
#(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58
60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625".
#(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64
64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375".

"unsigned long 64"
#(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318
316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618
636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812
822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat
536.109375".
#(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164
170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344
284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304
388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat
259.359375"

Levente

On Sun, 30 Aug 2015, Chris Muller wrote:

Hi Chris, I think these methods belong in the image with the fastest
implementation we can do.

I implemented 64-bit unsigned access for Ma Serializer back in 2005.
I modeled my implementation after Andreas' original approach which
tries to avoid LI arithmetic. I was curious whether your
implementations would be faster, because if they are then it could
benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a
trunk image, I used the following script to take comparison
measurements:

| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2
raisedTo: 14)) atRandom.
largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
maBa := ByteArray new: 8.
cbBa := ByteArray new: 8.
maBa maUint: 64 at: 0 put: largeN.
cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
bigEndian: false).
{ 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
bigEndian: false] bench.
'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench.
'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
bigEndian: false] bench.
'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench.
'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
}

Here are the results:

'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' .
'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' .
'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' .
'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' .
'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' .
'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' .
'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' .
'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'

It looks like your 64-bit access is 86% faster for accessing the
high-end of the 64-bit range, but slower in the other 3 metrics.
Noticeably, it was only 14% as fast for writing the high-end of the
64-bit range, and similarly as much slower for small-number access..

On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
<[hidden email]> wrote:
Hi.

I've committed a change to the inbox with changes to allow getting/putting
64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this
be added to trunk?

Also, first time I used the selective commit function - very nice! the
changes I didn't want committed didn't, in fact, get commited. Just the
desirable bits!

-cbc

Levente Uzonyi-2

Re: ByteArray accessors for 64-bit manipulation

In reply to this post by Levente Uzonyi-2

After thinking a bit more about this, I came to the conclusion that while
I paid attention to handle reader methods in 64-bit images correctly,
some of the writer methods are incorrect for "large" SmallInteger inputs.
First I implemented the missing parts for 61-bit SmallIntegers using a
general purpose loop, but it turned out that the loop is significantly
faster for smaller numbers, and only slightly slower for larger ones, so I
decided to nuke the special case for 30-bit SmallIntegers.
I also found further possiblities to optimize writes, so the average is
down from 259 to 236 milliseconds. Chris Muller's benchmark gives

{
'cbc smallN write'->'12,200,000 per second. 81.7 nanoseconds per run.' .
'cbc smallN access'->'10,300,000 per second. 96.8 nanoseconds per run.' .
'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.' .
'cbc largeN access'->'5,270,000 per second. 190 nanoseconds per run.'}

Which means +20% speed for smallN writes, and +18% for largeN access.

Levente

P.S.: I still couldn't test the code in my 64-bit Spur image, because it
won't respond to any input after startup.

On Tue, 8 Sep 2015, Levente Uzonyi wrote:

> Hi Chris,
>
> I added the #normalize sends to avoid the creation of spurious LargeIntegers
> in 64-bit images (there are two places where I relied on #-'s ability to work
> on unnormalized input). I didn't have a chance to test it, but I expect it to
> work correctly. Even if the code is sub-optimal in 64-bit, it shouldn't be
> any slower than in 32-bit.
>
> Levente
>
> On Tue, 8 Sep 2015, Chris Cunningham wrote:
>
>> Levente,
>> Interesting. I have a question and a concern about your implementation,
>> though:
>>
>> Question: why in the micro checks, is the Write faster than the Access:
>> {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'.
>> 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'.
>> 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'.
>> 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.
>> yet in your more thorough benchmark, the Write twice as slow as the Access?
>> "unsigned long 64"
>> (put, or Write) "average asFloat 536.109375".
>> (Access) "average asFloat 259.359375"
>> any ideas?
>>
>> the concern is that your code is nicely optimized for our current 32bit vm
>> - but once we go to 64 bit, I think it will fail. Should we be
>> concerned?
>>
>> -cbc
>>
>> On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi <[hidden email]> wrote:
>> Hi All,
>>
>> A bit later than I wanted to, but I've finally uploaded my versions
>> to the Trunk. I guess I went as far as possible with getting the
>> "fastest implementation".
>> I modified your benchmark to use the same numbers, so that the
>> measurements could be repeated. I got the following:
>>
>> Before:
>> {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per
>> run.'.
>> 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per
>> run.'.
>> 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'.
>> 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per
>> run.'}.
>>
>> After:
>> {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per
>> run.'.
>> 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per
>> run.'.
>> 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per
>> run.'.
>> 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per
>> run.'}.
>>
>> As you can see, everything became faster except for smallN access.
>> This is the side-effect of optimizing for the average case instead
>> of specific cases - like zero bytes. I decided not to use that trick,
>> because it decreased the overall performance.
>>
>> I also wrote a benchmark which measures reads and writes together. It
>> generates random numbers which can be represented using a given
>> number of bits. The result is an array of run times where values
>> having and odd index belong to big-endian access, and even ones to
>> little-endian.
>>
>> | byteArray inputs random storageBits unsigned |
>> Smalltalk garbageCollect.
>> random := Random seed: 36rSqueak.
>> storageBits := 64.
>> unsigned := true.
>> byteArray := ByteArray new: storageBits // 8 * 2.
>> inputs := Array new: 100000.
>> (2 to: storageBits * 2 + 1) collect: [ :descriptor |
>> "lowest bit describes endianness, the rest the number of
>> bits."
>> | limit bigEndian offset |
>> bigEndian := descriptor odd.
>> limit := 1 << (descriptor >> 1) - 1.
>> unsigned
>> ifTrue: [ offset := -1 ]
>> ifFalse: [ offset := -1- (limit >> 1) ].
>> inputs replace: [ :each | (random nextInt: limit) + offset ].
>> [ 1 to: byteArray size - (storageBits // 8 - 1) do: [
>> :startIndex |
>> 1 to: inputs size do: [ :inputIndex |
>> byteArray
>> unsignedLong64At: startIndex
>> put: (inputs at: inputIndex)
>> bigEndian: bigEndian;
>> unsignedLong64At: startIndex
>> bigEndian: bigEndian ] ] ]
>> timeToRun ].
>>
>> I ran it with various accessors and got the following results:
>>
>> "short"
>> #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28
>> 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625".
>> #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20
>> 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5".
>>
>> "long"
>> #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72
>> 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74
>> 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584)
>> "average asFloat 87.28125".
>> #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72
>> 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74
>> 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122)
>> "average asFloat 73.625".
>>
>> "unsigned short"
>> #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20
>> 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5".
>> #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18
>> 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125".
>>
>> "unsigned long"
>> #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54
>> 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58
>> 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694)
>> "average asFloat 82.40625".
>> #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62
>> 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64
>> 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298)
>> "average asFloat 69.09375".
>>
>> "unsigned long 64"
>> #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312
>> 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318
>> 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324
>> 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618
>> 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626
>> 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812
>> 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834
>> 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat
>> 536.109375".
>> #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166
>> 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164
>> 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170
>> 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344
>> 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294
>> 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304
>> 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400
>> 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat
>> 259.359375"
>>
>>
>> Levente
>>
>> On Sun, 30 Aug 2015, Chris Muller wrote:
>>
>> Hi Chris, I think these methods belong in the image with the
>> fastest
>> implementation we can do.
>>
>> I implemented 64-bit unsigned access for Ma Serializer back in
>> 2005.
>> I modeled my implementation after Andreas' original approach
>> which
>> tries to avoid LI arithmetic. I was curious whether your
>> implementations would be faster, because if they are then it
>> could
>> benefit Magma. After loading "Ma Serializer" 1.5 (or head)
>> into a
>> trunk image, I used the following script to take comparison
>> measurements:
>>
>> | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2
>> raisedTo: 14)) atRandom.
>> largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
>> maBa := ByteArray new: 8.
>> cbBa := ByteArray new: 8.
>> maBa maUint: 64 at: 0 put: largeN.
>> cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
>> self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At:
>> 1
>> bigEndian: false).
>> { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
>> bigEndian: false] bench.
>> 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ]
>> bench.
>> 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
>> false. ] bench.
>> 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
>> 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
>> bigEndian: false] bench.
>> 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ]
>> bench.
>> 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
>> false ] bench.
>> 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
>> }
>>
>> Here are the results:
>>
>> 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per
>> run.' .
>> 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per
>> run.' .
>> 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds
>> per run.' .
>> 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds
>> per run.' .
>> 'cbc largeN write'->'907,000 per second. 1.1 microseconds per
>> run.' .
>> 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per
>> run.' .
>> 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds
>> per run.' .
>> 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per
>> run.'
>>
>> It looks like your 64-bit access is 86% faster for accessing
>> the
>> high-end of the 64-bit range, but slower in the other 3
>> metrics.
>> Noticeably, it was only 14% as fast for writing the high-end of
>> the
>> 64-bit range, and similarly as much slower for small-number
>> access..
>>
>>
>> On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
>> <[hidden email]> wrote:
>> Hi.
>>
>> I've committed a change to the inbox with changes to
>> allow getting/putting
>> 64bit values to ByteArrays (similar to 32 and 16 bit
>> accessors). Could this
>> be added to trunk?
>>
>> Also, first time I used the selective commit function -
>> very nice! the
>> changes I didn't want committed didn't, in fact, get
>> commited. Just the
>> desirable bits!
>>
>> -cbc
>>
>>
>>
>>
>>
>>
>>
>>
>

Eliot Miranda-2

Re: ByteArray accessors for 64-bit manipulation

Hi Levente,

Sent from my iPhone

> On Sep 9, 2015, at 2:11 PM, Levente Uzonyi <[hidden email]> wrote:
>
> After thinking a bit more about this, I came to the conclusion that while I paid attention to handle reader methods in 64-bit images correctly, some of the writer methods are incorrect for "large" SmallInteger inputs.
> First I implemented the missing parts for 61-bit SmallIntegers using a general purpose loop, but it turned out that the loop is significantly faster for smaller numbers, and only slightly slower for larger ones, so I decided to nuke the special case for 30-bit SmallIntegers.
> I also found further possiblities to optimize writes, so the average is down from 259 to 236 milliseconds. Chris Muller's benchmark gives
>
> {
> 'cbc smallN write'->'12,200,000 per second. 81.7 nanoseconds per run.' .
> 'cbc smallN access'->'10,300,000 per second. 96.8 nanoseconds per run.' .
> 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.' .
> 'cbc largeN access'->'5,270,000 per second. 190 nanoseconds per run.'}
>
> Which means +20% speed for smallN writes, and +18% for largeN access.
>
> Levente
>
> P.S.: I still couldn't test the code in my 64-bit Spur image, because it won't respond to any input after startup.

IIRC that's due to a bug in earlier versions of the bootstrap in converting negative integers, or perhaps larger positive smallintegers. Anyway I *think* that the latest Spur 64 but VM image combination works properly now. Apologies.

>
>> On Tue, 8 Sep 2015, Levente Uzonyi wrote:
>>
>> Hi Chris,
>>
>> I added the #normalize sends to avoid the creation of spurious LargeIntegers in 64-bit images (there are two places where I relied on #-'s ability to work on unnormalized input). I didn't have a chance to test it, but I expect it to work correctly. Even if the code is sub-optimal in 64-bit, it shouldn't be any slower than in 32-bit.
>>
>> Levente
>>
>>> On Tue, 8 Sep 2015, Chris Cunningham wrote:
>>>
>>> Levente,
>>> Interesting. I have a question and a concern about your implementation, though:
>>> Question: why in the micro checks, is the Write faster than the Access:
>>> {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'.
>>> 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'.
>>> 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'.
>>> 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.
>>> yet in your more thorough benchmark, the Write twice as slow as the Access?
>>> "unsigned long 64"
>>> (put, or Write) "average asFloat 536.109375".
>>> (Access) "average asFloat 259.359375"
>>> any ideas?
>>> the concern is that your code is nicely optimized for our current 32bit vm - but once we go to 64 bit, I think it will fail. Should we be
>>> concerned?
>>> -cbc
>>> On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi <[hidden email]> wrote:
>>> Hi All,
>>>
>>> A bit later than I wanted to, but I've finally uploaded my versions to the Trunk. I guess I went as far as possible with getting the
>>> "fastest implementation".
>>> I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following:
>>>
>>> Before:
>>> {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'.
>>> 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'.
>>> 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'.
>>> 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}.
>>>
>>> After:
>>> {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'.
>>> 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'.
>>> 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'.
>>> 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.
>>>
>>> As you can see, everything became faster except for smallN access. This is the side-effect of optimizing for the average case instead
>>> of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance.
>>>
>>> I also wrote a benchmark which measures reads and writes together. It generates random numbers which can be represented using a given
>>> number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to
>>> little-endian.
>>>
>>> | byteArray inputs random storageBits unsigned |
>>> Smalltalk garbageCollect.
>>> random := Random seed: 36rSqueak.
>>> storageBits := 64.
>>> unsigned := true.
>>> byteArray := ByteArray new: storageBits // 8 * 2.
>>> inputs := Array new: 100000.
>>> (2 to: storageBits * 2 + 1) collect: [ :descriptor |
>>> "lowest bit describes endianness, the rest the number of bits."
>>> | limit bigEndian offset |
>>> bigEndian := descriptor odd.
>>> limit := 1 << (descriptor >> 1) - 1.
>>> unsigned
>>> ifTrue: [ offset := -1 ]
>>> ifFalse: [ offset := -1- (limit >> 1) ].
>>> inputs replace: [ :each | (random nextInt: limit) + offset ].
>>> [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex |
>>> 1 to: inputs size do: [ :inputIndex |
>>> byteArray
>>> unsignedLong64At: startIndex
>>> put: (inputs at: inputIndex)
>>> bigEndian: bigEndian;
>>> unsignedLong64At: startIndex
>>> bigEndian: bigEndian ] ] ] timeToRun ].
>>>
>>> I ran it with various accessors and got the following results:
>>>
>>> "short"
>>> #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625".
>>> #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5".
>>>
>>> "long"
>>> #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74
>>> 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125".
>>> #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74
>>> 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625".
>>>
>>> "unsigned short"
>>> #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5".
>>> #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125".
>>>
>>> "unsigned long"
>>> #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58
>>> 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625".
>>> #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64
>>> 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375".
>>>
>>> "unsigned long 64"
>>> #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318
>>> 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618
>>> 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812
>>> 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat
>>> 536.109375".
>>> #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164
>>> 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344
>>> 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304
>>> 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat
>>> 259.359375"
>>>
>>> Levente
>>>
>>> On Sun, 30 Aug 2015, Chris Muller wrote:
>>>
>>> Hi Chris, I think these methods belong in the image with the fastest
>>> implementation we can do.
>>>
>>> I implemented 64-bit unsigned access for Ma Serializer back in 2005.
>>> I modeled my implementation after Andreas' original approach which
>>> tries to avoid LI arithmetic. I was curious whether your
>>> implementations would be faster, because if they are then it could
>>> benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a
>>> trunk image, I used the following script to take comparison
>>> measurements:
>>>
>>> | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2
>>> raisedTo: 14)) atRandom.
>>> largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
>>> maBa := ByteArray new: 8.
>>> cbBa := ByteArray new: 8.
>>> maBa maUint: 64 at: 0 put: largeN.
>>> cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
>>> self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
>>> bigEndian: false).
>>> { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
>>> bigEndian: false] bench.
>>> 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
>>> 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench.
>>> 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
>>> 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
>>> bigEndian: false] bench.
>>> 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
>>> 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench.
>>> 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
>>> }
>>>
>>> Here are the results:
>>>
>>> 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' .
>>> 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' .
>>> 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' .
>>> 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' .
>>> 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' .
>>> 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' .
>>> 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' .
>>> 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'
>>>
>>> It looks like your 64-bit access is 86% faster for accessing the
>>> high-end of the 64-bit range, but slower in the other 3 metrics.
>>> Noticeably, it was only 14% as fast for writing the high-end of the
>>> 64-bit range, and similarly as much slower for small-number access..
>>>
>>> On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
>>> <[hidden email]> wrote:
>>> Hi.
>>>
>>> I've committed a change to the inbox with changes to allow getting/putting
>>> 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this
>>> be added to trunk?
>>>
>>> Also, first time I used the selective commit function - very nice! the
>>> changes I didn't want committed didn't, in fact, get commited. Just the
>>> desirable bits!
>>>
>>> -cbc
>