Hi.
I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this be added to trunk? Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits! -cbc |
Hi Chris, I think these methods belong in the image with the fastest
implementation we can do. I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements: | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. } Here are the results: 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.' It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access.. On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham <[hidden email]> wrote: > Hi. > > I've committed a change to the inbox with changes to allow getting/putting > 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this > be added to trunk? > > Also, first time I used the selective commit function - very nice! the > changes I didn't want committed didn't, in fact, get commited. Just the > desirable bits! > > -cbc > > > |
Hi Chris, I'm all for having the fastest that in the image that works. If you could make your version handle endianess, then I'm all for including it (at least in the 3 variants that are faster). My first use for this (interface for KAFKA) apparently requires bigEndianess, so I really want that supported. It might be best to keep my naming, though - it follows the name pattern that is already in the class. Or will yours also support 128? -cbc On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller <[hidden email]> wrote: Hi Chris, I think these methods belong in the image with the fastest |
Hi Chrises, my vote would be to write these as 12 numbered primitives, (2,4 & 8 bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT implementing the primitives. On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham <[hidden email]> wrote:
_,,,^..^,,,_ best, Eliot |
Sometimes the number of bytes is only known in a variable, so would it
be possible to do 4 primitives which accept the number of bits (or bytes) as an argument? (uint:at: uint:at:put:) * (big endian, little endian) On Mon, Aug 31, 2015 at 12:25 PM, Eliot Miranda <[hidden email]> wrote: > Hi Chrises, > > my vote would be to write these as 12 numbered primitives, (2,4 & 8 > bytes) * (at: & at:put:) * (big & little endian) because they can be > performance critical and implementing them like this means the maximum > efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT > implementing the primitives. > > On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham <[hidden email]> > wrote: >> >> Hi Chris, >> >> I'm all for having the fastest that in the image that works. If you could >> make your version handle endianess, then I'm all for including it (at least >> in the 3 variants that are faster). My first use for this (interface for >> KAFKA) apparently requires bigEndianess, so I really want that supported. >> >> It might be best to keep my naming, though - it follows the name pattern >> that is already in the class. Or will yours also support 128? >> >> -cbc >> >> On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller <[hidden email]> wrote: >>> >>> Hi Chris, I think these methods belong in the image with the fastest >>> implementation we can do. >>> >>> I implemented 64-bit unsigned access for Ma Serializer back in 2005. >>> I modeled my implementation after Andreas' original approach which >>> tries to avoid LI arithmetic. I was curious whether your >>> implementations would be faster, because if they are then it could >>> benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a >>> trunk image, I used the following script to take comparison >>> measurements: >>> >>> | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 >>> raisedTo: 14)) atRandom. >>> largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. >>> maBa := ByteArray new: 8. >>> cbBa := ByteArray new: 8. >>> maBa maUint: 64 at: 0 put: largeN. >>> cbBa unsignedLong64At: 1 put: largeN bigEndian: false. >>> self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 >>> bigEndian: false). >>> { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN >>> bigEndian: false] bench. >>> 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. >>> 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] >>> bench. >>> 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. >>> 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN >>> bigEndian: false] bench. >>> 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. >>> 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] >>> bench. >>> 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. >>> } >>> >>> Here are the results: >>> >>> 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . >>> 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . >>> 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . >>> 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . >>> 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . >>> 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . >>> 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . >>> 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.' >>> >>> It looks like your 64-bit access is 86% faster for accessing the >>> high-end of the 64-bit range, but slower in the other 3 metrics. >>> Noticeably, it was only 14% as fast for writing the high-end of the >>> 64-bit range, and similarly as much slower for small-number access.. >>> >>> >>> On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham >>> <[hidden email]> wrote: >>> > Hi. >>> > >>> > I've committed a change to the inbox with changes to allow >>> > getting/putting >>> > 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could >>> > this >>> > be added to trunk? >>> > >>> > Also, first time I used the selective commit function - very nice! the >>> > changes I didn't want committed didn't, in fact, get commited. Just >>> > the >>> > desirable bits! >>> > >>> > -cbc >>> > >>> > >>> > >>> >> >> >> >> > > > > -- > _,,,^..^,,,_ > best, Eliot > > > |
On Mon, Aug 31, 2015 at 11:35 AM, Chris Muller <[hidden email]> wrote:
Of course its possible, but such an architecture can hardly be quick. If one needs the flexible primitives then use them, but don't hobble the system by only providing them. Having a real 64-bit VM means that the use of 2 32-bit accesses is unnecessarily slow. Which would you rather, and which would you think would be faster (I don't know, but I have my suspicions): Expand the existing flexible integerAt: prims to integerAt:put:bytes:signed:bigEndian: (yuck), or implement this in terms of a wrapper something like ByteArray>>integerAt: index bytes: numBytes signed: signed bigEndian: bigEndian ^size >= 4 ifTrue: [size = 8 ifTrue: [value := self unsignedLong64At: index. bigEndian ifTrue: [value := self byteReverseEightBytes: value]. (sign := value bitShift: -63) ~= 0 ifTrue: "if the VM is intelligent about left shift of zero then this test is unnecessary..." [value := value - ((sign bitAnd: 1) bitShift: 64)]. ^value]. size = 4 ifTrue: [value := self unsignedLong32At: index. bigEndian ifTrue: [value := self byteReverseFourBytes: value]. (sign := value bitShift: -31) ~= 0 ifTrue: "if the VM is intelligent about left shift of zero then this test is unnecessary..." [value := value - ((sign bitAnd: 1) bitShift: 32)]. ^value]. ^self error: 'size must be a power of two from 1 to 8'] ifFalse: ...
_,,,^..^,,,_ best, Eliot |
In reply to this post by Eliot Miranda-2
I would ask that someone please measure the real-world performance benefit
of adding these (or any other) numbered primitives. Maybe it's a lot, maybe it's not, but when in doubt leave it out. Dave On Mon, Aug 31, 2015 at 10:25:59AM -0700, Eliot Miranda wrote: > Hi Chrises, > > my vote would be to write these as 12 numbered primitives, (2,4 & 8 > bytes) * (at: & at:put:) * (big & little endian) because they can be > performance critical and implementing them like this means the maximum > efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT > implementing the primitives. > > On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham <[hidden email]> > wrote: > > > Hi Chris, > > > > I'm all for having the fastest that in the image that works. If you could > > make your version handle endianess, then I'm all for including it (at least > > in the 3 variants that are faster). My first use for this (interface for > > KAFKA) apparently requires bigEndianess, so I really want that supported. > > > > It might be best to keep my naming, though - it follows the name pattern > > that is already in the class. Or will yours also support 128? > > > > -cbc > > > > On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller <[hidden email]> wrote: > > > >> Hi Chris, I think these methods belong in the image with the fastest > >> implementation we can do. > >> > >> I implemented 64-bit unsigned access for Ma Serializer back in 2005. > >> I modeled my implementation after Andreas' original approach which > >> tries to avoid LI arithmetic. I was curious whether your > >> implementations would be faster, because if they are then it could > >> benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a > >> trunk image, I used the following script to take comparison > >> measurements: > >> > >> | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 > >> raisedTo: 14)) atRandom. > >> largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. > >> maBa := ByteArray new: 8. > >> cbBa := ByteArray new: 8. > >> maBa maUint: 64 at: 0 put: largeN. > >> cbBa unsignedLong64At: 1 put: largeN bigEndian: false. > >> self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 > >> bigEndian: false). > >> { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN > >> bigEndian: false] bench. > >> 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. > >> 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] > >> bench. > >> 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. > >> 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN > >> bigEndian: false] bench. > >> 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. > >> 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] > >> bench. > >> 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. > >> } > >> > >> Here are the results: > >> > >> 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . > >> 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . > >> 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . > >> 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . > >> 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . > >> 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . > >> 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . > >> 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.' > >> > >> It looks like your 64-bit access is 86% faster for accessing the > >> high-end of the 64-bit range, but slower in the other 3 metrics. > >> Noticeably, it was only 14% as fast for writing the high-end of the > >> 64-bit range, and similarly as much slower for small-number access.. > >> > >> > >> On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham > >> <[hidden email]> wrote: > >> > Hi. > >> > > >> > I've committed a change to the inbox with changes to allow > >> getting/putting > >> > 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could > >> this > >> > be added to trunk? > >> > > >> > Also, first time I used the selective commit function - very nice! the > >> > changes I didn't want committed didn't, in fact, get commited. Just the > >> > desirable bits! > >> > > >> > -cbc > >> > > >> > > >> > > >> > >> > > > > > > > > > > > -- > _,,,^..^,,,_ > best, Eliot > |
In reply to this post by Eliot Miranda-2
FWIW... IMO it's better to enable access to the relevant compiler
intrinsic with platform specific macros, rather than implementing instructions such as Intel's BSWAP or MOVBE by hand. In HPS, isolating endianness concerns from the large integer arithmetic primitives with such macros enabled 25-40% faster performance on big endian platforms. Just as importantly, the intrinsic approach takes significantly less code to implement. On 8/31/15 10:25 , Eliot Miranda wrote: > Hi Chrises, > > my vote would be to write these as 12 numbered primitives, (2,4 & 8 > bytes) * (at: & at:put:) * (big & little endian) because they can be > performance critical and implementing them like this means the maximum > efficiency in both 32-bit and 64-bit Spur, plus the possibility of the > JIT implementing the primitives. > > On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham > <[hidden email] <mailto:[hidden email]>> wrote: > > Hi Chris, > > I'm all for having the fastest that in the image that works. If you > could make your version handle endianess, then I'm all for including > it (at least in the 3 variants that are faster). My first use for > this (interface for KAFKA) apparently requires bigEndianess, so I > really want that supported. > > It might be best to keep my naming, though - it follows the name > pattern that is already in the class. Or will yours also support 128? > > -cbc > > On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller <[hidden email] > <mailto:[hidden email]>> wrote: > > Hi Chris, I think these methods belong in the image with the fastest > implementation we can do. > > I implemented 64-bit unsigned access for Ma Serializer back in 2005. > I modeled my implementation after Andreas' original approach which > tries to avoid LI arithmetic. I was curious whether your > implementations would be faster, because if they are then it could > benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a > trunk image, I used the following script to take comparison > measurements: > > | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 > raisedTo: 14)) atRandom. > largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. > maBa := ByteArray new: 8. > cbBa := ByteArray new: 8. > maBa maUint: 64 at: 0 put: largeN. > cbBa unsignedLong64At: 1 put: largeN bigEndian: false. > self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 > bigEndian: false). > { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN > bigEndian: false] bench. > 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. > 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: > false. ] bench. > 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. > 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN > bigEndian: false] bench. > 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. > 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: > false ] bench. > 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. > } > > Here are the results: > > 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per > run.' . > 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per > run.' . > 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per > run.' . > 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds > per run.' . > 'cbc largeN write'->'907,000 per second. 1.1 microseconds per > run.' . > 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per > run.' . > 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per > run.' . > 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per > run.' > > It looks like your 64-bit access is 86% faster for accessing the > high-end of the 64-bit range, but slower in the other 3 metrics. > Noticeably, it was only 14% as fast for writing the high-end of the > 64-bit range, and similarly as much slower for small-number access.. > > > On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham > <[hidden email] <mailto:[hidden email]>> wrote: > > Hi. > > > > I've committed a change to the inbox with changes to allow > getting/putting > > 64bit values to ByteArrays (similar to 32 and 16 bit > accessors). Could this > > be added to trunk? > > > > Also, first time I used the selective commit function - very > nice! the > > changes I didn't want committed didn't, in fact, get > commited. Just the > > desirable bits! > > > > -cbc > > > > > > > > > > > > > > > -- > _,,,^..^,,,_ > best, Eliot > > > |
In reply to this post by Chris Muller-3
Hi All,
A bit later than I wanted to, but I've finally uploaded my versions to the Trunk. I guess I went as far as possible with getting the "fastest implementation". I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following: Before: {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'. 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'. 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'. 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}. After: {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. As you can see, everything became faster except for smallN access. This is the side-effect of optimizing for the average case instead of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance. I also wrote a benchmark which measures reads and writes together. It generates random numbers which can be represented using a given number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to little-endian. | byteArray inputs random storageBits unsigned | Smalltalk garbageCollect. random := Random seed: 36rSqueak. storageBits := 64. unsigned := true. byteArray := ByteArray new: storageBits // 8 * 2. inputs := Array new: 100000. (2 to: storageBits * 2 + 1) collect: [ :descriptor | "lowest bit describes endianness, the rest the number of bits." | limit bigEndian offset | bigEndian := descriptor odd. limit := 1 << (descriptor >> 1) - 1. unsigned ifTrue: [ offset := -1 ] ifFalse: [ offset := -1- (limit >> 1) ]. inputs replace: [ :each | (random nextInt: limit) + offset ]. [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex | 1 to: inputs size do: [ :inputIndex | byteArray unsignedLong64At: startIndex put: (inputs at: inputIndex) bigEndian: bigEndian; unsignedLong64At: startIndex bigEndian: bigEndian ] ] ] timeToRun ]. I ran it with various accessors and got the following results: "short" #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625". #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5". "long" #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125". #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625". "unsigned short" #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5". #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125". "unsigned long" #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625". #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375". "unsigned long 64" #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat 536.109375". #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat 259.359375" Levente On Sun, 30 Aug 2015, Chris Muller wrote: > Hi Chris, I think these methods belong in the image with the fastest > implementation we can do. > > I implemented 64-bit unsigned access for Ma Serializer back in 2005. > I modeled my implementation after Andreas' original approach which > tries to avoid LI arithmetic. I was curious whether your > implementations would be faster, because if they are then it could > benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a > trunk image, I used the following script to take comparison > measurements: > > | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 > raisedTo: 14)) atRandom. > largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. > maBa := ByteArray new: 8. > cbBa := ByteArray new: 8. > maBa maUint: 64 at: 0 put: largeN. > cbBa unsignedLong64At: 1 put: largeN bigEndian: false. > self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 > bigEndian: false). > { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN > bigEndian: false] bench. > 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. > 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. > 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. > 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN > bigEndian: false] bench. > 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. > 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. > 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. > } > > Here are the results: > > 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . > 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . > 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . > 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . > 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . > 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . > 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . > 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.' > > It looks like your 64-bit access is 86% faster for accessing the > high-end of the 64-bit range, but slower in the other 3 metrics. > Noticeably, it was only 14% as fast for writing the high-end of the > 64-bit range, and similarly as much slower for small-number access.. > > > On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham > <[hidden email]> wrote: >> Hi. >> >> I've committed a change to the inbox with changes to allow getting/putting >> 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this >> be added to trunk? >> >> Also, first time I used the selective commit function - very nice! the >> changes I didn't want committed didn't, in fact, get commited. Just the >> desirable bits! >> >> -cbc >> >> >> > > |
Levente, Interesting. I have a question and a concern about your implementation, though: Question: why in the micro checks, is the Write faster than the Access: {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. yet in your more thorough benchmark, the Write twice as slow as the Access? "unsigned long 64" (put, or Write) "average asFloat 536.109375". (Access) "average asFloat 259.359375" any ideas? the concern is that your code is nicely optimized for our current 32bit vm - but once we go to 64 bit, I think it will fail. Should we be concerned? -cbc On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi <[hidden email]> wrote: Hi All, |
Hi Chris,
I added the #normalize sends to avoid the creation of spurious LargeIntegers in 64-bit images (there are two places where I relied on #-'s ability to work on unnormalized input). I didn't have a chance to test it, but I expect it to work correctly. Even if the code is sub-optimal in 64-bit, it shouldn't be any slower than in 32-bit. Levente On Tue, 8 Sep 2015, Chris Cunningham wrote: > Levente, > Interesting. I have a question and a concern about your implementation, though: > > Question: why in the micro checks, is the Write faster than the Access: > {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. > 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. > 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. > 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. > yet in your more thorough benchmark, the Write twice as slow as the Access? > "unsigned long 64" > (put, or Write) "average asFloat 536.109375". > (Access) "average asFloat 259.359375" > any ideas? > > the concern is that your code is nicely optimized for our current 32bit vm - but once we go to 64 bit, I think it will fail. Should we be > concerned? > > -cbc > > On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi <[hidden email]> wrote: > Hi All, > > A bit later than I wanted to, but I've finally uploaded my versions to the Trunk. I guess I went as far as possible with getting the > "fastest implementation". > I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following: > > Before: > {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'. > 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'. > 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'. > 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}. > > After: > {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. > 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. > 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. > 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. > > As you can see, everything became faster except for smallN access. This is the side-effect of optimizing for the average case instead > of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance. > > I also wrote a benchmark which measures reads and writes together. It generates random numbers which can be represented using a given > number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to > little-endian. > > | byteArray inputs random storageBits unsigned | > Smalltalk garbageCollect. > random := Random seed: 36rSqueak. > storageBits := 64. > unsigned := true. > byteArray := ByteArray new: storageBits // 8 * 2. > inputs := Array new: 100000. > (2 to: storageBits * 2 + 1) collect: [ :descriptor | > "lowest bit describes endianness, the rest the number of bits." > | limit bigEndian offset | > bigEndian := descriptor odd. > limit := 1 << (descriptor >> 1) - 1. > unsigned > ifTrue: [ offset := -1 ] > ifFalse: [ offset := -1- (limit >> 1) ]. > inputs replace: [ :each | (random nextInt: limit) + offset ]. > [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex | > 1 to: inputs size do: [ :inputIndex | > byteArray > unsignedLong64At: startIndex > put: (inputs at: inputIndex) > bigEndian: bigEndian; > unsignedLong64At: startIndex > bigEndian: bigEndian ] ] ] timeToRun ]. > > I ran it with various accessors and got the following results: > > "short" > #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625". > #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5". > > "long" > #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74 > 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125". > #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74 > 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625". > > "unsigned short" > #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5". > #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125". > > "unsigned long" > #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58 > 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625". > #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64 > 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375". > > "unsigned long 64" > #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318 > 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618 > 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812 > 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat > 536.109375". > #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164 > 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344 > 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304 > 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat > 259.359375" > > > Levente > > On Sun, 30 Aug 2015, Chris Muller wrote: > > Hi Chris, I think these methods belong in the image with the fastest > implementation we can do. > > I implemented 64-bit unsigned access for Ma Serializer back in 2005. > I modeled my implementation after Andreas' original approach which > tries to avoid LI arithmetic. I was curious whether your > implementations would be faster, because if they are then it could > benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a > trunk image, I used the following script to take comparison > measurements: > > | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 > raisedTo: 14)) atRandom. > largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. > maBa := ByteArray new: 8. > cbBa := ByteArray new: 8. > maBa maUint: 64 at: 0 put: largeN. > cbBa unsignedLong64At: 1 put: largeN bigEndian: false. > self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 > bigEndian: false). > { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN > bigEndian: false] bench. > 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. > 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. > 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. > 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN > bigEndian: false] bench. > 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. > 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. > 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. > } > > Here are the results: > > 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . > 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . > 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . > 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . > 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . > 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . > 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . > 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.' > > It looks like your 64-bit access is 86% faster for accessing the > high-end of the 64-bit range, but slower in the other 3 metrics. > Noticeably, it was only 14% as fast for writing the high-end of the > 64-bit range, and similarly as much slower for small-number access.. > > > On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham > <[hidden email]> wrote: > Hi. > > I've committed a change to the inbox with changes to allow getting/putting > 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this > be added to trunk? > > Also, first time I used the selective commit function - very nice! the > changes I didn't want committed didn't, in fact, get commited. Just the > desirable bits! > > -cbc > > > > > > > > > |
In reply to this post by cbc
I forgot to answer your other questions.
On Tue, 8 Sep 2015, Chris Cunningham wrote: > Levente, > Interesting. I have a question and a concern about your implementation, though: > > Question: why in the micro checks, is the Write faster than the Access: Because Access means allocation of new objects (LargeIntegers), while Write simply means copying bytes. > {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. > 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. > 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. > 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. > yet in your more thorough benchmark, the Write twice as slow as the Access? For each pair of lines the first one was measured _before_ my changes, the second one was measured _after_ them. The benchmark measures read and write together, so 536.109375 stands for the average number of milliseconds to read and write ninemillion numbers using your variant of the 64-bit methods on my machine. > "unsigned long 64" > (put, or Write) "average asFloat 536.109375". > (Access) "average asFloat 259.359375" > any ideas? So this is (read and write Before) "average asFloat 536.109375" (read and write After) "average asFloat 259.359375" Levente > > the concern is that your code is nicely optimized for our current 32bit vm - but once we go to 64 bit, I think it will fail. Should we be > concerned? > > -cbc > > On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi <[hidden email]> wrote: > Hi All, > > A bit later than I wanted to, but I've finally uploaded my versions to the Trunk. I guess I went as far as possible with getting the > "fastest implementation". > I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following: > > Before: > {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'. > 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'. > 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'. > 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}. > > After: > {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. > 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. > 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. > 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. > > As you can see, everything became faster except for smallN access. This is the side-effect of optimizing for the average case instead > of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance. > > I also wrote a benchmark which measures reads and writes together. It generates random numbers which can be represented using a given > number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to > little-endian. > > | byteArray inputs random storageBits unsigned | > Smalltalk garbageCollect. > random := Random seed: 36rSqueak. > storageBits := 64. > unsigned := true. > byteArray := ByteArray new: storageBits // 8 * 2. > inputs := Array new: 100000. > (2 to: storageBits * 2 + 1) collect: [ :descriptor | > "lowest bit describes endianness, the rest the number of bits." > | limit bigEndian offset | > bigEndian := descriptor odd. > limit := 1 << (descriptor >> 1) - 1. > unsigned > ifTrue: [ offset := -1 ] > ifFalse: [ offset := -1- (limit >> 1) ]. > inputs replace: [ :each | (random nextInt: limit) + offset ]. > [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex | > 1 to: inputs size do: [ :inputIndex | > byteArray > unsignedLong64At: startIndex > put: (inputs at: inputIndex) > bigEndian: bigEndian; > unsignedLong64At: startIndex > bigEndian: bigEndian ] ] ] timeToRun ]. > > I ran it with various accessors and got the following results: > > "short" > #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625". > #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5". > > "long" > #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74 > 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125". > #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74 > 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625". > > "unsigned short" > #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5". > #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125". > > "unsigned long" > #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58 > 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625". > #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64 > 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375". > > "unsigned long 64" > #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318 > 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618 > 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812 > 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat > 536.109375". > #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164 > 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344 > 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304 > 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat > 259.359375" > > > Levente > > On Sun, 30 Aug 2015, Chris Muller wrote: > > Hi Chris, I think these methods belong in the image with the fastest > implementation we can do. > > I implemented 64-bit unsigned access for Ma Serializer back in 2005. > I modeled my implementation after Andreas' original approach which > tries to avoid LI arithmetic. I was curious whether your > implementations would be faster, because if they are then it could > benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a > trunk image, I used the following script to take comparison > measurements: > > | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 > raisedTo: 14)) atRandom. > largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. > maBa := ByteArray new: 8. > cbBa := ByteArray new: 8. > maBa maUint: 64 at: 0 put: largeN. > cbBa unsignedLong64At: 1 put: largeN bigEndian: false. > self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 > bigEndian: false). > { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN > bigEndian: false] bench. > 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. > 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. > 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. > 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN > bigEndian: false] bench. > 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. > 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. > 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. > } > > Here are the results: > > 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . > 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . > 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . > 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . > 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . > 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . > 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . > 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.' > > It looks like your 64-bit access is 86% faster for accessing the > high-end of the 64-bit range, but slower in the other 3 metrics. > Noticeably, it was only 14% as fast for writing the high-end of the > 64-bit range, and similarly as much slower for small-number access.. > > > On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham > <[hidden email]> wrote: > Hi. > > I've committed a change to the inbox with changes to allow getting/putting > 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this > be added to trunk? > > Also, first time I used the selective commit function - very nice! the > changes I didn't want committed didn't, in fact, get commited. Just the > desirable bits! > > -cbc > > > > > > > > > |
In reply to this post by Levente Uzonyi-2
(Ha! What an amazing thread!) Levente, why am I not surprised that
you could manage to squeeze yet even more efficiency out of it. If there were an eXtreme sports for programming, you'd win the gold medal... :) I've got to try this on the MagmaBenchmarker right now, I'll let you know in a bit.. On Tue, Sep 8, 2015 at 4:42 AM, Levente Uzonyi <[hidden email]> wrote: > Hi All, > > A bit later than I wanted to, but I've finally uploaded my versions to the > Trunk. I guess I went as far as possible with getting the "fastest > implementation". > I modified your benchmark to use the same numbers, so that the measurements > could be repeated. I got the following: > > Before: > {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'. > 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'. > 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'. > 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}. > > After: > {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. > 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. > 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. > 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. > > As you can see, everything became faster except for smallN access. This is > the side-effect of optimizing for the average case instead of specific cases > - like zero bytes. I decided not to use that trick, because it decreased the > overall performance. > > I also wrote a benchmark which measures reads and writes together. It > generates random numbers which can be represented using a given number of > bits. The result is an array of run times where values having and odd index > belong to big-endian access, and even ones to little-endian. > > | byteArray inputs random storageBits unsigned | > Smalltalk garbageCollect. > random := Random seed: 36rSqueak. > storageBits := 64. > unsigned := true. > byteArray := ByteArray new: storageBits // 8 * 2. > inputs := Array new: 100000. > (2 to: storageBits * 2 + 1) collect: [ :descriptor | > "lowest bit describes endianness, the rest the number of bits." > | limit bigEndian offset | > bigEndian := descriptor odd. > limit := 1 << (descriptor >> 1) - 1. > unsigned > ifTrue: [ offset := -1 ] > ifFalse: [ offset := -1- (limit >> 1) ]. > inputs replace: [ :each | (random nextInt: limit) + offset ]. > [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex | > 1 to: inputs size do: [ :inputIndex | > byteArray > unsignedLong64At: startIndex > put: (inputs at: inputIndex) > bigEndian: bigEndian; > unsignedLong64At: startIndex > bigEndian: bigEndian ] ] ] timeToRun > ]. > > I ran it with various accessors and got the following results: > > "short" > #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 > 28 28 28 26 28 28 28) "average asFloat 27.625". > #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 > 20 20 22 20 20 20 20) "average asFloat 19.5". > > "long" > #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 > 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74 72 76 70 76 72 76 > 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125". > #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 > 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74 74 72 72 72 72 72 > 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625". > > "unsigned short" > #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 > 20 18 20 18 18 18 18) "average asFloat 18.5". > #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 > 18 18 18 18 18 18 18) "average asFloat 18.125". > > "unsigned long" > #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 > 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58 60 58 56 56 60 62 > 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625". > #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 > 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64 64 64 64 64 64 > 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat > 69.09375". > > "unsigned long 64" > #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 > 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318 316 314 318 316 > 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 > 326 322 326 322 510 520 592 592 634 618 636 640 652 666 642 644 660 648 642 > 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 > 644 632 750 748 812 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 > 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average > asFloat 536.109375". > #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 > 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164 170 166 172 166 > 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 > 166 172 166 170 238 272 264 484 282 344 284 356 292 362 294 364 288 362 292 > 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 > 292 364 302 382 304 388 302 390 298 392 298 384 302 388 302 390 298 386 308 > 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average > asFloat 259.359375" > > > Levente > > > On Sun, 30 Aug 2015, Chris Muller wrote: > >> Hi Chris, I think these methods belong in the image with the fastest >> implementation we can do. >> >> I implemented 64-bit unsigned access for Ma Serializer back in 2005. >> I modeled my implementation after Andreas' original approach which >> tries to avoid LI arithmetic. I was curious whether your >> implementations would be faster, because if they are then it could >> benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a >> trunk image, I used the following script to take comparison >> measurements: >> >> | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 >> raisedTo: 14)) atRandom. >> largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. >> maBa := ByteArray new: 8. >> cbBa := ByteArray new: 8. >> maBa maUint: 64 at: 0 put: largeN. >> cbBa unsignedLong64At: 1 put: largeN bigEndian: false. >> self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 >> bigEndian: false). >> { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN >> bigEndian: false] bench. >> 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. >> 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] >> bench. >> 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. >> 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN >> bigEndian: false] bench. >> 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. >> 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] >> bench. >> 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. >> } >> >> Here are the results: >> >> 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . >> 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . >> 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . >> 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . >> 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . >> 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . >> 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . >> 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.' >> >> It looks like your 64-bit access is 86% faster for accessing the >> high-end of the 64-bit range, but slower in the other 3 metrics. >> Noticeably, it was only 14% as fast for writing the high-end of the >> 64-bit range, and similarly as much slower for small-number access.. >> >> >> On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham >> <[hidden email]> wrote: >>> >>> Hi. >>> >>> I've committed a change to the inbox with changes to allow >>> getting/putting >>> 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could >>> this >>> be added to trunk? >>> >>> Also, first time I used the selective commit function - very nice! the >>> changes I didn't want committed didn't, in fact, get commited. Just the >>> desirable bits! >>> >>> -cbc >>> >>> >>> >> >> > |
In reply to this post by Levente Uzonyi-2
Ok, it makes sense now. Well, it made sense before, I was just slow to see it. -cbc On Tue, Sep 8, 2015 at 9:25 AM, Levente Uzonyi <[hidden email]> wrote: I forgot to answer your other questions. |
In reply to this post by Levente Uzonyi-2
After thinking a bit more about this, I came to the conclusion that while
I paid attention to handle reader methods in 64-bit images correctly, some of the writer methods are incorrect for "large" SmallInteger inputs. First I implemented the missing parts for 61-bit SmallIntegers using a general purpose loop, but it turned out that the loop is significantly faster for smaller numbers, and only slightly slower for larger ones, so I decided to nuke the special case for 30-bit SmallIntegers. I also found further possiblities to optimize writes, so the average is down from 259 to 236 milliseconds. Chris Muller's benchmark gives { 'cbc smallN write'->'12,200,000 per second. 81.7 nanoseconds per run.' . 'cbc smallN access'->'10,300,000 per second. 96.8 nanoseconds per run.' . 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.' . 'cbc largeN access'->'5,270,000 per second. 190 nanoseconds per run.'} Which means +20% speed for smallN writes, and +18% for largeN access. Levente P.S.: I still couldn't test the code in my 64-bit Spur image, because it won't respond to any input after startup. On Tue, 8 Sep 2015, Levente Uzonyi wrote: > Hi Chris, > > I added the #normalize sends to avoid the creation of spurious LargeIntegers > in 64-bit images (there are two places where I relied on #-'s ability to work > on unnormalized input). I didn't have a chance to test it, but I expect it to > work correctly. Even if the code is sub-optimal in 64-bit, it shouldn't be > any slower than in 32-bit. > > Levente > > On Tue, 8 Sep 2015, Chris Cunningham wrote: > >> Levente, >> Interesting. I have a question and a concern about your implementation, >> though: >> >> Question: why in the micro checks, is the Write faster than the Access: >> {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. >> 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. >> 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. >> 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. >> yet in your more thorough benchmark, the Write twice as slow as the Access? >> "unsigned long 64" >> (put, or Write) "average asFloat 536.109375". >> (Access) "average asFloat 259.359375" >> any ideas? >> >> the concern is that your code is nicely optimized for our current 32bit vm >> - but once we go to 64 bit, I think it will fail. Should we be >> concerned? >> >> -cbc >> >> On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi <[hidden email]> wrote: >> Hi All, >> >> A bit later than I wanted to, but I've finally uploaded my versions >> to the Trunk. I guess I went as far as possible with getting the >> "fastest implementation". >> I modified your benchmark to use the same numbers, so that the >> measurements could be repeated. I got the following: >> >> Before: >> {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per >> run.'. >> 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per >> run.'. >> 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'. >> 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per >> run.'}. >> >> After: >> {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per >> run.'. >> 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per >> run.'. >> 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per >> run.'. >> 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per >> run.'}. >> >> As you can see, everything became faster except for smallN access. >> This is the side-effect of optimizing for the average case instead >> of specific cases - like zero bytes. I decided not to use that trick, >> because it decreased the overall performance. >> >> I also wrote a benchmark which measures reads and writes together. It >> generates random numbers which can be represented using a given >> number of bits. The result is an array of run times where values >> having and odd index belong to big-endian access, and even ones to >> little-endian. >> >> | byteArray inputs random storageBits unsigned | >> Smalltalk garbageCollect. >> random := Random seed: 36rSqueak. >> storageBits := 64. >> unsigned := true. >> byteArray := ByteArray new: storageBits // 8 * 2. >> inputs := Array new: 100000. >> (2 to: storageBits * 2 + 1) collect: [ :descriptor | >> "lowest bit describes endianness, the rest the number of >> bits." >> | limit bigEndian offset | >> bigEndian := descriptor odd. >> limit := 1 << (descriptor >> 1) - 1. >> unsigned >> ifTrue: [ offset := -1 ] >> ifFalse: [ offset := -1- (limit >> 1) ]. >> inputs replace: [ :each | (random nextInt: limit) + offset ]. >> [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ >> :startIndex | >> 1 to: inputs size do: [ :inputIndex | >> byteArray >> unsignedLong64At: startIndex >> put: (inputs at: inputIndex) >> bigEndian: bigEndian; >> unsignedLong64At: startIndex >> bigEndian: bigEndian ] ] ] >> timeToRun ]. >> >> I ran it with various accessors and got the following results: >> >> "short" >> #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 >> 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625". >> #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 >> 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5". >> >> "long" >> #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 >> 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74 >> 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) >> "average asFloat 87.28125". >> #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 >> 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74 >> 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) >> "average asFloat 73.625". >> >> "unsigned short" >> #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 >> 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5". >> #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 >> 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125". >> >> "unsigned long" >> #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 >> 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58 >> 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) >> "average asFloat 82.40625". >> #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 >> 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64 >> 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) >> "average asFloat 69.09375". >> >> "unsigned long 64" >> #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 >> 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318 >> 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 >> 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618 >> 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 >> 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812 >> 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 >> 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat >> 536.109375". >> #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 >> 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164 >> 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 >> 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344 >> 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 >> 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304 >> 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 >> 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat >> 259.359375" >> >> >> Levente >> >> On Sun, 30 Aug 2015, Chris Muller wrote: >> >> Hi Chris, I think these methods belong in the image with the >> fastest >> implementation we can do. >> >> I implemented 64-bit unsigned access for Ma Serializer back in >> 2005. >> I modeled my implementation after Andreas' original approach >> which >> tries to avoid LI arithmetic. I was curious whether your >> implementations would be faster, because if they are then it >> could >> benefit Magma. After loading "Ma Serializer" 1.5 (or head) >> into a >> trunk image, I used the following script to take comparison >> measurements: >> >> | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 >> raisedTo: 14)) atRandom. >> largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. >> maBa := ByteArray new: 8. >> cbBa := ByteArray new: 8. >> maBa maUint: 64 at: 0 put: largeN. >> cbBa unsignedLong64At: 1 put: largeN bigEndian: false. >> self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: >> 1 >> bigEndian: false). >> { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN >> bigEndian: false] bench. >> 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] >> bench. >> 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: >> false. ] bench. >> 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. >> 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN >> bigEndian: false] bench. >> 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] >> bench. >> 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: >> false ] bench. >> 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. >> } >> >> Here are the results: >> >> 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per >> run.' . >> 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per >> run.' . >> 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds >> per run.' . >> 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds >> per run.' . >> 'cbc largeN write'->'907,000 per second. 1.1 microseconds per >> run.' . >> 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per >> run.' . >> 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds >> per run.' . >> 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per >> run.' >> >> It looks like your 64-bit access is 86% faster for accessing >> the >> high-end of the 64-bit range, but slower in the other 3 >> metrics. >> Noticeably, it was only 14% as fast for writing the high-end of >> the >> 64-bit range, and similarly as much slower for small-number >> access.. >> >> >> On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham >> <[hidden email]> wrote: >> Hi. >> >> I've committed a change to the inbox with changes to >> allow getting/putting >> 64bit values to ByteArrays (similar to 32 and 16 bit >> accessors). Could this >> be added to trunk? >> >> Also, first time I used the selective commit function - >> very nice! the >> changes I didn't want committed didn't, in fact, get >> commited. Just the >> desirable bits! >> >> -cbc >> >> >> >> >> >> >> >> > |
Hi Levente,
Sent from my iPhone > On Sep 9, 2015, at 2:11 PM, Levente Uzonyi <[hidden email]> wrote: > > After thinking a bit more about this, I came to the conclusion that while I paid attention to handle reader methods in 64-bit images correctly, some of the writer methods are incorrect for "large" SmallInteger inputs. > First I implemented the missing parts for 61-bit SmallIntegers using a general purpose loop, but it turned out that the loop is significantly faster for smaller numbers, and only slightly slower for larger ones, so I decided to nuke the special case for 30-bit SmallIntegers. > I also found further possiblities to optimize writes, so the average is down from 259 to 236 milliseconds. Chris Muller's benchmark gives > > { > 'cbc smallN write'->'12,200,000 per second. 81.7 nanoseconds per run.' . > 'cbc smallN access'->'10,300,000 per second. 96.8 nanoseconds per run.' . > 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.' . > 'cbc largeN access'->'5,270,000 per second. 190 nanoseconds per run.'} > > Which means +20% speed for smallN writes, and +18% for largeN access. > > Levente > > P.S.: I still couldn't test the code in my 64-bit Spur image, because it won't respond to any input after startup. IIRC that's due to a bug in earlier versions of the bootstrap in converting negative integers, or perhaps larger positive smallintegers. Anyway I *think* that the latest Spur 64 but VM image combination works properly now. Apologies. > >> On Tue, 8 Sep 2015, Levente Uzonyi wrote: >> >> Hi Chris, >> >> I added the #normalize sends to avoid the creation of spurious LargeIntegers in 64-bit images (there are two places where I relied on #-'s ability to work on unnormalized input). I didn't have a chance to test it, but I expect it to work correctly. Even if the code is sub-optimal in 64-bit, it shouldn't be any slower than in 32-bit. >> >> Levente >> >>> On Tue, 8 Sep 2015, Chris Cunningham wrote: >>> >>> Levente, >>> Interesting. I have a question and a concern about your implementation, though: >>> Question: why in the micro checks, is the Write faster than the Access: >>> {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. >>> 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. >>> 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. >>> 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. >>> yet in your more thorough benchmark, the Write twice as slow as the Access? >>> "unsigned long 64" >>> (put, or Write) "average asFloat 536.109375". >>> (Access) "average asFloat 259.359375" >>> any ideas? >>> the concern is that your code is nicely optimized for our current 32bit vm - but once we go to 64 bit, I think it will fail. Should we be >>> concerned? >>> -cbc >>> On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi <[hidden email]> wrote: >>> Hi All, >>> >>> A bit later than I wanted to, but I've finally uploaded my versions to the Trunk. I guess I went as far as possible with getting the >>> "fastest implementation". >>> I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following: >>> >>> Before: >>> {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'. >>> 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'. >>> 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'. >>> 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}. >>> >>> After: >>> {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. >>> 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. >>> 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. >>> 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. >>> >>> As you can see, everything became faster except for smallN access. This is the side-effect of optimizing for the average case instead >>> of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance. >>> >>> I also wrote a benchmark which measures reads and writes together. It generates random numbers which can be represented using a given >>> number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to >>> little-endian. >>> >>> | byteArray inputs random storageBits unsigned | >>> Smalltalk garbageCollect. >>> random := Random seed: 36rSqueak. >>> storageBits := 64. >>> unsigned := true. >>> byteArray := ByteArray new: storageBits // 8 * 2. >>> inputs := Array new: 100000. >>> (2 to: storageBits * 2 + 1) collect: [ :descriptor | >>> "lowest bit describes endianness, the rest the number of bits." >>> | limit bigEndian offset | >>> bigEndian := descriptor odd. >>> limit := 1 << (descriptor >> 1) - 1. >>> unsigned >>> ifTrue: [ offset := -1 ] >>> ifFalse: [ offset := -1- (limit >> 1) ]. >>> inputs replace: [ :each | (random nextInt: limit) + offset ]. >>> [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex | >>> 1 to: inputs size do: [ :inputIndex | >>> byteArray >>> unsignedLong64At: startIndex >>> put: (inputs at: inputIndex) >>> bigEndian: bigEndian; >>> unsignedLong64At: startIndex >>> bigEndian: bigEndian ] ] ] timeToRun ]. >>> >>> I ran it with various accessors and got the following results: >>> >>> "short" >>> #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625". >>> #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5". >>> >>> "long" >>> #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74 >>> 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125". >>> #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74 >>> 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625". >>> >>> "unsigned short" >>> #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5". >>> #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125". >>> >>> "unsigned long" >>> #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58 >>> 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625". >>> #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64 >>> 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375". >>> >>> "unsigned long 64" >>> #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318 >>> 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618 >>> 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812 >>> 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat >>> 536.109375". >>> #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164 >>> 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344 >>> 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304 >>> 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat >>> 259.359375" >>> >>> Levente >>> >>> On Sun, 30 Aug 2015, Chris Muller wrote: >>> >>> Hi Chris, I think these methods belong in the image with the fastest >>> implementation we can do. >>> >>> I implemented 64-bit unsigned access for Ma Serializer back in 2005. >>> I modeled my implementation after Andreas' original approach which >>> tries to avoid LI arithmetic. I was curious whether your >>> implementations would be faster, because if they are then it could >>> benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a >>> trunk image, I used the following script to take comparison >>> measurements: >>> >>> | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 >>> raisedTo: 14)) atRandom. >>> largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. >>> maBa := ByteArray new: 8. >>> cbBa := ByteArray new: 8. >>> maBa maUint: 64 at: 0 put: largeN. >>> cbBa unsignedLong64At: 1 put: largeN bigEndian: false. >>> self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 >>> bigEndian: false). >>> { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN >>> bigEndian: false] bench. >>> 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. >>> 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. >>> 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. >>> 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN >>> bigEndian: false] bench. >>> 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. >>> 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. >>> 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. >>> } >>> >>> Here are the results: >>> >>> 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . >>> 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . >>> 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . >>> 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . >>> 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . >>> 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . >>> 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . >>> 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.' >>> >>> It looks like your 64-bit access is 86% faster for accessing the >>> high-end of the 64-bit range, but slower in the other 3 metrics. >>> Noticeably, it was only 14% as fast for writing the high-end of the >>> 64-bit range, and similarly as much slower for small-number access.. >>> >>> On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham >>> <[hidden email]> wrote: >>> Hi. >>> >>> I've committed a change to the inbox with changes to allow getting/putting >>> 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this >>> be added to trunk? >>> >>> Also, first time I used the selective commit function - very nice! the >>> changes I didn't want committed didn't, in fact, get commited. Just the >>> desirable bits! >>> >>> -cbc > |
Free forum by Nabble | Edit this page |