Re: Primitive replaceFrom:to:with:startingAt: in the JIT

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Primitive replaceFrom:to:with:startingAt: in the JIT

Levente Uzonyi
 
Hi Clémont,

I finally found the time to write some benchmarks.
I compared the output of the script below on sqcogspur64linuxht vm
201710061559 and 201712221331 avaiable on bintray.

result := { ByteArray. DoubleByteArray. WordArray. DoubleWordArray. ByteString. WideString. FloatArray. Array } collect: [ :class |
  | collection |
  Smalltalk garbageCollect.
  collection := class basicNew: 10000.
  class -> (#(0 1 2 5 10 20 50 100 200 500 1000 2000 5000 10000) collect: [ :size |
  | iterations time overhead |
  iterations := (40000000 // (size max: 1) sqrt) floor.
  overhead := [ 1 to: iterations do: [ :i | ] ] timeToRun.
  time := [ 1 to: iterations do: [ :i |
  collection replaceFrom: 1 to: size with: collection startingAt: 1 ] ] timeToRun.
  { size. iterations. time - overhead } ]) ].

I found that the quick paths are probably only implented for bytes and
pointers collections, because there was no significant difference for
DoubleByteArray, WordArray, DoubleWordArray, WideString and FloatArray.

For pointers and bytes collections, there's significant speedup when the
copied portion is small. However, somewhere between 50 and 100 copied
elements, the copying of bytes collections becomes slower (up to 1.5x @
100k elements) with the newer VM.
It's interesting that this doesn't happen to pointers classes. Instead of
slowdown there's still 1.5x speedup even at 100k elements.

Levente

On Mon, 23 Oct 2017, Clément Bera wrote:

> Hi all,
> For a long time I was willing to add primitive #replaceFrom:to:with:startingAt: in the JIT but did not take time to do it. These days I am showing the JIT to one of my students and as an example of how one would write code in the JIT we implemented this primitive
> together, Spur-only. This is part of commit 2273.
>
> I implemented quick paths for byte objects and array-like objects only. The rationale behind this is that the most common cases I see in Pharo user benchmarks in the profiler is copy of arrays and byteStrings. Typically some application benchmarks would show 3-5% of
> time spent in copying small things, and switching from the JIT runtime to C runtime is an important part of the cost.
>
> First evaluation shows the following speed-ups, but I've just done that quickly in my machine:
>
> Copy of size 0
>     Array 2.85x
>     ByteString 2.7x
> Copy of size 1
>     Array 2.1x
>     ByteString 2x
> Copy of size 3
>     Array 2x
>     ByteString 1.9x
> Copy of size 8
>     Array 1.8x
>     ByteString 1.8x
> Copy of size 64
>    Array 1.1x
>    ByteString 1.1x
> Copy of size 1000
>    Array 1x
>    ByteString 1x
>
> So I would expect some macro benchmarks to get 1 to 3% percent speed-up. Not as much as I expected but it's there.
>
> Can someone who is good at benchmarks such as Levente have a look and provide us with a better evaluation of the performance difference ?
>
> Thanks.
>
> --
> Clément BéraPharo consortium engineer
> https://clementbera.wordpress.com/
> Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Primitive replaceFrom:to:with:startingAt: in the JIT

Clément Béra
 

Hi,

Indeed, I implemented the quick paths only for byte objects and pointer objects. The reason for this is that I based the implementation on customer benchmarks, and there was a lot of replacements for pointer objects (typically, growing the inner array of OrderedCollection/Dictionaries) and byte objects (typically bytestrings concatenation, such as 'a' , 'b' which calls twice this primitive for only a couple characters and is now much faster).

I wrote the quick path with short copies mind, in which case switching to the C runtime is too expensive. Based on your analysis, for pointer objects it looks good, but for byte objects it gets slower.

So I see two solutions for the slow down on byte copies:
1) over a threshold of 1000 elements, fall back to the slang primitive for byte objects
2) Generate smarter copying machine code (for example with some kind of duff device / unrolled loop / overlapping copies with 1 copy non aligned). If possible nothing requiring to extend the RTL. Note that memcopy in C is written as a per back-end files which is thousands of lines on x86 and we don't want to go anywhere near this complexity, in our case the size of generate machine code also matters and we want to keep it small. We could use pointer size read / writes on byte objects too if we deal with aliasing (if array == replArray just fall back to a slower path).

Let's look at the naive machine code in Cog's RTL (I could show the actual assembly version if you like it better), in each case the loop does:
- compare and jumpBelow to see if the copy is finished
- read one field from repl
- write the read field in the array
- increment the 2 registers holding indexes for read/write
- jump back. 

The copying machine code for byte object is as follow:

                       instr := cogit CmpR: startReg R: stopReg.
jumpFinished := cogit JumpBelow: 0.
cogit MoveXbr: repStartReg R: replReg R: TempReg.
cogit MoveR: TempReg Xbr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit AddCq: 1 R: repStartReg.
cogit Jump: instr.

The copying code for pointer object is as follow:
instr := cogit CmpR: startReg R: stopReg.
jumpFinished := cogit JumpBelow: 0.
cogit MoveXwr: repStartReg R: replReg R: TempReg.
cogit MoveR: TempReg Xwr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit AddCq: 1 R: repStartReg.
cogit Jump: instr.

Any idea what we could do smarter, but not too complex, without extending the RTL and without generating let's say more than 20 extra instructions ?

I think checking for aliasing and using pointer size read/write may be the way to go. Then if performance is not better we just fall back to slang code.

On Mon, Dec 25, 2017 at 10:57 PM, Levente Uzonyi <[hidden email]> wrote:
Hi Clémont,

I finally found the time to write some benchmarks.
I compared the output of the script below on sqcogspur64linuxht vm 201710061559 and 201712221331 avaiable on bintray.

result := { ByteArray. DoubleByteArray. WordArray. DoubleWordArray. ByteString. WideString. FloatArray. Array } collect: [ :class |
        | collection |
        Smalltalk garbageCollect.
        collection := class basicNew: 10000.
        class -> (#(0 1 2 5 10 20 50 100 200 500 1000 2000 5000 10000) collect: [ :size |
                | iterations time overhead |
                iterations := (40000000 // (size max: 1) sqrt) floor.
                overhead := [ 1 to: iterations do: [ :i | ] ] timeToRun.
                time := [ 1 to: iterations do: [ :i |
                        collection replaceFrom: 1 to: size with: collection startingAt: 1 ] ] timeToRun.
                { size. iterations. time - overhead } ]) ].

I found that the quick paths are probably only implented for bytes and pointers collections, because there was no significant difference for DoubleByteArray, WordArray, DoubleWordArray, WideString and FloatArray. 

For pointers and bytes collections, there's significant speedup when the copied portion is small. However, somewhere between 50 and 100 copied elements, the copying of bytes collections becomes slower (up to 1.5x @ 100k elements) with the newer VM.
It's interesting that this doesn't happen to pointers classes. Instead of slowdown there's still 1.5x speedup even at 100k elements.

Levente


On Mon, 23 Oct 2017, Clément Bera wrote:

Hi all,
For a long time I was willing to add primitive #replaceFrom:to:with:startingAt: in the JIT but did not take time to do it. These days I am showing the JIT to one of my students and as an example of how one would write code in the JIT we implemented this primitive
together, Spur-only. This is part of commit 2273.

I implemented quick paths for byte objects and array-like objects only. The rationale behind this is that the most common cases I see in Pharo user benchmarks in the profiler is copy of arrays and byteStrings. Typically some application benchmarks would show 3-5% of
time spent in copying small things, and switching from the JIT runtime to C runtime is an important part of the cost.

First evaluation shows the following speed-ups, but I've just done that quickly in my machine:

Copy of size 0
    Array 2.85x
    ByteString 2.7x
Copy of size 1
    Array 2.1x
    ByteString 2x
Copy of size 3
    Array 2x
    ByteString 1.9x
Copy of size 8
    Array 1.8x
    ByteString 1.8x
Copy of size 64
   Array 1.1x
   ByteString 1.1x
Copy of size 1000
   Array 1x
   ByteString 1x

So I would expect some macro benchmarks to get 1 to 3% percent speed-up. Not as much as I expected but it's there.

Can someone who is good at benchmarks such as Levente have a look and provide us with a better evaluation of the performance difference ?

Thanks.

--
Clément Béra
https://clementbera.wordpress.com/
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq




--
Clément Béra
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq



Reply | Threaded
Open this post in threaded view
|

Re: Primitive replaceFrom:to:with:startingAt: in the JIT

Eliot Miranda-2
 
Hi Clément,


On Dec 27, 2017, at 12:14 PM, Clément Bera <[hidden email]> wrote:


Hi,

Indeed, I implemented the quick paths only for byte objects and pointer objects. The reason for this is that I based the implementation on customer benchmarks, and there was a lot of replacements for pointer objects (typically, growing the inner array of OrderedCollection/Dictionaries) and byte objects (typically bytestrings concatenation, such as 'a' , 'b' which calls twice this primitive for only a couple characters and is now much faster).

I wrote the quick path with short copies mind, in which case switching to the C runtime is too expensive. Based on your analysis, for pointer objects it looks good, but for byte objects it gets slower.

So I see two solutions for the slow down on byte copies:
1) over a threshold of 1000 elements, fall back to the slang primitive for byte objects
2) Generate smarter copying machine code (for example with some kind of duff device / unrolled loop / overlapping copies with 1 copy non aligned). If possible nothing requiring to extend the RTL. Note that memcopy in C is written as a per back-end files which is thousands of lines on x86 and we don't want to go anywhere near this complexity, in our case the size of generate machine code also matters and we want to keep it small. We could use pointer size read / writes on byte objects too if we deal with aliasing (if array == replArray just fall back to a slower path).

Let's look at the naive machine code in Cog's RTL (I could show the actual assembly version if you like it better), in each case the loop does:
- compare and jumpBelow to see if the copy is finished
- read one field from repl
- write the read field in the array
- increment the 2 registers holding indexes for read/write
- jump back. 

The copying machine code for byte object is as follow:

                       instr := cogit CmpR: startReg R: stopReg.
jumpFinished := cogit JumpBelow: 0.
cogit MoveXbr: repStartReg R: replReg R: TempReg.
cogit MoveR: TempReg Xbr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit AddCq: 1 R: repStartReg.
cogit Jump: instr.

I think the thing to do is to implement a word copier for byte objects.  It could use up to two registers for the source to assemble a word when the source is not aligned mod a word boundary.  The copying engine would copy bytes up to the start of a word, then copy words, then copy trailing bytes.

Moving 4 or 8 bytes in the inner loop is probably more than good enough to exceed the interpreter primitive's speed, without the complexity of a duff's device dispatch.  Note that fetching the first word of the destination, when there are odd initial bytes is probably a good way of making the initial odd bytes write cheap.  Or fetch the destination, mask out the bytes that are written to, add/or in the source bytes.

Another thing is to use derived pointers that are incremented instead of the Xbr form, which wastes a register better used for buffering bytes into whole word writes.

Isn't it great to finally be at a stage where we can meaningfully have these JIT vs C compiler performance discussions?


The copying code for pointer object is as follow:
instr := cogit CmpR: startReg R: stopReg.
jumpFinished := cogit JumpBelow: 0.
cogit MoveXwr: repStartReg R: replReg R: TempReg.
cogit MoveR: TempReg Xwr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit AddCq: 1 R: repStartReg.
cogit Jump: instr.

Any idea what we could do smarter, but not too complex, without extending the RTL and without generating let's say more than 20 extra instructions ?

I think checking for aliasing and using pointer size read/write may be the way to go. Then if performance is not better we just fall back to slang code.

On Mon, Dec 25, 2017 at 10:57 PM, Levente Uzonyi <[hidden email]> wrote:
Hi Clémont,

I finally found the time to write some benchmarks.
I compared the output of the script below on sqcogspur64linuxht vm 201710061559 and 201712221331 avaiable on bintray.

result := { ByteArray. DoubleByteArray. WordArray. DoubleWordArray. ByteString. WideString. FloatArray. Array } collect: [ :class |
        | collection |
        Smalltalk garbageCollect.
        collection := class basicNew: 10000.
        class -> (#(0 1 2 5 10 20 50 100 200 500 1000 2000 5000 10000) collect: [ :size |
                | iterations time overhead |
                iterations := (40000000 // (size max: 1) sqrt) floor.
                overhead := [ 1 to: iterations do: [ :i | ] ] timeToRun.
                time := [ 1 to: iterations do: [ :i |
                        collection replaceFrom: 1 to: size with: collection startingAt: 1 ] ] timeToRun.
                { size. iterations. time - overhead } ]) ].

I found that the quick paths are probably only implented for bytes and pointers collections, because there was no significant difference for DoubleByteArray, WordArray, DoubleWordArray, WideString and FloatArray. 

For pointers and bytes collections, there's significant speedup when the copied portion is small. However, somewhere between 50 and 100 copied elements, the copying of bytes collections becomes slower (up to 1.5x @ 100k elements) with the newer VM.
It's interesting that this doesn't happen to pointers classes. Instead of slowdown there's still 1.5x speedup even at 100k elements.

Levente


On Mon, 23 Oct 2017, Clément Bera wrote:

Hi all,
For a long time I was willing to add primitive #replaceFrom:to:with:startingAt: in the JIT but did not take time to do it. These days I am showing the JIT to one of my students and as an example of how one would write code in the JIT we implemented this primitive
together, Spur-only. This is part of commit 2273.

I implemented quick paths for byte objects and array-like objects only. The rationale behind this is that the most common cases I see in Pharo user benchmarks in the profiler is copy of arrays and byteStrings. Typically some application benchmarks would show 3-5% of
time spent in copying small things, and switching from the JIT runtime to C runtime is an important part of the cost.

First evaluation shows the following speed-ups, but I've just done that quickly in my machine:

Copy of size 0
    Array 2.85x
    ByteString 2.7x
Copy of size 1
    Array 2.1x
    ByteString 2x
Copy of size 3
    Array 2x
    ByteString 1.9x
Copy of size 8
    Array 1.8x
    ByteString 1.8x
Copy of size 64
   Array 1.1x
   ByteString 1.1x
Copy of size 1000
   Array 1x
   ByteString 1x

So I would expect some macro benchmarks to get 1 to 3% percent speed-up. Not as much as I expected but it's there.

Can someone who is good at benchmarks such as Levente have a look and provide us with a better evaluation of the performance difference ?

Thanks.

--
Clément Béra
https://clementbera.wordpress.com/
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq




--
Clément Béra
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq



Reply | Threaded
Open this post in threaded view
|

Re: Primitive replaceFrom:to:with:startingAt: in the JIT

Clément Béra
 


On Wed, Dec 27, 2017 at 10:37 PM, Eliot Miranda <[hidden email]> wrote:
 
Hi Clément,


On Dec 27, 2017, at 12:14 PM, Clément Bera <[hidden email]> wrote:


Hi,

Indeed, I implemented the quick paths only for byte objects and pointer objects. The reason for this is that I based the implementation on customer benchmarks, and there was a lot of replacements for pointer objects (typically, growing the inner array of OrderedCollection/Dictionaries) and byte objects (typically bytestrings concatenation, such as 'a' , 'b' which calls twice this primitive for only a couple characters and is now much faster).

I wrote the quick path with short copies mind, in which case switching to the C runtime is too expensive. Based on your analysis, for pointer objects it looks good, but for byte objects it gets slower.

So I see two solutions for the slow down on byte copies:
1) over a threshold of 1000 elements, fall back to the slang primitive for byte objects
2) Generate smarter copying machine code (for example with some kind of duff device / unrolled loop / overlapping copies with 1 copy non aligned). If possible nothing requiring to extend the RTL. Note that memcopy in C is written as a per back-end files which is thousands of lines on x86 and we don't want to go anywhere near this complexity, in our case the size of generate machine code also matters and we want to keep it small. We could use pointer size read / writes on byte objects too if we deal with aliasing (if array == replArray just fall back to a slower path).

Let's look at the naive machine code in Cog's RTL (I could show the actual assembly version if you like it better), in each case the loop does:
- compare and jumpBelow to see if the copy is finished
- read one field from repl
- write the read field in the array
- increment the 2 registers holding indexes for read/write
- jump back. 

The copying machine code for byte object is as follow:

                       instr := cogit CmpR: startReg R: stopReg.
jumpFinished := cogit JumpBelow: 0.
cogit MoveXbr: repStartReg R: replReg R: TempReg.
cogit MoveR: TempReg Xbr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit AddCq: 1 R: repStartReg.
cogit Jump: instr.

I think the thing to do is to implement a word copier for byte objects.  It could use up to two registers for the source to assemble a word when the source is not aligned mod a word boundary.  The copying engine would copy bytes up to the start of a word, then copy words, then copy trailing bytes.

Moving 4 or 8 bytes in the inner loop is probably more than good enough to exceed the interpreter primitive's speed, without the complexity of a duff's device dispatch.  Note that fetching the first word of the destination, when there are odd initial bytes is probably a good way of making the initial odd bytes write cheap.  Or fetch the destination, mask out the bytes that are written to, add/or in the source bytes.

Yes that's the kind of thing what I was thinking of. In sista image side I do it for byte equality in some cases and it speeds things up significantly (though I do it only when everything is aligned like I want, which is the common case). 

Assembling a word when reads are not aligned with the writes seemed tedious - not sure if that's the best time invested/speed-up ratio. One could experiment.


Another thing is to use derived pointers that are incremented instead of the Xbr form, which wastes a register better used for buffering bytes into whole word writes.

Yeah well I thought about it, just make sure to restore the register to return correctly with the array. I've felt it was not worth it - we don't need more registers at this point and the processor already internally optimise those tights loops. I made the simplest thing that could possibly work. The primitive code is quite complex already, mostly due to all the range/type checks, but still...

We could experiment a bit and evaluate performance but I don't want to waste too much time. I watched at google a couple weeks ago an hour talk on building fast memcopy, the result of 4 people working for weeks if not months, and I don't want to go there - their implementation has like 20 different ways of copying things depending on the number of bytes to copy. I am not going to implement 20 versions and check they all work and that they are all fast.

Maybe just falling back to C code if there are more than 1000 elements to copy is good enough... At least on the short term...

Of course one alternative is extend the RTL... to use the REP prefix on intel and bulk memory moves on ARM... I don't want to go there either but someone could.


Isn't it great to finally be at a stage where we can meaningfully have these JIT vs C compiler performance discussions?


Yeah.

So many things I would like to do. 
 

The copying code for pointer object is as follow:
instr := cogit CmpR: startReg R: stopReg.
jumpFinished := cogit JumpBelow: 0.
cogit MoveXwr: repStartReg R: replReg R: TempReg.
cogit MoveR: TempReg Xwr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit AddCq: 1 R: repStartReg.
cogit Jump: instr.

Any idea what we could do smarter, but not too complex, without extending the RTL and without generating let's say more than 20 extra instructions ?

I think checking for aliasing and using pointer size read/write may be the way to go. Then if performance is not better we just fall back to slang code.

On Mon, Dec 25, 2017 at 10:57 PM, Levente Uzonyi <[hidden email]> wrote:
Hi Clémont,

I finally found the time to write some benchmarks.
I compared the output of the script below on sqcogspur64linuxht vm 201710061559 and 201712221331 avaiable on bintray.

result := { ByteArray. DoubleByteArray. WordArray. DoubleWordArray. ByteString. WideString. FloatArray. Array } collect: [ :class |
        | collection |
        Smalltalk garbageCollect.
        collection := class basicNew: 10000.
        class -> (#(0 1 2 5 10 20 50 100 200 500 1000 2000 5000 10000) collect: [ :size |
                | iterations time overhead |
                iterations := (40000000 // (size max: 1) sqrt) floor.
                overhead := [ 1 to: iterations do: [ :i | ] ] timeToRun.
                time := [ 1 to: iterations do: [ :i |
                        collection replaceFrom: 1 to: size with: collection startingAt: 1 ] ] timeToRun.
                { size. iterations. time - overhead } ]) ].

I found that the quick paths are probably only implented for bytes and pointers collections, because there was no significant difference for DoubleByteArray, WordArray, DoubleWordArray, WideString and FloatArray. 

For pointers and bytes collections, there's significant speedup when the copied portion is small. However, somewhere between 50 and 100 copied elements, the copying of bytes collections becomes slower (up to 1.5x @ 100k elements) with the newer VM.
It's interesting that this doesn't happen to pointers classes. Instead of slowdown there's still 1.5x speedup even at 100k elements.

Levente


On Mon, 23 Oct 2017, Clément Bera wrote:

Hi all,
For a long time I was willing to add primitive #replaceFrom:to:with:startingAt: in the JIT but did not take time to do it. These days I am showing the JIT to one of my students and as an example of how one would write code in the JIT we implemented this primitive
together, Spur-only. This is part of commit 2273.

I implemented quick paths for byte objects and array-like objects only. The rationale behind this is that the most common cases I see in Pharo user benchmarks in the profiler is copy of arrays and byteStrings. Typically some application benchmarks would show 3-5% of
time spent in copying small things, and switching from the JIT runtime to C runtime is an important part of the cost.

First evaluation shows the following speed-ups, but I've just done that quickly in my machine:

Copy of size 0
    Array 2.85x
    ByteString 2.7x
Copy of size 1
    Array 2.1x
    ByteString 2x
Copy of size 3
    Array 2x
    ByteString 1.9x
Copy of size 8
    Array 1.8x
    ByteString 1.8x
Copy of size 64
   Array 1.1x
   ByteString 1.1x
Copy of size 1000
   Array 1x
   ByteString 1x

So I would expect some macro benchmarks to get 1 to 3% percent speed-up. Not as much as I expected but it's there.

Can someone who is good at benchmarks such as Levente have a look and provide us with a better evaluation of the performance difference ?

Thanks.

--
Clément Béra
https://clementbera.wordpress.com/
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq




--
Clément Béra
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq







--
Clément Béra
Pharo consortium engineer
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq
Reply | Threaded
Open this post in threaded view
|

Re: Primitive replaceFrom:to:with:startingAt: in the JIT

timrowledge
 

> On 27-12-2017, at 2:56 PM, Clément Bera <[hidden email]> wrote:
>
>
> Assembling a word when reads are not aligned with the writes seemed tedious - not sure if that's the best time invested/speed-up ratio. One could experiment.
>
It’s just an  8bpp bitblt on a single line. We kinda-sorta have code for that lying around.

How about handling any simple cases you can come up with and failing to the (usually) intrinsic memcpy() ? Take advantage of those 4 people’s months of work!

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
All programmers are playwrights and all computers are lousy actors.


Reply | Threaded
Open this post in threaded view
|

Re: Primitive replaceFrom:to:with:startingAt: in the JIT

Clément Béra
 


On Thu, Dec 28, 2017 at 12:03 AM, tim Rowledge <[hidden email]> wrote:


> On 27-12-2017, at 2:56 PM, Clément Bera <[hidden email]> wrote:
>
>
> Assembling a word when reads are not aligned with the writes seemed tedious - not sure if that's the best time invested/speed-up ratio. One could experiment.
>
It’s just an  8bpp bitblt on a single line. We kinda-sorta have code for that lying around.

How about handling any simple cases you can come up with and failing to the (usually) intrinsic memcpy() ? Take advantage of those 4 people’s months of work!

We can't do that. Unfortunately our primitives allow things that memcopy does not such as:

array := ByteArray new: 1000.
array at: 1 put: 255.
array replaceFrom: 2 to: 500 with: array startingAt: 1.
array at: 501 put: 42.
array replaceFrom: 502 to: 1000 with: array startingAt: 501.

Or we detect aliasing and fall back to C code if there is aliasing ...

...
 

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
All programmers are playwrights and all computers are lousy actors.





--
Clément Béra
Pharo consortium engineer
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq
Reply | Threaded
Open this post in threaded view
|

Re: Primitive replaceFrom:to:with:startingAt: in the JIT

Nicolas Cellier
 


2017-12-28 9:01 GMT+01:00 Clément Bera <[hidden email]>:
 


On Thu, Dec 28, 2017 at 12:03 AM, tim Rowledge <[hidden email]> wrote:


> On 27-12-2017, at 2:56 PM, Clément Bera <[hidden email]> wrote:
>
>
> Assembling a word when reads are not aligned with the writes seemed tedious - not sure if that's the best time invested/speed-up ratio. One could experiment.
>
It’s just an  8bpp bitblt on a single line. We kinda-sorta have code for that lying around.

How about handling any simple cases you can come up with and failing to the (usually) intrinsic memcpy() ? Take advantage of those 4 people’s months of work!

We can't do that. Unfortunately our primitives allow things that memcopy does not such as:

array := ByteArray new: 1000.
array at: 1 put: 255.
array replaceFrom: 2 to: 500 with: array startingAt: 1.
array at: 501 put: 42.
array replaceFrom: 502 to: 1000 with: array startingAt: 501.

Or we detect aliasing and fall back to C code if there is aliasing ...

...
 

I don't much like the current implementation in case of aliasing...

But note that this mechanism is effectively used by (g)zip implementation in Squeak/Pharo.
Using memcpy would be relying on undefined behavior.
Using memmove would lead to a different result.


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
All programmers are playwrights and all computers are lousy actors.





--
Clément Béra
Pharo consortium engineer
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq


Reply | Threaded
Open this post in threaded view
|

Re: Primitive replaceFrom:to:with:startingAt: in the JIT

timrowledge
In reply to this post by Clément Béra
 

> On 28-12-2017, at 12:01 AM, Clément Bera <[hidden email]> wrote:
>
> We can't do that. Unfortunately our primitives allow things that memcopy does not such as:

Fair point. It’s even more like bitblt than I initially thought.


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Performance is easier to add than clarity.


Reply | Threaded
Open this post in threaded view
|

Re: Primitive replaceFrom:to:with:startingAt: in the JIT

Clément Béra
In reply to this post by Levente Uzonyi
 
Hi Levente, 

VMMaker.oscog-cb.2323 tries to solve the slow-down problem you mentioned here by decreasing the number of instructions in the copying loops from 7 to 5. 

On my machine, it seems at 10k elements, copying time improved by a factor of 1.25.

Copying loop becomes (replReg is adjusted by a diff ahead of the loop)

(pointer):
instr := cogit MoveXwr: startReg R: replReg R: TempReg.
cogit MoveR: TempReg Xwr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit CmpR: startReg R: stopReg.
cogit JumpAboveOrEqual: instr.


(bytes):
instr := cogit MoveXbr: startReg R: replReg R: TempReg.
cogit MoveR: TempReg Xbr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit CmpR: startReg R: stopReg.
cogit JumpAboveOrEqual: instr.

Since replReg unlike arrayReg is unused afterwards, we could cheat more to free one register using replReg as the counter use the fixed index read instr with index 0, but I am too lazy to do it. I wanted to do this since incrementing 2 counters felt a bit cumbersome and I wanted to try out the tight loop scheme they use in V8. That's what mostly pays off (removing an unconditional jump).

I guess there is still some slow down when copying large byte object - if some-one could compute the threshold at which point the C code is faster and what is the performance difference factor, tell me, I may change the jitted code so it falls back to the C primitive over this threshold.

Hopefully next month we'll discuss the #compareTo:[collated:] primitive for data objects (in [ ] optional parameters, default being ASCII order)... Much to discuss since we can use pointer comparison for byte objects, even for the last word since unused bytes are zeroed, of course if we deal with some narrow cases...

Best,

On Mon, Dec 25, 2017 at 10:57 PM, Levente Uzonyi <[hidden email]> wrote:
Hi Clémont,

I finally found the time to write some benchmarks.
I compared the output of the script below on sqcogspur64linuxht vm 201710061559 and 201712221331 avaiable on bintray.

result := { ByteArray. DoubleByteArray. WordArray. DoubleWordArray. ByteString. WideString. FloatArray. Array } collect: [ :class |
        | collection |
        Smalltalk garbageCollect.
        collection := class basicNew: 10000.
        class -> (#(0 1 2 5 10 20 50 100 200 500 1000 2000 5000 10000) collect: [ :size |
                | iterations time overhead |
                iterations := (40000000 // (size max: 1) sqrt) floor.
                overhead := [ 1 to: iterations do: [ :i | ] ] timeToRun.
                time := [ 1 to: iterations do: [ :i |
                        collection replaceFrom: 1 to: size with: collection startingAt: 1 ] ] timeToRun.
                { size. iterations. time - overhead } ]) ].

I found that the quick paths are probably only implented for bytes and pointers collections, because there was no significant difference for DoubleByteArray, WordArray, DoubleWordArray, WideString and FloatArray.

For pointers and bytes collections, there's significant speedup when the copied portion is small. However, somewhere between 50 and 100 copied elements, the copying of bytes collections becomes slower (up to 1.5x @ 100k elements) with the newer VM.
It's interesting that this doesn't happen to pointers classes. Instead of slowdown there's still 1.5x speedup even at 100k elements.

Levente


On Mon, 23 Oct 2017, Clément Bera wrote:

Hi all,
For a long time I was willing to add primitive #replaceFrom:to:with:startingAt: in the JIT but did not take time to do it. These days I am showing the JIT to one of my students and as an example of how one would write code in the JIT we implemented this primitive
together, Spur-only. This is part of commit 2273.

I implemented quick paths for byte objects and array-like objects only. The rationale behind this is that the most common cases I see in Pharo user benchmarks in the profiler is copy of arrays and byteStrings. Typically some application benchmarks would show 3-5% of
time spent in copying small things, and switching from the JIT runtime to C runtime is an important part of the cost.

First evaluation shows the following speed-ups, but I've just done that quickly in my machine:

Copy of size 0
    Array 2.85x
    ByteString 2.7x
Copy of size 1
    Array 2.1x
    ByteString 2x
Copy of size 3
    Array 2x
    ByteString 1.9x
Copy of size 8
    Array 1.8x
    ByteString 1.8x
Copy of size 64
   Array 1.1x
   ByteString 1.1x
Copy of size 1000
   Array 1x
   ByteString 1x

So I would expect some macro benchmarks to get 1 to 3% percent speed-up. Not as much as I expected but it's there.

Can someone who is good at benchmarks such as Levente have a look and provide us with a better evaluation of the performance difference ?

Thanks.

--
Clément BéraPharo consortium engineer
https://clementbera.wordpress.com/
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq




--
Clément Béra
Pharo consortium engineer
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq