BitBLt performance work

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

BitBLt performance work

timrowledge

I'm working on making bitBlt faster, specifically for the Raspberry Pi but with likely benefits for all machines. It only takes a quick look at the generated code to see there is a lot of room for improvement. Some comparison testing for broadly similar cases shows that our bitBlt is only about 10% of the performance of the X related pixman code. We really ought to be able to do better than that.

Part of the problem is simply the complexity buried within inner loops and that can often be dealt with easily enough. Part of it is Pi specific - the ARM11 buried in the SoC has a worst-case word load time of 150 cycles; so cache preloading and hinting is pretty important for any streaming type work. It also has very small caches by comparison to the Watt-sucking intel cpus in our desktops.

Where things get a bit trickier is working out what the blazes is going on with some of the weirder code; there's a *lot* of possible combinations of all the input variables and it looks like only a few are 'realistic'. For example the cmMaskTable/ShiftTable stuff has potential for a rather large number of combinations but only a few appear to be used. If anyone knows what the specs really are I'd be happy to hear about it. We have 40 combination rules, halftone forms, clipping rectangles, swapped-endian bitmaps, external bitmaps, and colormaps and probably pink unicorns.

As mentioned in my mail about the pixelpeeker plugin, some cases are probably best pulled out of bitblt entirely. I suspect that getting the load-up related code cleaner will help quite a bit since there is a lot of work to do before even a tiny blt. Rapid detection of special cases can obviously help - but having some clear evidence of what *Actual* cases are common and worth attacking would be nice.

Info about any facets of the weirder blt code, data about actual common cases, experiences in trying to improve performance, anything you think might help, all welcomed.



tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Machine-independent:  Does not run on any existing machine.


Reply | Threaded
Open this post in threaded view
|

Re: BitBLt performance work

David Ungar

Cool! Can't wait to read about the results.

- David
Sent from my iPhone, tap tap

On Mar 15, 2013, at 12:01 PM, tim Rowledge <[hidden email]> wrote:

>
> I'm working on making bitBlt faster, specifically for the Raspberry Pi but with likely benefits for all machines. It only takes a quick look at the generated code to see there is a lot of room for improvement. Some comparison testing for broadly similar cases shows that our bitBlt is only about 10% of the performance of the X related pixman code. We really ought to be able to do better than that.
>
> Part of the problem is simply the complexity buried within inner loops and that can often be dealt with easily enough. Part of it is Pi specific - the ARM11 buried in the SoC has a worst-case word load time of 150 cycles; so cache preloading and hinting is pretty important for any streaming type work. It also has very small caches by comparison to the Watt-sucking intel cpus in our desktops.
>
> Where things get a bit trickier is working out what the blazes is going on with some of the weirder code; there's a *lot* of possible combinations of all the input variables and it looks like only a few are 'realistic'. For example the cmMaskTable/ShiftTable stuff has potential for a rather large number of combinations but only a few appear to be used. If anyone knows what the specs really are I'd be happy to hear about it. We have 40 combination rules, halftone forms, clipping rectangles, swapped-endian bitmaps, external bitmaps, and colormaps and probably pink unicorns.
>
> As mentioned in my mail about the pixelpeeker plugin, some cases are probably best pulled out of bitblt entirely. I suspect that getting the load-up related code cleaner will help quite a bit since there is a lot of work to do before even a tiny blt. Rapid detection of special cases can obviously help - but having some clear evidence of what *Actual* cases are common and worth attacking would be nice.
>
> Info about any facets of the weirder blt code, data about actual common cases, experiences in trying to improve performance, anything you think might help, all welcomed.
>
>
>
> tim
> --
> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
> Machine-independent:  Does not run on any existing machine.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: BitBLt performance work

Hans-Martin Mosner
In reply to this post by timrowledge
 
Am 03/15/2013 08:01 PM, schrieb tim Rowledge:
>  
> ... and probably pink unicorns.
>
>
Whatever you do, please leave the pink unicorns alone. They're there for a reason.

Cheers,
Hans-Martin
Reply | Threaded
Open this post in threaded view
|

RE: BitBLt performance work

Ron Teitelbaum
 
>
>
> Am 03/15/2013 08:01 PM, schrieb tim Rowledge:
> >
> > ... and probably pink unicorns.
> >
> >
> Whatever you do, please leave the pink unicorns alone. They're there for a
> reason.

I'm sure a lot of code depends on them the way they are but they were
originally designed to be purple spotted unicorns with glitter hair.

>
> Cheers,
> Hans-Martin


Reply | Threaded
Open this post in threaded view
|

Re: BitBLt performance work

Bert Freudenberg
In reply to this post by timrowledge

On 2013-03-15, at 20:01, tim Rowledge <[hidden email]> wrote:

> I'm working on making bitBlt faster

Awesome! Maybe getting some usage statistics would be helpful. Very likely a handful of combinations dominates the workload. That's actually easy to verify.

I renamed #copyBits to primCopyBits and then did this:

===================================
copyBits
        BitBltStats ifNotNil: [ BitBltStats add: {
                destForm isForm ifTrue: [destForm depth].
                sourceForm isForm ifTrue: [sourceForm depth].
                halftoneForm isForm ifTrue: [halftoneForm depth].
                combinationRule.
                colorMap ifNotNil: [colorMap class]}].
        ^ self primCopyBits
===================================

Then in a workspace you can start/stop the data collection:

===================================

"start"
BitBltStats := Bag new

"stop"
BitBltStats sortedCounts inspect. BitBltStats := nil

====================================

Here's what I go an Etoys image doing normal Etoys stuff:

23709->#(16 nil nil 3 nil)
20471->#(16 nil nil 3 Bitmap)
7852->#(16 32 nil 34 nil)
2961->#(16 16 nil 3 nil)
1181->#(16 1 nil 25 Bitmap)
958->#(16 16 nil 25 nil)
702->#(16 1 nil 3 Bitmap)
702->#(16 16 nil 1 nil)
702->#(16 1 nil 1 Bitmap)
339->#(16 4 nil 25 Bitmap)
190->#(32 nil nil 3 nil)
190->#(16 2 nil 25 Bitmap)
42->#(2 2 nil 3 nil)
40->#(32 32 nil 3 nil)
18->#(8 1 nil 25 Bitmap)
9->#(32 8 nil 3 Bitmap)
5->#(16 16 nil 33 Bitmap)

 26624->#(16 nil nil 3 Bitmap)
 10860->#(16 nil nil 3 nil)
 2557->#(16 16 nil 3 nil)
 2068->#(16 32 nil 34 nil)
 1612->#(16 16 nil 25 nil)
 932->#(16 1 nil 25 Bitmap)
 822->#(16 1 nil 3 Bitmap)
 822->#(16 16 nil 1 nil)
 822->#(16 1 nil 1 Bitmap)
 343->#(1 16 nil 1 Bitmap)
 343->#(1 1 nil 33 Bitmap)
 343->#(1 16 nil 3 Bitmap)
 70->#(16 4 nil 25 Bitmap)
 45->#(32 nil nil 3 nil)
 26->#(16 2 nil 25 Bitmap)
 14->#(8 1 nil 25 Bitmap)
 7->#(32 8 nil 3 Bitmap)
 1->#(16 16 nil 33 Bitmap)

If using Smalltalk tools (System Browsers etc), it looks a bit different:

 39350->#(16 32 nil 34 nil)
 23318->#(16 nil nil 3 nil)
 1872->#(16 16 nil 3 nil)
 1512->#(16 nil nil 3 Bitmap)
 794->#(16 16 nil 25 nil)
 634->#(16 1 nil 25 Bitmap)
 624->#(16 1 nil 3 Bitmap)
 624->#(16 16 nil 1 nil)
 624->#(16 1 nil 1 Bitmap)
 73->#(32 32 nil 3 nil)

Might be interesting to see how other images use it. And maybe add in some instrumentation to measure actual time spent.

For Etoys and Scratch in particular, WarpBlt is also rather important.

- Bert -


Reply | Threaded
Open this post in threaded view
|

Re: BitBLt performance work

timrowledge


On 15-03-2013, at 2:13 PM, Bert Freudenberg <[hidden email]> wrote:

>
> On 2013-03-15, at 20:01, tim Rowledge <[hidden email]> wrote:
>
>> I'm working on making bitBlt faster
>
> Awesome! Maybe getting some usage statistics would be helpful. Very likely a handful of combinations dominates the workload. That's actually easy to verify.
>
> I renamed #copyBits to primCopyBits and then did this:

Well that's an interesting way of doing it. I'm so used to the lower performance machines that I immediately think of adding logging to the plugin C codeā€¦ Actually, that does have its value as well; I build a plugin without inlining so it would be easier to instrument and it is hardly any different in performance. Which probably says something interesting about the code and C compilers and machine in use. Those logs were what prompted the pixelpeekerplugin.


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Useful random insult:- Paralyzed from the neck up.


Reply | Threaded
Open this post in threaded view
|

Re: BitBLt performance work

J. Vuletich (mail lists)
In reply to this post by Bert Freudenberg
 
Great trick, Bert!

Tim, also include other primitives in BitBlt. In Cuis, they are:
#primDisplayString:from:to:map:xTable:kern:   Used for ALL text!
#copyBitsAgain
#copyBitsTranslucent:
#drawLoopX:y:

Cheers,
Juan Vuletich

Quoting Bert Freudenberg <[hidden email]>:

>
> On 2013-03-15, at 20:01, tim Rowledge <[hidden email]> wrote:
>
>> I'm working on making bitBlt faster
>
> Awesome! Maybe getting some usage statistics would be helpful. Very  
> likely a handful of combinations dominates the workload. That's  
> actually easy to verify.
>
> I renamed #copyBits to primCopyBits and then did this:
>
> ===================================
> copyBits
> BitBltStats ifNotNil: [ BitBltStats add: {
> destForm isForm ifTrue: [destForm depth].
> sourceForm isForm ifTrue: [sourceForm depth].
> halftoneForm isForm ifTrue: [halftoneForm depth].
> combinationRule.
> colorMap ifNotNil: [colorMap class]}].
> ^ self primCopyBits
> ===================================
>
> Then in a workspace you can start/stop the data collection:
>
> ===================================
>
> "start"
> BitBltStats := Bag new
>
> "stop"
> BitBltStats sortedCounts inspect. BitBltStats := nil
>
> ====================================
>
> Here's what I go an Etoys image doing normal Etoys stuff:
>
> 23709->#(16 nil nil 3 nil)
> 20471->#(16 nil nil 3 Bitmap)
> 7852->#(16 32 nil 34 nil)
> 2961->#(16 16 nil 3 nil)
> 1181->#(16 1 nil 25 Bitmap)
> 958->#(16 16 nil 25 nil)
> 702->#(16 1 nil 3 Bitmap)
> 702->#(16 16 nil 1 nil)
> 702->#(16 1 nil 1 Bitmap)
> 339->#(16 4 nil 25 Bitmap)
> 190->#(32 nil nil 3 nil)
> 190->#(16 2 nil 25 Bitmap)
> 42->#(2 2 nil 3 nil)
> 40->#(32 32 nil 3 nil)
> 18->#(8 1 nil 25 Bitmap)
> 9->#(32 8 nil 3 Bitmap)
> 5->#(16 16 nil 33 Bitmap)
>
>  26624->#(16 nil nil 3 Bitmap)
>  10860->#(16 nil nil 3 nil)
>  2557->#(16 16 nil 3 nil)
>  2068->#(16 32 nil 34 nil)
>  1612->#(16 16 nil 25 nil)
>  932->#(16 1 nil 25 Bitmap)
>  822->#(16 1 nil 3 Bitmap)
>  822->#(16 16 nil 1 nil)
>  822->#(16 1 nil 1 Bitmap)
>  343->#(1 16 nil 1 Bitmap)
>  343->#(1 1 nil 33 Bitmap)
>  343->#(1 16 nil 3 Bitmap)
>  70->#(16 4 nil 25 Bitmap)
>  45->#(32 nil nil 3 nil)
>  26->#(16 2 nil 25 Bitmap)
>  14->#(8 1 nil 25 Bitmap)
>  7->#(32 8 nil 3 Bitmap)
>  1->#(16 16 nil 33 Bitmap)
>
> If using Smalltalk tools (System Browsers etc), it looks a bit different:
>
>  39350->#(16 32 nil 34 nil)
>  23318->#(16 nil nil 3 nil)
>  1872->#(16 16 nil 3 nil)
>  1512->#(16 nil nil 3 Bitmap)
>  794->#(16 16 nil 25 nil)
>  634->#(16 1 nil 25 Bitmap)
>  624->#(16 1 nil 3 Bitmap)
>  624->#(16 16 nil 1 nil)
>  624->#(16 1 nil 1 Bitmap)
>  73->#(32 32 nil 3 nil)
>
> Might be interesting to see how other images use it. And maybe add  
> in some instrumentation to measure actual time spent.
>
> For Etoys and Scratch in particular, WarpBlt is also rather important.
>
> - Bert -
>
>
>



Cheers,
Juan Vuletich

Reply | Threaded
Open this post in threaded view
|

Re: BitBLt performance work

timrowledge


On 16-03-2013, at 12:10 PM, "Juan Vuletich (mail lists)" <[hidden email]> wrote:

> Great trick, Bert!
>
> Tim, also include other primitives in BitBlt. In Cuis, they are:
> #primDisplayString:from:to:map:xTable:kern:   Used for ALL text!
> #copyBitsAgain
> #copyBitsTranslucent:
> #drawLoopX:y:

An excellent point. I already added a trap for WarpBlt but the displaystring is especially important.


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Strange OpCodes: SSAN: Stop, and See if Anyone Notices


Reply | Threaded
Open this post in threaded view
|

Re: BitBLt performance work

timrowledge
In reply to this post by David Ungar


On 15-03-2013, at 12:09 PM, David Ungar <[hidden email]> wrote:

>
> Cool! Can't wait to read about the results.


Well preliminary results are encouraging; looks like we might see somewhere between 3 and 10 times faster on a Pi. Could be useful...

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
All wiyht.  Rho sritched mg kegtops awound?


Reply | Threaded
Open this post in threaded view
|

Re: BitBLt performance work

ungar

Great! Any thought about posting a video?


On Mar 21, 2013, at 2:51 PM, tim Rowledge <[hidden email]> wrote:

>
>
> On 15-03-2013, at 12:09 PM, David Ungar <[hidden email]> wrote:
>
>>
>> Cool! Can't wait to read about the results.
>
>
> Well preliminary results are encouraging; looks like we might see somewhere between 3 and 10 times faster on a Pi. Could be useful...
>
> tim
> --
> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
> All wiyht.  Rho sritched mg kegtops awound?
>
>