I'm working on making bitBlt faster, specifically for the Raspberry Pi but with likely benefits for all machines. It only takes a quick look at the generated code to see there is a lot of room for improvement. Some comparison testing for broadly similar cases shows that our bitBlt is only about 10% of the performance of the X related pixman code. We really ought to be able to do better than that. Part of the problem is simply the complexity buried within inner loops and that can often be dealt with easily enough. Part of it is Pi specific - the ARM11 buried in the SoC has a worst-case word load time of 150 cycles; so cache preloading and hinting is pretty important for any streaming type work. It also has very small caches by comparison to the Watt-sucking intel cpus in our desktops. Where things get a bit trickier is working out what the blazes is going on with some of the weirder code; there's a *lot* of possible combinations of all the input variables and it looks like only a few are 'realistic'. For example the cmMaskTable/ShiftTable stuff has potential for a rather large number of combinations but only a few appear to be used. If anyone knows what the specs really are I'd be happy to hear about it. We have 40 combination rules, halftone forms, clipping rectangles, swapped-endian bitmaps, external bitmaps, and colormaps and probably pink unicorns. As mentioned in my mail about the pixelpeeker plugin, some cases are probably best pulled out of bitblt entirely. I suspect that getting the load-up related code cleaner will help quite a bit since there is a lot of work to do before even a tiny blt. Rapid detection of special cases can obviously help - but having some clear evidence of what *Actual* cases are common and worth attacking would be nice. Info about any facets of the weirder blt code, data about actual common cases, experiences in trying to improve performance, anything you think might help, all welcomed. tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Machine-independent: Does not run on any existing machine. |
Cool! Can't wait to read about the results. - David Sent from my iPhone, tap tap On Mar 15, 2013, at 12:01 PM, tim Rowledge <[hidden email]> wrote: > > I'm working on making bitBlt faster, specifically for the Raspberry Pi but with likely benefits for all machines. It only takes a quick look at the generated code to see there is a lot of room for improvement. Some comparison testing for broadly similar cases shows that our bitBlt is only about 10% of the performance of the X related pixman code. We really ought to be able to do better than that. > > Part of the problem is simply the complexity buried within inner loops and that can often be dealt with easily enough. Part of it is Pi specific - the ARM11 buried in the SoC has a worst-case word load time of 150 cycles; so cache preloading and hinting is pretty important for any streaming type work. It also has very small caches by comparison to the Watt-sucking intel cpus in our desktops. > > Where things get a bit trickier is working out what the blazes is going on with some of the weirder code; there's a *lot* of possible combinations of all the input variables and it looks like only a few are 'realistic'. For example the cmMaskTable/ShiftTable stuff has potential for a rather large number of combinations but only a few appear to be used. If anyone knows what the specs really are I'd be happy to hear about it. We have 40 combination rules, halftone forms, clipping rectangles, swapped-endian bitmaps, external bitmaps, and colormaps and probably pink unicorns. > > As mentioned in my mail about the pixelpeeker plugin, some cases are probably best pulled out of bitblt entirely. I suspect that getting the load-up related code cleaner will help quite a bit since there is a lot of work to do before even a tiny blt. Rapid detection of special cases can obviously help - but having some clear evidence of what *Actual* cases are common and worth attacking would be nice. > > Info about any facets of the weirder blt code, data about actual common cases, experiences in trying to improve performance, anything you think might help, all welcomed. > > > > tim > -- > tim Rowledge; [hidden email]; http://www.rowledge.org/tim > Machine-independent: Does not run on any existing machine. > > |
In reply to this post by timrowledge
Am 03/15/2013 08:01 PM, schrieb tim Rowledge: > > ... and probably pink unicorns. > > Whatever you do, please leave the pink unicorns alone. They're there for a reason. Cheers, Hans-Martin |
> > > Am 03/15/2013 08:01 PM, schrieb tim Rowledge: > > > > ... and probably pink unicorns. > > > > > Whatever you do, please leave the pink unicorns alone. They're there for a > reason. I'm sure a lot of code depends on them the way they are but they were originally designed to be purple spotted unicorns with glitter hair. > > Cheers, > Hans-Martin |
In reply to this post by timrowledge
On 2013-03-15, at 20:01, tim Rowledge <[hidden email]> wrote: > I'm working on making bitBlt faster Awesome! Maybe getting some usage statistics would be helpful. Very likely a handful of combinations dominates the workload. That's actually easy to verify. I renamed #copyBits to primCopyBits and then did this: =================================== copyBits BitBltStats ifNotNil: [ BitBltStats add: { destForm isForm ifTrue: [destForm depth]. sourceForm isForm ifTrue: [sourceForm depth]. halftoneForm isForm ifTrue: [halftoneForm depth]. combinationRule. colorMap ifNotNil: [colorMap class]}]. ^ self primCopyBits =================================== Then in a workspace you can start/stop the data collection: =================================== "start" BitBltStats := Bag new "stop" BitBltStats sortedCounts inspect. BitBltStats := nil ==================================== Here's what I go an Etoys image doing normal Etoys stuff: 23709->#(16 nil nil 3 nil) 20471->#(16 nil nil 3 Bitmap) 7852->#(16 32 nil 34 nil) 2961->#(16 16 nil 3 nil) 1181->#(16 1 nil 25 Bitmap) 958->#(16 16 nil 25 nil) 702->#(16 1 nil 3 Bitmap) 702->#(16 16 nil 1 nil) 702->#(16 1 nil 1 Bitmap) 339->#(16 4 nil 25 Bitmap) 190->#(32 nil nil 3 nil) 190->#(16 2 nil 25 Bitmap) 42->#(2 2 nil 3 nil) 40->#(32 32 nil 3 nil) 18->#(8 1 nil 25 Bitmap) 9->#(32 8 nil 3 Bitmap) 5->#(16 16 nil 33 Bitmap) 26624->#(16 nil nil 3 Bitmap) 10860->#(16 nil nil 3 nil) 2557->#(16 16 nil 3 nil) 2068->#(16 32 nil 34 nil) 1612->#(16 16 nil 25 nil) 932->#(16 1 nil 25 Bitmap) 822->#(16 1 nil 3 Bitmap) 822->#(16 16 nil 1 nil) 822->#(16 1 nil 1 Bitmap) 343->#(1 16 nil 1 Bitmap) 343->#(1 1 nil 33 Bitmap) 343->#(1 16 nil 3 Bitmap) 70->#(16 4 nil 25 Bitmap) 45->#(32 nil nil 3 nil) 26->#(16 2 nil 25 Bitmap) 14->#(8 1 nil 25 Bitmap) 7->#(32 8 nil 3 Bitmap) 1->#(16 16 nil 33 Bitmap) If using Smalltalk tools (System Browsers etc), it looks a bit different: 39350->#(16 32 nil 34 nil) 23318->#(16 nil nil 3 nil) 1872->#(16 16 nil 3 nil) 1512->#(16 nil nil 3 Bitmap) 794->#(16 16 nil 25 nil) 634->#(16 1 nil 25 Bitmap) 624->#(16 1 nil 3 Bitmap) 624->#(16 16 nil 1 nil) 624->#(16 1 nil 1 Bitmap) 73->#(32 32 nil 3 nil) Might be interesting to see how other images use it. And maybe add in some instrumentation to measure actual time spent. For Etoys and Scratch in particular, WarpBlt is also rather important. - Bert - |
On 15-03-2013, at 2:13 PM, Bert Freudenberg <[hidden email]> wrote: > > On 2013-03-15, at 20:01, tim Rowledge <[hidden email]> wrote: > >> I'm working on making bitBlt faster > > Awesome! Maybe getting some usage statistics would be helpful. Very likely a handful of combinations dominates the workload. That's actually easy to verify. > > I renamed #copyBits to primCopyBits and then did this: Well that's an interesting way of doing it. I'm so used to the lower performance machines that I immediately think of adding logging to the plugin C codeā¦ Actually, that does have its value as well; I build a plugin without inlining so it would be easier to instrument and it is hardly any different in performance. Which probably says something interesting about the code and C compilers and machine in use. Those logs were what prompted the pixelpeekerplugin. tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Useful random insult:- Paralyzed from the neck up. |
In reply to this post by Bert Freudenberg
Great trick, Bert! Tim, also include other primitives in BitBlt. In Cuis, they are: #primDisplayString:from:to:map:xTable:kern: Used for ALL text! #copyBitsAgain #copyBitsTranslucent: #drawLoopX:y: Cheers, Juan Vuletich Quoting Bert Freudenberg <[hidden email]>: > > On 2013-03-15, at 20:01, tim Rowledge <[hidden email]> wrote: > >> I'm working on making bitBlt faster > > Awesome! Maybe getting some usage statistics would be helpful. Very > likely a handful of combinations dominates the workload. That's > actually easy to verify. > > I renamed #copyBits to primCopyBits and then did this: > > =================================== > copyBits > BitBltStats ifNotNil: [ BitBltStats add: { > destForm isForm ifTrue: [destForm depth]. > sourceForm isForm ifTrue: [sourceForm depth]. > halftoneForm isForm ifTrue: [halftoneForm depth]. > combinationRule. > colorMap ifNotNil: [colorMap class]}]. > ^ self primCopyBits > =================================== > > Then in a workspace you can start/stop the data collection: > > =================================== > > "start" > BitBltStats := Bag new > > "stop" > BitBltStats sortedCounts inspect. BitBltStats := nil > > ==================================== > > Here's what I go an Etoys image doing normal Etoys stuff: > > 23709->#(16 nil nil 3 nil) > 20471->#(16 nil nil 3 Bitmap) > 7852->#(16 32 nil 34 nil) > 2961->#(16 16 nil 3 nil) > 1181->#(16 1 nil 25 Bitmap) > 958->#(16 16 nil 25 nil) > 702->#(16 1 nil 3 Bitmap) > 702->#(16 16 nil 1 nil) > 702->#(16 1 nil 1 Bitmap) > 339->#(16 4 nil 25 Bitmap) > 190->#(32 nil nil 3 nil) > 190->#(16 2 nil 25 Bitmap) > 42->#(2 2 nil 3 nil) > 40->#(32 32 nil 3 nil) > 18->#(8 1 nil 25 Bitmap) > 9->#(32 8 nil 3 Bitmap) > 5->#(16 16 nil 33 Bitmap) > > 26624->#(16 nil nil 3 Bitmap) > 10860->#(16 nil nil 3 nil) > 2557->#(16 16 nil 3 nil) > 2068->#(16 32 nil 34 nil) > 1612->#(16 16 nil 25 nil) > 932->#(16 1 nil 25 Bitmap) > 822->#(16 1 nil 3 Bitmap) > 822->#(16 16 nil 1 nil) > 822->#(16 1 nil 1 Bitmap) > 343->#(1 16 nil 1 Bitmap) > 343->#(1 1 nil 33 Bitmap) > 343->#(1 16 nil 3 Bitmap) > 70->#(16 4 nil 25 Bitmap) > 45->#(32 nil nil 3 nil) > 26->#(16 2 nil 25 Bitmap) > 14->#(8 1 nil 25 Bitmap) > 7->#(32 8 nil 3 Bitmap) > 1->#(16 16 nil 33 Bitmap) > > If using Smalltalk tools (System Browsers etc), it looks a bit different: > > 39350->#(16 32 nil 34 nil) > 23318->#(16 nil nil 3 nil) > 1872->#(16 16 nil 3 nil) > 1512->#(16 nil nil 3 Bitmap) > 794->#(16 16 nil 25 nil) > 634->#(16 1 nil 25 Bitmap) > 624->#(16 1 nil 3 Bitmap) > 624->#(16 16 nil 1 nil) > 624->#(16 1 nil 1 Bitmap) > 73->#(32 32 nil 3 nil) > > Might be interesting to see how other images use it. And maybe add > in some instrumentation to measure actual time spent. > > For Etoys and Scratch in particular, WarpBlt is also rather important. > > - Bert - > > > Cheers, Juan Vuletich |
On 16-03-2013, at 12:10 PM, "Juan Vuletich (mail lists)" <[hidden email]> wrote: > Great trick, Bert! > > Tim, also include other primitives in BitBlt. In Cuis, they are: > #primDisplayString:from:to:map:xTable:kern: Used for ALL text! > #copyBitsAgain > #copyBitsTranslucent: > #drawLoopX:y: An excellent point. I already added a trap for WarpBlt but the displaystring is especially important. tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Strange OpCodes: SSAN: Stop, and See if Anyone Notices |
In reply to this post by David Ungar
On 15-03-2013, at 12:09 PM, David Ungar <[hidden email]> wrote: > > Cool! Can't wait to read about the results. Well preliminary results are encouraging; looks like we might see somewhere between 3 and 10 times faster on a Pi. Could be useful... tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim All wiyht. Rho sritched mg kegtops awound? |
Great! Any thought about posting a video? On Mar 21, 2013, at 2:51 PM, tim Rowledge <[hidden email]> wrote: > > > On 15-03-2013, at 12:09 PM, David Ungar <[hidden email]> wrote: > >> >> Cool! Can't wait to read about the results. > > > Well preliminary results are encouraging; looks like we might see somewhere between 3 and 10 times faster on a Pi. Could be useful... > > tim > -- > tim Rowledge; [hidden email]; http://www.rowledge.org/tim > All wiyht. Rho sritched mg kegtops awound? > > |
Free forum by Nabble | Edit this page |