BitBlt performance work for ARM32 & 64

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

BitBlt performance work for ARM32 & 64

timrowledge
We are *extremely* lucky that RPF has gifted us some of Ben Avison's time to revisit the BitBlt work he did in 2014 for the 32bit ARM vm in order to improve Scratch performance. Now he is going to work on extending that to support the 64bit ARM vm.

He's asking for a bit of information though;

- changes made to bitblt since then that might need attention. DIdn't we add a new rule or two some time back?
- the choices about int/ptr for 64 bit. I could swear there was a doc explaining it somewhere on the github site but haven't spotted it yet

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Strange OpCodes: BIK: Buggered if I Know



Reply | Threaded
Open this post in threaded view
|

Re: BitBlt performance work for ARM32 & 64

David T. Lewis
On Wed, Mar 24, 2021 at 05:10:30PM -0700, tim Rowledge wrote:
> We are *extremely* lucky that RPF has gifted us some of Ben Avison's time to revisit the BitBlt work he did in 2014 for the 32bit ARM vm in order to improve Scratch performance. Now he is going to work on extending that to support the 64bit ARM vm.
>

Excellent!


> He's asking for a bit of information though;
>
> - changes made to bitblt since then that might need attention. DIdn't we add a new rule or two some time back?

I am not certain of the status in the VM, but I think these links provide the background:

  https://github.com/OpenSmalltalk/opensmalltalk-vm/issues/505

  http://source.squeak.org/VMMaker/VMMaker.oscog-nice.2909.diff

Dave




> - the choices about int/ptr for 64 bit. I could swear there was a doc explaining it somewhere on the github site but haven't spotted it yet
>
> tim
> --
> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
> Strange OpCodes: BIK: Buggered if I Know
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: BitBlt performance work for ARM32 & 64

timrowledge


> On 2021-03-24, at 6:15 PM, David T. Lewis <[hidden email]> wrote:
>
> I am not certain of the status in the VM, but I think these links provide the background:
>
>  https://github.com/OpenSmalltalk/opensmalltalk-vm/issues/505
>
>  http://source.squeak.org/VMMaker/VMMaker.oscog-nice.2909.diff

OK, passed them on.

Ben is clearly having a fun time with this. His test suite has 524288 tests! Right now he's down to just 6 failing, all relating to addWord/subWord stuff, specifically when the line length takes up less than a word.

Hell, rather than trying to edit I'll just include his comments -

> The next problem I hit appears to be a genuine bug in the existing code - specifically the function rgbComponentAlphawith() - one which has simply become more visible due to a buffer overrun fix in copyLoop().
>
> What's happening is that source and destination images are 8 bpp, but the source and destination positions differ in 32-bit word alignment. Yet 32-bit blocks of pixels are passed in and out of rgbComponentAlphawith(), aligned to words at the destination. Therefore in a case like the following:
>
> Word boundaries at source   |       |       |
> Pixels                         X X X X X X X
> Word boundaries at dest   |       |       |       |
>
> previously, copyLoop() would read from the word following the last source pixel when constructing the final source word to pass to rgbComponentAlphawith() - but this is unsafe if the last source pixel is at the end of the image and the image ends at memory page alignment, because the following page might not be mapped in. The new version of copyLoop() correctly skips this load; however, it arbitrarily chooses to zero the relevant bits instead, and this triggers a shortcut in rgbComponentAlphawith() to be taken when it wasn't before.
>
> This shortcut leaves the destination word unchanged if the source word was all 0. At first glance this seems OK, since even at < 32bpp, the source pixels are made up of bitfields, and for a component alpha operation, each field is used as an alpha weight to blend a planar colour into the destination. The problem is in the way the blend calculation is done: it's (dest * (0xFF - alpha) + planar_source * alpha) >> 8. Thus when alpha == 0, it will subtract 1 from each colour component value. It was particularly noticeable in my test because it also utilised random gamma tables, so the off-by-one value was very stark.
>
> Ideally, the blending algorithm would be fixed so that an alpha of 0 leaves the destination unchanged. However, in all but a straight-through 1:1 gamma LUT, there will still be at least some incidences of a many-to-one mapping, so a round trip through gamma and ungamma could cause colour component changes even with an alpha of 0 and a better blending algorithm. So the simple solution is to remove the invalid shortcut in rgbComponentAlphawith(), and once you do so, the old and new builds produce the same result for this test. The change does impact the results of some other tests too, so I needed to recalculate the correct CRCs with this change in place.
>
> By now I was down to 190 failing tests out of 524288. Surveying the first 10, their combinationRules are all either addWord or subWord. Bearing the above in mind, I immediately suspected what the problem might be. Since pixels are stored left-to-right in most-significant-to-least-significant order, and carry/borrow propagate from less-significant to more-significant bits, that means that memory locations after the nominal end of the source row can impact the result. That means that the change that zeroes the remaining bits rather than loading them has the potential to change the results whenever word-relative alignment differs between source and destination.
>
> This time I tried backporting the buffer overflow fix to the old build (it's a good thing not to read off the end of the input buffer in any case) and recalculating the CRCs again.
>
> Now we were down to 6 failures out of 524288 tests. They were all addWord or subWord again, but the extra distinguishing feature this time was that the line lengths were very short - no more than 1 pixel for 16bpp, 3 pixels for 8bpp and (though there weren't enough examples to prove it) presumably no more than 7 pixels for 4bpp and so on.
>
> After a bit more digging, I think this change is responsible:
> https://github.com/OpenSmalltalk/opensmalltalk-vm/issues/426
>
> in the sense that it also causes bits that are outside the positions that are masked into the destination to be zeroed when they weren't before, but in such a way that they affect whether or not a carry or borrow into bit positions that *do* matter happens.
>
> Unfortunately the patch from the resolution of that issue doesn't cleanly apply back onto the old source tree - it seems there were other intervening changes in how the "preload" flag was calculated. However, I'm reasonably confident that since there are now only 6 failures, they're probably all caused by this one remaining issue. It's very reassuring that I now get the same results from the new source tree irrespective of whether the 32-bit ARM assembly fast paths are enabled or not. So the easiest thing is to simply recalculate the CRCs again from the new source tree and use them as a baseline for the AArch64 conversion.
>
> One thing that occurs to me is that, given that there are some combinationRules that have this "leaking across pixel boundaries" behaviour (and maybe it might be deliberately added one day - say for a "blur" operation) it arguably should be the case that the source and destination words are masked to only include the bits relating to pixels included in the bounding box *before* calling the combinationRule as well as afterwards. For example, assuming an operation on a single 8bpp pixel:
>
> Words at source        |AA AA AA AA|BB BB BB BB|
> Pixel                         **
> Words at destination      |FF FF FF FF|
>
> In the old Squeak VM, addWordwith() would have been passed 0xAAAAAABB and 0xFFFFFFFF. New VMs skip the load of the second source word and pass 0xAAAAAA00 and 0xFFFFFFFF. What I'm saying is, perhaps it should be 0x00AA0000 and 0x00FF0000 instead. (Had this rule been in place already, none of the issues I hit today would have shown up.)

Anything anyone can think of that might help would be appreciated. Although the intent here is to make the ARM64 system as fast as possible for Pi benefit, it will have probably benefits for other machine because some of his improvements are just C code. Bug fixes should be passed up to all systems easily enough.

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
At the end of the day, a cliché walks into a bar -- fresh as a daisy, cute as a button, and sharp as a tack.