The Trunk: Kernel-nice.1235.mcz

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

The Trunk: Kernel-nice.1235.mcz

commits-2
Nicolas Cellier uploaded a new version of Kernel to project The Trunk:
http://source.squeak.org/trunk/Kernel-nice.1235.mcz

==================== Summary ====================

Name: Kernel-nice.1235
Author: nice
Time: 16 May 2019, 12:08:27.320094 am
UUID: dd74b668-0767-441e-a952-5c73af6327be
Ancestors: Kernel-nice.1234

Round accelerated arithmetic chunks to upper multiple of 4 bytes rather than to lower.

I believe that this marginally improves the performance because it's a tiny bit better to recompose a longer least significant chunk with a shorter most significant chunk.

If someone wants to confirm...
It's better to tune the threshold before benchmarking.
See LargeArithmeticBench from http://www.squeaksource.com/STEM.html and http://smallissimo.blogspot.com/2019/05 blog for details.

=============== Diff against Kernel-nice.1234 ===============

Item was changed:
  ----- Method: LargePositiveInteger>>digitDiv21: (in category 'private') -----
  digitDiv21: anInteger
  "This is part of the recursive division algorithm from Burnikel - Ziegler
  Divide a two limbs receiver by 1 limb dividend
  Each limb is decomposed in two halves of p bytes (8*p bits)
  so as to continue the recursion"
 
  | p qr1 qr2 |
+ "split in two parts, rounded to upper multiple of 4"
+ p := (anInteger digitLength + 7 bitShift: -3) bitShift: 2.
- p := (anInteger digitLength + 1 bitShift: -1) bitClear: 2r11.
  p < self class thresholdForDiv21 ifTrue: [^(self digitDiv: anInteger neg: false) collect: #normalize].
  qr1 := (self butLowestNDigits: p) digitDiv32: anInteger.
  qr2 := (self lowestNDigits: p) + (qr1 last bitShift: 8*p) digitDiv32: anInteger.
  qr2 at: 1 put: (qr2 at: 1) + ((qr1 at: 1) bitShift: 8*p).
  ^qr2!

Item was changed:
  ----- Method: LargePositiveInteger>>digitDiv32: (in category 'private') -----
  digitDiv32: anInteger
  "This is part of the recursive division algorithm from Burnikel - Ziegler
  Divide 3 limb (a2,a1,a0) by 2 limb (b1,b0).
  Each limb is made of p bytes (8*p bits).
  This step transforms the division problem into multiplication
  It must use a fast multiplyByInteger: to be worth the overhead costs."
 
  | a2 b1 d p q qr r |
+ "split in two parts, rounded to upper multiple of 4"
+ p :=(anInteger digitLength + 7 bitShift: -3) bitShift: 2.
- p :=(anInteger digitLength + 1 bitShift: -1) bitClear: 2r11.
  (a2 := self butLowestNDigits: 2*p)
  < (b1 := anInteger butLowestNDigits: p)
  ifTrue:
  [qr := (self butLowestNDigits: p) digitDiv21: b1.
  q := qr first.
  r := qr last]
  ifFalse:
  [q := (1 bitShift: 8*p) - 1.
  r := (self butLowestNDigits: p) - (b1 bitShift: 8*p) + b1].
  d := q * (anInteger lowestNDigits: p).
  r := (self lowestNDigits: p) + (r bitShift: 8*p) - d.
  [r < 0]
  whileTrue:
  [q := q - 1.
  r := r + anInteger].
  ^Array with: q with: r
  !

Item was changed:
  ----- Method: LargePositiveInteger>>digitMul22: (in category 'private') -----
  digitMul22: anInteger
  "Multiply after decomposing each operand in two parts, using Karatsuba algorithm.
  Karatsuba perform only 3 multiplications, leading to a cost O(n^3 log2)
  asymptotically better than super O(n^2) for large number of digits n.
  See https://en.wikipedia.org/wiki/Karatsuba_algorithm"
 
  | half xLow xHigh yLow yHigh low mid high |
+ "split each in two parts, rounded to upper multiple of 4"
+ half := (anInteger digitLength + 7 bitShift: -3) bitShift: 2.
- "Divide each integer in two halves"
- half := (anInteger digitLength + 1 bitShift: -1)  bitClear: 2r11.
  xLow := self lowestNDigits: half.
  xHigh := self butLowestNDigits: half.
  yLow := anInteger lowestNDigits: half.
  yHigh := anInteger butLowestNDigits: half.
 
  "Karatsuba trick: perform with 3 multiplications instead of 4"
  low := xLow multiplyByInteger: yLow.
  high := xHigh multiplyByInteger: yHigh.
  mid := high + low + (xHigh - xLow multiplyByInteger: yLow - yHigh).
 
  "Sum the parts of decomposition"
  ^(high isZero
  ifTrue: [low]
  ifFalse: [(high bitShift: 16*half)
  inplaceAddNonOverlapping: low digitShiftBy: 0])
  + (mid bitShift: 8*half)!

Item was changed:
  ----- Method: LargePositiveInteger>>digitMul23: (in category 'private') -----
  digitMul23: anInteger
  "Multiply after decomposing the receiver in 2 parts, and multiplicand in 3 parts.
  Only invoke when anInteger digitLength between: 3/2 and 5/2 self digitLength.
  This is a variant of Toom-Cook algorithm (see digitMul33:)"
     
  | half x1 x0 y2 y1 y0 y20 z3 z2 z1 z0 |
+ "divide self in 2 and operand in 3 parts, rounded to upper multiple of 4"
+ half := ( self digitLength + 7 bitShift: -3) bitShift: 2.
- "divide self in 2 and operand in 3 parts"
- half := ( self digitLength + 1 bitShift: -1) bitClear: 2r11.
  x1 := self butLowestNDigits: half.
  x0 := self lowestNDigits: half.
  y2 := anInteger butLowestNDigits: half * 2.
  y1 := anInteger copyDigitsFrom: half + 1 to: half * 2.
  y0 := anInteger lowestNDigits: half.
     
  "Toom trick: 4 multiplications instead of 6"
  y20 := y2 + y0.
  z3 := x1 multiplyByInteger: y2.
  z2 := x0 - x1 multiplyByInteger: y20 - y1.
  z1 := x0 + x1 multiplyByInteger: y20 + y1.
  z0 := x0 multiplyByInteger: y0.
     
  "Sum the parts of decomposition"
  ^z0 + ((z1 - z2 bitShift: -1) - z3 bitShift: 8*half)
  + (((z1 + z2 bitShift: -1) - z0) + (z3 bitShift: 8*half) bitShift: 16 * half)!

Item was changed:
  ----- Method: LargePositiveInteger>>digitMul33: (in category 'private') -----
  digitMul33: anInteger
  "Multiply after decomposing each operand in 3 parts, using a Toom-Cooke algorithm.
  Toom-Cooke is a generalization of Karatsuba divide and conquer algorithm.
  See https://en.wikipedia.org/wiki/Toom%E2%80%93Cook_multiplication
  Use a Bodrato-Zanoni variant for the choice of interpolation points and matrix inversion
  See What about Toom-Cook matrices optimality? - Marco Bodrato, Alberto Zanoni - Oct. 2006
  http://www.bodrato.it/papers/WhatAboutToomCookMatricesOptimality.pdf"
 
  | third x2 x1 x0 y2 y1 y0 y20 z4 z3 z2 z1 z0 x20 |
+ "divide both operands in 3 parts, rounded to upper multiple of 4"
+ third := anInteger digitLength + 11 // 12 bitShift: 2.
- "divide both operands in 3 parts"
- third := anInteger digitLength + 2 // 3 bitClear: 2r11.
  x2 := self butLowestNDigits: third * 2.
  x1 := self copyDigitsFrom: third + 1 to: third * 2.
  x0 := self lowestNDigits: third.
  y2 := anInteger butLowestNDigits: third * 2.
  y1 := anInteger copyDigitsFrom: third + 1 to: third * 2.
  y0 := anInteger lowestNDigits: third.
 
  "Toom-3 trick: 5 multiplications instead of 9"
  z0 := x0 multiplyByInteger: y0.
  z4 := x2 multiplyByInteger: y2.
  x20 := x2 + x0.
  y20 := y2 + y0.
  z1 := x20 + x1 multiplyByInteger: y20 + y1.
  x20 := x20 - x1.
  y20 := y20 - y1.
  z2 := x20 multiplyByInteger: y20.
  z3 := (x20 + x2 bitShift: 1) - x0 multiplyByInteger: (y20 + y2 bitShift: 1) - y0.
 
  "Sum the parts of decomposition"
  z3 := z3 - z1 quo: 3.
  z1 := z1 - z2 bitShift: -1.
  z2 := z2 - z0.
 
  z3 := (z2 - z3 bitShift: -1) + (z4 bitShift: 1).
  z2 := z2 + z1 - z4.
  z1 := z1 - z3.
  ^z0 + (z1 bitShift: 8*third) + (z2 bitShift: 16*third) + (z3 + (z4 bitShift: 8*third) bitShift: 24*third)!

Item was changed:
  ----- Method: LargePositiveInteger>>digitMulSplit: (in category 'private') -----
  digitMulSplit: anInteger
  "multiply digits when self and anInteger have not well balanced digitlength.
  in this case, it is better to split the largest (anInteger) in several parts and recompose"
 
  | xLen yLen split q r high mid low sizes |
  yLen := anInteger digitLength.
  xLen := self digitLength.
+ "divide in about 1.5 xLen, rounded to upper multiple of 4"
+ split := (xLen * 3 + 7 bitShift: -3) bitShift: 2.
- split := (xLen * 3 + 2 bitShift: -1) bitClear: 2r11.
 
  "Arrange to sum non overlapping parts"
  q := yLen // split.
  q < 3 ifTrue: [^(0 to: yLen - 1 by: split) detectSum: [:yShift | (self multiplyByInteger: (anInteger copyDigitsFrom: yShift + 1 to: yShift + split)) bitShift: 8 * yShift]].
  r := yLen \\ split.
  "allocate enough bytes, but not too much, in order to minimise normalize cost;
  we could allocate xLen + yLen for each one as well"
  sizes := {q-1*split. q*split. q*split+r}.
  low  := Integer new: (sizes atWrap: 0 - (q\\3)) + xLen neg: self negative ~~ anInteger negative.
  mid := Integer new:  (sizes atWrap: 1 - (q\\3)) + xLen neg: self negative ~~ anInteger negative.
  high := Integer new: (sizes atWrap: 2 - (q\\3)) + xLen neg: self negative ~~ anInteger negative.
  0 to: yLen - 1 by: 3 * split do: [:yShift |
  low
  inplaceAddNonOverlapping: (self multiplyByInteger: (anInteger copyDigitsFrom: yShift + 1 to: yShift + split))
  digitShiftBy: yShift].
  split to: yLen - 1 by: 3 * split do: [:yShift |
  mid
  inplaceAddNonOverlapping: (self multiplyByInteger: (anInteger copyDigitsFrom: yShift + 1 to: yShift + split))
  digitShiftBy: yShift].
  split * 2 to: yLen - 1 by: 3 * split do: [:yShift |
  high
  inplaceAddNonOverlapping: (self multiplyByInteger: (anInteger copyDigitsFrom: yShift + 1 to: yShift + split))
  digitShiftBy: yShift].
  ^high normalize + mid normalize + low normalize!

Item was changed:
  ----- Method: LargePositiveInteger>>squaredByFourth (in category 'private') -----
  squaredByFourth
  "Use a 4-way Toom-Cook divide and conquer algorithm to perform the multiplication.
  See Asymmetric Squaring Formulae Jaewook Chung and M. Anwar Hasan
  https://www.lirmm.fr/arith18/papers/Chung-Squaring.pdf"
 
  | p a0 a1 a2 a3 a02 a13 s0 s1 s2 s3 s4 s5 s6 t2 t3 |
+ "divide in 4 parts, rounded to upper multiple of 4"
+ p := (self digitLength + 15 bitShift: -4) bitShift: 2.
- "divide in 4 parts"
- p := (self digitLength + 3 bitShift: -2) bitClear: 2r11.
  a3 := self butLowestNDigits: p * 3.
  a2 := self copyDigitsFrom: p * 2 + 1 to: p * 3.
  a1 := self copyDigitsFrom: p + 1 to: p * 2.
  a0 := self lowestNDigits: p.
 
  "Toom-4 trick: 7 multiplications instead of 16"
  a02 := a0 - a2.
  a13 := a1 - a3.
  s0 := a0 squared.
  s1 := (a0 * a1) bitShift: 1.
  s2 := (a02 + a13) * (a02 - a13).
  s3 := ((a0 + a1) + (a2 + a3)) squared.
  s4 := (a02 * a13) bitShift: 1.
  s5 := (a3 * a2) bitShift: 1.
  s6 := a3 squared.
 
  "Interpolation"
  t2 := s1 + s5.
  t3 := (s2 + s3 + s4 bitShift: -1) - t2.
  s3 := t2 - s4.
  s4 := t3 - s0.
  s2 := t3 - s2 - s6.
 
  "Sum the parts of decomposition"
  ^s0 + (s1 bitShift: 8*p) + (s2 + (s3 bitShift: 8*p) bitShift: 16*p)
  +(s4 + (s5 bitShift: 8*p) + (s6 bitShift: 16*p) bitShift: 32*p)
 
  "
  | a |
  a := 770 factorial-1.
  a digitLength.
  [a * a - a squaredToom4 = 0] assert.
  [Smalltalk garbageCollect.
  [1000 timesRepeat: [a squaredToom4]] timeToRun] value /
  [Smalltalk garbageCollect.
  [1000 timesRepeat: [a squaredKaratsuba]] timeToRun] value asFloat
  "!

Item was changed:
  ----- Method: LargePositiveInteger>>squaredByHalf (in category 'private') -----
  squaredByHalf
  "Use a divide and conquer algorithm to perform the multiplication.
  Split in two parts like Karatsuba, but economize 2 additions by using asymetrical product."
 
  | half xHigh xLow low high mid |
 
+ "Divide digits in two halves rounded tp upper multiple of 4"
+ half := (self digitLength + 1 bitShift: -3) bitShift: 2.
- "Divide digits in two halves"
- half := self digitLength + 1 // 2 bitClear: 2r11.
  xLow := self lowestNDigits: half.
  xHigh := self butLowestNDigits: half.
 
  "eventually use karatsuba"
  low := xLow squared.
  high := xHigh squared.
  mid := xLow multiplyByInteger: xHigh.
 
  "Sum the parts of decomposition"
  ^(high bitShift: 16*half)
  inplaceAddNonOverlapping: low digitShiftBy: 0;
  + (mid bitShift: 8*half+1)
 
  "
  | a |
  a := 440 factorial-1.
  a digitLength.
  self assert: a * a - a squaredKaratsuba = 0.
  [Smalltalk garbageCollect.
  [2000 timesRepeat: [a squaredKaratsuba]] timeToRun] value /
  [Smalltalk garbageCollect.
  [2000 timesRepeat: [a * a]] timeToRun] value asFloat
  "!

Item was changed:
  ----- Method: LargePositiveInteger>>squaredByThird (in category 'private') -----
  squaredByThird
  "Use a 3-way Toom-Cook divide and conquer algorithm to perform the multiplication"
 
  | third x0 x1 x2 x20 z0 z1 z2 z3 z4 |
+ "divide in 3 parts, rounded to upper multiple of 4"
+ third := self digitLength + 11 // 3 bitShift: 2.
- "divide in 3 parts"
- third := self digitLength + 2 // 3 bitClear: 2r11.
  x2 := self butLowestNDigits: third * 2.
  x1 := self copyDigitsFrom: third + 1 to: third * 2.
  x0 := self lowestNDigits: third.
 
  "Toom-3 trick: 5 multiplications instead of 9"
  z0 := x0 squared.
  z4 := x2 squared.
  x20 := x2 + x0.
  z1 := (x20 + x1) squared.
  x20 := x20 - x1.
  z2 := x20 squared.
  z3 := ((x20 + x2 bitShift: 1) - x0) squared.
 
  "Sum the parts of decomposition"
  z3 := z3 - z1 quo: 3.
  z1 := z1 - z2 bitShift: -1.
  z2 := z2 - z0.
 
  z3 := (z2 - z3 bitShift: -1) + (z4 bitShift: 1).
  z2 := z2 + z1 - z4.
  z1 := z1 - z3.
  ^z0 + (z1 bitShift: 8*third) + (z2 bitShift: 16*third) + (z3 + (z4 bitShift: 8*third) bitShift: 24*third)
 
  "
  | a |
  a := 1400 factorial-1.
  a digitLength.
  self assert: a * a - a squaredToom3 = 0.
  [Smalltalk garbageCollect.
  [1000 timesRepeat: [a squaredToom3]] timeToRun] value /
  [Smalltalk garbageCollect.
  [1000 timesRepeat: [a squaredKaratsuba]] timeToRun] value asFloat
  "!