# The Inbox: Kernel-nice.1235.mcz Classic List Threaded 1 message Reply | Threaded
Open this post in threaded view
|

## The Inbox: Kernel-nice.1235.mcz

 Nicolas Cellier uploaded a new version of Kernel to project The Inbox: http://source.squeak.org/inbox/Kernel-nice.1235.mcz==================== Summary ==================== Name: Kernel-nice.1235 Author: nice Time: 16 May 2019, 12:08:27.320094 am UUID: dd74b668-0767-441e-a952-5c73af6327be Ancestors: Kernel-nice.1234 Round accelerated arithmetic chunks to upper multiple of 4 bytes rather than to lower. I believe that this marginally improves the performance because it's a tiny bit better to recompose a longer least significant chunk with a shorter most significant chunk. If someone wants to confirm... It's better to tune the threshold before benchmarking. See LargeArithmeticBench from http://www.squeaksource.com/STEM.html and http://smallissimo.blogspot.com/2019/05 blog for details. =============== Diff against Kernel-nice.1234 =============== Item was changed:   ----- Method: LargePositiveInteger>>digitDiv21: (in category 'private') -----   digitDiv21: anInteger   "This is part of the recursive division algorithm from Burnikel - Ziegler   Divide a two limbs receiver by 1 limb dividend   Each limb is decomposed in two halves of p bytes (8*p bits)   so as to continue the recursion"     | p qr1 qr2 | + "split in two parts, rounded to upper multiple of 4" + p := (anInteger digitLength + 7 bitShift: -3) bitShift: 2. - p := (anInteger digitLength + 1 bitShift: -1) bitClear: 2r11.   p < self class thresholdForDiv21 ifTrue: [^(self digitDiv: anInteger neg: false) collect: #normalize].   qr1 := (self butLowestNDigits: p) digitDiv32: anInteger.   qr2 := (self lowestNDigits: p) + (qr1 last bitShift: 8*p) digitDiv32: anInteger.   qr2 at: 1 put: (qr2 at: 1) + ((qr1 at: 1) bitShift: 8*p).   ^qr2! Item was changed:   ----- Method: LargePositiveInteger>>digitDiv32: (in category 'private') -----   digitDiv32: anInteger   "This is part of the recursive division algorithm from Burnikel - Ziegler   Divide 3 limb (a2,a1,a0) by 2 limb (b1,b0).   Each limb is made of p bytes (8*p bits).   This step transforms the division problem into multiplication   It must use a fast multiplyByInteger: to be worth the overhead costs."     | a2 b1 d p q qr r | + "split in two parts, rounded to upper multiple of 4" + p :=(anInteger digitLength + 7 bitShift: -3) bitShift: 2. - p :=(anInteger digitLength + 1 bitShift: -1) bitClear: 2r11.   (a2 := self butLowestNDigits: 2*p)   < (b1 := anInteger butLowestNDigits: p)   ifTrue:   [qr := (self butLowestNDigits: p) digitDiv21: b1.   q := qr first.   r := qr last]   ifFalse:   [q := (1 bitShift: 8*p) - 1.   r := (self butLowestNDigits: p) - (b1 bitShift: 8*p) + b1].   d := q * (anInteger lowestNDigits: p).   r := (self lowestNDigits: p) + (r bitShift: 8*p) - d.   [r < 0]   whileTrue:   [q := q - 1.   r := r + anInteger].   ^Array with: q with: r   ! Item was changed:   ----- Method: LargePositiveInteger>>digitMul22: (in category 'private') -----   digitMul22: anInteger   "Multiply after decomposing each operand in two parts, using Karatsuba algorithm.   Karatsuba perform only 3 multiplications, leading to a cost O(n^3 log2)   asymptotically better than super O(n^2) for large number of digits n.   See https://en.wikipedia.org/wiki/Karatsuba_algorithm"     | half xLow xHigh yLow yHigh low mid high | + "split each in two parts, rounded to upper multiple of 4" + half := (anInteger digitLength + 7 bitShift: -3) bitShift: 2. - "Divide each integer in two halves" - half := (anInteger digitLength + 1 bitShift: -1)  bitClear: 2r11.   xLow := self lowestNDigits: half.   xHigh := self butLowestNDigits: half.   yLow := anInteger lowestNDigits: half.   yHigh := anInteger butLowestNDigits: half.     "Karatsuba trick: perform with 3 multiplications instead of 4"   low := xLow multiplyByInteger: yLow.   high := xHigh multiplyByInteger: yHigh.   mid := high + low + (xHigh - xLow multiplyByInteger: yLow - yHigh).     "Sum the parts of decomposition"   ^(high isZero   ifTrue: [low]   ifFalse: [(high bitShift: 16*half)   inplaceAddNonOverlapping: low digitShiftBy: 0])   + (mid bitShift: 8*half)! Item was changed:   ----- Method: LargePositiveInteger>>digitMul23: (in category 'private') -----   digitMul23: anInteger   "Multiply after decomposing the receiver in 2 parts, and multiplicand in 3 parts.   Only invoke when anInteger digitLength between: 3/2 and 5/2 self digitLength.   This is a variant of Toom-Cook algorithm (see digitMul33:)"         | half x1 x0 y2 y1 y0 y20 z3 z2 z1 z0 | + "divide self in 2 and operand in 3 parts, rounded to upper multiple of 4" + half := ( self digitLength + 7 bitShift: -3) bitShift: 2. - "divide self in 2 and operand in 3 parts" - half := ( self digitLength + 1 bitShift: -1) bitClear: 2r11.   x1 := self butLowestNDigits: half.   x0 := self lowestNDigits: half.   y2 := anInteger butLowestNDigits: half * 2.   y1 := anInteger copyDigitsFrom: half + 1 to: half * 2.   y0 := anInteger lowestNDigits: half.         "Toom trick: 4 multiplications instead of 6"   y20 := y2 + y0.   z3 := x1 multiplyByInteger: y2.   z2 := x0 - x1 multiplyByInteger: y20 - y1.   z1 := x0 + x1 multiplyByInteger: y20 + y1.   z0 := x0 multiplyByInteger: y0.         "Sum the parts of decomposition"   ^z0 + ((z1 - z2 bitShift: -1) - z3 bitShift: 8*half)   + (((z1 + z2 bitShift: -1) - z0) + (z3 bitShift: 8*half) bitShift: 16 * half)! Item was changed:   ----- Method: LargePositiveInteger>>digitMul33: (in category 'private') -----   digitMul33: anInteger   "Multiply after decomposing each operand in 3 parts, using a Toom-Cooke algorithm.   Toom-Cooke is a generalization of Karatsuba divide and conquer algorithm.   See https://en.wikipedia.org/wiki/Toom%E2%80%93Cook_multiplication  Use a Bodrato-Zanoni variant for the choice of interpolation points and matrix inversion   See What about Toom-Cook matrices optimality? - Marco Bodrato, Alberto Zanoni - Oct. 2006   http://www.bodrato.it/papers/WhatAboutToomCookMatricesOptimality.pdf"     | third x2 x1 x0 y2 y1 y0 y20 z4 z3 z2 z1 z0 x20 | + "divide both operands in 3 parts, rounded to upper multiple of 4" + third := anInteger digitLength + 11 // 12 bitShift: 2. - "divide both operands in 3 parts" - third := anInteger digitLength + 2 // 3 bitClear: 2r11.   x2 := self butLowestNDigits: third * 2.   x1 := self copyDigitsFrom: third + 1 to: third * 2.   x0 := self lowestNDigits: third.   y2 := anInteger butLowestNDigits: third * 2.   y1 := anInteger copyDigitsFrom: third + 1 to: third * 2.   y0 := anInteger lowestNDigits: third.     "Toom-3 trick: 5 multiplications instead of 9"   z0 := x0 multiplyByInteger: y0.   z4 := x2 multiplyByInteger: y2.   x20 := x2 + x0.   y20 := y2 + y0.   z1 := x20 + x1 multiplyByInteger: y20 + y1.   x20 := x20 - x1.   y20 := y20 - y1.   z2 := x20 multiplyByInteger: y20.   z3 := (x20 + x2 bitShift: 1) - x0 multiplyByInteger: (y20 + y2 bitShift: 1) - y0.     "Sum the parts of decomposition"   z3 := z3 - z1 quo: 3.   z1 := z1 - z2 bitShift: -1.   z2 := z2 - z0.     z3 := (z2 - z3 bitShift: -1) + (z4 bitShift: 1).   z2 := z2 + z1 - z4.   z1 := z1 - z3.   ^z0 + (z1 bitShift: 8*third) + (z2 bitShift: 16*third) + (z3 + (z4 bitShift: 8*third) bitShift: 24*third)! Item was changed:   ----- Method: LargePositiveInteger>>digitMulSplit: (in category 'private') -----   digitMulSplit: anInteger   "multiply digits when self and anInteger have not well balanced digitlength.   in this case, it is better to split the largest (anInteger) in several parts and recompose"     | xLen yLen split q r high mid low sizes |   yLen := anInteger digitLength.   xLen := self digitLength. + "divide in about 1.5 xLen, rounded to upper multiple of 4" + split := (xLen * 3 + 7 bitShift: -3) bitShift: 2. - split := (xLen * 3 + 2 bitShift: -1) bitClear: 2r11.     "Arrange to sum non overlapping parts"   q := yLen // split.   q < 3 ifTrue: [^(0 to: yLen - 1 by: split) detectSum: [:yShift | (self multiplyByInteger: (anInteger copyDigitsFrom: yShift + 1 to: yShift + split)) bitShift: 8 * yShift]].   r := yLen \\ split.   "allocate enough bytes, but not too much, in order to minimise normalize cost;   we could allocate xLen + yLen for each one as well"   sizes := {q-1*split. q*split. q*split+r}.   low  := Integer new: (sizes atWrap: 0 - (q\\3)) + xLen neg: self negative ~~ anInteger negative.   mid := Integer new:  (sizes atWrap: 1 - (q\\3)) + xLen neg: self negative ~~ anInteger negative.   high := Integer new: (sizes atWrap: 2 - (q\\3)) + xLen neg: self negative ~~ anInteger negative.   0 to: yLen - 1 by: 3 * split do: [:yShift |   low   inplaceAddNonOverlapping: (self multiplyByInteger: (anInteger copyDigitsFrom: yShift + 1 to: yShift + split))   digitShiftBy: yShift].   split to: yLen - 1 by: 3 * split do: [:yShift |   mid   inplaceAddNonOverlapping: (self multiplyByInteger: (anInteger copyDigitsFrom: yShift + 1 to: yShift + split))   digitShiftBy: yShift].   split * 2 to: yLen - 1 by: 3 * split do: [:yShift |   high   inplaceAddNonOverlapping: (self multiplyByInteger: (anInteger copyDigitsFrom: yShift + 1 to: yShift + split))   digitShiftBy: yShift].   ^high normalize + mid normalize + low normalize! Item was changed:   ----- Method: LargePositiveInteger>>squaredByFourth (in category 'private') -----   squaredByFourth   "Use a 4-way Toom-Cook divide and conquer algorithm to perform the multiplication.   See Asymmetric Squaring Formulae Jaewook Chung and M. Anwar Hasan   https://www.lirmm.fr/arith18/papers/Chung-Squaring.pdf"     | p a0 a1 a2 a3 a02 a13 s0 s1 s2 s3 s4 s5 s6 t2 t3 | + "divide in 4 parts, rounded to upper multiple of 4" + p := (self digitLength + 15 bitShift: -4) bitShift: 2. - "divide in 4 parts" - p := (self digitLength + 3 bitShift: -2) bitClear: 2r11.   a3 := self butLowestNDigits: p * 3.   a2 := self copyDigitsFrom: p * 2 + 1 to: p * 3.   a1 := self copyDigitsFrom: p + 1 to: p * 2.   a0 := self lowestNDigits: p.     "Toom-4 trick: 7 multiplications instead of 16"   a02 := a0 - a2.   a13 := a1 - a3.   s0 := a0 squared.   s1 := (a0 * a1) bitShift: 1.   s2 := (a02 + a13) * (a02 - a13).   s3 := ((a0 + a1) + (a2 + a3)) squared.   s4 := (a02 * a13) bitShift: 1.   s5 := (a3 * a2) bitShift: 1.   s6 := a3 squared.     "Interpolation"   t2 := s1 + s5.   t3 := (s2 + s3 + s4 bitShift: -1) - t2.   s3 := t2 - s4.   s4 := t3 - s0.   s2 := t3 - s2 - s6.     "Sum the parts of decomposition"   ^s0 + (s1 bitShift: 8*p) + (s2 + (s3 bitShift: 8*p) bitShift: 16*p)   +(s4 + (s5 bitShift: 8*p) + (s6 bitShift: 16*p) bitShift: 32*p)     "   | a |   a := 770 factorial-1.   a digitLength.   [a * a - a squaredToom4 = 0] assert.   [Smalltalk garbageCollect.   [1000 timesRepeat: [a squaredToom4]] timeToRun] value /   [Smalltalk garbageCollect.   [1000 timesRepeat: [a squaredKaratsuba]] timeToRun] value asFloat   "! Item was changed:   ----- Method: LargePositiveInteger>>squaredByHalf (in category 'private') -----   squaredByHalf   "Use a divide and conquer algorithm to perform the multiplication.   Split in two parts like Karatsuba, but economize 2 additions by using asymetrical product."     | half xHigh xLow low high mid |   + "Divide digits in two halves rounded tp upper multiple of 4" + half := (self digitLength + 1 bitShift: -3) bitShift: 2. - "Divide digits in two halves" - half := self digitLength + 1 // 2 bitClear: 2r11.   xLow := self lowestNDigits: half.   xHigh := self butLowestNDigits: half.     "eventually use karatsuba"   low := xLow squared.   high := xHigh squared.   mid := xLow multiplyByInteger: xHigh.     "Sum the parts of decomposition"   ^(high bitShift: 16*half)   inplaceAddNonOverlapping: low digitShiftBy: 0;   + (mid bitShift: 8*half+1)     "   | a |   a := 440 factorial-1.   a digitLength.   self assert: a * a - a squaredKaratsuba = 0.   [Smalltalk garbageCollect.   [2000 timesRepeat: [a squaredKaratsuba]] timeToRun] value /   [Smalltalk garbageCollect.   [2000 timesRepeat: [a * a]] timeToRun] value asFloat   "! Item was changed:   ----- Method: LargePositiveInteger>>squaredByThird (in category 'private') -----   squaredByThird   "Use a 3-way Toom-Cook divide and conquer algorithm to perform the multiplication"     | third x0 x1 x2 x20 z0 z1 z2 z3 z4 | + "divide in 3 parts, rounded to upper multiple of 4" + third := self digitLength + 11 // 3 bitShift: 2. - "divide in 3 parts" - third := self digitLength + 2 // 3 bitClear: 2r11.   x2 := self butLowestNDigits: third * 2.   x1 := self copyDigitsFrom: third + 1 to: third * 2.   x0 := self lowestNDigits: third.     "Toom-3 trick: 5 multiplications instead of 9"   z0 := x0 squared.   z4 := x2 squared.   x20 := x2 + x0.   z1 := (x20 + x1) squared.   x20 := x20 - x1.   z2 := x20 squared.   z3 := ((x20 + x2 bitShift: 1) - x0) squared.     "Sum the parts of decomposition"   z3 := z3 - z1 quo: 3.   z1 := z1 - z2 bitShift: -1.   z2 := z2 - z0.     z3 := (z2 - z3 bitShift: -1) + (z4 bitShift: 1).   z2 := z2 + z1 - z4.   z1 := z1 - z3.   ^z0 + (z1 bitShift: 8*third) + (z2 bitShift: 16*third) + (z3 + (z4 bitShift: 8*third) bitShift: 24*third)     "   | a |   a := 1400 factorial-1.   a digitLength.   self assert: a * a - a squaredToom3 = 0.   [Smalltalk garbageCollect.   [1000 timesRepeat: [a squaredToom3]] timeToRun] value /   [Smalltalk garbageCollect.   [1000 timesRepeat: [a squaredKaratsuba]] timeToRun] value asFloat   "!