Nicolas Cellier uploaded a new version of Kernel to project The Trunk:
http://source.squeak.org/trunk/Kernel-nice.1235.mcz ==================== Summary ==================== Name: Kernel-nice.1235 Author: nice Time: 16 May 2019, 12:08:27.320094 am UUID: dd74b668-0767-441e-a952-5c73af6327be Ancestors: Kernel-nice.1234 Round accelerated arithmetic chunks to upper multiple of 4 bytes rather than to lower. I believe that this marginally improves the performance because it's a tiny bit better to recompose a longer least significant chunk with a shorter most significant chunk. If someone wants to confirm... It's better to tune the threshold before benchmarking. See LargeArithmeticBench from http://www.squeaksource.com/STEM.html and http://smallissimo.blogspot.com/2019/05 blog for details. =============== Diff against Kernel-nice.1234 =============== Item was changed: ----- Method: LargePositiveInteger>>digitDiv21: (in category 'private') ----- digitDiv21: anInteger "This is part of the recursive division algorithm from Burnikel - Ziegler Divide a two limbs receiver by 1 limb dividend Each limb is decomposed in two halves of p bytes (8*p bits) so as to continue the recursion" | p qr1 qr2 | + "split in two parts, rounded to upper multiple of 4" + p := (anInteger digitLength + 7 bitShift: -3) bitShift: 2. - p := (anInteger digitLength + 1 bitShift: -1) bitClear: 2r11. p < self class thresholdForDiv21 ifTrue: [^(self digitDiv: anInteger neg: false) collect: #normalize]. qr1 := (self butLowestNDigits: p) digitDiv32: anInteger. qr2 := (self lowestNDigits: p) + (qr1 last bitShift: 8*p) digitDiv32: anInteger. qr2 at: 1 put: (qr2 at: 1) + ((qr1 at: 1) bitShift: 8*p). ^qr2! Item was changed: ----- Method: LargePositiveInteger>>digitDiv32: (in category 'private') ----- digitDiv32: anInteger "This is part of the recursive division algorithm from Burnikel - Ziegler Divide 3 limb (a2,a1,a0) by 2 limb (b1,b0). Each limb is made of p bytes (8*p bits). This step transforms the division problem into multiplication It must use a fast multiplyByInteger: to be worth the overhead costs." | a2 b1 d p q qr r | + "split in two parts, rounded to upper multiple of 4" + p :=(anInteger digitLength + 7 bitShift: -3) bitShift: 2. - p :=(anInteger digitLength + 1 bitShift: -1) bitClear: 2r11. (a2 := self butLowestNDigits: 2*p) < (b1 := anInteger butLowestNDigits: p) ifTrue: [qr := (self butLowestNDigits: p) digitDiv21: b1. q := qr first. r := qr last] ifFalse: [q := (1 bitShift: 8*p) - 1. r := (self butLowestNDigits: p) - (b1 bitShift: 8*p) + b1]. d := q * (anInteger lowestNDigits: p). r := (self lowestNDigits: p) + (r bitShift: 8*p) - d. [r < 0] whileTrue: [q := q - 1. r := r + anInteger]. ^Array with: q with: r ! Item was changed: ----- Method: LargePositiveInteger>>digitMul22: (in category 'private') ----- digitMul22: anInteger "Multiply after decomposing each operand in two parts, using Karatsuba algorithm. Karatsuba perform only 3 multiplications, leading to a cost O(n^3 log2) asymptotically better than super O(n^2) for large number of digits n. See https://en.wikipedia.org/wiki/Karatsuba_algorithm" | half xLow xHigh yLow yHigh low mid high | + "split each in two parts, rounded to upper multiple of 4" + half := (anInteger digitLength + 7 bitShift: -3) bitShift: 2. - "Divide each integer in two halves" - half := (anInteger digitLength + 1 bitShift: -1) bitClear: 2r11. xLow := self lowestNDigits: half. xHigh := self butLowestNDigits: half. yLow := anInteger lowestNDigits: half. yHigh := anInteger butLowestNDigits: half. "Karatsuba trick: perform with 3 multiplications instead of 4" low := xLow multiplyByInteger: yLow. high := xHigh multiplyByInteger: yHigh. mid := high + low + (xHigh - xLow multiplyByInteger: yLow - yHigh). "Sum the parts of decomposition" ^(high isZero ifTrue: [low] ifFalse: [(high bitShift: 16*half) inplaceAddNonOverlapping: low digitShiftBy: 0]) + (mid bitShift: 8*half)! Item was changed: ----- Method: LargePositiveInteger>>digitMul23: (in category 'private') ----- digitMul23: anInteger "Multiply after decomposing the receiver in 2 parts, and multiplicand in 3 parts. Only invoke when anInteger digitLength between: 3/2 and 5/2 self digitLength. This is a variant of Toom-Cook algorithm (see digitMul33:)" | half x1 x0 y2 y1 y0 y20 z3 z2 z1 z0 | + "divide self in 2 and operand in 3 parts, rounded to upper multiple of 4" + half := ( self digitLength + 7 bitShift: -3) bitShift: 2. - "divide self in 2 and operand in 3 parts" - half := ( self digitLength + 1 bitShift: -1) bitClear: 2r11. x1 := self butLowestNDigits: half. x0 := self lowestNDigits: half. y2 := anInteger butLowestNDigits: half * 2. y1 := anInteger copyDigitsFrom: half + 1 to: half * 2. y0 := anInteger lowestNDigits: half. "Toom trick: 4 multiplications instead of 6" y20 := y2 + y0. z3 := x1 multiplyByInteger: y2. z2 := x0 - x1 multiplyByInteger: y20 - y1. z1 := x0 + x1 multiplyByInteger: y20 + y1. z0 := x0 multiplyByInteger: y0. "Sum the parts of decomposition" ^z0 + ((z1 - z2 bitShift: -1) - z3 bitShift: 8*half) + (((z1 + z2 bitShift: -1) - z0) + (z3 bitShift: 8*half) bitShift: 16 * half)! Item was changed: ----- Method: LargePositiveInteger>>digitMul33: (in category 'private') ----- digitMul33: anInteger "Multiply after decomposing each operand in 3 parts, using a Toom-Cooke algorithm. Toom-Cooke is a generalization of Karatsuba divide and conquer algorithm. See https://en.wikipedia.org/wiki/Toom%E2%80%93Cook_multiplication Use a Bodrato-Zanoni variant for the choice of interpolation points and matrix inversion See What about Toom-Cook matrices optimality? - Marco Bodrato, Alberto Zanoni - Oct. 2006 http://www.bodrato.it/papers/WhatAboutToomCookMatricesOptimality.pdf" | third x2 x1 x0 y2 y1 y0 y20 z4 z3 z2 z1 z0 x20 | + "divide both operands in 3 parts, rounded to upper multiple of 4" + third := anInteger digitLength + 11 // 12 bitShift: 2. - "divide both operands in 3 parts" - third := anInteger digitLength + 2 // 3 bitClear: 2r11. x2 := self butLowestNDigits: third * 2. x1 := self copyDigitsFrom: third + 1 to: third * 2. x0 := self lowestNDigits: third. y2 := anInteger butLowestNDigits: third * 2. y1 := anInteger copyDigitsFrom: third + 1 to: third * 2. y0 := anInteger lowestNDigits: third. "Toom-3 trick: 5 multiplications instead of 9" z0 := x0 multiplyByInteger: y0. z4 := x2 multiplyByInteger: y2. x20 := x2 + x0. y20 := y2 + y0. z1 := x20 + x1 multiplyByInteger: y20 + y1. x20 := x20 - x1. y20 := y20 - y1. z2 := x20 multiplyByInteger: y20. z3 := (x20 + x2 bitShift: 1) - x0 multiplyByInteger: (y20 + y2 bitShift: 1) - y0. "Sum the parts of decomposition" z3 := z3 - z1 quo: 3. z1 := z1 - z2 bitShift: -1. z2 := z2 - z0. z3 := (z2 - z3 bitShift: -1) + (z4 bitShift: 1). z2 := z2 + z1 - z4. z1 := z1 - z3. ^z0 + (z1 bitShift: 8*third) + (z2 bitShift: 16*third) + (z3 + (z4 bitShift: 8*third) bitShift: 24*third)! Item was changed: ----- Method: LargePositiveInteger>>digitMulSplit: (in category 'private') ----- digitMulSplit: anInteger "multiply digits when self and anInteger have not well balanced digitlength. in this case, it is better to split the largest (anInteger) in several parts and recompose" | xLen yLen split q r high mid low sizes | yLen := anInteger digitLength. xLen := self digitLength. + "divide in about 1.5 xLen, rounded to upper multiple of 4" + split := (xLen * 3 + 7 bitShift: -3) bitShift: 2. - split := (xLen * 3 + 2 bitShift: -1) bitClear: 2r11. "Arrange to sum non overlapping parts" q := yLen // split. q < 3 ifTrue: [^(0 to: yLen - 1 by: split) detectSum: [:yShift | (self multiplyByInteger: (anInteger copyDigitsFrom: yShift + 1 to: yShift + split)) bitShift: 8 * yShift]]. r := yLen \\ split. "allocate enough bytes, but not too much, in order to minimise normalize cost; we could allocate xLen + yLen for each one as well" sizes := {q-1*split. q*split. q*split+r}. low := Integer new: (sizes atWrap: 0 - (q\\3)) + xLen neg: self negative ~~ anInteger negative. mid := Integer new: (sizes atWrap: 1 - (q\\3)) + xLen neg: self negative ~~ anInteger negative. high := Integer new: (sizes atWrap: 2 - (q\\3)) + xLen neg: self negative ~~ anInteger negative. 0 to: yLen - 1 by: 3 * split do: [:yShift | low inplaceAddNonOverlapping: (self multiplyByInteger: (anInteger copyDigitsFrom: yShift + 1 to: yShift + split)) digitShiftBy: yShift]. split to: yLen - 1 by: 3 * split do: [:yShift | mid inplaceAddNonOverlapping: (self multiplyByInteger: (anInteger copyDigitsFrom: yShift + 1 to: yShift + split)) digitShiftBy: yShift]. split * 2 to: yLen - 1 by: 3 * split do: [:yShift | high inplaceAddNonOverlapping: (self multiplyByInteger: (anInteger copyDigitsFrom: yShift + 1 to: yShift + split)) digitShiftBy: yShift]. ^high normalize + mid normalize + low normalize! Item was changed: ----- Method: LargePositiveInteger>>squaredByFourth (in category 'private') ----- squaredByFourth "Use a 4-way Toom-Cook divide and conquer algorithm to perform the multiplication. See Asymmetric Squaring Formulae Jaewook Chung and M. Anwar Hasan https://www.lirmm.fr/arith18/papers/Chung-Squaring.pdf" | p a0 a1 a2 a3 a02 a13 s0 s1 s2 s3 s4 s5 s6 t2 t3 | + "divide in 4 parts, rounded to upper multiple of 4" + p := (self digitLength + 15 bitShift: -4) bitShift: 2. - "divide in 4 parts" - p := (self digitLength + 3 bitShift: -2) bitClear: 2r11. a3 := self butLowestNDigits: p * 3. a2 := self copyDigitsFrom: p * 2 + 1 to: p * 3. a1 := self copyDigitsFrom: p + 1 to: p * 2. a0 := self lowestNDigits: p. "Toom-4 trick: 7 multiplications instead of 16" a02 := a0 - a2. a13 := a1 - a3. s0 := a0 squared. s1 := (a0 * a1) bitShift: 1. s2 := (a02 + a13) * (a02 - a13). s3 := ((a0 + a1) + (a2 + a3)) squared. s4 := (a02 * a13) bitShift: 1. s5 := (a3 * a2) bitShift: 1. s6 := a3 squared. "Interpolation" t2 := s1 + s5. t3 := (s2 + s3 + s4 bitShift: -1) - t2. s3 := t2 - s4. s4 := t3 - s0. s2 := t3 - s2 - s6. "Sum the parts of decomposition" ^s0 + (s1 bitShift: 8*p) + (s2 + (s3 bitShift: 8*p) bitShift: 16*p) +(s4 + (s5 bitShift: 8*p) + (s6 bitShift: 16*p) bitShift: 32*p) " | a | a := 770 factorial-1. a digitLength. [a * a - a squaredToom4 = 0] assert. [Smalltalk garbageCollect. [1000 timesRepeat: [a squaredToom4]] timeToRun] value / [Smalltalk garbageCollect. [1000 timesRepeat: [a squaredKaratsuba]] timeToRun] value asFloat "! Item was changed: ----- Method: LargePositiveInteger>>squaredByHalf (in category 'private') ----- squaredByHalf "Use a divide and conquer algorithm to perform the multiplication. Split in two parts like Karatsuba, but economize 2 additions by using asymetrical product." | half xHigh xLow low high mid | + "Divide digits in two halves rounded tp upper multiple of 4" + half := (self digitLength + 1 bitShift: -3) bitShift: 2. - "Divide digits in two halves" - half := self digitLength + 1 // 2 bitClear: 2r11. xLow := self lowestNDigits: half. xHigh := self butLowestNDigits: half. "eventually use karatsuba" low := xLow squared. high := xHigh squared. mid := xLow multiplyByInteger: xHigh. "Sum the parts of decomposition" ^(high bitShift: 16*half) inplaceAddNonOverlapping: low digitShiftBy: 0; + (mid bitShift: 8*half+1) " | a | a := 440 factorial-1. a digitLength. self assert: a * a - a squaredKaratsuba = 0. [Smalltalk garbageCollect. [2000 timesRepeat: [a squaredKaratsuba]] timeToRun] value / [Smalltalk garbageCollect. [2000 timesRepeat: [a * a]] timeToRun] value asFloat "! Item was changed: ----- Method: LargePositiveInteger>>squaredByThird (in category 'private') ----- squaredByThird "Use a 3-way Toom-Cook divide and conquer algorithm to perform the multiplication" | third x0 x1 x2 x20 z0 z1 z2 z3 z4 | + "divide in 3 parts, rounded to upper multiple of 4" + third := self digitLength + 11 // 3 bitShift: 2. - "divide in 3 parts" - third := self digitLength + 2 // 3 bitClear: 2r11. x2 := self butLowestNDigits: third * 2. x1 := self copyDigitsFrom: third + 1 to: third * 2. x0 := self lowestNDigits: third. "Toom-3 trick: 5 multiplications instead of 9" z0 := x0 squared. z4 := x2 squared. x20 := x2 + x0. z1 := (x20 + x1) squared. x20 := x20 - x1. z2 := x20 squared. z3 := ((x20 + x2 bitShift: 1) - x0) squared. "Sum the parts of decomposition" z3 := z3 - z1 quo: 3. z1 := z1 - z2 bitShift: -1. z2 := z2 - z0. z3 := (z2 - z3 bitShift: -1) + (z4 bitShift: 1). z2 := z2 + z1 - z4. z1 := z1 - z3. ^z0 + (z1 bitShift: 8*third) + (z2 bitShift: 16*third) + (z3 + (z4 bitShift: 8*third) bitShift: 24*third) " | a | a := 1400 factorial-1. a digitLength. self assert: a * a - a squaredToom3 = 0. [Smalltalk garbageCollect. [1000 timesRepeat: [a squaredToom3]] timeToRun] value / [Smalltalk garbageCollect. [1000 timesRepeat: [a squaredKaratsuba]] timeToRun] value asFloat "! |
Free forum by Nabble | Edit this page |