|
Le mar. 21 mai 2019 à 18:55, Nicolas Cellier < [hidden email]> a écrit : I have updated Smallapack to version 1.6.1 so as to accelerate sum.
| a b c | a := LapackSGEMatrix randNormal: #(10000 1). b := a as: FloatArray. c := a asAbstractMatrix. {a sum. b sum. c sum.}. {[a sum] bench. [b sum] bench. [c sum] bench.}. '27,500 per second. 36.3 microseconds per run.' "LapackMatrix"
'117,000 per second. 8.54 microseconds per run.' "FloatArray"
'1,180 per second. 845 microseconds per run.' "Un-accelerated AbstractMatrix"
I measured with single precision for everything so as to have fair comparisons.
As you can see, LapackMatrix sum is still slower than FloatArray sum. This is because we have to create a Matrix of ones first (xLASET) before calling BLAS dot-product primitive (xDOTU).
In Squeak, profiling can be obtained thru (I guess not much different in Pharo): AndreasSystemProfiler spyOn: [[a sum] bench].
which gives:
**Leaves** 37.19 (1,865) LapackSLibrary xlasetWithuplo:m:n:alpha:beta:a:lda:length: 27.12 (1,360) BlasSLibrary dotF2CWithn:x:incx:y:incy: 9.31 (467) Behavior basicNew: 2.42 (121) ByteString class compare:with:collated: 1.43 (72) ByteString hashWithInitialHash:
Note that this is un-accelerated BLAS (no Intel Math Kernel Library or other accelerated version), but it does not make much difference for those BLAS level-1 functions (those with cost O(N))
But we can avoid that cost and compute all the cumulative sums, or some of them at once with a single Matrix operation (xTRMM / xGEMM):
| a b c | a := LapackSGEMatrix randUniform: #(10000 1). b :=
LapackSGEMatrix nrow:
10000 ncol: 10000 withAll: 1. "huge matrix, that's not cheap, don't even bench it!" "cumsum"
c := b lowerTriangle * a.
| a b c | a := LapackSGEMatrix randUniform: #(10000 1). b :=
LapackSGEMatrix nrow:
10000 ncol: 10000 withAll: 1.
"huge matrix, that's not cheap, don't even bench it!"
"cumsum 1 out of 100"
c := (b lowerTriangle atRows: (100 to: 10000 by: 100)) * a.
| a b c | a := LapackSGEMatrix randUniform: #(10000 1). "cheaper construction of partial cum sum" [b := LapackSGEMatrix rows: ( (100 to: a nrow by: 100) collect: [:n | ( LapackSGEMatrix nrow: 1 ncol: n withAll: 1.0) , (LapackSGEMatrix nrow: 1 ncol: a nrow - n)]). c := b * a.] bench.
'3.5 per second. 286 milliseconds per run.'
So as you see, with some moderate effort, we have computed 100 partial cumulative sums in about 286ms, that's 2.86ms per cumsum on average, not too bad. Maybe not as straightforward as numpy, and maybe still not as fast, but not completely at west.
Oups, I posted to fast! ms are not µs!
[c := b * a] bench. gives '1,690 per second. 593 microseconds per run.', that's about 6µs per sum, but including construction of multiplier, that's way too much (it's not accelerated!)
Le mar. 21 mai 2019 à 12:51, Jimmie Houchin < [hidden email]> a écrit :
Not a problem. I greatly respective other peoples time and
priorities and their personal lives.
Just for the record I am using 64bit Pharo on a fast i7, 16gb
ram, laptop running Xubunut 18.04 64bit.
I do not remember any problems loading. And within the small
amount of experimenting that I did, it seemed to operate fine.
Again, thanks for your contribution. I know it is a lot of work
and a pretty large area to cover. Python/Numpy has armies of
people working on this.
Jimmie
On 5/21/19 2:54 AM, Nicolas Cellier
wrote:
Hi Jimmie,
I didn't take time yesterday to analyze your specific
example because it was quite late, but here are some remarks:
1) First, I recommend using 64bits Pharo, because number
crunching and Float operations will be faster (not FloatArray
though).
2) it would be nice to use a profiler to analyze where time
is spent
I would not be amazed that (Float readFrom:...) takes a
non neglectable percentage of it
3) ExternalDoubleArray only add overhead if no
bulk-operation is performed
(like reading raw binary data or serving as storage area
passed to Lapack/blas primitives)
it does not provide accelerated features by itself
indeed.
I think that it is too low level to serve as a primary
interface.
4) LapackXXXMatrix sum has effectively not been optimized
to use BLAS, and this can be easily corrected, thanks for
giving this example.
With some cooperation, we could easily make some progress,
there are low hanging fruits.
But I understand if you prefer to stick with more mature
numpy solution.
Thanks for trying. At least, you were able to load and use
Smallapack in Pharo, and this is already a good feedback.
If you have time, I'll publish a small enhancement for
accelerating sum, and ask you to retry.
Thanks again
Le mar. 21 mai 2019 à 05:13,
Serge Stinckwich < [hidden email]> a
écrit :
There is another
solution with my TensorFlow Pharo binding:
You can do a
matrix multiplication like that :
| graph t1 t2 c1 c2 mult session result |
graph := TF_Graph create.
t1 := TF_Tensor fromFloats: (1 to:1000000) asArray
shape:#(1000 1000).
t2 := TF_Tensor fromFloats: (1 to:1000000) asArray
shape:#(1000 1000).
c1 := graph const: 'c1' value: t1.
c2 := graph const: 'c2' value: t2.
mult := c1 * c2.
session := TF_Session on: graph.
result := session runOutput: (mult output: 0).
result asNumbers
Here I'm doing a
multiplication between 2 matrices of 1000X1000 size in 537
ms on my computer.
All operations can
be done in a graph of operations that is run outside
Pharo, so could be very fast.
Operations can be
done on CPU or GPU. 32 bits or 64 bits float operations
are possible.
This is a work in
progress but can already be used.
Regards,
On Tue, May 21, 2019 at
6:54 AM Jimmie Houchin < [hidden email]>
wrote:
I wasn't worried about how to do sliding windows. My
problem is that using LapackDGEMatrix in my example
was 18x slower than FloatArray, which is slower than
Numpy. It isn't what I was expecting.
What I didn't know is if I was doing something wrong
to cause such a tremendous slow down.
Python and Numpy is not my favorite. But it isn't
uncomfortable.
So I gave up and went back to Numpy.
Thanks.
On
5/20/19 5:17 PM, Nicolas Cellier wrote:
Hi Jimmie,
effectively I did not subsribe...
Having efficient methods for sliding window
average is possible, here is how I would do it:
"Create a vector with 100,000 rows filles with
random values (uniform distrubution in [0,1]"
v := LapackDGEMatrix randUniform: #(100000 1).
"extract values from rank 10001 to 20000"
w1 := v atIntervalFrom: 10001 to: 20000 by: 1.
"create a left multiplier matrix for performing
average of w1"
a := LapackDGEMatrix nrow: 1 ncol: w1 nrow
withAll: 1.0 / w1 size.
"get the average (this is a 1x1 matrix from which
we take first element)"
avg1 := (a * w1) at: 1.
[ "select another slice of same size"
w2 := v atIntervalFrom: 15001 to: 25000 by: 1.
"get the average (we can recycle a)"
avg2 := (a * w2) at: 1 ] bench.
This gives:
'16,500 per second. 60.7 microseconds per run.'
versus:
[w2 sum / w2 size] bench.
'1,100 per second. 908 microseconds per run.'
For max and min, it's harder. Lapack/Blas only
provide max of absolute value as primitive:
[w2 absMax] bench.
'19,400 per second. 51.5 microseconds per run.'
Everything else will be slower, unless we write
new primitives in C and connect them...
[w2 maxOf: [:each | each]] bench.
'984 per second. 1.02 milliseconds per run.'
On 5/16/19 1:26
PM, Nicolas Cellier wrote:> Any feedback on
this?
> Did someone tried to use Smallapack in
Pharo?
> Jimmie?
>
I am going to guess that you are not on
pharo-users. My bad.
I posted this in pharo-users as I it wasn't Pharo
development question.
I probably should have posted here or emailed you
directly.
All I really need is good performance with a
simple array of floats. No
matrix math. Nothing complicated. Moving Averages
over a slice of the
array. A variety of different averages, weighted,
etc. Max/min of the
array. But just a single simple array.
Any help greatly appreciated.
Thanks.
On 4/28/19 8:32 PM, Jimmie Houchin wrote:
Hello,
I have installed Smallapack into Pharo 7.0.3.
Thanks Nicholas.
I am very unsure on my use of Smallapack. I am not
a mathematician or
scientist. However the only part of Smallapack I
am trying to use at the
moment is something that would be 64bit and
compare to FloatArray so
that I can do some simple accessing, slicing, sum,
and average on the array.
Here is some sample code I wrote just to play in a
playground.
I have an ExternalDoubleArray, LapackDGEMatrix,
and a FloatArray
samples. The ones not in use are commented out for
any run.
fp is a download from
http://ratedata.gaincapital.com/2018/12%20December/EUR_USD_Week1.zip
and unzipped to a directory.
fp := '/home/jimmie/data/EUR_USD_Week1.csv'
index := 0.
pricesSum := 0.
asum := 0.
ttr := [
lines := fp asFileReference contents lines
allButFirst.
a := ExternalDoubleArray new: lines size.
"la := LapackDGEMatrix allocateNrow: lines
size ncol: 1.
a := la columnAt: 1."
"a := FloatArray new: lines size."
lines do: [ :line || parts price |
parts := ',' split: line.
index := index + 1.
price := Float readFrom: (parts last).
a at: index put: price.
pricesSum := pricesSum + price.
(index rem: 100) = 0 ifTrue: [
asum := a sum.
]]] timeToRun.
{ index. pricesSum. asum. ttr }.
"ExternalDoubleArray an Array(337588
383662.5627699992
383562.2956199993 0:00:01:59.885)"
"FloatArray an Array(337588 383662.5627699992
383562.2954441309
0:00:00:06.555)"
FloatArray is not the precision I need. But it is
over 18x faster.
I am afraid I must be doing something badly wrong.
Python/Numpy is over
4x faster than FloatArray for the above.
If I am using Smallapack incorrectly please help.
Any help greatly appreciated.
Thanks.
--
Int.
Research Unit
on
Modelling/Simulation of Complex Systems
(UMMISCO)
Sorbonne
University
(SU)
French
National Research Institute for
Sustainable Development (IRD)
U
niversity
of Yaoundé I, Cameroon
|