2010/6/24 Jimmie Houchin <[hidden email]>:
> On 6/21/2010 5:34 PM, Levente Uzonyi wrote: >> >> On Mon, 21 Jun 2010, Jimmie Houchin wrote: >>> >>> On 6/21/2010 9:27 AM, Levente Uzonyi wrote: >>>> >>>> On Sun, 20 Jun 2010, Jimmie Houchin wrote: >>>>> >>>>> [snip] >>>>> Here is some samples from my experimentation. >>>>> >>>>> Some of what I am doing is doing rolling calculations over my dataset. >>>>> >>>>> dataset is one weeks worth of OHLC data of a currency pair. >>>>> >>>>> In Squeak I have. >>>>> >>>>> ttr := [ >>>>> 1 to: ((m rowCount) -500) do: [:i || row rowSum rowMax rowMin >>>>> rowMedian rowAverage | >>>>> row := (m atRows: i to: (499+i) columns: 5 to: 5). >>>>> rowSum := row sum. >>>>> rowMax := row max. >>>>> rowMin := row min. >>>>> rowMedian := row median. >>>>> rowAverage := row average. >>>>> omd add: {rowSum . rowMax . rowMin . rowMedian . rowAverage}]] >>>>> timeToRun. >>>>> >>>>> Squeak: 17 seconds, with Cog 4.2 seconds (nice work guys >>>>> (Eliot/Teleplace) >>>> >>>> This code can be implemented a lot more efficiently. >>>> #atRows:to:columns:to: creates a new matrix, but that can be avoided. >>>> #sum, #max, #min, #median and #average iterate over the row. What's >>>> worse, #median sorts the row. These can be elimated too. >>>> The total runtime cost is: O((r - w) * (w + w * log(w))), which is O(m * >>>> w * log(w)) if m >> w, which is true in your case. (r is the number of rows, >>>> w is the window size (500 in your example)). >>>> This can be reduced to m*log(w) which is 500x speedup (in theory, >>>> ignoring constatns) in your case. >>>> >>>> The idea is to store the intermediate results. The sum (and average) >>>> only require a single variable which stores the sum of the window. Then >>>> substract the element getting out of the window and add the new element >>>> getting in the window and you got the new sum and average. Min, max and >>>> median are a bit more tricky, but a balanced binary tree handle them. >>>> Calculating min and max in a balanced tree requires O(log(n)) time (which is >>>> O(log(w)) in your case). Adding and removing 1-1 elements also require >>>> O(log(w)) time. For median, you have to find the node of the median of the >>>> first 500 elements at the beginning. When an element is removed or added to >>>> the tree, the median will be the same, the next or the previous element in >>>> the tree, depending on the median, the added/removed element and the size of >>>> the tree. This can be handled in O(log(w)) time. >>>> >>>> For a matrix with 10000 rows and 5 columns of random floats your code >>>> takes 5.5 seconds with Cog. Using a temp for sum and average and a tree for >>>> min and max (without the median, because it requires a special tree >>>> implementation) it takes ~35ms. That's 157x speedup. >>>> >>>> Levente >>> >>> Hello Levente, >>> >>> I am not surprised that I may not have the most efficient implementation. >>> I understand what you are saying in principle, but I don't understand how to >>> implement what you are saying. Can you provide an example doing what I did >>> in the manner you describe. I would greatly appreciate it. I would then run >>> it against my data for a test. >> >> Hi Jimmie, >> >> if you load http://leves.web.elte.hu/squeak/LLRBT-ul.7.mcz [1] then you >> can evaluate the following example in a workspace. Just print it: >> >> | rng matrix windowSize columnIndex old new oldTime newTime | >> "Create a matrix filled with random float values between 0 and 1 with >> 10000 rows." >> rng := Random new. >> matrix := Matrix rows: 10000 columns: 5 tabulate: [ :i :j | rng next ]. >> windowSize := 500. >> columnIndex := 5. >> "The 'old' slow computation." >> oldTime := [ old := Array new: matrix rowCount - windowSize >> streamContents: [ :stream | >> 1 to: matrix rowCount - windowSize + 1 do: [ :i | >> | row | >> row := (matrix atRows: i to: i + windowSize - 1 columns: 5 to: 5). >> stream nextPut: { >> row sum. >> row max. >> row min. >> "row median." >> row average } ] ] ] timeToRun. >> "The 'new' fast computation." >> newTime := [ new := Array new: matrix rowCount - windowSize >> streamContents: [ :stream | >> | sum tree | >> "Our variables holding the intermediate results." >> sum := 0. >> tree := LLRBTree new. >> "Calculate the first values." >> 1 to: windowSize do: [ :rowIndex | >> | element | >> element := matrix at: rowIndex at: columnIndex. >> sum := sum + element. >> tree at: element put: element ]. >> stream nextPut: { sum. tree max. tree min. sum / windowSize }. >> "And the rest." >> windowSize + 1 to: matrix rowCount do: [ :rowIndex | >> | oldElement newElement | >> oldElement := matrix at: rowIndex - windowSize at: columnIndex. >> newElement := matrix at: rowIndex at: columnIndex. >> sum := sum - oldElement + newElement. >> "Housekeeping for median would happen here" >> tree >> deleteKey: oldElement; >> at: newElement put: newElement. >> stream nextPut: { sum. tree max. tree min. sum / windowSize } ] ] ] >> timeToRun. >> "Make sure the calculations give the same results." >> old with: new do: [ :oldRow :newRow | >> oldRow with: newRow do: [ :oldValue :newValue | >> self assert: (oldValue closeTo: newValue) ] ]. >> "Print it to see the runtime" >> { oldTime. newTime } >> >> >> The "old" version is your original code with minor changes, the "new" is >> the one that stores intermediate results. If you change the value of the >> matrix variable, you can test it on your data. >> >>> The above was simply an example. I have many more methods which I've >>> implemented which are doing a variety of moving averages and such. To my >>> understanding, Squeak doesn't have the library of statistical methods at >>> this time. That would be one nice thing that could be done when Numpy >>> becomes a C lib and can be interfaced to from Squeak. >>> >>> I appreciate your comment above. I would really like to see Squeak out >>> perform some of the alternatives. :) >> >> The same "trick" can be done in python, and I'm sure it has a sufficient >> balanced binary tree implementation for median too. >> >> Levente > > I just recently had an opportunity to study your code to learn. I haven't > tried it yet, but while exploring and studying I realized if I understand > correctly a problem. Many/(most ?) of the uses of Matrix, (Numpy), etc. the > order of the data is a part of the data. It isn't simply a set of numbers, > but an ordered collection of numbers or data. Whether it is weather > statistics, etc., or in my case financials which are associated to a precise > point in time in the market. > > To my understanding using a tree based implementation would lose order of > the data. I would imagine if you used keys which could be ordered and then > order applied, you would lose the performance gains. > > In my simple example order had little meaning. But a common calculation is a > weighted average which does associate a weighted value according to position > in the array. > > While playing this is the implementation that performed the best for me. > Going from 6200ms to 3600ms. > > m := Matrix rows: 6333 columns: 162. > omd := OrderedCollection new. > ttr := [| column start windowSize | > column := m atColumn: 5. > start := 1. > windowSize := 500. > 1 to: ((column size) - windowSize) do: [:i || row rowSum rowMax rowMin > rowAverage rowMedian | > row := column copyFrom: i to: i+windowSize. > rowSum := row sum. > rowMax := row max. > rowMin := row min. > rowMedian := row median. > rowAverage := row average. > omd6 add: {rowSum . rowMax . rowMin . rowAverage}]] timeToRun. > ttr 3600 > > If I remove the median calculation I get 594ms. You are quite correct, the > median calculation is quite expensive. However it is quite light to the > things I actually do. :) > > 594ms is quite an improvement over the original 6200ms. The original without > median is now 2704. > Numpy is still faster over the same data set and doing the same thing it is > at 302ms. But Numpy is optimized C/Fortran with a Python face. It is a nice > API but still dependent on someone doing the hard work on the back end. The > nice thing with the Squeak code is that it is reasonably performing and all > Smalltalk. :) > For toying, educating, or implementing a few utilities maybe. But for main algorithms (eigen decomposition, singular values etc..), I much prefer a known to work solution, taking care both of numerical stability and space/time efficiency, and scaling to high dimensions whenever needed. Redoing LAPACK is a HUGE work. The probability to fail is high. Do you think matlab, scilab, octave, R-lab, etc... waste their time writing a LAPACK clone ? I wouldn't. Plus you must consider that using an optimized BLAS is the state of the art in number crunching, maybe 3 orders of magnitude of efficiency above a Smalltalk written loop. Even with an optimized future COG it will still be 2... The idea of BLAS is to get high efficiency in low level computations if you can vector them. It's a perfect solution for implementing higher level structures in Smalltalk while preserving decent efficiency. Also, Scalapack gives oppurtunity to profit by parallel architectures for distributing load in case of large amount of computations... and I trust BLAS will exploit GPU features when enough standardization will be reached. Sometimes, we just should open our eyes and take a look at external world We don't have to throw away any code not written in Smalltalk. If someone did some really hard work for free, then let's just grab it ! Let's be as clever as Python. Let's open our closed world. Let's promote FFI. that would promote Smalltalk solutions more than a 100% Smalltalk application would. Well, just my POV :) Nicolas > Thanks for the education. > > Jimmie > > |
In reply to this post by Jimmie Houchin-6
On Thu, 24 Jun 2010, Jimmie Houchin wrote:
<snip> > > I just recently had an opportunity to study your code to learn. I haven't > tried it yet, but while exploring and studying I realized if I understand > correctly a problem. Many/(most ?) of the uses of Matrix, (Numpy), etc. the > order of the data is a part of the data. It isn't simply a set of numbers, > but an ordered collection of numbers or data. Whether it is weather > statistics, etc., or in my case financials which are associated to a precise > point in time in the market. > > To my understanding using a tree based implementation would lose order of the > data. I would imagine if you used keys which could be ordered and then order > applied, you would lose the performance gains. The binary tree is only useful for the fast calculation of min max and median. If the ordering of the data matters for a calculation, then you need some other trick. > > In my simple example order had little meaning. But a common calculation is a > weighted average which does associate a weighted value according to position > in the array. If the weights are exponential, you can use the same trick as with sum. If the weights are not exponential, you can use convolution with FFI for O((n+w)*log(n+w)) performance which is O(n*log(n)) if w << n. Here is an example: | data windowWithWeights fftSize circularData circularWeights fft cdtr cdti cwtr cwti | data := #(1 2 3 4 5) asFloatArray. "Our time series." windowWithWeights := #(4 2 1) asFloatArray. "Most recent data with weight 4, least recent data with weight 1" "Prepare for the FFT" fftSize := (data size + windowWithWeights size) asLargerPowerOfTwo. circularData := data copyGrownBy: fftSize - data size. circularWeights := windowWithWeights copyGrownBy: fftSize - windowWithWeights size. fft := FFT new: fftSize. "Transform the data and store the results" fft realData: circularData. fft transformForward: true. cdtr := fft realData. cdti := fft imagData. "Transform the weights and store the results" fft realData: circularWeights. fft transformForward: true. cwtr := fft realData. cwti := fft imagData. "Do the convolution which is simple complex multiplication, then the IFFT." fft realData: cdtr * cwtr - (cdti * cwti) imagData: cdtr * cwti + (cdti * cwtr). fft transformForward: false. "Extract data" fft realData copyFrom: windowWithWeights size to: data size. "a FloatArray(17.0 24.0 31.0) 17 = 4 * 3 + 2 * 2 + 1 * 1, 24 = 4 * 4 + 2 * 3 + 1 * 2, 31 = 4 * 5 + 2 * 4 + 1 * 3" > > While playing this is the implementation that performed the best for me. > Going from 6200ms to 3600ms. > > m := Matrix rows: 6333 columns: 162. > omd := OrderedCollection new. > ttr := [| column start windowSize | > column := m atColumn: 5. > start := 1. > windowSize := 500. > 1 to: ((column size) - windowSize) do: [:i || row rowSum rowMax rowMin > rowAverage rowMedian | > row := column copyFrom: i to: i+windowSize. > rowSum := row sum. > rowMax := row max. > rowMin := row min. > rowMedian := row median. > rowAverage := row average. > omd6 add: {rowSum . rowMax . rowMin . rowAverage}]] timeToRun. > ttr 3600 > > If I remove the median calculation I get 594ms. You are quite correct, the > median calculation is quite expensive. However it is quite light to the > things I actually do. :) I found a simple way to improve the performance without using a tree or loading new packages: SortedCollection :). The least useful data structure in Squeak can shine here (though the tree-based implementation is >10 times faster in this case, but it's still pretty good): rng := Random new. m := Matrix rows: 6333 columns: 162 tabulate: [ :r :c | rng next ]. omd := OrderedCollection new. ttr := [ | column windowSize window sum | column := m atColumn: 5. windowSize := 500. window := (column first: windowSize) asSortedCollection. sum := window sum. omd add: { sum. window first. window last. window median. sum / windowSize }. 1 to: column size - windowSize do: [ :i | | outgoing incoming | outgoing := column at: i. incoming := column at: i + windowSize. sum := sum - outgoing + incoming. window remove: outgoing; add: incoming. omd add: { sum. window first. window last. window median. sum / windowSize } ] ] timeToRun. ttr My result for this code is 335ms (and it calculates the median too). It could be improved with a special method but that's out of the scope of this mail. :) Calculating the weighted averages with the same parameters (6333 data, 500 window size) with the FFT convolution described above took 350ms. I hope you'll find these methods useful. Levente > > 594ms is quite an improvement over the original 6200ms. The original without > median is now 2704. > Numpy is still faster over the same data set and doing the same thing it is > at 302ms. But Numpy is optimized C/Fortran with a Python face. It is a nice > API but still dependent on someone doing the hard work on the back end. The > nice thing with the Squeak code is that it is reasonably performing and all > Smalltalk. :) > > Thanks for the education. > > Jimmie > > |
Free forum by Nabble | Edit this page |