Dear all, in order to clean data before using Machine Learning algorithms (like PCA), I implement a class to do data standardization: PMStandardizationScaler https://github.com/PolyMathOrg/PolyMath/blob/development/src/Math-PrincipalComponentAnalysis/PMStandardizationScaler.class.stData can be centered and scaled. The class is similar to the one defined in scikit-learn : http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html More complex data transformation can be implemented as subclass of the abstract class PMDataTransformer. You have to implement fit: and transform: method. A+ -- Serge Stinckwich UMI UMMISCO 209 (SU/IRD/UY1)
You received this message because you are subscribed to the Google Groups "PolyMath" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Hi Serge,
-- i find the PMStandardizationScaler is a extremely useful thing, hence please excuse me making a few comments: #transform: uses self scale & self mean repeatedly. self scale calcs the covarianceMatrix repeatedly which can be a lot of work, perhaps a local var would speed that up. calculating the whole covariance matrix just for some stdevs, how much programming does that save <g>, more than a half line of code? the covariance thing calcs, i think, variance by dividing by size not (size - 1), is that your intention? i like the architecture, it is simple and obvious to use. i can use it, thanks! werner On Tuesday, July 31, 2018 at 10:20:40 AM UTC+2, Serge Stinckwich wrote:
You received this message because you are subscribed to the Google Groups "PolyMath" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
On Tue, Jul 31, 2018 at 2:43 PM werner kassens <[hidden email]> wrote:
ok done in last version.
I don't understand what you want to say here.
This is the same pattern used in scikit learn in StandardScaler: Would be nice to add more transformers like http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing -- Serge Stinckwich UMI UMMISCO 209 (SU/IRD/UY1)
You received this message because you are subscribed to the Google Groups "PolyMath" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
On Tue, Jul 31, 2018 at 4:03 PM, Serge Stinckwich <[hidden email]> wrote:
collection>>stdev | avg sample sum | avg := self average. "see comment in self sum" sample := self anyOne. sum := self inject: sample into: [:accum :each | accum + (each - avg) squared]. sum := sum - sample. ^ (sum / (self size - 1)) sqrt the method you use does probably: ^ (sum / self size) sqrt werner You received this message because you are subscribed to the Google Groups "PolyMath" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
On Tue, Jul 31, 2018 at 4:56 PM werner kassens <[hidden email]> wrote:
Sorry I still don't understand. Where do you see a problem ? PMStandardizationScaler use an iterative way to compute mean and variance with PMCovarianceAccumulator. So we need to compute covariance matrix in order to compute variance. A+ Serge Stinckwich UMI UMMISCO 209 (SU/IRD/UY1)
You received this message because you are subscribed to the Google Groups "PolyMath" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
i guess its a misunderstanding, i dont see no problem, sorry for the irritation, Serge. werner On Tue, Jul 31, 2018 at 6:23 PM, Serge Stinckwich <[hidden email]> wrote:
You received this message because you are subscribed to the Google Groups "PolyMath" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Hi Serge, sum( (x_i - average)^2 for i=1:n) / n is a biased estimator of variance. One must divide by n-1 to obtain an unbiased estimator. That's probably what Werner means. Apart the bias, I have verified the formula, and it sounds correct. But it would deserve an explanation (or a reference to an explanation, in Didier Besset Book?), because it is non trivial. delta_{n+1} is the difference of (average_estimate_{n} - vector_{n+1}). We want to use (average_estimate_{n+1} - vector_{n+1}) in the covariance estimator. But then we should also compensate the evolution of average estimate in previous accumulation... Let's do it in scalar first: sum( (x_i - average_estimate_{n+1})^2/n ) = ( sum( x_i^2)/n - 2*sum(x_i)/n*average_estimate_{n+1}+sum(average_estimate_{n+1}^2)/n ) ... = sum( x_i^2)/n + average_estimate_{n+1}^2 - 2*average_estimate_{n+1}*average_estimate_n Since we have computed: variance_estimate_n = sum( (x_i - average_estimate_n)^2/n ) = sum( x_i^2)/n - average_estimate_n^2 Then compensating the error requires taking: variance_estimate_n_corrected = variance_estimate_n + (average_estimate_{n+1}-average_estimate_n)^2 ... = variance_estimate_n + delta_{n+1}^2 Then, updating the variance with new accumulated value, with biased estimator: variance_estimate_{n+1} = (variance_estimate_n_corrected * n + (average_estimate_{n+1} - x_{n+1})^2) / (n+1) average_estimate_{n+1} = average_estimate_n - delta_{n+1} average_estimate_{n+1} - x_{n+1} = average_estimate_n - x_{n+1} - delta_{n+1} ... = {n+1)*delta_{n+1} - delta_{n+1} ... = n * delta_{n+1} So, if I did not messed up so far: variance_estimate_{n+1} = (n * variance_estimate_n + n * delta_{n+1}^2 + n^2*delta_{n+1}) / (n+1) IOW: variance_estimate_{n+1} = variance_estimate_n * n/(n+1) + n*delta_{n+1}^2 This can be extended to covariance, and we indeed find the iterative formula which is programmed. 2018-07-31 18:33 GMT+02:00 werner kassens <[hidden email]>:
You received this message because you are subscribed to the Google Groups "PolyMath" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
On Tue, Jul 31, 2018 at 10:28 PM Nicolas Cellier <[hidden email]> wrote:
Thank you for the math, Nicolas ! Actually, Collection>>stdev is not part of PolyMath but Pharo. Collection>>stdev | avg sample sum | "In statistics, the standard deviation is a measure that is used to quantify the amount of variation or dispersion of a set of data values. For details about implementation see comment in self sum." avg := self average. sample := self anyOne. sum := self inject: sample into: [ :accum :each | accum + (each - avg) squared ]. sum := sum - sample. ^ (sum / (self size - 1)) sqrt Serge Stinckwich UMI UMMISCO 209 (SU/IRD/UY1)
You received this message because you are subscribed to the Google Groups "PolyMath" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Free forum by Nabble | Edit this page |