GSoC 2017 Introduction

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

GSoC 2017 Introduction

Oleksandr Zaitsev
Hello,

I am a student from Ukraine, completing my Bachelor's in Applied Mathematics and Informatics. Here's my LinkedIn profile: https://www.linkedin.com/in/unolk/. I am working with Pharo for about 1.5 years now. My latest project is a neural network package in Pharo: http://smalltalkhub.com/#!/~Oleks/NeuralNetwork. Here is my article about the implementation of a single-layer perceptron in Pharo, where I describe the first part of this project: https://medium.com/towards-data-science/single-layer-perceptron-in-pharo-5b13246a041d. The second article is almost done - it will be about the multi-layer neural network and MNIST digit recognition (it's already working and giving 92% accuracy).

Several days ago I was accepted to the Google Summer of Code to work on a project related to PolyMath. I will build a tabular dataset for data analysis, similar to Pandas in Python or dataframe in R. You can see my proposal here: https://docs.google.com/document/d/1z6zi4s3Ur4YcOOgHe1iDwpnNwwVTCrbS7ZdBnzHaCpA/edit?usp=sharing. Please, let me know what you think about this project. Is it important? What would you like to see implemented? Should I change something in my plan or project description? Do you know some similar projects that I should consider? Your feedback is very important for me.

I will be writing a weekly blog about the TabularDatasets project (do you have ideas for a better name?) on my Medium: https://medium.com/@i.oleks. You can subscribe to receive instant updates. I will be also writing about my progress on Twitter: https://twitter.com/oleks_lviv, the #polymath channel on Discord, and Pharo mailing list. So you can easily get informed about the state of the project.

I'm looking forward to working with you!

Sincerely,
Oleks

--
You received this message because you are subscribed to the Google Groups "SciSmalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: GSoC 2017 Introduction

werner kassens-2
Hi Oleks,
your tabular dataset project makes a lot of sense to me. a few questions:
lets say i have one dataset (a) with a column of Dates and a column of Numbers, lets say temperature. there is not for every date a row, some data are missing. if i ask (a) for a temperature for a date that is not existing, can i ask for the preceeding temperature  (Date-wise), can i ask for the succeeding one, can i ask for the temperature of the nearest day? (iow i cant fill (a) automatically with defaults for missing dates.)

lets say i have a second dataset (b) consisting of Dates and lets say air pressure, but the dates are not always corresponding to the dates in (a). can i merge those datasets to get Date, temperature & air pressure, sorted by date, with the empty spaces where the dates do not correspond?

lets further say i have yet another dataset (c) based on weekly data. now i want to merge this with (a) into one weekly dataset (d), where i have a column of min temperature for the corresponding week. possible?

what if (c) has no column with the Date of the beginning of the week, the corresponding column may consist of a string that is not simply translateable to a date eg "week1" to "week52" (within a year)?

lets say i want to add a column to weekly (d) with my own BlockClosure whose arguments are mondays (of that week) temperature, fridays temperature & the weekly data of (c). if mondays temperature does not exist, the blockclosure should take the succeeding temperature, if fridays temperature does not exist, the blockclosure should take the preceeding temperature. is that possible without too much hassle?

what if i want to condense (a) to monthly data, iow the steps change between 28 days and 31 days?

werner

On Fri, May 5, 2017 at 10:00 PM, Oleksandr Zaytsev <[hidden email]> wrote:
Hello,

I am a student from Ukraine, completing my Bachelor's in Applied Mathematics and Informatics. Here's my LinkedIn profile: https://www.linkedin.com/in/unolk/. I am working with Pharo for about 1.5 years now. My latest project is a neural network package in Pharo: http://smalltalkhub.com/#!/~Oleks/NeuralNetwork. Here is my article about the implementation of a single-layer perceptron in Pharo, where I describe the first part of this project: https://medium.com/towards-data-science/single-layer-perceptron-in-pharo-5b13246a041d. The second article is almost done - it will be about the multi-layer neural network and MNIST digit recognition (it's already working and giving 92% accuracy).

Several days ago I was accepted to the Google Summer of Code to work on a project related to PolyMath. I will build a tabular dataset for data analysis, similar to Pandas in Python or dataframe in R. You can see my proposal here: https://docs.google.com/document/d/1z6zi4s3Ur4YcOOgHe1iDwpnNwwVTCrbS7ZdBnzHaCpA/edit?usp=sharing. Please, let me know what you think about this project. Is it important? What would you like to see implemented? Should I change something in my plan or project description? Do you know some similar projects that I should consider? Your feedback is very important for me.

I will be writing a weekly blog about the TabularDatasets project (do you have ideas for a better name?) on my Medium: https://medium.com/@i.oleks. You can subscribe to receive instant updates. I will be also writing about my progress on Twitter: https://twitter.com/oleks_lviv, the #polymath channel on Discord, and Pharo mailing list. So you can easily get informed about the state of the project.

I'm looking forward to working with you!

Sincerely,
Oleks

--
You received this message because you are subscribed to the Google Groups "SciSmalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "SciSmalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: GSoC 2017 Introduction

werner kassens-2
In reply to this post by Oleksandr Zaitsev
>(it's already working and giving 92% accuracy)

Congratulation on this result, Oleks!
is it correct to assume that you use the mnist version with 60,000 training examples & 10,000 test examples? how long did your algo need to train on which computer?
werner

On Fri, May 5, 2017 at 10:00 PM, Oleksandr Zaytsev <[hidden email]> wrote:
Hello,

I am a student from Ukraine, completing my Bachelor's in Applied Mathematics and Informatics. Here's my LinkedIn profile: https://www.linkedin.com/in/unolk/. I am working with Pharo for about 1.5 years now. My latest project is a neural network package in Pharo: http://smalltalkhub.com/#!/~Oleks/NeuralNetwork. Here is my article about the implementation of a single-layer perceptron in Pharo, where I describe the first part of this project: https://medium.com/towards-data-science/single-layer-perceptron-in-pharo-5b13246a041d. The second article is almost done - it will be about the multi-layer neural network and MNIST digit recognition (it's already working and giving 92% accuracy).

Several days ago I was accepted to the Google Summer of Code to work on a project related to PolyMath. I will build a tabular dataset for data analysis, similar to Pandas in Python or dataframe in R. You can see my proposal here: https://docs.google.com/document/d/1z6zi4s3Ur4YcOOgHe1iDwpnNwwVTCrbS7ZdBnzHaCpA/edit?usp=sharing. Please, let me know what you think about this project. Is it important? What would you like to see implemented? Should I change something in my plan or project description? Do you know some similar projects that I should consider? Your feedback is very important for me.

I will be writing a weekly blog about the TabularDatasets project (do you have ideas for a better name?) on my Medium: https://medium.com/@i.oleks. You can subscribe to receive instant updates. I will be also writing about my progress on Twitter: https://twitter.com/oleks_lviv, the #polymath channel on Discord, and Pharo mailing list. So you can easily get informed about the state of the project.

I'm looking forward to working with you!

Sincerely,
Oleks

--
You received this message because you are subscribed to the Google Groups "SciSmalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "SciSmalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: GSoC 2017 Introduction

Oleksandr Zaitsev
Hello, Werner

These are very interesting questions. Thanks a lot for asking them! I don't know for sure but I think that all these problems could be solved with some complex pandas queries (for example, some of them sound like they could be done with merges/joins). I will try to solve all these problems in Pandas (and maybe R) and write a blog post about it either this week or in two weeks. I will let you know when it's ready.

I also think that Pharo needs a more elegant tool for querying. Something that would allow me to treat the Date column as an object and to "communicate" with it accordingly. But this is just an idea (a feeling) - at this point, I can't support it with any good arguments or examples. I need to explore different options and start building something to see how it works.

As for MNIST. Yes, it's the 60,000 & 10,000 version taken from here: http://yann.lecun.com/exdb/mnist/.The network that gives 92% accuracy has the exact same architecture as the simplest network from the TensorFlow tutorials: https://www.tensorflow.org/get_started/mnist/beginners (this way I can easily compare the performance). It has 786 inputs, one layer with 10 softmax neurons and a cross-entropy cost function. It is trained on mini-batches with 100 training examples per each. The training has 1000 epochs (1 mini-batch per epoch), though, my experiments show that 300-400 epochs are enough. And the learning rate is constant 0.5 for all epochs. All these parameters are the same as the ones in TensorFlow example. The accuracy of such neural net trained in Pharo is exactly the same as the one in Tensorflow (92%). But the training time is much longer. It takes about 6-7 minutes on my Pharo image and some 2-3 seconds on Tensorflow.

I also want to measure the time on Tensorflow restricted to a single CPU core to make it closer to Pharo. But I don't think that my implementation could come any near that performance unless implemented with native code. And it would probably be wiser to just create a binding for Tensorflow - Francesco Agati (sa_su_ke) is already working on it.

I also tried some multi-layer architectures (784-500-10 or 784-800-10). With different learning rates (but they were all constant for each network) they took 6-8 hours to train on my Pharo image, but they all resulted in lower performance than the simple 784-10 one (around 88%). Two days ago I realized that in Pharo log represents the common logarithm, not the natural one as in many other languages (so in my cross-entropy I was using log, but differentiated it as ln). This funny mistake could be one source of problems.

But I think that the actual reason for such low performance in the poor choice of a learning rate - with more layers neural nets become very sensitive to the changes of learning rate (there are many local minima in which we may get stuck, and the network takes a long time to train) - so it would be nice to implement some algorithm for adjusting the learning rates (for example, adagrad).

I won't have much time to work on this project in summer because I will have to concentrate on the Tabular Datasets (GSoC). So now I don't implement new algorithms but improve the readability of code, write tests and documentation. This way others can continue working on this project, or use it in their own work - at least as an inspiration :)

Thanks again for your feedback!

Oleks

On Sat, May 6, 2017 at 2:09 PM, werner kassens <[hidden email]> wrote:
>(it's already working and giving 92% accuracy)

Congratulation on this result, Oleks!
is it correct to assume that you use the mnist version with 60,000 training examples & 10,000 test examples? how long did your algo need to train on which computer?
werner

On Fri, May 5, 2017 at 10:00 PM, Oleksandr Zaytsev <[hidden email]> wrote:
Hello,

I am a student from Ukraine, completing my Bachelor's in Applied Mathematics and Informatics. Here's my LinkedIn profile: https://www.linkedin.com/in/unolk/. I am working with Pharo for about 1.5 years now. My latest project is a neural network package in Pharo: http://smalltalkhub.com/#!/~Oleks/NeuralNetwork. Here is my article about the implementation of a single-layer perceptron in Pharo, where I describe the first part of this project: https://medium.com/towards-data-science/single-layer-perceptron-in-pharo-5b13246a041d. The second article is almost done - it will be about the multi-layer neural network and MNIST digit recognition (it's already working and giving 92% accuracy).

Several days ago I was accepted to the Google Summer of Code to work on a project related to PolyMath. I will build a tabular dataset for data analysis, similar to Pandas in Python or dataframe in R. You can see my proposal here: https://docs.google.com/document/d/1z6zi4s3Ur4YcOOgHe1iDwpnNwwVTCrbS7ZdBnzHaCpA/edit?usp=sharing. Please, let me know what you think about this project. Is it important? What would you like to see implemented? Should I change something in my plan or project description? Do you know some similar projects that I should consider? Your feedback is very important for me.

I will be writing a weekly blog about the TabularDatasets project (do you have ideas for a better name?) on my Medium: https://medium.com/@i.oleks. You can subscribe to receive instant updates. I will be also writing about my progress on Twitter: https://twitter.com/oleks_lviv, the #polymath channel on Discord, and Pharo mailing list. So you can easily get informed about the state of the project.

I'm looking forward to working with you!

Sincerely,
Oleks

--
You received this message because you are subscribed to the Google Groups "SciSmalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "SciSmalltalk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scismalltalk/_QEz-a5xrbI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "SciSmalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: GSoC 2017 Introduction

werner kassens-2
Hi Oleks,
thanks for your interesting, detailed answer! yes indeed, it would be interesting to see what an adaptive learning rate could do. i will certainly read your neural net docu with interest when it will be finished. one question in order to understand the training times better (just in case one would like to compare generally different approaches than neural nets), which speed does the computer have?

regarding my first example of merging (a) & (b), keep in mind that the cases where for one date there exist only one datum in one column, the other column does _not_ have a row with that date and a nan for the datum, the date simply does not exist in that other column.
werner

On Tue, May 9, 2017 at 1:14 AM, Oleksandr Zaytsev <[hidden email]> wrote:
Hello, Werner

These are very interesting questions. Thanks a lot for asking them! I don't know for sure but I think that all these problems could be solved with some complex pandas queries (for example, some of them sound like they could be done with merges/joins). I will try to solve all these problems in Pandas (and maybe R) and write a blog post about it either this week or in two weeks. I will let you know when it's ready.

I also think that Pharo needs a more elegant tool for querying. Something that would allow me to treat the Date column as an object and to "communicate" with it accordingly. But this is just an idea (a feeling) - at this point, I can't support it with any good arguments or examples. I need to explore different options and start building something to see how it works.

As for MNIST. Yes, it's the 60,000 & 10,000 version taken from here: http://yann.lecun.com/exdb/mnist/.The network that gives 92% accuracy has the exact same architecture as the simplest network from the TensorFlow tutorials: https://www.tensorflow.org/get_started/mnist/beginners (this way I can easily compare the performance). It has 786 inputs, one layer with 10 softmax neurons and a cross-entropy cost function. It is trained on mini-batches with 100 training examples per each. The training has 1000 epochs (1 mini-batch per epoch), though, my experiments show that 300-400 epochs are enough. And the learning rate is constant 0.5 for all epochs. All these parameters are the same as the ones in TensorFlow example. The accuracy of such neural net trained in Pharo is exactly the same as the one in Tensorflow (92%). But the training time is much longer. It takes about 6-7 minutes on my Pharo image and some 2-3 seconds on Tensorflow.

I also want to measure the time on Tensorflow restricted to a single CPU core to make it closer to Pharo. But I don't think that my implementation could come any near that performance unless implemented with native code. And it would probably be wiser to just create a binding for Tensorflow - Francesco Agati (sa_su_ke) is already working on it.

I also tried some multi-layer architectures (784-500-10 or 784-800-10). With different learning rates (but they were all constant for each network) they took 6-8 hours to train on my Pharo image, but they all resulted in lower performance than the simple 784-10 one (around 88%). Two days ago I realized that in Pharo log represents the common logarithm, not the natural one as in many other languages (so in my cross-entropy I was using log, but differentiated it as ln). This funny mistake could be one source of problems.

But I think that the actual reason for such low performance in the poor choice of a learning rate - with more layers neural nets become very sensitive to the changes of learning rate (there are many local minima in which we may get stuck, and the network takes a long time to train) - so it would be nice to implement some algorithm for adjusting the learning rates (for example, adagrad).

I won't have much time to work on this project in summer because I will have to concentrate on the Tabular Datasets (GSoC). So now I don't implement new algorithms but improve the readability of code, write tests and documentation. This way others can continue working on this project, or use it in their own work - at least as an inspiration :)

Thanks again for your feedback!

Oleks

On Sat, May 6, 2017 at 2:09 PM, werner kassens <[hidden email]> wrote:
>(it's already working and giving 92% accuracy)

Congratulation on this result, Oleks!
is it correct to assume that you use the mnist version with 60,000 training examples & 10,000 test examples? how long did your algo need to train on which computer?
werner

On Fri, May 5, 2017 at 10:00 PM, Oleksandr Zaytsev <[hidden email]> wrote:
Hello,

I am a student from Ukraine, completing my Bachelor's in Applied Mathematics and Informatics. Here's my LinkedIn profile: https://www.linkedin.com/in/unolk/. I am working with Pharo for about 1.5 years now. My latest project is a neural network package in Pharo: http://smalltalkhub.com/#!/~Oleks/NeuralNetwork. Here is my article about the implementation of a single-layer perceptron in Pharo, where I describe the first part of this project: https://medium.com/towards-data-science/single-layer-perceptron-in-pharo-5b13246a041d. The second article is almost done - it will be about the multi-layer neural network and MNIST digit recognition (it's already working and giving 92% accuracy).

Several days ago I was accepted to the Google Summer of Code to work on a project related to PolyMath. I will build a tabular dataset for data analysis, similar to Pandas in Python or dataframe in R. You can see my proposal here: https://docs.google.com/document/d/1z6zi4s3Ur4YcOOgHe1iDwpnNwwVTCrbS7ZdBnzHaCpA/edit?usp=sharing. Please, let me know what you think about this project. Is it important? What would you like to see implemented? Should I change something in my plan or project description? Do you know some similar projects that I should consider? Your feedback is very important for me.

I will be writing a weekly blog about the TabularDatasets project (do you have ideas for a better name?) on my Medium: https://medium.com/@i.oleks. You can subscribe to receive instant updates. I will be also writing about my progress on Twitter: https://twitter.com/oleks_lviv, the #polymath channel on Discord, and Pharo mailing list. So you can easily get informed about the state of the project.

I'm looking forward to working with you!

Sincerely,
Oleks

--
You received this message because you are subscribed to the Google Groups "SciSmalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "SciSmalltalk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scismalltalk/_QEz-a5xrbI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "SciSmalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "SciSmalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.