Login  Register

Re: [Pharo-dev] GSOC 2015 Call for Ideas

Posted by stepharo on Feb 21, 2015; 12:38pm
URL: https://forum.world.st/GSOC-2015-Call-for-Ideas-tp4805773p4806792.html

Indeed these are nice to have, now they will not magically happen :)
There is a 400 pages book on SciTalk.

Stef

Le 19/2/15 09:36, Andrea Ferretti a écrit :

> Hi Serge,
>
> as I said I do not really have the time now to get involved in a gsoc
> proposal, but I can give you my perspective. There are two sides to
> the story.
>
> The first one is complementary to SciSmalltalk: in order to analize
> data, you need to get data in first. So, one may want to read - say -
> a CSV, and have a number of heuristics, such as:
>
> - autodetection of encoding
> - autodetection of quotes and delimiter
> - autodetection of columns containing numbers or dates
> - the possibility to indicate that some markers, such as "N/A",
> represent missing values
> - the possibility to indicate a replacement for missing values, such
> as 0, or "", or the average or the minimum of the other values in the
> colums
>
> See http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#csv-text-files
> for some examples.
>
> It may be worth to consider making this into a sequence that is read
> and processed lazily, to deal with CSV files bigger than memory.
>
> When data is finally in, usually the first task is doing some
> processing, inspection or visualization. The Smalltalk collections are
> good for processing (although some lazy variants might help), and
> Roassal and the inspectors are perfect for visualization and browsing.
>
> The second part comes the time when one wants to run some algorithm.
> While there is no need to have the fanciest ones, there should be some
> of the basics, such as:
>
> - some form or regression (linear, logistic...)
> - some form of clustering (kmeans, dbscan, canopy...)
> - SVM
>
> Another thing which would be useful is support for linear algebra,
> leveraging native libraries such as BLAS or LAPACK.
>
> In short: just copying R, or numpy + pandas + scikit-learn would
> already be a giant leap forward.
>
> Actually, some of the things I have mentioned above are already (I
> think) in SciSmalltalk, which brings me to the next point:
> documentation. There is really no point in having all these tools if
> people do not know they are there.
>
> For this to become useful, there should be a dedicated site,
> highlighting what is already available, in what state (experimental,
> partial, stable...) and how to use it.
>
> Ideally, I would include also some tutorials, for instance for dealing
> with standard problems such as Kaggle competitions. Here I think
> Smalltalk would have an edge, since these tutorial could be in the
> form of Prof Stef. Still, it would be nice if some form of the
> tutorials was also on the web, which makes it discoverable.
>
> Best,
> Andrea
>
> 2015-02-18 11:14 GMT+01:00 Serge Stinckwich <[hidden email]>:
>> On Wed, Feb 18, 2015 at 11:01 AM, Sven Van Caekenberghe <[hidden email]> wrote:
>>> OK, try making a proposal then, http://gsoc.pharo.org has the instructions and the current list, you probably know more about data science than I do.
>>>
>>>> On 18 Feb 2015, at 10:53, Andrea Ferretti <[hidden email]> wrote:
>>>>
>>>> I am sorry if the previous messages came off as too harsh. The Neo
>>>> tools are perfectly fine for their intended use.
>>>>
>>>> What I was trying to say is that a good idea for a SoC project would
>>>> be to develop a framework for data analysis that would be useful for
>>>> data scientists, and in particular this would include something to
>>>> import unstructured data more freely.
>> Sorry Andrea. I didn't see you message because I'm not pharo-users
>> mailing-list, only on pharo-dev.
>> I'm also really interested to have a gsoc project to develop data
>> analysis framework.
>> Please let's talk together in order to discuss about a proposal.
>>
>> Regards,
>> --
>> Serge Stinckwich
>> UCBN & UMI UMMISCO 209 (IRD/UPMC)
>> Every DSL ends up being Smalltalk
>> http://www.doesnotunderstand.org/
>>
>