Smalltalk - Re: GSOC 2015 Call for Ideas

Smalltalk › Pharo › Pharo Smalltalk Users

Re: GSOC 2015 Call for Ideas

Posted by Andrea Ferretti on Feb 18, 2015; 10:00am
URL: https://forum.world.st/GSOC-2015-Call-for-Ideas-tp4805773p4806305.html

I am sorry, I must have misunderstood the purpose of this thread. I
read "Even if you have a vague idea, you can contribute." and tried to
give a couple of vague ideas.

I did not really mean that I would be able or have time to mentor such a project

2015-02-18 11:01 GMT+01:00 Sven Van Caekenberghe <[hidden email]>:

> OK, try making a proposal then, http://gsoc.pharo.org has the instructions and the current list, you probably know more about data science than I do.
>
>> On 18 Feb 2015, at 10:53, Andrea Ferretti <[hidden email]> wrote:
>>
>> I am sorry if the previous messages came off as too harsh. The Neo
>> tools are perfectly fine for their intended use.
>>
>> What I was trying to say is that a good idea for a SoC project would
>> be to develop a framework for data analysis that would be useful for
>> data scientists, and in particular this would include something to
>> import unstructured data more freely.
>>
>> 2015-02-18 10:39 GMT+01:00 Sven Van Caekenberghe <[hidden email]>:
>>> Well, you are certainly free to contribute.
>>>
>>> Heuristic interpretation of data could be useful, but looks like an addition on top, the core library should be fast and efficient.
>>>
>>>> On 18 Feb 2015, at 10:35, Andrea Ferretti <[hidden email]> wrote:
>>>>
>>>> For an example of what I am talking about, see
>>>>
>>>> http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#csv-text-files
>>>>
>>>> I agree that this is definitely too much options, but it gets the job
>>>> done for quick and dirty exploration.
>>>>
>>>> The fact is that working with a dump of table on your db, whose
>>>> content you know, requires different tools than exploring the latest
>>>> opendata that your local municipality has put online, using yet
>>>> another messy format.
>>>>
>>>> Enterprise programmers deal more often with the former, data
>>>> scientists with the latter, and I think there is room for both kind of
>>>> tools
>>>>
>>>> 2015-02-18 10:26 GMT+01:00 Andrea Ferretti <[hidden email]>:
>>>>> Thank you Sven. I think this should be emphasized and prominent on the
>>>>> home page*. Still, libraries such as pandas are even more lenient,
>>>>> doing things such as:
>>>>>
>>>>> - autodetecting which fields are numeric in CSV files
>>>>> - allowing to fill missing data based on statistics (for instance, you
>>>>> can say: where the field `age` is missing, use the average age)
>>>>>
>>>>> Probably there is room for something built on top of Neo
>>>>>
>>>>>
>>>>> * by the way, I suggest that the documentation on Neo could benefit
>>>>> from a reorganization. Right now, the first topic on the NeoJSON
>>>>> paper introduces JSON itself. I would argue that everyone that tries
>>>>> to use the library knows what JSON is already. Still, there is no
>>>>> example of how to read JSON from a file in the whole document.
>>>>>
>>>>> 2015-02-18 10:12 GMT+01:00 Sven Van Caekenberghe <[hidden email]>:
>>>>>>
>>>>>>> On 18 Feb 2015, at 09:52, Andrea Ferretti <[hidden email]> wrote:
>>>>>>>
>>>>>>> Also, these tasks
>>>>>>> often involve consuming data from various sources, such as CSV and
>>>>>>> Json files. NeoCSV and NeoJSON are still a little too rigid for the
>>>>>>> task - libraries like pandas allow to just feed a csv file and try to
>>>>>>> make head or tails of the content without having to define too much of
>>>>>>> a schema beforehand
>>>>>>
>>>>>> Both NeoCSV and NeoJSON can operate in two ways, (1) without the definition of any schema's or (2) with the definition of schema's and mappings. The quick and dirty explore style is most certainly possible.
>>>>>>
>>>>>> 'my-data.csv' asFileReference readStreamDo: [ :in | (NeoCSVReader on: in) upToEnd ].
>>>>>>
>>>>>> => an array of arrays
>>>>>>
>>>>>> 'my-data.json' asFileReference readStreamDo: [ :in | (NeoJSONReader on: in) next ].
>>>>>>
>>>>>> => objects structured using dictionaries and arrays
>>>>>>
>>>>>> Sven
>>>>>>
>>>>>>
>>>>
>>>
>>>
>>
>
>