Smalltalk › Pharo › Pharo Smalltalk Users

Issue with NeoCSVReader

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

10 messages Options

abergel

Issue with NeoCSVReader

Hi!

I have a simple use of NeoCSVReader, but I get a rollback.

-=-=-=-=-=-=-=-=-=
content := (ZnEasy get: 'https://github.com/sudar/pig-samples/raw/master/data/tweets.csv') contents readStream.
lines := (NeoCSVReader on: content)
skipHeader;
upToEnd.
-=-=-=-=-=-=-=-=-=

The url points to a .csv file with a header. No idea why the code .

I loaded the csv reader using:
Gofer new
smalltalkhubUser: 'SvenVanCaekenberghe' project: 'Neo';
package: 'ConfigurationOfNeoCSV'; load.
(Smalltalk at: #ConfigurationOfNeoCSV) loadBleedingEdge.

Alexandre
--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.

Pharo Smalltalk Users mailing list

Re: Issue with NeoCSVReader

Hi Alexandre,

I do not have access to a pharo image right now. But I had a look into
your csv file.
There are rows that do not include any or to few data. I think
NeoCSVReader can not handle that and expects a "proper" file

I will check tonight, in case nobody else comes up with additional
information by then.

To be continued
Sebastian

Am 28.06.2016 um 14:41 schrieb Alexandre Bergel:

> Hi!
>
> I have a simple use of NeoCSVReader, but I get a rollback.
>
> -=-=-=-=-=-=-=-=-=
> content := (ZnEasy get: 'https://github.com/sudar/pig-samples/raw/master/data/tweets.csv') contents readStream.
> lines := (NeoCSVReader on: content)
> skipHeader;
> upToEnd.
> -=-=-=-=-=-=-=-=-=
>
> The url points to a .csv file with a header. No idea why the code .
>
> I loaded the csv reader using:
> Gofer new
> smalltalkhubUser: 'SvenVanCaekenberghe' project: 'Neo';
> package: 'ConfigurationOfNeoCSV'; load.
> (Smalltalk at: #ConfigurationOfNeoCSV) loadBleedingEdge.
>
> Alexandre

jtuchel

Re: Issue with NeoCSVReader

I guess Seb is right.

The file cannot be read by Excel as well. So without having analysed this any further, it seems to me like the number of fields differs from row to row. NeoCSV does not handle this kind of "edge case". By defintion, each row in CSV has the same number of fields - even if you could possibly argue exceptions to this rule could be handled.

OTOH, what does a missing field in a row usually mean? Is it the last one? One from the middle? is so, which one?

So I think you should first check if the file is valid csv and if not, how you can get a valid one. If not, I guess you have to roll your own parser that foolows the "omission rules" of that file format.

Joachim

Ben Coman

Re: Issue with NeoCSVReader

On Wed, Jun 29, 2016 at 3:13 PM, Joachim Tuchel <[hidden email]> wrote:
> I guess Seb is right.
> The file cannot be read by Excel as well. So without having analysed this
> any further, it seems to me like the number of fields differs from row to
> row. NeoCSV does not handle this kind of "edge case". By defintion, each row
> in CSV has the same number of fields - even if you could possibly argue
> exceptions to this rule could be handled.
> OTOH, what does a missing field in a row usually mean? Is it the last one?
> One from the middle? is so, which one?

Your questions are significant. The standard library should not be
complicated by trying to handle undefined missing fields from a row.
Such a file is simply invalid.
cheers -ben

>
> So I think you should first check if the file is valid csv and if not, how
> you can get a valid one. If not, I guess you have to roll your own parser
> that foolows the "omission rules" of that file format.
>
> Joachim

jtuchel

Re: Issue with NeoCSVReader

Ben,

> Ben Coman <[hidden email]> hat am 29. Juni 2016 um 09:40 geschrieben:
> Your questions are significant. The standard library should not be
> complicated by trying to handle undefined missing fields from a row.
> Such a file is simply invalid.
> cheers -ben

I fully agree, I just wanted to give an example of questions that are way more difficult to answer than it might seem in the first place. I wanted to make an argument for not blaming NeoCSV for being unable to parse a file like that. I am surely not asking for NeoCSV to do some magic on invalid CSV files. It would be wrong in at least 50% of the cases anyways, no matter what kind of assumptions it would make.

Joachim

abergel

Re: Issue with NeoCSVReader

Thanks for all your comments

Alexandre

> On Jun 29, 2016, at 3:45 AM, Joachim Tuchel <[hidden email]> wrote:
>
> Ben,
>
> > Ben Coman <[hidden email]> hat am 29. Juni 2016 um 09:40 geschrieben:
> > Your questions are significant. The standard library should not be
> > complicated by trying to handle undefined missing fields from a row.
> > Such a file is simply invalid.
> > cheers -ben
>
> I fully agree, I just wanted to give an example of questions that are way more difficult to answer than it might seem in the first place. I wanted to make an argument for not blaming NeoCSV for being unable to parse a file like that. I am surely not asking for NeoCSV to do some magic on invalid CSV files. It would be wrong in at least 50% of the cases anyways, no matter what kind of assumptions it would make.
>
> Joachim
>

--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.

abergel

Re: Issue with NeoCSVReader

In reply to this post by Pharo Smalltalk Users mailing list

Thanks Sebastian!

Apparently, it is related to UTF encoding.

The following code works well

c := (ZnEasy get: 'https://github.com/sudar/pig-samples/raw/master/data/tweets.csv') contents.
c := ZnCharacterEncoder utf8 decodeBytes: c.
content := c readStream.
lines := (NeoCSVReader on: content)
skipHeader;
upToEnd.

Strange
Alexandre

> On Jun 28, 2016, at 5:54 PM, Sebastian Heidbrink via Pharo-users <[hidden email]> wrote:
>
>
> From: Sebastian Heidbrink <[hidden email]>
> Subject: Re: [Pharo-users] Issue with NeoCSVReader
> Date: June 28, 2016 at 5:53:44 PM GMT-4
> To: Any question about pharo is welcome <[hidden email]>
>
>
> Hi Alexandre,
>
> I do not have access to a pharo image right now. But I had a look into your csv file.
> There are rows that do not include any or to few data. I think NeoCSVReader can not handle that and expects a "proper" file
>
> I will check tonight, in case nobody else comes up with additional information by then.
>
> To be continued
> Sebastian
>
>
>
> Am 28.06.2016 um 14:41 schrieb Alexandre Bergel:
>> Hi!
>>
>> I have a simple use of NeoCSVReader, but I get a rollback.
>>
>> -=-=-=-=-=-=-=-=-=
>> content := (ZnEasy get: 'https://github.com/sudar/pig-samples/raw/master/data/tweets.csv') contents readStream.
>> lines := (NeoCSVReader on: content)
>> skipHeader;
>> upToEnd.
>> -=-=-=-=-=-=-=-=-=
>>
>> The url points to a .csv file with a header. No idea why the code .
>>
>> I loaded the csv reader using:
>> Gofer new
>> smalltalkhubUser: 'SvenVanCaekenberghe' project: 'Neo';
>> package: 'ConfigurationOfNeoCSV'; load.
>> (Smalltalk at: #ConfigurationOfNeoCSV) loadBleedingEdge.
>>
>> Alexandre
>
>
>
>
>

--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.

Sven Van Caekenberghe-2

Re: Issue with NeoCSVReader

> On 29 Jun 2016, at 16:00, Alexandre Bergel <[hidden email]> wrote:
>
> Thanks Sebastian!
>
> Apparently, it is related to UTF encoding.

To be more correct: the resource (file) is served (by github) with a content-type of 'application/octet-stream', which means it should be interpreted as pure bytes. Maybe that is on purpose (the 'raw' aspect). Still it means a client cannot know what is inside, it could guess. It should be served as 'text/csv' so that clients at least know what it is and can act accordingly (like apply the proper text converter).

> The following code works well
>
> c := (ZnEasy get: 'https://github.com/sudar/pig-samples/raw/master/data/tweets.csv') contents.
> c := ZnCharacterEncoder utf8 decodeBytes: c.
> content := c readStream.
> lines := (NeoCSVReader on: content)
> skipHeader;
> upToEnd.
>
> Strange
> Alexandre
>
>
>
>> On Jun 28, 2016, at 5:54 PM, Sebastian Heidbrink via Pharo-users <[hidden email]> wrote:
>>
>>
>> From: Sebastian Heidbrink <[hidden email]>
>> Subject: Re: [Pharo-users] Issue with NeoCSVReader
>> Date: June 28, 2016 at 5:53:44 PM GMT-4
>> To: Any question about pharo is welcome <[hidden email]>
>>
>>
>> Hi Alexandre,
>>
>> I do not have access to a pharo image right now. But I had a look into your csv file.
>> There are rows that do not include any or to few data. I think NeoCSVReader can not handle that and expects a "proper" file
>>
>> I will check tonight, in case nobody else comes up with additional information by then.
>>
>> To be continued
>> Sebastian
>>
>>
>>
>> Am 28.06.2016 um 14:41 schrieb Alexandre Bergel:
>>> Hi!
>>>
>>> I have a simple use of NeoCSVReader, but I get a rollback.
>>>
>>> -=-=-=-=-=-=-=-=-=
>>> content := (ZnEasy get: 'https://github.com/sudar/pig-samples/raw/master/data/tweets.csv') contents readStream.
>>> lines := (NeoCSVReader on: content)
>>> skipHeader;
>>> upToEnd.
>>> -=-=-=-=-=-=-=-=-=
>>>
>>> The url points to a .csv file with a header. No idea why the code .
>>>
>>> I loaded the csv reader using:
>>> Gofer new
>>> smalltalkhubUser: 'SvenVanCaekenberghe' project: 'Neo';
>>> package: 'ConfigurationOfNeoCSV'; load.
>>> (Smalltalk at: #ConfigurationOfNeoCSV) loadBleedingEdge.
>>>
>>> Alexandre
>>
>>
>>
>>
>>
>
> --
> _,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
> Alexandre Bergel http://www.bergel.eu
> ^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.

abergel

Re: Issue with NeoCSVReader

Ah okay! I understand better now.

Thanks Sven!
Alexandre

> On Jun 29, 2016, at 12:33 PM, Sven Van Caekenberghe <[hidden email]> wrote:
>
>
>> On 29 Jun 2016, at 16:00, Alexandre Bergel <[hidden email]> wrote:
>>
>> Thanks Sebastian!
>>
>> Apparently, it is related to UTF encoding.
>
> To be more correct: the resource (file) is served (by github) with a content-type of 'application/octet-stream', which means it should be interpreted as pure bytes. Maybe that is on purpose (the 'raw' aspect). Still it means a client cannot know what is inside, it could guess. It should be served as 'text/csv' so that clients at least know what it is and can act accordingly (like apply the proper text converter).
>
>> The following code works well
>>
>> c := (ZnEasy get: 'https://github.com/sudar/pig-samples/raw/master/data/tweets.csv') contents.
>> c := ZnCharacterEncoder utf8 decodeBytes: c.
>> content := c readStream.
>> lines := (NeoCSVReader on: content)
>> skipHeader;
>> upToEnd.
>>
>> Strange
>> Alexandre
>>
>>
>>
>>> On Jun 28, 2016, at 5:54 PM, Sebastian Heidbrink via Pharo-users <[hidden email]> wrote:
>>>
>>>
>>> From: Sebastian Heidbrink <[hidden email]>
>>> Subject: Re: [Pharo-users] Issue with NeoCSVReader
>>> Date: June 28, 2016 at 5:53:44 PM GMT-4
>>> To: Any question about pharo is welcome <[hidden email]>
>>>
>>>
>>> Hi Alexandre,
>>>
>>> I do not have access to a pharo image right now. But I had a look into your csv file.
>>> There are rows that do not include any or to few data. I think NeoCSVReader can not handle that and expects a "proper" file
>>>
>>> I will check tonight, in case nobody else comes up with additional information by then.
>>>
>>> To be continued
>>> Sebastian
>>>
>>>
>>>
>>> Am 28.06.2016 um 14:41 schrieb Alexandre Bergel:
>>>> Hi!
>>>>
>>>> I have a simple use of NeoCSVReader, but I get a rollback.
>>>>
>>>> -=-=-=-=-=-=-=-=-=
>>>> content := (ZnEasy get: 'https://github.com/sudar/pig-samples/raw/master/data/tweets.csv') contents readStream.
>>>> lines := (NeoCSVReader on: content)
>>>> skipHeader;
>>>> upToEnd.
>>>> -=-=-=-=-=-=-=-=-=
>>>>
>>>> The url points to a .csv file with a header. No idea why the code .
>>>>
>>>> I loaded the csv reader using:
>>>> Gofer new
>>>> smalltalkhubUser: 'SvenVanCaekenberghe' project: 'Neo';
>>>> package: 'ConfigurationOfNeoCSV'; load.
>>>> (Smalltalk at: #ConfigurationOfNeoCSV) loadBleedingEdge.
>>>>
>>>> Alexandre
>>>
>>>
>>>
>>>
>>>
>>
>> --
>> _,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
>> Alexandre Bergel http://www.bergel.eu
>> ^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.
>
>

--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.

Sven Van Caekenberghe-2

Re: Issue with NeoCSVReader

In reply to this post by jtuchel

> On 29 Jun 2016, at 09:45, Joachim Tuchel <[hidden email]> wrote:
>
> Ben,
>
> > Ben Coman <[hidden email]> hat am 29. Juni 2016 um 09:40 geschrieben:
> > Your questions are significant. The standard library should not be
> > complicated by trying to handle undefined missing fields from a row.
> > Such a file is simply invalid.
> > cheers -ben
>
> I fully agree, I just wanted to give an example of questions that are way more difficult to answer than it might seem in the first place. I wanted to make an argument for not blaming NeoCSV for being unable to parse a file like that. I am surely not asking for NeoCSV to do some magic on invalid CSV files. It would be wrong in at least 50% of the cases anyways, no matter what kind of assumptions it would make.
>
> Joachim

So yes, CSV is only defined for records that are all the same size.

NeoCSVReader can however deal with shorter records. The first record (or header, or convertors) define the number of fields for the whole file. Obviously, only empty fields at the end are supported.

Here is an example:

(NeoCSVReader on: '1,2,3\5\3,2,1' withCRs readStream) emptyFieldValue: #empty; upToEnd.

=> #(#('1' '2' '3') #('5' #empty #empty) #('3' '2' '1'))

The second line/record contains only one value, empty ones are added automatically. The #emptyFieldValue is optional and defaults to nil.

Sven