NeoCSV change proposal

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

NeoCSV change proposal

Peter Uhnak
Hi,

this is a continuation of an older thread about quoting fields only when necessary. ( http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )

I've attached fileouts of NeoCSV packages with the addition (I don't know if Metacello can file-out only changesets).

The change doesn't affect the default behavior, only when explicitly requested.

Peter

Reply | Threaded
Open this post in threaded view
|

Re: NeoCSV change proposal

Peter Uhnak
attached

On Sat, Jul 22, 2017 at 02:12:10PM +0200, Peter Uhnak wrote:
> Hi,
>
> this is a continuation of an older thread about quoting fields only when necessary. ( http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )
>
> I've attached fileouts of NeoCSV packages with the addition (I don't know if Metacello can file-out only changesets).
>
> The change doesn't affect the default behavior, only when explicitly requested.
>
> Peter

Neo-CSV-Core.st (38K) Download Attachment
Neo-CSV-Tests.st (30K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: NeoCSV change proposal

Sven Van Caekenberghe-2
In reply to this post by Peter Uhnak
Peter,

> On 22 Jul 2017, at 14:12, Peter Uhnak <[hidden email]> wrote:
>
> Hi,
>
> this is a continuation of an older thread about quoting fields only when necessary. ( http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )
>
> I've attached fileouts of NeoCSV packages with the addition (I don't know if Metacello can file-out only changesets).
>
> The change doesn't affect the default behavior, only when explicitly requested.
>
> Peter

Could you email me your .mcz packages (saved in the cache) ? That is much easier to work with.

Thx, Sven


Reply | Threaded
Open this post in threaded view
|

Re: NeoCSV change proposal

Sven Van Caekenberghe-2
In reply to this post by Peter Uhnak
Hi Peter,

> On 22 Jul 2017, at 14:12, Peter Uhnak <[hidden email]> wrote:
>
> Hi,
>
> this is a continuation of an older thread about quoting fields only when necessary. ( http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )
>
> I've attached fileouts of NeoCSV packages with the addition (I don't know if Metacello can file-out only changesets).
>
> The change doesn't affect the default behavior, only when explicitly requested.
>
> Peter

I accepted your changes as such, the .mcz's were copied over to the main repositories. This is a pure & clean extension, so that is good. Thank you.

This option is always going to be slower, but the current implementation might be improved, I think.

The key test in #writeOptionalQuotedField:

{
  lineEnd asString.
  separator asString.
  '"' } anySatisfy: [ :each | string includesSubstring: each ]

will be quite slow.

If we accept a little bit of (over safe) error on EOL and use any occurrence of CR or LF as needing a quote, we could switch to characters to test for. There exists a fast (primitive) test, #findFirstInString:inSet:startingAt: that can do all the testing in one go. If your version turns out to be slow, we could try that, if measurements show a difference.

Regards,

Sven


Reply | Threaded
Open this post in threaded view
|

Re: NeoCSV change proposal

Stephane Ducasse-3
Hi peter
Do you have an example that I understand?

@Svn Once the change is integrated could you add a little paragraph to
the chapter?

Stef

On Sat, Jul 22, 2017 at 6:51 PM, Sven Van Caekenberghe <[hidden email]> wrote:

> Hi Peter,
>
>> On 22 Jul 2017, at 14:12, Peter Uhnak <[hidden email]> wrote:
>>
>> Hi,
>>
>> this is a continuation of an older thread about quoting fields only when necessary. ( http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )
>>
>> I've attached fileouts of NeoCSV packages with the addition (I don't know if Metacello can file-out only changesets).
>>
>> The change doesn't affect the default behavior, only when explicitly requested.
>>
>> Peter
>
> I accepted your changes as such, the .mcz's were copied over to the main repositories. This is a pure & clean extension, so that is good. Thank you.
>
> This option is always going to be slower, but the current implementation might be improved, I think.
>
> The key test in #writeOptionalQuotedField:
>
> {
>   lineEnd asString.
>   separator asString.
>   '"' } anySatisfy: [ :each | string includesSubstring: each ]
>
> will be quite slow.
>
> If we accept a little bit of (over safe) error on EOL and use any occurrence of CR or LF as needing a quote, we could switch to characters to test for. There exists a fast (primitive) test, #findFirstInString:inSet:startingAt: that can do all the testing in one go. If your version turns out to be slow, we could try that, if measurements show a difference.
>
> Regards,
>
> Sven
>
>

Reply | Threaded
Open this post in threaded view
|

Re: NeoCSV change proposal

Sven Van Caekenberghe-2

> On 22 Jul 2017, at 21:51, Stephane Ducasse <[hidden email]> wrote:
>
> Hi peter
> Do you have an example that I understand?

In its most simple form, CSV looks like this:

a,b,c,d

However, if a field is text, it might contain a comma itself, or other problematic characters like line endings, hence there is a provision for quoting fields, as follows:

"a","b","c",d

Now, "a" could be "a1,a2" and still be one field. Embedded quotes are doubled.

Not all fields need quoting, normally only text or string fields. If you know something is an integer, you don't need quotes, like d above.

Since CSV is normally a collection of identical records, you set these options for all of them.

What Peter wanted, and now implemented himself as an option, is to make the quoting optional per field, per record, by first checking if a field contains dangerous characters and only then adding the quotes. I guess this makes most sense in the context of generating a CSV file for humans to edit.

> @Svn Once the change is integrated could you add a little paragraph to
> the chapter?
>
> Stef
>
> On Sat, Jul 22, 2017 at 6:51 PM, Sven Van Caekenberghe <[hidden email]> wrote:
>> Hi Peter,
>>
>>> On 22 Jul 2017, at 14:12, Peter Uhnak <[hidden email]> wrote:
>>>
>>> Hi,
>>>
>>> this is a continuation of an older thread about quoting fields only when necessary. ( http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )
>>>
>>> I've attached fileouts of NeoCSV packages with the addition (I don't know if Metacello can file-out only changesets).
>>>
>>> The change doesn't affect the default behavior, only when explicitly requested.
>>>
>>> Peter
>>
>> I accepted your changes as such, the .mcz's were copied over to the main repositories. This is a pure & clean extension, so that is good. Thank you.
>>
>> This option is always going to be slower, but the current implementation might be improved, I think.
>>
>> The key test in #writeOptionalQuotedField:
>>
>> {
>>  lineEnd asString.
>>  separator asString.
>>  '"' } anySatisfy: [ :each | string includesSubstring: each ]
>>
>> will be quite slow.
>>
>> If we accept a little bit of (over safe) error on EOL and use any occurrence of CR or LF as needing a quote, we could switch to characters to test for. There exists a fast (primitive) test, #findFirstInString:inSet:startingAt: that can do all the testing in one go. If your version turns out to be slow, we could try that, if measurements show a difference.
>>
>> Regards,
>>
>> Sven
>>
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: NeoCSV change proposal

Peter Uhnak
In reply to this post by Sven Van Caekenberghe-2
Hi Sven,

my use case was hand-edited CSVs (therefore the quotes are extra clutter), which imples that I would be hand-viewing/editing only small CSVs (no performance issues).

I agree that we should err on the safe side with CR & LF (e.g. tools may sometimes autoconvert line endings).

Regarding performance:

#findFirstInString:inSet:startingAt: didn't work for me (not sure if bug, or I don't know how to use), so I've trried with inCharacterSet:

Tested on worst-case scenario - strings don't contain tested symbols.

s10 := 'a' repeat: 10.
s100 := 'a' repeat: 100.

[ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s10 includesSubstring: each ] ] bench. "'1,200,046 per second'"
[ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s100 includesSubstring: each ] ] bench. "'495,482 per second'"

[ s10 includesAny: { $,. $". Character cr. Character lf } ] bench. "'2,819,416 per second'"
[ s100 includesAny: { $,. $". Character cr. Character lf } ] bench. "'2,200,668 per second'"

[ ByteString findFirstInString: s10 inCharacterSet: ',"', String crlf startingAt: 1 ] bench. "'1,187,324 per second'"
[ ByteString findFirstInString: s100 inCharacterSet: ',"', String crlf startingAt: 1 ] bench. "'165,526 per second'"


#includesAny: seems to be the best by far.

Storing the tested characters didn't improve it by much.

Peter

On Sat, Jul 22, 2017 at 06:51:31PM +0200, Sven Van Caekenberghe wrote:

> Hi Peter,
>
> > On 22 Jul 2017, at 14:12, Peter Uhnak <[hidden email]> wrote:
> >
> > Hi,
> >
> > this is a continuation of an older thread about quoting fields only when necessary. ( http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )
> >
> > I've attached fileouts of NeoCSV packages with the addition (I don't know if Metacello can file-out only changesets).
> >
> > The change doesn't affect the default behavior, only when explicitly requested.
> >
> > Peter
>
> I accepted your changes as such, the .mcz's were copied over to the main repositories. This is a pure & clean extension, so that is good. Thank you.
>
> This option is always going to be slower, but the current implementation might be improved, I think.
>
> The key test in #writeOptionalQuotedField:
>
> {
>   lineEnd asString.
>   separator asString.
>   '"' } anySatisfy: [ :each | string includesSubstring: each ]
>
> will be quite slow.
>
> If we accept a little bit of (over safe) error on EOL and use any occurrence of CR or LF as needing a quote, we could switch to characters to test for. There exists a fast (primitive) test, #findFirstInString:inSet:startingAt: that can do all the testing in one go. If your version turns out to be slow, we could try that, if measurements show a difference.
>
> Regards,
>
> Sven
>
>

Reply | Threaded
Open this post in threaded view
|

Re: NeoCSV change proposal

Sven Van Caekenberghe-2
Peter,

> On 22 Jul 2017, at 22:27, Peter Uhnak <[hidden email]> wrote:
>
> Hi Sven,
>
> my use case was hand-edited CSVs (therefore the quotes are extra clutter), which imples that I would be hand-viewing/editing only small CSVs (no performance issues).
>
> I agree that we should err on the safe side with CR & LF (e.g. tools may sometimes autoconvert line endings).
>
> Regarding performance:
>
> #findFirstInString:inSet:startingAt: didn't work for me (not sure if bug, or I don't know how to use), so I've trried with inCharacterSet:

Yes, it is a bitch to use, the the version you used does not use the primitive.

Try playing with this:

s10 := 'a' repeat: 10.
s100 := 'a' repeat: 100.

searchSet := ByteArray new: 256 withAll: 0.
',"', String crlf do: [ :each | searchSet at: each asInteger + 1 put: 1 ].

searchSet := (CharacterSet newFrom: ',"', String crlf) byteArrayMap.

[ ByteString findFirstInString: s10 inSet: searchSet startingAt: 1 ] bench.
 "15,160,161 per second"
[ ByteString findFirstInString: s100 inSet: searchSet startingAt: 1 ] bench.
 "8,219,081 per second"

ByteString findFirstInString: ',"', String crlf inSet: searchSet startingAt: 1.

Sven

> Tested on worst-case scenario - strings don't contain tested symbols.
>
> s10 := 'a' repeat: 10.
> s100 := 'a' repeat: 100.
>
> [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s10 includesSubstring: each ] ] bench. "'1,200,046 per second'"
> [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s100 includesSubstring: each ] ] bench. "'495,482 per second'"
>
> [ s10 includesAny: { $,. $". Character cr. Character lf } ] bench. "'2,819,416 per second'"
> [ s100 includesAny: { $,. $". Character cr. Character lf } ] bench. "'2,200,668 per second'"
>
> [ ByteString findFirstInString: s10 inCharacterSet: ',"', String crlf startingAt: 1 ] bench. "'1,187,324 per second'"
> [ ByteString findFirstInString: s100 inCharacterSet: ',"', String crlf startingAt: 1 ] bench. "'165,526 per second'"
>
>
> #includesAny: seems to be the best by far.
>
> Storing the tested characters didn't improve it by much.
>
> Peter
>
> On Sat, Jul 22, 2017 at 06:51:31PM +0200, Sven Van Caekenberghe wrote:
>> Hi Peter,
>>
>>> On 22 Jul 2017, at 14:12, Peter Uhnak <[hidden email]> wrote:
>>>
>>> Hi,
>>>
>>> this is a continuation of an older thread about quoting fields only when necessary. ( http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )
>>>
>>> I've attached fileouts of NeoCSV packages with the addition (I don't know if Metacello can file-out only changesets).
>>>
>>> The change doesn't affect the default behavior, only when explicitly requested.
>>>
>>> Peter
>>
>> I accepted your changes as such, the .mcz's were copied over to the main repositories. This is a pure & clean extension, so that is good. Thank you.
>>
>> This option is always going to be slower, but the current implementation might be improved, I think.
>>
>> The key test in #writeOptionalQuotedField:
>>
>> {
>>  lineEnd asString.
>>  separator asString.
>>  '"' } anySatisfy: [ :each | string includesSubstring: each ]
>>
>> will be quite slow.
>>
>> If we accept a little bit of (over safe) error on EOL and use any occurrence of CR or LF as needing a quote, we could switch to characters to test for. There exists a fast (primitive) test, #findFirstInString:inSet:startingAt: that can do all the testing in one go. If your version turns out to be slow, we could try that, if measurements show a difference.
>>
>> Regards,
>>
>> Sven
>>
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: NeoCSV change proposal

Peter Uhnak
Ah, ByteArrayMap, I was trying ByteArray.

Not sure what the next step is here: should I add it and send you mczs, or will you do it yourself? (it is a simple change).

Peter

On Sat, Jul 22, 2017 at 10:51:42PM +0200, Sven Van Caekenberghe wrote:

> Peter,
>
> > On 22 Jul 2017, at 22:27, Peter Uhnak <[hidden email]> wrote:
> >
> > Hi Sven,
> >
> > my use case was hand-edited CSVs (therefore the quotes are extra clutter), which imples that I would be hand-viewing/editing only small CSVs (no performance issues).
> >
> > I agree that we should err on the safe side with CR & LF (e.g. tools may sometimes autoconvert line endings).
> >
> > Regarding performance:
> >
> > #findFirstInString:inSet:startingAt: didn't work for me (not sure if bug, or I don't know how to use), so I've trried with inCharacterSet:
>
> Yes, it is a bitch to use, the the version you used does not use the primitive.
>
> Try playing with this:
>
> s10 := 'a' repeat: 10.
> s100 := 'a' repeat: 100.
>
> searchSet := ByteArray new: 256 withAll: 0.
> ',"', String crlf do: [ :each | searchSet at: each asInteger + 1 put: 1 ].
>
> searchSet := (CharacterSet newFrom: ',"', String crlf) byteArrayMap.
>
> [ ByteString findFirstInString: s10 inSet: searchSet startingAt: 1 ] bench.
>  "15,160,161 per second"
> [ ByteString findFirstInString: s100 inSet: searchSet startingAt: 1 ] bench.
>  "8,219,081 per second"
>
> ByteString findFirstInString: ',"', String crlf inSet: searchSet startingAt: 1.
>
> Sven
>
> > Tested on worst-case scenario - strings don't contain tested symbols.
> >
> > s10 := 'a' repeat: 10.
> > s100 := 'a' repeat: 100.
> >
> > [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s10 includesSubstring: each ] ] bench. "'1,200,046 per second'"
> > [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s100 includesSubstring: each ] ] bench. "'495,482 per second'"
> >
> > [ s10 includesAny: { $,. $". Character cr. Character lf } ] bench. "'2,819,416 per second'"
> > [ s100 includesAny: { $,. $". Character cr. Character lf } ] bench. "'2,200,668 per second'"
> >
> > [ ByteString findFirstInString: s10 inCharacterSet: ',"', String crlf startingAt: 1 ] bench. "'1,187,324 per second'"
> > [ ByteString findFirstInString: s100 inCharacterSet: ',"', String crlf startingAt: 1 ] bench. "'165,526 per second'"
> >
> >
> > #includesAny: seems to be the best by far.
> >
> > Storing the tested characters didn't improve it by much.
> >
> > Peter
> >
> > On Sat, Jul 22, 2017 at 06:51:31PM +0200, Sven Van Caekenberghe wrote:
> >> Hi Peter,
> >>
> >>> On 22 Jul 2017, at 14:12, Peter Uhnak <[hidden email]> wrote:
> >>>
> >>> Hi,
> >>>
> >>> this is a continuation of an older thread about quoting fields only when necessary. ( http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )
> >>>
> >>> I've attached fileouts of NeoCSV packages with the addition (I don't know if Metacello can file-out only changesets).
> >>>
> >>> The change doesn't affect the default behavior, only when explicitly requested.
> >>>
> >>> Peter
> >>
> >> I accepted your changes as such, the .mcz's were copied over to the main repositories. This is a pure & clean extension, so that is good. Thank you.
> >>
> >> This option is always going to be slower, but the current implementation might be improved, I think.
> >>
> >> The key test in #writeOptionalQuotedField:
> >>
> >> {
> >>  lineEnd asString.
> >>  separator asString.
> >>  '"' } anySatisfy: [ :each | string includesSubstring: each ]
> >>
> >> will be quite slow.
> >>
> >> If we accept a little bit of (over safe) error on EOL and use any occurrence of CR or LF as needing a quote, we could switch to characters to test for. There exists a fast (primitive) test, #findFirstInString:inSet:startingAt: that can do all the testing in one go. If your version turns out to be slow, we could try that, if measurements show a difference.
> >>
> >> Regards,
> >>
> >> Sven
> >>
> >>
> >
>
>

Reply | Threaded
Open this post in threaded view
|

Re: NeoCSV change proposal

Stephane Ducasse-3
In reply to this post by Sven Van Caekenberghe-2
thanks for the explanation.
this is cool.


On Sat, Jul 22, 2017 at 10:06 PM, Sven Van Caekenberghe <[hidden email]> wrote:

>
>> On 22 Jul 2017, at 21:51, Stephane Ducasse <[hidden email]> wrote:
>>
>> Hi peter
>> Do you have an example that I understand?
>
> In its most simple form, CSV looks like this:
>
> a,b,c,d
>
> However, if a field is text, it might contain a comma itself, or other problematic characters like line endings, hence there is a provision for quoting fields, as follows:
>
> "a","b","c",d
>
> Now, "a" could be "a1,a2" and still be one field. Embedded quotes are doubled.
>
> Not all fields need quoting, normally only text or string fields. If you know something is an integer, you don't need quotes, like d above.
>
> Since CSV is normally a collection of identical records, you set these options for all of them.
>
> What Peter wanted, and now implemented himself as an option, is to make the quoting optional per field, per record, by first checking if a field contains dangerous characters and only then adding the quotes. I guess this makes most sense in the context of generating a CSV file for humans to edit.
>
>> @Svn Once the change is integrated could you add a little paragraph to
>> the chapter?
>>
>> Stef
>>
>> On Sat, Jul 22, 2017 at 6:51 PM, Sven Van Caekenberghe <[hidden email]> wrote:
>>> Hi Peter,
>>>
>>>> On 22 Jul 2017, at 14:12, Peter Uhnak <[hidden email]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> this is a continuation of an older thread about quoting fields only when necessary. ( http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )
>>>>
>>>> I've attached fileouts of NeoCSV packages with the addition (I don't know if Metacello can file-out only changesets).
>>>>
>>>> The change doesn't affect the default behavior, only when explicitly requested.
>>>>
>>>> Peter
>>>
>>> I accepted your changes as such, the .mcz's were copied over to the main repositories. This is a pure & clean extension, so that is good. Thank you.
>>>
>>> This option is always going to be slower, but the current implementation might be improved, I think.
>>>
>>> The key test in #writeOptionalQuotedField:
>>>
>>> {
>>>  lineEnd asString.
>>>  separator asString.
>>>  '"' } anySatisfy: [ :each | string includesSubstring: each ]
>>>
>>> will be quite slow.
>>>
>>> If we accept a little bit of (over safe) error on EOL and use any occurrence of CR or LF as needing a quote, we could switch to characters to test for. There exists a fast (primitive) test, #findFirstInString:inSet:startingAt: that can do all the testing in one go. If your version turns out to be slow, we could try that, if measurements show a difference.
>>>
>>> Regards,
>>>
>>> Sven
>>>
>>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: NeoCSV change proposal

Sven Van Caekenberghe-2
In reply to this post by Peter Uhnak

> On 23 Jul 2017, at 09:55, Peter Uhnak <[hidden email]> wrote:
>
> Ah, ByteArrayMap, I was trying ByteArray.

ByteArrayMap is not a class, it is still a ByteArray, but of size 256 used as an inclusion map for characters that fit a byte.

BTW, using CharacterSet as an abstraction makes the trick easier.

I committed:

===
Name: Neo-CSV-Core-SvenVanCaekenberghe.26
Author: SvenVanCaekenberghe
Time: 23 July 2017, 1:54:22.095185 pm
UUID: e429f87e-5c11-0d00-a72d-ae5d00db2a8c
Ancestors: Neo-CSV-Core-PeterUhnak.25

make the core test in #writeOptionalQuotedField faster by using a primitive when possible

add #testWideOptionalQuoted
===
Name: Neo-CSV-Tests-SvenVanCaekenberghe.22
Author: SvenVanCaekenberghe
Time: 23 July 2017, 1:54:52.133003 pm
UUID: 5f81c280-5c11-0d00-a72e-7cd500db2a8c
Ancestors: Neo-CSV-Tests-PeterUhnak.21

make the core test in #writeOptionalQuotedField faster by using a primitive when possible

add #testWideOptionalQuoted
===

Extra caution was needed for WideStrings.

Sven

> Not sure what the next step is here: should I add it and send you mczs, or will you do it yourself? (it is a simple change).
>
> Peter
>
> On Sat, Jul 22, 2017 at 10:51:42PM +0200, Sven Van Caekenberghe wrote:
>> Peter,
>>
>>> On 22 Jul 2017, at 22:27, Peter Uhnak <[hidden email]> wrote:
>>>
>>> Hi Sven,
>>>
>>> my use case was hand-edited CSVs (therefore the quotes are extra clutter), which imples that I would be hand-viewing/editing only small CSVs (no performance issues).
>>>
>>> I agree that we should err on the safe side with CR & LF (e.g. tools may sometimes autoconvert line endings).
>>>
>>> Regarding performance:
>>>
>>> #findFirstInString:inSet:startingAt: didn't work for me (not sure if bug, or I don't know how to use), so I've trried with inCharacterSet:
>>
>> Yes, it is a bitch to use, the the version you used does not use the primitive.
>>
>> Try playing with this:
>>
>> s10 := 'a' repeat: 10.
>> s100 := 'a' repeat: 100.
>>
>> searchSet := ByteArray new: 256 withAll: 0.
>> ',"', String crlf do: [ :each | searchSet at: each asInteger + 1 put: 1 ].
>>
>> searchSet := (CharacterSet newFrom: ',"', String crlf) byteArrayMap.
>>
>> [ ByteString findFirstInString: s10 inSet: searchSet startingAt: 1 ] bench.
>> "15,160,161 per second"
>> [ ByteString findFirstInString: s100 inSet: searchSet startingAt: 1 ] bench.
>> "8,219,081 per second"
>>
>> ByteString findFirstInString: ',"', String crlf inSet: searchSet startingAt: 1.
>>
>> Sven
>>
>>> Tested on worst-case scenario - strings don't contain tested symbols.
>>>
>>> s10 := 'a' repeat: 10.
>>> s100 := 'a' repeat: 100.
>>>
>>> [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s10 includesSubstring: each ] ] bench. "'1,200,046 per second'"
>>> [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s100 includesSubstring: each ] ] bench. "'495,482 per second'"
>>>
>>> [ s10 includesAny: { $,. $". Character cr. Character lf } ] bench. "'2,819,416 per second'"
>>> [ s100 includesAny: { $,. $". Character cr. Character lf } ] bench. "'2,200,668 per second'"
>>>
>>> [ ByteString findFirstInString: s10 inCharacterSet: ',"', String crlf startingAt: 1 ] bench. "'1,187,324 per second'"
>>> [ ByteString findFirstInString: s100 inCharacterSet: ',"', String crlf startingAt: 1 ] bench. "'165,526 per second'"
>>>
>>>
>>> #includesAny: seems to be the best by far.
>>>
>>> Storing the tested characters didn't improve it by much.
>>>
>>> Peter
>>>
>>> On Sat, Jul 22, 2017 at 06:51:31PM +0200, Sven Van Caekenberghe wrote:
>>>> Hi Peter,
>>>>
>>>>> On 22 Jul 2017, at 14:12, Peter Uhnak <[hidden email]> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> this is a continuation of an older thread about quoting fields only when necessary. ( http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )
>>>>>
>>>>> I've attached fileouts of NeoCSV packages with the addition (I don't know if Metacello can file-out only changesets).
>>>>>
>>>>> The change doesn't affect the default behavior, only when explicitly requested.
>>>>>
>>>>> Peter
>>>>
>>>> I accepted your changes as such, the .mcz's were copied over to the main repositories. This is a pure & clean extension, so that is good. Thank you.
>>>>
>>>> This option is always going to be slower, but the current implementation might be improved, I think.
>>>>
>>>> The key test in #writeOptionalQuotedField:
>>>>
>>>> {
>>>> lineEnd asString.
>>>> separator asString.
>>>> '"' } anySatisfy: [ :each | string includesSubstring: each ]
>>>>
>>>> will be quite slow.
>>>>
>>>> If we accept a little bit of (over safe) error on EOL and use any occurrence of CR or LF as needing a quote, we could switch to characters to test for. There exists a fast (primitive) test, #findFirstInString:inSet:startingAt: that can do all the testing in one go. If your version turns out to be slow, we could try that, if measurements show a difference.
>>>>
>>>> Regards,
>>>>
>>>> Sven
>>>>
>>>>
>>>
>>
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: NeoCSV change proposal

Peter Uhnak
Thank you Sven!

Peter

On Sun, Jul 23, 2017 at 01:59:41PM +0200, Sven Van Caekenberghe wrote:

>
> > On 23 Jul 2017, at 09:55, Peter Uhnak <[hidden email]> wrote:
> >
> > Ah, ByteArrayMap, I was trying ByteArray.
>
> ByteArrayMap is not a class, it is still a ByteArray, but of size 256 used as an inclusion map for characters that fit a byte.
>
> BTW, using CharacterSet as an abstraction makes the trick easier.
>
> I committed:
>
> ===
> Name: Neo-CSV-Core-SvenVanCaekenberghe.26
> Author: SvenVanCaekenberghe
> Time: 23 July 2017, 1:54:22.095185 pm
> UUID: e429f87e-5c11-0d00-a72d-ae5d00db2a8c
> Ancestors: Neo-CSV-Core-PeterUhnak.25
>
> make the core test in #writeOptionalQuotedField faster by using a primitive when possible
>
> add #testWideOptionalQuoted
> ===
> Name: Neo-CSV-Tests-SvenVanCaekenberghe.22
> Author: SvenVanCaekenberghe
> Time: 23 July 2017, 1:54:52.133003 pm
> UUID: 5f81c280-5c11-0d00-a72e-7cd500db2a8c
> Ancestors: Neo-CSV-Tests-PeterUhnak.21
>
> make the core test in #writeOptionalQuotedField faster by using a primitive when possible
>
> add #testWideOptionalQuoted
> ===
>
> Extra caution was needed for WideStrings.
>
> Sven
>
> > Not sure what the next step is here: should I add it and send you mczs, or will you do it yourself? (it is a simple change).
> >
> > Peter
> >
> > On Sat, Jul 22, 2017 at 10:51:42PM +0200, Sven Van Caekenberghe wrote:
> >> Peter,
> >>
> >>> On 22 Jul 2017, at 22:27, Peter Uhnak <[hidden email]> wrote:
> >>>
> >>> Hi Sven,
> >>>
> >>> my use case was hand-edited CSVs (therefore the quotes are extra clutter), which imples that I would be hand-viewing/editing only small CSVs (no performance issues).
> >>>
> >>> I agree that we should err on the safe side with CR & LF (e.g. tools may sometimes autoconvert line endings).
> >>>
> >>> Regarding performance:
> >>>
> >>> #findFirstInString:inSet:startingAt: didn't work for me (not sure if bug, or I don't know how to use), so I've trried with inCharacterSet:
> >>
> >> Yes, it is a bitch to use, the the version you used does not use the primitive.
> >>
> >> Try playing with this:
> >>
> >> s10 := 'a' repeat: 10.
> >> s100 := 'a' repeat: 100.
> >>
> >> searchSet := ByteArray new: 256 withAll: 0.
> >> ',"', String crlf do: [ :each | searchSet at: each asInteger + 1 put: 1 ].
> >>
> >> searchSet := (CharacterSet newFrom: ',"', String crlf) byteArrayMap.
> >>
> >> [ ByteString findFirstInString: s10 inSet: searchSet startingAt: 1 ] bench.
> >> "15,160,161 per second"
> >> [ ByteString findFirstInString: s100 inSet: searchSet startingAt: 1 ] bench.
> >> "8,219,081 per second"
> >>
> >> ByteString findFirstInString: ',"', String crlf inSet: searchSet startingAt: 1.
> >>
> >> Sven
> >>
> >>> Tested on worst-case scenario - strings don't contain tested symbols.
> >>>
> >>> s10 := 'a' repeat: 10.
> >>> s100 := 'a' repeat: 100.
> >>>
> >>> [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s10 includesSubstring: each ] ] bench. "'1,200,046 per second'"
> >>> [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s100 includesSubstring: each ] ] bench. "'495,482 per second'"
> >>>
> >>> [ s10 includesAny: { $,. $". Character cr. Character lf } ] bench. "'2,819,416 per second'"
> >>> [ s100 includesAny: { $,. $". Character cr. Character lf } ] bench. "'2,200,668 per second'"
> >>>
> >>> [ ByteString findFirstInString: s10 inCharacterSet: ',"', String crlf startingAt: 1 ] bench. "'1,187,324 per second'"
> >>> [ ByteString findFirstInString: s100 inCharacterSet: ',"', String crlf startingAt: 1 ] bench. "'165,526 per second'"
> >>>
> >>>
> >>> #includesAny: seems to be the best by far.
> >>>
> >>> Storing the tested characters didn't improve it by much.
> >>>
> >>> Peter
> >>>
> >>> On Sat, Jul 22, 2017 at 06:51:31PM +0200, Sven Van Caekenberghe wrote:
> >>>> Hi Peter,
> >>>>
> >>>>> On 22 Jul 2017, at 14:12, Peter Uhnak <[hidden email]> wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> this is a continuation of an older thread about quoting fields only when necessary. ( http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )
> >>>>>
> >>>>> I've attached fileouts of NeoCSV packages with the addition (I don't know if Metacello can file-out only changesets).
> >>>>>
> >>>>> The change doesn't affect the default behavior, only when explicitly requested.
> >>>>>
> >>>>> Peter
> >>>>
> >>>> I accepted your changes as such, the .mcz's were copied over to the main repositories. This is a pure & clean extension, so that is good. Thank you.
> >>>>
> >>>> This option is always going to be slower, but the current implementation might be improved, I think.
> >>>>
> >>>> The key test in #writeOptionalQuotedField:
> >>>>
> >>>> {
> >>>> lineEnd asString.
> >>>> separator asString.
> >>>> '"' } anySatisfy: [ :each | string includesSubstring: each ]
> >>>>
> >>>> will be quite slow.
> >>>>
> >>>> If we accept a little bit of (over safe) error on EOL and use any occurrence of CR or LF as needing a quote, we could switch to characters to test for. There exists a fast (primitive) test, #findFirstInString:inSet:startingAt: that can do all the testing in one go. If your version turns out to be slow, we could try that, if measurements show a difference.
> >>>>
> >>>> Regards,
> >>>>
> >>>> Sven
> >>>>
> >>>>
> >>>
> >>
> >>
> >
>
>