ASCII Streams with mixed Line-End conventions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

ASCII Streams with mixed Line-End conventions

jtuchel

Hi there,

we're importing some data from csv files. There's a lot of good reasons not to do it, but anyways... ;-)

Just yesterday I got a complaint from a user who couldn't figure out how to import their data. It turns out their source system exports data with mixed CrLf and Lf line-ends.

I know there are quite a few things we could do here. Like pre-process the file, look fo the first occurence of either Cr, CrLf or Lf and make sure to unify them befor starting the CSV import. But maybe someone has an even easier way to do it, maybe even with on-board tools of VAST?

Joachim

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/d5154674-0601-4e9e-978e-f7725c335e83n%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: ASCII Streams with mixed Line-End conventions

Louis LaBrunda
Hi Joachim,

The #nextLine method of #CfsReadFileStream has the following comment:

nextLine
"Answer the elements between the current position and the next lineDelimiter. 
 
If #shouldSearchForAllStandardDelimiters answers false 
(likely, the user has specified an explicit line delimiter via #lineDelimiter:)
then we ONLY search for that delimiter. However, if it answers true
(likely the stream current delimiter is the default one), then we try to look
for any of the standard delimiters: Cr, Lf, and CrLf. This is useful when we are
reading streams that could have been written on different platforms.
Answers:
<ByteArray | String>"

It seems to me that covers you case.  I haven't checked but I guess other streams may have similar methods.

Lou

On Monday, September 21, 2020 at 4:02:17 AM UTC-4 [hidden email] wrote:

Hi there,

we're importing some data from csv files. There's a lot of good reasons not to do it, but anyways... ;-)

Just yesterday I got a complaint from a user who couldn't figure out how to import their data. It turns out their source system exports data with mixed CrLf and Lf line-ends.

I know there are quite a few things we could do here. Like pre-process the file, look fo the first occurence of either Cr, CrLf or Lf and make sure to unify them befor starting the CSV import. But maybe someone has an even easier way to do it, maybe even with on-board tools of VAST?

Joachim

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/88c8a768-52ce-4c7d-b5c4-03af37d3e480n%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: ASCII Streams with mixed Line-End conventions

Richard Sargent
Administrator
In reply to this post by jtuchel
On Monday, September 21, 2020 at 1:02:17 AM UTC-7, Joachim Tuchel wrote:

Hi there,

we're importing some data from csv files. There's a lot of good reasons not to do it, but anyways... ;-)

Just yesterday I got a complaint from a user who couldn't figure out how to import their data. It turns out their source system exports data with mixed CrLf and Lf line-ends.

Given that a quoted field may contain line breaks, you really need to parse the end of line as part of parsing the fields. i.e. splitting the file into lines will give you bad results.

e.g. pseudocode

parseLine
    [self parseField] whileTrue.



I know there are quite a few things we could do here. Like pre-process the file, look fo the first occurence of either Cr, CrLf or Lf and make sure to unify them befor starting the CSV import. But maybe someone has an even easier way to do it, maybe even with on-board tools of VAST?

Joachim

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/65dbe946-f231-4d97-be8f-717293876efbo%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: ASCII Streams with mixed Line-End conventions

jtuchel
In reply to this post by Louis LaBrunda
Thanks Lou,

I was somewhat assuming that if I set #lineDelimiter (a variable that our users can modify in the Applications' GUI), the Stream will only look for this one. The comment suggests I can set (or override) #shouldSearchForAllStandardDelimiters and combine the two somehow.

I will have to play with this....

BUT of course, Richard is bringing up a very valid point. We're using NeoCSVReader to parse CSV fields from these files. So the challenge here will be to find the right way to fiddle the line-end detection into that.... In the worst of cases, we'll have to do the "line reading" outside of NeoCSVReader and only feed each single line into the Reader instead of using #upToEnd....

Sigh. So much trouble for a single exception to a simple rule....

Anyways: Thanks Lou and Richard for your input. You showed me both a possible solution to take a closer look at as well as a very important caveat...

Joachim



[hidden email] schrieb am Montag, 21. September 2020 um 15:02:47 UTC+2:
Hi Joachim,

The #nextLine method of #CfsReadFileStream has the following comment:

nextLine
"Answer the elements between the current position and the next lineDelimiter. 
 
If #shouldSearchForAllStandardDelimiters answers false 
(likely, the user has specified an explicit line delimiter via #lineDelimiter:)
then we ONLY search for that delimiter. However, if it answers true
(likely the stream current delimiter is the default one), then we try to look
for any of the standard delimiters: Cr, Lf, and CrLf. This is useful when we are
reading streams that could have been written on different platforms.
Answers:
<ByteArray | String>"

It seems to me that covers you case.  I haven't checked but I guess other streams may have similar methods.

Lou

On Monday, September 21, 2020 at 4:02:17 AM UTC-4 [hidden email] wrote:

Hi there,

we're importing some data from csv files. There's a lot of good reasons not to do it, but anyways... ;-)

Just yesterday I got a complaint from a user who couldn't figure out how to import their data. It turns out their source system exports data with mixed CrLf and Lf line-ends.

I know there are quite a few things we could do here. Like pre-process the file, look fo the first occurence of either Cr, CrLf or Lf and make sure to unify them befor starting the CSV import. But maybe someone has an even easier way to do it, maybe even with on-board tools of VAST?

Joachim

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/1b9902f2-929d-4b34-a8f4-433eba5f129fn%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: ASCII Streams with mixed Line-End conventions

Louis LaBrunda
Hi Joachim,

Perhaps you can modify #nextLine or extend the class with a new method that handles the embedded line end character.  perhaps something as simple as counting the double quotes and continuing to read if there is an odd number.

Lou

On Tuesday, September 22, 2020 at 2:51:12 AM UTC-4 [hidden email] wrote:
Thanks Lou,

I was somewhat assuming that if I set #lineDelimiter (a variable that our users can modify in the Applications' GUI), the Stream will only look for this one. The comment suggests I can set (or override) #shouldSearchForAllStandardDelimiters and combine the two somehow.

I will have to play with this....

BUT of course, Richard is bringing up a very valid point. We're using NeoCSVReader to parse CSV fields from these files. So the challenge here will be to find the right way to fiddle the line-end detection into that.... In the worst of cases, we'll have to do the "line reading" outside of NeoCSVReader and only feed each single line into the Reader instead of using #upToEnd....

Sigh. So much trouble for a single exception to a simple rule....

Anyways: Thanks Lou and Richard for your input. You showed me both a possible solution to take a closer look at as well as a very important caveat...

Joachim



[hidden email] schrieb am Montag, 21. September 2020 um 15:02:47 UTC+2:
Hi Joachim,

The #nextLine method of #CfsReadFileStream has the following comment:

nextLine
"Answer the elements between the current position and the next lineDelimiter. 
 
If #shouldSearchForAllStandardDelimiters answers false 
(likely, the user has specified an explicit line delimiter via #lineDelimiter:)
then we ONLY search for that delimiter. However, if it answers true
(likely the stream current delimiter is the default one), then we try to look
for any of the standard delimiters: Cr, Lf, and CrLf. This is useful when we are
reading streams that could have been written on different platforms.
Answers:
<ByteArray | String>"

It seems to me that covers you case.  I haven't checked but I guess other streams may have similar methods.

Lou

On Monday, September 21, 2020 at 4:02:17 AM UTC-4 [hidden email] wrote:

Hi there,

we're importing some data from csv files. There's a lot of good reasons not to do it, but anyways... ;-)

Just yesterday I got a complaint from a user who couldn't figure out how to import their data. It turns out their source system exports data with mixed CrLf and Lf line-ends.

I know there are quite a few things we could do here. Like pre-process the file, look fo the first occurence of either Cr, CrLf or Lf and make sure to unify them befor starting the CSV import. But maybe someone has an even easier way to do it, maybe even with on-board tools of VAST?

Joachim

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/d6244a91-501c-461e-aa08-7a585b3d488dn%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: ASCII Streams with mixed Line-End conventions

Mariano Martinez Peck-2
Hi guys, 

Joachim, yes, that behavior on #nextLine and #shouldSearchForAllStandardDelimiters is somehow new (91 or 92) and I hope it can meet your requirements. If not, then I would like to hear about it.  Not sure if I understand the question NeoCSVReader. If that's still an issue, could you clarify?

Louis: no, please, don't modify base apps. 

Best,

Mariano



On Tue, Sep 22, 2020 at 10:48 AM Louis LaBrunda <[hidden email]> wrote:
Hi Joachim,

Perhaps you can modify #nextLine or extend the class with a new method that handles the embedded line end character.  perhaps something as simple as counting the double quotes and continuing to read if there is an odd number.

Lou

On Tuesday, September 22, 2020 at 2:51:12 AM UTC-4 [hidden email] wrote:
Thanks Lou,

I was somewhat assuming that if I set #lineDelimiter (a variable that our users can modify in the Applications' GUI), the Stream will only look for this one. The comment suggests I can set (or override) #shouldSearchForAllStandardDelimiters and combine the two somehow.

I will have to play with this....

BUT of course, Richard is bringing up a very valid point. We're using NeoCSVReader to parse CSV fields from these files. So the challenge here will be to find the right way to fiddle the line-end detection into that.... In the worst of cases, we'll have to do the "line reading" outside of NeoCSVReader and only feed each single line into the Reader instead of using #upToEnd....

Sigh. So much trouble for a single exception to a simple rule....

Anyways: Thanks Lou and Richard for your input. You showed me both a possible solution to take a closer look at as well as a very important caveat...

Joachim



[hidden email] schrieb am Montag, 21. September 2020 um 15:02:47 UTC+2:
Hi Joachim,

The #nextLine method of #CfsReadFileStream has the following comment:

nextLine
"Answer the elements between the current position and the next lineDelimiter. 
 
If #shouldSearchForAllStandardDelimiters answers false 
(likely, the user has specified an explicit line delimiter via #lineDelimiter:)
then we ONLY search for that delimiter. However, if it answers true
(likely the stream current delimiter is the default one), then we try to look
for any of the standard delimiters: Cr, Lf, and CrLf. This is useful when we are
reading streams that could have been written on different platforms.
Answers:
<ByteArray | String>"

It seems to me that covers you case.  I haven't checked but I guess other streams may have similar methods.

Lou

On Monday, September 21, 2020 at 4:02:17 AM UTC-4 [hidden email] wrote:

Hi there,

we're importing some data from csv files. There's a lot of good reasons not to do it, but anyways... ;-)

Just yesterday I got a complaint from a user who couldn't figure out how to import their data. It turns out their source system exports data with mixed CrLf and Lf line-ends.

I know there are quite a few things we could do here. Like pre-process the file, look fo the first occurence of either Cr, CrLf or Lf and make sure to unify them befor starting the CSV import. But maybe someone has an even easier way to do it, maybe even with on-board tools of VAST?

Joachim

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/d6244a91-501c-461e-aa08-7a585b3d488dn%40googlegroups.com.


--
Mariano Martinez Peck
Software Engineer, Instantiations Inc.

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/CAOUkibGeN-fu3gbs2OFbKyypUUh0fnVFEGg0vaN10m-3PSNW0Q%40mail.gmail.com.
Reply | Threaded
Open this post in threaded view
|

Re: ASCII Streams with mixed Line-End conventions

Louis LaBrunda
Mariano: I wasn't suggesting the base be modified, just that #nextLine could be copied to a new name and modified to do what Joachim needs to meet  Richard's warning.

On Thursday, September 24, 2020 at 4:00:01 PM UTC-4 [hidden email] wrote:
Hi guys, 

Joachim, yes, that behavior on #nextLine and #shouldSearchForAllStandardDelimiters is somehow new (91 or 92) and I hope it can meet your requirements. If not, then I would like to hear about it.  Not sure if I understand the question NeoCSVReader. If that's still an issue, could you clarify?

Louis: no, please, don't modify base apps. 

Best,

Mariano



On Tue, Sep 22, 2020 at 10:48 AM Louis LaBrunda <[hidden email]> wrote:
Hi Joachim,

Perhaps you can modify #nextLine or extend the class with a new method that handles the embedded line end character.  perhaps something as simple as counting the double quotes and continuing to read if there is an odd number.

Lou

On Tuesday, September 22, 2020 at 2:51:12 AM UTC-4 [hidden email] wrote:
Thanks Lou,

I was somewhat assuming that if I set #lineDelimiter (a variable that our users can modify in the Applications' GUI), the Stream will only look for this one. The comment suggests I can set (or override) #shouldSearchForAllStandardDelimiters and combine the two somehow.

I will have to play with this....

BUT of course, Richard is bringing up a very valid point. We're using NeoCSVReader to parse CSV fields from these files. So the challenge here will be to find the right way to fiddle the line-end detection into that.... In the worst of cases, we'll have to do the "line reading" outside of NeoCSVReader and only feed each single line into the Reader instead of using #upToEnd....

Sigh. So much trouble for a single exception to a simple rule....

Anyways: Thanks Lou and Richard for your input. You showed me both a possible solution to take a closer look at as well as a very important caveat...

Joachim



[hidden email] schrieb am Montag, 21. September 2020 um 15:02:47 UTC+2:
Hi Joachim,

The #nextLine method of #CfsReadFileStream has the following comment:

nextLine
"Answer the elements between the current position and the next lineDelimiter. 
 
If #shouldSearchForAllStandardDelimiters answers false 
(likely, the user has specified an explicit line delimiter via #lineDelimiter:)
then we ONLY search for that delimiter. However, if it answers true
(likely the stream current delimiter is the default one), then we try to look
for any of the standard delimiters: Cr, Lf, and CrLf. This is useful when we are
reading streams that could have been written on different platforms.
Answers:
<ByteArray | String>"

It seems to me that covers you case.  I haven't checked but I guess other streams may have similar methods.

Lou

On Monday, September 21, 2020 at 4:02:17 AM UTC-4 [hidden email] wrote:

Hi there,

we're importing some data from csv files. There's a lot of good reasons not to do it, but anyways... ;-)

Just yesterday I got a complaint from a user who couldn't figure out how to import their data. It turns out their source system exports data with mixed CrLf and Lf line-ends.

I know there are quite a few things we could do here. Like pre-process the file, look fo the first occurence of either Cr, CrLf or Lf and make sure to unify them befor starting the CSV import. But maybe someone has an even easier way to do it, maybe even with on-board tools of VAST?

Joachim

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/d6244a91-501c-461e-aa08-7a585b3d488dn%40googlegroups.com.


--
Mariano Martinez Peck
Software Engineer, Instantiations Inc.

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/ea287b6b-9ad6-4cfd-b8a3-c14d24674cfcn%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: ASCII Streams with mixed Line-End conventions

jtuchel
In reply to this post by Mariano Martinez Peck-2
Mariano,

Thanks. First of all, I can confirm #nextLine works as expected on teh example file that contained lines ending in both <cr><lf> and <lf>. So the key is not to set a lineDelimiter on the Stream.

To my surpise, NeoCSV also scans for cr/crlf/lf and does everything in the way it should.

The problem in this case is that I had implemented some clever code that first reads the first line of the File and tries to determine the lineDelimiter and then sets it on the input Stream. From then on, the Stream didn't handle any other lineDeliimiter than the one I had told it. So I was knocking out the cleverness of the Streams. I thought I was more clever. I learned something about this illusion today ;-)

Anyways: thank you Lou, Richard, Mariano for joining me in this journey and pushing me in the right direction.

Conclusion: PositionableStream, Cfs*FileStream and NeoCSVReader all handle line endings perfectly. If you expect Windows or Unix line endings and even a mix of these, just let them do their work and don't interfere. You're in the way if you try to be more clever ;-)

Joachim





[hidden email] schrieb am Donnerstag, 24. September 2020 um 22:00:01 UTC+2:
Hi guys, 

Joachim, yes, that behavior on #nextLine and #shouldSearchForAllStandardDelimiters is somehow new (91 or 92) and I hope it can meet your requirements. If not, then I would like to hear about it.  Not sure if I understand the question NeoCSVReader. If that's still an issue, could you clarify?

Louis: no, please, don't modify base apps. 

Best,

Mariano



On Tue, Sep 22, 2020 at 10:48 AM Louis LaBrunda <[hidden email]> wrote:
Hi Joachim,

Perhaps you can modify #nextLine or extend the class with a new method that handles the embedded line end character.  perhaps something as simple as counting the double quotes and continuing to read if there is an odd number.

Lou

On Tuesday, September 22, 2020 at 2:51:12 AM UTC-4 [hidden email] wrote:
Thanks Lou,

I was somewhat assuming that if I set #lineDelimiter (a variable that our users can modify in the Applications' GUI), the Stream will only look for this one. The comment suggests I can set (or override) #shouldSearchForAllStandardDelimiters and combine the two somehow.

I will have to play with this....

BUT of course, Richard is bringing up a very valid point. We're using NeoCSVReader to parse CSV fields from these files. So the challenge here will be to find the right way to fiddle the line-end detection into that.... In the worst of cases, we'll have to do the "line reading" outside of NeoCSVReader and only feed each single line into the Reader instead of using #upToEnd....

Sigh. So much trouble for a single exception to a simple rule....

Anyways: Thanks Lou and Richard for your input. You showed me both a possible solution to take a closer look at as well as a very important caveat...

Joachim



[hidden email] schrieb am Montag, 21. September 2020 um 15:02:47 UTC+2:
Hi Joachim,

The #nextLine method of #CfsReadFileStream has the following comment:

nextLine
"Answer the elements between the current position and the next lineDelimiter. 
 
If #shouldSearchForAllStandardDelimiters answers false 
(likely, the user has specified an explicit line delimiter via #lineDelimiter:)
then we ONLY search for that delimiter. However, if it answers true
(likely the stream current delimiter is the default one), then we try to look
for any of the standard delimiters: Cr, Lf, and CrLf. This is useful when we are
reading streams that could have been written on different platforms.
Answers:
<ByteArray | String>"

It seems to me that covers you case.  I haven't checked but I guess other streams may have similar methods.

Lou

On Monday, September 21, 2020 at 4:02:17 AM UTC-4 [hidden email] wrote:

Hi there,

we're importing some data from csv files. There's a lot of good reasons not to do it, but anyways... ;-)

Just yesterday I got a complaint from a user who couldn't figure out how to import their data. It turns out their source system exports data with mixed CrLf and Lf line-ends.

I know there are quite a few things we could do here. Like pre-process the file, look fo the first occurence of either Cr, CrLf or Lf and make sure to unify them befor starting the CSV import. But maybe someone has an even easier way to do it, maybe even with on-board tools of VAST?

Joachim

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/d6244a91-501c-461e-aa08-7a585b3d488dn%40googlegroups.com.


--
Mariano Martinez Peck
Software Engineer, Instantiations Inc.

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/b2eafb48-bebf-454b-b67e-e355f492b314n%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: ASCII Streams with mixed Line-End conventions

Richard Sargent
Administrator
On Thu, Sep 24, 2020, 23:36 Joachim Tuchel <[hidden email]> wrote:
Mariano,

Thanks. First of all, I can confirm #nextLine works as expected on teh example file that contained lines ending in both <cr><lf> and <lf>. So the key is not to set a lineDelimiter on the Stream.

To my surpise, NeoCSV also scans for cr/crlf/lf and does everything in the way it should.

The problem in this case is that I had implemented some clever code that first reads the first line of the File and tries to determine the lineDelimiter and then sets it on the input Stream. From then on, the Stream didn't handle any other lineDeliimiter than the one I had told it. So I was knocking out the cleverness of the Streams. I thought I was more clever. I learned something about this illusion today ;-)


One of the greatest failings of programmers is when they try to be clever. There are myriad counter-examples. I'll tell you my favourite, if you're interested.



Anyways: thank you Lou, Richard, Mariano for joining me in this journey and pushing me in the right direction.

Conclusion: PositionableStream, Cfs*FileStream and NeoCSVReader all handle line endings perfectly. If you expect Windows or Unix line endings and even a mix of these, just let them do their work and don't interfere. You're in the way if you try to be more clever ;-)

Joachim





[hidden email] schrieb am Donnerstag, 24. September 2020 um 22:00:01 UTC+2:
Hi guys, 

Joachim, yes, that behavior on #nextLine and #shouldSearchForAllStandardDelimiters is somehow new (91 or 92) and I hope it can meet your requirements. If not, then I would like to hear about it.  Not sure if I understand the question NeoCSVReader. If that's still an issue, could you clarify?

Louis: no, please, don't modify base apps. 

Best,

Mariano



On Tue, Sep 22, 2020 at 10:48 AM Louis LaBrunda <[hidden email]> wrote:
Hi Joachim,

Perhaps you can modify #nextLine or extend the class with a new method that handles the embedded line end character.  perhaps something as simple as counting the double quotes and continuing to read if there is an odd number.

Lou

On Tuesday, September 22, 2020 at 2:51:12 AM UTC-4 [hidden email] wrote:
Thanks Lou,

I was somewhat assuming that if I set #lineDelimiter (a variable that our users can modify in the Applications' GUI), the Stream will only look for this one. The comment suggests I can set (or override) #shouldSearchForAllStandardDelimiters and combine the two somehow.

I will have to play with this....

BUT of course, Richard is bringing up a very valid point. We're using NeoCSVReader to parse CSV fields from these files. So the challenge here will be to find the right way to fiddle the line-end detection into that.... In the worst of cases, we'll have to do the "line reading" outside of NeoCSVReader and only feed each single line into the Reader instead of using #upToEnd....

Sigh. So much trouble for a single exception to a simple rule....

Anyways: Thanks Lou and Richard for your input. You showed me both a possible solution to take a closer look at as well as a very important caveat...

Joachim



[hidden email] schrieb am Montag, 21. September 2020 um 15:02:47 UTC+2:
Hi Joachim,

The #nextLine method of #CfsReadFileStream has the following comment:

nextLine
"Answer the elements between the current position and the next lineDelimiter. 
 
If #shouldSearchForAllStandardDelimiters answers false 
(likely, the user has specified an explicit line delimiter via #lineDelimiter:)
then we ONLY search for that delimiter. However, if it answers true
(likely the stream current delimiter is the default one), then we try to look
for any of the standard delimiters: Cr, Lf, and CrLf. This is useful when we are
reading streams that could have been written on different platforms.
Answers:
<ByteArray | String>"

It seems to me that covers you case.  I haven't checked but I guess other streams may have similar methods.

Lou

On Monday, September 21, 2020 at 4:02:17 AM UTC-4 [hidden email] wrote:

Hi there,

we're importing some data from csv files. There's a lot of good reasons not to do it, but anyways... ;-)

Just yesterday I got a complaint from a user who couldn't figure out how to import their data. It turns out their source system exports data with mixed CrLf and Lf line-ends.

I know there are quite a few things we could do here. Like pre-process the file, look fo the first occurence of either Cr, CrLf or Lf and make sure to unify them befor starting the CSV import. But maybe someone has an even easier way to do it, maybe even with on-board tools of VAST?

Joachim

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/d6244a91-501c-461e-aa08-7a585b3d488dn%40googlegroups.com.


--
Mariano Martinez Peck
Software Engineer, Instantiations Inc.

--
You received this message because you are subscribed to a topic in the Google Groups "VA Smalltalk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/va-smalltalk/pifWoYORMxY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/b2eafb48-bebf-454b-b67e-e355f492b314n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/CAGNapEOSiRxg7UPUTPA9BTWgTJ8N5kx7o-PrGdGuj_uMiL8_7A%40mail.gmail.com.
Reply | Threaded
Open this post in threaded view
|

Re: ASCII Streams with mixed Line-End conventions

Mariano Martinez Peck-2
In reply to this post by jtuchel
Hi Joachim, 

Happy yo read everything is working. I think you indeed needed to be smarter before and that's exactly why we improved this in 9.1. Just for the record, the developer case was:

"63547: Improve #nextLine to work with all known delimiters (Cr, Lf and CrLf)"

Cheers,



On Fri, Sep 25, 2020 at 3:36 AM Joachim Tuchel <[hidden email]> wrote:
Mariano,

Thanks. First of all, I can confirm #nextLine works as expected on teh example file that contained lines ending in both <cr><lf> and <lf>. So the key is not to set a lineDelimiter on the Stream.

To my surpise, NeoCSV also scans for cr/crlf/lf and does everything in the way it should.

The problem in this case is that I had implemented some clever code that first reads the first line of the File and tries to determine the lineDelimiter and then sets it on the input Stream. From then on, the Stream didn't handle any other lineDeliimiter than the one I had told it. So I was knocking out the cleverness of the Streams. I thought I was more clever. I learned something about this illusion today ;-)

Anyways: thank you Lou, Richard, Mariano for joining me in this journey and pushing me in the right direction.

Conclusion: PositionableStream, Cfs*FileStream and NeoCSVReader all handle line endings perfectly. If you expect Windows or Unix line endings and even a mix of these, just let them do their work and don't interfere. You're in the way if you try to be more clever ;-)

Joachim





[hidden email] schrieb am Donnerstag, 24. September 2020 um 22:00:01 UTC+2:
Hi guys, 

Joachim, yes, that behavior on #nextLine and #shouldSearchForAllStandardDelimiters is somehow new (91 or 92) and I hope it can meet your requirements. If not, then I would like to hear about it.  Not sure if I understand the question NeoCSVReader. If that's still an issue, could you clarify?

Louis: no, please, don't modify base apps. 

Best,

Mariano



On Tue, Sep 22, 2020 at 10:48 AM Louis LaBrunda <[hidden email]> wrote:
Hi Joachim,

Perhaps you can modify #nextLine or extend the class with a new method that handles the embedded line end character.  perhaps something as simple as counting the double quotes and continuing to read if there is an odd number.

Lou

On Tuesday, September 22, 2020 at 2:51:12 AM UTC-4 [hidden email] wrote:
Thanks Lou,

I was somewhat assuming that if I set #lineDelimiter (a variable that our users can modify in the Applications' GUI), the Stream will only look for this one. The comment suggests I can set (or override) #shouldSearchForAllStandardDelimiters and combine the two somehow.

I will have to play with this....

BUT of course, Richard is bringing up a very valid point. We're using NeoCSVReader to parse CSV fields from these files. So the challenge here will be to find the right way to fiddle the line-end detection into that.... In the worst of cases, we'll have to do the "line reading" outside of NeoCSVReader and only feed each single line into the Reader instead of using #upToEnd....

Sigh. So much trouble for a single exception to a simple rule....

Anyways: Thanks Lou and Richard for your input. You showed me both a possible solution to take a closer look at as well as a very important caveat...

Joachim



[hidden email] schrieb am Montag, 21. September 2020 um 15:02:47 UTC+2:
Hi Joachim,

The #nextLine method of #CfsReadFileStream has the following comment:

nextLine
"Answer the elements between the current position and the next lineDelimiter. 
 
If #shouldSearchForAllStandardDelimiters answers false 
(likely, the user has specified an explicit line delimiter via #lineDelimiter:)
then we ONLY search for that delimiter. However, if it answers true
(likely the stream current delimiter is the default one), then we try to look
for any of the standard delimiters: Cr, Lf, and CrLf. This is useful when we are
reading streams that could have been written on different platforms.
Answers:
<ByteArray | String>"

It seems to me that covers you case.  I haven't checked but I guess other streams may have similar methods.

Lou

On Monday, September 21, 2020 at 4:02:17 AM UTC-4 [hidden email] wrote:

Hi there,

we're importing some data from csv files. There's a lot of good reasons not to do it, but anyways... ;-)

Just yesterday I got a complaint from a user who couldn't figure out how to import their data. It turns out their source system exports data with mixed CrLf and Lf line-ends.

I know there are quite a few things we could do here. Like pre-process the file, look fo the first occurence of either Cr, CrLf or Lf and make sure to unify them befor starting the CSV import. But maybe someone has an even easier way to do it, maybe even with on-board tools of VAST?

Joachim

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/d6244a91-501c-461e-aa08-7a585b3d488dn%40googlegroups.com.


--
Mariano Martinez Peck
Software Engineer, Instantiations Inc.

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/b2eafb48-bebf-454b-b67e-e355f492b314n%40googlegroups.com.


--
Mariano Martinez Peck
Software Engineer, Instantiations Inc.

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/CAOUkibFOKf%2B121GwzhLvbbvX4EFRCPwjXgdUs27ZCEeK6GP%2BsQ%40mail.gmail.com.
Reply | Threaded
Open this post in threaded view
|

Re: ASCII Streams with mixed Line-End conventions

jtuchel
In reply to this post by Richard Sargent
Richard

I love to hear stories from and about other developers that feed my illusion that I am not alone in my imperfection. The top-most cleverest ideas I have often turn out to be the worst, but it takes time and sweat to find out ;-)
Is there something like the anonymous emberassed programmers? I'd like to join.

Joachim


Richard Sargent schrieb am Freitag, 25. September 2020 um 09:23:13 UTC+2:
On Thu, Sep 24, 2020, 23:36 Joachim Tuchel <[hidden email]> wrote:
Mariano,

Thanks. First of all, I can confirm #nextLine works as expected on teh example file that contained lines ending in both <cr><lf> and <lf>. So the key is not to set a lineDelimiter on the Stream.

To my surpise, NeoCSV also scans for cr/crlf/lf and does everything in the way it should.

The problem in this case is that I had implemented some clever code that first reads the first line of the File and tries to determine the lineDelimiter and then sets it on the input Stream. From then on, the Stream didn't handle any other lineDeliimiter than the one I had told it. So I was knocking out the cleverness of the Streams. I thought I was more clever. I learned something about this illusion today ;-)


One of the greatest failings of programmers is when they try to be clever. There are myriad counter-examples. I'll tell you my favourite, if you're interested.



Anyways: thank you Lou, Richard, Mariano for joining me in this journey and pushing me in the right direction.

Conclusion: PositionableStream, Cfs*FileStream and NeoCSVReader all handle line endings perfectly. If you expect Windows or Unix line endings and even a mix of these, just let them do their work and don't interfere. You're in the way if you try to be more clever ;-)

Joachim





[hidden email] schrieb am Donnerstag, 24. September 2020 um 22:00:01 UTC+2:
Hi guys, 

Joachim, yes, that behavior on #nextLine and #shouldSearchForAllStandardDelimiters is somehow new (91 or 92) and I hope it can meet your requirements. If not, then I would like to hear about it.  Not sure if I understand the question NeoCSVReader. If that's still an issue, could you clarify?

Louis: no, please, don't modify base apps. 

Best,

Mariano



On Tue, Sep 22, 2020 at 10:48 AM Louis LaBrunda <[hidden email]> wrote:
Hi Joachim,

Perhaps you can modify #nextLine or extend the class with a new method that handles the embedded line end character.  perhaps something as simple as counting the double quotes and continuing to read if there is an odd number.

Lou

On Tuesday, September 22, 2020 at 2:51:12 AM UTC-4 [hidden email] wrote:
Thanks Lou,

I was somewhat assuming that if I set #lineDelimiter (a variable that our users can modify in the Applications' GUI), the Stream will only look for this one. The comment suggests I can set (or override) #shouldSearchForAllStandardDelimiters and combine the two somehow.

I will have to play with this....

BUT of course, Richard is bringing up a very valid point. We're using NeoCSVReader to parse CSV fields from these files. So the challenge here will be to find the right way to fiddle the line-end detection into that.... In the worst of cases, we'll have to do the "line reading" outside of NeoCSVReader and only feed each single line into the Reader instead of using #upToEnd....

Sigh. So much trouble for a single exception to a simple rule....

Anyways: Thanks Lou and Richard for your input. You showed me both a possible solution to take a closer look at as well as a very important caveat...

Joachim



[hidden email] schrieb am Montag, 21. September 2020 um 15:02:47 UTC+2:
Hi Joachim,

The #nextLine method of #CfsReadFileStream has the following comment:

nextLine
"Answer the elements between the current position and the next lineDelimiter. 
 
If #shouldSearchForAllStandardDelimiters answers false 
(likely, the user has specified an explicit line delimiter via #lineDelimiter:)
then we ONLY search for that delimiter. However, if it answers true
(likely the stream current delimiter is the default one), then we try to look
for any of the standard delimiters: Cr, Lf, and CrLf. This is useful when we are
reading streams that could have been written on different platforms.
Answers:
<ByteArray | String>"

It seems to me that covers you case.  I haven't checked but I guess other streams may have similar methods.

Lou

On Monday, September 21, 2020 at 4:02:17 AM UTC-4 [hidden email] wrote:

Hi there,

we're importing some data from csv files. There's a lot of good reasons not to do it, but anyways... ;-)

Just yesterday I got a complaint from a user who couldn't figure out how to import their data. It turns out their source system exports data with mixed CrLf and Lf line-ends.

I know there are quite a few things we could do here. Like pre-process the file, look fo the first occurence of either Cr, CrLf or Lf and make sure to unify them befor starting the CSV import. But maybe someone has an even easier way to do it, maybe even with on-board tools of VAST?

Joachim

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/d6244a91-501c-461e-aa08-7a585b3d488dn%40googlegroups.com.


--
Mariano Martinez Peck
Software Engineer, Instantiations Inc.

--
You received this message because you are subscribed to a topic in the Google Groups "VA Smalltalk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/va-smalltalk/pifWoYORMxY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/b2eafb48-bebf-454b-b67e-e355f492b314n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/3cd0683e-6b7b-465e-be6a-ff722eda250dn%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: ASCII Streams with mixed Line-End conventions

jtuchel
In reply to this post by Mariano Martinez Peck-2
Mariano,

I guess or at least like the idea of thinking this was once a clver choice in order to overcome some limitation. I just recently migrated to 9.2, so thank you for opening this back door for me ;-)

Anyways: removing my clever code was a satisfying act today, as well as seeing how fine VAST handles even mixed line endings in files and streams. These are the small but powerful things that I like about your work at Instantiations.

Joachim


[hidden email] schrieb am Freitag, 25. September 2020 um 15:09:53 UTC+2:
Hi Joachim, 

Happy yo read everything is working. I think you indeed needed to be smarter before and that's exactly why we improved this in 9.1. Just for the record, the developer case was:

"63547: Improve #nextLine to work with all known delimiters (Cr, Lf and CrLf)"

Cheers,



On Fri, Sep 25, 2020 at 3:36 AM Joachim Tuchel <[hidden email]> wrote:
Mariano,

Thanks. First of all, I can confirm #nextLine works as expected on teh example file that contained lines ending in both <cr><lf> and <lf>. So the key is not to set a lineDelimiter on the Stream.

To my surpise, NeoCSV also scans for cr/crlf/lf and does everything in the way it should.

The problem in this case is that I had implemented some clever code that first reads the first line of the File and tries to determine the lineDelimiter and then sets it on the input Stream. From then on, the Stream didn't handle any other lineDeliimiter than the one I had told it. So I was knocking out the cleverness of the Streams. I thought I was more clever. I learned something about this illusion today ;-)

Anyways: thank you Lou, Richard, Mariano for joining me in this journey and pushing me in the right direction.

Conclusion: PositionableStream, Cfs*FileStream and NeoCSVReader all handle line endings perfectly. If you expect Windows or Unix line endings and even a mix of these, just let them do their work and don't interfere. You're in the way if you try to be more clever ;-)

Joachim





[hidden email] schrieb am Donnerstag, 24. September 2020 um 22:00:01 UTC+2:
Hi guys, 

Joachim, yes, that behavior on #nextLine and #shouldSearchForAllStandardDelimiters is somehow new (91 or 92) and I hope it can meet your requirements. If not, then I would like to hear about it.  Not sure if I understand the question NeoCSVReader. If that's still an issue, could you clarify?

Louis: no, please, don't modify base apps. 

Best,

Mariano



On Tue, Sep 22, 2020 at 10:48 AM Louis LaBrunda <[hidden email]> wrote:
Hi Joachim,

Perhaps you can modify #nextLine or extend the class with a new method that handles the embedded line end character.  perhaps something as simple as counting the double quotes and continuing to read if there is an odd number.

Lou

On Tuesday, September 22, 2020 at 2:51:12 AM UTC-4 [hidden email] wrote:
Thanks Lou,

I was somewhat assuming that if I set #lineDelimiter (a variable that our users can modify in the Applications' GUI), the Stream will only look for this one. The comment suggests I can set (or override) #shouldSearchForAllStandardDelimiters and combine the two somehow.

I will have to play with this....

BUT of course, Richard is bringing up a very valid point. We're using NeoCSVReader to parse CSV fields from these files. So the challenge here will be to find the right way to fiddle the line-end detection into that.... In the worst of cases, we'll have to do the "line reading" outside of NeoCSVReader and only feed each single line into the Reader instead of using #upToEnd....

Sigh. So much trouble for a single exception to a simple rule....

Anyways: Thanks Lou and Richard for your input. You showed me both a possible solution to take a closer look at as well as a very important caveat...

Joachim



[hidden email] schrieb am Montag, 21. September 2020 um 15:02:47 UTC+2:
Hi Joachim,

The #nextLine method of #CfsReadFileStream has the following comment:

nextLine
"Answer the elements between the current position and the next lineDelimiter. 
 
If #shouldSearchForAllStandardDelimiters answers false 
(likely, the user has specified an explicit line delimiter via #lineDelimiter:)
then we ONLY search for that delimiter. However, if it answers true
(likely the stream current delimiter is the default one), then we try to look
for any of the standard delimiters: Cr, Lf, and CrLf. This is useful when we are
reading streams that could have been written on different platforms.
Answers:
<ByteArray | String>"

It seems to me that covers you case.  I haven't checked but I guess other streams may have similar methods.

Lou

On Monday, September 21, 2020 at 4:02:17 AM UTC-4 [hidden email] wrote:

Hi there,

we're importing some data from csv files. There's a lot of good reasons not to do it, but anyways... ;-)

Just yesterday I got a complaint from a user who couldn't figure out how to import their data. It turns out their source system exports data with mixed CrLf and Lf line-ends.

I know there are quite a few things we could do here. Like pre-process the file, look fo the first occurence of either Cr, CrLf or Lf and make sure to unify them befor starting the CSV import. But maybe someone has an even easier way to do it, maybe even with on-board tools of VAST?

Joachim

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/d6244a91-501c-461e-aa08-7a585b3d488dn%40googlegroups.com.


--
Mariano Martinez Peck
Software Engineer, Instantiations Inc.

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].


--
Mariano Martinez Peck
Software Engineer, Instantiations Inc.

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/ffb73433-2c4e-4fde-9eb5-2ec6dc59a9a8n%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: ASCII Streams with mixed Line-End conventions

Mariano Martinez Peck-2


On Fri, Sep 25, 2020 at 4:18 PM Joachim Tuchel <[hidden email]> wrote:
Mariano,

I guess or at least like the idea of thinking this was once a clver choice in order to overcome some limitation. I just recently migrated to 9.2, so thank you for opening this back door for me ;-)

Anyways: removing my clever code was a satisfying act today, as well as seeing how fine VAST handles even mixed line endings in files and streams. These are the small but powerful things that I like about your work at Instantiations.


How did you know I implemented that? hahahah ;)
Now seriously, I am still not convinced on the decision on way to do that automatically or not..that is...the #shouldSearchForAllStandardDelimiters
I evaluated, and still is in my mind, to use an explicit new boolean instVar to control that.... but so far, I am still not convinced. 

BTW, to implement that functionally and be it performant, we also implemented new primitives which allowed speedup in other areas too. 
Below some notes from our interna bug tracker:

"- New methods in SequenceableCollection

indexOfAny: aSequenceableCollection
indexOfAny: aSequenceableCollection ifAbsent: exceptionBlock
indexOfAny: aSequenceableCollection startingAt: start
indexOfAny: aSequenceableCollection startingAt: start ifAbsent: exceptionHandler
- These 4 methods give symmetry with indexOfSubCollection...
- indexOfAny will answer the index of the first element to be in the argument aSequenceableCollection
- Implemented VMprStringIndexOfAny prim
This will provide String/DBString specific prim-assist for character searches using indexOfAny.... It's about 3-4x faster than using the more generic version in SequenceableCollection.  

 These also went on all the other streams that have the new delimiter handlers.

skipToAny: aSequentialCollection
upToAny: aSequentialCollection
"
create a generic/reusable and  FAST #indexOfAny:* ,   #skipToAny:  and  #upToAny:  


So basically.... all these #indexOfAny:*, #skipToAny:  and #upToAny:  are now prim assisted and fast.  

--
Mariano Martinez Peck
Software Engineer, Instantiations Inc.

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/CAOUkibHkvSvdFN529uFLVAwd0RN0EYxnMn370hSUEUDpXw%2BEJw%40mail.gmail.com.
Reply | Threaded
Open this post in threaded view
|

Re: ASCII Streams with mixed Line-End conventions

Richard Sargent
Administrator
In reply to this post by jtuchel
On Friday, September 25, 2020 at 12:15:22 PM UTC-7, Joachim Tuchel wrote:
Richard

I love to hear stories from and about other developers that feed my illusion that I am not alone in my imperfection. The top-most cleverest ideas I have often turn out to be the worst, but it takes time and sweat to find out ;-)
Is there something like the anonymous emberassed programmers? I'd like to join.

Most of my stories/complaints are more in the "I would like to Slinky the programmer responsible for this!" style

My favourite and most extreme example involved buying something on eBay when I live in Zurich. I had navigated through numerous pages to order the product and had navigated through a number of pages to set up the payment. As soon as I entered my credit card's billing address, my web browser presented everything in German. Clever programmer (TM) had decided that, living in a German-speaking city, I must speak and read German and that he (somehow I am sure it was a male) was doing me a great favour by switching to German.



Joachim


Richard Sargent schrieb am Freitag, 25. September 2020 um 09:23:13 UTC+2:
On Thu, Sep 24, 2020, 23:36 Joachim Tuchel <[hidden email]> wrote:
Mariano,

Thanks. First of all, I can confirm #nextLine works as expected on teh example file that contained lines ending in both <cr><lf> and <lf>. So the key is not to set a lineDelimiter on the Stream.

To my surpise, NeoCSV also scans for cr/crlf/lf and does everything in the way it should.

The problem in this case is that I had implemented some clever code that first reads the first line of the File and tries to determine the lineDelimiter and then sets it on the input Stream. From then on, the Stream didn't handle any other lineDeliimiter than the one I had told it. So I was knocking out the cleverness of the Streams. I thought I was more clever. I learned something about this illusion today ;-)


One of the greatest failings of programmers is when they try to be clever. There are myriad counter-examples. I'll tell you my favourite, if you're interested.



Anyways: thank you Lou, Richard, Mariano for joining me in this journey and pushing me in the right direction.

Conclusion: PositionableStream, Cfs*FileStream and NeoCSVReader all handle line endings perfectly. If you expect Windows or Unix line endings and even a mix of these, just let them do their work and don't interfere. You're in the way if you try to be more clever ;-)

Joachim





[hidden email] schrieb am Donnerstag, 24. September 2020 um 22:00:01 UTC+2:
Hi guys, 

Joachim, yes, that behavior on #nextLine and #shouldSearchForAllStandardDelimiters is somehow new (91 or 92) and I hope it can meet your requirements. If not, then I would like to hear about it.  Not sure if I understand the question NeoCSVReader. If that's still an issue, could you clarify?

Louis: no, please, don't modify base apps. 

Best,

Mariano



On Tue, Sep 22, 2020 at 10:48 AM Louis LaBrunda <[hidden email]> wrote:
Hi Joachim,

Perhaps you can modify #nextLine or extend the class with a new method that handles the embedded line end character.  perhaps something as simple as counting the double quotes and continuing to read if there is an odd number.

Lou

On Tuesday, September 22, 2020 at 2:51:12 AM UTC-4 [hidden email] wrote:
Thanks Lou,

I was somewhat assuming that if I set #lineDelimiter (a variable that our users can modify in the Applications' GUI), the Stream will only look for this one. The comment suggests I can set (or override) #shouldSearchForAllStandardDelimiters and combine the two somehow.

I will have to play with this....

BUT of course, Richard is bringing up a very valid point. We're using NeoCSVReader to parse CSV fields from these files. So the challenge here will be to find the right way to fiddle the line-end detection into that.... In the worst of cases, we'll have to do the "line reading" outside of NeoCSVReader and only feed each single line into the Reader instead of using #upToEnd....

Sigh. So much trouble for a single exception to a simple rule....

Anyways: Thanks Lou and Richard for your input. You showed me both a possible solution to take a closer look at as well as a very important caveat...

Joachim



[hidden email] schrieb am Montag, 21. September 2020 um 15:02:47 UTC+2:
Hi Joachim,

The #nextLine method of #CfsReadFileStream has the following comment:

nextLine
"Answer the elements between the current position and the next lineDelimiter. 
 
If #shouldSearchForAllStandardDelimiters answers false 
(likely, the user has specified an explicit line delimiter via #lineDelimiter:)
then we ONLY search for that delimiter. However, if it answers true
(likely the stream current delimiter is the default one), then we try to look
for any of the standard delimiters: Cr, Lf, and CrLf. This is useful when we are
reading streams that could have been written on different platforms.
Answers:
<ByteArray | String>"

It seems to me that covers you case.  I haven't checked but I guess other streams may have similar methods.

Lou

On Monday, September 21, 2020 at 4:02:17 AM UTC-4 [hidden email] wrote:

Hi there,

we're importing some data from csv files. There's a lot of good reasons not to do it, but anyways... ;-)

Just yesterday I got a complaint from a user who couldn't figure out how to import their data. It turns out their source system exports data with mixed CrLf and Lf line-ends.

I know there are quite a few things we could do here. Like pre-process the file, look fo the first occurence of either Cr, CrLf or Lf and make sure to unify them befor starting the CSV import. But maybe someone has an even easier way to do it, maybe even with on-board tools of VAST?

Joachim

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to va-smalltalk...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/va-smalltalk/d6244a91-501c-461e-aa08-7a585b3d488dn%40googlegroups.com?utm_medium=email&amp;utm_source=footer" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://groups.google.com/d/msgid/va-smalltalk/d6244a91-501c-461e-aa08-7a585b3d488dn%40googlegroups.com?utm_medium\x3demail\x26utm_source\x3dfooter&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/msgid/va-smalltalk/d6244a91-501c-461e-aa08-7a585b3d488dn%40googlegroups.com?utm_medium\x3demail\x26utm_source\x3dfooter&#39;;return true;">https://groups.google.com/d/msgid/va-smalltalk/d6244a91-501c-461e-aa08-7a585b3d488dn%40googlegroups.com.


--
Mariano Martinez Peck
Software Engineer, Instantiations Inc.
Twitter: <a href="https://twitter.com/MartinezPeck" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Ftwitter.com%2FMartinezPeck\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFPV-7Bnc-U6phGEh-VZU0iUtY7vw&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Ftwitter.com%2FMartinezPeck\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFPV-7Bnc-U6phGEh-VZU0iUtY7vw&#39;;return true;">https://twitter.com/MartinezPeck
LinkedIn: <a href="https://www.linkedin.com/in/mariano-mart%C3%ADnez-peck/" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fwww.linkedin.com%2Fin%2Fmariano-mart%25C3%25ADnez-peck%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGyJTUAWXPstaw4J3OpFUYRyFAqmw&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fwww.linkedin.com%2Fin%2Fmariano-mart%25C3%25ADnez-peck%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGyJTUAWXPstaw4J3OpFUYRyFAqmw&#39;;return true;">www.linkedin.com/in/mariano-martinez-peck
Blog: <a href="https://marianopeck.wordpress.com/" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fmarianopeck.wordpress.com%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHAOaIsyMIYgmQWdQZRuKRdD6gBfw&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fmarianopeck.wordpress.com%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHAOaIsyMIYgmQWdQZRuKRdD6gBfw&#39;;return true;">https://marianopeck.wordpress.com/

--
You received this message because you are subscribed to a topic in the Google Groups "VA Smalltalk" group.
To unsubscribe from this topic, visit <a href="https://groups.google.com/d/topic/va-smalltalk/pifWoYORMxY/unsubscribe" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://groups.google.com/d/topic/va-smalltalk/pifWoYORMxY/unsubscribe&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/topic/va-smalltalk/pifWoYORMxY/unsubscribe&#39;;return true;">https://groups.google.com/d/topic/va-smalltalk/pifWoYORMxY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to va-smalltalk...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/va-smalltalk/b2eafb48-bebf-454b-b67e-e355f492b314n%40googlegroups.com?utm_medium=email&amp;utm_source=footer" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://groups.google.com/d/msgid/va-smalltalk/b2eafb48-bebf-454b-b67e-e355f492b314n%40googlegroups.com?utm_medium\x3demail\x26utm_source\x3dfooter&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/msgid/va-smalltalk/b2eafb48-bebf-454b-b67e-e355f492b314n%40googlegroups.com?utm_medium\x3demail\x26utm_source\x3dfooter&#39;;return true;">https://groups.google.com/d/msgid/va-smalltalk/b2eafb48-bebf-454b-b67e-e355f492b314n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/6d051ed6-318f-4e43-b6b8-104981e1fcb9o%40googlegroups.com.