Hi there, we're importing some data from csv files. There's a lot of good reasons not to do it, but anyways... ;-) Just yesterday I got a complaint from a user who couldn't figure out how to import their data. It turns out their source system exports data with mixed CrLf and Lf line-ends. I know there are quite a few things we could do here. Like pre-process the file, look fo the first occurence of either Cr, CrLf or Lf and make sure to unify them befor starting the CSV import. But maybe someone has an even easier way to do it, maybe even with on-board tools of VAST? Joachim
-- You received this message because you are subscribed to the Google Groups "VA Smalltalk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/d5154674-0601-4e9e-978e-f7725c335e83n%40googlegroups.com. |
Hi Joachim, The #nextLine method of #CfsReadFileStream has the following comment: nextLine "Answer the elements between the current position and the next lineDelimiter. If #shouldSearchForAllStandardDelimiters answers false (likely, the user has specified an explicit line delimiter via #lineDelimiter:) then we ONLY search for that delimiter. However, if it answers true (likely the stream current delimiter is the default one), then we try to look for any of the standard delimiters: Cr, Lf, and CrLf. This is useful when we are reading streams that could have been written on different platforms. Answers: <ByteArray | String>" It seems to me that covers you case. I haven't checked but I guess other streams may have similar methods. Lou On Monday, September 21, 2020 at 4:02:17 AM UTC-4 [hidden email] wrote:
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/88c8a768-52ce-4c7d-b5c4-03af37d3e480n%40googlegroups.com. |
Administrator
|
In reply to this post by jtuchel
On Monday, September 21, 2020 at 1:02:17 AM UTC-7, Joachim Tuchel wrote:
--
Given that a quoted field may contain line breaks, you really need to parse the end of line as part of parsing the fields. i.e. splitting the file into lines will give you bad results. e.g. pseudocode parseLine [self parseField] whileTrue.
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/65dbe946-f231-4d97-be8f-717293876efbo%40googlegroups.com. |
In reply to this post by Louis LaBrunda
Thanks Lou, I was somewhat assuming that if I set #lineDelimiter (a variable that our users can modify in the Applications' GUI), the Stream will only look for this one. The comment suggests I can set (or override) #shouldSearchForAllStandardDelimiters and combine the two somehow. I will have to play with this.... BUT of course, Richard is bringing up a very valid point. We're using NeoCSVReader to parse CSV fields from these files. So the challenge here will be to find the right way to fiddle the line-end detection into that.... In the worst of cases, we'll have to do the "line reading" outside of NeoCSVReader and only feed each single line into the Reader instead of using #upToEnd.... Sigh. So much trouble for a single exception to a simple rule.... Anyways: Thanks Lou and Richard for your input. You showed me both a possible solution to take a closer look at as well as a very important caveat... Joachim [hidden email] schrieb am Montag, 21. September 2020 um 15:02:47 UTC+2:
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/1b9902f2-929d-4b34-a8f4-433eba5f129fn%40googlegroups.com. |
Hi Joachim,
Perhaps you can modify #nextLine or extend the class with a new method that handles the embedded line end character. perhaps something as simple as counting the double quotes and continuing to read if there is an odd number. Lou On Tuesday, September 22, 2020 at 2:51:12 AM UTC-4 [hidden email] wrote:
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/d6244a91-501c-461e-aa08-7a585b3d488dn%40googlegroups.com. |
Hi guys, Joachim, yes, that behavior on #nextLine and #shouldSearchForAllStandardDelimiters is somehow new (91 or 92) and I hope it can meet your requirements. If not, then I would like to hear about it. Not sure if I understand the question NeoCSVReader. If that's still an issue, could you clarify? Louis: no, please, don't modify base apps. Best, Mariano On Tue, Sep 22, 2020 at 10:48 AM Louis LaBrunda <[hidden email]> wrote: Hi Joachim, Mariano Martinez Peck Software Engineer, Instantiations Inc. Email: [hidden email] Twitter: https://twitter.com/MartinezPeck Blog: https://marianopeck.wordpress.com/You received this message because you are subscribed to the Google Groups "VA Smalltalk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/CAOUkibGeN-fu3gbs2OFbKyypUUh0fnVFEGg0vaN10m-3PSNW0Q%40mail.gmail.com. |
Mariano: I wasn't suggesting the base be modified, just that #nextLine could be copied to a new name and modified to do what Joachim needs to meet
Richard's warning.
On Thursday, September 24, 2020 at 4:00:01 PM UTC-4 [hidden email] wrote:
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/ea287b6b-9ad6-4cfd-b8a3-c14d24674cfcn%40googlegroups.com. |
In reply to this post by Mariano Martinez Peck-2
Mariano, Thanks. First of all, I can confirm #nextLine works as expected on teh example file that contained lines ending in both <cr><lf> and <lf>. So the key is not to set a lineDelimiter on the Stream. To my surpise, NeoCSV also scans for cr/crlf/lf and does everything in the way it should. The problem in this case is that I had implemented some clever code that first reads the first line of the File and tries to determine the lineDelimiter and then sets it on the input Stream. From then on, the Stream didn't handle any other lineDeliimiter than the one I had told it. So I was knocking out the cleverness of the Streams. I thought I was more clever. I learned something about this illusion today ;-) Anyways: thank you Lou, Richard, Mariano for joining me in this journey and pushing me in the right direction. Conclusion: PositionableStream, Cfs*FileStream and NeoCSVReader all handle line endings perfectly. If you expect Windows or Unix line endings and even a mix of these, just let them do their work and don't interfere. You're in the way if you try to be more clever ;-) Joachim [hidden email] schrieb am Donnerstag, 24. September 2020 um 22:00:01 UTC+2:
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/b2eafb48-bebf-454b-b67e-e355f492b314n%40googlegroups.com. |
Administrator
|
On Thu, Sep 24, 2020, 23:36 Joachim Tuchel <[hidden email]> wrote:
One of the greatest failings of programmers is when they try to be clever. There are myriad counter-examples. I'll tell you my favourite, if you're interested.
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/CAGNapEOSiRxg7UPUTPA9BTWgTJ8N5kx7o-PrGdGuj_uMiL8_7A%40mail.gmail.com. |
In reply to this post by jtuchel
Hi Joachim, Happy yo read everything is working. I think you indeed needed to be smarter before and that's exactly why we improved this in 9.1. Just for the record, the developer case was: "63547: Improve #nextLine to work with all known delimiters (Cr, Lf and CrLf)" Cheers, On Fri, Sep 25, 2020 at 3:36 AM Joachim Tuchel <[hidden email]> wrote:
Mariano Martinez Peck Software Engineer, Instantiations Inc. Email: [hidden email] Twitter: https://twitter.com/MartinezPeck Blog: https://marianopeck.wordpress.com/You received this message because you are subscribed to the Google Groups "VA Smalltalk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/CAOUkibFOKf%2B121GwzhLvbbvX4EFRCPwjXgdUs27ZCEeK6GP%2BsQ%40mail.gmail.com. |
In reply to this post by Richard Sargent
Richard I love to hear stories from and about other developers that feed my illusion that I am not alone in my imperfection. The top-most cleverest ideas I have often turn out to be the worst, but it takes time and sweat to find out ;-) Is there something like the anonymous emberassed programmers? I'd like to join. Joachim Richard Sargent schrieb am Freitag, 25. September 2020 um 09:23:13 UTC+2:
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/3cd0683e-6b7b-465e-be6a-ff722eda250dn%40googlegroups.com. |
In reply to this post by Mariano Martinez Peck-2
Mariano, I guess or at least like the idea of thinking this was once a clver choice in order to overcome some limitation. I just recently migrated to 9.2, so thank you for opening this back door for me ;-) Anyways: removing my clever code was a satisfying act today, as well as seeing how fine VAST handles even mixed line endings in files and streams. These are the small but powerful things that I like about your work at Instantiations. Joachim [hidden email] schrieb am Freitag, 25. September 2020 um 15:09:53 UTC+2:
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/ffb73433-2c4e-4fde-9eb5-2ec6dc59a9a8n%40googlegroups.com. |
On Fri, Sep 25, 2020 at 4:18 PM Joachim Tuchel <[hidden email]> wrote:
How did you know I implemented that? hahahah ;) Now seriously, I am still not convinced on the decision on way to do that automatically or not..that is...the #shouldSearchForAllStandardDelimiters I evaluated, and still is in my mind, to use an explicit new boolean instVar to control that.... but so far, I am still not convinced. BTW, to implement that functionally and be it performant, we also implemented new primitives which allowed speedup in other areas too. Below some notes from our interna bug tracker: "- New methods in SequenceableCollection indexOfAny: aSequenceableCollection indexOfAny: aSequenceableCollection ifAbsent: exceptionBlock indexOfAny: aSequenceableCollection startingAt: start indexOfAny: aSequenceableCollection startingAt: start ifAbsent: exceptionHandler - These 4 methods give symmetry with indexOfSubCollection... - indexOfAny will answer the index of the first element to be in the argument aSequenceableCollection - Implemented VMprStringIndexOfAny prim This will provide String/DBString specific prim-assist for character searches using indexOfAny.... It's about 3-4x faster than using the more generic version in SequenceableCollection. These also went on all the other streams that have the new delimiter handlers. skipToAny: aSequentialCollection upToAny: aSequentialCollection " create a generic/reusable and FAST #indexOfAny:* , #skipToAny: and #upToAny: So basically.... all these #indexOfAny:*, #skipToAny: and #upToAny: are now prim assisted and fast. Mariano Martinez Peck Software Engineer, Instantiations Inc. Email: [hidden email] Twitter: https://twitter.com/MartinezPeck Blog: https://marianopeck.wordpress.com/You received this message because you are subscribed to the Google Groups "VA Smalltalk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/CAOUkibHkvSvdFN529uFLVAwd0RN0EYxnMn370hSUEUDpXw%2BEJw%40mail.gmail.com. |
Administrator
|
In reply to this post by jtuchel
On Friday, September 25, 2020 at 12:15:22 PM UTC-7, Joachim Tuchel wrote:
--
Most of my stories/complaints are more in the "I would like to Slinky the programmer responsible for this!" style My favourite and most extreme example involved buying something on eBay when I live in Zurich. I had navigated through numerous pages to order the product and had navigated through a number of pages to set up the payment. As soon as I entered my credit card's billing address, my web browser presented everything in German. Clever programmer (TM) had decided that, living in a German-speaking city, I must speak and read German and that he (somehow I am sure it was a male) was doing me a great favour by switching to German.
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/6d051ed6-318f-4e43-b6b8-104981e1fcb9o%40googlegroups.com. |
Free forum by Nabble | Edit this page |