Is there an existing method that will tokenize/chunk(?) data from a file using CR/LF? The use case is to decompose a file into PDF objects defined as strings are strings terminated by CR/LF. (if there is an existing framework/project available, I have not found it, just dead ends :-(
I have been exploring in #String and #ByteString and this is all I have found that is close to what I need. "Finds first occurance of #Sting" self findString: ( Character cr asString, Character lf asString). "Breaks at either token value" self findTokens: ( Character cr asString, Character lf asString) I have tried poking around in #MultiByteFileStream, but keep running into errors. If there is no existing method, any suggestions how to write a new one? My naive approach is to scan for CR and then peek for LF keeping track of my pointers and using them to identify the CR/LF delimited substrings; or iterate through contents using #findString: TIA, jrm ----- Image ----- C:\Smalltalk\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit.1.image Squeak5.1 latest update: #16549 Current Change Set: PDFPlayground Image format 68021 (64 bit) Operating System Details ------------------------ Operating System: Windows 7 Professional (Build 7601 Service Pack 1) Registered Owner: T530 Registered Company: SP major version: 1 SP minor version: 0 Suite mask: 100 Product type: 1 _______________________________________________ Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
Hi John,
Windows files normally use CR/LF as line termination. Linux files normally use LF. Look at the #subStrings: and friends. You may want to change all CR/LF to LF and then all CR to LF and then split the file at LFs. You could also look into the various stream classes. There are lots of ways to do this and if you are just learning, it doesn't hurt to try a few of them. Lou On Tue, 25 Jul 2017 06:00:25 +1200, John-Reed Maffeo <[hidden email]> wrote: >Is there an existing method that will tokenize/chunk(?) data from a file >using CR/LF? The use case is to decompose a file into PDF objects defined >as strings are strings terminated by CR/LF. (if there is an existing >framework/project available, I have not found it, just dead ends :-( > >I have been exploring in #String and #ByteString and this is all I have >found that is close to what I need. > >"Finds first occurance of #Sting" >self findString: ( Character cr asString, Character lf asString). >"Breaks at either token value" >self findTokens: ( Character cr asString, Character lf asString) > >I have tried poking around in #MultiByteFileStream, but keep running into >errors. > >If there is no existing method, any suggestions how to write a new one? My >naive approach is to scan for CR and then peek for LF keeping track of my >pointers and using them to identify the CR/LF delimited substrings; or >iterate through contents using #findString: > >TIA, jrm > >----- >Image >----- >C:\Smalltalk\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit.1.image >Squeak5.1 >latest update: #16549 >Current Change Set: PDFPlayground >Image format 68021 (64 bit) > >Operating System Details >------------------------ >Operating System: Windows 7 Professional (Build 7601 Service Pack 1) >Registered Owner: T530 >Registered Company: >SP major version: 1 >SP minor version: 0 >Suite mask: 100 >Product type: 1 Louis LaBrunda Keystone Software Corp. SkypeMe callto://PhotonDemon _______________________________________________ Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
In reply to this post by jrm
Hi JRM, I think MultiByteFileStream is where you want to work on this. Since you said it is, specifically, a file that has Cr/Lf line endings, then this is the place. There are tricks to making it work, which aren't clearly documented (unfortunately). This looks like how the MultiByteFileStream is supposed to work: 1. Open the file. 2. Send #wantsLineEndConversoin: true to the file. 3. Send #ascii to the file (to tell it is a text file, and to determine the Cr/Lf or Cr or Lf encoding) 4. Read data from file. It should convert Cr/Lf to just Cr, and all things are happy. Except if you send something like #next: 20, and the last character isn't a #Cr, then it looks like it would be buggy. But, please try this and see if it works. If so, please let me know. An alternative seems to be that you could just open it without any of those changes, and go through the file line by line (sending #nextLine to the file), and the implementation of #nextLine in PositionableStream should also take care of the Cr/Lf issues. If you try this route, please let me know how it goes as well. Thanks, cbc On Mon, Jul 24, 2017 at 11:00 AM, John-Reed Maffeo <[hidden email]> wrote:
_______________________________________________ Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
Chris, Lou, Thanks. After more research on the web, I think I need to rethink my approach to the problem. "" PDF's are actually designed to be read "backwards" starting at the end. "" My question is still valid and I am working on a solution. Will post something if it is useful. -jrm On Mon, Jul 24, 2017 at 5:25 PM, Chris Cunningham <[hidden email]> wrote:
_______________________________________________ Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
In reply to this post by jrm
On 24/07/17 20:00, John-Reed Maffeo wrote:
> Is there an existing method that will tokenize/chunk(?) data from a file > using CR/LF? The use case is to decompose a file into PDF objects > defined as strings are strings terminated by CR/LF. (if there is an > existing framework/project available, I have not found it, just dead > ends :-( You know about the work by Christian Haider? http://christianhaider.de/dokuwiki/doku.php?id=pdf:pdf4smalltalk Stephan _______________________________________________ Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
Stephan, Thank you, I have seen this reference, but the framework seems to be written in VisualWorks. While it looks like what I need, I am not ready to put the effort into the process of getting up to speed in the VisualWorks ecosystem. There does appear to be a way to download it, but, since I am focused on my (hobby) project and the learning opportunity it provides, I will continue to develop a framework in Squeak. There is a page referenced on the link you provided that discusses porting, so perhaps, someday, I will take a look at it. From what I have learned so far PDF is a gnarly mess of pointers and offsets which have to be carefully managed - hard fun! BTW, I owe the list a reply to this thread about my discovery of the answer to my question. jrm On Wed, Aug 16, 2017 at 2:22 PM, Stephan Eggermont <[hidden email]> wrote: On 24/07/17 20:00, John-Reed Maffeo wrote: _______________________________________________ Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
On 04-09-17 19:32, John-Reed Maffeo wrote:
> Thank you, I have seen this reference, but the framework seems to be > written in VisualWorks. While it looks like what I need, I am not ready > to put the effort into the process of getting up to speed in the > VisualWorks ecosystem. There does appear to be a way to download it, > but, since I am focused on my (hobby) project and the learning > opportunity it provides, I will continue to develop a framework in Squeak. Christian recently ported it to Gemstone. He has done some work that would help with porting, and yes, it would be a large project. The file parsing part is likely to be rather well portable, though. A project going in the other direction is Artefact, on Pharo. Adding file parsing to that would be welcomed, I'm sure, and it is probably easy to port. Stephan _______________________________________________ Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
Free forum by Nabble | Edit this page |