Decomposing Binary Data by CR/LF

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
jrm
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Decomposing Binary Data by CR/LF

jrm
Is there an existing method that will tokenize/chunk(?) data from a file using  CR/LF? The use case is to decompose a file into PDF objects defined as strings are strings terminated by CR/LF. (if there is an existing framework/project available, I have not found it, just dead ends :-(

I have been exploring in #String and #ByteString and this is all I have found that is close to what I need.

"Finds first occurance of #Sting"
self findString: ( Character cr  asString,  Character lf asString).
"Breaks at either token value"
self findTokens: ( Character cr  asString,  Character lf asString)

I have tried poking around in #MultiByteFileStream, but  keep running into errors.

If there is no existing method, any suggestions how to write a new one? My naive approach is to scan for CR and then peek for LF keeping track of my pointers and using them to identify the CR/LF delimited substrings; or iterate through contents using #findString:

TIA, jrm

-----
Image
-----
C:\Smalltalk\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit.1.image
Squeak5.1
latest update: #16549
Current Change Set: PDFPlayground
Image format 68021 (64 bit)

Operating System Details
------------------------
Operating System: Windows 7 Professional (Build 7601 Service Pack 1)
Registered Owner: T530
Registered Company: 
SP major version: 1
SP minor version: 0
Suite mask: 100
Product type: 1


_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Decomposing Binary Data by CR/LF

Louis LaBrunda
Hi John,

Windows files normally use CR/LF as line termination.  Linux files normally use LF.  Look at
the #subStrings: and friends.  You may want to change all CR/LF to LF and then all CR to LF and
then split the file at LFs.  You could also look into the various stream classes.

There are lots of ways to do this and if you are just learning, it doesn't hurt to try a few of
them.

Lou

On Tue, 25 Jul 2017 06:00:25 +1200, John-Reed Maffeo <[hidden email]> wrote:

>Is there an existing method that will tokenize/chunk(?) data from a file
>using  CR/LF? The use case is to decompose a file into PDF objects defined
>as strings are strings terminated by CR/LF. (if there is an existing
>framework/project available, I have not found it, just dead ends :-(
>
>I have been exploring in #String and #ByteString and this is all I have
>found that is close to what I need.
>
>"Finds first occurance of #Sting"
>self findString: ( Character cr  asString,  Character lf asString).
>"Breaks at either token value"
>self findTokens: ( Character cr  asString,  Character lf asString)
>
>I have tried poking around in #MultiByteFileStream, but  keep running into
>errors.
>
>If there is no existing method, any suggestions how to write a new one? My
>naive approach is to scan for CR and then peek for LF keeping track of my
>pointers and using them to identify the CR/LF delimited substrings; or
>iterate through contents using #findString:
>
>TIA, jrm
>
>-----
>Image
>-----
>C:\Smalltalk\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit.1.image
>Squeak5.1
>latest update: #16549
>Current Change Set: PDFPlayground
>Image format 68021 (64 bit)
>
>Operating System Details
>------------------------
>Operating System: Windows 7 Professional (Build 7601 Service Pack 1)
>Registered Owner: T530
>Registered Company:
>SP major version: 1
>SP minor version: 0
>Suite mask: 100
>Product type: 1
--
Louis LaBrunda
Keystone Software Corp.
SkypeMe callto://PhotonDemon

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
cbc
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Decomposing Binary Data by CR/LF

cbc
In reply to this post by jrm
Hi JRM, 

I think MultiByteFileStream is where you want to work on this.  Since you said it is, specifically, a file that has Cr/Lf line endings, then this is the place.

There are tricks to making it work, which aren't clearly documented (unfortunately).  

This looks like how the MultiByteFileStream is supposed to work:

1. Open the file.
2. Send 
          #wantsLineEndConversoin: true 
    to the file.
3. Send #ascii to the file (to tell it is a text file, and to determine the Cr/Lf or Cr or Lf encoding)
4. Read data from file. It should convert Cr/Lf to just Cr, and all things are happy.

Except if you send something like #next: 20, and the last character isn't a #Cr, then it looks like it would be buggy.
But, please try this and see if it works.  If so, please let me know.

An alternative seems to be that you could just open it without any of those changes, and go through the file line by line (sending #nextLine to the file), and the implementation of #nextLine in PositionableStream should also take care of the Cr/Lf issues.

If you try this route, please let me know how it goes as well.

Thanks,
cbc


On Mon, Jul 24, 2017 at 11:00 AM, John-Reed Maffeo <[hidden email]> wrote:
Is there an existing method that will tokenize/chunk(?) data from a file using  CR/LF? The use case is to decompose a file into PDF objects defined as strings are strings terminated by CR/LF. (if there is an existing framework/project available, I have not found it, just dead ends :-(

I have been exploring in #String and #ByteString and this is all I have found that is close to what I need.

"Finds first occurance of #Sting"
self findString: ( Character cr  asString,  Character lf asString).
"Breaks at either token value"
self findTokens: ( Character cr  asString,  Character lf asString)

I have tried poking around in #MultiByteFileStream, but  keep running into errors.

If there is no existing method, any suggestions how to write a new one? My naive approach is to scan for CR and then peek for LF keeping track of my pointers and using them to identify the CR/LF delimited substrings; or iterate through contents using #findString:

TIA, jrm

-----
Image
-----
C:\Smalltalk\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit.1.image
Squeak5.1
latest update: #16549
Current Change Set: PDFPlayground
Image format 68021 (64 bit)

Operating System Details
------------------------
Operating System: Windows 7 Professional (Build 7601 Service Pack 1)
Registered Owner: T530
Registered Company: 
SP major version: 1
SP minor version: 0
Suite mask: 100
Product type: 1


_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners



_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
jrm
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Decomposing Binary Data by CR/LF

jrm
Chris, Lou,
Thanks. After more research on the web, I think I need to rethink my approach to the problem.  "" PDF's are actually designed to be read "backwards" starting at the end. ""  My question is still valid  and I am working on a solution. Will post something if it is useful.

-jrm

On Mon, Jul 24, 2017 at 5:25 PM, Chris Cunningham <[hidden email]> wrote:
Hi JRM, 

I think MultiByteFileStream is where you want to work on this.  Since you said it is, specifically, a file that has Cr/Lf line endings, then this is the place.

There are tricks to making it work, which aren't clearly documented (unfortunately).  

This looks like how the MultiByteFileStream is supposed to work:

1. Open the file.
2. Send 
          #wantsLineEndConversoin: true 
    to the file.
3. Send #ascii to the file (to tell it is a text file, and to determine the Cr/Lf or Cr or Lf encoding)
4. Read data from file. It should convert Cr/Lf to just Cr, and all things are happy.

Except if you send something like #next: 20, and the last character isn't a #Cr, then it looks like it would be buggy.
But, please try this and see if it works.  If so, please let me know.

An alternative seems to be that you could just open it without any of those changes, and go through the file line by line (sending #nextLine to the file), and the implementation of #nextLine in PositionableStream should also take care of the Cr/Lf issues.

If you try this route, please let me know how it goes as well.

Thanks,
cbc


On Mon, Jul 24, 2017 at 11:00 AM, John-Reed Maffeo <[hidden email]> wrote:
Is there an existing method that will tokenize/chunk(?) data from a file using  CR/LF? The use case is to decompose a file into PDF objects defined as strings are strings terminated by CR/LF. (if there is an existing framework/project available, I have not found it, just dead ends :-(

I have been exploring in #String and #ByteString and this is all I have found that is close to what I need.

"Finds first occurance of #Sting"
self findString: ( Character cr  asString,  Character lf asString).
"Breaks at either token value"
self findTokens: ( Character cr  asString,  Character lf asString)

I have tried poking around in #MultiByteFileStream, but  keep running into errors.

If there is no existing method, any suggestions how to write a new one? My naive approach is to scan for CR and then peek for LF keeping track of my pointers and using them to identify the CR/LF delimited substrings; or iterate through contents using #findString:

TIA, jrm

-----
Image
-----
C:\Smalltalk\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit.1.image
Squeak5.1
latest update: #16549
Current Change Set: PDFPlayground
Image format 68021 (64 bit)

Operating System Details
------------------------
Operating System: Windows 7 Professional (Build 7601 Service Pack 1)
Registered Owner: T530
Registered Company: 
SP major version: 1
SP minor version: 0
Suite mask: 100
Product type: 1


_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners



_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners



_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Decomposing Binary Data by CR/LF

Stephan Eggermont-3
In reply to this post by jrm
On 24/07/17 20:00, John-Reed Maffeo wrote:
> Is there an existing method that will tokenize/chunk(?) data from a file
> using  CR/LF? The use case is to decompose a file into PDF objects
> defined as strings are strings terminated by CR/LF. (if there is an
> existing framework/project available, I have not found it, just dead
> ends :-(

You know about the work by Christian Haider?
http://christianhaider.de/dokuwiki/doku.php?id=pdf:pdf4smalltalk

Stephan

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Loading...