Decomposing Binary Data by CR/LF

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
jrm
Reply | Threaded
Open this post in threaded view
|

Decomposing Binary Data by CR/LF

jrm
Is there an existing method that will tokenize/chunk(?) data from a file using  CR/LF? The use case is to decompose a file into PDF objects defined as strings are strings terminated by CR/LF. (if there is an existing framework/project available, I have not found it, just dead ends :-(

I have been exploring in #String and #ByteString and this is all I have found that is close to what I need.

"Finds first occurance of #Sting"
self findString: ( Character cr  asString,  Character lf asString).
"Breaks at either token value"
self findTokens: ( Character cr  asString,  Character lf asString)

I have tried poking around in #MultiByteFileStream, but  keep running into errors.

If there is no existing method, any suggestions how to write a new one? My naive approach is to scan for CR and then peek for LF keeping track of my pointers and using them to identify the CR/LF delimited substrings; or iterate through contents using #findString:

TIA, jrm

-----
Image
-----
C:\Smalltalk\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit.1.image
Squeak5.1
latest update: #16549
Current Change Set: PDFPlayground
Image format 68021 (64 bit)

Operating System Details
------------------------
Operating System: Windows 7 Professional (Build 7601 Service Pack 1)
Registered Owner: T530
Registered Company: 
SP major version: 1
SP minor version: 0
Suite mask: 100
Product type: 1


_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|

Decomposing Binary Data by CR/LF

Louis LaBrunda
Hi John,

Windows files normally use CR/LF as line termination.  Linux files normally use LF.  Look at
the #subStrings: and friends.  You may want to change all CR/LF to LF and then all CR to LF and
then split the file at LFs.  You could also look into the various stream classes.

There are lots of ways to do this and if you are just learning, it doesn't hurt to try a few of
them.

Lou

On Tue, 25 Jul 2017 06:00:25 +1200, John-Reed Maffeo <[hidden email]> wrote:

>Is there an existing method that will tokenize/chunk(?) data from a file
>using  CR/LF? The use case is to decompose a file into PDF objects defined
>as strings are strings terminated by CR/LF. (if there is an existing
>framework/project available, I have not found it, just dead ends :-(
>
>I have been exploring in #String and #ByteString and this is all I have
>found that is close to what I need.
>
>"Finds first occurance of #Sting"
>self findString: ( Character cr  asString,  Character lf asString).
>"Breaks at either token value"
>self findTokens: ( Character cr  asString,  Character lf asString)
>
>I have tried poking around in #MultiByteFileStream, but  keep running into
>errors.
>
>If there is no existing method, any suggestions how to write a new one? My
>naive approach is to scan for CR and then peek for LF keeping track of my
>pointers and using them to identify the CR/LF delimited substrings; or
>iterate through contents using #findString:
>
>TIA, jrm
>
>-----
>Image
>-----
>C:\Smalltalk\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit.1.image
>Squeak5.1
>latest update: #16549
>Current Change Set: PDFPlayground
>Image format 68021 (64 bit)
>
>Operating System Details
>------------------------
>Operating System: Windows 7 Professional (Build 7601 Service Pack 1)
>Registered Owner: T530
>Registered Company:
>SP major version: 1
>SP minor version: 0
>Suite mask: 100
>Product type: 1
--
Louis LaBrunda
Keystone Software Corp.
SkypeMe callto://PhotonDemon

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
cbc
Reply | Threaded
Open this post in threaded view
|

Re: Decomposing Binary Data by CR/LF

cbc
In reply to this post by jrm
Hi JRM, 

I think MultiByteFileStream is where you want to work on this.  Since you said it is, specifically, a file that has Cr/Lf line endings, then this is the place.

There are tricks to making it work, which aren't clearly documented (unfortunately).  

This looks like how the MultiByteFileStream is supposed to work:

1. Open the file.
2. Send 
          #wantsLineEndConversoin: true 
    to the file.
3. Send #ascii to the file (to tell it is a text file, and to determine the Cr/Lf or Cr or Lf encoding)
4. Read data from file. It should convert Cr/Lf to just Cr, and all things are happy.

Except if you send something like #next: 20, and the last character isn't a #Cr, then it looks like it would be buggy.
But, please try this and see if it works.  If so, please let me know.

An alternative seems to be that you could just open it without any of those changes, and go through the file line by line (sending #nextLine to the file), and the implementation of #nextLine in PositionableStream should also take care of the Cr/Lf issues.

If you try this route, please let me know how it goes as well.

Thanks,
cbc


On Mon, Jul 24, 2017 at 11:00 AM, John-Reed Maffeo <[hidden email]> wrote:
Is there an existing method that will tokenize/chunk(?) data from a file using  CR/LF? The use case is to decompose a file into PDF objects defined as strings are strings terminated by CR/LF. (if there is an existing framework/project available, I have not found it, just dead ends :-(

I have been exploring in #String and #ByteString and this is all I have found that is close to what I need.

"Finds first occurance of #Sting"
self findString: ( Character cr  asString,  Character lf asString).
"Breaks at either token value"
self findTokens: ( Character cr  asString,  Character lf asString)

I have tried poking around in #MultiByteFileStream, but  keep running into errors.

If there is no existing method, any suggestions how to write a new one? My naive approach is to scan for CR and then peek for LF keeping track of my pointers and using them to identify the CR/LF delimited substrings; or iterate through contents using #findString:

TIA, jrm

-----
Image
-----
C:\Smalltalk\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit.1.image
Squeak5.1
latest update: #16549
Current Change Set: PDFPlayground
Image format 68021 (64 bit)

Operating System Details
------------------------
Operating System: Windows 7 Professional (Build 7601 Service Pack 1)
Registered Owner: T530
Registered Company: 
SP major version: 1
SP minor version: 0
Suite mask: 100
Product type: 1


_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners



_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
jrm
Reply | Threaded
Open this post in threaded view
|

Re: Decomposing Binary Data by CR/LF

jrm
Chris, Lou,
Thanks. After more research on the web, I think I need to rethink my approach to the problem.  "" PDF's are actually designed to be read "backwards" starting at the end. ""  My question is still valid  and I am working on a solution. Will post something if it is useful.

-jrm

On Mon, Jul 24, 2017 at 5:25 PM, Chris Cunningham <[hidden email]> wrote:
Hi JRM, 

I think MultiByteFileStream is where you want to work on this.  Since you said it is, specifically, a file that has Cr/Lf line endings, then this is the place.

There are tricks to making it work, which aren't clearly documented (unfortunately).  

This looks like how the MultiByteFileStream is supposed to work:

1. Open the file.
2. Send 
          #wantsLineEndConversoin: true 
    to the file.
3. Send #ascii to the file (to tell it is a text file, and to determine the Cr/Lf or Cr or Lf encoding)
4. Read data from file. It should convert Cr/Lf to just Cr, and all things are happy.

Except if you send something like #next: 20, and the last character isn't a #Cr, then it looks like it would be buggy.
But, please try this and see if it works.  If so, please let me know.

An alternative seems to be that you could just open it without any of those changes, and go through the file line by line (sending #nextLine to the file), and the implementation of #nextLine in PositionableStream should also take care of the Cr/Lf issues.

If you try this route, please let me know how it goes as well.

Thanks,
cbc


On Mon, Jul 24, 2017 at 11:00 AM, John-Reed Maffeo <[hidden email]> wrote:
Is there an existing method that will tokenize/chunk(?) data from a file using  CR/LF? The use case is to decompose a file into PDF objects defined as strings are strings terminated by CR/LF. (if there is an existing framework/project available, I have not found it, just dead ends :-(

I have been exploring in #String and #ByteString and this is all I have found that is close to what I need.

"Finds first occurance of #Sting"
self findString: ( Character cr  asString,  Character lf asString).
"Breaks at either token value"
self findTokens: ( Character cr  asString,  Character lf asString)

I have tried poking around in #MultiByteFileStream, but  keep running into errors.

If there is no existing method, any suggestions how to write a new one? My naive approach is to scan for CR and then peek for LF keeping track of my pointers and using them to identify the CR/LF delimited substrings; or iterate through contents using #findString:

TIA, jrm

-----
Image
-----
C:\Smalltalk\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit.1.image
Squeak5.1
latest update: #16549
Current Change Set: PDFPlayground
Image format 68021 (64 bit)

Operating System Details
------------------------
Operating System: Windows 7 Professional (Build 7601 Service Pack 1)
Registered Owner: T530
Registered Company: 
SP major version: 1
SP minor version: 0
Suite mask: 100
Product type: 1


_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners



_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners



_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|

Re: Decomposing Binary Data by CR/LF

Stephan Eggermont-3
In reply to this post by jrm
On 24/07/17 20:00, John-Reed Maffeo wrote:
> Is there an existing method that will tokenize/chunk(?) data from a file
> using  CR/LF? The use case is to decompose a file into PDF objects
> defined as strings are strings terminated by CR/LF. (if there is an
> existing framework/project available, I have not found it, just dead
> ends :-(

You know about the work by Christian Haider?
http://christianhaider.de/dokuwiki/doku.php?id=pdf:pdf4smalltalk

Stephan

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
jrm
Reply | Threaded
Open this post in threaded view
|

Re: Decomposing Binary Data by CR/LF

jrm
Stephan,

Thank you, I have seen this reference, but the framework seems to be written in VisualWorks. While it looks like what I need, I am not ready to put the effort into the process of getting up to speed in the VisualWorks ecosystem. There does appear to be a way to download it, but, since I am focused on my (hobby) project and the learning opportunity it provides, I will continue to develop a framework in Squeak. 

There is a page referenced on the link you provided that discusses porting, so perhaps, someday, I will take a look at it. From what I have learned so far PDF is a gnarly mess of pointers and offsets which have to be carefully managed - hard fun!

BTW, I owe the list a reply to this thread about my discovery of the answer to my question.

jrm

On Wed, Aug 16, 2017 at 2:22 PM, Stephan Eggermont <[hidden email]> wrote:
On 24/07/17 20:00, John-Reed Maffeo wrote:
Is there an existing method that will tokenize/chunk(?) data from a file
using  CR/LF? The use case is to decompose a file into PDF objects
defined as strings are strings terminated by CR/LF. (if there is an
existing framework/project available, I have not found it, just dead
ends :-(

You know about the work by Christian Haider?
http://christianhaider.de/dokuwiki/doku.php?id=pdf:pdf4smalltalk

Stephan


_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners


_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|

Re: Decomposing Binary Data by CR/LF

Stephan Eggermont-3
On 04-09-17 19:32, John-Reed Maffeo wrote:
> Thank you, I have seen this reference, but the framework seems to be
> written in VisualWorks. While it looks like what I need, I am not ready
> to put the effort into the process of getting up to speed in the
> VisualWorks ecosystem. There does appear to be a way to download it,
> but, since I am focused on my (hobby) project and the learning
> opportunity it provides, I will continue to develop a framework in Squeak.

Christian recently ported it to Gemstone. He has done some work that
would help with porting, and yes, it would be a large project.
The file parsing part is likely to be rather well portable, though.
A project going in the other direction is Artefact, on Pharo. Adding
file parsing to that would be welcomed, I'm sure, and it is probably
easy to port.

Stephan

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners