Smalltalk › Squeak › Squeak - Beginners

Decomposing Binary Data by CR/LF

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

7 messages Options

jrm

Decomposing Binary Data by CR/LF

Is there an existing method that will tokenize/chunk(?) data from a file using CR/LF? The use case is to decompose a file into PDF objects defined as strings are strings terminated by CR/LF. (if there is an existing framework/project available, I have not found it, just dead ends :-(

I have been exploring in #String and #ByteString and this is all I have found that is close to what I need.

"Finds first occurance of #Sting"

self findString: ( Character cr asString, Character lf asString).

"Breaks at either token value"

self findTokens: ( Character cr asString, Character lf asString)

I have tried poking around in #MultiByteFileStream, but keep running into errors.

If there is no existing method, any suggestions how to write a new one? My naive approach is to scan for CR and then peek for LF keeping track of my pointers and using them to identify the CR/LF delimited substrings; or iterate through contents using #findString:

TIA, jrm

-----

Image

-----

C:\Smalltalk\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit.1.image

Squeak5.1

latest update: #16549

Current Change Set: PDFPlayground

Image format 68021 (64 bit)

Operating System Details

------------------------

Operating System: Windows 7 Professional (Build 7601 Service Pack 1)

Registered Owner: T530

Registered Company:

SP major version: 1

SP minor version: 0

Suite mask: 100

Product type: 1

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners

Louis LaBrunda

Decomposing Binary Data by CR/LF

Hi John,

Windows files normally use CR/LF as line termination. Linux files normally use LF. Look at
the #subStrings: and friends. You may want to change all CR/LF to LF and then all CR to LF and
then split the file at LFs. You could also look into the various stream classes.

There are lots of ways to do this and if you are just learning, it doesn't hurt to try a few of
them.

Lou

On Tue, 25 Jul 2017 06:00:25 +1200, John-Reed Maffeo <[hidden email]> wrote:

>Is there an existing method that will tokenize/chunk(?) data from a file
>using CR/LF? The use case is to decompose a file into PDF objects defined
>as strings are strings terminated by CR/LF. (if there is an existing
>framework/project available, I have not found it, just dead ends :-(
>
>I have been exploring in #String and #ByteString and this is all I have
>found that is close to what I need.
>
>"Finds first occurance of #Sting"
>self findString: ( Character cr asString, Character lf asString).
>"Breaks at either token value"
>self findTokens: ( Character cr asString, Character lf asString)
>
>I have tried poking around in #MultiByteFileStream, but keep running into
>errors.
>
>If there is no existing method, any suggestions how to write a new one? My
>naive approach is to scan for CR and then peek for LF keeping track of my
>pointers and using them to identify the CR/LF delimited substrings; or
>iterate through contents using #findString:
>
>TIA, jrm
>
>-----
>Image
>-----
>C:\Smalltalk\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit.1.image
>Squeak5.1
>latest update: #16549
>Current Change Set: PDFPlayground
>Image format 68021 (64 bit)
>
>Operating System Details
>------------------------
>Operating System: Windows 7 Professional (Build 7601 Service Pack 1)
>Registered Owner: T530
>Registered Company:
>SP major version: 1
>SP minor version: 0
>Suite mask: 100
>Product type: 1

--
Louis LaBrunda
Keystone Software Corp.
SkypeMe callto://PhotonDemon

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners

cbc

Re: Decomposing Binary Data by CR/LF

In reply to this post by jrm

Hi JRM,

I think MultiByteFileStream is where you want to work on this. Since you said it is, specifically, a file that has Cr/Lf line endings, then this is the place.

There are tricks to making it work, which aren't clearly documented (unfortunately).

This looks like how the MultiByteFileStream is supposed to work:

1. Open the file.

2. Send

#wantsLineEndConversoin: true

to the file.

3. Send #ascii to the file (to tell it is a text file, and to determine the Cr/Lf or Cr or Lf encoding)

4. Read data from file. It should convert Cr/Lf to just Cr, and all things are happy.

Except if you send something like #next: 20, and the last character isn't a #Cr, then it looks like it would be buggy.

But, please try this and see if it works. If so, please let me know.

An alternative seems to be that you could just open it without any of those changes, and go through the file line by line (sending #nextLine to the file), and the implementation of #nextLine in PositionableStream should also take care of the Cr/Lf issues.

If you try this route, please let me know how it goes as well.

Thanks,

cbc

On Mon, Jul 24, 2017 at 11:00 AM, John-Reed Maffeo <[hidden email]> wrote:

Is there an existing method that will tokenize/chunk(?) data from a file using CR/LF? The use case is to decompose a file into PDF objects defined as strings are strings terminated by CR/LF. (if there is an existing framework/project available, I have not found it, just dead ends :-(

I have been exploring in #String and #ByteString and this is all I have found that is close to what I need.

"Finds first occurance of #Sting"
self findString: ( Character cr asString, Character lf asString).
"Breaks at either token value"
self findTokens: ( Character cr asString, Character lf asString)

I have tried poking around in #MultiByteFileStream, but keep running into errors.

If there is no existing method, any suggestions how to write a new one? My naive approach is to scan for CR and then peek for LF keeping track of my pointers and using them to identify the CR/LF delimited substrings; or iterate through contents using #findString:

TIA, jrm

-----
Image
-----
C:\Smalltalk\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit.1.image
Squeak5.1
latest update: #16549
Current Change Set: PDFPlayground
Image format 68021 (64 bit)

Operating System Details
------------------------
Operating System: Windows 7 Professional (Build 7601 Service Pack 1)
Registered Owner: T530
Registered Company:
SP major version: 1
SP minor version: 0
Suite mask: 100
Product type: 1

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners

jrm

Re: Decomposing Binary Data by CR/LF

Chris, Lou,

Thanks. After more research on the web, I think I need to rethink my approach to the problem. "" PDF's are actually designed to be read "backwards" starting at the end. "" My question is still valid and I am working on a solution. Will post something if it is useful.

-jrm

On Mon, Jul 24, 2017 at 5:25 PM, Chris Cunningham <[hidden email]> wrote:

Hi JRM,

I think MultiByteFileStream is where you want to work on this. Since you said it is, specifically, a file that has Cr/Lf line endings, then this is the place.

There are tricks to making it work, which aren't clearly documented (unfortunately).

This looks like how the MultiByteFileStream is supposed to work:

1. Open the file.
2. Send
#wantsLineEndConversoin: true
to the file.
3. Send #ascii to the file (to tell it is a text file, and to determine the Cr/Lf or Cr or Lf encoding)
4. Read data from file. It should convert Cr/Lf to just Cr, and all things are happy.

Except if you send something like #next: 20, and the last character isn't a #Cr, then it looks like it would be buggy.
But, please try this and see if it works. If so, please let me know.

An alternative seems to be that you could just open it without any of those changes, and go through the file line by line (sending #nextLine to the file), and the implementation of #nextLine in PositionableStream should also take care of the Cr/Lf issues.

If you try this route, please let me know how it goes as well.

Thanks,
cbc

On Mon, Jul 24, 2017 at 11:00 AM, John-Reed Maffeo <[hidden email]> wrote:
Is there an existing method that will tokenize/chunk(?) data from a file using CR/LF? The use case is to decompose a file into PDF objects defined as strings are strings terminated by CR/LF. (if there is an existing framework/project available, I have not found it, just dead ends :-(

I have been exploring in #String and #ByteString and this is all I have found that is close to what I need.

"Finds first occurance of #Sting"
self findString: ( Character cr asString, Character lf asString).
"Breaks at either token value"
self findTokens: ( Character cr asString, Character lf asString)

I have tried poking around in #MultiByteFileStream, but keep running into errors.

If there is no existing method, any suggestions how to write a new one? My naive approach is to scan for CR and then peek for LF keeping track of my pointers and using them to identify the CR/LF delimited substrings; or iterate through contents using #findString:

TIA, jrm

-----
Image
-----
C:\Smalltalk\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit.1.image
Squeak5.1
latest update: #16549
Current Change Set: PDFPlayground
Image format 68021 (64 bit)

Operating System Details
------------------------
Operating System: Windows 7 Professional (Build 7601 Service Pack 1)
Registered Owner: T530
Registered Company:
SP major version: 1
SP minor version: 0
Suite mask: 100
Product type: 1

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners

Stephan Eggermont-3

Re: Decomposing Binary Data by CR/LF

In reply to this post by jrm

On 24/07/17 20:00, John-Reed Maffeo wrote:
> Is there an existing method that will tokenize/chunk(?) data from a file
> using CR/LF? The use case is to decompose a file into PDF objects
> defined as strings are strings terminated by CR/LF. (if there is an
> existing framework/project available, I have not found it, just dead
> ends :-(

You know about the work by Christian Haider?
http://christianhaider.de/dokuwiki/doku.php?id=pdf:pdf4smalltalk

Stephan

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners

jrm

Re: Decomposing Binary Data by CR/LF

Stephan,

Thank you, I have seen this reference, but the framework seems to be written in VisualWorks. While it looks like what I need, I am not ready to put the effort into the process of getting up to speed in the VisualWorks ecosystem. There does appear to be a way to download it, but, since I am focused on my (hobby) project and the learning opportunity it provides, I will continue to develop a framework in Squeak.

There is a page referenced on the link you provided that discusses porting, so perhaps, someday, I will take a look at it. From what I have learned so far PDF is a gnarly mess of pointers and offsets which have to be carefully managed - hard fun!

BTW, I owe the list a reply to this thread about my discovery of the answer to my question.

jrm

On Wed, Aug 16, 2017 at 2:22 PM, Stephan Eggermont <[hidden email]> wrote:

On 24/07/17 20:00, John-Reed Maffeo wrote:

Is there an existing method that will tokenize/chunk(?) data from a file
using CR/LF? The use case is to decompose a file into PDF objects
defined as strings are strings terminated by CR/LF. (if there is an
existing framework/project available, I have not found it, just dead
ends :-(

You know about the work by Christian Haider?
http://christianhaider.de/dokuwiki/doku.php?id=pdf:pdf4smalltalk

Stephan

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners

Stephan Eggermont-3

Re: Decomposing Binary Data by CR/LF

On 04-09-17 19:32, John-Reed Maffeo wrote:
> Thank you, I have seen this reference, but the framework seems to be
> written in VisualWorks. While it looks like what I need, I am not ready
> to put the effort into the process of getting up to speed in the
> VisualWorks ecosystem. There does appear to be a way to download it,
> but, since I am focused on my (hobby) project and the learning
> opportunity it provides, I will continue to develop a framework in Squeak.

Christian recently ported it to Gemstone. He has done some work that
would help with porting, and yes, it would be a large project.
The file parsing part is likely to be rather well portable, though.
A project going in the other direction is Artefact, on Pharo. Adding
file parsing to that would be welcomed, I'm sure, and it is probably
easy to port.

Stephan

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners