Re: Decomposing Binary Data by CR/LF - Solved (Sort of)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
jrm
Reply | Threaded
Open this post in threaded view
|

Re: Decomposing Binary Data by CR/LF - Solved (Sort of)

jrm
It turns out that there is no easy answer to my question. There is no way to chunk a binary file using line ending characters because of the inherent nature of binary data which can contain the line end hex values as part of the stream of arbitrary stream. PDF files contain an xref table to all of the objects in the file. I have managed to create classes in my framework which will extract that data into a usable object which I can use to extract the data. Using "self findTokens: ( Character cr  asString,  Character lf asString)" is useful in the  areas of a PDF file which do not contain binary data, and is necessary because the line end values used in a PDF are dependent on the default values of the operating system the file was created on,

Thanks again for your interest in my question. 

Jrm

On Tue, Jul 25, 2017 at 3:00 AM, John-Reed Maffeo <[hidden email]> wrote:
Is there an existing method that will tokenize/chunk(?) data from a file using  CR/LF? The use case is to decompose a file into PDF objects defined as strings are strings terminated by CR/LF. (if there is an existing framework/project available, I have not found it, just dead ends :-(

I have been exploring in #String and #ByteString and this is all I have found that is close to what I need.

"Finds first occurance of #Sting"
self findString: ( Character cr  asString,  Character lf asString).
"Breaks at either token value"
self findTokens: ( Character cr  asString,  Character lf asString)

I have tried poking around in #MultiByteFileStream, but  keep running into errors.

If there is no existing method, any suggestions how to write a new one? My naive approach is to scan for CR and then peek for LF keeping track of my pointers and using them to identify the CR/LF delimited substrings; or iterate through contents using #findString:

TIA, jrm

-----
Image
-----
C:\Smalltalk\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit.1.image
Squeak5.1
latest update: #16549
Current Change Set: PDFPlayground
Image format 68021 (64 bit)

Operating System Details
------------------------
Operating System: Windows 7 Professional (Build 7601 Service Pack 1)
Registered Owner: T530
Registered Company: 
SP major version: 1
SP minor version: 0
Suite mask: 100
Product type: 1



_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|

Re: Decomposing Binary Data by CR/LF - Solved (Sort of)

Joseph Alotta
This post has NOT been accepted by the mailing list yet.
why can’t you just count characters and chunk it by blocks of an arbitrary length?


> On Sep 4, 2017, at 1:01 PM, jrm [via Smalltalk] <[hidden email]> wrote:
>
> It turns out that there is no easy answer to my question. There is no way to chunk a binary file using line ending characters because of the inherent nature of binary data which can contain the line end hex values as part of the stream of arbitrary stream. PDF files contain an xref table to all of the objects in the file. I have managed to create classes in my framework which will extract that data into a usable object which I can use to extract the data. Using "self findTokens: ( Character cr  asString,  Character lf asString)" is useful in the  areas of a PDF file which do not contain binary data, and is necessary because the line end values used in a PDF are dependent on the default values of the operating system the file was created on,
>
> Thanks again for your interest in my question.
>
> Jrm
>
> On Tue, Jul 25, 2017 at 3:00 AM, John-Reed Maffeo <[hidden email]> wrote:
> Is there an existing method that will tokenize/chunk(?) data from a file using  CR/LF? The use case is to decompose a file into PDF objects defined as strings are strings terminated by CR/LF. (if there is an existing framework/project available, I have not found it, just dead ends :-(
>
> I have been exploring in #String and #ByteString and this is all I have found that is close to what I need.
>
> "Finds first occurance of #Sting"
> self findString: ( Character cr  asString,  Character lf asString).
> "Breaks at either token value"
> self findTokens: ( Character cr  asString,  Character lf asString)
>
> I have tried poking around in #MultiByteFileStream, but  keep running into errors.
>
> If there is no existing method, any suggestions how to write a new one? My naive approach is to scan for CR and then peek for LF keeping track of my pointers and using them to identify the CR/LF delimited substrings; or iterate through contents using #findString:
>
> TIA, jrm
>
> -----
> Image
> -----
> C:\Smalltalk\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit.1.image
> Squeak5.1
> latest update: #16549
> Current Change Set: PDFPlayground
> Image format 68021 (64 bit)
>
> Operating System Details
> ------------------------
> Operating System: Windows 7 Professional (Build 7601 Service Pack 1)
> Registered Owner: T530
> Registered Company:
> SP major version: 1
> SP minor version: 0
> Suite mask: 100
> Product type: 1
>
>
>
> _______________________________________________
> Beginners mailing list
> [hidden email]
> http://lists.squeakfoundation.org/mailman/listinfo/beginners
>
>
> If you reply to this email, your message will be added to the discussion below:
> http://forum.world.st/Re-Decomposing-Binary-Data-by-CR-LF-Solved-Sort-of-tp4966497.html
> To start a new topic under Squeak - Beginners, email [hidden email]
> To unsubscribe from Squeak - Beginners, click here.
> NAML