PDF parser - have I missed anything?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

PDF parser - have I missed anything?

Schwab,Wilhelm K
Hello all,

I've searched for this, but it's tricky because 'pdf' commonly appears
in URLs.  So far, I have uncovered a dare-devil's parser in Squeak and
various ways to write pdf files (status unclear), which is not my
problem at present.

In short, I have a ~2MB pdf containing many tables and some useful text
on either side of those tables.  Naturally, I would like to find a
reader in Dolphin, but just about any dialect would suffice to read,
parse, and export the interesting stuff.  The result could be one or
more text files written for the benefit of Dolphin code that will
further parse the text to create what I need.

Any recommendations or pointers to a parser?

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: PDF parser - have I missed anything?

Stefan Schmiedl
On Wed, 20 Jul 2005 14:37:24 GMT, Bill Schwab <[hidden email]> wrote:


> Any recommendations or pointers to a parser?

what about a linux based toolchain

        pdf2ps | ps2text

s.


Reply | Threaded
Open this post in threaded view
|

Re: PDF parser - have I missed anything?

Janos Kazsoki
In reply to this post by Schwab,Wilhelm K
Bill,

I have seen https://sourceforge.net/projects/pdfcreator

I haven't used it, I just saw it. From the name they may have something
useful also for parsing.

I hope this helps,
Janos


Reply | Threaded
Open this post in threaded view
|

Re: PDF parser - have I missed anything?

Chris Uppal-3
In reply to this post by Schwab,Wilhelm K
Bill,

> In short, I have a ~2MB pdf containing many tables and some useful text
> on either side of those tables.  Naturally, I would like to find a
> reader in Dolphin, but just about any dialect would suffice to read,
> parse, and export the interesting stuff.  The result could be one or
> more text files written for the benefit of Dolphin code that will
> further parse the text to create what I need.

You can certainly do that /somehow/ with Aladin Ghostscript.  Presumably you
could run the command-line executable from Dolphin in the "normal" way
(although I have no idea what flags to give it).

    -- chris


Reply | Threaded
Open this post in threaded view
|

Re: PDF parser - have I missed anything?

Yanni Chiu
In reply to this post by Schwab,Wilhelm K
Bill Schwab wrote:
>
> ...  So far, I have uncovered a dare-devil's parser in Squeak and
> various ways to write pdf files (status unclear), which is not my
> problem at present.

I've just been trying the PDFReader at:
    http://map1.squeakfoundation.org/sm/package/ac6ea80a-cad6-403b-8286-3db1ff5d53a4

It says that it's for Squeak 3.2, but I'm using a 3.6 image,
and the parsing works. Rendering to morphic doesn't though.
I briefly considered tweaking the rendering engine to
spit out the text, but so far it's been easier to
scan through the page description, line by line.

I just need it to extract the text from a telecom bill,
which it does fine. The real problem is that the text is
mostly out of order, so heuristics have to be used
to pick out the numbers. HTH.

--yanni


Reply | Threaded
Open this post in threaded view
|

Re: PDF parser - have I missed anything?

Schwab,Wilhelm K
Yanni,

> I've just been trying the PDFReader at:
>     http://map1.squeakfoundation.org/sm/package/ac6ea80a-cad6-403b-8286-3db1ff5d53a4
>
> It says that it's for Squeak 3.2, but I'm using a 3.6 image,
> and the parsing works. Rendering to morphic doesn't though.
[snip]
> I just need it to extract the text from a telecom bill,
> which it does fine.

Thanks, I will give it a shot.


 > The real problem is that the text is
> mostly out of order, so heuristics have to be used
> to pick out the numbers. HTH.

Can you elaborate on what is out of order and why?  That might be a
problem for me given that proximity (so far) seems to be the only way to
associate the tables and the text.  OTOH, if there is enough structure,
I might be able to use sections etc. to put things together.

Thanks!!

Bill


--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: PDF parser - have I missed anything?

Yanni Chiu
Bill Schwab wrote:
>
> > The real problem is that the text is
> > mostly out of order, so heuristics have to be used
> > to pick out the numbers. HTH.
>
> Can you elaborate on what is out of order and why?  That might be a
> problem for me given that proximity (so far) seems to be the only way to
> associate the tables and the text.  OTOH, if there is enough structure,
> I might be able to use sections etc. to put things together.

In Postscript, the data stream that is sent to the "printer"
is actually a sequence of intructions to a stack based
interpreter, which then "draws" each page.

AFAICT, the PDF format seems to use a restricted subset
of Postscript (or maybe just a set of pre-defined Postscript
functions) to describe each page. Each page can be quickly
located using an index structure (unlike Postscript which
uses an "advance to next page" instruction, inserted into
the stream).

The problem is that the PDF page description consists
of instructions which move the drawing point to a given
coordinate. Then a draw text instuction follows. In my
case (and likely in most cases), each word is placed
individually on the page. Otherwise, differences in the
character widths between the computed and the rendered
values may cause lines to be wider than expected.

In my case, the printed page has a line with a text
description followed by an amount. But in the PDF
these bits of text occur in opposite order, or an entire
group of name/values is transposed. Whatever is generating
the text can do it in any order, it seems. However,
paragraph text is appearing in order. Tables are appearing
in order, but empty columns have no artifact. This is the
text only, the lines are drawn by stuff I've skipped.

--yanni


Reply | Threaded
Open this post in threaded view
|

Re: PDF parser - have I missed anything?

DiegoC
In reply to this post by Schwab,Wilhelm K
Hi Bill. I have a very good PDF parser completly in Dolphin that I
currently using in one of my projects. I bought it to a friend and he owns
it's copyrights, If you are interesting in buying something like this I
could put in contact with him. That parser is part of a very complete
framework developed in VS, I only bought a part migrated to DOlphin, and I
can say it's very well written and I'am using it in production now.

Diego


Reply | Threaded
Open this post in threaded view
|

Re: PDF parser - have I missed anything?

Alejandro F. Reimondo
Dear Bill,

I have implemented a PDF parser & renderer,
 and text conversion framework for VisualSmalltalk.

A small DLL is used for decryption and decompression;
 but all important code (+99.9%) is entirelly written
 in well docummented and factored Smalltalk.

The framework for VisualSmalltalk can:
1.- read, decompress, and decrypt PDF files; extracting
 all text, graphics and lineArt (some shadings are not fully
 implemented).
2.- render PDF pages (text/images/lines) on graph panes
 with zoom and object detection (for point&click interfaces)
3.- render pages on any pen (e.g. printer's pen, or hidden
 in-memory rendering) with alignment & zooming.
4.- render to a conversion framework for easy parsing
 of texts; performing page layout inference (when requiered)
 to convert to wordprocessor formats (RTF conversion
 is implemented)

I have also implemented part of the framework for
 Dolphin Smalltalk.

Dolphin implementation can:
1.- read, decompress, and decrypt PDF files; extracting all text
 with some information incuding: text position, name of font.
2.- build a simple structure inmemory of page layout with all
 the text read.
3.-parse text in PDFs at very high speed (only text is scanned
 skipping images; and extracting only relevant information).

The framework for Dolphin has been proved and is currently
 working in commercial applications that read PDF files
 with tables.
The applications use the extracted information (text,positions,font)
 to match page with known template models.

I have a private swiki for the framework
 at http://www.aleReimondo.com.ar/PDF
 it is written in spanish, but I think it can
 be helpfull to know how the framework works
 for VS version (a runtime demo can be downloaded there)
 to have access to the swiki, please contact me for an userName:password.

feel free to contact me if you think that it is what you need.
best,
Ale.
---------
Alejandro F. Reimondo
http://www.smalltalking.net http://www.aleReimondo.com.ar