Hello all,
I've searched for this, but it's tricky because 'pdf' commonly appears in URLs. So far, I have uncovered a dare-devil's parser in Squeak and various ways to write pdf files (status unclear), which is not my problem at present. In short, I have a ~2MB pdf containing many tables and some useful text on either side of those tables. Naturally, I would like to find a reader in Dolphin, but just about any dialect would suffice to read, parse, and export the interesting stuff. The result could be one or more text files written for the benefit of Dolphin code that will further parse the text to create what I need. Any recommendations or pointers to a parser? Have a good one, Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
On Wed, 20 Jul 2005 14:37:24 GMT, Bill Schwab <[hidden email]> wrote:
> Any recommendations or pointers to a parser? what about a linux based toolchain pdf2ps | ps2text s. |
In reply to this post by Schwab,Wilhelm K
Bill,
I have seen https://sourceforge.net/projects/pdfcreator I haven't used it, I just saw it. From the name they may have something useful also for parsing. I hope this helps, Janos |
In reply to this post by Schwab,Wilhelm K
Bill,
> In short, I have a ~2MB pdf containing many tables and some useful text > on either side of those tables. Naturally, I would like to find a > reader in Dolphin, but just about any dialect would suffice to read, > parse, and export the interesting stuff. The result could be one or > more text files written for the benefit of Dolphin code that will > further parse the text to create what I need. You can certainly do that /somehow/ with Aladin Ghostscript. Presumably you could run the command-line executable from Dolphin in the "normal" way (although I have no idea what flags to give it). -- chris |
In reply to this post by Schwab,Wilhelm K
Bill Schwab wrote:
> > ... So far, I have uncovered a dare-devil's parser in Squeak and > various ways to write pdf files (status unclear), which is not my > problem at present. I've just been trying the PDFReader at: http://map1.squeakfoundation.org/sm/package/ac6ea80a-cad6-403b-8286-3db1ff5d53a4 It says that it's for Squeak 3.2, but I'm using a 3.6 image, and the parsing works. Rendering to morphic doesn't though. I briefly considered tweaking the rendering engine to spit out the text, but so far it's been easier to scan through the page description, line by line. I just need it to extract the text from a telecom bill, which it does fine. The real problem is that the text is mostly out of order, so heuristics have to be used to pick out the numbers. HTH. --yanni |
Yanni,
> I've just been trying the PDFReader at: > http://map1.squeakfoundation.org/sm/package/ac6ea80a-cad6-403b-8286-3db1ff5d53a4 > > It says that it's for Squeak 3.2, but I'm using a 3.6 image, > and the parsing works. Rendering to morphic doesn't though. [snip] > I just need it to extract the text from a telecom bill, > which it does fine. Thanks, I will give it a shot. > The real problem is that the text is > mostly out of order, so heuristics have to be used > to pick out the numbers. HTH. Can you elaborate on what is out of order and why? That might be a problem for me given that proximity (so far) seems to be the only way to associate the tables and the text. OTOH, if there is enough structure, I might be able to use sections etc. to put things together. Thanks!! Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
Bill Schwab wrote:
> > > The real problem is that the text is > > mostly out of order, so heuristics have to be used > > to pick out the numbers. HTH. > > Can you elaborate on what is out of order and why? That might be a > problem for me given that proximity (so far) seems to be the only way to > associate the tables and the text. OTOH, if there is enough structure, > I might be able to use sections etc. to put things together. In Postscript, the data stream that is sent to the "printer" is actually a sequence of intructions to a stack based interpreter, which then "draws" each page. AFAICT, the PDF format seems to use a restricted subset of Postscript (or maybe just a set of pre-defined Postscript functions) to describe each page. Each page can be quickly located using an index structure (unlike Postscript which uses an "advance to next page" instruction, inserted into the stream). The problem is that the PDF page description consists of instructions which move the drawing point to a given coordinate. Then a draw text instuction follows. In my case (and likely in most cases), each word is placed individually on the page. Otherwise, differences in the character widths between the computed and the rendered values may cause lines to be wider than expected. In my case, the printed page has a line with a text description followed by an amount. But in the PDF these bits of text occur in opposite order, or an entire group of name/values is transposed. Whatever is generating the text can do it in any order, it seems. However, paragraph text is appearing in order. Tables are appearing in order, but empty columns have no artifact. This is the text only, the lines are drawn by stuff I've skipped. --yanni |
In reply to this post by Schwab,Wilhelm K
Hi Bill. I have a very good PDF parser completly in Dolphin that I
currently using in one of my projects. I bought it to a friend and he owns it's copyrights, If you are interesting in buying something like this I could put in contact with him. That parser is part of a very complete framework developed in VS, I only bought a part migrated to DOlphin, and I can say it's very well written and I'am using it in production now. Diego |
Dear Bill,
I have implemented a PDF parser & renderer, and text conversion framework for VisualSmalltalk. A small DLL is used for decryption and decompression; but all important code (+99.9%) is entirelly written in well docummented and factored Smalltalk. The framework for VisualSmalltalk can: 1.- read, decompress, and decrypt PDF files; extracting all text, graphics and lineArt (some shadings are not fully implemented). 2.- render PDF pages (text/images/lines) on graph panes with zoom and object detection (for point&click interfaces) 3.- render pages on any pen (e.g. printer's pen, or hidden in-memory rendering) with alignment & zooming. 4.- render to a conversion framework for easy parsing of texts; performing page layout inference (when requiered) to convert to wordprocessor formats (RTF conversion is implemented) I have also implemented part of the framework for Dolphin Smalltalk. Dolphin implementation can: 1.- read, decompress, and decrypt PDF files; extracting all text with some information incuding: text position, name of font. 2.- build a simple structure inmemory of page layout with all the text read. 3.-parse text in PDFs at very high speed (only text is scanned skipping images; and extracting only relevant information). The framework for Dolphin has been proved and is currently working in commercial applications that read PDF files with tables. The applications use the extracted information (text,positions,font) to match page with known template models. I have a private swiki for the framework at http://www.aleReimondo.com.ar/PDF it is written in spanish, but I think it can be helpfull to know how the framework works for VS version (a runtime demo can be downloaded there) to have access to the swiki, please contact me for an userName:password. feel free to contact me if you think that it is what you need. best, Ale. --------- Alejandro F. Reimondo http://www.smalltalking.net http://www.aleReimondo.com.ar |
Free forum by Nabble | Edit this page |