LiteratureResearcher - where graphs, PDFs, and BibTex happily live together

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

LiteratureResearcher - where graphs, PDFs, and BibTex happily live together

Manuel Leuenberger
Hi everyone,

I was experimenting in the last few weeks with my take on literature research. For me, the corpus of scientific papers form an interconnected graph, not those plain lists and tables we keep in our bibliographies. So, here is the first prototype that has Google Scholar integration for search, can fetch PDFs from IEEE and ACM, extracts metadata from PDFs - all this results in hyperlinked PDFs!


I plan on packaging it, so that those who are interested can check it out themselves (help wanted!). Currently, it only works on macOS.

What do you think of my approach? Which use cases should be added?

Cheers,
Manuel

Reply | Threaded
Open this post in threaded view
|

Re: LiteratureResearcher - where graphs, PDFs, and BibTex happily live together

kilon.alios
Super cool more detailed recommendations when I try it on practice

A cheap pdf viewer in Pharo would be to turn pdf pages to JPG images which you can load via image morph so you won’t have to have two separate windows. There are ton of converters out there that can do this.
On Wed, 1 Nov 2017 at 23:17, Manuel Leuenberger <[hidden email]> wrote:
Hi everyone,

I was experimenting in the last few weeks with my take on literature research. For me, the corpus of scientific papers form an interconnected graph, not those plain lists and tables we keep in our bibliographies. So, here is the first prototype that has Google Scholar integration for search, can fetch PDFs from IEEE and ACM, extracts metadata from PDFs - all this results in hyperlinked PDFs!


I plan on packaging it, so that those who are interested can check it out themselves (help wanted!). Currently, it only works on macOS.

What do you think of my approach? Which use cases should be added?

Cheers,
Manuel

Reply | Threaded
Open this post in threaded view
|

Re: LiteratureResearcher - where graphs, PDFs, and BibTex happily live together

vonbecmann
In reply to this post by Manuel Leuenberger
really nice! ted nelson talks about something like that in his xanadu project.

On Wed, Nov 1, 2017 at 6:16 PM, Manuel Leuenberger <[hidden email]> wrote:
Hi everyone,

I was experimenting in the last few weeks with my take on literature research. For me, the corpus of scientific papers form an interconnected graph, not those plain lists and tables we keep in our bibliographies. So, here is the first prototype that has Google Scholar integration for search, can fetch PDFs from IEEE and ACM, extracts metadata from PDFs - all this results in hyperlinked PDFs!


I plan on packaging it, so that those who are interested can check it out themselves (help wanted!). Currently, it only works on macOS.

What do you think of my approach? Which use cases should be added?

Cheers,
Manuel




--
Bernardo E.C.

Sent from a cheap desktop computer in South America.
Reply | Threaded
Open this post in threaded view
|

Re: LiteratureResearcher - where graphs, PDFs, and BibTex happily live together

Offray Vladimir Luna Cárdenas-2
In reply to this post by Manuel Leuenberger

Hi Manuel,

This is really interesting! I like the idea of a graph and the way ILE manages metadata importation. Grafoscopio [1][2][3], is the software I'm creating for reproducible research and literate computing (mixing prose, code, queries, data and visualizations). It has alpha support for Zotero and I think that ILE and Grafoscopio could work together to provide researcher support for creating and working with research literature.

[1] http://mutabit.com/grafoscopio/index.en.html
[2] http://mutabit.com/repos.fossil/alvicoda/doc/tip/index.html
[3] http://joss.theoj.org/papers/c92ed13fa746bc681081f9b31678841b

Now my research workflow includes Pharo, TeXStudio, Grafoscopio, Zotero, Docear[4] and Hypothesis[5] to map and annotate research literature, but there is a lot of context switching, as you point in your presentation, and lack of moldability that we could get rid of, if those tools were integrated into Pharo. To manage this, I usually have two screens (see screenshot below) where I made annotated reading (in hypothesis) and map readings (in Docear). That is because Docear is not integrated (yet) with Hypothesis, which has a superb annotation system.

[4] http://www.docear.org/
[5] https://web.hypothes.is/

I would like to have a similar tree like Docear reading interface , connected with the annotated reading of hypothesis (tags and comments), inside Pharo/Grafoscopio. In my ideal workflow I would add a bibliography item to Zotero (using add item by ID or dropping the URL/PDF), open it (getting the table of contents map inside Pharo, ala Docear and thanks to Hypothesis) and I would start to annotate and tag my readings. After that, some DSL would allow me to recover, visualize and export such annotations to be put inside of or connected with a Grafoscopio notebook and there I would finish the research writing.

What do you think of this workflow? Could ILE support or be part of it in some way?

Once you have packaged ILE, I could help as a tester in Gnu/Linux.

Cheers,

Offray



On 01/11/17 16:16, Manuel Leuenberger wrote:
Hi everyone,

I was experimenting in the last few weeks with my take on literature research. For me, the corpus of scientific papers form an interconnected graph, not those plain lists and tables we keep in our bibliographies. So, here is the first prototype that has Google Scholar integration for search, can fetch PDFs from IEEE and ACM, extracts metadata from PDFs - all this results in hyperlinked PDFs!


I plan on packaging it, so that those who are interested can check it out themselves (help wanted!). Currently, it only works on macOS.

What do you think of my approach? Which use cases should be added?

Cheers,
Manuel


Reply | Threaded
Open this post in threaded view
|

Re: LiteratureResearcher - where graphs, PDFs, and BibTex happily live together

Stephane Ducasse-3
In reply to this post by Manuel Leuenberger
Hi manuel

this is super cool :)
Could you describe how you did the pdf integration?
And yes please package it :)
I want to try it.

Stef

On Wed, Nov 1, 2017 at 10:16 PM, Manuel Leuenberger
<[hidden email]> wrote:

> Hi everyone,
>
> I was experimenting in the last few weeks with my take on literature
> research. For me, the corpus of scientific papers form an interconnected
> graph, not those plain lists and tables we keep in our bibliographies. So,
> here is the first prototype that has Google Scholar integration for search,
> can fetch PDFs from IEEE and ACM, extracts metadata from PDFs - all this
> results in hyperlinked PDFs!
>
> See a demo here: https://youtu.be/EcK3Pt_WnEw
> Also slides from the SCG seminar here:
> http://scg.unibe.ch/download/softwarecomposition/2017-10-31-Leuenberger-ILE.pdf
>
> I plan on packaging it, so that those who are interested can check it out
> themselves (help wanted!). Currently, it only works on macOS.
>
> What do you think of my approach? Which use cases should be added?
>
> Cheers,
> Manuel
>

Reply | Threaded
Open this post in threaded view
|

Re: LiteratureResearcher - where graphs, PDFs, and BibTex happily live together

Tudor Girba-2
Really nice work!

Indeed, the PDF integration can be quite interesting to people.

Doru


> On Nov 2, 2017, at 6:08 PM, Stephane Ducasse <[hidden email]> wrote:
>
> Hi manuel
>
> this is super cool :)
> Could you describe how you did the pdf integration?
> And yes please package it :)
> I want to try it.
>
> Stef
>
> On Wed, Nov 1, 2017 at 10:16 PM, Manuel Leuenberger
> <[hidden email]> wrote:
>> Hi everyone,
>>
>> I was experimenting in the last few weeks with my take on literature
>> research. For me, the corpus of scientific papers form an interconnected
>> graph, not those plain lists and tables we keep in our bibliographies. So,
>> here is the first prototype that has Google Scholar integration for search,
>> can fetch PDFs from IEEE and ACM, extracts metadata from PDFs - all this
>> results in hyperlinked PDFs!
>>
>> See a demo here: https://youtu.be/EcK3Pt_WnEw
>> Also slides from the SCG seminar here:
>> http://scg.unibe.ch/download/softwarecomposition/2017-10-31-Leuenberger-ILE.pdf
>>
>> I plan on packaging it, so that those who are interested can check it out
>> themselves (help wanted!). Currently, it only works on macOS.
>>
>> What do you think of my approach? Which use cases should be added?
>>
>> Cheers,
>> Manuel
>>
>

--
www.tudorgirba.com
www.feenk.com

"Be rather willing to give than demanding to get."





Reply | Threaded
Open this post in threaded view
|

Re: LiteratureResearcher - where graphs, PDFs, and BibTex happily live together

Manuel Leuenberger
In reply to this post by kilon.alios
Hi Dimitris,

I looked around for a way to integrate PDF into Pharo. But as long as we don’t have a way to natively display interactive multimedia (maybe Bloc will at some point support something like this?), I consider it not worth the effort to come up with a half-baked (note to self: watch that movie again) solution that is hard to implement and cannot even remotely support all the use cases as an external tool that have been crafted for years. Just rendering bitmaps loses so much information (full text) and actionability (hyperlinks). If I wanted to have interactive PDFs inside Pharo, I had to invest so much time to replicate functionality that has been implemented before. So, I chose the path of the least resistance by pushing this responsibility to a specialized tool. I think Pharo is great for modelling, exploration, and inspection, but OS integration is not really a selling point, and it doesn’t have to be. By using a standard PDF viewer I also gain that users are already familiar with them and have their own workflow that I can extend.

As a side note, I think that being able to embed a web browser inside Pharo would open up a whole new world of applications, as web browsers are currently the vehicles for content distribution. But as long as there is no project where this is a critical feature, I can live without it.

Cheers,
Manuel

On 1 Nov 2017, at 22:39, Dimitris Chloupis <[hidden email]> wrote:

Super cool more detailed recommendations when I try it on practice

A cheap pdf viewer in Pharo would be to turn pdf pages to JPG images which you can load via image morph so you won’t have to have two separate windows. There are ton of converters out there that can do this.
On Wed, 1 Nov 2017 at 23:17, Manuel Leuenberger <[hidden email]> wrote:
Hi everyone,

I was experimenting in the last few weeks with my take on literature research. For me, the corpus of scientific papers form an interconnected graph, not those plain lists and tables we keep in our bibliographies. So, here is the first prototype that has Google Scholar integration for search, can fetch PDFs from IEEE and ACM, extracts metadata from PDFs - all this results in hyperlinked PDFs!


I plan on packaging it, so that those who are interested can check it out themselves (help wanted!). Currently, it only works on macOS.

What do you think of my approach? Which use cases should be added?

Cheers,
Manuel


Reply | Threaded
Open this post in threaded view
|

Re: LiteratureResearcher - where graphs, PDFs, and BibTex happily live together

Manuel Leuenberger
In reply to this post by vonbecmann
Hi Bernardo,

Thanks for the Xanadu hint. Indeed I made similar observations as Ted Nelson: We use desktop machines to replicate physical paper on the screen, blatantly ignoring that we can easily break the constraints of paper on a computer. This reminds a bit of Roy Fieldings thesis about REST, where he clearly defined what REST is. Yet, most implementations tagging themselves with REST are just RPC. People seem to look at technological concepts, take out all the good parts, only keep the name and claims, yet doing something different, but still marketing it as innovative and modern.

Cheers,
Manuel

On 2 Nov 2017, at 00:06, Bernardo Ezequiel Contreras <[hidden email]> wrote:

really nice! ted nelson talks about something like that in his xanadu project.

On Wed, Nov 1, 2017 at 6:16 PM, Manuel Leuenberger <[hidden email]> wrote:
Hi everyone,

I was experimenting in the last few weeks with my take on literature research. For me, the corpus of scientific papers form an interconnected graph, not those plain lists and tables we keep in our bibliographies. So, here is the first prototype that has Google Scholar integration for search, can fetch PDFs from IEEE and ACM, extracts metadata from PDFs - all this results in hyperlinked PDFs!


I plan on packaging it, so that those who are interested can check it out themselves (help wanted!). Currently, it only works on macOS.

What do you think of my approach? Which use cases should be added?

Cheers,
Manuel




--
Bernardo E.C.

Sent from a cheap desktop computer in South America.

Reply | Threaded
Open this post in threaded view
|

Re: LiteratureResearcher - where graphs, PDFs, and BibTex happily live together

Manuel Leuenberger
In reply to this post by Offray Vladimir Luna Cárdenas-2
Hi Offray,

That’s a lot to process for me, I need some time to inspect your workflow to find the possible connections with the ILE. I will do that in the next few days and give you more detailed feedback.

As for the Linux testers, I will gladly come back to you. I think it won’t be too hard to make the ILE support Linux and Windows. The platform dependency is only induced by the way an OS allows one to register custom URI schemes.

Cheers,
Manuel

On 2 Nov 2017, at 18:03, Offray Vladimir Luna Cárdenas <[hidden email]> wrote:

Hi Manuel,

This is really interesting! I like the idea of a graph and the way ILE manages metadata importation. Grafoscopio [1][2][3], is the software I'm creating for reproducible research and literate computing (mixing prose, code, queries, data and visualizations). It has alpha support for Zotero and I think that ILE and Grafoscopio could work together to provide researcher support for creating and working with research literature.

[1] http://mutabit.com/grafoscopio/index.en.html
[2] http://mutabit.com/repos.fossil/alvicoda/doc/tip/index.html
[3] http://joss.theoj.org/papers/c92ed13fa746bc681081f9b31678841b

Now my research workflow includes Pharo, TeXStudio, Grafoscopio, Zotero, Docear[4] and Hypothesis[5] to map and annotate research literature, but there is a lot of context switching, as you point in your presentation, and lack of moldability that we could get rid of, if those tools were integrated into Pharo. To manage this, I usually have two screens (see screenshot below) where I made annotated reading (in hypothesis) and map readings (in Docear). That is because Docear is not integrated (yet) with Hypothesis, which has a superb annotation system.

[4] http://www.docear.org/
[5] https://web.hypothes.is/

I would like to have a similar tree like Docear reading interface , connected with the annotated reading of hypothesis (tags and comments), inside Pharo/Grafoscopio. In my ideal workflow I would add a bibliography item to Zotero (using add item by ID or dropping the URL/PDF), open it (getting the table of contents map inside Pharo, ala Docear and thanks to Hypothesis) and I would start to annotate and tag my readings. After that, some DSL would allow me to recover, visualize and export such annotations to be put inside of or connected with a Grafoscopio notebook and there I would finish the research writing.

What do you think of this workflow? Could ILE support or be part of it in some way?

Once you have packaged ILE, I could help as a tester in Gnu/Linux.

Cheers,

Offray

<oghneepgbajdbnif.png>

On 01/11/17 16:16, Manuel Leuenberger wrote:
Hi everyone,

I was experimenting in the last few weeks with my take on literature research. For me, the corpus of scientific papers form an interconnected graph, not those plain lists and tables we keep in our bibliographies. So, here is the first prototype that has Google Scholar integration for search, can fetch PDFs from IEEE and ACM, extracts metadata from PDFs - all this results in hyperlinked PDFs!


I plan on packaging it, so that those who are interested can check it out themselves (help wanted!). Currently, it only works on macOS.

What do you think of my approach? Which use cases should be added?

Cheers,
Manuel



Reply | Threaded
Open this post in threaded view
|

Re: LiteratureResearcher - where graphs, PDFs, and BibTex happily live together

Manuel Leuenberger
In reply to this post by Stephane Ducasse-3
Hi Stef,

The PDF integration consists of three parts:

1. CERMINE (https://github.com/CeON/CERMINE) is fed with the PDF and outputs metadata as BibTex and a structured XML (title, authors, affiliations, abstract, keyword, references, …). This is not perfect, but way better than any other metadata extractor I could find.
2. From the metadata I generate hyperlinks that are anchored in the PDF by a text key. pdf-linker (https://github.com/maenu/pdf-linker) then searches for the anchors in the PDF text, using heuristics, as PDF has a document model that is primarily intended for rendering and printing, but not for processing. The hyperlinks are then inserted using the awesome Apache PDFBox (https://pdfbox.apache.org/).
3. Those hyperlinks point to an URI like “<a href="pharo://handle/clickReference.in.?args=1&amp;args=2" class="">pharo://handle/clickReference.in.?args=1&args=2” to represent a reference 1 in the paper 2. Now comes the magic part: The OS allows you to register custom handlers for custom URI schemes like pharo://. For that I created a simple Objective-C app that handles the event and passes it over as a HTTP message to a server running in Pharo (https://github.com/maenu/PharoUriScheme). The OS will even start the application if it is not yet running.

While the custom URI scheme approach is super powerful, it has critical drawbacks. Any application can request to be the receiver of a URI scheme, just as browser are for http://. Especially on mobile devices with limited access to the OS, this opens up an attack point for malware apps that replicate original apps that make use of schemes like facebook:// and eavesdrop all interactions. If an original app transmits any unencrypted secrets or user data encoded in those URIs, malware can easily intercept it without the user noticing the leak. I guess this is the reason why many PDF viewer just support the standard http:// and mailto:// schemes. E.g., macOS Preview gives just an audible beep when I click on a pharo:// link, Chromes viewer doesn’t even bother giving any feedback. Only Adobe Acrobat allows you to relax security settings to make them work (How could it be someone else than Adobe, when it’s a security issue? ;).

I finished basic packaging today and will continue with some READMEs and a nearly-all-in-one distribution tomorrow, I’ll keep you posted in this thread.

Cheers,
Manuel

On 2 Nov 2017, at 18:08, Stephane Ducasse <[hidden email]> wrote:

Hi manuel

this is super cool :)
Could you describe how you did the pdf integration?
And yes please package it :)
I want to try it.

Stef

On Wed, Nov 1, 2017 at 10:16 PM, Manuel Leuenberger
<[hidden email]> wrote:
Hi everyone,

I was experimenting in the last few weeks with my take on literature
research. For me, the corpus of scientific papers form an interconnected
graph, not those plain lists and tables we keep in our bibliographies. So,
here is the first prototype that has Google Scholar integration for search,
can fetch PDFs from IEEE and ACM, extracts metadata from PDFs - all this
results in hyperlinked PDFs!

See a demo here: https://youtu.be/EcK3Pt_WnEw
Also slides from the SCG seminar here:
http://scg.unibe.ch/download/softwarecomposition/2017-10-31-Leuenberger-ILE.pdf

I plan on packaging it, so that those who are interested can check it out
themselves (help wanted!). Currently, it only works on macOS.

What do you think of my approach? Which use cases should be added?

Cheers,
Manuel



Reply | Threaded
Open this post in threaded view
|

Re: LiteratureResearcher - where graphs, PDFs, and BibTex happily live together

Stephane Ducasse-3
Hi manuel

thanks for the details. I think that the framework of christian
haidler should be able to read pdf.

Stef

On Thu, Nov 2, 2017 at 8:33 PM, Manuel Leuenberger
<[hidden email]> wrote:

> Hi Stef,
>
> The PDF integration consists of three parts:
>
> 1. CERMINE (https://github.com/CeON/CERMINE) is fed with the PDF and outputs
> metadata as BibTex and a structured XML (title, authors, affiliations,
> abstract, keyword, references, …). This is not perfect, but way better than
> any other metadata extractor I could find.
> 2. From the metadata I generate hyperlinks that are anchored in the PDF by a
> text key. pdf-linker (https://github.com/maenu/pdf-linker) then searches for
> the anchors in the PDF text, using heuristics, as PDF has a document model
> that is primarily intended for rendering and printing, but not for
> processing. The hyperlinks are then inserted using the awesome Apache PDFBox
> (https://pdfbox.apache.org/).
> 3. Those hyperlinks point to an URI like
> “pharo://handle/clickReference.in.?args=1&args=2” to represent a reference 1
> in the paper 2. Now comes the magic part: The OS allows you to register
> custom handlers for custom URI schemes like pharo://. For that I created a
> simple Objective-C app that handles the event and passes it over as a HTTP
> message to a server running in Pharo
> (https://github.com/maenu/PharoUriScheme). The OS will even start the
> application if it is not yet running.
>
> While the custom URI scheme approach is super powerful, it has critical
> drawbacks. Any application can request to be the receiver of a URI scheme,
> just as browser are for http://. Especially on mobile devices with limited
> access to the OS, this opens up an attack point for malware apps that
> replicate original apps that make use of schemes like facebook:// and
> eavesdrop all interactions. If an original app transmits any unencrypted
> secrets or user data encoded in those URIs, malware can easily intercept it
> without the user noticing the leak. I guess this is the reason why many PDF
> viewer just support the standard http:// and mailto:// schemes. E.g., macOS
> Preview gives just an audible beep when I click on a pharo:// link, Chromes
> viewer doesn’t even bother giving any feedback. Only Adobe Acrobat allows
> you to relax security settings to make them work (How could it be someone
> else than Adobe, when it’s a security issue? ;).
>
> I finished basic packaging today and will continue with some READMEs and a
> nearly-all-in-one distribution tomorrow, I’ll keep you posted in this
> thread.
>
> Cheers,
> Manuel
>
> On 2 Nov 2017, at 18:08, Stephane Ducasse <[hidden email]> wrote:
>
> Hi manuel
>
> this is super cool :)
> Could you describe how you did the pdf integration?
> And yes please package it :)
> I want to try it.
>
> Stef
>
> On Wed, Nov 1, 2017 at 10:16 PM, Manuel Leuenberger
> <[hidden email]> wrote:
>
> Hi everyone,
>
> I was experimenting in the last few weeks with my take on literature
> research. For me, the corpus of scientific papers form an interconnected
> graph, not those plain lists and tables we keep in our bibliographies. So,
> here is the first prototype that has Google Scholar integration for search,
> can fetch PDFs from IEEE and ACM, extracts metadata from PDFs - all this
> results in hyperlinked PDFs!
>
> See a demo here: https://youtu.be/EcK3Pt_WnEw
> Also slides from the SCG seminar here:
> http://scg.unibe.ch/download/softwarecomposition/2017-10-31-Leuenberger-ILE.pdf
>
> I plan on packaging it, so that those who are interested can check it out
> themselves (help wanted!). Currently, it only works on macOS.
>
> What do you think of my approach? Which use cases should be added?
>
> Cheers,
> Manuel
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: LiteratureResearcher - where graphs, PDFs, and BibTex happily live together

Christian Haider
Yes, reading PDFs is fine with PDFtalk, but what you want is more: Text extraction (there is a chapter in the spec about this).
This feature is not yet readily available. Some ground work has been done (content analysis), but for full text extraction more work is needed.

Cheers,
        Christian

> -----Ursprüngliche Nachricht-----
> Von: Pharo-users [mailto:[hidden email]] Im Auftrag
> von Stephane Ducasse
> Gesendet: Freitag, 3. November 2017 09:46
> An: Any question about pharo is welcome <[hidden email]>
> Betreff: Re: [Pharo-users] LiteratureResearcher - where graphs, PDFs, and
> BibTex happily live together
>
> Hi manuel
>
> thanks for the details. I think that the framework of christian haidler should
> be able to read pdf.
>
> Stef
>
> On Thu, Nov 2, 2017 at 8:33 PM, Manuel Leuenberger
> <[hidden email]> wrote:
> > Hi Stef,
> >
> > The PDF integration consists of three parts:
> >
> > 1. CERMINE (https://github.com/CeON/CERMINE) is fed with the PDF and
> > outputs metadata as BibTex and a structured XML (title, authors,
> > affiliations, abstract, keyword, references, …). This is not perfect,
> > but way better than any other metadata extractor I could find.
> > 2. From the metadata I generate hyperlinks that are anchored in the
> > PDF by a text key. pdf-linker (https://github.com/maenu/pdf-linker)
> > then searches for the anchors in the PDF text, using heuristics, as
> > PDF has a document model that is primarily intended for rendering and
> > printing, but not for processing. The hyperlinks are then inserted
> > using the awesome Apache PDFBox (https://pdfbox.apache.org/).
> > 3. Those hyperlinks point to an URI like
> > “pharo://handle/clickReference.in.?args=1&args=2” to represent a
> > reference 1 in the paper 2. Now comes the magic part: The OS allows
> > you to register custom handlers for custom URI schemes like pharo://.
> > For that I created a simple Objective-C app that handles the event and
> > passes it over as a HTTP message to a server running in Pharo
> > (https://github.com/maenu/PharoUriScheme). The OS will even start the
> > application if it is not yet running.
> >
> > While the custom URI scheme approach is super powerful, it has
> > critical drawbacks. Any application can request to be the receiver of
> > a URI scheme, just as browser are for http://. Especially on mobile
> > devices with limited access to the OS, this opens up an attack point
> > for malware apps that replicate original apps that make use of schemes
> > like facebook:// and eavesdrop all interactions. If an original app
> > transmits any unencrypted secrets or user data encoded in those URIs,
> > malware can easily intercept it without the user noticing the leak. I
> > guess this is the reason why many PDF viewer just support the standard
> > http:// and mailto:// schemes. E.g., macOS Preview gives just an
> > audible beep when I click on a pharo:// link, Chromes viewer doesn’t
> > even bother giving any feedback. Only Adobe Acrobat allows you to
> > relax security settings to make them work (How could it be someone else
> than Adobe, when it’s a security issue? ;).
> >
> > I finished basic packaging today and will continue with some READMEs
> > and a nearly-all-in-one distribution tomorrow, I’ll keep you posted in
> > this thread.
> >
> > Cheers,
> > Manuel
> >
> > On 2 Nov 2017, at 18:08, Stephane Ducasse <[hidden email]>
> wrote:
> >
> > Hi manuel
> >
> > this is super cool :)
> > Could you describe how you did the pdf integration?
> > And yes please package it :)
> > I want to try it.
> >
> > Stef
> >
> > On Wed, Nov 1, 2017 at 10:16 PM, Manuel Leuenberger
> > <[hidden email]> wrote:
> >
> > Hi everyone,
> >
> > I was experimenting in the last few weeks with my take on literature
> > research. For me, the corpus of scientific papers form an
> > interconnected graph, not those plain lists and tables we keep in our
> > bibliographies. So, here is the first prototype that has Google
> > Scholar integration for search, can fetch PDFs from IEEE and ACM,
> > extracts metadata from PDFs - all this results in hyperlinked PDFs!
> >
> > See a demo here: https://youtu.be/EcK3Pt_WnEw Also slides from the
> SCG
> > seminar here:
> > http://scg.unibe.ch/download/softwarecomposition/2017-10-31-
> Leuenberge
> > r-ILE.pdf
> >
> > I plan on packaging it, so that those who are interested can check it
> > out themselves (help wanted!). Currently, it only works on macOS.
> >
> > What do you think of my approach? Which use cases should be added?
> >
> > Cheers,
> > Manuel
> >
> >
> >



Reply | Threaded
Open this post in threaded view
|

Re: LiteratureResearcher - where graphs, PDFs, and BibTex happily live together

Manuel Leuenberger
Hi Christian,

For text extraction I used the PDFTextStripper from PDFBox (https://github.com/apache/pdfbox/blob/5991a69ecbcd53775f685755a399304d04accfa2/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java). It’s not perfect, but good enough. Maybe this can serve as an oracle for PDFtalk on how to do it. It basically parses the PDF into a DOM of words, lines, paragraphs, and pages.
Repeating myself from an earlier thread, I would be very interested in a Pharo port of PDFtalk. What I currently need use text extraction, manipulation of annotations and content streams for drawing rectangles.

Cheers,
Manuel

On 3 Nov 2017, at 10:15, Christian Haider <[hidden email]> wrote:

Yes, reading PDFs is fine with PDFtalk, but what you want is more: Text extraction (there is a chapter in the spec about this).
This feature is not yet readily available. Some ground work has been done (content analysis), but for full text extraction more work is needed.

Cheers,
Christian

-----Ursprüngliche Nachricht-----
Von: Pharo-users [[hidden email]] Im Auftrag
von Stephane Ducasse
Gesendet: Freitag, 3. November 2017 09:46
An: Any question about pharo is welcome <[hidden email]>
Betreff: Re: [Pharo-users] LiteratureResearcher - where graphs, PDFs, and
BibTex happily live together

Hi manuel

thanks for the details. I think that the framework of christian haidler should
be able to read pdf.

Stef

On Thu, Nov 2, 2017 at 8:33 PM, Manuel Leuenberger
<[hidden email]> wrote:
Hi Stef,

The PDF integration consists of three parts:

1. CERMINE (https://github.com/CeON/CERMINE) is fed with the PDF and
outputs metadata as BibTex and a structured XML (title, authors,
affiliations, abstract, keyword, references, …). This is not perfect,
but way better than any other metadata extractor I could find.
2. From the metadata I generate hyperlinks that are anchored in the
PDF by a text key. pdf-linker (https://github.com/maenu/pdf-linker)
then searches for the anchors in the PDF text, using heuristics, as
PDF has a document model that is primarily intended for rendering and
printing, but not for processing. The hyperlinks are then inserted
using the awesome Apache PDFBox (https://pdfbox.apache.org/).
3. Those hyperlinks point to an URI like
“<a href="pharo://handle/clickReference.in.?args=1&amp;args=2" class="">pharo://handle/clickReference.in.?args=1&args=2” to represent a
reference 1 in the paper 2. Now comes the magic part: The OS allows
you to register custom handlers for custom URI schemes like pharo://.
For that I created a simple Objective-C app that handles the event and
passes it over as a HTTP message to a server running in Pharo
(https://github.com/maenu/PharoUriScheme). The OS will even start the
application if it is not yet running.

While the custom URI scheme approach is super powerful, it has
critical drawbacks. Any application can request to be the receiver of
a URI scheme, just as browser are for http://. Especially on mobile
devices with limited access to the OS, this opens up an attack point
for malware apps that replicate original apps that make use of schemes
like facebook:// and eavesdrop all interactions. If an original app
transmits any unencrypted secrets or user data encoded in those URIs,
malware can easily intercept it without the user noticing the leak. I
guess this is the reason why many PDF viewer just support the standard
http:// and mailto:// schemes. E.g., macOS Preview gives just an
audible beep when I click on a pharo:// link, Chromes viewer doesn’t
even bother giving any feedback. Only Adobe Acrobat allows you to
relax security settings to make them work (How could it be someone else
than Adobe, when it’s a security issue? ;).

I finished basic packaging today and will continue with some READMEs
and a nearly-all-in-one distribution tomorrow, I’ll keep you posted in
this thread.

Cheers,
Manuel

On 2 Nov 2017, at 18:08, Stephane Ducasse <[hidden email]>
wrote:

Hi manuel

this is super cool :)
Could you describe how you did the pdf integration?
And yes please package it :)
I want to try it.

Stef

On Wed, Nov 1, 2017 at 10:16 PM, Manuel Leuenberger
<[hidden email]> wrote:

Hi everyone,

I was experimenting in the last few weeks with my take on literature
research. For me, the corpus of scientific papers form an
interconnected graph, not those plain lists and tables we keep in our
bibliographies. So, here is the first prototype that has Google
Scholar integration for search, can fetch PDFs from IEEE and ACM,
extracts metadata from PDFs - all this results in hyperlinked PDFs!

See a demo here: https://youtu.be/EcK3Pt_WnEw Also slides from the
SCG
seminar here:
http://scg.unibe.ch/download/softwarecomposition/2017-10-31-
Leuenberge
r-ILE.pdf

I plan on packaging it, so that those who are interested can check it
out themselves (help wanted!). Currently, it only works on macOS.

What do you think of my approach? Which use cases should be added?

Cheers,
Manuel







Reply | Threaded
Open this post in threaded view
|

Re: LiteratureResearcher - where graphs, PDFs, and BibTex happily live together

Evan Donahue
In reply to this post by Manuel Leuenberger
This looks great. What would it take to get it running on Ubuntu?



--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Reply | Threaded
Open this post in threaded view
|

Re: LiteratureResearcher - where graphs, PDFs, and BibTex happily live together

Offray Vladimir Luna Cárdenas-2
In reply to this post by Manuel Leuenberger

I share Manuel's view about delegating some functionality to external apps, like a PDF reader, the web browser or even Pandoc. There is a lot of external mature stuff out there and we can gain users from there if we make bridges to the stuff they already know and use. For example, I would like to try the Pharo-Chrome integration and also the Hypothesis JSON API to annotate and read PDF/HTML.

Cheers,

Offray


On 02/11/17 13:49, Manuel Leuenberger wrote:
Hi Dimitris,

I looked around for a way to integrate PDF into Pharo. But as long as we don’t have a way to natively display interactive multimedia (maybe Bloc will at some point support something like this?), I consider it not worth the effort to come up with a half-baked (note to self: watch that movie again) solution that is hard to implement and cannot even remotely support all the use cases as an external tool that have been crafted for years. Just rendering bitmaps loses so much information (full text) and actionability (hyperlinks). If I wanted to have interactive PDFs inside Pharo, I had to invest so much time to replicate functionality that has been implemented before. So, I chose the path of the least resistance by pushing this responsibility to a specialized tool. I think Pharo is great for modelling, exploration, and inspection, but OS integration is not really a selling point, and it doesn’t have to be. By using a standard PDF viewer I also gain that users are already familiar with them and have their own workflow that I can extend.

As a side note, I think that being able to embed a web browser inside Pharo would open up a whole new world of applications, as web browsers are currently the vehicles for content distribution. But as long as there is no project where this is a critical feature, I can live without it.

Cheers,
Manuel

On 1 Nov 2017, at 22:39, Dimitris Chloupis <[hidden email]> wrote:

Super cool more detailed recommendations when I try it on practice

A cheap pdf viewer in Pharo would be to turn pdf pages to JPG images which you can load via image morph so you won’t have to have two separate windows. There are ton of converters out there that can do this.
On Wed, 1 Nov 2017 at 23:17, Manuel Leuenberger <[hidden email]> wrote:
Hi everyone,

I was experimenting in the last few weeks with my take on literature research. For me, the corpus of scientific papers form an interconnected graph, not those plain lists and tables we keep in our bibliographies. So, here is the first prototype that has Google Scholar integration for search, can fetch PDFs from IEEE and ACM, extracts metadata from PDFs - all this results in hyperlinked PDFs!


I plan on packaging it, so that those who are interested can check it out themselves (help wanted!). Currently, it only works on macOS.

What do you think of my approach? Which use cases should be added?

Cheers,
Manuel



Reply | Threaded
Open this post in threaded view
|

Re: LiteratureResearcher - where graphs, PDFs, andBibTex happily live together

aglynn42

I’ve used PDFMiner and pypdf2xml previously and both are easier to use now that Atlas is available.  Both work well, though XPdf (in C++) is faster.

 

http://www.unixuser.org/~euske/python/pdfminer/

 

https://github.com/zejn/pypdf2xml

 

pdf2xml (also Python) is slightly quicker than pypdf2xml but doesn’t handle Type 2 fonts properly.

 

 

From: [hidden email]
Sent: Sunday, November 5, 2017 12:29 PM
To: [hidden email]
Subject: Re: [Pharo-users] LiteratureResearcher - where graphs, PDFs, andBibTex happily live together

 

I share Manuel's view about delegating some functionality to external apps, like a PDF reader, the web browser or even Pandoc. There is a lot of external mature stuff out there and we can gain users from there if we make bridges to the stuff they already know and use. For example, I would like to try the Pharo-Chrome integration and also the Hypothesis JSON API to annotate and read PDF/HTML.

Cheers,

Offray

 

On 02/11/17 13:49, Manuel Leuenberger wrote:

Hi Dimitris,

 

I looked around for a way to integrate PDF into Pharo. But as long as we don’t have a way to natively display interactive multimedia (maybe Bloc will at some point support something like this?), I consider it not worth the effort to come up with a half-baked (note to self: watch that movie again) solution that is hard to implement and cannot even remotely support all the use cases as an external tool that have been crafted for years. Just rendering bitmaps loses so much information (full text) and actionability (hyperlinks). If I wanted to have interactive PDFs inside Pharo, I had to invest so much time to replicate functionality that has been implemented before. So, I chose the path of the least resistance by pushing this responsibility to a specialized tool. I think Pharo is great for modelling, exploration, and inspection, but OS integration is not really a selling point, and it doesn’t have to be. By using a standard PDF viewer I also gain that users are already familiar with them and have their own workflow that I can extend.

 

As a side note, I think that being able to embed a web browser inside Pharo would open up a whole new world of applications, as web browsers are currently the vehicles for content distribution. But as long as there is no project where this is a critical feature, I can live without it.

 

Cheers,

Manuel



On 1 Nov 2017, at 22:39, Dimitris Chloupis <[hidden email]> wrote:

 

Super cool more detailed recommendations when I try it on practice

A cheap pdf viewer in Pharo would be to turn pdf pages to JPG images which you can load via image morph so you won’t have to have two separate windows. There are ton of converters out there that can do this.

On Wed, 1 Nov 2017 at 23:17, Manuel Leuenberger <[hidden email]> wrote:

Hi everyone,

 

I was experimenting in the last few weeks with my take on literature research. For me, the corpus of scientific papers form an interconnected graph, not those plain lists and tables we keep in our bibliographies. So, here is the first prototype that has Google Scholar integration for search, can fetch PDFs from IEEE and ACM, extracts metadata from PDFs - all this results in hyperlinked PDFs!

 

 

I plan on packaging it, so that those who are interested can check it out themselves (help wanted!). Currently, it only works on macOS.

 

What do you think of my approach? Which use cases should be added?

 

Cheers,

Manuel

 

 

 

 

Reply | Threaded
Open this post in threaded view
|

Re: LiteratureResearcher - where graphs, PDFs, and BibTex happily live together

Manuel Leuenberger
In reply to this post by Manuel Leuenberger
Hi everyone,

The estimation of packaging everything over the weekend was overly optimistic. There were just too many issues with portability and dependencies, leading to a long chain of installation requirements. Nevertheless, I decided to publish what I have so far, maybe some of you have more experience in making Pharo applications like this portable, especially Python is giving me a headache (literally).

So, here are two source repositories and a all-in-one package, which may work for some of you (still macOS only):

LiteratureResearcher sources https://github.com/maenu/LiteratureResearcher contains the main sources and make scripts. Installation through Metacello should make the project, takes a while.
PharoUriScheme source https://github.com/maenu/PharoUriScheme contains the sources for the pharo:// protocol wrapper. This is mainly an Xcode project, which introduces the platform dependency. This project can also be extended to support Linux and Windows.
The all-in-one package https://figshare.com/articles/LiteratureResearcher_All-in-one/5584837 contains the LiteratureResearcher, still requires Java, Python 2.7, and virtualenv (shame on you, Python). Might work for some.

Comments and contributors are very welcome, post issues, fork, create PRs, extend!

Cheers,
Manuel

On 2 Nov 2017, at 20:33, Manuel Leuenberger <[hidden email]> wrote:

Hi Stef,

The PDF integration consists of three parts:

1. CERMINE (https://github.com/CeON/CERMINE) is fed with the PDF and outputs metadata as BibTex and a structured XML (title, authors, affiliations, abstract, keyword, references, …). This is not perfect, but way better than any other metadata extractor I could find.
2. From the metadata I generate hyperlinks that are anchored in the PDF by a text key. pdf-linker (https://github.com/maenu/pdf-linker) then searches for the anchors in the PDF text, using heuristics, as PDF has a document model that is primarily intended for rendering and printing, but not for processing. The hyperlinks are then inserted using the awesome Apache PDFBox (https://pdfbox.apache.org/).
3. Those hyperlinks point to an URI like “<a href="pharo://handle/clickReference.in.?args=1&amp;args=2" class="">pharo://handle/clickReference.in.?args=1&args=2” to represent a reference 1 in the paper 2. Now comes the magic part: The OS allows you to register custom handlers for custom URI schemes like pharo://. For that I created a simple Objective-C app that handles the event and passes it over as a HTTP message to a server running in Pharo (https://github.com/maenu/PharoUriScheme). The OS will even start the application if it is not yet running.

While the custom URI scheme approach is super powerful, it has critical drawbacks. Any application can request to be the receiver of a URI scheme, just as browser are for http://. Especially on mobile devices with limited access to the OS, this opens up an attack point for malware apps that replicate original apps that make use of schemes like facebook:// and eavesdrop all interactions. If an original app transmits any unencrypted secrets or user data encoded in those URIs, malware can easily intercept it without the user noticing the leak. I guess this is the reason why many PDF viewer just support the standard http:// and mailto:// schemes. E.g., macOS Preview gives just an audible beep when I click on a pharo:// link, Chromes viewer doesn’t even bother giving any feedback. Only Adobe Acrobat allows you to relax security settings to make them work (How could it be someone else than Adobe, when it’s a security issue? ;).

I finished basic packaging today and will continue with some READMEs and a nearly-all-in-one distribution tomorrow, I’ll keep you posted in this thread.

Cheers,
Manuel

On 2 Nov 2017, at 18:08, Stephane Ducasse <[hidden email]> wrote:

Hi manuel

this is super cool :)
Could you describe how you did the pdf integration?
And yes please package it :)
I want to try it.

Stef

On Wed, Nov 1, 2017 at 10:16 PM, Manuel Leuenberger
<[hidden email]> wrote:
Hi everyone,

I was experimenting in the last few weeks with my take on literature
research. For me, the corpus of scientific papers form an interconnected
graph, not those plain lists and tables we keep in our bibliographies. So,
here is the first prototype that has Google Scholar integration for search,
can fetch PDFs from IEEE and ACM, extracts metadata from PDFs - all this
results in hyperlinked PDFs!

See a demo here: https://youtu.be/EcK3Pt_WnEw
Also slides from the SCG seminar here:
http://scg.unibe.ch/download/softwarecomposition/2017-10-31-Leuenberger-ILE.pdf

I plan on packaging it, so that those who are interested can check it out
themselves (help wanted!). Currently, it only works on macOS.

What do you think of my approach? Which use cases should be added?

Cheers,
Manuel