[ANN] XMLParserHTML moved to GitHub

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[ANN] XMLParserHTML moved to GitHub

Torsten Bergmann
Hi,

the STHub -> PharoExtras project "XMLParserHTML"

was now moved from http://smalltalkhub.com/#!/~PharoExtras/XMLParserHTML to
https://github.com/pharo-contributions/XML-XMLParserHTML including the FULL HISTORY

The old STHub repo was marked as obsolete - but is linking to the new one. I've also
setup an CI job:  https://travis-ci.org/pharo-contributions/XML-XMLParserHTML
which is green for Pharo 7. Some cleanups, class comments and docu was applied as you can
see from commit history.

The new version is tagged in git as version 1.6.0 (with a moveable tag 1.6.x in case further
hotfixes are required).

You can load using

   Metacello new
        baseline: 'XMLParserHTML';
        repository: 'github://pharo-contributions/XML-XMLParserHTML/src';
        load.

or from catalog in Pharo 7 or 8.

Attached is current dependency graph.

More to come soon ...

Bye
T.

xmlparserhtml.png (41K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

Esteban A. Maringolo
Thank you Torsten,

I wasn't aware of this tool, I'm already using it to scrap content
from a website and fed a Pharo driven system :)

The XML integration in the Inspector is great too.

Regards!

Esteban A. Maringolo

On Tue, Nov 19, 2019 at 8:40 AM Torsten Bergmann <[hidden email]> wrote:

>
> Hi,
>
> the STHub -> PharoExtras project "XMLParserHTML"
>
> was now moved from http://smalltalkhub.com/#!/~PharoExtras/XMLParserHTML to
> https://github.com/pharo-contributions/XML-XMLParserHTML including the FULL HISTORY
>
> The old STHub repo was marked as obsolete - but is linking to the new one. I've also
> setup an CI job:  https://travis-ci.org/pharo-contributions/XML-XMLParserHTML
> which is green for Pharo 7. Some cleanups, class comments and docu was applied as you can
> see from commit history.
>
> The new version is tagged in git as version 1.6.0 (with a moveable tag 1.6.x in case further
> hotfixes are required).
>
> You can load using
>
>    Metacello new
>         baseline: 'XMLParserHTML';
>         repository: 'github://pharo-contributions/XML-XMLParserHTML/src';
>         load.
>
> or from catalog in Pharo 7 or 8.
>
> Attached is current dependency graph.
>
> More to come soon ...
>
> Bye
> T.

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

cedreek
Stef and other wrote this book a while ago:


Basically XMLHtmlParser + XPath

To me, far better than using Soup. 
Google chrome pharo integration helps top to scrap complex full JS web site like google ;)


Cheers,

Cedrick 

Le 29 nov. 2019 à 15:41, Esteban Maringolo <[hidden email]> a écrit :

Thank you Torsten,

I wasn't aware of this tool, I'm already using it to scrap content
from a website and fed a Pharo driven system :)

The XML integration in the Inspector is great too.

Regards!

Esteban A. Maringolo

On Tue, Nov 19, 2019 at 8:40 AM Torsten Bergmann <[hidden email]> wrote:

Hi,

the STHub -> PharoExtras project "XMLParserHTML"

was now moved from http://smalltalkhub.com/#!/~PharoExtras/XMLParserHTML to
https://github.com/pharo-contributions/XML-XMLParserHTML including the FULL HISTORY

The old STHub repo was marked as obsolete - but is linking to the new one. I've also
setup an CI job:  https://travis-ci.org/pharo-contributions/XML-XMLParserHTML
which is green for Pharo 7. Some cleanups, class comments and docu was applied as you can
see from commit history.

The new version is tagged in git as version 1.6.0 (with a moveable tag 1.6.x in case further
hotfixes are required).

You can load using

  Metacello new
       baseline: 'XMLParserHTML';
       repository: 'github://pharo-contributions/XML-XMLParserHTML/src';
       load.

or from catalog in Pharo 7 or 8.

Attached is current dependency graph.

More to come soon ...

Bye
T.

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

Esteban A. Maringolo
Great!

I just added a link to the README.md of the project and created a PR,
because it is very likely that if you're parsing HTML you're doing
some scrapping. :-)

Esteban A. Maringolo


On Fri, Nov 29, 2019 at 2:18 PM Cédrick Béler <[hidden email]> wrote:
>
> Stef and other wrote this book a while ago:
>
> http://books.pharo.org/booklet-Scraping/html/scrapingbook.html
>
> Basically XMLHtmlParser + XPath
>
> To me, far better than using Soup.
> Google chrome pharo integration helps top to scrap complex full JS web site like google ;)

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

Sean P. DeNigris
Administrator
In reply to this post by cedreek
cedreek wrote
> To me, far better than using Soup.

Ah, interesting! I use Soup almost exclusively. What did you find superior
about XMLParserHTML? I may give it a try...


cedreek wrote
> Google chrome pharo integration helps top to scrap complex full JS web
> site like google ;)

Also interesting! Any publicly available examples? How does one load "Google
chrome pharo integration"? Also, there is often the "poor man's" way (albeit
requiring manual intervention) by inspecting the Ajax http requests in a
developer console and then recreating directly in Pharo.



-----
Cheers,
Sean
--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Cheers,
Sean
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

cedreek

cedreek wrote
To me, far better than using Soup.

Ah, interesting! I use Soup almost exclusively. What did you find superior
about XMLParserHTML? I may give it a try...


It’s mainly xpath which I find easier than navigating the html tree with soup or even The xmlHtmlparser. 

I usually copy the xpath form a web inspector. I have to tweak it a bit though.


cedreek wrote
Google chrome pharo integration helps top to scrap complex full JS web
site like google ;)

Also interesting! Any publicly available examples? How does one load "Google
chrome pharo integration"? Also, there is often the "poor man's" way (albeit
requiring manual intervention) by inspecting the Ajax http requests in a
developer console and then recreating directly in Pharo.


I just tried it once. 

There is a google chrome plugin that allows to use chrome headless to get the fully loaded html page. 

I need to try it again. A simple example I’d like to do is to scrap google and remove advertised content ^^

This is btw Torsten package:


Happy scrapping ;-)

And thx Torsten for all ^^

Cedrick 



-----
Cheers,
Sean
--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

cedreek
Just a quick try to scrap a query from google search  (using Chrome plugin + XMLHTMLParser + Path) :


Really cool… and in few lines you get your own « google » powered search engine ;-).  

Nevertheless this is a bit slow (5 seconds for one page, around 20s for the 5 from the script (I think this might be doable to speed things a bit).

HTH,


Cédrick


PS: code below in text

searches := OrderedCollection new.
query := 'black friday' urlEncoded.

"chrome plugin"
browser := GoogleChrome new.
browser headless: true.
browser open.
page := browser firstTab.
0 to: 50 by: 10 do: [:paging | 
page get: 'https://www.google.com/search?q=',query,'&start=', paging asString. 
result := page html.
dom := result parseHTML. "uses XMLHTMLParser"
searches addAll: (dom xpath: '//div[@class="g"]/div/div/div/a/@href'). 
].   

searches size
 "64".
searches first.

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

cedreek
In reply to this post by Sean P. DeNigris

>
> Also interesting! Any publicly available examples? How does one load "Google
> chrome pharo integration »?

"https://github.com/astares/Pharo-Chrome"
"https://github.com/akgrant43/Pharo-Chrome »

Cheers,

Cédrick


Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

Esteban A. Maringolo
Why use Chrome instead of ZnClient? To get a "real" render of the
content? (including JS and whatnot).

Regards!


Esteban A. Maringolo

On Sat, Nov 30, 2019 at 8:11 PM Cédrick Béler <[hidden email]> wrote:

>
>
> >
> > Also interesting! Any publicly available examples? How does one load "Google
> > chrome pharo integration »?
>
> "https://github.com/astares/Pharo-Chrome"
> "https://github.com/akgrant43/Pharo-Chrome »
>
> Cheers,
>
> Cédrick
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

cedreek
I couldn’t get it from Zn as (I think) there are some js lib that defer the full rendering.

I have the same problem with a site in France (leboncoin). They use https://datadome.co to complicate webscrapping. So an headless browser is the only solution I know. 

Cheers,

Cédrick

Le 1 déc. 2019 à 00:23, Esteban Maringolo <[hidden email]> a écrit :

Why use Chrome instead of ZnClient? To get a "real" render of the
content? (including JS and whatnot).

Regards!


Esteban A. Maringolo

On Sat, Nov 30, 2019 at 8:11 PM Cédrick Béler <[hidden email]> wrote:



Also interesting! Any publicly available examples? How does one load "Google
chrome pharo integration »?

"https://github.com/astares/Pharo-Chrome"
"https://github.com/akgrant43/Pharo-Chrome »

Cheers,

Cédrick




Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

Peter Kenny
In reply to this post by Sean P. DeNigris
Sean

I used Soup a few times, but found it difficult to interpret the output,
because the parse did not seem to reflect the hierarchy of the nodes in the
original; in particular, sibling nodes were not necessarily at the same
level in the Soup. XMLHTMLParser always gets the structure right, in my
experience. I think this is essential if you want to use Xpath to process
the parse. The worked examples in the scraping booklet show how the parser
and Xpath can work together.

HTH

Peter Kenny

-----Original Message-----
From: Pharo-users <[hidden email]> On Behalf Of Sean P.
DeNigris
Sent: 30 November 2019 16:43
To: [hidden email]
Subject: Re: [Pharo-users] [ANN] XMLParserHTML moved to GitHub

cedreek wrote
> To me, far better than using Soup.

Ah, interesting! I use Soup almost exclusively. What did you find superior
about XMLParserHTML? I may give it a try...


cedreek wrote
> Google chrome pharo integration helps top to scrap complex full JS web
> site like google ;)

Also interesting! Any publicly available examples? How does one load "Google
chrome pharo integration"? Also, there is often the "poor man's" way (albeit
requiring manual intervention) by inspecting the Ajax http requests in a
developer console and then recreating directly in Pharo.



-----
Cheers,
Sean
--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html