Scraping HTML chapter 2 (soon chapter 3 coming)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Scraping HTML chapter 2 (soon chapter 3 coming)

Stephane Ducasse-3
I came with the idea of this booklet thank to Peter Kenny that kindly
answered a question on the Pharo mailing-list.
To help, Peter showed to a Pharoer how to scrap a web site
using XPath. In addition, some years ago
I was maintaining Soup a scraping framework because I was scraping
magic web sites and I wanted an application to manage my magic cards.
Since then I always wanted to try XPath and in addition I wanted to
offer this booklet to Peter. Why because I asked Peter
if he would like to write something and he told that he was at a great
age where he would not take any commitment.
I realised that I would like to get as old as him and be able to hack
like a mad in Pharo with new technology.
So this booklet is a gift to Peter, a great and gentle Pharoer.

Stef

scrapingChap2-min.pdf (317K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Scraping HTML chapter 2 (soon chapter 3 coming)

cedreek
Hi Stéphane and all,

I’m using Soup from time to time. I use it also with student this semester to do a small app that get leboncoin adds programmatically to send alerts (leboncoin is a French web site to sell/buy stuff between people).

I think it should be updated to reflect HTML5 important new tags. For instance I needed to add the section tag in the « nestableBlockTags » … It’s in SoupParserParameter initializeNestableBlockTags (see the screenshot - I don’t have my image here sot it’s just the method). Without this tag, I could properly get the classified ads).

I can publish it if you want but I think there are more important HTML5 tags to include in this parameters class. As Soup is an "incomplete parser » (as far as I understand), there is no need to include all html5 tags. But beside <section>, do other people think there are tags to include there ?



Cheers, 

Cédrick



Le 27 sept. 2017 à 09:25, Stephane Ducasse <[hidden email]> a écrit :

I came with the idea of this booklet thank to Peter Kenny that kindly
answered a question on the Pharo mailing-list.
To help, Peter showed to a Pharoer how to scrap a web site
using XPath. In addition, some years ago
I was maintaining Soup a scraping framework because I was scraping
magic web sites and I wanted an application to manage my magic cards.
Since then I always wanted to try XPath and in addition I wanted to
offer this booklet to Peter. Why because I asked Peter
if he would like to write something and he told that he was at a great
age where he would not take any commitment.
I realised that I would like to get as old as him and be able to hack
like a mad in Pharo with new technology.
So this booklet is a gift to Peter, a great and gentle Pharoer.

Stef
<scrapingChap2-min.pdf>

Reply | Threaded
Open this post in threaded view
|

Re: Scraping HTML chapter 2 (soon chapter 3 coming)

Stephane Ducasse-3
Hello Cedric

I’m using Soup from time to time. I use it also with student this semester to do a small app that get leboncoin adds programmatically to send alerts (leboncoin is a French web site to sell/buy stuff between people).

cool I want it because I'm looking for Games on leboncoin 
 
I think it should be updated to reflect HTML5 important new tags. For instance I needed to add the section tag in the « nestableBlockTags » … It’s in SoupParserParameter initializeNestableBlockTags (see the screenshot - I don’t have my image here sot it’s just the method). Without this tag, I could properly get the classified ads). 

I can publish it if you want but I think there are more important HTML5 tags to include in this parameters class. As Soup is an "incomplete parser » (as far as I understand), there is no need to include all html5 tags. But beside <section>, do other people think there are tags to include there ?

Please publish it. 
What you can do also is to write a chapter showing how you script leboncoin with Soup. 
You are welcome. 
 



Cheers, 

Cédrick



Le 27 sept. 2017 à 09:25, Stephane Ducasse <[hidden email]> a écrit :

I came with the idea of this booklet thank to Peter Kenny that kindly
answered a question on the Pharo mailing-list.
To help, Peter showed to a Pharoer how to scrap a web site
using XPath. In addition, some years ago
I was maintaining Soup a scraping framework because I was scraping
magic web sites and I wanted an application to manage my magic cards.
Since then I always wanted to try XPath and in addition I wanted to
offer this booklet to Peter. Why because I asked Peter
if he would like to write something and he told that he was at a great
age where he would not take any commitment.
I realised that I would like to get as old as him and be able to hack
like a mad in Pharo with new technology.
So this booklet is a gift to Peter, a great and gentle Pharoer.

Stef
<scrapingChap2-min.pdf>


Reply | Threaded
Open this post in threaded view
|

Re: Scraping HTML chapter 2 (soon chapter 3 coming)

cedreek
Hi Steph (sorry for late reply, it seems I have a wrong automatic process of mailing list),

I'll published it soon and why not doing a chapter on that subject. This could be a really fun tutorial :)

I'll add the section tag in the code for nested tags. 

Cheers,

Cédrick

2017-09-27 21:52 GMT+02:00 Stephane Ducasse <[hidden email]>:
Hello Cedric

I’m using Soup from time to time. I use it also with student this semester to do a small app that get leboncoin adds programmatically to send alerts (leboncoin is a French web site to sell/buy stuff between people).

cool I want it because I'm looking for Games on leboncoin 
 
I think it should be updated to reflect HTML5 important new tags. For instance I needed to add the section tag in the « nestableBlockTags » … It’s in SoupParserParameter initializeNestableBlockTags (see the screenshot - I don’t have my image here sot it’s just the method). Without this tag, I could properly get the classified ads). 

I can publish it if you want but I think there are more important HTML5 tags to include in this parameters class. As Soup is an "incomplete parser » (as far as I understand), there is no need to include all html5 tags. But beside <section>, do other people think there are tags to include there ?

Please publish it. 
What you can do also is to write a chapter showing how you script leboncoin with Soup. 
You are welcome. 
 



Cheers, 

Cédrick



Le 27 sept. 2017 à 09:25, Stephane Ducasse <[hidden email]> a écrit :

I came with the idea of this booklet thank to Peter Kenny that kindly
answered a question on the Pharo mailing-list.
To help, Peter showed to a Pharoer how to scrap a web site
using XPath. In addition, some years ago
I was maintaining Soup a scraping framework because I was scraping
magic web sites and I wanted an application to manage my magic cards.
Since then I always wanted to try XPath and in addition I wanted to
offer this booklet to Peter. Why because I asked Peter
if he would like to write something and he told that he was at a great
age where he would not take any commitment.
I realised that I would like to get as old as him and be able to hack
like a mad in Pharo with new technology.
So this booklet is a gift to Peter, a great and gentle Pharoer.

Stef
<scrapingChap2-min.pdf>





--
Cédrick