If you have to do web data scraping, what tool would you use?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

If you have to do web data scraping, what tool would you use?

vonbecmann
Hi,
Imagine that you have to do some data scraping work, what tool would you use?
I know about ZnClient, Soup, NeoCSV, NeoJSON, is there something else that i'm not aware of it?

thanks.


--
Bernardo E.C.

Sent from a cheap desktop computer in South America.
Reply | Threaded
Open this post in threaded view
|

Re: If you have to do web data scraping, what tool would you use?

Offray Vladimir Luna Cárdenas-2
Hi Bernardo,


On 26/06/16 16:14, Bernardo Ezequiel Contreras wrote:
> Hi,
> Imagine that you have to do some data scraping work, what tool would
> you use?
> I know about ZnClient, Soup, NeoCSV, NeoJSON, is there something else
> that i'm not aware of it?
>
> thanks.
>

No that I'm aware of. This combination in a live coding environment like
Pharo has been really empowering for me.

Cheers,

Offray

Reply | Threaded
Open this post in threaded view
|

Re: If you have to do web data scraping, what tool would you use?

Andy Burnett
In reply to this post by vonbecmann
I had a similar question, which led me to the Webdriver project (now part of selenium).

I haven't done any experiments yet, but the idea of driving the browser via its API is appealing.

The Parasol project says it is a Webdriver implementation for Pharo, and it looks quite exciting. The build is currently failing, but that seems to be a gemstone problem.

https://github.com/SeasideSt/Parasol/blob/master/README.md

If you have time to play with it I would love to hear your experience.

Cheers
Andy



Reply | Threaded
Open this post in threaded view
|

Re: If you have to do web data scraping, what tool would you use?

Tudor Girba-2
In reply to this post by vonbecmann
Hi,

Could you provide more details about the use case?

Cheers,
Doru


> On Jun 26, 2016, at 11:14 PM, Bernardo Ezequiel Contreras <[hidden email]> wrote:
>
> Hi,
> Imagine that you have to do some data scraping work, what tool would you use?
> I know about ZnClient, Soup, NeoCSV, NeoJSON, is there something else that i'm not aware of it?
>
> thanks.
>
>
> --
> Bernardo E.C.
>
> Sent from a cheap desktop computer in South America.

--
www.tudorgirba.com
www.feenk.com

"If you can't say why something is relevant,
it probably isn't."


Reply | Threaded
Open this post in threaded view
|

Re: If you have to do web data scraping, what tool would you use?

Pierce Ng-3
In reply to this post by vonbecmann
On Sun, Jun 26, 2016 at 06:14:50PM -0300, Bernardo Ezequiel Contreras wrote:
> Imagine that you have to do some data scraping work, what tool would you
> use?
> I know about ZnClient, Soup, NeoCSV, NeoJSON, is there something else that
> i'm not aware of it?

I used Todd Blanchard's HTML parser back in Squeak 3.x days and found it very
nice. Not sure about compatibility with current Pharo though.

  http://smalltalkhub.com/#!/~ToddBlanchard/HTMCSSValidatingParser

Pierce

Reply | Threaded
Open this post in threaded view
|

Re: If you have to do web data scraping, what tool would you use?

stepharo
In reply to this post by vonbecmann

I scrapped all the magic cards (bad bad practice) using Soup.

It could be easier but it worked for me. This is why I maintain it and added tests.

Stef


Le 26/6/16 à 23:14, Bernardo Ezequiel Contreras a écrit :
Hi,
Imagine that you have to do some data scraping work, what tool would you use?
I know about ZnClient, Soup, NeoCSV, NeoJSON, is there something else that i'm not aware of it?

thanks.


--
Bernardo E.C.

Sent from a cheap desktop computer in South America.

Reply | Threaded
Open this post in threaded view
|

Re: If you have to do web data scraping, what tool would you use?

Peter Kenny

Hello

 

I have used Soup, but I now prefer XMLHTMLParser, because I find it easier to relate the structure of the XML output to that of the original web page. It is also possible to use XPath to locate the target area more quickly. I particularly like Monty’s ‘Smalltalkish’ adaptation of XPath (see the last para of the class comment to XPath).

 

Hope this helps

 

Peter Kenny

 

From: Pharo-users [mailto:[hidden email]] On Behalf Of stepharo
Sent: 27 June 2016 10:03
To: [hidden email]
Subject: Re: [Pharo-users] If you have to do web data scraping, what tool would you use?

 

I scrapped all the magic cards (bad bad practice) using Soup.

It could be easier but it worked for me. This is why I maintain it and added tests.

Stef

 

Le 26/6/16 à 23:14, Bernardo Ezequiel Contreras a écrit :

Hi,

Imagine that you have to do some data scraping work, what tool would you use?

I know about ZnClient, Soup, NeoCSV, NeoJSON, is there something else that i'm not aware of it?

 

thanks.

 

 

--

Bernardo E.C.

 

Sent from a cheap desktop computer in South America.

 

Reply | Threaded
Open this post in threaded view
|

Re: If you have to do web data scraping, what tool would you use?

stepharo

I can imagine that too :) because Soup was often trial and error.

Stef


Le 27/6/16 à 11:26, PBKResearch a écrit :

Hello

 

I have used Soup, but I now prefer XMLHTMLParser, because I find it easier to relate the structure of the XML output to that of the original web page. It is also possible to use XPath to locate the target area more quickly. I particularly like Monty’s ‘Smalltalkish’ adaptation of XPath (see the last para of the class comment to XPath).

 

Hope this helps

 

Peter Kenny



Reply | Threaded
Open this post in threaded view
|

Re: If you have to do web data scraping, what tool would you use?

vonbecmann
In reply to this post by Tudor Girba-2
Doru,
 See attached file, it's a job posting from upwork.

On Mon, Jun 27, 2016 at 3:58 AM, Tudor Girba <[hidden email]> wrote:
Hi,

Could you provide more details about the use case?

Cheers,
Doru


> On Jun 26, 2016, at 11:14 PM, Bernardo Ezequiel Contreras <[hidden email]> wrote:
>
> Hi,
> Imagine that you have to do some data scraping work, what tool would you use?
> I know about ZnClient, Soup, NeoCSV, NeoJSON, is there something else that i'm not aware of it?
>
> thanks.
>
>
> --
> Bernardo E.C.
>
> Sent from a cheap desktop computer in South America.

--
www.tudorgirba.com
www.feenk.com

"If you can't say why something is relevant,
it probably isn't."





--
Bernardo E.C.

Sent from a cheap desktop computer in South America.

crawler_usda.pdf (345K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: If you have to do web data scraping, what tool would you use?

Peter Kenny

Bernardo

 

Being now retired, I do programing just for intellectual stimulation. Your problem looked as though it would provide more interest than cryptic crosswords or Sudoku, and it touches on areas of Pharo use that I have some experience with. So…

 

The attached file, BernardoDemo.st, shows how to use XMLHTMLParser with xPath and NeoJSON to tackle your problem – or at least a large subset of it. I cobbled it together in a Playground, and the easiest way to use it is to copy it into  a Playground and ‘do it and go’ for each block of code. There are liberal comments, but if anything is not clear come back to me.

 

A few caveats:

 

1.      XPath is a whole other programming language, embedded in Pharo, which takes some learning. I am by no means expert in it, and it may be that I have used it clumsily. One advantage of embedding it in Pharo is that you can intersperse Pharo and XPath, which I do whenever I can’t solve something entirely with XPath. Probably most of the places where I use #collect: followed by more XPath could be done entirely in XPath if I knew how.

2.      This is the first time I have tried to use NeoJSON, so do not take my code as an example of how to use it. It all works, as far as I can see. I cannot claim more than that.

3.      The easiest way to generate an object (or map) in NeoJSON is to start with a Pharo dictionary, which I have done everywhere. However, this means you have no control over the order in which the attributes appear in the JSON file. This is of no importance to a computer, since by definition the attributes are unordered, but it makes it a little odd to a human reader of the JSON.

4.      In your spec, the desired output has a lot of unquoted strings for attribute names, for example nbd_no. The code produces these strings with double quotes, which as far as I can see is necessary for legal JSON.

5.      Note that all numerical values appear in the output as strings. No doubt they could be converted to numbers, but I was too lazy to find out how.

6.      I have done this using Moose 5.1 (Pharo 4.0, build #40613), with versions of XMLHTMLParser and XPath which I downloaded quite a while ago. There are no particularly abstruse uses, so I hope you will be OK if you use more recent versions.

 

Hope this is helpful.

 

Best wishes

 

Peter Kenny

 

From: Pharo-users [mailto:[hidden email]] On Behalf Of Bernardo Ezequiel Contreras
Sent: 27 June 2016 15:17
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] If you have to do web data scraping, what tool would you use?

 

Doru,

 See attached file, it's a job posting from upwork.

 

On Mon, Jun 27, 2016 at 3:58 AM, Tudor Girba <[hidden email]> wrote:

Hi,

Could you provide more details about the use case?

Cheers,
Doru



> On Jun 26, 2016, at 11:14 PM, Bernardo Ezequiel Contreras <[hidden email]> wrote:
>
> Hi,
> Imagine that you have to do some data scraping work, what tool would you use?
> I know about ZnClient, Soup, NeoCSV, NeoJSON, is there something else that i'm not aware of it?
>
> thanks.
>
>
> --
> Bernardo E.C.
>
> Sent from a cheap desktop computer in South America.

--

www.tudorgirba.com
www.feenk.com

"If you can't say why something is relevant,
it probably isn't."



 

--

Bernardo E.C.

 

Sent from a cheap desktop computer in South America.


BernardoDemo.st (3K) Download Attachment