Web scrapping with Pharo Chrome

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Web scrapping with Pharo Chrome

Offray Vladimir Luna Cárdenas-2

Hi,

I have been finally able to install and use properly Pharo Chrome. The issues I reported in other thread, were caused by conflicts between OSProcess and OSSubProcess on Linux. Now I'm able to launch Chrome, point it to particular addresses and get some info from there.

I would like to continue the conversation related with web scraping using Pharo Chrome, which comes handy now that there is a lot of React and other technologies making the web more and more opaque for digital citizenship and data activism endeavors. So, as an starting example, I would like to scrap the Grafoscopio's own page [1]. It makes use of Material Design Light [2].

[1] http://mutabit.com/grafoscopio/index.en.html
[2] http://getmdl.io/

An starting example would be fine to kickstart myself (from the ones I'm reading, there is still something I don't get). Let's say I want to get all cards in [1] , as shown in the screenshot below. I know the div class of each one, and the div class where they are located. Which could be a minimal example of a scraper, to start with?

Thanks,

Offray


Reply | Threaded
Open this post in threaded view
|

Re: Web scrapping with Pharo Chrome

Stephane Ducasse-3
I would love to have a little how to and that we can turn it into a document. 

Stef

On Wed, Feb 14, 2018 at 5:25 PM, Offray Vladimir Luna Cárdenas <[hidden email]> wrote:

Hi,

I have been finally able to install and use properly Pharo Chrome. The issues I reported in other thread, were caused by conflicts between OSProcess and OSSubProcess on Linux. Now I'm able to launch Chrome, point it to particular addresses and get some info from there.

I would like to continue the conversation related with web scraping using Pharo Chrome, which comes handy now that there is a lot of React and other technologies making the web more and more opaque for digital citizenship and data activism endeavors. So, as an starting example, I would like to scrap the Grafoscopio's own page [1]. It makes use of Material Design Light [2].

[1] http://mutabit.com/grafoscopio/index.en.html
[2] http://getmdl.io/

An starting example would be fine to kickstart myself (from the ones I'm reading, there is still something I don't get). Let's say I want to get all cards in [1] , as shown in the screenshot below. I know the div class of each one, and the div class where they are located. Which could be a minimal example of a scraper, to start with?

Thanks,

Offray



Reply | Threaded
Open this post in threaded view
|

Re: Web scrapping with Pharo Chrome

Offray Vladimir Luna Cárdenas-2

Yes. Me too. Alistair, any starting points with this example? I will take from there and we could get visibility in the upcoming Open Data Day.

Of course, we're going to document everything and share back (I have already proposed some improvements in documentation via PR on the Git repo).

Cheers,

Offray


On 14/02/18 14:13, Stephane Ducasse wrote:
I would love to have a little how to and that we can turn it into a document. 

Stef

On Wed, Feb 14, 2018 at 5:25 PM, Offray Vladimir Luna Cárdenas <[hidden email]> wrote:

Hi,

I have been finally able to install and use properly Pharo Chrome. The issues I reported in other thread, were caused by conflicts between OSProcess and OSSubProcess on Linux. Now I'm able to launch Chrome, point it to particular addresses and get some info from there.

I would like to continue the conversation related with web scraping using Pharo Chrome, which comes handy now that there is a lot of React and other technologies making the web more and more opaque for digital citizenship and data activism endeavors. So, as an starting example, I would like to scrap the Grafoscopio's own page [1]. It makes use of Material Design Light [2].

[1] http://mutabit.com/grafoscopio/index.en.html
[2] http://getmdl.io/

An starting example would be fine to kickstart myself (from the ones I'm reading, there is still something I don't get). Let's say I want to get all cards in [1] , as shown in the screenshot below. I know the div class of each one, and the div class where they are located. Which could be a minimal example of a scraper, to start with?

Thanks,

Offray




Reply | Threaded
Open this post in threaded view
|

Re: Web scrapping with Pharo Chrome

Stephane Ducasse-3
super.
Stef


On Wed, Feb 14, 2018 at 8:29 PM, Offray Vladimir Luna Cárdenas <[hidden email]> wrote:

Yes. Me too. Alistair, any starting points with this example? I will take from there and we could get visibility in the upcoming Open Data Day.

Of course, we're going to document everything and share back (I have already proposed some improvements in documentation via PR on the Git repo).

Cheers,

Offray


On 14/02/18 14:13, Stephane Ducasse wrote:
I would love to have a little how to and that we can turn it into a document. 

Stef

On Wed, Feb 14, 2018 at 5:25 PM, Offray Vladimir Luna Cárdenas <[hidden email]> wrote:

Hi,

I have been finally able to install and use properly Pharo Chrome. The issues I reported in other thread, were caused by conflicts between OSProcess and OSSubProcess on Linux. Now I'm able to launch Chrome, point it to particular addresses and get some info from there.

I would like to continue the conversation related with web scraping using Pharo Chrome, which comes handy now that there is a lot of React and other technologies making the web more and more opaque for digital citizenship and data activism endeavors. So, as an starting example, I would like to scrap the Grafoscopio's own page [1]. It makes use of Material Design Light [2].

[1] http://mutabit.com/grafoscopio/index.en.html
[2] http://getmdl.io/

An starting example would be fine to kickstart myself (from the ones I'm reading, there is still something I don't get). Let's say I want to get all cards in [1] , as shown in the screenshot below. I know the div class of each one, and the div class where they are located. Which could be a minimal example of a scraper, to start with?

Thanks,

Offray





Reply | Threaded
Open this post in threaded view
|

Re: Web scrapping with Pharo Chrome

alistairgrant
In reply to this post by Offray Vladimir Luna Cárdenas-2
Hi Offray,

On 14 February 2018 at 20:29, Offray Vladimir Luna Cárdenas
<[hidden email]> wrote:
> Yes. Me too. Alistair, any starting points with this example? I will take
> from there and we could get visibility in the upcoming Open Data Day.

I'm not sure that I understand what you're after, but maybe the
following will help.

This simply returns a collection of all the h4 headings that are in cells:


| rootNode divs cells cellTitleNodes cellTitles |

rootNode := GoogleChrome get: 'http://mutabit.com/grafoscopio/index.en.html'.
divs := rootNode findAllTags: 'div'.
cells := divs select: [ :each | (' ' split: (each attributeAt:
'class')) includes: 'mdl-cell' ].
cellTitleNodes := cells flatCollect: [ :each | each findAllTags: 'h4' ].
cellTitles := cellTitleNodes collect: [ :each | (each findAllStrings:
true) first nodeValue ].
{ rootNode. divs. cells. cellTitleNodes. cellTitles }




> Of course, we're going to document everything and share back (I have already
> proposed some improvements in documentation via PR on the Git repo).

Thanks very much for improving the readme.  I've merged the PR.

Cheers,
Alistair


> Cheers,
>
> Offray
>
>
> On 14/02/18 14:13, Stephane Ducasse wrote:
>
> I would love to have a little how to and that we can turn it into a
> document.
>
> Stef
>
> On Wed, Feb 14, 2018 at 5:25 PM, Offray Vladimir Luna Cárdenas
> <[hidden email]> wrote:
>>
>> Hi,
>>
>> I have been finally able to install and use properly Pharo Chrome. The
>> issues I reported in other thread, were caused by conflicts between
>> OSProcess and OSSubProcess on Linux. Now I'm able to launch Chrome, point it
>> to particular addresses and get some info from there.
>>
>> I would like to continue the conversation related with web scraping using
>> Pharo Chrome, which comes handy now that there is a lot of React and other
>> technologies making the web more and more opaque for digital citizenship and
>> data activism endeavors. So, as an starting example, I would like to scrap
>> the Grafoscopio's own page [1]. It makes use of Material Design Light [2].
>>
>> [1] http://mutabit.com/grafoscopio/index.en.html
>> [2] http://getmdl.io/
>>
>> An starting example would be fine to kickstart myself (from the ones I'm
>> reading, there is still something I don't get). Let's say I want to get all
>> cards in [1] , as shown in the screenshot below. I know the div class of
>> each one, and the div class where they are located. Which could be a minimal
>> example of a scraper, to start with?
>>
>> Thanks,
>>
>> Offray
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Web scrapping with Pharo Chrome

Offray Vladimir Luna Cárdenas-2
Hi Alistair,


On 15/02/18 12:50, Alistair Grant wrote:

> Hi Offray,
>
> On 14 February 2018 at 20:29, Offray Vladimir Luna Cárdenas
> <[hidden email]> wrote:
>> Yes. Me too. Alistair, any starting points with this example? I will take
>> from there and we could get visibility in the upcoming Open Data Day.
> I'm not sure that I understand what you're after, but maybe the
> following will help.
>
> This simply returns a collection of all the h4 headings that are in cells:
>
>
> | rootNode divs cells cellTitleNodes cellTitles |
>
> rootNode := GoogleChrome get: 'http://mutabit.com/grafoscopio/index.en.html'.
> divs := rootNode findAllTags: 'div'.
> cells := divs select: [ :each | (' ' split: (each attributeAt:
> 'class')) includes: 'mdl-cell' ].
> cellTitleNodes := cells flatCollect: [ :each | each findAllTags: 'h4' ].
> cellTitles := cellTitleNodes collect: [ :each | (each findAllStrings:
> true) first nodeValue ].
> { rootNode. divs. cells. cellTitleNodes. cellTitles }
>

Thanks. This was the starting point I was looking for.

>
>
>> Of course, we're going to document everything and share back (I have already
>> proposed some improvements in documentation via PR on the Git repo).
> Thanks very much for improving the readme.  I've merged the PR.

No problem. Thanks to you for starting the project.

Cheers,

Offray

Reply | Threaded
Open this post in threaded view
|

Re: Web scrapping with Pharo Chrome

alistairgrant
Hi Offray,

On 16 February 2018 at 19:08, Offray Vladimir Luna Cárdenas
<[hidden email]> wrote:

> Hi Alistair,
>
>
> On 15/02/18 12:50, Alistair Grant wrote:
>> Hi Offray,
>>
>> On 14 February 2018 at 20:29, Offray Vladimir Luna Cárdenas
>> <[hidden email]> wrote:
>>> Yes. Me too. Alistair, any starting points with this example? I will take
>>> from there and we could get visibility in the upcoming Open Data Day.
>> I'm not sure that I understand what you're after, but maybe the
>> following will help.
>>
>> This simply returns a collection of all the h4 headings that are in cells:
>>
>>
>> | rootNode divs cells cellTitleNodes cellTitles |
>>
>> rootNode := GoogleChrome get: 'http://mutabit.com/grafoscopio/index.en.html'.
>> divs := rootNode findAllTags: 'div'.
>> cells := divs select: [ :each | (' ' split: (each attributeAt:
>> 'class')) includes: 'mdl-cell' ].
>> cellTitleNodes := cells flatCollect: [ :each | each findAllTags: 'h4' ].
>> cellTitles := cellTitleNodes collect: [ :each | (each findAllStrings:
>> true) first nodeValue ].
>> { rootNode. divs. cells. cellTitleNodes. cellTitles }
>>
>
> Thanks. This was the starting point I was looking for.

Great.  The other place to look is at any examples you can find using
Soup (which can be loaded from the Catalog).  The API for Pharo-Chrome
is a subset of that provided by Soup.


>>> Of course, we're going to document everything and share back (I have already
>>> proposed some improvements in documentation via PR on the Git repo).
>> Thanks very much for improving the readme.  I've merged the PR.
>
> No problem. Thanks to you for starting the project.

Actually we need to thank Torsten: https://github.com/astares/Pharo-Chrome

Cheers,
Alistair



> Cheers,
>
> Offray
>