Smalltalk › Pharo › Pharo Smalltalk Users

Web scrapping with Pharo Chrome

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

7 messages Options

Offray Vladimir Luna Cárdenas-2

Web scrapping with Pharo Chrome

Hi,

I have been finally able to install and use properly Pharo Chrome. The issues I reported in other thread, were caused by conflicts between OSProcess and OSSubProcess on Linux. Now I'm able to launch Chrome, point it to particular addresses and get some info from there.

I would like to continue the conversation related with web scraping using Pharo Chrome, which comes handy now that there is a lot of React and other technologies making the web more and more opaque for digital citizenship and data activism endeavors. So, as an starting example, I would like to scrap the Grafoscopio's own page [1]. It makes use of Material Design Light [2].

[1] http://mutabit.com/grafoscopio/index.en.html
[2] http://getmdl.io/

An starting example would be fine to kickstart myself (from the ones I'm reading, there is still something I don't get). Let's say I want to get all cards in [1] , as shown in the screenshot below. I know the div class of each one, and the div class where they are located. Which could be a minimal example of a scraper, to start with?

Thanks,

Offray

Stephane Ducasse-3

Re: Web scrapping with Pharo Chrome

I would love to have a little how to and that we can turn it into a document.

Stef

On Wed, Feb 14, 2018 at 5:25 PM, Offray Vladimir Luna Cárdenas <[hidden email]> wrote:

Hi,

I have been finally able to install and use properly Pharo Chrome. The issues I reported in other thread, were caused by conflicts between OSProcess and OSSubProcess on Linux. Now I'm able to launch Chrome, point it to particular addresses and get some info from there.

I would like to continue the conversation related with web scraping using Pharo Chrome, which comes handy now that there is a lot of React and other technologies making the web more and more opaque for digital citizenship and data activism endeavors. So, as an starting example, I would like to scrap the Grafoscopio's own page [1]. It makes use of Material Design Light [2].

[1] http://mutabit.com/grafoscopio/index.en.html
[2] http://getmdl.io/

An starting example would be fine to kickstart myself (from the ones I'm reading, there is still something I don't get). Let's say I want to get all cards in [1] , as shown in the screenshot below. I know the div class of each one, and the div class where they are located. Which could be a minimal example of a scraper, to start with?

Thanks,

Offray

Offray Vladimir Luna Cárdenas-2

Re: Web scrapping with Pharo Chrome

Yes. Me too. Alistair, any starting points with this example? I will take from there and we could get visibility in the upcoming Open Data Day.

Of course, we're going to document everything and share back (I have already proposed some improvements in documentation via PR on the Git repo).

Cheers,

Offray

On 14/02/18 14:13, Stephane Ducasse wrote:

I would love to have a little how to and that we can turn it into a document.

Stef

On Wed, Feb 14, 2018 at 5:25 PM, Offray Vladimir Luna Cárdenas <[hidden email]> wrote:

Hi,

I have been finally able to install and use properly Pharo Chrome. The issues I reported in other thread, were caused by conflicts between OSProcess and OSSubProcess on Linux. Now I'm able to launch Chrome, point it to particular addresses and get some info from there.

I would like to continue the conversation related with web scraping using Pharo Chrome, which comes handy now that there is a lot of React and other technologies making the web more and more opaque for digital citizenship and data activism endeavors. So, as an starting example, I would like to scrap the Grafoscopio's own page [1]. It makes use of Material Design Light [2].

[1] http://mutabit.com/grafoscopio/index.en.html
[2] http://getmdl.io/

An starting example would be fine to kickstart myself (from the ones I'm reading, there is still something I don't get). Let's say I want to get all cards in [1] , as shown in the screenshot below. I know the div class of each one, and the div class where they are located. Which could be a minimal example of a scraper, to start with?

Thanks,

Offray

Stephane Ducasse-3

Re: Web scrapping with Pharo Chrome

super.

Stef

On Wed, Feb 14, 2018 at 8:29 PM, Offray Vladimir Luna Cárdenas <[hidden email]> wrote:

Yes. Me too. Alistair, any starting points with this example? I will take from there and we could get visibility in the upcoming Open Data Day.

Of course, we're going to document everything and share back (I have already proposed some improvements in documentation via PR on the Git repo).

Cheers,

Offray

On 14/02/18 14:13, Stephane Ducasse wrote:

I would love to have a little how to and that we can turn it into a document.

Stef

On Wed, Feb 14, 2018 at 5:25 PM, Offray Vladimir Luna Cárdenas <[hidden email]> wrote:

Hi,

I have been finally able to install and use properly Pharo Chrome. The issues I reported in other thread, were caused by conflicts between OSProcess and OSSubProcess on Linux. Now I'm able to launch Chrome, point it to particular addresses and get some info from there.

I would like to continue the conversation related with web scraping using Pharo Chrome, which comes handy now that there is a lot of React and other technologies making the web more and more opaque for digital citizenship and data activism endeavors. So, as an starting example, I would like to scrap the Grafoscopio's own page [1]. It makes use of Material Design Light [2].

[1] http://mutabit.com/grafoscopio/index.en.html
[2] http://getmdl.io/

An starting example would be fine to kickstart myself (from the ones I'm reading, there is still something I don't get). Let's say I want to get all cards in [1] , as shown in the screenshot below. I know the div class of each one, and the div class where they are located. Which could be a minimal example of a scraper, to start with?

Thanks,

Offray

alistairgrant

Re: Web scrapping with Pharo Chrome

In reply to this post by Offray Vladimir Luna Cárdenas-2

Hi Offray,

On 14 February 2018 at 20:29, Offray Vladimir Luna Cárdenas
<[hidden email]> wrote:
> Yes. Me too. Alistair, any starting points with this example? I will take
> from there and we could get visibility in the upcoming Open Data Day.

I'm not sure that I understand what you're after, but maybe the
following will help.

This simply returns a collection of all the h4 headings that are in cells:

| rootNode divs cells cellTitleNodes cellTitles |

rootNode := GoogleChrome get: 'http://mutabit.com/grafoscopio/index.en.html'.
divs := rootNode findAllTags: 'div'.
cells := divs select: [ :each | (' ' split: (each attributeAt:
'class')) includes: 'mdl-cell' ].
cellTitleNodes := cells flatCollect: [ :each | each findAllTags: 'h4' ].
cellTitles := cellTitleNodes collect: [ :each | (each findAllStrings:
true) first nodeValue ].
{ rootNode. divs. cells. cellTitleNodes. cellTitles }

> Of course, we're going to document everything and share back (I have already
> proposed some improvements in documentation via PR on the Git repo).

Thanks very much for improving the readme. I've merged the PR.

Cheers,
Alistair

> Cheers,
>
> Offray
>
>
> On 14/02/18 14:13, Stephane Ducasse wrote:
>
> I would love to have a little how to and that we can turn it into a
> document.
>
> Stef
>
> On Wed, Feb 14, 2018 at 5:25 PM, Offray Vladimir Luna Cárdenas
> <[hidden email]> wrote:
>>
>> Hi,
>>
>> I have been finally able to install and use properly Pharo Chrome. The
>> issues I reported in other thread, were caused by conflicts between
>> OSProcess and OSSubProcess on Linux. Now I'm able to launch Chrome, point it
>> to particular addresses and get some info from there.
>>
>> I would like to continue the conversation related with web scraping using
>> Pharo Chrome, which comes handy now that there is a lot of React and other
>> technologies making the web more and more opaque for digital citizenship and
>> data activism endeavors. So, as an starting example, I would like to scrap
>> the Grafoscopio's own page [1]. It makes use of Material Design Light [2].
>>
>> [1] http://mutabit.com/grafoscopio/index.en.html
>> [2] http://getmdl.io/
>>
>> An starting example would be fine to kickstart myself (from the ones I'm
>> reading, there is still something I don't get). Let's say I want to get all
>> cards in [1] , as shown in the screenshot below. I know the div class of
>> each one, and the div class where they are located. Which could be a minimal
>> example of a scraper, to start with?
>>
>> Thanks,
>>
>> Offray
>>
>>
>
>

Offray Vladimir Luna Cárdenas-2

Re: Web scrapping with Pharo Chrome

Hi Alistair,

On 15/02/18 12:50, Alistair Grant wrote:

> Hi Offray,
>
> On 14 February 2018 at 20:29, Offray Vladimir Luna Cárdenas
> <[hidden email]> wrote:
>> Yes. Me too. Alistair, any starting points with this example? I will take
>> from there and we could get visibility in the upcoming Open Data Day.
> I'm not sure that I understand what you're after, but maybe the
> following will help.
>
> This simply returns a collection of all the h4 headings that are in cells:
>
>
> | rootNode divs cells cellTitleNodes cellTitles |
>
> rootNode := GoogleChrome get: 'http://mutabit.com/grafoscopio/index.en.html'.
> divs := rootNode findAllTags: 'div'.
> cells := divs select: [ :each | (' ' split: (each attributeAt:
> 'class')) includes: 'mdl-cell' ].
> cellTitleNodes := cells flatCollect: [ :each | each findAllTags: 'h4' ].
> cellTitles := cellTitleNodes collect: [ :each | (each findAllStrings:
> true) first nodeValue ].
> { rootNode. divs. cells. cellTitleNodes. cellTitles }
>

Thanks. This was the starting point I was looking for.

>
>
>> Of course, we're going to document everything and share back (I have already
>> proposed some improvements in documentation via PR on the Git repo).
> Thanks very much for improving the readme. I've merged the PR.

No problem. Thanks to you for starting the project.

Cheers,

Offray

alistairgrant

Re: Web scrapping with Pharo Chrome

Hi Offray,

On 16 February 2018 at 19:08, Offray Vladimir Luna Cárdenas
<[hidden email]> wrote:

> Hi Alistair,
>
>
> On 15/02/18 12:50, Alistair Grant wrote:
>> Hi Offray,
>>
>> On 14 February 2018 at 20:29, Offray Vladimir Luna Cárdenas
>> <[hidden email]> wrote:
>>> Yes. Me too. Alistair, any starting points with this example? I will take
>>> from there and we could get visibility in the upcoming Open Data Day.
>> I'm not sure that I understand what you're after, but maybe the
>> following will help.
>>
>> This simply returns a collection of all the h4 headings that are in cells:
>>
>>
>> | rootNode divs cells cellTitleNodes cellTitles |
>>
>> rootNode := GoogleChrome get: 'http://mutabit.com/grafoscopio/index.en.html'.
>> divs := rootNode findAllTags: 'div'.
>> cells := divs select: [ :each | (' ' split: (each attributeAt:
>> 'class')) includes: 'mdl-cell' ].
>> cellTitleNodes := cells flatCollect: [ :each | each findAllTags: 'h4' ].
>> cellTitles := cellTitleNodes collect: [ :each | (each findAllStrings:
>> true) first nodeValue ].
>> { rootNode. divs. cells. cellTitleNodes. cellTitles }
>>
>
> Thanks. This was the starting point I was looking for.

Great. The other place to look is at any examples you can find using
Soup (which can be loaded from the Catalog). The API for Pharo-Chrome
is a subset of that provided by Soup.

>>> Of course, we're going to document everything and share back (I have already
>>> proposed some improvements in documentation via PR on the Git repo).
>> Thanks very much for improving the readme. I've merged the PR.
>
> No problem. Thanks to you for starting the project.

Actually we need to thank Torsten: https://github.com/astares/Pharo-Chrome

Cheers,
Alistair

> Cheers,
>
> Offray
>