Hi, I have been finally able to install and use properly Pharo
Chrome. The issues I reported in other thread, were caused by
conflicts between OSProcess and OSSubProcess on Linux. Now I'm
able to launch Chrome, point it to particular addresses and get
some info from there. I would like to continue the conversation related with web
scraping using Pharo Chrome, which comes handy now that there is a
lot of React and other technologies making the web more and more
opaque for digital citizenship and data activism endeavors. So, as
an starting example, I would like to scrap the Grafoscopio's own
page [1]. It makes use of Material Design Light [2]. [1] http://mutabit.com/grafoscopio/index.en.html An starting example would be fine to kickstart myself (from the ones I'm reading, there is still something I don't get). Let's say I want to get all cards in [1] , as shown in the screenshot below. I know the div class of each one, and the div class where they are located. Which could be a minimal example of a scraper, to start with? Thanks, Offray |
I would love to have a little how to and that we can turn it into a document. Stef On Wed, Feb 14, 2018 at 5:25 PM, Offray Vladimir Luna Cárdenas <[hidden email]> wrote:
|
Yes. Me too. Alistair, any starting points with this example? I
will take from there and we could get visibility in the upcoming
Open Data Day. Of course, we're going to document everything and share back (I
have already proposed some improvements in documentation via PR on
the Git repo). Cheers, Offray On 14/02/18 14:13, Stephane Ducasse
wrote:
|
super. Stef On Wed, Feb 14, 2018 at 8:29 PM, Offray Vladimir Luna Cárdenas <[hidden email]> wrote:
|
In reply to this post by Offray Vladimir Luna Cárdenas-2
Hi Offray,
On 14 February 2018 at 20:29, Offray Vladimir Luna Cárdenas <[hidden email]> wrote: > Yes. Me too. Alistair, any starting points with this example? I will take > from there and we could get visibility in the upcoming Open Data Day. I'm not sure that I understand what you're after, but maybe the following will help. This simply returns a collection of all the h4 headings that are in cells: | rootNode divs cells cellTitleNodes cellTitles | rootNode := GoogleChrome get: 'http://mutabit.com/grafoscopio/index.en.html'. divs := rootNode findAllTags: 'div'. cells := divs select: [ :each | (' ' split: (each attributeAt: 'class')) includes: 'mdl-cell' ]. cellTitleNodes := cells flatCollect: [ :each | each findAllTags: 'h4' ]. cellTitles := cellTitleNodes collect: [ :each | (each findAllStrings: true) first nodeValue ]. { rootNode. divs. cells. cellTitleNodes. cellTitles } > Of course, we're going to document everything and share back (I have already > proposed some improvements in documentation via PR on the Git repo). Thanks very much for improving the readme. I've merged the PR. Cheers, Alistair > Cheers, > > Offray > > > On 14/02/18 14:13, Stephane Ducasse wrote: > > I would love to have a little how to and that we can turn it into a > document. > > Stef > > On Wed, Feb 14, 2018 at 5:25 PM, Offray Vladimir Luna Cárdenas > <[hidden email]> wrote: >> >> Hi, >> >> I have been finally able to install and use properly Pharo Chrome. The >> issues I reported in other thread, were caused by conflicts between >> OSProcess and OSSubProcess on Linux. Now I'm able to launch Chrome, point it >> to particular addresses and get some info from there. >> >> I would like to continue the conversation related with web scraping using >> Pharo Chrome, which comes handy now that there is a lot of React and other >> technologies making the web more and more opaque for digital citizenship and >> data activism endeavors. So, as an starting example, I would like to scrap >> the Grafoscopio's own page [1]. It makes use of Material Design Light [2]. >> >> [1] http://mutabit.com/grafoscopio/index.en.html >> [2] http://getmdl.io/ >> >> An starting example would be fine to kickstart myself (from the ones I'm >> reading, there is still something I don't get). Let's say I want to get all >> cards in [1] , as shown in the screenshot below. I know the div class of >> each one, and the div class where they are located. Which could be a minimal >> example of a scraper, to start with? >> >> Thanks, >> >> Offray >> >> > > |
Hi Alistair,
On 15/02/18 12:50, Alistair Grant wrote: > Hi Offray, > > On 14 February 2018 at 20:29, Offray Vladimir Luna Cárdenas > <[hidden email]> wrote: >> Yes. Me too. Alistair, any starting points with this example? I will take >> from there and we could get visibility in the upcoming Open Data Day. > I'm not sure that I understand what you're after, but maybe the > following will help. > > This simply returns a collection of all the h4 headings that are in cells: > > > | rootNode divs cells cellTitleNodes cellTitles | > > rootNode := GoogleChrome get: 'http://mutabit.com/grafoscopio/index.en.html'. > divs := rootNode findAllTags: 'div'. > cells := divs select: [ :each | (' ' split: (each attributeAt: > 'class')) includes: 'mdl-cell' ]. > cellTitleNodes := cells flatCollect: [ :each | each findAllTags: 'h4' ]. > cellTitles := cellTitleNodes collect: [ :each | (each findAllStrings: > true) first nodeValue ]. > { rootNode. divs. cells. cellTitleNodes. cellTitles } > Thanks. This was the starting point I was looking for. > > >> Of course, we're going to document everything and share back (I have already >> proposed some improvements in documentation via PR on the Git repo). > Thanks very much for improving the readme. I've merged the PR. No problem. Thanks to you for starting the project. Cheers, Offray |
Hi Offray,
On 16 February 2018 at 19:08, Offray Vladimir Luna Cárdenas <[hidden email]> wrote: > Hi Alistair, > > > On 15/02/18 12:50, Alistair Grant wrote: >> Hi Offray, >> >> On 14 February 2018 at 20:29, Offray Vladimir Luna Cárdenas >> <[hidden email]> wrote: >>> Yes. Me too. Alistair, any starting points with this example? I will take >>> from there and we could get visibility in the upcoming Open Data Day. >> I'm not sure that I understand what you're after, but maybe the >> following will help. >> >> This simply returns a collection of all the h4 headings that are in cells: >> >> >> | rootNode divs cells cellTitleNodes cellTitles | >> >> rootNode := GoogleChrome get: 'http://mutabit.com/grafoscopio/index.en.html'. >> divs := rootNode findAllTags: 'div'. >> cells := divs select: [ :each | (' ' split: (each attributeAt: >> 'class')) includes: 'mdl-cell' ]. >> cellTitleNodes := cells flatCollect: [ :each | each findAllTags: 'h4' ]. >> cellTitles := cellTitleNodes collect: [ :each | (each findAllStrings: >> true) first nodeValue ]. >> { rootNode. divs. cells. cellTitleNodes. cellTitles } >> > > Thanks. This was the starting point I was looking for. Great. The other place to look is at any examples you can find using Soup (which can be loaded from the Catalog). The API for Pharo-Chrome is a subset of that provided by Soup. >>> Of course, we're going to document everything and share back (I have already >>> proposed some improvements in documentation via PR on the Git repo). >> Thanks very much for improving the readme. I've merged the PR. > > No problem. Thanks to you for starting the project. Actually we need to thank Torsten: https://github.com/astares/Pharo-Chrome Cheers, Alistair > Cheers, > > Offray > |
Free forum by Nabble | Edit this page |