[squeak-dev] Extracting data from web pages using Squeak

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Extracting data from web pages using Squeak

Louis LaBrunda
Hi All,

I'm interested in extracting data from web pages using Squeak.  Can I use
Scamper or parts of it?  Any ideas or hints on where to start will be greatly
appreciated.  Thanks.

Lou
-----------------------------------------------------------
Louis LaBrunda
Keystone Software Corp.
SkypeMe callto://PhotonDemon
mailto:[hidden email] http://www.Keystone-Software.com


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Extracting data from web pages using Squeak

David Mitchell-10
I used Scamper long ago to scrape and test a site I was developing.
Was pretty straightforward. Haven't tried since 2001 or so...

On Mon, Jun 16, 2008 at 10:28 AM, Louis LaBrunda
<[hidden email]> wrote:

> Hi All,
>
> I'm interested in extracting data from web pages using Squeak.  Can I use
> Scamper or parts of it?  Any ideas or hints on where to start will be greatly
> appreciated.  Thanks.
>
> Lou
> -----------------------------------------------------------
> Louis LaBrunda
> Keystone Software Corp.
> SkypeMe callto://PhotonDemon
> mailto:[hidden email] http://www.Keystone-Software.com
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Extracting data from web pages using Squeak

cedreek
In reply to this post by Louis LaBrunda
Hi,

2008/6/16 Louis LaBrunda <[hidden email]>:
Hi All,

I'm interested in extracting data from web pages using Squeak.  Can I use
Scamper or parts of it?  Any ideas or hints on where to start will be greatly
appreciated.  Thanks.

I would use:
HTTPClient httpGet: 'http://url.com' to get the html stream.
Then you can parse it...

HTH

Cédrick


 


Lou
-----------------------------------------------------------
Louis LaBrunda
Keystone Software Corp.
SkypeMe callto://PhotonDemon
mailto:[hidden email] http://www.Keystone-Software.com





Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: Extracting data from web pages using Squeak

Louis LaBrunda
Hi Cédrick,

Thanks for the hint.

>I would use:
>HTTPClient httpGet: 'http://url.com' to get the html stream.
>Then you can parse it...

Are there parsers available to get say table data into some kind of collection?

Lou
-----------------------------------------------------------
Louis LaBrunda
Keystone Software Corp.
SkypeMe callto://PhotonDemon
mailto:[hidden email] http://www.Keystone-Software.com


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: Extracting data from web pages using Squeak

cedreek



Thanks for the hint.
 
no pb

Are there parsers available to get say table data into some kind of collection?

not directly  (from what  I know)

Have a look at XMLDOMParser addressBookXMLWithDTD but it's for XML files...

Maybe just cut the part of the stream you're interested in... when you get to <td> until </td>... somethink like:

string := (HTTPClient httpGet: 'http://url.com') contents .
a := (string indexOfSubCollection: '<table>') + '<table>' size  "if this is the first table..."
b := (string indexOfSubCollection: '</table>') - 1
string copyFrom: a to: b.
...

than you work on the string to build your collection (#copyReplaceAll: with: can help)...  quite hacky though ;) ... I'm sure there are better options but that's all I can see now

Cédrick


Lou
-----------------------------------------------------------
Louis LaBrunda
Keystone Software Corp.
SkypeMe callto://PhotonDemon
mailto:[hidden email] http://www.Keystone-Software.com





Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: Extracting data from web pages using Squeak

Edgar J. De Cleene



El 6/16/08 1:34 PM, "cdrick" <[hidden email]> escribió:

>>
>
> Thanks for the hint.


no pb

>
> Are there parsers available to get say
>> table data into some kind of
> collection?


not directly  (from what  I
>> know)

Have a look at XMLDOMParser addressBookXMLWithDTD but it's for XML
>> files...

Maybe just cut the part of the stream you're interested in... when
>> you get
to <td> until </td>... somethink like:

string := (HTTPClient
>> httpGet: 'http://url.com') contents .
a := (string indexOfSubCollection:
>> '<table>') + '<table>' size  "if this is
the first table..."
b := (string
>> indexOfSubCollection: '</table>') - 1
string copyFrom: a to: b.
...

than you
>> work on the string to build your collection (#copyReplaceAll: with:
can
>> help)...  quite hacky though ;) ... I'm sure there are better options
but
>> that's all I can see now

Cédrick

>
>
> Lou
>
>> -----------------------------------------------------------
> Louis
>> LaBrunda
> Keystone Software Corp.
> SkypeMe callto://PhotonDemon
>
>> mailto:[hidden email] http://www.Keystone-Software.com
>
>
>

>

Here you have a crude html text visualizer.
Works in SqueakLightII
http://ftp.squeak.org/various_images/SqueakLight/SqueakLightII.7069.zip
File first
Network-HTML-md.4.mcz
Then
HTMLScrollableField.st
and in Workspace do it
HTMLScrollableField help

Edgar




Network-HTML-md.4.mcz (79K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: Extracting data from web pages using Squeak

Edgar J. De Cleene



El 6/16/08 3:41 PM, "Edgar J. De Cleene" <[hidden email]>
escribió:

> Here you have a crude html text visualizer.
> Works in SqueakLightII
> http://ftp.squeak.org/various_images/SqueakLight/SqueakLightII.7069.zip
> File first
> Network-HTML-md.4.mcz
> Then
> HTMLScrollableField.st
> and in Workspace do it
> HTMLScrollableField help
>
> Edgar

Here the other file




HTMLScrollableField.st (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: Extracting data from web pages using Squeak

John Richards
In reply to this post by Louis LaBrunda

HtmlTokenizer helps here.  Here's a bit of code I added to String class to give you an idea of how to use it.

tagsOfType: aString
        "return all tags found in self of type aString"
       
        | endTag |
        endTag := '</' , aString , '>'.
        ^ ((HtmlTokenizer  on: self) upToEnd
                select: [ :ea | ea name = aString])
                reject: [ :ea | ea source = endTag]



Here's another example that is slightly richer (and probably could be improved but what the heck).

textOfType: aString
        "return a collection of triples of all tags found in self of type aString with start tag, intermediate text if any, and end tag if any"
       
        | stream element endTag triple answer |
        endTag := '</' , aString , '>'.
        answer := OrderedCollection new.
        stream := ReadStream on: ((HtmlTokenizer  on: self) upToEnd).
        [stream atEnd] whileFalse: [
                (element := stream next) name = aString ifTrue: [  "start tag found"
                        triple := Array new: 3.
                        triple at: 1 put: element.
                        stream peek class = HtmlText ifTrue: [
                                triple at: 2 put: stream next.
                                stream peek source = endTag ifTrue: [
                                        triple at: 3 put: stream next
                                        ]
                                ].
                        answer add: triple
                        ]
                ].
        ^ answer



Louis LaBrunda <[hidden email]>
Sent by: [hidden email]

06/16/08 11:57 AM

Please respond to
[hidden email]; Please respond to
The general-purpose Squeak developers list        <[hidden email]>

To
[hidden email]
cc
Subject
[squeak-dev] Re: Extracting data from web pages using Squeak





Hi Cédrick,

Thanks for the hint.

>I would use:
>HTTPClient httpGet: 'http://url.com' to get the html stream.
>Then you can parse it...

Are there parsers available to get say table data into some kind of collection?

Lou
-----------------------------------------------------------
Louis LaBrunda
Keystone Software Corp.
SkypeMe callto://PhotonDemon
mailto:[hidden email] http://www.Keystone-Software.com





Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: Extracting data from web pages using Squeak

Yar Hwee Boon-3
Or port BeautifulSoup :)

--
Hwee-Boon

On Tue, Jun 17, 2008 at 3:51 AM, John Richards <[hidden email]> wrote:

>
> HtmlTokenizer helps here.  Here's a bit of code I added to String class to
> give you an idea of how to use it.
>
> tagsOfType: aString
>         "return all tags found in self of type aString"
>
>         | endTag |
>         endTag := '</' , aString , '>'.
>         ^ ((HtmlTokenizer  on: self) upToEnd
>                 select: [ :ea | ea name = aString])
>                 reject: [ :ea | ea source = endTag]
>
>
>
> Here's another example that is slightly richer (and probably could be
> improved but what the heck).
>
> textOfType: aString
>         "return a collection of triples of all tags found in self of type
> aString with start tag, intermediate text if any, and end tag if any"
>
>         | stream element endTag triple answer |
>         endTag := '</' , aString , '>'.
>         answer := OrderedCollection new.
>         stream := ReadStream on: ((HtmlTokenizer  on: self) upToEnd).
>         [stream atEnd] whileFalse: [
>                 (element := stream next) name = aString ifTrue: [  "start
> tag found"
>                         triple := Array new: 3.
>                         triple at: 1 put: element.
>                         stream peek class = HtmlText ifTrue: [
>                                 triple at: 2 put: stream next.
>                                 stream peek source = endTag ifTrue: [
>                                         triple at: 3 put: stream next
>                                         ]
>                                 ].
>                         answer add: triple
>                         ]
>                 ].
>         ^ answer
>
>
>
> Louis LaBrunda <[hidden email]>
> Sent by: [hidden email]
>
> 06/16/08 11:57 AM
>
> Please respond to
> [hidden email]; Please respond to
> The general-purpose Squeak developers list
>  <[hidden email]>
> To
> [hidden email]
> cc
> Subject
> [squeak-dev] Re: Extracting data from web pages using Squeak
>
>
>
>
> Hi Cédrick,
>
> Thanks for the hint.
>
>>I would use:
>>HTTPClient httpGet: 'http://url.com' to get the html stream.
>>Then you can parse it...
>
> Are there parsers available to get say table data into some kind of
> collection?
>
> Lou
> -----------------------------------------------------------
> Louis LaBrunda
> Keystone Software Corp.
> SkypeMe callto://PhotonDemon
> mailto:[hidden email] http://www.Keystone-Software.com
>
>
>
>
>
>
>



--
Hwee-Boon

Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: Extracting data from web pages using Squeak

Bernd Elkemann
In reply to this post by Louis LaBrunda
Hi!

f_'http://www.google.com' asUrl retrieveContents.
cont_f content.
html_HtmlParser parse: cont.
html explore

gives you the dom-tree as squeak-objects you can then work on

greetings, Bernd


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: Extracting data from web pages using Squeak

Giuseppe
A simple noob question about this.

I tried with lordzealon.com and was a bit slowly. How can I show a
progress bar while data is retrieved?

Bernd Elkemann escribió:

> Hi!
>
> f_'http://www.google.com' asUrl retrieveContents.
> cont_f content.
> html_HtmlParser parse: cont.
> html explore
>
> gives you the dom-tree as squeak-objects you can then work on
>
> greetings, Bernd
>
>


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: Extracting data from web pages using Squeak

cedreek
> I tried with lordzealon.com and was a bit slowly. How can I show a progress
> bar while data is retrieved?

maybe
String>>displayProgressAt: from: to: during:

see senders...

Cédrick


Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: Extracting data from web pages using Squeak

Bernd Elkemann
In reply to this post by Giuseppe
Giuseppe Luigi Punzi schrieb:
> I tried with lordzealon.com and was a bit slowly.
wow that is an understatement, it actually takes forever on that
particular website because it transmits a lot from unrelated sites like
googlesyndication.com. If it is really that website you want to parse:
use a normal browser to save the .html, then use squeak to parse it (you
can use Squeak to only grab the html from the site but its complicated,
see the posts of the other newsgroup members)
A progress bar is probably not what you want because what takes long
here is the transmission of the website, not the parsing. If it were the
parsing you could tell at any Moment what percentage is completed, in
the case of the transmission thats difficult to tell.
But if you need ProgressBars for something else: there is class
ProgressMorph. You will have to tell a ProgressMorph which value
(percentage %) it is supposed to show.
Greetings, Bernd


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: Extracting data from web pages using Squeak

Giuseppe
I don't need to parse this site. Only was to try the code posted.

Bernd Elkemann escribió:

> Giuseppe Luigi Punzi schrieb:
>> I tried with lordzealon.com and was a bit slowly.
> wow that is an understatement, it actually takes forever on that
> particular website because it transmits a lot from unrelated sites
> like googlesyndication.com. If it is really that website you want to
> parse: use a normal browser to save the .html, then use squeak to
> parse it (you can use Squeak to only grab the html from the site but
> its complicated, see the posts of the other newsgroup members)
> A progress bar is probably not what you want because what takes long
> here is the transmission of the website, not the parsing. If it were
> the parsing you could tell at any Moment what percentage is completed,
> in the case of the transmission thats difficult to tell.
> But if you need ProgressBars for something else: there is class
> ProgressMorph. You will have to tell a ProgressMorph which value
> (percentage %) it is supposed to show.
> Greetings, Bernd
>
>


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: Extracting data from web pages using Squeak

garduino
In reply to this post by Louis LaBrunda
You can check also (as an example)
http://www.squeaksource.com/SqueakPeopleStats.html.

To see it working, check:
http://squeakpeoplestats.seasidehosting.st/seaside/SqueakPeopleStats
(This app read the html from squeak people pages).

HTH.
gsa.

2008/6/16 Louis LaBrunda <[hidden email]>:

> Hi Cédrick,
>
> Thanks for the hint.
>
>>I would use:
>>HTTPClient httpGet: 'http://url.com' to get the html stream.
>>Then you can parse it...
>
> Are there parsers available to get say table data into some kind of collection?
>
> Lou
> -----------------------------------------------------------
> Louis LaBrunda
> Keystone Software Corp.
> SkypeMe callto://PhotonDemon
> mailto:[hidden email] http://www.Keystone-Software.com
>
>
>