Hi All,
I'm interested in extracting data from web pages using Squeak. Can I use Scamper or parts of it? Any ideas or hints on where to start will be greatly appreciated. Thanks. Lou ----------------------------------------------------------- Louis LaBrunda Keystone Software Corp. SkypeMe callto://PhotonDemon mailto:[hidden email] http://www.Keystone-Software.com |
I used Scamper long ago to scrape and test a site I was developing.
Was pretty straightforward. Haven't tried since 2001 or so... On Mon, Jun 16, 2008 at 10:28 AM, Louis LaBrunda <[hidden email]> wrote: > Hi All, > > I'm interested in extracting data from web pages using Squeak. Can I use > Scamper or parts of it? Any ideas or hints on where to start will be greatly > appreciated. Thanks. > > Lou > ----------------------------------------------------------- > Louis LaBrunda > Keystone Software Corp. > SkypeMe callto://PhotonDemon > mailto:[hidden email] http://www.Keystone-Software.com > > > |
In reply to this post by Louis LaBrunda
Hi,
2008/6/16 Louis LaBrunda <[hidden email]>: Hi All, I would use: HTTPClient httpGet: 'http://url.com' to get the html stream. Then you can parse it... HTH Cédrick
|
Hi Cédrick,
Thanks for the hint. >I would use: >HTTPClient httpGet: 'http://url.com' to get the html stream. >Then you can parse it... Are there parsers available to get say table data into some kind of collection? Lou ----------------------------------------------------------- Louis LaBrunda Keystone Software Corp. SkypeMe callto://PhotonDemon mailto:[hidden email] http://www.Keystone-Software.com |
no pb
not directly (from what I know) Have a look at XMLDOMParser addressBookXMLWithDTD but it's for XML files... Maybe just cut the part of the stream you're interested in... when you get to <td> until </td>... somethink like: string := (HTTPClient httpGet: 'http://url.com') contents . a := (string indexOfSubCollection: '<table>') + '<table>' size "if this is the first table..." b := (string indexOfSubCollection: '</table>') - 1 string copyFrom: a to: b. ... than you work on the string to build your collection (#copyReplaceAll: with: can help)... quite hacky though ;) ... I'm sure there are better options but that's all I can see now Cédrick
|
El 6/16/08 1:34 PM, "cdrick" <[hidden email]> escribió: >> > > Thanks for the hint. no pb > > Are there parsers available to get say >> table data into some kind of > collection? not directly (from what I >> know) Have a look at XMLDOMParser addressBookXMLWithDTD but it's for XML >> files... Maybe just cut the part of the stream you're interested in... when >> you get to <td> until </td>... somethink like: string := (HTTPClient >> httpGet: 'http://url.com') contents . a := (string indexOfSubCollection: >> '<table>') + '<table>' size "if this is the first table..." b := (string >> indexOfSubCollection: '</table>') - 1 string copyFrom: a to: b. ... than you >> work on the string to build your collection (#copyReplaceAll: with: can >> help)... quite hacky though ;) ... I'm sure there are better options but >> that's all I can see now Cédrick > > > Lou > >> ----------------------------------------------------------- > Louis >> LaBrunda > Keystone Software Corp. > SkypeMe callto://PhotonDemon > >> mailto:[hidden email] http://www.Keystone-Software.com > > > > Here you have a crude html text visualizer. Works in SqueakLightII http://ftp.squeak.org/various_images/SqueakLight/SqueakLightII.7069.zip File first Network-HTML-md.4.mcz Then HTMLScrollableField.st and in Workspace do it HTMLScrollableField help Edgar Network-HTML-md.4.mcz (79K) Download Attachment |
El 6/16/08 3:41 PM, "Edgar J. De Cleene" <[hidden email]> escribió: > Here you have a crude html text visualizer. > Works in SqueakLightII > http://ftp.squeak.org/various_images/SqueakLight/SqueakLightII.7069.zip > File first > Network-HTML-md.4.mcz > Then > HTMLScrollableField.st > and in Workspace do it > HTMLScrollableField help > > Edgar Here the other file HTMLScrollableField.st (4K) Download Attachment |
In reply to this post by Louis LaBrunda
HtmlTokenizer helps here. Here's a bit of code I added to String class to give you an idea of how to use it. tagsOfType: aString "return all tags found in self of type aString" | endTag | endTag := '</' , aString , '>'. ^ ((HtmlTokenizer on: self) upToEnd select: [ :ea | ea name = aString]) reject: [ :ea | ea source = endTag] Here's another example that is slightly richer (and probably could be improved but what the heck). textOfType: aString "return a collection of triples of all tags found in self of type aString with start tag, intermediate text if any, and end tag if any" | stream element endTag triple answer | endTag := '</' , aString , '>'. answer := OrderedCollection new. stream := ReadStream on: ((HtmlTokenizer on: self) upToEnd). [stream atEnd] whileFalse: [ (element := stream next) name = aString ifTrue: [ "start tag found" triple := Array new: 3. triple at: 1 put: element. stream peek class = HtmlText ifTrue: [ triple at: 2 put: stream next. stream peek source = endTag ifTrue: [ triple at: 3 put: stream next ] ]. answer add: triple ] ]. ^ answer
Hi Cédrick, Thanks for the hint. >I would use: >HTTPClient httpGet: 'http://url.com' to get the html stream. >Then you can parse it... Are there parsers available to get say table data into some kind of collection? Lou ----------------------------------------------------------- Louis LaBrunda Keystone Software Corp. SkypeMe callto://PhotonDemon mailto:[hidden email] http://www.Keystone-Software.com |
Or port BeautifulSoup :)
-- Hwee-Boon On Tue, Jun 17, 2008 at 3:51 AM, John Richards <[hidden email]> wrote: > > HtmlTokenizer helps here. Here's a bit of code I added to String class to > give you an idea of how to use it. > > tagsOfType: aString > "return all tags found in self of type aString" > > | endTag | > endTag := '</' , aString , '>'. > ^ ((HtmlTokenizer on: self) upToEnd > select: [ :ea | ea name = aString]) > reject: [ :ea | ea source = endTag] > > > > Here's another example that is slightly richer (and probably could be > improved but what the heck). > > textOfType: aString > "return a collection of triples of all tags found in self of type > aString with start tag, intermediate text if any, and end tag if any" > > | stream element endTag triple answer | > endTag := '</' , aString , '>'. > answer := OrderedCollection new. > stream := ReadStream on: ((HtmlTokenizer on: self) upToEnd). > [stream atEnd] whileFalse: [ > (element := stream next) name = aString ifTrue: [ "start > tag found" > triple := Array new: 3. > triple at: 1 put: element. > stream peek class = HtmlText ifTrue: [ > triple at: 2 put: stream next. > stream peek source = endTag ifTrue: [ > triple at: 3 put: stream next > ] > ]. > answer add: triple > ] > ]. > ^ answer > > > > Louis LaBrunda <[hidden email]> > Sent by: [hidden email] > > 06/16/08 11:57 AM > > Please respond to > [hidden email]; Please respond to > The general-purpose Squeak developers list > <[hidden email]> > To > [hidden email] > cc > Subject > [squeak-dev] Re: Extracting data from web pages using Squeak > > > > > Hi Cédrick, > > Thanks for the hint. > >>I would use: >>HTTPClient httpGet: 'http://url.com' to get the html stream. >>Then you can parse it... > > Are there parsers available to get say table data into some kind of > collection? > > Lou > ----------------------------------------------------------- > Louis LaBrunda > Keystone Software Corp. > SkypeMe callto://PhotonDemon > mailto:[hidden email] http://www.Keystone-Software.com > > > > > > > -- Hwee-Boon |
In reply to this post by Louis LaBrunda
Hi!
f_'http://www.google.com' asUrl retrieveContents. cont_f content. html_HtmlParser parse: cont. html explore gives you the dom-tree as squeak-objects you can then work on greetings, Bernd |
A simple noob question about this.
I tried with lordzealon.com and was a bit slowly. How can I show a progress bar while data is retrieved? Bernd Elkemann escribió: > Hi! > > f_'http://www.google.com' asUrl retrieveContents. > cont_f content. > html_HtmlParser parse: cont. > html explore > > gives you the dom-tree as squeak-objects you can then work on > > greetings, Bernd > > |
> I tried with lordzealon.com and was a bit slowly. How can I show a progress
> bar while data is retrieved? maybe String>>displayProgressAt: from: to: during: see senders... Cédrick |
In reply to this post by Giuseppe
Giuseppe Luigi Punzi schrieb:
> I tried with lordzealon.com and was a bit slowly. wow that is an understatement, it actually takes forever on that particular website because it transmits a lot from unrelated sites like googlesyndication.com. If it is really that website you want to parse: use a normal browser to save the .html, then use squeak to parse it (you can use Squeak to only grab the html from the site but its complicated, see the posts of the other newsgroup members) A progress bar is probably not what you want because what takes long here is the transmission of the website, not the parsing. If it were the parsing you could tell at any Moment what percentage is completed, in the case of the transmission thats difficult to tell. But if you need ProgressBars for something else: there is class ProgressMorph. You will have to tell a ProgressMorph which value (percentage %) it is supposed to show. Greetings, Bernd |
I don't need to parse this site. Only was to try the code posted.
Bernd Elkemann escribió: > Giuseppe Luigi Punzi schrieb: >> I tried with lordzealon.com and was a bit slowly. > wow that is an understatement, it actually takes forever on that > particular website because it transmits a lot from unrelated sites > like googlesyndication.com. If it is really that website you want to > parse: use a normal browser to save the .html, then use squeak to > parse it (you can use Squeak to only grab the html from the site but > its complicated, see the posts of the other newsgroup members) > A progress bar is probably not what you want because what takes long > here is the transmission of the website, not the parsing. If it were > the parsing you could tell at any Moment what percentage is completed, > in the case of the transmission thats difficult to tell. > But if you need ProgressBars for something else: there is class > ProgressMorph. You will have to tell a ProgressMorph which value > (percentage %) it is supposed to show. > Greetings, Bernd > > |
In reply to this post by Louis LaBrunda
You can check also (as an example)
http://www.squeaksource.com/SqueakPeopleStats.html. To see it working, check: http://squeakpeoplestats.seasidehosting.st/seaside/SqueakPeopleStats (This app read the html from squeak people pages). HTH. gsa. 2008/6/16 Louis LaBrunda <[hidden email]>: > Hi Cédrick, > > Thanks for the hint. > >>I would use: >>HTTPClient httpGet: 'http://url.com' to get the html stream. >>Then you can parse it... > > Are there parsers available to get say table data into some kind of collection? > > Lou > ----------------------------------------------------------- > Louis LaBrunda > Keystone Software Corp. > SkypeMe callto://PhotonDemon > mailto:[hidden email] http://www.Keystone-Software.com > > > |
Free forum by Nabble | Edit this page |