Hi,
How can I retrieve a webpage with Dolphi? O:-) I tried looking for classes with htp in their names but couldn't find any. thanks |
Fernando,
> How can I retrieve a webpage with Dolphi? O:-) I tried looking for > classes with htp in their names but couldn't find any. There are a number of way. The simplest is to use URLMonLibrary - as in URLMonLibrary default urlDownload: 'http://www.whatever.com/' toFile: 'c:\whatever.html' If you want to see the contents, rather than just downlod to a file, try WebBrowserShell show openUrl: 'http://www.whatever.com/' You can also use some off the COM classes (IStream I think) to achieve more complex behaviour. It all depends on what you want to do..... -- Ian Use the Reply-To address to contact me. Mail sent to the From address is ignored. |
On Wed, 8 Dec 2004 19:09:07 -0000, "Ian Bartholomew"
<[hidden email]> wrote: >Fernando, > >> How can I retrieve a webpage with Dolphi? O:-) I tried looking for >> classes with htp in their names but couldn't find any. > >There are a number of way. The simplest is to use URLMonLibrary - as in > >URLMonLibrary default > urlDownload: 'http://www.whatever.com/' > toFile: 'c:\whatever.html' > >If you want to see the contents, rather than just downlod to a file, try >WebBrowserShell show openUrl: 'http://www.whatever.com/' Actually, I want to read the url contents into a string, parse it and save some stuff into a database (I'm considering Omnibase). If at all possible, I'd rather have several connections in different threads. |
In reply to this post by Fernando Rodríguez
Fernando,
> How can I retrieve a webpage with Dolphi? O:-) I tried looking for > classes with htp in their names but couldn't find any. Look at LiveUpdate for the Microsoft way of doing it. Search the archives though, because it, being based on much of the same code as IE, will quit with the "work offline" setting, etc. It also can get tricked by caching, etc. Note that if you are going to untrusted sites, you are eventually going to get bitten; the code is too full of defects and too widely attacked to avoid it. However, code such as | url file in out | url := 'http://www.somewhere.com/something.pdf'. file := 'c:\docs\something.pdf'. in := ( IStream onURL:url ) contents readStream. out := FileStream write:file text:false. [ out nextPutAll:in upToEnd. ] ensure:[ out close ]. can give you a quick start. IMHO, you would be better off with one of the HTTP clients written in Smalltalk. You can fork work to background threads, constuct your own timeouts, etc., not to mention chuckle the next time a critical update appears. Have a good one, Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
On Wed, 08 Dec 2004 14:16:35 -0500, Bill Schwab
<[hidden email]> wrote: >IMHO, you would be better off with one of the HTTP clients written in >Smalltalk. You can fork work to background threads, constuct your own >timeouts, etc., not to mention chuckle the next time a critical update >appears. I had no idea this was available. Could you give some pointers? O:-) BTW, what do you mean with 'http clients', full browsers implemented in Smalltalk? O:-) Thanks |
Fernando,
>>IMHO, you would be better off with one of the HTTP clients written in >>Smalltalk. You can fork work to background threads, constuct your own >>timeouts, etc., not to mention chuckle the next time a critical update >>appears. > > I had no idea this was available. Could you give some pointers? Try http://www.dolphinharbor.org/dh/projects/httpclient/download.html I also have a simple http client, but it is not as complete as the one (hopefully) linked above, and is currently not in a form that I can redistribute. If Steve's client goes poof, I would be willing to separate mine from the stuff I can't share and make it available. > BTW, what do you mean with 'http clients', full browsers implemented > in Smalltalk? No, they are essentially smart sockets that send requests and process replies, handle authentication, and return the data to the caller. > Actually, I want to read the url contents into a string, parse it and > save some stuff into a database (I'm considering Omnibase). If at all > possible, I'd rather have several connections in different threads. Parsing is likely to be the hard part. Squeak has one that has worked for me, but IIRC, others have reported throwing more at it and gotten poor results. "Any" Smalltalk client should do fine with reads forked to background Process-es. Of course, you will need to properly synchronize the returning data, but Mutex, SharedQueue, etc. are available. Again, think twice before putting untrusted html through a Microsoft parser. Have a good one, Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
Bill
You wrote in message news:cp812e$1104$[hidden email]... > ... > Again, think twice before putting untrusted html through a Microsoft > parser. This might be a legitimate opinion, but based on browser usage statistics more than 80% of the people on the internet are doing this every day. My point is that one would be no worse off driving Microsoft's HTML parser from Dolphin than one would be using Internet Explorer. Most businesses do indeed use Internet Explorer, and provided it is fully patched the risk is clearly considered acceptable. Actually you'll probably be a lot better off just driving the parser, since many of the security holes are in the way IE executes/displays the content itself. There are going to be security holes in all fully functional HTML parsers. And frankly if one wants a fully functional one that handles all the vagaries of the HTML one will encounter, one can forget the simple parsers implemented in Smalltalk since they will reject a lot of it as invalid (which it is, but that's beside the point). I would encourage everyone to make their own judgements about what they consider an acceptable level of risk. If you are prepared to use IE, or any of the myriad other Microsoft products that embeds the IE engine, then you should have no qualms about using the MS HTML parser in Dolphin. Regards Blair |
Blair,
> > Again, think twice before putting untrusted html through a Microsoft > > parser. > > This might be a legitimate opinion, but based on browser usage statistics > more than 80% of the people on the internet are doing this every day. But 80% of people on the Net know bugger-all about computer security, or the risks they are taking. I don't think this is a very sensible argument. > My point is that one would be no worse off driving Microsoft's HTML parser > from Dolphin than one would be using Internet Explorer. Is that actually true ? (That's a real question, not rhetorical. I don't claim to know the answer). The HTML component has quite a lot of stuff for controlling security zones and so on; if you "just use it" from Dolphin (or any other programming language) does it run in its highest possible security mode by default ? If not then I don't think that using the HTML component is even as safe as you describe. Of course its not an issue at all unless you are pushing untrusted HTML through it. > I would encourage everyone to make their own judgements about what they > consider an acceptable level of risk. Agreed. But it's important to remind people that there /is/ a judgement to be made. It not an issue that you should ever just ignore, whichever way you decide to go after due consideration. > If you are prepared to use IE, or > any of the myriad other Microsoft products that embeds the IE engine, > then you should have no qualms about using the MS HTML parser in Dolphin. I don't think that follows at all. E.g. the app you write in Dolphin (or whatever) may be running on a very different machine (with sensitive data or whatever) from the desktop machine you use for browsing (which -- if you use IE for browsing -- had damn well better be either very well protected or expendable). -- chris |
In reply to this post by Schwab,Wilhelm K
On Wed, 08 Dec 2004 18:11:33 -0500, Bill Schwab
<[hidden email]> wrote: > > Actually, I want to read the url contents into a string, parse it and > > save some stuff into a database (I'm considering Omnibase). If at all > > possible, I'd rather have several connections in different threads. > >Parsing is likely to be the hard part. Squeak has one that has worked >for me, but IIRC, others have reported throwing more at it and gotten >poor results. "Any" Smalltalk client should do fine with reads forked >to background Process-es. Of course, you will need to properly >synchronize the returning data, but Mutex, SharedQueue, etc. are available. Maybe I shouldn't have said 'parse'. Actually I just want to extract a few tokens so I don't think I'll need a full html parser. Some regexes should do it. Thanks for the tip anyway, I'll take a look at the parser in Squeak. :-) |
"Fernando" <[hidden email]> wrote in message
news:[hidden email]... > > Thanks for the tip anyway, I'll take a look at the parser in Squeak. Fernando You should know that there are two HTML parsers available from Squeak. The one in the standard Squeak distribution (called HtmlParser) has a lot of weaknesses (i.e. it quite often gives parses which are obviously not what the page author intended). Avi Bryant recommended the parser available from SqueakMap (called HTMLParser - note the upper case). I have tried this, and found that it also gets many things wrong, but I have made some modifications which enable it to do quite well in my application (deconstructing web newspaper stories to extract the text content). I have ported both to Dolphin, though the port of HtmlParser is work in progress, rather than a complete job, because I switched to HTMLParser part way through. If you are interested, I could e-mail you the packages and you could play with them. As a warning, any programmatic analysis of web pages depends on the page structure remaining stable. In the web newspaper field, there were regular redesigns (I learned to dread the words 'New Improved Page Design!') which meant re-programming. Unless the data you are looking for have a very clear and stable structure, this can be a very frustrating field. Of course, if the analysis you want to do is not complex enough to need a full HTML parse, you could always try to develop a special-purpose parser for your task using SmaCC. I've tried it, it's fun to play with and the results are very efficient parsers. > :-) Your halo comes and goes - not feeling angelic today? Best wishes Peter Kenny |
In reply to this post by Blair McGlashan-3
Blair,
> I would encourage everyone to make their own judgements about what they > consider an acceptable level of risk. If you are prepared to use IE, or any > of the myriad other Microsoft products that embeds the IE engine, then you > should have no qualms about using the MS HTML parser in Dolphin. I do not agree, because the decision must depend on the intended usage. It sounds to me as though Fernando is working on a crawler of sorts, and using a Microsoft parser for that _will_ get him clobbered eventually; a computer can do a lot more browsing than you or I ever would, and will not be discriminating about the sites it hits. Have a good one, Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
In reply to this post by Peter Kenny-2
Peter,
When you are ready, I will gladly provide web space for your port(s), identified as your work, of course. Have a good one, Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
In reply to this post by Peter Kenny-2
On Thu, 9 Dec 2004 16:15:22 -0000, "Peter Kenny"
<[hidden email]> wrote: >(deconstructing web newspaper stories to extract the text content). I have >ported both to Dolphin, though the port of HtmlParser is work in progress, >rather than a complete job, because I switched to HTMLParser part way >through. If you are interested, I could e-mail you the packages and you >could play with them. Thanks, it would be really nic eif you could send me the HTMLParser. O:-) >As a warning, any programmatic analysis of web pages depends on the page >structure remaining stable. In the web newspaper field, there were regular >redesigns (I learned to dread the words 'New Improved Page Design!') which >meant re-programming. Unless the data you are looking for have a very clear >and stable structure, this can be a very frustrating field. I think (hope) it will. >Of course, if the analysis you want to do is not complex enough to need a >full HTML parse, you could always try to develop a special-purpose parser >for your task using SmaCC. I've tried it, it's fun to play with and the >results are very efficient parsers. I'm amazed with the amount of tools provided by the smalltalk community. A few months ago, after my first 'aha!' moment with smalltalk, I posted (http://urlmini.us/?i=517): "(...) PD If you guys don't have to wait for half an hour, while the whole project recompiles, every time you want to see the effect of some minor modification, what do you do with all this extra free time? };-)" Now I know what you've been doing. ;-) >> :-) > >Your halo comes and goes - not feeling angelic today? After a whole afternoon fighting against a horde of makefiles? Nah, not really. ;-) |
"Fernando" <[hidden email]> wrote in message
news:[hidden email]... Fernando > Thanks, it would be really nice if you could send me the HTMLParser. > O:-) The zip file with the two ports is on its way (assuming the e-mail address above is correct). It's easier to package them together, because they have prerequisites in common; I hope the .txt file in the zip makes it clear. Any comments or suggestions gratefully received. > "(...) PD If you guys don't have to wait for half an hour, while the > whole project recompiles, every time you want to see the effect of > some minor modification, what do you do with all this extra free time? > };-)" > > Now I know what you've been doing. ;-) > In my case I'm semi-retired, and I do a lot of this for fun. How the people with full-time day jobs do this as well I don't know. Peter |
In reply to this post by Schwab,Wilhelm K
"Bill Schwab" <[hidden email]> wrote in message
news:cpa65g$15o8$[hidden email]... > Peter, > > When you are ready, I will gladly provide web space for your port(s), > identified as your work, of course. Bill It's a vicious circle - I would like to get feedback to confirm that the package is OK before making it widely available, but without making it available I can't get feedback. The recent discussions between you and Chris Demers make me realise that it's OK to circulate experimental stuff (with appropriate warnings), so I'm sending the zip file direct to you, together with another bit of stuff that I like. Peter |
Peter,
> It's a vicious circle - I would like to get feedback to confirm that the > package is OK before making it widely available, but without making it > available I can't get feedback. The recent discussions between you and Chris > Demers make me realise that it's OK to circulate experimental stuff (with > appropriate warnings), Absolutely. > so I'm sending the zip file direct to you, together > with another bit of stuff that I like. I see you got past the zip file police - well done :) Have a look at the Smalltalk page. Thanks!!! Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
Free forum by Nabble | Edit this page |