Retrieving web pages

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Retrieving web pages

Fernando Rodríguez
Hi,

How can I retrieve a webpage with Dolphi? O:-)  I tried looking for
classes with htp in their names but couldn't find any.

thanks


Reply | Threaded
Open this post in threaded view
|

Re: Retrieving web pages

Ian Bartholomew-19
Fernando,

> How can I retrieve a webpage with Dolphi? O:-)  I tried looking for
> classes with htp in their names but couldn't find any.

There are a number of way.  The simplest is to use URLMonLibrary - as in

URLMonLibrary default
    urlDownload: 'http://www.whatever.com/'
    toFile: 'c:\whatever.html'

If you want to see the contents, rather than just downlod to a file, try
WebBrowserShell show openUrl: 'http://www.whatever.com/'

You can also use some off the COM classes (IStream I think) to achieve more
complex behaviour.

It all depends on what you want to do.....

--
Ian

Use the Reply-To address to contact me.
Mail sent to the From address is ignored.


Reply | Threaded
Open this post in threaded view
|

Re: Retrieving web pages

Fernando Rodríguez
On Wed, 8 Dec 2004 19:09:07 -0000, "Ian Bartholomew"
<[hidden email]> wrote:

>Fernando,
>
>> How can I retrieve a webpage with Dolphi? O:-)  I tried looking for
>> classes with htp in their names but couldn't find any.
>
>There are a number of way.  The simplest is to use URLMonLibrary - as in
>
>URLMonLibrary default
>    urlDownload: 'http://www.whatever.com/'
>    toFile: 'c:\whatever.html'
>
>If you want to see the contents, rather than just downlod to a file, try
>WebBrowserShell show openUrl: 'http://www.whatever.com/'

Actually, I want to read the url contents into a string, parse it and
save some stuff into a database (I'm considering Omnibase). If at all
possible, I'd rather have several connections in different threads.


Reply | Threaded
Open this post in threaded view
|

Re: Retrieving web pages

Schwab,Wilhelm K
In reply to this post by Fernando Rodríguez
Fernando,

> How can I retrieve a webpage with Dolphi? O:-)  I tried looking for
> classes with htp in their names but couldn't find any.

Look at LiveUpdate for the Microsoft way of doing it.  Search the
archives though, because it, being based on much of the same code as IE,
will quit with the "work offline" setting, etc.  It also can get tricked
by caching, etc.  Note that if you are going to untrusted sites, you are
eventually going to get bitten; the code is too full of defects and too
widely attacked to avoid it.

However, code such as

| url file in out |
url := 'http://www.somewhere.com/something.pdf'.
file := 'c:\docs\something.pdf'.
in := ( IStream onURL:url ) contents readStream.
out := FileStream write:file text:false.
[
        out nextPutAll:in upToEnd.

] ensure:[ out close ].

can give you a quick start.

IMHO, you would be better off with one of the HTTP clients written in
Smalltalk.  You can fork work to background threads, constuct your own
timeouts, etc., not to mention chuckle the next time a critical update
appears.

Have a good one,

Bill


--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Retrieving web pages

Fernando Rodríguez
On Wed, 08 Dec 2004 14:16:35 -0500, Bill Schwab
<[hidden email]> wrote:


>IMHO, you would be better off with one of the HTTP clients written in
>Smalltalk.  You can fork work to background threads, constuct your own
>timeouts, etc., not to mention chuckle the next time a critical update
>appears.

I had no idea this was available. Could you give some pointers? O:-)

BTW, what do you mean with 'http clients', full browsers implemented
in Smalltalk? O:-)

Thanks


Reply | Threaded
Open this post in threaded view
|

Re: Retrieving web pages

Schwab,Wilhelm K
Fernando,

>>IMHO, you would be better off with one of the HTTP clients written in
>>Smalltalk.  You can fork work to background threads, constuct your own
>>timeouts, etc., not to mention chuckle the next time a critical update
>>appears.
>
> I had no idea this was available. Could you give some pointers?

Try

   http://www.dolphinharbor.org/dh/projects/httpclient/download.html

I also have a simple http client, but it is not as complete as the one
(hopefully) linked above, and is currently not in a form that I can
redistribute.  If Steve's client goes poof, I would be willing to
separate mine from the stuff I can't share and make it available.


> BTW, what do you mean with 'http clients', full browsers implemented
> in Smalltalk?

No, they are essentially smart sockets that send requests and process
replies, handle authentication, and return the data to the caller.


 > Actually, I want to read the url contents into a string, parse it and
 > save some stuff into a database (I'm considering Omnibase). If at all
 > possible, I'd rather have several connections in different threads.

Parsing is likely to be the hard part.  Squeak has one that has worked
for me, but IIRC, others have reported throwing more at it and gotten
poor results.  "Any" Smalltalk client should do fine with reads forked
to background Process-es.  Of course, you will need to properly
synchronize the returning data, but Mutex, SharedQueue, etc. are available.

Again, think twice before putting untrusted html through a Microsoft parser.

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Retrieving web pages

Blair McGlashan-3
Bill

You wrote in message news:cp812e$1104$[hidden email]...
> ...
> Again, think twice before putting untrusted html through a Microsoft
> parser.

This might be a legitimate opinion, but based on browser usage statistics
more than 80% of the people on the internet are doing this every day.

My point is that one would be no worse off driving Microsoft's HTML parser
from Dolphin than one would be using Internet Explorer. Most businesses do
indeed use Internet Explorer, and provided it is fully patched the risk is
clearly considered acceptable. Actually you'll probably be a lot better off
just driving the parser, since many of the security holes are in the way IE
executes/displays the content itself.

There are going to be security holes in all fully functional HTML parsers.
And frankly if one wants a fully functional one that handles all the
vagaries of the HTML one will encounter, one can forget the simple parsers
implemented in Smalltalk since they will reject a lot of it as invalid
(which it is, but that's beside the point).

I would encourage everyone to make their own judgements about what they
consider an acceptable level of risk. If you are prepared to use IE, or any
of the myriad other Microsoft products that embeds the IE engine, then you
should have no qualms about using the MS HTML parser in Dolphin.

Regards

Blair


Reply | Threaded
Open this post in threaded view
|

Re: Retrieving web pages

Chris Uppal-3
Blair,

> > Again, think twice before putting untrusted html through a Microsoft
> > parser.
>
> This might be a legitimate opinion, but based on browser usage statistics
> more than 80% of the people on the internet are doing this every day.

But 80% of people on the Net know bugger-all about computer security, or the
risks they are taking.  I don't think this is a very sensible argument.


> My point is that one would be no worse off driving Microsoft's HTML parser
> from Dolphin than one would be using Internet Explorer.

Is that actually true ?  (That's a real question, not rhetorical. I don't claim
to know the answer).  The HTML component has quite a lot of stuff for
controlling security zones and so on; if you "just use it" from Dolphin (or any
other programming language) does it run in its highest possible security mode
by default ?  If not then I don't think that using the HTML component is even
as safe as you describe.

Of course its not an issue at all unless you are pushing untrusted HTML through
it.


> I would encourage everyone to make their own judgements about what they
> consider an acceptable level of risk.

Agreed.  But it's important to remind people that there /is/ a judgement to be
made.  It not an issue that you should ever just ignore, whichever way you
decide to go after due consideration.


> If you are prepared to use IE, or
> any of the myriad other Microsoft products that embeds the IE engine,
> then you should have no qualms about using the MS HTML parser in Dolphin.

I don't think that follows at all.  E.g. the app you write in Dolphin (or
whatever) may be running on a very different machine (with sensitive data or
whatever) from the desktop machine you use for browsing (which -- if you use IE
for browsing -- had damn well better be either very well protected or
expendable).

    -- chris


Reply | Threaded
Open this post in threaded view
|

Re: Retrieving web pages

Fernando Rodríguez
In reply to this post by Schwab,Wilhelm K
On Wed, 08 Dec 2004 18:11:33 -0500, Bill Schwab
<[hidden email]> wrote:


> > Actually, I want to read the url contents into a string, parse it and
> > save some stuff into a database (I'm considering Omnibase). If at all
> > possible, I'd rather have several connections in different threads.
>
>Parsing is likely to be the hard part.  Squeak has one that has worked
>for me, but IIRC, others have reported throwing more at it and gotten
>poor results.  "Any" Smalltalk client should do fine with reads forked
>to background Process-es.  Of course, you will need to properly
>synchronize the returning data, but Mutex, SharedQueue, etc. are available.

Maybe I shouldn't have said 'parse'. Actually I just want to extract a
few tokens so I don't think I'll need a full html parser. Some regexes
should do it.

Thanks for the tip anyway, I'll take a look at the parser in Squeak.
:-)


Reply | Threaded
Open this post in threaded view
|

Re: Retrieving web pages

Peter Kenny-2
"Fernando" <[hidden email]> wrote in message
news:[hidden email]...
>
> Thanks for the tip anyway, I'll take a look at the parser in Squeak.

Fernando

You should know that there are two HTML parsers available from Squeak. The
one in the standard Squeak distribution (called HtmlParser) has a lot of
weaknesses (i.e. it quite often gives parses which are obviously not what
the page author intended). Avi Bryant recommended the parser available from
SqueakMap (called HTMLParser - note the upper case). I have tried this, and
found that it also gets many things wrong, but I have made some
modifications which enable it to do quite well in my application
(deconstructing web newspaper stories to extract the text content). I have
ported both to Dolphin, though the port of HtmlParser is work in progress,
rather than a complete job, because I switched to HTMLParser part way
through. If you are interested, I could e-mail you the packages and you
could play with them.

As a warning, any programmatic analysis of web pages depends on the page
structure remaining stable. In the web newspaper field, there were regular
redesigns (I learned to dread the words 'New Improved Page Design!') which
meant re-programming. Unless the data you are looking for have a very clear
and stable structure, this can be a very frustrating field.

Of course, if the analysis you want to do is not complex enough to need a
full HTML parse, you could always try to develop a special-purpose parser
for your task using SmaCC. I've tried it, it's fun to play with and the
results are very efficient parsers.

> :-)

Your halo comes and goes - not feeling angelic today?

Best wishes

Peter Kenny


Reply | Threaded
Open this post in threaded view
|

Re: Retrieving web pages

Schwab,Wilhelm K
In reply to this post by Blair McGlashan-3
Blair,

> I would encourage everyone to make their own judgements about what they
> consider an acceptable level of risk. If you are prepared to use IE, or any
> of the myriad other Microsoft products that embeds the IE engine, then you
> should have no qualms about using the MS HTML parser in Dolphin.

I do not agree, because the decision must depend on the intended usage.
  It sounds to me as though Fernando is working on a crawler of sorts,
and using a Microsoft parser for that _will_ get him clobbered
eventually; a computer can do a lot more browsing than you or I ever
would, and will not be discriminating about the sites it hits.

Have a good one,

Bill


--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Retrieving web pages

Schwab,Wilhelm K
In reply to this post by Peter Kenny-2
Peter,

When you are ready, I will gladly provide web space for your port(s),
identified as your work, of course.

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Retrieving web pages

Fernando Rodríguez
In reply to this post by Peter Kenny-2
On Thu, 9 Dec 2004 16:15:22 -0000, "Peter Kenny"
<[hidden email]> wrote:


>(deconstructing web newspaper stories to extract the text content). I have
>ported both to Dolphin, though the port of HtmlParser is work in progress,
>rather than a complete job, because I switched to HTMLParser part way
>through. If you are interested, I could e-mail you the packages and you
>could play with them.

Thanks, it would be really nic eif you could send me the HTMLParser.
O:-)

>As a warning, any programmatic analysis of web pages depends on the page
>structure remaining stable. In the web newspaper field, there were regular
>redesigns (I learned to dread the words 'New Improved Page Design!') which
>meant re-programming. Unless the data you are looking for have a very clear
>and stable structure, this can be a very frustrating field.

I think (hope) it will.

>Of course, if the analysis you want to do is not complex enough to need a
>full HTML parse, you could always try to develop a special-purpose parser
>for your task using SmaCC. I've tried it, it's fun to play with and the
>results are very efficient parsers.

I'm amazed with the amount of tools provided by the smalltalk
community.  A few months ago, after my first 'aha!' moment with
smalltalk, I posted (http://urlmini.us/?i=517):

"(...) PD If you guys don't have to wait for half an hour, while the
whole project recompiles, every time you want to see the effect of
some minor modification, what do you do with all this extra free time?
};-)"

Now I know what you've been doing. ;-)

>> :-)
>
>Your halo comes and goes - not feeling angelic today?

After a whole afternoon fighting against a horde of makefiles? Nah,
not really. ;-)


Reply | Threaded
Open this post in threaded view
|

Re: Retrieving web pages

Peter Kenny-2
"Fernando" <[hidden email]> wrote in message
news:[hidden email]...

Fernando

> Thanks, it would be really nice if you could send me the HTMLParser.
> O:-)

The zip file with the two ports is on its way (assuming the e-mail address
above is correct). It's easier to package them together, because they have
prerequisites in common; I hope the .txt file in the zip makes it clear. Any
comments or suggestions gratefully received.

> "(...) PD If you guys don't have to wait for half an hour, while the
> whole project recompiles, every time you want to see the effect of
> some minor modification, what do you do with all this extra free time?
> };-)"
>
> Now I know what you've been doing. ;-)
>
In my case I'm semi-retired, and I do a lot of this for fun. How the people
with full-time day jobs do this as well I don't know.

Peter


Reply | Threaded
Open this post in threaded view
|

Re: Retrieving web pages

Peter Kenny-2
In reply to this post by Schwab,Wilhelm K
"Bill Schwab" <[hidden email]> wrote in message
news:cpa65g$15o8$[hidden email]...
> Peter,
>
> When you are ready, I will gladly provide web space for your port(s),
> identified as your work, of course.

Bill

It's a vicious circle - I would like to get feedback to confirm that the
package is OK before making it widely available, but without making it
available I can't get feedback. The recent discussions between you and Chris
Demers make me realise that it's OK to circulate experimental stuff (with
appropriate warnings), so I'm sending the zip file direct to you, together
with another bit of stuff that I like.

Peter


Reply | Threaded
Open this post in threaded view
|

Re: Retrieving web pages

Schwab,Wilhelm K
Peter,

> It's a vicious circle - I would like to get feedback to confirm that the
> package is OK before making it widely available, but without making it
> available I can't get feedback. The recent discussions between you and Chris
> Demers make me realise that it's OK to circulate experimental stuff (with
> appropriate warnings),

Absolutely.


 > so I'm sending the zip file direct to you, together
> with another bit of stuff that I like.

I see you got past the zip file police - well done :)  Have a look at
the Smalltalk page.

Thanks!!!

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]