ZnClient GET, but just the content of the <head> tag?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

ZnClient GET, but just the content of the <head> tag?

Paul DeBruicker
This is a micro optimization if there ever was one but I wondered if it was possible to stop downloading and get the entity once the </head> tag has been received.  

Right now I download the whole page, parse it with Soup, then extract the tags I want from the head.  Which works fine.  e.g.

head:=((Soup fromString: (ZnEasy get: 'http://pharo.org') entity)
                                findChildTag: 'html') findChildTag: 'head'.




Reply | Threaded
Open this post in threaded view
|

Re: ZnClient GET, but just the content of the <head> tag?

Sven Van Caekenberghe-2
Paul,

> On 26 Nov 2016, at 18:31, PAUL DEBRUICKER <[hidden email]> wrote:
>
> This is a micro optimization if there ever was one but I wondered if it was possible to stop downloading and get the entity once the </head> tag has been received.  
>
> Right now I download the whole page, parse it with Soup, then extract the tags I want from the head.  Which works fine.  e.g.
>
> head:=((Soup fromString: (ZnEasy get: 'http://pharo.org') entity)
> findChildTag: 'html') findChildTag: 'head'.

This would only be useful for large pages. Dealing with the content of resources (like parsing HTML) is outside the scope of Zinc. However, I can help you get started.

What you want to do is use streaming. That gives you access to the content of a resource using a direct stream, so you could decide to stop reading (but then you have to close the connection, else you need to read everything anyway).

Start by having a look at ZnClient>>#downloadTo: and ZnStreamingEntity. What you want to do is more or less the following.

ZnClient new
  url: 'http://pharo.org';
  streaming: true;
  get.

At this point, the request is done, the response is in, but the entity of the response is not yet read. When you ask for the entity, you get a ZnStreamingEntity which holds the stream that you then have to read from. You can check the response (and its header) for meta info.

Your next challenge then is to process this stream so that you can parse it in a real streaming fashion. I don't know if Soup can do this.

Sven



Reply | Threaded
Open this post in threaded view
|

Re: ZnClient GET, but just the content of the <head> tag?

Peter Kenny
Paul

Not sure if this is helpful - I have not tried it out, but it may give you a
pointer.

As Sven says, you need to parse a stream and be able to stop when you reach
the desired point. If instead of Soup you use XMLHTMLParser, this has
streaming siblings called SAXHTMLHandler and SAX2HTMLParser. I think it
should be possible to use one or the other to stop when you reach the
</head> tag.

Personally I find the output of XMLHTMLParser easier to follow than that of
Soup, but this may be a matter of taste.

Hope this helps

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of
Sven Van Caekenberghe
Sent: 26 November 2016 18:19
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] ZnClient GET, but just the content of the <head>
tag?

Paul,

> On 26 Nov 2016, at 18:31, PAUL DEBRUICKER <[hidden email]> wrote:
>
> This is a micro optimization if there ever was one but I wondered if it
was possible to stop downloading and get the entity once the </head> tag has
been received.  
>
> Right now I download the whole page, parse it with Soup, then extract the
tags I want from the head.  Which works fine.  e.g.
>
> head:=((Soup fromString: (ZnEasy get: 'http://pharo.org') entity)
> findChildTag: 'html') findChildTag: 'head'.

This would only be useful for large pages. Dealing with the content of
resources (like parsing HTML) is outside the scope of Zinc. However, I can
help you get started.

What you want to do is use streaming. That gives you access to the content
of a resource using a direct stream, so you could decide to stop reading
(but then you have to close the connection, else you need to read everything
anyway).

Start by having a look at ZnClient>>#downloadTo: and ZnStreamingEntity. What
you want to do is more or less the following.

ZnClient new
  url: 'http://pharo.org';
  streaming: true;
  get.

At this point, the request is done, the response is in, but the entity of
the response is not yet read. When you ask for the entity, you get a
ZnStreamingEntity which holds the stream that you then have to read from.
You can check the response (and its header) for meta info.

Your next challenge then is to process this stream so that you can parse it
in a real streaming fashion. I don't know if Soup can do this.

Sven




Reply | Threaded
Open this post in threaded view
|

Re: ZnClient GET, but just the content of the <head> tag?

Jan Kurš
Hi,

PetitParser2 [1] supports parsing of streams. I have been experimenting with ZnClient and come up with the following solution:

1) Create a PP2 stream from ZnClient stream:
byteStream := ZnClient new
  url: 'http://pharo.org';
  streaming: true;
  get.
stream := PP2CharacterStream on: byteStream encoder: ZnUTF8Encoder new.

2) Create a parser for header:
head := '<head>' asPParser, #any asPParser starLazy, '</head>' asPParser.

3) Create a parser that reads everything up till header or body (in case header is not present) and parse the header:
headStart := '<head' asPParser.
bodyStart := '<body' asPParser.
parser := (#any asPParser starLazy: (headStart / bodyStart)), head ==> #second.

result := parser optimize parse: stream.

4) Finally, the contents of header is a collection of characters, I don't know what is the best way to convert it into a string, perhaps this:
text := (result second inject: (WriteStream on: '') into: [ :stream :char | stream nextPut: char. stream ]) contents

Cheers,
Jan


On Sun, Nov 27, 2016 at 1:38 PM PBKResearch <[hidden email]> wrote:
Paul

Not sure if this is helpful - I have not tried it out, but it may give you a
pointer.

As Sven says, you need to parse a stream and be able to stop when you reach
the desired point. If instead of Soup you use XMLHTMLParser, this has
streaming siblings called SAXHTMLHandler and SAX2HTMLParser. I think it
should be possible to use one or the other to stop when you reach the
</head> tag.

Personally I find the output of XMLHTMLParser easier to follow than that of
Soup, but this may be a matter of taste.

Hope this helps

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of
Sven Van Caekenberghe
Sent: 26 November 2016 18:19
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] ZnClient GET, but just the content of the <head>
tag?

Paul,

> On 26 Nov 2016, at 18:31, PAUL DEBRUICKER <[hidden email]> wrote:
>
> This is a micro optimization if there ever was one but I wondered if it
was possible to stop downloading and get the entity once the </head> tag has
been received.
>
> Right now I download the whole page, parse it with Soup, then extract the
tags I want from the head.  Which works fine.  e.g.
>
> head:=((Soup fromString: (ZnEasy get: 'http://pharo.org') entity)
>                               findChildTag: 'html') findChildTag: 'head'.

This would only be useful for large pages. Dealing with the content of
resources (like parsing HTML) is outside the scope of Zinc. However, I can
help you get started.

What you want to do is use streaming. That gives you access to the content
of a resource using a direct stream, so you could decide to stop reading
(but then you have to close the connection, else you need to read everything
anyway).

Start by having a look at ZnClient>>#downloadTo: and ZnStreamingEntity. What
you want to do is more or less the following.

ZnClient new
  url: 'http://pharo.org';
  streaming: true;
  get.

At this point, the request is done, the response is in, but the entity of
the response is not yet read. When you ask for the entity, you get a
ZnStreamingEntity which holds the stream that you then have to read from.
You can check the response (and its header) for meta info.

Your next challenge then is to process this stream so that you can parse it
in a real streaming fashion. I don't know if Soup can do this.

Sven




Reply | Threaded
Open this post in threaded view
|

Re: ZnClient GET, but just the content of the <head> tag?

stepharo
nice :)

On Sun, 27 Nov 2016 19:46:41 +0100, Jan Kurš <[hidden email]> wrote:

Hi,

PetitParser2 [1] supports parsing of streams. I have been experimenting with ZnClient and come up with the following solution:

1) Create a PP2 stream from ZnClient stream:
byteStream := ZnClient new
  url: 'http://pharo.org';
  streaming: true;
  get.
stream := PP2CharacterStream on: byteStream encoder: ZnUTF8Encoder new.

2) Create a parser for header:
head := '<head>' asPParser, #any asPParser starLazy, '</head>' asPParser.

3) Create a parser that reads everything up till header or body (in case header is not present) and parse the header:
headStart := '<head' asPParser.
bodyStart := '<body' asPParser.
parser := (#any asPParser starLazy: (headStart / bodyStart)), head ==> #second.

result := parser optimize parse: stream.

4) Finally, the contents of header is a collection of characters, I don't know what is the best way to convert it into a string, perhaps this:
text := (result second inject: (WriteStream on: '') into: [ :stream :char | stream nextPut: char. stream ]) contents

Cheers,
Jan


On Sun, Nov 27, 2016 at 1:38 PM PBKResearch <[hidden email]> wrote:
Paul

Not sure if this is helpful - I have not tried it out, but it may give you a
pointer.

As Sven says, you need to parse a stream and be able to stop when you reach
the desired point. If instead of Soup you use XMLHTMLParser, this has
streaming siblings called SAXHTMLHandler and SAX2HTMLParser. I think it
should be possible to use one or the other to stop when you reach the
</head> tag.

Personally I find the output of XMLHTMLParser easier to follow than that of
Soup, but this may be a matter of taste.

Hope this helps

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of
Sven Van Caekenberghe
Sent: 26 November 2016 18:19
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] ZnClient GET, but just the content of the <head>
tag?

Paul,

> On 26 Nov 2016, at 18:31, PAUL DEBRUICKER <[hidden email]> wrote:
>
> This is a micro optimization if there ever was one but I wondered if it
was possible to stop downloading and get the entity once the </head> tag has
been received.
>
> Right now I download the whole page, parse it with Soup, then extract the
tags I want from the head.  Which works fine.  e.g.
>
> head:=((Soup fromString: (ZnEasy get: 'http://pharo.org') entity)
>                               findChildTag: 'html') findChildTag: 'head'.

This would only be useful for large pages. Dealing with the content of
resources (like parsing HTML) is outside the scope of Zinc. However, I can
help you get started.

What you want to do is use streaming. That gives you access to the content
of a resource using a direct stream, so you could decide to stop reading
(but then you have to close the connection, else you need to read everything
anyway).

Start by having a look at ZnClient>>#downloadTo: and ZnStreamingEntity. What
you want to do is more or less the following.

ZnClient new
  url: 'http://pharo.org';
  streaming: true;
  get.

At this point, the request is done, the response is in, but the entity of
the response is not yet read. When you ask for the entity, you get a
ZnStreamingEntity which holds the stream that you then have to read from.
You can check the response (and its header) for meta info.

Your next challenge then is to process this stream so that you can parse it
in a real streaming fashion. I don't know if Soup can do this.

Sven







--
Using Opera's mail client: http://www.opera.com/mail/
Reply | Threaded
Open this post in threaded view
|

Re: ZnClient GET, but just the content of the <head> tag?

jtuchel
In reply to this post by Peter Kenny
I wonder if it is a hard requirement to transport your info as part of
the html page.
A possible alternative might be to just use a HEAD request instead of
GET and transport your info "on the bare HTTP layer"?

Joachim


Am 27.11.16 um 13:37 schrieb PBKResearch:

> Paul
>
> Not sure if this is helpful - I have not tried it out, but it may give you a
> pointer.
>
> As Sven says, you need to parse a stream and be able to stop when you reach
> the desired point. If instead of Soup you use XMLHTMLParser, this has
> streaming siblings called SAXHTMLHandler and SAX2HTMLParser. I think it
> should be possible to use one or the other to stop when you reach the
> </head> tag.
>
> Personally I find the output of XMLHTMLParser easier to follow than that of
> Soup, but this may be a matter of taste.
>
> Hope this helps
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users [mailto:[hidden email]] On Behalf Of
> Sven Van Caekenberghe
> Sent: 26 November 2016 18:19
> To: Any question about pharo is welcome <[hidden email]>
> Subject: Re: [Pharo-users] ZnClient GET, but just the content of the <head>
> tag?
>
> Paul,
>
>> On 26 Nov 2016, at 18:31, PAUL DEBRUICKER <[hidden email]> wrote:
>>
>> This is a micro optimization if there ever was one but I wondered if it
> was possible to stop downloading and get the entity once the </head> tag has
> been received.
>> Right now I download the whole page, parse it with Soup, then extract the
> tags I want from the head.  Which works fine.  e.g.
>> head:=((Soup fromString: (ZnEasy get: 'http://pharo.org') entity)
>> findChildTag: 'html') findChildTag: 'head'.
> This would only be useful for large pages. Dealing with the content of
> resources (like parsing HTML) is outside the scope of Zinc. However, I can
> help you get started.
>
> What you want to do is use streaming. That gives you access to the content
> of a resource using a direct stream, so you could decide to stop reading
> (but then you have to close the connection, else you need to read everything
> anyway).
>
> Start by having a look at ZnClient>>#downloadTo: and ZnStreamingEntity. What
> you want to do is more or less the following.
>
> ZnClient new
>    url: 'http://pharo.org';
>    streaming: true;
>    get.
>
> At this point, the request is done, the response is in, but the entity of
> the response is not yet read. When you ask for the entity, you get a
> ZnStreamingEntity which holds the stream that you then have to read from.
> You can check the response (and its header) for meta info.
>
> Your next challenge then is to process this stream so that you can parse it
> in a real streaming fashion. I don't know if Soup can do this.
>
> Sven
>
>
>
>
>


--
-----------------------------------------------------------------------
Objektfabrik Joachim Tuchel          mailto:[hidden email]
Fliederweg 1                         http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1


Reply | Threaded
Open this post in threaded view
|

Re: ZnClient GET, but just the content of the <head> tag?

Peter Kenny
In reply to this post by Jan Kurš

Paul

 

Further to my previous post, I have found a way to use XMLHTMLParser to give what you want (more or less). The method is pretty hackish, and I am sure it could be tidied up, but here it is:

 

par := XMLHTMLParser onURL: 'http://pharo.org'.

 

par parseDocumentUntil: [|top| (top := par topNode) notNil and:

               [ top isDocument not and:

                              [ top isNamed: 'body']]].

 

(par parsingResult descendantElementsNamed: 'head') first inspect.

 

You will see that I cheat by scanning until the opening tag of the body is found, so the resulting parse contains all the head plus an empty body.

There may be a way to stop at the </head> tag, but I haven't found it. As written, it depends on the input having both a head and a body.

 

I have tried this in a playground, and it works as expected. The parse stops after the end of the head. I don't know how to check whether the reading stops at the same point; I suppose that depends on the method 'parseDocumentUntil' behaving sensibly.

 

Hope this helps

 

Peter Kenny

 

From: Pharo-users [mailto:[hidden email]] On Behalf Of Jan Kurš
Sent: 27 November 2016 18:47
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] ZnClient GET, but just the content of the <head> tag?

 

Hi,

 

PetitParser2 [1] supports parsing of streams. I have been experimenting with ZnClient and come up with the following solution:

 

1) Create a PP2 stream from ZnClient stream:

byteStream := ZnClient new

  url: 'http://pharo.org';

  streaming: true;

  get.

stream := PP2CharacterStream on: byteStream encoder: ZnUTF8Encoder new.

 

2) Create a parser for header:

head := '<head>' asPParser, #any asPParser starLazy, '</head>' asPParser.

 

3) Create a parser that reads everything up till header or body (in case header is not present) and parse the header:

headStart := '<head' asPParser.

bodyStart := '<body' asPParser.

parser := (#any asPParser starLazy: (headStart / bodyStart)), head ==> #second.

 

result := parser optimize parse: stream.

 

4) Finally, the contents of header is a collection of characters, I don't know what is the best way to convert it into a string, perhaps this:

text := (result second inject: (WriteStream on: '') into: [ :stream :char | stream nextPut: char. stream ]) contents

 

Cheers,

Jan

 

 

On Sun, Nov 27, 2016 at 1:38 PM PBKResearch <[hidden email]> wrote:

Paul

Not sure if this is helpful - I have not tried it out, but it may give you a
pointer.

As Sven says, you need to parse a stream and be able to stop when you reach
the desired point. If instead of Soup you use XMLHTMLParser, this has
streaming siblings called SAXHTMLHandler and SAX2HTMLParser. I think it
should be possible to use one or the other to stop when you reach the
</head> tag.

Personally I find the output of XMLHTMLParser easier to follow than that of
Soup, but this may be a matter of taste.

Hope this helps

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of
Sven Van Caekenberghe
Sent: 26 November 2016 18:19
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] ZnClient GET, but just the content of the <head>
tag?

Paul,

> On 26 Nov 2016, at 18:31, PAUL DEBRUICKER <[hidden email]> wrote:
>
> This is a micro optimization if there ever was one but I wondered if it
was possible to stop downloading and get the entity once the </head> tag has
been received.
>
> Right now I download the whole page, parse it with Soup, then extract the
tags I want from the head.  Which works fine.  e.g.
>
> head:=((Soup fromString: (ZnEasy get: 'http://pharo.org') entity)
>                               findChildTag: 'html') findChildTag: 'head'.

This would only be useful for large pages. Dealing with the content of
resources (like parsing HTML) is outside the scope of Zinc. However, I can
help you get started.

What you want to do is use streaming. That gives you access to the content
of a resource using a direct stream, so you could decide to stop reading
(but then you have to close the connection, else you need to read everything
anyway).

Start by having a look at ZnClient>>#downloadTo: and ZnStreamingEntity. What
you want to do is more or less the following.

ZnClient new
  url: 'http://pharo.org';
  streaming: true;
  get.

At this point, the request is done, the response is in, but the entity of
the response is not yet read. When you ask for the entity, you get a
ZnStreamingEntity which holds the stream that you then have to read from.
You can check the response (and its header) for meta info.

Your next challenge then is to process this stream so that you can parse it
in a real streaming fashion. I don't know if Soup can do this.

Sven