Soup bug(fix)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Soup bug(fix)

Siemen Baader
Hi all,

who maintains Soup, the HTML parser? Stef?

It seems to auto-close <button> (and <a>) tags when nested inside another element. I wrote this test that fails:

testNestedButton
    "this works with nested <div> tags instead of <button> and when there is no enclosing <div> at all. but here <button> is auto-closed."

    "a does not work either"

    | soup |
    soup := Soup
        fromString:
            '<div><button>
        <span>text</span>
   </button>
</div>'.
    self assert: soup div button span string equals: 'text'

----


Where should I look to prevent Soup from auto-closing the tag, and where & how should I submit my fix?

cheers,
Siemen
Reply | Threaded
Open this post in threaded view
|

Re: Soup bug(fix)

Stephane Ducasse-3
Hi Siemen

let me know your loging and I can add you to commit. Paul is also
taking care of Soup.
Now I like XPath for scraping. Did you see the tutorial I wrote with Peter.


STef

On Wed, Nov 8, 2017 at 2:17 PM, Siemen Baader <[hidden email]> wrote:

> Hi all,
>
> who maintains Soup, the HTML parser? Stef?
>
> It seems to auto-close <button> (and <a>) tags when nested inside another
> element. I wrote this test that fails:
>
> testNestedButton
>     "this works with nested <div> tags instead of <button> and when there is
> no enclosing <div> at all. but here <button> is auto-closed."
>
>     "a does not work either"
>
>     | soup |
>     soup := Soup
>         fromString:
>             '<div><button>
>         <span>text</span>
>    </button>
> </div>'.
>     self assert: soup div button span string equals: 'text'
>
> ----
>
>
> Where should I look to prevent Soup from auto-closing the tag, and where &
> how should I submit my fix?
>
> cheers,
> Siemen

Reply | Threaded
Open this post in threaded view
|

Re: Soup bug(fix)

Peter Kenny
Siemen

Stef should have added that XPath depends on using Monty's XMLParser suite. I tried your snippet on XMLDOMParser, and it parses correctly. I always use XMLHTMLParser for parsing HTML, because I can always see the exact relationship between the parsed structure and the original HTML. With Soup I often found the match difficult or even impossible.

HTH

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of Stephane Ducasse
Sent: 08 November 2017 21:19
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] Soup bug(fix)

Hi Siemen

let me know your loging and I can add you to commit. Paul is also taking care of Soup.
Now I like XPath for scraping. Did you see the tutorial I wrote with Peter.


STef

On Wed, Nov 8, 2017 at 2:17 PM, Siemen Baader <[hidden email]> wrote:

> Hi all,
>
> who maintains Soup, the HTML parser? Stef?
>
> It seems to auto-close <button> (and <a>) tags when nested inside
> another element. I wrote this test that fails:
>
> testNestedButton
>     "this works with nested <div> tags instead of <button> and when
> there is no enclosing <div> at all. but here <button> is auto-closed."
>
>     "a does not work either"
>
>     | soup |
>     soup := Soup
>         fromString:
>             '<div><button>
>         <span>text</span>
>    </button>
> </div>'.
>     self assert: soup div button span string equals: 'text'
>
> ----
>
>
> Where should I look to prevent Soup from auto-closing the tag, and
> where & how should I submit my fix?
>
> cheers,
> Siemen


Reply | Threaded
Open this post in threaded view
|

Re: Soup bug(fix)

Kjell Godo
i like to collect some newspaper comics from an online newspaper
     but it takes really long to do it by hand by hand
i tried Soup but i didn’t get anywhere
     the pictures were hidden behind a script or something
is there anything to do about that?         i don’t want to collect them all
i have the XPath .pdf but i haven’t read it yet

these browsers seem to gobble up memory
     and while open they just keep getting bigger till the OS session crash
     might there be a browser that is more minimal?

Vivaldi seems better at not bloating up RAM
Reply | Threaded
Open this post in threaded view
|

Re: Soup bug(fix)

Peter Kenny

Kjell

 

Almost certainly the HTML files will not contain the code for the actual pictures; they will just contain an ‘href’ node with the address to load the picture file from. If the web pages are built to a regular pattern, you should be able to parse them and locate the href nodes you want.

 

I haven’t found any problem with the parse from XMLHTMLParser taking up too much memory. My machine has 4GB ram; if you have much less than that, you might have trouble. If you have found a systematic way to locate the picture file, you could minimise the size of the DOM the parser creates, by using a streaming parser. The streaming version of Monty’s parser is called StAXHTMLParser.

 

I have a bit of experience playing with these parsers. If you get stuck, ask again here with more details; I may be able to help.

 

Peter Kenny

 

From: Pharo-users [mailto:[hidden email]] On Behalf Of Kjell Godo
Sent: 08 November 2017 23:00
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] Soup bug(fix)

 

i like to collect some newspaper comics from an online newspaper

     but it takes really long to do it by hand by hand

i tried Soup but i didn’t get anywhere

     the pictures were hidden behind a script or something

is there anything to do about that?         i don’t want to collect them all

i have the XPath .pdf but i haven’t read it yet

 

these browsers seem to gobble up memory

     and while open they just keep getting bigger till the OS session crash

     might there be a browser that is more minimal?

 

Vivaldi seems better at not bloating up RAM

Reply | Threaded
Open this post in threaded view
|

Re: Soup bug(fix)

Siemen Baader
In reply to this post by Peter Kenny


On Wed, Nov 8, 2017 at 10:45 PM, PBKResearch <[hidden email]> wrote:
Siemen

Stef should have added that XPath depends on using Monty's XMLParser suite. I tried your snippet on XMLDOMParser, and it parses correctly. I always use XMLHTMLParser for parsing HTML, because I can always see the exact relationship between the parsed structure and the original HTML. With Soup I often found the match difficult or even impossible.

Thanks Stef & Peter. I'm going with XMLHTMLParser, it is indeed nicer to work with in the inspector. I'm scraping my own html files (to create a mock DOM object of my HTML to work with in Pharo) so I think I will switch to xhtml too to reduce the complexity.

Nice chapter, I only saw it briefly before and didn't realize that XMLHTMLParser is a newer replacement for Soup.

-- Siemen
Reply | Threaded
Open this post in threaded view
|

Re: Soup bug(fix)

Stephane Ducasse-3
In reply to this post by Kjell Godo
You should try the Xpath tutorial because the code of the magic the
gathering was quite ugly generated html and I could find my way.

Stef

On Thu, Nov 9, 2017 at 12:00 AM, Kjell Godo <[hidden email]> wrote:

> i like to collect some newspaper comics from an online newspaper
>      but it takes really long to do it by hand by hand
> i tried Soup but i didn’t get anywhere
>      the pictures were hidden behind a script or something
> is there anything to do about that?         i don’t want to collect them all
> i have the XPath .pdf but i haven’t read it yet
>
> these browsers seem to gobble up memory
>      and while open they just keep getting bigger till the OS session crash
>      might there be a browser that is more minimal?
>
> Vivaldi seems better at not bloating up RAM

Reply | Threaded
Open this post in threaded view
|

Re: Soup bug(fix)

Stephane Ducasse-3
In reply to this post by Siemen Baader
Indeed the interaction with the inspector is great.
Now could you still improve soup?
What is your smalltalkhub account?
Stef

On Thu, Nov 9, 2017 at 11:12 AM, Siemen Baader <[hidden email]> wrote:

>
>
> On Wed, Nov 8, 2017 at 10:45 PM, PBKResearch <[hidden email]>
> wrote:
>>
>> Siemen
>>
>> Stef should have added that XPath depends on using Monty's XMLParser
>> suite. I tried your snippet on XMLDOMParser, and it parses correctly. I
>> always use XMLHTMLParser for parsing HTML, because I can always see the
>> exact relationship between the parsed structure and the original HTML. With
>> Soup I often found the match difficult or even impossible.
>
>
> Thanks Stef & Peter. I'm going with XMLHTMLParser, it is indeed nicer to
> work with in the inspector. I'm scraping my own html files (to create a mock
> DOM object of my HTML to work with in Pharo) so I think I will switch to
> xhtml too to reduce the complexity.
>
> Nice chapter, I only saw it briefly before and didn't realize that
> XMLHTMLParser is a newer replacement for Soup.
>
> -- Siemen

Reply | Threaded
Open this post in threaded view
|

Re: Soup bug(fix)

alistairgrant
In reply to this post by Kjell Godo
On 9 November 2017 at 00:00, Kjell Godo <[hidden email]> wrote:
> i like to collect some newspaper comics from an online newspaper
>      but it takes really long to do it by hand by hand
> i tried Soup but i didn’t get anywhere
>      the pictures were hidden behind a script or something
> is there anything to do about that?

Most of the web pages I want to scrape use javascript to construct the
DOM, which makes Soup. XMLHTMLParser, etc. useless.

I've extended Torsten's Pharo-Chrome library and use that to navigate
the DOM in a way similar to Soup:

https://github.com/akgrant43/Pharo-Chrome

This gets around the issue with javascript since it waits for the
browser to load the page, run the javascript and construct the DOM.

HTH,
Alistair



>         i don’t want to collect them all
> i have the XPath .pdf but i haven’t read it yet
>
> these browsers seem to gobble up memory
>      and while open they just keep getting bigger till the OS session crash
>      might there be a browser that is more minimal?
>
> Vivaldi seems better at not bloating up RAM

Reply | Threaded
Open this post in threaded view
|

Re: Soup bug(fix)

Stephane Ducasse-3
Hi alistair

this is cool.
Do you have one little example so that we can see how we can use it?

Stef


On Sat, Nov 11, 2017 at 4:38 PM, Alistair Grant <[hidden email]> wrote:

> On 9 November 2017 at 00:00, Kjell Godo <[hidden email]> wrote:
>> i like to collect some newspaper comics from an online newspaper
>>      but it takes really long to do it by hand by hand
>> i tried Soup but i didn’t get anywhere
>>      the pictures were hidden behind a script or something
>> is there anything to do about that?
>
> Most of the web pages I want to scrape use javascript to construct the
> DOM, which makes Soup. XMLHTMLParser, etc. useless.
>
> I've extended Torsten's Pharo-Chrome library and use that to navigate
> the DOM in a way similar to Soup:
>
> https://github.com/akgrant43/Pharo-Chrome
>
> This gets around the issue with javascript since it waits for the
> browser to load the page, run the javascript and construct the DOM.
>
> HTH,
> Alistair
>
>
>
>>         i don’t want to collect them all
>> i have the XPath .pdf but i haven’t read it yet
>>
>> these browsers seem to gobble up memory
>>      and while open they just keep getting bigger till the OS session crash
>>      might there be a browser that is more minimal?
>>
>> Vivaldi seems better at not bloating up RAM
>

Reply | Threaded
Open this post in threaded view
|

Re: Soup bug(fix)

Stephane Ducasse-3
exampleNavigation
| chrome page logger |
logger := InMemoryLogger new.
logger start.
chrome := GoogleChrome new
debugOn;
debugSession;
open;
yourself.
page := chrome tabPages first.
page enablePage.
page enableDOM.
page navigateTo: 'http://pharo.org'.
page getDocument.
page getMissingChildren.
page updateTitle.
logger stop.
^{ chrome. page. logger. }

but in fact I realised that I would like to a simple doc :)


On Sun, Nov 12, 2017 at 2:44 PM, Stephane Ducasse
<[hidden email]> wrote:

> Hi alistair
>
> this is cool.
> Do you have one little example so that we can see how we can use it?
>
> Stef
>
>
> On Sat, Nov 11, 2017 at 4:38 PM, Alistair Grant <[hidden email]> wrote:
>> On 9 November 2017 at 00:00, Kjell Godo <[hidden email]> wrote:
>>> i like to collect some newspaper comics from an online newspaper
>>>      but it takes really long to do it by hand by hand
>>> i tried Soup but i didn’t get anywhere
>>>      the pictures were hidden behind a script or something
>>> is there anything to do about that?
>>
>> Most of the web pages I want to scrape use javascript to construct the
>> DOM, which makes Soup. XMLHTMLParser, etc. useless.
>>
>> I've extended Torsten's Pharo-Chrome library and use that to navigate
>> the DOM in a way similar to Soup:
>>
>> https://github.com/akgrant43/Pharo-Chrome
>>
>> This gets around the issue with javascript since it waits for the
>> browser to load the page, run the javascript and construct the DOM.
>>
>> HTH,
>> Alistair
>>
>>
>>
>>>         i don’t want to collect them all
>>> i have the XPath .pdf but i haven’t read it yet
>>>
>>> these browsers seem to gobble up memory
>>>      and while open they just keep getting bigger till the OS session crash
>>>      might there be a browser that is more minimal?
>>>
>>> Vivaldi seems better at not bloating up RAM
>>

Reply | Threaded
Open this post in threaded view
|

Re: Soup bug(fix)

Sean P. DeNigris
Administrator
In reply to this post by alistairgrant
Alistair Grant wrote
> https://github.com/akgrant43/Pharo-Chrome

Wow, that was a wild ride! Lessons learned along the way:
1. On a Mac, to use the snazzy `chrome` terminal command referenced all over
the place in the docs, you must first `alias chrome="/Applications/Google\
Chrome.app/Contents/MacOS/Google\ Chrome"`
2. Chrome must be started with certain flags: `chrome
--remote-debugging-port=9222 --disable-gpu` (not sure if the last flag is
needed, but `#get:` seemed to hang before using; reference
https://developers.google.com/web/updates/2017/04/headless-chrome)
3. Beacon has renamed InMemoryLogger to MemoryLogger
4. I guess Beacon has renamed `#log` to `#emit`
5. I had to comment out `chromeProcess sigterm.` because `chromeProcess` was
nil and also #sigterm seemed not to be defined anywhere in the image. I'm
not sure what the issue is there.

Pull request issued for #3 & #4. Also, I'm not sure what platforms you
support, but you may want to tag the example methods with <gtExample> or
similar so that they are runnable from the browser and open an inspector if
there is an interesting return value.




-----
Cheers,
Sean
--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Cheers,
Sean