Smalltalk › Pharo › Pharo Smalltalk Users

Soup bug(fix)

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

12 messages Options

Siemen Baader

Soup bug(fix)

Hi all,

who maintains Soup, the HTML parser? Stef?

It seems to auto-close <button> (and <a>) tags when nested inside another element. I wrote this test that fails:

testNestedButton
    "this works with nested <div> tags instead of <button> and when there is no enclosing <div> at all. but here <button> is auto-closed."

    "a does not work either"

    | soup |
    soup := Soup
        fromString:
            '<div><button>
        <span>text</span>
   </button>
</div>'.
    self assert: soup div button span string equals: 'text'

----

Where should I look to prevent Soup from auto-closing the tag, and where & how should I submit my fix?

cheers,

Siemen

Stephane Ducasse-3

Re: Soup bug(fix)

Hi Siemen

let me know your loging and I can add you to commit. Paul is also
taking care of Soup.
Now I like XPath for scraping. Did you see the tutorial I wrote with Peter.

STef

On Wed, Nov 8, 2017 at 2:17 PM, Siemen Baader <[hidden email]> wrote:

> Hi all,
>
> who maintains Soup, the HTML parser? Stef?
>
> It seems to auto-close <button> (and <a>) tags when nested inside another
> element. I wrote this test that fails:
>
> testNestedButton
> "this works with nested <div> tags instead of <button> and when there is
> no enclosing <div> at all. but here <button> is auto-closed."
>
> "a does not work either"
>
> | soup |
> soup := Soup
> fromString:
> '<div><button>
> <span>text</span>
> </button>
> </div>'.
> self assert: soup div button span string equals: 'text'
>
> ----
>
>
> Where should I look to prevent Soup from auto-closing the tag, and where &
> how should I submit my fix?
>
> cheers,
> Siemen

Peter Kenny

Re: Soup bug(fix)

Siemen

Stef should have added that XPath depends on using Monty's XMLParser suite. I tried your snippet on XMLDOMParser, and it parses correctly. I always use XMLHTMLParser for parsing HTML, because I can always see the exact relationship between the parsed structure and the original HTML. With Soup I often found the match difficult or even impossible.

HTH

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of Stephane Ducasse
Sent: 08 November 2017 21:19
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] Soup bug(fix)

Hi Siemen

let me know your loging and I can add you to commit. Paul is also taking care of Soup.
Now I like XPath for scraping. Did you see the tutorial I wrote with Peter.

STef

On Wed, Nov 8, 2017 at 2:17 PM, Siemen Baader <[hidden email]> wrote:

> Hi all,
>
> who maintains Soup, the HTML parser? Stef?
>
> It seems to auto-close <button> (and <a>) tags when nested inside
> another element. I wrote this test that fails:
>
> testNestedButton
> "this works with nested <div> tags instead of <button> and when
> there is no enclosing <div> at all. but here <button> is auto-closed."
>
> "a does not work either"
>
> | soup |
> soup := Soup
> fromString:
> '<div><button>
> <span>text</span>
> </button>
> </div>'.
> self assert: soup div button span string equals: 'text'
>
> ----
>
>
> Where should I look to prevent Soup from auto-closing the tag, and
> where & how should I submit my fix?
>
> cheers,
> Siemen

Kjell Godo

Re: Soup bug(fix)

i like to collect some newspaper comics from an online newspaper

but it takes really long to do it by hand by hand

i tried Soup but i didn’t get anywhere

the pictures were hidden behind a script or something

is there anything to do about that? i don’t want to collect them all

i have the XPath .pdf but i haven’t read it yet

these browsers seem to gobble up memory

and while open they just keep getting bigger till the OS session crash

might there be a browser that is more minimal?

Vivaldi seems better at not bloating up RAM

Peter Kenny

Re: Soup bug(fix)

Kjell

Almost certainly the HTML files will not contain the code for the actual pictures; they will just contain an ‘href’ node with the address to load the picture file from. If the web pages are built to a regular pattern, you should be able to parse them and locate the href nodes you want.

I haven’t found any problem with the parse from XMLHTMLParser taking up too much memory. My machine has 4GB ram; if you have much less than that, you might have trouble. If you have found a systematic way to locate the picture file, you could minimise the size of the DOM the parser creates, by using a streaming parser. The streaming version of Monty’s parser is called StAXHTMLParser.

I have a bit of experience playing with these parsers. If you get stuck, ask again here with more details; I may be able to help.

Peter Kenny

From: Pharo-users [mailto:[hidden email]] On Behalf Of Kjell Godo
Sent: 08 November 2017 23:00
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] Soup bug(fix)

i like to collect some newspaper comics from an online newspaper

but it takes really long to do it by hand by hand

i tried Soup but i didn’t get anywhere

the pictures were hidden behind a script or something

is there anything to do about that? i don’t want to collect them all

i have the XPath .pdf but i haven’t read it yet

these browsers seem to gobble up memory

and while open they just keep getting bigger till the OS session crash

might there be a browser that is more minimal?

Vivaldi seems better at not bloating up RAM

Siemen Baader

Re: Soup bug(fix)

In reply to this post by Peter Kenny

On Wed, Nov 8, 2017 at 10:45 PM, PBKResearch <[hidden email]> wrote:

Siemen

Stef should have added that XPath depends on using Monty's XMLParser suite. I tried your snippet on XMLDOMParser, and it parses correctly. I always use XMLHTMLParser for parsing HTML, because I can always see the exact relationship between the parsed structure and the original HTML. With Soup I often found the match difficult or even impossible.

Thanks Stef & Peter. I'm going with XMLHTMLParser, it is indeed nicer to work with in the inspector. I'm scraping my own html files (to create a mock DOM object of my HTML to work with in Pharo) so I think I will switch to xhtml too to reduce the complexity.

Nice chapter, I only saw it briefly before and didn't realize that XMLHTMLParser is a newer replacement for Soup.

-- Siemen

Stephane Ducasse-3

Re: Soup bug(fix)

In reply to this post by Kjell Godo

You should try the Xpath tutorial because the code of the magic the
gathering was quite ugly generated html and I could find my way.

Stef

On Thu, Nov 9, 2017 at 12:00 AM, Kjell Godo <[hidden email]> wrote:

> i like to collect some newspaper comics from an online newspaper
> but it takes really long to do it by hand by hand
> i tried Soup but i didn’t get anywhere
> the pictures were hidden behind a script or something
> is there anything to do about that? i don’t want to collect them all
> i have the XPath .pdf but i haven’t read it yet
>
> these browsers seem to gobble up memory
> and while open they just keep getting bigger till the OS session crash
> might there be a browser that is more minimal?
>
> Vivaldi seems better at not bloating up RAM

Stephane Ducasse-3

Re: Soup bug(fix)

In reply to this post by Siemen Baader

Indeed the interaction with the inspector is great.
Now could you still improve soup?
What is your smalltalkhub account?
Stef

On Thu, Nov 9, 2017 at 11:12 AM, Siemen Baader <[hidden email]> wrote:

>
>
> On Wed, Nov 8, 2017 at 10:45 PM, PBKResearch <[hidden email]>
> wrote:
>>
>> Siemen
>>
>> Stef should have added that XPath depends on using Monty's XMLParser
>> suite. I tried your snippet on XMLDOMParser, and it parses correctly. I
>> always use XMLHTMLParser for parsing HTML, because I can always see the
>> exact relationship between the parsed structure and the original HTML. With
>> Soup I often found the match difficult or even impossible.
>
>
> Thanks Stef & Peter. I'm going with XMLHTMLParser, it is indeed nicer to
> work with in the inspector. I'm scraping my own html files (to create a mock
> DOM object of my HTML to work with in Pharo) so I think I will switch to
> xhtml too to reduce the complexity.
>
> Nice chapter, I only saw it briefly before and didn't realize that
> XMLHTMLParser is a newer replacement for Soup.
>
> -- Siemen

alistairgrant

Re: Soup bug(fix)

In reply to this post by Kjell Godo

On 9 November 2017 at 00:00, Kjell Godo <[hidden email]> wrote:
> i like to collect some newspaper comics from an online newspaper
> but it takes really long to do it by hand by hand
> i tried Soup but i didn’t get anywhere
> the pictures were hidden behind a script or something
> is there anything to do about that?

Most of the web pages I want to scrape use javascript to construct the
DOM, which makes Soup. XMLHTMLParser, etc. useless.

I've extended Torsten's Pharo-Chrome library and use that to navigate
the DOM in a way similar to Soup:

https://github.com/akgrant43/Pharo-Chrome

This gets around the issue with javascript since it waits for the
browser to load the page, run the javascript and construct the DOM.

HTH,
Alistair

> i don’t want to collect them all
> i have the XPath .pdf but i haven’t read it yet
>
> these browsers seem to gobble up memory
> and while open they just keep getting bigger till the OS session crash
> might there be a browser that is more minimal?
>
> Vivaldi seems better at not bloating up RAM

Stephane Ducasse-3

Re: Soup bug(fix)

Hi alistair

this is cool.
Do you have one little example so that we can see how we can use it?

Stef

On Sat, Nov 11, 2017 at 4:38 PM, Alistair Grant <[hidden email]> wrote:

> On 9 November 2017 at 00:00, Kjell Godo <[hidden email]> wrote:
>> i like to collect some newspaper comics from an online newspaper
>> but it takes really long to do it by hand by hand
>> i tried Soup but i didn’t get anywhere
>> the pictures were hidden behind a script or something
>> is there anything to do about that?
>
> Most of the web pages I want to scrape use javascript to construct the
> DOM, which makes Soup. XMLHTMLParser, etc. useless.
>
> I've extended Torsten's Pharo-Chrome library and use that to navigate
> the DOM in a way similar to Soup:
>
> https://github.com/akgrant43/Pharo-Chrome
>
> This gets around the issue with javascript since it waits for the
> browser to load the page, run the javascript and construct the DOM.
>
> HTH,
> Alistair
>
>
>
>> i don’t want to collect them all
>> i have the XPath .pdf but i haven’t read it yet
>>
>> these browsers seem to gobble up memory
>> and while open they just keep getting bigger till the OS session crash
>> might there be a browser that is more minimal?
>>
>> Vivaldi seems better at not bloating up RAM
>

Stephane Ducasse-3

Re: Soup bug(fix)

exampleNavigation
| chrome page logger |
logger := InMemoryLogger new.
logger start.
chrome := GoogleChrome new
debugOn;
debugSession;
open;
yourself.
page := chrome tabPages first.
page enablePage.
page enableDOM.
page navigateTo: 'http://pharo.org'.
page getDocument.
page getMissingChildren.
page updateTitle.
logger stop.
^{ chrome. page. logger. }

but in fact I realised that I would like to a simple doc :)

On Sun, Nov 12, 2017 at 2:44 PM, Stephane Ducasse
<[hidden email]> wrote:

> Hi alistair
>
> this is cool.
> Do you have one little example so that we can see how we can use it?
>
> Stef
>
>
> On Sat, Nov 11, 2017 at 4:38 PM, Alistair Grant <[hidden email]> wrote:
>> On 9 November 2017 at 00:00, Kjell Godo <[hidden email]> wrote:
>>> i like to collect some newspaper comics from an online newspaper
>>> but it takes really long to do it by hand by hand
>>> i tried Soup but i didn’t get anywhere
>>> the pictures were hidden behind a script or something
>>> is there anything to do about that?
>>
>> Most of the web pages I want to scrape use javascript to construct the
>> DOM, which makes Soup. XMLHTMLParser, etc. useless.
>>
>> I've extended Torsten's Pharo-Chrome library and use that to navigate
>> the DOM in a way similar to Soup:
>>
>> https://github.com/akgrant43/Pharo-Chrome
>>
>> This gets around the issue with javascript since it waits for the
>> browser to load the page, run the javascript and construct the DOM.
>>
>> HTH,
>> Alistair
>>
>>
>>
>>> i don’t want to collect them all
>>> i have the XPath .pdf but i haven’t read it yet
>>>
>>> these browsers seem to gobble up memory
>>> and while open they just keep getting bigger till the OS session crash
>>> might there be a browser that is more minimal?
>>>
>>> Vivaldi seems better at not bloating up RAM
>>

Sean P. DeNigris

Re: Soup bug(fix)

Administrator

In reply to this post by alistairgrant

Alistair Grant wrote
> https://github.com/akgrant43/Pharo-Chrome

Wow, that was a wild ride! Lessons learned along the way:
1. On a Mac, to use the snazzy `chrome` terminal command referenced all over
the place in the docs, you must first `alias chrome="/Applications/Google\
Chrome.app/Contents/MacOS/Google\ Chrome"`
2. Chrome must be started with certain flags: `chrome
--remote-debugging-port=9222 --disable-gpu` (not sure if the last flag is
needed, but `#get:` seemed to hang before using; reference
https://developers.google.com/web/updates/2017/04/headless-chrome)
3. Beacon has renamed InMemoryLogger to MemoryLogger
4. I guess Beacon has renamed `#log` to `#emit`
5. I had to comment out `chromeProcess sigterm.` because `chromeProcess` was
nil and also #sigterm seemed not to be defined anywhere in the image. I'm
not sure what the issue is there.

Pull request issued for #3 & #4. Also, I'm not sure what platforms you
support, but you may want to tag the example methods with <gtExample> or
similar so that they are runnable from the browser and open an inspector if
there is an interesting return value.

-----
Cheers,
Sean
--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Cheers,
Sean