Hi all, who maintains Soup, the HTML parser? Stef?testNestedButton "this works with nested <div> tags instead of <button> and when there is no enclosing <div> at all. but here <button> is auto-closed." "a does not work either" | soup | soup := Soup fromString: '<div><button> <span>text</span> </button> </div>'. self assert: soup div button span string equals: 'text' ---- |
Hi Siemen
let me know your loging and I can add you to commit. Paul is also taking care of Soup. Now I like XPath for scraping. Did you see the tutorial I wrote with Peter. STef On Wed, Nov 8, 2017 at 2:17 PM, Siemen Baader <[hidden email]> wrote: > Hi all, > > who maintains Soup, the HTML parser? Stef? > > It seems to auto-close <button> (and <a>) tags when nested inside another > element. I wrote this test that fails: > > testNestedButton > "this works with nested <div> tags instead of <button> and when there is > no enclosing <div> at all. but here <button> is auto-closed." > > "a does not work either" > > | soup | > soup := Soup > fromString: > '<div><button> > <span>text</span> > </button> > </div>'. > self assert: soup div button span string equals: 'text' > > ---- > > > Where should I look to prevent Soup from auto-closing the tag, and where & > how should I submit my fix? > > cheers, > Siemen |
Siemen
Stef should have added that XPath depends on using Monty's XMLParser suite. I tried your snippet on XMLDOMParser, and it parses correctly. I always use XMLHTMLParser for parsing HTML, because I can always see the exact relationship between the parsed structure and the original HTML. With Soup I often found the match difficult or even impossible. HTH Peter Kenny -----Original Message----- From: Pharo-users [mailto:[hidden email]] On Behalf Of Stephane Ducasse Sent: 08 November 2017 21:19 To: Any question about pharo is welcome <[hidden email]> Subject: Re: [Pharo-users] Soup bug(fix) Hi Siemen let me know your loging and I can add you to commit. Paul is also taking care of Soup. Now I like XPath for scraping. Did you see the tutorial I wrote with Peter. STef On Wed, Nov 8, 2017 at 2:17 PM, Siemen Baader <[hidden email]> wrote: > Hi all, > > who maintains Soup, the HTML parser? Stef? > > It seems to auto-close <button> (and <a>) tags when nested inside > another element. I wrote this test that fails: > > testNestedButton > "this works with nested <div> tags instead of <button> and when > there is no enclosing <div> at all. but here <button> is auto-closed." > > "a does not work either" > > | soup | > soup := Soup > fromString: > '<div><button> > <span>text</span> > </button> > </div>'. > self assert: soup div button span string equals: 'text' > > ---- > > > Where should I look to prevent Soup from auto-closing the tag, and > where & how should I submit my fix? > > cheers, > Siemen |
i like to collect some newspaper comics from an online newspaper but it takes really long to do it by hand by hand i tried Soup but i didn’t get anywhere the pictures were hidden behind a script or something is there anything to do about that? i don’t want to collect them all i have the XPath .pdf but i haven’t read it yet these browsers seem to gobble up memory and while open they just keep getting bigger till the OS session crash might there be a browser that is more minimal? Vivaldi seems better at not bloating up RAM
|
Kjell Almost certainly the HTML files will not contain the code for the actual pictures; they will just contain an ‘href’ node with the address to load the picture file from. If the web pages are built to a regular pattern, you should be able to parse them and locate the href nodes you want. I haven’t found any problem with the parse from XMLHTMLParser taking up too much memory. My machine has 4GB ram; if you have much less than that, you might have trouble. If you have found a systematic way to locate the picture file, you could minimise the size of the DOM the parser creates, by using a streaming parser. The streaming version of Monty’s parser is called StAXHTMLParser. I have a bit of experience playing with these parsers. If you get stuck, ask again here with more details; I may be able to help. Peter Kenny From: Pharo-users [mailto:[hidden email]] On Behalf Of Kjell Godo i like to collect some newspaper comics from an online newspaper but it takes really long to do it by hand by hand i tried Soup but i didn’t get anywhere the pictures were hidden behind a script or something is there anything to do about that? i don’t want to collect them all i have the XPath .pdf but i haven’t read it yet these browsers seem to gobble up memory and while open they just keep getting bigger till the OS session crash might there be a browser that is more minimal? Vivaldi seems better at not bloating up RAM |
In reply to this post by Peter Kenny
On Wed, Nov 8, 2017 at 10:45 PM, PBKResearch <[hidden email]> wrote: Siemen Thanks Stef & Peter. I'm going with XMLHTMLParser, it is indeed nicer to work with in the inspector. I'm scraping my own html files (to create a mock DOM object of my HTML to work with in Pharo) so I think I will switch to xhtml too to reduce the complexity. Nice chapter, I only saw it briefly before and didn't realize that XMLHTMLParser is a newer replacement for Soup. -- Siemen |
In reply to this post by Kjell Godo
You should try the Xpath tutorial because the code of the magic the
gathering was quite ugly generated html and I could find my way. Stef On Thu, Nov 9, 2017 at 12:00 AM, Kjell Godo <[hidden email]> wrote: > i like to collect some newspaper comics from an online newspaper > but it takes really long to do it by hand by hand > i tried Soup but i didn’t get anywhere > the pictures were hidden behind a script or something > is there anything to do about that? i don’t want to collect them all > i have the XPath .pdf but i haven’t read it yet > > these browsers seem to gobble up memory > and while open they just keep getting bigger till the OS session crash > might there be a browser that is more minimal? > > Vivaldi seems better at not bloating up RAM |
In reply to this post by Siemen Baader
Indeed the interaction with the inspector is great.
Now could you still improve soup? What is your smalltalkhub account? Stef On Thu, Nov 9, 2017 at 11:12 AM, Siemen Baader <[hidden email]> wrote: > > > On Wed, Nov 8, 2017 at 10:45 PM, PBKResearch <[hidden email]> > wrote: >> >> Siemen >> >> Stef should have added that XPath depends on using Monty's XMLParser >> suite. I tried your snippet on XMLDOMParser, and it parses correctly. I >> always use XMLHTMLParser for parsing HTML, because I can always see the >> exact relationship between the parsed structure and the original HTML. With >> Soup I often found the match difficult or even impossible. > > > Thanks Stef & Peter. I'm going with XMLHTMLParser, it is indeed nicer to > work with in the inspector. I'm scraping my own html files (to create a mock > DOM object of my HTML to work with in Pharo) so I think I will switch to > xhtml too to reduce the complexity. > > Nice chapter, I only saw it briefly before and didn't realize that > XMLHTMLParser is a newer replacement for Soup. > > -- Siemen |
In reply to this post by Kjell Godo
On 9 November 2017 at 00:00, Kjell Godo <[hidden email]> wrote:
> i like to collect some newspaper comics from an online newspaper > but it takes really long to do it by hand by hand > i tried Soup but i didn’t get anywhere > the pictures were hidden behind a script or something > is there anything to do about that? Most of the web pages I want to scrape use javascript to construct the DOM, which makes Soup. XMLHTMLParser, etc. useless. I've extended Torsten's Pharo-Chrome library and use that to navigate the DOM in a way similar to Soup: https://github.com/akgrant43/Pharo-Chrome This gets around the issue with javascript since it waits for the browser to load the page, run the javascript and construct the DOM. HTH, Alistair > i don’t want to collect them all > i have the XPath .pdf but i haven’t read it yet > > these browsers seem to gobble up memory > and while open they just keep getting bigger till the OS session crash > might there be a browser that is more minimal? > > Vivaldi seems better at not bloating up RAM |
Hi alistair
this is cool. Do you have one little example so that we can see how we can use it? Stef On Sat, Nov 11, 2017 at 4:38 PM, Alistair Grant <[hidden email]> wrote: > On 9 November 2017 at 00:00, Kjell Godo <[hidden email]> wrote: >> i like to collect some newspaper comics from an online newspaper >> but it takes really long to do it by hand by hand >> i tried Soup but i didn’t get anywhere >> the pictures were hidden behind a script or something >> is there anything to do about that? > > Most of the web pages I want to scrape use javascript to construct the > DOM, which makes Soup. XMLHTMLParser, etc. useless. > > I've extended Torsten's Pharo-Chrome library and use that to navigate > the DOM in a way similar to Soup: > > https://github.com/akgrant43/Pharo-Chrome > > This gets around the issue with javascript since it waits for the > browser to load the page, run the javascript and construct the DOM. > > HTH, > Alistair > > > >> i don’t want to collect them all >> i have the XPath .pdf but i haven’t read it yet >> >> these browsers seem to gobble up memory >> and while open they just keep getting bigger till the OS session crash >> might there be a browser that is more minimal? >> >> Vivaldi seems better at not bloating up RAM > |
exampleNavigation
| chrome page logger | logger := InMemoryLogger new. logger start. chrome := GoogleChrome new debugOn; debugSession; open; yourself. page := chrome tabPages first. page enablePage. page enableDOM. page navigateTo: 'http://pharo.org'. page getDocument. page getMissingChildren. page updateTitle. logger stop. ^{ chrome. page. logger. } but in fact I realised that I would like to a simple doc :) On Sun, Nov 12, 2017 at 2:44 PM, Stephane Ducasse <[hidden email]> wrote: > Hi alistair > > this is cool. > Do you have one little example so that we can see how we can use it? > > Stef > > > On Sat, Nov 11, 2017 at 4:38 PM, Alistair Grant <[hidden email]> wrote: >> On 9 November 2017 at 00:00, Kjell Godo <[hidden email]> wrote: >>> i like to collect some newspaper comics from an online newspaper >>> but it takes really long to do it by hand by hand >>> i tried Soup but i didn’t get anywhere >>> the pictures were hidden behind a script or something >>> is there anything to do about that? >> >> Most of the web pages I want to scrape use javascript to construct the >> DOM, which makes Soup. XMLHTMLParser, etc. useless. >> >> I've extended Torsten's Pharo-Chrome library and use that to navigate >> the DOM in a way similar to Soup: >> >> https://github.com/akgrant43/Pharo-Chrome >> >> This gets around the issue with javascript since it waits for the >> browser to load the page, run the javascript and construct the DOM. >> >> HTH, >> Alistair >> >> >> >>> i don’t want to collect them all >>> i have the XPath .pdf but i haven’t read it yet >>> >>> these browsers seem to gobble up memory >>> and while open they just keep getting bigger till the OS session crash >>> might there be a browser that is more minimal? >>> >>> Vivaldi seems better at not bloating up RAM >> |
Administrator
|
In reply to this post by alistairgrant
Alistair Grant wrote
> https://github.com/akgrant43/Pharo-Chrome Wow, that was a wild ride! Lessons learned along the way: 1. On a Mac, to use the snazzy `chrome` terminal command referenced all over the place in the docs, you must first `alias chrome="/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome"` 2. Chrome must be started with certain flags: `chrome --remote-debugging-port=9222 --disable-gpu` (not sure if the last flag is needed, but `#get:` seemed to hang before using; reference https://developers.google.com/web/updates/2017/04/headless-chrome) 3. Beacon has renamed InMemoryLogger to MemoryLogger 4. I guess Beacon has renamed `#log` to `#emit` 5. I had to comment out `chromeProcess sigterm.` because `chromeProcess` was nil and also #sigterm seemed not to be defined anywhere in the image. I'm not sure what the issue is there. Pull request issued for #3 & #4. Also, I'm not sure what platforms you support, but you may want to tag the example methods with <gtExample> or similar so that they are runnable from the browser and open an inspector if there is an interesting return value. ----- Cheers, Sean -- Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
Cheers,
Sean |
Free forum by Nabble | Edit this page |