I came with the idea of this booklet thank to Peter Kenny that kindly
answered a question on the Pharo mailing-list. To help, Peter showed to a Pharoer how to scrap a web site using XPath. In addition, some years ago I was maintaining Soup a scraping framework because I was scraping magic web sites and I wanted an application to manage my magic cards. Since then I always wanted to try XPath and in addition I wanted to offer this booklet to Peter. Why because I asked Peter if he would like to write something and he told that he was at a great age where he would not take any commitment. I realised that I would like to get as old as him and be able to hack like a mad in Pharo with new technology. So this booklet is a gift to Peter, a great and gentle Pharoer. Stef scrapingChap2-min.pdf (317K) Download Attachment |
Hi Stéphane and all,
I’m using Soup from time to time. I use it also with student this semester to do a small app that get leboncoin adds programmatically to send alerts (leboncoin is a French web site to sell/buy stuff between people). I think it should be updated to reflect HTML5 important new tags. For instance I needed to add the section tag in the « nestableBlockTags » … It’s in SoupParserParameter initializeNestableBlockTags (see the screenshot - I don’t have my image here sot it’s just the method). Without this tag, I could properly get the classified ads). I can publish it if you want but I think there are more important HTML5 tags to include in this parameters class. As Soup is an "incomplete parser » (as far as I understand), there is no need to include all html5 tags. But beside <section>, do other people think there are tags to include there ? Cheers, Cédrick
|
Hello Cedric
cool I want it because I'm looking for Games on leboncoin
Please publish it. What you can do also is to write a chapter showing how you script leboncoin with Soup. You are welcome.
|
Hi Steph (sorry for late reply, it seems I have a wrong automatic process of mailing list), I'll published it soon and why not doing a chapter on that subject. This could be a really fun tutorial :) I'll add the section tag in the code for nested tags. Cheers, Cédrick 2017-09-27 21:52 GMT+02:00 Stephane Ducasse <[hidden email]>:
Cédrick
|
Free forum by Nabble | Edit this page |