[ANN] XMLParserHTML moved to GitHub

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[ANN] XMLParserHTML moved to GitHub

Torsten Bergmann
Hi,

the STHub -> PharoExtras project "XMLParserHTML"

was now moved from http://smalltalkhub.com/#!/~PharoExtras/XMLParserHTML to
https://github.com/pharo-contributions/XML-XMLParserHTML including the FULL HISTORY

The old STHub repo was marked as obsolete - but is linking to the new one. I've also
setup an CI job:  https://travis-ci.org/pharo-contributions/XML-XMLParserHTML
which is green for Pharo 7. Some cleanups, class comments and docu was applied as you can
see from commit history.

The new version is tagged in git as version 1.6.0 (with a moveable tag 1.6.x in case further
hotfixes are required).

You can load using

   Metacello new
        baseline: 'XMLParserHTML';
        repository: 'github://pharo-contributions/XML-XMLParserHTML/src';
        load.

or from catalog in Pharo 7 or 8.

Attached is current dependency graph.

More to come soon ...

Bye
T.

xmlparserhtml.png (41K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

Esteban A. Maringolo
Thank you Torsten,

I wasn't aware of this tool, I'm already using it to scrap content
from a website and fed a Pharo driven system :)

The XML integration in the Inspector is great too.

Regards!

Esteban A. Maringolo

On Tue, Nov 19, 2019 at 8:40 AM Torsten Bergmann <[hidden email]> wrote:

>
> Hi,
>
> the STHub -> PharoExtras project "XMLParserHTML"
>
> was now moved from http://smalltalkhub.com/#!/~PharoExtras/XMLParserHTML to
> https://github.com/pharo-contributions/XML-XMLParserHTML including the FULL HISTORY
>
> The old STHub repo was marked as obsolete - but is linking to the new one. I've also
> setup an CI job:  https://travis-ci.org/pharo-contributions/XML-XMLParserHTML
> which is green for Pharo 7. Some cleanups, class comments and docu was applied as you can
> see from commit history.
>
> The new version is tagged in git as version 1.6.0 (with a moveable tag 1.6.x in case further
> hotfixes are required).
>
> You can load using
>
>    Metacello new
>         baseline: 'XMLParserHTML';
>         repository: 'github://pharo-contributions/XML-XMLParserHTML/src';
>         load.
>
> or from catalog in Pharo 7 or 8.
>
> Attached is current dependency graph.
>
> More to come soon ...
>
> Bye
> T.

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

cedreek
Stef and other wrote this book a while ago:


Basically XMLHtmlParser + XPath

To me, far better than using Soup. 
Google chrome pharo integration helps top to scrap complex full JS web site like google ;)


Cheers,

Cedrick 

Le 29 nov. 2019 à 15:41, Esteban Maringolo <[hidden email]> a écrit :

Thank you Torsten,

I wasn't aware of this tool, I'm already using it to scrap content
from a website and fed a Pharo driven system :)

The XML integration in the Inspector is great too.

Regards!

Esteban A. Maringolo

On Tue, Nov 19, 2019 at 8:40 AM Torsten Bergmann <[hidden email]> wrote:

Hi,

the STHub -> PharoExtras project "XMLParserHTML"

was now moved from http://smalltalkhub.com/#!/~PharoExtras/XMLParserHTML to
https://github.com/pharo-contributions/XML-XMLParserHTML including the FULL HISTORY

The old STHub repo was marked as obsolete - but is linking to the new one. I've also
setup an CI job:  https://travis-ci.org/pharo-contributions/XML-XMLParserHTML
which is green for Pharo 7. Some cleanups, class comments and docu was applied as you can
see from commit history.

The new version is tagged in git as version 1.6.0 (with a moveable tag 1.6.x in case further
hotfixes are required).

You can load using

  Metacello new
       baseline: 'XMLParserHTML';
       repository: 'github://pharo-contributions/XML-XMLParserHTML/src';
       load.

or from catalog in Pharo 7 or 8.

Attached is current dependency graph.

More to come soon ...

Bye
T.

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

Esteban A. Maringolo
Great!

I just added a link to the README.md of the project and created a PR,
because it is very likely that if you're parsing HTML you're doing
some scrapping. :-)

Esteban A. Maringolo


On Fri, Nov 29, 2019 at 2:18 PM Cédrick Béler <[hidden email]> wrote:
>
> Stef and other wrote this book a while ago:
>
> http://books.pharo.org/booklet-Scraping/html/scrapingbook.html
>
> Basically XMLHtmlParser + XPath
>
> To me, far better than using Soup.
> Google chrome pharo integration helps top to scrap complex full JS web site like google ;)

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

Sean P. DeNigris
Administrator
In reply to this post by cedreek
cedreek wrote
> To me, far better than using Soup.

Ah, interesting! I use Soup almost exclusively. What did you find superior
about XMLParserHTML? I may give it a try...


cedreek wrote
> Google chrome pharo integration helps top to scrap complex full JS web
> site like google ;)

Also interesting! Any publicly available examples? How does one load "Google
chrome pharo integration"? Also, there is often the "poor man's" way (albeit
requiring manual intervention) by inspecting the Ajax http requests in a
developer console and then recreating directly in Pharo.



-----
Cheers,
Sean
--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Cheers,
Sean
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

cedreek

cedreek wrote
To me, far better than using Soup.

Ah, interesting! I use Soup almost exclusively. What did you find superior
about XMLParserHTML? I may give it a try...


It’s mainly xpath which I find easier than navigating the html tree with soup or even The xmlHtmlparser. 

I usually copy the xpath form a web inspector. I have to tweak it a bit though.


cedreek wrote
Google chrome pharo integration helps top to scrap complex full JS web
site like google ;)

Also interesting! Any publicly available examples? How does one load "Google
chrome pharo integration"? Also, there is often the "poor man's" way (albeit
requiring manual intervention) by inspecting the Ajax http requests in a
developer console and then recreating directly in Pharo.


I just tried it once. 

There is a google chrome plugin that allows to use chrome headless to get the fully loaded html page. 

I need to try it again. A simple example I’d like to do is to scrap google and remove advertised content ^^

This is btw Torsten package:


Happy scrapping ;-)

And thx Torsten for all ^^

Cedrick 



-----
Cheers,
Sean
--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

cedreek
Just a quick try to scrap a query from google search  (using Chrome plugin + XMLHTMLParser + Path) :


Really cool… and in few lines you get your own « google » powered search engine ;-).  

Nevertheless this is a bit slow (5 seconds for one page, around 20s for the 5 from the script (I think this might be doable to speed things a bit).

HTH,


Cédrick


PS: code below in text

searches := OrderedCollection new.
query := 'black friday' urlEncoded.

"chrome plugin"
browser := GoogleChrome new.
browser headless: true.
browser open.
page := browser firstTab.
0 to: 50 by: 10 do: [:paging | 
page get: 'https://www.google.com/search?q=',query,'&start=', paging asString. 
result := page html.
dom := result parseHTML. "uses XMLHTMLParser"
searches addAll: (dom xpath: '//div[@class="g"]/div/div/div/a/@href'). 
].   

searches size
 "64".
searches first.

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

cedreek
In reply to this post by Sean P. DeNigris

>
> Also interesting! Any publicly available examples? How does one load "Google
> chrome pharo integration »?

"https://github.com/astares/Pharo-Chrome"
"https://github.com/akgrant43/Pharo-Chrome »

Cheers,

Cédrick


Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

Esteban A. Maringolo
Why use Chrome instead of ZnClient? To get a "real" render of the
content? (including JS and whatnot).

Regards!


Esteban A. Maringolo

On Sat, Nov 30, 2019 at 8:11 PM Cédrick Béler <[hidden email]> wrote:

>
>
> >
> > Also interesting! Any publicly available examples? How does one load "Google
> > chrome pharo integration »?
>
> "https://github.com/astares/Pharo-Chrome"
> "https://github.com/akgrant43/Pharo-Chrome »
>
> Cheers,
>
> Cédrick
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

cedreek
I couldn’t get it from Zn as (I think) there are some js lib that defer the full rendering.

I have the same problem with a site in France (leboncoin). They use https://datadome.co to complicate webscrapping. So an headless browser is the only solution I know. 

Cheers,

Cédrick

Le 1 déc. 2019 à 00:23, Esteban Maringolo <[hidden email]> a écrit :

Why use Chrome instead of ZnClient? To get a "real" render of the
content? (including JS and whatnot).

Regards!


Esteban A. Maringolo

On Sat, Nov 30, 2019 at 8:11 PM Cédrick Béler <[hidden email]> wrote:



Also interesting! Any publicly available examples? How does one load "Google
chrome pharo integration »?

"https://github.com/astares/Pharo-Chrome"
"https://github.com/akgrant43/Pharo-Chrome »

Cheers,

Cédrick




Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

Peter Kenny
In reply to this post by Sean P. DeNigris
Sean

I used Soup a few times, but found it difficult to interpret the output,
because the parse did not seem to reflect the hierarchy of the nodes in the
original; in particular, sibling nodes were not necessarily at the same
level in the Soup. XMLHTMLParser always gets the structure right, in my
experience. I think this is essential if you want to use Xpath to process
the parse. The worked examples in the scraping booklet show how the parser
and Xpath can work together.

HTH

Peter Kenny

-----Original Message-----
From: Pharo-users <[hidden email]> On Behalf Of Sean P.
DeNigris
Sent: 30 November 2019 16:43
To: [hidden email]
Subject: Re: [Pharo-users] [ANN] XMLParserHTML moved to GitHub

cedreek wrote
> To me, far better than using Soup.

Ah, interesting! I use Soup almost exclusively. What did you find superior
about XMLParserHTML? I may give it a try...


cedreek wrote
> Google chrome pharo integration helps top to scrap complex full JS web
> site like google ;)

Also interesting! Any publicly available examples? How does one load "Google
chrome pharo integration"? Also, there is often the "poor man's" way (albeit
requiring manual intervention) by inspecting the Ajax http requests in a
developer console and then recreating directly in Pharo.



-----
Cheers,
Sean
--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html


Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

LawsonEnglish
In reply to this post by Torsten Bergmann
Torsten Bergmann wrote

> Hi,
>
>
> You can load using
>
>    Metacello new
> baseline: 'XMLParserHTML';
> repository: 'github://pharo-contributions/XML-XMLParserHTML/src';
> load.
>
>
> Bye
> T.

Hi,

I'm trying to use the sample code in the pharo screen scraping booklet —
http://books.pharo.org/booklet-Scraping/pdf/2018-09-02-scrapingbook.pdf —
but while everything appears to load, I'm getting an odd behavior from:

/| ingredientsXML |
ingredientsXML := XMLHTMLParser parseURL:
'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'.
ingredientsXML inspect/

"#new was sent to nil"

No matter what URL I use, I get the same message.

I'm using Mac OS Catalina so I thought I might have some strange Mac OS
security issue (like it was quietly refusing to allow Pharo to access the
internet), but I tested with squeak and the old

/html :=(HtmlParser parse:
'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'
asUrl retrieveContents content)/

and that returns actual html without any problems.


Suggestions?


Thanks.

L




--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

Torsten Bergmann
Works without a problem (Pharo 8 on Windows), see attached. So it looks like a local problem.

Just check the debugger and compare to the squeak version where you run in trouble.
Maybe the document could not be retrieved on your machine.

Bye
T.

> Gesendet: Dienstag, 07. Januar 2020 um 04:42 Uhr
> Von: "LawsonEnglish" <[hidden email]>
> An: [hidden email]
> Betreff: Re: [Pharo-users] [ANN] XMLParserHTML moved to GitHub
>
> Torsten Bergmann wrote
> > Hi,
> >
> >
> > You can load using
> >
> >    Metacello new
> > baseline: 'XMLParserHTML';
> > repository: 'github://pharo-contributions/XML-XMLParserHTML/src';
> > load.
> >
> >
> > Bye
> > T.
>
> Hi,
>
> I'm trying to use the sample code in the pharo screen scraping booklet —
> http://books.pharo.org/booklet-Scraping/pdf/2018-09-02-scrapingbook.pdf —
> but while everything appears to load, I'm getting an odd behavior from:
>
> /| ingredientsXML |
> ingredientsXML := XMLHTMLParser parseURL:
> 'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'.
> ingredientsXML inspect/
>
> "#new was sent to nil"
>
> No matter what URL I use, I get the same message.
>
> I'm using Mac OS Catalina so I thought I might have some strange Mac OS
> security issue (like it was quietly refusing to allow Pharo to access the
> internet), but I tested with squeak and the old
>
> /html :=(HtmlParser parse:
> 'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'
> asUrl retrieveContents content)/
>
> and that returns actual html without any problems.
>
>
> Suggestions?
>
>
> Thanks.
>
> L
>
>
>
>
> --
> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
>
>

screen.png (277K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

Peter Kenny
It may be a quirk of how Pharo Playground works. It doesn't need local variable declarations - which is convenient - but putting them in can screw things up. Try your snippet again without the first line. Compare Torsten's code.

HTH

Peter Kenny

-----Original Message-----
From: Pharo-users <[hidden email]> On Behalf Of Torsten Bergmann
Sent: 07 January 2020 07:47
To: [hidden email]
Cc: [hidden email]
Subject: Re: [Pharo-users] [ANN] XMLParserHTML moved to GitHub

Works without a problem (Pharo 8 on Windows), see attached. So it looks like a local problem.

Just check the debugger and compare to the squeak version where you run in trouble.
Maybe the document could not be retrieved on your machine.

Bye
T.

> Gesendet: Dienstag, 07. Januar 2020 um 04:42 Uhr
> Von: "LawsonEnglish" <[hidden email]>
> An: [hidden email]
> Betreff: Re: [Pharo-users] [ANN] XMLParserHTML moved to GitHub
>
> Torsten Bergmann wrote
> > Hi,
> >
> >
> > You can load using
> >
> >    Metacello new
> > baseline: 'XMLParserHTML';
> > repository: 'github://pharo-contributions/XML-XMLParserHTML/src';
> > load.
> >
> >
> > Bye
> > T.
>
> Hi,
>
> I'm trying to use the sample code in the pharo screen scraping booklet
> —
> http://books.pharo.org/booklet-Scraping/pdf/2018-09-02-scrapingbook.pdf — but while everything appears to load, I'm getting an odd behavior from:
>
> /| ingredientsXML |
> ingredientsXML := XMLHTMLParser parseURL:
> 'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'.
> ingredientsXML inspect/
>
> "#new was sent to nil"
>
> No matter what URL I use, I get the same message.
>
> I'm using Mac OS Catalina so I thought I might have some strange Mac
> OS security issue (like it was quietly refusing to allow Pharo to
> access the internet), but I tested with squeak and the old
>
> /html :=(HtmlParser parse:
> 'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'
> asUrl retrieveContents content)/
>
> and that returns actual html without any problems.
>
>
> Suggestions?
>
>
> Thanks.
>
> L
>
>
>
>
> --
> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
>
>


Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

Torsten Bergmann
Agree with Peter - but "screw things up" means then the users screws up.

Pharo and the Playground is working fine on them. But one has to know the difference when
working with the Playground:
 
 1. If you evaluate with an explicit variable declaration than the variable is freshly defined and used like a temporary variable in a method:

      | ingredientsXML |
      ingredientsXML := XMLHTMLParser parseURL: 'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'.
      ingredientsXML inspect

   You have to selected the full text and evaluate it (either with "do It" or "print it" to get the result.

   If you only select "ingredientsXML inspect" part first and evaluate then the variable "ingredientsXML" is not known, undefined
   and uninitialized and therefore results in a nil.

 2. If in the playground you do not give an explicit variable declaration at the beginning line like for example in

          ingredientsXML := XMLHTMLParser parseURL: 'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'.
          ingredientsXML inspect    

    then a "workspace local variable" is implicitly created by the playground as soon as you evaluate which means

      - "ingredientsXML" is defined as a workspace variable as soon as you evaluate
      - the contents of "ingredientsXML" is preserved over different evaluations within the workspace / playground
      - you can use only "ingredientsXML" within this playground (not in another plaground)

    So you can evaluate the first line doing the assignment (this initializes the workspace variable "ingredientsXML" for the current playground)
    and when you later want to use it again you can just inspect it or evaluate the second line in the same playground.
   
    If you like you can open a second playground which can have its own "ingredientsXML" workspace variable.

Workspace variables (or "playground variables") are convenient for experimenting - as they are preserved - but
yes they might confuse you when you cant remember what was done with them last.

Bye
T.

> Gesendet: Dienstag, 07. Januar 2020 um 09:55 Uhr
> Von: "PBKResearch" <[hidden email]>
> An: "'Any question about pharo is welcome'" <[hidden email]>
> Betreff: Re: [Pharo-users] [ANN] XMLParserHTML moved to GitHub
>
> It may be a quirk of how Pharo Playground works. It doesn't need local variable declarations - which is convenient - but putting them in can screw things up. Try your snippet again without the first line. Compare Torsten's code.
>
> HTH
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users <[hidden email]> On Behalf Of Torsten Bergmann
> Sent: 07 January 2020 07:47
> To: [hidden email]
> Cc: [hidden email]
> Subject: Re: [Pharo-users] [ANN] XMLParserHTML moved to GitHub
>
> Works without a problem (Pharo 8 on Windows), see attached. So it looks like a local problem.
>
> Just check the debugger and compare to the squeak version where you run in trouble.
> Maybe the document could not be retrieved on your machine.
>
> Bye
> T.
>
> > Gesendet: Dienstag, 07. Januar 2020 um 04:42 Uhr
> > Von: "LawsonEnglish" <[hidden email]>
> > An: [hidden email]
> > Betreff: Re: [Pharo-users] [ANN] XMLParserHTML moved to GitHub
> >
> > Torsten Bergmann wrote
> > > Hi,
> > >
> > >
> > > You can load using
> > >
> > >    Metacello new
> > > baseline: 'XMLParserHTML';
> > > repository: 'github://pharo-contributions/XML-XMLParserHTML/src';
> > > load.
> > >
> > >
> > > Bye
> > > T.
> >
> > Hi,
> >
> > I'm trying to use the sample code in the pharo screen scraping booklet
> > —
> > http://books.pharo.org/booklet-Scraping/pdf/2018-09-02-scrapingbook.pdf — but while everything appears to load, I'm getting an odd behavior from:
> >
> > /| ingredientsXML |
> > ingredientsXML := XMLHTMLParser parseURL:
> > 'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'.
> > ingredientsXML inspect/
> >
> > "#new was sent to nil"
> >
> > No matter what URL I use, I get the same message.
> >
> > I'm using Mac OS Catalina so I thought I might have some strange Mac
> > OS security issue (like it was quietly refusing to allow Pharo to
> > access the internet), but I tested with squeak and the old
> >
> > /html :=(HtmlParser parse:
> > 'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'
> > asUrl retrieveContents content)/
> >
> > and that returns actual html without any problems.
> >
> >
> > Suggestions?
> >
> >
> > Thanks.
> >
> > L
> >
> >
> >
> >
> > --
> > Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
> >
> >
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

LawsonEnglish
In reply to this post by Peter Kenny
I deleted the playground and entered the text thusly


“do it” has no complaints

ingredientsXML = nil 

yields “false"

ingredientsXML inspect

has errors: #new sent to nil


.

This makes no sense at all.


L


On Jan 7, 2020, at 1:55 AM, PBKResearch <[hidden email]> wrote:

It may be a quirk of how Pharo Playground works. It doesn't need local variable declarations - which is convenient - but putting them in can screw things up. Try your snippet again without the first line. Compare Torsten's code.

HTH

Peter Kenny

-----Original Message-----
From: Pharo-users <[hidden email]> On Behalf Of Torsten Bergmann
Sent: 07 January 2020 07:47
To: [hidden email]
Cc: [hidden email]
Subject: Re: [Pharo-users] [ANN] XMLParserHTML moved to GitHub

Works without a problem (Pharo 8 on Windows), see attached. So it looks like a local problem.

Just check the debugger and compare to the squeak version where you run in trouble.
Maybe the document could not be retrieved on your machine.

Bye
T.

Gesendet: Dienstag, 07. Januar 2020 um 04:42 Uhr
Von: "LawsonEnglish" <[hidden email]>
An: [hidden email]
Betreff: Re: [Pharo-users] [ANN] XMLParserHTML moved to GitHub

Torsten Bergmann wrote
Hi,


You can load using

  Metacello new
baseline: 'XMLParserHTML';
repository: '<a href="github://pharo-contributions/XML-XMLParserHTML/src" class="">github://pharo-contributions/XML-XMLParserHTML/src';
load.


Bye
T.

Hi,

I'm trying to use the sample code in the pharo screen scraping booklet

http://books.pharo.org/booklet-Scraping/pdf/2018-09-02-scrapingbook.pdf — but while everything appears to load, I'm getting an odd behavior from:

/| ingredientsXML |
ingredientsXML := XMLHTMLParser parseURL:
'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'.
ingredientsXML inspect/

"#new was sent to nil"

No matter what URL I use, I get the same message.

I'm using Mac OS Catalina so I thought I might have some strange Mac
OS security issue (like it was quietly refusing to allow Pharo to
access the internet), but I tested with squeak and the old

/html :=(HtmlParser parse:
'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'
asUrl retrieveContents content)/

and that returns actual html without any problems.


Suggestions?


Thanks.

L




--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html





Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

Peter Kenny

I agree it makes no sense. I repeated exactly what you describe in a new playground (in Pharo 6.1 on Windows 10) and all worked as expected – essentially the same result as Torsten reported in his first post. I wonder if it might be something Mac related in the operation of Playground.

 

As a desperate try to explain it, please see what happens if you open a Playground with just your single line

ingredientsXML := XMLHTMLParser parseURL: 'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference’

and then select ‘do it and go’. You should find an inspector pane opening to the right in the Playground, with the result of the parse. If this fails, the standard suggestion is to open a debugger on you error message and try to work back through the stack to see how execution got there.

 

Just to discourage you further, when you do get to read the contents of the URL, you will find that the USDA have changed everything. All the data are now on a separate web site, probably in a new layout. This is one of the perpetual hassles of web scraping – the web site authors have to justify their existence by rewriting everything. I wrote this section of the scraping booklet, working up something I had done as a one-off a year or so earlier, and then I found that the USDA had changed the layout in the interim and much needed to be rewritten.

 

HTH – in part at least.

 

Peter Kenny

 

To Torsten – I agree I was slipshod in my drafting – I was in a hurry. Instead of saying ‘can screw things up’ I should have said ‘can produce counter-intuitive results’, as exemplified by the fact that, in your first example, ‘ingredientsXML’ can mean different things depending on whether you execute it all in one go or a line at a time.

 

From: Pharo-users <[hidden email]> On Behalf Of LawsonEnglish
Sent: 07 January 2020 20:55
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] [ANN] XMLParserHTML moved to GitHub

 

I deleted the playground and entered the text thusly

 

 

“do it” has no complaints

 

ingredientsXML = nil 

 

yields “false"

 

ingredientsXML inspect

 

has errors: #new sent to nil

 

 

.

 

This makes no sense at all.

 

 

L

 



On Jan 7, 2020, at 1:55 AM, PBKResearch <[hidden email]> wrote:

 

It may be a quirk of how Pharo Playground works. It doesn't need local variable declarations - which is convenient - but putting them in can screw things up. Try your snippet again without the first line. Compare Torsten's code.

HTH

Peter Kenny

-----Original Message-----
From: Pharo-users <[hidden email]> On Behalf Of Torsten Bergmann
Sent: 07 January 2020 07:47
To: [hidden email]
Cc: [hidden email]
Subject: Re: [Pharo-users] [ANN] XMLParserHTML moved to GitHub

Works without a problem (Pharo 8 on Windows), see attached. So it looks like a local problem.

Just check the debugger and compare to the squeak version where you run in trouble.
Maybe the document could not be retrieved on your machine.

Bye
T.


Gesendet: Dienstag, 07. Januar 2020 um 04:42 Uhr
Von: "LawsonEnglish" <[hidden email]>
An: [hidden email]
Betreff: Re: [Pharo-users] [ANN] XMLParserHTML moved to GitHub

Torsten Bergmann wrote

Hi,


You can load using

  Metacello new
               baseline: 'XMLParserHTML';
               repository: '<a href="github://pharo-contributions/XML-XMLParserHTML/src">github://pharo-contributions/XML-XMLParserHTML/src';
               load.


Bye
T.


Hi,

I'm trying to use the sample code in the pharo screen scraping booklet

http://books.pharo.org/booklet-Scraping/pdf/2018-09-02-scrapingbook.pdf — but while everything appears to load, I'm getting an odd behavior from:

/| ingredientsXML |
ingredientsXML := XMLHTMLParser parseURL:
'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'.
ingredientsXML inspect/

"#new was sent to nil"

No matter what URL I use, I get the same message.

I'm using Mac OS Catalina so I thought I might have some strange Mac
OS security issue (like it was quietly refusing to allow Pharo to
access the internet), but I tested with squeak and the old

/html :=(HtmlParser parse:
'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'
asUrl retrieveContents content)/

and that returns actual html without any problems.


Suggestions?


Thanks.

L




--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

 

 

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

LawsonEnglish
In reply to this post by Torsten Bergmann
Well, as you can see in my response elsewhere, none of that actually works as you describe.

>      ingredientsXML := XMLHTMLParser parseURL: 'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'.

Doesn’t raise any errors, with or without the local variable declaration.

> ingredientsXML = nil returns false

> ingredientsXML inspect

Raises the message “#new on nil

I “do it” on the entire text or on each line in the order entered. It doesn’t matter.

I’m using a Mac with Mac OS X Catalina, using Pharo 7.


L


> On Jan 7, 2020, at 5:31 AM, Torsten Bergmann <[hidden email]> wrote:
>
> Agree with Peter - but "screw things up" means then the users screws up.
>
> Pharo and the Playground is working fine on them. But one has to know the difference when
> working with the Playground:
>
> 1. If you evaluate with an explicit variable declaration than the variable is freshly defined and used like a temporary variable in a method:
>
>      | ingredientsXML |
>      ingredientsXML := XMLHTMLParser parseURL: 'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'.
>      ingredientsXML inspect
>
>   You have to selected the full text and evaluate it (either with "do It" or "print it" to get the result.
>
>   If you only select "ingredientsXML inspect" part first and evaluate then the variable "ingredientsXML" is not known, undefined
>   and uninitialized and therefore results in a nil.
>
> 2. If in the playground you do not give an explicit variable declaration at the beginning line like for example in
>
>          ingredientsXML := XMLHTMLParser parseURL: 'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'.
>          ingredientsXML inspect    
>
>    then a "workspace local variable" is implicitly created by the playground as soon as you evaluate which means
>
>      - "ingredientsXML" is defined as a workspace variable as soon as you evaluate
>      - the contents of "ingredientsXML" is preserved over different evaluations within the workspace / playground
>      - you can use only "ingredientsXML" within this playground (not in another plaground)
>
>    So you can evaluate the first line doing the assignment (this initializes the workspace variable "ingredientsXML" for the current playground)
>    and when you later want to use it again you can just inspect it or evaluate the second line in the same playground.
>
>    If you like you can open a second playground which can have its own "ingredientsXML" workspace variable.
>
> Workspace variables (or "playground variables") are convenient for experimenting - as they are preserved - but
> yes they might confuse you when you cant remember what was done with them last.
>
> Bye
> T.
>
>> Gesendet: Dienstag, 07. Januar 2020 um 09:55 Uhr
>> Von: "PBKResearch" <[hidden email]>
>> An: "'Any question about pharo is welcome'" <[hidden email]>
>> Betreff: Re: [Pharo-users] [ANN] XMLParserHTML moved to GitHub
>>
>> It may be a quirk of how Pharo Playground works. It doesn't need local variable declarations - which is convenient - but putting them in can screw things up. Try your snippet again without the first line. Compare Torsten's code.
>>
>> HTH
>>
>> Peter Kenny
>>
>> -----Original Message-----
>> From: Pharo-users <[hidden email]> On Behalf Of Torsten Bergmann
>> Sent: 07 January 2020 07:47
>> To: [hidden email]
>> Cc: [hidden email]
>> Subject: Re: [Pharo-users] [ANN] XMLParserHTML moved to GitHub
>>
>> Works without a problem (Pharo 8 on Windows), see attached. So it looks like a local problem.
>>
>> Just check the debugger and compare to the squeak version where you run in trouble.
>> Maybe the document could not be retrieved on your machine.
>>
>> Bye
>> T.
>>
>>> Gesendet: Dienstag, 07. Januar 2020 um 04:42 Uhr
>>> Von: "LawsonEnglish" <[hidden email]>
>>> An: [hidden email]
>>> Betreff: Re: [Pharo-users] [ANN] XMLParserHTML moved to GitHub
>>>
>>> Torsten Bergmann wrote
>>>> Hi,
>>>>
>>>>
>>>> You can load using
>>>>
>>>>   Metacello new
>>>> baseline: 'XMLParserHTML';
>>>> repository: 'github://pharo-contributions/XML-XMLParserHTML/src';
>>>> load.
>>>>
>>>>
>>>> Bye
>>>> T.
>>>
>>> Hi,
>>>
>>> I'm trying to use the sample code in the pharo screen scraping booklet
>>> —
>>> http://books.pharo.org/booklet-Scraping/pdf/2018-09-02-scrapingbook.pdf — but while everything appears to load, I'm getting an odd behavior from:
>>>
>>> /| ingredientsXML |
>>> ingredientsXML := XMLHTMLParser parseURL:
>>> 'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'.
>>> ingredientsXML inspect/
>>>
>>> "#new was sent to nil"
>>>
>>> No matter what URL I use, I get the same message.
>>>
>>> I'm using Mac OS Catalina so I thought I might have some strange Mac
>>> OS security issue (like it was quietly refusing to allow Pharo to
>>> access the internet), but I tested with squeak and the old
>>>
>>> /html :=(HtmlParser parse:
>>> 'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'
>>> asUrl retrieveContents content)/
>>>
>>> and that returns actual html without any problems.
>>>
>>>
>>> Suggestions?
>>>
>>>
>>> Thanks.
>>>
>>> L
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
>>>
>>>
>>
>>
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

Sven Van Caekenberghe-2
Something is wrong in your image.
The XML package adds special GT inspector views, the error is probably there.
This has nothing to do with the platform.

BTW, a stack trace would be much appreciated, like:

ZeroDivide when doing: 1/0

SmallInteger>>/
UndefinedObject>>DoIt
OpalCompiler>>evaluate
RubSmalltalkEditor>>evaluate:andDo:
RubSmalltalkEditor>>highlightEvaluateAndDo:
[ textMorph textArea editor highlightEvaluateAndDo: ann action.
textMorph shoutStyler style: textMorph text ] in [ textMorph textArea
        handleEdit: [ textMorph textArea editor highlightEvaluateAndDo: ann action.
                textMorph shoutStyler style: textMorph text ] ] in GLMMorphicPharoScriptRenderer(GLMMorphicPharoCodeRenderer)>>actOnHighlightAndEvaluate: in Block: [ textMorph textArea editor highlightEvaluateAndDo...etc...
RubEditingArea(RubAbstractTextArea)>>handleEdit:
[ textMorph textArea
        handleEdit: [ textMorph textArea editor highlightEvaluateAndDo: ann action.
                textMorph shoutStyler style: textMorph text ] ] in GLMMorphicPharoScriptRenderer(GLMMorphicPharoCodeRenderer)>>actOnHighlightAndEvaluate: in Block: [ textMorph textArea...
WorldState>>runStepMethodsIn:
WorldMorph>>runStepMethods
WorldState>>doOneCycleNowFor:
WorldState>>doOneCycleFor:
WorldMorph>>doOneCycle
WorldMorph class>>doOneCycle
[ [ WorldMorph doOneCycle.
Processor yield.
false ] whileFalse: [  ] ] in MorphicUIManager>>spawnNewProcess in Block: [ [ WorldMorph doOneCycle....
[ self value.
Processor terminateActive ] in BlockClosure>>newProcess in Block: [ self value....

You can copy that from the debugger extra menu (top right).

Try 'Basic Inspect It' instead of 'Inspect It', this will use an older less complex inspector, that will probably work.

> On 7 Jan 2020, at 23:06, LawsonEnglish <[hidden email]> wrote:
>
> Well, as you can see in my response elsewhere, none of that actually works as you describe.
>
>>     ingredientsXML := XMLHTMLParser parseURL: 'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'.
>
> Doesn’t raise any errors, with or without the local variable declaration.
>
>> ingredientsXML = nil returns false
>
>> ingredientsXML inspect
>
> Raises the message “#new on nil
>
> I “do it” on the entire text or on each line in the order entered. It doesn’t matter.
>
> I’m using a Mac with Mac OS X Catalina, using Pharo 7.
>
>
> L
>
>
>> On Jan 7, 2020, at 5:31 AM, Torsten Bergmann <[hidden email]> wrote:
>>
>> Agree with Peter - but "screw things up" means then the users screws up.
>>
>> Pharo and the Playground is working fine on them. But one has to know the difference when
>> working with the Playground:
>>
>> 1. If you evaluate with an explicit variable declaration than the variable is freshly defined and used like a temporary variable in a method:
>>
>>     | ingredientsXML |
>>     ingredientsXML := XMLHTMLParser parseURL: 'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'.
>>     ingredientsXML inspect
>>
>>  You have to selected the full text and evaluate it (either with "do It" or "print it" to get the result.
>>
>>  If you only select "ingredientsXML inspect" part first and evaluate then the variable "ingredientsXML" is not known, undefined
>>  and uninitialized and therefore results in a nil.
>>
>> 2. If in the playground you do not give an explicit variable declaration at the beginning line like for example in
>>
>>         ingredientsXML := XMLHTMLParser parseURL: 'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'.
>>         ingredientsXML inspect    
>>
>>   then a "workspace local variable" is implicitly created by the playground as soon as you evaluate which means
>>
>>     - "ingredientsXML" is defined as a workspace variable as soon as you evaluate
>>     - the contents of "ingredientsXML" is preserved over different evaluations within the workspace / playground
>>     - you can use only "ingredientsXML" within this playground (not in another plaground)
>>
>>   So you can evaluate the first line doing the assignment (this initializes the workspace variable "ingredientsXML" for the current playground)
>>   and when you later want to use it again you can just inspect it or evaluate the second line in the same playground.
>>
>>   If you like you can open a second playground which can have its own "ingredientsXML" workspace variable.
>>
>> Workspace variables (or "playground variables") are convenient for experimenting - as they are preserved - but
>> yes they might confuse you when you cant remember what was done with them last.
>>
>> Bye
>> T.
>>
>>> Gesendet: Dienstag, 07. Januar 2020 um 09:55 Uhr
>>> Von: "PBKResearch" <[hidden email]>
>>> An: "'Any question about pharo is welcome'" <[hidden email]>
>>> Betreff: Re: [Pharo-users] [ANN] XMLParserHTML moved to GitHub
>>>
>>> It may be a quirk of how Pharo Playground works. It doesn't need local variable declarations - which is convenient - but putting them in can screw things up. Try your snippet again without the first line. Compare Torsten's code.
>>>
>>> HTH
>>>
>>> Peter Kenny
>>>
>>> -----Original Message-----
>>> From: Pharo-users <[hidden email]> On Behalf Of Torsten Bergmann
>>> Sent: 07 January 2020 07:47
>>> To: [hidden email]
>>> Cc: [hidden email]
>>> Subject: Re: [Pharo-users] [ANN] XMLParserHTML moved to GitHub
>>>
>>> Works without a problem (Pharo 8 on Windows), see attached. So it looks like a local problem.
>>>
>>> Just check the debugger and compare to the squeak version where you run in trouble.
>>> Maybe the document could not be retrieved on your machine.
>>>
>>> Bye
>>> T.
>>>
>>>> Gesendet: Dienstag, 07. Januar 2020 um 04:42 Uhr
>>>> Von: "LawsonEnglish" <[hidden email]>
>>>> An: [hidden email]
>>>> Betreff: Re: [Pharo-users] [ANN] XMLParserHTML moved to GitHub
>>>>
>>>> Torsten Bergmann wrote
>>>>> Hi,
>>>>>
>>>>>
>>>>> You can load using
>>>>>
>>>>>  Metacello new
>>>>> baseline: 'XMLParserHTML';
>>>>> repository: 'github://pharo-contributions/XML-XMLParserHTML/src';
>>>>> load.
>>>>>
>>>>>
>>>>> Bye
>>>>> T.
>>>>
>>>> Hi,
>>>>
>>>> I'm trying to use the sample code in the pharo screen scraping booklet
>>>> —
>>>> http://books.pharo.org/booklet-Scraping/pdf/2018-09-02-scrapingbook.pdf — but while everything appears to load, I'm getting an odd behavior from:
>>>>
>>>> /| ingredientsXML |
>>>> ingredientsXML := XMLHTMLParser parseURL:
>>>> 'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'.
>>>> ingredientsXML inspect/
>>>>
>>>> "#new was sent to nil"
>>>>
>>>> No matter what URL I use, I get the same message.
>>>>
>>>> I'm using Mac OS Catalina so I thought I might have some strange Mac
>>>> OS security issue (like it was quietly refusing to allow Pharo to
>>>> access the internet), but I tested with squeak and the old
>>>>
>>>> /html :=(HtmlParser parse:
>>>> 'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'
>>>> asUrl retrieveContents content)/
>>>>
>>>> and that returns actual html without any problems.
>>>>
>>>>
>>>> Suggestions?
>>>>
>>>>
>>>> Thanks.
>>>>
>>>> L
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
>>>>
>>>>
>>>
>>>
>>>
>>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: [ANN] XMLParserHTML moved to GitHub

LawsonEnglish
In reply to this post by Peter Kenny
Thanks for responding. I won’t say that I’m not screwing up as I’ve had severe health problems (impacting both physical and cognitive abilities to the point that I am permanently on US government disability).

Even so, I did the Squeak from the very start [https://www.youtube.com/playlist?list=PL6601A198DF14788Dvideos some years ago, and as far as I can tell, I can still understand Smalltalk and its features to that level.

The “do it and go” yields the same “#new was sent to nil” error.

Note that “do it” doesn’t give an error. “inspect does” — the “#new was sent to nil”

As before, 

ingredientsXML = nil.

returns “false”

This could be a plugin issue in Mac Catalina as Apple has added all sorts of arcane security features with the new OS.


Or it just could be my literally crippled brain not seeing something obvious due to the fallout from my health issues.

I have no way of knowing (obviously)


Thanks for responding.

I had hoped to do new videos discussing the neat features of Pharo similar to the “very start” videos, but since I can’t get things started, obviously I can’t make new "from the very start” videos either.

L


On Jan 7, 2020, at 3:04 PM, PBKResearch <[hidden email]> wrote:

I agree it makes no sense. I repeated exactly what you describe in a new playground (in Pharo 6.1 on Windows 10) and all worked as expected – essentially the same result as Torsten reported in his first post. I wonder if it might be something Mac related in the operation of Playground.
 
As a desperate try to explain it, please see what happens if you open a Playground with just your single line
and then select ‘do it and go’. You should find an inspector pane opening to the right in the Playground, with the result of the parse. If this fails, the standard suggestion is to open a debugger on you error message and try to work back through the stack to see how execution got there.
 
Just to discourage you further, when you do get to read the contents of the URL, you will find that the USDA have changed everything. All the data are now on a separate web site, probably in a new layout. This is one of the perpetual hassles of web scraping – the web site authors have to justify their existence by rewriting everything. I wrote this section of the scraping booklet, working up something I had done as a one-off a year or so earlier, and then I found that the USDA had changed the layout in the interim and much needed to be rewritten.
 
HTH – in part at least.
 
Peter Kenny
 
To Torsten – I agree I was slipshod in my drafting – I was in a hurry. Instead of saying ‘can screw things up’ I should have said ‘can produce counter-intuitive results’, as exemplified by the fact that, in your first example, ‘ingredientsXML’ can mean different things depending on whether you execute it all in one go or a line at a time.
 
From: Pharo-users <[hidden email]> On Behalf Of LawsonEnglish
Sent: 07 January 2020 20:55
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] [ANN] XMLParserHTML moved to GitHub
 
I deleted the playground and entered the text thusly
 
 
“do it” has no complaints
 
ingredientsXML = nil 
 
yields “false"
 
ingredientsXML inspect
 
has errors: #new sent to nil
 
 
.
 
This makes no sense at all.
 
 
L
 


On Jan 7, 2020, at 1:55 AM, PBKResearch <[hidden email]> wrote:
 
It may be a quirk of how Pharo Playground works. It doesn't need local variable declarations - which is convenient - but putting them in can screw things up. Try your snippet again without the first line. Compare Torsten's code.

HTH

Peter Kenny

-----Original Message-----
From: Pharo-users <[hidden email]> On Behalf Of Torsten Bergmann
Sent: 07 January 2020 07:47
To: [hidden email]
Cc: [hidden email]
Subject: Re: [Pharo-users] [ANN] XMLParserHTML moved to GitHub

Works without a problem (Pharo 8 on Windows), see attached. So it looks like a local problem.

Just check the debugger and compare to the squeak version where you run in trouble.
Maybe the document could not be retrieved on your machine.

Bye
T.


Gesendet: Dienstag, 07. Januar 2020 um 04:42 Uhr
Von: "LawsonEnglish" <[hidden email]>
An: [hidden email]
Betreff: Re: [Pharo-users] [ANN] XMLParserHTML moved to GitHub

Torsten Bergmann wrote

Hi,


You can load using

  Metacello new
               baseline: 'XMLParserHTML';
               repository: '<a href="github://pharo-contributions/XML-XMLParserHTML/src" style="color: purple; text-decoration: underline;" class="">github://pharo-contributions/XML-XMLParserHTML/src';
               load.


Bye
T.


Hi,

I'm trying to use the sample code in the pharo screen scraping booklet 
 
http://books.pharo.org/booklet-Scraping/pdf/2018-09-02-scrapingbook.pdf — but while everything appears to load, I'm getting an odd behavior from:

/| ingredientsXML |
ingredientsXML := XMLHTMLParser parseURL:
'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'.
ingredientsXML inspect/

"#new was sent to nil"

No matter what URL I use, I get the same message.

I'm using Mac OS Catalina so I thought I might have some strange Mac 
OS security issue (like it was quietly refusing to allow Pharo to 
access the internet), but I tested with squeak and the old

/html :=(HtmlParser parse:
'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'
asUrl retrieveContents content)/

and that returns actual html without any problems.


Suggestions?


Thanks.

L




--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html


12