I've been looking for a nice and fast HTML parser.
I've found Zulq Alam's Soup (http://www.squeaksource.com/@vHckXt8_6gVtXFxy/XMrjDbIs) it looks nice but it's way too slow for me (takes 5 sec to parse the page, my current lisp parser takes about 1 sec for that.) I found another one, Todd Blanchard's HTML and CSS parser (http://www.squeaksource.com/@iMgHmTKVxU00wEdz/A0jkqk71) but I couldn't load it into Pharo 1.1 or Squeak 4.1. It complains about some syntax error and leaves the progress bar which I can't kill... I wonder if anyone (Todd?) can take a look at the parser and figure out how to fix it? What other options I have for an HTML parser? Looking at Pharo speed I wonder if there is any way to optimize it? Is JIT or some other speed optimization in plans for Pharo/Squeak? Thank you, Andrei |
On Wed, Aug 18, 2010 at 7:50 AM, Andrei Stebakov <[hidden email]> wrote:
I've been looking for a nice and fast HTML parser. What do you need to do ? Scamper might have a standalone HTML parser http://www.squeaksource.com/Scamper.html
The CogVM has JIT. Laurent.
|
Web page scraping. XML parser chokes on bad html input.
On Wed, Aug 18, 2010 at 2:34 AM, laurent laffont <[hidden email]> wrote: > > > On Wed, Aug 18, 2010 at 7:50 AM, Andrei Stebakov <[hidden email]> > wrote: >> >> I've been looking for a nice and fast HTML parser. >> I've found Zulq Alam's Soup >> (http://www.squeaksource.com/@vHckXt8_6gVtXFxy/XMrjDbIs) it looks nice >> but it's way too slow for me (takes 5 sec to parse the page, my >> current lisp parser takes about 1 sec for that.) >> I found another one, Todd Blanchard's HTML and CSS parser >> (http://www.squeaksource.com/@iMgHmTKVxU00wEdz/A0jkqk71) but I >> couldn't load it into Pharo 1.1 or Squeak 4.1. >> It complains about some syntax error and leaves the progress bar which >> I can't kill... >> I wonder if anyone (Todd?) can take a look at the parser and figure >> out how to fix it? >> >> What other options I have for an HTML parser? >> Looking at Pharo speed I wonder if there is any way to optimize it? Is >> JIT or some other speed optimization in plans for Pharo/Squeak? > > > What do you need to do ? > There's XMLSupport http://www.squeaksource.com/XMLSupport.html > Scamper might have a standalone HTML > parser http://www.squeaksource.com/Scamper.html > The CogVM has JIT. > Laurent. > >> >> Thank you, >> Andrei >> >> _______________________________________________ >> Pharo-project mailing list >> [hidden email] >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > > > > > |
In reply to this post by laurent laffont
I tried to load Scamper's Network-HTML, I got a Syntax Error during reloading:
HtmlTokenizer private-initialization initialize: initialize: s text _ s withSqueakLineEndings. pos _ Nothing more expected ->1. textAreaLevel _ 0. On Wed, Aug 18, 2010 at 2:34 AM, laurent laffont <[hidden email]> wrote: > > > On Wed, Aug 18, 2010 at 7:50 AM, Andrei Stebakov <[hidden email]> > wrote: >> >> I've been looking for a nice and fast HTML parser. >> I've found Zulq Alam's Soup >> (http://www.squeaksource.com/@vHckXt8_6gVtXFxy/XMrjDbIs) it looks nice >> but it's way too slow for me (takes 5 sec to parse the page, my >> current lisp parser takes about 1 sec for that.) >> I found another one, Todd Blanchard's HTML and CSS parser >> (http://www.squeaksource.com/@iMgHmTKVxU00wEdz/A0jkqk71) but I >> couldn't load it into Pharo 1.1 or Squeak 4.1. >> It complains about some syntax error and leaves the progress bar which >> I can't kill... >> I wonder if anyone (Todd?) can take a look at the parser and figure >> out how to fix it? >> >> What other options I have for an HTML parser? >> Looking at Pharo speed I wonder if there is any way to optimize it? Is >> JIT or some other speed optimization in plans for Pharo/Squeak? > > > What do you need to do ? > There's XMLSupport http://www.squeaksource.com/XMLSupport.html > Scamper might have a standalone HTML > parser http://www.squeaksource.com/Scamper.html > The CogVM has JIT. > Laurent. > >> >> Thank you, >> Andrei >> >> _______________________________________________ >> Pharo-project mailing list >> [hidden email] >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > > > > > |
In reply to this post by Andrei Stebakov
On Tue, Aug 17, 2010 at 10:50 PM, Andrei Stebakov <[hidden email]> wrote: I've been looking for a nice and fast HTML parser. Have you tried Cog as Laurent suggests? It may make the difference you need. In any case I'd be interested in teh speed comparison.
I found another one, Todd Blanchard's HTML and CSS parser cheers, Eliot |
In reply to this post by Andrei Stebakov
On Wed, Aug 18, 2010 at 5:55 PM, Andrei Stebakov <[hidden email]> wrote: I tried to load Scamper's Network-HTML, I got a Syntax Error during reloading: That code is using underscore as assigment, don't allowed anymore in Pharo 1.1 unless you explicity set a specific setting. So....or set that setting or update the code (in another image) cheers mariano
|
Where can I make this setting?
On Wed, Aug 18, 2010 at 1:14 PM, Mariano Martinez Peck <[hidden email]> wrote: > > > On Wed, Aug 18, 2010 at 5:55 PM, Andrei Stebakov <[hidden email]> > wrote: >> >> I tried to load Scamper's Network-HTML, I got a Syntax Error during >> reloading: >> HtmlTokenizer private-initialization initialize: >> initialize: s >> text _ s withSqueakLineEndings. >> pos _ Nothing more expected ->1. >> textAreaLevel _ 0. >> > > That code is using underscore as assigment, don't allowed anymore in Pharo > 1.1 unless you explicity set a specific setting. > > So....or set that setting or update the code (in another image) > > cheers > > mariano > > >> >> On Wed, Aug 18, 2010 at 2:34 AM, laurent laffont >> <[hidden email]> wrote: >> > >> > >> > On Wed, Aug 18, 2010 at 7:50 AM, Andrei Stebakov <[hidden email]> >> > wrote: >> >> >> >> I've been looking for a nice and fast HTML parser. >> >> I've found Zulq Alam's Soup >> >> (http://www.squeaksource.com/@vHckXt8_6gVtXFxy/XMrjDbIs) it looks nice >> >> but it's way too slow for me (takes 5 sec to parse the page, my >> >> current lisp parser takes about 1 sec for that.) >> >> I found another one, Todd Blanchard's HTML and CSS parser >> >> (http://www.squeaksource.com/@iMgHmTKVxU00wEdz/A0jkqk71) but I >> >> couldn't load it into Pharo 1.1 or Squeak 4.1. >> >> It complains about some syntax error and leaves the progress bar which >> >> I can't kill... >> >> I wonder if anyone (Todd?) can take a look at the parser and figure >> >> out how to fix it? >> >> >> >> What other options I have for an HTML parser? >> >> Looking at Pharo speed I wonder if there is any way to optimize it? Is >> >> JIT or some other speed optimization in plans for Pharo/Squeak? >> > >> > >> > What do you need to do ? >> > There's XMLSupport http://www.squeaksource.com/XMLSupport.html >> > Scamper might have a standalone HTML >> > parser http://www.squeaksource.com/Scamper.html >> > The CogVM has JIT. >> > Laurent. >> > >> >> >> >> Thank you, >> >> Andrei >> >> >> >> _______________________________________________ >> >> Pharo-project mailing list >> >> [hidden email] >> >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project >> > >> > >> > >> > >> > >> > > > > > |
In reply to this post by laurent laffont
Is there a one-click image for CogVM somewhere so I can download it?
On Wed, Aug 18, 2010 at 2:34 AM, laurent laffont <[hidden email]> wrote: > > > On Wed, Aug 18, 2010 at 7:50 AM, Andrei Stebakov <[hidden email]> > wrote: >> >> I've been looking for a nice and fast HTML parser. >> I've found Zulq Alam's Soup >> (http://www.squeaksource.com/@vHckXt8_6gVtXFxy/XMrjDbIs) it looks nice >> but it's way too slow for me (takes 5 sec to parse the page, my >> current lisp parser takes about 1 sec for that.) >> I found another one, Todd Blanchard's HTML and CSS parser >> (http://www.squeaksource.com/@iMgHmTKVxU00wEdz/A0jkqk71) but I >> couldn't load it into Pharo 1.1 or Squeak 4.1. >> It complains about some syntax error and leaves the progress bar which >> I can't kill... >> I wonder if anyone (Todd?) can take a look at the parser and figure >> out how to fix it? >> >> What other options I have for an HTML parser? >> Looking at Pharo speed I wonder if there is any way to optimize it? Is >> JIT or some other speed optimization in plans for Pharo/Squeak? > > > What do you need to do ? > There's XMLSupport http://www.squeaksource.com/XMLSupport.html > Scamper might have a standalone HTML > parser http://www.squeaksource.com/Scamper.html > The CogVM has JIT. > Laurent. > >> >> Thank you, >> Andrei >> >> _______________________________________________ >> Pharo-project mailing list >> [hidden email] >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > > > > > |
In reply to this post by laurent laffont
As for Scamper when I try to evaluate (in Pharo 1.1)
tok := HtmlTokenizer on: '<html />'. There is an error: Error: My subclass should have overridden #contents Proceed Abandon Debug HtmlTokenizer(Object)>>error: HtmlTokenizer(Object)>>subclassResponsibility HtmlTokenizer(Stream)>>contents HtmlTokenizer(Stream)>>printOn: [] in HtmlTokenizer(Object)>>printStringLimitedTo: String class(SequenceableCollection class)>>streamContents:limitedTo: HtmlTokenizer(Object)>>printStringLimitedTo: HtmlTokenizer(Object)>>printString TextMorphForShoutEditor(ParagraphEditor)>>printIt [] in TextMorphForShoutEditor(ParagraphEditor)>>printIt: TextMorphForShoutEditor(ParagraphEditor)>>terminateAndInitializeAround: TextMorphForShoutEditor(ParagraphEditor)>>printIt: TextMorphForShoutEditor(ParagraphEditor)>>dispatchOnKeyEvent:with: TextMorphForShoutEditor(TextMorphEditor)>>dispatchOnKeyEvent:with: TextMorphForShoutEditor(ParagraphEditor)>>keystroke: TextMorphForShoutEditor(TextMorphEditor)>>keystroke: [] in [] in TextMorphForShout(TextMorph)>>keyStroke: TextMorphForShout(TextMorph)>>handleInteraction: TextMorphForShout(TextMorphForEditView)>>handleInteraction: [] in TextMorphForShout(TextMorph)>>keyStroke: On Wed, Aug 18, 2010 at 2:34 AM, laurent laffont <[hidden email]> wrote: > > > On Wed, Aug 18, 2010 at 7:50 AM, Andrei Stebakov <[hidden email]> > wrote: >> >> I've been looking for a nice and fast HTML parser. >> I've found Zulq Alam's Soup >> (http://www.squeaksource.com/@vHckXt8_6gVtXFxy/XMrjDbIs) it looks nice >> but it's way too slow for me (takes 5 sec to parse the page, my >> current lisp parser takes about 1 sec for that.) >> I found another one, Todd Blanchard's HTML and CSS parser >> (http://www.squeaksource.com/@iMgHmTKVxU00wEdz/A0jkqk71) but I >> couldn't load it into Pharo 1.1 or Squeak 4.1. >> It complains about some syntax error and leaves the progress bar which >> I can't kill... >> I wonder if anyone (Todd?) can take a look at the parser and figure >> out how to fix it? >> >> What other options I have for an HTML parser? >> Looking at Pharo speed I wonder if there is any way to optimize it? Is >> JIT or some other speed optimization in plans for Pharo/Squeak? > > > What do you need to do ? > There's XMLSupport http://www.squeaksource.com/XMLSupport.html > Scamper might have a standalone HTML > parser http://www.squeaksource.com/Scamper.html > The CogVM has JIT. > Laurent. > >> >> Thank you, >> Andrei >> >> _______________________________________________ >> Pharo-project mailing list >> [hidden email] >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > > > > > |
I am sorry, this error only happens when I try to print it instead of do it.
On Wed, Aug 18, 2010 at 2:10 PM, Andrei Stebakov <[hidden email]> wrote: > As for Scamper when I try to evaluate (in Pharo 1.1) > tok := HtmlTokenizer on: '<html />'. > > There is an error: > > Error: My subclass should have overridden #contents > Proceed > Abandon > Debug > HtmlTokenizer(Object)>>error: > HtmlTokenizer(Object)>>subclassResponsibility > HtmlTokenizer(Stream)>>contents > HtmlTokenizer(Stream)>>printOn: > [] in HtmlTokenizer(Object)>>printStringLimitedTo: > String class(SequenceableCollection class)>>streamContents:limitedTo: > HtmlTokenizer(Object)>>printStringLimitedTo: > HtmlTokenizer(Object)>>printString > TextMorphForShoutEditor(ParagraphEditor)>>printIt > [] in TextMorphForShoutEditor(ParagraphEditor)>>printIt: > TextMorphForShoutEditor(ParagraphEditor)>>terminateAndInitializeAround: > TextMorphForShoutEditor(ParagraphEditor)>>printIt: > TextMorphForShoutEditor(ParagraphEditor)>>dispatchOnKeyEvent:with: > TextMorphForShoutEditor(TextMorphEditor)>>dispatchOnKeyEvent:with: > TextMorphForShoutEditor(ParagraphEditor)>>keystroke: > TextMorphForShoutEditor(TextMorphEditor)>>keystroke: > [] in [] in TextMorphForShout(TextMorph)>>keyStroke: > TextMorphForShout(TextMorph)>>handleInteraction: > TextMorphForShout(TextMorphForEditView)>>handleInteraction: > [] in TextMorphForShout(TextMorph)>>keyStroke: > > > > On Wed, Aug 18, 2010 at 2:34 AM, laurent laffont > <[hidden email]> wrote: >> >> >> On Wed, Aug 18, 2010 at 7:50 AM, Andrei Stebakov <[hidden email]> >> wrote: >>> >>> I've been looking for a nice and fast HTML parser. >>> I've found Zulq Alam's Soup >>> (http://www.squeaksource.com/@vHckXt8_6gVtXFxy/XMrjDbIs) it looks nice >>> but it's way too slow for me (takes 5 sec to parse the page, my >>> current lisp parser takes about 1 sec for that.) >>> I found another one, Todd Blanchard's HTML and CSS parser >>> (http://www.squeaksource.com/@iMgHmTKVxU00wEdz/A0jkqk71) but I >>> couldn't load it into Pharo 1.1 or Squeak 4.1. >>> It complains about some syntax error and leaves the progress bar which >>> I can't kill... >>> I wonder if anyone (Todd?) can take a look at the parser and figure >>> out how to fix it? >>> >>> What other options I have for an HTML parser? >>> Looking at Pharo speed I wonder if there is any way to optimize it? Is >>> JIT or some other speed optimization in plans for Pharo/Squeak? >> >> >> What do you need to do ? >> There's XMLSupport http://www.squeaksource.com/XMLSupport.html >> Scamper might have a standalone HTML >> parser http://www.squeaksource.com/Scamper.html >> The CogVM has JIT. >> Laurent. >> >>> >>> Thank you, >>> Andrei >>> >>> _______________________________________________ >>> Pharo-project mailing list >>> [hidden email] >>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project >> >> >> >> >> > |
In reply to this post by Andrei Stebakov
no CogVM is not ready for us.
> Is there a one-click image for CogVM somewhere so I can download it? > > > On Wed, Aug 18, 2010 at 2:34 AM, laurent laffont > <[hidden email]> wrote: >> >> >> On Wed, Aug 18, 2010 at 7:50 AM, Andrei Stebakov <[hidden email]> >> wrote: >>> >>> I've been looking for a nice and fast HTML parser. >>> I've found Zulq Alam's Soup >>> (http://www.squeaksource.com/@vHckXt8_6gVtXFxy/XMrjDbIs) it looks nice >>> but it's way too slow for me (takes 5 sec to parse the page, my >>> current lisp parser takes about 1 sec for that.) >>> I found another one, Todd Blanchard's HTML and CSS parser >>> (http://www.squeaksource.com/@iMgHmTKVxU00wEdz/A0jkqk71) but I >>> couldn't load it into Pharo 1.1 or Squeak 4.1. >>> It complains about some syntax error and leaves the progress bar which >>> I can't kill... >>> I wonder if anyone (Todd?) can take a look at the parser and figure >>> out how to fix it? >>> >>> What other options I have for an HTML parser? >>> Looking at Pharo speed I wonder if there is any way to optimize it? Is >>> JIT or some other speed optimization in plans for Pharo/Squeak? >> >> >> What do you need to do ? >> There's XMLSupport http://www.squeaksource.com/XMLSupport.html >> Scamper might have a standalone HTML >> parser http://www.squeaksource.com/Scamper.html >> The CogVM has JIT. >> Laurent. >> >>> >>> Thank you, >>> Andrei >>> >>> _______________________________________________ >>> Pharo-project mailing list >>> [hidden email] >>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project >> >> >> >> >> > |
In reply to this post by Andrei Stebakov
On Wed, Aug 18, 2010 at 7:48 PM, Andrei Stebakov <[hidden email]> wrote: Is there a one-click image for CogVM somewhere so I can download it? It's planned but for now it seems you have to build it yourself. Laurent
|
Administrator
|
In reply to this post by Andrei Stebakov
Todd Blanchard's HTML and CSS parser at http://www.squeaksource.com/htmlcssparser now loads in Squeak 4.1 and Pharo 1.1. It can be found at http://www.squeaksource.com/SPDProjectUpdates (HTML package).
I'm forwarding my post about this experience from the Pharo list to a new thread to talk about improving the situation for community contribution for non-supported packages. Sean
Cheers,
Sean |
In reply to this post by stephane ducasse-2
I will try to push a CogVM for the mac this weekend, Eliot and I are planing some time then to get this out the door.
On 2010-08-18, at 2:05 PM, stephane ducasse wrote: > no CogVM is not ready for us. > > > -- =========================================================================== John M. McIntosh <[hidden email]> Twitter: squeaker68882 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com =========================================================================== |
In reply to this post by stephane ducasse-2
I only have a cellphone with me, so I can't check, but I think if you filein changesNecessaryForCogToWork.cs (I might have the filename wrong) Cog will probably work. Do note that after saving an image with Cog, you won't be able to open it with a classic VM.
On Aug 18, 2010, at 2:05 PM, stephane ducasse <[hidden email]> wrote: > no CogVM is not ready for us. > > >> Is there a one-click image for CogVM somewhere so I can download it? >> >> >> On Wed, Aug 18, 2010 at 2:34 AM, laurent laffont >> <[hidden email]> wrote: >>> >>> >>> On Wed, Aug 18, 2010 at 7:50 AM, Andrei Stebakov <[hidden email]> >>> wrote: >>>> >>>> I've been looking for a nice and fast HTML parser. >>>> I've found Zulq Alam's Soup >>>> (http://www.squeaksource.com/@vHckXt8_6gVtXFxy/XMrjDbIs) it looks nice >>>> but it's way too slow for me (takes 5 sec to parse the page, my >>>> current lisp parser takes about 1 sec for that.) >>>> I found another one, Todd Blanchard's HTML and CSS parser >>>> (http://www.squeaksource.com/@iMgHmTKVxU00wEdz/A0jkqk71) but I >>>> couldn't load it into Pharo 1.1 or Squeak 4.1. >>>> It complains about some syntax error and leaves the progress bar which >>>> I can't kill... >>>> I wonder if anyone (Todd?) can take a look at the parser and figure >>>> out how to fix it? >>>> >>>> What other options I have for an HTML parser? >>>> Looking at Pharo speed I wonder if there is any way to optimize it? Is >>>> JIT or some other speed optimization in plans for Pharo/Squeak? >>> >>> >>> What do you need to do ? >>> There's XMLSupport http://www.squeaksource.com/XMLSupport.html >>> Scamper might have a standalone HTML >>> parser http://www.squeaksource.com/Scamper.html >>> The CogVM has JIT. >>> Laurent. >>> >>>> >>>> Thank you, >>>> Andrei >>>> >>>> _______________________________________________ >>>> Pharo-project mailing list >>>> [hidden email] >>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project >>> >>> >>> >>> >>> >> > > |
In reply to this post by johnmci
That would be really great. As I mentioned before, I am using the
CogVM since its release and it is pretty stable (with the exception of crashes due to this socket problem). Is there a place to report possible bugs related to it, or is this mailing list the most appropriate place? Cheers, Doru On 19 Aug 2010, at 01:26, John M McIntosh wrote: > I will try to push a CogVM for the mac this weekend, Eliot and I are > planing some time then to get this out the door. > > On 2010-08-18, at 2:05 PM, stephane ducasse wrote: > >> no CogVM is not ready for us. >> >> >> > > -- > = > = > = > = > = > ====================================================================== > John M. McIntosh <[hidden email]> Twitter: > squeaker68882 > Corporate Smalltalk Consulting Ltd. http:// > www.smalltalkconsulting.com > = > = > = > = > = > ====================================================================== > > > > > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project -- www.tudorgirba.com "Speaking louder won't make the point worthier." |
Free forum by Nabble | Edit this page |