HTML Parser?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

HTML Parser?

Chris Hayes-4
Is there any kind of HTML parser available for Dolphin (preferably,
something written in Smalltalk)?

Thanks.

Chris Hayes


Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser?

ar-4
On Mon, 11 Aug 2003 13:30:30 +0000, Chris Hayes wrote:

> Is there any kind of HTML parser available for Dolphin (preferably,
> something written in Smalltalk)?
>
> Thanks.
>
> Chris Hayes

If you're willing to write your html as strict XHTML, (all tags closed,
quoted attributes, etc) you could use the activex control, to get a DOM,
eg

IXMLDOMDocument new
   validateOnParse: false
;  loadText: '<html></html>'
;  lastChild


It's not smalltalk but it works as advertised. I wrote a framework on top
of that which converts the DOM into a set of Smalltalk HTML/XML classes.
That's what I am using to work with HTML in Dolphin. If you're interested,
you can download it at:

http://www.reider.net/dolphinsmalltalk/free/software.html

-alan r.


Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser?

Bill Schwab-2
In reply to this post by Chris Hayes-4
Chris,

Rumor control indicates that _some_ html can be parsed by an XML parser.
FWIW, I have yet to see that work.  Squeak has a nice HTML parser as part of
Scamper, and I've used it a couple of times to find errors that I wasn't
able to spot myself.  IIRC, I had Squeak read the offending HTML from the
clipboard, and display it in an object exporer.

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser?

Christopher J. Demers
In reply to this post by Chris Hayes-4
"Chris Hayes" <[hidden email]> wrote in message
news:WTMZa.6307$[hidden email]...
> Is there any kind of HTML parser available for Dolphin (preferably,
> something written in Smalltalk)?

I have used ActiveX wrapping of Internet Explorer for this.  I don't
remember if I did it from MS Access or Dolphin, but it should work from
Dolphin.  I believe you can hide IE, so you can just use it as a parser.

Chris


Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser?

ar-4
In reply to this post by Bill Schwab-2
On Mon, 11 Aug 2003 12:37:41 -0500, Bill Schwab wrote:

> Chris,
>
> Rumor control indicates that _some_ html can be parsed by an XML parser.
> FWIW, I have yet to see that work.

I am doing it routinely in Dolphin for non-trivial web pages (using the
activex XML parser (see my other response). It's been absolutely
unobstrusive except when I forget to close or match up my tags properly.
Then it puts out a very good error message.

The one gotcha I hit was using a '<' (less than operator) in javascript.
That got interpreted as the beginning of a tag. I just changed it to  &lt;
and it went away, however I think the right way was to enclose the
contents of the script in <!-- and -->  (thinking to myself, so *that's*
why they do that... :)


Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser?

Avi Bryant-3
In reply to this post by Bill Schwab-2
"Bill Schwab" <[hidden email]> wrote in message news:<bh8jvc$vcqgd$[hidden email]>...
> Chris,
>
> Rumor control indicates that _some_ html can be parsed by an XML parser.
> FWIW, I have yet to see that work.  Squeak has a nice HTML parser as part of
> Scamper, and I've used it a couple of times to find errors that I wasn't
> able to spot myself.  IIRC, I had Squeak read the offending HTML from the
> clipboard, and display it in an object exporer.

There's also a stricter HTML parser (by which I mean that it will do
exactly the right thing with conformant HTML 4, but may give
unexpected results for broken markup) on SqueakMap:
http://map1.squeakfoundation.org/sm/package/69778b2e-3884-4490-b18a-1b9a86a201ec

It requires YAXO, which I think has already been ported to Dolphin,
and shouldn't be too hard to port itself.


Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser?

tgkuo
Hi,
     In doing the porting of "HTML-Parser" package to Dolphin, it seemed
that it needed the framework of ProtoObject ( prototype object ? in
http://russell-allen.com/squeak/prototypes/), ( for "HTMLDocument class",
implementing "canContain: aNode" method, that would become aNode object (
XMLNodeWithElements ) at runtime), which is missing in Dolphin.
    Is there other workaround or I misunderstood the meaning?


Best regards,
Tk Kuo.

"Avi Bryant" <[hidden email]>wrote:
> "Bill Schwab" <[hidden email]> wrote in message
news:<bh8jvc$vcqgd$[hidden email]>...

> There's also a stricter HTML parser (by which I mean that it will do
> exactly the right thing with conformant HTML 4, but may give
> unexpected results for broken markup) on SqueakMap:
>
http://map1.squeakfoundation.org/sm/package/69778b2e-3884-4490-b18a-1b9a86a201ec
>
> It requires YAXO, which I think has already been ported to Dolphin,
> and shouldn't be too hard to port itself.


Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser?

Chris Uppal-3
In reply to this post by Christopher J. Demers
Christopher J. Demers wrote:

> > Is there any kind of HTML parser available for Dolphin (preferably,
> > something written in Smalltalk)?
>
> I have used ActiveX wrapping of Internet Explorer for this.  I don't
> remember if I did it from MS Access or Dolphin, but it should work
> from Dolphin.  I believe you can hide IE, so you can just use it as a
> parser.

I'd be a bit cautious about that approach.  I think that, by default, the IE
"parser" will also execute any Javascript, ActiveX, etc, that the page
contains.  MS give an example of how to change the default, but I, personally,
would not trust MS to have got it right if I were considering using IE to parse
HTML that might be hostile.

It may well not have mattered for Chris's application, but it's an issue to
bear in mind.

(In a previous life I worked on an HTML parser and Javascript engine for use in
a bulk scanning application.  The likelyhood was high that we'd scan hostile
(or accidentally nasty) HTML; it grew disheartening that I had to *keep*
explaining that we couldn't possibly use the IE parser....)

    -- chris


Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser?

Chris Hayes-4
In reply to this post by Chris Hayes-4
Hey everyone,

Thanks (as always) for the many helpful responses!  Hopefully, one of these
approaches will do the trick.

Regards,

Chris Hayes


Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser?

Avi Bryant-3
In reply to this post by tgkuo
"kuo" <[hidden email]> wrote in message news:<[hidden email]>...
> Hi,
>      In doing the porting of "HTML-Parser" package to Dolphin, it seemed
> that it needed the framework of ProtoObject ( prototype object ? in
> http://russell-allen.com/squeak/prototypes/), ( for "HTMLDocument class",
> implementing "canContain: aNode" method, that would become aNode object (
> XMLNodeWithElements ) at runtime), which is missing in Dolphin.
>     Is there other workaround or I misunderstood the meaning?

It definitely doesn't need that framework.  ProtoObject is the
superclass of Object in Squeak - subclassing from ProtoObject is like
subclassing from nil in VW.  You also get subclasses of ProtoObject
when you load code into Squeak which uses a missing superclass.

My guess is that you loaded HTML-Parser without having YAXO loaded
first, which is listed as a dependency.  Any classes in the
HTML-Parser that were subclasses of YAXO classes would show up as
subclasses of ProtoObject instead.

Avi


Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser?

tgkuo
Thanks for your instruction, it is really due to the missing YAXO.
  Some unit tests failed after porting the Html-Parser codes, I don't know
why.
    I still worked on it and it may take some time since I'm now restudying
XML books in order to get more clear and complete pictures on it.
   The mechanism and theory underlying YAXO is quite messive, it needs to
change, evolve and conformed continually with the XML ( W3C ) standards.
   I've visited its web site at http://www.squeaklet.com/Yax/index.html. I
think I could get aquainted to it quickly if there are good tutorials, test
files or examples available.

Best regards,
Tk Kuo

"Avi Bryant" <[hidden email]> wrote:
> "kuo" <[hidden email]> wrote in message
news:<[hidden email]>...

>
> It definitely doesn't need that framework.  ProtoObject is the
> superclass of Object in Squeak - subclassing from ProtoObject is like
> subclassing from nil in VW.  You also get subclasses of ProtoObject
> when you load code into Squeak which uses a missing superclass.
>
> My guess is that you loaded HTML-Parser without having YAXO loaded
> first, which is listed as a dependency.  Any classes in the
> HTML-Parser that were subclasses of YAXO classes would show up as
> subclasses of ProtoObject instead.
>
> Avi