Getting some tag in an HTML file

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Getting some tag in an HTML file

abergel
Hi!

Together with Nicolas we are trying to get all the <script …> … </script> from html files.
We have tried to use XMLDOMParser, but many webpages are actually not well formed, therefore the parser is complaining.

Anyone has tried to get some particular tags from HTML files? This looks like a classical thing to do. Maybe some of you have done it.
Is there a way to configure the parser to accept a broken XML/HTML content?

Cheers,
Alexandre
--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel  http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.




Reply | Threaded
Open this post in threaded view
|

Re: Getting some tag in an HTML file

Paul DeBruicker
You probably want this:

http://smalltalkhub.com/#!/~PharoExtras/Soup


abergel wrote
Hi!

Together with Nicolas we are trying to get all the <script …> … </script> from html files.
We have tried to use XMLDOMParser, but many webpages are actually not well formed, therefore the parser is complaining.

Anyone has tried to get some particular tags from HTML files? This looks like a classical thing to do. Maybe some of you have done it.
Is there a way to configure the parser to accept a broken XML/HTML content?

Cheers,
Alexandre
--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel  http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.
Reply | Threaded
Open this post in threaded view
|

Re: Getting some tag in an HTML file

Tudor Girba-2
In reply to this post by abergel
Hi,

You can also consider using island parsing, this very cool addition to PetitParser developed by Jan:

beginScript := '<script>' asParser.
endScript := '</script>' asParser.
script := beginScript , endScript negate star flatten , endScript ==> #second.
islandScripts := (script island ==> #second) star.

If you apply it on:

code := 'uninteresting part
<script>
some code
</script>
another
uninteresting part
<script>
some other
code
</script>
yet another
uninteresting part
'.

You get:
islandScripts parse: code
==>  "#('some code' 'some other
code')"

Quite cool, no? :)

Doru


On Fri, Aug 14, 2015 at 1:31 AM, Alexandre Bergel <[hidden email]> wrote:
Hi!

Together with Nicolas we are trying to get all the <script …> … </script> from html files.
We have tried to use XMLDOMParser, but many webpages are actually not well formed, therefore the parser is complaining.

Anyone has tried to get some particular tags from HTML files? This looks like a classical thing to do. Maybe some of you have done it.
Is there a way to configure the parser to accept a broken XML/HTML content?

Cheers,
Alexandre
--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel  http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.







--

"Every thing has its own flow"
Reply | Threaded
Open this post in threaded view
|

Re: Getting some tag in an HTML file

Nicolas Anquetil

yep intead very cool

will try it

nicolas

On 14/08/2015 11:40, Tudor Girba wrote:
Hi,

You can also consider using island parsing, this very cool addition to PetitParser developed by Jan:

beginScript := '<script>' asParser.
endScript := '</script>' asParser.
script := beginScript , endScript negate star flatten , endScript ==> #second.
islandScripts := (script island ==> #second) star.

If you apply it on:

code := 'uninteresting part
<script>
some code
</script>
another
uninteresting part
<script>
some other
code
</script>
yet another
uninteresting part
'.

You get:
islandScripts parse: code
==>  "#('some code' 'some other
code')"

Quite cool, no? :)

Doru


On Fri, Aug 14, 2015 at 1:31 AM, Alexandre Bergel <[hidden email]> wrote:
Hi!

Together with Nicolas we are trying to get all the <script …> … </script> from html files.
We have tried to use XMLDOMParser, but many webpages are actually not well formed, therefore the parser is complaining.

Anyone has tried to get some particular tags from HTML files? This looks like a classical thing to do. Maybe some of you have done it.
Is there a way to configure the parser to accept a broken XML/HTML content?

Cheers,
Alexandre
--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel  http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.







--

"Every thing has its own flow"

Reply | Threaded
Open this post in threaded view
|

Re: Getting some tag in an HTML file

monty-3
In reply to this post by abergel
You need to install XMLParserHTML from the Config(Pharo4) or Catalog(Pharo5) browser.

Disabling validation like Vincent said isn't enough because non-validating xml parsers still expect documents to be wellformed (which includes proper nesting and closing of tags). Validation is just a set of additional constraints like checking against a DTD if present.

XMLParserHTML will load SAX and DOM parsers that accept messy HTML, but they were kind of an afterthought so I don't promote them to people already using something else (like Soup or Todd's parser) that are happy with it. Their advantage is speed (the fastest HTML parsers for Pharo or Squeak by far) and integration with XMLParser and related libs.

Reply | Threaded
Open this post in threaded view
|

Re: Getting some tag in an HTML file

monty-3
In reply to this post by abergel
Doru's suggestion of PP + islands is good and makes building complex parsers easy, but his toy example gives wrong results:

<script>
function end() {
        document.writeln('</script>');
}
function start() {
        document.writeln('<script>');
}
</script>

or:
<script>
alert('hello');
// <script> alert('world')
</script>

or:
<!--<script>alert('hello')</script>-->

better to use one of the proper HTML parsers, at least if you need more correctness.