Smalltalk › Pharo › Pharo Smalltalk Developers

Getting some tag in an HTML file

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

6 messages Options

abergel

Getting some tag in an HTML file

Hi!

Together with Nicolas we are trying to get all the <script …> … </script> from html files.
We have tried to use XMLDOMParser, but many webpages are actually not well formed, therefore the parser is complaining.

Anyone has tried to get some particular tags from HTML files? This looks like a classical thing to do. Maybe some of you have done it.
Is there a way to configure the parser to accept a broken XML/HTML content?

Cheers,
Alexandre
--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.

Paul DeBruicker

Re: Getting some tag in an HTML file

You probably want this:

http://smalltalkhub.com/#!/~PharoExtras/Soup

abergel wrote

Hi!

Together with Nicolas we are trying to get all the <script …> … </script> from html files.
We have tried to use XMLDOMParser, but many webpages are actually not well formed, therefore the parser is complaining.

Anyone has tried to get some particular tags from HTML files? This looks like a classical thing to do. Maybe some of you have done it.
Is there a way to configure the parser to accept a broken XML/HTML content?

Cheers,
Alexandre
--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.

Tudor Girba-2

Re: Getting some tag in an HTML file

In reply to this post by abergel

Hi,

You can also consider using island parsing, this very cool addition to PetitParser developed by Jan:

beginScript := '<script>' asParser.

endScript := '</script>' asParser.

script := beginScript , endScript negate star flatten , endScript ==> #second.

islandScripts := (script island ==> #second) star.

If you apply it on:

code := 'uninteresting part

some code

</script>

another

uninteresting part

some other

code

</script>

yet another

uninteresting part

You get:

islandScripts parse: code

==> "#('some code' 'some other

code')"

Quite cool, no? :)

Doru

On Fri, Aug 14, 2015 at 1:31 AM, Alexandre Bergel <[hidden email]> wrote:

Hi!

Together with Nicolas we are trying to get all the <script …> … </script> from html files.
We have tried to use XMLDOMParser, but many webpages are actually not well formed, therefore the parser is complaining.

Anyone has tried to get some particular tags from HTML files? This looks like a classical thing to do. Maybe some of you have done it.
Is there a way to configure the parser to accept a broken XML/HTML content?

Cheers,
Alexandre
--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.

www.tudorgirba.com

"Every thing has its own flow"

Nicolas Anquetil

Re: Getting some tag in an HTML file

yep intead very cool

will try it

nicolas

On 14/08/2015 11:40, Tudor Girba wrote:

Hi,

You can also consider using island parsing, this very cool addition to PetitParser developed by Jan:

beginScript := '<script>' asParser.

endScript := '</script>' asParser.

script := beginScript , endScript negate star flatten , endScript ==> #second.

islandScripts := (script island ==> #second) star.

If you apply it on:

code := 'uninteresting part

<script>

some code

</script>

another

uninteresting part

<script>

some other

code

</script>

yet another

uninteresting part

'.

You get:

islandScripts parse: code

==> "#('some code' 'some other

code')"

Quite cool, no? :)

Doru

On Fri, Aug 14, 2015 at 1:31 AM, Alexandre Bergel <[hidden email]> wrote:

Hi!

Together with Nicolas we are trying to get all the <script …> … </script> from html files.
We have tried to use XMLDOMParser, but many webpages are actually not well formed, therefore the parser is complaining.

Anyone has tried to get some particular tags from HTML files? This looks like a classical thing to do. Maybe some of you have done it.
Is there a way to configure the parser to accept a broken XML/HTML content?

Cheers,
Alexandre
--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.

--

www.tudorgirba.com

"Every thing has its own flow"

monty-3

Re: Getting some tag in an HTML file

In reply to this post by abergel

You need to install XMLParserHTML from the Config(Pharo4) or Catalog(Pharo5) browser.

Disabling validation like Vincent said isn't enough because non-validating xml parsers still expect documents to be wellformed (which includes proper nesting and closing of tags). Validation is just a set of additional constraints like checking against a DTD if present.

XMLParserHTML will load SAX and DOM parsers that accept messy HTML, but they were kind of an afterthought so I don't promote them to people already using something else (like Soup or Todd's parser) that are happy with it. Their advantage is speed (the fastest HTML parsers for Pharo or Squeak by far) and integration with XMLParser and related libs.

monty-3

Re: Getting some tag in an HTML file

In reply to this post by abergel

Doru's suggestion of PP + islands is good and makes building complex parsers easy, but his toy example gives wrong results:

<script>
function end() {
document.writeln('</script>');
}
function start() {
document.writeln('<script>');
}
</script>

or:
<script>
alert('hello');
// <script> alert('world')
</script>

or:


better to use one of the proper HTML parsers, at least if you need more correctness.