Hi!
Together with Nicolas we are trying to get all the <script …> … </script> from html files. We have tried to use XMLDOMParser, but many webpages are actually not well formed, therefore the parser is complaining. Anyone has tried to get some particular tags from HTML files? This looks like a classical thing to do. Maybe some of you have done it. Is there a way to configure the parser to accept a broken XML/HTML content? Cheers, Alexandre -- _,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;: Alexandre Bergel http://www.bergel.eu ^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;. |
You probably want this:
http://smalltalkhub.com/#!/~PharoExtras/Soup
|
In reply to this post by abergel
Hi, beginScript := '<script>' asParser. endScript := '</script>' asParser. script := beginScript , endScript negate star flatten , endScript ==> #second. islandScripts := (script island ==> #second) star. If you apply it on: code := 'uninteresting part <script> some code </script> another uninteresting part <script> some other code </script> yet another uninteresting part '. You get: islandScripts parse: code ==> "#('some code' 'some other code')" Quite cool, no? :) Doru On Fri, Aug 14, 2015 at 1:31 AM, Alexandre Bergel <[hidden email]> wrote: Hi! |
yep intead very cool will try it nicolas On 14/08/2015 11:40, Tudor Girba wrote:
|
In reply to this post by abergel
You need to install XMLParserHTML from the Config(Pharo4) or Catalog(Pharo5) browser.
Disabling validation like Vincent said isn't enough because non-validating xml parsers still expect documents to be wellformed (which includes proper nesting and closing of tags). Validation is just a set of additional constraints like checking against a DTD if present. XMLParserHTML will load SAX and DOM parsers that accept messy HTML, but they were kind of an afterthought so I don't promote them to people already using something else (like Soup or Todd's parser) that are happy with it. Their advantage is speed (the fastest HTML parsers for Pharo or Squeak by far) and integration with XMLParser and related libs. |
In reply to this post by abergel
Doru's suggestion of PP + islands is good and makes building complex parsers easy, but his toy example gives wrong results:
<script> function end() { document.writeln('</script>'); } function start() { document.writeln('<script>'); } </script> or: <script> alert('hello'); // <script> alert('world') </script> or: <!--<script>alert('hello')</script>--> better to use one of the proper HTML parsers, at least if you need more correctness. |
Free forum by Nabble | Edit this page |