Getting some tag in an HTML file

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Getting some tag in an HTML file

abergel
Hi!

Together with Nicolas we are trying to get all the <script …> … </script> from html files.
We have tried to use XMLDOMParser, but many webpages are actually not well formed, therefore the parser is complaining.

Anyone has tried to get some particular tags from HTML files? This looks like a classical thing to do. Maybe some of you have done it.
Is there a way to configure the parser to accept a broken XML/HTML content?

Cheers,
Alexandre
--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel  http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.




_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev
Reply | Threaded
Open this post in threaded view
|

Re: Getting some tag in an HTML file

Vincent Blondeau
 Hi,

Look at the class side, there is the method parse: namespace: validation: . call this method instead of parse: with false in the two last arguments. It should work.

Anyway, you should use the sax parser. It is faster and memory less consuming. It is very simple to get only one tag.

Cheers
Vincent

Le 14 août 2015 01:31, Alexandre Bergel <[hidden email]> a écrit :

>
> Hi!
>
> Together with Nicolas we are trying to get all the <script …> … </script> from html files.
> We have tried to use XMLDOMParser, but many webpages are actually not well formed, therefore the parser is complaining.
>
> Anyone has tried to get some particular tags from HTML files? This looks like a classical thing to do. Maybe some of you have done it.
> Is there a way to configure the parser to accept a broken XML/HTML content?
>
> Cheers,
> Alexandre
> --
> _,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
> Alexandre Bergel  http://www.bergel.eu
> ^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.
>
>
> _______________________________________________
> Moose-dev mailing list
> [hidden email]
> https://www.iam.unibe.ch/mailman/listinfo/moose-dev

_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev
Reply | Threaded
Open this post in threaded view
|

Re: Getting some tag in an HTML file

Hannes Hirzel
http://ss3.gemtalksystems.com/ss/Tabular.html

contains an application example of a SAX parser. You only pick what is
of interest.

On 8/14/15, Vincent Blondeau <[hidden email]> wrote:

>  Hi,
>
> Look at the class side, there is the method parse: namespace: validation: .
> call this method instead of parse: with false in the two last arguments. It
> should work.
>
> Anyway, you should use the sax parser. It is faster and memory less
> consuming. It is very simple to get only one tag.
>
> Cheers
> Vincent
>
> Le 14 août 2015 01:31, Alexandre Bergel <[hidden email]> a écrit :
>>
>> Hi!
>>
>> Together with Nicolas we are trying to get all the <script …> … </script>
>> from html files.
>> We have tried to use XMLDOMParser, but many webpages are actually not well
>> formed, therefore the parser is complaining.
>>
>> Anyone has tried to get some particular tags from HTML files? This looks
>> like a classical thing to do. Maybe some of you have done it.
>> Is there a way to configure the parser to accept a broken XML/HTML
>> content?
>>
>> Cheers,
>> Alexandre
>> --
>> _,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
>> Alexandre Bergel  http://www.bergel.eu
>> ^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.
>>
>>
>> _______________________________________________
>> Moose-dev mailing list
>> [hidden email]
>> https://www.iam.unibe.ch/mailman/listinfo/moose-dev
>
> _______________________________________________
> Moose-dev mailing list
> [hidden email]
> https://www.iam.unibe.ch/mailman/listinfo/moose-dev
>

_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] Getting some tag in an HTML file

Tudor Girba-2
In reply to this post by abergel
Hi,

You can also consider using island parsing, this very cool addition to PetitParser developed by Jan:

beginScript := '<script>' asParser.
endScript := '</script>' asParser.
script := beginScript , endScript negate star flatten , endScript ==> #second.
islandScripts := (script island ==> #second) star.

If you apply it on:

code := 'uninteresting part
<script>
some code
</script>
another
uninteresting part
<script>
some other
code
</script>
yet another
uninteresting part
'.

You get:
islandScripts parse: code
==>  "#('some code' 'some other
code')"

Quite cool, no? :)

Doru


On Fri, Aug 14, 2015 at 1:31 AM, Alexandre Bergel <[hidden email]> wrote:
Hi!

Together with Nicolas we are trying to get all the <script …> … </script> from html files.
We have tried to use XMLDOMParser, but many webpages are actually not well formed, therefore the parser is complaining.

Anyone has tried to get some particular tags from HTML files? This looks like a classical thing to do. Maybe some of you have done it.
Is there a way to configure the parser to accept a broken XML/HTML content?

Cheers,
Alexandre
--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel  http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.







--

"Every thing has its own flow"

_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] Getting some tag in an HTML file

Floyd May
If your scripts contain string literals with '<script>' or '</script>' in them (I've seen this before), then your mileage may vary with Tudor's approach. Also consider that script tags may have attributes, and those attributes may have single or double quotes. Also, script tags may or may not refer to javascript. Many javascript libraries use script tags for HTML template sources, for instance. These tags you'd probably want to keep (and perhaps follow the reference for the third):

<script type='text/javascript'> [code here] </script>
<script type='text/javascript'> document.write('<script src="somewhere.js"></script>");</script> <!-- here be dragons! -->
<script type='text/javascript' src="path/to/javascript/source.js"></script>

However, something like this you might want to ignore:
<script type='text/html' id='someTemplate'>
  <span>{{some template syntax}}</span>
</script>

If you can make some assumptions about what you're parsing you might be able to adapt Tudor's solution to be more robust. However, if you're trying for a general-purpose solution, I'd highly recommend using an existing HTML parsing library, not an XML parser.

In general, parsing HTML as XML is the wrong approach. HTML is technically not a subset of XML (closing tags aren't required), so most true XML parsers are going to barf on it.

Some further reading:
https://en.wikipedia.org/wiki/Tag_soup
https://en.wikipedia.org/wiki/HTML5#XHTML5_.28XML-serialized_HTML5.29

I'm new to smalltalk so I can't recommend a library, but in Java I've used Tag Soup and I've used Beautiful Soup in Python.

Hope this helps,

Floyd

On Fri, Aug 14, 2015 at 9:40 AM, Tudor Girba <[hidden email]> wrote:

Hi,

You can also consider using island parsing, this very cool addition to PetitParser developed by Jan:

beginScript := '<script>' asParser.
endScript := '</script>' asParser.
script := beginScript , endScript negate star flatten , endScript ==> #second.
islandScripts := (script island ==> #second) star.

If you apply it on:

code := 'uninteresting part
<script>
some code
</script>
another
uninteresting part
<script>
some other
code
</script>
yet another
uninteresting part
'.

You get:
islandScripts parse: code
==>  "#('some code' 'some other
code')"

Quite cool, no? :)

Doru


On Fri, Aug 14, 2015 at 1:31 AM, Alexandre Bergel <[hidden email]> wrote:


Hi!

Together with Nicolas we are trying to get all the <script …> … </script> from html files.
We have tried to use XMLDOMParser, but many webpages are actually not well formed, therefore the parser is complaining.

Anyone has tried to get some particular tags from HTML files? This looks like a classical thing to do. Maybe some of you have done it.
Is there a way to configure the parser to accept a broken XML/HTML content?

Cheers,
Alexandre
--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel  http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.





-- 
www.tudorgirba.com

"Every thing has its own flow"

_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev



_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] Getting some tag in an HTML file

Hannes Hirzel
A question about soup

https://ci.inria.fr/pharo-contribution/job/Soup/

The test runs for Pharo 2 and Pharo 3.

Who needs to be contacted to set up a test for Pharo 4?

--Hannes

On 8/17/15, Floyd May <[hidden email]> wrote:

> If your scripts contain string literals with '<script>' or '</script>' in
> them (I've seen this before), then your mileage may vary with Tudor's
> approach. Also consider that script tags may have attributes, and those
> attributes may have single or double quotes. Also, script tags may or may
> not refer to javascript. Many javascript libraries use script tags for HTML
> template sources, for instance. These tags you'd probably want to keep (and
> perhaps follow the reference for the third):
>
> <script type='text/javascript'> [code here] </script>
> <script type='text/javascript'> document.write('<script
> src="somewhere.js"></script>");</script> <!-- here be dragons! -->
> <script type='text/javascript' src="path/to/javascript/source.js"></script>
>
> However, something like this you might want to ignore:
> <script type='text/html' id='someTemplate'>
>   <span>{{some template syntax}}</span>
> </script>
>
> If you can make some assumptions about what you're parsing you might be
> able to adapt Tudor's solution to be more robust. However, if you're trying
> for a general-purpose solution, I'd highly recommend using an existing HTML
> parsing library, not an XML parser.
>
> In general, parsing HTML as XML is the wrong approach. HTML is technically
> not a subset of XML (closing tags aren't required), so most true XML
> parsers are going to barf on it.
>
> Some further reading:
> https://en.wikipedia.org/wiki/Tag_soup
> https://en.wikipedia.org/wiki/HTML5#XHTML5_.28XML-serialized_HTML5.29
>
> I'm new to smalltalk so I can't recommend a library, but in Java I've used
> Tag Soup and I've used Beautiful Soup in Python.
>
> Hope this helps,
>
> Floyd
>
> On Fri, Aug 14, 2015 at 9:40 AM, Tudor Girba <[hidden email]> wrote:
>
> Hi,
>>
>> You can also consider using island parsing, this very cool addition to
>> PetitParser developed by Jan:
>>
>> beginScript := '<script>' asParser.
>> endScript := '</script>' asParser.
>> script := beginScript , endScript negate star flatten , endScript ==>
>> #second.
>> islandScripts := (script island ==> #second) star.
>>
>> If you apply it on:
>>
>> code := 'uninteresting part
>> <script>
>> some code
>> </script>
>> another
>> uninteresting part
>> <script>
>> some other
>> code
>> </script>
>> yet another
>> uninteresting part
>> '.
>>
>> You get:
>> islandScripts parse: code
>> ==>  "#('some code' 'some other
>> code')"
>>
>> Quite cool, no? :)
>>
>> Doru
>>
>>
>> On Fri, Aug 14, 2015 at 1:31 AM, Alexandre Bergel
>> <[hidden email]
>> > wrote:
>>
>>
>> Hi!
>>>
>>>>
>>> Together with Nicolas we are trying to get all the <script …> …
>>> </script>
>>> from html files.
>>>
>>>> We have tried to use XMLDOMParser, but many webpages are actually not
>>> well formed, therefore the parser is complaining.
>>>
>>>>
>>> Anyone has tried to get some particular tags from HTML files? This looks
>>> like a classical thing to do. Maybe some of you have done it.
>>>
>>>> Is there a way to configure the parser to accept a broken XML/HTML
>>> content?
>>>
>>>>
>>> Cheers,
>>>
>>>> Alexandre
>>>
>>>> --
>>>
>>>> _,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
>>>
>>>> Alexandre Bergel  http://www.bergel.eu
>>>
>>>> ^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.
>>>
>>>>
>>>
>>>
>>>
>>>
>> --
>> www.tudorgirba.com
>>
>> "Every thing has its own flow"
>>
>> _______________________________________________
>>
>> Moose-dev mailing list
>>
>> [hidden email]
>>
>> https://www.iam.unibe.ch/mailman/listinfo/moose-dev
>>
>>
>

_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] Getting some tag in an HTML file

Tudor Girba-2
In reply to this post by Floyd May
Welcome, Flyod. And thank for participating :).

Indeed, my suggestion was only a starter and was meant to be used as a prototype. I wanted to remind people that we have this cool parser engine that can be used in many ways.

So, I spent 10 more minutes to deal with the cases you just mentioned:

attributes := '>' asParser negate star flatten.
beginScript := '<script' asParser , attributes , '>' asParser ==> #second.
endScript := '</script>' asParser.
string := $' asParser , $' asParser negate star, $' asParser.
code := (string / endScript negate) star flatten.
script := beginScript , code , endScript ==> [:t | t first -> t second].
islandScripts := (script island ==> #second) star.

When applied:

string := '
something irrelevant
<script> [ simple script ] </script>
something else irrelevant
<script type=''text/javascript''> [code here] </script>
yet something
else irrelevant
<script type=''text/javascript''> document.write(''<script src="somewhere.js"></script>'');</script> <!-- here be dragons! -->'.

(islandScripts parse: string)

You get:

 "{''->' [ simple script ] '.
' type=''text/javascript'''->' [code here] '.
' type=''text/javascript'''->' document.write(''<script src=""somewhere.js""></script>'');'}"


And of course, the playground makes it reasonably easy to prototype:

Inline image 1

Cheers,
Tudor


On Mon, Aug 17, 2015 at 3:24 AM, Floyd May <[hidden email]> wrote:
If your scripts contain string literals with '<script>' or '</script>' in them (I've seen this before), then your mileage may vary with Tudor's approach. Also consider that script tags may have attributes, and those attributes may have single or double quotes. Also, script tags may or may not refer to javascript. Many javascript libraries use script tags for HTML template sources, for instance. These tags you'd probably want to keep (and perhaps follow the reference for the third):

<script type='text/javascript'> [code here] </script>
<script type='text/javascript'> document.write('<script src="somewhere.js"></script>");</script> <!-- here be dragons! -->
<script type='text/javascript' src="path/to/javascript/source.js"></script>

However, something like this you might want to ignore:
<script type='text/html' id='someTemplate'>
  <span>{{some template syntax}}</span>
</script>

If you can make some assumptions about what you're parsing you might be able to adapt Tudor's solution to be more robust. However, if you're trying for a general-purpose solution, I'd highly recommend using an existing HTML parsing library, not an XML parser.

In general, parsing HTML as XML is the wrong approach. HTML is technically not a subset of XML (closing tags aren't required), so most true XML parsers are going to barf on it.

Some further reading:
https://en.wikipedia.org/wiki/Tag_soup
https://en.wikipedia.org/wiki/HTML5#XHTML5_.28XML-serialized_HTML5.29

I'm new to smalltalk so I can't recommend a library, but in Java I've used Tag Soup and I've used Beautiful Soup in Python.

Hope this helps,

Floyd


On Fri, Aug 14, 2015 at 9:40 AM, Tudor Girba <[hidden email]> wrote:

Hi,

You can also consider using island parsing, this very cool addition to PetitParser developed by Jan:

beginScript := '<script>' asParser.
endScript := '</script>' asParser.
script := beginScript , endScript negate star flatten , endScript ==> #second.
islandScripts := (script island ==> #second) star.

If you apply it on:

code := 'uninteresting part
<script>
some code
</script>
another
uninteresting part
<script>
some other
code
</script>
yet another
uninteresting part
'.

You get:
islandScripts parse: code
==>  "#('some code' 'some other
code')"

Quite cool, no? :)

Doru


On Fri, Aug 14, 2015 at 1:31 AM, Alexandre Bergel <[hidden email]> wrote:


Hi!

Together with Nicolas we are trying to get all the <script …> … </script> from html files.
We have tried to use XMLDOMParser, but many webpages are actually not well formed, therefore the parser is complaining.

Anyone has tried to get some particular tags from HTML files? This looks like a classical thing to do. Maybe some of you have done it.
Is there a way to configure the parser to accept a broken XML/HTML content?

Cheers,
Alexandre
--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel  http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.





-- 
www.tudorgirba.com

"Every thing has its own flow"

_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev



_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev




--

"Every thing has its own flow"

_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev