XMLTokenizer problem with ampersand

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

XMLTokenizer problem with ampersand

Karl Ramberg
Hi,
I'm parsing some html docs but the XMLTokenizer chockes on a '&' followed by a space in a string. 
I guess '&' is used for other stuff than a 'and' in html and it causes error when used in plain text.

Does anybody have fix for this?

Karl


Reply | Threaded
Open this post in threaded view
|

Re: XMLTokenizer problem with ampersand

Jakob Reschke-2
I guess this will not help you, but a standalone ampersand is not
valid XML (it is the leader for entities, if you want to have a
literal ampersand in the text, the markup must be &), hence I
would not expect any XML tokenizer or parser implementation to accept
it.

HTML is more relaxed about this, so a standalone amapersand is valid,
but you would need some kind of HTMLTokenizer and I do not know
whether there is such thing for Squeak. Anyone else knows one?

Best regards
Jakob

2015-06-01 20:05 GMT+02:00 karl ramberg <[hidden email]>:
> Hi,
> I'm parsing some html docs but the XMLTokenizer chockes on a '&' followed by
> a space in a string.
> I guess '&' is used for other stuff than a 'and' in html and it causes error
> when used in plain text.
>
> Does anybody have fix for this?
>
> Karl

Reply | Threaded
Open this post in threaded view
|

Re: XMLTokenizer problem with ampersand

Karl Ramberg
Hi,
thanks for the info.
I guess I need a HTMLTokenizer for what I'm doing. I had issues with &nbsp as well, with the current XMLTokenizer

Karl

On Mon, Jun 1, 2015 at 11:01 PM, Jakob Reschke <[hidden email]> wrote:
I guess this will not help you, but a standalone ampersand is not
valid XML (it is the leader for entities, if you want to have a
literal ampersand in the text, the markup must be &amp;), hence I
would not expect any XML tokenizer or parser implementation to accept
it.

HTML is more relaxed about this, so a standalone amapersand is valid,
but you would need some kind of HTMLTokenizer and I do not know
whether there is such thing for Squeak. Anyone else knows one?

Best regards
Jakob

2015-06-01 20:05 GMT+02:00 karl ramberg <[hidden email]>:
> Hi,
> I'm parsing some html docs but the XMLTokenizer chockes on a '&' followed by
> a space in a string.
> I guess '&' is used for other stuff than a 'and' in html and it causes error
> when used in plain text.
>
> Does anybody have fix for this?
>
> Karl




Reply | Threaded
Open this post in threaded view
|

Re: XMLTokenizer problem with ampersand

Levente Uzonyi-2
XMLTokenizer is not suitable to parse HTML documents. XML and HTML may
look similar, but are very different.
We used to use Soup[1] to parse HTML pages.

Levente

[1] http://squeaksource.com/Soup.html (watch out for versions which may
not be Squeak-compatible)

On Tue, 2 Jun 2015, karl ramberg wrote:

> Hi,thanks for the info.
> I guess I need a HTMLTokenizer for what I'm doing. I had issues with &nbsp as well, with the current XMLTokenizer
>
> Karl
>
> On Mon, Jun 1, 2015 at 11:01 PM, Jakob Reschke <[hidden email]> wrote:
>       I guess this will not help you, but a standalone ampersand is not
>       valid XML (it is the leader for entities, if you want to have a
>       literal ampersand in the text, the markup must be &amp;), hence I
>       would not expect any XML tokenizer or parser implementation to accept
>       it.
>
>       HTML is more relaxed about this, so a standalone amapersand is valid,
>       but you would need some kind of HTMLTokenizer and I do not know
>       whether there is such thing for Squeak. Anyone else knows one?
>
>       Best regards
>       Jakob
>
>       2015-06-01 20:05 GMT+02:00 karl ramberg <[hidden email]>:
>       > Hi,
>       > I'm parsing some html docs but the XMLTokenizer chockes on a '&' followed by
>       > a space in a string.
>       > I guess '&' is used for other stuff than a 'and' in html and it causes error
>       > when used in plain text.
>       >
>       > Does anybody have fix for this?
>       >
>       > Karl
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: XMLTokenizer problem with ampersand

Chris Muller-3
On Mon, Jun 1, 2015 at 9:10 PM, Levente Uzonyi <[hidden email]> wrote:
> XMLTokenizer is not suitable to parse HTML documents. XML and HTML may look
> similar, but are very different.
> We used to use Soup[1] to parse HTML pages.

Have you used Todd Blanchard's "HTML & CSS Validating Parser" [1], if
so how does it compare to Soup?

[1] -- http://www.squeaksource.com/htmlcssparser.html

Reply | Threaded
Open this post in threaded view
|

Re: XMLTokenizer problem with ampersand

Karl Ramberg
Hi,
I tested three different HTML parsers and found SOUP to work best for my needs.
Thank you all.

Karl

On Tue, Jun 2, 2015 at 6:17 PM, Chris Muller <[hidden email]> wrote:
On Mon, Jun 1, 2015 at 9:10 PM, Levente Uzonyi <[hidden email]> wrote:
> XMLTokenizer is not suitable to parse HTML documents. XML and HTML may look
> similar, but are very different.
> We used to use Soup[1] to parse HTML pages.

Have you used Todd Blanchard's "HTML & CSS Validating Parser" [1], if
so how does it compare to Soup?

[1] -- http://www.squeaksource.com/htmlcssparser.html