Hi,
I'm parsing some html docs but the XMLTokenizer chockes on a '&' followed by a space in a string. I guess '&' is used for other stuff than a 'and' in html and it causes error when used in plain text. Does anybody have fix for this? Karl |
I guess this will not help you, but a standalone ampersand is not
valid XML (it is the leader for entities, if you want to have a literal ampersand in the text, the markup must be &), hence I would not expect any XML tokenizer or parser implementation to accept it. HTML is more relaxed about this, so a standalone amapersand is valid, but you would need some kind of HTMLTokenizer and I do not know whether there is such thing for Squeak. Anyone else knows one? Best regards Jakob 2015-06-01 20:05 GMT+02:00 karl ramberg <[hidden email]>: > Hi, > I'm parsing some html docs but the XMLTokenizer chockes on a '&' followed by > a space in a string. > I guess '&' is used for other stuff than a 'and' in html and it causes error > when used in plain text. > > Does anybody have fix for this? > > Karl |
Hi, thanks for the info. I guess I need a HTMLTokenizer for what I'm doing. I had issues with   as well, with the current XMLTokenizer Karl On Mon, Jun 1, 2015 at 11:01 PM, Jakob Reschke <[hidden email]> wrote: I guess this will not help you, but a standalone ampersand is not |
XMLTokenizer is not suitable to parse HTML documents. XML and HTML may
look similar, but are very different. We used to use Soup[1] to parse HTML pages. Levente [1] http://squeaksource.com/Soup.html (watch out for versions which may not be Squeak-compatible) On Tue, 2 Jun 2015, karl ramberg wrote: > Hi,thanks for the info. > I guess I need a HTMLTokenizer for what I'm doing. I had issues with   as well, with the current XMLTokenizer > > Karl > > On Mon, Jun 1, 2015 at 11:01 PM, Jakob Reschke <[hidden email]> wrote: > I guess this will not help you, but a standalone ampersand is not > valid XML (it is the leader for entities, if you want to have a > literal ampersand in the text, the markup must be &), hence I > would not expect any XML tokenizer or parser implementation to accept > it. > > HTML is more relaxed about this, so a standalone amapersand is valid, > but you would need some kind of HTMLTokenizer and I do not know > whether there is such thing for Squeak. Anyone else knows one? > > Best regards > Jakob > > 2015-06-01 20:05 GMT+02:00 karl ramberg <[hidden email]>: > > Hi, > > I'm parsing some html docs but the XMLTokenizer chockes on a '&' followed by > > a space in a string. > > I guess '&' is used for other stuff than a 'and' in html and it causes error > > when used in plain text. > > > > Does anybody have fix for this? > > > > Karl > > > > |
On Mon, Jun 1, 2015 at 9:10 PM, Levente Uzonyi <[hidden email]> wrote:
> XMLTokenizer is not suitable to parse HTML documents. XML and HTML may look > similar, but are very different. > We used to use Soup[1] to parse HTML pages. Have you used Todd Blanchard's "HTML & CSS Validating Parser" [1], if so how does it compare to Soup? [1] -- http://www.squeaksource.com/htmlcssparser.html |
Hi, I tested three different HTML parsers and found SOUP to work best for my needs. Thank you all. Karl On Tue, Jun 2, 2015 at 6:17 PM, Chris Muller <[hidden email]> wrote: On Mon, Jun 1, 2015 at 9:10 PM, Levente Uzonyi <[hidden email]> wrote: |
Free forum by Nabble | Edit this page |