Problems loading XML System ( was Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Problems loading XML System ( was Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)

Peter Kenny
Monty

As an update, I have rebuilt from the Moose 6.0 download. The version of XML-Parser in that was dated 18 July 2016 (configuration monty.233), so I installed versions of XML-Parser-HTML and XML-Parser-StAX contemporary with that. (The respective configurations are monty.48 and monty.39). With these versions all my previous XMLHTMLParser operations work as before, and I have been able to use the StAX parser in a simple way. So I can start exploring as I intended.

I have made repeated attempts to update this rebuilt image to more recent versions of the HTML and StAX parsers, and every time I run into the same error reported below. I started from the latest version and worked backwards, but gave up quickly; it takes about 6 minutes on my machine to load and compile a version, and it soon gets tedious. If I feel more enthusiastic tomorrow, I might start working forwards from my current versions.

Anyway, I now have a working system with the StaX and HTML parsers, so I can continue to explore.

Best wishes

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of PBKResearch
Sent: 15 May 2017 20:44
To: 'Any question about pharo is welcome' <[hidden email]>
Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

Monty

I have just started trying to use the StAX parsers, and I have found that the update has introduced a problem, which means that XMLHTMLParser no longer works on examples I have used before. I updated to ConfigurationOfXMLParser(monty.302), which is the latest version on the smalltalkhub repository, and then used the load version in the class comment, which loads the stable default. Similarly, I loaded ConfigurationOfXMLParserHTML(monty.62) and ConfigurationOfXMLParserStAX(monty.51), again using stable and default. When I try to run the XMLHTMLParser example I quoted below, I get an error message 'MessageNotunderstood: receiver of "critical:" is nil'. The same message comes up with anything else I try with XMLHTMLParser or with StAXHTMLParser.

I am not really up to using the debugger on someone else's code, but the one thing I can see is that the problem lies in XMLKeyValueCache>>critical:, which has the code:
^ self mutex critical: aBlock
The problem being that mutex is nil.

In my enthusiasm, I saved the updated image with the same name as the old image, which is now therefore overwritten. If I cannot solve this problem, my only way out is to rebuild my image from the Moose 6.0 download. Any suggestions gratefully received.

Thanks in advance

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of PBKResearch
Sent: 15 May 2017 19:16
To: 'Any question about pharo is welcome' <[hidden email]>
Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

Monty

Many thanks for this. My original purpose was just to answer Paul deBruicker's query, namely to parse an html file and stop reading at the end of the <head> section. I solved this by trial and error using the code shown below ( which actually stops at the opening tag of the body). This was not my problem at all, but Paul's; I just tackled it for fun.

However, you note has prompted me to update my version of the whole XML system - I was using the version I downloaded with Moose 6.0, which was dated August 2016. I am looking at the StAX parsers as a possible way of simplifying what I currently do, which involves downloading an entire web page as a DOM and then manipulating it with XPath to extract the bits I am interested in. I may be able to use StAX to do some of the selection and manipulation as I am reading.

It's all a new topic to me, so I foresee a lot of experimentation. It all helps to keep the grey matter active.

Thanks again

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of monty
Sent: 15 May 2017 12:15
To: [hidden email]
Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

For that kind of incremental parsing, you could also use XMLParserStAX, a pull-parser that parses a document as a stream of event objects you control with #next, #peek, and #atEnd. It also supports pull-DOM parsing with messages like #nextNode, #nextElement, and #nextElementNamed:, which return the next event object(s) as DOM subtrees (searchable with XPath). See the StAXParser class comment for an example. (The StAXHTMLParser class requires XMLParserHTML be installed to work.)




Reply | Threaded
Open this post in threaded view
|

Re: Problems loading XML System ( was [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)

monty-3
Something went wrong during your upgrade with class initialization.

Installing the latest versions of these projects into a clean image would work, and so would installing the latest XMLParserHTML and XMLParserStAX into the newest Moose-6.1 image (which has the latest XMLParser and XPath).

But if you insist on upgrading your old image, try the latest ConfigurationOfXMLParser (.303.mcz) and ConfigurationOfXPath (.149.mcz) from their PharoExtras repos and install their latest project versions, and do the same with XMLParserHTML and XMLParserStAX (the older versions aren't compatible with newer XMLParser versions). Then open the test runner and run all "XML|XPath" tests. If you get any failures, evaluate this:

#('XML-Parser' 'XPath-Core') do: [:package |
        (SystemNavigation default allClassesInPackageNamed: package) do: [:class |
                class initialize]]

and try running the tests again.

> Sent: Monday, May 15, 2017 at 6:50 PM
> From: PBKResearch <[hidden email]>
> To: "'Any question about pharo is welcome'" <[hidden email]>
> Subject: [Pharo-users] Problems loading XML System ( was Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)
>
> Monty
>
> As an update, I have rebuilt from the Moose 6.0 download. The version of XML-Parser in that was dated 18 July 2016 (configuration monty.233), so I installed versions of XML-Parser-HTML and XML-Parser-StAX contemporary with that. (The respective configurations are monty.48 and monty.39). With these versions all my previous XMLHTMLParser operations work as before, and I have been able to use the StAX parser in a simple way. So I can start exploring as I intended.
>
> I have made repeated attempts to update this rebuilt image to more recent versions of the HTML and StAX parsers, and every time I run into the same error reported below. I started from the latest version and worked backwards, but gave up quickly; it takes about 6 minutes on my machine to load and compile a version, and it soon gets tedious. If I feel more enthusiastic tomorrow, I might start working forwards from my current versions.
>
> Anyway, I now have a working system with the StaX and HTML parsers, so I can continue to explore.
>
> Best wishes
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users [mailto:[hidden email]] On Behalf Of PBKResearch
> Sent: 15 May 2017 20:44
> To: 'Any question about pharo is welcome' <[hidden email]>
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding
>
> Monty
>
> I have just started trying to use the StAX parsers, and I have found that the update has introduced a problem, which means that XMLHTMLParser no longer works on examples I have used before. I updated to ConfigurationOfXMLParser(monty.302), which is the latest version on the smalltalkhub repository, and then used the load version in the class comment, which loads the stable default. Similarly, I loaded ConfigurationOfXMLParserHTML(monty.62) and ConfigurationOfXMLParserStAX(monty.51), again using stable and default. When I try to run the XMLHTMLParser example I quoted below, I get an error message 'MessageNotunderstood: receiver of "critical:" is nil'. The same message comes up with anything else I try with XMLHTMLParser or with StAXHTMLParser.
>
> I am not really up to using the debugger on someone else's code, but the one thing I can see is that the problem lies in XMLKeyValueCache>>critical:, which has the code:
> ^ self mutex critical: aBlock
> The problem being that mutex is nil.
>
> In my enthusiasm, I saved the updated image with the same name as the old image, which is now therefore overwritten. If I cannot solve this problem, my only way out is to rebuild my image from the Moose 6.0 download. Any suggestions gratefully received.
>
> Thanks in advance
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users [mailto:[hidden email]] On Behalf Of PBKResearch
> Sent: 15 May 2017 19:16
> To: 'Any question about pharo is welcome' <[hidden email]>
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding
>
> Monty
>
> Many thanks for this. My original purpose was just to answer Paul deBruicker's query, namely to parse an html file and stop reading at the end of the <head> section. I solved this by trial and error using the code shown below ( which actually stops at the opening tag of the body). This was not my problem at all, but Paul's; I just tackled it for fun.
>
> However, you note has prompted me to update my version of the whole XML system - I was using the version I downloaded with Moose 6.0, which was dated August 2016. I am looking at the StAX parsers as a possible way of simplifying what I currently do, which involves downloading an entire web page as a DOM and then manipulating it with XPath to extract the bits I am interested in. I may be able to use StAX to do some of the selection and manipulation as I am reading.
>
> It's all a new topic to me, so I foresee a lot of experimentation. It all helps to keep the grey matter active.
>
> Thanks again
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users [mailto:[hidden email]] On Behalf Of monty
> Sent: 15 May 2017 12:15
> To: [hidden email]
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding
>
> For that kind of incremental parsing, you could also use XMLParserStAX, a pull-parser that parses a document as a stream of event objects you control with #next, #peek, and #atEnd. It also supports pull-DOM parsing with messages like #nextNode, #nextElement, and #nextElementNamed:, which return the next event object(s) as DOM subtrees (searchable with XPath). See the StAXParser class comment for an example. (The StAXHTMLParser class requires XMLParserHTML be installed to work.)
>
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Problems loading XML System ( was [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)

Peter Kenny
Monty

Many thanks for your help. I have followed your advice to start again in a clean Moose 6.1 image, and so far everything is working fine. Apologies for getting you to sort out the results of my stupidity. In Pharo I am really an experienced beginner.

Thanks again

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of monty
Sent: 16 May 2017 03:37
To: [hidden email]
Subject: Re: [Pharo-users] Problems loading XML System ( was [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)

Something went wrong during your upgrade with class initialization.

Installing the latest versions of these projects into a clean image would work, and so would installing the latest XMLParserHTML and XMLParserStAX into the newest Moose-6.1 image (which has the latest XMLParser and XPath).

But if you insist on upgrading your old image, try the latest ConfigurationOfXMLParser (.303.mcz) and ConfigurationOfXPath (.149.mcz) from their PharoExtras repos and install their latest project versions, and do the same with XMLParserHTML and XMLParserStAX (the older versions aren't compatible with newer XMLParser versions). Then open the test runner and run all "XML|XPath" tests. If you get any failures, evaluate this:

#('XML-Parser' 'XPath-Core') do: [:package |
        (SystemNavigation default allClassesInPackageNamed: package) do: [:class |
                class initialize]]

and try running the tests again.

> Sent: Monday, May 15, 2017 at 6:50 PM
> From: PBKResearch <[hidden email]>
> To: "'Any question about pharo is welcome'" <[hidden email]>
> Subject: [Pharo-users] Problems loading XML System ( was Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)
>
> Monty
>
> As an update, I have rebuilt from the Moose 6.0 download. The version of XML-Parser in that was dated 18 July 2016 (configuration monty.233), so I installed versions of XML-Parser-HTML and XML-Parser-StAX contemporary with that. (The respective configurations are monty.48 and monty.39). With these versions all my previous XMLHTMLParser operations work as before, and I have been able to use the StAX parser in a simple way. So I can start exploring as I intended.
>
> I have made repeated attempts to update this rebuilt image to more recent versions of the HTML and StAX parsers, and every time I run into the same error reported below. I started from the latest version and worked backwards, but gave up quickly; it takes about 6 minutes on my machine to load and compile a version, and it soon gets tedious. If I feel more enthusiastic tomorrow, I might start working forwards from my current versions.
>
> Anyway, I now have a working system with the StaX and HTML parsers, so I can continue to explore.
>
> Best wishes
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users [mailto:[hidden email]] On Behalf Of PBKResearch
> Sent: 15 May 2017 20:44
> To: 'Any question about pharo is welcome' <[hidden email]>
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding
>
> Monty
>
> I have just started trying to use the StAX parsers, and I have found that the update has introduced a problem, which means that XMLHTMLParser no longer works on examples I have used before. I updated to ConfigurationOfXMLParser(monty.302), which is the latest version on the smalltalkhub repository, and then used the load version in the class comment, which loads the stable default. Similarly, I loaded ConfigurationOfXMLParserHTML(monty.62) and ConfigurationOfXMLParserStAX(monty.51), again using stable and default. When I try to run the XMLHTMLParser example I quoted below, I get an error message 'MessageNotunderstood: receiver of "critical:" is nil'. The same message comes up with anything else I try with XMLHTMLParser or with StAXHTMLParser.
>
> I am not really up to using the debugger on someone else's code, but the one thing I can see is that the problem lies in XMLKeyValueCache>>critical:, which has the code:
> ^ self mutex critical: aBlock
> The problem being that mutex is nil.
>
> In my enthusiasm, I saved the updated image with the same name as the old image, which is now therefore overwritten. If I cannot solve this problem, my only way out is to rebuild my image from the Moose 6.0 download. Any suggestions gratefully received.
>
> Thanks in advance
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users [mailto:[hidden email]] On Behalf Of PBKResearch
> Sent: 15 May 2017 19:16
> To: 'Any question about pharo is welcome' <[hidden email]>
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding
>
> Monty
>
> Many thanks for this. My original purpose was just to answer Paul deBruicker's query, namely to parse an html file and stop reading at the end of the <head> section. I solved this by trial and error using the code shown below ( which actually stops at the opening tag of the body). This was not my problem at all, but Paul's; I just tackled it for fun.
>
> However, you note has prompted me to update my version of the whole XML system - I was using the version I downloaded with Moose 6.0, which was dated August 2016. I am looking at the StAX parsers as a possible way of simplifying what I currently do, which involves downloading an entire web page as a DOM and then manipulating it with XPath to extract the bits I am interested in. I may be able to use StAX to do some of the selection and manipulation as I am reading.
>
> It's all a new topic to me, so I foresee a lot of experimentation. It all helps to keep the grey matter active.
>
> Thanks again
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users [mailto:[hidden email]] On Behalf Of monty
> Sent: 15 May 2017 12:15
> To: [hidden email]
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding
>
> For that kind of incremental parsing, you could also use XMLParserStAX, a pull-parser that parses a document as a stream of event objects you control with #next, #peek, and #atEnd. It also supports pull-DOM parsing with messages like #nextNode, #nextElement, and #nextElementNamed:, which return the next event object(s) as DOM subtrees (searchable with XPath). See the StAXParser class comment for an example. (The StAXHTMLParser class requires XMLParserHTML be installed to work.)
>
>
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Problems loading XML System ( was [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)

monty-3
For example, this:
((StAXHTMLParser onURL: aURLString)
        nextElementNamed: 'head')
                ifNotNil: [:headElement | ...]

parses the document upto the next "head" element and returns it and any descendants as a DOM subtree. If there's no next "head" element, it exhausts the event stream looking for one. If you don't want that, test it first:
(parser peek isStartTagNamed: 'head')
        ifTrue: [| headElement |
                headElement := parser nextNode.
                ...].

because you now know what kind of DOM subtree the next events represent, #nextNode is used, which builds any DOM subtree out of the next events, including an element with descendants, a string or comment node, or even an entire document (if sent before reading the start-of-document event). So this:
(StAXHTMLParser onURL: aURLString) nextNode

is equivalent to this:
XMLHTMLParser parseURL: aURLString.

StAX is more useful with XML than HTML, because XML documents can be huge.

> Sent: Tuesday, May 16, 2017 at 6:39 PM
> From: PBKResearch <[hidden email]>
> To: "'Any question about pharo is welcome'" <[hidden email]>
> Subject: Re: [Pharo-users] Problems loading XML System ( was [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)
>
> Monty
>
> Many thanks for your help. I have followed your advice to start again in a clean Moose 6.1 image, and so far everything is working fine. Apologies for getting you to sort out the results of my stupidity. In Pharo I am really an experienced beginner.
>
> Thanks again
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users [mailto:[hidden email]] On Behalf Of monty
> Sent: 16 May 2017 03:37
> To: [hidden email]
> Subject: Re: [Pharo-users] Problems loading XML System ( was [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)
>
> Something went wrong during your upgrade with class initialization.
>
> Installing the latest versions of these projects into a clean image would work, and so would installing the latest XMLParserHTML and XMLParserStAX into the newest Moose-6.1 image (which has the latest XMLParser and XPath).
>
> But if you insist on upgrading your old image, try the latest ConfigurationOfXMLParser (.303.mcz) and ConfigurationOfXPath (.149.mcz) from their PharoExtras repos and install their latest project versions, and do the same with XMLParserHTML and XMLParserStAX (the older versions aren't compatible with newer XMLParser versions). Then open the test runner and run all "XML|XPath" tests. If you get any failures, evaluate this:
>
> #('XML-Parser' 'XPath-Core') do: [:package |
> (SystemNavigation default allClassesInPackageNamed: package) do: [:class |
> class initialize]]
>
> and try running the tests again.
>
> > Sent: Monday, May 15, 2017 at 6:50 PM
> > From: PBKResearch <[hidden email]>
> > To: "'Any question about pharo is welcome'" <[hidden email]>
> > Subject: [Pharo-users] Problems loading XML System ( was Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)
> >
> > Monty
> >
> > As an update, I have rebuilt from the Moose 6.0 download. The version of XML-Parser in that was dated 18 July 2016 (configuration monty.233), so I installed versions of XML-Parser-HTML and XML-Parser-StAX contemporary with that. (The respective configurations are monty.48 and monty.39). With these versions all my previous XMLHTMLParser operations work as before, and I have been able to use the StAX parser in a simple way. So I can start exploring as I intended.
> >
> > I have made repeated attempts to update this rebuilt image to more recent versions of the HTML and StAX parsers, and every time I run into the same error reported below. I started from the latest version and worked backwards, but gave up quickly; it takes about 6 minutes on my machine to load and compile a version, and it soon gets tedious. If I feel more enthusiastic tomorrow, I might start working forwards from my current versions.
> >
> > Anyway, I now have a working system with the StaX and HTML parsers, so I can continue to explore.
> >
> > Best wishes
> >
> > Peter Kenny
> >
> > -----Original Message-----
> > From: Pharo-users [mailto:[hidden email]] On Behalf Of PBKResearch
> > Sent: 15 May 2017 20:44
> > To: 'Any question about pharo is welcome' <[hidden email]>
> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding
> >
> > Monty
> >
> > I have just started trying to use the StAX parsers, and I have found that the update has introduced a problem, which means that XMLHTMLParser no longer works on examples I have used before. I updated to ConfigurationOfXMLParser(monty.302), which is the latest version on the smalltalkhub repository, and then used the load version in the class comment, which loads the stable default. Similarly, I loaded ConfigurationOfXMLParserHTML(monty.62) and ConfigurationOfXMLParserStAX(monty.51), again using stable and default. When I try to run the XMLHTMLParser example I quoted below, I get an error message 'MessageNotunderstood: receiver of "critical:" is nil'. The same message comes up with anything else I try with XMLHTMLParser or with StAXHTMLParser.
> >
> > I am not really up to using the debugger on someone else's code, but the one thing I can see is that the problem lies in XMLKeyValueCache>>critical:, which has the code:
> > ^ self mutex critical: aBlock
> > The problem being that mutex is nil.
> >
> > In my enthusiasm, I saved the updated image with the same name as the old image, which is now therefore overwritten. If I cannot solve this problem, my only way out is to rebuild my image from the Moose 6.0 download. Any suggestions gratefully received.
> >
> > Thanks in advance
> >
> > Peter Kenny
> >
> > -----Original Message-----
> > From: Pharo-users [mailto:[hidden email]] On Behalf Of PBKResearch
> > Sent: 15 May 2017 19:16
> > To: 'Any question about pharo is welcome' <[hidden email]>
> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding
> >
> > Monty
> >
> > Many thanks for this. My original purpose was just to answer Paul deBruicker's query, namely to parse an html file and stop reading at the end of the <head> section. I solved this by trial and error using the code shown below ( which actually stops at the opening tag of the body). This was not my problem at all, but Paul's; I just tackled it for fun.
> >
> > However, you note has prompted me to update my version of the whole XML system - I was using the version I downloaded with Moose 6.0, which was dated August 2016. I am looking at the StAX parsers as a possible way of simplifying what I currently do, which involves downloading an entire web page as a DOM and then manipulating it with XPath to extract the bits I am interested in. I may be able to use StAX to do some of the selection and manipulation as I am reading.
> >
> > It's all a new topic to me, so I foresee a lot of experimentation. It all helps to keep the grey matter active.
> >
> > Thanks again
> >
> > Peter Kenny
> >
> > -----Original Message-----
> > From: Pharo-users [mailto:[hidden email]] On Behalf Of monty
> > Sent: 15 May 2017 12:15
> > To: [hidden email]
> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding
> >
> > For that kind of incremental parsing, you could also use XMLParserStAX, a pull-parser that parses a document as a stream of event objects you control with #next, #peek, and #atEnd. It also supports pull-DOM parsing with messages like #nextNode, #nextElement, and #nextElementNamed:, which return the next event object(s) as DOM subtrees (searchable with XPath). See the StAXParser class comment for an example. (The StAXHTMLParser class requires XMLParserHTML be installed to work.)
> >
> >
> >
> >
> >
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Problems loading XML System ( was [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)

Peter Kenny
Monty

Many thanks for your patient explanation. I would like to try one supplementary question, if I may. Almost all my work involves reading HTML files from the web and extracting relevant sections, and I am trying to work out how to divide the effort between StAX and XPath. For example, if I am reading an article from Frankfurter Allgemeine, I am looking for two tags:
 <div id="artikelEinleitung" class="FAZArtikelEinleitung">
<div class="FAZArtikelText" itemprop="articleBody">
which contain the intro and body of the article; everything else can be discarded.

Using StAX, I can find the first <div> with something like this (adapted from your second snippet):

[((tag := parser peek) isStartTagNamed: 'div') and: [ tag hasAttributes and: [(tag attributeAt: 'class') = 'FAZArtikelEinleitung']]]
whileFalse: [parser next].
intro := parser nextNode.

and similarly for the body. I suppose if this is common I could subclass and make this a method.

Does this look sensible, and is it more efficient than reading the entire body with StAX and locating the relevant sections with XPath? (I have tried this snippet, and it works. My question is really whether this is the best way to go about it?)

Many thanks for any advice.

Peter Kenny


-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of monty
Sent: 17 May 2017 22:10
To: [hidden email]
Subject: Re: [Pharo-users] Problems loading XML System ( was [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)

For example, this:
((StAXHTMLParser onURL: aURLString)
        nextElementNamed: 'head')
                ifNotNil: [:headElement | ...]

parses the document upto the next "head" element and returns it and any descendants as a DOM subtree. If there's no next "head" element, it exhausts the event stream looking for one. If you don't want that, test it first:
(parser peek isStartTagNamed: 'head')
        ifTrue: [| headElement |
                headElement := parser nextNode.
                ...].

because you now know what kind of DOM subtree the next events represent, #nextNode is used, which builds any DOM subtree out of the next events, including an element with descendants, a string or comment node, or even an entire document (if sent before reading the start-of-document event). So this:
(StAXHTMLParser onURL: aURLString) nextNode

is equivalent to this:
XMLHTMLParser parseURL: aURLString.

StAX is more useful with XML than HTML, because XML documents can be huge.

> Sent: Tuesday, May 16, 2017 at 6:39 PM
> From: PBKResearch <[hidden email]>
> To: "'Any question about pharo is welcome'" <[hidden email]>
> Subject: Re: [Pharo-users] Problems loading XML System ( was [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)
>
> Monty
>
> Many thanks for your help. I have followed your advice to start again in a clean Moose 6.1 image, and so far everything is working fine. Apologies for getting you to sort out the results of my stupidity. In Pharo I am really an experienced beginner.
>
> Thanks again
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users [mailto:[hidden email]] On Behalf Of monty
> Sent: 16 May 2017 03:37
> To: [hidden email]
> Subject: Re: [Pharo-users] Problems loading XML System ( was [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)
>
> Something went wrong during your upgrade with class initialization.
>
> Installing the latest versions of these projects into a clean image would work, and so would installing the latest XMLParserHTML and XMLParserStAX into the newest Moose-6.1 image (which has the latest XMLParser and XPath).
>
> But if you insist on upgrading your old image, try the latest ConfigurationOfXMLParser (.303.mcz) and ConfigurationOfXPath (.149.mcz) from their PharoExtras repos and install their latest project versions, and do the same with XMLParserHTML and XMLParserStAX (the older versions aren't compatible with newer XMLParser versions). Then open the test runner and run all "XML|XPath" tests. If you get any failures, evaluate this:
>
> #('XML-Parser' 'XPath-Core') do: [:package |
> (SystemNavigation default allClassesInPackageNamed: package) do: [:class |
> class initialize]]
>
> and try running the tests again.
>
> > Sent: Monday, May 15, 2017 at 6:50 PM
> > From: PBKResearch <[hidden email]>
> > To: "'Any question about pharo is welcome'" <[hidden email]>
> > Subject: [Pharo-users] Problems loading XML System ( was Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)
> >
> > Monty
> >
> > As an update, I have rebuilt from the Moose 6.0 download. The version of XML-Parser in that was dated 18 July 2016 (configuration monty.233), so I installed versions of XML-Parser-HTML and XML-Parser-StAX contemporary with that. (The respective configurations are monty.48 and monty.39). With these versions all my previous XMLHTMLParser operations work as before, and I have been able to use the StAX parser in a simple way. So I can start exploring as I intended.
> >
> > I have made repeated attempts to update this rebuilt image to more recent versions of the HTML and StAX parsers, and every time I run into the same error reported below. I started from the latest version and worked backwards, but gave up quickly; it takes about 6 minutes on my machine to load and compile a version, and it soon gets tedious. If I feel more enthusiastic tomorrow, I might start working forwards from my current versions.
> >
> > Anyway, I now have a working system with the StaX and HTML parsers, so I can continue to explore.
> >
> > Best wishes
> >
> > Peter Kenny
> >
> > -----Original Message-----
> > From: Pharo-users [mailto:[hidden email]] On Behalf Of PBKResearch
> > Sent: 15 May 2017 20:44
> > To: 'Any question about pharo is welcome' <[hidden email]>
> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding
> >
> > Monty
> >
> > I have just started trying to use the StAX parsers, and I have found that the update has introduced a problem, which means that XMLHTMLParser no longer works on examples I have used before. I updated to ConfigurationOfXMLParser(monty.302), which is the latest version on the smalltalkhub repository, and then used the load version in the class comment, which loads the stable default. Similarly, I loaded ConfigurationOfXMLParserHTML(monty.62) and ConfigurationOfXMLParserStAX(monty.51), again using stable and default. When I try to run the XMLHTMLParser example I quoted below, I get an error message 'MessageNotunderstood: receiver of "critical:" is nil'. The same message comes up with anything else I try with XMLHTMLParser or with StAXHTMLParser.
> >
> > I am not really up to using the debugger on someone else's code, but the one thing I can see is that the problem lies in XMLKeyValueCache>>critical:, which has the code:
> > ^ self mutex critical: aBlock
> > The problem being that mutex is nil.
> >
> > In my enthusiasm, I saved the updated image with the same name as the old image, which is now therefore overwritten. If I cannot solve this problem, my only way out is to rebuild my image from the Moose 6.0 download. Any suggestions gratefully received.
> >
> > Thanks in advance
> >
> > Peter Kenny
> >
> > -----Original Message-----
> > From: Pharo-users [mailto:[hidden email]] On Behalf Of PBKResearch
> > Sent: 15 May 2017 19:16
> > To: 'Any question about pharo is welcome' <[hidden email]>
> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding
> >
> > Monty
> >
> > Many thanks for this. My original purpose was just to answer Paul deBruicker's query, namely to parse an html file and stop reading at the end of the <head> section. I solved this by trial and error using the code shown below ( which actually stops at the opening tag of the body). This was not my problem at all, but Paul's; I just tackled it for fun.
> >
> > However, you note has prompted me to update my version of the whole XML system - I was using the version I downloaded with Moose 6.0, which was dated August 2016. I am looking at the StAX parsers as a possible way of simplifying what I currently do, which involves downloading an entire web page as a DOM and then manipulating it with XPath to extract the bits I am interested in. I may be able to use StAX to do some of the selection and manipulation as I am reading.
> >
> > It's all a new topic to me, so I foresee a lot of experimentation. It all helps to keep the grey matter active.
> >
> > Thanks again
> >
> > Peter Kenny
> >
> > -----Original Message-----
> > From: Pharo-users [mailto:[hidden email]] On Behalf Of monty
> > Sent: 15 May 2017 12:15
> > To: [hidden email]
> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding
> >
> > For that kind of incremental parsing, you could also use XMLParserStAX, a pull-parser that parses a document as a stream of event objects you control with #next, #peek, and #atEnd. It also supports pull-DOM parsing with messages like #nextNode, #nextElement, and #nextElementNamed:, which return the next event object(s) as DOM subtrees (searchable with XPath). See the StAXParser class comment for an example. (The StAXHTMLParser class requires XMLParserHTML be installed to work.)
> >
> >
> >
> >
> >
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Problems loading XML System ( was [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)

monty-3
Creating two separate DOM subtrees for two descendants of the body element should be faster and consume less memory than creating one subtree for the entire body. You should also consider benchmarking different approaches, and using profiling to identify which of the parsing, querying, network IO, or whatever is your bottleneck before optimizing.

> Sent: Saturday, May 20, 2017 at 7:08 AM
> From: PBKResearch <[hidden email]>
> To: "'Any question about pharo is welcome'" <[hidden email]>
> Subject: Re: [Pharo-users] Problems loading XML System ( was [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)
>
> Monty
>
> Many thanks for your patient explanation. I would like to try one supplementary question, if I may. Almost all my work involves reading HTML files from the web and extracting relevant sections, and I am trying to work out how to divide the effort between StAX and XPath. For example, if I am reading an article from Frankfurter Allgemeine, I am looking for two tags:
>  <div id="artikelEinleitung" class="FAZArtikelEinleitung">
> <div class="FAZArtikelText" itemprop="articleBody">
> which contain the intro and body of the article; everything else can be discarded.
>
> Using StAX, I can find the first <div> with something like this (adapted from your second snippet):
>
> [((tag := parser peek) isStartTagNamed: 'div') and: [ tag hasAttributes and: [(tag attributeAt: 'class') = 'FAZArtikelEinleitung']]]
> whileFalse: [parser next].
> intro := parser nextNode.
>
> and similarly for the body. I suppose if this is common I could subclass and make this a method.
>
> Does this look sensible, and is it more efficient than reading the entire body with StAX and locating the relevant sections with XPath? (I have tried this snippet, and it works. My question is really whether this is the best way to go about it?)
>
> Many thanks for any advice.
>
> Peter Kenny
>
>
> -----Original Message-----
> From: Pharo-users [mailto:[hidden email]] On Behalf Of monty
> Sent: 17 May 2017 22:10
> To: [hidden email]
> Subject: Re: [Pharo-users] Problems loading XML System ( was [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)
>
> For example, this:
> ((StAXHTMLParser onURL: aURLString)
> nextElementNamed: 'head')
> ifNotNil: [:headElement | ...]
>
> parses the document upto the next "head" element and returns it and any descendants as a DOM subtree. If there's no next "head" element, it exhausts the event stream looking for one. If you don't want that, test it first:
> (parser peek isStartTagNamed: 'head')
> ifTrue: [| headElement |
> headElement := parser nextNode.
> ...].
>
> because you now know what kind of DOM subtree the next events represent, #nextNode is used, which builds any DOM subtree out of the next events, including an element with descendants, a string or comment node, or even an entire document (if sent before reading the start-of-document event). So this:
> (StAXHTMLParser onURL: aURLString) nextNode
>
> is equivalent to this:
> XMLHTMLParser parseURL: aURLString.
>
> StAX is more useful with XML than HTML, because XML documents can be huge.
>
> > Sent: Tuesday, May 16, 2017 at 6:39 PM
> > From: PBKResearch <[hidden email]>
> > To: "'Any question about pharo is welcome'" <[hidden email]>
> > Subject: Re: [Pharo-users] Problems loading XML System ( was [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)
> >
> > Monty
> >
> > Many thanks for your help. I have followed your advice to start again in a clean Moose 6.1 image, and so far everything is working fine. Apologies for getting you to sort out the results of my stupidity. In Pharo I am really an experienced beginner.
> >
> > Thanks again
> >
> > Peter Kenny
> >
> > -----Original Message-----
> > From: Pharo-users [mailto:[hidden email]] On Behalf Of monty
> > Sent: 16 May 2017 03:37
> > To: [hidden email]
> > Subject: Re: [Pharo-users] Problems loading XML System ( was [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)
> >
> > Something went wrong during your upgrade with class initialization.
> >
> > Installing the latest versions of these projects into a clean image would work, and so would installing the latest XMLParserHTML and XMLParserStAX into the newest Moose-6.1 image (which has the latest XMLParser and XPath).
> >
> > But if you insist on upgrading your old image, try the latest ConfigurationOfXMLParser (.303.mcz) and ConfigurationOfXPath (.149.mcz) from their PharoExtras repos and install their latest project versions, and do the same with XMLParserHTML and XMLParserStAX (the older versions aren't compatible with newer XMLParser versions). Then open the test runner and run all "XML|XPath" tests. If you get any failures, evaluate this:
> >
> > #('XML-Parser' 'XPath-Core') do: [:package |
> > (SystemNavigation default allClassesInPackageNamed: package) do: [:class |
> > class initialize]]
> >
> > and try running the tests again.
> >
> > > Sent: Monday, May 15, 2017 at 6:50 PM
> > > From: PBKResearch <[hidden email]>
> > > To: "'Any question about pharo is welcome'" <[hidden email]>
> > > Subject: [Pharo-users] Problems loading XML System ( was Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)
> > >
> > > Monty
> > >
> > > As an update, I have rebuilt from the Moose 6.0 download. The version of XML-Parser in that was dated 18 July 2016 (configuration monty.233), so I installed versions of XML-Parser-HTML and XML-Parser-StAX contemporary with that. (The respective configurations are monty.48 and monty.39). With these versions all my previous XMLHTMLParser operations work as before, and I have been able to use the StAX parser in a simple way. So I can start exploring as I intended.
> > >
> > > I have made repeated attempts to update this rebuilt image to more recent versions of the HTML and StAX parsers, and every time I run into the same error reported below. I started from the latest version and worked backwards, but gave up quickly; it takes about 6 minutes on my machine to load and compile a version, and it soon gets tedious. If I feel more enthusiastic tomorrow, I might start working forwards from my current versions.
> > >
> > > Anyway, I now have a working system with the StaX and HTML parsers, so I can continue to explore.
> > >
> > > Best wishes
> > >
> > > Peter Kenny
> > >
> > > -----Original Message-----
> > > From: Pharo-users [mailto:[hidden email]] On Behalf Of PBKResearch
> > > Sent: 15 May 2017 20:44
> > > To: 'Any question about pharo is welcome' <[hidden email]>
> > > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding
> > >
> > > Monty
> > >
> > > I have just started trying to use the StAX parsers, and I have found that the update has introduced a problem, which means that XMLHTMLParser no longer works on examples I have used before. I updated to ConfigurationOfXMLParser(monty.302), which is the latest version on the smalltalkhub repository, and then used the load version in the class comment, which loads the stable default. Similarly, I loaded ConfigurationOfXMLParserHTML(monty.62) and ConfigurationOfXMLParserStAX(monty.51), again using stable and default. When I try to run the XMLHTMLParser example I quoted below, I get an error message 'MessageNotunderstood: receiver of "critical:" is nil'. The same message comes up with anything else I try with XMLHTMLParser or with StAXHTMLParser.
> > >
> > > I am not really up to using the debugger on someone else's code, but the one thing I can see is that the problem lies in XMLKeyValueCache>>critical:, which has the code:
> > > ^ self mutex critical: aBlock
> > > The problem being that mutex is nil.
> > >
> > > In my enthusiasm, I saved the updated image with the same name as the old image, which is now therefore overwritten. If I cannot solve this problem, my only way out is to rebuild my image from the Moose 6.0 download. Any suggestions gratefully received.
> > >
> > > Thanks in advance
> > >
> > > Peter Kenny
> > >
> > > -----Original Message-----
> > > From: Pharo-users [mailto:[hidden email]] On Behalf Of PBKResearch
> > > Sent: 15 May 2017 19:16
> > > To: 'Any question about pharo is welcome' <[hidden email]>
> > > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding
> > >
> > > Monty
> > >
> > > Many thanks for this. My original purpose was just to answer Paul deBruicker's query, namely to parse an html file and stop reading at the end of the <head> section. I solved this by trial and error using the code shown below ( which actually stops at the opening tag of the body). This was not my problem at all, but Paul's; I just tackled it for fun.
> > >
> > > However, you note has prompted me to update my version of the whole XML system - I was using the version I downloaded with Moose 6.0, which was dated August 2016. I am looking at the StAX parsers as a possible way of simplifying what I currently do, which involves downloading an entire web page as a DOM and then manipulating it with XPath to extract the bits I am interested in. I may be able to use StAX to do some of the selection and manipulation as I am reading.
> > >
> > > It's all a new topic to me, so I foresee a lot of experimentation. It all helps to keep the grey matter active.
> > >
> > > Thanks again
> > >
> > > Peter Kenny
> > >
> > > -----Original Message-----
> > > From: Pharo-users [mailto:[hidden email]] On Behalf Of monty
> > > Sent: 15 May 2017 12:15
> > > To: [hidden email]
> > > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding
> > >
> > > For that kind of incremental parsing, you could also use XMLParserStAX, a pull-parser that parses a document as a stream of event objects you control with #next, #peek, and #atEnd. It also supports pull-DOM parsing with messages like #nextNode, #nextElement, and #nextElementNamed:, which return the next event object(s) as DOM subtrees (searchable with XPath). See the StAXParser class comment for an example. (The StAXHTMLParser class requires XMLParserHTML be installed to work.)
> > >
> > >
> > >
> > >
> > >
> >
> >
> >
>
>
>