All,
I'm hitting an error where fetching web content fails. The website does indeed use invalid characters. The easiest way to reproduce: ZnEasy get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723' Is there any way to tell Zinc to simply ignore that error and to continue? CU, Udo |
Hi Udo,
> On 11 May 2017, at 21:37, Udo Schneider <[hidden email]> wrote: > > All, > > I'm hitting an error where fetching web content fails. The website does indeed use invalid characters. > > The easiest way to reproduce: > > ZnEasy get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723' > > Is there any way to tell Zinc to simply ignore that error and to continue? > > CU, > > Udo That server/page has a mime-type text/plain with no explicit encoding (charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591 does not work. The following does work, but you can't be sure everything went well (beLenient takes some bytes as they are). ZnDefaultCharacterEncoder value: ZnCharacterEncoder latin1 beLenient during: [ ZnClient new get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; yourself ]. I added some API earlier today, so that the following should also work (you need to load Zn #bleedingEdge first). ZnClient new defaultEncoder: ZnCharacterEncoder latin1 beLenient; get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; yourself. HTH, Regards, Sven |
Hi Sven,
that's perfect. To be honest I don't care about the content - I'm just parsing the header. And even if there is a wrong decoding in there... I can live with that. Thank you very very much! For your help but also your stuff in general. CU, Udo Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe: > Hi Udo, > >> On 11 May 2017, at 21:37, Udo Schneider <[hidden email]> wrote: >> >> All, >> >> I'm hitting an error where fetching web content fails. The website does indeed use invalid characters. >> >> The easiest way to reproduce: >> >> ZnEasy get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723' >> >> Is there any way to tell Zinc to simply ignore that error and to continue? >> >> CU, >> >> Udo > > That server/page has a mime-type text/plain with no explicit encoding (charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591 does not work. The following does work, but you can't be sure everything went well (beLenient takes some bytes as they are). > > ZnDefaultCharacterEncoder > value: ZnCharacterEncoder latin1 beLenient > during: [ > ZnClient new > get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; > yourself ]. > > I added some API earlier today, so that the following should also work (you need to load Zn #bleedingEdge first). > > ZnClient new > defaultEncoder: ZnCharacterEncoder latin1 beLenient; > get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; > yourself. > > HTH, > > Regards, > > Sven > > > |
Just to mention. If you are not interested in the content body you could do a HEAD request instead of GET.
Norbert > Am 11.05.2017 um 22:44 schrieb Udo Schneider <[hidden email]>: > > Hi Sven, > > that's perfect. To be honest I don't care about the content - I'm just parsing the header. And even if there is a wrong decoding in there... I can live with that. > > Thank you very very much! For your help but also your stuff in general. > > CU, > > Udo > > >> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe: >> Hi Udo, >>> On 11 May 2017, at 21:37, Udo Schneider <[hidden email]> wrote: >>> >>> All, >>> >>> I'm hitting an error where fetching web content fails. The website does indeed use invalid characters. >>> >>> The easiest way to reproduce: >>> >>> ZnEasy get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723' >>> >>> Is there any way to tell Zinc to simply ignore that error and to continue? >>> >>> CU, >>> >>> Udo >> That server/page has a mime-type text/plain with no explicit encoding (charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591 does not work. The following does work, but you can't be sure everything went well (beLenient takes some bytes as they are). >> ZnDefaultCharacterEncoder >> value: ZnCharacterEncoder latin1 beLenient >> during: [ >> ZnClient new >> get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; >> yourself ]. >> I added some API earlier today, so that the following should also work (you need to load Zn #bleedingEdge first). >> ZnClient new >> defaultEncoder: ZnCharacterEncoder latin1 beLenient; >> get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; >> yourself. >> HTH, >> Regards, >> Sven > > > |
With reference to Norbert's comment, there /may/ be an ambiguity about the
word 'header' in Udo's reply. It could refer to the http HEAD section, in which case Norbert is of course right. It could also refer to the <head> section of the html file, which is part of the content of the http response. If it is the latter, this is similar to a question that Paul deBruicker posted last November ("[Pharo-users] ZnClient GET, but just the content of the <head> tag?"). I tried the method I devised for Paul's case on Udo's problem website, and read the html header with no problem. Incidentally, the header includes 'charset=iso-8859-1', which does not agree with Sven's findings. In case it is of interest, I used XMLHTMLParser to read and parse the header. Try the following in a Playground: par := XMLHTMLParser onURL: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'. par parseDocumentUntil: [|top|(top := par topNode) notNil and: [ top isElement and:[ top isNamed: 'body']]]. par parsingResult findElementNamed: 'head'. If you 'Do it and go', the full header appears. The way I get it to stop after the header may not be quite correct, because it uses XMLHTMLParser>>topNode, which is a private method. On the other hand, I can't see how to make the stop condition for XMLHTMLParser>>parseDocumentUntil: depend on the parsed results without using a private method. Hope this is helpful Peter Kenny -----Original Message----- From: Pharo-users [mailto:[hidden email]] On Behalf Of Norbert Hartl Sent: 12 May 2017 08:04 To: Any question about pharo is welcome <[hidden email]> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding Just to mention. If you are not interested in the content body you could do a HEAD request instead of GET. Norbert > Am 11.05.2017 um 22:44 schrieb Udo Schneider <[hidden email]>: > > Hi Sven, > > that's perfect. To be honest I don't care about the content - I'm just parsing the header. And even if there is a wrong decoding in there... I can live with that. > > Thank you very very much! For your help but also your stuff in general. > > CU, > > Udo > > >> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe: >> Hi Udo, >>> On 11 May 2017, at 21:37, Udo Schneider <[hidden email]> >>> >>> All, >>> >>> I'm hitting an error where fetching web content fails. The website does indeed use invalid characters. >>> >>> The easiest way to reproduce: >>> >>> ZnEasy get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723' >>> >>> Is there any way to tell Zinc to simply ignore that error and to continue? >>> >>> CU, >>> >>> Udo >> That server/page has a mime-type text/plain with no explicit encoding (charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591 does not work. The following does work, but you can't be sure everything went well (beLenient takes some bytes as they are). >> ZnDefaultCharacterEncoder >> value: ZnCharacterEncoder latin1 beLenient >> during: [ >> ZnClient new >> get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; >> yourself ]. >> I added some API earlier today, so that the following should also work (you need to load Zn #bleedingEdge first). >> ZnClient new >> defaultEncoder: ZnCharacterEncoder latin1 beLenient; >> get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; >> yourself. >> HTH, >> Regards, >> Sven > > > |
In reply to this post by NorbertHartl
Hi Sven,
I didn't tell the whole truth :-) I'm /mainly/ parsing the header (extracting published dates). For some sites however I have to resort to finding a date in the body. CU, Udo Am 12/05/17 um 09:03 schrieb Norbert Hartl: > Just to mention. If you are not interested in the content body you could do a HEAD request instead of GET. > > Norbert > >> Am 11.05.2017 um 22:44 schrieb Udo Schneider <[hidden email]>: >> >> Hi Sven, >> >> that's perfect. To be honest I don't care about the content - I'm just parsing the header. And even if there is a wrong decoding in there... I can live with that. >> >> Thank you very very much! For your help but also your stuff in general. >> >> CU, >> >> Udo >> >> >>> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe: >>> Hi Udo, >>>> On 11 May 2017, at 21:37, Udo Schneider <[hidden email]> wrote: >>>> >>>> All, >>>> >>>> I'm hitting an error where fetching web content fails. The website does indeed use invalid characters. >>>> >>>> The easiest way to reproduce: >>>> >>>> ZnEasy get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723' >>>> >>>> Is there any way to tell Zinc to simply ignore that error and to continue? >>>> >>>> CU, >>>> >>>> Udo >>> That server/page has a mime-type text/plain with no explicit encoding (charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591 does not work. The following does work, but you can't be sure everything went well (beLenient takes some bytes as they are). >>> ZnDefaultCharacterEncoder >>> value: ZnCharacterEncoder latin1 beLenient >>> during: [ >>> ZnClient new >>> get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; >>> yourself ]. >>> I added some API earlier today, so that the following should also work (you need to load Zn #bleedingEdge first). >>> ZnClient new >>> defaultEncoder: ZnCharacterEncoder latin1 beLenient; >>> get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; >>> yourself. >>> HTH, >>> Regards, >>> Sven >> >> >> > > > |
In reply to this post by Peter Kenny
> Sent: Friday, May 12, 2017 at 5:30 AM
> From: PBKResearch <[hidden email]> > To: "'Any question about pharo is welcome'" <[hidden email]> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding > > With reference to Norbert's comment, there /may/ be an ambiguity about the > word 'header' in Udo's reply. It could refer to the http HEAD section, in > which case Norbert is of course right. It could also refer to the <head> > section of the html file, which is part of the content of the http response. > If it is the latter, this is similar to a question that Paul deBruicker > posted last November ("[Pharo-users] ZnClient GET, but just the content of > the <head> tag?"). I tried the method I devised for Paul's case on Udo's > problem website, and read the html header with no problem. Incidentally, the > header includes 'charset=iso-8859-1', which does not agree with Sven's > findings. > > In case it is of interest, I used XMLHTMLParser to read and parse the > header. Try the following in a Playground: > > par := XMLHTMLParser onURL: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns- > with-a-new-(borrowed)-distribution-method/a/d-id/1328723'. > par parseDocumentUntil: [|top|(top := par topNode) notNil and: [ top > isElement and:[ top isNamed: 'body']]]. > par parsingResult findElementNamed: 'head'. > > If you 'Do it and go', the full header appears. The way I get it to stop > after the header may not be quite correct, because it uses > XMLHTMLParser>>topNode, which is a private method. On the other hand, I > can't see how to make the stop condition for > XMLHTMLParser>>parseDocumentUntil: depend on the parsed results without > using a private method. There's always #document. But since I can't see any possible harm, I'll make it public. > Hope this is helpful > > Peter Kenny > > -----Original Message----- > From: Pharo-users [mailto:[hidden email]] On Behalf Of > Norbert Hartl > Sent: 12 May 2017 08:04 > To: Any question about pharo is welcome <[hidden email]> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for > utf-8 encoding > > Just to mention. If you are not interested in the content body you could do > a HEAD request instead of GET. > > Norbert > > > Am 11.05.2017 um 22:44 schrieb Udo Schneider > <[hidden email]>: > > > > Hi Sven, > > > > that's perfect. To be honest I don't care about the content - I'm just > parsing the header. And even if there is a wrong decoding in there... I can > live with that. > > > > Thank you very very much! For your help but also your stuff in general. > > > > CU, > > > > Udo > > > > > >> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe: > >> Hi Udo, > >>> On 11 May 2017, at 21:37, Udo Schneider <[hidden email]> > wrote: > >>> > >>> All, > >>> > >>> I'm hitting an error where fetching web content fails. The website does > indeed use invalid characters. > >>> > >>> The easiest way to reproduce: > >>> > >>> ZnEasy get: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns- > with-a-new-(borrowed)-distribution-method/a/d-id/1328723' > >>> > >>> Is there any way to tell Zinc to simply ignore that error and to > continue? > >>> > >>> CU, > >>> > >>> Udo > >> That server/page has a mime-type text/plain with no explicit encoding > (charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591 > does not work. The following does work, but you can't be sure everything > went well (beLenient takes some bytes as they are). > >> ZnDefaultCharacterEncoder > >> value: ZnCharacterEncoder latin1 beLenient > >> during: [ > >> ZnClient new > >> get: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns- > with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; > >> yourself ]. > >> I added some API earlier today, so that the following should also work > (you need to load Zn #bleedingEdge first). > >> ZnClient new > >> defaultEncoder: ZnCharacterEncoder latin1 beLenient; > >> get: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns- > with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; > >> yourself. > >> HTH, > >> Regards, > >> Sven > > > > > > > > > > |
In reply to this post by Peter Kenny
For that kind of incremental parsing, you could also use XMLParserStAX, a pull-parser that parses a document as a stream of event objects you control with #next, #peek, and #atEnd. It also supports pull-DOM parsing with messages like #nextNode, #nextElement, and #nextElementNamed:, which return the next event object(s) as DOM subtrees (searchable with XPath). See the StAXParser class comment for an example. (The StAXHTMLParser class requires XMLParserHTML be installed to work.)
> Sent: Friday, May 12, 2017 at 5:30 AM > From: PBKResearch <[hidden email]> > To: "'Any question about pharo is welcome'" <[hidden email]> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding > > With reference to Norbert's comment, there /may/ be an ambiguity about the > word 'header' in Udo's reply. It could refer to the http HEAD section, in > which case Norbert is of course right. It could also refer to the <head> > section of the html file, which is part of the content of the http response. > If it is the latter, this is similar to a question that Paul deBruicker > posted last November ("[Pharo-users] ZnClient GET, but just the content of > the <head> tag?"). I tried the method I devised for Paul's case on Udo's > problem website, and read the html header with no problem. Incidentally, the > header includes 'charset=iso-8859-1', which does not agree with Sven's > findings. > > In case it is of interest, I used XMLHTMLParser to read and parse the > header. Try the following in a Playground: > > par := XMLHTMLParser onURL: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns- > with-a-new-(borrowed)-distribution-method/a/d-id/1328723'. > par parseDocumentUntil: [|top|(top := par topNode) notNil and: [ top > isElement and:[ top isNamed: 'body']]]. > par parsingResult findElementNamed: 'head'. > > If you 'Do it and go', the full header appears. The way I get it to stop > after the header may not be quite correct, because it uses > XMLHTMLParser>>topNode, which is a private method. On the other hand, I > can't see how to make the stop condition for > XMLHTMLParser>>parseDocumentUntil: depend on the parsed results without > using a private method. > > Hope this is helpful > > Peter Kenny > > -----Original Message----- > From: Pharo-users [mailto:[hidden email]] On Behalf Of > Norbert Hartl > Sent: 12 May 2017 08:04 > To: Any question about pharo is welcome <[hidden email]> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for > utf-8 encoding > > Just to mention. If you are not interested in the content body you could do > a HEAD request instead of GET. > > Norbert > > > Am 11.05.2017 um 22:44 schrieb Udo Schneider > <[hidden email]>: > > > > Hi Sven, > > > > that's perfect. To be honest I don't care about the content - I'm just > parsing the header. And even if there is a wrong decoding in there... I can > live with that. > > > > Thank you very very much! For your help but also your stuff in general. > > > > CU, > > > > Udo > > > > > >> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe: > >> Hi Udo, > >>> On 11 May 2017, at 21:37, Udo Schneider <[hidden email]> > wrote: > >>> > >>> All, > >>> > >>> I'm hitting an error where fetching web content fails. The website does > indeed use invalid characters. > >>> > >>> The easiest way to reproduce: > >>> > >>> ZnEasy get: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns- > with-a-new-(borrowed)-distribution-method/a/d-id/1328723' > >>> > >>> Is there any way to tell Zinc to simply ignore that error and to > continue? > >>> > >>> CU, > >>> > >>> Udo > >> That server/page has a mime-type text/plain with no explicit encoding > (charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591 > does not work. The following does work, but you can't be sure everything > went well (beLenient takes some bytes as they are). > >> ZnDefaultCharacterEncoder > >> value: ZnCharacterEncoder latin1 beLenient > >> during: [ > >> ZnClient new > >> get: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns- > with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; > >> yourself ]. > >> I added some API earlier today, so that the following should also work > (you need to load Zn #bleedingEdge first). > >> ZnClient new > >> defaultEncoder: ZnCharacterEncoder latin1 beLenient; > >> get: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns- > with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; > >> yourself. > >> HTH, > >> Regards, > >> Sven > > > > > > > > > > |
Monty
Many thanks for this. My original purpose was just to answer Paul deBruicker's query, namely to parse an html file and stop reading at the end of the <head> section. I solved this by trial and error using the code shown below ( which actually stops at the opening tag of the body). This was not my problem at all, but Paul's; I just tackled it for fun. However, you note has prompted me to update my version of the whole XML system - I was using the version I downloaded with Moose 6.0, which was dated August 2016. I am looking at the StAX parsers as a possible way of simplifying what I currently do, which involves downloading an entire web page as a DOM and then manipulating it with XPath to extract the bits I am interested in. I may be able to use StAX to do some of the selection and manipulation as I am reading. It's all a new topic to me, so I foresee a lot of experimentation. It all helps to keep the grey matter active. Thanks again Peter Kenny -----Original Message----- From: Pharo-users [mailto:[hidden email]] On Behalf Of monty Sent: 15 May 2017 12:15 To: [hidden email] Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding For that kind of incremental parsing, you could also use XMLParserStAX, a pull-parser that parses a document as a stream of event objects you control with #next, #peek, and #atEnd. It also supports pull-DOM parsing with messages like #nextNode, #nextElement, and #nextElementNamed:, which return the next event object(s) as DOM subtrees (searchable with XPath). See the StAXParser class comment for an example. (The StAXHTMLParser class requires XMLParserHTML be installed to work.) > Sent: Friday, May 12, 2017 at 5:30 AM > From: PBKResearch <[hidden email]> > To: "'Any question about pharo is welcome'" > <[hidden email]> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte > for utf-8 encoding > > With reference to Norbert's comment, there /may/ be an ambiguity about > the word 'header' in Udo's reply. It could refer to the http HEAD > section, in which case Norbert is of course right. It could also refer > to the <head> section of the html file, which is part of the content of the http response. > If it is the latter, this is similar to a question that Paul > deBruicker posted last November ("[Pharo-users] ZnClient GET, but just > the content of the <head> tag?"). I tried the method I devised for > Paul's case on Udo's problem website, and read the html header with no > problem. Incidentally, the header includes 'charset=iso-8859-1', which > does not agree with Sven's findings. > > In case it is of interest, I used XMLHTMLParser to read and parse the > header. Try the following in a Playground: > > par := XMLHTMLParser onURL: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re > turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'. > par parseDocumentUntil: [|top|(top := par topNode) notNil and: [ top > isElement and:[ top isNamed: 'body']]]. > par parsingResult findElementNamed: 'head'. > > If you 'Do it and go', the full header appears. The way I get it to > stop after the header may not be quite correct, because it uses > XMLHTMLParser>>topNode, which is a private method. On the other hand, > XMLHTMLParser>>I > can't see how to make the stop condition for > XMLHTMLParser>>parseDocumentUntil: depend on the parsed results > XMLHTMLParser>>without > using a private method. > > Hope this is helpful > > Peter Kenny > > -----Original Message----- > From: Pharo-users [mailto:[hidden email]] On > Behalf Of Norbert Hartl > Sent: 12 May 2017 08:04 > To: Any question about pharo is welcome <[hidden email]> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte > for > utf-8 encoding > > Just to mention. If you are not interested in the content body you > could do a HEAD request instead of GET. > > Norbert > > > Am 11.05.2017 um 22:44 schrieb Udo Schneider > <[hidden email]>: > > > > Hi Sven, > > > > that's perfect. To be honest I don't care about the content - I'm > > just > parsing the header. And even if there is a wrong decoding in there... > I can live with that. > > > > Thank you very very much! For your help but also your stuff in general. > > > > CU, > > > > Udo > > > > > >> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe: > >> Hi Udo, > >>> On 11 May 2017, at 21:37, Udo Schneider > >>> <[hidden email]> > wrote: > >>> > >>> All, > >>> > >>> I'm hitting an error where fetching web content fails. The website > >>> does > indeed use invalid characters. > >>> > >>> The easiest way to reproduce: > >>> > >>> ZnEasy get: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re > turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723' > >>> > >>> Is there any way to tell Zinc to simply ignore that error and to > continue? > >>> > >>> CU, > >>> > >>> Udo > >> That server/page has a mime-type text/plain with no explicit > >> encoding > (charset) setting, so we have to guess. Like utf-8, pure > latin1/iso88591 does not work. The following does work, but you can't > be sure everything went well (beLenient takes some bytes as they are). > >> ZnDefaultCharacterEncoder > >> value: ZnCharacterEncoder latin1 beLenient > >> during: [ > >> ZnClient new > >> get: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re > turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; > >> yourself ]. > >> I added some API earlier today, so that the following should also > >> work > (you need to load Zn #bleedingEdge first). > >> ZnClient new > >> defaultEncoder: ZnCharacterEncoder latin1 beLenient; > >> get: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re > turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; > >> yourself. > >> HTH, > >> Regards, > >> Sven > > > > > > > > > > |
Monty
I have just started trying to use the StAX parsers, and I have found that the update has introduced a problem, which means that XMLHTMLParser no longer works on examples I have used before. I updated to ConfigurationOfXMLParser(monty.302), which is the latest version on the smalltalkhub repository, and then used the load version in the class comment, which loads the stable default. Similarly, I loaded ConfigurationOfXMLParserHTML(monty.62) and ConfigurationOfXMLParserStAX(monty.51), again using stable and default. When I try to run the XMLHTMLParser example I quoted below, I get an error message 'MessageNotunderstood: receiver of "critical:" is nil'. The same message comes up with anything else I try with XMLHTMLParser or with StAXHTMLParser. I am not really up to using the debugger on someone else's code, but the one thing I can see is that the problem lies in XMLKeyValueCache>>critical:, which has the code: ^ self mutex critical: aBlock The problem being that mutex is nil. In my enthusiasm, I saved the updated image with the same name as the old image, which is now therefore overwritten. If I cannot solve this problem, my only way out is to rebuild my image from the Moose 6.0 download. Any suggestions gratefully received. Thanks in advance Peter Kenny -----Original Message----- From: Pharo-users [mailto:[hidden email]] On Behalf Of PBKResearch Sent: 15 May 2017 19:16 To: 'Any question about pharo is welcome' <[hidden email]> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding Monty Many thanks for this. My original purpose was just to answer Paul deBruicker's query, namely to parse an html file and stop reading at the end of the <head> section. I solved this by trial and error using the code shown below ( which actually stops at the opening tag of the body). This was not my problem at all, but Paul's; I just tackled it for fun. However, you note has prompted me to update my version of the whole XML system - I was using the version I downloaded with Moose 6.0, which was dated August 2016. I am looking at the StAX parsers as a possible way of simplifying what I currently do, which involves downloading an entire web page as a DOM and then manipulating it with XPath to extract the bits I am interested in. I may be able to use StAX to do some of the selection and manipulation as I am reading. It's all a new topic to me, so I foresee a lot of experimentation. It all helps to keep the grey matter active. Thanks again Peter Kenny -----Original Message----- From: Pharo-users [mailto:[hidden email]] On Behalf Of monty Sent: 15 May 2017 12:15 To: [hidden email] Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding For that kind of incremental parsing, you could also use XMLParserStAX, a pull-parser that parses a document as a stream of event objects you control with #next, #peek, and #atEnd. It also supports pull-DOM parsing with messages like #nextNode, #nextElement, and #nextElementNamed:, which return the next event object(s) as DOM subtrees (searchable with XPath). See the StAXParser class comment for an example. (The StAXHTMLParser class requires XMLParserHTML be installed to work.) > Sent: Friday, May 12, 2017 at 5:30 AM > From: PBKResearch <[hidden email]> > To: "'Any question about pharo is welcome'" > <[hidden email]> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte > for utf-8 encoding > > With reference to Norbert's comment, there /may/ be an ambiguity about > the word 'header' in Udo's reply. It could refer to the http HEAD > section, in which case Norbert is of course right. It could also refer > to the <head> section of the html file, which is part of the content of the http response. > If it is the latter, this is similar to a question that Paul > deBruicker posted last November ("[Pharo-users] ZnClient GET, but just > the content of the <head> tag?"). I tried the method I devised for > Paul's case on Udo's problem website, and read the html header with no > problem. Incidentally, the header includes 'charset=iso-8859-1', which > does not agree with Sven's findings. > > In case it is of interest, I used XMLHTMLParser to read and parse the > header. Try the following in a Playground: > > par := XMLHTMLParser onURL: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re > turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'. > par parseDocumentUntil: [|top|(top := par topNode) notNil and: [ top > isElement and:[ top isNamed: 'body']]]. > par parsingResult findElementNamed: 'head'. > > If you 'Do it and go', the full header appears. The way I get it to > stop after the header may not be quite correct, because it uses > XMLHTMLParser>>topNode, which is a private method. On the other hand, > XMLHTMLParser>>I > can't see how to make the stop condition for > XMLHTMLParser>>parseDocumentUntil: depend on the parsed results > XMLHTMLParser>>without > using a private method. > > Hope this is helpful > > Peter Kenny > > -----Original Message----- > From: Pharo-users [mailto:[hidden email]] On > Behalf Of Norbert Hartl > Sent: 12 May 2017 08:04 > To: Any question about pharo is welcome <[hidden email]> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte > for > utf-8 encoding > > Just to mention. If you are not interested in the content body you > could do a HEAD request instead of GET. > > Norbert > > > Am 11.05.2017 um 22:44 schrieb Udo Schneider > <[hidden email]>: > > > > Hi Sven, > > > > that's perfect. To be honest I don't care about the content - I'm > > just > parsing the header. And even if there is a wrong decoding in there... > I can live with that. > > > > Thank you very very much! For your help but also your stuff in general. > > > > CU, > > > > Udo > > > > > >> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe: > >> Hi Udo, > >>> On 11 May 2017, at 21:37, Udo Schneider > >>> <[hidden email]> > wrote: > >>> > >>> All, > >>> > >>> I'm hitting an error where fetching web content fails. The website > >>> does > indeed use invalid characters. > >>> > >>> The easiest way to reproduce: > >>> > >>> ZnEasy get: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re > turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723' > >>> > >>> Is there any way to tell Zinc to simply ignore that error and to > continue? > >>> > >>> CU, > >>> > >>> Udo > >> That server/page has a mime-type text/plain with no explicit > >> encoding > (charset) setting, so we have to guess. Like utf-8, pure > latin1/iso88591 does not work. The following does work, but you can't > be sure everything went well (beLenient takes some bytes as they are). > >> ZnDefaultCharacterEncoder > >> value: ZnCharacterEncoder latin1 beLenient > >> during: [ > >> ZnClient new > >> get: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re > turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; > >> yourself ]. > >> I added some API earlier today, so that the following should also > >> work > (you need to load Zn #bleedingEdge first). > >> ZnClient new > >> defaultEncoder: ZnCharacterEncoder latin1 beLenient; > >> get: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re > turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; > >> yourself. > >> HTH, > >> Regards, > >> Sven > > > > > > > > > > |
Free forum by Nabble | Edit this page |