[Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

Udo Schneider
All,

I'm hitting an error where fetching web content fails. The website does
indeed use invalid characters.

The easiest way to reproduce:

ZnEasy get:
'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723'

Is there any way to tell Zinc to simply ignore that error and to continue?

CU,

Udo


Reply | Threaded
Open this post in threaded view
|

Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

Sven Van Caekenberghe-2
Hi Udo,

> On 11 May 2017, at 21:37, Udo Schneider <[hidden email]> wrote:
>
> All,
>
> I'm hitting an error where fetching web content fails. The website does indeed use invalid characters.
>
> The easiest way to reproduce:
>
> ZnEasy get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723'
>
> Is there any way to tell Zinc to simply ignore that error and to continue?
>
> CU,
>
> Udo

That server/page has a mime-type text/plain with no explicit encoding (charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591 does not work. The following does work, but you can't be sure everything went well (beLenient takes some bytes as they are).

ZnDefaultCharacterEncoder
  value: ZnCharacterEncoder latin1 beLenient
  during: [
    ZnClient new
      get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; 
      yourself ].

I added some API earlier today, so that the following should also work (you need to load Zn #bleedingEdge first).
 
ZnClient new
  defaultEncoder: ZnCharacterEncoder latin1 beLenient;
  get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
  yourself.

HTH,

Regards,

Sven


Reply | Threaded
Open this post in threaded view
|

Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

Udo Schneider
Hi Sven,

that's perfect. To be honest I don't care about the content - I'm just
parsing the header. And even if there is a wrong decoding in there... I
can live with that.

Thank you very very much! For your help but also your stuff in general.

CU,

Udo


Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe:

> Hi Udo,
>
>> On 11 May 2017, at 21:37, Udo Schneider <[hidden email]> wrote:
>>
>> All,
>>
>> I'm hitting an error where fetching web content fails. The website does indeed use invalid characters.
>>
>> The easiest way to reproduce:
>>
>> ZnEasy get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723'
>>
>> Is there any way to tell Zinc to simply ignore that error and to continue?
>>
>> CU,
>>
>> Udo
>
> That server/page has a mime-type text/plain with no explicit encoding (charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591 does not work. The following does work, but you can't be sure everything went well (beLenient takes some bytes as they are).
>
> ZnDefaultCharacterEncoder
>    value: ZnCharacterEncoder latin1 beLenient
>    during: [
>      ZnClient new
>        get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
>        yourself ].
>
> I added some API earlier today, so that the following should also work (you need to load Zn #bleedingEdge first).
>  
> ZnClient new
>    defaultEncoder: ZnCharacterEncoder latin1 beLenient;
>    get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
>    yourself.
>
> HTH,
>
> Regards,
>
> Sven
>
>
>



Reply | Threaded
Open this post in threaded view
|

Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

NorbertHartl
Just to mention. If you are not interested in the content body you could do a HEAD request instead of GET.

Norbert

> Am 11.05.2017 um 22:44 schrieb Udo Schneider <[hidden email]>:
>
> Hi Sven,
>
> that's perfect. To be honest I don't care about the content - I'm just parsing the header. And even if there is a wrong decoding in there... I can live with that.
>
> Thank you very very much! For your help but also your stuff in general.
>
> CU,
>
> Udo
>
>
>> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe:
>> Hi Udo,
>>> On 11 May 2017, at 21:37, Udo Schneider <[hidden email]> wrote:
>>>
>>> All,
>>>
>>> I'm hitting an error where fetching web content fails. The website does indeed use invalid characters.
>>>
>>> The easiest way to reproduce:
>>>
>>> ZnEasy get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723'
>>>
>>> Is there any way to tell Zinc to simply ignore that error and to continue?
>>>
>>> CU,
>>>
>>> Udo
>> That server/page has a mime-type text/plain with no explicit encoding (charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591 does not work. The following does work, but you can't be sure everything went well (beLenient takes some bytes as they are).
>> ZnDefaultCharacterEncoder
>>   value: ZnCharacterEncoder latin1 beLenient
>>   during: [
>>     ZnClient new
>>       get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
>>       yourself ].
>> I added some API earlier today, so that the following should also work (you need to load Zn #bleedingEdge first).
>>  ZnClient new
>>   defaultEncoder: ZnCharacterEncoder latin1 beLenient;
>>   get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
>>   yourself.
>> HTH,
>> Regards,
>> Sven
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

Peter Kenny
With reference to Norbert's comment, there /may/ be an ambiguity about the
word 'header' in Udo's reply. It could refer to the http HEAD section, in
which case Norbert is of course right. It could also refer to the <head>
section of the html file, which is part of the content of the http response.
If it is the latter, this is similar to a question that Paul deBruicker
posted last November ("[Pharo-users] ZnClient GET, but just the  content of
the <head> tag?"). I tried the method I devised for Paul's case on Udo's
problem website, and read the html header with no problem. Incidentally, the
header includes 'charset=iso-8859-1', which does not agree with Sven's
findings.

In case it is of interest, I used XMLHTMLParser to read and parse the
header. Try the following in a Playground:

par := XMLHTMLParser onURL:
'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
with-a-new-(borrowed)-distribution-method/a/d-id/1328723'.
par parseDocumentUntil: [|top|(top := par topNode) notNil and: [ top
isElement and:[ top isNamed: 'body']]].
par parsingResult findElementNamed: 'head'.

If you 'Do it and go', the full header appears. The way I get it to stop
after the header may not be quite correct, because it uses
XMLHTMLParser>>topNode, which is a private method. On the other hand, I
can't see how to make the stop condition for
XMLHTMLParser>>parseDocumentUntil: depend on the parsed results without
using a private method.

Hope this is helpful

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of
Norbert Hartl
Sent: 12 May 2017 08:04
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for
utf-8 encoding

Just to mention. If you are not interested in the content body you could do
a HEAD request instead of GET.

Norbert

> Am 11.05.2017 um 22:44 schrieb Udo Schneider
<[hidden email]>:
>
> Hi Sven,
>
> that's perfect. To be honest I don't care about the content - I'm just
parsing the header. And even if there is a wrong decoding in there... I can
live with that.

>
> Thank you very very much! For your help but also your stuff in general.
>
> CU,
>
> Udo
>
>
>> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe:
>> Hi Udo,
>>> On 11 May 2017, at 21:37, Udo Schneider <[hidden email]>
wrote:
>>>
>>> All,
>>>
>>> I'm hitting an error where fetching web content fails. The website does
indeed use invalid characters.
>>>
>>> The easiest way to reproduce:
>>>
>>> ZnEasy get:
'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
with-a-new-(borrowed)-distribution-method/a/d-id/1328723'
>>>
>>> Is there any way to tell Zinc to simply ignore that error and to
continue?
>>>
>>> CU,
>>>
>>> Udo
>> That server/page has a mime-type text/plain with no explicit encoding
(charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591
does not work. The following does work, but you can't be sure everything
went well (beLenient takes some bytes as they are).
>> ZnDefaultCharacterEncoder
>>   value: ZnCharacterEncoder latin1 beLenient
>>   during: [
>>     ZnClient new
>>       get:
'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
>>       yourself ].
>> I added some API earlier today, so that the following should also work
(you need to load Zn #bleedingEdge first).
>>  ZnClient new
>>   defaultEncoder: ZnCharacterEncoder latin1 beLenient;
>>   get:
'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
>>   yourself.
>> HTH,
>> Regards,
>> Sven
>
>
>



Reply | Threaded
Open this post in threaded view
|

Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

Udo Schneider
In reply to this post by NorbertHartl
Hi Sven,

I didn't tell the whole truth :-)

I'm /mainly/ parsing the header (extracting published dates). For some
sites however I have to resort to finding a date in the body.

CU,

Udo


Am 12/05/17 um 09:03 schrieb Norbert Hartl:

> Just to mention. If you are not interested in the content body you could do a HEAD request instead of GET.
>
> Norbert
>
>> Am 11.05.2017 um 22:44 schrieb Udo Schneider <[hidden email]>:
>>
>> Hi Sven,
>>
>> that's perfect. To be honest I don't care about the content - I'm just parsing the header. And even if there is a wrong decoding in there... I can live with that.
>>
>> Thank you very very much! For your help but also your stuff in general.
>>
>> CU,
>>
>> Udo
>>
>>
>>> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe:
>>> Hi Udo,
>>>> On 11 May 2017, at 21:37, Udo Schneider <[hidden email]> wrote:
>>>>
>>>> All,
>>>>
>>>> I'm hitting an error where fetching web content fails. The website does indeed use invalid characters.
>>>>
>>>> The easiest way to reproduce:
>>>>
>>>> ZnEasy get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723'
>>>>
>>>> Is there any way to tell Zinc to simply ignore that error and to continue?
>>>>
>>>> CU,
>>>>
>>>> Udo
>>> That server/page has a mime-type text/plain with no explicit encoding (charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591 does not work. The following does work, but you can't be sure everything went well (beLenient takes some bytes as they are).
>>> ZnDefaultCharacterEncoder
>>>    value: ZnCharacterEncoder latin1 beLenient
>>>    during: [
>>>      ZnClient new
>>>        get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
>>>        yourself ].
>>> I added some API earlier today, so that the following should also work (you need to load Zn #bleedingEdge first).
>>>   ZnClient new
>>>    defaultEncoder: ZnCharacterEncoder latin1 beLenient;
>>>    get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
>>>    yourself.
>>> HTH,
>>> Regards,
>>> Sven
>>
>>
>>
>
>
>



Reply | Threaded
Open this post in threaded view
|

Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

monty-3
In reply to this post by Peter Kenny
> Sent: Friday, May 12, 2017 at 5:30 AM
> From: PBKResearch <[hidden email]>
> To: "'Any question about pharo is welcome'" <[hidden email]>
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding
>
> With reference to Norbert's comment, there /may/ be an ambiguity about the
> word 'header' in Udo's reply. It could refer to the http HEAD section, in
> which case Norbert is of course right. It could also refer to the <head>
> section of the html file, which is part of the content of the http response.
> If it is the latter, this is similar to a question that Paul deBruicker
> posted last November ("[Pharo-users] ZnClient GET, but just the  content of
> the <head> tag?"). I tried the method I devised for Paul's case on Udo's
> problem website, and read the html header with no problem. Incidentally, the
> header includes 'charset=iso-8859-1', which does not agree with Sven's
> findings.
>
> In case it is of interest, I used XMLHTMLParser to read and parse the
> header. Try the following in a Playground:
>
> par := XMLHTMLParser onURL:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
> with-a-new-(borrowed)-distribution-method/a/d-id/1328723'.
> par parseDocumentUntil: [|top|(top := par topNode) notNil and: [ top
> isElement and:[ top isNamed: 'body']]].
> par parsingResult findElementNamed: 'head'.
>
> If you 'Do it and go', the full header appears. The way I get it to stop
> after the header may not be quite correct, because it uses
> XMLHTMLParser>>topNode, which is a private method. On the other hand, I
> can't see how to make the stop condition for
> XMLHTMLParser>>parseDocumentUntil: depend on the parsed results without
> using a private method.

There's always #document. But since I can't see any possible harm, I'll make it public.
 

> Hope this is helpful
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users [mailto:[hidden email]] On Behalf Of
> Norbert Hartl
> Sent: 12 May 2017 08:04
> To: Any question about pharo is welcome <[hidden email]>
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for
> utf-8 encoding
>
> Just to mention. If you are not interested in the content body you could do
> a HEAD request instead of GET.
>
> Norbert
>
> > Am 11.05.2017 um 22:44 schrieb Udo Schneider
> <[hidden email]>:
> >
> > Hi Sven,
> >
> > that's perfect. To be honest I don't care about the content - I'm just
> parsing the header. And even if there is a wrong decoding in there... I can
> live with that.
> >
> > Thank you very very much! For your help but also your stuff in general.
> >
> > CU,
> >
> > Udo
> >
> >
> >> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe:
> >> Hi Udo,
> >>> On 11 May 2017, at 21:37, Udo Schneider <[hidden email]>
> wrote:
> >>>
> >>> All,
> >>>
> >>> I'm hitting an error where fetching web content fails. The website does
> indeed use invalid characters.
> >>>
> >>> The easiest way to reproduce:
> >>>
> >>> ZnEasy get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
> with-a-new-(borrowed)-distribution-method/a/d-id/1328723'
> >>>
> >>> Is there any way to tell Zinc to simply ignore that error and to
> continue?
> >>>
> >>> CU,
> >>>
> >>> Udo
> >> That server/page has a mime-type text/plain with no explicit encoding
> (charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591
> does not work. The following does work, but you can't be sure everything
> went well (beLenient takes some bytes as they are).
> >> ZnDefaultCharacterEncoder
> >>   value: ZnCharacterEncoder latin1 beLenient
> >>   during: [
> >>     ZnClient new
> >>       get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
> with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
> >>       yourself ].
> >> I added some API earlier today, so that the following should also work
> (you need to load Zn #bleedingEdge first).
> >>  ZnClient new
> >>   defaultEncoder: ZnCharacterEncoder latin1 beLenient;
> >>   get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
> with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
> >>   yourself.
> >> HTH,
> >> Regards,
> >> Sven
> >
> >
> >
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

monty-3
In reply to this post by Peter Kenny
For that kind of incremental parsing, you could also use XMLParserStAX, a pull-parser that parses a document as a stream of event objects you control with #next, #peek, and #atEnd. It also supports pull-DOM parsing with messages like #nextNode, #nextElement, and #nextElementNamed:, which return the next event object(s) as DOM subtrees (searchable with XPath). See the StAXParser class comment for an example. (The StAXHTMLParser class requires XMLParserHTML be installed to work.)

> Sent: Friday, May 12, 2017 at 5:30 AM
> From: PBKResearch <[hidden email]>
> To: "'Any question about pharo is welcome'" <[hidden email]>
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding
>
> With reference to Norbert's comment, there /may/ be an ambiguity about the
> word 'header' in Udo's reply. It could refer to the http HEAD section, in
> which case Norbert is of course right. It could also refer to the <head>
> section of the html file, which is part of the content of the http response.
> If it is the latter, this is similar to a question that Paul deBruicker
> posted last November ("[Pharo-users] ZnClient GET, but just the  content of
> the <head> tag?"). I tried the method I devised for Paul's case on Udo's
> problem website, and read the html header with no problem. Incidentally, the
> header includes 'charset=iso-8859-1', which does not agree with Sven's
> findings.
>
> In case it is of interest, I used XMLHTMLParser to read and parse the
> header. Try the following in a Playground:
>
> par := XMLHTMLParser onURL:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
> with-a-new-(borrowed)-distribution-method/a/d-id/1328723'.
> par parseDocumentUntil: [|top|(top := par topNode) notNil and: [ top
> isElement and:[ top isNamed: 'body']]].
> par parsingResult findElementNamed: 'head'.
>
> If you 'Do it and go', the full header appears. The way I get it to stop
> after the header may not be quite correct, because it uses
> XMLHTMLParser>>topNode, which is a private method. On the other hand, I
> can't see how to make the stop condition for
> XMLHTMLParser>>parseDocumentUntil: depend on the parsed results without
> using a private method.
>
> Hope this is helpful
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users [mailto:[hidden email]] On Behalf Of
> Norbert Hartl
> Sent: 12 May 2017 08:04
> To: Any question about pharo is welcome <[hidden email]>
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for
> utf-8 encoding
>
> Just to mention. If you are not interested in the content body you could do
> a HEAD request instead of GET.
>
> Norbert
>
> > Am 11.05.2017 um 22:44 schrieb Udo Schneider
> <[hidden email]>:
> >
> > Hi Sven,
> >
> > that's perfect. To be honest I don't care about the content - I'm just
> parsing the header. And even if there is a wrong decoding in there... I can
> live with that.
> >
> > Thank you very very much! For your help but also your stuff in general.
> >
> > CU,
> >
> > Udo
> >
> >
> >> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe:
> >> Hi Udo,
> >>> On 11 May 2017, at 21:37, Udo Schneider <[hidden email]>
> wrote:
> >>>
> >>> All,
> >>>
> >>> I'm hitting an error where fetching web content fails. The website does
> indeed use invalid characters.
> >>>
> >>> The easiest way to reproduce:
> >>>
> >>> ZnEasy get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
> with-a-new-(borrowed)-distribution-method/a/d-id/1328723'
> >>>
> >>> Is there any way to tell Zinc to simply ignore that error and to
> continue?
> >>>
> >>> CU,
> >>>
> >>> Udo
> >> That server/page has a mime-type text/plain with no explicit encoding
> (charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591
> does not work. The following does work, but you can't be sure everything
> went well (beLenient takes some bytes as they are).
> >> ZnDefaultCharacterEncoder
> >>   value: ZnCharacterEncoder latin1 beLenient
> >>   during: [
> >>     ZnClient new
> >>       get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
> with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
> >>       yourself ].
> >> I added some API earlier today, so that the following should also work
> (you need to load Zn #bleedingEdge first).
> >>  ZnClient new
> >>   defaultEncoder: ZnCharacterEncoder latin1 beLenient;
> >>   get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
> with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
> >>   yourself.
> >> HTH,
> >> Regards,
> >> Sven
> >
> >
> >
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

Peter Kenny
Monty

Many thanks for this. My original purpose was just to answer Paul deBruicker's query, namely to parse an html file and stop reading at the end of the <head> section. I solved this by trial and error using the code shown below ( which actually stops at the opening tag of the body). This was not my problem at all, but Paul's; I just tackled it for fun.

However, you note has prompted me to update my version of the whole XML system - I was using the version I downloaded with Moose 6.0, which was dated August 2016. I am looking at the StAX parsers as a possible way of simplifying what I currently do, which involves downloading an entire web page as a DOM and then manipulating it with XPath to extract the bits I am interested in. I may be able to use StAX to do some of the selection and manipulation as I am reading.

It's all a new topic to me, so I foresee a lot of experimentation. It all helps to keep the grey matter active.

Thanks again

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of monty
Sent: 15 May 2017 12:15
To: [hidden email]
Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

For that kind of incremental parsing, you could also use XMLParserStAX, a pull-parser that parses a document as a stream of event objects you control with #next, #peek, and #atEnd. It also supports pull-DOM parsing with messages like #nextNode, #nextElement, and #nextElementNamed:, which return the next event object(s) as DOM subtrees (searchable with XPath). See the StAXParser class comment for an example. (The StAXHTMLParser class requires XMLParserHTML be installed to work.)

> Sent: Friday, May 12, 2017 at 5:30 AM
> From: PBKResearch <[hidden email]>
> To: "'Any question about pharo is welcome'"
> <[hidden email]>
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte
> for utf-8 encoding
>
> With reference to Norbert's comment, there /may/ be an ambiguity about
> the word 'header' in Udo's reply. It could refer to the http HEAD
> section, in which case Norbert is of course right. It could also refer
> to the <head> section of the html file, which is part of the content of the http response.
> If it is the latter, this is similar to a question that Paul
> deBruicker posted last November ("[Pharo-users] ZnClient GET, but just
> the  content of the <head> tag?"). I tried the method I devised for
> Paul's case on Udo's problem website, and read the html header with no
> problem. Incidentally, the header includes 'charset=iso-8859-1', which
> does not agree with Sven's findings.
>
> In case it is of interest, I used XMLHTMLParser to read and parse the
> header. Try the following in a Playground:
>
> par := XMLHTMLParser onURL:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re
> turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'.
> par parseDocumentUntil: [|top|(top := par topNode) notNil and: [ top
> isElement and:[ top isNamed: 'body']]].
> par parsingResult findElementNamed: 'head'.
>
> If you 'Do it and go', the full header appears. The way I get it to
> stop after the header may not be quite correct, because it uses
> XMLHTMLParser>>topNode, which is a private method. On the other hand,
> XMLHTMLParser>>I
> can't see how to make the stop condition for
> XMLHTMLParser>>parseDocumentUntil: depend on the parsed results
> XMLHTMLParser>>without
> using a private method.
>
> Hope this is helpful
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users [mailto:[hidden email]] On
> Behalf Of Norbert Hartl
> Sent: 12 May 2017 08:04
> To: Any question about pharo is welcome <[hidden email]>
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte
> for
> utf-8 encoding
>
> Just to mention. If you are not interested in the content body you
> could do a HEAD request instead of GET.
>
> Norbert
>
> > Am 11.05.2017 um 22:44 schrieb Udo Schneider
> <[hidden email]>:
> >
> > Hi Sven,
> >
> > that's perfect. To be honest I don't care about the content - I'm
> > just
> parsing the header. And even if there is a wrong decoding in there...
> I can live with that.
> >
> > Thank you very very much! For your help but also your stuff in general.
> >
> > CU,
> >
> > Udo
> >
> >
> >> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe:
> >> Hi Udo,
> >>> On 11 May 2017, at 21:37, Udo Schneider
> >>> <[hidden email]>
> wrote:
> >>>
> >>> All,
> >>>
> >>> I'm hitting an error where fetching web content fails. The website
> >>> does
> indeed use invalid characters.
> >>>
> >>> The easiest way to reproduce:
> >>>
> >>> ZnEasy get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re
> turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'
> >>>
> >>> Is there any way to tell Zinc to simply ignore that error and to
> continue?
> >>>
> >>> CU,
> >>>
> >>> Udo
> >> That server/page has a mime-type text/plain with no explicit
> >> encoding
> (charset) setting, so we have to guess. Like utf-8, pure
> latin1/iso88591 does not work. The following does work, but you can't
> be sure everything went well (beLenient takes some bytes as they are).
> >> ZnDefaultCharacterEncoder
> >>   value: ZnCharacterEncoder latin1 beLenient
> >>   during: [
> >>     ZnClient new
> >>       get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re
> turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
> >>       yourself ].
> >> I added some API earlier today, so that the following should also
> >> work
> (you need to load Zn #bleedingEdge first).
> >>  ZnClient new
> >>   defaultEncoder: ZnCharacterEncoder latin1 beLenient;
> >>   get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re
> turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
> >>   yourself.
> >> HTH,
> >> Regards,
> >> Sven
> >
> >
> >
>
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

Peter Kenny
Monty

I have just started trying to use the StAX parsers, and I have found that the update has introduced a problem, which means that XMLHTMLParser no longer works on examples I have used before. I updated to ConfigurationOfXMLParser(monty.302), which is the latest version on the smalltalkhub repository, and then used the load version in the class comment, which loads the stable default. Similarly, I loaded ConfigurationOfXMLParserHTML(monty.62) and ConfigurationOfXMLParserStAX(monty.51), again using stable and default. When I try to run the XMLHTMLParser example I quoted below, I get an error message 'MessageNotunderstood: receiver of "critical:" is nil'. The same message comes up with anything else I try with XMLHTMLParser or with StAXHTMLParser.

I am not really up to using the debugger on someone else's code, but the one thing I can see is that the problem lies in XMLKeyValueCache>>critical:, which has the code:
^ self mutex critical: aBlock
The problem being that mutex is nil.

In my enthusiasm, I saved the updated image with the same name as the old image, which is now therefore overwritten. If I cannot solve this problem, my only way out is to rebuild my image from the Moose 6.0 download. Any suggestions gratefully received.

Thanks in advance

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of PBKResearch
Sent: 15 May 2017 19:16
To: 'Any question about pharo is welcome' <[hidden email]>
Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

Monty

Many thanks for this. My original purpose was just to answer Paul deBruicker's query, namely to parse an html file and stop reading at the end of the <head> section. I solved this by trial and error using the code shown below ( which actually stops at the opening tag of the body). This was not my problem at all, but Paul's; I just tackled it for fun.

However, you note has prompted me to update my version of the whole XML system - I was using the version I downloaded with Moose 6.0, which was dated August 2016. I am looking at the StAX parsers as a possible way of simplifying what I currently do, which involves downloading an entire web page as a DOM and then manipulating it with XPath to extract the bits I am interested in. I may be able to use StAX to do some of the selection and manipulation as I am reading.

It's all a new topic to me, so I foresee a lot of experimentation. It all helps to keep the grey matter active.

Thanks again

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of monty
Sent: 15 May 2017 12:15
To: [hidden email]
Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

For that kind of incremental parsing, you could also use XMLParserStAX, a pull-parser that parses a document as a stream of event objects you control with #next, #peek, and #atEnd. It also supports pull-DOM parsing with messages like #nextNode, #nextElement, and #nextElementNamed:, which return the next event object(s) as DOM subtrees (searchable with XPath). See the StAXParser class comment for an example. (The StAXHTMLParser class requires XMLParserHTML be installed to work.)

> Sent: Friday, May 12, 2017 at 5:30 AM
> From: PBKResearch <[hidden email]>
> To: "'Any question about pharo is welcome'"
> <[hidden email]>
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte
> for utf-8 encoding
>
> With reference to Norbert's comment, there /may/ be an ambiguity about
> the word 'header' in Udo's reply. It could refer to the http HEAD
> section, in which case Norbert is of course right. It could also refer
> to the <head> section of the html file, which is part of the content of the http response.
> If it is the latter, this is similar to a question that Paul
> deBruicker posted last November ("[Pharo-users] ZnClient GET, but just
> the  content of the <head> tag?"). I tried the method I devised for
> Paul's case on Udo's problem website, and read the html header with no
> problem. Incidentally, the header includes 'charset=iso-8859-1', which
> does not agree with Sven's findings.
>
> In case it is of interest, I used XMLHTMLParser to read and parse the
> header. Try the following in a Playground:
>
> par := XMLHTMLParser onURL:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re
> turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'.
> par parseDocumentUntil: [|top|(top := par topNode) notNil and: [ top
> isElement and:[ top isNamed: 'body']]].
> par parsingResult findElementNamed: 'head'.
>
> If you 'Do it and go', the full header appears. The way I get it to
> stop after the header may not be quite correct, because it uses
> XMLHTMLParser>>topNode, which is a private method. On the other hand,
> XMLHTMLParser>>I
> can't see how to make the stop condition for
> XMLHTMLParser>>parseDocumentUntil: depend on the parsed results
> XMLHTMLParser>>without
> using a private method.
>
> Hope this is helpful
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users [mailto:[hidden email]] On
> Behalf Of Norbert Hartl
> Sent: 12 May 2017 08:04
> To: Any question about pharo is welcome <[hidden email]>
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte
> for
> utf-8 encoding
>
> Just to mention. If you are not interested in the content body you
> could do a HEAD request instead of GET.
>
> Norbert
>
> > Am 11.05.2017 um 22:44 schrieb Udo Schneider
> <[hidden email]>:
> >
> > Hi Sven,
> >
> > that's perfect. To be honest I don't care about the content - I'm
> > just
> parsing the header. And even if there is a wrong decoding in there...
> I can live with that.
> >
> > Thank you very very much! For your help but also your stuff in general.
> >
> > CU,
> >
> > Udo
> >
> >
> >> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe:
> >> Hi Udo,
> >>> On 11 May 2017, at 21:37, Udo Schneider
> >>> <[hidden email]>
> wrote:
> >>>
> >>> All,
> >>>
> >>> I'm hitting an error where fetching web content fails. The website
> >>> does
> indeed use invalid characters.
> >>>
> >>> The easiest way to reproduce:
> >>>
> >>> ZnEasy get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re
> turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'
> >>>
> >>> Is there any way to tell Zinc to simply ignore that error and to
> continue?
> >>>
> >>> CU,
> >>>
> >>> Udo
> >> That server/page has a mime-type text/plain with no explicit
> >> encoding
> (charset) setting, so we have to guess. Like utf-8, pure
> latin1/iso88591 does not work. The following does work, but you can't
> be sure everything went well (beLenient takes some bytes as they are).
> >> ZnDefaultCharacterEncoder
> >>   value: ZnCharacterEncoder latin1 beLenient
> >>   during: [
> >>     ZnClient new
> >>       get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re
> turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
> >>       yourself ].
> >> I added some API earlier today, so that the following should also
> >> work
> (you need to load Zn #bleedingEdge first).
> >>  ZnClient new
> >>   defaultEncoder: ZnCharacterEncoder latin1 beLenient;
> >>   get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re
> turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
> >>   yourself.
> >> HTH,
> >> Regards,
> >> Sven
> >
> >
> >
>
>
>
>