Smalltalk › Pharo › Pharo Smalltalk Users

[Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

10 messages Options

Udo Schneider

[Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

All,

I'm hitting an error where fetching web content fails. The website does
indeed use invalid characters.

The easiest way to reproduce:

ZnEasy get:
'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723'

Is there any way to tell Zinc to simply ignore that error and to continue?

CU,

Udo

Sven Van Caekenberghe-2

Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

Hi Udo,

> On 11 May 2017, at 21:37, Udo Schneider <[hidden email]> wrote:
>
> All,
>
> I'm hitting an error where fetching web content fails. The website does indeed use invalid characters.
>
> The easiest way to reproduce:
>
> ZnEasy get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723'
>
> Is there any way to tell Zinc to simply ignore that error and to continue?
>
> CU,
>
> Udo

That server/page has a mime-type text/plain with no explicit encoding (charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591 does not work. The following does work, but you can't be sure everything went well (beLenient takes some bytes as they are).

ZnDefaultCharacterEncoder
value: ZnCharacterEncoder latin1 beLenient
during: [
ZnClient new
get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
yourself ].

I added some API earlier today, so that the following should also work (you need to load Zn #bleedingEdge first).

ZnClient new
defaultEncoder: ZnCharacterEncoder latin1 beLenient;
get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
yourself.

HTH,

Regards,

Sven

Udo Schneider

Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

Hi Sven,

that's perfect. To be honest I don't care about the content - I'm just
parsing the header. And even if there is a wrong decoding in there... I
can live with that.

Thank you very very much! For your help but also your stuff in general.

CU,

Udo

Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe:

> Hi Udo,
>
>> On 11 May 2017, at 21:37, Udo Schneider <[hidden email]> wrote:
>>
>> All,
>>
>> I'm hitting an error where fetching web content fails. The website does indeed use invalid characters.
>>
>> The easiest way to reproduce:
>>
>> ZnEasy get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723'
>>
>> Is there any way to tell Zinc to simply ignore that error and to continue?
>>
>> CU,
>>
>> Udo
>
> That server/page has a mime-type text/plain with no explicit encoding (charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591 does not work. The following does work, but you can't be sure everything went well (beLenient takes some bytes as they are).
>
> ZnDefaultCharacterEncoder
> value: ZnCharacterEncoder latin1 beLenient
> during: [
> ZnClient new
> get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
> yourself ].
>
> I added some API earlier today, so that the following should also work (you need to load Zn #bleedingEdge first).
>
> ZnClient new
> defaultEncoder: ZnCharacterEncoder latin1 beLenient;
> get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
> yourself.
>
> HTH,
>
> Regards,
>
> Sven
>
>
>

NorbertHartl

Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

Just to mention. If you are not interested in the content body you could do a HEAD request instead of GET.

Norbert

> Am 11.05.2017 um 22:44 schrieb Udo Schneider <[hidden email]>:
>
> Hi Sven,
>
> that's perfect. To be honest I don't care about the content - I'm just parsing the header. And even if there is a wrong decoding in there... I can live with that.
>
> Thank you very very much! For your help but also your stuff in general.
>
> CU,
>
> Udo
>
>
>> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe:
>> Hi Udo,
>>> On 11 May 2017, at 21:37, Udo Schneider <[hidden email]> wrote:
>>>
>>> All,
>>>
>>> I'm hitting an error where fetching web content fails. The website does indeed use invalid characters.
>>>
>>> The easiest way to reproduce:
>>>
>>> ZnEasy get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723'
>>>
>>> Is there any way to tell Zinc to simply ignore that error and to continue?
>>>
>>> CU,
>>>
>>> Udo
>> That server/page has a mime-type text/plain with no explicit encoding (charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591 does not work. The following does work, but you can't be sure everything went well (beLenient takes some bytes as they are).
>> ZnDefaultCharacterEncoder
>> value: ZnCharacterEncoder latin1 beLenient
>> during: [
>> ZnClient new
>> get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
>> yourself ].
>> I added some API earlier today, so that the following should also work (you need to load Zn #bleedingEdge first).
>> ZnClient new
>> defaultEncoder: ZnCharacterEncoder latin1 beLenient;
>> get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
>> yourself.
>> HTH,
>> Regards,
>> Sven
>
>
>

Peter Kenny

Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

With reference to Norbert's comment, there /may/ be an ambiguity about the
word 'header' in Udo's reply. It could refer to the http HEAD section, in
which case Norbert is of course right. It could also refer to the <head>
section of the html file, which is part of the content of the http response.
If it is the latter, this is similar to a question that Paul deBruicker
posted last November ("[Pharo-users] ZnClient GET, but just the content of
the <head> tag?"). I tried the method I devised for Paul's case on Udo's
problem website, and read the html header with no problem. Incidentally, the
header includes 'charset=iso-8859-1', which does not agree with Sven's
findings.

In case it is of interest, I used XMLHTMLParser to read and parse the
header. Try the following in a Playground:

par := XMLHTMLParser onURL:
'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
with-a-new-(borrowed)-distribution-method/a/d-id/1328723'.
par parseDocumentUntil: [|top|(top := par topNode) notNil and: [ top
isElement and:[ top isNamed: 'body']]].
par parsingResult findElementNamed: 'head'.

If you 'Do it and go', the full header appears. The way I get it to stop
after the header may not be quite correct, because it uses
XMLHTMLParser>>topNode, which is a private method. On the other hand, I
can't see how to make the stop condition for
XMLHTMLParser>>parseDocumentUntil: depend on the parsed results without
using a private method.

Hope this is helpful

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of
Norbert Hartl
Sent: 12 May 2017 08:04
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for
utf-8 encoding

Just to mention. If you are not interested in the content body you could do
a HEAD request instead of GET.

Norbert

> Am 11.05.2017 um 22:44 schrieb Udo Schneider
<[hidden email]>:
>
> Hi Sven,
>
> that's perfect. To be honest I don't care about the content - I'm just
parsing the header. And even if there is a wrong decoding in there... I can
live with that.

>
> Thank you very very much! For your help but also your stuff in general.
>
> CU,
>
> Udo
>
>
>> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe:
>> Hi Udo,
>>> On 11 May 2017, at 21:37, Udo Schneider <[hidden email]>

wrote:
>>>
>>> All,
>>>
>>> I'm hitting an error where fetching web content fails. The website does
indeed use invalid characters.
>>>
>>> The easiest way to reproduce:
>>>
>>> ZnEasy get:
'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
with-a-new-(borrowed)-distribution-method/a/d-id/1328723'
>>>
>>> Is there any way to tell Zinc to simply ignore that error and to
continue?
>>>
>>> CU,
>>>
>>> Udo
>> That server/page has a mime-type text/plain with no explicit encoding
(charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591
does not work. The following does work, but you can't be sure everything
went well (beLenient takes some bytes as they are).
>> ZnDefaultCharacterEncoder
>> value: ZnCharacterEncoder latin1 beLenient
>> during: [
>> ZnClient new
>> get:
'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
>> yourself ].
>> I added some API earlier today, so that the following should also work
(you need to load Zn #bleedingEdge first).
>> ZnClient new
>> defaultEncoder: ZnCharacterEncoder latin1 beLenient;
>> get:
'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
>> yourself.
>> HTH,
>> Regards,
>> Sven
>
>
>

Udo Schneider

Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

In reply to this post by NorbertHartl

Hi Sven,

I didn't tell the whole truth :-)

I'm /mainly/ parsing the header (extracting published dates). For some
sites however I have to resort to finding a date in the body.

CU,

Udo

Am 12/05/17 um 09:03 schrieb Norbert Hartl:

> Just to mention. If you are not interested in the content body you could do a HEAD request instead of GET.
>
> Norbert
>
>> Am 11.05.2017 um 22:44 schrieb Udo Schneider <[hidden email]>:
>>
>> Hi Sven,
>>
>> that's perfect. To be honest I don't care about the content - I'm just parsing the header. And even if there is a wrong decoding in there... I can live with that.
>>
>> Thank you very very much! For your help but also your stuff in general.
>>
>> CU,
>>
>> Udo
>>
>>
>>> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe:
>>> Hi Udo,
>>>> On 11 May 2017, at 21:37, Udo Schneider <[hidden email]> wrote:
>>>>
>>>> All,
>>>>
>>>> I'm hitting an error where fetching web content fails. The website does indeed use invalid characters.
>>>>
>>>> The easiest way to reproduce:
>>>>
>>>> ZnEasy get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723'
>>>>
>>>> Is there any way to tell Zinc to simply ignore that error and to continue?
>>>>
>>>> CU,
>>>>
>>>> Udo
>>> That server/page has a mime-type text/plain with no explicit encoding (charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591 does not work. The following does work, but you can't be sure everything went well (beLenient takes some bytes as they are).
>>> ZnDefaultCharacterEncoder
>>> value: ZnCharacterEncoder latin1 beLenient
>>> during: [
>>> ZnClient new
>>> get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
>>> yourself ].
>>> I added some API earlier today, so that the following should also work (you need to load Zn #bleedingEdge first).
>>> ZnClient new
>>> defaultEncoder: ZnCharacterEncoder latin1 beLenient;
>>> get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
>>> yourself.
>>> HTH,
>>> Regards,
>>> Sven
>>
>>
>>
>
>
>

monty-3

Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

In reply to this post by Peter Kenny

There's always #document. But since I can't see any possible harm, I'll make it public.

> Hope this is helpful
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users [mailto:[hidden email]] On Behalf Of
> Norbert Hartl
> Sent: 12 May 2017 08:04
> To: Any question about pharo is welcome <[hidden email]>
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for
> utf-8 encoding
>
> Just to mention. If you are not interested in the content body you could do
> a HEAD request instead of GET.
>
> Norbert
>
> > Am 11.05.2017 um 22:44 schrieb Udo Schneider
> <[hidden email]>:
> >
> > Hi Sven,
> >
> > that's perfect. To be honest I don't care about the content - I'm just
> parsing the header. And even if there is a wrong decoding in there... I can
> live with that.
> >
> > Thank you very very much! For your help but also your stuff in general.
> >
> > CU,
> >
> > Udo
> >
> >
> >> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe:
> >> Hi Udo,
> >>> On 11 May 2017, at 21:37, Udo Schneider <[hidden email]>
> wrote:
> >>>
> >>> All,
> >>>
> >>> I'm hitting an error where fetching web content fails. The website does
> indeed use invalid characters.
> >>>
> >>> The easiest way to reproduce:
> >>>
> >>> ZnEasy get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
> with-a-new-(borrowed)-distribution-method/a/d-id/1328723'
> >>>
> >>> Is there any way to tell Zinc to simply ignore that error and to
> continue?
> >>>
> >>> CU,
> >>>
> >>> Udo
> >> That server/page has a mime-type text/plain with no explicit encoding
> (charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591
> does not work. The following does work, but you can't be sure everything
> went well (beLenient takes some bytes as they are).
> >> ZnDefaultCharacterEncoder
> >> value: ZnCharacterEncoder latin1 beLenient
> >> during: [
> >> ZnClient new
> >> get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
> with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
> >> yourself ].
> >> I added some API earlier today, so that the following should also work
> (you need to load Zn #bleedingEdge first).
> >> ZnClient new
> >> defaultEncoder: ZnCharacterEncoder latin1 beLenient;
> >> get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
> with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
> >> yourself.
> >> HTH,
> >> Regards,
> >> Sven
> >
> >
> >
>
>
>
>

monty-3

Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

In reply to this post by Peter Kenny

For that kind of incremental parsing, you could also use XMLParserStAX, a pull-parser that parses a document as a stream of event objects you control with #next, #peek, and #atEnd. It also supports pull-DOM parsing with messages like #nextNode, #nextElement, and #nextElementNamed:, which return the next event object(s) as DOM subtrees (searchable with XPath). See the StAXParser class comment for an example. (The StAXHTMLParser class requires XMLParserHTML be installed to work.)

> Sent: Friday, May 12, 2017 at 5:30 AM
> From: PBKResearch <[hidden email]>
> To: "'Any question about pharo is welcome'" <[hidden email]>
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding
>
> With reference to Norbert's comment, there /may/ be an ambiguity about the
> word 'header' in Udo's reply. It could refer to the http HEAD section, in
> which case Norbert is of course right. It could also refer to the <head>
> section of the html file, which is part of the content of the http response.
> If it is the latter, this is similar to a question that Paul deBruicker
> posted last November ("[Pharo-users] ZnClient GET, but just the content of
> the <head> tag?"). I tried the method I devised for Paul's case on Udo's
> problem website, and read the html header with no problem. Incidentally, the
> header includes 'charset=iso-8859-1', which does not agree with Sven's
> findings.
>
> In case it is of interest, I used XMLHTMLParser to read and parse the
> header. Try the following in a Playground:
>
> par := XMLHTMLParser onURL:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
> with-a-new-(borrowed)-distribution-method/a/d-id/1328723'.
> par parseDocumentUntil: [|top|(top := par topNode) notNil and: [ top
> isElement and:[ top isNamed: 'body']]].
> par parsingResult findElementNamed: 'head'.
>
> If you 'Do it and go', the full header appears. The way I get it to stop
> after the header may not be quite correct, because it uses
> XMLHTMLParser>>topNode, which is a private method. On the other hand, I
> can't see how to make the stop condition for
> XMLHTMLParser>>parseDocumentUntil: depend on the parsed results without
> using a private method.
>
> Hope this is helpful
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users [mailto:[hidden email]] On Behalf Of
> Norbert Hartl
> Sent: 12 May 2017 08:04
> To: Any question about pharo is welcome <[hidden email]>
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for
> utf-8 encoding
>
> Just to mention. If you are not interested in the content body you could do
> a HEAD request instead of GET.
>
> Norbert
>
> > Am 11.05.2017 um 22:44 schrieb Udo Schneider
> <[hidden email]>:
> >
> > Hi Sven,
> >
> > that's perfect. To be honest I don't care about the content - I'm just
> parsing the header. And even if there is a wrong decoding in there... I can
> live with that.
> >
> > Thank you very very much! For your help but also your stuff in general.
> >
> > CU,
> >
> > Udo
> >
> >
> >> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe:
> >> Hi Udo,
> >>> On 11 May 2017, at 21:37, Udo Schneider <[hidden email]>
> wrote:
> >>>
> >>> All,
> >>>
> >>> I'm hitting an error where fetching web content fails. The website does
> indeed use invalid characters.
> >>>
> >>> The easiest way to reproduce:
> >>>
> >>> ZnEasy get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
> with-a-new-(borrowed)-distribution-method/a/d-id/1328723'
> >>>
> >>> Is there any way to tell Zinc to simply ignore that error and to
> continue?
> >>>
> >>> CU,
> >>>
> >>> Udo
> >> That server/page has a mime-type text/plain with no explicit encoding
> (charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591
> does not work. The following does work, but you can't be sure everything
> went well (beLenient takes some bytes as they are).
> >> ZnDefaultCharacterEncoder
> >> value: ZnCharacterEncoder latin1 beLenient
> >> during: [
> >> ZnClient new
> >> get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
> with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
> >> yourself ].
> >> I added some API earlier today, so that the following should also work
> (you need to load Zn #bleedingEdge first).
> >> ZnClient new
> >> defaultEncoder: ZnCharacterEncoder latin1 beLenient;
> >> get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
> with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
> >> yourself.
> >> HTH,
> >> Regards,
> >> Sven
> >
> >
> >
>
>
>
>

Peter Kenny

Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

Monty

Many thanks for this. My original purpose was just to answer Paul deBruicker's query, namely to parse an html file and stop reading at the end of the <head> section. I solved this by trial and error using the code shown below ( which actually stops at the opening tag of the body). This was not my problem at all, but Paul's; I just tackled it for fun.

However, you note has prompted me to update my version of the whole XML system - I was using the version I downloaded with Moose 6.0, which was dated August 2016. I am looking at the StAX parsers as a possible way of simplifying what I currently do, which involves downloading an entire web page as a DOM and then manipulating it with XPath to extract the bits I am interested in. I may be able to use StAX to do some of the selection and manipulation as I am reading.

It's all a new topic to me, so I foresee a lot of experimentation. It all helps to keep the grey matter active.

Thanks again

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of monty
Sent: 15 May 2017 12:15
To: [hidden email]
Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

For that kind of incremental parsing, you could also use XMLParserStAX, a pull-parser that parses a document as a stream of event objects you control with #next, #peek, and #atEnd. It also supports pull-DOM parsing with messages like #nextNode, #nextElement, and #nextElementNamed:, which return the next event object(s) as DOM subtrees (searchable with XPath). See the StAXParser class comment for an example. (The StAXHTMLParser class requires XMLParserHTML be installed to work.)

> Sent: Friday, May 12, 2017 at 5:30 AM
> From: PBKResearch <[hidden email]>
> To: "'Any question about pharo is welcome'"
> <[hidden email]>
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte
> for utf-8 encoding
>
> With reference to Norbert's comment, there /may/ be an ambiguity about
> the word 'header' in Udo's reply. It could refer to the http HEAD
> section, in which case Norbert is of course right. It could also refer
> to the <head> section of the html file, which is part of the content of the http response.
> If it is the latter, this is similar to a question that Paul
> deBruicker posted last November ("[Pharo-users] ZnClient GET, but just
> the content of the <head> tag?"). I tried the method I devised for
> Paul's case on Udo's problem website, and read the html header with no
> problem. Incidentally, the header includes 'charset=iso-8859-1', which
> does not agree with Sven's findings.
>
> In case it is of interest, I used XMLHTMLParser to read and parse the
> header. Try the following in a Playground:
>
> par := XMLHTMLParser onURL:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re
> turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'.
> par parseDocumentUntil: [|top|(top := par topNode) notNil and: [ top
> isElement and:[ top isNamed: 'body']]].
> par parsingResult findElementNamed: 'head'.
>
> If you 'Do it and go', the full header appears. The way I get it to
> stop after the header may not be quite correct, because it uses
> XMLHTMLParser>>topNode, which is a private method. On the other hand,
> XMLHTMLParser>>I
> can't see how to make the stop condition for
> XMLHTMLParser>>parseDocumentUntil: depend on the parsed results
> XMLHTMLParser>>without
> using a private method.
>
> Hope this is helpful
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users [mailto:[hidden email]] On
> Behalf Of Norbert Hartl
> Sent: 12 May 2017 08:04
> To: Any question about pharo is welcome <[hidden email]>
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte
> for
> utf-8 encoding
>
> Just to mention. If you are not interested in the content body you
> could do a HEAD request instead of GET.
>
> Norbert
>
> > Am 11.05.2017 um 22:44 schrieb Udo Schneider
> <[hidden email]>:
> >
> > Hi Sven,
> >
> > that's perfect. To be honest I don't care about the content - I'm
> > just
> parsing the header. And even if there is a wrong decoding in there...
> I can live with that.
> >
> > Thank you very very much! For your help but also your stuff in general.
> >
> > CU,
> >
> > Udo
> >
> >
> >> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe:
> >> Hi Udo,
> >>> On 11 May 2017, at 21:37, Udo Schneider
> >>> <[hidden email]>
> wrote:
> >>>
> >>> All,
> >>>
> >>> I'm hitting an error where fetching web content fails. The website
> >>> does
> indeed use invalid characters.
> >>>
> >>> The easiest way to reproduce:
> >>>
> >>> ZnEasy get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re
> turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'
> >>>
> >>> Is there any way to tell Zinc to simply ignore that error and to
> continue?
> >>>
> >>> CU,
> >>>
> >>> Udo
> >> That server/page has a mime-type text/plain with no explicit
> >> encoding
> (charset) setting, so we have to guess. Like utf-8, pure
> latin1/iso88591 does not work. The following does work, but you can't
> be sure everything went well (beLenient takes some bytes as they are).
> >> ZnDefaultCharacterEncoder
> >> value: ZnCharacterEncoder latin1 beLenient
> >> during: [
> >> ZnClient new
> >> get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re
> turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
> >> yourself ].
> >> I added some API earlier today, so that the following should also
> >> work
> (you need to load Zn #bleedingEdge first).
> >> ZnClient new
> >> defaultEncoder: ZnCharacterEncoder latin1 beLenient;
> >> get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re
> turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
> >> yourself.
> >> HTH,
> >> Regards,
> >> Sven
> >
> >
> >
>
>
>
>

Peter Kenny

Re: [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

Monty

I have just started trying to use the StAX parsers, and I have found that the update has introduced a problem, which means that XMLHTMLParser no longer works on examples I have used before. I updated to ConfigurationOfXMLParser(monty.302), which is the latest version on the smalltalkhub repository, and then used the load version in the class comment, which loads the stable default. Similarly, I loaded ConfigurationOfXMLParserHTML(monty.62) and ConfigurationOfXMLParserStAX(monty.51), again using stable and default. When I try to run the XMLHTMLParser example I quoted below, I get an error message 'MessageNotunderstood: receiver of "critical:" is nil'. The same message comes up with anything else I try with XMLHTMLParser or with StAXHTMLParser.

I am not really up to using the debugger on someone else's code, but the one thing I can see is that the problem lies in XMLKeyValueCache>>critical:, which has the code:
^ self mutex critical: aBlock
The problem being that mutex is nil.

In my enthusiasm, I saved the updated image with the same name as the old image, which is now therefore overwritten. If I cannot solve this problem, my only way out is to rebuild my image from the Moose 6.0 download. Any suggestions gratefully received.

Thanks in advance

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of PBKResearch
Sent: 15 May 2017 19:16
To: 'Any question about pharo is welcome' <[hidden email]>
Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

Monty

Many thanks for this. My original purpose was just to answer Paul deBruicker's query, namely to parse an html file and stop reading at the end of the <head> section. I solved this by trial and error using the code shown below ( which actually stops at the opening tag of the body). This was not my problem at all, but Paul's; I just tackled it for fun.

However, you note has prompted me to update my version of the whole XML system - I was using the version I downloaded with Moose 6.0, which was dated August 2016. I am looking at the StAX parsers as a possible way of simplifying what I currently do, which involves downloading an entire web page as a DOM and then manipulating it with XPath to extract the bits I am interested in. I may be able to use StAX to do some of the selection and manipulation as I am reading.

It's all a new topic to me, so I foresee a lot of experimentation. It all helps to keep the grey matter active.

Thanks again

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of monty
Sent: 15 May 2017 12:15
To: [hidden email]
Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

For that kind of incremental parsing, you could also use XMLParserStAX, a pull-parser that parses a document as a stream of event objects you control with #next, #peek, and #atEnd. It also supports pull-DOM parsing with messages like #nextNode, #nextElement, and #nextElementNamed:, which return the next event object(s) as DOM subtrees (searchable with XPath). See the StAXParser class comment for an example. (The StAXHTMLParser class requires XMLParserHTML be installed to work.)