Problem using Zinc in Pharo 4 (Moose 5.1)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem using Zinc in Pharo 4 (Moose 5.1)

Peter Kenny

Hello

 

I have been trying to use Soup class>> fromUrl: to access the contents of a web page. It halts with a message from Zinc about malformed UTF-8. The page displays perfectly in Firefox, so I copied the page source from there to a local file and tried to read it from there. Again a message from Zinc: 'Invalid utf8 input detected'. It’s strange, because the page is not in UTF-8. The head contains: <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">. I have tried to find how to specify the character set in reading files with Zinc, but without success.*

 

If it’s relevant, I am using Pharo4.0 Latest update: #40613, downloaded two days ago. The address of the web page is: http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=peter@.... Other pages from the same source are loaded and analysed with no problem. Processing this page seems to go off course as soon as it encounters the character code 246, which is a correct o-umlaut in ISO-8859-1.

 

Any advice gratefully received.

 

Peter Kenny

 

*I would be happy with advice to RTFM, if someone would point out the relevant bit of the FM.

Reply | Threaded
Open this post in threaded view
|

Re: Problem using Zinc in Pharo 4 (Moose 5.1)

Sven Van Caekenberghe-2
Peter,

Thanks for the URL, it makes it much easier to help you.

The answer is easy: the server is incorrect, it serves a specific encoding without saying so.

Consider:

(ZnClient new
   head: 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=peter@...'; 
   response) contentType.

 => 'text/html'

If no charset/encoding is specified, the modern default is UTF-8, so Zn tries that but fails.

You can change the default for unspecified encoding as follows:

ZnDefaultCharacterEncoder
  value: ZnByteEncoder iso88591
  during: [
    ZnClient new
      get: 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=peter@...' ].

The server should have used the following mime type to avoid the confusion:

ZnMimeType textHtml charSet: #iso88591
 
  => 'text/html;charset=iso88591'

HTH,

Sven

PS: the encoding inside the document cannot be used because (1) no interpretation inside documents is done and (2) at that point it is too late, the contents is already converted from bytes to characters

> On 08 May 2015, at 18:51, PBKResearch <[hidden email]> wrote:
>
> Hello
>  
> I have been trying to use Soup class>> fromUrl: to access the contents of a web page. It halts with a message from Zinc about malformed UTF-8. The page displays perfectly in Firefox, so I copied the page source from there to a local file and tried to read it from there. Again a message from Zinc: 'Invalid utf8 input detected'. It’s strange, because the page is not in UTF-8. The head contains: <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">. I have tried to find how to specify the character set in reading files with Zinc, but without success.*
>  
> If it’s relevant, I am using Pharo4.0 Latest update: #40613, downloaded two days ago. The address of the web page is: http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=peter@.... Other pages from the same source are loaded and analysed with no problem. Processing this page seems to go off course as soon as it encounters the character code 246, which is a correct o-umlaut in ISO-8859-1.
>  
> Any advice gratefully received.
>  
> Peter Kenny
>  
> *I would be happy with advice to RTFM, if someone would point out the relevant bit of the FM.


Reply | Threaded
Open this post in threaded view
|

Re: Problem using Zinc in Pharo 4 (Moose 5.1)

Peter Kenny
Sven

Many thanks for the quick response. I always like to try to solve problems myself before appealing for help, so I had worked out what was wrong, but did not know how to tell Zinc to use a specific coding. I had tried by reading through your very full note on Zinc, but did not find the trick you describe - which works perfectly, of course.

It seems unfortunate that Zinc does not use the coding specified in the html head. Evidently browsers like Firefox must do it, since the page displays correctly. If it cannot be done, I think it would be helpful to reconsider the error message produced when the user is dumped out, because in this context it is misleading. I spent some time tracing debugger output, trying to work out what was wrong with the UTF-8, before I spotted that one of the bytes was displayed in character form as $ö, and began to suspect it might be a different coding; I finally confirmed this by reading the page source in Firefox.

Thanks again for your help.

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of Sven Van Caekenberghe
Sent: 08 May 2015 20:04
To: Any question about pharo is welcome
Subject: Re: [Pharo-users] Problem using Zinc in Pharo 4 (Moose 5.1)

Peter,

Thanks for the URL, it makes it much easier to help you.

The answer is easy: the server is incorrect, it serves a specific encoding without saying so.

Consider:

(ZnClient new
   head: 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=peter@...'; 
   response) contentType.

 => 'text/html'

If no charset/encoding is specified, the modern default is UTF-8, so Zn tries that but fails.

You can change the default for unspecified encoding as follows:

ZnDefaultCharacterEncoder
  value: ZnByteEncoder iso88591
  during: [
    ZnClient new
      get: 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=peter@...' ].

The server should have used the following mime type to avoid the confusion:

ZnMimeType textHtml charSet: #iso88591
 
  => 'text/html;charset=iso88591'

HTH,

Sven

PS: the encoding inside the document cannot be used because (1) no interpretation inside documents is done and (2) at that point it is too late, the contents is already converted from bytes to characters

> On 08 May 2015, at 18:51, PBKResearch <[hidden email]> wrote:
>
> Hello
>  
> I have been trying to use Soup class>> fromUrl: to access the contents of a web page. It halts with a message from Zinc about malformed UTF-8. The page displays perfectly in Firefox, so I copied the page source from there to a local file and tried to read it from there. Again a message from Zinc: 'Invalid utf8 input detected'. It’s strange, because the page is not in UTF-8. The head contains: <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">. I have tried to find how to specify the character set in reading files with Zinc, but without success.*
>  
> If it’s relevant, I am using Pharo4.0 Latest update: #40613, downloaded two days ago. The address of the web page is: http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=peter@.... Other pages from the same source are loaded and analysed with no problem. Processing this page seems to go off course as soon as it encounters the character code 246, which is a correct o-umlaut in ISO-8859-1.
>  
> Any advice gratefully received.
>  
> Peter Kenny
>  
> *I would be happy with advice to RTFM, if someone would point out the relevant bit of the FM.



Reply | Threaded
Open this post in threaded view
|

Re: Problem using Zinc in Pharo 4 (Moose 5.1)

Ben Coman


On Sat, May 9, 2015 at 8:18 AM, PBKResearch <[hidden email]> wrote:
Sven

Many thanks for the quick response. I always like to try to solve problems myself before appealing for help, so I had worked out what was wrong, but did not know how to tell Zinc to use a specific coding. I had tried by reading through your very full note on Zinc, but did not find the trick you describe - which works perfectly, of course.

It seems unfortunate that Zinc does not use the coding specified in the html head. Evidently browsers like Firefox must do it, since the page displays correctly. If it cannot be done, I think it would be helpful to reconsider the error message produced when the user is dumped out, because in this context it is misleading.

Now we have moldable tools by default, I wonder if ZnResponse (which I guess typically people will inspect while troubleshooting) might have a tab called something like  "Someone Else's Problem" or "Protocol Errors".
cheers -ben

 
I spent some time tracing debugger output, trying to work out what was wrong with the UTF-8, before I spotted that one of the bytes was displayed in character form as $ö, and began to suspect it might be a different coding; I finally confirmed this by reading the page source in Firefox.

Thanks again for your help.

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of Sven Van Caekenberghe
Sent: 08 May 2015 20:04
To: Any question about pharo is welcome
Subject: Re: [Pharo-users] Problem using Zinc in Pharo 4 (Moose 5.1)

Peter,

Thanks for the URL, it makes it much easier to help you.

The answer is easy: the server is incorrect, it serves a specific encoding without saying so.

Consider:

(ZnClient new
   head: 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=peter@...';
   response) contentType.

 => 'text/html'

If no charset/encoding is specified, the modern default is UTF-8, so Zn tries that but fails.

You can change the default for unspecified encoding as follows:

ZnDefaultCharacterEncoder
  value: ZnByteEncoder iso88591
  during: [
    ZnClient new
      get: 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=peter@...' ].

The server should have used the following mime type to avoid the confusion:

ZnMimeType textHtml charSet: #iso88591

  => 'text/html;charset=iso88591'

HTH,

Sven

PS: the encoding inside the document cannot be used because (1) no interpretation inside documents is done and (2) at that point it is too late, the contents is already converted from bytes to characters

> On 08 May 2015, at 18:51, PBKResearch <[hidden email]> wrote:
>
> Hello
>
> I have been trying to use Soup class>> fromUrl: to access the contents of a web page. It halts with a message from Zinc about malformed UTF-8. The page displays perfectly in Firefox, so I copied the page source from there to a local file and tried to read it from there. Again a message from Zinc: 'Invalid utf8 input detected'. It’s strange, because the page is not in UTF-8. The head contains: <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">. I have tried to find how to specify the character set in reading files with Zinc, but without success.*
>
> If it’s relevant, I am using Pharo4.0 Latest update: #40613, downloaded two days ago. The address of the web page is: http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=peter@.... Other pages from the same source are loaded and analysed with no problem. Processing this page seems to go off course as soon as it encounters the character code 246, which is a correct o-umlaut in ISO-8859-1.
>
> Any advice gratefully received.
>
> Peter Kenny
>
> *I would be happy with advice to RTFM, if someone would point out the relevant bit of the FM.




Reply | Threaded
Open this post in threaded view
|

Re: Problem using Zinc in Pharo 4 (Moose 5.1)

Sven Van Caekenberghe-2
In reply to this post by Peter Kenny

> On 09 May 2015, at 02:18, PBKResearch <[hidden email]> wrote:
>
> Sven
>
> Many thanks for the quick response. I always like to try to solve problems myself before appealing for help, so I had worked out what was wrong, but did not know how to tell Zinc to use a specific coding. I had tried by reading through your very full note on Zinc, but did not find the trick you describe - which works perfectly, of course.

Good, yes this is a more recent thing.

> It seems unfortunate that Zinc does not use the coding specified in the html head. Evidently browsers like Firefox must do it, since the page displays correctly. If it cannot be done, I think it would be helpful to reconsider the error message produced when the user is dumped out, because in this context it is misleading. I spent some time tracing debugger output, trying to work out what was wrong with the UTF-8, before I spotted that one of the bytes was displayed in character form as $ö, and began to suspect it might be a different coding; I finally confirmed this by reading the page source in Firefox.

Zn deals with HTTP, not with HTML, these are totally different things, a browser obviously does both. But even then there is no easy way to do this, apart from trying. Consider these two byte arrays:

#[85 84 70 56 58 32 68 101 114 32 87 101 103 32 122 117 114 32 72 195 182 108 108 101 32 105 115 116 32 109 105 116 32 103 117 116 101 110 32 86 111 114 115 195 164 116 122 101 110 32 103 101 112 102 108 97 115 116 101 114 116 46]

#[73 83 79 56 56 53 57 49 58 32 68 101 114 32 87 101 103 32 122 117 114 32 72 246 108 108 101 32 105 115 116 32 109 105 116 32 103 117 116 101 110 32 86 111 114 115 228 116 122 101 110 32 103 101 112 102 108 97 115 116 101 114 116 46]

In them it says how you should decode them!

The GT tools make this challenge easy because there is a tab that tries both encodings, but in general this is hard to solve (efficiently).

But since Zn does not do HTML, it will never be added at that level.

I will think about the error, it might indeed be useful to tell the user that a default encoding was chosen.

> Thanks again for your help.

You're welcome.

> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users [mailto:[hidden email]] On Behalf Of Sven Van Caekenberghe
> Sent: 08 May 2015 20:04
> To: Any question about pharo is welcome
> Subject: Re: [Pharo-users] Problem using Zinc in Pharo 4 (Moose 5.1)
>
> Peter,
>
> Thanks for the URL, it makes it much easier to help you.
>
> The answer is easy: the server is incorrect, it serves a specific encoding without saying so.
>
> Consider:
>
> (ZnClient new
>   head: 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=peter@...'; 
>   response) contentType.
>
> => 'text/html'
>
> If no charset/encoding is specified, the modern default is UTF-8, so Zn tries that but fails.
>
> You can change the default for unspecified encoding as follows:
>
> ZnDefaultCharacterEncoder
>  value: ZnByteEncoder iso88591
>  during: [
>    ZnClient new
>      get: 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=peter@...' ].
>
> The server should have used the following mime type to avoid the confusion:
>
> ZnMimeType textHtml charSet: #iso88591
>
>  => 'text/html;charset=iso88591'
>
> HTH,
>
> Sven
>
> PS: the encoding inside the document cannot be used because (1) no interpretation inside documents is done and (2) at that point it is too late, the contents is already converted from bytes to characters
>
>> On 08 May 2015, at 18:51, PBKResearch <[hidden email]> wrote:
>>
>> Hello
>>
>> I have been trying to use Soup class>> fromUrl: to access the contents of a web page. It halts with a message from Zinc about malformed UTF-8. The page displays perfectly in Firefox, so I copied the page source from there to a local file and tried to read it from there. Again a message from Zinc: 'Invalid utf8 input detected'. It’s strange, because the page is not in UTF-8. The head contains: <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">. I have tried to find how to specify the character set in reading files with Zinc, but without success.*
>>
>> If it’s relevant, I am using Pharo4.0 Latest update: #40613, downloaded two days ago. The address of the web page is: http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=peter@.... Other pages from the same source are loaded and analysed with no problem. Processing this page seems to go off course as soon as it encounters the character code 246, which is a correct o-umlaut in ISO-8859-1.
>>
>> Any advice gratefully received.
>>
>> Peter Kenny
>>
>> *I would be happy with advice to RTFM, if someone would point out the relevant bit of the FM.
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Problem using Zinc in Pharo 4 (Moose 5.1)

Stephan Eggermont-3
In reply to this post by Peter Kenny
On 09/05/15 02:18, PBKResearch wrote:
> Evidently browsers like Firefox must do it, since the page displays correctly.

In my experience with html, if browsers do it, it is probably wrong ;)

Stephan


Reply | Threaded
Open this post in threaded view
|

Re: Problem using Zinc in Pharo 4 (Moose 5.1)

Peter Kenny
In reply to this post by Sven Van Caekenberghe-2
Sven

Thanks for your considered response to my midnight thoughts. I now see the importance of distinguishing the html side (Soup in my work) from the http side (Zinc). I looked at your specimen byte arrays in a Pharo playground, which I am still learning to use; I hadn't seen that proverb in German, but I see the point you are making.

Now that my immediate problem is solved, I am not sure whether it is necessary to take up your time any more with it. It depends how many servers do not say what encoding they use, how many sites use encodings other than UTF-8 and how big is the intersection of those sets. If my problem was a one-off, there is no need to go any further. However, there is one general point I would like to make. The debugger is a very good tool for programmers investigating their own code, but it is a very unfriendly place for a user dumped in the middle of someone else's code; yesterday I learned more than I ever wished to about the innards of UTF-8 decoding, before realising it was all irrelevant. If you do think it worth modifying the handling of this case, I would suggest replacing the call to the debugger with a dialog box of some sort, perhaps with debug as an option for the enthusiast, but perhaps also with an option to restart with an alternative encoding.

If I can, I would like to ask a supplementary question - I am deeply ignorant but eager to learn. As I mentioned, I tried to get round the problem by downloading the page source to a local file and reading from there into Soup, but this also involved the Zinc decoder and so failed. I tried to see how to get round this, using what I now know about the encoding, and came up with the following:

binaryStream := (FileStream readOnlyFileNamed: 'display.html') binary.
charStream := ZnCharacterReadStream on: binaryStream encoding: ZnByteEncoder iso88591.
hbSoup := Soup fromString: charStream contents.

This worked, so in that sense it is OK, but I wonder if there is a neater way of doing it. More importantly, I found that Soup has its own decoder, so I can skip the second line and replace the third by:

hbSoup := Soup fromString: binaryStream contents asString.

At one stage I found myself looking at a debugger on this process (I know - this contradicts what I said above!), because I had not realised that 'asString' was needed. It looked as though Soup was trying three candidate encodings, which it labelled 'latin1', 'utf-8' and 'cp1252', to find which one would work. It showed the one it had 'sniffed' as most likely being 'latin1', which I think is the same as ISO-8859-1, so it was trying that first.

Given this, my question is whether Zinc would allow me to read from a web URL as a binary stream, which I could then feed into the Soup decoder in the same way. If I can, I would use this as my standard procedure; I expect to be visiting a lot of sites, and it would be handy to be able to ignore the encoding issue and hope that Soup can sort it out.

Finally, a general comment. Both this query and the one I posed earlier this week, answered by Vincent Blondeau, showed that Pharo users can come to this site and expect quick, friendly and expert help. I am retired and can devote all the time I want to this, but you people must have day jobs as well! I am really very grateful.

Best wishes

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of Sven Van Caekenberghe
Sent: 09 May 2015 07:51
To: Any question about pharo is welcome
Subject: Re: [Pharo-users] Problem using Zinc in Pharo 4 (Moose 5.1)


> On 09 May 2015, at 02:18, PBKResearch <[hidden email]> wrote:
>
> Sven
>
> Many thanks for the quick response. I always like to try to solve problems myself before appealing for help, so I had worked out what was wrong, but did not know how to tell Zinc to use a specific coding. I had tried by reading through your very full note on Zinc, but did not find the trick you describe - which works perfectly, of course.

Good, yes this is a more recent thing.

> It seems unfortunate that Zinc does not use the coding specified in the html head. Evidently browsers like Firefox must do it, since the page displays correctly. If it cannot be done, I think it would be helpful to reconsider the error message produced when the user is dumped out, because in this context it is misleading. I spent some time tracing debugger output, trying to work out what was wrong with the UTF-8, before I spotted that one of the bytes was displayed in character form as $ö, and began to suspect it might be a different coding; I finally confirmed this by reading the page source in Firefox.

Zn deals with HTTP, not with HTML, these are totally different things, a browser obviously does both. But even then there is no easy way to do this, apart from trying. Consider these two byte arrays:

#[85 84 70 56 58 32 68 101 114 32 87 101 103 32 122 117 114 32 72 195 182 108 108 101 32 105 115 116 32 109 105 116 32 103 117 116 101 110 32 86 111 114 115 195 164 116 122 101 110 32 103 101 112 102 108 97 115 116 101 114 116 46]

#[73 83 79 56 56 53 57 49 58 32 68 101 114 32 87 101 103 32 122 117 114 32 72 246 108 108 101 32 105 115 116 32 109 105 116 32 103 117 116 101 110 32 86 111 114 115 228 116 122 101 110 32 103 101 112 102 108 97 115 116 101 114 116 46]

In them it says how you should decode them!

The GT tools make this challenge easy because there is a tab that tries both encodings, but in general this is hard to solve (efficiently).

But since Zn does not do HTML, it will never be added at that level.

I will think about the error, it might indeed be useful to tell the user that a default encoding was chosen.

> Thanks again for your help.

You're welcome.

> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users [mailto:[hidden email]] On Behalf Of Sven Van Caekenberghe
> Sent: 08 May 2015 20:04
> To: Any question about pharo is welcome
> Subject: Re: [Pharo-users] Problem using Zinc in Pharo 4 (Moose 5.1)
>
> Peter,
>
> Thanks for the URL, it makes it much easier to help you.
>
> The answer is easy: the server is incorrect, it serves a specific encoding without saying so.
>
> Consider:
>
> (ZnClient new
>   head: 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=peter@...'; 
>   response) contentType.
>
> => 'text/html'
>
> If no charset/encoding is specified, the modern default is UTF-8, so Zn tries that but fails.
>
> You can change the default for unspecified encoding as follows:
>
> ZnDefaultCharacterEncoder
>  value: ZnByteEncoder iso88591
>  during: [
>    ZnClient new
>      get: 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=peter@...' ].
>
> The server should have used the following mime type to avoid the confusion:
>
> ZnMimeType textHtml charSet: #iso88591
>
>  => 'text/html;charset=iso88591'
>
> HTH,
>
> Sven
>
> PS: the encoding inside the document cannot be used because (1) no interpretation inside documents is done and (2) at that point it is too late, the contents is already converted from bytes to characters
>
>> On 08 May 2015, at 18:51, PBKResearch <[hidden email]> wrote:
>>
>> Hello
>>
>> I have been trying to use Soup class>> fromUrl: to access the contents of a web page. It halts with a message from Zinc about malformed UTF-8. The page displays perfectly in Firefox, so I copied the page source from there to a local file and tried to read it from there. Again a message from Zinc: 'Invalid utf8 input detected'. It’s strange, because the page is not in UTF-8. The head contains: <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">. I have tried to find how to specify the character set in reading files with Zinc, but without success.*
>>
>> If it’s relevant, I am using Pharo4.0 Latest update: #40613, downloaded two days ago. The address of the web page is: http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=peter@.... Other pages from the same source are loaded and analysed with no problem. Processing this page seems to go off course as soon as it encounters the character code 246, which is a correct o-umlaut in ISO-8859-1.
>>
>> Any advice gratefully received.
>>
>> Peter Kenny
>>
>> *I would be happy with advice to RTFM, if someone would point out the relevant bit of the FM.
>
>
>



Reply | Threaded
Open this post in threaded view
|

Re: Problem using Zinc in Pharo 4 (Moose 5.1)

Sven Van Caekenberghe-2
Hi Peter,

> On 09 May 2015, at 19:00, PBKResearch <[hidden email]> wrote:
>
> Sven
>
> Thanks for your considered response to my midnight thoughts. I now see the importance of distinguishing the html side (Soup in my work) from the http side (Zinc). I looked at your specimen byte arrays in a Pharo playground, which I am still learning to use; I hadn't seen that proverb in German, but I see the point you are making.

Haha, good !

> Now that my immediate problem is solved, I am not sure whether it is necessary to take up your time any more with it. It depends how many servers do not say what encoding they use, how many sites use encodings other than UTF-8 and how big is the intersection of those sets. If my problem was a one-off, there is no need to go any further.

It happens, but not too often (anymore).

> However, there is one general point I would like to make. The debugger is a very good tool for programmers investigating their own code, but it is a very unfriendly place for a user dumped in the middle of someone else's code; yesterday I learned more than I ever wished to about the innards of UTF-8 decoding, before realising it was all irrelevant. If you do think it worth modifying the handling of this case, I would suggest replacing the call to the debugger with a dialog box of some sort, perhaps with debug as an option for the enthusiast, but perhaps also with an option to restart with an alternative encoding.

I understand your idea, but showing dialogs from system code is a no go, all that we can do is throw better or more specific exceptions, it is up to the code invoking things (your code/application) to handle those.

> If I can, I would like to ask a supplementary question - I am deeply ignorant but eager to learn. As I mentioned, I tried to get round the problem by downloading the page source to a local file and reading from there into Soup, but this also involved the Zinc decoder and so failed. I tried to see how to get round this, using what I now know about the encoding, and came up with the following:
>
> binaryStream := (FileStream readOnlyFileNamed: 'display.html') binary.
> charStream := ZnCharacterReadStream on: binaryStream encoding: ZnByteEncoder iso88591.
> hbSoup := Soup fromString: charStream contents.

Yes, that is perfect (did you see http://stfx.eu/EnterprisePharo/Zinc-Encoding-Meta/ ?)

I would write

'display.html' asFileReference binaryReadStreamDo: [ :in |
  Soup fromString: (ZnCharacterReadStream on: in encoding: ZnByteEncoder iso88591) upToEnd ].

or even

'display.html' asFileReference binaryReadStreamDo: [ :in |
  Soup fromString: (ZnByteEncoder iso88591 decodeBytes: in upToEnd) ].

It seems Soup does not accept streams, only strings.

> This worked, so in that sense it is OK, but I wonder if there is a neater way of doing it. More importantly, I found that Soup has its own decoder, so I can skip the second line and replace the third by:
>
> hbSoup := Soup fromString: binaryStream contents asString.
>
> At one stage I found myself looking at a debugger on this process (I know - this contradicts what I said above!), because I had not realised that 'asString' was needed. It looked as though Soup was trying three candidate encodings, which it labelled 'latin1', 'utf-8' and 'cp1252', to find which one would work. It showed the one it had 'sniffed' as most likely being 'latin1', which I think is the same as ISO-8859-1, so it was trying that first.

Yes Latin1, cp1252 and ISO88591 are equivalent for most purposes. BTW, #asString is also more or less the same (the difference is that there is a 'hole' in the encoding).
>
> Given this, my question is whether Zinc would allow me to read from a web URL as a binary stream, which I could then feed into the Soup decoder in the same way. If I can, I would use this as my standard procedure; I expect to be visiting a lot of sites, and it would be handy to be able to ignore the encoding issue and hope that Soup can sort it out.

Downloading binary to a file goes like this:

ZnClient new
  url: 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=peter@...';
  downloadTo: 'display.html'.

If you would want the bytes in memory, you could do:

| client bytes |
(client := ZnClient new)
   streaming: true;
   get: 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=peter@...'.
bytes := client entity contents.
client close.
bytes.

> Finally, a general comment. Both this query and the one I posed earlier this week, answered by Vincent Blondeau, showed that Pharo users can come to this site and expect quick, friendly and expert help. I am retired and can devote all the time I want to this, but you people must have day jobs as well! I am really very grateful.

To get anywhere, we have to help each other.

Sven

> Best wishes
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users [mailto:[hidden email]] On Behalf Of Sven Van Caekenberghe
> Sent: 09 May 2015 07:51
> To: Any question about pharo is welcome
> Subject: Re: [Pharo-users] Problem using Zinc in Pharo 4 (Moose 5.1)
>
>
>> On 09 May 2015, at 02:18, PBKResearch <[hidden email]> wrote:
>>
>> Sven
>>
>> Many thanks for the quick response. I always like to try to solve problems myself before appealing for help, so I had worked out what was wrong, but did not know how to tell Zinc to use a specific coding. I had tried by reading through your very full note on Zinc, but did not find the trick you describe - which works perfectly, of course.
>
> Good, yes this is a more recent thing.
>
>> It seems unfortunate that Zinc does not use the coding specified in the html head. Evidently browsers like Firefox must do it, since the page displays correctly. If it cannot be done, I think it would be helpful to reconsider the error message produced when the user is dumped out, because in this context it is misleading. I spent some time tracing debugger output, trying to work out what was wrong with the UTF-8, before I spotted that one of the bytes was displayed in character form as $ö, and began to suspect it might be a different coding; I finally confirmed this by reading the page source in Firefox.
>
> Zn deals with HTTP, not with HTML, these are totally different things, a browser obviously does both. But even then there is no easy way to do this, apart from trying. Consider these two byte arrays:
>
> #[85 84 70 56 58 32 68 101 114 32 87 101 103 32 122 117 114 32 72 195 182 108 108 101 32 105 115 116 32 109 105 116 32 103 117 116 101 110 32 86 111 114 115 195 164 116 122 101 110 32 103 101 112 102 108 97 115 116 101 114 116 46]
>
> #[73 83 79 56 56 53 57 49 58 32 68 101 114 32 87 101 103 32 122 117 114 32 72 246 108 108 101 32 105 115 116 32 109 105 116 32 103 117 116 101 110 32 86 111 114 115 228 116 122 101 110 32 103 101 112 102 108 97 115 116 101 114 116 46]
>
> In them it says how you should decode them!
>
> The GT tools make this challenge easy because there is a tab that tries both encodings, but in general this is hard to solve (efficiently).
>
> But since Zn does not do HTML, it will never be added at that level.
>
> I will think about the error, it might indeed be useful to tell the user that a default encoding was chosen.
>
>> Thanks again for your help.
>
> You're welcome.
>
>> Peter Kenny
>>
>> -----Original Message-----
>> From: Pharo-users [mailto:[hidden email]] On Behalf Of Sven Van Caekenberghe
>> Sent: 08 May 2015 20:04
>> To: Any question about pharo is welcome
>> Subject: Re: [Pharo-users] Problem using Zinc in Pharo 4 (Moose 5.1)
>>
>> Peter,
>>
>> Thanks for the URL, it makes it much easier to help you.
>>
>> The answer is easy: the server is incorrect, it serves a specific encoding without saying so.
>>
>> Consider:
>>
>> (ZnClient new
>>  head: 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=peter@...'; 
>>  response) contentType.
>>
>> => 'text/html'
>>
>> If no charset/encoding is specified, the modern default is UTF-8, so Zn tries that but fails.
>>
>> You can change the default for unspecified encoding as follows:
>>
>> ZnDefaultCharacterEncoder
>> value: ZnByteEncoder iso88591
>> during: [
>>   ZnClient new
>>     get: 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=peter@...' ].
>>
>> The server should have used the following mime type to avoid the confusion:
>>
>> ZnMimeType textHtml charSet: #iso88591
>>
>> => 'text/html;charset=iso88591'
>>
>> HTH,
>>
>> Sven
>>
>> PS: the encoding inside the document cannot be used because (1) no interpretation inside documents is done and (2) at that point it is too late, the contents is already converted from bytes to characters
>>
>>> On 08 May 2015, at 18:51, PBKResearch <[hidden email]> wrote:
>>>
>>> Hello
>>>
>>> I have been trying to use Soup class>> fromUrl: to access the contents of a web page. It halts with a message from Zinc about malformed UTF-8. The page displays perfectly in Firefox, so I copied the page source from there to a local file and tried to read it from there. Again a message from Zinc: 'Invalid utf8 input detected'. It’s strange, because the page is not in UTF-8. The head contains: <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">. I have tried to find how to specify the character set in reading files with Zinc, but without success.*
>>>
>>> If it’s relevant, I am using Pharo4.0 Latest update: #40613, downloaded two days ago. The address of the web page is: http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=peter@.... Other pages from the same source are loaded and analysed with no problem. Processing this page seems to go off course as soon as it encounters the character code 246, which is a correct o-umlaut in ISO-8859-1.
>>>
>>> Any advice gratefully received.
>>>
>>> Peter Kenny
>>>
>>> *I would be happy with advice to RTFM, if someone would point out the relevant bit of the FM.
>>
>>
>>
>
>
>