Invalid utf8 input detected: now what?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Invalid utf8 input detected: now what?

Schwab,Wilhelm K
Hello all,

I got an error (on Ubuntu 9.10) trying open an old text file that I created on Windows some time ago.  The encoding (if gedit's save-as dialog can be trusted??) is Western ISO-8859-15; resaving as utf8 lets me read it.

So, is Pharo working by design?  Did I do the correct/only thing needed to read the file?  What should I be asking?  Is there anything I can do to turn this into a useful test/debugging example?

Bill


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Invalid utf8 input detected: now what?

Yanni Chiu
Schwab,Wilhelm K wrote:
>
> I got an error (on Ubuntu 9.10) trying open an old text file that I
> created on Windows some time ago.  The encoding (if gedit's save-as
> dialog can be trusted??) is Western ISO-8859-15; resaving as utf8
> lets me read it.

You could try viewing the original file in a web browser. Try different
encodings until the stuff looks right. Then you might have a better idea
of whether you really have a file in ISO-8859-15.

You could also view your converted UTF-8 file in a web browser too, and
compare the two renderings.

If this checks out, then maybe it's a Pharo issue.


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Invalid utf8 input detected: now what?

Stéphane Ducasse

On Jul 23, 2010, at 4:38 AM, Yanni Chiu wrote:

> Schwab,Wilhelm K wrote:
>> I got an error (on Ubuntu 9.10) trying open an old text file that I
>> created on Windows some time ago.  The encoding (if gedit's save-as
>> dialog can be trusted??) is Western ISO-8859-15; resaving as utf8
>> lets me read it.
>
> You could try viewing the original file in a web browser. Try different encodings until the stuff looks right. Then you might have a better idea of whether you really have a file in ISO-8859-15.
>
> You could also view your converted UTF-8 file in a web browser too, and compare the two renderings.
>
> If this checks out, then maybe it's a Pharo issue.

please report and if possible with a test so that we can fix it.
>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Invalid utf8 input detected: now what?

Philippe Marschall-2
In reply to this post by Schwab,Wilhelm K
On 07/23/2010 04:09 AM, Schwab,Wilhelm K wrote:
> Hello all,
>
> I got an error (on Ubuntu 9.10) trying open an old text file that I created on Windows some time ago.  The encoding (if gedit's save-as dialog can be trusted??) is Western ISO-8859-15; resaving as utf8 lets me read it.
>
> So, is Pharo working by design?  Did I do the correct/only thing needed to read the file?

You need to pass the encoding to the file stream.

Cheers
Philippe


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Invalid utf8 input detected: now what?

Henrik Sperre Johansen
In reply to this post by Schwab,Wilhelm K

On Jul 23, 2010, at 4:09 30AM, Schwab,Wilhelm K wrote:

> Hello all,
>
> I got an error (on Ubuntu 9.10) trying open an old text file that I created on Windows some time ago.  The encoding (if gedit's save-as dialog can be trusted??) is Western ISO-8859-15; resaving as utf8 lets me read it.
>
> So, is Pharo working by design?  Did I do the correct/only thing needed to read the file?  What should I be asking?  Is there anything I can do to turn this into a useful test/debugging example?
>
> Bill
>

This is not an error per se, seeing as the encoding is not utf8 :)

If the import was done from some tool instead of in your code (in which case you'd set the encoding of the file stream), a nicer *behavior* might be for the UI Manager to catch encoding errors when trying to read a file, and offer up a dialogue with a list of encodings which the file *can* be read as, along with a preview window of what the text would look like with the selected encoding, like some word processors do.

Cheers,
Henry
_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Invalid utf8 input detected: now what?

Schwab,Wilhelm K
No dialogs, please :)  Actually, it would be fine if there were a different stream class or simply a different method/state (encoding =#userInteraction or something??) that is understood to negotiate such details with "the user."  In general, exception is the correct way to handle this: the stream "knows" what is wrong; the application will know what to make of it.  If the encoding can be detected automatically, that would be great.

Firefox tells me that the encoding is ISO-8859-1; I am not leaving off the 5, Firefox and gedit report it differently.  In fairness to gedit, I am reporting the encodings listed in its save-as dialog.  Unfortunately the offending file contains specifications that are not mine.  I have seen pieces of it published elsewhere (quite recently in fact) but will need to do some checking on the licensing.  I might be able to excerpt the file and end up with the same behavior.

Bill


________________________________________
From: [hidden email] [[hidden email]] On Behalf Of Henrik Johansen [[hidden email]]
Sent: Friday, July 23, 2010 7:39 AM
To: [hidden email]
Subject: Re: [Pharo-project] Invalid utf8 input detected: now what?

On Jul 23, 2010, at 4:09 30AM, Schwab,Wilhelm K wrote:

> Hello all,
>
> I got an error (on Ubuntu 9.10) trying open an old text file that I created on Windows some time ago.  The encoding (if gedit's save-as dialog can be trusted??) is Western ISO-8859-15; resaving as utf8 lets me read it.
>
> So, is Pharo working by design?  Did I do the correct/only thing needed to read the file?  What should I be asking?  Is there anything I can do to turn this into a useful test/debugging example?
>
> Bill
>

This is not an error per se, seeing as the encoding is not utf8 :)

If the import was done from some tool instead of in your code (in which case you'd set the encoding of the file stream), a nicer *behavior* might be for the UI Manager to catch encoding errors when trying to read a file, and offer up a dialogue with a list of encodings which the file *can* be read as, along with a preview window of what the text would look like with the selected encoding, like some word processors do.

Cheers,
Henry
_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Invalid utf8 input detected: now what?

Henrik Sperre Johansen


Den 23. juli 2010 kl. 15:15 skrev "Schwab,Wilhelm K" <[hidden email]>:

> No dialogs, please :)  Actually, it would be fine if there were a different stream class or simply a different method/state (encoding =#userInteraction or something??) that is understood to negotiate such details with "the user."  
I have no idea what you are suggesting...

> In general, exception is the correct way to handle this: the stream "knows" what is wrong; the application will know what to make of it.  If the encoding can be detected automatically, that would be great.
>
Then I fail to see what the problem is.
You got an error stating it was not UTF8, which implies a choice of the correct encoding needs to be done by the application. (by setting the streams encoding to something else)
As you noticed with gedit/firefox, any "automatic" detection is at best an educated guess, and can not be relied upon to make the correct choice.

Cheers,
Henry
_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Invalid utf8 input detected: now what?

Schwab,Wilhelm K
I agree that there is apparently not much of a problem.  However, I also stand by "no more dialogs" unless they are in a clearly-identified class/method/state that is known to interact with the user.  Squeak has *far* too much forced and unexpected interaction, and we must not go back down that road.

Those things said, there might be room to grow, as someone suggested the possibility of automatically detecting the coding.


________________________________________
From: [hidden email] [[hidden email]] On Behalf Of Henrik Johansen [[hidden email]]
Sent: Friday, July 23, 2010 12:41 PM
To: [hidden email]
Subject: Re: [Pharo-project] Invalid utf8 input detected: now what?

Den 23. juli 2010 kl. 15:15 skrev "Schwab,Wilhelm K" <[hidden email]>:

> No dialogs, please :)  Actually, it would be fine if there were a different stream class or simply a different method/state (encoding =#userInteraction or something??) that is understood to negotiate such details with "the user."
I have no idea what you are suggesting...

> In general, exception is the correct way to handle this: the stream "knows" what is wrong; the application will know what to make of it.  If the encoding can be detected automatically, that would be great.
>
Then I fail to see what the problem is.
You got an error stating it was not UTF8, which implies a choice of the correct encoding needs to be done by the application. (by setting the streams encoding to something else)
As you noticed with gedit/firefox, any "automatic" detection is at best an educated guess, and can not be relied upon to make the correct choice.

Cheers,
Henry
_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Invalid utf8 input detected: now what?

Henrik Sperre Johansen

On Jul 24, 2010, at 1:57 11PM, Schwab,Wilhelm K wrote:

> I agree that there is apparently not much of a problem.  However, I also stand by "no more dialogs" unless they are in a clearly-identified class/method/state that is known to interact with the user.  Squeak has *far* too much forced and unexpected interaction, and we must not go back down that road.

Which is why I asked initially whether you encountered this when using a tool (ie file browser etc.), or custom code.
For tools when you have no way to set one encoding which will be correct for all cases, it might be a better behaviour to open a dialogue where one can be selected instead of raising a DNU, if a GUI is present.

>
> Those things said, there might be room to grow, as someone suggested the possibility of automatically detecting the coding.

Then they were wrong.

Cheers,
Henry
_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Invalid utf8 input detected: now what?

Philippe Marschall-2-3
In reply to this post by Schwab,Wilhelm K
On 23.07.2010 15:15, Schwab,Wilhelm K wrote:
> No dialogs, please :)  Actually, it would be fine if there were a different stream class or simply a different method/state (encoding =#userInteraction or something??) that is understood to negotiate such details with "the user."  In general, exception is the correct way to handle this: the stream "knows" what is wrong; the application will know what to make of it.  If the encoding can be detected automatically, that would be great.

There is no way to do this. The only thing that can be determined is
that something is not utf-8. The stream did that reliably. But you said
you already know the encoding, so just set in on the stream and it
should work.

> Firefox tells me that the encoding is ISO-8859-1;

ISO-8859-1 and ISO-8859-15 are not the same.

Cheers
Philippe


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Invalid utf8 input detected: now what?

Schwab,Wilhelm K
I'm ok with calling this "works as intended" if the encoding experts are.  Since I am *not* an expert on encoding, I ran it up the flag pole.




________________________________________
From: [hidden email] [[hidden email]] On Behalf Of Philippe Marschall [[hidden email]]
Sent: Sunday, July 25, 2010 3:57 AM
To: [hidden email]
Subject: Re: [Pharo-project] Invalid utf8 input detected: now what?

On 23.07.2010 15:15, Schwab,Wilhelm K wrote:
> No dialogs, please :)  Actually, it would be fine if there were a different stream class or simply a different method/state (encoding =#userInteraction or something??) that is understood to negotiate such details with "the user."  In general, exception is the correct way to handle this: the stream "knows" what is wrong; the application will know what to make of it.  If the encoding can be detected automatically, that would be great.

There is no way to do this. The only thing that can be determined is
that something is not utf-8. The stream did that reliably. But you said
you already know the encoding, so just set in on the stream and it
should work.

> Firefox tells me that the encoding is ISO-8859-1;

ISO-8859-1 and ISO-8859-15 are not the same.

Cheers
Philippe


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project