Smalltalk › Squeak › Squeak VM

Unix VM path encodings

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

14 messages Options

Andreas.Raab

Unix VM path encodings

Hi -

Due to a bug reported against Qwaq Forums I needed to look into how the
Unix VM encodes file and path names and got terribly confused. My test
case was to create a file with an Umlaut("Jürgen") and to see what both
Squeak and the Unix shell reports with varying settings of -pathenc and
-textenc.

I started with the assumption that since the file system I was running
this on is UTF-8 the default settings (-textenc MacRoman and -pathenc
UTF-8) ought to be correct. However, the result was very surprising. The
file name was reported incorrectly both in the file list as well as by
the OS - the file list reported "J?" (truncated after the question mark)
and the Unix shell reported "J?rgen" but with a "funky ?" (the glyph is
hard to describe without a screenshot; it was neither an umlaut nor a
regular question mark).

Playing with the settings I could not find any combination that resulted
in a consistent representation for all the different views - either the
Unix shell was off or Squeak's view was off no matter how I set those
encodings. Can someone explain to me how I need to set these values to
get a consistent view on file names both from Squeak and Unix?

Cheers,
- Andreas

johnmci

Re: Unix VM path encodings

For Sophie we spend hours/days? getting this correct. However we
really didn't check out the pure Unix variation.

Actually let's some other utf-32 character versus ü

Oh let's say LATIN CAPITAL LETTER SCHWA -> UTF-8 0xC68F UTF-32
0x0000018F

Ə

in the os-x 10.5.1 Finder we see

Ə.png

and in a terminal session we see

-rw-r--r-- 1 johnmci staff 26451 Apr 10 2007 Ə.png

in both cases just in case you can't see this in the email the
character is visually correct.

Using Squeak 3.10Alpha 7092 with a Mac Carbon VM 3.8.18b1 set to utf8
when we use the file list morphic What we see is
?.png
the ? is 0x3F

of course it says it can't open the file, because the smalltalk code
(which code is an exercise for the reader) has mangled the 0xC68F into
0x3F. In asking about this a few years back I think I was told it
converts the VM data to latin1. However the conversion from macroman
to latin1 and back *usually* is workable, mind only if the characters
are <= 0xFF

However utf8 to latin1 usually ends up broken which is why the mac
carbon VM is set to macroman by default.

Recall that in os-x HFS Plus converts all file names to decomposed
Unicode, while Macintosh keyboards generally produce precomposed
Unicode. The macintosh carbon VM converts back and forth between the
pre-composed to decomposed unicode when it is using UTF8 encoding.
This also depends on the file system and what it thinks it wants to
store unicode characters as...

Now in Sophie when we import this into Sophie the URI that was
generated is

/Users/johnmci/Work%20In%20Progress/squeak%20Bugs/%C6%8F.png

that becomes

'/Users/johnmci/Work In Progress/squeak Bugs/Æ∑.png'

But the VM ensures the proper thing is done. In sophie we store all
media paths as encoded URI objects, and convert to
what is required when we need to access the media.

Oh and btw if you enter
file:///Users/johnmci/Work%20In%20Progress/squeak%20Bugs/%C6%8F.png
into FireFox, it's happy too. Oddly when you enter it into Safari it
becomes file:///Users/johnmci/Work%20In%20Progress/squeak%20Bugs/Ə.png

Oh and if I take the Ə.png from a terminal session, or the finder
window
and paste into a Sophie text field, yes it's Ə.png
because the extended clipboard support converts it properly from utf8,
Mind in TextEdit it comes across as RTF which is a different issue,
but *still* is correctly converted into utf-32 in Sophie.

People of course are welcome to uncover unicode character issue with
Sophie and how it deals with file names or
textual data in text fields.

On Dec 29, 2007, at 11:32 PM, Andreas Raab wrote:

> Hi -
>
> Due to a bug reported against Qwaq Forums I needed to look into how
> the Unix VM encodes file and path names and got terribly confused.
> My test case was to create a file with an Umlaut("Jürgen") and to
> see what both Squeak and the Unix shell reports with varying
> settings of -pathenc and -textenc.
>
> I started with the assumption that since the file system I was
> running this on is UTF-8 the default settings (-textenc MacRoman and
> -pathenc UTF-8) ought to be correct. However, the result was very
> surprising. The file name was reported incorrectly both in the file
> list as well as by the OS - the file list reported "J?" (truncated
> after the question mark) and the Unix shell reported "J?rgen" but
> with a "funky ?" (the glyph is hard to describe without a
> screenshot; it was neither an umlaut nor a regular question mark).
>
> Playing with the settings I could not find any combination that
> resulted in a consistent representation for all the different views
> - either the Unix shell was off or Squeak's view was off no matter
> how I set those encodings. Can someone explain to me how I need to
> set these values to get a consistent view on file names both from
> Squeak and Unix?
>
> Cheers,
> - Andreas
>

--
=
=
=
========================================================================
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
=
=
=
========================================================================

johnmci

Re: Unix VM path encodings

In reply to this post by Andreas.Raab

On Dec 29, 2007, at 11:32 PM, Andreas Raab wrote:

> Hi -
>
> Due to a bug reported against Qwaq Forums I needed to look into how
> the Unix VM encodes file and path names and got terribly confused.

Also see

From: [hidden email]
Subject: Re: mac carbon VM goes to unix file names, testers needed

Date: February 6, 2006 12:54:17 AM PST (CA)

> We've been using UTF-8 VM encoding for a while, to be able to access
> files with non-ASCII characters in their path. However, I haven't
> found a way to permanently switch the image (3.8) to UTF-8 encoding.
> It keeps resetting to Latin1 on startup. The only solution for me
> was to put this line in my own startup code:
>
> LanguageEnvironment classPool at: #FileNameConverterClass put:
> UTF8TextConverter
>
> Did anybody create a better solution than this horrible hack? I feel
> someone else must be using UTF-8, too, now that we support it ...
>
> - Bert -

Andreas.Raab

Re: Unix VM path encodings

Oh, interesting. That reminds me of the fix that I needed for Windows,
let me see... yes, here it is: LanguageEnvironment
class>>defaultFileNameConverter needs to be fixed since it is (wrongly)
guessing the file name encoding based on the currently active locale
(which makes no sense btw, since the locale doesn't mean Jack for file
name encodings). Hm ... lemme try this ... ah, interesting. It appears
that I can make the Umlauts work on Unix correctly if and only if:
* I fix the above method to return UTF8TextConverter in every case [*1]
* I use -pathenc MacRoman -textenc MacRoman
Which makes no sense to me since neither the path nor the text encoding
is MacRoman but it appears to work. Huh?

[*1] And that of course reminds me that nobody has really made any
comment on why the hell we still deal with all of these nonsensical
legacy encodings and don't just go straight to UTF-8 in the VM interface
which would simplify *lots* of cruft in the code.

Cheers,
- Andreas

John M McIntosh wrote:

>
> On Dec 29, 2007, at 11:32 PM, Andreas Raab wrote:
>
>> Hi -
>>
>> Due to a bug reported against Qwaq Forums I needed to look into how
>> the Unix VM encodes file and path names and got terribly confused.
>
> Also see
>
> From: [hidden email]
> Subject: Re: mac carbon VM goes to unix file names, testers needed
>
> Date: February 6, 2006 12:54:17 AM PST (CA)
>
>> We've been using UTF-8 VM encoding for a while, to be able to access
>> files with non-ASCII characters in their path. However, I haven't
>> found a way to permanently switch the image (3.8) to UTF-8 encoding.
>> It keeps resetting to Latin1 on startup. The only solution for me was
>> to put this line in my own startup code:
>>
>> LanguageEnvironment classPool at: #FileNameConverterClass put:
>> UTF8TextConverter
>>
>> Did anybody create a better solution than this horrible hack? I feel
>> someone else must be using UTF-8, too, now that we support it ...
>>
>> - Bert -
>
> --
> ===========================================================================
> John M. McIntosh <[hidden email]>
> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
> ===========================================================================
>
>
>
>

José "L. Redrejo" Rodríguez

Re: Re: Unix VM path encodings

El dom, 30-12-2007 a las 10:28 +0100, Andreas Raab escribió:

> Oh, interesting. That reminds me of the fix that I needed for Windows,
> let me see... yes, here it is: LanguageEnvironment
> class>>defaultFileNameConverter needs to be fixed since it is (wrongly)
> guessing the file name encoding based on the currently active locale
> (which makes no sense btw, since the locale doesn't mean Jack for file
> name encodings). Hm ... lemme try this ... ah, interesting. It appears
> that I can make the Umlauts work on Unix correctly if and only if:
> * I fix the above method to return UTF8TextConverter in every case [*1]
> * I use -pathenc MacRoman -textenc MacRoman
> Which makes no sense to me since neither the path nor the text encoding
> is MacRoman but it appears to work. Huh?
>

And what's the result of dragging and dropping a file (a gif, mp3, or
any other with the mime type registered) over the image if the file name
contains the Umlauts?
That has never worked either in the unix vm...

> [*1] And that of course reminds me that nobody has really made any
> comment on why the hell we still deal with all of these nonsensical
> legacy encodings and don't just go straight to UTF-8 in the VM interface
> which would simplify *lots* of cruft in the code.

Totally agree, maybe every vm hacker thinks everybody use english or it
was just too messy to waste time on it...

signature.asc (196 bytes) Download Attachment

johnmci

Re: Unix VM path encodings

In reply to this post by Andreas.Raab

On Dec 30, 2007, at 1:28 AM, Andreas Raab wrote:

> Oh, interesting. That reminds me of the fix that I needed for
> Windows, let me see... yes, here it is: LanguageEnvironment
> class>>defaultFileNameConverter needs to be fixed since it is
> (wrongly) guessing the file name encoding based on the currently
> active locale (which makes no sense btw, since the locale doesn't
> mean Jack for file name encodings). Hm ... lemme try this ... ah,
> interesting. It appears that I can make the Umlauts work on Unix
> correctly if and only if:
> * I fix the above method to return UTF8TextConverter in every case
> [*1]
> * I use -pathenc MacRoman -textenc MacRoman
> Which makes no sense to me since neither the path nor the text
> encoding is MacRoman but it appears to work. Huh?

Careful now is that

ü
0x9F in macroman or
0x000000FC in utf-32

which is coming back from the VM, then what does the
LanguageEnvironment class>>defaultFileNameConverter do
and which FONT and character set does that font think it lives in?
Since the hex value might not have proper representation if you use a
non-unicode proper font after converting from something to unicode....

--
=
=
=
========================================================================
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
=
=
=
========================================================================

johnmci

Re: Unix VM path encodings

In reply to this post by Andreas.Raab

On Dec 30, 2007, at 1:28 AM, Andreas Raab wrote:

> [*1] And that of course reminds me that nobody has really made any
> comment on why the hell we still deal with all of these nonsensical
> legacy encodings and don't just go straight to UTF-8 in the VM
> interface which would simplify *lots* of cruft in the code.

I think all the vertical squeak packages (scratch, sophie, plopp) use
UTF8 and get to ignore the file tools. And
don't get Tim started on what he thinks of the classes that deal with
files...

Personally I wanted to move the carbon vm to utf8 awhile back, but
well as you see all the morphic file tools would
break in interesting ways.
--
=
=
=
========================================================================
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
=
=
=
========================================================================

Yoshiki Ohshima-2

Re: Unix VM path encodings

In reply to this post by Andreas.Raab

> Hm ... lemme try this ... ah, interesting. It appears
> that I can make the Umlauts work on Unix correctly if and only if:
> * I fix the above method to return UTF8TextConverter in every case [*1]
> * I use -pathenc MacRoman -textenc MacRoman
> Which makes no sense to me since neither the path nor the text encoding
> is MacRoman but it appears to work. Huh?

Yes, on Unix VM, another historical mishappen caused it; "MacRoman"
still means "no conversion" so that if the image passes UTF-8 string,
the UTF-8 string is passed to system calls.

> [*1] And that of course reminds me that nobody has really made any
> comment on why the hell we still deal with all of these nonsensical
> legacy encodings and don't just go straight to UTF-8 in the VM interface
> which would simplify *lots* of cruft in the code.

Well, nobody tried to change stuff on the all platforms at once.
Windows is doing ok with 3.10 VM and OLPC Etoys image (there is still
code that deals with older VM... typical installation for people is to
install stuff from squeakland.org and then use Etoys image).

-- Yoshiki

johnmci

Re: Re: Unix VM path encodings

On Dec 30, 2007, at 2:11 AM, Yoshiki Ohshima wrote:

>
>> Hm ... lemme try this ... ah, interesting. It appears
>> that I can make the Umlauts work on Unix correctly if and only if:
>> * I fix the above method to return UTF8TextConverter in every case
>> [*1]
>> * I use -pathenc MacRoman -textenc MacRoman
>> Which makes no sense to me since neither the path nor the text
>> encoding
>> is MacRoman but it appears to work. Huh?
>
> Yes, on Unix VM, another historical mishappen caused it; "MacRoman"
> still means "no conversion" so that if the image passes UTF-8 string,
> the UTF-8 string is passed to system calls.

Er, well I'm not sure that's quite accurate? In looking at
sqUnixCharConv.c it seems to say that if the
text encoding is macroman and the path encoding is macroman the
translation from unix path to squeak would be macroman to macroman so
nothing would happen.

Convert(sq,ux, Path, sqTextEncoding, uxPathEncoding, 1, 0); //
normalised paths for HFS+
Convert(ux,sq, Path, uxPathEncoding, sqTextEncoding, 0, 0);

in
sqInt dir_Lookup(char *pathString, sqInt pathStringLength, sqInt index,
/* outputs: */ char *name, sqInt *nameLength, sqInt *creationDate,
sqInt *modificationDate,
sqInt *isDirectory, squeakFileOffsetType *sizeIfFile)

we find
*nameLength= ux2sqPath(dirEntry->d_name, nameLen, name, MAXPATHLEN,
0);

However I note the carbon vm does
if (norm) // HFS+ imposes Unicode2.1 decomposed UTF-8 encoding on
all path elements
CFStringNormalize(str, kCFStringNormalizationFormD); // canonical
decomposition
else
CFStringNormalize(str, kCFStringNormalizationFormC); // pre-
combined

but the unix VM does not do this, which I think is an error based on:

See
From: [hidden email]
Subject: [Vm-dev] Patch for filename normalization of mac vm

Date: March 11, 2007 8:21:06 PM PDT (CA)

To: [hidden email]

> Hi,
>
> I've found the latest mac vm (or recent version) fails to normalize
> UTF file name.
> It seems to be the function convertChars() of
> sqMacUnixFileInterface.c, which normalizes only decompose when
> converting squeak string to unix, but I think it needs pre-combined
> when unix string to squeak, and I noticed normalization form should
> be canonical (exactly should be kCFStringNormalizationFormC) for pre-
> combined.
>
> Patch (diff format of xcode tool) for this problem is attached to
> this mail.
>
> Regards,
> --
> Tetsuya HAYASHI, [hidden email], [hidden email]

PS I note if you feed CFStringCreateWithBytes bad data, why it
returns NULL, then the lurking CFStringCreateMutableCopy core dumps
the VM. That's why I check for it in the carbon vm. Normally you
won't see
this issue unless you get creative...
--
=
=
=
========================================================================
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
=
=
=
========================================================================

Andreas.Raab

Re: Unix VM path encodings

In reply to this post by Yoshiki Ohshima-2

Yoshiki Ohshima wrote:

>> Hm ... lemme try this ... ah, interesting. It appears
>> that I can make the Umlauts work on Unix correctly if and only if:
>> * I fix the above method to return UTF8TextConverter in every case [*1]
>> * I use -pathenc MacRoman -textenc MacRoman
>> Which makes no sense to me since neither the path nor the text encoding
>> is MacRoman but it appears to work. Huh?
>
> Yes, on Unix VM, another historical mishappen caused it; "MacRoman"
> still means "no conversion" so that if the image passes UTF-8 string,
> the UTF-8 string is passed to system calls.

Playing around a little it appears as if the Unix VM always converts
path names with the assumption that Squeak uses MacRoman in the image
and only -pathenc affects the translation between file system and the
image (i.e., -textenc has *no* effect on path name translation
whatsoever). Can someone confirm this? It would explain why -pathenc
MacRoman works (since like you say it's really the "no conversion" flag)
if combined with a proper file name converter in the image.

>> [*1] And that of course reminds me that nobody has really made any
>> comment on why the hell we still deal with all of these nonsensical
>> legacy encodings and don't just go straight to UTF-8 in the VM interface
>> which would simplify *lots* of cruft in the code.
>
> Well, nobody tried to change stuff on the all platforms at once.
> Windows is doing ok with 3.10 VM and OLPC Etoys image (there is still
> code that deals with older VM... typical installation for people is to
> install stuff from squeakland.org and then use Etoys image).

What encoding options are being used on OLPC? Do non-ascii file names,
clipboard, drag and drop etc. work on OLPC?

Cheers,
- Andreas

johnmci

Re: Re: Unix VM path encodings

On Dec 30, 2007, at 4:09 AM, Andreas Raab wrote:

> Yoshiki Ohshima wrote:
>>> Hm ... lemme try this ... ah, interesting. It appears that I can
>>> make the Umlauts work on Unix correctly if and only if:
>>> * I fix the above method to return UTF8TextConverter in every case
>>> [*1]
>>> * I use -pathenc MacRoman -textenc MacRoman
>>> Which makes no sense to me since neither the path nor the text
>>> encoding is MacRoman but it appears to work. Huh?
>> Yes, on Unix VM, another historical mishappen caused it; "MacRoman"
>> still means "no conversion" so that if the image passes UTF-8 string,
>> the UTF-8 string is passed to system calls.
>
> Playing around a little it appears as if the Unix VM always converts
> path names with the assumption that Squeak uses MacRoman in the
> image and only -pathenc affects the translation between file system
> and the image (i.e., -textenc has *no* effect on path name
> translation whatsoever). Can someone confirm this? It would explain
> why -pathenc MacRoman works (since like you say it's really the "no
> conversion" flag) if combined with a proper file name converter in
> the image.

Mmm for the -pathenc and the -textenc from what I can see the data
coming from the file system is said to exist
in the form -pathenc and translated to a CFString in UTF-32, then
translated back to a byte string in -textenc.

In sending the data to the file system, it said it exists in the form -
textenc, then translated to a CFString in UTF-32
then CFStringNormalize(str, kCFStringNormalizationFormD); //
canonical decomposition, then translated back to
a byte string in -pathenc.

I'll note the kCFStringNormalizationFormD operation (and all above/
below) only occurs if this is macintosh. If this is a Linux/BSD unix
system then iconv is used. So is this on a mac or some Linux/BSD
system?

That and I think a kCFStringNormalizationFormC is needed in the first
step to properly compose the characters.

For background

Ok, let's see in the mac carbon vm we get back from the file system for
LATIN CAPITAL LETTER SCHWA + LATIN CAPITAL LETTER A WITH ACUTE
ƏÁ.png

0xC6, 0x8F, A, 0xCC, 0x81, .png Note how the A0xCC81 is the
decomposed UTF8, this is what is stored in the HFS+ file system.

We convert that from UTF8 to the target of MacRoman for path names by
default in the base carbon VM.
This means converting to a CFString from kCFStringEncodingUTF8
then applying
CFStringNormalize(str, kCFStringNormalizationFormC); // pre-combined
then pulling back the bytes as MacRoman, that becomes

?, 0xE7, .png

since the translation of the Ə from utf8 to macroman fails, but the
E7 is correct macroman for the Á

Now if I set the vm up to use UTF8 as the path name default.

after we apply the kCFStringNormalizationFormC step and pull back the
data as UTF8 it is

0xC6, 0x8F, 0xC3, 0x81, .png where the 0xC381 is the (LATIN CAPITAL
LETTER A WITH ACUTE) in UTF8 or 0x00c1 in utf-16 or 0x000000c1 in utf-32

Now if I remove the CFStringNormalize(str,
kCFStringNormalizationFormC); // pre-combined
which does not exist in the base unix vm, then I get back

0xC6, 0x8F, A, 0xCC, 0x81, '.png'

which is the decomposed UTF8. I'll note in the file browser it shows
as
Æ∑AÌ∞.png

but it does work....

NOW the question is what does the translation do... mmm Well if I try
LanguageEnvironment classPool at: #FileNameConverterClass put:
UTF8TextConverter
then it shows:
?A?.png

which is mmm, less wrong? But it does work.

Now if in the base unix VM you take path encoding and text encoding
and set to macroman
Convert(ux,sq, Path, uxPathEncoding, sqTextEncoding, 0, 0);
then we are saying the operating system (unix) path encoding is
macroman and the squeak path name encoding is macroman that gives back

0xC6, 0x8F, A, 0xCC, 0x81, '.png'

since as thought the macroman to macroman translation does nothing.

However the UTF8TextConverter does not work with decomposed UTF8 so
what the user would see is not correct, since it assumes we are
working with precomposed UTF8 when it converts it to UTF-32 for the
font system's enjoyment.

I suspect to fix properly a
CFStringNormalize(str, kCFStringNormalizationFormC);
is needed in the Unix sqUnixCharConv.c and applied when applicable.

Notes from Apple's site

For example, an Á (A acute) can be encoded either precomposed, as U
+00C1 (LATIN CAPITAL LETTER A WITH ACUTE), or decomposed, as U+0041 U
+0301 (UTF8 is 0xCC81) (LATIN CAPITAL LETTER A followed by a
COMBINING ACUTE ACCENT). Precomposed characters are more common in the
Windows world, whereas decomposed characters are more common on the Mac.

--
=
=
=
========================================================================
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
=
=
=
========================================================================

Yoshiki Ohshima-2

Re: Unix VM path encodings

In reply to this post by Andreas.Raab

Andreas,

> Playing around a little it appears as if the Unix VM always converts
> path names with the assumption that Squeak uses MacRoman in the image
> and only -pathenc affects the translation between file system and the
> image (i.e., -textenc has *no* effect on path name translation
> whatsoever). Can someone confirm this? It would explain why -pathenc
> MacRoman works (since like you say it's really the "no conversion" flag)
> if combined with a proper file name converter in the image.

I think it is. If I remember correctly, this was introduced before
3.8 (around 3.7 era, I believe) by Ned, and that was only to make
something he was trying to do work.

> > Well, nobody tried to change stuff on the all platforms at once.
> > Windows is doing ok with 3.10 VM and OLPC Etoys image (there is still
> > code that deals with older VM... typical installation for people is to
> > install stuff from squeakland.org and then use Etoys image).
>
> What encoding options are being used on OLPC? Do non-ascii file names,
> clipboard, drag and drop etc. work on OLPC?

Filenames are in UTF-8. Clipboard and Drag and drop uses somewhat
strange mechanism, but they are normalized to UTF-8 as well.

-- Yoshiki

Andreas.Raab

Re: Re: Unix VM path encodings

Yoshiki Ohshima wrote:
>> What encoding options are being used on OLPC? Do non-ascii file names,
>> clipboard, drag and drop etc. work on OLPC?
>
> Filenames are in UTF-8. Clipboard and Drag and drop uses somewhat
> strange mechanism, but they are normalized to UTF-8 as well.

Odd. Checking the latest OLPC image I have access to (etoys2.3-1867)
there is nothing I can see that would deal properly in the image with
non-ascii file names on Unix. LanguageEnvironment still returns "self
currentPlatform class fileNameConverterClass" which won't use an UTF-8
converter anywhere. So unless I'm missing something it probably
shouldn't work on OLPC. Maybe try it with some "real" UTF-32 names?

Cheers,
- Andreas

Yoshiki Ohshima-2

Re: Re: Unix VM path encodings

Andreas,

> Odd. Checking the latest OLPC image I have access to (etoys2.3-1867)
> there is nothing I can see that would deal properly in the image with
> non-ascii file names on Unix. LanguageEnvironment still returns "self
> currentPlatform class fileNameConverterClass" which won't use an UTF-8
> converter anywhere. So unless I'm missing something it probably
> shouldn't work on OLPC. Maybe try it with some "real" UTF-32 names?

Ah, you are right. the code wasn't the way I remember it should be.
Latin1Environment (which should be called WesternEuropeanEnvironment)
flieNameConverterClass should be a bit more elaborated (or just
returning UTF8TextConverter).

-- Yoshiki