Quantcast

Extended character sets

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Extended character sets

Peter Kenny-2
Hi

This is a topic which has come up before here, but I couldn't find any
definite advice. I am looking at documents in French. Some of them seem to
be encoded in UTF-8 (most characters are single bytes like ASCII, but
accented characters occupy two bytes, so for instance e-acute appears as é
in Dolphin text output), while others use for example é to denote the
same letter. After a fascinating(!) afternoon studying Wikipedia on UTF-8
and ISO 8859-1, I have cobbled together something which will translate all
the UTF-8 text in these pages, but I wonder how I will get on with other
languages. Are there any tools which will cope with a wide range of accented
characters, and with the &-escaped coding? My objective is to be able to
transcribe text in any one of these codes into a standard form, so that I
can do dictionary searches on it, and as far as possible to be able to
display it as something readable in a Dolphin app.

I know there are contributors to this group whose native languages are
French, German, Spanish, Croatian, Slovenian (and others I can't remember);
are there any general solutions out there?

Thanks for any help.

--
Best wishes

Peter Kenny


Loading...