ANSI code pages strangeness with Scintilla

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

ANSI code pages strangeness with Scintilla

Janos Kazsoki
Hi,

on the way to localisation I tried to fix my ouUO characters, but
was not able to. Any suggestion would be appreciated.

If I type ouUO in a workspace then I get õûÕÛ displayed
independently of setting the Workspace>>Option>>Text. (this seems to be
the Windows Western interpretation instead of the Windows Central
Europe)
If I copy uoUO (from Winword for ex.) and paste in a workspace,
then I get ouOU. (Well, similar, but without the small "-s above the
letters)

What I have tried until now according to the earlier suggestions:

It seems in my case it is not about double or Uni - bytes, - codes, -
whatever, but changing the Windows ANSI codepages between Windows
Western, Windows Central Europe (i.e from CP 1252 to 1250). my ouOU
remains in one byte code range, they are:     245 251 213 219
"õûÕÛ" resp.

0.) The ouOU works fine in D5. There we do not have this problem.
AND it works fine in the basic text presenter (as you can see below).
Only the Scintilla seems to have difficulties with the simple ANSI
codepages...

1.) "Davorin's follow up post

i solved my own problem.  window 2000 looks at 'system locale' when
converting non-unicode text, (and changing that system locale is the
first
operation i found that required a reboot).  after changing systel
locale,
croatian characters are displayed ok."

I think it was in the pre D6 times, because I have set my Windows (XP)
locale and keyboard to Hungarian, and restarted the machine.
The result is the same.
If I type ouUO in a workspace then I get õûÕÛ displayed. If I
copy uoUO (from Winword for ex.) and paste in a workspace, then I
get ouOU.

2.) Then checked Chris' suggestion:

I tried

    bytes := #[
                        245 251 213 219 "õûÕÛ"
                ].
    tm := bytes asString asValue.
    tp := TextPresenter show: 'Scintilla view' on: tm.
    sci := tp view.

And got õûÕÛ.

(Strange enough is: the "basic" textPresenter displays fine.
If I do

    tp1 := TextPresenter showOn: tm.

the I get the correct ouOU displayed.

And even if I show a base TextPresenter on the strange Scintilla
characters:

   tp2 := TextPresenter showOn: 'õûÕÛ'.

then I also get the correct ouOU displayed. This gives hope for the
runtime!

3.) With Chris' "fixed" version of SciLexer.dll: the result is the
same.

Then added to KernelLibrary
getACP
    "Retrieves the current ANSI code-page identifier for the system."

    <stdcall: dword GetACP>
    ^self invalidCall

And evaluated
KernelLibrary default getACP.
It is 1250.

4.) "If it answers 0, then Scintilla is just leaving the formatting to
Windows,
which /should/ get it right"

    sci sciSetCodePage: 0.
After setting CP to 0, the result is the same.

5.) Then according to Blaire I inserted the

 view sciSetCodePage: KernelLibrary default getACP

into the SmalltalkWorkspace>>applyOptions

The result is the same.

Now:
The Scintilla displayed õûÕÛ changes if I change the sci codepage
by #sciSetCodePage: like
    sci sciSetCodePage: 1251.
In the advanced Tab of Windows regional settings all code page
conversions are enabled from 1250 (ANSI Central Europe) to 1258 (ANSI
OEM - Viet Nam)
1250 (Central Europe) gives õûÕÛ
1251 (Cyrillic) gives some russioan characters
1252 (Latin 1) seems to display the same as 1250
1253 (Greek) displays something else
and so on.
I tried from 1250 to 1258, but neither of them gives me the correct
display although according to the considerations above at least the 0
or the 1250 should.

Well, I can live with this if necessary because the "wrong" displayed
characters in the texts in the methods seem to be displayed fine in the
basic presenters, only it would be nice if I could see the same text in
the method, what is displayed during runtime.

Any idea, or suggestion: why does Scintilla behave in such a strange
way with simple ANSI codepages, and how could I reach to get the
correct display?

Thanks in advance,
Janos


Reply | Threaded
Open this post in threaded view
|

Re: ANSI code pages strangeness with Scintilla

Janos Kazsoki
Yeah, well...

it seems Google itself has problems with displaying some ANSI CP 1250
charactes.

So to the őűŐŰ above:
ő is "Latin Small letter O with double acute"
ű is "Latin Small letter U with double acute"
Ő is "Latin Capital letter O with double acute"
Ű is ""Latin Capital letter U with double acute"

Life is good!
Janos


Reply | Threaded
Open this post in threaded view
|

Re: ANSI code pages strangeness with Scintilla

Chris Uppal-3
Janos,

> it seems Google itself has problems with displaying some ANSI CP 1250
> charactes.

Not only Google, but many or most newsreaders too. Thanks for the
clarification.

When those characters were rendered in my newsreader, it interpreted the bytes
    #[16rF5 16rFB 16rD5 16rDB]
as
    o-tilde u-circumflex O-tilde U-circumflex
which is correct for code-page 1252.  In code page 1250 the same bytes should
be interpreted as:
    o-double-acute u-double-acute O-double-acute U-double-acute
It's difficult to see the difference unless you use a large font (or set
#zoomLevel: to something like 20).

I'm not absolutely sure, but I /think/ that the problem you are experiencing is
that when you paste the ouOU-with-double-acute-accents string into a workspace,
the accents don't show up, even though it /ought/ to work (because your Windows
system code page is 1250).  If not then the rest of this might not be much help
;-)

I think I know roughly what's going wrong.  It seems to be a bug in the way
that Dolphin handles Scintilla/Windows charset identifiers (as distinct from
code page identifiers -- Redmond work /hard/ to make this stuff confusing).

The rest of this is aimed as much at OA, or at least Blair, as yourself ;-)

When you paste your string into the workspace, Dolphin instructs Scintilla to
paste the current clipboard text.  Scintilla gets the text off the clipboard
(as Unicode) and attempts to convert it to something it can insert into its
text buffer.  This is where the first odd thing happens.  Scintilla uses the
#characterSet of the current text style in preference to the code page of the
control itself (I don't know why it does that).  So if the #characterSet is not
SC_CHARSET_DEFAULT -- which has value 1, not 0 as you might imagine -- then the
control will attempt to convert the Unicode text from the clipboard into the
code page corresponding to that charset.  If the charset /is/
SC_CHARSET_DEFAULT then the control converts the Unicode into its own
configured code page.

Scintilla's own default for the charset of each text style is
SC_CHARSET_DEFAULT, so it normally works correctly -- and that's why an
ordinary TextPresenter with a Scintilla View works correctly.

But, when Dolphin defines ScintillaTextStyles for use in a workspace (or
anywhere else) then code is all set up to default to nil, which ends up being
SC_CHARSET_ANSI (=0). So if Dolphin has set up a text style for the control,
then it will tell Scintilla to use a charset of 0.   Scintilla does as it is
asked, and the result is that all text which is pasted into a styled
ScintillaView is converted (as far as possible) into ASCII before being
displayed.

In the case of your string, Windows will convert it to ouOU (with no accents),
which is why the accents disappear when you paste into a workspace.   If the
characters have no near equivalent then you'll just end up with a lot of ????s.

By way of a temporary hack to see if I could fix it (not a serious attempt at a
fix, just to find out if I'd interpreted the problem correctly), I made a few
changes to my image.  I very definitely am /not/ recommending these as fixes
for your image, but they may help confirm that we are seeing the same problem,
and may also help Blair a little bit.

I changed a few of methods to stop ScintillaTextStyle defaulting to
SC_CHARSET_ANSI.

===============
initialize
 super initialize.
 flags := 0.
 characterSet := SC_CHARSET_DEFAULT.
===============
characterSet
 ^characterSet ifNil: [SC_CHARSET_DEFAULT].
===============

And in ScintillaView I chaged #buildDefaultStyle to useSC_CHARSET_DEFAULT
instead of SC_CHARSET_ANSI.

Unfortunately that still didn't quite fix it.  It turns out that the logic in
ScintillaTextStyle>>mergeFont: will always overwrite the style's #characterSet
with the #characterSet from the Font.  Unfortunately, that seems to be 0 in all
the cases I tried, and that's exactly what we don't want.  I don't really know
what Windows is doing here, I suspect that the value is simply wrong.  In any
case, I just commented out the line:

     characterSet ifNil: [characterSet := aFont characterSet]

And then new workspaces pasted text correctly !

    -- chris


Reply | Threaded
Open this post in threaded view
|

Re: ANSI code pages strangeness with Scintilla

Blair McGlashan-4
"Chris Uppal" <[hidden email]> wrote in message
news:448077a1$0$656$[hidden email]...

>...
> Unfortunately that still didn't quite fix it.  It turns out that the logic
> in
> ScintillaTextStyle>>mergeFont: will always overwrite the style's
> #characterSet
> with the #characterSet from the Font.  Unfortunately, that seems to be 0
> in all
> the cases I tried, and that's exactly what we don't want.  I don't really
> know
> what Windows is doing here, I suspect that the value is simply wrong.

That's odd. It works fine for me (i.e. if I execute 'Font choose', choose a
font, and set the character set to Central European, then I do get back
238). This is on XP SP2.

>...In any
> case, I just commented out the line:
>
>     characterSet ifNil: [characterSet := aFont characterSet]
>
> And then new workspaces pasted text correctly !

I don't think that is right though. It should be respecting the character
set of the Font you set. There is no way to choose "default" in the font
dialog, so I'm afraid we end up shipping with the western (ANSI) character
set anyway. Maybe we could set it up programmatically so it will work for
more people out of the box. In any case Font characterSet should be
returning the correct value (the Scintilla constants are defined to have the
same values as the Windows constants. although Scintilla defines a few extra
over and above those in gdi32.h), so if you change the workspace default
font character set it should work.

Indeed I find that if I just change the default workspace font to use the
Central European character set, that both direct character entry and
copy&paste from Wordpad work fine. This was tested in a fresh 6.02 install
without any patches (i.e without any previously discussed
internationalisation improvements).

Regards

Blair


Reply | Threaded
Open this post in threaded view
|

Re: ANSI code pages strangeness with Scintilla

Janos Kazsoki
Blair,

> Indeed I find that if I just change the default workspace font to use the
> Central European character set, that both direct character entry and
> copy&paste from Wordpad work fine.

How do you set it?

By

Font characterSet: 1250

or somehow else?

Thank you,
Janos


Reply | Threaded
Open this post in threaded view
|

Re: ANSI code pages strangeness with Scintilla

Janos Kazsoki
In reply to this post by Blair McGlashan-4
Blair,

Blair McGlashan wrote:
> Indeed I find that if I just change the default workspace font to use the
> Central European character set, that both direct character entry and
> copy&paste from Wordpad work fine.

How do you change the workspace default font character set?

Thank you,
Janos


Reply | Threaded
Open this post in threaded view
|

Re: ANSI code pages strangeness with Scintilla

Blair McGlashan-4
"Janos" <[hidden email]> wrote in message
news:[hidden email]...
> Blair,
>
> Blair McGlashan wrote:
>> Indeed I find that if I just change the default workspace font to use the
>> Central European character set, that both direct character entry and
>> copy&paste from Wordpad work fine.
>
> How do you change the workspace default font character set?

It's in Dolphin Options. You can either navigate to this from the system
launcher window, or in an open workspace invoke Tools/Options/Inspect.
You'll get an inspector. In that inspector double-click the defaultFont node
in the tree. You'll get a common font dialog in which you can choose the
font, including the character set.

Regards

Blair


Reply | Threaded
Open this post in threaded view
|

Re: ANSI code pages strangeness with Scintilla

Janos Kazsoki
Blair,

Wow. So simple it is!!!!!!!!!!!! And it works fine!

Then:
"I'm afraid we end up shipping with the western (ANSI) character set
anyway. "

I do not think it is really necessary to deliver Dolphin with something
else as a western (ANSI). (well, we do not speak here about asian
languages, they are anotehr story, discussed in another antry already).


You can change it any time if you need.

Great! Many thanks,
Janos


Reply | Threaded
Open this post in threaded view
|

Re: ANSI code pages strangeness with Scintilla

Chris Uppal-3
In reply to this post by Blair McGlashan-4
Blair,

> That's odd. It works fine for me (i.e. if I execute 'Font choose', choose
> a font, and set the character set to Central European, then I do get back
> 238).

That would probably work for me too; I hadn't noticed the "script" option in
the font dialog.

Janos seems happy, so maybe the issue is closed, but it still seems (to me)
that there's something wrong here.

I just don't understand what the "script" attribute of a font is supposed to
mean.  There seems to be very little about the corresponding field in LOGFONT
in MSDN.  I know very little about pre-Unicode internationalisation, but it
seems plausible that the charset field is some sort of archaic holdover which
has no meaning now.   Fonts don't have charsets or code-pages any more do they
?  I thought the underlying machinery worked in Unicode.

In this instance, if Dolphin and Scintilla both attempt to honour that field,
then the control inevitably ends up discarding information (unless the field
happens to be set to a value which matches the control's idea of the document
code page).  Unless the charset field does something real that I don't know
about (perfectly possible), I would have thought that it would be more correct
to force it into correspondence with the target document's code page, or --
since that isn't trivial afaik -- to force it always to CHARSET_DEFAULT which
Scintilla effectively ignores.

As I say, I'm far from certain about this, but that's how it seems to me today.

    -- chris