Smalltalk › Squeak › Squeak VM

Help: weird bug in inspecting characters

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

13 messages Options

K K Subbu

Help: weird bug in inspecting characters

Hi,

I need help in tracing a bug (see attached picture) which triggered a
MNU while trying to view a .cs file in FileTool. I traced the problem to
peek on aStream returning a nil because a wrong character code was being
returned in generated.

In the attached picture, aStream isBinary is false and the basicNext
returns the correct $^ character which gets stored in character1 local
var. But an inspector displays it as Character 128. In the same
inspector window $^ shows the correct character code as 94.

This is on Squeak5.2alpha-64b-Linux-18127. What is happening here? Has
anyone seen this type of behavior before?

Thanks in advance .. Subbu

strangeCharBug.png (73K) Download Attachment

Eliot Miranda-2

Re: Help: weird bug in inspecting characters

Hi Subbu,

> On Jul 2, 2018, at 7:24 AM, K K Subbu <[hidden email]> wrote:
>
> Hi,
>
> I need help in tracing a bug (see attached picture) which triggered a MNU while trying to view a .cs file in FileTool. I traced the problem to peek on aStream returning a nil because a wrong character code was being returned in generated.
>
> In the attached picture, aStream isBinary is false and the basicNext returns the correct $^ character which gets stored in character1 local var. But an inspector displays it as Character 128. In the same inspector window $^ shows the correct character code as 94.
>
> This is on Squeak5.2alpha-64b-Linux-18127. What is happening here?

No idea. Do you have a test case?

> Has anyone seen this type of behavior before?
>
>
> Thanks in advance .. Subbu
> <strangeCharBug.png>

Ron Teitelbaum

Re: Help: weird bug in inspecting characters

That is strange.

On Squeak 4.1

$^ charCode -> 94

94 asCharacter -> $^

128 asCharacter -> $€ charCode -> 128 (doesn't show properly in text on email but does in squeak, see image).

In other words, if I use my keyboard to type it in, it seems to be represented fine and it evaluates to charCode 94 as expected.

But something created with 128 charCode also is represented with the same symbol and it also retains it's 128 charCode as you can see with you send charCode to the string representation that was created.

If this was filed out it would seem that either version could have been used in the code and you wouldn't notice it. Manually changing it by typing in ^ and fileing it out again will probably fix it. An external editor changing 128 to 94 chars will also probably work.

All the best,

Ron Teitelbaum

On Mon, Jul 2, 2018 at 12:23 PM, Eliot Miranda <[hidden email]> wrote:

Hi Subbu,

> On Jul 2, 2018, at 7:24 AM, K K Subbu <[hidden email]> wrote:
>
> Hi,
>
> I need help in tracing a bug (see attached picture) which triggered a MNU while trying to view a .cs file in FileTool. I traced the problem to peek on aStream returning a nil because a wrong character code was being returned in generated.
>
> In the attached picture, aStream isBinary is false and the basicNext returns the correct $^ character which gets stored in character1 local var. But an inspector displays it as Character 128. In the same inspector window $^ shows the correct character code as 94.
>
> This is on Squeak5.2alpha-64b-Linux-18127. What is happening here?

No idea. Do you have a test case?

> Has anyone seen this type of behavior before?
>
>
> Thanks in advance .. Subbu
> <strangeCharBug.png>

K K Subbu

Re: Help: weird bug in inspecting characters

In reply to this post by Eliot Miranda-2

On Monday 02 July 2018 09:53 PM, Eliot Miranda wrote:

>
> Hi Subbu,
>
>> On Jul 2, 2018, at 7:24 AM, K K Subbu <[hidden email]> wrote:
>>
>> Hi,
>>
>> I need help in tracing a bug (see attached picture) which triggered a MNU while trying to view a .cs file in FileTool. I traced the problem to peek on aStream returning a nil because a wrong character code was being returned in generated.
>>
>> In the attached picture, aStream isBinary is false and the basicNext returns the correct $^ character which gets stored in character1 local var. But an inspector displays it as Character 128. In the same inspector window $^ shows the correct character code as 94.
>>
>> This is on Squeak5.2alpha-64b-Linux-18127. What is happening here?
>
> No idea. Do you have a test case?

Attached is the file that triggered the error.
It is extracted from a project file produced on an old Mac (bigendian).

File List -> select x.cs and click Changes. This will bring up MNU window
Select "scanFrom:..." and open inspector on "file" ivar
Debug "self peek"
step into "self next" and then into "nextFromStream".
step over until "character1" is assigned.
Press "Restart" and step over till it is assigned again.
Notice that it shows as $^ in debug panel
inspect character1. It shows as Character 128.

The stream position is 1086 out of 2048. The actual byte in the position
is indeed 128 but I don't know why it is appearing as $^ (94) in the
inspector.

Regards .. Subbu

x.cs (141K) Download Attachment

K K Subbu

Re: Help: weird bug in inspecting characters

In reply to this post by Ron Teitelbaum

On Monday 02 July 2018 10:59 PM, Ron Teitelbaum wrote:
> But something created with 128 charCode also is represented with the
> same symbol and it also retains it's 128 charCode as you can see with
> you send charCode to the string representation that was created.

OK. That explains it. Thank you for the screen grab.

Regards .. Subbu

Tobias Pape

Re: Help: weird bug in inspecting characters

In reply to this post by Ron Teitelbaum

Hi all

On 02.07.2018, at 19:29, Ron Teitelbaum <[hidden email]> wrote:

That is strange.

On Squeak 4.1

$^ charCode -> 94
94 asCharacter -> $^
128 asCharacter -> $€ charCode -> 128 (doesn't show properly in text on email but does in squeak, see image).

<Capture.PNG>

In other words, if I use my keyboard to type it in, it seems to be represented fine and it evaluates to charCode 94 as expected.

But something created with 128 charCode also is represented with the same symbol and it also retains it's 128 charCode as you can see with you send charCode to the string representation that was created.

If this was filed out it would seem that either version could have been used in the code and you wouldn't notice it. Manually changing it by typing in ^ and fileing it out again will probably fix it. An external editor changing 128 to 94 chars will also probably work.

All the best,

Maybe I can shed a bit light on things here.

If you look at the attached image (which is one of the default fonts we use), you see that ^ and _ are present after [\] but ALSO after {|}~. This seems to be intentional so that you, should you want, can switch betwen caret/underscore and up-arrow/left-arrow printing for return/assignment and here's how it's done:

StrikeFont>>useLeftArrow

self characterToGlyphMap.
characterToGlyphMap at: 96 put: 95.
characterToGlyphMap at: 95 put: 94

and

StrikeFont>>useUnderscore

self characterToGlyphMap.
characterToGlyphMap at: 96 put: 129.
characterToGlyphMap at: 95 put: 128

There's the 128.

What happens here, too, is that 128 is no proper character to begin with.

Our characters represent unicode codepoints, which, for ByteStrings, happen to _exactly_ match the ISO 8859-1 (Latin1) encoding. (In fact, that was a design decision for Unicode to begins with; does NOT hold for UTF-8 tho).

In both, Unicode and ISO 8859-1, certain "character codes" are not actually characters. The control characters (<32) are intentionally undefined, as are codes between 128 and 159 (includes 128). However, 8859-1 was often combined with Ansi escape codes (aka ISO 6429), which defines the codes from 128 as Control Block C1, which Unicode subsequently adopted.

Long story short, Characters between 128 and 159 are inherently non-printable. Either they control output or format output, but cannot in themselves be displayed. The StrikeFonts utilize that and use those code points in fonts to relocate caret, underscore, left-arrow and right-arrow so that they can serve as substitutes when you don't want ^ _ in code but rather arrows.

=-=-=-=-=

That being said, I just saw that the fileList forces MacRoman encoding (deprecated since MacOS X 10.0 in 2001....) which _has_ a proper meaning for 128, namely Ä. However, the respective method probably needs an overhaul:

FileList>>defaultEncoderFor: aFileName

"This method just illustrates the stupidest possible implementation of encoder selection."
| l |
l := aFileName asLowercase.
" ((l endsWith: FileStream multiCs) or: [
l endsWith: FileStream multiSt]) ifTrue: [
^ UTF8TextConverter new.
].
"
((l endsWith: FileStream cs) or: [
l endsWith: FileStream st]) ifTrue: [
^ MacRomanTextConverter new.
].

^ Latin1TextConverter new.

=-=-=-=-=-=

Indeed, the file x.cs contains an 128 at the indicated position. Which is in the middle of a binary SmartRefStream-dump. Maybe we must change the fileIn logic to make the stream binary when it encounters a smartrefstream? that would certainly help.

Best regards

-Tobias

Ron Teitelbaum

On Mon, Jul 2, 2018 at 12:23 PM, Eliot Miranda <[hidden email]> wrote:

Hi Subbu,

> On Jul 2, 2018, at 7:24 AM, K K Subbu <[hidden email]> wrote:
>
> Hi,
>
> I need help in tracing a bug (see attached picture) which triggered a MNU while trying to view a .cs file in FileTool. I traced the problem to peek on aStream returning a nil because a wrong character code was being returned in generated.
>
> In the attached picture, aStream isBinary is false and the basicNext returns the correct $^ character which gets stored in character1 local var. But an inspector displays it as Character 128. In the same inspector window $^ shows the correct character code as 94.
>
> This is on Squeak5.2alpha-64b-Linux-18127. What is happening here?

No idea. Do you have a test case?

> Has anyone seen this type of behavior before?
>
>
> Thanks in advance .. Subbu
> <strangeCharBug.png>

Tobias Pape

Re: Help: weird bug in inspecting characters

> On 02.07.2018, at 20:26, Tobias Pape <[hidden email]> wrote:
>
> Hi all
>
>> On 02.07.2018, at 19:29, Ron Teitelbaum <[hidden email]> wrote:
>>
>> That is strange.
>>
>> On Squeak 4.1
>>
>> $^ charCode -> 94
>> 94 asCharacter -> $^
>> 128 asCharacter -> $€ charCode -> 128 (doesn't show properly in text on email but does in squeak, see image).
>>
>> <Capture.PNG>
>>
>> In other words, if I use my keyboard to type it in, it seems to be represented fine and it evaluates to charCode 94 as expected.
>>
>> But something created with 128 charCode also is represented with the same symbol and it also retains it's 128 charCode as you can see with you send charCode to the string representation that was created.
>>
>> If this was filed out it would seem that either version could have been used in the code and you wouldn't notice it. Manually changing it by typing in ^ and fileing it out again will probably fix it. An external editor changing 128 to 94 chars will also probably work.
>>
>> All the best,
>
> Maybe I can shed a bit light on things here.
>
> If you look at the attached image (which is one of the default fonts we use), you see that ^ and _ are present after [\] but ALSO after {|}~. This seems to be intentional so that you, should you want, can switch betwen caret/underscore and up-arrow/left-arrow printing for return/assignment and here's how it's done:
>
> StrikeFont>>useLeftArrow
> self characterToGlyphMap.
> characterToGlyphMap at: 96 put: 95.
> characterToGlyphMap at: 95 put: 94
>
> and
>
> StrikeFont>>useUnderscore
> self characterToGlyphMap.
> characterToGlyphMap at: 96 put: 129.
> characterToGlyphMap at: 95 put: 128
>
>
> There's the 128.
>
> What happens here, too, is that 128 is no proper character to begin with.
>
> Our characters represent unicode codepoints, which, for ByteStrings, happen to _exactly_ match the ISO 8859-1 (Latin1) encoding. (In fact, that was a design decision for Unicode to begins with; does NOT hold for UTF-8 tho).
>
> In both, Unicode and ISO 8859-1, certain "character codes" are not actually characters. The control characters (<32) are intentionally undefined, as are codes between 128 and 159 (includes 128). However, 8859-1 was often combined with Ansi escape codes (aka ISO 6429), which defines the codes from 128 as Control Block C1, which Unicode subsequently adopted.
>
> Long story short, Characters between 128 and 159 are inherently non-printable. Either they control output or format output, but cannot in themselves be displayed. The StrikeFonts utilize that and use those code points in fonts to relocate caret, underscore, left-arrow and right-arrow so that they can serve as substitutes when you don't want ^ _ in code but rather arrows.

For followup: this is how I generated that into the font to be compatible with the original one:

https://github.com/krono/Squeak-Fonts/blob/master/render.py#L41

Best regards
-Tobias

>
> =-=-=-=-=
>
> That being said, I just saw that the fileList forces MacRoman encoding (deprecated since MacOS X 10.0 in 2001....) which _has_ a proper meaning for 128, namely Ä. However, the respective method probably needs an overhaul:
>
> FileList>>defaultEncoderFor: aFileName
>
> "This method just illustrates the stupidest possible implementation of encoder selection."
> | l |
> l := aFileName asLowercase.
> " ((l endsWith: FileStream multiCs) or: [
> l endsWith: FileStream multiSt]) ifTrue: [
> ^ UTF8TextConverter new.
> ].
> "
> ((l endsWith: FileStream cs) or: [
> l endsWith: FileStream st]) ifTrue: [
> ^ MacRomanTextConverter new.
> ].
>
> ^ Latin1TextConverter new.
>
> =-=-=-=-=-=
>
> Indeed, the file x.cs contains an 128 at the indicated position. Which is in the middle of a binary SmartRefStream-dump. Maybe we must change the fileIn logic to make the stream binary when it encounters a smartrefstream? that would certainly help.
>
> Best regards
> -Tobias
>
>
>
> <dejavu_new.png>
>>
>> Ron Teitelbaum
>>
>>
>> On Mon, Jul 2, 2018 at 12:23 PM, Eliot Miranda <[hidden email]> wrote:
>>
>> Hi Subbu,
>>
>> > On Jul 2, 2018, at 7:24 AM, K K Subbu <[hidden email]> wrote:
>> >
>> > Hi,
>> >
>> > I need help in tracing a bug (see attached picture) which triggered a MNU while trying to view a .cs file in FileTool. I traced the problem to peek on aStream returning a nil because a wrong character code was being returned in generated.
>> >
>> > In the attached picture, aStream isBinary is false and the basicNext returns the correct $^ character which gets stored in character1 local var. But an inspector displays it as Character 128. In the same inspector window $^ shows the correct character code as 94.
>> >
>> > This is on Squeak5.2alpha-64b-Linux-18127. What is happening here?
>>
>> No idea. Do you have a test case?
>>
>> > Has anyone seen this type of behavior before?
>> >
>> >
>> > Thanks in advance .. Subbu
>> > <strangeCharBug.png>

Tobias Pape

Re: Help: weird bug in inspecting characters

> On 02.07.2018, at 20:28, Tobias Pape <[hidden email]> wrote:
>
>
>
>> On 02.07.2018, at 20:26, Tobias Pape <[hidden email]> wrote:
>>
>> Hi all
>>
>>> On 02.07.2018, at 19:29, Ron Teitelbaum <[hidden email]> wrote:
>>>
>>> That is strange.
>>>
>>> On Squeak 4.1
>>>
>>> $^ charCode -> 94
>>> 94 asCharacter -> $^
>>> 128 asCharacter -> $€ charCode -> 128 (doesn't show properly in text on email but does in squeak, see image).
>>>
>>> <Capture.PNG>
>>>
>>> In other words, if I use my keyboard to type it in, it seems to be represented fine and it evaluates to charCode 94 as expected.
>>>
>>> But something created with 128 charCode also is represented with the same symbol and it also retains it's 128 charCode as you can see with you send charCode to the string representation that was created.
>>>
>>> If this was filed out it would seem that either version could have been used in the code and you wouldn't notice it. Manually changing it by typing in ^ and fileing it out again will probably fix it. An external editor changing 128 to 94 chars will also probably work.
>>>
>>> All the best,
>>
>> Maybe I can shed a bit light on things here.
>>
>> If you look at the attached image (which is one of the default fonts we use), you see that ^ and _ are present after [\] but ALSO after {|}~. This seems to be intentional so that you, should you want, can switch betwen caret/underscore and up-arrow/left-arrow printing for return/assignment and here's how it's done:
>>
>> StrikeFont>>useLeftArrow
>> self characterToGlyphMap.
>> characterToGlyphMap at: 96 put: 95.
>> characterToGlyphMap at: 95 put: 94
>>
>> and
>>
>> StrikeFont>>useUnderscore
>> self characterToGlyphMap.
>> characterToGlyphMap at: 96 put: 129.
>> characterToGlyphMap at: 95 put: 128
>>
>>
>> There's the 128.
>>
>> What happens here, too, is that 128 is no proper character to begin with.
>>
>> Our characters represent unicode codepoints, which, for ByteStrings, happen to _exactly_ match the ISO 8859-1 (Latin1) encoding. (In fact, that was a design decision for Unicode to begins with; does NOT hold for UTF-8 tho).
>>
>> In both, Unicode and ISO 8859-1, certain "character codes" are not actually characters. The control characters (<32) are intentionally undefined, as are codes between 128 and 159 (includes 128). However, 8859-1 was often combined with Ansi escape codes (aka ISO 6429), which defines the codes from 128 as Control Block C1, which Unicode subsequently adopted.
>>
>> Long story short, Characters between 128 and 159 are inherently non-printable. Either they control output or format output, but cannot in themselves be displayed. The StrikeFonts utilize that and use those code points in fonts to relocate caret, underscore, left-arrow and right-arrow so that they can serve as substitutes when you don't want ^ _ in code but rather arrows.
>
> For followup: this is how I generated that into the font to be compatible with the original one:
>
> https://github.com/krono/Squeak-Fonts/blob/master/render.py#L41

Last followup:

Why do you actually see ^ when it is meant to be a control character?
Because the text morph does not care. Simplified, it hands all character/codes to the font (modulo characterToGlyphMap) and just displays what comes back. There should not be something at 128, but there is, so it is being displayed (and happens to look equal to $^ , which is Character value: 94).

Best regards
-Tobias

>
> Best regards
> -Tobias
>>
>> =-=-=-=-=
>>
>> That being said, I just saw that the fileList forces MacRoman encoding (deprecated since MacOS X 10.0 in 2001....) which _has_ a proper meaning for 128, namely Ä. However, the respective method probably needs an overhaul:
>>
>> FileList>>defaultEncoderFor: aFileName
>>
>> "This method just illustrates the stupidest possible implementation of encoder selection."
>> | l |
>> l := aFileName asLowercase.
>> " ((l endsWith: FileStream multiCs) or: [
>> l endsWith: FileStream multiSt]) ifTrue: [
>> ^ UTF8TextConverter new.
>> ].
>> "
>> ((l endsWith: FileStream cs) or: [
>> l endsWith: FileStream st]) ifTrue: [
>> ^ MacRomanTextConverter new.
>> ].
>>
>> ^ Latin1TextConverter new.
>>
>> =-=-=-=-=-=
>>
>> Indeed, the file x.cs contains an 128 at the indicated position. Which is in the middle of a binary SmartRefStream-dump. Maybe we must change the fileIn logic to make the stream binary when it encounters a smartrefstream? that would certainly help.
>>
>> Best regards
>> -Tobias
>>
>>
>>
>> <dejavu_new.png>
>>>
>>> Ron Teitelbaum
>>>
>>>
>>> On Mon, Jul 2, 2018 at 12:23 PM, Eliot Miranda <[hidden email]> wrote:
>>>
>>> Hi Subbu,
>>>
>>>> On Jul 2, 2018, at 7:24 AM, K K Subbu <[hidden email]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I need help in tracing a bug (see attached picture) which triggered a MNU while trying to view a .cs file in FileTool. I traced the problem to peek on aStream returning a nil because a wrong character code was being returned in generated.
>>>>
>>>> In the attached picture, aStream isBinary is false and the basicNext returns the correct $^ character which gets stored in character1 local var. But an inspector displays it as Character 128. In the same inspector window $^ shows the correct character code as 94.
>>>>
>>>> This is on Squeak5.2alpha-64b-Linux-18127. What is happening here?
>>>
>>> No idea. Do you have a test case?
>>>
>>>> Has anyone seen this type of behavior before?
>>>>
>>>>
>>>> Thanks in advance .. Subbu
>>>> <strangeCharBug.png>

Ron Teitelbaum

Re: Help: weird bug in inspecting characters

In reply to this post by Tobias Pape

Hi Tobias,

Interesting! I think 128 is the Euro symbol in extended ASCII on MS Windows.

https://www.petefreitag.com/cheatsheets/ascii-codes/

Thanks for the explanation! It makes sense that it would have been a hack for code printing.

All the best,

Ron Teitelbaum

On Mon, Jul 2, 2018 at 2:26 PM, Tobias Pape <[hidden email]> wrote:

Hi all

On 02.07.2018, at 19:29, Ron Teitelbaum <[hidden email]> wrote:

That is strange.

On Squeak 4.1

$^ charCode -> 94
94 asCharacter -> $^
128 asCharacter -> $€ charCode -> 128 (doesn't show properly in text on email but does in squeak, see image).

<Capture.PNG>

In other words, if I use my keyboard to type it in, it seems to be represented fine and it evaluates to charCode 94 as expected.

But something created with 128 charCode also is represented with the same symbol and it also retains it's 128 charCode as you can see with you send charCode to the string representation that was created.

If this was filed out it would seem that either version could have been used in the code and you wouldn't notice it. Manually changing it by typing in ^ and fileing it out again will probably fix it. An external editor changing 128 to 94 chars will also probably work.

All the best,

Maybe I can shed a bit light on things here.

If you look at the attached image (which is one of the default fonts we use), you see that ^ and _ are present after [\] but ALSO after {|}~. This seems to be intentional so that you, should you want, can switch betwen caret/underscore and up-arrow/left-arrow printing for return/assignment and here's how it's done:

StrikeFont>>useLeftArrow
self characterToGlyphMap.
characterToGlyphMap at: 96 put: 95.
characterToGlyphMap at: 95 put: 94

and

StrikeFont>>useUnderscore
self characterToGlyphMap.
characterToGlyphMap at: 96 put: 129.
characterToGlyphMap at: 95 put: 128

There's the 128.

What happens here, too, is that 128 is no proper character to begin with.

Our characters represent unicode codepoints, which, for ByteStrings, happen to _exactly_ match the ISO 8859-1 (Latin1) encoding. (In fact, that was a design decision for Unicode to begins with; does NOT hold for UTF-8 tho).

In both, Unicode and ISO 8859-1, certain "character codes" are not actually characters. The control characters (<32) are intentionally undefined, as are codes between 128 and 159 (includes 128). However, 8859-1 was often combined with Ansi escape codes (aka ISO 6429), which defines the codes from 128 as Control Block C1, which Unicode subsequently adopted.

Long story short, Characters between 128 and 159 are inherently non-printable. Either they control output or format output, but cannot in themselves be displayed. The StrikeFonts utilize that and use those code points in fonts to relocate caret, underscore, left-arrow and right-arrow so that they can serve as substitutes when you don't want ^ _ in code but rather arrows.

=-=-=-=-=

That being said, I just saw that the fileList forces MacRoman encoding (deprecated since MacOS X 10.0 in 2001....) which _has_ a proper meaning for 128, namely Ä. However, the respective method probably needs an overhaul:

FileList>>defaultEncoderFor: aFileName

"This method just illustrates the stupidest possible implementation of encoder selection."
| l |
l := aFileName asLowercase.
" ((l endsWith: FileStream multiCs) or: [
l endsWith: FileStream multiSt]) ifTrue: [
^ UTF8TextConverter new.
].
"
((l endsWith: FileStream cs) or: [
l endsWith: FileStream st]) ifTrue: [
^ MacRomanTextConverter new.
].

^ Latin1TextConverter new.

=-=-=-=-=-=

Indeed, the file x.cs contains an 128 at the indicated position. Which is in the middle of a binary SmartRefStream-dump. Maybe we must change the fileIn logic to make the stream binary when it encounters a smartrefstream? that would certainly help.

Best regards
-Tobias

Ron Teitelbaum

On Mon, Jul 2, 2018 at 12:23 PM, Eliot Miranda <[hidden email]> wrote:

Hi Subbu,

> On Jul 2, 2018, at 7:24 AM, K K Subbu <[hidden email]> wrote:
>
> Hi,
>
> I need help in tracing a bug (see attached picture) which triggered a MNU while trying to view a .cs file in FileTool. I traced the problem to peek on aStream returning a nil because a wrong character code was being returned in generated.
>
> In the attached picture, aStream isBinary is false and the basicNext returns the correct $^ character which gets stored in character1 local var. But an inspector displays it as Character 128. In the same inspector window $^ shows the correct character code as 94.
>
> This is on Squeak5.2alpha-64b-Linux-18127. What is happening here?

No idea. Do you have a test case?

> Has anyone seen this type of behavior before?
>
>
> Thanks in advance .. Subbu
> <strangeCharBug.png>

K K Subbu

Re: Help: weird bug in inspecting characters

In reply to this post by Tobias Pape

On Monday 02 July 2018 11:56 PM, Tobias Pape wrote:
> That being said, I just saw that the fileList forces MacRoman encoding
> (deprecated since MacOS X 10.0 in 2001....) which _has_ a proper meaning
> for 128, namely Ä. However, the respective method probably needs an
> overhaul:
Should FileList default to binary stream and let the service handler set
the converter and ascii/binary flag?

In my case, it is ChangeList class that is setting the converter:
----
browseStream: changesFile
"Opens a changeList on a fileStream"
| changeList charCount |
changesFile readOnly.
changesFile setConverterForCode.
----
The smartRefStream starts at 22705 bytes into x.cs (see pic) but I don't
see the scanner switching the converter back to binary stream.

Regards .. Subbu

refstream.png (18K) Download Attachment

Tobias Pape

Re: Help: weird bug in inspecting characters

In reply to this post by Ron Teitelbaum

Hi Ron

> On 02.07.2018, at 20:42, Ron Teitelbaum <[hidden email]> wrote:
>
> Hi Tobias,
>
> Interesting! I think 128 is the Euro symbol in extended ASCII on MS Windows.

The Euro is 128 in Windows Codepage 1252, true. However, there's not one "true" extended Ascii, but everything that is 8-bit and includes ascii is somehow extended Ascii. To quote Wikipedia[https://en.wikipedia.org/wiki/Extended_ASCII]:

"There are many extended ASCII encodings (more than 220 DOS and Windows codepages)."

For example in Latin-9 (ISO 8895-15), Euro is at 164.
In IBM CodePage 850 (or, 858, to be precise), Euro replaced dotless i at 213.
Mac Roman replaced the generic currency sign and put Euro at 219,
(all "extended ascii")

To complete the list, Unicode assigned CodePoint U+20AC for Euro.

It's hard to get your Euro's worth… :D

Best regards
-Tobias

>
> https://www.petefreitag.com/cheatsheets/ascii-codes/
>
> Thanks for the explanation! It makes sense that it would have been a hack for code printing.
>
> All the best,
>
> Ron Teitelbaum
>
> On Mon, Jul 2, 2018 at 2:26 PM, Tobias Pape <[hidden email]> wrote:
> Hi all
>
>> On 02.07.2018, at 19:29, Ron Teitelbaum <[hidden email]> wrote:
>>
>> That is strange.
>>
>> On Squeak 4.1
>>
>> $^ charCode -> 94
>> 94 asCharacter -> $^
>> 128 asCharacter -> $€ charCode -> 128 (doesn't show properly in text on email but does in squeak, see image).
>>
>> <Capture.PNG>
>>
>> In other words, if I use my keyboard to type it in, it seems to be represented fine and it evaluates to charCode 94 as expected.
>>
>> But something created with 128 charCode also is represented with the same symbol and it also retains it's 128 charCode as you can see with you send charCode to the string representation that was created.
>>
>> If this was filed out it would seem that either version could have been used in the code and you wouldn't notice it. Manually changing it by typing in ^ and fileing it out again will probably fix it. An external editor changing 128 to 94 chars will also probably work.
>>
>> All the best,
>
> Maybe I can shed a bit light on things here.
>
> If you look at the attached image (which is one of the default fonts we use), you see that ^ and _ are present after [\] but ALSO after {|}~. This seems to be intentional so that you, should you want, can switch betwen caret/underscore and up-arrow/left-arrow printing for return/assignment and here's how it's done:
>
> StrikeFont>>useLeftArrow
> self characterToGlyphMap.
> characterToGlyphMap at: 96 put: 95.
> characterToGlyphMap at: 95 put: 94
>
> and
>
> StrikeFont>>useUnderscore
> self characterToGlyphMap.
> characterToGlyphMap at: 96 put: 129.
> characterToGlyphMap at: 95 put: 128
>
>
> There's the 128.
>
> What happens here, too, is that 128 is no proper character to begin with.
>
> Our characters represent unicode codepoints, which, for ByteStrings, happen to _exactly_ match the ISO 8859-1 (Latin1) encoding. (In fact, that was a design decision for Unicode to begins with; does NOT hold for UTF-8 tho).
>
> In both, Unicode and ISO 8859-1, certain "character codes" are not actually characters. The control characters (<32) are intentionally undefined, as are codes between 128 and 159 (includes 128). However, 8859-1 was often combined with Ansi escape codes (aka ISO 6429), which defines the codes from 128 as Control Block C1, which Unicode subsequently adopted.
>
> Long story short, Characters between 128 and 159 are inherently non-printable. Either they control output or format output, but cannot in themselves be displayed. The StrikeFonts utilize that and use those code points in fonts to relocate caret, underscore, left-arrow and right-arrow so that they can serve as substitutes when you don't want ^ _ in code but rather arrows.
>
> =-=-=-=-=
>
> That being said, I just saw that the fileList forces MacRoman encoding (deprecated since MacOS X 10.0 in 2001....) which _has_ a proper meaning for 128, namely Ä. However, the respective method probably needs an overhaul:
>
> FileList>>defaultEncoderFor: aFileName
>
> "This method just illustrates the stupidest possible implementation of encoder selection."
> | l |
> l := aFileName asLowercase.
> " ((l endsWith: FileStream multiCs) or: [
> l endsWith: FileStream multiSt]) ifTrue: [
> ^ UTF8TextConverter new.
> ].
> "
> ((l endsWith: FileStream cs) or: [
> l endsWith: FileStream st]) ifTrue: [
> ^ MacRomanTextConverter new.
> ].
>
> ^ Latin1TextConverter new.
>
> =-=-=-=-=-=
>
> Indeed, the file x.cs contains an 128 at the indicated position. Which is in the middle of a binary SmartRefStream-dump. Maybe we must change the fileIn logic to make the stream binary when it encounters a smartrefstream? that would certainly help.
>
> Best regards
> -Tobias
>
>
>
> <dejavu_new.png>
>>
>> Ron Teitelbaum
>>
>>
>> On Mon, Jul 2, 2018 at 12:23 PM, Eliot Miranda <[hidden email]> wrote:
>>
>> Hi Subbu,
>>
>> > On Jul 2, 2018, at 7:24 AM, K K Subbu <[hidden email]> wrote:
>> >
>> > Hi,
>> >
>> > I need help in tracing a bug (see attached picture) which triggered a MNU while trying to view a .cs file in FileTool. I traced the problem to peek on aStream returning a nil because a wrong character code was being returned in generated.
>> >
>> > In the attached picture, aStream isBinary is false and the basicNext returns the correct $^ character which gets stored in character1 local var. But an inspector displays it as Character 128. In the same inspector window $^ shows the correct character code as 94.
>> >
>> > This is on Squeak5.2alpha-64b-Linux-18127. What is happening here?
>>
>> No idea. Do you have a test case?
>>
>> > Has anyone seen this type of behavior before?
>> >
>> >
>> > Thanks in advance .. Subbu
>> > <strangeCharBug.png>
>>
>
>

Tobias Pape

Re: Help: weird bug in inspecting characters

In reply to this post by K K Subbu

> On 02.07.2018, at 20:55, K K Subbu <[hidden email]> wrote:
>
> On Monday 02 July 2018 11:56 PM, Tobias Pape wrote:
>> That being said, I just saw that the fileList forces MacRoman encoding (deprecated since MacOS X 10.0 in 2001....) which _has_ a proper meaning for 128, namely Ä. However, the respective method probably needs an overhaul:
> Should FileList default to binary stream and let the service handler set the converter and ascii/binary flag?
>
> In my case, it is ChangeList class that is setting the converter:
> ----
> browseStream: changesFile
> "Opens a changeList on a fileStream"
> | changeList charCount |
> changesFile readOnly.
> changesFile setConverterForCode.
> ----
> The smartRefStream starts at 22705 bytes into x.cs (see pic) but I don't see the scanner switching the converter back to binary stream.
>

Yea, I don't actually know what's sensible here.
IIRC, MacRomanTextConverter used to be 1:1-8bit, so everything worked by chance, but that time has apparently gone…

Best regards
-Tobias

> Regards .. Subbu
> <refstream.png>

Ron Teitelbaum

Re: Help: weird bug in inspecting characters

In reply to this post by Tobias Pape

Hi Tobias,

Yeah, I saw that it was an MS Windows only thing.

It makes your head spin!!

:) All the best,

Ron

On Mon, Jul 2, 2018 at 2:57 PM, Tobias Pape <[hidden email]> wrote:

Hi Ron

> On 02.07.2018, at 20:42, Ron Teitelbaum <[hidden email]> wrote:
>
> Hi Tobias,
>
> Interesting! I think 128 is the Euro symbol in extended ASCII on MS Windows.

The Euro is 128 in Windows Codepage 1252, true. However, there's not one "true" extended Ascii, but everything that is 8-bit and includes ascii is somehow extended Ascii. To quote Wikipedia[https://en.wikipedia.org/wiki/Extended_ASCII]:

"There are many extended ASCII encodings (more than 220 DOS and Windows codepages)."

For example in Latin-9 (ISO 8895-15), Euro is at 164.
In IBM CodePage 850 (or, 858, to be precise), Euro replaced dotless i at 213.
Mac Roman replaced the generic currency sign and put Euro at 219,
(all "extended ascii")

To complete the list, Unicode assigned CodePoint U+20AC for Euro.

It's hard to get your Euro's worth… :D

Best regards
-Tobias

>
> https://www.petefreitag.com/cheatsheets/ascii-codes/
>
> Thanks for the explanation! It makes sense that it would have been a hack for code printing.
>
> All the best,
>
> Ron Teitelbaum
>
> On Mon, Jul 2, 2018 at 2:26 PM, Tobias Pape <[hidden email]> wrote:
> Hi all
>
>> On 02.07.2018, at 19:29, Ron Teitelbaum <[hidden email]> wrote:
>>
>> That is strange.
>>
>> On Squeak 4.1
>>
>> $^ charCode -> 94
>> 94 asCharacter -> $^
>> 128 asCharacter -> $€ charCode -> 128 (doesn't show properly in text on email but does in squeak, see image).
>>
>> <Capture.PNG>
>>
>> In other words, if I use my keyboard to type it in, it seems to be represented fine and it evaluates to charCode 94 as expected.
>>
>> But something created with 128 charCode also is represented with the same symbol and it also retains it's 128 charCode as you can see with you send charCode to the string representation that was created.
>>
>> If this was filed out it would seem that either version could have been used in the code and you wouldn't notice it. Manually changing it by typing in ^ and fileing it out again will probably fix it. An external editor changing 128 to 94 chars will also probably work.
>>
>> All the best,
>
> Maybe I can shed a bit light on things here.
>
> If you look at the attached image (which is one of the default fonts we use), you see that ^ and _ are present after [\] but ALSO after {|}~. This seems to be intentional so that you, should you want, can switch betwen caret/underscore and up-arrow/left-arrow printing for return/assignment and here's how it's done:
>
> StrikeFont>>useLeftArrow
> self characterToGlyphMap.
> characterToGlyphMap at: 96 put: 95.
> characterToGlyphMap at: 95 put: 94
>
> and
>
> StrikeFont>>useUnderscore
> self characterToGlyphMap.
> characterToGlyphMap at: 96 put: 129.
> characterToGlyphMap at: 95 put: 128
>
>
> There's the 128.
>
> What happens here, too, is that 128 is no proper character to begin with.
>
> Our characters represent unicode codepoints, which, for ByteStrings, happen to _exactly_ match the ISO 8859-1 (Latin1) encoding. (In fact, that was a design decision for Unicode to begins with; does NOT hold for UTF-8 tho).
>
> In both, Unicode and ISO 8859-1, certain "character codes" are not actually characters. The control characters (<32) are intentionally undefined, as are codes between 128 and 159 (includes 128). However, 8859-1 was often combined with Ansi escape codes (aka ISO 6429), which defines the codes from 128 as Control Block C1, which Unicode subsequently adopted.
>
> Long story short, Characters between 128 and 159 are inherently non-printable. Either they control output or format output, but cannot in themselves be displayed. The StrikeFonts utilize that and use those code points in fonts to relocate caret, underscore, left-arrow and right-arrow so that they can serve as substitutes when you don't want ^ _ in code but rather arrows.
>
> =-=-=-=-=
>
> That being said, I just saw that the fileList forces MacRoman encoding (deprecated since MacOS X 10.0 in 2001....) which _has_ a proper meaning for 128, namely Ä. However, the respective method probably needs an overhaul:
>
> FileList>>defaultEncoderFor: aFileName
>
> "This method just illustrates the stupidest possible implementation of encoder selection."
> | l |
> l := aFileName asLowercase.
> " ((l endsWith: FileStream multiCs) or: [
> l endsWith: FileStream multiSt]) ifTrue: [
> ^ UTF8TextConverter new.
> ].
> "
> ((l endsWith: FileStream cs) or: [
> l endsWith: FileStream st]) ifTrue: [
> ^ MacRomanTextConverter new.
> ].
>
> ^ Latin1TextConverter new.
>
> =-=-=-=-=-=
>
> Indeed, the file x.cs contains an 128 at the indicated position. Which is in the middle of a binary SmartRefStream-dump. Maybe we must change the fileIn logic to make the stream binary when it encounters a smartrefstream? that would certainly help.
>
> Best regards
> -Tobias
>
>
>

> <dejavu_new.png>

>>
>> Ron Teitelbaum
>>
>>
>> On Mon, Jul 2, 2018 at 12:23 PM, Eliot Miranda <[hidden email]> wrote:
>>
>> Hi Subbu,
>>
>> > On Jul 2, 2018, at 7:24 AM, K K Subbu <[hidden email]> wrote:
>> >
>> > Hi,
>> >
>> > I need help in tracing a bug (see attached picture) which triggered a MNU while trying to view a .cs file in FileTool. I traced the problem to peek on aStream returning a nil because a wrong character code was being returned in generated.
>> >
>> > In the attached picture, aStream isBinary is false and the basicNext returns the correct $^ character which gets stored in character1 local var. But an inspector displays it as Character 128. In the same inspector window $^ shows the correct character code as 94.
>> >
>> > This is on Squeak5.2alpha-64b-Linux-18127. What is happening here?
>>
>> No idea. Do you have a test case?
>>
>> > Has anyone seen this type of behavior before?
>> >
>> >
>> > Thanks in advance .. Subbu
>> > <strangeCharBug.png>
>>
>
>