Why the change to Character>isHexDigit?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Why the change to Character>isHexDigit?

Joey Gibson-2
Blair and/or Andy,

I just spent quite a chunk o' time tracking down a problem in Pocket
Smalltalk that was rather hard to find. I won't go into the details of
the problem, but it was the result of a change in the #isHexDigit method
in class Character. In Dolphin 2.1 upper as well as lower case letters
A..F were considered valid hex. 3.0 and forward only accept upper case.
I was just wondering what was the reason behind this change. Is it
something to do with ANSI ST, or something else?

The 2.1 version defined it thusly:

isHexDigit
        ^CRTLibrary default iswxdigit: self

while 3.0 and above have this:

isHexDigit

        "Answer whether the receiver is a valid Smalltalk hexadecimal
digit (i.e. digits and the uppercase characters A through F)."

        ^self isDigit or: [self codePoint >= ##($A codePoint) and: [self
codePoint <= ##($F codePoint)]]


Just curious.
Joey

--
-- Sun Certified Java2 Programmer
-- Political Rants: www.joeygibson.com
-- My Pocket Smalltalk Stuff: www.joeygibson.com/st
--
-- "We thought about killin' him, but we kinda
--  hated to go that far...."



-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----==  Over 80,000 Newsgroups - 16 Different Servers! =-----


Reply | Threaded
Open this post in threaded view
|

Re: Why the change to Character>isHexDigit?

Ian Bartholomew
Joey,

I'm not sure if Blair is around this week so here's a reply he posted on the
very subect, in reference to a problem with pocketSmalltalk raised by a
slightly disgruntled Steve Harris.

~-~-~-~-
 From Blair 24/1/2000

Sorry if this caused a problem for you. The reason for the change (in case
it makes you feel any better) is that:

a) #isHexDigit is supposed to report only those characters which are valid
hex digits in Smalltalk syntax (it is intended for use by the scanner), and
the lowercase letters a..f are not valid as hex digits in Smalltalk.
b) The CRT library call also counts various accented characters and odd
digit symbols as hex digits - try evaluating:
    Character allInstances select: [:c | c isHexDigit]
in 2.1, and you will see what I mean!


Reply | Threaded
Open this post in threaded view
|

Re: Why the change to Character>isHexDigit?

Ian Bartholomew
In reply to this post by Joey Gibson-2
Joey,

I'm not sure if Blair is around this week so here's a reply he posted on the
very subect, in reference to a problem with pocketSmalltalk raised by a
slightly disgruntled Steve Harris.

~-~-~-~-
>From Blair 24/1/2000

Sorry if this caused a problem for you. The reason for the change (in case
it makes you feel any better) is that:

a) #isHexDigit is supposed to report only those characters which are valid
hex digits in Smalltalk syntax (it is intended for use by the scanner), and
the lowercase letters a..f are not valid as hex digits in Smalltalk.
b) The CRT library call also counts various accented characters and odd
digit symbols as hex digits - try evaluating:
    Character allInstances select: [:c | c isHexDigit]
in 2.1, and you will see what I mean!


Reply | Threaded
Open this post in threaded view
|

Re: Why the change to Character>isHexDigit?

David Simmons
In reply to this post by Ian Bartholomew
"Ian Bartholomew" <[hidden email]> wrote in message
news:92c5os$6gb48$[hidden email]...
> Joey,
>
> I'm not sure if Blair is around this week so here's a reply he posted on
the

> very subect, in reference to a problem with pocketSmalltalk raised by a
> slightly disgruntled Steve Harris.
>
> ~-~-~-~-
> From Blair 24/1/2000
>
> Sorry if this caused a problem for you. The reason for the change (in case
> it makes you feel any better) is that:
>
> a) #isHexDigit is supposed to report only those characters which are valid
> hex digits in Smalltalk syntax (it is intended for use by the scanner),
and
> the lowercase letters a..f are not valid as hex digits in Smalltalk.

Hmm...

I learn something new all the time ;-(

Obviously, that is news to me. We've always allowed them in the "0x" and
"0X" numeric form.

In QKS Smalltalk v1-v1.X there were restrictions against their use in
"<base>r" radix prefixed numeric forms, as of v2.0-v3 those restrictions
were lifted.

I don't know what the "official" rationale Dolphin (Blair) is referring to,
but I can explain some technical issues that may have led to some Smalltalk
dialects concluding that lowercase letters are not valid for hex digits. If
the "official" reference is the ANSI standard, then I would take it with a
grain of salt.

If you consider numeric encoding forms there are some potentially ambiguous
cases, and some outright conflicting problems that can occur without some
restrictions or special case rules regarding lowercase character usage in
radix\based numeric forms.

Here a some of the supported numeric forms (from QKS Smalltalk)

    0x...        - base 16 [0-9,A-Z,a-z]
    0X...        - base 16 [0-9,A-Z,a-z]
    0b...        - base 2  [0-1]
    0B...        - base 2  [0-1]
    <nn>s...     - ScaledDecimal [0-9]
    <base>r...   - [0-(<base>-1)] max possible [0-9,A-Z]
    <nn>e...     - one of a number of float forms
    <nn>f...     - one of a number of float forms
    <nn>g...     - one of a number of float forms
    <NN>j        - "j" indicates imaginary part of a complex number
    <NN>i        - "i" indicates imaginary part of a complex number

The "r" character is a radix delimiter.
The "e,f,g" characters are delimiters in <Floats>.
The "s" character is a <ScaledDecimal> tag/marker and delimiter
The "i,j" character is (a message) recognized as a tag/marker in a <Complex>

The <nn> form includes the optional "." decimal and any subsequent digits.
The <NN> form means any valid numeric form.

QKS Smalltalk v1-v1.X used to allow floats or any number to be expressed
using the radix prefix form. To support this generalization requires
disallowing lowercase characters for all numbers with a "<base>r" prefix.

In QKS v2 or possibly as late as QKS v3 (1996), I can't remember for sure,
the tokenizer was modified to allow upper and lowercase digits and support
for a "." decimal. That meant disallowing recognition of "e,f,g,s,..." in
radix prefixed numeric forms. This change was made because the "e,f,g,s,..."
forms were (practically useless) never used with radix notation, wherease
lowercase digits were often desireable.

The ability to declare "." decimal <Float> forms in a radix notation was
fully removed in QKS v4/SmallScript -- because the use of radix based
<Float> forms is not useful and its presence represents a "lingering"
partial support of the original generalization.

I.e., one could write a number like:

    16r7E1     <- Notice some possible problems if lowercase was allowed?
    16r7e1    Is this 7.0e1 meaning a <Float>? or is it 0x7E1 meaning a
<SmallInteger>?
    35r2s3    Is this 2.0s3 meaning a <ScaledDecimal>? or is it 3433 a
<SmallInteger>?

If you allow the "<base>r" prefix to be applied to any numeric form, then
any subsequent digits need to be restricted to uppercase letters. If the
"<base>r" prefix is restricted to <Integer> forms then that restriction is
not required.

I'm guessing that Dolphin Smalltalk doesn't support 0x, 0X forms, and does
allow the "<base>r" prefix to be applied to any numeric form. Personally,
being able to use upper and lowercase hex digits is very convenient.
Especially when working with documentation or source from other languages,
or needing to code in multiple languages at the same time.

I should mention that QKS Smalltalk also supports "prefix" operator
messages. This was done to both enhance and address some other issues in
Smalltalk regarding numerics/precedence and sign/processing.

    "-"    - unary prefix message mapped to "negate"
    "+"    - unary prefix message mapped to "yourself"
    "~"    - unary prefix message mapped to "complement"
    "!"    - unary prefix message mapped to "not"

QKS Smalltalk compilers have always performed constant folding, and as part
of doing so they recognized certain messages when applied to literals. So
"-(1+3)" would actually generate opcodes for <-4>. "!(1+3)" would generate
opcodes for <false>. "~(0xF | SOME_LITERAL_CONST)" where SOME_LITERAL_CONST
== 0x80 would generate opcodes for <0xFFFFFF70>.

> b) The CRT library call also counts various accented characters and odd
> digit symbols as hex digits - try evaluating:
>     Character allInstances select: [:c | c isHexDigit]
> in 2.1, and you will see what I mean!

Ahh. I think understand why... It is likely that Dolphin 2.1 was using the
Win32 code-point mapping function for "POSIX (LC_TYPE) 1 character-typing".
And then applying the tag mask "C1_XDIGIT". Which is really just a Microsoft
specific version of Unicode/CodePage code-point mapping facilities. If
you're trying to be portable then you don't want to rely on them -- which
may explain some changes in Dolphin 4?; my solution was to build my own
equivalent routines for v3 of QKS' AOS Platform to enable portability.

The QKS Smalltalk compilers have always been both encoding and font aware.
I.e., you could compile styled string source and the compiler not only
understood and preserved the encoding it also understood and preserved the
font and face/style run information. So the compiler needed rich character
processing facilities to support unicode symbols as binary selector
characters, etc.

In v4 (SmalLScript), the font and face/style run processing mechanism for
source code was changed. It no longer pays any attention to style
information contained in the <StyleRuns> of <Text/StyledString> source when
compiling. Rather, it now treats source input as encoded character streams
where it recognizes XML and HTML sequences in comments and strings -- which
actually allows a richer and more portable set of extensible text/style
annotation constructs.

-- Dave Simmons [www.qks.com / www.smallscript.com]
  "Effectively solving a problem begins with how you express it."


Reply | Threaded
Open this post in threaded view
|

Re: Why the change to Character>isHexDigit?

sharris
In reply to this post by Ian Bartholomew
hehehe,
That is putting it mildly ;-)


steve

In article <92c5os$6gb48$[hidden email]>,
  "Ian Bartholomew" <[hidden email]> wrote:
> Joey,
>
> I'm not sure if Blair is around this week so here's a reply he posted
on the
> very subect, in reference to a problem with pocketSmalltalk raised by
a
> slightly disgruntled Steve Harris.
>
> ~-~-~-~-
> From Blair 24/1/2000
>
> Sorry if this caused a problem for you. The reason for the change (in
case
> it makes you feel any better) is that:
>
> a) #isHexDigit is supposed to report only those characters which are
valid
> hex digits in Smalltalk syntax (it is intended for use by the
scanner), and
> the lowercase letters a..f are not valid as hex digits in Smalltalk.
> b) The CRT library call also counts various accented characters and
odd
> digit symbols as hex digits - try evaluating:
>     Character allInstances select: [:c | c isHexDigit]
> in 2.1, and you will see what I mean!
>
>


Sent via Deja.com
http://www.deja.com/


Reply | Threaded
Open this post in threaded view
|

Re: Why the change to Character>isHexDigit?

Blair McGlashan
In reply to this post by David Simmons
Dave

You wrote in message
news:i8m26.45360$[hidden email]...
> > a) #isHexDigit is supposed to report only those characters which are
valid

> > hex digits in Smalltalk syntax (it is intended for use by the scanner),
> and
> > the lowercase letters a..f are not valid as hex digits in Smalltalk.
>
> Hmm...
>
> I learn something new all the time ;-(
>
> Obviously, that is news to me.
> ...
> I don't know what the "official" rationale Dolphin (Blair) is referring
to,
> but I can explain some technical issues that may have led to some
Smalltalk
> dialects concluding that lowercase letters are not valid for hex digits.
If
> the "official" reference is the ANSI standard, then I would take it with a
> grain of salt.

One could certainly extend the syntax to accept lower-case alphabetic hex
digits if one wishes (and I think we probably did originally), but it isn't
standard Smalltalk, at least by any of the known standards we refer to:

1) ANSI NCITS 319-1998, Section 3.5.6, p27.
    integer ::= decimalInteger | radixInteger.
    decimalInteger ::= digits
    digits := digit+
    radixInteger := radixSpecifier 'r' radixDigits
    radixSpecifier := digits
    radixDigits := (digit | uppercaseAlphabetic)+
    (the radix is restricted to the range 2..36 inclusive).
2) The IBM Common Base red book also restricts radix digits to decimals and
uppercase letters.
3) I seem to remember the Blue Book being the same, but I no longer have a
copy.

So that's the "official" rationale.

>...
> I'm guessing that Dolphin Smalltalk doesn't support 0x, 0X forms,

Yup.

>...and does
> allow the "<base>r" prefix to be applied to any numeric form.

Nope, that's not standard either: The ANSI standard reader will note only
integers can have a radix prefix.

 I'd sometimes prefer it if lower-case hex digits were accepted myself, but
it is one of those restrictions I don't find bothersome enough to mutiny
over - it's "just how it is". Portability at the code-transport level is
more important to me. If we all agree that lower-case should be acceptable
for hex digits (I note that VW does now accept such too), then I'm happy to
go along with it.

With regards to the other syntax enhancements you mention in your post, I
have a suspicion we may have a different attitude to language extensions
Dave :-)

Regards

Blair