New Dolphin Goodie - ANSI substrings

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

New Dolphin Goodie - ANSI substrings

Bob Jarvis
I decided that 'almost' didn't cut it.  After looking around a bit I
determined that the only place in the base image where the non-ANSI
behavior of String>>subStrings: was used was in String>>lines.  This
was easy to refactor, so now the package on my web site provides my
interpretation of what the ANSI standard requires of
String>>subStrings: while retaining the functionality of
String>>lines.  A separate unit test package is also provided.

The ANSI spec is silent on one issue that I'm interested in hearing
opinions on.  It doesn't say how multiple consecutive separator
characters should be handled.  If we have the following string

        aString := 'abc,def,,ghi,,jkl,'

should aString subStrings: ',' answer

        #('abc' 'def' '' 'ghi' '' 'jkl' '')

or

        #('abc' 'def' 'ghi' 'jkl')

Alternatively, if we have

        aString := '--abc--def--ghi--jkl--'

should aString subStrings: '-' answer

        #('' '' 'abc' '' 'def' '' 'ghi' '' 'jkl' '' '')

or

        #('abc' 'def' 'ghi' 'jkl')

???

Opinions solicited.


Reply | Threaded
Open this post in threaded view
|

Re: New Dolphin Goodie - ANSI substrings

Bob Jarvis
If anyone is interested, I've updated this package.  The
implementation is now a bit more flexible, providing all the options
mentioned in the previous message in this thread.

Share and enjoy.


Reply | Threaded
Open this post in threaded view
|

Keep Your Options Open (Re: New Dolphin Goodie - ANSI substrings

Panu Viljamaa-3
In reply to this post by Bob Jarvis
Bob Jarvis wrote:

> ... If we have the following string
>
>         aString := 'abc,def,,ghi,,jkl,'
>
> should aString subStrings: ',' answer
>
>         #( 'abc'  'def'  ''  'ghi'  ''  'jkl'  '' )

Yes. If you automatically remove the empty substrings, you lose information.

You may want to #reject:  the empty substrings at a later step, but this shouldn't happen automatically.  If you remove them automatically, you "can't get it back". But if you keep them, you can always decide to remove them. And you can  add an alternative method that does the removing automatically. But you keep your options open, which according to the financial theory of options pricing adds value to them.

I'd like to write this as a 'pattern' since it applies to many other things than extracting substrings. But I don't have the time right now.  "Keep your Options Open"

-Panu Viljamaa


Reply | Threaded
Open this post in threaded view
|

Re: Keep Your Options Open (Re: New Dolphin Goodie - ANSI substrings

Peter van Rooijen
"Panu Viljamaa" <[hidden email]> wrote in message
news:[hidden email]...

> Bob Jarvis wrote:
>
> > ... If we have the following string
> >
> >         aString := 'abc,def,,ghi,,jkl,'
> >
> > should aString subStrings: ',' answer
> >
> >         #( 'abc'  'def'  ''  'ghi'  ''  'jkl'  '' )
>
> Yes. If you automatically remove the empty substrings, you lose
information.

Hi Panu,

Well, it is true that you will lose information if you do that. But look at
the definition of #subStrings: and you'll see that losing information cannot
be avoided by any reasonable interpretation of it.

I will show you:

'hello,my dear friend' subStrings: ', '

none of the valid answers to that one are lossless, because you can specify
more than one separator.

The service as specified is pretty lame (i.e., of limited use) anyway. For
instance, #('friend' 'hello' 'my' 'dear') is a valid answer according to the
specification, as is #('hello').

It looks like the ANSI committee were out to lunch when they wrote up this
one.

> You may want to #reject:  the empty substrings at a later step, but this
shouldn't happen automatically.  If you remove them automatically, you
"can't get it back". But if you keep them, you can always decide to remove
them. And you can  add an alternative method that does the removing
automatically. But you keep your options open, which according to the
financial theory of options pricing adds value to them.
>
> I'd like to write this as a 'pattern' since it applies to many other
things than extracting substrings. But I don't have the time right now.
"Keep your Options Open"

Well, I'm very much in agreement with that one!

Regards,

Peter van Rooijen

> -Panu Viljamaa


Reply | Threaded
Open this post in threaded view
|

Re: Keep Your Options Open (Re: New Dolphin Goodie - ANSI substrings

Richard A. Harmon
On Wed, 26 Dec 2001 11:20:18 +0100, "Peter van Rooijen"
<[hidden email]> wrote:

>"Panu Viljamaa" <[hidden email]> wrote in message
>news:[hidden email]...
>> Bob Jarvis wrote:
>>
>> > ... If we have the following string
>> >
>> >         aString := 'abc,def,,ghi,,jkl,'
>> >
>> > should aString subStrings: ',' answer
>> >
>> >         #( 'abc'  'def'  ''  'ghi'  ''  'jkl'  '' )

That is what I think the ANSI spec. says should be returned.


>> Yes. If you automatically remove the empty substrings, you lose
>information.
>
>Hi Panu,
>
>Well, it is true that you will lose information if you do that. But look at
>the definition of #subStrings: and you'll see that losing information cannot
>be avoided by any reasonable interpretation of it.
>
>I will show you:
>
>'hello,my dear friend' subStrings: ', '
>
>none of the valid answers to that one are lossless, because you can specify
>more than one separator.

I'm not sure what the discussion about loosing information is about in
the context of ANSI subStrings: -- why it matters?


>The service as specified is pretty lame (i.e., of limited use) anyway. For
>instance, #('friend' 'hello' 'my' 'dear') is a valid answer according to the
>specification, as is #('hello').
>
>It looks like the ANSI committee were out to lunch when they wrote up this
>one.

I don't have enough experience in Smalltalk to know what the committee
should have specified as behavior.  I just looked at the (draft) ANSI
spec. for guidance when doing my reworked version of the CS II ANSI
Compatibility Tests.

I've not pondered deeply #subStrings:.  I just assumed those on the
committee were looking at the tasks I've commonly encountered:

        'Hi Bob' to a list of words -> 'Hi' 'Bob'
        'Hi, Bob' coma delimited row in table -> 'Hi' ' Bob'
        'Hi/tthere Bob' space or tab delimited row in table
  -> 'Hi' 'there' ' Bob'

If I want more exotic rules than ANSI's I generally read a stream:

        ' Hi, Bob ' to a list of words with no punctuation and white
space trimmed
                -> 'Hi' 'Bob'

I don't know how to do the task above with the spec. ANSI #subStrings:
or any other general purpose method.  I would think any general
purpose method would have rules too complicated to remember.


>> You may want to #reject:  the empty substrings at a later step, but this
>shouldn't happen automatically.  If you remove them automatically, you
>"can't get it back". But if you keep them, you can always decide to remove
>them. And you can  add an alternative method that does the removing
>automatically. But you keep your options open, which according to the
>financial theory of options pricing adds value to them.
>>
>> I'd like to write this as a 'pattern' since it applies to many other
>things than extracting substrings. But I don't have the time right now.
>"Keep your Options Open"
>
>Well, I'm very much in agreement with that one!

I encourage folks to use #subStrings: for ANSI #subStrings:.  For a
dialect's, or other preferred behavior, of #subStrings: use some other
name.  I just changed all #subStrings: references to
#dolphinSubStrings: and implemented ANSI #subStrings: as:

subStrings: separators
        "Answer an array containing the substrings in the receiver
separated by the elements of separators."
        | char result sourceStream subString |
        #'AStdMsg'. "2001/04/26 Harmon, R. Changed for ANSI
<readableString>."
        (separators allSatisfy: [:elem | elem isKindOf: Character])
ifFalse: [
                ^self error: 'separators must be Characters.'].
        sourceStream := ReadStream on: self.
        result := OrderedCollection new.
        subString := String new.
        [sourceStream atEnd] whileFalse: [
                char := sourceStream next.
                (separators includes: char)
                        ifTrue: [result add: subString.
                                        subString := String new]
                        ifFalse: [subString := subString , (String with:
char)]].
        result add: subString.
        ^result asArray


It seems to work for me.


--
Richard A. Harmon          "The only good zombie is a dead zombie"
[hidden email]           E. G. McCarthy


Reply | Threaded
Open this post in threaded view
|

Re: Keep Your Options Open (Re: New Dolphin Goodie - ANSI substrings

Richard A. Harmon
Peter van Rooijen emailed me noting:
=====
consider dropping the String concatenation for nextPut: -ing to a
Stream.
You are creating an awful number of objects for no good reason.

Mail me if you need to see the code I mean.
=====


Right on the mark.  The ANSI messages I've added or changed are
definitely not built for speed!  I'm still struggling with getting
them to produce the correct behavior, and how to test for it.  So I
appreciate all the help I get.

I found that String >> #_separateSubStringsIn: finds the receiver
separator in the string with:

        stop := aReadableString indexOfSubCollection: self
                        startingAt: start.

and creates a sub-string with:

        aReadableString copyFrom: start to: stop-1

I considered finding the start and stop index of a sub-string and
using #copyFrom:to: also.  Empty sub-strings seemed too complicated
for me to figure out in the short time available, and my brain started
to hurt.  So I just decided to present an implementation easy to
understand but (I hope) a little less clumsy than the original method
I posted.

I'll revisit this when I get more time to explore how the WriteStream
stacks up to #copyFrom:to and other trade-offs.


subStrings: separators
        "Answer an array containing the substrings in the receiver
separated by the elements of separators."
        | char result sourceStream subStrStream |
        "2001/04/26 Harmon, R. Changed for ANSI <readableString>."
        "2001/04/26 Harmon, R. Changed impl. that used , (concat)
                                                - create too many unneed objects."
        (separators allSatisfy: [:elem | elem isKindOf: Character])
ifFalse: [
                ^self error: 'separators must be Characters.'].

        sourceStream := ReadStream on: self.
        result := OrderedCollection new: 10.
        subStrStream := WriteStream with: (String new).
        [sourceStream atEnd] whileFalse: [
                char := sourceStream next.
                (separators includes: char)
                        ifTrue: [result add: subStrStream contents.
                                        subStrStream := WriteStream with: (String
new)]
                        ifFalse: [subStrStream nextPut: char]].
        result add: subStrStream contents.
        ^result asArray


On Wed, 26 Dec 2001 14:12:20 GMT, [hidden email] (Richard A.
Harmon) wrote:
[snip]
>subStrings: separators
> "Answer an array containing the substrings in the receiver
>separated by the elements of separators."
[snip]
> subString := String new.
> [sourceStream atEnd] whileFalse: [
> char := sourceStream next.
> (separators includes: char)
> ifTrue: [result add: subString.
> subString := String new]
> ifFalse: [subString := subString , (String with:
>char)]].
[snip]

--
Richard A. Harmon          "The only good zombie is a dead zombie"
[hidden email]           E. G. McCarthy


Reply | Threaded
Open this post in threaded view
|

Re: Keep Your Options Open (Re: New Dolphin Goodie - ANSI substrings

Panu Viljamaa-3
In reply to this post by Richard A. Harmon
"Richard A. Harmon" wrote:

> ...
> I'm not sure what the discussion about loosing information is about in
> the context of ANSI subStrings: -- why it matters?

I think it matters as a general principle. If you're trying to implement
a generally (re)useful (library-) method, and you have two otherwise
as good choices for specifying it, choose the one the loses less
information about the original argument

Why  ?  This is hard to explain, and even harder to prove I guess.
But I believe in it. Here's some arguments:

If you provide more information about the original string,
the users of your method can decide for themselves whether
they want to keep that information or not.

If you decide for them, they can't make that choice.

In a sense you are 'selling' them not only the method, but
also an 'option' to decide for themselves whether they
want to see the empty substrings in the result.

According to the options pricing theory there is value
in an option, and the value is more the more choices
the option gives you. (The proof of the pudding is in
that options are sold and bought, and even taxed!)

So if you don't know whether the users of your method
would like the empty substrings to be in the result denoting
places where two consecutive commas ( ,, ) appeared
in the argument, leave them there,  because
1) It is easy for them to take empty substrings out of the result
    if they want to
2) It is 'impossible'  for them to put them back by looking
     at the result of the method.

-Panu Viljamaa


Reply | Threaded
Open this post in threaded view
|

Re: Keep Your Options Open (Re: New Dolphin Goodie - ANSI substrings

Richard A. Harmon
On Thu, 27 Dec 2001 19:59:33 -0500, Panu Viljamaa <[hidden email]>
wrote:

>"Richard A. Harmon" wrote:
>
>> ...
>> I'm not sure what the discussion about loosing information is about in
>> the context of ANSI subStrings: -- why it matters?
>
>I think it matters as a general principle. If you're trying to implement
>a generally (re)useful (library-) method, and you have two otherwise
>as good choices for specifying it, choose the one the loses less
>information about the original argument

Thanks.  I see why it was introduced.

In the general case I can see it would be an important consideration.
In the specific case of a long established and widely used method such
as #subString: I think a principle like "Least Surprise" would more
applicable.  One is sort of past the design phase and in the
implementation phase.

Also it seemed too many disparate functions were being assigned to a
single message.

>
>Why  ?  This is hard to explain, and even harder to prove I guess.
>But I believe in it. Here's some arguments:
[snip]

Excellent explanation.

--
Richard A. Harmon          "The only good zombie is a dead zombie"
[hidden email]           E. G. McCarthy