I decided that 'almost' didn't cut it. After looking around a bit I
determined that the only place in the base image where the non-ANSI behavior of String>>subStrings: was used was in String>>lines. This was easy to refactor, so now the package on my web site provides my interpretation of what the ANSI standard requires of String>>subStrings: while retaining the functionality of String>>lines. A separate unit test package is also provided. The ANSI spec is silent on one issue that I'm interested in hearing opinions on. It doesn't say how multiple consecutive separator characters should be handled. If we have the following string aString := 'abc,def,,ghi,,jkl,' should aString subStrings: ',' answer #('abc' 'def' '' 'ghi' '' 'jkl' '') or #('abc' 'def' 'ghi' 'jkl') Alternatively, if we have aString := '--abc--def--ghi--jkl--' should aString subStrings: '-' answer #('' '' 'abc' '' 'def' '' 'ghi' '' 'jkl' '' '') or #('abc' 'def' 'ghi' 'jkl') ??? Opinions solicited. |
If anyone is interested, I've updated this package. The
implementation is now a bit more flexible, providing all the options mentioned in the previous message in this thread. Share and enjoy. |
In reply to this post by Bob Jarvis
Bob Jarvis wrote:
> ... If we have the following string > > aString := 'abc,def,,ghi,,jkl,' > > should aString subStrings: ',' answer > > #( 'abc' 'def' '' 'ghi' '' 'jkl' '' ) Yes. If you automatically remove the empty substrings, you lose information. You may want to #reject: the empty substrings at a later step, but this shouldn't happen automatically. If you remove them automatically, you "can't get it back". But if you keep them, you can always decide to remove them. And you can add an alternative method that does the removing automatically. But you keep your options open, which according to the financial theory of options pricing adds value to them. I'd like to write this as a 'pattern' since it applies to many other things than extracting substrings. But I don't have the time right now. "Keep your Options Open" -Panu Viljamaa |
"Panu Viljamaa" <[hidden email]> wrote in message
news:[hidden email]... > Bob Jarvis wrote: > > > ... If we have the following string > > > > aString := 'abc,def,,ghi,,jkl,' > > > > should aString subStrings: ',' answer > > > > #( 'abc' 'def' '' 'ghi' '' 'jkl' '' ) > > Yes. If you automatically remove the empty substrings, you lose Hi Panu, Well, it is true that you will lose information if you do that. But look at the definition of #subStrings: and you'll see that losing information cannot be avoided by any reasonable interpretation of it. I will show you: 'hello,my dear friend' subStrings: ', ' none of the valid answers to that one are lossless, because you can specify more than one separator. The service as specified is pretty lame (i.e., of limited use) anyway. For instance, #('friend' 'hello' 'my' 'dear') is a valid answer according to the specification, as is #('hello'). It looks like the ANSI committee were out to lunch when they wrote up this one. > You may want to #reject: the empty substrings at a later step, but this shouldn't happen automatically. If you remove them automatically, you "can't get it back". But if you keep them, you can always decide to remove them. And you can add an alternative method that does the removing automatically. But you keep your options open, which according to the financial theory of options pricing adds value to them. > > I'd like to write this as a 'pattern' since it applies to many other things than extracting substrings. But I don't have the time right now. "Keep your Options Open" Well, I'm very much in agreement with that one! Regards, Peter van Rooijen > -Panu Viljamaa |
On Wed, 26 Dec 2001 11:20:18 +0100, "Peter van Rooijen"
<[hidden email]> wrote: >"Panu Viljamaa" <[hidden email]> wrote in message >news:[hidden email]... >> Bob Jarvis wrote: >> >> > ... If we have the following string >> > >> > aString := 'abc,def,,ghi,,jkl,' >> > >> > should aString subStrings: ',' answer >> > >> > #( 'abc' 'def' '' 'ghi' '' 'jkl' '' ) That is what I think the ANSI spec. says should be returned. >> Yes. If you automatically remove the empty substrings, you lose >information. > >Hi Panu, > >Well, it is true that you will lose information if you do that. But look at >the definition of #subStrings: and you'll see that losing information cannot >be avoided by any reasonable interpretation of it. > >I will show you: > >'hello,my dear friend' subStrings: ', ' > >none of the valid answers to that one are lossless, because you can specify >more than one separator. I'm not sure what the discussion about loosing information is about in the context of ANSI subStrings: -- why it matters? >The service as specified is pretty lame (i.e., of limited use) anyway. For >instance, #('friend' 'hello' 'my' 'dear') is a valid answer according to the >specification, as is #('hello'). > >It looks like the ANSI committee were out to lunch when they wrote up this >one. I don't have enough experience in Smalltalk to know what the committee should have specified as behavior. I just looked at the (draft) ANSI spec. for guidance when doing my reworked version of the CS II ANSI Compatibility Tests. I've not pondered deeply #subStrings:. I just assumed those on the committee were looking at the tasks I've commonly encountered: 'Hi Bob' to a list of words -> 'Hi' 'Bob' 'Hi, Bob' coma delimited row in table -> 'Hi' ' Bob' 'Hi/tthere Bob' space or tab delimited row in table -> 'Hi' 'there' ' Bob' If I want more exotic rules than ANSI's I generally read a stream: ' Hi, Bob ' to a list of words with no punctuation and white space trimmed -> 'Hi' 'Bob' I don't know how to do the task above with the spec. ANSI #subStrings: or any other general purpose method. I would think any general purpose method would have rules too complicated to remember. >> You may want to #reject: the empty substrings at a later step, but this >shouldn't happen automatically. If you remove them automatically, you >"can't get it back". But if you keep them, you can always decide to remove >them. And you can add an alternative method that does the removing >automatically. But you keep your options open, which according to the >financial theory of options pricing adds value to them. >> >> I'd like to write this as a 'pattern' since it applies to many other >things than extracting substrings. But I don't have the time right now. >"Keep your Options Open" > >Well, I'm very much in agreement with that one! I encourage folks to use #subStrings: for ANSI #subStrings:. For a dialect's, or other preferred behavior, of #subStrings: use some other name. I just changed all #subStrings: references to #dolphinSubStrings: and implemented ANSI #subStrings: as: subStrings: separators "Answer an array containing the substrings in the receiver separated by the elements of separators." | char result sourceStream subString | #'AStdMsg'. "2001/04/26 Harmon, R. Changed for ANSI <readableString>." (separators allSatisfy: [:elem | elem isKindOf: Character]) ifFalse: [ ^self error: 'separators must be Characters.']. sourceStream := ReadStream on: self. result := OrderedCollection new. subString := String new. [sourceStream atEnd] whileFalse: [ char := sourceStream next. (separators includes: char) ifTrue: [result add: subString. subString := String new] ifFalse: [subString := subString , (String with: char)]]. result add: subString. ^result asArray It seems to work for me. -- Richard A. Harmon "The only good zombie is a dead zombie" [hidden email] E. G. McCarthy |
Peter van Rooijen emailed me noting:
===== consider dropping the String concatenation for nextPut: -ing to a Stream. You are creating an awful number of objects for no good reason. Mail me if you need to see the code I mean. ===== Right on the mark. The ANSI messages I've added or changed are definitely not built for speed! I'm still struggling with getting them to produce the correct behavior, and how to test for it. So I appreciate all the help I get. I found that String >> #_separateSubStringsIn: finds the receiver separator in the string with: stop := aReadableString indexOfSubCollection: self startingAt: start. and creates a sub-string with: aReadableString copyFrom: start to: stop-1 I considered finding the start and stop index of a sub-string and using #copyFrom:to: also. Empty sub-strings seemed too complicated for me to figure out in the short time available, and my brain started to hurt. So I just decided to present an implementation easy to understand but (I hope) a little less clumsy than the original method I posted. I'll revisit this when I get more time to explore how the WriteStream stacks up to #copyFrom:to and other trade-offs. subStrings: separators "Answer an array containing the substrings in the receiver separated by the elements of separators." | char result sourceStream subStrStream | "2001/04/26 Harmon, R. Changed for ANSI <readableString>." "2001/04/26 Harmon, R. Changed impl. that used , (concat) - create too many unneed objects." (separators allSatisfy: [:elem | elem isKindOf: Character]) ifFalse: [ ^self error: 'separators must be Characters.']. sourceStream := ReadStream on: self. result := OrderedCollection new: 10. subStrStream := WriteStream with: (String new). [sourceStream atEnd] whileFalse: [ char := sourceStream next. (separators includes: char) ifTrue: [result add: subStrStream contents. subStrStream := WriteStream with: (String new)] ifFalse: [subStrStream nextPut: char]]. result add: subStrStream contents. ^result asArray On Wed, 26 Dec 2001 14:12:20 GMT, [hidden email] (Richard A. Harmon) wrote: [snip] >subStrings: separators > "Answer an array containing the substrings in the receiver >separated by the elements of separators." [snip] > subString := String new. > [sourceStream atEnd] whileFalse: [ > char := sourceStream next. > (separators includes: char) > ifTrue: [result add: subString. > subString := String new] > ifFalse: [subString := subString , (String with: >char)]]. [snip] -- Richard A. Harmon "The only good zombie is a dead zombie" [hidden email] E. G. McCarthy |
In reply to this post by Richard A. Harmon
"Richard A. Harmon" wrote:
> ... > I'm not sure what the discussion about loosing information is about in > the context of ANSI subStrings: -- why it matters? I think it matters as a general principle. If you're trying to implement a generally (re)useful (library-) method, and you have two otherwise as good choices for specifying it, choose the one the loses less information about the original argument Why ? This is hard to explain, and even harder to prove I guess. But I believe in it. Here's some arguments: If you provide more information about the original string, the users of your method can decide for themselves whether they want to keep that information or not. If you decide for them, they can't make that choice. In a sense you are 'selling' them not only the method, but also an 'option' to decide for themselves whether they want to see the empty substrings in the result. According to the options pricing theory there is value in an option, and the value is more the more choices the option gives you. (The proof of the pudding is in that options are sold and bought, and even taxed!) So if you don't know whether the users of your method would like the empty substrings to be in the result denoting places where two consecutive commas ( ,, ) appeared in the argument, leave them there, because 1) It is easy for them to take empty substrings out of the result if they want to 2) It is 'impossible' for them to put them back by looking at the result of the method. -Panu Viljamaa |
On Thu, 27 Dec 2001 19:59:33 -0500, Panu Viljamaa <[hidden email]>
wrote: >"Richard A. Harmon" wrote: > >> ... >> I'm not sure what the discussion about loosing information is about in >> the context of ANSI subStrings: -- why it matters? > >I think it matters as a general principle. If you're trying to implement >a generally (re)useful (library-) method, and you have two otherwise >as good choices for specifying it, choose the one the loses less >information about the original argument Thanks. I see why it was introduced. In the general case I can see it would be an important consideration. In the specific case of a long established and widely used method such as #subString: I think a principle like "Least Surprise" would more applicable. One is sort of past the design phase and in the implementation phase. Also it seemed too many disparate functions were being assigned to a single message. > >Why ? This is hard to explain, and even harder to prove I guess. >But I believe in it. Here's some arguments: [snip] Excellent explanation. -- Richard A. Harmon "The only good zombie is a dead zombie" [hidden email] E. G. McCarthy |
Free forum by Nabble | Edit this page |