Smalltalk › Pharo › Pharo Smalltalk Users

New methods for the String class

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

10 messages Options

Daniela Meneses

New methods for the String class

Hi to all,

As you may know I'm working on in some improvements for the String class. Until now I implemented some missing tests. Right now I'm looking forward to add new methods that could be useful based on Ruby API (http://www.ruby-doc.org/core-2.1.0/String.html). These are a few of the methods that I'm planning to implement:

chomp(separator=$/) -> new_str
chop() -> new_str
ljust(integer, padstr='') ->new_str
next -> new_str
partition(sep) -> [head, sep, tail]

Could you help to find out if these methods are already available for the String class?

If you have any idea of new methods for the string class, will be really welcome.

Cheers

Daniela Meneses

pharo4Stef@free.fr

Re: [Pharo-dev] New methods for the String class

Daniela

you should try the method finder

open it and select the example in the dropbox

then you can type examples and see if a method already implement it

for example

‘abcab’ . ‘a’ . ‘bcb'

shows that copyWithoutAll: is the method.

but it expects a character as second argument

‘abcab’ . $a . ‘bcb’

Stef

On 24 Feb 2014, at 18:30, Daniela Meneses <[hidden email]> wrote:

Hi to all,

As you may know I'm working on in some improvements for the String class. Until now I implemented some missing tests. Right now I'm looking forward to add new methods that could be useful based on Ruby API (http://www.ruby-doc.org/core-2.1.0/String.html). These are a few of the methods that I'm planning to implement:

chomp(separator=$/) -> new_str

chop() -> new_str
ljust(integer, padstr='') ->new_str
next -> new_str

partition(sep) -> [head, sep, tail]

Could you help to find out if these methods are already available for the String class?

If you have any idea of new methods for the string class, will be really welcome.

--
Cheers
,
Daniela Meneses

hernanmd

Re: New methods for the String class

In reply to this post by Daniela Meneses

Hi Daniela,

2014-02-24 14:30 GMT-03:00 Daniela Meneses <[hidden email]>:

Hi to all,

As you may know I'm working on in some improvements for the String class. Until now I implemented some missing tests. Right now I'm looking forward to add new methods that could be useful based on Ruby API (http://www.ruby-doc.org/core-2.1.0/String.html). These are a few of the methods that I'm planning to implement:

chomp(separator=$/) -> new_str

chop() -> new_str
ljust(integer, padstr='') ->new_str
next -> new_str

partition(sep) -> [head, sep, tail]

Could you help to find out if these methods are already available for the String class?

If you have any idea of new methods for the string class, will be really welcome.

We can have an information retrieval API for aproximate string matching, i.e. Levenshtein distance (already implemented, various versions), Hamming distance, both are the most used and simplest edit distances.
Then you have Longest common subsequence, Longest common substring (they are implemented in a package called "Fuzz", #longestCommonSubsequenceWith: ). Also there is the shift-or adapted for approximate matches (also implemented), fuzzy phrasing is another world also. Many applications use Damerau edit distance. Bioinformatics uses the Needleman-Wunsch and Smith-Waterman, but they call them "aligners" :) but you don't want to code the optimized version in Smalltalk, some say it could take years.

All edit distances out there have specific requirements and no one is better than another for all cases. For example Jaro-Winkler is useful for one-word short strings.

You have a lot of options for research. Smalltalkers here are very experienced and clever, always gives cool advices so don't be afraid to ask.

Cheers,

Hernán

--
Cheers
,
Daniela Meneses

pharo4Stef@free.fr

Re: New methods for the String class

We can have an information retrieval API for aproximate string matching, i.e. Levenshtein distance (already implemented, various versions), Hamming distance, both are the most used and simplest edit distances.
Then you have Longest common subsequence, Longest common substring (they are implemented in a package called "Fuzz", #longestCommonSubsequenceWith: ). Also there is the shift-or adapted for approximate matches (also implemented), fuzzy phrasing is another world also. Many applications use Damerau edit distance. Bioinformatics uses the Needleman-Wunsch and Smith-Waterman, but they call them "aligners" :) but you don't want to code the optimized version in Smalltalk, some say it could take years.

All edit distances out there have specific requirements and no one is better than another for all cases. For example Jaro-Winkler is useful for one-word short strings.

I’m not sure that all these edit distances should be part of the String core api.

Now what would be good is to have a chapter describing them. This chapter would work well with the bioSmalltalk one :)

You have a lot of options for research. Smalltalkers here are very experienced and clever, always gives cool advices so don't be afraid to ask.

Cheers,

Hernán

--
Cheers
,
Daniela Meneses

NorbertHartl

Re: New methods for the String class

Am 26.02.2014 um 09:50 schrieb Pharo4Stef <[hidden email]>:

We can have an information retrieval API for aproximate string matching, i.e. Levenshtein distance (already implemented, various versions), Hamming distance, both are the most used and simplest edit distances.
Then you have Longest common subsequence, Longest common substring (they are implemented in a package called "Fuzz", #longestCommonSubsequenceWith: ). Also there is the shift-or adapted for approximate matches (also implemented), fuzzy phrasing is another world also. Many applications use Damerau edit distance. Bioinformatics uses the Needleman-Wunsch and Smith-Waterman, but they call them "aligners" :) but you don't want to code the optimized version in Smalltalk, some say it could take years.
All edit distances out there have specific requirements and no one is better than another for all cases. For example Jaro-Winkler is useful for one-word short strings.

I’m not sure that all these edit distances should be part of the String core api.
Now what would be good is to have a chapter describing them. This chapter would work well with the bioSmalltalk one :)

I’m pretty sure they shouldn’t. Most of these are most likely for special applications. So a perfect candidate for a string extension package. A real modular entity that could load each of them individually would be perfect but we don’t have the proper tools, yet. Unless of course every of those algorithms is composed of multiple classes and would fit naturally in a package.

But the most important prerequisite would be to make a separate package out of it. Did I understand that right that those are part of biosmalltalk? Then the problem is that useful things are buried in a specialized application. I encounter this often that I don’t know about some code because it is buried inside another project. Or I know about it and cannot use it because it is tied closely to a project.

my 2 cents,

Norbert

You have a lot of options for research. Smalltalkers here are very experienced and clever, always gives cool advices so don't be afraid to ask.

Cheers,

Hernán

--
Cheers
,
Daniela Meneses

hernanmd

Re: New methods for the String class

2014-02-26 7:10 GMT-03:00 Norbert Hartl <[hidden email]>:

Am 26.02.2014 um 09:50 schrieb Pharo4Stef <[hidden email]>:

We can have an information retrieval API for aproximate string matching, i.e. Levenshtein distance (already implemented, various versions), Hamming distance, both are the most used and simplest edit distances.
Then you have Longest common subsequence, Longest common substring (they are implemented in a package called "Fuzz", #longestCommonSubsequenceWith: ). Also there is the shift-or adapted for approximate matches (also implemented), fuzzy phrasing is another world also. Many applications use Damerau edit distance. Bioinformatics uses the Needleman-Wunsch and Smith-Waterman, but they call them "aligners" :) but you don't want to code the optimized version in Smalltalk, some say it could take years.

All edit distances out there have specific requirements and no one is better than another for all cases. For example Jaro-Winkler is useful for one-word short strings.

I’m not sure that all these edit distances should be part of the String core api.
Now what would be good is to have a chapter describing them. This chapter would work well with the bioSmalltalk one :)

I’m pretty sure they shouldn’t. Most of these are most likely for special applications. So a perfect candidate for a string extension package. A real modular entity that could load each of them individually would be perfect but we don’t have the proper tools, yet. Unless of course every of those algorithms is composed of multiple classes and would fit naturally in a package.

Absolutely for a separate package for information retrieval algorithms. From what I've seen, some algorithms require optimization through dynamic programming (automatas, matrices, etc) and that would lead to multiple classes, assuming you don't want to get dirty String class.

But the most important prerequisite would be to make a separate package out of it. Did I understand that right that those are part of biosmalltalk?

No. Those algorithms are spread over different packages in repositories like SqueakSource, Cincom Store, etc.

Hernán

askoh

Re: New methods for the String class

Administrator

"No. Those algorithms are spread over different packages in repositories like SqueakSource, Cincom Store, etc"

Can you tell me where to find code for longest common substring? I would appreciate the detailed location.

Thanks,
Aik-Siong Koh

Pharo Smalltalk Users mailing list

Re: New methods for the String class

In reply to this post by hernanmd

What fuzzy-string matching tools & packages are available today?

-cam

On Wed, Feb 26, 2014 at 9:09 AM, Hernán Morales Durand <[hidden email]> wrote:

2014-02-26 7:10 GMT-03:00 Norbert Hartl <[hidden email]>:

Am 26.02.2014 um 09:50 schrieb Pharo4Stef <[hidden email]>:

We can have an information retrieval API for aproximate string matching, i.e. Levenshtein distance (already implemented, various versions), Hamming distance, both are the most used and simplest edit distances.
Then you have Longest common subsequence, Longest common substring (they are implemented in a package called "Fuzz", #longestCommonSubsequenceWith: ). Also there is the shift-or adapted for approximate matches (also implemented), fuzzy phrasing is another world also. Many applications use Damerau edit distance. Bioinformatics uses the Needleman-Wunsch and Smith-Waterman, but they call them "aligners" :) but you don't want to code the optimized version in Smalltalk, some say it could take years.

All edit distances out there have specific requirements and no one is better than another for all cases. For example Jaro-Winkler is useful for one-word short strings.

I’m not sure that all these edit distances should be part of the String core api.
Now what would be good is to have a chapter describing them. This chapter would work well with the bioSmalltalk one :)

I’m pretty sure they shouldn’t. Most of these are most likely for special applications. So a perfect candidate for a string extension package. A real modular entity that could load each of them individually would be perfect but we don’t have the proper tools, yet. Unless of course every of those algorithms is composed of multiple classes and would fit naturally in a package.

Absolutely for a separate package for information retrieval algorithms. From what I've seen, some algorithms require optimization through dynamic programming (automatas, matrices, etc) and that would lead to multiple classes, assuming you don't want to get dirty String class.

But the most important prerequisite would be to make a separate package out of it. Did I understand that right that those are part of biosmalltalk?

No. Those algorithms are spread over different packages in repositories like SqueakSource, Cincom Store, etc.

Hernán

stepharo

Re: New methods for the String class

Hi

I know that Olivier Auverlot has all kind of string distance.
Hernan Morales has also a package extending string.

Stef

Pharo Smalltalk Users mailing list

Re: New methods for the String class

In reply to this post by Pharo Smalltalk Users mailing list

Thank You!

-cam

On Tue, Jul 28, 2015 at 11:39 AM, Cameron Sanders via Pharo-users <[hidden email]> wrote:

---------- Forwarded message ----------
From: Cameron Sanders <[hidden email]>
To: Any question about pharo is welcome <[hidden email]>
Cc:
Date: Tue, 28 Jul 2015 11:00:11 -0400
Subject: Re: [Pharo-users] New methods for the String class
What fuzzy-string matching tools & packages are available today?

-cam

On Wed, Feb 26, 2014 at 9:09 AM, Hernán Morales Durand <[hidden email]> wrote:

2014-02-26 7:10 GMT-03:00 Norbert Hartl <[hidden email]>:

Am 26.02.2014 um 09:50 schrieb Pharo4Stef <[hidden email]>:

We can have an information retrieval API for aproximate string matching, i.e. Levenshtein distance (already implemented, various versions), Hamming distance, both are the most used and simplest edit distances.
Then you have Longest common subsequence, Longest common substring (they are implemented in a package called "Fuzz", #longestCommonSubsequenceWith: ). Also there is the shift-or adapted for approximate matches (also implemented), fuzzy phrasing is another world also. Many applications use Damerau edit distance. Bioinformatics uses the Needleman-Wunsch and Smith-Waterman, but they call them "aligners" :) but you don't want to code the optimized version in Smalltalk, some say it could take years.

All edit distances out there have specific requirements and no one is better than another for all cases. For example Jaro-Winkler is useful for one-word short strings.

I’m not sure that all these edit distances should be part of the String core api.
Now what would be good is to have a chapter describing them. This chapter would work well with the bioSmalltalk one :)

I’m pretty sure they shouldn’t. Most of these are most likely for special applications. So a perfect candidate for a string extension package. A real modular entity that could load each of them individually would be perfect but we don’t have the proper tools, yet. Unless of course every of those algorithms is composed of multiple classes and would fit naturally in a package.

Absolutely for a separate package for information retrieval algorithms. From what I've seen, some algorithms require optimization through dynamic programming (automatas, matrices, etc) and that would lead to multiple classes, assuming you don't want to get dirty String class.

But the most important prerequisite would be to make a separate package out of it. Did I understand that right that those are part of biosmalltalk?

No. Those algorithms are spread over different packages in repositories like SqueakSource, Cincom Store, etc.

Hernán