New methods for the String class

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

New methods for the String class

Daniela Meneses
Hi to all,

As you may know I'm working on in some improvements for the String class. Until now I implemented some missing tests. Right now I'm looking forward to add new methods that could be useful based on Ruby API (http://www.ruby-doc.org/core-2.1.0/String.html). These are a few of the methods that I'm planning to implement:

  • chomp(separator=$/) -> new_str
  • chop() -> new_str
  • ljust(integer, padstr='') ->new_str
  • next -> new_str
  • partition(sep) -> [head, sep, tail]

Could you help to find out if these methods are already available for the String class?

If you have any idea of new methods for the string class, will be really welcome.

--
Cheers
,
Daniela Meneses
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] New methods for the String class

pharo4Stef@free.fr
Daniela

you should try the method finder
open it and select the example in the dropbox

then you can type examples and see if a method already implement it

for example

‘abcab’ . ‘a’ . ‘bcb'

shows that copyWithoutAll: is the method.
but it expects a character as second argument

‘abcab’ . $a . ‘bcb’

Stef
On 24 Feb 2014, at 18:30, Daniela Meneses <[hidden email]> wrote:

Hi to all,

As you may know I'm working on in some improvements for the String class. Until now I implemented some missing tests. Right now I'm looking forward to add new methods that could be useful based on Ruby API (http://www.ruby-doc.org/core-2.1.0/String.html). These are a few of the methods that I'm planning to implement:

  • chomp(separator=$/) -> new_str
  • chop() -> new_str
  • ljust(integer, padstr='') ->new_str
  • next -> new_str
  • partition(sep) -> [head, sep, tail]

Could you help to find out if these methods are already available for the String class?

If you have any idea of new methods for the string class, will be really welcome.

--
Cheers
,
Daniela Meneses

Reply | Threaded
Open this post in threaded view
|

Re: New methods for the String class

hernanmd
In reply to this post by Daniela Meneses

Hi Daniela,

2014-02-24 14:30 GMT-03:00 Daniela Meneses <[hidden email]>:
Hi to all,

As you may know I'm working on in some improvements for the String class. Until now I implemented some missing tests. Right now I'm looking forward to add new methods that could be useful based on Ruby API (http://www.ruby-doc.org/core-2.1.0/String.html). These are a few of the methods that I'm planning to implement:

  • chomp(separator=$/) -> new_str
  • chop() -> new_str
  • ljust(integer, padstr='') ->new_str
  • next -> new_str
  • partition(sep) -> [head, sep, tail]

Could you help to find out if these methods are already available for the String class?

If you have any idea of new methods for the string class, will be really welcome.


We can have an information retrieval API for aproximate string matching, i.e. Levenshtein distance (already implemented, various versions), Hamming distance, both are the most used and simplest edit distances.
Then you have Longest common subsequence, Longest common substring (they are implemented in a package called "Fuzz", #longestCommonSubsequenceWith: ). Also there is the shift-or adapted for approximate matches (also implemented), fuzzy phrasing is another world also. Many applications use Damerau edit distance. Bioinformatics uses the Needleman-Wunsch and Smith-Waterman, but they call them "aligners" :) but you don't want to code the optimized version in Smalltalk, some say it could take years.
All edit distances out there have specific requirements and no one is better than another for all cases. For example Jaro-Winkler is useful for one-word short strings.

You have a lot of options for research. Smalltalkers here are very experienced and clever, always gives cool advices so don't be afraid to ask.

Cheers,

Hernán

 
--
Cheers
,
Daniela Meneses

Reply | Threaded
Open this post in threaded view
|

Re: New methods for the String class

pharo4Stef@free.fr

We can have an information retrieval API for aproximate string matching, i.e. Levenshtein distance (already implemented, various versions), Hamming distance, both are the most used and simplest edit distances.
Then you have Longest common subsequence, Longest common substring (they are implemented in a package called "Fuzz", #longestCommonSubsequenceWith: ). Also there is the shift-or adapted for approximate matches (also implemented), fuzzy phrasing is another world also. Many applications use Damerau edit distance. Bioinformatics uses the Needleman-Wunsch and Smith-Waterman, but they call them "aligners" :) but you don't want to code the optimized version in Smalltalk, some say it could take years.
All edit distances out there have specific requirements and no one is better than another for all cases. For example Jaro-Winkler is useful for one-word short strings.


I’m not sure that all these edit distances should be part of the String core api.
Now what would be good is to have a chapter describing them. This chapter would work well with the bioSmalltalk one :)


You have a lot of options for research. Smalltalkers here are very experienced and clever, always gives cool advices so don't be afraid to ask.

Cheers,

Hernán

 
--
Cheers
,
Daniela Meneses


Reply | Threaded
Open this post in threaded view
|

Re: New methods for the String class

NorbertHartl

Am 26.02.2014 um 09:50 schrieb Pharo4Stef <[hidden email]>:


We can have an information retrieval API for aproximate string matching, i.e. Levenshtein distance (already implemented, various versions), Hamming distance, both are the most used and simplest edit distances.
Then you have Longest common subsequence, Longest common substring (they are implemented in a package called "Fuzz", #longestCommonSubsequenceWith: ). Also there is the shift-or adapted for approximate matches (also implemented), fuzzy phrasing is another world also. Many applications use Damerau edit distance. Bioinformatics uses the Needleman-Wunsch and Smith-Waterman, but they call them "aligners" :) but you don't want to code the optimized version in Smalltalk, some say it could take years.
All edit distances out there have specific requirements and no one is better than another for all cases. For example Jaro-Winkler is useful for one-word short strings.


I’m not sure that all these edit distances should be part of the String core api.
Now what would be good is to have a chapter describing them. This chapter would work well with the bioSmalltalk one :)

I’m pretty sure they shouldn’t. Most of these are most likely for special applications. So a perfect candidate for a string extension package. A real modular entity that could load each of them individually would be perfect but we don’t have the proper tools, yet. Unless of course every of those algorithms is composed of multiple classes and would fit naturally in a package.
But the most important prerequisite would be to make a separate package out of it. Did I understand that right that those are part of biosmalltalk? Then the problem is that useful things are buried in a specialized application. I encounter this often that I don’t know about some code because it is buried inside another project. Or I know about it and cannot use it because it is tied closely to a project.

my 2 cents,

Norbert


You have a lot of options for research. Smalltalkers here are very experienced and clever, always gives cool advices so don't be afraid to ask.

Cheers,

Hernán

 
-- 
Cheers
,
Daniela Meneses

Reply | Threaded
Open this post in threaded view
|

Re: New methods for the String class

hernanmd



2014-02-26 7:10 GMT-03:00 Norbert Hartl <[hidden email]>:

Am 26.02.2014 um 09:50 schrieb Pharo4Stef <[hidden email]>:


We can have an information retrieval API for aproximate string matching, i.e. Levenshtein distance (already implemented, various versions), Hamming distance, both are the most used and simplest edit distances.
Then you have Longest common subsequence, Longest common substring (they are implemented in a package called "Fuzz", #longestCommonSubsequenceWith: ). Also there is the shift-or adapted for approximate matches (also implemented), fuzzy phrasing is another world also. Many applications use Damerau edit distance. Bioinformatics uses the Needleman-Wunsch and Smith-Waterman, but they call them "aligners" :) but you don't want to code the optimized version in Smalltalk, some say it could take years.
All edit distances out there have specific requirements and no one is better than another for all cases. For example Jaro-Winkler is useful for one-word short strings.


I’m not sure that all these edit distances should be part of the String core api.
Now what would be good is to have a chapter describing them. This chapter would work well with the bioSmalltalk one :)

I’m pretty sure they shouldn’t. Most of these are most likely for special applications. So a perfect candidate for a string extension package. A real modular entity that could load each of them individually would be perfect but we don’t have the proper tools, yet. Unless of course every of those algorithms is composed of multiple classes and would fit naturally in a package.

Absolutely for a separate package for information retrieval algorithms. From what I've seen, some algorithms require optimization through dynamic programming (automatas, matrices, etc) and that would lead to multiple classes, assuming you don't want to get dirty String class.
 
But the most important prerequisite would be to make a separate package out of it. Did I understand that right that those are part of biosmalltalk?

No. Those algorithms are spread over different packages in repositories like SqueakSource, Cincom Store, etc.
 
Hernán


Reply | Threaded
Open this post in threaded view
|

Re: New methods for the String class

askoh
Administrator
"No. Those algorithms are spread over different packages in repositories like SqueakSource, Cincom Store, etc"

Can you tell me where to find code for longest common substring? I would appreciate the detailed location.

Thanks,
Aik-Siong Koh
Reply | Threaded
Open this post in threaded view
|

Re: New methods for the String class

Pharo Smalltalk Users mailing list
In reply to this post by hernanmd
What fuzzy-string matching tools & packages are available today?

-cam

On Wed, Feb 26, 2014 at 9:09 AM, Hernán Morales Durand <[hidden email]> wrote:



2014-02-26 7:10 GMT-03:00 Norbert Hartl <[hidden email]>:

Am 26.02.2014 um 09:50 schrieb Pharo4Stef <[hidden email]>:


We can have an information retrieval API for aproximate string matching, i.e. Levenshtein distance (already implemented, various versions), Hamming distance, both are the most used and simplest edit distances.
Then you have Longest common subsequence, Longest common substring (they are implemented in a package called "Fuzz", #longestCommonSubsequenceWith: ). Also there is the shift-or adapted for approximate matches (also implemented), fuzzy phrasing is another world also. Many applications use Damerau edit distance. Bioinformatics uses the Needleman-Wunsch and Smith-Waterman, but they call them "aligners" :) but you don't want to code the optimized version in Smalltalk, some say it could take years.
All edit distances out there have specific requirements and no one is better than another for all cases. For example Jaro-Winkler is useful for one-word short strings.


I’m not sure that all these edit distances should be part of the String core api.
Now what would be good is to have a chapter describing them. This chapter would work well with the bioSmalltalk one :)

I’m pretty sure they shouldn’t. Most of these are most likely for special applications. So a perfect candidate for a string extension package. A real modular entity that could load each of them individually would be perfect but we don’t have the proper tools, yet. Unless of course every of those algorithms is composed of multiple classes and would fit naturally in a package.

Absolutely for a separate package for information retrieval algorithms. From what I've seen, some algorithms require optimization through dynamic programming (automatas, matrices, etc) and that would lead to multiple classes, assuming you don't want to get dirty String class.
 
But the most important prerequisite would be to make a separate package out of it. Did I understand that right that those are part of biosmalltalk?

No. Those algorithms are spread over different packages in repositories like SqueakSource, Cincom Store, etc.
 
Hernán



Reply | Threaded
Open this post in threaded view
|

Re: New methods for the String class

stepharo
Hi

I know that Olivier Auverlot has all kind of string distance.
Hernan Morales has also a package extending string.

Stef

Reply | Threaded
Open this post in threaded view
|

Re: New methods for the String class

Pharo Smalltalk Users mailing list
In reply to this post by Pharo Smalltalk Users mailing list
Thank You! 
-cam

On Tue, Jul 28, 2015 at 11:39 AM, Cameron Sanders via Pharo-users <[hidden email]> wrote:


---------- Forwarded message ----------
From: Cameron Sanders <[hidden email]>
To: Any question about pharo is welcome <[hidden email]>
Cc: 
Date: Tue, 28 Jul 2015 11:00:11 -0400
Subject: Re: [Pharo-users] New methods for the String class
What fuzzy-string matching tools & packages are available today?

-cam

On Wed, Feb 26, 2014 at 9:09 AM, Hernán Morales Durand <[hidden email]> wrote:



2014-02-26 7:10 GMT-03:00 Norbert Hartl <[hidden email]>:

Am 26.02.2014 um 09:50 schrieb Pharo4Stef <[hidden email]>:


We can have an information retrieval API for aproximate string matching, i.e. Levenshtein distance (already implemented, various versions), Hamming distance, both are the most used and simplest edit distances.
Then you have Longest common subsequence, Longest common substring (they are implemented in a package called "Fuzz", #longestCommonSubsequenceWith: ). Also there is the shift-or adapted for approximate matches (also implemented), fuzzy phrasing is another world also. Many applications use Damerau edit distance. Bioinformatics uses the Needleman-Wunsch and Smith-Waterman, but they call them "aligners" :) but you don't want to code the optimized version in Smalltalk, some say it could take years.
All edit distances out there have specific requirements and no one is better than another for all cases. For example Jaro-Winkler is useful for one-word short strings.


I’m not sure that all these edit distances should be part of the String core api.
Now what would be good is to have a chapter describing them. This chapter would work well with the bioSmalltalk one :)

I’m pretty sure they shouldn’t. Most of these are most likely for special applications. So a perfect candidate for a string extension package. A real modular entity that could load each of them individually would be perfect but we don’t have the proper tools, yet. Unless of course every of those algorithms is composed of multiple classes and would fit naturally in a package.

Absolutely for a separate package for information retrieval algorithms. From what I've seen, some algorithms require optimization through dynamic programming (automatas, matrices, etc) and that would lead to multiple classes, assuming you don't want to get dirty String class.
 
But the most important prerequisite would be to make a separate package out of it. Did I understand that right that those are part of biosmalltalk?

No. Those algorithms are spread over different packages in repositories like SqueakSource, Cincom Store, etc.
 
Hernán