Hi,
I want to know for a matched string whether the match is non-ambiguous. For example: 'I walk a walk.' matchesRegex: '([\w. ]*)walk([\w. ]*)' The match is ambiguous: a) first ([\w. ]*) applies to 'I ', second ([\w. ]*) applies to ' a walk.' b) first ([\w. ]*) applies to 'I walk a ', second ([\w. ]*) applies to '.' Which is synonymous to multiple accepting paths in the underlying final state automaton. Is there an easy way to test for ambiguity? Perhaps it one can get all possible successfull matches for the subexpressions? Ciao, Steffen _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
On Sep 15, 2008, at 11:21 PM, Steffen Märcker wrote: > Hi, > > I want to know for a matched string whether the match is non- > ambiguous. > For example: > > > 'I walk a walk.' matchesRegex: '([\w. ]*)walk([\w. ]*)' > > The match is ambiguous: I'm not sure about that: your case should be disambiguated because the match is 'greedy'. > > > a) first ([\w. ]*) applies to 'I ', second ([\w. ]*) applies to ' a > walk.' > b) first ([\w. ]*) applies to 'I walk a ', second ([\w. ]*) applies > to '.' > > Which is synonymous to multiple accepting paths in the underlying > final > state automaton. > Is there an easy way to test for ambiguity? Perhaps it one can get all > possible successfull matches for the subexpressions? If the subject interests you have a look at: http://en.wikipedia.org/wiki/Nondeterministic_finite_state_machine R - _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Steffen Märcker
Steffen,
Do you want to perform approximate searching? I.e. Approximate searching (called Fuzzy searching too, but has nothing to do with Fuzzy Logic) is used when you cannot preprocess the text (is not possible to build an index on it because it is not know in advance) and happens if, for example, you have a text containing the word "Heidegger" and an user of your system search for "Heideger", then you will want to report a match anyway because both words are approximate. You can use a "text" searching algorithm. Now, depending your domain you'll have different meanings for "text" here. The text is defined by an Alphabet. We all know the common case, when the alphabet is the English Alphabet, but it could be another too, like DNA alphabet, Binary alphabet, etc. An alphabet determines the error model and the most commonly used error model is the "Edit Distance". For text searches (i.e. English Alphabet), you will want to check the Hamming Edit Distance and Levenshtein Edit Distance, which are the simpler ones and applies when: Hamming Distance --> Search with K mismatches. --> Count only substitutions --> From Heideger to Heidegger (5 substitutions) Levenshtein Distance --> Search with K differences --> Count minimum number of insertions, deletions and substitutions --> From Heideger to Heidegger (1 edition, insertion of g). (Note there are distances for data-minning applications, computer vision, machine learning, molecular subsquences, chess applications, and there are many extensions for Levenshtein that take into account Transpositions, Costs, etc) I have implemented the Shift-Or algorithm (with the Hamming Distance, which transforms it in an Approximate search algorithm as opposed to Exact Searching) in Squeak with maximum k of 31, I think it could be easily ported to VisualWorks. I wrote the C version too but it was just too boring to port it as plugin, let me know if this is what you are looking for. Hernán 2008/9/15 Steffen Märcker <[hidden email]> Hi, _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Free forum by Nabble | Edit this page |