Smalltalk › Cincom › VisualWorks

[vwnc] Regex11

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

3 messages Options

Steffen Märcker

[vwnc] Regex11

Hi,

I want to know for a matched string whether the match is non-ambiguous.
For example:

'I walk a walk.' matchesRegex: '([\w. ]*)walk([\w. ]*)'

The match is ambiguous:

a) first ([\w. ]*) applies to 'I ', second ([\w. ]*) applies to ' a walk.'
b) first ([\w. ]*) applies to 'I walk a ', second ([\w. ]*) applies to '.'

Which is synonymous to multiple accepting paths in the underlying final
state automaton.
Is there an easy way to test for ambiguity? Perhaps it one can get all
possible successfull matches for the subexpressions?

Ciao, Steffen
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Reinout Heeck

Re: [vwnc] Regex11

On Sep 15, 2008, at 11:21 PM, Steffen Märcker wrote:

> Hi,
>
> I want to know for a matched string whether the match is non-
> ambiguous.
> For example:
>
>
> 'I walk a walk.' matchesRegex: '([\w. ]*)walk([\w. ]*)'
>
> The match is ambiguous:

I'm not sure about that: your case should be disambiguated because the
match is 'greedy'.

>
>
> a) first ([\w. ]*) applies to 'I ', second ([\w. ]*) applies to ' a
> walk.'
> b) first ([\w. ]*) applies to 'I walk a ', second ([\w. ]*) applies
> to '.'
>
> Which is synonymous to multiple accepting paths in the underlying
> final
> state automaton.
> Is there an easy way to test for ambiguity? Perhaps it one can get all
> possible successfull matches for the subexpressions?

If the subject interests you have a look at:
http://en.wikipedia.org/wiki/Nondeterministic_finite_state_machine

R
-

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

hernanmd

Re: [vwnc] Regex11

In reply to this post by Steffen Märcker

Steffen,
Do you want to perform approximate searching? I.e. Approximate searching (called Fuzzy searching too, but has nothing to do with Fuzzy Logic) is used when you cannot preprocess the text (is not possible to build an index on it because it is not know in advance) and happens if, for example, you have a text containing the word "Heidegger" and an user of your system search for "Heideger", then you will want to report a match anyway because both words are approximate.

You can use a "text" searching algorithm. Now, depending your domain you'll have different meanings for "text" here. The text is defined by an Alphabet. We all know the common case, when the alphabet is the English Alphabet, but it could be another too, like DNA alphabet, Binary alphabet, etc.

An alphabet determines the error model and the most commonly used error model is the "Edit Distance". For text searches (i.e. English Alphabet), you will want to check the Hamming Edit Distance and Levenshtein Edit Distance, which are the simpler ones and applies when:

Hamming Distance --> Search with K mismatches. --> Count only substitutions --> From Heideger to Heidegger (5 substitutions)
Levenshtein Distance --> Search with K differences --> Count minimum number of insertions, deletions and substitutions --> From Heideger to Heidegger (1 edition, insertion of g).

(Note there are distances for data-minning applications, computer vision, machine learning, molecular subsquences, chess applications, and there are many extensions for Levenshtein that take into account Transpositions, Costs, etc)

I have implemented the Shift-Or algorithm (with the Hamming Distance, which transforms it in an Approximate search algorithm as opposed to Exact Searching) in Squeak with maximum k of 31, I think it could be easily ported to VisualWorks. I wrote the C version too but it was just too boring to port it as plugin, let me know if this is what you are looking for.

Hernán

2008/9/15 Steffen Märcker <[hidden email]>

Hi,

I want to know for a matched string whether the match is non-ambiguous.
For example:

'I walk a walk.' matchesRegex: '([\w. ]*)walk([\w. ]*)'

The match is ambiguous:

a) first ([\w. ]*) applies to 'I ', second ([\w. ]*) applies to ' a walk.'
b) first ([\w. ]*) applies to 'I walk a ', second ([\w. ]*) applies to '.'

Which is synonymous to multiple accepting paths in the underlying final
state automaton.
Is there an easy way to test for ambiguity? Perhaps it one can get all
possible successfull matches for the subexpressions?

Ciao, Steffen

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc