Non-greedy RegEx?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Non-greedy RegEx?

Manuel Leuenberger
Hi,

I just noticed that the Pharo regexes do not understand non-greedy matches. A regex engine to be PCRE is kind of essential, not having '.*?' to be a parseable and working regex is a bummer. Are there any more powerful regex engines around for Pharo? I could not find any.

Cheers,
Manuel


Reply | Threaded
Open this post in threaded view
|

Re: Non-greedy RegEx?

EstebanLM
Hi,

Yes, Pharo regex implementation is very naive.
We will be moving to a PCRE binding to match outside world standards but we have not had the time to work on it :(

Esteban

> On 5 Feb 2019, at 00:27, Manuel Leuenberger <[hidden email]> wrote:
>
> Hi,
>
> I just noticed that the Pharo regexes do not understand non-greedy matches. A regex engine to be PCRE is kind of essential, not having '.*?' to be a parseable and working regex is a bummer. Are there any more powerful regex engines around for Pharo? I could not find any.
>
> Cheers,
> Manuel
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Non-greedy RegEx?

Thierry Goubier
The Regex engine inside SmaCC allows for non-greedy REs. But it's
integrated as a parser first stage, not as an independent RE engine.

Regards,

Thierry

Le mar. 5 févr. 2019 à 08:34, Esteban Lorenzano <[hidden email]> a écrit :

>
> Hi,
>
> Yes, Pharo regex implementation is very naive.
> We will be moving to a PCRE binding to match outside world standards but we have not had the time to work on it :(
>
> Esteban
>
> > On 5 Feb 2019, at 00:27, Manuel Leuenberger <[hidden email]> wrote:
> >
> > Hi,
> >
> > I just noticed that the Pharo regexes do not understand non-greedy matches. A regex engine to be PCRE is kind of essential, not having '.*?' to be a parseable and working regex is a bummer. Are there any more powerful regex engines around for Pharo? I could not find any.
> >
> > Cheers,
> > Manuel
> >
> >
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Non-greedy RegEx?

Richard O'Keefe
In reply to this post by EstebanLM
Please DON'T move to PCRE.
"Outside world standards"?  There are so many.
There are two important things to know about
PCRE: (1) it is a popular open source regexp
library for Perl-style regexps, (2) because of
that, it is prone to truly horrendous performance
problems.  There are alternatives, such as re2,
which are not subject to PCRE's intrinsic
performance pathologies.  As it happens, re2
supports *? +? and ??.

On Tue, 5 Feb 2019 at 20:34, Esteban Lorenzano <[hidden email]> wrote:
Hi,

Yes, Pharo regex implementation is very naive.
We will be moving to a PCRE binding to match outside world standards but we have not had the time to work on it :(

Esteban

> On 5 Feb 2019, at 00:27, Manuel Leuenberger <[hidden email]> wrote:
>
> Hi,
>
> I just noticed that the Pharo regexes do not understand non-greedy matches. A regex engine to be PCRE is kind of essential, not having '.*?' to be a parseable and working regex is a bummer. Are there any more powerful regex engines around for Pharo? I could not find any.
>
> Cheers,
> Manuel
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Non-greedy RegEx?

Manuel Leuenberger
I am not advocating for PCRE in particular, I just need a regex engine that is just as powerful. I guess re2 serves that purpose, although I haven't used it myself (knowingly).

Looking at https://github.com/google/re2/wiki/WhyRE2, re2 actually seems to be a good target. "match time is linear in the length of the input string", sounds like a really nice property.

On 5 Feb 2019, at 12:26, Richard O'Keefe <[hidden email]> wrote:

Please DON'T move to PCRE.
"Outside world standards"?  There are so many.
There are two important things to know about
PCRE: (1) it is a popular open source regexp
library for Perl-style regexps, (2) because of
that, it is prone to truly horrendous performance
problems.  There are alternatives, such as re2,
which are not subject to PCRE's intrinsic
performance pathologies.  As it happens, re2
supports *? +? and ??.

On Tue, 5 Feb 2019 at 20:34, Esteban Lorenzano <[hidden email]> wrote:
Hi,

Yes, Pharo regex implementation is very naive.
We will be moving to a PCRE binding to match outside world standards but we have not had the time to work on it :(

Esteban

> On 5 Feb 2019, at 00:27, Manuel Leuenberger <[hidden email]> wrote:
>
> Hi,
>
> I just noticed that the Pharo regexes do not understand non-greedy matches. A regex engine to be PCRE is kind of essential, not having '.*?' to be a parseable and working regex is a bummer. Are there any more powerful regex engines around for Pharo? I could not find any.
>
> Cheers,
> Manuel
>
>



Reply | Threaded
Open this post in threaded view
|

Re: Non-greedy RegEx?

Denis Kudriashov
In reply to this post by EstebanLM
We can also update pharo version from original VW repositoriy if the current license is appropriate. I think it covers missing parts.

5 февр. 2019 г. 7:34 пользователь "Esteban Lorenzano" <[hidden email]> написал:
Hi,

Yes, Pharo regex implementation is very naive.
We will be moving to a PCRE binding to match outside world standards but we have not had the time to work on it :(


Esteban


> On 5 Feb 2019, at 00:27, Manuel Leuenberger <[hidden email]> wrote:
>
> Hi,
>
> I just noticed that the Pharo regexes do not understand non-greedy matches. A regex engine to be PCRE is kind of essential, not having '.*?' to be a parseable and working regex is a bummer. Are there any more powerful regex engines around for Pharo? I could not find any.
>
> Cheers,
> Manuel
>
>



Reply | Threaded
Open this post in threaded view
|

Re: Non-greedy RegEx?

Sven Van Caekenberghe-2
In reply to this post by EstebanLM
Still, there are advantages to an in-image solution, can't says this enough, these external lib dependencies pose their own problems ...

> On 5 Feb 2019, at 08:33, Esteban Lorenzano <[hidden email]> wrote:
>
> Hi,
>
> Yes, Pharo regex implementation is very naive.
> We will be moving to a PCRE binding to match outside world standards but we have not had the time to work on it :(
>
> Esteban
>
>> On 5 Feb 2019, at 00:27, Manuel Leuenberger <[hidden email]> wrote:
>>
>> Hi,
>>
>> I just noticed that the Pharo regexes do not understand non-greedy matches. A regex engine to be PCRE is kind of essential, not having '.*?' to be a parseable and working regex is a bummer. Are there any more powerful regex engines around for Pharo? I could not find any.
>>
>> Cheers,
>> Manuel
>>
>>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Non-greedy RegEx?

Pierce Ng-3
In reply to this post by Richard O'Keefe
On Wed, Feb 06, 2019 at 12:26:00AM +1300, Richard O'Keefe wrote:

> Please DON'T move to PCRE.
> "Outside world standards"?  There are so many.
> There are two important things to know about
> PCRE: (1) it is a popular open source regexp
> library for Perl-style regexps, (2) because of
> that, it is prone to truly horrendous performance
> problems.  There are alternatives, such as re2,
> https://github.com/google/re2 ,
> which are not subject to PCRE's intrinsic
> performance pathologies.  As it happens, re2
> supports *? +? and ??.

Can you share some examples of PCRE's bad performance?

Pierce

Reply | Threaded
Open this post in threaded view
|

Re: Non-greedy RegEx?

NorbertHartl
In reply to this post by Sven Van Caekenberghe-2


> Am 05.02.2019 um 16:16 schrieb Sven Van Caekenberghe <[hidden email]>:
>
> Still, there are advantages to an in-image solution, can't says this enough, these external lib dependencies pose their own problems ...

+1

>
>> On 5 Feb 2019, at 08:33, Esteban Lorenzano <[hidden email]> wrote:
>>
>> Hi,
>>
>> Yes, Pharo regex implementation is very naive.
>> We will be moving to a PCRE binding to match outside world standards but we have not had the time to work on it :(
>>
>> Esteban
>>
>>> On 5 Feb 2019, at 00:27, Manuel Leuenberger <[hidden email]> wrote:
>>>
>>> Hi,
>>>
>>> I just noticed that the Pharo regexes do not understand non-greedy matches. A regex engine to be PCRE is kind of essential, not having '.*?' to be a parseable and working regex is a bummer. Are there any more powerful regex engines around for Pharo? I could not find any.
>>>
>>> Cheers,
>>> Manuel
>>>
>>>
>>
>>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Non-greedy RegEx?

Richard O'Keefe
In reply to this post by Pierce Ng-3
PCRE has exponential worst-case time.  See for example
but searching for PCRE exponential time or worst case
will find more.  That's not the problem.  The problem
is that it isn't *obvious* which regexps are safe and
that people are taught that regular expressions can be
matched in linear time, which is sort of the point of
them.  But PCRE patterns *aren't* regular expressions.


On Wed, 6 Feb 2019 at 04:54, Pierce Ng <[hidden email]> wrote:
On Wed, Feb 06, 2019 at 12:26:00AM +1300, Richard O'Keefe wrote:
> Please DON'T move to PCRE.
> "Outside world standards"?  There are so many.
> There are two important things to know about
> PCRE: (1) it is a popular open source regexp
> library for Perl-style regexps, (2) because of
> that, it is prone to truly horrendous performance
> problems.  There are alternatives, such as re2,
> https://github.com/google/re2 ,
> which are not subject to PCRE's intrinsic
> performance pathologies.  As it happens, re2
> supports *? +? and ??.

Can you share some examples of PCRE's bad performance?

Pierce

Reply | Threaded
Open this post in threaded view
|

Re: Non-greedy RegEx?

Pierce Ng-3
In reply to this post by NorbertHartl
On Tue, Feb 05, 2019 at 07:25:07PM +0100, Norbert Hartl wrote:
> > Am 05.02.2019 um 16:16 schrieb Sven Van Caekenberghe <[hidden email]>:
> > Still, there are advantages to an in-image solution, can't says this
> > enough, these external lib dependencies pose their own problems ...
> +1

Libraries like libgit2, libssh2 and quite a few more are already a core
part of Pharo, so I'd say philosophically might as well go all in to
make FFI to external libraries an intrinsic part of computing with
Pharo, not just for developing Pharo itself. The more people use UFFI,
the better it will become.



Reply | Threaded
Open this post in threaded view
|

Re: Non-greedy RegEx?

EstebanLM
If cre2 meets the standard and community agree, I do not care to be honest, I just mentioned PCRE because it is the “de facto standard” and we are tired of not being compatible :)

About an in-image solutions in general: Yes, it is better than rely in a solution like that.
Also about in-image solutions in general: It is harder to maintain them. We are a small community and we cannot allow us to have everything we need implemented in image. Using an FFI solution is a perfect valid way and IMO is preferred many times because of this maintainability issue.
Of course, it does not applies to all cases with same intensity :)
I don’t know the cost of maintain a regex lib in-image. I know at the moment is infinite because there is no-one there doing it (and I know I do now have the time). So we stay for years with a suboptimal solution :(

Anyway I want to point to some things I (In my “architect” role, or whatever is what I do here) I always think:

1- This solution is state-of-the-art?
2-  is it maintained or the cost to maintain it can be absorbed ?
        2.1- Is maintained by whom?
        2.2- Will this kick us back if maintainers leave?
3- is it well tested?
4- is it well documented?

People often forgets that making libraries is a commitment with your user community, is a lot more than “I do this and then I forget”. Of course you can proceed like that (and not few of my own projects are of use-and-throw), but we as “pharo makers” need to take this into account with priority to other variables.

Notice that “performance” is not in this list. I care a lot about performance, but I care a lot more about the other points.

Now, that does not means we do not make decisions we later regret (we are humans, after all… and this is a learning process).  

Cheers,
Esteban

> On 6 Feb 2019, at 02:38, Pierce Ng <[hidden email]> wrote:
>
> On Tue, Feb 05, 2019 at 07:25:07PM +0100, Norbert Hartl wrote:
>>> Am 05.02.2019 um 16:16 schrieb Sven Van Caekenberghe <[hidden email]>:
>>> Still, there are advantages to an in-image solution, can't says this
>>> enough, these external lib dependencies pose their own problems ...
>> +1
>
> Libraries like libgit2, libssh2 and quite a few more are already a core
> part of Pharo, so I'd say philosophically might as well go all in to
> make FFI to external libraries an intrinsic part of computing with
> Pharo, not just for developing Pharo itself. The more people use UFFI,
> the better it will become.
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Non-greedy RegEx?

Manuel Leuenberger
In reply to this post by Pierce Ng-3
An in-image regex engine would always be preferable by me, but creating a fully-flegged engine with all the fancy lookarounds, named/non-capturing groups, non-greedy matches, Unicode support, etc. sounds like a six-month-length full-time project. Any volunteers? ;)

I am all for a pragmatic approach: If there is a solution that allows to reuse Smalltalk streams and strings without much memory copying combined with a library dependency that is battle-proven and maintained by capable people - why not?

Funny enough, PCRE already seems to be used in the RePlugin (https://github.com/OpenSmalltalk/opensmalltalk-vm/blob/Cog/src/plugins/RePlugin/RePlugin.c#L356). Why is this not used in the image, am I missing something?

Regarding third-party dependencies, my only beef is that I do not particularly like that they are distributed and linked together with a copy by the VM. This makes it very portable, but I do not need another copy of a library that I have already installed on my system (git/SSH/SSL/png/FreeType/Cairo). On Linux some libs (Cairo/FreeType/PNG) are already linked to system libs, I would like to have Pharo as an installable package in apt and MacPorts/brew with declared dependencies, which would make it even more lightweight. But that is only marginally related for this thread.

On 6 Feb 2019, at 02:38, Pierce Ng <[hidden email]> wrote:

On Tue, Feb 05, 2019 at 07:25:07PM +0100, Norbert Hartl wrote:
Am 05.02.2019 um 16:16 schrieb Sven Van Caekenberghe <[hidden email]>:
Still, there are advantages to an in-image solution, can't says this
enough, these external lib dependencies pose their own problems ...
+1

Libraries like libgit2, libssh2 and quite a few more are already a core
part of Pharo, so I'd say philosophically might as well go all in to
make FFI to external libraries an intrinsic part of computing with
Pharo, not just for developing Pharo itself. The more people use UFFI,
the better it will become.




Reply | Threaded
Open this post in threaded view
|

Re: Non-greedy RegEx?

Manuel Leuenberger
As an in-image solution, refactoring the SmaCC scanner to a standalone regex engine might be pretty efficient to gain more RE features, but I cannot really judge how big this effort would be.

On 6 Feb 2019, at 11:42, Manuel Leuenberger <[hidden email]> wrote:

An in-image regex engine would always be preferable by me, but creating a fully-flegged engine with all the fancy lookarounds, named/non-capturing groups, non-greedy matches, Unicode support, etc. sounds like a six-month-length full-time project. Any volunteers? ;)

I am all for a pragmatic approach: If there is a solution that allows to reuse Smalltalk streams and strings without much memory copying combined with a library dependency that is battle-proven and maintained by capable people - why not?

Funny enough, PCRE already seems to be used in the RePlugin (https://github.com/OpenSmalltalk/opensmalltalk-vm/blob/Cog/src/plugins/RePlugin/RePlugin.c#L356). Why is this not used in the image, am I missing something?

Regarding third-party dependencies, my only beef is that I do not particularly like that they are distributed and linked together with a copy by the VM. This makes it very portable, but I do not need another copy of a library that I have already installed on my system (git/SSH/SSL/png/FreeType/Cairo). On Linux some libs (Cairo/FreeType/PNG) are already linked to system libs, I would like to have Pharo as an installable package in apt and MacPorts/brew with declared dependencies, which would make it even more lightweight. But that is only marginally related for this thread.

On 6 Feb 2019, at 02:38, Pierce Ng <[hidden email]> wrote:

On Tue, Feb 05, 2019 at 07:25:07PM +0100, Norbert Hartl wrote:
Am 05.02.2019 um 16:16 schrieb Sven Van Caekenberghe <[hidden email]>:
Still, there are advantages to an in-image solution, can't says this
enough, these external lib dependencies pose their own problems ...
+1

Libraries like libgit2, libssh2 and quite a few more are already a core
part of Pharo, so I'd say philosophically might as well go all in to
make FFI to external libraries an intrinsic part of computing with
Pharo, not just for developing Pharo itself. The more people use UFFI,
the better it will become.





Reply | Threaded
Open this post in threaded view
|

Re: Non-greedy RegEx?

Sean P. DeNigris
Administrator
In reply to this post by Manuel Leuenberger
Manuel Leuenberger wrote
> An in-image regex engine would always be preferable by me, but... I am all
> for a pragmatic approach

I feel similarly: Ideally I would love everything in image, but given
limited manpower it seems wise to leverage outside libs for standard things
so that we can devote our focus to blue plane invention. Some fantasy future
moment when we have already conquered the world and have nothing to do would
be a perfect time to circle back and reimplement the delegated tasks in
image.



-----
Cheers,
Sean
--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Cheers,
Sean