Regular Expressions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Regular Expressions

Edgar De Cleene
Folks:
I wish remove tags from HTMlL
According to https://regex101.com/ and http://www.freeformatter.com/regex-tester.html and also of my old Nissus Pro.

<.+?>

Should be a valid expression.

But

 regex|
regex := RxMatcher forString: '<.+?>’.

Gives my an error.

Any help ?

Edgar
@morplenauta

Reply | Threaded
Open this post in threaded view
|

Re: Regular Expressions

Hans-Martin Mosner
Am 18.11.2016 um 14:39 schrieb Edgar De Cleene:
Folks:
I wish remove tags from HTMlL
According to https://regex101.com/ and http://www.freeformatter.com/regex-tester.html and also of my old Nissus Pro.

<.+?>

Should be a valid expression.

But 

 regex|
regex := RxMatcher forString: '<.+?>’.

Gives my an error.

Any help ?

Edgar
@morplenauta

I was going to write this:

The "+" already means "match one or more of the previous", where "previous" in this case is ".", which means "any character".

The "?" means "match zero or one of the previous", but it cannot be cmobined with "+".

But then I realized that "+?" is defined in regex syntax as "lazy" matching, i.e. it finds as few of the previous tokens as needed to to make the pattern match (in contrast, standard "+" matches greedily, so it consumes as much as possible while still matching the pattern).

However, the Rx framework in Squeak is quite old and does not have these extensions. A pattern that should work would be "<[^>]+>" which matches an opening angle bracket, any characters that are not closing angle brackets, and finally the closing bracket.

Be aware though that correctly stripping tags from HTML is not possible (or at least not trivial) with regex. For example, in your pattern, the "." would not match newlines, but tags can extend over multiple lines, so you would not be able to strip out a multiline tag. My pattern apparently works with newlines, too, but there are other cases that it does not handle (for example, see http://stackoverflow.com/questions/94528/is-u003e-greater-than-sign-allowed-inside-an-html-element-attribute-value).

So unless you know that your input is going to be fairly regular, don't rely on regex to strip tags. Use a proper HTML/SGML/XML parser, they are designed to do it right.

Cheers,

Hans-Martin



Reply | Threaded
Open this post in threaded view
|

Re: Regular Expressions

Edgar De Cleene
Very thanks.
I use VB-Regex-damienpollet.17.mcz
By the way, i use Nissus Writer Pro when need learn regex.
The ripping HTML tags expression is '(?:</?[0-9A-Za-z]+.*?>)’ which of course is not possible in Squeak.



On Nov 18, 2016, at 11:05, Hans-Martin Mosner <[hidden email]> wrote:

I was going to write this:

The "+" already means "match one or more of the previous", where "previous" in this case is ".", which means "any character".

The "?" means "match zero or one of the previous", but it cannot be cmobined with "+".

But then I realized that "+?" is defined in regex syntax as "lazy" matching, i.e. it finds as few of the previous tokens as needed to to make the pattern match (in contrast, standard "+" matches greedily, so it consumes as much as possible while still matching the pattern).

However, the Rx framework in Squeak is quite old and does not have these extensions. A pattern that should work would be "<[^>]+>" which matches an opening angle bracket, any characters that are not closing angle brackets, and finally the closing bracket.

Be aware though that correctly stripping tags from HTML is not possible (or at least not trivial) with regex. For example, in your pattern, the "." would not match newlines, but tags can extend over multiple lines, so you would not be able to strip out a multiline tag. My pattern apparently works with newlines, too, but there are other cases that it does not handle (for example, seehttp://stackoverflow.com/questions/94528/is-u003e-greater-than-sign-allowed-inside-an-html-element-attribute-value).

So unless you know that your input is going to be fairly regular, don't rely on regex to strip tags. Use a proper HTML/SGML/XML parser, they are designed to do it right.

Cheers,

Hans-Martin




Reply | Threaded
Open this post in threaded view
|

Re: Regular Expressions

Levente Uzonyi
On Fri, 18 Nov 2016, Edgar De Cleene wrote:

> Very thanks.I use VB-Regex-damienpollet.17.mcz
> By the way, i use Nissus Writer Pro when need learn regex.
> The ripping HTML tags expression is '(?:</?[0-9A-Za-z]+.*?>)’ which of course is not possible in Squeak.

That regular expression has an equivalent form in Squeak:

regex := '</?[0-9A-Za-z]+[^>]*>' asRegex.
html := '<a>foo<b>bar</b></a>'.
regex copy: html replacingMatchesWith: ''

But it is not possible to filter out tags and only tags from html with
regular expressions.

*?, +? and (?:) are not supported by VBRegex (nor the forked Regex
package), but I think it wouldn't be too hard to implement them.


Levente

>
>
>
>       On Nov 18, 2016, at 11:05, Hans-Martin Mosner <[hidden email]> wrote:
>
> I was going to write this:
>
>       The "+" already means "match one or more of the previous", where "previous" in this case is ".", which means "any character".
>
>       The "?" means "match zero or one of the previous", but it cannot be cmobined with "+".
>
> But then I realized that "+?" is defined in regex syntax as "lazy" matching, i.e. it finds as few of the previous tokens as needed to to make the pattern match (in contrast, standard "+" matches
> greedily, so it consumes as much as possible while still matching the pattern).
>
> However, the Rx framework in Squeak is quite old and does not have these extensions. A pattern that should work would be "<[^>]+>" which matches an opening angle bracket, any characters that are not
> closing angle brackets, and finally the closing bracket.
>
> Be aware though that correctly stripping tags from HTML is not possible (or at least not trivial) with regex. For example, in your pattern, the "." would not match newlines, but tags can extend over
> multiple lines, so you would not be able to strip out a multiline tag. My pattern apparently works with newlines, too, but there are other cases that it does not handle (for example,
> seehttp://stackoverflow.com/questions/94528/is-u003e-greater-than-sign-allowed-inside-an-html-element-attribute-value).
>
> So unless you know that your input is going to be fairly regular, don't rely on regex to strip tags. Use a proper HTML/SGML/XML parser, they are designed to do it right.
>
> Cheers,
>
> Hans-Martin
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Regular Expressions

Edgar De Cleene
Very thanks for the tip
Could be legal copy the look and feel of Nissus Writer Pro Find and replace window/menús ?
A enhanced and useful regex package ?

Enviado desde mi iPad

> El 18 nov 2016, a las 16:14, Levente Uzonyi <[hidden email]> escribió:
>
>> On Fri, 18 Nov 2016, Edgar De Cleene wrote:
>>
>> Very thanks.I use VB-Regex-damienpollet.17.mcz
>> By the way, i use Nissus Writer Pro when need learn regex.
>> The ripping HTML tags expression is '(?:</?[0-9A-Za-z]+.*?>)’ which of course is not possible in Squeak.
>
> That regular expression has an equivalent form in Squeak:
>
> regex := '</?[0-9A-Za-z]+[^>]*>' asRegex.
> html := '<a>foo<b>bar</b></a>'.
> regex copy: html replacingMatchesWith: ''
>
> But it is not possible to filter out tags and only tags from html with regular expressions.
>
> *?, +? and (?:) are not supported by VBRegex (nor the forked Regex package), but I think it wouldn't be too hard to implement them.
>
>
> Levente
>
>>
>>      On Nov 18, 2016, at 11:05, Hans-Martin Mosner <[hidden email]> wrote:
>> I was going to write this:
>>
>>      The "+" already means "match one or more of the previous", where "previous" in this case is ".", which means "any character".
>>
>>      The "?" means "match zero or one of the previous", but it cannot be cmobined with "+".
>> But then I realized that "+?" is defined in regex syntax as "lazy" matching, i.e. it finds as few of the previous tokens as needed to to make the pattern match (in contrast, standard "+" matches
>> greedily, so it consumes as much as possible while still matching the pattern).
>> However, the Rx framework in Squeak is quite old and does not have these extensions. A pattern that should work would be "<[^>]+>" which matches an opening angle bracket, any characters that are not
>> closing angle brackets, and finally the closing bracket.
>> Be aware though that correctly stripping tags from HTML is not possible (or at least not trivial) with regex. For example, in your pattern, the "." would not match newlines, but tags can extend over
>> multiple lines, so you would not be able to strip out a multiline tag. My pattern apparently works with newlines, too, but there are other cases that it does not handle (for example,
>> seehttp://stackoverflow.com/questions/94528/is-u003e-greater-than-sign-allowed-inside-an-html-element-attribute-value).
>> So unless you know that your input is going to be fairly regular, don't rely on regex to strip tags. Use a proper HTML/SGML/XML parser, they are designed to do it right.
>> Cheers,
>> Hans-Martin
>

Reply | Threaded
Open this post in threaded view
|

Re: [Cuis] Regular Expressions

Phil B
In reply to this post by Edgar De Cleene

I've got a port of Vasilli's version in github.com/Cuis-Ports I've been using that pretty extensively for years without issues


On Nov 22, 2016 11:31 AM, "KenD" <[hidden email]> wrote:
On Fri, 18 Nov 2016 10:39:52 -0300
Edgar De Cleene <[hidden email]> wrote:

> Folks:
> I wish remove tags from HTMlL
> According to https://regex101.com/ and http://www.freeformatter.com/regex-tester.html and also of my old Nissus Pro.
>
> <.+?>
>
> Should be a valid expression.
>
> But
>
>  regex|
> regex := RxMatcher forString: '<.+?>’.
>
> Gives my an error.
>
> Any help ?

Edgar,

Sorry for the delay.  Busy lives..


I found https://github.com/garduino/Cuis-Smalltalk-RegEx but this code is very old and simply loaded fails most test cases (see notes below).

The Cuis-Smalltalk-RegEx code definitely signals an error for the example you site, even though
  RegEx-Core RxParser DOCUMENTATION c:syntax:
description would appear to support this use case.


What source of RegEx are you using?  Is there a Cuis Package available?


Note that I have zero experience with this package but am happy to help out as I get time.

Best wishes,
-KenD

=========================================================
https://github.com/garduino/Cuis-Smalltalk-RegEx notes:

1) Needs requirements (tests should require core should require Squeak compatibility).

2) Line endings should be newLines.

3) Needs added compatibility methods, e.g.:

'
Array>>contains: other

        ^ self includes: other
'

'
Character>>sameAs: otherChar
        "Case independent compare"

        (self class) = (otherChar class) ifFalse: [ ^ false ].

        ^(self asLowercase) = (otherChar asLowercase)
'
=========================================================


_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org


Reply | Threaded
Open this post in threaded view
|

Re: [Cuis] Regular Expressions

Levente Uzonyi
On Fri, 25 Nov 2016, Phil B wrote:

>
> I've got a port of Vasilli's version in github.com/Cuis-Ports I've been using that pretty extensively for years without issues

I hate to advertise stuff, but IMHO the version in Squeak is far superior
to the version you ported.

Levente

>
>
> On Nov 22, 2016 11:31 AM, "KenD" <[hidden email]> wrote:
>       On Fri, 18 Nov 2016 10:39:52 -0300
>       Edgar De Cleene <[hidden email]> wrote:
>
>       > Folks:
>       > I wish remove tags from HTMlL
>       > According to https://regex101.com/ and http://www.freeformatter.com/regex-tester.html and also of my old Nissus Pro.
>       >
>       > <.+?>
>       >
>       > Should be a valid expression.
>       >
>       > But
>       >
>       >  regex|
>       > regex := RxMatcher forString: '<.+?>’.
>       >
>       > Gives my an error.
>       >
>       > Any help ?
>
>       Edgar,
>
>       Sorry for the delay.  Busy lives..
>
>
>       I found https://github.com/garduino/Cuis-Smalltalk-RegEx but this code is very old and simply loaded fails most test cases (see notes
>       below).
>
>       The Cuis-Smalltalk-RegEx code definitely signals an error for the example you site, even though
>         RegEx-Core RxParser DOCUMENTATION c:syntax:
>       description would appear to support this use case.
>
>
>       What source of RegEx are you using?  Is there a Cuis Package available?
>
>
>       Note that I have zero experience with this package but am happy to help out as I get time.
>
>       Best wishes,
>       -KenD
>
>       =========================================================
>       https://github.com/garduino/Cuis-Smalltalk-RegEx notes:
>
>       1) Needs requirements (tests should require core should require Squeak compatibility).
>
>       2) Line endings should be newLines.
>
>       3) Needs added compatibility methods, e.g.:
>
>       '
>       Array>>contains: other
>
>               ^ self includes: other
>       '
>
>       '
>       Character>>sameAs: otherChar
>               "Case independent compare"
>
>               (self class) = (otherChar class) ifFalse: [ ^ false ].
>
>               ^(self asLowercase) = (otherChar asLowercase)
>       '
>       =========================================================
>
>
>       _______________________________________________
>       Cuis mailing list
>       [hidden email]
>       http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>
>
>