Smalltalk › Pharo › Pharo Smalltalk Developers

Out of ideas...

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

4 messages Options

CdAB63

Out of ideas...

If someone can help me... I'm dealing with the following situation:

I may have a string in which matches of the following regex:
'[\s.;\:!?]*#\w+' may happen (multiple times). I want to replace the
#\w+ part of it by nothing but keep the [\s.;\:!?]* but it seems to be
no easy way using copyWithRegex: matchesTranslatedWith: or
copyWithRegex: matchesReplacedWith:

Someone knows an easy (meaning, no several operations, etc) to do this????

Thanks in advance,

Casimiro Barreto

--
The information contained in this message is confidential and intended
to the recipients specified in the headers. If you received this message
by error, notify the sender immediately. The unauthorized use,
disclosure, copy or alteration of this message are strictly forbidden
and subjected to civil and criminal sanctions.

==

---
Este email foi escaneado pelo Avast antivírus.
https://www.avast.com/antivirus

vonbecmann

Re: Out of ideas...

Hi,

have you try with (World>>Help>>Help Browser>>Regular Expressions Framework>>Usage)

SUBEXPRESSION MATCHES

After a successful match attempt, you can query the specifics of which

part of the original string has matched which part of the whole

expression.

A subexpression is a parenthesized part of a regular expression, or

the whole expression. When a regular expression is compiled, its

subexpressions are assigned indices starting from 1, depth-first,

left-to-right. For example, `((ab)+(c|d))?ef' includes the following

subexpressions with these indices:

1: ((ab)+(c|d))?ef

2: (ab)+(c|d)

3: ab

4: c|d

After a successful match, the matcher can report what part of the

original string matched what subexpression.

And theres an example

This facility provides a convenient way of extracting parts of input

strings of complex format. For example, the following piece of code

uses the 'MMM DD, YYYY' date format recognizer example from the

`Syntax' section to convert a date to a three-element array with year,

month, and day strings (you can select and evaluate it right here):

| matcher |

matcher := RxMatcher forString: '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[ ]+(:isDigit::isDigit:?)[ ]*,[ ]*(19|20)(:isDigit::isDigit:)'.

(matcher matches: 'Aug 6, 1996')

ifTrue:

[Array

with: (matcher subexpression: 5)

with: (matcher subexpression: 2)

with: (matcher subexpression: 3)]

ifFalse: ['no match']

(should answer ` #('96' 'Aug' '6')').

you could make two subexpressions

'([\s.;\:!?]*)(#\w+)'

first subexpression is ([\s.;\:!?]*)

and second subexpression is (#\w+)

and then use

It understandards these

messages:

subexpressionCount

Answers the total number of subexpressions: the highest value that

can be used as a subexpression index with this matcher. This value

is available immediately after initialization and never changes.

subexpression: anIndex

An index must be a valid subexpression index, and this message

must be sent only after a successful match attempt. The method

answers a substring of the original string the corresponding

subexpression has matched to.

On Mon, Aug 8, 2016 at 7:10 PM, Casimiro - GMAIL <[hidden email]> wrote:

If someone can help me... I'm dealing with the following situation:

I may have a string in which matches of the following regex:
'[\s.;\:!?]*#\w+' may happen (multiple times). I want to replace the
#\w+ part of it by nothing but keep the [\s.;\:!?]* but it seems to be
no easy way using copyWithRegex: matchesTranslatedWith: or
copyWithRegex: matchesReplacedWith:

Someone knows an easy (meaning, no several operations, etc) to do this????

Thanks in advance,

Casimiro Barreto

--
The information contained in this message is confidential and intended
to the recipients specified in the headers. If you received this message
by error, notify the sender immediately. The unauthorized use,
disclosure, copy or alteration of this message are strictly forbidden
and subjected to civil and criminal sanctions.

==

---
Este email foi escaneado pelo Avast antivírus.
https://www.avast.com/antivirus

Bernardo E.C.

Sent from a cheap desktop computer in South America.

CdAB63

Re: Out of ideas...

Em 08-08-2016 19:25, Bernardo Ezequiel Contreras escreveu:

Hi,

have you try with (World>>Help>>Help Browser>>Regular Expressions Framework>>Usage)

SUBEXPRESSION MATCHES

After a successful match attempt, you can query the specifics of which

part of the original string has matched which part of the whole

expression.

(...)

Thanks, but thing is: my need is little more complex than finding sequences. I'm looking for expressions in natural language text. The expressions must be extracted without ambiguities so I have cases for occurrences in the beginning of line (aka '^(#\w+)([\s.,;\:!?]*)') in the middle of the line (aka '([\s.,;\:!?]+)(#\w+)([\s.,;\:!?]+)') or at the end (which may be simplified to the second case...). So, if I find several hashtags in a text like:

'A política no Brasil está complicada #FAIL porque a corrupção impera #CRIME. De qualquer forma os #PETRALHAS, que tudo justificam, levam o país ao #CAOS'

I want two things:

1st and obvious: #( '#FAIL' '#CRIME' '#PETRALHAS' '#CAOS')
2nd: the line minus hashtags: 'A política no Brasil está complicada porque a corrupção impera. De qualquer forma os, que tudo justificam, levam o país ao'

When I use regexps to process the line, for instance:

bfr := line copyWithRegex: '#\w+' matchesReplacedUsing [ :e | '' ].

I can have trouble because it will extract things like #ANOTAÇÃO# which is not a hashtag but will match.

And I'm trying to avoid doing the Lex/Yacc thing here :D

Best regards,

CdAB

--
The information contained in this message is confidential and intended to the recipients specified in the headers. If you received this message by error, notify the sender immediately. The unauthorized use, disclosure, copy or alteration of this message are strictly forbidden and subjected to civil and criminal sanctions.

==

This email may be signed using PGP key ID: 0x4134A417

signature.asc (836 bytes) Download Attachment

vonbecmann

Re: Out of ideas...

What about this

| line matcher exps ranges negateRanges result |

line :=

'A política no Brasil está complicada #FAIL porque a corrupção impera #CRIME. De qualquer forma os #PETRALHAS, que tudo justificam, levam o país ao #CAOS'.

matcher := '(#\w+)' asRegex.

exps := matcher matchesIn: line.

"i believe you can extend the matcher to give the subexpressions

and ranges at the same time"

ranges := matcher matchingRangesIn: line.

negateRanges := OrderedCollection new.

ranges inject: 1 into: [ :start :interval |

negateRanges add: (Interval from: start to: (interval first - 1)).

interval last + 1 ].

result :=

negateRanges inject: String new into: [ :s :interval |

s, (line copyFrom: interval first to: interval last).

Array with: exps

with: ranges

with: negateRanges

with: result

On Tue, Aug 9, 2016 at 12:53 PM, Casimiro - GMAIL <[hidden email]> wrote:

Em 08-08-2016 19:25, Bernardo Ezequiel Contreras escreveu:

Hi,

have you try with (World>>Help>>Help Browser>>Regular Expressions Framework>>Usage)

SUBEXPRESSION MATCHES

After a successful match attempt, you can query the specifics of which

part of the original string has matched which part of the whole

expression.

(...)

Thanks, but thing is: my need is little more complex than finding sequences. I'm looking for expressions in natural language text. The expressions must be extracted without ambiguities so I have cases for occurrences in the beginning of line (aka '^(#\w+)([\s.,;\:!?]*)') in the middle of the line (aka '([\s.,;\:!?]+)(#\w+)([\s.,;\:!?]+)') or at the end (which may be simplified to the second case...). So, if I find several hashtags in a text like:

'A política no Brasil está complicada #FAIL porque a corrupção impera #CRIME. De qualquer forma os #PETRALHAS, que tudo justificam, levam o país ao #CAOS'

I want two things:

1st and obvious: #( '#FAIL' '#CRIME' '#PETRALHAS' '#CAOS')
2nd: the line minus hashtags: 'A política no Brasil está complicada porque a corrupção impera. De qualquer forma os, que tudo justificam, levam o país ao'

When I use regexps to process the line, for instance:

bfr := line copyWithRegex: '#\w+' matchesReplacedUsing [ :e | '' ].

I can have trouble because it will extract things like #ANOTAÇÃO# which is not a hashtag but will match.

And I'm trying to avoid doing the Lex/Yacc thing here :D

Best regards,

CdAB

--
The information contained in this message is confidential and intended to the recipients specified in the headers. If you received this message by error, notify the sender immediately. The unauthorized use, disclosure, copy or alteration of this message are strictly forbidden and subjected to civil and criminal sanctions.

==

This email may be signed using PGP key ID: 0x4134A417

Bernardo E.C.

Sent from a cheap desktop computer in South America.