Out of ideas...

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Out of ideas...

CdAB63
If someone  can help me... I'm dealing with the following situation:


    I may have a string in which matches of the following regex:
'[\s.;\:!?]*#\w+' may happen (multiple times). I want to replace the
#\w+ part of it by nothing but keep the [\s.;\:!?]* but it seems to be
no easy way using copyWithRegex: matchesTranslatedWith: or
copyWithRegex: matchesReplacedWith:

Someone knows an easy (meaning, no several operations, etc) to do this????


Thanks in advance,


Casimiro Barreto


--
The information contained in this message is confidential and intended
to the recipients specified in the headers. If you received this message
by error, notify the sender immediately. The unauthorized use,
disclosure, copy or alteration of this message are strictly forbidden
and subjected to civil and criminal sanctions.

==


---
Este email foi escaneado pelo Avast antivírus.
https://www.avast.com/antivirus


Reply | Threaded
Open this post in threaded view
|

Re: Out of ideas...

vonbecmann
Hi,

  have you try with  (World>>Help>>Help Browser>>Regular Expressions Framework>>Usage)

SUBEXPRESSION MATCHES

After a successful match attempt, you can query the specifics of which
part of the original string has matched which part of the whole
expression.

A subexpression is a parenthesized part of a regular expression, or
the whole expression. When a regular expression is compiled, its
subexpressions are assigned indices starting from 1, depth-first,
left-to-right. For example, `((ab)+(c|d))?ef' includes the following
subexpressions with these indices:

1: ((ab)+(c|d))?ef
2: (ab)+(c|d)
3: ab
4: c|d

After a successful match, the matcher can report what part of the
original string matched what subexpression.


And theres an example 

This facility provides a convenient way of extracting parts of input
strings of complex format. For example, the following piece of code
uses the 'MMM DD, YYYY' date format recognizer example from the
`Syntax' section to convert a date to a three-element array with year,
month, and day strings (you can select and evaluate it right here):

| matcher |
matcher := RxMatcher forString: '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[ ]+(:isDigit::isDigit:?)[ ]*,[ ]*(19|20)(:isDigit::isDigit:)'.
(matcher matches: 'Aug 6, 1996')
ifTrue: 
[Array 
with: (matcher subexpression: 5)
with: (matcher subexpression: 2)
with: (matcher subexpression: 3)]
ifFalse: ['no match']

(should answer ` #('96' 'Aug' '6')').


you could make two subexpressions 

 '([\s.;\:!?]*)(#\w+)'

first subexpression is ([\s.;\:!?]*) 
and second subexpression is (#\w+) 

and then use 

It understandards these
messages:

subexpressionCount

Answers the total number of subexpressions: the highest value that
can be used as a subexpression index with this matcher. This value
is available immediately after initialization and never changes.

subexpression: anIndex

An index must be a valid subexpression index, and this message
must be sent only after a successful match attempt. The method
answers a substring of the original string the corresponding
subexpression has matched to.






On Mon, Aug 8, 2016 at 7:10 PM, Casimiro - GMAIL <[hidden email]> wrote:
If someone  can help me... I'm dealing with the following situation:


    I may have a string in which matches of the following regex:
'[\s.;\:!?]*#\w+' may happen (multiple times). I want to replace the
#\w+ part of it by nothing but keep the [\s.;\:!?]* but it seems to be
no easy way using copyWithRegex: matchesTranslatedWith: or
copyWithRegex: matchesReplacedWith:

Someone knows an easy (meaning, no several operations, etc) to do this????


Thanks in advance,


Casimiro Barreto


--
The information contained in this message is confidential and intended
to the recipients specified in the headers. If you received this message
by error, notify the sender immediately. The unauthorized use,
disclosure, copy or alteration of this message are strictly forbidden
and subjected to civil and criminal sanctions.

==


---
Este email foi escaneado pelo Avast antivírus.
https://www.avast.com/antivirus





--
Bernardo E.C.

Sent from a cheap desktop computer in South America.
Reply | Threaded
Open this post in threaded view
|

Re: Out of ideas...

CdAB63
Em 08-08-2016 19:25, Bernardo Ezequiel Contreras escreveu:
Hi,

  have you try with  (World>>Help>>Help Browser>>Regular Expressions Framework>>Usage)

SUBEXPRESSION MATCHES

After a successful match attempt, you can query the specifics of which
part of the original string has matched which part of the whole
expression.

(...)
Thanks, but thing is: my need is little more complex than finding sequences. I'm looking for expressions in natural language text. The expressions must be extracted without ambiguities so I have cases for occurrences in the beginning of line (aka '^(#\w+)([\s.,;\:!?]*)') in the middle of the line (aka '([\s.,;\:!?]+)(#\w+)([\s.,;\:!?]+)') or at the end (which may be simplified to the second case...). So, if I find several hashtags in a text like:

'A política no Brasil está complicada #FAIL porque a corrupção impera #CRIME. De qualquer forma os #PETRALHAS, que tudo justificam, levam o país ao #CAOS'

I want two things:

1st and obvious: #( '#FAIL' '#CRIME' '#PETRALHAS' '#CAOS')
2nd: the line minus hashtags: 'A política no Brasil está complicada porque a corrupção impera. De qualquer forma os, que tudo justificam, levam o país ao'

When I use regexps to process the line, for instance:

bfr := line copyWithRegex: '#\w+' matchesReplacedUsing [ :e | '' ].

I can have trouble because it will extract things like #ANOTAÇÃO# which is not a hashtag but will match.

And I'm trying to avoid doing the Lex/Yacc thing here :D

Best regards,

CdAB

--
The information contained in this message is confidential and intended to the recipients specified in the headers. If you received this message by error, notify the sender immediately. The unauthorized use, disclosure, copy or alteration of this message are strictly forbidden and subjected to civil and criminal sanctions.

==

This email may be signed using PGP key ID: 0x4134A417

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Out of ideas...

vonbecmann
What about this

| line matcher exps ranges negateRanges result |
line :=
'A política no Brasil está complicada #FAIL porque a corrupção impera #CRIME. De qualquer forma os #PETRALHAS, que tudo justificam, levam o país ao #CAOS'.
matcher := '(#\w+)' asRegex.
exps := matcher matchesIn: line. 
"i believe you can extend the matcher to give the subexpressions
and ranges at the same time"
ranges := matcher matchingRangesIn: line.
negateRanges := OrderedCollection new.
ranges inject: 1 into: [ :start :interval |
negateRanges add: (Interval from: start to: (interval first - 1)).
interval last + 1  ].
result := 
negateRanges inject: String new into: [ :s :interval |
s, (line copyFrom: interval first to: interval last).
].

Array with: exps
with: ranges
with: negateRanges 
with: result


On Tue, Aug 9, 2016 at 12:53 PM, Casimiro - GMAIL <[hidden email]> wrote:
Em 08-08-2016 19:25, Bernardo Ezequiel Contreras escreveu:
Hi,

  have you try with  (World>>Help>>Help Browser>>Regular Expressions Framework>>Usage)

SUBEXPRESSION MATCHES

After a successful match attempt, you can query the specifics of which
part of the original string has matched which part of the whole
expression.

(...)
Thanks, but thing is: my need is little more complex than finding sequences. I'm looking for expressions in natural language text. The expressions must be extracted without ambiguities so I have cases for occurrences in the beginning of line (aka '^(#\w+)([\s.,;\:!?]*)') in the middle of the line (aka '([\s.,;\:!?]+)(#\w+)([\s.,;\:!?]+)') or at the end (which may be simplified to the second case...). So, if I find several hashtags in a text like:

'A política no Brasil está complicada #FAIL porque a corrupção impera #CRIME. De qualquer forma os #PETRALHAS, que tudo justificam, levam o país ao #CAOS'

I want two things:

1st and obvious: #( '#FAIL' '#CRIME' '#PETRALHAS' '#CAOS')
2nd: the line minus hashtags: 'A política no Brasil está complicada porque a corrupção impera. De qualquer forma os, que tudo justificam, levam o país ao'

When I use regexps to process the line, for instance:

bfr := line copyWithRegex: '#\w+' matchesReplacedUsing [ :e | '' ].

I can have trouble because it will extract things like #ANOTAÇÃO# which is not a hashtag but will match.

And I'm trying to avoid doing the Lex/Yacc thing here :D

Best regards,

CdAB

--
The information contained in this message is confidential and intended to the recipients specified in the headers. If you received this message by error, notify the sender immediately. The unauthorized use, disclosure, copy or alteration of this message are strictly forbidden and subjected to civil and criminal sanctions.

==

This email may be signed using PGP key ID: 0x4134A417



--
Bernardo E.C.

Sent from a cheap desktop computer in South America.