If someone can help me... I'm dealing with the following situation:
I may have a string in which matches of the following regex: '[\s.;\:!?]*#\w+' may happen (multiple times). I want to replace the #\w+ part of it by nothing but keep the [\s.;\:!?]* but it seems to be no easy way using copyWithRegex: matchesTranslatedWith: or copyWithRegex: matchesReplacedWith: Someone knows an easy (meaning, no several operations, etc) to do this???? Thanks in advance, Casimiro Barreto -- The information contained in this message is confidential and intended to the recipients specified in the headers. If you received this message by error, notify the sender immediately. The unauthorized use, disclosure, copy or alteration of this message are strictly forbidden and subjected to civil and criminal sanctions. == --- Este email foi escaneado pelo Avast antivírus. https://www.avast.com/antivirus |
Hi,
have you try with (World>>Help>>Help Browser>>Regular Expressions Framework>>Usage) SUBEXPRESSION MATCHES After a successful match attempt, you can query the specifics of which part of the original string has matched which part of the whole expression. A subexpression is a parenthesized part of a regular expression, or the whole expression. When a regular expression is compiled, its subexpressions are assigned indices starting from 1, depth-first, left-to-right. For example, `((ab)+(c|d))?ef' includes the following subexpressions with these indices: 1: ((ab)+(c|d))?ef 2: (ab)+(c|d) 3: ab 4: c|d After a successful match, the matcher can report what part of the original string matched what subexpression. And theres an example This facility provides a convenient way of extracting parts of input strings of complex format. For example, the following piece of code uses the 'MMM DD, YYYY' date format recognizer example from the `Syntax' section to convert a date to a three-element array with year, month, and day strings (you can select and evaluate it right here): | matcher | matcher := RxMatcher forString: '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[ ]+(:isDigit::isDigit:?)[ ]*,[ ]*(19|20)(:isDigit::isDigit:)'. (matcher matches: 'Aug 6, 1996') ifTrue: [Array with: (matcher subexpression: 5) with: (matcher subexpression: 2) with: (matcher subexpression: 3)] ifFalse: ['no match'] (should answer ` #('96' 'Aug' '6')'). you could make two subexpressions '([\s.;\:!?]*)(#\w+)' first subexpression is ([\s.;\:!?]*) and second subexpression is (#\w+) and then use It understandards these messages: subexpressionCount Answers the total number of subexpressions: the highest value that can be used as a subexpression index with this matcher. This value is available immediately after initialization and never changes. subexpression: anIndex An index must be a valid subexpression index, and this message must be sent only after a successful match attempt. The method answers a substring of the original string the corresponding subexpression has matched to. On Mon, Aug 8, 2016 at 7:10 PM, Casimiro - GMAIL <[hidden email]> wrote: If someone can help me... I'm dealing with the following situation: Bernardo E.C. Sent from a cheap desktop computer in South America. |
Em 08-08-2016 19:25, Bernardo Ezequiel
Contreras escreveu:
Thanks, but thing is: my need is little more complex than finding sequences. I'm looking for expressions in natural language text. The expressions must be extracted without ambiguities so I have cases for occurrences in the beginning of line (aka '^(#\w+)([\s.,;\:!?]*)') in the middle of the line (aka '([\s.,;\:!?]+)(#\w+)([\s.,;\:!?]+)') or at the end (which may be simplified to the second case...). So, if I find several hashtags in a text like: 'A política no Brasil está complicada #FAIL porque a corrupção impera #CRIME. De qualquer forma os #PETRALHAS, que tudo justificam, levam o país ao #CAOS' I want two things: 1st and obvious: #( '#FAIL' '#CRIME' '#PETRALHAS' '#CAOS') 2nd: the line minus hashtags: 'A política no Brasil está complicada porque a corrupção impera. De qualquer forma os, que tudo justificam, levam o país ao' When I use regexps to process the line, for instance: bfr := line copyWithRegex: '#\w+' matchesReplacedUsing [ :e | '' ]. I can have trouble because it will extract things like #ANOTAÇÃO# which is not a hashtag but will match. And I'm trying to avoid doing the Lex/Yacc thing here :D Best regards, CdAB --
The information contained in this message is confidential and intended to the recipients specified in the headers. If you received this message by error, notify the sender immediately. The unauthorized use, disclosure, copy or alteration of this message are strictly forbidden and subjected to civil and criminal sanctions. == This email may be signed using PGP key ID: 0x4134A417 signature.asc (836 bytes) Download Attachment |
What about this | line matcher exps ranges negateRanges result | line := 'A política no Brasil está complicada #FAIL porque a corrupção impera #CRIME. De qualquer forma os #PETRALHAS, que tudo justificam, levam o país ao #CAOS'. matcher := '(#\w+)' asRegex. exps := matcher matchesIn: line. "i believe you can extend the matcher to give the subexpressions and ranges at the same time" ranges := matcher matchingRangesIn: line. negateRanges := OrderedCollection new. ranges inject: 1 into: [ :start :interval | negateRanges add: (Interval from: start to: (interval first - 1)). interval last + 1 ]. result := negateRanges inject: String new into: [ :s :interval | s, (line copyFrom: interval first to: interval last). ]. Array with: exps with: ranges with: negateRanges with: result On Tue, Aug 9, 2016 at 12:53 PM, Casimiro - GMAIL <[hidden email]> wrote:
Bernardo E.C. Sent from a cheap desktop computer in South America. |
Free forum by Nabble | Edit this page |