Regular expression to match URIs

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Regular expression to match URIs

CdAB63
If someone is in need to parse URIs, then the following regex handle it:

    | regStr regex |

    regStr :=     

        '((([a-z]\w+\:)',                     "Match URL protocol and colon"
        '(/|//|///|[A-Za-z0-9%]))',          "Match 1-3 slashes or single letter or digit or %"
        '|',                                 "or"
        '(((www\d)|(www\d\d)|(www\d\d\d))[.])',    "match www ou www[1-999] followed by ."
        '|',                                   "or"
        '([A-Za-z0-9._\-]+[.]([a-z][a-z]|[a-z][a-z][a-z]|[a-z][a-z][a-z][a-z])/))', "domain name"   
        '(([^\s()<>]+',                      "run of non-space, non ()<>"
        '|',                                 "or"
        '\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+)', "balanced parens up to two levels"
        '(\(([^\s()<>]+|(\([^\s()<>]+\)))*\)',   "end with balanced aprens up to 2 levels"
        '|',                                 "or"
        '[^\s`!()[]{};:''".,<>?«»“”‘’])'.    "not a space or one of these punct chars"
                   
    regex := RxMatcher forString: regStr.

--
The information contained in this message is confidential and intended to the recipients specified in the headers. If you received this message by error, notify the sender immediately. The unauthorized use, disclosure, copy or alteration of this message are strictly forbidden and subjected to civil and criminal sanctions.

==

This email may be signed using PGP key ID: 0x4134A417

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Regular expression to match URIs

stepharo

thanks

We should add that as an example in the package :)

Stef


Le 29/7/16 à 16:13, Casimiro - GMAIL a écrit :
If someone is in need to parse URIs, then the following regex handle it:

    | regStr regex |

    regStr :=     

        '((([a-z]\w+\:)',                     "Match URL protocol and colon"
        '(/|//|///|[A-Za-z0-9%]))',          "Match 1-3 slashes or single letter or digit or %"
        '|',                                 "or"
        '(((www\d)|(www\d\d)|(www\d\d\d))[.])',    "match www ou www[1-999] followed by ."
        '|',                                   "or"
        '([A-Za-z0-9._\-]+[.]([a-z][a-z]|[a-z][a-z][a-z]|[a-z][a-z][a-z][a-z])/))', "domain name"   
        '(([^\s()<>]+',                      "run of non-space, non ()<>"
        '|',                                 "or"
        '\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+)', "balanced parens up to two levels"
        '(\(([^\s()<>]+|(\([^\s()<>]+\)))*\)',   "end with balanced aprens up to 2 levels"
        '|',                                 "or"
        '[^\s`!()[]{};:''".,<>?«»“”‘’])'.    "not a space or one of these punct chars"
                   
    regex := RxMatcher forString: regStr.

--
The information contained in this message is confidential and intended to the recipients specified in the headers. If you received this message by error, notify the sender immediately. The unauthorized use, disclosure, copy or alteration of this message are strictly forbidden and subjected to civil and criminal sanctions.

==

This email may be signed using PGP key ID: 0x4134A417