Parsing text to discover general data of interest (phone, email, address, ...)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Parsing text to discover general data of interest (phone, email, address, ...)

cedreek
Hi all,

I’ve often got the need to analyse some random unstructured text to discover (structured) information (in email for instance), to extract :
- emails
- telephone numbers
- addresses
- events
- person names (according to a list of known persons),
- etc…

Apple do it in email for instance (strangely, this is not generalized).


So my questions are :
- do we have something equivalent in Smalltalk/Pharo ? (I didn’t find)
- if not, what strategy would you use ?
=> I do really stupid text analysis (substrings, finding @, …, parsing according to the text structure when there is… kind of Soup parsing…)
=> I feel this is a job for PetitParser ? And would be a nice feet to the new GToolkit.

All ideas or suggestions are welcome ;-)


TIA,

Cédrick



Reply | Threaded
Open this post in threaded view
|

Re: Parsing text to discover general data of interest (phone, email, address, ...)

Pharo Smalltalk Users mailing list
Couldn't find anything in Smalltalk but that should you give ideas and
inspire you or get you started...

https://github.com/search?q=contact+scraping&type=Repositories

I guess we have all that's needed in Pharo : parsers (HTML, XML,
PetitParser), Soup & regex !

On 2019-03-07 04:52, Cédrick Béler wrote:

> Hi all,
>
> I’ve often got the need to analyse some random unstructured text to discover (structured) information (in email for instance), to extract :
> - emails
> - telephone numbers
> - addresses
> - events
> - person names (according to a list of known persons),
> - etc…
>
> Apple do it in email for instance (strangely, this is not generalized).
>
>
> So my questions are :
> - do we have something equivalent in Smalltalk/Pharo ? (I didn’t find)
> - if not, what strategy would you use ?
> => I do really stupid text analysis (substrings, finding @, …, parsing according to the text structure when there is… kind of Soup parsing…)
> => I feel this is a job for PetitParser ? And would be a nice feet to the new GToolkit.
>
> All ideas or suggestions are welcome ;-)
>
>
> TIA,
>
> Cédrick
>
>
>
--
-----------------
Benoît St-Jean
Yahoo! Messenger: bstjean
Twitter: @BenLeChialeux
Pinterest: benoitstjean
Instagram: Chef_Benito
IRC: lamneth
Blogue: endormitoire.wordpress.com
"A standpoint is an intellectual horizon of radius zero".  (A. Einstein)


Reply | Threaded
Open this post in threaded view
|

Re: Parsing text to discover general data of interest (phone, email, address, ...)

Peter Kenny
In reply to this post by cedreek
Cédrick

In principle, what you are asking for is to identify 'islands' of structured information in a 'sea' of otherwise unstructured material, which is now a standard pattern in PetitParser. You could imagine a parser spec of the form:

(sea optional, (email/phone/address/....), sea optional) plus

Where email etc are parsers for the individual structures. As a parser this would probably lead to lots of backtracking and be hideously inefficient, but for a short text like an e-mail it could be usable. This also assumes that the items of interest are really structured; there could be many ways of writing phone numbers, for instance.

HTH

Peter Kenny

-----Original Message-----
From: Pharo-users <[hidden email]> On Behalf Of Cédrick Béler
Sent: 07 March 2019 09:52
To: Any question about pharo is welcome <[hidden email]>
Cc: Tudor Girba <[hidden email]>
Subject: [Pharo-users] Parsing text to discover general data of interest (phone, email, address, ...)

Hi all,

I’ve often got the need to analyse some random unstructured text to discover (structured) information (in email for instance), to extract :
- emails
- telephone numbers
- addresses
- events
- person names (according to a list of known persons),
- etc…

Apple do it in email for instance (strangely, this is not generalized).


So my questions are :
- do we have something equivalent in Smalltalk/Pharo ? (I didn’t find)
- if not, what strategy would you use ?
=> I do really stupid text analysis (substrings, finding @, …, parsing according to the text structure when there is… kind of Soup parsing…) => I feel this is a job for PetitParser ? And would be a nice feet to the new GToolkit.

All ideas or suggestions are welcome ;-)


TIA,

Cédrick




Reply | Threaded
Open this post in threaded view
|

Re: Parsing text to discover general data of interest (phone, email, address, ...)

cedreek
>
>
> Cédrick
>
> In principle, what you are asking for is to identify 'islands' of structured information in a 'sea' of otherwise unstructured material, which is now a standard pattern in PetitParser.

Exactly :)


> You could imagine a parser spec of the form:
>
> (sea optional, (email/phone/address/....), sea optional) plus
>
> Where email etc are parsers for the individual structures. As a parser this would probably lead to lots of backtracking and be hideously inefficient, but for a short text like an e-mail it could be usable.

Yes this is only for shot text like email or say a text selection + shortcut.


> This also assumes that the items of interest are really structured; there could be many ways of writing phone numbers, for instance.

Phone numbers are actually not easy… I see them as a limited sequence of number (if not well structure) + eventually the +contrycode).
I’d like fuzzy structuration actually, but would be perfectly ok with an initial crisp one.

I find this is a nice pet project to dive into PetitParser. When you say  "unstructured material ... is now a standard pattern in PetitParser », how could I begin exploring that ? Any tutorials ?


Thanks Peter,

Cédrick

>
> HTH
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users <[hidden email]> On Behalf Of Cédrick Béler
> Sent: 07 March 2019 09:52
> To: Any question about pharo is welcome <[hidden email]>
> Cc: Tudor Girba <[hidden email]>
> Subject: [Pharo-users] Parsing text to discover general data of interest (phone, email, address, ...)
>
> Hi all,
>
> I’ve often got the need to analyse some random unstructured text to discover (structured) information (in email for instance), to extract :
> - emails
> - telephone numbers
> - addresses
> - events
> - person names (according to a list of known persons),
> - etc…
>
> Apple do it in email for instance (strangely, this is not generalized).
>
>
> So my questions are :
> - do we have something equivalent in Smalltalk/Pharo ? (I didn’t find)
> - if not, what strategy would you use ?
> => I do really stupid text analysis (substrings, finding @, …, parsing according to the text structure when there is… kind of Soup parsing…) => I feel this is a job for PetitParser ? And would be a nice feet to the new GToolkit.
>
> All ideas or suggestions are welcome ;-)
>
>
> TIA,
>
> Cédrick
>
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Parsing text to discover general data of interest (phone, email, address, ...)

cedreek
>
>> This also assumes that the items of interest are really structured; there could be many ways of writing phone numbers, for instance.
>
> Phone numbers are actually not easy… I see them as a limited sequence of number (if not well structure) + eventually the +contrycode).
> I’d like fuzzy structuration actually, but would be perfectly ok with an initial crisp one.
>

Yes and event are even worse (indeed Apple fails quite often to detect them in the right way). Event is a composition of date, time, place, …

But again, I’ll start slowly by emails (easy this one) and names of person (with eventual search if typo - leveinstein like distance)...

Cheers,

Cédrick


Reply | Threaded
Open this post in threaded view
|

Re: Parsing text to discover general data of interest (phone, email, address, ...)

cedreek
In reply to this post by cedreek



When you say  "unstructured material ... is now a standard pattern in PetitParser », how could I begin exploring that ? Any tutorials ?


I’ll load it and play around. 
Reply | Threaded
Open this post in threaded view
|

Re: Parsing text to discover general data of interest (phone, email, address, ...)

Peter Kenny

Cedrick

 

Sorry for not answering sooner. You have probably found what you want. I meant that ‘island + sea’ is now a standard pattern. Follow the link to the tutorial from the readme you referenced – the example of javascript in HTML might carry over well enough to your situation.

 

Peter

 

From: Pharo-users <[hidden email]> On Behalf Of Cédrick Béler
Sent: 07 March 2019 11:20
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] Parsing text to discover general data of interest (phone, email, address, ...)

 





When you say  "unstructured material ... is now a standard pattern in PetitParser », how could I begin exploring that ? Any tutorials ?

 

I’ll load it and play around. 

Reply | Threaded
Open this post in threaded view
|

Re: Parsing text to discover general data of interest (phone, email, address, ...)

cedreek


Cedrick
 
Sorry for not answering sooner.

Do not worry, this is because I haden’t properly searched before answering.

You have probably found what you want.

Not yet. Loading.

I meant that ‘island + sea’ is now a standard pattern. Follow the link to the tutorial from the readme you referenced – the example of javascript in HTML might carry over well enough to your situation.


Perfect,

Thanks again,

Cédrick

 
Peter
 
From: Pharo-users <[hidden email]> On Behalf Of Cédrick Béler
Sent: 07 March 2019 11:20
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] Parsing text to discover general data of interest (phone, email, address, ...)
 




When you say  "unstructured material ... is now a standard pattern in PetitParser », how could I begin exploring that ? Any tutorials ?
 
I’ll load it and play around. 

Reply | Threaded
Open this post in threaded view
|

Re: Parsing text to discover general data of interest (phone, email, address, ...)

hernanmd
In reply to this post by cedreek
Hi Cédrick,

I wrote some years ago an interface to a named-entity recognizer:
https://80738163270632.blogspot.com/2015/02/stner-interface-to-stanford-named.html

I think that was Pharo 5, so you may want to check if there are load
problems in current Pharo.

The blogger post didn't parsed correctly the output but for the input:

StSocketNERClient new
 tagText: 'Argentina President Kirchner has been asked to testify in
court on the death of Alberto Nisman the crusading prosecutor who had
accused her of conspiring to cover up involvement of Iran'


output would be:

'<location>Argentina</LOCATION> President <person>Kirchner</PERSON>
has been asked to testify in court on the death of <person>Alberto
Nisman</PERSON> the crusading prosecutor who had accused her of
conspiring to cover up involvement of <location>Iran</LOCATION>'

Cheers,

Hernán

El jue., 7 mar. 2019 a las 6:53, Cédrick Béler (<[hidden email]>) escribió:

>
> Hi all,
>
> I’ve often got the need to analyse some random unstructured text to discover (structured) information (in email for instance), to extract :
> - emails
> - telephone numbers
> - addresses
> - events
> - person names (according to a list of known persons),
> - etc…
>
> Apple do it in email for instance (strangely, this is not generalized).
>
>
> So my questions are :
> - do we have something equivalent in Smalltalk/Pharo ? (I didn’t find)
> - if not, what strategy would you use ?
> => I do really stupid text analysis (substrings, finding @, …, parsing according to the text structure when there is… kind of Soup parsing…)
> => I feel this is a job for PetitParser ? And would be a nice feet to the new GToolkit.
>
> All ideas or suggestions are welcome ;-)
>
>
> TIA,
>
> Cédrick
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Parsing text to discover general data of interest (phone, email, address, ...)

Richard O'Keefe
In reply to this post by cedreek
You say that named entity recognition is not generalised beyond Mail,
but the support library is there for anyone to use.  See for

In Python, you can use NLTK to do roughly the same.

There's no real point in reimplementing this stuff in Pharo.
Just set up a separate process, send text to it, and receive
results back.


On Thu, 7 Mar 2019 at 22:53, Cédrick Béler <[hidden email]> wrote:
Hi all,

I’ve often got the need to analyse some random unstructured text to discover (structured) information (in email for instance), to extract :
- emails
- telephone numbers
- addresses
- events
- person names (according to a list of known persons),
- etc…

Apple do it in email for instance (strangely, this is not generalized).


So my questions are :
- do we have something equivalent in Smalltalk/Pharo ? (I didn’t find)
- if not, what strategy would you use ?
=> I do really stupid text analysis (substrings, finding @, …, parsing according to the text structure when there is… kind of Soup parsing…)
=> I feel this is a job for PetitParser ? And would be a nice feet to the new GToolkit.

All ideas or suggestions are welcome ;-)


TIA,

Cédrick



Reply | Threaded
Open this post in threaded view
|

Re: Parsing text to discover general data of interest (phone, email, address, ...)

cedreek
In reply to this post by Pharo Smalltalk Users mailing list

> Couldn't find anything in Smalltalk but that should you give ideas and inspire you or get you started...
>
> https://github.com/search?q=contact+scraping&type=Repositories
>
> I guess we have all that's needed in Pharo : parsers (HTML, XML, PetitParser), Soup & regex !

Yes for markup, I played already quite much with Soup initially at first but then XPath which os far more convenient and direct.

PetitParser(2) is also a pure gem. I finally understand it better. Grammar and Parser are interesting but maybe more appropriate for structured/semi/structured texts.

Pure text, is nice with tools like those Richard just showed.

I continue to play with PetitParser2 today (finishing tutorial…).

Cheers,

Cédrick


>
> On 2019-03-07 04:52, Cédrick Béler wrote:
>> Hi all,
>>
>> I’ve often got the need to analyse some random unstructured text to discover (structured) information (in email for instance), to extract :
>> - emails
>> - telephone numbers
>> - addresses
>> - events
>> - person names (according to a list of known persons),
>> - etc…
>>
>> Apple do it in email for instance (strangely, this is not generalized).
>>
>>
>> So my questions are :
>> - do we have something equivalent in Smalltalk/Pharo ? (I didn’t find)
>> - if not, what strategy would you use ?
>> => I do really stupid text analysis (substrings, finding @, …, parsing according to the text structure when there is… kind of Soup parsing…)
>> => I feel this is a job for PetitParser ? And would be a nice feet to the new GToolkit.
>>
>> All ideas or suggestions are welcome ;-)
>>
>>
>> TIA,
>>
>> Cédrick
>>
>>
>>
> --
> -----------------
> Benoît St-Jean
> Yahoo! Messenger: bstjean
> Twitter: @BenLeChialeux
> Pinterest: benoitstjean
> Instagram: Chef_Benito
> IRC: lamneth
> Blogue: endormitoire.wordpress.com
> "A standpoint is an intellectual horizon of radius zero".  (A. Einstein)
>
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Parsing text to discover general data of interest (phone, email, address, ...)

cedreek
In reply to this post by hernanmd
Hi Hernan,

Really nice. I try it today.

It might be what I need.

I come back if installation pb.

Cheers,

Cédrick

> Le 8 mars 2019 à 03:34, Hernán Morales Durand <[hidden email]> a écrit :
>
> Hi Cédrick,
>
> I wrote some years ago an interface to a named-entity recognizer:
> https://80738163270632.blogspot.com/2015/02/stner-interface-to-stanford-named.html
>
> I think that was Pharo 5, so you may want to check if there are load
> problems in current Pharo.
>
> The blogger post didn't parsed correctly the output but for the input:
>
> StSocketNERClient new
> tagText: 'Argentina President Kirchner has been asked to testify in
> court on the death of Alberto Nisman the crusading prosecutor who had
> accused her of conspiring to cover up involvement of Iran'
>
>
> output would be:
>
> '<location>Argentina</LOCATION> President <person>Kirchner</PERSON>
> has been asked to testify in court on the death of <person>Alberto
> Nisman</PERSON> the crusading prosecutor who had accused her of
> conspiring to cover up involvement of <location>Iran</LOCATION>'
>
> Cheers,
>
> Hernán
>
> El jue., 7 mar. 2019 a las 6:53, Cédrick Béler (<[hidden email]>) escribió:
>>
>> Hi all,
>>
>> I’ve often got the need to analyse some random unstructured text to discover (structured) information (in email for instance), to extract :
>> - emails
>> - telephone numbers
>> - addresses
>> - events
>> - person names (according to a list of known persons),
>> - etc…
>>
>> Apple do it in email for instance (strangely, this is not generalized).
>>
>>
>> So my questions are :
>> - do we have something equivalent in Smalltalk/Pharo ? (I didn’t find)
>> - if not, what strategy would you use ?
>> => I do really stupid text analysis (substrings, finding @, …, parsing according to the text structure when there is… kind of Soup parsing…)
>> => I feel this is a job for PetitParser ? And would be a nice feet to the new GToolkit.
>>
>> All ideas or suggestions are welcome ;-)
>>
>>
>> TIA,
>>
>> Cédrick
>>
>>
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: Parsing text to discover general data of interest (phone, email, address, ...)

cedreek
In reply to this post by Richard O'Keefe


You say that named entity recognition is not generalised beyond Mail,
but the support library is there for anyone to use.  See for

Yes true.


In Python, you can use NLTK to do roughly the same.

There's no real point in reimplementing this stuff in Pharo.
Just set up a separate process, send text to it, and receive
results back.

I agree, that is an excellent option.

"NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum." 

Thanks for pointing NLTK. Great tool for sure. I agree that there no point in reimplementing but in this case of NLP, this might worth it as I think we have already all good foundations (Basic string manipulation, PP, Zn, XML, etc…). The hard stuff is probably the integration of 100 corpora and lexical ressources (http://www.nltk.org/nltk_data/)...

The thing is I need a lighter version, based on my own (growing) corpora (my experience). 
So Hernan solution, PP, or straight string processing would do the job.

But thanks, this is something I will explore too (especially the luges ressources).

Cheers,

Cédrick







On Thu, 7 Mar 2019 at 22:53, Cédrick Béler <[hidden email]> wrote:
Hi all,

I’ve often got the need to analyse some random unstructured text to discover (structured) information (in email for instance), to extract :
- emails
- telephone numbers
- addresses
- events
- person names (according to a list of known persons),
- etc…

Apple do it in email for instance (strangely, this is not generalized).


So my questions are :
- do we have something equivalent in Smalltalk/Pharo ? (I didn’t find)
- if not, what strategy would you use ?
=> I do really stupid text analysis (substrings, finding @, …, parsing according to the text structure when there is… kind of Soup parsing…)
=> I feel this is a job for PetitParser ? And would be a nice feet to the new GToolkit.

All ideas or suggestions are welcome ;-)


TIA,

Cédrick