Hapax/CodeFu changes and example script

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Hapax/CodeFu changes and example script

hernanmd

Hi guys,
For those working in information retrieval, for example for doing td-idf
ranking, you can find adapted packages: "Hapax" and "CodeFu" in the
BioSmalltalk repository http://ss3.gemstone.com/ss/BioSmalltalk.html . I
have translated some VW specific code to Pharo 1.4 (under Windows
requires the ProcessWrapper package) and adapted some Hapax methods to
work with corpus in different languages.

This is an example script for a corpus in Spanish:

| corpus tdm documents |

corpus := HXSpanishCorpus new.

documents := 'el río Danubio pasa por Viena, su color es azul
el caudal de un río asciende en Invierno
el río Rhin y el río Danubio tienen mucho caudal
si un río es navegable, es porque tiene mucho caudal'.

documents lines doWithIndex: [: doc : index |
        corpus
                addDocument: index asString
                with: (Terms new
                        addString: doc
                        using: CamelcaseScanner;
                        yourself)].
corpus removeStopwords.
corpus stemAll.
tdm := TermDocumentMatrix on: corpus.

Feel free to integrate to any repository. If you want to add a language
just see methods with selectors including "spanish".
Cheers,

Hernán

_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev
Reply | Threaded
Open this post in threaded view
|

Re: Hapax/CodeFu changes and example script

abergel
What is the result of this script?

Alexandre


On Jan 17, 2013, at 11:02 AM, Hernán Morales Durand <[hidden email]> wrote:

>
> Hi guys,
> For those working in information retrieval, for example for doing td-idf ranking, you can find adapted packages: "Hapax" and "CodeFu" in the BioSmalltalk repository http://ss3.gemstone.com/ss/BioSmalltalk.html . I have translated some VW specific code to Pharo 1.4 (under Windows requires the ProcessWrapper package) and adapted some Hapax methods to work with corpus in different languages.
>
> This is an example script for a corpus in Spanish:
>
> | corpus tdm documents |
>
> corpus := HXSpanishCorpus new.
>
> documents := 'el río Danubio pasa por Viena, su color es azul
> el caudal de un río asciende en Invierno
> el río Rhin y el río Danubio tienen mucho caudal
> si un río es navegable, es porque tiene mucho caudal'.
>
> documents lines doWithIndex: [: doc : index |
> corpus
> addDocument: index asString
> with: (Terms new
> addString: doc
> using: CamelcaseScanner;
> yourself)].
> corpus removeStopwords.
> corpus stemAll.
> tdm := TermDocumentMatrix on: corpus.
>
> Feel free to integrate to any repository. If you want to add a language just see methods with selectors including "spanish".
> Cheers,
>
> Hernán
>
> _______________________________________________
> Moose-dev mailing list
> [hidden email]
> https://www.iam.unibe.ch/mailman/listinfo/moose-dev

--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel  http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.




_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev
Reply | Threaded
Open this post in threaded view
|

Re: Hapax/CodeFu changes and example script

hernanmd
A TermDocumentMatrix with word mappings and frequencies for the given
documents (consider each line a different document).

El 17/01/2013 13:14, Alexandre Bergel escribió:

> What is the result of this script?
>
> Alexandre
>
>
> On Jan 17, 2013, at 11:02 AM, Hernán Morales Durand <[hidden email]> wrote:
>
>>
>> Hi guys,
>> For those working in information retrieval, for example for doing td-idf ranking, you can find adapted packages: "Hapax" and "CodeFu" in the BioSmalltalk repository http://ss3.gemstone.com/ss/BioSmalltalk.html . I have translated some VW specific code to Pharo 1.4 (under Windows requires the ProcessWrapper package) and adapted some Hapax methods to work with corpus in different languages.
>>
>> This is an example script for a corpus in Spanish:
>>
>> | corpus tdm documents |
>>
>> corpus := HXSpanishCorpus new.
>>
>> documents := 'el río Danubio pasa por Viena, su color es azul
>> el caudal de un río asciende en Invierno
>> el río Rhin y el río Danubio tienen mucho caudal
>> si un río es navegable, es porque tiene mucho caudal'.
>>
>> documents lines doWithIndex: [: doc : index |
>> corpus
>> addDocument: index asString
>> with: (Terms new
>> addString: doc
>> using: CamelcaseScanner;
>> yourself)].
>> corpus removeStopwords.
>> corpus stemAll.
>> tdm := TermDocumentMatrix on: corpus.
>>
>> Feel free to integrate to any repository. If you want to add a language just see methods with selectors including "spanish".
>> Cheers,
>>
>> Hernán
>>
>> _______________________________________________
>> Moose-dev mailing list
>> [hidden email]
>> https://www.iam.unibe.ch/mailman/listinfo/moose-dev
>

_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-users] Hapax/CodeFu changes and example script

Stéphane Ducasse
In reply to this post by hernanmd
Thanks hernan!

Stef

On Jan 17, 2013, at 5:02 PM, Hernán Morales Durand wrote:

>
> Hi guys,
> For those working in information retrieval, for example for doing td-idf ranking, you can find adapted packages: "Hapax" and "CodeFu" in the BioSmalltalk repository http://ss3.gemstone.com/ss/BioSmalltalk.html . I have translated some VW specific code to Pharo 1.4 (under Windows requires the ProcessWrapper package) and adapted some Hapax methods to work with corpus in different languages.
>
> This is an example script for a corpus in Spanish:
>
> | corpus tdm documents |
>
> corpus := HXSpanishCorpus new.
>
> documents := 'el río Danubio pasa por Viena, su color es azul
> el caudal de un río asciende en Invierno
> el río Rhin y el río Danubio tienen mucho caudal
> si un río es navegable, es porque tiene mucho caudal'.
>
> documents lines doWithIndex: [: doc : index |
> corpus
> addDocument: index asString
> with: (Terms new
> addString: doc
> using: CamelcaseScanner;
> yourself)].
> corpus removeStopwords.
> corpus stemAll.
> tdm := TermDocumentMatrix on: corpus.
>
> Feel free to integrate to any repository. If you want to add a language just see methods with selectors including "spanish".
> Cheers,
>
> Hernán
>
>


_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev
Reply | Threaded
Open this post in threaded view
|

Re: Hapax/CodeFu changes and example script

abergel
In reply to this post by hernanmd
Ah okay, a kind of Latent Semantic Indexing system then. Looks good!

Alexandre


On Jan 17, 2013, at 1:12 PM, Hernán Morales Durand <[hidden email]> wrote:

> A TermDocumentMatrix with word mappings and frequencies for the given documents (consider each line a different document).
>
> El 17/01/2013 13:14, Alexandre Bergel escribió:
>> What is the result of this script?
>>
>> Alexandre
>>
>>
>> On Jan 17, 2013, at 11:02 AM, Hernán Morales Durand <[hidden email]> wrote:
>>
>>>
>>> Hi guys,
>>> For those working in information retrieval, for example for doing td-idf ranking, you can find adapted packages: "Hapax" and "CodeFu" in the BioSmalltalk repository http://ss3.gemstone.com/ss/BioSmalltalk.html . I have translated some VW specific code to Pharo 1.4 (under Windows requires the ProcessWrapper package) and adapted some Hapax methods to work with corpus in different languages.
>>>
>>> This is an example script for a corpus in Spanish:
>>>
>>> | corpus tdm documents |
>>>
>>> corpus := HXSpanishCorpus new.
>>>
>>> documents := 'el río Danubio pasa por Viena, su color es azul
>>> el caudal de un río asciende en Invierno
>>> el río Rhin y el río Danubio tienen mucho caudal
>>> si un río es navegable, es porque tiene mucho caudal'.
>>>
>>> documents lines doWithIndex: [: doc : index |
>>> corpus
>>> addDocument: index asString
>>> with: (Terms new
>>> addString: doc
>>> using: CamelcaseScanner;
>>> yourself)].
>>> corpus removeStopwords.
>>> corpus stemAll.
>>> tdm := TermDocumentMatrix on: corpus.
>>>
>>> Feel free to integrate to any repository. If you want to add a language just see methods with selectors including "spanish".
>>> Cheers,
>>>
>>> Hernán
>>>
>>> _______________________________________________
>>> Moose-dev mailing list
>>> [hidden email]
>>> https://www.iam.unibe.ch/mailman/listinfo/moose-dev
>>
>
> _______________________________________________
> Moose-dev mailing list
> [hidden email]
> https://www.iam.unibe.ch/mailman/listinfo/moose-dev

--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel  http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.




_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-users] Hapax/CodeFu changes and example script

Tudor Girba-2
In reply to this post by Stéphane Ducasse
Thanks, indeed!

It would be great to have this back in Moose. Anyone interested in looking at it?

Cheers,
Doru


On Jan 17, 2013, at 10:43 PM, Stéphane Ducasse <[hidden email]> wrote:

> Thanks hernan!
>
> Stef
>
> On Jan 17, 2013, at 5:02 PM, Hernán Morales Durand wrote:
>
>>
>> Hi guys,
>> For those working in information retrieval, for example for doing td-idf ranking, you can find adapted packages: "Hapax" and "CodeFu" in the BioSmalltalk repository http://ss3.gemstone.com/ss/BioSmalltalk.html . I have translated some VW specific code to Pharo 1.4 (under Windows requires the ProcessWrapper package) and adapted some Hapax methods to work with corpus in different languages.
>>
>> This is an example script for a corpus in Spanish:
>>
>> | corpus tdm documents |
>>
>> corpus := HXSpanishCorpus new.
>>
>> documents := 'el río Danubio pasa por Viena, su color es azul
>> el caudal de un río asciende en Invierno
>> el río Rhin y el río Danubio tienen mucho caudal
>> si un río es navegable, es porque tiene mucho caudal'.
>>
>> documents lines doWithIndex: [: doc : index |
>> corpus
>> addDocument: index asString
>> with: (Terms new
>> addString: doc
>> using: CamelcaseScanner;
>> yourself)].
>> corpus removeStopwords.
>> corpus stemAll.
>> tdm := TermDocumentMatrix on: corpus.
>>
>> Feel free to integrate to any repository. If you want to add a language just see methods with selectors including "spanish".
>> Cheers,
>>
>> Hernán
>>
>>
>
>

--
www.tudorgirba.com

"Every successful trip needs a suitable vehicle."





_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-users] Hapax/CodeFu changes and example script

Gustavo Santos
What's the progress of this integration?

I'm new at the team. Some time ago, i developed Hapax (ir, clustering,
visualization) in Java with some improvements, because i had developed some
tools for ir and clustering before.

Now i would like to contribute and implement these improvements in Hapax. Is
there some advance in the last bundle from 2011, or someone who worked in it
for the last time?

--
Gustavo Jansen



--
View this message in context: http://moose-dev.97923.n3.nabble.com/Hapax-CodeFu-changes-and-example-script-tp4025949p4029025.html
Sent from the moose-dev mailing list archive at Nabble.com.
_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-users] Hapax/CodeFu changes and example script

abergel
Hi!

I am not really familiar with this. But having a strong tools for text analysis is really important.

Alexandre


On Oct 3, 2013, at 9:11 AM, Gustavo Jansen <[hidden email]> wrote:

> What's the progress of this integration?
>
> I'm new at the team. Some time ago, i developed Hapax (ir, clustering,
> visualization) in Java with some improvements, because i had developed some
> tools for ir and clustering before.
>
> Now i would like to contribute and implement these improvements in Hapax. Is
> there some advance in the last bundle from 2011, or someone who worked in it
> for the last time?
>
> --
> Gustavo Jansen
>
>
>
> --
> View this message in context: http://moose-dev.97923.n3.nabble.com/Hapax-CodeFu-changes-and-example-script-tp4025949p4029025.html
> Sent from the moose-dev mailing list archive at Nabble.com.
> _______________________________________________
> Moose-dev mailing list
> [hidden email]
> https://www.iam.unibe.ch/mailman/listinfo/moose-dev

--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel  http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.




_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-users] Hapax/CodeFu changes and example script

Tudor Girba-2
In reply to this post by Gustavo Santos
Welcome, Gustavo!

I am not aware of any effort in this direction, but as Alex said, anything in the area of text manipulation would be greatly appreciated.

Cheers,
Doru



On Thu, Oct 3, 2013 at 2:11 PM, Gustavo Jansen <[hidden email]> wrote:
What's the progress of this integration?

I'm new at the team. Some time ago, i developed Hapax (ir, clustering,
visualization) in Java with some improvements, because i had developed some
tools for ir and clustering before.

Now i would like to contribute and implement these improvements in Hapax. Is
there some advance in the last bundle from 2011, or someone who worked in it
for the last time?

--
Gustavo Jansen



--
View this message in context: http://moose-dev.97923.n3.nabble.com/Hapax-CodeFu-changes-and-example-script-tp4025949p4029025.html
Sent from the moose-dev mailing list archive at Nabble.com.
_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev



--

"Every thing has its own flow"

_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev