Smalltalk › Pharo › Pharo Smalltalk Users

Hapax/CodeFu changes and example script

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

3 messages Options

hernanmd

Hapax/CodeFu changes and example script

Hi guys,
For those working in information retrieval, for example for doing td-idf
ranking, you can find adapted packages: "Hapax" and "CodeFu" in the
BioSmalltalk repository http://ss3.gemstone.com/ss/BioSmalltalk.html . I
have translated some VW specific code to Pharo 1.4 (under Windows
requires the ProcessWrapper package) and adapted some Hapax methods to
work with corpus in different languages.

This is an example script for a corpus in Spanish:

| corpus tdm documents |

corpus := HXSpanishCorpus new.

documents := 'el río Danubio pasa por Viena, su color es azul
el caudal de un río asciende en Invierno
el río Rhin y el río Danubio tienen mucho caudal
si un río es navegable, es porque tiene mucho caudal'.

documents lines doWithIndex: [: doc : index |
corpus
addDocument: index asString
with: (Terms new
addString: doc
using: CamelcaseScanner;
yourself)].
corpus removeStopwords.
corpus stemAll.
tdm := TermDocumentMatrix on: corpus.

Feel free to integrate to any repository. If you want to add a language
just see methods with selectors including "spanish".
Cheers,

Hernán

Stéphane Ducasse

Re: Hapax/CodeFu changes and example script

Thanks hernan!

Stef

On Jan 17, 2013, at 5:02 PM, Hernán Morales Durand wrote:

>
> Hi guys,
> For those working in information retrieval, for example for doing td-idf ranking, you can find adapted packages: "Hapax" and "CodeFu" in the BioSmalltalk repository http://ss3.gemstone.com/ss/BioSmalltalk.html . I have translated some VW specific code to Pharo 1.4 (under Windows requires the ProcessWrapper package) and adapted some Hapax methods to work with corpus in different languages.
>
> This is an example script for a corpus in Spanish:
>
> | corpus tdm documents |
>
> corpus := HXSpanishCorpus new.
>
> documents := 'el río Danubio pasa por Viena, su color es azul
> el caudal de un río asciende en Invierno
> el río Rhin y el río Danubio tienen mucho caudal
> si un río es navegable, es porque tiene mucho caudal'.
>
> documents lines doWithIndex: [: doc : index |
> corpus
> addDocument: index asString
> with: (Terms new
> addString: doc
> using: CamelcaseScanner;
> yourself)].
> corpus removeStopwords.
> corpus stemAll.
> tdm := TermDocumentMatrix on: corpus.
>
> Feel free to integrate to any repository. If you want to add a language just see methods with selectors including "spanish".
> Cheers,
>
> Hernán
>
>

Tudor Girba-2

Re: Hapax/CodeFu changes and example script

Thanks, indeed!

It would be great to have this back in Moose. Anyone interested in looking at it?

Cheers,
Doru

On Jan 17, 2013, at 10:43 PM, Stéphane Ducasse <[hidden email]> wrote:

> Thanks hernan!
>
> Stef
>
> On Jan 17, 2013, at 5:02 PM, Hernán Morales Durand wrote:
>
>>
>> Hi guys,
>> For those working in information retrieval, for example for doing td-idf ranking, you can find adapted packages: "Hapax" and "CodeFu" in the BioSmalltalk repository http://ss3.gemstone.com/ss/BioSmalltalk.html . I have translated some VW specific code to Pharo 1.4 (under Windows requires the ProcessWrapper package) and adapted some Hapax methods to work with corpus in different languages.
>>
>> This is an example script for a corpus in Spanish:
>>
>> | corpus tdm documents |
>>
>> corpus := HXSpanishCorpus new.
>>
>> documents := 'el río Danubio pasa por Viena, su color es azul
>> el caudal de un río asciende en Invierno
>> el río Rhin y el río Danubio tienen mucho caudal
>> si un río es navegable, es porque tiene mucho caudal'.
>>
>> documents lines doWithIndex: [: doc : index |
>> corpus
>> addDocument: index asString
>> with: (Terms new
>> addString: doc
>> using: CamelcaseScanner;
>> yourself)].
>> corpus removeStopwords.
>> corpus stemAll.
>> tdm := TermDocumentMatrix on: corpus.
>>
>> Feel free to integrate to any repository. If you want to add a language just see methods with selectors including "spanish".
>> Cheers,
>>
>> Hernán
>>
>>
>
>

--
www.tudorgirba.com

"Every successful trip needs a suitable vehicle."