Smalltalk › Squeak › Squeak - Dev

[ANN] BioSqueak 0.4

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

7 messages Options

hernanmd

[ANN] BioSqueak 0.4

Hi,

Few days ago I created a port of BioSmalltalk for Squeak too.
BioSmalltalk is a library for doing Bioinformatics with Smalltalk. This
port is labelled "BioSqueak" and I expect to release a version for
Windows sometime soon. You can find it in:

http://code.google.com/p/biosmalltalk/downloads/list

I'm very interested in feedback.
Thanks for reading.

Hernán

--
Hernán Morales
Institute of Veterinary Genetics (IGEVET)
http://igevet.fcv.unlp.edu.ar
National Scientific and Technical Research Council (CONICET).
La Plata (1900), Buenos Aires, Argentina.
Telephone: +54 (0221) 421-1799.
Internal: 422
Fax: 425-7980 or 421-1799.

Hannes Hirzel

Re: [ANN] BioSqueak 0.4

Hello Hernán

This is interesting.
http://biosmalltalk.blogspot.com/

I understand that you have constructed an internal domain specific
language (a DSL, a query language) for dealing with genetic data in
Smalltalk

search := BioNCBIWWWBlastClient new nucleotide query: 'CCCTCAAACAT...TTTGAGGAG';
hitListSize: 150;
filterLowComplexity;
expectValue: 10;
wordSize: 11;
blastn;
blastPlainService;
alignmentViewFlatQueryAnchored;
formatTypeXML;
fetch.
search outputToFile: 'blast-query-result.xml' contents: search result.

Is there a description of this DSL? The data is kept in XML files and
all is read into the image to be queried. It seems that you don't have
a problem with the image size?

I would welcome a short writeup with a general introduction to what
you are doing in http://biosmalltalk.blogspot.com/.

Or pointers to papers (Castilian is fine)

Kind regards

Hannes Hirzel

On 2/1/13, Hernán Morales Durand <[hidden email]> wrote:

> Hi,
>
> Few days ago I created a port of BioSmalltalk for Squeak too.
> BioSmalltalk is a library for doing Bioinformatics with Smalltalk. This
> port is labelled "BioSqueak" and I expect to release a version for
> Windows sometime soon. You can find it in:
>
> http://code.google.com/p/biosmalltalk/downloads/list
>
> I'm very interested in feedback.
> Thanks for reading.
>
> Hernán
>
> --
> Hernán Morales
> Institute of Veterinary Genetics (IGEVET)
> http://igevet.fcv.unlp.edu.ar
> National Scientific and Technical Research Council (CONICET).
> La Plata (1900), Buenos Aires, Argentina.
> Telephone: +54 (0221) 421-1799.
> Internal: 422
> Fax: 425-7980 or 421-1799.
>
>

hernanmd

Re: [ANN] BioSqueak 0.4

Hello Hannes,
Thanks for the feedback! Some answers then between the lines:

El 01/02/2013 11:35, H. Hirzel escribió:

> Hello Hernán
>
> This is interesting.
> http://biosmalltalk.blogspot.com/
>
> I understand that you have constructed an internal domain specific
> language (a DSL, a query language) for dealing with genetic data in
> Smalltalk
>
> search := BioNCBIWWWBlastClient new nucleotide query: 'CCCTCAAACAT...TTTGAGGAG';
> hitListSize: 150;
> filterLowComplexity;
> expectValue: 10;
> wordSize: 11;
> blastn;
> blastPlainService;
> alignmentViewFlatQueryAnchored;
> formatTypeXML;
> fetch.
> search outputToFile: 'blast-query-result.xml' contents: search result.
>
> Is there a description of this DSL?

Is not a DSL in the traditional sense, i.e., using ANTLR, Lex or Yacc,
but a "DSL" which is embedded thus inheriting the syntax and execution
semantics of Smalltalk.
To clarify: I've not built a DSL specification for the QBlast API,
although I'm willing to develop DSLs for bioinformatics APIs in a
Smalltalk language workbench (anyone?).

Currently the messages for performing alignments at the NCBI are based
in the API specification,
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/node9.html . The unary
sends are the result of a plan to reduce parametrization and to
replicate or customize Blast settings through a UI. This is because
geneticists experiment changing Blast parameters over time and I want my
system not to be tied to textual parameters.

> The data is kept in XML files and
> all is read into the image to be queried. It seems that you don't have
> a problem with the image size?

Yes I had problems with image size and performance, a lot indeed.
Actually working with XML DOM with alignments of 5000 or more hits
Squeak (and Pharo of course) started to show slowliness. So I cannot
keep all XML nodes in memory. To overcome this problem I've tried the
SAX (push) parser and the XMLPullParser (which is a StAX parser). Then
my idea was to reduce the tree by specifying only the XML nodes which
I'm interested for. After reducing the nodes, I wrote custom XML tree
classes with a specific API to query blast XML results, taken form the
DTD specification. AFAIK this is known as a XML digester, which is
somewhat "evolved" in Java
(http://commons.apache.org/digester/xmlrules.html). I have built a
dynamic query builder in Morphic for querying the XML providing the
possibility of persist and update the filters. Unfortunately for Squeak
users I'm using the Polymorph API, which I think is not available in Squeak.

We worked using the XML push/pull parsers for reading genomes and they
worked acceptably. But it is impossible to keep nodes for 3 GBytes of
XML at least for now in Squeak/Pharo.

More and critical problems arise when trying to work with microarray
data (big data) in Smalltalk which is not document-oriented. I had to
switch to "solutions" like SQL, or HDF5 using Pytables with
well-designed scheme for our input. The advantages are that supports
indexing and reading data in blocks, besides tools like Vitables or
HDFView to navigate the data. Until someone provides some bits in this
field, there is little opportunity for using Smalltalk.

> I would welcome a short writeup with a general introduction to what
> you are doing in http://biosmalltalk.blogspot.com/.
>
> Or pointers to papers (Castilian is fine)
>

We have submitted a paper recently and we are waiting for the review
results. On the other side we are preparing another paper for a
phylogenetics decision support system which includes text-mining and a
rule engine. I will try to write an entry in the next week with screenshots.

Best regards,

Hernán

> Kind regards
>
> Hannes Hirzel
>
> On 2/1/13, Hernán Morales Durand <[hidden email]> wrote:
>> Hi,
>>
>> Few days ago I created a port of BioSmalltalk for Squeak too.
>> BioSmalltalk is a library for doing Bioinformatics with Smalltalk. This
>> port is labelled "BioSqueak" and I expect to release a version for
>> Windows sometime soon. You can find it in:
>>
>> http://code.google.com/p/biosmalltalk/downloads/list
>>
>> I'm very interested in feedback.
>> Thanks for reading.
>>
>> Hernán
>>
>> --
>> Hernán Morales
>> Institute of Veterinary Genetics (IGEVET)
>> http://igevet.fcv.unlp.edu.ar
>> National Scientific and Technical Research Council (CONICET).
>> La Plata (1900), Buenos Aires, Argentina.
>> Telephone: +54 (0221) 421-1799.
>> Internal: 422
>> Fax: 425-7980 or 421-1799.
>>
>>
>
>

garduino

Re: [ANN] BioSqueak 0.4

Wow, very very interesting Hernán.

2013/2/1 Hernán Morales Durand <[hidden email]>:

>
> Hello Hannes,
> Thanks for the feedback! Some answers then between the lines:
>
> El 01/02/2013 11:35, H. Hirzel escribió:
>
>> Hello Hernán
>>
>> This is interesting.
>> http://biosmalltalk.blogspot.com/
>>
>> I understand that you have constructed an internal domain specific
>> language (a DSL, a query language) for dealing with genetic data in
>> Smalltalk
>>
>> search := BioNCBIWWWBlastClient new nucleotide query:
>> 'CCCTCAAACAT...TTTGAGGAG';
>> hitListSize: 150;
>> filterLowComplexity;
>> expectValue: 10;
>> wordSize: 11;
>> blastn;
>> blastPlainService;
>> alignmentViewFlatQueryAnchored;
>> formatTypeXML;
>> fetch.
>> search outputToFile: 'blast-query-result.xml' contents: search result.
>>
>> Is there a description of this DSL?
>
>
> Is not a DSL in the traditional sense, i.e., using ANTLR, Lex or Yacc, but a
> "DSL" which is embedded thus inheriting the syntax and execution semantics
> of Smalltalk.
> To clarify: I've not built a DSL specification for the QBlast API, although
> I'm willing to develop DSLs for bioinformatics APIs in a Smalltalk language
> workbench (anyone?).
>
> Currently the messages for performing alignments at the NCBI are based in
> the API specification,
> http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/node9.html . The unary
> sends are the result of a plan to reduce parametrization and to replicate or
> customize Blast settings through a UI. This is because geneticists
> experiment changing Blast parameters over time and I want my system not to
> be tied to textual parameters.
>
>
>> The data is kept in XML files and
>> all is read into the image to be queried. It seems that you don't have
>> a problem with the image size?
>
> Yes I had problems with image size and performance, a lot indeed. Actually
> working with XML DOM with alignments of 5000 or more hits Squeak (and Pharo
> of course) started to show slowliness. So I cannot keep all XML nodes in
> memory. To overcome this problem I've tried the SAX (push) parser and the
> XMLPullParser (which is a StAX parser). Then my idea was to reduce the tree
> by specifying only the XML nodes which I'm interested for. After reducing
> the nodes, I wrote custom XML tree classes with a specific API to query
> blast XML results, taken form the DTD specification. AFAIK this is known as
> a XML digester, which is somewhat "evolved" in Java
> (http://commons.apache.org/digester/xmlrules.html). I have built a dynamic
> query builder in Morphic for querying the XML providing the possibility of
> persist and update the filters. Unfortunately for Squeak users I'm using the
> Polymorph API, which I think is not available in Squeak.
>
> We worked using the XML push/pull parsers for reading genomes and they
> worked acceptably. But it is impossible to keep nodes for 3 GBytes of XML at
> least for now in Squeak/Pharo.
>
> More and critical problems arise when trying to work with microarray data
> (big data) in Smalltalk which is not document-oriented. I had to switch to
> "solutions" like SQL, or HDF5 using Pytables with well-designed scheme for
> our input. The advantages are that supports indexing and reading data in
> blocks, besides tools like Vitables or HDFView to navigate the data. Until
> someone provides some bits in this field, there is little opportunity for
> using Smalltalk.
>
>
>> I would welcome a short writeup with a general introduction to what
>> you are doing in http://biosmalltalk.blogspot.com/.
>>
>> Or pointers to papers (Castilian is fine)
>>
>
> We have submitted a paper recently and we are waiting for the review
> results. On the other side we are preparing another paper for a
> phylogenetics decision support system which includes text-mining and a rule
> engine. I will try to write an entry in the next week with screenshots.
>
> Best regards,
>
> Hernán
>
>
>> Kind regards
>>
>> Hannes Hirzel
>>
>> On 2/1/13, Hernán Morales Durand <[hidden email]> wrote:
>>>
>>> Hi,
>>>
>>> Few days ago I created a port of BioSmalltalk for Squeak too.
>>> BioSmalltalk is a library for doing Bioinformatics with Smalltalk. This
>>> port is labelled "BioSqueak" and I expect to release a version for
>>> Windows sometime soon. You can find it in:
>>>
>>> http://code.google.com/p/biosmalltalk/downloads/list
>>>
>>> I'm very interested in feedback.
>>> Thanks for reading.
>>>
>>> Hernán
>>>
>>> --
>>> Hernán Morales
>>> Institute of Veterinary Genetics (IGEVET)
>>> http://igevet.fcv.unlp.edu.ar
>>> National Scientific and Technical Research Council (CONICET).
>>> La Plata (1900), Buenos Aires, Argentina.
>>> Telephone: +54 (0221) 421-1799.
>>> Internal: 422
>>> Fax: 425-7980 or 421-1799.
>>>
>>>
>>
>>
>
>

Hannes Hirzel

Re: [ANN] BioSqueak 0.4

In reply to this post by hernanmd

Hello Hernán

Thank you for your elaboration on the topic of BioSqueak.

On 2/1/13, Hernán Morales Durand <[hidden email]> wrote:

>
> Hello Hannes,
> Thanks for the feedback! Some answers then between the lines:
>
> El 01/02/2013 11:35, H. Hirzel escribió:
>> Hello Hernán
>>
>> This is interesting.
>> http://biosmalltalk.blogspot.com/
>>
>> I understand that you have constructed an internal domain specific
>> language (a DSL, a query language) for dealing with genetic data in
>> Smalltalk
>>
>> search := BioNCBIWWWBlastClient new nucleotide query:
>> 'CCCTCAAACAT...TTTGAGGAG';
>> hitListSize: 150;
>> filterLowComplexity;
>> expectValue: 10;
>> wordSize: 11;
>> blastn;
>> blastPlainService;
>> alignmentViewFlatQueryAnchored;
>> formatTypeXML;
>> fetch.
>> search outputToFile: 'blast-query-result.xml' contents: search result.
>>
>> Is there a description of this DSL?
>
> Is not a DSL in the traditional sense, i.e., using ANTLR, Lex or Yacc,
> but a "DSL" which is embedded thus inheriting the syntax and execution
> semantics of Smalltalk.

Yes, I understand, the regular thing in Smalltalk as every Smalltalk
domain model could be considered a DSL to a certain extent/

Lukas Renggli has a useful classification on DSLs in his PhD dissertation on
'Dynamic Language Embedding''
http://scg.unibe.ch/archive/phd/renggli-phd.pdf
Chapter 2

According to that you probably have an Internal DSL (chapter 2.1), right?

> To clarify: I've not built a DSL specification for the QBlast API,
> although I'm willing to develop DSLs for bioinformatics APIs in a
> Smalltalk language workbench (anyone?).

OK

> Currently the messages for performing alignments at the NCBI are based
> in the API specification,
> http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/node9.html .

The unary
> sends are the result of a plan to reduce parametrization and to
> replicate or customize Blast settings through a UI. This is because
> geneticists experiment changing Blast parameters over time and I want my
> system not to be tied to textual parameters.
>

> > The data is kept in XML files and
> > all is read into the image to be queried. It seems that you don't have
> > a problem with the image size?
>
> Yes I had problems with image size and performance, a lot indeed.

> Actually working with XML DOM with alignments of 5000 or more hits
> Squeak (and Pharo of course) started to show slowliness. So I cannot
> keep all XML nodes in memory. To overcome this problem I've tried the
> SAX (push) parser and the XMLPullParser (which is a StAX parser). Then
> my idea was to reduce the tree by specifying only the XML nodes which
> I'm interested for. After reducing the nodes, I wrote custom XML tree
> classes with a specific API to query blast XML results, taken form the
> DTD specification. AFAIK this is known as a XML digester, which is
> somewhat "evolved" in Java
> (http://commons.apache.org/digester/xmlrules.html).

I understand that you took
http://www.squeaksource.com/XMLSupport/
(the XML support repo for Pharo, for Squeak XML support is in
the trunk image)
and modified it.

> I have built a
> dynamic query builder in Morphic for querying the XML providing the
> possibility of persist and update the filters. Unfortunately for Squeak
> users I'm using the Polymorph API, which I think is not available in
> Squeak.

A screen shot would be appreciated... :-)

> We worked using the XML push/pull parsers for reading genomes and they
> worked acceptably. But it is impossible to keep nodes for 3 GBytes of
> XML at least for now in Squeak/Pharo.

According to my experience keeping XML structures in the image is
inefficient in terms of memory usage. More efficient ways are needed
and XML is then only for reading/writing to external files.

> More and critical problems arise when trying to work with microarray
> data (big data) in Smalltalk which is not document-oriented. I had to
> switch to "solutions" like SQL, or HDF5 using Pytables with
> well-designed scheme for our input. The advantages are that supports
> indexing and reading data in blocks, besides tools like Vitables or
> HDFView to navigate the data. Until someone provides some bits in this
> field, there is little opportunity for using Smalltalk.

But what I understand is that people keep DNA data in memory for speed
reasons and use C++ or Perl programs to deal with it.

>> I would welcome a short writeup with a general introduction to what
>> you are doing in http://biosmalltalk.blogspot.com/.

>
> We have submitted a paper recently and we are waiting for the review
> results. On the other side we are preparing another paper for a
> phylogenetics decision support system which includes text-mining and a
> rule engine. I will try to write an entry in the next week with
> screenshots.

Any news on this?

Kind regards
Hannes

> Best regards,
>
> Hernán
>
>> Kind regards
>>
>> Hannes Hirzel
>>
>> On 2/1/13, Hernán Morales Durand <[hidden email]> wrote:
>>> Hi,
>>>
>>> Few days ago I created a port of BioSmalltalk for Squeak too.
>>> BioSmalltalk is a library for doing Bioinformatics with Smalltalk. This
>>> port is labelled "BioSqueak" and I expect to release a version for
>>> Windows sometime soon. You can find it in:
>>>
>>> http://code.google.com/p/biosmalltalk/downloads/list
>>>
>>> I'm very interested in feedback.
>>> Thanks for reading.
>>>
>>> Hernán
>>>
>>> --
>>> Hernán Morales
>>> Institute of Veterinary Genetics (IGEVET)
>>> http://igevet.fcv.unlp.edu.ar
>>> National Scientific and Technical Research Council (CONICET).
>>> La Plata (1900), Buenos Aires, Argentina.
>>> Telephone: +54 (0221) 421-1799.
>>> Internal: 422
>>> Fax: 425-7980 or 421-1799.
>>>
>>>
>>
>>
>
>
>

hernanmd

PhyloclassTalk (was: Re: [squeak-dev] [ANN] BioSqueak 0.4)

Hello Hannes,

Sorry for the late response, I have been working intensively in an
application using BioSmalltalk. Here is a post with some screenshots:
http://biosmalltalk.blogspot.com.ar/2013/02/phyloclasstalk-preview.html

as I've said, it is developed in Pharo but most subsystems work in
Squeak too. I cross-post to the Pharo users list in case someone is
interested.

El 16/02/2013 16:00, H. Hirzel escribió:

> Hello Hernán
>
> Thank you for your elaboration on the topic of BioSqueak.
>
> On 2/1/13, Hernán Morales Durand <[hidden email]> wrote:
>>
>> Hello Hannes,
>> Thanks for the feedback! Some answers then between the lines:
>>
>> El 01/02/2013 11:35, H. Hirzel escribió:
>>> Hello Hernán
>>>
>>> This is interesting.
>>> http://biosmalltalk.blogspot.com/
>>>
>>> I understand that you have constructed an internal domain specific
>>> language (a DSL, a query language) for dealing with genetic data in
>>> Smalltalk
>>>
>>> search := BioNCBIWWWBlastClient new nucleotide query:
>>> 'CCCTCAAACAT...TTTGAGGAG';
>>> hitListSize: 150;
>>> filterLowComplexity;
>>> expectValue: 10;
>>> wordSize: 11;
>>> blastn;
>>> blastPlainService;
>>> alignmentViewFlatQueryAnchored;
>>> formatTypeXML;
>>> fetch.
>>> search outputToFile: 'blast-query-result.xml' contents: search result.
>>>
>>> Is there a description of this DSL?
>>
>> Is not a DSL in the traditional sense, i.e., using ANTLR, Lex or Yacc,
>> but a "DSL" which is embedded thus inheriting the syntax and execution
>> semantics of Smalltalk.
>
> Yes, I understand, the regular thing in Smalltalk as every Smalltalk
> domain model could be considered a DSL to a certain extent/
>
> Lukas Renggli has a useful classification on DSLs in his PhD dissertation on
> 'Dynamic Language Embedding''
> http://scg.unibe.ch/archive/phd/renggli-phd.pdf
> Chapter 2
>
> According to that you probably have an Internal DSL (chapter 2.1), right?
>

Yes, it would fit into the Internal DSL category. I didn't knew about
that classification, thanks for sharing.

>
>> To clarify: I've not built a DSL specification for the QBlast API,
>> although I'm willing to develop DSLs for bioinformatics APIs in a
>> Smalltalk language workbench (anyone?).
>
> OK
>
>> Currently the messages for performing alignments at the NCBI are based
>> in the API specification,
>> http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/node9.html .
>
> The unary
>> sends are the result of a plan to reduce parametrization and to
>> replicate or customize Blast settings through a UI. This is because
>> geneticists experiment changing Blast parameters over time and I want my
>> system not to be tied to textual parameters.
>>
>
>
>> > The data is kept in XML files and
>> > all is read into the image to be queried. It seems that you don't have
>> > a problem with the image size?
>>
>> Yes I had problems with image size and performance, a lot indeed.
>
>> Actually working with XML DOM with alignments of 5000 or more hits
>> Squeak (and Pharo of course) started to show slowliness. So I cannot
>> keep all XML nodes in memory. To overcome this problem I've tried the
>> SAX (push) parser and the XMLPullParser (which is a StAX parser). Then
>> my idea was to reduce the tree by specifying only the XML nodes which
>> I'm interested for. After reducing the nodes, I wrote custom XML tree
>> classes with a specific API to query blast XML results, taken form the
>> DTD specification. AFAIK this is known as a XML digester, which is
>> somewhat "evolved" in Java
>> (http://commons.apache.org/digester/xmlrules.html).
>
> I understand that you took
> http://www.squeaksource.com/XMLSupport/
> (the XML support repo for Pharo, for Squeak XML support is in
> the trunk image)
> and modified it.
>
>> I have built a
>> dynamic query builder in Morphic for querying the XML providing the
>> possibility of persist and update the filters. Unfortunately for Squeak
>> users I'm using the Polymorph API, which I think is not available in
>> Squeak.
>
> A screen shot would be appreciated... :-)
>

Ok, the blog post includes some screenshots.

>> We worked using the XML push/pull parsers for reading genomes and they
>> worked acceptably. But it is impossible to keep nodes for 3 GBytes of
>> XML at least for now in Squeak/Pharo.
>
> According to my experience keeping XML structures in the image is
> inefficient in terms of memory usage. More efficient ways are needed
> and XML is then only for reading/writing to external files.
>

Exactly, XML is not good at all for big data.

>> More and critical problems arise when trying to work with microarray
>> data (big data) in Smalltalk which is not document-oriented. I had to
>> switch to "solutions" like SQL, or HDF5 using Pytables with
>> well-designed scheme for our input. The advantages are that supports
>> indexing and reading data in blocks, besides tools like Vitables or
>> HDFView to navigate the data. Until someone provides some bits in this
>> field, there is little opportunity for using Smalltalk.
>
> But what I understand is that people keep DNA data in memory for speed
> reasons and use C++ or Perl programs to deal with it.
>

It really depends of the type of analysis, I've seen most starter
bioinformaticians prefer Python over Perl because of the nicer syntax
and more complete library support.

I don't know big data projects using C++ with raw DNA data. Compression
with indexing, and specialized file formats are used these days,
splitting data in clusters where needed. I would love to see some
Smalltalkers working on dataspaces too.

See these presentations: http://www.slideshare.net/mndoci/presentations

>>> I would welcome a short writeup with a general introduction to what
>>> you are doing in http://biosmalltalk.blogspot.com/.
>
>
>>
>> We have submitted a paper recently and we are waiting for the review
>> results. On the other side we are preparing another paper for a
>> phylogenetics decision support system which includes text-mining and a
>> rule engine. I will try to write an entry in the next week with
>> screenshots.
>
> Any news on this?
>

No news so far, still in the reviewing process.
Best regards,

Hernán

> Kind regards
> Hannes
>
>
>> Best regards,
>>
>> Hernán
>>
>>> Kind regards
>>>
>>> Hannes Hirzel
>>>
>>> On 2/1/13, Hernán Morales Durand <[hidden email]> wrote:
>>>> Hi,
>>>>
>>>> Few days ago I created a port of BioSmalltalk for Squeak too.
>>>> BioSmalltalk is a library for doing Bioinformatics with Smalltalk. This
>>>> port is labelled "BioSqueak" and I expect to release a version for
>>>> Windows sometime soon. You can find it in:
>>>>
>>>> http://code.google.com/p/biosmalltalk/downloads/list
>>>>
>>>> I'm very interested in feedback.
>>>> Thanks for reading.
>>>>
>>>> Hernán
>>>>
>>>> --
>>>> Hernán Morales
>>>> Institute of Veterinary Genetics (IGEVET)
>>>> http://igevet.fcv.unlp.edu.ar
>>>> National Scientific and Technical Research Council (CONICET).
>>>> La Plata (1900), Buenos Aires, Argentina.
>>>> Telephone: +54 (0221) 421-1799.
>>>> Internal: 422
>>>> Fax: 425-7980 or 421-1799.
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>

garduino

Re: PhyloclassTalk (was: Re: [squeak-dev] [ANN] BioSqueak 0.4)

Really nice UI Hernán! Congrats!

2013/2/23 Hernán Morales Durand <[hidden email]>:

> Hello Hannes,
>
> Sorry for the late response, I have been working intensively in an
> application using BioSmalltalk. Here is a post with some screenshots:
> http://biosmalltalk.blogspot.com.ar/2013/02/phyloclasstalk-preview.html
>
> as I've said, it is developed in Pharo but most subsystems work in Squeak
> too. I cross-post to the Pharo users list in case someone is interested.
>
> El 16/02/2013 16:00, H. Hirzel escribió:
>>
>> Hello Hernán
>>
>> Thank you for your elaboration on the topic of BioSqueak.
>>
>> On 2/1/13, Hernán Morales Durand <[hidden email]> wrote:
>>>
>>>
>>> Hello Hannes,
>>> Thanks for the feedback! Some answers then between the lines:
>>>
>>> El 01/02/2013 11:35, H. Hirzel escribió:
>>>>
>>>> Hello Hernán
>>>>
>>>> This is interesting.
>>>> http://biosmalltalk.blogspot.com/
>>>>
>>>> I understand that you have constructed an internal domain specific
>>>> language (a DSL, a query language) for dealing with genetic data in
>>>> Smalltalk
>>>>
>>>> search := BioNCBIWWWBlastClient new nucleotide query:
>>>> 'CCCTCAAACAT...TTTGAGGAG';
>>>> hitListSize: 150;
>>>> filterLowComplexity;
>>>> expectValue: 10;
>>>> wordSize: 11;
>>>> blastn;
>>>> blastPlainService;
>>>> alignmentViewFlatQueryAnchored;
>>>> formatTypeXML;
>>>> fetch.
>>>> search outputToFile: 'blast-query-result.xml' contents: search result.
>>>>
>>>> Is there a description of this DSL?
>>>
>>>
>>> Is not a DSL in the traditional sense, i.e., using ANTLR, Lex or Yacc,
>>> but a "DSL" which is embedded thus inheriting the syntax and execution
>>> semantics of Smalltalk.
>>
>>
>> Yes, I understand, the regular thing in Smalltalk as every Smalltalk
>> domain model could be considered a DSL to a certain extent/
>>
>> Lukas Renggli has a useful classification on DSLs in his PhD dissertation
>> on
>> 'Dynamic Language Embedding''
>> http://scg.unibe.ch/archive/phd/renggli-phd.pdf
>> Chapter 2
>>
>> According to that you probably have an Internal DSL (chapter 2.1), right?
>>
>
> Yes, it would fit into the Internal DSL category. I didn't knew about that
> classification, thanks for sharing.
>
>>
>>> To clarify: I've not built a DSL specification for the QBlast API,
>>> although I'm willing to develop DSLs for bioinformatics APIs in a
>>> Smalltalk language workbench (anyone?).
>>
>>
>> OK
>>
>>> Currently the messages for performing alignments at the NCBI are based
>>> in the API specification,
>>> http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/node9.html .
>>
>>
>> The unary
>>>
>>> sends are the result of a plan to reduce parametrization and to
>>> replicate or customize Blast settings through a UI. This is because
>>> geneticists experiment changing Blast parameters over time and I want my
>>> system not to be tied to textual parameters.
>>>
>>
>>
>>> > The data is kept in XML files and
>>> > all is read into the image to be queried. It seems that you don't
>>> have
>>> > a problem with the image size?
>>>
>>> Yes I had problems with image size and performance, a lot indeed.
>>
>>
>>> Actually working with XML DOM with alignments of 5000 or more hits
>>> Squeak (and Pharo of course) started to show slowliness. So I cannot
>>> keep all XML nodes in memory. To overcome this problem I've tried the
>>> SAX (push) parser and the XMLPullParser (which is a StAX parser). Then
>>> my idea was to reduce the tree by specifying only the XML nodes which
>>> I'm interested for. After reducing the nodes, I wrote custom XML tree
>>> classes with a specific API to query blast XML results, taken form the
>>> DTD specification. AFAIK this is known as a XML digester, which is
>>> somewhat "evolved" in Java
>>> (http://commons.apache.org/digester/xmlrules.html).
>>
>>
>> I understand that you took
>> http://www.squeaksource.com/XMLSupport/
>> (the XML support repo for Pharo, for Squeak XML support is in
>> the trunk image)
>> and modified it.
>>
>>> I have built a
>>> dynamic query builder in Morphic for querying the XML providing the
>>> possibility of persist and update the filters. Unfortunately for Squeak
>>> users I'm using the Polymorph API, which I think is not available in
>>> Squeak.
>>
>>
>> A screen shot would be appreciated... :-)
>>
>
> Ok, the blog post includes some screenshots.
>
>>> We worked using the XML push/pull parsers for reading genomes and they
>>> worked acceptably. But it is impossible to keep nodes for 3 GBytes of
>>> XML at least for now in Squeak/Pharo.
>>
>>
>> According to my experience keeping XML structures in the image is
>> inefficient in terms of memory usage. More efficient ways are needed
>> and XML is then only for reading/writing to external files.
>>
>
> Exactly, XML is not good at all for big data.
>
>>> More and critical problems arise when trying to work with microarray
>>> data (big data) in Smalltalk which is not document-oriented. I had to
>>> switch to "solutions" like SQL, or HDF5 using Pytables with
>>> well-designed scheme for our input. The advantages are that supports
>>> indexing and reading data in blocks, besides tools like Vitables or
>>> HDFView to navigate the data. Until someone provides some bits in this
>>> field, there is little opportunity for using Smalltalk.
>>
>>
>> But what I understand is that people keep DNA data in memory for speed
>> reasons and use C++ or Perl programs to deal with it.
>>
>
> It really depends of the type of analysis, I've seen most starter
> bioinformaticians prefer Python over Perl because of the nicer syntax and
> more complete library support.
>
> I don't know big data projects using C++ with raw DNA data. Compression with
> indexing, and specialized file formats are used these days, splitting data
> in clusters where needed. I would love to see some Smalltalkers working on
> dataspaces too.
>
> See these presentations: http://www.slideshare.net/mndoci/presentations
>
>>>> I would welcome a short writeup with a general introduction to what
>>>> you are doing in http://biosmalltalk.blogspot.com/.
>>
>>
>>
>>>
>>> We have submitted a paper recently and we are waiting for the review
>>> results. On the other side we are preparing another paper for a
>>> phylogenetics decision support system which includes text-mining and a
>>> rule engine. I will try to write an entry in the next week with
>>> screenshots.
>>
>>
>> Any news on this?
>>
>
> No news so far, still in the reviewing process.
> Best regards,
>
> Hernán
>
>> Kind regards
>> Hannes
>>
>>
>>> Best regards,
>>>
>>> Hernán
>>>
>>>> Kind regards
>>>>
>>>> Hannes Hirzel
>>>>
>>>> On 2/1/13, Hernán Morales Durand <[hidden email]> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Few days ago I created a port of BioSmalltalk for Squeak too.
>>>>> BioSmalltalk is a library for doing Bioinformatics with Smalltalk. This
>>>>> port is labelled "BioSqueak" and I expect to release a version for
>>>>> Windows sometime soon. You can find it in:
>>>>>
>>>>> http://code.google.com/p/biosmalltalk/downloads/list
>>>>>
>>>>> I'm very interested in feedback.
>>>>> Thanks for reading.
>>>>>
>>>>> Hernán
>>>>>
>>>>> --
>>>>> Hernán Morales
>>>>> Institute of Veterinary Genetics (IGEVET)
>>>>> http://igevet.fcv.unlp.edu.ar
>>>>> National Scientific and Technical Research Council (CONICET).
>>>>> La Plata (1900), Buenos Aires, Argentina.
>>>>> Telephone: +54 (0221) 421-1799.
>>>>> Internal: 422
>>>>> Fax: 425-7980 or 421-1799.
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>
>