[squeak-dev] [ANN] Soup 0.1

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[squeak-dev] [ANN] Soup 0.1

Zulq Alam-2
Soup is a port of Beautiful Soup [1]. If you're not familiar with
Beautiful Soup, it is a tolerant HTML/XML parser written in Python and
is extremely useful when you need to scrape data from a web page.

soup := Soup fromUrl: 'http://www.google.co.uk/search?q=squeak'.
results := soup findAllTags:
   [:e |
   e name = 'h3'
     and: [(e attributeAt: 'class') = 'r']].
links := results collect: [:e | e text -> e a href].

Squeak Smalltalk
   -> http://www.squeak.org/
Squeak Smalltalk: Download
   -> http://www.squeak.org/Download/
Squeak - Wikipedia, the free encyclopedia
   -> http://en.wikipedia.org/wiki/Squeak
etc...

The main differences in API are:

   - find*Tag(s) for tags
   - find*String(s) for strings, CData, declarations, processing
     instructions
   - the use of blocks for complex queries

For more usage information browse the searching tags and searching
strings protocols on SoupElement subclasses. Also look at the tests in
SoupElementTest, SoupTagTest and SoupParserTest. I will write/port
proper documentation later.

There are still many things to do:

   - No attempt is made to deal with different character sets and
     encodings. This is a major feature of Beautiful Soup which I have
     so far ignored.

   - The parser will not convert entity or char refs. Although this is
     the default behavior for Beautiful Soup on HTML it is still an
     important feature.

   - The parser will not accept options such as whether to convert
     entities, which entities to convert, what to parse, etc.

   - The parser will only do HTML. Unlike Beautiful Soup there are no
     configurations for other XML flavors yet.

The project is globally writable. I look forward to your feedback and
contributions.

Thanks,
Zulq.

[1] http://www.crummy.com/software/BeautifulSoup/
[2] http://www.squeaksource.com/Soup.html


Reply | Threaded
Open this post in threaded view
|

RE: [squeak-dev] [ANN] Soup 0.1

Sebastian Sastre-2
Hi Zulq,
semantinc web is here so this is interesting stuff. I can't think about
aplicability for myself right now but knowing we can make that if needed feels
just good.
In case you want to share, what are you going to use it for?
Cheers,
Sebastian

> -----Mensaje original-----
> De: [hidden email]
> [mailto:[hidden email]] En
> nombre de Zulq Alam
> Enviado el: Jueves, 25 de Diciembre de 2008 07:03
> Para: [hidden email]
> Asunto: [squeak-dev] [ANN] Soup 0.1
>
> Soup is a port of Beautiful Soup [1]. If you're not familiar with
> Beautiful Soup, it is a tolerant HTML/XML parser written in
> Python and
> is extremely useful when you need to scrape data from a web page.
>
> soup := Soup fromUrl: 'http://www.google.co.uk/search?q=squeak'.
> results := soup findAllTags:
>    [:e |
>    e name = 'h3'
>      and: [(e attributeAt: 'class') = 'r']].
> links := results collect: [:e | e text -> e a href].
>
> Squeak Smalltalk
>    -> http://www.squeak.org/
> Squeak Smalltalk: Download
>    -> http://www.squeak.org/Download/
> Squeak - Wikipedia, the free encyclopedia
>    -> http://en.wikipedia.org/wiki/Squeak
> etc...
>
> The main differences in API are:
>
>    - find*Tag(s) for tags
>    - find*String(s) for strings, CData, declarations, processing
>      instructions
>    - the use of blocks for complex queries
>
> For more usage information browse the searching tags and searching
> strings protocols on SoupElement subclasses. Also look at the
> tests in
> SoupElementTest, SoupTagTest and SoupParserTest. I will write/port
> proper documentation later.
>
> There are still many things to do:
>
>    - No attempt is made to deal with different character sets and
>      encodings. This is a major feature of Beautiful Soup which I have
>      so far ignored.
>
>    - The parser will not convert entity or char refs. Although this is
>      the default behavior for Beautiful Soup on HTML it is still an
>      important feature.
>
>    - The parser will not accept options such as whether to convert
>      entities, which entities to convert, what to parse, etc.
>
>    - The parser will only do HTML. Unlike Beautiful Soup there are no
>      configurations for other XML flavors yet.
>
> The project is globally writable. I look forward to your feedback and
> contributions.
>
> Thanks,
> Zulq.
>
> [1] http://www.crummy.com/software/BeautifulSoup/
> [2] http://www.squeaksource.com/Soup.html
>
>


Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: [ANN] Soup 0.1

Zulq Alam-2
Hi Sebastian,

I haven't any concrete plans yet. I have some ideas which, as you can
guess, revolve around collecting and aggregating data in some new or
useful way.

Zulq.


Sebastian Sastre wrote:
> Hi Zulq,
> semantinc web is here so this is interesting stuff. I can't think about
> aplicability for myself right now but knowing we can make that if needed feels
> just good.
> In case you want to share, what are you going to use it for?
> Cheers,
> Sebastian
>