[Documentation] Spell checking within Squeak - suggestion of a method

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[Documentation] Spell checking within Squeak - suggestion of a method

Hannes Hirzel
On 4/30/10, Ian Trudel <[hidden email]> wrote:

> 2010/4/30 Casey Ransberger <[hidden email]>:
>> Ian, thanks. Comments inline.
>
>> I don't know of a spell checker implementation for Squeak. Is there one
>> out
>> there? If not, can you implement one and then get back to me right away?
>> :P
>
> Neither do I know.
>

Contrariwise to general word-processing our spell checking needs are
simpler. We are writing technical documentation and the number of word
forms is more limited (my estimate - something between 3000...5000).
This means a simple dictionary of allowed word forms could do the job?

How do we get that dictionary?

Method 1)
We create a Bag of all word forms in the existing comments. The words
which are infrequent are candidates for being misspelt. They can be
flagged and put into a another collection.

Method 2)
We paste the list of words obtained from the comments into a regular
word processor and run a  spell check there.

A diff to the original word list gives then the misspelt words.

We paste the result back into the Squeak image,

Within the Squeak image resides a collection of acceptable word forms
(as part of the HelpSystem-Tools).

Before accepting a comment this list is consulted.

This process has to be repeated for a few versions maybe until we have
a comprehensive wordlist.

Additional benefit. We limit the number of words used thus making the
texts easier to understand and more consistent.

The idea here is 'controlled language'  - i.e. "Squeak Technical English"  :-)

--Hannes

------------------------------------------------------------------------------------
http://en.wikipedia.org/wiki/Controlled_language

Controlled natural languages (CNLs) are subsets of natural languages,
obtained by restricting the grammar and vocabulary in order to reduce
or eliminate ambiguity and complexity. Traditionally, controlled
languages fall into two major types: those that improve readability
for human readers (e.g. non-native speakers), and those that enable
reliable automatic semantic analysis of the language.

The first type of languages (often called "simplified" or "technical"
languages), for example ASD Simplified Technical English, Caterpillar
Technical English, IBM's Easy English, are used in the industry to
increase the quality of technical documentation, and possibly simplify
the (semi-)automatic translation of the documentation. These languages
restrict the writer by general rules such as "write short and
grammatically simple sentences", "use nouns instead of pronouns", "use
determiners", and "use active instead of passive".[1]

Reply | Threaded
Open this post in threaded view
|

Re: [Documentation] Spell checking within Squeak - suggestion of a method

Ian Trudel-2
Hannes,

I am not particularly fond of this solution. There are two major
problems: 1) we will have to build our own dictionary and risk to
insert errors, misspells, etc. and 2) existing spell checkers have
more features (e.g. suggesting close matches when a word is
misspelled).

Hunspell is a good choice because it has been peer reviewed over and
over. It is used in projects with a large audience, such as OpenOffice
and Firefox. We may have to define a terminology dictionary but we
won't have to deal with plain English as it is already available.

Ian.

2010/4/30 Hannes Hirzel <[hidden email]>:

> On 4/30/10, Ian Trudel <[hidden email]> wrote:
>> 2010/4/30 Casey Ransberger <[hidden email]>:
>>> Ian, thanks. Comments inline.
>>
>>> I don't know of a spell checker implementation for Squeak. Is there one
>>> out
>>> there? If not, can you implement one and then get back to me right away?
>>> :P
>>
>> Neither do I know.
>>
>
> Contrariwise to general word-processing our spell checking needs are
> simpler. We are writing technical documentation and the number of word
> forms is more limited (my estimate - something between 3000...5000).
> This means a simple dictionary of allowed word forms could do the job?
>
> How do we get that dictionary?
>
> Method 1)
> We create a Bag of all word forms in the existing comments. The words
> which are infrequent are candidates for being misspelt. They can be
> flagged and put into a another collection.
>
> Method 2)
> We paste the list of words obtained from the comments into a regular
> word processor and run a  spell check there.
>
> A diff to the original word list gives then the misspelt words.
>
> We paste the result back into the Squeak image,
>
> Within the Squeak image resides a collection of acceptable word forms
> (as part of the HelpSystem-Tools).
>
> Before accepting a comment this list is consulted.
>
> This process has to be repeated for a few versions maybe until we have
> a comprehensive wordlist.
>
> Additional benefit. We limit the number of words used thus making the
> texts easier to understand and more consistent.
>
> The idea here is 'controlled language'  - i.e. "Squeak Technical English"  :-)
>
> --Hannes
>
> ------------------------------------------------------------------------------------
> http://en.wikipedia.org/wiki/Controlled_language
>
> Controlled natural languages (CNLs) are subsets of natural languages,
> obtained by restricting the grammar and vocabulary in order to reduce
> or eliminate ambiguity and complexity. Traditionally, controlled
> languages fall into two major types: those that improve readability
> for human readers (e.g. non-native speakers), and those that enable
> reliable automatic semantic analysis of the language.
>
> The first type of languages (often called "simplified" or "technical"
> languages), for example ASD Simplified Technical English, Caterpillar
> Technical English, IBM's Easy English, are used in the industry to
> increase the quality of technical documentation, and possibly simplify
> the (semi-)automatic translation of the documentation. These languages
> restrict the writer by general rules such as "write short and
> grammatically simple sentences", "use nouns instead of pronouns", "use
> determiners", and "use active instead of passive".[1]
>
>



--
http://mecenia.blogspot.com/

Reply | Threaded
Open this post in threaded view
|

Re: [Documentation] Spell checking within Squeak - suggestion of a method

Ian Trudel-2
By the way, copy-and-paste from a word processor is time consuming and
error prone. Many of us will lazy this one out (i.e. will NOT do it).
OpenOffice writer already use hunspell anyway. We might as well
integrate it and enforce it. ;)

Ian.
--
http://mecenia.blogspot.com/

Reply | Threaded
Open this post in threaded view
|

Re: [Documentation] Spell checking within Squeak - suggestion of a method

Hannes Hirzel
The write-up for the solution I outlined unfortunately is sketchy and
so from your reaction I realise that you did not get the point.

Currently I do not have the time to elaborate more on this but you can
be assured that the concept behind is a workable solution.

--HJH

On 5/1/10, Ian Trudel <[hidden email]> wrote:

> By the way, copy-and-paste from a word processor is time consuming and
> error prone. Many of us will lazy this one out (i.e. will NOT do it).
> OpenOffice writer already use hunspell anyway. We might as well
> integrate it and enforce it. ;)
>
> Ian.
> --
> http://mecenia.blogspot.com/
>
>