On 4/30/10, Ian Trudel <[hidden email]> wrote:
> 2010/4/30 Casey Ransberger <[hidden email]>: >> Ian, thanks. Comments inline. > >> I don't know of a spell checker implementation for Squeak. Is there one >> out >> there? If not, can you implement one and then get back to me right away? >> :P > > Neither do I know. > Contrariwise to general word-processing our spell checking needs are simpler. We are writing technical documentation and the number of word forms is more limited (my estimate - something between 3000...5000). This means a simple dictionary of allowed word forms could do the job? How do we get that dictionary? Method 1) We create a Bag of all word forms in the existing comments. The words which are infrequent are candidates for being misspelt. They can be flagged and put into a another collection. Method 2) We paste the list of words obtained from the comments into a regular word processor and run a spell check there. A diff to the original word list gives then the misspelt words. We paste the result back into the Squeak image, Within the Squeak image resides a collection of acceptable word forms (as part of the HelpSystem-Tools). Before accepting a comment this list is consulted. This process has to be repeated for a few versions maybe until we have a comprehensive wordlist. Additional benefit. We limit the number of words used thus making the texts easier to understand and more consistent. The idea here is 'controlled language' - i.e. "Squeak Technical English" :-) --Hannes ------------------------------------------------------------------------------------ http://en.wikipedia.org/wiki/Controlled_language Controlled natural languages (CNLs) are subsets of natural languages, obtained by restricting the grammar and vocabulary in order to reduce or eliminate ambiguity and complexity. Traditionally, controlled languages fall into two major types: those that improve readability for human readers (e.g. non-native speakers), and those that enable reliable automatic semantic analysis of the language. The first type of languages (often called "simplified" or "technical" languages), for example ASD Simplified Technical English, Caterpillar Technical English, IBM's Easy English, are used in the industry to increase the quality of technical documentation, and possibly simplify the (semi-)automatic translation of the documentation. These languages restrict the writer by general rules such as "write short and grammatically simple sentences", "use nouns instead of pronouns", "use determiners", and "use active instead of passive".[1] |
Hannes,
I am not particularly fond of this solution. There are two major problems: 1) we will have to build our own dictionary and risk to insert errors, misspells, etc. and 2) existing spell checkers have more features (e.g. suggesting close matches when a word is misspelled). Hunspell is a good choice because it has been peer reviewed over and over. It is used in projects with a large audience, such as OpenOffice and Firefox. We may have to define a terminology dictionary but we won't have to deal with plain English as it is already available. Ian. 2010/4/30 Hannes Hirzel <[hidden email]>: > On 4/30/10, Ian Trudel <[hidden email]> wrote: >> 2010/4/30 Casey Ransberger <[hidden email]>: >>> Ian, thanks. Comments inline. >> >>> I don't know of a spell checker implementation for Squeak. Is there one >>> out >>> there? If not, can you implement one and then get back to me right away? >>> :P >> >> Neither do I know. >> > > Contrariwise to general word-processing our spell checking needs are > simpler. We are writing technical documentation and the number of word > forms is more limited (my estimate - something between 3000...5000). > This means a simple dictionary of allowed word forms could do the job? > > How do we get that dictionary? > > Method 1) > We create a Bag of all word forms in the existing comments. The words > which are infrequent are candidates for being misspelt. They can be > flagged and put into a another collection. > > Method 2) > We paste the list of words obtained from the comments into a regular > word processor and run a spell check there. > > A diff to the original word list gives then the misspelt words. > > We paste the result back into the Squeak image, > > Within the Squeak image resides a collection of acceptable word forms > (as part of the HelpSystem-Tools). > > Before accepting a comment this list is consulted. > > This process has to be repeated for a few versions maybe until we have > a comprehensive wordlist. > > Additional benefit. We limit the number of words used thus making the > texts easier to understand and more consistent. > > The idea here is 'controlled language' - i.e. "Squeak Technical English" :-) > > --Hannes > > ------------------------------------------------------------------------------------ > http://en.wikipedia.org/wiki/Controlled_language > > Controlled natural languages (CNLs) are subsets of natural languages, > obtained by restricting the grammar and vocabulary in order to reduce > or eliminate ambiguity and complexity. Traditionally, controlled > languages fall into two major types: those that improve readability > for human readers (e.g. non-native speakers), and those that enable > reliable automatic semantic analysis of the language. > > The first type of languages (often called "simplified" or "technical" > languages), for example ASD Simplified Technical English, Caterpillar > Technical English, IBM's Easy English, are used in the industry to > increase the quality of technical documentation, and possibly simplify > the (semi-)automatic translation of the documentation. These languages > restrict the writer by general rules such as "write short and > grammatically simple sentences", "use nouns instead of pronouns", "use > determiners", and "use active instead of passive".[1] > > -- http://mecenia.blogspot.com/ |
By the way, copy-and-paste from a word processor is time consuming and
error prone. Many of us will lazy this one out (i.e. will NOT do it). OpenOffice writer already use hunspell anyway. We might as well integrate it and enforce it. ;) Ian. -- http://mecenia.blogspot.com/ |
The write-up for the solution I outlined unfortunately is sketchy and
so from your reaction I realise that you did not get the point. Currently I do not have the time to elaborate more on this but you can be assured that the concept behind is a workable solution. --HJH On 5/1/10, Ian Trudel <[hidden email]> wrote: > By the way, copy-and-paste from a word processor is time consuming and > error prone. Many of us will lazy this one out (i.e. will NOT do it). > OpenOffice writer already use hunspell anyway. We might as well > integrate it and enforce it. ;) > > Ian. > -- > http://mecenia.blogspot.com/ > > |
Free forum by Nabble | Edit this page |