Hello all and especially Michael Lucas-Smith, the original author of
the Spellchecker2 package :-) A week ago I posted a question about a problem with #suggestWords: in Spellchecker2 but so far got no response whatsoever. Hence I would like to know answers to the following questions: * Is there anyone using the package? * If so, do you use #suggestWords: API? Does it work as expected for you? * If not, what do you use to add spell checking support to your apps? * Is there someone actively participating in development or maintenance of the package? Thank you, Ladislav Lenart -------- Original Message -------- Subject: [Q] Spellchecker2 Date: Fri, 29 Apr 2011 15:15:18 +0200 From: Ladislav Lenart <[hidden email]> To: [hidden email] Hello. We use Spellchecker2 as a means to spell check text-like documents in our application (#hasWord:) and now we are in the process of adding auto-correct suggestions (#suggestWords:). The first part works like a charm, but I have trouble with the second. The LevenshteinAutomata's class comment says it implements the following edit operations for distance of one: Add, Remove, Replace, Transpose, Merge, Split. However it seems to me that it does not return all valid candidates. I demonstrate the problems on the following vocabulary: v := Spellchecker2.Vocabulary new. v mutableWhile: [:n | #('a' 'b' 'aa' 'ab' 'ba' 'bb''ac') do: [:e | n addWord: e]. ]. [1] v suggestWords: 'aa'. Returns: #('a' 'aa'). I would expect the following: #( "deletion" 'a' "identity" 'aa' "replace" 'ba' "replace" 'ab' "replace" 'ac' ). [2] v suggestWords: 'ca'. Returns an empty collection. I would expect the following: #( "deletion" 'a' "replace" 'aa' "replace" 'ba' "transpose" 'ac' ). Are these bugs or am I missing something? Thx, Ladislav Lenart _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
On May 5, 2011, at 2:29 AM, Ladislav Lenart wrote: > Hello all and especially Michael Lucas-Smith, the original author of > the Spellchecker2 package :-) > > A week ago I posted a question about a problem with #suggestWords: in > Spellchecker2 but so far got no response whatsoever. Hence I would like > to know answers to the following questions: > * Is there anyone using the package? > * If so, do you use #suggestWords: API? Does it work as expected for you? > * If not, what do you use to add spell checking support to your apps? > * Is there someone actively participating in development or maintenance > of the package? > Yep. It's on my todo list to look at your bug report. Do you have the same problem with longer words, or just the short words? Michael _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Hello Michael!
First of all, thank you very much for jumping in, much appreciated :-) Secondly, sorry for a longish post. Now to your question... I noticed this at first while testing the almost implemented feature in our application. We created our own czech vocabulary. The problem is with longer words too, see for yourself: wlist := #( 'abeceda' "a czech for the word alphabet" 'abecedy' "alphabets" 'abeced' "one of grammatical cases of the word alphabet" 'abecední' "alphabetic" ). voc := Spellchecker2.Vocabulary new mutableWhile: [:node | wlist do: [:each | node addWord: each]]. voc suggestWords: 'abecedn'. "result: #('abeced') expected: #( 'abeced' ""remove"" 'abeceda' ""replace"" 'abecední' ""insertion"" 'abecedy' ""replace"" )." voc suggestWords: 'baeced'. "result: #() expected: #( 'abeced' ""transpose"" )." I am also puzzled about the following: wlist := #( 'collect' 'connect' 'correct' 'erect' 'forect' "The word does not exist, I just wanted to try something like this." ). voc := Spellchecker2.Vocabulary new mutableWhile: [:node | wlist do: [:each | node addWord: each]]. voc suggestWords: 'corect'. "result: #( 'collect' ""why?"" 'connect' ""why?"" 'correct' ""insertion"" 'erect' ""why?"" 'forect' ""replace"" ). expected: #( 'correct' ""insertion"" 'forect' ""replace"" )." voc suggestWords: 'xorect'. "result: #( 'erect' ""why?"" 'forect' ""replace"" ). expected: #( 'forect' ""replace"" )." voc suggestWords: 'ocrrect'. "result: #() expected: #( 'correct' ""transpose"" )." But this might stem partially from the fact that I don't know what merge and split operations do. Can you elaborate a little? During my course of development with Spellchecker2 package I also found few minor glitches: * #asMutableVocabulary returns aNode whereas #asImmutableVocabulary returns aByteArray (vocabulary's data). I suggest to add something like: Node>>immutableVocabularyDataCopy ^self asImmutableVocabulary immutableVocabularyDataCopy. "Or introduce #data as a private accessor." Vocabulary>>immutableVocabularyDataCopy ^data copy. and implement #asMutableVocabulary and #asImmutableVocabulary as normal Node <-> Vocabulary conversion methods: Node>>asMutableVocabulary ^self. Node>>asImmutableVocabulary ^Vocabulary data: self immutableVocabularyDataCopy. Vocabulary>>asMutableVocabulary "The current implementation." Vocabulary>>asImmutableVocabulary ^self. * I think #allWords is broken because it tries to convert each byte to a character which does not hold for UTF-8 encoded strings. The similar #asOrderedCollection works fine, though I dislike the name (but that's just my subjective opinion really). I suggest to rename #asOrderedCollection to #allWords. I would also like to have a method like: Node>>allWordsXXX ^self allWords collect: [:each | each asStringEncoding: #utf8]. but I don't know how to name the two properly. * The method #loadWordListFilename: cannot be used on a file with a different encoding, because aFilename is being sent #asFilename. I suggest one of (in no particular order of preference): * Remove #asFilename message send from the method. * Completely remove #loadWordListFilename: and keep only #loadWordList:. * Add #loadWordListFilename:encoding:. * Call #asFilename on the argument only if it is a string (I don't like this one, but it is an option nonetheless). * I would also recommend to add SUnitToo test cases because this functionality is really easy to test. (Or is there a standalone Spellchecker2 test package somewhere in PR?) Thank you again for your time and this all-in-all very useful package. Looking forward to hearinng from you and please keep me informed, Ladislav Lenart On 5.5.2011 16:18, Michael Lucas-Smith wrote: > > On May 5, 2011, at 2:29 AM, Ladislav Lenart wrote: > >> Hello all and especially Michael Lucas-Smith, the original author of >> the Spellchecker2 package :-) >> >> A week ago I posted a question about a problem with #suggestWords: in >> Spellchecker2 but so far got no response whatsoever. Hence I would like >> to know answers to the following questions: >> * Is there anyone using the package? >> * If so, do you use #suggestWords: API? Does it work as expected for you? >> * If not, what do you use to add spell checking support to your apps? >> * Is there someone actively participating in development or maintenance >> of the package? >> > > Yep. It's on my todo list to look at your bug report. Do you have the same problem with longer words, or just the short words? > > Michael _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Michael Lucas-Smith-2
Hello again.
One last thing I forgot to mention in my previous post... The following (empty and one-letter strings): voc suggestWords: '' voc suggestWords: 'x' causes SubscriptOutOfBoundsError in LevenshteinAutomata>>accept:, because depth is still zero. The following fixes it, though I am not 100% sure that it does not break anything else: LevenshteinAutomata>>accept: depth | result | depth isZero ifTrue: [^self]. "FIX (added the line)" (vocabulary hasTerminal: (history at: depth)) ifFalse: [^self]. result := (ByteArray new: depth) writeStream. 1 to: depth do: [:index | result nextPut: (vocabulary valueAt: (history at: index))]. results add: result contents On the other hand, none of the error cases is a valid real-life example. Ladislav Lenart On 5.5.2011 16:18, Michael Lucas-Smith wrote: > > On May 5, 2011, at 2:29 AM, Ladislav Lenart wrote: > >> Hello all and especially Michael Lucas-Smith, the original author of >> the Spellchecker2 package :-) >> >> A week ago I posted a question about a problem with #suggestWords: in >> Spellchecker2 but so far got no response whatsoever. Hence I would like >> to know answers to the following questions: >> * Is there anyone using the package? >> * If so, do you use #suggestWords: API? Does it work as expected for you? >> * If not, what do you use to add spell checking support to your apps? >> * Is there someone actively participating in development or maintenance >> of the package? >> > > Yep. It's on my todo list to look at your bug report. Do you have the same problem with longer words, or just the short words? > > Michael _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Ladislav Lenart
Hi Ladislav,
Thank you for hammering that part of the package so well. It's been a long time since I've used that behavior. I've rewritten it a half dozen times and I guess the final version just wasn't up to scratch. So -- the RB only uses the "Is this word in the dictionary" behavior, not the "suggest me a correction" behavior. I found a lot lacking in my old implementation of the Automata. So, see the latest publish for a completely rewritten version. The code, I believe is a lot more straight forward while still preserving the same execution time features. I swapped from the Merge/Split automata to the Transpose automata. A regular automata would miss out on the transposes without making n=1, which bloats up the code and reduces the execution time and introduces lots of extra hits you'd never expect. So, transpose seemed like the way to go. I've been seeing some truly bizarre matches though, such as 'tset' matching to 'teeth'.. why? not a clue! But if you enable the debug information, it looks as though I'm doing everything correctly.. the transition table for the transpose automata apparently thinks this is an a-okay answer. May be it is. I have not gone through and implemented your improved messages and behavior, but I did fix matching internationalized characters in suggestions (though not allWords.. wordsDo: is probably much more useful). The czech example has a strange answer with transpose too: baeced brings back abeced, abeceda and abecedy and again it looks like the transpose automata is saying that's what it wants. The various English examples appear to be working as you expect now. Let me know how you fair with it and where possible, enable the debug information in #word:vocabulary: as well as #compute: when discussing mismatches and your theories as to why they are matching unexpectedly. Cheers, Michael On May 6, 2011, at 2:31 AM, Ladislav Lenart wrote: > Hello Michael! > > First of all, thank you very much for jumping in, much appreciated :-) > Secondly, sorry for a longish post. > > Now to your question... > I noticed this at first while testing the almost implemented feature > in our application. We created our own czech vocabulary. The problem > is with longer words too, see for yourself: > > wlist := #( > 'abeceda' "a czech for the word alphabet" > 'abecedy' "alphabets" > 'abeced' "one of grammatical cases of the word alphabet" > 'abecední' "alphabetic" > ). > voc := Spellchecker2.Vocabulary new mutableWhile: [:node | wlist do: [:each | node addWord: each]]. > voc suggestWords: 'abecedn'. > "result: #('abeced') > expected: #( > 'abeced' ""remove"" > 'abeceda' ""replace"" > 'abecední' ""insertion"" > 'abecedy' ""replace"" > )." > voc suggestWords: 'baeced'. > "result: #() > expected: #( > 'abeced' ""transpose"" > )." > > > I am also puzzled about the following: > > wlist := #( > 'collect' > 'connect' > 'correct' > 'erect' > 'forect' "The word does not exist, I just wanted to try something like this." > ). > voc := Spellchecker2.Vocabulary new mutableWhile: [:node | wlist do: [:each | node addWord: each]]. > voc suggestWords: 'corect'. > "result: #( > 'collect' ""why?"" > 'connect' ""why?"" > 'correct' ""insertion"" > 'erect' ""why?"" > 'forect' ""replace"" > ). > expected: #( > 'correct' ""insertion"" > 'forect' ""replace"" > )." > voc suggestWords: 'xorect'. > "result: #( > 'erect' ""why?"" > 'forect' ""replace"" > ). > expected: #( > 'forect' ""replace"" > )." > voc suggestWords: 'ocrrect'. > "result: #() > expected: #( > 'correct' ""transpose"" > )." > > But this might stem partially from the fact that I don't know > what merge and split operations do. Can you elaborate a little? > > > During my course of development with Spellchecker2 package I > also found few minor glitches: > * #asMutableVocabulary returns aNode whereas #asImmutableVocabulary > returns aByteArray (vocabulary's data). I suggest to add something > like: > > Node>>immutableVocabularyDataCopy > ^self asImmutableVocabulary immutableVocabularyDataCopy. "Or introduce #data as a private accessor." > > Vocabulary>>immutableVocabularyDataCopy > ^data copy. > > and implement #asMutableVocabulary and #asImmutableVocabulary as normal > Node <-> Vocabulary conversion methods: > > Node>>asMutableVocabulary > ^self. > > Node>>asImmutableVocabulary > ^Vocabulary data: self immutableVocabularyDataCopy. > > Vocabulary>>asMutableVocabulary > "The current implementation." > > Vocabulary>>asImmutableVocabulary > ^self. > > * I think #allWords is broken because it tries to convert each byte > to a character which does not hold for UTF-8 encoded strings. The > similar #asOrderedCollection works fine, though I dislike the name > (but that's just my subjective opinion really). I suggest to rename > #asOrderedCollection to #allWords. I would also like to have a method > like: > > Node>>allWordsXXX > ^self allWords collect: [:each | each asStringEncoding: #utf8]. > > but I don't know how to name the two properly. > > * The method #loadWordListFilename: cannot be used on a file with a > different encoding, because aFilename is being sent #asFilename. > I suggest one of (in no particular order of preference): > * Remove #asFilename message send from the method. > * Completely remove #loadWordListFilename: and keep only #loadWordList:. > * Add #loadWordListFilename:encoding:. > * Call #asFilename on the argument only if it is a string (I don't > like this one, but it is an option nonetheless). > * I would also recommend to add SUnitToo test cases because this > functionality is really easy to test. (Or is there a standalone > Spellchecker2 test package somewhere in PR?) > > > Thank you again for your time and this all-in-all very useful package. > > > Looking forward to hearinng from you and please keep me informed, > > Ladislav Lenart > > > On 5.5.2011 16:18, Michael Lucas-Smith wrote: >> >> On May 5, 2011, at 2:29 AM, Ladislav Lenart wrote: >> >>> Hello all and especially Michael Lucas-Smith, the original author of >>> the Spellchecker2 package :-) >>> >>> A week ago I posted a question about a problem with #suggestWords: in >>> Spellchecker2 but so far got no response whatsoever. Hence I would like >>> to know answers to the following questions: >>> * Is there anyone using the package? >>> * If so, do you use #suggestWords: API? Does it work as expected for you? >>> * If not, what do you use to add spell checking support to your apps? >>> * Is there someone actively participating in development or maintenance >>> of the package? >>> >> >> Yep. It's on my todo list to look at your bug report. Do you have the same problem with longer words, or just the short words? >> >> Michael > > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Hello.
First of all thank you for such a quick reaction! I decided to implement a 'reference' implementation of the Levenshtein algorithm with n=1. I took it from http://norvig.com/spell-correct.html and rewrote it in Smalltalk (from Python original). See the attached file-out of the package. You might find it useful to prove that your fast implementation is definitely correct. With this in place, I can confirm that the following is indeed a bug in Spellchecker2 implementation: wlist := #('teeth'). voc := Spellchecker2.Vocabulary new mutableWhile: [:node | wlist do: [:each | node addWord: each]]. voc suggestWords: 'tset'. "result: #('teeth') expected: #()." Using our czech dictionary, I found a more complex example of the same bug: wlist := #( 'kobalt' "czech for cobalt" 'kobylí' "horsy" 'kobyly' "horses" 'konal' "he performed" 'konala' "she performed" 'konali' "men performed" 'konalo' "it performed" 'konaly' "women performed" 'obal' "wrapping; the rest are its various single/plural cases" 'obale' 'obalem' 'obalu' 'obalů' 'obalům' 'obaly' ). voc := Spellchecker2.Vocabulary new mutableWhile: [:node | wlist do: [:each | node addWord: each]]. voc suggestWords: 'kobal'. "result: #('kobalt' 'kobylí' 'kobyly' 'konal' 'konala' 'konali' 'konalo' 'konaly' 'obal' 'obale' 'obalem' 'obalu' 'obalù' 'obalùm' 'obaly'). expected: #('kobalt' 'konal' 'obal')." If you look at all those unexpected suggestions, you can see they *all* have one thing in common: If you ignore what's left after the misspelt word was processed, they all have n=1. Observe for 'kobal' from the example above: * kobyl|í (one replace if you ignore the rest) * kobyl|y * konal|a * konal|i * konal|o * konal|y * obal|e (k was processed as deletion, hence the input ends here) * obal|em * obal|ů * obal|ům * obal|y The example with 'tset' works exactly the same way: * teet|h (one replace if you ignore the rest) It seems to me that the LA is missing a terminal condition, but unfortunately I don't understand its internals (transition specs and all that stuff). Nevertheless I hope this little observation will help you fix it :-) I also took a look at the new implementation and have the following questions: * In LA's method #compute, #wordsDo:maxDepth: is called with maxDepth being word size + 2 where word is aString (of CHARS) whereas the depth is a length of a BYTE sequence. Is this really correct? I would expect the following maxDepth limit expression: (word size + 2) * 4 "the worst case being aFourByteString". * Vocabulary class comment states that a node can have a total of 256 children and that a byte is used to store this number. But a node can have at most 255 children when a byte range is used. Hence it seems that Vocabulary is unable to represent all possible words (byte sequences). What am I missing? Thank you and let me know how can I assist you any further, Ladislav Lenart On 7.5.2011 02:40, Michael Lucas-Smith wrote: > Hi Ladislav, > > Thank you for hammering that part of the package so well. It's been a long time since I've used that behavior. I've rewritten it a half dozen times and I guess the final version just wasn't up to scratch. > So -- the RB only uses the "Is this word in the dictionary" behavior, not the "suggest me a correction" behavior. > > I found a lot lacking in my old implementation of the Automata. So, see the latest publish for a completely rewritten version. The code, I believe is a lot more straight forward while still preserving the same execution time features. > > I swapped from the Merge/Split automata to the Transpose automata. A regular automata would miss out on the transposes without making n=1, which bloats up the code and reduces the execution time and introduces lots of extra hits you'd never expect. So, transpose seemed like the way to go. I've been seeing some truly bizarre matches though, such as 'tset' matching to 'teeth'.. why? not a clue! But if you enable the debug information, it looks as though I'm doing everything correctly.. the transition table for the transpose automata apparently thinks this is an a-okay answer. May be it is. > > I have not gone through and implemented your improved messages and behavior, but I did fix matching internationalized characters in suggestions (though not allWords.. wordsDo: is probably much more useful). > > The czech example has a strange answer with transpose too: baeced brings back abeced, abeceda and abecedy and again it looks like the transpose automata is saying that's what it wants. The various English examples appear to be working as you expect now. > > Let me know how you fair with it and where possible, enable the debug information in #word:vocabulary: as well as #compute: when discussing mismatches and your theories as to why they are matching unexpectedly. > > Cheers, > Michael > > On May 6, 2011, at 2:31 AM, Ladislav Lenart wrote: > >> Hello Michael! >> >> First of all, thank you very much for jumping in, much appreciated :-) >> Secondly, sorry for a longish post. >> >> Now to your question... >> I noticed this at first while testing the almost implemented feature >> in our application. We created our own czech vocabulary. The problem >> is with longer words too, see for yourself: >> >> wlist := #( >> 'abeceda' "a czech for the word alphabet" >> 'abecedy' "alphabets" >> 'abeced' "one of grammatical cases of the word alphabet" >> 'abecední' "alphabetic" >> ). >> voc := Spellchecker2.Vocabulary new mutableWhile: [:node | wlist do: [:each | node addWord: each]]. >> voc suggestWords: 'abecedn'. >> "result: #('abeced') >> expected: #( >> 'abeced' ""remove"" >> 'abeceda' ""replace"" >> 'abecední' ""insertion"" >> 'abecedy' ""replace"" >> )." >> voc suggestWords: 'baeced'. >> "result: #() >> expected: #( >> 'abeced' ""transpose"" >> )." >> >> >> I am also puzzled about the following: >> >> wlist := #( >> 'collect' >> 'connect' >> 'correct' >> 'erect' >> 'forect' "The word does not exist, I just wanted to try something like this." >> ). >> voc := Spellchecker2.Vocabulary new mutableWhile: [:node | wlist do: [:each | node addWord: each]]. >> voc suggestWords: 'corect'. >> "result: #( >> 'collect' ""why?"" >> 'connect' ""why?"" >> 'correct' ""insertion"" >> 'erect' ""why?"" >> 'forect' ""replace"" >> ). >> expected: #( >> 'correct' ""insertion"" >> 'forect' ""replace"" >> )." >> voc suggestWords: 'xorect'. >> "result: #( >> 'erect' ""why?"" >> 'forect' ""replace"" >> ). >> expected: #( >> 'forect' ""replace"" >> )." >> voc suggestWords: 'ocrrect'. >> "result: #() >> expected: #( >> 'correct' ""transpose"" >> )." >> >> But this might stem partially from the fact that I don't know >> what merge and split operations do. Can you elaborate a little? >> >> >> During my course of development with Spellchecker2 package I >> also found few minor glitches: >> * #asMutableVocabulary returns aNode whereas #asImmutableVocabulary >> returns aByteArray (vocabulary's data). I suggest to add something >> like: >> >> Node>>immutableVocabularyDataCopy >> ^self asImmutableVocabulary immutableVocabularyDataCopy. "Or introduce #data as a private accessor." >> >> Vocabulary>>immutableVocabularyDataCopy >> ^data copy. >> >> and implement #asMutableVocabulary and #asImmutableVocabulary as normal >> Node<-> Vocabulary conversion methods: >> >> Node>>asMutableVocabulary >> ^self. >> >> Node>>asImmutableVocabulary >> ^Vocabulary data: self immutableVocabularyDataCopy. >> >> Vocabulary>>asMutableVocabulary >> "The current implementation." >> >> Vocabulary>>asImmutableVocabulary >> ^self. >> >> * I think #allWords is broken because it tries to convert each byte >> to a character which does not hold for UTF-8 encoded strings. The >> similar #asOrderedCollection works fine, though I dislike the name >> (but that's just my subjective opinion really). I suggest to rename >> #asOrderedCollection to #allWords. I would also like to have a method >> like: >> >> Node>>allWordsXXX >> ^self allWords collect: [:each | each asStringEncoding: #utf8]. >> >> but I don't know how to name the two properly. >> >> * The method #loadWordListFilename: cannot be used on a file with a >> different encoding, because aFilename is being sent #asFilename. >> I suggest one of (in no particular order of preference): >> * Remove #asFilename message send from the method. >> * Completely remove #loadWordListFilename: and keep only #loadWordList:. >> * Add #loadWordListFilename:encoding:. >> * Call #asFilename on the argument only if it is a string (I don't >> like this one, but it is an option nonetheless). >> * I would also recommend to add SUnitToo test cases because this >> functionality is really easy to test. (Or is there a standalone >> Spellchecker2 test package somewhere in PR?) >> >> >> Thank you again for your time and this all-in-all very useful package. >> >> >> Looking forward to hearinng from you and please keep me informed, >> >> Ladislav Lenart >> >> >> On 5.5.2011 16:18, Michael Lucas-Smith wrote: >>> >>> On May 5, 2011, at 2:29 AM, Ladislav Lenart wrote: >>> >>>> Hello all and especially Michael Lucas-Smith, the original author of >>>> the Spellchecker2 package :-) >>>> >>>> A week ago I posted a question about a problem with #suggestWords: in >>>> Spellchecker2 but so far got no response whatsoever. Hence I would like >>>> to know answers to the following questions: >>>> * Is there anyone using the package? >>>> * If so, do you use #suggestWords: API? Does it work as expected for you? >>>> * If not, what do you use to add spell checking support to your apps? >>>> * Is there someone actively participating in development or maintenance >>>> of the package? >>>> >>> >>> Yep. It's on my todo list to look at your bug report. Do you have the same problem with longer words, or just the short words? >>> >>> Michael >> >> _______________________________________________ >> vwnc mailing list >> [hidden email] >> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > > > _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc SuggestWordsRefImpl.st (21K) Download Attachment |
On May 9, 2011, at 8:37 AM, Ladislav Lenart wrote: > Hello. > > First of all thank you for such a quick reaction! > > I decided to implement a 'reference' implementation of > the Levenshtein algorithm with n=1. I took it from > > http://norvig.com/spell-correct.html > > and rewrote it in Smalltalk (from Python original). See the > attached file-out of the package. You might find it useful > to prove that your fast implementation is definitely correct. You can take the transition table and a word you want to create a levenshtein automata for and create it backwards from the table. I haven't yet done this, but I do believe your observation that it's "terminal condition" is missing is correct. I've not yet figured out how or why though. Have you (attempted) to read the paper referenced in the class description of LevenshteinAutomata? It's a heavy and often confusing read, but may be you'll notice some detail that I've missed. > > It seems to me that the LA is missing a terminal condition, but > unfortunately I don't understand its internals (transition specs > and all that stuff). Nevertheless I hope this little observation > will help you fix it :-) > > I also took a look at the new implementation and have the > following questions: > * In LA's method #compute, #wordsDo:maxDepth: is called with > maxDepth being word size + 2 where word is aString (of CHARS) > whereas the depth is a length of a BYTE sequence. Is this really > correct? I would expect the following maxDepth limit expression: > (word size + 2) * 4 "the worst case being aFourByteString". This is a bug, I need to get rid of the maxDepth: test from the #compute method. Traversing the whole dictionary doesn't really add that much time to the over all computation (longer words are rarer than shorter words so the time saved is marginal). > * Vocabulary class comment states that a node can have a total > of 256 children and that a byte is used to store this number. > But a node can have at most 255 children when a byte range is > used. Hence it seems that Vocabulary is unable to represent > all possible words (byte sequences). What am I missing? I'm not sure I understand your question here. 0..255 is 8 bits which is 256 combinations. Cheers, Michael _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
On May 9, 2011, at 9:13 AM, Michael Lucas-Smith wrote: > > On May 9, 2011, at 8:37 AM, Ladislav Lenart wrote: > >> Hello. >> >> First of all thank you for such a quick reaction! >> >> I decided to implement a 'reference' implementation of >> the Levenshtein algorithm with n=1. I took it from >> >> http://norvig.com/spell-correct.html >> >> and rewrote it in Smalltalk (from Python original). See the >> attached file-out of the package. You might find it useful >> to prove that your fast implementation is definitely correct. > > You can take the transition table and a word you want to create a levenshtein automata for and create it backwards from the table. I haven't yet done this, but I do believe your observation that it's "terminal condition" is missing is correct. I've not yet figured out how or why though. Have you (attempted) to read the paper referenced in the class description of LevenshteinAutomata? It's a heavy and often confusing read, but may be you'll notice some detail that I've missed. Having just said that, I believe I've just fixed it in version 24. Cheers, Michael _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
On May 9, 2011, at 9:34 AM, Michael Lucas-Smith wrote: > > On May 9, 2011, at 9:13 AM, Michael Lucas-Smith wrote: > >> >> On May 9, 2011, at 8:37 AM, Ladislav Lenart wrote: >> >>> Hello. >>> >>> First of all thank you for such a quick reaction! >>> >>> I decided to implement a 'reference' implementation of >>> the Levenshtein algorithm with n=1. I took it from >>> >>> http://norvig.com/spell-correct.html >>> >>> and rewrote it in Smalltalk (from Python original). See the >>> attached file-out of the package. You might find it useful >>> to prove that your fast implementation is definitely correct. >> >> You can take the transition table and a word you want to create a levenshtein automata for and create it backwards from the table. I haven't yet done this, but I do believe your observation that it's "terminal condition" is missing is correct. I've not yet figured out how or why though. Have you (attempted) to read the paper referenced in the class description of LevenshteinAutomata? It's a heavy and often confusing read, but may be you'll notice some detail that I've missed. > > Having just said that, I believe I've just fixed it in version 24. > Now that I had the algorithm working, I went back to benchmark it and discovered that it was running very slowly.. instead of 10ms it was running in, say, 1.7 seconds. That's pretty darned awful. So I dug in to it and a lot of the time was wasted in decoding bytes in to strings. The only way to solve that -and- add in the maxDepth test was to change the binary format. The binary format no longer stores bytes, but UTF8 encoded bytes. This pushed the value of each Trie node to the end of the data structure, to avoid having to interpret the UTF8 bytes just to access the children structure. As such, a lot of code changed. I took the liberty of renaming Node to Trie while I was at it. Vocabulary and Trie are now subclasses of Collection too, so #asOrderedCollection, #includes:, #add: #isEmpty all work in a consistent manner. While I was working in this area, Niall and I chatted and we have updated TextHighlighting to include the ability for a Highlighter to mess with a text editors menu. SpellcheckHighlighting now does exactly that. If your input cursor is in a word that is misspelt, a right click will now have a new menu option 'Suggestions...' which will offer you corrects from English and the SymbolTable. Yes, the language should be pluggable, but until we have a Czech dictionary working from Ladislav we basically only have English :) ....so, when we have more dictionaries, we can expand this out to a full setting. Spellchecker2 32 Cheers, Michael > Cheers, > Michael _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Hello Michael!
I wanted to reply you yesterday, immediately after your (first) post but did not have enough time to finish it. Sorry for that. [ByteArray->String conversion bottleneck] Good catch. That was going to be one of my points in my unfinished e-mail :-) [Node with 256 children - 256 does not fit into 1 byte] The new implementation of Vocabulary has rendered my remark completely mute. Excellent idea by the way. Also thank you for incorporating my suggestions. Much appreciated :-) [Regarding Czech dictionary] What we currently have is only a first and incomplete version of a czech dictionary for one specific profession, government legislation. I am building a generic czech dictionary from the aspell source as I write this e-mail, but it will contain plenty of suspicious words (e.g. many of them appear to be the first names of foreigners or other specific entities). However the following is just a quick comparison of their sizes: * English dictionary has about ~70000 words. * Our specific czech dictionary currently has ~81000 words (but nearly all miss at least some grammatical cases). * The aspell dictionary has ~4700000 (yes, 4.7 million) words (including all grammatical cases of all of them). I will post the results here when it's done. [Spellchecker2 33] I've just tested Spellchecker2 version 33 from PR with my test suite (the yesterday's file-out) and I've found the last two bugs: [1] Sending #includes: (former #hasWord:) to an empty Vocabulary fails with SubscriptOutOfBoundsError, because #sizeAt: tries to access empty data. Trie works correctly in this case. [2] The following does not work as it should: wlist := #('abcd'). voc := Spellchecker2.Vocabulary new mutableWhile: [:trie | wlist do: [:each | trie add: each]]. "Note that Trie does not work either. The bug is somewhere in the LA." voc suggestWords: 'abcXd'. "result: #() expected: #('abcd')" Debug output of the above (which I don't understand very much so cannot comment on it further) is: abcXd = D4 E4 F4 A5 C5 A6 B6 abcd = D3 E3 F3 A4 C4 A5 B5 A1 a abc -> A2 A2 b bcX -> A3 A3 c cXd -> A4 A4 d Xd -> F4 Note that according to my test suite, this is the only bug in LA's logic. All other deletes and all inserts, replaces and transpositions work correctly now. Excellent work! Ladislav Lenart On 9.5.2011 22:58, Michael Lucas-Smith wrote: > > On May 9, 2011, at 9:34 AM, Michael Lucas-Smith wrote: > >> >> On May 9, 2011, at 9:13 AM, Michael Lucas-Smith wrote: >> >>> >>> On May 9, 2011, at 8:37 AM, Ladislav Lenart wrote: >>> >>>> Hello. >>>> >>>> First of all thank you for such a quick reaction! >>>> >>>> I decided to implement a 'reference' implementation of >>>> the Levenshtein algorithm with n=1. I took it from >>>> >>>> http://norvig.com/spell-correct.html >>>> >>>> and rewrote it in Smalltalk (from Python original). See the >>>> attached file-out of the package. You might find it useful >>>> to prove that your fast implementation is definitely correct. >>> >>> You can take the transition table and a word you want to create a levenshtein automata for and create it backwards from the table. I haven't yet done this, but I do believe your observation that it's "terminal condition" is missing is correct. I've not yet figured out how or why though. Have you (attempted) to read the paper referenced in the class description of LevenshteinAutomata? It's a heavy and often confusing read, but may be you'll notice some detail that I've missed. >> >> Having just said that, I believe I've just fixed it in version 24. >> > > Now that I had the algorithm working, I went back to benchmark it and discovered that it was running very slowly.. instead of 10ms it was running in, say, 1.7 seconds. That's pretty darned awful. So I dug in to it and a lot of the time was wasted in decoding bytes in to strings. The only way to solve that -and- add in the maxDepth test was to change the binary format. The binary format no longer stores bytes, but UTF8 encoded bytes. This pushed the value of each Trie node to the end of the data structure, to avoid having to interpret the UTF8 bytes just to access the children structure. As such, a lot of code changed. > > I took the liberty of renaming Node to Trie while I was at it. Vocabulary and Trie are now subclasses of Collection too, so #asOrderedCollection, #includes:, #add: #isEmpty all work in a consistent manner. > > While I was working in this area, Niall and I chatted and we have updated TextHighlighting to include the ability for a Highlighter to mess with a text editors menu. SpellcheckHighlighting now does exactly that. If your input cursor is in a word that is misspelt, a right click will now have a new menu option 'Suggestions...' which will offer you corrects from English and the SymbolTable. Yes, the language should be pluggable, but until we have a Czech dictionary working from Ladislav we basically only have English :) ....so, when we have more dictionaries, we can expand this out to a full setting. > > Spellchecker2 32 > > Cheers, > Michael > >> Cheers, >> Michael > > > _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Michael Lucas-Smith-2
Hello again.
I finally managed to build a generic czech vocabulary from the aspell source, though it was a bit harder than I originally thought... I had to run it on my coworker's machine at the end. Mine just couldn't handle it. Some interesting data for the #loadWordList: task: * PC model name: Intel Core2 Quad CPU Q6600 @ 2.40GHz (64-bit) * RAM: 8 GB * Total number of words: 4 669 281 * Time to run: 13 minutes 38.333799 seconds * Total memory consumed by the task (Trie representation): 4.5 GB * Size of the resulting Vocabulary's data ByteArray: 2 272 498 B * Raw approximation of the Trie/Vocabulary compression ratio: ~2000:1 (= 4500/2.25) * 4.5GB is how much the heap shrunk at the end. I am sure there was some garbage, but assume that most of it was occupied by the Trie graph representation. * Word list file/Vocabulary compression ratio: ~28:1 (63/2.25). I have a TGZ archive with a file-out of the resulting Czech class in Spellchecker2 namespace (done in the same spirit as the English vocabulary). However the size of the file is 1.5 MB. Can I e-mail it to your private account? I am not sure vwnc allows attachments of such size. Ladislav Lenart On 9.5.2011 22:58, Michael Lucas-Smith wrote: > > On May 9, 2011, at 9:34 AM, Michael Lucas-Smith wrote: > >> >> On May 9, 2011, at 9:13 AM, Michael Lucas-Smith wrote: >> >>> >>> On May 9, 2011, at 8:37 AM, Ladislav Lenart wrote: >>> >>>> Hello. >>>> >>>> First of all thank you for such a quick reaction! >>>> >>>> I decided to implement a 'reference' implementation of >>>> the Levenshtein algorithm with n=1. I took it from >>>> >>>> http://norvig.com/spell-correct.html >>>> >>>> and rewrote it in Smalltalk (from Python original). See the >>>> attached file-out of the package. You might find it useful >>>> to prove that your fast implementation is definitely correct. >>> >>> You can take the transition table and a word you want to create a levenshtein automata for and create it backwards from the table. I haven't yet done this, but I do believe your observation that it's "terminal condition" is missing is correct. I've not yet figured out how or why though. Have you (attempted) to read the paper referenced in the class description of LevenshteinAutomata? It's a heavy and often confusing read, but may be you'll notice some detail that I've missed. >> >> Having just said that, I believe I've just fixed it in version 24. >> > > Now that I had the algorithm working, I went back to benchmark it and discovered that it was running very slowly.. instead of 10ms it was running in, say, 1.7 seconds. That's pretty darned awful. So I dug in to it and a lot of the time was wasted in decoding bytes in to strings. The only way to solve that -and- add in the maxDepth test was to change the binary format. The binary format no longer stores bytes, but UTF8 encoded bytes. This pushed the value of each Trie node to the end of the data structure, to avoid having to interpret the UTF8 bytes just to access the children structure. As such, a lot of code changed. > > I took the liberty of renaming Node to Trie while I was at it. Vocabulary and Trie are now subclasses of Collection too, so #asOrderedCollection, #includes:, #add: #isEmpty all work in a consistent manner. > > While I was working in this area, Niall and I chatted and we have updated TextHighlighting to include the ability for a Highlighter to mess with a text editors menu. SpellcheckHighlighting now does exactly that. If your input cursor is in a word that is misspelt, a right click will now have a new menu option 'Suggestions...' which will offer you corrects from English and the SymbolTable. Yes, the language should be pluggable, but until we have a Czech dictionary working from Ladislav we basically only have English :) ....so, when we have more dictionaries, we can expand this out to a full setting. > > Spellchecker2 32 > > Cheers, > Michael > >> Cheers, >> Michael > > > _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Hey interesting statistics. Why not publish it as Spellchecker2-Czech and then anyone interested can load it up and have it be available in their image? In fact, if it works (barring any punctation rules we're getting wrong) you could add it to SpellHighlighting class>>dictionaries and start spell checking czech words in your comments :)
I've published Spellchecker2-Tests to public store, with all the tests passing. I'd appreciate it if you applied some of your tests to that as well, since you've done such a good job of identifying flawed cases. Cheers, Michael On May 10, 2011, at 6:14 AM, Ladislav Lenart wrote: > Hello again. > > I finally managed to build a generic czech vocabulary from the > aspell source, though it was a bit harder than I originally > thought... I had to run it on my coworker's machine at the end. > Mine just couldn't handle it. Some interesting data for the > #loadWordList: task: > * PC model name: Intel Core2 Quad CPU Q6600 @ 2.40GHz (64-bit) > * RAM: 8 GB > * Total number of words: 4 669 281 > * Time to run: 13 minutes 38.333799 seconds > * Total memory consumed by the task (Trie representation): 4.5 GB > * Size of the resulting Vocabulary's data ByteArray: 2 272 498 B > * Raw approximation of the Trie/Vocabulary compression ratio: > ~2000:1 (= 4500/2.25) > * 4.5GB is how much the heap shrunk at the end. I am sure > there was some garbage, but assume that most of it was > occupied by the Trie graph representation. > * Word list file/Vocabulary compression ratio: ~28:1 (63/2.25). > > I have a TGZ archive with a file-out of the resulting Czech > class in Spellchecker2 namespace (done in the same spirit as > the English vocabulary). However the size of the file is 1.5 MB. > Can I e-mail it to your private account? I am not sure vwnc > allows attachments of such size. > > > Ladislav Lenart > > > On 9.5.2011 22:58, Michael Lucas-Smith wrote: >> >> On May 9, 2011, at 9:34 AM, Michael Lucas-Smith wrote: >> >>> >>> On May 9, 2011, at 9:13 AM, Michael Lucas-Smith wrote: >>> >>>> >>>> On May 9, 2011, at 8:37 AM, Ladislav Lenart wrote: >>>> >>>>> Hello. >>>>> >>>>> First of all thank you for such a quick reaction! >>>>> >>>>> I decided to implement a 'reference' implementation of >>>>> the Levenshtein algorithm with n=1. I took it from >>>>> >>>>> http://norvig.com/spell-correct.html >>>>> >>>>> and rewrote it in Smalltalk (from Python original). See the >>>>> attached file-out of the package. You might find it useful >>>>> to prove that your fast implementation is definitely correct. >>>> >>>> You can take the transition table and a word you want to create a levenshtein automata for and create it backwards from the table. I haven't yet done this, but I do believe your observation that it's "terminal condition" is missing is correct. I've not yet figured out how or why though. Have you (attempted) to read the paper referenced in the class description of LevenshteinAutomata? It's a heavy and often confusing read, but may be you'll notice some detail that I've missed. >>> >>> Having just said that, I believe I've just fixed it in version 24. >>> >> >> Now that I had the algorithm working, I went back to benchmark it and discovered that it was running very slowly.. instead of 10ms it was running in, say, 1.7 seconds. That's pretty darned awful. So I dug in to it and a lot of the time was wasted in decoding bytes in to strings. The only way to solve that -and- add in the maxDepth test was to change the binary format. The binary format no longer stores bytes, but UTF8 encoded bytes. This pushed the value of each Trie node to the end of the data structure, to avoid having to interpret the UTF8 bytes just to access the children structure. As such, a lot of code changed. >> >> I took the liberty of renaming Node to Trie while I was at it. Vocabulary and Trie are now subclasses of Collection too, so #asOrderedCollection, #includes:, #add: #isEmpty all work in a consistent manner. >> >> While I was working in this area, Niall and I chatted and we have updated TextHighlighting to include the ability for a Highlighter to mess with a text editors menu. SpellcheckHighlighting now does exactly that. If your input cursor is in a word that is misspelt, a right click will now have a new menu option 'Suggestions...' which will offer you corrects from English and the SymbolTable. Yes, the language should be pluggable, but until we have a Czech dictionary working from Ladislav we basically only have English :) ....so, when we have more dictionaries, we can expand this out to a full setting. >> >> Spellchecker2 32 >> >> Cheers, >> Michael >> >>> Cheers, >>> Michael >> >> >> > > _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Ladislav Lenart
The original authors of the paper were testing against German (6 million words) and Bulgarian (870,000 words). So it'll be interesting to see what sorts of results you get for #includes: and #suggestWords: when running against the 4.6 million words.
Cheers, Michael On May 10, 2011, at 6:14 AM, Ladislav Lenart wrote: > Hello again. > > I finally managed to build a generic czech vocabulary from the > aspell source, though it was a bit harder than I originally > thought... I had to run it on my coworker's machine at the end. > Mine just couldn't handle it. Some interesting data for the > #loadWordList: task: > * PC model name: Intel Core2 Quad CPU Q6600 @ 2.40GHz (64-bit) > * RAM: 8 GB > * Total number of words: 4 669 281 > * Time to run: 13 minutes 38.333799 seconds > * Total memory consumed by the task (Trie representation): 4.5 GB > * Size of the resulting Vocabulary's data ByteArray: 2 272 498 B > * Raw approximation of the Trie/Vocabulary compression ratio: > ~2000:1 (= 4500/2.25) > * 4.5GB is how much the heap shrunk at the end. I am sure > there was some garbage, but assume that most of it was > occupied by the Trie graph representation. > * Word list file/Vocabulary compression ratio: ~28:1 (63/2.25). > > I have a TGZ archive with a file-out of the resulting Czech > class in Spellchecker2 namespace (done in the same spirit as > the English vocabulary). However the size of the file is 1.5 MB. > Can I e-mail it to your private account? I am not sure vwnc > allows attachments of such size. > > > Ladislav Lenart > > > On 9.5.2011 22:58, Michael Lucas-Smith wrote: >> >> On May 9, 2011, at 9:34 AM, Michael Lucas-Smith wrote: >> >>> >>> On May 9, 2011, at 9:13 AM, Michael Lucas-Smith wrote: >>> >>>> >>>> On May 9, 2011, at 8:37 AM, Ladislav Lenart wrote: >>>> >>>>> Hello. >>>>> >>>>> First of all thank you for such a quick reaction! >>>>> >>>>> I decided to implement a 'reference' implementation of >>>>> the Levenshtein algorithm with n=1. I took it from >>>>> >>>>> http://norvig.com/spell-correct.html >>>>> >>>>> and rewrote it in Smalltalk (from Python original). See the >>>>> attached file-out of the package. You might find it useful >>>>> to prove that your fast implementation is definitely correct. >>>> >>>> You can take the transition table and a word you want to create a levenshtein automata for and create it backwards from the table. I haven't yet done this, but I do believe your observation that it's "terminal condition" is missing is correct. I've not yet figured out how or why though. Have you (attempted) to read the paper referenced in the class description of LevenshteinAutomata? It's a heavy and often confusing read, but may be you'll notice some detail that I've missed. >>> >>> Having just said that, I believe I've just fixed it in version 24. >>> >> >> Now that I had the algorithm working, I went back to benchmark it and discovered that it was running very slowly.. instead of 10ms it was running in, say, 1.7 seconds. That's pretty darned awful. So I dug in to it and a lot of the time was wasted in decoding bytes in to strings. The only way to solve that -and- add in the maxDepth test was to change the binary format. The binary format no longer stores bytes, but UTF8 encoded bytes. This pushed the value of each Trie node to the end of the data structure, to avoid having to interpret the UTF8 bytes just to access the children structure. As such, a lot of code changed. >> >> I took the liberty of renaming Node to Trie while I was at it. Vocabulary and Trie are now subclasses of Collection too, so #asOrderedCollection, #includes:, #add: #isEmpty all work in a consistent manner. >> >> While I was working in this area, Niall and I chatted and we have updated TextHighlighting to include the ability for a Highlighter to mess with a text editors menu. SpellcheckHighlighting now does exactly that. If your input cursor is in a word that is misspelt, a right click will now have a new menu option 'Suggestions...' which will offer you corrects from English and the SymbolTable. Yes, the language should be pluggable, but until we have a Czech dictionary working from Ladislav we basically only have English :) ....so, when we have more dictionaries, we can expand this out to a full setting. >> >> Spellchecker2 32 >> >> Cheers, >> Michael >> >>> Cheers, >>> Michael >> >> >> > > _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Michael Lucas-Smith-2
Hello.
I've just published Spellchecker2-Czech package to the Public Repository (on an old company account, because I was unable to obtain a new one for myself via the Cincom website, or maybe it does not work just yet). I've also taken a look at your Spellchecker2-Tests package and to the best of my knowledge it tests everything. It contains exactly the same tests I invinted, plus some more (the #deny: variants). So I think Spellchecker2 is fully functional now. Nevertheless I am going to test it tomorrow in our application once again :-) Thank you very much, Ladislav Lenart On 10.5.2011 15:17, Michael Lucas-Smith wrote: > Hey interesting statistics. Why not publish it as Spellchecker2-Czech and then anyone interested can load it up and have it be available in their image? In fact, if it works (barring any punctation rules we're getting wrong) you could add it to SpellHighlighting class>>dictionaries and start spell checking czech words in your comments :) > > I've published Spellchecker2-Tests to public store, with all the tests passing. I'd appreciate it if you applied some of your tests to that as well, since you've done such a good job of identifying flawed cases. > > Cheers, > Michael > > On May 10, 2011, at 6:14 AM, Ladislav Lenart wrote: > >> Hello again. >> >> I finally managed to build a generic czech vocabulary from the >> aspell source, though it was a bit harder than I originally >> thought... I had to run it on my coworker's machine at the end. >> Mine just couldn't handle it. Some interesting data for the >> #loadWordList: task: >> * PC model name: Intel Core2 Quad CPU Q6600 @ 2.40GHz (64-bit) >> * RAM: 8 GB >> * Total number of words: 4 669 281 >> * Time to run: 13 minutes 38.333799 seconds >> * Total memory consumed by the task (Trie representation): 4.5 GB >> * Size of the resulting Vocabulary's data ByteArray: 2 272 498 B >> * Raw approximation of the Trie/Vocabulary compression ratio: >> ~2000:1 (= 4500/2.25) >> * 4.5GB is how much the heap shrunk at the end. I am sure >> there was some garbage, but assume that most of it was >> occupied by the Trie graph representation. >> * Word list file/Vocabulary compression ratio: ~28:1 (63/2.25). >> >> I have a TGZ archive with a file-out of the resulting Czech >> class in Spellchecker2 namespace (done in the same spirit as >> the English vocabulary). However the size of the file is 1.5 MB. >> Can I e-mail it to your private account? I am not sure vwnc >> allows attachments of such size. >> >> >> Ladislav Lenart >> >> >> On 9.5.2011 22:58, Michael Lucas-Smith wrote: >>> >>> On May 9, 2011, at 9:34 AM, Michael Lucas-Smith wrote: >>> >>>> >>>> On May 9, 2011, at 9:13 AM, Michael Lucas-Smith wrote: >>>> >>>>> >>>>> On May 9, 2011, at 8:37 AM, Ladislav Lenart wrote: >>>>> >>>>>> Hello. >>>>>> >>>>>> First of all thank you for such a quick reaction! >>>>>> >>>>>> I decided to implement a 'reference' implementation of >>>>>> the Levenshtein algorithm with n=1. I took it from >>>>>> >>>>>> http://norvig.com/spell-correct.html >>>>>> >>>>>> and rewrote it in Smalltalk (from Python original). See the >>>>>> attached file-out of the package. You might find it useful >>>>>> to prove that your fast implementation is definitely correct. >>>>> >>>>> You can take the transition table and a word you want to create a levenshtein automata for and create it backwards from the table. I haven't yet done this, but I do believe your observation that it's "terminal condition" is missing is correct. I've not yet figured out how or why though. Have you (attempted) to read the paper referenced in the class description of LevenshteinAutomata? It's a heavy and often confusing read, but may be you'll notice some detail that I've missed. >>>> >>>> Having just said that, I believe I've just fixed it in version 24. >>>> >>> >>> Now that I had the algorithm working, I went back to benchmark it and discovered that it was running very slowly.. instead of 10ms it was running in, say, 1.7 seconds. That's pretty darned awful. So I dug in to it and a lot of the time was wasted in decoding bytes in to strings. The only way to solve that -and- add in the maxDepth test was to change the binary format. The binary format no longer stores bytes, but UTF8 encoded bytes. This pushed the value of each Trie node to the end of the data structure, to avoid having to interpret the UTF8 bytes just to access the children structure. As such, a lot of code changed. >>> >>> I took the liberty of renaming Node to Trie while I was at it. Vocabulary and Trie are now subclasses of Collection too, so #asOrderedCollection, #includes:, #add: #isEmpty all work in a consistent manner. >>> >>> While I was working in this area, Niall and I chatted and we have updated TextHighlighting to include the ability for a Highlighter to mess with a text editors menu. SpellcheckHighlighting now does exactly that. If your input cursor is in a word that is misspelt, a right click will now have a new menu option 'Suggestions...' which will offer you corrects from English and the SymbolTable. Yes, the language should be pluggable, but until we have a Czech dictionary working from Ladislav we basically only have English :) ....so, when we have more dictionaries, we can expand this out to a full setting. >>> >>> Spellchecker2 32 >>> >>> Cheers, >>> Michael >>> >>>> Cheers, >>>> Michael >>> >>> >>> >> >> > > > _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Hello.
I hope you don't mind but I've just published a new slightly refactored version of Spellchecker2-Tests package to the PR: * Added #vocabularyClass which is the only difference between SpellingTest subclasses. * Added #assertSuggestWords:produces:. * Existing tests rewritten using the above assert. * Added one failing test: #testSuggestWords_zakro. I don't have a clue about the cause of the false suggestion. Please take a look at it. Nevertheless I am going to use the #suggestWords: feature in our product. Thank you, Ladislav Lenart On 10.5.2011 17:29, Ladislav Lenart wrote: > Hello. > > I've just published Spellchecker2-Czech package to the Public > Repository (on an old company account, because I was unable to > obtain a new one for myself via the Cincom website, or maybe it > does not work just yet). > > I've also taken a look at your Spellchecker2-Tests package and > to the best of my knowledge it tests everything. It contains > exactly the same tests I invinted, plus some more (the #deny: > variants). So I think Spellchecker2 is fully functional now. > Nevertheless I am going to test it tomorrow in our application > once again :-) > > > Thank you very much, > > Ladislav Lenart > > > On 10.5.2011 15:17, Michael Lucas-Smith wrote: >> Hey interesting statistics. Why not publish it as Spellchecker2-Czech and then anyone interested can load it up and have it be available in their image? In fact, if it works (barring any punctation rules we're getting wrong) you could add it to SpellHighlighting class>>dictionaries and start spell checking czech words in your comments :) >> >> I've published Spellchecker2-Tests to public store, with all the tests passing. I'd appreciate it if you applied some of your tests to that as well, since you've done such a good job of identifying flawed cases. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Free forum by Nabble | Edit this page |