Hi all,
I wish everybody the best for the still new year. I thought it a part of the Collection hierarchy, but I cannot find stuff for indexed binary searching. The problem is, I've got around 10.000 hard coded character patterns which I want to match against natural language words to find the potential hyphenation points. In simplified terms, a pattern like 'put-er' puts the $- to the word 'comput-er', for example. Currently I'm dully matching every single pattern to each word by string searching which costs me perhaps 20 mSec each. This *needs* to be faster by a factor of 10 at least. Caching words helps a lot, but this is IMHO not a solution. How would you implement a faster solution? How can I use existing collections for that purpose? Does anybody disagree with the hyphenation-pattern method at all? Thank you for helping Thomas J. Schrader -- mailto thomas j schrader at web de ________________________________________________________________________ Kostenlos tippen, täglich 1 Million gewinnen: zum WEB.DE MillionenKlick! http://produkte.web.de/go/08/ _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Thomas,
What I would try? I would use a variant of the trie algorithm. This algorithm has been use by the award winning application of the 2008 dynamic languages shootout. You can find information in in the paragraph "Seaside mit Smalltalk gewinnt Dynamic Languages Shootout des Java-Spektrums: " on the bottom of the starting page www.heeg.de. As far as I remember the complete implementation including the trie is available on public store. Georg Georg Heeg eK, Dortmund und Köthen, HR Dortmund A 12812 Tel. +49-3496-214328, Fax +49-3496-214712 > -----Ursprüngliche Nachricht----- > Von: [hidden email] [mailto:[hidden email]] Im > Auftrag von Thomas Schrader > Gesendet: Mittwoch, 27. Januar 2010 10:25 > An: [hidden email] > Betreff: [vwnc] a simple TextHyphenator, IndexedBinarySearchTree > > Hi all, > > I wish everybody the best for the still new year. > > I thought it a part of the Collection hierarchy, but I cannot find stuff > binary searching. > > The problem is, I've got around 10.000 hard coded character patterns which I > want to match against natural language words to find the potential hyphenation > points. In simplified terms, a pattern like 'put-er' puts the $- to the word > 'comput-er', for example. > > Currently I'm dully matching every single pattern to each word by string > searching which costs me perhaps 20 mSec each. This *needs* to be faster by a > factor of 10 at least. Caching words helps a lot, but this is IMHO not a solution. > > How would you implement a faster solution? How can I use existing collections > for that purpose? > > Does anybody disagree with the hyphenation-pattern method at all? > > Thank you for helping > > Thomas J. Schrader > > -- > > mailto thomas j schrader at web de > > _________________________________________________________________ > _______ > Kostenlos tippen, täglich 1 Million gewinnen: zum WEB.DE MillionenKlick! > http://produkte.web.de/go/08/ > > > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Thomas Schrader
> I would use a variant of the trie algorithm. Cool! Fast!! Thanks a lot. Cheers Thomas J. Schrader___________________________________________________________GRATIS für alle WEB.DE-Nutzer: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://movieflat.web.de _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Thomas Schrader
Thomas,
Have you looked at the VW package "UIBasics-Internationalization" which has support for the indexed searches in the Message catalogs? Take a look at the classes UserMessage and IndexedFileMessageCatalog maybe you will find what you need there. Thomas Schrader wrote: > Hi all, > > I wish everybody the best for the still new year. > > I thought it a part of the Collection hierarchy, but I cannot find stuff for indexed binary searching. > > The problem is, I've got around 10.000 hard coded character patterns which I want to match against natural language words to find the potential hyphenation points. In simplified terms, a pattern like 'put-er' puts the $- to the word 'comput-er', for example. > > Currently I'm dully matching every single pattern to each word by string searching which costs me perhaps 20 mSec each. This *needs* to be faster by a factor of 10 at least. Caching words helps a lot, but this is IMHO not a solution. > > How would you implement a faster solution? How can I use existing collections for that purpose? > > Does anybody disagree with the hyphenation-pattern method at all? > > Thank you for helping > > Thomas J. Schrader > > -- > > mailto thomas j schrader at web de > > ________________________________________________________________________ > Kostenlos tippen, täglich 1 Million gewinnen: zum WEB.DE MillionenKlick! > http://produkte.web.de/go/08/ > > > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > > > > _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Free forum by Nabble | Edit this page |