Drawing Blanks

Premature Optimization is a Prerequisite for Success

Inside Xworder: the Scrabble mode

with 5 comments

Given a bunch of letter tiles, how do you find all words that can be built out of them? The naive approach would be to enumerate all permutations of all subsets of the given set, and look them up in the word list. This would be an extremely slow algorithm. If you have 10 letters, the number of all ordered subsets is about 6 million. Simply scanning the whole word list would work about 30 times faster. But that would still be not fast enough.

Xworder maintains an index of words by their length. It is a SortedList<int, HashSet<int>> where the first int is the word length and the second int is the index of the word in the word list. (Did I mention that Xworder does not use any sort of a database engine, and that it doesn’t use the filesystem at all?).

So in order to find subanagrams I can scan all words whose length does not exceed the total number of letters, and check if they are in fact subanagrams. Surprisingly, this method exhibited decent performance. But I took one further step to make it work faster.

Xworder also has an index that for each of the few most frequent letters (“ETAOINSR”), lists words that do not contain them. So when I see that my set of letters does not contain some of those frequent letters, I can immediately exclude a whole lot of words from my scan. Typically, the remaining word count is so small, that I scan them all without even taking the length into account.

This is it. Xworder’s Scrabble mode works by scanning the word length indexes OR by scanning the frequent letter exclusion indexes.

Note that the algorithms that I’ve just described iterate through HashSets. They say, it’s not the fastest data structure for iteration. In theory, SortedList should be must faster to iterate through. In addition, SortedList should consume less memory. I did test my algorithms with SortedLists instead of HashSets and did not find any measurable difference in the speed of iteration. That, and the fact that HashSet.IntersectWith() is extremely fast, made me choose HashSet as the preferred data structure for most of my indexes.

Below is a code snippet (some details omitted) that implements what I described above. It is really simple. It is so simple that after I wrote it I thought that implementing Full Anagrams would be a piece of cake too. But they appeared to be much, much tougher…

            HashSet<int> freqIndex = GetFrequentInclusionIndex(s);
            if (freqIndex.Count > 0)
                var q = from i in freqIndex
                        words[i].Length <= s.Length && words[i].Length >= minLength
                        && IsSubstring(words[i], s)
                        select words[i];
                foreach (string r in q)
                    yield return r;
                for (int l = s.Length; l >= minLength; l--)
                    var index = GetLengthIndex(l);
                    foreach (int i in index)
                        string w = words[i];
                        if (IsSubstring(w, s))
                            yield return w;

Written by bbzippo

11/16/2009 at 7:03 am

Posted in programming

5 Responses

Subscribe to comments with RSS.

  1. […] I described how Xworder’s Scrabble Mode works. The Scrabble mode finds subanagrams simply by scanning the word length index, and I mentioned that […]

  2. If you aren’t using a database, or the file system, how are you storing all the words and phrases? I can see words being stored in memory, but anagram mode supports phrases 20 characters in length!


    05/26/2010 at 5:47 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: