Drawing Blanks

Premature Optimization is a Prerequisite for Success

Inside Xworder: the road to full anagrams 2

with 3 comments

Now that I have described the high-level approach to full anagrams that is used by Xworder, the next thing I’d like to tell is how it finds subanagrams of the given string having the given length. That is, how I implement the function

IEnumerable<string> FindSubanagramsExact(string s, int length)

Previously I described how Xworder’s Scrabble Mode works. The Scrabble mode finds subanagrams simply by scanning the word length index, and I mentioned that it works fast enough. But for finding full anagrams, “fast enough” is not fast enough. We need a faster method.

Fist of all, Xworder has a special index for that. The keys in that index are dictionary words with letters sorted alphabetically. And the values are lists of words that match the key. E.g. the key “aelnp” matches the list {“panel”,”penal”,”plane”}. This index allows to instantly find exact anagrams of a string simply by sorting its letters. The index is a Dictionary<string, List<int>> where the int is the index of the word in the word list.

But how do we generate subanagrams of the given length N using that index? Basically, we need to iterate through all subsets of our string that have the length N (yay, another combinatorial algorithm) and lookup their anagrams. Here is the listing (some details omitted):

        IEnumerable<string> FindSubanagramsExactCombinatorial(string s, int length)
        {
            var index = GetAnagramIndex();
            HashSet<int> result = new HashSet<int>();

            char[] c = s.ToCharArray();
            Array.Sort(c);
            s = new string(c); // s is now sorted

            var combinations = new CombinationGenerator(s.Length, length);
            int[] combination = combinations.GetNext();
            while (combination != null)
            {
                string key = "";
                for (int i = 0; i < length; i++)
                    key += s[combination[i]];
                //key is now a sorted substring of s

                List<int> entries;
                if (index.TryGetValue(key, out entries))
                {
                    //entries are all anagrams of key
                    for (int j = 0; j < entries.Count; j++)
                    {
                        int wordIndex = entries[j];
                        if (!result.Contains(wordIndex))
                            result.Add(wordIndex);
                    }
                }
                combination = combinations.GetNext();
            }
            foreach (int key in result)
                yield return words[key];
        }

I’m going to describe the CombniationGenerator later, and now I just want to mention two things.

One is that the algorithm deals with duplicates in a really ugly manner: it dumps all results into a hashset. I really hate this, but I couldn’t find a good alternative. One alternative I thought about was to generate full anagrams based on set partitions rather than on integer partitions. But that was a little bit over my head and I didn’t have much time to dig into that. I also had some specific reasons to use integer partitions, which I’m going to describe later, when I’m telling about partition generation.

Another thing is that this is still not the final algorithm. Apparently, sometimes (especially for long strings) finding subanagrams of the exact length by scanning all words of that length is still faster than doing that using the above function. So Xworder uses a little bit of “heuristics” to choose the faster method:

        IEnumerable<string> FindSubanagramsExact(string s, int length)
        {
            CombinationGenerator cg = new CombinationGenerator(s.Length, length);
            double combinationCount = cg.GetTotal();
            int wordCount = GetLengthIndex(length).Count;
            if (3 * combinationCount < wordCount)
                return FindSubanagramsExactCombinatorial(s, length);
            else
                return FindSubanagramsExactScan(s, length);
        }
 
This is it. The only things left are partition and combination generation. Stay tuned.

CONTINUIE READING:

https://bbzippo.wordpress.com/2009/12/18/integer-partitions/

https://bbzippo.wordpress.com/2010/01/06/generating-combinations/

Advertisements

Written by bbzippo

12/06/2009 at 11:36 pm

Posted in programming

3 Responses

Subscribe to comments with RSS.

  1. […] Inside Xworder: the road to full anagrams 2 « Drawing Blanks […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: