Drawing Blanks

Premature Optimization is a Prerequisite for Success

Archive for November 2009

Inside Xworder: the road to full anagrams

with 3 comments

What’s a full anagram? It’s one or more words that can be built by rearranging all letters from the given set. E.g. “Clint Eastwood” = “old west action” = “cows and toilet”, “Paul McCartney” = “clean up my cart” = “my nuclear pact”.

I’m going to present some algorithms that Xworder uses to generate them. I’ll start with a function that generates anagrams matching the given integer partition. What’s an integer partition?
4 = 2 + 2 = 2 + 1 + 1 = 1 + 3 = … – these are partitions of 4. So let’s say we have 4 letters. We can potentially build two two-letter words, one one-letter word and one three-letter word, and so on.

Here’s the function that builds all anagrams of the given string matching the given partition:

        IEnumerable<string> FindFullAnagrams(string s, IEnumerable<int> partition)
        {
            int partitionHead = partition.First();
            var partitionTail = partition.Skip(1);            

            var headAnagrams = FindSubanagramsExact(s, partitionHead);

            if (!partitionTail.Any())
            {
                foreach (string a in headAnagrams)
                    yield return a;
            }
            else
            {
                foreach (string headAnagram in headAnagrams)
                {
                    string remainder = SubtractString(s, headAnagram);
                    foreach (string a in FindFullAnagrams(remainder, partitionTail))
                        yield return string.Concat(headAnagram, " ", a);
                }
            }
        }

Now, in order to actually generate anagrams, we need to implement just a couple of things: actually learn how to generate integer partitions, and then implement IEnumerable<string> FindSubanagramsExact(string s, int length) that finds words of the given length that can be built out of the string s.

Stay tuned.

CONTINUE READING:

https://bbzippo.wordpress.com/2009/12/06/inside-xworder-the-road-to-full-anagrams-2/

https://bbzippo.wordpress.com/2009/12/18/integer-partitions/

https://bbzippo.wordpress.com/2010/01/06/generating-combinations/

Advertisements

Written by bbzippo

11/27/2009 at 7:36 pm

Posted in programming

Expected number of tosses to get N heads in a row

with 2 comments

I wrote before how to compute the expected number of coin tosses needed to throw a head:
https://bbzippo.wordpress.com/2009/11/05/expected-number-of-coin-tosses/

I mentioned that I didn’t have a rigorous solution for N heads in a row, and I was asked to at least present a non-rigorous one.
Ok, let’s say we know the expectation to get N heads in a row and it equals to E(N). Now we want to compute E(N+1). How many tosses on average do we need to turn N heads into N+1 heads? With the probability 1/2 you need only 1 additional toss. But if you fail to throw a head with that one toss, you need to start over and end up tossing the coin E(N+1) times on average, plus that 1 time that gave you a tail.
So E(N+1) = E(N) + 1/2 + 1/2*(E(N + 1) + 1)
Or E(N+1) = 2*E(N) + 2
Since we know E(1) = 2 then E(2) = 6, E(3) = 14, etc.
It’s not difficult to come up with a non-recursive formula:
E(N) = 2*(2^N – 1)

Why is this solution not rigorous? Because you can’t manipulate expectation values like this without saying a lot of additonal blah-blah about your random variables being independent or dependent, and the expectation being a linear function, maybe about Markov chains, etc.

Written by bbzippo

11/25/2009 at 7:19 pm

Posted in math

More Zen of Design

leave a comment »

namespace Mind {
    class Hand {
        void Clap() {

Another Zen quote on Design Patterns. It’s about loose coupling, minimizing dependencies, considering what can vary:

"If you are attached to something and it disappears, you suffer. Don’t be attached to form. Form and name are always changing, changing, changing, nonstop."

Written by bbzippo

11/25/2009 at 3:59 am

Posted in fun, programming

My confusion about Gaussian Integers

leave a comment »

For some reason I was under impression that Gaussian Integers (http://en.wikipedia.org/wiki/Gaussian_integer) were not uniquely factorable and was trying to come up with a counterexample 🙂 For some reason I was sure that they were not a Euclidean ring. Well they ARE a Euclidean ring and hence a factorial ring. 

(A million years ago, and a smaller number of light years away, when/where I studied math, we called unique factorization domains “factorial rings”. We didn’t use the term “integral domain” at all. Number theory was just part of “algebra”. Such bourbakism 🙂

Anyway, the Wiki article is surprisingly good. I learned about some unsolved problems that I didn’t know before http://en.wikipedia.org/wiki/Gaussian_integer#Unsolved_problems.

But when it comes to math I always check Mathworld in addition to Wiki (or even before Wiki). http://mathworld.wolfram.com/GaussianInteger.html

And the Mathworld article revealed the cause of my confusion. Gaussian integers only form a Euclidean ring if you define the norm as the sum of squares, and not as the square root (the modulus). I was so dead set on using the distance as the norm, that I didn’t think that the square of the distance is still a norm, and that it’s integer, so it works as a Euclidean degree function.

It’s so funny that the Euclidean distance is not a good norm for the Euclidean algorithm 🙂

I’m curious: can a real valued norm work as a degree function for the Euclidean algorithm at all?

Now I want to program the Euclidean algorithm and factorization for the Gaussians. But I doubt I’m ever going to do that – too busy/lazy.

Written by bbzippo

11/23/2009 at 8:54 am

Posted in math

Single Responsibility

leave a comment »

David Cooksey wrote some interesting observations on The Dangers of Single Responsibility in Programming (continued here ).

I view the SRP as one of the most important design principles, because it gives birth to loose coupling and reuse.
It can be applied on the OOD pattern level, as well as on the higher levels: layering, deployment packaging, and multi-tier architecture.
Obviously, it can be harmful when misinterpreted and misapplied. And David gave a good description of how this can happen.
Interestingly, in my career, I have very rarely encountered the SRP abuse. Much more frequently I saw designs that could definitely benefit from applying the SRP.
Can we say that “premature application of design principles is the root of all evil”? Well, yes and no. On one hand, the whole point of design patterns is that you apply them in anticipation of change, so they have to be “premature”. On the other hand, good patterns often emerge when you refactor. I’d say, it’s not possible to arrive to a well balanced design without constant refactoring. You need to both anticipate the change and embrace the change. This is the Zen of Design.

Here is a Zen quote about a pattern of two loosely coupled interacting classes:
“Just think of the trees: they let the birds perch and fly, with no
intention to call them when they come
and no longing for their return when they fly away.”
(Zen Quotes, 470-543 AD)

“Design pattern” is an anagram of “Departing nest”, found via Xworder.

Written by bbzippo

11/20/2009 at 5:58 pm

Posted in programming

Finalize this.

leave a comment »

3 years ago I ran into an issue with suppressed finalization. I wrote a small essay on it which I sent to my coworkers and friends. I thought I’d rewrite it and publish, in case others are interested.

Summary: If you inherit from a class that calls GC.SuppressFinalize(this) in the constructor then your finalizer will never execute. To fix, call GC.ReRegisterForFinalize(this) in your constructor.

The Code

Let’s inherit a class from DataTable which will own some disposable resource.

NOTE: In this example I chose a SqlConnection, but this is an extremely bad idea. SqlConnections must be scoped to the method, and only to the method, no exceptions. The reasons are: 1. It is a good pattern and practice. 2. If you don’t follow this good pattern, Connection Pooling will punish you! (https://bbzippo.wordpress.com/2009/11/05/net-connection-pooling-enforces-good-coding-patterns/ , http://social.msdn.microsoft.com/Forums/en-US/adodotnetdataproviders/thread/74a5db87-c63f-4386-b317-e848dedb2cd9)

Ok, back to subject. Our class will allocate the resource in the constructor and close it in the destructor (finalizer). We are going to observe something weird:

class ConnectedDataTable: DataTable
{
    private SqlConnection connection;

    public ConnectedDataTable() : base()
    {

        // This connection will close when the object goes out of scope 
        //   and gets GC'd. 
        // HAHA, NOT REALLY!.
        connection = new SqlConnection("connString");
        connection.Open();
    }

    ~ConnectedDataTable()
    {
        // Problem! This finalizer will never execute! But Why?!
        connection.Close();
    }   

    // other methods...
    // ...
}
 

Explanations

Let me first remind you what the .Net Garbage Collector does with unused (unreferenced) objects. The GC will put them into the finalization queue. The finalization thread will then iterate through them and call the finalizers. And during the next GC cycle the finalized objects will be destroyed (deallocated).

However if you call GC.SuppressFinalize(myObject) then your object will bypass the finalization queue and will be destroyed right when the GC picks it up. Normally it is recommended to call SuppressFinalize() in IDisposable.Dispose() after disposing of all unmanaged resources (see MSDN on the IDisposable pattern)

One may think: “What happens to objects that don’t utilize any unmanaged stuff, don’t need IDisposable, and don’t implement finalizers? They probably still wait in the Finalization Queue taking up memory and CPU time – what a waste! So why don’t I call GC.SuppressFinalize(this) right in the constructor to speed things up?”

In most cases this is not a great idea. The runtime is smart enough to recognize objects that do not override Object.Finalize() (or implement destructors in C#) and not put them into the queue for finalization.

But guess what: There is at least one class in the .Net framework that calls GC.SuppressFinalize(this) in the constructor. It is DataTable.

Why? DataTable inherits from System.ComponentModel.MarshalByValueComponent that overrides Finalize(). So the runtime will mark it as requiring finalization.

DataTables are usually expected to be large and they allocate only managed memory. And the work performed by the parent’s Finalize() is probably not needed for DataTable (I did not investigate this deeper, but I trust Microsoft on this).

So it does make sense to suppress finalization in DataTable’s constructor.

But when we inherit a class from DataTable which does allocate some resources and implements a finalizer that disposes of those resources we run into a trouble. The constructor of this class calls the base constructor. And the base constructor will suppress finalization. So our finalizer will never execute.

How do we fix our code? Simply add GC.ReRegisterForFinalize(this) to the constructor.

NOTE: actually it would be better not to inherit from DataTable at all, but rather contain it, but that’s a different topic.

Lessons learned

  1. In some special cases it may be possible to improve performance by suppressing finalization in the constructor.
  2. Extreme caution should be used when inheriting from classes that suppress finalization.
  3. DataTable is an example of questionable design: it introduces non-standard behavior that can propagate to child classes, and doesn’t make consumers aware of that. One option would be to seal the class. Or at least mark it with some interface or attribute that would make the developer (and possibly the compiler) aware of the special behavior.

Written by bbzippo

11/16/2009 at 7:50 am

Posted in programming

Inside Xworder: the Scrabble mode

with 5 comments

Given a bunch of letter tiles, how do you find all words that can be built out of them? The naive approach would be to enumerate all permutations of all subsets of the given set, and look them up in the word list. This would be an extremely slow algorithm. If you have 10 letters, the number of all ordered subsets is about 6 million. Simply scanning the whole word list would work about 30 times faster. But that would still be not fast enough.

Xworder maintains an index of words by their length. It is a SortedList<int, HashSet<int>> where the first int is the word length and the second int is the index of the word in the word list. (Did I mention that Xworder does not use any sort of a database engine, and that it doesn’t use the filesystem at all?).

So in order to find subanagrams I can scan all words whose length does not exceed the total number of letters, and check if they are in fact subanagrams. Surprisingly, this method exhibited decent performance. But I took one further step to make it work faster.

Xworder also has an index that for each of the few most frequent letters (“ETAOINSR”), lists words that do not contain them. So when I see that my set of letters does not contain some of those frequent letters, I can immediately exclude a whole lot of words from my scan. Typically, the remaining word count is so small, that I scan them all without even taking the length into account.

This is it. Xworder’s Scrabble mode works by scanning the word length indexes OR by scanning the frequent letter exclusion indexes.

Note that the algorithms that I’ve just described iterate through HashSets. They say, it’s not the fastest data structure for iteration. In theory, SortedList should be must faster to iterate through. In addition, SortedList should consume less memory. I did test my algorithms with SortedLists instead of HashSets and did not find any measurable difference in the speed of iteration. That, and the fact that HashSet.IntersectWith() is extremely fast, made me choose HashSet as the preferred data structure for most of my indexes.

Below is a code snippet (some details omitted) that implements what I described above. It is really simple. It is so simple that after I wrote it I thought that implementing Full Anagrams would be a piece of cake too. But they appeared to be much, much tougher…

            HashSet<int> freqIndex = GetFrequentInclusionIndex(s);
            if (freqIndex.Count > 0)
            {
                var q = from i in freqIndex
                        where
                        words[i].Length <= s.Length && words[i].Length >= minLength
                        && IsSubstring(words[i], s)
                        select words[i];
                foreach (string r in q)
                    yield return r;
            }
            else
            {
                for (int l = s.Length; l >= minLength; l--)
                {
                    var index = GetLengthIndex(l);
                    foreach (int i in index)
                    {
                        string w = words[i];
                        if (IsSubstring(w, s))
                            yield return w;
                    }
                }
            }

Written by bbzippo

11/16/2009 at 7:03 am

Posted in programming