## Archive for **February 2012**

## Russian Elections: Patterns of Fraud

Good news: some Russian bloggers have finally noticed Stephen Coleman’s social conformity theory (and they noticed my blog too), so there is hope that people will finally stop claiming that the correlation between turnout and party support indicates fraud. If you can read Russian, I recommend this analysis as well as other election-related posts in that blog: http://jemmybutton.livejournal.com/2550.html

What I found the most interesting there, is the presentation of the data from http://ruelect.com (comparison of the copies of tally sheets obtained by observers versus the officially announced results). I haven’t looked at that data carefully before, and it looks interesting. So I decided to build a few quick plots. Ruelect.com has so far collected only about 1000 voting protocols. That is about 1% of total, and obviously, it is not a random sample, but it gives a lot of insight into how the falsification of voting results shaped the data.

Distribution of votes for United Russia. About 600 out of 1000 of voting protocols were altered in favor of the ruling party. Left: original observers’ copies. Right: the 600 altered protocols:

And the final distribution (left) compared to the distribution among all large-size precincts (virtually all available observer’s copies of tallies are from large precincts; here I describe how and why I break them by size):

## Distribution of GPS accuracy

*The power of accurate observation is commonly called cynicism by those who have not got it. *

*George Bernard Shaw.*

When we are measuring something, we expect the errors to be normally distributed. Because we expect the causes of the errors to be independent, and to have additive effect on the measurement.

When we are measuring the 2 coordinates of the position, we may naively assume that we are determining the latitude independently from the longitude, and expect a normal distribution for each coordinate. The resulting distribution of the distance from the true position would be the distribution of where x and y are normal. This is called a “2D-normal” or “circular gaussian”. Here’s the shape of this distribution (mean=0, sd=1 for each coordinate):

It is instructive to see that most measurements fall around 1 standard deviation (per coordinate) away from the true position.

Even though the assumption that the coordinates are determined independently is naïve, this distribution is actually observed experimentally: http://users.erols.com/dlwilson/gpsacc.htm

Now, what about the “accuracy” value that your GPS receiver reports? It cannot know your true position, so it can only report an estimated error. How is it estimated? It uses the Dilution Of Precision (DOP) as the main component of accuracy. If you read the definition of DOP you’ll see that it’s in fact the volume of a 3D body. If we assume (naively ) that the linear dimensions of that body are normally distributed, we should expect the DOP distribution to be log-normal. We expect the log-normal distribution to arise whenever the causes have multiplicative effect on the outcome. Here’s the shape of the log-normal distribution:

And here is the actual distribution of GPS accuracy reported by cellphone GPS receivers (20,000 observations):

Does this look closer to the 2D-gauss or to the log-normal?

There is a simple tool for comparing distributions that is called a Q-Q plot (quantile-quantile plot). Let’s see how the QQ plots look…

The 2D-gauss is not a good fit…

And the log-normal is much-much better!

So the GPS accuracy reported by the cell phone receiver is distributed close to log-normally. Basically, in most cases, the accuracy value is the dilution of precision multiplied by a constant (the estimated range accuracy for the receiver, typically 6 meters or so).

Disclaimer: this analysis is very inaccurate. The measurements in my sample include ones when the GPS was forced to yield a fix when it wasn’t done acquiring satellites. That resulted in outliers with very low accuracy values.

## Peak load, confidence, Poisson

*Disclaimer: I’m clueless in statistics. I’m just playing with numbers and don’t know if any of this makes sense.*

Here is the distribution of some server load measured in events per minute (real data):

**Mean: 65 **

**Max: 210**

*[We should be alarmed already, read on to see why]*

Nothing special, right? Divide these numbers by 60 (to obtain event/second) and conclude that a peak capacity of 10 events per second should be sufficient, right?

But what if all of those 210 events per minute happen to occur within one second?

How likely is that?!

Let’s take a look at events per second (idle seconds with zero events not included):

**Mean: 3.5**

**Max: 56**

And how often does the rate of events actually exceed the “estimated” capacity of 10 events/sec? Well, it is within the 97% percentile, so our estimate was pretty safe, right?

Wrong. 97% is very bad. It means that once in 30 seconds the capacity would be exceeded. And once in 300 seconds it would be exceeded by 300%. In simple words, the (1-1/300)th quantile (99.7% confidence) is 30 events/sec.

So what’s going on? Well, it looks like the events tend to clump together. And **we could actually conclude this by looking at the first histogram only. **If the events were independent they would follow the Poisson Distribution. An the actual distribution looks nothing like Poisson with mean=65. **When the mean is greater than 10 the Poisson is not supposed to have any tails.**

And if we try to fit our data by some Poisson shape, best matches will have the means at 3 or 4 (they are rescaled on the plot):

Can we conclude anything from this? For example, that within each minute there are 3-4 independent batches of events?

Honestly, I don’t know if we can interpret it this way.

But I know that the actual events do indeed arrive in batches.