Chad Perrin: SOB

28 August 2009

Significance of a Sample, in Ruby

Filed under: Geek — apotheon @ 11:18

The following is how to calculate the statistical certainty (i.e., the “statistical significance” of your sample taken as a percentage) that the results of a randomly selected sampling of a given size is actually representative of the total population. Note that this all assumes a normal distribution (i.e., the well-known “bell curve”). Calculations will be represented using Ruby source code and results, since it’s easier to type (and, I think, to understand) Ruby code than the typical mathematical notation used for these calculations. Any line of “code” that starts with => is actually the result of the current one-line calculation.

Disclaimer: The following explanation is correct, to the best of my recollection. If there’s anything wrong with it, I blame the passage of time, because I haven’t done this stuff since college. Please correct any errors in my explanation in comments following this SOB entry. Feel free to offer suggestions for how to make my explanatory Ruby code more clear in the aggregate to readers who may not know Ruby (but can muddle through code in general), or how to prettify the scripts at the end, but keep in mind that the point was to explain how to do some simple statistical calculations and not so much to write Great Software in this case. In fact, clarity of explanation for people who might not know the language is why I chose Ruby instead of Scheme, since I don’t know that I’m familiar enough with Scheme to make it as readable as my Ruby code (especially considering some people might get hung up on the prefix notation).

Without further ado, the process of calculating statistical significance of your sample starts with computing the average of your raw data:

raw_results = [1,2,3,4,5,6]
=> [1,2,3,4,5,6]

raw_total = raw_results.inject {|sum,n| sum + n }
=> 21

raw_mean = raw_total.to_f / raw_results.size
=> 3.5

Next, compute the differences for each data point from the average:

mean_difference = raw_results.collect {|n| n - raw_mean }
=> [-2.5, -1.5, -0.5, 0.5, 1.5, 2.5]

Then, compute the squares of each difference from the average:

mean_diff_squares = mean_difference.collect {|n| n ** 2 }
=> [6.25, 2.25, 0.25, 0.25, 2.25, 6.25]

Because you’re using a sample of the population rather than measuring the total population, you’ll subtract one from the sample size to calculate the standard deviation. The standard deviation is calculated by determining the average of the mean difference square values, then determining the square root of that number:

square_total = mean_diff_squares.inject {|sum,n| sum + n }
=> 17.5

square_mean = square_total / (mean_diff_squares.size - 1)
=> 3.5

stdev = Math.sqrt square_mean
=> 1.87082869338697

It’s difficult to pronounce stdev, and typing standard_deviation all the time is annoying, so let’s use the name of the Greek letter usually used to refer to the standard deviation in formulae:

sigma = stdev
=> 1.87082869338697

This all assumes a truly random sampling of the population, of course.

Now you just need to decide how precise your results have to be. A common assumption is that 95% certainty is “enough” for initial experimental results, though until your experimental results can be confirmed by independent experimentation it’s just “interesting” and not “meaningful”. To determine statistical certainty, you need to first determine the standard error of the mean:

sem = sigma / Math.sqrt(raw_results.size)
=> 0.763762615825973

Next, you calculate the relative standard error — which is the standard error of the mean divided by the mean:

rse = sem / raw_mean
=> 0.218217890235992

That’s your uncertainty. To find out your certainty, just subtract the number from 1:

certainty = 1 - rse
=> 0.781782109764008

Translate that into a percentage:

(certainty * 100).to_i.to_s + '%'
=> "78%"

As you can see, your rate of certainty is only about 78% — well short of the 95% target certainty (which is to be expected from a sample size of only six). As your sample size grows, your estimated statistical certainty increases, all else being equal.


If you wrote a simplified Ruby script called “stat_sig.rb” for all this, it might look like this:

#!/usr/bin/env ruby

raw_results = ARGV.collect {|s| s.to_f }

sample = raw_results.size

raw_mean = (raw_results.inject {|sum,n| sum + n }).to_f / sample

diff_squares = raw_results.collect {|n| (n - raw_mean) ** 2 }

sigma = Math.sqrt( diff_squares.inject {|sum,n| sum + n } / (sample - 1) )

certainty = 1 - ( ( sigma / Math.sqrt(sample) ) / raw_mean )

puts ( (certainty * 1000).to_i / 10.0 ).to_s + '%'

A simple script I wrote called “numbers.rb” to generate data sets for a sort of off-the-cuff heuristic “that looks right” test of the “stat_sig.rb” script looks like this:

#!/usr/bin/env ruby

(1..ARGV[0].to_i).each {|n| puts n }

I used them together by typing something like this (with the 100 indicating I want my data set to consist of every number from 1 to 100) at my Unix shell prompt:

stat_sig.rb `numbers.rb 100`


Don’t forget that changing any of the underlying assumptions for the above explanation (such as that your results will conform to a normal distribution) can invalidate this methodology for calculating a measure of statistical significance. Also remember that 95% certainty is just a rule-of-thumb threshold for statistical significance, and that number may change depending on the circumstances of your statistical analysis.

All original content Copyright Chad Perrin: Distributed under the terms of the Open Works License