Sample size is an important issue in marketing
testing because it has such a large impact on the validity of
your results and is so often misunderstood. Without sufficient
sample size (i.e., sufficient data), your test will be
meaningless—little more than a guess. Greater sample size
increases your confidence that any change in response or sales
is a real difference and not just random chance.
Calculating Sample Size
Sample size should be based on a equation, not
a rule-of-thumb. General rules like “each test cell should have
at least 100 orders” oversimplifies the issue and often results
in a weak test with few significant results.
Different equations are appropriate for
different types of data, but all sample size calculations are
based on the general equation at top of the page. Sample size,
N, should be based on:
1. How much variation there is in your
data, measured in “standard deviations” (with the symbol, σ,
sigma)
2. How small an effect you want to be able to see (the
change in response rate for the test versus the control)
Two standard deviations are an important
measure in separating significant market changes from natural
variation. Statistically, if nothing has changed, about 95% of results
will fall within ±2σ of the average. So an increase (or
decrease) beyond 2σ means something has changed—something that
you should be able to identify. Also, for response rate (and
other yes/no data), the standard deviation is proportional to
the average—a higher response rate has a lower relative standard
deviation.
Generally, the smaller the change you want
to see and/or the lower your response rate, the larger the
sample size you need.
For example:
- If response rate for direct mail
campaigns is usually about 1%, you’ll need three times the
sample size as someone who normally sees a 3% response.
- If you want to see any variable that
increases response by 5% (e.g., from 1% to 1.05%), you’ll
need four times the sample size as you would need to see a
10% or larger effect.
Sample Size for Scientific Testing
(no matter how many variables are tested)
Sample size is a big issue in marketing and
advertising testing, making it difficult to test very many
variables or see any but the largest effects. Here’s where
scientific testing has an immense advantage. With the right
techniques, you can test any number of variables at once with
the same sample size.
If all the variables are a part of the same
statistical test design, the sample size gives equal statistical
confidence whether you’re testing two or two-dozen variables.
For a scientific test measuring response rate, use the sample
size equation:
Though this looks a bit complex, it’s similar
to standard equations (but more accurate) and is fairly easy to
calculate if you can answer three questions:
1. What is your average response
rate?
2. How small an effect would you like to be able to
see?
3. How confident do you want to be that you'll see an
effect of that size?
The symbols and terms include –
N =
overall sample size
(equally divided among all recipes, no matter how many
elements or recipes are tested)
p =
historical or expected average response rate
(actually “p-bar”, in the equation the bar overtop means
“average”)
effect
= how large a change you want to be able to detect
(e.g., the difference in response rate between the test and
control)
tα/2
= a fancy statistical way of saying “about 2” standard
deviations
- It’s perfectly fine to use 2, since
tα/2=1.96 for almost every calculation (With 95%
confidence, meaning there’s only a 5% chance that a
significant effect is really just random chance)
- Always use alpha=5% for tests,
confidence levels less than 95% (alpha>5%) are just an
excuse for using too small a sample size
tβ
= how confident you want to be in seeing the selected effect
(beta error)
- tβ = 0, for a 50-50 chance
of seeing the "effect" of the chosen size
- tβ = 0.674 for only a 25%
chance of missing the effect (75% power)
- tβ = 0.841 for only a 20%
chance of missing the effect (80% power)
Alpha and Beta: two types of error
One focus of statistics is reducing error so
you can make the right decisions. Two types of error are
important to consider before every test:
- “Alpha” is your chance of seeing a
significant effect that really doesn’t exist
- “Beta” is your chance of missing an
effect that should be significant
The “beta” term is often not used for sample
size calculations, but it’s always a part of the mix. Leaving it
out is the same as setting tβ=0, which means that you
have only a 50-50 chance of seeing the desired effect. (Note: If
you remove the “4” and “+tβ“, you get the
commonly-used, usually-wrong, sample size calculation for one
test cell).
This equation gives you a sample size
about 4-times what you get with more common calculations:
(a) Because this equation combines sample size
for both the “test” and “control” versions—since the control is
not separated out in scientific testing—change the 4 to a 2 to
get the best equation for one test cell in a split-run test.
(b) Unless you have just one test cell versus a large control
cell, the common equation—without the 4—is wrong (half the true
sample size you need). You need a 2 in the equation because you
have two groups of data—with two sources of variation—that you
compare for every test (when the control cell is very large, its
variation gets very small, so eliminating the 2 is only
appropriate with one test cell versus a large control cell).
Back to top
Try out the equation with the following two
examples (and
e-mail us if you would like to get a sample size calculator
in Excel):
- A credit card marketer
was planning a scientific test of 19 direct mail elements
using a 20-recipe test design. The control response rate was
only 0.5% and they wanted to see any elements that increased
response by 10%, or 0.05 percentage points. With such a
small response rate, they calculated that the test would
need a sample size of 305,791 for a 50-50 chance of seeing a
10% increase and 624,510 to be 80% confident in seeing a 10%
lift [p=0.005, tα/2=1.96, tβ=0 and 0.841, effect=0.0005,
solving for N].
Now remember, these sample size calculations
are for the overall test—all 19 elements and all 20 recipes can
be tested with a total of 305,791 names, or just about 15,290
names per recipe. Split-run tests of all 19 elements would
require half that number for each test cell, plus at least four
times that for the control, totaling more than 3.5 million names
for equal confidence using split-run tests!
- With a list of only
35,000 e-mail addresses across three customer segments, a
conversion rate of 1%, and 12 new ideas she wanted to test,
a marketing director wanted to calculate how large an effect
she could expect to see [N=35,000, p=0.01, tα/2=1.96, tβ=0
and then 0.841, solving for “effect”]. She calculated a
50-50 chance of seeing any effect that increased conversion
rate by at least 20.8% (0.208 percentage points) and an 80%
chance of seeing effects that increased (or decreased)
conversion by 29.8%.
Even with a small e-mail list and not as much
power as she would like, the marketing director went ahead with
the test and ended up seeing four significant main effects and
one significant interaction, which together increasing
conversion rate 54% (details are explained in the E-mail case
study).
The full sample size equation, above, is
all you need. However, you can change it slightly if…
- If you are running a few split-run tests,
you can use one-half the calculated sample size for each
test cell.
- If you are running one test cell against
a very-large control, then you
can use one-fourth of the calculated number for the test
cell.
Note: you may see some equations with a term
for the “total universe” of names. Since you never know the full
universe of potential buyers, this term should not be included.
Also, for you statistical gurus, the z value should be used
instead of the t value, but with large sample size, z, t, and
normal distributions are about the same.
Sample Size for Sales Data
One more equation is important. The above
equation is used for response rate, conversion rate, and other
types of yes/no data. For retail and other sales data—dollar
sales, average order size, percent change in sales versus
baseline—the term for “standard deviation” is different, so you
need the equation:
Unlike response data, sales data has no fixed
relationship between the average and standard deviation. So
p*(1-p) is replaced by the standard deviation squared, σ2. The
standard deviation must be calculated from all of the individual
sales numbers. For example, if five people buy clothes from your
website and each order is: $54, $20, $160, $95, and $76, then
the average order size is $81.00 and the standard deviation
(using any calculator or spreadsheet) is $52.23.
Unfortunately, most companies—including some
of the biggest database service providers—measure average sales
data without providing the standard deviation. Without “sigma,”
you have no measure of variation and can do little statistical
analysis. If one catalog has an average order size of $75 and
another $65, are these statistically different? Who knows…
unless you calculate the standard deviation of all the
individual orders.
And a final note on sample size…
We find that most marketers often test with
too small a sample size. Using the accurate sample size
equation, the “100 orders” rule-of-thumb means there’s a 50-50
chance that a test cell will not be statistically-significant
unless it increases response rate by about 27%. For an 80%
chance of seeing the effect, it must increase response by 39% or
more! These are big hurdles to overcome.
Natural variation in the marketplace remains a
big challenge for marketing and testing. If you don't believe
it, here’s a good “test”
to try: take your control, give it five different keycodes,
and split your list into five random groups. Measure response
for each of the five controls. Since each mailing is exactly the
same, these five data points give you a sense of how much
variation you can expect to see. The results may surprise you.
Contact us
for more information or if you would like a sample size
calculator in Excel. Next, you can learn more
about real-world case
studies and articles showing the power of cutting-edge
testing techniques.
Back to top
|