Testing
and analytics are like two sides of the statistical coin.
Back-end analytics draw information from raw data. Proactive
testing breaks new ground, guiding you to new insights and
opportunities. Testing remains the only way to prove the
real-world impact of marketing-mix changes. But the value of both depends upon the size of
the coin – the quantity of data available for testing and
analysis. Greater sample size equals a larger investment in your
test and the potential for more valuable insights.
As a greedy tester, you may never have too
much sample size, but when do you have enough? That’s
where sample size equations are useful.
Read on, or click on
topics #1-11
below to jump to certain sections:
1.
One sample for one test
(whether split-run or multivariable testing)
2.
Calculating sample size
3.
The sample size equation – for
response rate 4.
The sample size equation – for
dollar sales 5.
How do you calculate the
standard deviation? 6.
The deep, dark statistics
behind the equation 7.
Why are other equations
different? 8.
Choosing the right metric for
sample size 9.
When sample size is limited
10.
Advanced topics 11.
About LucidView - multichannel, multivariable testing
One sample for
one test
Each
test requires its own sample size. Testing 4 e-mail subject
lines requires 4x the sample size of testing one subject line versus the “control” (the current best-performing version). An
exception: one multivariable test requires no more sample size
than one A/B split. As long as all test elements are part of the
same multivariable test design, you can use the same sample size
whether testing 3 or 33 marketing-mix elements at once (this
Internet Retailer article explains how it works).
Calculating sample size
Sample size is an important issue in marketing
testing because it has such a large impact on the validity of
your results and is so often misunderstood. Without sufficient
sample size (i.e., sufficient data), your test will be
meaningless—little more than a guess. Greater sample size
increases your confidence that any change in response or sales
is a real difference and not just random chance. Sample size should be based on a equation, not
a rule-of-thumb. General rules like “each test cell should have
at least 100 orders” oversimplifies the issue and often results
in a weak test with few significant results.
Different equations are appropriate for
different types of data, but all sample size (N) calculations
are based on:

1. How much variation there
is in your data, measured in “standard deviations” (with the
symbol, σ, sigma)
2. How small a "significant effect" you
want to be able to see (the change in response rate for the test
versus the control)
Two standard deviations are an important
measure in separating significant market changes from natural
variation. Statistically, if nothing has changed, about 95% of
results will fall within ±2σ of the average. So an increase (or
decrease) beyond 2σ means something has changed—something that
you should be able to identify. Also, for response rate (and
other yes/no data), the standard deviation is proportional to
the average—a higher response rate has a lower relative standard
deviation.
Generally,
the smaller the change you want to see, or the lower your
response rate, the larger the sample size you need.
- The size of the mailing doesn’t matter
– it’s the number of responses that’s important.
If response rate for direct mail campaigns is usually about
1%, you’ll need three times the sample size as someone who
normally sees a 3% response.
- Cutting experimental error in half
requires 4x the sample size.
If you want to see any variable that increases response by
5% (e.g., from 1% to 1.05%), you’ll need four times the
sample size as you would need to see a 10% or larger effect.
- More variation in sales = less
confidence = the need for greater sample size.
For example, if almost all of your customers spend $95 to
$105, then increasing the average order to $150 looks like a
significant difference. If your customers average $100 per
order, but some spend only $20 and some spend $250, then one
campaign averaging $150 per order may just be random chance.
- Different numbers are not always
statistically different.
Picture a “bell curve” around every data point. Two points
are statistically different only if their bell curves barely
touch. So the difference in averages is important, but the
wider the bell curve (the picture of all individual orders
lumped together), the farther apart the averages need to be.
(You can learn more about confidence, uncertainty, and bell
curves in
this
Audience Development article.)
Back to top
The sample size
equation – for response rate
The most accurate, yet straightforward,
equation for calculating sample size based on response rate is
this:
 Where:
- 31.38 is a constant (related to alpha and
beta error: look
here for more details)
- p (p-bar) =
average expected response rate (p for proportion, with the
bar on top meaning average)
For response rate (and other yes/no data), the standard
deviation is proportional to the average, so px(1-p)
equals the standard deviation squared.
- significant effect = how small a change
you want to see statistically significant (in the same units
as p)
- N = the total sample size for the test:
split between one test cell and the control, or split evenly
among all multivariable test recipes
- For one test cell, the sample size =
½ x N
- For a 16-recipe multivariable test,
each recipe will have a sample size = N÷16
For example, if your
e-mail conversion rate is 3% and you want to see if
a new subject line can increase conversion by 10% or
more, then:
• p
= 0.03, (1-p) = 0.97
•
Significant effect = 3% x 10% = 0.003 and
(significant effect)2 = 0.000009
• N
= 101,462
So, you need to send out about 50,000 of your
control e-mail and 50,000 with the new subject line,
in order to see if the difference is 10% or more (if
the new e-mail has a conversion rate of 3.3% or
more).
E-mail us if you would like an Excel version of the
LucidView sample size calculator.
This equation changes when:
- The control is very large
Then the test cell may be reduced 25% or more (but if you
want to compare test cells to other test cells, then sample
size should remain the same).
- You can accept only a 50-50 chance of
seeing the “effect” of interest
In this case, overall sample size can be cut in half. (The
“effect” falls right on the Line of Significance, so with a
bell curve around the effect, about half the time an
individual test will fall below the line and half the time
it will be significant.)
- You have >2 levels for some test
elements in a multivariable test
Generally, you need at least 50% larger sample size for each
additional level, for example...
-
15-element multivariable test
– with each element at two levels – requires no greater
sample size than a simple A/B split
-
Central composite designs may
have only 4 test elements, but with each at 3 levels, it
requires a much larger sample size
-
Price tests with centerpoints
are very insightful in uncovering curvature and
interactions, but the “centerpoint” (3rd level) require additional
sample size
-
Optimal designs (described in
this article from Quirk’s Marketing Research Review)
not only add immense complexity, but cannot overcome the
need for greater sample size when testing 3 or 4 levels.
If this equation looks too complicated, you
can follow these general (and statistically-valid) guidelines:

For example, if you want to see if a free gift
increases response by 15% or more, then…
- If your response rate is 1%, you need to
mail 69,000 of the test package (and at least 69,000 of the
control), since 1% x 69,000 = 690 = 1,380 ÷ 2
- If your response rate is 5%, then it’s OK
to mail only 13,800 of the test package to see a 15% lift
(to see if another 103 or 104 people respond to the premium
offer).
Back to top
The sample size
equation – for dollar sales
If your key metric is continuous data, like
average order value, average sales per store, or year-over-year
change in sales, then you need to calculate the standard
deviation of the individual values and include it in the
equation:

Where:
- 31.38 is a constant
-
σ2 = the variance
(standard deviation squared) of the individual values
- significant effect = how small a change
you want to see statistically significant (in the same units
as the standard deviation, σ)
- N = the total number of orders (or
stores, weeks, markets) needed for the test:
split between one test cell and the control, or split evenly
among all multivariable test recipes
For example, if an
average customer spends $100, but most individual
purchases are between $50 and $150, then…
1. Calculate
the standard deviation of, perhaps, 5,000 recent
purchases (removing extreme values first). If
the standard deviation is $25, then the variance
(σ2) is 252 = 625
2. If you want to see if a new Facebook
promotion increases sales 12% or more, then
“significant effect” = 12% x $100 = $12.00
3. Calculate N = (31.38)(625)/(144) = 136.2
This means you
need to have at least 68 buyers in the test group
and 68 buyers in the control group. But if only 10%
of the people who see the Facebook offer actually
make a purchase, then you need to be sure about 680
people see the new promotion, with at least 680 in
the control group. Notice that the sample size for
sales – for variables data – tends to be much
smaller than the sample size for response rate.
E-mail us if you would like an Excel version of the
LucidView sample size calculator.
How do you calculate the
standard deviation?
Averages too often hide the variation among
individual customers. In order to understand the true variation
in sales (or other variables), you need to go back to the
individual orders. Start with a campaign similar to the
up-coming test campaign and:
a) List all individual orders (excluding
any tests in the campaign)
b) Calculate the standard deviation (σ) of all the orders
(simply “=stdev(range-of-values)” in Excel)
c) Remove outliers – extremely high or low numbers (normally
above or below 2-3σ)
d) Recalculate the standard deviation (σ) of all the
remaining orders
e) Repeat (c-d) 1-2 times more, if needed (Note that steps
a-e can get more complicated depending upon the type of
data, consistency, and distribution, but… that’s another
topic).
f) Use the final standard deviation value (σ) in the sample
size equation.
Back to top
The
deep, dark statistics behind the equation
The two sample size equations, above, are
really all you need. But if you love statistics (or tend to be
overly curious), here’s the response equation with the magic
number 31.38 broken down into its components:

Where:
- ta/2 =
the t value for type I, or “alpha” risk – basically, a fancy
statistical way of saying “about 2” standard deviations.
- It’s generally accepted to use 95%
confidence – only a 5% chance that a significant effect is
truly random chance – which gives alpha = 0.05. The “t
value” is 1.96, as long as your sample size is over 1,000.
- There's seldom a reason to drop below
95% confidence (especially with multiple test cells or
multivariable testing). Usually lower confidence is
simply a way to find some meaning in insignificant
results.
- tβ = the t value for the other type
of risk – the chance of missing what should be a significant
effect. In other words, 1-β is how
confident you want to be in seeing the selected effect.
- This type of error is ignored in many
sample size equations. If you ignore it – essentially setting tβ
at 0 – then you have only a 50-50 chance of seeing the
"effect" you want to see.
- tβ = 0.841 for 80% confidence
that the effect will actually be significant
(β = 0.20)
- σ2
= the variance of the data
- σ2
= p(1-p) for response (yes/no) data, so the average
response rate is directly related to the variance
- σ2
is not related to average sales (or other continuous
data), so you need to calculate the variance (standard
deviation)2 for each new dataset. Plus
you also need a way to calculate σ2
for each test (i.e., have 2+ individual values within
each test recipe)
The number 4 comes from the calculation of the
pooled standard deviation. Comparing two groups (test and
control), 4 is always a part of the equation (in the
calculation, you have 2 samples each over ½ the sample size, so
you end up with 2x2=4 on the top of the equation).
OK, so if you want only 5% chance of a
significant effect being noise and 80% chance of seeing the
“significant effect” you selected, then… 4(ta/2
+ tβ)2 = 4(1.96 +
0.841)2 = 31.3824 ~ 31.38
If you set tβ = 0, then you cut the
sample in half… and end up with only a 50% chance that the
“significant effect” will actually be significant.
Going back to the basic idea of sample size…
How does (2σ)2 becomes (31.38)(σ2)?
Clearly 4
≠ 31, but the number gets
bigger because (a) you add beta risk, (b) you have 2 samples to
compare (test and control) and (c) the overall sample size, N,
equals 2x the sample of one test cell. Therefore, you go from
the t value at 95% confidence, “about 2,” to “4(ta/2
+ tβ)” in order to account for the full test sample
and both types of error.
Why
are other equations different?
Many other sample size equations and
rules-of-thumb give you a different answer. Why do we claim ours
is right and theirs is wrong? Well, first of all, if you have
SAS, Minitab, or other statistical software, you should be able
to confirm that our equation matches the software’s answer. Most
others miss one or more of the following pieces of the equation:
- Variation in both the test and control
groups – ignoring the control group and considering the
“bell curve” around the test group alone.
- Two types of variation – looking at
“alpha” error, without considering “beta” error
- Realistic confidence in seeing the lift
you want to see – plug most “rules” into these equations and
you’ll see a miserably low confidence in results, which
translates into an impossibly large lift needed to show a
significant difference.
- The total universe is unknown. Some
equations ask you for the number of people in the overall
"universe." Pure statistics assumes greater confidence when
your test group is a large proportion of all items that
could possibly be sampled. But the marketplace is not
a closed, static environment – the "universe" is unknowable
and ever-changing, so it's unreasonable to include the
universe in the equation. (Can you give me the total number
of everyone who might someday be your customer?)
- The whisper-down-the-lane effect
Many rules-of-thumb have been passed down in marketing lore
without any statistical basis. “You need 100 responses per
test cell” (or 50, or 500, or …) have been written in
marketing articles for decades, but these almost always
severely underestimate the true sample size you need.
Someday we’ll find the original instigator and solve the
mystery of this misinformation.
Back to top
Choosing the right metric for sample size
Sample size equations are only as accurate as
the data put into the calculation. For example, if you plug your
e-mail clickthrough rate into the equation, you will probably
end up with only a fraction of the sample size you’ll need for a
good read on conversion rate. If you want to increase response
rate and average order value, then you should look at the sample
size required for each and probably choose the higher of the
two.
Generally, your test analysis should focus on
simple “behavioral” metrics rather than calculated metrics. You
may want to drive long-term profitability, but focusing simply on response and sales
for each test may make more sense.
Whenever you use a calculated metric within a test analysis –
like lifetime value, or forecast total sales – you add (a)
potential sources of error from the assumptions within the
formulas and (b) a combination of variables that each may be
driven different ways by different effects (the combination may
mute effects that are significant for each variable alone).
Profitability and sales forecasts are important, but when
analyzing test results, focus on a few clear, objective, simple
metrics. Then you can input results for each into your more
complicated formulas.
The metric determines how experimental error
is calculated. In retail tests, if the key metric is per-store
sales (the change in sales versus predicted levels or control
stores), then experimental error will be based on the difference
among all stores in each test group. If the change in sales is
calculated by market or customer (for a loyalty program), then
the error may appear greater or less. Generally, more granular
metrics and “test units” offer a more realistic view of the
marketplace, allowing you to see outliers in the data while
offering the flexibility to aggregate data in different
ways to analyze results by market, region, or type of customer
or store.
Examples by channel – good metrics may
include:
- Direct mail (letter package, catalog,
postcard, etc.): response rate, average order value, total
sales, net revenue
- E-mail and landing page: conversion rate,
average order, total sales, responses/clicks
- Retail and CRM: change in store sales
(versus baseline/control group), unit sales, basket size,
average margin; number of transactions, total annual sales,
and total margin dollars by customer (for a loyalty program)
- Advertising: change in sales by market,
advertising ROI
- Telemarketing: conversion rate, total
sales, customer satisfaction
When sample
size is limited
Many small companies have a small total mail
(or e-mail or retail) volume, so any testing is difficult. When
sample size is limited, consider if you can…
1. Combine elements into one multivariable
test design
One big benefit of advanced techniques is sample size efficiency
– unlike A/B splits, for multivariable test designs, sample size can remain constant whether
you’re testing 2 or 22 different elements, as long as all
elements are part of the same statistical test design (see two
examples in this
Hearst Magazines direct mail case study).
A simple example: If you need 50,000 per test
cell, then 2 tests plus the control require 150,000 names. One
multivariable test of the same 2 elements requires only 100,000
for equal confidence. This 33% drop in sample size becomes even
larger as more marketing-mix elements are combined into one
test.
One
non-profit DM test would have needed 4 million more mailings
if the same elements had been tested using split-run techniques.
2. Run the same test across multiple drops,
segments, or promotions
This
Conde Nast e-mail test was run over three different
campaigns to build sample size while also analyzing
campaign-specific differences. A
Whirlpool contact strategy test
focused on weekly service contract renewals, so they continued
to run the test week-after-week until the sample size was large
enough.
3. Focus on fewer, large changes
When sample size is limited, the equations let you see how large
a lift you need from each test element. If you only have 500
buyers in each monthly campaign, then consider new ideas that
each have the chance to increase response by 25% or more.
Perhaps combine elements into one envelope test (changing size,
color, teasers all at once), or copy test (version A versus
version B, instead of testing a new headline, P.S., or sidebar
alone). Also consider bolder differences in each element: test a
15% price increase instead of 5%, a free tote bag instead of 1
free issue, or a 9x12” package instead of 6x9”.
4. Test the most responsive customers
Even if overall response rate is relatively low, are there
high-response segments or more receptive customers who may be a
good focus for testing? For example, in contract renewals –
magazine subscriptions, extended service warranties, Internet or
phone service – the first-time renewal is often a big hurdle.
These customers are valuable and often more responsive to creatives and offers (versus long-time customers who are
profitable, yet with greater inertia to respond at about the
same level no matter what they receive).
Back to top
Advanced topics
OK, if you're expert in this stuff, here are a
few (of many) advanced topics to consider:
1. Is response rate truly binomial (yes/no)
data?
The sample equation for response rate is easy to use, plus
setting the “beta” risk at 50% gives you the Line of
Significance (replacing 31.38 with 15.37). But response rate
assumes an equal probability of each person responding – not
always true if you have different segments mixed together, or
changing “influencers” outside of the test elements. A great
safeguard is to split each test recipe into two equal groups:
each receives the exact same mailing (or e-mail, landing page,
etc.), but you have a different keycode / tracking number for
each. This allows you to calculate the variance between
“replicates” (the difference in response rate between the two
samples from the same recipe) as a measure of experimental
error, as well as calculating experimental error from the
response rate (where the variance is assumed to be related to
the average response). If both versions of the Line of
Significance match, you’re OK. If they differ, the “replicated”
experimental error is generally a more conservative estimate.
2. Consider other sources of variation when
selecting sample size
All sample size equations consider the variation among customers
without considering additional noise from outside sources of
variation. With environmental, personal, competitive,
week-to-week, and numerous other changes going on constantly
(and largely unknown), these “pure” sample size equations may
not incorporate a realistic level of natural
in-market variation. Therefore, think rationally about your
sample size: Do you plan to run the test long enough to balance
out all of these unknowns?
For example, retail and packaged goods (CPG) tests are challenging
because of store, market, and region-specific differences. Local
growth and customer demographics, seasonality (including
temperature differences and back-to-school dates), and
merchandise differences are just a few changes that may affect
week-to-week sales changes. Replication through additional
stores, multiple markets, and comparable control markets running
in parallel can all help give different perspectives of
real-world variation during the test period. Plus extra stores
offer the freedom to drop some unexpected outliers that occur
during the test.
Another example… An Internet test may only
need to run 5 days based on your calculations. But there may be
some daily and weekly variation that justifies a longer run
length. A 3-week test may be unnecessary, but allow for the
analysis of week-to-week variation and unknown changes that may
occur over time. A good check: analyze you test results each
week and see if they match. If not, then you may need to take
the average results over all 3 weeks to increase reliability (or
perhaps a few days when conversion was much higher – perhaps
when a partner ran a special promotion – should be removed from
the analysis).
3. What is the best Line of Significance?
Experimental error is relatively easy to calculate – basically
the same ±2σ concept as you use for confidence limits (only a
5% probability that any effect would be significant due simply
to random chance). One great thing about testing is that the
most reliable tests should have clear results without the need
for overly complex statistical analyses. Put simply: Does the
Line of Significance make sense? Do the “significant” effects
clearly rise above the non-zero effects due to random chance?
With large sample sizes, using ±3σ
or the Bonferroni method (with alpha ~ 5% per element instead of
per test), you may see a more realistic estimate of real-world
error.
Here are two examples:
- Results of a 7-element, 16-recipe direct
mail test showed 4 significant effects. Rearranging the
sample size equation, the Line of Significance = 0.22% (the
orange dashed line, calculated from a total sample size of
250,000 with an 8.5% average response rate). Even without
any measure of experimental error, you could guess that the
top four effects are significant – these are far from the
steady stair-step effects you generally see from random
variation. But be careful: sometimes your eye can pick
out "significant" effects when the true error is far above
even the largest effect (basically, you have much greater
variation in the variation).

- Results of a 14-element, 32-recipe direct
mail test showed the following main effects and
interactions. With a total sample size of about 500,000 and
average response rate of 3.0%, the Line of Significance =
0.099%, just below the main effect of element E. However, in
this case:
- Elements E and G were removed from
the test and the columns left “empty” with no main
effect (only 3-way interactions), so it’s tough to
believe these effects could be significant.
- With a small difference between
consecutive effects below C or P (the top 3-4 effects),
it’s difficult to draw a clear “line in the sand”
between significant and insignificant effects. The break
between the AF and AE two-way interactions is
reasonable, but adding the next 6 interaction effects is
more difficult to justify.
- With a large sample size and many
effects, a more conservative measure of experimental
error may be needed. In this case, the top 8 effects
(dark blue bars) provide more actionable insights than
simply following the rules for binomial data.

About LucidView - multichannel,
multivariable testing
Testing remains the only way to prove which
creative ideas, offers, and contact strategies have a direct
impact on response. Advanced techniques let you test more
variables, more rapidly, with a small sample size, greater
accuracy, and a significant increase in ROI.
LucidView is a recognized leader in the art
and science of marketing testing. Bridging the gap between
academic statistics and real-world marketing programs, LucidView
consults with industry leaders, sharing best practices in
multivariable testing and cutting-edge analytics for direct
marketing, Internet, retail, and CRM programs. We help build
your in-house skills, uncover new insights, and increase sales
and marketing ROI.
From Boston to Belgium and Brazil (and points
in between), LucidView has given market leaders the guidance
they need to quickly and clearly increase sales in tough,
competitive markets. With unparalleled technical expertise and
marketing experience, we’ve consulted, lectured, and published
in the field of multivariable testing for over a decade... and
continue to prove that, as Stone & Jacobs write, "Testing is still the best way to find true
breakthroughs."
E-mail us with any questions and to
learn more. You may also want to look over a few
case
studies & articles, like “Navigating the Depths of
Multivariable Testing” or a quick summary of Hearst's
Food
Network Magazine tests.
Since we’ve probably covered more than
you ever cared to learn about sample size, we’ll stop here. Keep
in mind that most every “simple” concept becomes more complex
(and useful) as
your experience grows and you delve deeper into the details.
Like marketing itself, testing is a combination of art and
science, rules and the skill to interpret them, a collection of
tools and the experience to use them correctly.
Back to top
|