Testing and analytics are like two sides of the statistical coin. Back-end analytics draw information from raw data. Proactive testing breaks new ground, guiding you to new insights and opportunities. Testing remains the only way to prove the real-world impact of marketing-mix changes. But the value of both depends upon the size of the coin – the quantity of data available for testing and analysis. Greater sample size equals a larger investment in your test and the potential for more valuable insights.
As a tester, you may never have too much sample size, but when do you have enough? That’s where sample size equations are useful.
One sample for one test
Each test requires its own sample size. Testing four e-mail subject lines requires 2.5 times the sample size of testing one subject line versus the “control” (the current best-performing version). An exception: one multivariable test requires no more sample size than one A/B split. As long as all test elements are part of the same multivariable test design, you can use the same sample size whether testing 3 or 33 marketing-mix elements at once. View this Internet Retailer article that explains how it works.
Calculating sample size
Sample size is an important issue in marketing testing because it has such a significant impact on the validity of your results and is so often misunderstood. Without sufficient sample size (i.e., sufficient data), your test will be meaningless—little more than a guess. Greater sample size increases your confidence that any change in response or sales is a real difference and not just random chance. Sample size should be based on an equation, not a rule-of-thumb. General rules like “each test cell should have at least 100 orders” oversimplifies the issue and often results in a weak test with few significant results.
Different equations are appropriate for different types of data, but all sample size (N) calculations are based on:
1. How much variation there is in your data, measured in “standard deviations” (with the symbol, σ, sigma)
2. How small a "significant effect" you want to be able to see (the change in response rate for the test versus the control)
Two standard deviations are an important measure in separating significant market changes from natural variation. Statistically, if nothing has changed, about 95% of results will fall within ±2σ of the average. So an increase (or decrease) beyond 2σ means something has changed—something that you should be able to identify. Also, for response rate (and other yes/no data), the standard deviation is proportional to the average—a higher response rate has a lower relative standard deviation.
Generally, the smaller the change you want to see, or the lower your response rate, the larger the sample size you need.
- The size of the mailing doesn’t matter – it’s the number of responses that’s important.
If response rate for direct mail campaigns is usually about 1%, you’ll need three times the sample size as someone who normally sees a 3% response.
- Cutting experimental error in half requires 4x the sample size.
If you want to see any variable that increases response by 5% (e.g., from 1% to 1.05%), you’ll need four times the sample size as you would need to see a 10% or larger effect.
- More variation in sales = less confidence = the need for greater sample size.
For example, if almost all of your customers spend $95 to $105, then increasing the average order to $150 looks like a significant difference. If your customers average $100 per order, but some spend only $20 and some spend $250, then one campaign averaging $150 per order may just be random chance.
- Different numbers are not always statistically different.
Picture a “bell curve” around every data point. Two points are statistically different only if their bell curves barely touch. So the difference in averages is important, but the wider the bell curve (the picture of all individual orders lumped together), the farther apart the averages need to be. You can learn more about confidence, uncertainty, and bell curves in this Audience Development article.
The sample size equation – for response rate
The most accurate, yet straightforward, equation for calculating sample size based on response rate is this:
- 31.38 is a constant
- p (p-bar) = average expected response rate (p for proportion, with the bar on top meaning average). For response rate (and other yes/no data), the standard deviation is proportional to the average, so p(1-p) equals the standard deviation squared.
- significant effect = how small a change you want to see statistically significant (in the same units as p)
- N = the total sample size for the test: split between one test cell and the control, or split evenly among all multivariable test recipes
- For one test cell, the sample size = ½ x N
- For a 16-recipe multivariable test, each recipe will have a sample size = N÷16
For example, if your e-mail conversion rate is 3% and you want to see if a new subject line can increase conversion by 10% or more, then:
• p = 0.03, (1-p) = 0.97
• Significant effect = 3% x 10% = 0.003 and (significant effect)2 = 0.000009
• N = 101,462
So, you need to send out about 50,000 of your control e-mail and 50,000 with the new subject line, in order to see if the difference is 10% or more (if the new e-mail has a conversion rate of 3.3% or more).
This equation changes when:
- The control is very large
Then the test cell may be reduced 25% or more (but if you want to compare test cells to other test cells, then sample size should remain the same).
- You can accept only a 50-50 chance of seeing the “effect” of interest
In this case, overall sample size can be cut in half. (The “effect” falls right on the Line of Significance, so with a bell curve around the effect, about half the time an individual test will fall below the line and half the time it will be significant.)
- You have >2 levels for some test elements in a multivariable test
Generally, you need at least 50% larger sample size for each additional level, for example...
- 15-element multivariable test, with each element at two levels, requires no greater sample size than a simple A/B split
- Central composite designs may have only 4 test elements, but with each at 3 levels, it requires a much larger sample size
- Price tests with centerpoints are very insightful in uncovering curvature and interactions, but the “centerpoint” (3rd level) require additional sample size
- Optimal designs not only add immense complexity, but cannot overcome the need for greater sample size when testing 3 or 4 levels.
The sample size equation – for dollar sales
If your key metric is continuous data, like average order value, average sales per store, or year-over-year change in sales, then you need to calculate the standard deviation of the individual values and include it in the equation:
- 31.38 is a constant
- σ2 = the variance (standard deviation squared) of the individual values
- significant effect = how small a change you want to see statistically significant (in the same units as the standard deviation, σ)
- N = the total number of orders (or stores, weeks, markets) needed for the test: split between one test cell and the control, or split evenly among all multivariable test recipes
For example, if an average customer spends $100, but most individual purchases are between $50 and $150, then:
- Calculate the standard deviation of, perhaps, 5,000 recent purchases (removing extreme values first). If the standard deviation is $25, then the variance (σ2) is 252 = 625
- If you want to see if a new Facebook promotion increases sales 12% or more, then “significant effect” = 12% x $100 = $12.00
- Calculate N = (31.38)(625)/(144) = 136.2
This means you need to have at least 68 buyers in the test group and 68 buyers in the control group. But if only 10% of the people who see the Facebook offer actually make a purchase, then you need to be sure about 680 people see the new promotion, with at least 680 in the control group. Notice that the sample size for sales – for variables data – tends to be much smaller than the sample size for response rate.
How do you calculate the standard deviation?
Averages too often hide the variation among individual customers. In order to understand the true variation in sales (or other variables), you need to go back to the individual orders. Start with a campaign similar to the up-coming test campaign and:
a) List all individual orders (excluding any tests in the campaign)
b) Calculate the standard deviation (σ) of all the orders (simply “=stdev(range-of-values)” in Excel)
c) Remove outliers – extremely high or low numbers (normally above or below 2-3σ)
d) Recalculate the standard deviation (σ) of all the remaining orders
e) Repeat (c-d) 1-2 times more, if needed
f) Use the final standard deviation value (σ) in the sample size equation.
The deep, dark statistics behind the equation
The two sample size equations, above, are really all you need. But if you love statistics (or tend to be overly curious), here’s the response equation with the magic number 31.38 broken down into its components:
- ta/2 = the t value for type I, or “alpha” risk – basically, a fancy statistical way of saying “about 2” standard deviations.
- It’s generally accepted to use 95% confidence – only a 5% chance that a significant effect is truly random chance – which gives alpha = 0.05. The “t value” is 1.96, as long as your sample size is over 1,000.
- There's seldom a reason to drop below 95% confidence (especially with multiple test cells or multivariable testing). Usually lower confidence is simply a way to find some meaning in insignificant results.
- tβ = the t value for the other type of risk – the chance of missing what should be a significant effect. In other words, 1-β is how confident you want to be in seeing the selected effect.
- This type of error is ignored in many sample size equations. If you ignore it – essentially setting tβ at 0 – then you have only a 50-50 chance of seeing the "effect" you want to see.
- tβ = 0.841 for 80% confidence that the effect will actually be significant (β = 0.20)
- σ2 = the variance of the data
- σ2 = p(1-p) for response (yes/no) data, so the average response rate is directly related to the variance
- σ2 is not related to average sales (or other continuous data), so you need to calculate the variance (standard deviation)2 for each new dataset. Plus you also need a way to calculate σ2 for each test (i.e., have 2+ individual values within each test recipe)
The number 4 comes from the calculation of the pooled standard deviation. Comparing two groups (test and control), 4 is always a part of the equation (in the calculation, you have 2 samples each over ½ the sample size, so you end up with 2x2=4 on the top of the equation).
If you want only 5% chance of a significant effect being noise and 80% chance of seeing the “significant effect” you selected, then… 4(ta/2 + tβ)2 = 4(1.96 + 0.841)2 = 31.3824 ~ 31.38
If you set tβ = 0, then you cut the sample in half and end up with only a 50% chance that the “significant effect” will actually be significant.
How does (2σ)2 becomes (31.38)(σ2)? Clearly 4 ≠ 31, but the number gets bigger because (a) you add beta risk, (b) you have 2 samples to compare (test and control) and (c) the overall sample size, N, equals 2x the sample of one test cell. Therefore, you go from the t value at 95% confidence, “about 2,” to “4(ta/2 + tβ)” in order to account for the full test sample and both types of error.
Why are other equations different?
Many other sample size equations and rules-of-thumb give you a different answer. Why do we claim ours is right and theirs is wrong? Well, first of all, if you have SAS, Minitab, or other statistical software, you should be able to confirm that our equation matches the software’s answer. Most others miss one or more of the following pieces of the equation:
- Variation in both the test and control groups – ignoring the control group and considering the “bell curve” around the test group alone.
- Two types of variation – looking at “alpha” error, without considering “beta” error
- Realistic confidence in seeing the lift you want to see – plug most “rules” into these equations and you’ll see a miserably low confidence in results, which translates into an impossibly large lift needed to show a significant difference.
- The total universe is unknown. Some equations ask you for the number of people in the overall "universe." Pure statistics assumes greater confidence when your test group is a large proportion of all items that could possibly be sampled. But the marketplace is not a closed, static environment – the "universe" is unknowable and ever-changing, so it's unreasonable to include the universe in the equation. (Can you give me the total number of everyone who might someday be your customer?)
- The whisper-down-the-lane effect
- Many rules-of-thumb have been passed down in marketing lore without any statistical basis. “You need 100 responses per test cell” (or 50, 500, etc.) have been written in marketing articles for decades, but these almost always severely underestimate the true sample size you need.
Choosing the right metric for sample size
Sample size equations are only as accurate as the data put into the calculation. For example, if you plug your e-mail click through rate into the equation, you will probably end up with only a fraction of the sample size you’ll need for a good read on conversion rate. If you want to increase response rate and average order value, then you should look at the sample size required for each and probably choose the higher of the two.
Generally, your test analysis should focus on simple “behavioral” metrics rather than calculated metrics. You may want to drive long-term profitability, but focusing simply on response and sales for each test may make more sense. Whenever you use a calculated metric within a test analysis – like lifetime value, or forecast total sales – you add (a) potential sources of error from the assumptions within the formulas and (b) a combination of variables that each may be driven different ways by different effects (the combination may mute effects that are significant for each variable alone). Profitability and sales forecasts are important, but when analyzing test results, focus on a few clear, objective, simple metrics. Then you can input results for each into your more complicated formulas.
The metric determines how experimental error is calculated. In retail tests, if the key metric is per-store sales (the change in sales versus predicted levels or control stores), then experimental error will be based on the difference among all stores in each test group. If the change in sales is calculated by market or customer (for a loyalty program), then the error may appear greater or less. Generally, more granular metrics and “test units” offer a more realistic view of the marketplace, allowing you to see outliers in the data while offering the flexibility to aggregate data in different ways to analyze results by market, region, or type of customer or store.
Examples by channel – good metrics may include:
- Direct mail (letter package, catalog, postcard, etc.): response rate, average order value, total sales, net revenue
- E-mail and landing page: conversion rate, average order, total sales, responses/clicks
- Retail and CRM: change in store sales (versus baseline/control group), unit sales, basket size, average margin; number of transactions, total annual sales, and total margin dollars by customer (for a loyalty program)
- Advertising: change in sales by market, advertising ROI
- Telemarketing: conversion rate, total sales, customer satisfaction
When sample size is limited
Many small companies have a small total mail (or e-mail or retail) volume, so any testing is difficult. When sample size is limited, consider if you can:
1. Combine elements into one multivariable test design
One big benefit of advanced techniques is sample size efficiency – unlike A/B splits, for multivariable test designs, sample size can remain constant whether you’re testing 2 or 22 different elements, as long as all elements are part of the same statistical test design.
A simple example: If you need 50,000 per test cell, then 2 tests plus the control require 150,000 names. One multivariable test of the same 2 elements requires only 100,000 for equal confidence. This 33% drop in sample size becomes even larger as more marketing-mix elements are combined into one test. One non-profit DM test would have needed 4 million more mailings if the same elements had been tested using split-run techniques.
2. Run the same test across multiple drops, segments, or promotions
This Conde Nast e-mail test was run over three different campaigns to build sample size while also analyzing campaign-specific differences. A Whirlpool contact strategy test focused on weekly service contract renewals, so they continued to run the test week-after-week until the sample size was large enough.
3. Focus on fewer, large changes
When sample size is limited, the equations let you see how large a lift you need from each test element. If you only have 500 buyers in each monthly campaign, then consider new ideas that each have the chance to increase response by 25% or more. Perhaps combine elements into one envelope test (changing size, color, teasers all at once), or copy test (version A versus version B, instead of testing a new headline, P.S., or sidebar alone). Also consider bolder differences in each element: test a 15% price increase instead of 5%, a free tote bag instead of 1 free issue, or a 9x12” package instead of 6x9”.
4. Test the most responsive customers
Even if overall response rate is relatively low, are there high-response segments or more receptive customers who may be a good focus for testing? For example, in contract renewals – magazine subscriptions, extended service warranties, Internet or phone service – the first-time renewal is often a big hurdle. These customers are valuable and often more responsive to creatives and offers (versus long-time customers who are profitable, yet with greater inertia to respond at about the same level no matter what they receive).
If you're expert, here are a few advanced topics to consider:
1. Is response rate truly binomial (yes/no) data?
The sample equation for response rate is easy to use, plus setting the “beta” risk at 50% gives you the Line of Significance (replacing 31.38 with 15.37). But response rate assumes an equal probability of each person responding – not always true if you have different segments mixed together, or changing “influencers” outside of the test elements. A great safeguard is to split each test recipe into two equal groups: each receives the exact same mailing (or e-mail, landing page, etc.), but you have a different keycode/tracking number for each. This allows you to calculate the variance between “replicates” (the difference in response rate between the two samples from the same recipe) as a measure of experimental error, as well as calculating experimental error from the response rate (where the variance is assumed to be related to the average response). If both versions of the Line of Significance match, you’re OK. If they differ, the “replicated” experimental error is generally a more conservative estimate.
2. Consider other sources of variation when selecting sample size
All sample size equations consider the variation among customers without considering additional noise from outside sources of variation. With environmental, personal, competitive, week-to-week, and numerous other changes going on constantly (and largely unknown), these “pure” sample size equations may not incorporate a realistic level of natural in-market variation. Therefore, think rationally about your sample size: Do you plan to run the test long enough to balance out all of these unknowns?
For example, retail and packaged goods (CPG) tests are challenging because of store, market, and region-specific differences. Local growth and customer demographics, seasonality (including temperature differences and back-to-school dates), and merchandise differences are just a few changes that may affect week-to-week sales changes. Replication through additional stores, multiple markets, and comparable control markets running in parallel can all help give different perspectives of real-world variation during the test period. Plus extra stores offer the freedom to drop some unexpected outliers that occur during the test.
3. What is the best Line of Significance?
Experimental error is relatively easy to calculate – basically the same ±2σ concept as you use for confidence limits (only a 5% probability that any effect would be significant due simply to random chance). One great thing about testing is that the most reliable tests should have clear results without the need for overly complex statistical analyses. Put simply: Does the Line of Significance make sense? Do the “significant” effects clearly rise above the non-zero effects due to random chance? With large sample sizes, using ±3σ or the Bonferroni method (with alpha ~ 5% per element instead of per test), you may see a more realistic estimate of real-world error.