A/B Testing: Working with a very small sample size is difficult, but not impossible
A thought for future Web clinics:
There are millions of small businesses like mine. (Think small and local: your dentist, dry cleaner, pizza delivery). We are, in the grand picture, very small. My website generates, on average, 400 visitors in a month. (That’s around 14 a day. It works for me.)
We run tests and split tests all the time, but it is hard to draw any real conclusion for what is working and what is not working with really small amounts of data.
Is there something small business can do to better interpret small amounts of data? Thanks for your help and insight.
Thanks for the question, Chris. After having a mini-brainstorm session with one of our data analysts, Anuj Shrestha, I’ve written up some tips for dealing with a small sample size:
Tip #1: Decide how much risk you are willing to take
Testing, sample sizes and level of confidence are really all about risk. At MECLABS, our standard level of confidence (LoC) is 95%. This means we are only willing to take a 5% chance that the results we found were just a fluke. However, you may decide you are willing to accept an 80% LoC.
When looking at LoC with a small sample size, you must keep in mind that testing tools will consider small sample size when calculating the LoC; therefore, depending on how small your data pool is, you may never even reach a 50% LoC.
If this is the case, you should look at the relative conversion rate difference, (CRtreatment – CRcontrol) / CRcontrol, between your two treatments after the test. If a treatment has a significant increase over the control, it may be worth the risk for the possibility of high reward. However, if the relative difference between treatments is small and the LoC is low, you may decide you are not willing to take that risk.
Tip #2: Look at metrics for learnings, not just lifts
While most companies test and analyze metrics with the end goal of increasing some type of monetary number, you can also look at data to better understand your customers. By gathering learnings from your test, even if you don’t validate, you can leverage these learnings on the next treatment you design.
There are four helpful metrics you can look at that generally don’t fluctuate much as sample sizes differ:
- Next page path — Where are users going? Products page? Pricing page? FAQ? This can tell you what type of information for which people are looking.
- Pages per visit — You can determine how much information users are looking for with this metric. Because there is a limit on this (the number of pages on your site), it will often be safe to use, but keep in mind that it is an average, meaning outliers can skew the results.
- Exit pages — Knowing where people are leaving from can highlight pages you may need to improve.
- Traffic sources — Where are they coming from, and how does each channel perform separately?
On top of these, create a segment in your data platform that includes only people who completed your conversion action. How did they perform differently than those who did not? Did they view more pages? Different pages? Did they come from a specific traffic channel? Knowing these things will help you optimize your marketing efforts.
One metric you may not want to look at is average time on page, as it can be misleading with a small sample size. If a few people leave their windows open for an hour, that’s going to drastically skew the metric. Most platforms allow you to exclude outliers, but you should still be careful of this one.
Tip #3: Try sequential testing
An alternative to A/B split testing is to do sequential testing. Run one treatment, next run another, and then compare. This way you have double the traffic to each treatment.
Anuj says, “As long as user motivation stays constant [during both test periods], sequential testing can work.”
Make sure you set your test for a time that historically performs very evenly and there are no external validity threats occurring, such as holidays, industry peak times, sales, economic event, etc.
And, as with Tip #1, you have to decide how much risk you want to take. While you can mitigate risk by keeping the above points in mind, fielding sequential treatments opens your testing up to a validity threat called history effect – the effect on a test variable by an extraneous variable associated with the passage of time.
Tip #4: Test radical redesigns
At MECLABS, when we know we have a small sample size to work with, we usually try to create what is called a radical redesign to make sure we validate on a lift or loss. Radical redesigns make very drastic changes. The more radical the difference between pages, the more likely one is to outperform the other.
Sometimes minor changes can have very little effect on how the visitor behaves (which is why your treatment wouldn’t perform much differently than the control), making it difficult to validate.
While a radical redesign will help you achieve statistical significance, it is difficult to get any true learnings from these tests, as it will likely be unclear as to what exactly caused the lift or loss. Was it the layout, copy, color, process … all of the above?
It helps to have an overall hypothesis, or theme, to the changes. For example, one set of changes to the layout, copy, color and process is meant to emphasize that the car you’re selling is fuel efficient. Another set of changes is meant to emphasize the car is safe.
In this way, you can learn more about the motivations of your customers even while changing more than one element of your landing page.
Again, it all comes down to risk. You will have to properly set up and interpret your tests to properly get a learning. So for some, this approach might be better used to focus on getting valid results and not necessarily learnings.