A/B Testing: Working with a very small sample size is difficult, but not impossible

11

A thought for future Web clinics:

There are millions of small businesses like mine. (Think small and local: your dentist, dry cleaner, pizza delivery). We are, in the grand picture, very small. My website generates, on average, 400 visitors in a month. (That’s around 14 a day. It works for me.) 

We run tests and split tests all the time, but it is hard to draw any real conclusion for what is working and what is not working with really small amounts of data. 

Is there something small business can do to better interpret small amounts of data? Thanks for your help and insight.

– Chris

 

Thanks for the question, Chris. After having a mini-brainstorm session with one of our data analysts, Anuj Shrestha, I’ve written up some tips for dealing with a small sample size:

Tip #1: Decide how much risk you are willing to take

Testing, sample sizes and level of confidence are really all about risk. At MECLABS, our standard level of confidence (LoC) is 95%. This means we are only willing to take a 5% chance that the results we found were just a fluke. However, you may decide you are willing to accept an 80% LoC.

When looking at LoC with a small sample size, you must keep in mind that testing tools will consider small sample size when calculating the LoC; therefore, depending on how small your data pool is, you may never even reach a 50% LoC.

If this is the case, you should look at the relative conversion rate difference, (CRtreatment – CRcontrol) / CRcontrol, between your two treatments after the test. If a treatment has a significant increase over the control, it may be worth the risk for the possibility of high reward.  However, if the relative difference between treatments is small and the LoC is low, you may decide you are not willing to take that risk.

 

Tip #2: Look at metrics for learnings, not just lifts

While most companies test and analyze metrics with the end goal of increasing some type of monetary number, you can also look at data to better understand your customers. By gathering learnings from your test, even if you don’t validate, you can leverage these learnings on the next treatment you design.

There are four helpful metrics you can look at that generally don’t fluctuate much as sample sizes differ:

  • Next page path — Where are users going? Products page? Pricing page? FAQ? This can tell you what type of information for which people are looking.
  • Pages per visit — You can determine how much information users are looking for with this metric. Because there is a limit on this (the number of pages on your site), it will often be safe to use, but keep in mind that it is an average, meaning outliers can skew the results.
  • Exit pages — Knowing where people are leaving from can highlight pages you may need to improve.
  • Traffic sources — Where are they coming from, and how does each channel perform separately?

On top of these, create a segment in your data platform that includes only people who completed your conversion action. How did they perform differently than those who did not? Did they view more pages? Different pages? Did they come from a specific traffic channel? Knowing these things will help you optimize your marketing efforts.

One metric you may not want to look at is average time on page, as it can be misleading with a small sample size. If a few people leave their windows open for an hour, that’s going to drastically skew the metric. Most platforms allow you to exclude outliers, but you should still be careful of this one.

 

Tip #3: Try sequential testing

An alternative to A/B split testing is to do sequential testing. Run one treatment, next run another, and then compare. This way you have double the traffic to each treatment.

Anuj says, “As long as user motivation stays constant [during both test periods], sequential testing can work.”

Make sure you set your test for a time that historically performs very evenly and there are no external validity threats occurring, such as holidays, industry peak times, sales, economic event, etc.

And, as with Tip #1, you have to decide how much risk you want to take. While you can mitigate risk by keeping the above points in mind, fielding sequential treatments opens your testing up to a validity threat called history effect – the effect on a test variable by an extraneous variable associated with the passage of time.

 

Tip #4: Test radical redesigns

At MECLABS, when we know we have a small sample size to work with, we usually try to create what is called a radical redesign to make sure we validate on a lift or loss. Radical redesigns make very drastic changes. The more radical the difference between pages, the more likely one is to outperform the other.

Sometimes minor changes can have very little effect on how the visitor behaves (which is why your treatment wouldn’t perform much differently than the control), making it difficult to validate.

While a radical redesign will help you achieve statistical significance, it is difficult to get any true learnings from these tests, as it will likely be unclear as to what exactly caused the lift or loss. Was it the layout, copy, color, process … all of the above?

It helps to have an overall hypothesis, or theme, to the changes. For example, one set of changes to the layout, copy, color and process is meant to emphasize that the car you’re selling is fuel efficient. Another set of changes is meant to emphasize the car is safe.

In this way, you can learn more about the motivations of your customers even while changing more than one element of your landing page.

Again, it all comes down to risk. You will have to properly set up and interpret your tests to properly get a learning. So for some, this approach might be better used to focus on getting  valid results and not necessarily learnings.

I wrote a blog post about how to interpret your data correctly that may be of help in this situation, as well.  Anuj also wrote a post on testing and risk.

 

Related Resources:

Marketing Optimization: How to determine the proper sample size

Online Marketing Tests: How do you know you’re really learning anything?

Online Testing: 3 takeaways to get the most out of your results

You might also like
11 Comments
  1. Tyler says

    Kudos to Chris for being a very web savvy small business owner. Many of the small businesses I’ve interacted with are still at the point where they can significantly increase leads or sales with very basic changes like adding a clear call to action or replacing “Welcome to Our Site” on their homepage with an actual headline.

  2. Clayton says

    Tip #3 doesn’t make sense to me. It says that a sequential test would send twice as much traffic to each treatment, but what is the advantage of doing that instead of sending twice as much traffic into the A/B split test (perhaps by running it for twice as long)? Thanks.

  3. Lauren Maki says

    @Clayton
    Hi Clayton, good question.

    When dealing with low traffic, small businesses will usually push 100% of their traffic into the test, so sending twice as much traffic may not be feasible.

    A/B split testing is definitely a preferred method over sequential testing for validity reasons; however, when looking at daily results for tests with extremely low traffic, split testing will significantly affect your variance. For example, if you have 10 people visit your site one day and you are running a split test, each page sees 5 visitors. One person converting on the treatment while no one converted on the control would be a comparison of 20% versus 0% CR; whereas, if you run a sequential test, your conversion rate for the day would be 10% compared to another day’s results. One person has less of an effect on your daily results.

    When your numbers are very low like this example, sequential may be a good option, but if your numbers are closer to 50 visits/day with at least 2 conversions per treatment, A/B split for a longer period of time may be a better option.

    Hope this helps!

  4. Ben says

    @Clayton is right as far as I understand. You’re making the mistake to assume that if you send twice as many visitors to the treatment, they’re not going to convert. If 1/5 convert, then the next 5 visitors will see 1 convert too, in the long run. We can look at it from a simulation point of view.

    A/B test (2 weeks):
    – A gets 100 visits, converts 4 (4%)
    – B gets 100 visits, converts 10 (10%)

    Sequential (2 x 2 weeks):
    – Period 1: A gets 200 visits, converts 8 (4%); B gets 0 visits (0%)
    – Period 2: A gets 0 visits (0%); B gets 200 visits, converts 20 (10%)

    Back to the article, tips 2 (learning from micro-behavior/interactions) and 4 (making bold changes) are indeed very good. Tip 1 is half good. It’s true that accepting a lower LoC will yield results more often. 80 or 90% could be acceptable LoC in many situations.

    However I feel it’s very misleading to accept a test with 50% confidence *on the basis that the relative difference is large* (and to add the words “significant increase” is prone to create confusion: 50% LoC is statistically non-significant). When a variation performs much better than another variation, the edge is big (big increase) and as a result the variance is low. Thus, you should get significant results faster than if the edge was small (and the variance higher). If you’re at 50% confidence with a big lift, it means you’re riding on small sample size variance. You need to let the test run.

  5. RJ says

    You can run the split tests in parallel indefinitely. When the sample size is too small the result of the test will be no statistical difference. When they start showing a difference, you know the sample is large enough. The beauty of this method is it doesn’t matter how many people accepted the offer as long as they were homogeneously offered either A or B – the offers were queued up 50% of the time.

    When you realize you are not learning anymore from the test and you are not gaining statistical significance, it’s time to move on to a new one.

    It’s tempting but do not use “click through rates” for these tests – they are interesting but irrelevant.

Leave A Reply

Your email address will not be published.