Online Marketing Tests: A data analyst’s view of balancing risk and reward
The answer “yes” to the question, “Is this test statistically significant?” — in Daniel Burstein’s MarketingExperiments blog post, “Online Marketing Tests: How could you be sure?” — is not only “misleading” as he mentions, but it is also incomplete.
In fact, if someone asked me that question, as a data analyst, instead of responding with a dichotomous “Yes/No” answer, I would ask, “At what level of significance?” The reason statistical significance is so important in experimental testing is it quantifies the amount of risk that one is willing to take in drawing inferences from a test.
From some recent conversations I have had with marketers running A/B tests, it seems like many of them are really interested in knowing the ‘roots’ of how the testing works, and they find it helpful to understand it so they can draw better conclusions.
So, in today’s MarketingExperiments blog post, I’ll give you a look at what some of the numbers and values you encounter in testing really mean.
In A/B testing, you’re essentially trying to determine if A and B are different. Let me show you how the math works.
In a classical inferential experiment (that is, an experiment in which we’d like to infer a conclusion from), the researcher prepares two hypotheses: a null hypothesis (H) and an alternative hypothesis (Ha).
It is necessary that these two hypotheses are mutually exclusive and exhaustive – meaning every possible outcome must satisfy one or the other, but not both.
For instance, a null hypothesis may be that the conversion rates for two landing pages (Page A and Page B) are the same. The alternative hypothesis would be the complement of the null – namely, the conversion rates for Page A and Page B are NOT the same.
By convention, the alternative hypothesis is the one the researcher would ‘like’ to be able to support at a certain level of significance, and the null hypothesis is something we are ‘stuck with’ if the experimental data fails to reach the level of significance.
What are the chances I’ll get the wrong answer?
Since the tests rely only on a small sample and not the whole population, there is always a risk that the group of members we choose is a poor representation of the population.
Returning to the example cited in Burstein’s prior post – suppose we would like to determine if the child automobile safety restraints really do reduce child fatalities compared to seat belts alone. The null hypothesis (H) would be, “The death rate among children in accidents who are secured by child restraints IS the same as that when only seat belts are used.” So then, the alternative hypothesis (Ha) would be, “The death rate among children in accidents who are secured by child restraints is NOT the same as when only seat belts are used.”
Further, we decide that if there is any more than 5% chance that, from our sample data, any observed difference in the rates is just due to having picked wrong accident cases for our sample or random chance, then we will have to presume that they are not actually different.
Once sample data is collected and analyzed, there are only two possible valid inferences:
- There is less than 5% chance that the observed difference in rates is merely due to ‘sampling error,’ in which case the null hypothesis (H) will be rejected (i.e., we do NOT believe that both rates are the same, and it seems likely that the child restraints do make a difference).
- There is 5% or greater chance that the observed difference is due to ‘sampling error,’ in which case we must concede that not enough evidence exists to compel us to believe that the child restraints actually do reduce fatalities.
If these policies result in inferring a difference that really doesn’t exist, we have fallen victim to Type I Error. Type I Error is a term used in statistics that simply means a true null hypothesis was incorrectly rejected.
The probability of a Type I Error is what the significance level measures.
There is also a chance that, due to the standards we have chosen, we may fail to detect the difference that really does exist (i.e., mistakenly failing to reject a false null hypothesis). This probability is given the name Type II Error.
It is a longstanding and widely held convention that researchers take much greater precautions to avoid committing a Type I Error than a Type II Error.
Balancing risk and reward (in theory)
Type I Error is considered a bigger ‘no-no’ because that it is the alternative hypothesis — if accepted over the null hypothesis — that challenges the status quo and signals the need for change. So, mistakenly provoking a change that was unnecessary (or possibly even damaging) is considered a more grievous sin than failing to recommend a change that might have made things better.
Think about it as erring on the side of caution. It’s probably safer to falsely assume that your landing page treatment will not generate a lift, and therefore still use a reliable control, than to falsely assume that the landing page treatment does generate a lift, when in fact replacing the control will result in a reduction in conversion.
So how does a rational person work out a risk-reward decision based on the information provided by inferential testing?
Consider a fair coin is tossed five times, and I offer to give you $500 if we observe five straight heads. If not, you give me only $50.
Here, even if the reward following the five straight heads is much higher for you, the likelihood of such an event is comparatively low (about 3%). While you have only about a 3% chance of winning $500, I have about a 97% chance of winning $50.
So, I am willing to take the 3% risk of losing $500 to win $50.
Now, if we toss the coin only four times, and I had to give $10,000 for four straight heads, my likelihood of winning $50 declines to about 94% and your chances of walking away with $10,000 roughly doubles from 3% to 6%.
In this case, I have some soul searching (not to mention wallet searching) to do. Presuming that I could actually cover the $10,000 bet, am I willing to risk that amount, where there is a 6% chance of losing $10,000 vs. a 94% chance of winning $50?
It turns out, there is a comparatively simple way – used by ‘gamblers’ in hundreds of professions from card sharks to hedge fund managers to insurance actuaries – called ‘Expected Value’ that provides the ideal “rational person’s answer.”
Balancing risk and reward (in practice)
Suppose now, in the child restraint study I mentioned above, we collected a random sample that supported the alternative hypothesis (at a 0.05 level of significance) that the fatality rates are NOT the same – in fact, the data suggests that the rate is lower with child restraints.
How should legislators decide whether to enact the law that mandates the use of such restraints, including fines for non-compliance?
Sadly, this is one of those irksome real-world situations in which it is necessary to take a morally reprehensive and emotionally distasteful step – namely assigning a dollar value to human life (and children’s lives at that).
The coward’s way out would be to simply abdicate and set the number to ‘infinity.’ That would inevitably lead to the conclusion that “… there ought to be a law!” regardless of the cost, “even if it saves just one life.”
In real life, though, such cowardice (or laziness) comes with a high risk of unintended consequences, stemming mainly from the absence of infinite resources. That is, the enactment of one exceptionally expensive mandate on these grounds could derail many other subsequent policy initiatives that would have had a much greater collective impact on lives saved.
So, while they seldom speak of them publicly, insurance actuaries and automobile companies, members of Congress and courts of law, the medical profession and the military all maintain finite ‘provisional’ figures for this invaluable practical purpose.
Happily, as professional marketers, we are seldom faced with such distasteful decisions, though we can use the same statistical methods and tools in making our own tough decisions about which road to take by quantifying the risks and rewards of each through inferential testing.
Traditionally, the significance level is determined by the researcher before all the data-crunching.
However, for our practical purposes, we can determine whether or not to conclude the test valid by analyzing how close our result came to the designated significance level and amount of risk involved in implementing the changes supported by the test.
Suppose if a test would have validated at a 0.06 significance level, but came inconclusive at a preset 0.05 significance level. Depending on how willing one is to take the extra 1% “risk” of error, it might not be a bad idea after all to reset the significance level to 0.06 from 0.05 and validate the test rather than rerunning the test with other treatments to gain a tiny fraction of additional difference to reach the level of confidence previously set.