Dances with Science: Are you better off not A/B testing?
When a news headline contains the word “scientists,” I get warm fuzzies: working routinely on the design of experiments affords me a sense of camaraderie.
Occasionally, the news is disappointing. Not because science failed, but because the scientists employed a questionable design of experiments. When this happens in conversion optimization, the marketer would have been better off employing the infamous “marketer’s intuition” than testing. Relying on improperly generated test results just creates a false sense of confidence.
I wanted to use a recent news report to illustrate several validity threats that can occur in conversion optimization experiments.
The news story in a nutshell: a team of psychologists from Northumbria University modeled the dance moves of 19 men using featureless computer-generated 3D mannequins, and then asked 35 women to identify the one they found most attractive.
The general concept at first seems plausible—isolate dance moves from other attributes that contribute to a man’s attractiveness (like clothes, facial features, etc.) and compare them to determine which ones are more desirable.
We may similarly want to isolate the layout of a Web page in terms of the number of columns of content to determine which layout converts the best. However, both in studying the attractiveness of dance moves and the performance of a landing page, there are several experimental design factors that need to be considered to make a meaningful result even possible.
In the Northumbria experiment, there were 19 treatments (dance move combinations) and 35 human subjects. There are two implicit assumptions that undermine the validity of the reported discoveries:
1) that these 19 men are a good representation of the universe of possible dancing moves, even within one culture, and
2) that the dance move preferences of these 35 women are representative of most women.
The first assumption matters only if they attempted to generalize the results of the experiment to all dance styles. If the objective of the experiment were only to determine the best one of the 19, this in itself would not be a problem. When we test several landing page treatments, a valid research question would be something like “Which one of the three layouts produces the highest conversion rate?”—rather than “Which layout is always the best?”
With only 35 individuals making all the decisions, it’s intuitive that their preferences should be difficult to extrapolate to the entire population. We know from real-life experience, that people have widely different mating preferences.
Likewise, in landing page testing, it is essential to make sure that the profile of the subjects sampled is both consistent among all of the treatments and representative of the people who will be visiting the site after the test.
The story implied that the researchers were using the results of this experiment to advise men on the “better” dance moves. This opens a second sample selection problem: even if the women in the study did represent the preferences of all women, the aggregate result may not be useful to any one man.
The reason is simple: each man would likely not be interested in attracting the average woman—in simplistic terms, he would only be interested in his type. In marketing, we call this segmentation.
This is why when you test your pages, it is critical that you dissect the data by various traffic channels, geography, etc. You may find that Treatment A gives you a 10% lift on weekdays, but underperforms the Control by 50% on weekends. Had you looked only at the aggregate data, you might have concluded that you should keep the Control, whereas the right outcome is to use each version on the days that it performs best.
Finally, the Northumbria experiment entirely ignores the interaction between the independent variable they isolated (dance moves) and other variables that have a powerful impact on the dependent variable (attractiveness).
It is easy to imagine that the same set of dance moves can be seen as very attractive on one man, yet very unattractive on another. That is, the very factors that the Northumbria researchers sought to eliminate—by mechanically separating the human physical form from its movements—may, in a very practical sense, be so entangled in the mysterious gestalt of attraction that the removal of either factor makes the results meaningless outside of the 3-D model.
Detecting such things is the purpose of the category of statistics called “factor analysis,” which spans from simple regression (drawing the best “line” through the data) to more sophisticated methods like principal component analysis. It enables you to recognize when your test design doesn’t support your objective.
While multi-variable testing (MVT) possesses its own set of challenges and constraints, the interaction between variables cannot be ignored when we interpret even single-factor tests. For example, even if a specific headline improved your conversion rate significantly, you will still need to re-test it after any changes to the body copy or when a different PPC ad sends it traffic.
Don’t Drink and Test
One particular factor, seasonality, is an important issue in optimization testing. Website visitors behave differently at different times of year, days of week, and even times of day (more data for you to use in your segmentation). The study on dance moves ignored a key variable: the passage of time.
While it wasn’t reported in the story, I suspect that the study was not conducted at a local nightclub, and the subjects were not already at least a bit tipsy. For studies involving human behavior, environmental and physiological conditions (e.g., blood alcohol level) of the subjects during measurement is essential to the predictiveness of the outcome.
As an apparent fellow scientist, who identified himself only as BuckHippo, commented on the story, “This study is so flawed. No one dances when they are sober. And when they are drunk dancing, it all looks good …”