It’s the nightmare scenario for any analyst or executive: Making what seems to be the right decision, only to find out it was based on false data.
Through online optimization testing, we try to discover which webpage or email message will perform best by trying each version with a random sample of target prospects.
We decide during test design how much difference we must see to warrant making a change, and what level of confidence is required to consider the data valid using statistical methods.
But is sample size the only factor that should be considered when assessing the validity of test results?
In this research brief, we will identify the four greatest threats to test validity, use recent case studies to illustrate the nature and danger of each threat, and provide you with 10 key ways to avoid making bad business decisions on the basis of a “statistically valid” sample size alone.
Editor’s Note: We recently released the audio recording of our clinic on this topic. You can listen to a recording of this clinic here:
Optimization Testing Tested: Validity Threats Beyond Sample Size
Case Study 1
Test Design
We conducted a 7-day test for a large industrial parts company whose market is predominantly midsize to large businesses. The company’s largest volume paid search traffic source is Google Adwords.
Our primary goal for this test was to reduce the bounce rate for the company’s Home page.
Treatments
We tested the control against a products oriented page and a directory style page.
Validation
After one week, a large enough sample had accumulated to satisfy the test design threshold criteria of at least 5% difference in bounce rate with a 95% level of confidence.
Page | Bounce Rate |
---|---|
Control | 22.6% |
Treatment 1 | 20.2% |
Treatment 2 | 12.9% |
Diff. Between Ctr. & Trt. 2 | 75.6% |
What you need to understand: The directory style page yielded a 9.7% lower bounce rate over the control page—a 75.60% relative reduction
Using the outcome of this statistically validated test, it might have been natural to stop gathering data for the test and commence with implementation of Treatment 2 and/or follow-on research.
However, while post-test analysis and planning continued, we kept the test running for an additional week…
Case Study 1 – Week 2
After a second week, before closing out the test, we extracted a fresh set of reports, expecting further confirmation of the prior outcome.
What we found instead was quite different.
Page | Bounce Rate |
---|---|
Control | 20.2% |
Treatment 1 | 19.7% |
Treatment 2 | 20.7% |
Diff. Between Ctr. & Trt. 2 | -2.4% |
Trt. 2 Diff. Between Week 1 & Week 2 | -37.7% |
What you need to understand: In Week 2, the Control and Treatment 1 pages performed much as they had in week 1. But the Treatment 2 page bounce rate soared from 12.9% to 20.7%, exceeding both of the other pages. The relative increase in bounce rate from week 1 to week 2 was nearly 38%.
What possible causes can you think of for such a dramatic change in only one week?
Research
Since only the “radical redesign” treatment was significantly impacted, we first speculated that many returning visitors, who were accustomed to seeing the familiar Control page, may have “bounced” by clicking back to the search results page to verify they were on the “right” page or by manually entering the URL.
To test this hypothesis, we extracted performance data separately for “New” and “Returning” visitors.
Results — New vs. Returning Visitors
Page | Week 1 | Week 2 | Difference |
---|---|---|---|
Control | 24.7% | 22.3% | -9.6% |
Treatment 1 | 21.1% | 21.4% | 1.1% |
Treatment 2 | 13.5% | 24.0% | 78.5% |
What you need to understand: When the returning visitors are filtered out, and only new visitors are included, the results are similar. The bounce rate for the control page changed by only 2.4%, while that of the Treatment 2 page soared from 13.5% to 24.0%. The relative increase in bounce rate from week 1 to week 2 was nearly 79%.
Research
So it wasn’t new vs. returning visitors. What else could have caused the reversal?
Test Validity Threats
Since this was a significant and unexpected change, we initiated a formal forensic research process for Test Validity Threats.
This investigation is still underway.
When conducting optimization testing, there are four primary threats to test validity
History Effects: The effect on a test variable by an extraneous variable associated with the passage of time.
Instrumentation Effects: The effect on the test variable, caused by a variable external to an experiment, which is associated with a change in the measurement instrument.
Selection Effects: The effect on a test variable, by an extraneous variable associated with different types of subjects not being evenly distributed between experimental treatments.
Sampling Distortion Effects: The effect on the test outcome caused by failing to collect a sufficient number of observations.
Now, let’s take a look at an example of History Effects.
Case Study 2
Test Design
We conducted a 7-day experiment for a subscription-based site that provides search and mapping services to a nationwide database of registered sex offenders.
The objective was to determine which ad headline would have the highest click-through-rate.
Headlines
- Child Predator Registry (control)
- Predators in Your Area
- Find Child Predators
- Is Your Child Safe?
The Problem
During the test period, the nationally syndicated NBC television program Dateline aired a special called “To Catch a Predator.”
This program was viewed by approximately 10 million individuals, many of them concerned parents. Throughout this program sex offenders are referred to as “predators.”
This word was used in some, but not all of the headlines being tested.
Results
We found the following:
Headline | CTR |
---|---|
Predators in your Area | 6.7% |
Child Predator Registry | 4.4% |
Find Child Predators | 4.4% |
Is Your Child Safe? | 2.9% |
Dif. Between First and Last | 133.5% |
What you need to understand: In the 48 hours following the Dateline special, there was a dramatic spike in overall click-through to the site. The click-through rates aligned in descending order of the prominence of the word “predator.” The best performing ad performed 133% better than the headline without the word “predator.”
In effect, an event external to the experiment which occurred during the test period (the Dateline special), caused a significant (and transient) change in the nature and magnitude of the arriving traffic. Thus the test was invalidated due to the History Effects validity threat.
History Effects
Ask: “Did anything happen in the external environment during the test that could significantly influence results?”
- Significant industry events (e.g., conferences).
- Holidays or other seasonal factors.
- Company, industry, or news-related events.
Now, let’s take a look at an example of Instrumentation Effects.
Case Study 3
Test Design
We conducted a multivariable landing page optimization experiment for a large subscription based site.
The goal was to increase conversion by finding the optimal combination of page elements.
The test treatments were rendered by rotating the different values of each of the test variables evenly among arriving visitors.
The Problem
We discovered that in the testing software, a “fail safe” feature was enabled specifying that, if for any reason the test was not running correctly, the page would default to the Control page.
The testing system (instrumentation) would deliver to the browser both the treatment page values and the control page values, but the browser would render only the page corresponding to the test condition. It added a lot of page load time.
Control page with the 5 variables (circled) set up to rotate highlighted.
Test Validity Impact
Page Load Time
This chart shows load times for the control web page compared to one of the “treatment” pages. The extra 53Kb is significant, especially for anyone using “dialup” or low speed access. At 56K modem speed it’s an extra 9.56 seconds.
Page Load Time Reference Chart (In Seconds) | |||||||
---|---|---|---|---|---|---|---|
Connection Rate (K) | ISDN | T1 | |||||
14.4 | 28.8 | 33.6 | 56 | 128 | 1440 | ||
Page Size In (Kb) | |||||||
50 | 35.69 | 17.85 | 15.30 | 9.18 | 4.02 | 0.36 | |
75 | 53.54 | 26.77 | 22.95 | 13.77 | 6.02 | 0.54 | |
Regular Web Page -> | 84.9 | 60.61 | 30.30 | 25.98 | 15.59 | 6.82 | 0.61 |
100 | 71.39 | 35.69 | 30.60 | 18.36 | 8.03 | 0.71 | |
125 | 89.24 | 44.62 | 38.24 | 22.95 | 10.04 | 0.89 | |
Treatment Page -> | 137 | 97.80 | 48.90 | 41.92 | 25.15 | 11.00 | 0.98 |
150 | 107.08 | 53.54 | 45.89 | 27.54 | 12.05 | 1.07 | |
175 | 124.93 | 62.47 | 53.54 | 32.13 | 14.05 | 1.25 | |
200 | 142.78 | 71.39 | 61.19 | 36.71 | 16.06 | 1.43 | |
Additional Load Time (s) | 37.19 | 18.60 | 15.94 | 9.56 | 4.18 | 0.37 |
What you need to understand: The artificially long load times caused the “user experience” to be both asymmetric among the treatments, and significantly different from any of the treatments in production, thereby threatening test validity.
Ask: “Did anything happen to the technical environment or the measurement tools or instrumentation that could significantly influence results?”
- Web server or network problems resulting in performance problems.
- Problems with the testing software or reporting tools.
Now, let’s return to Case Study 1 to see how these principles have been applied to-date in investigating the possible causes of test invalidation:
Applying Each Effect
History Effects: Did anything happen during either week to dramatically influence results?
- There were no holidays or major industry events during the test period.
- No extraordinary publicity or media events related to the company or its industry during the test period have been identified.
Instrumentation Effects: Did anything happen to the technical environment or the measurement tools or instrumentation that could significantly influence results?
- The company rotates among several servers, so we are investigating whether any major technical problems occurred during the test.
- The Web analytics product has recently undergone changes, so we are cross-checking and analyzing to detect possible reporting errors.
Selection Effects: Did anything happen to the testing environment that may have caused the nature of the incoming traffic to be different from treatment to treatment or from week to week?”
We saw that the ratio of new vs. returning visitors was not a factor. But did the nature of the incoming traffic between these two weeks differ in any other way?
- Was there a significant change in the source of the incoming traffic?
(which keywords or ads) - Were there significant shifts in relative traffic proportion (when multiple sources are present)?
- Were automated tools such as traffic balancers or ROI optimizers being used?
We are extracting and analyzing stratified reports to identify characteristics that may have changed significantly from week 1 to week 2.
In summary, even though we haven’t yet established with certainty what caused the anomaly in Case Study 1, we continue to investigate and we’re confident we’ll establish what the key factors were.
Avoiding threats to test validity: 10 ways
While it would be a mistake to hastily assume that you know what caused a discrepancy in test results, it is an even bigger mistake to inadvertently act on bad data because you failed to thoroughly check for validity threats.
Following are 10 key ways you can recognize and avoid the most common, most damaging threats to test validity:
- Consider important dates. Avoid scheduling tests that span holidays or industry-specific seasonal boundaries unless they are the subject of the test.
- Monitor industry publications and news during the test period for events that may affect traffic or visitor purchase behavior in a way that is transient or unsustainable.
- Extract and analyze segmented data for consistency throughout the test period. This is especially important for long-duration tests.
- Never assume your metric system is accurate. Carefully monitor for anomalies.
- Periodically perform cross-checks among multiple sources of the same data when available. Seek to explain any significant differences.
- In your metrics tools, ensure that you are gathering data with sufficient granularity to “drill down” if needed. For example, to look for time-oriented factors, ensure you can compare results by hour-of-day or day-of-week. Categories you should consider include new vs. returning visitors (as we saw in Case Study 1), connection speed, browser type. Read the “Essential Metrics” brief to learn more about this topic.
- Match the profile of your test subjects with the profile of your market.
- For testing Web pages, use the traffic source(s) that most closely match the demand market for the subject page(s).
- Do not underestimate the significance of using browser compatibility testing on treatment pages. There are often flaws on your new treatment page you don’t know about.
- Continue to test until you have the necessary level of confidence in your data. Use a consistent and reliable method for establishing when you have a statistically sufficient sample. For more about this, including tools for designing and conducting tests and establishing statistical validity, consider enrolling in the MarketingExperiments Professional Certification course in Online Testing.
Using the methods and principles outlined in this brief, you can avoid being blind-sided by test-spoiling validity threats and making a bad business decision based on a “statistically valid” sample size alone.
Related MarketingExperiments Reports
As part of our research, we have prepared a review of the best Internet resources on this topic.
Rating System
These sites were rated for usefulness and clarity, but alas, the rating is purely subjective.
* = Decent | ** = Good | *** = Excellent | **** = Indispensable
- Validity (statistics) **
- Populations, Samples, and Validity **
- Exploring the Seasonality of Your Website *
Credits:
Editor(s) — Frank Green
Bob Kemper
Writer(s) — Peg Davis
Bob Kemper
Adam Lapp
Contributor(s) — Adam Lapp
Gabby Diaz
Jimmy Ellis
Bob Kemper
Flint McGlaughlin
HTML Designer — Cliff Rainer