Testing, Testing: Not All Failures Are Created Equal

Linnea Gandhi
By Linnea Gandhi
BehavioralSight and Chicago Booth School of Business

April 02, 2018

Editor's Note: Testing, testing, 1-2-3.

We don’t have many hard and fast rules at People Science:  We try to keep an open mind and learn from the experience and insight of our contributors, our readers and the comments section of conspiracy theory websites (kidding… mostly).

One rule, however, is gospel: You have to test. Behavioral science doesn’t offer off-the-rack, one-size-fits-all solutions. It offers scientifically supported principles with world-changing potential… but they need to be carefully and thoughtfully crafted to fit each business, each impacted community, each goal.  You have to experiment.  Anyone telling you otherwise is selling snake oil. Run away.

Since experimentation is so vital to effective application of behavioral science, we’re going to dedicate a lot of time and space to it on this platform. Here’s some more, from the talented and wise Linnea Gandhi.

Your team just presented the results of an experiment testing the impact of a new call center script on customer retention. As you walk out of the conference room you mutter under your breath, “What a complete failure.” Why? (Circle all that apply).

a) Your new script performed no differently than the current script
b) Your new script performed worse than the current script
c) You learn later that your team and the call center define “retention” differently
d) Your new script performed so well in the first week that your boss stopped the experiment early
e) Only the best performing call center reps were given the new script
f) The call center reps didn’t stick to the scripts
g) The language was accidentally jumbled when fed into the call center system
h) Not rolling out the new script risks lowering your annual bonus
i) You learn your predecessor tested an identical script last year, with statistically significant results

Ugh, right? I wish I merely had an overactive imagination, but these failures are all real and have all happened to me. If you’ve run enough experiments, versions of them have probably happened to you, too. Failure is the name of the game when it comes to experimenting in the field.

But not all failures are created equal. This distinction doesn’t come immediately to most of us. Corporate environments aren’t too forgiving of the “F” word, and it’s rare if not unheard of for failure to drive promotions or pay raises. This blanket fear is one reason experimentation as a methodology is slower to be adopted by organizations outside of digital marketing (where A/B tests now prevail). Experimenting effectively in more complex contexts—new product rollouts, recruiting procedures, back office forms—is relatively uncharted, and failure looms large. But it looks a lot less scary when you can tease apart types of failure and triage them.

One solution is what I’ll call a taxonomy of failure, classifying our experimental hiccups to enable us to move beyond “argh!” and towards “aha!” A taxonomy of failure puts these potential problems in perspective, so we can design our experiments to avoid certain failures and embrace others.

A basic taxonomy of failure

An experiment is only as good as its process.

The same logic that I teach my MBA students for evaluating the quality of their decisions applies to evaluating the quality of experiments: Judge process, not outcomes. Both domains are driven by uncertainty—neither the decision-maker nor the experimenter knows what will happen post-decision or post-experiment launch—and, probabilistically, outcomes can’t fall in their favor every time.

Outcome failure

a) Your new script performed no differently than the current script
b) Your new script performed worse than the current script

Outcome failure—or not seeing statistically significant results from your idea or intervention—is the boogeyman of anyone operating in a results-driven corporate environment. Yet it is integral for learning over time and averting roll-outs of ineffective ideas, products, and communications. If you aren’t seeing negative and null effects, you’re not running experiments (seriously, go back and check your designs). Outcome failures are only “failures” if your organization’s incentives make them so.

If you aren’t seeing negative and null effects, you’re not running experiments

Process failures

The real failures in experimentation? They’re all about process. Process failures undermine learning. Some prevent accurately measuring changes in outcomes. Others prevent confidently tying changes we do measure back to the idea we’re testing. Still others prevent us from us from doing all this efficiently.  

The first of these—measurement failure—is most fatal: without accurate measurement, we are left with nothing to learn from but noise. The second—attribution failure—may provide salvageable knowledge: we at least know we can get results, even though we cannot provide a causal explanation. The third—efficiency failure—is mostly harmful at scale: we learn something we can act on, but we waste resources in the process.

Measurement Failure: Fix your machine first

c) You learn later that your team and the call center define “retention” differently

Little can be learned from measurement failure. If you don’t have consistent interpretation for what different metrics mean, or you’re not drawing on the same underlying data, how can you meaningfully measure results? Presumably only the most disorganized companies should be prone to this. But how many people are happy with their company's data infrastructure?  Any company operating in a data-rich environment is likely getting by with a patchwork quilt of programs, sources, and conventions.

Beyond the obvious—i.e. investing in infrastructure—solutions lie in planning. Have clear conversations on definitions and data sources, identify proxy measures or IT workarounds, and carefully review a sample report of results before launching the experiment.

Attribution Failure: The greater your control, the greater your confidence in causality

d) Your new script performed so well in the first week that your boss stopped the experiment early
e) Only the best performing call center reps were given the new script
f) The call center reps didn’t stick to the scripts
g) The language was accidentally jumbled when fed into the call center system

Attribution failure is pervasive once you step outside of a controlled lab into the real world, where your experimental setting is less like a sanitized petri dish and more like a complex factory not quite up to code. The machines are cobbled together by patchwork piping, workers jump around from issue to issue (or grab a smoke break), and management is pacing upstairs, asking why the product wasn’t complete yesterday.

The call center example provides several scenarios where human and technical complexity, when left uncontrolled, undermines attribution. Pressure to perform could be high enough for leadership that they stack the deck with their best reps, or call center reps paid based on performance switch to the script they suspect will yield the best results. Or more charitably, perhaps an innocent error was made in balancing talent between groups, or reps from the new and old script sat next to each other and naively traded tips on phrasing. Or maybe the call center software is just decades old, jumbling the wording or misallocating scripts to reps.

Systems and technologies break, more often than we wish. But if we run a small pilot before our main experiment, we can often figure out whether and where we have “leaky pipes” in our system and then patch them up. Where the fix becomes harder is when it relies on human judgement. Is it in everyone’s interest to run the experiment cleanly, regardless of whether they know they are participating in one? Dry runs help here too, but the best solutions require thoughtful blueprinting for every scenario, designing and redesigning incentives, experiences, and even physical environments.

Efficiency Failure: Early efforts to stay organized reap exponential rewards

h) Not rolling out the new script risks lowering your annual bonus
i) You learn your predecessor tested an identical script last year, with statistically significant results

Efficiency failure is the least likely to prevent a company from dabbling in experimentation but the most likely to prevent it from generating long-term value at scale. A single inefficient experiment won’t hold you back, but a persistent habit of putting quick wins ahead of slower, thorough thinking will.

A single inefficient experiment won’t hold you back, but a persistent habit of putting quick wins ahead of slower, thorough thinking will.

I’ve frequently kicked myself for realizing too late that someone outside my team already had an answer to the question we were experimentally evaluating. As many organizations grow, knowledge becomes siloed or lost to turnover, driving redundancy and waste. Experiments are no exception. If the goal is to learn over the long-run, investing early in good governance—coordination, documentation, and decision rights—goes a long way. Further, aligning incentives to that governance will only amplify impact.

Every experiment is perfectly designed to produce the failures it gets

I can’t help but play on the words of health policy and quality professor Paul Batalden (who, in turn, is playing on similar words from P&G executive Arthur Jones): “Every system is perfectly designed to produce the results it gets.”

Look across the failures classified above—those of outcomes and process, measurement, attribution and efficiency—and you'll see every experiment is perfectly designed to produce the failures it gets. Poor experimental design is oft driven by poor organizational—especially incentive—design.

Every experiment is perfectly designed to produce the failures it gets.

If you only reward experimental outcomes that confirm the starting hypothesis, you’ll get fewer experiments, fewer innovative hypotheses, and a whole lot of false positives and buried results. If you reward quick wins from rushed experiments, you’ll have to throw out a lot of poorly-measured results. If you reward the supply chain of individuals running the experiment for behaviors conflicting with it, you’re asking for confounds to causality. And if you reward doing over thinking, running experiments over learning from them, you may get measurable, attributable results, but at a high, unsustainable price.

• • •

Effective experimentation outside academia, by companies in the messy real world, hinges on far more than smart statistical design. At the center of experimental success is smart organizational design.

Not all experimental failures are created equal. Some spring from outcomes, some from process. Some undermine learning, some enhance it.  The best way to shift the balance is to shift our perspective. To see our way to experimental success, we must look beyond the experiment to the broader organization.

Linnea Gandhi
By Linnea Gandhi
BehavioralSight and Chicago Booth School of Business

Subscribe

Get the latest behavioral science insights.