A pilot that passes every success criterion feels like a green light. The numbers are good, the team is energized, and the natural instinct is to move forward. But “the pilot worked” and “we should proceed” are not the same statement, and the gap between them is where expensive mistakes happen.
In my last two posts, I walked through a real pilot (automating customer prospecting for ResortSteward) from design through execution. The results were strong: operational costs well below my targets, qualification accuracy that exceeded expectations, and a pipeline that runs in 35 minutes without my active involvement. By every metric I defined in advance, the pilot passed.
And yet, the decision to proceed still required deliberate evaluation. This post is about that evaluation: how to read pilot results honestly and make the proceed, modify, or stop call without letting momentum do the thinking for you.
The Three Outcomes
Every pilot ends with one of three decisions: proceed, modify, or stop. All three are valid returns on the pilot investment. The pilot’s job was to generate evidence, and evidence that says “don’t do this” is just as valuable as evidence that says “go.”
The problem is that organizations rarely treat these three outcomes as equally legitimate. There’s almost always pressure, sometimes subtle, sometimes not, to declare success and move forward. The executive sponsor wants a win and the team that built it wants to see their work deployed. The sunk cost of the pilot creates momentum that’s hard to resist.
This is why you define success, rethink, and kill criteria before the pilot begins, as I discussed in Post 15. Those criteria exist to make the decision for you, or at least to anchor it in evidence rather than enthusiasm.
Reading Results Honestly
Pilot results only mean something relative to the baseline you established before you started. “The system processed 250 candidates” is not a result, but “The system processed 250 candidates in 35 minutes at a cost of $0.75, compared to roughly 60 hours of manual effort for the same volume” is a result. With a good baseline, you have evidence, but without that, you just have activity metrics.
There are a few traps to watch for when interpreting results.
Strong numbers can mask fragility. A pilot often succeeds partly because of the attention it receives. In my case, I manually reviewed every output at every stage of the pipeline. That level of oversight produced clean results, but it’s not how the system will run in production. If your pilot’s accuracy depends on someone inspecting every output, you haven’t proven the system works — you’ve proven that the system plus a human reviewer works. Those are different things, and the production version needs to account for the difference.
Weak results can contain valuable signal. A pilot that misses its targets isn’t necessarily a failure. If it reveals that the real problem is different from the one you defined, or that a different part of the workflow is the actual bottleneck, that insight is worth the investment. A “failed” pilot that redirects your effort is more valuable than a “successful” one that answered the wrong question.
Missing metrics matter. During my prospecting pilot, website traffic to ResortSteward increased noticeably after the outreach emails went out. I hadn’t included website traffic as a success metric; somehow, it wasn’t on my radar when I defined the criteria, and that was a mistake. The increase might have been a direct result of the campaign, or it might have been normal fluctuation. I’ll never know because I wasn’t tracking it before the pilot started. The lesson: think carefully about what to monitor, and start monitoring before the pilot begins. Metrics you didn’t plan for are metrics you can’t interpret.
The Proceed Decision
“Proceed” doesn’t mean “keep doing what you did in the pilot.” It means transitioning from a controlled experiment to an operational system, and that transition changes almost everything.
Monitoring changes. In a pilot, you can afford to review every output. In production, you need automated checks, exception handling, and sampling strategies. What does “good enough” monitoring look like when you can’t inspect every result?
Ownership changes. The pilot champion (often the person who built it or paid to have it built) isn’t necessarily the right long-term owner. Who maintains the system? Who responds when it breaks? Who decides when it needs updating? These questions need answers before you proceed, not after.
Scale changes the cost model. A pilot that costs $0.75 across 9 searches might cost differently for 90 searches, and the cost might change differently than you expect as you encounter edge cases, API rate limits, and data quality issues that the pilot’s limited scope didn’t surface.
Deferred governance needs addressing. During a pilot, it’s reasonable to take shortcuts: manual oversight instead of automated controls, informal review instead of documented processes. Those shortcuts need to be replaced before production. In my case, I realized I needed full “do not contact” management built into the automation platform. During the pilot, I could track this informally. At production scale, that’s a liability.
And sometimes, the right proceed decision includes keeping a human in the loop where your instinct says to automate. I chose not to automate the email sending step, even though the pipeline could do it. In a pilot of 73 emails, I deleted one without sending because the qualification was wrong. At scale, that’s roughly a 1.4% error rate on outreach going to businesses that aren’t a fit. Is that small enough to ignore? Maybe. But for a brand built on credibility and precision, sending the wrong email to the wrong business is a real cost, even if it’s not a financial one. The human review step takes minutes and prevents damage that’s hard to undo.
The Modify Decision
Many pilots end with promising results that are close, but not quite ready for production. The key discipline is scoping the modification clearly. What specifically needs to change, and how will you know if the change worked? Do you need to run another pilot after making those changes?
In my prospecting pilot, the modification was structural: the pilot ran all five jobs as a single pipeline, but production needed a different workflow. I separated the search job so I could create targeted searches for specific locations independently, then have the remaining pipeline (filtering, qualification, contact enrichment, email drafting) run on a schedule. This let me control the input (which locations to target and when) while automating the processing.
That’s a meaningful architectural change, but it didn’t require re-running the entire pilot. The individual jobs were already proven — what changed was how they were orchestrated. Not every modification requires a full re-test, but you need to be honest about whether your change affects the components that were validated or just the workflow around them.
If your modification is more fundamental (e.g., the core AI capability didn’t perform well enough, or the cost model doesn’t work at the scale you need) that’s closer to a rethink than a tweak. Apply the graduated thresholds you defined before the pilot. Is this a tuning problem or an architecture problem? The answer determines whether you iterate or redesign.
The Stop Decision
Stopping is the hardest call and the most valuable discipline. The money and time already spent on the pilot are gone regardless of whether you continue. Sunk costs are not a reason to proceed.
I’ve experienced this directly. With ResortSteward, I wanted to use AI for automated guest communications by building a flexible system where AI would monitor events (a booking, an upcoming check-in, a payment) and intelligently generate emails in response. The concept was appealing: instead of building rigid templates and triggers, let the AI figure out what to say and when.
The pilot showed quickly that this was the wrong approach. The task didn’t require intelligence — it required reliability. Guests need to receive the right email at the right time, every time, with no variation. That’s a rules-based automation problem, not an AI problem. I stopped the AI approach and built a straightforward trigger-and-template system instead.
That stop decision saved significant development time and produced a better product. The AI approach would have been more complex to build, harder to debug, and less reliable in production — all for a task that didn’t benefit from AI’s strengths. This connects directly to the principle from Post 11: use the lowest-complexity solution that achieves the outcome.
Communicating a stop decision matters too. A pilot that concludes “this isn’t the right approach” is a success — it did exactly what it was designed to do. Frame it that way. The alternative was spending far more resources discovering the same thing in production.
Making the Call
The decision framework is straightforward, even when the decision itself is difficult:
Review your results against the criteria you defined before the pilot. Don’t switch gears and evaluate against how you hoped it would perform, or against what the team expected, or against what you told your stakeholders. Evaluate the results against the specific, measurable targets you wrote down before you started.
And be honest in asking if the pilot conditions will hold in production. If success depended on conditions that won’t scale such as manual oversight, curated data, or a dedicated champion, factor that in.
Be honest about what you don’t know. Missing metrics, untested edge cases, and deferred governance decisions are all risks that need to be weighed, not ignored.
And remember that all three outcomes — proceed, modify, or stop — represent a return on the pilot investment. The worst outcome isn’t stopping — the worst outcome is proceeding without evidence or with evidence that there are problems.
What Comes Next
The pilot passed. The decision to proceed has been made, perhaps with modifications for production readiness. Next week, I’ll cover what changes when a successful AI initiative scales beyond the controlled pilot: data quality at volume, monitoring that works without reviewing every output, ownership structures, and change management.
Wrap Up
This post is part of a series on the current state of AI, focused on how it can be applied in practical ways to deliver measurable improvements in productivity, cost savings, and response times. If you’d like to explore more, all previous posts are available under Insights; please read them and reach out with any questions or comments you have.