Your First AI Pilot Project: A 90-Day Checklist

By Snow AI · Published May 5, 2026 · 10 min read

About three out of four AI initiatives don't deliver the ROI they promised. The technology isn't usually the reason. The reason is "pilot purgatory" — projects that demo well, never get past the demo, and quietly become a slide in next quarter's roadmap.

This is the 90-day checklist we use with new clients to get out of pilot purgatory and into a real go/no-go decision. It's structured around five two-week phases, with concrete deliverables at the end of each so you can see slippage early, before it costs you a quarter.

The point of a pilot is not to prove AI works. It's to make a defensible decision about whether this specific use case is worth scaling. If you treat the pilot as the project, you've already lost.

Days 1–14: Pick the right use case

Most pilots fail in week one, when the wrong problem gets selected. Use this filter and reject any candidate that misses more than two criteria:

The data exists and is accessible. If the answer to "where does this data live?" is "in a few different systems, we'd have to consolidate it," that's a data project with an AI hat on. Worth doing — just don't budget it as an AI pilot.
A single person owns the operational metric. Not a committee. Not "the team." A name. If you can't name the owner, the savings won't show up in any P&L.
The current process is documented. If nobody can describe the current SOP, you don't have a baseline to measure against. You'll spend the pilot debating what "improvement" means.
The cost of being wrong is recoverable. First pilots should be back-office, internal-facing, low-stakes-per-decision use cases. Save customer-facing chat and regulated decisions for project number three.
Volume is high enough to matter. A use case that fires 50 times a year won't pay back the build cost no matter how well it works. Aim for processes that run hundreds or thousands of times per month.

Common good first pilots: ticket triage and routing, internal knowledge search, document classification, structured data extraction from PDFs, draft generation for repetitive written outputs (summaries, briefs, intake notes). Common bad first pilots: customer-facing chatbots, sales lead scoring, anything that touches HR decisions, anything where the data is in screenshots or scanned forms with no OCR pipeline.

Days 15–30: Lock the success metric and the kill criterion

This is the phase teams skip and regret. Before any code gets written, you need four numbers on paper, signed off by the budget owner:

Baseline. What's the current operational metric? Measure it for at least two weeks of normal operations. If you can't measure the "before," you can't claim improvement on the "after."
Target. What threshold counts as success? Be specific: "reduce handle time by 30% on tickets in category X" beats "improve efficiency."
Kill criterion. What number, by what date, makes you stop? This is the most important sentence in the document. Write it down. Have the budget owner sign it.
Cost ceiling. Total spend authorized for the pilot, including consulting, infra, and internal time. When you hit the ceiling, you stop — even if you're "almost there."

Why the kill criterion matters: sunk-cost bias is the dominant failure mode of AI pilots. By month three you'll have invested time, political capital, and a few all-hands moments. The temptation to extend "just two more weeks" will be enormous. The kill criterion is how you make the decision before emotions get involved. We wrote more about why this lands on the CFO's desk in our AI ROI framework.

Days 31–60: Build the smallest thing that proves the point

The build phase is where consultants and clients fight, almost always over scope. The discipline that makes pilots succeed:

Build on a sample of real data, not synthetic data. A small sample of messy production data tells you ten times more than a clean synthetic set.
One use case, one channel, one team. Resist the urge to add "and while we're at it, let's also..." every additional thing doubles the timeline.
Demo every Friday. Not status updates — actual working demos against the success metric. A pilot that can't demo by week six is a pilot in trouble.
Track the metric on a dashboard the budget owner can see. If the metric is only visible to the consultant, you don't have the data infrastructure to operate this in production anyway.
Write the runbook as you go. The pilot isn't done when it works once; it's done when someone other than the original builder can run it.

Plan for one big surprise during this phase. We've never run a pilot that didn't have at least one "the data isn't structured the way the source-system documentation says it is" moment. Reserve 15–20% of the build time for surprise remediation. If you don't use it, congratulations.

Days 61–80: Run it shadow-mode in production

This is the phase most pilots skip and most go-live failures originate from. Before you switch the AI on for real users, run it in shadow mode: same inputs as the live process, outputs go to a parallel queue that nobody acts on. Compare quality, latency, and cost between the AI output and the human-only baseline.

Three things shadow mode reveals that demos don't:

Edge cases. The 5% of inputs that look weird. Demos use happy-path examples; production has long tails.
Cost at volume. Per-call costs that look fine for 50 examples can be brutal at 50,000.
Operational fit. Does the team trust the output enough to use it? Trust is a function of consistency, not just accuracy.

Run shadow mode for at least two weeks. Track agreement rate (AI vs. human), false-positive and false-negative rates, and latency at the 50th, 95th, and 99th percentiles. If any of those numbers fall outside what you committed to in days 15–30, that's not a tuning problem — that's data for your go/no-go decision.

Days 81–90: Make the call

The last two weeks are for the decision, not the build. The deliverables:

A one-pager comparing baseline vs. shadow-mode results against the success metric and kill criterion.
An updated cost estimate for production rollout, with the full cost stack (data prep, integration, change management, run cost, maintenance, risk reserve — see our ROI post for the template).
A go/no-go recommendation from the project owner, signed.
A written post-mortem covering what worked, what didn't, what surprised you, and what you'd do differently. This is the artifact that makes pilot number two faster than pilot number one.

Three legitimate outcomes from a pilot — all of them count as success:

Go. Hit the metric, scale to production with the cost estimate above. About 25–35% of well-scoped pilots land here.
No-go, learn. Didn't hit the metric, but the post-mortem identifies why. Apply the lessons to the next use case. 40–50% of pilots.
No-go, kill. Didn't hit the metric, post-mortem says the use case is fundamentally a poor fit. Stop. 20–30% of pilots. This is a feature of the process, not a failure.

The wrong outcome — and the one we're trying to prevent — is "extend, extend, extend." That's pilot purgatory. The kill criterion exists to prevent it.

What you'll have at day 91

If you run this checklist, regardless of whether the pilot succeeds, you'll end up with:

A documented baseline for the operational metric (useful forever).
A working artifact built on real data (useful even if you don't deploy it — you can reuse the integrations).
A defensible go/no-go decision with a post-mortem.
An updated, evidence-based ROI estimate for the production rollout.
A team that has shipped one AI thing — which makes the second one cheaper, faster, and more accurate.

That's the goal. Not "AI transformation." Not "agent-first enterprise." A real artifact, a real decision, a team that's done it once. Pilot two starts from a much stronger position.

The five patterns we see in pilots that fail

About 75% of mid-market AI pilots don't make it to production with measurable ROI. Across the post-mortems we've sat in on, the same five patterns show up over and over — and almost none of them are about model quality.

No metric owner. The pilot has a target metric but no single named person whose review or bonus moves with it. When week 8 arrives and the number is ambiguous, nobody is incentivized to push for a clean read. The pilot drifts into "we learned a lot" territory.
The data wasn't real until week 4. The team starts on synthetic or sample data because the production data is locked behind a security review. By the time real data arrives, half the pilot timeline is gone and the assumptions baked into the model don't hold up.
The kill criterion was set after the build started. Without an upfront threshold, every result becomes argumentative. "It's almost there" becomes the standard answer at every weekly review.
Adoption was treated as someone else's problem. The model works in isolation but the operators it's meant to help never moved off their old workflow. Six weeks after pilot end, the dashboard is still showing zero queries from production users.
Scope expanded mid-pilot. A second use case got added in week 5 because "we're already paying for the consultants." Both use cases now have half the time and neither produces a clean signal.

The fix for all five is the same: write them down as risks during the scoping phase, name an owner for each, and revisit at week 4 instead of week 12. The kill criterion is a backstop, not a substitute.

Frequently asked questions

How long should an AI pilot project take?

Plan for 8–12 weeks for the build, plus 2 weeks each for scoping and decision-making — roughly 90 days end-to-end. Pilots that drag past 16 weeks are usually ones that skipped the kill criterion.

How much does an AI pilot cost?

A scoped pilot typically runs $25,000–$100,000 over 8–12 weeks for external consulting, plus internal time and infrastructure. We covered the full pricing landscape in our AI consultant buyer's guide.

What's the most common reason AI pilots fail?

Use-case selection, not technology. The most common pattern is choosing a problem where the data isn't accessible, the metric owner isn't named, or the cost of being wrong is too high for a first attempt.

Should the pilot be built in-house or by a consultant?

For a first pilot, a consultant or studio is usually faster — they've made the mistakes you're about to make. For pilot two and beyond, lean toward in-house with consultant support on net-new technical patterns. Always own the code and documentation, regardless of who builds it.

This checklist is the same one we use in our scoping engagements at Snow AI. If you'd like a working session against your specific use case, book a free 30-minute call.

Your First AI Pilot Project: A 90-Day Checklist

Days 1–14: Pick the right use case

Days 15–30: Lock the success metric and the kill criterion

Days 31–60: Build the smallest thing that proves the point

Days 61–80: Run it shadow-mode in production

Days 81–90: Make the call

What you'll have at day 91

The five patterns we see in pilots that fail

Frequently asked questions

Want help running your first AI pilot?