Some of the executives I talk to have moved past the question of “should we use AI.” They have followed the advice I have been giving for a while now: pick a small project, define what a successful outcome looks like in metrics they can actually measure, build the thing, and let it run in production long enough to learn from it.
The obvious goal of a pilot is to solve the problem the pilot was scoped against. On every pilot I have helped run, two equally important goals sit alongside that one. The pilot starts the mindset and culture shift that any production AI deployment requires, and it builds organizational knowledge about how AI affects the work and the people doing it. Just as importantly, it produces a sharper view of where AI actually fits in the specific workflows the business runs every day, which is almost never identical to where AI fit in the slide deck that authorized the pilot in the first place.
An important part of the knowledge a pilot builds is a working mental model of what an AI solution actually costs. Leaders who have run a pilot through to production come out the other side with a much sharper view of the cost-benefit math than the ones who have not. Starting out, many leaders do not have that view yet. The typical mental model is that you budget for an AI model and maybe a subscription-based service or two like a coding agent. Real production AI solutions usually require more services than that, sometimes many more. In the solutions I have built, it is not unusual to depend on third-party services that provide functionality like search, contact enrichment, scraping and page-rendering, and bulk emailing.
This post is the first of three on the cost of production AI. It walks through what those systems actually cost in May of 2026, looking at the third-party services that surround a real deployment, the model usage at the core, and the coding-assistant tier that hides volume-based pricing behind a subscription wrapper. The familiar AI subscriptions like ChatGPT for Teams and the orchestration platforms like n8n or Zapier scale with headcount and behave like other SaaS subscriptions; finance teams already track them well, so I am setting those aside. Part 2 covers cost visibility, which is the operational discipline that makes the bill legible enough to manage. Part 3 covers the practical techniques I use to keep the bill under control. The strategic-level coverage you have seen elsewhere groups AI cost under headings like infrastructure, integration, and human capital. This is the operational view of the same territory, with named providers, published prices, and the things to watch out for.
Third-Party API Services
The first place the bill grows beyond the obvious is in the third-party services that surround an AI deployment. Most are billed by usage and will scale directly with job volumes.
Search APIs (Brave, Bing, Serper, Tavily) run roughly $3 to $5 per 1,000 queries on the lower-cost providers, with premium indexes costing more. The reason search APIs are required is that any AI pipeline reaching the open web for new information needs a search layer, and the search engines you use (Google, Bing) block programmatic usage. The quality of the search APIs does vary, with the premium services giving markedly better results. That’s important, because if your initial search returns poor quality results, the rest of the pipeline is processing “garbage” as fast and as expensively as it knows how.
Contact enrichment services (Apollo, Hunter, Clay) typically require you to purchase a base subscription plus usage credits, with credits charged at a rate ranging from $0.10 to $0.50 per enriched contact. The reason they’re used is because it is difficult, at scale, to find good contact information. If enrichment is required, it typically needs to be a separate, paid step. Hit rates vary dramatically by industry and target type, which means the credit math you budgeted at signup is often not the credit math you see in production.
Scraping and page-rendering services (Apify, Browserless, Jina.AI Reader) typically charge per page or per request, with Jina.AI Reader around $0.02 to $0.04 per page on the paid tier. The reason they’re needed is that a growing number of websites refuse to render for “headless” browsers. Without a rendering layer, the website-analysis or qualification step in any real pipeline returns nothing for a meaningful share of accounts, and pipelines designed before this became common often underestimate how much work the rendering layer is doing.
Mailing infrastructure (SendGrid, Postmark, Resend, AWS SES) runs $20 to $100 per month for typical small-business volumes, with AWS SES closer to $0.10 per 1,000 emails on the simple end. These are used because sending bulk email often leads to entire domains being blocked or blacklisted. The most expensive failure mode in any outbound AI workflow is a deliverability incident, and a burned sending domain costs months of recovery work, lost meeting volume, and lost goodwill. Picking the right provider and following their warm-up and reputation-management guidance is more important than the price.
The structural pattern across all four is the same: costs are directly related to volume. The model bill gets the attention because tokens are unfamiliar units, but the billing meter underneath each of these services is doing exactly the same kind of work. On a real production pipeline, the combined bill for this category usually exceeds the model bill outright.
Model Usage
This is the line most people already associate with AI cost, and the one where the unit (tokens) is least intuitive. A token is roughly three-quarters of a word. Every model call has an input cost for the prompt and an output cost for the response, and output is typically four to six times more expensive than input.
Published pricing as of May 2026 falls into three rough tiers. Frontier models (GPT-5.5 Pro, Claude Opus class) run around $30 input and $180 output per million tokens. Mid-tier frontier models (Claude Sonnet and equivalents) run around $3 input and $15 output. Mid-low tier hosted models run from $0.20 to $1.50 input and $0.50 to $5 output. Local open-weights models cost nothing per call, but do require an investment in appropriate hardware.
Hosted API providers (OpenAI, Anthropic, Google, plus aggregators like Groq, Together, and OpenRouter) are the default for variable workloads and frontier access, with pricing that can shift quarterly and data that leaves your environment unless zero-retention terms are negotiated in writing. Specialized GPU hosting (RunPod, Lambda Labs, Modal, Replicate) gives you predictable per-hour cost on dedicated capacity, which becomes the right answer when a steady workload is large enough that the hosted-API bill has become the dominant line item. Fully local models on hardware you already own cost nothing per call and keep the data in your environment, but the capability ceiling is lower than the frontier and the engineering work of running and updating them is real.
The mistake that costs the most, regardless of hosting path, is defaulting to a frontier model at every step. Frontier is roughly ten times the price of a mid-tier model and twenty to a hundred times the price of a small local model. If the task does not need frontier capability, that ratio compounds across the volume, and the hosting choice multiplies the compounding.
The Coding-Assistant Surprise
The model bill does not only show up on the model line. It also shows up wearing a subscription wrapper, and the place it shows up most loudly is the coding assistants.
Entry-level tiers (GitHub Copilot, Cursor base, Claude Code starter) sit around $20 to $40 per seat per month, which is the number most coverage cites. The real production tiers are several times that. Cursor’s higher-usage plans run $60 to $200 per seat per month. Claude Code’s higher-usage plans run $100 to $300, and it’s not unusual to get pushed over time to the higher cost plans. The reason these plans cost what they cost is that the included token allowance scales with the plan, and the underlying activity is the same model usage the rest of this post has been describing, billed up front rather than after the fact. The subscription wrapper makes the bill feel predictable, but the mechanism underneath is a form of volume-based pricing.
The question worth asking for every AI subscription on the corporate card is the same one. Does the price scale with headcount, or with usage? If it scales with usage, the line item belongs in the volume conversation, regardless of how the invoice is shaped.
Two Real Numbers
When I gave a tech talk earlier this month, I used Docora as one of the worked examples. Running 1,000 queries on a frontier-everywhere path costs just over $400. On a hybrid path, with the embedding and retrieval steps handled by small local models and only the answer generation running on a mid-tier hosted model, the same 1,000 queries cost $14. And most importantly, the output quality using the hybrid model was virtually indistinguishable from the output quality using the frontier models.
The prospecting pipeline I have written about over the last two weeks is the other example. The deployed mix uses a small local model on the snippet-filter step, mid-low tier hosted models on the website-qualification and contact-enrichment steps, and a mid-high tier model on the email personalization step. Using this hybrid models, most pipeline executions cost less than $1, while running the same pipeline with a frontier-tier model at every step would cost over $100 per run—again, for output I could not tell apart.
Closing
The model bill scales with usage, and most leaders are watching it, but the rest of the stack also scales with usage and needs the same level of attention. Search, enrichment, scraping, mailing, and the heavier coding-assistant tiers all run the same kind of meter the model line does, just under different unit names.
In part 2 of this post next week, I’m going to look at cost visibility, which sounds dry but is the part that is easy to quietly fail at. Then, part 3 (in two weeks) will cover the practical techniques I use to bring the bill down at every layer of the stack.