This is the third and final part of a short series on what it actually costs to run AI in production. Part 1, “The Costs of AI You See, and the Ones You Don’t,” made the case that the bill scales at every layer, not just the model line everyone watches. Part 2 showed how to attribute that bill to the project, the feature, and the useful output that caused it.

This final post in the series is about lowering your costs. Specifically, it’s about the techniques I actually use, in real deployments, to bring a production AI bill down without impacting the results. The cost visibility I describe in Part 2 of this series is a prerequisite, because you can’t optimize a layer you can’t measure.

Before we talk about any of the techniques used to lower costs, we have to identify one overarching principle: results come first. The target is never the cheapest configuration, but rather the cheapest configuration that still produces high-quality results. A cheaper but weaker model that quietly degrades your output doesn’t save you anything — it just moves the cost off the API invoice and onto the people who now have to check and correct what the model got wrong. As the old saying goes, that’s penny wise, pound foolish.

The Model Layer: The Biggest Single Multiplier

In a typical AI-integrated project, most of the cost is at the LLM layer. There are several levers you can adjust here, and they work together.

You should always start by matching the model to the task. The rule is easy to state and easy to get wrong: use the cheapest model that does each step well enough, and add a cheap verification step downstream wherever a false positive is expensive. When I say that, you should note that the phrase “well enough” is loaded with meaning. You only push a step down to a cheaper model when the cheaper model genuinely holds the quality, because the moment the output gets worse you’ve simply relocated the cost to human review. When models are appropriately matched to the work, the impact on cost is striking. Our prospecting pipeline runs at about seventy-five cents per typical run with the right model chosen at each step. The same pipeline using a frontier model everywhere costs over a hundred dollars per run, for output that is not materially better.

Next, sequence the work so inexpensive models go first and the expensive ones only see what’s left. That implies looking for easy wins upfront to limit the work that needs to be done later. When you’re processing a batch of data, build the pipeline with that in mind. Structure the workflow so a first pass can capture the obvious classifications, filters, and easy tasks. This shrinks the volume of information that reaches the more costly tiers. The savings come from reducing how much work hits the expensive model, not from running a weaker model across the whole job and hoping it copes.

Then you need to consider where the models actually run, because most real deployments use a mix. If you have the right hardware, you can run local models to handle AI processing jobs at almost no marginal cost. If you don’t have appropriate hardware, third-party vendors (think RunPod, Lambda Labs, or Modal) offer specialized GPU hosting, and this may be a viable option in some cases. There’s also a third option emerging: companies such as NetAccess Solutions (one of my partners) that host private or open models for a flat monthly fee. At high volume, that flat fee can beat per-token pricing outright, provided one of the models they offer does your job well enough. It’s always worth pricing out, because a flat fee turns a usage meter into a fixed line you can budget against. For high-complexity tasks, for now you still need to look to hosted frontier models — but note that there are an increasing number of providers, and you need to shop around.

However, the selection of a model and/or provider is not a “one and done” decision. You need to treat model selection as a standing review, because new models arrive almost weekly, and several land in any given month. Each release can reprice the older models you already use and may put a better model within reach at the same or a lower cost than what you’re running today. A light recurring check — just asking whether there’s now a model that gives the same result for less or a better result for the same money — captures real value for very little effort. Sometimes the win is entirely passive, a simple price drop on the model you were already using.

Finally, mind your prompts and your context. Shorter system prompts, retrieval that returns less context at higher relevance, and structured outputs that compress the response all add up. Output tokens cost roughly four to six times what input tokens cost, so trimming the shape of the response pays out of proportion to the effort it takes.

The Supporting APIs

The model isn’t the only meter running. The services around it — search, enrichment, scraping, and mailing — all bill by usage too, and taken together it’s not unusual for them to exceed the model costs.

The highest-dollar technique here is using a waterfall approach. This is a bit more work upfront but can dramatically reduce your costs. Call the cheapest enrichment provider first and only fall through to the more expensive ones when the cheap one comes back empty. On my prospecting pipeline, this single change does more for the bill than almost anything else because most lookups never need the premium provider at all.

On search, the question I hear most often is whether to build your own. My honest answer is that it’s almost always a bad idea, with a few real exceptions: a very narrow corpus, very high volume, or a compliance constraint that rules out third-party providers. For most teams the right answer is choosing the appropriate tier and tuning the queries, not standing up and maintaining a scraper.

Caching and rate limiting round this section out. Semantic caching of model and retrieval responses pays off whenever the same questions recur, and on a customer-facing knowledge base the repeat rate is higher than people expect. You also need to put circuit breakers (i.e., budget limits) on anything that runs unattended, because a bug that fires a paid API in a tight loop can turn into a very expensive event.

The Mailing Layer

Email is surprisingly difficult, especially if your solution is intended to send bulk emails. If you send email at any scale, the provider with the lowest price might not provide the cheapest outcome overall. For instance, AWS SES is inexpensive per message while providers like SendGrid, Postmark, and Resend cost more, but they’re often better at delivering your emails when you start sending real volume.

For email, cost control means choosing a provider and following best practices to ensure delivery. You must avoid domain-reputation issues, which will send all your emails into spam folders and take months to recover from. Use reputable providers, “warm up” your sending domains, and manage their reputation deliberately — that’s cost control in this layer.

Platform and Observability

The orchestration platform that ties all this together is mostly plumbing. Be careful selecting platforms that charge per-flow or per-task, because those costs will increase linearly with volume. For high-volume, mission-critical platforms, building a platform or choosing a fixed-fee platform is often the best way to control costs.

That being said, the platform must support full cost observability. As mentioned earlier, you can’t control what you can’t measure, so cost observability needs to be a first-class part of your platform (see Part 2 of this series). A team that can’t answer “what did the last hour or job run cost us” tends to find out on the invoice instead.

While raw data is important for monitoring costs, it’s also very effective to construct a metric to measure “cost per useful output.” This might be something like the cost per qualified lead or per answered question; if you can build this, it keeps attention focused on the cost/benefit impact of the solution rather than just the outright cost.

One Thing I’ve Stopped Doing

I want to draw a careful line here, because two things I’ve said could look like they contradict each other. Reviewing the model landscape constantly is not the same as constantly re-plumbing production to chase it. Monitoring is cheap, mostly passive, and always worth doing. But I’ve stopped aggressively provider-swapping in production for marginal savings, because replacing a working integration every quarter to chase a small per-token difference quickly eats the gain.

The rule that reconciles the two is straightforward: review continuously but only switch when the improvement is material. Material means a clearly better result at the same or lower cost, or the same result at a meaningfully lower cost. It doesn’t mean a few percent that disappears the moment you account for the engineering and testing time the swap demands. Choose well at the start, keep watching, and move only when the math isn’t close.

Diagnose, See, Then Act

When you put all of this together, the overall direction is clear. Controlling production AI costs isn’t one big lever, it’s a dozen small ones spread across every layer. In every case, the overarching principle that needs to be followed is quality of results. The target is the cheapest configuration that still produces the result you need. A really cheap pipeline that returns bad answers is the most expensive one you can run, once you count the human time spent cleaning up after it.

This closes this short series on the cost of AI. Part 1 told you where to look, Part 2 told you how to see it, and Part 3 has told you what to pull and how hard.