← Blog
EssayJune 9, 2026

Why an inference marketplace

For most of computing history, compute was something you owned outright or rented at a fixed rate. AI inference broke that pattern. The same answer, the same resolved bug, the same summarised contract, the same classified row, can cost five dollars or five cents depending on which model produces it, when it runs, and how quickly you need it back. That gap is not a rounding error. It is the central economic fact of the inference market today, and it is quietly costing everyone who buys AI more than it should.

When a good has many sellers, volatile supply, and buyers with very different urgency, the efficient way to price it is a market, not a rate card.

The market nobody designed

The supply side of AI has exploded. As of early 2026, a single broker offered API access to more than 600 distinct models, and one major cloud catalogue listed over 11,000.1 A buyer no longer chooses between a handful of frontier labs. They face thousands of endpoints of wildly varying cost and capability, from incumbents, challengers, open-weight specialists, and resellers reselling one another.

Abundance like that should drive prices toward efficiency. Instead it produced the opposite: a sprawling, opaque market that no one designed and no one clears. The same level of capability quietly trades at very different prices across different endpoints. A good that sells for different prices at the same quality is the textbook sign of a market that is not working, and the cost of that failure lands on the buyer.

What the chaos costs

Until recently this was an intuition rather than a number. Researchers at the Max Planck Institute for Intelligent Systems recently put a figure on it.1 Their study uses SWE-bench Verified, a benchmark of 500 real GitHub issues, each paired with a test suite that checks whether a model's proposed fix actually works.2 Because the result can be verified automatically, a buyer can compare providers on equal footing, exactly the setting in which prices should be sharp. They are not.

The findings show how much the fragmentation costs. Solving half of all the issues runs to roughly $10 with one mid-tier model but about $20 with another. Push to a harder target, a 75% solve rate, and the ranking flips: the second model now gets there for about $120, while the first needs more than $150.1 No single model is the cheap choice. Whichever one a buyer standardises on, they overpay on some large share of their work.

The researchers then showed how much was being left on the table. By sending each task to the model best suited to it, easy work to a lighter model and hard work to a stronger one, the same 75% result can be reached for about $80, well below the cost of any single model run on its own. Put differently, a buyer who commits to one model is overpaying by as much as 40% for the same outcome.

Measured on SWE-bench Verified · 500 GitHub issues
$80to reach a 75% solve rate by using the right model for each task, versus $120 or more than $150 for any single model alone
up to 40%how much a buyer overpays by standardising on one model across mixed work
600+models on a single broker by early 2026, none of them the cheapest choice for every job
Figures from Olmedo, Schölkopf and Hardt, "Computational Arbitrage in AI Model Markets," 2026

Two things make this striking. The waste is large and consistent, not a quirk of one benchmark, and the cheaper path is easy to find. In the study, only a modest amount of testing was needed to work out which model to use for which kind of task. The savings are sitting in plain sight. What is missing is anything that routes the work to capture them.

List pricing leaves money on the table

Step away from the benchmark and the same logic governs everyday work. A frontier model billed per token is priced for its hardest job. But most jobs are not the hardest job. Summarising a support thread, classifying a row, drafting boilerplate: this is work a lighter, often purpose-built model handles at the same quality. Paying frontier rates for it is the AI equivalent of overnighting a letter that could have gone by post.

The same is true of time. A real-time guarantee is expensive to provide, and a great deal of AI work, overnight batches, enrichment, back-office pipelines, does not need it. Yet it is billed as if it does. This matters more every quarter. Deloitte projects that inference will grow from roughly a third of all AI compute in 2023 to two-thirds in 2026.3 As inference becomes the dominant workload, charging real-time rates for work that could have waited stops being a minor line item and becomes a structural drag on cost.

A per-token rate card cannot express any of this. It prices every request as if it were urgent and every provider as if it were the only one. The market underneath has many sellers, supply that shifts hour by hour as fleets fill and empty, and buyers whose urgency ranges from milliseconds to overnight. A static list describes none of those dimensions, which is precisely why the gap between list price and real value stays open.

The chaos will not fix itself

It would be comforting to assume the market sorts itself out as it matures, that more competitors mean tighter prices. The evidence points the other way. When the same researchers widened their study from two models to six, the inefficiency did not shrink. It grew, appearing across a far broader range of work.1 Their conclusion is worth stating plainly: larger markets are not necessarily more efficient. Every new provider adds choice, and choice without structure adds confusion, not clarity.

This is the heart of the problem. The model catalogue will keep expanding, and each addition widens the distance between what inference costs to produce and what buyers actually pay. Nothing in the current structure closes that distance on its own. The matching that the researchers did by hand, finding the right model for each task, needs to happen continuously, for every job, automatically. That requires building the market on purpose rather than waiting for it to organise itself.

A market, built deliberately

That is what Keld is. It treats every job as an order with three simple terms: a ceiling price, a deadline, and the model or use case you need. You tell the marketplace the most you will pay and how long you can wait, and it runs your job on the best-value provider of the model you need, at or below your ceiling and inside your window. If the first-choice provider cannot deliver in time, the job escalates to a guaranteed backup, so it always lands.

The effect is to give every buyer the routing the researchers had to construct by hand, running continuously and clearing each job at its honest price. And it is not only buyers who gain. Without a matching layer, an inference market splits into rigid tiers, one model owning the cheap end, another the expensive end. With one, those tiers dissolve. Lighter and stronger models alike earn across the whole range of work, because each is used wherever it is genuinely the best value.1 A provider does not need the best-performing model to win business; being the right value for a given job is enough. For an open marketplace that is the point: more providers can compete for your work, which widens supply and pushes prices down further over time.

Neutral by design

A market only works if it favours neither side. Keld does not steer demand toward a preferred provider or take a position on who should win. It matches on price, deadline and performance, and settles each job against an auditable record. Because it clears only on a job's requirements and its price, the marketplace never needs to see your prompt data. Neutrality is not a feature added for comfort. It is what guarantees the price you pay is the real clearing price, with nobody's thumb on the scale. Neutrality is the product.

The inference market is young, abundant, and chaotic, and that chaos is expensive for everyone buying AI today. It will not tidy itself. What the market needs is structure: a neutral place where every job finds the right model at the right price, on time. That is what we are building.

You do not have to change your stack to start. Map what AI costs you today with Keld Atlas, then send the work that can wait to the best model for the job, and watch the gap close in your favour.

Sources
  1. Ricardo Olmedo, Bernhard Schölkopf & Moritz Hardt, "Computational Arbitrage in AI Model Markets," Max Planck Institute for Intelligent Systems, arXiv:2603.22404, 2026 — arxiv.org/abs/2603.22404
  2. Carlos E. Jimenez, John Yang, et al., "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?", ICLR 2024 — openreview.net
  3. Deloitte, TMT Predictions 2026: Why AI's Next Phase Will Demand More Compute, Dec 2025, retrieved June 9, 2026 — deloitte.com

See it on your own spend

Start free with Keld Atlas spend observability, then optimise the work that can wait.

Start tracking — free →