CompanyJune 23, 2026

Introducing Keld: the world's first AI inference marketplace

By Dave Otten, Federico Enni and Doug Shore · 11 min read

Enterprise AI spending crossed a trillion dollars in 2025, and most teams couldn't tell you where it went. The average enterprise runs more than five AI models in production, pays invoices from a dozen providers, and has no single view of what any of it costs, let alone whether the spend was efficient and necessary. That isn't an AI problem. It's a visibility-and-pricing problem, and both halves are fixable.

Static rate cards don't distinguish urgent work from work that can wait four hours. Fragmented provider accounts make spend invisible across teams. And compute sits idle while jobs queue at premium prices, one of the quietly expensive inefficiencies in modern infrastructure.

Today we're launching Keld: map every AI workflow your company runs, then route the ones that don't need a premium model or real-time response on the best-value provider of the model you need.

Key takeaways

In 2026, global AI infrastructure spending is projected to reach $487 billion, up from $153B in 2024²
Average enterprise GPU utilization sits at just 5%: 95 cents of every dollar spent on AI compute earns nothing³
Set a deadline and a price ceiling on each job, and Keld matches it to the strongest qualifying model at the best price across 100+ providers, running it within your bounds, with a guaranteed backup if the first choice can't deliver in time
34% of enterprises can't adequately observe the external AI services they depend on⁴. Keld Atlas maps that spend, free, from day one

Global AI infrastructure spending trajectory. Source: IDC Worldwide AI Infrastructure Tracker, Q1 2026.

Enterprise AI costs are becoming a balance sheet problem

In 2026, IDC projects global AI infrastructure spending will reach $487 billion, more than triple the $153 billion spent in 2024 and on a path to exceed $1 trillion by 2029². Gartner puts total worldwide AI spending for 2026 at $2.59 trillion, a 47% year-over-year increase¹. The capital commitment is settled. The discipline to manage it isn't.

Three structural failures are driving the waste:

Static pricing. Most inference is billed at flat per-token rates built for a model's peak use case, but most enterprise AI jobs aren't peak-case work. Classifying a ticket, extracting metadata from a video, summarising a thread: these all get resolved at the best available quality on Keld, by near-frontier models that cost a fraction of frontier pricing. Paying list price for them is the compute equivalent of overnighting a letter that could have gone standard post.

Idle compute. Cast AI's 2026 State of Kubernetes Optimization Report, covering production telemetry from more than 23,000 clusters, found that average enterprise GPU utilization sits at just 5%.³ Ninety-five cents of every dollar spent on AI infrastructure earns nothing. That's not a hardware shortage but a matching problem: spare capacity at one provider never reaches the job that could use it, because nothing pairs the two efficiently.

Average GPU utilization across enterprise AI workloads. Source: Cast AI, 2026 State of Kubernetes Optimization Report (n = 23,000+ production clusters).

Undiscovered models. The best model for a job is often not the famous one. Across the market, independent providers run models that are genuinely exceptional at a single thing: drafting copy, generating voice and audio, pulling structure out of video and documents, analysing data, frequently because they were built or fine-tuned for exactly that. But most of these models, and the providers behind them, are effectively invisible. So work defaults to a handful of frontier names by reputation, teams overpay for general-purpose power they don't need, and the specialists that would do the job better, for less, never get the call.

These failures compound. The teams paying real-time rates for batch work are the same teams routing everything to a couple of frontier names, while most Model Providers' GPU fleets sit dark. The result: costs scaling faster than anyone projected, with no mechanism to surface the right model for each job and run it at the right price.

How Keld works

The unit of work on Keld isn't a prompt. It's a job. You define a job with three things: a deadline (the most time it can take), a ceiling (the most you'll pay), and the model or use case you need. Keld runs the job on the strongest model for the task at the best available price, within your bounds. If the first-choice provider can't deliver inside your window, the job automatically escalates to a guaranteed backup, so it always lands, on time, at or under your price.

It all starts one level up. Before Keld optimises anything, Keld Atlas maps every AI workflow you run, so you can see spend by team, project, model, provider, and use case. With that map in hand, Atlas suggests the workloads that don't need a premium model or a real-time response, and leave the ones that do exactly where they are. You decide what moves; the map tells you where the efficiencies are.

Underneath, the Keld marketplace is a real-time, neutral exchange, which continuously matches incoming jobs against live supply from a hundred-plus model and compute providers and clears each job at the best price inside its deadline. The set of sellers is open: any model or compute provider can list capacity into the market and compete for your work. Enterprises never have to think in those terms, as you just send a job. But the engine underneath is the same structure that prices equities, foreign exchange, and energy markets.

A marketplace fits inference because inference has exactly the conditions that make markets the right answer. Sellers are many and the pool keeps widening as more models run on more machines; supply shifts hour by hour as fleets fill and empty; and buyers' urgency varies enormously, since a live chatbot and an overnight batch job are not the same trade. A static rate card can't express any of that. It prices every request as if it were urgent and every provider as if it were the only one. The order book does the opposite: it discovers the real clearing price continuously, rewards whichever provider offers the best price right now for work that meets the bar, and lets price and deadline decide where each job runs, not a contract signed six months ago. Neutrality is what makes it hold together: Keld earns no margin steering you to a favoured provider, so the market clears on price, deadline and performance, never on who's paying us. And because it settles only on a job's requirements and price, the marketplace never sees your prompt data.

Why the marketplace trades tokens, not GPUs

Most attempts to turn compute into a market operate one layer too low: they rent GPUs or instances by the hour. That's an infrastructure market, and it solves the wrong problem for an enterprise. You don't actually want a GPU. You want a summarised document, a transcribed call, a classified ticket, delivered at a quality bar, before a deadline, at a price. The unit you consume, and the unit your finance team sees on the invoice, is the token, not the GPU-hour.

An infrastructure market prices the machine. Keld prices the work the machine produces.

So the Keld marketplace clears at the inference and token level. The asset being priced is a unit of model output for a given use case, abstracted away from the hardware that produces it. That single choice has three consequences an infra-level market structurally can't match:

It's directly comparable. When the traded unit is "one million tokens of summarisation," every model and provider competes on the same yardstick, and the best-value option that clears your quality bar wins. You can't run that auction when one seller quotes A100-hours, another quotes H100-hours, and neither maps cleanly to the result you need.

It carries no operational burden. You never provision, schedule, or babysit a cluster. The provider keeps the infrastructure problem; you get the output. The risk of idle hardware sits with whoever is best placed to absorb it, the provider's own order book and micro-batching on Keld Trade, not with you.

It optimises the thing that matters. Because a job names a model or a use case and a quality bar, the market can surface and route to the model that is genuinely best at that task, even a specialist most teams have never heard of, not just to cheaper silicon running the same expensive model. Optimising tokens optimises outcomes and COGS; optimising GPU-hours just moves where the meter runs.

The Keld product suite, explained

The market Keld plugs into is large and fragmented: the global AI inference market was worth roughly $106 billion in 2025 and is projected to reach $255 billion by 2030⁷, yet it has no standard wire protocol, no neutral matching layer, and no unified view for the teams running the compute. Keld supplies all three, in a compact suite split by audience.

Enterprises

Keld AtlasMap spend, then run Deadline jobs

IntegrationsDrop-in over the open IXP protocol

Keldmarketplace

neutral matching

AI Model Providers

Keld TradeManage orders · micro-batching

Each side does one thing well, and a single neutral matching engine connects them. For Enterprises, two products work together:

Keld Atlas is the control plane, and it does two jobs. First, it maps your spend across six accounting dimensions (Team, Project, Model, Provider, Job Category, and Capex/Opex) by ingesting telemetry from your existing AI use cases. The enterprises that currently can't see their external AI costs have a direct path to visibility here, free to start, with no routing changes and a mapped view of your AI spend within a day of connecting credentials. Second, Atlas runs Deadline jobs: once you can see which workloads tolerate latency, typically the majority, Atlas executes them asynchronously, sending only the job's requirements (deadline, ceiling, use-case category, estimated token count) to the matching engine, then streaming the raw payload straight to the matched provider once a price is confirmed. Atlas never buffers or stores your prompt data.

Integrations make Keld drop-in for the stack you already run. If you use LiteLLM, LangChain, the OpenAI SDK, or any standard HTTP client, you connect over the open Inference eXchange Protocol (IXP), Keld's open wire standard, without rewriting infrastructure. Point your existing code at Keld and the optimisation happens underneath it.

For AI Model Providers, there's Keld Trade, the platform providers use to list and manage their orders, offering spare GPU capacity at prices that absorb excess load without eroding premium direct-contract pricing. Any provider can join and sell into the marketplace. Trade also handles the micro-batching that paces matched jobs into a provider's fleet in real time, so demand hits hardware at exactly the utilisation level it can absorb.

Why this matters in practice

In 2026, Deloitte projects AI inference workloads will represent two-thirds of all AI compute, up from one-third in 2023⁶. As inference becomes the dominant workload class, running all of it at real-time SLA pricing compounds into a measurable COGS drag for any team that isn't separating its latency tiers.

Not all inference needs to be fast. That's the insight most enterprise AI teams haven't yet operationalised.

Real-time inference (customer-facing chatbots, live coding assistants, interactive API calls) needs low latency and commands a premium for it. It's a minority of most enterprise AI workloads by volume, even if it gets the most architectural attention. The majority looks different: overnight enrichment pipelines, document classification at scale, batch report generation, data extraction across thousands of records. These jobs measure acceptable wait times in minutes or hours, yet they bill at the same per-token rates as live user sessions, because the infrastructure makes no distinction between them.

AI inference share of total AI compute. Source: Deloitte TMT Predictions 2026.

On Keld, a batch enrichment job submitted with a two-hour deadline runs on the best-value provider that fits within that window, not the fastest one. The cost differential between a real-time SLA and a multi-hour window for equivalent work is material. The job gets done; the bill shrinks. And because pricing is set by neutral matching rather than a negotiated rate card, every enterprise gets the same best price, with no undisclosed markups and no preferred-provider steering.

What changes structurally isn't only the price. It's the discipline. When latency and price are explicit parameters on every job, teams start reasoning about what their work actually requires, which is impossible when every API call looks identical on the invoice.

Built for enterprise control from day one

In 2025, a16z reported that 37% of enterprise CIOs run five or more AI models in production, up from 29% the prior year, with enterprise LLM spend growing from $4.5M to $7M over two years and 75% further growth expected⁵. Managing that portfolio across separate provider contracts, API keys, and billing statements isn't governance; it's exposure. Keld consolidates it into a single control plane.

Enterprise multi-model adoption. Source: a16z, How 100 Enterprise CIOs Are Building and Buying Gen AI in 2025.

Every API credential, model connection, and routing rule in Keld lives in Atlas. Budget caps and quota limits are configurable at the team and project level, and spend is auditable across six dimensions from the moment you connect. There are no shared credentials, no provider accounts living in individual engineers' environments, no billing emails going to personal inboxes.

The IXP protocol is open source. Enterprises that want to inspect the wire format, build custom adapters, or connect existing tooling without using the SDK can do so. The spec is public and the matching logic is published. Nothing is hidden.

Multi-model access comes built in, not as a vendor-negotiated bundle. Because Keld connects every AI Model Provider to every enterprise through one neutral matching engine, you reach providers far beyond your current contracts without negotiating each one bilaterally, and the catalogue keeps widening as new providers join. Your team accesses the full model catalogue the day your Atlas workspace opens.

The recommended starting point is Keld Atlas: connect your credentials and you'll have a mapped view of your AI spend within a day, with no routing changes and no commitment to async execution. Start with the map; the decision about what to optimise follows from the data.

Read the IXP protocol spec →

The infrastructure for enterprise AI isn't short of capital. It's short of structure and discipline. Teams scale spend on static pricing while the best models for the job stay hidden, with no mechanism for distinguishing urgent work from work that can wait. That gap is the reason Keld exists.

Keld is live. Atlas is free to start. Map your spend today, and start running the jobs that don't need a premium model at a fraction of the price, with quality intact.

Who's building Keld

Keld is built by Dave, Doug and Federico, three technology-industry veterans who scaled JW Player to thousands of enterprise customers, serving the world's biggest publishers and broadcasters at the intersection of video streaming and advertising. But the three of us are only the start. Keld is a growing, AI-first team of industry experts who have spent their careers building platforms at high scale and enterprise quality, across distributed systems, infrastructure, data, security and developer experience. We started Keld to bring that hard-won discipline to the fragmented world of AI.

Frequently asked questions

Is Keld a router or a gateway?

No. Gateways like LiteLLM or OpenRouter route across a fixed set of static integrations and add a markup, and they do that job well. Keld is a different thing: it routes each prompt into a live marketplace of independent model providers, a dynamic and ever-changing set of companies and people competing on price and quality. A fixed set of integrations is exactly the kind of plumbing that AI will eventually optimise away; an open, competitive marketplace of independent sellers is not.

Does Keld see my prompt data?

No. Keld Atlas handles only the routing and financial metadata (deadline, ceiling price, use-case category, estimated token count), which is all the matching engine needs to find your best price. The raw prompt payload streams directly from your infrastructure to the matched provider. Keld never buffers, stores, or processes prompt content at any point.

What models can I run through Keld?

Keld launches with 100+ model endpoints across frontier and near-frontier providers, and the catalogue grows as new AI Model Providers join. In Atlas, you can restrict each job to specific models, providers, or use-case buckets, or set open policies that let Keld find the best match within your price and latency parameters. You're never locked to whatever was available the day you connected.

How do the ceiling price and deadline work?

Set a maximum price per token you'll pay and a deadline: "within two hours," "by end of day," whatever the job requires. Keld finds the best-priced option that meets both. If supply exists at or below your ceiling within your window, the job runs immediately. If none can deliver inside your window, the job escalates to your configured backup, typically a direct provider at list price, so it always lands on time.

Where do I start if I'm an enterprise?

Start with Keld Atlas. Connect your existing AI provider credentials (OpenAI, Anthropic, Mistral, whatever you're running) and optionally configure an OTel integration. Within a day you'll have a live map of AI spend broken down by team, project, model, provider, and job category. No routing changes, no architecture work required. See what you're spending first. The decision about which workloads to optimise follows naturally from the map.

Sources

Gartner, Gartner Forecasts Worldwide AI Spending to Grow 47% in 2026, May 2026, retrieved June 23, 2026 — gartner.com
IDC, AI Infrastructure Spending Caps Historic Year at $90B in Q4 2025, Q1 2026, retrieved June 23, 2026 — idc.com
Cast AI, 2026 State of Kubernetes Optimization Report, Q1 2026, retrieved June 23, 2026 — cast.ai
Groundcover, The Observability Imperative (n=500, Atomik Research), May 26, 2026, retrieved June 23, 2026 — businesswire.com
a16z, How 100 Enterprise CIOs Are Building and Buying Gen AI in 2025, 2025, retrieved June 23, 2026 — a16z.com
Deloitte, TMT Predictions 2026: Why AI's Next Phase Will Demand More Compute, Dec 2025, retrieved June 23, 2026 — deloitte.com
MarketsandMarkets, AI Inference Market Size, Share & Growth 2025–2030, 2025, retrieved June 23, 2026 — marketsandmarkets.com