EssayJune 29, 2026

Understand your job types: batch vs real-time inference

By Federico Enni · 8 min read

Almost everyone who builds with AI today thinks in prompts. You write a request, you send it, you wait for the answer to come back, and you move on. The loop is so familiar that one of its assumptions has gone invisible: that every request needs its answer now, in the seconds you sit watching the cursor blink. For a chat window, that is true. For most of the AI work running inside companies, it is quietly false, and that false assumption is one of the largest avoidable line items in the modern AI bill.

A prompt is something you wait for. A job is something you hand off. Most of your AI work is the second kind, dressed up as the first.

The clock you never chose

Real-time inference is expensive to provide. To answer in milliseconds, a provider has to keep capacity hot and waiting for you, idle between your requests, ready to respond the instant you ask. You pay for that readiness whether or not your work needs it. And a great deal of AI work does not. Overnight enrichment of a data warehouse, creating or analyzing a video, classifying a backlog of support tickets, summarising yesterday's documents, generating embeddings, running an evaluation suite across thousands of test cases: none of these care whether the answer lands in two seconds or two hours. They inherited a real-time clock simply because the API you called only offered one speed.

This is the heart of the confusion between batch and real-time inference. Real-time is a guarantee about latency. Batch is the absence of that guarantee: you submit the work, the provider runs it when its fleet has room, and you collect the results inside a window rather than on a stopwatch. The output is identical. The same model produces the same answer. The only thing you give up is the promise of immediacy, and for a job that runs while you sleep, that promise was worth nothing to begin with.

What "now" actually costs

The providers themselves have already priced the difference, and they are not shy about it. Submit work to a batch endpoint and the major labs cut the bill in half. OpenAI's Batch API runs at 50% of the synchronous price and completes within a 24-hour window, often sooner.¹ Anthropic's Message Batches API is also 50% off standard rates, with most batches finishing in under an hour.² Google's batch mode on Vertex AI applies the same 50% discount across its Gemini models, run asynchronously with no latency guarantee.³ Three independent providers, one number. That is not a promotion. It is a measurement of how much of your synchronous bill was paying for speed you did not use.

The price of patience · published batch discounts, June 2026

50%off synchronous pricing on OpenAI's Batch API, completed within a 24-hour window

50%off standard rates on Anthropic's Message Batches API, most batches done in under an hour

50%off Gemini pricing in Google Vertex AI batch mode, run asynchronously with no latency SLA

Source: OpenAI, Anthropic and Google Vertex AI official pricing and batch documentation, retrieved June 29, 2026

Why can they afford to halve the price? Because deferring the work is worth more to them than the discount costs. A GPU answering one request at a time, on demand, spends most of its life waiting. When a provider can hold a queue of work and decide when to run it, it can pack many requests through the hardware together and keep the silicon busy. NVIDIA reports that this kind of in-flight batching roughly doubles throughput on real-world request loads on its H100 GPUs, before any other optimisation.⁴ More work through the same chip means a lower cost per job, and a provider that controls the timing can also push deferred work into the troughs when its fleet would otherwise sit idle.

And idle is the normal state. In a 2026 study of tens of thousands of Kubernetes clusters, the cost-optimisation firm Cast AI found average GPU utilisation of just 5%.⁵ The number measures provisioned capacity rather than the efficiency of any single inference engine, but the lesson holds: the compute to run your deferrable work cheaply already exists, mostly unused, waiting for someone to fill it. Batch pricing is the mechanism that routes patient work into that idle capacity and shares the saving with you.

Most jobs are not urgent

If the discount is this large and this consistent, the obvious question is how much of your work could claim it. The honest answer is that there is no clean industry figure for the batch-versus-real-time split. But the early evidence is striking. In a 2025 study of twenty production AI-agent deployments, researchers found that 66% of them tolerated response times of minutes or longer, and three-quarters of the case studies could run asynchronously or on a relaxed schedule, some batching their requests hourly or overnight.⁶ It is a small sample, not a market census, and it deserves to be read as a signal rather than a law. But it points the same way common sense does: the share of AI work that genuinely needs a sub-second answer is far smaller than the share currently billed as if it did.

This matters more every quarter, because inference is becoming the workload. Deloitte projects that inference will grow from roughly a third of all AI compute in 2023, to half in 2025, to about two-thirds in 2026, against global AI data-centre spending in the range of $400 to $450 billion for the year.⁷ As inference takes over, paying real-time rates for work that could have waited stops being a rounding error and becomes a structural tax on everything you build.

From prompts to jobs

Capturing this is less a pricing trick than a change of mental frame. The prompt-based view sees a flat stream of requests, each one urgent because each one is a thing you are waiting for. The job-based view sees something different: a unit of work with a shape. A job has a purpose, a sequence of steps, a tolerance for delay, a budget you would not exceed, and a model or class of models that can do it well. "Summarise every contract uploaded today and flag the unusual clauses" is not a prompt. It is a job, and almost none of its character is captured by the single API call you would use to start it.

The moment you describe work as a job rather than a prompt, its real requirements become visible — and most of them turn out not to include speed.

Once you see work this way, the right execution model follows naturally. A job that can wait should be told it can wait, so it can be priced as patient work. A job made of several steps should be handed to something that will see it through to the end, retrying the steps that fail, rather than babysat call by call from your own code. A job worth a fixed amount and no more should carry that ceiling with it. None of this fits inside a prompt. All of it fits inside a job.

Hand the whole job to Keld

This is the shift Keld is built around. Instead of issuing prompts and managing the consequences yourself, you describe a job and delegate it. Keld runs it as a hosted durable function: it accepts the work, carries it through its full sequence of steps, survives failures and retries them, and returns the result when the job is done. You are no longer holding a connection open, polling for completion, or writing the orchestration glue that long-running AI work usually demands. The durable function is the runtime; the job is what you hand it.

Each job carries three plain terms. A deadline — the most time you are willing to wait. A ceiling price — the most you are willing to pay. And the model or use case you need. Tell Keld those three things and it runs your job on the best-value provider that can meet them, at or below your ceiling and inside your window, escalating to a backup if the first choice cannot deliver in time. A patient job with a generous deadline naturally clears against the cheap, deferred capacity that batch pricing exists to fill — without you ever wiring up a batch endpoint by hand.

The cost of adopting this is meant to be near zero. Keld ships SDKs and ready-made plugins so the change in your code is minimal — in many cases a configuration change rather than a rewrite. And the harder question — which of your jobs are deadline-tolerant in the first place — is exactly what Keld Atlas is built to answer.

Knowing your job types is the whole game

You cannot put a deadline on a job you have not identified. Atlas maps your AI spend across the dimensions that actually decide how a job should run — by team, project, model, provider, and job category — so the picture stops being one undifferentiated bill and becomes a sorted inventory of work. From that map, the latency-tolerant jobs sort themselves out: the nightly pipelines, the bulk classifications, the enrichment runs, the evaluations. And because Atlas governs how that work is routed, the remapping of those jobs toward Keld can be orchestrated centrally, with no intervention in application code.

So the path is short, and it runs in one direction. Understand your job types. Notice how many of them never needed to be real-time. Give those a deadline, a ceiling, and a model, and let Keld run them as durable jobs against the cheapest capacity that can still hit the clock you set. The prompt was never the unit of work. The job was. Pricing it as one is how the gap between what AI costs to produce and what you pay finally closes in your favour.

Sources

OpenAI, "Batch API" developer guide — 50% discount and 24-hour completion window, retrieved June 29, 2026 — developers.openai.com
Anthropic, "Batch processing," Claude Platform Docs — 50% discount, most batches complete in under one hour, retrieved June 29, 2026 — platform.claude.com
Google, "Generative AI on Vertex AI pricing" and "Batch inference with Gemini" — 50% batch discount, asynchronous with no latency SLA, retrieved June 29, 2026 — cloud.google.com
NVIDIA, "NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs" — in-flight batching improves GPU usage and roughly doubles throughput, retrieved June 29, 2026 — developer.nvidia.com
Cast AI, "2026 State of Kubernetes Optimization Report" — average GPU utilisation of 5% across tens of thousands of clusters, retrieved June 29, 2026 — cast.ai
Melissa Z. Pan, Ion Stoica, Matei Zaharia, et al., "Measuring Agents in Production," arXiv:2512.04123, December 2025 — 66% of 20 production agent deployments tolerate response times of minutes or longer, retrieved June 29, 2026 — arxiv.org/abs/2512.04123
Deloitte, "TMT Predictions 2026: Why AI's next phase will demand more compute" — inference share of AI compute rising from ~one-third (2023) to ~two-thirds (2026); global AI data-centre spend of $400–450B, retrieved June 29, 2026 — deloitte.com