A group of engineers known for pioneering “continuous batching,” a technique now widely used to speed up large language model inference, is arguing that the industry is still wasting vast amounts of expensive GPU capacity. According to a recent VentureBeat report titled “The team behind continuous batching says your idle GPUs should be running,” the researchers say most AI infrastructure leaves periods of idle compute that could be used for additional workloads without harming user-facing performance.
Continuous batching emerged from work on modern inference servers for large language models, particularly systems such as vLLM, which dynamically add incoming user requests into a batch while previous responses are still being generated. That approach contrasts with traditional static batching, where systems wait to collect groups of requests before processing them together. By allowing new inputs to join the batch mid-stream, continuous batching significantly raises throughput and utilization of GPUs, which are otherwise prone to stalls during generation.
Even with that advance, the engineers argue that large portions of GPU cycles remain underused. Real-world request traffic often fluctuates unpredictably, producing short windows where hardware sits partially idle. According to the VentureBeat report, the team believes infrastructure software should treat those gaps as opportunities to run background inference, lower-priority jobs, or precomputation tasks. The goal is to ensure that costly accelerators continue performing useful work whenever spare capacity appears.
The issue has major economic implications. Training and serving frontier models now requires clusters packed with specialized chips that can cost tens of thousands of dollars each. For AI companies operating inference services, small improvements in utilization can translate into substantial savings. A GPU that operates at 40% or 50% effective utilization, the engineers argue, represents a large opportunity for optimization through smarter scheduling and workload orchestration.
The concept also reflects a broader shift in how AI infrastructure is managed. As model deployment scales, companies are starting to treat inference more like high-performance computing or cloud resource scheduling, where workloads of different priorities compete for shared hardware. Continuous batching addressed one layer of inefficiency at the model-serving level; the next challenge, the engineers say, is coordinating tasks across clusters so that idle compute is automatically filled with useful activity.
As VentureBeat noted in its coverage, the push to keep GPUs constantly active reflects both a technical and economic reality of the current AI boom. With demand for compute continuing to outstrip supply, improving utilization rates may be one of the fastest ways for companies to extract more value from the infrastructure they already have—without waiting for the next generation of hardware to arrive.
