You’re Paying for GPU Power You’ll Never Use: Sergey Speranskiy on Where AI Infrastructure Waste Really Happens

· · Views: 1,996 · 6 min time to read

The AI infrastructure boom is often framed as a problem of scarcity. GPUs are expensive, hard to secure, and treated as the defining constraint in scaling modern workloads. But inside many enterprise environments, the more immediate problem is not access to compute. It is how much of that compute sits idle while budgets continue to rise.

Industry data cited in the discussion puts average GPU utilization in enterprise Kubernetes clusters at roughly 10 to 30 percent. For companies paying for on-prem capacity or cloud instances priced by the GPU-hour, that gap has direct financial consequences. Much of the waste does not come from dramatic failures. It comes from ordinary decisions that go unchallenged: conservative sizing, fragmented ownership, stale workload assumptions, and infrastructure designed around worst-case thinking.

Sergey Speranskiy sees that pattern from inside the platform layer. As a Senior Infrastructure Platform Engineer at Social Discovery Group, he works in the kind of large-scale Kubernetes environment where these issues become visible, including a cluster running more than 4,000 pods at the same time. He is not approaching the problem as a vendor or consultant. He is working inside the systems where efficiency, reliability, and cost discipline have to coexist in production.

In this conversation, he talks about where compute budgets quietly leak, why low utilization persists even in sophisticated teams, and what engineering organizations can do to reduce waste before buying another server.

1. You’ve worked inside a Kubernetes cluster running more than 4,000 pods. What does that actually look like day to day, and what keeps you up at night?

Sergey Speranskiy:

What that looks like day to day is less about staring at a dashboard and more about constantly managing trade-offs between reliability, resource efficiency, and operational simplicity.

In our case, the cluster had around 4,000 pods, multiple workload types, and historically a very fragmented node pool. So the day-to-day work was watching scheduling pressure, actual versus requested CPU and memory usage, node utilization, eviction risk, noisy-neighbor effects, and the blast radius of failures. I also care a lot about how evenly workloads are spread, whether limits and requests reflect reality, and whether we have hidden single points of failure.

What keeps me up at night is usually not one big dramatic outage. It is silent inefficiency or hidden fragility: overprovisioned workloads, bad bin-packing, one workload starving another, or infrastructure that looks redundant on paper but fails badly under stress. In large clusters, waste and reliability problems often come from the same place: poor resource discipline.

2. From where you sit, what usually signals that a team has a resource waste problem, and how often do they catch it early?

Sergey Speranskiy:

The clearest signal is when declared demand and real demand are far apart for a long time.

You see pods with large requests and limits, but actual usage is consistently much lower. Or you see too many nodes kept alive for peak assumptions that almost never happen. Another signal is when teams say they need more capacity, but scheduler pressure, utilization graphs, and historical usage do not support that claim.

Most teams do not catch it early unless they already have good visibility and review habits. Waste is usually discovered indirectly: rising bills, low node utilization, poor packing efficiency, or after someone is forced to do a cleanup. In practice, waste can live in a cluster for months because nothing is technically “broken.”

3. Industry discussions often cite GPU utilization in the 10 to 30 percent range. Does that match what you’ve seen in practice, and why does utilization stay so low?

Sergey Speranskiy:

I would say the broader pattern definitely matches what I have seen with expensive compute in general: teams often provision for certainty, but run at much lower steady-state utilization.

The reasons are usually very practical. First, people size for peak traffic and leave the capacity allocated all the time. Second, requests are often set defensively, because nobody wants production incidents. Third, specialized workloads are harder to pack efficiently, so you get fragmentation. And fourth, once expensive infrastructure is allocated to one team or one service, it tends to get protected instead of continuously challenged.

So even when the hardware is technically available, the effective utilization can stay low because the operating model is conservative and ownership is fragmented.

4. Where does the money actually leak? If you had to point to the biggest sources of silent compute waste, what would they be?

Sergey Speranskiy:

The first leak is overprovisioned requests and limits. That directly drives unnecessary capacity, larger node pools, and poor scheduling density.

The second is fragmentation and inconsistent node strategy. If you have too many node types or too much special-casing, you reduce packing efficiency and keep extra capacity around just to satisfy edge cases.

The third is lack of lifecycle pressure on workloads. Services get introduced, traffic patterns change, but the original resource assumptions stay forever. So the cluster keeps paying for yesterday’s architecture.

In my experience, those three together can cost more than any one dramatic incident.

5. Is this mainly a technical problem, or is it really an organizational one?

Sergey Speranskiy:

It is both, but the organizational part is usually what allows the technical waste to persist.

Technically, you can measure usage, set requests and limits, apply LimitRanges, improve scheduling, and standardize node pools. But if nobody owns efficiency, teams will still ask for safety margins and keep them indefinitely. That is rational behavior from their point of view.

The best results come when platform engineering gives teams guardrails and visibility, and leadership makes efficiency a real engineering metric, not just a finance complaint.

6. You’re working in an environment where GPU resource management is an active challenge right now. What does that waste actually look like from the inside?

Sergey Speranskiy:

From the inside, the waste usually does not look dramatic. It looks ordinary. It looks like capacity reserved for workloads that do not fully use it, conservative sizing that was never revisited, and infrastructure built around worst-case assumptions.

In my own work, one example was translation. We replaced a managed API approach with a dedicated hosted LLM inference setup for cost control. But once you do that, the question becomes operational: how do you make sure the whole path is efficient and resilient, not just the model itself? We put nginx in front with basic auth, proxied to three model pods, and used two load balancer VMs with BGP anycast for resilience. That architecture solved reliability and control, but it also reinforces an important lesson: model cost is only one layer. Routing, concurrency, redundancy, and utilization discipline matter just as much.

At cluster level, the same pattern applies. Waste happens when resource policy is loose, workload sizing is stale, and node strategy evolves reactively instead of intentionally.

7. What is the lowest-effort, highest-impact thing an engineering team can do this week to stop bleeding GPU budget?

Sergey Speranskiy:

The fastest win is to compare actual usage against declared requests for your most expensive workloads and correct the obvious outliers.

Do not start with a huge platform migration. Start with measurement and cleanup. Identify the workloads that reserve the most expensive compute, check what they really use, and fix the top offenders first. In many environments, a small number of services drive a large share of waste.

That is basically the same principle we applied more broadly in Kubernetes: standardize, right-size, and remove guesswork. In our cluster, adding requests and limits everywhere and introducing LimitRanges created much better resource discipline immediately.

8. A lot of companies still assume the answer is to buy more GPUs. When is that the wrong move?

Sergey Speranskiy:

It is the wrong call when the real problem is utilization, not shortage.

If your scheduling is inefficient, requests are inflated, or workloads are fragmented across too many node classes, buying more hardware just hides the problem temporarily. You are adding supply without fixing control.

I would only buy more after I was confident I understood actual usage, contention patterns, and packing efficiency. Otherwise you are scaling waste.

9. Cloud GPU prices have jumped sharply. Does that change how you think about where AI workloads should run?

Sergey Speranskiy:

Yes, it pushes the conversation away from hype and toward workload economics.

I do not think there is one universal answer. Some workloads belong in managed services because speed matters more than efficiency. Others justify dedicated infrastructure because they are predictable, steady, and expensive enough to optimize.

For me, the key question is not cloud versus on-prem as ideology. It is: do we understand the traffic pattern, the latency requirement, the operational burden, and the cost per useful unit of work? In our translation case, moving away from a managed API to a dedicated hosted inference setup made sense because we wanted more cost control and architectural control over the path.

10. If a CTO told you, “We’re spending a fortune on AI infrastructure and I don’t know if we’re getting value,” what would you ask first, and what do you think the industry still does not talk about enough?

Sergey Speranskiy:

My first question would be: what are you actually measuring – hardware allocated, or useful work delivered?

Because if you cannot connect spend to throughput, latency, quality, and business value, then the infrastructure discussion is still too abstract. A team can be very proud of owning powerful hardware and still be operating inefficiently.

Once that is clear, the next step is straightforward: compare provisioned capacity, real utilization, and business output. That usually tells you very quickly whether you have a scaling problem or a discipline problem.

A lot of the cost problem is not caused by the model. It is caused by the environment around the model.

People talk publicly about model size, GPU count, and benchmarks. Internally, the harder reality is that cost and reliability are heavily shaped by boring infrastructure details: resource policies, traffic patterns, queueing, scheduling, failover design, node standardization, and how much unused capacity the organization tolerates.

That is why I think platform discipline matters so much. In our case, one of the biggest wins was not a fancy AI optimization. It was infrastructure cleanup: reducing 169 heterogeneous worker nodes down to 25 standardized nodes, enforcing requests and limits, and improving failure isolation and placement strategy. That kind of work is not flashy, but it is where a lot of real savings come from.

What This Conversation Shows About AI Infrastructure Waste

AI infrastructure waste is rarely the result of one dramatic technical mistake. In Sergey Speranskiy’s account, it is usually the cumulative effect of ordinary engineering decisions left unchallenged over time: oversized requests, fragmented node strategies, stale workload assumptions, and a lack of ownership around efficiency.

The conversation also makes clear that the real cost problem often sits outside the model itself. Routing, failover, scheduling, workload policy, and platform discipline shape whether expensive compute turns into useful work or idle capacity. For teams under pressure to scale AI systems quickly, that distinction matters more than ever.

Sergey Speranskiy is a Senior Infrastructure Platform Engineer at Social Discovery Group. He works on large-scale Kubernetes infrastructure and is currently navigating GPU workloads at scale.

Share
f 𝕏 in
Copied