Kubernetes 1.35: Moving from Static to Fluid Infrastructure

The CNCF has officially announced Kubernetes 1.35, a landmark release that introduces zero-downtime resource scaling for production workloads.

By making in-place pod resource adjustments generally available (GA), platform teams can now update CPU and memory on running pods without restarts or service disruption. This marks a fundamental shift in cloud-native operations, offering a massive leap forward for AI training, edge deployments, and long-running stateful workloads.

What Makes Version 1.35 Distinct?

  • 🤖 Gang Scheduling (Preview): Atomic scheduling for distributed AI workloads.
  • In-Place Resizing (GA): “Hot-swap” resources without pod restarts.
  • 🔐 Identity-Centric Security: Granular, verifiable pod identity controls for Zero Trust.
  • 🌱 Modernized Control Plane: Accelerated removal of legacy technical debt.

Key Feature Deep-Dive

1. Zero-Downtime Resource Resizing

Historically, changing CPU or memory required killing a pod and spinning up a replacement. This “restart tax” disrupted stateful applications and cleared local caches.

  • The 1.35 Advantage: You can now adjust resources on the fly. The pod remains active while the underlying node adjusts its resource allocation, making it ideal for high-pressure environments like databases and AI model training.

2. Native AI/ML Optimization

The introduction of Gang Scheduling addresses a historical pain point in distributed computing. Unlike “first-come, first-served” logic, Gang Scheduling ensures a group of related pods only starts if all members can be scheduled at once. This prevents “resource deadlocks” where idle pods sit waiting for others to join, wasting expensive GPU capacity.

3. Identity-Centric Security

Release 1.35 moves beyond generic service accounts toward granular, verifiable pod identities. This shifts security away from network location and toward workload identity, a core requirement for modern Zero Trust architectures.


Strategic Briefing: Puts & Takes

For technical leadership and DevOps teams, the 1.35 upgrade is an exercise in balancing efficiency against new security guardrails and runtime requirements. Be sure to quadruple check the Change logs.

The “Puts” (Gains)The “Takes” (Critical Risks)
Operational Continuity: Eliminates “restart tax” on stateful and AI jobs.🚨 Breaking Change: The --pod-infra-container-image flag has been removed. If left in your Kubelet config, the Kubelet will fail to start.
GPU Optimization: Gang Scheduling prevents resource deadlocks.🚨 Cgroup v1 Deprecation: Nodes now fail to start on cgroup v1 by default (failCgroupV1: true). Systems must be on cgroup v2.
Zero-Trust Security: Verifiable identities for better compliance.API Removal: The StorageVersionMigration v1alpha1 API is gone. You must remove these resources before upgrading or the upgrade will fail.
Modernized Control Plane: Removal of legacy technical debt.Network Change: IPVS mode in kube-proxy is now deprecated. It is time to begin planning a migration to nftables.

Engineer’s Pre-Flight Checklist (The “Must-Haves”)

1. Mandatory Configuration Removals

  • [ ] Strip Kubelet Flags: You must remove the --pod-infra-container-image flag from your Kubelet configuration. In 1.35, this flag is no longer supported; leaving it in will cause the Kubelet to enter a CrashLoop on startup.
  • [ ] Purge Storage v1alpha1: Identify and delete any StorageVersionMigration v1alpha1 objects. These must be replaced by the v1beta1 version before the upgrade initiates.

2. Environment & OS Validation

  • [ ] Cgroup v2 Enforcement: Verify your nodes are on cgroup v2. If you are stuck on cgroup v1, you must set failCgroupV1: false in your KubeletConfiguration before upgrading, or the node will refuse to start.
  • [ ] CRI Version Check: Confirm your container runtime (e.g., containerd 2.0+) is compatible with the new cgroup and in-place scaling logic.

3. RBAC & Access Control Adjustments

  • [ ] WebSocket Permissions: Because the API server has transitioned from SPDY to WebSockets for streaming, ensure users/service accounts have the create verb for pods/exec and pods/portforward.
  • [ ] Credential Audit: Review imagePullSecrets. 1.35 introduces stricter verification for pre-pulled images; pods may now fail to initialize if they lack explicit pull credentials for images already cached on the node.

4. Observability & Testing

  • [ ] Silence False-Positive Alerts: Update your monitoring to recognize that a CPU/Memory change without a Pod restart is a success, not a failure or a “flap.”
  • [ ] StatefulSet Live Test: In your sandbox, perform a live resource increase on a database pod. Verify the PID (Process ID) stays the same while the cgroup reflects the new limits.

Technical FAQ: The Nuances of In-Place Scaling

How does the API handle a failure?

If a resize request fails (e.g., exceeds a LimitRange), the API rejects it immediately with no impact on the running pod. If the Node lacks capacity, the request is marked as Deferred—it will not kill the pod to move it to a larger node.

Are there scaling limitations?

CPU is “squishy” and scales easily. Memory is “hard.” If you try to scale memory down below what the process is currently using, the runtime may refuse the change to prevent an OOM (Out of Memory) event. Additionally, all containers in a pod (including sidecars like Istio) must support in-place scaling for the full pod footprint to adjust.

How does this affect Error Budgets & SLAs?

This is a major win for your SLA. By removing restarts, you eliminate the 5xx errors and “warm-up” latencies typically seen during rolling updates. We are introducing a new SLO for “Scaling Latency” to track how quickly the platform responds to these requests.


The Goal

A successful Platform Engineering team provides an infrastructure that “stays out of the way.” Kubernetes 1.35 significantly reduces the operational tax on your services, allowing you to right-size applications in real-time without fearing a production outage.

Next Steps:

We recommend beginning validation in staging environments immediately. Your dedicated DevOps or PlatOps lead will reach out to coordinate change management testing for your specific stateful workloads.


Positioning Your Executive Summary: Strategic Value & ROI of Kubernetes 1.35

Target Audience: C-Suite, Board of Directors, and Strategic Stakeholders

The release of Kubernetes 1.35 marks a pivotal shift from static to fluid infrastructure. For organizations invested in digital transformation and AI, this update provides a direct path to improved system availability and significant cloud cost avoidance.

The Bottom Line: Why This Matters to the Business

  • Eliminating Scaling Downtime: Historically, adjusting the “engine” of an application required stopping the car. Kubernetes 1.35 allows us to scale resources (CPU/Memory) while the application is live. This removes the “restart tax,” protecting our customer experience (SLAs) and reducing the risk of 5xx errors during traffic spikes.
  • Maximizing AI/ML Investment: High-performance GPU resources are expensive. The new Gang Scheduling capabilities ensure our AI training jobs only consume resources when the entire system is ready to run. This eliminates “resource leaking,” where idle processes burn budget while waiting for capacity, directly improving our Compute ROI.
  • Infrastructure Agility: In-place scaling allows our engineering teams to “right-size” workloads in real-time. This prevents over-provisioning (paying for more than we use) and under-provisioning (risking an outage), leading to a more cost-efficient cloud footprint.
  • Zero-Trust Security Maturity: By modernizing pod identity and removing legacy code, we are reducing our cybersecurity risk profile. This release reinforces our commitment to a “Zero Trust” architecture, ensuring that every workload is verified and secure.

Strategic ROI Impact

Strategic MetricImpact of K8s 1.35Business Value
AvailabilityZero-restart resource updates.Higher uptime and brand trust.
Cloud SpendReduced GPU idle time & right-sizing.Lower monthly OpEx.
Developer VelocityReduced operational “cleanup” and friction.Faster time-to-market for new features.
Security ComplianceVerifiable workload identities.Reduced risk of lateral-move exploits.

Cloud Spend Projection Template

For Finance & Procurement teams to track realized savings post-upgrade.

To measure the financial impact of Kubernetes 1.35, we recommend tracking the following Key Performance Indicators (KPIs) over the first 90 days of implementation:

Cost CenterMetric to TrackProjected Savings Goal
AI/GPU ClustersReduction in “Idle GPU Time” via Gang Scheduling.15–25% reduction in wasted compute cycles.
SLA PenaltiesDecrease in 5xx errors/outages during scaling events.90% reduction in scaling-related downtime.
Compute Over-provisioningDifference between “Requested” vs “Actual” CPU usage.10–15% reduction in total cluster “slack” capacity.
Engineering OpExHours spent on manual pod-restart troubleshooting.20% increase in team capacity for feature work.

The Verdict: Kubernetes 1.35 is really an operational efficiency tool. By adopting this version, we are building a more resilient, cost-aware, and secure foundation for our next generation of digital products.


Leave a Reply