Exclusive Interview with 2025 Speaker Prashanth Thinakaran

Prashanth Thinakaran, Distinguished AI Infrastructure Engineer at Clockwork Systems, joined the Enterprise AI track to discuss how Software-Driven Fabrics (SDF) are reshaping AI infrastructure, enabling real-time observability, workload resilience, openness, and massive scalability across rapidly evolving GPU clusters.

In this interview, he shares his key takeaways from the session, offers insights into Clockwork Systems’ technology, and reflects on what sets the AI Infra Summit apart from other industry events.


 

  •  Prashanth Thinakaran onstage at the AI Infra Summit

Key Takeaway From Your Session

The biggest takeaway is that reliability in large-scale GPU clusters has become a first-class engineering priority for cloud operators. Today, the resiliency of an AI training stack isn’t dictated by hardware alone, it requires tight coordination between compute, networking, and software, approached as a true hardware/software co-design problem.

As clusters scale into the thousands of GPUs, the real challenges aren’t obvious failures like bad cables or dead NICs. They are the invisible micro-events such as microbursts, tail-latency spikes, routing hot spots, and intermittent congestion that quietly erode the training efficiency.

What resonated most with the audience is that software-driven resiliency gives AI clusters the ability to detect and adapt to these conditions in real time, resulting in materially higher performance, application stability, and infrastructure ROI.

Why Software-Defined Fabrics Matter Now

Modern AI clusters behave like distributed supercomputers. They’re extremely sensitive to small changes in network timing since RDMA demands extreme throughput with minimal jitter. In addition, communication between nodes must be tightly synchronized, so overall performance is determined by the slowest node.

Clockwork’s Software-Driven Fabrics matter because they:

  • Provide real-time sub-microsecond visibility into the network fabric to figure out the root cause of network congestion and contention.
  • Address the infrastructure reliability issues while workloads are running without needing to restart from a checkpoint.
  • Make workloads resilient to underlying infrastructure issues avoiding the need for costly restarts from checkpoints. 
  • Dynamically eliminate congestion and contention by optimizing traffic flow

A Surprising Insight About Clockwork’s Tech

Many teams assume their performance issues are caused by GPUs, NCCL, or the model code. After working with us, they quickly realize the real bottlenecks are invisible network-level events: tail latency spikes, straggler GPUs due to network congestion, network pluggable optic flapping/failure, slow-drain flows, or silent failures.

The “aha moment” is that these issues were happening constantly, but they had no way to see them until Clockwork's FleetIQ highlighted the problems.

Which Hardware Shift Excites You Most?

I’m most excited about the shift toward programmable, telemetry-rich fabrics and the growing adoption of Open Compute standards across the AI infrastructure stack. These standards give ecosystem partners a shared blueprint for building interoperable, high-performance components and accelerate the emergence of an open, multi-vendor hardware ecosystem.

This evolution is particularly important as AI workloads move beyond training into multi-node inference and KV-cache disaggregated serving. These workloads introduce extremely tight latency, synchronization, and bandwidth requirements across memory tiers, storage and nodes. Running model execution, memory access, and KV-cache lookups over the network places unprecedented pressure on the underlying fabric: microsecond sensitivity, congestion hotspots, and tail-latency amplification all become critical bottlenecks.

The industry’s move toward heterogeneous, high-bandwidth, low-latency fabric architectures supported by CXL memory pooling, UALink/SUE/NVLink scale-up accelerator links, and RDMA Scale out networking is what will make these new inference architectures a.k.a Gen AI token factories viable at scale.

For the first time, we can run true software-driven control on top of these network fabrics to ensure consistency, performance, and reliability, even under dynamic multi-node inference loads.

Is the Ai-Infrastructure Sector Stable?

While applications may go through hype cycles, the demand for AI infrastructure is structurally durable. Enterprises are still in the early stages of adoption, model sizes continue to grow, and GPU cluster deployments are accelerating worldwide. The sector is far more robust than people assume, and the long-term need for scalable, reliable AI compute is only increasing.

What Sets AI Infra Summit Apart?

AI Infra Summit stands out because it’s built by engineers, for engineers. It brings together GPU architects, network designers, distributed systems researchers, and practitioners who run some of the world’s largest clusters. The quality of interaction with the attendees and the sessions are deeply technical and insightful.

It’s increasingly viewed as a strategic event in the corporate calendar, an opportunity for companies to showcase their roadmap to signal what’s coming next, and engage directly with the experts who shape the future of AI infrastructure.

Register Your Interest in 2026