Interview with Prashanth Thinakaran, Clockwork Systems

Exclusive Interview with 2025 Speaker Prashanth Thinakaran

Prashanth Thinakaran, Distinguished AI Infrastructure Engineer at Clockwork Systems, joined the Enterprise AI track to discuss how Software-Driven Fabrics (SDF) are reshaping AI infrastructure, enabling real-time observability, workload resilience, openness, and massive scalability across rapidly evolving GPU clusters.

In this interview, he shares his key takeaways from the session, offers insights into Clockwork Systems’ technology, and reflects on what sets the AI Infra Summit apart from other industry events.

REGISTER YOUR INTEREST

Key Takeaway From Your Session

The biggest takeaway is that reliability in large-scale GPU clusters has become a first-class engineering priority for cloud operators. Today, the resiliency of an AI training stack isn’t dictated by hardware alone, it requires tight coordination between compute, networking, and software, approached as a true hardware/software co-design problem.

As clusters scale into the thousands of GPUs, the real challenges aren’t obvious failures like bad cables or dead NICs. They are the invisible micro-events such as microbursts, tail-latency spikes, routing hot spots, and intermittent congestion that quietly erode the training efficiency.

What resonated most with the audience is that software-driven resiliency gives AI clusters the ability to detect and adapt to these conditions in real time, resulting in materially higher performance, application stability, and infrastructure ROI.

Why Software-Defined Fabrics Matter Now

Modern AI clusters behave like distributed supercomputers. They’re extremely sensitive to small changes in network timing since RDMA demands extreme throughput with minimal jitter. In addition, communication between nodes must be tightly synchronized, so overall performance is determined by the slowest node.

Clockwork’s Software-Driven Fabrics matter because they:

Provide real-time sub-microsecond visibility into the network fabric to figure out the root cause of network congestion and contention.
Address the infrastructure reliability issues while workloads are running without needing to restart from a checkpoint.
Make workloads resilient to underlying infrastructure issues avoiding the need for costly restarts from checkpoints.
Dynamically eliminate congestion and contention by optimizing traffic flow

A Surprising Insight About Clockwork’s Tech

Many teams assume their performance issues are caused by GPUs, NCCL, or the model code. After working with us, they quickly realize the real bottlenecks are invisible network-level events: tail latency spikes, straggler GPUs due to network congestion, network pluggable optic flapping/failure, slow-drain flows, or silent failures.

The “aha moment” is that these issues were happening constantly, but they had no way to see them until Clockwork's FleetIQ highlighted the problems.

Which Hardware Shift Excites You Most?

I’m most excited about the shift toward programmable, telemetry-rich fabrics and the growing adoption of Open Compute standards across the AI infrastructure stack. These standards give ecosystem partners a shared blueprint for building interoperable, high-performance components and accelerate the emergence of an open, multi-vendor hardware ecosystem.

This evolution is particularly important as AI workloads move beyond training into multi-node inference and KV-cache disaggregated serving. These workloads introduce extremely tight latency, synchronization, and bandwidth requirements across memory tiers, storage and nodes. Running model execution, memory access, and KV-cache lookups over the network places unprecedented pressure on the underlying fabric: microsecond sensitivity, congestion hotspots, and tail-latency amplification all become critical bottlenecks.

The industry’s move toward heterogeneous, high-bandwidth, low-latency fabric architectures supported by CXL memory pooling, UALink/SUE/NVLink scale-up accelerator links, and RDMA Scale out networking is what will make these new inference architectures a.k.a Gen AI token factories viable at scale.

For the first time, we can run true software-driven control on top of these network fabrics to ensure consistency, performance, and reliability, even under dynamic multi-node inference loads.

Is the Ai-Infrastructure Sector Stable?

While applications may go through hype cycles, the demand for AI infrastructure is structurally durable. Enterprises are still in the early stages of adoption, model sizes continue to grow, and GPU cluster deployments are accelerating worldwide. The sector is far more robust than people assume, and the long-term need for scalable, reliable AI compute is only increasing.

What Sets AI Infra Summit Apart?

AI Infra Summit stands out because it’s built by engineers, for engineers. It brings together GPU architects, network designers, distributed systems researchers, and practitioners who run some of the world’s largest clusters. The quality of interaction with the attendees and the sessions are deeply technical and insightful.

It’s increasingly viewed as a strategic event in the corporate calendar, an opportunity for companies to showcase their roadmap to signal what’s coming next, and engage directly with the experts who shape the future of AI infrastructure.

Why the Next Generation of AI Depends on Software-Defined Networks

Exclusive Interview with 2025 Speaker Prashanth Thinakaran

Key Takeaway From Your Session

Why Software-Defined Fabrics Matter Now

A Surprising Insight About Clockwork’s Tech

Which Hardware Shift Excites You Most?

Is the Ai-Infrastructure Sector Stable?

What Sets AI Infra Summit Apart?

Register Your Interest in 2026

Links

Keep in touch

About Us

Website Search

Wishlist

	Meet industry peers that will help build a career-changing network for life.
	Learn from the mistakes of your peers as much as their successes - ambitious industry stalwarts who are happy to share not just what has made them successful so far but also their plans for future proofing their companies.
	Note down the inspired insight that will form the foundation for future strategies and roadmaps, both at our events and through our online communities.
	Invest both in your company growth and your own personal development by signing up to one of our events and get started.

Why the Next Generation of AI Depends on Software-Defined Networks

Exclusive Interview with 2025 Speaker Prashanth Thinakaran

Key Takeaway From Your Session

Why Software-Defined Fabrics Matter Now

A Surprising Insight About Clockwork’s Tech

Which Hardware Shift Excites You Most?

Is the Ai-Infrastructure Sector Stable?

What Sets AI Infra Summit Apart?

Register Your Interest in 2026

Links

Keep in touch

About Us

Share

Website Search

Wishlist