Demo: High-throughput LLM inference with Kubernetes, llm-d, and Google Cloud TPUs

10 Sep 2025
This hands-on session is designed for developers and architects building and scaling generative AI services. We will provide a practical look at Google Kubernetes Engine (GKE) as the foundation for high-performance large language model (LLM) inference. The session will feature a live demo of the GKE Inference Gateway, highlighting its model-aware routing and serving priority features. We will then delve into the open-source llm-d project, showcasing its vLLM-aware scheduling and disaggregated serving capabilities. To cap it off, we'll explore the impressive performance gains of running vLLM on Cloud TPUs for maximum throughput and efficiency. You will leave with actionable insights and code examples to optimize your LLM serving stack.