Google New TPU Generation is Specifically Designed for Agents and SOTA Model Training

Google has unvelied a new generation of Tensor Processing Units (TPUs), featuring two specialized chips designed to accelerate model training and agent workflows, which require continuous, multi-step reasoning, and action loops distributed across multiple models. The new TPUs deliver better performance, memory, and energy efficiency, the company says.

According to Google, the rise of AI agents require distinct chips, specifically designed for training and inference to unlock significant performance gains for the specific workloads they are designed to handle.

TPU 8t shines at massive, compute-intensive training workloads designed with larger compute throughput and more scale-up bandwidth. TPU 8i is designed with more memory bandwidth to serve the most latency-sensitive inference workloads.

On the training side, the new chip aims to maximize raw scale and speed. It is designed to reduce the time needed to train frontier models "from months to weeks", says Google. This is achieved by increasing compute density, memory capacity, and bandwidth across large clusters, resulting in nearly 3x the compute performance over the previous generation.

A single TPU 8t superpod now scales to 9,600 chips and two petabytes of shared high bandwidth memory, with double the interchip bandwidth of the previous generation. This architecture delivers 121 ExaFlops of compute and allows the most complex models to leverage a single, massive pool of memory.

This system can scale almost linearly up to a million chips in a single local cluster, according to the company. Beyond scale, the design also maximizes utilization through 10x faster storage and increased reliability, availability, and serviceability to reduce downtime due to hardware failures, network stalls, or checkpoint restarts.

On the inference side, the TPU 8i chip shifts priorities toward responsiveness and efficiency under continuous load. Google notes that agent workloads involve long contexts, memory-heavy operations, and concurrent requests from distinct agents. The chip is optimized to reduce latency by offloading global operations, increases memory bandwidth with up to 288GB of memory, and improves performance per dollar by 80%.

For modern Mixture of Expert (MoE) models, we doubled the Interconnect (ICI) bandwidth to 19.2 Tb/s. Our new Boardfly architecture reduces the maximum network diameter by more than 50%, ensuring the system works as one cohesive, low-latency unit.

Besides the significant improvement that the new chips will deliver, Google TPU phylosophy has remained largely consistent over the years"

The key insight behind the original TPU design continues to hold today: by customizing and co-designing silicon with hardware, networking and software, including model architecture and application requirements, we can deliver dramatically more power efficiency and absolute performance.

This sentiment is echoed by burnte on Hacker News, who noted that:

Google owns everything from the keyboard to the silicon. They've iterated so much they understand how to separate out different functions that compete with each other for resources.

Likewise, pmb highlights another advantage in Google's TPU offering:

When you are doing big AI you basically have to buy it from NVidia or rent it from Google. And Google can design their chips and engine and systems in a whole-datacenter context, centralizing some aspects that are impossible for chip vendors to centralize.

On a different chord, amelius warns against "building your castle in someone else's kingdom", suggesting that buying from Nvidia is the only real option, though even that does not fully eliminate concerns around vendor lock-in.