Flash-Lite Is $0.10. Google Cloud's Margin Is 32.9%.

Gemini 2.5 Flash-Lite lists at $0.10 per million input tokens on Vertex AI, in line with what H100-based operators charge for small open-source models. Output tokens run $0.40 per million.

Google Cloud's operating margin hit 32.9% in Q1 2026, up from 9.4% a year earlier, on $20 billion in quarterly revenue. Alphabet does not break out AI inference separately.

API token throughput hit 16 billion tokens per minute in Q1, up 60% from Q4 2025. Sundar Pichai said Cloud revenue would have been higher without capacity limits, not the usual pattern of a service losing margin share.

The Stack Behind the Price

Google designs its own inference silicon, manufactured by TSMC, and runs XLA, the compiler that translates AI workloads to hardware instructions. On April 22 at Google Cloud Next, Google announced TPU 8i, an inference-only chip targeting 80% better price-performance over its Ironwood predecessor. An interest form for 8i access is live; the chip targets general availability by year-end.

The H100 Floor

H100 spot rates cluster around $2.85 to $3.50 per hour in 2026, down 64-75% from peak. DeepInfra serves Llama 3.1 8B on H100-class hardware at $0.06 per million tokens, achieved through pooled customer demand at sustained batch throughput. Any meaningful utilization drop pushes the per-token cost above Flash-Lite's $0.10.

What the $100B Buys

Anthropic's Claude Haiku 4.5 lists at $1.00 per million input tokens and $5.00 per million output tokens. In April 2026, Anthropic committed more than $100 billion to AWS over ten years for access to Trainium2 through Trainium4.

SemiAnalysis estimates Anthropic's inference margins reached 70% in early 2026, up from 38% a year prior, as hardware utilization and model distillation improved. The 10x input gap to Flash-Lite implies where the gains are not flowing: into customer pricing. Anthropic buys access to Trainium2 through Trainium4 rather than owning the silicon, which leaves Amazon setting the rate on each successive chip.

TPU 8i targets general availability by year-end 2026. If Google cuts Flash-Lite's input price when the chip ships, the inference floor for small proprietary models will reach where H100 operators post their deepest open-source discounts today.

Gemini 2.5 Flash-Lite lists at $0.10 per million input tokens on Vertex AI, in line with what H100-based operators charge for small open-source models. Output tokens run $0.40 per million.

Google Cloud's operating margin hit 32.9% in Q1 2026, up from 9.4% a year earlier, on $20 billion in quarterly revenue. Alphabet does not break out AI inference separately.