Tech / AI
Google's I/O Table Skipped SWE-Bench Pro. The Model Card Didn't.
Google's I/O pitch for Gemini 3.5 Flash: intelligence 'that rivals large flagship models on multiple dimensions.' The model card includes SWE-Bench Pro, showing Flash at 55.1% against Opus 4.7's 64.3%, with no note that the two scores came from different eval conditions.

Google launched Gemini 3.5 Flash at I/O on May 19, claiming the model "delivers intelligence that rivals large flagship models on multiple dimensions."
Flash is Google's efficiency line, priced at $1.50 per million input tokens. The flagship comparison is Google's own language.
The I/O launch table led with GDPval-AA, where Flash scored 1,656 Elo, and MCP Atlas, where it scored 83.6%. The current GDPval-AA leader sits at 1,769, 113 points above Flash.
The model card, published the same day, includes SWE-Bench Pro, a benchmark the I/O launch table did not feature. Flash scores 55.1% there. The card lists Opus 4.7 at 64.3%.
The two scores came from different conditions. Anthropic evaluates SWE-bench averaged over 25 trials, each at adaptive thinking and max effort; Google evaluated Flash at single attempt. The card notes neither difference.
Publishing both figures without a methodology note forces a clean reading that the underlying conditions do not support. The 9.2 points separates an optimally-run Opus 4.7 from a single-attempt Flash, not two models at the same starting line.
Run Flash on SWE-Bench Pro at 25 trials, max effort, adaptive thinking, the conditions Anthropic used for Opus 4.7's 64.3%, and publish the result. If the gap holds above 9 points, Google's flagship framing fails by its own benchmark.