Back to Blog

Gemini 3.1 Can Solve Puzzles. It Still Can't Use a Screwdriver.

Google's Gemini 3.1 Pro just dropped with a 77% on ARC-AGI-2 - up from 31%. The benchmarks are staggering. But the people actually building with it keep saying the same thing: it can't call tools.

AILLMsGeminiEngineeringDeveloper Tools
February 19, 2026
4 min read

Google dropped Gemini 3.1 Pro today. The numbers are absurd.

ARC-AGI-2 went from 31.1% on Gemini 3 Pro to 77.1%. That's not an incremental gain. That's a different model. GPQA Diamond hit 94.3%. Humanity's Last Exam climbed to 44.4% without tools, 51.4% with them. On paper, this is the smartest model anyone has shipped.

And within an hour of the announcement, the Hacker News thread had a very specific complaint repeating itself: it still can't call tools properly.

The screwdriver problem

Here's what I mean. ARC-AGI-2 tests abstract reasoning - pattern recognition over novel visual puzzles that you can't memorize from training data. A 77.1% score means the model is genuinely thinking, not pattern-matching. That's real progress.

But reasoning and execution are different skills. When you ask Gemini 3.1 to do something that requires calling an API, reading a file, running a function, and chaining the results together - the kind of thing you actually need an AI to do in production - it falls apart in ways that Sonnet 4.6 and GPT-5.3-Codex don't.

One developer in the HN thread put it bluntly: Gemini 3 is "not good at all at tool calling and agentic workflows, especially compared to the recent two mini-generations of models."Another said you can "really notice the tool use problems." A third noted that even with explicit system instructions to change its behavior, the model only obeyed when reminded.

This isn't a niche complaint. This is the entire trajectory of where AI models are headed. Agents. Tools. Workflows. If your model can solve a PhD-level physics problem but can't reliably parse a JSON response from an API call, you've built the world's smartest person who can't use a screwdriver.

Why this keeps happening

The gap between reasoning benchmarks and tool-use performance isn't a bug. It's a training priority.

Reasoning improvements come from scaling compute at inference time - letting the model "think longer" with chain-of-thought approaches. Google has been investing heavily here since Gemini 2.5's thinking modes, and the results show. The ARC-AGI-2 jump is proof that extended reasoning works.

Tool calling is a different beast. It requires the model to generate precise, structured outputs - exact function names, correct parameter types, proper sequencing of dependent calls. It's less about intelligence and more about discipline. And discipline comes from fine-tuning on massive amounts of tool-use data, RLHF specifically targeting structured output quality, and careful evaluation against real-world agentic benchmarks.

Anthropic and OpenAI have been grinding on this for over a year. Their models aren't necessarily smarter. They're more obedient in the specific ways that matter for building things.

The benchmark trap

There's a pattern I keep seeing in these releases. A new model drops. The benchmark table gets posted. Everyone compares numbers. And then the people who actually ship software with these models show up in the comments and say, "Yeah, but does it actually work?"

Benchmarks measure capability. They don't measure reliability. A model that scores 77% on ARC-AGI-2 but hallucinates function parameters 15% of the time is worse for production use than a model scoring 58% that nails every tool call. The second model ships. The first one demos.

Google knows this. Their own Terminal-Bench 2.0 numbers (68.5%) are strong but not dominant - GPT-5.3-Codex hits 77.3% with its own harness. SWE-Bench Verified at 80.6% is competitive with Opus 4.6 at 80.8%. The delta between "smartest" and "most useful" is exactly the tool-calling gap.

What this means if you're building

If you're choosing a model for an agentic workflow today - a coding assistant, an automation pipeline, anything that chains multiple tool calls - Gemini 3.1 Pro probably isn't your first pick yet. Not because it's not smart enough. Because it's not reliable enough at the boring part.

For analysis, research, long-context understanding, multimodal reasoning? It's arguably the best available. A 1M token context window with 64K output and those reasoning scores means it can digest an entire codebase and give you insights that other models miss.

But the industry isn't moving toward "digest and analyze." It's moving toward "digest, analyze, plan, execute, verify, and iterate" - all autonomously. And that last part is where the screwdriver matters more than the IQ score.

Google's next move is obvious: close the tool-calling gap without sacrificing the reasoning gains. If they do that with Gemini 3.2 or even a 3.1 update, the competitive picture changes overnight. Flash at those reasoning levels with reliable tool use would be a default choice for most workloads.

Until then, Gemini 3.1 Pro is the smartest model in the room that you still can't fully trust to do the work.

Share

Get new posts in your inbox

Architecture, performance, security. No spam.

Keep reading