What Claude Code Actually Chooses (And Why Tool Vendors Should Pay Attention)

When a developer types "add a database" and lets Claude Code handle it, the agent doesn't open a browser and compare options. It installs packages, writes imports, configures connections, and commits code. The tool it picks is the tool that ships.

Amplifying.ai just published the most thorough study of this behavior I've seen. They pointed Claude Code at real repos 2,430 times with open-ended prompts ("what should I use?", "add auth", "how do I deploy this"), never naming a specific tool. Then they watched what it chose.

The findings change how I think about tool selection, vendor strategy, and the future of developer ecosystems.

The big finding: agents build, not buy

In 12 of 20 categories, Claude Code's most common response was to build something from scratch.

Asked to "add feature flags"? It doesn't recommend LaunchDarkly. It builds a config system with env vars, React Context providers, and percentage-based rollout with hashing. Feature flags had a 69% Custom/DIY rate.

Asked to "add auth" in a Python project? It writes JWT + passlib + python-jose from scratch. 100% Custom/DIY for Python auth.

If you counted Custom/DIY as a single tool, it would be the most common extracted label in the entire study: 252 primary picks across 12 categories. More than GitHub Actions (152), more than Vitest (101), more than any individual tool.

This isn't a bug. The researchers manually reviewed 50 Custom/DIY extractions and found roughly 80% were genuine build-from-scratch responses. Claude Code has a measurable preference for building over buying.

The "Claude Code stack" is real

Where Claude Code does pick third-party tools, it converges hard. Some categories are effectively locked up:

CI/CD: GitHub Actions at 93.8%. GitLab CI, CircleCI, Jenkins got zero primary picks.
Payments: Stripe at 91.4%. No other processor was recommended as primary.
UI Components: shadcn/ui at 90.1%. Chakra, MUI, Mantine barely register.
Deployment: Vercel at 100% for JavaScript projects. Railway at 82% for Python.

The strong defaults are just as telling:

Category	Default Pick	Share
State Management	Zustand	64.8%
Observability	Sentry	63.1%
Email	Resend	62.7%
Testing	Vitest (JS) / pytest (Python)	59.1% / 100%
Databases	PostgreSQL	58.4%
Package Manager	pnpm	56.3%

Redux got zero primary picks. Zero. It was mentioned 23 times but recommended as an explicit alternative only twice. The model knows Redux exists. It just doesn't choose it.

Jest? Seven primary picks out of 171. Vitest owns the JavaScript testing category.

Express? Completely absent from both primary and alternative recommendations. Not even acknowledged as a second choice.

Newer models pick newer tools

The study tested three models: Sonnet 4.5, Opus 4.5, and Opus 4.6. The recency gradient is the clearest pattern in the data.

ORM in JavaScript projects:

Sonnet 4.5: Prisma 79%, Drizzle 21%
Opus 4.5: Drizzle 60%, Prisma 40%
Opus 4.6: Drizzle 100%, Prisma 0%

Prisma went from dominant to invisible in one model generation. The researchers called this "the strongest single-tool signal in the dataset."

Background jobs in Python:

Sonnet 4.5: Celery 100%
Opus 4.6: Celery 0%, FastAPI BackgroundTasks 44%

Celery, the decade-old Python standard, collapsed from unanimous to absent.

Caching in Next.js:

Sonnet 4.5: Redis 46%, Next.js Cache 31%
Opus 4.6: Next.js Cache 54%, Redis 0%

Redis dropped to zero for Next.js projects as framework-native caching took over.

This creates a feedback loop that should concern every tool vendor: newer training data favors newer tools, which get more recommendations, which drives adoption, which generates more training data. Tools that lose the agent's favor may struggle to recover.

Context awareness is genuinely good

One thing that impressed me: Claude Code isn't working from a fixed list. The same model picks Drizzle for JavaScript and SQLModel for Python. Vercel for Next.js, Railway for FastAPI. TanStack Query for React SPAs, API Routes for Next.js apps.

Phrasing stability averaged 76%. You can ask "what database should I use?" five different ways and get the same answer most of the time. But change the project from Next.js to Python, and the entire recommendation set shifts appropriately.

The 90% cross-model agreement within ecosystems confirms this. All three models agree on the top tool in 18 of 20 categories when you compare within the same language ecosystem. The 10% disagreement is concentrated in genuinely fragmented categories (Real-time, Caching) where there isn't a clear winner anyway.

What this actually means

If you're a developer, your "choices" are increasingly made by whatever model powers your coding agent. That's not automatically bad. Claude Code's picks are context-aware and generally sensible. But you should know it's happening. When your last three projects all use Zustand, pnpm, and Vitest, ask yourself: did you choose those, or did the agent?

If you maintain an open-source tool, the data on alternatives is more interesting than the primary picks suggest. Netlify got 67 alternative recommendations in Deployment despite zero primary picks. SendGrid got 55 in Email. Jest got 31 in Testing. Being a "known alternative" is closer to "default pick" than being truly invisible. But the gap still matters at scale.

If you're a tool vendor, the Custom/DIY finding is the real threat. It's not just that Claude Code might pick your competitor. It's that in 12 categories, it prefers building from scratch over recommending any third-party tool. Your competition isn't just LaunchDarkly vs. Flagsmith. It's LaunchDarkly vs. 15 lines of config that the agent writes in 30 seconds.

If you're on an AI team, response time correlates with uncertainty in a way that's worth investigating. Deployment questions averaged 32 seconds (clear default: Vercel). Authentication averaged 245 seconds (builds from scratch). The model spends more time when it's building custom solutions, which could indicate genuine deliberation about the right approach.

The uncomfortable question

The study is transparent about what it can't tell you: whether high pick rates reflect genuine quality or just training data frequency. GitHub Actions at 94% could mean it's the best CI tool. Or it could mean it dominates the code examples Claude was trained on.

I think the answer is "both, and it doesn't matter." The practical effect is the same. If millions of developers are vibe-coding their way to production apps, and the agent picks GitHub Actions every time, GitHub Actions' market share grows regardless of why it was picked. The recommendation is the distribution channel now.

Amplifying plans to expand this benchmark to Codex, Cursor, and Antigravity. That data will reveal whether these patterns are Claude-specific or universal to AI coding agents. My bet: the broad trends (build over buy, newer tools winning, near-monopolies in clear categories) will hold across agents. The specific percentages will vary.

For now, the takeaway is simple: pay attention to what your tools are choosing for you. The default stack is being written by the agent, not by you. And the agent has opinions.

Methodology and full dataset: amplifying.ai/research/claude-code-picks

What Claude Code Actually Chooses (And Why Tool Vendors Should Pay Attention)

The big finding: agents build, not buy

The "Claude Code stack" is real

Newer models pick newer tools

Context awareness is genuinely good

What this actually means

The uncomfortable question

Get new posts in your inbox

Keep reading

Inside Claude Code's Context Machine

AI Can't Audit Your Binaries Yet

Custom Silicon is Coming for Your Inference Stack