← All posts
📅

I Spun Up ~30 AI Agents in Parallel to Kill My Own Product Idea

Before building 'Jira for AI agents,' I ran one Claude prompt that fanned out into 111 sub-agent runs — roughly twenty minutes and 3.6 million tokens of work — across five research angles searched in parallel, 29 sources pulled, 133 claims extracted, then three independent skeptics voted on every claim. Only 9 survived. The verdict killed the idea in minutes. A PM's field notes on using AI to disprove yourself, cheaply.

TL;DR — Before committing weeks to a product idea — “Jira, but the team is a fleet of AI agents, not humans” — I needed competitor research. Instead of opening twenty browser tabs, I ran Claude in deep-research mode: one prompt fanned out into 111 sub-agent runs — about twenty minutes and 3.6M tokens — that searched five angles in parallel, pulled 29 sources, and extracted 133 claims — then the agents argued with each other, three independent skeptics voting on every claim. Only 9 claims survived. The surviving 9 said the thing I wanted to build had already shipped from OpenAI, Microsoft, and Linear in the previous two quarters. The research didn’t help me build faster. It helped me not build at all — in an afternoon, for a few dollars. This is a field note on that workflow, written as a PM, not an engineer.

Every founder-brained idea arrives wearing a halo. Mine was “Jira for AI agents”: the workers aren’t a human scrum team but a fleet of coding agents, with a human supervising. It felt fresh. It felt mine.

The actual job-to-be-done of competitor research is unglamorous and slightly hostile to that feeling: find out who already built this before you sink a month into rebuilding it. Not “is my idea good” — that’s confirmation bias with a search bar. The real question is “what’s the smallest amount of work that proves I’m too late?”

The old way of answering it — a dozen tabs, skim, bookmark, lose the thread, conclude what you already believed — is slow and biased. You stop searching the moment you find something encouraging. The tool I reached for is built to do the opposite.

What “deep research” actually did

Deep-research mode takes one question and decomposes it into angles, then spawns a sub-agent per angle and runs them concurrently. For this question it chose five angles, each handed to its own sub-agents:

  1. Incumbents — who already ships this?
  2. Technical feasibility — can it be built the way I imagined?
  3. Market adoption — does anyone actually want it?
  4. Data model — what’s the right core abstraction for agent-run work?
  5. Cost — what would it take to run?

111 sub-agent runs in total fanned out across those five angles — about twenty minutes of wall-clock time and 3.6 million tokens for a single research question. Each one searched, fetched, and read independently of the others, so no one agent’s early conclusion could anchor the rest.

A Claude deep-research run in progress: 111 sub-agents fanning out, ~20 minutes, 3.6M tokens
The receipt for one question: a single prompt fanned out into 111 sub-agent runs — roughly twenty minutes and 3.6M tokens, end to end — before returning a cited report.

That breadth is the first thing a single human can’t match. Five angles, each chased several levels deep, simultaneously. I read sequentially; the swarm doesn’t.

But breadth alone would just be a faster way to fool myself. The part that mattered was the second stage. After fetching 29 sources and pulling 133 candidate claims, the system didn’t summarize them — it adversarially verified them. Each important claim was handed to a panel of independent agents prompted to refute it, and they voted. A claim survived only on a strong majority. The log is full of lines like 3-0 ✓ (unanimously confirmed), 0-3 ✗ (unanimously refuted), and 1-2 ✗ (died by majority).

Out of 133 claims, 25 were worth verifying, and 9 survived the vote. 16 were killed. The killed ones weren’t junk — many were plausible. That’s exactly why a normal summary is dangerous: it would have served me the plausible-but-wrong claims with the same confident tone as the true ones.

The verdict: a red ocean I couldn’t see from my desk

The 9 survivors converged on one uncomfortable fact: the “command center for a fleet of AI agents” I wanted to build had already shipped — three times, from incumbents, in the two quarters before I had the idea.

  • OpenAI’s Codex app positioned itself, in its own words, as a command center for parallel agent work.
  • VS Code / GitHub Copilot shipped an Agent Sessions view — one place to see every agent session, local or cloud — to general availability.
  • A leading issue tracker had already made agents first-class workspace members that get assigned issues and report status, with a built-in “awaiting input” state for human approval.

My differentiator was real but narrow: a philosophy (agent-first, designed to reduce human-in-the-loop over time) rather than a technical moat. Worth knowing in an afternoon. Not worth discovering after a month of building.

Just as important was what the research refused to claim. Most of the technical-feasibility, market-size, and cost angles came back unverified — not enough independent confirmation to survive the vote. A lesser tool would have smoothed over those gaps. This one told me, explicitly, where the floor was missing.

Three lessons for PMs using AI as a research instrument

1. The value is subtractive. The best output of competitor research is a confident no. AI made the no cheap: 133 claims down to 9, in minutes, for a few dollars in tokens. A swarm that disproves you fast is worth more than a copilot that helps you build the wrong thing fast.

2. Adversarial verification beats more searching. Throwing more agents at finding things just amplifies noise. The leverage was making agents attack each finding and counting the survivors. If you take one mechanism from this, take the vote — not the fan-out.

3. “I couldn’t verify this” is a feature, not a failure. A research tool that maps its own ignorance is more trustworthy than one that fills every gap with fluent prose. Knowing that four of my five angles — feasibility, market, data model, and cost — came back unproven, while only the incumbent angle was verified, told me precisely where I’d be flying blind if I built anyway.

The meta-point

We talk about AI as an accelerator — write faster, ship faster, build faster. The afternoon that mattered most to me recently was the opposite: AI as a brake. One prompt, ~30 agents, an adversarial vote, and a well-cited answer that talked me out of a month of work.

The most valuable thing an AI can hand a product manager isn’t a feature. It’s a “no” you can trust.