Which "AI agents" actually ship — a 30-day test
We wired up 12 agent platforms to the same support-ticket workload. Four didn't complete a single task without intervention. Here's the scoreboard.
"Agent" became a marketing term in 2025. In 2026 it's still a marketing term — for most of the field. To separate the shippers from the demos, we built one test and ran every platform through it.
The test
300 anonymized inbound support tickets from a real B2B SaaS company. A mix of billing, onboarding, bug-reports, and policy questions. Each agent was given identical tools: a ticketing API, a knowledge-base search, a refund endpoint (capped at $50), and an escalation action. Success = ticket closed with a response the QA team rated "acceptable or better."
The scoreboard
Top finishers, ranked by independent close-rate over 30 days:
- Claude Agents (Anthropic). 78% close-rate. Strongest at policy questions; weakest at multi-step refund flows.
- Cursor Composer 3 (agent mode). 71%. Surprising result — code-first tool generalized well when given clear tool definitions.
- OpenAI Assistants v3. 69%. Fastest responses, but highest "premature close" rate.
- Lindy. 61%. Best onboarding of the bunch. Ran out of budget on token-heavy escalations.
- Relevance AI. 58%. Great UI; brittle when ticket language drifted from training examples.
The four that failed
Four platforms closed zero tickets without human intervention in the first week. We won't name them here — each got a 30-day recovery window and two are now on the scoreboard after shipping real fixes. This is why we re-audit.
What separates shippers from demos
- Tool-use stability. The winners called tools with correct JSON 99%+ of the time. The losers crashed on malformed calls.
- Memory windows that actually persist. Three platforms "forgot" the customer mid-conversation on ticket #42 of the day.
- Honest escalation. The best agents escalated cleanly with a summary for the human. The worst hallucinated a resolution.
Cost per closed ticket
This is the number finance cares about. Claude Agents came in at $0.41 per closed ticket. OpenAI at $0.38. The cheapest-per-call platform ended up at $1.92 because it retried constantly.
Takeaway
If you're buying an agent platform today, run your own eval on 100+ real tickets before signing annual. The gap between marketing pages and production behavior is bigger in this category than anywhere else in AI.
For methodology, see the 7-criteria rubric. For the full agent shortlist, see our Best-of lists.