Home  /  Journal  /  Analysis
Analysis · The Cloudbase Journal

Which "AI agents" actually ship — a 30-day test

We wired up 12 agent platforms to the same support-ticket workload. Four didn't complete a single task without intervention. Here's the scoreboard.

SJ Sarah Jain · Senior Reviewer · · 8 min read
AGENT

"Agent" became a marketing term in 2025. In 2026 it's still a marketing term — for most of the field. To separate the shippers from the demos, we built one test and ran every platform through it.

The test

300 anonymized inbound support tickets from a real B2B SaaS company. A mix of billing, onboarding, bug-reports, and policy questions. Each agent was given identical tools: a ticketing API, a knowledge-base search, a refund endpoint (capped at $50), and an escalation action. Success = ticket closed with a response the QA team rated "acceptable or better."

The scoreboard

Top finishers, ranked by independent close-rate over 30 days:

  1. Claude Agents (Anthropic). 78% close-rate. Strongest at policy questions; weakest at multi-step refund flows.
  2. Cursor Composer 3 (agent mode). 71%. Surprising result — code-first tool generalized well when given clear tool definitions.
  3. OpenAI Assistants v3. 69%. Fastest responses, but highest "premature close" rate.
  4. Lindy. 61%. Best onboarding of the bunch. Ran out of budget on token-heavy escalations.
  5. Relevance AI. 58%. Great UI; brittle when ticket language drifted from training examples.

The four that failed

Four platforms closed zero tickets without human intervention in the first week. We won't name them here — each got a 30-day recovery window and two are now on the scoreboard after shipping real fixes. This is why we re-audit.

What separates shippers from demos

Cost per closed ticket

This is the number finance cares about. Claude Agents came in at $0.41 per closed ticket. OpenAI at $0.38. The cheapest-per-call platform ended up at $1.92 because it retried constantly.

Takeaway

If you're buying an agent platform today, run your own eval on 100+ real tickets before signing annual. The gap between marketing pages and production behavior is bigger in this category than anywhere else in AI.

For methodology, see the 7-criteria rubric. For the full agent shortlist, see our Best-of lists.