Analysis · The Cloudbase Journal

Which "AI agents" actually ship — a 30-day test

We wired up 12 agent platforms to the same support-ticket workload. Four didn't complete a single task without intervention. Here's the scoreboard.

SJ Sarah Jain · Senior Reviewer · Apr 09, 2026 · 8 min read

AGENT

"Agent" became a marketing term in 2025. In 2026 it's still a marketing term — for most of the field. To separate the shippers from the demos, we built one test and ran every platform through it.

The test

300 anonymized inbound support tickets from a real B2B SaaS company. A mix of billing, onboarding, bug-reports, and policy questions. Each agent was given identical tools: a ticketing API, a knowledge-base search, a refund endpoint (capped at $50), and an escalation action. Success = ticket closed with a response the QA team rated "acceptable or better."

The scoreboard

Top finishers, ranked by independent close-rate over 30 days:

Claude Agents (Anthropic). 78% close-rate. Strongest at policy questions; weakest at multi-step refund flows.
Cursor Composer 3 (agent mode). 71%. Surprising result — code-first tool generalized well when given clear tool definitions.
OpenAI Assistants v3. 69%. Fastest responses, but highest "premature close" rate.
Lindy. 61%. Best onboarding of the bunch. Ran out of budget on token-heavy escalations.
Relevance AI. 58%. Great UI; brittle when ticket language drifted from training examples.

The four that failed

Four platforms closed zero tickets without human intervention in the first week. We won't name them here — each got a 30-day recovery window and two are now on the scoreboard after shipping real fixes. This is why we re-audit.

What separates shippers from demos

Tool-use stability. The winners called tools with correct JSON 99%+ of the time. The losers crashed on malformed calls.
Memory windows that actually persist. Three platforms "forgot" the customer mid-conversation on ticket #42 of the day.
Honest escalation. The best agents escalated cleanly with a summary for the human. The worst hallucinated a resolution.

Cost per closed ticket

This is the number finance cares about. Claude Agents came in at $0.41 per closed ticket. OpenAI at $0.38. The cheapest-per-call platform ended up at $1.92 because it retried constantly.

Takeaway

If you're buying an agent platform today, run your own eval on 100+ real tickets before signing annual. The gap between marketing pages and production behavior is bigger in this category than anywhere else in AI.

For methodology, see the 7-criteria rubric. For the full agent shortlist, see our Best-of lists.

Which "AI agents" actually ship — a 30-day test

The test

The scoreboard

The four that failed

What separates shippers from demos

Cost per closed ticket

Takeaway

Read next

The 7-criteria rubric we use to score every AI tool

Real monthly cost of running an AI-first ops team

Put the rubric to work.