Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

AI agents falter hard on VAKRA, a new benchmark from IBM Research that tests real-world enterprise workflows.

AI agents falter hard on VAKRA, a new benchmark from IBM Research that tests real-world enterprise workflows. Top models score under 20% on tasks requiring 3-7 steps of API chaining and document retrieval. This exposes a core weakness: current systems can’t reliably compose tools in complex, domain-specific environments. Enterprises betting on agents for automation face high failure risks without fixes.

VAKRA simulates enterprise setups with over 8,000 locally hosted APIs across 62 domains, backed by real databases. Agents interact via natural-language instructions, chaining tools while retrieving from aligned documents. Unlike toy benchmarks, it runs full execution traces to verify outcomes—no credit for plausible reasoning without results. Released April 2026, it includes a leaderboard, dataset, and GitHub repo for submissions.

Breaking Down the Tasks

VAKRA splits into four capabilities, each hitting distinct agent pain points. First up: API chaining with business intelligence tools from expanded SLOT-BIRD and SEL-BIRD collections (Elder et al., 2026). This covers 2,077 test instances over 54 domains, demanding 1-12 sequential calls per task.

Take this example query: “Which football team has a build-up play speed of 31, build-up plan dribbling of 53, and build-up play passing of 32?” The ground-truth solution chains filters on a JSON dataset:

{
  "query": "Which football team has a build-up play speed of 31, build-up plan dribbling of 53, and build-up play passing of 32?",
  "tool_calls": [
    {
      "name": "get_data",
      "arguments": {"tool_universe_id": "486ea46224d1-aeb8037c5e78"},
      "label": "retrieved_data_1"
    },
    {
      "name": "select_data_equal_to",
      "arguments": {"data_label": "retrieved_data_1", "key_name": "play_speed", "value": 31},
      "label": "FILTERED_DF_0"
    },
    {
      "name": "select_data_equal_to",
      "arguments": {"data_label": "FILTERED_DF_0", "key_name": "play_dribble", "value": 53},
      "label": "FILTERED_DF_1"
    },
    {
      "name": "select_data_equal_to",
      "arguments": {"data_label": "FILTERED_DF_1", "key_name": "play_passing", "value": 32},
      "label": "FILTERED_DF_2"
    },
    {
      "name": "get_team_name",
      "arguments": {"data_label": "FILTERED_DF_2", "n": 1}
    }
  ],
  "answer": "FC Barcelona"
}

Agents must fetch data, filter step-by-step, then extract the answer. Simple in theory, but models often drop steps or hallucinate filters.

Other capabilities likely mix in document retrieval and hybrid reasoning—VAKRA stresses natural-language tool constraints, mimicking messy enterprise prompts. Full MCP servers host tools like get_data, ensuring grounded execution.

Failure Modes and Real-World Implications

Models bomb across the board. Leaderboard toppers like GPT-4o and Claude 3.5 Sonnet hit single digits on chaining tasks, per early results. Common fails: premature answers without full chains, tool misselection (e.g., wrong universe ID), or state loss mid-sequence. Retrieval tasks amplify errors—agents retrieve irrelevant docs or ignore APIs altogether.

Why this matters: Enterprises run on API-heavy workflows—finance queries across ledgers, supply chain traces, HR data pulls. VAKRA mirrors that scale, unlike narrow benchmarks like ToolBench (focused on 20k tools) or AgentBench (simpler sims). Poor scores signal agents aren’t production-ready. A 10% success rate means 90% manual intervention, killing ROI.

Fair credit: VAKRA advances evaluation by executing traces, not just parsing outputs. It scales domains realistically (62 vs. typical 10-20). But skepticism warranted—hosted APIs sidestep auth/compliance headaches; real enterprises add latency, rate limits, errors. Still, it spotlights the gap: reasoning depth lags tool fluency.

Fixes? Fine-tune on traces, better memory (e.g., vector stores for intermediates), or hybrid human-in-loop. Watch the leaderboard—submissions could shift baselines. For now, deploy agents narrowly; VAKRA proves broad autonomy remains hype.

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

Breaking Down the Tasks

Failure Modes and Real-World Implications

Related

Quoting Kyle Kingsbury

My bets on open models, mid-2026

Gemini 3.1 Flash TTS

Zig 0.16.0 release notes: “Juicy Main”

What I’ve been building: ATOM Report, post-training course, finishing my book, and ongoing research

Trusted access for the next era of cyber defense