Why Our AI Robots Kept Standing Idle — And How We Fixed the Scan→Act Loop

Early playtesting of AI Arena surfaced a maddening bug. Matches between two Claude Sonnet robots would begin, the combat engine would tick forward — and both robots would just stand there. No movement. No missiles. Nothing. Two expensive LLM calls per tick, zero actions produced.

We stared at the logs for a long time before we understood what was actually happening.

The Setup: Tool-Calling AI Pilots#

Every robot in AI Arena is piloted by an AI client that receives a structured prompt describing the current battlefield state and must respond with a sequence of tool calls representing actions — move_forward, fire_missile, activate_shield, and so on.

The interaction follows a standard agentic tool-use loop:

# Simplified — the real loop is in ai_engine/clients/anthropic_client.py
async def generate_plan(self, state: CombatState, context: str) -> list[Action]:
    messages = [{"role": "user", "content": build_prompt(state, context)}]
    tools = get_full_tool_catalog()  # scan + action tools combined

    for turn in range(self._max_turns):
        response = await self.client.messages.create(
            model=self.model,
            tools=tools,
            messages=messages,
        )

        if response.stop_reason == "end_turn":
            break

        # Process tool calls, add results back to messages
        tool_results = await execute_tool_calls(response.content)
        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})

    return extract_actions(messages)

We gave the AI two types of tools: scan tools (read-only observation) and action tools (write combat commands). The idea was that a good AI pilot would scan the battlefield, reason about what it sees, then act.

What we didn't account for was how differently models of different capability levels would use those turns.

The Bug: Smarter Models Reason More#

Our original _MAX_TURNS was 2. We chose this because early testing with Claude Haiku worked perfectly — it would do one scan, make a decision, and fire actions on the second turn. Fast, efficient, sensible.

Then we added Sonnet and Opus to the roster.

Looking at the raw API responses, the pattern became obvious:

Turn 1 — Haiku:

tool_use: scan_self_status
tool_use: fire_missile(target_x=18, target_y=12, angle=45)

Turn 1 — Sonnet:

tool_use: scan_self_status
tool_use: scan_battlefield
tool_use: scan_opponent
tool_use: scan_wind_conditions

(stop_reason: tool_use — wants another turn)

Turn 2 — Sonnet:

tool_use: scan_battlefield  ← still scanning! analyzing the results

(stop_reason: end_turn — ran out of budget, returned nothing)

Sonnet was building a complete mental model of the battlefield before committing to any action. It called scan_self_status to check its own HP and energy, scan_battlefield to map obstacle positions, scan_opponent to assess threat distance and orientation, and scan_wind_conditions to calculate ballistic adjustments. By the time it had gathered all that information, our 2-turn budget was exhausted.

The AI wasn't malfunctioning. It was being thorough. We were penalizing intelligence.

The Investigation: Token Patterns Don't Lie#

We added structured logging to capture full token usage per turn:

logger.info(
    "ai_turn_complete",
    extra={
        "match_id": state.match_id,
        "robot_id": robot.id,
        "model": self.model,
        "turn": turn_number,
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
        "tool_calls": [t.name for t in tool_calls],
        "stop_reason": response.stop_reason,
    }
)

After a week of matches, the pattern was crystal clear:

Model	Avg scan turns	Avg action turn	Actions produced
Haiku	1	1	2.1 avg
Sonnet	2–3	1 (if budget allows)	0 at 2-turn limit
Opus	3–4	1 (if budget allows)	0 at 2-turn limit

Haiku is decisive by nature — it makes quick judgments with limited context. Sonnet and Opus are deliberate. They gather more signal before acting, which in real-world use makes them better pilots. But our 2-turn cap was turning their thoroughness into paralysis.

The Fix: Dynamic Tool Stripping#

Increasing the turn limit alone wasn't enough. A naive bump to _MAX_TURNS = 10 would let a misbehaving model loop indefinitely through scan calls without ever producing actions — burning tokens and stalling the match.

We needed a mechanism to guarantee an action on the final turn. The solution: strip scan tools from the tool catalog on the last allowed turn, leaving only action tools available.

_MAX_TURNS = 5  # 4 scan rounds + 1 forced action

async def generate_plan(self, state: CombatState, context: str) -> list[Action]:
    messages = [{"role": "user", "content": build_prompt(state, context)}]

    for turn in range(_MAX_TURNS):
        is_final_turn = (turn == _MAX_TURNS - 1)

        # On the final turn, only offer action tools.
        # The model physically cannot scan anymore — it must act.
        tools = get_action_only_tools() if is_final_turn else get_full_tool_catalog()

        if is_final_turn:
            # Nudge the model to wrap up
            messages.append({
                "role": "user",
                "content": "You have completed your reconnaissance. Issue your combat commands now."
            })

        response = await self.client.messages.create(
            model=self.model,
            tools=tools,
            messages=messages,
        )

        if response.stop_reason == "end_turn":
            break

        tool_results = await execute_tool_calls(response.content)
        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})

        # If the model already produced actions before the final turn, we're done
        if has_action_calls(response.content):
            break

    return extract_actions(messages)

The key insight is that tool availability shapes model behavior. When scan_battlefield simply doesn't exist in the tools list, the model can't call it. It will look at what's available — move_forward, fire_missile, activate_shield — and produce actions instead.

This isn't a hack. It's the standard pattern for constrained agentic loops: use tool availability as a behavioral lever. The model is always acting in good faith — we're just adjusting what options it has.

The Result: Models Use Exactly What They Need#

After deploying the fix, we saw each model naturally settle into its own rhythm:

Haiku  →  scan(1) → act(2)          — fast, decisive
Sonnet →  scan(1) → scan(2) → act(3)  — thorough, still quick
Opus   →  scan(1) → scan(2) → scan(3) → act(4)  — methodical, deep

No more idle robots. Every match tick now produces actions from all AI models. The quality of those actions also improved — Sonnet and Opus robots began reading wind conditions and opponent position before firing, producing more accurate shots.

The Cost Question#

Expanding from 2 turns to 5 turns sounds expensive — 2.5x more API calls. In practice, the cost increase was minimal for two reasons.

First, most models still finish in 2–3 turns. The 5-turn maximum is a ceiling, not a default. We rarely hit turn 4 or 5 except with Opus in complex battlefield states.

Second, prompt caching absorbs most of the repeated context. The battlefield state prompt is large (serialized grid + robot stats + history), and it doesn't change between scan turns within a single planning cycle. Anthropic's prompt caching covers roughly 90% of input tokens on turns 2+, making the incremental cost of each additional scan turn very small.

The real cost driver in AI Arena isn't turns — it's the frequency of planning cycles. And that's controlled separately by the async worker architecture, which is a story for another post.

Lessons#

Don't assume reasoning depth. We designed the loop for Haiku's behavior and assumed all models would behave similarly. They don't. More capable models invest more in observation before acting.

Tool availability is a first-class behavioral control. You don't need complex prompt engineering to force a model to act on a deadline. Simply make action tools the only option on the final turn — the model will do what it can with what's available.

Log everything, early. We would have spent weeks debugging this without per-turn structured logging. Token counts and tool call sequences told us exactly what was happening. Add telemetry before you think you need it.

The idle robot problem turned out to be a feature of intelligence, not a bug. Once we understood that, the fix was simple.