Know what your agent loads before it writes.

hibench measures the default context footprint of coding agents: system prompts, tools, skills, MCP servers, and sub-agents loaded before the first user request is answered.

Top 5 + flop 5 token ranking

heaviest and lightest latest versions · total request tokens

Full ranking →
1 Claude Code: 21,618 2 OpenClaw: 18,540 3 Cursor CLI: 16,379 4 Grok CLI: 15,744 5 Copilot CLI: 12,395 7 Kilo Code: 9,762 8 Codex CLI: 8,730 9 OpenCode: 6,785 10 Cline: 3,890 11 Pi: 1,255

Goal & philosophy

Default cost is real cost

Every tool schema, skill description and system instruction is sent on the very first turn. That baseline is paid on each request, before any useful work happens.

Measure, don't guess

We capture the first real outbound request and count it with a single fixed tokenizer so numbers stay comparable across agents, versions and models.

Track the evolution

Footprints drift over releases. hibench keeps one canonical capture per version so you can watch context grow (or shrink) over time.

How it works

  1. 1

    Isolate

    Run each agent in Docker inside a fresh, empty Git repo.

  2. 2

    Intercept

    Point it at a local recorder with a dummy key — no upstream model call.

  3. 3

    Capture

    Send the prompt Hi and record the first outbound request body.

  4. 4

    Count

    Tokenize every field with o200k_base and break it down.

Benchmarked agents

All agents →