Skip to content

Token Savings Benchmarks

How much does Olaf actually save compared to using Claude Code's built-in tools (Grep, Read, Glob)?

We ran two real tasks on Olaf's own codebase (~5,500 lines of Rust across 15 modules) and measured tokens consumed and tool calls made. Both paths aimed at the same outcome: enough context to understand the answer.

Benchmark 1: "How does context ranking work?"

A focused question about a single subsystem — pivot selection, keyword scoring, and in-degree tiebreaking.

Path A — Regular tools

Step Tool Purpose Output tokens
1 Grep (files) Find files mentioning rank/score ~35
2 Grep (content) Find function signatures ~1,050
3 Read (query.rs, 200 lines) Read ranking function ~2,150
4 Read (query.rs, intent detection) Understand traversal policy ~1,050
5 Read (store.rs, scoring) Observation scoring ~700
Total 5-7 tool calls ~6,000

Path B — Olaf

Step Tool Purpose Output tokens
1 get_context Single call with intent ~1,950
Total 1 tool call ~1,950

Result: 68% fewer tokens, 1 call instead of 5-7.

Olaf's response included the ranking function, the SelectionReason enum, find_pivot_symbols, get_context_from_pivot_scores, plus relevant test cases and session memory — all within a 4,000-token budget. The regular approach required manually discovering each function, reading it, then deciding what to read next.

Benchmark 2: "Trace MCP get_context request flow"

A cross-module question requiring understanding of how an MCP request flows from the handler in mcp/tools.rs through intent detection, pivot selection, BFS traversal, and brief assembly in graph/query.rs.

Path A — Regular tools

Step Tool Purpose Output tokens
1 Grep (mcp/) Find get_context references ~700
2 Grep (all src/) Find all context functions ~350
3 Read (tools.rs, 50 lines) MCP handler ~525
4 Read (query.rs, 80 lines) get_context + variants ~900
5 Grep (query.rs) Locate helper functions ~75
6 Read (query.rs, 90 lines) Intent detection + traversal policy ~1,050
7 Read (query.rs, 170 lines) build_context_brief ~1,950
Total 7 tool calls ~5,550

Path B — Olaf

Step Tool Purpose Output tokens
1 get_context Single call with intent ~1,800
Total 1 tool call ~1,800

Result: 68% fewer tokens, 1 call instead of 7.

What the numbers understate

Tool call overhead

Each tool call in Claude Code has ~1-2 seconds of latency overhead for the round-trip. Seven sequential calls means 10-14 seconds of pure overhead vs ~2 seconds for a single Olaf call.

Planning cost

With regular tools, the AI must decide what to search for next at each step. That planning burns tokens in the model's reasoning budget — real cost that doesn't show up in the output token counts. Olaf eliminates this entirely: one call, one result.

Session memory is included for free

Olaf's response automatically includes relevant past observations — file changes, signature changes, decisions from prior sessions. With regular tools, you'd need additional git log or git diff calls to get equivalent context.

Scales with codebase size

On a small codebase (15 modules), regular tools need 5-7 calls. On a larger codebase with 50+ files, the same question can easily require 10-20 calls and 15,000+ tokens as the search space grows. Olaf's cost stays roughly constant — the symbol graph handles the fan-out internally.

Summary

Metric Regular tools Olaf
Tokens per question 5,000-6,000 1,800-1,950
Tool calls per question 5-7 1
Token reduction ~68%
Tool call reduction ~85%
Latency overhead 10-14s ~2s

Conservative headline: 3-4x fewer tokens, 7x fewer tool calls.

Methodology

Both benchmarks were conducted on Olaf's own codebase (Rust, ~5,500 lines across 15 source modules). Token counts are estimated at 1 token per 4 characters of tool output. Regular tool paths followed the natural discovery pattern — grep to find relevant files, read to understand them, grep again to find the next piece. Olaf paths used get_context with a 4,000-token budget and natural language intent.

The benchmarks measure retrieval cost — how many tokens it takes to gather enough context to answer the question. They do not measure answer quality, which depends on the AI model, not the retrieval method.

External Repo Benchmark

Measured on kubernetes/kubernetes at commit 040ca596 using get_brief (wraps get_context + impact analysis). The self-benchmark above uses get_context directly — these are complementary measurements showing realistic agent-facing cost vs core retrieval cost.

Environment

  • CPU: Apple M4 Max (arm64)
  • RAM: 128 GB
  • OS: macOS Darwin 25.2.0
  • Build: --release profile
  • Olaf commit: 8b924f1

Indexing

Metric Value
Indexed files 16,794
Symbols 304,337
Edges 181,517 (145,115 calls + 36,402 uses_type)
Index time 30.4s

Per-query results (budget=4000)

Query Tag Latency (ms) Olaf tokens Baseline A tokens Savings
keyword-exact-scheduleone keyword 948 2,004 13,402 85.0%
keyword-module-garbage-collector keyword 889 930 8,321 88.8%
keyword-cross-module-reflector-informer keyword 939 1,467 21,511 93.2%
lowconf-broad-handle-errors low_confidence 857 1,313 6,647 80.2%
bugfix-sync-deployment-stuck bugfix 1,173 947 8,846 89.3%
impl-pod-eviction-threshold impl 1,078 772 6,765 88.6%
refactor-endpoint-controller-sync refactor 1,027 826 7,336 88.7%
filehint-kubelet-syncpod file_hint 824 67 36,962 99.8%
lowconf-ambiguous-token low_confidence 790 1,211 3,700 67.3%
fallback-zookeeper-leader fallback 1,043 1,782 0 n/a

Multi-budget summary

Budget Mean savings Median savings
2,000 79.3% 88.8%
4,000 78.1% 88.7%
8,000 77.5% 88.7%

Latency

Metric Value
Cold first query 2,490 ms
Warm p50 948 ms
Warm p95 1,173 ms
Warm max 1,173 ms

NFR1 comparison (warm only): NFR1 defines a 1-second ceiling for get_context. This benchmark measures get_brief, which wraps get_context and adds impact analysis. The warm p50 (948ms) is under 1s; the p95 (1,173ms) exceeds it. Queries exceeding 1s were bugfix-sync-deployment-stuck (1,173ms), impl-pod-eviction-threshold (1,078ms), refactor-endpoint-controller-sync (1,027ms), and fallback-zookeeper-leader (1,043ms).

Recall

Expected pivots hit rate: 0% across all queries with defined pivots. These results predate Go edge extraction — retrieval operated in keyword-only mode on a flat symbol table. With 181k edges now indexed, graph traversal and impact analysis can discover symbols that keyword matching alone misses. A benchmark re-run is planned.

Conclusion

Measured ~78% token reduction (Baseline A, budget=4000) on kubernetes — an additional data point alongside the self-benchmark's ~68% figure. The higher reduction on the external repo reflects the larger baseline cost of manually reading Go files (many 6k-36k tokens) that Olaf's budget-constrained retrieval avoids. The self-benchmark uses get_context on a smaller Rust codebase; the external benchmark uses get_brief on a large Go codebase.

Note: Per-query results above predate Go edge extraction. Retrieval operated in keyword-only mode (0 edges). With 181k edges now indexed, graph-assisted features are active and recall is expected to improve. A benchmark re-run is planned.