Token Savings Benchmarks
How much does Olaf actually save compared to using Claude Code's built-in tools (Grep, Read, Glob)?
We ran two real tasks on Olaf's own codebase (~5,500 lines of Rust across 15 modules) and measured tokens consumed and tool calls made. Both paths aimed at the same outcome: enough context to understand the answer.
Benchmark 1: "How does context ranking work?"
A focused question about a single subsystem — pivot selection, keyword scoring, and in-degree tiebreaking.
Path A — Regular tools
| Step | Tool | Purpose | Output tokens |
|---|---|---|---|
| 1 | Grep (files) | Find files mentioning rank/score | ~35 |
| 2 | Grep (content) | Find function signatures | ~1,050 |
| 3 | Read (query.rs, 200 lines) | Read ranking function | ~2,150 |
| 4 | Read (query.rs, intent detection) | Understand traversal policy | ~1,050 |
| 5 | Read (store.rs, scoring) | Observation scoring | ~700 |
| Total | 5-7 tool calls | ~6,000 |
Path B — Olaf
| Step | Tool | Purpose | Output tokens |
|---|---|---|---|
| 1 | get_context |
Single call with intent | ~1,950 |
| Total | 1 tool call | ~1,950 |
Result: 68% fewer tokens, 1 call instead of 5-7.
Olaf's response included the ranking function, the SelectionReason enum, find_pivot_symbols, get_context_from_pivot_scores, plus relevant test cases and session memory — all within a 4,000-token budget. The regular approach required manually discovering each function, reading it, then deciding what to read next.
Benchmark 2: "Trace MCP get_context request flow"
A cross-module question requiring understanding of how an MCP request flows from the handler in mcp/tools.rs through intent detection, pivot selection, BFS traversal, and brief assembly in graph/query.rs.
Path A — Regular tools
| Step | Tool | Purpose | Output tokens |
|---|---|---|---|
| 1 | Grep (mcp/) | Find get_context references | ~700 |
| 2 | Grep (all src/) | Find all context functions | ~350 |
| 3 | Read (tools.rs, 50 lines) | MCP handler | ~525 |
| 4 | Read (query.rs, 80 lines) | get_context + variants | ~900 |
| 5 | Grep (query.rs) | Locate helper functions | ~75 |
| 6 | Read (query.rs, 90 lines) | Intent detection + traversal policy | ~1,050 |
| 7 | Read (query.rs, 170 lines) | build_context_brief | ~1,950 |
| Total | 7 tool calls | ~5,550 |
Path B — Olaf
| Step | Tool | Purpose | Output tokens |
|---|---|---|---|
| 1 | get_context |
Single call with intent | ~1,800 |
| Total | 1 tool call | ~1,800 |
Result: 68% fewer tokens, 1 call instead of 7.
What the numbers understate
Tool call overhead
Each tool call in Claude Code has ~1-2 seconds of latency overhead for the round-trip. Seven sequential calls means 10-14 seconds of pure overhead vs ~2 seconds for a single Olaf call.
Planning cost
With regular tools, the AI must decide what to search for next at each step. That planning burns tokens in the model's reasoning budget — real cost that doesn't show up in the output token counts. Olaf eliminates this entirely: one call, one result.
Session memory is included for free
Olaf's response automatically includes relevant past observations — file changes, signature changes, decisions from prior sessions. With regular tools, you'd need additional git log or git diff calls to get equivalent context.
Scales with codebase size
On a small codebase (15 modules), regular tools need 5-7 calls. On a larger codebase with 50+ files, the same question can easily require 10-20 calls and 15,000+ tokens as the search space grows. Olaf's cost stays roughly constant — the symbol graph handles the fan-out internally.
Summary
| Metric | Regular tools | Olaf |
|---|---|---|
| Tokens per question | 5,000-6,000 | 1,800-1,950 |
| Tool calls per question | 5-7 | 1 |
| Token reduction | — | ~68% |
| Tool call reduction | — | ~85% |
| Latency overhead | 10-14s | ~2s |
Conservative headline: 3-4x fewer tokens, 7x fewer tool calls.
Methodology
Both benchmarks were conducted on Olaf's own codebase (Rust, ~5,500 lines across 15 source modules). Token counts are estimated at 1 token per 4 characters of tool output. Regular tool paths followed the natural discovery pattern — grep to find relevant files, read to understand them, grep again to find the next piece. Olaf paths used get_context with a 4,000-token budget and natural language intent.
The benchmarks measure retrieval cost — how many tokens it takes to gather enough context to answer the question. They do not measure answer quality, which depends on the AI model, not the retrieval method.
External Repo Benchmark
Measured on kubernetes/kubernetes at commit 040ca596 using get_brief (wraps get_context + impact analysis). The self-benchmark above uses get_context directly — these are complementary measurements showing realistic agent-facing cost vs core retrieval cost.
Environment
- CPU: Apple M4 Max (arm64)
- RAM: 128 GB
- OS: macOS Darwin 25.2.0
- Build:
--releaseprofile - Olaf commit:
8b924f1
Indexing
| Metric | Value |
|---|---|
| Indexed files | 16,794 |
| Symbols | 304,337 |
| Edges | 181,517 (145,115 calls + 36,402 uses_type) |
| Index time | 30.4s |
Per-query results (budget=4000)
| Query | Tag | Latency (ms) | Olaf tokens | Baseline A tokens | Savings |
|---|---|---|---|---|---|
| keyword-exact-scheduleone | keyword | 948 | 2,004 | 13,402 | 85.0% |
| keyword-module-garbage-collector | keyword | 889 | 930 | 8,321 | 88.8% |
| keyword-cross-module-reflector-informer | keyword | 939 | 1,467 | 21,511 | 93.2% |
| lowconf-broad-handle-errors | low_confidence | 857 | 1,313 | 6,647 | 80.2% |
| bugfix-sync-deployment-stuck | bugfix | 1,173 | 947 | 8,846 | 89.3% |
| impl-pod-eviction-threshold | impl | 1,078 | 772 | 6,765 | 88.6% |
| refactor-endpoint-controller-sync | refactor | 1,027 | 826 | 7,336 | 88.7% |
| filehint-kubelet-syncpod | file_hint | 824 | 67 | 36,962 | 99.8% |
| lowconf-ambiguous-token | low_confidence | 790 | 1,211 | 3,700 | 67.3% |
| fallback-zookeeper-leader | fallback | 1,043 | 1,782 | 0 | n/a |
Multi-budget summary
| Budget | Mean savings | Median savings |
|---|---|---|
| 2,000 | 79.3% | 88.8% |
| 4,000 | 78.1% | 88.7% |
| 8,000 | 77.5% | 88.7% |
Latency
| Metric | Value |
|---|---|
| Cold first query | 2,490 ms |
| Warm p50 | 948 ms |
| Warm p95 | 1,173 ms |
| Warm max | 1,173 ms |
NFR1 comparison (warm only): NFR1 defines a 1-second ceiling for get_context. This benchmark measures get_brief, which wraps get_context and adds impact analysis. The warm p50 (948ms) is under 1s; the p95 (1,173ms) exceeds it. Queries exceeding 1s were bugfix-sync-deployment-stuck (1,173ms), impl-pod-eviction-threshold (1,078ms), refactor-endpoint-controller-sync (1,027ms), and fallback-zookeeper-leader (1,043ms).
Recall
Expected pivots hit rate: 0% across all queries with defined pivots. These results predate Go edge extraction — retrieval operated in keyword-only mode on a flat symbol table. With 181k edges now indexed, graph traversal and impact analysis can discover symbols that keyword matching alone misses. A benchmark re-run is planned.
Conclusion
Measured ~78% token reduction (Baseline A, budget=4000) on kubernetes — an additional data point alongside the self-benchmark's ~68% figure. The higher reduction on the external repo reflects the larger baseline cost of manually reading Go files (many 6k-36k tokens) that Olaf's budget-constrained retrieval avoids. The self-benchmark uses get_context on a smaller Rust codebase; the external benchmark uses get_brief on a large Go codebase.
Note: Per-query results above predate Go edge extraction. Retrieval operated in keyword-only mode (0 edges). With 181k edges now indexed, graph-assisted features are active and recall is expected to improve. A benchmark re-run is planned.