Skip to content

We built perfect cache. Anthropic broke it from the server.

Our cache was working too well. Then it stopped.

Earlier today we wrote about Matrix's cache engineering: byte-identical JSONL reconstruction keeps 800K Opus sessions affordable. A few hours later, the same cache system observed something we couldn't explain.

Between two consecutive calls in the same session, input tokens jumped from 220K to 284K

json
{"type":"usage","inputTokens":220712,"outputTokens":217,"contextWindow":1000000,"cacheCreationTokens":200,"cacheReadTokens":220509,"taskId":"ea053810-fdba-4c90-9e0c-e7b22bcb5c68","ts":1775332443540,"traceId":"01KND0TDWMV8EM53DKK448TZ3Y"}
{"type":"usage","inputTokens":284800,"outputTokens":397,"contextWindow":1000000,"cacheCreationTokens":284794,"cacheReadTokens":0,"taskId":"ea053810-fdba-4c90-9e0c-e7b22bcb5c68","ts":1775333012661,"traceId":"01KND0TDWMV8EM53DKK448TZ3Y"}

We checked our code. No bugs, no agent restarts, and even if there had been a restart, Matrix's byte-identical reconstruction should produce the exact same request. The two calls were less than an hour apart, well within the cache TTL. Yet the entire cache prefix was invalidated and rewritten from scratch.

If it wasn't us, was something being added server-side?

Experiment 1: same JSONL, different time

Matrix stores every conversation as a provider-agnostic JSONL event stream. At any point during a session, we can take the current JSONL state, truncate it to an earlier point, and replay it. This is byte-identical reconstruction of the original request.

We rewound our JSONL to an earlier state in the session where usage was stable around 128K:

json
{"type":"usage","inputTokens":127423,"outputTokens":402,"cacheCreationTokens":736,"cacheReadTokens":126684,"ts":1775322844846,"taskId":"ea053810-fdba-4c90-9e0c-e7b22bcb5c68"}
{"type":"usage","inputTokens":128434,"outputTokens":152,"cacheCreationTokens":1011,"cacheReadTokens":127420,"ts":1775322853544,"taskId":"ea053810-fdba-4c90-9e0c-e7b22bcb5c68"}

The original third call from that point reported 129,111 input tokens, a normal incremental increase:

json
{"type":"usage","inputTokens":129111,"outputTokens":249,"cacheCreationTokens":679,"cacheReadTokens":128431,"ts":1775322861024,"taskId":"ea053810-fdba-4c90-9e0c-e7b22bcb5c68"}

We replayed from this state, sending a simple "ping" as the next user message. The content difference should have been negligible: a few tokens at most. Instead we got:

json
{"type":"usage","inputTokens":167658,"outputTokens":218,"cacheCreationTokens":134871,"cacheReadTokens":32781,"ts":1775338001547,"taskId":"01KND67CNPGJ96BR02DA0CRN3X"}

167,658 tokens. Up from 128,434. A 39,224 difference, about 30.5%, for the same content.

The difference is proportional, not a fixed addition. When we tried other states, a 190K original replayed as 245K (+29%), a 220K original replayed as 284K (+29%). Roughly the same 30% multiplier across different context sizes. This looks more like a tokenizer change than a fixed content injection.

But we couldn't rule out a bug in our own replay logic. We needed a cleaner experiment where the only variable wasn't time.

Experiment 2: same request, different accounts

We have two OAuth subscription tokens. We built one request body from 700 messages of our root session, 57 tool definitions, our full system prompt, Opus 4.6. Then we sent that exact body to both accounts.

TOKEN COMPARISON: Same request body, two accounts
====================================================
Messages: 700 from root session
Model: claude-opus-4-6
Tools: 57, System: ~8281 tokens

Results:
----------------------------------------------------
Account A:
  count_tokens API:     236,678
  actual /v1/messages:  304,306
  Difference:           67,628 tokens

Account B:
  count_tokens API:     236,678
  actual /v1/messages:  236,678
  Difference:           0 tokens
====================================================

count_tokens returns the same number for both accounts, confirming the request body is identical. But at the /v1/messages endpoint, one account reports 67,628 additional input tokens. The other reports zero.

We did not send those 67,628 tokens. They appear between the API receiving our request and recording usage. Compared against the count_tokens baseline of 236,678, that's a 28.6% increase, consistent with the ~29-30% multiplier we observed in experiment 1.

The script

javascript
const rootTask = ctx.tracker.getTask('ROOT_TASK_ID');
const allMessages = rootTask.session.messages;

const sessionPath = `PATH_TO_JSONL`;
const fs = await import('fs');
const lines = fs.readFileSync(sessionPath, 'utf-8').split('\n');
let systemStable = '';
let systemVariable = '';
let tools = null;
for (const line of lines) {
  if (!line) continue;
  try {
    const d = JSON.parse(line);
    if (d.type === 'session_config') {
      systemStable = d.systemStable || '';
      systemVariable = d.systemVariable || '';
      tools = d.tools;
    }
  } catch {}
}

const anthropicTools = tools.map(t => ({
  name: t.name,
  description: t.description,
  input_schema: t.jsonSchema,
}));

const { loadGlobalConfig } = await import('PATH_TO_CONFIG');
const global = await loadGlobalConfig();
const token1 = global.authGroups['account-a'].claudeOauthToken;
const token2 = global.authGroups['account-b'].claudeOauthToken;

// Clean slice at 700 msgs
let end = 700;
while (end > 0) {
  const last = allMessages[end - 1];
  if (last.role === 'user') break;
  if (last.role === 'assistant' && Array.isArray(last.content) && !last.content.some(c => c.type === 'tool_use')) break;
  end--;
}

// Build request body (same for both)
const body = {
  model: "claude-opus-4-6",
  max_tokens: 5,
  system: [
    { type: "text", text: "You are Claude Code, Anthropic's official CLI for Claude." },
    { type: "text", text: systemStable },
    { type: "text", text: systemVariable },
  ],
  tools: anthropicTools,
  messages: [...allMessages.slice(0, end), {role: "user", content: "ok"}],
};

const countBody = {...body};
delete countBody.max_tokens;

async function runTest(accountName, token) {
  // count_tokens
  const countRes = await fetch('https://api.anthropic.com/v1/messages/count_tokens', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'authorization': `Bearer ${token}`,
      'anthropic-version': '2023-06-01',
      'anthropic-beta': 'oauth-2025-04-20',
    },
    body: JSON.stringify(countBody),
  });
  const countData = await countRes.json();
  
  // actual messages
  const actualRes = await fetch('https://api.anthropic.com/v1/messages', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'authorization': `Bearer ${token}`,
      'anthropic-version': '2023-06-01',
      'anthropic-beta': 'oauth-2025-04-20',
    },
    body: JSON.stringify(body),
  });
  const actualData = await actualRes.json();
  
  if (actualData.usage) {
    const actualTotal = actualData.usage.input_tokens + 
                        (actualData.usage.cache_creation_input_tokens || 0) + 
                        (actualData.usage.cache_read_input_tokens || 0);
    return {
      account: accountName,
      countTokens: countData.input_tokens,
      actualTotal,
      diff: actualTotal - countData.input_tokens,
    };
  }
  return {account: accountName, error: JSON.stringify(actualData).slice(0, 200)};
}

// Run both sequentially to avoid rate limits
const result1 = await runTest('Account A', token1);
const result2 = await runTest('Account B', token2);

console.log('TOKEN COMPARISON: Same request, two accounts');
console.log(`Messages: ${end} from root session`);
console.log(`Model: claude-opus-4-6`);
console.log(`Tools: 57, System: ~${Math.floor((systemStable.length + systemVariable.length)/4)} tokens`);
for (const r of [result1, result2]) {
  if (r.error) {
    console.log(`${r.account}: ERROR ${r.error}`);
  } else {
    console.log(`${r.account}:`);
    console.log(`  count_tokens API: ${r.countTokens.toLocaleString()}`);
    console.log(`  actual /v1/messages: ${r.actualTotal.toLocaleString()}`);
    console.log(`  Difference: ${r.diff.toLocaleString()} tokens`);
  }
}

What we're seeing

Two variables seem to affect how many tokens Anthropic reports for our requests: time (experiment 1, though we can't fully rule out bugs in our replay or tokenizer changes) and account (experiment 2, which is clean).

The account variable is the stronger finding. Identical request body, identical count_tokens response, different /v1/messages response. The 67K difference between the two accounts is not explained by anything we sent.

Anthropic's token counting documentation mentions this possibility:

Token counts may include tokens added automatically by Anthropic for system optimizations. You are not billed for system-added tokens. Billing reflects only your content.

We hadn't paid much attention to this line before. Today we observed it happen.

Questions this raises

Is this a hidden prompt or a tokenizer change? The ~29% multiplier is proportional to context size, not a fixed addition. This is more consistent with a tokenizer change than a fixed content injection. But either way, the effect is the same: 29% more tokens for the same content, applied to some accounts and not others, with no transparency and no opt-out.

Is the cache invalidation side effect billed? Anthropic says "you are not billed for system-added tokens." But when those tokens appeared mid-session, they invalidated our entire cache prefix. The next call rewrote 220K+ tokens of cache creation from scratch. Even if the ~60K added tokens are free, it's unclear whether the cache rebuild they caused is also free. There is no documentation on this, and no transparency into what is or isn't counted.

Which accounts are affected? Our two accounts are both OAuth subscription tokens, but only one showed the extra tokens. We don't yet know the deciding factor, plan tier, account age, whether the account has been used with official Claude Code, or some other server-side flag. If you have an API key (non-OAuth) or multiple OAuth accounts, we'd be curious whether you see the same split.

What does this look like for Claude Code users? Claude Code is Anthropic's first-party client. Its API traffic is harder to observe. If the same account-level variation appears there, most users wouldn't see it directly. They'd only notice that their quota runs out faster than expected.

Cache transparency

If you run long-lived sessions against Claude's API, you can do this yourself: take any request you're about to send, run it through count_tokens, then send it to /v1/messages. Compare the numbers. If they match, your session is transparent. If they diverge, something is being added to your traffic that you didn't send.

Earlier today's post on cache engineering | Matrix | GitHub

Released under the MIT License.