benchmark: isolate parser hot loop from code-layout shifts

Extract the tight per-byte parsing loop from TerminalParser.step into
a separate noinline function (parseAll). This eliminates a ~20%
benchmark regression that appeared after the highway vendor changes
despite zero changes to the parser source code.

The root cause: the parser benchmark processes 50 MB of input through
a byte-at-a-time DFA loop that is highly sensitive to instruction
cache-line placement on Apple Silicon. The M-series cores fetch
aligned 16-byte blocks; when the loop head lands near the end of a
64-byte cache line (offset 60), only one instruction fits in the
first fetch versus four when aligned to offset 48. This causes ~29%
more cycles for identical instruction counts.

Previously the loop was inlined into the large step() function, so
any code change anywhere in the binary (like the highway vendor
restructuring) could shift the loop across a cache-line boundary.
By making parseAll noinline, the loop gets its own function placement
that is stable regardless of surrounding code changes.
pull/12402/head
Mitchell Hashimoto 2026-04-23 21:33:01 -07:00
parent 00dfd67bee
commit bf3047b9b2
No known key found for this signature in database
GPG Key ID: 523D5DC389D273BC
1 changed files with 11 additions and 5 deletions

View File

@ -88,11 +88,17 @@ fn step(ptr: *anyopaque) Benchmark.Error!void {
return error.BenchmarkFailed;
};
if (n == 0) break; // EOF reached
for (buf[0..n]) |c| {
const actions = p.next(c);
//std.log.warn("actions={any}", .{actions});
_ = actions;
}
parseAll(&p, buf[0..n]);
}
}
/// Separated from `step` so that the tight per-byte loop gets its own
/// function alignment, insulating it from code-layout changes elsewhere
/// in the binary that would otherwise shift its cache-line placement.
noinline fn parseAll(p: *terminalpkg.Parser, data: []const u8) void {
for (data) |c| {
const actions = p.next(c);
_ = actions;
}
}