[benchmarks] Use std.mem.doNotOptimizeAway to avoid data collisions (#8548)
I've been playing with benchmarks over in my [branch swapping out ziglyph for uucode](https://github.com/ghostty-org/ghostty/compare/main...jacobsandlund:jacob/uucode?expand=1), and I ran into an interesting issue where benchmarks were giving odd numbers. TL;DR: writing to `buf[0]` ends up slowing down the benchmark in inconsistent ways because it's the same buffer that's being written and read in the loop, so switching to `std.mem.doNotOptimizeAway` fixes this. ## Full story: I ran the `codepoint-width` benchmark with the following (and also did similarly for `grapheme-bench` and `is-symbol`): ``` zig-out/bin/ghostty-gen +utf8 | head -c 200000000 > data.txt hyperfine --warmup 4 'zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table' 'zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode' ``` ... and I was surprised to see that `uucode` was 3% slower than Ghostty, despite similar implementations. I debugged this, bringing the `uucode` implementation to the exact same assembly (minus offsets) as Ghostty, even re-using the same table data (fun fact I learned is that even though these tables are large, zig or LLVM saw they were byte-by-byte equivalent and optimized them down to one table). Still though, 3% slower. Then I realized that if I wrote to a separate `buf` on `self` the difference went away, and I figured out it's this writing to `buf[0]` that is tripping up the CPU, because in the next outer loop it'll write over that again when reading from the data file, and then it's read as part of getting the code point. ### with buf[0] ``` Benchmark 1: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table Time (mean ± σ): 944.7 ms ± 0.8 ms [User: 900.2 ms, System: 42.8 ms] Range (min … max): 943.4 ms … 945.9 ms 10 runs Benchmark 2: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode Time (mean ± σ): 974.0 ms ± 0.7 ms [User: 929.3 ms, System: 43.1 ms] Range (min … max): 973.3 ms … 975.2 ms 10 runs Summary zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table ran 1.03 ± 0.00 times faster than zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode ``` ### with mem.doNotOptimizeAway ``` Benchmark 1: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table Time (mean ± σ): 929.4 ms ± 2.7 ms [User: 884.8 ms, System: 43.0 ms] Range (min … max): 926.7 ms … 936.3 ms 10 runs Benchmark 2: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode Time (mean ± σ): 931.2 ms ± 2.5 ms [User: 886.6 ms, System: 42.9 ms] Range (min … max): 927.3 ms … 935.7 ms 10 runs Summary zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table ran 1.00 ± 0.00 times faster than zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode ``` ### with buf[0], mode = .uucode Another interesting thing is that with `buf[0]`, it's highly dependent on the offsets somehow. If I switched the default mode line from `mode: Mode = .noop` to `mode: Mode = .uucode`, it shifts the offsets ever so slightly and even though that default mode is not getting used (since it's passed in), it flips the results of the benchmark around: ``` Benchmark 1: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table Time (mean ± σ): 973.3 ms ± 2.2 ms [User: 928.9 ms, System: 42.9 ms] Range (min … max): 968.0 ms … 975.9 ms 10 runs Benchmark 2: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode Time (mean ± σ): 945.8 ms ± 1.4 ms [User: 901.2 ms, System: 42.8 ms] Range (min … max): 943.5 ms … 948.5 ms 10 runs Summary zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode ran 1.03 ± 0.00 times faster than zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table ``` looking at the assembly with `mode: Mode = .noop`: ``` # table.txt: 165 // away ** 166 buf[0] = @intCast(width); ghostty-bench[0x100017370] <+508>: strb w11, [x21, #0x4] ghostty-bench[0x100017374] <+512>: b 0x100017288 ; <+276> at CodepointWidth.zig:168:9 ghostty-bench[0x100017378] <+516>: mov w0, #0x0 ; =0 # uucode.txt: ** 229 buf[0] = @intCast(width); ghostty-bench[0x1000177bc] <+508>: strb w11, [x21, #0x4] ghostty-bench[0x1000177c0] <+512>: b 0x1000176d4 ; <+276> at CodepointWidth.zig:231:9 ghostty-bench[0x1000177c4] <+516>: mov w0, #0x0 ; =0 ``` vs `mode: Mode = .uucode`: ``` # table.txt: ** 166 buf[0] = @intCast(width); ghostty-bench[0x100017374] <+508>: strb w11, [x21, #0x4] ghostty-bench[0x100017378] <+512>: b 0x10001728c ; <+276> at CodepointWidth.zig:168:9 ghostty-bench[0x10001737c] <+516>: mov w0, #0x0 ; =0 # uucode.txt: ** 229 buf[0] = @intCast(width); ghostty-bench[0x1000177c0] <+508>: strb w11, [x21, #0x4] ghostty-bench[0x1000177c4] <+512>: b 0x1000176d8 ; <+276> at CodepointWidth.zig:231:9 ghostty-bench[0x1000177c8] <+516>: mov w0, #0x0 ; =0 ``` ... shows the only difference is the offsets, which somehow have a large impact on the result of the benchmark.pull/8538/head
commit
ae7061efb0
|
|
@ -121,11 +121,7 @@ fn stepWcwidth(ptr: *anyopaque) Benchmark.Error!void {
|
|||
const cp_, const consumed = d.next(c);
|
||||
assert(consumed);
|
||||
if (cp_) |cp| {
|
||||
const width = wcwidth(cp);
|
||||
|
||||
// Write the width to the buffer to avoid it being compiled
|
||||
// away
|
||||
buf[0] = @intCast(width);
|
||||
std.mem.doNotOptimizeAway(wcwidth(cp));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -151,14 +147,10 @@ fn stepTable(ptr: *anyopaque) Benchmark.Error!void {
|
|||
if (cp_) |cp| {
|
||||
// This is the same trick we do in terminal.zig so we
|
||||
// keep it here.
|
||||
const width = if (cp <= 0xFF)
|
||||
std.mem.doNotOptimizeAway(if (cp <= 0xFF)
|
||||
1
|
||||
else
|
||||
table.get(@intCast(cp)).width;
|
||||
|
||||
// Write the width to the buffer to avoid it being compiled
|
||||
// away
|
||||
buf[0] = @intCast(width);
|
||||
table.get(@intCast(cp)).width);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -182,11 +174,7 @@ fn stepSimd(ptr: *anyopaque) Benchmark.Error!void {
|
|||
const cp_, const consumed = d.next(c);
|
||||
assert(consumed);
|
||||
if (cp_) |cp| {
|
||||
const width = simd.codepointWidth(cp);
|
||||
|
||||
// Write the width to the buffer to avoid it being compiled
|
||||
// away
|
||||
buf[0] = @intCast(width);
|
||||
std.mem.doNotOptimizeAway(simd.codepointWidth(cp));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -126,8 +126,7 @@ fn stepTable(ptr: *anyopaque) Benchmark.Error!void {
|
|||
const cp_, const consumed = d.next(c);
|
||||
assert(consumed);
|
||||
if (cp_) |cp2| {
|
||||
const v = unicode.graphemeBreak(cp1, @intCast(cp2), &state);
|
||||
buf[0] = @intCast(@intFromBool(v));
|
||||
std.mem.doNotOptimizeAway(unicode.graphemeBreak(cp1, @intCast(cp2), &state));
|
||||
cp1 = cp2;
|
||||
}
|
||||
}
|
||||
|
|
|
|||
Loading…
Reference in New Issue