[benchmarks] Use std.mem.doNotOptimizeAway to avoid data collisions (#8548)

I've been playing with benchmarks over in my [branch swapping out
ziglyph for
uucode](https://github.com/ghostty-org/ghostty/compare/main...jacobsandlund:jacob/uucode?expand=1),
and I ran into an interesting issue where benchmarks were giving odd
numbers.

TL;DR: writing to `buf[0]` ends up slowing down the benchmark in
inconsistent ways because it's the same buffer that's being written and
read in the loop, so switching to `std.mem.doNotOptimizeAway` fixes
this.

## Full story:

I ran the `codepoint-width` benchmark with the following (and also did
similarly for `grapheme-bench` and `is-symbol`):

```
zig-out/bin/ghostty-gen +utf8 | head -c 200000000 > data.txt
hyperfine --warmup 4 'zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table' 'zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode'
```

... and I was surprised to see that `uucode` was 3% slower than Ghostty,
despite similar implementations. I debugged this, bringing the `uucode`
implementation to the exact same assembly (minus offsets) as Ghostty,
even re-using the same table data (fun fact I learned is that even
though these tables are large, zig or LLVM saw they were byte-by-byte
equivalent and optimized them down to one table). Still though, 3%
slower.

Then I realized that if I wrote to a separate `buf` on `self` the
difference went away, and I figured out it's this writing to `buf[0]`
that is tripping up the CPU, because in the next outer loop it'll write
over that again when reading from the data file, and then it's read as
part of getting the code point.

### with buf[0]

```
Benchmark 1: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table
  Time (mean ± σ):     944.7 ms ±   0.8 ms    [User: 900.2 ms, System: 42.8 ms]
  Range (min … max):   943.4 ms … 945.9 ms    10 runs

Benchmark 2: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode
  Time (mean ± σ):     974.0 ms ±   0.7 ms    [User: 929.3 ms, System: 43.1 ms]
  Range (min … max):   973.3 ms … 975.2 ms    10 runs

Summary
  zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table ran
    1.03 ± 0.00 times faster than zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode
```

### with mem.doNotOptimizeAway

```
Benchmark 1: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table
  Time (mean ± σ):     929.4 ms ±   2.7 ms    [User: 884.8 ms, System: 43.0 ms]
  Range (min … max):   926.7 ms … 936.3 ms    10 runs

Benchmark 2: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode
  Time (mean ± σ):     931.2 ms ±   2.5 ms    [User: 886.6 ms, System: 42.9 ms]
  Range (min … max):   927.3 ms … 935.7 ms    10 runs

Summary
  zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table ran
    1.00 ± 0.00 times faster than zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode
```

### with buf[0], mode = .uucode

Another interesting thing is that with `buf[0]`, it's highly dependent
on the offsets somehow. If I switched the default mode line from `mode:
Mode = .noop` to `mode: Mode = .uucode`, it shifts the offsets ever so
slightly and even though that default mode is not getting used (since
it's passed in), it flips the results of the benchmark around:

```
Benchmark 1: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table
  Time (mean ± σ):     973.3 ms ±   2.2 ms    [User: 928.9 ms, System: 42.9 ms]
  Range (min … max):   968.0 ms … 975.9 ms    10 runs

Benchmark 2: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode
  Time (mean ± σ):     945.8 ms ±   1.4 ms    [User: 901.2 ms, System: 42.8 ms]
  Range (min … max):   943.5 ms … 948.5 ms    10 runs

Summary
  zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode ran
    1.03 ± 0.00 times faster than zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table
```

looking at the assembly with `mode: Mode = .noop`:

```
# table.txt:

   165 	                // away
** 166 	                buf[0] = @intCast(width);

ghostty-bench[0x100017370] <+508>: strb   w11, [x21, #0x4]
ghostty-bench[0x100017374] <+512>: b      0x100017288               ; <+276> at CodepointWidth.zig:168:9
ghostty-bench[0x100017378] <+516>: mov    w0, #0x0                  ; =0 

# uucode.txt:

** 229 	                buf[0] = @intCast(width);

ghostty-bench[0x1000177bc] <+508>: strb   w11, [x21, #0x4]
ghostty-bench[0x1000177c0] <+512>: b      0x1000176d4               ; <+276> at CodepointWidth.zig:231:9
ghostty-bench[0x1000177c4] <+516>: mov    w0, #0x0                  ; =0 
```

vs `mode: Mode = .uucode`:

```
# table.txt:

** 166 	                buf[0] = @intCast(width);

ghostty-bench[0x100017374] <+508>: strb   w11, [x21, #0x4]
ghostty-bench[0x100017378] <+512>: b      0x10001728c               ; <+276> at CodepointWidth.zig:168:9
ghostty-bench[0x10001737c] <+516>: mov    w0, #0x0                  ; =0 

# uucode.txt:

** 229 	                buf[0] = @intCast(width);

ghostty-bench[0x1000177c0] <+508>: strb   w11, [x21, #0x4]
ghostty-bench[0x1000177c4] <+512>: b      0x1000176d8               ; <+276> at CodepointWidth.zig:231:9
ghostty-bench[0x1000177c8] <+516>: mov    w0, #0x0                  ; =0 
```

... shows the only difference is the offsets, which somehow have a large
impact on the result of the benchmark.
pull/8538/head
Mitchell Hashimoto 2025-09-06 12:48:40 -07:00 committed by GitHub
commit ae7061efb0
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 5 additions and 18 deletions

View File

@ -121,11 +121,7 @@ fn stepWcwidth(ptr: *anyopaque) Benchmark.Error!void {
const cp_, const consumed = d.next(c); const cp_, const consumed = d.next(c);
assert(consumed); assert(consumed);
if (cp_) |cp| { if (cp_) |cp| {
const width = wcwidth(cp); std.mem.doNotOptimizeAway(wcwidth(cp));
// Write the width to the buffer to avoid it being compiled
// away
buf[0] = @intCast(width);
} }
} }
} }
@ -151,14 +147,10 @@ fn stepTable(ptr: *anyopaque) Benchmark.Error!void {
if (cp_) |cp| { if (cp_) |cp| {
// This is the same trick we do in terminal.zig so we // This is the same trick we do in terminal.zig so we
// keep it here. // keep it here.
const width = if (cp <= 0xFF) std.mem.doNotOptimizeAway(if (cp <= 0xFF)
1 1
else else
table.get(@intCast(cp)).width; table.get(@intCast(cp)).width);
// Write the width to the buffer to avoid it being compiled
// away
buf[0] = @intCast(width);
} }
} }
} }
@ -182,11 +174,7 @@ fn stepSimd(ptr: *anyopaque) Benchmark.Error!void {
const cp_, const consumed = d.next(c); const cp_, const consumed = d.next(c);
assert(consumed); assert(consumed);
if (cp_) |cp| { if (cp_) |cp| {
const width = simd.codepointWidth(cp); std.mem.doNotOptimizeAway(simd.codepointWidth(cp));
// Write the width to the buffer to avoid it being compiled
// away
buf[0] = @intCast(width);
} }
} }
} }

View File

@ -126,8 +126,7 @@ fn stepTable(ptr: *anyopaque) Benchmark.Error!void {
const cp_, const consumed = d.next(c); const cp_, const consumed = d.next(c);
assert(consumed); assert(consumed);
if (cp_) |cp2| { if (cp_) |cp2| {
const v = unicode.graphemeBreak(cp1, @intCast(cp2), &state); std.mem.doNotOptimizeAway(unicode.graphemeBreak(cp1, @intCast(cp2), &state));
buf[0] = @intCast(@intFromBool(v));
cp1 = cp2; cp1 = cp2;
} }
} }