benchmarks: Align `buf` to cache line for consistency (#8569)

This aligns the `buf` of `4096` bytes in the benchmarks to the cache
line, to ensure a consistent number of cache lines are used, and also to
avoid any sub-`usize` alignment issues as seen in
https://github.com/ghostty-org/ghostty/pull/8548.

This has less of an effect as
https://github.com/ghostty-org/ghostty/pull/8548, and looking at the
before and after of the current benchmarks in the repo doesn't show any
noticeable difference.

In my case, I've been comparing the `table` option with [uucode in this
branch](https://github.com/ghostty-org/ghostty/compare/main...jacobsandlund:jacob/uucode?expand=1),
and I did see a difference.

### Before

I ran the before code several times (6 with the exact same binary, but
several more with essentially the same code), always getting something
like this, with `table` edging out `uucode` by something like 3-4ms:

```
Benchmark 1: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table
  Time (mean ± σ):     927.8 ms ±   1.3 ms    [User: 883.7 ms, System: 42.5 ms]
  Range (min … max):   926.0 ms … 929.8 ms    10 runs

Benchmark 2: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode
  Time (mean ± σ):     930.9 ms ±   1.4 ms    [User: 886.8 ms, System: 42.5 ms]
  Range (min … max):   928.5 ms … 933.4 ms    10 runs
```

### After

After this change, it shows `uucode` coming in at 10-11ms (~1%) faster:

```
Benchmark 1: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table
  Time (mean ± σ):     930.6 ms ±   1.3 ms    [User: 886.5 ms, System: 42.4 ms]
  Range (min … max):   928.9 ms … 932.4 ms    10 runs

Benchmark 2: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode
  Time (mean ± σ):     920.1 ms ±   1.4 ms    [User: 876.3 ms, System: 42.1 ms]
  Range (min … max):   918.4 ms … 923.3 ms    10 runs

Summary
  zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode ran
    1.01 ± 0.00 times faster than zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table
```

This ~1% faster time checks out, since from looking at the assembly,
it's an exact match minus this small place where the compiler can
optimize `uucode` a little better:

```
# both table.asm/uucode.asm:

   140                     const high = cp >> 8;
   141                     const low = cp & 0xFF;
** 142                     return self.stage3[self.stage2[self.stage1[high] + low]];

<+464>: ubfx   x12, x11, #8, #13
<+468>: ldrh   w12, [x27, x12, lsl #1]
<+472>: add    x11, x28, w11, uxtb #1
<+476>: ldrh   w11, [x11, x12, lsl #1]

# table.asm:

<+480>: lsl    x11, x11, #1

** 158                             table.get(@intCast(cp)).width);
   159                     }
   160                 }

<+484>: ldrb   w11, [x22, x11]

# uucode.asm:

** 148                 return @field(data(stages, cp), name);

<+480>: ldrh   w11, [x22, x11, lsl #1]
```

### More confusion with showing addresses

Confusingly, when I added `std.debug.print("buf addr={}\n",
.{@intFromPtr(&buf)})` to show the addresses, this somehow made the
`before` case show `uucode` as being faster. Then, when I added
alignment, `uucode` and `table` were taking about the same time
(**edit:** _uucode was only ~4 ms faster, but see more in "Edit: more
investigation"_)

If I run without the `std.debug.print` and with `--show-output`, the
times are different, so just making a note of this.

```
Benchmark 1: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table
  Time (mean ± σ):     904.2 ms ±   1.2 ms    [User: 884.6 ms, System: 40.3 ms]
  Range (min … max):   902.8 ms … 906.1 ms    10 runs

Benchmark 2: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode
  Time (mean ± σ):     892.7 ms ±   2.0 ms    [User: 873.2 ms, System: 40.1 ms]
  Range (min … max):   887.9 ms … 895.6 ms    10 runs

Summary
  zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode ran
    1.01 ± 0.00 times faster than zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table
```

I think, even with this confusing case, aligning is going to be more
consistent than not.

### Edit: more investigation

I wasn't satisfied with the discovery that adding `std.debug.print` made
this difference and I wanted to dig in and figure out exactly what's
going on, but I didn't get a satisfactory answer. Here's what I tried:

* I compared the un-aligned addresses from `stepTable` and `stepUucode`,
but both seemed similar (not aligned to 128, different each run, but
aligned to 8). Note though that `uucode` was running ~1% faster still,
similar to the aligned case even though here it was un-aligned.
* Instead of doing `std.debug.print` in the step function, I printed in
teardown, just in case. This had no difference in the unaligned case,
but with alignment it brought the ~4 ms faster `uucode` (as noted above)
back closer to the original "after" at around 11-12 ms faster (~1%).
* I forced the `buf` in `stepUucode` to not be aligned (e.g. by making
it `= other_aligned_buf[3..4096 + 3]`). Still it was ~1% faster.
* I compared the assembly of `stepTable` and `stepUucode` for both
aligned and not aligned cases, including doing a diff of the diff of
these two across aligned and not aligned. The only difference between
`stepTable` and `stepUucode` is what's noted above, and nothing stood
out in the double diff.
* I tried going back to the original un-aligned non-printing code, but
then swapped the lines that get from `table` or `uucode`, so that
`stepTable` and `stepUucode` were actually doing the opposite. And the
result is`stepTable` (actually `uucode`) was 10-11 ms (~1%) faster, just
like the aligned case!

In summary, I wasn't able to replicate the original benchmark behavior
_and print out buffer addresses that pointed to alignment being the
issue_. I still feel like in theory aligning the buffer ought to make
the benchmark more reliable, and indeed the original un-aligned version
gives the result that is more of an outlier, but the evidence here is
weak, so I'm alright if we stick with the status quo and close. I think
a lesson here is benchmarks are hard to get precise.
pull/8576/head
Mitchell Hashimoto 2025-09-09 07:25:36 -07:00 committed by GitHub
commit e7c1b4dd05
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 7 additions and 7 deletions

View File

@ -109,7 +109,7 @@ fn stepWcwidth(ptr: *anyopaque) Benchmark.Error!void {
const f = self.data_f orelse return;
var r = std.io.bufferedReader(f.reader());
var d: UTF8Decoder = .{};
var buf: [4096]u8 = undefined;
var buf: [4096]u8 align(std.atomic.cache_line) = undefined;
while (true) {
const n = r.read(&buf) catch |err| {
log.warn("error reading data file err={}", .{err});
@ -133,7 +133,7 @@ fn stepTable(ptr: *anyopaque) Benchmark.Error!void {
const f = self.data_f orelse return;
var r = std.io.bufferedReader(f.reader());
var d: UTF8Decoder = .{};
var buf: [4096]u8 = undefined;
var buf: [4096]u8 align(std.atomic.cache_line) = undefined;
while (true) {
const n = r.read(&buf) catch |err| {
log.warn("error reading data file err={}", .{err});
@ -162,7 +162,7 @@ fn stepSimd(ptr: *anyopaque) Benchmark.Error!void {
const f = self.data_f orelse return;
var r = std.io.bufferedReader(f.reader());
var d: UTF8Decoder = .{};
var buf: [4096]u8 = undefined;
var buf: [4096]u8 align(std.atomic.cache_line) = undefined;
while (true) {
const n = r.read(&buf) catch |err| {
log.warn("error reading data file err={}", .{err});

View File

@ -92,7 +92,7 @@ fn stepNoop(ptr: *anyopaque) Benchmark.Error!void {
const f = self.data_f orelse return;
var r = std.io.bufferedReader(f.reader());
var d: UTF8Decoder = .{};
var buf: [4096]u8 = undefined;
var buf: [4096]u8 align(std.atomic.cache_line) = undefined;
while (true) {
const n = r.read(&buf) catch |err| {
log.warn("error reading data file err={}", .{err});
@ -114,7 +114,7 @@ fn stepTable(ptr: *anyopaque) Benchmark.Error!void {
var d: UTF8Decoder = .{};
var state: unicode.GraphemeBreakState = .{};
var cp1: u21 = 0;
var buf: [4096]u8 = undefined;
var buf: [4096]u8 align(std.atomic.cache_line) = undefined;
while (true) {
const n = r.read(&buf) catch |err| {
log.warn("error reading data file err={}", .{err});

View File

@ -91,7 +91,7 @@ fn stepZiglyph(ptr: *anyopaque) Benchmark.Error!void {
const f = self.data_f orelse return;
var r = std.io.bufferedReader(f.reader());
var d: UTF8Decoder = .{};
var buf: [4096]u8 = undefined;
var buf: [4096]u8 align(std.atomic.cache_line) = undefined;
while (true) {
const n = r.read(&buf) catch |err| {
log.warn("error reading data file err={}", .{err});
@ -115,7 +115,7 @@ fn stepTable(ptr: *anyopaque) Benchmark.Error!void {
const f = self.data_f orelse return;
var r = std.io.bufferedReader(f.reader());
var d: UTF8Decoder = .{};
var buf: [4096]u8 = undefined;
var buf: [4096]u8 align(std.atomic.cache_line) = undefined;
while (true) {
const n = r.read(&buf) catch |err| {
log.warn("error reading data file err={}", .{err});