benchmarks: Align `buf` to cache line for consistency (#8569)
This aligns the `buf` of `4096` bytes in the benchmarks to the cache line, to ensure a consistent number of cache lines are used, and also to avoid any sub-`usize` alignment issues as seen in https://github.com/ghostty-org/ghostty/pull/8548. This has less of an effect as https://github.com/ghostty-org/ghostty/pull/8548, and looking at the before and after of the current benchmarks in the repo doesn't show any noticeable difference. In my case, I've been comparing the `table` option with [uucode in this branch](https://github.com/ghostty-org/ghostty/compare/main...jacobsandlund:jacob/uucode?expand=1), and I did see a difference. ### Before I ran the before code several times (6 with the exact same binary, but several more with essentially the same code), always getting something like this, with `table` edging out `uucode` by something like 3-4ms: ``` Benchmark 1: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table Time (mean ± σ): 927.8 ms ± 1.3 ms [User: 883.7 ms, System: 42.5 ms] Range (min … max): 926.0 ms … 929.8 ms 10 runs Benchmark 2: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode Time (mean ± σ): 930.9 ms ± 1.4 ms [User: 886.8 ms, System: 42.5 ms] Range (min … max): 928.5 ms … 933.4 ms 10 runs ``` ### After After this change, it shows `uucode` coming in at 10-11ms (~1%) faster: ``` Benchmark 1: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table Time (mean ± σ): 930.6 ms ± 1.3 ms [User: 886.5 ms, System: 42.4 ms] Range (min … max): 928.9 ms … 932.4 ms 10 runs Benchmark 2: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode Time (mean ± σ): 920.1 ms ± 1.4 ms [User: 876.3 ms, System: 42.1 ms] Range (min … max): 918.4 ms … 923.3 ms 10 runs Summary zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode ran 1.01 ± 0.00 times faster than zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table ``` This ~1% faster time checks out, since from looking at the assembly, it's an exact match minus this small place where the compiler can optimize `uucode` a little better: ``` # both table.asm/uucode.asm: 140 const high = cp >> 8; 141 const low = cp & 0xFF; ** 142 return self.stage3[self.stage2[self.stage1[high] + low]]; <+464>: ubfx x12, x11, #8, #13 <+468>: ldrh w12, [x27, x12, lsl #1] <+472>: add x11, x28, w11, uxtb #1 <+476>: ldrh w11, [x11, x12, lsl #1] # table.asm: <+480>: lsl x11, x11, #1 ** 158 table.get(@intCast(cp)).width); 159 } 160 } <+484>: ldrb w11, [x22, x11] # uucode.asm: ** 148 return @field(data(stages, cp), name); <+480>: ldrh w11, [x22, x11, lsl #1] ``` ### More confusion with showing addresses Confusingly, when I added `std.debug.print("buf addr={}\n", .{@intFromPtr(&buf)})` to show the addresses, this somehow made the `before` case show `uucode` as being faster. Then, when I added alignment, `uucode` and `table` were taking about the same time (**edit:** _uucode was only ~4 ms faster, but see more in "Edit: more investigation"_) If I run without the `std.debug.print` and with `--show-output`, the times are different, so just making a note of this. ``` Benchmark 1: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table Time (mean ± σ): 904.2 ms ± 1.2 ms [User: 884.6 ms, System: 40.3 ms] Range (min … max): 902.8 ms … 906.1 ms 10 runs Benchmark 2: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode Time (mean ± σ): 892.7 ms ± 2.0 ms [User: 873.2 ms, System: 40.1 ms] Range (min … max): 887.9 ms … 895.6 ms 10 runs Summary zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode ran 1.01 ± 0.00 times faster than zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table ``` I think, even with this confusing case, aligning is going to be more consistent than not. ### Edit: more investigation I wasn't satisfied with the discovery that adding `std.debug.print` made this difference and I wanted to dig in and figure out exactly what's going on, but I didn't get a satisfactory answer. Here's what I tried: * I compared the un-aligned addresses from `stepTable` and `stepUucode`, but both seemed similar (not aligned to 128, different each run, but aligned to 8). Note though that `uucode` was running ~1% faster still, similar to the aligned case even though here it was un-aligned. * Instead of doing `std.debug.print` in the step function, I printed in teardown, just in case. This had no difference in the unaligned case, but with alignment it brought the ~4 ms faster `uucode` (as noted above) back closer to the original "after" at around 11-12 ms faster (~1%). * I forced the `buf` in `stepUucode` to not be aligned (e.g. by making it `= other_aligned_buf[3..4096 + 3]`). Still it was ~1% faster. * I compared the assembly of `stepTable` and `stepUucode` for both aligned and not aligned cases, including doing a diff of the diff of these two across aligned and not aligned. The only difference between `stepTable` and `stepUucode` is what's noted above, and nothing stood out in the double diff. * I tried going back to the original un-aligned non-printing code, but then swapped the lines that get from `table` or `uucode`, so that `stepTable` and `stepUucode` were actually doing the opposite. And the result is`stepTable` (actually `uucode`) was 10-11 ms (~1%) faster, just like the aligned case! In summary, I wasn't able to replicate the original benchmark behavior _and print out buffer addresses that pointed to alignment being the issue_. I still feel like in theory aligning the buffer ought to make the benchmark more reliable, and indeed the original un-aligned version gives the result that is more of an outlier, but the evidence here is weak, so I'm alright if we stick with the status quo and close. I think a lesson here is benchmarks are hard to get precise.pull/8576/head
commit
e7c1b4dd05
|
|
@ -109,7 +109,7 @@ fn stepWcwidth(ptr: *anyopaque) Benchmark.Error!void {
|
|||
const f = self.data_f orelse return;
|
||||
var r = std.io.bufferedReader(f.reader());
|
||||
var d: UTF8Decoder = .{};
|
||||
var buf: [4096]u8 = undefined;
|
||||
var buf: [4096]u8 align(std.atomic.cache_line) = undefined;
|
||||
while (true) {
|
||||
const n = r.read(&buf) catch |err| {
|
||||
log.warn("error reading data file err={}", .{err});
|
||||
|
|
@ -133,7 +133,7 @@ fn stepTable(ptr: *anyopaque) Benchmark.Error!void {
|
|||
const f = self.data_f orelse return;
|
||||
var r = std.io.bufferedReader(f.reader());
|
||||
var d: UTF8Decoder = .{};
|
||||
var buf: [4096]u8 = undefined;
|
||||
var buf: [4096]u8 align(std.atomic.cache_line) = undefined;
|
||||
while (true) {
|
||||
const n = r.read(&buf) catch |err| {
|
||||
log.warn("error reading data file err={}", .{err});
|
||||
|
|
@ -162,7 +162,7 @@ fn stepSimd(ptr: *anyopaque) Benchmark.Error!void {
|
|||
const f = self.data_f orelse return;
|
||||
var r = std.io.bufferedReader(f.reader());
|
||||
var d: UTF8Decoder = .{};
|
||||
var buf: [4096]u8 = undefined;
|
||||
var buf: [4096]u8 align(std.atomic.cache_line) = undefined;
|
||||
while (true) {
|
||||
const n = r.read(&buf) catch |err| {
|
||||
log.warn("error reading data file err={}", .{err});
|
||||
|
|
|
|||
|
|
@ -92,7 +92,7 @@ fn stepNoop(ptr: *anyopaque) Benchmark.Error!void {
|
|||
const f = self.data_f orelse return;
|
||||
var r = std.io.bufferedReader(f.reader());
|
||||
var d: UTF8Decoder = .{};
|
||||
var buf: [4096]u8 = undefined;
|
||||
var buf: [4096]u8 align(std.atomic.cache_line) = undefined;
|
||||
while (true) {
|
||||
const n = r.read(&buf) catch |err| {
|
||||
log.warn("error reading data file err={}", .{err});
|
||||
|
|
@ -114,7 +114,7 @@ fn stepTable(ptr: *anyopaque) Benchmark.Error!void {
|
|||
var d: UTF8Decoder = .{};
|
||||
var state: unicode.GraphemeBreakState = .{};
|
||||
var cp1: u21 = 0;
|
||||
var buf: [4096]u8 = undefined;
|
||||
var buf: [4096]u8 align(std.atomic.cache_line) = undefined;
|
||||
while (true) {
|
||||
const n = r.read(&buf) catch |err| {
|
||||
log.warn("error reading data file err={}", .{err});
|
||||
|
|
|
|||
|
|
@ -91,7 +91,7 @@ fn stepZiglyph(ptr: *anyopaque) Benchmark.Error!void {
|
|||
const f = self.data_f orelse return;
|
||||
var r = std.io.bufferedReader(f.reader());
|
||||
var d: UTF8Decoder = .{};
|
||||
var buf: [4096]u8 = undefined;
|
||||
var buf: [4096]u8 align(std.atomic.cache_line) = undefined;
|
||||
while (true) {
|
||||
const n = r.read(&buf) catch |err| {
|
||||
log.warn("error reading data file err={}", .{err});
|
||||
|
|
@ -115,7 +115,7 @@ fn stepTable(ptr: *anyopaque) Benchmark.Error!void {
|
|||
const f = self.data_f orelse return;
|
||||
var r = std.io.bufferedReader(f.reader());
|
||||
var d: UTF8Decoder = .{};
|
||||
var buf: [4096]u8 = undefined;
|
||||
var buf: [4096]u8 align(std.atomic.cache_line) = undefined;
|
||||
while (true) {
|
||||
const n = r.read(&buf) catch |err| {
|
||||
log.warn("error reading data file err={}", .{err});
|
||||
|
|
|
|||
Loading…
Reference in New Issue