mirror-linux/block
Jialin Wang f91ffe89b2 blk-iocost: fix busy_level reset when no IOs complete
When a disk is saturated, it is common for no IOs to complete within a
timer period. Currently, in this case, rq_wait_pct and missed_ppm are
calculated as 0, the iocost incorrectly interprets this as meeting QoS
targets and resets busy_level to 0.

This reset prevents busy_level from reaching the threshold (4) needed
to reduce vrate. On certain cloud storage, such as Azure Premium SSD,
we observed that iocost may fail to reduce vrate for tens of seconds
during saturation, failing to mitigate noisy neighbor issues.

Fix this by tracking the number of IO completions (nr_done) in a period.
If nr_done is 0 and there are lagging IOs, the saturation status is
unknown, so we keep busy_level unchanged.

The issue is consistently reproducible on Azure Standard_D8as_v5 (Dasv5)
VMs with 512GB Premium SSD (P20) using the script below. It was not
observed on GCP n2d VMs (with 100G pd-ssd and 1.5T local-ssd), and no
regressions were found with this patch. In this script, cgA performs
large IOs with iodepth=128, while cgB performs small IOs with iodepth=1
rate_iops=100 rw=randrw. With iocost enabled, we expect it to throttle
cgA, the submission latency (slat) of cgA should be significantly higher,
cgB can reach 200 IOPS and the completion latency (clat) should below.

  BLK_DEVID="8:0"
  MODEL="rbps=173471131 rseqiops=3566 rrandiops=3566 wbps=173333269 wseqiops=3566 wrandiops=3566"
  QOS="rpct=90 rlat=3500 wpct=90 wlat=3500 min=80 max=10000"

  echo "$BLK_DEVID ctrl=user model=linear $MODEL" > /sys/fs/cgroup/io.cost.model
  echo "$BLK_DEVID enable=1 ctrl=user $QOS" > /sys/fs/cgroup/io.cost.qos

  CG_A="/sys/fs/cgroup/cgA"
  CG_B="/sys/fs/cgroup/cgB"

  FILE_A="/path/to/sda/A.fio.testfile"
  FILE_B="/path/to/sda/B.fio.testfile"
  RESULT_DIR="./iocost_results_$(date +%Y%m%d_%H%M%S)"

  mkdir -p "$CG_A" "$CG_B" "$RESULT_DIR"

  get_result() {
    local file=$1
    local label=$2

    local results=$(jq -r '
    .jobs[0].mixed |
    ( .iops | tonumber | round ) as $iops |
    ( .bw_bytes / 1024 / 1024 ) as $bps |
    ( .slat_ns.mean / 1000000 ) as $slat |
    ( .clat_ns.mean / 1000000 ) as $avg |
    ( .clat_ns.max / 1000000 ) as $max |
    ( .clat_ns.percentile["90.000000"] / 1000000 ) as $p90 |
    ( .clat_ns.percentile["99.000000"] / 1000000 ) as $p99 |
    ( .clat_ns.percentile["99.900000"] / 1000000 ) as $p999 |
    ( .clat_ns.percentile["99.990000"] / 1000000 ) as $p9999 |
    "\($iops)|\($bps)|\($slat)|\($avg)|\($max)|\($p90)|\($p99)|\($p999)|\($p9999)"
    ' "$file")

    IFS='|' read -r iops bps slat avg max p90 p99 p999 p9999 <<<"$results"
    printf "%-8s %-6s %-7.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f\n" \
           "$label" "$iops" "$bps" "$slat" "$avg" "$max" "$p90" "$p99" "$p999" "$p9999"
  }

  run_fio() {
    local cg_path=$1
    local filename=$2
    local name=$3
    local bs=$4
    local qd=$5
    local out=$6
    shift 6
    local extra=$@

    (
      pid=$(sh -c 'echo $PPID')
      echo $pid >"${cg_path}/cgroup.procs"
      fio --name="$name" --filename="$filename" --direct=1 --rw=randrw --rwmixread=50 \
          --ioengine=libaio --bs="$bs" --iodepth="$qd" --size=4G --runtime=10 \
          --time_based --group_reporting --unified_rw_reporting=mixed \
          --output-format=json --output="$out" $extra >/dev/null 2>&1
    ) &
  }

  echo "Starting Test ..."

  for bs_b in "4k" "32k" "256k"; do
    echo "Running iteration: BS=$bs_b"
    out_a="${RESULT_DIR}/cgA_1m.json"
    out_b="${RESULT_DIR}/cgB_${bs_b}.json"

    # cgA: Heavy background (BS 1MB, QD 128)
    run_fio "$CG_A" "$FILE_A" "cgA" "1m" 128 "$out_a"
    # cgB: Latency sensitive (Variable BS, QD 1, Read/Write IOPS limit 100)
    run_fio "$CG_B" "$FILE_B" "cgB" "$bs_b" 1 "$out_b" "--rate_iops=100"

    wait
    SUMMARY_DATA+="$(get_result "$out_a" "cgA-1m")"$'\n'
    SUMMARY_DATA+="$(get_result "$out_b" "cgB-$bs_b")"$'\n\n'
  done

  echo -e "\nFinal Results Summary:\n"

  printf "%-8s %-6s %-7s %-8s %-8s %-8s %-8s %-8s %-8s %-8s\n" \
          "" "" "" "slat" "clat" "clat" "clat" "clat" "clat" "clat"
  printf "%-8s %-6s %-7s %-8s %-8s %-8s %-8s %-8s %-8s %-8s\n\n" \
          "CGROUP" "IOPS" "MB/s" "avg(ms)" "avg(ms)" "max(ms)" "P90(ms)" "P99" "P99.9" "P99.99"
  echo "$SUMMARY_DATA"

  echo "Results saved in $RESULT_DIR"

Before:
                          slat     clat     clat     clat     clat     clat     clat
  CGROUP   IOPS   MB/s    avg(ms)  avg(ms)  max(ms)  P90(ms)  P99      P99.9    P99.99

  cgA-1m   166    166.37  3.44     748.95   1298.29  977.27   1233.13  1300.23  1300.23
  cgB-4k   5      0.02    0.02     181.74   761.32   742.39   759.17   759.17   759.17

  cgA-1m   167    166.51  1.98     748.68   1549.41  809.50   1451.23  1551.89  1551.89
  cgB-32k  6      0.18    0.02     169.98   761.76   742.39   759.17   759.17   759.17

  cgA-1m   166    165.55  2.89     750.89   1540.37  851.44   1451.23  1535.12  1535.12
  cgB-256k 5      1.30    0.02     191.35   759.51   750.78   759.17   759.17   759.17

After:
                          slat     clat     clat     clat     clat     clat     clat
  CGROUP   IOPS   MB/s    avg(ms)  avg(ms)  max(ms)  P90(ms)  P99      P99.9    P99.99

  cgA-1m   162    162.48  6.14     749.69   850.02   826.28   834.67   843.06   851.44
  cgB-4k   199    0.78    0.01     1.95     42.12    2.57     7.50     34.87    42.21

  cgA-1m   146    146.20  6.83     833.04   908.68   893.39   901.78   910.16   910.16
  cgB-32k  200    6.25    0.01     2.32     31.40    3.06     7.50     16.58    31.33

  cgA-1m   110    110.46  9.04     1082.67  1197.91  1182.79  1199.57  1199.57  1199.57
  cgB-256k 200    49.98   0.02     3.69     22.20    4.88     9.11     20.05    22.15

Signed-off-by: Jialin Wang <wjl.linux@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://patch.msgid.link/20260331100509.182882-1-wjl.linux@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-31 13:56:38 -06:00
..
partitions block: partitions: Replace pp_buf with struct seq_buf 2026-03-21 08:27:08 -06:00
Kconfig block: Remove obsolete configs BLK_MQ_{PCI,VIRTIO} 2025-05-14 05:43:56 -06:00
Kconfig.iosched
Makefile block: add fs_bio_integrity helpers 2026-03-09 07:47:02 -06:00
badblocks.c badblocks: Fix a nonsense WARN_ON() which checks whether a u64 variable < 0 2025-03-10 07:41:58 -06:00
bdev.c block: remove redundant kill_bdev() call in set_blocksize() 2026-02-04 09:28:18 -07:00
bfq-cgroup.c treewide: Replace kmalloc with kmalloc_obj for non-scalar types 2026-02-21 01:02:28 -08:00
bfq-iosched.c treewide: Replace kmalloc with kmalloc_obj for non-scalar types 2026-02-21 01:02:28 -08:00
bfq-iosched.h block, bfq: update outdated comment 2026-01-01 08:57:37 -07:00
bfq-wf2q.c
bio-integrity-auto.c block: prepare generation / verification helpers for fs usage 2026-03-09 07:47:02 -06:00
bio-integrity-fs.c block: add fs_bio_integrity helpers 2026-03-09 07:47:02 -06:00
bio-integrity.c block: factor out a bio_integrity_setup_default helper 2026-03-09 07:47:02 -06:00
bio.c block: fix bio_alloc_bioset slowpath GFP handling 2026-03-23 07:58:32 -06:00
blk-cgroup-fc-appid.c
blk-cgroup-rwstat.c blk-cgroup: use group allocation/free of per-cpu counters API 2024-04-03 09:10:17 -06:00
blk-cgroup-rwstat.h blk-cgroup: rwstat: fix kernel-doc warnings in header file 2025-01-13 07:47:09 -07:00
blk-cgroup.c blk-cgroup: fix disk reference leak in blkcg_maybe_throttle_current() 2026-03-31 13:55:41 -06:00
blk-cgroup.h block: initialize bio issue time in blk_mq_submit_bio() 2025-09-10 05:23:45 -06:00
blk-core.c blk-mq: add a new queue sysfs attribute async_depth 2026-02-03 07:45:36 -07:00
blk-crypto-fallback.c Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses 2026-02-22 08:26:33 -08:00
blk-crypto-internal.h blk-crypto: handle the fallback above the block layer 2026-01-11 12:55:41 -07:00
blk-crypto-profile.c Convert more 'alloc_obj' cases to default GFP_KERNEL arguments 2026-02-21 20:03:00 -08:00
blk-crypto-sysfs.c blk-crypto: make blk_crypto_attr instances const 2026-03-17 19:29:16 -06:00
blk-crypto.c blk-crypto: handle the fallback above the block layer 2026-01-11 12:55:41 -07:00
blk-flush.c block: pass io_comp_batch to rq_end_io_fn callback 2026-01-20 10:12:54 -07:00
blk-ia-ranges.c block: ia-ranges: make blk_ia_range_sysfs_entry instances const 2026-03-17 19:29:16 -06:00
blk-integrity.c block: don't merge bios with different app_tags 2026-01-06 19:10:08 -07:00
blk-ioc.c copy_process: pass clone_flags as u64 across calltree 2025-09-01 15:31:34 +02:00
blk-iocost.c blk-iocost: fix busy_level reset when no IOs complete 2026-03-31 13:56:38 -06:00
blk-iolatency.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
blk-ioprio.c treewide: Replace kmalloc with kmalloc_obj for non-scalar types 2026-02-21 01:02:28 -08:00
blk-ioprio.h blk-ioprio: remove per-disk structure 2024-07-28 16:47:51 -06:00
blk-lib.c block: change return type to void 2026-02-12 04:23:53 -07:00
blk-map.c block-7.0-20260305 2026-03-06 08:36:18 -08:00
blk-merge.c for-7.0/block-stable-pages-20260206 2026-02-09 18:14:52 -08:00
blk-mq-cpumap.c blk-mq: add number of queue calc helper 2025-07-01 10:24:19 -06:00
blk-mq-debugfs.c block: allow submitting all zone writes from a single context 2026-03-09 14:30:00 -06:00
blk-mq-debugfs.h blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos() 2026-02-02 07:05:19 -07:00
blk-mq-dma.c block: fix partial IOVA mapping cleanup in blk_rq_dma_map_iova 2026-02-12 04:23:31 -07:00
blk-mq-sched.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
blk-mq-sched.h blk-mq-sched: unify elevators checking for async requests 2026-02-03 07:45:36 -07:00
blk-mq-sysfs.c blk-mq: make blk_mq_hw_ctx_sysfs_entry instances const 2026-03-17 19:29:16 -06:00
blk-mq-tag.c blk-mq: use array manage hctx map instead of xarray 2025-11-28 09:09:19 -07:00
blk-mq.c block: clear BIO_QOS flags in blk_steal_bios() 2026-03-10 07:11:09 -06:00
blk-mq.h blk-mq: use queue_hctx in blk_mq_map_queue_type 2025-12-01 07:18:31 -07:00
blk-pm.c block: force noio scope in blk_mq_freeze_queue 2025-01-31 07:20:08 -07:00
blk-pm.h
blk-rq-qos.c blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos() 2026-02-02 07:05:19 -07:00
blk-rq-qos.h blk-rq-qos: Remove unlikely() hints from QoS checks 2026-01-06 19:08:23 -07:00
blk-settings.c blk-integrity: support arbitrary buffer alignment 2026-03-14 07:44:30 -06:00
blk-stat.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
blk-stat.h blk-stat: convert struct blk_stat_callback to kernel-doc 2026-02-16 10:21:06 -07:00
blk-sysfs.c block: make queue_sysfs_entry instances const 2026-03-17 19:29:16 -06:00
blk-throttle.c block/blk-throttle: Remove throtl_slice from struct throtl_data 2025-11-17 09:39:48 -07:00
blk-throttle.h blk-throttle: fix access race during throttle policy activation 2025-09-08 08:24:44 -06:00
blk-timeout.c
blk-wbt.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
blk-wbt.h blk-wbt: factor out a helper wbt_set_lat() 2026-02-02 07:05:19 -07:00
blk-zoned.c block: fix zones_cond memory leak on zone revalidation error paths 2026-03-31 07:05:49 -06:00
blk.h block: mark bvec_{alloc,free} static 2026-03-17 19:27:14 -06:00
bsg-lib.c bsg: add io_uring command support to generic layer 2026-03-19 11:38:24 -06:00
bsg.c bsg: add io_uring command support to generic layer 2026-03-19 11:38:24 -06:00
disk-events.c loop: fix partition scan race between udev and loop_reread_partitions() 2026-03-31 07:04:34 -06:00
early-lookup.c wrapper for access to ->bd_partno 2024-05-02 17:48:09 -04:00
elevator.c block: use trylock to avoid lockdep circular dependency in sysfs 2026-03-05 04:01:42 -07:00
elevator.h block: fix race between wbt_enable_default and IO submission 2025-12-12 12:51:11 -07:00
fops.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
genhd.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
holder.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
ioctl.c block: allow IOC_PR_READ_* ioctls with BLK_OPEN_READ 2026-02-11 10:36:54 -07:00
ioprio.c block: remove test of incorrect io priority level 2025-05-08 09:04:12 -06:00
kyber-iosched.c kyber: covert to use request_queue->async_depth 2026-02-03 07:45:36 -07:00
mq-deadline.c mq-deadline: covert to use request_queue->async_depth 2026-02-03 07:45:36 -07:00
opal_proto.h sed-opal: Add STACK_RESET command 2026-03-31 07:04:00 -06:00
sed-opal.c sed-opal: Add STACK_RESET command 2026-03-31 07:04:00 -06:00
t10-pi.c blk-integrity: support arbitrary buffer alignment 2026-03-14 07:44:30 -06:00