Commit Graph

8649 Commits (09cfd3c52ea76f43b3cb15e570aeddf633d65e80)

Author SHA1 Message Date
Linus Torvalds e1b1d03cee for-6.18/block-20250929
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmjbLCgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoY0D/9J+11BC88pBxCrLKv/V2TwCNokRMi0dU3L
 r3EUdA46k0oXmvb6ueZqIcfY2e+IX7rdQkaRbh1zRdsNejqHo4548C3ePWGdBAcM
 OdNEGfpehO0aD0td1+mK/NxoJMLhbs5QraPanz+SOkGZOKeF+vGCga5PUDivsr5J
 16T9yb7i+isENLdAc2RJbZVyAphqHQlo5GHi5ZIKOVi5cNt8GU/q2sQl7NYmGvHd
 aq37svvZHFOhLRajP959Fw9WOxEYITewzQ4UYf1FZjUodJUxO+vCnP0ooBQRlyu8
 1B4PYWwSE+Vn3GkQE0Om+mzo9AVPOiLmoAWGxdgJBMyEkZndocr46XEslXOufQ1Z
 T3Gu19G6jCxcyByNVhjVnaajYKmvSQAy1w75m4XlfqTRm4f9Om+LAJavUk3RuaOL
 7lXKQ7Ql1/Tby9Jmf8afjYYXXotNDNku6rz2P3qtOwAA26mNJfgVt0rO+8XGRDe9
 ioLbCkTjslYMc/Oh4jSsbrspsVALbaQMq/Dmah8k0EWb4QAHVgCJyGBoff3hOboI
 jD6B1enaKOQVgcjWcjm/FjOk3jv2h3v4X26YWQZTvEc/1PnSnST78Zi/ePhzDdmt
 sBALUAS37TfTgNMzrhbHl5Zs13k0C0XyANuayuKuo5hlNnC1wbdap+5FZJOmpuOB
 YT+VkYnaOA==
 =kOmc
 -----END PGP SIGNATURE-----

Merge tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block updates from Jens Axboe:

 - NVMe pull request via Keith:
     - FC target fixes (Daniel)
     - Authentication fixes and updates (Martin, Chris)
     - Admin controller handling (Kamaljit)
     - Target lockdep assertions (Max)
     - Keep-alive updates for discovery (Alastair)
     - Suspend quirk (Georg)

 - MD pull request via Yu:
     - Add support for a lockless bitmap.

       A key feature for the new bitmap are that the IO fastpath is
       lockless. If a user issues lots of write IO to the same bitmap
       bit in a short time, only the first write has additional overhead
       to update bitmap bit, no additional overhead for the following
       writes.

       By supporting only resync or recover written data, means in the
       case creating new array or replacing with a new disk, there is no
       need to do a full disk resync/recovery.

 - Switch ->getgeo() and ->bios_param() to using struct gendisk rather
   than struct block_device.

 - Rust block changes via Andreas. This series adds configuration via
   configfs and remote completion to the rnull driver. The series also
   includes a set of changes to the rust block device driver API: a few
   cleanup patches, and a few features supporting the rnull changes.

   The series removes the raw buffer formatting logic from
   `kernel::block` and improves the logic available in `kernel::string`
   to support the same use as the removed logic.

 - floppy arch cleanups

 - Reduce the number of dereferencing needed for ublk commands

 - Restrict supported sockets for nbd. Mostly done to eliminate a class
   of issues perpetually reported by syzbot, by using nonsensical socket
   setups.

 - A few s390 dasd block fixes

 - Fix a few issues around atomic writes

 - Improve DMA interation for integrity requests

 - Improve how iovecs are treated with regards to O_DIRECT aligment
   constraints.

   We used to require each segment to adhere to the constraints, now
   only the request as a whole needs to.

 - Clean up and improve p2p support, enabling use of p2p for metadata
   payloads

 - Improve locking of request lookup, using SRCU where appropriate

 - Use page references properly for brd, avoiding very long RCU sections

 - Fix ordering of recursively submitted IOs

 - Clean up and improve updating nr_requests for a live device

 - Various fixes and cleanups

* tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (164 commits)
  s390/dasd: enforce dma_alignment to ensure proper buffer validation
  s390/dasd: Return BLK_STS_INVAL for EINVAL from do_dasd_request
  ublk: remove redundant zone op check in ublk_setup_iod()
  nvme: Use non zero KATO for persistent discovery connections
  nvmet: add safety check for subsys lock
  nvme-core: use nvme_is_io_ctrl() for I/O controller check
  nvme-core: do ioccsz/iorcsz validation only for I/O controllers
  nvme-core: add method to check for an I/O controller
  blk-cgroup: fix possible deadlock while configuring policy
  blk-mq: fix null-ptr-deref in blk_mq_free_tags() from error path
  blk-mq: Fix more tag iteration function documentation
  selftests: ublk: fix behavior when fio is not installed
  ublk: don't access ublk_queue in ublk_unmap_io()
  ublk: pass ublk_io to __ublk_complete_rq()
  ublk: don't access ublk_queue in ublk_need_complete_req()
  ublk: don't access ublk_queue in ublk_check_commit_and_fetch()
  ublk: don't pass ublk_queue to ublk_fetch()
  ublk: don't access ublk_queue in ublk_config_io_buf()
  ublk: don't access ublk_queue in ublk_check_fetch_buf()
  ublk: pass q_id and tag to __ublk_check_and_get_req()
  ...
2025-10-02 10:16:56 -07:00
Linus Torvalds a769648f46 dlm for 6.18
This set of patches add a dlm_release_lockspace() flag to request that
 node-failure recovery be performed for the node leaving the lockspace.
 The implementation of this flag requires coordination with userland
 clustering components.  It's been requested for use by GFS2.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEcGkeEvkvjdvlR90nOBtzx/yAaaoFAmjamPAACgkQOBtzx/yA
 aaql4RAAqdzYLcAkhhM6oge+FMOYGIGsMkd286Oct8zU9RbGr0kqckTgz93k0uUC
 BZApe0NbVAh6oXyglc8enpxEI/he8IIxzFCoy+dM1YEJzLH6GFcFmYIqiGMDA0G0
 nceMeMHFzJyTA/QiLrKqmULsAXsFYI1ZRCEAX08zkH/iCRKhFYUP/eD2U88sN5lF
 Kn29pG+/G4S07VR+1/K6TRRSaFkEWp9Vp6MjgziPT3x4XkRXOew5W5Lskvpgj8nb
 HaU/pbmIwcS7mj5K4iBLaAr5SlSNa27N/YGaDkLcOITuhkLsT5yJv4QyKyMuMJ7Y
 6fK96MxCnVrg+n/SHjIUPEF76T1IzxxPQfwgTi8PgqCEqzQ4I2xTqiAST7PdGMF1
 erbx/Ra28wRDrCo5ZAvaitKtcRG9TR55MOKbSgH4gnUjpiPKHKFjl4onhCiqsWvK
 TPQ1DODZHFqDFun8Jywo9RuEaS8BtM7ikagcO76kM90ml60AaIuXh0kDNgGAdEcw
 fHfGv5FXKhYTqt0eq1mYoH+g3EjlmWZfF6hIKHEykUk5k2XqnilupYcwrAL83AKT
 WGvGrTh4tz/BSx/1iWrwG4XnCAM0jS/KDkQr+SvUw0H05rvoRTDtjDr1+ah6WEmI
 T6+ujYfTnDAEsyxB8cfVrN5cqmIIt+D5WwaihLGVfp9yoGlrjAU=
 =6LSt
 -----END PGP SIGNATURE-----

Merge tag 'dlm-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:
 "This adds a dlm_release_lockspace() flag to request that node-failure
  recovery be performed for the node leaving the lockspace.

  The implementation of this flag requires coordination with userland
  clustering components. It's been requested for use by GFS2"

* tag 'dlm-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: check for undefined release_option values
  dlm: handle release_option as unsigned
  dlm: move to rinfo for all middle conversion cases
  dlm: handle invalid lockspace member remove
  dlm: add new flag DLM_RELEASE_RECOVER for dlm_lockspace_release
  dlm: add new configfs entry release_recover for lockspace members
  dlm: add new RELEASE_RECOVER uevent attribute for release_lockspace
  dlm: use defines for force values in dlm_release_lockspace
  dlm: check for defined force value in dlm_lockspace_release
2025-09-29 15:24:58 -07:00
Linus Torvalds 1522b530ac block-6.17-20250918
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmjMlsUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpph+D/9/aXD4SrIOjRRZNDb0HIw5ySOJ36fHIfQU
 6M+DqVOOa+srY9Ja1zH9NJxcZ7Z9hUJKD9KMZd3mGSDuGCR8VTP/DZqYLjCpJmaO
 6m2Bn0Uc90iCAoialOEErPMWMmcJgapjupaeiRKcGvkLbDPUSKX/tHsNPrI9fK/t
 bxdeSZ8971DNC4DAEd0mnBDqgBz4JQa+YfSoLb9OM57LshEjExn8vhUUeu5De4Ek
 wZ5aQ44/1hZAWWMSCWFWvnxTXYc6B2McZgUT3fY6MR2jdgAdmCgXy0TfXxDwbwG9
 cbiFOsM1gCLWXywPS4pAOnw2N6t/Qb6RUv5JPGQbayredSqjxGFBYhS9zYM0cZPu
 tRtsPsCD4GJSEjY/meBGNGFB3G2cqo7IcnrWw86hAh4PLJc8Ov0kVAmS91lGsgjg
 qRJLJlS5Xoic021ccPSHAXg4/nynudDrYbn8AOVu8voMawAOi9/JCHQWgLb4siiJ
 O3CT7NQPZr/UunM7qhux6odyWppCHre3QyUvLf8btIIlrgf9dcwWyflbgiJ//iHt
 oNCHAF2h180S+Y791OcnFvNPO1zYVe1rE53XAZOky9/Cc6B0Jxelc4WUQufrbeBK
 Ug+EtAXSHGdvZkOwniuZoNpcU4Q5xOFhe9fmDArLSCc5Qh9VHH7lmt788SF/TLQ3
 SptiosZe9A==
 =Zbhj
 -----END PGP SIGNATURE-----

Merge tag 'block-6.17-20250918' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "A set of fixes for an issue with md array assembly and drbd for
  devices supporting write zeros"

* tag 'block-6.17-20250918' of git://git.kernel.dk/linux:
  drbd: init queue_limits->max_hw_wzeroes_unmap_sectors parameter
  md: init queue_limits->max_hw_wzeroes_unmap_sectors parameter
2025-09-19 12:26:20 -07:00
Linus Torvalds d4b779985a - fix integer overflow in dm-stripe
- limit tag size in dm-integrity to 255 bytes
 
 - fix 'alignment inconsistency' warning in dm-raid
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCaMrAdRQcbXBhdG9ja2FA
 cmVkaGF0LmNvbQAKCRATAyx9YGnhbYhtAQDsWJHG7dOVuvoY+kyM9SmC2My6l5u2
 awYBsjVT//9KDAD/c6s+ffeseY6VpBuFmPx5i75ieCmUcuKTMUxgdIErPwM=
 =FXwZ
 -----END PGP SIGNATURE-----

Merge tag 'for-6.17/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mikulas Patocka:

 - fix integer overflow in dm-stripe

 - limit tag size in dm-integrity to 255 bytes

 - fix 'alignment inconsistency' warning in dm-raid

* tag 'for-6.17/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm-raid: don't set io_min and io_opt for raid1
  dm-integrity: limit MAX_TAG_SIZE to 255
  dm-stripe: fix a possible integer overflow
2025-09-17 08:07:02 -07:00
Mikulas Patocka a865562646 dm-raid: don't set io_min and io_opt for raid1
These commands
 modprobe brd rd_size=1048576
 vgcreate vg /dev/ram*
 lvcreate -m4 -L10 -n lv vg
trigger the following warnings:
device-mapper: table: 252:10: adding target device (start sect 0 len 24576) caused an alignment inconsistency
device-mapper: table: 252:10: adding target device (start sect 0 len 24576) caused an alignment inconsistency

The warnings are caused by the fact that io_min is 512 and physical block
size is 4096.

If there's chunk-less raid, such as raid1, io_min shouldn't be set to zero
because it would be raised to 512 and it would trigger the warning.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Cc: stable@vger.kernel.org
2025-09-17 16:04:29 +02:00
Zhang Yi f0bd03832f md: init queue_limits->max_hw_wzeroes_unmap_sectors parameter
The parameter max_hw_wzeroes_unmap_sectors in queue_limits should be
equal to max_write_zeroes_sectors if it is set to a non-zero value.
However, the stacked md drivers call md_init_stacking_limits() to
initialize this parameter to UINT_MAX but only adjust
max_write_zeroes_sectors when setting limits. Therefore, this
discrepancy triggers a value check failure in blk_validate_limits().

 $ modprobe scsi_debug num_parts=2 dev_size_mb=8 lbprz=1 lbpws=1
 $ mdadm --create /dev/md0 --level=0 --raid-device=2 /dev/sda1 /dev/sda2
   mdadm: Defaulting to version 1.2 metadata
   mdadm: RUN_ARRAY failed: Invalid argument

Fix this failure by explicitly setting max_hw_wzeroes_unmap_sectors to
max_write_zeroes_sectors. Since the linear and raid0 drivers support
write zeroes, so they can support unmap write zeroes operation if all of
the backend devices support it. However, the raid1/10/5 drivers don't
support write zeroes, so we have to set it to zero.

Fixes: 0c40d7cb5e ("block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits")
Reported-by: John Garry <john.g.garry@oracle.com>
Closes: https://lore.kernel.org/linux-block/803a2183-a0bb-4b7a-92f1-afc5097630d2@oracle.com/
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Tested-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Li Nan <linan122@huawei.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/linux-raid/20250910111107.3247530-2-yi.zhang@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-09-17 00:37:12 +08:00
Nathan Chancellor 7935b843ce md/md-llbitmap: Use DIV_ROUND_UP_SECTOR_T
When building for 32-bit platforms, there are several link (if builtin)
or modpost (if a module) errors due to dividends of type 'sector_t' in
DIV_ROUND_UP:

  arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o: in function `llbitmap_resize':
  drivers/md/md-llbitmap.c:1017:(.text+0xae8): undefined reference to `__aeabi_uldivmod'
  arm-linux-gnueabi-ld: drivers/md/md-llbitmap.c:1020:(.text+0xb10): undefined reference to `__aeabi_uldivmod'
  arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o: in function `llbitmap_end_discard':
  drivers/md/md-llbitmap.c:1114:(.text+0xf14): undefined reference to `__aeabi_uldivmod'
  arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o: in function `llbitmap_start_discard':
  drivers/md/md-llbitmap.c:1097:(.text+0x1808): undefined reference to `__aeabi_uldivmod'
  arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o: in function `llbitmap_read_sb':
  drivers/md/md-llbitmap.c:867:(.text+0x2080): undefined reference to `__aeabi_uldivmod'
  arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o:drivers/md/md-llbitmap.c:895: more undefined references to `__aeabi_uldivmod' follow

Use DIV_ROUND_UP_SECTOR_T instead of DIV_ROUND_UP, which exists to
handle this exact situation.

Fixes: 5ab829f197 ("md/md-llbitmap: introduce new lockless bitmap")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 18:18:41 -06:00
Yu Kuai e0ed2bca7b md/raid0: convert raid0_make_request() to use bio_submit_split_bioset()
Currently, raid0_make_request() will remap the original bio to underlying
disks to prevent reordered IO. Now that bio_submit_split_bioset() will put
original bio to the head of current->bio_list, it's safe converting to use
this helper and bio will still be ordered.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:46 -06:00
Yu Kuai 6529d41d87 md/md-linear: convert to use bio_submit_split_bioset()
Unify bio split code, prepare to fix reordered split IO.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:46 -06:00
Yu Kuai 9e8a5b37c9 md/raid5: convert to use bio_submit_split_bioset()
Unify bio split code, prepare to fix ordering of split IO.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:45 -06:00
Yu Kuai 6fc07785d9 md/raid10: convert read/write to use bio_submit_split_bioset()
Unify bio split code, prepare to fix ordering of split IO, the error path
is modified a bit, however no functional changes are intended:

- bio_submit_split_bioset() can fail the original bio directly
  by split error, set R10BIO_Uptodate in this case to notify
  raid_end_bio_io() that the original bio is returned already.
- set R10BIO_Uptodate and set error value to -EIO is useless now,
  for r10_bio without R10BIO_Uptodate, -EIO will be returned for
  original bio.

And discard is not handled, because discard is only split for
unaligned head and tail, and this can be considered slow path, the
reorder here does not matter much.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:45 -06:00
Yu Kuai deeeab3028 md/raid10: add a new r10bio flag R10BIO_Returned
The new helper bio_submit_split_bioset() can failed the orginal bio on
split errors, prepare to handle this case in raid_end_bio_io().

The flag name is refer to the r1bio flag name.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:45 -06:00
Yu Kuai a6fcc160d6 md/raid1: convert to use bio_submit_split_bioset()
Unify bio split code, and prepare to fix ordering of split IO.

Noted that bio_submit_split_bioset() can fail the original bio directly
by split error, set R1BIO_Returned in this case to notify raid_end_bio_io()
that the original bio is returned already.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:45 -06:00
Yu Kuai 5b38ee5a4a md/raid0: convert raid0_handle_discard() to use bio_submit_split_bioset()
Unify bio split code, and prepare to fix ordering of split IO

Noted commit 319ff40a54 ("md/raid0: Fix performance regression for large
sequential writes") already fix ordering of split IO by remapping bio to
underlying disks before resubmitting it, with the respect
md_submit_bio() already split it by sectors, and raid0_make_request()
will split at most once for unaligned IO. This is a bit hacky and we'll
convert this to solution in general later.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:45 -06:00
Yu Kuai 22f166218f md: fix mssing blktrace bio split events
If bio is split by internal handling like chunksize or badblocks, the
corresponding trace_block_split() is missing, resulting in blktrace
inability to catch BIO split events and making it harder to analyze the
BIO sequence.

Cc: stable@vger.kernel.org
Fixes: 4b1faf9316 ("block: Kill bio_pair_split()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:45 -06:00
Jens Axboe 79b24810a2 Merge tag 'md-6.18-20250909' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux into for-6.18/block
Pull MD changes from Yu Kuai:

"Redundant data is used to enhance data fault tolerance, and the storage
 method for redundant data vary depending on the RAID levels. And it's
 important to maintain the consistency of redundant data.

 Bitmap is used to record which data blocks have been synchronized and
 which ones need to be resynchronized or recovered. Each bit in the
 bitmap represents a segment of data in the array. When a bit is set,
 it indicates that the multiple redundant copies of that data segment
 may not be consistent. Data synchronization can be performed based on
 the bitmap after power failure or readding a disk. If there is no
 bitmap, a full disk synchronization is required.

 Due to known performance issues with md-bitmap and the unreasonable
 implementations:

 - self-managed IO submitting like filemap_write_page();
 - global spin_lock

 I have decided not to continue optimizing based on the current bitmap
 implementation, this new bitmap is invented without locking from IO fast
 path and can be used with fast disks.

 Key features for the new bitmap:
  - IO fastpath is lockless, if user issues lots of write IO to the same
    bitmap bit in a short time, only the first write has additional
    overhead to update bitmap bit, no additional overhead for the
    following writes;
  - support only resync or recover written data, means in the case
    creating new array or replacing with a new disk, there is no need to
    do a full disk resync/recovery;"

* tag 'md-6.18-20250909' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux: (24 commits)
  md/md-llbitmap: introduce new lockless bitmap
  md/md-bitmap: make method bitmap_ops->daemon_work optional
  md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER
  md/md-bitmap: add a new method blocks_synced() in bitmap_operations
  md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations
  md/md-bitmap: delay registration of bitmap_ops until creating bitmap
  md/md-bitmap: add a new sysfs api bitmap_type
  md: add a new mddev field 'bitmap_id'
  md/md-bitmap: support discard for bitmap ops
  md: factor out a helper raid_is_456()
  md: add a new parameter 'offset' to md_super_write()
  md/md-bitmap: introduce CONFIG_MD_BITMAP
  md: check before referencing mddev->bitmap_ops
  md/dm-raid: check before referencing mddev->bitmap_ops
  md/raid5: check before referencing mddev->bitmap_ops
  md/raid10: check before referencing mddev->bitmap_ops
  md/raid1: check before referencing mddev->bitmap_ops
  md/raid1: check bitmap before behind write
  md/md-bitmap: handle the case bitmap is not enabled before end_sync()
  md/md-bitmap: handle the case bitmap is not enabled before start_sync()
  ...
2025-09-09 11:24:03 -06:00
Christoph Hellwig d86eaa0f3c block: remove the bi_inline_vecs variable sized array from struct bio
Bios are embedded into other structures, and at least spare is unhappy
about embedding structures with variable sized arrays.  There's no
real need to the array anyway, we can replace it with a helper pointing
to the memory just behind the bio, and with the previous cleanups there
is very few site doing anything special with it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09 07:31:59 -06:00
Christoph Hellwig 70a6f71b1a block: add a bio_init_inline helper
Just a simpler wrapper around bio_init for callers that want to
initialize a bio with inline bvecs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09 07:31:59 -06:00
Mikulas Patocka 77b8e6fbf9 dm-integrity: limit MAX_TAG_SIZE to 255
MAX_TAG_SIZE was 0x1a8 and it may be truncated in the "bi->metadata_size
= ic->tag_size" assignment. We need to limit it to 255.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-08 15:57:04 +02:00
Yu Kuai 5ab829f197 md/md-llbitmap: introduce new lockless bitmap
Redundant data is used to enhance data fault tolerance, and the storage
method for redundant data vary depending on the RAID levels. And it's
important to maintain the consistency of redundant data.

Bitmap is used to record which data blocks have been synchronized and which
ones need to be resynchronized or recovered. Each bit in the bitmap
represents a segment of data in the array. When a bit is set, it indicates
that the multiple redundant copies of that data segment may not be
consistent. Data synchronization can be performed based on the bitmap after
power failure or readding a disk. If there is no bitmap, a full disk
synchronization is required.

Due to known performance issues with md-bitmap and the unreasonable
implementations:

 - self-managed IO submitting like filemap_write_page();
 - global spin_lock

I have decided not to continue optimizing based on the current bitmap
implementation, this new bitmap is invented without locking from IO fast
path and can be used with fast disks.

For designs and details, see the comments in drivers/md-llbitmap.c.

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-12-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06 17:27:51 +08:00
Yu Kuai 66be318e66 md/md-bitmap: make method bitmap_ops->daemon_work optional
daemon_work() will be called by daemon thread, on the one hand, daemon
thread doesn't have strict wake-up time; on the other hand, too much
work are put to daemon thread, like handle sync IO, handle failed
or specail normal IO, handle recovery, and so on. Hence daemon thread
may be too busy to clear dirty bits in time.

Make bitmap_ops->daemon_work() optional and following patches will use
separate async work to clear dirty bits for the new bitmap.

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-11-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06 17:20:49 +08:00
Yu Kuai c951ccf0bf md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER
This flag is used by llbitmap in later patches to skip raid456 initial
recover and delay building initial xor data to first write.

https://lore.kernel.org/linux-raid/20250829080426.1441678-10-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-09-06 17:20:32 +08:00
Yu Kuai a4dd9ba39b md/md-bitmap: add a new method blocks_synced() in bitmap_operations
Currently, raid456 must perform a whole array initial recovery to build
initail xor data, then IO to the array won't have to read all the blocks
in underlying disks.

This behavior will affect IO performance a lot, and nowadays there are
huge disks and the initial recovery can take a long time. Hence llbitmap
will support lazy initial recovery in following patches. This method is
used to check if data blocks is synced or not, if not then IO will still
have to read all blocks for raid456.

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-9-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-09-06 17:20:01 +08:00
Yu Kuai f196d72888 md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations
This method is used to check if blocks can be skipped before calling
into pers->sync_request(), llbitmap will use this method to skip
resync for unwritten/clean data blocks, and recovery/check/repair for
unwritten data blocks;

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-8-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06 17:19:45 +08:00
Yu Kuai fb8cc3b0d9 md/md-bitmap: delay registration of bitmap_ops until creating bitmap
Currently bitmap_ops is registered while allocating mddev, this is fine
when there is only one bitmap_ops.

Delay setting bitmap_ops until creating bitmap, so that user can choose
which bitmap to use before running the array.

Link: https://lore.kernel.org/linux-raid/20250721171557.34587-7-yukuai@kernel.org
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:19:19 +08:00
Yu Kuai 180b82c1c7 md/md-bitmap: add a new sysfs api bitmap_type
The api will be used by mdadm to set bitmap_type while creating new array
or assembling array, prepare to add a new bitmap.

Currently available options are:

cat /sys/block/md0/md/bitmap_type
none [bitmap]

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-6-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06 17:19:04 +08:00
Yu Kuai 300bffa870 md: add a new mddev field 'bitmap_id'
Prepare to store the bitmap id selected by user, also refactor
mddev_set_bitmap_ops a bit in case the value is invalid.

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-5-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:18:48 +08:00
Yu Kuai ac9dad8faa md/md-bitmap: support discard for bitmap ops
Use two new methods {start, end}_discard in bitmap_ops and a new field 'rw'
in struct md_io_clone to handle discard IO, prepare to support new md
bitmap.

Since all bitmap functions to hanlde write IO are the same, also add
typedef to make code cleaner.

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-4-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06 17:18:19 +08:00
Yu Kuai 7797da149d md: factor out a helper raid_is_456()
There are no functional changes, the helper will be used by llbitmap in
following patches.

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-3-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06 17:17:58 +08:00
Yu Kuai d01acbce39 md: add a new parameter 'offset' to md_super_write()
The parameter is always set to 0 for now, following patches will use
this helper to write llbitmap to underlying disks, allow writing
dirty sectors instead of the whole page.

Also rename md_super_write to md_write_metadata since there is nothing
super-block specific.

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-2-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06 17:17:26 +08:00
Yu Kuai c27474ac1d md/md-bitmap: introduce CONFIG_MD_BITMAP
Now that all implementations are internal, it's sensible to add a config
option for md-bitmap, and it's a good way for isolation.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-16-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:12:22 +08:00
Yu Kuai 26292657ad md: check before referencing mddev->bitmap_ops
Prepare to introduce CONFIG_MD_BITMAP.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-15-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:12:19 +08:00
Yu Kuai 0e18745420 md/dm-raid: check before referencing mddev->bitmap_ops
Prepare to introduce CONFIG_MD_BITMAP.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-14-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:12:15 +08:00
Yu Kuai bb9317b13a md/raid5: check before referencing mddev->bitmap_ops
Prepare to introduce CONFIG_MD_BITMAP.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-13-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:12:11 +08:00
Yu Kuai 969f996243 md/raid10: check before referencing mddev->bitmap_ops
Prepare to introduce CONFIG_MD_BITMAP.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-12-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:12:07 +08:00
Yu Kuai 8d31ed3b77 md/raid1: check before referencing mddev->bitmap_ops
Prepare to introduce CONFIG_MD_BITMAP.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-11-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:12:02 +08:00
Yu Kuai 20cecae877 md/raid1: check bitmap before behind write
behind write rely on bitmap, because the number of IO are recorded in
bitmap->behind_writes, and callers rely on bitmap_wait_behind_writes()
to wait for IO to be done.

However, currently callers doesn't check if bitmap is enabeld before
calling into behind methods. Hence if behind write start without bitmap,
readers will not wait for slow write IO to be done and old data can be
read in some corner cases.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-10-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:11:58 +08:00
Yu Kuai bb74b093c3 md/md-bitmap: handle the case bitmap is not enabled before end_sync()
This case can be handled without knowing internal implementation.

Prepare to introduce CONFIG_MD_BITMAP.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-9-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:11:54 +08:00
Yu Kuai 5ae58d1500 md/md-bitmap: handle the case bitmap is not enabled before start_sync()
This case can be handled without knowing internal implementation.

Prepare to introduce CONFIG_MD_BITMAP.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-8-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:11:50 +08:00
Yu Kuai 110332074d md/md-bitmap: add md_bitmap_registered/enabled() helper
There are no functional changes, prepare to handle the case that
mddev->bitmap_ops can be NULL, which is possible after introducing
CONFIG_MD_BITMAP.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-7-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:11:45 +08:00
Yu Kuai 9c41ead04e md/md-bitmap: add a new parameter 'flush' to bitmap_ops->enabled
The method is only used from raid1/raid10 IO path, to check if write
bio should be pluged, the parameter is always set to true for now,
following patch will use this helper in other context like updating
superblock.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-6-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:11:42 +08:00
Yu Kuai 9307dbac0e md/md-bitmap: merge md_bitmap_group into bitmap_operations
Now that all bitmap implementations are internal, it doesn't make sense
to export md_bitmap_group anymore.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-5-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:11:40 +08:00
Yu Kuai e57b225c28 md/md-bitmap: remove the parameter 'init' for bitmap_ops->resize()
It's set to 'false' for all callers, hence it's useless and can be
removed.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-3-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:11:33 +08:00
Li Nan 7202082b7b md: prevent incorrect update of resync/recovery offset
In md_do_sync(), when md_sync_action returns ACTION_FROZEN, subsequent
call to md_sync_position() will return MaxSector. This causes
'curr_resync' (and later 'recovery_offset') to be set to MaxSector too,
which incorrectly signals that recovery/resync has completed, even though
disk data has not actually been updated.

To fix this issue, skip updating any offset values when the sync action
is FROZEN. The same holds true for IDLE.

Fixes: 7d9f107a4e ("md: use new helpers in md_do_sync()")
Signed-off-by: Li Nan <linan122@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250904073452.3408516-1-linan666@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-09-05 00:31:18 +08:00
Yu Kuai 93dec51e71 md/raid1: fix data lost for writemostly rdev
If writemostly is enabled, alloc_behind_master_bio() will allocate a new
bio for rdev, with bi_opf set to 0. Later, raid1_write_request() will
clone from this bio, hence bi_opf is still 0 for the cloned bio. Submit
this cloned bio will end up to be read, causing write data lost.

Fix this problem by inheriting bi_opf from original bio for
behind_mast_bio.

Fixes: e879a0d9cb ("md/raid1,raid10: don't ignore IO flags")
Reported-and-tested-by: Ian Dall <ian@beware.dropbear.id.au>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220507
Link: https://lore.kernel.org/linux-raid/20250903014140.3690499-1-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-05 00:30:04 +08:00
Jens Axboe 4dbe13c784 switching ->getgeo() from struct block_device to struct gendisk
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaLifHQAKCRBZ7Krx/gZQ
 64qlAPsGU9cVg8tVcbbuf767MXyuQZkUPeA5AWnSkm0jfQzaKAEAmsF4+KsjOFRR
 EmdjHBlN5kk6a0TWzXcADlieJ/ccNA4=
 =Tr1Q
 -----END PGP SIGNATURE-----

Merge tag 'pull-getgeo' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into for-6.18/block

Pull struct block_device getgeo changes from Al.

"switching ->getgeo() from struct block_device to struct gendisk

 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>"

* tag 'pull-getgeo' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  block: switch ->getgeo() to struct gendisk
  scsi: switch ->bios_param() to passing gendisk
  scsi: switch scsi_bios_ptable() and scsi_partsize() to gendisk
2025-09-03 15:15:43 -06:00
Zheng Qixing b7ee30f0ef md: fix sync_action incorrect display during resync
During raid resync, if a disk becomes faulty, the operation is
briefly interrupted. The MD_RECOVERY_RECOVER flag triggered by
the disk failure causes sync_action to incorrectly show "recover"
instead of "resync". The same issue affects reshape operations.

Reproduction steps:
  mdadm -Cv /dev/md1 -l1 -n4 -e1.2 /dev/sd{a..d} // -> resync happened
  mdadm -f /dev/md1 /dev/sda                     // -> resync interrupted
  cat sync_action
  -> recover

Add progress checks in md_sync_action() for resync/recover/reshape
to ensure the interface correctly reports the actual operation type.

Fixes: 4b10a3bc67 ("md: ensure resync is prioritized over recovery")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250816002534.1754356-3-zhengqixing@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-08-16 08:52:33 +08:00
Zheng Qixing cb0780ad43 md: add helper rdev_needs_recovery()
Add a helper for checking if an rdev needs recovery.

Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250816002534.1754356-2-zhengqixing@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-08-16 08:51:59 +08:00
Xiao Ni c27973211f md: keep recovery_cp in mdp_superblock_s
commit 907a99c314 ("md: rename recovery_cp to resync_offset") replaces
recovery_cp with resync_offset in mdp_superblock_s which is in md_p.h.
md_p.h is used in userspace too. So mdadm building fails because of this.
This patch revert this change.

Fixes: 907a99c314 ("md: rename recovery_cp to resync_offset")
Signed-off-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/linux-raid/20250815040028.18085-1-xni@redhat.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-08-16 08:47:38 +08:00
Xiao Ni 25db5f284f md: add legacy_async_del_gendisk mode
commit 9e59d60976 ("md: call del_gendisk in control path") changes the
async way to sync way of calling del_gendisk. But it breaks mdadm
--assemble command. The assemble command runs like this:
1. create the array
2. stop the array
3. access the sysfs files after stopping

The sync way calls del_gendisk in step 2, so all sysfs files are removed.
Now to avoid breaking mdadm assemble command, this patch adds the parameter
legacy_async_del_gendisk that can be used to choose which way. The default
is async way. In future, we plan to change default to sync way in kernel
7.0. Then users need to upgrade to mdadm 4.5+ which removes step 2.

Fixes: 9e59d60976 ("md: call del_gendisk in control path")
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Closes: https://lore.kernel.org/linux-raid/CAMw=ZnQ=ET2St-+hnhsuq34rRPnebqcXqP1QqaHW5Bh4aaaZ4g@mail.gmail.com/T/#t
Suggested-and-reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Link: https://lore.kernel.org/linux-raid/20250813032929.54978-1-xni@redhat.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-08-13 19:44:17 +08:00