mirror-linux

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	d16738a4e7	The kthread code provides an infrastructure which manages the preferred affinity of unbound kthreads (node or custom cpumask) against housekeeping (CPU isolation) constraints and CPU hotplug events. One crucial missing piece is the handling of cpuset: when an isolated partition is created, deleted, or its CPUs updated, all the unbound kthreads in the top cpuset become indifferently affine to _all_ the non-isolated CPUs, possibly breaking their preferred affinity along the way. Solve this with performing the kthreads affinity update from cpuset to the kthreads consolidated relevant code instead so that preferred affinities are honoured and applied against the updated cpuset isolated partitions. The dispatch of the new isolated cpumasks to timers, workqueues and kthreads is performed by housekeeping, as per the nice Tejun's suggestion. As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set from boot defined domain isolation (through isolcpus=) and cpuset isolated partitions. Housekeeping cpumasks are now modifiable with a specific RCU based synchronization. A big step toward making nohz_full= also mutable through cpuset in the future. -----BEGIN PGP SIGNATURE----- iQJPBAABCAA5FiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmmE0mYbFIAAAAAABAAO bWFudTIsMi41KzEuMTEsMiwyAAoJEIUkVEdQjox36eMP/0Ls/ArfYVi/MNAXWlpy rAt6m9Y/X9GBcDM/VI9BXq1ZX4qEr2XjJ8UUb8cM08uHEAt0ErlmpRxREwJFrKbI H4jzg5EwO0D0c6MnvgQJEAwkHxQVIjsxG9DovRIjxyW4ycx3aSsRg/f2VKyWoLvY 7ZT7CbLFE+I/MQh2ZgUu/9pnCDQVR2anss2WYIej5mmgFL5pyEv3YvYgKYVyK08z sXyNxpP976g2d9ECJ9OtFJV9we6mlqxlG0MVCiv/Uxh7DBjxWWPsLvlmLAXggQ03 +0GW+nnutDaKz83pgS7Z4zum/+Oa+I1dTLIN27pARUNcMCYip7njM2KNpJwPdov3 +fAIODH2JVX1xewT+U1cCq6gdI55ejbwdQYGFV075dKBUxKQeIyrghvfC3Ga6aKQ Gw3y68jdrXOw6iyfHR5k/0Mnu2/FDKUW2fZxLKm55PvNZP5jQFmSlz9wyiwwyb3m UUSgThj6Ozodxks8hDX41rGVezCcm1ni+qNSiNIs8HPaaZQrwbnvKHQFBBJHQzJP rJ39VWBx3Hq/ly71BOR6pCzoZsfS1f85YKhJ4vsfjLO6BfhI16nBat89eROSRKcz XptyWqW0PgAD0teDuMCTPNuUym/viBHALXHKuSO12CIizacvftiGcmaQNPlLiiFZ /Dr2+aOhwYw3UD6djn3u94M9 =nWGh -----END PGP SIGNATURE----- Merge tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks Pull kthread updates from Frederic Weisbecker: "The kthread code provides an infrastructure which manages the preferred affinity of unbound kthreads (node or custom cpumask) against housekeeping (CPU isolation) constraints and CPU hotplug events. One crucial missing piece is the handling of cpuset: when an isolated partition is created, deleted, or its CPUs updated, all the unbound kthreads in the top cpuset become indifferently affine to _all_ the non-isolated CPUs, possibly breaking their preferred affinity along the way. Solve this with performing the kthreads affinity update from cpuset to the kthreads consolidated relevant code instead so that preferred affinities are honoured and applied against the updated cpuset isolated partitions. The dispatch of the new isolated cpumasks to timers, workqueues and kthreads is performed by housekeeping, as per the nice Tejun's suggestion. As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set from boot defined domain isolation (through isolcpus=) and cpuset isolated partitions. Housekeeping cpumasks are now modifiable with a specific RCU based synchronization. A big step toward making nohz_full= also mutable through cpuset in the future" * tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (33 commits) doc: Add housekeeping documentation kthread: Document kthread_affine_preferred() kthread: Comment on the purpose and placement of kthread_affine_node() call kthread: Honour kthreads preferred affinity after cpuset changes sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management kthread: Include kthreadd to the managed affinity list kthread: Include unbound kthreads in the managed affinity list kthread: Refine naming of affinity related fields PCI: Remove superfluous HK_TYPE_WQ check sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated() cpuset: Remove cpuset_cpu_is_isolated() timers/migration: Remove superfluous cpuset isolation test cpuset: Propagate cpuset isolation update to timers through housekeeping cpuset: Propagate cpuset isolation update to workqueue through housekeeping PCI: Flush PCI probe workqueue on cpuset isolated partition change sched/isolation: Flush vmstat workqueues on cpuset isolated partition change sched/isolation: Flush memcg workqueues on cpuset isolated partition change cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset ...	2026-02-09 19:57:30 -08:00
Linus Torvalds	4adc13ed7c	for-7.0/block-stable-pages-20260206 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmGPZwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjy5EAC8z4IFCz+ua+q3hqJIlGfTlkxR6kM+DMn/ WKqaFYjnwzwApYe7kgBtlVcINnX5riCdNEk70tG1SCkAHqqdnzF4Ps1kQz0RflXS 7DftN76hSTUbEfolQWTzqDAGMrcn7GUjjjwaRKjSVF30UBKjZ6U4fKfyzWChEwah UtnmLMd3Osl58C9RTcjQPN1qMeQagmLej9C8plyCu9iLauoLA8XlkjxWvXRCcYwc L+IY9F0s1rxmjGZ3eeaevs7V59RjOwJZvL4EPICajkx3oE7EAxS3VVt0p9LC3tPD F4U6SXL0UkIeinduKlbEGP17N6l/4a4Twetyu6rSu//APzKIPAOPeD2xqIbrNSlI rxHqKCsI8KW5JfNTvo9+JjiDOeDxRwt19ZCVCFUzXcsNfRq0EljtuY/4V5P1tPr9 0rOe5SdYS94AncwrabeV/ZOLEGmujjY9YhsCcP3J49LDkFG+T3fBgCpmFWwlWLs7 92MUHVcQmvb+j0z/fZVWRsqzhqtHBG4SO4yg2+Q0RQZeWnsVNTOR5cWfUEShI9G+ hnfYLdyyBTy37n60WXJOq2VhiWbPDAetEjKr+ulbD9hvpPdh6QL7rFiWZsVlnc7V wUQoUjNltfHlyPI/YSwqa9YyyLPAl6YGKba2/qBKSwFTQmFLpSynJIa87W6jUx6B sofywm9ZZw== =faTj -----END PGP SIGNATURE----- Merge tag 'for-7.0/block-stable-pages-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull bounce buffer dio for stable pages from Jens Axboe: "This adds support for bounce buffering of dio for stable pages. This was all done by Christoph. In his words: This series tries to address the problem that under I/O pages can be modified during direct I/O, even when the device or file system require stable pages during I/O to calculate checksums, parity or data operations. It does so by adding block layer helpers to bounce buffer an iov_iter into a bio, then wires that up in iomap and ultimately XFS. The reason that the file system even needs to know about it, is because reads need a user context to copy the data back, and the infrastructure to defer ioends to a workqueue currently sits in XFS. I'm going to look into moving that into ioend and enabling it for other file systems. Additionally btrfs already has it's own infrastructure for this, and actually an urgent need to bounce buffer, so this should be useful there and could be wire up easily. In fact the idea comes from patches by Qu that did this in btrfs. This patch fixes all but one xfstests failures on T10 PI capable devices (generic/095 seems to have issues with a mix of mmap and splice still, I'm looking into that separately), and make qemu VMs running Windows, or Linux with swap enabled fine on an XFS file on a device using PI. Performance numbers on my (not exactly state of the art) NVMe PI test setup: Sequential reads using io_uring, QD=16. Bandwidth and CPU usage (usr/sys): \| size \| zero copy \| bounce \| +------+--------------------------+--------------------------+ \| 4k \| 1316MiB/s (12.65/55.40%) \| 1081MiB/s (11.76/49.78%) \| \| 64K \| 3370MiB/s ( 5.46/18.20%) \| 3365MiB/s ( 4.47/15.68%) \| \| 1M \| 3401MiB/s ( 0.76/23.05%) \| 3400MiB/s ( 0.80/09.06%) \| +------+--------------------------+--------------------------+ Sequential writes using io_uring, QD=16. Bandwidth and CPU usage (usr/sys): \| size \| zero copy \| bounce \| +------+--------------------------+--------------------------+ \| 4k \| 882MiB/s (11.83/33.88%) \| 750MiB/s (10.53/34.08%) \| \| 64K \| 2009MiB/s ( 7.33/15.80%) \| 2007MiB/s ( 7.47/24.71%) \| \| 1M \| 1992MiB/s ( 7.26/ 9.13%) \| 1992MiB/s ( 9.21/19.11%) \| +------+--------------------------+--------------------------+ Note that the 64k read numbers look really odd to me for the baseline zero copy case, but are reproducible over many repeated runs. The bounce read numbers should further improve when moving the PI validation to the file system and removing the double context switch, which I have patches for that will sent out soon" * tag 'for-7.0/block-stable-pages-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: xfs: use bounce buffering direct I/O when the device requires stable pages iomap: add a flag to bounce buffer direct I/O iomap: support ioends for direct reads iomap: rename IOMAP_DIO_DIRTY to IOMAP_DIO_USER_BACKED iomap: free the bio before completing the dio iomap: share code between iomap_dio_bio_end_io and iomap_finish_ioend_direct iomap: split out the per-bio logic from iomap_dio_bio_iter iomap: simplify iomap_dio_bio_iter iomap: fix submission side handling of completion side errors block: add helpers to bounce buffer an iov_iter into bios block: remove bio_release_page iov_iter: extract a iov_iter_extract_bvecs helper from bio code block: open code bio_add_page and fix handling of mismatching P2P ranges block: refactor get_contig_folio_len block: add a BIO_MAX_SIZE constant and use it	2026-02-09 18:14:52 -08:00
Linus Torvalds	0c00ed308d	for-7.0/block-20260206 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmGLwcQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpv+TD/48S2HTnMhmW6AtFYWErQ+sEKXpHrxbYe7S +qR8/g/T+QSfhfqPwZEuagndFKtIP3LJfaXGSP1Lk1RfP9NLQy91v33Ibe4DjHkp etWSfnMHA9MUAoWKmg8EvncB2G+ZQFiYCpjazj5tKHD9S2+psGMuL8kq6qzMJE83 uhpb8WutUl4aSIXbMSfyGlwBhI1MjjRbbWlIBmg4yC8BWt1sH8Qn2L2GNVylEIcX U8At3KLgPGn0axSg4yGMAwTqtGhL/jwdDyeczbmRlXuAr4iVL9UX/yADCYkazt6U ttQ2/H+cxCwfES84COx9EteAatlbZxo6wjGvZ3xOMiMJVTjYe1x6Gkcckq+LrZX6 tjofi2KK78qkrMXk1mZMkZjpyUWgRtCswhDllbQyqFs0SwzQtno2//Rk8HU9dhbt pkpryDbGFki9X3upcNyEYp5TYflpW6YhAzShYgmE6KXim2fV8SeFLviy0erKOAl+ fwjTE6KQ5QoQv0s3WxkWa4lREm34O6IHrCUmbiPm5CruJnQDhqAN2QZIDgYC4WAf 0gu9cR/O4Vxu7TQXrumPs5q+gCyDU0u0B8C3mG2s+rIo+PI5cVZKs2OIZ8HiPo0F x73kR/pX3DMe35ZQkQX22ymMuowV+aQouDLY9DTwakP5acdcg7h7GZKABk6VLB06 gUIsnxURiQ== =jNzW -----END PGP SIGNATURE----- Merge tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - Support for batch request processing for ublk, improving the efficiency of the kernel/ublk server communication. This can yield nice 7-12% performance improvements - Support for integrity data for ublk - Various other ublk improvements and additions, including a ton of selftests additions and updated - Move the handling of blk-crypto software fallback from below the block layer to above it. This reduces the complexity of dealing with bio splitting - Series fixing a number of potential deadlocks in blk-mq related to the queue usage counter and writeback throttling and rq-qos debugfs handling - Add an async_depth queue attribute, to resolve a performance regression that's been around for a qhilw related to the scheduler depth handling - Only use task_work for IOPOLL completions on NVMe, if it is necessary to do so. An earlier fix for an issue resulted in all these completions being punted to task_work, to guarantee that completions were only run for a given io_uring ring when it was local to that ring. With the new changes, we can detect if it's necessary to use task_work or not, and avoid it if possible. - rnbd fixes: - Fix refcount underflow in device unmap path - Handle PREFLUSH and NOUNMAP flags properly in protocol - Fix server-side bi_size for special IOs - Zero response buffer before use - Fix trace format for flags - Add .release to rnbd_dev_ktype - MD pull requests via Yu Kuai - Fix raid5_run() to return error when log_init() fails - Fix IO hang with degraded array with llbitmap - Fix percpu_ref not resurrected on suspend timeout in llbitmap - Fix GPF in write_page caused by resize race - Fix NULL pointer dereference in process_metadata_update - Fix hang when stopping arrays with metadata through dm-raid - Fix any_working flag handling in raid10_sync_request - Refactor sync/recovery code path, improve error handling for badblocks, and remove unused recovery_disabled field - Consolidate mddev boolean fields into mddev_flags - Use mempool to allocate stripe_request_ctx and make sure max_sectors is not less than io_opt in raid5 - Fix return value of mddev_trylock - Fix memory leak in raid1_run() - Add Li Nan as mdraid reviewer - Move phys_vec definitions to the kernel types, mostly in preparation for some VFIO and RDMA changes - Improve the speed for secure erase for some devices - Various little rust updates - Various other minor fixes, improvements, and cleanups * tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits) blk-mq: ABI/sysfs-block: fix docs build warnings selftests: ublk: organize test directories by test ID block: decouple secure erase size limit from discard size limit block: remove redundant kill_bdev() call in set_blocksize() blk-mq: add documentation for new queue attribute async_dpeth block, bfq: convert to use request_queue->async_depth mq-deadline: covert to use request_queue->async_depth kyber: covert to use request_queue->async_depth blk-mq: add a new queue sysfs attribute async_depth blk-mq: factor out a helper blk_mq_limit_depth() blk-mq-sched: unify elevators checking for async requests block: convert nr_requests to unsigned int block: don't use strcpy to copy blockdev name blk-mq-debugfs: warn about possible deadlock blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs() blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos() blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static blk-rq-qos: fix possible debugfs_mutex deadlock blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter ...	2026-02-09 17:57:21 -08:00
Linus Torvalds	56feb532bb	xfs: new patches for Linux 7.0 Signed-off-by: Carlos Maiolino <cem@kernel.org> -----BEGIN PGP SIGNATURE----- iJUEABMJAB0WIQSmtYVZ/MfVMGUq1GNcsMJ8RxYuYwUCaYXRZgAKCRBcsMJ8RxYu Y6wOAX0TcdEZWVLnIsKsc6XmY6QO7i2HXR+6pX+1XzeL81bFxfkDv/GPJln3ovk+ v2h1YOUBf1veFyoEN5DwHhuV0SPsSko5MohJMli5a6ELxt6ZV8vByzzNW2EHA13K pXAvbbrWLw== =6D/r -----END PGP SIGNATURE----- Merge tag 'xfs-merge-7.0' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs updates from Carlos Maiolino: "This contains several improvements to zoned device support, performance improvements for the parent pointers, and a new health monitoring feature. There are some improvements in the journaling code too but no behavior change expected. Last but not least, some code refactoring and bug fixes are also included in this series" * tag 'xfs-merge-7.0' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (67 commits) xfs: add sysfs stats for zoned GC xfs: give the defer_relog stat a xs_ prefix xfs: add zone reset error injection xfs: refactor zone reset handling xfs: don't mark all discard issued by zoned GC as sync xfs: allow setting errortags at mount time xfs: use WRITE_ONCE/READ_ONCE for m_errortag xfs: move the guts of XFS_ERRORTAG_DELAY out of line xfs: don't validate error tags in the I/O path xfs: allocate m_errortag early xfs: fix the errno sign for the xfs_errortag_{add,clearall} stubs xfs: validate log record version against superblock log version xfs: fix spacing style issues in xfs_alloc.c xfs: remove xfs_zone_gc_space_available xfs: use a seprate member to track space availabe in the GC scatch buffer xfs: check for deleted cursors when revalidating two btrees xfs: fix UAF in xchk_btree_check_block_owner xfs: check return value of xchk_scrub_create_subord xfs: only call xf{array,blob}_destroy if we have a valid pointer xfs: get rid of the xchk_xfile_*_descr calls ...	2026-02-09 16:11:27 -08:00
Luke Wang	ee81212f74	block: decouple secure erase size limit from discard size limit Secure erase should use max_secure_erase_sectors instead of being limited by max_discard_sectors. Separate the handling of REQ_OP_SECURE_ERASE from REQ_OP_DISCARD to allow each operation to use its own size limit. Signed-off-by: Luke Wang <ziniu.wang_1@nxp.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-04 20:42:12 -07:00
Yang Xiuwei	d4d78dd43c	block: remove redundant kill_bdev() call in set_blocksize() The second kill_bdev() call in set_blocksize() is redundant as the first call already clears all buffers and pagecache, and locks prevent new pagecache creation between the calls. Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-04 09:28:18 -07:00
Yu Kuai	2110858c51	block, bfq: convert to use request_queue->async_depth The default limits is unchanged, and user can configure async_depth now. Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-03 07:45:36 -07:00
Yu Kuai	988bb1b9ed	mq-deadline: covert to use request_queue->async_depth In downstream kernel, we test with mq-deadline with many fio workloads, and we found a performance regression after commit `39823b47bb` ("block/mq-deadline: Fix the tag reservation code") with following test: [global] rw=randread direct=1 ramp_time=1 ioengine=libaio iodepth=1024 numjobs=24 bs=1024k group_reporting=1 runtime=60 [job1] filename=/dev/sda Root cause is that mq-deadline now support configuring async_depth, although the default value is nr_request, however the minimal value is 1, hence min_shallow_depth is set to 1, causing wake_batch to be 1. For consequence, sbitmap_queue will be waken up after each IO instead of 8 IO. In this test case, sda is HDD and max_sectors is 128k, hence each submitted 1M io will be splited into 8 sequential 128k requests, however due to there are 24 jobs and total tags are exhausted, the 8 requests are unlikely to be dispatched sequentially, and changing wake_batch to 1 will make this much worse, accounting blktrace D stage, the percentage of sequential io is decreased from 8% to 0.8%. Fix this problem by converting to request_queue->async_depth, where min_shallow_depth is set each time async_depth is updated. Noted elevator attribute async_depth is now removed, queue attribute with the same name is used instead. Fixes: `39823b47bb` ("block/mq-deadline: Fix the tag reservation code") Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-03 07:45:36 -07:00
Yu Kuai	8cbe62f4d8	kyber: covert to use request_queue->async_depth Instead of the internal async_depth, remove kqd->async_depth and related helpers. Noted elevator attribute async_depth is now removed, queue attribute with the same name is used instead. Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-03 07:45:36 -07:00
Yu Kuai	f98afe4f31	blk-mq: add a new queue sysfs attribute async_depth Add a new field async_depth to request_queue and related APIs, this is currently not used, following patches will convert elevators to use this instead of internal async_depth. Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-03 07:45:36 -07:00
Yu Kuai	cf02d7d41b	blk-mq: factor out a helper blk_mq_limit_depth() There are no functional changes, just make code cleaner. Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-03 07:45:36 -07:00
Yu Kuai	1db61b0afd	blk-mq-sched: unify elevators checking for async requests bfq and mq-deadline consider sync writes as async requests and only reserve tags for sync reads by async_depth, however, kyber doesn't consider sync writes as async requests for now. Consider the case there are lots of dirty pages, and user use fsync to flush dirty pages. In this case sched_tags can be exhausted by sync writes and sync reads can stuck waiting for tag. Hence let kyber follow what mq-deadline and bfq did, and unify async requests checking for all elevators. Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-03 07:45:36 -07:00
Frederic Weisbecker	53c2f9d1b1	block: Protect against concurrent isolated cpuset change The block subsystem prevents running the workqueue to isolated CPUs, including those defined by cpuset isolated partitions. Since HK_TYPE_DOMAIN will soon contain both and be subject to runtime modifications, synchronize against housekeeping using the relevant lock. For full support of cpuset changes, the block subsystem may need to propagate changes to isolated cpumask through the workqueue in the future. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Jens Axboe <axboe@kernel.dk> Cc: Jens Axboe <axboe@kernel.dk> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: linux-block@vger.kernel.org	2026-02-03 15:23:33 +01:00
Yu Kuai	65d466b629	blk-mq-debugfs: warn about possible deadlock Creating new debugfs entries can trigger fs reclaim, hence we can't do this with queue frozen, meanwhile, other locks that can be held while queue is frozen should not be held as well. Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-02 07:05:19 -07:00
Yu Kuai	9d20fd6ce1	blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs() In blk_mq_update_nr_hw_queues(), debugfs_mutex is not held while creating debugfs entries for hctxs. Hence add debugfs_mutex there, it's safe because queue is not frozen. Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-02 07:05:19 -07:00
Yu Kuai	5ae4b12ee6	blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos() Because this helper is only used by iocost and iolatency, while they don't have debugfs entries. Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-02 07:05:19 -07:00
Yu Kuai	70bafa5e31	blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static Because it's only used inside blk-mq-debugfs.c now. Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-02 07:05:19 -07:00
Yu Kuai	3c17a346ff	blk-rq-qos: fix possible debugfs_mutex deadlock Currently rq-qos debugfs entries are created from rq_qos_add(), while rq_qos_add() can be called while queue is still frozen. This can deadlock because creating new entries can trigger fs reclaim. Fix this problem by delaying creating rq-qos debugfs entries after queue is unfrozen. - For wbt, 1) it can be initialized by default, fix it by calling new helper after wbt_init() from wbt_init_enable_default(); 2) it can be initialized by sysfs, fix it by calling new helper after queue is unfrozen from wbt_set_lat(). - For iocost and iolatency, they can only be initialized by blkcg configuration, however, they don't have debugfs entries for now, hence they are not handled yet. Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-02 07:05:19 -07:00
Yu Kuai	3f0bea9f3b	blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos There is already a helper blk_mq_debugfs_register_rqos() to register one rqos, however this helper is called synchronously when the rqos is created with queue frozen. Prepare to fix possible deadlock to create blk-mq debugfs entries while queue is still frozen. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-02 07:05:19 -07:00
Yu Kuai	41afaeeda5	blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter If wbt is disabled by default and user configures wbt by sysfs, queue will be frozen first and then pcpu_alloc_mutex will be held in blk_stat_alloc_callback(). Fix this problem by allocating memory first before queue frozen. Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-02 07:05:19 -07:00
Yu Kuai	2751b90051	blk-wbt: factor out a helper wbt_set_lat() To move implementation details inside blk-wbt.c, prepare to fix possible deadlock to call wbt_init() while queue is frozen in the next patch. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-02 07:05:19 -07:00
Ondrej Kozina	06564bae93	sed-opal: ignore locking ranges array when not enabling SUM. The locking ranges count and the array items are always ignored unless Single User Mode (SUM) is requested in the activate method. It is useless to enforce limits of unused array in the non-SUM case. Signed-off-by: Ondrej Kozina <okozina@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-02 07:04:43 -07:00
Damien Le Moal	da562d92e6	block: introduce bdev_rot() Introduce the helper function bdev_rot() to test if a block device is a rotational one. The existing function bdev_nonrot() which tests for the opposite condition is redefined using this new helper. This avoids the double negation (operator and name) that appears when testing if a block device is a rotational device, thus making the code a little easier to read. Call sites of bdev_nonrot() in the block layer are updated to use this new helper. Remaining users in other subsystems are left unchanged for now. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-30 08:11:09 -07:00
Damien Le Moal	2719bd1ee1	block: introduce blk_queue_rot() To check if a request queue is for a rotational device, a double negation is needed with the pattern "!blk_queue_nonrot(q)". Simplify this with the introduction of the helper blk_queue_rot() which tests if a requests queue limit has the BLK_FEAT_ROTATIONAL feature set. All call sites of blk_queue_nonrot() are modified to use blk_queue_rot() and blk_queue_nonrot() definition removed. No functional changes. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-29 13:15:50 -07:00
Thorsten Blum	f46ebb9109	block: Replace snprintf with strscpy in check_partition Replace snprintf("%s", ...) with the faster and more direct strscpy(). Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-28 05:28:13 -07:00
Christoph Hellwig	8dd5e7c75d	block: add helpers to bounce buffer an iov_iter into bios Add helpers to implement bounce buffering of data into a bio to implement direct I/O for cases where direct user access is not possible because stable in-flight data is required. These are intended to be used as easily as bio_iov_iter_get_pages for the zero-copy path. The write side is trivial and just copies data into the bounce buffer. The read side is a lot more complex because it needs to perform the copy from the completion context, and without preserving the iov_iter through the call chain. It steals a trick from the integrity data user interface and uses the first vector in the bio for the bounce buffer data that is fed to the block I/O stack, and uses the others to record the user buffer fragments. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-28 05:16:39 -07:00
Christoph Hellwig	301f535652	block: remove bio_release_page Merge bio_release_page into the only remaining caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-28 05:16:39 -07:00
Christoph Hellwig	91b73c4581	iov_iter: extract a iov_iter_extract_bvecs helper from bio code Massage __bio_iov_iter_get_pages so that it doesn't need the bio, and move it to lib/iov_iter.c so that it can be used by block code for other things than filling a bio and by other subsystems like netfs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-28 05:16:39 -07:00
Christoph Hellwig	12da89e884	block: open code bio_add_page and fix handling of mismatching P2P ranges bio_add_page fails to add data to the bio when mixing P2P with non-P2P ranges, or ranges that map to different P2P providers. In that case it will trigger that WARN_ON and return an error up the chain instead of simply starting a new bio as intended. Fix this by open coding bio_add_page and handling this case explicitly. While doing so, stop merging physical contiguous data that belongs to multiple folios. While this merge could lead to more efficient bio packing in some case, dropping will allow to remove handling of this corner case in other places and make the code more robust. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-28 05:16:39 -07:00
Christoph Hellwig	4d77007d42	block: refactor get_contig_folio_len Move all of the logic to find the contigous length inside a folio into get_contig_folio_len instead of keeping some of it in the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-28 05:16:39 -07:00
Christoph Hellwig	fa0bdd45d7	block: add a BIO_MAX_SIZE constant and use it Currently the only constant for the maximum bio size is BIO_MAX_SECTORS, which is in units of 512-byte sectors, but a lot of user need a byte limit. Add a BIO_MAX_SIZE constant, redefine BIO_MAX_SECTORS in terms of it, and switch all bio-related uses of UINT_MAX for the maximum size to use the symbolic names instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-28 05:16:39 -07:00
Linus Torvalds	00d20db21e	block-6.19-20260122 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmly5rUQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpv30EACNx0n1MlRvrIHvqIH+gq2NmpweLhJphars 3e4lyDD5bBUuQ1V2iRc5k7MiKPEtQJpF83m1WVIvycNjjx2qWLBsZyCc9PCNoZIn nBe1/qCuKOs8kmrNkTXLKjr1uFVp+0Nm5jWSjeIgNI1zd7IDe6fpkjyH6JrnKsw8 DD7aCYE4jHGJH8q9Ks3rOu3M2syiFwkqmOD12Xgqiz29fP4qJJDvC/y4yOMDWbZY G7ZKFHZK5QS8tmjRUhgJacVpJlN3NCL6lZ16f0fRTJoIlWzOAaBafeJTAjiS3S01 TS8sadfTLrVZocFYSKZjixVgUocHqE5QeKBrDe09N0A5XsWz/YJkcNisNw0oRcvb BKCrO1TCHiPzliPCobPcaFZII4z2HeiIfVGSTaJAzSM4ZCt/EH9BFTKrgohGRZi4 v+bPNdJaCLaFOipgsUeA63NZpt9xN+PsfmzFGC1PvF50gQMM+JeFpqjNh37XjCIt 6QtK9rzLzC9GvdFXV3o1CRbIBAli3UEgjyLeThhsCu2HPnab/yrrf9AwF5udZQlJ wK7S8kB3zl25nyM3jwKzzaVLqi+aUWr1Jn+D+DA2BwwPg3m8h7yIrysJryF5Leon 5VAYHCOg+MhPuvhNFjqkgahjreJqV/3Qou+sbGcLur4PXdKXeW8SY5wrDlplumFb +BZSUZ06Kw== =RK+t -----END PGP SIGNATURE----- Merge tag 'block-6.19-20260122' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - A set of selftest fixes for ublk - Fix for a pid mismatch in ublk, comparing PIDs in different namespaces if run inside a namespace - Fix for a regression added in this release with polling, where the nvme tcp connect code would spin forever - Zoned device error path fix - Tweak the blkzoned uapi additions from this kernel release, making them more easily discoverable - Fix for a regression in bcache with bio endio handling added in this release * tag 'block-6.19-20260122' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: bcache: use bio cloning for detached device requests blk-mq: use BLK_POLL_ONESHOT for synchronous poll completion selftests/ublk: fix garbage output in foreground mode selftests/ublk: fix error handling for starting device selftests/ublk: fix IO thread idle check block: make the new blkzoned UAPI constants discoverable ublk: fix ublksrv pid handling for pid namespaces block: Fix an error path in disk_update_zone_resources()	2026-01-23 12:53:56 -08:00
Ming Lei	046be7e596	blk-mq: use BLK_POLL_ONESHOT for synchronous poll completion blk_execute_rq() with polling is used in kernel code paths such as NVMe controller connect. The aggressive spinning in blk_hctx_poll() can prevent the completion task from getting a chance to run, causing a lockup. The spinning with cpu_relax() doesn't yield CPU, so need_resched() only becomes true on timer tick. This causes unnecessary spinning while the completion task is already waiting to run. Before commit `f22ecf9c14`, the loop would exit early because task_is_running() was always true. After that commit removed the check, the loop now spins until need_resched(). Fix this by using BLK_POLL_ONESHOT in blk_rq_poll_completion(). This causes blk_hctx_poll() to poll once and return immediately, letting the outer loop's cond_resched() yield CPU so the completion task can run. Fixes: `f22ecf9c14` ("blk-mq: delete task running check in blk_hctx_poll()") Cc: Diangang Li <lidiangang@bytedance.com> Cc: Fengnan Chang <changfengnan@bytedance.com> Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-22 07:24:17 -07:00
Christoph Hellwig	7ca44303f9	block: add a bio_reuse helper Add a helper to allow an existing bio to be resubmitted without having to re-add the payload. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:57:16 +01:00
Ming Lei	5e2fde1a94	block: pass io_comp_batch to rq_end_io_fn callback Add a third parameter 'const struct io_comp_batch *' to the rq_end_io_fn callback signature. This allows end_io handlers to access the completion batch context when requests are completed via blk_mq_end_request_batch(). The io_comp_batch is passed from blk_mq_end_request_batch(), while NULL is passed from __blk_mq_end_request() and blk_mq_put_rq_ref() which don't have batch context. This infrastructure change enables drivers to detect whether they're being called from a batched completion path (like iopoll) and access additional context stored in the io_comp_batch. Update all rq_end_io_fn implementations: - block/blk-mq.c: blk_end_sync_rq - block/blk-flush.c: flush_end_io, mq_flush_data_end_io - drivers/nvme/host/ioctl.c: nvme_uring_cmd_end_io - drivers/nvme/host/core.c: nvme_keep_alive_end_io - drivers/nvme/host/pci.c: abort_endio, nvme_del_queue_end, nvme_del_cq_end - drivers/nvme/target/passthru.c: nvmet_passthru_req_done - drivers/scsi/scsi_error.c: eh_lock_door_done - drivers/scsi/sg.c: sg_rq_end_io - drivers/scsi/st.c: st_scsi_execute_end - drivers/target/target_core_pscsi.c: pscsi_req_done - drivers/md/dm-rq.c: end_clone_request Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-20 10:12:54 -07:00
Bart Van Assche	07a1bc5c14	block: Fix an error path in disk_update_zone_resources() Any queue_limits_start_update() call must be followed either by a queue_limits_commit_update() call or by a queue_limits_cancel_update() call. Make sure that the error path near the start of disk_update_zone_resources() follows this requirement. Remove the "goto unfreeze" statement from that error path to make the code easier to verify. This was detected by annotating the queue_limits_*() calls with Clang thread-safety attributes and by building the kernel with thread-safety checking enabled. Without this patch and with thread-safety checking enabled, the following error is reported: block/blk-zoned.c:2020:1: error: mutex 'disk->queue->limits_lock' is not held on every path through here [-Werror,-Wthread-safety-analysis] 2020 \| } \| ^ block/blk-zoned.c:1959:8: note: mutex acquired here 1959 \| lim = queue_limits_start_update(q); \| ^ Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Fixes: `bba4322e3f` ("block: freeze queue when updating zone resources") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20260114192803.4171847-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-19 11:04:06 -07:00
Jens Axboe	df73d3c618	Merge branch 'for-7.0/blk-pvec' into for-7.0/block * for-7.0/blk-pvec: types: move phys_vec definition to common header nvme-pci: Use size_t for length fields to handle larger sizes	2026-01-18 06:27:37 -07:00
Linus Torvalds	d3eeb99bbc	block-6.19-20260116 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmlrBfMQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgphVYEADD88/MTkm0x/Ydwq5WjoJ1yfYsJXFpO+ZW UZpsKGhPbmnu1d+541vlzMJS7OkIE3OsCHn8sXTagrKP862aDDZ7r71cG1OPF49U 7ICK660OPr0s1E0sVEC0O3tB1C1jJf+cKgPmpDiz9Lyj43J23mGhXfuWyX4hpx3W 3HVJBFb5KBaTV4hiFKGwNnPGNgRyWMt2tj5hSKGTZ8UU4jhi+CtWju64HUOqL8X6 gPFKoqHx+qWwrnT7frd+B3ldVgrVUVY5hosrmqa9qKapEoxy+mZ897UZrV8mJc9X gbp+403kNMilMKOI1sE0ekU7aCpz7iAyWAX7+h5HTH19fmwlRVGrNZpqRINY06xR Wc9XdLgIrPwLdpngfO9YXFik6rLa5uhiXVYaIIaKwl0riORCxzKGpL3bROjYJKfh N0nLkvt55mO2BA4dxJd1h3Mz5Bwxme+B21nsNJf2pBSKBBfpW86t+Euqzm5xZwYW UYMbIfLnTGd7swLXhAzX67pz7dkkVTPd6cx6lOXtSq+WUS+Lng7c1tTTryuF2o/X LmHDTvsaWBkli4s8DKbxdoJQen8g9YE0jqvwJIgFecIlrhOl/6OjvI4VbKd18i56 bkL/0RBtoNRfw8KyuXabZOX4h/NVSsO/7l1nPbqdiBXXuWTYUZFqvcuPD1Z76slt abUDIP8OQw== =8QRq -----END PGP SIGNATURE----- Merge tag 'block-6.19-20260116' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Device quirk to disable faulty temperature (Ilikara) - TCP target null pointer fix from bad host protocol usage (Shivam) - Add apple,t8103-nvme-ans2 as a compatible apple controller (Janne) - FC tagset leak fix (Chaitanya) - TCP socket deadlock fix (Hannes) - Target name buffer overrun fix (Shin'ichiro) - Fix for an underflow for rnbd during device unmap - Zero the non-PI part of the auto integrity buffer - Fix for a configfs memory leak in the null block driver * tag 'block-6.19-20260116' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: rnbd-clt: fix refcount underflow in device unmap path nvme: fix PCIe subsystem reset controller state transition nvmet: do not copy beyond sybsysnqn string length nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready() null_blk: fix kmemleak by releasing references to fault configfs items block: zero non-PI portion of auto integrity buffer nvme-fc: release admin tagset if init fails nvme-apple: add "apple,t8103-nvme-ans2" as compatible nvme-tcp: fix NULL pointer dereferences in nvmet_tcp_build_pdu_iovec nvme-pci: disable secondary temp for Wodposit WPBSNM8	2026-01-16 20:59:46 -08:00
Damien Le Moal	5e35a24c96	block: improve blk_op_str() comment Replace XXX with what it actually means. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-15 10:19:09 -07:00
Damien Le Moal	41ee77b753	block: fix blk_zone_cond_str() comment Fix the comment for blk_zone_cond_str() by replacing the meaningless BLK_ZONE_ZONE_XXX comment with the correct BLK_ZONE_COND_name, thus also replacing the XXX with what that actually means. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-15 10:19:08 -07:00
Nitesh Shetty	91e1c1bcf0	block, nvme: remove unused dma_iova_state function parameter DMA IOVA state is not used inside blk_rq_dma_map_iter_next, get rid of the argument. Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-13 07:23:39 -07:00
Jens Axboe	5df832ba5f	Merge branch 'block-6.19' into for-7.0/block Merge in fixes that went to 6.19 after for-7.0/block was branched. Pending ublk changes depend on particularly the async scan work. * block-6.19: block: zero non-PI portion of auto integrity buffer ublk: fix use-after-free in ublk_partition_scan_work blk-mq: avoid stall during boot due to synchronize_rcu_expedited loop: add missing bd_abort_claiming in loop_set_status block: don't merge bios with different app_tags blk-rq-qos: Remove unlikely() hints from QoS checks loop: don't change loop device under exclusive opener in loop_set_status block, bfq: update outdated comment blk-mq: skip CPU offline notify on unmapped hctx selftests/ublk: fix Makefile to rebuild on header changes selftests/ublk: add test for async partition scan ublk: scan partition in async way block,bfq: fix aux stat accumulation destination md: Fix forward incompatibility from configurable logical block size md: Fix logical_block_size configuration being overwritten md: suspend array while updating raid_disks via sysfs md/raid5: fix possible null-pointer dereferences in raid5_store_group_thread_cnt() md: Fix static checker warning in analyze_sbs	2026-01-11 13:16:36 -07:00
Christoph Hellwig	bb8e2019ad	blk-crypto: handle the fallback above the block layer Add a blk_crypto_submit_bio helper that either submits the bio when it is not encrypted or inline encryption is provided, but otherwise handles the encryption before going down into the low-level driver. This reduces the risk from bio reordering and keeps memory allocation as high up in the stack as possible. Note that if the submitter knows that inline enctryption is known to be supported by the underyling driver, it can still use plain submit_bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-11 12:55:41 -07:00
Christoph Hellwig	66e5a11d2e	blk-crypto: optimize data unit alignment checking Avoid the relatively high overhead of constructing and walking per-page segment bio_vecs for data unit alignment checking by merging the checks into existing loops. For hardware support crypto, perform the check in bio_split_io_at, which already contains a similar alignment check applied for all I/O. This means bio-based drivers that do not call bio_split_to_limits, should they ever grow blk-crypto support, need to implement the check themselves, just like all other queue limits checks. For blk-crypto-fallback do it in the encryption/decryption loops. This means alignment errors for decryption will only be detected after I/O has completed, but that seems like a worthwhile trade off. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-11 12:55:41 -07:00
Christoph Hellwig	3d939695e6	blk-crypto: use mempool_alloc_bulk for encrypted bio page allocation Calling mempool_alloc in a loop is not safe unless the maximum allocation size times the maximum number of threads using it is less than the minimum pool size. Use the new mempool_alloc_bulk helper to allocate all missing elements in one pass to remove this deadlock risk. This also means that non-pool allocations now use alloc_pages_bulk which can be significantly faster than a loop over individual page allocations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-11 12:55:41 -07:00
Christoph Hellwig	2f655dcb2d	blk-crypto: use on-stack skcipher requests for fallback en/decryption Allocating a skcipher request dynamically can deadlock or cause unexpected I/O failures when called from writeback context. Avoid the allocation entirely by using on-stack skciphers, similar to what the non-blk-crypto fscrypt path already does. This drops the incomplete support for asynchronous algorithms, which previously could be used, but only synchronously. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-11 12:55:41 -07:00
Christoph Hellwig	b37fbce460	blk-crypto: optimize bio splitting in blk_crypto_fallback_encrypt_bio The current code in blk_crypto_fallback_encrypt_bio is inefficient and prone to deadlocks under memory pressure: It first walks the passed in plaintext bio to see how much of it can fit into a single encrypted bio using up to BIO_MAX_VEC PAGE_SIZE segments, and then allocates a plaintext clone that fits the size, only to allocate another bio for the ciphertext later. While the plaintext clone uses a bioset to avoid deadlocks when allocations could fail, the ciphertex one uses bio_kmalloc which is a no-go in the file system I/O path. Switch blk_crypto_fallback_encrypt_bio to walk the source plaintext bio while consuming bi_iter without cloning it, and instead allocate a ciphertext bio at the beginning and whenever we fille up the previous one. The existing bio_set for the plaintext clones is reused for the ciphertext bios to remove the deadlock risk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-11 12:55:41 -07:00
Christoph Hellwig	aefc2a1fa2	blk-crypto: submit the encrypted bio in blk_crypto_fallback_bio_prep Restructure blk_crypto_fallback_bio_prep so that it always submits the encrypted bio instead of passing it back to the caller, which allows to simplify the calling conventions for blk_crypto_fallback_bio_prep and blk_crypto_bio_prep so that they never have to return a bio, and can use a true return value to indicate that the caller should submit the bio, and false that the blk-crypto code consumed it. The submission is handled by the on-stack bio list in the current task_struct by the block layer and does not cause additional stack usage or major overhead. It also prepares for the following optimization and fixes for the blk-crypto fallback write path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-11 12:55:41 -07:00
Ming Lei	f7ba87dfa8	block: account for bi_bvec_done in bio_may_need_split() When checking if a bio fits in a single segment, bio_may_need_split() compares bi_size against the current bvec's bv_len. However, for partially consumed bvecs (bi_bvec_done > 0), such as in cloned or split bios, the remaining bytes in the current bvec is actually (bv_len - bi_bvec_done), not bv_len. This could cause bio_may_need_split() to incorrectly return false, leading to nr_phys_segments being set to 1 when the bio actually spans multiple segments. This triggers the WARN_ON in __blk_rq_map_sg() when the actual mapped segments exceed the expected count. Fix by subtracting bi_bvec_done from bv_len in the comparison. Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Close: https://lore.kernel.org/linux-block/9687cf2b-1f32-44e1-b58d-2492dc6e7185@linux.ibm.com/ Repored-and-bisected-by: Christoph Hellwig <hch@infradead.org> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Tested-by: Christoph Hellwig <hch@infradead.org> Fixes: `ee623c892a` ("block: use bvec iterator helper for bio_may_need_split()") Cc: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-10 10:22:54 -07:00
Caleb Sander Mateos	a31bde687b	block: use pi_tuple_size in bi_offload_capable() bi_offload_capable() returns whether a block device's metadata size matches its PI tuple size. Use pi_tuple_size instead of switching on csum_type. This makes the code considerably simpler and less branchy. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-10 10:22:54 -07:00

1 2 3 4 5 ...

8146 Commits (6f7e6393d1ce636bb7ec77a7fe7b77458fddf701)