Commit Graph

7981 Commits (09cfd3c52ea76f43b3cb15e570aeddf633d65e80)

Author SHA1 Message Date
Linus Torvalds 8804d970fa Summary of significant series in this pull request:
- The 3 patch series "mm, swap: improve cluster scan strategy" from
   Kairui Song improves performance and reduces the failure rate of swap
   cluster allocation.
 
 - The 4 patch series "support large align and nid in Rust allocators"
   from Vitaly Wool permits Rust allocators to set NUMA node and large
   alignment when perforning slub and vmalloc reallocs.
 
 - The 2 patch series "mm/damon/vaddr: support stat-purpose DAMOS" from
   Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets
   for virtual address spaces for ops-level DAMOS filters.
 
 - The 3 patch series "execute PROCMAP_QUERY ioctl under per-vma lock"
   from Suren Baghdasaryan reduces mmap_lock contention during reads of
   /proc/pid/maps.
 
 - The 2 patch series "mm/mincore: minor clean up for swap cache
   checking" from Kairui Song performs some cleanup in the swap code.
 
 - The 11 patch series "mm: vm_normal_page*() improvements" from David
   Hildenbrand provides code cleanup in the pagemap code.
 
 - The 5 patch series "add persistent huge zero folio support" from
   Pankaj Raghav provides a block layer speedup by optionalls making the
   huge_zero_pagepersistent, instead of releasing it when its refcount
   falls to zero.
 
 - The 3 patch series "kho: fixes and cleanups" from Mike Rapoport adds a
   few touchups to the recently added Kexec Handover feature.
 
 - The 10 patch series "mm: make mm->flags a bitmap and 64-bit on all
   arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap.  To
   end the constant struggle with space shortage on 32-bit conflicting with
   64-bit's needs.
 
 - The 2 patch series "mm/swapfile.c and swap.h cleanup" from Chris Li
   cleans up some swap code.
 
 - The 7 patch series "selftests/mm: Fix false positives and skip
   unsupported tests" from Donet Tom fixes a few things in our selftests
   code.
 
 - The 7 patch series "prctl: extend PR_SET_THP_DISABLE to only provide
   THPs when advised" from David Hildenbrand "allows individual processes
   to opt-out of THP=always into THP=madvise, without affecting other
   workloads on the system".
 
   It's a long story - the [1/N] changelog spells out the considerations.
 
 - The 11 patch series "Add and use memdesc_flags_t" from Matthew Wilcox
   gets us started on the memdesc project.  Please see
   https://kernelnewbies.org/MatthewWilcox/Memdescs and
   https://blogs.oracle.com/linux/post/introducing-memdesc.
 
 - The 3 patch series "Tiny optimization for large read operations" from
   Chi Zhiling improves the efficiency of the pagecache read path.
 
 - The 5 patch series "Better split_huge_page_test result check" from Zi
   Yan improves our folio splitting selftest code.
 
 - The 2 patch series "test that rmap behaves as expected" from Wei Yang
   adds some rmap selftests.
 
 - The 3 patch series "remove write_cache_pages()" from Christoph Hellwig
   removes that function and converts its two remaining callers.
 
 - The 2 patch series "selftests/mm: uffd-stress fixes" from Dev Jain
   fixes some UFFD selftests issues.
 
 - The 3 patch series "introduce kernel file mapped folios" from Boris
   Burkov introduces the concept of "kernel file pages".  Using these
   permits btrfs to account its metadata pages to the root cgroup, rather
   than to the cgroups of random inappropriate tasks.
 
 - The 2 patch series "mm/pageblock: improve readability of some
   pageblock handling" from Wei Yang provides some readability improvements
   to the page allocator code.
 
 - The 11 patch series "mm/damon: support ARM32 with LPAE" from SeongJae
   Park teaches DAMON to understand arm32 highmem.
 
 - The 4 patch series "tools: testing: Use existing atomic.h for
   vma/maple tests" from Brendan Jackman performs some code cleanups and
   deduplication under tools/testing/.
 
 - The 2 patch series "maple_tree: Fix testing for 32bit compiles" from
   Liam Howlett fixes a couple of 32-bit issues in
   tools/testing/radix-tree.c.
 
 - The 2 patch series "kasan: unify kasan_enabled() and remove
   arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN
   arch-specific initialization code into a common arch-neutral
   implementation.
 
 - The 3 patch series "mm: remove zpool" from Johannes Weiner removes
   zspool - an indirection layer which now only redirects to a single thing
   (zsmalloc).
 
 - The 2 patch series "mm: task_stack: Stack handling cleanups" from
   Pasha Tatashin makes a couple of cleanups in the fork code.
 
 - The 37 patch series "mm: remove nth_page()" from David Hildenbrand
   makes rather a lot of adjustments at various nth_page() callsites,
   eventually permitting the removal of that undesirable helper function.
 
 - The 2 patch series "introduce kasan.write_only option in hw-tags" from
   Yeoreum Yun creates a KASAN read-only mode for ARM, using that
   architecture's memory tagging feature.  It is felt that a read-only mode
   KASAN is suitable for use in production systems rather than debug-only.
 
 - The 3 patch series "mm: hugetlb: cleanup hugetlb folio allocation"
   from Kefeng Wang does some tidying in the hugetlb folio allocation code.
 
 - The 12 patch series "mm: establish const-correctness for pointer
   parameters" from Max Kellermann makes quite a number of the MM API
   functions more accurate about the constness of their arguments.  This
   was getting in the way of subsystems (in this case CEPH) when they
   attempt to improving their own const/non-const accuracy.
 
 - The 7 patch series "Cleanup free_pages() misuse" from Vishal Moola
   fixes a number of code sites which were confused over when to use
   free_pages() vs __free_pages().
 
 - The 3 patch series "Add Rust abstraction for Maple Trees" from Alice
   Ryhl makes the mapletree code accessible to Rust.  Required by nouveau
   and by its forthcoming successor: the new Rust Nova driver.
 
 - The 2 patch series "selftests/mm: split_huge_page_test:
   split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and
   some cleanups to the thp selftesting code.
 
 - The 14 patch series "mm, swap: introduce swap table as swap cache
   (phase I)" from Chris Li and Kairui Song is the first step along the
   path to implementing "swap tables" - a new approach to swap allocation
   and state tracking which is expected to yield speed and space
   improvements.  This patchset itself yields a 5-20% performance benefit
   in some situations.
 
 - The 3 patch series "Some ptdesc cleanups" from Matthew Wilcox utilizes
   the new memdesc layer to clean up the ptdesc code a little.
 
 - The 3 patch series "Fix va_high_addr_switch.sh test failure" from
   Chunyu Hu fixes some issues in our 5-level pagetable selftesting code.
 
 - The 2 patch series "Minor fixes for memory allocation profiling" from
   Suren Baghdasaryan addresses a couple of minor issues in relatively new
   memory allocation profiling feature.
 
 - The 3 patch series "Small cleanups" from Matthew Wilcox has a few
   cleanups in preparation for more memdesc work.
 
 - The 2 patch series "mm/damon: add addr_unit for DAMON_LRU_SORT and
   DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in
   furtherance of supporting arm highmem.
 
 - The 2 patch series "selftests/mm: Add -Wunreachable-code and fix
   warnings" from Muhammad Anjum adds that compiler check to selftests code
   and fixes the fallout, by removing dead code.
 
 - The 10 patch series "Improvements to Victim Process Thawing and OOM
   Reaper Traversal Order" from zhongjinji makes a number of improvements
   in the OOM killer: mainly thawing a more appropriate group of victim
   threads so they can release resources.
 
 - The 5 patch series "mm/damon: misc fixups and improvements for 6.18"
   from SeongJae Park is a bunch of small and unrelated fixups for DAMON.
 
 - The 7 patch series "mm/damon: define and use DAMON initialization
   check function" from SeongJae Park implement reliability and
   maintainability improvements to a recently-added bug fix.
 
 - The 2 patch series "mm/damon/stat: expose auto-tuned intervals and
   non-idle ages" from SeongJae Park provides additional transparency to
   userspace clients of the DAMON_STAT information.
 
 - The 2 patch series "Expand scope of khugepaged anonymous collapse"
   from Dev Jain removes some constraints on khubepaged's collapsing of
   anon VMAs.  It also increases the success rate of MADV_COLLAPSE against
   an anon vma.
 
 - The 2 patch series "mm: do not assume file == vma->vm_file in
   compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards
   removal of file_operations.mmap().  This patchset concentrates upon
   clearing up the treatment of stacked filesystems.
 
 - The 6 patch series "mm: Improve mlock tracking for large folios" from
   Kiryl Shutsemau provides some fixes and improvements to mlock's tracking
   of large folios.  /proc/meminfo's "Mlocked" field became more accurate.
 
 - The 2 patch series "mm/ksm: Fix incorrect accounting of KSM counters
   during fork" from Donet Tom fixes several user-visible KSM stats
   inaccuracies across forks and adds selftest code to verify these
   counters.
 
 - The 2 patch series "mm_slot: fix the usage of mm_slot_entry" from Wei
   Yang addresses some potential but presently benign issues in KSM's
   mm_slot handling.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaN3cywAKCRDdBJ7gKXxA
 jtaPAQDmIuIu7+XnVUK5V11hsQ/5QtsUeLHV3OsAn4yW5/3dEQD/UddRU08ePN+1
 2VRB0EwkLAdfMWW7TfiNZ+yhuoiL/AA=
 =4mhY
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - "mm, swap: improve cluster scan strategy" from Kairui Song improves
   performance and reduces the failure rate of swap cluster allocation

 - "support large align and nid in Rust allocators" from Vitaly Wool
   permits Rust allocators to set NUMA node and large alignment when
   perforning slub and vmalloc reallocs

 - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend
   DAMOS_STAT's handling of the DAMON operations sets for virtual
   address spaces for ops-level DAMOS filters

 - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren
   Baghdasaryan reduces mmap_lock contention during reads of
   /proc/pid/maps

 - "mm/mincore: minor clean up for swap cache checking" from Kairui Song
   performs some cleanup in the swap code

 - "mm: vm_normal_page*() improvements" from David Hildenbrand provides
   code cleanup in the pagemap code

 - "add persistent huge zero folio support" from Pankaj Raghav provides
   a block layer speedup by optionalls making the
   huge_zero_pagepersistent, instead of releasing it when its refcount
   falls to zero

 - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to
   the recently added Kexec Handover feature

 - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo
   Stoakes turns mm_struct.flags into a bitmap. To end the constant
   struggle with space shortage on 32-bit conflicting with 64-bit's
   needs

 - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap
   code

 - "selftests/mm: Fix false positives and skip unsupported tests" from
   Donet Tom fixes a few things in our selftests code

 - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised"
   from David Hildenbrand "allows individual processes to opt-out of
   THP=always into THP=madvise, without affecting other workloads on the
   system".

   It's a long story - the [1/N] changelog spells out the considerations

 - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on
   the memdesc project. Please see

      https://kernelnewbies.org/MatthewWilcox/Memdescs and
      https://blogs.oracle.com/linux/post/introducing-memdesc

 - "Tiny optimization for large read operations" from Chi Zhiling
   improves the efficiency of the pagecache read path

 - "Better split_huge_page_test result check" from Zi Yan improves our
   folio splitting selftest code

 - "test that rmap behaves as expected" from Wei Yang adds some rmap
   selftests

 - "remove write_cache_pages()" from Christoph Hellwig removes that
   function and converts its two remaining callers

 - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD
   selftests issues

 - "introduce kernel file mapped folios" from Boris Burkov introduces
   the concept of "kernel file pages". Using these permits btrfs to
   account its metadata pages to the root cgroup, rather than to the
   cgroups of random inappropriate tasks

 - "mm/pageblock: improve readability of some pageblock handling" from
   Wei Yang provides some readability improvements to the page allocator
   code

 - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON
   to understand arm32 highmem

 - "tools: testing: Use existing atomic.h for vma/maple tests" from
   Brendan Jackman performs some code cleanups and deduplication under
   tools/testing/

 - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes
   a couple of 32-bit issues in tools/testing/radix-tree.c

 - "kasan: unify kasan_enabled() and remove arch-specific
   implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific
   initialization code into a common arch-neutral implementation

 - "mm: remove zpool" from Johannes Weiner removes zspool - an
   indirection layer which now only redirects to a single thing
   (zsmalloc)

 - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a
   couple of cleanups in the fork code

 - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of
   adjustments at various nth_page() callsites, eventually permitting
   the removal of that undesirable helper function

 - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun
   creates a KASAN read-only mode for ARM, using that architecture's
   memory tagging feature. It is felt that a read-only mode KASAN is
   suitable for use in production systems rather than debug-only

 - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does
   some tidying in the hugetlb folio allocation code

 - "mm: establish const-correctness for pointer parameters" from Max
   Kellermann makes quite a number of the MM API functions more accurate
   about the constness of their arguments. This was getting in the way
   of subsystems (in this case CEPH) when they attempt to improving
   their own const/non-const accuracy

 - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of
   code sites which were confused over when to use free_pages() vs
   __free_pages()

 - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the
   mapletree code accessible to Rust. Required by nouveau and by its
   forthcoming successor: the new Rust Nova driver

 - "selftests/mm: split_huge_page_test: split_pte_mapped_thp
   improvements" from David Hildenbrand adds a fix and some cleanups to
   the thp selftesting code

 - "mm, swap: introduce swap table as swap cache (phase I)" from Chris
   Li and Kairui Song is the first step along the path to implementing
   "swap tables" - a new approach to swap allocation and state tracking
   which is expected to yield speed and space improvements. This
   patchset itself yields a 5-20% performance benefit in some situations

 - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc
   layer to clean up the ptdesc code a little

 - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some
   issues in our 5-level pagetable selftesting code

 - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan
   addresses a couple of minor issues in relatively new memory
   allocation profiling feature

 - "Small cleanups" from Matthew Wilcox has a few cleanups in
   preparation for more memdesc work

 - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from
   Quanmin Yan makes some changes to DAMON in furtherance of supporting
   arm highmem

 - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad
   Anjum adds that compiler check to selftests code and fixes the
   fallout, by removing dead code

 - "Improvements to Victim Process Thawing and OOM Reaper Traversal
   Order" from zhongjinji makes a number of improvements in the OOM
   killer: mainly thawing a more appropriate group of victim threads so
   they can release resources

 - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park
   is a bunch of small and unrelated fixups for DAMON

 - "mm/damon: define and use DAMON initialization check function" from
   SeongJae Park implement reliability and maintainability improvements
   to a recently-added bug fix

 - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from
   SeongJae Park provides additional transparency to userspace clients
   of the DAMON_STAT information

 - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes
   some constraints on khubepaged's collapsing of anon VMAs. It also
   increases the success rate of MADV_COLLAPSE against an anon vma

 - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()"
   from Lorenzo Stoakes moves us further towards removal of
   file_operations.mmap(). This patchset concentrates upon clearing up
   the treatment of stacked filesystems

 - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau
   provides some fixes and improvements to mlock's tracking of large
   folios. /proc/meminfo's "Mlocked" field became more accurate

 - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from
   Donet Tom fixes several user-visible KSM stats inaccuracies across
   forks and adds selftest code to verify these counters

 - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses
   some potential but presently benign issues in KSM's mm_slot handling

* tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits)
  mm: swap: check for stable address space before operating on the VMA
  mm: convert folio_page() back to a macro
  mm/khugepaged: use start_addr/addr for improved readability
  hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list
  alloc_tag: fix boot failure due to NULL pointer dereference
  mm: silence data-race in update_hiwater_rss
  mm/memory-failure: don't select MEMORY_ISOLATION
  mm/khugepaged: remove definition of struct khugepaged_mm_slot
  mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL
  hugetlb: increase number of reserving hugepages via cmdline
  selftests/mm: add fork inheritance test for ksm_merging_pages counter
  mm/ksm: fix incorrect KSM counter handling in mm_struct during fork
  drivers/base/node: fix double free in register_one_node()
  mm: remove PMD alignment constraint in execmem_vmalloc()
  mm/memory_hotplug: fix typo 'esecially' -> 'especially'
  mm/rmap: improve mlock tracking for large folios
  mm/filemap: map entire large folio faultaround
  mm/fault: try to map the entire file folio in finish_fault()
  mm/rmap: mlock large folios in try_to_unmap_one()
  mm/rmap: fix a mlock race condition in folio_referenced_one()
  ...
2025-10-02 18:18:33 -07:00
Linus Torvalds e1b1d03cee for-6.18/block-20250929
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmjbLCgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoY0D/9J+11BC88pBxCrLKv/V2TwCNokRMi0dU3L
 r3EUdA46k0oXmvb6ueZqIcfY2e+IX7rdQkaRbh1zRdsNejqHo4548C3ePWGdBAcM
 OdNEGfpehO0aD0td1+mK/NxoJMLhbs5QraPanz+SOkGZOKeF+vGCga5PUDivsr5J
 16T9yb7i+isENLdAc2RJbZVyAphqHQlo5GHi5ZIKOVi5cNt8GU/q2sQl7NYmGvHd
 aq37svvZHFOhLRajP959Fw9WOxEYITewzQ4UYf1FZjUodJUxO+vCnP0ooBQRlyu8
 1B4PYWwSE+Vn3GkQE0Om+mzo9AVPOiLmoAWGxdgJBMyEkZndocr46XEslXOufQ1Z
 T3Gu19G6jCxcyByNVhjVnaajYKmvSQAy1w75m4XlfqTRm4f9Om+LAJavUk3RuaOL
 7lXKQ7Ql1/Tby9Jmf8afjYYXXotNDNku6rz2P3qtOwAA26mNJfgVt0rO+8XGRDe9
 ioLbCkTjslYMc/Oh4jSsbrspsVALbaQMq/Dmah8k0EWb4QAHVgCJyGBoff3hOboI
 jD6B1enaKOQVgcjWcjm/FjOk3jv2h3v4X26YWQZTvEc/1PnSnST78Zi/ePhzDdmt
 sBALUAS37TfTgNMzrhbHl5Zs13k0C0XyANuayuKuo5hlNnC1wbdap+5FZJOmpuOB
 YT+VkYnaOA==
 =kOmc
 -----END PGP SIGNATURE-----

Merge tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block updates from Jens Axboe:

 - NVMe pull request via Keith:
     - FC target fixes (Daniel)
     - Authentication fixes and updates (Martin, Chris)
     - Admin controller handling (Kamaljit)
     - Target lockdep assertions (Max)
     - Keep-alive updates for discovery (Alastair)
     - Suspend quirk (Georg)

 - MD pull request via Yu:
     - Add support for a lockless bitmap.

       A key feature for the new bitmap are that the IO fastpath is
       lockless. If a user issues lots of write IO to the same bitmap
       bit in a short time, only the first write has additional overhead
       to update bitmap bit, no additional overhead for the following
       writes.

       By supporting only resync or recover written data, means in the
       case creating new array or replacing with a new disk, there is no
       need to do a full disk resync/recovery.

 - Switch ->getgeo() and ->bios_param() to using struct gendisk rather
   than struct block_device.

 - Rust block changes via Andreas. This series adds configuration via
   configfs and remote completion to the rnull driver. The series also
   includes a set of changes to the rust block device driver API: a few
   cleanup patches, and a few features supporting the rnull changes.

   The series removes the raw buffer formatting logic from
   `kernel::block` and improves the logic available in `kernel::string`
   to support the same use as the removed logic.

 - floppy arch cleanups

 - Reduce the number of dereferencing needed for ublk commands

 - Restrict supported sockets for nbd. Mostly done to eliminate a class
   of issues perpetually reported by syzbot, by using nonsensical socket
   setups.

 - A few s390 dasd block fixes

 - Fix a few issues around atomic writes

 - Improve DMA interation for integrity requests

 - Improve how iovecs are treated with regards to O_DIRECT aligment
   constraints.

   We used to require each segment to adhere to the constraints, now
   only the request as a whole needs to.

 - Clean up and improve p2p support, enabling use of p2p for metadata
   payloads

 - Improve locking of request lookup, using SRCU where appropriate

 - Use page references properly for brd, avoiding very long RCU sections

 - Fix ordering of recursively submitted IOs

 - Clean up and improve updating nr_requests for a live device

 - Various fixes and cleanups

* tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (164 commits)
  s390/dasd: enforce dma_alignment to ensure proper buffer validation
  s390/dasd: Return BLK_STS_INVAL for EINVAL from do_dasd_request
  ublk: remove redundant zone op check in ublk_setup_iod()
  nvme: Use non zero KATO for persistent discovery connections
  nvmet: add safety check for subsys lock
  nvme-core: use nvme_is_io_ctrl() for I/O controller check
  nvme-core: do ioccsz/iorcsz validation only for I/O controllers
  nvme-core: add method to check for an I/O controller
  blk-cgroup: fix possible deadlock while configuring policy
  blk-mq: fix null-ptr-deref in blk_mq_free_tags() from error path
  blk-mq: Fix more tag iteration function documentation
  selftests: ublk: fix behavior when fio is not installed
  ublk: don't access ublk_queue in ublk_unmap_io()
  ublk: pass ublk_io to __ublk_complete_rq()
  ublk: don't access ublk_queue in ublk_need_complete_req()
  ublk: don't access ublk_queue in ublk_check_commit_and_fetch()
  ublk: don't pass ublk_queue to ublk_fetch()
  ublk: don't access ublk_queue in ublk_config_io_buf()
  ublk: don't access ublk_queue in ublk_check_fetch_buf()
  ublk: pass q_id and tag to __ublk_check_and_get_req()
  ...
2025-10-02 10:16:56 -07:00
Linus Torvalds 5832d26433 for-6.18/io_uring-20250929
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmjbLEcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnEUD/4/FgfQP2LFS/88BBF5ukZjRySe4wmyyZ2Q
 MFh2ehdxzkZxVXjbeA2wRAXdqjw2MbNhx8tzU9VrW7rweNDZxHbwi6jJIP7OAjxE
 4ZP0goAQj7P0TFyXC2KGj7k6dP20FkAltx5gGLVwsuOWDDrQKp2EykAcRnGYAD4W
 3yf+nojVr2bjHyO7dx8dM7jUDjMg7J8nmHD6zgHOlHRLblWwfzw907bhz+eBX/FI
 9kYvtX2c9MgY4Isa+43rZd5qvj9S3Cs8PD6tFPbq+n+3l7yWgMBTu/y+SNI8hupT
 W7CqjPcpvppFHhPkcXDA3yARnW7ccEx5aiQuvUCmRUioHtGwXvC63HMp8OjcQspV
 NNoIHYFsi1alzYq2kJLxY1IleWZ8j0hUkSSU8u7al8VIvtD43LGkv51xavxQUFjg
 BO9mLyS51H2agffySs4vhHJE82lZizvmh/RJfSJ0ezALzE2k42MrximX1D1rBJE6
 KPOhCiPt/jqpQMyqDYnY10FgTXQVwgPIVH1JLpo611tPFHlGW8Y4YxxR1Xduh5JX
 jbGLEjVREsDZ7EHrimLNLmJRAQpyQujv/yhf7k96gWBelVwVuISQLI4Ca5IeVQyk
 9yifgLXNGddgAwj0POMFeKXSm2We9nrrPDYLCKrsBMSN96/3SLveJC7fkW88aUZr
 ye4/K8Y3vA==
 =uc/3
 -----END PGP SIGNATURE-----

Merge tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring updates from Jens Axboe:

 - Store ring provided buffers locally for the users, rather than stuff
   them into struct io_kiocb.

   These types of buffers must always be fully consumed or recycled in
   the current context, and leaving them in struct io_kiocb is hence not
   a good ideas as that struct has a vastly different life time.

   Basically just an architecture cleanup that can help prevent issues
   with ring provided buffers in the future.

 - Support for mixed CQE sizes in the same ring.

   Before this change, a CQ ring either used the default 16b CQEs, or it
   was setup with 32b CQE using IORING_SETUP_CQE32. For use cases where
   a few 32b CQEs were needed, this caused everything else to use big
   CQEs. This is wasteful both in terms of memory usage, but also memory
   bandwidth for the posted CQEs.

   With IORING_SETUP_CQE_MIXED, applications may use request types that
   post both normal 16b and big 32b CQEs on the same ring.

 - Add helpers for async data management, to make it harder for opcode
   handlers to mess it up.

 - Add support for multishot for uring_cmd, which ublk can use. This
   helps improve efficiency, by providing a persistent request type that
   can trigger multiple CQEs.

 - Add initial support for ring feature querying.

   We had basic support for probe operations, but the API isn't great.
   Rather than expand that, add support for QUERY which is easily
   expandable and can cover a lot more cases than the existing probe
   support. This will help applications get a better idea of what
   operations are supported on a given host.

 - zcrx improvements from Pavel:
        - Improve refill entry alignment for better caching
        - Various cleanups, especially around deduplicating normal
          memory vs dmabuf setup.
        - Generalisation of the niov size (Patch 12). It's still hard
          coded to PAGE_SIZE on init, but will let the user to specify
          the rx buffer length on setup.
        - Syscall / synchronous bufer return. It'll be used as a slow
          fallback path for returning buffers when the refill queue is
          full. Useful for tolerating slight queue size misconfiguration
          or with inconsistent load.
        - Accounting more memory to cgroups.
        - Additional independent cleanups that will also be useful for
          mutli-area support.

 - Various fixes and cleanups

* tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits)
  io_uring/cmd: drop unused res2 param from io_uring_cmd_done()
  io_uring: fix nvme's 32b cqes on mixed cq
  io_uring/query: cap number of queries
  io_uring/query: prevent infinite loops
  io_uring/zcrx: account niov arrays to cgroup
  io_uring/zcrx: allow synchronous buffer return
  io_uring/zcrx: introduce io_parse_rqe()
  io_uring/zcrx: don't adjust free cache space
  io_uring/zcrx: use guards for the refill lock
  io_uring/zcrx: reduce netmem scope in refill
  io_uring/zcrx: protect netdev with pp_lock
  io_uring/zcrx: rename dma lock
  io_uring/zcrx: make niov size variable
  io_uring/zcrx: set sgt for umem area
  io_uring/zcrx: remove dmabuf_offset
  io_uring/zcrx: deduplicate area mapping
  io_uring/zcrx: pass ifq to io_zcrx_alloc_fallback()
  io_uring/zcrx: check all niovs filled with dma addresses
  io_uring/zcrx: move area reg checks into io_import_area
  io_uring/zcrx: don't pass slot to io_zcrx_create_area
  ...
2025-10-02 09:56:23 -07:00
Linus Torvalds 18b19abc37 namespace-6.18-rc1
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQgQAKCRCRxhvAZXjc
 oiFXAQCpbLvkWbld9wLgxUBhq+q+kw5NvGxzpvqIhXwJB9F9YAEA44/Wevln4xGx
 +kRUbP+xlRQqenIYs2dLzVHzAwAdfQ4=
 =EO4Y
 -----END PGP SIGNATURE-----

Merge tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull namespace updates from Christian Brauner:
 "This contains a larger set of changes around the generic namespace
  infrastructure of the kernel.

  Each specific namespace type (net, cgroup, mnt, ...) embedds a struct
  ns_common which carries the reference count of the namespace and so
  on.

  We open-coded and cargo-culted so many quirks for each namespace type
  that it just wasn't scalable anymore. So given there's a bunch of new
  changes coming in that area I've started cleaning all of this up.

  The core change is to make it possible to correctly initialize every
  namespace uniformly and derive the correct initialization settings
  from the type of the namespace such as namespace operations, namespace
  type and so on. This leaves the new ns_common_init() function with a
  single parameter which is the specific namespace type which derives
  the correct parameters statically. This also means the compiler will
  yell as soon as someone does something remotely fishy.

  The ns_common_init() addition also allows us to remove ns_alloc_inum()
  and drops any special-casing of the initial network namespace in the
  network namespace initialization code that Linus complained about.

  Another part is reworking the reference counting. The reference
  counting was open-coded and copy-pasted for each namespace type even
  though they all followed the same rules. This also removes all open
  accesses to the reference count and makes it private and only uses a
  very small set of dedicated helpers to manipulate them just like we do
  for e.g., files.

  In addition this generalizes the mount namespace iteration
  infrastructure introduced a few cycles ago. As reminder, the vfs makes
  it possible to iterate sequentially and bidirectionally through all
  mount namespaces on the system or all mount namespaces that the caller
  holds privilege over. This allow userspace to iterate over all mounts
  in all mount namespaces using the listmount() and statmount() system
  call.

  Each mount namespace has a unique identifier for the lifetime of the
  systems that is exposed to userspace. The network namespace also has a
  unique identifier working exactly the same way. This extends the
  concept to all other namespace types.

  The new nstree type makes it possible to lookup namespaces purely by
  their identifier and to walk the namespace list sequentially and
  bidirectionally for all namespace types, allowing userspace to iterate
  through all namespaces. Looking up namespaces in the namespace tree
  works completely locklessly.

  This also means we can move the mount namespace onto the generic
  infrastructure and remove a bunch of code and members from struct
  mnt_namespace itself.

  There's a bunch of stuff coming on top of this in the future but for
  now this uses the generic namespace tree to extend a concept
  introduced first for pidfs a few cycles ago. For a while now we have
  supported pidfs file handles for pidfds. This has proven to be very
  useful.

  This extends the concept to cover namespaces as well. It is possible
  to encode and decode namespace file handles using the common
  name_to_handle_at() and open_by_handle_at() apis.

  As with pidfs file handles, namespace file handles are exhaustive,
  meaning it is not required to actually hold a reference to nsfs in
  able to decode aka open_by_handle_at() a namespace file handle.
  Instead the FD_NSFS_ROOT constant can be passed which will let the
  kernel grab a reference to the root of nsfs internally and thus decode
  the file handle.

  Namespaces file descriptors can already be derived from pidfds which
  means they aren't subject to overmount protection bugs. IOW, it's
  irrelevant if the caller would not have access to an appropriate
  /proc/<pid>/ns/ directory as they could always just derive the
  namespace based on a pidfd already.

  It has the same advantage as pidfds. It's possible to reliably and for
  the lifetime of the system refer to a namespace without pinning any
  resources and to compare them trivially.

  Permission checking is kept simple. If the caller is located in the
  namespace the file handle refers to they are able to open it otherwise
  they must hold privilege over the owning namespace of the relevant
  namespace.

  The namespace file handle layout is exposed as uapi and has a stable
  and extensible format. For now it simply contains the namespace
  identifier, the namespace type, and the inode number. The stable
  format means that userspace may construct its own namespace file
  handles without going through name_to_handle_at() as they are already
  allowed for pidfs and cgroup file handles"

* tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (65 commits)
  ns: drop assert
  ns: move ns type into struct ns_common
  nstree: make struct ns_tree private
  ns: add ns_debug()
  ns: simplify ns_common_init() further
  cgroup: add missing ns_common include
  ns: use inode initializer for initial namespaces
  selftests/namespaces: verify initial namespace inode numbers
  ns: rename to __ns_ref
  nsfs: port to ns_ref_*() helpers
  net: port to ns_ref_*() helpers
  uts: port to ns_ref_*() helpers
  ipv4: use check_net()
  net: use check_net()
  net-sysfs: use check_net()
  user: port to ns_ref_*() helpers
  time: port to ns_ref_*() helpers
  pid: port to ns_ref_*() helpers
  ipc: port to ns_ref_*() helpers
  cgroup: port to ns_ref_*() helpers
  ...
2025-09-29 11:20:29 -07:00
Linus Torvalds 722df25ddf kernel-6.18-rc1.clone3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZgMQAKCRCRxhvAZXjc
 ornXAP954dZjz+OJw6lJLCf0j9TXJOczGHvK3oW5ZD9KnqtTdwEA7p1A6WMOKJyl
 8VtTgCS0yNt8QlznUnsSDfVm0jXVGAY=
 =tUXG
 -----END PGP SIGNATURE-----

Merge tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull copy_process updates from Christian Brauner:
 "This contains the changes to enable support for clone3() on nios2
  which apparently is still a thing.

  The more exciting part of this is that it cleans up the inconsistency
  in how the 64-bit flag argument is passed from copy_process() into the
  various other copy_*() helpers"

[ Fixed up rv ltl_monitor 32-bit support as per Sasha Levin in the merge ]

* tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  nios2: implement architecture-specific portion of sys_clone3
  arch: copy_thread: pass clone_flags as u64
  copy_process: pass clone_flags as u64 across calltree
  copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)
2025-09-29 10:36:50 -07:00
Linus Torvalds b7ce6fa90f vfs-6.18-rc1.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQMQAKCRCRxhvAZXjc
 omNLAQCgrwzd9sa1JTlixweu3OAxQlSEbLuMpEv7Ztm+B7Wz0AD9HtwPC44Kev03
 GbMcB2DCFLC4evqYECj6IG7NBmoKsAs=
 =1ICf
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual selections of misc updates for this cycle.

  Features:

   - Add "initramfs_options" parameter to set initramfs mount options.
     This allows to add specific mount options to the rootfs to e.g.,
     limit the memory size

   - Add RWF_NOSIGNAL flag for pwritev2()

     Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE
     signal from being raised when writing on disconnected pipes or
     sockets. The flag is handled directly by the pipe filesystem and
     converted to the existing MSG_NOSIGNAL flag for sockets

   - Allow to pass pid namespace as procfs mount option

     Ever since the introduction of pid namespaces, procfs has had very
     implicit behaviour surrounding them (the pidns used by a procfs
     mount is auto-selected based on the mounting process's active
     pidns, and the pidns itself is basically hidden once the mount has
     been constructed)

     This implicit behaviour has historically meant that userspace was
     required to do some special dances in order to configure the pidns
     of a procfs mount as desired. Examples include:

     * In order to bypass the mnt_too_revealing() check, Kubernetes
       creates a procfs mount from an empty pidns so that user
       namespaced containers can be nested (without this, the nested
       containers would fail to mount procfs)

       But this requires forking off a helper process because you cannot
       just one-shot this using mount(2)

     * Container runtimes in general need to fork into a container
       before configuring its mounts, which can lead to security issues
       in the case of shared-pidns containers (a privileged process in
       the pidns can interact with your container runtime process)

       While SUID_DUMP_DISABLE and user namespaces make this less of an
       issue, the strict need for this due to a minor uAPI wart is kind
       of unfortunate

       Things would be much easier if there was a way for userspace to
       just specify the pidns they want. So this pull request contains
       changes to implement a new "pidns" argument which can be set
       using fsconfig(2):

           fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
           fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);

       or classic mount(2) / mount(8):

           // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
           mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");

  Cleanups:

   - Remove the last references to EXPORT_OP_ASYNC_LOCK

   - Make file_remove_privs_flags() static

   - Remove redundant __GFP_NOWARN when GFP_NOWAIT is used

   - Use try_cmpxchg() in start_dir_add()

   - Use try_cmpxchg() in sb_init_done_wq()

   - Replace offsetof() with struct_size() in ioctl_file_dedupe_range()

   - Remove vfs_ioctl() export

   - Replace rwlock() with spinlock in epoll code as rwlock causes
     priority inversion on preempt rt kernels

   - Make ns_entries in fs/proc/namespaces const

   - Use a switch() statement() in init_special_inode() just like we do
     in may_open()

   - Use struct_size() in dir_add() in the initramfs code

   - Use str_plural() in rd_load_image()

   - Replace strcpy() with strscpy() in find_link()

   - Rename generic_delete_inode() to inode_just_drop() and
     generic_drop_inode() to inode_generic_drop()

   - Remove unused arguments from fcntl_{g,s}et_rw_hint()

  Fixes:

   - Document @name parameter for name_contains_dotdot() helper

   - Fix spelling mistake

   - Always return zero from replace_fd() instead of the file descriptor
     number

   - Limit the size for copy_file_range() in compat mode to prevent a
     signed overflow

   - Fix debugfs mount options not being applied

   - Verify the inode mode when loading it from disk in minixfs

   - Verify the inode mode when loading it from disk in cramfs

   - Don't trigger automounts with RESOLVE_NO_XDEV

     If openat2() was called with RESOLVE_NO_XDEV it didn't traverse
     through automounts, but could still trigger them

   - Add FL_RECLAIM flag to show_fl_flags() macro so it appears in
     tracepoints

   - Fix unused variable warning in rd_load_image() on s390

   - Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD

   - Use ns_capable_noaudit() when determining net sysctl permissions

   - Don't call path_put() under namespace semaphore in listmount() and
     statmount()"

* tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits)
  fcntl: trim arguments
  listmount: don't call path_put() under namespace semaphore
  statmount: don't call path_put() under namespace semaphore
  pid: use ns_capable_noaudit() when determining net sysctl permissions
  fs: rename generic_delete_inode() and generic_drop_inode()
  init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD
  initramfs: Replace strcpy() with strscpy() in find_link()
  initrd: Use str_plural() in rd_load_image()
  initramfs: Use struct_size() helper to improve dir_add()
  initrd: Fix unused variable warning in rd_load_image() on s390
  fs: use the switch statement in init_special_inode()
  fs/proc/namespaces: make ns_entries const
  filelock: add FL_RECLAIM to show_fl_flags() macro
  eventpoll: Replace rwlock with spinlock
  selftests/proc: add tests for new pidns APIs
  procfs: add "pidns" mount option
  pidns: move is-ancestor logic to helper
  openat2: don't trigger automounts with RESOLVE_NO_XDEV
  namei: move cross-device check to __traverse_mounts
  namei: remove LOOKUP_NO_XDEV check from handle_mounts
  ...
2025-09-29 09:03:07 -07:00
Linus Torvalds 3a654ee549 block-6.17-20250925
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmjWKd4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphViEADJF/LndvCUkcNFagHcmAHoJWXajelT/c2w
 yUfiP5kH3nyjC+eXUNT819QyXNxgr7jhJWtOSnWXGcX28aHEx1BuCQSbCW7L0VxR
 DtEfipBeSDt/ixMMokgV0fG/je9D8s4xHLUqFW3tni91jK73G9JxXZoQzj1lQ5t8
 qNDGH4Z5BZvX/ClxWXZf1tIPWy1Bbjx8YncJQCDxg9C1fOJXlP9NuWBk7iL5svPV
 trgIExix3YoyjOk9d5/P6604wbzTffF8CGoPJEC08LZxxjXkob/ipsd6+Wv1aCyF
 3RNIX5bsoN/u0uabyfh5imYxGOkesqqK96sTz+pOExTALNtwKerdmTV790tq73yG
 EeNDAHkRve5xBAgHwRlbU9sH6mQypzhiR7DaLXe/INKp6rUOMOhH3JgYwycrzvnC
 bDgI0kbs1IrEk/rr1yGupu0Fqav30yWlgQ13vVu3rwp2cGabTnmoOPl34siWjEc9
 XL1Q0ftsBtXPxOomKYIDatBCiN8i33/KZ5/IGhDH6qO2or0ydzjK1yEeuDcIDClg
 HCfGCnh5Rs11c4iiMjSBSwmWjwwbciZ4XV5XM9VqcqRDeKg0XHb/jaRYYM3UGu37
 nTUmJlLN/S90S9tYTd9b7iUlpq5PKZV3TvSW1QeZVxelTgRw7MIwyWGz4qH6XUc3
 Im0RjM/ZiA==
 =AKR9
 -----END PGP SIGNATURE-----

Merge tag 'block-6.17-20250925' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:
 "A regression fix for this series where an attempt to silence an EOD
  error got messed up a bit, and then a change of git trees for the
  block and io_uring trees.

  Switching the git trees to kernel.org now, as I've just about had it
  trying to battle AI bots that bring the box to its knees, continually.
  At least I don't have to maintain the kernel.org side"

* tag 'block-6.17-20250925' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  MAINTAINERS: update io_uring and block tree git trees
  block: fix EOD return for device with nr_sectors == 0
2025-09-26 09:46:51 -07:00
Yu Kuai 5d726c4dbe blk-cgroup: fix possible deadlock while configuring policy
Following deadlock can be triggered easily by lockdep:

WARNING: possible circular locking dependency detected
6.17.0-rc3-00124-ga12c2658ced0 #1665 Not tainted
------------------------------------------------------
check/1334 is trying to acquire lock:
ff1100011d9d0678 (&q->sysfs_lock){+.+.}-{4:4}, at: blk_unregister_queue+0x53/0x180

but task is already holding lock:
ff1100011d9d00e0 (&q->q_usage_counter(queue)#3){++++}-{0:0}, at: del_gendisk+0xba/0x110

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&q->q_usage_counter(queue)#3){++++}-{0:0}:
       blk_queue_enter+0x40b/0x470
       blkg_conf_prep+0x7b/0x3c0
       tg_set_limit+0x10a/0x3e0
       cgroup_file_write+0xc6/0x420
       kernfs_fop_write_iter+0x189/0x280
       vfs_write+0x256/0x490
       ksys_write+0x83/0x190
       __x64_sys_write+0x21/0x30
       x64_sys_call+0x4608/0x4630
       do_syscall_64+0xdb/0x6b0
       entry_SYSCALL_64_after_hwframe+0x76/0x7e

-> #1 (&q->rq_qos_mutex){+.+.}-{4:4}:
       __mutex_lock+0xd8/0xf50
       mutex_lock_nested+0x2b/0x40
       wbt_init+0x17e/0x280
       wbt_enable_default+0xe9/0x140
       blk_register_queue+0x1da/0x2e0
       __add_disk+0x38c/0x5d0
       add_disk_fwnode+0x89/0x250
       device_add_disk+0x18/0x30
       virtblk_probe+0x13a3/0x1800
       virtio_dev_probe+0x389/0x610
       really_probe+0x136/0x620
       __driver_probe_device+0xb3/0x230
       driver_probe_device+0x2f/0xe0
       __driver_attach+0x158/0x250
       bus_for_each_dev+0xa9/0x130
       driver_attach+0x26/0x40
       bus_add_driver+0x178/0x3d0
       driver_register+0x7d/0x1c0
       __register_virtio_driver+0x2c/0x60
       virtio_blk_init+0x6f/0xe0
       do_one_initcall+0x94/0x540
       kernel_init_freeable+0x56a/0x7b0
       kernel_init+0x2b/0x270
       ret_from_fork+0x268/0x4c0
       ret_from_fork_asm+0x1a/0x30

-> #0 (&q->sysfs_lock){+.+.}-{4:4}:
       __lock_acquire+0x1835/0x2940
       lock_acquire+0xf9/0x450
       __mutex_lock+0xd8/0xf50
       mutex_lock_nested+0x2b/0x40
       blk_unregister_queue+0x53/0x180
       __del_gendisk+0x226/0x690
       del_gendisk+0xba/0x110
       sd_remove+0x49/0xb0 [sd_mod]
       device_remove+0x87/0xb0
       device_release_driver_internal+0x11e/0x230
       device_release_driver+0x1a/0x30
       bus_remove_device+0x14d/0x220
       device_del+0x1e1/0x5a0
       __scsi_remove_device+0x1ff/0x2f0
       scsi_remove_device+0x37/0x60
       sdev_store_delete+0x77/0x100
       dev_attr_store+0x1f/0x40
       sysfs_kf_write+0x65/0x90
       kernfs_fop_write_iter+0x189/0x280
       vfs_write+0x256/0x490
       ksys_write+0x83/0x190
       __x64_sys_write+0x21/0x30
       x64_sys_call+0x4608/0x4630
       do_syscall_64+0xdb/0x6b0
       entry_SYSCALL_64_after_hwframe+0x76/0x7e

other info that might help us debug this:

Chain exists of:
  &q->sysfs_lock --> &q->rq_qos_mutex --> &q->q_usage_counter(queue)#3

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&q->q_usage_counter(queue)#3);
                               lock(&q->rq_qos_mutex);
                               lock(&q->q_usage_counter(queue)#3);
  lock(&q->sysfs_lock);

Root cause is that queue_usage_counter is grabbed with rq_qos_mutex
held in blkg_conf_prep(), while queue should be freezed before
rq_qos_mutex from other context.

The blk_queue_enter() from blkg_conf_prep() is used to protect against
policy deactivation, which is already protected with blkcg_mutex, hence
convert blk_queue_enter() to blkcg_mutex to fix this problem. Meanwhile,
consider that blkcg_mutex is held after queue is freezed from policy
deactivation, also convert blkg_alloc() to use GFP_NOIO.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-23 05:22:14 -06:00
Yu Kuai 670bfe6838 blk-mq: fix null-ptr-deref in blk_mq_free_tags() from error path
blk_mq_free_tags() can be called after blk_mq_init_tags(), while
tags->page_list is still not initialized, causing null-ptr-deref.

Fix this problem by initializing tags->page_list at blk_mq_init_tags(),
meanwhile, also free tags directly from error path because there is no
srcu barrier.

Fixes: ad0d05dbdd ("blk-mq: Defer freeing of tags page_list to SRCU callback")
Reported-by: syzbot+5c5d41e80248d610221f@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68d1b079.a70a0220.1b52b.0000.GAE@google.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-23 01:35:52 -06:00
Bart Van Assche fea55691ac blk-mq: Fix more tag iteration function documentation
Commit 8ab30a3319 ("blk-mq: Drop busy_iter_fn blk_mq_hw_ctx argument")
removed the hctx argument from the callback functions called by
bt_for_each() and blk_mq_queue_tag_busy_iter(). Commit 2dd6532e95
("blk-mq: Drop 'reserved' arg of busy_tag_iter_fn") removed the
'reserved' argument of the busy_tag_iter_fn function pointer type. Bring
the documentation of the tag iteration functions in sync with these
changes.

Cc: John Garry <john.g.garry@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-23 00:20:31 -06:00
Caleb Sander Mateos ef9f603fd3 io_uring/cmd: drop unused res2 param from io_uring_cmd_done()
Commit 79525b51ac ("io_uring: fix nvme's 32b cqes on mixed cq") split
out a separate io_uring_cmd_done32() helper for ->uring_cmd()
implementations that return 32-byte CQEs. The res2 value passed to
io_uring_cmd_done() is now unused because __io_uring_cmd_done() ignores
it when is_cqe32 is passed as false. So drop the parameter from
io_uring_cmd_done() to simplify the callers and clarify that it's not
possible to return an extra value beyond the 32-bit CQE result.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-23 00:15:02 -06:00
Jens Axboe ab073abf6d block: fix EOD return for device with nr_sectors == 0
A recent commit skipped dumping the usual "attempt to access beyond end
of device" message if the device size is 0 sectors, as that's a common
pattern for devices that have been hot removed. But while it stopped
that message, it also prevented returning -EIO for that condition.
Reinstate the -EIO return, while retaining the quiet operation for
triggering EOD for a device with 0 sectors.

Reported-by: syzbot+4b12286339fe4c2700c1@syzkaller.appspotmail.com
Reported-by: Sahil Chandna <chandna.linuxkernel@gmail.com>
Fixes: d0a2b527d8 ("block: tone down bio_check_eod")
Tested-by: Sahil Chandna <chandna.linuxkernel@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-22 09:35:24 -06:00
Christian Brauner 7914f15c5e
Merge branch 'no-rebase-mnt_ns_tree_remove'
Bring in the fix for removing a mount namespace from the mount namespace
rbtree and list.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19 14:26:14 +02:00
Christian Brauner fa8ee8627b
block: use extensible_ioctl_valid()
Use the new extensible_ioctl_valid() helper which is equivalent to what
is done here.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19 14:26:05 +02:00
Yu Kuai 336aec7b06 blk-throttle: fix throtl_data leak during disk release
Tightening the throttle activation check in blk_throtl_activated() to
require both q->td presence and policy bit set introduced a memory leak
during disk release:

blkg_destroy_all() clears the policy bit first during queue deactivation,
causing subsequent blk_throtl_exit() to skip throtl_data cleanup when
blk_throtl_activated() fails policy check.

Idealy we should avoid modifying blk_throtl_exit() activation check because
it's intuitive that blk-throtl start from blk_throtl_init() and end in
blk_throtl_exit(). However, call blk_throtl_exit() before
blkg_destroy_all() will make a long term deadlock problem easier to
trigger[1], hence fix this problem by checking if q->td is NULL from
blk_throtl_exit(), and remove policy deactivation as well since it's
useless.

[1] https://lore.kernel.org/all/CAHj4cs9p9H5yx+ywsb3CMUdbqGPhM+8tuBvhW=9ADiCjAqza9w@mail.gmail.com/#t

Fixes: bd9fd5be6b ("blk-throttle: fix access race during throttle policy activation")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAHj4cs-p-ZwBEKigBj7T6hQCOo-H68-kVwCrV6ZvRovrr9Z+HA@mail.gmail.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-17 07:27:29 -06:00
Bart Van Assche 0b507305a0 blk-mq: Fix the blk_mq_tagset_busy_iter() documentation
Commit 2dd6532e95 ("blk-mq: Drop 'reserved' arg of busy_tag_iter_fn")
removed the 'reserved' argument from tag iteration callback functions.
Bring the blk_mq_tagset_busy_iter() documentation in sync with that
change.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: John Garry <john.g.garry@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-17 07:26:29 -06:00
John Garry da7b97ba0d block: relax atomic write boundary vs chunk size check
blk_validate_atomic_write_limits() ensures that any boundary fits into
and is aligned to any chunk size.

However, it should also be possible to fit the chunk size into any
boundary. That check is already made in
blk_stack_atomic_writes_boundary_head().

Relax the check in blk_validate_atomic_write_limits() by reusing (and
renaming) blk_stack_atomic_writes_boundary_head().

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16 12:29:10 -06:00
John Garry f2d8c5a2f7 block: fix stacking of atomic writes when atomics are not supported
Atomic writes support may not always be possible when stacking devices
which support atomic writes. Such as case is a different atomic write
boundary between stacked devices (which is not supported).

In the case that atomic writes cannot supported, the top device queue HW
limits are set to 0.

However, in blk_stack_atomic_writes_limits(), we detect that we are
stacking the first bottom device by checking the top device
atomic_write_hw_max value == 0. This get confused with the case of atomic
writes not supported, above.

Make the distinction between stacking the first bottom device and no
atomics supported by initializing stacked device atomic_write_hw_max =
UINT_MAX and checking that for stacking the first bottom device.

Fixes: d7f36dc446 ("block: Support atomic writes limits for stacked devices")
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16 12:29:10 -06:00
John Garry bfd4037296 block: update validation of atomic writes boundary for stacked devices
In commit 63d092d1c1 ("block: use chunk_sectors when evaluating stacked
atomic write limits"), it was missed to use a chunk sectors limit check
in blk_stack_atomic_writes_boundary_head(), so update that function to
do the proper check.

Fixes: 63d092d1c1 ("block: use chunk_sectors when evaluating stacked atomic write limits")
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16 12:29:10 -06:00
chengkaitao 74b1db8684 block/mq-deadline: Remove the redundant rb_entry_rq in the deadline_from_pos().
In commit(fde02699c2), the "if (blk_rq_is_seq_zoned_write(rq))"
was removed, but the "rb_entry_rq(node)" and some other code were
inadvertently left behind. This patch fixed it.

Signed-off-by: chengkaitao <chengkaitao@kylinos.cn>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Li Nan <linan122@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-15 13:00:05 -06:00
Mateusz Guzik f99b391778
fs: rename generic_delete_inode() and generic_drop_inode()
generic_delete_inode() is rather misleading for what the routine is
doing. inode_just_drop() should be much clearer.

The new naming is inconsistent with generic_drop_inode(), so rename that
one as well with inode_ as the suffix.

No functional changes.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-15 16:09:42 +02:00
Pankaj Raghav ea5e101fb6 block: use largest_zero_folio in __blkdev_issue_zero_pages()
Use largest_zero_folio() in __blkdev_issue_zero_pages().  On systems with
CONFIG_PERSISTENT_HUGE_ZERO_FOLIO enabled, we will end up sending larger
bvecs instead of multiple small ones.

Noticed a 4% increase in performance on a commercial NVMe SSD which does
not support OP_WRITE_ZEROES.  The device's MDTS was 128K.  The performance
gains might be bigger if the device supports bigger MDTS.

Link: https://lkml.kernel.org/r/20250811084113.647267-6-kernel@pankajraghav.com
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Kiryl Shutsemau <kirill@shutemov.name>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Mariano Pache <npache@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13 16:54:54 -07:00
Yu Kuai 9784041145 blk-mq: remove blk_mq_tag_update_depth()
This helper is not used now.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:25:56 -06:00
Yu Kuai b86433721f blk-mq: fix potential deadlock while nr_requests grown
Allocate and free sched_tags while queue is freezed can deadlock[1],
this is a long term problem, hence allocate memory before freezing
queue and free memory after queue is unfreezed.

[1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Fixes: e3a2b3f931 ("blk-mq: allow changing of queue depth through sysfs")

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:25:56 -06:00
Yu Kuai 6293e336f6 blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags()
This helper only support to allocate the default number of requests,
add a new parameter to support specific number of requests.

Prepare to fix potential deadlock in the case nr_requests grow.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:25:56 -06:00
Yu Kuai e632004044 blk-mq: split bitmap grow and resize case in blk_mq_update_nr_requests()
No functional changes are intended, make code cleaner and prepare to fix
the grow case in following patches.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:25:56 -06:00
Yu Kuai 7f2799c546 blk-mq: cleanup shared tags case in blk_mq_update_nr_requests()
For shared tags case, all hctx->sched_tags/tags are the same, it doesn't
make sense to call into blk_mq_tag_update_depth() multiple times for the
same tags.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:25:56 -06:00
Yu Kuai 626ff4f8eb blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock
request_queue->nr_requests can be changed by:

a) switch elevator by updating nr_hw_queues
b) switch elevator by elevator sysfs attribute
c) configue queue sysfs attribute nr_requests

Current lock order is:

1) update_nr_hwq_lock, case a,b
2) freeze_queue
3) elevator_lock, case a,b,c

And update nr_requests is seriablized by elevator_lock() already,
however, in the case c, we'll have to allocate new sched_tags if
nr_requests grow, and do this with elevator_lock held and queue
freezed has the risk of deadlock.

Hence use update_nr_hwq_lock instead, make it possible to allocate
memory if tags grow, meanwhile also prevent nr_requests to be changed
concurrently.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:25:56 -06:00
Yu Kuai b46d4c447d blk-mq: check invalid nr_requests in queue_requests_store()
queue_requests_store() is the only caller of
blk_mq_update_nr_requests(), and blk_mq_update_nr_requests() is the
only caller of blk_mq_tag_update_depth(), however, they all have
checkings for nr_requests input by user.

Make code cleaner by moving all the checkings to the top function:

1) nr_requests > reserved tags;
2) if there is elevator, 4 <= nr_requests <= 2048;
3) if elevator is none, 4 <= nr_requests <= tag_set->queue_depth;

Meanwhile, case 2 is the only case tags can grow and -ENOMEM might be
returned.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:25:56 -06:00
Yu Kuai 8bd7195fea blk-mq: remove useless checkings in blk_mq_update_nr_requests()
1) queue_requests_store() is the only caller of
blk_mq_update_nr_requests(), where queue is already freezed, no need to
check mq_freeze_depth;
2) q->tag_set must be set for request based device, and queue_is_mq() is
already checked in blk_mq_queue_attr_visible(), no need to check
q->tag_set.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:25:56 -06:00
Yu Kuai dc1dd13d44 blk-mq: remove useless checking in queue_requests_store()
blk_mq_queue_attr_visible() already checked queue_is_mq(), no need to
check this again in queue_requests_store().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:25:56 -06:00
Yu Kuai b2f5974079 block: fix ordering of recursive split IO
Currently, split bio will be chained to original bio, and original bio
will be resubmitted to the tail of current->bio_list, waiting for
split bio to be issued. However, if split bio get split again, the IO
order will be messed up. This problem, on the one hand, will cause
performance degradation, especially for mdraid with large IO size; on
the other hand, will cause write errors for zoned block devices[1].

For example, in raid456 IO will first be split by max_sector from
md_submit_bio(), and then later be split again by chunksize for internal
handling:

For example, assume max_sectors is 1M, and chunksize is 512k

1) issue a 2M IO:

bio issuing: 0+2M
current->bio_list: NULL

2) md_submit_bio() split by max_sector:

bio issuing: 0+1M
current->bio_list: 1M+1M

3) chunk_aligned_read() split by chunksize:

bio issuing: 0+512k
current->bio_list: 1M+1M -> 512k+512k

4) after first bio issued, __submit_bio_noacct() will contuine issuing
next bio:

bio issuing: 1M+1M
current->bio_list: 512k+512k
bio issued: 0+512k

5) chunk_aligned_read() split by chunksize:

bio issuing: 1M+512k
current->bio_list: 512k+512k -> 1536k+512k
bio issued: 0+512k

6) no split afterwards, finally the issue order is:

0+512k -> 1M+512k -> 512k+512k -> 1536k+512k

This behaviour will cause large IO read on raid456 endup to be small
discontinuous IO in underlying disks. Fix this problem by placing split
bio to the head of current->bio_list.

Test script: test on 8 disk raid5 with 64k chunksize
dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct

Test results:
Before this patch
1) iostat results:
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz  aqu-sz  %util
md0           52430.00   3276.87     0.00   0.00    0.62    64.00   32.60  80.10
sd*           4487.00    409.00  2054.00  31.40    0.82    93.34    3.68  71.20
2) blktrace G stage:
  8,0    0   486445    11.357392936   843  G   R 14071424 + 128 [dd]
  8,0    0   486451    11.357466360   843  G   R 14071168 + 128 [dd]
  8,0    0   486454    11.357515868   843  G   R 14071296 + 128 [dd]
  8,0    0   486468    11.357968099   843  G   R 14072192 + 128 [dd]
  8,0    0   486474    11.358031320   843  G   R 14071936 + 128 [dd]
  8,0    0   486480    11.358096298   843  G   R 14071552 + 128 [dd]
  8,0    0   486490    11.358303858   843  G   R 14071808 + 128 [dd]
3) io seek for sdx:
Noted io seek is the result from blktrace D stage, statistic of:
ABS((offset of next IO) - (offset + len of previous IO))

Read|Write seek
cnt 55175, zero cnt 25079
    >=(KB) .. <(KB)     : count       ratio |distribution                            |
         0 .. 1         : 25079       45.5% |########################################|
         1 .. 2         : 0            0.0% |                                        |
         2 .. 4         : 0            0.0% |                                        |
         4 .. 8         : 0            0.0% |                                        |
         8 .. 16        : 0            0.0% |                                        |
        16 .. 32        : 0            0.0% |                                        |
        32 .. 64        : 12540       22.7% |#####################                   |
        64 .. 128       : 2508         4.5% |#####                                   |
       128 .. 256       : 0            0.0% |                                        |
       256 .. 512       : 10032       18.2% |#################                       |
       512 .. 1024      : 5016         9.1% |#########                               |

After this patch:
1) iostat results:
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz  aqu-sz  %util
md0           87965.00   5271.88     0.00   0.00    0.16    61.37   14.03  90.60
sd*           6020.00    658.44  5117.00  45.95    0.44   112.00    2.68  86.50
2) blktrace G stage:
  8,0    0   206296     5.354894072   664  G   R 7156992 + 128 [dd]
  8,0    0   206305     5.355018179   664  G   R 7157248 + 128 [dd]
  8,0    0   206316     5.355204438   664  G   R 7157504 + 128 [dd]
  8,0    0   206319     5.355241048   664  G   R 7157760 + 128 [dd]
  8,0    0   206333     5.355500923   664  G   R 7158016 + 128 [dd]
  8,0    0   206344     5.355837806   664  G   R 7158272 + 128 [dd]
  8,0    0   206353     5.355960395   664  G   R 7158528 + 128 [dd]
  8,0    0   206357     5.356020772   664  G   R 7158784 + 128 [dd]
3) io seek for sdx
Read|Write seek
cnt 28644, zero cnt 21483
    >=(KB) .. <(KB)     : count       ratio |distribution                            |
         0 .. 1         : 21483       75.0% |########################################|
         1 .. 2         : 0            0.0% |                                        |
         2 .. 4         : 0            0.0% |                                        |
         4 .. 8         : 0            0.0% |                                        |
         8 .. 16        : 0            0.0% |                                        |
        16 .. 32        : 0            0.0% |                                        |
        32 .. 64        : 7161        25.0% |##############                          |

BTW, this looks like a long term problem from day one, and large
sequential IO read is pretty common case like video playing.

And even with this patch, in this test case IO is merged to at most 128k
is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such
limit can get even better performance. However, we'll figure out how to do
this properly later.

[1] https://lore.kernel.org/all/e40b076d-583d-406b-b223-005910a9f46f@acm.org/

Fixes: d89d87965d ("When stacked block devices are in-use (e.g. md or dm), the recursive calls")
Reported-by: Tie Ren <tieren@fnnas.com>
Closes: https://lore.kernel.org/all/7dro5o7u5t64d6bgiansesjavxcuvkq5p2pok7dtwkav7b7ape@3isfr44b6352/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:46 -06:00
Yu Kuai 0b64682e78 block: skip unnecessary checks for split bio
Lots of checks are already done while submitting this bio the first
time, and there is no need to check them again when this bio is
resubmitted after split.

Hence open code should_fail_bio() and blk_throtl_bio() that are still
necessary from submit_bio_split_bioset().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:46 -06:00
Yu Kuai e3290419d9 blk-crypto: convert to use bio_submit_split_bioset()
Unify bio split code, prepare to fix ordering of split IO.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:46 -06:00
Yu Kuai e37b5596a1 block: factor out a helper bio_submit_split_bioset()
No functional changes are intended, some drivers like mdraid will split
bio by internal processing, prepare to unify bio split codes.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:45 -06:00
Yu Kuai 06d712d297 blk-crypto: fix missing blktrace bio split events
trace_block_split() is missing, resulting in blktrace inability to catch
BIO split events and making it harder to analyze the BIO sequence.

Cc: stable@vger.kernel.org
Fixes: 488f6682c8 ("block: blk-crypto-fallback for Inline Encryption")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:45 -06:00
Yu Kuai ea3d1f104d blk-mq: add QUEUE_FLAG_BIO_ISSUE_TIME
bio->issue_time_ns is initialized for every bio, however, it's only used
by blk-iolatency. Add a new queue_flag and only set this flag when
blk-iolatency is enabled, so that extra blk_time_get_ns() can be saved
for disks that blk-iolatency is not enabled.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:45 -06:00
Yu Kuai 1f963bdd64 block: initialize bio issue time in blk_mq_submit_bio()
bio->issue_time_ns is only used by blk-iolatency, which can only be
enabled for rq-based disk, hence it's not necessary to initialize
the time for bio-based disk.

Meanwhile, if bio is split by blk_crypto_fallback_split_bio_if_needed(),
the issue time is not initialized for new split bio, this can be fixed
as well.

Noted the next patch will optimize better that bio issue time will
only be used when blk-iolatency is really enabled by the disk.

Fixes: 488f6682c8 ("block: blk-crypto-fallback for Inline Encryption")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:45 -06:00
Yu Kuai 1733e88874 block: cleanup bio_issue
Now that bio->bi_issue is only used by blk-iolatency to get bio issue
time, replace bio_issue with u64 time directly and remove bio_issue to
make code cleaner.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10 05:23:45 -06:00
Keith Busch d0d1d52231 blk-map: provide the bdev to bio if one exists
We can now safely provide a block device when extracting user pages for
driver and user passthrough commands. Set the bdev so the caller doesn't
have to do that later. This has an additional  benefit of being able to
extract P2P pages in the passthrough path.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09 10:35:28 -06:00
Keith Busch d57447ffb5 blk-mq-dma: bring back p2p request flags
We only need to consider data and metadata dma mapping types separately.
The request and bio integrity payload have enough flag bits to
internally track the mapping type for each. Use these so the caller
doesn't need to track them, and provide separete request and integrity
helpers to the common code. This will make it easier to scale new
mappings, like the proposed MMIO attribute, without burdening the caller
to track such things.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09 10:33:35 -06:00
Keith Busch 05ceea5d3e blk-integrity: enable p2p source and destination
Set the extraction flags to allow p2p pages for the metadata buffer if
the block device allows it. Similar to data payloads, ensure the bio
does not use merging if we see a p2p page.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09 10:33:27 -06:00
Keith Busch 69d7ed5b9e blk-integrity: use simpler alignment check
We're checking length and addresses against the same alignment value, so
use the more simple iterator check.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09 10:27:01 -06:00
Keith Busch 5ff3f74e14 block: simplify direct io validity check
The block layer checks all the segments for validity later, so no need
for an early check. Just reduce it to a simple position and total length
check, and defer the more invasive segment checks to the block layer.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09 10:27:01 -06:00
Keith Busch 20a0e6276e block: align the bio after building it
Instead of ensuring each vector is block size aligned while constructing
the bio, just ensure the entire size is aligned after it's built. This
makes getting bio pages more flexible to accepting device valid io
vectors that would otherwise get rejected by alignment checks.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09 10:27:01 -06:00
Keith Busch 743bf2e0c4 block: add size alignment to bio_iov_iter_get_pages
The block layer tries to align bio vectors to the block device's logical
block size. Some cases don't have a block device, or we may need to
align to something larger, which we can't derive it from the queue
limits. Have the caller specify what they want, or allow any length
alignment if nothing was specified. Since the most common use case
relies on the block device's limits, a helper function is provided.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09 10:27:01 -06:00
Keith Busch fec2e70572 block: check for valid bio while splitting
We're already iterating every segment, so check these for a valid IO
lengths at the same time. Individual segment lengths will not be checked
on passthrough commands. The read/write command segments must be sized
to the dma alignment.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09 10:27:01 -06:00
Christoph Hellwig d86eaa0f3c block: remove the bi_inline_vecs variable sized array from struct bio
Bios are embedded into other structures, and at least spare is unhappy
about embedding structures with variable sized arrays.  There's no
real need to the array anyway, we can replace it with a helper pointing
to the memory just behind the bio, and with the previous cleanups there
is very few site doing anything special with it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09 07:31:59 -06:00
Christoph Hellwig 70a6f71b1a block: add a bio_init_inline helper
Just a simpler wrapper around bio_init for callers that want to
initialize a bio with inline bvecs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09 07:31:59 -06:00
Linus Torvalds f777d1112e vfs-6.17-rc6.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaL6SyQAKCRCRxhvAZXjc
 ouTGAQDGiTnaENiOzRhzNl1XONTRv8a1uV0pxg4W3fNdiRlxgQEA/O90/+nM48KC
 pdV3WHz5eGfcnMTpqgHxK6HYgwklJAY=
 =oKnm
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
 "fuse:

   - Prevent opening of non-regular backing files.

     Fuse doesn't support non-regular files anyway.

   - Check whether copy_file_range() returns a larger size than
     requested.

   - Prevent overflow in copy_file_range() as fuse currently only
     supports 32-bit sized copies.

   - Cache the blocksize value if the server returned a new value as
     inode->i_blkbits isn't modified directly anymore.

   - Fix i_blkbits handling for iomap partial writes.

     By default i_blkbits is set to PAGE_SIZE which causes iomap to mark
     the whole folio as uptodate even on a partial write. But fuseblk
     filesystems support choosing a blocksize smaller than PAGE_SIZE
     risking data corruption. Simply enforce PAGE_SIZE as blocksize for
     fuseblk's internal inode for now.

   - Prevent out-of-bounds acces in fuse_dev_write() when the number of
     bytes to be retrieved is truncated to the fc->max_pages limit.

  virtiofs:

   - Fix page faults for DAX page addresses.

  Misc:

   - Tighten file handle decoding from userns.

     Check that the decoded dentry itself has a valid idmapping in the
     user namespace.

   - Fix mount-notify selftests.

   - Fix some indentation errors.

   - Add an FMODE_ flag to indicate IOCB_HAS_METADATA availability.

     This will be moved to an FOP_* flag with a bit more rework needed
     for that to happen not suitable for a fix.

   - Don't silently ignore metadata for sync read/write.

   - Don't pointlessly log warning when reading coredump sysctls"

* tag 'vfs-6.17-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fuse: virtio_fs: fix page fault for DAX page address
  selftests/fs/mount-notify: Fix compilation failure.
  fhandle: use more consistent rules for decoding file handle from userns
  fuse: Block access to folio overlimit
  fuse: fix fuseblk i_blkbits for iomap partial writes
  fuse: reflect cached blocksize if blocksize was changed
  fuse: prevent overflow in copy_file_range return value
  fuse: check if copy_file_range() returns larger than requested size
  fuse: do not allow mapping a non-regular backing file
  coredump: don't pointlessly check and spew warnings
  fs: fix indentation style
  block: don't silently ignore metadata for sync read/write
  fs: add a FMODE_ flag to indicate IOCB_HAS_METADATA availability
  Please enter a commit message to explain why this merge is necessary,
  especially if it merges an updated upstream into a topic branch.
2025-09-08 07:53:01 -07:00