Commit Graph

2861 Commits (09cfd3c52ea76f43b3cb15e570aeddf633d65e80)

Author SHA1 Message Date
Linus Torvalds 8804d970fa Summary of significant series in this pull request:
- The 3 patch series "mm, swap: improve cluster scan strategy" from
   Kairui Song improves performance and reduces the failure rate of swap
   cluster allocation.
 
 - The 4 patch series "support large align and nid in Rust allocators"
   from Vitaly Wool permits Rust allocators to set NUMA node and large
   alignment when perforning slub and vmalloc reallocs.
 
 - The 2 patch series "mm/damon/vaddr: support stat-purpose DAMOS" from
   Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets
   for virtual address spaces for ops-level DAMOS filters.
 
 - The 3 patch series "execute PROCMAP_QUERY ioctl under per-vma lock"
   from Suren Baghdasaryan reduces mmap_lock contention during reads of
   /proc/pid/maps.
 
 - The 2 patch series "mm/mincore: minor clean up for swap cache
   checking" from Kairui Song performs some cleanup in the swap code.
 
 - The 11 patch series "mm: vm_normal_page*() improvements" from David
   Hildenbrand provides code cleanup in the pagemap code.
 
 - The 5 patch series "add persistent huge zero folio support" from
   Pankaj Raghav provides a block layer speedup by optionalls making the
   huge_zero_pagepersistent, instead of releasing it when its refcount
   falls to zero.
 
 - The 3 patch series "kho: fixes and cleanups" from Mike Rapoport adds a
   few touchups to the recently added Kexec Handover feature.
 
 - The 10 patch series "mm: make mm->flags a bitmap and 64-bit on all
   arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap.  To
   end the constant struggle with space shortage on 32-bit conflicting with
   64-bit's needs.
 
 - The 2 patch series "mm/swapfile.c and swap.h cleanup" from Chris Li
   cleans up some swap code.
 
 - The 7 patch series "selftests/mm: Fix false positives and skip
   unsupported tests" from Donet Tom fixes a few things in our selftests
   code.
 
 - The 7 patch series "prctl: extend PR_SET_THP_DISABLE to only provide
   THPs when advised" from David Hildenbrand "allows individual processes
   to opt-out of THP=always into THP=madvise, without affecting other
   workloads on the system".
 
   It's a long story - the [1/N] changelog spells out the considerations.
 
 - The 11 patch series "Add and use memdesc_flags_t" from Matthew Wilcox
   gets us started on the memdesc project.  Please see
   https://kernelnewbies.org/MatthewWilcox/Memdescs and
   https://blogs.oracle.com/linux/post/introducing-memdesc.
 
 - The 3 patch series "Tiny optimization for large read operations" from
   Chi Zhiling improves the efficiency of the pagecache read path.
 
 - The 5 patch series "Better split_huge_page_test result check" from Zi
   Yan improves our folio splitting selftest code.
 
 - The 2 patch series "test that rmap behaves as expected" from Wei Yang
   adds some rmap selftests.
 
 - The 3 patch series "remove write_cache_pages()" from Christoph Hellwig
   removes that function and converts its two remaining callers.
 
 - The 2 patch series "selftests/mm: uffd-stress fixes" from Dev Jain
   fixes some UFFD selftests issues.
 
 - The 3 patch series "introduce kernel file mapped folios" from Boris
   Burkov introduces the concept of "kernel file pages".  Using these
   permits btrfs to account its metadata pages to the root cgroup, rather
   than to the cgroups of random inappropriate tasks.
 
 - The 2 patch series "mm/pageblock: improve readability of some
   pageblock handling" from Wei Yang provides some readability improvements
   to the page allocator code.
 
 - The 11 patch series "mm/damon: support ARM32 with LPAE" from SeongJae
   Park teaches DAMON to understand arm32 highmem.
 
 - The 4 patch series "tools: testing: Use existing atomic.h for
   vma/maple tests" from Brendan Jackman performs some code cleanups and
   deduplication under tools/testing/.
 
 - The 2 patch series "maple_tree: Fix testing for 32bit compiles" from
   Liam Howlett fixes a couple of 32-bit issues in
   tools/testing/radix-tree.c.
 
 - The 2 patch series "kasan: unify kasan_enabled() and remove
   arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN
   arch-specific initialization code into a common arch-neutral
   implementation.
 
 - The 3 patch series "mm: remove zpool" from Johannes Weiner removes
   zspool - an indirection layer which now only redirects to a single thing
   (zsmalloc).
 
 - The 2 patch series "mm: task_stack: Stack handling cleanups" from
   Pasha Tatashin makes a couple of cleanups in the fork code.
 
 - The 37 patch series "mm: remove nth_page()" from David Hildenbrand
   makes rather a lot of adjustments at various nth_page() callsites,
   eventually permitting the removal of that undesirable helper function.
 
 - The 2 patch series "introduce kasan.write_only option in hw-tags" from
   Yeoreum Yun creates a KASAN read-only mode for ARM, using that
   architecture's memory tagging feature.  It is felt that a read-only mode
   KASAN is suitable for use in production systems rather than debug-only.
 
 - The 3 patch series "mm: hugetlb: cleanup hugetlb folio allocation"
   from Kefeng Wang does some tidying in the hugetlb folio allocation code.
 
 - The 12 patch series "mm: establish const-correctness for pointer
   parameters" from Max Kellermann makes quite a number of the MM API
   functions more accurate about the constness of their arguments.  This
   was getting in the way of subsystems (in this case CEPH) when they
   attempt to improving their own const/non-const accuracy.
 
 - The 7 patch series "Cleanup free_pages() misuse" from Vishal Moola
   fixes a number of code sites which were confused over when to use
   free_pages() vs __free_pages().
 
 - The 3 patch series "Add Rust abstraction for Maple Trees" from Alice
   Ryhl makes the mapletree code accessible to Rust.  Required by nouveau
   and by its forthcoming successor: the new Rust Nova driver.
 
 - The 2 patch series "selftests/mm: split_huge_page_test:
   split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and
   some cleanups to the thp selftesting code.
 
 - The 14 patch series "mm, swap: introduce swap table as swap cache
   (phase I)" from Chris Li and Kairui Song is the first step along the
   path to implementing "swap tables" - a new approach to swap allocation
   and state tracking which is expected to yield speed and space
   improvements.  This patchset itself yields a 5-20% performance benefit
   in some situations.
 
 - The 3 patch series "Some ptdesc cleanups" from Matthew Wilcox utilizes
   the new memdesc layer to clean up the ptdesc code a little.
 
 - The 3 patch series "Fix va_high_addr_switch.sh test failure" from
   Chunyu Hu fixes some issues in our 5-level pagetable selftesting code.
 
 - The 2 patch series "Minor fixes for memory allocation profiling" from
   Suren Baghdasaryan addresses a couple of minor issues in relatively new
   memory allocation profiling feature.
 
 - The 3 patch series "Small cleanups" from Matthew Wilcox has a few
   cleanups in preparation for more memdesc work.
 
 - The 2 patch series "mm/damon: add addr_unit for DAMON_LRU_SORT and
   DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in
   furtherance of supporting arm highmem.
 
 - The 2 patch series "selftests/mm: Add -Wunreachable-code and fix
   warnings" from Muhammad Anjum adds that compiler check to selftests code
   and fixes the fallout, by removing dead code.
 
 - The 10 patch series "Improvements to Victim Process Thawing and OOM
   Reaper Traversal Order" from zhongjinji makes a number of improvements
   in the OOM killer: mainly thawing a more appropriate group of victim
   threads so they can release resources.
 
 - The 5 patch series "mm/damon: misc fixups and improvements for 6.18"
   from SeongJae Park is a bunch of small and unrelated fixups for DAMON.
 
 - The 7 patch series "mm/damon: define and use DAMON initialization
   check function" from SeongJae Park implement reliability and
   maintainability improvements to a recently-added bug fix.
 
 - The 2 patch series "mm/damon/stat: expose auto-tuned intervals and
   non-idle ages" from SeongJae Park provides additional transparency to
   userspace clients of the DAMON_STAT information.
 
 - The 2 patch series "Expand scope of khugepaged anonymous collapse"
   from Dev Jain removes some constraints on khubepaged's collapsing of
   anon VMAs.  It also increases the success rate of MADV_COLLAPSE against
   an anon vma.
 
 - The 2 patch series "mm: do not assume file == vma->vm_file in
   compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards
   removal of file_operations.mmap().  This patchset concentrates upon
   clearing up the treatment of stacked filesystems.
 
 - The 6 patch series "mm: Improve mlock tracking for large folios" from
   Kiryl Shutsemau provides some fixes and improvements to mlock's tracking
   of large folios.  /proc/meminfo's "Mlocked" field became more accurate.
 
 - The 2 patch series "mm/ksm: Fix incorrect accounting of KSM counters
   during fork" from Donet Tom fixes several user-visible KSM stats
   inaccuracies across forks and adds selftest code to verify these
   counters.
 
 - The 2 patch series "mm_slot: fix the usage of mm_slot_entry" from Wei
   Yang addresses some potential but presently benign issues in KSM's
   mm_slot handling.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaN3cywAKCRDdBJ7gKXxA
 jtaPAQDmIuIu7+XnVUK5V11hsQ/5QtsUeLHV3OsAn4yW5/3dEQD/UddRU08ePN+1
 2VRB0EwkLAdfMWW7TfiNZ+yhuoiL/AA=
 =4mhY
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - "mm, swap: improve cluster scan strategy" from Kairui Song improves
   performance and reduces the failure rate of swap cluster allocation

 - "support large align and nid in Rust allocators" from Vitaly Wool
   permits Rust allocators to set NUMA node and large alignment when
   perforning slub and vmalloc reallocs

 - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend
   DAMOS_STAT's handling of the DAMON operations sets for virtual
   address spaces for ops-level DAMOS filters

 - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren
   Baghdasaryan reduces mmap_lock contention during reads of
   /proc/pid/maps

 - "mm/mincore: minor clean up for swap cache checking" from Kairui Song
   performs some cleanup in the swap code

 - "mm: vm_normal_page*() improvements" from David Hildenbrand provides
   code cleanup in the pagemap code

 - "add persistent huge zero folio support" from Pankaj Raghav provides
   a block layer speedup by optionalls making the
   huge_zero_pagepersistent, instead of releasing it when its refcount
   falls to zero

 - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to
   the recently added Kexec Handover feature

 - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo
   Stoakes turns mm_struct.flags into a bitmap. To end the constant
   struggle with space shortage on 32-bit conflicting with 64-bit's
   needs

 - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap
   code

 - "selftests/mm: Fix false positives and skip unsupported tests" from
   Donet Tom fixes a few things in our selftests code

 - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised"
   from David Hildenbrand "allows individual processes to opt-out of
   THP=always into THP=madvise, without affecting other workloads on the
   system".

   It's a long story - the [1/N] changelog spells out the considerations

 - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on
   the memdesc project. Please see

      https://kernelnewbies.org/MatthewWilcox/Memdescs and
      https://blogs.oracle.com/linux/post/introducing-memdesc

 - "Tiny optimization for large read operations" from Chi Zhiling
   improves the efficiency of the pagecache read path

 - "Better split_huge_page_test result check" from Zi Yan improves our
   folio splitting selftest code

 - "test that rmap behaves as expected" from Wei Yang adds some rmap
   selftests

 - "remove write_cache_pages()" from Christoph Hellwig removes that
   function and converts its two remaining callers

 - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD
   selftests issues

 - "introduce kernel file mapped folios" from Boris Burkov introduces
   the concept of "kernel file pages". Using these permits btrfs to
   account its metadata pages to the root cgroup, rather than to the
   cgroups of random inappropriate tasks

 - "mm/pageblock: improve readability of some pageblock handling" from
   Wei Yang provides some readability improvements to the page allocator
   code

 - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON
   to understand arm32 highmem

 - "tools: testing: Use existing atomic.h for vma/maple tests" from
   Brendan Jackman performs some code cleanups and deduplication under
   tools/testing/

 - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes
   a couple of 32-bit issues in tools/testing/radix-tree.c

 - "kasan: unify kasan_enabled() and remove arch-specific
   implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific
   initialization code into a common arch-neutral implementation

 - "mm: remove zpool" from Johannes Weiner removes zspool - an
   indirection layer which now only redirects to a single thing
   (zsmalloc)

 - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a
   couple of cleanups in the fork code

 - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of
   adjustments at various nth_page() callsites, eventually permitting
   the removal of that undesirable helper function

 - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun
   creates a KASAN read-only mode for ARM, using that architecture's
   memory tagging feature. It is felt that a read-only mode KASAN is
   suitable for use in production systems rather than debug-only

 - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does
   some tidying in the hugetlb folio allocation code

 - "mm: establish const-correctness for pointer parameters" from Max
   Kellermann makes quite a number of the MM API functions more accurate
   about the constness of their arguments. This was getting in the way
   of subsystems (in this case CEPH) when they attempt to improving
   their own const/non-const accuracy

 - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of
   code sites which were confused over when to use free_pages() vs
   __free_pages()

 - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the
   mapletree code accessible to Rust. Required by nouveau and by its
   forthcoming successor: the new Rust Nova driver

 - "selftests/mm: split_huge_page_test: split_pte_mapped_thp
   improvements" from David Hildenbrand adds a fix and some cleanups to
   the thp selftesting code

 - "mm, swap: introduce swap table as swap cache (phase I)" from Chris
   Li and Kairui Song is the first step along the path to implementing
   "swap tables" - a new approach to swap allocation and state tracking
   which is expected to yield speed and space improvements. This
   patchset itself yields a 5-20% performance benefit in some situations

 - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc
   layer to clean up the ptdesc code a little

 - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some
   issues in our 5-level pagetable selftesting code

 - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan
   addresses a couple of minor issues in relatively new memory
   allocation profiling feature

 - "Small cleanups" from Matthew Wilcox has a few cleanups in
   preparation for more memdesc work

 - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from
   Quanmin Yan makes some changes to DAMON in furtherance of supporting
   arm highmem

 - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad
   Anjum adds that compiler check to selftests code and fixes the
   fallout, by removing dead code

 - "Improvements to Victim Process Thawing and OOM Reaper Traversal
   Order" from zhongjinji makes a number of improvements in the OOM
   killer: mainly thawing a more appropriate group of victim threads so
   they can release resources

 - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park
   is a bunch of small and unrelated fixups for DAMON

 - "mm/damon: define and use DAMON initialization check function" from
   SeongJae Park implement reliability and maintainability improvements
   to a recently-added bug fix

 - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from
   SeongJae Park provides additional transparency to userspace clients
   of the DAMON_STAT information

 - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes
   some constraints on khubepaged's collapsing of anon VMAs. It also
   increases the success rate of MADV_COLLAPSE against an anon vma

 - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()"
   from Lorenzo Stoakes moves us further towards removal of
   file_operations.mmap(). This patchset concentrates upon clearing up
   the treatment of stacked filesystems

 - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau
   provides some fixes and improvements to mlock's tracking of large
   folios. /proc/meminfo's "Mlocked" field became more accurate

 - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from
   Donet Tom fixes several user-visible KSM stats inaccuracies across
   forks and adds selftest code to verify these counters

 - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses
   some potential but presently benign issues in KSM's mm_slot handling

* tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits)
  mm: swap: check for stable address space before operating on the VMA
  mm: convert folio_page() back to a macro
  mm/khugepaged: use start_addr/addr for improved readability
  hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list
  alloc_tag: fix boot failure due to NULL pointer dereference
  mm: silence data-race in update_hiwater_rss
  mm/memory-failure: don't select MEMORY_ISOLATION
  mm/khugepaged: remove definition of struct khugepaged_mm_slot
  mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL
  hugetlb: increase number of reserving hugepages via cmdline
  selftests/mm: add fork inheritance test for ksm_merging_pages counter
  mm/ksm: fix incorrect KSM counter handling in mm_struct during fork
  drivers/base/node: fix double free in register_one_node()
  mm: remove PMD alignment constraint in execmem_vmalloc()
  mm/memory_hotplug: fix typo 'esecially' -> 'especially'
  mm/rmap: improve mlock tracking for large folios
  mm/filemap: map entire large folio faultaround
  mm/fault: try to map the entire file folio in finish_fault()
  mm/rmap: mlock large folios in try_to_unmap_one()
  mm/rmap: fix a mlock race condition in folio_referenced_one()
  ...
2025-10-02 18:18:33 -07:00
Linus Torvalds e1b1d03cee for-6.18/block-20250929
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmjbLCgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoY0D/9J+11BC88pBxCrLKv/V2TwCNokRMi0dU3L
 r3EUdA46k0oXmvb6ueZqIcfY2e+IX7rdQkaRbh1zRdsNejqHo4548C3ePWGdBAcM
 OdNEGfpehO0aD0td1+mK/NxoJMLhbs5QraPanz+SOkGZOKeF+vGCga5PUDivsr5J
 16T9yb7i+isENLdAc2RJbZVyAphqHQlo5GHi5ZIKOVi5cNt8GU/q2sQl7NYmGvHd
 aq37svvZHFOhLRajP959Fw9WOxEYITewzQ4UYf1FZjUodJUxO+vCnP0ooBQRlyu8
 1B4PYWwSE+Vn3GkQE0Om+mzo9AVPOiLmoAWGxdgJBMyEkZndocr46XEslXOufQ1Z
 T3Gu19G6jCxcyByNVhjVnaajYKmvSQAy1w75m4XlfqTRm4f9Om+LAJavUk3RuaOL
 7lXKQ7Ql1/Tby9Jmf8afjYYXXotNDNku6rz2P3qtOwAA26mNJfgVt0rO+8XGRDe9
 ioLbCkTjslYMc/Oh4jSsbrspsVALbaQMq/Dmah8k0EWb4QAHVgCJyGBoff3hOboI
 jD6B1enaKOQVgcjWcjm/FjOk3jv2h3v4X26YWQZTvEc/1PnSnST78Zi/ePhzDdmt
 sBALUAS37TfTgNMzrhbHl5Zs13k0C0XyANuayuKuo5hlNnC1wbdap+5FZJOmpuOB
 YT+VkYnaOA==
 =kOmc
 -----END PGP SIGNATURE-----

Merge tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block updates from Jens Axboe:

 - NVMe pull request via Keith:
     - FC target fixes (Daniel)
     - Authentication fixes and updates (Martin, Chris)
     - Admin controller handling (Kamaljit)
     - Target lockdep assertions (Max)
     - Keep-alive updates for discovery (Alastair)
     - Suspend quirk (Georg)

 - MD pull request via Yu:
     - Add support for a lockless bitmap.

       A key feature for the new bitmap are that the IO fastpath is
       lockless. If a user issues lots of write IO to the same bitmap
       bit in a short time, only the first write has additional overhead
       to update bitmap bit, no additional overhead for the following
       writes.

       By supporting only resync or recover written data, means in the
       case creating new array or replacing with a new disk, there is no
       need to do a full disk resync/recovery.

 - Switch ->getgeo() and ->bios_param() to using struct gendisk rather
   than struct block_device.

 - Rust block changes via Andreas. This series adds configuration via
   configfs and remote completion to the rnull driver. The series also
   includes a set of changes to the rust block device driver API: a few
   cleanup patches, and a few features supporting the rnull changes.

   The series removes the raw buffer formatting logic from
   `kernel::block` and improves the logic available in `kernel::string`
   to support the same use as the removed logic.

 - floppy arch cleanups

 - Reduce the number of dereferencing needed for ublk commands

 - Restrict supported sockets for nbd. Mostly done to eliminate a class
   of issues perpetually reported by syzbot, by using nonsensical socket
   setups.

 - A few s390 dasd block fixes

 - Fix a few issues around atomic writes

 - Improve DMA interation for integrity requests

 - Improve how iovecs are treated with regards to O_DIRECT aligment
   constraints.

   We used to require each segment to adhere to the constraints, now
   only the request as a whole needs to.

 - Clean up and improve p2p support, enabling use of p2p for metadata
   payloads

 - Improve locking of request lookup, using SRCU where appropriate

 - Use page references properly for brd, avoiding very long RCU sections

 - Fix ordering of recursively submitted IOs

 - Clean up and improve updating nr_requests for a live device

 - Various fixes and cleanups

* tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (164 commits)
  s390/dasd: enforce dma_alignment to ensure proper buffer validation
  s390/dasd: Return BLK_STS_INVAL for EINVAL from do_dasd_request
  ublk: remove redundant zone op check in ublk_setup_iod()
  nvme: Use non zero KATO for persistent discovery connections
  nvmet: add safety check for subsys lock
  nvme-core: use nvme_is_io_ctrl() for I/O controller check
  nvme-core: do ioccsz/iorcsz validation only for I/O controllers
  nvme-core: add method to check for an I/O controller
  blk-cgroup: fix possible deadlock while configuring policy
  blk-mq: fix null-ptr-deref in blk_mq_free_tags() from error path
  blk-mq: Fix more tag iteration function documentation
  selftests: ublk: fix behavior when fio is not installed
  ublk: don't access ublk_queue in ublk_unmap_io()
  ublk: pass ublk_io to __ublk_complete_rq()
  ublk: don't access ublk_queue in ublk_need_complete_req()
  ublk: don't access ublk_queue in ublk_check_commit_and_fetch()
  ublk: don't pass ublk_queue to ublk_fetch()
  ublk: don't access ublk_queue in ublk_config_io_buf()
  ublk: don't access ublk_queue in ublk_check_fetch_buf()
  ublk: pass q_id and tag to __ublk_check_and_get_req()
  ...
2025-10-02 10:16:56 -07:00
Linus Torvalds 2cb8eeaf00 - Add support on AMD for assigning QOS bandwidth counters to resources
(RMIDs) with the ability for those resources to be tracked by the
   counters as long as they're assigned to them. Previously, due to hw
   limitations, bandwidth counts from untracked resources would get lost
   when those resources are not tracked. Refactor the code and user
   interfaces to be able to also support other, similar features on ARM,
   for example
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmjWZ+cACgkQEsHwGGHe
 VUoRrRAAjy/bn0gQCJPPySUfUa7rTG/qiriI6Ax9tc09X8ahspeUQ0IHCK1RWNwm
 tpzbsrTQImA8AeBkNbnD6ALKGSZdfL1l0hMZkKr1Htg35JoHSDMIalxab+sYCfg8
 cysw0tmEKQce1wC8PtP4iX6ENbci4thS6td/t8rWcWTg1ndfOF9eTKHtZvZ/W0Qr
 zJ8MnBIqNhPMV0VEUxacmLuqPGnyWuXyIXzX9nHTc3fjSzTc94jqIB20wFBRFPAx
 EVFivq/mpu3uXXvvumuZZzXuqgBTIMMx2KjABw+5LNU9BIvkzoFwYZvRlUoU9es8
 ASSBddt2jvYyHLal8MUXYieIHJAESKOe2apxtP74ZJIR0t7pUswobPyYO0GEseuQ
 +H2QbLlmJGNO3kJZUuSUGZ9yfjW4JsqM5UlowlVoHGvAOXIW+wR6r1tbdErfima9
 rkqwUCzCXJg3UJ+a6xrK7hhc7QsbIHoo8mHBD+q5I1Hoipwoqe7j6YV80Oely/IC
 Gy3mFQ3e/KwRplfO7qHT88sfAAh+J+TmsVQTHlM62wrnjQOOPcMm7r9m0A2eTOme
 7ZlvcU2q9nAJBaA8ylXbBt4l2AcFKaJg8btskWiZLAXGIjAb6B+5uepmwPRavY5Q
 NdiVa3pk7vGDl9oJ2DJIFKbq3adr50TCCarUM0304s3BsQvKnX4=
 =2FMs
 -----END PGP SIGNATURE-----

Merge tag 'x86_cache_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 resource control updates from Borislav Petkov:
 "Add support on AMD for assigning QoS bandwidth counters to resources
  (RMIDs) with the ability for those resources to be tracked by the
  counters as long as they're assigned to them.

  Previously, due to hw limitations, bandwidth counts from untracked
  resources would get lost when those resources are not tracked.

  Refactor the code and user interfaces to be able to also support
  other, similar features on ARM, for example"

* tag 'x86_cache_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
  fs/resctrl: Fix counter auto-assignment on mkdir with mbm_event enabled
  MAINTAINERS: resctrl: Add myself as reviewer
  x86/resctrl: Configure mbm_event mode if supported
  fs/resctrl: Introduce the interface to switch between monitor modes
  fs/resctrl: Disable BMEC event configuration when mbm_event mode is enabled
  fs/resctrl: Introduce the interface to modify assignments in a group
  fs/resctrl: Introduce mbm_L3_assignments to list assignments in a group
  fs/resctrl: Auto assign counters on mkdir and clean up on group removal
  fs/resctrl: Introduce mbm_assign_on_mkdir to enable assignments on mkdir
  fs/resctrl: Provide interface to update the event configurations
  fs/resctrl: Add event configuration directory under info/L3_MON/
  fs/resctrl: Support counter read/reset with mbm_event assignment mode
  x86/resctrl: Implement resctrl_arch_reset_cntr() and resctrl_arch_cntr_read()
  x86/resctrl: Refactor resctrl_arch_rmid_read()
  fs/resctrl: Introduce counter ID read, reset calls in mbm_event mode
  fs/resctrl: Pass struct rdtgroup instead of individual members
  fs/resctrl: Add the functionality to unassign MBM events
  fs/resctrl: Add the functionality to assign MBM events
  x86,fs/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC
  fs/resctrl: Introduce event configuration field in struct mon_evt
  ...
2025-09-30 13:29:42 -07:00
Linus Torvalds f2c61db29f Remove bcachefs core code
bcachefs was marked 'externally maintained' in 6.17 but the code
remained to make the transition smoother.

It's now a DKMS module, making the in-kernel code stale, so remove
it to avoid any version confusion.

Link: https://lore.kernel.org/linux-bcachefs/yokpt2d2g2lluyomtqrdvmkl3amv3kgnipmenobkpgx537kay7@xgcgjviv3n7x/T/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-09-29 13:43:52 -07:00
Linus Torvalds 449c2b302c vfs-6.18-rc1.async
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQkAAKCRCRxhvAZXjc
 oijOAQCw+gE9GkvYJ7TOCiqVFj8RZZ1tb7NiCdBRCa4wXM7KMwD9Gwim//G8Hjyy
 dNPZejphGdxv99Skwp/0nfyDGL6x7w4=
 =zIJf
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.18-rc1.async' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs async directory updates from Christian Brauner:
 "This contains further preparatory changes for the asynchronous directory
  locking scheme:

   - Add lookup_one_positive_killable() which allows overlayfs to
     perform lookup that won't block on a fatal signal

   - Unify the mount idmap handling in struct renamedata as a rename can
     only happen within a single mount

   - Introduce kern_path_parent() for audit which sets the path to the
     parent and returns a dentry for the target without holding any
     locks on return

   - Rename kern_path_locked() as it is only used to prepare for the
     removal of an object from the filesystem:

	kern_path_locked()    => start_removing_path()
	kern_path_create()    => start_creating_path()
	user_path_create()    => start_creating_user_path()
	user_path_locked_at() => start_removing_user_path_at()
	done_path_create()    => end_creating_path()
	NA                    => end_removing_path()"

* tag 'vfs-6.18-rc1.async' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  debugfs: rename start_creating() to debugfs_start_creating()
  VFS: rename kern_path_locked() and related functions.
  VFS/audit: introduce kern_path_parent() for audit
  VFS: unify old_mnt_idmap and new_mnt_idmap in renamedata
  VFS: discard err2 in filename_create()
  VFS/ovl: add lookup_one_positive_killable()
2025-09-29 11:55:15 -07:00
Linus Torvalds 3a2a5b278f vfs-6.18-rc1.mount
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQOwAKCRCRxhvAZXjc
 oth/AQDvlOo+23/f2djgDGS8akjkBYVLW14OYzC0q5cbEnnGgAEAycHL50pp3n1o
 3jMYlCByuv507vpCsDupo7QcJapmQAk=
 =/NwE
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.18-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs mount updates from Christian Brauner:
 "This contains some work around mount api handling:

   - Output the warning message for mnt_too_revealing() triggered during
     fsmount() to the fscontext log. This makes it possible for the
     mount tool to output appropriate warnings on the command line.

     For example, with the newest fsopen()-based mount(8) from
     util-linux, the error messages now look like:

       # mount -t proc proc /tmp
       mount: /tmp: fsmount() failed: VFS: Mount too revealing.
              dmesg(1) may have more information after failed mount system call.

   - Do not consume fscontext log entries when returning -EMSGSIZE

     Userspace generally expects APIs that return -EMSGSIZE to allow for
     them to adjust their buffer size and retry the operation.

     However, the fscontext log would previously clear the message even
     in the -EMSGSIZE case.

     Given that it is very cheap for us to check whether the buffer is
     too small before we remove the message from the ring buffer, let's
     just do that instead.

   - Drop an unused argument from do_remount()"

* tag 'vfs-6.18-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  vfs: fs/namespace.c: remove ms_flags argument from do_remount
  selftests/filesystems: add basic fscontext log tests
  fscontext: do not consume log entries when returning -EMSGSIZE
  vfs: output mount_too_revealing() errors to fscontext
  docs/vfs: Remove mentions to the old mount API helpers
  fscontext: add custom-prefix log helpers
  fs: Remove mount_bdev
  fs: Remove mount_nodev
2025-09-29 09:32:34 -07:00
NeilBrown 3d18f80ce1
VFS: rename kern_path_locked() and related functions.
kern_path_locked() is now only used to prepare for removing an object
from the filesystem (and that is the only credible reason for wanting a
positive locked dentry).  Thus it corresponds to kern_path_create() and
so should have a corresponding name.

Unfortunately the name "kern_path_create" is somewhat misleading as it
doesn't actually create anything.  The recently added
simple_start_creating() provides a better pattern I believe.  The
"start" can be matched with "end" to bracket the creating or removing.

So this patch changes names:

 kern_path_locked -> start_removing_path
 kern_path_create -> start_creating_path
 user_path_create -> start_creating_user_path
 user_path_locked_at -> start_removing_user_path_at
 done_path_create -> end_creating_path

and also introduces end_removing_path() which is identical to
end_creating_path().

__start_removing_path (which was __kern_path_locked) is enhanced to
call mnt_want_write() for consistency with the start_creating_path().

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-23 12:37:36 +02:00
Suren Baghdasaryan b9e2f58ffb alloc_tag: mark inaccurate allocation counters in /proc/allocinfo output
While rare, memory allocation profiling can contain inaccurate counters if
slab object extension vector allocation fails.  That allocation might
succeed later but prior to that, slab allocations that would have used
that object extension vector will not be accounted for.  To indicate
incorrect counters, "accurate:no" marker is appended to the call site line
in the /proc/allocinfo output.  Bump up /proc/allocinfo version to reflect
the change in the file format and update documentation.

Example output with invalid counters:
allocinfo - version: 2.0
           0        0 arch/x86/kernel/kdebugfs.c:105 func:create_setup_data_nodes
           0        0 arch/x86/kernel/alternative.c:2090 func:alternatives_smp_module_add
           0        0 arch/x86/kernel/alternative.c:127 func:__its_alloc accurate:no
           0        0 arch/x86/kernel/fpu/regset.c:160 func:xstateregs_set
           0        0 arch/x86/kernel/fpu/xstate.c:1590 func:fpstate_realloc
           0        0 arch/x86/kernel/cpu/aperfmperf.c:379 func:arch_enable_hybrid_capacity_scale
           0        0 arch/x86/kernel/cpu/amd_cache_disable.c:258 func:init_amd_l3_attrs
       49152       48 arch/x86/kernel/cpu/mce/core.c:2709 func:mce_device_create accurate:no
       32768        1 arch/x86/kernel/cpu/mce/genpool.c:132 func:mce_gen_pool_create
           0        0 arch/x86/kernel/cpu/mce/amd.c:1341 func:mce_threshold_create_device

[surenb@google.com: document new "accurate:no" marker]
  Fixes: 39d117e04d15 ("alloc_tag: mark inaccurate allocation counters in /proc/allocinfo output")
[akpm@linux-foundation.org: simplification per Usama, reflow text]
[akpm@linux-foundation.org: add newline to prevent docs warning, per Randy]
Link: https://lkml.kernel.org/r/20250915230224.4115531-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Usama Arif <usamaarif642@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: David Wang <00107082@163.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sourav Panda <souravpanda@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:36 -07:00
Mateusz Guzik f99b391778
fs: rename generic_delete_inode() and generic_drop_inode()
generic_delete_inode() is rather misleading for what the routine is
doing. inode_just_drop() should be much clearer.

The new naming is inconsistent with generic_drop_inode(), so rename that
one as well with inode_ as the suffix.

No functional changes.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-15 16:09:42 +02:00
Babu Moger 8004ea01cf fs/resctrl: Introduce the interface to switch between monitor modes
Resctrl subsystem can support two monitoring modes, "mbm_event" or "default".
In mbm_event mode, monitoring event can only accumulate data while it is
backed by a hardware counter. In "default" mode, resctrl assumes there is
a hardware counter for each event within every CTRL_MON and MON group.

Introduce mbm_assign_mode resctrl file to switch between mbm_event and default
modes.

Example:
To list the MBM monitor modes supported:
  $ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
  [mbm_event]
  default

To enable the "mbm_event" counter assignment mode:
  $ echo "mbm_event" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode

To enable the "default" monitoring mode:
  $ echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode

Reset MBM event counters automatically as part of changing the mode.  Clear
both architectural and non-architectural event states to prevent overflow
conditions during the next event read. Clear assignable counter configuration
on all the domains. Also, enable auto assignment when switching to "mbm_event"
mode.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15 12:49:18 +02:00
Babu Moger 88bee79640 fs/resctrl: Introduce the interface to modify assignments in a group
Enable the mbm_l3_assignments resctrl file to be used to modify counter
assignments of CTRL_MON and MON groups when the "mbm_event" counter
assignment mode is enabled.

Process the assignment modifications in the following format:
<Event>:<Domain id>=<Assignment state>;<Domain id>=<Assignment state>

Event: A valid MBM event in the
       /sys/fs/resctrl/info/L3_MON/event_configs directory.

Domain ID: A valid domain ID. When writing, '*' applies the changes
	   to all domains.

Assignment states:

    _ : Unassign a counter.

    e : Assign a counter exclusively.

Examples:

  $ cd /sys/fs/resctrl
  $ cat /sys/fs/resctrl/mbm_L3_assignments
    mbm_total_bytes:0=e;1=e
    mbm_local_bytes:0=e;1=e

To unassign the counter associated with the mbm_total_bytes event on
domain 0:

  $ echo "mbm_total_bytes:0=_" > mbm_L3_assignments
  $ cat /sys/fs/resctrl/mbm_L3_assignments
    mbm_total_bytes:0=_;1=e
    mbm_local_bytes:0=e;1=e

To unassign the counter associated with the mbm_total_bytes event on
all the domains:

  $ echo "mbm_total_bytes:*=_" > mbm_L3_assignments
  $ cat /sys/fs/resctrl/mbm_L3_assignments
    mbm_total_bytes:0=_;1=_
    mbm_local_bytes:0=e;1=e

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15 12:47:17 +02:00
Babu Moger cba8222880 fs/resctrl: Introduce mbm_L3_assignments to list assignments in a group
Introduce the mbm_L3_assignments resctrl file associated with CTRL_MON and MON
resource groups to display the counter assignment states of the resource group
when "mbm_event" counter assignment mode is enabled.

Display the list in the following format:
<Event>:<Domain id>=<Assignment state>;<Domain id>=<Assignment state>

Event: A valid MBM event listed in
      /sys/fs/resctrl/info/L3_MON/event_configs directory.

Domain ID: A valid domain ID.

The assignment state can be one of the following:

_ : No counter assigned.

e : Counter assigned exclusively.

Example:
To list the assignment states for the default group
  $ cd /sys/fs/resctrl
  $ cat /sys/fs/resctrl/mbm_L3_assignments
  mbm_total_bytes:0=e;1=e
  mbm_local_bytes:0=e;1=e

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15 12:45:34 +02:00
Babu Moger ac1df9bb0b fs/resctrl: Introduce mbm_assign_on_mkdir to enable assignments on mkdir
The "mbm_event" counter assignment mode allows users to assign a hardware
counter to an RMID, event pair and monitor the bandwidth as long as it is
assigned.

Introduce a user-configurable option that determines if a counter will
automatically be assigned to an RMID, event pair when its associated
monitor group is created via mkdir. Accessible when "mbm_event" counter
assignment mode is enabled.

Suggested-by: Peter Newman <peternewman@google.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15 12:42:02 +02:00
Babu Moger f9ae5913d4 fs/resctrl: Provide interface to update the event configurations
When "mbm_event" counter assignment mode is enabled, users can modify the
event configuration by writing to the 'event_filter' resctrl file.  The event
configurations for mbm_event mode are located in
/sys/fs/resctrl/info/L3_MON/event_configs/.

Update the assignments of all CTRL_MON and MON resource groups when the event
configuration is modified.

Example:
  $ mount -t resctrl resctrl /sys/fs/resctrl

  $ cd /sys/fs/resctrl/

  $ cat info/L3_MON/event_configs/mbm_local_bytes/event_filter
    local_reads,local_non_temporal_writes,local_reads_slow_memory

  $ echo "local_reads,local_non_temporal_writes" >
    info/L3_MON/event_configs/mbm_total_bytes/event_filter

  $ cat info/L3_MON/event_configs/mbm_total_bytes/event_filter
    local_reads,local_non_temporal_writes

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15 12:40:38 +02:00
Babu Moger ea274cbeaf fs/resctrl: Add event configuration directory under info/L3_MON/
The "mbm_event" counter assignment mode allows the user to assign a hardware
counter to an RMID, event pair and monitor the bandwidth as long as it is
assigned. The user can specify the memory transaction(s) for the counter to
track.

When this mode is supported, the /sys/fs/resctrl/info/L3_MON/event_configs
directory contains a sub-directory for each MBM event that can be assigned to
a counter.  The MBM event sub-directory contains a file named "event_filter"
that is used to view and modify which memory transactions the MBM event is
configured with.

Create /sys/fs/resctrl/info/L3_MON/event_configs directory on resctrl mount
and pre-populate it with directories for the two existing MBM events:
mbm_total_bytes and mbm_local_bytes. Create the "event_filter" file within
each MBM event directory with the needed *show() that displays the memory
transactions with which the MBM event is configured.

Example:
  $ mount -t resctrl resctrl /sys/fs/resctrl
  $ cd /sys/fs/resctrl/
  $ cat info/L3_MON/event_configs/mbm_total_bytes/event_filter
    local_reads,remote_reads,local_non_temporal_writes,
    remote_non_temporal_writes,local_reads_slow_memory,
    remote_reads_slow_memory,dirty_victim_writes_all

  $ cat info/L3_MON/event_configs/mbm_local_bytes/event_filter
    local_reads,local_non_temporal_writes,local_reads_slow_memory

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15 12:39:38 +02:00
Babu Moger 159f36cd4d fs/resctrl: Support counter read/reset with mbm_event assignment mode
When "mbm_event" counter assignment mode is enabled, the architecture requires
a counter ID to read the event data.

Introduce an is_mbm_cntr field in struct rmid_read to indicate whether counter
assignment mode is in use.

Update the logic to call resctrl_arch_cntr_read() and resctrl_arch_reset_cntr()
when the assignment mode is active. Report 'Unassigned' in case the user attempts
to read an event without assigning a hardware counter.

Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15 12:38:58 +02:00
Babu Moger 16ff6b038f fs/resctrl: Introduce interface to display number of free MBM counters
Introduce the "available_mbm_cntrs" resctrl file to display the number of
counters available for assignment in each domain when "mbm_event" mode is
enabled.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15 12:16:11 +02:00
Babu Moger 8c793336ea fs/resctrl: Add resctrl file to display number of assignable counters
The "mbm_event" counter assignment mode allows users to assign a hardware
counter to an RMID, event pair and monitor bandwidth usage as long as it is
assigned.  The hardware continues to track the assigned counter until it is
explicitly unassigned by the user.

Create 'num_mbm_cntrs' resctrl file that displays the number of counters
supported in each domain. 'num_mbm_cntrs' is only visible to user space when
the system supports "mbm_event" mode.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15 12:12:09 +02:00
Babu Moger 3b497c3f4f fs/resctrl: Introduce the interface to display monitoring modes
Introduce the resctrl file "mbm_assign_mode" to list the supported counter
assignment modes.

The "mbm_event" counter assignment mode allows users to assign a hardware
counter to an RMID, event pair and monitor bandwidth usage as long as it is
assigned. The hardware continues to track the assigned counter until it is
explicitly unassigned by the user. Each event within a resctrl group can be
assigned independently in this mode.

On AMD systems "mbm_event" mode is backed by the ABMC (Assignable Bandwidth
Monitoring Counters) hardware feature and is enabled by default.

The "default" mode is the existing mode that works without the explicit
counter assignment, instead relying on dynamic counter assignment by hardware
that may result in hardware not dedicating a counter resulting in monitoring
data reads returning "Unavailable".

Provide an interface to display the monitor modes on the system.

  $ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
  [mbm_event]
  default

Add IS_ENABLED(CONFIG_RESCTRL_ASSIGN_FIXED) check to support Arm64.

On x86, CONFIG_RESCTRL_ASSIGN_FIXED is not defined. On Arm64, it will be
defined when the "mbm_event" mode is supported.

Add IS_ENABLED(CONFIG_RESCTRL_ASSIGN_FIXED) check early to ensure the user
interface remains compatible with upcoming Arm64 support. IS_ENABLED() safely
evaluates to 0 when the configuration is not defined.

As a result, for MPAM, the display would be either:

  [default]

or

  [mbm_event]

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15 12:10:58 +02:00
Babu Moger bebf57bf05 x86/resctrl: Add ABMC feature in the command line options
Add a kernel command-line parameter to enable or disable the exposure of
the ABMC (Assignable Bandwidth Monitoring Counters) hardware feature to
resctrl.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15 12:05:23 +02:00
David Hildenbrand 9dc21bbd62 prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
Patch series "prctl: extend PR_SET_THP_DISABLE to only provide THPs when
advised", v5.

This will allow individual processes to opt-out of THP = "always" into THP
= "madvise", without affecting other workloads on the system.  This has
been extensively discussed on the mailing list and has been summarized
very well by David in the first patch which also includes the links to
alternatives, please refer to the first patch commit message for the
motivation for this series.

Patch 1 adds the PR_THP_DISABLE_EXCEPT_ADVISED flag to implement this,
along with the MMF changes.

Patch 2 is a cleanup patch for tva_flags that will allow the forced
collapse case to be transmitted to vma_thp_disabled (which is done in
patch 3).

Patch 4 adds documentation for PR_SET_THP_DISABLE/PR_GET_THP_DISABLE.

Patches 6-7 implement the selftests for PR_SET_THP_DISABLE for completely
disabling THPs (old behaviour) and only enabling it at advise
(PR_THP_DISABLE_EXCEPT_ADVISED).


This patch (of 7):

People want to make use of more THPs, for example, moving from the "never"
system policy to "madvise", or from "madvise" to "always".

While this is great news for every THP desperately waiting to get
allocated out there, apparently there are some workloads that require a
bit of care during that transition: individual processes may need to
opt-out from this behavior for various reasons, and this should be
permitted without needing to make all other workloads on the system
similarly opt-out.

The following scenarios are imaginable:

(1) Switch from "none" system policy to "madvise"/"always", but keep THPs
    disabled for selected workloads.

(2) Stay at "none" system policy, but enable THPs for selected
    workloads, making only these workloads use the "madvise" or "always"
    policy.

(3) Switch from "madvise" system policy to "always", but keep the
    "madvise" policy for selected workloads: allocate THPs only when
    advised.

(4) Stay at "madvise" system policy, but enable THPs even when not advised
    for selected workloads -- "always" policy.

Once can emulate (2) through (1), by setting the system policy to
"madvise"/"always" while disabling THPs for all processes that don't want
THPs.  It requires configuring all workloads, but that is a user-space
problem to sort out.

(4) can be emulated through (3) in a similar way.

Back when (1) was relevant in the past, as people started enabling THPs,
we added PR_SET_THP_DISABLE, so relevant workloads that were not ready yet
(i.e., used by Redis) were able to just disable THPs completely.  Redis
still implements the option to use this interface to disable THPs
completely.

With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a
workload -- a process, including fork+exec'ed process hierarchy.  That
essentially made us support (1): simply disable THPs for all workloads
that are not ready for THPs yet, while still enabling THPs system-wide.

The quest for handling (3) and (4) started, but current approaches
(completely new prctl, options to set other policies per process,
alternatives to prctl -- mctrl, cgroup handling) don't look particularly
promising.  Likely, the future will use bpf or something similar to
implement better policies, in particular to also make better decisions
about THP sizes to use, but this will certainly take a while as that work
just started.

Long story short: a simple enable/disable is not really suitable for the
future, so we're not willing to add completely new toggles.

While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
completely for these processes, this is a step backwards, because these
processes can no longer allocate THPs in regions where THPs were
explicitly advised: regions flagged as VM_HUGEPAGE.  Apparently, that
imposes a problem for relevant workloads, because "not THPs" is certainly
worse than "THPs only when advised".

Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not
explicitly advised by the app through MAD_HUGEPAGE"?  *maybe*, but this
would change the documented semantics quite a bit, and the versatility to
use it for debugging purposes, so I am not 100% sure that is what we want
-- although it would certainly be much easier.

So instead, as an easy way forward for (3) and (4), add an option to
make PR_SET_THP_DISABLE disable *less* THPs for a process.

In essence, this patch:

(A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3
    of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0).

    prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED).

(B) Makes prctl(PR_GET_THP_DISABLE) return 3 if
    PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling.

    Previously, it would return 1 if THPs were disabled completely. Now
    it returns the set flags as well: 3 if PR_THP_DISABLE_EXCEPT_ADVISED
    was set.

(C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express
    the semantics clearly.

    Fortunately, there are only two instances outside of prctl() code.

(D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs
    with VM_HUGEPAGE" -- essentially "thp=madvise" behavior

    Fortunately, we only have to extend vma_thp_disabled().

(E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are
    disabled completely

    Only indicating that THPs are disabled when they are really disabled
    completely, not only partially.

    For now, we don't add another interface to obtained whether THPs
    are disabled partially (PR_THP_DISABLE_EXCEPT_ADVISED was set). If
    ever required, we could add a new entry.

The documented semantics in the man page for PR_SET_THP_DISABLE "is
inherited by a child created via fork(2) and is preserved across
execve(2)" is maintained.  This behavior, for example, allows for
disabling THPs for a workload through the launching process (e.g., systemd
where we fork() a helper process to then exec()).

For now, MADV_COLLAPSE will *fail* in regions without VM_HUGEPAGE and
VM_NOHUGEPAGE.  As MADV_COLLAPSE is a clear advise that user space thinks
a THP is a good idea, we'll enable that separately next (requiring a bit
of cleanup first).

There is currently not way to prevent that a process will not issue
PR_SET_THP_DISABLE itself to re-enable THP.  There are not really known
users for re-enabling it, and it's against the purpose of the original
interface.  So if ever required, we could investigate just forbidding to
re-enable them, or make this somehow configurable.

Link: https://lkml.kernel.org/r/20250815135549.130506-1-usamaarif642@gmail.com
Link: https://lkml.kernel.org/r/20250815135549.130506-2-usamaarif642@gmail.com
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Usama Arif <usamaarif642@gmail.com>
Tested-by: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yafang <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13 16:55:05 -07:00
Aleksa Sarai fe49652e36
procfs: add "pidns" mount option
Since the introduction of pid namespaces, their interaction with procfs
has been entirely implicit in ways that require a lot of dancing around
by programs that need to construct sandboxes with different PID
namespaces.

Being able to explicitly specify the pid namespace to use when
constructing a procfs super block will allow programs to no longer need
to fork off a process which does then does unshare(2) / setns(2) and
forks again in order to construct a procfs in a pidns.

So, provide a "pidns" mount option which allows such users to just
explicitly state which pid namespace they want that procfs instance to
use. This interface can be used with fsconfig(2) either with a file
descriptor or a path:

  fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
  fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);

or with classic mount(2) / mount(8):

  // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
  mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");

As this new API is effectively shorthand for setns(2) followed by
mount(2), the permission model for this mirrors pidns_install() to avoid
opening up new attack surfaces by loosening the existing permission
model.

In order to avoid having to RCU-protect all users of proc_pid_ns() (to
avoid UAFs), attempting to reconfigure an existing procfs instance's pid
namespace will error out with -EBUSY. Creating new procfs instances is
quite cheap, so this should not be an impediment to most users, and lets
us avoid a lot of churn in fs/proc/* for a feature that it seems
unlikely userspace would use.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/20250805-procfs-pidns-api-v4-2-705f984940e7@cyphar.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-02 11:37:24 +02:00
Al Viro 4fc8728aa3 block: switch ->getgeo() to struct gendisk
Instances are happier that way and it makes more sense anyway -
the only part of the result that is related to partition we are given
is the start sector, and that has been filled in by the caller.

Everything else is a function of the disk.  Only one instance
(DASD) is ever looking at anything other than bdev->bd_disk and
that one is trivial to adjust.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-08-13 02:59:29 -04:00
Pedro Falcato 92becd1701
docs/vfs: Remove mentions to the old mount API helpers
Now that mount_bdev(), mount_nodev() and mount_single() have all been
removed, remove mentions to them in vfs.rst.

While we're at it, redirect people looking for mount API docs to
mount_api.rst (which documents the newer API).

Signed-off-by: Pedro Falcato <pfalcato@suse.de>
Link: https://lore.kernel.org/20250723132156.225410-4-pfalcato@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-11 14:52:40 +02:00
Linus Torvalds 0974f486f3 f2fs-for-6.17-rc1
In this round, we've mainly updated three parts: 1) folio conversion by Matthew,
 2) switch to a new mount API by Hongbo and Eric, and 3) several sysfs entries
 to tune GCs for ZUFS with finer granularity by Daeho. There are also patches
 to address bugs and issues in the existing features such as GCs, file pinning,
 write-while-dio-read, contingous block allocation, and memory access violations.
 
 Enhancement:
  - switch to new mount API and folio conversion
  - add sysfs nodes to controle F2FS GCs for ZUFS
  - improve performance on the nat entry cache
  - drop inode from the donation list when the last file is closed
  - avoid splitting bio when reading multiple pages
 
 Bug fix:
  - fix to trigger foreground gc during f2fs_map_blocks() in lfs mode
  - make sure zoned device GC to use FG_GC in shortage of free section
  - fix to calculate dirty data during has_not_enough_free_secs()
  - fix to update upper_p in __get_secs_required() correctly
  - wait for inflight dio completion, excluding pinned files read using dio
  - don't break allocation when crossing contiguous sections
  - vm_unmap_ram() may be called from an invalid context
  - fix to avoid out-of-boundary access in dnode page
  - fix to avoid panic in f2fs_evict_inode
  - fix to avoid UAF in f2fs_sync_inode_meta()
  - fix to use f2fs_is_valid_blkaddr_raw() in do_write_page()
  - fix UAF of f2fs_inode_info in f2fs_free_dic
  - fix to avoid invalid wait context issue
  - fix bio memleak when committing super block
  - handle nat.blkaddr corruption in f2fs_get_node_info()
 
 In addition, there are also clean-ups and minor bug fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmiRJ+oACgkQQBSofoJI
 UNInMA//ekJJCf/0UyMYiPA9ag4KBb/VA0VaVJbw6BA/DoT5ZII6+lCIfllyELbk
 78+ZppTrKq5OyImwiajcNijEwyDbh/asfUu+uNVsC85fjoboiBgDGVHbUEtSQ20Q
 5JVXIL5PhDDVGdVNPh57ijYK/PxhzBPaFNuaGECYrqnWhkQEb//HmN20KRfzcOjZ
 19QnOyEh0HED/izMjLhtZaCBQP53kfB7VjhTxMdY86l6IZ22gJHPRrnqBQHRTfyb
 iHcMJj4WRd7SpvbD/6bSdnUfpxOYPIm3GwQHdG46cHBEH1scnyQxx2OULlSLUbz6
 yeiG36jcuQQWOev8ikBjNzfAozD0VvUAulPpfIbAoHc5jBYkA1sP3N7JOiao1H4Z
 FnPgw/FyIQE+d9NkbyeVW+6f9WfmKlJlIJ4zKoURbZvARYCZKmiPiI9vPWWe18qV
 nchWniQMJ45TYsABUGmGJwTEe/SFaOkgLpLjAlzCy7ZY9/6LKVUlnxR0E1ZDcjSp
 5/E5fXQhds0Nn7F1jQXV3afxkECW+MNOLS/31ggL+ym6Pce3HPJCxBeRU4XaKrvA
 O0wP7n3g5jhVVWce0PBghF0mwTVVBwohTaUhL7lIIJMxKGkr4A8kH1j8tLLBdD3b
 hqcesDCtqqOZhogbwHXEgUDSikak4/1R1gDXnK0KhL1gg0Z6wR4=
 =XIPU
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "Three main updates: folio conversion by Matthew, switch to a new mount
  API by Hongbo and Eric, and several sysfs entries to tune GCs for ZUFS
  with finer granularity by Daeho.

  There are also patches to address bugs and issues in the existing
  features such as GCs, file pinning, write-while-dio-read, contingous
  block allocation, and memory access violations.

  Enhancements:
   - switch to new mount API and folio conversion
   - add sysfs nodes to controle F2FS GCs for ZUFS
   - improve performance on the nat entry cache
   - drop inode from the donation list when the last file is closed
   - avoid splitting bio when reading multiple pages

  Bug fixes:
   - fix to trigger foreground gc during f2fs_map_blocks() in lfs mode
   - make sure zoned device GC to use FG_GC in shortage of free section
   - fix to calculate dirty data during has_not_enough_free_secs()
   - fix to update upper_p in __get_secs_required() correctly
   - wait for inflight dio completion, excluding pinned files read using dio
   - don't break allocation when crossing contiguous sections
   - vm_unmap_ram() may be called from an invalid context
   - fix to avoid out-of-boundary access in dnode page
   - fix to avoid panic in f2fs_evict_inode
   - fix to avoid UAF in f2fs_sync_inode_meta()
   - fix to use f2fs_is_valid_blkaddr_raw() in do_write_page()
   - fix UAF of f2fs_inode_info in f2fs_free_dic
   - fix to avoid invalid wait context issue
   - fix bio memleak when committing super block
   - handle nat.blkaddr corruption in f2fs_get_node_info()

  In addition, there are also clean-ups and minor bug fixes"

* tag 'f2fs-for-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (109 commits)
  f2fs: drop inode from the donation list when the last file is closed
  f2fs: add gc_boost_gc_greedy sysfs node
  f2fs: add gc_boost_gc_multiple sysfs node
  f2fs: fix to trigger foreground gc during f2fs_map_blocks() in lfs mode
  f2fs: fix to calculate dirty data during has_not_enough_free_secs()
  f2fs: fix to update upper_p in __get_secs_required() correctly
  f2fs: directly add newly allocated pre-dirty nat entry to dirty set list
  f2fs: avoid redundant clean nat entry move in lru list
  f2fs: zone: wait for inflight dio completion, excluding pinned files read using dio
  f2fs: ignore valid ratio when free section count is low
  f2fs: don't break allocation when crossing contiguous sections
  f2fs: remove unnecessary tracepoint enabled check
  f2fs: merge the two conditions to avoid code duplication
  f2fs: vm_unmap_ram() may be called from an invalid context
  f2fs: fix to avoid out-of-boundary access in dnode page
  f2fs: switch to the new mount api
  f2fs: introduce fs_context_operation structure
  f2fs: separate the options parsing and options checking
  f2fs: Add f2fs_fs_context to record the mount options
  f2fs: Allow sbi to be NULL in f2fs_printk
  ...
2025-08-04 16:27:21 -07:00
Linus Torvalds beace86e61 Summary of significant series in this pull request:
- The 4 patch series "mm: ksm: prevent KSM from breaking merging of new
   VMAs" from Lorenzo Stoakes addresses an issue with KSM's
   PR_SET_MEMORY_MERGE mode: newly mapped VMAs were not eligible for
   merging with existing adjacent VMAs.
 
 - The 4 patch series "mm/damon: introduce DAMON_STAT for simple and
   practical access monitoring" from SeongJae Park adds a new kernel module
   which simplifies the setup and usage of DAMON in production
   environments.
 
 - The 6 patch series "stop passing a writeback_control to swap/shmem
   writeout" from Christoph Hellwig is a cleanup to the writeback code
   which removes a couple of pointers from struct writeback_control.
 
 - The 7 patch series "drivers/base/node.c: optimization and cleanups"
   from Donet Tom contains largely uncorrelated cleanups to the NUMA node
   setup and management code.
 
 - The 4 patch series "mm: userfaultfd: assorted fixes and cleanups" from
   Tal Zussman does some maintenance work on the userfaultfd code.
 
 - The 5 patch series "Readahead tweaks for larger folios" from Ryan
   Roberts implements some tuneups for pagecache readahead when it is
   reading into order>0 folios.
 
 - The 4 patch series "selftests/mm: Tweaks to the cow test" from Mark
   Brown provides some cleanups and consistency improvements to the
   selftests code.
 
 - The 4 patch series "Optimize mremap() for large folios" from Dev Jain
   does that.  A 37% reduction in execution time was measured in a
   memset+mremap+munmap microbenchmark.
 
 - The 5 patch series "Remove zero_user()" from Matthew Wilcox expunges
   zero_user() in favor of the more modern memzero_page().
 
 - The 3 patch series "mm/huge_memory: vmf_insert_folio_*() and
   vmf_insert_pfn_pud() fixes" from David Hildenbrand addresses some warts
   which David noticed in the huge page code.  These were not known to be
   causing any issues at this time.
 
 - The 3 patch series "mm/damon: use alloc_migrate_target() for
   DAMOS_MIGRATE_{HOT,COLD" from SeongJae Park provides some cleanup and
   consolidation work in DAMON.
 
 - The 3 patch series "use vm_flags_t consistently" from Lorenzo Stoakes
   uses vm_flags_t in places where we were inappropriately using other
   types.
 
 - The 3 patch series "mm/memfd: Reserve hugetlb folios before
   allocation" from Vivek Kasireddy increases the reliability of large page
   allocation in the memfd code.
 
 - The 14 patch series "mm: Remove pXX_devmap page table bit and pfn_t
   type" from Alistair Popple removes several now-unneeded PFN_* flags.
 
 - The 5 patch series "mm/damon: decouple sysfs from core" from SeongJae
   Park implememnts some cleanup and maintainability work in the DAMON
   sysfs layer.
 
 - The 5 patch series "madvise cleanup" from Lorenzo Stoakes does quite a
   lot of cleanup/maintenance work in the madvise() code.
 
 - The 4 patch series "madvise anon_name cleanups" from Vlastimil Babka
   provides additional cleanups on top or Lorenzo's effort.
 
 - The 11 patch series "Implement numa node notifier" from Oscar Salvador
   creates a standalone notifier for NUMA node memory state changes.
   Previously these were lumped under the more general memory on/offline
   notifier.
 
 - The 6 patch series "Make MIGRATE_ISOLATE a standalone bit" from Zi Yan
   cleans up the pageblock isolation code and fixes a potential issue which
   doesn't seem to cause any problems in practice.
 
 - The 5 patch series "selftests/damon: add python and drgn based DAMON
   sysfs functionality tests" from SeongJae Park adds additional drgn- and
   python-based DAMON selftests which are more comprehensive than the
   existing selftest suite.
 
 - The 5 patch series "Misc rework on hugetlb faulting path" from Oscar
   Salvador fixes a rather obscure deadlock in the hugetlb fault code and
   follows that fix with a series of cleanups.
 
 - The 3 patch series "cma: factor out allocation logic from
   __cma_declare_contiguous_nid" from Mike Rapoport rationalizes and cleans
   up the highmem-specific code in the CMA allocator.
 
 - The 28 patch series "mm/migration: rework movable_ops page migration
   (part 1)" from David Hildenbrand provides cleanups and
   future-preparedness to the migration code.
 
 - The 2 patch series "mm/damon: add trace events for auto-tuned
   monitoring intervals and DAMOS quota" from SeongJae Park adds some
   tracepoints to some DAMON auto-tuning code.
 
 - The 6 patch series "mm/damon: fix misc bugs in DAMON modules" from
   SeongJae Park does that.
 
 - The 6 patch series "mm/damon: misc cleanups" from SeongJae Park also
   does what it claims.
 
 - The 4 patch series "mm: folio_pte_batch() improvements" from David
   Hildenbrand cleans up the large folio PTE batching code.
 
 - The 13 patch series "mm/damon/vaddr: Allow interleaving in
   migrate_{hot,cold} actions" from SeongJae Park facilitates dynamic
   alteration of DAMON's inter-node allocation policy.
 
 - The 3 patch series "Remove unmap_and_put_page()" from Vishal Moola
   provides a couple of page->folio conversions.
 
 - The 4 patch series "mm: per-node proactive reclaim" from Davidlohr
   Bueso implements a per-node control of proactive reclaim - beyond the
   current memcg-based implementation.
 
 - The 14 patch series "mm/damon: remove damon_callback" from SeongJae
   Park replaces the damon_callback interface with a more general and
   powerful damon_call()+damos_walk() interface.
 
 - The 10 patch series "mm/mremap: permit mremap() move of multiple VMAs"
   from Lorenzo Stoakes implements a number of mremap cleanups (of course)
   in preparation for adding new mremap() functionality: newly permit the
   remapping of multiple VMAs when the user is specifying MREMAP_FIXED.  It
   still excludes some specialized situations where this cannot be
   performed reliably.
 
 - The 3 patch series "drop hugetlb_free_pgd_range()" from Anthony Yznaga
   switches some sparc hugetlb code over to the generic version and removes
   the thus-unneeded hugetlb_free_pgd_range().
 
 - The 4 patch series "mm/damon/sysfs: support periodic and automated
   stats update" from SeongJae Park augments the present
   userspace-requested update of DAMON sysfs monitoring files.  Automatic
   update is now provided, along with a tunable to control the update
   interval.
 
 - The 4 patch series "Some randome fixes and cleanups to swapfile" from
   Kemeng Shi does what is claims.
 
 - The 4 patch series "mm: introduce snapshot_page" from Luiz Capitulino
   and David Hildenbrand provides (and uses) a means by which debug-style
   functions can grab a copy of a pageframe and inspect it locklessly
   without tripping over the races inherent in operating on the live
   pageframe directly.
 
 - The 6 patch series "use per-vma locks for /proc/pid/maps reads" from
   Suren Baghdasaryan addresses the large contention issues which can be
   triggered by reads from that procfs file.  Latencies are reduced by more
   than half in some situations.  The series also introduces several new
   selftests for the /proc/pid/maps interface.
 
 - The 6 patch series "__folio_split() clean up" from Zi Yan cleans up
   __folio_split()!
 
 - The 7 patch series "Optimize mprotect() for large folios" from Dev
   Jain provides some quite large (>3x) speedups to mprotect() when dealing
   with large folios.
 
 - The 2 patch series "selftests/mm: reuse FORCE_READ to replace "asm
   volatile("" : "+r" (XXX));" and some cleanup" from wang lian does some
   cleanup work in the selftests code.
 
 - The 3 patch series "tools/testing: expand mremap testing" from Lorenzo
   Stoakes extends the mremap() selftest in several ways, including adding
   more checking of Lorenzo's recently added "permit mremap() move of
   multiple VMAs" feature.
 
 - The 22 patch series "selftests/damon/sysfs.py: test all parameters"
   from SeongJae Park extends the DAMON sysfs interface selftest so that it
   tests all possible user-requested parameters.  Rather than the present
   minimal subset.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaIqcCgAKCRDdBJ7gKXxA
 jkVBAQCCn9DR1QP0CRk961ot0cKzOgioSc0aA03DPb2KXRt2kQEAzDAz0ARurFhL
 8BzbvI0c+4tntHLXvIlrC33n9KWAOQM=
 =XsFy
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "As usual, many cleanups. The below blurbiage describes 42 patchsets.
  21 of those are partially or fully cleanup work. "cleans up",
  "cleanup", "maintainability", "rationalizes", etc.

  I never knew the MM code was so dirty.

  "mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes)
     addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly
     mapped VMAs were not eligible for merging with existing adjacent
     VMAs.

  "mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park)
     adds a new kernel module which simplifies the setup and usage of
     DAMON in production environments.

  "stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig)
     is a cleanup to the writeback code which removes a couple of
     pointers from struct writeback_control.

  "drivers/base/node.c: optimization and cleanups" (Donet Tom)
     contains largely uncorrelated cleanups to the NUMA node setup and
     management code.

  "mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman)
     does some maintenance work on the userfaultfd code.

  "Readahead tweaks for larger folios" (Ryan Roberts)
     implements some tuneups for pagecache readahead when it is reading
     into order>0 folios.

  "selftests/mm: Tweaks to the cow test" (Mark Brown)
     provides some cleanups and consistency improvements to the
     selftests code.

  "Optimize mremap() for large folios" (Dev Jain)
     does that. A 37% reduction in execution time was measured in a
     memset+mremap+munmap microbenchmark.

  "Remove zero_user()" (Matthew Wilcox)
     expunges zero_user() in favor of the more modern memzero_page().

  "mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand)
     addresses some warts which David noticed in the huge page code.
     These were not known to be causing any issues at this time.

  "mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park)
     provides some cleanup and consolidation work in DAMON.

  "use vm_flags_t consistently" (Lorenzo Stoakes)
     uses vm_flags_t in places where we were inappropriately using other
     types.

  "mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy)
     increases the reliability of large page allocation in the memfd
     code.

  "mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple)
     removes several now-unneeded PFN_* flags.

  "mm/damon: decouple sysfs from core" (SeongJae Park)
     implememnts some cleanup and maintainability work in the DAMON
     sysfs layer.

  "madvise cleanup" (Lorenzo Stoakes)
     does quite a lot of cleanup/maintenance work in the madvise() code.

  "madvise anon_name cleanups" (Vlastimil Babka)
     provides additional cleanups on top or Lorenzo's effort.

  "Implement numa node notifier" (Oscar Salvador)
     creates a standalone notifier for NUMA node memory state changes.
     Previously these were lumped under the more general memory
     on/offline notifier.

  "Make MIGRATE_ISOLATE a standalone bit" (Zi Yan)
     cleans up the pageblock isolation code and fixes a potential issue
     which doesn't seem to cause any problems in practice.

  "selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park)
     adds additional drgn- and python-based DAMON selftests which are
     more comprehensive than the existing selftest suite.

  "Misc rework on hugetlb faulting path" (Oscar Salvador)
     fixes a rather obscure deadlock in the hugetlb fault code and
     follows that fix with a series of cleanups.

  "cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport)
     rationalizes and cleans up the highmem-specific code in the CMA
     allocator.

  "mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand)
     provides cleanups and future-preparedness to the migration code.

  "mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park)
     adds some tracepoints to some DAMON auto-tuning code.

  "mm/damon: fix misc bugs in DAMON modules" (SeongJae Park)
     does that.

  "mm/damon: misc cleanups" (SeongJae Park)
     also does what it claims.

  "mm: folio_pte_batch() improvements" (David Hildenbrand)
     cleans up the large folio PTE batching code.

  "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park)
     facilitates dynamic alteration of DAMON's inter-node allocation
     policy.

  "Remove unmap_and_put_page()" (Vishal Moola)
     provides a couple of page->folio conversions.

  "mm: per-node proactive reclaim" (Davidlohr Bueso)
     implements a per-node control of proactive reclaim - beyond the
     current memcg-based implementation.

  "mm/damon: remove damon_callback" (SeongJae Park)
     replaces the damon_callback interface with a more general and
     powerful damon_call()+damos_walk() interface.

  "mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes)
     implements a number of mremap cleanups (of course) in preparation
     for adding new mremap() functionality: newly permit the remapping
     of multiple VMAs when the user is specifying MREMAP_FIXED. It still
     excludes some specialized situations where this cannot be performed
     reliably.

  "drop hugetlb_free_pgd_range()" (Anthony Yznaga)
     switches some sparc hugetlb code over to the generic version and
     removes the thus-unneeded hugetlb_free_pgd_range().

  "mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park)
     augments the present userspace-requested update of DAMON sysfs
     monitoring files. Automatic update is now provided, along with a
     tunable to control the update interval.

  "Some randome fixes and cleanups to swapfile" (Kemeng Shi)
     does what is claims.

  "mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand)
     provides (and uses) a means by which debug-style functions can grab
     a copy of a pageframe and inspect it locklessly without tripping
     over the races inherent in operating on the live pageframe
     directly.

  "use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan)
     addresses the large contention issues which can be triggered by
     reads from that procfs file. Latencies are reduced by more than
     half in some situations. The series also introduces several new
     selftests for the /proc/pid/maps interface.

  "__folio_split() clean up" (Zi Yan)
     cleans up __folio_split()!

  "Optimize mprotect() for large folios" (Dev Jain)
     provides some quite large (>3x) speedups to mprotect() when dealing
     with large folios.

  "selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian)
     does some cleanup work in the selftests code.

  "tools/testing: expand mremap testing" (Lorenzo Stoakes)
     extends the mremap() selftest in several ways, including adding
     more checking of Lorenzo's recently added "permit mremap() move of
     multiple VMAs" feature.

  "selftests/damon/sysfs.py: test all parameters" (SeongJae Park)
     extends the DAMON sysfs interface selftest so that it tests all
     possible user-requested parameters. Rather than the present minimal
     subset"

* tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits)
  MAINTAINERS: add missing headers to mempory policy & migration section
  MAINTAINERS: add missing file to cgroup section
  MAINTAINERS: add MM MISC section, add missing files to MISC and CORE
  MAINTAINERS: add missing zsmalloc file
  MAINTAINERS: add missing files to page alloc section
  MAINTAINERS: add missing shrinker files
  MAINTAINERS: move memremap.[ch] to hotplug section
  MAINTAINERS: add missing mm_slot.h file THP section
  MAINTAINERS: add missing interval_tree.c to memory mapping section
  MAINTAINERS: add missing percpu-internal.h file to per-cpu section
  mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info()
  selftests/damon: introduce _common.sh to host shared function
  selftests/damon/sysfs.py: test runtime reduction of DAMON parameters
  selftests/damon/sysfs.py: test non-default parameters runtime commit
  selftests/damon/sysfs.py: generalize DAMON context commit assertion
  selftests/damon/sysfs.py: generalize monitoring attributes commit assertion
  selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion
  selftests/damon/sysfs.py: test DAMOS filters commitment
  selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion
  selftests/damon/sysfs.py: test DAMOS destinations commitment
  ...
2025-07-31 14:57:54 -07:00
Linus Torvalds b1cce98493 It has been a relatively busy cycle for docs, especially the build system:
- The Perl kernel-doc script was added to 2.3.52pre1 just after the turn of
   the millennium.  Over the following 25 years, it accumulated a vast
   amount of cruft, all in a language few people want to deal with anymore.
   Mauro's Python replacement in 6.16 faithfully reproduced all of the cruft
   in the hope of avoiding regressions.  Now that we have a more reasonable
   code base, though, we can work on cleaning it up; many of the changes
   this time around are toward that end.
 
 - A reorganization of the ext4 docs into the usual TOC format.
 
 - Various Chinese translations and updates.
 
 - A new script from Mauro to help with docs-build testing.
 
 - A new document for linked lists
 
 - A sweep through MAINTAINERS fixing broken GitHub git:// repository links.
 
 ...and lots of fixes and updates.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCgAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmiHe3oPHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5Y+EIH/0dMribAlWSrfS1sdisfkVp+nHh+DB6EA+uX
 XqbJvQrukze6GvvOI2L6+3fDp+5CBtBRRSkzsNIXfFQo6p/jEbTmD/JILO0LcyDT
 9iFX+W30nRetu1SqkiTGjLXgu+tF0gUE6zVnI7Lx7H10PUnUPkFbmMuwmOcOV/lC
 7Lml+G1FTByGE6gDjTTyTJOqBf37uLJq33N2YnPK0SHm4DiSsWGvINxGbXrrpR5Z
 7ORA6SnaIxFuy60SxL9pEH92OLS/kHRw74P/DT1dkg9BSdy4TRLM30QkZFGoiG2B
 OOnnT/JJz80BzI1ctzpcwGRWfD+i8DDvujp8+aLxXYbl5N7WYw0=
 =sji4
 -----END PGP SIGNATURE-----

Merge tag 'docs-6.17' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "It has been a relatively busy cycle for docs, especially the build
  system:

   - The Perl kernel-doc script was added to 2.3.52pre1 just after the
     turn of the millennium. Over the following 25 years, it accumulated
     a vast amount of cruft, all in a language few people want to deal
     with anymore. Mauro's Python replacement in 6.16 faithfully
     reproduced all of the cruft in the hope of avoiding regressions.

     Now that we have a more reasonable code base, though, we can work
     on cleaning it up; many of the changes this time around are toward
     that end.

   - A reorganization of the ext4 docs into the usual TOC format.

   - Various Chinese translations and updates.

   - A new script from Mauro to help with docs-build testing.

   - A new document for linked lists

   - A sweep through MAINTAINERS fixing broken GitHub git:// repository
     links.

  ...and lots of fixes and updates"

* tag 'docs-6.17' of git://git.lwn.net/linux: (147 commits)
  scripts: add origin commit identification based on specific patterns
  sphinx: kernel_abi: fix performance regression with O=<dir>
  Documentation: core-api: entry: Replace deprecated KVM entry/exit functions
  docs: fault-injection: drop reference to md-faulty
  docs: document linked lists
  scripts: kdoc: make it backward-compatible with Python 3.7
  docs: kernel-doc: emit warnings for ancient versions of Python
  Documentation/rtla: Describe exit status
  Documentation/rtla: Add include common_appendix.rst
  docs: kernel: Clarify printk_ratelimit_burst reset behavior
  Documentation: ioctl-number: Don't repeat macro names
  Documentation: ioctl-number: Shorten macros table
  Documentation: ioctl-number: Correct full path to papr-physical-attestation.h
  Documentation: ioctl-number: Extend "Include File" column width
  Documentation: ioctl-number: Fix linuxppc-dev mailto link
  overlayfs.rst: fix typos
  docs: kdoc: emit a warning for ancient versions of Python
  docs: kdoc: clean up check_sections()
  docs: kdoc: directly access the always-there KdocItem fields
  docs: kdoc: straighten up dump_declaration()
  ...
2025-07-31 08:36:51 -07:00
Linus Torvalds 283564a433 fscrypt updates for 6.17
Simplify how fscrypt uses the crypto API, resulting in some
 significant performance improvements:
 
  - Drop the incomplete and problematic support for asynchronous
    algorithms. These drivers are bug-prone, and it turns out they are
    actually much slower than the CPU-based code as well.
 
  - Allocate crypto requests on the stack instead of the heap. This
    improves encryption and decryption performance, especially for
    filenames. It also eliminates a point of failure during I/O.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCaIZ+fhQcZWJpZ2dlcnNA
 a2VybmVsLm9yZwAKCRDzXCl4vpKOK+dkAQDrHUTj9dGZI/cQ/TjP0kmOv9XfYAfj
 HOQDRikTX+Ip4QEA6L8FS8lJYf9EMznTvTPOkP7hXpwqzuf00vJWr+ySmQs=
 =N9vo
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux

Pull fscrypt updates from Eric Biggers:
 "Simplify how fscrypt uses the crypto API, resulting in some
  significant performance improvements:

   - Drop the incomplete and problematic support for asynchronous
     algorithms. These drivers are bug-prone, and it turns out they are
     actually much slower than the CPU-based code as well.

   - Allocate crypto requests on the stack instead of the heap. This
     improves encryption and decryption performance, especially for
     filenames. This also eliminates a point of failure during I/O"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
  ceph: Remove gfp_t argument from ceph_fscrypt_encrypt_*()
  fscrypt: Remove gfp_t argument from fscrypt_encrypt_block_inplace()
  fscrypt: Remove gfp_t argument from fscrypt_crypt_data_unit()
  fscrypt: Switch to sync_skcipher and on-stack requests
  fscrypt: Drop FORBID_WEAK_KEYS flag for AES-ECB
  fscrypt: Don't use asynchronous CryptoAPI algorithms
  fscrypt: Don't use problematic non-inline crypto engines
  fscrypt: Drop obsolete recommendation to enable optimized SHA-512
  fscrypt: Explicitly include <linux/export.h>
2025-07-28 18:07:38 -07:00
Linus Torvalds 4b65b859f5 Crypto library conversions for 6.17
Convert fsverity and apparmor to use the SHA-2 library functions
 instead of crypto_shash. This is simpler and also slightly faster.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCaIZ+MhQcZWJpZ2dlcnNA
 a2VybmVsLm9yZwAKCRDzXCl4vpKOK6LlAP42XrHo9hMqCnVs7uCtDVOk2kBUB7ml
 m7rkVqQLiEbgAAEAiFvQYPjiaUedTi+bH45+qP/O/LxY7y5laIby7QKa1wc=
 =X4Xs
 -----END PGP SIGNATURE-----

Merge tag 'libcrypto-conversions-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux

Pull crypto library conversions from Eric Biggers:
 "Convert fsverity and apparmor to use the SHA-2 library functions
  instead of crypto_shash. This is simpler and also slightly faster"

* tag 'libcrypto-conversions-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
  fsverity: Switch from crypto_shash to SHA-2 library
  fsverity: Explicitly include <linux/export.h>
  apparmor: use SHA-256 library API instead of crypto_shash API
2025-07-28 18:05:46 -07:00
Linus Torvalds b5d760d53a vfs-6.17-rc1.iomap
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCtwAKCRCRxhvAZXjc
 ogPuAQChc4tCjlNp+yAwbSmuzWooKTN8PHI6v+3ftjdaKSy9AgD/Yya1i8aBYBA8
 9HBtIKGAqvcgNB3por7yN+GJ8fxb/Ag=
 =YmLL
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs iomap updates from Christian Brauner:

 - Refactor the iomap writeback code and split the generic and ioend/bio
   based writeback code.

   There are two methods that define the split between the generic
   writeback code, and the implemementation of it, and all knowledge of
   ioends and bios now sits below that layer.

 - Add fuse iomap support for buffered writes and dirty folio writeback.

   This is needed so that granular uptodate and dirty tracking can be
   used in fuse when large folios are enabled. This has two big
   advantages. For writes, instead of the entire folio needing to be
   read into the page cache, only the relevant portions need to be. For
   writeback, only the dirty portions need to be written back instead of
   the entire folio.

* tag 'vfs-6.17-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fuse: refactor writeback to use iomap_writepage_ctx inode
  fuse: hook into iomap for invalidating and checking partial uptodateness
  fuse: use iomap for folio laundering
  fuse: use iomap for writeback
  fuse: use iomap for buffered writes
  iomap: build the writeback code without CONFIG_BLOCK
  iomap: add read_folio_range() handler for buffered writes
  iomap: improve argument passing to iomap_read_folio_sync
  iomap: replace iomap_folio_ops with iomap_write_ops
  iomap: export iomap_writeback_folio
  iomap: move folio_unlock out of iomap_writeback_folio
  iomap: rename iomap_writepage_map to iomap_writeback_folio
  iomap: move all ioend handling to ioend.c
  iomap: add public helpers for uptodate state manipulation
  iomap: hide ioends from the generic writeback code
  iomap: refactor the writeback interface
  iomap: cleanup the pending writeback tracking in iomap_writepage_map_blocks
  iomap: pass more arguments using the iomap writeback context
  iomap: header diet
2025-07-28 16:09:03 -07:00
Linus Torvalds 57fcb7d930 vfs-6.17-rc1.fileattr
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCpgAKCRCRxhvAZXjc
 oqfFAQDcy3rROUF3W34KcSi7rDmaKVSX53d1tUoqH+1zDRpSlwEAriKDNC1ybudp
 YAnxVzkRHjHs1296WIuwKq5lfhJ60Q4=
 =geAl
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull fileattr updates from Christian Brauner:
 "This introduces the new file_getattr() and file_setattr() system calls
  after lengthy discussions.

  Both system calls serve as successors and extensible companions to
  the FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR system calls which have
  started to show their age in addition to being named in a way that
  makes it easy to conflate them with extended attribute related
  operations.

  These syscalls allow userspace to set filesystem inode attributes on
  special files. One of the usage examples is the XFS quota projects.

  XFS has project quotas which could be attached to a directory. All new
  inodes in these directories inherit project ID set on parent
  directory.

  The project is created from userspace by opening and calling
  FS_IOC_FSSETXATTR on each inode. This is not possible for special
  files such as FIFO, SOCK, BLK etc. Therefore, some inodes are left
  with empty project ID. Those inodes then are not shown in the quota
  accounting but still exist in the directory. This is not critical but
  in the case when special files are created in the directory with
  already existing project quota, these new inodes inherit extended
  attributes. This creates a mix of special files with and without
  attributes. Moreover, special files with attributes don't have a
  possibility to become clear or change the attributes. This, in turn,
  prevents userspace from re-creating quota project on these existing
  files.

  In addition, these new system calls allow the implementation of
  additional attributes that we couldn't or didn't want to fit into the
  legacy ioctls anymore"

* tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: tighten a sanity check in file_attr_to_fileattr()
  tree-wide: s/struct fileattr/struct file_kattr/g
  fs: introduce file_getattr and file_setattr syscalls
  fs: prepare for extending file_get/setattr()
  fs: make vfs_fileattr_[get|set] return -EOPNOTSUPP
  selinux: implement inode_file_[g|s]etattr hooks
  lsm: introduce new hooks for setting/getting inode fsxattr
  fs: split fileattr related helpers into separate file
2025-07-28 15:24:14 -07:00
Linus Torvalds 7031769e10 vfs-6.17-rc1.mmap_prepare
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCgQAKCRCRxhvAZXjc
 os+nAP9LFHUwWO6EBzHJJGEVjJvvzsbzqeYrRFamYiMc5ulPJwD+KW4RIgJa/MWO
 pcYE40CacaekD8rFWwYUyszpgmv6ewc=
 =wCwp
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull mmap_prepare updates from Christian Brauner:
 "Last cycle we introduce f_op->mmap_prepare() in c84bf6dd2b ("mm:
  introduce new .mmap_prepare() file callback").

  This is preferred to the existing f_op->mmap() hook as it does require
  a VMA to be established yet, thus allowing the mmap logic to invoke
  this hook far, far earlier, prior to inserting a VMA into the virtual
  address space, or performing any other heavy handed operations.

  This allows for much simpler unwinding on error, and for there to be a
  single attempt at merging a VMA rather than having to possibly
  reattempt a merge based on potentially altered VMA state.

  Far more importantly, it prevents inappropriate manipulation of
  incompletely initialised VMA state, which is something that has been
  the cause of bugs and complexity in the past.

  The intent is to gradually deprecate f_op->mmap, and in that vein this
  series coverts the majority of file systems to using f_op->mmap_prepare.

  Prerequisite steps are taken - firstly ensuring all checks for mmap
  capabilities use the file_has_valid_mmap_hooks() helper rather than
  directly checking for f_op->mmap (which is now not a valid check) and
  secondly updating daxdev_mapping_supported() to not require a VMA
  parameter to allow ext4 and xfs to be converted.

  Commit bb666b7c27 ("mm: add mmap_prepare() compatibility layer for
  nested file systems") handles the nasty edge-case of nested file
  systems like overlayfs, which introduces a compatibility shim to allow
  f_op->mmap_prepare() to be invoked from an f_op->mmap() callback.

  This allows for nested filesystems to continue to function correctly
  with all file systems regardless of which callback is used. Once we
  finally convert all file systems, this shim can be removed.

  As a result, ecryptfs, fuse, and overlayfs remain unaltered so they
  can nest all other file systems.

  We additionally do not update resctl - as this requires an update to
  remap_pfn_range() (or an alternative to it) which we defer to a later
  series, equally we do not update cramfs which needs a mixed mapping
  insertion with the same issue, nor do we update procfs, hugetlbfs,
  syfs or kernfs all of which require VMAs for internal state and hooks.
  We shall return to all of these later"

* tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  doc: update porting, vfs documentation to describe mmap_prepare()
  fs: replace mmap hook with .mmap_prepare for simple mappings
  fs: convert most other generic_file_*mmap() users to .mmap_prepare()
  fs: convert simple use of generic_file_*_mmap() to .mmap_prepare()
  mm/filemap: introduce generic_file_*_mmap_prepare() helpers
  fs/xfs: transition from deprecated .mmap hook to .mmap_prepare
  fs/ext4: transition from deprecated .mmap hook to .mmap_prepare
  fs/dax: make it possible to check dev dax support without a VMA
  fs: consistently use can_mmap_file() helper
  mm/nommu: use file_has_valid_mmap_hooks() helper
  mm: rename call_mmap/mmap_prepare to vfs_mmap/mmap_prepare
2025-07-28 13:43:25 -07:00
Linus Torvalds 0c4ec4a339 vfs-6.17-rc1.async.dir
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCcgAKCRCRxhvAZXjc
 ose6AQDUhwws5T7FYbqQRZC7tc19xJ4CJN2MH6WQsRJ8PrXMtQD/dY/KVPGtOZgb
 +fFGcOPkO9c+D9WUNXjcGtGMv+fsegc=
 =iV8T
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull async directory updates from Christian Brauner:
 "This contains preparatory changes for the asynchronous directory
  locking scheme.

  While the locking scheme is still very much controversial and we're
  still far away from landing any actual changes in that area the
  preparatory work that we've been upstreaming for a while now has been
  very useful. This is another set of minor changes and cleanups"

* tag 'vfs-6.17-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  exportfs: use lookup_one_unlocked()
  coda: use iterate_dir() in coda_readdir()
  VFS: Minor fixes for porting.rst
  VFS: merge lookup_one_qstr_excl_raw() back into lookup_one_qstr_excl()
2025-07-28 13:31:32 -07:00
Linus Torvalds 7879d7aff0 vfs-6.17-rc1.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaIM/KwAKCRCRxhvAZXjc
 opT+AP407JwhRSBjUEmHg5JzUyDoivkOySdnthunRjaBKD8rlgEApM6SOIZYucU7
 cPC3ZY6ORFM6Mwaw+iDW9lasM5ucHQ8=
 =CHha
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc VFS updates from Christian Brauner:
 "This contains the usual selections of misc updates for this cycle.

  Features:

   - Add ext4 IOCB_DONTCACHE support

     This refactors the address_space_operations write_begin() and
     write_end() callbacks to take const struct kiocb * as their first
     argument, allowing IOCB flags such as IOCB_DONTCACHE to propagate
     to the filesystem's buffered I/O path.

     Ext4 is updated to implement handling of the IOCB_DONTCACHE flag
     and advertises support via the FOP_DONTCACHE file operation flag.

     Additionally, the i915 driver's shmem write paths are updated to
     bypass the legacy write_begin/write_end interface in favor of
     directly calling write_iter() with a constructed synchronous kiocb.
     Another i915 change replaces a manual write loop with
     kernel_write() during GEM shmem object creation.

  Cleanups:

   - don't duplicate vfs_open() in kernel_file_open()

   - proc_fd_getattr(): don't bother with S_ISDIR() check

   - fs/ecryptfs: replace snprintf with sysfs_emit in show function

   - vfs: Remove unnecessary list_for_each_entry_safe() from
     evict_inodes()

   - filelock: add new locks_wake_up_waiter() helper

   - fs: Remove three arguments from block_write_end()

   - VFS: change old_dir and new_dir in struct renamedata to dentrys

   - netfs: Remove unused declaration netfs_queue_write_request()

  Fixes:

   - eventpoll: Fix semi-unbounded recursion

   - eventpoll: fix sphinx documentation build warning

   - fs/read_write: Fix spelling typo

   - fs: annotate data race between poll_schedule_timeout() and
     pollwake()

   - fs/pipe: set FMODE_NOWAIT in create_pipe_files()

   - docs/vfs: update references to i_mutex to i_rwsem

   - fs/buffer: remove comment about hard sectorsize

   - fs/buffer: remove the min and max limit checks in __getblk_slow()

   - fs/libfs: don't assume blocksize <= PAGE_SIZE in
     generic_check_addressable

   - fs_context: fix parameter name in infofc() macro

   - fs: Prevent file descriptor table allocations exceeding INT_MAX"

* tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (24 commits)
  netfs: Remove unused declaration netfs_queue_write_request()
  eventpoll: fix sphinx documentation build warning
  ext4: support uncached buffered I/O
  mm/pagemap: add write_begin_get_folio() helper function
  fs: change write_begin/write_end interface to take struct kiocb *
  drm/i915: Refactor shmem_pwrite() to use kiocb and write_iter
  drm/i915: Use kernel_write() in shmem object create
  eventpoll: Fix semi-unbounded recursion
  vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes()
  fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable
  fs/buffer: remove the min and max limit checks in __getblk_slow()
  fs: Prevent file descriptor table allocations exceeding INT_MAX
  fs: Remove three arguments from block_write_end()
  fs/ecryptfs: replace snprintf with sysfs_emit in show function
  fs: annotate suspected data race between poll_schedule_timeout() and pollwake()
  docs/vfs: update references to i_mutex to i_rwsem
  fs/buffer: remove comment about hard sectorsize
  fs_context: fix parameter name in infofc() macro
  VFS: change old_dir and new_dir in struct renamedata to dentrys
  proc_fd_getattr(): don't bother with S_ISDIR() check
  ...
2025-07-28 11:22:56 -07:00
Linus Torvalds 794cbac9c0 mount changes. I've got more stuff in the local tree, but
this is getting too much for one merge window as it is.
 
 * mount hash conflicts rudiments are gone now - we do not allow
 	multiple mounts with the same parent/mountpoint to be
 	hashed at the same time.
 * struct mount changes
 	mnt_umounting is gone;
 	mnt_slave_list/mnt_slave is an hlist now;
 	overmounts are kept track of by explicit pointer in mount;
 	a bunch of flags moved out of mnt_flags to a new field,
 	with only namespace_sem for protection;
 	mnt_expiry is protected by mount_lock now (instead of
 	namespace_sem);
 	MNT_LOCKED is used only for mounts that need to remain
 	attached to their parents to prevent mountpoint exposure -
 	no more overloading it for absolute roots;
 	all mnt_list uses are transient now - it's used only to
 	represent temporary sets during umount_tree().
 * mount refcounting change
 	children no longer pin parents for any mounts, whether they'd
 	passed through umount_tree() or not.
 * struct mountpoint changes
 	refcount is no more; what matters is ->m_list emptiness;
 	instead of temporary bumping the refcount, we insert a new object
 	(pinned_mountpoint) into ->m_list;
 	new calling conventions for lock_mount() and friends.
 * do_move_mount()/attach_recursive_mnt() seriously cleaned up.
 * globals in fs/pnode.c are gone.
 * propagate_mnt(), change_mnt_propagation() and propagate_umount() cleaned up
 	(in the last case - pretty much completely rewritten).
 * freeing of emptied mnt_namespace is done in namespace_unlock()
 	for one thing, there are subtle ordering requirements there;
 	for another it simplifies cleanups.
 * assorted cleanups.
 * restore the machinery for long-term mounts from accumulated bitrot.
 	This is going to get a followup come next cycle, when #work.fs_context
 	with its change of vfs_fs_parse_string() calling conventions goes
 	into -next.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaIR2dQAKCRBZ7Krx/gZQ
 6/SzAP4x+Fjjc5Tm2UNgGW5dptDY5s9O5RuFauo1MM6rcrekagEApTarcMlPnZvC
 mj1TVJFNfdVhZyTXnz5ocHhGX1udmgU=
 =qT69
 -----END PGP SIGNATURE-----

Merge tag 'pull-mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs mount updates from Al Viro:

 - mount hash conflicts rudiments are gone now - we do not allow
     multiple mounts with the same parent/mountpoint to be hashed at the
     same time.

 - 'struct mount' changes:
      - mnt_umounting is gone
      - mnt_slave_list/mnt_slave is an hlist now
      - overmounts are kept track of by explicit pointer in mount
      - a bunch of flags moved out of mnt_flags to a new field, with
        only namespace_sem for protection
      - mnt_expiry is protected by mount_lock now (instead of
        namespace_sem)
      - MNT_LOCKED is used only for mounts that need to remain attached
        to their parents to prevent mountpoint exposure - no more
        overloading it for absolute roots
      - all mnt_list uses are transient now - it's used only to
        represent temporary sets during umount_tree()

 - mount refcounting change: children no longer pin parents for any
   mounts, whether they'd passed through umount_tree() or not

 - 'struct mountpoint' changes:
      - refcount is no more; what matters is ->m_list emptiness
      - instead of temporary bumping the refcount, we insert a new
        object (pinned_mountpoint) into ->m_list
      - new calling conventions for lock_mount() and friends

 - do_move_mount()/attach_recursive_mnt() seriously cleaned up

 - globals in fs/pnode.c are gone

 - propagate_mnt(), change_mnt_propagation() and propagate_umount()
   cleaned up (in the last case - pretty much completely rewritten).

 - freeing of emptied mnt_namespace is done in namespace_unlock(). For
   one thing, there are subtle ordering requirements there; for another
   it simplifies cleanups.

 - assorted cleanups

 - restore the machinery for long-term mounts from accumulated bitrot.

   This is going to get a followup come next cycle, when the change of
   vfs_fs_parse_string() calling conventions goes into -next

* tag 'pull-mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (48 commits)
  statmount_mnt_basic(): simplify the logics for group id
  invent_group_ids(): zero ->mnt_group_id always implies !IS_MNT_SHARED()
  get rid of CL_SHARE_TO_SLAVE
  take freeing of emptied mnt_namespace to namespace_unlock()
  copy_tree(): don't link the mounts via mnt_list
  change_mnt_propagation(): move ->mnt_master assignment into MS_SLAVE case
  mnt_slave_list/mnt_slave: turn into hlist_head/hlist_node
  turn do_make_slave() into transfer_propagation()
  do_make_slave(): choose new master sanely
  change_mnt_propagation(): do_make_slave() is a no-op unless IS_MNT_SHARED()
  change_mnt_propagation() cleanups, step 1
  propagate_mnt(): fix comment and convert to kernel-doc, while we are at it
  propagate_mnt(): get rid of last_dest
  fs/pnode.c: get rid of globals
  propagate_one(): fold into the sole caller
  propagate_one(): separate the "what should be the master for this copy" part
  propagate_one(): separate the "do we need secondary here?" logics
  propagate_mnt(): handle all peer groups in the same loop
  propagate_one(): get rid of dest_master
  mount: separate the flags accessed only under namespace_sem
  ...
2025-07-28 10:49:38 -07:00
Linus Torvalds 11fe69fbd5 Current exclusion rules for ->d_flags stores are rather unpleasant.
The basic rules are simple:
 	* stores to dentry->d_flags are OK under dentry->d_lock.
 	* stores to dentry->d_flags are OK in the dentry constructor, before
 becomes potentially visible to other threads.
 Unfortunately, there's a couple of exceptions to that, and that's where the
 headache comes from.
 
 	Main PITA comes from d_set_d_op(); that primitive sets ->d_op
 of dentry and adjusts the flags that correspond to presence of individual
 methods.  It's very easy to misuse; existing uses _are_ safe, but proof
 of correctness is brittle.
 
 	Use in __d_alloc() is safe (we are within a constructor), but we
 might as well precalculate the initial value of ->d_flags when we set
 the default ->d_op for given superblock and set ->d_flags directly
 instead of messing with that helper.
 
 	The reasons why other uses are safe are bloody convoluted; I'm not going
 to reproduce it here.  See https://lore.kernel.org/all/20250224010624.GT1977892@ZenIV/
 for gory details, if you care.  The critical part is using d_set_d_op() only
 just prior to d_splice_alias(), which makes a combination of d_splice_alias()
 with setting ->d_op, etc. a natural replacement primitive.  Better yet, if
 we go that way, it's easy to take setting ->d_op and modifying ->d_flags
 under ->d_lock, which eliminates the headache as far as ->d_flags exclusion
 rules are concerned.  Other exceptions are minor and easy to deal with.
 
 	What this series does:
 * d_set_d_op() is no longer available; new primitive (d_splice_alias_ops())
 is provided, equivalent to combination of d_set_d_op() and d_splice_alias().
 * new field of struct super_block - ->s_d_flags.  Default value of ->d_flags
 to be used when allocating dentries on this filesystem.
 * new primitive for setting ->s_d_op: set_default_d_op().  Replaces stores
 to ->s_d_op at mount time.  All in-tree filesystems converted; out-of-tree
 ones will get caught by compiler (->s_d_op is renamed, so stores to it will
 be caught).  ->s_d_flags is set by the same primitive to match the ->s_d_op.
 * a lot of filesystems had ->s_d_op->d_delete equal to always_delete_dentry;
 that is equivalent to setting DCACHE_DONTCACHE in ->d_flags, so such filesystems
 can bloody well set that bit in ->s_d_flags and drop ->d_delete() from
 dentry_operations.  In quite a few cases that results in empty dentry_operations,
 which means that we can get rid of those.
 * kill simple_dentry_operations - not needed anymore.
 * massage d_alloc_parallel() to get rid of the other exception wrt ->d_flags
 stores - we can set DCACHE_PAR_LOOKUP as soon as we allocate the new dentry;
 no need to delay that until we commit to using the sucker.
 
 As the result, ->d_flags stores are all either under ->d_lock or done before
 the dentry becomes visible in any shared data structures.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaIQ/tQAKCRBZ7Krx/gZQ
 66AhAQDgQ+S224x5YevNXc9mDoGUBMF4OG0n0fIla9rfdL4I6wEAqpOWMNDcVPCZ
 GwYOvJ9YuqNdz+MyprAI18Yza4GOmgs=
 =rTYB
 -----END PGP SIGNATURE-----

Merge tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull dentry d_flags updates from Al Viro:
 "The current exclusion rules for dentry->d_flags stores are rather
  unpleasant. The basic rules are simple:

   - stores to dentry->d_flags are OK under dentry->d_lock

   - stores to dentry->d_flags are OK in the dentry constructor, before
     becomes potentially visible to other threads

  Unfortunately, there's a couple of exceptions to that, and that's
  where the headache comes from.

  The main PITA comes from d_set_d_op(); that primitive sets ->d_op of
  dentry and adjusts the flags that correspond to presence of individual
  methods. It's very easy to misuse; existing uses _are_ safe, but proof
  of correctness is brittle.

  Use in __d_alloc() is safe (we are within a constructor), but we might
  as well precalculate the initial value of 'd_flags' when we set the
  default ->d_op for given superblock and set 'd_flags' directly instead
  of messing with that helper.

  The reasons why other uses are safe are bloody convoluted; I'm not
  going to reproduce it here. See [1] for gory details, if you care. The
  critical part is using d_set_d_op() only just prior to
  d_splice_alias(), which makes a combination of d_splice_alias() with
  setting ->d_op, etc a natural replacement primitive.

  Better yet, if we go that way, it's easy to take setting ->d_op and
  modifying 'd_flags' under ->d_lock, which eliminates the headache as
  far as 'd_flags' exclusion rules are concerned. Other exceptions are
  minor and easy to deal with.

  What this series does:

   - d_set_d_op() is no longer available; instead a new primitive
     (d_splice_alias_ops()) is provided, equivalent to combination of
     d_set_d_op() and d_splice_alias().

   - new field of struct super_block - 's_d_flags'. This sets the
     default value of 'd_flags' to be used when allocating dentries on
     this filesystem.

   - new primitive for setting 's_d_op': set_default_d_op(). This
     replaces stores to 's_d_op' at mount time.

     All in-tree filesystems converted; out-of-tree ones will get caught
     by the compiler ('s_d_op' is renamed, so stores to it will be
     caught). 's_d_flags' is set by the same primitive to match the
     's_d_op'.

   - a lot of filesystems had sb->s_d_op->d_delete equal to
     always_delete_dentry; that is equivalent to setting
     DCACHE_DONTCACHE in 'd_flags', so such filesystems can bloody well
     set that bit in 's_d_flags' and drop 'd_delete()' from
     dentry_operations.

     In quite a few cases that results in empty dentry_operations, which
     means that we can get rid of those.

   - kill simple_dentry_operations - not needed anymore

   - massage d_alloc_parallel() to get rid of the other exception wrt
     'd_flags' stores - we can set DCACHE_PAR_LOOKUP as soon as we
     allocate the new dentry; no need to delay that until we commit to
     using the sucker.

  As the result, 'd_flags' stores are all either under ->d_lock or done
  before the dentry becomes visible in any shared data structures"

Link: https://lore.kernel.org/all/20250224010624.GT1977892@ZenIV/ [1]

* tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (21 commits)
  configfs: use DCACHE_DONTCACHE
  debugfs: use DCACHE_DONTCACHE
  efivarfs: use DCACHE_DONTCACHE instead of always_delete_dentry()
  9p: don't bother with always_delete_dentry
  ramfs, hugetlbfs, mqueue: set DCACHE_DONTCACHE
  kill simple_dentry_operations
  devpts, sunrpc, hostfs: don't bother with ->d_op
  shmem: no dentry retention past the refcount reaching zero
  d_alloc_parallel(): set DCACHE_PAR_LOOKUP earlier
  make d_set_d_op() static
  simple_lookup(): just set DCACHE_DONTCACHE
  tracefs: Add d_delete to remove negative dentries
  set_default_d_op(): calculate the matching value for ->d_flags
  correct the set of flags forbidden at d_set_d_op() time
  split d_flags calculation out of d_set_d_op()
  new helper: set_default_d_op()
  fuse: no need for special dentry_operations for root dentry
  switch procfs from d_set_d_op() to d_splice_alias_ops()
  new helper: d_splice_alias_ops()
  procfs: kill ->proc_dops
  ...
2025-07-28 09:17:57 -07:00
Lorenzo Stoakes 425c8bb39b
doc: update porting, vfs documentation to describe mmap_prepare()
Now that we have established .mmap_prepare() as the preferred means by
which filesystems establish state upon memory mapping of a file, update the
VFS and porting documentation to reflect this.

As part of this change, additionally update the VFS documentation to
contain the current state of the file_operations struct.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/20250723123036.35472-1-lorenzo.stoakes@oracle.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-23 15:09:14 +02:00
Vlastimil Babka 8356a5a3b0 mm, vmstat: remove the NR_WRITEBACK_TEMP node_stat_item counter
The only user of the counter (FUSE) was removed in commit 0c58a97f91
("fuse: remove tmp folio for writebacks and internal rb tree") so follow
the established pattern of removing the counter and hardcoding 0 in
meminfo output, as done recently with NR_BOUNCE.  Update documentation for
procfs, including for the value for Bounce that was missed when removing
its counter.

Also remove the mention of NR_WRITEBACK_TEMP implications from a comment
in wb_position_ratio(). The rest of the comment there about fuse setting
bdi->max_ratio to 1% is still correct.

[vbabka@suse.cz: v2]
  Link: https://lkml.kernel.org/r/5a848e15-6a57-4ecb-a015-d4f358b8a5d3@suse.cz
Link: https://lkml.kernel.org/r/20250625-nr_writeback_removal-v1-1-7f2a0df70faa@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joanne Koong <joannelkoong@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Maxim Patlasov <mpatlasov@parallels.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zach O'Keefe <zokeefe@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-19 18:59:47 -07:00
Taotao Chen e9d8e2bf23
fs: change write_begin/write_end interface to take struct kiocb *
Change the address_space_operations callbacks write_begin() and
write_end() to take struct kiocb * as the first argument instead of
struct file *.

Update all affected function prototypes, implementations, call sites,
and related documentation across VFS, filesystems, and block layer.

Part of a series refactoring address_space_operations write_begin and
write_end callbacks to use struct kiocb for passing write context and
flags.

Signed-off-by: Taotao Chen <chentaotao@didiglobal.com>
Link: https://lore.kernel.org/20250716093559.217344-4-chentaotao@didiglobal.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-16 14:48:18 +02:00
Matthias Frank 2afcd62944 overlayfs.rst: fix typos
Grammatical fixes

Signed-off-by: Matthias Frank <frank.mt125@gmail.com>
Acked-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20250710050607.2891-1-frank.mt125@gmail.com
2025-07-15 13:53:46 -06:00
Eric Biggers 998646b3c1 fsverity: Switch from crypto_shash to SHA-2 library
fsverity supports two hash algorithms: SHA-256 and SHA-512.  Since both
of these have a library API now, just use the library API instead of
crypto_shash.  Even with multiple algorithms, the library-based code
still ends up being quite a bit simpler, due to how clumsy the
old-school crypto API is.  The library-based code is also more
efficient, since it avoids overheads such as indirect calls.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250630172224.46909-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-07-14 11:29:32 -07:00
Christoph Hellwig c5690dd019
iomap: add read_folio_range() handler for buffered writes
Add a read_folio_range() handler for buffered writes that filesystems
may pass in if they wish to provide a custom handler for synchronously
reading in the contents of a folio.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
[hch: renamed to read_folio_range, pass less arguments]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-14-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14 10:51:33 +02:00
Christoph Hellwig 2a5574fc57
iomap: replace iomap_folio_ops with iomap_write_ops
The iomap_folio_ops are only used for buffered writes, including the zero
and unshare variants.  Rename them to iomap_write_ops to better describe
the usage, and pass them through the call chain like the other operation
specific methods instead of through the iomap.

xfs_iomap_valid grows a IOMAP_HOLE check to keep the existing behavior
that never attached the folio_ops to a iomap representing a hole.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-12-hch@lst.de
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14 10:51:33 +02:00
Christoph Hellwig f4fa7981fa
iomap: hide ioends from the generic writeback code
Replace the ioend pointer in iomap_writeback_ctx with a void *wb_ctx
one to facilitate non-block, non-ioend writeback for use.  Rename
the submit_ioend method to writeback_submit and make it mandatory so
that the generic writeback code stops seeing ioends and bios.

Co-developed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-6-hch@lst.de
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14 10:51:31 +02:00
Christoph Hellwig fb7399cf2d
iomap: refactor the writeback interface
Replace ->map_blocks with a new ->writeback_range, which differs in the
following ways:

 - it must also queue up the I/O for writeback, that is called into the
   slightly refactored and extended in scope iomap_add_to_ioend for
   each region
 - can handle only a part of the requested region, that is the retry
   loop for partial mappings moves to the caller
 - handles cleanup on failures as well, and thus also replaces the
   discard_folio method only implemented by XFS.

This will allow to use the iomap writeback code also for file systems
that are not block based like fuse.

Co-developed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-5-hch@lst.de
Acked-by: Damien Le Moal <dlemoal@kernel.org>	# zonefs
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14 10:51:31 +02:00
Chao Yu 81b6ecca2f f2fs: doc: fix wrong quota mount option description
We should use "{usr,grp,prj}jquota=" to disable journaled quota,
rather than using off{usr,grp,prj}jquota.

Fixes: 4b2414d04e ("f2fs: support journalled quota")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-09 18:00:52 +00:00
Eric Biggers b41c1d8d07 fscrypt: Don't use problematic non-inline crypto engines
Make fscrypt no longer use Crypto API drivers for non-inline crypto
engines, even when the Crypto API prioritizes them over CPU-based code
(which unfortunately it often does).  These drivers tend to be really
problematic, especially for fscrypt's workload.  This commit has no
effect on inline crypto engines, which are different and do work well.

Specifically, exclude drivers that have CRYPTO_ALG_KERN_DRIVER_ONLY or
CRYPTO_ALG_ALLOCATES_MEMORY set.  (Later, CRYPTO_ALG_ASYNC should be
excluded too.  That's omitted for now to keep this commit backportable,
since until recently some CPU-based code had CRYPTO_ALG_ASYNC set.)

There are two major issues with these drivers: bugs and performance.

First, these drivers tend to be buggy.  They're fundamentally much more
error-prone and harder to test than the CPU-based code.  They often
don't get tested before kernel releases, and even if they do, the crypto
self-tests don't properly test these drivers.  Released drivers have
en/decrypted or hashed data incorrectly.  These bugs cause issues for
fscrypt users who often didn't even want to use these drivers, e.g.:

- https://github.com/google/fscryptctl/issues/32
- https://github.com/google/fscryptctl/issues/9
- https://lore.kernel.org/r/PH0PR02MB731916ECDB6C613665863B6CFFAA2@PH0PR02MB7319.namprd02.prod.outlook.com

These drivers have also similarly caused issues for dm-crypt users,
including data corruption and deadlocks.  Since Linux v5.10, dm-crypt
has disabled most of them by excluding CRYPTO_ALG_ALLOCATES_MEMORY.

Second, these drivers tend to be *much* slower than the CPU-based code.
This may seem counterintuitive, but benchmarks clearly show it.  There's
a *lot* of overhead associated with going to a hardware driver, off the
CPU, and back again.  To prove this, I gathered as many systems with
this type of crypto engine as I could, and I measured synchronous
encryption of 4096-byte messages (which matches fscrypt's workload):

Intel Emerald Rapids server:
   AES-256-XTS:
      xts-aes-vaes-avx512   16171 MB/s  [CPU-based, Vector AES]
      qat_aes_xts             289 MB/s  [Offload, Intel QuickAssist]

Qualcomm SM8650 HDK:
   AES-256-XTS:
      xts-aes-ce             4301 MB/s  [CPU-based, ARMv8 Crypto Extensions]
      xts-aes-qce              73 MB/s  [Offload, Qualcomm Crypto Engine]

i.MX 8M Nano LPDDR4 EVK:
   AES-256-XTS:
      xts-aes-ce              647 MB/s   [CPU-based, ARMv8 Crypto Extensions]
      xts(ecb-aes-caam)        20 MB/s   [Offload, CAAM]
   AES-128-CBC-ESSIV:
      essiv(cbc-aes-caam,sha256-lib) 23 MB/s   [Offload, CAAM]

STM32MP157F-DK2:
   AES-256-XTS:
      xts-aes-neonbs         13.2 MB/s   [CPU-based, ARM NEON]
      xts(stm32-ecb-aes)     3.1 MB/s    [Offload, STM32 crypto engine]
   AES-128-CBC-ESSIV:
      essiv(cbc-aes-neonbs,sha256-lib)
                             14.7 MB/s   [CPU-based, ARM NEON]
      essiv(stm32-cbc-aes,sha256-lib)
                             3.2 MB/s    [Offload, STM32 crypto engine]
   Adiantum:
      adiantum(xchacha12-arm,aes-arm,nhpoly1305-neon)
                             52.8 MB/s   [CPU-based, ARM scalar + NEON]

So, there was no case in which the crypto engine was even *close* to
being faster.  On the first three, which have AES instructions in the
CPU, the CPU was 30 to 55 times faster (!).  Even on STM32MP157F-DK2
which has a Cortex-A7 CPU that doesn't have AES instructions, AES was
over 4 times faster on the CPU.  And Adiantum encryption, which is what
actually should be used on CPUs like that, was over 17 times faster.

Other justifications that have been given for these non-inline crypto
engines (almost always coming from the hardware vendors, not actual
users) don't seem very plausible either:

  - The crypto engine throughput could be improved by processing
    multiple requests concurrently.  Currently irrelevant to fscrypt,
    since it doesn't do that.  This would also be complex, and unhelpful
    in many cases.  2 of the 4 engines I tested even had only one queue.

  - Some of the engines, e.g. STM32, support hardware keys.  Also
    currently irrelevant to fscrypt, since it doesn't support these.
    Interestingly, the STM32 driver itself doesn't support this either.

  - Free up CPU for other tasks and/or reduce energy usage.  Not very
    plausible considering the "short" message length, driver overhead,
    and scheduling overhead.  There's just very little time for the CPU
    to do something else like run another task or enter low-power state,
    before the message finishes and it's time to process the next one.

  - Some of these engines resist power analysis and electromagnetic
    attacks, while the CPU-based crypto generally does not.  In theory,
    this sounds great.  In practice, if this benefit requires the use of
    an off-CPU offload that massively regresses performance and has a
    low-quality, buggy driver, the price for this hardening (which is
    not relevant to most fscrypt users, and tends to be incomplete) is
    just too high.  Inline crypto engines are much more promising here,
    as are on-CPU solutions like RISC-V High Assurance Cryptography.

Fixes: b30ab0e034 ("ext4 crypto: add ext4 encryption facilities")
Cc: stable@vger.kernel.org
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250704070322.20692-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-07-04 10:25:26 -07:00
Christian Brauner ca115d7e75
tree-wide: s/struct fileattr/struct file_kattr/g
Now that we expose struct file_attr as our uapi struct rename all the
internal struct to struct file_kattr to clearly communicate that it is a
kernel internal struct. This is similar to struct mount_{k}attr and
others.

Link: https://lore.kernel.org/20250703-restlaufzeit-baurecht-9ed44552b481@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-04 16:14:39 +02:00
Eric Biggers 66271c155d fscrypt: Drop obsolete recommendation to enable optimized SHA-512
Since the crypto kconfig options are being fixed to enable optimized
SHA-512 automatically
(https://lore.kernel.org/linux-crypto/20250616014019.415791-1-ebiggers@kernel.org/),
it is no longer necessary to give a recommendation to enable it.

Link: https://lore.kernel.org/r/20250619193149.138315-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-07-04 00:07:46 -07:00
Bagas Sanjaya 1ce50d4e06 Documentation: ext4: Move inode table short docs into its own file
The short description of inode table is in bitmaps.rst alongside the
proper bitmpas documentation. The docs file is short enough that it fits
whole browser screen on desktop, which implies that when readers click
"Inode Table", they will essentially see bitmaps docs.

Move inode table short description.

Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20250620105643.25141-7-bagasdotme@gmail.com
2025-07-02 16:57:20 -06:00