Commit Graph

4314 Commits (dfca045cd4d0ea07ff4198ba392be3e718acaddc)

Author SHA1 Message Date
David Howells 2c28769a51 rxrpc: Fix recvmsg() unconditional requeue
If rxrpc_recvmsg() fails because MSG_DONTWAIT was specified but the call at
the front of the recvmsg queue already has its mutex locked, it requeues
the call - whether or not the call is already queued.  The call may be on
the queue because MSG_PEEK was also passed and so the call was not dequeued
or because the I/O thread requeued it.

The unconditional requeue may then corrupt the recvmsg queue, leading to
things like UAFs or refcount underruns.

Fix this by only requeuing the call if it isn't already on the queue - and
moving it to the front if it is already queued.  If we don't queue it, we
have to put the ref we obtained by dequeuing it.

Also, MSG_PEEK doesn't dequeue the call so shouldn't call
rxrpc_notify_socket() for the call if we didn't use up all the data on the
queue, so fix that also.

Fixes: 540b1c48c3 ("rxrpc: Fix deadlock between call creation and sendmsg/recvmsg")
Reported-by: Faith <faith@zellic.io>
Reported-by: Pumpkin Chang <pumpkin@devco.re>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Marc Dionne <marc.dionne@auristor.com>
cc: Nir Ohfeld <niro@wiz.io>
cc: Willy Tarreau <w@1wt.eu>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
Link: https://patch.msgid.link/95163.1768428203@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19 10:07:06 -08:00
Linus Torvalds f0b9d8eb98 nfsd-6.19 fixes:
A set of NFSD fixes that arrived after the 6.19 merge window.
 
 Issues that need expedient stable backports:
 - Remove an invalid NFS status code
 - Fix an fstests failure when using pNFS
 - Fix a UAF in v4_end_grace()
 - Fix the administrative interface used to revoke NFSv4 state
 - Fix a memory leak reported by syzbot
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmlb2WUACgkQM2qzM29m
 f5cqaA/+MbO1kop63/TiNE0tRc34yTBnApg1XVza4vSmcpSgpB8ZKGZ5xOjnRpwg
 yBw9+/puEJhyogPE6JKEGnLiFr+s3ApInFHaxnXnrGZz1RR1qkqfioKudIcpC0s1
 /pKx7y/fktltgo/5Dl0gp2QH3Oytg375ge+dcSQbopSTQYPbsAw7AmoHDPBQd8Nr
 Q/pIu1q/tAM8R2zyijU3eAiUMyYRCrxNVYnlsdYmj7Dn0ypybOyKufkpVCEaS3kO
 a7SV/QSVKdNbZOf8annwAhW+VN4urFmA9nnnr/yirrLJ0i2h18E0txrPFBszhftf
 xpOvaDR7okfEvzqwrHvVfRsqB4nYq9f0TSvvpPsS8vCtq34pWKZPa6iiSxeVL/jb
 EmFtiesUWClZzTIQSpUdbuU80cST6WEoNJJKDPZwF1XbA2navsDqgxKiYxsczjt6
 M5SStHcafK5LrXPruqOhfco/uKTmHNJJlvBWxUGCMQEDvdXdEJ4MIlg8VxxvoWPR
 FQDwU+iSdPOwlG7L3Tl9/PGSNe0MxJSgvzK6JNoKL3LvDx80FtMErWxPJdqdIL0+
 RpBsW7zaCyX9lwD866Frs4K2H1w2XFeQjOMI0Pz1SG9dZ8NoKJ+lzcwVY7GgHUvq
 NUNJLzL6MVCHytwTfqrSY7PGvUCrDqR102FQusQyplT4edcUv0M=
 =okQh
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:
 "A set of NFSD fixes for stable that arrived after the merge window:

   - Remove an invalid NFS status code

   - Fix an fstests failure when using pNFS

   - Fix a UAF in v4_end_grace()

   - Fix the administrative interface used to revoke NFSv4 state

   - Fix a memory leak reported by syzbot"

* tag 'nfsd-6.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: net ref data still needs to be freed even if net hasn't startup
  nfsd: check that server is running in unlock_filesystem
  nfsd: use correct loop termination in nfsd4_revoke_states()
  nfsd: provide locking for v4_end_grace
  NFSD: Fix permission check for read access to executable-only files
  NFSD: Remove NFSERR_EAGAIN
2026-01-06 09:12:52 -08:00
Linus Torvalds 7f98ab9da0 for-6.19-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmlb99EACgkQxWXV+ddt
 WDtJfQ//cppHAHSxb3NNDGXDiKx4ccCp9CWiOF7z+BTFfngsNGvbs2FzKFnYI2f3
 dT/DlPV8uBgVX3uYL3ZI1na/5MShXvS+sajIRhz3woyKBb2shVqVnFmfA8A3pKf6
 3Dfm6FWrJHGCgV28Oi5pbg/UQeTAHAmA2aPLYJKRnNwIq8pSSzDWRCVNFfYrt4o2
 7UUW1PzasZ7tuqL55HcwzuXjVTYr/t3puLjq+ydVfGSJSZlmlMd3pnZXz8S7/BC6
 jVQGOT6nK9SWCnfXD9plqqr4CY+ThJZJNSdhVTwfVxkxVHmEBWfqfhAToqZaLKX9
 co3rXvvZyIQf5KeHMmtbb2P736zaAcKb7G41liRN7EZg/gOsROE+UziYRkTg+Xyg
 rztTksc913DsuHj19sZhIgcKRcym2h57wyZyt7vYAdsv9uksLUgKUo3U9CiTbEsb
 8d/vgt1e3+ELoVcc+xVZSSGRDVzvZnxVmRHQV2dAtIXK34FXzqCDeKnFG0wsjqtF
 Kw6bV93cXLohfcB7fPPBdAHzVN89kfUXTBT8mrri7HnjSnZTJNeHrGpcRNNQ76BT
 8RL6gSP32Mpo9HZOYYhl1Xj2hRonRiJrUQAb6x9CY1MMUP2vwVvVBUVj2NAohWdM
 vAYwRQDigw92RoKIYvHu+X+E5PXgX2AQ9NV8qiL79od+A7NFLgY=
 =hmbc
 -----END PGP SIGNATURE-----

Merge tag 'for-6.19-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix potential deadlock due to mismatching transaction states when
   waiting for the current transaction

 - fix squota accounting with nested snapshots

 - fix quota inheritance of qgroups with multiple parent qgroups

 - fix NULL inode pointer in evict tracepoint

 - fix writes beyond end of file on systems with 64K page size and 4K
   block size

 - fix logging of inodes after exchange rename

 - fix use after free when using ref_tracker feature

 - space reservation fixes

* tag 'for-6.19-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix reservation leak in some error paths when inserting inline extent
  btrfs: do not free data reservation in fallback from inline due to -ENOSPC
  btrfs: fix use-after-free warning in btrfs_get_or_create_delayed_node()
  btrfs: always detect conflicting inodes when logging inode refs
  btrfs: fix beyond-EOF write handling
  btrfs: fix deadlock in wait_current_trans() due to ignored transaction type
  btrfs: fix NULL dereference on root when tracing inode eviction
  btrfs: qgroup: update all parent qgroups when doing quick inherit
  btrfs: fix qgroup_snapshot_quick_inherit() squota bug
2026-01-05 14:10:48 -08:00
Chuck Lever c6c209ceb8 NFSD: Remove NFSERR_EAGAIN
I haven't found an NFSERR_EAGAIN in RFCs 1094, 1813, 7530, or 8881.
None of these RFCs have an NFS status code that match the numeric
value "11".

Based on the meaning of the EAGAIN errno, I presume the use of this
status in NFSD means NFS4ERR_DELAY. So replace the one usage of
nfserr_eagain, and remove it from NFSD's NFS status conversion
tables.

As far as I can tell, NFSERR_EAGAIN has existed since the pre-git
era, but was not actually used by any code until commit f4e44b3933
("NFSD: delay unmount source's export after inter-server copy
completed."), at which time it become possible for NFSD to return
a status code of 11 (which is not valid NFS protocol).

Fixes: f4e44b3933 ("NFSD: delay unmount source's export after inter-server copy completed.")
Cc: stable@vger.kernel.org
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-02 13:43:41 -05:00
Linus Torvalds 44087d3d46 Miscellaneous x86 fixes:
- Fix FPU core dumps on certain CPU models
  - Fix htmldocs build warning
  - Export TLB tracing event name via header
  - Remove unused constant from <linux/mm_types.h>
  - Fix comments
  - Fix whitespace noise in documentation
  - Fix variadic structure's definition to un-confuse UBSAN
  - Fix posted MSI interrupts irq_retrigger() bug
  - Fix asm build failure with older GCC builds
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmlHr/sRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gpUQ/9FgdoIa3BbdyvP1ZOc7Sasl6OBjhvPKwZ
 2pI7WOcble1Q3wh+BXUmrNnIQxgipENLUm31LaWn22QbHqPplAtzpCYABA06GW+f
 sbAcCMTmXzyBEUm6/Wh6FuoLcKrhr4GJ6UW02ROxE4M/hyWr4vQYZRAHKpmhimfJ
 u4VKrVMRWYoxd17KGEaVEz2WHT/MuctDDtjl1xhc6EbQxw6xmE0xYfFzYBcN8E31
 0743rZ9BBYMXSmS9p83Qqjzj5S3Y+VErGxlCwJ+7QUFGGfwnb0iDLNUA4rk4VBf/
 SL5xZXx6008GoBXEt6npvSUm4zQJI0QPyuCttkR/w0lX1TkoGvyGWKEjCBfnXaK5
 QvptqoHSgpvqqaW6R8ophPJNTtKzPf2ZMQD+kZlRy9MVgKfQhlEPpU2DvouzYg1s
 CtgGHCdCUP5EhozS0l17R6knHUhNg3ZHyB5eRQexsA5gOO7vjxngq+oCMUmoVNPV
 ahGsolGKuU8nrZBTDTc0LpJq6gPpspUSEKdVHY0wX5m4rfKw4tPUsdS5Q7oqzc1K
 1RJrFjzuouADUqW4L21Q7kTeOnRyr6lNej1TwtiOs9sUTmsv1AmywJ5GaZcIsFGP
 /rOaL+Wa8BGtkWFqkmHcq6f80IgQEL63+CoOpmY8uDZF5Ba55nUfG/q0JDZTLtMN
 LAc4pO7ubsA=
 =zJpa
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2025-12-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:

 - Fix FPU core dumps on certain CPU models

 - Fix htmldocs build warning

 - Export TLB tracing event name via header

 - Remove unused constant from <linux/mm_types.h>

 - Fix comments

 - Fix whitespace noise in documentation

 - Fix variadic structure's definition to un-confuse UBSAN

 - Fix posted MSI interrupts irq_retrigger() bug

 - Fix asm build failure with older GCC builds

* tag 'x86-urgent-2025-12-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/bug: Fix old GCC compile fails
  x86/msi: Make irq_retrigger() functional for posted MSI
  x86/platform/uv: Fix UBSAN array-index-out-of-bounds
  mm: Remove tlb_flush_reason::NR_TLB_FLUSH_REASONS from <linux/mm_types.h>
  x86/mm/tlb/trace: Export the TLB_REMOTE_WRONG_CPU enum in <trace/events/tlb.h>
  x86/sgx: Remove unmatched quote in __sgx_encl_extend function comment
  x86/boot/Documentation: Fix whitespace noise in boot.rst
  x86/fpu: Fix FPU state core dump truncation on CPUs with no extended xfeatures
  x86/boot/Documentation: Fix htmldocs build warning due to malformed table in boot.rst
2025-12-21 14:41:29 -08:00
Miquel Sabaté Solà f157dd6613 btrfs: fix NULL dereference on root when tracing inode eviction
When evicting an inode the first thing we do is to setup tracing for it,
which implies fetching the root's id. But in btrfs_evict_inode() the
root might be NULL, as implied in the next check that we do in
btrfs_evict_inode().

Hence, we either should set the ->root_objectid to 0 in case the root is
NULL, or we move tracing setup after checking that the root is not
NULL. Setting the rootid to 0 at least gives us the possibility to trace
this call even in the case when the root is NULL, so that's the solution
taken here.

Fixes: 1abe9b8a13 ("Btrfs: add initial tracepoint support for btrfs")
Reported-by: syzbot+d991fea1b4b23b1f6bf8@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d991fea1b4b23b1f6bf8
Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-12-16 22:53:14 +01:00
Linus Torvalds 0dfb36b2dc We have a patch that adds an initial set of tracepoints to the MDS
client from Max, a fix that hardens osdmap parsing code from myself
 (marked for stable) and a few assorted fixups.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmk8YxITHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi702B/9Mfj0HQSAaWCyMz6GJAv8+1ijVTSbG
 nTFeMmipYZhPn188OvWqHsf1cw9rWY0fzenIEW05Tk3YmMvYdeRmCcOkZdeG06xw
 um8XxX4L8315E0b98CCpQpVa02ux7XoNtBPjeHccl8PEErJQgQJrQ3Cc/C8kk5U5
 a0KlfeVRXYWkOPJva3+wosOu0t9QgJ9ABt5stqcvYDfdkKfvatQMBN3N1nNRKkFH
 yhNPv+nRtypSk8jiHhKeeCVmosC0L7MnKuO593vDr761cY5mKrgKFqc2LLlPF6r0
 /p13s5SG6X38RUegWCjcK4XRJAzVcR/tyod8LkVVp8d5DC/ptcyy8H+I
 =h1gC
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.19-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "We have a patch that adds an initial set of tracepoints to the MDS
  client from Max, a fix that hardens osdmap parsing code from myself
  (marked for stable) and a few assorted fixups"

* tag 'ceph-for-6.19-rc1' of https://github.com/ceph/ceph-client:
  rbd: stop selecting CRC32, CRYPTO, and CRYPTO_AES
  ceph: stop selecting CRC32, CRYPTO, and CRYPTO_AES
  libceph: make decode_pool() more resilient against corrupted osdmaps
  libceph: Amend checking to fix `make W=1` build breakage
  ceph: Amend checking to fix `make W=1` build breakage
  ceph: add trace points to the MDS client
  libceph: fix log output race condition in OSD client
2025-12-14 15:24:10 +12:00
Tal Zussman 8b62e64e6d x86/mm/tlb/trace: Export the TLB_REMOTE_WRONG_CPU enum in <trace/events/tlb.h>
When the TLB_REMOTE_WRONG_CPU enum was introduced for the tlb_flush
tracepoint, the enum was not exported to user-space. Add it to the
appropriate macro definition to enable parsing by userspace tools, as
per:

  Link: https://lore.kernel.org/all/20150403013802.220157513@goodmis.org

[ mingo: Capitalize IPI, etc. ]

Fixes: 2815a56e4b ("x86/mm/tlb: Add tracepoint for TLB flush IPI to stale CPU")
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Link: https://patch.msgid.link/20251212-tlb-trace-fix-v2-1-d322e0ad9b69@columbia.edu
2025-12-13 11:01:16 +01:00
Max Kellermann d927a595ab ceph: add trace points to the MDS client
This patch adds trace points to the Ceph filesystem MDS client:

- request submission (CEPH_MSG_CLIENT_REQUEST) and completion
  (CEPH_MSG_CLIENT_REPLY)
- capabilities (CEPH_MSG_CLIENT_CAPS)

These are the central pieces that are useful for analyzing MDS
latency/performance problems from the client's perspective.

In the long run, all doutc() calls should be replaced with
tracepoints.  This way, the Ceph filesystem can be traced at any time
(without spamming the kernel log).  Additionally, trace points can be
used in BPF programs (which can even deference the pointer parameters
and extract more values).

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2025-12-10 11:50:54 +01:00
Linus Torvalds cb015814f8 f2fs-for-6.19-rc1
This series focuses on minor clean-ups and performance optimizations across
 sysfs, documentation, debugfs, tracepoints, slab allocation, and GC.
 Furthermore, it resolves several corner-case bugs caught by xfstests, as
 well as issues related to 16KB page support and f2fs_enable_checkpoint.
 
 Enhancement:
  - wrap ASCII tables in literal blocks to fix LaTeX build
  - optimize trace_f2fs_write_checkpoint with enums
  - support to show curseg.next_blkoff in debugfs
  - add a sysfs entry to show max open zones
  - add fadvise tracepoint
  - use global inline_xattr_slab instead of per-sb slab cache
  - set default valid_thresh_ratio to 80 for zoned devices
  - maintain one time GC mode is enabled during whole zoned GC cycle
 
 Bug fix:
  - ensure node page reads complete before f2fs_put_super() finishes
  - fix to not account invalid blocks in get_left_section_blocks()
  - revert summary entry count from 2048 to 512 in 16kb block support
  - fix to detect recoverable inode during dryrun of find_fsync_dnodes()
  - fix age extent cache insertion skip on counter overflow
  - Add sanity checks before unlinking and loading inodes
  - ensure minimum trim granularity accounts for all devices
  - block cache/dio write during f2fs_enable_checkpoint()
  - fix to propagate error from f2fs_enable_checkpoint()
  - invalidate dentry cache on failed whiteout creation
  - fix to avoid updating compression context during writeback
  - fix to avoid updating zero-sized extent in extent cache
  - fix to avoid potential deadlock
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmk3N1kACgkQQBSofoJI
 UNKfGw//Z7+0Oy0w/3k8UkJHvz6b3sDFzzCGlyBtYUaQaxp0eXxytB9T7GNE4g8z
 UA6nOA7VvHdFyu8YvJkMrf8vejorVnO9I86vlUZ/uZcOqKPWkjNxaHJvMYg0ZvkS
 uwiFo8rSL5FO0MSbnVhZScnolNuEINYi1sYd0fb2BzHB3P7cSwRrDGYuU53E3S8p
 3JsOa1EN0DrxlL7YTI8q8wmMcN1+/BK9YP4Sl3r8nBAYNAoP/JLMY40YkOTk3gKy
 ppJ32e++D9XxVTEaZUvktW/z9zLKdSvqjFE0BduSbNrqlfGj2AEwU1WJouFPYDOs
 b4mDhi9y3Mv2LWY6fTeOXcT/nTf6IssopHNBpPI6Ay73GwENPOYf+q4oTNeqpa1f
 sGqmw6M8NGiEjQAPKrbON8IDSpdc6Yzk1ENRjOf5j7/xR0gtL1b3G0KV5FCO+25x
 QP9KupkhBc9yheCTrig6reCQlvfWU+I70tyB30YD/BcqhCB/EjBvM/v9kK1udN0e
 6wjr5eBfX8z8DGlqNYzAjjEQC8IfkwDc1qLkovTsBKBo2Z0fHPriAZERAcLU7TuU
 z06GZQT6QdZ4lAw4KfNWcef0S3m14qY5E8qJoQS2G7DwdMOglouJRakOi75nW1Dc
 lSZBI1m1JxwLsj7iXNXLEJoGMUR5u+oUzJyj46trn6fOG6AIbuo=
 =4ZOp
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "This series focuses on minor clean-ups and performance optimizations
  across sysfs, documentation, debugfs, tracepoints, slab allocation,
  and GC. Furthermore, it resolves several corner-case bugs caught by
  xfstests, as well as issues related to 16KB page support and
  f2fs_enable_checkpoint.

  Enhancement:
   - wrap ASCII tables in literal blocks to fix LaTeX build
   - optimize trace_f2fs_write_checkpoint with enums
   - support to show curseg.next_blkoff in debugfs
   - add a sysfs entry to show max open zones
   - add fadvise tracepoint
   - use global inline_xattr_slab instead of per-sb slab cache
   - set default valid_thresh_ratio to 80 for zoned devices
   - maintain one time GC mode is enabled during whole zoned GC cycle

  Bug fix:
   - ensure node page reads complete before f2fs_put_super() finishes
   - do not account invalid blocks in get_left_section_blocks()
   - revert summary entry count from 2048 to 512 in 16kb block support
   - detect recoverable inode during dryrun of find_fsync_dnodes()
   - fix age extent cache insertion skip on counter overflow
   - add sanity checks before unlinking and loading inodes
   - ensure minimum trim granularity accounts for all devices
   - block cache/dio write during f2fs_enable_checkpoint()
   - propagate error from f2fs_enable_checkpoint()
   - invalidate dentry cache on failed whiteout creation
   - avoid updating compression context during writeback
   - avoid updating zero-sized extent in extent cache
   - avoid potential deadlock"

* tag 'f2fs-for-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (39 commits)
  f2fs: ignore discard return value
  f2fs: optimize trace_f2fs_write_checkpoint with enums
  f2fs: fix to not account invalid blocks in get_left_section_blocks()
  f2fs: support to show curseg.next_blkoff in debugfs
  docs: f2fs: wrap ASCII tables in literal blocks to fix LaTeX build
  f2fs: expand scalability of f2fs mount option
  f2fs: change default schedule timeout value
  f2fs: introduce f2fs_schedule_timeout()
  f2fs: use memalloc_retry_wait() as much as possible
  f2fs: add a sysfs entry to show max open zones
  f2fs: wrap all unusable_blocks_per_sec code in CONFIG_BLK_DEV_ZONED
  f2fs: simplify list initialization in f2fs_recover_fsync_data()
  f2fs: revert summary entry count from 2048 to 512 in 16kb block support
  f2fs: fix to detect recoverable inode during dryrun of find_fsync_dnodes()
  f2fs: fix return value of f2fs_recover_fsync_data()
  f2fs: add fadvise tracepoint
  f2fs: fix age extent cache insertion skip on counter overflow
  f2fs: Add sanity checks before unlinking and loading inodes
  f2fs: Rename f2fs_unlink exit label
  f2fs: ensure minimum trim granularity accounts for all devices
  ...
2025-12-09 12:06:20 +09:00
Linus Torvalds cfd4039213 io_uring-6.19-20251208
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmk3KXIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpo91EACGlORRzg4FJXox8DcItdOQsZGFIqCXts9p
 SVtbxV6sPdsHwRB/xGTzHWP2iWUjA4+i5l3n4mt8vzGAmQU50gtdaIsJEMq7SOfB
 nJW0wNi905qcLihOfTpQ/2xpE5Am/iWPavFkAqOF7qo6GlS7aN47TIaHCPmAm3Nx
 Kla2XMDnneFhl8xCdnJHaLrzyD94xlArywG5UPjkgFGCmLEu2ZE6T9ivq86DHQZJ
 Ujy3ueMO/7SErfoDbY4I/gPs4ONxBaaieKycuyljQQB3n6sj15EBNB0TMDPA/Rwx
 Aq4WD/MC48titpxV2BT9RKCjYvJ4wsBww4uFLkCTKDlFCRH0pqclzgtd2iB46kge
 tj9KfTS9tkLBp9steMcw45FStu0iiHBwqqTcqUr1q/wzIPbPAQ/L/Mu6AlUOheW/
 MmedhtPP22IShpkKYWSv923P2Qp2HhKa6LtoKJzxOK9rb6yoYvHl0zEQlKbWtPgq
 lpGzjbBoCtjqwlQKTpcH8diwaZ/fafrIP4h80Hg1pRiQEwzBgDpA3/N0EcfigkmU
 2IgyH3k6F9v/IgyVPkpzNh4w6hrr9RnxVA8yaf2ItkfWKwajWJAtPLUBuING8qqa
 3xg1MZ27NS6gUKEdCEy/mAaz8Vt2SGRUc3szHYrZHy7OFEW94WoiKAYK9qsZXGzX
 ms2VldIiQA==
 =Mbok
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.19-20251208' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring updates from Jens Axboe:
 "Followup set of fixes for io_uring for this merge window. These are
  either later fixes, or cleanups that don't make sense to defer. This
  pull request contains:

   - Fix for a recent regression in io-wq worker creation

   - Tracing cleanup

   - Use READ_ONCE/WRITE_ONCE consistently for ring mapped kbufs. Mostly
     for documentation purposes, indicating that they are shared with
     userspace

   - Fix for POLL_ADD losing a completion, if the request is updated and
     now is triggerable - eg, if POLLIN is set with the updated, and the
     polled file is readable

   - In conjunction with the above fix, also unify how poll wait queue
     entries are deleted with the head update. We had 3 different spots
     doing both the list deletion and head write, with one of them
     nicely documented. Abstract that into a helper and use it
     consistently

   - Small series from Joanne fixing an issue with buffer cloning, and
     cleaning up the arg validation"

* tag 'io_uring-6.19-20251208' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/poll: unify poll waitqueue entry and list removal
  io_uring/kbuf: use WRITE_ONCE() for userspace-shared buffer ring fields
  io_uring/kbuf: use READ_ONCE() for userspace-mapped memory
  io_uring/rsrc: fix lost entries after cloned range
  io_uring/rsrc: rename misleading src_node variable in io_clone_buffers()
  io_uring/rsrc: clean up buffer cloning arg validation
  io_uring/trace: rename io_uring_queue_async_work event "rw" field
  io_uring/io-wq: always retry worker create on ERESTART*
  io_uring/poll: correctly handle io_poll_add() return value on update
2025-12-09 09:07:28 +09:00
Linus Torvalds 7203ca412f Significant patch series in this merge are as follows:
- The 10 patch series "__vmalloc()/kvmalloc() and no-block support" from
   Uladzislau Rezki reworks the vmalloc() code to support non-blocking
   allocations (GFP_ATOIC, GFP_NOWAIT).
 
 - The 2 patch series "ksm: fix exec/fork inheritance" from xu xin fixes
   a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not inherited
   across fork/exec.
 
 - The 4 patch series "mm/zswap: misc cleanup of code and documentations"
   from SeongJae Park does some light maintenance work on the zswap code.
 
 - The 5 patch series "mm/page_owner: add debugfs files 'show_handles'
   and 'show_stacks_handles'" from Mauricio Faria de Oliveira enhances the
   /sys/kernel/debug/page_owner debug feature.  It adds unique identifiers
   to differentiate the various stack traces so that userspace monitoring
   tools can better match stack traces over time.
 
 - The 2 patch series "mm/page_alloc: pcp->batch cleanups" from Joshua
   Hahn makes some minor alterations to the page allocator's per-cpu-pages
   feature.
 
 - The 2 patch series "Improve UFFDIO_MOVE scalability by removing
   anon_vma lock" from Lokesh Gidra addresses a scalability issue in
   userfaultfd's UFFDIO_MOVE operation.
 
 - The 2 patch series "kasan: cleanups for kasan_enabled() checks" from
   Sabyrzhan Tasbolatov performs some cleanup in the KASAN code.
 
 - The 2 patch series "drivers/base/node: fold node register and
   unregister functions" from Donet Tom cleans up the NUMA node handling
   code a little.
 
 - The 4 patch series "mm: some optimizations for prot numa" from Kefeng
   Wang provides some cleanups and small optimizations to the NUMA
   allocation hinting code.
 
 - The 5 patch series "mm/page_alloc: Batch callers of
   free_pcppages_bulk" from Joshua Hahn addresses long lock hold times at
   boot on large machines.  These were causing (harmless) softlockup
   warnings.
 
 - The 2 patch series "optimize the logic for handling dirty file folios
   during reclaim" from Baolin Wang removes some now-unnecessary work from
   page reclaim.
 
 - The 10 patch series "mm/damon: allow DAMOS auto-tuned for per-memcg
   per-node memory usage" from SeongJae Park enhances the DAMOS auto-tuning
   feature.
 
 - The 2 patch series "mm/damon: fixes for address alignment issues in
   DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan fixes DAMON_LRU_SORT
   and DAMON_RECLAIM with certain userspace configuration.
 
 - The 15 patch series "expand mmap_prepare functionality, port more
   users" from Lorenzo Stoakes enhances the new(ish)
   file_operations.mmap_prepare() method and ports additional callsites
   from the old ->mmap() over to ->mmap_prepare().
 
 - The 8 patch series "Fix stale IOTLB entries for kernel address space"
   from Lu Baolu fixes a bug (and possible security issue on non-x86) in
   the IOMMU code.  In some situations the IOMMU could be left hanging onto
   a stale kernel pagetable entry.
 
 - The 4 patch series "mm/huge_memory: cleanup __split_unmapped_folio()"
   from Wei Yang cleans up and optimizes the folio splitting code.
 
 - The 5 patch series "mm, swap: misc cleanup and bugfix" from Kairui
   Song implements some cleanups and a minor fix in the swap discard code.
 
 - The 8 patch series "mm/damon: misc documentation fixups" from SeongJae
   Park does as advertised.
 
 - The 9 patch series "mm/damon: support pin-point targets removal" from
   SeongJae Park permits userspace to remove a specific monitoring target
   in the middle of the current targets list.
 
 - The 2 patch series "mm: MISC follow-up patches for linux/pgalloc.h"
   from Harry Yoo implements a couple of cleanups related to mm header file
   inclusion.
 
 - The 2 patch series "mm/swapfile.c: select swap devices of default
   priority round robin" from Baoquan He improves the selection of swap
   devices for NUMA machines.
 
 - The 3 patch series "mm: Convert memory block states (MEM_*) macros to
   enums" from Israel Batista changes the memory block labels from macros
   to enums so they will appear in kernel debug info.
 
 - The 3 patch series "ksm: perform a range-walk to jump over holes in
   break_ksm" from Pedro Demarchi Gomes addresses an inefficiency when KSM
   unmerges an address range.
 
 - The 22 patch series "mm/damon/tests: fix memory bugs in kunit tests"
   from SeongJae Park fixes leaks and unhandled malloc() failures in DAMON
   userspace unit tests.
 
 - The 2 patch series "some cleanups for pageout()" from Baolin Wang
   cleans up a couple of minor things in the page scanner's
   writeback-for-eviction code.
 
 - The 2 patch series "mm/hugetlb: refactor sysfs/sysctl interfaces" from
   Hui Zhu moves hugetlb's sysfs/sysctl handling code into a new file.
 
 - The 9 patch series "introduce VM_MAYBE_GUARD and make it sticky" from
   Lorenzo Stoakes makes the VMA guard regions available in /proc/pid/smaps
   and improves the mergeability of guarded VMAs.
 
 - The 2 patch series "mm: perform guard region install/remove under VMA
   lock" from Lorenzo Stoakes reduces mmap lock contention for callers
   performing VMA guard region operations.
 
 - The 2 patch series "vma_start_write_killable" from Matthew Wilcox
   starts work in permitting applications to be killed when they are
   waiting on a read_lock on the VMA lock.
 
 - The 11 patch series "mm/damon/tests: add more tests for online
   parameters commit" from SeongJae Park adds additional userspace testing
   of DAMON's "commit" feature.
 
 - The 9 patch series "mm/damon: misc cleanups" from SeongJae Park does
   that.
 
 - The 2 patch series "make VM_SOFTDIRTY a sticky VMA flag" from Lorenzo
   Stoakes addresses the possible loss of a VMA's VM_SOFTDIRTY flag when
   that VMA is merged with another.
 
 - The 16 patch series "mm: support device-private THP" from Balbir Singh
   introduces support for Transparent Huge Page (THP) migration in zone
   device-private memory.
 
 - The 3 patch series "Optimize folio split in memory failure" from Zi
   Yan optimizes folio split operations in the memory failure code.
 
 - The 2 patch series "mm/huge_memory: Define split_type and consolidate
   split support checks" from Wei Yang provides some more cleanups in the
   folio splitting code.
 
 - The 16 patch series "mm: remove is_swap_[pte, pmd]() + non-swap
   entries, introduce leaf entries" from Lorenzo Stoakes cleans up our
   handling of pagetable leaf entries by introducing the concept of
   'software leaf entries', of type softleaf_t.
 
 - The 4 patch series "reparent the THP split queue" from Muchun Song
   reparents the THP split queue to its parent memcg.  This is in
   preparation for addressing the long-standing "dying memcg" problem,
   wherein dead memcg's linger for too long, consuming memory resources.
 
 - The 3 patch series "unify PMD scan results and remove redundant
   cleanup" from Wei Yang does a little cleanup in the hugepage collapse
   code.
 
 - The 6 patch series "zram: introduce writeback bio batching" from
   Sergey Senozhatsky improves zram writeback efficiency by introducing
   batched bio writeback support.
 
 - The 4 patch series "memcg: cleanup the memcg stats interfaces" from
   Shakeel Butt cleans up our handling of the interrupt safety of some
   memcg stats.
 
 - The 4 patch series "make vmalloc gfp flags usage more apparent" from
   Vishal Moola cleans up vmalloc's handling of incoming GFP flags.
 
 - The 6 patch series "mm: Add soft-dirty and uffd-wp support for RISC-V"
   from Chunyan Zhang teches soft dirty and userfaultfd write protect
   tracking to use RISC-V's Svrsw60t59b extension.
 
 - The 5 patch series "mm: swap: small fixes and comment cleanups" from
   Youngjun Park fixes a small bug and cleans up some of the swap code.
 
 - The 4 patch series "initial work on making VMA flags a bitmap" from
   Lorenzo Stoakes starts work on converting the vma struct's flags to a
   bitmap, so we stop running out of them, especially on 32-bit.
 
 - The 2 patch series "mm/swapfile: fix and cleanup swap list iterations"
   from Youngjun Park addresses a possible bug in the swap discard code and
   cleans things up a little.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaTEb0wAKCRDdBJ7gKXxA
 jjfIAP94W4EkCCwNOupnChoG+YWw/JW21anXt5NN+i5svn1yugEAwzvv6A+cAFng
 o+ug/fyrfPZG7PLp2R8WFyGIP0YoBA4=
 =IUzS
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

  "__vmalloc()/kvmalloc() and no-block support" (Uladzislau Rezki)
     Rework the vmalloc() code to support non-blocking allocations
     (GFP_ATOIC, GFP_NOWAIT)

  "ksm: fix exec/fork inheritance" (xu xin)
     Fix a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not
     inherited across fork/exec

  "mm/zswap: misc cleanup of code and documentations" (SeongJae Park)
     Some light maintenance work on the zswap code

  "mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" (Mauricio Faria de Oliveira)
     Enhance the /sys/kernel/debug/page_owner debug feature by adding
     unique identifiers to differentiate the various stack traces so
     that userspace monitoring tools can better match stack traces over
     time

  "mm/page_alloc: pcp->batch cleanups" (Joshua Hahn)
     Minor alterations to the page allocator's per-cpu-pages feature

  "Improve UFFDIO_MOVE scalability by removing anon_vma lock" (Lokesh Gidra)
     Address a scalability issue in userfaultfd's UFFDIO_MOVE operation

  "kasan: cleanups for kasan_enabled() checks" (Sabyrzhan Tasbolatov)

  "drivers/base/node: fold node register and unregister functions" (Donet Tom)
     Clean up the NUMA node handling code a little

  "mm: some optimizations for prot numa" (Kefeng Wang)
     Cleanups and small optimizations to the NUMA allocation hinting
     code

  "mm/page_alloc: Batch callers of free_pcppages_bulk" (Joshua Hahn)
     Address long lock hold times at boot on large machines. These were
     causing (harmless) softlockup warnings

  "optimize the logic for handling dirty file folios during reclaim" (Baolin Wang)
     Remove some now-unnecessary work from page reclaim

  "mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" (SeongJae Park)
     Enhance the DAMOS auto-tuning feature

  "mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" (Quanmin Yan)
     Fix DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace
     configuration

  "expand mmap_prepare functionality, port more users" (Lorenzo Stoakes)
     Enhance the new(ish) file_operations.mmap_prepare() method and port
     additional callsites from the old ->mmap() over to ->mmap_prepare()

  "Fix stale IOTLB entries for kernel address space" (Lu Baolu)
     Fix a bug (and possible security issue on non-x86) in the IOMMU
     code. In some situations the IOMMU could be left hanging onto a
     stale kernel pagetable entry

  "mm/huge_memory: cleanup __split_unmapped_folio()" (Wei Yang)
     Clean up and optimize the folio splitting code

  "mm, swap: misc cleanup and bugfix" (Kairui Song)
     Some cleanups and a minor fix in the swap discard code

  "mm/damon: misc documentation fixups" (SeongJae Park)

  "mm/damon: support pin-point targets removal" (SeongJae Park)
     Permit userspace to remove a specific monitoring target in the
     middle of the current targets list

  "mm: MISC follow-up patches for linux/pgalloc.h" (Harry Yoo)
     A couple of cleanups related to mm header file inclusion

  "mm/swapfile.c: select swap devices of default priority round robin" (Baoquan He)
     improve the selection of swap devices for NUMA machines

  "mm: Convert memory block states (MEM_*) macros to enums" (Israel Batista)
     Change the memory block labels from macros to enums so they will
     appear in kernel debug info

  "ksm: perform a range-walk to jump over holes in break_ksm" (Pedro Demarchi Gomes)
     Address an inefficiency when KSM unmerges an address range

  "mm/damon/tests: fix memory bugs in kunit tests" (SeongJae Park)
     Fix leaks and unhandled malloc() failures in DAMON userspace unit
     tests

  "some cleanups for pageout()" (Baolin Wang)
     Clean up a couple of minor things in the page scanner's
     writeback-for-eviction code

  "mm/hugetlb: refactor sysfs/sysctl interfaces" (Hui Zhu)
     Move hugetlb's sysfs/sysctl handling code into a new file

  "introduce VM_MAYBE_GUARD and make it sticky" (Lorenzo Stoakes)
     Make the VMA guard regions available in /proc/pid/smaps and
     improves the mergeability of guarded VMAs

  "mm: perform guard region install/remove under VMA lock" (Lorenzo Stoakes)
     Reduce mmap lock contention for callers performing VMA guard region
     operations

  "vma_start_write_killable" (Matthew Wilcox)
     Start work on permitting applications to be killed when they are
     waiting on a read_lock on the VMA lock

  "mm/damon/tests: add more tests for online parameters commit" (SeongJae Park)
     Add additional userspace testing of DAMON's "commit" feature

  "mm/damon: misc cleanups" (SeongJae Park)

  "make VM_SOFTDIRTY a sticky VMA flag" (Lorenzo Stoakes)
     Address the possible loss of a VMA's VM_SOFTDIRTY flag when that
     VMA is merged with another

  "mm: support device-private THP" (Balbir Singh)
     Introduce support for Transparent Huge Page (THP) migration in zone
     device-private memory

  "Optimize folio split in memory failure" (Zi Yan)

  "mm/huge_memory: Define split_type and consolidate split support checks" (Wei Yang)
     Some more cleanups in the folio splitting code

  "mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" (Lorenzo Stoakes)
     Clean up our handling of pagetable leaf entries by introducing the
     concept of 'software leaf entries', of type softleaf_t

  "reparent the THP split queue" (Muchun Song)
     Reparent the THP split queue to its parent memcg. This is in
     preparation for addressing the long-standing "dying memcg" problem,
     wherein dead memcg's linger for too long, consuming memory
     resources

  "unify PMD scan results and remove redundant cleanup" (Wei Yang)
     A little cleanup in the hugepage collapse code

  "zram: introduce writeback bio batching" (Sergey Senozhatsky)
     Improve zram writeback efficiency by introducing batched bio
     writeback support

  "memcg: cleanup the memcg stats interfaces" (Shakeel Butt)
     Clean up our handling of the interrupt safety of some memcg stats

  "make vmalloc gfp flags usage more apparent" (Vishal Moola)
     Clean up vmalloc's handling of incoming GFP flags

  "mm: Add soft-dirty and uffd-wp support for RISC-V" (Chunyan Zhang)
     Teach soft dirty and userfaultfd write protect tracking to use
     RISC-V's Svrsw60t59b extension

  "mm: swap: small fixes and comment cleanups" (Youngjun Park)
     Fix a small bug and clean up some of the swap code

  "initial work on making VMA flags a bitmap" (Lorenzo Stoakes)
     Start work on converting the vma struct's flags to a bitmap, so we
     stop running out of them, especially on 32-bit

  "mm/swapfile: fix and cleanup swap list iterations" (Youngjun Park)
     Address a possible bug in the swap discard code and clean things
     up a little

[ This merge also reverts commit ebb9aeb980 ("vfio/nvgrace-gpu:
  register device memory for poison handling") because it looks
  broken to me, I've asked for clarification   - Linus ]

* tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
  mm: fix vma_start_write_killable() signal handling
  mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate
  mm/swapfile: fix list iteration when next node is removed during discard
  fs/proc/task_mmu.c: fix make_uffd_wp_huge_pte() huge pte handling
  mm/kfence: add reboot notifier to disable KFENCE on shutdown
  memcg: remove inc/dec_lruvec_kmem_state helpers
  selftests/mm/uffd: initialize char variable to Null
  mm: fix DEBUG_RODATA_TEST indentation in Kconfig
  mm: introduce VMA flags bitmap type
  tools/testing/vma: eliminate dependency on vma->__vm_flags
  mm: simplify and rename mm flags function for clarity
  mm: declare VMA flags by bit
  zram: fix a spelling mistake
  mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity
  mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
  pagemap: update BUDDY flag documentation
  mm: swap: remove scan_swap_map_slots() references from comments
  mm: swap: change swap_alloc_slow() to void
  mm, swap: remove redundant comment for read_swap_cache_async
  mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational
  ...
2025-12-05 13:52:43 -08:00
Linus Torvalds 69c5079b49 tracing updates for v6.19:
- Merge branch shared with kprobes on extending trace options
 
   The trace options were defined by a 32 bit variable. This limits the
   tracing instances to have a total of 32 different options. As that limit
   has been hit, and more options are being added, increase the option mask
   to a 64 bit number, doubling the number of options available.
 
   As this is required for the kprobe topic branches as well as the tracing
   topic branch, a separate branch was created and merged into both.
 
 - Make trace_user_fault_read() available for the rest of tracing
 
   The function trace_user_fault_read() is used by trace_marker file read to
   allow reading user space to be done fast and without locking or
   allocations. Make this available so that the system call trace events can
   use it too.
 
 - Have system call trace events read user space values
 
   Now that the system call trace events callbacks are called in a faultable
   context, take advantage of this and read the user space buffers for
   various system calls. For example, show the path name of the openat system
   call instead of just showing the pointer to that path name in user space.
   Also show the contents of the buffer of the write system call. Several
   system call trace events are updated to make tracing into a light weight
   strace tool for all applications in the system.
 
 - Update perf system call tracing to do the same
 
 - And a config and syscall_user_buf_size file to control the size of the buffer
 
   Limit the amount of data that can be read from user space. The default
   size is 63 bytes but that can be expanded to 165 bytes.
 
 - Allow the persistent ring buffer to print system calls normally
 
   The persistent ring buffer prints trace events by their type and ignores
   the print_fmt. This is because the print_fmt may change from kernel to
   kernel. As the system call output is fixed by the system call ABI itself,
   there's no reason to limit that. This makes reading the system call events
   in the persistent ring buffer much nicer and easier to understand.
 
 - Add options to show text offset to function profiler
 
   The function profiler that counts the number of times a function is hit
   currently lists all functions by its name and offset. But this becomes
   ambiguous when there are several functions with the same name. Add a
   tracing option that changes the output to be that of _text+offset
   instead. Now a user space tool can use this information to map the
   _text+offset to the unique function it is counting.
 
 - Report bad dynamic event command
 
   If a bad command is passed to the dynamic_events file, report it properly
   in the error log.
 
 - Clean up tracer options
 
   Clean up the tracer option code a bit, by removing some useless code and
   also using switch statements instead of a series of if statements.
 
 - Have tracing options be instance specific
 
   Tracers can have their own options (function tracer, irqsoff tracer,
   function graph tracer, etc). But now that the same tracer can be enabled
   in multiple trace instances, their options are still global. The API is
   per instance, thus changing one affects other instances. This isn't even
   consistent, as the option take affect differently depending on when an
   tracer started in an instance.  Make the options for instances only affect
   the instance it is changed under.
 
 - Optimize pid_list lock contention
 
   Whenever the pid_list is read, it uses a spin lock. This happens at every
   sched switch. Taking the lock at sched switch can be removed by instead
   using a seqlock counter.
 
 - Clean up the trace trigger structures
 
   The trigger code uses two different structures to implement a single
   tigger. This was due to trying to reuse code for the two different types
   of triggers (always on trigger, and count limited trigger). But by adding
   a single field to one structure, the other structure could be absorbed
   into the first structure making he code easier to understand.
 
 - Create a bulk garbage collector for trace triggers
 
   If user space has triggers for several hundreds of events and then removes
   them, it can take several seconds to complete. This is because each
   removal calls the slow tracepoint_synchronize_unregister() that can take
   hundreds of milliseconds to complete. Instead, create a helper thread that
   will do the clean up. When a trigger is removed, it will create the
   kthread if it isn't already created, and then add the trigger to a llist.
   The kthread will take the items off the llist, call
   tracepoint_synchronize_unregister(), and then remove the items it took
   off. It will then check if there's more items to free before sleeping.
 
   This makes user space removing all these triggers to finish in less than a
   second.
 
 - Allow function tracing of some of the tracing infrastructure code
 
   Because the tracing code can cause recursion issues if it is traced by the
   function tracer the entire tracing directory disables function tracing.
   But not all of tracing causes issues if it is traced. Namely, the event
   tracing code. Add a config that enables some of the tracing code to be
   traced to help in debugging it. Note, when this is enabled, it does add
   noise to general function tracing, especially if events are enabled as
   well (which is a common case).
 
 - Add boot-time backup instance for persistent buffer
 
   The persistent ring buffer is used mostly for kernel crash analysis in the
   field. One issue is that if there's a crash, the data in the persistent
   ring buffer must be read before tracing can begin using it. This slows
   down the boot process. Once tracing starts in the persistent ring buffer,
   the old data must be freed and the addresses no longer match and old
   events can't be in the buffer with new events.
 
   Create a way to create a backup buffer that copies the persistent ring
   buffer at boot up. Then after a crash, the always on tracer can begin
   immediately as well as the normal boot process while the crash analysis
   tooling uses the backup buffer. After the backup buffer is finished being
   read, it can be removed.
 
 - Enable function graph args and return address options at the same time
 
   Currently the when reading of arguments in the function graph tracer is
   enabled, the option to record the parent function in the entry event can
   not be enabled. Update the code so that it can.
 
 - Add new struct_offset() helper macro
 
   Add a new macro that takes a pointer to a structure and a name of one of
   its members and it will return the offset of that member. This allows the
   ring buffer code to simplify the following:
 
   From:  size = struct_size(entry, buf, cnt - sizeof(entry->id));
     To:  size = struct_offset(entry, id) + cnt;
 
   There should be other simplifications that this macro can help out with as
   well.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaS9xqxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qj6tAQD4MR1lsE3XpH09asO4CDDfhbtRSQVD
 o8bVKVihWx/j5gD/XezjqE2Q2+DO6dhnsQY6pbtNdXoKgaMuQJGA+dvPsQc=
 =HilC
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:

 - Extend tracing option mask to 64 bits

   The trace options were defined by a 32 bit variable. This limits the
   tracing instances to have a total of 32 different options. As that
   limit has been hit, and more options are being added, increase the
   option mask to a 64 bit number, doubling the number of options
   available.

   As this is required for the kprobe topic branches as well as the
   tracing topic branch, a separate branch was created and merged into
   both.

 - Make trace_user_fault_read() available for the rest of tracing

   The function trace_user_fault_read() is used by trace_marker file
   read to allow reading user space to be done fast and without locking
   or allocations. Make this available so that the system call trace
   events can use it too.

 - Have system call trace events read user space values

   Now that the system call trace events callbacks are called in a
   faultable context, take advantage of this and read the user space
   buffers for various system calls. For example, show the path name of
   the openat system call instead of just showing the pointer to that
   path name in user space. Also show the contents of the buffer of the
   write system call. Several system call trace events are updated to
   make tracing into a light weight strace tool for all applications in
   the system.

 - Update perf system call tracing to do the same

 - And a config and syscall_user_buf_size file to control the size of
   the buffer

   Limit the amount of data that can be read from user space. The
   default size is 63 bytes but that can be expanded to 165 bytes.

 - Allow the persistent ring buffer to print system calls normally

   The persistent ring buffer prints trace events by their type and
   ignores the print_fmt. This is because the print_fmt may change from
   kernel to kernel. As the system call output is fixed by the system
   call ABI itself, there's no reason to limit that. This makes reading
   the system call events in the persistent ring buffer much nicer and
   easier to understand.

 - Add options to show text offset to function profiler

   The function profiler that counts the number of times a function is
   hit currently lists all functions by its name and offset. But this
   becomes ambiguous when there are several functions with the same
   name.

   Add a tracing option that changes the output to be that of
   '_text+offset' instead. Now a user space tool can use this
   information to map the '_text+offset' to the unique function it is
   counting.

 - Report bad dynamic event command

   If a bad command is passed to the dynamic_events file, report it
   properly in the error log.

 - Clean up tracer options

   Clean up the tracer option code a bit, by removing some useless code
   and also using switch statements instead of a series of if
   statements.

 - Have tracing options be instance specific

   Tracers can have their own options (function tracer, irqsoff tracer,
   function graph tracer, etc). But now that the same tracer can be
   enabled in multiple trace instances, their options are still global.
   The API is per instance, thus changing one affects other instances.
   This isn't even consistent, as the option take affect differently
   depending on when an tracer started in an instance. Make the options
   for instances only affect the instance it is changed under.

 - Optimize pid_list lock contention

   Whenever the pid_list is read, it uses a spin lock. This happens at
   every sched switch. Taking the lock at sched switch can be removed by
   instead using a seqlock counter.

 - Clean up the trace trigger structures

   The trigger code uses two different structures to implement a single
   tigger. This was due to trying to reuse code for the two different
   types of triggers (always on trigger, and count limited trigger). But
   by adding a single field to one structure, the other structure could
   be absorbed into the first structure making he code easier to
   understand.

 - Create a bulk garbage collector for trace triggers

   If user space has triggers for several hundreds of events and then
   removes them, it can take several seconds to complete. This is
   because each removal calls tracepoint_synchronize_unregister() that
   can take hundreds of milliseconds to complete.

   Instead, create a helper thread that will do the clean up. When a
   trigger is removed, it will create the kthread if it isn't already
   created, and then add the trigger to a llist. The kthread will take
   the items off the llist, call tracepoint_synchronize_unregister(),
   and then remove the items it took off. It will then check if there's
   more items to free before sleeping.

   This makes user space removing all these triggers to finish in less
   than a second.

 - Allow function tracing of some of the tracing infrastructure code

   Because the tracing code can cause recursion issues if it is traced
   by the function tracer the entire tracing directory disables function
   tracing. But not all of tracing causes issues if it is traced.
   Namely, the event tracing code. Add a config that enables some of the
   tracing code to be traced to help in debugging it. Note, when this is
   enabled, it does add noise to general function tracing, especially if
   events are enabled as well (which is a common case).

 - Add boot-time backup instance for persistent buffer

   The persistent ring buffer is used mostly for kernel crash analysis
   in the field. One issue is that if there's a crash, the data in the
   persistent ring buffer must be read before tracing can begin using
   it. This slows down the boot process. Once tracing starts in the
   persistent ring buffer, the old data must be freed and the addresses
   no longer match and old events can't be in the buffer with new
   events.

   Create a way to create a backup buffer that copies the persistent
   ring buffer at boot up. Then after a crash, the always on tracer can
   begin immediately as well as the normal boot process while the crash
   analysis tooling uses the backup buffer. After the backup buffer is
   finished being read, it can be removed.

 - Enable function graph args and return address options at the same
   time

   Currently the when reading of arguments in the function graph tracer
   is enabled, the option to record the parent function in the entry
   event can not be enabled. Update the code so that it can.

 - Add new struct_offset() helper macro

   Add a new macro that takes a pointer to a structure and a name of one
   of its members and it will return the offset of that member. This
   allows the ring buffer code to simplify the following:

   From:  size = struct_size(entry, buf, cnt - sizeof(entry->id));
     To:  size = struct_offset(entry, id) + cnt;

   There should be other simplifications that this macro can help out
   with as well

* tag 'trace-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (42 commits)
  overflow: Introduce struct_offset() to get offset of member
  function_graph: Enable funcgraph-args and funcgraph-retaddr to work simultaneously
  tracing: Add boot-time backup of persistent ring buffer
  ftrace: Allow tracing of some of the tracing code
  tracing: Use strim() in trigger_process_regex() instead of skip_spaces()
  tracing: Add bulk garbage collection of freeing event_trigger_data
  tracing: Remove unneeded event_mutex lock in event_trigger_regex_release()
  tracing: Merge struct event_trigger_ops into struct event_command
  tracing: Remove get_trigger_ops() and add count_func() from trigger ops
  tracing: Show the tracer options in boot-time created instance
  ftrace: Avoid redundant initialization in register_ftrace_direct
  tracing: Remove unused variable in tracing_trace_options_show()
  fgraph: Make fgraph_no_sleep_time signed
  tracing: Convert function graph set_flags() to use a switch() statement
  tracing: Have function graph tracer option sleep-time be per instance
  tracing: Move graph-time out of function graph options
  tracing: Have function graph tracer option funcgraph-irqs be per instance
  trace/pid_list: optimize pid_list->lock contention
  tracing: Have function graph tracer define options per instance
  tracing: Have function tracer define options per instance
  ...
2025-12-05 09:51:37 -08:00
Linus Torvalds fa5ef10561 spi: Updates for v6.19
This release is almost entirely new drivers, with a couple of small
 changes in generic code.  The biggest individual update is a rename of
 the existing Microchip driver and the addition of a new driver for the
 silicon SPI controller in their PolarFire SoCs.  The overlap between the
 soft IP supported by the current driver and this new one is regrettably
 all in the IP and not in the register interface offered to software.
 
  - Add a time offset parameter for offloads, allowing them to be defined
    in relation to each other.  This is useful for IIO type applcations
    where you trigger an operation then read the result after a delay.
  - Add a tracepoint for flash exec_ops, bringing the flash support more
    in line with the debuggability of vanilla SPI.
  - Support for Airoha EN7523, Arduino MCUs, Aspeed AST2700, Microchip
    PolarFire SPI controllers, NXP i.MX51 ECSPI target mode, Qualcomm
    IPQ5414 and IPQ5332, Renesas RZ/T2H, RZ/V2N and RZ/2NH and SpacemiT K1
    QuadSPI.
 
 There's also a small set of ASoC cleanups that I mistakenly applied to
 the SPI tree and then put more stuff on top of before it was brought to
 my attention, sorry about that.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEreZoqmdXGLWf4p/qJNaLcl1Uh9AFAmkt8qUACgkQJNaLcl1U
 h9COSAf/buDi3hXYykhEbSlwl839u1XhCU9d8eJF8GhlWR87JeJCb8ou7RdMOQhE
 HFehdoD2d5EcPc169cDp2uis8LovGZIPWuzh3D83eZpOOQTLMaQYBrDVNquBoKOR
 WVFPPy9X/kb/IR63vM0b5Xv9K6l3ud4yauIsa0ingl9pZi5m2fB3ZEOx9siYYwAn
 4fxv43jESbwdTfx33Yc4CkzctZEKuqI2JgLNe/mJZQsdYhS/nLvmDwiZ69k6b4ac
 QSHQkP6i+fQzogcbip2z8dA3IUEhDjNQdBbtzmIot8Qbg7zXJkXpAlx2Wstw7Lt8
 vLTUY/EHqKh39zok5GECq6E2R6W41w==
 =Uw2/
 -----END PGP SIGNATURE-----

Merge tag 'spi-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi

Pull spi updates from Mark Brown:
 "This release is almost entirely new drivers, with a couple of small
  changes in generic code.

  The biggest individual update is a rename of the existing Microchip
  driver and the addition of a new driver for the silicon SPI controller
  in their PolarFire SoCs. The overlap between the soft IP supported by
  the current driver and this new one is regrettably all in the IP and
  not in the register interface offered to software.

   - Add a time offset parameter for offloads, allowing them to be
     defined in relation to each other. This is useful for IIO type
     applcations where you trigger an operation then read the result
     after a delay.

   - Add a tracepoint for flash exec_ops, bringing the flash support
     more in line with the debuggability of vanilla SPI.

   - Support for Airoha EN7523, Arduino MCUs, Aspeed AST2700, Microchip
     PolarFire SPI controllers, NXP i.MX51 ECSPI target mode, Qualcomm
     IPQ5414 and IPQ5332, Renesas RZ/T2H, RZ/V2N and RZ/2NH and SpacemiT
     K1 QuadSPI.

  There's also a small set of ASoC cleanups that I mistakenly applied to
  the SPI tree and then put more stuff on top of before it was brought
  to my attention, sorry about that"

* tag 'spi-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: (80 commits)
  spi: microchip-core: Refactor FIFO read and write handlers
  spi: ch341: fix out-of-bounds memory access in ch341_transfer_one
  spi: microchip-core: Remove unneeded PM related macro
  spi: microchip-core: Use SPI_MODE_X_MASK
  spi: microchip-core: Utilise temporary variable for struct device
  spi: microchip-core: Replace dead code (-ENOMEM error message)
  spi: microchip-core: use min() instead of min_t()
  spi: dt-bindings: airoha: add compatible for EN7523
  spi: airoha-snfi: en7523: workaround flash damaging if UART_TXD was short to GND
  spi: dt-bindings: renesas,rzv2h-rspi: Document RZ/V2N SoC support
  spi: dt-bindings: renesas,rzv2h-rspi: Document RZ/V2N SoC support
  spi: microchip: Enable compile-testing for FPGA SPI controllers
  spi: Fix potential uninitialized variable in probe()
  spi: rzv2h-rspi: add support for RZ/T2H and RZ/N2H
  spi: dt-bindings: renesas,rzv2h-rspi: document RZ/T2H and RZ/N2H
  spi: rzv2h-rspi: add support for loopback mode
  spi: rzv2h-rspi: add support for variable transfer clock
  spi: rzv2h-rspi: add support for using PCLK for transfer clock
  spi: rzv2h-rspi: make transfer clock rate finding chip-specific
  spi: rzv2h-rspi: avoid recomputing transfer frequency
  ...
2025-12-04 11:24:24 -08:00
Linus Torvalds 2aa680df68 sound updates for 6.19-rc1
The majority of changes at this time were about ASoC with a lot of
 code refactoring works.  From the functionality POV, there aren't much
 to see, but we have a wide range of device-specific fixes and updates.
 Here are some highlights:
 
 - Continued ASoC API clean works, spanned over many files
 - Added a SoundWire SCDA generic class driver with regmap support
 - Enhancements and fixes for Cirrus, Intel, Maxim and Qualcomm.
 - Support for ASoC Allwinner A523, Mediatek MT8189, Qualcomm QCM2290,
   QRB2210 and SM6115, SpacemiT K1, and TI TAS2568, TAS5802, TAS5806,
   TAS5815, TAS5828 and TAS5830
 - Usual HD-audio and USB-audio quirks and fixups
 - Support for Onkyo SE-300PCIE, TASCAM IF-FW/DM MkII
 
 Some gpiolib changes for shared GPIOs are included along with this PR
 for covering ASoC drivers changes.
 -----BEGIN PGP SIGNATURE-----
 
 iQJCBAABCAAsFiEEIXTw5fNLNI7mMiVaLtJE4w1nLE8FAmkwQ2UOHHRpd2FpQHN1
 c2UuZGUACgkQLtJE4w1nLE8tIRAAjCHdIlMejNTCzGRlhsRSQVD6bo1wASXcjfJ6
 COH84akbnA0oT5z7H7JnzTOmfjzxLJpwC8j6IpZ/9CQazanT5IIVE41FZquXZ1JB
 RhQVzuGw9Pl4MaYVdFuRqIXjiP+msY1jpbo9/QXQo8D/B41wpmVTgzkFVW2rxPMy
 0aBOu4Wpu+11aBpNBy6dXDiKQ5kDqn7zOLoFGgcf5wlFIvOGZJ0Wg/i0kvCjl+ia
 xYiP+/F6xKOyTY1c98iqExbKzSSy4ddGFUwrkevm6bWpu8hkXiL1O0zMWOe769x6
 0wy0b5zvsbtOQOxbtK5+8gdjJw7ycgDa441hDtsaXBBROYZEV3D6+XZJCfq8Tz8F
 +vLH5lfZeLg+59eqt3GOMGlwBfuhH91qzukIYG3q9EQGOkNkZ19ySJnFMLom68Ei
 TCfNzh/ggSGXA9qAmfBcPoizgC/j9o+v4kbLRQteuRRWxES1FxqeN9Ba3d5JcHT3
 BQpz1bhUli73477D6voPcwXLiQlM+Alv4QUKTFr2nUnWUQKwMvkZFwiv2jTqVdDf
 f71Usv7xdyM7XijgmXuLg+3n0UvCwUPBB+bv3a1Bu7G4iTB1deNKU8t9k+sBJpcX
 aRs5ych3MiU/zG+KRMB5FEx31KzpKu+Kk9NQ207/1HLaNhTgD3cg2wS3T3qdRUPv
 Yf6wFHs=
 =1JUI
 -----END PGP SIGNATURE-----

Merge tag 'sound-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound updates from Takashi Iwai:
 "The majority of changes at this time were about ASoC with a lot of
  code refactoring works. From the functionality POV, there isn't much
  to see, but we have a wide range of device-specific fixes and updates.
  Here are some highlights:

   - Continued ASoC API cleanup work, spanned over many files

   - Added a SoundWire SCDA generic class driver with regmap support

   - Enhancements and fixes for Cirrus, Intel, Maxim and Qualcomm.

   - Support for ASoC Allwinner A523, Mediatek MT8189, Qualcomm QCM2290,
     QRB2210 and SM6115, SpacemiT K1, and TI TAS2568, TAS5802, TAS5806,
     TAS5815, TAS5828 and TAS5830

   - Usual HD-audio and USB-audio quirks and fixups

   - Support for Onkyo SE-300PCIE, TASCAM IF-FW/DM MkII

  Some gpiolib changes for shared GPIOs are included along with this PR
  for covering ASoC drivers changes"

* tag 'sound-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (739 commits)
  ALSA: hda/realtek: Add PCI SSIDs to HP ProBook quirks
  ALSA: usb-audio: Simplify with usb_endpoint_max_periodic_payload()
  ALSA: hda/realtek: fix mute/micmute LEDs don't work for more HP laptops
  ALSA: rawmidi: Fix inconsistent indenting warning reported by smatch
  ALSA: dice: fix buffer overflow in detect_stream_formats()
  ASoC: codecs: Modify awinic amplifier dsp read and write functions
  ASoC: SDCA: Fixup some more Kconfig issues
  ASoC: cs35l56: Log a message if firmware is missing
  ASoC: nau8325: Delete a stray tab
  firmware: cs_dsp: Add test cases for client_ops == NULL
  firmware: cs_dsp: Don't require client to provide a struct cs_dsp_client_ops
  ASoC: fsl_micfil: Set channel range control
  ASoC: fsl_micfil: Add default quality for different platforms
  ASoC: intel: sof_sdw: Add codec_info for cs42l45
  ASoC: sdw_utils: Add cs42l45 support functions
  ASoC: intel: sof_sdw: Add ability to have auxiliary devices
  ASoC: sdw_utils: Move codec_name to dai info
  ASoC: sdw_utils: Add codec_conf for every DAI
  ASoC: SDCA: Add terminal type into input/output widget name
  ASoC: SDCA: Align mute controls to ALSA expectations
  ...
2025-12-04 10:08:40 -08:00
Caleb Sander Mateos f345be751b io_uring/trace: rename io_uring_queue_async_work event "rw" field
The io_uring_queue_async_work tracepoint event stores an int rw field
that represents whether the work item is hashed. Rename it to "hashed"
and change its type to bool to more accurately reflect its value.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-04 07:18:02 -07:00
Linus Torvalds fbeea4db51 New features and improvements for the ext4 file system
* Optimize online defragmentation by using folios instead of individual
   buffer heads
 * Improve error codes stored in the superblock when the journal aborts
 * Minor cleanups and clarifications in ext4_map_blocks()
 * Add documentation of the casefold and encrypt flags
 * Add support for file systems with a blocksize greater than the pagesize
 * Improve performance by enabling the caching the fact that an inode does
   not have a Posix ACL.
 
 Various Bug Fixes
 
 * Fix false positive compliants from smatch
 * Fix error code which is returned by ext4fs_dirhash() when Siphash is
   used without the encryption key
 * Fix races when writing to inline data files which could trigger a BUG
 * Fix potential NULL dereference when there is an corrupt file system with
   an extended attribute value stored in a inode
 * Fix false positive lockdep report when syzbot uses ext4 and ocfs2 together
 * Fix false positive reported by DEPT by adjusting lock annotation
 * Avoid a potential BUG_ON in jbd2 when a file system is massively corrupted
 * Fix a WARN_ON when superblock is corrupted with a non-NULL terminated
   mount options field
 * Add check if the userspace passes in a non-NULL terminated mount options
   field to EXT4_IOC_SET_TUNE_SB_PARAM
 * Fix a potential journal checksum failure whena file system is copied while
   it is mounted read-only
 * Fix a potential potential orphan file tracking error which only showed
   on 32-bit systems
 * Fix assertion checks in mballoc (which have to be explicitly enbled by
   manually enabling AGGRESSIVE_CHECKS and recompiling)
 * Avoid complaining about overly large orphan files created by mke2fs with
   with file systems with a 64k block size
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmktFUsACgkQ8vlZVpUN
 gaPA/Af/WMIK5fHG687JOQUoAnxJ5A89aWrJM7LXIXhfyy/hQ2x/dp7rRHuis+WQ
 1AcB7tRN4EuAx+tU8rBKsh7f+xQRkhdl3FHjxAyZdNaVTS/iYh121lMeSDqBVP0V
 tRSk+9DoahueYBJdHwgtFBd7ZHSKF2haqDW1FIYvFZFWZR1NEzNaoB9O4NO5D1dH
 RN7nKB/eggOnPelP8FtD83yY3lwCMmxanqZuFCTn9AFcn/o1yo0w+8L+nHKO0n0Z
 BMbIMLaJ4oJv6G/4vA99btGJjMHKqBqwNNh+2Gq81eubpkutTpTgma6npUBSKdl3
 pfWhHzVaa3vVj1Mxc0j4qb1fzlCH8w==
 =abb+
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "New features and improvements for the ext4 file system:
   - Optimize online defragmentation by using folios instead of
     individual buffer heads
   - Improve error codes stored in the superblock when the journal
     aborts
   - Minor cleanups and clarifications in ext4_map_blocks()
   - Add documentation of the casefold and encrypt flags
   - Add support for file systems with a blocksize greater than the
     pagesize
   - Improve performance by enabling the caching the fact that an inode
     does not have a Posix ACL

  Various Bug Fixes:
   - Fix false positive complaints from smatch
   - Fix error code which is returned by ext4fs_dirhash() when Siphash
     is used without the encryption key
   - Fix races when writing to inline data files which could trigger a
     BUG
   - Fix potential NULL dereference when there is an corrupt file system
     with an extended attribute value stored in a inode
   - Fix false positive lockdep report when syzbot uses ext4 and ocfs2
     together
   - Fix false positive reported by DEPT by adjusting lock annotation
   - Avoid a potential BUG_ON in jbd2 when a file system is massively
     corrupted
   - Fix a WARN_ON when superblock is corrupted with a non-NULL
     terminated mount options field
   - Add check if the userspace passes in a non-NULL terminated mount
     options field to EXT4_IOC_SET_TUNE_SB_PARAM
   - Fix a potential journal checksum failure whena file system is
     copied while it is mounted read-only
   - Fix a potential potential orphan file tracking error which only
     showed on 32-bit systems
   - Fix assertion checks in mballoc (which have to be explicitly enbled
     by manually enabling AGGRESSIVE_CHECKS and recompiling)
   - Avoid complaining about overly large orphan files created by mke2fs
     with with file systems with a 64k block size"

* tag 'ext4_for_linus-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits)
  ext4: mark inodes without acls in __ext4_iget()
  ext4: enable block size larger than page size
  ext4: add checks for large folio incompatibilities when BS > PS
  ext4: support verifying data from large folios with fs-verity
  ext4: make data=journal support large block size
  ext4: support large block size in __ext4_block_zero_page_range()
  ext4: support large block size in mpage_prepare_extent_to_map()
  ext4: support large block size in mpage_map_and_submit_buffers()
  ext4: support large block size in ext4_block_write_begin()
  ext4: support large block size in ext4_mpage_readpages()
  ext4: rename 'page' references to 'folio' in multi-block allocator
  ext4: prepare buddy cache inode for BS > PS with large folios
  ext4: support large block size in ext4_mb_init_cache()
  ext4: support large block size in ext4_mb_get_buddy_page_lock()
  ext4: support large block size in ext4_mb_load_buddy_gfp()
  ext4: add EXT4_LBLK_TO_PG and EXT4_PG_TO_LBLK for block/page conversion
  ext4: add EXT4_LBLK_TO_B macro for logical block to bytes conversion
  ext4: support large block size in ext4_readdir()
  ext4: support large block size in ext4_calculate_overhead()
  ext4: introduce s_min_folio_order for future BS > PS support
  ...
2025-12-03 20:37:15 -08:00
YH Lin 8d1cb17aca f2fs: optimize trace_f2fs_write_checkpoint with enums
This patch optimizes the tracepoint by replacing these hardcoded strings
with a new enumeration f2fs_cp_phase.

1.Defines enum f2fs_cp_phase with values for each checkpoint phase.
2.Updates trace_f2fs_write_checkpoint to accept a u16 phase argument
instead of a string pointer.
3.Uses __print_symbolic in TP_printk to convert the enum values
back to their corresponding strings for human-readable trace output.

This change reduces the storage overhead for each trace event
by replacing a variable-length string with a 2-byte integer,
while maintaining the same readable output in ftrace.

Signed-off-by: YH Lin <yhli@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-04 02:00:06 +00:00
Jaegeuk Kim 2e2e0d679a f2fs: add fadvise tracepoint
This adds a tracepoint in the fadvise call path.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-04 02:00:03 +00:00
Chao Yu 5b5578c3b0 f2fs: fix to access i_size w/ i_size_read()
It recommends to use i_size_{read,write}() to access and update i_size,
otherwise, we may get wrong tearing value due to high 32-bits value
and low 32-bits value of i_size field are not updated atomically in
32-bits archicture machine.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-04 02:00:03 +00:00
Linus Torvalds 8f7aa3d3c7 Networking changes for 6.19.
Core & protocols
 ----------------
 
  - Replace busylock at the Tx queuing layer with a lockless list. Resulting
    in a 300% (4x) improvement on heavy TX workloads, sending twice the
    number of packets per second, for half the cpu cycles.
 
  - Allow constantly busy flows to migrate to a more suitable CPU/NIC
    queue. Normally we perform queue re-selection when flow comes out
    of idle, but under extreme circumstances the flows may be constantly
    busy. Add sysctl to allow periodic rehashing even if it'd risk packet
    reordering.
 
  - Optimize the NAPI skb cache, make it larger, use it in more paths.
 
  - Attempt returning Tx skbs to the originating CPU (like we already did
    for Rx skbs).
 
  - Various data structure layout and prefetch optimizations from Eric.
 
  - Remove ktime_get() from the recvmsg() fast path, ktime_get() is sadly
    quite expensive on recent AMD machines.
 
  - Extend threaded NAPI polling to allow the kthread busy poll for packets.
 
  - Make MPTCP use Rx backlog processing. This lowers the lock pressure,
    improving the Rx performance.
 
  - Support memcg accounting of MPTCP socket memory.
 
  - Allow admin to opt sockets out of global protocol memory accounting
    (using a sysctl or BPF-based policy). The global limits are a poor fit
    for modern container workloads, where limits are imposed using cgroups.
 
  - Improve heuristics for when to kick off AF_UNIX garbage collection.
 
  - Allow users to control TCP SACK compression, and default to 33% of RTT.
 
  - Add tcp_rcvbuf_low_rtt sysctl to let datacenter users avoid unnecessarily
    aggressive rcvbuf growth and overshot when the connection RTT is low.
 
  - Preserve skb metadata space across skb_push / skb_pull operations.
 
  - Support for IPIP encapsulation in the nftables flowtable offload.
 
  - Support appending IP interface information to ICMP messages (RFC 5837).
 
  - Support setting max record size in TLS (RFC 8449).
 
  - Remove taking rtnl_lock from RTM_GETNEIGHTBL and RTM_SETNEIGHTBL.
 
  - Use a dedicated lock (and RCU) in MPLS, instead of rtnl_lock.
 
  - Let users configure the number of write buffers in SMC.
 
  - Add new struct sockaddr_unsized for sockaddr of unknown length,
    from Kees.
 
  - Some conversions away from the crypto_ahash API, from Eric Biggers.
 
  - Some preparations for slimming down struct page.
 
  - YAML Netlink protocol spec for WireGuard.
 
  - Add a tool on top of YAML Netlink specs/lib for reporting commonly
    computed derived statistics and summarized system state.
 
 Driver API
 ----------
 
  - Add CAN XL support to the CAN Netlink interface.
 
  - Add uAPI for reporting PHY Mean Square Error (MSE) diagnostics,
    as defined by the OPEN Alliance's "Advanced diagnostic features
    for 100BASE-T1 automotive Ethernet PHYs" specification.
 
  - Add DPLL phase-adjust-gran pin attribute (and implement it in zl3073x).
 
  - Refactor xfrm_input lock to reduce contention when NIC offloads IPsec
    and performs RSS.
 
  - Add info to devlink params whether the current setting is the default
    or a user override. Allow resetting back to default.
 
  - Add standard device stats for PSP crypto offload.
 
  - Leverage DSA frame broadcast to implement simple HSR frame duplication
    for a lot of switches without dedicated HSR offload.
 
  - Add uAPI defines for 1.6Tbps link modes.
 
 Device drivers
 --------------
 
  - Add Motorcomm YT921x gigabit Ethernet switch support.
 
  - Add MUCSE driver for N500/N210 1GbE NIC series.
 
  - Convert drivers to support dedicated ops for timestamping control,
    and away from the direct IOCTL handling. While at it support GET
    operations for PHY timestamping.
 
  - Add (and convert most drivers to) a dedicated ethtool callback
    for reading the Rx ring count.
 
  - Significant refactoring efforts in the STMMAC driver, which supports
    Synopsys turn-key MAC IP integrated into a ton of SoCs.
 
  - Ethernet high-speed NICs:
    - Broadcom (bnxt):
      - support PPS in/out on all pins
    - Intel (100G, ice, idpf):
      - ice: implement standard ethtool and timestamping stats
      - i40e: support setting the max number of MAC addresses per VF
      - iavf: support RSS of GTP tunnels for 5G and LTE deployments
    - nVidia/Mellanox (mlx5):
      - reduce downtime on interface reconfiguration
      - disable being an XDP redirect target by default (same as other
        drivers) to avoid wasting resources if feature is unused
    - Meta (fbnic):
      - add support for Linux-managed PCS on 25G, 50G, and 100G links
    - Wangxun:
      - support Rx descriptor merge, and Tx head writeback
      - support Rx coalescing offload
      - support 25G SPF and 40G QSFP modules
 
  - Ethernet virtual:
    - Google (gve):
      - allow ethtool to configure rx_buf_len
      - implement XDP HW RX Timestamping support for DQ descriptor format
    - Microsoft vNIC (mana):
      - support HW link state events
      - handle hardware recovery events when probing the device
 
  - Ethernet NICs consumer, and embedded:
    - usbnet: add support for Byte Queue Limits (BQL)
    - AMD (amd-xgbe):
      - add device selftests
    - NXP (enetc):
      - add i.MX94 support
    - Broadcom integrated MACs (bcmgenet, bcmasp):
      - bcmasp: add support for PHY-based Wake-on-LAN
    - Broadcom switches (b53):
      - support port isolation
      - support BCM5389/97/98 and BCM63XX ARL formats
    - Lantiq/MaxLinear switches:
      - support bridge FDB entries on the CPU port
      - use regmap for register access
      - allow user to enable/disable learning
      - support Energy Efficient Ethernet
      - support configuring RMII clock delays
      - add tagging driver for MaxLinear GSW1xx switches
    - Synopsys (stmmac):
      - support using the HW clock in free running mode
      - add Eswin EIC7700 support
      - add Rockchip RK3506 support
      - add Altera Agilex5 support
    - Cadence (macb):
      - cleanup and consolidate descriptor and DMA address handling
      - add EyeQ5 support
    - TI:
      - icssg-prueth: support AF_XDP
    - Airoha access points:
      - add missing Ethernet stats and link state callback
      - add AN7583 support
      - support out-of-order Tx completion processing
    - Power over Ethernet:
      - pd692x0: preserve PSE configuration across reboots
      - add support for TPS23881B devices
 
  - Ethernet PHYs:
    - Open Alliance OATC14 10BASE-T1S PHY cable diagnostic support
    - Support 50G SerDes and 100G interfaces in Linux-managed PHYs
    - micrel:
      - support for non PTP SKUs of lan8814
      - enable in-band auto-negotiation on lan8814
    - realtek:
      - cable testing support on RTL8224
      - interrupt support on RTL8221B
    - motorcomm: support for PHY LEDs on YT853
    - microchip: support for LAN867X Rev.D0 PHYs w/ SQI and cable diag
    - mscc: support for PHY LED control
 
  - CAN drivers:
    - m_can: add support for optional reset and system wake up
    - remove can_change_mtu() obsoleted by core handling
    - mcp251xfd: support GPIO controller functionality
 
  - Bluetooth:
    - add initial support for PASTa
 
  - WiFi:
    - split ieee80211.h file, it's way too big
    - improvements in VHT radiotap reporting, S1G, Channel Switch
      Announcement handling, rate tracking in mesh networks
    - improve multi-radio monitor mode support, and add a cfg80211 debugfs
      interface for it
    - HT action frame handling on 6 GHz
    - initial chanctx work towards NAN
    - MU-MIMO sniffer improvements
 
  - WiFi drivers:
    - RealTek (rtw89):
      - support USB devices RTL8852AU and RTL8852CU
      - initial work for RTL8922DE
      - improved injection support
    - Intel:
      - iwlwifi: new sniffer API support
    - MediaTek (mt76):
      - WED support for >32-bit DMA
      - airoha NPU support
      - regdomain improvements
      - continued WiFi7/MLO work
    - Qualcomm/Atheros:
      - ath10k: factory test support
      - ath11k: TX power insertion support
      - ath12k: BSS color change support
      - ath12k: statistics improvements
    - brcmfmac: Acer A1 840 tablet quirk
    - rtl8xxxu: 40 MHz connection fixes/support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmkveRQACgkQMUZtbf5S
 IrvY7A/+Nb0o4BxLHjPkAl1m3t3q2d0Y29B7SNkwnwEtxAV8EkNeZ3GWrdtDnTQY
 MYhmc7LEzvz8/lihapr7UJkcokzSASUV54hbez5jDBKC8EEoyUk8FdWDPerwlcRI
 zmCFNAVFyh9GX8i7wcrzKbDTHT5+GZLbSlGl9U5mhLsDdRlJgH7d8PJ7vWcmtLFY
 XN0paDyaeHfCl8wReWNAYx4C/I0ODOvlscpO0tnAKhB0ngJbQCKY2t6tn3rOYdif
 ZSQ5KwVRnJtQ4fYOFMOy9+FSCjVXtyrxF8KLxD+mqom2ZhmO00UpOMl09tqhq3uT
 WnvwoHUVBt6F+iITHwg5kMgIDPUq1kpUvL4S4UbVSuUm9ZKD+4KRU2ZHRBYMx+MU
 bsqmtY8/IULClUoRz+tZhltA8eb0NEqNZE2JPOFDiJHn1YiCCkFwxibhir893oM3
 sB7x65D7LQI2ty2BBGVGYnwYDPtyaxOA/s3WTwPvLEi3+Y/TGNIIrS9lBLA4U+Yr
 Gi93WQGVjttMmVyaHgXBUGmi3L52hvolm0AZ8zSRGrnIEpecjhly2KfYuaOzuxXC
 IHEQ6AFLdRh6JzafXGb/mQwGCHNmhwsY8A49i94fakWQamaL/L6A+1dyPu4LXMqi
 NwqCmlVb/LKGlfNG+V4wT27srJ+yBA2Vk3tpR1sZQQytFh0LKHI=
 =UoDR
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Replace busylock at the Tx queuing layer with a lockless list.

     Resulting in a 300% (4x) improvement on heavy TX workloads, sending
     twice the number of packets per second, for half the cpu cycles.

   - Allow constantly busy flows to migrate to a more suitable CPU/NIC
     queue.

     Normally we perform queue re-selection when flow comes out of idle,
     but under extreme circumstances the flows may be constantly busy.

     Add sysctl to allow periodic rehashing even if it'd risk packet
     reordering.

   - Optimize the NAPI skb cache, make it larger, use it in more paths.

   - Attempt returning Tx skbs to the originating CPU (like we already
     did for Rx skbs).

   - Various data structure layout and prefetch optimizations from Eric.

   - Remove ktime_get() from the recvmsg() fast path, ktime_get() is
     sadly quite expensive on recent AMD machines.

   - Extend threaded NAPI polling to allow the kthread busy poll for
     packets.

   - Make MPTCP use Rx backlog processing. This lowers the lock
     pressure, improving the Rx performance.

   - Support memcg accounting of MPTCP socket memory.

   - Allow admin to opt sockets out of global protocol memory accounting
     (using a sysctl or BPF-based policy). The global limits are a poor
     fit for modern container workloads, where limits are imposed using
     cgroups.

   - Improve heuristics for when to kick off AF_UNIX garbage collection.

   - Allow users to control TCP SACK compression, and default to 33% of
     RTT.

   - Add tcp_rcvbuf_low_rtt sysctl to let datacenter users avoid
     unnecessarily aggressive rcvbuf growth and overshot when the
     connection RTT is low.

   - Preserve skb metadata space across skb_push / skb_pull operations.

   - Support for IPIP encapsulation in the nftables flowtable offload.

   - Support appending IP interface information to ICMP messages (RFC
     5837).

   - Support setting max record size in TLS (RFC 8449).

   - Remove taking rtnl_lock from RTM_GETNEIGHTBL and RTM_SETNEIGHTBL.

   - Use a dedicated lock (and RCU) in MPLS, instead of rtnl_lock.

   - Let users configure the number of write buffers in SMC.

   - Add new struct sockaddr_unsized for sockaddr of unknown length,
     from Kees.

   - Some conversions away from the crypto_ahash API, from Eric Biggers.

   - Some preparations for slimming down struct page.

   - YAML Netlink protocol spec for WireGuard.

   - Add a tool on top of YAML Netlink specs/lib for reporting commonly
     computed derived statistics and summarized system state.

  Driver API:

   - Add CAN XL support to the CAN Netlink interface.

   - Add uAPI for reporting PHY Mean Square Error (MSE) diagnostics, as
     defined by the OPEN Alliance's "Advanced diagnostic features for
     100BASE-T1 automotive Ethernet PHYs" specification.

   - Add DPLL phase-adjust-gran pin attribute (and implement it in
     zl3073x).

   - Refactor xfrm_input lock to reduce contention when NIC offloads
     IPsec and performs RSS.

   - Add info to devlink params whether the current setting is the
     default or a user override. Allow resetting back to default.

   - Add standard device stats for PSP crypto offload.

   - Leverage DSA frame broadcast to implement simple HSR frame
     duplication for a lot of switches without dedicated HSR offload.

   - Add uAPI defines for 1.6Tbps link modes.

  Device drivers:

   - Add Motorcomm YT921x gigabit Ethernet switch support.

   - Add MUCSE driver for N500/N210 1GbE NIC series.

   - Convert drivers to support dedicated ops for timestamping control,
     and away from the direct IOCTL handling. While at it support GET
     operations for PHY timestamping.

   - Add (and convert most drivers to) a dedicated ethtool callback for
     reading the Rx ring count.

   - Significant refactoring efforts in the STMMAC driver, which
     supports Synopsys turn-key MAC IP integrated into a ton of SoCs.

   - Ethernet high-speed NICs:
      - Broadcom (bnxt):
         - support PPS in/out on all pins
      - Intel (100G, ice, idpf):
         - ice: implement standard ethtool and timestamping stats
         - i40e: support setting the max number of MAC addresses per VF
         - iavf: support RSS of GTP tunnels for 5G and LTE deployments
      - nVidia/Mellanox (mlx5):
         - reduce downtime on interface reconfiguration
         - disable being an XDP redirect target by default (same as
           other drivers) to avoid wasting resources if feature is
           unused
      - Meta (fbnic):
         - add support for Linux-managed PCS on 25G, 50G, and 100G links
      - Wangxun:
         - support Rx descriptor merge, and Tx head writeback
         - support Rx coalescing offload
         - support 25G SPF and 40G QSFP modules

   - Ethernet virtual:
      - Google (gve):
         - allow ethtool to configure rx_buf_len
         - implement XDP HW RX Timestamping support for DQ descriptor
           format
      - Microsoft vNIC (mana):
         - support HW link state events
         - handle hardware recovery events when probing the device

   - Ethernet NICs consumer, and embedded:
      - usbnet: add support for Byte Queue Limits (BQL)
      - AMD (amd-xgbe):
         - add device selftests
      - NXP (enetc):
         - add i.MX94 support
      - Broadcom integrated MACs (bcmgenet, bcmasp):
         - bcmasp: add support for PHY-based Wake-on-LAN
      - Broadcom switches (b53):
         - support port isolation
         - support BCM5389/97/98 and BCM63XX ARL formats
      - Lantiq/MaxLinear switches:
         - support bridge FDB entries on the CPU port
         - use regmap for register access
         - allow user to enable/disable learning
         - support Energy Efficient Ethernet
         - support configuring RMII clock delays
         - add tagging driver for MaxLinear GSW1xx switches
      - Synopsys (stmmac):
         - support using the HW clock in free running mode
         - add Eswin EIC7700 support
         - add Rockchip RK3506 support
         - add Altera Agilex5 support
      - Cadence (macb):
         - cleanup and consolidate descriptor and DMA address handling
         - add EyeQ5 support
      - TI:
         - icssg-prueth: support AF_XDP
      - Airoha access points:
         - add missing Ethernet stats and link state callback
         - add AN7583 support
         - support out-of-order Tx completion processing
      - Power over Ethernet:
         - pd692x0: preserve PSE configuration across reboots
         - add support for TPS23881B devices

   - Ethernet PHYs:
      - Open Alliance OATC14 10BASE-T1S PHY cable diagnostic support
      - Support 50G SerDes and 100G interfaces in Linux-managed PHYs
      - micrel:
         - support for non PTP SKUs of lan8814
         - enable in-band auto-negotiation on lan8814
      - realtek:
         - cable testing support on RTL8224
         - interrupt support on RTL8221B
      - motorcomm: support for PHY LEDs on YT853
      - microchip: support for LAN867X Rev.D0 PHYs w/ SQI and cable diag
      - mscc: support for PHY LED control

   - CAN drivers:
      - m_can: add support for optional reset and system wake up
      - remove can_change_mtu() obsoleted by core handling
      - mcp251xfd: support GPIO controller functionality

   - Bluetooth:
      - add initial support for PASTa

   - WiFi:
      - split ieee80211.h file, it's way too big
      - improvements in VHT radiotap reporting, S1G, Channel Switch
        Announcement handling, rate tracking in mesh networks
      - improve multi-radio monitor mode support, and add a cfg80211
        debugfs interface for it
      - HT action frame handling on 6 GHz
      - initial chanctx work towards NAN
      - MU-MIMO sniffer improvements

   - WiFi drivers:
      - RealTek (rtw89):
         - support USB devices RTL8852AU and RTL8852CU
         - initial work for RTL8922DE
         - improved injection support
      - Intel:
         - iwlwifi: new sniffer API support
      - MediaTek (mt76):
         - WED support for >32-bit DMA
         - airoha NPU support
         - regdomain improvements
         - continued WiFi7/MLO work
      - Qualcomm/Atheros:
         - ath10k: factory test support
         - ath11k: TX power insertion support
         - ath12k: BSS color change support
         - ath12k: statistics improvements
      - brcmfmac: Acer A1 840 tablet quirk
      - rtl8xxxu: 40 MHz connection fixes/support"

* tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1381 commits)
  net: page_pool: sanitise allocation order
  net: page pool: xa init with destroy on pp init
  net/mlx5e: Support XDP target xmit with dummy program
  net/mlx5e: Update XDP features in switch channels
  selftests/tc-testing: Test CAKE scheduler when enqueue drops packets
  net/sched: sch_cake: Fix incorrect qlen reduction in cake_drop
  wireguard: netlink: generate netlink code
  wireguard: uapi: generate header with ynl-gen
  wireguard: uapi: move flag enums
  wireguard: uapi: move enum wg_cmd
  wireguard: netlink: add YNL specification
  selftests: drv-net: Fix tolerance calculation in devlink_rate_tc_bw.py
  selftests: drv-net: Fix and clarify TC bandwidth split in devlink_rate_tc_bw.py
  selftests: drv-net: Set shell=True for sysfs writes in devlink_rate_tc_bw.py
  selftests: drv-net: Use Iperf3Runner in devlink_rate_tc_bw.py
  selftests: drv-net: introduce Iperf3Runner for measurement use cases
  selftests: drv-net: Add devlink_rate_tc_bw.py to TEST_PROGS
  net: ps3_gelic_net: Use napi_alloc_skb() and napi_gro_receive()
  Documentation: net: dsa: mention simple HSR offload helpers
  Documentation: net: dsa: mention availability of RedBox
  ...
2025-12-03 17:24:33 -08:00
Linus Torvalds 02baaa67d9 sched_ext: Changes for v6.19
- Improve recovery from misbehaving BPF schedulers. When a scheduler puts many
   tasks with varying affinity restrictions on a shared DSQ, CPUs scanning
   through tasks they cannot run can overwhelm the system, causing lockups.
   Bypass mode now uses per-CPU DSQs with a load balancer to avoid this, and
   hooks into the hardlockup detector to attempt recovery. Add scx_cpu0 example
   scheduler to demonstrate this scenario.
 
 - Add lockless peek operation for DSQs to reduce lock contention for schedulers
   that need to query queue state during load balancing.
 
 - Allow scx_bpf_reenqueue_local() to be called from anywhere in preparation for
   deprecating cpu_acquire/release() callbacks in favor of generic BPF hooks.
 
 - Prepare for hierarchical scheduler support: add scx_bpf_task_set_slice() and
   scx_bpf_task_set_dsq_vtime() kfuncs, make scx_bpf_dsq_insert*() return bool,
   and wrap kfunc args in structs for future aux__prog parameter.
 
 - Implement cgroup_set_idle() callback to notify BPF schedulers when a cgroup's
   idle state changes.
 
 - Fix migration tasks being incorrectly downgraded from stop_sched_class to
   rt_sched_class across sched_ext enable/disable. Applied late as the fix is
   low risk and the bug subtle but needs stable backporting.
 
 - Various fixes and cleanups including cgroup exit ordering, SCX_KICK_WAIT
   reliability, and backward compatibility improvements.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaS4h1A4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGe/MAP9EZ0pLiTpmMtt6mI/11Fmi+aWfL84j1zt13cz9
 W4vb4gEA9eVEH6n9xyC4nhcOk9AQwSDuCWMOzLsnhW8TbEHVTww=
 =8W/B
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext updates from Tejun Heo:

 - Improve recovery from misbehaving BPF schedulers.

   When a scheduler puts many tasks with varying affinity restrictions
   on a shared DSQ, CPUs scanning through tasks they cannot run can
   overwhelm the system, causing lockups.

   Bypass mode now uses per-CPU DSQs with a load balancer to avoid this,
   and hooks into the hardlockup detector to attempt recovery.

   Add scx_cpu0 example scheduler to demonstrate this scenario.

 - Add lockless peek operation for DSQs to reduce lock contention for
   schedulers that need to query queue state during load balancing.

 - Allow scx_bpf_reenqueue_local() to be called from anywhere in
   preparation for deprecating cpu_acquire/release() callbacks in favor
   of generic BPF hooks.

 - Prepare for hierarchical scheduler support: add
   scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() kfuncs,
   make scx_bpf_dsq_insert*() return bool, and wrap kfunc args in
   structs for future aux__prog parameter.

 - Implement cgroup_set_idle() callback to notify BPF schedulers when a
   cgroup's idle state changes.

 - Fix migration tasks being incorrectly downgraded from
   stop_sched_class to rt_sched_class across sched_ext enable/disable.
   Applied late as the fix is low risk and the bug subtle but needs
   stable backporting.

 - Various fixes and cleanups including cgroup exit ordering,
   SCX_KICK_WAIT reliability, and backward compatibility improvements.

* tag 'sched_ext-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (44 commits)
  sched_ext: Fix incorrect sched_class settings for per-cpu migration tasks
  sched_ext: tools: Removing duplicate targets during non-cross compilation
  sched_ext: Use kvfree_rcu() to release per-cpu ksyncs object
  sched_ext: Pass locked CPU parameter to scx_hardlockup() and add docs
  sched_ext: Update comments replacing breather with aborting mechanism
  sched_ext: Implement load balancer for bypass mode
  sched_ext: Factor out abbreviated dispatch dequeue into dispatch_dequeue_locked()
  sched_ext: Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR
  sched_ext: Add scx_cpu0 example scheduler
  sched_ext: Hook up hardlockup detector
  sched_ext: Make handle_lockup() propagate scx_verror() result
  sched_ext: Refactor lockup handlers into handle_lockup()
  sched_ext: Make scx_exit() and scx_vexit() return bool
  sched_ext: Exit dispatch and move operations immediately when aborting
  sched_ext: Simplify breather mechanism with scx_aborting flag
  sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode
  sched_ext: Refactor do_enqueue_task() local and global DSQ paths
  sched_ext: Use shorter slice in bypass mode
  sched_ext: Mark racy bitfields to prevent adding fields that can't tolerate races
  sched_ext: Minor cleanups to scx_task_iter
  ...
2025-12-03 13:25:39 -08:00
Linus Torvalds d348c22394 Power management updates for 6.19-rc1
- Introduce and document a QoS limit on CPU exit latency during wakeup
    from suspend-to-idle (Ulf Hansson)
 
  - Add support for building libcpupower statically (Zuo An)
 
  - Add support for sending netlink notifications to user space on energy
    model updates (Changwoo Mini, Peng Fan)
 
  - Minor improvements to the Rust OPP interface (Tamir Duberstein)
 
  - Fixes to scope-based pointers in the OPP library (Viresh Kumar)
 
  - Use residency threshold in polling state override decisions in the
    menu cpuidle governor (Aboorva Devarajan)
 
  - Add sanity check for exit latency and target residency in the cpufreq
    core (Rafael Wysocki)
 
  - Use this_cpu_ptr() where possible in the teo governor (Christian
    Loehle)
 
  - Rework the handling of tick wakeups in the teo cpuidle governor to
    increase the likelihood of stopping the scheduler tick in the cases
    when tick wakeups can be counted as non-timer ones (Rafael Wysocki)
 
  - Fix a reverse condition in the teo cpuidle governor and drop a
    misguided target residency check from it (Rafael Wysocki)
 
  - Clean up multiple minor defects in the teo cpuidle governor (Rafael
    Wysocki)
 
  - Update header inclusion to make it follow the Include What You Use
    principle (Andy Shevchenko)
 
  - Enable MSR-based RAPL PMU support in the intel_rapl power capping
    driver and arrange for using it on the Panther Lake and Wildcat Lake
    processors (Kuppuswamy Sathyanarayanan)
 
  - Add support for Nova Lake and Wildcat Lake processors to the
    intel_rapl power capping driver (Kaushlendra Kumar, Srinivas
    Pandruvada)
 
  - Add OPP and bandwidth support for Tegra186 (Aaron Kling)
 
  - Optimizations for parameter array handling in the amd-pstate cpufreq
    driver (Mario Limonciello)
 
  - Fix for mode changes with offline CPUs in the amd-pstate cpufreq
    driver (Gautham Shenoy)
 
  - Preserve freq_table_sorted across suspend/hibernate in the cpufreq
    core (Zihuan Zhang)
 
  - Adjust energy model rules for Intel hybrid platforms in the
    intel_pstate cpufreq driver and improve printing of debug messages
    in it (Rafael Wysocki)
 
  - Replace deprecated strcpy() in cpufreq_unregister_governor()
    (Thorsten Blum)
 
  - Fix duplicate hyperlink target errors in the intel_pstate cpufreq
    driver documentation and use :ref: directive for internal linking in
    it (Swaraj Gaikwad, Bagas Sanjaya)
 
  - Add Diamond Rapids OOB mode support to the intel_pstate cpufreq
    driver (Kuppuswamy Sathyanarayanan)
 
  - Use mutex guard for driver locking in the intel_pstate driver and
    eliminate some code duplication from it (Rafael Wysocki)
 
  - Replace udelay() with usleep_range() in ACPI cpufreq (Kaushlendra
    Kumar)
 
  - Minor improvements to various cpufreq drivers (Christian Marangi, Hal
    Feng, Jie Zhan, Marco Crivellari, Miaoqian Lin, and Shuhao Fu)
 
  - Replace snprintf() with scnprintf() in show_trace_dev_match()
    (Kaushlendra Kumar)
 
  - Fix memory allocation error handling in pm_vt_switch_required()
    (Malaya Kumar Rout)
 
  - Introduce CALL_PM_OP() macro and use it to simplify code in
    generic PM operations (Kaushlendra Kumar)
 
  - Add module param to backtrace all CPUs in the device power management
    watchdog (Sergey Senozhatsky)
 
  - Rework message printing in swsusp_save() (Rafael Wysocki)
 
  - Make it possible to change the number of hibernation compression
    threads (Xueqin Luo)
 
  - Clarify that only cgroup1 freezer uses PM freezer (Tejun Heo)
 
  - Add document on debugging shutdown hangs to PM documentation and
    correct a mistaken configuration option in it (Mario Limonciello)
 
  - Shut down wakeup source timer before removing the wakeup source from
    the list (Kaushlendra Kumar, Rafael Wysocki)
 
  - Introduce new PMSG_POWEROFF event for system shutdown handling with
    the help of PM device callbacks (Mario Limonciello)
 
  - Make pm_test delay interruptible by wakeup events (Riwen Lu)
 
  - Clean up kernel-doc comment style usage in the core hibernation
    code and remove unuseful comments from it (Sunday Adelodun, Rafael
    Wysocki)
 
  - Add support for handling wakeup events and aborting the suspend
    process while it is syncing file systems (Samuel Wu, Rafael Wysocki)
 
  - Add WQ_UNBOUND to pm_wq workqueue (Marco Crivellari)
 
  - Add runtime PM wrapper macros for ACQUIRE()/ACQUIRE_ERR() and use
    them in the PCI core and the ACPI TAD driver (Rafael Wysocki)
 
  - Improve runtime PM in the ACPI TAD driver (Rafael Wysocki)
 
  - Update pm_runtime_allow/forbid() documentation (Rafael Wysocki)
 
  - Fix typos in runtime.c comments (Malaya Kumar Rout)
 
  - Move governor.h from devfreq under include/linux/ and rename to
    devfreq-governor.h to allow devfreq governor definitions in out
    of drivers/devfreq/ (Dmitry Baryshkov)
 
  - Use min() to improve readability in tegra30-devfreq.c (Thorsten
    Blum)
 
  - Fix potential use-after-free issue of OPP handling in
    hisi_uncore_freq.c (Pengjie Zhang)
 
  - Fix typo in DFSO_DOWNDIFFERENTIAL macro name in
    governor_simpleondemand.c in devfreq (Riwen Lu)
 -----BEGIN PGP SIGNATURE-----
 
 iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmkp0BYSHHJqd0Byand5
 c29ja2kubmV0AAoJEO5fvZ0v1OO1Pc8H/2G5d0aD/ym1a8MDTpKqn7t3/rVMHa76
 YGfxXMBr1oY++r5GTJTKBxZBHmF89VH71kdyvsMidTAtHjR+iZAS1ajd2Q5VYjOF
 QNMld1qgPEzAZU8WSetDrBqMr89zls05Uubo4aCoNy6rFmgRaLHh3AmIKSS9aJuo
 C1eH8dRONME5I/rafkOUpFs1+/Agq1vePwPZmwVnZX9A3qI+UOhMRdU9A37kYkx9
 YwfQvR2fKTIPjZ6B9f/wGXPOvdrT37d4+dWT3EABOHMkxlpAPDMvmVzZsUaXSQMr
 0d9NGEjPGo33qciKJJpHqNOdDOhi90606WBBf7aaMF+GMhDX3PznOK4=
 =rzXO
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "There are quite a few interesting things here, including new hardware
  support, new features, some bug fixes and documentation updates. In
  addition, there are a usual bunch of minor fixes and cleanups all
  over.

  In the new hardware support category, there are intel_pstate and
  intel_rapl driver updates to support new processors, Panther Lake,
  Wildcat Lake, Noval Lake, and Diamond Rapids in the OOB mode, OPP and
  bandwidth allocation support in the tegra186 cpufreq driver, and
  JH7110S SOC support in dt-platdev cpufreq.

  The new features are the PM QoS CPU latency limit for suspend-to-idle,
  the netlink support for the energy model management, support for
  terminating system suspend via a wakeup event during the sync of file
  systems, configurable number of hibernation compression threads, the
  runtime PM auto-cleanup macros, and the "poweroff" PM event that is
  expected to be used during system shutdown.

  Bugs are mostly fixed in cpuidle governors, but there are also fixes
  elsewhere, like in the amd-pstate cpufreq driver.

  Documentation updates include, but are not limited to, a new doc on
  debugging shutdown hangs, cross-referencing fixes and cleanups in the
  intel_pstate documentation, and updates of comments in the core
  hibernation code.

  Specifics:

   - Introduce and document a QoS limit on CPU exit latency during
     wakeup from suspend-to-idle (Ulf Hansson)

   - Add support for building libcpupower statically (Zuo An)

   - Add support for sending netlink notifications to user space on
     energy model updates (Changwoo Mini, Peng Fan)

   - Minor improvements to the Rust OPP interface (Tamir Duberstein)

   - Fixes to scope-based pointers in the OPP library (Viresh Kumar)

   - Use residency threshold in polling state override decisions in the
     menu cpuidle governor (Aboorva Devarajan)

   - Add sanity check for exit latency and target residency in the
     cpufreq core (Rafael Wysocki)

   - Use this_cpu_ptr() where possible in the teo governor (Christian
     Loehle)

   - Rework the handling of tick wakeups in the teo cpuidle governor to
     increase the likelihood of stopping the scheduler tick in the cases
     when tick wakeups can be counted as non-timer ones (Rafael Wysocki)

   - Fix a reverse condition in the teo cpuidle governor and drop a
     misguided target residency check from it (Rafael Wysocki)

   - Clean up multiple minor defects in the teo cpuidle governor (Rafael
     Wysocki)

   - Update header inclusion to make it follow the Include What You Use
     principle (Andy Shevchenko)

   - Enable MSR-based RAPL PMU support in the intel_rapl power capping
     driver and arrange for using it on the Panther Lake and Wildcat
     Lake processors (Kuppuswamy Sathyanarayanan)

   - Add support for Nova Lake and Wildcat Lake processors to the
     intel_rapl power capping driver (Kaushlendra Kumar, Srinivas
     Pandruvada)

   - Add OPP and bandwidth support for Tegra186 (Aaron Kling)

   - Optimizations for parameter array handling in the amd-pstate
     cpufreq driver (Mario Limonciello)

   - Fix for mode changes with offline CPUs in the amd-pstate cpufreq
     driver (Gautham Shenoy)

   - Preserve freq_table_sorted across suspend/hibernate in the cpufreq
     core (Zihuan Zhang)

   - Adjust energy model rules for Intel hybrid platforms in the
     intel_pstate cpufreq driver and improve printing of debug messages
     in it (Rafael Wysocki)

   - Replace deprecated strcpy() in cpufreq_unregister_governor()
     (Thorsten Blum)

   - Fix duplicate hyperlink target errors in the intel_pstate cpufreq
     driver documentation and use :ref: directive for internal linking
     in it (Swaraj Gaikwad, Bagas Sanjaya)

   - Add Diamond Rapids OOB mode support to the intel_pstate cpufreq
     driver (Kuppuswamy Sathyanarayanan)

   - Use mutex guard for driver locking in the intel_pstate driver and
     eliminate some code duplication from it (Rafael Wysocki)

   - Replace udelay() with usleep_range() in ACPI cpufreq (Kaushlendra
     Kumar)

   - Minor improvements to various cpufreq drivers (Christian Marangi,
     Hal Feng, Jie Zhan, Marco Crivellari, Miaoqian Lin, and Shuhao Fu)

   - Replace snprintf() with scnprintf() in show_trace_dev_match()
     (Kaushlendra Kumar)

   - Fix memory allocation error handling in pm_vt_switch_required()
     (Malaya Kumar Rout)

   - Introduce CALL_PM_OP() macro and use it to simplify code in generic
     PM operations (Kaushlendra Kumar)

   - Add module param to backtrace all CPUs in the device power
     management watchdog (Sergey Senozhatsky)

   - Rework message printing in swsusp_save() (Rafael Wysocki)

   - Make it possible to change the number of hibernation compression
     threads (Xueqin Luo)

   - Clarify that only cgroup1 freezer uses PM freezer (Tejun Heo)

   - Add document on debugging shutdown hangs to PM documentation and
     correct a mistaken configuration option in it (Mario Limonciello)

   - Shut down wakeup source timer before removing the wakeup source
     from the list (Kaushlendra Kumar, Rafael Wysocki)

   - Introduce new PMSG_POWEROFF event for system shutdown handling with
     the help of PM device callbacks (Mario Limonciello)

   - Make pm_test delay interruptible by wakeup events (Riwen Lu)

   - Clean up kernel-doc comment style usage in the core hibernation
     code and remove unuseful comments from it (Sunday Adelodun, Rafael
     Wysocki)

   - Add support for handling wakeup events and aborting the suspend
     process while it is syncing file systems (Samuel Wu, Rafael
     Wysocki)

   - Add WQ_UNBOUND to pm_wq workqueue (Marco Crivellari)

   - Add runtime PM wrapper macros for ACQUIRE()/ACQUIRE_ERR() and use
     them in the PCI core and the ACPI TAD driver (Rafael Wysocki)

   - Improve runtime PM in the ACPI TAD driver (Rafael Wysocki)

   - Update pm_runtime_allow/forbid() documentation (Rafael Wysocki)

   - Fix typos in runtime.c comments (Malaya Kumar Rout)

   - Move governor.h from devfreq under include/linux/ and rename to
     devfreq-governor.h to allow devfreq governor definitions in out of
     drivers/devfreq/ (Dmitry Baryshkov)

   - Use min() to improve readability in tegra30-devfreq.c (Thorsten
     Blum)

   - Fix potential use-after-free issue of OPP handling in
     hisi_uncore_freq.c (Pengjie Zhang)

   - Fix typo in DFSO_DOWNDIFFERENTIAL macro name in
     governor_simpleondemand.c in devfreq (Riwen Lu)"

* tag 'pm-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (96 commits)
  PM / devfreq: Fix typo in DFSO_DOWNDIFFERENTIAL macro name
  cpuidle: Warn instead of bailing out if target residency check fails
  cpuidle: Update header inclusion
  Documentation: power/cpuidle: Document the CPU system wakeup latency QoS
  cpuidle: Respect the CPU system wakeup QoS limit for cpuidle
  sched: idle: Respect the CPU system wakeup QoS limit for s2idle
  pmdomain: Respect the CPU system wakeup QoS limit for cpuidle
  pmdomain: Respect the CPU system wakeup QoS limit for s2idle
  PM: QoS: Introduce a CPU system wakeup QoS limit
  cpuidle: governors: teo: Add missing space to the description
  PM: hibernate: Extra cleanup of comments in swap handling code
  PM / devfreq: tegra30: use min to simplify actmon_cpu_to_emc_rate
  PM / devfreq: hisi: Fix potential UAF in OPP handling
  PM / devfreq: Move governor.h to a public header location
  powercap: intel_rapl: Enable MSR-based RAPL PMU support
  powercap: intel_rapl: Prepare read_raw() interface for atomic-context callers
  cpufreq: qcom-nvmem: fix compilation warning for qcom_cpufreq_ipq806x_match_list
  PM: sleep: Call pm_sleep_fs_sync() instead of ksys_sync_helper()
  PM: sleep: Add support for wakeup during filesystem sync
  cpufreq: ACPI: Replace udelay() with usleep_range()
  ...
2025-12-02 17:31:22 -08:00
Linus Torvalds d42e504a55 Update to the time/timers core:
- Prevent a thundering herd problem when the timekeeper CPU is delayed
     and a large number of CPUs compete to acquire jiffies_lock to do the
     update. Limit it to one CPU with a separate "uncontended" atomic
     variable.
 
   - A set of improvements for the timer migration mechanism:
 
     - Support imbalanced NUMA trees correctly
 
     - Support dynamic exclusion of CPUs from the migrator duty to allow the
       cpuset/isolation mechanism to exclude them from handling timers of
       remote idle CPUs.
 
    - The usual small updates, cleanups and enhancements
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmks7doTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoaxrD/40nxx+8cEXsVbVLIkP2PQbd2Y8+7sk
 YbNu/Cb7j7Bg7R8YIs4p5GHk+7Yt/hNsW77SmbAzRPUyYYG6L3bUYlBa3yQlvIuo
 xRPbzGA+RJies9skIGHbQ8z6ig1zUASRJPcBYiuaVIAuQhCfLNc4Nii9cEWtjZ24
 +5gfRwV+vy74ArWwRkwaGejDK1tav+gd62OkFQZC8WtjQ08ozGZ6VBJNg7nYq/gH
 FYO1rH2tQ/ZyjlO/x5NF8gFcjYD8iv5PDp8oH35MPx+XTdDccf0G3QB7ug0ffVdV
 b4gA6lZTAmpsu/NHb6ByN4i/kf3wf8la/i+EaAh/Ov7NW078gunvVKVA7jStcbBl
 ZgG5SRHiKRvQF/WXLGVQAnilRDZwRuS0nmJlqfExa44v23l5o3768RwdRYwQlv8g
 X5KSRl0jlVgVtZHgNBlZtgX9+rnQSr9sB5sVGBP2a6a1WhVXQV/2kp0wjdnU0mPw
 jLCnSdsHqBlSf9V7O/na823WCnBFb7blrLBXUoSbHBnICqtVFzhE1kBXWw3S7Kqh
 CiaWM+S4WfR0HRnUlWMTS8BZ82MgiDnd7nGUXWwXBbdqWmoj/9CoU6SZRjbMBkzi
 EY1XvmoYf6eSzdxfydI1hFi0/bbb8K9umHQlrpW3HeN9uXnVc0/+TroVPLuaKUdi
 53ClqXjzE+CpJg==
 =lQKn
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer core updates from Thomas Gleixner:

 - Prevent a thundering herd problem when the timekeeper CPU is delayed
   and a large number of CPUs compete to acquire jiffies_lock to do the
   update. Limit it to one CPU with a separate "uncontended" atomic
   variable.

 - A set of improvements for the timer migration mechanism:

     - Support imbalanced NUMA trees correctly

     - Support dynamic exclusion of CPUs from the migrator duty to allow
       the cpuset/isolation mechanism to exclude them from handling
       timers of remote idle CPUs

 - The usual small updates, cleanups and enhancements

* tag 'timers-core-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers/migration: Exclude isolated cpus from hierarchy
  cpumask: Add initialiser to use cleanup helpers
  sched/isolation: Force housekeeping if isolcpus and nohz_full don't leave any
  cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to update_isolation_cpumasks()
  timers/migration: Use scoped_guard on available flag set/clear
  timers/migration: Add mask for CPUs available in the hierarchy
  timers/migration: Rename 'online' bit to 'available'
  selftests/timers/nanosleep: Add tests for return of remaining time
  selftests/timers: Clean up kernel version check in posix_timers
  time: Fix a few typos in time[r] related code comments
  time: tick-oneshot: Add missing Return and parameter descriptions to kernel-doc
  hrtimer: Store time as ktime_t in restart block
  timers/migration: Remove dead code handling idle CPU checking for remote timers
  timers/migration: Remove unused "cpu" parameter from tmigr_get_group()
  timers/migration: Assert that hotplug preparing CPU is part of stable active hierarchy
  timers/migration: Fix imbalanced NUMA trees
  timers/migration: Remove locking on group connection
  timers/migration: Convert "while" loops to use "for"
  tick/sched: Limit non-timekeeper CPUs calling jiffies update
2025-12-02 09:58:33 -08:00
Linus Torvalds 2b09f480f0 A large overhaul of the restartable sequences and CID management:
The recent enablement of RSEQ in glibc resulted in regressions which are
   caused by the related overhead. It turned out that the decision to invoke
   the exit to user work was not really a decision. More or less each
   context switch caused that. There is a long list of small issues which
   sums up nicely and results in a 3-4% regression in I/O benchmarks.
 
   The other detail which caused issues due to extra work in context switch
   and task migration is the CID (memory context ID) management. It also
   requires to use a task work to consolidate the CID space, which is
   executed in the context of an arbitrary task and results in sporadic
   uncontrolled exit latencies.
 
   The rewrite addresses this by:
 
   - Removing deprecated and long unsupported functionality
 
   - Moving the related data into dedicated data structures which are
     optimized for fast path processing.
 
   - Caching values so actual decisions can be made
 
   - Replacing the current implementation with a optimized inlined variant.
 
   - Separating fast and slow path for architectures which use the generic
     entry code, so that only fault and error handling goes into the
     TIF_NOTIFY_RESUME handler.
 
   - Rewriting the CID management so that it becomes mostly invisible in the
     context switch path. That moves the work of switching modes into the
     fork/exit path, which is a reasonable tradeoff. That work is only
     required when a process creates more threads than the cpuset it is
     allowed to run on or when enough threads exit after that. An artificial
     thread pool benchmarks which triggers this did not degrade, it actually
     improved significantly.
 
     The main effect in migration heavy scenarios is that runqueue lock held
     time and therefore contention goes down significantly.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmksaRYTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoencEADA5he8PAFPmSRRPo6+2G5mHzWe8kIU
 5ZViQStWFNAA0qqy8VXryWiJ6qqrO6la9o7K4YOXASUtlkVjquRp1DF7PabqGwuy
 zshbRCXNlT51J8uqanN8VrGVjlf+bMdHDbGoI1SLkUTxG8b+kDD5PXUQE1ARelPP
 Slbg9u+EMrxj6D5MDTPbuW6TqryJEkPtiNScyOz43emp9ww9+WVxenOcRqU4D+Th
 mjWmrGIzkroSf4XReMoD/wg9TPTpUjXnNCwl2viY9JvBpkMfYtU4tJAGK3aNFOWy
 zsAN0O9CaFGrUEFne7qUmtwhNLdtnjx5HN5pe7yZd1EhdTuQKq4jPiiQnwwm8w72
 c0o6m45FNPmPoSyfaZWCkLjbTEUXonT9JF61iN35JVxim8gBDDJjHFKnLxDmLrH3
 X0eESE48ReY2EneDV6Y8RJRo6oG14Fccvc39aTf/2Rw3trpmtt2agvConQzupQIg
 DzANw4jhUUzFRrHrMHACNsqKFXh9ratue/S9DM3xxTpGO/bKdeK7jGIgzNf8O34M
 J0O6Hvk5jMdcWlIJTx21GoGzoSkkXnR49g/71aCcp+MwdY4x9zFz5SWi8LWQRmkx
 xRo6tY27Bma8/SEwMJjPpAUXDTpq6v+j3cPisybL1yGsyt9lh+p8LX7VUtwcoEqe
 6ZelC5Kgw/+/kg==
 =n5KT
 -----END PGP SIGNATURE-----

Merge tag 'core-rseq-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull rseq updates from Thomas Gleixner:
 "A large overhaul of the restartable sequences and CID management:

  The recent enablement of RSEQ in glibc resulted in regressions which
  are caused by the related overhead. It turned out that the decision to
  invoke the exit to user work was not really a decision. More or less
  each context switch caused that. There is a long list of small issues
  which sums up nicely and results in a 3-4% regression in I/O
  benchmarks.

  The other detail which caused issues due to extra work in context
  switch and task migration is the CID (memory context ID) management.
  It also requires to use a task work to consolidate the CID space,
  which is executed in the context of an arbitrary task and results in
  sporadic uncontrolled exit latencies.

  The rewrite addresses this by:

   - Removing deprecated and long unsupported functionality

   - Moving the related data into dedicated data structures which are
     optimized for fast path processing.

   - Caching values so actual decisions can be made

   - Replacing the current implementation with a optimized inlined
     variant.

   - Separating fast and slow path for architectures which use the
     generic entry code, so that only fault and error handling goes into
     the TIF_NOTIFY_RESUME handler.

   - Rewriting the CID management so that it becomes mostly invisible in
     the context switch path. That moves the work of switching modes
     into the fork/exit path, which is a reasonable tradeoff. That work
     is only required when a process creates more threads than the
     cpuset it is allowed to run on or when enough threads exit after
     that. An artificial thread pool benchmarks which triggers this did
     not degrade, it actually improved significantly.

     The main effect in migration heavy scenarios is that runqueue lock
     held time and therefore contention goes down significantly"

* tag 'core-rseq-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
  sched/mmcid: Switch over to the new mechanism
  sched/mmcid: Implement deferred mode change
  irqwork: Move data struct to a types header
  sched/mmcid: Provide CID ownership mode fixup functions
  sched/mmcid: Provide new scheduler CID mechanism
  sched/mmcid: Introduce per task/CPU ownership infrastructure
  sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex
  sched/mmcid: Provide precomputed maximal value
  sched/mmcid: Move initialization out of line
  signal: Move MMCID exit out of sighand lock
  sched/mmcid: Convert mm CID mask to a bitmap
  cpumask: Cache num_possible_cpus()
  sched/mmcid: Use cpumask_weighted_or()
  cpumask: Introduce cpumask_weighted_or()
  sched/mmcid: Prevent pointless work in mm_update_cpus_allowed()
  sched/mmcid: Move scheduler code out of global header
  sched: Fixup whitespace damage
  sched/mmcid: Cacheline align MM CID storage
  sched/mmcid: Use proper data structures
  sched/mmcid: Revert the complex CID management
  ...
2025-12-02 08:48:53 -08:00
Linus Torvalds 9368f0f941 vfs-6.19-rc1.inode
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZAAKCRCRxhvAZXjc
 omMSAP9GLhavxyWQ24Q+49CNWWRQWDY1wTOiUK2BwtIvZ0YEcAD8D1dAiMckL5pC
 RwEAVA5p+y+qi+bZP0KXCBxQddoTIQM=
 =zo/J
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs inode updates from Christian Brauner:
 "Features:

   - Hide inode->i_state behind accessors. Open-coded accesses prevent
     asserting they are done correctly. One obvious aspect is locking,
     but significantly more can be checked. For example it can be
     detected when the code is clearing flags which are already missing,
     or is setting flags when it is illegal (e.g., I_FREEING when
     ->i_count > 0)

   - Provide accessors for ->i_state, converts all filesystems using
     coccinelle and manual conversions (btrfs, ceph, smb, f2fs, gfs2,
     overlayfs, nilfs2, xfs), and makes plain ->i_state access fail to
     compile

   - Rework I_NEW handling to operate without fences, simplifying the
     code after the accessor infrastructure is in place

  Cleanups:

   - Move wait_on_inode() from writeback.h to fs.h

   - Spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
     for clarity

   - Cosmetic fixes to LRU handling

   - Push list presence check into inode_io_list_del()

   - Touch up predicts in __d_lookup_rcu()

   - ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage

   - Assert on ->i_count in iput_final()

   - Assert ->i_lock held in __iget()

  Fixes:

   - Add missing fences to I_NEW handling"

* tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
  dcache: touch up predicts in __d_lookup_rcu()
  fs: push list presence check into inode_io_list_del()
  fs: cosmetic fixes to lru handling
  fs: rework I_NEW handling to operate without fences
  fs: make plain ->i_state access fail to compile
  xfs: use the new ->i_state accessors
  nilfs2: use the new ->i_state accessors
  overlayfs: use the new ->i_state accessors
  gfs2: use the new ->i_state accessors
  f2fs: use the new ->i_state accessors
  smb: use the new ->i_state accessors
  ceph: use the new ->i_state accessors
  btrfs: use the new ->i_state accessors
  Manual conversion to use ->i_state accessors of all places not covered by coccinelle
  Coccinelle-based conversion to use ->i_state accessors
  fs: provide accessors for ->i_state
  fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
  fs: move wait_on_inode() from writeback.h to fs.h
  fs: add missing fences to I_NEW handling
  ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
  ...
2025-12-01 09:02:34 -08:00
Yang Erkun dac092195b ext4: rename EXT4_GET_BLOCKS_PRE_IO
This flag has been generalized to split an unwritten extent when we do
dio or dioread_nolock writeback, or to avoid merge new extents which was
created by extents split. Update some related comments too.

Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Message-ID: <20251112084538.1658232-2-yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-26 17:13:33 -05:00
Xie Yuanbin 31807483d3 mm/memory-failure: remove the selection of RAS
commit 97f0b13452 ("tracing: add trace event for
memory-failure") introduces the selection of RAS in memory-failure.  This
commit is just a tracing feature; in reality, there is no dependency
between memory-failure and RAS.  RAS increases the size of the bzImage
image by 8k, which is very valuable for embedded devices.

Move the memory-failure traceing code from ras_event.h to
memory-failure.h and remove the selection of RAS.

Link: https://lkml.kernel.org/r/20251119095943.67125-1-xieyuanbin1@huawei.com
Signed-off-by: Xie Yuanbin <xieyuanbin1@huawei.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Borislav Petkov <bp@alien8.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-24 15:08:55 -08:00
Wei Yang 9e01407708 mm/khugepaged: unify SCAN_PMD_NONE and SCAN_PMD_NULL into SCAN_NO_PTE_TABLE
The current hugepage collapse scan results include two separate values,
SCAN_PMD_NONE and SCAN_PMD_NULL, which are handled identically by the
consuming code.

To reduce confusion and improve long-term maintenance, this commit merges
these two functionally equivalent states into a single, clearer
identifier: SCAN_NO_PTE_TABLE

Link: https://lkml.kernel.org/r/20251114030028.7035-4-richard.weiyang@gmail.com
Suggested-by: "David Hildenbrand (Red Hat)" <david@kernel.org>
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Nico Pache <npache@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-24 15:08:52 -08:00
Lorenzo Stoakes 5dba5cc2e0 mm: introduce VM_MAYBE_GUARD and make visible in /proc/$pid/smaps
Patch series "introduce VM_MAYBE_GUARD and make it sticky", v4.

Currently, guard regions are not visible to users except through
/proc/$pid/pagemap, with no explicit visibility at the VMA level.

This makes the feature less useful, as it isn't entirely apparent which
VMAs may have these entries present, especially when performing actions
which walk through memory regions such as those performed by CRIU.

This series addresses this issue by introducing the VM_MAYBE_GUARD flag
which fulfils this role, updating the smaps logic to display an entry for
these.

The semantics of this flag are that a guard region MAY be present if set
(we cannot be sure, as we can't efficiently track whether an
MADV_GUARD_REMOVE finally removes all the guard regions in a VMA) - but if
not set the VMA definitely does NOT have any guard regions present.

It's problematic to establish this flag without further action, because
that means that VMAs with guard regions in them become non-mergeable with
adjacent VMAs for no especially good reason.

To work around this, this series also introduces the concept of 'sticky'
VMA flags - that is flags which:

a. if set in one VMA and not in another still permit those VMAs to be
   merged (if otherwise compatible).

b. When they are merged, the resultant VMA must have the flag set.

The VMA logic is updated to propagate these flags correctly.

Additionally, VM_MAYBE_GUARD being an explicit VMA flag allows us to solve
an issue with file-backed guard regions - previously these established an
anon_vma object for file-backed mappings solely to have vma_needs_copy()
correctly propagate guard region mappings to child processes.

We introduce a new flag alias VM_COPY_ON_FORK (which currently only
specifies VM_MAYBE_GUARD) and update vma_needs_copy() to check explicitly
for this flag and to copy page tables if it is present, which resolves
this issue.

Additionally, we add the ability for allow-listed VMA flags to be
atomically writable with only mmap/VMA read locks held.

The only flag we allow so far is VM_MAYBE_GUARD, which we carefully ensure
does not cause any races by being allowed to do so.

This allows us to maintain guard region installation as a read-locked
operation and not endure the overhead of obtaining a write lock here.

Finally we introduce extensive VMA userland tests to assert that the
sticky VMA logic behaves correctly as well as guard region self tests to
assert that smaps visibility is correctly implemented.


This patch (of 9):

Currently, if a user needs to determine if guard regions are present in a
range, they have to scan all VMAs (or have knowledge of which ones might
have guard regions).

Since commit 8e2f2aeb8b ("fs/proc/task_mmu: add guard region bit to
pagemap") and the related commit a516403787 ("fs/proc: extend the
PAGEMAP_SCAN ioctl to report guard regions"), users can use either
/proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
operation at a virtual address level.

This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
that guard regions exist in ranges.

This patch remedies the situation by establishing a new VMA flag,
VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
uncertain because we cannot reasonably determine whether a
MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
additionally VMAs may change across merge/split).

We utilise 0x800 for this flag which makes it available to 32-bit
architectures also, a flag that was previously used by VM_DENYWRITE, which
was removed in commit 8d0920bde5 ("mm: remove VM_DENYWRITE") and hasn't
bee reused yet.

We also update the smaps logic and documentation to identify these VMAs.

Another major use of this functionality is that we can use it to identify
that we ought to copy page tables on fork.

We do not actually implement usage of this flag in mm/madvise.c yet as we
need to allow some VMA flags to be applied atomically under mmap/VMA read
lock in order to avoid the need to acquire a write lock for this purpose.

Link: https://lkml.kernel.org/r/cover.1763460113.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/cf8ef821eba29b6c5b5e138fffe95d6dcabdedb9.1763460113.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20 13:43:58 -08:00
Gabriele Monaco 8312cab5ff timers/migration: Rename 'online' bit to 'available'
The timer migration hierarchy excludes offline CPUs via the
tmigr_is_not_available function, which is essentially checking the
online bit for the CPU.

Rename the online bit to available and all references in function names
and tracepoint to generalise the concept of available CPUs.

Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251120145653.296659-2-gmonaco@redhat.com
2025-11-20 20:17:31 +01:00
Rafael J. Wysocki 37d6d92fe0 Merge back earlier material related to system sleep for 6.19 2025-11-17 16:55:55 +01:00
Kuninori Morimoto 8855eb7d29
ASoC: asoc.h: convert to snd_soc_dapm_xxx()
This patch converts below functions.

dapm->dev					-> snd_soc_dapm_to_dev()
dapm->card					-> snd_soc_dapm_to_card()
dapm->component					-> snd_soc_dapm_to_component()

dapm_kcontrol_get_value()			-> snd_soc_dapm_kcontrol_get_value()

snd_soc_component_enable_pin()			-> snd_soc_dapm_enable_pin()
snd_soc_component_enable_pin_unlocked()		-> snd_soc_dapm_enable_pin_unlocked()
snd_soc_component_disable_pin()			-> snd_soc_dapm_disable_pin()
snd_soc_component_disable_pin_unlocked()	-> snd_soc_dapm_disable_pin_unlocked()
snd_soc_component_nc_pin()			-> snd_soc_dapm_nc_pin()
snd_soc_component_nc_pin_unlocked()		-> snd_soc_dapm_nc_pin_unlocked()
snd_soc_component_get_pin_status()		-> snd_soc_dapm_get_pin_status()
snd_soc_component_force_enable_pin()		-> snd_soc_dapm_force_enable_pin()
snd_soc_component_force_enable_pin_unlocked()	-> snd_soc_dapm_force_enable_pin_unlocked()
snd_soc_component_force_bias_level()		-> snd_soc_dapm_force_bias_level()
snd_soc_component_get_bias_level()		-> snd_soc_dapm_get_bias_level()
snd_soc_component_init_bias_level()		-> snd_soc_dapm_init_bias_level()
snd_soc_component_get_dapm()			-> snd_soc_component_to_dapm()

snd_soc_dapm_kcontrol_component()		-> snd_soc_dapm_kcontrol_to_component()
snd_soc_dapm_kcontrol_widget()			-> snd_soc_dapm_kcontrol_to_widget()
snd_soc_dapm_kcontrol_dapm()			-> snd_soc_dapm_kcontrol_to_dapm()
snd_soc_dapm_np_pin()				-> snd_soc_dapm_disable_pin()

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Reviewed-by: Charles Keepax <ckeepax@opensource.cirrus.com>
Link: https://patch.msgid.link/87346la0cv.wl-kuninori.morimoto.gx@renesas.com
Signed-off-by: Mark Brown <broonie@kernel.org>
2025-11-17 00:14:25 +00:00
Mario Limonciello (AMD) 0ca04993da PM: Introduce new PMSG_POWEROFF event
PMSG_POWEROFF will be used for the PM core to allow differentiating between
a hibernation or shutdown sequence when re-using callbacks for common code.

Hibernation is started by writing a hibernation method (such as 'platform'
'shutdown', or 'reboot') to use into /sys/power/disk and writing 'disk' to
/sys/power/state.

Shutdown is initiated with the reboot() syscall with arguments on whether
to halt the system or power it off.

Tested-by: Eric Naim <dnaim@cachyos.org>
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Link: https://patch.msgid.link/20251112224025.2051702-2-superm1@kernel.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-14 17:05:53 +01:00
Tejun Heo 95d1df610c sched_ext: Implement load balancer for bypass mode
In bypass mode, tasks are queued on per-CPU bypass DSQs. While this works well
in most cases, there is a failure mode where a BPF scheduler can skew task
placement severely before triggering bypass in highly over-saturated systems.
If most tasks end up concentrated on a few CPUs, those CPUs can accumulate
queues that are too long to drain in a reasonable time, leading to RCU stalls
and hung tasks.

Implement a simple timer-based load balancer that redistributes tasks across
CPUs within each NUMA node. The balancer runs periodically (default 500ms,
tunable via bypass_lb_intv_us module parameter) and moves tasks from overloaded
CPUs to underloaded ones.

When moving tasks between bypass DSQs, the load balancer holds nested DSQ locks
to avoid dropping and reacquiring the donor DSQ lock on each iteration, as
donor DSQs can be very long and highly contended. Add the SCX_ENQ_NESTED flag
and use raw_spin_lock_nested() in dispatch_enqueue() to support this. The load
balancer timer function reads scx_bypass_depth locklessly to check whether
bypass mode is active. Use WRITE_ONCE() when updating scx_bypass_depth to pair
with the READ_ONCE() in the timer function.

This has been tested on a 192 CPU dual socket AMD EPYC machine with ~20k
runnable tasks running scx_cpu0. As scx_cpu0 queues all tasks to CPU0, almost
all tasks end up on CPU0 creating severe imbalance. Without the load balancer,
disabling the scheduler can lead to RCU stalls and hung tasks, taking a very
long time to complete. With the load balancer, disable completes in about a
second.

The load balancing operation can be monitored using the sched_ext_bypass_lb
tracepoint and disabled by setting bypass_lb_intv_us to 0.

v2: Lock both rq and DSQ in bypass_lb_cpu() and use dispatch_dequeue_locked()
    to prevent races with dispatch_dequeue() (Andrea Righi).

Cc: Andrea Righi <arighi@nvidia.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Reviewed_by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12 06:43:44 -10:00
Zhang Yi 9dbf945320 ext4: add two trace points for moving extents
To facilitate tracking the length, type, and outcome of the move extent,
add a trace point at both the entry and exit of mext_move_extent().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20251013015128.499308-13-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-06 10:44:39 -05:00
Zhang Yi dd064d5101 ext4: introduce seq counter for the extent status entry
In the iomap_write_iter(), the iomap buffered write frame does not hold
any locks between querying the inode extent mapping info and performing
page cache writes. As a result, the extent mapping can be changed due to
concurrent I/O in flight. Similarly, in the iomap_writepage_map(), the
write-back process faces a similar problem: concurrent changes can
invalidate the extent mapping before the I/O is submitted.

Therefore, both of these processes must recheck the mapping info after
acquiring the folio lock. To address this, similar to XFS, we propose
introducing an extent sequence number to serve as a validity cookie for
the extent. After commit 24b7a2331f ("ext4: clairfy the rules for
modifying extents"), we can ensure the extent information should always
be processed through the extent status tree, and the extent status tree
is always uptodate under i_rwsem or invalidate_lock or folio lock, so
it's safe to introduce this sequence number. The sequence number will be
increased whenever the extent status tree changes, preparing for the
buffered write iomap conversion.

Besides, this mechanism is also applicable for the moving extents case.
In move_extent_per_page(), it also needs to reacquire data_sem and check
the mapping info again under the folio lock.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20251013015128.499308-3-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-06 10:44:39 -05:00
Mark Brown 8b6faa7fdd
spi: tegra210-quad: Improve timeout handling under
Merge series from Vishwaroop A <va@nvidia.com>:

This patch series addresses timeout handling issues in the Tegra QSPI driver
that occur under high system load conditions. We've observed that when CPUs
are saturated (due to error injection, RAS firmware activity, or general CPU
contention), QSPI interrupt handlers can be delayed, causing spurious transfer
failures even though the hardware completed the operation successfully.

These changes have been tested in production environments under various high
load scenarios including RAS testing and CPU saturation workloads.
2025-11-05 11:54:48 +00:00
Tonghao Zhang 27cb3de7f4 net: add net cookie for net device trace events
In a multi-network card or container environment, this is needed in order
to differentiate between trace events relating to net devices that exist
in different network namespaces and share the same name.

for xmit_timeout trace events:
[002] ..s1.  1838.311662: net_dev_xmit_timeout: dev=eth0 driver=virtio_net queue=10 net_cookie=3
[007] ..s1.  1839.335650: net_dev_xmit_timeout: dev=eth0 driver=virtio_net queue=10 net_cookie=4100
[007] ..s1.  1844.455659: net_dev_xmit_timeout: dev=eth0 driver=virtio_net queue=10 net_cookie=3
[002] ..s1.  1850.087647: net_dev_xmit_timeout: dev=eth0 driver=virtio_net queue=10 net_cookie=3

Cc: Eran Ben Elisha <eranbe@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Suggested-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20251028043244.82288-1-tonghao@bamaicloud.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-04 09:59:19 +01:00
Thomas Gleixner 4b7de6df20 rseq: Cache CPU ID and MM CID values
In preparation for rewriting RSEQ exit to user space handling provide
storage to cache the CPU ID and MM CID values which were written to user
space. That prepares for a quick check, which avoids the update when
nothing changed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.841964081@linutronix.de
2025-11-04 08:32:14 +01:00
Eric Dumazet 24990d89c2 trace: tcp: add three metrics to trace_tcp_rcvbuf_grow()
While chasing yet another receive autotuning bug,
I found useful to add rcv_ssthresh, window_clamp and rcv_wnd.

tcp_stream 40597 [068]  2172.978198: tcp:tcp_rcvbuf_grow: time=50307 rtt_us=50179 copied=77824 inq=0 space=40960 ooo=0 scaling_ratio=219 rcvbuf=131072 rcv_ssthresh=107474 window_clamp=112128 rcv_wnd=110592
tcp_stream 40597 [068]  2173.028528: tcp:tcp_rcvbuf_grow: time=50336 rtt_us=50206 copied=110592 inq=0 space=77824 ooo=0 scaling_ratio=219 rcvbuf=509444 rcv_ssthresh=328658 window_clamp=435813 rcv_wnd=331776
tcp_stream 40597 [068]  2173.078830: tcp:tcp_rcvbuf_grow: time=50305 rtt_us=50070 copied=270336 inq=0 space=110592 ooo=0 scaling_ratio=219 rcvbuf=509444 rcv_ssthresh=431159 window_clamp=435813 rcv_wnd=434176
tcp_stream 40597 [068]  2173.129137: tcp:tcp_rcvbuf_grow: time=50313 rtt_us=50118 copied=434176 inq=0 space=270336 ooo=0 scaling_ratio=219 rcvbuf=2457847 rcv_ssthresh=1299511 window_clamp=2102611 rcv_wnd=1302528
tcp_stream 40597 [068]  2173.179451: tcp:tcp_rcvbuf_grow: time=50318 rtt_us=50041 copied=1019904 inq=0 space=434176 ooo=0 scaling_ratio=219 rcvbuf=2457847 rcv_ssthresh=2087445 window_clamp=2102611 rcv_wnd=2088960

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20251028-net-tcp-recv-autotune-v3-2-74b43ba4c84c@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-29 17:30:18 -07:00
Steven Rostedt 011ea0501d tracing: Display some syscall arrays as strings
Some of the system calls that read a fixed length of memory from the user
space address are not arrays but strings. Take a bit away from the nb_args
field in the syscall meta data to use as a flag to denote that the system
call's user_arg_size is being used as a string. The nb_args should never
be more than 6, so 7 bits is plenty to hold that number. When the
user_arg_is_str flag that, when set, will display the data array from the
user space address as a string and not an array.

This will allow the output to look like this:

  sys_sethostname(name: 0x5584310eb2a0 "debian", len: 6)

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Takaya Saeki <takayas@google.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ian Rogers <irogers@google.com>
Cc: Douglas Raillard <douglas.raillard@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Link: https://lore.kernel.org/20251028231147.930550359@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-28 20:10:58 -04:00
Steven Rostedt b4f7624cfc tracing: Have system call events record user array data
For system call events that have a length field, add a "user_arg_size"
parameter to the system call meta data that denotes the index of the args
array that holds the size of arg that the user_mask field has a bit set
for.

The "user_mask" has a bit set that denotes the arg that points to an array
in the user space address space and if a system call event has the
user_mask field set and the user_arg_size set, it will then record the
content of that address into the trace event, up to the size defined by
SYSCALL_FAULT_BUF_SZ - 1.

This allows the output to look like:

  sys_write(fd: 0xa, buf: 0x5646978d13c0 (01:00:05:00:00:00:00:00:01:87:55:89:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00), count: 0x20)

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Takaya Saeki <takayas@google.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ian Rogers <irogers@google.com>
Cc: Douglas Raillard <douglas.raillard@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Link: https://lore.kernel.org/20251028231147.763528474@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-28 20:10:58 -04:00
Steven Rostedt a544d9a66b tracing: Have syscall trace events read user space string
As of commit 654ced4a13 ("tracing: Introduce tracepoint_is_faultable()")
system call trace events allow faulting in user space memory. Have some of
the system call trace events take advantage of this.

Use the trace_user_fault_read() logic to read the user space buffer from
user space and instead of just saving the pointer to the buffer in the
system call event, also save the string that is passed in.

The syscall event has its nb_args shorten from an int to a short (where
even u8 is plenty big enough) and the freed two bytes are used for
"user_mask".  The new "user_mask" field is used to store the index of the
"args" field array that has the address to read from user space. This
value is set to 0 if the system call event does not need to read user
space for a field. This mask can be used to know if the event may fault or
not. Only one bit set in user_mask is supported at this time.

This allows the output to look like this:

 sys_access(filename: 0x7f8c55368470 "/etc/ld.so.preload", mode: 4)
 sys_execve(filename: 0x564ebcf5a6b8 "/usr/bin/emacs", argv: 0x7fff357c0300, envp: 0x564ebc4a4820)

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Takaya Saeki <takayas@google.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ian Rogers <irogers@google.com>
Cc: Douglas Raillard <douglas.raillard@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Link: https://lore.kernel.org/20251028231147.261867956@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-28 20:10:58 -04:00
Sean Anderson 77a58ba7c6
spi: spi-mem: Trace exec_op
The spi subsystem has tracing, which is very convenient when debugging
problems. Add tracing for spi-mem too so that accesses that skip the spi
subsystem can still be seen.

The format is roughly based on the existing spi tracing. We don't bother
tracing the op's address because the tracing happens while the memory is
locked, so there can be no confusion about the matching of start and
stop. The conversion of cmd/addr/dummy to an array is directly analogous
to the conversion in the latter half of spi_mem_exec_op.

Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Link: https://patch.msgid.link/20251021144702.1582397-1-sean.anderson@linux.dev
Signed-off-by: Mark Brown <broonie@kernel.org>
2025-10-27 11:10:50 +00:00
Mateusz Guzik f5aa78e2be
Manual conversion to use ->i_state accessors of all places not covered by coccinelle
Nothing to look at apart from iput_final().

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:26 +02:00
Linus Torvalds 16d1ba7c96 dma-mapping fixes for Linux 6.18:
- two small fixes for the recently performed code refactoring (Shigeru
 Yoshida) and missing handling of direction parameter in DMA debug code
 (Petr Tesarik)
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaOTg+QAKCRCJp1EFxbsS
 RDFkAQCxV0khAeYDuiPdub+9d5XGVGTjBxG1ErYsvDbbsZ3QpAEAhheuAdbMBYU1
 kOwmGuBUY32d0cMz0/4BfKbIzuQs9wE=
 =aE76
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-6.18-2025-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux

Pull dma-mapping fixes from Marek Szyprowski:
 "Two small fixes for the recently performed code refactoring (Shigeru
  Yoshida) and missing handling of direction parameter in DMA debug code
  (Petr Tesarik)"

* tag 'dma-mapping-6.18-2025-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
  dma-mapping: fix direction in dma_alloc direction traces
  kmsan: fix kmsan_handle_dma() to avoid false positives
2025-10-07 12:48:06 -07:00
Linus Torvalds f3826aa996 guest_memfd:
* Add support for host userspace mapping of guest_memfd-backed memory for VM
   types that do NOT use support KVM_MEMORY_ATTRIBUTE_PRIVATE (which isn't
   precisely the same thing as CoCo VMs, since x86's SEV-MEM and SEV-ES have
   no way to detect private vs. shared).
 
   This lays the groundwork for removal of guest memory from the kernel direct
   map, as well as for limited mmap() for guest_memfd-backed memory.
 
   For more information see:
   * a6ad54137a ("Merge branch 'guest-memfd-mmap' into HEAD", 2025-08-27)
   * https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding
     (guest_memfd in Firecracker)
   * https://lore.kernel.org/all/20250221160728.1584559-1-roypat@amazon.co.uk/
     (direct map removal)
   * https://lore.kernel.org/all/20250328153133.3504118-1-tabba@google.com/
     (mmap support)
 
 ARM:
 
 * Add support for FF-A 1.2 as the secure memory conduit for pKVM,
   allowing more registers to be used as part of the message payload.
 
 * Change the way pKVM allocates its VM handles, making sure that the
   privileged hypervisor is never tricked into using uninitialised
   data.
 
 * Speed up MMIO range registration by avoiding unnecessary RCU
   synchronisation, which results in VMs starting much quicker.
 
 * Add the dump of the instruction stream when panic-ing in the EL2
   payload, just like the rest of the kernel has always done. This will
   hopefully help debugging non-VHE setups.
 
 * Add 52bit PA support to the stage-1 page-table walker, and make use
   of it to populate the fault level reported to the guest on failing
   to translate a stage-1 walk.
 
 * Add NV support to the GICv3-on-GICv5 emulation code, ensuring
   feature parity for guests, irrespective of the host platform.
 
 * Fix some really ugly architecture problems when dealing with debug
   in a nested VM. This has some bad performance impacts, but is at
   least correct.
 
 * Add enough infrastructure to be able to disable EL2 features and
   give effective values to the EL2 control registers. This then allows
   a bunch of features to be turned off, which helps cross-host
   migration.
 
 * Large rework of the selftest infrastructure to allow most tests to
   transparently run at EL2. This is the first step towards enabling
   NV testing.
 
 * Various fixes and improvements all over the map, including one BE
   fix, just in time for the removal of the feature.
 
 LoongArch:
 
 * Detect page table walk feature on new hardware
 
 * Add sign extension with kernel MMIO/IOCSR emulation
 
 * Improve in-kernel IPI emulation
 
 * Improve in-kernel PCH-PIC emulation
 
 * Move kvm_iocsr tracepoint out of generic code
 
 RISC-V:
 
 * Added SBI FWFT extension for Guest/VM with misaligned delegation and
   pointer masking PMLEN features
 
 * Added ONE_REG interface for SBI FWFT extension
 
 * Added Zicbop and bfloat16 extensions for Guest/VM
 
 * Enabled more common KVM selftests for RISC-V
 
 * Added SBI v3.0 PMU enhancements in KVM and perf driver
 
 s390:
 
 * Improve interrupt cpu for wakeup, in particular the heuristic to decide
   which vCPU to deliver a floating interrupt to.
 
 * Clear the PTE when discarding a swapped page because of CMMA; this
   bug was introduced in 6.16 when refactoring gmap code.
 
 x86 selftests:
 
 * Add #DE coverage in the fastops test (the only exception that's guest-
   triggerable in fastop-emulated instructions).
 
 * Fix PMU selftests errors encountered on Granite Rapids (GNR), Sierra
   Forest (SRF) and Clearwater Forest (CWF).
 
 * Minor cleanups and improvements
 
 x86 (guest side):
 
 * For the legacy PCI hole (memory between TOLUD and 4GiB) to UC when
   overriding guest MTRR for TDX/SNP to fix an issue where ACPI auto-mapping
   could map devices as WB and prevent the device drivers from mapping their
   devices with UC/UC-.
 
 * Make kvm_async_pf_task_wake() a local static helper and remove its
   export.
 
 * Use native qspinlocks when running in a VM with dedicated vCPU=>pCPU
   bindings even when PV_UNHALT is unsupported.
 
 Generic:
 
 * Remove a redundant __GFP_NOWARN from kvm_setup_async_pf() as __GFP_NOWARN is
   now included in GFP_NOWAIT.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmjcGSkUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPSPAgAnJDswU4fZ5YdJr6jGzsbSQ6utlIV
 FeEltLKQIM7Aq/uvL6PLN5Kx1Pb/d9r9ag39mDT6lq9fOfJdOLjJr2SBXPTCsrPS
 6hyNL1mlgo5qzs54T8dkMbQThlSgA4zaehsc0zl8vnwil6ygoAdrtTHqZm6V0hu/
 F/sVlikCsLix1hC0KtzwscyWYcjWtXfVoi9eU5WY6ALpQaVXfRUtwyOhGDkldr+m
 i3iDiGiLAZ5Iu3igUCIOEzSSQY0FgLJpzbwJAeUxIvomDkHGJLaR14ijvM+NkRZi
 FBo2CLbjrwXb56Rbh2ABcq0CGJ3EiU3L+CC34UaRLzbtl/2BtpetkC3irA==
 =fyov
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "This excludes the bulk of the x86 changes, which I will send
  separately. They have two not complex but relatively unusual conflicts
  so I will wait for other dust to settle.

  guest_memfd:

   - Add support for host userspace mapping of guest_memfd-backed memory
     for VM types that do NOT use support KVM_MEMORY_ATTRIBUTE_PRIVATE
     (which isn't precisely the same thing as CoCo VMs, since x86's
     SEV-MEM and SEV-ES have no way to detect private vs. shared).

     This lays the groundwork for removal of guest memory from the
     kernel direct map, as well as for limited mmap() for
     guest_memfd-backed memory.

     For more information see:
       - commit a6ad54137a ("Merge branch 'guest-memfd-mmap' into HEAD")
       - guest_memfd in Firecracker:
           https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding
       - direct map removal:
           https://lore.kernel.org/all/20250221160728.1584559-1-roypat@amazon.co.uk/
       - mmap support:
           https://lore.kernel.org/all/20250328153133.3504118-1-tabba@google.com/

  ARM:

   - Add support for FF-A 1.2 as the secure memory conduit for pKVM,
     allowing more registers to be used as part of the message payload.

   - Change the way pKVM allocates its VM handles, making sure that the
     privileged hypervisor is never tricked into using uninitialised
     data.

   - Speed up MMIO range registration by avoiding unnecessary RCU
     synchronisation, which results in VMs starting much quicker.

   - Add the dump of the instruction stream when panic-ing in the EL2
     payload, just like the rest of the kernel has always done. This
     will hopefully help debugging non-VHE setups.

   - Add 52bit PA support to the stage-1 page-table walker, and make use
     of it to populate the fault level reported to the guest on failing
     to translate a stage-1 walk.

   - Add NV support to the GICv3-on-GICv5 emulation code, ensuring
     feature parity for guests, irrespective of the host platform.

   - Fix some really ugly architecture problems when dealing with debug
     in a nested VM. This has some bad performance impacts, but is at
     least correct.

   - Add enough infrastructure to be able to disable EL2 features and
     give effective values to the EL2 control registers. This then
     allows a bunch of features to be turned off, which helps cross-host
     migration.

   - Large rework of the selftest infrastructure to allow most tests to
     transparently run at EL2. This is the first step towards enabling
     NV testing.

   - Various fixes and improvements all over the map, including one BE
     fix, just in time for the removal of the feature.

  LoongArch:

   - Detect page table walk feature on new hardware

   - Add sign extension with kernel MMIO/IOCSR emulation

   - Improve in-kernel IPI emulation

   - Improve in-kernel PCH-PIC emulation

   - Move kvm_iocsr tracepoint out of generic code

  RISC-V:

   - Added SBI FWFT extension for Guest/VM with misaligned delegation
     and pointer masking PMLEN features

   - Added ONE_REG interface for SBI FWFT extension

   - Added Zicbop and bfloat16 extensions for Guest/VM

   - Enabled more common KVM selftests for RISC-V

   - Added SBI v3.0 PMU enhancements in KVM and perf driver

  s390:

   - Improve interrupt cpu for wakeup, in particular the heuristic to
     decide which vCPU to deliver a floating interrupt to.

   - Clear the PTE when discarding a swapped page because of CMMA; this
     bug was introduced in 6.16 when refactoring gmap code.

  x86 selftests:

   - Add #DE coverage in the fastops test (the only exception that's
     guest- triggerable in fastop-emulated instructions).

   - Fix PMU selftests errors encountered on Granite Rapids (GNR),
     Sierra Forest (SRF) and Clearwater Forest (CWF).

   - Minor cleanups and improvements

  x86 (guest side):

   - For the legacy PCI hole (memory between TOLUD and 4GiB) to UC when
     overriding guest MTRR for TDX/SNP to fix an issue where ACPI
     auto-mapping could map devices as WB and prevent the device drivers
     from mapping their devices with UC/UC-.

   - Make kvm_async_pf_task_wake() a local static helper and remove its
     export.

   - Use native qspinlocks when running in a VM with dedicated
     vCPU=>pCPU bindings even when PV_UNHALT is unsupported.

  Generic:

   - Remove a redundant __GFP_NOWARN from kvm_setup_async_pf() as
     __GFP_NOWARN is now included in GFP_NOWAIT.

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (178 commits)
  KVM: s390: Fix to clear PTE when discarding a swapped page
  KVM: arm64: selftests: Cover ID_AA64ISAR3_EL1 in set_id_regs
  KVM: arm64: selftests: Remove a duplicate register listing in set_id_regs
  KVM: arm64: selftests: Cope with arch silliness in EL2 selftest
  KVM: arm64: selftests: Add basic test for running in VHE EL2
  KVM: arm64: selftests: Enable EL2 by default
  KVM: arm64: selftests: Initialize HCR_EL2
  KVM: arm64: selftests: Use the vCPU attr for setting nr of PMU counters
  KVM: arm64: selftests: Use hyp timer IRQs when test runs at EL2
  KVM: arm64: selftests: Select SMCCC conduit based on current EL
  KVM: arm64: selftests: Provide helper for getting default vCPU target
  KVM: arm64: selftests: Alias EL1 registers to EL2 counterparts
  KVM: arm64: selftests: Create a VGICv3 for 'default' VMs
  KVM: arm64: selftests: Add unsanitised helpers for VGICv3 creation
  KVM: arm64: selftests: Add helper to check for VGICv3 support
  KVM: arm64: selftests: Initialize VGICv3 only once
  KVM: arm64: selftests: Provide kvm_arch_vm_post_create() in library code
  KVM: selftests: Add ex_str() to print human friendly name of exception vectors
  selftests/kvm: remove stale TODO in xapic_state_test
  KVM: selftests: Handle Intel Atom errata that leads to PMU event overcount
  ...
2025-10-04 08:52:16 -07:00
Linus Torvalds a498d59c46 dma-mapping updates for Linux 6.18:
- refactoring of DMA mapping API to physical addresses as the primary
 interface instead of page+offset parameters; this gets much closer to
 Matthew Wilcox's long term wish for struct-pageless IO to cacheable DRAM and is
 supporting memdesc project which seeks to substantially transform how
 struct page works; an advantage of this approach is the possibility of
 introducing DMA_ATTR_MMIO, which covers existing 'dma_map_resource' flow
 in the common paths, what in turn lets to use recently introduced
 dma_iova_link() API to map PCI P2P MMIO without creating struct page;
 developped by Leon Romanovsky and Jason Gunthorpe
 - minor clean-up by Petr Tesarik and Qianfeng Rong
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaNugqAAKCRCJp1EFxbsS
 RBvDAQCEd4P6pz6ROQHf5BfiF5J1db2H6bWsFLjajx3KfNWf8gD+P0eQ0hTzLrcd
 zuSKZTivviOiyjXlt/9GOaXXPnmTwA0=
 =b0nZ
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-6.18-2025-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux

Pull dma-mapping updates from Marek Szyprowski:

 - Refactoring of DMA mapping API to physical addresses as the primary
   interface instead of page+offset parameters

   This gets much closer to Matthew Wilcox's long term wish for
   struct-pageless IO to cacheable DRAM and is supporting memdesc
   project which seeks to substantially transform how struct page works.

   An advantage of this approach is the possibility of introducing
   DMA_ATTR_MMIO, which covers existing 'dma_map_resource' flow in the
   common paths, what in turn lets to use recently introduced
   dma_iova_link() API to map PCI P2P MMIO without creating struct page

   Developped by Leon Romanovsky and Jason Gunthorpe

 - Minor clean-up by Petr Tesarik and Qianfeng Rong

* tag 'dma-mapping-6.18-2025-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
  kmsan: fix missed kmsan_handle_dma() signature conversion
  mm/hmm: properly take MMIO path
  mm/hmm: migrate to physical address-based DMA mapping API
  dma-mapping: export new dma_*map_phys() interface
  xen: swiotlb: Open code map_resource callback
  dma-mapping: implement DMA_ATTR_MMIO for dma_(un)map_page_attrs()
  kmsan: convert kmsan_handle_dma to use physical addresses
  dma-mapping: convert dma_direct_*map_page to be phys_addr_t based
  iommu/dma: implement DMA_ATTR_MMIO for iommu_dma_(un)map_phys()
  iommu/dma: rename iommu_dma_*map_page to iommu_dma_*map_phys
  dma-mapping: rename trace_dma_*map_page to trace_dma_*map_phys
  dma-debug: refactor to use physical addresses for page mapping
  iommu/dma: implement DMA_ATTR_MMIO for dma_iova_link().
  dma-mapping: introduce new DMA attribute to indicate MMIO memory
  swiotlb: Remove redundant __GFP_NOWARN
  dma-direct: clean up the logic in __dma_direct_alloc_pages()
2025-10-03 17:41:12 -07:00
Linus Torvalds 070a542f08 NFS Client Updates for Linux 6.18
New Features:
  * Add a Kconfig option to redirect dfprintk() to the trace buffer
  * Enable use of the RWF_DONTCACHE flag on the NFS client
  * Add striped layout handling to pNFS flexfiles
  * Add proper localio handling for READ and WRITE O_DIRECT
 
 Bugfixes:
  * Handle NFS4ERR_GRACE errors during delegation recall
  * Fix NFSv4.1 backchannel max_resp_sz verification check
  * Fix mount hang after CREATE_SESSION failure
  * Fix d_parent->d_inode locking in nfs4_setup_readdir()
 
 Other Cleanups and Improvements:
  * Improvements to write handling tracepoints
  * Fix a few trivial spelling mistakes
  * Cleanups to the rpcbind cleanup call sites
  * Convert the SUNRPC xdr_buf to use a scratch folio instead of scratch page
  * Remove unused NFS_WBACK_BUSY() macro
  * Remove __GFP_NOWARN flags
  * Unexport rpc_malloc() and rpc_free()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmjgOEkACgkQ18tUv7Cl
 QOuX7RAA33AUq4NBxzDOgz4u4eNU/a8z2AazRgAtfmPVLTitrx/kqcfVEtHAdHFi
 cWkN2SO+TVzxIGOrudqNyjV2cjfUJV4ZJBkNY6lJvxPNAH27Dk9P2iMF12QYtHOq
 qyrwqoUQcyBkmtpgFUyHzydA4J17JDl5A7I/tOkro3ZfV4gmYAUwVdS+VtJoosLp
 7FnXv+W5FBWkfKrIT+vPyiBqxl0gZXmzUkJK2lG9m9NvE2Jk2MbPFyhdUEA5JybJ
 akNLdBnFwNWw2rLulSqs68ZbCGz6NY634q1Z+ZsRJ907ZdBqJ7zIBFv4yc/bMpZm
 Q9kh1M0OyvK0MlLRFe3efOLxRoN9nJPd+kuaw9eYw5V57Jrwj6QGV4nud2C8nzs8
 iB+LuJli+FRCeD84SY8NnMFKpXphHCeMXcBMRMsLTOSotJZFithO95+w1pKlK64A
 lxY1JXOQYelwJZxfhGPovwac4t1arpDjsumRlTmq12KaQnM3Z1gR2PUgeLxEPHQM
 f6gEiN9KDOhW/gZrFQxNs2hVAH68RDKpWxeR2XeVJlJYf37Hgh8bNGEiURi3G57n
 ED7tFbK9lzHVFR07hiP3Cvzop4z2mzadgHo+1vzdXmZZQA4gc4MSFfszFLCnQopw
 LEb7RqpVVXtQb+f7A+LuD+a2rLEnW+gTf6iLqCR8hAB5k1xmcYQ=
 =8wnU
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.18-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "New Features:
   - Add a Kconfig option to redirect dfprintk() to the trace buffer
   - Enable use of the RWF_DONTCACHE flag on the NFS client
   - Add striped layout handling to pNFS flexfiles
   - Add proper localio handling for READ and WRITE O_DIRECT

  Bugfixes:
   - Handle NFS4ERR_GRACE errors during delegation recall
   - Fix NFSv4.1 backchannel max_resp_sz verification check
   - Fix mount hang after CREATE_SESSION failure
   - Fix d_parent->d_inode locking in nfs4_setup_readdir()

  Other Cleanups and Improvements:
   - Improvements to write handling tracepoints
   - Fix a few trivial spelling mistakes
   - Cleanups to the rpcbind cleanup call sites
   - Convert the SUNRPC xdr_buf to use a scratch folio instead of
     scratch page
   - Remove unused NFS_WBACK_BUSY() macro
   - Remove __GFP_NOWARN flags
   - Unexport rpc_malloc() and rpc_free()"

* tag 'nfs-for-6.18-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (46 commits)
  NFS: add basic STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
  nfs/localio: add tracepoints for misaligned DIO READ and WRITE support
  nfs/localio: add proper O_DIRECT support for READ and WRITE
  nfs/localio: refactor iocb initialization
  nfs/localio: refactor iocb and iov_iter_bvec initialization
  nfs/localio: avoid issuing misaligned IO using O_DIRECT
  nfs/localio: make trace_nfs_local_open_fh more useful
  NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
  sunrpc: unexport rpc_malloc() and rpc_free()
  NFSv4/flexfiles: Add support for striped layouts
  NFSv4/flexfiles: Update layout stats & error paths for striped layouts
  NFSv4/flexfiles: Write path updates for striped layouts
  NFSv4/flexfiles: Commit path updates for striped layouts
  NFSv4/flexfiles: Read path updates for striped layouts
  NFSv4/flexfiles: Update low level helper functions to be DS stripe aware.
  NFSv4/flexfiles: Add data structure support for striped layouts
  NFSv4/flexfiles: Use ds_commit_idx when marking a write commit
  NFSv4/flexfiles: Remove cred local variable dependency
  nfs4_setup_readdir(): insufficient locking for ->d_parent->d_inode dereferencing
  NFS: Enable use of the RWF_DONTCACHE flag on the NFS client
  ...
2025-10-03 14:20:40 -07:00