- Implement masked user access
- Add support for internal only per-CPU instructions and inline the bpf_get_smp_processor_id() and bpf_get_current_task()
- Fix pSeries MSI-X allocation failure when quota is exceeded
- Fix recursive pci_lock_rescan_remove locking in EEH event handling
- Support tailcalls with subprogs & BPF exceptions on 64bit
- Extend "trusted" keys to support the PowerVM Key Wrapping Module (PKWM)
Thanks to: Abhishek Dubey, Christophe Leroy, Gaurav Batra, Guangshuo Li, Jarkko
Sakkinen, Mahesh Salgaonkar, Mimi Zohar, Miquel Sabaté Solà, Nam Cao, Narayana
Murty N, Nayna Jain, Nilay Shroff, Puranjay Mohan, Saket Kumar Bhaskar, Sourabh
Jain, Srish Srinivasan, Venkat Rao Bagalkote,
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEqX2DNAOgU8sBX3pRpnEsdPSHZJQFAmmL7S0ACgkQpnEsdPSH
ZJR2xA/9G+tZp9+2TidbLSyPT5o063uC5H5+j5QcvvHWi/ImGUNtixlDm5HcZLCR
yKywE1piKYBU8HoISjzAt0+JCVd3ZjJ8chTpKgCHLXPRSBTgdR1MG+SXQysDJSWb
yA4pwDikmwoLlfi+pf500F0nX2DRCRdU2Yi28ZFeaF/khJ7ebwj41QJ7LjN22+Q1
G8Kq543obZluzSoVvfG4xUK4ByWER+Zdd2F6699iMP68yw5PJ8PPc0SUGt8nuD4i
FUs0Lw7XV7i/K3+zm/ZgH5+Cvn7wOIcMNkXgFlxJXkFit97KXUDijifYPoXQyKLL
ksD7SPFdV0++Sc+3mWcgW4j+hQZC0Pn864unmh8C6ug3SagQ+3pE1JYWKwCmoyXd
49ROH0y+npArJ4NAc79eweunhafGcRYTSG+zV7swQvpRocMujEqa4CDz4uk1ll5W
1yAac08AN6PnfcXl2VMrcDboziTlQVFcnNQbK/ieYMC7KpgA+udw1hd2rOWNZCPd
u0byXxR1ak5YaAEuyMztd/39hrExx8306Jtkh5FIRZKWGAO+3np5bi3vxk11rDni
c9BGh2JIMtuPUGys3wcFPGMRTKwF2bDFW/pB+5hMHeLUdlkni9WGCX8eLe2klYsy
T7fBVb4d99IVrJGYv3J1lwELgjrgXvv35XOaUiyJyZhcbng15cc=
=zJoL
-----END PGP SIGNATURE-----
Merge tag 'powerpc-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates for 7.0
- Implement masked user access
- Add bpf support for internal only per-CPU instructions and inline the
bpf_get_smp_processor_id() and bpf_get_current_task() functions
- Fix pSeries MSI-X allocation failure when quota is exceeded
- Fix recursive pci_lock_rescan_remove locking in EEH event handling
- Support tailcalls with subprogs & BPF exceptions on 64bit
- Extend "trusted" keys to support the PowerVM Key Wrapping Module
(PKWM)
Thanks to Abhishek Dubey, Christophe Leroy, Gaurav Batra, Guangshuo Li,
Jarkko Sakkinen, Mahesh Salgaonkar, Mimi Zohar, Miquel Sabaté Solà, Nam
Cao, Narayana Murty N, Nayna Jain, Nilay Shroff, Puranjay Mohan, Saket
Kumar Bhaskar, Sourabh Jain, Srish Srinivasan, and Venkat Rao Bagalkote.
* tag 'powerpc-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (27 commits)
powerpc/pseries: plpks: export plpks_wrapping_is_supported
docs: trusted-encryped: add PKWM as a new trust source
keys/trusted_keys: establish PKWM as a trusted source
pseries/plpks: add HCALLs for PowerVM Key Wrapping Module
pseries/plpks: expose PowerVM wrapping features via the sysfs
powerpc/pseries: move the PLPKS config inside its own sysfs directory
pseries/plpks: fix kernel-doc comment inconsistencies
powerpc/smp: Add check for kcalloc() failure in parse_thread_groups()
powerpc: kgdb: Remove OUTBUFMAX constant
powerpc64/bpf: Additional NVR handling for bpf_throw
powerpc64/bpf: Support exceptions
powerpc64/bpf: Add arch_bpf_stack_walk() for BPF JIT
powerpc64/bpf: Avoid tailcall restore from trampoline
powerpc64/bpf: Support tailcalls with subprogs
powerpc64/bpf: Moving tail_call_cnt to bottom of frame
powerpc/eeh: fix recursive pci_lock_rescan_remove locking in EEH event handling
powerpc/pseries: Fix MSI-X allocation failure when quota is exceeded
powerpc/iommu: bypass DMA APIs for coherent allocations for pre-mapped memory
powerpc64/bpf: Inline bpf_get_smp_processor_id() and bpf_get_current_task/_btf()
powerpc64/bpf: Support internal-only MOV instruction to resolve per-CPU addrs
...
clock interface, taming the include hell by splitting the pv_ops structure
and removing of a bunch of obsolete code. Work by Juergen Gross.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmmLKHAACgkQEsHwGGHe
VUrURg//Ucf+3EAIkLCmFkH0WwYmQl2JjRYww8bPAw3iJMIVxy4dMnaBbsUiAtUp
kYza+pgEtvyAwwd8RIEs85c9VhZn0DKoaWV8goBH3zFH6YvIRiLwb0w2QvjkF+70
FNU+4zlvt/I3FD+tWNElAgVtkFL3Gmzm44qyLLsPtlYaJ71xFl2XB7V+TlqXMHzE
m8BMenP9/CrbTlBBdNJGzAkAbWi1uAP+IydvuFNolH/F2lqVM2z5Ta3gUWWCIk/q
jWrPLDZCHr2WlBZNUGamKVVH9NEh+7YNwBAGUrSNYGZFoaFjqeX6lN3djzS+wXIj
0nDoW35jN0QNKz239MdXZDf1mfpb6ZQd/iOhFjo4dAvbm+J8WPAMr98ac8wR3Dyb
2LF/BxkoKWRabxQApXSCrLPXEuqT6Qc1+lDA0bNHg51zBoqP5vRNVZRwArnzGB+O
LxDKx+o4VYOf+UCaB6oQHjylbSgFvIedZ9p822hBe3QG9act8indRE8LWip7Utld
peoJGgvlQ0xtClh6FjVHpvmVfAvk7Zki5ywj2GwmB/TZ0yywuGStAjE3UqY168/M
gb7MSajh+HHZNj1/2+b/se4CUYlAgIPDQ+SwHJPm5TqyopvnOVi/2XWmjbx8I5jT
jS0nxaxD+SbESSZ6IMAsppnAAxAYbvRHGIS+6mtNCXVkaV1pMbA=
=AeFt
-----END PGP SIGNATURE-----
Merge tag 'x86_paravirt_for_v7.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 paravirt updates from Borislav Petkov:
- A nice cleanup to the paravirt code containing a unification of the
paravirt clock interface, taming the include hell by splitting the
pv_ops structure and removing of a bunch of obsolete code (Juergen
Gross)
* tag 'x86_paravirt_for_v7.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/paravirt: Use XOR r32,r32 to clear register in pv_vcpu_is_preempted()
x86/paravirt: Remove trailing semicolons from alternative asm templates
x86/pvlocks: Move paravirt spinlock functions into own header
x86/paravirt: Specify pv_ops array in paravirt macros
x86/paravirt: Allow pv-calls outside paravirt.h
objtool: Allow multiple pv_ops arrays
x86/xen: Drop xen_mmu_ops
x86/xen: Drop xen_cpu_ops
x86/xen: Drop xen_irq_ops
x86/paravirt: Move pv_native_*() prototypes to paravirt.c
x86/paravirt: Introduce new paravirt-base.h header
x86/paravirt: Move paravirt_sched_clock() related code into tsc.c
x86/paravirt: Use common code for paravirt_steal_clock()
riscv/paravirt: Use common code for paravirt_steal_clock()
loongarch/paravirt: Use common code for paravirt_steal_clock()
arm64/paravirt: Use common code for paravirt_steal_clock()
arm/paravirt: Use common code for paravirt_steal_clock()
sched: Move clock related paravirt code to kernel/sched
paravirt: Remove asm/paravirt_api_clock.h
x86/paravirt: Move thunk macros to paravirt_types.h
...
- Provide the missing 64-bit variant of clock_getres()
This allows the extension of CONFIG_COMPAT_32BIT_TIME to the vDSO and
finally the removal of 32-bit time types from the kernel and UAPI.
- Remove the useless and broken getcpu_cache from the VDSO
The intention was to provide a trivial way to retrieve the CPU number from
the VDSO, but as the VDSO data is per process there is no way to make it
work.
- Switch get/put_unaligned() from packed struct to memcpy()
The packed struct violates strict aliasing rules which requires to pass
-fno-strict-aliasing to the compiler. As this are scalar values
__builtin_memcpy() turns them into simple loads and stores
- Use __typeof_unqual__() for __unqual_scalar_typeof()
The get/put_unaligned() changes triggered a new sparse warning when __beNN
types are used with get/put_unaligned() as sparse builds add a special
'bitwise' attribute to them which prevents sparse to evaluate the Generic
in __unqual_scalar_typeof().
Newer sparse versions support __typeof_unqual__() which avoids the problem,
but requires a recent sparse install. So this adds a sanity check to sparse
builds, which validates that sparse is available and capable of handling it.
- Force inline __cvdso_clock_getres_common()
Compilers sometimes un-inline agressively, which results in function call
overhead and problems with automatic stack variable initialization.
Interestingly enough the force inlining results in smaller code than the
un-inlined variant produced by GCC when optimizing for size.
-----BEGIN PGP SIGNATURE-----
iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmmJ2eAQHHRnbHhAa2Vy
bmVsLm9yZwAKCRCmGPVMDXSYoXhQD/4mjneVlRBbQ6Nt9LxxhzPlYXvED6u8b4SX
R//DQ4qqqagh2fSpE57tr3f56HqhUmXptGJSgEvr6tVKoyugsLqP6sN9J9J3o181
jnPQR0VBP9dihQZ5X93pKo9NZWxLIn0uLD0RsdCbE9NTx4ciw4VIPFYQqkC4Rw6b
jiRrDR2l8EhV8cmxB6puW5WaQ932M6Awabw9RumzwH3MzIIlbc5Ero51S9eS64LL
byU5XeWUe295W1Gxze5RHHJWyNQEyx1eUCFfe3LWvfpz7FMzc2AQsKnIJDzW3GiO
UGu5MGptbLpG+ccvhVEs6/Ls5pWXcoCw4WuDNAunCCOmqda98oDniKf2LwRRbLT0
nAfLNatMnhXdTPk2zbS45z9uipUQAGKmVAE3/LVqB+ekcutmIGMyqHgR75QX0b4l
CQPkC9rBsV6gGsScWTnhRydhqioNO/uhhrQv0vEXnKZa0ysTbgZKt3JDZbgUEL2B
uDxXKyrqjpnqDZKlMMaoLtwd+l+T80ya4/NhHd4ZNGUpTUrHVw2H47lgE7ahCxEk
/SvXTZSU4Jp8sVQIQ+J6y5z2AQ/xGy++zvNKiZMyP9fQuPqhiDqYkLpmOp61bARx
wqyVsfGXZYSB1l16AiSC/CyUDcMqqsFohGXQ/Yf0SOiVtXu2WUyfofY1N0IXIGu4
WOVV9mH1yg==
=0ogX
-----END PGP SIGNATURE-----
Merge tag 'timers-vdso-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull VDSO updates from Thomas Gleixner:
- Provide the missing 64-bit variant of clock_getres()
This allows the extension of CONFIG_COMPAT_32BIT_TIME to the vDSO and
finally the removal of 32-bit time types from the kernel and UAPI.
- Remove the useless and broken getcpu_cache from the VDSO
The intention was to provide a trivial way to retrieve the CPU number
from the VDSO, but as the VDSO data is per process there is no way to
make it work.
- Switch get/put_unaligned() from packed struct to memcpy()
The packed struct violates strict aliasing rules which requires to
pass -fno-strict-aliasing to the compiler. As this are scalar values
__builtin_memcpy() turns them into simple loads and stores
- Use __typeof_unqual__() for __unqual_scalar_typeof()
The get/put_unaligned() changes triggered a new sparse warning when
__beNN types are used with get/put_unaligned() as sparse builds add a
special 'bitwise' attribute to them which prevents sparse to evaluate
the Generic in __unqual_scalar_typeof().
Newer sparse versions support __typeof_unqual__() which avoids the
problem, but requires a recent sparse install. So this adds a sanity
check to sparse builds, which validates that sparse is available and
capable of handling it.
- Force inline __cvdso_clock_getres_common()
Compilers sometimes un-inline agressively, which results in function
call overhead and problems with automatic stack variable
initialization.
Interestingly enough the force inlining results in smaller code than
the un-inlined variant produced by GCC when optimizing for size.
* tag 'timers-vdso-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
vdso/gettimeofday: Force inlining of __cvdso_clock_getres_common()
x86/percpu: Make CONFIG_USE_X86_SEG_SUPPORT work with sparse
compiler: Use __typeof_unqual__() for __unqual_scalar_typeof()
powerpc/vdso: Provide clock_getres_time64()
tools headers: Remove unneeded ignoring of warnings in unaligned.h
tools headers: Update the linux/unaligned.h copy with the kernel sources
vdso: Switch get/put_unaligned() from packed struct to memcpy()
parisc: Inline a type punning version of get_unaligned_le32()
vdso: Remove struct getcpu_cache
MIPS: vdso: Provide getres_time64() for 32-bit ABIs
arm64: vdso32: Provide clock_getres_time64()
ARM: VDSO: Provide clock_getres_time64()
ARM: VDSO: Patch out __vdso_clock_getres() if unavailable
x86/vdso: Provide clock_getres_time64() for x86-32
selftests: vDSO: vdso_test_abi: Add test for clock_getres_time64()
selftests: vDSO: vdso_test_abi: Use UAPI system call numbers
selftests: vDSO: vdso_config: Add configurations for clock_getres_time64()
vdso: Add prototype for __vdso_clock_getres_time64()
- Inline timecounter_cyc2time() as that is now used in the networking
hotpath. Inlining it significantly improves performance.
- Optimize the tick dependency check in case that the tracepoint is disabled,
which improves the hotpath performance in the tick management code, which
is a hotpath on transitions in and out of idle.
- The usual cleanups and improvements
-----BEGIN PGP SIGNATURE-----
iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmmJ1WUQHHRnbHhAa2Vy
bmVsLm9yZwAKCRCmGPVMDXSYoabjD/9cc06W408fLGfqEGlriV8BnMJSGiWO/efe
SJPA4IziIu6DS7aQugdmk1vTqV9PxQWZ+L0tgSe1BNzUY+9x+z+I7X+asY6pjPmp
4zQSGSuotvFpRsWa4MsEgOjXk0QV/1Lm1Ba0eaBBqfY1wacoxJgjxHd5qjZBycUp
H1nLAshXbDBsKYdTLqI4KQq9ZbHKTPz47UhJiKwPHEIRSGNzuAdNkD7KOv34GEDU
ZHlTfWv8dMoH0lp/N/2+NmojYOzVXjH9UeeG3MYyfrcrzxU8Ifg/21zz9KgcyrF/
rrDnUCdnJUk2z8AltgSyQfqw3awUrTDTnHQDFe9oS4t5lexXd7fkmbjqsZBpP8OD
QQJd6yyI2j5GsKKI/fcYz/3NEHZR5HxjgT/Dv3qdHh/7L49FIXFI68kcOqdHFGSf
21CY25cBRr7uPnyAs/G3LtSb/WyBVgTzMaPo798gIqpoDkSIuSlR0x252yaI6kW9
fMXul0ZYwU5j8xGPt39IW9/Xm0HAMBUdvMusKn8CKBPfYCAx2EtH75NXilCuRgjG
UJvzEv20W3f3IXgvrVqTznKUh0Airsfo8blNtX/LqhCZraeJIRV/UGUrhL8QpOp+
Ri8FJ1Dr57oRJDDqM7XVckJuyOGkRvYtvJyPhbyEXiWZpDUqQnrivm+3j7DecHN2
6+ThtJ6P4A==
=Zh3T
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer core updates from Thomas Gleixner:
- Inline timecounter_cyc2time() as that is now used in the networking
hotpath. Inlining it significantly improves performance.
- Optimize the tick dependency check in case that the tracepoint is
disabled, which improves the hotpath performance in the tick
management code, which is a hotpath on transitions in and out of
idle.
- The usual cleanups and improvements
* tag 'timers-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time/kunit: Document handling of negative years of is_leap()
tick/nohz: Optimize check_tick_dependency() with early return
time/sched_clock: Use ACCESS_PRIVATE() to evaluate hrtimer::function
hrtimer: Drop _tv64() helpers
hrtimer: Remove public definition of HIGH_RES_NSEC
hrtimer: Remove unused resolution constants
time/timecounter: Inline timecounter_cyc2time()
- Add interrupt redirection infrastructure
Some PCI controllers use a single demultiplexing interrupt for the MSI
interrupts of subordinate devices.
This prevents setting the interrupt affinity of device interrupts, which
causes device interrupts to be delivered to a single CPU. That obviously is
counterproductive for multi-queue devices and interrupt balancing.
To work around this limitation the new infrastructure installs a dummy
irq_set_affinity() callback which captures the affinity mask and picks a
redirection target CPU out of the mask.
When the PCI controller demultiplexes the interrupts it invokes a new
handling function in the core, which either runs the interrupt handler in
the context of the target CPU or delegates it to irq_work on the target CPU.
- Utilize the interrupt redirection mechanism in the PCI DWC host controller
driver.
This allows affinity control for the subordinate device MSI interrupts
instead of being randomly executed on the CPU which runs the demultiplex
handler.
- Replace the binary 64-bit MSI flag with a DMA mask
Some PCI devices have PCI_MSI_FLAGS_64BIT in the MSI capability, but
implement less than 64 address bits. This breaks on platforms where such a
device is assigned an MSI address higher than what's supported.
With the binary 64-bit flag there is no other choice than disabling 64-bit
MSI support which leaves the device disfunctional.
By using a DMA mask the address limit of a device can be described
correctly which provides support for the above scenario.
- Make use of the DMA mask based address limit in the hda/intel and radeon
drivers to enable them on affected platforms.
- The usual small cleanups and improvements
-----BEGIN PGP SIGNATURE-----
iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmmJyPsQHHRnbHhAa2Vy
bmVsLm9yZwAKCRCmGPVMDXSYocekEADAsS5FlUkFuBy6kODhl5J7b9/oqlL3IEnR
3CdOrFO716dce+Gej+Wp3T93dJ3XsfD7nCZuy99+LwUkTubmaBJXfjY9S+Ket0ID
Wc3ltiD6f3GEFB14rXN+fFG/u+OOLkaXdpbQpiTnqL4JAti9qF80D4uon28+FC/o
wc1MhqVBPbOHU9iM196ngkZuXCNVPLcnZN6PNBgIn0sxx06LcK+daY0bNGxfn5Ua
LY9SD8hN7tYlkDi42nB/ZXMrexqT9cxSqHObmPX+G/QLfXCRBtD+gyVbs+KVzpRL
hmFERTlUh9tUdcQFrjgiZP/r4N5ilzsu6w5ZpSOEsGuahFUPZWJWFFC1D8rmq/Ay
X9HKge1jqXJtbCf0pJM/kdbJKSH5S6aLP3iF37y+PqITIEIX8jIT3oVcvL9hI0BW
HFxpuJfhAVg63kMegZCO/iROTusLHUZr8iwYOM7pEiCE6fP46jPijsPffVIWvrlJ
2LVOv/A5wy9q8FW8sF9/M6CW7cdeYQF06Ce3qAyMxjZjEyR3KFBJCVWjhqyMxZJP
3zFl1XXKXgRO+CDrYKVTPIaXR5D76k/l6MnECQpq81CQyQKm2h6A9PyY+n70FfbZ
BimakUlBGCd92ZbSxzC9pAOiHo0ZoKtc5BhnsRhKVyBCmEKDazEplDuf49/OSZUE
p2kaf/PuOw==
=SCSQ
-----END PGP SIGNATURE-----
Merge tag 'irq-msi-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull MSI updates from Thomas Gleixner:
"Updates for the [PCI] MSI subsystem:
- Add interrupt redirection infrastructure
Some PCI controllers use a single demultiplexing interrupt for the
MSI interrupts of subordinate devices.
This prevents setting the interrupt affinity of device interrupts,
which causes device interrupts to be delivered to a single CPU.
That obviously is counterproductive for multi-queue devices and
interrupt balancing.
To work around this limitation the new infrastructure installs a
dummy irq_set_affinity() callback which captures the affinity mask
and picks a redirection target CPU out of the mask.
When the PCI controller demultiplexes the interrupts it invokes a
new handling function in the core, which either runs the interrupt
handler in the context of the target CPU or delegates it to
irq_work on the target CPU.
- Utilize the interrupt redirection mechanism in the PCI DWC host
controller driver.
This allows affinity control for the subordinate device MSI
interrupts instead of being randomly executed on the CPU which runs
the demultiplex handler.
- Replace the binary 64-bit MSI flag with a DMA mask
Some PCI devices have PCI_MSI_FLAGS_64BIT in the MSI capability,
but implement less than 64 address bits. This breaks on platforms
where such a device is assigned an MSI address higher than what's
supported.
With the binary 64-bit flag there is no other choice than disabling
64-bit MSI support which leaves the device disfunctional.
By using a DMA mask the address limit of a device can be described
correctly which provides support for the above scenario.
- Make use of the DMA mask based address limit in the hda/intel and
radeon drivers to enable them on affected platforms
- The usual small cleanups and improvements"
* tag 'irq-msi-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ALSA: hda/intel: Make MSI address limit based on the device DMA limit
drm/radeon: Make MSI address limit based on the device DMA limit
PCI/MSI: Check the device specific address mask in msi_verify_entries()
PCI/MSI: Convert the boolean no_64bit_msi flag to a DMA address mask
genirq/redirect: Prevent writing MSI message on affinity change
PCI/MSI: Unmap MSI-X region on error
genirq: Update effective affinity for redirected interrupts
PCI: dwc: Enable MSI affinity support
PCI: dwc: Code cleanup
genirq: Add interrupt redirection infrastructure
genirq/msi: Correct kernel-doc in <linux/msi.h>
- Remove the interrupt timing infrastructure
This was added seven years ago to be used for power management purposes, but
that integration never happened.
- Clean up the remaining setup_percpu_irq() users
The memory allocator is available when interrupts can be requested so there
is not need for static irq_action. Move the remaining users to
request_percpu_irq() and delete the historical cruft.
- Warn when interrupt flag inconsistencies are detected in request*_irq().
Inconsistent flags can lead to hard to diagnose malfunction. The fallout of
this new warning has been addressed in next and the fixes are coming in via
the maintainer trees and the tip irq/cleanup pull requests.
- Invoke affinity notifier when CPU hotplug breaks affinity
Otherwise the code using the notifier misses the affinity change and
operates on stale information.
- The usual cleanups and improvements
-----BEGIN PGP SIGNATURE-----
iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmmJwSAQHHRnbHhAa2Vy
bmVsLm9yZwAKCRCmGPVMDXSYoV1rD/9lxepiLVhIvaaauyr69HkHnLDJyyyIuNQk
OZ5Rx8JEnmtJIN49DJ/Ps/VjDZA6GYV2RvtKfqCNtpA3AdpHcLCDw2/mZKetHH3v
0RlTNeu0TKt76Mo+d7vadJZqdcKbBPYEBucaQed4XLGFq9Vb7nuVMFLzkDZe+C5U
zG8zhtIaPQyY+MUxFGpWJdeKcddnIC/lKRFBmhglmuflQ8ZdrydRlJZzFBxfcnPy
dcZVnOVZYD4aaW+uryd9eVWg0de5q97C03OobVdhwt077zcmTZpGanRCBM/+7Kop
UDB/p04Z5FFFKFbd+j0f76gl1l6yQ06K/2Pby8qrLnodsqOEgjEgLs0qbHhkNsVw
g5W6ry9boLWAA1XhGFOnqZ4Dc/Qb1G7DpnRUzpmAlyL04MTManjocIHNPrGhjlNG
+jT3iPuraiRUoNZ9K2PBYX9SikGVnKJXFmOwlffkPpOtaJk/oYVUAdMq2W71C4Du
TZeAbd2RyrSPMYaSh7pnBhM7KBmsYIg+hHU1+RaAi/rFSOT0Se5sUWwmUUUzSfOn
KvCAzMGX6gIg8rkSmTp7OvLQhXEfJ2kEVrVZACS6MBxS7+PNVBkxYuOqxdhdoVj0
sZrfB8Qnm4mAcbq7YJtrD6az3+LH5CkrdNyWIAL62leMmqWDoTOgLRMFh2/yEMTl
BeAOgPO2pw==
=3hJ2
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq core updates from Thomas Gleixner:
"Updates for the interrupt core subsystem:
- Remove the interrupt timing infrastructure
This was added seven years ago to be used for power management
purposes, but that integration never happened.
- Clean up the remaining setup_percpu_irq() users
The memory allocator is available when interrupts can be requested
so there is not need for static irq_action. Move the remaining
users to request_percpu_irq() and delete the historical cruft.
- Warn when interrupt flag inconsistencies are detected in
request*_irq().
Inconsistent flags can lead to hard to diagnose malfunction. The
fallout of this new warning has been addressed in next and the
fixes are coming in via the maintainer trees and the tip
irq/cleanup pull requests.
- Invoke affinity notifier when CPU hotplug breaks affinity
Otherwise the code using the notifier misses the affinity change
and operates on stale information.
- The usual cleanups and improvements"
* tag 'irq-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/proc: Replace snprintf with strscpy in register_handler_proc
genirq/cpuhotplug: Notify about affinity changes breaking the affinity mask
genirq: Move clear of kstat_irqs to free_desc()
genirq: Warn about using IRQF_ONESHOT without a threaded handler
irqdomain: Fix up const problem in irq_domain_set_name()
genirq: Remove setup_percpu_irq()
clocksource/drivers/mips-gic-timer: Move GIC timer to request_percpu_irq()
MIPS: Move IP27 timer to request_percpu_irq()
MIPS: Move IP30 timer to request_percpu_irq()
genirq: Remove __request_percpu_irq() helper
genirq: Remove IRQ timing tracking infrastructure
Scheduler Kconfig space updates:
- Further consolidate configurable preemption modes: reduce
the number of architectures that are allowed to offer
PREEMPT_NONE and PREEMPT_VOLUNTARY, reducing the number
of preemption models from four to just two: 'full' and 'lazy'
on up-to-date architectures (arm64, loongarch, powerpc,
riscv, s390, x86).
None and voluntary are only available as legacy features
on platforms that don't implement lazy preemption yet,
or which don't even support preemption.
The goal is to eventually remove cond_resched() and
voluntary preemption altogether.
(Peter Zijlstra)
RSEQ based 'scheduler time slice extension' support:
This allows a thread to request a time slice extension when it
enters a critical section to avoid contention on a resource when
the thread is scheduled out inside of the critical section.
- Add fields and constants for time slice extension
- Provide static branch for time slice extensions
- Add statistics for time slice extensions
- Add prctl() to enable time slice extensions
- Implement sys_rseq_slice_yield()
- Implement syscall entry work for time slice extensions
- Implement time slice extension enforcement timer
- Reset slice extension when scheduled
- Implement rseq_grant_slice_extension()
- entry: Hook up rseq time slice extension
- selftests: Implement time slice extension test
(Thomas Gleixner)
- Allow registering RSEQ with slice extension
- Move slice_ext_nsec to debugfs
- Lower default slice extension
- selftests/rseq: Add rseq slice histogram script
(Peter Zijlstra)
Scheduler performance/scalability improvements:
- Update rq->avg_idle when a task is moved to an idle CPU,
which improves the scalability of various workloads.
(Shubhang Kaushik)
- Reorder fields in 'struct rq' for better caching
(Blake Jones)
- Fair scheduler SMP NOHZ balancing code speedups:
- Move checking for nohz cpus after time check
- Change likelyhood of nohz.nr_cpus
- Remove nohz.nr_cpus and use weight of cpumask instead
(Shrikanth Hegde)
- Avoid false sharing for sched_clock_irqtime (Wangyang Guo)
- Drop useless cpumask_empty() in find_energy_efficient_cpu()
- Simplify task_numa_find_cpu()
- Use cpumask_weight_and() in sched_balance_find_dst_group()
(Yury Norov)
DL scheduler updates:
- Add a deadline server for sched_ext tasks (by Andrea Righi and
Joel Fernandes, with fixes by Peter Zijlstra)
RT scheduler updates:
- Skip currently executing CPU in rto_next_cpu() (Chen Jinghuang)
Entry code updates and performance improvements, which is part of the
scheduler tree in this cycle due to interdependencies with the RSEQ
based time slice extension work:
- Remove unused syscall argument from syscall_trace_enter()
- Rework syscall_exit_to_user_mode_work() for architecture reuse
- Add arch_ptrace_report_syscall_entry/exit()
- Inline syscall_exit_work() and syscall_trace_enter()
(Jinjie Ruan)
Scheduler core updates:
- Rework sched_class::wakeup_preempt() and rq_modified_*()
- Avoid rq->lock bouncing in sched_balance_newidle()
- Rename rcu_dereference_check_sched_domain() =>
rcu_dereference_sched_domain()
- <linux/compiler_types.h>: Add the __signed_scalar_typeof() helper
(Peter Zijlstra)
Fair scheduler updates/refactoring:
- Fold the sched_avg update
- Change rcu_dereference_check_sched_domain() to rcu-sched
- Switch to rcu_dereference_all()
- Remove superfluous rcu_read_lock()
- Limit hrtick work
(Peter Zijlstra)
- Join two #ifdef CONFIG_FAIR_GROUP_SCHED blocks
- Clean up comments in 'struct cfs_rq'
- Separate se->vlag from se->vprot
- Rename cfs_rq::avg_load to cfs_rq::sum_weight
- Rename cfs_rq::avg_vruntime to ::sum_w_vruntime & helper functions
- Introduce and use the vruntime_cmp() and vruntime_op() wrappers
for wrapped-signed aritmetics
- Sort out 'blocked_load*' namespace noise
(Ingo Molnar)
Scheduler debugging code updates:
- Export hidden tracepoints to modules (Gabriele Monaco)
- Convert copy_from_user() + kstrtouint() to kstrtouint_from_user()
(Fushuai Wang)
- Add assertions to QUEUE_CLASS (Peter Zijlstra)
- hrtimer: Fix tracing oddity (Thomas Gleixner)
Misc fixes and cleanups:
- Re-evaluate scheduling when migrating queued tasks out of
throttled cgroups (Zicheng Qu)
- Remove task_struct->faults_disabled_mapping (Christoph Hellwig)
- Fix math notation errors in avg_vruntime comment (Zhan Xusheng)
- sched/cpufreq: Use %pe format for PTR_ERR() printing (zenghongling)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmJj+IRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1grtQ//WyXYGVE/WicdqslfaCY2Mr0uJnL0tLSM
CJp+0LROdkmy+ChJmftO8RgjCUSsjhC4/xcBhUQXApf/ffQi3b2jH6nkTp/Z64Ms
p2IXLkBiZjwdcO6fGbB0JE2G1J4hGRC5BlqfgkZzWidMf3kIbmrHg99mVWGzODLY
N/cPW4d0WGf9TScl1FgEiOqgF3czMLlqvTDJqaFMpsTzSUcRBnrG4xushb4W/bBx
573eqxgZJ6urNSGu8niY9PAl9F7gskXW3YxI3k8SH7VmJKSevWlwI9vMEhcRDzud
E0XxD7J8iPOKtr7ypXm7anMBv4jWVUdAnPbYi4TDsyDDU/HguqMqT1McTGn8wQ+F
jmdhmMC9/TEIzq93SNLbCYieibqDsmJoNVFFi0FWfPLMtYbcZd5a884SIz532vx4
DegdlDXdazUwhxzDiQR3sq1CsHXpxNS2YdrpadAtF/r2gU86DQjsEew8yBvXi7bb
Wrkzpax70sU1AFI23wJQkEb/OnnXyehAHAhhQN6GVvuiGr9P7C02WLEGLlmSmJrx
zl2F750P76yhTfGcvTfJ/5LTfSB+yRozGvcdXnIkyzWotY6a2D1MKNusAfVax+IR
kyfAWqVdxBhlKnqYbu92lTogvnPh3Lymd6G4TZZRkSH2jixyGd2oS7nZaDBAeBEM
NHQtr9R+KyU=
=Xj2f
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Scheduler Kconfig space updates:
- Further consolidate configurable preemption modes (Peter Zijlstra)
Reduce the number of architectures that are allowed to offer
PREEMPT_NONE and PREEMPT_VOLUNTARY, reducing the number of
preemption models from four to just two: 'full' and 'lazy' on
up-to-date architectures (arm64, loongarch, powerpc, riscv, s390,
x86).
None and voluntary are only available as legacy features on
platforms that don't implement lazy preemption yet, or which don't
even support preemption.
The goal is to eventually remove cond_resched() and voluntary
preemption altogether.
RSEQ based 'scheduler time slice extension' support (Thomas Gleixner
and Peter Zijlstra):
This allows a thread to request a time slice extension when it enters
a critical section to avoid contention on a resource when the thread
is scheduled out inside of the critical section.
- Add fields and constants for time slice extension
- Provide static branch for time slice extensions
- Add statistics for time slice extensions
- Add prctl() to enable time slice extensions
- Implement sys_rseq_slice_yield()
- Implement syscall entry work for time slice extensions
- Implement time slice extension enforcement timer
- Reset slice extension when scheduled
- Implement rseq_grant_slice_extension()
- entry: Hook up rseq time slice extension
- selftests: Implement time slice extension test
- Allow registering RSEQ with slice extension
- Move slice_ext_nsec to debugfs
- Lower default slice extension
- selftests/rseq: Add rseq slice histogram script
Scheduler performance/scalability improvements:
- Update rq->avg_idle when a task is moved to an idle CPU, which
improves the scalability of various workloads (Shubhang Kaushik)
- Reorder fields in 'struct rq' for better caching (Blake Jones)
- Fair scheduler SMP NOHZ balancing code speedups (Shrikanth Hegde):
- Move checking for nohz cpus after time check
- Change likelyhood of nohz.nr_cpus
- Remove nohz.nr_cpus and use weight of cpumask instead
- Avoid false sharing for sched_clock_irqtime (Wangyang Guo)
- Cleanups (Yury Norov):
- Drop useless cpumask_empty() in find_energy_efficient_cpu()
- Simplify task_numa_find_cpu()
- Use cpumask_weight_and() in sched_balance_find_dst_group()
DL scheduler updates:
- Add a deadline server for sched_ext tasks (by Andrea Righi and Joel
Fernandes, with fixes by Peter Zijlstra)
RT scheduler updates:
- Skip currently executing CPU in rto_next_cpu() (Chen Jinghuang)
Entry code updates and performance improvements (Jinjie Ruan)
This is part of the scheduler tree in this cycle due to inter-
dependencies with the RSEQ based time slice extension work:
- Remove unused syscall argument from syscall_trace_enter()
- Rework syscall_exit_to_user_mode_work() for architecture reuse
- Add arch_ptrace_report_syscall_entry/exit()
- Inline syscall_exit_work() and syscall_trace_enter()
Scheduler core updates (Peter Zijlstra):
- Rework sched_class::wakeup_preempt() and rq_modified_*()
- Avoid rq->lock bouncing in sched_balance_newidle()
- Rename rcu_dereference_check_sched_domain() =>
rcu_dereference_sched_domain()
- <linux/compiler_types.h>: Add the __signed_scalar_typeof() helper
Fair scheduler updates/refactoring (Peter Zijlstra and Ingo Molnar):
- Fold the sched_avg update
- Change rcu_dereference_check_sched_domain() to rcu-sched
- Switch to rcu_dereference_all()
- Remove superfluous rcu_read_lock()
- Limit hrtick work
- Join two #ifdef CONFIG_FAIR_GROUP_SCHED blocks
- Clean up comments in 'struct cfs_rq'
- Separate se->vlag from se->vprot
- Rename cfs_rq::avg_load to cfs_rq::sum_weight
- Rename cfs_rq::avg_vruntime to ::sum_w_vruntime & helper functions
- Introduce and use the vruntime_cmp() and vruntime_op() wrappers for
wrapped-signed aritmetics
- Sort out 'blocked_load*' namespace noise
Scheduler debugging code updates:
- Export hidden tracepoints to modules (Gabriele Monaco)
- Convert copy_from_user() + kstrtouint() to kstrtouint_from_user()
(Fushuai Wang)
- Add assertions to QUEUE_CLASS (Peter Zijlstra)
- hrtimer: Fix tracing oddity (Thomas Gleixner)
Misc fixes and cleanups:
- Re-evaluate scheduling when migrating queued tasks out of throttled
cgroups (Zicheng Qu)
- Remove task_struct->faults_disabled_mapping (Christoph Hellwig)
- Fix math notation errors in avg_vruntime comment (Zhan Xusheng)
- sched/cpufreq: Use %pe format for PTR_ERR() printing
(zenghongling)"
* tag 'sched-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups
sched/cpufreq: Use %pe format for PTR_ERR() printing
sched/rt: Skip currently executing CPU in rto_next_cpu()
sched/clock: Avoid false sharing for sched_clock_irqtime
selftests/sched_ext: Add test for DL server total_bw consistency
selftests/sched_ext: Add test for sched_ext dl_server
sched/debug: Fix dl_server (re)start conditions
sched/debug: Add support to change sched_ext server params
sched_ext: Add a DL server for sched_ext tasks
sched/debug: Stop and start server based on if it was active
sched/debug: Fix updating of ppos on server write ops
sched/deadline: Clear the defer params
entry: Inline syscall_exit_work() and syscall_trace_enter()
entry: Add arch_ptrace_report_syscall_entry/exit()
entry: Rework syscall_exit_to_user_mode_work() for architecture reuse
entry: Remove unused syscall argument from syscall_trace_enter()
sched: remove task_struct->faults_disabled_mapping
sched: Update rq->avg_idle when a task is moved to an idle CPU
selftests/rseq: Add rseq slice histogram script
hrtimer: Fix trace oddity
...
Lock debugging:
- Implement compiler-driven static analysis locking context
checking, using the upcoming Clang 22 compiler's context
analysis features. (Marco Elver)
We removed Sparse context analysis support, because prior to
removal even a defconfig kernel produced 1,700+ context
tracking Sparse warnings, the overwhelming majority of which
are false positives. On an allmodconfig kernel the number of
false positive context tracking Sparse warnings grows to
over 5,200... On the plus side of the balance actual locking
bugs found by Sparse context analysis is also rather ... sparse:
I found only 3 such commits in the last 3 years. So the
rate of false positives and the maintenance overhead is
rather high and there appears to be no active policy in
place to achieve a zero-warnings baseline to move the
annotations & fixers to developers who introduce new code.
Clang context analysis is more complete and more aggressive
in trying to find bugs, at least in principle. Plus it has
a different model to enabling it: it's enabled subsystem by
subsystem, which results in zero warnings on all relevant
kernel builds (as far as our testing managed to cover it).
Which allowed us to enable it by default, similar to other
compiler warnings, with the expectation that there are no
warnings going forward. This enforces a zero-warnings baseline
on clang-22+ builds. (Which are still limited in distribution,
admittedly.)
Hopefully the Clang approach can lead to a more maintainable
zero-warnings status quo and policy, with more and more
subsystems and drivers enabling the feature. Context tracking
can be enabled for all kernel code via WARN_CONTEXT_ANALYSIS_ALL=y
(default disabled), but this will generate a lot of false positives.
( Having said that, Sparse support could still be added back,
if anyone is interested - the removal patch is still
relatively straightforward to revert at this stage. )
Rust integration updates: (Alice Ryhl, Fujita Tomonori, Boqun Feng)
- Add support for Atomic<i8/i16/bool> and replace most Rust native
AtomicBool usages with Atomic<bool>
- Clean up LockClassKey and improve its documentation
- Add missing Send and Sync trait implementation for SetOnce
- Make ARef Unpin as it is supposed to be
- Add __rust_helper to a few Rust helpers as a preparation for
helper LTO
- Inline various lock related functions to avoid additional
function calls.
WW mutexes:
- Extend ww_mutex tests and other test-ww_mutex updates (John Stultz)
Misc fixes and cleanups:
- rcu: Mark lockdep_assert_rcu_helper() __always_inline
(Arnd Bergmann)
- locking/local_lock: Include more missing headers (Peter Zijlstra)
- seqlock: fix scoped_seqlock_read kernel-doc (Randy Dunlap)
- rust: sync: Replace `kernel::c_str!` with C-Strings
(Tamir Duberstein)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmIXiURHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gH+A/9GX5UmU6+HuDfDrCtXm9GDve6wkwahvcW
jLDxOYjs764I2BhyjZnjKjyF5zw60hbykem7Wcf5EV2YH30nM4XRgEWVJfkr1UAI
Pra415X4DdOzZ6qYQIpO8Udt1LtR7BMSaXITVLJaLicxEoOVtq3SKxjqyhCFs7UW
MfJdqleB+RMLqq3LlzgB4l43eKk1xyeHh+oQwI0RSxuIpVZme3p4TObnCKjIWnK7
Ihd+dkgC852WBjANgNL7F/sd5UsF5QX3wjtOrLhMKvkIgTPdXln0g398pivjN/G/
Kpnw18SFeb159JfJu8eMotsYvVnQ0D5aOcTBfL4qvOHCImhpcu2s6ik9BcXqt2yT
8IiuWk9xEM3Ok+I/I4ClT5cf5GYpyigV2QsXxn+IjDX5Na8v4zlHh0r8SElP8fOt
7dpQx7iw8UghAib3AzA3suN78Oh39m8l5BNobj7LAjnqOQcVvoPo4o7/48ntuH7A
38EucFrXfxQBMfGbMwvxEmgYuX7MyVfQLaPE06MHy1BkZkffT8Um38TB0iNtZmtf
WUx01yLKWYspehlwFi319uVI4/Zp7FnTfqa5uKv1oSXVdL9vZojSXUzrgDV7FVqT
Z4xAAw/kwNHpUG7y0zNOqd6PukovG1t+CjbLvK+eHPwc5c0vEGG2oTRAfEvvP1z/
kesYDmCyJnk=
=N1gA
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2026-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"Lock debugging:
- Implement compiler-driven static analysis locking context checking,
using the upcoming Clang 22 compiler's context analysis features
(Marco Elver)
We removed Sparse context analysis support, because prior to
removal even a defconfig kernel produced 1,700+ context tracking
Sparse warnings, the overwhelming majority of which are false
positives. On an allmodconfig kernel the number of false positive
context tracking Sparse warnings grows to over 5,200... On the plus
side of the balance actual locking bugs found by Sparse context
analysis is also rather ... sparse: I found only 3 such commits in
the last 3 years. So the rate of false positives and the
maintenance overhead is rather high and there appears to be no
active policy in place to achieve a zero-warnings baseline to move
the annotations & fixers to developers who introduce new code.
Clang context analysis is more complete and more aggressive in
trying to find bugs, at least in principle. Plus it has a different
model to enabling it: it's enabled subsystem by subsystem, which
results in zero warnings on all relevant kernel builds (as far as
our testing managed to cover it). Which allowed us to enable it by
default, similar to other compiler warnings, with the expectation
that there are no warnings going forward. This enforces a
zero-warnings baseline on clang-22+ builds (Which are still limited
in distribution, admittedly)
Hopefully the Clang approach can lead to a more maintainable
zero-warnings status quo and policy, with more and more subsystems
and drivers enabling the feature. Context tracking can be enabled
for all kernel code via WARN_CONTEXT_ANALYSIS_ALL=y (default
disabled), but this will generate a lot of false positives.
( Having said that, Sparse support could still be added back,
if anyone is interested - the removal patch is still
relatively straightforward to revert at this stage. )
Rust integration updates: (Alice Ryhl, Fujita Tomonori, Boqun Feng)
- Add support for Atomic<i8/i16/bool> and replace most Rust native
AtomicBool usages with Atomic<bool>
- Clean up LockClassKey and improve its documentation
- Add missing Send and Sync trait implementation for SetOnce
- Make ARef Unpin as it is supposed to be
- Add __rust_helper to a few Rust helpers as a preparation for
helper LTO
- Inline various lock related functions to avoid additional function
calls
WW mutexes:
- Extend ww_mutex tests and other test-ww_mutex updates (John
Stultz)
Misc fixes and cleanups:
- rcu: Mark lockdep_assert_rcu_helper() __always_inline (Arnd
Bergmann)
- locking/local_lock: Include more missing headers (Peter Zijlstra)
- seqlock: fix scoped_seqlock_read kernel-doc (Randy Dunlap)
- rust: sync: Replace `kernel::c_str!` with C-Strings (Tamir
Duberstein)"
* tag 'locking-core-2026-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (90 commits)
locking/rwlock: Fix write_trylock_irqsave() with CONFIG_INLINE_WRITE_TRYLOCK
rcu: Mark lockdep_assert_rcu_helper() __always_inline
compiler-context-analysis: Remove __assume_ctx_lock from initializers
tomoyo: Use scoped init guard
crypto: Use scoped init guard
kcov: Use scoped init guard
compiler-context-analysis: Introduce scoped init guards
cleanup: Make __DEFINE_LOCK_GUARD handle commas in initializers
seqlock: fix scoped_seqlock_read kernel-doc
tools: Update context analysis macros in compiler_types.h
rust: sync: Replace `kernel::c_str!` with C-Strings
rust: sync: Inline various lock related methods
rust: helpers: Move #define __rust_helper out of atomic.c
rust: wait: Add __rust_helper to helpers
rust: time: Add __rust_helper to helpers
rust: task: Add __rust_helper to helpers
rust: sync: Add __rust_helper to helpers
rust: refcount: Add __rust_helper to helpers
rust: rcu: Add __rust_helper to helpers
rust: processor: Add __rust_helper to helpers
...
x86 PMU driver updates:
- Add support for the core PMU for Intel Diamond Rapids (DMR) CPUs.
Compared to previous iterations of the Intel PMU code, there's
been a lot of changes, which center around three main areas:
- Introduce the OFF-MODULE RESPONSE (OMR) facility to
replace the Off-Core Response (OCR) facility
- New PEBS data source encoding layout
- Support the new "RDPMC user disable" feature
(Dapeng Mi)
- Likewise, a large series adds uncore PMU support for
Intel Diamond Rapids (DMR) CPUs, which center around these
four main areas:
- DMR may have two Integrated I/O and Memory Hub (IMH) dies,
separate from the compute tile (CBB) dies. Each CBB and
each IMH die has its own discovery domain.
- Unlike prior CPUs that retrieve the global discovery table
portal exclusively via PCI or MSR, DMR uses PCI for IMH PMON
discovery and MSR for CBB PMON discovery.
- DMR introduces several new PMON types: SCA, HAMVF, D2D_ULA,
UBR, PCIE4, CRS, CPC, ITC, OTC, CMS, and PCIE6.
- IIO free-running counters in DMR are MMIO-based, unlike SPR.
(Zide Chen)
- Also add support for Add missing PMON units for Intel Panther Lake,
and support Nova Lake (NVL), which largely maps to Panther Lake.
(Zide Chen)
- KVM integration: Add support for mediated vPMUs (by Kan Liang
and Sean Christopherson, with fixes and cleanups by Peter Zijlstra,
Sandipan Das and Mingwei Zhang)
- Add Intel cstate driver to support for Wildcat Lake (WCL)
CPUs, which are a low-power variant of Panther Lake.
(Zide Chen)
- Add core, cstate and MSR PMU support for the Airmont NP Intel CPU
(aka MaxLinear Lightning Mountain), which maps to the existing
Airmont code. (Martin Schiller)
Performance enhancements:
- core: Speed up kexec shutdown by avoiding unnecessary
cross CPU calls. (Jan H. Schönherr)
- core: Fix slow perf_event_task_exit() with LBR callstacks
(Namhyung Kim)
User-space stack unwinding support:
- Various cleanups and refactorings in preparation to generalize
the unwinding code for other architectures. (Jens Remus)
Uprobes updates:
- Transition from kmap_atomic to kmap_local_page (Keke Ming)
- Fix incorrect lockdep condition in filter_chain() (Breno Leitao)
- Fix XOL allocation failure for 32-bit tasks (Oleg Nesterov)
Misc fixes and cleanups:
- s390: Remove kvm_types.h from Kbuild (Randy Dunlap)
- x86/intel/uncore: Convert comma to semicolon (Chen Ni)
- x86/uncore: Clean up const mismatch (Greg Kroah-Hartman)
- x86/ibs: Fix typo in dc_l2tlb_miss comment (Xiang-Bin Shi)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmJhTURHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1i/qw/9F/sjMqbxH8d3kPXB8wk2eUuSkynQ2aNw
Zec9qtfCC5N1U9b2D7ywGJRscTmWYnX/3BKTyzFuyA6SDz6buAgDDIGPlHi+9Fww
+RUUS3lQ7N3pVWZ4Ifu3kbh3Vz4lkQuOXhfcjiyIMS6QIxfrcSLFoKHK+2V6PeU+
x0k+THHz/Ymg+DIpqSjqil1yrKaUmU9xRrbnyy6zJB1duREQrkYBhIWL1+bcd7SA
89RVAGXQ+sWzVQMPaKrMkZj6GavOCB7zseigiiwjBRLznukS2OulDDe8zR6pCJZp
wbdc7TR/nCm+QtNfkHlOmTQvsPAXiXNyXe5Vi8aFjGc0uMGhHaeiL9ah/bwsKA5m
Bm5Y7oVSmBlCJbcr/CTrGYkb+WwvLIgPCwVkn4FYPlsWv+U92qTOx9q7qKmIDFaj
1oUXCwoHbYrYnoZqZqPp2h689m0Lh/lsGhy0QRt8aGnKu0SDjMaqRGbYWH0UI0kA
aDnZstVHG76RnVi0143q8HcvvjNZb82NL0cS749tY/YcwH4kUEGj2XuSK2Ar887T
H0oDJHXijMlGXWqO5bK3WMoCQajR7nRyqBo7/rKYj20OjXmwoXS3vel77/s8WFo2
fUUC469MacDzyxdBNutnkJvvcvsUio3r4MWFsEEWQk2nUE58PtN8YM8j/FdNYpql
zAZ4Jx/A0RM=
=4SRB
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance event updates from Ingo Molnar:
"x86 PMU driver updates:
- Add support for the core PMU for Intel Diamond Rapids (DMR) CPUs
(Dapeng Mi)
Compared to previous iterations of the Intel PMU code, there's been
a lot of changes, which center around three main areas:
- Introduce the OFF-MODULE RESPONSE (OMR) facility to replace the
Off-Core Response (OCR) facility
- New PEBS data source encoding layout
- Support the new "RDPMC user disable" feature
- Likewise, a large series adds uncore PMU support for Intel Diamond
Rapids (DMR) CPUs (Zide Chen)
This centers around these four main areas:
- DMR may have two Integrated I/O and Memory Hub (IMH) dies,
separate from the compute tile (CBB) dies. Each CBB and each IMH
die has its own discovery domain.
- Unlike prior CPUs that retrieve the global discovery table
portal exclusively via PCI or MSR, DMR uses PCI for IMH PMON
discovery and MSR for CBB PMON discovery.
- DMR introduces several new PMON types: SCA, HAMVF, D2D_ULA, UBR,
PCIE4, CRS, CPC, ITC, OTC, CMS, and PCIE6.
- IIO free-running counters in DMR are MMIO-based, unlike SPR.
- Also add support for Add missing PMON units for Intel Panther Lake,
and support Nova Lake (NVL), which largely maps to Panther Lake.
(Zide Chen)
- KVM integration: Add support for mediated vPMUs (by Kan Liang and
Sean Christopherson, with fixes and cleanups by Peter Zijlstra,
Sandipan Das and Mingwei Zhang)
- Add Intel cstate driver to support for Wildcat Lake (WCL) CPUs,
which are a low-power variant of Panther Lake (Zide Chen)
- Add core, cstate and MSR PMU support for the Airmont NP Intel CPU
(aka MaxLinear Lightning Mountain), which maps to the existing
Airmont code (Martin Schiller)
Performance enhancements:
- Speed up kexec shutdown by avoiding unnecessary cross CPU calls
(Jan H. Schönherr)
- Fix slow perf_event_task_exit() with LBR callstacks (Namhyung Kim)
User-space stack unwinding support:
- Various cleanups and refactorings in preparation to generalize the
unwinding code for other architectures (Jens Remus)
Uprobes updates:
- Transition from kmap_atomic to kmap_local_page (Keke Ming)
- Fix incorrect lockdep condition in filter_chain() (Breno Leitao)
- Fix XOL allocation failure for 32-bit tasks (Oleg Nesterov)
Misc fixes and cleanups:
- s390: Remove kvm_types.h from Kbuild (Randy Dunlap)
- x86/intel/uncore: Convert comma to semicolon (Chen Ni)
- x86/uncore: Clean up const mismatch (Greg Kroah-Hartman)
- x86/ibs: Fix typo in dc_l2tlb_miss comment (Xiang-Bin Shi)"
* tag 'perf-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
s390: remove kvm_types.h from Kbuild
uprobes: Fix incorrect lockdep condition in filter_chain()
x86/ibs: Fix typo in dc_l2tlb_miss comment
x86/uprobes: Fix XOL allocation failure for 32-bit tasks
perf/x86/intel/uncore: Convert comma to semicolon
perf/x86/intel: Add support for rdpmc user disable feature
perf/x86: Use macros to replace magic numbers in attr_rdpmc
perf/x86/intel: Add core PMU support for Novalake
perf/x86/intel: Add support for PEBS memory auxiliary info field in NVL
perf/x86/intel: Add core PMU support for DMR
perf/x86/intel: Add support for PEBS memory auxiliary info field in DMR
perf/x86/intel: Support the 4 new OMR MSRs introduced in DMR and NVL
perf/core: Fix slow perf_event_task_exit() with LBR callstacks
perf/core: Speed up kexec shutdown by avoiding unnecessary cross CPU calls
uprobes: use kmap_local_page() for temporary page mappings
arm/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
mips/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
arm64/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
riscv/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
perf/x86/intel/uncore: Add Nova Lake support
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmmGmrgACgkQ6rmadz2v
bTq6NxAAkCHosxzGn9GYYBV8xhrBJoJJDCyEbQ4nR0XNY+zaWnuykmiPP9w1aOAM
zm/po3mQB2pZjetvlrPrgG5RLgBCAUHzqVGy0r+phUvD3vbohKlmSlMm2kiXOb9N
T01BgLWsyqN2ZcNFvORdSsftqIJUHcXxU6RdupGD60sO5XM9ty5cwyewLX8GBOas
UN2bOhbK2DpqYWUvtv+3Q3ykxoStMSkXZvDRurwLKl4RHeLjXZXPo8NjnfBlk/F2
vdFo/F4NO4TmhOave6UPXvKb4yo9IlBRmiPAl0RmNKBxenY8j9XuV/xZxU6YgzDn
+SQfDK+CKQ4IYIygE+fqd4e5CaQrnjmPPcIw12AB2CF0LimY9Xxyyk6FSAhMN7wm
GTVh5K2C3Dk3OiRQk4G58EvQ5QcxzX98IeeCpcckMUkPsFWHRvF402WMUcv9SWpD
DsxxPkfENY/6N67EvH0qcSe/ikdUorQKFl4QjXKwsMCd5WhToeP4Z7Ck1gVSNkAh
9CX++mLzg333Lpsc4SSIuk9bEPpFa5cUIKUY7GCsCiuOXciPeMDP3cGSd5LioqxN
qWljs4Z88QDM2LJpAh8g4m3sA7bMhES3nPmdlI5CfgBcVyLW8D8CqQq4GEZ1McwL
Ky084+lEosugoVjRejrdMMEOsqAfcbkTr2b8jpuAZdwJKm6p/bw=
=cBdK
-----END PGP SIGNATURE-----
Merge tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
- Support associating BPF program with struct_ops (Amery Hung)
- Switch BPF local storage to rqspinlock and remove recursion detection
counters which were causing false positives (Amery Hung)
- Fix live registers marking for indirect jumps (Anton Protopopov)
- Introduce execution context detection BPF helpers (Changwoo Min)
- Improve verifier precision for 32bit sign extension pattern
(Cupertino Miranda)
- Optimize BTF type lookup by sorting vmlinux BTF and doing binary
search (Donglin Peng)
- Allow states pruning for misc/invalid slots in iterator loops (Eduard
Zingerman)
- In preparation for ASAN support in BPF arenas teach libbpf to move
global BPF variables to the end of the region and enable arena kfuncs
while holding locks (Emil Tsalapatis)
- Introduce support for implicit arguments in kfuncs and migrate a
number of them to new API. This is a prerequisite for cgroup
sub-schedulers in sched-ext (Ihor Solodrai)
- Fix incorrect copied_seq calculation in sockmap (Jiayuan Chen)
- Fix ORC stack unwind from kprobe_multi (Jiri Olsa)
- Speed up fentry attach by using single ftrace direct ops in BPF
trampolines (Jiri Olsa)
- Require frozen map for calculating map hash (KP Singh)
- Fix lock entry creation in TAS fallback in rqspinlock (Kumar
Kartikeya Dwivedi)
- Allow user space to select cpu in lookup/update operations on per-cpu
array and hash maps (Leon Hwang)
- Make kfuncs return trusted pointers by default (Matt Bobrowski)
- Introduce "fsession" support where single BPF program is executed
upon entry and exit from traced kernel function (Menglong Dong)
- Allow bpf_timer and bpf_wq use in all programs types (Mykyta
Yatsenko, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Alexei
Starovoitov)
- Make KF_TRUSTED_ARGS the default for all kfuncs and clean up their
definition across the tree (Puranjay Mohan)
- Allow BPF arena calls from non-sleepable context (Puranjay Mohan)
- Improve register id comparison logic in the verifier and extend
linked registers with negative offsets (Puranjay Mohan)
- In preparation for BPF-OOM introduce kfuncs to access memcg events
(Roman Gushchin)
- Use CFI compatible destructor kfunc type (Sami Tolvanen)
- Add bitwise tracking for BPF_END in the verifier (Tianci Cao)
- Add range tracking for BPF_DIV and BPF_MOD in the verifier (Yazhou
Tang)
- Make BPF selftests work with 64k page size (Yonghong Song)
* tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (268 commits)
selftests/bpf: Fix outdated test on storage->smap
selftests/bpf: Choose another percpu variable in bpf for btf_dump test
selftests/bpf: Remove test_task_storage_map_stress_lookup
selftests/bpf: Update task_local_storage/task_storage_nodeadlock test
selftests/bpf: Update task_local_storage/recursion test
selftests/bpf: Update sk_storage_omem_uncharge test
bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy}
bpf: Support lockless unlink when freeing map or local storage
bpf: Prepare for bpf_selem_unlink_nofail()
bpf: Remove unused percpu counter from bpf_local_storage_map_free
bpf: Remove cgroup local storage percpu counter
bpf: Remove task local storage percpu counter
bpf: Change local_storage->lock and b->lock to rqspinlock
bpf: Convert bpf_selem_unlink to failable
bpf: Convert bpf_selem_link_map to failable
bpf: Convert bpf_selem_unlink_map to failable
bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage
selftests/xsk: fix number of Tx frags in invalid packet
selftests/xsk: properly handle batch ending in the middle of a packet
bpf: Prevent reentrance into call_rcu_tasks_trace()
...
Module signing:
- Remove SHA-1 support for signing modules. SHA-1 is no longer
considered secure for signatures due to vulnerabilities that can
lead to hash collisions. None of the major distributions use
SHA-1 anymore, and the kernel has defaulted to SHA-512 since
v6.11. Note that loading SHA-1 signed modules is still supported.
- Update scripts/sign-file to use only the OpenSSL CMS API for
signing. As SHA-1 support is gone, we can drop the legacy PKCS#7
API which was limited to SHA-1. This also cleans up support for
legacy OpenSSL versions.
Cleanups and fixes:
- Use system_dfl_wq instead of the per-cpu system_wq following the
ongoing workqueue API refactoring.
- Avoid open-coded kvrealloc() in module decompression logic by
using the standard helper.
- Improve section annotations by replacing the custom __modinit
with __init_or_module and removing several unused __INIT*_OR_MODULE
macros.
- Fix kernel-doc warnings in include/linux/moduleparam.h.
- Ensure set_module_sig_enforced is only declared when module
signing is enabled.
- Fix gendwarfksyms build failures on 32-bit hosts.
MAINTAINERS:
- Update the module subsystem entry to reflect the maintainer
rotation and update the git repository link.
The changes have been soaking in linux-next since -rc2.
Note that like Daniel mentioned in the previous pull request [1], we
rotate maintainership every 6 months, and I will be handling the module
subsystem pull requests for the first half of this year.
Link: https://lore.kernel.org/r/20251203234840.3720-1-da.gomez@kernel.org [1]
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSE9au1u/dCZerzchhaByWrOaGnegUCaYYeeAAKCRBaByWrOaGn
epIxAQDU/VSAC491S9/5dAUeGbOis9/p6QJKQlNgEqU4oTlOsgEA0p8BZ9Spkwzd
v9BfIl3j9qVt7wUdlLdbHfdvPgtUVgc=
=at6w
-----END PGP SIGNATURE-----
Merge tag 'modules-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux
Pull module updates from Sami Tolvanen:
"Module signing:
- Remove SHA-1 support for signing modules.
SHA-1 is no longer considered secure for signatures due to
vulnerabilities that can lead to hash collisions. None of the major
distributions use SHA-1 anymore, and the kernel has defaulted to
SHA-512 since v6.11.
Note that loading SHA-1 signed modules is still supported.
- Update scripts/sign-file to use only the OpenSSL CMS API for
signing.
As SHA-1 support is gone, we can drop the legacy PKCS#7 API which
was limited to SHA-1. This also cleans up support for legacy
OpenSSL versions.
Cleanups and fixes:
- Use system_dfl_wq instead of the per-cpu system_wq following the
ongoing workqueue API refactoring.
- Avoid open-coded kvrealloc() in module decompression logic by using
the standard helper.
- Improve section annotations by replacing the custom __modinit with
__init_or_module and removing several unused __INIT*_OR_MODULE
macros.
- Fix kernel-doc warnings in include/linux/moduleparam.h.
- Ensure set_module_sig_enforced is only declared when module signing
is enabled.
- Fix gendwarfksyms build failures on 32-bit hosts.
MAINTAINERS:
- Update the module subsystem entry to reflect the maintainer
rotation and update the git repository link"
* tag 'modules-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
modules: moduleparam.h: fix kernel-doc comments
module: Only declare set_module_sig_enforced when CONFIG_MODULE_SIG=y
module/decompress: Avoid open-coded kvrealloc()
gendwarfksyms: Fix build on 32-bit hosts
sign-file: Use only the OpenSSL CMS API for signing
module: Remove SHA-1 support for module signing
module: replace use of system_wq with system_dfl_wq
params: Replace __modinit with __init_or_module
module: Remove unused __INIT*_OR_MODULE macros
MAINTAINERS: Update module subsystem maintainers and repository
API:
- Fix race condition in hwrng core by using RCU.
Algorithms:
- Allow authenc(sha224,rfc3686) in fips mode.
- Add test vectors for authenc(hmac(sha384),cbc(aes)).
- Add test vectors for authenc(hmac(sha224),cbc(aes)).
- Add test vectors for authenc(hmac(md5),cbc(des3_ede)).
- Add lz4 support in hisi_zip.
- Only allow clear key use during self-test in s390/{phmac,paes}.
Drivers:
- Set rng quality to 900 in airoha.
- Add gcm(aes) support for AMD/Xilinx Versal device.
- Allow tfms to share device in hisilicon/trng.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmmJlNEACgkQxycdCkmx
i6dfYw//fLKHita7B7k6Rnfv7aTX7ZaF7bwMb1w2OtNu7061ZK1+Ou127ZjFKFxC
qJtI71qmTnhTOXnqeLHDio81QLZ3D9cUwSITv4YS4SCIZlbpKmKNFNfmNd5qweNG
xHRQnD4jiM2Qk8GFx6CmXKWEooev9Z9vvjWtPSbuHSXVUd5WPGkJfLv6s9Oy3W6u
7/Z+KcPtMNx3mAhNy7ZwzttKLCPfLp8YhEP99sOFmrUhehjC2e5z59xcQmef5gfJ
cCTBUJkySLChF2bd8eHWilr8y7jow/pyldu2Ksxv2/o0l01xMqrQoIOXwCeEuEq0
uxpKMCR0wM9jBlA1C59zBfiL5Dacb+Dbc7jcRRAa49MuYclVMRoPmnAutUMiz38G
mk/gpc1BQJIez1rAoTyXiNsXiSeZnu/fR9tOq28pTfNXOt2CXsR6kM1AuuP2QyuP
QC0+UM5UsTE+QIibYklop3HfSCFIaV5LkDI/RIvPzrUjcYkJYgMnG3AoIlqkOl1s
mzcs20cH9PoZG3v5W4SkKJMib6qSx1qfa1YZ7GucYT1nUk04Plcb8tuYabPP4x6y
ow/vfikRjnzuMesJShifJUwplaZqP64RBXMvIfgdoOCXfeQ1tKCKz0yssPfgmSs6
K5mmnmtMvgB6k14luCD3E2zFHO6W+PHZQbSanEvhnlikPo86Dbk=
=n4fL
-----END PGP SIGNATURE-----
Merge tag 'v7.0-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
"API:
- Fix race condition in hwrng core by using RCU
Algorithms:
- Allow authenc(sha224,rfc3686) in fips mode
- Add test vectors for authenc(hmac(sha384),cbc(aes))
- Add test vectors for authenc(hmac(sha224),cbc(aes))
- Add test vectors for authenc(hmac(md5),cbc(des3_ede))
- Add lz4 support in hisi_zip
- Only allow clear key use during self-test in s390/{phmac,paes}
Drivers:
- Set rng quality to 900 in airoha
- Add gcm(aes) support for AMD/Xilinx Versal device
- Allow tfms to share device in hisilicon/trng"
* tag 'v7.0-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (100 commits)
crypto: img-hash - Use unregister_ahashes in img_{un}register_algs
crypto: testmgr - Add test vectors for authenc(hmac(md5),cbc(des3_ede))
crypto: cesa - Simplify return statement in mv_cesa_dequeue_req_locked
crypto: testmgr - Add test vectors for authenc(hmac(sha224),cbc(aes))
crypto: testmgr - Add test vectors for authenc(hmac(sha384),cbc(aes))
hwrng: core - use RCU and work_struct to fix race condition
crypto: starfive - Fix memory leak in starfive_aes_aead_do_one_req()
crypto: xilinx - Fix inconsistant indentation
crypto: rng - Use unregister_rngs in register_rngs
crypto: atmel - Use unregister_{aeads,ahashes,skciphers}
hwrng: optee - simplify OP-TEE context match
crypto: ccp - Add sysfs attribute for boot integrity
dt-bindings: crypto: atmel,at91sam9g46-sha: add microchip,lan9691-sha
dt-bindings: crypto: atmel,at91sam9g46-aes: add microchip,lan9691-aes
dt-bindings: crypto: qcom,inline-crypto-engine: document the Milos ICE
crypto: caam - fix netdev memory leak in dpaa2_caam_probe
crypto: hisilicon/qm - increase wait time for mailbox
crypto: hisilicon/qm - obtain the mailbox configuration at one time
crypto: hisilicon/qm - remove unnecessary code in qm_mb_write()
crypto: hisilicon/qm - move the barrier before writing to the mailbox register
...
affinity of unbound kthreads (node or custom cpumask) against
housekeeping (CPU isolation) constraints and CPU hotplug events.
One crucial missing piece is the handling of cpuset: when an isolated
partition is created, deleted, or its CPUs updated, all the unbound
kthreads in the top cpuset become indifferently affine to _all_ the
non-isolated CPUs, possibly breaking their preferred affinity along
the way.
Solve this with performing the kthreads affinity update from cpuset to
the kthreads consolidated relevant code instead so that preferred
affinities are honoured and applied against the updated cpuset isolated
partitions.
The dispatch of the new isolated cpumasks to timers, workqueues and
kthreads is performed by housekeeping, as per the nice Tejun's
suggestion.
As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
from boot defined domain isolation (through isolcpus=) and cpuset
isolated partitions. Housekeeping cpumasks are now modifiable with a
specific RCU based synchronization. A big step toward making nohz_full=
also mutable through cpuset in the future.
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmmE0mYbFIAAAAAABAAO
bWFudTIsMi41KzEuMTEsMiwyAAoJEIUkVEdQjox36eMP/0Ls/ArfYVi/MNAXWlpy
rAt6m9Y/X9GBcDM/VI9BXq1ZX4qEr2XjJ8UUb8cM08uHEAt0ErlmpRxREwJFrKbI
H4jzg5EwO0D0c6MnvgQJEAwkHxQVIjsxG9DovRIjxyW4ycx3aSsRg/f2VKyWoLvY
7ZT7CbLFE+I/MQh2ZgUu/9pnCDQVR2anss2WYIej5mmgFL5pyEv3YvYgKYVyK08z
sXyNxpP976g2d9ECJ9OtFJV9we6mlqxlG0MVCiv/Uxh7DBjxWWPsLvlmLAXggQ03
+0GW+nnutDaKz83pgS7Z4zum/+Oa+I1dTLIN27pARUNcMCYip7njM2KNpJwPdov3
+fAIODH2JVX1xewT+U1cCq6gdI55ejbwdQYGFV075dKBUxKQeIyrghvfC3Ga6aKQ
Gw3y68jdrXOw6iyfHR5k/0Mnu2/FDKUW2fZxLKm55PvNZP5jQFmSlz9wyiwwyb3m
UUSgThj6Ozodxks8hDX41rGVezCcm1ni+qNSiNIs8HPaaZQrwbnvKHQFBBJHQzJP
rJ39VWBx3Hq/ly71BOR6pCzoZsfS1f85YKhJ4vsfjLO6BfhI16nBat89eROSRKcz
XptyWqW0PgAD0teDuMCTPNuUym/viBHALXHKuSO12CIizacvftiGcmaQNPlLiiFZ
/Dr2+aOhwYw3UD6djn3u94M9
=nWGh
-----END PGP SIGNATURE-----
Merge tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks
Pull kthread updates from Frederic Weisbecker:
"The kthread code provides an infrastructure which manages the
preferred affinity of unbound kthreads (node or custom cpumask)
against housekeeping (CPU isolation) constraints and CPU hotplug
events.
One crucial missing piece is the handling of cpuset: when an isolated
partition is created, deleted, or its CPUs updated, all the unbound
kthreads in the top cpuset become indifferently affine to _all_ the
non-isolated CPUs, possibly breaking their preferred affinity along
the way.
Solve this with performing the kthreads affinity update from cpuset to
the kthreads consolidated relevant code instead so that preferred
affinities are honoured and applied against the updated cpuset
isolated partitions.
The dispatch of the new isolated cpumasks to timers, workqueues and
kthreads is performed by housekeeping, as per the nice Tejun's
suggestion.
As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
from boot defined domain isolation (through isolcpus=) and cpuset
isolated partitions. Housekeeping cpumasks are now modifiable with a
specific RCU based synchronization. A big step toward making
nohz_full= also mutable through cpuset in the future"
* tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (33 commits)
doc: Add housekeeping documentation
kthread: Document kthread_affine_preferred()
kthread: Comment on the purpose and placement of kthread_affine_node() call
kthread: Honour kthreads preferred affinity after cpuset changes
sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN
sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN
kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management
kthread: Include kthreadd to the managed affinity list
kthread: Include unbound kthreads in the managed affinity list
kthread: Refine naming of affinity related fields
PCI: Remove superfluous HK_TYPE_WQ check
sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated()
cpuset: Remove cpuset_cpu_is_isolated()
timers/migration: Remove superfluous cpuset isolation test
cpuset: Propagate cpuset isolation update to timers through housekeeping
cpuset: Propagate cpuset isolation update to workqueue through housekeeping
PCI: Flush PCI probe workqueue on cpuset isolated partition change
sched/isolation: Flush vmstat workqueues on cpuset isolated partition change
sched/isolation: Flush memcg workqueues on cpuset isolated partition change
cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
...
- Remove the unused omap-cpufreq driver (Andreas Kemnade)
- Optimize error handling code in cpufreq_boost_trigger_state() and
make cpufreq_boost_trigger_state() return -EOPNOTSUPP if no policy
supports boost (Lifeng Zheng)
- Update cpufreq-dt-platdev list for tegra, qcom, TI (Aaron Kling,
Dhruva Gole, and Konrad Dybcio)
- Minor improvements to the cpufreq and cpumask rust implementation
(Alexandre Courbot, Alice Ryhl, Tamir Duberstein, and Yilin Chen)
- Add support for AM62L3 SoC to the ti-cpufreq driver (Dhruva Gole)
- Update arch_freq_scale in the CPPC cpufreq driver's frequency
invariance engine (FIE) in scheduler ticks if the related CPPC
registers are not in PCC (Jie Zhan)
- Assorted minor cleanups and improvements in ARM cpufreq drivers (Juan
Martinez, Felix Gu, Luca Weiss, and Sergey Shtylyov)
- Add generic helpers for sysfs show/store to cppc_cpufreq (Sumit
Gupta)
- Make the scaling_setspeed cpufreq sysfs attribute return the actual
requested frequency to avoid confusion (Pengjie Zhang)
- Simplify the idle CPU time granularity test in the ondemand cpufreq
governor (Frederic Weisbecker)
- Enable asym capacity in intel_pstate only when CPU SMT is not
possible (Yaxiong Tian)
- Update the description of rate_limit_us default value in cpufreq
documentation (Yaxiong Tian)
- Add a command line option to adjust the C-states table in the
intel_idle driver, remove the 'preferred_cstates' module parameter
from it, add C-states validation to it and clean it up (Artem
Bityutskiy)
- Make the menu cpuidle governor always check the time till the closest
timer event when the scheduler tick has been stopped to prevent it
from mistakenly selecting the deepest available idle state (Rafael
Wysocki)
- Update the teo cpuidle governor to avoid making suboptimal decisions
in certain corner cases and generally improve idle state selection
accuracy (Rafael Wysocki)
- Remove an unlikely() annotation on the early-return condition in
menu_select() that leads to branch misprediction 100% of the time
on systems with only 1 idle state enabled, like ARM64 servers (Breno
Leitao)
- Add Christian Loehle to MAINTAINERS as a cpuidle reviewer (Christian
Loehle)
- Stop flagging the PM runtime workqueue as freezable to avoid system
suspend and resume deadlocks in subsystems that assume asynchronous
runtime PM to work during system-wide PM transitions (Rafael Wysocki)
- Drop redundant NULL pointer checks before acomp_request_free() from
the hibernation code handling image saving (Rafael Wysocki)
- Update wakeup_sources_walk_start() to handle empty lists of wakeup
sources as appropriate (Samuel Wu)
- Make dev_pm_clear_wake_irq() check the power.wakeirq value under
power.lock to avoid race conditions (Gui-Dong Han)
- Avoid bit field races related to power.work_in_progress in the core
device suspend code (Xuewen Yan)
- Make several drivers discard pm_runtime_put() return value in
preparation for converting that function to a void one (Rafael
Wysocki)
- Add PL4 support for Ice Lake to the Intel RAPL power capping
driver (Daniel Tang)
- Replace sprintf() with sysfs_emit() in power capping sysfs show
functions (Sumeet Pawnikar)
- Make dev_pm_opp_get_level() return value match the documentation
after a previous update of the latter (Aleks Todorov)
- Use scoped for each OF child loop in the OPP code (Krzysztof
Kozlowski)
- Fix a bug in an example code snippet and correct typos in the energy
model management documentation (Patrick Little)
- Fix miscellaneous problems in cpupower (Kaushlendra Kumar):
* idle_monitor: Fix incorrect value logged after stop
* Fix inverted APERF capability check
* Use strcspn() to strip trailing newline
* Reset errno before strtoull()
* Show C0 in idle-info dump
- Improve cpupower installation procedure by making the systemd step
optional and allowing users to disable the installation of systemd's
unit file (João Marcos Costa)
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmmDr5ASHHJqd0Byand5
c29ja2kubmV0AAoJEO5fvZ0v1OO1Q8oH/0KRqdidHzesIQl6gd5WSS/sWdxODRUt
R9dEGQQ6LXCY0z05RAq29HZQf618fYuRFX4PSrtyCvrcRJK7MJKuzK55MRq0MC3c
c/2pL1PdpHexjLXUP9pcoxrYjetsr7SnD6Y0M3JfOPg1E/bG8sp1DlnE8cdqrL0W
lrdB2cEGewT2SVkNhCIQ2n6bwfQwmLlfQl1vXTM8BA7xCjoslePUJlRphAFVAt/J
5fQxSOH0eSxK5PYQFUDM2D2J3uMAN0pFb6eIjwVYYqjABqV//BPl99Rv2W3ElJq7
K/SICRWlvzyINCgF15QAUtQHWdINxSb0GzovECVxODHOv0N4mKHdpNU=
=QlVe
-----END PGP SIGNATURE-----
Merge tag 'pm-6.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"By the number of commits, cpufreq is the leading party (again) and the
most visible change there is the removal of the omap-cpufreq driver
that has not been used for a long time (good riddance). There are also
quite a few changes in the cppc_cpufreq driver, mostly related to
fixing its frequency invariance engine in the case when the CPPC
registers used by it are not in PCC. In addition to that, support for
AM62L3 is added to the ti-cpufreq driver and the cpufreq-dt-platdev
list is updated for some platforms. The remaining cpufreq changes are
assorted fixes and cleanups.
Next up is cpuidle and the changes there are dominated by intel_idle
driver updates, mostly related to the new command line facility
allowing users to adjust the list of C-states used by the driver.
There are also a few updates of cpuidle governors, including two menu
governor fixes and some refinements of the teo governor, and a
MAINTAINERS update adding Christian Loehle as a cpuidle reviewer.
[Thanks for stepping up Christian!]
The most significant update related to system suspend and hibernation
is the one to stop freezing the PM runtime workqueue during system PM
transitions which allows some deadlocks to be avoided. There is also a
fix for possible concurrent bit field updates in the core device
suspend code and a few other minor fixes.
Apart from the above, several drivers are updated to discard the
return value of pm_runtime_put() which is going to be converted to a
void function as soon as everybody stops using its return value, PL4
support for Ice Lake is added to the Intel RAPL power capping driver,
and there are assorted cleanups, documentation fixes, and some
cpupower utility improvements.
Specifics:
- Remove the unused omap-cpufreq driver (Andreas Kemnade)
- Optimize error handling code in cpufreq_boost_trigger_state() and
make cpufreq_boost_trigger_state() return -EOPNOTSUPP if no policy
supports boost (Lifeng Zheng)
- Update cpufreq-dt-platdev list for tegra, qcom, TI (Aaron Kling,
Dhruva Gole, and Konrad Dybcio)
- Minor improvements to the cpufreq and cpumask rust implementation
(Alexandre Courbot, Alice Ryhl, Tamir Duberstein, and Yilin Chen)
- Add support for AM62L3 SoC to the ti-cpufreq driver (Dhruva Gole)
- Update arch_freq_scale in the CPPC cpufreq driver's frequency
invariance engine (FIE) in scheduler ticks if the related CPPC
registers are not in PCC (Jie Zhan)
- Assorted minor cleanups and improvements in ARM cpufreq drivers
(Juan Martinez, Felix Gu, Luca Weiss, and Sergey Shtylyov)
- Add generic helpers for sysfs show/store to cppc_cpufreq (Sumit
Gupta)
- Make the scaling_setspeed cpufreq sysfs attribute return the actual
requested frequency to avoid confusion (Pengjie Zhang)
- Simplify the idle CPU time granularity test in the ondemand cpufreq
governor (Frederic Weisbecker)
- Enable asym capacity in intel_pstate only when CPU SMT is not
possible (Yaxiong Tian)
- Update the description of rate_limit_us default value in cpufreq
documentation (Yaxiong Tian)
- Add a command line option to adjust the C-states table in the
intel_idle driver, remove the 'preferred_cstates' module parameter
from it, add C-states validation to it and clean it up (Artem
Bityutskiy)
- Make the menu cpuidle governor always check the time till the
closest timer event when the scheduler tick has been stopped to
prevent it from mistakenly selecting the deepest available idle
state (Rafael Wysocki)
- Update the teo cpuidle governor to avoid making suboptimal
decisions in certain corner cases and generally improve idle state
selection accuracy (Rafael Wysocki)
- Remove an unlikely() annotation on the early-return condition in
menu_select() that leads to branch misprediction 100% of the time
on systems with only 1 idle state enabled, like ARM64 servers
(Breno Leitao)
- Add Christian Loehle to MAINTAINERS as a cpuidle reviewer
(Christian Loehle)
- Stop flagging the PM runtime workqueue as freezable to avoid system
suspend and resume deadlocks in subsystems that assume asynchronous
runtime PM to work during system-wide PM transitions (Rafael
Wysocki)
- Drop redundant NULL pointer checks before acomp_request_free() from
the hibernation code handling image saving (Rafael Wysocki)
- Update wakeup_sources_walk_start() to handle empty lists of wakeup
sources as appropriate (Samuel Wu)
- Make dev_pm_clear_wake_irq() check the power.wakeirq value under
power.lock to avoid race conditions (Gui-Dong Han)
- Avoid bit field races related to power.work_in_progress in the core
device suspend code (Xuewen Yan)
- Make several drivers discard pm_runtime_put() return value in
preparation for converting that function to a void one (Rafael
Wysocki)
- Add PL4 support for Ice Lake to the Intel RAPL power capping driver
(Daniel Tang)
- Replace sprintf() with sysfs_emit() in power capping sysfs show
functions (Sumeet Pawnikar)
- Make dev_pm_opp_get_level() return value match the documentation
after a previous update of the latter (Aleks Todorov)
- Use scoped for each OF child loop in the OPP code (Krzysztof
Kozlowski)
- Fix a bug in an example code snippet and correct typos in the
energy model management documentation (Patrick Little)
- Fix miscellaneous problems in cpupower (Kaushlendra Kumar):
* idle_monitor: Fix incorrect value logged after stop
* Fix inverted APERF capability check
* Use strcspn() to strip trailing newline
* Reset errno before strtoull()
* Show C0 in idle-info dump
- Improve cpupower installation procedure by making the systemd step
optional and allowing users to disable the installation of
systemd's unit file (João Marcos Costa)"
* tag 'pm-6.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (65 commits)
PM: sleep: core: Avoid bit field races related to work_in_progress
PM: sleep: wakeirq: harden dev_pm_clear_wake_irq() against races
cpufreq: Documentation: Update description of rate_limit_us default value
cpufreq: intel_pstate: Enable asym capacity only when CPU SMT is not possible
PM: wakeup: Handle empty list in wakeup_sources_walk_start()
PM: EM: Documentation: Fix bug in example code snippet
Documentation: Fix typos in energy model documentation
cpuidle: governors: teo: Refine intercepts-based idle state lookup
cpuidle: governors: teo: Adjust the classification of wakeup events
cpufreq: ondemand: Simplify idle cputime granularity test
cpufreq: userspace: make scaling_setspeed return the actual requested frequency
PM: hibernate: Drop NULL pointer checks before acomp_request_free()
cpufreq: CPPC: Add generic helpers for sysfs show/store
cpufreq: scmi: Fix device_node reference leak in scmi_cpu_domain_id()
cpufreq: ti-cpufreq: add support for AM62L3 SoC
cpufreq: dt-platdev: Add ti,am62l3 to blocklist
cpufreq/amd-pstate: Add comment explaining nominal_perf usage for performance policy
cpufreq: scmi: correct SCMI explanation
cpufreq: dt-platdev: Block the driver from probing on more QC platforms
rust: cpumask: rename methods of Cpumask for clarity and consistency
...
- Update the ACPICA code in the kernel to upstream version 20251212
which includes the following changes:
* Add support for new ACPI table DTPR (Michal Camacho Romero)
* Release objects with acpi_ut_delete_object_desc() (Zilin Guan)
* Add UUIDs for Microsoft fan extensions and UUIDs associated with
TPM 2.0 devices (Armin Wolf)
* Fix NULL pointer dereference in acpi_ev_address_space_dispatch()
(Alexey Simakov)
* Add KEYP ACPI table definition (Dave Jiang)
* Add support for the Microsoft display mux _OSI string (Armin Wolf)
* Add definitions for the IOVT ACPI table (Xianglai Li)
* Abort AML bytecode execution on AML_FATAL_OP (Armin Wolf)
* Include all fields in subtable type1 for PPTT (Ben Horgan)
* Add GICv5 MADT structures and Arm IORT IWB node definitions (Jose
Marinho)
* Update Parameter Block structure for RAS2 and add a new flag in
Memory Affinity Structure for SRAT (Pawel Chmielewski)
* Add _VDM (Voltage Domain) object (Pawel Chmielewski)
- Add support for GICv5 ACPI probing on ARM which is based on the
GICv5 MADT structures and ARM IORT IWB node definitions recently
added to ACPICA (Lorenzo Pieralisi)
- Rework ACPI PM notification setup for PCI root buses and modify the
ACPI PM setup for devices to register wakeup source objects under
physical (that is, PCI, platform, etc.) devices instead of doing that
under their ACPI companions (Rafael Wysocki)
- Adjust debug messages regarding postponed ACPI PM printed during
system resume to be more accurate (Rafael Wysocki)
- Remove dead code from lps0_device_attach() (Gergo Koteles)
- Start to invoke Microsoft Function 9 (Turn On Display) of the Low-
Power S0 Idle (LPS0) _DSM in the suspend-to-idle resume flow on
systems with ACPI LPS0 support to address a functional issue on
Lenovo Yoga Slim 7i Aura (15ILL9), where system fans and keyboard
backlights fail to resume after suspend (Jakob Riemenschneider)
- Add sysfs attribute cid for exposing _CID lists under ACPI device
objects (Rafael Wysocki)
- Replace sprintf() with sysfs_emit() in all of the core ACPI sysfs
interface code (Sumeet Pawnikar)
- Use acpi_get_local_u64_address() in the code implementing ACPI
support for PCI to evaluate _ADR instead of evaluating that object
directly (Andy Shevchenko)
- Add JWIPC JVC9100 to irq1_level_low_skip_override[] to unbreak
serial IRQs on that system (Ai Chao)
- Fix handling of _OSC errors in acpi_run_osc() to avoid failures on
systems where _OSC error bits are set even though the _OSC return
buffer contains acknowledged feature bits (Rafael Wysocki)
- Clean up and rearrange \_SB._OSC handling for general platform
features and USB4 features to avoid code duplication and unnecessary
memory management overhead (Rafael Wysocki)
- Make the ACPI core device enumeration code handle PNP0C01 and PNP0C02
("system resource") device objects directly instead of letting the
legacy PNP system driver handle them to avoid device enumeration
issues on systems where PNP0C02 is present in the _CID list under
ACPI device objects with a _HID matching a proper device driver in
Linux (Rafael Wysocki)
- Drop workarounds for the known device enumeration issues related to
_CID lists containing PNP0C02 (Rafael Wysocki)
- Drop outdated comment regarding removed function in the ACPI-based
device enumeration code (Julia Lawall)
- Make PRP0001 device matching work as expected for ACPI device objects
using it as a _HID for board development and similar purposes (Kartik
Rajput)
- Use async schedule function in acpi_scan_clear_dep_fn() to avoid
races with user space initialization on some systems (Yicong Yang)
- Add a piece of documentation explaining why binding drivers directly
to ACPI device objects is not a good idea in general and why it is
desirable to convert drivers doing so into proper platform drivers
that use struct platform_driver for device binding (Rafael Wysocki)
- Convert multiple "core ACPI" drivers, including the NFIT ACPI device
driver, the generic ACPI button drivers, the generic ACPI thermal
zone driver, the ACPI hardware event device (HED) driver, the ACPI EC
driver, the ACPI SMBUS HC driver, the ACPI Smart Battery Subsystem
(SBS) driver, and the ACPI backlight (video) driver to proper platform
drivers that use struct platform_driver for device binding (Rafael
Wysocki)
- Use acpi_get_local_u64_address() in the ACPI backlight (video) driver
to evaluate _ADR instead of evaluating that object directly (Andy
Shevchenko)
- Convert the generic ACPI battery driver to a proper platform driver
using struct platform_driver for device binding (Rafael Wysocki)
- Fix incorrect charging status when current is zero in the generic
ACPI battery driver (Ata İlhan Köktürk)
- Use LIST_HEAD() for initializing a stack-allocated list in the
generic ACPI watchdog device driver (Can Peng)
- Rework the ACPI idle driver initialization to register it directly
from the common initialization code instead of doing that from a
CPU hotplug "online" callback and clean it up (Huisong Li, Rafael
Wysocki)
- Fix a possible NULL pointer dereference in
acpi_processor_errata_piix4() (Tuo Li)
- Make read-only array non_mmio_desc[] static const (Colin Ian King)
- Prevent the APEI GHES support code on ARM from accessing memory out
of bounds or going past the ARM processor CPER record buffer (Mauro
Carvalho Chehab)
- Prevent cper_print_fw_err() from dumping the entire memory on systems
with defective firmware (Mauro Carvalho Chehab)
- Improve ghes_notify_nmi() status check to avoid unnecessary overhead
in the NMI handler by carrying out all of the requisite preparations
and the NMI registration time (Tony Luck)
- Refactor the GHES driver by extracting common functionality into
reusable helper functions to reduce code duplication and improve
the ghes_notify_sea() status check in analogy with the previous
ghes_notify_nmi() status check improvement (Shuai Xue)
- Make ELOG and GHES log and trace consistently and support the CPER
CXL protocol analogously (Fabio De Francesco)
- Disable KASAN instrumentation in the APEI GHES driver when compile
testing with clang < 18 (Nathan Chancellor)
- Let ghes_edac be the preferred driver to load on __ZX__ and _BYO_
systems by extending the platform detection list in the APEI GHES
driver (Tony W Wang-oc)
- Clean up cppc_perf_caps and cppc_perf_ctrls structs and rename EPP
constants for clarity in the ACPI CPPC library (Sumit Gupta)
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmmErOoSHHJqd0Byand5
c29ja2kubmV0AAoJEO5fvZ0v1OO1wCoH/iWkTc5/kTTmzi5Urh9lsUsOL8Oc9hrx
KlH4XdLf9TUUSGPJtPRrrtXiOotNUSaZpg2ULktJAZHxDWXnA83DO0FjwT3+2WbH
/5YGVEK24Sae+DI38ukEvOI6bXrxxqNPLnFRtvV7tclUJD2OEAdKcUd8Y98s9qNf
64bf2gKoQtbZj0eM33KHiotULYZUNoZ2SWKFAWXzKQr7EOkcQGKcuupeStgQ04Id
uumWr+lIgW5vYPOcPFCufE3+LM0MzDvruxBlMJQEJJFVkY4IEFwwJNGW/lo8hgE3
74pFKqOPhYsIFPM8rtKXS+JmjP22iCI49QjnHIsQdDG03GEKvKFFeCw=
=CG0j
-----END PGP SIGNATURE-----
Merge tag 'acpi-6.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI updates from Rafael Wysocki:
"This one is significantly larger than previous ACPI support pull
requests because several significant updates have coincided in it.
First, there is a routine ACPICA code update, to upstream version
20251212, but this time it covers new ACPI 6.6 material that has not
been covered yet. Among other things, it includes definitions of a few
new ACPI tables and updates of some others, like the GICv5 MADT
structures and ARM IORT IWB node definitions that are used for adding
GICv5 ACPI probing on ARM (that technically is IRQ subsystem material,
but it depends on the ACPICA changes, so it is included here). The
latter alone adds a few hundred lines of new code.
Second, there is an update of ACPI _OSC handling including a fix that
prevents failures from occurring in some corner cases due to careless
handling of _OSC error bits.
On top of that, the "system resource" ACPI device objects with the
PNP0C01 and PNP0C02 are now going to be handled by the ACPI core
device enumeration code instead of handing them over to the legacy PNP
system driver which causes device enumeration issues to occur. Some of
those issues have been worked around in device drivers and elsewhere
and those workarounds should not be necessary any more, so they are
going away.
Moreover, the time has come to convert all "core ACPI" device drivers
that were still using struct acpi_driver objects for device binding
into proper platform drivers that use struct platform_driver for this
purpose. These updates are accompanied by some requisite core ACPI
device enumeration code changes.
Next, there are ACPI APEI updates, including changes to avoid excess
overhead in the NMI handler and in SEA on the ARM side, changes to
unify ACPI-based HW error tracing and logging, and changes to prevent
APEI code from reaching out of its allocated memory.
There are also some ACPI power management updates, mostly related to
the ACPI cpuidle support in the processor driver, suspend-to-idle
handling on systems with ACPI support and to ACPI PM of devices.
In addition to the above, bugs are fixed and the code is cleaned up in
assorted places all over.
Specifics:
- Update the ACPICA code in the kernel to upstream version 20251212
which includes the following changes:
* Add support for new ACPI table DTPR (Michal Camacho Romero)
* Release objects with acpi_ut_delete_object_desc() (Zilin Guan)
* Add UUIDs for Microsoft fan extensions and UUIDs associated with
TPM 2.0 devices (Armin Wolf)
* Fix NULL pointer dereference in acpi_ev_address_space_dispatch()
(Alexey Simakov)
* Add KEYP ACPI table definition (Dave Jiang)
* Add support for the Microsoft display mux _OSI string (Armin
Wolf)
* Add definitions for the IOVT ACPI table (Xianglai Li)
* Abort AML bytecode execution on AML_FATAL_OP (Armin Wolf)
* Include all fields in subtable type1 for PPTT (Ben Horgan)
* Add GICv5 MADT structures and Arm IORT IWB node definitions
(Jose Marinho)
* Update Parameter Block structure for RAS2 and add a new flag in
Memory Affinity Structure for SRAT (Pawel Chmielewski)
* Add _VDM (Voltage Domain) object (Pawel Chmielewski)
- Add support for GICv5 ACPI probing on ARM which is based on the
GICv5 MADT structures and ARM IORT IWB node definitions recently
added to ACPICA (Lorenzo Pieralisi)
- Rework ACPI PM notification setup for PCI root buses and modify the
ACPI PM setup for devices to register wakeup source objects under
physical (that is, PCI, platform, etc.) devices instead of doing
that under their ACPI companions (Rafael Wysocki)
- Adjust debug messages regarding postponed ACPI PM printed during
system resume to be more accurate (Rafael Wysocki)
- Remove dead code from lps0_device_attach() (Gergo Koteles)
- Start to invoke Microsoft Function 9 (Turn On Display) of the Low-
Power S0 Idle (LPS0) _DSM in the suspend-to-idle resume flow on
systems with ACPI LPS0 support to address a functional issue on
Lenovo Yoga Slim 7i Aura (15ILL9), where system fans and keyboard
backlights fail to resume after suspend (Jakob Riemenschneider)
- Add sysfs attribute cid for exposing _CID lists under ACPI device
objects (Rafael Wysocki)
- Replace sprintf() with sysfs_emit() in all of the core ACPI sysfs
interface code (Sumeet Pawnikar)
- Use acpi_get_local_u64_address() in the code implementing ACPI
support for PCI to evaluate _ADR instead of evaluating that object
directly (Andy Shevchenko)
- Add JWIPC JVC9100 to irq1_level_low_skip_override[] to unbreak
serial IRQs on that system (Ai Chao)
- Fix handling of _OSC errors in acpi_run_osc() to avoid failures on
systems where _OSC error bits are set even though the _OSC return
buffer contains acknowledged feature bits (Rafael Wysocki)
- Clean up and rearrange \_SB._OSC handling for general platform
features and USB4 features to avoid code duplication and
unnecessary memory management overhead (Rafael Wysocki)
- Make the ACPI core device enumeration code handle PNP0C01 and
PNP0C02 ("system resource") device objects directly instead of
letting the legacy PNP system driver handle them to avoid device
enumeration issues on systems where PNP0C02 is present in the _CID
list under ACPI device objects with a _HID matching a proper device
driver in Linux (Rafael Wysocki)
- Drop workarounds for the known device enumeration issues related to
_CID lists containing PNP0C02 (Rafael Wysocki)
- Drop outdated comment regarding removed function in the ACPI-based
device enumeration code (Julia Lawall)
- Make PRP0001 device matching work as expected for ACPI device
objects using it as a _HID for board development and similar
purposes (Kartik Rajput)
- Use async schedule function in acpi_scan_clear_dep_fn() to avoid
races with user space initialization on some systems (Yicong Yang)
- Add a piece of documentation explaining why binding drivers
directly to ACPI device objects is not a good idea in general and
why it is desirable to convert drivers doing so into proper
platform drivers that use struct platform_driver for device binding
(Rafael Wysocki)
- Convert multiple "core ACPI" drivers, including the NFIT ACPI
device driver, the generic ACPI button drivers, the generic ACPI
thermal zone driver, the ACPI hardware event device (HED) driver,
the ACPI EC driver, the ACPI SMBUS HC driver, the ACPI Smart
Battery Subsystem (SBS) driver, and the ACPI backlight (video)
driver to proper platform drivers that use struct platform_driver
for device binding (Rafael Wysocki)
- Use acpi_get_local_u64_address() in the ACPI backlight (video)
driver to evaluate _ADR instead of evaluating that object directly
(Andy Shevchenko)
- Convert the generic ACPI battery driver to a proper platform driver
using struct platform_driver for device binding (Rafael Wysocki)
- Fix incorrect charging status when current is zero in the generic
ACPI battery driver (Ata İlhan Köktürk)
- Use LIST_HEAD() for initializing a stack-allocated list in the
generic ACPI watchdog device driver (Can Peng)
- Rework the ACPI idle driver initialization to register it directly
from the common initialization code instead of doing that from a
CPU hotplug "online" callback and clean it up (Huisong Li, Rafael
Wysocki)
- Fix a possible NULL pointer dereference in
acpi_processor_errata_piix4() (Tuo Li)
- Make read-only array non_mmio_desc[] static const (Colin Ian King)
- Prevent the APEI GHES support code on ARM from accessing memory out
of bounds or going past the ARM processor CPER record buffer (Mauro
Carvalho Chehab)
- Prevent cper_print_fw_err() from dumping the entire memory on
systems with defective firmware (Mauro Carvalho Chehab)
- Improve ghes_notify_nmi() status check to avoid unnecessary
overhead in the NMI handler by carrying out all of the requisite
preparations and the NMI registration time (Tony Luck)
- Refactor the GHES driver by extracting common functionality into
reusable helper functions to reduce code duplication and improve
the ghes_notify_sea() status check in analogy with the previous
ghes_notify_nmi() status check improvement (Shuai Xue)
- Make ELOG and GHES log and trace consistently and support the CPER
CXL protocol analogously (Fabio De Francesco)
- Disable KASAN instrumentation in the APEI GHES driver when compile
testing with clang < 18 (Nathan Chancellor)
- Let ghes_edac be the preferred driver to load on __ZX__ and _BYO_
systems by extending the platform detection list in the APEI GHES
driver (Tony W Wang-oc)
- Clean up cppc_perf_caps and cppc_perf_ctrls structs and rename EPP
constants for clarity in the ACPI CPPC library (Sumit Gupta)"
* tag 'acpi-6.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (117 commits)
ACPI: battery: fix incorrect charging status when current is zero
ACPI: scan: Use async schedule function in acpi_scan_clear_dep_fn()
ACPI: x86: s2idle: Invoke Microsoft _DSM Function 9 (Turn On Display)
ACPI: APEI: GHES: Add ghes_edac support for __ZX__ and _BYO_ systems
ACPI: APEI: GHES: Disable KASAN instrumentation when compile testing with clang < 18
ACPI: sysfs: Replace sprintf() with sysfs_emit()
ACPI: CPPC: Rename EPP constants for clarity
ACPI: CPPC: Clean up cppc_perf_caps and cppc_perf_ctrls structs
ACPI: processor: idle: Rework the handling of acpi_processor_ffh_lpi_probe()
ACPI: processor: idle: Convert acpi_processor_setup_cpuidle_dev() to void
ACPI: processor: idle: Convert acpi_processor_setup_cpuidle_states() to void
irqchip/gic-v5: Add ACPI IWB probing
irqchip/gic-v5: Add ACPI ITS probing
irqchip/gic-v5: Add ACPI IRS probing
irqchip/gic-v5: Split IRS probing into OF and generic portions
PCI/MSI: Make the pci_msi_map_rid_ctlr_node() interface firmware agnostic
irqdomain: Add parent field to struct irqchip_fwid
ACPI: PCI: simplify code with acpi_get_local_u64_address()
ACPI: video: simplify code with acpi_get_local_u64_address()
ACPI: PM: Adjust messages regarding postponed ACPI PM
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmGLwcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpv+TD/48S2HTnMhmW6AtFYWErQ+sEKXpHrxbYe7S
+qR8/g/T+QSfhfqPwZEuagndFKtIP3LJfaXGSP1Lk1RfP9NLQy91v33Ibe4DjHkp
etWSfnMHA9MUAoWKmg8EvncB2G+ZQFiYCpjazj5tKHD9S2+psGMuL8kq6qzMJE83
uhpb8WutUl4aSIXbMSfyGlwBhI1MjjRbbWlIBmg4yC8BWt1sH8Qn2L2GNVylEIcX
U8At3KLgPGn0axSg4yGMAwTqtGhL/jwdDyeczbmRlXuAr4iVL9UX/yADCYkazt6U
ttQ2/H+cxCwfES84COx9EteAatlbZxo6wjGvZ3xOMiMJVTjYe1x6Gkcckq+LrZX6
tjofi2KK78qkrMXk1mZMkZjpyUWgRtCswhDllbQyqFs0SwzQtno2//Rk8HU9dhbt
pkpryDbGFki9X3upcNyEYp5TYflpW6YhAzShYgmE6KXim2fV8SeFLviy0erKOAl+
fwjTE6KQ5QoQv0s3WxkWa4lREm34O6IHrCUmbiPm5CruJnQDhqAN2QZIDgYC4WAf
0gu9cR/O4Vxu7TQXrumPs5q+gCyDU0u0B8C3mG2s+rIo+PI5cVZKs2OIZ8HiPo0F
x73kR/pX3DMe35ZQkQX22ymMuowV+aQouDLY9DTwakP5acdcg7h7GZKABk6VLB06
gUIsnxURiQ==
=jNzW
-----END PGP SIGNATURE-----
Merge tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:
- Support for batch request processing for ublk, improving the
efficiency of the kernel/ublk server communication. This can yield
nice 7-12% performance improvements
- Support for integrity data for ublk
- Various other ublk improvements and additions, including a ton of
selftests additions and updated
- Move the handling of blk-crypto software fallback from below the
block layer to above it. This reduces the complexity of dealing with
bio splitting
- Series fixing a number of potential deadlocks in blk-mq related to
the queue usage counter and writeback throttling and rq-qos debugfs
handling
- Add an async_depth queue attribute, to resolve a performance
regression that's been around for a qhilw related to the scheduler
depth handling
- Only use task_work for IOPOLL completions on NVMe, if it is necessary
to do so. An earlier fix for an issue resulted in all these
completions being punted to task_work, to guarantee that completions
were only run for a given io_uring ring when it was local to that
ring. With the new changes, we can detect if it's necessary to use
task_work or not, and avoid it if possible.
- rnbd fixes:
- Fix refcount underflow in device unmap path
- Handle PREFLUSH and NOUNMAP flags properly in protocol
- Fix server-side bi_size for special IOs
- Zero response buffer before use
- Fix trace format for flags
- Add .release to rnbd_dev_ktype
- MD pull requests via Yu Kuai
- Fix raid5_run() to return error when log_init() fails
- Fix IO hang with degraded array with llbitmap
- Fix percpu_ref not resurrected on suspend timeout in llbitmap
- Fix GPF in write_page caused by resize race
- Fix NULL pointer dereference in process_metadata_update
- Fix hang when stopping arrays with metadata through dm-raid
- Fix any_working flag handling in raid10_sync_request
- Refactor sync/recovery code path, improve error handling for
badblocks, and remove unused recovery_disabled field
- Consolidate mddev boolean fields into mddev_flags
- Use mempool to allocate stripe_request_ctx and make sure
max_sectors is not less than io_opt in raid5
- Fix return value of mddev_trylock
- Fix memory leak in raid1_run()
- Add Li Nan as mdraid reviewer
- Move phys_vec definitions to the kernel types, mostly in preparation
for some VFIO and RDMA changes
- Improve the speed for secure erase for some devices
- Various little rust updates
- Various other minor fixes, improvements, and cleanups
* tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
blk-mq: ABI/sysfs-block: fix docs build warnings
selftests: ublk: organize test directories by test ID
block: decouple secure erase size limit from discard size limit
block: remove redundant kill_bdev() call in set_blocksize()
blk-mq: add documentation for new queue attribute async_dpeth
block, bfq: convert to use request_queue->async_depth
mq-deadline: covert to use request_queue->async_depth
kyber: covert to use request_queue->async_depth
blk-mq: add a new queue sysfs attribute async_depth
blk-mq: factor out a helper blk_mq_limit_depth()
blk-mq-sched: unify elevators checking for async requests
block: convert nr_requests to unsigned int
block: don't use strcpy to copy blockdev name
blk-mq-debugfs: warn about possible deadlock
blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs()
blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos()
blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static
blk-rq-qos: fix possible debugfs_mutex deadlock
blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos
blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmGJ1kQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpky8EAChIL3uJ5Vmv+oQTxT4EVb1wpc8U/XzXWU5
Q5F9IpZZCGO7+i015Y7iTTqDRixjblRaWpWzZZP8vflWDUS8LESNZLQdcoEnxaiv
P367KNPUGwxejcKsu8PvZvfnX6JWSQoNstcDmrwkCF0ND2UUfvvMZyn3uKhkbBRY
h5Ehcqkvqc1OJDAWC7+yPzYAmB01uRPQ6sc9/GeujznHPlfbvie4u6gBvvfXeirT
592zbVftINMrm6Twd6zl4n+HNAn+CUoyVMppeeddv5IcyFPm9uz/dLOZBXTz6552
jFYNmB0U4g+SxGXMyqp37YISTALnuY+57y5eXmEAtgkEeE3HrF+F/ZdxQHwXSpo3
T2Lb9IOqFyHtSvq678HZ37JB6aIYbBE/mZdNf8FFFpnPJGb5Ey7d50qPp/ywVq0H
p9CahbpkzGUBMsZ+koew0YHiFdWV9tww+/Bnk5dTtn2197uyaHsLdmbf4C36GWke
Bk5cwNgU+3DMFAfTiL9m+AIXYsJkBayRJn+hViTrF5AL7gcGiBryGF43FOSKoYuq
f0mniDnGSwvn86VZPuZQ6wBRHZPEMR3OlaUXn6XrUU6cYyvMg0pBZV+QHF7zlsSP
2sdfUbPL5TxexF3G8dsxlDIypz9Z6TCoUCfU0WiiUETnCrVNkXfIY846A+w08p0b
ejBjzrwRtQ==
=CqJq
-----END PGP SIGNATURE-----
Merge tag 'io_uring-bpf-restrictions.4-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring bpf filters from Jens Axboe:
"This adds support for both cBPF filters for io_uring, as well as task
inherited restrictions and filters.
seccomp and io_uring don't play along nicely, as most of the
interesting data to filter on resides somewhat out-of-band, in the
submission queue ring.
As a result, things like containers and systemd that apply seccomp
filters, can't filter io_uring operations.
That leaves them with just one choice if filtering is critical -
filter the actual io_uring_setup(2) system call to simply disallow
io_uring. That's rather unfortunate, and has limited us because of it.
io_uring already has some filtering support. It requires the ring to
be setup in a disabled state, and then a filter set can be applied.
This filter set is completely bi-modal - an opcode is either enabled
or it's not. Once a filter set is registered, the ring can be enabled.
This is very restrictive, and it's not useful at all to systemd or
containers which really want both broader and more specific control.
This first adds support for cBPF filters for opcodes, which enables
tighter control over what exactly a specific opcode may do. As
examples, specific support is added for IORING_OP_OPENAT/OPENAT2,
allowing filtering on resolve flags. And another example is added for
IORING_OP_SOCKET, allowing filtering on domain/type/protocol. These
are both common use cases. cBPF was chosen rather than eBPF, because
the latter is often restricted in containers as well.
These filters are run post the init phase of the request, which allows
filters to even dip into data that is being passed in struct in user
memory, as the init side of requests make that data stable by bringing
it into the kernel. This allows filtering without needing to copy this
data twice, or have filters etc know about the exact layout of the
user data. The filters get the already copied and sanitized data
passed.
On top of that support is added for per-task filters, meaning that any
ring created with a task that has a per-task filter will get those
filters applied when it's created. These filters are inherited across
fork as well. Once a filter has been registered, any further added
filters may only further restrict what operations are permitted.
Filters cannot change the return value of an operation, they can only
permit or deny it based on the contents"
* tag 'io_uring-bpf-restrictions.4-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring: allow registration of per-task restrictions
io_uring: add task fork hook
io_uring/bpf_filter: add ref counts to struct io_bpf_filter
io_uring/bpf_filter: cache lookup table in ctx->bpf_filters
io_uring/bpf_filter: allow filtering on contents of struct open_how
io_uring/net: allow filtering on IORING_OP_SOCKET data
io_uring: add support for BPF filtering for opcode restrictions
[mostly] sanitize struct filename hanling
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaYlcJgAKCRBZ7Krx/gZQ
6xlKAP9c9J13sJ/mcobsj1Ov7nSHISNbnYqvRRCu09Wq3UQvJgEApNQYOEdLtpff
zUnWOAQ0nOKY7w9VMLkRRustXpuGjAc=
=Fld4
-----END PGP SIGNATURE-----
Merge tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs 'struct filename' updates from Al Viro:
"[Mostly] sanitize struct filename handling"
* tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (68 commits)
sysfs(2): fs_index() argument is _not_ a pathname
alpha: switch osf_mount() to strndup_user()
ksmbd: use CLASS(filename_kernel)
mqueue: switch to CLASS(filename)
user_statfs(): switch to CLASS(filename)
statx: switch to CLASS(filename_maybe_null)
quotactl_block(): switch to CLASS(filename)
chroot(2): switch to CLASS(filename)
move_mount(2): switch to CLASS(filename_maybe_null)
namei.c: switch user pathname imports to CLASS(filename{,_flags})
namei.c: convert getname_kernel() callers to CLASS(filename_kernel)
do_f{chmod,chown,access}at(): use CLASS(filename_uflags)
do_readlinkat(): switch to CLASS(filename_flags)
do_sys_truncate(): switch to CLASS(filename)
do_utimes_path(): switch to CLASS(filename_uflags)
chdir(2): unspaghettify a bit...
do_fchownat(): unspaghettify a bit...
fspick(2): use CLASS(filename_flags)
name_to_handle_at(): use CLASS(filename_uflags)
vfs_open_tree(): use CLASS(filename_uflags)
...
Please consider pulling these changes from the signed vfs-7.0-rc1.misc tag.
Thanks!
Christian
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaYX49QAKCRCRxhvAZXjc
ojrZAQD1VJzY46r5FnAVf4jlEHyjIbDnZCP/n+c4x6XnqpU6EQEAgB0yAtAGP6+u
SBuytElqHoTT5VtmEXTAabCNQ9Ks8wo=
=JwZz
-----END PGP SIGNATURE-----
Merge tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains a mix of VFS cleanups, performance improvements, API
fixes, documentation, and a deprecation notice.
Scalability and performance:
- Rework pid allocation to only take pidmap_lock once instead of
twice during alloc_pid(), improving thread creation/teardown
throughput by 10-16% depending on false-sharing luck. Pad the
namespace refcount to reduce false-sharing
- Track file lock presence via a flag in ->i_opflags instead of
reading ->i_flctx, avoiding false-sharing with ->i_readcount on
open/close hot paths. Measured 4-16% improvement on 24-core
open-in-a-loop benchmarks
- Use a consume fence in locks_inode_context() to match the
store-release/load-consume idiom, eliminating a hardware fence on
some architectures
- Annotate cdev_lock with __cacheline_aligned_in_smp to prevent
false-sharing
- Remove a redundant DCACHE_MANAGED_DENTRY check in
__follow_mount_rcu() that never fires since the caller already
verifies it, eliminating a 100% mispredicted branch
- Fix a 100% mispredicted likely() in devcgroup_inode_permission()
that became wrong after a prior code reorder
Bug fixes and correctness:
- Make insert_inode_locked() wait for inode destruction instead of
skipping, fixing a corner case where two matching inodes could
exist in the hash
- Move f_mode initialization before file_ref_init() in alloc_file()
to respect the SLAB_TYPESAFE_BY_RCU ordering contract
- Add a WARN_ON_ONCE guard in try_to_free_buffers() for folios with
no buffers attached, preventing a null pointer dereference when
AS_RELEASE_ALWAYS is set but no release_folio op exists
- Fix select restart_block to store end_time as timespec64, avoiding
truncation of tv_sec on 32-bit architectures
- Make dump_inode() use get_kernel_nofault() to safely access inode
and superblock fields, matching the dump_mapping() pattern
API modernization:
- Make posix_acl_to_xattr() allocate the buffer internally since
every single caller was doing it anyway. Reduces boilerplate and
unnecessary error checking across ~15 filesystems
- Replace deprecated simple_strtoul() with kstrtoul() for the
ihash_entries, dhash_entries, mhash_entries, and mphash_entries
boot parameters, adding proper error handling
- Convert chardev code to use guard(mutex) and __free(kfree) cleanup
patterns
- Replace min_t() with min() or umin() in VFS code to avoid silently
truncating unsigned long to unsigned int
- Gate LOOKUP_RCU assertions behind CONFIG_DEBUG_VFS since callers
already check the flag
Deprecation:
- Begin deprecating legacy BSD process accounting (acct(2)). The
interface has numerous footguns and better alternatives exist
(eBPF)
Documentation:
- Fix and complete kernel-doc for struct export_operations, removing
duplicated documentation between ReST and source
- Fix kernel-doc warnings for __start_dirop() and ilookup5_nowait()
Testing:
- Add a kunit test for initramfs cpio handling of entries with
filesize > PATH_MAX
Misc:
- Add missing <linux/init_task.h> include in fs_struct.c"
* tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits)
posix_acl: make posix_acl_to_xattr() alloc the buffer
fs: make insert_inode_locked() wait for inode destruction
initramfs_test: kunit test for cpio.filesize > PATH_MAX
fs: improve dump_inode() to safely access inode fields
fs: add <linux/init_task.h> for 'init_fs'
docs: exportfs: Use source code struct documentation
fs: move initializing f_mode before file_ref_init()
exportfs: Complete kernel-doc for struct export_operations
exportfs: Mark struct export_operations functions at kernel-doc
exportfs: Fix kernel-doc output for get_name()
acct(2): begin the deprecation of legacy BSD process accounting
device_cgroup: remove branch hint after code refactor
VFS: fix __start_dirop() kernel-doc warnings
fs: Describe @isnew parameter in ilookup5_nowait()
fs/namei: Remove redundant DCACHE_MANAGED_DENTRY check in __follow_mount_rcu
fs: only assert on LOOKUP_RCU when built with CONFIG_DEBUG_VFS
select: store end_time as timespec64 in restart block
chardev: Switch to guard(mutex) and __free(kfree)
namespace: Replace simple_strtoul with kstrtoul to parse boot params
dcache: Replace simple_strtoul with kstrtoul in set_dhash_entries
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmmCurkUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNDWA//RZxjjyY1I0GRDepJXJ8UFEVt4Fdr
VsnSKL3o7sf0SAsQj2HCJsJPiwD5fHm2C2gdxh9rFC0bPpMbTVAkwUL7WhP+nkAt
LA+UZKYurrk1XF6OctILoY3JcXmynb1Oe3lg6uVcWX5b1uEriqRgGKNcMYLb5fmr
D1vZ9LMuZe8WwGTScprQID9FMrZ0TDbdI/vqG7si1W/PCFH7630MPJkmzmjPWvnV
xJISKLOG+qbyWoNGLr+VaNjkmA+jPfsXAKWbfNXUGfikP8g/OHpFd70nIzJs8p7J
dxZD7w6/kqSGhauQjcX8ov0zKxn83Z2Xt0+4Ldl5vOCWI3r4T3Y8WdarmULbq65n
jIN8djDgmCJPqa5zuPmik+womaPk2GmSy1viEJdT4W0iHggTC1snOz1J+BbD+nkh
uEZkmcCZbaeEQmfefxIyHDirrFsJvrunWupGrkfxvfFr+QU8H1xNLfMd6CQzvtI4
P5p/KrnP2e58tJqvPxSY315ewUMy73kZU5DUl+Rq6Y4ai415R7vtwwEEkSKWnyja
LMdEumc9IrsiBMcLmsj8QwobCr7XJtdCQV5ohR8CPxxcsI/G0pR99e1pckD7l7Qm
OG461BKHntU3SFWSiZw+rNWlJuyPcSy5nmUxQvxQHP9pShZPu8rTfYX+CBzrHJk2
OFjAwNJn1N/NfYI=
=cCyp
-----END PGP SIGNATURE-----
Merge tag 'lsm-pr-20260203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull lsm updates from Paul Moore:
- Unify the security_inode_listsecurity() calls in NFSv4
While looking at security_inode_listsecurity() with an eye towards
improving the interface, we realized that the NFSv4 code was making
multiple calls to the LSM hook that could be consolidated into one.
- Mark the LSM static branch keys as static - this helps resolve some
sparse warnings
- Add __rust_helper annotations to the LSM and cred wrapper functions
- Remove the unsused set_security_override_from_ctx() function
- Minor fixes to some of the LSM kdoc comment blocks
* tag 'lsm-pr-20260203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
lsm: make keys for static branch static
cred: remove unused set_security_override_from_ctx()
rust: security: add __rust_helper to helpers
rust: cred: add __rust_helper to helpers
nfs: unify security_inode_listsecurity() calls
lsm: fix kernel-doc struct member names
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmmCuoQUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXMRFhAAntv4vmqRciFI4oEqxi5X8wmYmzc9
BUQV2XXcfO63IOHdGrXmYHByx3+mZZddAPpYMTqrzA0p2NCqi4svCVwspUHwUcTY
btl+xlppBJpBtUL5pmLiP6Q4u+zURYCwuA/OKfxuKa5Frm8D3kbkd5MpxJS15Mev
qqEhLT0aj6/rjQpYVwOFGMwehKfE7iuyc8XTBaetvUKHW38sj18ANSpLnN5bmiuE
3lz252kCjyDoOsu+vO0Saa8Rv8lVDjlSMn6mYr4L2fVygYwFDg2Gj7+bmB6LGYy9
YyIm6P+b23E8GOltEObpvrz8ItPR7nvKNiDMEeP1eqGzQ/Mc5OqqljVaNMNPmP+s
XN/jZt02XePKXlje+C08620mDVeIYp35TK1bY2/HrYMqySE0wwO1iSyBI4ftPFtu
CteM8XA8oH49pspFWbEKCHmtFFGxDVjfVM7YrHeDc+qw2tJjZ7R1GRk5hadP1Ou7
emxGLb6jfejT6NMNU8rM2RVmQNs1jcFh+8lHvDgqqQmaCXJd3AEgr+Om5w9kZ6fJ
FyEkh0f9HuZdcEn8tqWaIwCAZXTzECOThj6hhxZGiG9xXFXza1eIXxq7VtIE6fdO
ATAJ6cpcj0LIQmE7QWntS1NloPkWD1OSiWLNis0AgCiN97oRTk0oFJ5q7fD0bLUa
HGAQqd4OJKmNciw=
=pFeD
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20260203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
- Improve the NETFILTER_PKT audit records
Add source and destination ports to the NETFILTER_PKT audit records
while also consolidating a lot of the code into a new, singular
audit_log_nf_skb() function. This new approach to structuring the
NETFILTER_PKT record generation should eliminate some unnecessary
overhead when audit is not built into the kernel.
- Update the audit syscall classifier code
Add the listxattrat(), getxattrat(), and fchmodat2() syscall to the
audit code which classifies syscalls into categories of operations,
e.g. "read" or "change attributes".
- Move the syscall classifier declarations into audit_arch.h
Shuffle around some header file declarations to resolve some sparse
warnings.
* tag 'audit-pr-20260203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: move the compat_xxx_class[] extern declarations to audit_arch.h
audit: add missing syscalls to read class
audit: include source and destination ports to NETFILTER_PKT
audit: add audit_log_nf_skb helper function
audit: add fchmodat2() to change attributes class
RCU Tasks Trace:
Re-implement RCU tasks trace in term of SRCU-fast, not only more than 500 lines
of code are saved because of the reimplementation, a new set of API,
rcu_read_{,un}lock_tasks_trace(), becomes possible as well. Compared to the
previous rcu_read_{,un}lock_trace(), the new API avoid the task_struct accesses
thanks to the SRCU-fast semantics. As a result, the old
rcu_read{,un}lock_trace() API is now deprecated.
RCU Torture Test:
- Multiple improvements on kvm-series.sh (parallel run and progress showing
metrics)
- Add context checks to rcu_torture_timer().
- Make config2csv.sh properly handle comments in .boot files.
- Include commit discription in testid.txt.
Miscellaneous RCU changes:
- Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early.
- Use suitable gfp_flags for the init_srcu_struct_nodes().
- Fix rcu_read_unlock() deadloop due to softirq.
- Correctly compute probability to invoke ->exp_current() in rcutorture.
- Make expedited RCU CPU stall warnings detect stall-end races.
RCU nocb:
- Remove unnecessary WakeOvfIsDeferred wake path and callback overload
handling.
- Extract nocb_defer_wakeup_cancel() helper.
-----BEGIN PGP SIGNATURE-----
iQFFBAABCAAvFiEEj5IosQTPz8XU1wRHSXnow7UH+rgFAmmARZERHGJvcXVuQGtl
cm5lbC5vcmcACgkQSXnow7UH+rh8SAf+PDIBWAkdbGgs32EfgpFY42RB4CWygH47
YRup/M3+nU0JcBzNnona1srpHBXRySBJQbvRbsOdlM45VoNQ2wPjig/3vFVRUKYx
uqj9Tze00DS74IIGESoTGp0amZde9SS9JakNRoEfTr+Zpj8N6LFERQw0ywUwjR5b
RR6bz7q05TAl3u2BYUAgNdnf3VWWTmj4WYwArlQ+qRFAyGN+TVj8Ezra6+K5TJ7u
SQYrf7WmRGOhHbVVolvVEOVdACccI8dFl3ebJVE2Ky0gp1o3BLPkcDLJ6gBdTCoE
rRrbnkeqs5V7tOkPFDBeUhLPrm1QxrdEDxUQFWjSApbv161sx7AOZA==
=O9QQ
-----END PGP SIGNATURE-----
Merge tag 'rcu.release.v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux
Pull RCU updates from Boqun Feng:
- RCU Tasks Trace:
Re-implement RCU tasks trace in term of SRCU-fast, not only more than
500 lines of code are saved because of the reimplementation, a new
set of API, rcu_read_{,un}lock_tasks_trace(), becomes possible as
well. Compared to the previous rcu_read_{,un}lock_trace(), the new
API avoid the task_struct accesses thanks to the SRCU-fast semantics.
As a result, the old rcu_read{,un}lock_trace() API is now deprecated.
- RCU Torture Test:
- Multiple improvements on kvm-series.sh (parallel run and
progress showing metrics)
- Add context checks to rcu_torture_timer()
- Make config2csv.sh properly handle comments in .boot files
- Include commit discription in testid.txt
- Miscellaneous RCU changes:
- Reduce synchronize_rcu() latency by reporting GP kthread's
CPU QS early
- Use suitable gfp_flags for the init_srcu_struct_nodes()
- Fix rcu_read_unlock() deadloop due to softirq
- Correctly compute probability to invoke ->exp_current()
in rcutorture
- Make expedited RCU CPU stall warnings detect stall-end races
- RCU nocb:
- Remove unnecessary WakeOvfIsDeferred wake path and callback
overload handling
- Extract nocb_defer_wakeup_cancel() helper
* tag 'rcu.release.v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (25 commits)
rcu/nocb: Extract nocb_defer_wakeup_cancel() helper
rcu/nocb: Remove dead callback overload handling
rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path
rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early
srcu: Use suitable gfp_flags for the init_srcu_struct_nodes()
rcu: Fix rcu_read_unlock() deadloop due to softirq
rcutorture: Correctly compute probability to invoke ->exp_current()
rcu: Make expedited RCU CPU stall warnings detect stall-end races
rcutorture: Add --kill-previous option to terminate previous kvm.sh runs
rcutorture: Prevent concurrent kvm.sh runs on same source tree
torture: Include commit discription in testid.txt
torture: Make config2csv.sh properly handle comments in .boot files
torture: Make kvm-series.sh give run numbers and totals
torture: Make kvm-series.sh give build numbers and totals
torture: Parallelize kvm-series.sh guest-OS execution
rcutorture: Add context checks to rcu_torture_timer()
rcutorture: Test rcu_tasks_trace_expedite_current()
srcu: Create an rcu_tasks_trace_expedite_current() function
checkpatch: Deprecate rcu_read_{,un}lock_trace()
rcu: Update Requirements.rst for RCU Tasks Trace
...
performance regressions in the recent rewrite
of the SCHED_MM_CID management code:
- Fix livelock triggered by BPF CI testing
- Fix hard lockup on weakly ordered systems
- Simplify the dropping of CIDs in the exit path
by removing an unintended transition phase.
- Fix performance/scalability regression on a
thread-pool benchmark by optimizing transitional
CIDs when scheduling out.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmHDvQRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hdPBAAgnl/L09wF8WCQLSoLrhr71FmS6fZApDB
Rvov2be8tGJR0BsrJF5uOKTNjulqUIr0mfO73fdHZftdFuhm/WLnWjBO62GhCKMg
d8kXOVZ7PudFN+QwL17pOAub8voh9s9/mceE/hZ3M5eNjXlG4sAcpyGvnrTLLYru
rfzO48NOpy5NMfbxU5/f9nojfr2t8fhnpX2QjquOhEPpl/BeYzexTZK7h2IJXqTK
tkU6IY9X8fT7y8LkKbTCIMJvEuWawHj1DSW2EiWNPJZkX+Hk5ZHttg28JjROavEy
orgairCSCT/cOETKugfToFd0Z4WlmemY6Nk5Kyx//WiFQ/u0HHlFVgMJoJfQEovV
MtIxLVygVbEoQyTszZyFUlTQjrnH8uKxXYhh1mX5wSj9lyDfpfJZycFFA2RpE4Rw
/+pvH08BfR4FgpqTfojfgOnuK/575VsomaVghritoNW3bAie1kpnWIeBaXS8lL4O
0pkK7XX8ng6hXuZTMxgXXfkfUB6oM1Yp1OZJAEzUvftsK0FQ5q3e0WxD+pdVza2s
PfQPaA7bT/G7y8k4LIXm59/tPX2QWPwe0yci00NbyfWiOdxHSgS7crQO8E1+VAiq
TcLGZNj/wFL6B5ghaiUIi22Mo+WnLX8fW+aiIjSiUQILmbNZXYmwtfEFsvsahh9W
/RkE/WQ492E=
=/PkF
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Miscellaneous MMCID fixes to address bugs and performance regressions
in the recent rewrite of the SCHED_MM_CID management code:
- Fix livelock triggered by BPF CI testing
- Fix hard lockup on weakly ordered systems
- Simplify the dropping of CIDs in the exit path by removing an
unintended transition phase
- Fix performance/scalability regression on a thread-pool benchmark
by optimizing transitional CIDs when scheduling out"
* tag 'sched-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/mmcid: Optimize transitional CIDs when scheduling out
sched/mmcid: Drop per CPU CID immediately when switching to per task mode
sched/mmcid: Protect transition on weakly ordered systems
sched/mmcid: Prevent live lock on task to CPU mode transition
- Bump up the Clang minimum version requirements
for livepatch builds, due to Clang assembler
section handling bugs causing silent
miscompilations.
- Strip livepatching symbol artifacts from
non-livepatch modules.
- Fix livepatch build warnings when certain
Clang LTO options are enabled.
- Fix livepatch build error when
CONFIG_MEM_ALLOC_PROFILING_DEBUG=y.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmHCvwRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hkTQ/8DhLI5m9CqhMaouuR5Vm9POKgFOXAe6uz
eHTuKhpJlw+anhNjeUA7PtYbnkrj0j+aNo5SmrfD4Yx8CW7dCt+oO3Y5ziVhkPHw
46Q/9KbmcT11uPbYywp/G4b15FF8YYu9slzhGau/Wa9H+oqU/WPoGapJsMPpUtBo
s0qGxdr2G3WZyD9H/wgCyhMwCOkAYMJ0sHxpGgRajLefDutsRvtlau/ktYU79vI3
nUFteD3YDIAUblBtZPogsCP36QJlx7TWCUNK02vPeOYRh3xPjf3iG+vgf1+sjZHV
P20psekpDwhh1KyeVziUyihUy8TmEVVozRvsrUKVlXmEqLqDtNrysqKTSp5/yRdT
MqgNwDrvv2wW/DKYJhefbuttx3ppAErrnJ3zC9TYSmdn27feKPDJcD1OmdS3BIpH
x9u/eVbOS8xbsOc/t3/Al7CRazvjLU0+OXsMJWbmAaO3tE7SwHq2aOnGbLGjvwWC
Ts1AYfNp4H41CLLFnmKR8q2t/DOBhefW8p3cR5U+cVQ7PdqKRT+TwKWVnCrbrBcJ
71702IrqoghwUrmhtxdZR0jZLtwb80s8zqdbrHojXSbXUFfYwwHNNaW1NSEpdSr4
W8xYfSPrK71OGZ6oh/v1Wce6D+mxqb6kYD8DRLgOdbxrnvLwUhGXxxNDjfXA0f4B
abjGHlAyoCs=
=cdu2
-----END PGP SIGNATURE-----
Merge tag 'objtool-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fixes from Ingo Molnar::
- Bump up the Clang minimum version requirements for livepatch
builds, due to Clang assembler section handling bugs causing
silent miscompilations
- Strip livepatching symbol artifacts from non-livepatch modules
- Fix livepatch build warnings when certain Clang LTO options
are enabled
- Fix livepatch build error when CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
* tag 'objtool-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool/klp: Fix unexported static call key access for manually built livepatch modules
objtool/klp: Fix symbol correlation for orphaned local symbols
livepatch: Free klp_{object,func}_ext data after initialization
livepatch: Fix having __klp_objects relics in non-livepatch modules
livepatch/klp-build: Require Clang assembler >= 20
Take care of rqspinlock error in bpf_local_storage_{map_free, destroy}()
properly by switching to bpf_selem_unlink_nofail().
Both functions iterate their own RCU-protected list of selems and call
bpf_selem_unlink_nofail(). In map_free(), to prevent infinite loop when
both map_free() and destroy() fail to remove a selem from b->list
(extremely unlikely), switch to hlist_for_each_entry_rcu(). In destroy(),
also switch to hlist_for_each_entry_rcu() since we no longer iterate
local_storage->list under local_storage->lock.
bpf_selem_unlink() now becomes dedicated to helpers and syscalls paths
so reuse_now should always be false. Remove it from the argument and
hardcode it.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-12-ameryhung@gmail.com
Introduce bpf_selem_unlink_nofail() to properly handle errors returned
from rqspinlock in bpf_local_storage_map_free() and
bpf_local_storage_destroy() where the operation must succeeds.
The idea of bpf_selem_unlink_nofail() is to allow an selem to be
partially linked and use atomic operation on a bit field, selem->state,
to determine when and who can free the selem if any unlink under lock
fails. An selem initially is fully linked to a map and a local storage.
Under normal circumstances, bpf_selem_unlink_nofail() will be able to
grab locks and unlink a selem from map and local storage in sequeunce,
just like bpf_selem_unlink(), and then free it after an RCU grace period.
However, if any of the lock attempts fails, it will only clear
SDATA(selem)->smap or selem->local_storage depending on the caller and
set SELEM_MAP_UNLINKED or SELEM_STORAGE_UNLINKED according to the
caller. Then, after both map_free() and destroy() see the selem and the
state becomes SELEM_UNLINKED, one of two racing caller can succeed in
cmpxchg the state from SELEM_UNLINKED to SELEM_TOFREE, ensuring no
double free or memory leak.
To make sure bpf_obj_free_fields() is done only once and when map is
still present, it is called when unlinking an selem from b->list under
b->lock.
To make sure uncharging memory is done only when the owner is still
present in map_free(), block destroy() from returning until there is no
pending map_free().
Since smap may not be valid in destroy(), bpf_selem_unlink_nofail()
skips bpf_selem_unlink_storage_nolock_misc() when called from destroy().
This is okay as bpf_local_storage_destroy() will return the remaining
amount of memory charge tracked by mem_charge to the owner to uncharge.
It is also safe to skip clearing local_storage->owner and owner_storage
as the owner is being freed and no users or bpf programs should be able
to reference the owner and using local_storage.
Finally, access of selem, SDATA(selem)->smap and selem->local_storage
are racy. Callers will protect these fields with RCU.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-11-ameryhung@gmail.com
The next patch will introduce bpf_selem_unlink_nofail() to handle
rqspinlock errors. bpf_selem_unlink_nofail() will allow an selem to be
partially unlinked from map or local storage. Save memory allocation
method in selem so that later an selem can be correctly freed even when
SDATA(selem)->smap is init to NULL.
In addition, keep track of memory charge to the owner in local storage
so that later bpf_selem_unlink_nofail() can return the correct memory
charge to the owner. Updating local_storage->mem_charge is protected by
local_storage->lock.
Finally, extract miscellaneous tasks performed when unlinking an selem
from local_storage into bpf_selem_unlink_storage_nolock_misc(). It will
be reused by bpf_selem_unlink_nofail().
This patch also takes the chance to remove local_storage->smap, which
is no longer used since commit f484f4a3e0 ("bpf: Replace bpf memory
allocator with kmalloc_nolock() in local storage").
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-10-ameryhung@gmail.com
Percpu locks have been removed from cgroup and task local storage. Now
that all local storage no longer use percpu variables as locks preventing
recursion, there is no need to pass them to bpf_local_storage_map_free().
Remove the argument from the function.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-9-ameryhung@gmail.com
The percpu counter in cgroup local storage is no longer needed as the
underlying bpf_local_storage can now handle deadlock with the help of
rqspinlock. Remove the percpu counter and related migrate_{disable,
enable}.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-8-ameryhung@gmail.com
The percpu counter in task local storage is no longer needed as the
underlying bpf_local_storage can now handle deadlock with the help of
rqspinlock. Remove the percpu counter and related migrate_{disable,
enable}.
Since the percpu counter is removed, merge back bpf_task_storage_get()
and bpf_task_storage_get_recur(). This will allow the bpf syscalls and
helpers to run concurrently on the same CPU, removing the spurious
-EBUSY error. bpf_task_storage_get(..., F_CREATE) will now always
succeed with enough free memory unless being called recursively.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-7-ameryhung@gmail.com
Change bpf_local_storage::lock and bpf_local_storage_map_bucket::lock
from raw_spin_lock to rqspinlock.
Finally, propagate errors from raw_res_spin_lock_irqsave() to syscall
return or BPF helper return.
In bpf_local_storage_destroy(), ignore return from
raw_res_spin_lock_irqsave() for now. A later patch will correctly
handle errors correctly in bpf_local_storage_destroy() so that it can
unlink selems even when failing to acquire locks.
For __bpf_local_storage_map_cache(), instead of handling the error,
skip updating the cache.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-6-ameryhung@gmail.com
To prepare changing both bpf_local_storage_map_bucket::lock and
bpf_local_storage::lock to rqspinlock, convert bpf_selem_unlink() to
failable. It still always succeeds and returns 0 until the change
happens. No functional change.
Open code bpf_selem_unlink_storage() in the only caller,
bpf_selem_unlink(), since unlink_map and unlink_storage must be done
together after all the necessary locks are acquired.
For bpf_local_storage_map_free(), ignore the return from
bpf_selem_unlink() for now. A later patch will allow it to unlink selems
even when failing to acquire locks.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-5-ameryhung@gmail.com
To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock,
convert bpf_selem_link_map() to failable. It still always succeeds and
returns 0 until the change happens. No functional change.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-4-ameryhung@gmail.com
To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock,
convert bpf_selem_unlink_map() to failable. It still always succeeds and
returns 0 for now.
Since some operations updating local storage cannot fail in the middle,
open-code bpf_selem_unlink_map() to take the b->lock before the
operation. There are two such locations:
- bpf_local_storage_alloc()
The first selem will be unlinked from smap if cmpxchg owner_storage_ptr
fails, which should not fail. Therefore, hold b->lock when linking
until allocation complete. Helpers that assume b->lock is held by
callers are introduced: bpf_selem_link_map_nolock() and
bpf_selem_unlink_map_nolock().
- bpf_local_storage_update()
The three step update process: link_map(new_selem),
link_storage(new_selem), and unlink_map(old_selem) should not fail in
the middle.
In bpf_selem_unlink(), bpf_selem_unlink_map() and
bpf_selem_unlink_storage() should either all succeed or fail as a whole
instead of failing in the middle. So, return if unlink_map() failed.
Remove the selem_linked_to_map_lockless() check as an selem in the
common paths (not bpf_local_storage_map_free() or
bpf_local_storage_destroy()), will be unlinked under b->lock and
local_storage->lock and therefore no other threads can unlink the selem
from map at the same time.
In bpf_local_storage_destroy(), ignore the return of
bpf_selem_unlink_map() for now. A later patch will allow
bpf_local_storage_destroy() to unlink selems even when failing to
acquire locks.
Note that while this patch removes all callers of selem_linked_to_map(),
a later patch that introduces bpf_selem_unlink_nofail() will use it
again.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-3-ameryhung@gmail.com
A later bpf_local_storage refactor will acquire all locks before
performing any update. To simplified the number of locks needed to take
in bpf_local_storage_map_update(), determine the bucket based on the
local_storage an selem belongs to instead of the selem pointer.
Currently, when a new selem needs to be created to replace the old selem
in bpf_local_storage_map_update(), locks of both buckets need to be
acquired to prevent racing. This can be simplified if the two selem
belongs to the same bucket so that only one bucket needs to be locked.
Therefore, instead of hashing selem, hashing the local_storage pointer
the selem belongs.
Performance wise, this is slightly better as update now requires locking
one bucket. It should not change the level of contention on one bucket
as the pointers to local storages of selems in a map are just as unique
as pointers to selems.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-2-ameryhung@gmail.com
- Fix event format field alignments for 32 bit architectures
The fields in the event format files are used to parse the raw binary
buffer data by applications. If they are incorrect, then the application
produces garbage.
On 32 bit architectures, the function graph 64bit calltime and rettime
were off by 4bytes. That's because the actual fields are in a packed
structure but the macros used by the ftrace events did not mark them as
packed, and instead, gave them their natural alignment which made their
offsets off by 4 bytes.
There are macros to have a packed field within an embedded structure of
an event, but there's no macro for normal fields within a packed
structure of the event. The macro __field_packed() was used for the
packed embedded structure field. Rename that to __field_desc_pcaked() (to
match the non-packed embedded field macro __field_desc()), and make
__field_packed() for fields that are in a packed event structure (which
matches the unpacked __field() macro).
Switch the calltime and rettime fields of the function graph event to use
the new __field_packed() and this makes the offsets correct.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaYZKpRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qk1UAP41H5PL24xmYp+34GIP6lHuJr6iZzUm
KbZi1Zx4zNmXSAD/e3Ra5SZopWszeMTf/tmxUXbl30oLdw4CJgS1WztBggk=
=m36h
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fix from Steven Rostedt:
- Fix event format field alignments for 32 bit architectures
The fields in the event format files are used to parse the raw binary
buffer data by applications. If they are incorrect, then the
application produces garbage.
On 32 bit architectures, the function graph 64bit calltime and
rettime were off by 4bytes. That's because the actual fields are in a
packed structure but the macros used by the ftrace events did not
mark them as packed, and instead, gave them their natural alignment
which made their offsets off by 4 bytes.
There are macros to have a packed field within an embedded structure
of an event, but there's no macro for normal fields within a packed
structure of the event. The macro __field_packed() was used for the
packed embedded structure field. Rename that to __field_desc_packed()
(to match the non-packed embedded field macro __field_desc()), and
make __field_packed() for fields that are in a packed event structure
(which matches the unpacked __field() macro).
Switch the calltime and rettime fields of the function graph event to
use the new __field_packed() and this makes the offsets correct.
* tag 'trace-v6.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix ftrace event field alignments
Two minor fixes for DMA-mapping subsystem:
- check for the rare case of the allocation failure of the global CMA pool
(Shanker Donthineni)
- avoid perf buffer overflow when tracing large scatter-gather lists
(Deepanshu Kartikey)
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaYXTzAAKCRCJp1EFxbsS
REbUAP40hTgNNGjlbV/b6nES4P/SuZ5a8p05+YWF7bTOVf/pMwEA8EHFz5DLsKeS
1bX7/X2wzEYOyJ7v1S+PYxIswn9A5AY=
=2IQo
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.19-2026-02-06' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
"Two minor fixes for the DMA-mapping subsystem:
- check for the rare case of the allocation failure of the global CMA
pool (Shanker Donthineni)
- avoid perf buffer overflow when tracing large scatter-gather lists
(Deepanshu Kartikey)"
* tag 'dma-mapping-6.19-2026-02-06' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma: contiguous: Check return value of dma_contiguous_reserve_area()
tracing/dma: Cap dma_map_sg tracepoint arrays to prevent buffer overflow
Called when copy_process() is called to copy state to a new child.
Right now this is just a stub, but will be used shortly to properly
handle fork'ing of task based io_uring restrictions.
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
call_rcu_tasks_trace() is not safe from in_nmi() and not reentrant.
To prevent deadlock on raw_spin_lock_rcu_node(rtpcp) or memory corruption
defer to irq_work when IRQs are disabled. call_rcu_tasks_generic()
protects itself with local_irq_save().
Note when bpf_async_cb->refcnt drops to zero it's safe to reuse
bpf_async_cb->worker for a different irq_work callback, since
bpf_async_schedule_op() -> irq_work_queue(&cb->worker);
is only called when refcnt >= 1.
Fixes: 1bfbc267ec ("bpf: Enable bpf_timer and bpf_wq in any context")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260205190233.912-1-alexei.starovoitov@gmail.com
Currently, bpf_map_get_info_by_fd calculates and caches the hash of the
map regardless of the map's frozen state.
This leads to a TOCTOU bug where userspace can call
BPF_OBJ_GET_INFO_BY_FD to cache the hash and then modify the map
contents before freezing.
Therefore, a trusted loader can be tricked into verifying the stale hash
while loading the modified contents.
Fix this by returning -EPERM if the map is not frozen when the hash is
requested. This ensures the hash is only generated for the final,
immutable state of the map.
Fixes: ea2e6467ac ("bpf: Return hashes of maps in BPF_OBJ_GET_INFO_BY_FD")
Reported-by: Toshi Piazza <toshi.piazza@microsoft.com>
Signed-off-by: KP Singh <kpsingh@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260205070755.695776-1-kpsingh@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Practical BPF signatures are significantly smaller than
KMALLOC_MAX_CACHE_SIZE
Allowing larger sizes opens the door for abuse by passing excessive
size values and forcing the kernel into expensive allocation paths (via
kmalloc_large or vmalloc).
Fixes: 3492715683 ("bpf: Implement signature verification for BPF programs")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: KP Singh <kpsingh@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260205063807.690823-1-kpsingh@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The linker script scripts/module.lds.S specifies that all input
__klp_objects sections should be consolidated into an output section of
the same name, and start/stop symbols should be created to enable
scripts/livepatch/init.c to locate this data.
This start/stop pattern is not ideal for modules because the symbols are
created even if no __klp_objects input sections are present.
Consequently, a dummy __klp_objects section also appears in the
resulting module. This unnecessarily pollutes non-livepatch modules.
Instead, since modules are relocatable files, the usual method for
locating consolidated data in a module is to read its section table.
This approach avoids the aforementioned problem.
The klp_modinfo already stores a copy of the entire section table with
the final addresses. Introduce a helper function that
scripts/livepatch/init.c can call to obtain the location of the
__klp_objects section from this data.
Fixes: dd590d4d57 ("objtool/klp: Introduce klp diff subcommand for diffing object files")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Link: https://patch.msgid.link/20260123102825.3521961-2-petr.pavlu@suse.com
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
The fields of ftrace specific events (events used to save ftrace internal
events like function traces and trace_printk) are generated similarly to
how normal trace event fields are generated. That is, the fields are added
to a trace_events_fields array that saves the name, offset, size,
alignment and signness of the field. It is used to produce the output in
the format file in tracefs so that tooling knows how to parse the binary
data of the trace events.
The issue is that some of the ftrace event structures are packed. The
function graph exit event structures are one of them. The 64 bit calltime
and rettime fields end up 4 byte aligned, but the algorithm to show to
userspace shows them as 8 byte aligned.
The macros that create the ftrace events has one for embedded structure
fields. There's two macros for theses fields:
__field_desc() and __field_packed()
The difference of the latter macro is that it treats the field as packed.
Rename that field to __field_desc_packed() and create replace the
__field_packed() to be a normal field that is packed and have the calltime
and rettime use those.
This showed up on 32bit architectures for function graph time fields. It
had:
~# cat /sys/kernel/tracing/events/ftrace/funcgraph_exit/format
[..]
field:unsigned long func; offset:8; size:4; signed:0;
field:unsigned int depth; offset:12; size:4; signed:0;
field:unsigned int overrun; offset:16; size:4; signed:0;
field:unsigned long long calltime; offset:24; size:8; signed:0;
field:unsigned long long rettime; offset:32; size:8; signed:0;
Notice that overrun is at offset 16 with size 4, where in the structure
calltime is at offset 20 (16 + 4), but it shows the offset at 24. That's
because it used the alignment of unsigned long long when used as a
declaration and not as a member of a structure where it would be aligned
by word size (in this case 4).
By using the proper structure alignment, the format has it at the correct
offset:
~# cat /sys/kernel/tracing/events/ftrace/funcgraph_exit/format
[..]
field:unsigned long func; offset:8; size:4; signed:0;
field:unsigned int depth; offset:12; size:4; signed:0;
field:unsigned int overrun; offset:16; size:4; signed:0;
field:unsigned long long calltime; offset:20; size:8; signed:0;
field:unsigned long long rettime; offset:28; size:8; signed:0;
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reported-by: "jempty.liang" <imntjempty@163.com>
Link: https://patch.msgid.link/20260204113628.53faec78@gandalf.local.home
Fixes: 04ae87a520 ("ftrace: Rework event_create_dir()")
Closes: https://lore.kernel.org/all/20260130015740.212343-1-imntjempty@163.com/
Closes: https://lore.kernel.org/all/20260202123342.2544795-1-imntjempty@163.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Replace prog and callback in bpf_async_cb after removing visibility of
bpf_async_cb in bpf_async_cancel_and_free() to increase the chances the
scheduled async callbacks short-circuit execution and exit early, and
not starting a RCU tasks trace section. This improves the overall time
spent in running the wq selftest.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260205003853.527571-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When freeing a bpf_async_cb in bpf_async_cb_rcu_tasks_trace_free(), in
case the wq callback is not scheduled, doing cancel_work() currently
returns false and leads to retry of RCU tasks trace grace period. If the
callback is never scheduled, we keep retrying indefinitely and don't put
the prog reference.
Since the only race we care about here is against a potentially running
wq callback in the first grace period, it should finish by the second
grace period, hence check work_busy() result to detect presence of
running wq callback if it's not pending, otherwise free the object
immediately without retrying.
Reasoning behind the check and its correctness with racing wq callback
invocation: cancel_work is supposed to be synchronized, hence calling it
first and getting false would mean that work is definitely not pending,
at this point, either the work is not scheduled at all or already
running, or we race and it already finished by the time we checked for
it using work_busy(). In case it is running, we synchronize using
pool->lock to check the current work running there, if we match, it
means we extend the wait by another grace period using retry = true,
otherwise either the work already finished running or was never
scheduled, so we can free the bpf_async_cb right away.
Fixes: 1bfbc267ec ("bpf: Enable bpf_timer and bpf_wq in any context")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260205003853.527571-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
All are singletons - please see the changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaYPcbwAKCRDdBJ7gKXxA
jpM9AQDRiBlZRBdYY8/nS2zMc8hE7s5O3koXu/UMf2O01aJjsgD6AssmcJzkbLir
O1mlBSD0wlR3TZLEqSOUYIxgw7evLww=
=ZHeq
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2026-02-04-15-55' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Five hotfixes. Two are cc:stable, two are for MM.
All are singletons - please see the changelogs for details"
* tag 'mm-hotfixes-stable-2026-02-04-15-55' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
Documentation: document liveupdate cmdline parameter
mm, shmem: prevent infinite loop on truncate race
mailmap: update Alexander Mikhalitsyn's emails
liveupdate: luo_file: do not clear serialized_data on unfreeze
x86/kfence: fix booting on 32bit non-PAE systems
- Fix race where sched_class operations (sched_setscheduler() and friends)
could be invoked on dead tasks after sched_ext_dead() already ran, causing
invalid SCX task state transitions and NULL pointer dereferences. This was
a regression from the cgroup exit ordering fix which moved
sched_ext_free() to finish_task_switch().
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaYPIhw4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGS3HAQChF4sOgoD67cul36LJaeiQzjLCh9iTU9vi2lB4
slJb5QD/dJhrC0T2ZVRm5rHVxckIx7KeFwbzhvlrUD7l+zEaAwo=
=ysQE
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.19-rc8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fix from Tejun Heo:
- Fix race where sched_class operations (sched_setscheduler() and
friends) could be invoked on dead tasks after sched_ext_dead()
already ran, causing invalid SCX task state transitions and NULL
pointer dereferences.
This was a regression from the cgroup exit ordering fix which
moved sched_ext_free() to finish_task_switch().
* tag 'sched_ext-for-6.19-rc8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Short-circuit sched_class operations on dead tasks
7900aa699c ("sched_ext: Fix cgroup exit ordering by moving sched_ext_free()
to finish_task_switch()") moved sched_ext_free() to finish_task_switch() and
renamed it to sched_ext_dead() to fix cgroup exit ordering issues. However,
this created a race window where certain sched_class ops may be invoked on
dead tasks leading to failures - e.g. sched_setscheduler() may try to switch a
task which finished sched_ext_dead() back into SCX triggering invalid SCX task
state transitions.
Add task_dead_and_done() which tests whether a task is TASK_DEAD and has
completed its final context switch, and use it to short-circuit sched_class
operations which may be called on dead tasks.
Fixes: 7900aa699c ("sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()")
Reported-by: Andrea Righi <arighi@nvidia.com>
Link: http://lkml.kernel.org/r/20260202151341.796959-1-arighi@nvidia.com
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Previously, the verifier only tracked positive constant deltas between
linked registers using BPF_ADD. This limitation meant patterns like:
r1 = r0;
r1 += -4;
if r1 s>= 0 goto l0_%=; // r1 >= 0 implies r0 >= 4
// verifier couldn't propagate bounds back to r0
if r0 != 0 goto l0_%=;
r0 /= 0; // Verifier thinks this is reachable
l0_%=:
Similar limitation exists for 32-bit registers.
With this change, the verifier can now track negative deltas in reg->off
enabling bound propagation for the above pattern.
For alu32, we make sure the destination register has the upper 32 bits
as 0s before creating the link. BPF_ADD_CONST is split into
BPF_ADD_CONST64 and BPF_ADD_CONST32, the latter is used in case of alu32
and sync_linked_regs uses this to zext the result if known_reg has this
flag.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260204151741.2678118-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch implements bitwise tracking (tnum analysis) for BPF_END
(byte swap) operation.
Currently, the BPF verifier does not track value for BPF_END operation,
treating the result as completely unknown. This limits the verifier's
ability to prove safety of programs that perform endianness conversions,
which are common in networking code.
For example, the following code pattern for port number validation:
int test(struct pt_regs *ctx) {
__u64 x = bpf_get_prandom_u32();
x &= 0x3f00; // Range: [0, 0x3f00], var_off: (0x0; 0x3f00)
x = bswap16(x); // Should swap to range [0, 0x3f], var_off: (0x0; 0x3f)
if (x > 0x3f) goto trap;
return 0;
trap:
return *(u64 *)NULL; // Should be unreachable
}
Currently generates verifier output:
1: (54) w0 &= 16128 ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=16128,var_off=(0x0; 0x3f00))
2: (d7) r0 = bswap16 r0 ; R0=scalar()
3: (25) if r0 > 0x3f goto pc+2 ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=63,var_off=(0x0; 0x3f))
Without this patch, even though the verifier knows `x` has certain bits
set, after bswap16, it loses all tracking information and treats port
as having a completely unknown value [0, 65535].
According to the BPF instruction set[1], there are 3 kinds of BPF_END:
1. `bswap(16|32|64)`: opcode=0xd7 (BPF_END | BPF_ALU64 | BPF_TO_LE)
- do unconditional swap
2. `le(16|32|64)`: opcode=0xd4 (BPF_END | BPF_ALU | BPF_TO_LE)
- on big-endian: do swap
- on little-endian: truncation (16/32-bit) or no-op (64-bit)
3. `be(16|32|64)`: opcode=0xdc (BPF_END | BPF_ALU | BPF_TO_BE)
- on little-endian: do swap
- on big-endian: truncation (16/32-bit) or no-op (64-bit)
Since BPF_END operations are inherently bit-wise permutations, tnum
(bitwise tracking) offers the most efficient and precise mechanism
for value analysis. By implementing `tnum_bswap16`, `tnum_bswap32`,
and `tnum_bswap64`, we can derive exact `var_off` values concisely,
directly reflecting the bit-level changes.
Here is the overview of changes:
1. In `tnum_bswap(16|32|64)` (kernel/bpf/tnum.c):
Call `swab(16|32|64)` function on the value and mask of `var_off`, and
do truncation for 16/32-bit cases.
2. In `adjust_scalar_min_max_vals` (kernel/bpf/verifier.c):
Call helper function `scalar_byte_swap`.
- Only do byte swap when
* alu64 (unconditional swap) OR
* switching between big-endian and little-endian machines.
- If need do byte swap:
* Firstly call `tnum_bswap(16|32|64)` to update `var_off`.
* Then reset the bound since byte swap scrambles the range.
- For 16/32-bit cases, truncate dst register to match the swapped size.
This enables better verification of networking code that frequently uses
byte swaps for protocol processing, reducing false positive rejections.
[1] https://www.kernel.org/doc/Documentation/bpf/standardization/instruction-set.rst
Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Co-developed-by: Yazhou Tang <tangyazhou518@outlook.com>
Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com>
Signed-off-by: Tianci Cao <ziye@zju.edu.cn>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260204111503.77871-2-ziye@zju.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>