Commit Graph

47857 Commits (cff6df108b39b0433aea16c35ee806ffca662bee)

Author SHA1 Message Date
Linus Torvalds 088d13246a Kbuild fixes for v6.15
- Add proper pahole version dependency to CONFIG_GENDWARFKSYMS to avoid
    module loading errors
 
  - Fix UAPI header tests for the OpenRISC architecture
 
  - Add dependency on the libdw package in Debian and RPM packages
 
  - Disable -Wdefault-const-init-unsafe warnings on Clang
 
  - Make "make clean ARCH=um" also clean the arch/x86/ directory
 
  - Revert the use of -fmacro-prefix-map=, which causes issues with
    debugger usability
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmgh4XoVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGj34P/14KpYOyUZxQvz3uvGCwYUsFpeYT
 CKa3s9TxAO9Dxz3dWTGsKNQXM24DXoL94bPvkVhQ1pUP2kugi0KEAmQ21k83hfMe
 m/P0BPaSImTn6Cv+N7GyyuX0q+KO31UBhXkf14MCpyq0NQQXJ+7T2OgOWhZenZ6m
 PzuSZO0/rNhKQNykl2xPcD3TLBP7BEWRbPxADWgQ/353dpNbxXCYC4lWKaWsUpit
 FvLTiUEYRBiP68oZYCCT/26K6+FZMRBicvjowbDMDAXIi3sNeBJo6hWX6GtfHacW
 q+f95edikvu0NcJxyNwkjsf7d7a5yuurQsVW0JT8dG1FZlrfuphBTEjomsWRhKtO
 +AGTMAG3VbkQ+Z/WUN9FItS/+lGfKpMToZbiKsETJHni0sOJSRB1+MLQ6U0NCSAs
 PpjSxm6hHZruZN7nAdhZfB3aWA2EaQYuq4SX+rYDXuSZqvYLuzBC2ZH1P2AmDbp5
 wmj76QJ9JEgXsjg9ewr0/aYrx26we+P3q1pUuxpURvg7M9vHUnVR+QstUS+MhCnR
 pWUjkNHMuWWz48UmqVx7YxIyURlqkqqG0avZJHLGEwRiTWkG8d+j4R5VwB84+73P
 XBAYpfeaiemWdViOq/bYxJHXrKUZX1rhzRDBNq4806JBR26ZWk7GgKW8xE5Rx3qH
 iqkH4uo1EHT8ULHW
 =Wn5A
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-fixes-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - Add proper pahole version dependency to CONFIG_GENDWARFKSYMS to avoid
   module loading errors

 - Fix UAPI header tests for the OpenRISC architecture

 - Add dependency on the libdw package in Debian and RPM packages

 - Disable -Wdefault-const-init-unsafe warnings on Clang

 - Make "make clean ARCH=um" also clean the arch/x86/ directory

 - Revert the use of -fmacro-prefix-map=, which causes issues with
   debugger usability

* tag 'kbuild-fixes-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kbuild: fix typos "module.builtin" to "modules.builtin"
  Revert "kbuild, rust: use -fremap-path-prefix to make paths relative"
  Revert "kbuild: make all file references relative to source root"
  kbuild: fix dependency on sorttable
  init: remove unused CONFIG_CC_CAN_LINK_STATIC
  um: let 'make clean' properly clean underlying SUBARCH as well
  kbuild: Disable -Wdefault-const-init-unsafe
  kbuild: rpm-pkg: Add (elfutils-devel or libdw-devel) to BuildRequires
  kbuild: deb-pkg: Add libdw-dev:native to Build-Depends-Arch
  usr/include: openrisc: don't HDRTEST bpf_perf_event.h
  kbuild: Require pahole <v1.28 or >v1.29 with GENDWARFKSYMS on X86
2025-05-14 22:24:17 -07:00
Linus Torvalds c94d59a126 tracing fixes for 6.15:
- Fix sample code that uses trace_array_printk()
 
   The sample code for in kernel use of trace_array (that creates an instance
   for use within the kernel) and shows how to use trace_array_printk() that
   writes into the created instance, used trace_printk_init_buffers(). But
   that function is used to initialize normal trace_printk() and produces the
   NOTICE banner which is not needed for use of trace_array_printk(). The
   function to initialize that is trace_array_init_printk() that takes the
   created trace array instance as a parameter.
 
   Update the sample code to reflect the proper usage.
 
 - Fix preemption count output for stacktrace event
 
   The tracing buffer shows the preempt count level when an event executes.
   Because writing the event itself disables preemption, this needs to be
   accounted for when recording. The stacktrace event did not account for
   this so the output of the stacktrace event showed preemption was disabled
   while the event that triggered the stacktrace shows preemption is enabled
   and this leads to confusion. Account for preemption being disabled for the
   stacktrace event.
 
   The same happened for stack traces triggered by function tracer.
 
 - Fix persistent ring buffer when trace_pipe is used
 
   The ring buffer swaps the reader page with the next page to read from the
   write buffer when trace_pipe is used. If there's only a page of data in
   the ring buffer, this swap will cause the "commit" pointer (last data
   written) to be on the reader page. If more data is written to the buffer,
   it is added to the reader page until it falls off back into the write
   buffer.
 
   If the system reboots and the commit pointer is still on the reader page,
   even if new data was written, the persistent buffer validator will miss
   finding the commit pointer because it only checks the write buffer and
   does not check the reader page. This causes the validator to fail the
   validation and clear the buffer, where the new data is lost.
 
   There was a check for this, but it checked the "head pointer", which was
   incorrect, because the "head pointer" always stays on the write buffer and
   is the next page to swap out for the reader page. Fix the logic to catch
   this case and allow the user to still read the data after reboot.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaCTZHBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qu04AQDjOS46Y8d58MuwjLrQAotOUnANZADz
 7d+5snlcMjhqkAEAo+zc2z9LgqBAnv1VG3GEPgac0JmyPeOnqSJRWRpRXAM=
 =UveQ
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix sample code that uses trace_array_printk()

   The sample code for in kernel use of trace_array (that creates an
   instance for use within the kernel) and shows how to use
   trace_array_printk() that writes into the created instance, used
   trace_printk_init_buffers(). But that function is used to initialize
   normal trace_printk() and produces the NOTICE banner which is not
   needed for use of trace_array_printk(). The function to initialize
   that is trace_array_init_printk() that takes the created trace array
   instance as a parameter.

   Update the sample code to reflect the proper usage.

 - Fix preemption count output for stacktrace event

   The tracing buffer shows the preempt count level when an event
   executes. Because writing the event itself disables preemption, this
   needs to be accounted for when recording. The stacktrace event did
   not account for this so the output of the stacktrace event showed
   preemption was disabled while the event that triggered the stacktrace
   shows preemption is enabled and this leads to confusion. Account for
   preemption being disabled for the stacktrace event.

   The same happened for stack traces triggered by function tracer.

 - Fix persistent ring buffer when trace_pipe is used

   The ring buffer swaps the reader page with the next page to read from
   the write buffer when trace_pipe is used. If there's only a page of
   data in the ring buffer, this swap will cause the "commit" pointer
   (last data written) to be on the reader page. If more data is written
   to the buffer, it is added to the reader page until it falls off back
   into the write buffer.

   If the system reboots and the commit pointer is still on the reader
   page, even if new data was written, the persistent buffer validator
   will miss finding the commit pointer because it only checks the write
   buffer and does not check the reader page. This causes the validator
   to fail the validation and clear the buffer, where the new data is
   lost.

   There was a check for this, but it checked the "head pointer", which
   was incorrect, because the "head pointer" always stays on the write
   buffer and is the next page to swap out for the reader page. Fix the
   logic to catch this case and allow the user to still read the data
   after reboot.

* tag 'trace-v6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Fix persistent buffer when commit page is the reader page
  ftrace: Fix preemption accounting for stacktrace filter command
  ftrace: Fix preemption accounting for stacktrace trigger command
  tracing: samples: Initialize trace_array_printk() with the correct function
2025-05-14 11:24:19 -07:00
Steven Rostedt 1d6c39c89f ring-buffer: Fix persistent buffer when commit page is the reader page
The ring buffer is made up of sub buffers (sometimes called pages as they
are by default PAGE_SIZE). It has the following "pages":

  "tail page" - this is the page that the next write will write to
  "head page" - this is the page that the reader will swap the reader page with.
  "reader page" - This belongs to the reader, where it will swap the head
                  page from the ring buffer so that the reader does not
                  race with the writer.

The writer may end up on the "reader page" if the ring buffer hasn't
written more than one page, where the "tail page" and the "head page" are
the same.

The persistent ring buffer has meta data that points to where these pages
exist so on reboot it can re-create the pointers to the cpu_buffer
descriptor. But when the commit page is on the reader page, the logic is
incorrect.

The check to see if the commit page is on the reader page checked if the
head page was the reader page, which would never happen, as the head page
is always in the ring buffer. The correct check would be to test if the
commit page is on the reader page. If that's the case, then it can exit
out early as the commit page is only on the reader page when there's only
one page of data in the buffer. There's no reason to iterate the ring
buffer pages to find the "commit page" as it is already found.

To trigger this bug:

  # echo 1 > /sys/kernel/tracing/instances/boot_mapped/events/syscalls/sys_enter_fchownat/enable
  # touch /tmp/x
  # chown sshd /tmp/x
  # reboot

On boot up, the dmesg will have:
 Ring buffer meta [0] is from previous boot!
 Ring buffer meta [1] is from previous boot!
 Ring buffer meta [2] is from previous boot!
 Ring buffer meta [3] is from previous boot!
 Ring buffer meta [4] commit page not found
 Ring buffer meta [5] is from previous boot!
 Ring buffer meta [6] is from previous boot!
 Ring buffer meta [7] is from previous boot!

Where the buffer on CPU 4 had a "commit page not found" error and that
buffer is cleared and reset causing the output to be empty and the data lost.

When it works correctly, it has:

  # cat /sys/kernel/tracing/instances/boot_mapped/trace_pipe
        <...>-1137    [004] .....   998.205323: sys_enter_fchownat: __syscall_nr=0x104 (260) dfd=0xffffff9c (4294967196) filename=(0xffffc90000a0002c) user=0x3e8 (1000) group=0xffffffff (4294967295) flag=0x0 (0

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250513115032.3e0b97f7@gandalf.local.home
Fixes: 5f3b6e839f ("ring-buffer: Validate boot range memory events")
Reported-by: Tasos Sahanidis <tasos@tasossah.com>
Tested-by: Tasos Sahanidis <tasos@tasossah.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-14 13:53:23 -04:00
pengdonglin 11aff32439 ftrace: Fix preemption accounting for stacktrace filter command
The preemption count of the stacktrace filter command to trace ksys_read
is consistently incorrect:

$ echo ksys_read:stacktrace > set_ftrace_filter

   <...>-453     [004] ...1.    38.308956: <stack trace>
=> ksys_read
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe

The root cause is that the trace framework disables preemption when
invoking the filter command callback in function_trace_probe_call:

   preempt_disable_notrace();
   probe_ops->func(ip, parent_ip, probe_opsbe->tr, probe_ops, probe->data);
   preempt_enable_notrace();

Use tracing_gen_ctx_dec() to account for the preempt_disable_notrace(),
which will output the correct preemption count:

$ echo ksys_read:stacktrace > set_ftrace_filter

   <...>-410     [006] .....    31.420396: <stack trace>
=> ksys_read
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe

Cc: stable@vger.kernel.org
Fixes: 36590c50b2 ("tracing: Merge irqflags + preempt counter.")
Link: https://lore.kernel.org/20250512094246.1167956-2-dolinux.peng@gmail.com
Signed-off-by: pengdonglin <dolinux.peng@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-14 13:53:23 -04:00
pengdonglin e333332657 ftrace: Fix preemption accounting for stacktrace trigger command
When using the stacktrace trigger command to trace syscalls, the
preemption count was consistently reported as 1 when the system call
event itself had 0 (".").

For example:

root@ubuntu22-vm:/sys/kernel/tracing/events/syscalls/sys_enter_read
$ echo stacktrace > trigger
$ echo 1 > enable

    sshd-416     [002] .....   232.864910: sys_read(fd: a, buf: 556b1f3221d0, count: 8000)
    sshd-416     [002] ...1.   232.864913: <stack trace>
 => ftrace_syscall_enter
 => syscall_trace_enter
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

The root cause is that the trace framework disables preemption in __DO_TRACE before
invoking the trigger callback.

Use the tracing_gen_ctx_dec() that will accommodate for the increase of
the preemption count in __DO_TRACE when calling the callback. The result
is the accurate reporting of:

    sshd-410     [004] .....   210.117660: sys_read(fd: 4, buf: 559b725ba130, count: 40000)
    sshd-410     [004] .....   210.117662: <stack trace>
 => ftrace_syscall_enter
 => syscall_trace_enter
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

Cc: stable@vger.kernel.org
Fixes: ce33c845b0 ("tracing: Dump stacktrace trigger to the corresponding instance")
Link: https://lore.kernel.org/20250512094246.1167956-1-dolinux.peng@gmail.com
Signed-off-by: pengdonglin <dolinux.peng@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-14 13:53:09 -04:00
Linus Torvalds 405e6c37c8 Probes fixes for v6.15-rc6:
- fprobe: Fix RCU warning message in list traversal
   fprobe_module_callback() using hlist_for_each_entry_rcu() traverse
   the fprobe list but it locks fprobe_mutex() instead of rcu lock
   because it is enough. So add lockdep_is_held() to avoid warning.
 
 - tracing: eprobe: Add missing trace_probe_log_clear for eprobe
   __trace_eprobe_create() uses trace_probe_log but forgot to clear
   it at exit. Add trace_probe_log_clear() calls.
 
 - tracing: probes: Fix possible race in trace_probe_log APIs
   trace_probe_log APIs are used in probe event (dynamic_events,
   kprobe_events and uprobe_events) creation. Only dynamic_events uses
   the dyn_event_ops_mutex mutex to serialize it. This makes kprobe and
   uprobe events to lock the same mutex to serialize its creation to
   avoid race in trace_probe_log APIs.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmgjXhcbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b5FgIAJ6yEMCKDnao6CGVto9E
 lRmgTlgJHk/PYoxMt929C+fJyUbAgafQDZIEGwKDmenaEemEJChvYOTXLmAf9qTS
 2YEOUEVZ1p4OSWDriJ59eS5RR4UvMGNNIOLhZH+NV43SzyrVpkBTm+XOWCrulfrB
 UaLy6WAVkGZhEabnXUQIo+lwbvJTNg4/sH0uiUyvkKFl9C6s/BWFIFaa7eE2ibfD
 RIJummZ9xL0EECe9aQcmbVEZTF5y141yTDth+hEjeW3tIlkQrw4GVOufDScZIGhg
 y92OoROiZqOllr0vS5+pSB4Pa+hgtBWFOcrupYb9SeG/vf30a3P5bufI3MCvzhiL
 OkI=
 =MhYZ
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes fixes from Masami Hiramatsu:

 - fprobe: Fix RCU warning message in list traversal

   fprobe_module_callback() using hlist_for_each_entry_rcu() traverse
   the fprobe list but it locks fprobe_mutex() instead of rcu lock
   because it is enough. So add lockdep_is_held() to avoid warning.

 - tracing: eprobe: Add missing trace_probe_log_clear for eprobe

   __trace_eprobe_create() uses trace_probe_log but forgot to clear it
   at exit. Add trace_probe_log_clear() calls.

 - tracing: probes: Fix possible race in trace_probe_log APIs

   trace_probe_log APIs are used in probe event (dynamic_events,
   kprobe_events and uprobe_events) creation. Only dynamic_events uses
   the dyn_event_ops_mutex mutex to serialize it. This makes kprobe and
   uprobe events to lock the same mutex to serialize its creation to
   avoid race in trace_probe_log APIs.

* tag 'probes-fixes-v6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: probes: Fix a possible race in trace_probe_log APIs
  tracing: add missing trace_probe_log_clear for eprobes
  tracing: fprobe: Fix RCU warning message in list traversal
2025-05-13 12:20:07 -07:00
Masami Hiramatsu (Google) fd837de3c9 tracing: probes: Fix a possible race in trace_probe_log APIs
Since the shared trace_probe_log variable can be accessed and
modified via probe event create operation of kprobe_events,
uprobe_events, and dynamic_events, it should be protected.
In the dynamic_events, all operations are serialized by
`dyn_event_ops_mutex`. But kprobe_events and uprobe_events
interfaces are not serialized.

To solve this issue, introduces dyn_event_create(), which runs
create() operation under the mutex, for kprobe_events and
uprobe_events. This also uses lockdep to check the mutex is
held when using trace_probe_log* APIs.

Link: https://lore.kernel.org/all/174684868120.551552.3068655787654268804.stgit@devnote2/

Reported-by: Paul Cacheux <paulcacheux@gmail.com>
Closes: https://lore.kernel.org/all/20250510074456.805a16872b591e2971a4d221@kernel.org/
Fixes: ab105a4fb8 ("tracing: Use tracing error_log with probe events")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-05-13 22:23:34 +09:00
Linus Torvalds e9565e23cd sched_ext: Fixes for v6.15-rc6
A little bit invasive for rc6 but they're important fixes, pass tests fine
 and won't break anything outside sched_ext.
 
 - scx_bpf_cpuperf_set() calls internal functions that require the rq to be
   locked. It assumed that the BPF caller has rq locked but that's not always
   true. Fix it by tracking whether rq is currently held by the CPU and
   grabbing it if necessary.
 
 - bpf_iter_scx_dsq_new() was leaving the DSQ iterator in an uninitialized
   state after an error. However, next() and destroy() can be called on an
   iterator which failed initialization and thus they always need to be
   initialized even after an init error. Fix by always initializing the
   iterator.
 
 - Remove duplicate BTF_ID_FLAGS() entries.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaCKYDw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGV0nAP9UX1sGJV7a8by8L0hq3luZRSQo7Kjw0pTaMeD+
 /oIlOgD/VX3epHmSKvwIOO2W4bv5oSR8B+Yx+4uAhhLfTgdlxgk=
 =clg7
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.15-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:
 "A little bit invasive for rc6 but they're important fixes, pass tests
  fine and won't break anything outside sched_ext:

   - scx_bpf_cpuperf_set() calls internal functions that require the rq
     to be locked. It assumed that the BPF caller has rq locked but
     that's not always true. Fix it by tracking whether rq is currently
     held by the CPU and grabbing it if necessary

   - bpf_iter_scx_dsq_new() was leaving the DSQ iterator in an
     uninitialized state after an error. However, next() and destroy()
     can be called on an iterator which failed initialization and thus
     they always need to be initialized even after an init error. Fix by
     always initializing the iterator

   - Remove duplicate BTF_ID_FLAGS() entries"

* tag 'sched_ext-for-6.15-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: bpf_iter_scx_dsq_new() should always initialize iterator
  sched_ext: Fix rq lock state in hotplug ops
  sched_ext: Remove duplicate BTF_ID_FLAGS definitions
  sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()
  sched_ext: Track currently locked rq
2025-05-12 18:02:05 -07:00
Linus Torvalds d471045e75 cgroup: A fix for v6.15-rc6
One low-risk patch to fix a cpuset bug where it over-eagerly tries to modify
 CPU affinity of kernel threads.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaCKVJA4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGZJ7AQDJIOHkmRvDnSnwdsaQ7hvsU1afNEWjGsvKcLtp
 VXQUFwD/UIgSc5miCkgi5ucphlr6Vxxnq0PW7hf7KRhdzhqwagg=
 =j/Y4
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.15-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fix from Tejun Heo:
 "One low-risk patch to fix a cpuset bug where it over-eagerly tries to
  modify CPU affinity of kernel threads"

* tag 'cgroup-for-6.15-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/cpuset: Extend kthread_is_per_cpu() check to all PF_NO_SETAFFINITY tasks
2025-05-12 17:58:57 -07:00
Sami Tolvanen 9520a2b3f0 kbuild: Require pahole <v1.28 or >v1.29 with GENDWARFKSYMS on X86
With CONFIG_GENDWARFKSYMS, __gendwarfksyms_ptr variables are
added to the kernel in EXPORT_SYMBOL() to ensure DWARF type
information is available for exported symbols in the TUs where
they're actually exported. These symbols are dropped when linking
vmlinux, but dangling references to them remain in DWARF.

With CONFIG_DEBUG_INFO_BTF enabled on X86, pahole versions after
commit 47dcb534e253 ("btf_encoder: Stop indexing symbols for
VARs") and before commit 9810758003ce ("btf_encoder: Verify 0
address DWARF variables are in ELF section") place these symbols
in the .data..percpu section, which results in an "Invalid
offset" error in btf_datasec_check_meta() during boot, as all
the variables are at zero offset and have non-zero size. If
CONFIG_DEBUG_INFO_BTF_MODULES is enabled, this also results in a
failure to load modules with:

  failed to validate module [$module] BTF: -22

As the issue occurs in pahole v1.28 and the fix was merged
after v1.29 was released, require pahole <v1.28 or >v1.29 when
GENDWARFKSYMS is enabled with DEBUG_INFO_BTF on X86.

Reported-by: Paolo Pisati <paolo.pisati@canonical.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2025-05-12 15:03:16 +09:00
Linus Torvalds ac814cbbab Misc timers fixes:
- Fix time keeping bugs in CLOCK_MONOTONIC_COARSE clocks
 
  - Work around absolute relocations into vDSO code that
    GCC erroneously emits in certain arm64 build environments
 
  - Fix a false positive lockdep warning in the i8253 clocksource
    driver
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmggVjYRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gcRg//Wiyl1Sth9E3M4++TCTAoZwgty/lEBo/+
 u2T3BTI3cu3q+KLr8NEV+l+EOCcycv7AR7dVBKab/LEaliHwvQ0gGxu/Tc2FsQX+
 jbk/1COYkwafr3XIRR0QZxU7BppSTzXNfoNqn/MjM+rgpG8CdgtLPuHgDCpeuNBn
 NHHL46L4a3L7INh03WlVVh34cHnLC+Hq5DNf++Mr8VvlJG8Q5WaPgrnIMfSL5STJ
 Z6A5l3w4TX1E5C5d/eEJjwUUjbGQDbvWGQRxqLYhXXyS3h379K6BN/5t5pUQgGIU
 ZOV2MYS8DxGZpS3CXtKLTJxyC2VUzP9VeJTyFjlj7IZQ8UBL/JkQ+wZhuXhWsMVd
 puje6gmgSP6CQ14/s5WeAU/BFfj5kakYJuAFSa8u+ucHsAKqEEdvk600WC1cXFfn
 AyKuXq6xZ8M27EoqfemD447b66kh8VDbNmcp9AwiNKBzq5pVVGIfVrXRJakBKKR0
 yV7fJUbgogYol6ra9Yx7FjtscKezan52C+ja9UqgMDb262Ez+zo7mMVINXvS6F+e
 8byUdLqVlnkoj+xgRylVsbsfY8g145pkc03y0rxBS3EgiGtR+PUQq8Fwduhb9Bkr
 obIwiqMvo5bW+ULZcGm0Fu1pInXfumofCLAiXq7Vxz7RdpKO3SfsDdIkc9O8aFMJ
 ubB7DMaGcyg=
 =Ndox
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2025-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc timers fixes from Ingo Molnar:

 - Fix time keeping bugs in CLOCK_MONOTONIC_COARSE clocks

 - Work around absolute relocations into vDSO code that GCC erroneously
   emits in certain arm64 build environments

 - Fix a false positive lockdep warning in the i8253 clocksource driver

* tag 'timers-urgent-2025-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource/i8253: Use raw_spinlock_irqsave() in clockevent_i8253_disable()
  arm64: vdso: Work around invalid absolute relocations from GCC
  timekeeping: Prevent coarse clocks going backwards
2025-05-11 10:33:25 -07:00
Paul Cacheux e41b5af451 tracing: add missing trace_probe_log_clear for eprobes
Make sure trace_probe_log_clear is called in the tracing
eprobe code path, matching the trace_probe_log_init call.

Link: https://lore.kernel.org/all/20250504-fix-trace-probe-log-race-v3-1-9e99fec7eddc@gmail.com/

Signed-off-by: Paul Cacheux <paulcacheux@gmail.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-05-10 08:44:50 +09:00
Breno Leitao 9dda18a32b tracing: fprobe: Fix RCU warning message in list traversal
When CONFIG_PROVE_RCU_LIST is enabled, fprobe triggers the following
warning:

    WARNING: suspicious RCU usage
    kernel/trace/fprobe.c:457 RCU-list traversed in non-reader section!!

    other info that might help us debug this:
	#1: ffffffff863c4e08 (fprobe_mutex){+.+.}-{4:4}, at: fprobe_module_callback+0x7b/0x8c0

    Call Trace:
	fprobe_module_callback
	notifier_call_chain
	blocking_notifier_call_chain

This warning occurs because fprobe_remove_node_in_module() traverses an
RCU list using RCU primitives without holding an RCU read lock. However,
the function is only called from fprobe_module_callback(), which holds
the fprobe_mutex lock that provides sufficient protection for safely
traversing the list.

Fix the warning by specifying the locking design to the
CONFIG_PROVE_RCU_LIST mechanism. Add the lockdep_is_held() argument to
hlist_for_each_entry_rcu() to inform the RCU checker that fprobe_mutex
provides the required protection.

Link: https://lore.kernel.org/all/20250410-fprobe-v1-1-068ef5f41436@debian.org/

Fixes: a3dc2983ca ("tracing: fprobe: Cleanup fprobe hash when module unloading")
Signed-off-by: Breno Leitao <leitao@debian.org>
Tested-by: Antonio Quartulli <antonio@mandelbit.com>
Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-05-10 08:28:02 +09:00
Waiman Long 39b5ef791d cgroup/cpuset: Extend kthread_is_per_cpu() check to all PF_NO_SETAFFINITY tasks
Commit ec5fbdfb99 ("cgroup/cpuset: Enable update_tasks_cpumask()
on top_cpuset") enabled us to pull CPUs dedicated to child partitions
from tasks in top_cpuset by ignoring per cpu kthreads. However, there
can be other kthreads that are not per cpu but have PF_NO_SETAFFINITY
flag set to indicate that we shouldn't mess with their CPU affinity.
For other kthreads, their affinity will be changed to skip CPUs dedicated
to child partitions whether it is an isolating or a scheduling one.

As all the per cpu kthreads have PF_NO_SETAFFINITY set, the
PF_NO_SETAFFINITY tasks are essentially a superset of per cpu kthreads.
Fix this issue by dropping the kthread_is_per_cpu() check and checking
the PF_NO_SETAFFINITY flag instead.

Fixes: ec5fbdfb99 ("cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset")
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-09 07:35:14 -10:00
Dmitry Antipov a6aeb73997 module: ensure that kobject_put() is safe for module type kobjects
In 'lookup_or_create_module_kobject()', an internal kobject is created
using 'module_ktype'. So call to 'kobject_put()' on error handling
path causes an attempt to use an uninitialized completion pointer in
'module_kobject_release()'. In this scenario, we just want to release
kobject without an extra synchronization required for a regular module
unloading process, so adding an extra check whether 'complete()' is
actually required makes 'kobject_put()' safe.

Reported-by: syzbot+7fb8a372e1f6add936dd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7fb8a372e1f6add936dd
Fixes: 942e443127 ("module: Fix mod->mkobj.kobj potentially freed too early")
Cc: stable@vger.kernel.org
Suggested-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Link: https://lore.kernel.org/r/20250507065044.86529-1-dmantipov@yandex.ru
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-05-07 20:24:59 +02:00
Tejun Heo 428dc9fc08 sched_ext: bpf_iter_scx_dsq_new() should always initialize iterator
BPF programs may call next() and destroy() on BPF iterators even after new()
returns an error value (e.g. bpf_for_each() macro ignores error returns from
new()). bpf_iter_scx_dsq_new() could leave the iterator in an uninitialized
state after an error return causing bpf_iter_scx_dsq_next() to dereference
garbage data. Make bpf_iter_scx_dsq_new() always clear $kit->dsq so that
next() and destroy() become noops.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 650ba21b13 ("sched_ext: Implement DSQ iterator")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-05-07 06:24:07 -10:00
Linus Torvalds 59c9ab3e8c tracing updates for v6.15
- Fix read out of bounds bug in tracing_splice_read_pipe()
 
   The size of the sub page being read can now be greater than a page. But
   the buffer used in tracing_splice_read_pipe() only allocates a page size.
   The data copied to the buffer is the amount in sub buffer which can
   overflow the buffer. Use min((size_t)trace_seq_used(&iter->seq), PAGE_SIZE)
   to limit the amount copied to the buffer to a max of PAGE_SIZE.
 
 - Fix the test for NULL from "!filter_hash" to "!*filter_hash"
 
   The add_next_hash() function checked for NULL at the wrong pointer level.
 
 - Do not use the array in trace_adjust_address() if there are no elements
 
   The trace_adjust_address() finds the offset of a module that was stored in
   the persistent buffer when reading the previous boot buffer to see if the
   address belongs to a module that was loaded in the previous boot. An array
   is created that matches currently loaded modules with previously loaded
   modules. The trace_adjust_address() uses that array to find the new offset
   of the address that's in the previous buffer.  But if no module was
   loaded, it ends up reading the last element in an array that was never
   allocated. Check if nr_entries is zero and exit out early if it is.
 
 - Remove nested lock of trace_event_sem in print_event_fields()
 
   The print_event_fields() function iterates over the ftrace_events list and
   requires the trace_event_sem semaphore held for read. But this function is
   always called with that semaphore held for read. Remove the taking of the
   semaphore and replace it with lockdep_assert_held_read(&trace_event_sem);
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaBeXEBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qvXFAP9JNgi0+ainOppsEP6u9KH+sttxKl76
 14EslzuPqbzgOwD/Sm00a8n7m858iv6UN3AAW9AsX2QK5yG0Wbvterm8FgI=
 =s9qk
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix read out of bounds bug in tracing_splice_read_pipe()

   The size of the sub page being read can now be greater than a page.
   But the buffer used in tracing_splice_read_pipe() only allocates a
   page size. The data copied to the buffer is the amount in sub buffer
   which can overflow the buffer.

   Use min((size_t)trace_seq_used(&iter->seq), PAGE_SIZE) to limit the
   amount copied to the buffer to a max of PAGE_SIZE.

 - Fix the test for NULL from "!filter_hash" to "!*filter_hash"

   The add_next_hash() function checked for NULL at the wrong pointer
   level.

 - Do not use the array in trace_adjust_address() if there are no
   elements

   The trace_adjust_address() finds the offset of a module that was
   stored in the persistent buffer when reading the previous boot buffer
   to see if the address belongs to a module that was loaded in the
   previous boot. An array is created that matches currently loaded
   modules with previously loaded modules. The trace_adjust_address()
   uses that array to find the new offset of the address that's in the
   previous buffer. But if no module was loaded, it ends up reading the
   last element in an array that was never allocated.

   Check if nr_entries is zero and exit out early if it is.

 - Remove nested lock of trace_event_sem in print_event_fields()

   The print_event_fields() function iterates over the ftrace_events
   list and requires the trace_event_sem semaphore held for read. But
   this function is always called with that semaphore held for read.

   Remove the taking of the semaphore and replace it with
   lockdep_assert_held_read(&trace_event_sem)

* tag 'trace-v6.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Do not take trace_event_sem in print_event_fields()
  tracing: Fix trace_adjust_address() when there is no modules in scratch area
  ftrace: Fix NULL memory allocation check
  tracing: Fix oob write in trace_seq_to_buffer()
2025-05-04 10:15:42 -07:00
Linus Torvalds 5aac99c6b5 Two fixes:
- Prevent NULL pointer dereference in msi_domain_debug_show()
 
  - Fix crash in the qcom-mpm irqchip driver when configuring
    interrupts for non-wake GPIOs
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgXEhsRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1g5kg//evpNs+ulOhUzr3bT+XbOO+hgYS+N/ySq
 JWG+ChJYZItYczPpnYaH/s++fq72ssIhbfgZNcdZiYc2LnTQ0ZXFN450UuB4jeKk
 2Gk7PoZJ3wc/glYG3EXmh7ciGSr1nrbYndiLCfV2x+tD33VsZV21WbPcVIFIzKsD
 hfRzMEmPE0cSQnFOciFm4aj+ZA/wMkkvdFU5FCRMfq2qdfkwQEamXLGSX6pZYiDT
 cAtr5ZiD3AXgY61tNXE0K6wZs3qYuNPJT/9OtdlDJn1VsK9iumczwzp8gAKX4dbH
 eogiaFqymnBXlD4Ah22TMml7dzywY6yjv8Mk/v8UgSbxTmPC2mYMoBGXlXwGHTxY
 ay2VEIZd0SN6+15p4YUgc2KQhsNB573KT6oR6lEgEB9LfIvlEpxAgMxkp5PDRNTJ
 JqkBMJFpdk0JcvUuT19XmRQjZ7cwrghW2wS8C5S7u+1tK+9r4EXy/mg7fGlBcmd2
 3yOaGAyBNdItcg1H1/WaJkScPHmoI68V9GW3wuIpjW4Bo8YhQCxStLMrUGEKxY1M
 INAkXvkiRcGQ9264aZJrDLIPdD60y9uyA+E+EZ3uP7hZn0YCpHH0CMugY9XtUpy9
 Wvs9i7Ztn11BxDcyCzk/yMdVwGGImeM7AUqyuyKgfHQJxei01o5/op2DvkrjWlZT
 V6URyY5pkwc=
 =Kq8w
 -----END PGP SIGNATURE-----

Merge tag 'irq-urgent-2025-05-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Ingo Molnar:

 - Prevent NULL pointer dereference in msi_domain_debug_show()

 - Fix crash in the qcom-mpm irqchip driver when configuring
   interrupts for non-wake GPIOs

* tag 'irq-urgent-2025-05-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/qcom-mpm: Prevent crash when trying to handle non-wake GPIOs
  genirq/msi: Prevent NULL pointer dereference in msi_domain_debug_show()
2025-05-04 07:58:53 -07:00
Steven Rostedt 0a8f11f856 tracing: Do not take trace_event_sem in print_event_fields()
On some paths in print_event_fields() it takes the trace_event_sem for
read, even though it should always be held when the function is called.

Remove the taking of that mutex and add a lockdep_assert_held_read() to
make sure the trace_event_sem is held when print_event_fields() is called.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250501224128.0b1f0571@batman.local.home
Fixes: 80a76994b2 ("tracing: Add "fields" option to show raw trace event fields")
Reported-by: syzbot+441582c1592938fccf09@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6813ff5e.050a0220.14dd7d.001b.GAE@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-01 22:44:52 -04:00
Steven Rostedt 1be8e54a1e tracing: Fix trace_adjust_address() when there is no modules in scratch area
The function trace_adjust_address() is used to map addresses of modules
stored in the persistent memory and are also loaded in the current boot to
return the current address for the module.

If there's only one module entry, it will simply use that, otherwise it
performs a bsearch of the entry array to find the modules to offset with.

The issue is if there are no modules in the array. The code does not
account for that and ends up referencing the first element in the array
which does not exist and causes a crash.

If nr_entries is zero, exit out early as if this was a core kernel
address.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250501151909.65910359@gandalf.local.home
Fixes: 35a380ddbc ("tracing: Show last module text symbols in the stacktrace")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-01 16:06:55 -04:00
Colin Ian King 3c1d9cfa84 ftrace: Fix NULL memory allocation check
The check for a failed memory location is incorrectly checking
the wrong level of pointer indirection by checking !filter_hash
rather than !*filter_hash.  Fix this.

Cc: asami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250422221335.89896-1-colin.i.king@gmail.com
Fixes: 0ae6b8ce20 ("ftrace: Fix accounting of subop hashes")
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-01 15:46:19 -04:00
Jeongjun Park f5178c41bb tracing: Fix oob write in trace_seq_to_buffer()
syzbot reported this bug:
==================================================================
BUG: KASAN: slab-out-of-bounds in trace_seq_to_buffer kernel/trace/trace.c:1830 [inline]
BUG: KASAN: slab-out-of-bounds in tracing_splice_read_pipe+0x6be/0xdd0 kernel/trace/trace.c:6822
Write of size 4507 at addr ffff888032b6b000 by task syz.2.320/7260

CPU: 1 UID: 0 PID: 7260 Comm: syz.2.320 Not tainted 6.15.0-rc1-syzkaller-00301-g3bde70a2c827 #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:408 [inline]
 print_report+0xc3/0x670 mm/kasan/report.c:521
 kasan_report+0xe0/0x110 mm/kasan/report.c:634
 check_region_inline mm/kasan/generic.c:183 [inline]
 kasan_check_range+0xef/0x1a0 mm/kasan/generic.c:189
 __asan_memcpy+0x3c/0x60 mm/kasan/shadow.c:106
 trace_seq_to_buffer kernel/trace/trace.c:1830 [inline]
 tracing_splice_read_pipe+0x6be/0xdd0 kernel/trace/trace.c:6822
 ....
==================================================================

It has been reported that trace_seq_to_buffer() tries to copy more data
than PAGE_SIZE to buf. Therefore, to prevent this, we should use the
smaller of trace_seq_used(&iter->seq) and PAGE_SIZE as an argument.

Link: https://lore.kernel.org/20250422113026.13308-1-aha310510@gmail.com
Reported-by: syzbot+c8cd2d2c412b868263fb@syzkaller.appspotmail.com
Fixes: 3c56819b14 ("tracing: splice support for tracing_pipe")
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-01 15:24:15 -04:00
Andrew Jones e6a3fc4f10 genirq/msi: Prevent NULL pointer dereference in msi_domain_debug_show()
irq_domain_debug_show_one() calls msi_domain_debug_show() with a non-NULL
domain pointer and a NULL irq_data pointer. irq_debug_show_data() calls it
with a NULL domain pointer.

The domain pointer is not used, but the irq_data pointer is required to be
non-NULL and lacks a NULL pointer check.

Add the missing NULL pointer check to ensure there is a non-NULL irq_data
pointer in msi_domain_debug_show() before dereferencing it.

[ tglx: Massaged change log ]

Fixes: 01499ae673 ("genirq/msi: Expose MSI message data in debugfs")
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250430124836.49964-2-ajones@ventanamicro.com
2025-04-30 23:25:10 +02:00
Linus Torvalds 3929527918 Modules fixes for v6.15-rc5
A single series is present to properly handle the module_kobject creation.
 It fixes a problem with missing /sys/module/<module>/drivers for built-in
 modules.
 
 The fix has been on linux-next for two weeks with no reported issues.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEEIduBR9MnFA82q/jtumpXJwqY6poFAmgSH2YUHHBldHIucGF2
 bHVAc3VzZS5jb20ACgkQumpXJwqY6poL/wf/TZEux9aieu8VOhPbV1Mo1npVAeT7
 MJ5R4M6QKxPNvvBiXK5lWSVy5IPtcuwkbyEfKxV/CS668FwJeFpGFb91rRY108He
 EUHjj5NtZ1WhEHFRBgJPLydGZGGtJzxy3yg26x6wO58VJrIx/H3HU3jgsnj1m32a
 fA1cbo4Yo9gnk0YzI2KDu6A+bXi8zJVpyYDU9Ir4mdy+CVd5+vN9WypzrjHXbMya
 2xhd9768sVmShY9K5+DlOXF4stVsP6CbgWGxhwIbdfLvY977QaBhrr+emrPE3uYt
 5g+rg3v7ciuW14D5rLPWqZ5aXinjNt4vc7maNA9sJLW5wLOiWGXjhseUFg==
 =6rT4
 -----END PGP SIGNATURE-----

Merge tag 'modules-6.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux

Pull modules fixes from Petr Pavlu:
 "A single series to properly handle the module_kobject creation.

  This fixes a problem with missing /sys/module/<module>/drivers for
  built-in modules"

* tag 'modules-6.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
  drivers: base: handle module_kobject creation
  kernel: globalize lookup_or_create_module_kobject()
  kernel: refactor lookup_or_create_module_kobject()
  kernel: param: rename locate_module_kobject
2025-04-30 08:37:52 -07:00
Andrea Righi e38be1c764 sched_ext: Fix rq lock state in hotplug ops
The ops.cpu_online() and ops.cpu_offline() callbacks incorrectly assume
that the rq involved in the operation is locked, which is not the case
during hotplug, triggering the following warning:

  WARNING: CPU: 1 PID: 20 at kernel/sched/sched.h:1504 handle_hotplug+0x280/0x340

Fix by not tracking the target rq as locked in the context of
ops.cpu_online() and ops.cpu_offline().

Fixes: 18853ba782 ("sched_ext: Track currently locked rq")
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Tested-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-29 08:01:32 -10:00
Thomas Gleixner b71f9804f6 timekeeping: Prevent coarse clocks going backwards
Lei Chen raised an issue with CLOCK_MONOTONIC_COARSE seeing time
inconsistencies. Lei tracked down that this was being caused by the
adjustment:

    tk->tkr_mono.xtime_nsec -= offset;

which is made to compensate for the unaccumulated cycles in offset when the
multiplicator is adjusted forward, so that the non-_COARSE clockids don't
see inconsistencies.

However, the _COARSE clockid getter functions use the adjusted xtime_nsec
value directly and do not compensate the negative offset via the
clocksource delta multiplied with the new multiplicator. In that case the
caller can observe time going backwards in consecutive calls.

By design, this negative adjustment should be fine, because the logic run
from timekeeping_adjust() is done after it accumulated approximately

     multiplicator * interval_cycles

into xtime_nsec.  The accumulated value is always larger then the

     mult_adj * offset

value, which is subtracted from xtime_nsec. Both operations are done
together under the tk_core.lock, so the net change to xtime_nsec is always
always be positive.

However, do_adjtimex() calls into timekeeping_advance() as well, to
apply the NTP frequency adjustment immediately. In this case,
timekeeping_advance() does not return early when the offset is smaller
then interval_cycles. In that case there is no time accumulated into
xtime_nsec. But the subsequent call into timekeeping_adjust(), which
modifies the multiplicator, subtracts from xtime_nsec to correct for the
new multiplicator.

Here because there was no accumulation, xtime_nsec becomes smaller than
before, which opens a window up to the next accumulation, where the
_COARSE clockid getters, which don't compensate for the offset, can
observe the inconsistency.

This has been tried to be fixed by forwarding the timekeeper in the case
that adjtimex() adjusts the multiplier, which resets the offset to zero:

  757b000f7b ("timekeeping: Fix possible inconsistencies in _COARSE clockids")

That works correctly, but unfortunately causes a regression on the
adjtimex() side. There are two issues:

   1) The forwarding of the base time moves the update out of the original
      period and establishes a new one.

   2) The clearing of the accumulated NTP error is changing the behaviour as
      well.

User-space expects that multiplier/frequency updates are in effect, when the
syscall returns, so delaying the update to the next tick is not solving the
problem either.

Commit 757b000f7b was reverted so that the established expectations of
user space implementations (ntpd, chronyd) are restored, but that obviously
brought the inconsistencies back.

One of the initial approaches to fix this was to establish a separate
storage for the coarse time getter nanoseconds part by calculating it from
the offset. That was dropped on the floor because not having yet another
state to maintain was simpler. But given the result of the above exercise,
this solution turns out to be the right one. Bring it back in a slightly
modified form.

Thus introduce timekeeper::coarse_nsec and store that nanoseconds part in
it, switch the time getter functions and the VDSO update to use that value.
coarse_nsec is set on operations which forward or initialize the timekeeper
and after time was accumulated during a tick. If there is no accumulation
the timestamp is unchanged.

This leaves the adjtimex() behaviour unmodified and prevents coarse time
from going backwards.

[ jstultz: Simplified the coarse_nsec calculation and kept behavior so
  	   coarse clockids aren't adjusted on each inter-tick adjtimex
  	   call, slightly reworked the comments and commit message ]

Fixes: da15cfdae0 ("time: Introduce CLOCK_REALTIME_COARSE")
Reported-by: Lei Chen <lei.chen@smartx.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/20250419054706.2319105-1-jstultz@google.com
Closes: https://lore.kernel.org/lkml/20250310030004.3705801-1-lei.chen@smartx.com/
2025-04-28 11:17:29 +02:00
Linus Torvalds 3d23ef05c3 Fix sporadic crashes in dequeue_entities() due to ... bad math.
[ Arguably if pick_eevdf()/pick_next_entity() was less trusting
   of complex math being correct it could have de-escalated a
   crash into a warning, but that's for a different patch. ]
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgMnbERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iWcg//Tj96bTRqh9c31n0r4+k6xPZE8kyf7FVd
 vT/LraGe9I731+omWQQrybwgxe01Ja/bk2FNShQhVN3safk43OXReTWyNcd213Un
 q4vYvwX8gcVOl8qZibnAUHO2N0B+uTPlFBniG+VyhWqb+zPq0/dRCSy7nc0TNc54
 hHhERcLYOn+WWrUzKSbL5vm2JRowmlthTiw8/li7N5aappM8Hr4XbIZuhvd2aaB4
 ocsXJpOJyDUXP51Zi7jWEbWPr8O3VS/Zdz/F9/MGomPZ6rPBmRyNnadn0w1gjrGB
 ccTvJgBMMRH7Ltp05TslvVsnRnUIRIRe6bx/kW5pkSANxpSP+Ztw90ssAwq1v11G
 38+XIVnRJCjgP9O8/YByyW3dgWrp2o6rrZJIt++50BfQzASmT66//1Z1iV4nQIC0
 szoSa/tOm/WOFNK357pFDhAZyhZZUYlq5etvReG7q4OEHZb0y9/axw40wkzY/rpy
 Jc9XnaMoey/SvyvNHMKNJxFJNHuodosfY3rXRaeuhP22FW3qPqbtCxTM+8nRWMTs
 HojbqlrMFH9rAV9K1STgdH5YjIWsWHwJ9siHfw6SZtocMLOvzWDtfKyCJ+cvtvUz
 z82EBuPsltgq6+LLoGXyY+wlEUo/qvY3Ywv0Y9cAqolbBw5Cw/aEaDSH+K1cmubb
 AgtloYCkmXo=
 =pJ11
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Ingo Molnar:
 "Fix sporadic crashes in dequeue_entities() due to ... bad math.

  [ Arguably if pick_eevdf()/pick_next_entity() was less trusting of
    complex math being correct it could have de-escalated a crash into
    a warning, but that's for a different patch ]"

* tag 'sched-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash
2025-04-26 09:23:20 -07:00
Linus Torvalds 86baa5499c Misc perf events fixes:
- Use POLLERR for events in error state, instead of
    the ambiguous POLLHUP error value
 
  - Fix non-sampling (counting) events on certain x86 platforms
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgMmoERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i6lg//Z2fxDHOXxSxNaNtin6wNb52vSRfmtFFD
 +6lxCbJP+qT66rWR8ZpRNKKQ+vZAKYXm8wGNakhb4wpFe+PJwsQhl5sWOHnoMO5a
 TBQFkvGrHxDxDa8xoQy6IFgee4ckpwxiVaMe0jhwG9/2rbOhXgDZ5dFZvxV4sbAT
 uT0Qfsm4gC+2oRVOx430zYSNlLRieux7mrXcTRpszLWy7n7kG2fzd+f7OFgKHrGd
 Bnx+X2DE2R3k8lNhJGZBc92zhJAjgoBw3R4ajFqsH6v7Fw0DFIhJ3zEn0EBbPvVo
 6hdkdYtpCog7Ek841lhzXlIz4Ofu05q+iUquEtbU3q51QeHF3a00i4SHfLT5L1NS
 xhOLR1nCSi9PMSfBHsdDfQbHr4WqK5NsyFvgQNnH7h31MybhkROzlP2JWN+tA/nJ
 DxBs14DiscA7zIYtl8gx8nVPgo7PBxupqJjorPgW6Fq11diKBe9thcPfjR763QKR
 jt6xyw40KAC8HZKntzrqugeWUGpf/LPwbH4QNX5M9TfgTum8duHaLFR2wGWUb3gr
 jPPxaSIBEPTENb2w9Z+N/5xGRwKlQo/QmROoygcr0Qox7qelp4GfFOxbQYGyppZX
 6k0BCRlgpNIy6EIgORgA8fpL6k5hZS7Jkjrs2nJd07pklYOuRQDYTBd0gh0eAwU5
 8wLrnBKCDCA=
 =LLOs
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc perf events fixes from Ingo Molnar:

 - Use POLLERR for events in error state, instead of the ambiguous
   POLLHUP error value

 - Fix non-sampling (counting) events on certain x86 platforms

* tag 'perf-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Fix non-sampling (counting) events on certain x86 platforms
  perf/core: Change to POLLERR for pinned events with error
2025-04-26 09:13:09 -07:00
Omar Sandoval bbce3de72b sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash
There is a code path in dequeue_entities() that can set the slice of a
sched_entity to U64_MAX, which sometimes results in a crash.

The offending case is when dequeue_entities() is called to dequeue a
delayed group entity, and then the entity's parent's dequeue is delayed.
In that case:

1. In the if (entity_is_task(se)) else block at the beginning of
   dequeue_entities(), slice is set to
   cfs_rq_min_slice(group_cfs_rq(se)). If the entity was delayed, then
   it has no queued tasks, so cfs_rq_min_slice() returns U64_MAX.
2. The first for_each_sched_entity() loop dequeues the entity.
3. If the entity was its parent's only child, then the next iteration
   tries to dequeue the parent.
4. If the parent's dequeue needs to be delayed, then it breaks from the
   first for_each_sched_entity() loop _without updating slice_.
5. The second for_each_sched_entity() loop sets the parent's ->slice to
   the saved slice, which is still U64_MAX.

This throws off subsequent calculations with potentially catastrophic
results. A manifestation we saw in production was:

6. In update_entity_lag(), se->slice is used to calculate limit, which
   ends up as a huge negative number.
7. limit is used in se->vlag = clamp(vlag, -limit, limit). Because limit
   is negative, vlag > limit, so se->vlag is set to the same huge
   negative number.
8. In place_entity(), se->vlag is scaled, which overflows and results in
   another huge (positive or negative) number.
9. The adjusted lag is subtracted from se->vruntime, which increases or
   decreases se->vruntime by a huge number.
10. pick_eevdf() calls entity_eligible()/vruntime_eligible(), which
    incorrectly returns false because the vruntime is so far from the
    other vruntimes on the queue, causing the
    (vruntime - cfs_rq->min_vruntime) * load calulation to overflow.
11. Nothing appears to be eligible, so pick_eevdf() returns NULL.
12. pick_next_entity() tries to dereference the return value of
    pick_eevdf() and crashes.

Dumping the cfs_rq states from the core dumps with drgn showed tell-tale
huge vruntime ranges and bogus vlag values, and I also traced se->slice
being set to U64_MAX on live systems (which was usually "benign" since
the rest of the runqueue needed to be in a particular state to crash).

Fix it in dequeue_entities() by always setting slice from the first
non-empty cfs_rq.

Fixes: aef6987d89 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/f0c2d1072be229e1bdddc73c0703919a8b00c652.1745570998.git.osandov@fb.com
2025-04-26 10:44:36 +02:00
Linus Torvalds f1a3944c86 bpf-fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmgMIjIACgkQ6rmadz2v
 bTqeDw/+Mjrvdfav0Xg1OrhPcJKNzi6Wy3q8mkAPf+2tGxJQCOtTJuBESSSbGkM5
 H03dpOQ9TtHhrS67HxzxhyjQdEU7j8F/ZSDfELznT6tYJHdhsbCX73wgJcX+x+Cl
 YhgtQty0Y2UhIrIO+bs3oAYHhRAKKmK4qcCUgRbmPc2stMRxuv28AbxxNaTDnweA
 qsw/2rXb1RYULRXMaWCAqWcejIUwGa16ATMiyLEDw8OCVnulXwMLKXGy4SXw5zUE
 Hw/vh9i6xMrkOgsMgeB6YBKBURX70pfS1WWDLqpBhOqyr4rdu93EGZ2Pg8ae32Qx
 XGyeO2J3Xjr2iUYxcGcOwCrKiz2FoNFLLq9qSOTRq8RQbHBnzGWpgPBaaMryjYId
 ARxqX3tcLdNVIgpy8h6d5Zo/uqxkMwa0Ttnz0tj3RVNiyFfY6EtZpC2YC2IHnriK
 QthDiXRAyS8o3oTO5dMp7lM9Oz1j7JslRym5jZNJNSfO2nXi3GHwRFRA532/7x1i
 JhfoKVwOFe6WybHDU/Dg8EuV8tuFKiPanu9Id9hCKJdW+Ze/KpnxAzZsT3gZ6U3n
 tj9MROu1GPIihpqgiLNgITKWO6G+6Z02ATKfP5PiTUmzhiAe4r/MFux9FpZ2CfUL
 LmryGu7fLb6KXLarHAkN++0yHBHcaw7faAJ9prFv7vFI8CNBMyE=
 =ghBA
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:

 - Add namespace to BPF internal symbols (Alexei Starovoitov)

 - Fix possible endless loop in BPF map iteration (Brandon Kammerdiener)

 - Fix compilation failure for samples/bpf on LoongArch (Haoran Jiang)

 - Disable a part of sockmap_ktls test (Ihor Solodrai)

 - Correct typo in __clang_major__ macro (Peilin Ye)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Correct typo in __clang_major__ macro
  samples/bpf: Fix compilation failure for samples/bpf on LoongArch Fedora
  bpf: Add namespace to BPF internal symbols
  selftests/bpf: add test for softlock when modifying hashmap while iterating
  bpf: fix possible endless loop in BPF map iteration
  selftests/bpf: Mitigate sockmap_ktls disconnect_after_delete failure
2025-04-25 17:53:09 -07:00
Andrea Righi e7dcd1304b sched_ext: Remove duplicate BTF_ID_FLAGS definitions
Some kfuncs specific to the idle CPU selection policy are registered in
both the scx_kfunc_ids_any and scx_kfunc_ids_idle blocks, even though
they should only be defined in the latter.

Remove the duplicates from scx_kfunc_ids_any.

Fixes: 337d1b354a ("sched_ext: Move built-in idle CPU selection policy to a separate file")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-25 08:41:02 -10:00
Linus Torvalds 882cd65288 dma-maping fixes for Linux 6.15
- avoid unused variable warnings (Arnd Bergmann, Marek Szyprowski)
 - add runtume warnings and debug messages for devices with limited DMA
   capabilities (Balbir Singh, Chen-Yu Tsai)
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaAtFeAAKCRCJp1EFxbsS
 RGJwAQDIwyLQdk4XbWUZYokxIl/5jIiuaTqQBoPGPILnwoTkuAD8DlxKQvsnzkdT
 QK7TSFpKwrboSaveGWEG5oB60wIsKgU=
 =TSJY
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-6.15-2025-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux

Pull dma-maping fixes from Marek Szyprowski:

 - avoid unused variable warnings (Arnd Bergmann, Marek Szyprowski)

 - add runtume warnings and debug messages for devices with limited DMA
   capabilities (Balbir Singh, Chen-Yu Tsai)

* tag 'dma-mapping-6.15-2025-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
  dma-coherent: Warn if OF reserved memory is beyond current coherent DMA mask
  dma-mapping: Fix warning reported for missing prototype
  dma-mapping: avoid potential unused data compilation warning
  dma/mapping.c: dev_dbg support for dma_addressing_limited
  dma/contiguous: avoid warning about unused size_bytes
2025-04-25 09:44:53 -07:00
Alexei Starovoitov f88886de09 bpf: Add namespace to BPF internal symbols
Add namespace to BPF internal symbols used by light skeleton
to prevent abuse and document with the code their allowed usage.

Fixes: b1d18a7574 ("bpf: Extend sys_bpf commands for bpf_syscall programs.")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20250425014542.62385-1-alexei.starovoitov@gmail.com
2025-04-25 09:21:23 -07:00
Brandon Kammerdiener 75673fda0c bpf: fix possible endless loop in BPF map iteration
The _safe variant used here gets the next element before running the callback,
avoiding the endless loop condition.

Signed-off-by: Brandon Kammerdiener <brandon.kammerdiener@intel.com>
Link: https://lore.kernel.org/r/20250424153246.141677-2-brandon.kammerdiener@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
2025-04-25 08:36:59 -07:00
Linus Torvalds 0251ddbffb virtio, vhost: fixes
A small number of fixes.
 
 virtgpu is exempt from reset shutdown fow now -
 	 a more complete fix is in the works
 spec compliance fixes in:
 	virtio-pci cap commands
 	vhost_scsi_send_bad_target
 	virtio console resize
 missing locking fix in vhost-scsi
 virtio ring - a KCSAN false positive fix
 VHOST_*_OWNER documentation fix
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCgAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmgIi3cPHG1zdEByZWRo
 YXQuY29tAAoJECgfDbjSjVRpGU8H/1Fq8pH+irGvyH4E21O03qx0wiM+lcYVhNH5
 2a3rjOwuJBiLvscZTJG/w07hIpx0O4WrbygdT0BTll4Uen2C+OpGn/Y1LfhW6wsr
 3yyeBpTr5hKiY8sOD08rMTHTCM4mD8UdYr13RcNq+eUxNZ6bA+kiGaXpIk0AiRPR
 5pdbx16cTZM7k+/9aXp68hRO7yHnyilGzAJG1hHmfx1L5Mt++RVKsf2KI+3YHWcI
 0ZZj/NP3iZfNm57+QpKX6zYikH4IFIer1r9wotMaR74brpuq8w7HKZUqe3VfG11Y
 TBgq6NfDZVq8G8bCGPv+C+DfDnpYMFVYqytCLn4/AyOhLNCRDs8=
 =m8wk
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio fixes from Michael Tsirkin:
 "A small number of fixes:

   - virtgpu is exempt from reset shutdown fow now - a more complete fix
     is in the works

   - spec compliance fixes in:
       - virtio-pci cap commands
       - vhost_scsi_send_bad_target
       - virtio console resize

   - missing locking fix in vhost-scsi

   - virtio ring - a KCSAN false positive fix

   - VHOST_*_OWNER documentation fix"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  vhost-scsi: Fix vhost_scsi_send_status()
  vhost-scsi: Fix vhost_scsi_send_bad_target()
  vhost-scsi: protect vq->log_used with vq->mutex
  vhost_task: fix vhost_task_create() documentation
  virtio_console: fix order of fields cols and rows
  virtio_console: fix missing byte order handling for cols and rows
  virtgpu: don't reset on shutdown
  virtio_ring: Fix data race by tagging event_triggered as racy for KCSAN
  vhost: fix VHOST_*_OWNER documentation
  virtio_pci: Use self group type for cap commands
2025-04-23 08:25:56 -07:00
Namhyung Kim 0db61388b3 perf/core: Change to POLLERR for pinned events with error
Commit:

  f4b07fd62d ("perf/core: Use POLLHUP for pinned events in error")

started to emit POLLHUP for pinned events in an error state.

But the POLLHUP is also used to signal events that the attached task is
terminated.  To distinguish pinned per-task events in the error state
it would need to check if the task is live.

Change it to POLLERR to make it clear.

Suggested-by: Gabriel Marin <gmx@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250422223318.180343-1-namhyung@kernel.org
2025-04-23 09:39:06 +02:00
Andrea Righi a11d6784d7 sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()
scx_bpf_cpuperf_set() can be used to set a performance target level on
any CPU. However, it doesn't correctly acquire the corresponding rq
lock, which may lead to unsafe behavior and trigger the following
warning, due to the lockdep_assert_rq_held() check:

[   51.713737] WARNING: CPU: 3 PID: 3899 at kernel/sched/sched.h:1512 scx_bpf_cpuperf_set+0x1a0/0x1e0
...
[   51.713836] Call trace:
[   51.713837]  scx_bpf_cpuperf_set+0x1a0/0x1e0 (P)
[   51.713839]  bpf_prog_62d35beb9301601f_bpfland_init+0x168/0x440
[   51.713841]  bpf__sched_ext_ops_init+0x54/0x8c
[   51.713843]  scx_ops_enable.constprop.0+0x2c0/0x10f0
[   51.713845]  bpf_scx_reg+0x18/0x30
[   51.713847]  bpf_struct_ops_link_create+0x154/0x1b0
[   51.713849]  __sys_bpf+0x1934/0x22a0

Fix by properly acquiring the rq lock when possible or raising an error
if we try to operate on a CPU that is not the one currently locked.

Fixes: d86adb4fc0 ("sched_ext: Add cpuperf support")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-22 09:28:12 -10:00
Andrea Righi 18853ba782 sched_ext: Track currently locked rq
Some kfuncs provided by sched_ext may need to operate on a struct rq,
but they can be invoked from various contexts, specifically, different
scx callbacks.

While some of these callbacks are invoked with a particular rq already
locked, others are not. This makes it impossible for a kfunc to reliably
determine whether it's safe to access a given rq, triggering potential
bugs or unsafe behaviors, see for example [1].

To address this, track the currently locked rq whenever a sched_ext
callback is invoked via SCX_CALL_OP*().

This allows kfuncs that need to operate on an arbitrary rq to retrieve
the currently locked one and apply the appropriate action as needed.

[1] https://lore.kernel.org/lkml/20250325140021.73570-1-arighi@nvidia.com/

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-22 09:27:50 -10:00
Chen-Yu Tsai 89461db349 dma-coherent: Warn if OF reserved memory is beyond current coherent DMA mask
When a reserved memory region described in the device tree is attached
to a device, it is expected that the device's limitations are correctly
included in that description.

However, if the device driver failed to implement DMA address masking
or addressing beyond the default 32 bits (on arm64), then bad things
could happen because the DMA address was truncated, such as playing
back audio with no actual audio coming out, or DMA overwriting random
blocks of kernel memory.

Check against the coherent DMA mask when the memory regions are attached
to the device. Give a warning when the memory region can not be covered
by the mask.

A warning instead of a hard error was chosen, because it is possible
that existing drivers could be working fine even if they forgot to
extend the coherent DMA mask.

Signed-off-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20250421083930.374173-1-wenst@chromium.org
2025-04-22 17:44:09 +02:00
Balbir Singh cae5572ec9 dma-mapping: Fix warning reported for missing prototype
lkp reported a warning about missing prototype for a recent patch.

The kernel-doc style comments are out of sync, move them to the right
function.

Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Christoph Hellwig <hch@lst.de>

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202504190615.g9fANxHw-lkp@intel.com/

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
[mszyprow: reformatted subject]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20250422114034.3535515-1-balbirs@nvidia.com
2025-04-22 15:06:33 +02:00
Linus Torvalds a33b5a08cb sched_ext: Fixes for v6.15-rc3
- Use kvzalloc() so that large exit_dump buffer allocations don't fail
   easily.
 
 - Remove cpu.weight / cpu.idle unimplemented warnings which are more
   annoying than helpful. This makes SCX_OPS_HAS_CGROUP_WEIGHT unnecessary.
   Mark it for deprecation.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaAaeKw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGcDjAQDM14FReObIKOOuqzlNaYXQqcSW2///ZVf/FR8j
 HpWtyAD/Tsqg6CzBpTxKkpMRLsE2iKI1t770vkUnDbjcnR0Rxgc=
 =rIV5
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - Use kvzalloc() so that large exit_dump buffer allocations don't fail
   easily

 - Remove cpu.weight / cpu.idle unimplemented warnings which are more
   annoying than helpful.

   This makes SCX_OPS_HAS_CGROUP_WEIGHT unnecessary. Mark it for
   deprecation

* tag 'sched_ext-for-6.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Mark SCX_OPS_HAS_CGROUP_WEIGHT for deprecation
  sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings
  sched_ext: Use kvzalloc for large exit_dump allocation
2025-04-21 19:16:29 -07:00
Linus Torvalds a22509a4ee cgroup: Fixes for v6.15-rc3
- Fix compilation in CONFIG_LOCKDEP && !CONFIG_PROVE_RCU configurations.
 
 - Allow "cpuset_v2_mode" mount option for "cpuset" filesystem type to make
   life easier for android.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaAac+Q4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGUiAAQCbw6eOFAE+sjI6GgAeMVORbLqyufGDNPvBwgzJ
 xPxgcwD/ZLlsJWRG6BzQ/KHeFZnGWSJEiqSSFHGCCr0l4QkIdgA=
 =ay1E
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:

 - Fix compilation in CONFIG_LOCKDEP && !CONFIG_PROVE_RCU configurations

 - Allow "cpuset_v2_mode" mount option for "cpuset" filesystem type to
   make life easier for android

* tag 'cgroup-for-6.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/cpuset-v1: Add missing support for cpuset_v2_mode
  cgroup: Fix compilation issue due to cgroup_mutex not being exported
2025-04-21 19:13:25 -07:00
Linus Torvalds 119009db26 vfs-6.15-rc3.fixes.2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaAQM5QAKCRCRxhvAZXjc
 olcwAP0RETZn15Jkt5+mKjcx99fuVE7je3lp56UH4Y4XjZmthgEA1n65RDr4Tq6E
 548A2/9Hnt4NWdvoi9VhrG4+5dNRowM=
 =cFFa
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:

 - Revert the hfs{plus} deprecation warning that's also included in this
   pull request. The commit introducing the deprecation warning resides
   rather early in this branch. So simply dropping it would've rebased
   all other commits which I decided to avoid. Hence the revert in the
   same branch

   [ Background - the deprecation warning discussion resulted in people
     stepping up, and so hfs{plus} will have a maintainer taking care of
     it after all..   - Linus ]

 - Switch CONFIG_SYSFS_SYCALL default to n and decouple from
   CONFIG_EXPERT

 - Fix an audit bug caused by changes to our kernel path lookup helpers
   this cycle. Audit needs the parent path even if the dentry it tried
   to look up is negative

 - Ensure that the kernel path lookup helpers leave the passed in path
   argument clean when they return an error. This is consistent with all
   our other helpers

 - Ensure that vfs_getattr_nosec() calls bdev_statx() so the relevant
   information is available to kernel consumers as well

 - Don't set a timer and call schedule() if the timer will expire
   immediately in epoll

 - Make netfs lookup tables with __nonstring

* tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  Revert "hfs{plus}: add deprecation warning"
  fs: move the bdex_statx call to vfs_getattr_nosec
  netfs: Mark __nonstring lookup tables
  eventpoll: Set epoll timeout if it's in the future
  fs: ensure that *path_locked*() helpers leave passed path pristine
  fs: add kern_path_locked_negative()
  hfs{plus}: add deprecation warning
  Kconfig: switch CONFIG_SYSFS_SYCALL default to n
2025-04-19 14:31:08 -07:00
Linus Torvalds fa6ad96dca tracing fixes for v6.15
- Initialize hash variables in ftrace subops logic
 
   The fix that simplified the ftrace subops logic opened a path where some
   variables could be used without being initialized, and done subtly where
   the compiler did not catch it. Initialize those variables to the
   EMPTY_HASH, which is the default hash.
 
 - Reinitialize the hash pointers after they are freed
 
   Some of the hash pointers in the subop logic were freed but may still be
   referenced later. To prevent use-after-free bugs, initialize them back to
   the EMPTY_HASH.
 
 - Free the ftrace hashes when they are replaced
 
   The fix that simplified the subops logic updated some hash pointers, but
   left the original hash that they were pointing to where they are no longer
   used. This caused a memory leak. Free the hashes that are pointed to by
   the pointers when they are replaced.
 
 - Fix size initialization of ftrace direct function hash
 
   The ftrace direct function hash used by BPF initialized the hash size
   incorrectly. It checked the size of items to a hard coded 32, which made
   the hash bit size of 5. The hash size is supposed to be limited by the bit
   size of the hash, as the bitmask is allowed to be greater than 5. Rework
   the size check to first pass the number of elements to fls() and then
   compare that to FTRACE_HASH_MAX_BITS before allocating the hash.
 
 - Fix format output of ftrace_graph_ent_entry event
 
   The field depth of the ftrace_graph_ent_entry event is of size 4 but the
   output showed it as unsigned long and use "%lu". Change it to unsigned int
   and use "%u" in the print format that is displayed to user space.
 
 - Fix the trace event filter on strings
 
   Events can be filtered on numbers or string values. The return value
   checked from strncpy_from_kernel_nofault() and strncpy_from_user_nofault()
   was used to determine if reading the strings would fault or not. It would
   return fault if the value was non zero, which is basically meant that it
   was always considering the read as a fault.
 
 - Add selftest to test trace event string filtering
 
   In order to catch the breakage of the string filtering, add a self test to
   make sure that it continues to work.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaAPqNRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qv5nAP4mqIgne7tzMhHIH/nQGM/7Dj98n+Vt
 BXm6VifVdVJvtAD+KCDipZ2MspGEeZX3SDSnvBuj0S+OX9T9CTWPv+rFUwE=
 =AWY4
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Initialize hash variables in ftrace subops logic

   The fix that simplified the ftrace subops logic opened a path where
   some variables could be used without being initialized, and done
   subtly where the compiler did not catch it. Initialize those
   variables to the EMPTY_HASH, which is the default hash.

 - Reinitialize the hash pointers after they are freed

   Some of the hash pointers in the subop logic were freed but may still
   be referenced later. To prevent use-after-free bugs, initialize them
   back to the EMPTY_HASH.

 - Free the ftrace hashes when they are replaced

   The fix that simplified the subops logic updated some hash pointers,
   but left the original hash that they were pointing to where they are
   no longer used. This caused a memory leak. Free the hashes that are
   pointed to by the pointers when they are replaced.

 - Fix size initialization of ftrace direct function hash

   The ftrace direct function hash used by BPF initialized the hash size
   incorrectly. It checked the size of items to a hard coded 32, which
   made the hash bit size of 5. The hash size is supposed to be limited
   by the bit size of the hash, as the bitmask is allowed to be greater
   than 5. Rework the size check to first pass the number of elements to
   fls() and then compare that to FTRACE_HASH_MAX_BITS before allocating
   the hash.

 - Fix format output of ftrace_graph_ent_entry event

   The field depth of the ftrace_graph_ent_entry event is of size 4 but
   the output showed it as unsigned long and use "%lu". Change it to
   unsigned int and use "%u" in the print format that is displayed to
   user space.

 - Fix the trace event filter on strings

   Events can be filtered on numbers or string values. The return value
   checked from strncpy_from_kernel_nofault() and
   strncpy_from_user_nofault() was used to determine if reading the
   strings would fault or not. It would return fault if the value was
   non zero, which is basically meant that it was always considering the
   read as a fault.

 - Add selftest to test trace event string filtering

   In order to catch the breakage of the string filtering, add a self
   test to make sure that it continues to work.

* tag 'trace-v6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: selftests: Add testing a user string to filters
  tracing: Fix filter string testing
  ftrace: Fix type of ftrace_graph_ent_entry.depth
  ftrace: fix incorrect hash size in register_ftrace_direct()
  ftrace: Free ftrace hashes after they are replaced in the subops code
  ftrace: Reinitialize hash to EMPTY_HASH after freeing
  ftrace: Initialize variables for ftrace_startup/shutdown_subops()
2025-04-19 11:57:36 -07:00
Stefano Garzarella fec0abf526 vhost_task: fix vhost_task_create() documentation
Commit cb380909ae ("vhost: return task creation error instead of NULL")
changed the return value of vhost_task_create(), but did not update the
documentation.

Reflect the change in the documentation: on an error, vhost_task_create()
returns an ERR_PTR() and no longer NULL.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20250327124435.142831-1-sgarzare@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-04-18 10:08:11 -04:00
Steven Rostedt a8c5b0ed89 tracing: Fix filter string testing
The filter string testing uses strncpy_from_kernel/user_nofault() to
retrieve the string to test the filter against. The if() statement was
incorrect as it considered 0 as a fault, when it is only negative that it
faulted.

Running the following commands:

  # cd /sys/kernel/tracing
  # echo "filename.ustring ~ \"/proc*\"" > events/syscalls/sys_enter_openat/filter
  # echo 1 > events/syscalls/sys_enter_openat/enable
  # ls /proc/$$/maps
  # cat trace

Would produce nothing, but with the fix it will produce something like:

      ls-1192    [007] .....  8169.828333: sys_openat(dfd: ffffffffffffff9c, filename: 7efc18359904, flags: 80000, mode: 0)

Link: https://lore.kernel.org/all/CAEf4BzbVPQ=BjWztmEwBPRKHUwNfKBkS3kce-Rzka6zvbQeVpg@mail.gmail.com/

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250417183003.505835fb@gandalf.local.home
Fixes: 77360f9bbc ("tracing: Add test for user space strings when filtering on string pointers")
Reported-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Reported-by: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-17 22:16:56 -04:00
Ilya Leoshkevich 3b4e87e6a5 ftrace: Fix type of ftrace_graph_ent_entry.depth
ftrace_graph_ent.depth is int, but ftrace_graph_ent_entry.depth is
unsigned long. This confuses trace-cmd on 64-bit big-endian systems and
makes it print a huge amount of spaces. Fix this by using unsigned int,
which has a matching size, instead.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250412221847.17310-2-iii@linux.ibm.com
Fixes: ff5c9c576e ("ftrace: Add support for function argument to graph tracer")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-17 15:19:15 -04:00
Menglong Dong 92f1d3b401 ftrace: fix incorrect hash size in register_ftrace_direct()
The maximum of the ftrace hash bits is made fls(32) in
register_ftrace_direct(), which seems illogical. So, we fix it by making
the max hash bits FTRACE_HASH_MAX_BITS instead.

Link: https://lore.kernel.org/20250413014444.36724-1-dongml2@chinatelecom.cn
Fixes: d05cb47066 ("ftrace: Fix modification of direct_function hash while in use")
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-17 15:16:51 -04:00
Steven Rostedt c45c585dde ftrace: Free ftrace hashes after they are replaced in the subops code
The subops processing creates new hashes when adding and removing subops.
There were some places that the old hashes that were replaced were not
freed and this caused some memory leaks.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250417135939.245b128d@gandalf.local.home
Fixes: 0ae6b8ce20 ("ftrace: Fix accounting of subop hashes")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-17 15:16:07 -04:00
Steven Rostedt 08275e59a7 ftrace: Reinitialize hash to EMPTY_HASH after freeing
There's several locations that free a ftrace hash pointer but may be
referenced again. Reset them to EMPTY_HASH so that a u-a-f bug doesn't
happen.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250417110933.20ab718b@gandalf.local.home
Fixes: 0ae6b8ce20 ("ftrace: Fix accounting of subop hashes")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-17 15:15:28 -04:00