mirror-linux/kernel/trace
Linus Torvalds c1fe867b5b Updates for the timer/timekeeping core:
- A rework of the hrtimer subsystem to reduce the overhead for frequently
     armed timers, especially the hrtick scheduler timer.
 
       - Better timer locality decision
 
       - Simplification of the evaluation of the first expiry time by
         keeping track of the neighbor timers in the RB-tree by providing a
         RB-tree variant with neighbor links. That avoids walking the
         RB-tree on removal to find the next expiry time, but even more
         important allows to quickly evaluate whether a timer which is
         rearmed changes the position in the RB-tree with the modified
         expiry time or not. If not, the dequeue/enqueue sequence which both
         can end up in rebalancing can be completely avoided.
 
       - Deferred reprogramming of the underlying clock event device. This
         optimizes for the situation where a hrtimer callback sets the need
         resched bit. In that case the code attempts to defer the
         re-programming of the clock event device up to the point where the
         scheduler has picked the next task and has the next hrtick timer
         armed. In case that there is no immediate reschedule or soft
         interrupts have to be handled before reaching the reschedule point
         in the interrupt entry code the clock event is reprogrammed in one
         of those code paths to prevent that the timer becomes stale.
 
       - Support for clocksource coupled clockevents
 
       	The TSC deadline timer is coupled to the TSC. The next event is
       	programmed in TSC time. Currently this is done by converting the
       	CLOCK_MONOTONIC based expiry value into a relative timeout,
       	converting it into TSC ticks, reading the TSC adding the delta
       	ticks and writing the deadline MSR.
 
 	As the timekeeping core has the conversion factors for the TSC
 	already, the whole back and forth conversion can be completely
 	avoided. The timekeeping core calculates the reverse conversion
 	factors from nanoseconds to TSC ticks and utilizes the base
 	timestamps of TSC and CLOCK_MONOTONIC which are updated once per
 	tick. This allows a direct conversion into the TSC deadline value
 	without reading the time and as a bonus keeps the deadline
 	conversion in sync with the TSC conversion factors, which are
 	updated by adjtimex() on systems with NTP/PTP enabled.
 
      - Allow inlining of the clocksource read and clockevent write
        functions when they are tiny enough, e.g. on x86 RDTSC and WRMSR.
 
     With all those enhancements in place a hrtick enabled scheduler
     provides the same performance as without hrtick. But also other hrtimer
     users obviously benefit from these optimizations.
 
   - Robustness improvements and cleanups of historical sins in the hrtimer
     and timekeeping code.
 
   - Rewrite of the clocksource watchdog.
 
     The clocksource watchdog code has over time reached the state of an
     impenetrable maze of duct tape and staples. The original design, which was
     made in the context of systems far smaller than today, is based on the
     assumption that the to be monitored clocksource (TSC) can be trivially
     compared against a known to be stable clocksource (HPET/ACPI-PM timer).
 
     Over the years this rather naive approach turned out to have major
     flaws. Long delays between the watchdog invocations can cause wrap
     arounds of the reference clocksource. The access to the reference
     clocksource degrades on large multi-sockets systems dure to
     interconnect congestion. This has been addressed with various
     heuristics which degraded the accuracy of the watchdog to the point
     that it fails to detect actual TSC problems on older hardware which
     exposes slow inter CPU drifts due to firmware manipulating the TSC to
     hide SMI time.
 
     The rewrite addresses this by:
 
       - Restricting the validation against the reference clocksource to the
         boot CPU which is usually closest to the legacy block which
         contains the reference clocksource (HPET/ACPI-PM).
 
       - Do a round robin validation betwen the boot CPU and the other CPUs
         based only on the TSC with an algorithm similar to the TSC
         synchronization code during CPU hotplug.
 
       - Being more leniant versus remote timeouts
 
   - The usual tiny fixes, cleanups and enhancements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmnbxdMQHHRnbHhAa2Vy
 bmVsLm9yZwAKCRCmGPVMDXSYoeq6EAC4h9wuBr5yCkxmog1Bhlk9cnK0oX1THb7V
 Q4z6DrYAiDXP6z4IDwqSW+3vvmNw1QXOeqpyMTiiIcQ5mNSs1IDnCt5HOEwY+ICm
 fiSUMYkXkH6xdFWspYWFkD7aExHJRT3hd/bo+WnXGHhHclPj5NHZssLMIDboHrzX
 jLV1hljmthfwg/uOXDGmQUPRFjqr2ZQjo7zGA5SwfVg8Krz7g/qRVy2wUns9TdW/
 NYwihDm1YV7qkK/+f1GnMdd70toqb1OZo/fS9FPbBrPLdyi8V+UbnFSUeZu8kCwV
 KubAzjLZR4xYCnrlaHhoi208GMd0TOvHMOrdAA0zkQHfhmszGl4N0pbF/EI29Ft2
 tQG/FUTG+nzgNOrMCPN2nr3u/UOXLP+gO+2hkyyQjqUP35IyaTYQn10JgBmPTdJL
 Ab6E8WL9gTMCd+t/bVjdU/B8W9ruMihKBtWkTfMBCcQ9uNJFCEGzrcMF8hzFYRTs
 /4rMDr3NlGYydAnbKPj6bkC5gtjBvh/L08kOdUFyXCMSqIzvJkZJ4241ogl1Awi6
 VfdwjF5KZCQo3M1ujpep+1L010wC/yulqLt2brQMO9Nt05dRhgwM3lxy7cnlMNm3
 NdfMgi+OG0CzQ+ZUpvo20hCgTDUVgWN9g5R3rar8FJX+Ym3T+ZoEKlShZF+fSRjf
 YAUIbUyi7A==
 =2qc8
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer core updates from Thomas Gleixner:

 - A rework of the hrtimer subsystem to reduce the overhead for
   frequently armed timers, especially the hrtick scheduler timer:

     - Better timer locality decision

     - Simplification of the evaluation of the first expiry time by
       keeping track of the neighbor timers in the RB-tree by providing
       a RB-tree variant with neighbor links. That avoids walking the
       RB-tree on removal to find the next expiry time, but even more
       important allows to quickly evaluate whether a timer which is
       rearmed changes the position in the RB-tree with the modified
       expiry time or not. If not, the dequeue/enqueue sequence which
       both can end up in rebalancing can be completely avoided.

     - Deferred reprogramming of the underlying clock event device. This
       optimizes for the situation where a hrtimer callback sets the
       need resched bit. In that case the code attempts to defer the
       re-programming of the clock event device up to the point where
       the scheduler has picked the next task and has the next hrtick
       timer armed. In case that there is no immediate reschedule or
       soft interrupts have to be handled before reaching the reschedule
       point in the interrupt entry code the clock event is reprogrammed
       in one of those code paths to prevent that the timer becomes
       stale.

     - Support for clocksource coupled clockevents

       The TSC deadline timer is coupled to the TSC. The next event is
       programmed in TSC time. Currently this is done by converting the
       CLOCK_MONOTONIC based expiry value into a relative timeout,
       converting it into TSC ticks, reading the TSC adding the delta
       ticks and writing the deadline MSR.

       As the timekeeping core has the conversion factors for the TSC
       already, the whole back and forth conversion can be completely
       avoided. The timekeeping core calculates the reverse conversion
       factors from nanoseconds to TSC ticks and utilizes the base
       timestamps of TSC and CLOCK_MONOTONIC which are updated once per
       tick. This allows a direct conversion into the TSC deadline value
       without reading the time and as a bonus keeps the deadline
       conversion in sync with the TSC conversion factors, which are
       updated by adjtimex() on systems with NTP/PTP enabled.

     - Allow inlining of the clocksource read and clockevent write
       functions when they are tiny enough, e.g. on x86 RDTSC and WRMSR.

   With all those enhancements in place a hrtick enabled scheduler
   provides the same performance as without hrtick. But also other
   hrtimer users obviously benefit from these optimizations.

 - Robustness improvements and cleanups of historical sins in the
   hrtimer and timekeeping code.

 - Rewrite of the clocksource watchdog.

   The clocksource watchdog code has over time reached the state of an
   impenetrable maze of duct tape and staples. The original design,
   which was made in the context of systems far smaller than today, is
   based on the assumption that the to be monitored clocksource (TSC)
   can be trivially compared against a known to be stable clocksource
   (HPET/ACPI-PM timer).

   Over the years this rather naive approach turned out to have major
   flaws. Long delays between the watchdog invocations can cause wrap
   arounds of the reference clocksource. The access to the reference
   clocksource degrades on large multi-sockets systems dure to
   interconnect congestion. This has been addressed with various
   heuristics which degraded the accuracy of the watchdog to the point
   that it fails to detect actual TSC problems on older hardware which
   exposes slow inter CPU drifts due to firmware manipulating the TSC to
   hide SMI time.

   The rewrite addresses this by:

     - Restricting the validation against the reference clocksource to
       the boot CPU which is usually closest to the legacy block which
       contains the reference clocksource (HPET/ACPI-PM).

     - Do a round robin validation betwen the boot CPU and the other
       CPUs based only on the TSC with an algorithm similar to the TSC
       synchronization code during CPU hotplug.

     - Being more leniant versus remote timeouts

 - The usual tiny fixes, cleanups and enhancements all over the place

* tag 'timers-core-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
  alarmtimer: Access timerqueue node under lock in suspend
  hrtimer: Fix incorrect #endif comment for BITS_PER_LONG check
  posix-timers: Fix stale function name in comment
  timers: Get this_cpu once while clearing the idle state
  clocksource: Rewrite watchdog code completely
  clocksource: Don't use non-continuous clocksources as watchdog
  x86/tsc: Handle CLOCK_SOURCE_VALID_FOR_HRES correctly
  MIPS: Don't select CLOCKSOURCE_WATCHDOG
  parisc: Remove unused clocksource flags
  hrtimer: Add a helper to retrieve a hrtimer from its timerqueue node
  hrtimer: Remove trailing comma after HRTIMER_MAX_CLOCK_BASES
  hrtimer: Mark index and clockid of clock base as const
  hrtimer: Drop unnecessary pointer indirection in hrtimer_expire_entry event
  hrtimer: Drop spurious space in 'enum hrtimer_base_type'
  hrtimer: Don't zero-initialize ret in hrtimer_nanosleep()
  hrtimer: Remove hrtimer_get_expires_ns()
  timekeeping: Mark offsets array as const
  timekeeping/auxclock: Consistently use raw timekeeper for tk_setup_internals()
  timer_list: Print offset as signed integer
  tracing: Use explicit array size instead of sentinel elements in symbol printing
  ...
2026-04-14 10:27:07 -07:00
..
rv verification/rvgen: Remove unused variable declaration from containers 2026-01-12 07:43:51 +01:00
Kconfig tracing updates for 7.0: 2026-02-13 19:25:16 -08:00
Makefile tracing: Move pid filtering into trace_pid.c 2026-02-08 21:01:13 -05:00
blktrace.c block-7.0-20260305 2026-03-06 08:36:18 -08:00
bpf_trace.c bpf: Reject sleepable kprobe_multi programs at attach time 2026-04-02 09:48:46 -07:00
bpf_trace.h
error_report-traces.c
fgraph.c fgraph: Do not call handlers direct when not using ftrace_ops 2026-02-19 15:21:22 -05:00
fprobe.c Convert 'alloc_flex' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
ftrace.c ftrace: Use hash argument for tmp_ops in update_ftrace_direct_mod 2026-03-21 16:51:04 -04:00
ftrace_internal.h
kprobe_event_gen_test.c
pid_list.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
pid_list.h trace/pid_list: optimize pid_list->lock contention 2025-11-13 15:15:54 -05:00
power-traces.c PM: cpufreq: powernv/tracing: Move powernv_throttle trace event 2025-07-21 16:40:56 -04:00
preemptirq_delay_test.c kernel: trace: preemptirq_delay_test: use offstack cpu mask 2025-07-08 18:17:38 -04:00
rethook.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
ring_buffer.c ring-buffer: Fix to update per-subbuf entries of persistent ring buffer 2026-03-21 16:47:28 -04:00
ring_buffer_benchmark.c tracing: Fix typo in ring_buffer_benchmark.c 2025-12-05 15:43:40 -05:00
rpm-traces.c
synth_event_gen_test.c
trace.c tracing: Fix trace_marker copy link list updates 2026-03-21 16:43:53 -04:00
trace.h tracing updates for 7.0: 2026-02-13 19:25:16 -08:00
trace_benchmark.c
trace_benchmark.h
trace_boot.c
trace_branch.c
trace_btf.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
trace_btf.h
trace_clock.c
trace_dynevent.c tracing: Report wrong dynamic event command 2025-11-10 19:26:14 -05:00
trace_dynevent.h
trace_entries.h tracing: Fix ftrace event field alignments 2026-02-05 09:47:11 -05:00
trace_eprobe.c Convert 'alloc_flex' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
trace_event_perf.c
trace_events.c tracing: Fix enabling multiple events on the kernel command line and bootconfig 2026-03-06 16:54:34 -05:00
trace_events_filter.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
trace_events_filter_test.h
trace_events_hist.c Convert more 'alloc_obj' cases to default GFP_KERNEL arguments 2026-02-21 20:03:00 -08:00
trace_events_inject.c
trace_events_synth.c tracing: Use explicit array size instead of sentinel elements in symbol printing 2026-03-12 12:15:53 +01:00
trace_events_trigger.c tracing: Drain deferred trigger frees if kthread creation fails 2026-03-28 08:32:44 -04:00
trace_events_user.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
trace_export.c tracing: Fix ftrace event field alignments 2026-02-05 09:47:11 -05:00
trace_fprobe.c Convert 'alloc_flex' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
trace_functions.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
trace_functions_graph.c fgraph: Fix thresh_return nosleeptime double-adjust 2026-03-03 22:11:20 -05:00
trace_hwlat.c tracing: Fix false sharing in hwlat get_sample() 2026-02-10 03:36:39 -05:00
trace_irqsoff.c tracing: Allow tracer to add more than 32 options 2025-11-04 21:44:00 +09:00
trace_kdb.c tracing: Allow tracer to add more than 32 options 2025-11-04 21:44:00 +09:00
trace_kprobe.c Convert 'alloc_flex' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
trace_kprobe_selftest.c
trace_kprobe_selftest.h
trace_mmiotrace.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
trace_nop.c
trace_osnoise.c tracing: Fix potential deadlock in cpu hotplug with osnoise 2026-03-27 15:18:06 -04:00
trace_output.c tracing: Use explicit array size instead of sentinel elements in symbol printing 2026-03-12 12:15:53 +01:00
trace_output.h tracing: Allow tracer to add more than 32 options 2025-11-04 21:44:00 +09:00
trace_pid.c tracing: Move pid filtering into trace_pid.c 2026-02-08 21:01:13 -05:00
trace_preemptirq.c
trace_printk.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
trace_probe.c tracing/probe: reject non-closed empty immediate strings 2026-04-06 09:22:42 +09:00
trace_probe.h tracing: probes: Use __free() for trace_probe_log 2025-11-01 01:10:28 +09:00
trace_probe_kernel.h
trace_probe_tmpl.h
trace_recursion_record.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
trace_sched_switch.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
trace_sched_wakeup.c tracing: Allow tracer to add more than 32 options 2025-11-04 21:44:00 +09:00
trace_selftest.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
trace_selftest_dynamic.c
trace_seq.c tracing: Add bitmask-list option for human-readable bitmask display 2026-01-26 17:00:50 -05:00
trace_stack.c
trace_stat.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
trace_stat.h
trace_synth.h
trace_syscalls.c tracing: Use explicit array size instead of sentinel elements in symbol printing 2026-03-12 12:15:53 +01:00
trace_uprobe.c Convert 'alloc_flex' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
tracing_map.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
tracing_map.h