mirror-linux/include/trace/events
Linus Torvalds d08c407f71 A large set of updates and features for timers and timekeeping:
- The hierarchical timer pull model
 
     When timer wheel timers are armed they are placed into the timer wheel
     of a CPU which is likely to be busy at the time of expiry. This is done
     to avoid wakeups on potentially idle CPUs.
 
     This is wrong in several aspects:
 
      1) The heuristics to select the target CPU are wrong by
         definition as the chance to get the prediction right is close
         to zero.
 
      2) Due to #1 it is possible that timers are accumulated on a
         single target CPU
 
      3) The required computation in the enqueue path is just overhead for
      	dubious value especially under the consideration that the vast
      	majority of timer wheel timers are either canceled or rearmed
      	before they expire.
 
     The timer pull model avoids the above by removing the target
     computation on enqueue and queueing timers always on the CPU on which
     they get armed.
 
     This is achieved by having separate wheels for CPU pinned timers and
     global timers which do not care about where they expire.
 
     As long as a CPU is busy it handles both the pinned and the global
     timers which are queued on the CPU local timer wheels.
 
     When a CPU goes idle it evaluates its own timer wheels:
 
       - If the first expiring timer is a pinned timer, then the global
       	timers can be ignored as the CPU will wake up before they expire.
 
       - If the first expiring timer is a global timer, then the expiry time
         is propagated into the timer pull hierarchy and the CPU makes sure
         to wake up for the first pinned timer.
 
     The timer pull hierarchy organizes CPUs in groups of eight at the
     lowest level and at the next levels groups of eight groups up to the
     point where no further aggregation of groups is required, i.e. the
     number of levels is log8(NR_CPUS). The magic number of eight has been
     established by experimention, but can be adjusted if needed.
 
     In each group one busy CPU acts as the migrator. It's only one CPU to
     avoid lock contention on remote timer wheels.
 
     The migrator CPU checks in its own timer wheel handling whether there
     are other CPUs in the group which have gone idle and have global timers
     to expire. If there are global timers to expire, the migrator locks the
     remote CPU timer wheel and handles the expiry.
 
     Depending on the group level in the hierarchy this handling can require
     to walk the hierarchy downwards to the CPU level.
 
     Special care is taken when the last CPU goes idle. At this point the
     CPU is the systemwide migrator at the top of the hierarchy and it
     therefore cannot delegate to the hierarchy. It needs to arm its own
     timer device to expire either at the first expiring timer in the
     hierarchy or at the first CPU local timer, which ever expires first.
 
     This completely removes the overhead from the enqueue path, which is
     e.g. for networking a true hotpath and trades it for a slightly more
     complex idle path.
 
     This has been in development for a couple of years and the final series
     has been extensively tested by various teams from silicon vendors and
     ran through extensive CI.
 
     There have been slight performance improvements observed on network
     centric workloads and an Intel team confirmed that this allows them to
     power down a die completely on a mult-die socket for the first time in
     a mostly idle scenario.
 
     There is only one outstanding ~1.5% regression on a specific overloaded
     netperf test which is currently investigated, but the rest is either
     positive or neutral performance wise and positive on the power
     management side.
 
   - Fixes for the timekeeping interpolation code for cross-timestamps:
 
     cross-timestamps are used for PTP to get snapshots from hardware timers
     and interpolated them back to clock MONOTONIC. The changes address a
     few corner cases in the interpolation code which got the math and logic
     wrong.
 
   - Simplifcation of the clocksource watchdog retry logic to automatically
     adjust to handle larger systems correctly instead of having more
     incomprehensible command line parameters.
 
   - Treewide consolidation of the VDSO data structures.
 
   - The usual small improvements and cleanups all over the place.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmXuAN0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVKXEADIR45rjR1Xtz32js7B53Y65O4WNoOQ
 6/ycWcswuGzg/h4QUpPSJ6gOGVmKSWwZi4n0P/VadCiXGSPPm0aUKsoRUt9DZsPY
 mtj2wjCSXKXiyhTl9OtrZME86ZAIGO1dQXa/sOHsiP5PCjgQkD0b5CYi1+B6eHDt
 1/Uo2Tb9g8VAPppq20V5Uo93GrPf642oyi3FCFrR1M112Uuak5DmqHJYiDpreNcG
 D5SgI+ykSiaUaVyHifvqijoJk0rYXkqEC6evl02477lJ/X0vVo2/M8XPS95BxHST
 s5Iruo4rP+qeAy8QvhZpoPX59fO0m/AgA7cf77XXAtOpVdLH+bs4ILsEbouAIOtv
 lsmRkcYt+TpvrZFHPAxks+6g3afuROiDtxD5sXXpVWxvofi8FwWqubdlqdsbw9MP
 ZCTNyzNyKL47QeDwBfSynYUL1RSyqsphtIwk4oeQklH9rwMAnW21hi30z15hQ0pQ
 FOVkmcwi79JNvl/G+jRkDzw7r8/zcHshWdSjyUM04CDjjnCDjQOFWSIjEPwbQjjz
 S4HXpJKJW963dBgs9Z84/Ctw1GwoBk1qedDWDJE1257Qvmo/Wpe/7GddWcazOGnN
 RRFMzGPbOqBDbjtErOKGU+iCisgNEvz2XK+TI16uRjWde7DxZpiTVYgNDrZ+/Pyh
 rQ23UBms6ZRR+A==
 =iQlu
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "A large set of updates and features for timers and timekeeping:

   - The hierarchical timer pull model

     When timer wheel timers are armed they are placed into the timer
     wheel of a CPU which is likely to be busy at the time of expiry.
     This is done to avoid wakeups on potentially idle CPUs.

     This is wrong in several aspects:

       1) The heuristics to select the target CPU are wrong by
          definition as the chance to get the prediction right is
          close to zero.

       2) Due to #1 it is possible that timers are accumulated on
          a single target CPU

       3) The required computation in the enqueue path is just overhead
          for dubious value especially under the consideration that the
          vast majority of timer wheel timers are either canceled or
          rearmed before they expire.

     The timer pull model avoids the above by removing the target
     computation on enqueue and queueing timers always on the CPU on
     which they get armed.

     This is achieved by having separate wheels for CPU pinned timers
     and global timers which do not care about where they expire.

     As long as a CPU is busy it handles both the pinned and the global
     timers which are queued on the CPU local timer wheels.

     When a CPU goes idle it evaluates its own timer wheels:

       - If the first expiring timer is a pinned timer, then the global
         timers can be ignored as the CPU will wake up before they
         expire.

       - If the first expiring timer is a global timer, then the expiry
         time is propagated into the timer pull hierarchy and the CPU
         makes sure to wake up for the first pinned timer.

     The timer pull hierarchy organizes CPUs in groups of eight at the
     lowest level and at the next levels groups of eight groups up to
     the point where no further aggregation of groups is required, i.e.
     the number of levels is log8(NR_CPUS). The magic number of eight
     has been established by experimention, but can be adjusted if
     needed.

     In each group one busy CPU acts as the migrator. It's only one CPU
     to avoid lock contention on remote timer wheels.

     The migrator CPU checks in its own timer wheel handling whether
     there are other CPUs in the group which have gone idle and have
     global timers to expire. If there are global timers to expire, the
     migrator locks the remote CPU timer wheel and handles the expiry.

     Depending on the group level in the hierarchy this handling can
     require to walk the hierarchy downwards to the CPU level.

     Special care is taken when the last CPU goes idle. At this point
     the CPU is the systemwide migrator at the top of the hierarchy and
     it therefore cannot delegate to the hierarchy. It needs to arm its
     own timer device to expire either at the first expiring timer in
     the hierarchy or at the first CPU local timer, which ever expires
     first.

     This completely removes the overhead from the enqueue path, which
     is e.g. for networking a true hotpath and trades it for a slightly
     more complex idle path.

     This has been in development for a couple of years and the final
     series has been extensively tested by various teams from silicon
     vendors and ran through extensive CI.

     There have been slight performance improvements observed on network
     centric workloads and an Intel team confirmed that this allows them
     to power down a die completely on a mult-die socket for the first
     time in a mostly idle scenario.

     There is only one outstanding ~1.5% regression on a specific
     overloaded netperf test which is currently investigated, but the
     rest is either positive or neutral performance wise and positive on
     the power management side.

   - Fixes for the timekeeping interpolation code for cross-timestamps:

     cross-timestamps are used for PTP to get snapshots from hardware
     timers and interpolated them back to clock MONOTONIC. The changes
     address a few corner cases in the interpolation code which got the
     math and logic wrong.

   - Simplifcation of the clocksource watchdog retry logic to
     automatically adjust to handle larger systems correctly instead of
     having more incomprehensible command line parameters.

   - Treewide consolidation of the VDSO data structures.

   - The usual small improvements and cleanups all over the place"

* tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
  timer/migration: Fix quick check reporting late expiry
  tick/sched: Fix build failure for CONFIG_NO_HZ_COMMON=n
  vdso/datapage: Quick fix - use asm/page-def.h for ARM64
  timers: Assert no next dyntick timer look-up while CPU is offline
  tick: Assume timekeeping is correctly handed over upon last offline idle call
  tick: Shut down low-res tick from dying CPU
  tick: Split nohz and highres features from nohz_mode
  tick: Move individual bit features to debuggable mask accesses
  tick: Move got_idle_tick away from common flags
  tick: Assume the tick can't be stopped in NOHZ_MODE_INACTIVE mode
  tick: Move broadcast cancellation up to CPUHP_AP_TICK_DYING
  tick: Move tick cancellation up to CPUHP_AP_TICK_DYING
  tick: Start centralizing tick related CPU hotplug operations
  tick/sched: Don't clear ts::next_tick again in can_stop_idle_tick()
  tick/sched: Rename tick_nohz_stop_sched_tick() to tick_nohz_full_stop_tick()
  tick: Use IS_ENABLED() whenever possible
  tick/sched: Remove useless oneshot ifdeffery
  tick/nohz: Remove duplicate between lowres and highres handlers
  tick/nohz: Remove duplicate between tick_nohz_switch_to_nohz() and tick_setup_sched_timer()
  hrtimer: Select housekeeping CPU during migration
  ...
2024-03-11 14:38:26 -07:00
..
9p.h 9p: prevent read overrun in protocol dump tracepoint 2023-12-05 21:18:44 +09:00
afs.h vfs-6.9.file 2024-03-11 10:37:45 -07:00
alarmtimer.h
asoc.h
avc.h
bcache.h
block.h fs: add CONFIG_BUFFER_HEAD 2023-08-02 09:13:09 -06:00
bpf_test_run.h
bridge.h net: bridge: Add a tracepoint for MDB overflows 2023-02-06 08:48:25 +00:00
btrfs.h btrfs: use the flags of an extent map to identify the compression type 2023-12-15 22:59:02 +01:00
cachefiles.h fscache,cachefiles: add prepare_ondemand_read() callback 2022-12-07 10:56:29 +08:00
cgroup.h
clk.h clk: Add trace events for rate requests 2022-12-07 13:54:09 -08:00
cma.h trace: cma: remove unnecessary event class cma_alloc_class 2023-04-05 19:42:58 -07:00
compaction.h mm: compaction: add trace event for fast freepages isolation 2023-06-09 16:25:43 -07:00
context_tracking.h
cpuhp.h
csd.h smp: Change function signatures to use call_single_data_t 2023-09-13 14:59:24 +02:00
damon.h mm/damon/core: use nr_accesses_bp as a source of damos_before_apply tracepoint 2023-10-04 10:32:31 -07:00
devfreq.h
devlink.h devlink: Fix TP_STRUCT_entry in trace of devlink health report 2023-02-15 19:15:44 -08:00
dlm.h fs: dlm: add plock dev tracepoints 2023-08-10 10:33:03 -05:00
dma_fence.h
erofs.h erofs: adapt folios for z_erofs_read_folio() 2023-08-23 23:47:33 +08:00
error_report.h panic: use error_report_end tracepoint on warnings 2022-01-20 08:52:55 +02:00
ext4.h ext4: remove 'needed' in trace_ext4_discard_preallocations 2024-01-18 10:52:45 -05:00
f2fs.h f2fs: add tracepoint for f2fs_vm_page_mkwrite() 2023-12-11 13:37:53 -08:00
fib.h net: Replace strlcpy with strscpy 2023-07-04 19:40:16 +01:00
fib6.h net: Replace strlcpy with strscpy 2023-07-04 19:40:16 +01:00
filelock.h filelock: split leases out of struct file_lock 2024-02-05 13:11:44 +01:00
filemap.h
fs_dax.h
fscache.h fscache: Fix oops due to race with cookie_lru and use_cookie 2022-12-07 11:49:18 -08:00
fsi.h fsi: core: Add trace events for scan and unregister 2023-08-09 15:43:28 +09:30
fsi_master_aspeed.h fsi: Add trace events in initialization path 2022-02-21 19:38:54 +10:30
fsi_master_ast_cf.h
fsi_master_gpio.h
fsi_master_i2cr.h fsi: Add IBM I2C Responder virtual FSI master 2023-08-11 13:32:14 +09:30
gpio.h
gpu_mem.h
habanalabs.h accel/habanalabs: minor cosmetics update to trace file 2023-10-09 12:37:23 +03:00
handshake.h net/handshake: Trace events for TLS Alert helpers 2023-07-28 14:07:59 -07:00
host1x.h
huge_memory.h mm/khugepaged: skip shmem with userfaultfd 2023-04-18 16:29:52 -07:00
hwmon.h
i2c.h
i2c_slave.h i2c: add tracepoints for I2C slave events 2022-03-20 00:11:05 +01:00
ib_mad.h IB/mad: Don't call to function that might sleep while in atomic context 2022-11-10 10:57:15 +02:00
ib_umad.h
initcall.h
intel-sst.h
intel_ifs.h platform/x86/intel/ifs: Gen2 Scan test support 2023-10-06 13:05:18 +03:00
intel_ish.h
io_uring.h io_uring: remove 'loops' argument from trace_io_uring_task_work_run() 2024-02-08 13:27:06 -07:00
iocost.h blk-iocost: Trace vtime_base_rate instead of vtime_rate 2022-12-01 07:44:12 -07:00
iommu.h iommu: Remove detach_dev callback 2023-01-13 16:39:18 +01:00
ipi.h trace: Add trace_ipi_send_cpu() 2023-03-24 11:01:29 +01:00
irq.h softirq: Add trace points for tasklet entry/exit 2023-04-15 10:17:16 +02:00
irq_matrix.h
iscsi.h scsi: iscsi: tracing: Use the new __vstring() helper 2022-07-19 11:20:25 -04:00
jbd2.h jbd2: remove journal_clean_one_cp_list() 2023-07-10 23:09:21 -04:00
kmem.h mm: convert mm's rss stats into percpu_counter 2022-11-30 15:58:40 -08:00
ksm.h mm/ksm: add tracepoint for ksm advisor 2023-12-29 11:58:27 -08:00
kvm.h KVM: remove CONFIG_HAVE_KVM_IRQFD 2023-12-08 15:43:33 -05:00
kyber.h kyber: Replace strlcpy with strscpy 2023-07-17 08:18:17 -06:00
libata.h ata: libata: add qc->flags in ata_qc_complete_template tracepoint 2022-06-17 16:30:03 +09:00
lock.h locking/mutex: Make contention tracepoints more consistent wrt adaptive spinning 2022-04-05 10:24:36 +02:00
maple_tree.h Maple Tree: add new data structure 2022-09-26 19:46:13 -07:00
mce.h
mctp.h mctp: Add SIOCMCTP{ALLOC,DROP}TAG ioctls for tag control 2022-02-09 12:00:11 +00:00
mdio.h
migrate.h mm/migrate: add nr_split to trace_mm_migrate_pages stats. 2023-10-25 16:47:13 -07:00
mlxsw.h
mmap.h mm: mmap: remove newline at the end of the trace 2023-03-23 17:18:36 -07:00
mmap_lock.h
mmc.h
mmflags.h arch: Remove Itanium (IA-64) architecture 2023-09-11 08:13:17 +00:00
module.h
mptcp.h net: implement lockless SO_MAX_PACING_RATE 2023-10-01 19:09:54 +01:00
napi.h
nbd.h
neigh.h neighbor: tracing: Move pin6 inside CONFIG_IPV6=y section 2023-10-18 11:16:43 +01:00
net.h net: fix net_dev_start_xmit trace event vs skb_transport_offset() 2023-07-03 09:13:23 +01:00
net_probe_common.h
netfs.h netfs: Implement a write-through caching option 2023-12-28 09:45:27 +00:00
netlink.h
nilfs2.h fs/nilfs2: Use the enum req_op and blk_opf_t types 2022-07-14 12:14:33 -06:00
nmi.h
notifier.h notifiers: add tracepoints to the notifiers infrastructure 2023-04-08 13:45:38 -07:00
objagg.h
oom.h
osnoise.h
page_isolation.h
page_pool.h page_pool: split types and declarations from page_pool.h 2023-08-07 13:05:19 -07:00
page_ref.h
pagemap.h
percpu.h include/trace/events/percpu.h: cleanup for "percpu: improve percpu_alloc_percpu event trace" 2022-05-25 10:47:48 -07:00
power.h cpuidle: Add cpu_idle_miss trace event 2022-08-03 17:50:58 +02:00
power_cpu_migrate.h
preemptirq.h
printk.h
pwc.h
pwm.h pwm/tracing: Also record trace events for failed API calls 2022-12-06 12:46:23 +01:00
qdisc.h tracing/net_sched: Fix tracepoints that save qdisc_dev() as a string 2024-03-04 09:35:54 +00:00
qla.h scsi: qla2xxx: tracing: Use the new __vstring() helper 2022-07-19 11:20:25 -04:00
qrtr.h net: qrtr: correct types of trace event parameters 2023-04-04 18:58:43 -07:00
rcu.h RCU Changes for 6.4: 2023-04-24 12:16:14 -07:00
rdma_core.h
regulator.h
rpcgss.h SUNRPC: Record gss_wrap() errors in svcauth_gss_wrap_priv() 2023-02-20 09:20:25 -05:00
rpcrdma.h svcrdma: Copy construction of svc_rqst::rq_arg to rdma_read_complete() 2024-01-07 17:54:33 -05:00
rpm.h
rseq.h tracing/rseq: Add mm_cid field to rseq_update 2022-12-27 12:52:15 +01:00
rtc.h
rv.h rv/monitor: Add the wwnr monitor 2022-07-30 14:01:30 -04:00
rwmmio.h asm-generic/io: Add _RET_IP_ to MMIO trace for more accurate debug info 2022-11-21 22:02:10 +01:00
rxrpc.h rxrpc: Fix counting of new acks and nacks 2024-02-05 12:34:07 +00:00
sched.h sched: Remove vruntime from trace_sched_stat_runtime() 2023-11-15 09:57:49 +01:00
scmi.h include: trace: Add platform and channel instance references 2023-01-20 11:40:57 +00:00
scsi.h scsi: core: Trace SCSI sense data 2023-05-31 11:05:34 -04:00
sctp.h
signal.h
siox.h
skb.h net: add location to trace_consume_skb() 2023-02-20 08:28:49 +00:00
smbus.h
sock.h inet: preserve const qualifier in inet_sk() 2023-03-17 08:56:37 +00:00
sof.h ASoC: SOF: replace ipc4-loader dev_vdbg with tracepoints 2022-09-19 15:44:08 +01:00
sof_intel.h ASoC: SOF: Intel: replace dev_vdbg with tracepoints 2022-09-19 15:44:06 +01:00
spi.h spi: Fix spelling typos and acronyms capitalization 2023-07-11 14:14:32 +01:00
spmi.h spmi: trace: fix stack-out-of-bound access in SPMI tracing functions 2022-07-24 16:16:44 +02:00
sunrpc.h SUNRPC: Remove RQ_SPLICE_OK 2024-01-07 17:54:26 -05:00
sunvnet.h
swiotlb.h swiotlb: make the swiotlb_init interface more useful 2022-04-18 07:21:11 +02:00
syscalls.h
target.h
task.h tracing: Replace strlcpy with strscpy in trace/events/task.h 2023-09-01 21:00:00 -04:00
tcp.h tcp: add missing family to tcp_set_ca_state() tracepoint 2023-08-09 13:45:19 -07:00
tegra_apb_dma.h
thermal_pressure.h arch_topology: Trace the update thermal pressure 2022-05-06 09:57:38 +02:00
thp.h powerpc/book3s64/mm: enable transparent pud hugepage 2023-08-18 10:12:55 -07:00
timer.h tracing/timers: Add tracepoint for tracking timer base is_idle flag 2023-12-20 16:49:38 +01:00
timer_migration.h timer_migration: Add tracepoints 2024-02-22 17:52:32 +01:00
tlb.h
udp.h
ufs.h scsi: ufs: core: Include the SCSI ID in UFS command tracing output 2023-09-13 20:44:59 -04:00
v4l2.h
vb2.h
vmalloc.h mm: vmalloc: add free_vmap_area_noflush trace event 2022-11-08 17:37:17 -08:00
vmscan.h mm, vmscan: remove ISOLATE_UNMAPPED 2023-10-04 10:32:29 -07:00
vsock_virtio_transport_common.h vsock/virtio: MSG_ZEROCOPY flag support 2023-09-21 12:34:00 +02:00
watchdog.h watchdog: Add tracing events for the most usual watchdog events 2022-10-12 09:47:02 +02:00
wbt.h blk-wbt: Replace strlcpy with strscpy 2023-07-17 08:18:17 -06:00
workqueue.h workqueue: Fix type of cpu in trace event 2022-06-07 07:09:47 -10:00
writeback.h writeback: fix dereferencing NULL mapping->host on writeback_page_template 2023-06-19 13:19:31 -07:00
xdp.h net: invert the netdevice.h vs xdp.h dependency 2023-08-03 08:38:07 -07:00
xen.h x86/xen: move paravirt lazy code 2023-09-19 07:04:49 +02:00