Commit Graph

17732 Commits (09cfd3c52ea76f43b3cb15e570aeddf633d65e80)

Author SHA1 Message Date
Linus Torvalds 98afd4dd3d - Add instruction decoding support for the XOP-prefixed instruction set
present on the AMD Bulldozer uarch
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmjWZM8ACgkQEsHwGGHe
 VUpDwQ//UeVlmtFqIK/KJSoc0/WrL34D7ICqoKuo9ocLoYHrGLNxinV18GhtJ4vD
 o1pmuwcLj6fIEIjTvlNSvs1/mtB7XYNRMm/W21fOLAyhRARg1P3Mi+crpMB/HrfA
 Lowx+YjKYFRo6NUaNARCAfgnkkFz9w715o5ab5rr3yxkqeVSyYUz7Rl2eIDc03vl
 HYEarpRFhrvwi8ccxoz1xKECehmvW0Htn8QzaHGrJkOf/gfWJdIz5KmjFDONt8mM
 AdMVTZ49lM3fzoYRpsr2+x5oHWfjMw1nRYyJliLhriVDga03FNzj6FL+AHt3itvG
 2wFqJfYxOETMjdvSHAQHkOZUu7ZSKFNIwtYdqsdklTPNB5AVNE2ANd21vY2+wgUP
 P5aIXNo7SEIaUSxkE7JfUubJTn/lRExZOkgZx/fwYeFbF1Wc7e5w9Q5bhxq70s1A
 1E0iSAPR30t9HvveDCDLBGqbNAfNsBN0E3m4EULRcEwKZ3P7OvknddBHpVjnwEG+
 +OgDf/mJQt3hW6ubvYHvgEhOYyEnc92z9YEtqFiGG6bRx0ppqO+K/7FE7vBrUp7a
 fDRsyYtWHLQVOZTmeStK7aeslM9BGnsmhACy566oaX1C11f2I7KLYQAI1apKb8A4
 y+NjQUebmTBmlKx4J+QnKBOhGNbl1xbSFJPIoNM0WIlm410Od48=
 =mUtz
 -----END PGP SIGNATURE-----

Merge tag 'x86_misc_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 instruction decoder update from Borislav Petkov:

 - Add instruction decoding support for the XOP-prefixed instruction set
   present on the AMD Bulldozer uarch

[ These instructions don't normally happen, but a X86_NATIVE_CPU build
  on a bulldozer host can make the compiler then use these unusual
  instruction encodings ]

* tag 'x86_misc_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/insn: Add XOP prefix instructions decoder support
2025-09-30 11:48:33 -07:00
Linus Torvalds a5ba183bde hardening updates for v6.18-rc1
- Clean up usage of TRAILING_OVERLAP() (Gustavo A. R. Silva)
 
 - lkdtm: fortify: Fix potential NULL dereference on kmalloc failure
   (Junjie Cao)
 
 - Add str_assert_deassert() helper (Lad Prabhakar)
 
 - gcc-plugins: Remove TODO_verify_il for GCC >= 16
 
 - kconfig: Fix BrokenPipeError warnings in selftests
 
 - kconfig: Add transitional symbol attribute for migration support
 
 - kcfi: Rename CONFIG_CFI_CLANG to CONFIG_CFI
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCaNraNQAKCRA2KwveOeQk
 u/DkAPwKPP5BSmVR2wkdpQaXIr3PGA+cbBYp34DMJNujZ9piIwD/WZ+HfGTLoERy
 +2Q6HLj9hUdd+Rx3IZ8/w1QmnhUIUAU=
 =AwV9
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening updates from Kees Cook:
 "One notable addition is the creation of the 'transitional' keyword for
  kconfig so CONFIG renaming can go more smoothly.

  This has been a long-standing deficiency, and with the renaming of
  CONFIG_CFI_CLANG to CONFIG_CFI (since GCC will soon have KCFI
  support), this came up again.

  The breadth of the diffstat is mainly this renaming.

   - Clean up usage of TRAILING_OVERLAP() (Gustavo A. R. Silva)

   - lkdtm: fortify: Fix potential NULL dereference on kmalloc failure
     (Junjie Cao)

   - Add str_assert_deassert() helper (Lad Prabhakar)

   - gcc-plugins: Remove TODO_verify_il for GCC >= 16

   - kconfig: Fix BrokenPipeError warnings in selftests

   - kconfig: Add transitional symbol attribute for migration support

   - kcfi: Rename CONFIG_CFI_CLANG to CONFIG_CFI"

* tag 'hardening-v6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  lib/string_choices: Add str_assert_deassert() helper
  kcfi: Rename CONFIG_CFI_CLANG to CONFIG_CFI
  kconfig: Add transitional symbol attribute for migration support
  kconfig: Fix BrokenPipeError warnings in selftests
  gcc-plugins: Remove TODO_verify_il for GCC >= 16
  stddef: Introduce __TRAILING_OVERLAP()
  stddef: Remove token-pasting in TRAILING_OVERLAP()
  lkdtm: fortify: Fix potential NULL dereference on kmalloc failure
2025-09-29 17:48:27 -07:00
Kees Cook 23ef9d4397 kcfi: Rename CONFIG_CFI_CLANG to CONFIG_CFI
The kernel's CFI implementation uses the KCFI ABI specifically, and is
not strictly tied to a particular compiler. In preparation for GCC
supporting KCFI, rename CONFIG_CFI_CLANG to CONFIG_CFI (along with
associated options).

Use new "transitional" Kconfig option for old CONFIG_CFI_CLANG that will
enable CONFIG_CFI during olddefconfig.

Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20250923213422.1105654-3-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
2025-09-24 14:29:14 -07:00
Ian Rogers 20c9ccffcc perf maps: Ensure kmap is set up for all inserts
__maps__fixup_overlap_and_insert may split or directly insert a map,
when doing this the map may need to have a kmap set up for the sake of
the kmaps. The missing kmap set up fails the check_invariants test in
maps, later "Internal error" reports from map__kmap and ultimately
causes segfaults.

Similar fixes were added in commit e0e4e0b8b7 ("perf maps: Add
missing map__set_kmap_maps() when replacing a kernel map") and commit
25d9c0301d ("perf maps: Set the kmaps for newly created/added kernel
maps") but they missed cases. To try to reduce the risk of this,
update the kmap directly following any manual insert. This identified
another problem in maps__copy_from.

Fixes: e0e4e0b8b7 ("perf maps: Add missing map__set_kmap_maps() when replacing a kernel map")
Fixes: 25d9c0301d ("perf maps: Set the kmaps for newly created/added kernel maps")
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-09-15 10:03:23 -07:00
Ian Rogers 7947ad1561 perf lock: Provide a host_env for session new
When "perf lock con" is run in a live mode, with no data file, a host
environment must be provided. Testing missed this as a failing assert
was creating the 1 line of expected stderr output.

  $ sudo perf lock con -ab true
  perf: util/session.c:195: __perf_session__new: Assertion `host_env != NULL' failed.
  Aborted

Fixes: 525a599bad ("perf env: Remove global perf_env")
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-09-12 17:53:03 -07:00
Ian Rogers ca81e74dc3 perf symbol-elf: Add support for the block argument for libbfd
James Clark caught that the BUILD_NONDISTRO=1 build with libbfd was
broken due to an update to the read_build_id function adding a
blocking argument. Add support for this argument by first opening the
file blocking or non-blocking, then switching from bfd_openr to
bfd_fdopenr and passing the opened fd. bfd_fdopenr closes the fd on
error and when bfd_close are called.

Reported-by: James Clark <james.clark@linaro.org>
Closes: https://lore.kernel.org/lkml/20250903-james-perf-read-build-id-fix-v1-2-6a694d0a980f@linaro.org/
Fixes: 2c369d91d0 ("perf symbol: Add blocking argument to filename__read_build_id")
Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20250904161731.1193729-1-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-09-04 16:37:35 -07:00
Thomas Richter 744175e972 perf test: Checking BPF metadata collection fails on version string
commit edf2cadf01 ("perf test: add test for BPF metadata collection")

fails consistently on the version string check. The perf version
string on some of the constant integration test machines contains
characters with special meaning in grep's extended regular expression
matching algorithm. The output of perf version is:

 # perf version
 perf version 6.17.0-20250814.rc1.git20.24ea63ea3877.63.fc42.s390x+git
 #

and the '+' character has special meaning in egrep command.
Also the use of egrep is deprecated.

Change the perf version string check to fixed character matching
and get rid of egrep's warning being deprecated. Use grep -F instead.

Output before:
 # perf test -F 102
 Checking BPF metadata collection
 egrep: warning: egrep is obsolescent; using grep -E
 Basic BPF metadata test [Failed invalid output]
 102: BPF metadata collection test             : FAILED!
 #

Output after:
 # perf test -F 102
 Checking BPF metadata collection
 Basic BPF metadata test [Success]
 102: BPF metadata collection test             : Ok
 #

Fixes: edf2cadf01 ("perf test: add test for BPF metadata collection")
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Blake Jones <blakejones@google.com>
Link: https://lore.kernel.org/r/20250822122540.4104658-1-tmricht@linux.ibm.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-09-04 16:37:19 -07:00
James Clark 666d2206f1 perf tests: Fix "PE file support" test build
filename__read_build_id() now takes a blocking/non-blocking argument.
The original behavior of filename__read_build_id() was blocking so add
block=true to fix the build.

Fixes: 2c369d91d0 ("perf symbol: Add blocking argument to filename__read_build_id")
Signed-off-by: James Clark <james.clark@linaro.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20250903-james-perf-read-build-id-fix-v1-1-6a694d0a980f@linaro.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-09-03 10:50:55 -07:00
Ian Rogers 01be43f2a0 perf bpf-utils: Harden get_bpf_prog_info_linear
In get_bpf_prog_info_linear two calls to bpf_obj_get_info_by_fd are
made, the first to compute memory requirements for a struct perf_bpil
and the second to fill it in. Previously the code would warn when the
second call didn't match the first. Such races can be common place in
things like perf test, whose perf trace tests will frequently load BPF
programs. Rather than a debug message, return actual errors for this
case. Out of paranoia also validate the read bpf_prog_info array
value. Change the type of ptr to avoid mismatched pointer type
compiler warnings. Add some additional debug print outs and sanity
asserts.

Closes: https://lore.kernel.org/lkml/CAP-5=fWJQcmUOP7MuCA2ihKnDAHUCOBLkQFEkQES-1ZZTrgf8Q@mail.gmail.com/
Fixes: 6ac22d036f ("perf bpf: Pull in bpf_program__get_prog_info_linear()")
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250902181713.309797-4-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-09-02 14:55:32 -07:00
Ian Rogers 1654a0e4d5 perf bpf-utils: Constify bpil_array_desc
The array's contents is a compile time constant. Constify to make the
code more intention revealing and avoid unintended errors.

Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250902181713.309797-3-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-09-02 14:55:20 -07:00
Ian Rogers d7b67dd6f9 perf bpf-event: Fix use-after-free in synthesis
Calls to perf_env__insert_bpf_prog_info may fail as a sideband thread
may already have inserted the bpf_prog_info. Such failures may yield
info_linear being freed which then causes use-after-free issues with
the internal bpf_prog_info info struct. Make it so that
perf_env__insert_bpf_prog_info trigger early non-error paths and fix
the use-after-free in perf_event__synthesize_one_bpf_prog. Add proper
return error handling to perf_env__add_bpf_info (that calls
perf_env__insert_bpf_prog_info) and propagate the return value in its
callers.

Closes: https://lore.kernel.org/lkml/CAP-5=fWJQcmUOP7MuCA2ihKnDAHUCOBLkQFEkQES-1ZZTrgf8Q@mail.gmail.com/
Fixes: 03edb7020b ("perf bpf: Fix two memory leakages when calling perf_env__insert_bpf_prog_info()")
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250902181713.309797-2-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-09-02 14:55:05 -07:00
Ian Rogers 2c369d91d0 perf symbol: Add blocking argument to filename__read_build_id
When synthesizing build-ids, for build ID mmap2 events, they will be
added for data mmaps if -d/--data is specified. The files opened for
their build IDs may block on the open causing perf to hang during
synthesis. There is some robustness in existing calls to
filename__read_build_id by checking the file path is to a regular
file, which unfortunately fails for symlinks. Rather than adding more
is_regular_file calls, switch filename__read_build_id to take a
"block" argument and specify O_NONBLOCK when this is false. The
existing is_regular_file checking callers and the event synthesis
callers are made to pass false and thereby avoiding the hang.

Fixes: 53b00ff358 ("perf record: Make --buildid-mmap the default")
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250823000024.724394-3-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-08-25 15:07:18 -07:00
Ian Rogers ba0b7081f7 perf symbol-minimal: Fix ehdr reading in filename__read_build_id
The e_ident is part of the ehdr and so reading it a second time would
mean the read ehdr was displaced by 16-bytes. Switch from stdio to
open/read/lseek syscalls for similarity with the symbol-elf version of
the function and so that later changes can alter then open flags.

Fixes: fef8f648bb ("perf symbol: Fix use-after-free in filename__read_build_id")
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250823000024.724394-2-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-08-25 15:07:18 -07:00
Namhyung Kim f79a62f4b3 tools headers: Sync uapi/linux/vhost.h with the kernel source
To pick up the changes in this cset:

  7d9896e9f6 vhost: Reintroduce kthread API and add mode selection
  333c515d18 vhost-net: allow configuring extended features

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/perf/trace/beauty/include/uapi/linux/vhost.h include/uapi/linux/vhost.h

Please see tools/include/uapi/README for further details.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: kvm@vger.kernel.org
Cc: virtualization@lists.linux.dev
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-08-18 13:49:26 -07:00
Namhyung Kim e7e79e9972 tools headers: Sync uapi/linux/prctl.h with the kernel source
To pick up the changes in this cset:

  b1fabef37b prctl: Introduce PR_MTE_STORE_ONLY
  a2fc422ed7 syscall_user_dispatch: Add PR_SYS_DISPATCH_INCLUSIVE_ON

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/perf/trace/beauty/include/uapi/linux/prctl.h include/uapi/linux/prctl.h

Please see tools/include/uapi/README for further details.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-08-18 13:49:26 -07:00
Namhyung Kim 4a4083af03 tools headers: Sync uapi/linux/fs.h with the kernel source
To pick up the changes in this cset:

  76fdb7eb4e uapi: export PROCFS_ROOT_INO
  ca115d7e75 tree-wide: s/struct fileattr/struct file_kattr/g
  be7efb2d20 fs: introduce file_getattr and file_setattr syscalls
  9eb22f7fed fs: add ioctl to query metadata and protection info capabilities

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/perf/trace/beauty/include/uapi/linux/fs.h include/uapi/linux/fs.h

Please see tools/include/uapi/README for further details.

Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-08-18 13:49:26 -07:00
Namhyung Kim b18aabe283 tools headers: Sync uapi/linux/fcntl.h with the kernel source
To pick up the changes in this cset:

  3941e37f62 uapi/fcntl: add FD_PIDFS_ROOT
  cd5d200632 uapi/fcntl: add FD_INVALID
  67fcec2919 fcntl/pidfd: redefine PIDFD_SELF_THREAD_GROUP
  a4c746f068 uapi/fcntl: mark range as reserved

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/perf/trace/beauty/include/uapi/linux/fcntl.h include/uapi/linux/fcntl.h

Please see tools/include/uapi/README for further details.

Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-08-18 13:49:26 -07:00
Namhyung Kim 52174e0eb1 tools headers: Sync syscall tables with the kernel source
To pick up the changes in this cset:

  be7efb2d20 fs: introduce file_getattr and file_setattr syscalls

This addresses these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h
    diff -u tools/scripts/syscall.tbl scripts/syscall.tbl
    diff -u tools/perf/arch/x86/entry/syscalls/syscall_32.tbl arch/x86/entry/syscalls/syscall_32.tbl
    diff -u tools/perf/arch/x86/entry/syscalls/syscall_64.tbl arch/x86/entry/syscalls/syscall_64.tbl
    diff -u tools/perf/arch/powerpc/entry/syscalls/syscall.tbl arch/powerpc/kernel/syscalls/syscall.tbl
    diff -u tools/perf/arch/s390/entry/syscalls/syscall.tbl arch/s390/kernel/syscalls/syscall.tbl
    diff -u tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl arch/mips/kernel/syscalls/syscall_n64.tbl
    diff -u tools/perf/arch/arm/entry/syscalls/syscall.tbl arch/arm/tools/syscall.tbl
    diff -u tools/perf/arch/sh/entry/syscalls/syscall.tbl arch/sh/kernel/syscalls/syscall.tbl
    diff -u tools/perf/arch/sparc/entry/syscalls/syscall.tbl arch/sparc/kernel/syscalls/syscall.tbl
    diff -u tools/perf/arch/xtensa/entry/syscalls/syscall.tbl arch/xtensa/kernel/syscalls/syscall.tbl

Please see tools/include/uapi/README for further details.

Cc: Arnd Bergmann <arnd@arndb.de>
CC: linux-api@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-08-18 13:49:25 -07:00
Namhyung Kim 22ec0faa0e perf test: Fix a build error in x86 topdown test
There's an environment that caused the following build error.  Include
"debug.h" (under util directory) to fix it.

  arch/x86/tests/topdown.c: In function 'event_cb':
  arch/x86/tests/topdown.c:53:25: error: implicit declaration of function 'pr_debug'
                                         [-Werror=implicit-function-declaration]
     53 |                         pr_debug("Broken topdown information for '%s'\n", evsel__name(evsel));
        |                         ^~~~~~~~
  cc1: all warnings being treated as errors

Link: https://lore.kernel.org/r/20250815164122.289651-1-namhyung@kernel.org
Fixes: 5b546de9cc ("perf topdown: Use attribute to see an event is a topdown metic or slots")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-08-18 11:52:22 -07:00
Masami Hiramatsu (Google) 26178b713f x86/insn: Add XOP prefix instructions decoder support
Support decoding AMD's XOP prefix encoded instructions.

These instructions are introduced for Bulldozer micro architecture, and not
supported on Intel's processors. But when compiling kernel with
CONFIG_X86_NATIVE_CPU on some AMD processor (e.g. -march=bdver2), these
instructions can be used.

Closes: https://lore.kernel.org/all/871pq06728.fsf@wylie.me.uk/
Reported-by: Alan J. Wylie <alan@wylie.me.uk>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Alan J. Wylie <alan@wylie.me.uk>
Link: https://lore.kernel.org/175386161199.564247.597496379413236944.stgit@devnote2
2025-08-18 17:15:02 +02:00
Ilya Leoshkevich 5e2ac8e857 perf bpf-filter: Enable events manually
On s390, and, in general, on all platforms where the respective event
supports auxiliary data gathering, the command:

   # ./perf record -u 0 -aB --synth=no -- ./perf test -w thloop
   [ perf record: Woken up 1 times to write data ]
   [ perf record: Captured and wrote 0.011 MB perf.data ]
   # ./perf report --stats | grep SAMPLE
   #

does not generate samples in the perf.data file. On x86 the command:

  # sudo perf record -e intel_pt// -u 0 ls

is broken too.

Looking at the sequence of calls in 'perf record' reveals this
behavior:

1. The event 'cycles' is created and enabled:

   record__open()
   +-> evlist__apply_filters()
       +-> perf_bpf_filter__prepare()
	   +-> bpf_program.attach_perf_event()
	       +-> bpf_program.attach_perf_event_opts()
	           +-> __GI___ioctl(..., PERF_EVENT_IOC_ENABLE, ...)

   The event 'cycles' is enabled and active now. However the event's
   ring-buffer to store the samples generated by hardware is not
   allocated yet.

2. The event's fd is mmap()ed to create the ring buffer:

   record__open()
   +-> record__mmap()
       +-> record__mmap_evlist()
	   +-> evlist__mmap_ex()
	       +-> perf_evlist__mmap_ops()
	           +-> mmap_per_cpu()
	               +-> mmap_per_evsel()
	                   +-> mmap__mmap()
	                       +-> perf_mmap__mmap()
	                           +-> mmap()

   This allocates the ring buffer for the event 'cycles'. With mmap()
   the kernel creates the ring buffer:

   perf_mmap(): kernel function to create the event's ring
   |            buffer to save the sampled data.
   |
   +-> ring_buffer_attach(): Allocates memory for ring buffer.
       |        The PMU has auxiliary data setup function. The
       |        has_aux(event) condition is true and the PMU's
       |        stop() is called to stop sampling. It is not
       |        restarted:
       |
       |        if (has_aux(event))
       |                perf_event_stop(event, 0);
       |
       +-> cpumsf_pmu_stop():

   Hardware sampling is stopped. No samples are generated and saved
   anymore.

3. After the event 'cycles' has been mapped, the event is enabled a
   second time in:

   __cmd_record()
   +-> evlist__enable()
       +-> __evlist__enable()
	   +-> evsel__enable_cpu()
	       +-> perf_evsel__enable_cpu()
	           +-> perf_evsel__run_ioctl()
	               +-> perf_evsel__ioctl()
	                   +-> __GI___ioctl(., PERF_EVENT_IOC_ENABLE, .)

   The second

      ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);

   is just a NOP in this case. The first invocation in (1.) sets the
   event::state to PERF_EVENT_STATE_ACTIVE. The kernel functions

   perf_ioctl()
   +-> _perf_ioctl()
       +-> _perf_event_enable()
           +-> __perf_event_enable()

   return immediately because event::state is already set to
   PERF_EVENT_STATE_ACTIVE.

This happens on s390, because the event 'cycles' offers the possibility
to save auxilary data. The PMU callbacks setup_aux() and free_aux() are
defined. Without both callback functions, cpumsf_pmu_stop() is not
invoked and sampling continues.

To remedy this, remove the first invocation of

   ioctl(..., PERF_EVENT_IOC_ENABLE, ...).

in step (1.) Create the event in step (1.) and enable it in step (3.)
after the ring buffer has been mapped.

Output after:

 # ./perf record -aB --synth=no -u 0 -- ./perf test -w thloop 2
 [ perf record: Woken up 3 times to write data ]
 [ perf record: Captured and wrote 0.876 MB perf.data ]
 # ./perf  report --stats | grep SAMPLE
              SAMPLE events:      16200  (99.5%)
              SAMPLE events:      16200
 #

The software event succeeded both before and after the patch:

 # ./perf record -e cpu-clock -aB --synth=no -u 0 -- \
					  ./perf test -w thloop 2
 [ perf record: Woken up 7 times to write data ]
 [ perf record: Captured and wrote 2.870 MB perf.data ]
 # ./perf  report --stats | grep SAMPLE
              SAMPLE events:      53506  (99.8%)
              SAMPLE events:      53506
 #

Fixes: b4c658d4d6 ("perf target: Remove uid from target")
Suggested-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Co-developed-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20250806162417.19666-3-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-07 09:03:44 -07:00
Linus Torvalds f4f346c346 [GIT PULL] perf tools changes for v6.17
Build-ID processing goodies
 ---------------------------
 Build-IDs are content based hashes to link regions of memory to ELF files
 in post processing. They have been available in distros for quite a while:
 
     $ file /bin/bash
     /bin/bash: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV),
     dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2,
     BuildID[sha1]=707a1c670cd72f8e55ffedfbe94ea98901b7ce3a,
     for GNU/Linux 3.2.0, stripped
 
 It is possible to ask the kernel to get it from mmap executable backing
 storage at time they are being put in place and send it as metadata at
 that moment to have in perf.data.
 
 Prefer that across the board to speed up 'record' time - it post processes
 the samples to find binaries touched by any samples and to save them with
 build-ID.  It can skip reading build-ID in userspace if it comes from the
 kernel.
 
 perf record
 -----------
 * Make --buildid-mmap default.  The kernel can generate MMAP2 events
   with a build-ID from ELF header.  Use that by default instead of using
   inode and device ID to identify binaries.  It also can be disabled
   with --no-buildid-mmap.
 
 * Use BPF for -u/--uid option to sample processes belong to a user.
   BPF can track user processes more accurately and the existing logic
   often fails to get the list of processes due to race with reading the
   /proc filesystem.
 
 * Generate PERF_RECORD_BPF_METADATA when it profiles BPF programs and
   they have variables starting with "bpf_metadata_".  This will help to
   identify BPF objects used in the profile.  This has been supported in
   bpftool for some time and allows the recording of metadata such as
   commit hashes, versions, etc, that now gets recorded in perf.data as
   well.
 
 * Collect list of DSOs touched in the sample callchains as well as in
   the sample itself.  This would increase the processing time at the end
   of record, but can improve the data quality.
 
 perf stat
 ---------
 * Add a new 'drm' pseudo-PMU support like in 'hwmon'.  It can collect
   DRM usage stats using fdinfo in /proc.
 
   On my Intel laptop, it shows like below:
 
     $ perf list drm
     ...
 
     drm:
       drm-active-stolen-system0
            [Total memory active in one or more engines. Unit: drm_i915]
       drm-active-system0
            [Total memory active in one or more engines. Unit: drm_i915]
       drm-engine-capacity-video
            [Engine capacity. Unit: drm_i915]
       drm-engine-copy
            [Utilization in ns. Unit: drm_i915]
       drm-engine-render
            [Utilization in ns. Unit: drm_i915]
       drm-engine-video
            [Utilization in ns. Unit: drm_i915]
       ...
 
     $ sudo perf stat -a -e drm-engine-render,drm-engine-video,drm-engine-capacity-video sleep 1
 
      Performance counter stats for 'system wide':
 
     48,137,316,988,873 ns       drm-engine-render
         34,452,696,746 ns       drm-engine-video
                     20 capacity drm-engine-capacity-video
 
            1.002086194 seconds time elapsed
 
 perf list
 ---------
 * Add description for software events.  The description is in JSON format
   and the event parser now can handle the software events like others
   (for example, it's case-insensitive and subject to wildcard matching).
 
     $ perf list software
 
     List of pre-defined events (to be used in -e or -M):
 
     software:
       alignment-faults
            [Number of kernel handled memory alignment faults. Unit: software]
       bpf-output
            [An event used by BPF programs to write to the perf ring buffer. Unit: software]
       cgroup-switches
            [Number of context switches to a task in a different cgroup. Unit: software]
       context-switches
            [Number of context switches [This event is an alias of cs]. Unit: software]
       cpu-clock
            [Per-CPU high-resolution timer based event. Unit: software]
       cpu-migrations
            [Number of times a process has migrated to a new CPU [This event is an alias of migrations]. Unit: software]
       cs
            [Number of context switches [This event is an alias of context-switches]. Unit: software]
       dummy
            [A placeholder event that doesn't count anything. Unit: software]
       emulation-faults
            [Number of kernel handled unimplemented instruction faults handled through emulation. Unit: software]
       faults
            [Number of page faults [This event is an alias of page-faults]. Unit: software]
       major-faults
            [Number of major page faults. Major faults require I/O to handle. Unit: software]
       migrations
            [Number of times a process has migrated to a new CPU [This event is an alias of cpu-migrations]. Unit: software]
       minor-faults
            [Number of minor page faults. Minor faults don't require I/O to handle. Unit: software]
       page-faults
            [Number of page faults [This event is an alias of faults]. Unit: software]
       task-clock
            [Per-task high-resolution timer based event. Unit: software]
 
 perf ftrace
 -----------
 * Add -e/--events option to perf ftrace latency to measure latency
   between the two events instead of a function.
 
     $ sudo perf ftrace latency -ab -e i915_request_wait_begin,i915_request_wait_end --hide-empty -- sleep 1
     #   DURATION     |      COUNT | GRAPH                                |
        256 -  512 us |          4 | ######                               |
          2 -    4 ms |          2 | ###                                  |
          4 -    8 ms |         12 | ###################                  |
          8 -   16 ms |         10 | ################                     |
 
     # statistics  (in usec)
       total time:               194915
         avg time:                 6961
         max time:                12855
         min time:                  373
            count:                   28
 
 * Add new function graph tracer options (--graph-opts) to display more
   info like arguments and return value.  They will be passed to the
   kernel ftrace directly.
 
     $ sudo perf ftrace -G vfs_write --graph-opts retval,retaddr
     # tracer: function_graph
     #
     # CPU  DURATION                  FUNCTION CALLS
     # |     |   |                     |   |   |   |
     ...
     5)               |  mutex_unlock() { /* <-rb_simple_write+0xda/0x150 */
     5)   0.188 us    |    local_clock(); /* <-lock_release+0x2ad/0x440 ret=0x3bf2a3cf90e */
     5)               |    rt_mutex_slowunlock() { /* <-rb_simple_write+0xda/0x150 */
     5)               |      _raw_spin_lock_irqsave() { /* <-rt_mutex_slowunlock+0x4f/0x200 */
     5)   0.123 us    |        preempt_count_add(); /* <-_raw_spin_lock_irqsave+0x23/0x90 ret=0x0 */
     5)   0.128 us    |        local_clock(); /* <-__lock_acquire.isra.0+0x17a/0x740 ret=0x3bf2a3cfc8b */
     5)   0.086 us    |        do_raw_spin_trylock(); /* <-_raw_spin_lock_irqsave+0x4a/0x90 ret=0x1 */
     5)   0.845 us    |      } /* _raw_spin_lock_irqsave ret=0x292 */
     ...
 
 misc
 ----
 * Add perf archive --exclude-buildids <FILE> option to skip some binaries.
   The format of the FILE should be same as an output of perf buildid-list.
 
 * Get rid of dependency of libcrypto.  It was just to get SHA-1 hash so
   implement it directly like in the kernel.  A side effect is that it
   needs -fno-strict-aliasing compiler option (again, like in the kernel).
 
 * Convert all shell script tests to use bash.
 
 Reviewed-by: Arnaldo Carvalho de Melo <acme@kernel.org>
 Signed-off-by: Namhyung Kim <namhyung@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSo2x5BnqMqsoHtzsmMstVUGiXMgwUCaIz8IAAKCRCMstVUGiXM
 g4huAP9WDTZIT9E0gx4yLJ0slyBV/5ROaUWX8OUVO3JJ/1sEUgEAp3wsSmDYc1/o
 XTvqNNjxo1LG+bEmZk8yNAJ2FYghPgw=
 =6enW
 -----END PGP SIGNATURE-----

Merge tag 'perf-tools-for-v6.17-2025-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools

Pull perf tools updates from Namhyung Kim:
 "Build-ID processing goodies:

     Build-IDs are content based hashes to link regions of memory to ELF
     files in post processing. They have been available in distros for
     quite a while:

       $ file /bin/bash
       /bin/bash: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV),
       dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2,
       BuildID[sha1]=707a1c670cd72f8e55ffedfbe94ea98901b7ce3a,
       for GNU/Linux 3.2.0, stripped

     It is possible to ask the kernel to get it from mmap executable
     backing storage at time they are being put in place and send it as
     metadata at that moment to have in perf.data.

     Prefer that across the board to speed up 'record' time - it post
     processes the samples to find binaries touched by any samples and
     to save them with build-ID. It can skip reading build-ID in
     userspace if it comes from the kernel.

  perf record:

   * Make --buildid-mmap default.  The kernel can generate MMAP2 events
     with a build-ID from ELF header.  Use that by default instead of using
     inode and device ID to identify binaries.  It also can be disabled
     with --no-buildid-mmap.

   * Use BPF for -u/--uid option to sample processes belong to a user.
     BPF can track user processes more accurately and the existing logic
     often fails to get the list of processes due to race with reading the
     /proc filesystem.

   * Generate PERF_RECORD_BPF_METADATA when it profiles BPF programs and
     they have variables starting with "bpf_metadata_".  This will help to
     identify BPF objects used in the profile.  This has been supported in
     bpftool for some time and allows the recording of metadata such as
     commit hashes, versions, etc, that now gets recorded in perf.data as
     well.

   * Collect list of DSOs touched in the sample callchains as well as in
     the sample itself.  This would increase the processing time at the end
     of record, but can improve the data quality.

  perf stat:

   * Add a new 'drm' pseudo-PMU support like in 'hwmon'.  It can collect
     DRM usage stats using fdinfo in /proc.

     On my Intel laptop, it shows like below:

       $ perf list drm
       ...

       drm:
         drm-active-stolen-system0
              [Total memory active in one or more engines. Unit: drm_i915]
         drm-active-system0
              [Total memory active in one or more engines. Unit: drm_i915]
         drm-engine-capacity-video
              [Engine capacity. Unit: drm_i915]
         drm-engine-copy
              [Utilization in ns. Unit: drm_i915]
         drm-engine-render
              [Utilization in ns. Unit: drm_i915]
         drm-engine-video
              [Utilization in ns. Unit: drm_i915]
         ...

       $ sudo perf stat -a -e drm-engine-render,drm-engine-video,drm-engine-capacity-video sleep 1

        Performance counter stats for 'system wide':

       48,137,316,988,873 ns       drm-engine-render
           34,452,696,746 ns       drm-engine-video
                       20 capacity drm-engine-capacity-video

              1.002086194 seconds time elapsed

  perf list

   * Add description for software events.  The description is in JSON format
     and the event parser now can handle the software events like others
     (for example, it's case-insensitive and subject to wildcard matching).

       $ perf list software

       List of pre-defined events (to be used in -e or -M):

       software:
         alignment-faults
              [Number of kernel handled memory alignment faults. Unit: software]
         bpf-output
              [An event used by BPF programs to write to the perf ring buffer. Unit: software]
         cgroup-switches
              [Number of context switches to a task in a different cgroup. Unit: software]
         context-switches
              [Number of context switches [This event is an alias of cs]. Unit: software]
         cpu-clock
              [Per-CPU high-resolution timer based event. Unit: software]
         cpu-migrations
              [Number of times a process has migrated to a new CPU [This event is an alias of migrations]. Unit: software]
         cs
              [Number of context switches [This event is an alias of context-switches]. Unit: software]
         dummy
              [A placeholder event that doesn't count anything. Unit: software]
         emulation-faults
              [Number of kernel handled unimplemented instruction faults handled through emulation. Unit: software]
         faults
              [Number of page faults [This event is an alias of page-faults]. Unit: software]
         major-faults
              [Number of major page faults. Major faults require I/O to handle. Unit: software]
         migrations
              [Number of times a process has migrated to a new CPU [This event is an alias of cpu-migrations]. Unit: software]
         minor-faults
              [Number of minor page faults. Minor faults don't require I/O to handle. Unit: software]
         page-faults
              [Number of page faults [This event is an alias of faults]. Unit: software]
         task-clock
              [Per-task high-resolution timer based event. Unit: software]

  perf ftrace:

   * Add -e/--events option to perf ftrace latency to measure latency
     between the two events instead of a function.

       $ sudo perf ftrace latency -ab -e i915_request_wait_begin,i915_request_wait_end --hide-empty -- sleep 1
       #   DURATION     |      COUNT | GRAPH                                |
          256 -  512 us |          4 | ######                               |
            2 -    4 ms |          2 | ###                                  |
            4 -    8 ms |         12 | ###################                  |
            8 -   16 ms |         10 | ################                     |

       # statistics  (in usec)
         total time:               194915
           avg time:                 6961
           max time:                12855
           min time:                  373
              count:                   28

   * Add new function graph tracer options (--graph-opts) to display more
     info like arguments and return value.  They will be passed to the
     kernel ftrace directly.

       $ sudo perf ftrace -G vfs_write --graph-opts retval,retaddr
       # tracer: function_graph
       #
       # CPU  DURATION                  FUNCTION CALLS
       # |     |   |                     |   |   |   |
       ...
       5)               |  mutex_unlock() { /* <-rb_simple_write+0xda/0x150 */
       5)   0.188 us    |    local_clock(); /* <-lock_release+0x2ad/0x440 ret=0x3bf2a3cf90e */
       5)               |    rt_mutex_slowunlock() { /* <-rb_simple_write+0xda/0x150 */
       5)               |      _raw_spin_lock_irqsave() { /* <-rt_mutex_slowunlock+0x4f/0x200 */
       5)   0.123 us    |        preempt_count_add(); /* <-_raw_spin_lock_irqsave+0x23/0x90 ret=0x0 */
       5)   0.128 us    |        local_clock(); /* <-__lock_acquire.isra.0+0x17a/0x740 ret=0x3bf2a3cfc8b */
       5)   0.086 us    |        do_raw_spin_trylock(); /* <-_raw_spin_lock_irqsave+0x4a/0x90 ret=0x1 */
       5)   0.845 us    |      } /* _raw_spin_lock_irqsave ret=0x292 */
       ...

  Misc:

   * Add perf archive --exclude-buildids <FILE> option to skip some binaries.
     The format of the FILE should be same as an output of perf buildid-list.

   * Get rid of dependency of libcrypto.  It was just to get SHA-1 hash so
     implement it directly like in the kernel.  A side effect is that it
     needs -fno-strict-aliasing compiler option (again, like in the kernel).

   * Convert all shell script tests to use bash"

* tag 'perf-tools-for-v6.17-2025-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (179 commits)
  perf record: Cache build-ID of hit DSOs only
  perf test: Ensure lock contention using pipe mode
  perf python: Stop using deprecated PyUnicode_AsString()
  perf list: Skip ABI PMUs when printing pmu values
  perf list: Remove tracepoint printing code
  perf tp_pmu: Add event APIs
  perf tp_pmu: Factor existing tracepoint logic to new file
  perf parse-events: Remove non-json software events
  perf jevents: Add common software event json
  perf tools: Remove libtraceevent in .gitignore
  perf test: Fix comment ordering
  perf sort: Use perf_env to set arch sort keys and header
  perf test: Move PERF_SAMPLE_WEIGHT_STRUCT parsing to common test
  perf sample: Remove arch notion of sample parsing
  perf env: Remove global perf_env
  perf trace: Avoid global perf_env with evsel__env
  perf auxtrace: Pass perf_env from session through to mmap read
  perf machine: Explicitly pass in host perf_env
  perf bench synthesize: Avoid use of global perf_env
  perf top: Make perf_env locally scoped
  ...
2025-08-01 16:55:47 -07:00
Namhyung Kim 6235ce7774 perf record: Cache build-ID of hit DSOs only
It post-processes samples to find which DSO has samples.  Based on that
info, it can save used DSOs in the build-ID cache directory.  But for
some reason, it saves all DSOs without checking the hit mark.  Skipping
unused DSOs can give some speedup especially with --buildid-mmap being
default.

On my idle machine, `time perf record -a sleep 1` goes down from 3 sec
to 1.5 sec with this change.

Fixes: e29386c8f7 ("perf record: Add --buildid-mmap option to enable PERF_RECORD_MMAP2's build id")
Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20250731070330.57116-1-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-31 10:46:46 -07:00
Linus Torvalds 63eb28bb14 ARM:
- Host driver for GICv5, the next generation interrupt controller for
   arm64, including support for interrupt routing, MSIs, interrupt
   translation and wired interrupts.
 
 - Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on
   GICv5 hardware, leveraging the legacy VGIC interface.
 
 - Userspace control of the 'nASSGIcap' GICv3 feature, allowing
   userspace to disable support for SGIs w/o an active state on hardware
   that previously advertised it unconditionally.
 
 - Map supporting endpoints with cacheable memory attributes on systems
   with FEAT_S2FWB and DIC where KVM no longer needs to perform cache
   maintenance on the address range.
 
 - Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the guest
   hypervisor to inject external aborts into an L2 VM and take traps of
   masked external aborts to the hypervisor.
 
 - Convert more system register sanitization to the config-driven
   implementation.
 
 - Fixes to the visibility of EL2 registers, namely making VGICv3 system
   registers accessible through the VGIC device instead of the ONE_REG
   vCPU ioctls.
 
 - Various cleanups and minor fixes.
 
 LoongArch:
 
 - Add stat information for in-kernel irqchip
 
 - Add tracepoints for CPUCFG and CSR emulation exits
 
 - Enhance in-kernel irqchip emulation
 
 - Various cleanups.
 
 RISC-V:
 
 - Enable ring-based dirty memory tracking
 
 - Improve perf kvm stat to report interrupt events
 
 - Delegate illegal instruction trap to VS-mode
 
 - MMU improvements related to upcoming nested virtualization
 
 s390x
 
 - Fixes
 
 x86:
 
 - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC,
   PIC, and PIT emulation at compile time.
 
 - Share device posted IRQ code between SVM and VMX and
   harden it against bugs and runtime errors.
 
 - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1)
   instead of O(n).
 
 - For MMIO stale data mitigation, track whether or not a vCPU has access to
   (host) MMIO based on whether the page tables have MMIO pfns mapped; using
   VFIO is prone to false negatives
 
 - Rework the MSR interception code so that the SVM and VMX APIs are more or
   less identical.
 
 - Recalculate all MSR intercepts from scratch on MSR filter changes,
   instead of maintaining shadow bitmaps.
 
 - Advertise support for LKGS (Load Kernel GS base), a new instruction
   that's loosely related to FRED, but is supported and enumerated
   independently.
 
 - Fix a user-triggerable WARN that syzkaller found by setting the vCPU
   in INIT_RECEIVED state (aka wait-for-SIPI), and then putting the vCPU
   into VMX Root Mode (post-VMXON).  Trying to detect every possible path
   leading to architecturally forbidden states is hard and even risks
   breaking userspace (if it goes from valid to valid state but passes
   through invalid states), so just wait until KVM_RUN to detect that
   the vCPU state isn't allowed.
 
 - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling interception of
   APERF/MPERF reads, so that a "properly" configured VM can access
   APERF/MPERF.  This has many caveats (APERF/MPERF cannot be zeroed
   on vCPU creation or saved/restored on suspend and resume, or preserved
   over thread migration let alone VM migration) but can be useful whenever
   you're interested in letting Linux guests see the effective physical CPU
   frequency in /proc/cpuinfo.
 
 - Reject KVM_SET_TSC_KHZ for vm file descriptors if vCPUs have been
   created, as there's no known use case for changing the default
   frequency for other VM types and it goes counter to the very reason
   why the ioctl was added to the vm file descriptor.  And also, there
   would be no way to make it work for confidential VMs with a "secure"
   TSC, so kill two birds with one stone.
 
 - Dynamically allocation the shadow MMU's hashed page list, and defer
   allocating the hashed list until it's actually needed (the TDP MMU
   doesn't use the list).
 
 - Extract many of KVM's helpers for accessing architectural local APIC
   state to common x86 so that they can be shared by guest-side code for
   Secure AVIC.
 
 - Various cleanups and fixes.
 
 x86 (Intel):
 
 - Preserve the host's DEBUGCTL.FREEZE_IN_SMM when running the guest.
   Failure to honor FREEZE_IN_SMM can leak host state into guests.
 
 - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter to prevent
   L1 from running L2 with features that KVM doesn't support, e.g. BTF.
 
 x86 (AMD):
 
 - WARN and reject loading kvm-amd.ko instead of panicking the kernel if the
   nested SVM MSRPM offsets tracker can't handle an MSR (which is pretty
   much a static condition and therefore should never happen, but still).
 
 - Fix a variety of flaws and bugs in the AVIC device posted IRQ code.
 
 - Inhibit AVIC if a vCPU's ID is too big (relative to what hardware
   supports) instead of rejecting vCPU creation.
 
 - Extend enable_ipiv module param support to SVM, by simply leaving
   IsRunning clear in the vCPU's physical ID table entry.
 
 - Disable IPI virtualization, via enable_ipiv, if the CPU is affected by
   erratum #1235, to allow (safely) enabling AVIC on such CPUs.
 
 - Request GA Log interrupts if and only if the target vCPU is blocking,
   i.e. only if KVM needs a notification in order to wake the vCPU.
 
 - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to the
   vCPU's CPUID model.
 
 - Accept any SNP policy that is accepted by the firmware with respect to
   SMT and single-socket restrictions.  An incompatible policy doesn't put
   the kernel at risk in any way, so there's no reason for KVM to care.
 
 - Drop a superfluous WBINVD (on all CPUs!) when destroying a VM and
   use WBNOINVD instead of WBINVD when possible for SEV cache maintenance.
 
 - When reclaiming memory from an SEV guest, only do cache flushes on CPUs
   that have ever run a vCPU for the guest, i.e. don't flush the caches for
   CPUs that can't possibly have cache lines with dirty, encrypted data.
 
 Generic:
 
 - Rework irqbypass to track/match producers and consumers via an xarray
   instead of a linked list.  Using a linked list leads to O(n^2) insertion
   times, which is hugely problematic for use cases that create large
   numbers of VMs.  Such use cases typically don't actually use irqbypass,
   but eliminating the pointless registration is a future problem to
   solve as it likely requires new uAPI.
 
 - Track irqbypass's "token" as "struct eventfd_ctx *" instead of a "void *",
   to avoid making a simple concept unnecessarily difficult to understand.
 
 - Decouple device posted IRQs from VFIO device assignment, as binding a VM
   to a VFIO group is not a requirement for enabling device posted IRQs.
 
 - Clean up and document/comment the irqfd assignment code.
 
 - Disallow binding multiple irqfds to an eventfd with a priority waiter,
   i.e.  ensure an eventfd is bound to at most one irqfd through the entire
   host, and add a selftest to verify eventfd:irqfd bindings are globally
   unique.
 
 - Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues
   related to private <=> shared memory conversions.
 
 - Drop guest_memfd's .getattr() implementation as the VFS layer will call
   generic_fillattr() if inode_operations.getattr is NULL.
 
 - Fix issues with dirty ring harvesting where KVM doesn't bound the
   processing of entries in any way, which allows userspace to keep KVM
   in a tight loop indefinitely.
 
 - Kill off kvm_arch_{start,end}_assignment() and x86's associated tracking,
   now that KVM no longer uses assigned_device_count as a heuristic for
   either irqbypass usage or MDS mitigation.
 
 Selftests:
 
 - Fix a comment typo.
 
 - Verify KVM is loaded when getting any KVM module param so that attempting
   to run a selftest without kvm.ko loaded results in a SKIP message about
   KVM not being loaded/enabled (versus some random parameter not existing).
 
 - Skip tests that hit EACCES when attempting to access a file, and rpint
   a "Root required?" help message.  In most cases, the test just needs to
   be run with elevated permissions.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmiKXMgUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMhMQf/QDhC/CP1aGXph2whuyeD2NMqPKiU
 9KdnDNST+ftPwjg9QxZ9mTaa8zeVz/wly6XlxD9OQHy+opM1wcys3k0GZAFFEEQm
 YrThgURdzEZ3nwJZgb+m0t4wjJQtpiFIBwAf7qq6z1VrqQBEmHXJ/8QxGuqO+BNC
 j5q/X+q6KZwehKI6lgFBrrOKWFaxqhnRAYfW6rGBxRXxzTJuna37fvDpodQnNceN
 zOiq+avfriUMArTXTqOteJNKU0229HjiPSnjILLnFQ+B3akBlwNG0jk7TMaAKR6q
 IZWG1EIS9q1BAkGXaw6DE1y6d/YwtXCR5qgAIkiGwaPt5yj9Oj6kRN2Ytw==
 =j2At
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "ARM:

   - Host driver for GICv5, the next generation interrupt controller for
     arm64, including support for interrupt routing, MSIs, interrupt
     translation and wired interrupts

   - Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on
     GICv5 hardware, leveraging the legacy VGIC interface

   - Userspace control of the 'nASSGIcap' GICv3 feature, allowing
     userspace to disable support for SGIs w/o an active state on
     hardware that previously advertised it unconditionally

   - Map supporting endpoints with cacheable memory attributes on
     systems with FEAT_S2FWB and DIC where KVM no longer needs to
     perform cache maintenance on the address range

   - Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the
     guest hypervisor to inject external aborts into an L2 VM and take
     traps of masked external aborts to the hypervisor

   - Convert more system register sanitization to the config-driven
     implementation

   - Fixes to the visibility of EL2 registers, namely making VGICv3
     system registers accessible through the VGIC device instead of the
     ONE_REG vCPU ioctls

   - Various cleanups and minor fixes

  LoongArch:

   - Add stat information for in-kernel irqchip

   - Add tracepoints for CPUCFG and CSR emulation exits

   - Enhance in-kernel irqchip emulation

   - Various cleanups

  RISC-V:

   - Enable ring-based dirty memory tracking

   - Improve perf kvm stat to report interrupt events

   - Delegate illegal instruction trap to VS-mode

   - MMU improvements related to upcoming nested virtualization

  s390x

   - Fixes

  x86:

   - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O
     APIC, PIC, and PIT emulation at compile time

   - Share device posted IRQ code between SVM and VMX and harden it
     against bugs and runtime errors

   - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups
     O(1) instead of O(n)

   - For MMIO stale data mitigation, track whether or not a vCPU has
     access to (host) MMIO based on whether the page tables have MMIO
     pfns mapped; using VFIO is prone to false negatives

   - Rework the MSR interception code so that the SVM and VMX APIs are
     more or less identical

   - Recalculate all MSR intercepts from scratch on MSR filter changes,
     instead of maintaining shadow bitmaps

   - Advertise support for LKGS (Load Kernel GS base), a new instruction
     that's loosely related to FRED, but is supported and enumerated
     independently

   - Fix a user-triggerable WARN that syzkaller found by setting the
     vCPU in INIT_RECEIVED state (aka wait-for-SIPI), and then putting
     the vCPU into VMX Root Mode (post-VMXON). Trying to detect every
     possible path leading to architecturally forbidden states is hard
     and even risks breaking userspace (if it goes from valid to valid
     state but passes through invalid states), so just wait until
     KVM_RUN to detect that the vCPU state isn't allowed

   - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling
     interception of APERF/MPERF reads, so that a "properly" configured
     VM can access APERF/MPERF. This has many caveats (APERF/MPERF
     cannot be zeroed on vCPU creation or saved/restored on suspend and
     resume, or preserved over thread migration let alone VM migration)
     but can be useful whenever you're interested in letting Linux
     guests see the effective physical CPU frequency in /proc/cpuinfo

   - Reject KVM_SET_TSC_KHZ for vm file descriptors if vCPUs have been
     created, as there's no known use case for changing the default
     frequency for other VM types and it goes counter to the very reason
     why the ioctl was added to the vm file descriptor. And also, there
     would be no way to make it work for confidential VMs with a
     "secure" TSC, so kill two birds with one stone

   - Dynamically allocation the shadow MMU's hashed page list, and defer
     allocating the hashed list until it's actually needed (the TDP MMU
     doesn't use the list)

   - Extract many of KVM's helpers for accessing architectural local
     APIC state to common x86 so that they can be shared by guest-side
     code for Secure AVIC

   - Various cleanups and fixes

  x86 (Intel):

   - Preserve the host's DEBUGCTL.FREEZE_IN_SMM when running the guest.
     Failure to honor FREEZE_IN_SMM can leak host state into guests

   - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter to
     prevent L1 from running L2 with features that KVM doesn't support,
     e.g. BTF

  x86 (AMD):

   - WARN and reject loading kvm-amd.ko instead of panicking the kernel
     if the nested SVM MSRPM offsets tracker can't handle an MSR (which
     is pretty much a static condition and therefore should never
     happen, but still)

   - Fix a variety of flaws and bugs in the AVIC device posted IRQ code

   - Inhibit AVIC if a vCPU's ID is too big (relative to what hardware
     supports) instead of rejecting vCPU creation

   - Extend enable_ipiv module param support to SVM, by simply leaving
     IsRunning clear in the vCPU's physical ID table entry

   - Disable IPI virtualization, via enable_ipiv, if the CPU is affected
     by erratum #1235, to allow (safely) enabling AVIC on such CPUs

   - Request GA Log interrupts if and only if the target vCPU is
     blocking, i.e. only if KVM needs a notification in order to wake
     the vCPU

   - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to
     the vCPU's CPUID model

   - Accept any SNP policy that is accepted by the firmware with respect
     to SMT and single-socket restrictions. An incompatible policy
     doesn't put the kernel at risk in any way, so there's no reason for
     KVM to care

   - Drop a superfluous WBINVD (on all CPUs!) when destroying a VM and
     use WBNOINVD instead of WBINVD when possible for SEV cache
     maintenance

   - When reclaiming memory from an SEV guest, only do cache flushes on
     CPUs that have ever run a vCPU for the guest, i.e. don't flush the
     caches for CPUs that can't possibly have cache lines with dirty,
     encrypted data

  Generic:

   - Rework irqbypass to track/match producers and consumers via an
     xarray instead of a linked list. Using a linked list leads to
     O(n^2) insertion times, which is hugely problematic for use cases
     that create large numbers of VMs. Such use cases typically don't
     actually use irqbypass, but eliminating the pointless registration
     is a future problem to solve as it likely requires new uAPI

   - Track irqbypass's "token" as "struct eventfd_ctx *" instead of a
     "void *", to avoid making a simple concept unnecessarily difficult
     to understand

   - Decouple device posted IRQs from VFIO device assignment, as binding
     a VM to a VFIO group is not a requirement for enabling device
     posted IRQs

   - Clean up and document/comment the irqfd assignment code

   - Disallow binding multiple irqfds to an eventfd with a priority
     waiter, i.e. ensure an eventfd is bound to at most one irqfd
     through the entire host, and add a selftest to verify eventfd:irqfd
     bindings are globally unique

   - Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues
     related to private <=> shared memory conversions

   - Drop guest_memfd's .getattr() implementation as the VFS layer will
     call generic_fillattr() if inode_operations.getattr is NULL

   - Fix issues with dirty ring harvesting where KVM doesn't bound the
     processing of entries in any way, which allows userspace to keep
     KVM in a tight loop indefinitely

   - Kill off kvm_arch_{start,end}_assignment() and x86's associated
     tracking, now that KVM no longer uses assigned_device_count as a
     heuristic for either irqbypass usage or MDS mitigation

  Selftests:

   - Fix a comment typo

   - Verify KVM is loaded when getting any KVM module param so that
     attempting to run a selftest without kvm.ko loaded results in a
     SKIP message about KVM not being loaded/enabled (versus some random
     parameter not existing)

   - Skip tests that hit EACCES when attempting to access a file, and
     print a "Root required?" help message. In most cases, the test just
     needs to be run with elevated permissions"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (340 commits)
  Documentation: KVM: Use unordered list for pre-init VGIC registers
  RISC-V: KVM: Avoid re-acquiring memslot in kvm_riscv_gstage_map()
  RISC-V: KVM: Use find_vma_intersection() to search for intersecting VMAs
  RISC-V: perf/kvm: Add reporting of interrupt events
  RISC-V: KVM: Enable ring-based dirty memory tracking
  RISC-V: KVM: Fix inclusion of Smnpm in the guest ISA bitmap
  RISC-V: KVM: Delegate illegal instruction fault to VS mode
  RISC-V: KVM: Pass VMID as parameter to kvm_riscv_hfence_xyz() APIs
  RISC-V: KVM: Factor-out g-stage page table management
  RISC-V: KVM: Add vmid field to struct kvm_riscv_hfence
  RISC-V: KVM: Introduce struct kvm_gstage_mapping
  RISC-V: KVM: Factor-out MMU related declarations into separate headers
  RISC-V: KVM: Use ncsr_xyz() in kvm_riscv_vcpu_trap_redirect()
  RISC-V: KVM: Implement kvm_arch_flush_remote_tlbs_range()
  RISC-V: KVM: Don't flush TLB when PTE is unchanged
  RISC-V: KVM: Replace KVM_REQ_HFENCE_GVMA_VMID_ALL with KVM_REQ_TLB_FLUSH
  RISC-V: KVM: Rename and move kvm_riscv_local_tlb_sanitize()
  RISC-V: KVM: Drop the return value of kvm_riscv_vcpu_aia_init()
  RISC-V: KVM: Check kvm_riscv_vcpu_alloc_vector_context() return value
  KVM: arm64: selftests: Add FEAT_RAS EL2 registers to get-reg-list
  ...
2025-07-30 17:14:01 -07:00
Jan Polensky 022245067f perf test: Ensure lock contention using pipe mode
The 'kernel lock contention analysis test' requires reliable triggering
of lock contention. On some systems, previous benchmark calls failed to
generate sufficient contention due to low system activity or resource
limits.

This patch adds the -p (pipe) option to all calls of perf bench sched
messaging, ensuring consistent lock contention without relying on
socket-based communication.

Suggested-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Jan Polensky <japo@linux.ibm.com>
Link: https://lore.kernel.org/r/20250725170801.3176678-1-japo@linux.ibm.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-30 13:38:53 -07:00
Arnaldo Carvalho de Melo 59edbec7a5 perf python: Stop using deprecated PyUnicode_AsString()
As noticed while building for Fedora 43:

    GEN     /tmp/build/perf/python/perf.cpython-314-x86_64-linux-gnu.so
  /git/perf-6.16.0-rc3/tools/perf/util/python.c: In function ‘get_tracepoint_field’:
  /git/perf-6.16.0-rc3/tools/perf/util/python.c:340:9: error: ‘_PyUnicode_AsString’ is deprecated [-Werror=deprecated-declarations]
    340 |         const char *str = _PyUnicode_AsString(PyObject_Str(attr_name));
        |         ^~~~~
  In file included from /usr/include/python3.14/unicodeobject.h:1022,
                   from /usr/include/python3.14/Python.h:89,
                   from /git/perf-6.16.0-rc3/tools/perf/util/python.c:2:
  /usr/include/python3.14/cpython/unicodeobject.h:648:1: note: declared here
    648 | _PyUnicode_AsString(PyObject *unicode)
        | ^~~~~~~~~~~~~~~~~~~
  cc1: all warnings being treated as errors
  error: command '/usr/bin/gcc' failed with exit code 1

Use PyUnicode_AsUTF8() instead and also check if PyObject_Str() fails
before doing so.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/aIofXNK8QLtLIaI3@x1
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-30 10:15:17 -07:00
Quan Zhou 3b7270c766 RISC-V: perf/kvm: Add reporting of interrupt events
For `perf kvm stat` on the RISC-V, in order to avoid the
occurrence of `UNKNOWN` event names, interrupts should be
reported in addition to exceptions.

testing without patch:

Event name                    Samples  Sample%       Time(ns)
---------------------------  --------  --------  ------------
STORE_GUEST_PAGE_FAULT        1496461   53.00%    889612544
UNKNOWN                        887514   31.00%    272857968
LOAD_GUEST_PAGE_FAULT          305164   10.00%    189186331
VIRTUAL_INST_FAULT              70625    2.00%    134114260
SUPERVISOR_SYSCALL              32014    1.00%     58577110
INST_GUEST_PAGE_FAULT               1    0.00%         2545

testing with patch:

Event name                    Samples  Sample%       Time(ns)
---------------------------  --------  --------  ------------
IRQ_S_TIMER                   211271    58.00%  738298680600
EXC_STORE_GUEST_PAGE_FAULT    111279    30.00%  130725914800
EXC_LOAD_GUEST_PAGE_FAULT      22039     6.00%   25441480600
EXC_VIRTUAL_INST_FAULT          8913     2.00%   21015381600
IRQ_VS_EXT                      4748     1.00%   10155464300
IRQ_S_EXT                       2802     0.00%   13288775800
IRQ_S_SOFT                      1998     0.00%    4254129300

Signed-off-by: Quan Zhou <zhouquan@iscas.ac.cn>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/9693132df4d0f857b8be3a75750c36b40213fcc0.1726211632.git.zhouquan@iscas.ac.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
2025-07-28 22:28:25 +05:30
Ian Rogers b91a9abbf4 perf list: Skip ABI PMUs when printing pmu values
Avoid printing tracepoint, legacy and software events when listing for
the pmu option. Add the PMU type to the print_event callbacks to ease
detection.

Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20250725185202.68671-8-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-26 16:31:43 -07:00
Ian Rogers 55c09681cc perf list: Remove tracepoint printing code
Now that the tp_pmu can iterate and describe events remove the custom
tracepoint printing logic, this avoids perf list showing the
tracepoint events twice.

Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20250725185202.68671-7-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-26 16:31:43 -07:00
Ian Rogers 45b6e281cb perf tp_pmu: Add event APIs
Add event APIs for the tracepoint PMU allowing things like perf list
to function using it. For perf list add the tracepoint format in the
long description (shown with -v).

  $ sudo perf list -v tracepoint

  List of pre-defined events (to be used in -e or -M):

    alarmtimer:alarmtimer_cancel                       [Tracepoint event]
         [name: alarmtimer_cancel
          ID: 416
          format:
          field:unsigned short common_type; offset:0; size:2; signed:0;
          field:unsigned char common_flags; offset:2; size:1; signed:0;
          field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
          field:int common_pid; offset:4; size:4; signed:1;
          field:void * alarm; offset:8; size:8; signed:0;
          field:unsigned char alarm_type; offset:16; size:1; signed:0;
          field:s64 expires; offset:24; size:8; signed:1;
          field:s64 now; offset:32; size:8; signed:1;
          print fmt: "alarmtimer:%p type:%s expires:%llu now:%llu",REC->alarm,__print_flags((1 << REC->alarm_type)," | ",{ 1 << 0,
          "REALTIME" },{ 1 << 1,"BOOTTIME" },{ 1 << 3,"REALTIME Freezer" },{ 1 << 4,"BOOTTIME Freezer" }),REC->expires,REC->now
          . Unit: tracepoint]
    alarmtimer:alarmtimer_fired                        [Tracepoint event]
         [name: alarmtimer_fired
          ID: 418
          ...

Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20250725185202.68671-6-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-26 16:31:43 -07:00
Ian Rogers d002aab87d perf tp_pmu: Factor existing tracepoint logic to new file
Start the creation of a tracepoint PMU abstraction. Tracepoint events
don't follow the regular sysfs perf conventions. Eventually the new
PMU abstraction will bridge the gap so tracepoint events look more
like regular perf ones.

Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20250725185202.68671-5-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-26 16:31:43 -07:00
Ian Rogers 6e9fa4131a perf parse-events: Remove non-json software events
Remove the hard coded encodings from parse-events. This has the
consequence that software events are matched using the sysfs/json
priority, will be case insensitive and will be wildcarded across PMUs.
As there were software and hardware types in the parsing code, the
removal means software vs hardware logic can be removed and hardware
assumed.

Now the perf json provides detailed descriptions of software events,
remove the previous listing support that didn't contain event
descriptions. When globbing is required for the "sw" option in perf
list, use string PMU globbing as was done previously for the tool PMU.

The output of `perf list sw` command changed like this.

Before:
  List of pre-defined events (to be used in -e or -M):

    alignment-faults                                   [Software event]
    bpf-output                                         [Software event]
    cgroup-switches                                    [Software event]
    context-switches OR cs                             [Software event]
    cpu-clock                                          [Software event]
    cpu-migrations OR migrations                       [Software event]
    dummy                                              [Software event]
    emulation-faults                                   [Software event]
    major-faults                                       [Software event]
    minor-faults                                       [Software event]
    page-faults OR faults                              [Software event]
    task-clock                                         [Software event]

After:
  List of pre-defined events (to be used in -e or -M):

  software:
    alignment-faults
         [Number of kernel handled memory alignment faults. Unit: software]
    bpf-output
         [An event used by BPF programs to write to the perf ring buffer. Unit: software]
    cgroup-switches
         [Number of context switches to a task in a different cgroup. Unit: software]
    context-switches
         [Number of context switches [This event is an alias of cs]. Unit: software]
    cpu-clock
         [Per-CPU high-resolution timer based event. Unit: software]
    cpu-migrations
         [Number of times a process has migrated to a new CPU [This event is an alias of migrations]. Unit: software]
    cs
         [Number of context switches [This event is an alias of context-switches]. Unit: software]
    dummy
         [A placeholder event that doesn't count anything. Unit: software]
    ...

Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20250725185202.68671-4-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-26 16:31:43 -07:00
Ian Rogers 9957d8c801 perf jevents: Add common software event json
Add json for software events so that in perf list the events can have
a description.  Common json exists for the tool PMU but it has no
sysfs equivalent. Modify the map_for_pmu code to return the common map
(rather than an architecture specific one) when a PMU with a common
name is being looked for, this allows the events to be found.

Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20250725185202.68671-3-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-26 16:31:43 -07:00
Chen Pei af470fb532 perf tools: Remove libtraceevent in .gitignore
The libtraceevent has been removed from the source tree, and .gitignore
needs to be updated as well.

Fixes: 4171925aa9 ("tools lib traceevent: Remove libtraceevent")
Signed-off-by: Chen Pei <cp0613@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250726111532.8031-1-cp0613@linux.alibaba.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-26 15:38:10 -07:00
Blake Jones d89c58068a perf test: Fix comment ordering
The previous commit that introduced this test overlooked a behavior of
"perf test list", causing it to print "SPDX-License-Identifier: GPL-2.0"
as a description for that test.  This reorders the comments to fix that
issue.

Fixes: edf2cadf01 ("perf test: add test for BPF metadata collection")
Signed-off-by: Blake Jones <blakejones@google.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250726004023.3466563-1-blakejones@google.com
[ update the commit message a little bit ]
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-26 15:34:48 -07:00
Ian Rogers 6e19839a80 perf sort: Use perf_env to set arch sort keys and header
Previously arch_support_sort_key and arch_perf_header_entry used a
weak symbol to compile as appropriate for x86 and powerpc. A
limitation to this is that the handling of a data file could vary in
cross-platform development. Change to using the perf_env of the
current session to determine the architecture kind and set the sort
key and header entries as appropriate.

Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-23-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25 10:37:58 -07:00
Ian Rogers a563c9f3bb perf test: Move PERF_SAMPLE_WEIGHT_STRUCT parsing to common test
test__x86_sample_parsing is identical to test__sample_parsing except
it explicitly tested PERF_SAMPLE_WEIGHT_STRUCT. Now the parsing code
is common move the PERF_SAMPLE_WEIGHT_STRUCT to the common sample
parsing test and remove the x86 version.

Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-22-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25 10:37:58 -07:00
Ian Rogers 8882095b1d perf sample: Remove arch notion of sample parsing
By definition arch sample parsing and synthesis will inhibit certain
kinds of cross-platform record then analysis (report, script,
etc.). Remove arch_perf_parse_sample_weight and
arch_perf_synthesize_sample_weight replacing with a common
implementation. Combine perf_sample p_stage_cyc and retire_lat as
weight3 to capture the differing uses regardless of compiled for
architecture.

Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-21-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25 10:37:58 -07:00
Ian Rogers 525a599bad perf env: Remove global perf_env
The global perf_env was used for the host, but if a perf_env wasn't
easy to come by it was used in a lot of places where potentially
recorded and host data could be confused. Remove the global variable
as now the majority of accesses retrieve the perf_env for the host
from the session.

Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-20-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25 10:37:58 -07:00
Ian Rogers 003a86bce0 perf trace: Avoid global perf_env with evsel__env
There is no session in perf trace unless in replay mode, so in host
mode no session can be associated with the evlist. If the evsel__env
call fails resort to the host_env that's part of the trace. Remove
errno_to_name as it becomes a called once 1-line function once the
argument is turned into a perf_env, just call perf_env__arch_strerrno
directly.

Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-19-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25 10:37:58 -07:00
Ian Rogers 69ac7472d2 perf auxtrace: Pass perf_env from session through to mmap read
auxtrace_mmap__read and auxtrace_mmap__read_snapshot end up calling
 `evsel__env(NULL)` which returns the global perf_env variable for the
 host. Their only call is in perf record. Rather than use the global
 variable pass through the perf_env for `perf record`.

Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-18-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25 10:37:58 -07:00
Ian Rogers e481066388 perf machine: Explicitly pass in host perf_env
When creating a machine for the host explicitly pass in a scoped
perf_env. This removes a use of the global perf_env.

Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-17-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25 10:37:57 -07:00
Ian Rogers aa91baa09b perf bench synthesize: Avoid use of global perf_env
The benchmark doesn't use a data file and so the header perf_env isn't
used. Stack allocate a host perf_env for use to avoid the use of the
global perf_env.

Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-16-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25 10:37:57 -07:00
Ian Rogers aaa23571fe perf top: Make perf_env locally scoped
The use of the global host perf_env variable is potentially
inconsistent within the code. Switch perf top to using a locally
scoped variable that is generally accessed through the session.

Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-15-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25 10:37:57 -07:00
Ian Rogers 740f7ba1e3 perf session: Add host_env argument to perf_session__new
When creating a perf_session the host perf_env may or may not want to
be used. For example, `perf top` uses a host perf_env while `perf
inject` does not. Add a host_env argument to perf_session__new so that
sessions requiring a host perf_env can pass it in. Currently if none
is specified the global perf_env variable is used, but this will
change in later patches.

Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-14-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25 10:37:57 -07:00
Ian Rogers 5a156353e5 perf test: Avoid use perf_env
The perf_env global variable holds the host perf_env data but its use
is hit and miss. Switch to using local perf_env variables and ensure
scoped perf_env__init and perf_env__exit. This loses command line
setting of the perf_env, but this doesn't matter for tests. So the
perf_env is fully initialized, clear it with memset in perf_env__init.

Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-13-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25 10:37:57 -07:00
Ian Rogers b743a1368d perf header: Clean up use of perf_env
Always use the perf_env from the feat_fd's perf_header. Cache the
value on entry to a function in `env` and use `env->` consistently in
the code. Ensure the header is initialized for use in
perf_session__do_write_header.

Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-12-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25 10:37:57 -07:00
Ian Rogers 57ddb9cbb5 perf evlist: Change env variable to session
The session holds a perf_env pointer env. In UI code container_of is
used to turn the env to a session, but this assumes the session
header's env is in use. Rather than a dubious container_of, hold the
session in the evlist and derive the env from the session with
evsel__env, perf_session__env, etc.

Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-11-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25 10:37:56 -07:00
Ian Rogers c3e5b9ec96 perf session: Add accessor for session->header.env
The perf_env from the header in the session is frequently accessed,
add an accessor function rather than access directly. Cache the value
to avoid repeated calls. No behavioral change.

Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-10-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25 10:37:56 -07:00
Ian Rogers 53b00ff358 perf record: Make --buildid-mmap the default
Support for build IDs in mmap2 perf events has been present since
Linux v5.12:
https://lore.kernel.org/lkml/20210219194619.1780437-1-acme@kernel.org/
Build ID mmap events don't avoid the need to inject build IDs for DSO
touched by samples as the build ID cache is populated by perf
record. They can avoid some cases of symbol mis-resolution caused by
the file system changing from when a sample occurred and when the DSO
is sought.

Unlike the --buildid-mmap option, this chnage doesn't disable the
build ID cache but it does disable the processing of samples looking
for DSOs to inject build IDs for. To disable the build ID cache the -B
(--no-buildid) option should be used.

Making this option the default was raised on the list in:
https://lore.kernel.org/linux-perf-users/CAP-5=fXP7jN_QrGUcd55_QH5J-Y-FCaJ6=NaHVtyx0oyNh8_-Q@mail.gmail.com/

Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-9-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-07-25 10:37:56 -07:00