Commit Graph

44242 Commits (680302d191b043cf3abe4076794de10171a4ca93)

Author SHA1 Message Date
Christophe Leroy ac88ee7d2b module: Use set_memory_rox()
A couple of architectures seem concerned about calling set_memory_ro()
and set_memory_x() too frequently and have implemented a version of
set_memory_rox(), see commit 60463628c9 ("x86/mm: Implement native
set_memory_rox()") and commit 22e99fa564 ("s390/mm: implement
set_memory_rox()")

Use set_memory_rox() in modules when STRICT_MODULES_RWX is set.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-01-29 12:00:31 -08:00
Tejun Heo 5797b1c189 workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.

In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.

However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.

While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.

636b927eba ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.

Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.

Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:

- One max_active enforcement decouples from pool boundaires, chaining
  execution after a work item finishes requires inter-pool operations which
  would require lock dancing, which is nasty.

- Sharing a single nr_active count across the whole system can be pretty
  expensive on NUMA machines.

- Per-pwq enforcement had been more or less okay while we were using
  per-node pools.

It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:

- To avoid sharing a single counter across multiple nodes, the configured
  max_active is split across nodes according to the proportion of each
  workqueue's online effective CPUs per node. e.g. A node with twice more
  online effective CPUs will get twice higher portion of max_active.

- Workqueue used to be able to process a chain of interdependent work items
  which is as long as max_active. We can't do this anymore as max_active is
  distributed across the nodes. Instead, a new parameter min_active is
  introduced which determines the minimum level of concurrency within a node
  regardless of how max_active distribution comes out to be.

  It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
  This can lead to higher effective max_weight than configured and also
  deadlocks if a workqueue was depending on being able to handle chains of
  interdependent work items that are longer than 8.

  I believe these should be fine given that the number of CPUs in each NUMA
  node is usually higher than 8 and work item chain longer than 8 is pretty
  unlikely. However, if these assumptions turn out to be wrong, we'll need
  to add an interface to adjust min_active.

- Each unbound wq has an array of struct wq_node_nr_active which tracks
  per-node nr_active. When its pwq wants to run a work item, it has to
  obtain the matching node's nr_active. If over the node's max_active, the
  pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
  the completion path round-robins the pending pwqs activating the first
  inactive work item of each, which involves some pool lock dancing and
  kicking other pools. It's not the simplest code but doesn't look too bad.

v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().

    - wq_adjust_max_active() is now protected by wq->mutex instead of
      wq_pool_mutex.

v3: - wq_node_max_active() used to calculate per-node max_active on the fly
      based on system-wide CPU online states. Lai pointed out that this can
      lead to skewed distributions for workqueues with restricted cpumasks.
      Update the max_active distribution to use per-workqueue effective
      online CPU counts instead of system-wide and cache the calculation
      results in node_nr_active->max.

v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:25 -10:00
Tejun Heo 91ccc6e723 workqueue: Introduce struct wq_node_nr_active
Currently, for both percpu and unbound workqueues, max_active applies
per-cpu, which is a recent change for unbound workqueues. The change for
unbound workqueues was a significant departure from the previous behavior of
per-node application. It made some use cases create undesirable number of
concurrent work items and left no good way of fixing them. To address the
problem, workqueue is implementing a NUMA node segmented global nr_active
mechanism, which will be explained further in the next patch.

As a preparation, this patch introduces struct wq_node_nr_active. It's a
data structured allocated for each workqueue and NUMA node pair and
currently only tracks the workqueue's number of active work items on the
node. This is split out from the next patch to make it easier to understand
and review.

Note that there is an extra wq_node_nr_active allocated for the invalid node
nr_node_ids which is used to track nr_active for pools which don't have NUMA
node associated such as the default fallback system-wide pool.

This doesn't cause any behavior changes visible to userland yet. The next
patch will expand to implement the control mechanism on top.

v4: - Fixed out-of-bound access when freeing per-cpu workqueues.

v3: - Use flexible array for wq->node_nr_active as suggested by Lai.

v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.

    - Lai pointed out that pwq_tryinc_nr_active() incorrectly dropped
      pwq->max_active check. Restored. As the next patch replaces the
      max_active enforcement mechanism, this doesn't change the end result.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo dd6c3c5441 workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling
The planned shared nr_active handling for unbound workqueues will make
pwq_dec_nr_active() sometimes drop the pool lock temporarily to acquire
other pool locks, which is necessary as retirement of an nr_active count
from one pool may need kick off an inactive work item in another pool.

This patch moves pwq_dec_nr_in_flight() call in try_to_grab_pending() to the
end of work item handling so that work item state changes stay atomic.
process_one_work() which is the other user of pwq_dec_nr_in_flight() already
calls it at the end of work item handling. Comments are added to both call
sites and pwq_dec_nr_in_flight().

This shouldn't cause any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo 9f66cff212 workqueue: RCU protect wq->dfl_pwq and implement accessors for it
wq->cpu_pwq is RCU protected but wq->dfl_pwq isn't. This is okay because
currently wq->dfl_pwq is used only accessed to install it into wq->cpu_pwq
which doesn't require RCU access. However, we want to be able to access
wq->dfl_pwq under RCU in the future to access its __pod_cpumask and the code
can be made easier to read by making the two pwq fields behave in the same
way.

- Make wq->dfl_pwq RCU protected.

- Add unbound_pwq_slot() and unbound_pwq() which can access both ->dfl_pwq
  and ->cpu_pwq. The former returns the double pointer that can be used
  access and update the pwqs. The latter performs locking check and
  dereferences the double pointer.

- pwq accesses and updates are converted to use unbound_pwq[_slot]().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo c5404d4e6d workqueue: Make wq_adjust_max_active() round-robin pwqs while activating
wq_adjust_max_active() needs to activate work items after max_active is
increased. Previously, it did that by visiting each pwq once activating all
that could be activated. While this makes sense with per-pwq nr_active,
nr_active will be shared across multiple pwqs for unbound wqs. Then, we'd
want to round-robin through pwqs to be fairer.

In preparation, this patch makes wq_adjust_max_active() round-robin pwqs
while activating. While the activation ordering changes, this shouldn't
cause user-noticeable behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo 1c270b79ce workqueue: Move nr_active handling into helpers
__queue_work(), pwq_dec_nr_in_flight() and wq_adjust_max_active() were
open-coding nr_active handling, which is fine given that the operations are
trivial. However, the planned unbound nr_active update will make them more
complicated, so let's move them into helpers.

- pwq_tryinc_nr_active() is added. It increments nr_active if under
  max_active limit and return a boolean indicating whether inc was
  successful. Note that the function is structured to accommodate future
  changes. __queue_work() is updated to use the new helper.

- pwq_activate_first_inactive() is updated to use pwq_tryinc_nr_active() and
  thus no longer assumes that nr_active is under max_active and returns a
  boolean to indicate whether a work item has been activated.

- wq_adjust_max_active() no longer tests directly whether a work item can be
  activated. Instead, it's updated to use the return value of
  pwq_activate_first_inactive() to tell whether a work item has been
  activated.

- nr_active decrement and activating the first inactive work item is
  factored into pwq_dec_nr_active().

v3: - WARN_ON_ONCE(!WORK_STRUCT_INACTIVE) added to __pwq_activate_work() as
      now we're calling the function unconditionally from
      pwq_activate_first_inactive().

v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo 4c6380305d workqueue: Replace pwq_activate_inactive_work() with [__]pwq_activate_work()
To prepare for unbound nr_active handling improvements, move work activation
part of pwq_activate_inactive_work() into __pwq_activate_work() and add
pwq_activate_work() which tests WORK_STRUCT_INACTIVE and updates nr_active.

pwq_activate_first_inactive() and try_to_grab_pending() are updated to use
pwq_activate_work(). The latter conversion is functionally identical. For
the former, this conversion adds an unnecessary WORK_STRUCT_INACTIVE
testing. This is temporary and will be removed by the next patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo afa87ce853 workqueue: Factor out pwq_is_empty()
"!pwq->nr_active && list_empty(&pwq->inactive_works)" test is repeated
multiple times. Let's factor it out into pwq_is_empty().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo a045a272d8 workqueue: Move pwq->max_active to wq->max_active
max_active is a workqueue-wide setting and the configured value is stored in
wq->saved_max_active; however, the effective value was stored in
pwq->max_active. While this is harmless, it makes max_active update process
more complicated and gets in the way of the planned max_active semantic
updates for unbound workqueues.

This patches moves pwq->max_active to wq->max_active. This simplifies the
code and makes freezing and noop max_active updates cheaper too. No
user-visible behavior change is intended.

As wq->max_active is updated while holding wq mutex but read without any
locking, it now uses WRITE/READ_ONCE(). A new locking locking rule WO is
added for it.

v2: wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Bartosz Golaszewski aafd753555 genirq/irq_sim: Shrink code by using <linux/cleanup.h> helpers
Use the new __free() mechanism to remove all gotos and simplify the error
paths.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20240122124243.44002-5-brgl@bgdev.pl
2024-01-29 11:07:57 +01:00
Linus Torvalds 648f575d5e - Prevent an inconsistent futex operation leading to stale state
exposure
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmW2LAgACgkQEsHwGGHe
 VUqowBAAiW9aPQmp401DSXLX+bX0oS5IQVEZnAEE3hQTWxdvDoIdmX+SBReSutXy
 PDm8mZgVtIiUg3V5bu7/9Dgpu7ovRuChJPjjkYFUDcEmzmsMI11W6u8+8eyt8yRd
 X9LuGUeXPJSI1kadYudhFUhl6X6KcXj4Y+XUqNcyp8yClSEcLriYeiumNApSEzj6
 BneO5VBbXTpJq1b7GOlC4MNhNXhx+WlUdJUb3VPLlxy/akxrNs9x0ASdOuqslCq8
 X9SJPnKeRh0mpezmWDgU72eQ/3vpvWQzwyXvp2pQGbjArCx7IwwD765NDu0P6651
 C/+4ruXmcd+Jp3wuobdHG8/J2NlZQy8tZQm284YkS5vyBQDi4s17hycXw/aeUFpu
 /3LR1Hppl//u7hkaHszE8vE5l6in4a2XAbk9EozChVj/aHRJqIaLn8TGQRquK4Tg
 uRjIC3O2ubJCsIlNIczysjCobSCO+cELwUuFVHh7cdmQAgUwF3efDab0+pJ7MHFb
 ZEcqQbIt4FGea4BGzvRYCYj6W9bkhzttnH+68ef+mDA3BcdGoYnHcQ143M8duNhe
 0inWCibQXMFC9EGPjC8Sz8WvzF/L5KL9bPQmO1sitIzH6kbU3o7PBk2Fe4V6+KP9
 THK865SJ/9QirjXrGmp9Sle6dqJRUylmt1ts8reOWACZ98LKeWU=
 =ibuM
 -----END PGP SIGNATURE-----

Merge tag 'locking_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Borislav Petkov:

 - Prevent an inconsistent futex operation leading to stale state
   exposure

* tag 'locking_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Prevent the reuse of stale pi_state
2024-01-28 10:38:16 -08:00
Linus Torvalds 0e4363ac1a - Initialize the resend node of each IRQ descriptor, not only the first one
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmW2KrEACgkQEsHwGGHe
 VUr1WA//Qsi2JkxO1lyUQgnyuXqs0+oVZJxFyH2dFYzWkfSaxgsyPZ0H+wsweDfP
 OgoNzwwDf1IaNbVz2voV6lSM/30ujJMx4aAucT5WTEXa12cJsvipxRiNd8WU8GqQ
 buBz+vnS9IJ2WfM7UxhIVevYFU8H/ERcSO9WCII0YjlcVxmlwMK3B7kFdpBPdT5Q
 m5hvBorZzIa9wD3TI7e+VvEVbCx0WjsYYEpXDXM/yCf1Juc9952pjjunzx3YmJES
 5JG2WpnEvmNdWwIPO0NAjs7Shw/MNViXTy5Ls5jcbswiAcBoUxNHQlUsvNVDaVyv
 8eMCkPzuSipY8HSoetQSTJl+mr3LyYRvevKahuTgwbS8K+kxgClqHFoLZVqolWnk
 2IDo63R6Ex6lb1Xpb/Rpg/4j4NqUVWcvPHf6Z2CmMRq/XbSk2DIFl1Wxjgy/Cjnu
 +nNLw2FYayEBrKF3VlYgERGoCfBrEsksxzljjeHFn5XWr+G2x1ykF37xaWjQ2+oV
 sFl6UYwIsdqPCjHmpT6R1lwCdeEC3o3Zc2Kf5uEVj+pXacKJkxZU0L6ZneO8UiEc
 rtc0gTgm9ZNd8oDsjsaBU1A3KxH9lOfVz82ZV0tipz94dcN4zrB9Qag0Yw+64YOC
 cQ9cRKiFiCVCeD1ksDLZe1IUX++T2Y9O8MZv06ZDrbaFC56+lU8=
 =IgON
 -----END PGP SIGNATURE-----

Merge tag 'irq_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fix from Borislav Petkov:

 - Initialize the resend node of each IRQ descriptor, not only the first
   one

* tag 'irq_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Initialize resend_node hlist for all interrupt descriptors
2024-01-28 10:34:55 -08:00
Linus Torvalds 90db544eba - Preserve the number of idle calls and sleep entries across CPU hotplug
events in order to be able to compute correct averages
 
 - Limit the duration of the clocksource watchdog checking interval as
   too long intervals lead to wrongly marking the TSC as unstable
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmW2KYEACgkQEsHwGGHe
 VUoX7w//Ulls1tp3m1oiejTBtUmkewSmnhNAfkHJv3MlKNe+ttG6LvQVh9g2bf1Z
 FOu2M2se0ge8G5xf3+I5E4rpqlJZSuhPmNmIET+aj+2a61UJq/6zE1Zw6mxjrJOK
 emjOYKTxZ/HxvKJJGO6NiH8Iv5Aj3nQR3Y6oyb/FyP5TLJ6MCT21iEaqyqU7P+Ix
 AHIS3cL97M5R/tFtP2CY3PV2M6hJ0lqapSi9t75hT8DfJN1TNQ5SvFkKgmOIrGFw
 2WxPTSTEZAnXlvI4cC3Nru9i64QQRw9S05FFelX2pwxE/7wVzBvfh8cjuGZJBve/
 KQhNnQ4/fzv6E/hUcavKuOyk1lx5XonfCuG4RFoLl67LjLbLh+Q55RBdXflBPF4T
 Ow9BSyQNFu391C2Bl5gJUYVd2JMv+IVpi2wUiwrXJ/Mxj+A2J7Fj0jz7hMbNCmsU
 EaA+QyfkAGsoa99xP3UDhPzxoCr2s5YTAxH+IUeSWeI25PMq9f+6fifXBwG+GaVa
 FS6Ei1VI0GCNmcYFYawHJbdM2ui5h7lZ96aEpOBSVcAv/2yBNgqxuYZ+icO/wI6N
 JM0DSEEOrWcytfxftl7LmglJauhXSKZH4UTG4RCz0IkDgR72Wn0QF4cqm4wWQ5yh
 n5/xO+SbkzE57bltsnAkvpu0a110fdK5ec+vkFIy4PrxyO83XhY=
 =9RXx
 -----END PGP SIGNATURE-----

Merge tag 'timers_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Borislav Petkov:

 - Preserve the number of idle calls and sleep entries across CPU
   hotplug events in order to be able to compute correct averages

 - Limit the duration of the clocksource watchdog checking interval as
   too long intervals lead to wrongly marking the TSC as unstable

* tag 'timers_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick/sched: Preserve number of idle sleeps across CPU hotplug events
  clocksource: Skip watchdog check for large watchdog intervals
2024-01-28 10:33:14 -08:00
Jakub Kicinski 92046e83c0 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZbQV+gAKCRDbK58LschI
 g2OeAP0VvhZS9SPiS+/AMAFuw2W1BkMrFNbfBTc3nzRnyJSmNAD+NG4CLLJvsKI9
 olu7VC20B8pLTGLUGIUSwqnjOC+Kkgc=
 =wVMl
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-01-26

We've added 107 non-merge commits during the last 4 day(s) which contain
a total of 101 files changed, 6009 insertions(+), 1260 deletions(-).

The main changes are:

1) Add BPF token support to delegate a subset of BPF subsystem
   functionality from privileged system-wide daemons such as systemd
   through special mount options for userns-bound BPF fs to a trusted
   & unprivileged application. With addressed changes from Christian
   and Linus' reviews, from Andrii Nakryiko.

2) Support registration of struct_ops types from modules which helps
   projects like fuse-bpf that seeks to implement a new struct_ops type,
   from Kui-Feng Lee.

3) Add support for retrieval of cookies for perf/kprobe multi links,
   from Jiri Olsa.

4) Bigger batch of prep-work for the BPF verifier to eventually support
   preserving boundaries and tracking scalars on narrowing fills,
   from Maxim Mikityanskiy.

5) Extend the tc BPF flavor to support arbitrary TCP SYN cookies to help
   with the scenario of SYN floods, from Kuniyuki Iwashima.

6) Add code generation to inline the bpf_kptr_xchg() helper which
   improves performance when stashing/popping the allocated BPF objects,
   from Hou Tao.

7) Extend BPF verifier to track aligned ST stores as imprecise spilled
   registers, from Yonghong Song.

8) Several fixes to BPF selftests around inline asm constraints and
   unsupported VLA code generation, from Jose E. Marchesi.

9) Various updates to the BPF IETF instruction set draft document such
   as the introduction of conformance groups for instructions,
   from Dave Thaler.

10) Fix BPF verifier to make infinite loop detection in is_state_visited()
    exact to catch some too lax spill/fill corner cases,
    from Eduard Zingerman.

11) Refactor the BPF verifier pointer ALU check to allow ALU explicitly
    instead of implicitly for various register types, from Hao Sun.

12) Fix the flaky tc_redirect_dtime BPF selftest due to slowness
    in neighbor advertisement at setup time, from Martin KaFai Lau.

13) Change BPF selftests to skip callback tests for the case when the
    JIT is disabled, from Tiezhu Yang.

14) Add a small extension to libbpf which allows to auto create
    a map-in-map's inner map, from Andrey Grafin.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (107 commits)
  selftests/bpf: Add missing line break in test_verifier
  bpf, docs: Clarify definitions of various instructions
  bpf: Fix error checks against bpf_get_btf_vmlinux().
  bpf: One more maintainer for libbpf and BPF selftests
  selftests/bpf: Incorporate LSM policy to token-based tests
  selftests/bpf: Add tests for LIBBPF_BPF_TOKEN_PATH envvar
  libbpf: Support BPF token path setting through LIBBPF_BPF_TOKEN_PATH envvar
  selftests/bpf: Add tests for BPF object load with implicit token
  selftests/bpf: Add BPF object loading tests with explicit token passing
  libbpf: Wire up BPF token support at BPF object level
  libbpf: Wire up token_fd into feature probing logic
  libbpf: Move feature detection code into its own file
  libbpf: Further decouple feature checking logic from bpf_object
  libbpf: Split feature detectors definitions from cached results
  selftests/bpf: Utilize string values for delegate_xxx mount options
  bpf: Support symbolic BPF FS delegation mount options
  bpf: Fail BPF_TOKEN_CREATE if no delegation option was set on BPF FS
  bpf,selinux: Allocate bpf_security_struct per BPF token
  selftests/bpf: Add BPF token-enabled tests
  libbpf: Add BPF token support to bpf_prog_load() API
  ...
====================

Link: https://lore.kernel.org/r/20240126215710.19855-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-26 21:08:22 -08:00
Tejun Heo e563d0a7cd workqueue: Break up enum definitions and give names to the types
workqueue is collecting different sorts of enums into a single unnamed enum
type which can increase confusion around enum width. Also, unnamed enums
can't be accessed from BPF. Let's break up enum definitions according to
their purposes and give them type names.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-26 11:55:50 -10:00
Tejun Heo 6a229b0e2f workqueue: Drop unnecessary kick_pool() in create_worker()
After creating a new worker, create_worker() is calling kick_pool() to wake
up the new worker task. However, as kick_pool() doesn't do anything if there
is no work pending, it also calls wake_up_process() explicitly. There's no
reason to call kick_pool() at all. wake_up_process() is enough by itself.
Drop the unnecessary kick_pool() call.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-26 11:55:46 -10:00
Masami Hiramatsu (Google) 0958b33ef5 tracing/trigger: Fix to return error if failed to alloc snapshot
Fix register_snapshot_trigger() to return error code if it failed to
allocate a snapshot instead of 0 (success). Unless that, it will register
snapshot trigger without an error.

Link: https://lore.kernel.org/linux-trace-kernel/170622977792.270660.2789298642759362200.stgit@devnote2

Fixes: 0bbe7f7199 ("tracing: Fix the race between registering 'snapshot' event trigger and triggering 'snapshot' operation")
Cc: stable@vger.kernel.org
Cc: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-01-26 15:10:24 -05:00
Li Zhijian effe6d278e kernel/cpu: Convert snprintf() to sysfs_emit()
Per filesystems/sysfs.rst, show() should only use sysfs_emit()
or sysfs_emit_at() when formatting the value to be returned to user space.

coccinelle complains that there are still a couple of functions that use
snprintf(). Convert them to sysfs_emit().

No functional change intended.

Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240116045151.3940401-40-lizhijian@fujitsu.com
2024-01-26 18:25:16 +01:00
Randy Dunlap ef7e585bf4 cpu/hotplug: Delete an extraneous kernel-doc description
struct cpuhp_cpu_state has an extraneous kernel-doc comment for @cpu.
There is no struct member by that name, so remove the comment to
prevent the kernel-doc warning:

  kernel/cpu.c:85: warning: Excess struct member 'cpu' description in 'cpuhp_cpu_state'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240114030615.30441-1-rdunlap@infradead.org
2024-01-26 17:44:42 +01:00
Bartosz Golaszewski 8dab7fd47e genirq/irq_sim: Order headers alphabetically
For better readability and maintenance keep headers in alphabetical
order.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240122124243.44002-4-brgl@bgdev.pl
2024-01-26 13:44:48 +01:00
Bartosz Golaszewski 3832f39042 genirq/irq_sim: Remove unused field from struct irq_sim_irq_ctx
The irqnum field is unused. Remove it.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240122124243.44002-3-brgl@bgdev.pl
2024-01-26 13:44:48 +01:00
David Hildenbrand 4dca82d141 uprobes: use pagesize-aligned virtual address when replacing pages
uprobes passes an unaligned page mapping address to
folio_add_new_anon_rmap(), which ends up triggering a VM_BUG_ON() we
recently extended in commit 372cbd4d5a ("mm: non-pmd-mappable, large
folios for folio_add_new_anon_rmap()").

Arguably, this is uprobes code doing something wrong; however, for the
time being it would have likely worked in rmap code because
__folio_set_anon() would set folio->index to the same value.

Looking at __replace_page(), we'd also pass slightly wrong values to
mmu_notifier_range_init(), page_vma_mapped_walk(), flush_cache_page(),
ptep_clear_flush() and set_pte_at_notify().  I suspect most of them are
fine, but let's just mark the introducing commit as the one needed fixing.
I don't think CC stable is warranted.

We'll add more sanity checks in rmap code separately, to make sure that we
always get properly aligned addresses.

Link: https://lkml.kernel.org/r/20240115100731.91007-1-david@redhat.com
Fixes: c517ee744b ("uprobes: __replace_page() should not use page_address_in_vma()")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Jiri Olsa <jolsa@kernel.org>
Closes: https://lkml.kernel.org/r/ZaMR2EWN-HvlCfUl@krava
Tested-by: Jiri Olsa <jolsa@kernel.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25 23:52:20 -08:00
Kui-Feng Lee e6be8cd5d3 bpf: Fix error checks against bpf_get_btf_vmlinux().
In bpf_struct_ops_map_alloc, it needs to check for NULL in the returned
pointer of bpf_get_btf_vmlinux() when CONFIG_DEBUG_INFO_BTF is not set.
ENOTSUPP is used to preserve the same behavior before the
struct_ops kmod support.

In the function check_struct_ops_btf_id(), instead of redoing the
bpf_get_btf_vmlinux() that has already been done in syscall.c, the fix
here is to check for prog->aux->attach_btf_id.
BPF_PROG_TYPE_STRUCT_OPS must require attach_btf_id and syscall.c
guarantees a valid attach_btf as long as attach_btf_id is set.
When attach_btf_id is not set, this patch returns -ENOTSUPP
because it is what the selftest in test_libbpf_probe_prog_types()
and libbpf_probes.c are expecting for feature probing purpose.

Changes from v1:

 - Remove an unnecessary NULL check in check_struct_ops_btf_id()

Reported-by: syzbot+88f0aafe5f950d7489d7@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/00000000000040d68a060fc8db8c@google.com/
Reported-by: syzbot+1336f3d4b10bcda75b89@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/00000000000026353b060fc21c07@google.com/
Fixes: fcc2c1fb06 ("bpf: pass attached BTF to the bpf_struct_ops subsystem")
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240126023113.1379504-1-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-25 20:49:59 -08:00
Audra Mitchell 8318d6a636 workqueue: Shorten events_freezable_power_efficient name
Since we have set the WQ_NAME_LEN to 32, decrease the name of
events_freezable_power_efficient so that it does not trip the name length
warning when the workqueue is created.

Signed-off-by: Audra Mitchell <audra@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-25 09:11:40 -10:00
Linus Torvalds 3cb9871f81 Urgent RCU pull request for v6.8
This commit fixes RCU grace period stalls, which are observed when
 an outgoing CPU's quiescent state reporting results in wakeup of
 one of the grace period kthreads, to complete the grace period. If
 those kthreads have SCHED_FIFO policy, the wake up can indirectly
 arm the RT bandwith timer to the local offline CPU. Earlier migration
 of the hrtimers from the CPU introduced in commit 5c0930ccaa
 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
 results in this timer getting ignored. If the RCU grace period
 kthreads are waiting for RT bandwidth to be available, they may
 never be actually scheduled, resulting in RCU stall warnings.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSi2tPIQIc2VEtjarIAHS7/6Z0wpQUCZbFOUgAKCRAAHS7/6Z0w
 pQjcAQCg/tJYRjwGUPebKLUgkmXlR+IIANzEgvES/RgWTOld5gEAklZVTjf3J0qt
 QeU9WC3My2cVPKvv6kqnuQ9rrqMQ3g0=
 =U5an
 -----END PGP SIGNATURE-----

Merge tag 'urgent-rcu.2024.01.24a' of https://github.com/neeraju/linux

Pull RCU fix from Neeraj Upadhyay:
 "This fixes RCU grace period stalls, which are observed when an
  outgoing CPU's quiescent state reporting results in wakeup of one of
  the grace period kthreads, to complete the grace period.

  If those kthreads have SCHED_FIFO policy, the wake up can indirectly
  arm the RT bandwith timer to the local offline CPU.

  Earlier migration of the hrtimers from the CPU introduced in commit
  5c0930ccaa ("hrtimers: Push pending hrtimers away from outgoing CPU
  earlier") results in this timer getting ignored.

  If the RCU grace period kthreads are waiting for RT bandwidth to be
  available, they may never be actually scheduled, resulting in RCU
  stall warnings"

* tag 'urgent-rcu.2024.01.24a' of https://github.com/neeraju/linux:
  rcu: Defer RCU kthreads wakeup when CPU is dying
2024-01-25 10:21:21 -08:00
Kunwu Chan aa13b70908 audit: use KMEM_CACHE() instead of kmem_cache_create()
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.

Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
[PM: subject line tweaks]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-01-25 10:12:22 -05:00
Li zeming 3104d0e94f audit: remove unnecessary assignment in audit_dupe_lsm_field()
The ret variable is assigned when it does not need to be defined, as it
has already been assigned before use.

Signed-off-by: Li zeming <zeming@nfschina.com>
[PM: rewrite subject line]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-01-25 09:59:27 -05:00
Tim Chen 9a574ea906 tick/sched: Preserve number of idle sleeps across CPU hotplug events
Commit 71fee48f ("tick-sched: Fix idle and iowait sleeptime accounting vs
CPU hotplug") preserved total idle sleep time and iowait sleeptime across
CPU hotplug events.

Similar reasoning applies to the number of idle calls and idle sleeps to
get the proper average of sleep time per idle invocation.

Preserve those fields too.

Fixes: 71fee48f ("tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug")
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240122233534.3094238-1-tim.c.chen@linux.intel.com
2024-01-25 09:52:40 +01:00
Jiri Wiesner 6446495535 clocksource: Skip watchdog check for large watchdog intervals
There have been reports of the watchdog marking clocksources unstable on
machines with 8 NUMA nodes:

  clocksource: timekeeping watchdog on CPU373:
  Marking clocksource 'tsc' as unstable because the skew is too large:
  clocksource:   'hpet' wd_nsec: 14523447520
  clocksource:   'tsc'  cs_nsec: 14524115132

The measured clocksource skew - the absolute difference between cs_nsec
and wd_nsec - was 668 microseconds:

  cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612

The kernel used 200 microseconds for the uncertainty_margin of both the
clocksource and watchdog, resulting in a threshold of 400 microseconds (the
md variable). Both the cs_nsec and the wd_nsec value indicate that the
readout interval was circa 14.5 seconds.  The observed behaviour is that
watchdog checks failed for large readout intervals on 8 NUMA node
machines. This indicates that the size of the skew was directly proportinal
to the length of the readout interval on those machines. The measured
clocksource skew, 668 microseconds, was evaluated against a threshold (the
md variable) that is suited for readout intervals of roughly
WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second.

The intention of 2e27e793e2 ("clocksource: Reduce clocksource-skew
threshold") was to tighten the threshold for evaluating skew and set the
lower bound for the uncertainty_margin of clocksources to twice
WATCHDOG_MAX_SKEW. Later in c37e85c135 ("clocksource: Loosen clocksource
watchdog constraints"), the WATCHDOG_MAX_SKEW constant was increased to
125 microseconds to fit the limit of NTP, which is able to use a
clocksource that suffers from up to 500 microseconds of skew per second.
Both the TSC and the HPET use default uncertainty_margin. When the
readout interval gets stretched the default uncertainty_margin is no
longer a suitable lower bound for evaluating skew - it imposes a limit
that is far stricter than the skew with which NTP can deal.

The root causes of the skew being directly proportinal to the length of
the readout interval are:

  * the inaccuracy of the shift/mult pairs of clocksources and the watchdog
  * the conversion to nanoseconds is imprecise for large readout intervals

Prevent this by skipping the current watchdog check if the readout
interval exceeds 2 * WATCHDOG_INTERVAL. Considering the maximum readout
interval of 2 * WATCHDOG_INTERVAL, the current default uncertainty margin
(of the TSC and HPET) corresponds to a limit on clocksource skew of 250
ppm (microseconds of skew per second).  To keep the limit imposed by NTP
(500 microseconds of skew per second) for all possible readout intervals,
the margins would have to be scaled so that the threshold value is
proportional to the length of the actual readout interval.

As for why the readout interval may get stretched: Since the watchdog is
executed in softirq context the expiration of the watchdog timer can get
severely delayed on account of a ksoftirqd thread not getting to run in a
timely manner. Surely, a system with such belated softirq execution is not
working well and the scheduling issue should be looked into but the
clocksource watchdog should be able to deal with it accordingly.

Fixes: 2e27e793e2 ("clocksource: Reduce clocksource-skew threshold")
Suggested-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Jiri Wiesner <jwiesner@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Feng Tang <feng.tang@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240122172350.GA740@incl
2024-01-25 09:13:16 +01:00
Andrii Nakryiko 6c1752e0b6 bpf: Support symbolic BPF FS delegation mount options
Besides already supported special "any" value and hex bit mask, support
string-based parsing of delegation masks based on exact enumerator
names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`,
`enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported
symbolic names (ignoring __MAX_xxx guard values and stripping repetitive
prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and
BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is
normalized to lower case in mount option output. So "PROG_LOAD",
"prog_load", and "MAP_create" are all valid values to specify for
delegate_cmds options, "array" is among supported for map types, etc.

Besides supporting string values, we also support multiple values
specified at the same time, using colon (':') separator.

There are corresponding changes on bpf_show_options side to use known
values to print them in human-readable format, falling back to hex mask
printing, if there are any unrecognized bits. This shouldn't be
necessary when enum BTF information is present, but in general we should
always be able to fall back to this even if kernel was built without BTF.
As mentioned, emitted symbolic names are normalized to be all lower case.

Example below shows various ways to specify delegate_cmds options
through mount command and how mount options are printed back:

12/14 14:39:07.604
vmuser@archvm:~/local/linux/tools/testing/selftests/bpf
$ mount | rg token

  $ sudo mkdir -p /sys/fs/bpf/token
  $ sudo mount -t bpf bpffs /sys/fs/bpf/token \
               -o delegate_cmds=prog_load:MAP_CREATE \
               -o delegate_progs=kprobe \
               -o delegate_attachs=xdp
  $ mount | grep token
  bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp)

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-20-andrii@kernel.org
2024-01-24 16:21:02 -08:00
Andrii Nakryiko aeaa97b006 bpf: Fail BPF_TOKEN_CREATE if no delegation option was set on BPF FS
It's quite confusing in practice when it's possible to successfully
create a BPF token from BPF FS that didn't have any of delegate_xxx
mount options set up. While it's not wrong, it's actually more
meaningful to reject BPF_TOKEN_CREATE with specific error code (-ENOENT)
to let user-space know that no token delegation is setup up.

So, instead of creating empty BPF token that will be always ignored
because it doesn't have any of the allow_xxx bits set, reject it with
-ENOENT. If we ever need empty BPF token to be possible, we can support
that with extra flag passed into BPF_TOKEN_CREATE.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-19-andrii@kernel.org
2024-01-24 16:21:02 -08:00
Andrii Nakryiko f568a3d49a bpf,lsm: Add BPF token LSM hooks
Wire up bpf_token_create and bpf_token_free LSM hooks, which allow to
allocate LSM security blob (we add `void *security` field to struct
bpf_token for that), but also control who can instantiate BPF token.
This follows existing pattern for BPF map and BPF prog.

Also add security_bpf_token_allow_cmd() and security_bpf_token_capable()
LSM hooks that allow LSM implementation to control and negate (if
necessary) BPF token's delegation of a specific bpf_cmd and capability,
respectively.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-12-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko a2431c7eab bpf,lsm: Refactor bpf_map_alloc/bpf_map_free LSM hooks
Similarly to bpf_prog_alloc LSM hook, rename and extend bpf_map_alloc
hook into bpf_map_create, taking not just struct bpf_map, but also
bpf_attr and bpf_token, to give a fuller context to LSMs.

Unlike bpf_prog_alloc, there is no need to move the hook around, as it
currently is firing right before allocating BPF map ID and FD, which
seems to be a sweet spot.

But like bpf_prog_alloc/bpf_prog_free combo, make sure that bpf_map_free
LSM hook is called even if bpf_map_create hook returned error, as if few
LSMs are combined together it could be that one LSM successfully
allocated security blob for its needs, while subsequent LSM rejected BPF
map creation. The former LSM would still need to free up LSM blob, so we
need to ensure security_bpf_map_free() is called regardless of the
outcome.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-11-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko 1b67772e4e bpf,lsm: Refactor bpf_prog_alloc/bpf_prog_free LSM hooks
Based on upstream discussion ([0]), rework existing
bpf_prog_alloc_security LSM hook. Rename it to bpf_prog_load and instead
of passing bpf_prog_aux, pass proper bpf_prog pointer for a full BPF
program struct. Also, we pass bpf_attr union with all the user-provided
arguments for BPF_PROG_LOAD command.  This will give LSMs as much
information as we can basically provide.

The hook is also BPF token-aware now, and optional bpf_token struct is
passed as a third argument. bpf_prog_load LSM hook is called after
a bunch of sanity checks were performed, bpf_prog and bpf_prog_aux were
allocated and filled out, but right before performing full-fledged BPF
verification step.

bpf_prog_free LSM hook is now accepting struct bpf_prog argument, for
consistency. SELinux code is adjusted to all new names, types, and
signatures.

Note, given that bpf_prog_load (previously bpf_prog_alloc) hook can be
used by some LSMs to allocate extra security blob, but also by other
LSMs to reject BPF program loading, we need to make sure that
bpf_prog_free LSM hook is called after bpf_prog_load/bpf_prog_alloc one
*even* if the hook itself returned error. If we don't do that, we run
the risk of leaking memory. This seems to be possible today when
combining SELinux and BPF LSM, as one example, depending on their
relative ordering.

Also, for BPF LSM setup, add bpf_prog_load and bpf_prog_free to
sleepable LSM hooks list, as they are both executed in sleepable
context. Also drop bpf_prog_load hook from untrusted, as there is no
issue with refcount or anything else anymore, that originally forced us
to add it to untrusted list in c0c852dd18 ("bpf: Do not mark certain LSM
hook arguments as trusted"). We now trigger this hook much later and it
should not be an issue anymore.

  [0] https://lore.kernel.org/bpf/9fe88aef7deabbe87d3fc38c4aea3c69.paul@paul-moore.com/

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-10-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko d79a354975 bpf: Consistently use BPF token throughout BPF verifier logic
Remove remaining direct queries to perfmon_capable() and bpf_capable()
in BPF verifier logic and instead use BPF token (if available) to make
decisions about privileges.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-9-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko bbc1d24724 bpf: Take into account BPF token when fetching helper protos
Instead of performing unconditional system-wide bpf_capable() and
perfmon_capable() calls inside bpf_base_func_proto() function (and other
similar ones) to determine eligibility of a given BPF helper for a given
program, use previously recorded BPF token during BPF_PROG_LOAD command
handling to inform the decision.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-8-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko caf8f28e03 bpf: Add BPF token support to BPF_PROG_LOAD command
Add basic support of BPF token to BPF_PROG_LOAD. BPF_F_TOKEN_FD flag
should be set in prog_flags field when providing prog_token_fd.

Wire through a set of allowed BPF program types and attach types,
derived from BPF FS at BPF token creation time. Then make sure we
perform bpf_token_capable() checks everywhere where it's relevant.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-7-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko 9ea7c4bf17 bpf: Add BPF token support to BPF_BTF_LOAD command
Accept BPF token FD in BPF_BTF_LOAD command to allow BTF data loading
through delegated BPF token. BPF_F_TOKEN_FD flag has to be specified
when passing BPF token FD. Given BPF_BTF_LOAD command didn't have flags
field before, we also add btf_flags field.

BTF loading is a pretty straightforward operation, so as long as BPF
token is created with allow_cmds granting BPF_BTF_LOAD command, kernel
proceeds to parsing BTF data and creating BTF object.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-6-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko a177fc2bf6 bpf: Add BPF token support to BPF_MAP_CREATE command
Allow providing token_fd for BPF_MAP_CREATE command to allow controlled
BPF map creation from unprivileged process through delegated BPF token.
New BPF_F_TOKEN_FD flag is added to specify together with BPF token FD
for BPF_MAP_CREATE command.

Wire through a set of allowed BPF map types to BPF token, derived from
BPF FS at BPF token creation time. This, in combination with allowed_cmds
allows to create a narrowly-focused BPF token (controlled by privileged
agent) with a restrictive set of BPF maps that application can attempt
to create.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-5-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko 35f96de041 bpf: Introduce BPF token object
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while having a good amount of control over which
privileged operations could be performed using provided BPF token.

This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).

BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
FS FD, which can be attained through open() API by opening BPF FS mount
point. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the
creation time or after the fact, allowing the process to guard itself
further from unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.

When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.

Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).

Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
within the BPF FS owning user namespace, rounding up the ns_capable()
story of BPF token. Also creating BPF token in init user namespace is
currently not supported, given BPF token doesn't have any effect in init
user namespace anyways.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-4-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko 6fe01d3cbb bpf: Add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
  - `delegate_cmds` allow to specify which bpf() syscall commands are
    allowed with BPF token derived from this BPF FS instance;
  - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
    a set of allowable BPF map types that could be created with BPF token;
  - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
    a set of allowable BPF program types that could be loaded with BPF token;
  - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
    a set of allowable BPF program attach types that could be loaded with
    BPF token; delegate_progs and delegate_attachs are meant to be used
    together, as full BPF program type is, in general, determined
    through both program type and program attach type.

Currently, these mount options accept the following forms of values:
  - a special value "any", that enables all possible values of a given
  bit set;
  - numeric value (decimal or hexadecimal, determined by kernel
  automatically) that specifies a bit mask value directly;
  - all the values for a given mount option are combined, if specified
  multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
  delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
  mask.

Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.

Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.

This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.

There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.

This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.

  [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
2024-01-24 16:21:00 -08:00
Andrii Nakryiko ed1ad5a741 bpf: Align CAP_NET_ADMIN checks with bpf_capable() approach
Within BPF syscall handling code CAP_NET_ADMIN checks stand out a bit
compared to CAP_BPF and CAP_PERFMON checks. For the latter, CAP_BPF or
CAP_PERFMON are checked first, but if they are not set, CAP_SYS_ADMIN
takes over and grants whatever part of BPF syscall is required.

Similar kind of checks that involve CAP_NET_ADMIN are not so consistent.
One out of four uses does follow CAP_BPF/CAP_PERFMON model: during
BPF_PROG_LOAD, if the type of BPF program is "network-related" either
CAP_NET_ADMIN or CAP_SYS_ADMIN is required to proceed.

But in three other cases CAP_NET_ADMIN is required even if CAP_SYS_ADMIN
is set:
  - when creating DEVMAP/XDKMAP/CPU_MAP maps;
  - when attaching CGROUP_SKB programs;
  - when handling BPF_PROG_QUERY command.

This patch is changing the latter three cases to follow BPF_PROG_LOAD
model, that is allowing to proceed under either CAP_NET_ADMIN or
CAP_SYS_ADMIN.

This also makes it cleaner in subsequent BPF token patches to switch
wholesomely to a generic bpf_token_capable(int cap) check, that always
falls back to CAP_SYS_ADMIN if requested capability is missing.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-2-andrii@kernel.org
2024-01-24 16:21:00 -08:00
Linus Torvalds cf10015a24 execve fixes for v6.8-rc2
- Fix error handling in begin_new_exec() (Bernd Edlinger)
 
 - MAINTAINERS: specifically mention ELF (Alexey Dobriyan)
 
 - Various cleanups related to earlier open() (Askar Safin, Kees Cook)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmWxbGsWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJiQZD/9Lxd6ntRORthvCGk07g12fGZhQ
 OstFdbHyk5/Z+6/uKxSMvkoPZwJkXF2n3D/AvlfMFgyDBvLCFUu08jZOV31YFbeQ
 OFXVGcbY7nexkAmC6eN2k3SX8E+jzLdbcHeVk/iJomNUYBNTpExXhGMEyqZ53Pzo
 fo1uaRNGreCdSP04aHU1LE0vx7p16553oBeBZFT+iLd4glLte+E1TOZh4cIaSZbK
 5h0e+vG1XSBd9uP3fbYEyf+1JzKuhmm1RrVVaDkds1CLgJzUxh0cE1U9otKfnrwf
 xyBu556wTb001vYAIIcLlOJq+ROdiuA12RSyyHbKZmYAWTkQnBgKPV8BGDbshtzN
 zykJEsbRnWV3vN1n6+UzCEknE/xjvywEEdJgghZh46zk2NjnbtULOonLq8aMw7SA
 O+kcr4rqPLuRnxnkBw7QqA1y09QD9+M/iRQdgahsBIaDM3mMXGQsqeJAo9tFxO2M
 oJ1gJ9A7IdeULMBQ7zKVxTvC5c5fF2/CA5jpHUjASiUOTqcfHkPRYX2GINE62Heb
 xfsc3c1RhDrknMA/O01c8ziEBzZqhHUq4vGgWn0VjwIspYyfOOJYneeIx6/pJyTY
 OXbgaK+NetDCOKcv91Jjj0xfxrP0WogzvDbT9j2NuViqX24aQR1oZrredWPCTt5S
 wKouTaLVsM10EwR/Rw==
 =oOcx
 -----END PGP SIGNATURE-----

Merge tag 'execve-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull execve fixes from Kees Cook:

 - Fix error handling in begin_new_exec() (Bernd Edlinger)

 - MAINTAINERS: specifically mention ELF (Alexey Dobriyan)

 - Various cleanups related to earlier open() (Askar Safin, Kees Cook)

* tag 'execve-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  exec: Distinguish in_execve from in_exec
  exec: Fix error handling in begin_new_exec()
  exec: Add do_close_execat() helper
  exec: remove useless comment
  ELF, MAINTAINERS: specifically mention ELF
2024-01-24 13:32:29 -08:00
Kees Cook 90383cc078 exec: Distinguish in_execve from in_exec
Just to help distinguish the fs->in_exec flag from the current->in_execve
flag, add comments in check_unsafe_exec() and copy_fs() for more
context. Also note that in_execve is only used by TOMOYO now.

Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2024-01-24 11:48:52 -08:00
Frederic Weisbecker e787644caf rcu: Defer RCU kthreads wakeup when CPU is dying
When the CPU goes idle for the last time during the CPU down hotplug
process, RCU reports a final quiescent state for the current CPU. If
this quiescent state propagates up to the top, some tasks may then be
woken up to complete the grace period: the main grace period kthread
and/or the expedited main workqueue (or kworker).

If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
arm the RT bandwith timer to the local offline CPU. Since this happens
after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
timer gets ignored. Therefore if the RCU kthreads are waiting for RT
bandwidth to be available, they may never be actually scheduled.

This triggers TREE03 rcutorture hangs:

	 rcu: INFO: rcu_preempt self-detected stall on CPU
	 rcu:     4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
	 rcu:     (t=21035 jiffies g=938281 q=40787 ncpus=6)
	 rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
	 rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
	 rcu: RCU grace-period kthread stack dump:
	 task:rcu_preempt     state:R  running task     stack:14896 pid:14    tgid:14    ppid:2      flags:0x00004000
	 Call Trace:
	  <TASK>
	  __schedule+0x2eb/0xa80
	  schedule+0x1f/0x90
	  schedule_timeout+0x163/0x270
	  ? __pfx_process_timeout+0x10/0x10
	  rcu_gp_fqs_loop+0x37c/0x5b0
	  ? __pfx_rcu_gp_kthread+0x10/0x10
	  rcu_gp_kthread+0x17c/0x200
	  kthread+0xde/0x110
	  ? __pfx_kthread+0x10/0x10
	  ret_from_fork+0x2b/0x40
	  ? __pfx_kthread+0x10/0x10
	  ret_from_fork_asm+0x1b/0x30
	  </TASK>

The situation can't be solved with just unpinning the timer. The hrtimer
infrastructure and the nohz heuristics involved in finding the best
remote target for an unpinned timer would then also need to handle
enqueues from an offline CPU in the most horrendous way.

So fix this on the RCU side instead and defer the wake up to an online
CPU if it's too late for the local one.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Fixes: 5c0930ccaa ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
2024-01-24 22:46:17 +05:30
Dawei Li 22653244a9 genirq: Deduplicate interrupt descriptor initialization
alloc_desc() and early_irq_init() contain duplicated code to initialize
interrupt descriptors.

Replace that with a helper function.

Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Dawei Li <dawei.li@shingroup.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240122085716.2999875-6-dawei.li@shingroup.cn
2024-01-24 16:02:48 +01:00
Dawei Li b184c8c288 genirq: Initialize resend_node hlist for all interrupt descriptors
For a CONFIG_SPARSE_IRQ=n kernel, early_irq_init() is supposed to
initialize all interrupt descriptors.

It does except for irq_desc::resend_node, which ia only initialized for the
first descriptor.

Use the indexed decriptor and not the base pointer to address that.

Fixes: bc06a9e087 ("genirq: Use hlist for managing resend handlers")
Signed-off-by: Dawei Li <dawei.li@shingroup.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240122085716.2999875-5-dawei.li@shingroup.cn
2024-01-24 14:15:41 +01:00
Kui-Feng Lee 7c81c2490c bpf: export btf_ctx_access to modules.
The module requires the use of btf_ctx_access() to invoke
bpf_tracing_btf_ctx_access() from a module. This function is valuable for
implementing validation functions that ensure proper access to ctx.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-14-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 17:12:52 -08:00
Kui-Feng Lee f6be98d199 bpf, net: switch to dynamic registration
Replace the static list of struct_ops types with per-btf struct_ops_tab to
enable dynamic registration.

Both bpf_dummy_ops and bpf_tcp_ca now utilize the registration function
instead of being listed in bpf_struct_ops_types.h.

Cc: netdev@vger.kernel.org
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-12-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 17:12:46 -08:00
Kui-Feng Lee 612d087d4b bpf: validate value_type
A value_type should consist of three components: refcnt, state, and data.
refcnt and state has been move to struct bpf_struct_ops_common_value to
make it easier to check the value type.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-11-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 16:37:45 -08:00
Kui-Feng Lee e3f87fdfed bpf: hold module refcnt in bpf_struct_ops map creation and prog verification.
To ensure that a module remains accessible whenever a struct_ops object of
a struct_ops type provided by the module is still in use.

struct bpf_struct_ops_map doesn't hold a refcnt to btf anymore since a
module will hold a refcnt to it's btf already. But, struct_ops programs are
different. They hold their associated btf, not the module since they need
only btf to assure their types (signatures).

However, verifier holds the refcnt of the associated module of a struct_ops
type temporarily when verify a struct_ops prog. Verifier needs the help
from the verifier operators (struct bpf_verifier_ops) provided by the owner
module to verify data access of a prog, provide information, and generate
code.

This patch also add a count of links (links_cnt) to bpf_struct_ops_map. It
avoids bpf_struct_ops_map_put_progs() from accessing btf after calling
module_put() in bpf_struct_ops_map_free().

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-10-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 16:37:44 -08:00
Kui-Feng Lee fcc2c1fb06 bpf: pass attached BTF to the bpf_struct_ops subsystem
Pass the fd of a btf from the userspace to the bpf() syscall, and then
convert the fd into a btf. The btf is generated from the module that
defines the target BPF struct_ops type.

In order to inform the kernel about the module that defines the target
struct_ops type, the userspace program needs to provide a btf fd for the
respective module's btf. This btf contains essential information on the
types defined within the module, including the target struct_ops type.

A btf fd must be provided to the kernel for struct_ops maps and for the bpf
programs attached to those maps.

In the case of the bpf programs, the attach_btf_obj_fd parameter is passed
as part of the bpf_attr and is converted into a btf. This btf is then
stored in the prog->aux->attach_btf field. Here, it just let the verifier
access attach_btf directly.

In the case of struct_ops maps, a btf fd is passed as value_type_btf_obj_fd
of bpf_attr. The bpf_struct_ops_map_alloc() function converts the fd to a
btf and stores it as st_map->btf. A flag BPF_F_VTYPE_BTF_OBJ_FD is added
for map_flags to indicate that the value of value_type_btf_obj_fd is set.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-9-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 16:37:44 -08:00
Kui-Feng Lee 689423db3b bpf: lookup struct_ops types from a given module BTF.
This is a preparation for searching for struct_ops types from a specified
module. BTF is always btf_vmlinux now. This patch passes a pointer of BTF
to bpf_struct_ops_find_value() and bpf_struct_ops_find(). Once the new
registration API of struct_ops types is used, other BTFs besides
btf_vmlinux can also be passed to them.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-8-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 16:37:44 -08:00
Kui-Feng Lee 1338b93346 bpf: pass btf object id in bpf_map_info.
Include btf object id (btf_obj_id) in bpf_map_info so that tools (ex:
bpftools struct_ops dump) know the correct btf from the kernel to look up
type information of struct_ops types.

Since struct_ops types can be defined and registered in a module. The
type information of a struct_ops type are defined in the btf of the
module defining it.  The userspace tools need to know which btf is for
the module defining a struct_ops type.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-7-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 16:37:44 -08:00
Kui-Feng Lee 47f4f657ac bpf: make struct_ops_map support btfs other than btf_vmlinux.
Once new struct_ops can be registered from modules, btf_vmlinux is no
longer the only btf that struct_ops_map would face.  st_map should remember
what btf it should use to get type information.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-6-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 16:37:44 -08:00
Kui-Feng Lee e61995111a bpf: add struct_ops_tab to btf.
Maintain a registry of registered struct_ops types in the per-btf (module)
struct_ops_tab. This registry allows for easy lookup of struct_ops types
that are registered by a specific module.

It is a preparation work for supporting kernel module struct_ops in a
latter patch. Each struct_ops will be registered under its own kernel
module btf and will be stored in the newly added btf->struct_ops_tab. The
bpf verifier and bpf syscall (e.g. prog and map cmd) can find the
struct_ops and its btf type/size/id... information from
btf->struct_ops_tab.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-5-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 16:37:44 -08:00
Kui-Feng Lee 4c5763ed99 bpf, net: introduce bpf_struct_ops_desc.
Move some of members of bpf_struct_ops to bpf_struct_ops_desc.  type_id is
unavailabe in bpf_struct_ops anymore. Modules should get it from the btf
received by kmod's init function.

Cc: netdev@vger.kernel.org
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-4-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 16:37:44 -08:00
Kui-Feng Lee 9567839538 bpf: get type information with BTF_ID_LIST
Get ready to remove bpf_struct_ops_init() in the future. By using
BTF_ID_LIST, it is possible to gather type information while building
instead of runtime.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-3-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 16:37:43 -08:00
Kui-Feng Lee 3b1f89e747 bpf: refactory struct_ops type initialization to a function.
Move the majority of the code to bpf_struct_ops_init_one(), which can then
be utilized for the initialization of newly registered dynamically
allocated struct_ops types in the following patches.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-2-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 16:37:43 -08:00
Jiri Olsa 9fd112b1f8 bpf: Store cookies in kprobe_multi bpf_link_info data
Storing cookies in kprobe_multi bpf_link_info data. The cookies
field is optional and if provided it needs to be an array of
__u64 with kprobe_multi.count length.

Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240119110505.400573-3-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 16:05:27 -08:00
Jiri Olsa d5c16492c6 bpf: Add cookie to perf_event bpf_link_info records
At the moment we don't store cookie for perf_event probes,
while we do that for the rest of the probes.

Adding cookie fields to struct bpf_link_info perf event
probe records:

  perf_event.uprobe
  perf_event.kprobe
  perf_event.tracepoint
  perf_event.perf_event

And the code to store that in bpf_link_info struct.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20240119110505.400573-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 16:05:27 -08:00
Hao Sun 2ce793ebe2 bpf: Refactor ptr alu checking rules to allow alu explicitly
Current checking rules are structured to disallow alu on particular ptr
types explicitly, so default cases are allowed implicitly. This may lead
to newly added ptr types being allowed unexpectedly. So restruture it to
allow alu explicitly. The tradeoff is mainly a bit more cases added in
the switch. The following table from Eduard summarizes the rules:

        | Pointer type        | Arithmetics allowed |
        |---------------------+---------------------|
        | PTR_TO_CTX          | yes                 |
        | CONST_PTR_TO_MAP    | conditionally       |
        | PTR_TO_MAP_VALUE    | yes                 |
        | PTR_TO_MAP_KEY      | yes                 |
        | PTR_TO_STACK        | yes                 |
        | PTR_TO_PACKET_META  | yes                 |
        | PTR_TO_PACKET       | yes                 |
        | PTR_TO_PACKET_END   | no                  |
        | PTR_TO_FLOW_KEYS    | conditionally       |
        | PTR_TO_SOCKET       | no                  |
        | PTR_TO_SOCK_COMMON  | no                  |
        | PTR_TO_TCP_SOCK     | no                  |
        | PTR_TO_TP_BUFFER    | yes                 |
        | PTR_TO_XDP_SOCK     | no                  |
        | PTR_TO_BTF_ID       | yes                 |
        | PTR_TO_MEM          | yes                 |
        | PTR_TO_BUF          | yes                 |
        | PTR_TO_FUNC         | yes                 |
        | CONST_PTR_TO_DYNPTR | yes                 |

The refactored rules are equivalent to the original one. Note that
PTR_TO_FUNC and CONST_PTR_TO_DYNPTR are not reject here because: (1)
check_mem_access() rejects load/store on those ptrs, and those ptrs
with offset passing to calls are rejected check_func_arg_reg_off();
(2) someone may rely on the verifier not rejecting programs earily.

Signed-off-by: Hao Sun <sunhao.th@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240117094012.36798-1-sunhao.th@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 15:01:39 -08:00
Yonghong Song 9a4c57f52b bpf: Track aligned st store as imprecise spilled registers
With patch set [1], precision backtracing supports register spill/fill
to/from the stack. The patch [2] allows initial imprecise register spill
with content 0. This is a common case for cpuv3 and lower for
initializing the stack variables with pattern
  r1 = 0
  *(u64 *)(r10 - 8) = r1
and the [2] has demonstrated good verification improvement.

For cpuv4, the initialization could be
  *(u64 *)(r10 - 8) = 0
The current verifier marks the r10-8 contents with STACK_ZERO.
Similar to [2], let us permit the above insn to behave like
imprecise register spill which can reduce number of verified states.
The change is in function check_stack_write_fixed_off().

Before this patch, spilled zero will be marked as STACK_ZERO
which can provide precise values. In check_stack_write_var_off(),
STACK_ZERO will be maintained if writing a const zero
so later it can provide precise values if needed.

The above handling of '*(u64 *)(r10 - 8) = 0' as a spill
will have issues in check_stack_write_var_off() as the spill
will be converted to STACK_MISC and the precise value 0
is lost. To fix this issue, if the spill slots with const
zero and the BPF_ST write also with const zero, the spill slots
are preserved, which can later provide precise values
if needed. Without the change in check_stack_write_var_off(),
the test_verifier subtest 'BPF_ST_MEM stack imm zero, variable offset'
will fail.

I checked cpuv3 and cpuv4 with and without this patch with veristat.
There is no state change for cpuv3 since '*(u64 *)(r10 - 8) = 0'
is only generated with cpuv4.

For cpuv4:
$ ../veristat -C old.cpuv4.csv new.cpuv4.csv -e file,prog,insns,states -f 'insns_diff!=0'
File                                        Program              Insns (A)  Insns (B)  Insns    (DIFF)  States (A)  States (B)  States (DIFF)
------------------------------------------  -------------------  ---------  ---------  ---------------  ----------  ----------  -------------
local_storage_bench.bpf.linked3.o           get_local                  228        168    -60 (-26.32%)          17          14   -3 (-17.65%)
pyperf600_bpf_loop.bpf.linked3.o            on_event                  6066       4889  -1177 (-19.40%)         403         321  -82 (-20.35%)
test_cls_redirect.bpf.linked3.o             cls_redirect             35483      35387     -96 (-0.27%)        2179        2177    -2 (-0.09%)
test_l4lb_noinline.bpf.linked3.o            balancer_ingress          4494       4522     +28 (+0.62%)         217         219    +2 (+0.92%)
test_l4lb_noinline_dynptr.bpf.linked3.o     balancer_ingress          1432       1455     +23 (+1.61%)          92          94    +2 (+2.17%)
test_xdp_noinline.bpf.linked3.o             balancer_ingress_v6       3462       3458      -4 (-0.12%)         216         216    +0 (+0.00%)
verifier_iterating_callbacks.bpf.linked3.o  widening                    52         41    -11 (-21.15%)           4           3   -1 (-25.00%)
xdp_synproxy_kern.bpf.linked3.o             syncookie_tc             12412      11719    -693 (-5.58%)         345         330   -15 (-4.35%)
xdp_synproxy_kern.bpf.linked3.o             syncookie_xdp            12478      11794    -684 (-5.48%)         346         331   -15 (-4.34%)

test_l4lb_noinline and test_l4lb_noinline_dynptr has minor regression, but
pyperf600_bpf_loop and local_storage_bench gets pretty good improvement.

  [1] https://lore.kernel.org/all/20231205184248.1502704-1-andrii@kernel.org/
  [2] https://lore.kernel.org/all/20231205184248.1502704-9-andrii@kernel.org/

Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Tested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240110051348.2737007-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:23 -08:00
Maxim Mikityanskiy 8ecfc371d8 bpf: Assign ID to scalars on spill
Currently, when a scalar bounded register is spilled to the stack, its
ID is preserved, but only if was already assigned, i.e. if this register
was MOVed before.

Assign an ID on spill if none is set, so that equal scalars could be
tracked if a register is spilled to the stack and filled into another
register.

One test is adjusted to reflect the change in register IDs.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-9-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:23 -08:00
Maxim Mikityanskiy 87e51ac6cb bpf: Add the get_reg_width function
Put calculation of the register value width into a dedicated function.
This function will also be used in a following commit.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Link: https://lore.kernel.org/r/20240108205209.838365-8-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:23 -08:00
Maxim Mikityanskiy 8e0e074aaf bpf: Add the assign_scalar_id_before_mov function
Extract the common code that generates a register ID for src_reg before
MOV if needed into a new function. This function will also be used in
a following commit.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-7-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:22 -08:00
Eduard Zingerman d5b892fd60 bpf: make infinite loop detection in is_state_visited() exact
Current infinite loops detection mechanism is speculative:
- first, states_maybe_looping() check is done which simply does memcmp
  for R1-R10 in current frame;
- second, states_equal(..., exact=false) is called. With exact=false
  states_equal() would compare scalars for equality only if in old
  state scalar has precision mark.

Such logic might be problematic if compiler makes some unlucky stack
spill/fill decisions. An artificial example of a false positive looks
as follows:

        r0 = ... unknown scalar ...
        r0 &= 0xff;
        *(u64 *)(r10 - 8) = r0;
        r0 = 0;
    loop:
        r0 = *(u64 *)(r10 - 8);
        if r0 > 10 goto exit_;
        r0 += 1;
        *(u64 *)(r10 - 8) = r0;
        r0 = 0;
        goto loop;

This commit updates call to states_equal to use exact=true, forcing
all scalar comparisons to be exact.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-3-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:22 -08:00
Andrii Nakryiko 522bb2c1f8 bpf: support multiple tags per argument
Add ability to iterate multiple decl_tag types pointed to the same
function argument. Use this to support multiple __arg_xxx tags per
global subprog argument.

We leave btf_find_decl_tag_value() intact, but change its implementation
to use a new btf_find_next_decl_tag() which can be straightforwardly
used to find next BTF type ID of a matching btf_decl_tag type.
btf_prepare_func_args() is switched from btf_find_decl_tag_value() to
btf_find_next_decl_tag() to gain multiple tags per argument support.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105000909.2818934-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:21 -08:00
Andrii Nakryiko 54c11ec493 bpf: prepare btf_prepare_func_args() for multiple tags per argument
Add btf_arg_tag flags enum to be able to record multiple tags per
argument. Also streamline pointer argument processing some more.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105000909.2818934-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:21 -08:00
Andrii Nakryiko 18810ad392 bpf: make sure scalar args don't accept __arg_nonnull tag
Move scalar arg processing in btf_prepare_func_args() after all pointer
arg processing is done. This makes it easier to do validation. One
example of unintended behavior right now is ability to specify
__arg_nonnull for integer/enum arguments. This patch fixes this.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105000909.2818934-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:21 -08:00
Hou Tao 7c05e7f3e7 bpf: Support inlining bpf_kptr_xchg() helper
The motivation of inlining bpf_kptr_xchg() comes from the performance
profiling of bpf memory allocator benchmark. The benchmark uses
bpf_kptr_xchg() to stash the allocated objects and to pop the stashed
objects for free. After inling bpf_kptr_xchg(), the performance for
object free on 8-CPUs VM increases about 2%~10%. The inline also has
downside: both the kasan and kcsan checks on the pointer will be
unavailable.

bpf_kptr_xchg() can be inlined by converting the calling of
bpf_kptr_xchg() into an atomic_xchg() instruction. But the conversion
depends on two conditions:
1) JIT backend supports atomic_xchg() on pointer-sized word
2) For the specific arch, the implementation of xchg is the same as
   atomic_xchg() on pointer-sized words.

It seems most 64-bit JIT backends satisfies these two conditions. But
as a precaution, defining a weak function bpf_jit_supports_ptr_xchg()
to state whether such conversion is safe and only supporting inline for
64-bit host.

For x86-64, it supports BPF_XCHG atomic operation and both xchg() and
atomic_xchg() use arch_xchg() to implement the exchange, so enabling the
inline of bpf_kptr_xchg() on x86-64 first.

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20240105104819.3916743-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:21 -08:00
Petr Pavlu 2b44760609 tracing: Ensure visibility when inserting an element into tracing_map
Running the following two commands in parallel on a multi-processor
AArch64 machine can sporadically produce an unexpected warning about
duplicate histogram entries:

 $ while true; do
     echo hist:key=id.syscall:val=hitcount > \
       /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
     cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
     sleep 0.001
   done
 $ stress-ng --sysbadaddr $(nproc)

The warning looks as follows:

[ 2911.172474] ------------[ cut here ]------------
[ 2911.173111] Duplicates detected: 1
[ 2911.173574] WARNING: CPU: 2 PID: 12247 at kernel/trace/tracing_map.c:983 tracing_map_sort_entries+0x3e0/0x408
[ 2911.174702] Modules linked in: iscsi_ibft(E) iscsi_boot_sysfs(E) rfkill(E) af_packet(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) ena(E) tiny_power_button(E) qemu_fw_cfg(E) button(E) fuse(E) efi_pstore(E) ip_tables(E) x_tables(E) xfs(E) libcrc32c(E) aes_ce_blk(E) aes_ce_cipher(E) crct10dif_ce(E) polyval_ce(E) polyval_generic(E) ghash_ce(E) gf128mul(E) sm4_ce_gcm(E) sm4_ce_ccm(E) sm4_ce(E) sm4_ce_cipher(E) sm4(E) sm3_ce(E) sm3(E) sha3_ce(E) sha512_ce(E) sha512_arm64(E) sha2_ce(E) sha256_arm64(E) nvme(E) sha1_ce(E) nvme_core(E) nvme_auth(E) t10_pi(E) sg(E) scsi_mod(E) scsi_common(E) efivarfs(E)
[ 2911.174738] Unloaded tainted modules: cppc_cpufreq(E):1
[ 2911.180985] CPU: 2 PID: 12247 Comm: cat Kdump: loaded Tainted: G            E      6.7.0-default #2 1b58bbb22c97e4399dc09f92d309344f69c44a01
[ 2911.182398] Hardware name: Amazon EC2 c7g.8xlarge/, BIOS 1.0 11/1/2018
[ 2911.183208] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[ 2911.184038] pc : tracing_map_sort_entries+0x3e0/0x408
[ 2911.184667] lr : tracing_map_sort_entries+0x3e0/0x408
[ 2911.185310] sp : ffff8000a1513900
[ 2911.185750] x29: ffff8000a1513900 x28: ffff0003f272fe80 x27: 0000000000000001
[ 2911.186600] x26: ffff0003f272fe80 x25: 0000000000000030 x24: 0000000000000008
[ 2911.187458] x23: ffff0003c5788000 x22: ffff0003c16710c8 x21: ffff80008017f180
[ 2911.188310] x20: ffff80008017f000 x19: ffff80008017f180 x18: ffffffffffffffff
[ 2911.189160] x17: 0000000000000000 x16: 0000000000000000 x15: ffff8000a15134b8
[ 2911.190015] x14: 0000000000000000 x13: 205d373432323154 x12: 5b5d313131333731
[ 2911.190844] x11: 00000000fffeffff x10: 00000000fffeffff x9 : ffffd1b78274a13c
[ 2911.191716] x8 : 000000000017ffe8 x7 : c0000000fffeffff x6 : 000000000057ffa8
[ 2911.192554] x5 : ffff0012f6c24ec0 x4 : 0000000000000000 x3 : ffff2e5b72b5d000
[ 2911.193404] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0003ff254480
[ 2911.194259] Call trace:
[ 2911.194626]  tracing_map_sort_entries+0x3e0/0x408
[ 2911.195220]  hist_show+0x124/0x800
[ 2911.195692]  seq_read_iter+0x1d4/0x4e8
[ 2911.196193]  seq_read+0xe8/0x138
[ 2911.196638]  vfs_read+0xc8/0x300
[ 2911.197078]  ksys_read+0x70/0x108
[ 2911.197534]  __arm64_sys_read+0x24/0x38
[ 2911.198046]  invoke_syscall+0x78/0x108
[ 2911.198553]  el0_svc_common.constprop.0+0xd0/0xf8
[ 2911.199157]  do_el0_svc+0x28/0x40
[ 2911.199613]  el0_svc+0x40/0x178
[ 2911.200048]  el0t_64_sync_handler+0x13c/0x158
[ 2911.200621]  el0t_64_sync+0x1a8/0x1b0
[ 2911.201115] ---[ end trace 0000000000000000 ]---

The problem appears to be caused by CPU reordering of writes issued from
__tracing_map_insert().

The check for the presence of an element with a given key in this
function is:

 val = READ_ONCE(entry->val);
 if (val && keys_match(key, val->key, map->key_size)) ...

The write of a new entry is:

 elt = get_free_elt(map);
 memcpy(elt->key, key, map->key_size);
 entry->val = elt;

The "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;"
stores may become visible in the reversed order on another CPU. This
second CPU might then incorrectly determine that a new key doesn't match
an already present val->key and subsequently insert a new element,
resulting in a duplicate.

Fix the problem by adding a write barrier between
"memcpy(elt->key, key, map->key_size);" and "entry->val = elt;", and for
good measure, also use WRITE_ONCE(entry->val, elt) for publishing the
element. The sequence pairs with the mentioned "READ_ONCE(entry->val);"
and the "val->key" check which has an address dependency.

The barrier is placed on a path executed when adding an element for
a new key. Subsequent updates targeting the same key remain unaffected.

From the user's perspective, the issue was introduced by commit
c193707dde ("tracing: Remove code which merges duplicates"), which
followed commit cbf4100efb ("tracing: Add support to detect and avoid
duplicates"). The previous code operated differently; it inherently
expected potential races which result in duplicates but merged them
later when they occurred.

Link: https://lore.kernel.org/linux-trace-kernel/20240122150928.27725-1-petr.pavlu@suse.com

Fixes: c193707dde ("tracing: Remove code which merges duplicates")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-01-22 17:15:40 -05:00
Linus Torvalds 4fbbed7872 Updates for time and clocksources:
- A fix for the idle and iowait time accounting vs. CPU hotplug.
     The time is reset on CPU hotplug which makes the accumulated
     systemwide time jump backwards.
 
  - Assorted fixes and improvements for clocksource/event drivers
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmWtTLgTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoUXiD/4uN4Ntps8TwxSdg1X11M6++rizg9q9
 EmIfwWcfQQJDM5Ss5FE88ye55NxIOwJ1brYo08+yTAXjnnZ/yNP1BBegHbMNiGil
 NCHye7tYKZle25+hErdgfBB9n6brPz7dPOvV04/wRRWW+9p2ejt/5nEvojkyco9Y
 S9KgBCxkvUqScMbdKKFW1UsThWh2euxwQXRGiWhTPPkbKcVynPvQJjvVyRxn01NS
 eEhTn8YUNcAPT+1YApouGXrSCxo/IzBJ36CxOoCoUfaXcJ6FG1LLeAjNxKZ26Dfs
 Ah0e3Hhyv6KOsBvBNwwabXDwryd6L8rZd8yL2KakI1vIC51uS2wneFy8GCieDVGh
 xmy3U/tfkS0L7pmN+dQW2l4k9PHRNrwvbISKhs0UAHSOgGIMHZcjE6aFbYKru5i4
 1W+dEjiktlceZ94mrEHbLpKmxWH2z5P8m0BzUs4kt3nkaOf6CTUKqa/qdAiU5dv+
 lovKT26L8HBrMXf48I70UpgW/bYzOUGk55sR6hiLTXAelz1z02D1uYHFkshc0NCO
 /O4wvHcgvMM46CtWVbim42AlRcyyWCr+FrY+jvfiG2icOcHPLqc81iHL8EKj7pJl
 IxLgyPHVckgnE5gx+GQ8aDkg/qwCZnj4rFWgub8QMYtjI+pO+9T9kPAYPCxFhP7J
 gmcJxZAB2RnKXA==
 =RD6E
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "Updates for time and clocksources:

   - A fix for the idle and iowait time accounting vs CPU hotplug.

     The time is reset on CPU hotplug which makes the accumulated
     systemwide time jump backwards.

   - Assorted fixes and improvements for clocksource/event drivers"

* tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug
  clocksource/drivers/ep93xx: Fix error handling during probe
  clocksource/drivers/cadence-ttc: Fix some kernel-doc warnings
  clocksource/drivers/timer-ti-dm: Fix make W=n kerneldoc warnings
  clocksource/timer-riscv: Add riscv_clock_shutdown callback
  dt-bindings: timer: Add StarFive JH8100 clint
  dt-bindings: timer: thead,c900-aclint-mtimer: separate mtime and mtimecmp regs
2024-01-21 11:14:40 -08:00
Marcelo Tosatti 7bd20b6b87 workqueue: mark power efficient workqueue as unbounded if nohz_full enabled
A customer using nohz_full has experienced the following interruption:

oslat-1004510 [018] timer_cancel:         timer=0xffff90a7ca663cf8
oslat-1004510 [018] timer_expire_entry:   timer=0xffff90a7ca663cf8 function=delayed_work_timer_fn now=4709188240 baseclk=4709188240
oslat-1004510 [018] workqueue_queue_work: work struct=0xffff90a7ca663cd8 function=fb_flashcursor workqueue=events_power_efficient req_cpu=8192 cpu=18
oslat-1004510 [018] workqueue_activate_work: work struct 0xffff90a7ca663cd8
oslat-1004510 [018] sched_wakeup:         kworker/18:1:326 [120] CPU:018
oslat-1004510 [018] timer_expire_exit:    timer=0xffff90a7ca663cf8
oslat-1004510 [018] irq_work_entry:       vector=246
oslat-1004510 [018] irq_work_exit:        vector=246
oslat-1004510 [018] tick_stop:            success=0 dependency=SCHED
oslat-1004510 [018] hrtimer_start:        hrtimer=0xffff90a70009cb00 function=tick_sched_timer/0x0 ...
oslat-1004510 [018] softirq_exit:         vec=1 [action=TIMER]
oslat-1004510 [018] softirq_entry:        vec=7 [action=SCHED]
oslat-1004510 [018] softirq_exit:         vec=7 [action=SCHED]
oslat-1004510 [018] tick_stop:            success=0 dependency=SCHED
oslat-1004510 [018] sched_switch:         oslat:1004510 [120] R ==> kworker/18:1:326 [120]
kworker/18:1-326 [018] workqueue_execute_start: work struct 0xffff90a7ca663cd8: function fb_flashcursor
kworker/18:1-326 [018] workqueue_queue_work: work struct=0xffff9078f119eed0 function=drm_fb_helper_damage_work workqueue=events req_cpu=8192 cpu=18
kworker/18:1-326 [018] workqueue_activate_work: work struct 0xffff9078f119eed0
kworker/18:1-326 [018] timer_start:          timer=0xffff90a7ca663cf8 function=delayed_work_timer_fn ...

Set wq_power_efficient to true, in case nohz_full is enabled.
This makes the power efficient workqueue be unbounded, which allows
workqueue items there to be moved to HK CPUs.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-19 13:55:47 -10:00
Heiko Carstens 71fee48fb7 tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug
When offlining and onlining CPUs the overall reported idle and iowait
times as reported by /proc/stat jump backward and forward:

cpu  132 0 176 225249 47 6 6 21 0 0
cpu0 80 0 115 112575 33 3 4 18 0 0
cpu1 52 0 60 112673 13 3 1 2 0 0

cpu  133 0 177 226681 47 6 6 21 0 0
cpu0 80 0 116 113387 33 3 4 18 0 0

cpu  133 0 178 114431 33 6 6 21 0 0 <---- jump backward
cpu0 80 0 116 114247 33 3 4 18 0 0
cpu1 52 0 61 183 0 3 1 2 0 0        <---- idle + iowait start with 0

cpu  133 0 178 228956 47 6 6 21 0 0 <---- jump forward
cpu0 81 0 117 114929 33 3 4 18 0 0

Reason for this is that get_idle_time() in fs/proc/stat.c has different
sources for both values depending on if a CPU is online or offline:

- if a CPU is online the values may be taken from its per cpu
  tick_cpu_sched structure

- if a CPU is offline the values are taken from its per cpu cpustat
  structure

The problem is that the per cpu tick_cpu_sched structure is set to zero on
CPU offline. See tick_cancel_sched_timer() in kernel/time/tick-sched.c.

Therefore when a CPU is brought offline and online afterwards both its idle
and iowait sleeptime will be zero, causing a jump backward in total system
idle and iowait sleeptime. In a similar way if a CPU is then brought
offline again the total idle and iowait sleeptimes will jump forward.

It looks like this behavior was introduced with commit 4b0c0f294f
("tick: Cleanup NOHZ per cpu data on cpu down").

This was only noticed now on s390, since we switched to generic idle time
reporting with commit be76ea6144 ("s390/idle: remove arch_cpu_idle_time()
and corresponding code").

Fix this by preserving the values of idle_sleeptime and iowait_sleeptime
members of the per-cpu tick_sched structure on CPU hotplug.

Fixes: 4b0c0f294f ("tick: Cleanup NOHZ per cpu data on cpu down")
Reported-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240115163555.1004144-1-hca@linux.ibm.com
2024-01-19 16:40:38 +01:00
Sebastian Andrzej Siewior e626cb02ee futex: Prevent the reuse of stale pi_state
Jiri Slaby reported a futex state inconsistency resulting in -EINVAL during
a lock operation for a PI futex. It requires that the a lock process is
interrupted by a timeout or signal:

  T1 Owns the futex in user space.

  T2 Tries to acquire the futex in kernel (futex_lock_pi()). Allocates a
     pi_state and attaches itself to it.

  T2 Times out and removes its rt_waiter from the rt_mutex. Drops the
     rtmutex lock and tries to acquire the hash bucket lock to remove
     the futex_q. The lock is contended and T2 schedules out.

  T1 Unlocks the futex (futex_unlock_pi()). Finds a futex_q but no
     rt_waiter. Unlocks the futex (do_uncontended) and makes it available
     to user space.

  T3 Acquires the futex in user space.

  T4 Tries to acquire the futex in kernel (futex_lock_pi()). Finds the
     existing futex_q of T2 and tries to attach itself to the existing
     pi_state.  This (attach_to_pi_state()) fails with -EINVAL because uval
     contains the TID of T3 but pi_state points to T1.

It's incorrect to unlock the futex and make it available for user space to
acquire as long as there is still an existing state attached to it in the
kernel.

T1 cannot hand over the futex to T2 because T2 already gave up and started
to clean up and is blocked on the hash bucket lock, so T2's futex_q with
the pi_state pointing to T1 is still queued.

T2 observes the futex_q, but ignores it as there is no waiter on the
corresponding rt_mutex and takes the uncontended path which allows the
subsequent caller of futex_lock_pi() (T4) to observe that stale state.

To prevent this the unlock path must dequeue all futex_q entries which
point to the same pi_state when there is no waiter on the rt mutex. This
requires obviously to make the dequeue conditional in the locking path to
prevent a double dequeue. With that it's guaranteed that user space cannot
observe an uncontended futex which has kernel state attached.

Fixes: fbeb558b0d ("futex/pi: Fix recursive rt_mutex waiter state")
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Slaby <jirislaby@kernel.org>
Link: https://lore.kernel.org/r/20240118115451.0TkD_ZhB@linutronix.de
Closes: https://lore.kernel.org/all/4611bcf2-44d0-4c34-9b84-17406f881003@kernel.org
2024-01-19 12:58:17 +01:00
Linus Torvalds 2a668d2176 kgdb patches for 6.8
The entire changeset for kgdb this cycle is a single two-line change to
 remove some deadcode that, had it not been dead, would have called
 strncpy() in an unsafe manner.
 
 To be fair there were other modest clean ups were discussed this cycle
 but they are not finalized and will have to wait until next time.
 
 Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAmWpMp8ACgkQfOMlXTn3
 iKE2YQ/9HGgROWCSQPHXmly/VFbcvp/0J77+XA4yNLqTKMfyV1ZHQ5xWoiDp2Q0Q
 o689iYQ6V6YWl1/KNR5c7Xct82zgKaSksypTvIZBvk7vPVCElkQM9tpvE9VahMr7
 /YB59GFr83Rks7a0tfQ6SMUFWyFFtaU3YdV1374CfGTmlFiqMBbfCz40izjgT7so
 yLbbDCZxIaNBlRnRore1kPTt/3KAih6udai74H+OmlifWR0saS57Hwywlfd2H3yv
 Q3bFK4NkYLkptJZVmAC8B68YK5c4W24Y+lQ9fouKiFMajym5ONHLFsmdbZe6MrPk
 CUU2cGgDvei8nyAXAucocr5ndvJMi8R/JNunBNbnI6oYnT7K1MW3u+v1R+SRYm22
 Ixza37hcFkR+7wWnXsPB/POQo74C5ylZt5Of20ogzBEm8qjY7zAGoVZhNh3Ny4t2
 NGHwm6MR9613dIh/cVuD4waZdWS2TWYkaTA1E8PwIIqSnKfmMo0pJQU7ddy5V8Kb
 F6un3/+9IA6mBQLOqPuFpHZNzIMkDdgE4GvZIrF/DIzlHeWxnk6hDOO2ftbGgsIY
 BTelXfE1L8UoHlhixnjnFrTxRNjsZE8CGLY/Zd++b1mhY6tCwSGUS+cInUiIf/3k
 X9bN+rwzxV+MVx2V5j/g2i2sEP2YELgCGDKtb7WbXDv43T+Q6/I=
 =dLyK
 -----END PGP SIGNATURE-----

Merge tag 'kgdb-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux

Pull kgdb update from Daniel Thompson:
 "The entire changeset for kgdb this cycle is a single two-line change
  to remove some deadcode that, had it not been dead, would have called
  strncpy() in an unsafe manner.

  To be fair there were other modest clean ups were discussed this cycle
  but they are not finalized and will have to wait until next time"

* tag 'kgdb-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
  kdb: Fix a potential buffer overflow in kdb_local()
2024-01-18 17:53:22 -08:00
Linus Torvalds 736b5545d3 Including fixes from bpf and netfilter.
Previous releases - regressions:
 
  - Revert "net: rtnetlink: Enslave device before bringing it up",
    breaks the case inverse to the one it was trying to fix
 
  - net: dsa: fix oob access in DSA's netdevice event handler
    dereference netdev_priv() before check its a DSA port
 
  - sched: track device in tcf_block_get/put_ext() only for clsact
    binder types
 
  - net: tls, fix WARNING in __sk_msg_free when record becomes full
    during splice and MORE hint set
 
  - sfp-bus: fix SFP mode detect from bitrate
 
  - drv: stmmac: prevent DSA tags from breaking COE
 
 Previous releases - always broken:
 
  - bpf: fix no forward progress in in bpf_iter_udp if output
    buffer is too small
 
  - bpf: reject variable offset alu on registers with a type
    of PTR_TO_FLOW_KEYS to prevent oob access
 
  - netfilter: tighten input validation
 
  - net: add more sanity check in virtio_net_hdr_to_skb()
 
  - rxrpc: fix use of Don't Fragment flag on RESPONSE packets,
    avoid infinite loop
 
  - amt: do not use the portion of skb->cb area which may get clobbered
 
  - mptcp: improve validation of the MPTCPOPT_MP_JOIN MCTCP option
 
 Misc:
 
  - spring cleanup of inactive maintainers
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmWpnvoACgkQMUZtbf5S
 Irvskg/+Or5tETxOmpQXxnj6ECZyrSp0Jcyd7+TIcos/7JfPdn3Kebl004SG4h/s
 bwKDOIIP1iSjQ+0NFsPjyYIVd6wFuCElSB7npV5uQAT6ptXx7A4Ym68/rVxodI8T
 6hiYV/mlPuZF8JjRhtp/VJL8sY1qnG7RIUB4oH3y9HQNfwZX0lIWChuUilHuWfbq
 zQ2Iu97tMkoIBjXrkIT3Qaj0aFxYbjCOrg9zy+FZ69a7Rmrswr//7amlCH6saNTx
 Ku7Wl8FXhe7O23OiM6GSl7AechSM1aJ5kOS3orseej0+aSp9eH3ekYGmbsQr6sjz
 ix/eZ7V7SUkJK3bEH5haeymk4TDV3lHE8SziMbosK4wVbHOyPwEmqCxppADYJLZs
 WycHZKcTBluFBOxknAofH7m5Hh0ToXkeTfpptSSGtRB4WncAOMsMapr3yS4WXg/q
 AnOo/tzCBgMrnSJtD/kjqgUiCk8vYoLc8lBR9K74l0zqI1sf13OfuTHvEgqIS6z1
 Ir/ewlAV6fCH8gQbyzjKUVlyjZS4+vFv19xg/2GgLf+LdyzcCOxUZkND3/DE6+OA
 Dgf9gtABYU4hGXMUfTfml3KCBTF65QmY8dIh17zraNylYUHEJ2lI4D+sdiqWUrXb
 mXPBJh4nOPwIV5t2gT80skNwF3aWPr6l4ieY2codSbP04rO74S8=
 =YhQQ
 -----END PGP SIGNATURE-----

Merge tag 'net-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf and netfilter.

  Previous releases - regressions:

   - Revert "net: rtnetlink: Enslave device before bringing it up",
     breaks the case inverse to the one it was trying to fix

   - net: dsa: fix oob access in DSA's netdevice event handler
     dereference netdev_priv() before check its a DSA port

   - sched: track device in tcf_block_get/put_ext() only for clsact
     binder types

   - net: tls, fix WARNING in __sk_msg_free when record becomes full
     during splice and MORE hint set

   - sfp-bus: fix SFP mode detect from bitrate

   - drv: stmmac: prevent DSA tags from breaking COE

  Previous releases - always broken:

   - bpf: fix no forward progress in in bpf_iter_udp if output buffer is
     too small

   - bpf: reject variable offset alu on registers with a type of
     PTR_TO_FLOW_KEYS to prevent oob access

   - netfilter: tighten input validation

   - net: add more sanity check in virtio_net_hdr_to_skb()

   - rxrpc: fix use of Don't Fragment flag on RESPONSE packets, avoid
     infinite loop

   - amt: do not use the portion of skb->cb area which may get clobbered

   - mptcp: improve validation of the MPTCPOPT_MP_JOIN MCTCP option

  Misc:

   - spring cleanup of inactive maintainers"

* tag 'net-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
  i40e: Include types.h to some headers
  ipv6: mcast: fix data-race in ipv6_mc_down / mld_ifc_work
  selftests: mlxsw: qos_pfc: Adjust the test to support 8 lanes
  selftests: mlxsw: qos_pfc: Remove wrong description
  mlxsw: spectrum_router: Register netdevice notifier before nexthop
  mlxsw: spectrum_acl_tcam: Fix stack corruption
  mlxsw: spectrum_acl_tcam: Fix NULL pointer dereference in error path
  mlxsw: spectrum_acl_erp: Fix error flow of pool allocation failure
  ethtool: netlink: Add missing ethnl_ops_begin/complete
  selftests: bonding: Add more missing config options
  selftests: netdevsim: add a config file
  libbpf: warn on unexpected __arg_ctx type when rewriting BTF
  selftests/bpf: add tests confirming type logic in kernel for __arg_ctx
  bpf: enforce types for __arg_ctx-tagged arguments in global subprogs
  bpf: extract bpf_ctx_convert_map logic and make it more reusable
  libbpf: feature-detect arg:ctx tag support in kernel
  ipvs: avoid stat macros calls from preemptible context
  netfilter: nf_tables: reject NFT_SET_CONCAT with not field length description
  netfilter: nf_tables: skip dead set elements in netlink dump
  netfilter: nf_tables: do not allow mismatch field size and set key length
  ...
2024-01-18 17:33:50 -08:00
Linus Torvalds 17e232b6d2 dma-mapping fixes for Linux 6.8
- fix kerneldoc warnings (Randy Dunlap)
  - better bounds checking in swiotlb (ZhangPeng)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmWo0rYLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYPxwRAAy/TfnQJZQX+VqLmdQ5JsHGoPkYCrdMdcGtM7JxHP
 0QJZSCN/FRIRIB/kXXceCLlls79pQUdrfyiIM4ZyWZU6nWdKlUVw3xJZ8VtbSXla
 W0tKUlAaVwt6HuK+H0GkSs/l1JlU/qWUwrVSXdsAeWT8aQes6OqRbesm/N8vvgJE
 Do/uyGG4WFQaiSB4vv6M6S32KGzNN+BBZGLEk+K4Et4fRhwmMh2ciGMK8Df8yyDU
 iKFVPVii26PySCpkoupxFw59pS7LNEn7NCE0hwuCqQt1VE0+8Hyb8BG4sh8U66FM
 sP9FRqv8HAMofJmn3OGIQXIS3J2/0OPoDAGJUbl0JqMfOUj9B3X2Iu3xFYUVK7MJ
 C1OG3vsKUJ8Dq3SStTNan2PsOXekfcLzzY5iuUoWysl5YiDmJs7zut22oX43z+k5
 lJ91QiEtejat39lAXERZYJ4BVuIoGM46xhJLbNjhdAcoakO4IpdETAEowsmZLxte
 6LMGXKUd/swcy0BCuRtoKheAvLfE+OGPd6RA299nsSxfn6T9zGmnl5pvXqrS17yV
 n3riHqgohcOaUFTaj88aO/y+WunVN5J/bpdbrPeitEnE94DW3Y2ah+5tx5sj+MiW
 p5qMO2AiHqx0Go1mlpYbk78adtBMogdwHjSdYBX0NfZVt9agTIB3FG6UVCkNWYta
 ing=
 =5+TR
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-6.8-2024-01-18' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:

 - fix kerneldoc warnings (Randy Dunlap)

 - better bounds checking in swiotlb (ZhangPeng)

* tag 'dma-mapping-6.8-2024-01-18' of git://git.infradead.org/users/hch/dma-mapping:
  dma-debug: fix kernel-doc warnings
  swiotlb: check alloc_size before the allocation of a new memory pool
2024-01-18 16:49:34 -08:00
Linus Torvalds 0dde2bf67b IOMMU Updates for Linux v6.8
Including:
 
 	- Core changes:
 	  - Fix race conditions in device probe path
 	  - Retire IOMMU bus_ops
 	  - Support for passing custom allocators to page table drivers
 	  - Clean up Kconfig around IOMMU_SVA
 	  - Support for sharing SVA domains with all devices bound to
 	    a mm
 	  - Firmware data parsing cleanup
 	  - Tracing improvements for iommu-dma code
 	  - Some smaller fixes and cleanups
 
 	- ARM-SMMU drivers:
 	  - Device-tree binding updates:
 	     - Add additional compatible strings for Qualcomm SoCs
 	     - Document Adreno clocks for Qualcomm's SM8350 SoC
 	  - SMMUv2:
 	    - Implement support for the ->domain_alloc_paging() callback
 	    - Ensure Secure context is restored following suspend of Qualcomm SMMU
 	      implementation
 	  - SMMUv3:
 	    - Disable stalling mode for the "quiet" context descriptor
 	    - Minor refactoring and driver cleanups
 
 	 - Intel VT-d driver:
 	   - Cleanup and refactoring
 
 	 - AMD IOMMU driver:
 	   - Improve IO TLB invalidation logic
 	   - Small cleanups and improvements
 
 	 - Rockchip IOMMU driver:
 	   - DT binding update to add Rockchip RK3588
 
 	 - Apple DART driver:
 	   - Apple M1 USB4/Thunderbolt DART support
 	   - Cleanups
 
 	 - Virtio IOMMU driver:
 	   - Add support for iotlb_sync_map
 	   - Enable deferred IO TLB flushes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmWecQoACgkQK/BELZcB
 GuN5ZxAAzC5QUKAzANx0puk7QhPpKKlbSvj6Q7iRgCLk00KJO1+VQh9v4ouCmXqF
 kn3Ko8gddjhtrgwN0OQ54F39cLUrp1SBemy71K5YOR+vu8VKtwtmawZGeeRZ+k+B
 Eohw58oaXTiR1maYvoLixLYczLrjklqyJOQ1vZ0GxFGxDqrFByAryHDgG/3OCpJx
 C9e6PsLbbfhfqA8Kv97iKcBqniGbXxAMuodqSUG0buQ3oZgfpIP6Bt3EgUzFGPGk
 3BTlYxowS/gkjUWd3fgjQFIFLTA01u9FhpA2Jb0a4v67pUCR64YxHN7rBQ6ZChtG
 kB9laQfU9re79RsHhqQzr0JT9x/eyq7pzGzjp5TV5TPW6IW+sqjMIPhzd9P08Ef7
 BclkCVobx0jSAHOhnnG4QJiKANr2Y2oM3HfsAJccMMY45RRhUKmVqM7jxMPfGn3A
 i+inlee73xTjZXJse1EWG1fmKKMLvX9LDEp4DyOfn9CqVT+7hpZvzPjfbGr937Rm
 JlwXhF3rQXEpOCagEsbt1vOf+V0e9QiCLf1Y2KpkIkDbE5wwSD/2qLm3tFhJG3oF
 fkW+J14Cid0pj+hY0afGe0kOUOIYlimu0nFmSf0pzMH+UktZdKogSfyb1gSDsy+S
 rsZRGPFhMJ832ExqhlDfxqBebqh+jsfKynlskui6Td5C9ZULaHA=
 =q751
 -----END PGP SIGNATURE-----

Merge tag 'iommu-updates-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull iommu updates from Joerg Roedel:
 "Core changes:
   - Fix race conditions in device probe path
   - Retire IOMMU bus_ops
   - Support for passing custom allocators to page table drivers
   - Clean up Kconfig around IOMMU_SVA
   - Support for sharing SVA domains with all devices bound to a mm
   - Firmware data parsing cleanup
   - Tracing improvements for iommu-dma code
   - Some smaller fixes and cleanups

  ARM-SMMU drivers:
   - Device-tree binding updates:
      - Add additional compatible strings for Qualcomm SoCs
      - Document Adreno clocks for Qualcomm's SM8350 SoC
   - SMMUv2:
      - Implement support for the ->domain_alloc_paging() callback
      - Ensure Secure context is restored following suspend of Qualcomm
        SMMU implementation
   - SMMUv3:
      - Disable stalling mode for the "quiet" context descriptor
      - Minor refactoring and driver cleanups

  Intel VT-d driver:
   - Cleanup and refactoring

  AMD IOMMU driver:
   - Improve IO TLB invalidation logic
   - Small cleanups and improvements

  Rockchip IOMMU driver:
   - DT binding update to add Rockchip RK3588

  Apple DART driver:
   - Apple M1 USB4/Thunderbolt DART support
   - Cleanups

  Virtio IOMMU driver:
   - Add support for iotlb_sync_map
   - Enable deferred IO TLB flushes"

* tag 'iommu-updates-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (66 commits)
  iommu: Don't reserve 0-length IOVA region
  iommu/vt-d: Move inline helpers to header files
  iommu/vt-d: Remove unused vcmd interfaces
  iommu/vt-d: Remove unused parameter of intel_pasid_setup_pass_through()
  iommu/vt-d: Refactor device_to_iommu() to retrieve iommu directly
  iommu/sva: Fix memory leak in iommu_sva_bind_device()
  dt-bindings: iommu: rockchip: Add Rockchip RK3588
  iommu/dma: Trace bounce buffer usage when mapping buffers
  iommu/arm-smmu: Convert to domain_alloc_paging()
  iommu/arm-smmu: Pass arm_smmu_domain to internal functions
  iommu/arm-smmu: Implement IOMMU_DOMAIN_BLOCKED
  iommu/arm-smmu: Convert to a global static identity domain
  iommu/arm-smmu: Reorganize arm_smmu_domain_add_master()
  iommu/arm-smmu-v3: Remove ARM_SMMU_DOMAIN_NESTED
  iommu/arm-smmu-v3: Master cannot be NULL in arm_smmu_write_strtab_ent()
  iommu/arm-smmu-v3: Add a type for the STE
  iommu/arm-smmu-v3: disable stall for quiet_cd
  iommu/qcom: restore IOMMU state if needed
  iommu/arm-smmu-qcom: Add QCM2290 MDSS compatible
  iommu/arm-smmu-qcom: Add missing GMU entry to match table
  ...
2024-01-18 15:16:57 -08:00
Linus Torvalds a2ded784cd tracing updates for 6.8:
- Allow kernel trace instance creation to specify what events are created
   Inside the kernel, a subsystem may create a tracing instance that it can
   use to send events to user space. This sub-system may not care about the
   thousands of events that exist in eventfs. Allow the sub-system to specify
   what sub-systems of events it cares about, and only those events are exposed
   to this instance.
 
 - Allow the ring buffer to be broken up into bigger sub-buffers than just the
   architecture page size. A new tracefs file called "buffer_subbuf_size_kb"
   is created. The user can now specify a minimum size the sub-buffer may be
   in kilobytes. Note, that the implementation currently make the sub-buffer
   size a power of 2 pages (1, 2, 4, 8, 16, ...) but the user only writes in
   kilobyte size, and the sub-buffer will be updated to the next size that
   it will can accommodate it. If the user writes in 10, it will change the
   size to be 4 pages on x86 (16K), as that is the next available size that
   can hold 10K pages.
 
 - Update the debug output when a corrupt time is detected in the ring buffer.
   If the ring buffer detects inconsistent timestamps, there's a debug config
   options that will dump the contents of the meta data of the sub-buffer that
   is used for debugging. Add some more information to this dump that helps
   with debugging.
 
 - Add more timestamp debugging checks (only triggers when the config is enabled)
 
 - Increase the trace_seq iterator to 2 page sizes.
 
 - Allow strings written into tracefs_marker to be larger. Up to just under
   2 page sizes (based on what trace_seq can hold).
 
 - Increase the trace_maker_raw write to be as big as a sub-buffer can hold.
 
 - Remove 32 bit time stamp logic, now that the rb_time_cmpxchg() has been
   removed.
 
 - More selftests were added.
 
 - Some code clean ups as well.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZZ8p3BQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6ql2GAQDZg/zlFEiJHyTfWbCIE8pA3T5xbzKo
 26TNxIZAxJJZpQEAvGFU5Smy14pG6soEoVMp8B6ZOANbqU8VVamhOL+r+Qw=
 =0OYG
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:

 - Allow kernel trace instance creation to specify what events are
   created

   Inside the kernel, a subsystem may create a tracing instance that it
   can use to send events to user space. This sub-system may not care
   about the thousands of events that exist in eventfs. Allow the
   sub-system to specify what sub-systems of events it cares about, and
   only those events are exposed to this instance.

 - Allow the ring buffer to be broken up into bigger sub-buffers than
   just the architecture page size.

   A new tracefs file called "buffer_subbuf_size_kb" is created. The
   user can now specify a minimum size the sub-buffer may be in
   kilobytes. Note, that the implementation currently make the
   sub-buffer size a power of 2 pages (1, 2, 4, 8, 16, ...) but the user
   only writes in kilobyte size, and the sub-buffer will be updated to
   the next size that it will can accommodate it. If the user writes in
   10, it will change the size to be 4 pages on x86 (16K), as that is
   the next available size that can hold 10K pages.

 - Update the debug output when a corrupt time is detected in the ring
   buffer. If the ring buffer detects inconsistent timestamps, there's a
   debug config options that will dump the contents of the meta data of
   the sub-buffer that is used for debugging. Add some more information
   to this dump that helps with debugging.

 - Add more timestamp debugging checks (only triggers when the config is
   enabled)

 - Increase the trace_seq iterator to 2 page sizes.

 - Allow strings written into tracefs_marker to be larger. Up to just
   under 2 page sizes (based on what trace_seq can hold).

 - Increase the trace_maker_raw write to be as big as a sub-buffer can
   hold.

 - Remove 32 bit time stamp logic, now that the rb_time_cmpxchg() has
   been removed.

 - More selftests were added.

 - Some code clean ups as well.

* tag 'trace-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (29 commits)
  ring-buffer: Remove stale comment from ring_buffer_size()
  tracing histograms: Simplify parse_actions() function
  tracing/selftests: Remove exec permissions from trace_marker.tc test
  ring-buffer: Use subbuf_order for buffer page masking
  tracing: Update subbuffer with kilobytes not page order
  ringbuffer/selftest: Add basic selftest to test changing subbuf order
  ring-buffer: Add documentation on the buffer_subbuf_order file
  ring-buffer: Just update the subbuffers when changing their allocation order
  ring-buffer: Keep the same size when updating the order
  tracing: Stop the tracing while changing the ring buffer subbuf size
  tracing: Update snapshot order along with main buffer order
  ring-buffer: Make sure the spare sub buffer used for reads has same size
  ring-buffer: Do no swap cpu buffers if order is different
  ring-buffer: Clear pages on error in ring_buffer_subbuf_order_set() failure
  ring-buffer: Read and write to ring buffers with custom sub buffer size
  ring-buffer: Set new size of the ring buffer sub page
  ring-buffer: Add interface for configuring trace sub buffer size
  ring-buffer: Page size per ring buffer
  ring-buffer: Have ring_buffer_print_page_header() be able to access ring_buffer_iter
  ring-buffer: Check if absolute timestamp goes backwards
  ...
2024-01-18 14:35:29 -08:00
Linus Torvalds 5b890ad456 Probes update for v6.8:
- Kprobes trace event to show the actual function name in notrace-symbol
   warning. Instead of using user specified symbol name, use "%ps" printk
   format to show the actual symbol at the probe address. Since kprobe
   event accepts the offset from symbol which is bigger than the symbol
   size, user specified symbol may not be the actual probed symbol.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmWdZB0bHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b9AcH/R8mNbgAbKlxSXUm0NAG
 xrUcN9vyb9yaLgvoIEvW+XF6EMaCM6G2kG+wSaJB6xFiPlJgf9FhILjDjHAtV2x1
 wXL8r3eLyKvkU3HXfS7RphUTPecgblI16FHZ12x2TkQ41KoRzQf2c7cSQs4B8SHP
 W5LPqvxxqjbV84iqZPScez99S0ZS0Of3ubmepVEm2LDshfhUVMIUH1vfvEn3vQI7
 k5PoNiVRem+rjduERM3I7Zd51K7Lz/5hN56q6ok2vY8hVoRdp0j83Ly36h21ClS9
 CtvlzPX0YjaogVd8Gyc3z+vqy61YiNA1q0fRqIhagmfIy/26s1ORaq/0S2gywxXn
 piA=
 =mfU0
 -----END PGP SIGNATURE-----

Merge tag 'probes-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes update from Masami Hiramatsu:

 - Update the Kprobes trace event to show the actual function name in
   notrace-symbol warning.

   Instead of using the user specified symbol name, use "%ps" printk
   format to show the actual symbol at the probe address. Since kprobe
   event accepts the offset from symbol which is bigger than the symbol
   size, the user specified symbol may not be the actual probed symbol.

* tag 'probes-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  trace/kprobe: Display the actual notrace function when rejecting a probe
2024-01-18 14:21:22 -08:00
Linus Torvalds b0d326da46 Fix a cpufreq related performance regression on certain systems,
where the CPU would remain at the lowest frequency, degrading
 performance substantially.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWpM0sRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1giEg/+Mn9hdLqgE7xPPvCa8UWoJzFGTIYgTT3O
 gma5Ras/kqB6cJTb1zn/HocAIj1Y2gZAsRU/U3IpOfPzklwIKQLBID1PE+d0izAc
 NC9N0LuPau+XbMY5U+G0YNQZzDW+Zioe/9I6uDRKRTtLTdZAk8Plk9yh+tRtpSG8
 aEswyoDOJfvkLbl7kJGymHgxDiDtmXEcz6j2pNlFtcEdHFjiSHo2Jq09DMia9sHr
 W563FSvO7DVBMOosKH8sq7sSPdCBi0zshaWDiyz2M7Ry2uBsqJvx+9qxDnloafTp
 Yqp5rkSVzOxtQwxjtYD+WWy+AgwQqo+O5FHsm0JmoiGVkmpB95bdhQxk2gtshSCo
 IwUt2Gqsndd0JM4v5gOn4G/qCPxFUA/Tx1OMWM89nQUVp3OmIwm8z99f5gFxoSYa
 DFn2P2Ku/A/fiKfWcNDOCyMgYcJNmqRKSjWEh+mfFeexiuWR3jPrQ4GKbSl9Gusw
 vLmBM9pMSyGvivptu+ALXERDDm95wEVVkULgxlcUgpuT8jjpmovbtFj2xYcnzvc4
 EKOgJ0FmXCM/B6QFnnbzgMzu2IThoQpL8Ud3JlMeGDRLGDvZip9AA+0RsnirURwX
 +EuE7fHcDzfAA+Fv9sGosaFmxD1dUh1EJL41XrFZSYfMsZzzzlj+k9PWf9ABCE4R
 6gEHuRza+rU=
 =c7Ib
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2024-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Ingo Molnar:
 "Fix a cpufreq related performance regression on certain systems, where
  the CPU would remain at the lowest frequency, degrading performance
  substantially"

* tag 'sched-urgent-2024-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix frequency selection for non-invariant case
2024-01-18 11:57:33 -08:00
Linus Torvalds 80955ae955 Driver core changes for 6.8-rc1
Here are the set of driver core and kernfs changes for 6.8-rc1.  Nothing
 major in here this release cycle, just lots of small cleanups and some
 tweaks on kernfs that in the very end, got reverted and will come back
 in a safer way next release cycle.
 
 Included in here are:
   - more driver core 'const' cleanups and fixes
   - fw_devlink=rpm is now the default behavior
   - kernfs tiny changes to remove some string functions
   - cpu handling in the driver core is updated to work better on many
     systems that add topologies and cpus after booting
   - other minor changes and cleanups
 
 All of the cpu handling patches have been acked by the respective
 maintainers and are coming in here in one series.  Everything has been
 in linux-next for a while with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZaeOrg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymtcwCffzvKKkSY9qAp6+0v2WQNkZm1JWoAoJCPYUwF
 If6wEoPLWvRfKx4gIoq9
 =D96r
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here are the set of driver core and kernfs changes for 6.8-rc1.
  Nothing major in here this release cycle, just lots of small cleanups
  and some tweaks on kernfs that in the very end, got reverted and will
  come back in a safer way next release cycle.

  Included in here are:

   - more driver core 'const' cleanups and fixes

   - fw_devlink=rpm is now the default behavior

   - kernfs tiny changes to remove some string functions

   - cpu handling in the driver core is updated to work better on many
     systems that add topologies and cpus after booting

   - other minor changes and cleanups

  All of the cpu handling patches have been acked by the respective
  maintainers and are coming in here in one series. Everything has been
  in linux-next for a while with no reported issues"

* tag 'driver-core-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (51 commits)
  Revert "kernfs: convert kernfs_idr_lock to an irq safe raw spinlock"
  kernfs: convert kernfs_idr_lock to an irq safe raw spinlock
  class: fix use-after-free in class_register()
  PM: clk: make pm_clk_add_notifier() take a const pointer
  EDAC: constantify the struct bus_type usage
  kernfs: fix reference to renamed function
  driver core: device.h: fix Excess kernel-doc description warning
  driver core: class: fix Excess kernel-doc description warning
  driver core: mark remaining local bus_type variables as const
  driver core: container: make container_subsys const
  driver core: bus: constantify subsys_register() calls
  driver core: bus: make bus_sort_breadthfirst() take a const pointer
  kernfs: d_obtain_alias(NULL) will do the right thing...
  driver core: Better advertise dev_err_probe()
  kernfs: Convert kernfs_path_from_node_locked() from strlcpy() to strscpy()
  kernfs: Convert kernfs_name_locked() from strlcpy() to strscpy()
  kernfs: Convert kernfs_walk_ns() from strlcpy() to strscpy()
  initramfs: Expose retained initrd as sysfs file
  fs/kernfs/dir: obey S_ISGID
  kernel/cgroup: use kernfs_create_dir_ns()
  ...
2024-01-18 09:48:40 -08:00
Andrii Nakryiko 0ba971511d bpf: enforce types for __arg_ctx-tagged arguments in global subprogs
Add enforcement of expected types for context arguments tagged with
arg:ctx (__arg_ctx) tag.

First, any program type will accept generic `void *` context type when
combined with __arg_ctx tag.

Besides accepting "canonical" struct names and `void *`, for a bunch of
program types for which program context is actually a named struct, we
allows a bunch of pragmatic exceptions to match real-world and expected
usage:

  - for both kprobes and perf_event we allow `bpf_user_pt_regs_t *` as
    canonical context argument type, where `bpf_user_pt_regs_t` is a
    *typedef*, not a struct;
  - for kprobes, we also always accept `struct pt_regs *`, as that's what
    actually is passed as a context to any kprobe program;
  - for perf_event, we resolve typedefs (unless it's `bpf_user_pt_regs_t`)
    down to actual struct type and accept `struct pt_regs *`, or
    `struct user_pt_regs *`, or `struct user_regs_struct *`, depending
    on the actual struct type kernel architecture points `bpf_user_pt_regs_t`
    typedef to; otherwise, canonical `struct bpf_perf_event_data *` is
    expected;
  - for raw_tp/raw_tp.w programs, `u64/long *` are accepted, as that's
    what's expected with BPF_PROG() usage; otherwise, canonical
    `struct bpf_raw_tracepoint_args *` is expected;
  - tp_btf supports both `struct bpf_raw_tracepoint_args *` and `u64 *`
    formats, both are coded as expections as tp_btf is actually a TRACING
    program type, which has no canonical context type;
  - iterator programs accept `struct bpf_iter__xxx *` structs, currently
    with no further iterator-type specific enforcement;
  - fentry/fexit/fmod_ret/lsm/struct_ops all accept `u64 *`;
  - classic tracepoint programs, as well as syscall and freplace
    programs allow any user-provided type.

In all other cases kernel will enforce exact match of struct name to
expected canonical type. And if user-provided type doesn't match that
expectation, verifier will emit helpful message with expected type name.

Note a bit unnatural way the check is done after processing all the
arguments. This is done to avoid conflict between bpf and bpf-next
trees. Once trees converge, a small follow up patch will place a simple
btf_validate_prog_ctx_type() check into a proper ARG_PTR_TO_CTX branch
(which bpf-next tree patch refactored already), removing duplicated
arg:ctx detection logic.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240118033143.3384355-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-17 20:20:06 -08:00
Andrii Nakryiko 66967a32d3 bpf: extract bpf_ctx_convert_map logic and make it more reusable
Refactor btf_get_prog_ctx_type() a bit to allow reuse of
bpf_ctx_convert_map logic in more than one places. Simplify interface by
returning btf_type instead of btf_member (field reference in BTF).

To do the above we need to touch and start untangling
btf_translate_to_vmlinux() implementation. We do the bare minimum to
not regress anything for btf_translate_to_vmlinux(), but its
implementation is very questionable for what it claims to be doing.
Mapping kfunc argument types to kernel corresponding types conceptually
is quite different from recognizing program context types. Fixing this
is out of scope for this change though.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240118033143.3384355-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-17 20:20:05 -08:00
Linus Torvalds 296455ade1 Char/Misc and other Driver changes for 6.8-rc1
Here is the big set of char/misc and other driver subsystem changes for
 6.8-rc1.  Lots of stuff in here, but first off, you will get a merge
 conflict in drivers/android/binder_alloc.c when merging this tree due to
 changing coming in through the -mm tree.
 
 The resolution of the merge issue can be found here:
 	https://lore.kernel.org/r/20231207134213.25631ae9@canb.auug.org.au
 or in a simpler patch form in that thread:
 	https://lore.kernel.org/r/ZXHzooF07LfQQYiE@google.com
 
 If there are issues with the merge of this file, please let me know.
 
 Other than lots of binder driver changes (as you can see by the merge
 conflicts) included in here are:
  - lots of iio driver updates and additions
  - spmi driver updates
  - eeprom driver updates
  - firmware driver updates
  - ocxl driver updates
  - mhi driver updates
  - w1 driver updates
  - nvmem driver updates
  - coresight driver updates
  - platform driver remove callback api changes
  - tags.sh script updates
  - bus_type constant marking cleanups
  - lots of other small driver updates
 
 All of these have been in linux-next for a while with no reported issues
 (other than the binder merge conflict.)
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZaeMMQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynWNgCfQ/Yz7QO6EMLDwHO5LRsb3YMhjL4AoNVdanjP
 YoI7f1I4GBcC0GKNfK6s
 =+Kyv
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc and other driver updates from Greg KH:
 "Here is the big set of char/misc and other driver subsystem changes
  for 6.8-rc1.

  Other than lots of binder driver changes (as you can see by the merge
  conflicts) included in here are:

   - lots of iio driver updates and additions

   - spmi driver updates

   - eeprom driver updates

   - firmware driver updates

   - ocxl driver updates

   - mhi driver updates

   - w1 driver updates

   - nvmem driver updates

   - coresight driver updates

   - platform driver remove callback api changes

   - tags.sh script updates

   - bus_type constant marking cleanups

   - lots of other small driver updates

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'char-misc-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (341 commits)
  android: removed duplicate linux/errno
  uio: Fix use-after-free in uio_open
  drivers: soc: xilinx: add check for platform
  firmware: xilinx: Export function to use in other module
  scripts/tags.sh: remove find_sources
  scripts/tags.sh: use -n to test archinclude
  scripts/tags.sh: add local annotation
  scripts/tags.sh: use more portable -path instead of -wholename
  scripts/tags.sh: Update comment (addition of gtags)
  firmware: zynqmp: Convert to platform remove callback returning void
  firmware: turris-mox-rwtm: Convert to platform remove callback returning void
  firmware: stratix10-svc: Convert to platform remove callback returning void
  firmware: stratix10-rsu: Convert to platform remove callback returning void
  firmware: raspberrypi: Convert to platform remove callback returning void
  firmware: qemu_fw_cfg: Convert to platform remove callback returning void
  firmware: mtk-adsp-ipc: Convert to platform remove callback returning void
  firmware: imx-dsp: Convert to platform remove callback returning void
  firmware: coreboot_table: Convert to platform remove callback returning void
  firmware: arm_scpi: Convert to platform remove callback returning void
  firmware: arm_scmi: Convert to platform remove callback returning void
  ...
2024-01-17 16:47:17 -08:00
Linus Torvalds 7f5e47f785 17 hotfixes. 10 address post-6.7 issues and the other 7 are cc:stable.
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZaHe5gAKCRDdBJ7gKXxA
 jrAiAQCYZQuwsNVyGJUuPD/GGQzqVUZNpWcuYwMXXAi6dO5rSAD+LDeFviun2K52
 uHCz4iRq5EwNLA+MbdHtAnQzr+e5CQ8=
 =Jjkw
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-01-12-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc hotfixes from Andrew Morton:
 "For once not mostly MM-related.

  17 hotfixes. 10 address post-6.7 issues and the other 7 are cc:stable"

* tag 'mm-hotfixes-stable-2024-01-12-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  userfaultfd: avoid huge_zero_page in UFFDIO_MOVE
  MAINTAINERS: add entry for shrinker
  selftests: mm: hugepage-vmemmap fails on 64K page size systems
  mm/memory_hotplug: fix memmap_on_memory sysfs value retrieval
  mailmap: switch email for Tanzir Hasan
  mailmap: add old address mappings for Randy
  kernel/crash_core.c: make __crash_hotplug_lock static
  efi: disable mirror feature during crashkernel
  kexec: do syscore_shutdown() in kernel_kexec
  mailmap: update entry for Manivannan Sadhasivam
  fs/proc/task_mmu: move mmu notification mechanism inside mm lock
  mm: zswap: switch maintainers to recently active developers and reviewers
  scripts/decode_stacktrace.sh: optionally use LLVM utilities
  kasan: avoid resetting aux_lock
  lib/Kconfig.debug: disable CONFIG_DEBUG_INFO_BTF for Hexagon
  MAINTAINERS: update LTP maintainers
  kdump: defer the insertion of crashkernel resources
2024-01-17 09:31:36 -08:00
Christophe JAILLET 4f41d30cd6 kdb: Fix a potential buffer overflow in kdb_local()
When appending "[defcmd]" to 'kdb_prompt_str', the size of the string
already in the buffer should be taken into account.

An option could be to switch from strncat() to strlcat() which does the
correct test to avoid such an overflow.

However, this actually looks as dead code, because 'defcmd_in_progress'
can't be true here.
See a more detailed explanation at [1].

[1]: https://lore.kernel.org/all/CAD=FV=WSh7wKN7Yp-3wWiDgX4E3isQ8uh0LCzTmd1v9Cg9j+nQ@mail.gmail.com/

Fixes: 5d5314d679 ("kdb: core for kgdb back end (1 of 2)")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
2024-01-17 17:19:06 +00:00
Xuewen Yan 1a65a6d17c workqueue: Add rcu lock check at the end of work item execution
Currently the workqueue just checks the atomic and locking states after work
execution ends. However, sometimes, a work item may not unlock rcu after
acquiring rcu_read_lock(). And as a result, it would cause rcu stall, but
the rcu stall warning can not dump the work func, because the work has
finished.

In order to quickly discover those works that do not call rcu_read_unlock()
after rcu_read_lock(), add the rcu lock check.

Use rcu_preempt_depth() to check the work's rcu status. Normally, this value
is 0. If this value is bigger than 0, it means the work are still holding
rcu lock. If so, print err info and the work func.

tj: Reworded the description for clarity. Minor formatting tweak.

Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-16 10:20:44 -10:00
Juri Lelli 85f0ab43f9 kernel/workqueue: Bind rescuer to unbound cpumask for WQ_UNBOUND
At the time they are created unbound workqueues rescuers currently use
cpu_possible_mask as their affinity, but this can be too wide in case a
workqueue unbound mask has been set as a subset of cpu_possible_mask.

Make new rescuers use their associated workqueue unbound cpumask from
the start.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-16 08:47:30 -10:00
Audra Mitchell 31c8900728 workqueue.c: Increase workqueue name length
Currently we limit the size of the workqueue name to 24 characters due to
commit ecf6881ff3 ("workqueue: make workqueue->name[] fixed len")
Increase the size to 32 characters and print a warning in the event
the requested name is larger than the limit of 32 characters.

Signed-off-by: Audra Mitchell <audra@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-16 08:31:24 -10:00
Hao Sun 22c7fa171a bpf: Reject variable offset alu on PTR_TO_FLOW_KEYS
For PTR_TO_FLOW_KEYS, check_flow_keys_access() only uses fixed off
for validation. However, variable offset ptr alu is not prohibited
for this ptr kind. So the variable offset is not checked.

The following prog is accepted:

  func#0 @0
  0: R1=ctx() R10=fp0
  0: (bf) r6 = r1                       ; R1=ctx() R6_w=ctx()
  1: (79) r7 = *(u64 *)(r6 +144)        ; R6_w=ctx() R7_w=flow_keys()
  2: (b7) r8 = 1024                     ; R8_w=1024
  3: (37) r8 /= 1                       ; R8_w=scalar()
  4: (57) r8 &= 1024                    ; R8_w=scalar(smin=smin32=0,
  smax=umax=smax32=umax32=1024,var_off=(0x0; 0x400))
  5: (0f) r7 += r8
  mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1
  mark_precise: frame0: regs=r8 stack= before 4: (57) r8 &= 1024
  mark_precise: frame0: regs=r8 stack= before 3: (37) r8 /= 1
  mark_precise: frame0: regs=r8 stack= before 2: (b7) r8 = 1024
  6: R7_w=flow_keys(smin=smin32=0,smax=umax=smax32=umax32=1024,var_off
  =(0x0; 0x400)) R8_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=1024,
  var_off=(0x0; 0x400))
  6: (79) r0 = *(u64 *)(r7 +0)          ; R0_w=scalar()
  7: (95) exit

This prog loads flow_keys to r7, and adds the variable offset r8
to r7, and finally causes out-of-bounds access:

  BUG: unable to handle page fault for address: ffffc90014c80038
  [...]
  Call Trace:
   <TASK>
   bpf_dispatcher_nop_func include/linux/bpf.h:1231 [inline]
   __bpf_prog_run include/linux/filter.h:651 [inline]
   bpf_prog_run include/linux/filter.h:658 [inline]
   bpf_prog_run_pin_on_cpu include/linux/filter.h:675 [inline]
   bpf_flow_dissect+0x15f/0x350 net/core/flow_dissector.c:991
   bpf_prog_test_run_flow_dissector+0x39d/0x620 net/bpf/test_run.c:1359
   bpf_prog_test_run kernel/bpf/syscall.c:4107 [inline]
   __sys_bpf+0xf8f/0x4560 kernel/bpf/syscall.c:5475
   __do_sys_bpf kernel/bpf/syscall.c:5561 [inline]
   __se_sys_bpf kernel/bpf/syscall.c:5559 [inline]
   __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:5559
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0x3f/0x110 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x63/0x6b

Fix this by rejecting ptr alu with variable offset on flow_keys.
Applying the patch rejects the program with "R7 pointer arithmetic
on flow_keys prohibited".

Fixes: d58e468b11 ("flow_dissector: implements flow dissector BPF hook")
Signed-off-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240115082028.9992-1-sunhao.th@gmail.com
2024-01-16 17:12:29 +01:00
Vincent Guittot e37617c8e5 sched/fair: Fix frequency selection for non-invariant case
Linus reported a ~50% performance regression on single-threaded
workloads on his AMD Ryzen system, and bisected it to:

  9c0b4bb7f6 ("sched/cpufreq: Rework schedutil governor performance estimation")

When frequency invariance is not enabled, get_capacity_ref_freq(policy)
is supposed to return the current frequency and the performance margin
applied by map_util_perf(), enabling the utilization to go above the
maximum compute capacity and to select a higher frequency than the current one.

After the changes in 9c0b4bb7f6, the performance margin was applied
earlier in the path to take into account utilization clampings and
we couldn't get a utilization higher than the maximum compute capacity,
and the CPU remained 'stuck' at lower frequencies.

To fix this, we must use a frequency above the current frequency to
get a chance to select a higher OPP when the current one becomes fully used.
Apply the same margin and return a frequency 25% higher than the current
one in order to switch to the next OPP before we fully use the CPU
at the current one.

[ mingo: Clarified the changelog. ]

Fixes: 9c0b4bb7f6 ("sched/cpufreq: Rework schedutil governor performance estimation")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Bisected-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Wyes Karny <wkarny@gmail.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Wyes Karny <wkarny@gmail.com>
Link: https://lore.kernel.org/r/20240114183600.135316-1-vincent.guittot@linaro.org
2024-01-16 10:41:25 +01:00
Randy Dunlap 7c65aa3cc0 dma-debug: fix kernel-doc warnings
Update the kernel-doc comments to catch up with the code changes and
fix the kernel-doc warnings:

debug.c:83: warning: Excess struct member 'stacktrace' description in 'dma_debug_entry'
debug.c:83: warning: Function parameter or struct member 'stack_len' not described in 'dma_debug_entry'
debug.c:83: warning: Function parameter or struct member 'stack_entries' not described in 'dma_debug_entry'

Fixes: 746017ed8d ("dma/debug: Simplify stracktrace retrieval")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: iommu@lists.linux.dev
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-01-15 08:20:39 +01:00
Linus Torvalds 23a80d462c RCU pull request for v6.8
This pull request contains the following branches:
 
 doc.2023.12.13a: Documentation and comment updates.
 
 torture.2023.11.23a: RCU torture, locktorture updates that include
         cleanups; nolibc init build support for mips, ppc and rv64;
         testing of mid stall duration scenario and fixing fqs task
         creation conditions.
 
 fixes.2023.12.13a: Misc fixes, most notably restricting usage of
         RCU CPU stall notifiers, to confine their usage primarily
         to debug kernels.
 
 rcu-tasks.2023.12.12b: RCU tasks minor fixes.
 
 srcu.2023.12.13a: lockdep annotation fix for NMI-safe accesses,
         callback advancing/acceleration cleanup and documentation
         improvements.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSi2tPIQIc2VEtjarIAHS7/6Z0wpQUCZYUS0AAKCRAAHS7/6Z0w
 pRXgAQD+k8oqjvKL6la61ppWm5Y7NLjdj/IbV+cOd42jKnM6PAEAyavNhX0n7zGx
 o9cDlvIDxJfHnFrOTc5WLH9yEs3IiQQ=
 =8rdu
 -----END PGP SIGNATURE-----

Merge tag 'rcu.release.v6.8' of https://github.com/neeraju/linux

Pull RCU updates from Neeraj Upadhyay:

 - Documentation and comment updates

 - RCU torture, locktorture updates that include cleanups; nolibc init
   build support for mips, ppc and rv64; testing of mid stall duration
   scenario and fixing fqs task creation conditions

 - Misc fixes, most notably restricting usage of RCU CPU stall
   notifiers, to confine their usage primarily to debug kernels

 - RCU tasks minor fixes

 - lockdep annotation fix for NMI-safe accesses, callback
   advancing/acceleration cleanup and documentation improvements

* tag 'rcu.release.v6.8' of https://github.com/neeraju/linux:
  rcu: Force quiescent states only for ongoing grace period
  doc: Clarify historical disclaimers in memory-barriers.txt
  doc: Mention address and data dependencies in rcu_dereference.rst
  doc: Clarify RCU Tasks reader/updater checklist
  rculist.h: docs: Fix wrong function summary
  Documentation: RCU: Remove repeated word in comments
  srcu: Use try-lock lockdep annotation for NMI-safe access.
  srcu: Explain why callbacks invocations can't run concurrently
  srcu: No need to advance/accelerate if no callback enqueued
  srcu: Remove superfluous callbacks advancing from srcu_gp_start()
  rcu: Remove unused macros from rcupdate.h
  rcu: Restrict access to RCU CPU stall notifiers
  rcu-tasks: Mark RCU Tasks accesses to current->rcu_tasks_idle_cpu
  rcutorture: Add fqs_holdoff check before fqs_task is created
  rcutorture: Add mid-sized stall to TREE07
  rcutorture: add nolibc init support for mips, ppc and rv64
  locktorture: Increase Hamming distance between call_rcu_chain and rcu_call_chains
2024-01-12 16:35:58 -08:00
Andrew Morton 4e87ff59ce kernel/crash_core.c: make __crash_hotplug_lock static
sparse warnings:
kernel/crash_core.c:749:1: sparse: sparse: symbol '__crash_hotplug_lock' was not declared. Should it be static?

Fixes: e2a8f20dd8 ("Crash: add lock to serialize crash hotplug handling")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202401080654.IjjU5oK7-lkp@intel.com/
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-12 15:20:47 -08:00
James Gowans 7bb943806f kexec: do syscore_shutdown() in kernel_kexec
syscore_shutdown() runs driver and module callbacks to get the system into
a state where it can be correctly shut down.  In commit 6f389a8f1d ("PM
/ reboot: call syscore_shutdown() after disable_nonboot_cpus()")
syscore_shutdown() was removed from kernel_restart_prepare() and hence got
(incorrectly?) removed from the kexec flow.  This was innocuous until
commit 6735150b69 ("KVM: Use syscore_ops instead of reboot_notifier to
hook restart/shutdown") changed the way that KVM registered its shutdown
callbacks, switching from reboot notifiers to syscore_ops.shutdown.  As
syscore_shutdown() is missing from kexec, KVM's shutdown hook is not run
and virtualisation is left enabled on the boot CPU which results in triple
faults when switching to the new kernel on Intel x86 VT-x with VMXE
enabled.

Fix this by adding syscore_shutdown() to the kexec sequence.  In terms of
where to add it, it is being added after migrating the kexec task to the
boot CPU, but before APs are shut down.  It is not totally clear if this
is the best place: in commit 6f389a8f1d ("PM / reboot: call
syscore_shutdown() after disable_nonboot_cpus()") it is stated that
"syscore_ops operations should be carried with one CPU on-line and
interrupts disabled." APs are only offlined later in machine_shutdown(),
so this syscore_shutdown() is being run while APs are still online.  This
seems to be the correct place as it matches where syscore_shutdown() is
run in the reboot and halt flows - they also run it before APs are shut
down.  The assumption is that the commit message in commit 6f389a8f1d
("PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()") is
no longer valid.

KVM has been discussed here as it is what broke loudly by not having
syscore_shutdown() in kexec, but this change impacts more than just KVM;
all drivers/modules which register a syscore_ops.shutdown callback will
now be invoked in the kexec flow.  Looking at some of them like x86 MCE it
is probably more correct to also shut these down during kexec. 
Maintainers of all drivers which use syscore_ops.shutdown are added on CC
for visibility.  They are:

arch/powerpc/platforms/cell/spu_base.c  .shutdown = spu_shutdown,
arch/x86/kernel/cpu/mce/core.c	        .shutdown = mce_syscore_shutdown,
arch/x86/kernel/i8259.c                 .shutdown = i8259A_shutdown,
drivers/irqchip/irq-i8259.c	        .shutdown = i8259A_shutdown,
drivers/irqchip/irq-sun6i-r.c	        .shutdown = sun6i_r_intc_shutdown,
drivers/leds/trigger/ledtrig-cpu.c	.shutdown = ledtrig_cpu_syscore_shutdown,
drivers/power/reset/sc27xx-poweroff.c	.shutdown = sc27xx_poweroff_shutdown,
kernel/irq/generic-chip.c	        .shutdown = irq_gc_shutdown,
virt/kvm/kvm_main.c	                .shutdown = kvm_shutdown,

This has been tested by doing a kexec on x86_64 and aarch64.

Link: https://lkml.kernel.org/r/20231213064004.2419447-1-jgowans@amazon.com
Fixes: 6735150b69 ("KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown")
Signed-off-by: James Gowans <jgowans@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chen-Yu Tsai <wens@csie.org>
Cc: Jernej Skrabec <jernej.skrabec@gmail.com>
Cc: Samuel Holland <samuel@sholland.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Orson Zhai <orsonzhai@gmail.com>
Cc: Alexander Graf <graf@amazon.de>
Cc: Jan H. Schoenherr <jschoenh@amazon.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-12 15:20:47 -08:00
Huacai Chen 4a693ce65b kdump: defer the insertion of crashkernel resources
In /proc/iomem, sub-regions should be inserted after their parent,
otherwise the insertion of parent resource fails.  But after generic
crashkernel reservation applied, in both RISC-V and ARM64 (LoongArch will
also use generic reservation later on), crashkernel resources are inserted
before their parent, which causes the parent disappear in /proc/iomem.  So
we defer the insertion of crashkernel resources to an early_initcall().

1, Without 'crashkernel' parameter:

 100d0100-100d01ff : LOON0001:00
   100d0100-100d01ff : LOON0001:00 LOON0001:00
 100e0000-100e0bff : LOON0002:00
   100e0000-100e0bff : LOON0002:00 LOON0002:00
 1fe001e0-1fe001e7 : serial
 90400000-fa17ffff : System RAM
   f6220000-f622ffff : Reserved
   f9ee0000-f9ee3fff : Reserved
   fa120000-fa17ffff : Reserved
 fa190000-fe0bffff : System RAM
   fa190000-fa1bffff : Reserved
 fe4e0000-47fffffff : System RAM
   43c000000-441ffffff : Reserved
   47ff98000-47ffa3fff : Reserved
   47ffa4000-47ffa7fff : Reserved
   47ffa8000-47ffabfff : Reserved
   47ffac000-47ffaffff : Reserved
   47ffb0000-47ffb3fff : Reserved

2, With 'crashkernel' parameter, before this patch:

 100d0100-100d01ff : LOON0001:00
   100d0100-100d01ff : LOON0001:00 LOON0001:00
 100e0000-100e0bff : LOON0002:00
   100e0000-100e0bff : LOON0002:00 LOON0002:00
 1fe001e0-1fe001e7 : serial
 e6200000-f61fffff : Crash kernel
 fa190000-fe0bffff : System RAM
   fa190000-fa1bffff : Reserved
 fe4e0000-47fffffff : System RAM
   43c000000-441ffffff : Reserved
   47ff98000-47ffa3fff : Reserved
   47ffa4000-47ffa7fff : Reserved
   47ffa8000-47ffabfff : Reserved
   47ffac000-47ffaffff : Reserved
   47ffb0000-47ffb3fff : Reserved

3, With 'crashkernel' parameter, after this patch:

 100d0100-100d01ff : LOON0001:00
   100d0100-100d01ff : LOON0001:00 LOON0001:00
 100e0000-100e0bff : LOON0002:00
   100e0000-100e0bff : LOON0002:00 LOON0002:00
 1fe001e0-1fe001e7 : serial
 90400000-fa17ffff : System RAM
   e6200000-f61fffff : Crash kernel
   f6220000-f622ffff : Reserved
   f9ee0000-f9ee3fff : Reserved
   fa120000-fa17ffff : Reserved
 fa190000-fe0bffff : System RAM
   fa190000-fa1bffff : Reserved
 fe4e0000-47fffffff : System RAM
   43c000000-441ffffff : Reserved
   47ff98000-47ffa3fff : Reserved
   47ffa4000-47ffa7fff : Reserved
   47ffa8000-47ffabfff : Reserved
   47ffac000-47ffaffff : Reserved
   47ffb0000-47ffb3fff : Reserved

Link: https://lkml.kernel.org/r/20231229080213.2622204-1-chenhuacai@loongson.cn
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Fixes: 0ab97169aa ("crash_core: add generic function to do reservation")
Cc: Baoquan He <bhe@redhat.com>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Cc: <stable@vger.kernel.org>	[6.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-12 15:20:45 -08:00
Linus Torvalds 893e2f9eac dma-mapping updates for Linux 6.8
- reduce area lock contention for non-primary IO TLB pools (Petr Tesarik)
  - don't store redundant offsets in the dma_ranges stuctures
    (Robin Murphy)
  - clear dev->dma_mem when freeing per-device pools (Joakim Zhang)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmWcGxQLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYNf6hAAi9wP1ehnIqHCczCxpideyJnE76r+LgjInjudQUqE
 cnpl7E+dO/e/7Trk+L7hIrzi5uz8m9e+DZgL9wUY4h5mvJ+8ELet3Ec62UMVL1g0
 cWSYOtlMdUZn9Oy+qy2TTCa//1HyzcWQdplVwcOqD7zCLO4PavUR18+Vw5eDUpBR
 TE8EB+7P8ta8XSFnsryZS4zI1AhTTjZfh8ZgPdp+niBh7XAqOFNn3WiGK4qvA9o6
 nIjIV6ydBjZYkyYPeDsqszqmZG64mEeGUZhLWmjAyg1/c8so7uFviNfJ05od34js
 aWpmFxrM9Mm4BaBiU3FsSQkMBGCGaD/H2UXjIl1Qayt+pzUfaP7+8UWWo/T7Mj35
 RFKe9xzlPY8rqOszdBBvy6lCWguHXw4d4IFoqOz+YoUaxlV+RAbFOCHtW2BNvtPe
 b1YCr/FKNQ8NxsJWnbcehDtClY461pqBbaDrio3K7eTJgG10biAoWBfPhV+5VEer
 aB14krQcn7v1vXjfLu2huSrPt1ZjXuWVfXA3nO3Mt3VWxZWat82gLkFyt5N6ZfQ4
 juaDMX3Vzlz3VPf4MHFC+yFRx55b/9X26lC1BlSoo4tAknoo746Lvy/PasZarILC
 sGPt+2BVlPQ466zkjky4GtoNof1TMNuPF0Xr/mNCEjxCYUGQcTIdmOvA9y8mu/V+
 7M4=
 =v5F+
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-6.8-2024-01-08' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping updates from Christoph Hellwig:

 - reduce area lock contention for non-primary IO TLB pools (Petr
   Tesarik)

 - don't store redundant offsets in the dma_ranges stuctures (Robin
   Murphy)

 - clear dev->dma_mem when freeing per-device pools (Joakim Zhang)

* tag 'dma-mapping-6.8-2024-01-08' of git://git.infradead.org/users/hch/dma-mapping:
  dma-mapping: clear dev->dma_mem to NULL after freeing it
  swiotlb: reduce area lock contention for non-primary IO TLB pools
  dma-mapping: don't store redundant offsets
2024-01-11 13:46:50 -08:00
Linus Torvalds 3e7aeb78ab Networking changes for 6.8.
Core & protocols
 ----------------
 
  - Analyze and reorganize core networking structs (socks, netdev,
    netns, mibs) to optimize cacheline consumption and set up
    build time warnings to safeguard against future header changes.
    This improves TCP performances with many concurrent connections
    up to 40%.
 
  - Add page-pool netlink-based introspection, exposing the
    memory usage and recycling stats. This helps indentify
    bad PP users and possible leaks.
 
  - Refine TCP/DCCP source port selection to no longer favor even
    source port at connect() time when IP_LOCAL_PORT_RANGE is set.
    This lowers the time taken by connect() for hosts having
    many active connections to the same destination.
 
  - Refactor the TCP bind conflict code, shrinking related socket
    structs.
 
  - Refactor TCP SYN-Cookie handling, as a preparation step to
    allow arbitrary SYN-Cookie processing via eBPF.
 
  - Tune optmem_max for 0-copy usage, increasing the default value
    to 128KB and namespecifying it.
 
  - Allow coalescing for cloned skbs coming from page pools, improving
    RX performances with some common configurations.
 
  - Reduce extension header parsing overhead at GRO time.
 
  - Add bridge MDB bulk deletion support, allowing user-space to
    request the deletion of matching entries.
 
  - Reorder nftables struct members, to keep data accessed by the
    datapath first.
 
  - Introduce TC block ports tracking and use. This allows supporting
    multicast-like behavior at the TC layer.
 
  - Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and
    classifiers (RSVP and tcindex).
 
  - More data-race annotations.
 
  - Extend the diag interface to dump TCP bound-only sockets.
 
  - Conditional notification of events for TC qdisc class and actions.
 
  - Support for WPAN dynamic associations with nearby devices, to form
    a sub-network using a specific PAN ID.
 
  - Implement SMCv2.1 virtual ISM device support.
 
  - Add support for Batman-avd mulicast packet type.
 
 BPF
 ---
 
  - Tons of verifier improvements:
    - BPF register bounds logic and range support along with a large
      test suite
    - log improvements
    - complete precision tracking support for register spills
    - track aligned STACK_ZERO cases as imprecise spilled registers. It
      improves the verifier "instructions processed" metric from single
      digit to 50-60% for some programs
    - support for user's global BPF subprogram arguments with few
      commonly requested annotations for a better developer experience
    - support tracking of BPF_JNE which helps cases when the compiler
      transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the
      like
    - several fixes
 
  - Add initial TX metadata implementation for AF_XDP with support in
    mlx5 and stmmac drivers. Two types of offloads are supported right
    now, that is, TX timestamp and TX checksum offload.
 
  - Fix kCFI bugs in BPF all forms of indirect calls from BPF into
    kernel and from kernel into BPF work with CFI enabled. This allows
    BPF to work with CONFIG_FINEIBT=y.
 
  - Change BPF verifier logic to validate global subprograms lazily
    instead of unconditionally before the main program, so they can be
    guarded using BPF CO-RE techniques.
 
  - Support uid/gid options when mounting bpffs.
 
  - Add a new kfunc which acquires the associated cgroup of a task
    within a specific cgroup v1 hierarchy where the latter is identified
    by its id.
 
  - Extend verifier to allow bpf_refcount_acquire() of a map value field
    obtained via direct load which is a use-case needed in sched_ext.
 
  - Add BPF link_info support for uprobe multi link along with bpftool
    integration for the latter.
 
  - Support for VLAN tag in XDP hints.
 
  - Remove deprecated bpfilter kernel leftovers given the project
    is developed in user-space (https://github.com/facebook/bpfilter).
 
 Misc
 ----
 
  - Support for parellel TC self-tests execution.
 
  - Increase MPTCP self-tests coverage.
 
  - Updated the bridge documentation, including several so-far
    undocumented features.
 
  - Convert all the net self-tests to run in unique netns, to
    avoid random failures due to conflict and allow concurrent
    runs.
 
  - Add TCP-AO self-tests.
 
  - Add kunit tests for both cfg80211 and mac80211.
 
  - Autogenerate Netlink families documentation from YAML spec.
 
  - Add yml-gen support for fixed headers and recursive nests, the
    tool can now generate user-space code for all genetlink families
    for which we have specs.
 
  - A bunch of additional module descriptions fixes.
 
  - Catch incorrect freeing of pages belonging to a page pool.
 
 Driver API
 ----------
 
  - Rust abstractions for network PHY drivers; do not cover yet the
    full C API, but already allow implementing functional PHY drivers
    in rust.
 
  - Introduce queue and NAPI support in the netdev Netlink interface,
    allowing complete access to the device <> NAPIs <> queues
    relationship.
 
  - Introduce notifications filtering for devlink to allow control
    application scale to thousands of instances.
 
  - Improve PHY validation, requesting rate matching information for
    each ethtool link mode supported by both the PHY and host.
 
  - Add support for ethtool symmetric-xor RSS hash.
 
  - ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD
    platform.
 
  - Expose pin fractional frequency offset value over new DPLL generic
    netlink attribute.
 
  - Convert older drivers to platform remove callback returning void.
 
  - Add support for PHY package MMD read/write.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Octeon CN10K devices
    - Broadcom 5760X P7
    - Qualcomm SM8550 SoC
    - Texas Instrument DP83TG720S PHY
 
  - Bluetooth:
    - IMC Networks Bluetooth radio
 
 Removed
 -------
 
  - WiFi:
    - libertas 16-bit PCMCIA support
    - Atmel at76c50x drivers
    - HostAP ISA/PCMCIA style 802.11b driver
    - zd1201 802.11b USB dongles
    - Orinoco ISA/PCMCIA 802.11b driver
    - Aviator/Raytheon driver
    - Planet WL3501 driver
    - RNDIS USB 802.11b driver
 
 Drivers
 -------
 
  - Ethernet high-speed NICs:
    - Intel (100G, ice, idpf):
      - allow one by one port representors creation and removal
      - add temperature and clock information reporting
      - add get/set for ethtool's header split ringparam
      - add again FW logging
      - adds support switchdev hardware packet mirroring
      - iavf: implement symmetric-xor RSS hash
      - igc: add support for concurrent physical and free-running timers
      - i40e: increase the allowable descriptors
    - nVidia/Mellanox:
      - Preparation for Socket-Direct multi-dev netdev. That will allow
        in future releases combining multiple PFs devices attached to
        different NUMA nodes under the same netdev
    - Broadcom (bnxt):
      - TX completion handling improvements
      - add basic ntuple filter support
      - reduce MSIX vectors usage for MQPRIO offload
      - add VXLAN support, USO offload and TX coalesce completion for P7
    - Marvell Octeon EP:
      - xmit-more support
      - add PF-VF mailbox support and use it for FW notifications for VFs
    - Wangxun (ngbe/txgbe):
      - implement ethtool functions to operate pause param, ring param,
        coalesce channel number and msglevel
    - Netronome/Corigine (nfp):
      - add flow-steering support
      - support UDP segmentation offload
 
  - Ethernet NICs embedded, slower, virtual:
    - Xilinx AXI: remove duplicate DMA code adopting the dma engine driver
    - stmmac: add support for HW-accelerated VLAN stripping
    - TI AM654x sw: add mqprio, frame preemption & coalescing
    - gve: add support for non-4k page sizes.
    - virtio-net: support dynamic coalescing moderation
 
  - nVidia/Mellanox Ethernet datacenter switches:
    - allow firmware upgrade without a reboot
    - more flexible support for bridge flooding via the compressed
      FID flooding mode
 
  - Ethernet embedded switches:
    - Microchip:
      - fine-tune flow control and speed configurations in KSZ8xxx
      - KSZ88X3: enable setting rmii reference
    - Renesas:
      - add jumbo frames support
    - Marvell:
      - 88E6xxx: add "eth-mac" and "rmon" stats support
 
  - Ethernet PHYs:
    - aquantia: add firmware load support
    - at803x: refactor the driver to simplify adding support for more
      chip variants
    - NXP C45 TJA11xx: Add MACsec offload support
 
  - Wifi:
    - MediaTek (mt76):
      - NVMEM EEPROM improvements
      - mt7996 Extremely High Throughput (EHT) improvements
      - mt7996 Wireless Ethernet Dispatcher (WED) support
      - mt7996 36-bit DMA support
    - Qualcomm (ath12k):
      - support for a single MSI vector
      - WCN7850: support AP mode
    - Intel (iwlwifi):
      - new debugfs file fw_dbg_clear
      - allow concurrent P2P operation on DFS channels
 
  - Bluetooth:
    - QCA2066: support HFP offload
    - ISO: more broadcast-related improvements
    - NXP: better recovery in case receiver/transmitter get out of sync
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmWdamsSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkGC4P/2xjLzdw22ckSssuE9ORbGko9SNjnqHk
 PQh1E+26BHiCg5KB8VvzMsL78E79MRNXEattSW+1g7dhCvln3oi+Vd0WkdRkgt35
 98Iv18zLbbwFAJeyKvmLAPAkQkMLtVj19QILBBRrugF+egEZgVSE3JBcTAiKv2ZQ
 HzkabA171Ri6LpCcEEtY5XuaKvimGnGzF8YMFf8rX0wtqd2p5kbY9aMe47WAGxvU
 Vf9548XvH+A5yVH2/4/gujtUOpA/RHuhuCMb+oo0cZ+VCC1x9MGzoXzj6r87OTkf
 k2W1whNzcGoin92f+9Lk1JYMuiGKBH4QVaDdNXJnYFSJWPTE7RvRsPzYTSD4/GzK
 yEZbzSJXpy/2vDQm16NoAxl7evRs8Sorzkw4LQRviZHI/5SAkK2ZQiCK5CO8QSYy
 C1LELcV5kn6Foe24xWnrWLjAGug9oJnYoGPMU5gvPmFJMvUMXqm5rmbBgUWL5Rxw
 q1M6gVzabCyWUy6z2G2vaqW2ZntNVvCkdsLtIX0XZkcTzNoP0MA+TuhyGz4wbiuo
 PeyQp/mbGnDgCYggqKIA0YWrTVxkhFrKN520cbO8qXBQytV9oFbM/0/+C0/r/5WX
 pL1JVzLrh6l5ME7EIQfha8UOF9j8q4ueSwb40P3AR2NaZiDABM0zfUZ6+sx+91WF
 ucqPEcZB5cRE
 =1bW6
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "The most interesting thing is probably the networking structs
  reorganization and a significant amount of changes is around
  self-tests.

  Core & protocols:

   - Analyze and reorganize core networking structs (socks, netdev,
     netns, mibs) to optimize cacheline consumption and set up build
     time warnings to safeguard against future header changes

     This improves TCP performances with many concurrent connections up
     to 40%

   - Add page-pool netlink-based introspection, exposing the memory
     usage and recycling stats. This helps indentify bad PP users and
     possible leaks

   - Refine TCP/DCCP source port selection to no longer favor even
     source port at connect() time when IP_LOCAL_PORT_RANGE is set. This
     lowers the time taken by connect() for hosts having many active
     connections to the same destination

   - Refactor the TCP bind conflict code, shrinking related socket
     structs

   - Refactor TCP SYN-Cookie handling, as a preparation step to allow
     arbitrary SYN-Cookie processing via eBPF

   - Tune optmem_max for 0-copy usage, increasing the default value to
     128KB and namespecifying it

   - Allow coalescing for cloned skbs coming from page pools, improving
     RX performances with some common configurations

   - Reduce extension header parsing overhead at GRO time

   - Add bridge MDB bulk deletion support, allowing user-space to
     request the deletion of matching entries

   - Reorder nftables struct members, to keep data accessed by the
     datapath first

   - Introduce TC block ports tracking and use. This allows supporting
     multicast-like behavior at the TC layer

   - Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and
     classifiers (RSVP and tcindex)

   - More data-race annotations

   - Extend the diag interface to dump TCP bound-only sockets

   - Conditional notification of events for TC qdisc class and actions

   - Support for WPAN dynamic associations with nearby devices, to form
     a sub-network using a specific PAN ID

   - Implement SMCv2.1 virtual ISM device support

   - Add support for Batman-avd mulicast packet type

  BPF:

   - Tons of verifier improvements:
       - BPF register bounds logic and range support along with a large
         test suite
       - log improvements
       - complete precision tracking support for register spills
       - track aligned STACK_ZERO cases as imprecise spilled registers.
         This improves the verifier "instructions processed" metric from
         single digit to 50-60% for some programs
       - support for user's global BPF subprogram arguments with few
         commonly requested annotations for a better developer
         experience
       - support tracking of BPF_JNE which helps cases when the compiler
         transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the
         like
       - several fixes

   - Add initial TX metadata implementation for AF_XDP with support in
     mlx5 and stmmac drivers. Two types of offloads are supported right
     now, that is, TX timestamp and TX checksum offload

   - Fix kCFI bugs in BPF all forms of indirect calls from BPF into
     kernel and from kernel into BPF work with CFI enabled. This allows
     BPF to work with CONFIG_FINEIBT=y

   - Change BPF verifier logic to validate global subprograms lazily
     instead of unconditionally before the main program, so they can be
     guarded using BPF CO-RE techniques

   - Support uid/gid options when mounting bpffs

   - Add a new kfunc which acquires the associated cgroup of a task
     within a specific cgroup v1 hierarchy where the latter is
     identified by its id

   - Extend verifier to allow bpf_refcount_acquire() of a map value
     field obtained via direct load which is a use-case needed in
     sched_ext

   - Add BPF link_info support for uprobe multi link along with bpftool
     integration for the latter

   - Support for VLAN tag in XDP hints

   - Remove deprecated bpfilter kernel leftovers given the project is
     developed in user-space (https://github.com/facebook/bpfilter)

  Misc:

   - Support for parellel TC self-tests execution

   - Increase MPTCP self-tests coverage

   - Updated the bridge documentation, including several so-far
     undocumented features

   - Convert all the net self-tests to run in unique netns, to avoid
     random failures due to conflict and allow concurrent runs

   - Add TCP-AO self-tests

   - Add kunit tests for both cfg80211 and mac80211

   - Autogenerate Netlink families documentation from YAML spec

   - Add yml-gen support for fixed headers and recursive nests, the tool
     can now generate user-space code for all genetlink families for
     which we have specs

   - A bunch of additional module descriptions fixes

   - Catch incorrect freeing of pages belonging to a page pool

  Driver API:

   - Rust abstractions for network PHY drivers; do not cover yet the
     full C API, but already allow implementing functional PHY drivers
     in rust

   - Introduce queue and NAPI support in the netdev Netlink interface,
     allowing complete access to the device <> NAPIs <> queues
     relationship

   - Introduce notifications filtering for devlink to allow control
     application scale to thousands of instances

   - Improve PHY validation, requesting rate matching information for
     each ethtool link mode supported by both the PHY and host

   - Add support for ethtool symmetric-xor RSS hash

   - ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD
     platform

   - Expose pin fractional frequency offset value over new DPLL generic
     netlink attribute

   - Convert older drivers to platform remove callback returning void

   - Add support for PHY package MMD read/write

  New hardware / drivers:

   - Ethernet:
       - Octeon CN10K devices
       - Broadcom 5760X P7
       - Qualcomm SM8550 SoC
       - Texas Instrument DP83TG720S PHY

   - Bluetooth:
       - IMC Networks Bluetooth radio

  Removed:

   - WiFi:
       - libertas 16-bit PCMCIA support
       - Atmel at76c50x drivers
       - HostAP ISA/PCMCIA style 802.11b driver
       - zd1201 802.11b USB dongles
       - Orinoco ISA/PCMCIA 802.11b driver
       - Aviator/Raytheon driver
       - Planet WL3501 driver
       - RNDIS USB 802.11b driver

  Driver updates:

   - Ethernet high-speed NICs:
       - Intel (100G, ice, idpf):
          - allow one by one port representors creation and removal
          - add temperature and clock information reporting
          - add get/set for ethtool's header split ringparam
          - add again FW logging
          - adds support switchdev hardware packet mirroring
          - iavf: implement symmetric-xor RSS hash
          - igc: add support for concurrent physical and free-running
            timers
          - i40e: increase the allowable descriptors
       - nVidia/Mellanox:
          - Preparation for Socket-Direct multi-dev netdev. That will
            allow in future releases combining multiple PFs devices
            attached to different NUMA nodes under the same netdev
       - Broadcom (bnxt):
          - TX completion handling improvements
          - add basic ntuple filter support
          - reduce MSIX vectors usage for MQPRIO offload
          - add VXLAN support, USO offload and TX coalesce completion
            for P7
       - Marvell Octeon EP:
          - xmit-more support
          - add PF-VF mailbox support and use it for FW notifications
            for VFs
       - Wangxun (ngbe/txgbe):
          - implement ethtool functions to operate pause param, ring
            param, coalesce channel number and msglevel
       - Netronome/Corigine (nfp):
          - add flow-steering support
          - support UDP segmentation offload

   - Ethernet NICs embedded, slower, virtual:
       - Xilinx AXI: remove duplicate DMA code adopting the dma engine
         driver
       - stmmac: add support for HW-accelerated VLAN stripping
       - TI AM654x sw: add mqprio, frame preemption & coalescing
       - gve: add support for non-4k page sizes.
       - virtio-net: support dynamic coalescing moderation

   - nVidia/Mellanox Ethernet datacenter switches:
       - allow firmware upgrade without a reboot
       - more flexible support for bridge flooding via the compressed
         FID flooding mode

   - Ethernet embedded switches:
       - Microchip:
          - fine-tune flow control and speed configurations in KSZ8xxx
          - KSZ88X3: enable setting rmii reference
       - Renesas:
          - add jumbo frames support
       - Marvell:
          - 88E6xxx: add "eth-mac" and "rmon" stats support

   - Ethernet PHYs:
       - aquantia: add firmware load support
       - at803x: refactor the driver to simplify adding support for more
         chip variants
       - NXP C45 TJA11xx: Add MACsec offload support

   - Wifi:
       - MediaTek (mt76):
          - NVMEM EEPROM improvements
          - mt7996 Extremely High Throughput (EHT) improvements
          - mt7996 Wireless Ethernet Dispatcher (WED) support
          - mt7996 36-bit DMA support
       - Qualcomm (ath12k):
          - support for a single MSI vector
          - WCN7850: support AP mode
       - Intel (iwlwifi):
          - new debugfs file fw_dbg_clear
          - allow concurrent P2P operation on DFS channels

   - Bluetooth:
       - QCA2066: support HFP offload
       - ISO: more broadcast-related improvements
       - NXP: better recovery in case receiver/transmitter get out of sync"

* tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1714 commits)
  lan78xx: remove redundant statement in lan78xx_get_eee
  lan743x: remove redundant statement in lan743x_ethtool_get_eee
  bnxt_en: Fix RCU locking for ntuple filters in bnxt_rx_flow_steer()
  bnxt_en: Fix RCU locking for ntuple filters in bnxt_srxclsrldel()
  bnxt_en: Remove unneeded variable in bnxt_hwrm_clear_vnic_filter()
  tcp: Revert no longer abort SYN_SENT when receiving some ICMP
  Revert "mlx5 updates 2023-12-20"
  Revert "net: stmmac: Enable Per DMA Channel interrupt"
  ipvlan: Remove usage of the deprecated ida_simple_xx() API
  ipvlan: Fix a typo in a comment
  net/sched: Remove ipt action tests
  net: stmmac: Use interrupt mode INTM=1 for per channel irq
  net: stmmac: Add support for TX/RX channel interrupt
  net: stmmac: Make MSI interrupt routine generic
  dt-bindings: net: snps,dwmac: per channel irq
  net: phy: at803x: make read_status more generic
  net: phy: at803x: add support for cdt cross short test for qca808x
  net: phy: at803x: refactor qca808x cable test get status function
  net: phy: at803x: generalize cdt fault length function
  net: ethernet: cortina: Drop TSO support
  ...
2024-01-11 10:07:29 -08:00
Linus Torvalds 4cd083d531 Modules changes for v6.8-rc1
Just one cleanup and one documentation improvement change. No functional
 changes. However, this has been tested on linux-next for over 1 month.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmWdW0USHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinL5gQAMUtU4nv+HahLwJK3ha5fRmcHm2kV9Q4
 g9X7JUscEe+mNYLNy2Kjpl/BWatEHWn8jQD2nsMQJOcEkX2Mf5tfqBR3561wrfSZ
 dnMmNDG7Ym6Y9kOSDz6cpCxsu8Xm5Dj9MLKJ51qPfyXgeobD8IiKBe2oCDqcKzGY
 ZpDnmpUaYCOIloNhJNK9ybrNLsVDQwdPiC8vVQXULC4ePBw3i+mnh8c1wr442wEO
 G6xfi0wNXIoB9S8ynzakW2lJPD1XeMQYu/SJR0nz61KhJxAjs/LVAt9k7itcnUgG
 Zbc1Fn944oWoe7/ywwDIxstR56NYVpcXoTJexeHxrLe6PiEmRbh6phrBMUCOpUsh
 0DrHJE8z4dsHovo6w6m1zvMF6FphLHhUU6L/opBwrUJ5CGrYfegG95v8d3jRSphe
 GSoMo9iGHvr0PgY7OcG77m5NJjrFwwbPO88Fe3IAXmOXIrKYuzBoUnt3Te1dw8vX
 6vYPUUZ3HLDOACWtKf2Tjhr7pM6b1C72vrg7uYc2560BH6ERqjAtuOWKrx8QOeeo
 SUT4ACs4qa3DrPz7zpNhwwnfwpFoVuJd+fooMV/WhOU9KsCVCokC7zZub+J8UD7m
 1j1nhWkKaF08c0BKW4zRMmUVSV6Dh/AO0YybetBq9b4o7NPEcij2HlIk+jJa0VMJ
 XSx5UcuxRyf7
 =hvRa
 -----END PGP SIGNATURE-----

Merge tag 'modules-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull module updates from Luis Chamberlain:
 "Just one cleanup and one documentation improvement change. No
  functional changes"

* tag 'modules-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  kernel/module: improve documentation for try_module_get()
  module: Remove redundant TASK_UNINTERRUPTIBLE
2024-01-10 18:00:18 -08:00
Linus Torvalds 78273df7f6 header cleanups for 6.8
The goal is to get sched.h down to a type only header, so the main thing
 happening in this patchset is splitting out various _types.h headers and
 dependency fixups, as well as moving some things out of sched.h to
 better locations.
 
 This is prep work for the memory allocation profiling patchset which
 adds new sched.h interdepencencies.
 
 Testing - it's been in -next, and fixes from pretty much all
 architectures have percolated in - nothing major.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmWfBwwACgkQE6szbY3K
 bnZPwBAAmuRojXaeWxi01IPIOehSGDe68vw44PR9glEMZvxdnZuPOdvE4/+245/L
 bRKU2WBCjBUokUbV9msIShwRkFTZAmEMPNfPAAsFMA+VXeDYHKB+ZRdwTggNAQ+I
 SG6fZgh5m0HsewCDxU8oqVHkjVq4fXn0cy+aL6xLEd9gu67GoBzX2pDieS2Kvy6j
 jnyoKTxFwb+LTQgph0P4EIpq5I2umAsdLwdSR8EJ+8e9NiNvMo1pI00Lx/ntAnFZ
 JftWUJcMy3TQ5u1GkyfQN9y/yThX1bZK5GvmHS9SJ2Dkacaus5d+xaKCHtRuFS1I
 7C6b8PsNgRczUMumBXus44HdlNfNs1yU3lvVxFvBIPE1qC9pYRHrkWIXXIocXLLC
 oxTEJ6B2G3BQZVQgLIA4fOaxMVhmvKffi/aEZLi9vN9VVosd1a6XNKI6KbyRnXFp
 GSs9qDqszhn5I3GYNlDNQTc/8UsRlhPFgS6nS0By6QnvxtGi9QkU2tBRBsXvqwCy
 cLoCYIhc2tvugHvld70dz26umiJ4rnmxGlobStNoigDvIKAIUt1UmIdr1so8P8eH
 xehnL9ZcOX6xnANDL0AqMFFHV6I58CJynhFdUoXfVQf/DWLGX48mpi9LVNsYBzsI
 CAwVOAQ0UjGrpdWmJ9ueY/ABYqg9vRjzaDEXQ+MhAYO55CLaVsg=
 =3tyT
 -----END PGP SIGNATURE-----

Merge tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefs

Pull header cleanups from Kent Overstreet:
 "The goal is to get sched.h down to a type only header, so the main
  thing happening in this patchset is splitting out various _types.h
  headers and dependency fixups, as well as moving some things out of
  sched.h to better locations.

  This is prep work for the memory allocation profiling patchset which
  adds new sched.h interdepencencies"

* tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefs: (51 commits)
  Kill sched.h dependency on rcupdate.h
  kill unnecessary thread_info.h include
  Kill unnecessary kernel.h include
  preempt.h: Kill dependency on list.h
  rseq: Split out rseq.h from sched.h
  LoongArch: signal.c: add header file to fix build error
  restart_block: Trim includes
  lockdep: move held_lock to lockdep_types.h
  sem: Split out sem_types.h
  uidgid: Split out uidgid_types.h
  seccomp: Split out seccomp_types.h
  refcount: Split out refcount_types.h
  uapi/linux/resource.h: fix include
  x86/signal: kill dependency on time.h
  syscall_user_dispatch.h: split out *_types.h
  mm_types_task.h: Trim dependencies
  Split out irqflags_types.h
  ipc: Kill bogus dependency on spinlock.h
  shm: Slim down dependencies
  workqueue: Split out workqueue_types.h
  ...
2024-01-10 16:43:55 -08:00
Linus Torvalds 120a201bd2 hardening updates for v6.8-rc1
- Introduce the param_unknown_fn type and other clean ups (Andy Shevchenko)
 
 - Various __counted_by annotations (Christophe JAILLET, Gustavo A. R. Silva,
   Kees Cook)
 
 - Add KFENCE test to LKDTM (Stephen Boyd)
 
 - Various strncpy() refactorings (Justin Stitt)
 
 - Fix qnx4 to avoid writing into the smaller of two overlapping buffers
 
 - Various strlcpy() refactorings
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmWcOsQWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJoiDD/9gNhalNG+6MNF5TDwSvO9X7pvL
 bQ6D3clByRxYjnJ4dMQ7p3s+rJ937uQt9PezIWHgRoldjQy3x7AJ5BxkhjeMlD2B
 YLbfdVYPy09X0Ewk1Efvfm/ta6tJpBGYF7Bc7LIneZrdQ6gemBpLW1PNZAFYzcWX
 oDjV+M1NytxaiF0aebxPZvZ1W+NGQ105Sxvj5MheDoezyO/j0CTe+ZYtCzFguFY0
 8SPpR5FG4AFidb8GHd5Ndv0trVWjF1jat0FUFgEFOCE0fJNWLVR0Bbr2MtXiG7wL
 LF7IZ/Mn+mi+O3BmcD6JiaYf9EPlMUXCyqc8NvsnoWGqhWhWmQPCInZVrpplMUNK
 V/UHVMkmjDs4f/lAHBJoJHDK6fmOD+cAFaNMOltfErcjV4s+lEo6vHoiKl8hfPnH
 EzpQaK3funGroVYwTc35e07NrJJHCzqIUhZ0FJO7ByuOE2tIomiVo9Xy9gy54iCT
 qzC7zkrZ0MKqui4qiUY9FWayRRYLX4qNxELm4yie6Pzmk8943hNOaDofcyKWuZFC
 eqvhIkvqb4LasLrzCBk+ehA2KWSRmTrR6E9IygwbBXUTsvn2yj2RRYeAlGQNBTBZ
 adgSXQpRBmtKYqyihWLhP4QcunknEiQdDS3lS2qJmPH33Iv3jGH4yS6BNIBufMGL
 PoC2UxSfGd+YT079fw==
 =1Wxx
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening updates from Kees Cook:

 - Introduce the param_unknown_fn type and other clean ups (Andy
   Shevchenko)

 - Various __counted_by annotations (Christophe JAILLET, Gustavo A. R.
   Silva, Kees Cook)

 - Add KFENCE test to LKDTM (Stephen Boyd)

 - Various strncpy() refactorings (Justin Stitt)

 - Fix qnx4 to avoid writing into the smaller of two overlapping buffers

 - Various strlcpy() refactorings

* tag 'hardening-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  qnx4: Use get_directory_fname() in qnx4_match()
  qnx4: Extract dir entry filename processing into helper
  atags_proc: Add __counted_by for struct buffer and use struct_size()
  tracing/uprobe: Replace strlcpy() with strscpy()
  params: Fix multi-line comment style
  params: Sort headers
  params: Use size_add() for kmalloc()
  params: Do not go over the limit when getting the string length
  params: Introduce the param_unknown_fn type
  lkdtm: Add kfence read after free crash type
  nvme-fc: replace deprecated strncpy with strscpy
  nvdimm/btt: replace deprecated strncpy with strscpy
  nvme-fabrics: replace deprecated strncpy with strscpy
  drm/modes: replace deprecated strncpy with strscpy_pad
  afs: Add __counted_by for struct afs_acl and use struct_size()
  VMCI: Annotate struct vmci_handle_arr with __counted_by
  i40e: Annotate struct i40e_qvlist_info with __counted_by
  HID: uhid: replace deprecated strncpy with strscpy
  samples: Replace strlcpy() with strscpy()
  SUNRPC: Replace strlcpy() with strscpy()
2024-01-10 11:03:52 -08:00
Breno Leitao aefb2f2e61 x86/bugs: Rename CONFIG_RETPOLINE => CONFIG_MITIGATION_RETPOLINE
Step 5/10 of the namespace unification of CPU mitigations related Kconfig options.

[ mingo: Converted a few more uses in comments/messages as well. ]

Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ariel Miculas <amiculas@cisco.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20231121160740.1249350-6-leitao@debian.org
2024-01-10 10:52:28 +01:00
Steven Rostedt (Google) 25742aeb13 ring-buffer: Remove stale comment from ring_buffer_size()
It's been 11 years since the ring_buffer_size() function was updated to
use the nr_pages from the buffer->buffers[cpu] structure instead of using
the buffer->nr_pages that no longer exists.

The comment in the code is more of what a change log should have and is
pretty much useless for development. It's saying how things worked back in
2012 that bares no purpose on today's code. Remove it.

Link: https://lore.kernel.org/linux-trace-kernel/84d3b41a72bd43dbb9d44921ef535c92@AcuMS.aculab.com/
Link: https://lore.kernel.org/linux-trace-kernel/20231220081028.7cd7e8e2@gandalf.local.home

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reported-by: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-01-09 22:54:37 -05:00
Linus Torvalds 41daf06ea1 linux_kselftest-kunit-6.8-rc1
This KUnit update for Linux 6.8-rc1 consists of:
 
 - a new feature that adds APIs for managing devices introducing
   a set of helper functions which allow devices (internally a
   struct kunit_device) to be created and managed by KUnit.
   These devices will be automatically unregistered on
   test exit. These helpers can either use a user-provided
   struct device_driver, or have one automatically created and
   managed by KUnit. In both cases, the device lives on a new
   kunit_bus.
 
 - changes to switch drm/tests to use kunit devices
 
 - several fixes and enhancements to attribute feature
 
 - changes to reorganize deferred action function introducing
   KUNIT_DEFINE_ACTION_WRAPPER
 
 - new feature adds ability to run tests after boot using debugfs
 
 - fixes and enhancements to string-stream-test:
   - parse ERR_PTR in string_stream_destroy()
   - unchecked dereference in bug fix in debugfs_print_results()
   - handling errors from alloc_string_stream()
   - NULL-dereference bug fix in kunit_init_suite()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmWdiHIACgkQCwJExA0N
 QxxzCxAAmhn+rkKV4DfuGXxUAJbO109H7LSP1Y7FKMYCVp83msWKASziujb2IQR9
 87jnmgeJMbmQaPcc9m//NHuFhZmJwQZwAdGZryoDiz7XK+1MwLxYeUj92HI7FPaD
 o5Jz6tlqFdehx5jCOymgwbvhI5kJMkQCTTtnEaiHCByfaA02UqmTtt3bXK5OeJkZ
 UG0HqdvI/6Xo01i+BnerRBZFcQV49GMhl4acw1k+dJnPLkqusL6txftRBoKtxuVd
 mXQHKS1SmNgiNA+nqs4d/8qERoMJWuwj6wV4pldVBXhgZwOHXbBxBf67i7hTakE/
 TkEURCkOb5X0QrT6akJj6phJ2xqXsF7xwzBJh9G4jF2Pdwwo8GGuAXW+ol0TRrm8
 ZEQ4eMBGIK07Lb9FeBMLO2bZ0Ox+oiN+YNGY/bs1d6Ibf4PnBUfy7IlmMjKL9h/V
 A/EpYdaq5q72IZZQ2pu1rYkBRPbnP7vHmjLXVYIq7Pq8bLA9/ycKO/0jnGHdo1oz
 rBK/6t7yB+ATi4KeKQpjpmUTX/vdEenUQI/QDn9ngXIEwYQfNrEUZitEvBXR1Kw+
 T8iKDIPFkvb/yEZgjWgNpxETooDx3yBkeeC29gKMj4QoN38wEdfy0Xltp8eqq9cS
 6lijRoipUypHRAuXeSJMW2dflLnFIt4mtC25hBNF+DmyNVT+IF4=
 =79+u
 -----END PGP SIGNATURE-----

Merge tag 'linux_kselftest-kunit-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull KUnit updates from Shuah Khan:

 - a new feature that adds APIs for managing devices introducing a set
   of helper functions which allow devices (internally a struct
   kunit_device) to be created and managed by KUnit.

   These devices will be automatically unregistered on test exit. These
   helpers can either use a user-provided struct device_driver, or have
   one automatically created and managed by KUnit. In both cases, the
   device lives on a new kunit_bus.

 - changes to switch drm/tests to use kunit devices

 - several fixes and enhancements to attribute feature

 - changes to reorganize deferred action function introducing
   KUNIT_DEFINE_ACTION_WRAPPER

 - new feature adds ability to run tests after boot using debugfs

 - fixes and enhancements to string-stream-test:
     - parse ERR_PTR in string_stream_destroy()
     - unchecked dereference in bug fix in debugfs_print_results()
     - handling errors from alloc_string_stream()
     - NULL-dereference bug fix in kunit_init_suite()

* tag 'linux_kselftest-kunit-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (27 commits)
  kunit: Fix some comments which were mistakenly kerneldoc
  kunit: Protect string comparisons against NULL
  kunit: Add example of kunit_activate_static_stub() with pointer-to-function
  kunit: Allow passing function pointer to kunit_activate_static_stub()
  kunit: Fix NULL-dereference in kunit_init_suite() if suite->log is NULL
  kunit: Reset test->priv after each param iteration
  kunit: Add example for using test->priv
  drm/tests: Switch to kunit devices
  ASoC: topology: Replace fake root_device with kunit_device in tests
  overflow: Replace fake root_device with kunit_device
  fortify: test: Use kunit_device
  kunit: Add APIs for managing devices
  Documentation: Add debugfs docs with run after boot
  kunit: add ability to run tests after boot using debugfs
  kunit: add is_init test attribute
  kunit: add example suite to test init suites
  kunit: add KUNIT_INIT_TABLE to init linker section
  kunit: move KUNIT_TABLE out of INIT_DATA
  kunit: tool: add test for parsing attributes
  kunit: tool: fix parsing of test attributes
  ...
2024-01-09 17:16:58 -08:00
Linus Torvalds 7da71072e1 Power management updates for 6.8-rc1
- Add support for the Sierra Forest, Grand Ridge and Meteorlake SoCs to
    the intel_idle cpuidle driver (Artem Bityutskiy, Zhang Rui).
 
  - Do not enable interrupts when entering idle in the haltpoll cpuidle
    driver (Borislav Petkov).
 
  - Add Emerald Rapids support in no-HWP mode to the intel_pstate cpufreq
    driver (Zhenguo Yao).
 
  - Use EPP values programmed by the platform firmware as balanced
    performance ones by default in intel_pstate (Srinivas Pandruvada).
 
  - Add a missing function return value check to the SCMI cpufreq driver
    to avoid unexpected behavior (Alexandra Diupina).
 
  - Fix parameter type warning in the armada-8k cpufreq driver (Gregory
    CLEMENT).
 
  - Rework trans_stat_show() in the devfreq core code to avoid buffer
    overflows (Christian Marangi).
 
  - Synchronize devfreq_monitor_[start/stop] so as to prevent a timer
    list corruption from occurring when devfreq governors are switched
    frequently (Mukesh Ojha).
 
  - Fix possible deadlocks in the core system-wide PM code that occur if
    device-handling functions cannot be executed asynchronously during
    resume from system-wide suspend (Rafael J. Wysocki).
 
  - Clean up unnecessary local variable initializations in multiple
    places in the hibernation code (Wang chaodong, Li zeming).
 
  - Adjust core hibernation code to avoid missing wakeup events that
    occur after saving an image to persistent storage (Chris Feng).
 
  - Update hibernation code to enforce correct ordering during image
    compression and decompression (Hongchen Zhang).
 
  - Use kmap_local_page() instead of kmap_atomic() in copy_data_page()
    during hibernation and restore (Chen Haonan).
 
  - Adjust documentation and code comments to reflect recent tasks freezer
    changes (Kevin Hao).
 
  - Repair excess function parameter description warning in the
    hibernation image-saving code (Randy Dunlap).
 
  - Fix _set_required_opps when opp is NULL (Bryan O'Donoghue).
 
  - Use device_get_match_data() in the OPP code for TI (Rob Herring).
 
  - Clean up OPP level and other parts and call dev_pm_opp_set_opp()
    recursively for required OPPs (Viresh Kumar).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmWb8o8SHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxbOQP/2D+YGyd9R1awYqxQoIQSGbpIr7a9oTR
 BOIcOn2PvWLXH8w9wVbIUJsSL9Nx90D9T4S5QcyHrS2qHmR+1Gb0gX3D4QAmEBcG
 +wFCLt2//5PwqShtPcJEUcGdL274aVEpmnEAmzKnk20MkLQM3twxe6FKSkwWMLYb
 u9OKgdN8Vah0iSBUCpyT52O0x4d65MD/tka0QaGjLg64TtyqhTSKi+XgWtZkSZ7H
 lRgn9qMoMXq/h1aeK4MKp5UtJKRxBWRdMijIFFXAfgO8dwbDbyXTo0d2LMR6DiEM
 VsvRIjEePoRcGf7bwAbrUeSoNb5Ec32RW3v9GSNn2sWutW+vhD//frZq48zAR6lm
 i8Xlf2Ar63Z+qNcFpCZjlNwAbfEuZ1vIr0Pu3oDd0GkOXjxiVMgAwtatTp1nSW7/
 wWFuMA5G+wdzU/Z5KcV1p7S8CP1gC8S05LHGwtKKGm9pLbzhauF8GK6Xpa4711T8
 oI3uDFIgxaxW8B/ymsM5cNa2QbfYUuQbOFTwXvBcy4gizrbZwwXRSpfaKoDIYAXZ
 2kfwmFbu3IbrRypboY58lG3SzbnN94oEMANtsVYuxMimGz2x3ZmHBAFm2l9YPYRz
 dBq/RUM7sMIvM1SwqR4tG8rt206L7KpPyW99pUa2AhEdof4iV2bpyujHFdkm83MK
 nJ0OF/xcc98Z
 =zh4c
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These add support for new processors (Sierra Forest, Grand Ridge and
  Meteor Lake) to the intel_idle driver, make intel_pstate run on
  Emerald Rapids without HWP support and adjust it to utilize EPP values
  supplied by the platform firmware, fix issues, clean up code and
  improve documentation.

  The most significant fix addresses deadlocks in the core system-wide
  resume code that occur if async_schedule_dev() attempts to run its
  argument function synchronously (for example, due to a memory
  allocation failure). It rearranges the code in question which may
  increase the system resume time in some cases, but this basically is a
  removal of a premature optimization. That optimization will be added
  back later, but properly this time.

  Specifics:

   - Add support for the Sierra Forest, Grand Ridge and Meteorlake SoCs
     to the intel_idle cpuidle driver (Artem Bityutskiy, Zhang Rui)

   - Do not enable interrupts when entering idle in the haltpoll cpuidle
     driver (Borislav Petkov)

   - Add Emerald Rapids support in no-HWP mode to the intel_pstate
     cpufreq driver (Zhenguo Yao)

   - Use EPP values programmed by the platform firmware as balanced
     performance ones by default in intel_pstate (Srinivas Pandruvada)

   - Add a missing function return value check to the SCMI cpufreq
     driver to avoid unexpected behavior (Alexandra Diupina)

   - Fix parameter type warning in the armada-8k cpufreq driver (Gregory
     CLEMENT)

   - Rework trans_stat_show() in the devfreq core code to avoid buffer
     overflows (Christian Marangi)

   - Synchronize devfreq_monitor_[start/stop] so as to prevent a timer
     list corruption from occurring when devfreq governors are switched
     frequently (Mukesh Ojha)

   - Fix possible deadlocks in the core system-wide PM code that occur
     if device-handling functions cannot be executed asynchronously
     during resume from system-wide suspend (Rafael J. Wysocki)

   - Clean up unnecessary local variable initializations in multiple
     places in the hibernation code (Wang chaodong, Li zeming)

   - Adjust core hibernation code to avoid missing wakeup events that
     occur after saving an image to persistent storage (Chris Feng)

   - Update hibernation code to enforce correct ordering during image
     compression and decompression (Hongchen Zhang)

   - Use kmap_local_page() instead of kmap_atomic() in copy_data_page()
     during hibernation and restore (Chen Haonan)

   - Adjust documentation and code comments to reflect recent tasks
     freezer changes (Kevin Hao)

   - Repair excess function parameter description warning in the
     hibernation image-saving code (Randy Dunlap)

   - Fix _set_required_opps when opp is NULL (Bryan O'Donoghue)

   - Use device_get_match_data() in the OPP code for TI (Rob Herring)

   - Clean up OPP level and other parts and call dev_pm_opp_set_opp()
     recursively for required OPPs (Viresh Kumar)"

* tag 'pm-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (35 commits)
  OPP: Rename 'rate_clk_single'
  OPP: Pass rounded rate to _set_opp()
  OPP: Relocate dev_pm_opp_sync_regulators()
  PM: sleep: Fix possible deadlocks in core system-wide PM code
  OPP: Move dev_pm_opp_icc_bw to internal opp.h
  async: Introduce async_schedule_dev_nocall()
  async: Split async_schedule_node_domain()
  cpuidle: haltpoll: Do not enable interrupts when entering idle
  OPP: Fix _set_required_opps when opp is NULL
  OPP: The level field is always of unsigned int type
  PM: hibernate: Repair excess function parameter description warning
  PM: sleep: Remove obsolete comment from unlock_system_sleep()
  cpufreq: intel_pstate: Add Emerald Rapids support in no-HWP mode
  Documentation: PM: Adjust freezing-of-tasks.rst to the freezer changes
  PM: hibernate: Use kmap_local_page() in copy_data_page()
  intel_idle: add Sierra Forest SoC support
  intel_idle: add Grand Ridge SoC support
  PM / devfreq: Synchronize devfreq_monitor_[start/stop]
  cpufreq: armada-8k: Fix parameter type warning
  PM: hibernate: Enforce ordering during image compression/decompression
  ...
2024-01-09 16:32:11 -08:00
Linus Torvalds 7f73ba68cf Thermal control updates for 6.8-rc1
- Add dynamic thresholds for trip point crossing detection to prevent
    trip point crossing notifications from being sent at incorrect times
    or not at all in some cases (Rafael J. Wysocki).
 
  - Fix synchronization issues related to the resume of thermal zones
    during a system-wide resume and allow thermal zones to be resumed
    concurrently (Rafael J. Wysocki).
 
  - Modify the thermal zone unregistration to wait for the given zone to
    go away completely before returning to the caller and rework the
    sysfs interface for trip points on top of that (Rafael J. Wysocki).
 
  - Fix a possible NULL pointer dereference in thermal zone registration
    error path (Rafael J. Wysocki).
 
  - Clean up the IPA thermal governor and modify it (with the help of a
    new governor callback) to avoid allocating and freeing memory every
    time its throttling callback is invoked (Lukasz Luba).
 
  - Make the IPA thermal governor handle thermal instance weight changes
    via sysfs correctly (Lukasz Luba).
 
  - Update the thermal netlink code to avoid sending messages if there
    are no recipients (Stanislaw Gruszka).
 
  - Convert Mediatek Thermal to the json-schema (Rafał Miłecki).
 
  - Fix thermal DT bindings issue on Loongson (Binbin Zhou).
 
  - Fix returning NULL instead of -ENODEV during thermal probe on
    Loogsoon (Binbin Zhou).
 
  - Add thermal DT binding for tsens on the SM8650 platform (Neil
    Armstrong).
 
  - Add reboot on the critical trip point crossing option feature (Fabio
    Estevam).
 
  - Use DEFINE_SIMPLE_DEV_PM_OPS do define PM functions for thermal
    suspend/resume on AmLogic (Uwe Kleine-König)
 
  - Add D1/T113s THS controller support to the Sun8i thermal control
    driver (Maxim Kiselev)
 
  - Fix example in the thermal DT binding for QCom SPMI (Johan Hovold).
 
  - Fix compilation warning in the tmon utility (Florian Eckert).
 
  - Add support for interrupt-based thermal configuration on Exynos along
    with a set of related cleanups (Mateusz Majewski).
 
  - Make the Intel HFI thermal driver enable an HFI instance (eg. processor
    package) from its first online CPU and disable it when the last CPU in
    it goes offline (Ricardo Neri).
 
  - Fix a kernel-doc warning and a spello in the cpuidle_cooling thermal
    driver (Randy Dunlap).
 
  - Move the .get_temp() thermal zone callback presence check to the
    thermal zone registration code (Daniel Lezcano).
 
  - Use the for_each_trip() macro for trip points table walks in a few
    places in the thermal core (Rafael J. Wysocki).
 
  - Make all trip point updates (via sysfs as well as from the platform
    firmware) trigger trip change notifications (Rafael J. Wysocki).
 
  - Drop redundant code from the thermal core and make one function in
    it take a const pointer argument (Rafael J. Wysocki).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmWb8iUSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxKLkP/iDsuDwmhZAjbAu2iftk/8ad8Trm2VoK
 +9eZ5Eqa8lKEcJLb0RxueTnFT4ppvT/hY99HOG4FM+mCnWeH/Z32N697DhqiUg4v
 GZUpeOPzxYgsfOOTeuL5XgfrVMgBjJrJunTXmzgAd8lIhTmRbAMVmFVJ18CJO11O
 RHgqvYznYFi5cywA9/NkG2xkhFB0VDoiTuIiuMMV+pMjqF0d5ooBMkhmjvPQ5Rp9
 FjNJ7hqiTamAsDPdULAFqhIGGhKZWWFbh4+S+JPCwBW8nqvxyJpemsm20vrwctJR
 bSXWQkgkDpWEeg9yrEAOO/Uk9yGd3jiLfkvPBKbK0x/YxGZ4hOYHcbF3cOUvmPYP
 5K3ZJ61DNrzB/5S3LY54VYrWmTVRdK6Lk3HYNvfAUYFJZMZ5oMYZLCUmo4SswUdy
 UUEIY27H7L18eLhP9zCcKo4njdaVG+vXQn/rJIFOpG0k9OElzPs1X8Dp/m9pKQDR
 rDUsMXqB34NUVrIEhjAgqvwF5xHooW8gykpuJgxwBetA9w8Pls2A/mzLsDY3wgdQ
 htiANGpKTDqBQSn+HrjzYckv9/R+1tDyTJmEDNZwllA1DJfrOlpCRD2VHRpgTZEA
 Ldnq0bhyq6RQnousqxhgpYkIAoGaebs9XasRH0YtBG5gIumeWfqeVzmTcM5xdsNB
 yf6RdQy8QunS
 =QVyh
 -----END PGP SIGNATURE-----

Merge tag 'thermal-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull thermal control updates from Rafael Wysocki:
 "These add support for the D1/T113s THS controller to the sun8i driver
  and a DT-based mechanism for platforms to indicate a preference to
  reboot (instead of shutting down) on crossing a critical trip point,
  fix issues, make other improvements (in the IPA governor, the Intel
  HFI driver, the exynos driver and the thermal netlink interface among
  other places) and clean up code.

  One long-standing issue addressed here is that trip point crossing
  notifications sent to user space might be unreliable due to the
  incorrect handling of trip point hysteresis in the thermal core:
  multiple notifications might be sent for the same event or there might
  be events without any notification at all.

  Specifics:

   - Add dynamic thresholds for trip point crossing detection to prevent
     trip point crossing notifications from being sent at incorrect
     times or not at all in some cases (Rafael J. Wysocki)

   - Fix synchronization issues related to the resume of thermal zones
     during a system-wide resume and allow thermal zones to be resumed
     concurrently (Rafael J. Wysocki)

   - Modify the thermal zone unregistration to wait for the given zone
     to go away completely before returning to the caller and rework the
     sysfs interface for trip points on top of that (Rafael J. Wysocki)

   - Fix a possible NULL pointer dereference in thermal zone
     registration error path (Rafael J. Wysocki)

   - Clean up the IPA thermal governor and modify it (with the help of a
     new governor callback) to avoid allocating and freeing memory every
     time its throttling callback is invoked (Lukasz Luba)

   - Make the IPA thermal governor handle thermal instance weight
     changes via sysfs correctly (Lukasz Luba)

   - Update the thermal netlink code to avoid sending messages if there
     are no recipients (Stanislaw Gruszka)

   - Convert Mediatek Thermal to the json-schema (Rafał Miłecki)

   - Fix thermal DT bindings issue on Loongson (Binbin Zhou)

   - Fix returning NULL instead of -ENODEV during thermal probe on
     Loogsoon (Binbin Zhou)

   - Add thermal DT binding for tsens on the SM8650 platform (Neil
     Armstrong)

   - Add reboot on the critical trip point crossing option feature
     (Fabio Estevam)

   - Use DEFINE_SIMPLE_DEV_PM_OPS do define PM functions for thermal
     suspend/resume on AmLogic (Uwe Kleine-König)

   - Add D1/T113s THS controller support to the Sun8i thermal control
     driver (Maxim Kiselev)

   - Fix example in the thermal DT binding for QCom SPMI (Johan Hovold)

   - Fix compilation warning in the tmon utility (Florian Eckert)

   - Add support for interrupt-based thermal configuration on Exynos
     along with a set of related cleanups (Mateusz Majewski)

   - Make the Intel HFI thermal driver enable an HFI instance (eg.
     processor package) from its first online CPU and disable it when
     the last CPU in it goes offline (Ricardo Neri)

   - Fix a kernel-doc warning and a spello in the cpuidle_cooling
     thermal driver (Randy Dunlap)

   - Move the .get_temp() thermal zone callback presence check to the
     thermal zone registration code (Daniel Lezcano)

   - Use the for_each_trip() macro for trip points table walks in a few
     places in the thermal core (Rafael J. Wysocki)

   - Make all trip point updates (via sysfs as well as from the platform
     firmware) trigger trip change notifications (Rafael J. Wysocki)

   - Drop redundant code from the thermal core and make one function in
     it take a const pointer argument (Rafael J. Wysocki)"

* tag 'thermal-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (64 commits)
  thermal: trip: Constify thermal zone argument of thermal_zone_trip_id()
  thermal: intel: hfi: Disable an HFI instance when all its CPUs go offline
  thermal: intel: hfi: Enable an HFI instance from its first online CPU
  thermal: intel: hfi: Refactor enabling code into helper functions
  thermal/drivers/exynos: Use set_trips ops
  thermal/drivers/exynos: Use BIT wherever possible
  thermal/drivers/exynos: Split initialization of TMU and the thermal zone
  thermal/drivers/exynos: Stop using the threshold mechanism on Exynos 4210
  thermal/drivers/exynos: Simplify regulator (de)initialization
  thermal/drivers/exynos: Handle devm_regulator_get_optional return value correctly
  thermal/drivers/exynos: Wwitch from workqueue-driven interrupt handling to threaded interrupts
  thermal/drivers/exynos: Drop id field
  thermal/drivers/exynos: Remove an unnecessary field description
  tools/thermal/tmon: Fix compilation warning for wrong format
  dt-bindings: thermal: qcom-spmi-adc-tm5/hc: Clean up examples
  dt-bindings: thermal: qcom-spmi-adc-tm5/hc: Fix example node names
  thermal/drivers/sun8i: Add D1/T113s THS controller support
  dt-bindings: thermal: sun8i: Add binding for D1/T113s THS controller
  thermal: amlogic: Use DEFINE_SIMPLE_DEV_PM_OPS for PM functions
  thermal: amlogic: Make amlogic_thermal_disable() return void
  ...
2024-01-09 16:20:17 -08:00
Linus Torvalds 063a7ce32d lsm/stable-6.8 PR 20240105
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmWYKUIUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNyHw/+IKnqL1MZ5QS+/HtSzi4jCL47N9yZ
 OHLol6XswyEGHH9myKPPGnT5lVA93v98v4ty2mws7EJUSGZQQUntYBPbU9Gi40+B
 XDzYSRocoj96sdlKeOJMgaWo3NBRD9HYSoGPDNWZixy6m+bLPk/Dqhn3FabKf1lo
 2qQSmstvChFRmVNkmgaQnBCAtWVqla4EJEL0EKX6cspHbuzRNTeJdTPn6Q/zOUVL
 O2znOZuEtSVpYS7yg3uJT0hHD8H0GnIciAcDAhyPSBL5Uk5l6gwJiACcdRfLRbgp
 QM5Z4qUFdKljV5XBCzYnfhhrx1df08h1SG84El8UK8HgTTfOZfYmawByJRWNJSQE
 TdCmtyyvEbfb61CKBFVwD7Tzb9/y8WgcY5N3Un8uCQqRzFIO+6cghHri5NrVhifp
 nPFlP4klxLHh3d7ZVekLmCMHbpaacRyJKwLy+f/nwbBEID47jpPkvZFIpbalat+r
 QaKRBNWdTeV+GZ+Yu0uWsI029aQnpcO1kAnGg09fl6b/dsmxeKOVWebir25AzQ++
 a702S8HRmj80X+VnXHU9a64XeGtBH7Nq0vu0lGHQPgwhSx/9P6/qICEPwsIriRjR
 I9OulWt4OBPDtlsonHFgDs+lbnd0Z0GJUwYT8e9pjRDMxijVO9lhAXyglVRmuNR8
 to2ByKP5BO+Vh8Y=
 =Py+n
 -----END PGP SIGNATURE-----

Merge tag 'lsm-pr-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm

Pull security module updates from Paul Moore:

 - Add three new syscalls: lsm_list_modules(), lsm_get_self_attr(), and
   lsm_set_self_attr().

   The first syscall simply lists the LSMs enabled, while the second and
   third get and set the current process' LSM attributes. Yes, these
   syscalls may provide similar functionality to what can be found under
   /proc or /sys, but they were designed to support multiple,
   simultaneaous (stacked) LSMs from the start as opposed to the current
   /proc based solutions which were created at a time when only one LSM
   was allowed to be active at a given time.

   We have spent considerable time discussing ways to extend the
   existing /proc interfaces to support multiple, simultaneaous LSMs and
   even our best ideas have been far too ugly to support as a kernel
   API; after +20 years in the kernel, I felt the LSM layer had
   established itself enough to justify a handful of syscalls.

   Support amongst the individual LSM developers has been nearly
   unanimous, with a single objection coming from Tetsuo (TOMOYO) as he
   is worried that the LSM_ID_XXX token concept will make it more
   difficult for out-of-tree LSMs to survive. Several members of the LSM
   community have demonstrated the ability for out-of-tree LSMs to
   continue to exist by picking high/unused LSM_ID values as well as
   pointing out that many kernel APIs rely on integer identifiers, e.g.
   syscalls (!), but unfortunately Tetsuo's objections remain.

   My personal opinion is that while I have no interest in penalizing
   out-of-tree LSMs, I'm not going to penalize in-tree development to
   support out-of-tree development, and I view this as a necessary step
   forward to support the push for expanded LSM stacking and reduce our
   reliance on /proc and /sys which has occassionally been problematic
   for some container users. Finally, we have included the linux-api
   folks on (all?) recent revisions of the patchset and addressed all of
   their concerns.

 - Add a new security_file_ioctl_compat() LSM hook to handle the 32-bit
   ioctls on 64-bit systems problem.

   This patch includes support for all of the existing LSMs which
   provide ioctl hooks, although it turns out only SELinux actually
   cares about the individual ioctls. It is worth noting that while
   Casey (Smack) and Tetsuo (TOMOYO) did not give explicit ACKs to this
   patch, they did both indicate they are okay with the changes.

 - Fix a potential memory leak in the CALIPSO code when IPv6 is disabled
   at boot.

   While it's good that we are fixing this, I doubt this is something
   users are seeing in the wild as you need to both disable IPv6 and
   then attempt to configure IPv6 labeled networking via
   NetLabel/CALIPSO; that just doesn't make much sense.

   Normally this would go through netdev, but Jakub asked me to take
   this patch and of all the trees I maintain, the LSM tree seemed like
   the best fit.

 - Update the LSM MAINTAINERS entry with additional information about
   our process docs, patchwork, bug reporting, etc.

   I also noticed that the Lockdown LSM is missing a dedicated
   MAINTAINERS entry so I've added that to the pull request. I've been
   working with one of the major Lockdown authors/contributors to see if
   they are willing to step up and assume a Lockdown maintainer role;
   hopefully that will happen soon, but in the meantime I'll continue to
   look after it.

 - Add a handful of mailmap entries for Serge Hallyn and myself.

* tag 'lsm-pr-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (27 commits)
  lsm: new security_file_ioctl_compat() hook
  lsm: Add a __counted_by() annotation to lsm_ctx.ctx
  calipso: fix memory leak in netlbl_calipso_add_pass()
  selftests: remove the LSM_ID_IMA check in lsm/lsm_list_modules_test
  MAINTAINERS: add an entry for the lockdown LSM
  MAINTAINERS: update the LSM entry
  mailmap: add entries for Serge Hallyn's dead accounts
  mailmap: update/replace my old email addresses
  lsm: mark the lsm_id variables are marked as static
  lsm: convert security_setselfattr() to use memdup_user()
  lsm: align based on pointer length in lsm_fill_user_ctx()
  lsm: consolidate buffer size handling into lsm_fill_user_ctx()
  lsm: correct error codes in security_getselfattr()
  lsm: cleanup the size counters in security_getselfattr()
  lsm: don't yet account for IMA in LSM_CONFIG_COUNT calculation
  lsm: drop LSM_ID_IMA
  LSM: selftests for Linux Security Module syscalls
  SELinux: Add selfattr hooks
  AppArmor: Add selfattr hooks
  Smack: implement setselfattr and getselfattr hooks
  ...
2024-01-09 12:57:46 -08:00
Linus Torvalds eab23bc8a8 audit/stable-6.8 PR 20240105
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmWYKJAUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXML3hAAyP/6RwJjUMM9Gsi9ZJNRz79X8uIp
 /MYONbzy1xKq/d78jZhjsJm9yIlLk3muVdd3oRcXdmahA5zs3jOKRaM+OfNLOrt6
 nuNwS+yaMUYKsKNh/A8TLoxcmBuNAN6lubCKwbccR6hvugqrZuZFkAqCIkiWUDeb
 N64u1rL1q/tLI+jI76GIiK4SMMQihF3MMVVTmBWYDiIdrfPhFIHxipLgZaEBUqZM
 43+2Y/blV75jcqPTZRgT9tk0LVLkiFtO97qUp9j+pYZbeoJ7CAaDH5A8NVm38yIX
 tyzYiTV2lGS3qf/HdLc3OpJQlBVkhbq6cRiLGvyiKQp60xiqYffoL7iFP4/DJMoT
 JKzoqXCixINRqdHWYbVY9hHBGg6R5c+1QqZzsnEy2MnBF++iLwJQAMz5JO9Qdh8F
 tD6fD82QzvfoNPuP0lBA67preqN3wiH1Zsv6cstoI/6QKCAMeTMZt/ywniBTKhX6
 WMmhdmMQKTwGrnCosydAOonYesieiYPhxz6oSeRIqoHRHtNow8rjnFh7DR7yi8uc
 nv1x5bDqEI+QTrDys0cAq6fvdUZT2B9joqSovzXUGllRRS7w17WNf1Cu16jMTrHH
 FeZ2P1BvKE7YIFkqxcE/RY5NHX3ylxA4unFM8UgIheYiWbWLm5+xrwZdNL30KQJ4
 4Hvvy3Buq6kb4HE=
 =908g
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit updates from Paul Moore:
 "The audit updates are fairly minor with only two patches:

   - Send an audit ACK to userspace immediately upon receiving an auditd
     registration event as opposed to waiting until the registration has
     been fully processed and the audit backlog starts filling the
     netlink buffers.

     Sending the ACK earlier, as done here, is still safe as the
     operation should not fail at the point when the ACK is done, and
     doing so helps avoid the ACK being dropped in extreme situations.

   - Update the audit MAINTAINERS entry with additional information.

     There isn't anything in this update that should be new to regular
     contributors or list subscribers, but I'm pushing to start
     documenting our processes, conventions, etc. and this seems like an
     important part of that"

* tag 'audit-pr-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  MAINTAINERS: update the audit entry
  audit: Send netlink ACK before setting connection in auditd_set
2024-01-09 12:01:15 -08:00
Linus Torvalds 9f2a635235 Quite a lot of kexec work this time around. Many singleton patches in
many places.  The notable patch series are:
 
 - nilfs2 folio conversion from Matthew Wilcox in "nilfs2: Folio
   conversions for file paths".
 
 - Additional nilfs2 folio conversion from Ryusuke Konishi in "nilfs2:
   Folio conversions for directory paths".
 
 - IA64 remnant removal in Heiko Carstens's "Remove unused code after
   IA-64 removal".
 
 - Arnd Bergmann has enabled the -Wmissing-prototypes warning everywhere
   in "Treewide: enable -Wmissing-prototypes".  This had some followup
   fixes:
 
   - Nathan Chancellor has cleaned up the hexagon build in the series
     "hexagon: Fix up instances of -Wmissing-prototypes".
 
   - Nathan also addressed some s390 warnings in "s390: A couple of
     fixes for -Wmissing-prototypes".
 
   - Arnd Bergmann addresses the same warnings for MIPS in his series
     "mips: address -Wmissing-prototypes warnings".
 
 - Baoquan He has made kexec_file operate in a top-down-fitting manner
   similar to kexec_load in the series "kexec_file: Load kernel at top of
   system RAM if required"
 
 - Baoquan He has also added the self-explanatory "kexec_file: print out
   debugging message if required".
 
 - Some checkstack maintenance work from Tiezhu Yang in the series
   "Modify some code about checkstack".
 
 - Douglas Anderson has disentangled the watchdog code's logging when
   multiple reports are occurring simultaneously.  The series is "watchdog:
   Better handling of concurrent lockups".
 
 - Yuntao Wang has contributed some maintenance work on the crash code in
   "crash: Some cleanups and fixes".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZ2R6AAKCRDdBJ7gKXxA
 juCVAP4t76qUISDOSKugB/Dn5E4Nt9wvPY9PcufnmD+xoPsgkQD+JVl4+jd9+gAV
 vl6wkJDiJO5JZ3FVtBtC3DFA/xHtVgk=
 =kQw+
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2024-01-09-10-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:
 "Quite a lot of kexec work this time around. Many singleton patches in
  many places. The notable patch series are:

   - nilfs2 folio conversion from Matthew Wilcox in 'nilfs2: Folio
     conversions for file paths'.

   - Additional nilfs2 folio conversion from Ryusuke Konishi in 'nilfs2:
     Folio conversions for directory paths'.

   - IA64 remnant removal in Heiko Carstens's 'Remove unused code after
     IA-64 removal'.

   - Arnd Bergmann has enabled the -Wmissing-prototypes warning
     everywhere in 'Treewide: enable -Wmissing-prototypes'. This had
     some followup fixes:

      - Nathan Chancellor has cleaned up the hexagon build in the series
        'hexagon: Fix up instances of -Wmissing-prototypes'.

      - Nathan also addressed some s390 warnings in 's390: A couple of
        fixes for -Wmissing-prototypes'.

      - Arnd Bergmann addresses the same warnings for MIPS in his series
        'mips: address -Wmissing-prototypes warnings'.

   - Baoquan He has made kexec_file operate in a top-down-fitting manner
     similar to kexec_load in the series 'kexec_file: Load kernel at top
     of system RAM if required'

   - Baoquan He has also added the self-explanatory 'kexec_file: print
     out debugging message if required'.

   - Some checkstack maintenance work from Tiezhu Yang in the series
     'Modify some code about checkstack'.

   - Douglas Anderson has disentangled the watchdog code's logging when
     multiple reports are occurring simultaneously. The series is
     'watchdog: Better handling of concurrent lockups'.

   - Yuntao Wang has contributed some maintenance work on the crash code
     in 'crash: Some cleanups and fixes'"

* tag 'mm-nonmm-stable-2024-01-09-10-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (157 commits)
  crash_core: fix and simplify the logic of crash_exclude_mem_range()
  x86/crash: use SZ_1M macro instead of hardcoded value
  x86/crash: remove the unused image parameter from prepare_elf_headers()
  kdump: remove redundant DEFAULT_CRASH_KERNEL_LOW_SIZE
  scripts/decode_stacktrace.sh: strip unexpected CR from lines
  watchdog: if panicking and we dumped everything, don't re-enable dumping
  watchdog/hardlockup: use printk_cpu_sync_get_irqsave() to serialize reporting
  watchdog/softlockup: use printk_cpu_sync_get_irqsave() to serialize reporting
  watchdog/hardlockup: adopt softlockup logic avoiding double-dumps
  kexec_core: fix the assignment to kimage->control_page
  x86/kexec: fix incorrect end address passed to kernel_ident_mapping_init()
  lib/trace_readwrite.c:: replace asm-generic/io with linux/io
  nilfs2: cpfile: fix some kernel-doc warnings
  stacktrace: fix kernel-doc typo
  scripts/checkstack.pl: fix no space expression between sp and offset
  x86/kexec: fix incorrect argument passed to kexec_dprintk()
  x86/kexec: use pr_err() instead of kexec_dprintk() when an error occurs
  nilfs2: add missing set_freezable() for freezable kthread
  kernel: relay: remove relay_file_splice_read dead code, doesn't work
  docs: submit-checklist: remove all of "make namespacecheck"
  ...
2024-01-09 11:46:20 -08:00
Linus Torvalds fb46e22a9e Many singleton patches against the MM code. The patch series which
are included in this merge do the following:
 
 - Peng Zhang has done some mapletree maintainance work in the
   series
 
 	"maple_tree: add mt_free_one() and mt_attr() helpers"
 	"Some cleanups of maple tree"
 
 - In the series "mm: use memmap_on_memory semantics for dax/kmem"
   Vishal Verma has altered the interworking between memory-hotplug
   and dax/kmem so that newly added 'device memory' can more easily
   have its memmap placed within that newly added memory.
 
 - Matthew Wilcox continues folio-related work (including a few
   fixes) in the patch series
 
 	"Add folio_zero_tail() and folio_fill_tail()"
 	"Make folio_start_writeback return void"
 	"Fix fault handler's handling of poisoned tail pages"
 	"Convert aops->error_remove_page to ->error_remove_folio"
 	"Finish two folio conversions"
 	"More swap folio conversions"
 
 - Kefeng Wang has also contributed folio-related work in the series
 
 	"mm: cleanup and use more folio in page fault"
 
 - Jim Cromie has improved the kmemleak reporting output in the
   series "tweak kmemleak report format".
 
 - In the series "stackdepot: allow evicting stack traces" Andrey
   Konovalov to permits clients (in this case KASAN) to cause
   eviction of no longer needed stack traces.
 
 - Charan Teja Kalla has fixed some accounting issues in the page
   allocator's atomic reserve calculations in the series "mm:
   page_alloc: fixes for high atomic reserve caluculations".
 
 - Dmitry Rokosov has added to the samples/ dorectory some sample
   code for a userspace memcg event listener application.  See the
   series "samples: introduce cgroup events listeners".
 
 - Some mapletree maintanance work from Liam Howlett in the series
   "maple_tree: iterator state changes".
 
 - Nhat Pham has improved zswap's approach to writeback in the
   series "workload-specific and memory pressure-driven zswap
   writeback".
 
 - DAMON/DAMOS feature and maintenance work from SeongJae Park in
   the series
 
 	"mm/damon: let users feed and tame/auto-tune DAMOS"
 	"selftests/damon: add Python-written DAMON functionality tests"
 	"mm/damon: misc updates for 6.8"
 
 - Yosry Ahmed has improved memcg's stats flushing in the series
   "mm: memcg: subtree stats flushing and thresholds".
 
 - In the series "Multi-size THP for anonymous memory" Ryan Roberts
   has added a runtime opt-in feature to transparent hugepages which
   improves performance by allocating larger chunks of memory during
   anonymous page faults.
 
 - Matthew Wilcox has also contributed some cleanup and maintenance
   work against eh buffer_head code int he series "More buffer_head
   cleanups".
 
 - Suren Baghdasaryan has done work on Andrea Arcangeli's series
   "userfaultfd move option".  UFFDIO_MOVE permits userspace heap
   compaction algorithms to move userspace's pages around rather than
   UFFDIO_COPY'a alloc/copy/free.
 
 - Stefan Roesch has developed a "KSM Advisor", in the series
   "mm/ksm: Add ksm advisor".  This is a governor which tunes KSM's
   scanning aggressiveness in response to userspace's current needs.
 
 - Chengming Zhou has optimized zswap's temporary working memory
   use in the series "mm/zswap: dstmem reuse optimizations and
   cleanups".
 
 - Matthew Wilcox has performed some maintenance work on the
   writeback code, both code and within filesystems.  The series is
   "Clean up the writeback paths".
 
 - Andrey Konovalov has optimized KASAN's handling of alloc and
   free stack traces for secondary-level allocators, in the series
   "kasan: save mempool stack traces".
 
 - Andrey also performed some KASAN maintenance work in the series
   "kasan: assorted clean-ups".
 
 - David Hildenbrand has gone to town on the rmap code.  Cleanups,
   more pte batching, folio conversions and more.  See the series
   "mm/rmap: interface overhaul".
 
 - Kinsey Ho has contributed some maintenance work on the MGLRU
   code in the series "mm/mglru: Kconfig cleanup".
 
 - Matthew Wilcox has contributed lruvec page accounting code
   cleanups in the series "Remove some lruvec page accounting
   functions".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA
 jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27
 Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU=
 =0NHs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Many singleton patches against the MM code. The patch series which are
  included in this merge do the following:

   - Peng Zhang has done some mapletree maintainance work in the series

	'maple_tree: add mt_free_one() and mt_attr() helpers'
	'Some cleanups of maple tree'

   - In the series 'mm: use memmap_on_memory semantics for dax/kmem'
     Vishal Verma has altered the interworking between memory-hotplug
     and dax/kmem so that newly added 'device memory' can more easily
     have its memmap placed within that newly added memory.

   - Matthew Wilcox continues folio-related work (including a few fixes)
     in the patch series

	'Add folio_zero_tail() and folio_fill_tail()'
	'Make folio_start_writeback return void'
	'Fix fault handler's handling of poisoned tail pages'
	'Convert aops->error_remove_page to ->error_remove_folio'
	'Finish two folio conversions'
	'More swap folio conversions'

   - Kefeng Wang has also contributed folio-related work in the series

	'mm: cleanup and use more folio in page fault'

   - Jim Cromie has improved the kmemleak reporting output in the series
     'tweak kmemleak report format'.

   - In the series 'stackdepot: allow evicting stack traces' Andrey
     Konovalov to permits clients (in this case KASAN) to cause eviction
     of no longer needed stack traces.

   - Charan Teja Kalla has fixed some accounting issues in the page
     allocator's atomic reserve calculations in the series 'mm:
     page_alloc: fixes for high atomic reserve caluculations'.

   - Dmitry Rokosov has added to the samples/ dorectory some sample code
     for a userspace memcg event listener application. See the series
     'samples: introduce cgroup events listeners'.

   - Some mapletree maintanance work from Liam Howlett in the series
     'maple_tree: iterator state changes'.

   - Nhat Pham has improved zswap's approach to writeback in the series
     'workload-specific and memory pressure-driven zswap writeback'.

   - DAMON/DAMOS feature and maintenance work from SeongJae Park in the
     series

	'mm/damon: let users feed and tame/auto-tune DAMOS'
	'selftests/damon: add Python-written DAMON functionality tests'
	'mm/damon: misc updates for 6.8'

   - Yosry Ahmed has improved memcg's stats flushing in the series 'mm:
     memcg: subtree stats flushing and thresholds'.

   - In the series 'Multi-size THP for anonymous memory' Ryan Roberts
     has added a runtime opt-in feature to transparent hugepages which
     improves performance by allocating larger chunks of memory during
     anonymous page faults.

   - Matthew Wilcox has also contributed some cleanup and maintenance
     work against eh buffer_head code int he series 'More buffer_head
     cleanups'.

   - Suren Baghdasaryan has done work on Andrea Arcangeli's series
     'userfaultfd move option'. UFFDIO_MOVE permits userspace heap
     compaction algorithms to move userspace's pages around rather than
     UFFDIO_COPY'a alloc/copy/free.

   - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm:
     Add ksm advisor'. This is a governor which tunes KSM's scanning
     aggressiveness in response to userspace's current needs.

   - Chengming Zhou has optimized zswap's temporary working memory use
     in the series 'mm/zswap: dstmem reuse optimizations and cleanups'.

   - Matthew Wilcox has performed some maintenance work on the writeback
     code, both code and within filesystems. The series is 'Clean up the
     writeback paths'.

   - Andrey Konovalov has optimized KASAN's handling of alloc and free
     stack traces for secondary-level allocators, in the series 'kasan:
     save mempool stack traces'.

   - Andrey also performed some KASAN maintenance work in the series
     'kasan: assorted clean-ups'.

   - David Hildenbrand has gone to town on the rmap code. Cleanups, more
     pte batching, folio conversions and more. See the series 'mm/rmap:
     interface overhaul'.

   - Kinsey Ho has contributed some maintenance work on the MGLRU code
     in the series 'mm/mglru: Kconfig cleanup'.

   - Matthew Wilcox has contributed lruvec page accounting code cleanups
     in the series 'Remove some lruvec page accounting functions'"

* tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits)
  mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
  mm, treewide: introduce NR_PAGE_ORDERS
  selftests/mm: add separate UFFDIO_MOVE test for PMD splitting
  selftests/mm: skip test if application doesn't has root privileges
  selftests/mm: conform test to TAP format output
  selftests: mm: hugepage-mmap: conform to TAP format output
  selftests/mm: gup_test: conform test to TAP format output
  mm/selftests: hugepage-mremap: conform test to TAP format output
  mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING
  mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large
  mm/memcontrol: remove __mod_lruvec_page_state()
  mm/khugepaged: use a folio more in collapse_file()
  slub: use a folio in __kmalloc_large_node
  slub: use folio APIs in free_large_kmalloc()
  slub: use alloc_pages_node() in alloc_slab_page()
  mm: remove inc/dec lruvec page state functions
  mm: ratelimit stat flush from workingset shrinker
  kasan: stop leaking stack trace handles
  mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE
  mm/mglru: add dummy pmd_dirty()
  ...
2024-01-09 11:18:47 -08:00
Linus Torvalds d30e51aa7b slab updates for 6.8
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmWWu9EACgkQu+CwddJF
 iJpXvQf/aGL7uEY57VpTm0t4gPwoZ9r2P89HxI/nQs9XgVzDcBmVp/cC0LDvSdcm
 t91kJO538KeGjMgvlhLMTEuoShH5FlPs6cOwrGAYUoAGa4NwiOpGvliGky+nNHqY
 w887ZgSzVLq0UOuSvn86N6enumMvewt4V+872+OWo6O1HWOJhC0SgHTIa8QPQtwb
 yZ9BghO5IqMRXiZEsSIwyO+tQHcaU6l2G5huFXzgMFUhkQqAB9KTFc3h6rYI+i80
 L4ppNXo2KNPGTDRb9dA8LNMWgvmfjhCb7chs8o1zSY2PwZlkzOix7EUBLCAIbc/2
 EIaFC8AsZjfT47D1t72r8QpHB+C14Q==
 =J+E7
 -----END PGP SIGNATURE-----

Merge tag 'slab-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab updates from Vlastimil Babka:

 - SLUB: delayed freezing of CPU partial slabs (Chengming Zhou)

   Freezing is an operation involving double_cmpxchg() that makes a slab
   exclusive for a particular CPU. Chengming noticed that we use it also
   in situations where we are not yet installing the slab as the CPU
   slab, because freezing also indicates that the slab is not on the
   shared list. This results in redundant freeze/unfreeze operation and
   can be avoided by marking separately the shared list presence by
   reusing the PG_workingset flag.

   This approach neatly avoids the issues described in 9b1ea29bc0
   ("Revert "mm, slub: consider rest of partial list if acquire_slab()
   fails"") as we can now grab a slab from the shared list in a quick
   and guaranteed way without the cmpxchg_double() operation that
   amplifies the lock contention and can fail.

   As a result, lkp has reported 34.2% improvement of
   stress-ng.rawudp.ops_per_sec

 - SLAB removal and SLUB cleanups (Vlastimil Babka)

   The SLAB allocator has been deprecated since 6.5 and nobody has
   objected so far. We agreed at LSF/MM to wait until the next LTS,
   which is 6.6, so we should be good to go now.

   This doesn't yet erase all traces of SLAB outside of mm/ so some dead
   code, comments or documentation remain, and will be cleaned up
   gradually (some series are already in the works).

   Removing the choice of allocators has already allowed to simplify and
   optimize the code wiring up the kmalloc APIs to the SLUB
   implementation.

* tag 'slab-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (34 commits)
  mm/slub: free KFENCE objects in slab_free_hook()
  mm/slub: handle bulk and single object freeing separately
  mm/slub: introduce __kmem_cache_free_bulk() without free hooks
  mm/slub: fix bulk alloc and free stats
  mm/slub: optimize free fast path code layout
  mm/slub: optimize alloc fastpath code layout
  mm/slub: remove slab_alloc() and __kmem_cache_alloc_lru() wrappers
  mm/slab: move kmalloc() functions from slab_common.c to slub.c
  mm/slab: move kmalloc_slab() to mm/slab.h
  mm/slab: move kfree() from slab_common.c to slub.c
  mm/slab: move struct kmem_cache_node from slab.h to slub.c
  mm/slab: move memcg related functions from slab.h to slub.c
  mm/slab: move pre/post-alloc hooks from slab.h to slub.c
  mm/slab: consolidate includes in the internal mm/slab.h
  mm/slab: move the rest of slub_def.h to mm/slab.h
  mm/slab: move struct kmem_cache_cpu declaration to slub.c
  mm/slab: remove mm/slab.c and slab_def.h
  mm/mempool/dmapool: remove CONFIG_DEBUG_SLAB ifdefs
  mm/slab: remove CONFIG_SLAB code from slab common code
  cpu/hotplug: remove CPUHP_SLAB_PREPARE hooks
  ...
2024-01-09 10:36:07 -08:00
ZhangPeng 3dc2f20920 swiotlb: check alloc_size before the allocation of a new memory pool
The allocation request for swiotlb contiguous memory greater than
128*2KB cannot be fulfilled because it exceeds the maximum contiguous
memory limit. If the swiotlb memory we allocate is larger than 128*2KB,
swiotlb_find_slots() will still schedule the allocation of a new memory
pool, which will increase memory overhead.

Fix it by adding a check with alloc_size no more than 128*2KB before
scheduling the allocation of a new memory pool in swiotlb_find_slots().

Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Reviewed-by: Petr Tesarik <petr.tesarik1@huawei-partners.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-01-09 16:58:36 +01:00
Linus Torvalds 9f8413c4a6 cgroup: Changes for v6.8
- Yafang Shao added task_get_cgroup1() helper to enable a similar BPF helper
   so that BPF progs can be more useful on cgroup1 hierarchies. While cgroup1
   is mostly in maintenance mode, this addition is very small while having an
   outsized usefulness for users who are still on cgroup1. Yafang also
   optimized root cgroup list access by making it RCU protected in the
   process.
 
 - Waiman Long optimized rstat operation leading to substantially lower and
   more consistent lock hold time while flushing the hierarchical statistics.
   As the lock can be acquired briefly in various hot paths, this reduction
   has cascading benefits.
 
 - Waiman also improved the quality of isolation for cpuset's isolated
   partitions. CPUs which are allocated to isolated partitions are now
   excluded from running unbound work items and cpu_is_isolated() test which
   is used by vmstat and memcg to reduce interference now includes cpuset
   isolated CPUs. While it isn't there yet, the hope is eventually reaching
   parity with the isolation level provided by the `isolcpus` boot param but
   in a dynamic manner.
 
   This involved a couple workqueue patches which were applied directly to
   cgroup/for-6.8 rather than ping-ponged through the wq tree. This was
   because the wq code change was small and the area is usually very static
   and unlikely to cause conflicts. However, luck had it that there was a wq
   bug fix in the area during the 6.7 cycle which caused a conflict. The
   conflict is contextual but can be a bit confusing to resolve, so there is
   one merge from wq/for-6.7-fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZYnuJg4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGQ5kAP9nMMWqi+R1HeG7+hWROTVjQZ0OM9KRcpZ1TmjF
 FNbkJgEAzt+sPnoWwYDTSI7pkNeZ/IM7x1qkkKGvENNtUXrz0Ac=
 =PyYN
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - Yafang Shao added task_get_cgroup1() helper to enable a similar BPF
   helper so that BPF progs can be more useful on cgroup1 hierarchies.
   While cgroup1 is mostly in maintenance mode, this addition is very
   small while having an outsized usefulness for users who are still on
   cgroup1. Yafang also optimized root cgroup list access by making it
   RCU protected in the process.

 - Waiman Long optimized rstat operation leading to substantially lower
   and more consistent lock hold time while flushing the hierarchical
   statistics. As the lock can be acquired briefly in various hot paths,
   this reduction has cascading benefits.

 - Waiman also improved the quality of isolation for cpuset's isolated
   partitions. CPUs which are allocated to isolated partitions are now
   excluded from running unbound work items and cpu_is_isolated() test
   which is used by vmstat and memcg to reduce interference now includes
   cpuset isolated CPUs. While it isn't there yet, the hope is
   eventually reaching parity with the isolation level provided by the
   `isolcpus` boot param but in a dynamic manner.

* tag 'cgroup-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Move rcu_head up near the top of cgroup_root
  cgroup/cpuset: Include isolated cpuset CPUs in cpu_is_isolated() check
  cgroup: Avoid false cacheline sharing of read mostly rstat_cpu
  cgroup/rstat: Optimize cgroup_rstat_updated_list()
  cgroup: Fix documentation for cpu.idle
  cgroup/cpuset: Expose cpuset.cpus.isolated
  workqueue: Move workqueue_set_unbound_cpumask() and its helpers inside CONFIG_SYSFS
  cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked()
  cgroup/cpuset: Take isolated CPUs out of workqueue unbound cpumask
  cgroup/cpuset: Keep track of CPUs in isolated partitions
  selftests/cgroup: Minor code cleanup and reorganization of test_cpuset_prs.sh
  workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask
  selftests: cgroup: Fixes a typo in a comment
  cgroup: Add a new helper for cgroup1 hierarchy
  cgroup: Add annotation for holding namespace_sem in current_cgns_cgroup_from_root()
  cgroup: Eliminate the need for cgroup_mutex in proc_cgroup_show()
  cgroup: Make operations on the cgroup root_list RCU safe
  cgroup: Remove unnecessary list_empty()
2024-01-08 20:04:02 -08:00
Linus Torvalds bfe8eb3b85 Scheduler changes for v6.8:
- Energy scheduling:
 
     - Consolidate how the max compute capacity is
       used in the scheduler and how we calculate
       the frequency for a level of utilization.
 
     - Rework interface between the scheduler and
       the schedutil governor
 
     - Simplify the util_est logic
 
  - Deadline scheduler:
 
     - Work more towards reducing SCHED_DEADLINE
       starvation of low priority tasks (e.g., SCHED_OTHER)
       tasks when higher priority tasks monopolize CPU
       cycles, via the introduction of 'deadline servers'
       (nested/2-level scheduling).
       "Fair servers" to make use of this facility are
       not introduced yet.
 
  - EEVDF:
 
     - Introduce O(1) fastpath for EEVDF task selection
 
  - NUMA balancing:
 
     - Tune the NUMA-balancing vma scanning logic some more,
       to better distribute the probability
       of a particular vma getting scanned.
 
  - Plus misc fixes, cleanups and updates.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWcASMRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jLbg/+NOwF18M6klF1/3jUaV1PU09vRzYnnA7w
 oF7Tru7JLV+/vZK+rwI1zxzj5Nj3sVBQPIyp1embEHx7Z/QH8MIaIVpcSFsDDCYY
 Q8n6ZVRB+lKWEo5+Ti6JEJftDAWuLHXwFWDa57oWPuR0Tc736+zYHUfj7jdKk0RI
 nT/lnOT6hXU8q26O4QFrBrrhvCCxc4byo7buKPQfqie0bDA70ppIWkFQoQME6mvQ
 US9jvOyUipOiPV06DPwFvPDJUQBGq2VdJNk+5zCEtcqEfLREuo/Xq1Ww1x1BWaZI
 761532EuDo73iMK4IFZrvVmj1ioz957qbje11MSSkDdKj692xxjXyvnY0NBvZuho
 Ueog/jQ4D4I2qu7pPSCF8UfnI/Hw4Q+KJ89j3pcywRm4hmCTf9k3MGpAaVLVxH7G
 e5REZ5MSsFZi4Cs+zF87Of5KCKLhTr1qSetNtShinKahg06WZ+MZ8tW4jb52qy0j
 F8PMlvfBI3f7SOtA8s2P26mDGQ21YQehN2d5P+Fbwj/U3fjIlSTOyx6NwLpFwYaS
 Vf+fctchGFV1Sh7c2JjCh+ecYfXx3ghT/pvyPOImJtxtCKSRUQ8c26ApC1OsWfOE
 FdHv4f2dPqcyswCZzIv/2fyDXc9eaS2E05EMDNqVuMCGnzidzSs81n7hBioNMrnH
 ZgHK90TmEbw=
 =wTVh
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Energy scheduling:

   - Consolidate how the max compute capacity is used in the scheduler
     and how we calculate the frequency for a level of utilization.

   - Rework interface between the scheduler and the schedutil governor

   - Simplify the util_est logic

  Deadline scheduler:

   - Work more towards reducing SCHED_DEADLINE starvation of low
     priority tasks (e.g., SCHED_OTHER) tasks when higher priority tasks
     monopolize CPU cycles, via the introduction of 'deadline servers'
     (nested/2-level scheduling).

     "Fair servers" to make use of this facility are not introduced yet.

  EEVDF:

   - Introduce O(1) fastpath for EEVDF task selection

  NUMA balancing:

   - Tune the NUMA-balancing vma scanning logic some more, to better
     distribute the probability of a particular vma getting scanned.

  Plus misc fixes, cleanups and updates"

* tag 'sched-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
  sched/fair: Fix tg->load when offlining a CPU
  sched/fair: Remove unused 'next_buddy_marked' local variable in check_preempt_wakeup_fair()
  sched/fair: Use all little CPUs for CPU-bound workloads
  sched/fair: Simplify util_est
  sched/fair: Remove SCHED_FEAT(UTIL_EST_FASTUP, true)
  arm64/amu: Use capacity_ref_freq() to set AMU ratio
  cpufreq/cppc: Set the frequency used for computing the capacity
  cpufreq/cppc: Move and rename cppc_cpufreq_{perf_to_khz|khz_to_perf}()
  energy_model: Use a fixed reference frequency
  cpufreq/schedutil: Use a fixed reference frequency
  cpufreq: Use the fixed and coherent frequency for scaling capacity
  sched/topology: Add a new arch_scale_freq_ref() method
  freezer,sched: Clean saved_state when restoring it during thaw
  sched/fair: Update min_vruntime for reweight_entity() correctly
  sched/doc: Update documentation after renames and synchronize Chinese version
  sched/cpufreq: Rework iowait boost
  sched/cpufreq: Rework schedutil governor performance estimation
  sched/pelt: Avoid underestimation of task utilization
  sched/timers: Explain why idle task schedules out on remote timer enqueue
  sched/cpuidle: Comment about timers requirements VS idle handler
  ...
2024-01-08 19:49:17 -08:00
Linus Torvalds aac4de465a Performance events changes for v6.8 are:
- Add branch stack counters ABI extension to better capture
    the growing amount of information the PMU exposes via
    branch stack sampling. There's matching tooling support.
 
  - Fix race when creating the nr_addr_filters sysfs file
 
  - Add Intel Sierra Forest and Grand Ridge intel/cstate
    PMU support.
 
  - Add Intel Granite Rapids, Sierra Forest and Grand Ridge
    uncore PMU support.
 
  - Misc cleanups & fixes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWb4lURHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jlnQ/+NSzrPQ9hEiS5a1iMMxdwC6IoXCmeFVsv
 s5NsGaVC7FEgjm3oCfvQlP63HolMO9R7TNLZsgINzOda5IHtE7WUcgBK7gbZr+NT
 WabdTyFrdmUr+Br0rLrEe0bxDSQU7r41ptqKE5HZRM9/3SbLhWgaXSJbfFAG2JV0
 xboZ/2qzb7Puch6VTWv1YhuIpr1Pi817As4SOo7JR4V8jBB2bh2eZ7XBN1z23aw2
 xuglbYml5gs4dOaFTqkRLWyn2PmrZ9wYKcdp63FVUscZ4LxvSw749BxEcNpTbxLp
 PT6uXIKw9PnStNfscfrsk6fDocVJzqrOK71blgiOKbmhWTE0UimEpFf1Hd3ooewg
 hFp3hmkE5Bc2MTUnwivkBxj96fz5rXH+3+Cue/5NsvDNlhlkswIIxzDw8M1G4rOI
 KQMDUYFOhQPa3Hi1lSp2SgHI5AcYHudepr/Z3QMxD3iLs+Wo2cmDcp8d2VrMLfb7
 GHSITG592iYcZPYsJosxby8CSFaUPxIl9l3AODQwWuEjd4PcOYa6iB2HbEa/mC3R
 wXcs8mFIMAaH/HRYUlqUDA5pOqN5chb13iDtS4JqJqBKyWgdrDLCVxoZSQvB64+I
 bldyy1e5oQSVVwJ42WLkUK3Eld2x75ki1JLZFwMgYuOgQv3jfu2VNenUWJ5ig0La
 dPpHP8PwOoc=
 =2O/5
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull performance events updates from Ingo Molnar:

 - Add branch stack counters ABI extension to better capture the growing
   amount of information the PMU exposes via branch stack sampling.
   There's matching tooling support.

 - Fix race when creating the nr_addr_filters sysfs file

 - Add Intel Sierra Forest and Grand Ridge intel/cstate PMU support

 - Add Intel Granite Rapids, Sierra Forest and Grand Ridge uncore PMU
   support

 - Misc cleanups & fixes

* tag 'perf-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/uncore: Factor out topology_gidnid_map()
  perf/x86/intel/uncore: Fix NULL pointer dereference issue in upi_fill_topology()
  perf/x86/amd: Reject branch stack for IBS events
  perf/x86/intel/uncore: Support Sierra Forest and Grand Ridge
  perf/x86/intel/uncore: Support IIO free-running counters on GNR
  perf/x86/intel/uncore: Support Granite Rapids
  perf/x86/uncore: Use u64 to replace unsigned for the uncore offsets array
  perf/x86/intel/uncore: Generic uncore_get_uncores and MMIO format of SPR
  perf: Fix the nr_addr_filters fix
  perf/x86/intel/cstate: Add Grand Ridge support
  perf/x86/intel/cstate: Add Sierra Forest support
  x86/smp: Export symbol cpu_clustergroup_mask()
  perf/x86/intel/cstate: Cleanup duplicate attr_groups
  perf/core: Fix narrow startup race when creating the perf nr_addr_filters sysfs file
  perf/x86/intel: Support branch counters logging
  perf/x86/intel: Reorganize attrs and is_visible
  perf: Add branch_sample_call_stack
  perf/x86: Add PERF_X86_EVENT_NEEDS_BRANCH_STACK flag
  perf: Add branch stack counters
2024-01-08 19:37:20 -08:00
Linus Torvalds f24dc33f8e Timer subsystem changes for v6.8:
- Various preparatory cleanups & enhancements of the timer-wheel code,
    in preparation for the WIP 'pull timers at expiry' timer migration model
    series (which will replace the current 'push timers at enqueue' migration
    model), by Anna-Maria Behnsen:
 
       - Update comments and clean up confusing variable names
 
       - Add debug check to warn about time travel
 
       - Improve/expand timer-wheel tracepoints
 
       - Optimize away unnecessary IPIs for deferrable timers
 
       - Restructure & clean up next_expiry_recalc()
 
       - Clean up forward_timer_base()
 
       - Introduce __forward_timer_base() and use it to simplify
         and micro-optimize get_next_timer_interrupt()
 
  - Restructure the get_next_timer_interrupt()'s idle logic
    for better readability and to enable a minor optimization.
 
  - Fix the nextevt calculation when no timers are pending
 
  - Fix the sysfs_get_uname() prototype declaration
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWb0XIRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1h9kg/9FpjbiogIKrDXb/pJHyhYkK6jzN4aNrQo
 wsOz4FDKyvioqLfr5ndpFE++DwsyzUibPfHJzfwD5IilTyolm2jW44VSCBzNdm72
 lI6NGIcIxmIeCuO4bLmJj/fuQAugQ+ajmA2pyC/aBSO4Q2jtnxjYMGiV9zMWmOsa
 E816CK5zp6IVx+w0GWwK5yW5YR5dscSQCD+mBYVAdTWYoRNTy6xonsMTRuNi0ePx
 donetpu0eWG9NGwUdax/65oKVLZMR/rKAI/3pInhkOS9BsL2o8Rt4o2Y+9aBFi2t
 2m+ZbFg5hngJwhP8Mfc7A+I3qiWgCOMGNGrebyzlwb+0PnNBPzrwnNPveW3R9QRx
 LMxSU3aH66bXeX+YCF4y2tjWSmYooAnztPstUGrs+sq36+NF0wyY6ip/36S6MRGk
 zjedqWnrHQeeZlzOLiKNzB+FIBnOt6bhZEh1Wk1/zwi9UWxw+7+I6tR0b57NqRxZ
 VHq38fp+O2OEAj5JvwJ6FomOd+onqQ2wwveG5OMCa+hwM2ZCuVXQRYgM2ohMfwl3
 BMSd3KMZsBiHT0zyun3k/uJ7CaIjArPh016baSS10ArSl9sE64aJj7ELtuSLqtaD
 idJFXu3tv6VgDT2rMhLWNHvzQoK+gb8+/qnms4Ea+wY2f7nubi0aH20qHfugkgis
 4KOkw9cQw0U=
 =n40J
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer subsystem updates from Ingo Molnar:

 - Various preparatory cleanups & enhancements of the timer-wheel code,
   in preparation for the WIP 'pull timers at expiry' timer migration
   model series (which will replace the current 'push timers at enqueue'
   migration model), by Anna-Maria Behnsen:

      - Update comments and clean up confusing variable names

      - Add debug check to warn about time travel

      - Improve/expand timer-wheel tracepoints

      - Optimize away unnecessary IPIs for deferrable timers

      - Restructure & clean up next_expiry_recalc()

      - Clean up forward_timer_base()

      - Introduce __forward_timer_base() and use it to simplify and
        micro-optimize get_next_timer_interrupt()

 - Restructure the get_next_timer_interrupt()'s idle logic for better
   readability and to enable a minor optimization.

 - Fix the nextevt calculation when no timers are pending

 - Fix the sysfs_get_uname() prototype declaration

* tag 'timers-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers: Fix nextevt calculation when no timers are pending
  timers: Rework idle logic
  timers: Use already existing function for forwarding timer base
  timers: Split out forward timer base functionality
  timers: Clarify check in forward_timer_base()
  timers: Move store of next event into __next_timer_interrupt()
  timers: Do not IPI for deferrable timers
  tracing/timers: Add tracepoint for tracking timer base is_idle flag
  tracing/timers: Enhance timer_start tracepoint
  tick-sched: Warn when next tick seems to be in the past
  tick/sched: Cleanup confusing variables
  tick-sched: Fix function names in comments
  time: Make sysfs_get_uname() function visible in header
2024-01-08 18:44:11 -08:00
Linus Torvalds cdc202281a Move various entry functions from kernel/entry/common.c to <linux/entry-common.h>,
and always-inline them, to improve syscall entry performance on s390 by ~11%.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWbxQ4RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1h5Cw//TEIlWPCLpIeiDsOCKb5g2e4U+AatNIGt
 ysmCvTWsKOiBEItDbZpwdpcdv/Ed41UXkS7Zmwetw81P50rz/i+kIJZW4gdl9GiV
 qhjj0gbhGQ43myQkGdYIcmdVaHl9fuyDGZSai6c17zgdOoL5CvCGGiL5Dn4Cn36x
 skm8P66r9DuM9cLTnhqQHMKp7cf4HQAX+awhFeppCquhzh3M2I8GsUVrT7tZV+Jw
 zOMLVjsI8Va4JyGsl07DoqWlyFWcoYvJ5ayzvDCaBxgeFIK9uZgwkKV0HT9q5tvg
 RhsHQK4zbxgkaMMCgEt/WdT14YesO2+5+ml91Zkjp2NMud0O0gmd2YXZju1aOQQw
 XCL3pm6DB4oN+IkW9lo6k3rqo9PEip9rt/FAfkNLeb50elHfSZSvE1ZxXSQwx5N4
 pHDNMcK6SMsJhEdJInNotViKrpXX0Rjr7x1pY/2DA9bMP/jX/9+J3ODuGCDZrvjp
 eq4JM15VSq6tVmg+LMcszThWz+9gIaLFAqQwFt3G082ANDkOvg0mK7T65gccDuyA
 Gl6f/p3tAYHYxOI9KOBN6Daq3QAqlMT+M4YgNbbv8fanWYIzRd3U/Y+YrUCnryu8
 Db+8FHlUkbJb/clUofJ5nj0Ene7xReJX/8m8XxA95Cc/UGYU8w0cachXDoPyKZUP
 xtFW3xn8K4M=
 =VHW9
 -----END PGP SIGNATURE-----

Merge tag 'core-entry-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull generic syscall updates from Ingo Molnar:
 "Move various entry functions from kernel/entry/common.c to a header
  file, and always-inline them, to improve syscall entry performance
  on s390 by ~11%"

* tag 'core-entry-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  entry: Move syscall_enter_from_user_mode() to header file
  entry: Move enter_from_user_mode() to header file
  entry: Move exit to usermode functions to header file
2024-01-08 18:37:13 -08:00
Linus Torvalds 6cbf5b3105 Locking changes for v6.8:
- lock guards:
 
    - Use lock guards in the ptrace code
 
    - Introduce conditional guards to extend to conditional lock
      primitives like mutex_trylock()/mutex_lock_interruptible()/etc.
 
 - lockdep:
 
    - Optimize 'struct lock_class' to be smaller
 
    - Update file patterns in MAINTAINERS
 
 - mutexes: Document mutex lifetime rules a bit more
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWbur0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i19xAAtAZs8Xi5X3IuLoygfMg81k91GvavLpv/
 sVYGbKLm6+r0dOr4w3w76jVyWXNrVtTUFJysfK6nTNInEO+P0stL6aDmQVDYIaCw
 jaQDO/UgPfxUxnBgMy7dUvsnmxBw4GO3zcQpx8GuAuuuEtNcQcnMoP2t/RvBpBEI
 K2xCCkIT5LQPKbu9LkVZ/fAhZHtMypipuIMtEpfVYEKCMEwDmXoHuj3SNo4LGt04
 0wZ5hHVhTcOQDm1/tjSXKsmxwQRVhI6OCcjXJ8hxDiPXg9vWO0+CzOibujl/jfhs
 Dw45D7wwiCHFUsmKHKz335Jtk8wBpgWUtlxZ+GB/TVfAQkLv1xBdoE1F8yLcBEfx
 yBNxh+0wecPWSrIsRLZEotRRu7obmBsIW04qUVP3oLWIu3tzhcud9gR43CeeplkT
 RW1lkdLt5SzN//MLx1cWPqKjfi6wiolaveD1RrqIbVXRhhrFH3o7+EDEOqGNwOvu
 0D9yx4Su7SYlAYXgGnGLnnmcmDt2cEoOXkY3K608sYW45dZOpNIu56+EfHiVH1fI
 Q0lZNHXNDkX85Zzxoam6Y0SYo74lXYBmtL+RSYvvaKjRYgrJGxGcOciK+/c8kmY2
 +Eazx5sxoR5nKDMP8MrZCZ5CQtqdPB9IYk7hUvQ31BYL1LSKhA5Mi/xjsFwMSNwv
 pEBazHxty58=
 =fOTd
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molar:
 "Lock guards:

   - Use lock guards in the ptrace code

   - Introduce conditional guards to extend to conditional lock
     primitives like mutex_trylock()/mutex_lock_interruptible()/etc.

  lockdep:

   - Optimize 'struct lock_class' to be smaller

   - Update file patterns in MAINTAINERS

  mutexes:

   - Document mutex lifetime rules a bit more"

* tag 'locking-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/mutex: Clarify that mutex_unlock(), and most other sleeping locks, can still use the lock object after it's unlocked
  locking/mutex: Document that mutex_unlock() is non-atomic
  ptrace: Convert ptrace_attach() to use lock guards
  locking/lockdep: Slightly reorder 'struct lock_class' to save some memory
  MAINTAINERS: Add include/linux/lockdep*.h
  cleanup: Add conditional guard support
2024-01-08 18:19:44 -08:00
Kirill A. Shutemov 5e0a760b44 mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
commit 23baf831a3 ("mm, treewide: redefine MAX_ORDER sanely") has
changed the definition of MAX_ORDER to be inclusive.  This has caused
issues with code that was not yet upstream and depended on the previous
definition.

To draw attention to the altered meaning of the define, rename MAX_ORDER
to MAX_PAGE_ORDER.

Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-08 15:27:15 -08:00
Kirill A. Shutemov fd37721803 mm, treewide: introduce NR_PAGE_ORDERS
NR_PAGE_ORDERS defines the number of page orders supported by the page
allocator, ranging from 0 to MAX_ORDER, MAX_ORDER + 1 in total.

NR_PAGE_ORDERS assists in defining arrays of page orders and allows for
more natural iteration over them.

[kirill.shutemov@linux.intel.com: fixup for kerneldoc warning]
  Link: https://lkml.kernel.org/r/20240101111512.7empzyifq7kxtzk3@box
Link: https://lkml.kernel.org/r/20231228144704.14033-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-08 15:27:15 -08:00
Linus Torvalds 5db8752c3b vfs-6.8.iov_iter
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUzBQAKCRCRxhvAZXjc
 ot+3AQCZw1PBD4azVxFMWH76qwlAGoVIFug4+ogKU/iUa4VLygEA2FJh1vLJw5iI
 LpgBEIUTPVkwtzinAW94iJJo1Vr7NAI=
 =p6PB
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.8.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs iov_iter cleanups from Christian Brauner:
 "This contains a minor cleanup. The patches drop an unused argument
  from import_single_range() allowing to replace import_single_range()
  with import_ubuf() and dropping import_single_range() completely"

* tag 'vfs-6.8.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  iov_iter: replace import_single_range() with import_ubuf()
  iov_iter: remove unused 'iov' argument from import_single_range()
2024-01-08 11:43:04 -08:00
Linus Torvalds c604110e66 vfs-6.8.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUxRQAKCRCRxhvAZXjc
 ov/QAQDzvge3oQ9MEymmOiyzzcF+HhAXBr+9oEsYJjFc1p0TsgEA61gXjZo7F1jY
 KBqd6znOZCR+Waj0kIVJRAo/ISRBqQc=
 =0bRl
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual miscellaneous features, cleanups, and fixes
  for vfs and individual fses.

  Features:

   - Add Jan Kara as VFS reviewer

   - Show correct device and inode numbers in proc/<pid>/maps for vma
     files on stacked filesystems. This is now easily doable thanks to
     the backing file work from the last cycles. This comes with
     selftests

  Cleanups:

   - Remove a redundant might_sleep() from wait_on_inode()

   - Initialize pointer with NULL, not 0

   - Clarify comment on access_override_creds()

   - Rework and simplify eventfd_signal() and eventfd_signal_mask()
     helpers

   - Process aio completions in batches to avoid needless wakeups

   - Completely decouple struct mnt_idmap from namespaces. We now only
     keep the actual idmapping around and don't stash references to
     namespaces

   - Reformat maintainer entries to indicate that a given subsystem
     belongs to fs/

   - Simplify fput() for files that were never opened

   - Get rid of various pointless file helpers

   - Rename various file helpers

   - Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from
     last cycle

   - Make relatime_need_update() return bool

   - Use GFP_KERNEL instead of GFP_USER when allocating superblocks

   - Replace deprecated ida_simple_*() calls with their current ida_*()
     counterparts

  Fixes:

   - Fix comments on user namespace id mapping helpers. They aren't
     kernel doc comments so they shouldn't be using /**

   - s/Retuns/Returns/g in various places

   - Add missing parameter documentation on can_move_mount_beneath()

   - Rename i_mapping->private_data to i_mapping->i_private_data

   - Fix a false-positive lockdep warning in pipe_write() for watch
     queues

   - Improve __fget_files_rcu() code generation to improve performance

   - Only notify writer that pipe resizing has finished after setting
     pipe->max_usage otherwise writers are never notified that the pipe
     has been resized and hang

   - Fix some kernel docs in hfsplus

   - s/passs/pass/g in various places

   - Fix kernel docs in ntfs

   - Fix kcalloc() arguments order reported by gcc 14

   - Fix uninitialized value in reiserfs"

* tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits)
  reiserfs: fix uninit-value in comp_keys
  watch_queue: fix kcalloc() arguments order
  ntfs: dir.c: fix kernel-doc function parameter warnings
  fs: fix doc comment typo fs tree wide
  selftests/overlayfs: verify device and inode numbers in /proc/pid/maps
  fs/proc: show correct device and inode numbers in /proc/pid/maps
  eventfd: Remove usage of the deprecated ida_simple_xx() API
  fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation
  fs/hfsplus: wrapper.c: fix kernel-doc warnings
  fs: add Jan Kara as reviewer
  fs/inode: Make relatime_need_update return bool
  pipe: wakeup wr_wait after setting max_usage
  file: remove __receive_fd()
  file: stop exposing receive_fd_user()
  fs: replace f_rcuhead with f_task_work
  file: remove pointless wrapper
  file: s/close_fd_get_file()/file_close_fd()/g
  Improve __fget_files_rcu() code generation (and thus __fget_light())
  file: massage cleanup of files that failed to open
  fs/pipe: Fix lockdep false-positive in watchqueue pipe_write()
  ...
2024-01-08 10:26:08 -08:00
Steven Rostedt (Google) 4f1991a92c tracing histograms: Simplify parse_actions() function
The parse_actions() function uses 'len = str_has_prefix()' to test which
action is in the string being parsed. But then it goes and repeats the
logic for each different action. This logic can be simplified and
duplicate code can be removed as 'len' contains the length of the found
prefix which should be used for all actions.

Link: https://lore.kernel.org/all/20240107112044.6702cb66@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20240107203258.37e26d2b@gandalf.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-01-08 13:24:56 -05:00
Rafael J. Wysocki f1e5e46397 Merge branch 'pm-sleep'
Merge system-wide power management updates for 6.8-rc1:

 - Fix possible deadlocks in the core system-wide PM code that occur if
   device-handling functions cannot be executed asynchronously during
   resune from system-wide suspend (Rafael J. Wysocki).

 - Clean up unnecessary local variable initializations in multiple
   places in the hibernation code (Wang chaodong, Li zeming).

 - Adjust core hibernation code to avoid missing wakeup events that
   occur after saving an image to persistent storage (Chris Feng).

 - Update hibernation code to enforce correct ordering during image
   compression and decompression (Hongchen Zhang).

 - Use kmap_local_page() instead of kmap_atomic() in copy_data_page()
   during hibernation and restore (Chen Haonan).

 - Adjust documentation and code comments to reflect recent task freezer
   changes (Kevin Hao).

 - Repair excess function parameter description warning in the
   hibernation image-saving code (Randy Dunlap).

* pm-sleep:
  PM: sleep: Fix possible deadlocks in core system-wide PM code
  async: Introduce async_schedule_dev_nocall()
  async: Split async_schedule_node_domain()
  PM: hibernate: Repair excess function parameter description warning
  PM: sleep: Remove obsolete comment from unlock_system_sleep()
  Documentation: PM: Adjust freezing-of-tasks.rst to the freezer changes
  PM: hibernate: Use kmap_local_page() in copy_data_page()
  PM: hibernate: Enforce ordering during image compression/decompression
  PM: hibernate: Avoid missing wakeup events during hibernation
  PM: hibernate: Do not initialize error in snapshot_write_next()
  PM: hibernate: Do not initialize error in swap_write_page()
  PM: hibernate: Drop unnecessary local variable initialization
2024-01-08 13:42:48 +01:00
Ingo Molnar cdb3033e19 Merge branch 'sched/urgent' into sched/core, to pick up pending v6.7 fixes for the v6.8 merge window
This fix didn't make it upstream in time, pick it up
for the v6.8 merge window.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-01-08 12:57:28 +01:00
Jakub Kicinski 8158a50f90 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZZgrfgAKCRDbK58LschI
 g87JAQDu+oUG3aWnRJi+lJTK8vGnKRuBwUxgnI5Ze99N0tuPmAEAz1gpXLYP+fKE
 eqRhZGGhujdHC9if3Le+nG6nvf8Gvw0=
 =KPkZ
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-01-05

We've added 40 non-merge commits during the last 2 day(s) which contain
a total of 73 files changed, 1526 insertions(+), 951 deletions(-).

The main changes are:

1) Fix a memory leak when streaming AF_UNIX sockets were inserted
   into multiple sockmap slots/maps, from John Fastabend.

2) Fix gotol in s390 BPF JIT with large offsets, from Ilya Leoshkevich.

3) Fix reattachment branch in bpf_tracing_prog_attach() and reject
   the request if there is no valid attach_btf, from Jiri Olsa.

4) Remove deprecated bpfilter kernel leftovers given the project
   is developed in user space (https://github.com/facebook/bpfilter),
   from Quentin Deslandes.

5) Relax tracing BPF program recursive attach rules given right now
   it is not possible to create tracing program call cycles,
   from Dmitrii Dolgov.

6) Fix excessive memory consumption for the bpf_global_percpu_ma
   for systems with a large number of CPUs, from Yonghong Song.

7) Small x86 BPF JIT cleanup to reuse emit_nops instead of open-coding
   memcpy of x86_nops, from Leon Hwang.

8) Follow-up for libbpf to support __arg_ctx global function argument tag
   semantics to complement the merged kernel side, from Andrii Nakryiko.

9) Introduce "volatile compare" macros for BPF selftests in order
   to make the latter more robust against compiler optimization,
   from Alexei Starovoitov.

10) Small simplification in verifier's size checking of helper accesses
    along with additional selftests, from Andrei Matei.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (40 commits)
  selftests/bpf: Test re-attachment fix for bpf_tracing_prog_attach
  bpf: Fix re-attachment branch in bpf_tracing_prog_attach
  selftests/bpf: Add test for recursive attachment of tracing progs
  bpf: Relax tracing prog recursive attach rules
  bpf, x86: Use emit_nops to replace memcpy x86_nops
  selftests/bpf: Test gotol with large offsets
  selftests/bpf: Double the size of test_loader log
  s390/bpf: Fix gotol with large offsets
  bpfilter: remove bpfilter
  bpf: Remove unnecessary cpu == 0 check in memalloc
  selftests/bpf: add __arg_ctx BTF rewrite test
  selftests/bpf: add arg:ctx cases to test_global_funcs tests
  libbpf: implement __arg_ctx fallback logic
  libbpf: move BTF loading step after relocation step
  libbpf: move exception callbacks assignment logic into relocation step
  libbpf: use stable map placeholder FDs
  libbpf: don't rely on map->fd as an indicator of map being created
  libbpf: use explicit map reuse flag to skip map creation steps
  libbpf: make uniform use of btf__fd() accessor inside libbpf
  selftests/bpf: Add a selftest with > 512-byte percpu allocation size
  ...
====================

Link: https://lore.kernel.org/r/20240105170105.21070-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-05 19:15:32 -08:00
Yuntao Wang 6dff315972 crash_core: fix and simplify the logic of crash_exclude_mem_range()
The purpose of crash_exclude_mem_range() is to remove all memory ranges
that overlap with [mstart-mend].  However, the current logic only removes
the first overlapping memory range.

Commit a2e9a95d21 ("kexec: Improve & fix crash_exclude_mem_range() to
handle overlapping ranges") attempted to address this issue, but it did
not fix all error cases.

Let's fix and simplify the logic of crash_exclude_mem_range().

Link: https://lkml.kernel.org/r/20240102144905.110047-4-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-05 10:45:25 -08:00
Kinsey Ho 61dd3f246b mm/mglru: add CONFIG_LRU_GEN_WALKS_MMU
Add CONFIG_LRU_GEN_WALKS_MMU such that if disabled, the code that
walks page tables to promote pages into the youngest generation will
not be built.

Also improves code readability by adding two helper functions
get_mm_state() and get_next_mm().

Link: https://lkml.kernel.org/r/20231227141205.2200125-3-kinseyho@google.com
Signed-off-by: Kinsey Ho <kinseyho@google.com>
Co-developed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Tested-by: Donet Tom <donettom@linux.vnet.ibm.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-05 10:17:44 -08:00
Jiri Olsa 715d82ba63 bpf: Fix re-attachment branch in bpf_tracing_prog_attach
The following case can cause a crash due to missing attach_btf:

1) load rawtp program
2) load fentry program with rawtp as target_fd
3) create tracing link for fentry program with target_fd = 0
4) repeat 3

In the end we have:

- prog->aux->dst_trampoline == NULL
- tgt_prog == NULL (because we did not provide target_fd to link_create)
- prog->aux->attach_btf == NULL (the program was loaded with attach_prog_fd=X)
- the program was loaded for tgt_prog but we have no way to find out which one

    BUG: kernel NULL pointer dereference, address: 0000000000000058
    Call Trace:
     <TASK>
     ? __die+0x20/0x70
     ? page_fault_oops+0x15b/0x430
     ? fixup_exception+0x22/0x330
     ? exc_page_fault+0x6f/0x170
     ? asm_exc_page_fault+0x22/0x30
     ? bpf_tracing_prog_attach+0x279/0x560
     ? btf_obj_id+0x5/0x10
     bpf_tracing_prog_attach+0x439/0x560
     __sys_bpf+0x1cf4/0x2de0
     __x64_sys_bpf+0x1c/0x30
     do_syscall_64+0x41/0xf0
     entry_SYSCALL_64_after_hwframe+0x6e/0x76

Return -EINVAL in this situation.

Fixes: f3a9507554 ("bpf: Allow trampoline re-attach for tracing and lsm programs")
Cc: stable@vger.kernel.org
Signed-off-by: Jiri Olsa <olsajiri@gmail.com>
Acked-by: Jiri Olsa <olsajiri@gmail.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Dmitrii Dolgov <9erthalion6@gmail.com>
Link: https://lore.kernel.org/r/20240103190559.14750-4-9erthalion6@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-04 20:40:19 -08:00
Dmitrii Dolgov 19bfcdf949 bpf: Relax tracing prog recursive attach rules
Currently, it's not allowed to attach an fentry/fexit prog to another
one fentry/fexit. At the same time it's not uncommon to see a tracing
program with lots of logic in use, and the attachment limitation
prevents usage of fentry/fexit for performance analysis (e.g. with
"bpftool prog profile" command) in this case. An example could be
falcosecurity libs project that uses tp_btf tracing programs.

Following the corresponding discussion [1], the reason for that is to
avoid tracing progs call cycles without introducing more complex
solutions. But currently it seems impossible to load and attach tracing
programs in a way that will form such a cycle. The limitation is coming
from the fact that attach_prog_fd is specified at the prog load (thus
making it impossible to attach to a program loaded after it in this
way), as well as tracing progs not implementing link_detach.

Replace "no same type" requirement with verification that no more than
one level of attachment nesting is allowed. In this way only one
fentry/fexit program could be attached to another fentry/fexit to cover
profiling use case, and still no cycle could be formed. To implement,
add a new field into bpf_prog_aux to track nested attachment for tracing
programs.

[1]: https://lore.kernel.org/bpf/20191108064039.2041889-16-ast@kernel.org/

Acked-by: Jiri Olsa <olsajiri@gmail.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Dmitrii Dolgov <9erthalion6@gmail.com>
Link: https://lore.kernel.org/r/20240103190559.14750-2-9erthalion6@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-04 20:31:34 -08:00
Jakub Kicinski e63c1822ac Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  e009b2efb7 ("bnxt_en: Remove mis-applied code from bnxt_cfg_ntp_filters()")
  0f2b214779 ("bnxt_en: Fix compile error without CONFIG_RFS_ACCEL")
https://lore.kernel.org/all/20240105115509.225aa8a2@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-04 18:06:46 -08:00
Yonghong Song 9ddf872b47 bpf: Remove unnecessary cpu == 0 check in memalloc
After merging the patch set [1] to reduce memory usage
for bpf_global_percpu_ma, Alexei found a redundant check (cpu == 0)
in function bpf_mem_alloc_percpu_unit_init() ([2]).
Indeed, the check is unnecessary since c->unit_size will
be all NULL or all non-NULL for all cpus before
for_each_possible_cpu() loop.
Removing the check makes code less confusing.

  [1] https://lore.kernel.org/all/20231222031729.1287957-1-yonghong.song@linux.dev/
  [2] https://lore.kernel.org/all/20231222031745.1289082-1-yonghong.song@linux.dev/

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240104165744.702239-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-04 10:18:14 -08:00
Greg Kroah-Hartman 86438841e4 dma-debug: make dma_debug_add_bus take a const pointer
The driver core now can handle a const struct bus_type pointer, and the
dma_debug_add_bus() call just passes on the pointer give to it to the
driver core, so make this pointer const as well to allow everyone to use
read-only struct bus_type pointers going forward.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc:  <iommu@lists.linux.dev>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/2023121941-dejected-nugget-681e@gregkh
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-01-04 14:38:29 +01:00
Yonghong Song 5c1a376532 bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation
For percpu data structure allocation with bpf_global_percpu_ma,
the maximum data size is 4K. But for a system with large
number of cpus, bigger data size (e.g., 2K, 4K) might consume
a lot of memory. For example, the percpu memory consumption
with unit size 2K and 1024 cpus will be 2K * 1K * 1k = 2GB
memory.

We should discourage such usage. Let us limit the maximum data
size to be 512 for bpf_global_percpu_ma allocation.

Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231222031801.1290841-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03 21:08:26 -08:00
Yonghong Song 0e2ba9f96f bpf: Use smaller low/high marks for percpu allocation
Currently, refill low/high marks are set with the assumption
of normal non-percpu memory allocation. For example, for
an allocation size 256, for non-percpu memory allocation,
low mark is 32 and high mark is 96, resulting in the
batch allocation of 48 elements and the allocated memory
will be 48 * 256 = 12KB for this particular cpu.
Assuming an 128-cpu system, the total memory consumption
across all cpus will be 12K * 128 = 1.5MB memory.

This might be okay for non-percpu allocation, but may not be
good for percpu allocation, which will consume 1.5MB * 128 = 192MB
memory in the worst case if every cpu has a chance of memory
allocation.

In practice, percpu allocation is very rare compared to
non-percpu allocation. So let us have smaller low/high marks
which can avoid unnecessary memory consumption.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231222031755.1289671-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03 21:08:25 -08:00
Yonghong Song 5b95e638f1 bpf: Refill only one percpu element in memalloc
Typically for percpu map element or data structure, once allocated,
most operations are lookup or in-place update. Deletion are really
rare. Currently, for percpu data strcture, 4 elements will be
refilled if the size is <= 256. Let us just do with one element
for percpu data. For example, for size 256 and 128 cpus, the
potential saving will be 3 * 256 * 128 * 128 = 12MB.

Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231222031750.1289290-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03 21:08:25 -08:00
Yonghong Song c39aa3b289 bpf: Allow per unit prefill for non-fix-size percpu memory allocator
Commit 41a5db8d81 ("Add support for non-fix-size percpu mem allocation")
added support for non-fix-size percpu memory allocation.
Such allocation will allocate percpu memory for all buckets on all
cpus and the memory consumption is in the order to quadratic.
For example, let us say, 4 cpus, unit size 16 bytes, so each
cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes.
Then let us say, 8 cpus with the same unit size, each cpu
has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes.
So if the number of cpus doubles, the number of memory consumption
will be 4 times. So for a system with large number of cpus, the
memory consumption goes up quickly with quadratic order.
For example, for 4KB percpu allocation, 128 cpus. The total memory
consumption will 4KB * 128 * 128 = 64MB. Things will become
worse if the number of cpus is bigger (e.g., 512, 1024, etc.)

In Commit 41a5db8d81, the non-fix-size percpu memory allocation is
done in boot time, so for system with large number of cpus, the initial
percpu memory consumption is very visible. For example, for 128 cpu
system, the total percpu memory allocation will be at least
(16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096)
  * 128 * 128 = ~138MB.
which is pretty big. It will be even bigger for larger number of cpus.

Note that the current prefill also allocates 4 entries if the unit size
is less than 256. So on top of 138MB memory consumption, this will
add more consumption with
3 * (16 + 32 + 64 + 96 + 128 + 196 + 256) * 128 * 128 = ~38MB.
Next patch will try to reduce this memory consumption.

Later on, Commit 1fda5bb66a ("bpf: Do not allocate percpu memory
at init stage") moved the non-fix-size percpu memory allocation
to bpf verificaiton stage. Once a particular bpf_percpu_obj_new()
is called by bpf program, the memory allocator will try to fill in
the cache with all sizes, causing the same amount of percpu memory
consumption as in the boot stage.

To reduce the initial percpu memory consumption for non-fix-size
percpu memory allocation, instead of filling the cache with all
supported allocation sizes, this patch intends to fill the cache
only for the requested size. As typically users will not use large
percpu data structure, this can save memory significantly.
For example, the allocation size is 64 bytes with 128 cpus.
Then total percpu memory amount will be 64 * 128 * 128 = 1MB,
much less than previous 138MB.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231222031745.1289082-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03 21:08:25 -08:00
Yonghong Song 9fc8e80204 bpf: Add objcg to bpf_mem_alloc
The objcg is a bpf_mem_alloc level property since all bpf_mem_cache's
are with the same objcg. This patch made such a property explicit.
The next patch will use this property to save and restore objcg
for percpu unit allocator.

Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231222031739.1288590-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03 21:08:25 -08:00
Yonghong Song 9beda16c25 bpf: Avoid unnecessary extra percpu memory allocation
Currently, for percpu memory allocation, say if the user
requests allocation size to be 32 bytes, the actually
calculated size will be 40 bytes and it further rounds
to 64 bytes, and eventually 64 bytes are allocated,
wasting 32-byte memory.

Change bpf_mem_alloc() to calculate the cache index
based on the user-provided allocation size so unnecessary
extra memory can be avoided.

Suggested-by: Hou Tao <houtao1@huawei.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231222031734.1288400-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03 21:08:25 -08:00
Andrei Matei 8a021e7fa1 bpf: Simplify checking size of helper accesses
This patch simplifies the verification of size arguments associated to
pointer arguments to helpers and kfuncs. Many helpers take a pointer
argument followed by the size of the memory access performed to be
performed through that pointer. Before this patch, the handling of the
size argument in check_mem_size_reg() was confusing and wasteful: if the
size register's lower bound was 0, then the verification was done twice:
once considering the size of the access to be the lower-bound of the
respective argument, and once considering the upper bound (even if the
two are the same). The upper bound checking is a super-set of the
lower-bound checking(*), except: the only point of the lower-bound check
is to handle the case where zero-sized-accesses are explicitly not
allowed and the lower-bound is zero. This static condition is now
checked explicitly, replacing a much more complex, expensive and
confusing verification call to check_helper_mem_access().

Error messages change in this patch. Before, messages about illegal
zero-size accesses depended on the type of the pointer and on other
conditions, and sometimes the message was plain wrong: in some tests
that changed you'll see that the old message was something like "R1 min
value is outside of the allowed memory range", where R1 is the pointer
register; the error was wrongly claiming that the pointer was bad
instead of the size being bad. Other times the information that the size
came for a register with a possible range of values was wrong, and the
error presented the size as a fixed zero. Now the errors refer to the
right register. However, the old error messages did contain useful
information about the pointer register which is now lost; recovering
this information was deemed not important enough.

(*) Besides standing to reason that the checks for a bigger size access
are a super-set of the checks for a smaller size access, I have also
mechanically verified this by reading the code for all types of
pointers. I could convince myself that it's true for all but
PTR_TO_BTF_ID (check_ptr_to_btf_access). There, simply looking
line-by-line does not immediately prove what we want. If anyone has any
qualms, let me know.

Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231221232225.568730-2-andreimatei1@gmail.com
2024-01-03 10:37:56 -08:00
Rafael J. Wysocki 7d4b5d7a37 async: Introduce async_schedule_dev_nocall()
In preparation for subsequent changes, introduce a specialized variant
of async_schedule_dev() that will not invoke the argument function
synchronously when it cannot be scheduled for asynchronous execution.

The new function, async_schedule_dev_nocall(), will be used for fixing
possible deadlocks in the system-wide power management core code.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> for the series.
Tested-by: Youngmin Nam <youngmin.nam@samsung.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
2024-01-03 11:31:12 +01:00
Rafael J. Wysocki 6aa09a5bcc async: Split async_schedule_node_domain()
In preparation for subsequent changes, split async_schedule_node_domain()
in two pieces so as to allow the bottom part of it to be called from a
somewhat different code path.

No functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Tested-by: Youngmin Nam <youngmin.nam@samsung.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
2024-01-03 11:29:58 +01:00
Joerg Roedel 75f74f85a4 Merge branches 'apple/dart', 'arm/rockchip', 'arm/smmu', 'virtio', 'x86/vt-d', 'x86/amd' and 'core' into next 2024-01-03 09:59:32 +01:00
Ingo Molnar 67a1723344 Linux 6.7-rc8
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmWR1E0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGYb8H/j5KAXUH3rvY3I7z
 pLZkdQiFX/ix2p25VYCNEes/h9gRLNTX0ndr+qd/LSjunaF17x+j0odHpjaK7dV0
 YwSiP+ffNY8zDjHJP3mAbNfyov37kZjrtjciEs9/Ldk0w9Swp91iJZC6mRsJ6LbV
 kneZ/5nQIfZn0KFgqNCrrfCvUf3TcRskNFDvguDLCXf+lg5nJmhJnbGRExDqe5vs
 VIbfI9nnIjZd0Pt5rsXSQfwk0RaK/85unz6Dfshny9vAHnfTC5tX1B0sAqga+Zji
 nQtHQ/UT/gLSY0b/PoRXjA4wenv5UdTwKs7XaKaGKpyaDlD35CxiWXi8xhdgw+qG
 X4LFgDY=
 =N/Fv
 -----END PGP SIGNATURE-----

Merge tag 'v6.7-rc8' into locking/core, to pick up dependent changes

Pick up these commits from Linus's tree:

  b106bcf0f9 ("locking/osq_lock: Clarify osq_wait_next()")
  563adbfc35 ("locking/osq_lock: Clarify osq_wait_next() calling convention")
  7c22309821 ("locking/osq_lock: Move the definition of optimistic_spin_node into osq_lock.c")

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-01-02 10:41:38 +01:00
Fabio Estevam 79fa723ba8 reboot: Introduce thermal_zone_device_critical_reboot()
Introduce thermal_zone_device_critical_reboot() to trigger an
emergency reboot.

It is a counterpart of thermal_zone_device_critical() with the
difference that it will force a reboot instead of shutdown.

The motivation for doing this is to allow the thermal subystem
to trigger a reboot when the temperature reaches the critical
temperature.

Signed-off-by: Fabio Estevam <festevam@denx.de>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lore.kernel.org/r/20231129124330.519423-3-festevam@gmail.com
2024-01-02 09:33:18 +01:00
Fabio Estevam 5a0e241003 thermal/core: Prepare for introduction of thermal reboot
Add some helper functions to make it easier introducing the support
for thermal reboot.

No functional change.

Signed-off-by: Fabio Estevam <festevam@denx.de>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lore.kernel.org/r/20231129124330.519423-2-festevam@gmail.com
2024-01-02 09:33:18 +01:00
David S. Miller 240436c06c bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZYVEqQAKCRDbK58LschI
 gzH6AP9hVXLpHFTWMT0+2GK2lx69VX8zW1C0SmN7WHaxUbPN9QEAwzGnELfKk00P
 0IKRHSl5abhVMX7JOM3sSOhCILeKjQg=
 =wRLJ
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
bpf-next-for-netdev
The following pull-request contains BPF updates for your *net-next* tree.

We've added 22 non-merge commits during the last 3 day(s) which contain
a total of 23 files changed, 652 insertions(+), 431 deletions(-).

The main changes are:

1) Add verifier support for annotating user's global BPF subprogram arguments
   with few commonly requested annotations for a better developer experience,
   from Andrii Nakryiko.

   These tags are:
     - Ability to annotate a special PTR_TO_CTX argument
     - Ability to annotate a generic PTR_TO_MEM as non-NULL

2) Support BPF verifier tracking of BPF_JNE which helps cases when the compiler
   transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the like, from
   Menglong Dong.

3) Fix a warning in bpf_mem_cache's check_obj_size() as reported by LKP, from Hou Tao.

4) Re-support uid/gid options when mounting bpffs which had to be reverted with
   the prior token series revert to avoid conflicts, from Daniel Borkmann.

5) Fix a libbpf NULL pointer dereference in bpf_object__collect_prog_relos() found
   from fuzzing the library with malformed ELF files, from Mingyi Zhang.

6) Skip DWARF sections in libbpf's linker sanity check given compiler options to
   generate compressed debug sections can trigger a rejection due to misalignment,
   from Alyssa Ross.

7) Fix an unnecessary use of the comma operator in BPF verifier, from Simon Horman.

8) Fix format specifier for unsigned long values in cpustat sample, from Colin Ian King.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-01 14:45:21 +00:00
Linus Torvalds 453f5db061 tracing fixes for v6.7-rc7:
- Fix readers that are blocked on the ring buffer when buffer_percent is
   100%. They are supposed to wake up when the buffer is full, but
   because the sub-buffer that the writer is on is never considered
   "dirty" in the calculation, dirty pages will never equal nr_pages.
   Add +1 to the dirty count in order to count for the sub-buffer that
   the writer is on.
 
 - When a reader is blocked on the "snapshot_raw" file, it is to be
   woken up when a snapshot is done and be able to read the snapshot
   buffer. But because the snapshot swaps the buffers (the main one
   with the snapshot one), and the snapshot reader is waiting on the
   old snapshot buffer, it was not woken up (because it is now on
   the main buffer after the swap). Worse yet, when it reads the buffer
   after a snapshot, it's not reading the snapshot buffer, it's reading
   the live active main buffer.
 
   Fix this by forcing a wakeup of all readers on the snapshot buffer when
   a new snapshot happens, and then update the buffer that the reader
   is reading to be back on the snapshot buffer.
 
 - Fix the modification of the direct_function hash. There was a race
   when new functions were added to the direct_function hash as when
   it moved function entries from the old hash to the new one, a direct
   function trace could be hit and not see its entry.
 
   This is fixed by allocating the new hash, copy all the old entries
   onto it as well as the new entries, and then use rcu_assign_pointer()
   to update the new direct_function hash with it.
 
   This also fixes a memory leak in that code.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZZAzTxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qs9IAP9e6wZ74aEjMED9nsbC49EpyCNTqa72
 y0uDS/p9ppv52gD7Be+l+kJQzYNh6bZU0+B19hNC2QVn38jb7sOadfO/1Q8=
 =NDkf
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix readers that are blocked on the ring buffer when buffer_percent
   is 100%. They are supposed to wake up when the buffer is full, but
   because the sub-buffer that the writer is on is never considered
   "dirty" in the calculation, dirty pages will never equal nr_pages.
   Add +1 to the dirty count in order to count for the sub-buffer that
   the writer is on.

 - When a reader is blocked on the "snapshot_raw" file, it is to be
   woken up when a snapshot is done and be able to read the snapshot
   buffer. But because the snapshot swaps the buffers (the main one with
   the snapshot one), and the snapshot reader is waiting on the old
   snapshot buffer, it was not woken up (because it is now on the main
   buffer after the swap). Worse yet, when it reads the buffer after a
   snapshot, it's not reading the snapshot buffer, it's reading the live
   active main buffer.

   Fix this by forcing a wakeup of all readers on the snapshot buffer
   when a new snapshot happens, and then update the buffer that the
   reader is reading to be back on the snapshot buffer.

 - Fix the modification of the direct_function hash. There was a race
   when new functions were added to the direct_function hash as when it
   moved function entries from the old hash to the new one, a direct
   function trace could be hit and not see its entry.

   This is fixed by allocating the new hash, copy all the old entries
   onto it as well as the new entries, and then use rcu_assign_pointer()
   to update the new direct_function hash with it.

   This also fixes a memory leak in that code.

 - Fix eventfs ownership

* tag 'trace-v6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ftrace: Fix modification of direct_function hash while in use
  tracing: Fix blocked reader of snapshot buffer
  ring-buffer: Fix wake ups when buffer_percent is set to 100
  eventfs: Fix file and directory uid and gid ownership
2023-12-30 11:37:35 -08:00
David Laight b106bcf0f9 locking/osq_lock: Clarify osq_wait_next()
Directly return NULL or 'next' instead of breaking out of the loop.

Signed-off-by: David Laight <david.laight@aculab.com>
[ Split original patch into two independent parts  - Linus ]
Link: https://lore.kernel.org/lkml/7c8828aec72e42eeb841ca0ee3397e9a@AcuMS.aculab.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-12-30 10:25:51 -08:00
David Laight 563adbfc35 locking/osq_lock: Clarify osq_wait_next() calling convention
osq_wait_next() is passed 'prev' from osq_lock() and NULL from
osq_unlock() but only needs the 'cpu' value to write to lock->tail.

Just pass prev->cpu or OSQ_UNLOCKED_VAL instead.

Should have no effect on the generated code since gcc manages to assume
that 'prev != NULL' due to an earlier dereference.

Signed-off-by: David Laight <david.laight@aculab.com>
[ Changed 'old' to 'old_cpu' by request from Waiman Long  - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-12-30 10:25:51 -08:00
David Laight 7c22309821 locking/osq_lock: Move the definition of optimistic_spin_node into osq_lock.c
struct optimistic_spin_node is private to the implementation.
Move it into the C file to ensure nothing is accessing it.

Signed-off-by: David Laight <david.laight@aculab.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-12-30 10:25:51 -08:00
Steven Rostedt (Google) d05cb47066 ftrace: Fix modification of direct_function hash while in use
Masami Hiramatsu reported a memory leak in register_ftrace_direct() where
if the number of new entries are added is large enough to cause two
allocations in the loop:

        for (i = 0; i < size; i++) {
                hlist_for_each_entry(entry, &hash->buckets[i], hlist) {
                        new = ftrace_add_rec_direct(entry->ip, addr, &free_hash);
                        if (!new)
                                goto out_remove;
                        entry->direct = addr;
                }
        }

Where ftrace_add_rec_direct() has:

        if (ftrace_hash_empty(direct_functions) ||
            direct_functions->count > 2 * (1 << direct_functions->size_bits)) {
                struct ftrace_hash *new_hash;
                int size = ftrace_hash_empty(direct_functions) ? 0 :
                        direct_functions->count + 1;

                if (size < 32)
                        size = 32;

                new_hash = dup_hash(direct_functions, size);
                if (!new_hash)
                        return NULL;

                *free_hash = direct_functions;
                direct_functions = new_hash;
        }

The "*free_hash = direct_functions;" can happen twice, losing the previous
allocation of direct_functions.

But this also exposed a more serious bug.

The modification of direct_functions above is not safe. As
direct_functions can be referenced at any time to find what direct caller
it should call, the time between:

                new_hash = dup_hash(direct_functions, size);
 and
                direct_functions = new_hash;

can have a race with another CPU (or even this one if it gets interrupted),
and the entries being moved to the new hash are not referenced.

That's because the "dup_hash()" is really misnamed and is really a
"move_hash()". It moves the entries from the old hash to the new one.

Now even if that was changed, this code is not proper as direct_functions
should not be updated until the end. That is the best way to handle
function reference changes, and is the way other parts of ftrace handles
this.

The following is done:

 1. Change add_hash_entry() to return the entry it created and inserted
    into the hash, and not just return success or not.

 2. Replace ftrace_add_rec_direct() with add_hash_entry(), and remove
    the former.

 3. Allocate a "new_hash" at the start that is made for holding both the
    new hash entries as well as the existing entries in direct_functions.

 4. Copy (not move) the direct_function entries over to the new_hash.

 5. Copy the entries of the added hash to the new_hash.

 6. If everything succeeds, then use rcu_pointer_assign() to update the
    direct_functions with the new_hash.

This simplifies the code and fixes both the memory leak as well as the
race condition mentioned above.

Link: https://lore.kernel.org/all/170368070504.42064.8960569647118388081.stgit@devnote2/
Link: https://lore.kernel.org/linux-trace-kernel/20231229115134.08dd5174@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes: 763e34e74b ("ftrace: Add register_ftrace_direct()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-30 10:07:27 -05:00
Douglas Anderson 55efe4abf9 watchdog: if panicking and we dumped everything, don't re-enable dumping
If, as part of handling a hardlockup or softlockup, we've already dumped
all CPUs and we're just about to panic, don't reenable dumping and give
some other CPU a chance to hop in there and add some confusing logs right
as the panic is happening.

Link: https://lkml.kernel.org/r/20231220131534.4.Id3a9c7ec2d7d83e4080da6f8662ba2226b40543f@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29 12:22:30 -08:00
Douglas Anderson ee6bdb3f4b watchdog/hardlockup: use printk_cpu_sync_get_irqsave() to serialize reporting
If two CPUs end up reporting a hardlockup at the same time then their logs
could get interleaved which is hard to read.

The interleaving problem was especially bad with the "perf" hardlockup
detector where the locked up CPU is always the same as the running CPU and
we end up in show_regs().  show_regs() has no inherent serialization so we
could mix together two crawls if two hardlockups happened at the same time
(and if we didn't have `sysctl_hardlockup_all_cpu_backtrace` set).  With
this change we'll fully serialize hardlockups when using the "perf"
hardlockup detector.

The interleaving problem was less bad with the "buddy" hardlockup
detector.  With "buddy" we always end up calling
`trigger_single_cpu_backtrace(cpu)` on some CPU other than the running
one.  trigger_single_cpu_backtrace() always at least serializes the
individual stack crawls because it eventually uses
printk_cpu_sync_get_irqsave().  Unfortunately the fact that
trigger_single_cpu_backtrace() eventually calls
printk_cpu_sync_get_irqsave() (on a different CPU) means that we have to
drop the "lock" before calling it and we can't fully serialize all
printouts associated with a given hardlockup.  However, we still do get
the advantage of serializing the output of print_modules() and
print_irqtrace_events().

Aside from serializing hardlockups from each other, this change also has
the advantage of serializing hardlockups and softlockups from each other
if they happen to happen at the same time since they are both using the
same "lock".

Even though nobody is expected to hang while holding the lock associated
with printk_cpu_sync_get_irqsave(), out of an abundance of caution, we
don't call printk_cpu_sync_get_irqsave() until after we print out about
the hardlockup.  This makes extra sure that, even if
printk_cpu_sync_get_irqsave() somehow never runs we at least print that we
saw the hardlockup.  This is different than the choice made for softlockup
because hardlockup is really our last resort.

Link: https://lkml.kernel.org/r/20231220131534.3.I6ff691b3b40f0379bc860f80c6e729a0485b5247@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29 12:22:30 -08:00
Douglas Anderson 896260a6d6 watchdog/softlockup: use printk_cpu_sync_get_irqsave() to serialize reporting
Instead of introducing a spinlock, use printk_cpu_sync_get_irqsave() and
printk_cpu_sync_put_irqrestore() to serialize softlockup reporting.  Alone
this doesn't have any real advantage over the spinlock, but this will
allow us to use the same function in a future change to also serialize
hardlockup crawls.

NOTE: for the most part this serialization is important because we often
end up in the show_regs() path and that has no built-in serialization if
there are multiple callers at once.  However, even in the case where we
end up in the dump_stack() path this still has some advantages because the
stack will be guaranteed to be together in the logs with the lockup
message with no interleaving.

NOTE: the fact that printk_cpu_sync_get_irqsave() is allowed to be called
multiple times on the same CPU is important here.  Specifically we hold
the "lock" while calling dump_stack() which also gets the same "lock". 
This is explicitly documented to be OK and means we don't need to
introduce a variant of dump_stack() that doesn't grab the lock.

Link: https://lkml.kernel.org/r/20231220131534.2.Ia5906525d440d8e8383cde31b7c61c2aadc8f907@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Li Zhe <lizhe.67@bytedance.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29 12:22:30 -08:00
Douglas Anderson 6dcde5d5f2 watchdog/hardlockup: adopt softlockup logic avoiding double-dumps
Patch series "watchdog: Better handling of concurrent lockups".

When we get multiple lockups at roughly the same time, the output in the
kernel logs can be very confusing since the reports about the lockups end
up interleaved in the logs.  There is some code in the kernel to try to
handle this but it wasn't that complete.

Li Zhe recently made this a bit better for softlockups (specifically for
the case where `kernel.softlockup_all_cpu_backtrace` is not set) in commit
9d02330abd ("softlockup: serialized softlockup's log"), but that only
handled softlockup reports.  Hardlockup reports still had similar issues.

This series also has a small fix to avoid dumping all stacks a second time
in the case of a panic.  This is a bit unrelated to the interleaving fixes
but it does also improve the clarity of lockup reports.


This patch (of 4):

The hardlockup detector and softlockup detector both have the ability to
dump the stack of all CPUs (`kernel.hardlockup_all_cpu_backtrace` and
`kernel.softlockup_all_cpu_backtrace`).  Both detectors also have some
logic to attempt to avoid interleaving printouts if two CPUs were trying
to do dumps of all CPUs at the same time.  However:

- The hardlockup detector's logic still allowed interleaving some
  information. Specifically another CPU could print modules and dump
  the stack of the locked CPU at the same time we were dumping all
  CPUs.

- In the case where `kernel.hardlockup_panic` was set in addition to
  `kernel.hardlockup_all_cpu_backtrace`, when two CPUs both detected
  hardlockups at the same time the second CPU could call panic() while
  the first was still dumping stacks. This was especially bad if the
  locked up CPU wasn't responding to the request for a backtrace since
  the function nmi_trigger_cpumask_backtrace() can wait up to 10
  seconds.

Let's resolve this by adopting the softlockup logic in the hardlockup
handler.

NOTES:

- As part of this, one might think that we should make a helper
  function that both the hard and softlockup detectors call. This
  turns out not to be super trivial since it would have to be
  parameterized quite a bit since there are separate global variables
  controlling each lockup detector and they print log messages that
  are just different enough that it would be a pain. We probably don't
  want to change the messages that are printed without good reason to
  avoid throwing log parsers for a loop.

- One might also think that it would be a good idea to have the
  hardlockup and softlockup detector use the same global variable to
  prevent interleaving. This would make sure that softlockups and
  hardlockups can't interleave each other. That _almost_ works but has
  a dangerous flaw if `kernel.hardlockup_panic` is not the same as
  `kernel.softlockup_panic` because we might skip a call to panic() if
  one type of lockup was detected at the same time as another.

Link: https://lkml.kernel.org/r/20231220211640.2023645-1-dianders@chromium.org
Link: https://lkml.kernel.org/r/20231220131534.1.I4f35a69fbb124b5f0c71f75c631e11fabbe188ff@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29 12:22:30 -08:00
Yuntao Wang 2861b37732 kexec_core: fix the assignment to kimage->control_page
image->control_page represents the starting address for allocating the
next control page, while hole_end represents the address of the last valid
byte of the currently allocated control page.

This bug actually does not affect the correctness of allocating control
pages, because image->control_page is currently only used in
kimage_alloc_crash_control_pages(), and this function, when allocating
control pages, will first align image->control_page up to the nearest
`(1 << order) << PAGE_SHIFT` boundary, then use this value as the
starting address of the next control page.  This ensures that the newly
allocated control page will use the correct starting address and not
overlap with previously allocated control pages.

Although it does not affect the correctness of the final result, it is
better for us to set image->control_page to the correct value, in case
it might be used elsewhere in the future, potentially causing errors.

Therefore, after successfully allocating a control page,
image->control_page should be updated to `hole_end + 1`, rather than
hole_end.

Link: https://lkml.kernel.org/r/20231221042308.11076-1-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29 12:22:29 -08:00
Randy Dunlap 5f981878c7 stacktrace: fix kernel-doc typo
Change @task to @tsk to prevent kernel-doc warnings:

kernel/stacktrace.c:138: warning: Excess function parameter 'task' description in 'stack_trace_save_tsk'
kernel/stacktrace.c:138: warning: Function parameter or member 'tsk' not described in 'stack_trace_save_tsk'

Link: https://lkml.kernel.org/r/20231220054945.17663-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29 12:22:29 -08:00
Ahelenia Ziemiańska d391615618 kernel: relay: remove relay_file_splice_read dead code, doesn't work
Documentation/filesystems/relay.rst says to use
	return debugfs_create_file(filename, mode, parent, buf,
	                           &relay_file_operations);
and this is the only way relay_file_operations is used.

Thus: debugfs_create_file(&relay_file_operations)
   -> __debugfs_create_file(&debugfs_full_proxy_file_operations,
                            &relay_file_operations)
   -> dentry{inode: {i_fop: &debugfs_full_proxy_file_operations},
             d_fsdata: &relay_file_operations
                       | DEBUGFS_FSDATA_IS_REAL_FOPS_BIT}

debugfs_full_proxy_file_operations.open is full_proxy_open, which extracts
the &relay_file_operations from the dentry, and allocates via
__full_proxy_fops_init() new fops, with trivial wrappers around release,
llseek, read, write, poll, and unlocked_ioctl, then replaces the fops on
the opened file therewith.

Naturally, all thusly-created debugfs files have .splice_read = NULL. 
This was introduced in commit 49d200deaa ("debugfs: prevent access to
removed files' private data") from 2016-03-22.

AFAICT, relay_file_operations is the only struct file_operations used for
debugfs which defines a .splice_read callback.  Hooking it up with

>	diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
>	index 5063434be0fc..952fcf5b2afa 100644
>	--- a/fs/debugfs/file.c
>	+++ b/fs/debugfs/file.c
>	@@ -328,6 +328,11 @@ FULL_PROXY_FUNC(write, ssize_t, filp,
>	 			loff_t *ppos),
>	 		ARGS(filp, buf, size, ppos));
>
>	+FULL_PROXY_FUNC(splice_read, long, in,
>	+		PROTO(struct file *in, loff_t *ppos, struct pipe_inode_info *pipe,
>	+			size_t len, unsigned int flags),
>	+		ARGS(in, ppos, pipe, len, flags));
>	+
>	 FULL_PROXY_FUNC(unlocked_ioctl, long, filp,
>	 		PROTO(struct file *filp, unsigned int cmd, unsigned long arg),
>	 		ARGS(filp, cmd, arg));
>	@@ -382,6 +387,8 @@ static void __full_proxy_fops_init(struct file_operations *proxy_fops,
>	 		proxy_fops->write = full_proxy_write;
>	 	if (real_fops->poll)
>	 		proxy_fops->poll = full_proxy_poll;
>	+	if (real_fops->splice_read)
>	+		proxy_fops->splice_read = full_proxy_splice_read;
>	 	if (real_fops->unlocked_ioctl)
>	 		proxy_fops->unlocked_ioctl = full_proxy_unlocked_ioctl;
>	 }

shows it just doesn't work, and splicing always instantly returns empty
(subsequent reads actually return the contents).

No-one noticed it became dead code in 2016, who knows if it worked back
then. Clearly no-one cares; just delete it.

Link: https://lkml.kernel.org/r/dtexwpw6zcdx7dkx3xj5gyjp5syxmyretdcbcdtvrnukd4vvuh@tarta.nabijaczleweli.xyz
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Li kunyu <kunyu@nfschina.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zhang Zhengming <zhang.zhengming@h3c.com>
Cc: Zhao Lei <zhao_lei1@hoperun.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29 12:22:27 -08:00
Yuntao Wang 18d565ea95 kexec_file: fix incorrect temp_start value in locate_mem_hole_top_down()
temp_end represents the address of the last available byte.  Therefore,
the starting address of the memory segment with temp_end as its last
available byte and a size of `kbuf->memsz`, that is, the value of
temp_start, should be `temp_end - kbuf->memsz + 1` instead of `temp_end -
kbuf->memsz`.

Additionally, use the ALIGN_DOWN macro instead of open-coding it directly
in locate_mem_hole_top_down() to improve code readability.

Link: https://lkml.kernel.org/r/20231217033528.303333-3-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29 12:22:25 -08:00
Yuntao Wang 816d334afa kexec: modify the meaning of the end parameter in kimage_is_destination_range()
The end parameter received by kimage_is_destination_range() should be the
last valid byte address of the target memory segment plus 1.  However, in
the locate_mem_hole_bottom_up() and locate_mem_hole_top_down() functions,
the corresponding value passed to kimage_is_destination_range() is the
last valid byte address of the target memory segment, which is 1 less.

There are two ways to fix this bug.  We can either correct the logic of
the locate_mem_hole_bottom_up() and locate_mem_hole_top_down() functions,
or we can fix kimage_is_destination_range() by making the end parameter
represent the last valid byte address of the target memory segment.  Here,
we choose the second approach.

Due to the modification to kimage_is_destination_range(), we also need to
adjust its callers, such as kimage_alloc_normal_control_pages() and
kimage_alloc_page().

Link: https://lkml.kernel.org/r/20231217033528.303333-2-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29 12:22:25 -08:00
David Hildenbrand 5cc9695f06 kernel/events/uprobes: page_remove_rmap() -> folio_remove_rmap_pte()
Let's convert __replace_page().

Link: https://lkml.kernel.org/r/20231220224504.646757-25-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29 11:58:53 -08:00
Matthew Wilcox (Oracle) 2853b66b60 mm: remove some calls to page_add_new_anon_rmap()
We already have the folio in these functions, we just need to use it. 
folio_add_new_anon_rmap() didn't exist at the time they were converted to
folios.

Link: https://lkml.kernel.org/r/20231211162214.2146080-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29 11:58:25 -08:00
Steven Rostedt (Google) 39a7dc23a1 tracing: Fix blocked reader of snapshot buffer
If an application blocks on the snapshot or snapshot_raw files, expecting
to be woken up when a snapshot occurs, it will not happen. Or it may
happen with an unexpected result.

That result is that the application will be reading the main buffer
instead of the snapshot buffer. That is because when the snapshot occurs,
the main and snapshot buffers are swapped. But the reader has a descriptor
still pointing to the buffer that it originally connected to.

This is fine for the main buffer readers, as they may be blocked waiting
for a watermark to be hit, and when a snapshot occurs, the data that the
main readers want is now on the snapshot buffer.

But for waiters of the snapshot buffer, they are waiting for an event to
occur that will trigger the snapshot and they can then consume it quickly
to save the snapshot before the next snapshot occurs. But to do this, they
need to read the new snapshot buffer, not the old one that is now
receiving new data.

Also, it does not make sense to have a watermark "buffer_percent" on the
snapshot buffer, as the snapshot buffer is static and does not receive new
data except all at once.

Link: https://lore.kernel.org/linux-trace-kernel/20231228095149.77f5b45d@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes: debdd57f51 ("tracing: Make a snapshot feature available from userspace")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-29 09:18:49 -05:00
Steven Rostedt (Google) 623b1f896f ring-buffer: Fix wake ups when buffer_percent is set to 100
The tracefs file "buffer_percent" is to allow user space to set a
water-mark on how much of the tracing ring buffer needs to be filled in
order to wake up a blocked reader.

 0 - is to wait until any data is in the buffer
 1 - is to wait for 1% of the sub buffers to be filled
 50 - would be half of the sub buffers are filled with data
 100 - is not to wake the waiter until the ring buffer is completely full

Unfortunately the test for being full was:

	dirty = ring_buffer_nr_dirty_pages(buffer, cpu);
	return (dirty * 100) > (full * nr_pages);

Where "full" is the value for "buffer_percent".

There is two issues with the above when full == 100.

1. dirty * 100 > 100 * nr_pages will never be true
   That is, the above is basically saying that if the user sets
   buffer_percent to 100, more pages need to be dirty than exist in the
   ring buffer!

2. The page that the writer is on is never considered dirty, as dirty
   pages are only those that are full. When the writer goes to a new
   sub-buffer, it clears the contents of that sub-buffer.

That is, even if the check was ">=" it would still not be equal as the
most pages that can be considered "dirty" is nr_pages - 1.

To fix this, add one to dirty and use ">=" in the compare.

Link: https://lore.kernel.org/linux-trace-kernel/20231226125902.4a057f1d@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes: 03329f9939 ("tracing: Add tracefs file buffer_percentage")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-29 09:18:30 -05:00
Vincent Guittot f60a631ab9 sched/fair: Fix tg->load when offlining a CPU
When a CPU is taken offline, the contribution of its cfs_rqs to task_groups'
load may remain and will negatively impact the calculation of the share of
the online CPUs.

To fix this bug, clear the contribution of an offlining CPU to task groups'
load and skip its contribution while it is inactive.

Here's the reproducer of the anomaly, by Imran Khan:

	"So far I have encountered only one rather lengthy way of reproducing this issue,
	which is as follows:

	1. Take a KVM guest (booted with 4 CPUs and can be scaled up to 124 CPUs) and
	   create 2 custom cgroups: /sys/fs/cgroup/cpu/test_group_1 and /sys/fs/cgroup/
	   cpu/test_group_2

	2. Assign a CPU intensive workload to each of these cgroups and start the
	   workload.

	For my tests I am using following app:

	int main(int argc, char *argv[])
	{
		unsigned long count, i, val;
		if (argc != 2) {
		      printf("usage: ./a.out <number of random nums to generate> \n");
		      return 0;
		}

		count = strtoul(argv[1], NULL, 10);

		printf("Generating %lu random numbers \n", count);
		for (i = 0; i < count; i++) {
			val = rand();
			val = val % 2;
			//usleep(1);
		}
		printf("Generated %lu random numbers \n", count);
		return 0;
	}

	Also since the system is booted with 4 CPUs, in order to completely load the
	system I am also launching 4 instances of same test app under:

	   /sys/fs/cgroup/cpu/

	3. We can see that both of the cgroups get similar CPU time:

        # systemd-cgtop --depth 1
	Path                                 Tasks    %CPU  Memory  Input/s    Output/s
	/                                      659      -     5.5G        -        -
	/system.slice                            -      -     5.7G        -        -
	/test_group_1                            4      -        -        -        -
	/test_group_2                            3      -        -        -        -
	/user.slice                             31      -    56.5M        -        -

	Path                                 Tasks   %CPU   Memory  Input/s    Output/s
	/                                      659  394.6     5.5G        -        -
	/test_group_2                            3   65.7        -        -        -
	/user.slice                             29   55.1    48.0M        -        -
	/test_group_1                            4   47.3        -        -        -
	/system.slice                            -    2.2     5.7G        -        -

	Path                                 Tasks  %CPU    Memory  Input/s    Output/s
	/                                      659  394.8     5.5G        -        -
	/test_group_1                            4   62.9        -        -        -
	/user.slice                             28   44.9    54.2M        -        -
	/test_group_2                            3   44.7        -        -        -
	/system.slice                            -    0.9     5.7G        -        -

	Path                                 Tasks  %CPU    Memory  Input/s     Output/s
	/                                      659  394.4     5.5G        -        -
	/test_group_2                            3   58.8        -        -        -
	/test_group_1                            4   51.9        -        -        -
	/user.slice                              30   39.3    59.6M        -        -
	/system.slice                            -    1.9     5.7G        -        -

	Path                                 Tasks  %CPU     Memory  Input/s    Output/s
	/                                      659  394.7     5.5G        -        -
	/test_group_1                            4   60.9        -        -        -
	/test_group_2                            3   57.9        -        -        -
	/user.slice                             28   43.5    36.9M        -        -
	/system.slice                            -    3.0     5.7G        -        -

	Path                                 Tasks  %CPU     Memory  Input/s     Output/s
	/                                      659  395.0     5.5G        -        -
	/test_group_1                            4   66.8        -        -        -
	/test_group_2                            3   56.3        -        -        -
	/user.slice                             29   43.1    51.8M        -        -
	/system.slice                            -    0.7     5.7G        -        -

	4. Now move systemd-udevd to one of these test groups, say test_group_1, and
	   perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
	   host side.

	5. Run the same workload i.e 4 instances of CPU hogger under /sys/fs/cgroup/cpu
	   and one instance of  CPU hogger each in /sys/fs/cgroup/cpu/test_group_1 and
	   /sys/fs/cgroup/test_group_2.

	It can be seen that test_group_1 (the one where systemd-udevd was moved) is getting
	much less CPU time than the test_group_2, even though at this point of time both of
	these groups have only CPU hogger running:

        # systemd-cgtop --depth 1
	Path                                   Tasks   %CPU   Memory  Input/s   Output/s
	/                                      1219     -     5.4G        -        -
	/system.slice                           -       -     5.6G        -        -
	/test_group_1                           4       -        -        -        -
	/test_group_2                           3       -        -        -        -
	/user.slice                            26       -    91.3M        -        -

	Path                                   Tasks  %CPU     Memory  Input/s   Output/s
	/                                      1221  394.3     5.4G        -        -
	/test_group_2                             3   82.7        -        -        -
	/test_group_1                             4   14.3        -        -        -
	/system.slice                             -    0.8     5.6G        -        -
	/user.slice                              26    0.4    91.2M        -        -

	Path                                   Tasks  %CPU    Memory  Input/s    Output/s
	/                                      1221  394.6     5.4G        -        -
	/test_group_2                             3   67.4        -        -        -
	/system.slice                             -   24.6     5.6G        -        -
	/test_group_1                             4   12.5        -        -        -
	/user.slice                              26    0.4    91.2M        -        -

	Path                                  Tasks  %CPU    Memory  Input/s    Output/s
	/                                     1221  395.2     5.4G        -        -
	/test_group_2                            3   60.9        -        -        -
	/system.slice                            -   27.9     5.6G        -        -
	/test_group_1                            4   12.2        -        -        -
	/user.slice                             26    0.4    91.2M        -        -

	Path                                  Tasks  %CPU    Memory  Input/s    Output/s
	/                                     1221  395.2     5.4G        -        -
	/test_group_2                            3   69.4        -        -        -
	/test_group_1                            4   13.9        -        -        -
	/user.slice                             28    1.6    92.0M        -        -
	/system.slice                            -    1.0     5.6G        -        -

	Path                                  Tasks  %CPU    Memory  Input/s    Output/s
	/                                      1221  395.6     5.4G        -        -
	/test_group_2                             3   59.3        -        -        -
	/test_group_1                             4   14.1        -        -        -
	/user.slice                              28    1.3    92.2M        -        -
	/system.slice                             -    0.7     5.6G        -        -

	Path                                  Tasks  %CPU    Memory  Input/s    Output/s
	/                                      1221  395.5     5.4G        -        -
	/test_group_2                            3   67.2        -        -        -
	/test_group_1                            4   11.5        -        -        -
	/user.slice                             28    1.3    92.5M        -        -
	/system.slice                            -    0.6     5.6G        -        -

	Path                                  Tasks  %CPU    Memory  Input/s    Output/s
	/                                      1221  395.1     5.4G        -        -
	/test_group_2                             3   76.8        -        -        -
	/test_group_1                             4   12.9        -        -        -
	/user.slice                              28    1.3    92.8M        -        -
	/system.slice                             -    1.2     5.6G        -        -

	From sched_debug data it can be seen that in bad case the load.weight of per-CPU
	sched entities corresponding to test_group_1 has reduced significantly and
	also load_avg of test_group_1 remains much higher than that of test_group_2,
	even though systemd-udevd stopped running long time back and at this point of
	time both cgroups just have the CPU hogger app as running entity."

[ mingo: Added details from the original discussion, plus minor edits to the patch. ]

Reported-by: Imran Khan <imran.f.khan@oracle.com>
Tested-by: Imran Khan <imran.f.khan@oracle.com>
Tested-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Imran Khan <imran.f.khan@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lore.kernel.org/r/20231223111545.62135-1-vincent.guittot@linaro.org
2023-12-29 13:22:03 +01:00
Linus Torvalds f5837722ff 11 hotfixes. 7 are cc:stable and the other 4 address post-6.6 issues or
are not considered backporting material.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZYys4AAKCRDdBJ7gKXxA
 jtmaAQC+o04Ia7IfB8MIqp1p7dNZQo64x/EnGA8YjUnQ8N6IwQD+ImU7dHl9g9Oo
 ROiiAbtMRBUfeJRsExX/Yzc1DV9E9QM=
 =ZGcs
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-12-27-15-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "11 hotfixes. 7 are cc:stable and the other 4 address post-6.6 issues
  or are not considered backporting material"

* tag 'mm-hotfixes-stable-2023-12-27-15-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mailmap: add an old address for Naoya Horiguchi
  mm/memory-failure: cast index to loff_t before shifting it
  mm/memory-failure: check the mapcount of the precise page
  mm/memory-failure: pass the folio and the page to collect_procs()
  selftests: secretmem: floor the memory size to the multiple of page_size
  mm: migrate high-order folios in swap cache correctly
  maple_tree: do not preallocate nodes for slot stores
  mm/filemap: avoid buffered read/write race to read inconsistent data
  kunit: kasan_test: disable fortify string checker on kmalloc_oob_memset
  kexec: select CRYPTO from KEXEC_FILE instead of depending on it
  kexec: fix KEXEC_FILE dependencies
2023-12-27 16:14:41 -08:00
Kent Overstreet 1e2f2d3199 Kill sched.h dependency on rcupdate.h
by moving cond_resched_rcu() to rcupdate_wait.h, we can kill another big
sched.h dependency.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-27 11:50:20 -05:00
Kent Overstreet 932562a604 rseq: Split out rseq.h from sched.h
We're trying to get sched.h down to more or less just types only, not
code - rseq can live in its own header.

This helps us kill the dependency on preempt.h in sched.h.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-27 11:49:56 -05:00
Wang Jinchao fbb66ce0b1 sched/fair: Remove unused 'next_buddy_marked' local variable in check_preempt_wakeup_fair()
This variable became unused in:

    5e963f2bd4 ("sched/fair: Commit to EEVDF")

Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/202312141319+0800-wangjinchao@xfusion.com
2023-12-23 16:12:21 +01:00
Pierre Gondois 3af7524b14 sched/fair: Use all little CPUs for CPU-bound workloads
Running N CPU-bound tasks on an N CPUs platform:

- with asymmetric CPU capacity

- not being a DynamIq system (i.e. having a PKG level sched domain
  without the SD_SHARE_PKG_RESOURCES flag set)

.. might result in a task placement where two tasks run on a big CPU
and none on a little CPU. This placement could be more optimal by
using all CPUs.

Testing platform:

  Juno-r2:
    - 2 big CPUs (1-2), maximum capacity of 1024
    - 4 little CPUs (0,3-5), maximum capacity of 383

Testing workload ([1]):

  Spawn 6 CPU-bound tasks. During the first 100ms (step 1), each tasks
  is affine to a CPU, except for:

    - one little CPU which is left idle.
    - one big CPU which has 2 tasks affine.

  After the 100ms (step 2), remove the cpumask affinity.

Behavior before the patch:

  During step 2, the load balancer running from the idle CPU tags sched
  domains as:

  - little CPUs: 'group_has_spare'. Cf. group_has_capacity() and
    group_is_overloaded(), 3 CPU-bound tasks run on a 4 CPUs
    sched-domain, and the idle CPU provides enough spare capacity
    regarding the imbalance_pct

  - big CPUs: 'group_overloaded'. Indeed, 3 tasks run on a 2 CPUs
    sched-domain, so the following path is used:

      group_is_overloaded()
      \-if (sgs->sum_nr_running <= sgs->group_weight) return true;

    The following path which would change the migration type to
    'migrate_task' is not taken:

      calculate_imbalance()
      \-if (env->idle != CPU_NOT_IDLE && env->imbalance == 0)

    as the local group has some spare capacity, so the imbalance
    is not 0.

  The migration type requested is 'migrate_util' and the busiest
  runqueue is the big CPU's runqueue having 2 tasks (each having a
  utilization of 512). The idle little CPU cannot pull one of these
  task as its capacity is too small for the task. The following path
  is used:

   detach_tasks()
   \-case migrate_util:
     \-if (util > env->imbalance) goto next;

After the patch:

As the number of failed balancing attempts grows (with
'nr_balance_failed'), progressively make it easier to migrate
a big task to the idling little CPU. A similar mechanism is
used for the 'migrate_load' migration type.

Improvement:

Running the testing workload [1] with the step 2 representing
a ~10s load for a big CPU:

  Before patch: ~19.3s
  After patch:  ~18s (-6.7%)

Similar issue reported at:

  https://lore.kernel.org/lkml/20230716014125.139577-1-qyousef@layalina.io/

Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Qais Yousef <qyousef@layalina.io>
Link: https://lore.kernel.org/r/20231206090043.634697-1-pierre.gondois@arm.com
2023-12-23 16:06:36 +01:00
Vincent Guittot 11137d3849 sched/fair: Simplify util_est
With UTIL_EST_FASTUP now being permanent, we can take advantage of the
fact that the ewma jumps directly to a higher utilization at dequeue to
simplify util_est and remove the enqueued field.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com>
Reviewed-by: Alex Shi <alexs@kernel.org>
Link: https://lore.kernel.org/r/20231201161652.1241695-3-vincent.guittot@linaro.org
2023-12-23 15:59:58 +01:00
Vincent Guittot 7736ae5572 sched/fair: Remove SCHED_FEAT(UTIL_EST_FASTUP, true)
sched_feat(UTIL_EST_FASTUP) has been added to easily disable the feature
in order to check for possibly related regressions. After 3 years, it has
never been used and no regression has been reported. Let's remove it
and make fast increase a permanent behavior.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com>
Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>
Reviewed-by: Yanteng Si <siyanteng@loongson.cn> [for the Chinese translation]
Reviewed-by: Alex Shi <alexs@kernel.org>
Link: https://lore.kernel.org/r/20231201161652.1241695-2-vincent.guittot@linaro.org
2023-12-23 15:59:56 +01:00
Vincent Guittot b3edde44e5 cpufreq/schedutil: Use a fixed reference frequency
cpuinfo.max_freq can change at runtime because of boost as an example. This
implies that the value could be different than the one that has been
used when computing the capacity of a CPU.

The new arch_scale_freq_ref() returns a fixed and coherent reference
frequency that can be used when computing a frequency based on utilization.

Use this arch_scale_freq_ref() when available and fallback to
policy otherwise.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lore.kernel.org/r/20231211104855.558096-4-vincent.guittot@linaro.org
2023-12-23 15:52:35 +01:00
Ingo Molnar d2e9f53ac5 Linux 6.7-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmV/ggAeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGVsQIAKOWsoJRP11U2N9z
 X+GjfDZ7JjV3iWZezDJ6Hmtw1H47PBofhJJXwCaUbIYtDImJxK2mSA7bDF0LKDZQ
 lCupi8R4SPVugFD6Z+cFOLz4dHD1LorlPopldlBmWJRkp85uWdE+Bzbuu8SboypM
 +8e4QxT+XOPXZoGxI9bOjVWN/mnIKcrCINRrhgbUGaCizQG08Mah1oW/QVLYE8at
 hZdLhDkWkV2sbcRMEx0vq7L99Ym5fXkmW1BXC1Uu6SgQ4KX4+28plUROtLGnm4MV
 QwmURUFcURDIqUEaPu66P+1xkAGeEtAYC7N7375pJ++VeuFpHiBjGrT1HTtXfCYx
 Z0FcvsI=
 =r3Rg
 -----END PGP SIGNATURE-----

Merge tag 'v6.7-rc6' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-12-23 15:52:13 +01:00
Sven Schnelle 221a164035 entry: Move syscall_enter_from_user_mode() to header file
To allow inlining of syscall_enter_from_user_mode(), move it
to entry-common.h.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20231218074520.1998026-4-svens@linux.ibm.com
2023-12-21 23:12:18 +01:00
Sven Schnelle caf4062e35 entry: Move enter_from_user_mode() to header file
To allow inlining of enter_from_user_mode(), move it to
entry-common.h.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20231218074520.1998026-3-svens@linux.ibm.com
2023-12-21 23:12:18 +01:00
Sven Schnelle d680194719 entry: Move exit to usermode functions to header file
To allow inlining, move exit_to_user_mode() to
entry-common.h.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20231218074520.1998026-2-svens@linux.ibm.com
2023-12-21 23:12:18 +01:00
Simon Horman 5abde62465 bpf: Avoid unnecessary use of comma operator in verifier
Although it does not seem to have any untoward side-effects, the use
of ';' to separate to assignments seems more appropriate than ','.

Flagged by clang-17 -Wcomma

No functional change intended. Compile tested only.

Signed-off-by: Simon Horman <horms@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/bpf/20231221-bpf-verifier-comma-v1-1-cde2530912e9@kernel.org
2023-12-21 22:40:25 +01:00
Paolo Abeni 56794e5358 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Adjacent changes:

drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
  23c93c3b62 ("bnxt_en: do not map packet buffers twice")
  6d1add9553 ("bnxt_en: Modify TX ring indexing logic.")

tools/testing/selftests/net/Makefile
  2258b66648 ("selftests: add vlan hw filter tests")
  a0bc96c0cd ("selftests: net: verify fq per-band packet limit")

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-12-21 22:17:23 +01:00
Linus Torvalds 13b734465a Tracing fixes for 6.7:
- Fix another kerneldoc warning
 
 - Fix eventfs files to inherit the ownership of its parent directory.
   The dynamic creating of dentries in eventfs did not take into
   account if the tracefs file system was mounted with a gid/uid,
   and would still default to the gid/uid of root. This is a regression.
 
 - Fix warning when synthetic event testing is enabled along with
   startup event tracing testing is enabled
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZYRYjhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qs0aAQCXWcBeDEWsi8VxAOBU5Q6isvXn2koM
 +xSX6LJPh6hFVAD+Pc3oLgvyE5IyqNUM9RYtpwPVMhpAsyE9FIz3TWarEww=
 =LY0i
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.7-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix another kerneldoc warning

 - Fix eventfs files to inherit the ownership of its parent directory.

   The dynamic creation of dentries in eventfs did not take into account
   if the tracefs file system was mounted with a gid/uid, and would
   still default to the gid/uid of root. This is a regression.

 - Fix warning when synthetic event testing is enabled along with
   startup event tracing testing is enabled

* tag 'trace-v6.7-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing / synthetic: Disable events after testing in synth_event_gen_test_init()
  eventfs: Have event files and directories default to parent uid and gid
  tracing/synthetic: fix kernel-doc warnings
2023-12-21 09:31:45 -08:00
Steven Rostedt (Google) 3cb3091138 ring-buffer: Use subbuf_order for buffer page masking
The comparisons to PAGE_SIZE were all converted to use the
buffer->subbuf_order, but the use of PAGE_MASK was missed.

Convert all the PAGE_MASK usages over to:

  (PAGE_SIZE << cpu_buffer->buffer->subbuf_order) - 1

Link: https://lore.kernel.org/linux-trace-kernel/20231219173800.66eefb7a@gandalf.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Fixes: 139f840021 ("ring-buffer: Page size per ring buffer")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21 11:04:48 -05:00
Steven Rostedt (Google) 2f84b39f48 tracing: Update subbuffer with kilobytes not page order
Using page order for deciding what the size of the ring buffer sub buffers
are is exposing a bit too much of the implementation. Although the sub
buffers are only allocated in orders of pages, allow the user to specify
the minimum size of each sub-buffer via kilobytes like they can with the
buffer size itself.

If the user specifies 3 via:

  echo 3 > buffer_subbuf_size_kb

Then the sub-buffer size will round up to 4kb (on a 4kb page size system).

If they specify:

  echo 6 > buffer_subbuf_size_kb

The sub-buffer size will become 8kb.

and so on.

Link: https://lore.kernel.org/linux-trace-kernel/20231219185631.809766769@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21 11:04:15 -05:00
Steven Rostedt (Google) 8e7b58c27b ring-buffer: Just update the subbuffers when changing their allocation order
The ring_buffer_subbuf_order_set() was creating ring_buffer_per_cpu
cpu_buffers with the new subbuffers with the updated order, and if they
all successfully were created, then they the ring_buffer's per_cpu buffers
would be freed and replaced by them.

The problem is that the freed per_cpu buffers contains state that would be
lost. Running the following commands:

1. # echo 3 > /sys/kernel/tracing/buffer_subbuf_order
2. # echo 0 > /sys/kernel/tracing/tracing_cpumask
3. # echo 1 > /sys/kernel/tracing/snapshot
4. # echo ff > /sys/kernel/tracing/tracing_cpumask
5. # echo test > /sys/kernel/tracing/trace_marker

Would result in:

 -bash: echo: write error: Bad file descriptor

That's because the state of the per_cpu buffers of the snapshot buffer is
lost when the order is changed (the order of a freed snapshot buffer goes
to 0 to save memory, and when the snapshot buffer is allocated again, it
goes back to what the main buffer is).

In operation 2, the snapshot buffers were set to "disable" (as all the
ring buffers CPUs were disabled).

In operation 3, the snapshot is allocated and a call to
ring_buffer_subbuf_order_set() replaced the per_cpu buffers losing the
"record_disable" count.

When it was enabled again, the atomic_dec(&cpu_buffer->record_disable) was
decrementing a zero, setting it to -1. Writing 1 into the snapshot would
swap the snapshot buffer with the main buffer, so now the main buffer is
"disabled", and nothing can write to the ring buffer anymore.

Instead of creating new per_cpu buffers and losing the state of the old
buffers, basically do what the resize does and just allocate new subbuf
pages into the new_pages link list of the per_cpu buffer and if they all
succeed, then replace the old sub buffers with the new ones. This keeps
the per_cpu buffer descriptor in tact and by doing so, keeps its state.

Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.944104939@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Fixes: f9b94daa54 ("ring-buffer: Set new size of the ring buffer sub page")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21 11:02:52 -05:00
Steven Rostedt (Google) 353cc21937 ring-buffer: Keep the same size when updating the order
The function ring_buffer_subbuf_order_set() just updated the sub-buffers
to the new size, but this also changes the size of the buffer in doing so.
As the size is determined by nr_pages * subbuf_size. If the subbuf_size is
increased without decreasing the nr_pages, this causes the total size of
the buffer to increase.

This broke the latency tracers as the snapshot needs to be the same size
as the main buffer. The size of the snapshot buffer is only expanded when
needed, and because the order is still the same, the size becomes out of
sync with the main buffer, as the main buffer increased in size without
the tracing system knowing.

Calculate the nr_pages to allocate with the new subbuf_size to be
buffer_size / new_subbuf_size.

Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.649397785@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Fixes: f9b94daa54 ("ring-buffer: Set new size of the ring buffer sub page")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21 11:02:01 -05:00
Steven Rostedt (Google) fa4b54af5b tracing: Stop the tracing while changing the ring buffer subbuf size
Because the main buffer and the snapshot buffer need to be the same for
some tracers, otherwise it will fail and disable all tracing, the tracers
need to be stopped while updating the sub buffer sizes so that the tracers
see the main and snapshot buffers with the same sub buffer size.

Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.353222794@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Fixes: 2808e31ec1 ("ring-buffer: Add interface for configuring trace sub buffer size")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21 11:00:56 -05:00
Steven Rostedt (Google) aa067682ad tracing: Update snapshot order along with main buffer order
When updating the order of the sub buffers for the main buffer, make sure
that if the snapshot buffer exists, that it gets its order updated as
well.

Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.054668186@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21 10:55:57 -05:00
Steven Rostedt (Google) 4e958db34f ring-buffer: Make sure the spare sub buffer used for reads has same size
Now that the ring buffer specifies the size of its sub buffers, they all
need to be the same size. When doing a read, a swap is done with a spare
page. Make sure they are the same size before doing the swap, otherwise
the read will fail.

Link: https://lore.kernel.org/linux-trace-kernel/20231219185629.763664788@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21 10:55:04 -05:00
Steven Rostedt (Google) b81e03a249 ring-buffer: Do no swap cpu buffers if order is different
As all the subbuffer order (subbuffer sizes) must be the same throughout
the ring buffer, check the order of the buffers that are doing a CPU
buffer swap in ring_buffer_swap_cpu() to make sure they are the same.

If the are not the same, then fail to do the swap, otherwise the ring
buffer will think the CPU buffer has a specific subbuffer size when it
does not.

Link: https://lore.kernel.org/linux-trace-kernel/20231219185629.467894710@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21 10:54:50 -05:00
Steven Rostedt (Google) 22887dfba0 ring-buffer: Clear pages on error in ring_buffer_subbuf_order_set() failure
On failure to allocate ring buffer pages, the pointer to the CPU buffer
pages is freed, but the pages that were allocated previously were not.
Make sure they are freed too.

Link: https://lore.kernel.org/linux-trace-kernel/20231219185629.179352802@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Fixes: f9b94daa54 ("tracing: Set new size of the ring buffer sub page")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21 10:53:59 -05:00
Steven Rostedt (Google) 88b30c7f5d tracing / synthetic: Disable events after testing in synth_event_gen_test_init()
The synth_event_gen_test module can be built in, if someone wants to run
the tests at boot up and not have to load them.

The synth_event_gen_test_init() function creates and enables the synthetic
events and runs its tests.

The synth_event_gen_test_exit() disables the events it created and
destroys the events.

If the module is builtin, the events are never disabled. The issue is, the
events should be disable after the tests are run. This could be an issue
if the rest of the boot up tests are enabled, as they expect the events to
be in a known state before testing. That known state happens to be
disabled.

When CONFIG_SYNTH_EVENT_GEN_TEST=y and CONFIG_EVENT_TRACE_STARTUP_TEST=y
a warning will trigger:

 Running tests on trace events:
 Testing event create_synth_test:
 Enabled event during self test!
 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 1 at kernel/trace/trace_events.c:4150 event_trace_self_tests+0x1c2/0x480
 Modules linked in:
 CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.7.0-rc2-test-00031-gb803d7c664d5-dirty #276
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
 RIP: 0010:event_trace_self_tests+0x1c2/0x480
 Code: bb e8 a2 ab 5d fc 48 8d 7b 48 e8 f9 3d 99 fc 48 8b 73 48 40 f6 c6 01 0f 84 d6 fe ff ff 48 c7 c7 20 b6 ad bb e8 7f ab 5d fc 90 <0f> 0b 90 48 89 df e8 d3 3d 99 fc 48 8b 1b 4c 39 f3 0f 85 2c ff ff
 RSP: 0000:ffffc9000001fdc0 EFLAGS: 00010246
 RAX: 0000000000000029 RBX: ffff88810399ca80 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: ffffffffb9f19478 RDI: ffff88823c734e64
 RBP: ffff88810399f300 R08: 0000000000000000 R09: fffffbfff79eb32a
 R10: ffffffffbcf59957 R11: 0000000000000001 R12: ffff888104068090
 R13: ffffffffbc89f0a0 R14: ffffffffbc8a0f08 R15: 0000000000000078
 FS:  0000000000000000(0000) GS:ffff88823c700000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 00000001f6282001 CR4: 0000000000170ef0
 Call Trace:
  <TASK>
  ? __warn+0xa5/0x200
  ? event_trace_self_tests+0x1c2/0x480
  ? report_bug+0x1f6/0x220
  ? handle_bug+0x6f/0x90
  ? exc_invalid_op+0x17/0x50
  ? asm_exc_invalid_op+0x1a/0x20
  ? tracer_preempt_on+0x78/0x1c0
  ? event_trace_self_tests+0x1c2/0x480
  ? __pfx_event_trace_self_tests_init+0x10/0x10
  event_trace_self_tests_init+0x27/0xe0
  do_one_initcall+0xd6/0x3c0
  ? __pfx_do_one_initcall+0x10/0x10
  ? kasan_set_track+0x25/0x30
  ? rcu_is_watching+0x38/0x60
  kernel_init_freeable+0x324/0x450
  ? __pfx_kernel_init+0x10/0x10
  kernel_init+0x1f/0x1e0
  ? _raw_spin_unlock_irq+0x33/0x50
  ret_from_fork+0x34/0x60
  ? __pfx_kernel_init+0x10/0x10
  ret_from_fork_asm+0x1b/0x30
  </TASK>

This is because the synth_event_gen_test_init() left the synthetic events
that it created enabled. By having it disable them after testing, the
other selftests will run fine.

Link: https://lore.kernel.org/linux-trace-kernel/20231220111525.2f0f49b0@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 9fe41efaca ("tracing: Add synth event generation test module")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reported-by: Alexander Graf <graf@amazon.com>
Tested-by: Alexander Graf <graf@amazon.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21 10:04:45 -05:00
Daniel Borkmann b08c8fc041 bpf: Re-support uid and gid when mounting bpffs
For a clean, conflict-free revert of the token-related patches in commit
d17aff807f ("Revert BPF token-related functionality"), the bpf fs commit
750e785796 ("bpf: Support uid and gid when mounting bpffs") was undone
temporarily as well.

This patch manually re-adds the functionality from the original one back
in 750e785796, no other functional changes intended.

Testing:

  # mount -t bpf -o uid=65534,gid=65534 bpffs ./foo
  # ls -la . | grep foo
  drwxrwxrwt   2 nobody nogroup          0 Dec 20 13:16 foo
  # mount -t bpf
  bpffs on /root/foo type bpf (rw,relatime,uid=65534,gid=65534)

Also, passing invalid arguments for uid/gid are properly rejected as expected.

Fixes: d17aff807f ("Revert BPF token-related functionality")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Cc: Jie Jiang <jiejiang@chromium.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/bpf/20231220133805.20953-1-daniel@iogearbox.net
2023-12-21 14:24:30 +01:00
Christian Brauner 2137e15642
Merge branch 'vfs.file'
Bring in the changes to the file infrastructure for this cycle. Mostly
cleanups and some performance tweaks.

* file: remove __receive_fd()
* file: stop exposing receive_fd_user()
* fs: replace f_rcuhead with f_task_work
* file: remove pointless wrapper
* file: s/close_fd_get_file()/file_close_fd()/g
* Improve __fget_files_rcu() code generation (and thus __fget_light())
* file: massage cleanup of files that failed to open

Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-21 13:21:52 +01:00
Dmitry Antipov 1bfc466b13
watch_queue: fix kcalloc() arguments order
When compiling with gcc version 14.0.0 20231220 (experimental)
and W=1, I've noticed the following warning:

kernel/watch_queue.c: In function 'watch_queue_set_size':
kernel/watch_queue.c:273:32: warning: 'kcalloc' sizes specified with 'sizeof'
in the earlier argument and not in the later argument [-Wcalloc-transposed-args]
  273 |         pages = kcalloc(sizeof(struct page *), nr_pages, GFP_KERNEL);
      |                                ^~~~~~

Since 'n' and 'size' arguments of 'kcalloc()' are multiplied to
calculate the final size, their actual order doesn't affect the
result and so this is not a bug. But it's still worth to fix it.

Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Link: https://lore.kernel.org/r/20231221090139.12579-1-dmantipov@yandex.ru
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-21 13:17:54 +01:00
Linus Torvalds a4aebe9365 posix-timers: Get rid of [COMPAT_]SYS_NI() uses
Only the posix timer system calls use this (when the posix timer support
is disabled, which does not actually happen in any normal case), because
they had debug code to print out a warning about missing system calls.

Get rid of that special case, and just use the standard COND_SYSCALL
interface that creates weak system call stubs that return -ENOSYS for
when the system call does not exist.

This fixes a kCFI issue with the SYS_NI() hackery:

  CFI failure at int80_emulation+0x67/0xb0 (target: sys_ni_posix_timers+0x0/0x70; expected type: 0xb02b34d9)
  WARNING: CPU: 0 PID: 48 at int80_emulation+0x67/0xb0

Reported-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-12-20 21:30:27 -08:00
Matthew Wilcox (Oracle) 6dfeff09d5 wait: Remove uapi header file from main header file
There's really no overlap between uapi/linux/wait.h and linux/wait.h.
There are two files which rely on the uapi file being implcitly included,
so explicitly include it there and remove it from the main header file.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Christian Brauner <brauner@kernel.org>
2023-12-20 19:26:31 -05:00
Kent Overstreet 8b7787a543 plist: Split out plist_types.h
Trimming down sched.h dependencies: we don't want to include more than
the base types.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-20 19:26:31 -05:00
Kent Overstreet f551103cb9 sched.h: move pid helpers to pid.h
This is needed for killing the sched.h dependency on rcupdate.h, and
pid.h is a better place for this code anyways.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-20 19:26:31 -05:00
Kent Overstreet d7a73e3f08 kernel/numa.c: Move logging out of numa.h
Moving these stub functions to a .c file means we can kill a sched.h
dependency on printk.h.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-20 19:26:30 -05:00
Kent Overstreet a2bef835d3 kernel/fork.c: add missing include
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-20 19:26:30 -05:00
Wang Jinchao 4459cd2e16 crash_core: remove duplicated including of kexec.h
Remove second include of linux/kexec.h

Link: https://lkml.kernel.org/r/202312151654+0800-wangjinchao@xfusion.com
Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-20 15:02:58 -08:00
Yuntao Wang db6b6fb701 kexec: use ALIGN macro instead of open-coding it
Use ALIGN macro instead of open-coding it to improve code readability.

Link: https://lkml.kernel.org/r/20231212142706.25149-1-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-20 15:02:58 -08:00
Kevin Hao a903904c5f fork: remove redundant TASK_UNINTERRUPTIBLE
TASK_KILLABLE already includes TASK_UNINTERRUPTIBLE, so there is no
need to add a separate TASK_UNINTERRUPTIBLE.

Link: https://lkml.kernel.org/r/20231208084115.1973285-1-haokexin@gmail.com
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-20 15:02:58 -08:00
Baoquan He a85ee18c79 kexec_file: print out debugging message if required
Then when specifying '-d' for kexec_file_load interface, loaded locations
of kernel/initrd/cmdline etc can be printed out to help debug.

Here replace pr_debug() with the newly added kexec_dprintk() in kexec_file
loading related codes.

And also print out type/start/head of kimage and flags to help debug.

Link: https://lkml.kernel.org/r/20231213055747.61826-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Conor Dooley <conor@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-20 15:02:57 -08:00
Baoquan He cbc2fe9d9c kexec_file: add kexec_file flag to control debug printing
Patch series "kexec_file: print out debugging message if required", v4.

Currently, specifying '-d' on kexec command will print a lot of debugging
informationabout kexec/kdump loading with kexec_load interface.

However, kexec_file_load prints nothing even though '-d' is specified. 
It's very inconvenient to debug or analyze the kexec/kdump loading when
something wrong happened with kexec/kdump itself or develper want to check
the kexec/kdump loading.

In this patchset, a kexec_file flag is KEXEC_FILE_DEBUG added and checked
in code.  If it's passed in, debugging message of kexec_file code will be
printed out and can be seen from console and dmesg.  Otherwise, the
debugging message is printed like beofre when pr_debug() is taken.

Note:
****
=====
1) The code in kexec-tools utility also need be changed to support
passing KEXEC_FILE_DEBUG to kernel when 'kexec -s -d' is specified.
The patch link is here:
=========
[PATCH] kexec_file: add kexec_file flag to support debug printing
http://lists.infradead.org/pipermail/kexec/2023-November/028505.html

2) s390 also has kexec_file code, while I am not sure what debugging
information is necessary. So leave it to s390 developer.

Test:
****
====
Testing was done in v1 on x86_64 and arm64. For v4, tested on x86_64
again. And on x86_64, the printed messages look like below:
--------------------------------------------------------------
kexec measurement buffer for the loaded kernel at 0x207fffe000.
Loaded purgatory at 0x207fff9000
Loaded boot_param, command line and misc at 0x207fff3000 bufsz=0x1180 memsz=0x1180
Loaded 64bit kernel at 0x207c000000 bufsz=0xc88200 memsz=0x3c4a000
Loaded initrd at 0x2079e79000 bufsz=0x2186280 memsz=0x2186280
Final command line is: root=/dev/mapper/fedora_intel--knightslanding--lb--02-root ro
rd.lvm.lv=fedora_intel-knightslanding-lb-02/root console=ttyS0,115200N81 crashkernel=256M
E820 memmap:
0000000000000000-000000000009a3ff (1)
000000000009a400-000000000009ffff (2)
00000000000e0000-00000000000fffff (2)
0000000000100000-000000006ff83fff (1)
000000006ff84000-000000007ac50fff (2)
......
000000207fff6150-000000207fff615f (128)
000000207fff6160-000000207fff714f (1)
000000207fff7150-000000207fff715f (128)
000000207fff7160-000000207fff814f (1)
000000207fff8150-000000207fff815f (128)
000000207fff8160-000000207fffffff (1)
nr_segments = 5
segment[0]: buf=0x000000004e5ece74 bufsz=0x211 mem=0x207fffe000 memsz=0x1000
segment[1]: buf=0x000000009e871498 bufsz=0x4000 mem=0x207fff9000 memsz=0x5000
segment[2]: buf=0x00000000d879f1fe bufsz=0x1180 mem=0x207fff3000 memsz=0x2000
segment[3]: buf=0x000000001101cd86 bufsz=0xc88200 mem=0x207c000000 memsz=0x3c4a000
segment[4]: buf=0x00000000c6e38ac7 bufsz=0x2186280 mem=0x2079e79000 memsz=0x2187000
kexec_file_load: type:0, start:0x207fff91a0 head:0x109e004002 flags:0x8
---------------------------------------------------------------------------


This patch (of 7):

When specifying 'kexec -c -d', kexec_load interface will print loading
information, e.g the regions where kernel/initrd/purgatory/cmdline are
put, the memmap passed to 2nd kernel taken as system RAM ranges, and
printing all contents of struct kexec_segment, etc.  These are very
helpful for analyzing or positioning what's happening when kexec/kdump
itself failed.  The debugging printing for kexec_load interface is made in
user space utility kexec-tools.

Whereas, with kexec_file_load interface, 'kexec -s -d' print nothing. 
Because kexec_file code is mostly implemented in kernel space, and the
debugging printing functionality is missed.  It's not convenient when
debugging kexec/kdump loading and jumping with kexec_file_load interface.

Now add KEXEC_FILE_DEBUG to kexec_file flag to control the debugging
message printing.  And add global variable kexec_file_dbg_print and macro
kexec_dprintk() to facilitate the printing.

This is a preparation, later kexec_dprintk() will be used to replace the
existing pr_debug().  Once 'kexec -s -d' is specified, it will print out
kexec/kdump loading information.  If '-d' is not specified, it regresses
to pr_debug().

Link: https://lkml.kernel.org/r/20231213055747.61826-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20231213055747.61826-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Conor Dooley <conor@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-20 15:02:57 -08:00
Andrew Morton a721aeac8b sync mm-stable with mm-hotfixes-stable to pick up depended-upon changes 2023-12-20 14:47:18 -08:00
Arnd Bergmann e63bde3d94 kexec: select CRYPTO from KEXEC_FILE instead of depending on it
All other users of crypto code use 'select' instead of 'depends on', so do
the same thing with KEXEC_FILE for consistency.

In practice this makes very little difference as kernels with kexec
support are very likely to also include some other feature that already
selects both crypto and crypto_sha256, but being consistent here helps for
usability as well as to avoid potential circular dependencies.

This reverts the dependency back to what it was originally before commit
74ca317c26 ("kexec: create a new config option CONFIG_KEXEC_FILE for
new syscall"), which changed changed it with the comment "This should be
safer as "select" is not recursive", but that appears to have been done in
error, as "select" is indeed recursive, and there are no other
dependencies that prevent CRYPTO_SHA256 from being selected here.

Link: https://lkml.kernel.org/r/20231023110308.1202042-2-arnd@kernel.org
Fixes: 74ca317c26 ("kexec: create a new config option CONFIG_KEXEC_FILE for new syscall")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Eric DeVolder <eric_devolder@yahoo.com>
Tested-by: Eric DeVolder <eric_devolder@yahoo.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Conor Dooley <conor@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-20 13:46:19 -08:00
Arnd Bergmann c1ad12ee0e kexec: fix KEXEC_FILE dependencies
The cleanup for the CONFIG_KEXEC Kconfig logic accidentally changed the
'depends on CRYPTO=y' dependency to a plain 'depends on CRYPTO', which
causes a link failure when all the crypto support is in a loadable module
and kexec_file support is built-in:

x86_64-linux-ld: vmlinux.o: in function `__x64_sys_kexec_file_load':
(.text+0x32e30a): undefined reference to `crypto_alloc_shash'
x86_64-linux-ld: (.text+0x32e58e): undefined reference to `crypto_shash_update'
x86_64-linux-ld: (.text+0x32e6ee): undefined reference to `crypto_shash_final'

Both s390 and x86 have this problem, while ppc64 and riscv have the
correct dependency already.  On riscv, the dependency is only used for the
purgatory, not for the kexec_file code itself, which may be a bit
surprising as it means that with CONFIG_CRYPTO=m, it is possible to enable
KEXEC_FILE but then the purgatory code is silently left out.

Move this into the common Kconfig.kexec file in a way that is correct
everywhere, using the dependency on CRYPTO_SHA256=y only when the
purgatory code is available.  This requires reversing the dependency
between ARCH_SUPPORTS_KEXEC_PURGATORY and KEXEC_FILE, but the effect
remains the same, other than making riscv behave like the other ones.

On s390, there is an additional dependency on CRYPTO_SHA256_S390, which
should technically not be required but gives better performance.  Remove
this dependency here, noting that it was not present in the initial
Kconfig code but was brought in without an explanation in commit
71406883fd ("s390/kexec_file: Add kexec_file_load system call").

[arnd@arndb.de: fix riscv build]
  Link: https://lkml.kernel.org/r/67ddd260-d424-4229-a815-e3fcfb864a77@app.fastmail.com
Link: https://lkml.kernel.org/r/20231023110308.1202042-1-arnd@kernel.org
Fixes: 6af5138083 ("x86/kexec: refactor for kernel/Kconfig.kexec")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Eric DeVolder <eric_devolder@yahoo.com>
Tested-by: Eric DeVolder <eric_devolder@yahoo.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Conor Dooley <conor@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-20 13:46:19 -08:00
Hou Tao 7ac5c53e00 bpf: Use c->unit_size to select target cache during free
At present, bpf memory allocator uses check_obj_size() to ensure that
ksize() of allocated pointer is equal with the unit_size of used
bpf_mem_cache. Its purpose is to prevent bpf_mem_free() from selecting
a bpf_mem_cache which has different unit_size compared with the
bpf_mem_cache used for allocation. But as reported by lkp, the return
value of ksize() or kmalloc_size_roundup() may change due to slab merge
and it will lead to the warning report in check_obj_size().

The reported warning happened as follows:
(1) in bpf_mem_cache_adjust_size(), kmalloc_size_roundup(96) returns the
object_size of kmalloc-96 instead of kmalloc-cg-96. The object_size of
kmalloc-96 is 96, so size_index for 96 is not adjusted accordingly.
(2) the object_size of kmalloc-cg-96 is adjust from 96 to 128 due to
slab merge in __kmem_cache_alias(). For SLAB, SLAB_HWCACHE_ALIGN is
enabled by default for kmalloc slab, so align is 64 and size is 128 for
kmalloc-cg-96. SLUB has a similar merge logic, but its object_size will
not be changed, because its align is 8 under x86-64.
(3) when unit_alloc() does kmalloc_node(96, __GFP_ACCOUNT, node),
ksize() returns 128 instead of 96 for the returned pointer.
(4) the warning in check_obj_size() is triggered.

Considering the slab merge can happen in anytime (e.g, a slab created in
a new module), the following case is also possible: during the
initialization of bpf_global_ma, there is no slab merge and ksize() for
a 96-bytes object returns 96. But after that a new slab created by a
kernel module is merged to kmalloc-cg-96 and the object_size of
kmalloc-cg-96 is adjust from 96 to 128 (which is possible for x86-64 +
CONFIG_SLAB, because its alignment requirement is 64 for 96-bytes slab).
So soon or later, when bpf_global_ma frees a 96-byte-sized pointer
which is allocated from bpf_mem_cache with unit_size=96, bpf_mem_free()
will free the pointer through a bpf_mem_cache in which unit_size is 128,
because the return value of ksize() changes. The warning for the
mismatch will be triggered again.

A feasible fix is introducing similar APIs compared with ksize() and
kmalloc_size_roundup() to return the actually-allocated size instead of
size which may change due to slab merge, but it will introduce
unnecessary dependency on the implementation details of mm subsystem.

As for now the pointer of bpf_mem_cache is saved in the 8-bytes area
(or 4-bytes under 32-bit host) above the returned pointer, using
unit_size in the saved bpf_mem_cache to select the target cache instead
of inferring the size from the pointer itself. Beside no extra
dependency on mm subsystem, the performance for bpf_mem_free_rcu() is
also improved as shown below.

Before applying the patch, the performances of bpf_mem_alloc() and
bpf_mem_free_rcu() on 8-CPUs VM with one producer are as follows:

kmalloc : alloc 11.69 ± 0.28M/s free 29.58 ± 0.93M/s
percpu  : alloc 14.11 ± 0.52M/s free 14.29 ± 0.99M/s

After apply the patch, the performance for bpf_mem_free_rcu() increases
9% and 146% for kmalloc memory and per-cpu memory respectively:

kmalloc: alloc 11.01 ± 0.03M/s free   32.42 ± 0.48M/s
percpu:  alloc 12.84 ± 0.12M/s free   35.24 ± 0.23M/s

After the fixes, there is no need to adjust size_index to fix the
mismatch between allocation and free, so remove it as well. Also return
NULL instead of ZERO_SIZE_PTR for zero-sized alloc in bpf_mem_alloc(),
because there is no bpf_mem_cache pointer saved above ZERO_SIZE_PTR.

Fixes: 9077fc228f ("bpf: Use kmalloc_size_roundup() to adjust size_index")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/bpf/202310302113.9f8fe705-oliver.sang@intel.com
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231216131052.27621-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-20 13:25:46 -08:00
Randy Dunlap dadce3fbaf PM: hibernate: Repair excess function parameter description warning
Function swsusp_close() does not have any parameters, so remove the
description of parameter @exclusive to prevent this warning.

swap.c:1573: warning: Excess function parameter 'exclusive' description in 'swsusp_close'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
[ rjw: Subject edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-20 19:19:26 +01:00
Kevin Hao e0f4bd26e2 PM: sleep: Remove obsolete comment from unlock_system_sleep()
With the freezer changes introduced by commit f5d39b0208
("freezer,sched: Rewrite core freezer logic"), the comment in
unlock_system_sleep() has become obsolete, there is no need to
retain it.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-20 19:16:55 +01:00
Randy Dunlap 7beb82b7d5 tracing/synthetic: fix kernel-doc warnings
scripts/kernel-doc warns about using @args: for variadic arguments to
functions. Documentation/doc-guide/kernel-doc.rst says that this should
be written as @...: instead, so update the source code to match that,
preventing the warnings.

trace_events_synth.c:1165: warning: Excess function parameter 'args' description in '__synth_event_gen_cmd_start'
trace_events_synth.c:1714: warning: Excess function parameter 'args' description in 'synth_event_trace'

Link: https://lore.kernel.org/linux-trace-kernel/20231220061226.30962-1-rdunlap@infradead.org

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 35ca5207c2 ("tracing: Add synthetic event command generation functions")
Fixes: 8dcc53ad95 ("tracing: Add synth_event_trace() and related functions")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-20 12:51:03 -05:00
Anna-Maria Behnsen da65f29dad timers: Fix nextevt calculation when no timers are pending
When no timer is queued into an empty timer base, the next_expiry will not
be updated. It was originally calculated as

  base->clk + NEXT_TIMER_MAX_DELTA

When the timer base stays empty long enough (> NEXT_TIMER_MAX_DELTA), the
next_expiry value of the empty base suggests that there is a timer pending
soon. This might be more a kind of a theoretical problem, but the fix
doesn't hurt.

Use only base->next_expiry value as nextevt when timers are
pending. Otherwise nextevt will be jiffies + NEXT_TIMER_MAX_DELTA. As all
information is in place, update base->next_expiry value of the empty timer
base as well.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-13-anna-maria@linutronix.de
2023-12-20 16:49:39 +01:00
Thomas Gleixner bb8caad508 timers: Rework idle logic
To improve readability of the code, split base->idle calculation and
expires calculation into separate parts. While at it, update the comment
about timer base idle marking.

Thereby the following subtle change happens if the next event is just one
jiffy ahead and the tick was already stopped: Originally base->is_idle
remains true in this situation. Now base->is_idle turns to false. This may
spare an IPI if a timer is enqueued remotely to an idle CPU that is going
to tick on the next jiffy.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-12-anna-maria@linutronix.de
2023-12-20 16:49:39 +01:00
Anna-Maria Behnsen 7a39a5080e timers: Use already existing function for forwarding timer base
There is an already existing function for forwarding the timer
base. Forwarding the timer base is implemented directly in
get_next_timer_interrupt() as well.

Remove the code duplication and invoke __forward_timer_base() instead.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-11-anna-maria@linutronix.de
2023-12-20 16:49:38 +01:00
Anna-Maria Behnsen 1e490484aa timers: Split out forward timer base functionality
Forwarding timer base is done when the next expiry value is calculated and
when a new timer is enqueued. When the next expiry value is calculated the
jiffies value is already available and does not need to be reread a second
time.

Splitting out the forward timer base functionality to make it executable
via both contextes - those where jiffies are already known and those, where
jiffies need to be read.

No functional change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-10-anna-maria@linutronix.de
2023-12-20 16:49:38 +01:00
Anna-Maria Behnsen 8a2c9c7e78 timers: Clarify check in forward_timer_base()
The current check whether a forward of the timer base is required can be
simplified by using an already existing comparison function which is easier
to read. The related comment is outdated and was not updated when the check
changed in commit 36cd28a4cd ("timers: Lower base clock forwarding
threshold").

Use time_before_eq() for the check and replace the comment by copying the
comment from the same check inside get_next_timer_interrupt(). Move the
precious information of the outdated comment to the proper place in
__run_timers().

No functional change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-9-anna-maria@linutronix.de
2023-12-20 16:49:38 +01:00
Anna-Maria Behnsen b5e6f59888 timers: Move store of next event into __next_timer_interrupt()
Both call sites of __next_timer_interrupt() store the return value directly
in base->next_expiry. Move the store into __next_timer_interrupt() and to
make its purpose more clear, rename the function to next_expiry_recalc().

No functional change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-8-anna-maria@linutronix.de
2023-12-20 16:49:38 +01:00
Anna-Maria Behnsen d124c3393e timers: Do not IPI for deferrable timers
Deferrable timers do not prevent CPU from going idle and are not taken into
account on idle path. Sending an IPI to a remote CPU when a new first
deferrable timer was enqueued will wake up the remote CPU but nothing will
be done regarding the deferrable timers.

Drop IPI completely when a new first deferrable timer was enqueued.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-7-anna-maria@linutronix.de
2023-12-20 16:49:38 +01:00
Anna-Maria Behnsen b573c73101 tracing/timers: Add tracepoint for tracking timer base is_idle flag
When debugging timer code the timer tracepoints are very important. There
is no tracepoint when the is_idle flag of the timer base changes. Instead
of always adding manually trace_printk(), add tracepoints which can be
easily enabled whenever required.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-6-anna-maria@linutronix.de
2023-12-20 16:49:38 +01:00
Anna-Maria Behnsen dbcdcb62b5 tracing/timers: Enhance timer_start tracepoint
For starting a timer, the timer is enqueued into a bucket of the timer
wheel. The bucket expiry is the defacto expiry of the timer but it is not
equal the timer expiry because of increasing granularity when bucket is in
a higher level of the wheel. To be able to figure out in a trace whether a
timer expired in time or not, the bucket expiry time is required as well.

Add bucket expiry time to the timer_start tracepoint and thereby simplify
the arguments.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-5-anna-maria@linutronix.de
2023-12-20 16:49:38 +01:00
Anna-Maria Behnsen cbf04a2202 tick-sched: Warn when next tick seems to be in the past
When the next tick is in the past, the delta between basemono and the next
tick gets negativ. But the next tick should never be in the past. The
negative effect of a wrong next tick might be a stop of the tick and timers
might expire late.

To prevent expensive debugging when changing underlying code, add a
WARN_ON_ONCE into this code path. To prevent complete misbehaviour, also
reset next_tick to basemono in this case.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-4-anna-maria@linutronix.de
2023-12-20 16:49:38 +01:00
Anna-Maria Behnsen 318050671a tick/sched: Cleanup confusing variables
tick_nohz_stop_tick() contains the expires (u64 variable) and tick
(ktime_t) variable. In the beginning the value of expires is written to
tick. Afterwards none of the variables is changed. They are only used for
checks.

Drop the not required variable tick and use always expires instead.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-3-anna-maria@linutronix.de
2023-12-20 16:49:37 +01:00
Anna-Maria Behnsen cb665db94f tick-sched: Fix function names in comments
When referencing functions in comments, it might be helpful to use full
function names (including the prefix) to be able to find it when grepping.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-2-anna-maria@linutronix.de
2023-12-20 16:49:37 +01:00
Tzvetomir Stoyanov (VMware) bce761d757 ring-buffer: Read and write to ring buffers with custom sub buffer size
As the size of the ring sub buffer page can be changed dynamically,
the logic that reads and writes to the buffer should be fixed to take
that into account. Some internal ring buffer APIs are changed:
 ring_buffer_alloc_read_page()
 ring_buffer_free_read_page()
 ring_buffer_read_page()
A new API is introduced:
 ring_buffer_read_page_data()

Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-6-tz.stoyanov@gmail.com
Link: https://lore.kernel.org/linux-trace-kernel/20231219185628.875145995@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
[ Fixed kerneldoc on data_page parameter in ring_buffer_free_read_page() ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-20 07:54:56 -05:00
Tzvetomir Stoyanov (VMware) f9b94daa54 ring-buffer: Set new size of the ring buffer sub page
There are two approaches when changing the size of the ring buffer
sub page:
 1. Destroying all pages and allocating new pages with the new size.
 2. Allocating new pages, copying the content of the old pages before
    destroying them.
The first approach is easier, it is selected in the proposed
implementation. Changing the ring buffer sub page size is supposed to
not happen frequently. Usually, that size should be set only once,
when the buffer is not in use yet and is supposed to be empty.

Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-5-tz.stoyanov@gmail.com
Link: https://lore.kernel.org/linux-trace-kernel/20231219185628.588995543@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-20 07:54:56 -05:00
Tzvetomir Stoyanov (VMware) 2808e31ec1 ring-buffer: Add interface for configuring trace sub buffer size
The trace ring buffer sub page size can be configured, per trace
instance. A new ftrace file "buffer_subbuf_order" is added to get and
set the size of the ring buffer sub page for current trace instance.
The size must be an order of system page size, that's why the new
interface works with system page order, instead of absolute page size:
0 means the ring buffer sub page is equal to 1 system page and so
forth:
0 - 1 system page
1 - 2 system pages
2 - 4 system pages
...
The ring buffer sub page size is limited between 1 and 128 system
pages. The default value is 1 system page.
New ring buffer APIs are introduced:
 ring_buffer_subbuf_order_set()
 ring_buffer_subbuf_order_get()
 ring_buffer_subbuf_size_get()

Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-4-tz.stoyanov@gmail.com
Link: https://lore.kernel.org/linux-trace-kernel/20231219185628.298324722@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-20 07:54:55 -05:00
Tzvetomir Stoyanov (VMware) 139f840021 ring-buffer: Page size per ring buffer
Currently the size of one sub buffer page is global for all buffers and
it is hard coded to one system page. In order to introduce configurable
ring buffer sub page size, the internal logic should be refactored to
work with sub page size per ring buffer.

Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-3-tz.stoyanov@gmail.com
Link: https://lore.kernel.org/linux-trace-kernel/20231219185628.009147038@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-20 07:54:55 -05:00
Tzvetomir Stoyanov (VMware) d5cfbdfc96 ring-buffer: Have ring_buffer_print_page_header() be able to access ring_buffer_iter
In order to introduce sub-buffer size per ring buffer, some internal
refactoring is needed. As ring_buffer_print_page_header() will depend on
the trace_buffer structure, it is moved after the structure definition.

Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-2-tz.stoyanov@gmail.com
Link: https://lore.kernel.org/linux-trace-kernel/20231219185627.723857541@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-20 07:52:47 -05:00
Andrii Nakryiko a64bfe6186 bpf: add support for passing dynptr pointer to global subprog
Add ability to pass a pointer to dynptr into global functions.
This allows to have global subprogs that accept and work with generic
dynptrs that are created by caller. Dynptr argument is detected based on
the name of a struct type, if it's "bpf_dynptr", it's assumed to be
a proper dynptr pointer. Both actual struct and forward struct
declaration types are supported.

This is conceptually exactly the same semantics as
bpf_user_ringbuf_drain()'s use of dynptr to pass a variable-sized
pointer to ringbuf record. So we heavily rely on CONST_PTR_TO_DYNPTR
bits of already existing logic in the verifier.

During global subprog validation, we mark such CONST_PTR_TO_DYNPTR as
having LOCAL type, as that's the most unassuming type of dynptr and it
doesn't have any special helpers that can try to free or acquire extra
references (unlike skb, xdp, or ringbuf dynptr). So that seems like a safe
"choice" to make from correctness standpoint. It's still possible to
pass any type of dynptr to such subprog, though, because generic dynptr
helpers, like getting data/slice pointers, read/write memory copying
routines, dynptr adjustment and getter routines all work correctly with
any type of dynptr.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-19 18:06:46 -08:00
Andrii Nakryiko 94e1c70a34 bpf: support 'arg:xxx' btf_decl_tag-based hints for global subprog args
Add support for annotating global BPF subprog arguments to provide more
information about expected semantics of the argument. Currently,
verifier relies purely on argument's BTF type information, and supports
three general use cases: scalar, pointer-to-context, and
pointer-to-fixed-size-memory.

Scalar and pointer-to-fixed-mem work well in practice and are quite
natural to use. But pointer-to-context is a bit problematic, as typical
BPF users don't realize that they need to use a special type name to
signal to verifier that argument is not just some pointer, but actually
a PTR_TO_CTX. Further, even if users do know which type to use, it is
limiting in situations where the same BPF program logic is used across
few different program types. Common case is kprobes, tracepoints, and
perf_event programs having a helper to send some data over BPF perf
buffer. bpf_perf_event_output() requires `ctx` argument, and so it's
quite cumbersome to share such global subprog across few BPF programs of
different types, necessitating extra static subprog that is context
type-agnostic.

Long story short, there is a need to go beyond types and allow users to
add hints to global subprog arguments to define expectations.

This patch adds such support for two initial special tags:
  - pointer to context;
  - non-null qualifier for generic pointer arguments.

All of the above came up in practice already and seem generally useful
additions. Non-null qualifier is an often requested feature, which
currently has to be worked around by having unnecessary NULL checks
inside subprogs even if we know that arguments are never NULL. Pointer
to context was discussed earlier.

As for implementation, we utilize btf_decl_tag attribute and set up an
"arg:xxx" convention to specify argument hint. As such:
  - btf_decl_tag("arg:ctx") is a PTR_TO_CTX hint;
  - btf_decl_tag("arg:nonnull") marks pointer argument as not allowed to
    be NULL, making NULL check inside global subprog unnecessary.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-19 18:06:46 -08:00
Andrii Nakryiko f18c3d88de bpf: reuse subprog argument parsing logic for subprog call checks
Remove duplicated BTF parsing logic when it comes to subprog call check.
Instead, use (potentially cached) results of btf_prepare_func_args() to
abstract away expectations of each subprog argument in generic terms
(e.g., "this is pointer to context", or "this is a pointer to memory of
size X"), and then use those simple high-level argument type
expectations to validate actual register states to check if they match
expectations.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-19 18:06:46 -08:00
Andrii Nakryiko c5a7244759 bpf: move subprog call logic back to verifier.c
Subprog call logic in btf_check_subprog_call() currently has both a lot
of BTF parsing logic (which is, presumably, what justified putting it
into btf.c), but also a bunch of register state checks, some of each
utilize deep verifier logic helpers, necessarily exported from
verifier.c: check_ptr_off_reg(), check_func_arg_reg_off(),
and check_mem_reg().

Going forward, btf_check_subprog_call() will have a minimum of
BTF-related logic, but will get more internal verifier logic related to
register state manipulation. So move it into verifier.c to minimize
amount of verifier-specific logic exposed to btf.c.

We do this move before refactoring btf_check_func_arg_match() to
preserve as much history post-refactoring as possible.

No functional changes.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-19 18:06:46 -08:00
Andrii Nakryiko e26080d0da bpf: prepare btf_prepare_func_args() for handling static subprogs
Generalize btf_prepare_func_args() to support both global and static
subprogs. We are going to utilize this property in the next patch,
reusing btf_prepare_func_args() for subprog call logic instead of
reparsing BTF information in a completely separate implementation.

btf_prepare_func_args() now detects whether subprog is global or static
makes slight logic adjustments for static func cases, like not failing
fatally (-EFAULT) for conditions that are allowable for static subprogs.

Somewhat subtle (but major!) difference is the handling of pointer arguments.
Both global and static functions need to handle special context
arguments (which are pointers to predefined type names), but static
subprogs give up on any other pointers, falling back to marking subprog
as "unreliable", disabling the use of BTF type information altogether.

For global functions, though, we are assuming that such pointers to
unrecognized types are just pointers to fixed-sized memory region (or
error out if size cannot be established, like for `void *` pointers).

This patch accommodates these small differences and sets up a stage for
refactoring in the next patch, eliminating a separate BTF-based parsing
logic in btf_check_func_arg_match().

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-19 18:06:46 -08:00
Andrii Nakryiko 5eccd2db42 bpf: reuse btf_prepare_func_args() check for main program BTF validation
Instead of btf_check_subprog_arg_match(), use btf_prepare_func_args()
logic to validate "trustworthiness" of main BPF program's BTF information,
if it is present.

We ignored results of original BTF check anyway, often times producing
confusing and ominously-sounding "reg type unsupported for arg#0
function" message, which has no apparent effect on program correctness
and verification process.

All the -EFAULT returning sanity checks are already performed in
check_btf_info_early(), so there is zero reason to have this duplication
of logic between btf_check_subprog_call() and btf_check_subprog_arg_match().
Dropping btf_check_subprog_arg_match() simplifies
btf_check_func_arg_match() further removing `bool processing_call` flag.

One subtle bit that was done by btf_check_subprog_arg_match() was
potentially marking main program's BTF as unreliable. We do this
explicitly now with a dedicated simple check, preserving the original
behavior, but now based on well factored btf_prepare_func_args() logic.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-19 18:06:46 -08:00
Andrii Nakryiko 4ba1d0f234 bpf: abstract away global subprog arg preparation logic from reg state setup
btf_prepare_func_args() is used to understand expectations and
restrictions on global subprog arguments. But current implementation is
hard to extend, as it intermixes BTF-based func prototype parsing and
interpretation logic with setting up register state at subprog entry.

Worse still, those registers are not completely set up inside
btf_prepare_func_args(), requiring some more logic later in
do_check_common(). Like calling mark_reg_unknown() and similar
initialization operations.

This intermixing of BTF interpretation and register state setup is
problematic. First, it causes duplication of BTF parsing logic for global
subprog verification (to set up initial state of global subprog) and
global subprog call sites analysis (when we need to check that whatever
is being passed into global subprog matches expectations), performed in
btf_check_subprog_call().

Given we want to extend global func argument with tags later, this
duplication is problematic. So refactor btf_prepare_func_args() to do
only BTF-based func proto and args parsing, returning high-level
argument "expectations" only, with no regard to specifics of register
state. I.e., if it's a context argument, instead of setting register
state to PTR_TO_CTX, we return ARG_PTR_TO_CTX enum for that argument as
"an argument specification" for further processing inside
do_check_common(). Similarly for SCALAR arguments, PTR_TO_MEM, etc.

This allows to reuse btf_prepare_func_args() in following patches at
global subprog call site analysis time. It also keeps register setup
code consistently in one place, do_check_common().

Besides all this, we cache this argument specs information inside
env->subprog_info, eliminating the need to redo these potentially
expensive BTF traversals, especially if BPF program's BTF is big and/or
there are lots of global subprog calls.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-19 18:06:46 -08:00
Menglong Dong d028f87517 bpf: make the verifier tracks the "not equal" for regs
We can derive some new information for BPF_JNE in regs_refine_cond_op().
Take following code for example:

  /* The type of "a" is u32 */
  if (a > 0 && a < 100) {
    /* the range of the register for a is [0, 99], not [1, 99],
     * and will cause the following error:
     *
     *   invalid zero-sized read
     *
     * as a can be 0.
     */
    bpf_skb_store_bytes(skb, xx, xx, a, 0);
  }

In the code above, "a > 0" will be compiled to "jmp xxx if a == 0". In the
TRUE branch, the dst_reg will be marked as known to 0. However, in the
fallthrough(FALSE) branch, the dst_reg will not be handled, which makes
the [min, max] for a is [0, 99], not [1, 99].

For BPF_JNE, we can reduce the range of the dst reg if the src reg is a
const and is exactly the edge of the dst reg.

Signed-off-by: Menglong Dong <menglong8.dong@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231219134800.1550388-2-menglong8.dong@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-19 17:18:55 -08:00
Kevin Hao f17f2c13d6 module: Remove redundant TASK_UNINTERRUPTIBLE
TASK_KILLABLE already includes TASK_UNINTERRUPTIBLE, so there is no
need to add a separate TASK_UNINTERRUPTIBLE.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-12-19 13:30:28 -08:00
Linus Torvalds 55cb5f4368 tracing fix for 6.7-rc6
While working on the ring buffer, I found one more bug with the timestamp
 code, and the fix for this removed the need for the final 64-bit cmpxchg!
 
 The ring buffer events hold a "delta" from the previous event. If it is
 determined that the delta can not be calculated, it falls back to adding an
 absolute timestamp value. The way to know if the delta can be used is via
 two stored timestamps in the per-cpu buffer meta data:
 
  before_stamp and write_stamp
 
 The before_stamp is written by every event before it tries to allocate its
 space on the ring buffer. The write_stamp is written after it allocates its
 space and knows that nothing came in after it read the previous
 before_stamp and write_stamp and the two matched.
 
 A previous fix dd93942570 ("ring-buffer: Do not try to put back
 write_stamp") removed putting back the write_stamp to match the
 before_stamp so that the next event could use the delta, but races were
 found where the two would match, but not be for of the previous event.
 
 It was determined to allow the event reservation to not have a valid
 write_stamp when it is finished, and this fixed a lot of races.
 
 The last use of the 64-bit timestamp cmpxchg depended on the write_stamp
 being valid after an interruption. But this is no longer the case, as if an
 event is interrupted by a softirq that writes an event, and that event gets
 interrupted by a hardirq or NMI and that writes an event, then the softirq
 could finish its reservation without a valid write_stamp.
 
 In the slow path of the event reservation, a delta can still be used if the
 write_stamp is valid. Instead of using a cmpxchg against the write stamp,
 the before_stamp needs to be read again to validate the write_stamp. The
 cmpxchg is not needed.
 
 This updates the slowpath to validate the write_stamp by comparing it to
 the before_stamp and removes all rb_time_cmpxchg() as there are no more
 users of that function.
 
 The removal of the 32-bit updates of rb_time_t will be done in the next
 merge window.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZYHVxhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qhk5AQDT56Uis34ewzeEzkwBSs8nsV2HDhnA
 d0CU4BHsf0GUVQD9E2eWVbIB9z8MiQwNMvKslpFJYmGCzr359pCMzoOmcws=
 =0rcD
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fix from Steven Rostedt:
 "While working on the ring buffer, I found one more bug with the
  timestamp code, and the fix for this removed the need for the final
  64-bit cmpxchg!

  The ring buffer events hold a "delta" from the previous event. If it
  is determined that the delta can not be calculated, it falls back to
  adding an absolute timestamp value. The way to know if the delta can
  be used is via two stored timestamps in the per-cpu buffer meta data:

   before_stamp and write_stamp

  The before_stamp is written by every event before it tries to allocate
  its space on the ring buffer. The write_stamp is written after it
  allocates its space and knows that nothing came in after it read the
  previous before_stamp and write_stamp and the two matched.

  A previous fix dd93942570 ("ring-buffer: Do not try to put back
  write_stamp") removed putting back the write_stamp to match the
  before_stamp so that the next event could use the delta, but races
  were found where the two would match, but not be for of the previous
  event.

  It was determined to allow the event reservation to not have a valid
  write_stamp when it is finished, and this fixed a lot of races.

  The last use of the 64-bit timestamp cmpxchg depended on the
  write_stamp being valid after an interruption. But this is no longer
  the case, as if an event is interrupted by a softirq that writes an
  event, and that event gets interrupted by a hardirq or NMI and that
  writes an event, then the softirq could finish its reservation without
  a valid write_stamp.

  In the slow path of the event reservation, a delta can still be used
  if the write_stamp is valid. Instead of using a cmpxchg against the
  write stamp, the before_stamp needs to be read again to validate the
  write_stamp. The cmpxchg is not needed.

  This updates the slowpath to validate the write_stamp by comparing it
  to the before_stamp and removes all rb_time_cmpxchg() as there are no
  more users of that function.

  The removal of the 32-bit updates of rb_time_t will be done in the
  next merge window"

* tag 'trace-v6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Fix slowpath of interrupted event
2023-12-19 12:25:43 -08:00
Chen Haonan 489c693bd0 PM: hibernate: Use kmap_local_page() in copy_data_page()
kmap_atomic() has been deprecated in favor of kmap_local_page().

kmap_atomic() disables page-faults and preemption (the latter
only for !PREEMPT_RT kernels).The code between the mapping and
un-mapping in this patch does not depend on the above-mentioned
side effects.So simply replaced kmap_atomic() with kmap_local_page().

Signed-off-by: Chen Haonan <chen.haonan2@zte.com.cn>
[ rjw: Subject edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-19 20:55:16 +01:00
Andrii Nakryiko d17aff807f Revert BPF token-related functionality
This patch includes the following revert (one  conflicting BPF FS
patch and three token patch sets, represented by merge commits):
  - revert 0f5d5454c7 "Merge branch 'bpf-fs-mount-options-parsing-follow-ups'";
  - revert 750e785796 "bpf: Support uid and gid when mounting bpffs";
  - revert 733763285a "Merge branch 'bpf-token-support-in-libbpf-s-bpf-object'";
  - revert c35919dcce "Merge branch 'bpf-token-and-bpf-fs-based-delegation'".

Link: https://lore.kernel.org/bpf/CAHk-=wg7JuFYwGy=GOMbRCtOL+jwSQsdUaBsRWkDVYbxipbM5A@mail.gmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2023-12-19 08:23:03 -08:00
Steven Rostedt (Google) f50345b49b ring-buffer: Check if absolute timestamp goes backwards
The check_buffer() which checks the timestamps of the ring buffer
sub-buffer page, when enabled, only checks if the adding of deltas of the
events from the last absolute timestamp or the timestamp of the sub-buffer
page adds up to the current event.

What it does not check is if the absolute timestamp causes the time of the
events to go backwards, as that can cause issues elsewhere.

Test for the timestamp going backwards too.

This also fixes a slight issue where if the warning triggers at boot up
(because of the resetting of the tsc), it will disable all further checks,
even those that are after boot Have it continue checking if the warning
was ignored during boot up.

Link: https://lore.kernel.org/linux-trace-kernel/20231219074732.18b092d4@gandalf.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-19 07:49:47 -05:00
Steven Rostedt (Google) d40dbb617a ring-buffer: Add interrupt information to dump of data sub-buffer
When the ring buffer timestamp verifier triggers, it dumps the content of
the sub-buffer. But currently it only dumps the timestamps and the offset
of the data as well as the deltas. It would be even more informative if
the event data also showed the interrupt context level it was in.

That is, if each event showed that the event was written in normal,
softirq, irq or NMI context. Then a better idea about how the events may
have been interrupted from each other.

As the payload of the ring buffer is really a black box of the ring
buffer, just assume that if the payload is larger than a trace entry, that
it is a trace entry. As trace entries have the interrupt context
information saved in a flags field, look at that location and report the
output of the flags.

If the payload is not a trace entry, there's no way to really know, and
the information will be garbage. But that's OK, because this is for
debugging only (this output is not used in production as the buffer check
that calls it causes a huge overhead to the tracing). This information,
when available, is crucial for debugging timestamp issues. If it's
garbage, it will also be pretty obvious that its garbage too.

As this output usually happens in kselftests of the tracing code, the user
will know what the payload is at the time.

Link: https://lore.kernel.org/linux-trace-kernel/20231219074542.6f304601@gandalf.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Suggested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-19 07:49:22 -05:00
Steven Rostedt (Google) c84897c0ff ring-buffer: Remove 32bit timestamp logic
Each event has a 27 bit timestamp delta that is used to hold the delta
from the last event. If the time between events is greater than 2^27, then
a timestamp is added that holds a 59 bit absolute timestamp.

Until a389d86f7f ("ring-buffer: Have nested events still record running
time stamp"), if an interrupt interrupted an event in progress, all the
events delta would be zero to not deal with the races that need to be
handled. The commit a389d86f7f changed that to handle the races giving
all events, even those that preempt other events, still have an accurate
timestamp.

To handle those races requires performing 64-bit cmpxchg on the
timestamps. But doing 64-bit cmpxchg on 32-bit architectures is considered
very slow. To try to deal with this the timestamp logic was broken into
two and then three 32-bit cmpxchgs, with the thought that two (or three)
32-bit cmpxchgs are still faster than a single 64-bit cmpxchg on 32-bit
architectures.

Part of the problem with this is that I didn't have any 32-bit
architectures to test on. After hitting several subtle bugs in this code,
an effort was made to try and see if three 32-bit cmpxchgs are indeed
faster than a single 64-bit. After a few people brushed off the dust of
their old 32-bit machines, tests were done, and even though 32-bit cmpxchg
was faster than a single 64-bit, it was in the order of 50% at best, not
300%.

After some more refactoring of the code, all 4 64-bit cmpxchg were removed:

 https://lore.kernel.org/linux-trace-kernel/20231211114420.36dde01b@gandalf.local.home
 https://lore.kernel.org/linux-trace-kernel/20231214222921.193037a7@gandalf.local.home
 https://lore.kernel.org/linux-trace-kernel/20231215081810.1f4f38fe@rorschach.local.home
 https://lore.kernel.org/linux-trace-kernel/20231218230712.3a76b081@gandalf.local.home/

With all the 64-bit cmpxchg removed, the complex 32-bit workaround can also be
removed.

The 32-bit and 64-bit logic is now exactly the same.

Link: https://lore.kernel.org/all/20231213214632.15047c40@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20231219074303.28f9abda@gandalf.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-19 07:47:04 -05:00
Steven Rostedt (Google) 76ca20c748 tracing: Increase size of trace_marker_raw to max ring buffer entry
There's no reason to give an arbitrary limit to the size of a raw trace
marker. Just let it be as big as the size that is allowed by the ring
buffer itself.

And there's also no reason to artificially break up the write to
TRACE_BUF_SIZE, as that's not even used.

Link: https://lore.kernel.org/linux-trace-kernel/20231213104218.2efc70c1@gandalf.local.home

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-18 23:14:16 -05:00
Steven Rostedt (Google) 9482341d9b tracing: Have trace_marker break up by lines by size of trace_seq
If a trace_marker write is bigger than what trace_seq can hold, then it
will print "LINE TOO BIG" message and not what was written.

Instead, check if the write is bigger than the trace_seq and break it
up by that size.

Ideally, we could make the trace_seq dynamic that could hold this. But
that's for another time.

Link: https://lore.kernel.org/linux-trace-kernel/20231212190422.1eaf224f@gandalf.local.home

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-18 23:14:16 -05:00
Steven Rostedt (Google) 40fc60e36c trace_seq: Increase the buffer size to almost two pages
Now that trace_marker can hold more than 1KB string, and can write as much
as the ring buffer can hold, the trace_seq is not big enough to hold
writes:

 ~# a="1234567890"
 ~# cnt=4080
 ~# s=""
 ~# while [ $cnt -gt 10 ]; do
 ~#	s="${s}${a}"
 ~#	cnt=$((cnt-10))
 ~# done
 ~# echo $s > trace_marker
 ~# cat trace
 # tracer: nop
 #
 # entries-in-buffer/entries-written: 2/2   #P:8
 #
 #                                _-----=> irqs-off/BH-disabled
 #                               / _----=> need-resched
 #                              | / _---=> hardirq/softirq
 #                              || / _--=> preempt-depth
 #                              ||| / _-=> migrate-disable
 #                              |||| /     delay
 #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
 #              | |         |   |||||     |         |
            <...>-860     [002] .....   105.543465: tracing_mark_write[LINE TOO BIG]
            <...>-860     [002] .....   105.543496: tracing_mark_write: 789012345678901234567890

By increasing the trace_seq buffer to almost two pages, it can now print
out the first line.

This also subtracts the rest of the trace_seq fields from the buffer, so
that the entire trace_seq is now PAGE_SIZE aligned.

Link: https://lore.kernel.org/linux-trace-kernel/20231209175220.19867af4@gandalf.local.home

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-18 23:14:16 -05:00
Steven Rostedt (Google) 8ec90be7f1 tracing: Allow for max buffer data size trace_marker writes
Allow a trace write to be as big as the ring buffer tracing data will
allow. Currently, it only allows writes of 1KB in size, but there's no
reason that it cannot allow what the ring buffer can hold.

Link: https://lore.kernel.org/linux-trace-kernel/20231212131901.5f501e72@gandalf.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-18 23:14:16 -05:00
Steven Rostedt (Google) 0b9036efd8 ring-buffer: Add offset of events in dump on mismatch
On bugs that have the ring buffer timestamp get out of sync, the config
CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS, that checks for it and if it is
detected it causes a dump of the bad sub buffer.

It shows each event and their timestamp as well as the delta in the event.
But it's also good to see the offset into the subbuffer for that event to
know if how close to the end it is.

Also print where the last event actually ended compared to where it was
expected to end.

Link: https://lore.kernel.org/linux-trace-kernel/20231211131623.59eaebd2@gandalf.local.home

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-18 23:14:16 -05:00
Steven Rostedt (Google) d23569979c tracing: Allow creating instances with specified system events
A trace instance may only need to enable specific events. As the eventfs
directory of an instance currently creates all events which adds overhead,
allow internal instances to be created with just the events in systems
that they care about. This currently only deals with systems and not
individual events, but this should bring down the overhead of creating
instances for specific use cases quite bit.

The trace_array_get_by_name() now has another parameter "systems". This
parameter is a const string pointer of a comma/space separated list of
event systems that should be created by the trace_array. (Note if the
trace_array already exists, this parameter is ignored).

The list of systems is saved and if a module is loaded, its events will
not be added unless the system for those events also match the systems
string.

Link: https://lore.kernel.org/linux-trace-kernel/20231213093701.03fddec0@gandalf.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Sean Paul <seanpaul@chromium.org>
Cc: Arun Easi   <aeasi@marvell.com>
Cc: Daniel Wagner <dwagner@suse.de>
Tested-by: Dmytro Maluka <dmaluka@chromium.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-18 23:14:16 -05:00
Steven Rostedt (Google) b803d7c664 ring-buffer: Fix slowpath of interrupted event
To synchronize the timestamps with the ring buffer reservation, there are
two timestamps that are saved in the buffer meta data.

1. before_stamp
2. write_stamp

When the two are equal, the write_stamp is considered valid, as in, it may
be used to calculate the delta of the next event as the write_stamp is the
timestamp of the previous reserved event on the buffer.

This is done by the following:

 /*A*/	w = current position on the ring buffer
	before = before_stamp
	after = write_stamp
	ts = read current timestamp

	if (before != after) {
		write_stamp is not valid, force adding an absolute
		timestamp.
	}

 /*B*/	before_stamp = ts

 /*C*/	write = local_add_return(event length, position on ring buffer)

	if (w == write - event length) {
		/* Nothing interrupted between A and C */
 /*E*/		write_stamp = ts;
		delta = ts - after
		/*
		 * If nothing interrupted again,
		 * before_stamp == write_stamp and write_stamp
		 * can be used to calculate the delta for
		 * events that come in after this one.
		 */
	} else {

		/*
		 * The slow path!
		 * Was interrupted between A and C.
		 */

This is the place that there's a bug. We currently have:

		after = write_stamp
		ts = read current timestamp

 /*F*/		if (write == current position on the ring buffer &&
		    after < ts && cmpxchg(write_stamp, after, ts)) {

			delta = ts - after;

		} else {
			delta = 0;
		}

The assumption is that if the current position on the ring buffer hasn't
moved between C and F, then it also was not interrupted, and that the last
event written has a timestamp that matches the write_stamp. That is the
write_stamp is valid.

But this may not be the case:

If a task context event was interrupted by softirq between B and C.

And the softirq wrote an event that got interrupted by a hard irq between
C and E.

and the hard irq wrote an event (does not need to be interrupted)

We have:

 /*B*/ before_stamp = ts of normal context

   ---> interrupted by softirq

	/*B*/ before_stamp = ts of softirq context

	  ---> interrupted by hardirq

		/*B*/ before_stamp = ts of hard irq context
		/*E*/ write_stamp = ts of hard irq context

		/* matches and write_stamp valid */
	  <----

	/*E*/ write_stamp = ts of softirq context

	/* No longer matches before_stamp, write_stamp is not valid! */

   <---

 w != write - length, go to slow path

// Right now the order of events in the ring buffer is:
//
// |-- softirq event --|-- hard irq event --|-- normal context event --|
//

 after = write_stamp (this is the ts of softirq)
 ts = read current timestamp

 if (write == current position on the ring buffer [true] &&
     after < ts [true] && cmpxchg(write_stamp, after, ts) [true]) {

	delta = ts - after  [Wrong!]

The delta is to be between the hard irq event and the normal context
event, but the above logic made the delta between the softirq event and
the normal context event, where the hard irq event is between the two. This
will shift all the remaining event timestamps on the sub-buffer
incorrectly.

The write_stamp is only valid if it matches the before_stamp. The cmpxchg
does nothing to help this.

Instead, the following logic can be done to fix this:

	before = before_stamp
	ts = read current timestamp
	before_stamp = ts

	after = write_stamp

	if (write == current position on the ring buffer &&
	    after == before && after < ts) {

		delta = ts - after

	} else {
		delta = 0;
	}

The above will only use the write_stamp if it still matches before_stamp
and was tested to not have changed since C.

As a bonus, with this logic we do not need any 64-bit cmpxchg() at all!

This means the 32-bit rb_time_t workaround can finally be removed. But
that's for a later time.

Link: https://lore.kernel.org/linux-trace-kernel/20231218175229.58ec3daf@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20231218230712.3a76b081@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: dd93942570 ("ring-buffer: Do not try to put back write_stamp")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-18 23:12:22 -05:00
Jakub Kicinski c49b292d03 netdev
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmWAz2EACgkQ6rmadz2v
 bToqrw/9EwroZCc8GEHOKAlb/fzrMvn92rLo0ZW/cGN84QJPnx4zM6Zo0+fgLaaN
 oqqztwMUwdzGC3uX3FfVXaaLKbJ/MeHeL9BXFZNW8zkRHciw4R7kIBhOdPnHyET7
 uT+rQ4xPe1Mt7e9PjepKlSL5mEsxWfBkdUgsdn19Z2Vjdfr9mZMhYWYMJGcfTCD1
 TwxHKBPhq5fN3IsshmMBB8IrRp1HStUKb65MgZ4dI22LJXxTsFkx5XMFXcmuqvkH
 NhKj8jDcPEEh31bYcb6aG2Z4onw5F2lquygjk1Qyy5cyw45m/ipJKAXKdAyvJG+R
 VZCWOET/9wbRwFSK5wxwihCuKghFiofK52i2PcGtXZh0PCouyZZneSJOKM0yVWKO
 BvuJBxK4ETRnQyN6ZxhuJiEXG3/YMBBhyR2TX1LntVK9ct/k7qFVzATG49J39/sR
 SYMbptBRj4a5oMJ1qn0nFVEDFkg0jTnTDNnsEpcz60Ayt6EsJ1XosO5yz2huf861
 xgRMTKMseyG1/uV45tQ8ZPzbSPpBxjUi9Dl3coYsIm1a+y6clWUXcarONY5KVrpS
 CR98DuFgl+E7dXuisd/Kz2p2KxxSPq8nytsmLlgOvrUqhwiXqB+TKN8EHgIapVOt
 l1A5LrzXFTcGlT9MlaWBqEIy83Bu1nqQqbxrAFOE0k8A5jomXaw=
 =stU2
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Alexei Starovoitov says:

====================
pull-request: bpf-next 2023-12-18

This PR is larger than usual and contains changes in various parts
of the kernel.

The main changes are:

1) Fix kCFI bugs in BPF, from Peter Zijlstra.

End result: all forms of indirect calls from BPF into kernel
and from kernel into BPF work with CFI enabled. This allows BPF
to work with CONFIG_FINEIBT=y.

2) Introduce BPF token object, from Andrii Nakryiko.

It adds an ability to delegate a subset of BPF features from privileged
daemon (e.g., systemd) through special mount options for userns-bound
BPF FS to a trusted unprivileged application. The design accommodates
suggestions from Christian Brauner and Paul Moore.

Example:
$ sudo mkdir -p /sys/fs/bpf/token
$ sudo mount -t bpf bpffs /sys/fs/bpf/token \
             -o delegate_cmds=prog_load:MAP_CREATE \
             -o delegate_progs=kprobe \
             -o delegate_attachs=xdp

3) Various verifier improvements and fixes, from Andrii Nakryiko, Andrei Matei.

 - Complete precision tracking support for register spills
 - Fix verification of possibly-zero-sized stack accesses
 - Fix access to uninit stack slots
 - Track aligned STACK_ZERO cases as imprecise spilled registers.
   It improves the verifier "instructions processed" metric from single
   digit to 50-60% for some programs.
 - Fix verifier retval logic

4) Support for VLAN tag in XDP hints, from Larysa Zaremba.

5) Allocate BPF trampoline via bpf_prog_pack mechanism, from Song Liu.

End result: better memory utilization and lower I$ miss for calls to BPF
via BPF trampoline.

6) Fix race between BPF prog accessing inner map and parallel delete,
from Hou Tao.

7) Add bpf_xdp_get_xfrm_state() kfunc, from Daniel Xu.

It allows BPF interact with IPSEC infra. The intent is to support
software RSS (via XDP) for the upcoming ipsec pcpu work.
Experiments on AWS demonstrate single tunnel pcpu ipsec reaching
line rate on 100G ENA nics.

8) Expand bpf_cgrp_storage to support cgroup1 non-attach, from Yafang Shao.

9) BPF file verification via fsverity, from Song Liu.

It allows BPF progs get fsverity digest.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (164 commits)
  bpf: Ensure precise is reset to false in __mark_reg_const_zero()
  selftests/bpf: Add more uprobe multi fail tests
  bpf: Fail uprobe multi link with negative offset
  selftests/bpf: Test the release of map btf
  s390/bpf: Fix indirect trampoline generation
  selftests/bpf: Temporarily disable dummy_struct_ops test on s390
  x86/cfi,bpf: Fix bpf_exception_cb() signature
  bpf: Fix dtor CFI
  cfi: Add CFI_NOSEAL()
  x86/cfi,bpf: Fix bpf_struct_ops CFI
  x86/cfi,bpf: Fix bpf_callback_t CFI
  x86/cfi,bpf: Fix BPF JIT call
  cfi: Flip headers
  selftests/bpf: Add test for abnormal cnt during multi-kprobe attachment
  selftests/bpf: Don't use libbpf_get_error() in kprobe_multi_test
  selftests/bpf: Add test for abnormal cnt during multi-uprobe attachment
  bpf: Limit the number of kprobes when attaching program to multiple kprobes
  bpf: Limit the number of uprobes when attaching program to multiple uprobes
  bpf: xdp: Register generic_kfunc_set with XDP programs
  selftests/bpf: utilize string values for delegate_xxx mount options
  ...
====================

Link: https://lore.kernel.org/r/20231219000520.34178-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-18 16:46:08 -08:00
Andrii Nakryiko 8e432e6197 bpf: Ensure precise is reset to false in __mark_reg_const_zero()
It is safe to always start with imprecise SCALAR_VALUE register.
Previously __mark_reg_const_zero() relied on caller to reset precise
mark, but it's very error prone and we already missed it in a few
places. So instead make __mark_reg_const_zero() reset precision always,
as it's a safe default for SCALAR_VALUE. Explanation is basically the
same as for why we are resetting (or rather not setting) precision in
current state. If necessary, precision propagation will set it to
precise correctly.

As such, also remove a big comment about forward precision propagation
in mark_reg_stack_read() and avoid unnecessarily setting precision to
true after reading from STACK_ZERO stack. Again, precision propagation
will correctly handle this, if that SCALAR_VALUE register will ever be
needed to be precise.

Reported-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231218173601.53047-1-andrii@kernel.org
2023-12-18 23:54:21 +01:00
Rae Moar d81f0d7b8b kunit: add KUNIT_INIT_TABLE to init linker section
Add KUNIT_INIT_TABLE to the INIT_DATA linker section.

Alter the KUnit macros to create init tests:
kunit_test_init_section_suites

Update lib/kunit/executor.c to run both the suites in KUNIT_TABLE and
KUNIT_INIT_TABLE.

Reviewed-by: David Gow <davidgow@google.com>
Signed-off-by: Rae Moar <rmoar@google.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2023-12-18 13:21:15 -07:00
Jiri Olsa 3983c00281 bpf: Fail uprobe multi link with negative offset
Currently the __uprobe_register will return 0 (success) when called with
negative offset. The reason is that the call to register_for_each_vma and
then build_map_info won't return error for negative offset. They just won't
do anything - no matching vma is found so there's no registered breakpoint
for the uprobe.

I don't think we can change the behaviour of __uprobe_register and fail
for negative uprobe offset, because apps might depend on that already.

But I think we can still make the change and check for it on bpf multi
link syscall level.

Also moving the __get_user call and check for the offsets to the top of
loop, to fail early without extra __get_user calls for ref_ctr_offset
and cookie arrays.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20231217215538.3361991-2-jolsa@kernel.org
2023-12-18 09:51:30 -08:00
Linus Torvalds 177c2ffe69 - Avoid iterating over newly created group leader event's siblings
because there are none, and thus prevent a lockdep splat
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmV/WSEACgkQEsHwGGHe
 VUoQ7RAAoc9AdZJ8ZmTMLy/5/SS3542z3w3Ts5DxHziZrIzBclRx5RYBw3YgBA2q
 AgbPIgqr1Y4+45gChhwKIAp7G6g2K8lpqNSJbqP/nFpwkmx7v/m5T/9DH7id1UyQ
 uRKwxtEAwKz5XeR2czXI5Z+VvG9vIqkrdAR+dSRmcXgcfi8oJCKOywW+n7QFUarf
 sYWdMQwbMNenl0y/o8MQXmTraQWkRJCqI5JXc23GIkr03z6ZUOwt2qAx2YW4GQQo
 enXzdk9MokKU4IpRz/rU9j7qaOd9h/AZZXhACSMUVJqDQVDFJJO1rbktKcS17sNH
 EgRv57xSAYsPvARk2wvw3INEIPOvL4Jb1s86MLa0eN2mH4mVwkqlv/KkUl1/RuHY
 IuCgpNWliQ3nb6dQeEsp83EW6Ao8FTn3D8+66tbtCVXnMBFQEfUHWFnSfHqcUizb
 JsRnBA9ke2t3Wu0ph/nZBzck+9kxp0PeUvio//x2IznjfeZ31fQYdVDYU0o8QgOe
 Ns6MyCn1OcCulfZZTpUbMhy/5FjokKLf2Sfit1r8duDMXVu4cicSnuMUpVPvZd+A
 8XcbH73kCO0DuUVtYMbYerFJLgZrcN1gzyPgAmegoaDXBVu4KkOIbXBGEJjv7pPs
 h5p84zKibyjKhwa+bqlIgy9R9EYb9sPwbNv5eWgQFXzrvQVm5FA=
 =s8Bq
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.7_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fix from Borislav Petkov:

 - Avoid iterating over newly created group leader event's siblings
   because there are none, and thus prevent a lockdep splat

* tag 'perf_urgent_for_v6.7_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix perf_event_validate_size() lockdep splat
2023-12-17 14:03:11 -08:00
Linus Torvalds 134fdb80bc cxl fixes for 6.7-rc6
- Fix alloc_free_mem_region()'s scan for address space, prevent false
   negative out-of-space events
 
 - Fix sleeping lock acquisition from CXL trace event (atomic context)
 
 - Fix put_device() like for the new CXL PMU driver
 
 - Fix wrong pointer freed on error path
 
 - Fixup several lockdep reports (missing lock hold) from new assertion
   in cxl_num_decoders_committed() and new tests
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCZX6oZwAKCRDfioYZHlFs
 ZyLRAPwLXinja3lpUab4mV6P6w87oO7qz1n4ly8vKpTTZZxaJAD/QGlqYS6YtiPo
 IXA8QiHe9RX3bGKhYmzSOd2/JFjyhQc=
 =2+M9
 -----END PGP SIGNATURE-----

Merge tag 'cxl-fixes-6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl

Pull CXL (Compute Express Link) fixes from Dan Williams:
 "A collection of CXL fixes.

  The touch outside of drivers/cxl/ is for a helper that allocates
  physical address space. Device hotplug tests showed that the driver
  failed to utilize (skipped over) valid capacity when allocating a new
  memory region. Outside of that, new tests uncovered a small crop of
  lockdep reports.

  There is also some miscellaneous error path and leak fixups that are
  not urgent, but useful to cleanup now.

   - Fix alloc_free_mem_region()'s scan for address space, prevent false
     negative out-of-space events

   - Fix sleeping lock acquisition from CXL trace event (atomic context)

   - Fix put_device() like for the new CXL PMU driver

   - Fix wrong pointer freed on error path

   - Fixup several lockdep reports (missing lock hold) from new
     assertion in cxl_num_decoders_committed() and new tests"

* tag 'cxl-fixes-6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
  cxl/pmu: Ensure put_device on pmu devices
  cxl/cdat: Free correct buffer on checksum error
  cxl/hdm: Fix dpa translation locking
  kernel/resource: Increment by align value in get_free_mem_region()
  cxl: Add cxl_num_decoders_committed() usage to cxl_test
  cxl/memdev: Hold region_rwsem during inject and clear poison ops
  cxl/core: Always hold region_rwsem while reading poison lists
  cxl/hdm: Fix a benign lockdep splat
2023-12-17 09:07:34 -08:00
Naveen N Rao 9c556b7c3f trace/kprobe: Display the actual notrace function when rejecting a probe
Trying to probe update_sd_lb_stats() using perf results in the below
message in the kernel log:
	trace_kprobe: Could not probe notrace function _text

This is because 'perf probe' specifies the kprobe location as an offset
from '_text':
	$ sudo perf probe -D update_sd_lb_stats
	p:probe/update_sd_lb_stats _text+1830728

However, the error message is misleading and doesn't help convey the
actual notrace function that is being probed. Fix this by looking up the
actual function name that is being probed. With this fix, we now get the
below message in the kernel log:
	trace_kprobe: Could not probe notrace function update_sd_lb_stats.constprop.0

Link: https://lore.kernel.org/all/20231214051702.1687300-1-naveen@kernel.org/

Signed-off-by: Naveen N Rao <naveen@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-12-17 16:28:54 +09:00
Linus Torvalds 3b8a9b2e68 Tracing fixes for v6.7-rc5:
- Fix eventfs to check creating new files for events with names greater than
   NAME_MAX. The eventfs lookup needs to check the return result of
   simple_lookup().
 
 - Fix the ring buffer to check the proper max data size. Events must be able to
   fit on the ring buffer sub-buffer, if it cannot, then it fails to be written
   and the logic to add the event is avoided. The code to check if an event can
   fit failed to add the possible absolute timestamp which may make the event
   not be able to fit. This causes the ring buffer to go into an infinite loop
   trying to find a sub-buffer that would fit the event. Luckily, there's a check
   that will bail out if it looped over a 1000 times and it also warns.
 
   The real fix is not to add the absolute timestamp to an event that is
   starting at the beginning of a sub-buffer because it uses the sub-buffer
   timestamp. By avoiding the timestamp at the start of the sub-buffer allows
   events that pass the first check to always find a sub-buffer that it can fit
   on.
 
 - Have large events that do not fit on a trace_seq to print "LINE TOO BIG" like
   it does for the trace_pipe instead of what it does now which is to silently
   drop the output.
 
 - Fix a memory leak of forgetting to free the spare page that is saved by a
   trace instance.
 
 - Update the size of the snapshot buffer when the main buffer is updated if the
   snapshot buffer is allocated.
 
 - Fix ring buffer timestamp logic by removing all the places that tried to put
   the before_stamp back to the write stamp so that the next event doesn't add
   an absolute timestamp. But each of these updates added a race where by making
   the two timestamp equal, it was validating the write_stamp so that it can be
   incorrectly used for calculating the delta of an event.
 
 - There's a temp buffer used for printing the event that was using the event
   data size for allocation when it needed to use the size of the entire event
   (meta-data and payload data)
 
 - For hardening, use "%.*s" for printing the trace_marker output, to limit the
   amount that is printed by the size of the event. This was discovered by
   development that added a bug that truncated the '\0' and caused a crash.
 
 - Fix a use-after-free bug in the use of the histogram files when an instance
   is being removed.
 
 - Remove a useless update in the rb_try_to_discard of the write_stamp. The
   before_stamp was already changed to force the next event to add an absolute
   timestamp that the write_stamp is not used. But the write_stamp is modified
   again using an unneeded 64-bit cmpxchg.
 
 - Fix several races in the 32-bit implementation of the rb_time_cmpxchg() that
   does a 64-bit cmpxchg.
 
 - While looking at fixing the 64-bit cmpxchg, I noticed that because the ring
   buffer uses normal cmpxchg, and this can be done in NMI context, there's some
   architectures that do not have a working cmpxchg in NMI context. For these
   architectures, fail recording events that happen in NMI context.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZX0nChQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qlOMAQD3iegTcceQl9lAsroa3tb3xdweC1GP
 51MsX5athxSyoQEAutI/2pBCtLFXgTLMHAMd5F23EM1U9rha7W0myrnvKQY=
 =d3bS
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix eventfs to check creating new files for events with names greater
   than NAME_MAX. The eventfs lookup needs to check the return result of
   simple_lookup().

 - Fix the ring buffer to check the proper max data size. Events must be
   able to fit on the ring buffer sub-buffer, if it cannot, then it
   fails to be written and the logic to add the event is avoided. The
   code to check if an event can fit failed to add the possible absolute
   timestamp which may make the event not be able to fit. This causes
   the ring buffer to go into an infinite loop trying to find a
   sub-buffer that would fit the event. Luckily, there's a check that
   will bail out if it looped over a 1000 times and it also warns.

   The real fix is not to add the absolute timestamp to an event that is
   starting at the beginning of a sub-buffer because it uses the
   sub-buffer timestamp.

   By avoiding the timestamp at the start of the sub-buffer allows
   events that pass the first check to always find a sub-buffer that it
   can fit on.

 - Have large events that do not fit on a trace_seq to print "LINE TOO
   BIG" like it does for the trace_pipe instead of what it does now
   which is to silently drop the output.

 - Fix a memory leak of forgetting to free the spare page that is saved
   by a trace instance.

 - Update the size of the snapshot buffer when the main buffer is
   updated if the snapshot buffer is allocated.

 - Fix ring buffer timestamp logic by removing all the places that tried
   to put the before_stamp back to the write stamp so that the next
   event doesn't add an absolute timestamp. But each of these updates
   added a race where by making the two timestamp equal, it was
   validating the write_stamp so that it can be incorrectly used for
   calculating the delta of an event.

 - There's a temp buffer used for printing the event that was using the
   event data size for allocation when it needed to use the size of the
   entire event (meta-data and payload data)

 - For hardening, use "%.*s" for printing the trace_marker output, to
   limit the amount that is printed by the size of the event. This was
   discovered by development that added a bug that truncated the '\0'
   and caused a crash.

 - Fix a use-after-free bug in the use of the histogram files when an
   instance is being removed.

 - Remove a useless update in the rb_try_to_discard of the write_stamp.
   The before_stamp was already changed to force the next event to add
   an absolute timestamp that the write_stamp is not used. But the
   write_stamp is modified again using an unneeded 64-bit cmpxchg.

 - Fix several races in the 32-bit implementation of the
   rb_time_cmpxchg() that does a 64-bit cmpxchg.

 - While looking at fixing the 64-bit cmpxchg, I noticed that because
   the ring buffer uses normal cmpxchg, and this can be done in NMI
   context, there's some architectures that do not have a working
   cmpxchg in NMI context. For these architectures, fail recording
   events that happen in NMI context.

* tag 'trace-v6.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Do not record in NMI if the arch does not support cmpxchg in NMI
  ring-buffer: Have rb_time_cmpxchg() set the msb counter too
  ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg()
  ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archs
  ring-buffer: Remove useless update to write_stamp in rb_try_to_discard()
  ring-buffer: Do not try to put back write_stamp
  tracing: Fix uaf issue when open the hist or hist_debug file
  tracing: Add size check when printing trace_marker output
  ring-buffer: Have saved event hold the entire event
  ring-buffer: Do not update before stamp when switching sub-buffers
  tracing: Update snapshot buffer on resize if it is allocated
  ring-buffer: Fix memory leak of free page
  eventfs: Fix events beyond NAME_MAX blocking tasks
  tracing: Have large events show up as '[LINE TOO BIG]' instead of nothing
  ring-buffer: Fix writing to the buffer with max_data_size
2023-12-16 10:40:51 -08:00
Alexei Starovoitov 852486b35f x86/cfi,bpf: Fix bpf_exception_cb() signature
As per the earlier patches, BPF sub-programs have bpf_callback_t
signature and CFI expects callers to have matching signature. This is
violated by bpf_prog_aux::bpf_exception_cb().

[peterz: Changelog]
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/CAADnVQ+Z7UcXXBBhMubhcMM=R-dExk-uHtfOLtoLxQ1XxEpqEA@mail.gmail.com
Link: https://lore.kernel.org/r/20231215092707.910319166@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-15 16:25:55 -08:00
Peter Zijlstra e4c0033989 bpf: Fix dtor CFI
Ensure the various dtor functions match their prototype and retain
their CFI signatures, since they don't have their address taken, they
are prone to not getting CFI, making them impossible to call
indirectly.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231215092707.799451071@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-15 16:25:55 -08:00
Peter Zijlstra 2cd3e3772e x86/cfi,bpf: Fix bpf_struct_ops CFI
BPF struct_ops uses __arch_prepare_bpf_trampoline() to write
trampolines for indirect function calls. These tramplines much have
matching CFI.

In order to obtain the correct CFI hash for the various methods, add a
matching structure that contains stub functions, the compiler will
generate correct CFI which we can pilfer for the trampolines.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231215092707.566977112@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-15 16:25:55 -08:00
Peter Zijlstra 4f9087f166 x86/cfi,bpf: Fix BPF JIT call
The current BPF call convention is __nocfi, except when it calls !JIT things,
then it calls regular C functions.

It so happens that with FineIBT the __nocfi and C calling conventions are
incompatible. Specifically __nocfi will call at func+0, while FineIBT will have
endbr-poison there, which is not a valid indirect target. Causing #CP.

Notably this only triggers on IBT enabled hardware, which is probably why this
hasn't been reported (also, most people will have JIT on anyway).

Implement proper CFI prologues for the BPF JIT codegen and drop __nocfi for
x86.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231215092707.345270396@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-15 16:25:55 -08:00
Jens Axboe ae1914174a cred: get rid of CONFIG_DEBUG_CREDENTIALS
This code is rarely (never?) enabled by distros, and it hasn't caught
anything in decades. Let's kill off this legacy debug code.

Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-12-15 14:19:48 -08:00
Jens Axboe f8fa5d7692 cred: switch to using atomic_long_t
There are multiple ways to grab references to credentials, and the only
protection we have against overflowing it is the memory required to do
so.

With memory sizes only moving in one direction, let's bump the reference
count to 64-bit and move it outside the realm of feasibly overflowing.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-12-15 14:08:46 -08:00
Hou Tao d6d1e6c17c bpf: Limit the number of kprobes when attaching program to multiple kprobes
An abnormally big cnt may also be assigned to kprobe_multi.cnt when
attaching multiple kprobes. It will trigger the following warning in
kvmalloc_node():

	if (unlikely(size > INT_MAX)) {
	    WARN_ON_ONCE(!(flags & __GFP_NOWARN));
	    return NULL;
	}

Fix the warning by limiting the maximal number of kprobes in
bpf_kprobe_multi_link_attach(). If the number of kprobes is greater than
MAX_KPROBE_MULTI_CNT, the attachment will fail and return -E2BIG.

Fixes: 0dcac27254 ("bpf: Add multi kprobe link")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231215100708.2265609-3-houtao@huaweicloud.com
2023-12-15 22:54:55 +01:00
Hou Tao 8b2efe51ba bpf: Limit the number of uprobes when attaching program to multiple uprobes
An abnormally big cnt may be passed to link_create.uprobe_multi.cnt,
and it will trigger the following warning in kvmalloc_node():

	if (unlikely(size > INT_MAX)) {
		WARN_ON_ONCE(!(flags & __GFP_NOWARN));
		return NULL;
	}

Fix the warning by limiting the maximal number of uprobes in
bpf_uprobe_multi_link_attach(). If the number of uprobes is greater than
MAX_UPROBE_MULTI_CNT, the attachment will return -E2BIG.

Fixes: 89ae89f53d ("bpf: Add multi uprobe link")
Reported-by: Xingwei Lee <xrivendell7@gmail.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Closes: https://lore.kernel.org/bpf/CABOYnLwwJY=yFAGie59LFsUsBAgHfroVqbzZ5edAXbFE3YiNVA@mail.gmail.com
Link: https://lore.kernel.org/bpf/20231215100708.2265609-2-houtao@huaweicloud.com
2023-12-15 22:54:46 +01:00
Linus Torvalds a62aa88ba1 17 hotfixes. 8 are cc:stable and the other 9 pertain to post-6.6 issues.
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZXxs8wAKCRDdBJ7gKXxA
 junbAQCdItfHHinkWziciOrb0387wW+5WZ1ohqRFW8pGYLuasQEArpKmw13bvX7z
 e+ec9K1Ek9MlIsO2RwORR4KHH4MAbwA=
 =YpZh
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-12-15-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "17 hotfixes. 8 are cc:stable and the other 9 pertain to post-6.6
  issues"

* tag 'mm-hotfixes-stable-2023-12-15-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/mglru: reclaim offlined memcgs harder
  mm/mglru: respect min_ttl_ms with memcgs
  mm/mglru: try to stop at high watermarks
  mm/mglru: fix underprotected page cache
  mm/shmem: fix race in shmem_undo_range w/THP
  Revert "selftests: error out if kernel header files are not yet built"
  crash_core: fix the check for whether crashkernel is from high memory
  x86, kexec: fix the wrong ifdeffery CONFIG_KEXEC
  sh, kexec: fix the incorrect ifdeffery and dependency of CONFIG_KEXEC
  mips, kexec: fix the incorrect ifdeffery and dependency of CONFIG_KEXEC
  m68k, kexec: fix the incorrect ifdeffery and build dependency of CONFIG_KEXEC
  loongarch, kexec: change dependency of object files
  mm/damon/core: make damon_start() waits until kdamond_fn() starts
  selftests/mm: cow: print ksft header before printing anything else
  mm: fix VMA heap bounds checking
  riscv: fix VMALLOC_START definition
  kexec: drop dependency on ARCH_SUPPORTS_KEXEC from CRASH_DUMP
2023-12-15 12:00:54 -08:00
Kees Cook ff6d413b0b kernfs: Convert kernfs_path_from_node_locked() from strlcpy() to strscpy()
One of the last remaining users of strlcpy() in the kernel is
kernfs_path_from_node_locked(), which passes back the problematic "length
we _would_ have copied" return value to indicate truncation.  Convert the
chain of all callers to use the negative return value (some of which
already doing this explicitly). All callers were already also checking
for negative return values, so the risk to missed checks looks very low.

In this analysis, it was found that cgroup1_release_agent() actually
didn't handle the "too large" condition, so this is technically also a
bug fix. :)

Here's the chain of callers, and resolution identifying each one as now
handling the correct return value:

kernfs_path_from_node_locked()
        kernfs_path_from_node()
                pr_cont_kernfs_path()
                        returns void
                kernfs_path()
                        sysfs_warn_dup()
                                return value ignored
                        cgroup_path()
                                blkg_path()
                                        bfq_bic_update_cgroup()
                                                return value ignored
                                TRACE_IOCG_PATH()
                                        return value ignored
                                TRACE_CGROUP_PATH()
                                        return value ignored
                                perf_event_cgroup()
                                        return value ignored
                                task_group_path()
                                        return value ignored
                                damon_sysfs_memcg_path_eq()
                                        return value ignored
                                get_mm_memcg_path()
                                        return value ignored
                                lru_gen_seq_show()
                                        return value ignored
                        cgroup_path_from_kernfs_id()
                                return value ignored
                cgroup_show_path()
                        already converted "too large" error to negative value
                cgroup_path_ns_locked()
                        cgroup_path_ns()
                                bpf_iter_cgroup_show_fdinfo()
                                        return value ignored
                                cgroup1_release_agent()
                                        wasn't checking "too large" error
                        proc_cgroup_show()
                                already converted "too large" to negative value

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Waiman Long <longman@redhat.com>
Cc:  <cgroups@vger.kernel.org>
Co-developed-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Link: https://lore.kernel.org/r/20231116192127.1558276-3-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20231212211741.164376-3-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-15 17:25:10 +01:00
Max Kellermann fe3de0102b kernel/cgroup: use kernfs_create_dir_ns()
By passing the fsugid to kernfs_create_dir_ns(), we don't need
cgroup_kn_set_ugid() any longer.  That function was added for exactly
this purpose by commit 49957f8e2a ("cgroup: newly created dirs and
files should be owned by the creator").

Eliminating this piece of duplicate code means we benefit from future
improvements to kernfs_create_dir_ns(); for example, both are lacking
S_ISGID support currently, which my next patch will add to
kernfs_create_dir_ns().  It cannot (easily) be added to
cgroup_kn_set_ugid() because we can't dereference struct kernfs_iattrs
from there.

--
v1 -> v2: 12-digit commit id

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20231208093310.297233-1-max.kellermann@ionos.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-15 17:22:40 +01:00
Steven Rostedt (Google) 712292308a ring-buffer: Do not record in NMI if the arch does not support cmpxchg in NMI
As the ring buffer recording requires cmpxchg() to work, if the
architecture does not support cmpxchg in NMI, then do not do any recording
within an NMI.

Link: https://lore.kernel.org/linux-trace-kernel/20231213175403.6fc18540@gandalf.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-15 08:54:26 -05:00
Steven Rostedt (Google) 0aa0e5289c ring-buffer: Have rb_time_cmpxchg() set the msb counter too
The rb_time_cmpxchg() on 32-bit architectures requires setting three
32-bit words to represent the 64-bit timestamp, with some salt for
synchronization. Those are: msb, top, and bottom

The issue is, the rb_time_cmpxchg() did not properly salt the msb portion,
and the msb that was written was stale.

Link: https://lore.kernel.org/linux-trace-kernel/20231215084114.20899342@rorschach.local.home

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: f03f2abce4 ("ring-buffer: Have 32 bit time stamps use all 64 bits")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-15 08:43:35 -05:00
Mathieu Desnoyers dec890089b ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg()
The following race can cause rb_time_read() to observe a corrupted time
stamp:

rb_time_cmpxchg()
[...]
        if (!rb_time_read_cmpxchg(&t->msb, msb, msb2))
                return false;
        if (!rb_time_read_cmpxchg(&t->top, top, top2))
                return false;
<interrupted before updating bottom>
__rb_time_read()
[...]
        do {
                c = local_read(&t->cnt);
                top = local_read(&t->top);
                bottom = local_read(&t->bottom);
                msb = local_read(&t->msb);
        } while (c != local_read(&t->cnt));

        *cnt = rb_time_cnt(top);

        /* If top and msb counts don't match, this interrupted a write */
        if (*cnt != rb_time_cnt(msb))
                return false;
          ^ this check fails to catch that "bottom" is still not updated.

So the old "bottom" value is returned, which is wrong.

Fix this by checking that all three of msb, top, and bottom 2-bit cnt
values match.

The reason to favor checking all three fields over requiring a specific
update order for both rb_time_set() and rb_time_cmpxchg() is because
checking all three fields is more robust to handle partial failures of
rb_time_cmpxchg() when interrupted by nested rb_time_set().

Link: https://lore.kernel.org/lkml/20231211201324.652870-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/linux-trace-kernel/20231212193049.680122-1-mathieu.desnoyers@efficios.com

Fixes: f458a14534 ("ring-buffer: Test last update in 32bit version of __rb_time_read()")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-15 08:39:34 -05:00
Steven Rostedt (Google) fff88fa0fb ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archs
Mathieu Desnoyers pointed out an issue in the rb_time_cmpxchg() for 32 bit
architectures. That is:

 static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
 {
	unsigned long cnt, top, bottom, msb;
	unsigned long cnt2, top2, bottom2, msb2;
	u64 val;

	/* The cmpxchg always fails if it interrupted an update */
	 if (!__rb_time_read(t, &val, &cnt2))
		 return false;

	 if (val != expect)
		 return false;

<<<< interrupted here!

	 cnt = local_read(&t->cnt);

The problem is that the synchronization counter in the rb_time_t is read
*after* the value of the timestamp is read. That means if an interrupt
were to come in between the value being read and the counter being read,
it can change the value and the counter and the interrupted process would
be clueless about it!

The counter needs to be read first and then the value. That way it is easy
to tell if the value is stale or not. If the counter hasn't been updated,
then the value is still good.

Link: https://lore.kernel.org/linux-trace-kernel/20231211201324.652870-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/linux-trace-kernel/20231212115301.7a9c9a64@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Fixes: 10464b4aa6 ("ring-buffer: Add rb_time_t 64 bit operations for speeding up 32 bit")
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-15 08:39:34 -05:00
Steven Rostedt (Google) 083e9f65bd ring-buffer: Remove useless update to write_stamp in rb_try_to_discard()
When filtering is enabled, a temporary buffer is created to place the
content of the trace event output so that the filter logic can decide
from the trace event output if the trace event should be filtered out or
not. If it is to be filtered out, the content in the temporary buffer is
simply discarded, otherwise it is written into the trace buffer.

But if an interrupt were to come in while a previous event was using that
temporary buffer, the event written by the interrupt would actually go
into the ring buffer itself to prevent corrupting the data on the
temporary buffer. If the event is to be filtered out, the event in the
ring buffer is discarded, or if it fails to discard because another event
were to have already come in, it is turned into padding.

The update to the write_stamp in the rb_try_to_discard() happens after a
fix was made to force the next event after the discard to use an absolute
timestamp by setting the before_stamp to zero so it does not match the
write_stamp (which causes an event to use the absolute timestamp).

But there's an effort in rb_try_to_discard() to put back the write_stamp
to what it was before the event was added. But this is useless and
wasteful because nothing is going to be using that write_stamp for
calculations as it still will not match the before_stamp.

Remove this useless update, and in doing so, we remove another
cmpxchg64()!

Also update the comments to reflect this change as well as remove some
extra white space in another comment.

Link: https://lore.kernel.org/linux-trace-kernel/20231215081810.1f4f38fe@rorschach.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Vincent Donnefort   <vdonnefort@google.com>
Fixes: b2dd797543 ("ring-buffer: Force absolute timestamp on discard of event")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-15 08:38:11 -05:00
Steven Rostedt (Google) dd93942570 ring-buffer: Do not try to put back write_stamp
If an update to an event is interrupted by another event between the time
the initial event allocated its buffer and where it wrote to the
write_stamp, the code try to reset the write stamp back to the what it had
just overwritten. It knows that it was overwritten via checking the
before_stamp, and if it didn't match what it wrote to the before_stamp
before it allocated its space, it knows it was overwritten.

To put back the write_stamp, it uses the before_stamp it read. The problem
here is that by writing the before_stamp to the write_stamp it makes the
two equal again, which means that the write_stamp can be considered valid
as the last timestamp written to the ring buffer. But this is not
necessarily true. The event that interrupted the event could have been
interrupted in a way that it was interrupted as well, and can end up
leaving with an invalid write_stamp. But if this happens and returns to
this context that uses the before_stamp to update the write_stamp again,
it can possibly incorrectly make it valid, causing later events to have in
correct time stamps.

As it is OK to leave this function with an invalid write_stamp (one that
doesn't match the before_stamp), there's no reason to try to make it valid
again in this case. If this race happens, then just leave with the invalid
write_stamp and the next event to come along will just add a absolute
timestamp and validate everything again.

Bonus points: This gets rid of another cmpxchg64!

Link: https://lore.kernel.org/linux-trace-kernel/20231214222921.193037a7@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Fixes: a389d86f7f ("ring-buffer: Have nested events still record running time stamp")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-15 08:38:05 -05:00
Hongchen Zhang 71cd7e80cf PM: hibernate: Enforce ordering during image compression/decompression
An S4 (suspend to disk) test on the LoongArch 3A6000 platform sometimes
fails with the following error messaged in the dmesg log:

	Invalid LZO compressed length

That happens because when compressing/decompressing the image, the
synchronization between the control thread and the compress/decompress/crc
thread is based on a relaxed ordering interface, which is unreliable, and the
following situation may occur:

CPU 0					CPU 1
save_image_lzo				lzo_compress_threadfn
					  atomic_set(&d->stop, 1);
  atomic_read(&data[thr].stop)
  data[thr].cmp = data[thr].cmp_len;
	  				  WRITE data[thr].cmp_len

Then CPU0 gets a stale cmp_len and writes it to disk. During resume from S4,
wrong cmp_len is loaded.

To maintain data consistency between the two threads, use the acquire/release
variants of atomic set and read operations.

Fixes: 081a9d043c ("PM / Hibernate: Improve performance of LZO/plain hibernation, checksum image")
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Hongchen Zhang <zhanghongchen@loongson.cn>
Co-developed-by: Weihao Li <liweihao@loongson.cn>
Signed-off-by: Weihao Li <liweihao@loongson.cn>
[ rjw: Subject rewrite and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-15 12:34:12 +01:00
Chris Feng 0c4cae1bc0 PM: hibernate: Avoid missing wakeup events during hibernation
Wakeup events that occur in the hibernation process's
hibernation_platform_enter() cannot wake up the system. Although the
current hibernation framework will execute part of the recovery process
after a wakeup event occurs, it ultimately performs a shutdown operation
because the system does not check the return value of
hibernation_platform_enter(). In short, if a wakeup event occurs before
putting the system into the final low-power state, it will be missed.

To solve this problem, check the return value of
hibernation_platform_enter(). When it returns -EAGAIN or -EBUSY (indicate
the occurrence of a wakeup event), execute the hibernation recovery
process, discard the previously saved image, and ultimately return to the
working state.

Signed-off-by: Chris Feng <chris.feng@mediatek.com>
[ rjw: Rephrase the message printed when going back to the working state ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-15 12:33:39 +01:00
Mark Rutland 7e2c1e4b34 perf: Fix perf_event_validate_size() lockdep splat
When lockdep is enabled, the for_each_sibling_event(sibling, event)
macro checks that event->ctx->mutex is held. When creating a new group
leader event, we call perf_event_validate_size() on a partially
initialized event where event->ctx is NULL, and so when
for_each_sibling_event() attempts to check event->ctx->mutex, we get a
splat, as reported by Lucas De Marchi:

  WARNING: CPU: 8 PID: 1471 at kernel/events/core.c:1950 __do_sys_perf_event_open+0xf37/0x1080

This only happens for a new event which is its own group_leader, and in
this case there cannot be any sibling events. Thus it's safe to skip the
check for siblings, which avoids having to make invasive and ugly
changes to for_each_sibling_event().

Avoid the splat by bailing out early when the new event is its own
group_leader.

Fixes: 382c27f4ed ("perf: Fix perf_event_validate_size()")
Closes: https://lore.kernel.org/lkml/20231214000620.3081018-1-lucas.demarchi@intel.com/
Closes: https://lore.kernel.org/lkml/ZXpm6gQ%2Fd59jGsuW@xpf.sh.intel.com/
Reported-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231215112450.3972309-1-mark.rutland@arm.com
2023-12-15 12:33:23 +01:00
Joakim Zhang b07bc23476 dma-mapping: clear dev->dma_mem to NULL after freeing it
Reproduced with below sequence:
dma_declare_coherent_memory()->dma_release_coherent_memory()
->dma_declare_coherent_memory()->"return -EBUSY" error

It will return -EBUSY from the dma_assign_coherent_memory()
in dma_declare_coherent_memory(), the reason is that dev->dma_mem
pointer has not been set to NULL after it's freed.

Fixes: cf65a0f6f6 ("dma-mapping: move all DMA mapping code to kernel/dma")
Signed-off-by: Joakim Zhang <joakim.zhang@cixtech.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-12-15 12:32:45 +01:00
Petr Tesarik 55c543865b swiotlb: reduce area lock contention for non-primary IO TLB pools
If multiple areas and multiple IO TLB pools exist, first iterate the
current CPU specific area in all pools. Then move to the next area index.

This is best illustrated by a diagram:

        area 0 |  area 1 | ... | area M |
pool 0    A         B              C
pool 1    D         E
...
pool N    F         G              H

Currently, each pool is searched before moving on to the next pool,
i.e. the search order is A, B ... C, D, E ... F, G ... H. With this patch,
each area is searched in all pools before moving on to the next area,
i.e. the search order is A, D ... F, B, E ... G ... C ... H.

Note that preemption is not disabled, and raw_smp_processor_id() may not
return a stable result, but it is called only once to determine the initial
area index. The search will iterate over all areas eventually, even if the
current task is preempted.

Next, some pools may have less (but not more) areas than default_nareas.
Skip such pools if the distance from the initial area index is greater than
pool->nareas. This logic ensures that for every pool the search starts in
the initial CPU's own area and never tries any area twice.

To verify performance impact, I booted the kernel with a minimum pool
size ("swiotlb=512,4,force"), so multiple pools get allocated, and I ran
these benchmarks:

- small: single-threaded I/O of 4 KiB blocks,
- big: single-threaded I/O of 64 KiB blocks,
- 4way: 4-way parallel I/O of 4 KiB blocks.

The "var" column in the tables below is the coefficient of variance over 5
runs of the test, the "diff" column is the relative difference against base
in read-write I/O bandwidth (MiB/s).

Tested on an x86 VM against a QEMU virtio SATA driver backed by a RAM-based
block device on the host:

	base	   patched
	var	var	diff
small	0.69%	0.62%	+25.4%
big	2.14%	2.27%	+25.7%
4way	2.65%	1.70%	+23.6%

Tested on a Raspberry Pi against a class-10 A1 microSD card:

	base	   patched
	var	var	diff
small	0.53%	1.96%	-0.3%
big	0.02%	0.57%	+0.8%
4way	6.17%	0.40%	+0.3%

These results confirm that there is significant performance boost in the
software IO TLB slot allocation itself. Where performance is dominated by
actual hardware, there is no measurable change.

Signed-off-by: Petr Tesarik <petr.tesarik1@huawei-partners.com>
Reviewed-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-12-15 12:32:45 +01:00
Robin Murphy 4ad4c1f394 dma-mapping: don't store redundant offsets
A bus_dma_region necessarily stores both CPU and DMA base addresses for
a range, so there's no need to also store the difference between them.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-12-15 12:32:42 +01:00
Daniel Xu 7489723c2e bpf: xdp: Register generic_kfunc_set with XDP programs
Registering generic_kfunc_set with XDP programs enables some of the
newer BPF features inside XDP -- namely tree based data structures and
BPF exceptions.

The current motivation for this commit is to enable assertions inside
XDP bpf progs. Assertions are a standard and useful tool to encode
intent.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/d07d4614b81ca6aada44fcb89bb6b618fb66e4ca.1702594357.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-14 19:12:16 -08:00
Andrii Nakryiko c5707b2146 bpf: support symbolic BPF FS delegation mount options
Besides already supported special "any" value and hex bit mask, support
string-based parsing of delegation masks based on exact enumerator
names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`,
`enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported
symbolic names (ignoring __MAX_xxx guard values and stripping repetitive
prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and
BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is
normalized to lower case in mount option output. So "PROG_LOAD",
"prog_load", and "MAP_create" are all valid values to specify for
delegate_cmds options, "array" is among supported for map types, etc.

Besides supporting string values, we also support multiple values
specified at the same time, using colon (':') separator.

There are corresponding changes on bpf_show_options side to use known
values to print them in human-readable format, falling back to hex mask
printing, if there are any unrecognized bits. This shouldn't be
necessary when enum BTF information is present, but in general we should
always be able to fall back to this even if kernel was built without BTF.
As mentioned, emitted symbolic names are normalized to be all lower case.

Example below shows various ways to specify delegate_cmds options
through mount command and how mount options are printed back:

12/14 14:39:07.604
vmuser@archvm:~/local/linux/tools/testing/selftests/bpf
$ mount | rg token

  $ sudo mkdir -p /sys/fs/bpf/token
  $ sudo mount -t bpf bpffs /sys/fs/bpf/token \
               -o delegate_cmds=prog_load:MAP_CREATE \
               -o delegate_progs=kprobe \
               -o delegate_attachs=xdp
  $ mount | grep token
  bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp)

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231214225016.1209867-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-14 17:30:27 -08:00
Jakub Kicinski 8f674972d6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/intel/iavf/iavf_ethtool.c
  3a0b5a2929 ("iavf: Introduce new state machines for flow director")
  95260816b4 ("iavf: use iavf_schedule_aq_request() helper")
https://lore.kernel.org/all/84e12519-04dc-bd80-bc34-8cf50d7898ce@intel.com/

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  c13e268c07 ("bnxt_en: Fix HWTSTAMP_FILTER_ALL packet timestamp logic")
  c2f8063309 ("bnxt_en: Refactor RX VLAN acceleration logic.")
  a7445d6980 ("bnxt_en: Add support for new RX and TPA_START completion types for P7")
  1c7fd6ee2f ("bnxt_en: Rename some macros for the P5 chips")
https://lore.kernel.org/all/20231211110022.27926ad9@canb.auug.org.au/

drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.c
  bd6781c18c ("bnxt_en: Fix wrong return value check in bnxt_close_nic()")
  84793a4995 ("bnxt_en: Skip nic close/open when configuring tstamp filters")
https://lore.kernel.org/all/20231214113041.3a0c003c@canb.auug.org.au/

drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c
  3d7a3f2612 ("net/mlx5: Nack sync reset request when HotPlug is enabled")
  cecf44ea1a ("net/mlx5: Allow sync reset flow when BF MGT interface device is present")
https://lore.kernel.org/all/20231211110328.76c925af@canb.auug.org.au/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-14 17:14:41 -08:00
Yonghong Song 59e5791f59 bpf: Fix a race condition between btf_put() and map_free()
When running `./test_progs -j` in my local vm with latest kernel,
I once hit a kasan error like below:

  [ 1887.184724] BUG: KASAN: slab-use-after-free in bpf_rb_root_free+0x1f8/0x2b0
  [ 1887.185599] Read of size 4 at addr ffff888106806910 by task kworker/u12:2/2830
  [ 1887.186498]
  [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: G           OEL     6.7.0-rc3-00699-g90679706d486-dirty #494
  [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred
  [ 1887.190341] Call Trace:
  [ 1887.190666]  <TASK>
  [ 1887.190949]  dump_stack_lvl+0xac/0xe0
  [ 1887.191423]  ? nf_tcp_handle_invalid+0x1b0/0x1b0
  [ 1887.192019]  ? panic+0x3c0/0x3c0
  [ 1887.192449]  print_report+0x14f/0x720
  [ 1887.192930]  ? preempt_count_sub+0x1c/0xd0
  [ 1887.193459]  ? __virt_addr_valid+0xac/0x120
  [ 1887.194004]  ? bpf_rb_root_free+0x1f8/0x2b0
  [ 1887.194572]  kasan_report+0xc3/0x100
  [ 1887.195085]  ? bpf_rb_root_free+0x1f8/0x2b0
  [ 1887.195668]  bpf_rb_root_free+0x1f8/0x2b0
  [ 1887.196183]  ? __bpf_obj_drop_impl+0xb0/0xb0
  [ 1887.196736]  ? preempt_count_sub+0x1c/0xd0
  [ 1887.197270]  ? preempt_count_sub+0x1c/0xd0
  [ 1887.197802]  ? _raw_spin_unlock+0x1f/0x40
  [ 1887.198319]  bpf_obj_free_fields+0x1d4/0x260
  [ 1887.198883]  array_map_free+0x1a3/0x260
  [ 1887.199380]  bpf_map_free_deferred+0x7b/0xe0
  [ 1887.199943]  process_scheduled_works+0x3a2/0x6c0
  [ 1887.200549]  worker_thread+0x633/0x890
  [ 1887.201047]  ? __kthread_parkme+0xd7/0xf0
  [ 1887.201574]  ? kthread+0x102/0x1d0
  [ 1887.202020]  kthread+0x1ab/0x1d0
  [ 1887.202447]  ? pr_cont_work+0x270/0x270
  [ 1887.202954]  ? kthread_blkcg+0x50/0x50
  [ 1887.203444]  ret_from_fork+0x34/0x50
  [ 1887.203914]  ? kthread_blkcg+0x50/0x50
  [ 1887.204397]  ret_from_fork_asm+0x11/0x20
  [ 1887.204913]  </TASK>
  [ 1887.204913]  </TASK>
  [ 1887.205209]
  [ 1887.205416] Allocated by task 2197:
  [ 1887.205881]  kasan_set_track+0x3f/0x60
  [ 1887.206366]  __kasan_kmalloc+0x6e/0x80
  [ 1887.206856]  __kmalloc+0xac/0x1a0
  [ 1887.207293]  btf_parse_fields+0xa15/0x1480
  [ 1887.207836]  btf_parse_struct_metas+0x566/0x670
  [ 1887.208387]  btf_new_fd+0x294/0x4d0
  [ 1887.208851]  __sys_bpf+0x4ba/0x600
  [ 1887.209292]  __x64_sys_bpf+0x41/0x50
  [ 1887.209762]  do_syscall_64+0x4c/0xf0
  [ 1887.210222]  entry_SYSCALL_64_after_hwframe+0x63/0x6b
  [ 1887.210868]
  [ 1887.211074] Freed by task 36:
  [ 1887.211460]  kasan_set_track+0x3f/0x60
  [ 1887.211951]  kasan_save_free_info+0x28/0x40
  [ 1887.212485]  ____kasan_slab_free+0x101/0x180
  [ 1887.213027]  __kmem_cache_free+0xe4/0x210
  [ 1887.213514]  btf_free+0x5b/0x130
  [ 1887.213918]  rcu_core+0x638/0xcc0
  [ 1887.214347]  __do_softirq+0x114/0x37e

The error happens at bpf_rb_root_free+0x1f8/0x2b0:

  00000000000034c0 <bpf_rb_root_free>:
  ; {
    34c0: f3 0f 1e fa                   endbr64
    34c4: e8 00 00 00 00                callq   0x34c9 <bpf_rb_root_free+0x9>
    34c9: 55                            pushq   %rbp
    34ca: 48 89 e5                      movq    %rsp, %rbp
  ...
  ;       if (rec && rec->refcount_off >= 0 &&
    36aa: 4d 85 ed                      testq   %r13, %r13
    36ad: 74 a9                         je      0x3658 <bpf_rb_root_free+0x198>
    36af: 49 8d 7d 10                   leaq    0x10(%r13), %rdi
    36b3: e8 00 00 00 00                callq   0x36b8 <bpf_rb_root_free+0x1f8>
                                        <==== kasan function
    36b8: 45 8b 7d 10                   movl    0x10(%r13), %r15d
                                        <==== use-after-free load
    36bc: 45 85 ff                      testl   %r15d, %r15d
    36bf: 78 8c                         js      0x364d <bpf_rb_root_free+0x18d>

So the problem is at rec->refcount_off in the above.

I did some source code analysis and find the reason.
                                  CPU A                        CPU B
  bpf_map_put:
    ...
    btf_put with rcu callback
    ...
    bpf_map_free_deferred
      with system_unbound_wq
    ...                          ...                           ...
    ...                          btf_free_rcu:                 ...
    ...                          ...                           bpf_map_free_deferred:
    ...                          ...
    ...         --------->       btf_struct_metas_free()
    ...         | race condition ...
    ...         --------->                                     map->ops->map_free()
    ...
    ...                          btf->struct_meta_tab = NULL

In the above, map_free() corresponds to array_map_free() and eventually
calling bpf_rb_root_free() which calls:
  ...
  __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
  ...

Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with following code:

  meta = btf_find_struct_meta(btf, btf_id);
  if (!meta)
    return -EFAULT;
  rec->fields[i].graph_root.value_rec = meta->record;

So basically, 'value_rec' is a pointer to the record in struct_metas_tab.
And it is possible that that particular record has been freed by
btf_struct_metas_free() and hence we have a kasan error here.

Actually it is very hard to reproduce the failure with current bpf/bpf-next
code, I only got the above error once. To increase reproducibility, I added
a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which
significantly increased reproducibility.

  diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
  index 5e43ddd1b83f..aae5b5213e93 100644
  --- a/kernel/bpf/syscall.c
  +++ b/kernel/bpf/syscall.c
  @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
        struct bpf_map *map = container_of(work, struct bpf_map, work);
        struct btf_record *rec = map->record;

  +     mdelay(100);
        security_bpf_map_free(map);
        bpf_map_release_memcg(map);
        /* implementation dependent freeing */

Hao also provided test cases ([1]) for easily reproducing the above issue.

There are two ways to fix the issue, the v1 of the patch ([2]) moving
btf_put() after map_free callback, and the v5 of the patch ([3]) using
a kptr style fix which tries to get a btf reference during
map_check_btf(). Each approach has its pro and cons. The first approach
delays freeing btf while the second approach needs to acquire reference
depending on context which makes logic not very elegant and may
complicate things with future new data structures. Alexei
suggested in [4] going back to v1 which is what this patch
tries to do.

Rerun './test_progs -j' with the above mdelay() hack for a couple
of times and didn't observe the error for the above rb_root test cases.
Running Hou's test ([1]) is also successful.

  [1] https://lore.kernel.org/bpf/20231207141500.917136-1-houtao@huaweicloud.com/
  [2] v1: https://lore.kernel.org/bpf/20231204173946.3066377-1-yonghong.song@linux.dev/
  [3] v5: https://lore.kernel.org/bpf/20231208041621.2968241-1-yonghong.song@linux.dev/
  [4] v4: https://lore.kernel.org/bpf/CAADnVQJ3FiXUhZJwX_81sjZvSYYKCFB3BT6P8D59RS2Gu+0Z7g@mail.gmail.com/

Cc: Hou Tao <houtao@huaweicloud.com>
Fixes: 958cf2e273 ("bpf: Introduce bpf_obj_new")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231214203815.1469107-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-14 17:10:32 -08:00
Hou Tao dc68540913 bpf: Use GFP_KERNEL in bpf_event_entry_gen()
rcu_read_lock() is no longer held when invoking bpf_event_entry_gen()
which is called by perf_event_fd_array_get_ptr(), so using GFP_KERNEL
instead of GFP_ATOMIC to reduce the possibility of failures due to
out-of-memory.

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231214043010.3458072-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-13 20:49:11 -08:00
Hou Tao 8f82583f95 bpf: Reduce the scope of rcu_read_lock when updating fd map
There is no rcu-read-lock requirement for ops->map_fd_get_ptr() or
ops->map_fd_put_ptr(), so doesn't use rcu-read-lock for these two
callbacks.

For bpf_fd_array_map_update_elem(), accessing array->ptrs doesn't need
rcu-read-lock because array->ptrs must still be allocated. For
bpf_fd_htab_map_update_elem(), htab_map_update_elem() only requires
rcu-read-lock to be held to avoid the WARN_ON_ONCE(), so only use
rcu_read_lock() during the invocation of htab_map_update_elem().

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231214043010.3458072-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-13 20:49:11 -08:00
Zheng Yejian 1cc111b9cd tracing: Fix uaf issue when open the hist or hist_debug file
KASAN report following issue. The root cause is when opening 'hist'
file of an instance and accessing 'trace_event_file' in hist_show(),
but 'trace_event_file' has been freed due to the instance being removed.
'hist_debug' file has the same problem. To fix it, call
tracing_{open,release}_file_tr() in file_operations callback to have
the ref count and avoid 'trace_event_file' being freed.

  BUG: KASAN: slab-use-after-free in hist_show+0x11e0/0x1278
  Read of size 8 at addr ffff242541e336b8 by task head/190

  CPU: 4 PID: 190 Comm: head Not tainted 6.7.0-rc5-g26aff849438c #133
  Hardware name: linux,dummy-virt (DT)
  Call trace:
   dump_backtrace+0x98/0xf8
   show_stack+0x1c/0x30
   dump_stack_lvl+0x44/0x58
   print_report+0xf0/0x5a0
   kasan_report+0x80/0xc0
   __asan_report_load8_noabort+0x1c/0x28
   hist_show+0x11e0/0x1278
   seq_read_iter+0x344/0xd78
   seq_read+0x128/0x1c0
   vfs_read+0x198/0x6c8
   ksys_read+0xf4/0x1e0
   __arm64_sys_read+0x70/0xa8
   invoke_syscall+0x70/0x260
   el0_svc_common.constprop.0+0xb0/0x280
   do_el0_svc+0x44/0x60
   el0_svc+0x34/0x68
   el0t_64_sync_handler+0xb8/0xc0
   el0t_64_sync+0x168/0x170

  Allocated by task 188:
   kasan_save_stack+0x28/0x50
   kasan_set_track+0x28/0x38
   kasan_save_alloc_info+0x20/0x30
   __kasan_slab_alloc+0x6c/0x80
   kmem_cache_alloc+0x15c/0x4a8
   trace_create_new_event+0x84/0x348
   __trace_add_new_event+0x18/0x88
   event_trace_add_tracer+0xc4/0x1a0
   trace_array_create_dir+0x6c/0x100
   trace_array_create+0x2e8/0x568
   instance_mkdir+0x48/0x80
   tracefs_syscall_mkdir+0x90/0xe8
   vfs_mkdir+0x3c4/0x610
   do_mkdirat+0x144/0x200
   __arm64_sys_mkdirat+0x8c/0xc0
   invoke_syscall+0x70/0x260
   el0_svc_common.constprop.0+0xb0/0x280
   do_el0_svc+0x44/0x60
   el0_svc+0x34/0x68
   el0t_64_sync_handler+0xb8/0xc0
   el0t_64_sync+0x168/0x170

  Freed by task 191:
   kasan_save_stack+0x28/0x50
   kasan_set_track+0x28/0x38
   kasan_save_free_info+0x34/0x58
   __kasan_slab_free+0xe4/0x158
   kmem_cache_free+0x19c/0x508
   event_file_put+0xa0/0x120
   remove_event_file_dir+0x180/0x320
   event_trace_del_tracer+0xb0/0x180
   __remove_instance+0x224/0x508
   instance_rmdir+0x44/0x78
   tracefs_syscall_rmdir+0xbc/0x140
   vfs_rmdir+0x1cc/0x4c8
   do_rmdir+0x220/0x2b8
   __arm64_sys_unlinkat+0xc0/0x100
   invoke_syscall+0x70/0x260
   el0_svc_common.constprop.0+0xb0/0x280
   do_el0_svc+0x44/0x60
   el0_svc+0x34/0x68
   el0t_64_sync_handler+0xb8/0xc0
   el0t_64_sync+0x168/0x170

Link: https://lore.kernel.org/linux-trace-kernel/20231214012153.676155-1-zhengyejian1@huawei.com

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-13 23:29:59 -05:00
Hou Tao 2a0c6b41ee bpf: Update the comments in maybe_wait_bpf_programs()
Since commit 638e4b825d ("bpf: Allows per-cpu maps and map-in-map in
sleepable programs"), sleepable BPF program can also use map-in-map, but
maybe_wait_bpf_programs() doesn't handle it accordingly. The main reason
is that using synchronize_rcu_tasks_trace() to wait for the completions
of these sleepable BPF programs may incur a very long delay and
userspace may think it is hung, so the wait for sleepable BPF programs
is skipped. Update the comments in maybe_wait_bpf_programs() to reflect
the reason.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20231211083447.1921178-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-13 17:01:42 -08:00
Matt Bobrowski b13cddf633 bpf: add small subset of SECURITY_PATH hooks to BPF sleepable_lsm_hooks list
security_path_* based LSM hooks appear to be generally missing from
the sleepable_lsm_hooks list. Initially add a small subset of them to
the preexisting sleepable_lsm_hooks list so that sleepable BPF helpers
like bpf_d_path() can be used from sleepable BPF LSM based programs.

The security_path_* hooks added in this patch are similar to the
security_inode_* counterparts that already exist in the
sleepable_lsm_hooks list, and are called in roughly similar points and
contexts. Presumably, making them OK to be also annotated as
sleepable.

Building a kernel with DEBUG_ATOMIC_SLEEP options enabled and running
reasonable workloads stimulating activity that would be intercepted by
such security hooks didn't show any splats.

Notably, I haven't added all the security_path_* LSM hooks that are
available as I don't need them at this point in time.

Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Acked-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/r/ZXM3IHHXpNY9y82a@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-13 16:56:19 -08:00
Andrii Nakryiko f5fdb51fb9 bpf: fail BPF_TOKEN_CREATE if no delegation option was set on BPF FS
It's quite confusing in practice when it's possible to successfully
create a BPF token from BPF FS that didn't have any of delegate_xxx
mount options set up. While it's not wrong, it's actually more
meaningful to reject BPF_TOKEN_CREATE with specific error code (-ENOENT)
to let user-space know that no token delegation is setup up.

So, instead of creating empty BPF token that will be always ignored
because it doesn't have any of the allow_xxx bits set, reject it with
-ENOENT. If we ever need empty BPF token to be possible, we can support
that with extra flag passed into BPF_TOKEN_CREATE.

Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231213190842.3844987-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-13 15:47:04 -08:00
Jie Jiang 750e785796 bpf: Support uid and gid when mounting bpffs
Parse uid and gid in bpf_parse_param() so that they can be passed in as
the `data` parameter when mount() bpffs. This will be useful when we
want to control which user/group has the control to the mounted bpffs,
otherwise a separate chown() call will be needed.

Signed-off-by: Jie Jiang <jiejiang@chromium.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Mike Frysinger <vapier@chromium.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231212093923.497838-1-jiejiang@chromium.org
2023-12-13 15:37:42 -08:00
Neeraj Upadhyay (AMD) 7dfb03dd24 Merge branches 'doc.2023.12.13a', 'torture.2023.11.23a', 'fixes.2023.12.13a', 'rcu-tasks.2023.12.12b' and 'srcu.2023.12.13a' into rcu-merge.2023.12.13a 2023-12-14 01:21:31 +05:30
Zqiang dee39c0c1e rcu: Force quiescent states only for ongoing grace period
If an rcutorture test scenario creates an fqs_task kthread, it will
periodically invoke rcu_force_quiescent_state() in order to start
force-quiescent-state (FQS) operations.  However, an FQS operation
will be started even if there is no RCU grace period in progress.
Although testing FQS operations startup when there is no grace period in
progress is necessary, it need not happen all that often.  This commit
therefore causes rcu_force_quiescent_state() to take an early exit
if there is no grace period in progress.

Note that there will still be attempts to start an FQS scan in the
absence of a grace period because the grace period might end right
after the rcu_force_quiescent_state() function's check.  In actual
testing, this happens about once every ten minutes, which should
provide adequate testing.

Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
2023-12-14 01:19:02 +05:30
Steven Rostedt (Google) 60be76eeab tracing: Add size check when printing trace_marker output
If for some reason the trace_marker write does not have a nul byte for the
string, it will overflow the print:

  trace_seq_printf(s, ": %s", field->buf);

The field->buf could be missing the nul byte. To prevent overflow, add the
max size that the buf can be by using the event size and the field
location.

  int max = iter->ent_size - offsetof(struct print_entry, buf);

  trace_seq_printf(s, ": %*.s", max, field->buf);

Link: https://lore.kernel.org/linux-trace-kernel/20231212084444.4619b8ce@gandalf.local.home

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-12 22:07:22 -05:00
Steven Rostedt (Google) b049525855 ring-buffer: Have saved event hold the entire event
For the ring buffer iterator (non-consuming read), the event needs to be
copied into the iterator buffer to make sure that a writer does not
overwrite it while the user is reading it. If a write happens during the
copy, the buffer is simply discarded.

But the temp buffer itself was not big enough. The allocation of the
buffer was only BUF_MAX_DATA_SIZE, which is the maximum data size that can
be passed into the ring buffer and saved. But the temp buffer needs to
hold the meta data as well. That would be BUF_PAGE_SIZE and not
BUF_MAX_DATA_SIZE.

Link: https://lore.kernel.org/linux-trace-kernel/20231212072558.61f76493@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 785888c544 ("ring-buffer: Have rb_iter_head_event() handle concurrent writer")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-12 20:56:10 -05:00
Steven Rostedt (Google) 9e45e39dc2 ring-buffer: Do not update before stamp when switching sub-buffers
The ring buffer timestamps are synchronized by two timestamp placeholders.
One is the "before_stamp" and the other is the "write_stamp" (sometimes
referred to as the "after stamp" but only in the comments. These two
stamps are key to knowing how to handle nested events coming in with a
lockless system.

When moving across sub-buffers, the before stamp is updated but the write
stamp is not. There's an effort to put back the before stamp to something
that seems logical in case there's nested events. But as the current event
is about to cross sub-buffers, and so will any new nested event that happens,
updating the before stamp is useless, and could even introduce new race
conditions.

The first event on a sub-buffer simply uses the sub-buffer's timestamp
and keeps a "delta" of zero. The "before_stamp" and "write_stamp" are not
used in the algorithm in this case. There's no reason to try to fix the
before_stamp when this happens.

As a bonus, it removes a cmpxchg() when crossing sub-buffers!

Link: https://lore.kernel.org/linux-trace-kernel/20231211114420.36dde01b@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: a389d86f7f ("ring-buffer: Have nested events still record running time stamp")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-12 20:55:45 -05:00
Andrew Morton fc0fbad122 merge mm-hotfixes-stable into mm-nonmm-stable to pick up depended-upon changes 2023-12-12 17:39:11 -08:00
Yuntao Wang 1dd11e9773 crash_core: fix the check for whether crashkernel is from high memory
If crash_base is equal to CRASH_ADDR_LOW_MAX, it also indicates that
the crashkernel memory is allocated from high memory. However, the
current check only considers the case where crash_base is greater than
CRASH_ADDR_LOW_MAX. Fix it.

The runtime effects is that crashkernel high memory is successfully
reserved, whereas the crashkernel low memory is bypassed in this case,
then kdump kernel bootup will fail because of no low memory under 4G.

This patch also includes some minor cleanups.

Link: https://lkml.kernel.org/r/20231209141438.77233-1-ytcoode@gmail.com
Fixes: 0ab97169aa ("crash_core: add generic function to do reservation")
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12 17:20:18 -08:00
Ignat Korchagin c41bd25141 kexec: drop dependency on ARCH_SUPPORTS_KEXEC from CRASH_DUMP
In commit f8ff23429c62 ("kernel/Kconfig.kexec: drop select of KEXEC for
CRASH_DUMP") we tried to fix a config regression, where CONFIG_CRASH_DUMP
required CONFIG_KEXEC.

However, it was not enough at least for arm64 platforms.  While further
testing the patch with our arm64 config I noticed that CONFIG_CRASH_DUMP
is unavailable in menuconfig.  This is because CONFIG_CRASH_DUMP still
depends on the new CONFIG_ARCH_SUPPORTS_KEXEC introduced in commit
91506f7e5d ("arm64/kexec: refactor for kernel/Kconfig.kexec") and on
arm64 CONFIG_ARCH_SUPPORTS_KEXEC requires CONFIG_PM_SLEEP_SMP=y, which in
turn requires either CONFIG_SUSPEND=y or CONFIG_HIBERNATION=y neither of
which are set in our config.

Given that we already established that CONFIG_KEXEC (which is a switch for
kexec system call itself) is not required for CONFIG_CRASH_DUMP drop
CONFIG_ARCH_SUPPORTS_KEXEC dependency as well.  The arm64 kernel builds
just fine with CONFIG_CRASH_DUMP=y and with both CONFIG_KEXEC=n and
CONFIG_KEXEC_FILE=n after f8ff23429c62 ("kernel/Kconfig.kexec: drop select
of KEXEC for CRASH_DUMP") and this patch are applied given that the
necessary shared bits are included via CONFIG_KEXEC_CORE dependency.

[bhe@redhat.com: don't export some symbols when CONFIG_MMU=n]
  Link: https://lkml.kernel.org/r/ZW03ODUKGGhP1ZGU@MiWiFi-R3L-srv
[bhe@redhat.com: riscv, kexec: fix dependency of two items]
  Link: https://lkml.kernel.org/r/ZW04G/SKnhbE5mnX@MiWiFi-R3L-srv
Link: https://lkml.kernel.org/r/20231129220409.55006-1-ignat@cloudflare.com
Fixes: 91506f7e5d ("arm64/kexec: refactor for kernel/Kconfig.kexec")
Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: <stable@vger.kernel.org> # 6.6+: f8ff234: kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP
Cc: <stable@vger.kernel.org> # 6.6+
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12 17:20:16 -08:00
Steven Rostedt (Google) d06aff1cb1 tracing: Update snapshot buffer on resize if it is allocated
The snapshot buffer is to mimic the main buffer so that when a snapshot is
needed, the snapshot and main buffer are swapped. When the snapshot buffer
is allocated, it is set to the minimal size that the ring buffer may be at
and still functional. When it is allocated it becomes the same size as the
main ring buffer, and when the main ring buffer changes in size, it should
do.

Currently, the resize only updates the snapshot buffer if it's used by the
current tracer (ie. the preemptirqsoff tracer). But it needs to be updated
anytime it is allocated.

When changing the size of the main buffer, instead of looking to see if
the current tracer is utilizing the snapshot buffer, just check if it is
allocated to know if it should be updated or not.

Also fix typo in comment just above the code change.

Link: https://lore.kernel.org/linux-trace-kernel/20231210225447.48476a6a@rorschach.local.home

Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: ad909e21bb ("tracing: Add internal tracing_snapshot() functions")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-12 19:00:38 -05:00
Steven Rostedt (Google) 17d8017581 ring-buffer: Fix memory leak of free page
Reading the ring buffer does a swap of a sub-buffer within the ring buffer
with a empty sub-buffer. This allows the reader to have full access to the
content of the sub-buffer that was swapped out without having to worry
about contention with the writer.

The readers call ring_buffer_alloc_read_page() to allocate a page that
will be used to swap with the ring buffer. When the code is finished with
the reader page, it calls ring_buffer_free_read_page(). Instead of freeing
the page, it stores it as a spare. Then next call to
ring_buffer_alloc_read_page() will return this spare instead of calling
into the memory management system to allocate a new page.

Unfortunately, on freeing of the ring buffer, this spare page is not
freed, and causes a memory leak.

Link: https://lore.kernel.org/linux-trace-kernel/20231210221250.7b9cc83c@rorschach.local.home

Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 73a757e631 ("ring-buffer: Return reader page back into existing ring buffer")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-12 19:00:37 -05:00
Steven Rostedt (Google) b55b0a0d7c tracing: Have large events show up as '[LINE TOO BIG]' instead of nothing
If a large event was added to the ring buffer that is larger than what the
trace_seq can handle, it just drops the output:

 ~# cat /sys/kernel/tracing/trace
 # tracer: nop
 #
 # entries-in-buffer/entries-written: 2/2   #P:8
 #
 #                                _-----=> irqs-off/BH-disabled
 #                               / _----=> need-resched
 #                              | / _---=> hardirq/softirq
 #                              || / _--=> preempt-depth
 #                              ||| / _-=> migrate-disable
 #                              |||| /     delay
 #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
 #              | |         |   |||||     |         |
            <...>-859     [001] .....   141.118951: tracing_mark_write           <...>-859     [001] .....   141.148201: tracing_mark_write: 78901234

Instead, catch this case and add some context:

 ~# cat /sys/kernel/tracing/trace
 # tracer: nop
 #
 # entries-in-buffer/entries-written: 2/2   #P:8
 #
 #                                _-----=> irqs-off/BH-disabled
 #                               / _----=> need-resched
 #                              | / _---=> hardirq/softirq
 #                              || / _--=> preempt-depth
 #                              ||| / _-=> migrate-disable
 #                              |||| /     delay
 #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
 #              | |         |   |||||     |         |
            <...>-852     [001] .....   121.550551: tracing_mark_write[LINE TOO BIG]
            <...>-852     [001] .....   121.550581: tracing_mark_write: 78901234

This now emulates the same output as trace_pipe.

Link: https://lore.kernel.org/linux-trace-kernel/20231209171058.78c1a026@gandalf.local.home

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-12 19:00:36 -05:00
Steven Rostedt (Google) b3ae7b67b8 ring-buffer: Fix writing to the buffer with max_data_size
The maximum ring buffer data size is the maximum size of data that can be
recorded on the ring buffer. Events must be smaller than the sub buffer
data size minus any meta data. This size is checked before trying to
allocate from the ring buffer because the allocation assumes that the size
will fit on the sub buffer.

The maximum size was calculated as the size of a sub buffer page (which is
currently PAGE_SIZE minus the sub buffer header) minus the size of the
meta data of an individual event. But it missed the possible adding of a
time stamp for events that are added long enough apart that the event meta
data can't hold the time delta.

When an event is added that is greater than the current BUF_MAX_DATA_SIZE
minus the size of a time stamp, but still less than or equal to
BUF_MAX_DATA_SIZE, the ring buffer would go into an infinite loop, looking
for a page that can hold the event. Luckily, there's a check for this loop
and after 1000 iterations and a warning is emitted and the ring buffer is
disabled. But this should never happen.

This can happen when a large event is added first, or after a long period
where an absolute timestamp is prefixed to the event, increasing its size
by 8 bytes. This passes the check and then goes into the algorithm that
causes the infinite loop.

For events that are the first event on the sub-buffer, it does not need to
add a timestamp, because the sub-buffer itself contains an absolute
timestamp, and adding one is redundant.

The fix is to check if the event is to be the first event on the
sub-buffer, and if it is, then do not add a timestamp.

This also fixes 32 bit adding a timestamp when a read of before_stamp or
write_stamp is interrupted. There's still no need to add that timestamp if
the event is going to be the first event on the sub buffer.

Also, if the buffer has "time_stamp_abs" set, then also check if the
length plus the timestamp is greater than the BUF_MAX_DATA_SIZE.

Link: https://lore.kernel.org/all/20231212104549.58863438@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20231212071837.5fdd6c13@gandalf.local.home
Link: https://lore.kernel.org/linux-trace-kernel/20231212111617.39e02849@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: a4543a2fa9 ("ring-buffer: Get timestamp after event is allocated")
Fixes: 58fbc3c632 ("ring-buffer: Consolidate add_timestamp to remove some branches")
Reported-by: Kent Overstreet <kent.overstreet@linux.dev> # (on IRC)
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-12 19:00:19 -05:00
Andrei Matei 745e031130 bpf: Comment on check_mem_size_reg
This patch adds a comment to check_mem_size_reg -- a function whose
meaning is not very transparent. The function implicitly deals with two
registers connected by convention, which is not obvious.

Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231210225149.67639-1-andreimatei1@gmail.com
2023-12-12 15:35:32 -08:00
Yang Li 56c26d5ad8 bpf: Remove unused backtrack_state helper functions
The function are defined in the verifier.c file, but not called
elsewhere, so delete the unused function.

kernel/bpf/verifier.c:3448:20: warning: unused function 'bt_set_slot'
kernel/bpf/verifier.c:3453:20: warning: unused function 'bt_clear_slot'
kernel/bpf/verifier.c:3488:20: warning: unused function 'bt_is_slot_set'

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20231212005436.103829-1-yang.lee@linux.alibaba.com

Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=7714
2023-12-12 09:52:07 -08:00
Christian Brauner 4e94ddfe2a
file: remove __receive_fd()
Honestly, there's little value in having a helper with and without that
int __user *ufd argument. It's just messy and doesn't really give us
anything. Just expose receive_fd() with that argument and get rid of
that helper.

Link: https://lore.kernel.org/r/20231130-vfs-files-fixes-v1-5-e73ca6f4ea83@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-12 14:24:14 +01:00
Jason Gunthorpe 8f23f5dba6 iommu: Change kconfig around IOMMU_SVA
Linus suggested that the kconfig here is confusing:

https://lore.kernel.org/all/CAHk-=wgUiAtiszwseM1p2fCJ+sC4XWQ+YN4TanFhUgvUqjr9Xw@mail.gmail.com/

Let's break it into three kconfigs controlling distinct things:

 - CONFIG_IOMMU_MM_DATA controls if the mm_struct has the additional
   fields for the IOMMU. Currently only PASID, but later patches store
   a struct iommu_mm_data *

 - CONFIG_ARCH_HAS_CPU_PASID controls if the arch needs the scheduling bit
   for keeping track of the ENQCMD instruction. x86 will select this if
   IOMMU_SVA is enabled

 - IOMMU_SVA controls if the IOMMU core compiles in the SVA support code
   for iommu driver use and the IOMMU exported API

This way ARM will not enable CONFIG_ARCH_HAS_CPU_PASID

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20231027000525.1278806-2-tina.zhang@intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2023-12-12 10:11:27 +01:00
Andrii Nakryiko 1a1ad782dc bpf: tidy up exception callback management a bit
Use the fact that we are passing subprog index around and have
a corresponding struct bpf_subprog_info in bpf_verifier_env for each
subprogram. We don't need to separately pass around a flag whether
subprog is exception callback or not, each relevant verifier function
can determine this using provided subprog index if we maintain
bpf_subprog_info properly.

Also move out exception callback-specific logic from
btf_prepare_func_args(), keeping it generic. We can enforce all these
restriction right before exception callback verification pass. We add
out parameter, arg_cnt, for now, but this will be unnecessary with
subsequent refactoring and will be removed.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231204233931.49758-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-11 19:23:32 -08:00
Andrii Nakryiko 22b769bb4f bpf: emit more dynptr information in verifier log
Emit dynptr type for CONST_PTR_TO_DYNPTR register. Also emit id,
ref_obj_id, and dynptr_id fields for STACK_DYNPTR stack slots.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231204233931.49758-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-11 19:21:22 -08:00
Andrii Nakryiko 1e68485d82 bpf: log PTR_TO_MEM memory size in verifier log
Emit valid memory size addressable through PTR_TO_MEM register.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231204233931.49758-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-11 19:21:22 -08:00
Frederic Weisbecker c21357e446 srcu: Explain why callbacks invocations can't run concurrently
If an SRCU barrier is queued while callbacks are running and a new
callbacks invocator for the same sdp were to run concurrently, the
RCU barrier might execute too early. As this requirement is non-obvious,
make sure to keep a record.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
2023-12-12 02:41:17 +05:30
Frederic Weisbecker 94c55b9e21 srcu: No need to advance/accelerate if no callback enqueued
While in grace period start, there is nothing to accelerate and
therefore no need to advance the callbacks either if no callback is
to be enqueued.

Spare these needless operations in this case.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
2023-12-12 02:41:16 +05:30
Frederic Weisbecker 20eb414239 srcu: Remove superfluous callbacks advancing from srcu_gp_start()
Callbacks advancing on SRCU must be performed on two specific places:

1) On enqueue time in order to make room for the acceleration of the
   new callback.

2) On invocation time in order to move the callbacks ready to invoke.

Any other callback advancing callsite is needless. Remove the remaining
one in srcu_gp_start().

Co-developed-by: Yong He <zhuangel570@gmail.com>
Signed-off-by: Yong He <zhuangel570@gmail.com>
Co-developed-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Co-developed-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
2023-12-12 02:40:45 +05:30
Li zeming 4ac934b1aa PM: hibernate: Do not initialize error in snapshot_write_next()
The error variable in snapshot_write_next() gets a value before it is
used, so don't initialize it to 0 upfront.

Signed-off-by: Li zeming <zeming@nfschina.com>
[ rjw: Subject and changelog rewrite ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-11 22:09:52 +01:00
Li zeming bbeaa4691f PM: hibernate: Do not initialize error in swap_write_page()
'error' first receives the function result before it is used, and it
does not need to be assigned a value during definition.

Signed-off-by: Li zeming <zeming@nfschina.com>
[ rjw: Subject rewrite ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-11 22:06:15 +01:00
Wang chaodong a1ca8295ee PM: hibernate: Drop unnecessary local variable initialization
It is not necessary to intialize the error variable in
create_basic_memory_bitmaps(), because it is only read after
being assigned a value.

Signed-off-by: Wang chaodong <chaodong@nfschina.com>
[ rjw: Subject and changelog rewrite ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-11 22:02:48 +01:00
Paul E. McKenney 4e58aaeebb rcu: Restrict access to RCU CPU stall notifiers
Although the RCU CPU stall notifiers can be useful for dumping state when
tracking down delicate forward-progress bugs where NUMA effects cause
cache lines to be delivered to a given CPU regularly, but always in a
state that prevents that CPU from making forward progress.  These bugs can
be detected by the RCU CPU stall-warning mechanism, but in some cases,
the stall-warnings printk()s disrupt the forward-progress bug before
any useful state can be obtained.

Unfortunately, the notifier mechanism added by commit 5b404fdaba ("rcu:
Add RCU CPU stall notifier") can make matters worse if used at all
carelessly. For example, if the stall warning was caused by a lock not
being released, then any attempt to acquire that lock in the notifier
will hang. This will prevent not only the notifier from producing any
useful output, but it will also prevent the stall-warning message from
ever appearing.

This commit therefore hides this new RCU CPU stall notifier
mechanism under a new RCU_CPU_STALL_NOTIFIER Kconfig option that
depends on both DEBUG_KERNEL and RCU_EXPERT.  In addition, the
rcupdate.rcu_cpu_stall_notifiers=1 kernel boot parameter must also
be specified.  The RCU_CPU_STALL_NOTIFIER Kconfig option's help text
contains a warning and explains the dangers of careless use, recommending
lockless notifier code.  In addition, a WARN() is triggered each time
that an attempt is made to register a stall-warning notifier in kernels
built with CONFIG_RCU_CPU_STALL_NOTIFIER=y.

This combination of measures will keep use of this mechanism confined to
debug kernels and away from routine deployments.

[ paulmck: Apply Dan Carpenter feedback. ]

Fixes: 5b404fdaba ("rcu: Add RCU CPU stall notifier")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
2023-12-12 02:31:22 +05:30
Paul E. McKenney 18966f7b94 rcu-tasks: Mark RCU Tasks accesses to current->rcu_tasks_idle_cpu
The task_struct structure's ->rcu_tasks_idle_cpu can be concurrently
read and written from the RCU Tasks grace-period kthread and from the
CPU on which the task_struct structure's task is running.  This commit
therefore marks the accesses appropriately.

Reported-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
2023-12-12 02:22:47 +05:30
Li Zhe 9d02330abd softlockup: serialized softlockup's log
If multiple CPUs trigger softlockup at the same time with
'softlockup_all_cpu_backtrace=0', the softlockup's logs will appear
staggeredly in dmesg, which will affect the viewing of the logs for
developer.  Since the code path for outputting softlockup logs is not a
kernel hotspot and the performance requirements for the code are not
strict, locks are used to serialize the softlockup log output to improve
the readability of the logs.

Link: https://lkml.kernel.org/r/20231123084022.10302-1-lizhe.67@bytedance.com
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:44 -08:00
Baoquan He b3ba234171 kexec_file: load kernel at top of system RAM if required
Patch series "kexec_file: Load kernel at top of system RAM if required".

Justification:
==============

Kexec_load interface has been doing top down searching and loading
kernel/initrd/purgtory etc to prepare for kexec reboot.  In that way, the
benefits are that it avoids to consume and fragment limited low memory
which satisfy DMA buffer allocation and big chunk of continuous memory
during system init; and avoids to stir with BIOS/FW reserved or occupied
areas, or corner case handling/work around/quirk occupied areas when doing
system init.  By the way, the top-down searching and loading of kexec-ed
kernel is done in user space utility code.

For kexec_file loading, even if kexec_buf.top_down is 'true', it's simply
ignored.  It calls walk_system_ram_res() directly to go through all
resources of System RAM bottom up, to find an available memory region,
then call locate_mem_hole_callback() to allocate memory in that found
memory region from top to down.  This is not expected and inconsistent
with kexec_load.

Implementation
===============

In patch 1, introduce a new function walk_system_ram_res_rev() which is a
variant of walk_system_ram_res(), it walks through a list of all the
resources of System RAM in reversed order, i.e., from higher to lower.

In patch 2, check if kexec_buf.top_down is 'true' in
kexec_walk_resources(), if yes, call walk_system_ram_res_rev() to find
memory region of system RAM from top to down to load kernel/initrd etc.

Background information: ======================= And I ever tried this in
the past in a different way, please see below link.  In the post, I tried
to adjust struct sibling linking code, replace the the singly linked list
with list_head so that walk_system_ram_res_rev() can be implemented in a
much easier way.  Finally I failed. 
https://lore.kernel.org/all/20180718024944.577-4-bhe@redhat.com/

This time, I picked up the patch from AKASHI Takahiro's old post and made
some change to take as the current patch 1:
https://lists.infradead.org/pipermail/linux-arm-kernel/2017-September/531456.html


This patch (of 2):

Kexec_load interface has been doing top down searching and loading
kernel/initrd/purgtory etc to prepare for kexec reboot.  In that way, the
benefits are that it avoids to consume and fragment limited low memory
which satisfy DMA buffer allocation and big chunk of continuous memory
during system init; and avoids to stir with BIOS/FW reserved or occupied
areas, or corner case handling/work around/quirk occupied areas when doing
system init.  By the way, the top-down searching and loading of kexec-ed
kernel is done in user space utility code.

For kexec_file loading, even if kexec_buf.top_down is 'true', it's simply
ignored.  It calls walk_system_ram_res() directly to go through all
resources of System RAM bottom up, to find an available memory region,
then call locate_mem_hole_callback() to allocate memory in that found
memory region from top to down.  This is not expected and inconsistent
with kexec_load.

Here check if kexec_buf.top_down is 'true' in kexec_walk_resources(), if
yes, call the newly added walk_system_ram_res_rev() to find memory region
of system RAM from top to down to load kernel/initrd etc.

Link: https://lkml.kernel.org/r/20231114091658.228030-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20231114091658.228030-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:44 -08:00
Baoquan He 7acf164b25 resource: add walk_system_ram_res_rev()
This function, being a variant of walk_system_ram_res() introduced in
commit 8c86e70ace ("resource: provide new functions to walk through
resources"), walks through a list of all the resources of System RAM in
reversed order, i.e., from higher to lower.

It will be used in kexec_file code to load kernel, initrd etc when
preparing kexec reboot.

Link: https://lkml.kernel.org/r/ZVTA6z/06cLnWKUz@MiWiFi-R3L-srv
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:44 -08:00
Arnd Bergmann b1c3efe079 sched: fair: move unused stub functions to header
These four functions have a normal definition for CONFIG_FAIR_GROUP_SCHED,
and empty one that is only referenced when FAIR_GROUP_SCHED is disabled
but CGROUP_SCHED is still enabled.  If both are turned off, the functions
are still defined but the misisng prototype causes a W=1 warning:

kernel/sched/fair.c:12544:6: error: no previous prototype for 'free_fair_sched_group'
kernel/sched/fair.c:12546:5: error: no previous prototype for 'alloc_fair_sched_group'
kernel/sched/fair.c:12553:6: error: no previous prototype for 'online_fair_sched_group'
kernel/sched/fair.c:12555:6: error: no previous prototype for 'unregister_fair_sched_group'

Move the alternatives into the header as static inline functions with the
correct combination of #ifdef checks to avoid the warning without adding
even more complexity.

[A different patch with the same description got applied by accident
 and was later reverted, but the original patch is still missing]

Link: https://lkml.kernel.org/r/20231123110506.707903-4-arnd@kernel.org
Fixes: 7aa55f2a59 ("sched/fair: Move unused stub functions to header")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicolas Schier <nicolas@fjasle.eu>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tudor Ambarus <tudor.ambarus@linaro.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:43 -08:00
Uros Bizjak 0311d82724 kexec: use atomic_try_cmpxchg in crash_kexec
Use atomic_try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
crash_kexec().  x86 CMPXCHG instruction returns success in ZF flag,
so this change saves a compare after cmpxchg.

No functional change intended.

Link: https://lkml.kernel.org/r/20231114161228.108516-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:33 -08:00
Oleg Nesterov 27bbb2a0fd __ptrace_unlink: kill the obsolete "FIXME" code
The corner case described by the comment is no longer possible after the
commit 7b3c36fc4c ("ptrace: fix task_join_group_stop() for the case when
current is traced"), task_join_group_stop() ensures that the new thread
has the correct signr in JOBCTL_STOP_SIGMASK regardless of ptrace.

Link: https://lkml.kernel.org/r/20231121162650.GA6635@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:32 -08:00
Oleg Nesterov b454ec2922 kernel/signal.c: simplify force_sig_info_to_task(), kill recalc_sigpending_and_wake()
The purpose of recalc_sigpending_and_wake() is not clear, it looks
"obviously unneeded" because we are going to send the signal which can't
be blocked or ignored.

Add the comment to explain why we can't rely on send_signal_locked() and
make this logic more simple/explicit.  recalc_sigpending_and_wake() has no
other users, it can die.

In fact I think we don't even need signal_wake_up(), the target task must
be either current or a TASK_TRACED child, otherwise the usage of siglock
is not safe.  But this needs another change.

Link: https://lkml.kernel.org/r/20231120151649.GA15995@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:32 -08:00
Heiko Carstens 3888750e21 arch: remove ARCH_TASK_STRUCT_ALLOCATOR
IA-64 was the only architecture which selected ARCH_TASK_STRUCT_ALLOCATOR.
IA-64 was removed with commit cf8e865810 ("arch: Remove Itanium (IA-64)
architecture"). Therefore remove support for ARCH_THREAD_STACK_ALLOCATOR
as well.

Link: https://lkml.kernel.org/r/20231116133638.1636277-3-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:31 -08:00
Heiko Carstens f72709ab69 arch: remove ARCH_THREAD_STACK_ALLOCATOR
Patch series "Remove unused code after IA-64 removal".

While looking into something different I noticed that there are a couple
of Kconfig options which were only selected by IA-64 and which are now
unused.

So remove them and simplify the code a bit.


This patch (of 3):

IA-64 was the only architecture which selected ARCH_THREAD_STACK_ALLOCATOR.
IA-64 was removed with commit cf8e865810 ("arch: Remove Itanium (IA-64)
architecture"). Therefore remove support for ARCH_THREAD_STACK_ALLOCATOR as
well.

Link: https://lkml.kernel.org/r/20231116133638.1636277-1-hca@linux.ibm.com
Link: https://lkml.kernel.org/r/20231116133638.1636277-2-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:30 -08:00
Oleg Nesterov 61a7a5e25f introduce for_other_threads(p, t)
Cosmetic, but imho it makes the usage look more clear and simple, the new
helper doesn't require to initialize "t".

After this change while_each_thread() has only 3 users, and it is only
used in the do/while loops.

Link: https://lkml.kernel.org/r/20231030155710.GA9095@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:25 -08:00
Dongmin Lee a9a1d6ad66 kernel/reboot: explicitly notify if halt occurred instead of power off
When kernel_can_power_off() returns false, and reboot has called with
LINUX_REBOOT_CMD_POWER_OFF, kernel_halt() will be initiated instead of
actual power off function.

However, in this situation, Kernel never explicitly notifies user that
system halted instead of requested power off.

Since halt and power off perform different behavior, and user initiated
reboot call with power off command, not halt, This could be unintended
behavior to user, like this:

~ # poweroff -f
[    3.581482] reboot: System halted

Therefore, this explicitly notifies user that poweroff is not available,
and halting has been occured as an alternative behavior instead:

~ # poweroff -f
[    4.123668] reboot: Power off not available: System halted instead

[akpm@linux-foundation.org: tweak comment text]
Link: https://lkml.kernel.org/r/20231104113320.72440-1-ldmldm05@gmail.com
Signed-off-by: Dongmin Lee <ldmldm05@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:25 -08:00
Peng Zhang d240629148 fork: use __mt_dup() to duplicate maple tree in dup_mmap()
In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then
directly replacing the entries of VMAs in the new maple tree can result in
better performance.  __mt_dup() uses DFS pre-order to duplicate the maple
tree, so it is efficient.

The average time complexity of __mt_dup() is O(n), where n is the number
of VMAs.  The proof of the time complexity is provided in the commit log
that introduces __mt_dup().  After duplicating the maple tree, each
element is traversed and replaced (ignoring the cases of deletion, which
are rare).  Since it is only a replacement operation for each element,
this process is also O(n).

Analyzing the exact time complexity of the previous algorithm is
challenging because each insertion can involve appending to a node,
pushing data to adjacent nodes, or even splitting nodes.  The frequency of
each action is difficult to calculate.  The worst-case scenario for a
single insertion is when the tree undergoes splitting at every level.  If
we consider each insertion as the worst-case scenario, we can determine
that the upper bound of the time complexity is O(n*log(n)), although this
is a loose upper bound.  However, based on the test data, it appears that
the actual time complexity is likely to be O(n).

As the entire maple tree is duplicated using __mt_dup(), if dup_mmap()
fails, there will be a portion of VMAs that have not been duplicated in
the maple tree.  To handle this, we mark the failure point with
XA_ZERO_ENTRY.  In exit_mmap(), if this marker is encountered, stop
releasing VMAs that have not been duplicated after this point.

There is a "spawn" in byte-unixbench[1], which can be used to test the
performance of fork().  I modified it slightly to make it work with
different number of VMAs.

Below are the test results.  The first row shows the number of VMAs.  The
second and third rows show the number of fork() calls per ten seconds,
corresponding to next-20231006 and the this patchset, respectively.  The
test results were obtained with CPU binding to avoid scheduler load
balancing that could cause unstable results.  There are still some
fluctuations in the test results, but at least they are better than the
original performance.

21     121   221    421    821    1621   3221   6421   12821  25621  51221
112100 76261 54227  34035  20195  11112  6017   3161   1606   802    393
114558 83067 65008  45824  28751  16072  8922   4747   2436   1233   599
2.19%  8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42%

[1] https://github.com/kdlucas/byte-unixbench/tree/master

Link: https://lkml.kernel.org/r/20231027033845.90608-11-zhangpeng.00@bytedance.com
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:34 -08:00
Linus Torvalds 3a87498869 - Make sure tasks are thawed exactly and only once to avoid their state
getting corrupted
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmV1nDUACgkQEsHwGGHe
 VUqC3Q/9GF3IjEzKZAwTqw9ir2Nq9fFKkDZVT1ZCkXcg3bc6t5Dp68NcMPEoPdNE
 6ONaEwKhZxqPyivI7u1ExdZnHYDMRWolZmjs/x19c+g3Zo6QzT+6blMdoWvl6nV2
 RD3macPt5w5bcJ8ugSM4ekTQgo4nPU5VhBS52zDARx0W9ufpIk3YKmxmVQjhuV5J
 z/nfewUuUtAHDxnbF8pRvN8WoSg15Z5iERksdcj8Wagjx79cMAR6liuauJNkj9dP
 lldG69ODdJeZc9L/SUkLEgYPVaq+G6BOKgWXbzeiRM9LedHN3iQlT9JUttLHN383
 NdTbQ6lboViP1O64WuoqJFVDYvY0DvVLUll4URywfT3lPbISGvxhg0Xj+4E8F5W9
 A9pB9TDZwRXwrNuRLksaY0v/Glfo7eUr6252aDbgrUovJCDOwfRB+pI4ywpfoL/+
 2eKkJR1mUjoCXirkbYjcm7EhnTSKxiKmCYK7pyol3fJCsK/4bQF7mJ4UyDFIB3Na
 VXVD41KkMsaAdIQp4HbdduYaPSCQvQee6ahtobQwcxyBWGXRzurTw4ubHlzSeN9F
 fIfxF9PfSY+So2J9IrU1uYKPvfbUWfU3b1urQPhPvVlbVlZmfG579ek6+4bhagsg
 UztDRvv9lCxvBskruIMfelAduXsDkDi0UwJ0/TXlPnQGzYlDdeI=
 =07a8
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v6.7_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Borislav Petkov:

 - Make sure tasks are thawed exactly and only once to avoid their state
   getting corrupted

* tag 'sched_urgent_for_v6.7_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  freezer,sched: Do not restore saved_state of a thawed task
2023-12-10 11:09:16 -08:00
Linus Torvalds 537ccb5d28 - Make sure perf event size validation is done on every event in the
group
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmV1mfQACgkQEsHwGGHe
 VUrW9w/9EVMf1/cu1rY4XN68NbOgdfoic2oPan60WJwiYhYto9uA1quR4Q8ziwVh
 UbuO4e0up1ZCUzutZGFnx0ZHrlQIb0/YTQj8kDKX6m7g8s2Vers7YbkRwneDsNqA
 JDp58yGXdc1TipVYrKqa0leNrezvaEeoVFPIPKoelzi3673xrlslRseJ/n7vJd4u
 SnMjT7LQZIlEe/pecz01nHAo6SSwfI/Ynh2WSorHnhSTuE5gMUzJwBYSXvpZ2gyg
 207keTiIcrvxgT+a32NMeEYsFFFvpYKFHI5nxxV1pB8AWXdWaNpuYHNItTDIh81D
 fSb8hu+EpNSWtZYzXl/esgULfMgHXez+4VknTpX/vsbfcV1Yif4aHlZP8tgP6gZ5
 QyA2NMA5vJypjzLsAgCyZjpTyEVPYQ3f4+iYg4EGlMlgLgoXtHIV+zP765SzDVkC
 yPO4xVf+Ypo9AKcGKjBrxyMlRq40zos40k6l2yOjSUlTE2IfOLMhjgVHeLcgD+uv
 E9pi0/KtfGvrm3nWgIhDtcvd5Jg6vrilaRWl9bAN6g6xgaqLPXuIZbOjPaRpKSNa
 L32XBMg5fUt4eesZv458qu4Zw1ybHCd6qoe3OieFzW5ocR61O946MHX3kkbpmsWC
 PzH1mBsPa3F8/utJ06p+9pank3M5yKHdkDPQXfSvImuZ3DPKEGI=
 =QxHj
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.7_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf event fix from Borislav Petkov:

 - Make sure perf event size validation is done on every event in the
   group

* tag 'perf_urgent_for_v6.7_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix perf_event_validate_size()
2023-12-10 11:03:15 -08:00
David Vernet a6de18f310 bpf: Add bpf_cpumask_weight() kfunc
It can be useful to query how many bits are set in a cpumask. For
example, if you want to perform special logic for the last remaining
core that's set in a mask. Let's therefore add a new
bpf_cpumask_weight() kfunc which checks how many bits are set in a mask.

Signed-off-by: David Vernet <void@manifault.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231207210843.168466-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-09 21:37:33 -08:00
Andrii Nakryiko 482d548d40 bpf: handle fake register spill to stack with BPF_ST_MEM instruction
When verifier validates BPF_ST_MEM instruction that stores known
constant to stack (e.g., *(u64 *)(r10 - 8) = 123), it effectively spills
a fake register with a constant (but initially imprecise) value to
a stack slot. Because read-side logic treats it as a proper register
fill from stack slot, we need to mark such stack slot initialization as
INSN_F_STACK_ACCESS instruction to stop precision backtracking from
missing it.

Fixes: 41f6f64e69 ("bpf: support non-r10 register spill/fill to/from stack in precision tracking")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231209010958.66758-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-09 18:43:20 -08:00
Hou Tao 06e5c999f1 bpf: Set uattr->batch.count as zero before batched update or deletion
generic_map_{delete,update}_batch() doesn't set uattr->batch.count as
zero before it tries to allocate memory for key. If the memory
allocation fails, the value of uattr->batch.count will be incorrect.

Fix it by setting uattr->batch.count as zero beore batched update or
deletion.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231208102355.2628918-6-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-09 18:12:34 -08:00
Hou Tao 67ad2c73ff bpf: Only call maybe_wait_bpf_programs() when map operation succeeds
There is no need to call maybe_wait_bpf_programs() if update or deletion
operation fails. So only call maybe_wait_bpf_programs() if update or
deletion operation succeeds.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231208102355.2628918-5-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-09 18:12:34 -08:00
Hou Tao 012772581d bpf: Add missed maybe_wait_bpf_programs() for htab of maps
When doing batched lookup and deletion operations on htab of maps,
maybe_wait_bpf_programs() is needed to ensure all programs don't use the
inner map after the bpf syscall returns.

Instead of adding the wait in __htab_map_lookup_and_delete_batch(),
adding the wait in bpf_map_do_batch() and also removing the calling of
maybe_wait_bpf_programs() from generic_map_{delete,update}_batch().

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231208102355.2628918-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-09 18:12:34 -08:00
Hou Tao 37ba5b59d6 bpf: Call maybe_wait_bpf_programs() only once for generic_map_update_batch()
Just like commit 9087c6ff8d ("bpf: Call maybe_wait_bpf_programs() only
once from generic_map_delete_batch()"), there is also no need to call
maybe_wait_bpf_programs() for each update in batched update, so only
call it once in generic_map_update_batch().

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231208102355.2628918-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-09 18:12:33 -08:00
Hou Tao c26f2a8901 bpf: Remove unnecessary wait from bpf_map_copy_value()
Both map_lookup_elem() and generic_map_lookup_batch() use
bpf_map_copy_value() to lookup and copy the value, and there is no
update operation in bpf_map_copy_value(), so just remove the invocation
of maybe_wait_bpf_programs() from it.

Fixes: 15c14a3dca ("bpf: Add bpf_map_{value_size, update_value, map_copy_value} functions")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231208102355.2628918-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-09 18:12:33 -08:00
Yafang Shao 73d9eb340d bpf: Enable bpf_cgrp_storage for cgroup1 non-attach case
In the current cgroup1 environment, associating operations between cgroups
and applications in a BPF program requires storing a mapping of cgroup_id
to application either in a hash map or maintaining it in userspace.
However, by enabling bpf_cgrp_storage for cgroup1, it becomes possible to
conveniently store application-specific information in cgroup-local storage
and utilize it within BPF programs. Furthermore, enabling this feature for
cgroup1 involves minor modifications for the non-attach case, streamlining
the process.

However, when it comes to enabling this functionality for the cgroup1
attach case, it presents challenges. Therefore, the decision is to focus on
enabling it solely for the cgroup1 non-attach case at present. If
attempting to attach to a cgroup1 fd, the operation will simply fail with
the error code -EBADF.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231206115326.4295-2-laoar.shao@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-12-08 17:08:18 -08:00
Andrei Matei 2929bfac00 bpf: Minor cleanup around stack bounds
Push the rounding up of stack offsets into the function responsible for
growing the stack, rather than relying on all the callers to do it.
Uncertainty about whether the callers did it or not tripped up people in
a previous review.

Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20231208032519.260451-4-andreimatei1@gmail.com
2023-12-08 14:19:00 -08:00
Andrei Matei 6b4a64bafd bpf: Fix accesses to uninit stack slots
Privileged programs are supposed to be able to read uninitialized stack
memory (ever since 6715df8d5) but, before this patch, these accesses
were permitted inconsistently. In particular, accesses were permitted
above state->allocated_stack, but not below it. In other words, if the
stack was already "large enough", the access was permitted, but
otherwise the access was rejected instead of being allowed to "grow the
stack". This undesired rejection was happening in two places:
- in check_stack_slot_within_bounds()
- in check_stack_range_initialized()
This patch arranges for these accesses to be permitted. A bunch of tests
that were relying on the old rejection had to change; all of them were
changed to add also run unprivileged, in which case the old behavior
persists. One tests couldn't be updated - global_func16 - because it
can't run unprivileged for other reasons.

This patch also fixes the tracking of the stack size for variable-offset
reads. This second fix is bundled in the same commit as the first one
because they're inter-related. Before this patch, writes to the stack
using registers containing a variable offset (as opposed to registers
with fixed, known values) were not properly contributing to the
function's needed stack size. As a result, it was possible for a program
to verify, but then to attempt to read out-of-bounds data at runtime
because a too small stack had been allocated for it.

Each function tracks the size of the stack it needs in
bpf_subprog_info.stack_depth, which is maintained by
update_stack_depth(). For regular memory accesses, check_mem_access()
was calling update_state_depth() but it was passing in only the fixed
part of the offset register, ignoring the variable offset. This was
incorrect; the minimum possible value of that register should be used
instead.

This tracking is now fixed by centralizing the tracking of stack size in
grow_stack_state(), and by lifting the calls to grow_stack_state() to
check_stack_access_within_bounds() as suggested by Andrii. The code is
now simpler and more convincingly tracks the correct maximum stack size.
check_stack_range_initialized() can now rely on enough stack having been
allocated for the access; this helps with the fix for the first issue.

A few tests were changed to also check the stack depth computation. The
one that fails without this patch is verifier_var_off:stack_write_priv_vs_unpriv.

Fixes: 01f810ace9 ("bpf: Allow variable-offset stack access")
Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231208032519.260451-3-andreimatei1@gmail.com

Closes: https://lore.kernel.org/bpf/CABWLsev9g8UP_c3a=1qbuZUi20tGoUXoU07FPf-5FLvhOKOY+Q@mail.gmail.com/
2023-12-08 14:19:00 -08:00
Linus Torvalds 17894c2a7a tracing fixes for v6.7-rc4:
- Snapshot buffer issues
 
   1. When instances started allowing latency tracers, it uses
      a snapshot buffer (another buffer that is not written to
      but swapped with the main buffer that is). The snapshot buffer
      needs to be the same size as the main buffer. But when the
      snapshot buffers were added to instances, the code to make
      the snapshot equal to the main buffer still was only doing it
      for the main buffer and not the instances.
 
   2. Need to stop the current tracer when resizing the buffers.
      Otherwise there can be a race if the tracer decides to make
      a snapshot between resizing the main buffer and the snapshot
      buffer.
 
   3. When a tracer is "stopped" in disables both the main buffer
      and the snapshot buffer. This needs to be done for instances
      and not only the main buffer, now that instances also have
      a snapshot buffer.
 
 - Buffered event for filtering issues
 
   When filtering is enabled, because events can be dropped often,
   it is quicker to copy the event into a temp buffer and write that
   into the main buffer if it is not filtered or just drop the event
   if it is, than to write the event into the ring buffer and then
   try to discard it. This temp buffer is allocated and needs special
   synchronization to do so. But there were some issues with that:
 
   1. When disabling the filter and freeing the buffer, a call to all
      CPUs is required to stop each per_cpu usage. But the code
      called smp_call_function_many() which does not include the
      current CPU. If the task is migrated to another CPU when it
      enables the CPUs via smp_call_function_many(), it will not enable
      the one it is currently on and this causes issues later on.
      Use on_each_cpu_mask() instead, which includes the current CPU.
 
    2. When the allocation of the buffered event fails, it can give
       a warning. But the buffered event is just an optimization
       (it's still OK to write to the ring buffer and free it).
       Do not WARN in this case.
 
    3. The freeing of the buffer event requires synchronization.
       First a counter is decremented to zero so that no new uses
       of it will happen. Then it sets the buffered event to NULL,
       and finally it frees the buffered event. There's a synchronize_rcu()
       between the counter decrement and the setting the variable to
       NULL, but only a smp_wmb() between that and the freeing of the
       buffer. It is theoretically possible that a user missed seeing
       the decrement, but will use the buffer after it is free. Another
       synchronize_rcu() is needed in place of that smp_wmb().
 
 - ring buffer timestamps on 32 bit machines
 
   The ring buffer timestamp on 32 bit machines has to break the 64 bit
   number into multiple values as cmpxchg is required on it, and a
   64 bit cmpxchg on 32 bit architectures is very slow. The code use
   to just use two 32 bit values and make it a 60 bit timestamp where
   the other 4 bits were used as counters for synchronization. It later
   came known that the timestamp on 32 bit still need all 64 bits in
   some cases. So 3 words were created to handle the 64 bits. But issues
   arised with this:
 
    1. The synchronization logic still only compared the counter
       with the first two, but not with the third number, so the
       synchronization could fail unknowingly.
 
    2. A check on discard of an event could race if an event happened
       between the discard and updating one of the counters. The
       counter needs to be updated (forcing an absolute timestamp
       and not to use a delta) before the actual discard happens.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZXIP5hQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qmJxAQDXBZwBUFQjWqZHLJn0S9aaz5FggkeR
 RmlsOMND0PXcjwD+N6U905i553ehu3SSyOP+5svoi0hyCB2qhj3ZF0LzZQU=
 =us1V
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Snapshot buffer issues:

   1. When instances started allowing latency tracers, it uses a
      snapshot buffer (another buffer that is not written to but swapped
      with the main buffer that is). The snapshot buffer needs to be the
      same size as the main buffer. But when the snapshot buffers were
      added to instances, the code to make the snapshot equal to the
      main buffer still was only doing it for the main buffer and not
      the instances.

   2. Need to stop the current tracer when resizing the buffers.
      Otherwise there can be a race if the tracer decides to make a
      snapshot between resizing the main buffer and the snapshot buffer.

   3. When a tracer is "stopped" in disables both the main buffer and
      the snapshot buffer. This needs to be done for instances and not
      only the main buffer, now that instances also have a snapshot
      buffer.

 - Buffered event for filtering issues:

   When filtering is enabled, because events can be dropped often, it is
   quicker to copy the event into a temp buffer and write that into the
   main buffer if it is not filtered or just drop the event if it is,
   than to write the event into the ring buffer and then try to discard
   it. This temp buffer is allocated and needs special synchronization
   to do so. But there were some issues with that:

   1. When disabling the filter and freeing the buffer, a call to all
      CPUs is required to stop each per_cpu usage. But the code called
      smp_call_function_many() which does not include the current CPU.
      If the task is migrated to another CPU when it enables the CPUs
      via smp_call_function_many(), it will not enable the one it is
      currently on and this causes issues later on. Use
      on_each_cpu_mask() instead, which includes the current CPU.

    2.When the allocation of the buffered event fails, it can give a
      warning. But the buffered event is just an optimization (it's
      still OK to write to the ring buffer and free it). Do not WARN in
      this case.

    3.The freeing of the buffer event requires synchronization. First a
      counter is decremented to zero so that no new uses of it will
      happen. Then it sets the buffered event to NULL, and finally it
      frees the buffered event. There's a synchronize_rcu() between the
      counter decrement and the setting the variable to NULL, but only a
      smp_wmb() between that and the freeing of the buffer. It is
      theoretically possible that a user missed seeing the decrement,
      but will use the buffer after it is free. Another
      synchronize_rcu() is needed in place of that smp_wmb().

 - ring buffer timestamps on 32 bit machines

   The ring buffer timestamp on 32 bit machines has to break the 64 bit
   number into multiple values as cmpxchg is required on it, and a 64
   bit cmpxchg on 32 bit architectures is very slow. The code use to
   just use two 32 bit values and make it a 60 bit timestamp where the
   other 4 bits were used as counters for synchronization. It later came
   known that the timestamp on 32 bit still need all 64 bits in some
   cases. So 3 words were created to handle the 64 bits. But issues
   arised with this:

    1. The synchronization logic still only compared the counter with
       the first two, but not with the third number, so the
       synchronization could fail unknowingly.

    2. A check on discard of an event could race if an event happened
       between the discard and updating one of the counters. The counter
       needs to be updated (forcing an absolute timestamp and not to use
       a delta) before the actual discard happens.

* tag 'trace-v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Test last update in 32bit version of __rb_time_read()
  ring-buffer: Force absolute timestamp on discard of event
  tracing: Fix a possible race when disabling buffered events
  tracing: Fix a warning when allocating buffered events fails
  tracing: Fix incomplete locking when disabling buffered events
  tracing: Disable snapshot buffer when stopping instance tracers
  tracing: Stop current tracer when resizing buffer
  tracing: Always update snapshot buffer size
2023-12-08 08:44:43 -08:00
Linus Torvalds 8e819a7623 31 hotfixes. 10 of these address pre-6.6 issues and are marked cc:stable.
The remainder address post-6.6 issues or aren't considered serious enough
 to justify backporting.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZXKEfwAKCRDdBJ7gKXxA
 jlRpAQCiAp1nSqIz/fOKTzoQRaTDXU/m+C+6ZAXdKLDfvQBhpwEAnxxjZ8IgF+8Z
 Klz/GirHX5w5o7jE2wb8iObo1nR75Qo=
 =omRq
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-12-07-18-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "31 hotfixes. Ten of these address pre-6.6 issues and are marked
  cc:stable. The remainder address post-6.6 issues or aren't considered
  serious enough to justify backporting"

* tag 'mm-hotfixes-stable-2023-12-07-18-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (31 commits)
  mm/madvise: add cond_resched() in madvise_cold_or_pageout_pte_range()
  nilfs2: prevent WARNING in nilfs_sufile_set_segment_usage()
  mm/hugetlb: have CONFIG_HUGETLB_PAGE select CONFIG_XARRAY_MULTI
  scripts/gdb: fix lx-device-list-bus and lx-device-list-class
  MAINTAINERS: drop Antti Palosaari
  highmem: fix a memory copy problem in memcpy_from_folio
  nilfs2: fix missing error check for sb_set_blocksize call
  kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP
  units: add missing header
  drivers/base/cpu: crash data showing should depends on KEXEC_CORE
  mm/damon/sysfs-schemes: add timeout for update_schemes_tried_regions
  scripts/gdb/tasks: fix lx-ps command error
  mm/Kconfig: make userfaultfd a menuconfig
  selftests/mm: prevent duplicate runs caused by TEST_GEN_PROGS
  mm/damon/core: copy nr_accesses when splitting region
  lib/group_cpus.c: avoid acquiring cpu hotplug lock in group_cpus_evenly
  checkstack: fix printed address
  mm/memory_hotplug: fix error handling in add_memory_resource()
  mm/memory_hotplug: add missing mem_hotplug_lock
  .mailmap: add a new address mapping for Chester Lin
  ...
2023-12-08 08:36:23 -08:00
Jakub Kicinski 2483e7f04c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/stmicro/stmmac/dwmac5.c
drivers/net/ethernet/stmicro/stmmac/dwmac5.h
drivers/net/ethernet/stmicro/stmmac/dwxgmac2_core.c
drivers/net/ethernet/stmicro/stmmac/hwif.h
  37e4b8df27 ("net: stmmac: fix FPE events losing")
  c3f3b97238 ("net: stmmac: Refactor EST implementation")
https://lore.kernel.org/all/20231206110306.01e91114@canb.auug.org.au/

Adjacent changes:

net/ipv4/tcp_ao.c
  9396c4ee93 ("net/tcp: Don't store TCP-AO maclen on reqsk")
  7b0f570f87 ("tcp: Move TCP-AO bits from cookie_v[46]_check() to tcp_ao_syncookie().")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-07 17:53:17 -08:00
Linus Torvalds 5e3f5b81de Including fixes from bpf and netfilter.
Current release - regressions:
 
  - veth: fix packet segmentation in veth_convert_skb_to_xdp_buff
 
 Current release - new code bugs:
 
  - tcp: assorted fixes to the new Auth Option support
 
 Older releases - regressions:
 
  - tcp: fix mid stream window clamp
 
  - tls: fix incorrect splice handling
 
  - ipv4: ip_gre: handle skb_pull() failure in ipgre_xmit()
 
  - dsa: mv88e6xxx: restore USXGMII support for 6393X
 
  - arcnet: restore support for multiple Sohard Arcnet cards
 
 Older releases - always broken:
 
  - tcp: do not accept ACK of bytes we never sent
 
  - require admin privileges to receive packet traces via netlink
 
  - packet: move reference count in packet_sock to atomic_long_t
 
  - bpf:
    - fix incorrect branch offset comparison with cpu=v4
    - fix prog_array_map_poke_run map poke update
 
  - netfilter:
    - 3 fixes for crashes on bad admin commands
    - xt_owner: fix race accessing sk->sk_socket, TOCTOU null-deref
    - nf_tables: fix 'exist' matching on bigendian arches
 
  - leds: netdev: fix RTNL handling to prevent potential deadlock
 
  - eth: tg3: prevent races in error/reset handling
 
  - eth: r8169: fix rtl8125b PAUSE storm when suspended
 
  - eth: r8152: improve reset and surprise removal handling
 
  - eth: hns: fix race between changing features and sending
 
  - eth: nfp: fix sleep in atomic for bonding offload
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmVyGxsACgkQMUZtbf5S
 IrvziA//XZQLEQ3OsZnnYuuGkH0lPnY6ABaK/hcjCHnk9xs8SfIKPVYpq1LaShEp
 TY6mBhLMIANbdNO+yPzaszWVTkBPyb0w8JNy43bhLhOL3m/6FS6qwsgN8SAL2qVv
 8rnDF9Gsb4yU27aMZ6+2m92WiuyPptf4HrWU2ISSv/oCYH9TWsPUrTwt+QuVUboN
 eSbvMzgIAkFIQVSbhMuinR9bOzAypSJPi18m1kkID5NsNUP/OToxPE7IFDEVS/oo
 f4P7Ru6g1Gw9pAJmVXy5c0528Hy2P4Pyyw3LD5i2FWZ7rhYJRADOC4EMs9lINzrn
 uscNUyztldaMHkKcZRqKbaXsnA3MPvuf3qycRH0wyHa1+OjL9N4A9P077FugtBln
 UlmgVokfONVlxRgwy7AqapQbZ30QmnUEOvWjFWV3dsCBS3ziq1h7ujCTaQkl6R/6
 i96xuiUPMrAnxAlbFOjoF8NeGvcvwujYCqs/q5JC43f+xZRGf52Pwf5U/AliOFym
 aBX1mF/mdMLjYIBlGwFABiybACRPMceT2RuCfvhfIdQiM01OHlydO933jS+R3I4O
 cB03ppK0QiNo5W4RlMqDGuXfVnBJ36pv/2tY8IUOZGXSR+jSQOxZHrhYrtzMM5F8
 sWjpEIrfzdtuz0ssEg9wwGBTffEf07uZyPttov3Pm+VnDrsmCMU=
 =bkyC
 -----END PGP SIGNATURE-----

Merge tag 'net-6.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf and netfilter.

  Current release - regressions:

   - veth: fix packet segmentation in veth_convert_skb_to_xdp_buff

  Current release - new code bugs:

   - tcp: assorted fixes to the new Auth Option support

  Older releases - regressions:

   - tcp: fix mid stream window clamp

   - tls: fix incorrect splice handling

   - ipv4: ip_gre: handle skb_pull() failure in ipgre_xmit()

   - dsa: mv88e6xxx: restore USXGMII support for 6393X

   - arcnet: restore support for multiple Sohard Arcnet cards

  Older releases - always broken:

   - tcp: do not accept ACK of bytes we never sent

   - require admin privileges to receive packet traces via netlink

   - packet: move reference count in packet_sock to atomic_long_t

   - bpf:
      - fix incorrect branch offset comparison with cpu=v4
      - fix prog_array_map_poke_run map poke update

   - netfilter:
      - three fixes for crashes on bad admin commands
      - xt_owner: fix race accessing sk->sk_socket, TOCTOU null-deref
      - nf_tables: fix 'exist' matching on bigendian arches

   - leds: netdev: fix RTNL handling to prevent potential deadlock

   - eth: tg3: prevent races in error/reset handling

   - eth: r8169: fix rtl8125b PAUSE storm when suspended

   - eth: r8152: improve reset and surprise removal handling

   - eth: hns: fix race between changing features and sending

   - eth: nfp: fix sleep in atomic for bonding offload"

* tag 'net-6.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits)
  vsock/virtio: fix "comparison of distinct pointer types lacks a cast" warning
  net/smc: fix missing byte order conversion in CLC handshake
  net: dsa: microchip: provide a list of valid protocols for xmit handler
  drop_monitor: Require 'CAP_SYS_ADMIN' when joining "events" group
  psample: Require 'CAP_NET_ADMIN' when joining "packets" group
  bpf: sockmap, updating the sg structure should also update curr
  net: tls, update curr on splice as well
  nfp: flower: fix for take a mutex lock in soft irq context and rcu lock
  net: dsa: mv88e6xxx: Restore USXGMII support for 6393X
  tcp: do not accept ACK of bytes we never sent
  selftests/bpf: Add test for early update in prog_array_map_poke_run
  bpf: Fix prog_array_map_poke_run map poke update
  netfilter: xt_owner: Fix for unsafe access of sk->sk_socket
  netfilter: nf_tables: validate family when identifying table via handle
  netfilter: nf_tables: bail out on mismatching dynset and set expressions
  netfilter: nf_tables: fix 'exist' matching on bigendian arches
  netfilter: nft_set_pipapo: skip inactive elements during set walk
  netfilter: bpf: fix bad registration on nf_defrag
  leds: trigger: netdev: fix RTNL handling to prevent potential deadlock
  octeontx2-af: Update Tx link register range
  ...
2023-12-07 17:04:13 -08:00
Andrei Matei 1d38a9ee81 bpf: Guard stack limits against 32bit overflow
This patch promotes the arithmetic around checking stack bounds to be
done in the 64-bit domain, instead of the current 32bit. The arithmetic
implies adding together a 64-bit register with a int offset. The
register was checked to be below 1<<29 when it was variable, but not
when it was fixed. The offset either comes from an instruction (in which
case it is 16 bit), from another register (in which case the caller
checked it to be below 1<<29 [1]), or from the size of an argument to a
kfunc (in which case it can be a u32 [2]). Between the register being
inconsistently checked to be below 1<<29, and the offset being up to an
u32, it appears that we were open to overflowing the `int`s which were
currently used for arithmetic.

[1] 815fb87b75/kernel/bpf/verifier.c (L7494-L7498)
[2] 815fb87b75/kernel/bpf/verifier.c (L11904)

Reported-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231207041150.229139-4-andreimatei1@gmail.com
2023-12-07 13:58:10 -08:00
Andrei Matei a833a17aea bpf: Fix verification of indirect var-off stack access
This patch fixes a bug around the verification of possibly-zero-sized
stack accesses. When the access was done through a var-offset stack
pointer, check_stack_access_within_bounds was incorrectly computing the
maximum-offset of a zero-sized read to be the same as the register's min
offset. Instead, we have to take in account the register's maximum
possible value. The patch also simplifies how the max offset is checked;
the check is now simpler than for min offset.

The bug was allowing accesses to erroneously pass the
check_stack_access_within_bounds() checks, only to later crash in
check_stack_range_initialized() when all the possibly-affected stack
slots are iterated (this time with a correct max offset).
check_stack_range_initialized() is relying on
check_stack_access_within_bounds() for its accesses to the
stack-tracking vector to be within bounds; in the case of zero-sized
accesses, we were essentially only verifying that the lowest possible
slot was within bounds. We would crash when the max-offset of the stack
pointer was >= 0 (which shouldn't pass verification, and hopefully is
not something anyone's code attempts to do in practice).

Thanks Hao for reporting!

Fixes: 01f810ace9 ("bpf: Allow variable-offset stack access")
Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231207041150.229139-2-andreimatei1@gmail.com

Closes: https://lore.kernel.org/bpf/CACkBjsZGEUaRCHsmaX=h-efVogsRfK1FPxmkgb0Os_frnHiNdw@mail.gmail.com/
2023-12-07 13:57:53 -08:00
Linus Torvalds 9ace34a8e4 cgroup: Fixes for v6.7-rc4
Just one patch.
 
 f5d39b0208 ("freezer,sched: Rewrite core freezer logic") changed how
 freezing state is recorded which cgroup_freezing() disagree with the actual
 state of the task while thawing triggering a warning. Fix it by updating
 cgroup_freezing().
 -----BEGIN PGP SIGNATURE-----
 
 iIMEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZXDMtQ4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGX2kAQDqVMZJ+fq3+nGYZNWdsCP+GtDOjsMf8GaadsMT
 e8Iu4QDzBo6QwRyFIA4glPQVWiTh3R35XeN0TCN1qEfCTYEICw==
 =F8Zn
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.7-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fix from Tejun Heo:
 "Just one fix.

  Commit f5d39b0208 ("freezer,sched: Rewrite core freezer logic")
  changed how freezing state is recorded which made cgroup_freezing()
  disagree with the actual state of the task while thawing triggering a
  warning. Fix it by updating cgroup_freezing()"

* tag 'cgroup-for-6.7-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup_freezer: cgroup_freezing: Check if not frozen
2023-12-07 12:42:40 -08:00
Linus Torvalds e0348c1f68 workqueue: Fixes for v6.7-rc4
Just one patch to fix a bug which can crash the kernel if the housekeeping
 and wq_unbound_cpu cpumask configuration combination leaves the latter
 empty.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZXDKTg4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGTmMAP9kuC9JkII2J5JnxQpkJLDd/qeRHrigrClx3F0+
 gBiK8AD/XgsGY5J/OOMjsU1Px7BYvy6w0MEEqqhx2vOVEkEFPAo=
 =pH9n
 -----END PGP SIGNATURE-----

Merge tag 'wq-for-6.7-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue fix from Tejun Heo:
 "Just one patch to fix a bug which can crash the kernel if the
  housekeeping and wq_unbound_cpu cpumask configuration combination
  leaves the latter empty"

* tag 'wq-for-6.7-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Make sure that wq_unbound_cpumask is never empty
2023-12-07 12:36:32 -08:00
Song Liu 26ef208c20 bpf: Use arch_bpf_trampoline_size
Instead of blindly allocating PAGE_SIZE for each trampoline, check the size
of the trampoline with arch_bpf_trampoline_size(). This size is saved in
bpf_tramp_image->size, and used for modmem charge/uncharge. The fallback
arch_alloc_bpf_trampoline() still allocates a whole page because we need to
use set_memory_* to protect the memory.

struct_ops trampoline still uses a whole page for multiple trampolines.

With this size check at caller (regular trampoline and struct_ops
trampoline), remove arch_bpf_trampoline_size() from
arch_prepare_bpf_trampoline() in archs.

Also, update bpf_image_ksym_add() to handle symbol of different sizes.

Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>  # on s390x
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com> # on riscv
Link: https://lore.kernel.org/r/20231206224054.492250-7-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 17:17:20 -08:00
Song Liu 96d1b7c081 bpf: Add arch_bpf_trampoline_size()
This helper will be used to calculate the size of the trampoline before
allocating the memory.

arch_prepare_bpf_trampoline() for arm64 and riscv64 can use
arch_bpf_trampoline_size() to check the trampoline fits in the image.

OTOH, arch_prepare_bpf_trampoline() for s390 has to call the JIT process
twice, so it cannot use arch_bpf_trampoline_size().

Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>  # on s390x
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com> # on riscv
Link: https://lore.kernel.org/r/20231206224054.492250-6-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 17:17:20 -08:00
Song Liu 82583daa2e bpf: Add helpers for trampoline image management
As BPF trampoline of different archs moves from bpf_jit_[alloc|free]_exec()
to bpf_prog_pack_[alloc|free](), we need to use different _alloc, _free for
different archs during the transition. Add the following helpers for this
transition:

void *arch_alloc_bpf_trampoline(unsigned int size);
void arch_free_bpf_trampoline(void *image, unsigned int size);
void arch_protect_bpf_trampoline(void *image, unsigned int size);
void arch_unprotect_bpf_trampoline(void *image, unsigned int size);

The fallback version of these helpers require size <= PAGE_SIZE, but they
are only called with size == PAGE_SIZE. They will be called with size <
PAGE_SIZE when arch_bpf_trampoline_size() helper is introduced later.

Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>  # on s390x
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20231206224054.492250-4-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 17:17:20 -08:00
Song Liu 7a3d9a159b bpf: Adjust argument names of arch_prepare_bpf_trampoline()
We are using "im" for "struct bpf_tramp_image" and "tr" for "struct
bpf_trampoline" in most of the code base. The only exception is the
prototype and fallback version of arch_prepare_bpf_trampoline(). Update
them to match the rest of the code base.

We mix "orig_call" and "func_addr" for the argument in different versions
of arch_prepare_bpf_trampoline(). s/orig_call/func_addr/g so they match.

Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>  # on s390x
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20231206224054.492250-3-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 17:17:20 -08:00
Song Liu f08a1c6582 bpf: Let bpf_prog_pack_free handle any pointer
Currently, bpf_prog_pack_free only can only free pointer to struct
bpf_binary_header, which is not flexible. Add a size argument to
bpf_prog_pack_free so that it can handle any pointer.

Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>  # on s390x
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20231206224054.492250-2-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 17:17:20 -08:00
Andrew Morton 0c92218f4e Merge branch 'master' into mm-hotfixes-stable 2023-12-06 17:03:50 -08:00
Baoquan He dccf78d39f kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP
Ignat Korchagin complained that a potential config regression was
introduced by commit 89cde45591 ("kexec: consolidate kexec and crash
options into kernel/Kconfig.kexec").  Before the commit, CONFIG_CRASH_DUMP
has no dependency on CONFIG_KEXEC.  After the commit, CRASH_DUMP selects
KEXEC.  That enforces system to have CONFIG_KEXEC=y as long as
CONFIG_CRASH_DUMP=Y which people may not want.

In Ignat's case, he sets CONFIG_CRASH_DUMP=y, CONFIG_KEXEC_FILE=y and
CONFIG_KEXEC=n because kexec_load interface could have security issue if
kernel/initrd has no chance to be signed and verified.

CRASH_DUMP has select of KEXEC because Eric, author of above commit, met a
LKP report of build failure when posting patch of earlier version.  Please
see below link to get detail of the LKP report:

    https://lore.kernel.org/all/3e8eecd1-a277-2cfb-690e-5de2eb7b988e@oracle.com/T/#u

In fact, that LKP report is triggered because arm's <asm/kexec.h> is
wrapped in CONFIG_KEXEC ifdeffery scope.  That is wrong.  CONFIG_KEXEC
controls the enabling/disabling of kexec_load interface, but not kexec
feature.  Removing the wrongly added CONFIG_KEXEC ifdeffery scope in
<asm/kexec.h> of arm allows us to drop the select KEXEC for CRASH_DUMP. 
Meanwhile, change arch/arm/kernel/Makefile to let machine_kexec.o
relocate_kernel.o depend on KEXEC_CORE.

Link: https://lkml.kernel.org/r/20231128054457.659452-1-bhe@redhat.com
Fixes: 89cde45591 ("kexec: consolidate kexec and crash options into kernel/Kconfig.kexec")
Signed-off-by: Baoquan He <bhe@redhat.com>
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Tested-by: Ignat Korchagin <ignat@cloudflare.com>	[compile-time only]
Tested-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Eric DeVolder <eric_devolder@yahoo.com>
Tested-by: Eric DeVolder <eric_devolder@yahoo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:48 -08:00
Jiri Olsa 4b7de80160 bpf: Fix prog_array_map_poke_run map poke update
Lee pointed out issue found by syscaller [0] hitting BUG in prog array
map poke update in prog_array_map_poke_run function due to error value
returned from bpf_arch_text_poke function.

There's race window where bpf_arch_text_poke can fail due to missing
bpf program kallsym symbols, which is accounted for with check for
-EINVAL in that BUG_ON call.

The problem is that in such case we won't update the tail call jump
and cause imbalance for the next tail call update check which will
fail with -EBUSY in bpf_arch_text_poke.

I'm hitting following race during the program load:

  CPU 0                             CPU 1

  bpf_prog_load
    bpf_check
      do_misc_fixups
        prog_array_map_poke_track

                                    map_update_elem
                                      bpf_fd_array_map_update_elem
                                        prog_array_map_poke_run

                                          bpf_arch_text_poke returns -EINVAL

    bpf_prog_kallsyms_add

After bpf_arch_text_poke (CPU 1) fails to update the tail call jump, the next
poke update fails on expected jump instruction check in bpf_arch_text_poke
with -EBUSY and triggers the BUG_ON in prog_array_map_poke_run.

Similar race exists on the program unload.

Fixing this by moving the update to bpf_arch_poke_desc_update function which
makes sure we call __bpf_arch_text_poke that skips the bpf address check.

Each architecture has slightly different approach wrt looking up bpf address
in bpf_arch_text_poke, so instead of splitting the function or adding new
'checkip' argument in previous version, it seems best to move the whole
map_poke_run update as arch specific code.

  [0] https://syzkaller.appspot.com/bug?extid=97a4fe20470e9bc30810

Fixes: ebf7d1f508 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
Reported-by: syzbot+97a4fe20470e9bc30810@syzkaller.appspotmail.com
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Cc: Lee Jones <lee@kernel.org>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20231206083041.1306660-2-jolsa@kernel.org
2023-12-06 22:40:16 +01:00
Steven Rostedt (Google) f458a14534 ring-buffer: Test last update in 32bit version of __rb_time_read()
Since 64 bit cmpxchg() is very expensive on 32bit architectures, the
timestamp used by the ring buffer does some interesting tricks to be able
to still have an atomic 64 bit number. It originally just used 60 bits and
broke it up into two 32 bit words where the extra 2 bits were used for
synchronization. But this was not enough for all use cases, and all 64
bits were required.

The 32bit version of the ring buffer timestamp was then broken up into 3
32bit words using the same counter trick. But one update was not done. The
check to see if the read operation was done without interruption only
checked the first two words and not last one (like it had before this
update). Fix it by making sure all three updates happen without
interruption by comparing the initial counter with the last updated
counter.

Link: https://lore.kernel.org/linux-trace-kernel/20231206100050.3100b7bb@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: f03f2abce4 ("ring-buffer: Have 32 bit time stamps use all 64 bits")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-06 15:01:49 -05:00
Steven Rostedt (Google) b2dd797543 ring-buffer: Force absolute timestamp on discard of event
There's a race where if an event is discarded from the ring buffer and an
interrupt were to happen at that time and insert an event, the time stamp
is still used from the discarded event as an offset. This can screw up the
timings.

If the event is going to be discarded, set the "before_stamp" to zero.
When a new event comes in, it compares the "before_stamp" with the
"write_stamp" and if they are not equal, it will insert an absolute
timestamp. This will prevent the timings from getting out of sync due to
the discarded event.

Link: https://lore.kernel.org/linux-trace-kernel/20231206100244.5130f9b3@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 6f6be606e7 ("ring-buffer: Force before_stamp and write_stamp to be different on discard")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-06 15:00:59 -05:00
Waiman Long 3232e7aad1 cgroup/cpuset: Include isolated cpuset CPUs in cpu_is_isolated() check
Currently, the cpu_is_isolated() function checks only the statically
isolated CPUs specified via the "isolcpus" and "nohz_full" kernel
command line options. This function is used by vmstat and memcg to
reduce interference with isolated CPUs by not doing stat flushing
or scheduling works on those CPUs.

Workloads running on isolated CPUs within isolated cpuset
partitions should receive the same treatment to reduce unnecessary
interference. This patch introduces a new cpuset_cpu_is_isolated()
function to be called by cpu_is_isolated() so that the set of dynamically
created cpuset isolated CPUs will be included in the check.

Assuming that testing a bit in a cpumask is atomic, no synchronization
primitive is currently used to synchronize access to the cpuset's
isolated_cpus mask.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-12-06 09:37:28 -10:00
Andrii Nakryiko d734ca7b33 bpf,lsm: add BPF token LSM hooks
Wire up bpf_token_create and bpf_token_free LSM hooks, which allow to
allocate LSM security blob (we add `void *security` field to struct
bpf_token for that), but also control who can instantiate BPF token.
This follows existing pattern for BPF map and BPF prog.

Also add security_bpf_token_allow_cmd() and security_bpf_token_capable()
LSM hooks that allow LSM implementation to control and negate (if
necessary) BPF token's delegation of a specific bpf_cmd and capability,
respectively.

Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-12-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:03:00 -08:00
Andrii Nakryiko 66d636d70a bpf,lsm: refactor bpf_map_alloc/bpf_map_free LSM hooks
Similarly to bpf_prog_alloc LSM hook, rename and extend bpf_map_alloc
hook into bpf_map_create, taking not just struct bpf_map, but also
bpf_attr and bpf_token, to give a fuller context to LSMs.

Unlike bpf_prog_alloc, there is no need to move the hook around, as it
currently is firing right before allocating BPF map ID and FD, which
seems to be a sweet spot.

But like bpf_prog_alloc/bpf_prog_free combo, make sure that bpf_map_free
LSM hook is called even if bpf_map_create hook returned error, as if few
LSMs are combined together it could be that one LSM successfully
allocated security blob for its needs, while subsequent LSM rejected BPF
map creation. The former LSM would still need to free up LSM blob, so we
need to ensure security_bpf_map_free() is called regardless of the
outcome.

Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-11-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:59 -08:00
Andrii Nakryiko c3dd6e94df bpf,lsm: refactor bpf_prog_alloc/bpf_prog_free LSM hooks
Based on upstream discussion ([0]), rework existing
bpf_prog_alloc_security LSM hook. Rename it to bpf_prog_load and instead
of passing bpf_prog_aux, pass proper bpf_prog pointer for a full BPF
program struct. Also, we pass bpf_attr union with all the user-provided
arguments for BPF_PROG_LOAD command.  This will give LSMs as much
information as we can basically provide.

The hook is also BPF token-aware now, and optional bpf_token struct is
passed as a third argument. bpf_prog_load LSM hook is called after
a bunch of sanity checks were performed, bpf_prog and bpf_prog_aux were
allocated and filled out, but right before performing full-fledged BPF
verification step.

bpf_prog_free LSM hook is now accepting struct bpf_prog argument, for
consistency. SELinux code is adjusted to all new names, types, and
signatures.

Note, given that bpf_prog_load (previously bpf_prog_alloc) hook can be
used by some LSMs to allocate extra security blob, but also by other
LSMs to reject BPF program loading, we need to make sure that
bpf_prog_free LSM hook is called after bpf_prog_load/bpf_prog_alloc one
*even* if the hook itself returned error. If we don't do that, we run
the risk of leaking memory. This seems to be possible today when
combining SELinux and BPF LSM, as one example, depending on their
relative ordering.

Also, for BPF LSM setup, add bpf_prog_load and bpf_prog_free to
sleepable LSM hooks list, as they are both executed in sleepable
context. Also drop bpf_prog_load hook from untrusted, as there is no
issue with refcount or anything else anymore, that originally forced us
to add it to untrusted list in c0c852dd18 ("bpf: Do not mark certain LSM
hook arguments as trusted"). We now trigger this hook much later and it
should not be an issue anymore.

  [0] https://lore.kernel.org/bpf/9fe88aef7deabbe87d3fc38c4aea3c69.paul@paul-moore.com/

Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-10-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:59 -08:00
Andrii Nakryiko 8062fb12de bpf: consistently use BPF token throughout BPF verifier logic
Remove remaining direct queries to perfmon_capable() and bpf_capable()
in BPF verifier logic and instead use BPF token (if available) to make
decisions about privileges.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:59 -08:00
Andrii Nakryiko 4cbb270e11 bpf: take into account BPF token when fetching helper protos
Instead of performing unconditional system-wide bpf_capable() and
perfmon_capable() calls inside bpf_base_func_proto() function (and other
similar ones) to determine eligibility of a given BPF helper for a given
program, use previously recorded BPF token during BPF_PROG_LOAD command
handling to inform the decision.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:59 -08:00
Andrii Nakryiko e1cef620f5 bpf: add BPF token support to BPF_PROG_LOAD command
Add basic support of BPF token to BPF_PROG_LOAD. Wire through a set of
allowed BPF program types and attach types, derived from BPF FS at BPF
token creation time. Then make sure we perform bpf_token_capable()
checks everywhere where it's relevant.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:59 -08:00
Andrii Nakryiko ee54b1a910 bpf: add BPF token support to BPF_BTF_LOAD command
Accept BPF token FD in BPF_BTF_LOAD command to allow BTF data loading
through delegated BPF token. BTF loading is a pretty straightforward
operation, so as long as BPF token is created with allow_cmds granting
BPF_BTF_LOAD command, kernel proceeds to parsing BTF data and creating
BTF object.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:59 -08:00
Andrii Nakryiko 688b7270b3 bpf: add BPF token support to BPF_MAP_CREATE command
Allow providing token_fd for BPF_MAP_CREATE command to allow controlled
BPF map creation from unprivileged process through delegated BPF token.

Wire through a set of allowed BPF map types to BPF token, derived from
BPF FS at BPF token creation time. This, in combination with allowed_cmds
allows to create a narrowly-focused BPF token (controlled by privileged
agent) with a restrictive set of BPF maps that application can attempt
to create.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:59 -08:00
Andrii Nakryiko 4527358b76 bpf: introduce BPF token object
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while having a good amount of control over which
privileged operations could be performed using provided BPF token.

This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).

BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
FS FD, which can be attained through open() API by opening BPF FS mount
point. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the
creation time or after the fact, allowing the process to guard itself
further from unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.

When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.

Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).

Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
within the BPF FS owning user namespace, rounding up the ns_capable()
story of BPF token.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:59 -08:00
Andrii Nakryiko 40bba140c6 bpf: add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
  - `delegate_cmds` allow to specify which bpf() syscall commands are
    allowed with BPF token derived from this BPF FS instance;
  - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
    a set of allowable BPF map types that could be created with BPF token;
  - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
    a set of allowable BPF program types that could be loaded with BPF token;
  - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
    a set of allowable BPF program attach types that could be loaded with
    BPF token; delegate_progs and delegate_attachs are meant to be used
    together, as full BPF program type is, in general, determined
    through both program type and program attach type.

Currently, these mount options accept the following forms of values:
  - a special value "any", that enables all possible values of a given
  bit set;
  - numeric value (decimal or hexadecimal, determined by kernel
  automatically) that specifies a bit mask value directly;
  - all the values for a given mount option are combined, if specified
  multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
  delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
  mask.

Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.

Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.

This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.

There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.

This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.

  [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:58 -08:00
Andrii Nakryiko 909fa05dd3 bpf: align CAP_NET_ADMIN checks with bpf_capable() approach
Within BPF syscall handling code CAP_NET_ADMIN checks stand out a bit
compared to CAP_BPF and CAP_PERFMON checks. For the latter, CAP_BPF or
CAP_PERFMON are checked first, but if they are not set, CAP_SYS_ADMIN
takes over and grants whatever part of BPF syscall is required.

Similar kind of checks that involve CAP_NET_ADMIN are not so consistent.
One out of four uses does follow CAP_BPF/CAP_PERFMON model: during
BPF_PROG_LOAD, if the type of BPF program is "network-related" either
CAP_NET_ADMIN or CAP_SYS_ADMIN is required to proceed.

But in three other cases CAP_NET_ADMIN is required even if CAP_SYS_ADMIN
is set:
  - when creating DEVMAP/XDKMAP/CPU_MAP maps;
  - when attaching CGROUP_SKB programs;
  - when handling BPF_PROG_QUERY command.

This patch is changing the latter three cases to follow BPF_PROG_LOAD
model, that is allowing to proceed under either CAP_NET_ADMIN or
CAP_SYS_ADMIN.

This also makes it cleaner in subsequent BPF token patches to switch
wholesomely to a generic bpf_token_capable(int cap) check, that always
falls back to CAP_SYS_ADMIN if requested capability is missing.

Cc: Jakub Kicinski <kuba@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:58 -08:00
Petr Pavlu c0591b1ccc tracing: Fix a possible race when disabling buffered events
Function trace_buffered_event_disable() is responsible for freeing pages
backing buffered events and this process can run concurrently with
trace_event_buffer_lock_reserve().

The following race is currently possible:

* Function trace_buffered_event_disable() is called on CPU 0. It
  increments trace_buffered_event_cnt on each CPU and waits via
  synchronize_rcu() for each user of trace_buffered_event to complete.

* After synchronize_rcu() is finished, function
  trace_buffered_event_disable() has the exclusive access to
  trace_buffered_event. All counters trace_buffered_event_cnt are at 1
  and all pointers trace_buffered_event are still valid.

* At this point, on a different CPU 1, the execution reaches
  trace_event_buffer_lock_reserve(). The function calls
  preempt_disable_notrace() and only now enters an RCU read-side
  critical section. The function proceeds and reads a still valid
  pointer from trace_buffered_event[CPU1] into the local variable
  "entry". However, it doesn't yet read trace_buffered_event_cnt[CPU1]
  which happens later.

* Function trace_buffered_event_disable() continues. It frees
  trace_buffered_event[CPU1] and decrements
  trace_buffered_event_cnt[CPU1] back to 0.

* Function trace_event_buffer_lock_reserve() continues. It reads and
  increments trace_buffered_event_cnt[CPU1] from 0 to 1. This makes it
  believe that it can use the "entry" that it already obtained but the
  pointer is now invalid and any access results in a use-after-free.

Fix the problem by making a second synchronize_rcu() call after all
trace_buffered_event values are set to NULL. This waits on all potential
users in trace_event_buffer_lock_reserve() that still read a previous
pointer from trace_buffered_event.

Link: https://lore.kernel.org/all/20231127151248.7232-2-petr.pavlu@suse.com/
Link: https://lkml.kernel.org/r/20231205161736.19663-4-petr.pavlu@suse.com

Cc: stable@vger.kernel.org
Fixes: 0fc1b09ff1 ("tracing: Use temp buffer when filtering events")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-05 17:17:00 -05:00
Petr Pavlu 34209fe83e tracing: Fix a warning when allocating buffered events fails
Function trace_buffered_event_disable() produces an unexpected warning
when the previous call to trace_buffered_event_enable() fails to
allocate pages for buffered events.

The situation can occur as follows:

* The counter trace_buffered_event_ref is at 0.

* The soft mode gets enabled for some event and
  trace_buffered_event_enable() is called. The function increments
  trace_buffered_event_ref to 1 and starts allocating event pages.

* The allocation fails for some page and trace_buffered_event_disable()
  is called for cleanup.

* Function trace_buffered_event_disable() decrements
  trace_buffered_event_ref back to 0, recognizes that it was the last
  use of buffered events and frees all allocated pages.

* The control goes back to trace_buffered_event_enable() which returns.
  The caller of trace_buffered_event_enable() has no information that
  the function actually failed.

* Some time later, the soft mode is disabled for the same event.
  Function trace_buffered_event_disable() is called. It warns on
  "WARN_ON_ONCE(!trace_buffered_event_ref)" and returns.

Buffered events are just an optimization and can handle failures. Make
trace_buffered_event_enable() exit on the first failure and left any
cleanup later to when trace_buffered_event_disable() is called.

Link: https://lore.kernel.org/all/20231127151248.7232-2-petr.pavlu@suse.com/
Link: https://lkml.kernel.org/r/20231205161736.19663-3-petr.pavlu@suse.com

Fixes: 0fc1b09ff1 ("tracing: Use temp buffer when filtering events")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-05 17:16:48 -05:00
Petr Pavlu 7fed14f7ac tracing: Fix incomplete locking when disabling buffered events
The following warning appears when using buffered events:

[  203.556451] WARNING: CPU: 53 PID: 10220 at kernel/trace/ring_buffer.c:3912 ring_buffer_discard_commit+0x2eb/0x420
[...]
[  203.670690] CPU: 53 PID: 10220 Comm: stress-ng-sysin Tainted: G            E      6.7.0-rc2-default #4 56e6d0fcf5581e6e51eaaecbdaec2a2338c80f3a
[  203.670704] Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
[  203.670709] RIP: 0010:ring_buffer_discard_commit+0x2eb/0x420
[  203.735721] Code: 4c 8b 4a 50 48 8b 42 48 49 39 c1 0f 84 b3 00 00 00 49 83 e8 01 75 b1 48 8b 42 10 f0 ff 40 08 0f 0b e9 fc fe ff ff f0 ff 47 08 <0f> 0b e9 77 fd ff ff 48 8b 42 10 f0 ff 40 08 0f 0b e9 f5 fe ff ff
[  203.735734] RSP: 0018:ffffb4ae4f7b7d80 EFLAGS: 00010202
[  203.735745] RAX: 0000000000000000 RBX: ffffb4ae4f7b7de0 RCX: ffff8ac10662c000
[  203.735754] RDX: ffff8ac0c750be00 RSI: ffff8ac10662c000 RDI: ffff8ac0c004d400
[  203.781832] RBP: ffff8ac0c039cea0 R08: 0000000000000000 R09: 0000000000000000
[  203.781839] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  203.781842] R13: ffff8ac10662c000 R14: ffff8ac0c004d400 R15: ffff8ac10662c008
[  203.781846] FS:  00007f4cd8a67740(0000) GS:ffff8ad798880000(0000) knlGS:0000000000000000
[  203.781851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  203.781855] CR2: 0000559766a74028 CR3: 00000001804c4000 CR4: 00000000001506f0
[  203.781862] Call Trace:
[  203.781870]  <TASK>
[  203.851949]  trace_event_buffer_commit+0x1ea/0x250
[  203.851967]  trace_event_raw_event_sys_enter+0x83/0xe0
[  203.851983]  syscall_trace_enter.isra.0+0x182/0x1a0
[  203.851990]  do_syscall_64+0x3a/0xe0
[  203.852075]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[  203.852090] RIP: 0033:0x7f4cd870fa77
[  203.982920] Code: 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 b8 89 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 43 0e 00 f7 d8 64 89 01 48
[  203.982932] RSP: 002b:00007fff99717dd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000089
[  203.982942] RAX: ffffffffffffffda RBX: 0000558ea1d7b6f0 RCX: 00007f4cd870fa77
[  203.982948] RDX: 0000000000000000 RSI: 00007fff99717de0 RDI: 0000558ea1d7b6f0
[  203.982957] RBP: 00007fff99717de0 R08: 00007fff997180e0 R09: 00007fff997180e0
[  203.982962] R10: 00007fff997180e0 R11: 0000000000000246 R12: 00007fff99717f40
[  204.049239] R13: 00007fff99718590 R14: 0000558e9f2127a8 R15: 00007fff997180b0
[  204.049256]  </TASK>

For instance, it can be triggered by running these two commands in
parallel:

 $ while true; do
    echo hist:key=id.syscall:val=hitcount > \
      /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger;
  done
 $ stress-ng --sysinfo $(nproc)

The warning indicates that the current ring_buffer_per_cpu is not in the
committing state. It happens because the active ring_buffer_event
doesn't actually come from the ring_buffer_per_cpu but is allocated from
trace_buffered_event.

The bug is in function trace_buffered_event_disable() where the
following normally happens:

* The code invokes disable_trace_buffered_event() via
  smp_call_function_many() and follows it by synchronize_rcu(). This
  increments the per-CPU variable trace_buffered_event_cnt on each
  target CPU and grants trace_buffered_event_disable() the exclusive
  access to the per-CPU variable trace_buffered_event.

* Maintenance is performed on trace_buffered_event, all per-CPU event
  buffers get freed.

* The code invokes enable_trace_buffered_event() via
  smp_call_function_many(). This decrements trace_buffered_event_cnt and
  releases the access to trace_buffered_event.

A problem is that smp_call_function_many() runs a given function on all
target CPUs except on the current one. The following can then occur:

* Task X executing trace_buffered_event_disable() runs on CPU 0.

* The control reaches synchronize_rcu() and the task gets rescheduled on
  another CPU 1.

* The RCU synchronization finishes. At this point,
  trace_buffered_event_disable() has the exclusive access to all
  trace_buffered_event variables except trace_buffered_event[CPU0]
  because trace_buffered_event_cnt[CPU0] is never incremented and if the
  buffer is currently unused, remains set to 0.

* A different task Y is scheduled on CPU 0 and hits a trace event. The
  code in trace_event_buffer_lock_reserve() sees that
  trace_buffered_event_cnt[CPU0] is set to 0 and decides the use the
  buffer provided by trace_buffered_event[CPU0].

* Task X continues its execution in trace_buffered_event_disable(). The
  code incorrectly frees the event buffer pointed by
  trace_buffered_event[CPU0] and resets the variable to NULL.

* Task Y writes event data to the now freed buffer and later detects the
  created inconsistency.

The issue is observable since commit dea499781a ("tracing: Fix warning
in trace_buffered_event_disable()") which moved the call of
trace_buffered_event_disable() in __ftrace_event_enable_disable()
earlier, prior to invoking call->class->reg(.. TRACE_REG_UNREGISTER ..).
The underlying problem in trace_buffered_event_disable() is however
present since the original implementation in commit 0fc1b09ff1
("tracing: Use temp buffer when filtering events").

Fix the problem by replacing the two smp_call_function_many() calls with
on_each_cpu_mask() which invokes a given callback on all CPUs.

Link: https://lore.kernel.org/all/20231127151248.7232-2-petr.pavlu@suse.com/
Link: https://lkml.kernel.org/r/20231205161736.19663-2-petr.pavlu@suse.com

Cc: stable@vger.kernel.org
Fixes: 0fc1b09ff1 ("tracing: Use temp buffer when filtering events")
Fixes: dea499781a ("tracing: Fix warning in trace_buffered_event_disable()")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-05 17:13:51 -05:00
Steven Rostedt (Google) b538bf7d0e tracing: Disable snapshot buffer when stopping instance tracers
It use to be that only the top level instance had a snapshot buffer (for
latency tracers like wakeup and irqsoff). When stopping a tracer in an
instance would not disable the snapshot buffer. This could have some
unintended consequences if the irqsoff tracer is enabled.

Consolidate the tracing_start/stop() with tracing_start/stop_tr() so that
all instances behave the same. The tracing_start/stop() functions will
just call their respective tracing_start/stop_tr() with the global_array
passed in.

Link: https://lkml.kernel.org/r/20231205220011.041220035@goodmis.org

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 6d9b3fa5e7 ("tracing: Move tracing_max_latency into trace_array")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-05 17:06:12 -05:00
Steven Rostedt (Google) d78ab79270 tracing: Stop current tracer when resizing buffer
When the ring buffer is being resized, it can cause side effects to the
running tracer. For instance, there's a race with irqsoff tracer that
swaps individual per cpu buffers between the main buffer and the snapshot
buffer. The resize operation modifies the main buffer and then the
snapshot buffer. If a swap happens in between those two operations it will
break the tracer.

Simply stop the running tracer before resizing the buffers and enable it
again when finished.

Link: https://lkml.kernel.org/r/20231205220010.748996423@goodmis.org

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 3928a8a2d9 ("ftrace: make work with new ring buffer")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-05 17:06:12 -05:00
Steven Rostedt (Google) 7be76461f3 tracing: Always update snapshot buffer size
It use to be that only the top level instance had a snapshot buffer (for
latency tracers like wakeup and irqsoff). The update of the ring buffer
size would check if the instance was the top level and if so, it would
also update the snapshot buffer as it needs to be the same as the main
buffer.

Now that lower level instances also has a snapshot buffer, they too need
to update their snapshot buffer sizes when the main buffer is changed,
otherwise the following can be triggered:

 # cd /sys/kernel/tracing
 # echo 1500 > buffer_size_kb
 # mkdir instances/foo
 # echo irqsoff > instances/foo/current_tracer
 # echo 1000 > instances/foo/buffer_size_kb

Produces:

 WARNING: CPU: 2 PID: 856 at kernel/trace/trace.c:1938 update_max_tr_single.part.0+0x27d/0x320

Which is:

	ret = ring_buffer_swap_cpu(tr->max_buffer.buffer, tr->array_buffer.buffer, cpu);

	if (ret == -EBUSY) {
		[..]
	}

	WARN_ON_ONCE(ret && ret != -EAGAIN && ret != -EBUSY);  <== here

That's because ring_buffer_swap_cpu() has:

	int ret = -EINVAL;

	[..]

	/* At least make sure the two buffers are somewhat the same */
	if (cpu_buffer_a->nr_pages != cpu_buffer_b->nr_pages)
		goto out;

	[..]
 out:
	return ret;
 }

Instead, update all instances' snapshot buffer sizes when their main
buffer size is updated.

Link: https://lkml.kernel.org/r/20231205220010.454662151@goodmis.org

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 6d9b3fa5e7 ("tracing: Move tracing_max_latency into trace_array")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-05 17:06:12 -05:00
Andrii Nakryiko 18a433b620 bpf: track aligned STACK_ZERO cases as imprecise spilled registers
Now that precision backtracing is supporting register spill/fill to/from
stack, there is another oportunity to be exploited here: minimizing
precise STACK_ZERO cases. With a simple code change we can rely on
initially imprecise register spill tracking for cases when register
spilled to stack was a known zero.

This is a very common case for initializing on the stack variables,
including rather large structures. Often times zero has no special
meaning for the subsequent BPF program logic and is often overwritten
with non-zero values soon afterwards. But due to STACK_ZERO vs
STACK_MISC tracking, such initial zero initialization actually causes
duplication of verifier states as STACK_ZERO is clearly different than
STACK_MISC or spilled SCALAR_VALUE register.

The effect of this (now) trivial change is huge, as can be seen below.
These are differences between BPF selftests, Cilium, and Meta-internal
BPF object files relative to previous patch in this series. You can see
improvements ranging from single-digit percentage improvement for
instructions and states, all the way to 50-60% reduction for some of
Meta-internal host agent programs, and even some Cilium programs.

For Meta-internal ones I left only the differences for largest BPF
object files by states/instructions, as there were too many differences
in the overall output. All the differences were improvements, reducting
number of states and thus instructions validated.

Note, Meta-internal BPF object file names are not printed below.
Many copies of balancer_ingress are actually many different
configurations of Katran, so they are different BPF programs, which
explains state reduction going from -16% all the way to 31%, depending
on BPF program logic complexity.

I also tooked a closer look at a few small-ish BPF programs to validate
the behavior. Let's take bpf_iter_netrlink.bpf.o (first row below).
While it's just 8 vs 5 states, verifier log is still pretty long to
include it here. But the reduction in states is due to the following
piece of C code:

        unsigned long ino;

	...

        sk = s->sk_socket;
        if (!sk) {
                ino = 0;
        } else {
                inode = SOCK_INODE(sk);
                bpf_probe_read_kernel(&ino, sizeof(ino), &inode->i_ino);
        }
        BPF_SEQ_PRINTF(seq, "%-8u %-8lu\n", s->sk_drops.counter, ino);
	return 0;

You can see that in some situations `ino` is zero-initialized, while in
others it's unknown value filled out by bpf_probe_read_kernel(). Before
this change code after if/else branches have to be validated twice. Once
with (precise) ino == 0, due to eager STACK_ZERO logic, and then again
for when ino is just STACK_MISC. But BPF_SEQ_PRINTF() doesn't care about
precise value of ino, so with the change in this patch verifier is able
to prune states from after one of the branches, reducing number of total
states (and instructions) required for successful validation.

Similar principle applies to bigger real-world applications, just at
a much larger scale.

SELFTESTS
=========
File                                     Program                  Insns (A)  Insns (B)  Insns    (DIFF)  States (A)  States (B)  States (DIFF)
---------------------------------------  -----------------------  ---------  ---------  ---------------  ----------  ----------  -------------
bpf_iter_netlink.bpf.linked3.o           dump_netlink                   148        104    -44 (-29.73%)           8           5   -3 (-37.50%)
bpf_iter_unix.bpf.linked3.o              dump_unix                     8474       8404     -70 (-0.83%)         151         147    -4 (-2.65%)
bpf_loop.bpf.linked3.o                   stack_check                    560        324   -236 (-42.14%)          42          24  -18 (-42.86%)
local_storage_bench.bpf.linked3.o        get_local                      120         77    -43 (-35.83%)           9           6   -3 (-33.33%)
loop6.bpf.linked3.o                      trace_virtqueue_add_sgs      10167       9868    -299 (-2.94%)         226         206   -20 (-8.85%)
pyperf600_bpf_loop.bpf.linked3.o         on_event                      4872       3423  -1449 (-29.74%)         322         229  -93 (-28.88%)
strobemeta.bpf.linked3.o                 on_event                    180697     176036   -4661 (-2.58%)        4780        4734   -46 (-0.96%)
test_cls_redirect.bpf.linked3.o          cls_redirect                 65594      65401    -193 (-0.29%)        4230        4212   -18 (-0.43%)
test_global_func_args.bpf.linked3.o      test_cls                       145        136      -9 (-6.21%)          10           9   -1 (-10.00%)
test_l4lb.bpf.linked3.o                  balancer_ingress              4760       2612  -2148 (-45.13%)         113         102   -11 (-9.73%)
test_l4lb_noinline.bpf.linked3.o         balancer_ingress              4845       4877     +32 (+0.66%)         219         221    +2 (+0.91%)
test_l4lb_noinline_dynptr.bpf.linked3.o  balancer_ingress              2072       2087     +15 (+0.72%)          97          98    +1 (+1.03%)
test_seg6_loop.bpf.linked3.o             __add_egr_x                  12440       9975  -2465 (-19.82%)         364         353   -11 (-3.02%)
test_tcp_hdr_options.bpf.linked3.o       estab                         2558       2572     +14 (+0.55%)         179         180    +1 (+0.56%)
test_xdp_dynptr.bpf.linked3.o            _xdp_tx_iptunnel               645        596     -49 (-7.60%)          26          24    -2 (-7.69%)
test_xdp_noinline.bpf.linked3.o          balancer_ingress_v6           3520       3516      -4 (-0.11%)         216         216    +0 (+0.00%)
xdp_synproxy_kern.bpf.linked3.o          syncookie_tc                 82661      81241   -1420 (-1.72%)        5073        5155   +82 (+1.62%)
xdp_synproxy_kern.bpf.linked3.o          syncookie_xdp                84964      82297   -2667 (-3.14%)        5130        5157   +27 (+0.53%)

META-INTERNAL
=============
Program                                 Insns (A)  Insns (B)  Insns      (DIFF)  States (A)  States (B)  States   (DIFF)
--------------------------------------  ---------  ---------  -----------------  ----------  ----------  ---------------
balancer_ingress                            27925      23608    -4317 (-15.46%)        1488        1482      -6 (-0.40%)
balancer_ingress                            31824      27546    -4278 (-13.44%)        1658        1652      -6 (-0.36%)
balancer_ingress                            32213      27935    -4278 (-13.28%)        1689        1683      -6 (-0.36%)
balancer_ingress                            32213      27935    -4278 (-13.28%)        1689        1683      -6 (-0.36%)
balancer_ingress                            31824      27546    -4278 (-13.44%)        1658        1652      -6 (-0.36%)
balancer_ingress                            38647      29562    -9085 (-23.51%)        2069        1835   -234 (-11.31%)
balancer_ingress                            38647      29562    -9085 (-23.51%)        2069        1835   -234 (-11.31%)
balancer_ingress                            40339      30792    -9547 (-23.67%)        2193        1934   -259 (-11.81%)
balancer_ingress                            37321      29055    -8266 (-22.15%)        1972        1795    -177 (-8.98%)
balancer_ingress                            38176      29753    -8423 (-22.06%)        2008        1831    -177 (-8.81%)
balancer_ingress                            29193      20910    -8283 (-28.37%)        1599        1422   -177 (-11.07%)
balancer_ingress                            30013      21452    -8561 (-28.52%)        1645        1447   -198 (-12.04%)
balancer_ingress                            28691      24290    -4401 (-15.34%)        1545        1531     -14 (-0.91%)
balancer_ingress                            34223      28965    -5258 (-15.36%)        1984        1875    -109 (-5.49%)
balancer_ingress                            35481      26158    -9323 (-26.28%)        2095        1806   -289 (-13.79%)
balancer_ingress                            35481      26158    -9323 (-26.28%)        2095        1806   -289 (-13.79%)
balancer_ingress                            35868      26455    -9413 (-26.24%)        2140        1827   -313 (-14.63%)
balancer_ingress                            35868      26455    -9413 (-26.24%)        2140        1827   -313 (-14.63%)
balancer_ingress                            35481      26158    -9323 (-26.28%)        2095        1806   -289 (-13.79%)
balancer_ingress                            35481      26158    -9323 (-26.28%)        2095        1806   -289 (-13.79%)
balancer_ingress                            34844      29485    -5359 (-15.38%)        2036        1918    -118 (-5.80%)
fbflow_egress                                3256       2652     -604 (-18.55%)         218         192    -26 (-11.93%)
fbflow_ingress                               1026        944       -82 (-7.99%)          70          63     -7 (-10.00%)
sslwall_tc_egress                            8424       7360    -1064 (-12.63%)         498         458     -40 (-8.03%)
syar_accept_protect                         15040       9539    -5501 (-36.58%)         364         220   -144 (-39.56%)
syar_connect_tcp_v6                         15036       9535    -5501 (-36.59%)         360         216   -144 (-40.00%)
syar_connect_udp_v4                         15039       9538    -5501 (-36.58%)         361         217   -144 (-39.89%)
syar_connect_connect4_protect4              24805      15833    -8972 (-36.17%)         756         480   -276 (-36.51%)
syar_lsm_file_open                         167772     151813    -15959 (-9.51%)        1836        1667    -169 (-9.20%)
syar_namespace_create_new                   14805       9304    -5501 (-37.16%)         353         209   -144 (-40.79%)
syar_python3_detect                         17531      12030    -5501 (-31.38%)         391         247   -144 (-36.83%)
syar_ssh_post_fork                          16412      10911    -5501 (-33.52%)         405         261   -144 (-35.56%)
syar_enter_execve                           14728       9227    -5501 (-37.35%)         345         201   -144 (-41.74%)
syar_enter_execveat                         14728       9227    -5501 (-37.35%)         345         201   -144 (-41.74%)
syar_exit_execve                            16622      11121    -5501 (-33.09%)         376         232   -144 (-38.30%)
syar_exit_execveat                          16622      11121    -5501 (-33.09%)         376         232   -144 (-38.30%)
syar_syscalls_kill                          15288       9787    -5501 (-35.98%)         398         254   -144 (-36.18%)
syar_task_enter_pivot_root                  14898       9397    -5501 (-36.92%)         357         213   -144 (-40.34%)
syar_syscalls_setreuid                      16678      11177    -5501 (-32.98%)         429         285   -144 (-33.57%)
syar_syscalls_setuid                        16678      11177    -5501 (-32.98%)         429         285   -144 (-33.57%)
syar_syscalls_process_vm_readv              14959       9458    -5501 (-36.77%)         364         220   -144 (-39.56%)
syar_syscalls_process_vm_writev             15757      10256    -5501 (-34.91%)         390         246   -144 (-36.92%)
do_uprobe                                   15519      10018    -5501 (-35.45%)         373         229   -144 (-38.61%)
edgewall                                   179715      55783  -123932 (-68.96%)       12607        3999  -8608 (-68.28%)
bictcp_state                                 7570       4131    -3439 (-45.43%)         496         269   -227 (-45.77%)
cubictcp_state                               7570       4131    -3439 (-45.43%)         496         269   -227 (-45.77%)
tcp_rate_skb_delivered                        447        272     -175 (-39.15%)          29          18    -11 (-37.93%)
kprobe__bbr_set_state                        4566       2615    -1951 (-42.73%)         209         124    -85 (-40.67%)
kprobe__bictcp_state                         4566       2615    -1951 (-42.73%)         209         124    -85 (-40.67%)
inet_sock_set_state                          1501       1337     -164 (-10.93%)          93          85      -8 (-8.60%)
tcp_retransmit_skb                           1145        981     -164 (-14.32%)          67          59     -8 (-11.94%)
tcp_retransmit_synack                        1183        951     -232 (-19.61%)          67          55    -12 (-17.91%)
bpf_tcptuner                                 1459       1187     -272 (-18.64%)          99          80    -19 (-19.19%)
tw_egress                                     801        776       -25 (-3.12%)          69          66      -3 (-4.35%)
tw_ingress                                    795        770       -25 (-3.14%)          69          66      -3 (-4.35%)
ttls_tc_ingress                             19025      19383      +358 (+1.88%)         470         465      -5 (-1.06%)
ttls_nat_egress                               490        299     -191 (-38.98%)          33          20    -13 (-39.39%)
ttls_nat_ingress                              448        285     -163 (-36.38%)          32          21    -11 (-34.38%)
tw_twfw_egress                             511127     212071  -299056 (-58.51%)       16733        8504  -8229 (-49.18%)
tw_twfw_ingress                            500095     212069  -288026 (-57.59%)       16223        8504  -7719 (-47.58%)
tw_twfw_tc_eg                              511113     212064  -299049 (-58.51%)       16732        8504  -8228 (-49.18%)
tw_twfw_tc_in                              500095     212069  -288026 (-57.59%)       16223        8504  -7719 (-47.58%)
tw_twfw_egress                              12632      12435      -197 (-1.56%)         276         260     -16 (-5.80%)
tw_twfw_ingress                             12631      12454      -177 (-1.40%)         278         261     -17 (-6.12%)
tw_twfw_tc_eg                               12595      12435      -160 (-1.27%)         274         259     -15 (-5.47%)
tw_twfw_tc_in                               12631      12454      -177 (-1.40%)         278         261     -17 (-6.12%)
tw_xdp_dump                                   266        209      -57 (-21.43%)           9           8     -1 (-11.11%)

CILIUM
=========
File           Program                           Insns (A)  Insns (B)  Insns     (DIFF)  States (A)  States (B)  States  (DIFF)
-------------  --------------------------------  ---------  ---------  ----------------  ----------  ----------  --------------
bpf_host.o     cil_to_netdev                          6047       4578   -1469 (-24.29%)         362         249  -113 (-31.22%)
bpf_host.o     handle_lxc_traffic                     2227       1585    -642 (-28.83%)         156         103   -53 (-33.97%)
bpf_host.o     tail_handle_ipv4_from_netdev           2244       1458    -786 (-35.03%)         163         106   -57 (-34.97%)
bpf_host.o     tail_handle_nat_fwd_ipv4              21022      10479  -10543 (-50.15%)        1289         670  -619 (-48.02%)
bpf_host.o     tail_handle_nat_fwd_ipv6              15433      11375   -4058 (-26.29%)         905         643  -262 (-28.95%)
bpf_host.o     tail_ipv4_host_policy_ingress          2219       1367    -852 (-38.40%)         161          96   -65 (-40.37%)
bpf_host.o     tail_nodeport_nat_egress_ipv4         22460      19862   -2598 (-11.57%)        1469        1293  -176 (-11.98%)
bpf_host.o     tail_nodeport_nat_ingress_ipv4         5526       3534   -1992 (-36.05%)         366         243  -123 (-33.61%)
bpf_host.o     tail_nodeport_nat_ingress_ipv6         5132       4256    -876 (-17.07%)         241         219    -22 (-9.13%)
bpf_host.o     tail_nodeport_nat_ipv6_egress          3702       3542     -160 (-4.32%)         215         205    -10 (-4.65%)
bpf_lxc.o      tail_handle_nat_fwd_ipv4              21022      10479  -10543 (-50.15%)        1289         670  -619 (-48.02%)
bpf_lxc.o      tail_handle_nat_fwd_ipv6              15433      11375   -4058 (-26.29%)         905         643  -262 (-28.95%)
bpf_lxc.o      tail_ipv4_ct_egress                    5073       3374   -1699 (-33.49%)         262         172   -90 (-34.35%)
bpf_lxc.o      tail_ipv4_ct_ingress                   5093       3385   -1708 (-33.54%)         262         172   -90 (-34.35%)
bpf_lxc.o      tail_ipv4_ct_ingress_policy_only       5093       3385   -1708 (-33.54%)         262         172   -90 (-34.35%)
bpf_lxc.o      tail_ipv6_ct_egress                    4593       3878    -715 (-15.57%)         194         151   -43 (-22.16%)
bpf_lxc.o      tail_ipv6_ct_ingress                   4606       3891    -715 (-15.52%)         194         151   -43 (-22.16%)
bpf_lxc.o      tail_ipv6_ct_ingress_policy_only       4606       3891    -715 (-15.52%)         194         151   -43 (-22.16%)
bpf_lxc.o      tail_nodeport_nat_ingress_ipv4         5526       3534   -1992 (-36.05%)         366         243  -123 (-33.61%)
bpf_lxc.o      tail_nodeport_nat_ingress_ipv6         5132       4256    -876 (-17.07%)         241         219    -22 (-9.13%)
bpf_overlay.o  tail_handle_nat_fwd_ipv4              20524      10114  -10410 (-50.72%)        1271         638  -633 (-49.80%)
bpf_overlay.o  tail_nodeport_nat_egress_ipv4         22718      19490   -3228 (-14.21%)        1475        1275  -200 (-13.56%)
bpf_overlay.o  tail_nodeport_nat_ingress_ipv4         5526       3534   -1992 (-36.05%)         366         243  -123 (-33.61%)
bpf_overlay.o  tail_nodeport_nat_ingress_ipv6         5132       4256    -876 (-17.07%)         241         219    -22 (-9.13%)
bpf_overlay.o  tail_nodeport_nat_ipv6_egress          3638       3548      -90 (-2.47%)         209         203     -6 (-2.87%)
bpf_overlay.o  tail_rev_nodeport_lb4                  4368       3820    -548 (-12.55%)         248         215   -33 (-13.31%)
bpf_overlay.o  tail_rev_nodeport_lb6                  2867       2428    -439 (-15.31%)         167         140   -27 (-16.17%)
bpf_sock.o     cil_sock6_connect                      1718       1703      -15 (-0.87%)         100          99     -1 (-1.00%)
bpf_xdp.o      tail_handle_nat_fwd_ipv4              12917      12443     -474 (-3.67%)         875         849    -26 (-2.97%)
bpf_xdp.o      tail_handle_nat_fwd_ipv6              13515      13264     -251 (-1.86%)         715         702    -13 (-1.82%)
bpf_xdp.o      tail_lb_ipv4                          39492      36367    -3125 (-7.91%)        2430        2251   -179 (-7.37%)
bpf_xdp.o      tail_lb_ipv6                          80441      78058    -2383 (-2.96%)        3647        3523   -124 (-3.40%)
bpf_xdp.o      tail_nodeport_ipv6_dsr                 1038        901    -137 (-13.20%)          61          55     -6 (-9.84%)
bpf_xdp.o      tail_nodeport_nat_egress_ipv4         13027      12096     -931 (-7.15%)         868         809    -59 (-6.80%)
bpf_xdp.o      tail_nodeport_nat_ingress_ipv4         7617       5900   -1717 (-22.54%)         522         413  -109 (-20.88%)
bpf_xdp.o      tail_nodeport_nat_ingress_ipv6         7575       7395     -180 (-2.38%)         383         374     -9 (-2.35%)
bpf_xdp.o      tail_rev_nodeport_lb4                  6808       6739      -69 (-1.01%)         403         396     -7 (-1.74%)
bpf_xdp.o      tail_rev_nodeport_lb6                 16173      15847     -326 (-2.02%)        1010         990    -20 (-1.98%)

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231205184248.1502704-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-05 13:40:21 -08:00
Andrii Nakryiko e322f0bcb8 bpf: preserve constant zero when doing partial register restore
Similar to special handling of STACK_ZERO, when reading 1/2/4 bytes from
stack from slot that has register spilled into it and that register has
a constant value zero, preserve that zero and mark spilled register as
precise for that. This makes spilled const zero register and STACK_ZERO
cases equivalent in their behavior.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231205184248.1502704-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-05 13:40:21 -08:00
Andrii Nakryiko eaf18febd6 bpf: preserve STACK_ZERO slots on partial reg spills
Instead of always forcing STACK_ZERO slots to STACK_MISC, preserve it in
situations where this is possible. E.g., when spilling register as
1/2/4-byte subslots on the stack, all the remaining bytes in the stack
slot do not automatically become unknown. If we knew they contained
zeroes, we can preserve those STACK_ZERO markers.

Add a helper mark_stack_slot_misc(), similar to scrub_spilled_slot(),
but that doesn't overwrite either STACK_INVALID nor STACK_ZERO. Note
that we need to take into account possibility of being in unprivileged
mode, in which case STACK_INVALID is forced to STACK_MISC for correctness,
as treating STACK_INVALID as equivalent STACK_MISC is only enabled in
privileged mode.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231205184248.1502704-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-05 13:40:20 -08:00
Andrii Nakryiko ab125ed3ec bpf: fix check for attempt to corrupt spilled pointer
When register is spilled onto a stack as a 1/2/4-byte register, we set
slot_type[BPF_REG_SIZE - 1] (plus potentially few more below it,
depending on actual spill size). So to check if some stack slot has
spilled register we need to consult slot_type[7], not slot_type[0].

To avoid the need to remember and double-check this in the future, just
use is_spilled_reg() helper.

Fixes: 27113c59b6 ("bpf: Check the other end of slot_type for STACK_SPILL")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231205184248.1502704-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-05 13:40:20 -08:00
Andrii Nakryiko 41f6f64e69 bpf: support non-r10 register spill/fill to/from stack in precision tracking
Use instruction (jump) history to record instructions that performed
register spill/fill to/from stack, regardless if this was done through
read-only r10 register, or any other register after copying r10 into it
*and* potentially adjusting offset.

To make this work reliably, we push extra per-instruction flags into
instruction history, encoding stack slot index (spi) and stack frame
number in extra 10 bit flags we take away from prev_idx in instruction
history. We don't touch idx field for maximum performance, as it's
checked most frequently during backtracking.

This change removes basically the last remaining practical limitation of
precision backtracking logic in BPF verifier. It fixes known
deficiencies, but also opens up new opportunities to reduce number of
verified states, explored in the subsequent patches.

There are only three differences in selftests' BPF object files
according to veristat, all in the positive direction (less states).

File                                    Program        Insns (A)  Insns (B)  Insns  (DIFF)  States (A)  States (B)  States (DIFF)
--------------------------------------  -------------  ---------  ---------  -------------  ----------  ----------  -------------
test_cls_redirect_dynptr.bpf.linked3.o  cls_redirect        2987       2864  -123 (-4.12%)         240         231    -9 (-3.75%)
xdp_synproxy_kern.bpf.linked3.o         syncookie_tc       82848      82661  -187 (-0.23%)        5107        5073   -34 (-0.67%)
xdp_synproxy_kern.bpf.linked3.o         syncookie_xdp      85116      84964  -152 (-0.18%)        5162        5130   -32 (-0.62%)

Note, I avoided renaming jmp_history to more generic insn_hist to
minimize number of lines changed and potential merge conflicts between
bpf and bpf-next trees.

Notice also cur_hist_entry pointer reset to NULL at the beginning of
instruction verification loop. This pointer avoids the problem of
relying on last jump history entry's insn_idx to determine whether we
already have entry for current instruction or not. It can happen that we
added jump history entry because current instruction is_jmp_point(), but
also we need to add instruction flags for stack access. In this case, we
don't want to entries, so we need to reuse last added entry, if it is
present.

Relying on insn_idx comparison has the same ambiguity problem as the one
that was fixed recently in [0], so we avoid that.

  [0] https://patchwork.kernel.org/project/netdevbpf/patch/20231110002638.4168352-3-andrii@kernel.org/

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reported-by: Tao Lyu <tao.lyu@epfl.ch>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231205184248.1502704-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-05 13:40:20 -08:00
Jens Axboe 9fd7874c0e
iov_iter: replace import_single_range() with import_ubuf()
With the removal of the 'iov' argument to import_single_range(), the two
functions are now fully identical. Convert the import_single_range()
callers to import_ubuf(), and remove the former fully.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20231204174827.1258875-3-axboe@kernel.dk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-05 11:57:37 +01:00
Jens Axboe 6ac805d138
iov_iter: remove unused 'iov' argument from import_single_range()
It is entirely unused, just get rid of it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20231204174827.1258875-2-axboe@kernel.dk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-05 11:57:34 +01:00
Vlastimil Babka 70da1d01ed cpu/hotplug: remove CPUHP_SLAB_PREPARE hooks
The CPUHP_SLAB_PREPARE hooks are only used by SLAB which is removed.
SLUB defines them as NULL, so we can remove those altogether.

Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David Rientjes <rientjes@google.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-12-05 11:17:58 +01:00
Hou Tao af66bfd3c8 bpf: Optimize the free of inner map
When removing the inner map from the outer map, the inner map will be
freed after one RCU grace period and one RCU tasks trace grace
period, so it is certain that the bpf program, which may access the
inner map, has exited before the inner map is freed.

However there is no need to wait for one RCU tasks trace grace period if
the outer map is only accessed by non-sleepable program. So adding
sleepable_refcnt in bpf_map and increasing sleepable_refcnt when adding
the outer map into env->used_maps for sleepable program. Although the
max number of bpf program is INT_MAX - 1, the number of bpf programs
which are being loaded may be greater than INT_MAX, so using atomic64_t
instead of atomic_t for sleepable_refcnt. When removing the inner map
from the outer map, using sleepable_refcnt to decide whether or not a
RCU tasks trace grace period is needed before freeing the inner map.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231204140425.1480317-6-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-04 17:50:26 -08:00
Hou Tao 8766733641 bpf: Defer the free of inner map when necessary
When updating or deleting an inner map in map array or map htab, the map
may still be accessed by non-sleepable program or sleepable program.
However bpf_map_fd_put_ptr() decreases the ref-counter of the inner map
directly through bpf_map_put(), if the ref-counter is the last one
(which is true for most cases), the inner map will be freed by
ops->map_free() in a kworker. But for now, most .map_free() callbacks
don't use synchronize_rcu() or its variants to wait for the elapse of a
RCU grace period, so after the invocation of ops->map_free completes,
the bpf program which is accessing the inner map may incur
use-after-free problem.

Fix the free of inner map by invoking bpf_map_free_deferred() after both
one RCU grace period and one tasks trace RCU grace period if the inner
map has been removed from the outer map before. The deferment is
accomplished by using call_rcu() or call_rcu_tasks_trace() when
releasing the last ref-counter of bpf map. The newly-added rcu_head
field in bpf_map shares the same storage space with work field to
reduce the size of bpf_map.

Fixes: bba1dc0b55 ("bpf: Remove redundant synchronize_rcu.")
Fixes: 638e4b825d ("bpf: Allows per-cpu maps and map-in-map in sleepable programs")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231204140425.1480317-5-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-04 17:50:26 -08:00
Hou Tao 79d93b3c6f bpf: Set need_defer as false when clearing fd array during map free
Both map deletion operation, map release and map free operation use
fd_array_map_delete_elem() to remove the element from fd array and
need_defer is always true in fd_array_map_delete_elem(). For the map
deletion operation and map release operation, need_defer=true is
necessary, because the bpf program, which accesses the element in fd
array, may still alive. However for map free operation, it is certain
that the bpf program which owns the fd array has already been exited, so
setting need_defer as false is appropriate for map free operation.

So fix it by adding need_defer parameter to bpf_fd_array_map_clear() and
adding a new helper __fd_array_map_delete_elem() to handle the map
deletion, map release and map free operations correspondingly.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231204140425.1480317-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-04 17:50:26 -08:00
Hou Tao 20c20bd11a bpf: Add map and need_defer parameters to .map_fd_put_ptr()
map is the pointer of outer map, and need_defer needs some explanation.
need_defer tells the implementation to defer the reference release of
the passed element and ensure that the element is still alive before
the bpf program, which may manipulate it, exits.

The following three cases will invoke map_fd_put_ptr() and different
need_defer values will be passed to these callers:

1) release the reference of the old element in the map during map update
   or map deletion. The release must be deferred, otherwise the bpf
   program may incur use-after-free problem, so need_defer needs to be
   true.
2) release the reference of the to-be-added element in the error path of
   map update. The to-be-added element is not visible to any bpf
   program, so it is OK to pass false for need_defer parameter.
3) release the references of all elements in the map during map release.
   Any bpf program which has access to the map must have been exited and
   released, so need_defer=false will be OK.

These two parameters will be used by the following patches to fix the
potential use-after-free problem for map-in-map.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231204140425.1480317-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-04 17:50:26 -08:00
Hou Tao 169410eba2 bpf: Check rcu_read_lock_trace_held() before calling bpf map helpers
These three bpf_map_{lookup,update,delete}_elem() helpers are also
available for sleepable bpf program, so add the corresponding lock
assertion for sleepable bpf program, otherwise the following warning
will be reported when a sleepable bpf program manipulates bpf map under
interpreter mode (aka bpf_jit_enable=0):

  WARNING: CPU: 3 PID: 4985 at kernel/bpf/helpers.c:40 ......
  CPU: 3 PID: 4985 Comm: test_progs Not tainted 6.6.0+ #2
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
  RIP: 0010:bpf_map_lookup_elem+0x54/0x60
  ......
  Call Trace:
   <TASK>
   ? __warn+0xa5/0x240
   ? bpf_map_lookup_elem+0x54/0x60
   ? report_bug+0x1ba/0x1f0
   ? handle_bug+0x40/0x80
   ? exc_invalid_op+0x18/0x50
   ? asm_exc_invalid_op+0x1b/0x20
   ? __pfx_bpf_map_lookup_elem+0x10/0x10
   ? rcu_lockdep_current_cpu_online+0x65/0xb0
   ? rcu_is_watching+0x23/0x50
   ? bpf_map_lookup_elem+0x54/0x60
   ? __pfx_bpf_map_lookup_elem+0x10/0x10
   ___bpf_prog_run+0x513/0x3b70
   __bpf_prog_run32+0x9d/0xd0
   ? __bpf_prog_enter_sleepable_recur+0xad/0x120
   ? __bpf_prog_enter_sleepable_recur+0x3e/0x120
   bpf_trampoline_6442580665+0x4d/0x1000
   __x64_sys_getpgid+0x5/0x30
   ? do_syscall_64+0x36/0xb0
   entry_SYSCALL_64_after_hwframe+0x6e/0x76
   </TASK>

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231204140425.1480317-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-04 17:50:26 -08:00
Alison Schofield 659aa050a5 kernel/resource: Increment by align value in get_free_mem_region()
Currently get_free_mem_region() searches for available capacity
in increments equal to the region size being requested. This can
cause the search to take giant steps through the resource leaving
needless gaps and missing available space.

Specifically 'cxl create-region' fails with ERANGE even though capacity
of the given size and CXL's expected 256M x InterleaveWays alignment can
be satisfied.

Replace the total-request-size increment with a next alignment increment
so that the next possible address is always examined for availability.

Fixes: 14b80582c4 ("resource: Introduce alloc_free_mem_region()")
Reported-by: Dmytro Adamenko <dmytro.adamenko@intel.com>
Reported-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Link: https://lore.kernel.org/r/20231113221324.1118092-1-alison.schofield@intel.com
Cc: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2023-12-04 17:19:03 -08:00
Andrei Matei 5bd90cdc65 bpf: Minor logging improvement
One place where we were logging a register was only logging the variable
part, not also the fixed part.

Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231204011248.2040084-1-andreimatei1@gmail.com
2023-12-04 15:57:27 +01:00
Linus Torvalds 669fc83452 Probes fixes for v6.7-r3:
- objpool: Fix objpool overrun case on memory/cache access delay especially
   on the big.LITTLE SoC. The objpool uses a copy of object slot index
   internal loop, but the slot index can be changed on another processor
   in parallel. In that case, the difference of 'head' local copy and the
   'slot->last' index will be bigger than local slot size. In that case,
   we need to re-read the slot::head to update it.
 
 - kretprobe: Fix to use appropriate rcu API for kretprobe holder. Since
   kretprobe_holder::rp is RCU managed, it should use rcu_assign_pointer()
   and rcu_dereference_check() correctly. Also adding __rcu tag for
   finding wrong usage by sparse.
 
 - rethook: Fix to use appropriate rcu API for rethook::handler. The same
   as kretprobe, rethook::handler is RCU managed and it should use
   rcu_assign_pointer() and rcu_dereference_check(). This also adds __rcu
   tag for finding wrong usage by sparse.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmVpfBobHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bNyMIAJSLICKQNuFiBJEn/rty
 ACWJ9QMOnwi0DoVaepG/m9QJh6AIUUFW4//9helmSm0GIVzxQ2+f8UeKU+sYiVtH
 ro9atea4W4+FMTvtEB1cU8oG5CDVT4WQdUXbjMktqYe3+WB8Zt8+fIP0mnbTFAVr
 yStpliGPecmlupJVRYqrJGyDdbkUxXxVlPsP/eDrHFgbBWv8Incw0f+MLGSi6LSE
 sZ1MaKCdi2tlHbtD/fiowfLoBMZwQAKY4hq/XguVsWh+BGaGUgwtif+8ESwPeu22
 KEZLyWDQ1N8XBHyOBotV7vsBEwh6LKtLGVXIBsO4KxVyGw6msxWBis0dt/tkn+kk
 LEg=
 =B9WK
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes fixes from Masami Hiramatsu:

 - objpool: Fix objpool overrun case on memory/cache access delay
   especially on the big.LITTLE SoC. The objpool uses a copy of object
   slot index internal loop, but the slot index can be changed on
   another processor in parallel. In that case, the difference of 'head'
   local copy and the 'slot->last' index will be bigger than local slot
   size. In that case, we need to re-read the slot::head to update it.

 - kretprobe: Fix to use appropriate rcu API for kretprobe holder. Since
   kretprobe_holder::rp is RCU managed, it should use
   rcu_assign_pointer() and rcu_dereference_check() correctly. Also
   adding __rcu tag for finding wrong usage by sparse.

 - rethook: Fix to use appropriate rcu API for rethook::handler. The
   same as kretprobe, rethook::handler is RCU managed and it should use
   rcu_assign_pointer() and rcu_dereference_check(). This also adds
   __rcu tag for finding wrong usage by sparse.

* tag 'probes-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  rethook: Use __rcu pointer for rethook::handler
  kprobes: consistent rcu api usage for kretprobe holder
  lib: objpool: fix head overrun on RK3588 SBC
2023-12-03 08:02:49 +09:00
Andrii Nakryiko 81eff2e364 bpf: simplify tnum output if a fully known constant
Emit tnum representation as just a constant if all bits are known.
Use decimal-vs-hex logic to determine exact format of emitted
constant value, just like it's done for register range values.
For that move tnum_strn() to kernel/bpf/log.c to reuse decimal-vs-hex
determination logic and constants.

Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231202175705.885270-12-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02 11:36:51 -08:00
Andrii Nakryiko eabe518de5 bpf: enforce precision of R0 on program/async callback return
Given we enforce a valid range for program and async callback return
value, we must mark R0 as precise to avoid incorrect state pruning.

Fixes: b5dc0163d8 ("bpf: precise scalar_value tracking")
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231202175705.885270-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02 11:36:51 -08:00
Andrii Nakryiko 0ef24c8dfa bpf: unify async callback and program retval checks
Use common logic to verify program return values and async callback
return values. This allows to avoid duplication of any extra steps
necessary, like precision marking, which will be added in the next
patch.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231202175705.885270-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02 11:36:50 -08:00
Andrii Nakryiko c871d0e00f bpf: enforce precise retval range on program exit
Similarly to subprog/callback logic, enforce return value of BPF program
using more precise smin/smax range.

We need to adjust a bunch of tests due to a changed format of an error
message.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231202175705.885270-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02 11:36:50 -08:00
Andrii Nakryiko 8fa4ecd49b bpf: enforce exact retval range on subprog/callback exit
Instead of relying on potentially imprecise tnum representation of
expected return value range for callbacks and subprogs, validate that
smin/smax range satisfy exact expected range of return values.

E.g., if callback would need to return [0, 2] range, tnum can't
represent this precisely and instead will allow [0, 3] range. By
checking smin/smax range, we can make sure that subprog/callback indeed
returns only valid [0, 2] range.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231202175705.885270-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02 11:36:50 -08:00
Andrii Nakryiko 0acd03a5bd bpf: enforce precision of R0 on callback return
Given verifier checks actual value, r0 has to be precise, so we need to
propagate precision properly. r0 also has to be marked as read,
otherwise subsequent state comparisons will ignore such register as
unimportant and precision won't really help here.

Fixes: 69c087ba62 ("bpf: Add bpf_for_each_map_elem() helper")
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231202175705.885270-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02 11:36:50 -08:00
Andrii Nakryiko 5fad52bee3 bpf: provide correct register name for exception callback retval check
bpf_throw() is checking R1, so let's report R1 in the log.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231202175705.885270-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02 11:36:50 -08:00
Song Liu ac9c05e0e4 bpf: Add kfunc bpf_get_file_xattr
It is common practice for security solutions to store tags/labels in
xattrs. To implement similar functionalities in BPF LSM, add new kfunc
bpf_get_file_xattr().

The first use case of bpf_get_file_xattr() is to implement file
verifications with asymmetric keys. Specificially, security applications
could use fsverity for file hashes and use xattr to store file signatures.
(kfunc for fsverity hash will be added in a separate commit.)

Currently, only xattrs with "user." prefix can be read with kfunc
bpf_get_file_xattr(). As use cases evolve, we may add a dedicated prefix
for bpf_get_file_xattr().

To avoid recursion, bpf_get_file_xattr can be only called from LSM hooks.

Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/r/20231129234417.856536-2-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-01 16:21:03 -08:00
Yonghong Song dfce9cb314 bpf: Fix a verifier bug due to incorrect branch offset comparison with cpu=v4
Bpf cpu=v4 support is introduced in [1] and Commit 4cd58e9af8
("bpf: Support new 32bit offset jmp instruction") added support for new
32bit offset jmp instruction. Unfortunately, in function
bpf_adj_delta_to_off(), for new branch insn with 32bit offset, the offset
(plus/minor a small delta) compares to 16-bit offset bound
[S16_MIN, S16_MAX], which caused the following verification failure:
  $ ./test_progs-cpuv4 -t verif_scale_pyperf180
  ...
  insn 10 cannot be patched due to 16-bit range
  ...
  libbpf: failed to load object 'pyperf180.bpf.o'
  scale_test:FAIL:expect_success unexpected error: -12 (errno 12)
  #405     verif_scale_pyperf180:FAIL

Note that due to recent llvm18 development, the patch [2] (already applied
in bpf-next) needs to be applied to bpf tree for testing purpose.

The fix is rather simple. For 32bit offset branch insn, the adjusted
offset compares to [S32_MIN, S32_MAX] and then verification succeeded.

  [1] https://lore.kernel.org/all/20230728011143.3710005-1-yonghong.song@linux.dev
  [2] https://lore.kernel.org/bpf/20231110193644.3130906-1-yonghong.song@linux.dev

Fixes: 4cd58e9af8 ("bpf: Support new 32bit offset jmp instruction")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231201024640.3417057-1-yonghong.song@linux.dev
2023-12-01 15:41:50 -08:00
Kees Cook 8a3750ecf8 tracing/uprobe: Replace strlcpy() with strscpy()
strlcpy() reads the entire source buffer first. This read may exceed
the destination size limit. This is both inefficient and can lead
to linear read overflows if a source string is not NUL-terminated[1].
Additionally, it returns the size of the source string, not the
resulting size of the destination string. In an effort to remove strlcpy()
completely[2], replace strlcpy() here with strscpy().

The negative return value is already handled by this code so no new
handling is needed here.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [1]
Link: https://github.com/KSPP/linux/issues/89 [2]
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: linux-trace-kernel@vger.kernel.org
Acked-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20231130205607.work.463-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-12-01 10:25:35 -08:00
Andy Shevchenko b5e3f86a47 params: Fix multi-line comment style
The multi-line comment style in the file is rather arbitrary.
Make it follow the standard one.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20231120151419.1661807-6-andriy.shevchenko@linux.intel.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-12-01 09:51:44 -08:00
Andy Shevchenko a05f096c2c params: Sort headers
Sort the headers in alphabetic order in order to ease
the maintenance for this part.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20231120151419.1661807-5-andriy.shevchenko@linux.intel.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-12-01 09:51:44 -08:00
Andy Shevchenko 0fc79cbc93 params: Use size_add() for kmalloc()
Prevent allocations from integer overflow by using size_add().

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20231120151419.1661807-4-andriy.shevchenko@linux.intel.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-12-01 09:51:43 -08:00
Andy Shevchenko fd0cd057a1 params: Do not go over the limit when getting the string length
We can use strnlen() even on early stages and it prevents from
going over the string boundaries in case it's already too long.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20231120151419.1661807-3-andriy.shevchenko@linux.intel.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-12-01 09:51:43 -08:00
Andy Shevchenko 12cd3cd8c7 params: Introduce the param_unknown_fn type
Introduce a new type for the callback to parse an unknown argument.
This unifies function prototypes which takes that as a parameter.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20231120151419.1661807-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-12-01 09:51:43 -08:00
Waiman Long d499fd418f cgroup/rstat: Optimize cgroup_rstat_updated_list()
The current design of cgroup_rstat_cpu_pop_updated() is to traverse
the updated tree in a way to pop out the leaf nodes first before
their parents. This can cause traversal of multiple nodes before a
leaf node can be found and popped out. IOW, a given node in the tree
can be visited multiple times before the whole operation is done. So
it is not very efficient and the code can be hard to read.

With the introduction of cgroup_rstat_updated_list() to build a list
of cgroups to be flushed first before any flushing operation is being
done, we can optimize the way the updated tree nodes are being popped
by pushing the parents first to the tail end of the list before their
children. In this way, most updated tree nodes will be visited only
once with the exception of the subtree root as we still need to go
back to its parent and popped it out of its updated_children list.
This also makes the code easier to read.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-12-01 07:40:04 -10:00
Jann Horn a51749ab34 locking/mutex: Document that mutex_unlock() is non-atomic
I have seen several cases of attempts to use mutex_unlock() to release an
object such that the object can then be freed by another task.

This is not safe because mutex_unlock(), in the
MUTEX_FLAG_WAITERS && !MUTEX_FLAG_HANDOFF case, accesses the mutex
structure after having marked it as unlocked; so mutex_unlock() requires
its caller to ensure that the mutex stays alive until mutex_unlock()
returns.

If MUTEX_FLAG_WAITERS is set and there are real waiters, those waiters
have to keep the mutex alive, but we could have a spurious
MUTEX_FLAG_WAITERS left if an interruptible/killable waiter bailed
between the points where __mutex_unlock_slowpath() did the cmpxchg
reading the flags and where it acquired the wait_lock.

( With spinlocks, that kind of code pattern is allowed and, from what I
  remember, used in several places in the kernel. )

Document this, such a semantic difference between mutexes and spinlocks
is fairly unintuitive.

[ mingo: Made the changelog a bit more assertive, refined the comments. ]

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231130204817.2031407-1-jannh@google.com
2023-12-01 11:27:43 +01:00
Masami Hiramatsu (Google) a1461f1fd6 rethook: Use __rcu pointer for rethook::handler
Since the rethook::handler is an RCU-maganged pointer so that it will
notice readers the rethook is stopped (unregistered) or not, it should
be an __rcu pointer and use appropriate functions to be accessed. This
will use appropriate memory barrier when accessing it. OTOH,
rethook::data is never changed, so we don't need to check it in
get_kretprobe().

NOTE: To avoid sparse warning, rethook::handler is defined by a raw
function pointer type with __rcu instead of rethook_handler_t.

Link: https://lore.kernel.org/all/170126066201.398836.837498688669005979.stgit@devnote2/

Fixes: 54ecbe6f1e ("rethook: Add a generic return hook")
Cc: stable@vger.kernel.org
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202311241808.rv9ceuAh-lkp@intel.com/
Tested-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-12-01 14:53:56 +09:00
JP Kobryn d839a656d0 kprobes: consistent rcu api usage for kretprobe holder
It seems that the pointer-to-kretprobe "rp" within the kretprobe_holder is
RCU-managed, based on the (non-rethook) implementation of get_kretprobe().
The thought behind this patch is to make use of the RCU API where possible
when accessing this pointer so that the needed barriers are always in place
and to self-document the code.

The __rcu annotation to "rp" allows for sparse RCU checking. Plain writes
done to the "rp" pointer are changed to make use of the RCU macro for
assignment. For the single read, the implementation of get_kretprobe()
is simplified by making use of an RCU macro which accomplishes the same,
but note that the log warning text will be more generic.

I did find that there is a difference in assembly generated between the
usage of the RCU macros vs without. For example, on arm64, when using
rcu_assign_pointer(), the corresponding store instruction is a
store-release (STLR) which has an implicit barrier. When normal assignment
is done, a regular store (STR) is found. In the macro case, this seems to
be a result of rcu_assign_pointer() using smp_store_release() when the
value to write is not NULL.

Link: https://lore.kernel.org/all/20231122132058.3359-1-inwardvessel@gmail.com/

Fixes: d741bf41d7 ("kprobes: Remove kretprobe hash")
Cc: stable@vger.kernel.org
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-12-01 14:53:55 +09:00
Jakub Kicinski 753c8608f3 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZWiCPAAKCRDbK58LschI
 g4djAQC1FdqCRIFkhbiIRNHTgHjnfQShELQbd9ofJqzylLqmmgD+JI1E7D9SXagm
 pIXQ26EGmq8/VcCT3VLncA8EsC76Gg4=
 =Xowm
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-11-30

We've added 30 non-merge commits during the last 7 day(s) which contain
a total of 58 files changed, 1598 insertions(+), 154 deletions(-).

The main changes are:

1) Add initial TX metadata implementation for AF_XDP with support in mlx5
   and stmmac drivers. Two types of offloads are supported right now, that
   is, TX timestamp and TX checksum offload, from Stanislav Fomichev with
   stmmac implementation from Song Yoong Siang.

2) Change BPF verifier logic to validate global subprograms lazily instead
   of unconditionally before the main program, so they can be guarded using
   BPF CO-RE techniques, from Andrii Nakryiko.

3) Add BPF link_info support for uprobe multi link along with bpftool
   integration for the latter, from Jiri Olsa.

4) Use pkg-config in BPF selftests to determine ld flags which is
   in particular needed for linking statically, from Akihiko Odaki.

5) Fix a few BPF selftest failures to adapt to the upcoming LLVM18,
   from Yonghong Song.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (30 commits)
  bpf/tests: Remove duplicate JSGT tests
  selftests/bpf: Add TX side to xdp_hw_metadata
  selftests/bpf: Convert xdp_hw_metadata to XDP_USE_NEED_WAKEUP
  selftests/bpf: Add TX side to xdp_metadata
  selftests/bpf: Add csum helpers
  selftests/xsk: Support tx_metadata_len
  xsk: Add option to calculate TX checksum in SW
  xsk: Validate xsk_tx_metadata flags
  xsk: Document tx_metadata_len layout
  net: stmmac: Add Tx HWTS support to XDP ZC
  net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
  tools: ynl: Print xsk-features from the sample
  xsk: Add TX timestamp and TX checksum offload support
  xsk: Support tx_metadata_len
  selftests/bpf: Use pkg-config for libelf
  selftests/bpf: Override PKG_CONFIG for static builds
  selftests/bpf: Choose pkg-config for the target
  bpftool: Add support to display uprobe_multi links
  selftests/bpf: Add link_info test for uprobe_multi link
  selftests/bpf: Use bpf_link__destroy in fill_link_info tests
  ...
====================

Conflicts:

Documentation/netlink/specs/netdev.yaml:
  839ff60df3 ("net: page_pool: add nlspec for basic access to page pools")
  48eb03dd26 ("xsk: Add TX timestamp and TX checksum offload support")
https://lore.kernel.org/all/20231201094705.1ee3cab8@canb.auug.org.au/

While at it also regen, tree is dirty after:
  48eb03dd26 ("xsk: Add TX timestamp and TX checksum offload support")
looks like code wasn't re-rendered after "render-max" was removed.

Link: https://lore.kernel.org/r/20231130145708.32573-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-30 16:58:42 -08:00
Jakub Kicinski 975f2d73a9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-30 16:11:19 -08:00
Linus Torvalds 6172a5180f Including fixes from bpf and wifi.
Current release - regressions:
 
   - neighbour: fix __randomize_layout crash in struct neighbour
 
   - r8169: fix deadlock on RTL8125 in jumbo mtu mode
 
 Previous releases - regressions:
 
   - wifi:
     - mac80211: fix warning at station removal time
     - cfg80211: fix CQM for non-range use
 
   - tools: ynl-gen: fix unexpected response handling
 
   - octeontx2-af: fix possible buffer overflow
 
   - dpaa2: recycle the RX buffer only after all processing done
 
   - rswitch: fix missing dev_kfree_skb_any() in error path
 
 Previous releases - always broken:
 
   - ipv4: fix uaf issue when receiving igmp query packet
 
   - wifi: mac80211: fix debugfs deadlock at device removal time
 
   - bpf:
     - sockmap: af_unix stream sockets need to hold ref for pair sock
     - netdevsim: don't accept device bound programs
 
   - selftests: fix a char signedness issue
 
   - dsa: mv88e6xxx: fix marvell 6350 probe crash
 
   - octeontx2-pf: restore TC ingress police rules when interface is up
 
   - wangxun: fix memory leak on msix entry
 
   - ravb: keep reverse order of operations in ravb_remove()
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmVobzISHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOk4rwP/2qaUstOJVpkO8cG+bRYi3idH9uO/8Yu
 dYgFI4LM826YgbVNVzuiu9Sh7t78dbep/fWQ2quDuZUinWtPmv6RV3UKbDyNWLRr
 iV7sZvXElGsUefixxGANYDUPuCrlr3O230Y8zCN0R65BMurppljs9Pp8FwIqaD+v
 pVs2alb/PeX7g+hPACKPr4Knu8QeZYmzdHoyYeLoMG3PqIgJVU3/8OHHfmnYCdxT
 VSss2LB5FKFCOgetEPGy83KQP7QVaK22GDphZJ4xh7aSewRVP92ORfauiI8To4vQ
 0VnLNcQ+1pXnYzgGdv8oF02e4EP5b0jvrTpqCw1U0QU2s2PARJarzajCXBkwa308
 gXELRpVRpY4+7WEBSX4RGUigurwGGEh/IP/puVtPDr9KU3lFgaTI8wM624Y3Ob/e
 /LVI7a5kUSJysq9/H/QrHjoiuTtV7nCmzBgDqEFSN5hQinSHYKyD0XsUPcLlMJmn
 p6CyQDGHv2ibbg+8TStig0xfmC83N8KfDfcCekSrYxquDMTRtfa2VXofzQiQKDnr
 XNyIURmZAAUVPR6enxlg5Iqzc0mQGumYif7wzsO1bzVzmVZgIDCVxU95hkoRrutU
 qnWXuUGUdieUvXA9HltntTzy2BgJVtg7L/p8YEbd97dxtgK80sbdnjfDswFvEeE4
 nTvE+IDKdCmb
 =QiQp
 -----END PGP SIGNATURE-----

Merge tag 'net-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from bpf and wifi.

  Current release - regressions:

   - neighbour: fix __randomize_layout crash in struct neighbour

   - r8169: fix deadlock on RTL8125 in jumbo mtu mode

  Previous releases - regressions:

   - wifi:
       - mac80211: fix warning at station removal time
       - cfg80211: fix CQM for non-range use

   - tools: ynl-gen: fix unexpected response handling

   - octeontx2-af: fix possible buffer overflow

   - dpaa2: recycle the RX buffer only after all processing done

   - rswitch: fix missing dev_kfree_skb_any() in error path

  Previous releases - always broken:

   - ipv4: fix uaf issue when receiving igmp query packet

   - wifi: mac80211: fix debugfs deadlock at device removal time

   - bpf:
       - sockmap: af_unix stream sockets need to hold ref for pair sock
       - netdevsim: don't accept device bound programs

   - selftests: fix a char signedness issue

   - dsa: mv88e6xxx: fix marvell 6350 probe crash

   - octeontx2-pf: restore TC ingress police rules when interface is up

   - wangxun: fix memory leak on msix entry

   - ravb: keep reverse order of operations in ravb_remove()"

* tag 'net-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (51 commits)
  net: ravb: Keep reverse order of operations in ravb_remove()
  net: ravb: Stop DMA in case of failures on ravb_open()
  net: ravb: Start TX queues after HW initialization succeeded
  net: ravb: Make write access to CXR35 first before accessing other EMAC registers
  net: ravb: Use pm_runtime_resume_and_get()
  net: ravb: Check return value of reset_control_deassert()
  net: libwx: fix memory leak on msix entry
  ice: Fix VF Reset paths when interface in a failed over aggregate
  bpf, sockmap: Add af_unix test with both sockets in map
  bpf, sockmap: af_unix stream sockets need to hold ref for pair sock
  tools: ynl-gen: always construct struct ynl_req_state
  ethtool: don't propagate EOPNOTSUPP from dumps
  ravb: Fix races between ravb_tx_timeout_work() and net related ops
  r8169: prevent potential deadlock in rtl8169_close
  r8169: fix deadlock on RTL8125 in jumbo mtu mode
  neighbour: Fix __randomize_layout crash in struct neighbour
  octeontx2-pf: Restore TC ingress police rules when interface is up
  octeontx2-pf: Fix adding mbox work queue entry when num_vfs > 64
  net: stmmac: xgmac: Disable FPE MMC interrupts
  octeontx2-af: Fix possible buffer overflow
  ...
2023-12-01 08:24:46 +09:00
Peter Zijlstra 5431fdd2c1 ptrace: Convert ptrace_attach() to use lock guards
Created as testing for the conditional guard infrastructure.
Specifically this makes use of the following form:

  scoped_cond_guard (mutex_intr, return -ERESTARTNOINTR,
		     &task->signal->cred_guard_mutex) {
    ...
  }
  ...
  return 0;

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lkml.kernel.org/r/20231102110706.568467727%40infradead.org
2023-11-29 15:43:54 +01:00
Elliot Berman 418146e398 freezer,sched: Clean saved_state when restoring it during thaw
Clean saved_state after using it during thaw. Cleaning the saved_state
allows us to avoid some unnecessary branches in ttwu_state_match.

Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231120-freezer-state-multiple-thaws-v1-2-f2e1dd7ce5a2@quicinc.com
2023-11-29 15:43:53 +01:00
Yiwei Lin 5068d84054 sched/fair: Update min_vruntime for reweight_entity() correctly
Since reweight_entity() may have chance to change the weight of
cfs_rq->curr entity, we should also update_min_vruntime() if
this is the case

Fixes: eab03c23c2 ("sched/eevdf: Fix vruntime adjustment on reweight")
Signed-off-by: Yiwei Lin <s921975628@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Abel Wu <wuyun.abel@bytedance.com>
Link: https://lore.kernel.org/r/20231117080106.12890-1-s921975628@gmail.com
2023-11-29 15:43:52 +01:00
Peter Zijlstra 382c27f4ed perf: Fix perf_event_validate_size()
Budimir noted that perf_event_validate_size() only checks the size of
the newly added event, even though the sizes of all existing events
can also change due to not all events having the same read_format.

When we attach the new event, perf_group_attach(), we do re-compute
the size for all events.

Fixes: a723968c0e ("perf: Fix u16 overflows")
Reported-by: Budimir Markovic <markovicbudimir@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2023-11-29 15:43:50 +01:00
Elliot Berman 23ab79e8e4 freezer,sched: Do not restore saved_state of a thawed task
It is possible for a task to be thawed multiple times when mixing the
*legacy* cgroup freezer and system-wide freezer. To do this, freeze the
cgroup, do system-wide freeze/thaw, then thaw the cgroup. When this
happens, then a stale saved_state can be written to the task's state
and cause task to hang indefinitely. Fix this by only trying to thaw
tasks that are actually frozen.

This change also has the marginal benefit avoiding unnecessary
wake_up_state(p, TASK_FROZEN) if we know the task is already thawed.
There is not possibility of time-of-compare/time-of-use race when we skip
the wake_up_state because entering/exiting TASK_FROZEN is guarded by
freezer_lock.

Fixes: 8f0eed4a78 ("freezer,sched: Use saved_state to reduce some spurious wakeups")
Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Abhijeet Dharmapurikar <quic_adharmap@quicinc.com>
Link: https://lore.kernel.org/r/20231120-freezer-state-multiple-thaws-v1-1-f2e1dd7ce5a2@quicinc.com
2023-11-29 15:43:48 +01:00
Jiri Olsa e56fdbfb06 bpf: Add link_info support for uprobe multi link
Adding support to get uprobe_link details through bpf_link_info
interface.

Adding new struct uprobe_multi to struct bpf_link_info to carry
the uprobe_multi link details.

The uprobe_multi.count is passed from user space to denote size
of array fields (offsets/ref_ctr_offsets/cookies). The actual
array size is stored back to uprobe_multi.count (allowing user
to find out the actual array size) and array fields are populated
up to the user passed size.

All the non-array fields (path/count/flags/pid) are always set.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20231125193130.834322-4-jolsa@kernel.org
2023-11-28 21:50:09 -08:00
Jiri Olsa 4930b7f53a bpf: Store ref_ctr_offsets values in bpf_uprobe array
We will need to return ref_ctr_offsets values through link_info
interface in following change, so we need to keep them around.

Storing ref_ctr_offsets values directly into bpf_uprobe array.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20231125193130.834322-3-jolsa@kernel.org
2023-11-28 21:50:09 -08:00
Tim Van Patten cff5f49d43 cgroup_freezer: cgroup_freezing: Check if not frozen
__thaw_task() was recently updated to warn if the task being thawed was
part of a freezer cgroup that is still currently freezing:

	void __thaw_task(struct task_struct *p)
	{
	...
		if (WARN_ON_ONCE(freezing(p)))
			goto unlock;

This has exposed a bug in cgroup1 freezing where when CGROUP_FROZEN is
asserted, the CGROUP_FREEZING bits are not also cleared at the same
time. Meaning, when a cgroup is marked FROZEN it continues to be marked
FREEZING as well. This causes the WARNING to trigger, because
cgroup_freezing() thinks the cgroup is still freezing.

There are two ways to fix this:

1. Whenever FROZEN is set, clear FREEZING for the cgroup and all
children cgroups.
2. Update cgroup_freezing() to also verify that FROZEN is not set.

This patch implements option (2), since it's smaller and more
straightforward.

Signed-off-by: Tim Van Patten <timvp@google.com>
Tested-by: Mark Hasemeyer <markhas@chromium.org>
Fixes: f5d39b0208 ("freezer,sched: Rewrite core freezer logic")
Cc: stable@vger.kernel.org # v6.1+
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-11-28 08:04:03 -10:00
Waiman Long 877c737db9 cgroup/cpuset: Expose cpuset.cpus.isolated
The root-only cpuset.cpus.isolated control file shows the current set
of isolated CPUs in isolated partitions. This control file is currently
exposed only with the cgroup_debug boot command line option which also
adds the ".__DEBUG__." prefix. This is actually a useful control file if
users want to find out which CPUs are currently in an isolated state by
the cpuset controller. Remove CFTYPE_DEBUG flag for this control file and
make it available by default without any prefix.

The test_cpuset_prs.sh test script and the cgroup-v2.rst documentation
file are also updated accordingly. Minor code change is also made in
test_cpuset_prs.sh to avoid false test failure when running on debug
kernel.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-11-28 06:45:11 -10:00
Christian Brauner 783822e445 mnt_idmapping: decouple from namespaces
There's no reason we need to couple mnt idmapping to namespaces in the
way we currently do. Copy the idmapping when an idmapped mount is
created and don't take any reference on the namespace at all.

We also can't easily refcount struct uid_gid_map because it needs to
stay the size of a cacheline otherwise we risk performance regressions
(Ignoring for a second that right now struct uid_gid_map isn't actually
 64 byte but 72 but that's a fix for another patch series.).

Link: https://lore.kernel.org/r/20231122-vfs-mnt_idmap-v1-3-dae4abdde5bd@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-28 14:08:47 +01:00
Hou Tao 75a442581d bpf: Add missed allocation hint for bpf_mem_cache_alloc_flags()
bpf_mem_cache_alloc_flags() may call __alloc() directly when there is no
free object in free list, but it doesn't initialize the allocation hint
for the returned pointer. It may lead to bad memory dereference when
freeing the pointer, so fix it by initializing the allocation hint.

Fixes: 822fb26bdb ("bpf: Add a hint to allocated objects.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231111043821.2258513-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-26 18:00:26 -08:00
Linus Torvalds 1d0dbc3d16 Fix lockdep block chain corruption resulting in KASAN warnings.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmVjEa0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jGMRAAvlW/mmlwp4lRv/+aIRBo3iAzDS9vkPds
 uuS7jOweKkFJZJTR0Fr/OppRB05JObSUVXQSH71hGc0YUC29NEQyqa03Qy6MDdDx
 TuvDzIUildQqcUVJLRV2d8PmNRfFQftuQnvQcFpk+T0jrElBq6ADTe0SAwbSYLVU
 8onXjYrGRsxOaZP7zQ99o4BkWyX7DHMv8lMhq5QdEHotg8/4BkcYDU4F99Zs0tu9
 txi2RPDCvR8JmvK37qMXumexu/IMBcE8OQadmlQjK1uPiXIBj+7iHdrqDegUIayk
 XyttXmvODb8SgXL/o5thbmHI9ZGsTSK0RpwQMO5CHrF0LmlI/z2bNClz9bGMh/7A
 Sa6misq4at0o50RQmpus3zo8q8hZ1P37bhyhIBgsfbzLJCVWU5LAltV3A6OrDygy
 YR4j29qSsnZvRZ1kvlfDROS5t4QicPN1IwfYxdDJypnlapIeQbmt1nLQFH1zaCN4
 EwYeVTfJ9dJpXozZTPftD/uiPhj7NZUNUhkVI9mngP46XMCC1GWjF1CcPYLuv8Iz
 Qw0Gj4YzDWFwuG98r3hrntXaTz2BKy4GVAQTQcpswhdPFJ/BPxY4AJPeTznm7fQX
 Lu2bIBLYlUROvuL45TgAPArh17iC8O1pfxwTfEOxlQvi9+xNzN9hPNsWRSiJnYlV
 R4q3G7Ejelo=
 =8C4B
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2023-11-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Ingo Molnar:
 "Fix lockdep block chain corruption resulting in KASAN warnings"

* tag 'locking-urgent-2023-11-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lockdep: Fix block chain corruption
2023-11-26 08:30:11 -08:00
Peter Zijlstra bca4104b00 lockdep: Fix block chain corruption
Kent reported an occasional KASAN splat in lockdep. Mark then noted:

> I suspect the dodgy access is to chain_block_buckets[-1], which hits the last 4
> bytes of the redzone and gets (incorrectly/misleadingly) attributed to
> nr_large_chain_blocks.

That would mean @size == 0, at which point size_to_bucket() returns -1
and the above happens.

alloc_chain_hlocks() has 'size - req', for the first with the
precondition 'size >= rq', which allows the 0.

This code is trying to split a block, del_chain_block() takes what we
need, and add_chain_block() puts back the remainder, except in the
above case the remainder is 0 sized and things go sideways.

Fixes: 810507fe6f ("locking/lockdep: Reuse freed chain_hlocks entries")
Reported-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lkml.kernel.org/r/20231121114126.GH8262@noisy.programming.kicks-ass.net
2023-11-24 11:04:54 +01:00
Andrii Nakryiko 2afae08c9d bpf: Validate global subprogs lazily
Slightly change BPF verifier logic around eagerness and order of global
subprog validation. Instead of going over every global subprog eagerly
and validating it before main (entry) BPF program is verified, turn it
around. Validate main program first, mark subprogs that were called from
main program for later verification, but otherwise assume it is valid.
Afterwards, go over marked global subprogs and validate those,
potentially marking some more global functions as being called. Continue
this process until all (transitively) callable global subprogs are
validated. It's a BFS traversal at its heart and will always converge.

This is an important change because it allows to feature-gate some
subprograms that might not be verifiable on some older kernel, depending
on supported set of features.

E.g., at some point, global functions were allowed to accept a pointer
to memory, which size is identified by user-provided type.
Unfortunately, older kernels don't support this feature. With BPF CO-RE
approach, the natural way would be to still compile BPF object file once
and guard calls to this global subprog with some CO-RE check or using
.rodata variables. That's what people do to guard usage of new helpers
or kfuncs, and any other new BPF-side feature that might be missing on
old kernels.

That's currently impossible to do with global subprogs, unfortunately,
because they are eagerly and unconditionally validated. This patch set
aims to change this, so that in the future when global funcs gain new
features, those can be guarded using BPF CO-RE techniques in the same
fashion as any other new kernel feature.

Two selftests had to be adjusted in sync with these changes.

test_global_func12 relied on eager global subprog validation failing
before main program failure is detected (unknown return value). Fix by
making sure that main program is always valid.

verifier_subprog_precision's parent_stack_slot_precise subtest relied on
verifier checkpointing heuristic to do a checkpoint at instruction #5,
but that's no longer true because we don't have enough jumps validated
before reaching insn #5 due to global subprogs being validated later.

Other than that, no changes, as one would expect.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231124035937.403208-3-andrii@kernel.org
2023-11-24 10:40:06 +01:00
Andrii Nakryiko 491dd8edec bpf: Emit global subprog name in verifier logs
We have the name, instead of emitting just func#N to identify global
subprog, augment verifier log messages with actual function name to make
it more user-friendly.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231124035937.403208-2-andrii@kernel.org
2023-11-24 10:40:06 +01:00
Jakub Kicinski 45c226dde7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/intel/ice/ice_main.c
  c9663f79cd ("ice: adjust switchdev rebuild path")
  7758017911 ("ice: restore timestamp configuration after device reset")
https://lore.kernel.org/all/20231121211259.3348630-1-anthony.l.nguyen@intel.com/

Adjacent changes:

kernel/bpf/verifier.c
  bb124da69c ("bpf: keep track of max number of bpf_loop callback iterations")
  5f99f312bd ("bpf: add register bounds sanity checks and sanitization")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-23 12:20:58 -08:00
Linus Torvalds d3fa86b1a7 Including fixes from bpf.
Current release - regressions:
 
  - Revert "net: r8169: Disable multicast filter for RTL8168H
    and RTL8107E"
 
  - kselftest: rtnetlink: fix ip route command typo
 
 Current release - new code bugs:
 
  - s390/ism: make sure ism driver implies smc protocol in kconfig
 
  - two build fixes for tools/net
 
 Previous releases - regressions:
 
  - rxrpc: couple of ACK/PING/RTT handling fixes
 
 Previous releases - always broken:
 
  - bpf: verify bpf_loop() callbacks as if they are called unknown
    number of times
 
  - improve stability of auto-bonding with Hyper-V
 
  - account BPF-neigh-redirected traffic in interface statistics
 
 Misc:
 
  - net: fill in some more MODULE_DESCRIPTION()s
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmVfiBoACgkQMUZtbf5S
 IrukFhAAiY5XyqiVyEBsm10AgYSpl0BbnxywfK27nR2SbxSTvSxyuXseV2EyEynU
 iNn6CksHe2rG1/DXbKoQohsIYey/YjY5c6omT5JzuxIT2h69J4iYKMIj/Ptk5joZ
 MQsDK5J9aCvxXBazYF2CuOCgVcwmqNFbCHf1FAFhk0RuqI3RoC5/OGbLM0tmTMQw
 BceNUHBn8iPcSkRbnntwLLHVxMrX9bay6F+pcy5/b40VWBATM3uBkw/2zBqPZ+n1
 Z0SNWvLtnO6T66Y07vaE1sRiqN4KHtS4WWelVYcmYX2rY1HkXx/TUjvrJ7R/uQQQ
 /5yUB6G294NmFv/2X+Yjt5AB8PjnFzjm/BqCBrjXcnnMPOiB0YZg554s59RhRrSr
 cmZ4bveUgCQV/jJWMxwGYvZSAqtle8uN+8DhxdjbCpVJocbrseDHKyWJ6bOy85BN
 zbJuUOUeFDs53nhV+Z9fiuUFuxhIwDCKHHFmEp7R5VotX0ZURiDnqlj9WEIxKZrC
 UidWRXv/VP+bV4BB2GVIFncEWMuhrnWOQY8DR6eC33uq7JkwTZD3R8IGR8up/+tm
 CtVyPvefAYZB8/IVU/mOSVrx04ERupNVvBkXzOMQe7UqRq3okPsQFPW8HmSrmnQG
 KrJWyBIqG3jnJCuqoXwlt0rKP3MmgCjowhTbZ3uDjeVf9UJTu2U=
 =2sG4
 -----END PGP SIGNATURE-----

Merge tag 'net-6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf.

  Current release - regressions:

   - Revert "net: r8169: Disable multicast filter for RTL8168H and
     RTL8107E"

   - kselftest: rtnetlink: fix ip route command typo

  Current release - new code bugs:

   - s390/ism: make sure ism driver implies smc protocol in kconfig

   - two build fixes for tools/net

  Previous releases - regressions:

   - rxrpc: couple of ACK/PING/RTT handling fixes

  Previous releases - always broken:

   - bpf: verify bpf_loop() callbacks as if they are called unknown
     number of times

   - improve stability of auto-bonding with Hyper-V

   - account BPF-neigh-redirected traffic in interface statistics

  Misc:

   - net: fill in some more MODULE_DESCRIPTION()s"

* tag 'net-6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits)
  tools: ynl: fix duplicate op name in devlink
  tools: ynl: fix header path for nfsd
  net: ipa: fix one GSI register field width
  tls: fix NULL deref on tls_sw_splice_eof() with empty record
  net: axienet: Fix check for partial TX checksum
  vsock/test: fix SEQPACKET message bounds test
  i40e: Fix adding unsupported cloud filters
  ice: restore timestamp configuration after device reset
  ice: unify logic for programming PFINT_TSYN_MSK
  ice: remove ptp_tx ring parameter flag
  amd-xgbe: propagate the correct speed and duplex status
  amd-xgbe: handle the corner-case during tx completion
  amd-xgbe: handle corner-case during sfp hotplug
  net: veth: fix ethtool stats reporting
  octeontx2-pf: Fix ntuple rule creation to direct packet to VF with higher Rx queue than its PF
  net: usb: qmi_wwan: claim interface 4 for ZTE MF290
  Revert "net: r8169: Disable multicast filter for RTL8168H and RTL8107E"
  net/smc: avoid data corruption caused by decline
  nfc: virtual_ncidev: Add variable to check if ndev is running
  dpll: Fix potential msg memleak when genlmsg_put_reply failed
  ...
2023-11-23 10:40:13 -08:00
Peter Zijlstra 388a1fb7da perf: Fix the nr_addr_filters fix
Thomas reported that commit 652ffc2104 ("perf/core: Fix narrow
startup race when creating the perf nr_addr_filters sysfs file") made
the entire attribute group vanish, instead of only the nr_addr_filters
attribute.

Additionally a stray return.

Insufficient coffee was involved with both writing and merging the
patch.

Fixes: 652ffc2104 ("perf/core: Fix narrow startup race when creating the perf nr_addr_filters sysfs file")
Reported-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Link: https://lkml.kernel.org/r/20231122100756.GP8262@noisy.programming.kicks-ass.net
2023-11-23 16:08:31 +01:00
Vincent Guittot f12560779f sched/cpufreq: Rework iowait boost
Use the max value that has already been computed inside sugov_get_util()
to cap the iowait boost and remove dependency with uclamp_rq_util_with()
which is not used anymore.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20231122133904.446032-3-vincent.guittot@linaro.org
2023-11-23 11:32:02 +01:00
Vincent Guittot 9c0b4bb7f6 sched/cpufreq: Rework schedutil governor performance estimation
The current method to take into account uclamp hints when estimating the
target frequency can end in a situation where the selected target
frequency is finally higher than uclamp hints, whereas there are no real
needs. Such cases mainly happen because we are currently mixing the
traditional scheduler utilization signal with the uclamp performance
hints. By adding these 2 metrics, we loose an important information when
it comes to select the target frequency, and we have to make some
assumptions which can't fit all cases.

Rework the interface between the scheduler and schedutil governor in order
to propagate all information down to the cpufreq governor.

effective_cpu_util() interface changes and now returns the actual
utilization of the CPU with 2 optional inputs:

- The minimum performance for this CPU; typically the capacity to handle
  the deadline task and the interrupt pressure. But also uclamp_min
  request when available.

- The maximum targeting performance for this CPU which reflects the
  maximum level that we would like to not exceed. By default it will be
  the CPU capacity but can be reduced because of some performance hints
  set with uclamp. The value can be lower than actual utilization and/or
  min performance level.

A new sugov_effective_cpu_perf() interface is also available to compute
the final performance level that is targeted for the CPU, after applying
some cpufreq headroom and taking into account all inputs.

With these 2 functions, schedutil is now able to decide when it must go
above uclamp hints. It now also has a generic way to get the min
performance level.

The dependency between energy model and cpufreq governor and its headroom
policy doesn't exist anymore.

eenv_pd_max_util() asks schedutil for the targeted performance after
applying the impact of the waking task.

[ mingo: Refined the changelog & C comments. ]

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20231122133904.446032-2-vincent.guittot@linaro.org
2023-11-23 11:32:01 +01:00
Vincent Guittot 50181c0cff sched/pelt: Avoid underestimation of task utilization
Lukasz Luba reported that a thread's util_est can significantly decrease as
a result of sharing the CPU with other threads.

The use case can be easily reproduced with a periodic task TA that runs 1ms
and sleeps 100us. When the task is alone on the CPU, its max utilization and
its util_est is around 888. If another similar task starts to run on the
same CPU, TA will have to share the CPU runtime and its maximum utilization
will decrease around half the CPU capacity (512) then TA's util_est will
follow this new maximum trend which is only the result of sharing the CPU
with others tasks.

Such situation can be detected with runnable_avg wich is close or
equal to util_avg when TA is alone, but increases above util_avg when TA
shares the CPU with other threads and wait on the runqueue.

[ We prefer an util_est that overestimate rather than under estimate
  because in 1st case we will not provide enough performance to the
  task which will remain under-provisioned, whereas in the other case we
  will create some idle time which will enable to reduce contention and
  as a result reduces the util_est so the overestimate will be transient
  whereas the underestimate will remain. ]

[ mingo: Refined the changelog, added comments from the LKML discussion. ]

Reported-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/lkml/CAKfTPtDd-HhF-YiNTtL9i5k0PfJbF819Yxu4YquzfXgwi7voyw@mail.gmail.com/#t
Link: https://lore.kernel.org/r/20231122140119.472110-1-vincent.guittot@linaro.org
Cc: Hongyan Xia <hongyan.xia2@arm.com>
2023-11-23 11:24:28 +01:00
Zqiang 90f1015dfe rcutorture: Add fqs_holdoff check before fqs_task is created
For rcutorture tests on RCU implementations that support
force-quiescent-state operations and that set the fqs_duration module
parameter greater than zero, the fqs_task kthread will be created.
However, if the fqs_holdoff module parameter is not set, then its default
value of zero will cause fqs_task enter a long-term busy loop until
stopped by kthread_stop().  This commit therefore adds a fqs_holdoff
check before the fqs_task is created, making sure that whenever the
fqs_task is created, the fqs_holdoff will be greater than zero.

Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
2023-11-23 11:58:18 +05:30
Paul E. McKenney 69dcbbd804 locktorture: Increase Hamming distance between call_rcu_chain and rcu_call_chains
One letter difference is really not enough, so this commit changes
call_rcu_chain to call_rcu_chain_list.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
2023-11-23 11:58:18 +05:30
Tejun Heo 2025956639 Merge branch 'for-6.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into for-6.8
cgroup/for-6.8 is carrying two workqueue changes to allow cpuset to restrict
the CPUs used by unbound workqueues. Unfortunately, this conflicts with a
new bug fix in wq/for-6.7-fixes. The conflict is contextual but can be a bit
confusing to resolve. Pull the fix branch to resolve the conflict.

Signed-off-by: Tejun Heo <tj@kernel.org>
2023-11-22 06:18:49 -10:00
Tejun Heo 4a6c5607d4 workqueue: Make sure that wq_unbound_cpumask is never empty
During boot, depending on how the housekeeping and workqueue.unbound_cpus
masks are set, wq_unbound_cpumask can end up empty. Since 8639ecebc9
("workqueue: Implement non-strict affinity scope for unbound workqueues"),
this may end up feeding -1 as a CPU number into scheduler leading to oopses.

  BUG: unable to handle page fault for address: ffffffff8305e9c0
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  ...
  Call Trace:
   <TASK>
   select_idle_sibling+0x79/0xaf0
   select_task_rq_fair+0x1cb/0x7b0
   try_to_wake_up+0x29c/0x5c0
   wake_up_process+0x19/0x20
   kick_pool+0x5e/0xb0
   __queue_work+0x119/0x430
   queue_work_on+0x29/0x30
  ...

An empty wq_unbound_cpumask is a clear misconfiguration and already
disallowed once system is booted up. Let's warn on and ignore
unbound_cpumask restrictions which lead to no unbound cpus. While at it,
also remove now unncessary empty check on wq_unbound_cpumask in
wq_select_unbound_cpu().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-Tested-by: Yong He <alexyonghe@tencent.com>
Link: http://lkml.kernel.org/r/20231120121623.119780-1-alexyonghe@tencent.com
Fixes: 8639ecebc9 ("workqueue: Implement non-strict affinity scope for unbound workqueues")
Cc: stable@vger.kernel.org # v6.6+
Reviewed-by: Waiman Long <longman@redhat.com>
2023-11-22 06:17:26 -10:00
Arnd Bergmann a89299c409 time: Make sysfs_get_uname() function visible in header
This function is defined globally in clocksource.c and used conditionally
in clockevent.c, which the declaration hidden when clockevent support
is disabled. This causes a harmless warning in the definition:

kernel/time/clocksource.c:1324:9: warning: no previous prototype for 'sysfs_get_uname' [-Wmissing-prototypes]
 1324 | ssize_t sysfs_get_uname(const char *buf, char *dst, size_t cnt)

Move the declaration out of the #ifdef so it is always visible.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Link: https://lore.kernel.org/r/20231108125843.3806765-5-arnd@kernel.org
2023-11-22 14:12:10 +01:00
Jakub Kicinski 53475287da bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZV0kjgAKCRDbK58LschI
 gy0EAP9XwncW2OhO72DpITluFzvWPgB0N97OANKBXjzKJrRAlQD/aUe9nlvBQuad
 WsbMKLeC4wvI2X/4PEIR4ukbuZ3ypAA=
 =LMVg
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-11-21

We've added 85 non-merge commits during the last 12 day(s) which contain
a total of 63 files changed, 4464 insertions(+), 1484 deletions(-).

The main changes are:

1) Huge batch of verifier changes to improve BPF register bounds logic
   and range support along with a large test suite, and verifier log
   improvements, all from Andrii Nakryiko.

2) Add a new kfunc which acquires the associated cgroup of a task within
   a specific cgroup v1 hierarchy where the latter is identified by its id,
   from Yafang Shao.

3) Extend verifier to allow bpf_refcount_acquire() of a map value field
   obtained via direct load which is a use-case needed in sched_ext,
   from Dave Marchevsky.

4) Fix bpf_get_task_stack() helper to add the correct crosstask check
   for the get_perf_callchain(), from Jordan Rome.

5) Fix BPF task_iter internals where lockless usage of next_thread()
   was wrong. The rework also simplifies the code, from Oleg Nesterov.

6) Fix uninitialized tail padding via LIBBPF_OPTS_RESET, and another
   fix for certain BPF UAPI structs to fix verifier failures seen
   in bpf_dynptr usage, from Yonghong Song.

7) Add BPF selftest fixes for map_percpu_stats flakes due to per-CPU BPF
   memory allocator not being able to allocate per-CPU pointer successfully,
   from Hou Tao.

8) Add prep work around dynptr and string handling for kfuncs which
   is later going to be used by file verification via BPF LSM and fsverity,
   from Song Liu.

9) Improve BPF selftests to update multiple prog_tests to use ASSERT_*
   macros, from Yuran Pereira.

10) Optimize LPM trie lookup to check prefixlen before walking the trie,
    from Florian Lehner.

11) Consolidate virtio/9p configs from BPF selftests in config.vm file
    given they are needed consistently across archs, from Manu Bretelle.

12) Small BPF verifier refactor to remove register_is_const(),
    from Shung-Hsi Yu.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (85 commits)
  selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in vmlinux
  selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bpf_obj_id
  selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bind_perm
  selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bpf_tcp_ca
  selftests/bpf: reduce verboseness of reg_bounds selftest logs
  bpf: bpf_iter_task_next: use next_task(kit->task) rather than next_task(kit->pos)
  bpf: bpf_iter_task_next: use __next_thread() rather than next_thread()
  bpf: task_group_seq_get_next: use __next_thread() rather than next_thread()
  bpf: emit frameno for PTR_TO_STACK regs if it differs from current one
  bpf: smarter verifier log number printing logic
  bpf: omit default off=0 and imm=0 in register state log
  bpf: emit map name in register state if applicable and available
  bpf: print spilled register state in stack slot
  bpf: extract register state printing
  bpf: move verifier state printing code to kernel/bpf/log.c
  bpf: move verbose_linfo() into kernel/bpf/log.c
  bpf: rename BPF_F_TEST_SANITY_STRICT to BPF_F_TEST_REG_INVARIANTS
  bpf: Remove test for MOVSX32 with offset=32
  selftests/bpf: add iter test requiring range x range logic
  veristat: add ability to set BPF_F_TEST_SANITY_STRICT flag with -r flag
  ...
====================

Link: https://lore.kernel.org/r/20231122000500.28126-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-21 17:53:20 -08:00
Waiman Long 49277a5b76 workqueue: Move workqueue_set_unbound_cpumask() and its helpers inside CONFIG_SYSFS
Commit fe28f631fa ("workqueue: Add workqueue_unbound_exclude_cpumask()
to exclude CPUs from wq_unbound_cpumask") makes
workqueue_set_unbound_cpumask() static as it is not used elsewhere in
the kernel. However, this triggers a kernel test robot warning about
'workqueue_set_unbound_cpumask' defined but not used when CONFIG_SYS
isn't defined. It happens that workqueue_set_unbound_cpumask() is only
called when CONFIG_SYS is defined.

Move workqueue_set_unbound_cpumask() and its helpers inside the
CONFIG_SYSFS compilation block to avoid the warning. There is no
functional change.

Fixes: fe28f631fa ("workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202311130831.uh0AoCd1-lkp@intel.com/
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-11-21 06:19:12 -10:00
Eduard Zingerman bb124da69c bpf: keep track of max number of bpf_loop callback iterations
In some cases verifier can't infer convergence of the bpf_loop()
iteration. E.g. for the following program:

    static int cb(__u32 idx, struct num_context* ctx)
    {
        ctx->i++;
        return 0;
    }

    SEC("?raw_tp")
    int prog(void *_)
    {
        struct num_context ctx = { .i = 0 };
        __u8 choice_arr[2] = { 0, 1 };

        bpf_loop(2, cb, &ctx, 0);
        return choice_arr[ctx.i];
    }

Each 'cb' simulation would eventually return to 'prog' and reach
'return choice_arr[ctx.i]' statement. At which point ctx.i would be
marked precise, thus forcing verifier to track multitude of separate
states with {.i=0}, {.i=1}, ... at bpf_loop() callback entry.

This commit allows "brute force" handling for such cases by limiting
number of callback body simulations using 'umax' value of the first
bpf_loop() parameter.

For this, extend bpf_func_state with 'callback_depth' field.
Increment this field when callback visiting state is pushed to states
traversal stack. For frame #N it's 'callback_depth' field counts how
many times callback with frame depth N+1 had been executed.
Use bpf_func_state specifically to allow independent tracking of
callback depths when multiple nested bpf_loop() calls are present.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-11-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20 18:36:40 -08:00
Eduard Zingerman cafe2c2150 bpf: widening for callback iterators
Callbacks are similar to open coded iterators, so add imprecise
widening logic for callback body processing. This makes callback based
loops behave identically to open coded iterators, e.g. allowing to
verify programs like below:

  struct ctx { u32 i; };
  int cb(u32 idx, struct ctx* ctx)
  {
          ++ctx->i;
          return 0;
  }
  ...
  struct ctx ctx = { .i = 0 };
  bpf_loop(100, cb, &ctx, 0);
  ...

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-9-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20 18:36:40 -08:00
Eduard Zingerman ab5cfac139 bpf: verify callbacks as if they are called unknown number of times
Prior to this patch callbacks were handled as regular function calls,
execution of callback body was modeled exactly once.
This patch updates callbacks handling logic as follows:
- introduces a function push_callback_call() that schedules callback
  body verification in env->head stack;
- updates prepare_func_exit() to reschedule callback body verification
  upon BPF_EXIT;
- as calls to bpf_*_iter_next(), calls to callback invoking functions
  are marked as checkpoints;
- is_state_visited() is updated to stop callback based iteration when
  some identical parent state is found.

Paths with callback function invoked zero times are now verified first,
which leads to necessity to modify some selftests:
- the following negative tests required adding release/unlock/drop
  calls to avoid previously masked unrelated error reports:
  - cb_refs.c:underflow_prog
  - exceptions_fail.c:reject_rbtree_add_throw
  - exceptions_fail.c:reject_with_cp_reference
- the following precision tracking selftests needed change in expected
  log trace:
  - verifier_subprog_precision.c:callback_result_precise
    (note: r0 precision is no longer propagated inside callback and
           I think this is a correct behavior)
  - verifier_subprog_precision.c:parent_callee_saved_reg_precise_with_callback
  - verifier_subprog_precision.c:parent_stack_slot_precise_with_callback

Reported-by: Andrew Werner <awerner32@gmail.com>
Closes: https://lore.kernel.org/bpf/CA+vRuzPChFNXmouzGG+wsy=6eMcfr1mFG0F3g7rbg-sedGKW3w@mail.gmail.com/
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-7-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20 18:35:44 -08:00
Eduard Zingerman 58124a98cb bpf: extract setup_func_entry() utility function
Move code for simulated stack frame creation to a separate utility
function. This function would be used in the follow-up change for
callbacks handling.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20 18:33:35 -08:00
Eduard Zingerman 683b96f960 bpf: extract __check_reg_arg() utility function
Split check_reg_arg() into two utility functions:
- check_reg_arg() operating on registers from current verifier state;
- __check_reg_arg() operating on a specific set of registers passed as
  a parameter;

The __check_reg_arg() function would be used by a follow-up change for
callbacks handling.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-5-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20 18:33:35 -08:00
Linus Torvalds b0014556a2 - Do the push of pending hrtimers away from a CPU which is being
offlined earlier in the offlining process in order to prevent
   a deadlock
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmVaB28ACgkQEsHwGGHe
 VUr3ZBAAwOLL5vimHB3Y59cTRLPN+zGKhzyVLMLnbkKs4sGJ+9srP4HLX4Q9PoAb
 kR9Hzq90+48YuyLe+S/R2pvm1x88K33spS+4w4fl3x6EeToqvUlop2GPuMS2yzXY
 yECdqCLEd3Q6DeI8hN35lv899qyfGSD+6WxezLCT+uwx6AMHljMAsDy2249UtMZv
 1bqZnYCtN2zv3MQuV1uli/AVxTDv4vXcumza17inuw0IpEA26Wz2kWruxeyZnUXU
 /sWZudUdhiErfg428ok3oTL1BOwPzyiIWjhN2MzqlKFmyp463DwV7KoAc3SxYINE
 8qbODN93CFdnU6h29+VQoRxO9vcmWL6w7A/Swc9ar/0/Qnt7H9JdzUKtJ4+EaTCY
 /IpRWcNcX4WI6BKuHHl6kOBvX3YW77PKaIsxj8JDNZTMk6rq6lMGi+tIaVsAki92
 3MQZ9+Lkm0baykIZAWz4jajbA98KvJMeJ60qZQI6sWWdpyrncEqG9pH/ulkLY4aZ
 gT94LiRpdwT0LWrX0J6xPMTNi9NYWjdB/uyo6Drer42SB9J7ol4rAbOxs50srG8i
 z46VGDtgWz6C5MSkonhQqrpGzc/HF9xCWVVSF1UENT4K+2W55JhJrDZBs5XCPJiz
 Bj8T3Maz7wcVkA41DA7C5xlVed+ST1ID8/4y5cWImnrmWOdG5Zw=
 =Tekh
 -----END PGP SIGNATURE-----

Merge tag 'timers_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Borislav Petkov:

 - Do the push of pending hrtimers away from a CPU which is being
   offlined earlier in the offlining process in order to prevent a
   deadlock

* tag 'timers_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hrtimers: Push pending hrtimers away from outgoing CPU earlier
2023-11-19 13:35:07 -08:00
Linus Torvalds 2a0adc4954 - Fix virtual runtime calculation when recomputing a sched entity's
weights
 
 - Fix wrongly rejected unprivileged poll requests to the cgroup psi
   pressure files
 
 - Make sure the load balancing is done by only one CPU
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmVaBe0ACgkQEsHwGGHe
 VUqX4hAAmrlp7bcloMRto6j4yC+pjDIQlFym7opa7kaEPeY3icOydfpSGEDnEwbv
 HxOOmveb2sC8DBE+Rkum4bHb2I46SD/5LlM/MZHvSguEGNgAJEYCcPfGZJDgGlW1
 MgALG78ThA/mVKr5i3/Q1U6U71+vuNHJOpCY1s4o+jgF/sG3AYIdK1sqaVI++ygz
 q0WK31jGo+YelPpNDKnXpVGIuOcUlh9v/Hu2zGBBJD9pf4kfTelseiV7rc+rk0yI
 YHSKpw2jCnuJaGS748Q4aIG+8kLRBz+HqUKDWQPlq3pRWjJWTBbH+i8TZef7keZQ
 gAk/uJpdQ9z4Z7suwY3vcEBVRo4e6AoD99XDG1eUX07C+f1d9p54EVDkgFIZMIle
 pT2yd5GT/zl0UfcZ8B96y2lJHoa6pHnv83uZKtRZhBMiN5F4iN88lhQFVpZDoCBg
 xZ+NPfpXcZxm4HpKFjfsGyxQJxIkC6NDdf6Rfhtc3sV1rx4AT1Qii4fDnBHOkaBs
 iFgpFOCeb+K9UUXB0ONJ5PWZVnc8OGPtm/22TwtZ9rBzVqrmtVJb+VDg2YWpwFwU
 xhy0hMWxwZFsn0VjjsBbgfm1/+WGjCKjbPa1SvS3oH3+H9EbWiBjxe1zwkS46PUf
 HjC0RCMPxfnYG4+h9JHEaFioGvUqQQ6Ub3K8epd8MPUtD9DCnro=
 =hJzS
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:

 - Fix virtual runtime calculation when recomputing a sched entity's
   weights

 - Fix wrongly rejected unprivileged poll requests to the cgroup psi
   pressure files

 - Make sure the load balancing is done by only one CPU

* tag 'sched_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix the decision for load balance
  sched: psi: fix unprivileged polling against cgroups
  sched/eevdf: Fix vruntime adjustment on reweight
2023-11-19 13:32:00 -08:00
Linus Torvalds 2f84f8232e - Fix a hardcoded futex flags case which lead to one robust futex test
failure
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmVaA2sACgkQEsHwGGHe
 VUoOdBAAmvDdbMNVi0p33kqLhSQLwzxsqkrGyNkAfSbuuaGNsH8mQ87VA0dMQYpe
 bXzJzvoccHYxJYnFyExv0d7PtN3xquh2q32D1pL6gzaA974oEmQiyQag9++gkJGh
 +/NYQwo0Y2ucEsvgeMN+knE0q0OdelUAiKNPF9nE9LG0d9TLFC45jwLH+9pa5jAF
 jtLBxrexeU49UBBDnoPR2CNrDi9TlNYRas2V5xlQnUXl5kZlVNcQLMo1Rcd7+dTF
 6I414ZVXiS6u02Vs7wcrKC50BdBIa4h2WaOX+Nb2j9ibJ5uY14B1nwewAztmaQY7
 szpaI2EtSMk0+Ap0QHTaxZvi/UREWed5n0AykqTy97f0vsvkK9zCiPk3LMJsoupu
 vfEApclAIMzDi6qnB/zGhHkHLMBHsiXrANGCe6SbjphD9ic0ClKwAyqJ9kpB43JE
 pnqdrTcrYLuTCV+fE9r/WfXt5Z09xmlF+usmOS4T7y35gzrl4+BPVzu2V80SlZSj
 CtDSvMG7z7LLK5o8XsvQk1VlAYCXEPfOldkoRaisD82yKw0r38YqXf+cigE4noyq
 55ChMwNmlqtetvPNK/6SsPtj8F/502Lqo/xAJjSRo/vO1KYpNa3sfXUZpZ5J+xuc
 zVGXzcBGsNgteVin2I0jhdOvRd7apA7rKiXd0duTtiSj2N++b5U=
 =T4AK
 -----END PGP SIGNATURE-----

Merge tag 'locking_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Borislav Petkov:

 - Fix a hardcoded futex flags case which lead to one robust futex test
   failure

* tag 'locking_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Fix hardcoded flags
2023-11-19 13:30:21 -08:00
Linus Torvalds c8b3443cbd - Make sure the context refcount is transferred too when migrating perf
events
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmVaAW4ACgkQEsHwGGHe
 VUq4bQ/5ATpnsIYxn/rjshZa2hfDrOfS9FRegfMCZyz7w/HDmRlwjIbXbQE7ttS7
 vU55QgftybwS38N//LRzx9etGIhbcmeNgNibskslxsuVFCxBJLUVT3UJ9T9AtWrM
 v9i0W+D8SuJsIKoKo77K+OkTvKHZgpvxyABQivyHMjeD2zmPHUslJ+OTWYNooM7g
 kPmTu/M6sJme/q/PgtezOVKhV3tdaHB2oBKUEv9y4nVBNNVRnX/Fo+Zd/FEgs45F
 JhxnvwqP5cU7mSmn9itcNRKY8nW5lv2OdYNtgyqWbpYY39JPZxy/EKF4GUPv1Jqb
 d/eQov0B8K844j982bBa0lgecmiGYq7DNevVHlUB/JowQpNsfxVJAVphVnWZr5iy
 hzZQU5EfvmkRE/ja/g1wUn15mdxvAQ3QRmOGBJwCQi2QLC9zCiki4esnGc4U9Zp7
 ZVQ22PVliuBnLRloOSHsp1G65i3K+VPlKyBOkp0YmEVz9EHahOceBPa39WkGCHOh
 1EkhPIzemTYUzAE5KuzbLp7C6KzMvaP8Zc9fOLy9SEesb2HjwoIQjkRKmJ38rbSu
 KEY5SbdxIkVCsks4kajOVSysS89rafbH7ykd/Di3cXb1rOJtRSyWghnaap1trghd
 uf0EAVb9y8er2xvdOHHwHIdNILK3Y1F2TcOg+jAyywudrfmnIls=
 =lbBn
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fix from Borislav Petkov:

 - Make sure the context refcount is transferred too when migrating perf
   events

* tag 'perf_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix cpuctx refcounting
2023-11-19 13:26:42 -08:00
Oleg Nesterov ac8148d957 bpf: bpf_iter_task_next: use next_task(kit->task) rather than next_task(kit->pos)
This looks more clear and simplifies the code. While at it, remove the
unnecessary initialization of pos/task at the start of bpf_iter_task_new().

Note that we can even kill kit->task, we can just use pos->group_leader,
but I don't understand the BUILD_BUG_ON() checks in bpf_iter_task_new().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231114163239.GA903@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-19 11:43:44 -08:00
Oleg Nesterov 5a34f9dabd bpf: bpf_iter_task_next: use __next_thread() rather than next_thread()
Lockless use of next_thread() should be avoided, kernel/bpf/task_iter.c
is the last user and the usage is wrong.

bpf_iter_task_next() can loop forever, "kit->pos == kit->task" can never
happen if kit->pos execs. Change this code to use __next_thread().

With or without this change the usage of kit->pos/task and next_task()
doesn't look nice, see the next patch.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231114163237.GA897@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-19 11:43:44 -08:00
Oleg Nesterov 2d1618054f bpf: task_group_seq_get_next: use __next_thread() rather than next_thread()
Lockless use of next_thread() should be avoided, kernel/bpf/task_iter.c
is the last user and the usage is wrong.

task_group_seq_get_next() can return the group leader twice if it races
with mt-thread exec which changes the group->leader's pid.

Change the main loop to use __next_thread(), kill "next_tid == common->pid"
check.

__next_thread() can't loop forever, we can also change this code to retry
if next_tid == 0.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231114163234.GA890@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-19 11:43:44 -08:00
Linus Torvalds 2254005ef1 parisc architecture fixes for kernel v6.7-rc2:
- Fix power soft-off on qemu
 - Disable prctl(PR_SET_MDWE) since parisc sometimes still needs
   writeable stacks
 - Use strscpy instead of strlcpy in show_cpuinfo()
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQS86RI+GtKfB8BJu973ErUQojoPXwUCZVkHjgAKCRD3ErUQojoP
 X196AP9I9w/4Go3HfvFNgEGUpVSbQq8679um13mlMdlFC6z3NAD+J32vmvU1keL1
 0f4C7IltOr2ntU4QIXJUCLAPWO7NWgQ=
 =r7N6
 -----END PGP SIGNATURE-----

Merge tag 'parisc-for-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux

Pull parisc fixes from Helge Deller:
 "On parisc we still sometimes need writeable stacks, e.g. if programs
  aren't compiled with gcc-14. To avoid issues with the upcoming
  systemd-254 we therefore have to disable prctl(PR_SET_MDWE) for now
  (for parisc only).

  The other two patches are minor: a bugfix for the soft power-off on
  qemu with 64-bit kernel and prefer strscpy() over strlcpy():

   - Fix power soft-off on qemu

   - Disable prctl(PR_SET_MDWE) since parisc sometimes still needs
     writeable stacks

   - Use strscpy instead of strlcpy in show_cpuinfo()"

* tag 'parisc-for-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  prctl: Disable prctl(PR_SET_MDWE) on parisc
  parisc/power: Fix power soft-off when running on qemu
  parisc: Replace strlcpy() with strscpy()
2023-11-18 15:13:10 -08:00
Andrii Nakryiko 46862ee854 bpf: emit frameno for PTR_TO_STACK regs if it differs from current one
It's possible to pass a pointer to parent's stack to child subprogs. In
such case verifier state output is ambiguous not showing whether
register container a pointer to "current" stack, belonging to current
subprog (frame), or it's actually a pointer to one of parent frames.

So emit this information if frame number differs between the state which
register is part of. E.g., if current state is in frame 2 and it has
a register pointing to stack in grand parent state (frame #0), we'll see
something like 'R1=fp[0]-16', while "local stack pointer" will be just
'R2=fp-16'.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:59 -08:00
Andrii Nakryiko 0f8dbdbc64 bpf: smarter verifier log number printing logic
Instead of always printing numbers as either decimals (and in some
cases, like for "imm=%llx", in hexadecimals), decide the form based on
actual values. For numbers in a reasonably small range (currently,
[0, U16_MAX] for unsigned values, and [S16_MIN, S16_MAX] for signed ones),
emit them as decimals. In all other cases, even for signed values,
emit them in hexadecimals.

For large values hex form is often times way more useful: it's easier to
see an exact difference between 0xffffffff80000000 and 0xffffffff7fffffff,
than between 18446744071562067966 and 18446744071562067967, as one
particular example.

Small values representing small pointer offsets or application
constants, on the other hand, are way more useful to be represented in
decimal notation.

Adjust reg_bounds register state parsing logic to take into account this
change.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:59 -08:00
Andrii Nakryiko 1db747d75b bpf: omit default off=0 and imm=0 in register state log
Simplify BPF verifier log further by omitting default (and frequently
irrelevant) off=0 and imm=0 parts for non-SCALAR_VALUE registers. As can
be seen from fixed tests, this is often a visual noise for PTR_TO_CTX
register and even for PTR_TO_PACKET registers.

Omitting default values follows the rest of register state logic: we
omit default values to keep verifier log succinct and to highlight
interesting state that deviates from default one. E.g., we do the same
for var_off, when it's unknown, which gives no additional information.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:59 -08:00
Andrii Nakryiko 0c95c9fdb6 bpf: emit map name in register state if applicable and available
In complicated real-world applications, whenever debugging some
verification error through verifier log, it often would be very useful
to see map name for PTR_TO_MAP_VALUE register. Usually this needs to be
inferred from key/value sizes and maybe trying to guess C code location,
but it's not always clear.

Given verifier has the name, and it's never too long, let's just emit it
for ptr_to_map_key, ptr_to_map_value, and const_ptr_to_map registers. We
reshuffle the order a bit, so that map name, key size, and value size
appear before offset and immediate values, which seems like a more
logical order.

Current output:

  R1_w=map_ptr(map=array_map,ks=4,vs=8,off=0,imm=0)

But we'll get rid of useless off=0 and imm=0 parts in the next patch.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:59 -08:00
Andrii Nakryiko 67d43dfbb4 bpf: print spilled register state in stack slot
Print the same register state representation when printing stack state,
as we do for normal registers. Note that if stack slot contains
subregister spill (1, 2, or 4 byte long), we'll still emit "m0?" mask
for those bytes that are not part of spilled register.

While means we can get something like fp-8=0000scalar() for a 4-byte
spill with other 4 bytes still being STACK_ZERO.

Some example before and after, taken from the log of
pyperf_subprogs.bpf.o:

49: (7b) *(u64 *)(r10 -256) = r1      ; frame1: R1_w=ctx(off=0,imm=0) R10=fp0 fp-256_w=ctx
49: (7b) *(u64 *)(r10 -256) = r1      ; frame1: R1_w=ctx(off=0,imm=0) R10=fp0 fp-256_w=ctx(off=0,imm=0)

150: (7b) *(u64 *)(r10 -264) = r0     ; frame1: R0_w=map_value_or_null(id=6,off=0,ks=192,vs=4,imm=0) R10=fp0 fp-264_w=map_value_or_null
150: (7b) *(u64 *)(r10 -264) = r0     ; frame1: R0_w=map_value_or_null(id=6,off=0,ks=192,vs=4,imm=0) R10=fp0 fp-264_w=map_value_or_null(id=6,off=0,ks=192,vs=4,imm=0)

5192: (61) r1 = *(u32 *)(r10 -272)    ; frame1: R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf)) R10=fp0 fp-272=
5192: (61) r1 = *(u32 *)(r10 -272)    ; frame1: R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf)) R10=fp0 fp-272=????scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf))

While at it, do a few other simple clean ups:
  - skip slot if it's not scratched before detecting whether it's valid;
  - move taking spilled_reg pointer outside of switch (only DYNPTR has
    to adjust that to get to the "main" slot);
  - don't recalculate types_buf second time for MISC/ZERO/default case.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:59 -08:00
Andrii Nakryiko 009f5465be bpf: extract register state printing
Extract printing register state representation logic into a separate
helper, as we are going to reuse it for spilled register state printing
in the next patch. This also nicely reduces code nestedness.

No functional changes.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:59 -08:00
Andrii Nakryiko 42feb6620a bpf: move verifier state printing code to kernel/bpf/log.c
Move a good chunk of code from verifier.c to log.c: verifier state
verbose printing logic. This is an important and very much
logging/debugging oriented code. It fits the overlall log.c's focus on
verifier logging, and moving it allows to keep growing it without
unnecessarily adding to verifier.c code that otherwise contains a core
verification logic.

There are not many shared dependencies between this code and the rest of
verifier.c code, except a few single-line helpers for various register
type checks and a bit of state "scratching" helpers. We move all such
trivial helpers into include/bpf/bpf_verifier.h as static inlines.

No functional changes in this patch.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:59 -08:00
Andrii Nakryiko db840d389b bpf: move verbose_linfo() into kernel/bpf/log.c
verifier.c is huge. Let's try to move out parts that are logging-related
into log.c, as we previously did with bpf_log() and other related stuff.
This patch moves line info verbose output routines: it's pretty
self-contained and isolated code, so there is no problem with this.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:58 -08:00
Helge Deller 793838138c prctl: Disable prctl(PR_SET_MDWE) on parisc
systemd-254 tries to use prctl(PR_SET_MDWE) for it's MemoryDenyWriteExecute
functionality, but fails on parisc which still needs executable stacks in
certain combinations of gcc/glibc/kernel.

Disable prctl(PR_SET_MDWE) by returning -EINVAL for now on parisc, until
userspace has catched up.

Signed-off-by: Helge Deller <deller@gmx.de>
Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Sam James <sam@gentoo.org>
Closes: https://github.com/systemd/systemd/issues/29775
Tested-by: Sam James <sam@gentoo.org>
Link: https://lore.kernel.org/all/875y2jro9a.fsf@gentoo.org/
Cc: <stable@vger.kernel.org> # v6.3+
2023-11-18 19:35:31 +01:00
Randy Dunlap f73f6181eb userns: eliminate many kernel-doc warnings
Drop the kernel-doc "/**" notation from 8 structs or functions to
prevent 22 kernel-doc warnings  (samples below).

user_namespace.c:239: warning: Function parameter or member 'map_up' not described in 'idmap_key'
user_namespace.c:246: warning: Function parameter or member 'k' not described in 'cmp_map_id'
user_namespace.c:277: warning: Function parameter or member 'extents' not described in 'map_id_range_down_max'
user_namespace.c:295: warning: Function parameter or member 'extents' not described in 'map_id_range_down_base'
user_namespace.c:344: warning: Function parameter or member 'extents' not described in 'map_id_up_base'
user_namespace.c:364: warning: Function parameter or member 'extents' not described in 'map_id_up_max'
user_namespace.c:776: warning: Function parameter or member 'map' not described in 'insert_extent'
user_namespace.c:844: warning: Function parameter or member 'map' not described in 'sort_idmaps'

Fixes: 6397fac491 ("userns: bump idmap limits to 340")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20230830163215.13193-1-rdunlap@infradead.org
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18 15:00:01 +01:00
Andrii Nakryiko ff8867af01 bpf: rename BPF_F_TEST_SANITY_STRICT to BPF_F_TEST_REG_INVARIANTS
Rename verifier internal flag BPF_F_TEST_SANITY_STRICT to more neutral
BPF_F_TEST_REG_INVARIANTS. This is a follow up to [0].

A few selftests and veristat need to be adjusted in the same patch as
well.

  [0] https://patchwork.kernel.org/project/netdevbpf/patch/20231112010609.848406-5-andrii@kernel.org/

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231117171404.225508-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-17 10:30:02 -08:00
Linus Torvalds bf786e2a78 audit/stable-6.7 PR 20231116
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmVWX8cUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXPl8hAA2D4DAmbnM4wLGk5FX1ruFpACmabx
 7iNPonV7loDiGZInvlTgvQxTQ6hafvs6aFqu69ZplLuCaBLiSn6U3J/bOXneQxzn
 nRjLQEfJLcSmTd39M82QxpaihCtVltDRT4jPfq4AGN+6nV0TB4KyFjrIvOw7udfX
 fJF096Lt9rqxbYyKk2Lgy8LZZdVqFN9pbstpH7Vas8LOi4bnvogRljhFA3vipn45
 0tzMrFR9b/myOPFm1ktvAUSUdWIzNGmxsYkrxHkQ2TemhuFEiNl3n86juWzeXCzN
 wjaGPLIUqJQW+C+kXRmEZo/SytiqKS5Wo97mMVDPKpYFwp6IbgjSg01LPNdmLoVY
 2i1jxOFTDnANLZgXa31kjzTO2Ceu61GFVqLZGuOh2lB7rjj3+JAkL0U/YLQWWBMO
 RG8MbmQnHOGZlHdqiPRKJKo/qHPW7vBkgSPJ/K0tRNXMFoZtGAfcHjxJJQNystPU
 BoRd2Tdw0jMrrS5cLNXfkxhHKwNHGFny4TRyqOJo9G7/jK56JWU+3ZXNoWH9OKFJ
 Ln2wH7NT16CLMnb/kZ2CSh8UQXIJpkBL1OuG6IrOQuBoNun7AzGnmXW7vqywV5bo
 dqOgxtkBYhrfjUEXmRzEii2oOoc/esr1vZYnmj5K4RpWksIXTJ4BZjC9kH/fHEpg
 2ZO03UwyZ7ZiE2U=
 =00KW
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20231116' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit fix from Paul Moore:
 "One small audit patch to convert a WARN_ON_ONCE() into a normal
  conditional to avoid scary looking console warnings when eBPF code
  generates audit records from unexpected places"

* tag 'audit-pr-20231116' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: don't WARN_ON_ONCE(!current->mm) in audit_exe_compare()
2023-11-17 08:42:05 -05:00
Linus Torvalds 7475e51b87 Including fixes from BPF and netfilter.
Current release - regressions:
 
  - core: fix undefined behavior in netdev name allocation
 
  - bpf: do not allocate percpu memory at init stage
 
  - netfilter: nf_tables: split async and sync catchall in two functions
 
  - mptcp: fix possible NULL pointer dereference on close
 
 Current release - new code bugs:
 
  - eth: ice: dpll: fix initial lock status of dpll
 
 Previous releases - regressions:
 
  - bpf: fix precision backtracking instruction iteration
 
  - af_unix: fix use-after-free in unix_stream_read_actor()
 
  - tipc: fix kernel-infoleak due to uninitialized TLV value
 
  - eth: bonding: stop the device in bond_setup_by_slave()
 
  - eth: mlx5:
    - fix double free of encap_header
    - avoid referencing skb after free-ing in drop path
 
  - eth: hns3: fix VF reset
 
  - eth: mvneta: fix calls to page_pool_get_stats
 
 Previous releases - always broken:
 
  - core: set SOCK_RCU_FREE before inserting socket into hashtable
 
  - bpf: fix control-flow graph checking in privileged mode
 
  - eth: ppp: limit MRU to 64K
 
  - eth: stmmac: avoid rx queue overrun
 
  - eth: icssg-prueth: fix error cleanup on failing initialization
 
  - eth: hns3: fix out-of-bounds access may occur when coalesce info is
  	      read via debugfs
 
  - eth: cortina: handle large frames
 
 Misc:
 
  - selftests: gso: support CONFIG_MAX_SKB_FRAGS up to 45
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmVV9akSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkICMP/1+QHUaD4JG1mW9oYc2zINPfQl3dqQt3
 2CGSE2yrtbQvyQl39BDa0WFzV5X6So6/U50twhTNM+UAJsCaOvxCUDvUP9eY9Dcm
 z2H4oITZimyP4CEb3l7JpL2PImvfImL7D/fCPPMUZVzNY6dkEFznaQrnawbJz4gg
 mZXDnjwIXq7OchoJy3dHzyOn4ZQj2Df5VcfBzkVMdMcwV55Sd5JezbhwJ6NOmnKA
 uoXlq4pFYj3ahAhEQfLWUwXmF3e6esHs/WUCMe5FR9YkanJlu4oHUmY3RLzfcdQA
 PPIPDRxOzthcXyymqvqs7gnZ3ruMUll4B7tGTVFpJch8ts+DwGdUyBIIoDd/1BUT
 gmjipP5HPia3Qdtk3Jc4vMkcf5AwoGo0hXku7YYJ1K7+4+t8ep3/hDbQc0PLWX6J
 afiQgqpnNXHSTqBO5zl91vSwhGr/AAtAkDlPnsQL/RDAxY4teIwxHuoMvwPWaHZJ
 sMo5ZcHXvNnBbGhpozFtmrnbf1nduUrQmW5LkJViCLf25Sj6pDYbo8WnhMuOKSnZ
 7an2YqniCgBtrX4MEVn2jsWgavI+SxndVIQR04u0uwqmP+dn8s9LUfjKKDtPWHsK
 +zMFtk+Op03TW5ur9w3+dgrGH0cLogPO3BJkho7xXKBfZ6/tN/pOef3/nV9xY6g8
 JjnBUdpZRTWI
 =VjWw
 -----END PGP SIGNATURE-----

Merge tag 'net-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from BPF and netfilter.

  Current release - regressions:

   - core: fix undefined behavior in netdev name allocation

   - bpf: do not allocate percpu memory at init stage

   - netfilter: nf_tables: split async and sync catchall in two
     functions

   - mptcp: fix possible NULL pointer dereference on close

  Current release - new code bugs:

   - eth: ice: dpll: fix initial lock status of dpll

  Previous releases - regressions:

   - bpf: fix precision backtracking instruction iteration

   - af_unix: fix use-after-free in unix_stream_read_actor()

   - tipc: fix kernel-infoleak due to uninitialized TLV value

   - eth: bonding: stop the device in bond_setup_by_slave()

   - eth: mlx5:
      - fix double free of encap_header
      - avoid referencing skb after free-ing in drop path

   - eth: hns3: fix VF reset

   - eth: mvneta: fix calls to page_pool_get_stats

  Previous releases - always broken:

   - core: set SOCK_RCU_FREE before inserting socket into hashtable

   - bpf: fix control-flow graph checking in privileged mode

   - eth: ppp: limit MRU to 64K

   - eth: stmmac: avoid rx queue overrun

   - eth: icssg-prueth: fix error cleanup on failing initialization

   - eth: hns3: fix out-of-bounds access may occur when coalesce info is
     read via debugfs

   - eth: cortina: handle large frames

  Misc:

   - selftests: gso: support CONFIG_MAX_SKB_FRAGS up to 45"

* tag 'net-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (78 commits)
  macvlan: Don't propagate promisc change to lower dev in passthru
  net: sched: do not offload flows with a helper in act_ct
  net/mlx5e: Check return value of snprintf writing to fw_version buffer for representors
  net/mlx5e: Check return value of snprintf writing to fw_version buffer
  net/mlx5e: Reduce the size of icosq_str
  net/mlx5: Increase size of irq name buffer
  net/mlx5e: Update doorbell for port timestamping CQ before the software counter
  net/mlx5e: Track xmit submission to PTP WQ after populating metadata map
  net/mlx5e: Avoid referencing skb after free-ing in drop path of mlx5e_sq_xmit_wqe
  net/mlx5e: Don't modify the peer sent-to-vport rules for IPSec offload
  net/mlx5e: Fix pedit endianness
  net/mlx5e: fix double free of encap_header in update funcs
  net/mlx5e: fix double free of encap_header
  net/mlx5: Decouple PHC .adjtime and .adjphase implementations
  net/mlx5: DR, Allow old devices to use multi destination FTE
  net/mlx5: Free used cpus mask when an IRQ is released
  Revert "net/mlx5: DR, Supporting inline WQE when possible"
  bpf: Do not allocate percpu memory at init stage
  net: Fix undefined behavior in netdev name allocation
  dt-bindings: net: ethernet-controller: Fix formatting error
  ...
2023-11-16 07:51:26 -05:00
Andrii Nakryiko cf5fe3c71c bpf: make __reg{32,64}_deduce_bounds logic more robust
This change doesn't seem to have any effect on selftests and production
BPF object files, but we preemptively try to make it more robust.

First, "learn sign from signed bounds" comment is misleading, as we are
learning not just sign, but also values.

Second, we simplify the check for determining whether entire range is
positive or negative similarly to other checks added earlier, using
appropriate u32/u64 cast and single comparisons. As explain in comments
in __reg64_deduce_bounds(), the checks are equivalent.

Last but not least, smin/smax and s32_min/s32_max reassignment based on
min/max of both umin/umax and smin/smax (and 32-bit equivalents) is hard
to explain and justify. We are updating unsigned bounds from signed
bounds, why would we update signed bounds at the same time? This might
be correct, but it's far from obvious why and the code or comments don't
try to justify this. Given we've added a separate deduction of signed
bounds from unsigned bounds earlier, this seems at least redundant, if
not just wrong.

In short, we remove doubtful pieces, and streamline the rest to follow
the logic and approach of the rest of reg_bounds_sync() checks.

Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231112010609.848406-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 12:03:42 -08:00
Andrii Nakryiko 3cf98cf594 bpf: remove redundant s{32,64} -> u{32,64} deduction logic
Equivalent checks were recently added in more succinct and, arguably,
safer form in:
  - f188765f23a5 ("bpf: derive smin32/smax32 from umin32/umax32 bounds");
  - 2e74aef782d3 ("bpf: derive smin/smax from umin/max bounds").

The checks we are removing in this patch set do similar checks to detect
if entire u32/u64 range has signed bit set or not set, but does it with
two separate checks.

Further, we forcefully overwrite either smin or smax (and 32-bit equvalents)
without applying normal min/max intersection logic. It's not clear why
that would be correct in all cases and seems to work by accident. This
logic is also "gated" by previous signed -> unsigned derivation, which
returns early.

All this is quite confusing and seems error-prone, while we already have
at least equivalent checks happening earlier. So remove this duplicate
and error-prone logic to simplify things a bit.

Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231112010609.848406-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 12:03:42 -08:00
Andrii Nakryiko 5f99f312bd bpf: add register bounds sanity checks and sanitization
Add simple sanity checks that validate well-formed ranges (min <= max)
across u64, s64, u32, and s32 ranges. Also for cases when the value is
constant (either 64-bit or 32-bit), we validate that ranges and tnums
are in agreement.

These bounds checks are performed at the end of BPF_ALU/BPF_ALU64
operations, on conditional jumps, and for LDX instructions (where subreg
zero/sign extension is probably the most important to check). This
covers most of the interesting cases.

Also, we validate the sanity of the return register when manually
adjusting it for some special helpers.

By default, sanity violation will trigger a warning in verifier log and
resetting register bounds to "unbounded" ones. But to aid development
and debugging, BPF_F_TEST_SANITY_STRICT flag is added, which will
trigger hard failure of verification with -EFAULT on register bounds
violations. This allows selftests to catch such issues. veristat will
also gain a CLI option to enable this behavior.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231112010609.848406-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 12:03:42 -08:00
Andrii Nakryiko be41a203bb bpf: enhance BPF_JEQ/BPF_JNE is_branch_taken logic
Use 32-bit subranges to prune some 64-bit BPF_JEQ/BPF_JNE conditions
that otherwise would be "inconclusive" (i.e., is_branch_taken() would
return -1). This can happen, for example, when registers are initialized
as 64-bit u64/s64, then compared for inequality as 32-bit subregisters,
and then followed by 64-bit equality/inequality check. That 32-bit
inequality can establish some pattern for lower 32 bits of a register
(e.g., s< 0 condition determines whether the bit #31 is zero or not),
while overall 64-bit value could be anything (according to a value range
representation).

This is not a fancy quirky special case, but actually a handling that's
necessary to prevent correctness issue with BPF verifier's range
tracking: set_range_min_max() assumes that register ranges are
non-overlapping, and if that condition is not guaranteed by
is_branch_taken() we can end up with invalid ranges, where min > max.

  [0] https://lore.kernel.org/bpf/CACkBjsY2q1_fUohD7hRmKGqv1MV=eP2f6XK8kjkYNw7BaiF8iQ@mail.gmail.com/

Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231112010609.848406-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 12:03:42 -08:00
Andrii Nakryiko 96381879a3 bpf: generalize is_scalar_branch_taken() logic
Generalize is_branch_taken logic for SCALAR_VALUE register to handle
cases when both registers are not constants. Previously supported
<range> vs <scalar> cases are a natural subset of more generic <range>
vs <range> set of cases.

Generalized logic relies on straightforward segment intersection checks.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231112010609.848406-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 12:03:41 -08:00
Andrii Nakryiko 67420501e8 bpf: generalize reg_set_min_max() to handle non-const register comparisons
Generalize bounds adjustment logic of reg_set_min_max() to handle not
just register vs constant case, but in general any register vs any
register cases. For most of the operations it's trivial extension based
on range vs range comparison logic, we just need to properly pick
min/max of a range to compare against min/max of the other range.

For BPF_JSET we keep the original capabilities, just make sure JSET is
integrated in the common framework. This is manifested in the
internal-only BPF_JSET + BPF_X "opcode" to allow for simpler and more
uniform rev_opcode() handling. See the code for details. This allows to
reuse the same code exactly both for TRUE and FALSE branches without
explicitly handling both conditions with custom code.

Note also that now we don't need a special handling of BPF_JEQ/BPF_JNE
case none of the registers are constants. This is now just a normal
generic case handled by reg_set_min_max().

To make tnum handling cleaner, tnum_with_subreg() helper is added, as
that's a common operator when dealing with 32-bit subregister bounds.
This keeps the overall logic much less noisy when it comes to tnums.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231112010609.848406-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 12:03:41 -08:00
Yonghong Song 1fda5bb66a bpf: Do not allocate percpu memory at init stage
Kirill Shutemov reported significant percpu memory consumption increase after
booting in 288-cpu VM ([1]) due to commit 41a5db8d81 ("bpf: Add support for
non-fix-size percpu mem allocation"). The percpu memory consumption is
increased from 111MB to 969MB. The number is from /proc/meminfo.

I tried to reproduce the issue with my local VM which at most supports upto
255 cpus. With 252 cpus, without the above commit, the percpu memory
consumption immediately after boot is 57MB while with the above commit the
percpu memory consumption is 231MB.

This is not good since so far percpu memory from bpf memory allocator is not
widely used yet. Let us change pre-allocation in init stage to on-demand
allocation when verifier detects there is a need of percpu memory for bpf
program. With this change, percpu memory consumption after boot can be reduced
signicantly.

  [1] https://lore.kernel.org/lkml/20231109154934.4saimljtqx625l3v@box.shutemov.name/

Fixes: 41a5db8d81 ("bpf: Add support for non-fix-size percpu mem allocation")
Reported-and-tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231111013928.948838-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 07:51:06 -08:00
Greg KH 652ffc2104 perf/core: Fix narrow startup race when creating the perf nr_addr_filters sysfs file
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/2023061204-decal-flyable-6090@gregkh
2023-11-15 10:15:50 +01:00
Peter Zijlstra 5d2d4a9f60 Merge branch 'tip/perf/urgent'
Avoid conflicts, base on fixes.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2023-11-15 10:15:40 +01:00
Frederic Weisbecker 194600008d sched/timers: Explain why idle task schedules out on remote timer enqueue
Trying to avoid that didn't bring much value after testing, add comment
about this.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lkml.kernel.org/r/20231114193840.4041-3-frederic@kernel.org
2023-11-15 09:57:52 +01:00
Frederic Weisbecker dd5403869a sched/cpuidle: Comment about timers requirements VS idle handler
Add missing explanation concerning IRQs re-enablement constraints in
the cpuidle path against timers.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lkml.kernel.org/r/20231114193840.4041-2-frederic@kernel.org
2023-11-15 09:57:51 +01:00
Peter Zijlstra 63ba8422f8 sched/deadline: Introduce deadline servers
Low priority tasks (e.g., SCHED_OTHER) can suffer starvation if tasks
with higher priority (e.g., SCHED_FIFO) monopolize CPU(s).

RT Throttling has been introduced a while ago as a (mostly debug)
countermeasure one can utilize to reserve some CPU time for low priority
tasks (usually background type of work, e.g. workqueues, timers, etc.).
It however has its own problems (see documentation) and the undesired
effect of unconditionally throttling FIFO tasks even when no lower
priority activity needs to run (there are mechanisms to fix this issue
as well, but, again, with their own problems).

Introduce deadline servers to service low priority tasks needs under
starvation conditions. Deadline servers are built extending SCHED_DEADLINE
implementation to allow 2-level scheduling (a sched_deadline entity
becomes a container for lower priority scheduling entities).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/4968601859d920335cf85822eb573a5f179f04b8.1699095159.git.bristot@kernel.org
2023-11-15 09:57:51 +01:00
Peter Zijlstra 2f7a0f5894 sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity
In preparation of introducing !task sched_dl_entity; move the
bandwidth accounting into {en.de}queue_dl_entity().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/a86dccbbe44e021b8771627e1dae01a69b73466d.1699095159.git.bristot@kernel.org
2023-11-15 09:57:50 +01:00
Peter Zijlstra 9e07d45c52 sched/deadline: Collect sched_dl_entity initialization
Create a single function that initializes a sched_dl_entity.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/51acc695eecf0a1a2f78f9a044e11ffd9b316bcf.1699095159.git.bristot@kernel.org
2023-11-15 09:57:50 +01:00
Peter Zijlstra c708a4dc5a sched: Unify more update_curr*()
Now that trace_sched_stat_runtime() no longer takes a vruntime
argument, the task specific bits are identical between
update_curr_common() and update_curr().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2023-11-15 09:57:49 +01:00
Peter Zijlstra 5fe6ec8f6a sched: Remove vruntime from trace_sched_stat_runtime()
Tracing the runtime delta makes sense, observer can sum over time.
Tracing the absolute vruntime makes less sense, inconsistent:
absolute-vs-delta, but also vruntime delta can be computed from
runtime delta.

Removing the vruntime thing also makes the two tracepoint sites
identical, allowing to unify the code in a later patch.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2023-11-15 09:57:49 +01:00
Peter Zijlstra 5d69eca542 sched: Unify runtime accounting across classes
All classes use sched_entity::exec_start to track runtime and have
copies of the exact same code around to compute runtime.

Collapse all that.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/54d148a144f26d9559698c4dd82d8859038a7380.1699095159.git.bristot@kernel.org
2023-11-15 09:57:48 +01:00
Abel Wu ee4373dc90 sched/eevdf: O(1) fastpath for task selection
Since the RB-tree is now sorted by deadline, let's first try the
leftmost entity which has the earliest virtual deadline. I've done
some benchmarks to see its effectiveness.

All the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled, on a dual-CPU Intel Xeon(R)
Platinum 8260 with 2 NUMA nodes each of which has 24C/48T.

  hackbench: process/thread + pipe/socket + 1/2/4/8 groups
  netperf:   TCP/UDP + STREAM/RR + 24/48/72/96/192 threads
  tbench:    loopback 24/48/72/96/192 threads
  schbench:  1/2/4/8 mthreads

  direct:    cfs_rq has only one entity
  parity:    RUN_TO_PARITY
  fast:      O(1) fastpath
  slow:	     heap search

    (%)		direct	parity	fast	slow
  hackbench	92.95	2.02	4.91	0.12
  netperf	68.08	6.60	24.18	1.14
  tbench	67.55	11.22	20.61	0.62
  schbench	69.91	2.65	25.73	1.71

The above results indicate that this fastpath really makes task
selection more efficient.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231115033647.80785-4-wuyun.abel@bytedance.com
2023-11-15 09:57:47 +01:00
Abel Wu 2227a957e1 sched/eevdf: Sort the rbtree by virtual deadline
Sort the task timeline by virtual deadline and keep the min_vruntime
in the augmented tree, so we can avoid doubling the worst case cost
and make full use of the cached leftmost node to enable O(1) fastpath
picking in next patch.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231115033647.80785-3-wuyun.abel@bytedance.com
2023-11-15 09:57:47 +01:00
Raghavendra K T 84db47ca71 sched/numa: Fix mm numa_scan_seq based unconditional scan
Since commit fc137c0dda ("sched/numa: enhance vma scanning logic")

NUMA Balancing allows updating PTEs to trap NUMA hinting faults if the
task had previously accessed VMA. However unconditional scan of VMAs are
allowed during initial phase of VMA creation until process's
mm numa_scan_seq reaches 2 even though current task had not accessed VMA.

Rationale:
 - Without initial scan subsequent PTE update may never happen.
 - Give fair opportunity to all the VMAs to be scanned and subsequently
understand the access pattern of all the VMAs.

But it has a corner case where, if a VMA is created after some time,
process's mm numa_scan_seq could be already greater than 2.

For e.g., values of mm numa_scan_seq when VMAs are created by running
mmtest autonuma benchmark briefly looks like:
start_seq=0 : 459
start_seq=2 : 138
start_seq=3 : 144
start_seq=4 : 8
start_seq=8 : 1
start_seq=9 : 1
This results in no unconditional PTE updates for those VMAs created after
some time.

Fix:
 - Note down the initial value of mm numa_scan_seq in per VMA start_seq.
 - Allow unconditional scan till start_seq + 2.

Result:
SUT: AMD EPYC Milan with 2 NUMA nodes 256 cpus.
base kernel: upstream 6.6-rc6 with Mels patches [1] applied.

kernbench
==========		base                  patched %gain
Amean    elsp-128      165.09 ( 0.00%)      164.78 *   0.19%*

Duration User       41404.28    41375.08
Duration System      9862.22     9768.48
Duration Elapsed      519.87      518.72

Ops NUMA PTE updates           1041416.00      831536.00
Ops NUMA hint faults            263296.00      220966.00
Ops NUMA pages migrated         258021.00      212769.00
Ops AutoNUMA cost                 1328.67        1114.69

autonumabench

NUMA01_THREADLOCAL
==================
Amean  elsp-NUMA01_THREADLOCAL   81.79 (0.00%)  67.74 *  17.18%*

Duration User       54832.73    47379.67
Duration System        75.00      185.75
Duration Elapsed      576.72      476.09

Ops NUMA PTE updates                  394429.00    11121044.00
Ops NUMA hint faults                    1001.00     8906404.00
Ops NUMA pages migrated                  288.00     2998694.00
Ops AutoNUMA cost                          7.77       44666.84

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/2ea7cbce80ac7c62e90cbfb9653a7972f902439f.1697816692.git.raghavendra.kt@amd.com
2023-11-15 09:57:46 +01:00
Paul E. McKenney d6111cf45c sched: Use WRITE_ONCE() for p->on_rq
Since RCU-tasks uses READ_ONCE(p->on_rq), ensure the write-side
matches with WRITE_ONCE().

Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/e4896e0b-eacc-45a2-a7a8-de2280a51ecc@paulmck-laptop
2023-11-15 09:57:45 +01:00
Peter Zijlstra 889c58b315 perf/core: Fix cpuctx refcounting
Audit of the refcounting turned up that perf_pmu_migrate_context()
fails to migrate the ctx refcount.

Fixes: bd27568117 ("perf: Rewrite core context handling")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20230612093539.085862001@infradead.org
Cc: <stable@vger.kernel.org>
2023-11-15 04:18:31 +01:00
Peter Zijlstra c9bd1568d5 futex: Fix hardcoded flags
Xi reported that commit 5694289ce1 ("futex: Flag conversion") broke
glibc's robust futex tests.

This was narrowed down to the change of FLAGS_SHARED from 0x01 to
0x10, at which point Florian noted that handle_futex_death() has a
hardcoded flags argument of 1.

Change this to: FLAGS_SIZE_32 | FLAGS_SHARED, matching how
futex_to_flags() unconditionally sets FLAGS_SIZE_32 for all legacy
futex ops.

Reported-by: Xi Ruoyao <xry111@xry111.site>
Reported-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20231114201402.GA25315@noisy.programming.kicks-ass.net
Fixes: 5694289ce1 ("futex: Flag conversion")
Cc: <stable@vger.kernel.org>
2023-11-15 04:02:25 +01:00
Paul Moore 969d90ec21 audit: don't WARN_ON_ONCE(!current->mm) in audit_exe_compare()
eBPF can end up calling into the audit code from some odd places, and
some of these places don't have @current set properly so we end up
tripping the `WARN_ON_ONCE(!current->mm)` near the top of
`audit_exe_compare()`.  While the basic `!current->mm` check is good,
the `WARN_ON_ONCE()` results in some scary console messages so let's
drop that and just do the regular `!current->mm` check to avoid
problems.

Cc: <stable@vger.kernel.org>
Fixes: 47846d5134 ("audit: don't take task_lock() in audit_exe_compare() code path")
Reported-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2023-11-14 17:34:27 -05:00