Commit Graph

46548 Commits (c83508da5620ef89232cb614fb9e02dfdfef2b8f)

Author SHA1 Message Date
Priya Bala Govindasamy c83508da56 bpf: Avoid deadlock caused by nested kprobe and fentry bpf programs
BPF program types like kprobe and fentry can cause deadlocks in certain
situations. If a function takes a lock and one of these bpf programs is
hooked to some point in the function's critical section, and if the
bpf program tries to call the same function and take the same lock it will
lead to deadlock. These situations have been reported in the following
bug reports.

In percpu_freelist -
Link: https://lore.kernel.org/bpf/CAADnVQLAHwsa+2C6j9+UC6ScrDaN9Fjqv1WjB1pP9AzJLhKuLQ@mail.gmail.com/T/
Link: https://lore.kernel.org/bpf/CAPPBnEYm+9zduStsZaDnq93q1jPLqO-PiKX9jy0MuL8LCXmCrQ@mail.gmail.com/T/
In bpf_lru_list -
Link: https://lore.kernel.org/bpf/CAPPBnEajj+DMfiR_WRWU5=6A7KKULdB5Rob_NJopFLWF+i9gCA@mail.gmail.com/T/
Link: https://lore.kernel.org/bpf/CAPPBnEZQDVN6VqnQXvVqGoB+ukOtHGZ9b9U0OLJJYvRoSsMY_g@mail.gmail.com/T/
Link: https://lore.kernel.org/bpf/CAPPBnEaCB1rFAYU7Wf8UxqcqOWKmRPU1Nuzk3_oLk6qXR7LBOA@mail.gmail.com/T/

Similar bugs have been reported by syzbot.
In queue_stack_maps -
Link: https://lore.kernel.org/lkml/0000000000004c3fc90615f37756@google.com/
Link: https://lore.kernel.org/all/20240418230932.2689-1-hdanton@sina.com/T/
In lpm_trie -
Link: https://lore.kernel.org/linux-kernel/00000000000035168a061a47fa38@google.com/T/
In ringbuf -
Link: https://lore.kernel.org/bpf/20240313121345.2292-1-hdanton@sina.com/T/

Prevent kprobe and fentry bpf programs from attaching to these critical
sections by removing CC_FLAGS_FTRACE for percpu_freelist.o,
bpf_lru_list.o, queue_stack_maps.o, lpm_trie.o, ringbuf.o files.

The bugs reported by syzbot are due to tracepoint bpf programs being
called in the critical sections. This patch does not aim to fix deadlocks
caused by tracepoint programs. However, it does prevent deadlocks from
occurring in similar situations due to kprobe and fentry programs.

Signed-off-by: Priya Bala Govindasamy <pgovind2@uci.edu>
Link: https://lore.kernel.org/r/CAPPBnEZpjGnsuA26Mf9kYibSaGLm=oF6=12L21X1GEQdqjLnzQ@mail.gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-14 09:49:27 -08:00
Kumar Kartikeya Dwivedi 838a10bd2e bpf: Augment raw_tp arguments with PTR_MAYBE_NULL
Arguments to a raw tracepoint are tagged as trusted, which carries the
semantics that the pointer will be non-NULL.  However, in certain cases,
a raw tracepoint argument may end up being NULL. More context about this
issue is available in [0].

Thus, there is a discrepancy between the reality, that raw_tp arguments can
actually be NULL, and the verifier's knowledge, that they are never NULL,
causing explicit NULL check branch to be dead code eliminated.

A previous attempt [1], i.e. the second fixed commit, was made to
simulate symbolic execution as if in most accesses, the argument is a
non-NULL raw_tp, except for conditional jumps.  This tried to suppress
branch prediction while preserving compatibility, but surfaced issues
with production programs that were difficult to solve without increasing
verifier complexity. A more complete discussion of issues and fixes is
available at [2].

Fix this by maintaining an explicit list of tracepoints where the
arguments are known to be NULL, and mark the positional arguments as
PTR_MAYBE_NULL. Additionally, capture the tracepoints where arguments
are known to be ERR_PTR, and mark these arguments as scalar values to
prevent potential dereference.

Each hex digit is used to encode NULL-ness (0x1) or ERR_PTR-ness (0x2),
shifted by the zero-indexed argument number x 4. This can be represented
as follows:
1st arg: 0x1
2nd arg: 0x10
3rd arg: 0x100
... and so on (likewise for ERR_PTR case).

In the future, an automated pass will be used to produce such a list, or
insert __nullable annotations automatically for tracepoints. Each
compilation unit will be analyzed and results will be collated to find
whether a tracepoint pointer is definitely not null, maybe null, or an
unknown state where verifier conservatively marks it PTR_MAYBE_NULL.
A proof of concept of this tool from Eduard is available at [3].

Note that in case we don't find a specification in the raw_tp_null_args
array and the tracepoint belongs to a kernel module, we will
conservatively mark the arguments as PTR_MAYBE_NULL. This is because
unlike for in-tree modules, out-of-tree module tracepoints may pass NULL
freely to the tracepoint. We don't protect against such tracepoints
passing ERR_PTR (which is uncommon anyway), lest we mark all such
arguments as SCALAR_VALUE.

While we are it, let's adjust the test raw_tp_null to not perform
dereference of the skb->mark, as that won't be allowed anymore, and make
it more robust by using inline assembly to test the dead code
elimination behavior, which should still stay the same.

  [0]: https://lore.kernel.org/bpf/ZrCZS6nisraEqehw@jlelli-thinkpadt14gen4.remote.csb
  [1]: https://lore.kernel.org/all/20241104171959.2938862-1-memxor@gmail.com
  [2]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com
  [3]: https://github.com/eddyz87/llvm-project/tree/nullness-for-tracepoint-params

Reported-by: Juri Lelli <juri.lelli@redhat.com> # original bug
Reported-by: Manu Bretelle <chantra@meta.com> # bugs in masking fix
Fixes: 3f00c52393 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs")
Fixes: cb4158ce8e ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL")
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Co-developed-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241213221929.3495062-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-13 16:24:53 -08:00
Kumar Kartikeya Dwivedi c00d738e16 bpf: Revert "bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"
This patch reverts commit
cb4158ce8e ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"). The
patch was well-intended and meant to be as a stop-gap fixing branch
prediction when the pointer may actually be NULL at runtime. Eventually,
it was supposed to be replaced by an automated script or compiler pass
detecting possibly NULL arguments and marking them accordingly.

However, it caused two main issues observed for production programs and
failed to preserve backwards compatibility. First, programs relied on
the verifier not exploring == NULL branch when pointer is not NULL, thus
they started failing with a 'dereference of scalar' error.  Next,
allowing raw_tp arguments to be modified surfaced the warning in the
verifier that warns against reg->off when PTR_MAYBE_NULL is set.

More information, context, and discusson on both problems is available
in [0]. Overall, this approach had several shortcomings, and the fixes
would further complicate the verifier's logic, and the entire masking
scheme would have to be removed eventually anyway.

Hence, revert the patch in preparation of a better fix avoiding these
issues to replace this commit.

  [0]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com

Reported-by: Manu Bretelle <chantra@meta.com>
Fixes: cb4158ce8e ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241213221929.3495062-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-13 16:24:53 -08:00
Kumar Kartikeya Dwivedi 659b9ba7cb bpf: Check size for BTF-based ctx access of pointer members
Robert Morris reported the following program type which passes the
verifier in [0]:

SEC("struct_ops/bpf_cubic_init")
void BPF_PROG(bpf_cubic_init, struct sock *sk)
{
	asm volatile("r2 = *(u16*)(r1 + 0)");     // verifier should demand u64
	asm volatile("*(u32 *)(r2 +1504) = 0");   // 1280 in some configs
}

The second line may or may not work, but the first instruction shouldn't
pass, as it's a narrow load into the context structure of the struct ops
callback. The code falls back to btf_ctx_access to ensure correctness
and obtaining the types of pointers. Ensure that the size of the access
is correctly checked to be 8 bytes, otherwise the verifier thinks the
narrow load obtained a trusted BTF pointer and will permit loads/stores
as it sees fit.

Perform the check on size after we've verified that the load is for a
pointer field, as for scalar values narrow loads are fine. Access to
structs passed as arguments to a BPF program are also treated as
scalars, therefore no adjustment is needed in their case.

Existing verifier selftests are broken by this change, but because they
were incorrect. Verifier tests for d_path were performing narrow load
into context to obtain path pointer, had this program actually run it
would cause a crash. The same holds for verifier_btf_ctx_access tests.

  [0]: https://lore.kernel.org/bpf/51338.1732985814@localhost

Fixes: 9e15db6613 ("bpf: Implement accurate raw_tp context access via BTF")
Reported-by: Robert Morris <rtm@mit.edu>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241212092050.3204165-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-12 11:40:18 -08:00
Eduard Zingerman ac6542ad92 bpf: fix null dereference when computing changes_pkt_data of prog w/o subprogs
bpf_prog_aux->func field might be NULL if program does not have
subprograms except for main sub-program. The fixed commit does
bpf_prog_aux->func access unconditionally, which might lead to null
pointer dereference.

The bug could be triggered by replacing the following BPF program:

    SEC("tc")
    int main_changes(struct __sk_buff *sk)
    {
        bpf_skb_pull_data(sk, 0);
        return 0;
    }

With the following BPF program:

    SEC("freplace")
    long changes_pkt_data(struct __sk_buff *sk)
    {
        return bpf_skb_pull_data(sk, 0);
    }

bpf_prog_aux instance itself represents the main sub-program,
use this property to fix the bug.

Fixes: 81f6d0530b ("bpf: check changes_pkt_data property for extension programs")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202412111822.qGw6tOyB-lkp@intel.com/
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241212070711.427443-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-12 11:37:19 -08:00
Jann Horn 7d0d673627 bpf: Fix theoretical prog_array UAF in __uprobe_perf_func()
Currently, the pointer stored in call->prog_array is loaded in
__uprobe_perf_func(), with no RCU annotation and no immediately visible
RCU protection, so it looks as if the loaded pointer can immediately be
dangling.
Later, bpf_prog_run_array_uprobe() starts a RCU-trace read-side critical
section, but this is too late. It then uses rcu_dereference_check(), but
this use of rcu_dereference_check() does not actually dereference anything.

Fix it by aligning the semantics to bpf_prog_run_array(): Let the caller
provide rcu_read_lock_trace() protection and then load call->prog_array
with rcu_dereference_check().

This issue seems to be theoretical: I don't know of any way to reach this
code without having handle_swbp() further up the stack, which is already
holding a rcu_read_lock_trace() lock, so where we take
rcu_read_lock_trace() in __uprobe_perf_func()/bpf_prog_run_array_uprobe()
doesn't actually have any effect.

Fixes: 8c7dcb84e3 ("bpf: implement sleepable uprobes by chaining gps")
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241210-bpf-fix-uprobe-uaf-v4-1-5fc8959b2b74@google.com
2024-12-10 13:06:51 -08:00
Anton Protopopov c4441ca86a bpf: fix potential error return
The bpf_remove_insns() function returns WARN_ON_ONCE(error), where
error is a result of bpf_adj_branches(), and thus should be always 0
However, if for any reason it is not 0, then it will be converted to
boolean by WARN_ON_ONCE and returned to user space as 1, not an actual
error value. Fix this by returning the original err after the WARN check.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20241210114245.836164-1-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-10 11:17:53 -08:00
Eduard Zingerman 81f6d0530b bpf: check changes_pkt_data property for extension programs
When processing calls to global sub-programs, verifier decides whether
to invalidate all packet pointers in current state depending on the
changes_pkt_data property of the global sub-program.

Because of this, an extension program replacing a global sub-program
must be compatible with changes_pkt_data property of the sub-program
being replaced.

This commit:
- adds changes_pkt_data flag to struct bpf_prog_aux:
  - this flag is set in check_cfg() for main sub-program;
  - in jit_subprogs() for other sub-programs;
- modifies bpf_check_attach_btf_id() to check changes_pkt_data flag;
- moves call to check_attach_btf_id() after the call to check_cfg(),
  because it needs changes_pkt_data flag to be set:

    bpf_check:
      ...                             ...
    - check_attach_btf_id             resolve_pseudo_ldimm64
      resolve_pseudo_ldimm64   -->    bpf_prog_is_offloaded
      bpf_prog_is_offloaded           check_cfg
      check_cfg                     + check_attach_btf_id
      ...                             ...

The following fields are set by check_attach_btf_id():
- env->ops
- prog->aux->attach_btf_trace
- prog->aux->attach_func_name
- prog->aux->attach_func_proto
- prog->aux->dst_trampoline
- prog->aux->mod
- prog->aux->saved_dst_attach_type
- prog->aux->saved_dst_prog_type
- prog->expected_attach_type

Neither of these fields are used by resolve_pseudo_ldimm64() or
bpf_prog_offload_verifier_prep() (for netronome and netdevsim
drivers), so the reordering is safe.

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241210041100.1898468-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-10 10:24:57 -08:00
Eduard Zingerman 51081a3f25 bpf: track changes_pkt_data property for global functions
When processing calls to certain helpers, verifier invalidates all
packet pointers in a current state. For example, consider the
following program:

    __attribute__((__noinline__))
    long skb_pull_data(struct __sk_buff *sk, __u32 len)
    {
        return bpf_skb_pull_data(sk, len);
    }

    SEC("tc")
    int test_invalidate_checks(struct __sk_buff *sk)
    {
        int *p = (void *)(long)sk->data;
        if ((void *)(p + 1) > (void *)(long)sk->data_end) return TCX_DROP;
        skb_pull_data(sk, 0);
        *p = 42;
        return TCX_PASS;
    }

After a call to bpf_skb_pull_data() the pointer 'p' can't be used
safely. See function filter.c:bpf_helper_changes_pkt_data() for a list
of such helpers.

At the moment verifier invalidates packet pointers when processing
helper function calls, and does not traverse global sub-programs when
processing calls to global sub-programs. This means that calls to
helpers done from global sub-programs do not invalidate pointers in
the caller state. E.g. the program above is unsafe, but is not
rejected by verifier.

This commit fixes the omission by computing field
bpf_subprog_info->changes_pkt_data for each sub-program before main
verification pass.
changes_pkt_data should be set if:
- subprogram calls helper for which bpf_helper_changes_pkt_data
  returns true;
- subprogram calls a global function,
  for which bpf_subprog_info->changes_pkt_data should be set.

The verifier.c:check_cfg() pass is modified to compute this
information. The commit relies on depth first instruction traversal
done by check_cfg() and absence of recursive function calls:
- check_cfg() would eventually visit every call to subprogram S in a
  state when S is fully explored;
- when S is fully explored:
  - every direct helper call within S is explored
    (and thus changes_pkt_data is set if needed);
  - every call to subprogram S1 called by S was visited with S1 fully
    explored (and thus S inherits changes_pkt_data from S1).

The downside of such approach is that dead code elimination is not
taken into account: if a helper call inside global function is dead
because of current configuration, verifier would conservatively assume
that the call occurs for the purpose of the changes_pkt_data
computation.

Reported-by: Nick Zavaritsky <mejedi@gmail.com>
Closes: https://lore.kernel.org/bpf/0498CA22-5779-4767-9C0C-A9515CEA711F@gmail.com/
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241210041100.1898468-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-10 10:24:57 -08:00
Eduard Zingerman b238e187b4 bpf: refactor bpf_helper_changes_pkt_data to use helper number
Use BPF helper number instead of function pointer in
bpf_helper_changes_pkt_data(). This would simplify usage of this
function in verifier.c:check_cfg() (in a follow-up patch),
where only helper number is easily available and there is no real need
to lookup helper proto.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241210041100.1898468-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-10 10:24:57 -08:00
Eduard Zingerman 27e88bc4df bpf: add find_containing_subprog() utility function
Add a utility function, looking for a subprogram containing a given
instruction index, rewrite find_subprog() to use this function.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241210041100.1898468-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-10 10:24:57 -08:00
Jiri Olsa 978c4486cc bpf,perf: Fix invalid prog_array access in perf_event_detach_bpf_prog
Syzbot reported [1] crash that happens for following tracing scenario:

  - create tracepoint perf event with attr.inherit=1, attach it to the
    process and set bpf program to it
  - attached process forks -> chid creates inherited event

    the new child event shares the parent's bpf program and tp_event
    (hence prog_array) which is global for tracepoint

  - exit both process and its child -> release both events
  - first perf_event_detach_bpf_prog call will release tp_event->prog_array
    and second perf_event_detach_bpf_prog will crash, because
    tp_event->prog_array is NULL

The fix makes sure the perf_event_detach_bpf_prog checks prog_array
is valid before it tries to remove the bpf program from it.

[1] https://lore.kernel.org/bpf/Z1MR6dCIKajNS6nU@krava/T/#m91dbf0688221ec7a7fc95e896a7ef9ff93b0b8ad

Fixes: 0ee288e69d ("bpf,perf: Fix perf_event_detach_bpf_prog error handling")
Reported-by: syzbot+2e0d2840414ce817aaac@syzkaller.appspotmail.com
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241208142507.1207698-1-jolsa@kernel.org
2024-12-10 10:16:28 -08:00
Jann Horn ef1b808e3b bpf: Fix UAF via mismatching bpf_prog/attachment RCU flavors
Uprobes always use bpf_prog_run_array_uprobe() under tasks-trace-RCU
protection. But it is possible to attach a non-sleepable BPF program to a
uprobe, and non-sleepable BPF programs are freed via normal RCU (see
__bpf_prog_put_noref()). This leads to UAF of the bpf_prog because a normal
RCU grace period does not imply a tasks-trace-RCU grace period.

Fix it by explicitly waiting for a tasks-trace-RCU grace period after
removing the attachment of a bpf_prog to a perf_event.

Fixes: 8c7dcb84e3 ("bpf: implement sleepable uprobes by chaining gps")
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20241210-bpf-fix-actual-uprobe-uaf-v1-1-19439849dd44@google.com
2024-12-10 10:14:02 -08:00
Linus Torvalds eadaac4dd2 - Fix a /proc/interrupts formatting regression
- Have the BCM2836 interrupt controller enter power management states properly
 
 - Other fixlets
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmdVeI0ACgkQEsHwGGHe
 VUpE9A/9Gs2K9ImUlQu5giKEM2dhu0U/oEN/KNKayd6mjSR4MWEovvMeE16M2AUA
 uALQVxCviJKy6vl5J2aYw5QIhktfGwrdmr4zGK61pfhGWD8dClaCxJpbfUvtx1Bu
 1K9U8PpSh1grScbcjDFBOIIzHsmQgpHFTJxVwktLXEktJsWJyzhxqvdiduMXvDlF
 T5WOIr/A5MwGtoP0kySQA49k0ymgUkXf1UgAl7nTksLEI50SMo3Wt7vpQ+qLk6xD
 3RdywlARaVOLY5GlRsFtUTZP6o06/8aDrEmrUHnOiip9u7pKzMiJR4aKzlScCdTN
 BRKEegLxgJqA0uvQAfYP7kCP4l5fFAAVHmcenZCuXDlcexJsnLxYJChxJKUK7CAt
 wCMZifysp13aF3gyT6BfNKEPZOEqFDykxUvzT/F1d2t0Z7yq1GHIXcDHC8eG0p2H
 3f/YOSi+5KgDTi+xzT1hOPn3HcTADhF7wrj2oppOmJa/FyJVrRVeL+DP+uNPD+ux
 tMlVknBp3nlYQ4Ll773GZmvdaamkUfw9U5eZRi04CrvVOcanq/KjKIL24bq9ODS0
 i6dgPfZPwYI+nuenGv7VEpgo92/oynqGxuShESk1c1S/rCL+D2kMiNaRvR9cVsU3
 +Gvr2v9B0KqLy79gw9Fhdqzbhnzhkcpa72B4JNJwEtHdO+t4iwI=
 =Uyoi
 -----END PGP SIGNATURE-----

Merge tag 'irq_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Borislav Petkov:

 - Fix a /proc/interrupts formatting regression

 - Have the BCM2836 interrupt controller enter power management states
   properly

 - Other fixlets

* tag 'irq_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/stm32mp-exti: CONFIG_STM32MP_EXTI should not default to y when compile-testing
  genirq/proc: Add missing space separator back
  irqchip/bcm2836: Enable SKIP_SET_WAKE and MASK_ON_SUSPEND
  irqchip/gic-v3: Fix irq_complete_ack() comment
2024-12-08 11:54:04 -08:00
Linus Torvalds c25ca0c2e4 - Handle the case where clocksources with small counter width can, in
conjunction with overly long idle sleeps, falsely trigger the negative
   motion detection of clocksources
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmdVb44ACgkQEsHwGGHe
 VUr/bA/+L5vH39rL4IGQ05LZ/TaZC5iSadsWeW26UccbV+5jw4q2hmVterH3TXDp
 ujsvPmRLPN/tjCdwMNKDRmVTc5nBoO+OHBErf+o3mqZMI8XdXz7bOEvNhjnlIPnT
 ILu/c8h2g3bHpFBYHsJjyhrIkfvfspG/yeD6V7SXI1r0StqRfoo6QmbbZwYpdrn3
 +ORs8TdW6GEp7gJhdCzSXxzbfXnvtRsBZvsBLxoIter9Kqd+pFpVxj7CqoHWEiBM
 NQHN/2DG3uczoVVtOB7VK9edAYlpe9mzokB4wRClXo21D7JFze0m2TJGJ3hf9eRZ
 RbzZea0CQNa11NUlxoRUrN+jG/CHjnptNFycRJIEtb6YgKyoizJ/x8CBOWI3a8nU
 NTGBIwXAeYcYrrsP5f3bmDRcks9OO+E2quZiGJZorq1zDxzxnPs8ALmtwCB64UhD
 ro0VAT1d7JuMdnmFUKSwf35nLydnUiBqRC2cb03jMZAze+YmFCgMh5xjtzTPw+WE
 QDIR9Eu6ebSM80ldXGISHYn1wHxQVFtJ2cN9gmV+Lnaekys6huDQkARCoWfLhdYc
 CiqZSvlMPis+VAhkglwnHlxc/mGpfFTZXh47oXzNbXw2J2bWhhbyzl9X8UiSqw+A
 UOfq/JRis2j6NsmKkGvTU/hGtEPOzXw2EStAuXm+OVg5TDKwr74=
 =mARf
 -----END PGP SIGNATURE-----

Merge tag 'timers_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Borislav Petkov:

 - Handle the case where clocksources with small counter width can,
   in conjunction with overly long idle sleeps, falsely trigger the
   negative motion detection of clocksources

* tag 'timers_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource: Make negative motion detection more robust
2024-12-08 11:51:29 -08:00
Linus Torvalds 553c89ec31 24 hotfixes. 17 are cc:stable. 15 are MM and 9 are non-MM.
The usual bunch of singletons - please see the relevant changelogs for
 details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ1U/QwAKCRDdBJ7gKXxA
 jnE7AQC0eyNNvaL5pLCIxN/Vmr8YeuWP1dldgI29TjrH/JKjSQEAihZNqVZYjoIT
 Gf7Y+IKnc4LbfAXcTe+MfJFeDexM5AU=
 =U5LQ
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-12-07-22-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "24 hotfixes.  17 are cc:stable.  15 are MM and 9 are non-MM.

  The usual bunch of singletons - please see the relevant changelogs for
  details"

* tag 'mm-hotfixes-stable-2024-12-07-22-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (24 commits)
  iio: magnetometer: yas530: use signed integer type for clamp limits
  sched/numa: fix memory leak due to the overwritten vma->numab_state
  mm/damon: fix order of arguments in damos_before_apply tracepoint
  lib: stackinit: hide never-taken branch from compiler
  mm/filemap: don't call folio_test_locked() without a reference in next_uptodate_folio()
  scatterlist: fix incorrect func name in kernel-doc
  mm: correct typo in MMAP_STATE() macro
  mm: respect mmap hint address when aligning for THP
  mm: memcg: declare do_memsw_account inline
  mm/codetag: swap tags when migrate pages
  ocfs2: update seq_file index in ocfs2_dlm_seq_next
  stackdepot: fix stack_depot_save_flags() in NMI context
  mm: open-code page_folio() in dump_page()
  mm: open-code PageTail in folio_flags() and const_folio_flags()
  mm: fix vrealloc()'s KASAN poisoning logic
  Revert "readahead: properly shorten readahead when falling back to do_page_cache_ra()"
  selftests/damon: add _damon_sysfs.py to TEST_FILES
  selftest: hugetlb_dio: fix test naming
  ocfs2: free inode when ocfs2_get_init_inode() fails
  nilfs2: fix potential out-of-bounds memory access in nilfs_find_entry()
  ...
2024-12-08 11:26:13 -08:00
Linus Torvalds b5f217084a BPF fixes:
- Fix several issues for BPF LPM trie map which were found by
   syzbot and during addition of new test cases (Hou Tao)
 
 - Fix a missing process_iter_arg register type check in the
   BPF verifier (Kumar Kartikeya Dwivedi, Tao Lyu)
 
 - Fix several correctness gaps in the BPF verifier when
   interacting with the BPF stack without CAP_PERFMON
   (Kumar Kartikeya Dwivedi, Eduard Zingerman, Tao Lyu)
 
 - Fix OOB BPF map writes when deleting elements for the case of
   xsk map as well as devmap (Maciej Fijalkowski)
 
 - Fix xsk sockets to always clear DMA mapping information when
   unmapping the pool (Larysa Zaremba)
 
 - Fix sk_mem_uncharge logic in tcp_bpf_sendmsg to only uncharge
   after sent bytes have been finalized (Zijian Zhang)
 
 - Fix BPF sockmap with vsocks which was missing a queue check
   in poll and sockmap cleanup on close (Michal Luczaj)
 
 - Fix tools infra to override makefile ARCH variable if defined
   but empty, which addresses cross-building tools. (Björn Töpel)
 
 - Fix two resolve_btfids build warnings on unresolved bpf_lsm
   symbols (Thomas Weißschuh)
 
 - Fix a NULL pointer dereference in bpftool (Amir Mohammadi)
 
 - Fix BPF selftests to check for CONFIG_PREEMPTION instead of
   CONFIG_PREEMPT (Sebastian Andrzej Siewior)
 
 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
 -----BEGIN PGP SIGNATURE-----
 
 iIsEABYKADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZ1N8bhUcZGFuaWVsQGlv
 Z2VhcmJveC5uZXQACgkQ2yufC7HISIO6ZAD+ITpujJgxvFGC0R7E9o3XJ7V1SpmR
 SlW0lGpj6vOHTUAA/2MRoZurJSTbdT3fbWiCUgU1rMcwkoErkyxUaPuBci0D
 =kgXL
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Daniel Borkmann::

 - Fix several issues for BPF LPM trie map which were found by syzbot
   and during addition of new test cases (Hou Tao)

 - Fix a missing process_iter_arg register type check in the BPF
   verifier (Kumar Kartikeya Dwivedi, Tao Lyu)

 - Fix several correctness gaps in the BPF verifier when interacting
   with the BPF stack without CAP_PERFMON (Kumar Kartikeya Dwivedi,
   Eduard Zingerman, Tao Lyu)

 - Fix OOB BPF map writes when deleting elements for the case of xsk map
   as well as devmap (Maciej Fijalkowski)

 - Fix xsk sockets to always clear DMA mapping information when
   unmapping the pool (Larysa Zaremba)

 - Fix sk_mem_uncharge logic in tcp_bpf_sendmsg to only uncharge after
   sent bytes have been finalized (Zijian Zhang)

 - Fix BPF sockmap with vsocks which was missing a queue check in poll
   and sockmap cleanup on close (Michal Luczaj)

 - Fix tools infra to override makefile ARCH variable if defined but
   empty, which addresses cross-building tools. (Björn Töpel)

 - Fix two resolve_btfids build warnings on unresolved bpf_lsm symbols
   (Thomas Weißschuh)

 - Fix a NULL pointer dereference in bpftool (Amir Mohammadi)

 - Fix BPF selftests to check for CONFIG_PREEMPTION instead of
   CONFIG_PREEMPT (Sebastian Andrzej Siewior)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (31 commits)
  selftests/bpf: Add more test cases for LPM trie
  selftests/bpf: Move test_lpm_map.c to map_tests
  bpf: Use raw_spinlock_t for LPM trie
  bpf: Switch to bpf mem allocator for LPM trie
  bpf: Fix exact match conditions in trie_get_next_key()
  bpf: Handle in-place update for full LPM trie correctly
  bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie
  bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem
  bpf: Remove unnecessary check when updating LPM trie
  selftests/bpf: Add test for narrow spill into 64-bit spilled scalar
  selftests/bpf: Add test for reading from STACK_INVALID slots
  selftests/bpf: Introduce __caps_unpriv annotation for tests
  bpf: Fix narrow scalar spill onto 64-bit spilled scalar slots
  bpf: Don't mark STACK_INVALID as STACK_MISC in mark_stack_slot_misc
  samples/bpf: Remove unnecessary -I flags from libbpf EXTRA_CFLAGS
  bpf: Zero index arg error string for dynptr and iter
  selftests/bpf: Add tests for iter arg check
  bpf: Ensure reg is PTR_TO_STACK in process_iter_arg
  tools: Override makefile ARCH variable if defined, but empty
  selftests/bpf: Add apply_bytes test to test_txmsg_redir_wait_sndmem in test_sockmap
  ...
2024-12-06 15:07:48 -08:00
Hou Tao 6a5c63d43c bpf: Use raw_spinlock_t for LPM trie
After switching from kmalloc() to the bpf memory allocator, there will be
no blocking operation during the update of LPM trie. Therefore, change
trie->lock from spinlock_t to raw_spinlock_t to make LPM trie usable in
atomic context, even on RT kernels.

The max value of prefixlen is 2048. Therefore, update or deletion
operations will find the target after at most 2048 comparisons.
Constructing a test case which updates an element after 2048 comparisons
under a 8 CPU VM, and the average time and the maximal time for such
update operation is about 210us and 900us.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-8-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06 09:14:26 -08:00
Hou Tao 3d8dc43eb2 bpf: Switch to bpf mem allocator for LPM trie
Multiple syzbot warnings have been reported. These warnings are mainly
about the lock order between trie->lock and kmalloc()'s internal lock.
See report [1] as an example:

======================================================
WARNING: possible circular locking dependency detected
6.10.0-rc7-syzkaller-00003-g4376e966ecb7 #0 Not tainted
------------------------------------------------------
syz.3.2069/15008 is trying to acquire lock:
ffff88801544e6d8 (&n->list_lock){-.-.}-{2:2}, at: get_partial_node ...

but task is already holding lock:
ffff88802dcc89f8 (&trie->lock){-.-.}-{2:2}, at: trie_update_elem ...

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&trie->lock){-.-.}-{2:2}:
       __raw_spin_lock_irqsave
       _raw_spin_lock_irqsave+0x3a/0x60
       trie_delete_elem+0xb0/0x820
       ___bpf_prog_run+0x3e51/0xabd0
       __bpf_prog_run32+0xc1/0x100
       bpf_dispatcher_nop_func
       ......
       bpf_trace_run2+0x231/0x590
       __bpf_trace_contention_end+0xca/0x110
       trace_contention_end.constprop.0+0xea/0x170
       __pv_queued_spin_lock_slowpath+0x28e/0xcc0
       pv_queued_spin_lock_slowpath
       queued_spin_lock_slowpath
       queued_spin_lock
       do_raw_spin_lock+0x210/0x2c0
       __raw_spin_lock_irqsave
       _raw_spin_lock_irqsave+0x42/0x60
       __put_partials+0xc3/0x170
       qlink_free
       qlist_free_all+0x4e/0x140
       kasan_quarantine_reduce+0x192/0x1e0
       __kasan_slab_alloc+0x69/0x90
       kasan_slab_alloc
       slab_post_alloc_hook
       slab_alloc_node
       kmem_cache_alloc_node_noprof+0x153/0x310
       __alloc_skb+0x2b1/0x380
       ......

-> #0 (&n->list_lock){-.-.}-{2:2}:
       check_prev_add
       check_prevs_add
       validate_chain
       __lock_acquire+0x2478/0x3b30
       lock_acquire
       lock_acquire+0x1b1/0x560
       __raw_spin_lock_irqsave
       _raw_spin_lock_irqsave+0x3a/0x60
       get_partial_node.part.0+0x20/0x350
       get_partial_node
       get_partial
       ___slab_alloc+0x65b/0x1870
       __slab_alloc.constprop.0+0x56/0xb0
       __slab_alloc_node
       slab_alloc_node
       __do_kmalloc_node
       __kmalloc_node_noprof+0x35c/0x440
       kmalloc_node_noprof
       bpf_map_kmalloc_node+0x98/0x4a0
       lpm_trie_node_alloc
       trie_update_elem+0x1ef/0xe00
       bpf_map_update_value+0x2c1/0x6c0
       map_update_elem+0x623/0x910
       __sys_bpf+0x90c/0x49a0
       ...

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&trie->lock);
                               lock(&n->list_lock);
                               lock(&trie->lock);
  lock(&n->list_lock);

 *** DEADLOCK ***

[1]: https://syzkaller.appspot.com/bug?extid=9045c0a3d5a7f1b119f7

A bpf program attached to trace_contention_end() triggers after
acquiring &n->list_lock. The program invokes trie_delete_elem(), which
then acquires trie->lock. However, it is possible that another
process is invoking trie_update_elem(). trie_update_elem() will acquire
trie->lock first, then invoke kmalloc_node(). kmalloc_node() may invoke
get_partial_node() and try to acquire &n->list_lock (not necessarily the
same lock object). Therefore, lockdep warns about the circular locking
dependency.

Invoking kmalloc() before acquiring trie->lock could fix the warning.
However, since BPF programs call be invoked from any context (e.g.,
through kprobe/tracepoint/fentry), there may still be lock ordering
problems for internal locks in kmalloc() or trie->lock itself.

To eliminate these potential lock ordering problems with kmalloc()'s
internal locks, replacing kmalloc()/kfree()/kfree_rcu() with equivalent
BPF memory allocator APIs that can be invoked in any context. The lock
ordering problems with trie->lock (e.g., reentrance) will be handled
separately.

Three aspects of this change require explanation:

1. Intermediate and leaf nodes are allocated from the same allocator.
Since the value size of LPM trie is usually small, using a single
alocator reduces the memory overhead of the BPF memory allocator.

2. Leaf nodes are allocated before disabling IRQs. This handles cases
where leaf_size is large (e.g., > 4KB - 8) and updates require
intermediate node allocation. If leaf nodes were allocated in
IRQ-disabled region, the free objects in BPF memory allocator would not
be refilled timely and the intermediate node allocation may fail.

3. Paired migrate_{disable|enable}() calls for node alloc and free. The
BPF memory allocator uses per-CPU struct internally, these paired calls
are necessary to guarantee correctness.

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-7-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06 09:14:26 -08:00
Hou Tao 27abc7b3fa bpf: Fix exact match conditions in trie_get_next_key()
trie_get_next_key() uses node->prefixlen == key->prefixlen to identify
an exact match, However, it is incorrect because when the target key
doesn't fully match the found node (e.g., node->prefixlen != matchlen),
these two nodes may also have the same prefixlen. It will return
expected result when the passed key exist in the trie. However when a
recently-deleted key or nonexistent key is passed to
trie_get_next_key(), it may skip keys and return incorrect result.

Fix it by using node->prefixlen == matchlen to identify exact matches.
When the condition is true after the search, it also implies
node->prefixlen equals key->prefixlen, otherwise, the search would
return NULL instead.

Fixes: b471f2f1de ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map")
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-6-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06 09:14:26 -08:00
Hou Tao 532d6b36b2 bpf: Handle in-place update for full LPM trie correctly
When a LPM trie is full, in-place updates of existing elements
incorrectly return -ENOSPC.

Fix this by deferring the check of trie->n_entries. For new insertions,
n_entries must not exceed max_entries. However, in-place updates are
allowed even when the trie is full.

Fixes: b95a5c4db0 ("bpf: add a longest prefix match trie map implementation")
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-5-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06 09:14:26 -08:00
Hou Tao eae6a075e9 bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie
Add the currently missing handling for the BPF_EXIST and BPF_NOEXIST
flags. These flags can be specified by users and are relevant since LPM
trie supports exact matches during update.

Fixes: b95a5c4db0 ("bpf: add a longest prefix match trie map implementation")
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06 09:14:26 -08:00
Hou Tao 3d5611b4d7 bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem
There is no need to call kfree(im_node) when updating element fails,
because im_node must be NULL. Remove the unnecessary kfree() for
im_node.

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06 09:14:25 -08:00
Hou Tao 156c977c53 bpf: Remove unnecessary check when updating LPM trie
When "node->prefixlen == matchlen" is true, it means that the node is
fully matched. If "node->prefixlen == key->prefixlen" is false, it means
the prefix length of key is greater than the prefix length of node,
otherwise, matchlen will not be equal with node->prefixlen. However, it
also implies that the prefix length of node must be less than
max_prefixlen.

Therefore, "node->prefixlen == trie->max_prefixlen" will always be false
when the check of "node->prefixlen == key->prefixlen" returns false.
Remove this unnecessary comparison.

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06 09:14:25 -08:00
Adrian Huang 5f1b64e9a9 sched/numa: fix memory leak due to the overwritten vma->numab_state
[Problem Description]
When running the hackbench program of LTP, the following memory leak is
reported by kmemleak.

  # /opt/ltp/testcases/bin/hackbench 20 thread 1000
  Running with 20*40 (== 800) tasks.

  # dmesg | grep kmemleak
  ...
  kmemleak: 480 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
  kmemleak: 665 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

  # cat /sys/kernel/debug/kmemleak
  unreferenced object 0xffff888cd8ca2c40 (size 64):
    comm "hackbench", pid 17142, jiffies 4299780315
    hex dump (first 32 bytes):
      ac 74 49 00 01 00 00 00 4c 84 49 00 01 00 00 00  .tI.....L.I.....
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    backtrace (crc bff18fd4):
      [<ffffffff81419a89>] __kmalloc_cache_noprof+0x2f9/0x3f0
      [<ffffffff8113f715>] task_numa_work+0x725/0xa00
      [<ffffffff8110f878>] task_work_run+0x58/0x90
      [<ffffffff81ddd9f8>] syscall_exit_to_user_mode+0x1c8/0x1e0
      [<ffffffff81dd78d5>] do_syscall_64+0x85/0x150
      [<ffffffff81e0012b>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
  ...

This issue can be consistently reproduced on three different servers:
  * a 448-core server
  * a 256-core server
  * a 192-core server

[Root Cause]
Since multiple threads are created by the hackbench program (along with
the command argument 'thread'), a shared vma might be accessed by two or
more cores simultaneously. When two or more cores observe that
vma->numab_state is NULL at the same time, vma->numab_state will be
overwritten.

Although current code ensures that only one thread scans the VMAs in a
single 'numa_scan_period', there might be a chance for another thread
to enter in the next 'numa_scan_period' while we have not gotten till
numab_state allocation [1].

Note that the command `/opt/ltp/testcases/bin/hackbench 50 process 1000`
cannot the reproduce the issue. It is verified with 200+ test runs.

[Solution]
Use the cmpxchg atomic operation to ensure that only one thread executes
the vma->numab_state assignment.

[1] https://lore.kernel.org/lkml/1794be3c-358c-4cdc-a43d-a1f841d91ef7@amd.com/

Link: https://lkml.kernel.org/r/20241113102146.2384-1-ahuang12@lenovo.com
Fixes: ef6a22b70f ("sched/numa: apply the scan delay to every new vma")
Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
Reported-by: Jiwei Sun <sunjw10@lenovo.com>
Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05 19:54:48 -08:00
Linus Torvalds b8f52214c6 audit/stable-6.13 PR 20241205
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmdSMGQUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNAZg//aDeY3r9tIBdE8FKXdhLFUuggkBAL
 BY4LR67FGFTklelO6oy+mGPxuby06BkwL3VLo086req55vL19pUOceNnBtWA6d1G
 ncZz6qnmFquEXTO5aDXrCIf9fG9zSSdd/D+sDjeUpq/35YJlXnEuyUMY88nm/sAQ
 LKSYOhiooTNRkE6MK60Wd9c6geAi8ER3dxO7l0agD3FGRrK3TOQkK2R/WsOcZTkE
 HtVI/s1EL+ao0s/UiY+xGuRhOgujJ0Gtokqc29m5F6a8I6SiXtXpG6okZK/7KCv0
 ta/87U3VGqXeTCf2aWKWkcDZQLZLajggJthWx0vb3OsA6ppON3FCkKqtSShqhphY
 7QQZV+CWXhHabwDhm4E5UrKu5JDb9wa1duyDnuVXLoPYFAfgMcukDrkrc9PWiKCr
 rbfV/tuwQO3XRKp643sXRcnn6lyM9KPyG77kmxyOD8duYdwX/B5Szf0o8DROwXz0
 /2htu2ZH9UPfwrQhf/MduKuN7Izz+vHFzMnMZZ2l0CVQXVgYq86kcQ4BS7C+ruXw
 PUD+7K1l1Q4tK0mk43K8I5zUqIf8wHERgYCiGkCwJew+q4rea7LNmE4vXvEnLl4W
 sMMYwtFynnNuEedMY3rhk66iz9+epZyou9W3VMoNKiwMdm5DCsPcuSX88zYUDsrP
 TFwMQjfqgpzOtFM=
 =RX05
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20241205' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit build problem workaround from Paul Moore:
 "A minor audit patch that shuffles some code slightly to workaround a
  GCC bug affecting a number of people.

  The GCC folks have been able to reproduce the problem and are
  discussing solutions (see the bug report link in the commit), but
  since the workaround is trivial let's do that in the kernel so we can
  unblock people who are hitting this"

* tag 'audit-pr-20241205' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: workaround a GCC bug triggered by task comm changes
2024-12-05 15:11:39 -08:00
Linus Torvalds 9d6a414ad3 tracing fixes for v6.13:
- Fix trace histogram sort function cmp_entries_dup()
 
   The sort function cmp_entries_dup() returns either 1 or 0, and not
   -1 if parameter "a" is less than "b" by memcmp().
 
 - Fix archs that call trace_hardirqs_off() without RCU watching
 
   Both x86 and arm64 no longer call any tracepoints with RCU not
   watching. It was assumed that it was safe to get rid of
   trace_*_rcuidle() version of the tracepoint calls. This was needed
   to get rid of the SRCU protection and be able to implement features
   like faultable traceponits and add rust tracepoints.
 
   Unfortunately, there were a few architectures that still relied on
   that logic. There's only one file that has tracepoints that are
   called without RCU watching. Add macro logic around the tracepoints
   for architectures that do not have CONFIG_ARCH_WANTS_NO_INSTR defined
   will check if the code is in the idle path (the only place RCU isn't
   watching), and enable RCU around calling the tracepoint, but only
   do it if the tracepoint is enabled.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ1G5gxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qnsXAQCzFHRbTrrmSmvKRHdWxUxhlYjKALHA
 v6DCySLgdNtv0QD8D5hHeGzhVXUhECG0mUcduZ7wvaym+yAQWU5V9gUcRwU=
 =E8i1
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix trace histogram sort function cmp_entries_dup()

   The sort function cmp_entries_dup() returns either 1 or 0, and not -1
   if parameter "a" is less than "b" by memcmp().

 - Fix archs that call trace_hardirqs_off() without RCU watching

   Both x86 and arm64 no longer call any tracepoints with RCU not
   watching. It was assumed that it was safe to get rid of
   trace_*_rcuidle() version of the tracepoint calls. This was needed to
   get rid of the SRCU protection and be able to implement features like
   faultable traceponits and add rust tracepoints.

   Unfortunately, there were a few architectures that still relied on
   that logic. There's only one file that has tracepoints that are
   called without RCU watching. Add macro logic around the tracepoints
   for architectures that do not have CONFIG_ARCH_WANTS_NO_INSTR defined
   will check if the code is in the idle path (the only place RCU isn't
   watching), and enable RCU around calling the tracepoint, but only do
   it if the tracepoint is enabled.

* tag 'trace-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix archs that still call tracepoints without RCU watching
  tracing: Fix cmp_entries_dup() to respect sort() comparison rules
2024-12-05 10:17:55 -08:00
Thomas Gleixner 76031d9536 clocksource: Make negative motion detection more robust
Guenter reported boot stalls on a emulated ARM 32-bit platform, which has a
24-bit wide clocksource.

It turns out that the calculated maximal idle time, which limits idle
sleeps to prevent clocksource wrap arounds, is close to the point where the
negative motion detection triggers.

  max_idle_ns:                    597268854 ns
  negative motion tripping point: 671088640 ns

If the idle wakeup is delayed beyond that point, the clocksource
advances far enough to trigger the negative motion detection. This
prevents the clock to advance and in the worst case the system stalls
completely if the consecutive sleeps based on the stale clock are
delayed as well.

Cure this by calculating a more robust cut-off value for negative motion,
which covers 87.5% of the actual clocksource counter width. Compare the
delta against this value to catch negative motion. This is specifically for
clock sources with a small counter width as their wrap around time is close
to the half counter width. For clock sources with wide counters this is not
a problem because the maximum idle time is far from the half counter width
due to the math overflow protection constraints.

For the case at hand this results in a tripping point of 1174405120ns.

Note, that this cannot prevent issues when the delay exceeds the 87.5%
margin, but that's not different from the previous unchecked version which
allowed arbitrary time jumps.

Systems with small counter width are prone to invalid results, but this
problem is unlikely to be seen on real hardware. If such a system
completely stalls for more than half a second, then there are other more
urgent problems than the counter wrapping around.

Fixes: c163e40af9 ("timekeeping: Always check for negative motion")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/all/8734j5ul4x.ffs@tglx
Closes: https://lore.kernel.org/all/387b120b-d68a-45e8-b6ab-768cd95d11c2@roeck-us.net
2024-12-05 16:03:24 +01:00
Steven Rostedt dc1b157b82 tracing: Fix archs that still call tracepoints without RCU watching
Tracepoints require having RCU "watching" as it uses RCU to do updates to
the tracepoints. There are some cases that would call a tracepoint when
RCU was not "watching". This was usually in the idle path where RCU has
"shutdown". For the few locations that had tracepoints without RCU
watching, there was an trace_*_rcuidle() variant that could be used. This
used SRCU for protection.

There are tracepoints that trace when interrupts and preemption are
enabled and disabled. In some architectures, these tracepoints are called
in a path where RCU is not watching. When x86 and arm64 removed these
locations, it was incorrectly assumed that it would be safe to remove the
trace_*_rcuidle() variant and also remove the SRCU logic, as it made the
code more complex and harder to implement new tracepoint features (like
faultable tracepoints and tracepoints in rust).

Instead of bringing back the trace_*_rcuidle(), as it will not be trivial
to do as new code has already been added depending on its removal, add a
workaround to the one file that still requires it (trace_preemptirq.c). If
the architecture does not define CONFIG_ARCH_WANTS_NO_INSTR, then check if
the code is in the idle path, and if so, call ct_irq_enter/exit() which
will enable RCU around the tracepoint.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/20241204100414.4d3e06d0@gandalf.local.home
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Fixes: 48bcda6848 ("tracing: Remove definition of trace_*_rcuidle()")
Closes: https://lore.kernel.org/all/bddb02de-957a-4df5-8e77-829f55728ea2@roeck-us.net/
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-05 09:28:58 -05:00
Yafang shao d9381508ea audit: workaround a GCC bug triggered by task comm changes
A build failure has been reported with the following details:

   In file included from include/linux/string.h:390,
                    from include/linux/bitmap.h:13,
                    from include/linux/cpumask.h:12,
                    from include/linux/smp.h:13,
                    from include/linux/lockdep.h:14,
                    from include/linux/spinlock.h:63,
                    from include/linux/wait.h:9,
                    from include/linux/wait_bit.h:8,
                    from include/linux/fs.h:6,
                    from kernel/auditsc.c:37:
   In function 'sized_strscpy',
       inlined from '__audit_ptrace' at kernel/auditsc.c:2732:2:
>> include/linux/fortify-string.h:293:17:
   error: call to '__write_overflow' declared with attribute error:
   detected write beyond size of object (1st parameter)
     293 |                 __write_overflow();
         |                 ^~~~~~~~~~~~~~~~~~
   In function 'sized_strscpy',
       inlined from 'audit_signal_info_syscall' at kernel/auditsc.c:2759:3:
>> include/linux/fortify-string.h:293:17:
   error: call to '__write_overflow' declared with attribute error:
   detected write beyond size of object (1st parameter)
     293 |                 __write_overflow();
         |                 ^~~~~~~~~~~~~~~~~~

The issue appears to be a GCC bug, though the root cause remains
unclear at this time. For now, let's implement a workaround.

A bug report has also been filed with GCC [0].

Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117912 [0]

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202410171420.1V00ICVG-lkp@intel.com/
Reported-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Closes: https://lore.kernel.org/all/20241128182435.57a1ea6f@gandalf.local.home/
Reported-by: Zhuo, Qiuxu <qiuxu.zhuo@intel.com>
Closes: https://lore.kernel.org/all/CY8PR11MB71348E568DBDA576F17DAFF389362@CY8PR11MB7134.namprd11.prod.outlook.com/
Originally-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/linux-hardening/202410171059.C2C395030@keescook/
Signed-off-by: Yafang shao <laoar.shao@gmail.com>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Yafang Shao <laoar.shao@gmail.com>
[PM: subject tweak, description line wrapping]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-12-04 22:57:46 -05:00
Tao Lyu b0e66977dc bpf: Fix narrow scalar spill onto 64-bit spilled scalar slots
When CAP_PERFMON and CAP_SYS_ADMIN (allow_ptr_leaks) are disabled, the
verifier aims to reject partial overwrite on an 8-byte stack slot that
contains a spilled pointer.

However, in such a scenario, it rejects all partial stack overwrites as
long as the targeted stack slot is a spilled register, because it does
not check if the stack slot is a spilled pointer.

Incomplete checks will result in the rejection of valid programs, which
spill narrower scalar values onto scalar slots, as shown below.

0: R1=ctx() R10=fp0
; asm volatile ( @ repro.bpf.c:679
0: (7a) *(u64 *)(r10 -8) = 1          ; R10=fp0 fp-8_w=1
1: (62) *(u32 *)(r10 -8) = 1
attempt to corrupt spilled pointer on stack
processed 2 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0.

Fix this by expanding the check to not consider spilled scalar registers
when rejecting the write into the stack.

Previous discussion on this patch is at link [0].

  [0]: https://lore.kernel.org/bpf/20240403202409.2615469-1-tao.lyu@epfl.ch

Fixes: ab125ed3ec ("bpf: fix check for attempt to corrupt spilled pointer")
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Tao Lyu <tao.lyu@epfl.ch>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241204044757.1483141-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-04 09:19:50 -08:00
Kumar Kartikeya Dwivedi 69772f509e bpf: Don't mark STACK_INVALID as STACK_MISC in mark_stack_slot_misc
Inside mark_stack_slot_misc, we should not upgrade STACK_INVALID to
STACK_MISC when allow_ptr_leaks is false, since invalid contents
shouldn't be read unless the program has the relevant capabilities.
The relaxation only makes sense when env->allow_ptr_leaks is true.

However, such conversion in privileged mode becomes unnecessary, as
invalid slots can be read without being upgraded to STACK_MISC.

Currently, the condition is inverted (i.e. checking for true instead of
false), simply remove it to restore correct behavior.

Fixes: eaf18febd6 ("bpf: preserve STACK_ZERO slots on partial reg spills")
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Reported-by: Tao Lyu <tao.lyu@epfl.ch>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241204044757.1483141-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-04 09:19:50 -08:00
Kuan-Wei Chiu e63fbd5f68 tracing: Fix cmp_entries_dup() to respect sort() comparison rules
The cmp_entries_dup() function used as the comparator for sort()
violated the symmetry and transitivity properties required by the
sorting algorithm. Specifically, it returned 1 whenever memcmp() was
non-zero, which broke the following expectations:

* Symmetry: If x < y, then y > x.
* Transitivity: If x < y and y < z, then x < z.

These violations could lead to incorrect sorting and failure to
correctly identify duplicate elements.

Fix the issue by directly returning the result of memcmp(), which
adheres to the required comparison properties.

Cc: stable@vger.kernel.org
Fixes: 08d43a5fa0 ("tracing: Add lock-free tracing_map")
Link: https://lore.kernel.org/20241203202228.1274403-1-visitorckw@gmail.com
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-04 10:38:24 -05:00
Thomas Gleixner 9d9f204bdf genirq/proc: Add missing space separator back
The recent conversion of show_interrupts() to seq_put_decimal_ull_width()
caused a formatting regression as it drops a previosuly existing space
separator.

Add it back by unconditionally inserting a space after the interrupt
counts and removing the extra leading space from the chip name prints.

Fixes: f9ed1f7c2e ("genirq/proc: Use seq_put_decimal_ull_width() for decimal values")
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: David Wang <00107082@163.com>
Link: https://lore.kernel.org/all/87zfldt5g4.ffs@tglx
Closes: https://lore.kernel.org/all/4ce18851-6e9f-bbe-8319-cc5e69fb45c@linux-m68k.org
2024-12-03 14:59:34 +01:00
Kumar Kartikeya Dwivedi bd74e238ae bpf: Zero index arg error string for dynptr and iter
Andrii spotted that process_dynptr_func's rejection of incorrect
argument register type will print an error string where argument numbers
are not zero-indexed, unlike elsewhere in the verifier.  Fix this by
subtracting 1 from regno. The same scenario exists for iterator
messages. Fix selftest error strings that match on the exact argument
number while we're at it to ensure clean bisection.

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241203002235.3776418-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-02 18:47:41 -08:00
Tao Lyu 12659d2861 bpf: Ensure reg is PTR_TO_STACK in process_iter_arg
Currently, KF_ARG_PTR_TO_ITER handling missed checking the reg->type and
ensuring it is PTR_TO_STACK. Instead of enforcing this in the caller of
process_iter_arg, move the check into it instead so that all callers
will gain the check by default. This is similar to process_dynptr_func.

An existing selftest in verifier_bits_iter.c fails due to this change,
but it's because it was passing a NULL pointer into iter_next helper and
getting an error further down the checks, but probably meant to pass an
uninitialized iterator on the stack (as is done in the subsequent test
below it). We will gain coverage for non-PTR_TO_STACK arguments in later
patches hence just change the declaration to zero-ed stack object.

Fixes: 06accc8779 ("bpf: add support for open-coded iterator loops")
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Tao Lyu <tao.lyu@epfl.ch>
[ Kartikeya: move check into process_iter_arg, rewrite commit log ]
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241203000238.3602922-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-02 17:47:56 -08:00
Peter Zijlstra cdd30ebb1b module: Convert symbol namespace to string literal
Clean up the existing export namespace code along the same lines of
commit 33def8498f ("treewide: Convert macro and uses of __section(foo)
to __section("foo")") and for the same reason, it is not desired for the
namespace argument to be a macro expansion itself.

Scripted using

  git grep -l -e MODULE_IMPORT_NS -e EXPORT_SYMBOL_NS | while read file;
  do
    awk -i inplace '
      /^#define EXPORT_SYMBOL_NS/ {
        gsub(/__stringify\(ns\)/, "ns");
        print;
        next;
      }
      /^#define MODULE_IMPORT_NS/ {
        gsub(/__stringify\(ns\)/, "ns");
        print;
        next;
      }
      /MODULE_IMPORT_NS/ {
        $0 = gensub(/MODULE_IMPORT_NS\(([^)]*)\)/, "MODULE_IMPORT_NS(\"\\1\")", "g");
      }
      /EXPORT_SYMBOL_NS/ {
        if ($0 ~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+),/) {
  	if ($0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/ &&
  	    $0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(\)/ &&
  	    $0 !~ /^my/) {
  	  getline line;
  	  gsub(/[[:space:]]*\\$/, "");
  	  gsub(/[[:space:]]/, "", line);
  	  $0 = $0 " " line;
  	}

  	$0 = gensub(/(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/,
  		    "\\1(\\2, \"\\3\")", "g");
        }
      }
      { print }' $file;
  done

Requested-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://mail.google.com/mail/u/2/#inbox/FMfcgzQXKWgMmjdFwwdsfgxzKpVHWPlc
Acked-by: Greg KH <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-02 11:34:44 -08:00
Linus Torvalds f788b5ef1c - Fix a case where posix timers with a thread-group-wide target would miss
signals if some of the group's threads are exiting
 
 - Fix a hang caused by ndelay() calling the wrong delay function __udelay()
 
 - Fix a wrong offset calculation in adjtimex(2) when using ADJ_MICRO
   (microsecond resolution) and a negative offset
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmdMQ2sACgkQEsHwGGHe
 VUoRGBAAt8luiDBdMHIcD053RHsLr7Oocg5AI/t0PVxYxJ+89o0cSdDx2vaaXiyX
 +vRSkdvH5mfwvwW4XRJZkVWbzOjMiA6m7FwH667XGzEedIq4vtgs5Rd/1YStSfIx
 ceQfD2N+34esamxiGGBlzjNO2GdqI2XMo/Fc6LuPCTfPBqELCL8OpbEdOV8Ltwxr
 mRsmbCNazBtw31Yo3zp9UZIVVSAzJFmWOoK0M+xm6S91YPYaKQ9RYk2QQwLizVgR
 N++dniNV6yZuSLTzr4dNckrvl744Iqc4Sy8iy2CL9rNFZkb+3q5CAAQggGNlY2U9
 0W95tgwpy/Qt6drfsyam3+PR5Smwjnh/0mrk3sLzUCdy9Y6L2HgKmrvHk4Rqq/66
 N6uIjIDmou+L0FUcdUducRnMOgQnvfIB/l6hIAHHkDap7iL8oy74JDzzk0jnNKHw
 1I5kGbKqXz0ucdxge6H1BHqCc/roobwC05/TWLPAQ5IG0BtQFPGAwd901AZtANkk
 /FfWUq7IT6PW05T2co7O75NjgMvU3QV0Sf5E9vkV/+R9WtTKT13FmZ8+rC6zaC7o
 Juml/lRWeTCyuot3vv29NtcvY6j+gy/RrKWL4iNWDlXznntR2DAhIkzRCF+1yTSb
 z0RSOrY2BSsk2iqeUh8ydet5OEyPiMXwiHVbxUHzJ4R/7qaxsB8=
 =X4bb
 -----END PGP SIGNATURE-----

Merge tag 'timers_urgent_for_v6.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Borislav Petkov:

 - Fix a case where posix timers with a thread-group-wide target would
   miss signals if some of the group's threads are exiting

 - Fix a hang caused by ndelay() calling the wrong delay function
   __udelay()

 - Fix a wrong offset calculation in adjtimex(2) when using ADJ_MICRO
   (microsecond resolution) and a negative offset

* tag 'timers_urgent_for_v6.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  posix-timers: Target group sigqueue to current task only if not exiting
  delay: Fix ndelay() spuriously treated as udelay()
  ntp: Remove invalid cast in time offset math
2024-12-01 12:41:21 -08:00
Linus Torvalds 133577cad6 dma-mapping fix for Linux 6.13
- fix physical address calculation for struct dma_debug_entry
    (Fedor Pchelkin)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmdKvyMLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYO6Sg/+LruMOlIBJ+X9E3H+c39JSiMteM5XVDPKLGpOXW01
 W3UpOh1vRhvmsoYmQaL/6Nalr0tc/bxb+obklHzimBbBsztuwaUEuY0DPcmeYpZw
 RZkUf/YX0lsf5cf5i2/bmozbiXnnbfp2g1FEv34m3W3ehLydLoBhyNZ8lqDGAt+a
 JN4s30j1CG6k5/NOnhzpMa2qVfs9GNR1MC0XJaWWybdtGYQr9tFVibS/7X8K5IOk
 dPUsoF2QFF5ODWBzhJqZnXlX23N0EC2EzVsgywTyKc2uCrSmcldidH2K8LnkmLPH
 gdNDwSAA48AbIdL1WnfVT4zyJKBl6TBTGqAkvreY6DyIfGZN8u9++3FowLJ13jdK
 vCJltoF1tf/66CBpMZAI+s9TnGT6YiwUqyheTVEIzbCSvH0Nby52iSci3FVTndoj
 otVPQMBbtzo/ZgC0tWQ0Fb1030p4OJrQJsdqHH6Y/a8J6px6AqTFf1tVumeO52P8
 pb3cadyX5VD3ACrqd5xl17AEwfatIBremFTq8XOlEohwRrSwSACsHValK+Mxrvzw
 6NpRuNPpz51u+Ii4/AzAOTHAZ/8+9AcVc26/ARpIW04nw3sJzy5mL+ND56/6oMOd
 J3T3fy+OTMZ6tKbmwTgjg/MAh8wQ7L+thlZaDGz5ubXVNqra/wHnTqFx+Gou9tRv
 9cc=
 =TU77
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-6.13-2024-11-30' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fix from Christoph Hellwig:

 - fix physical address calculation for struct dma_debug_entry (Fedor
   Pchelkin)

* tag 'dma-mapping-6.13-2024-11-30' of git://git.infradead.org/users/hch/dma-mapping:
  dma-debug: fix physical address calculation for struct dma_debug_entry
2024-11-30 15:36:17 -08:00
Linus Torvalds 55cb93fd24 Driver core changes for 6.13-rc1
Here is a small set of driver core changes for 6.13-rc1.
 
 Nothing major for this merge cycle, except for the 2 simple merge
 conflicts are here just to make life interesting.
 
 Included in here are:
   - sysfs core changes and preparations for more sysfs api cleanups that
     can come through all driver trees after -rc1 is out
   - fw_devlink fixes based on many reports and debugging sessions
   - list_for_each_reverse() removal, no one was using it!
   - last-minute seq_printf() format string bug found and fixed in many
     drivers all at once.
   - minor bugfixes and changes full details in the shortlog
 
 As mentioned above, there is 2 merge conflicts with your tree, one is
 where the file is removed (easy enough to resolve), the second is a
 build time error, that has been found in linux-next and the fix can be
 seen here:
 	https://lore.kernel.org/r/20241107212645.41252436@canb.auug.org.au
 
 Other than that, the changes here have been in linux-next with no other
 reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZ0lEog8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ym+0ACgw6wN+LkLVIHWhxTq5DYHQ0QCxY8AoJrRIcKe
 78h0+OU3OXhOy8JGz62W
 =oI5S
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is a small set of driver core changes for 6.13-rc1.

  Nothing major for this merge cycle, except for the two simple merge
  conflicts are here just to make life interesting.

  Included in here are:

   - sysfs core changes and preparations for more sysfs api cleanups
     that can come through all driver trees after -rc1 is out

   - fw_devlink fixes based on many reports and debugging sessions

   - list_for_each_reverse() removal, no one was using it!

   - last-minute seq_printf() format string bug found and fixed in many
     drivers all at once.

   - minor bugfixes and changes full details in the shortlog"

* tag 'driver-core-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (35 commits)
  Fix a potential abuse of seq_printf() format string in drivers
  cpu: Remove spurious NULL in attribute_group definition
  s390/con3215: Remove spurious NULL in attribute_group definition
  perf: arm-ni: Remove spurious NULL in attribute_group definition
  driver core: Constify bin_attribute definitions
  sysfs: attribute_group: allow registration of const bin_attribute
  firmware_loader: Fix possible resource leak in fw_log_firmware_info()
  drivers: core: fw_devlink: Fix excess parameter description in docstring
  driver core: class: Correct WARN() message in APIs class_(for_each|find)_device()
  cacheinfo: Use of_property_present() for non-boolean properties
  cdx: Fix cdx_mmap_resource() after constifying attr in ->mmap()
  drivers: core: fw_devlink: Make the error message a bit more useful
  phy: tegra: xusb: Set fwnode for xusb port devices
  drm: display: Set fwnode for aux bus devices
  driver core: fw_devlink: Stop trying to optimize cycle detection logic
  driver core: Constify attribute arguments of binary attributes
  sysfs: bin_attribute: add const read/write callback variants
  sysfs: implement all BIN_ATTR_* macros in terms of __BIN_ATTR()
  sysfs: treewide: constify attribute callback of bin_attribute::llseek()
  sysfs: treewide: constify attribute callback of bin_attribute::mmap()
  ...
2024-11-29 11:43:29 -08:00
Frederic Weisbecker 63dffecfba posix-timers: Target group sigqueue to current task only if not exiting
A sigqueue belonging to a posix timer, which target is not a specific
thread but a whole thread group, is preferrably targeted to the current
task if it is part of that thread group.

However nothing prevents a posix timer event from queueing such a
sigqueue from a reaped yet running task. The interruptible code space
between exit_notify() and the final call to schedule() is enough for
posix_timer_fn() hrtimer to fire.

If that happens while the current task is part of the thread group
target, it is proposed to handle it but since its sighand pointer may
have been cleared already, the sigqueue is dropped even if there are
other tasks running within the group that could handle it.

As a result posix timers with thread group wide target may miss signals
when some of their threads are exiting.

Fix this with verifying that the current task hasn't been through
exit_notify() before proposing it as a preferred target so as to ensure
that its sighand is still here and stable.

complete_signal() might still reconsider the choice and find a better
target within the group if current has passed retarget_shared_pending()
already.

Fixes: bcb7ee7902 ("posix-timers: Prefer delivery of signals to the current thread")
Reported-by: Anthony Mallet <anthony.mallet@laas.fr>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20241122234811.60455-1-frederic@kernel.org
Closes: https://lore.kernel.org/all/26411.57288.238690.681680@gargle.gargle.HOWL
2024-11-29 13:19:09 +01:00
Linus Torvalds 7af08b57bc Tracing updates for 6.13:
- Add trace flag for NEED_RESCHED_LAZY
 
   Now that NEED_RESCHED_LAZY is upstream, add it to the status bits of the
   common_flags. This will now show when the NEED_RESCHED_LAZY flag is set that
   is used for debugging latency issues in the kernel via a trace.
 
 - Remove leftover "__idx" variable when SRCU was removed from the tracepoint
   code
 
 - Add rcu_tasks_trace guard
 
   To add a guard() around the tracepoint code, a rcu_tasks_trace guard needs
   to be created first.
 
 - Remove __DO_TRACE() macro and just call __DO_TRACE_CALL() directly
 
   The DO_TRACE() macro has conditional locking depending on what was passed
   into the macro parameters. As the guts of the macro has been moved to
   __DO_TRACE_CALL() to handle static call logic, there's no reason to keep
   the __DO_TRACE() macro around. It is better to just do the locking in
   place without the conditionals and call __DO_TRACE_CALL() from those
   locations. The "cond" passed in can also be moved out of that macro.
   This simplifies the code.
 
 - Remove the "cond" from the system call tracepoint macros
 
   The "cond" variable was added to allow some tracepoints to check a
   condition within the static_branch (jump/nop) logic. The system calls do
   not need this. Removing it simplifies the code.
 
 - Replace scoped_guard() with just guard() in the tracepoint logic
 
   guard() works just as well as scoped_guard() in the tracepoint logic and
   the scoped_guard() causes some issues.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ0dGmBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qsZkAP9cm2psIGp2n1BgVjA+0tBRQJUnexEG
 RualDkF5wAETLwD9FNFI/EUwDR/E8gNt0SY309EJZ1ijRiLjtU0spbQmdgs=
 =awid
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull more tracing updates from Steven Rostedt:

 - Add trace flag for NEED_RESCHED_LAZY

   Now that NEED_RESCHED_LAZY is upstream, add it to the status bits of
   the common_flags. This will now show when the NEED_RESCHED_LAZY flag
   is set that is used for debugging latency issues in the kernel via a
   trace.

 - Remove leftover "__idx" variable when SRCU was removed from the
   tracepoint code

 - Add rcu_tasks_trace guard

   To add a guard() around the tracepoint code, a rcu_tasks_trace guard
   needs to be created first.

 - Remove __DO_TRACE() macro and just call __DO_TRACE_CALL() directly

   The DO_TRACE() macro has conditional locking depending on what was
   passed into the macro parameters. As the guts of the macro has been
   moved to __DO_TRACE_CALL() to handle static call logic, there's no
   reason to keep the __DO_TRACE() macro around.

   It is better to just do the locking in place without the conditionals
   and call __DO_TRACE_CALL() from those locations. The "cond" passed in
   can also be moved out of that macro. This simplifies the code.

 - Remove the "cond" from the system call tracepoint macros

   The "cond" variable was added to allow some tracepoints to check a
   condition within the static_branch (jump/nop) logic. The system calls
   do not need this. Removing it simplifies the code.

 - Replace scoped_guard() with just guard() in the tracepoint logic

   guard() works just as well as scoped_guard() in the tracepoint logic
   and the scoped_guard() causes some issues.

* tag 'trace-v6.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Use guard() rather than scoped_guard()
  tracing: Remove cond argument from __DECLARE_TRACE_SYSCALL
  tracing: Remove conditional locking from __DO_TRACE()
  rcupdate_trace: Define rcu_tasks_trace lock guard
  tracing: Remove __idx variable from __DO_TRACE
  tracing: Move it_func[0] comment to the relevant context
  tracing: Record task flag NEED_RESCHED_LAZY.
2024-11-28 11:46:13 -08:00
Marcelo Dalmas f5807b0606 ntp: Remove invalid cast in time offset math
Due to an unsigned cast, adjtimex() returns the wrong offest when using
ADJ_MICRO and the offset is negative. In this case a small negative offset
returns approximately 4.29 seconds (~ 2^32/1000 milliseconds) due to the
unsigned cast of the negative offset.

This cast was added when the kernel internal struct timex was changed to
use type long long for the time offset value to address the problem of a
64bit/32bit division on 32bit systems.

The correct cast would have been (s32), which is correct as time_offset can
only be in the range of [INT_MIN..INT_MAX] because the shift constant used
for calculating it is 32. But that's non-obvious.

Remove the cast and use div_s64() to cure the issue.

[ tglx: Fix white space damage, use div_s64() and amend the change log ]

Fixes: ead25417f8 ("timex: use __kernel_timex internally")
Signed-off-by: Marcelo Dalmas <marcelo.dalmas@ge.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/SJ0P101MB03687BF7D5A10FD3C49C51E5F42E2@SJ0P101MB0368.NAMP101.PROD.OUTLOOK.COM
2024-11-28 12:02:38 +01:00
Fedor Pchelkin aef7ee7649 dma-debug: fix physical address calculation for struct dma_debug_entry
Offset into the page should also be considered while calculating a physical
address for struct dma_debug_entry. page_to_phys() just shifts the value
PAGE_SHIFT bits to the left so offset part is zero-filled.

An example (wrong) debug assertion failure with CONFIG_DMA_API_DEBUG
enabled which is observed during systemd boot process after recent
dma-debug changes:

DMA-API: e1000 0000:00:03.0: cacheline tracking EEXIST, overlapping mappings aren't supported
WARNING: CPU: 4 PID: 941 at kernel/dma/debug.c:596 add_dma_entry
CPU: 4 UID: 0 PID: 941 Comm: ip Not tainted 6.12.0+ #288
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:add_dma_entry kernel/dma/debug.c:596
Call Trace:
 <TASK>
debug_dma_map_page kernel/dma/debug.c:1236
dma_map_page_attrs kernel/dma/mapping.c:179
e1000_alloc_rx_buffers drivers/net/ethernet/intel/e1000/e1000_main.c:4616
...

Found by Linux Verification Center (linuxtesting.org).

Fixes: 9d4f645a1f ("dma-debug: store a phys_addr_t in struct dma_debug_entry")
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
[hch: added a little helper to clean up the code]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-11-28 10:19:16 +01:00
Linus Torvalds b5361254c9 Modules changes for v6.13-rc1
Highlights for this merge window:
 
   * The whole caching of module code into huge pages by Mike Rapoport is going
     in through Andrew Morton's tree due to some other code dependencies. That's
     really the biggest highlight for Linux kernel modules in this release. With
     it we share huge pages for modules, starting off with x86. Expect to see that
     soon through Andrew!
 
   * Helge Deller addressed some lingering low hanging fruit alignment
     enhancements by. It is worth pointing out that from his old patch series
     I dropped his vmlinux.lds.h change at Masahiro's request as he would
     prefer this to be specified in asm code [0].
 
     [0] https://lore.kernel.org/all/20240129192644.3359978-5-mcgrof@kernel.org/T/#m9efef5e700fbecd28b7afb462c15eed8ba78ef5a
 
   * Matthew Maurer and Sami Tolvanen have been tag teaming to help
     get us closer to a modversions for Rust. In this cycle we take in
     quite a lot of the refactoring for ELF validation. I expect modversions
     for Rust will be merged by v6.14 as that code is mostly ready now.
 
   * Adds a new modules selftests: kallsyms which helps us tests find_symbol()
     and the limits of kallsyms on Linux today.
 
   * We have a realtime mailing list to kernel-ci testing for modules now
     which relies and combines patchwork, kpd and kdevops:
 
     - https://patchwork.kernel.org/project/linux-modules/list/
     - https://github.com/linux-kdevops/kdevops/blob/main/docs/kernel-ci/README.md
     - https://github.com/linux-kdevops/kdevops/blob/main/docs/kernel-ci/kernel-ci-kpd.md
     - https://github.com/linux-kdevops/kdevops/blob/main/docs/kernel-ci/linux-modules-kdevops-ci.md
 
     If you want to help avoid Linux kernel modules regressions, now its simple,
     just add a new Linux modules sefltests under tools/testing/selftests/module/
     That is it. All new selftests will be used and leveraged automatically by
     the CI.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmdGbrcSHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinIDEQAMa1H7hsneNT0Z/YewzOfdSKZIkTzpk3
 /fLl7PfWyFvk7yHT1JiUXidS/80SEMnWb+u8Sn00/uvcJomnPcK9oTwTzBQ0vefl
 FWIUM0DmBzBOi5xdjrPLjg5o6TFt7hVae3hoRJzIlLD02vGfrPYpyHo7XmRrLM4C
 8p+3geziwZMpjcGM254eSiTGxNL8z1iZVRsz8QrrBruRfBDnHNgwtmK097v13Xdb
 qmLX6CN2irmNPZSZwDqP8QL2sJk9qQpNdPmpjMvaY3VfaMVkM46FLy0k9yeXXNqw
 E1p/GuylCZq4NG1hic9zB1I1CE910ugCztJnPcGw4C7CSm54YoLiUJrIeRyTZhk6
 et9N25AlJHxyq72GIRTMQCA9Njxaavx5KilvuWYZmaILfeI0k/3gvcxUqp/EJQ9Q
 axPu69HJFRSKMVh1o+QrSaPmEtSydpYwuuNJ6ONRpq5I3bzOVDSCroceAdXEMO9K
 yoSfm4KwN/BSnmX6KVLonrSM91nv2/v9UokuaZMV/CsDpXIZs996PvAoopCm1Twb
 K3fv0uD+2q2FTOOBInkuRJo2zBUvNnDRPAS2pE3DMXy8xhsQXdovEpjijuCGb8eC
 y0R+I4RIugIB2n6YBUFfyma1veGlT3PtrWQnO6E3YJpv8bqIJoYVT5IGo9M9YRO9
 lzjtR9NzGtmh
 =Ny84
 -----END PGP SIGNATURE-----

Merge tag 'modules-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux

Pull modules updates from Luis Chamberlain:

 - The whole caching of module code into huge pages by Mike Rapoport is
   going in through Andrew Morton's tree due to some other code
   dependencies. That's really the biggest highlight for Linux kernel
   modules in this release. With it we share huge pages for modules,
   starting off with x86. Expect to see that soon through Andrew!

 - Helge Deller addressed some lingering low hanging fruit alignment
   enhancements by. It is worth pointing out that from his old patch
   series I dropped his vmlinux.lds.h change at Masahiro's request as he
   would prefer this to be specified in asm code [0].

    [0] https://lore.kernel.org/all/20240129192644.3359978-5-mcgrof@kernel.org/T/#m9efef5e700fbecd28b7afb462c15eed8ba78ef5a

 - Matthew Maurer and Sami Tolvanen have been tag teaming to help get us
   closer to a modversions for Rust. In this cycle we take in quite a
   lot of the refactoring for ELF validation. I expect modversions for
   Rust will be merged by v6.14 as that code is mostly ready now.

 - Adds a new modules selftests: kallsyms which helps us tests
   find_symbol() and the limits of kallsyms on Linux today.

 - We have a realtime mailing list to kernel-ci testing for modules now
   which relies and combines patchwork, kpd and kdevops:

     https://patchwork.kernel.org/project/linux-modules/list/
     https://github.com/linux-kdevops/kdevops/blob/main/docs/kernel-ci/README.md
     https://github.com/linux-kdevops/kdevops/blob/main/docs/kernel-ci/kernel-ci-kpd.md
     https://github.com/linux-kdevops/kdevops/blob/main/docs/kernel-ci/linux-modules-kdevops-ci.md

   If you want to help avoid Linux kernel modules regressions, now its
   simple, just add a new Linux modules sefltests under
   tools/testing/selftests/module/ That is it. All new selftests will be
   used and leveraged automatically by the CI.

* tag 'modules-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
  tests/module/gen_test_kallsyms.sh: use 0 value for variables
  scripts: Remove export_report.pl
  selftests: kallsyms: add MODULE_DESCRIPTION
  selftests: add new kallsyms selftests
  module: Reformat struct for code style
  module: Additional validation in elf_validity_cache_strtab
  module: Factor out elf_validity_cache_strtab
  module: Group section index calculations together
  module: Factor out elf_validity_cache_index_str
  module: Factor out elf_validity_cache_index_sym
  module: Factor out elf_validity_cache_index_mod
  module: Factor out elf_validity_cache_index_info
  module: Factor out elf_validity_cache_secstrings
  module: Factor out elf_validity_cache_sechdrs
  module: Factor out elf_validity_ehdr
  module: Take const arg in validate_section_offset
  modules: Add missing entry for __ex_table
  modules: Ensure 64-bit alignment on __ksymtab_* sections
2024-11-27 10:20:50 -08:00
Christian Brauner 3b83203538
Revert "fs: don't block i_writecount during exec"
This reverts commit 2a010c4128.

Rui Ueyama <rui314@gmail.com> writes:

> I'm the creator and the maintainer of the mold linker
> (https://github.com/rui314/mold). Recently, we discovered that mold
> started causing process crashes in certain situations due to a change
> in the Linux kernel. Here are the details:
>
> - In general, overwriting an existing file is much faster than
> creating an empty file and writing to it on Linux, so mold attempts to
> reuse an existing executable file if it exists.
>
> - If a program is running, opening the executable file for writing
> previously failed with ETXTBSY. If that happens, mold falls back to
> creating a new file.
>
> - However, the Linux kernel recently changed the behavior so that
> writing to an executable file is now always permitted
> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a010c412853).
>
> That caused mold to write to an executable file even if there's a
> process running that file. Since changes to mmap'ed files are
> immediately visible to other processes, any processes running that
> file would almost certainly crash in a very mysterious way.
> Identifying the cause of these random crashes took us a few days.
>
> Rejecting writes to an executable file that is currently running is a
> well-known behavior, and Linux had operated that way for a very long
> time. So, I don’t believe relying on this behavior was our mistake;
> rather, I see this as a regression in the Linux kernel.

Quoting myself from commit 2a010c4128 ("fs: don't block i_writecount during exec")

> Yes, someone in userspace could potentially be relying on this. It's not
> completely out of the realm of possibility but let's find out if that's
> actually the case and not guess.

It seems we found out that someone is relying on this obscure behavior.
So revert the change.

Link: https://github.com/rui314/mold/issues/1361
Link: https://lore.kernel.org/r/4a2bc207-76be-4715-8e12-7fc45a76a125@leemhuis.info
Cc: <stable@vger.kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-27 12:51:30 +01:00
Linus Torvalds f5f4745a7f - The series "resource: A couple of cleanups" from Andy Shevchenko
performs some cleanups in the resource management code.
 
 - The series "Improve the copy of task comm" from Yafang Shao addresses
   possible race-induced overflows in the management of task_struct.comm[].
 
 - The series "Remove unnecessary header includes from
   {tools/}lib/list_sort.c" from Kuan-Wei Chiu adds some cleanups and a
   small fix to the list_sort library code and to its selftest.
 
 - The series "Enhance min heap API with non-inline functions and
   optimizations" also from Kuan-Wei Chiu optimizes and cleans up the
   min_heap library code.
 
 - The series "nilfs2: Finish folio conversion" from Ryusuke Konishi
   finishes off nilfs2's folioification.
 
 - The series "add detect count for hung tasks" from Lance Yang adds more
   userspace visibility into the hung-task detector's activity.
 
 - Apart from that, singelton patches in many places - please see the
   individual changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ0L6lQAKCRDdBJ7gKXxA
 jmEIAPwMSglNPKRIOgzOvHh8MUJW1Dy8iKJ2kWCO3f6QTUIM2AEA+PazZbUd/g2m
 Ii8igH0UBibIgva7MrCyJedDI1O23AA=
 =8BIU
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2024-11-24-02-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - The series "resource: A couple of cleanups" from Andy Shevchenko
   performs some cleanups in the resource management code

 - The series "Improve the copy of task comm" from Yafang Shao addresses
   possible race-induced overflows in the management of
   task_struct.comm[]

 - The series "Remove unnecessary header includes from
   {tools/}lib/list_sort.c" from Kuan-Wei Chiu adds some cleanups and a
   small fix to the list_sort library code and to its selftest

 - The series "Enhance min heap API with non-inline functions and
   optimizations" also from Kuan-Wei Chiu optimizes and cleans up the
   min_heap library code

 - The series "nilfs2: Finish folio conversion" from Ryusuke Konishi
   finishes off nilfs2's folioification

 - The series "add detect count for hung tasks" from Lance Yang adds
   more userspace visibility into the hung-task detector's activity

 - Apart from that, singelton patches in many places - please see the
   individual changelogs for details

* tag 'mm-nonmm-stable-2024-11-24-02-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (71 commits)
  gdb: lx-symbols: do not error out on monolithic build
  kernel/reboot: replace sprintf() with sysfs_emit()
  lib: util_macros_kunit: add kunit test for util_macros.h
  util_macros.h: fix/rework find_closest() macros
  Improve consistency of '#error' directive messages
  ocfs2: fix uninitialized value in ocfs2_file_read_iter()
  hung_task: add docs for hung_task_detect_count
  hung_task: add detect count for hung tasks
  dma-buf: use atomic64_inc_return() in dma_buf_getfile()
  fs/proc/kcore.c: fix coccinelle reported ERROR instances
  resource: avoid unnecessary resource tree walking in __region_intersects()
  ocfs2: remove unused errmsg function and table
  ocfs2: cluster: fix a typo
  lib/scatterlist: use sg_phys() helper
  checkpatch: always parse orig_commit in fixes tag
  nilfs2: convert metadata aops from writepage to writepages
  nilfs2: convert nilfs_recovery_copy_block() to take a folio
  nilfs2: convert nilfs_page_count_clean_buffers() to take a folio
  nilfs2: remove nilfs_writepage
  nilfs2: convert checkpoint file to be folio-based
  ...
2024-11-25 16:09:48 -08:00
Maciej Fijalkowski ab244dd7cf bpf: fix OOB devmap writes when deleting elements
Jordy reported issue against XSKMAP which also applies to DEVMAP - the
index used for accessing map entry, due to being a signed integer,
causes the OOB writes. Fix is simple as changing the type from int to
u32, however, when compared to XSKMAP case, one more thing needs to be
addressed.

When map is released from system via dev_map_free(), we iterate through
all of the entries and an iterator variable is also an int, which
implies OOB accesses. Again, change it to be u32.

Example splat below:

[  160.724676] BUG: unable to handle page fault for address: ffffc8fc2c001000
[  160.731662] #PF: supervisor read access in kernel mode
[  160.736876] #PF: error_code(0x0000) - not-present page
[  160.742095] PGD 0 P4D 0
[  160.744678] Oops: Oops: 0000 [#1] PREEMPT SMP
[  160.749106] CPU: 1 UID: 0 PID: 520 Comm: kworker/u145:12 Not tainted 6.12.0-rc1+ #487
[  160.757050] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
[  160.767642] Workqueue: events_unbound bpf_map_free_deferred
[  160.773308] RIP: 0010:dev_map_free+0x77/0x170
[  160.777735] Code: 00 e8 fd 91 ed ff e8 b8 73 ed ff 41 83 7d 18 19 74 6e 41 8b 45 24 49 8b bd f8 00 00 00 31 db 85 c0 74 48 48 63 c3 48 8d 04 c7 <48> 8b 28 48 85 ed 74 30 48 8b 7d 18 48 85 ff 74 05 e8 b3 52 fa ff
[  160.796777] RSP: 0018:ffffc9000ee1fe38 EFLAGS: 00010202
[  160.802086] RAX: ffffc8fc2c001000 RBX: 0000000080000000 RCX: 0000000000000024
[  160.809331] RDX: 0000000000000000 RSI: 0000000000000024 RDI: ffffc9002c001000
[  160.816576] RBP: 0000000000000000 R08: 0000000000000023 R09: 0000000000000001
[  160.823823] R10: 0000000000000001 R11: 00000000000ee6b2 R12: dead000000000122
[  160.831066] R13: ffff88810c928e00 R14: ffff8881002df405 R15: 0000000000000000
[  160.838310] FS:  0000000000000000(0000) GS:ffff8897e0c40000(0000) knlGS:0000000000000000
[  160.846528] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  160.852357] CR2: ffffc8fc2c001000 CR3: 0000000005c32006 CR4: 00000000007726f0
[  160.859604] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  160.866847] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  160.874092] PKRU: 55555554
[  160.876847] Call Trace:
[  160.879338]  <TASK>
[  160.881477]  ? __die+0x20/0x60
[  160.884586]  ? page_fault_oops+0x15a/0x450
[  160.888746]  ? search_extable+0x22/0x30
[  160.892647]  ? search_bpf_extables+0x5f/0x80
[  160.896988]  ? exc_page_fault+0xa9/0x140
[  160.900973]  ? asm_exc_page_fault+0x22/0x30
[  160.905232]  ? dev_map_free+0x77/0x170
[  160.909043]  ? dev_map_free+0x58/0x170
[  160.912857]  bpf_map_free_deferred+0x51/0x90
[  160.917196]  process_one_work+0x142/0x370
[  160.921272]  worker_thread+0x29e/0x3b0
[  160.925082]  ? rescuer_thread+0x4b0/0x4b0
[  160.929157]  kthread+0xd4/0x110
[  160.932355]  ? kthread_park+0x80/0x80
[  160.936079]  ret_from_fork+0x2d/0x50
[  160.943396]  ? kthread_park+0x80/0x80
[  160.950803]  ret_from_fork_asm+0x11/0x20
[  160.958482]  </TASK>

Fixes: 546ac1ffb7 ("bpf: add devmap, a map for storing net device references")
CC: stable@vger.kernel.org
Reported-by: Jordy Zomer <jordyzomer@google.com>
Suggested-by: Jordy Zomer <jordyzomer@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20241122121030.716788-3-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-25 14:25:48 -08:00
Thomas Weißschuh 8618f5ffba bpf, lsm: Remove getlsmprop hooks BTF IDs
These hooks are not useful for BPF LSM currently.
Furthermore a recent renaming introduced build warnings:

  BTFIDS  vmlinux
WARN: resolve_btfids: unresolved symbol bpf_lsm_task_getsecid_obj
WARN: resolve_btfids: unresolved symbol bpf_lsm_current_getsecid_subj

Link: https://lore.kernel.org/lkml/20241123-bpf_lsm_task_getsecid_obj-v1-1-0d0f94649e05@weissschuh.net/
Fixes: 37f670aacd ("lsm: use lsm_prop in security_current_getsecid")
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20241125-bpf_lsm_task_getsecid_obj-v2-1-c8395bde84e0@weissschuh.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-25 14:14:17 -08:00
Linus Torvalds 43a43faf53 futex: improve user space accesses
Josh Poimboeuf reports that he got a "will-it-scale.per_process_ops 1.9%
improvement" report for his patch that changed __get_user() to use
pointer masking instead of the explicit speculation barrier.  However,
that patch doesn't actually work in the general case, because some (very
bad) architecture-specific code actually depends on __get_user() also
working on kernel addresses.

A profile showed that the offending __get_user() was the futex code,
which really should be fixed up to not use that horrid legacy case.
Rewrite futex_get_value_locked() to use the modern user acccess helpers,
and inline it so that the compiler not only avoids the function call for
a few instructions, but can do CSE on the address masking.

It also turns out the x86 futex functions have unnecessary barriers in
other places, so let's fix those up too.

Link: https://lore.kernel.org/all/20241115230653.hfvzyf3aqqntgp63@jpoimboe/
Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-11-25 12:11:55 -08:00