Commit Graph

15759 Commits (master)

Author SHA1 Message Date
Linus Torvalds 36ae1c45b2 Scheduler changes for v7.0:
Scheduler Kconfig space updates:
 
  - Further consolidate configurable preemption modes: reduce
    the number of architectures that are allowed to offer
    PREEMPT_NONE and PREEMPT_VOLUNTARY, reducing the number
    of preemption models from four to just two: 'full' and 'lazy'
    on up-to-date architectures (arm64, loongarch, powerpc,
    riscv, s390, x86).
 
    None and voluntary are only available as legacy features
    on platforms that don't implement lazy preemption yet,
    or which don't even support preemption.
 
    The goal is to eventually remove cond_resched() and
    voluntary preemption altogether.
 
    (Peter Zijlstra)
 
 RSEQ based 'scheduler time slice extension' support:
 
 This allows a thread to request a time slice extension when it
 enters a critical section to avoid contention on a resource when
 the thread is scheduled out inside of the critical section.
 
  - Add fields and constants for time slice extension
  - Provide static branch for time slice extensions
  - Add statistics for time slice extensions
  - Add prctl() to enable time slice extensions
  - Implement sys_rseq_slice_yield()
  - Implement syscall entry work for time slice extensions
  - Implement time slice extension enforcement timer
  - Reset slice extension when scheduled
  - Implement rseq_grant_slice_extension()
  - entry: Hook up rseq time slice extension
  - selftests: Implement time slice extension test
 
    (Thomas Gleixner)
 
  - Allow registering RSEQ with slice extension
  - Move slice_ext_nsec to debugfs
  - Lower default slice extension
  - selftests/rseq: Add rseq slice histogram script
 
    (Peter Zijlstra)
 
 Scheduler performance/scalability improvements:
 
  - Update rq->avg_idle when a task is moved to an idle CPU,
    which improves the scalability of various workloads.
    (Shubhang Kaushik)
 
  - Reorder fields in 'struct rq' for better caching
    (Blake Jones)
 
  - Fair scheduler SMP NOHZ balancing code speedups:
 
    - Move checking for nohz cpus after time check
    - Change likelyhood of nohz.nr_cpus
    - Remove nohz.nr_cpus and use weight of cpumask instead
 
      (Shrikanth Hegde)
 
  - Avoid false sharing for sched_clock_irqtime (Wangyang Guo)
 
  - Drop useless cpumask_empty() in find_energy_efficient_cpu()
  - Simplify task_numa_find_cpu()
  - Use cpumask_weight_and() in sched_balance_find_dst_group()
 
    (Yury Norov)
 
 DL scheduler updates:
 
  - Add a deadline server for sched_ext tasks (by Andrea Righi and
    Joel Fernandes, with fixes by Peter Zijlstra)
 
 RT scheduler updates:
 
  - Skip currently executing CPU in rto_next_cpu() (Chen Jinghuang)
 
 Entry code updates and performance improvements, which is part of the
 scheduler tree in this cycle due to interdependencies with the RSEQ
 based time slice extension work:
 
   - Remove unused syscall argument from syscall_trace_enter()
   - Rework syscall_exit_to_user_mode_work() for architecture reuse
   - Add arch_ptrace_report_syscall_entry/exit()
   - Inline syscall_exit_work() and syscall_trace_enter()
 
     (Jinjie Ruan)
 
 Scheduler core updates:
 
  - Rework sched_class::wakeup_preempt() and rq_modified_*()
  - Avoid rq->lock bouncing in sched_balance_newidle()
  - Rename rcu_dereference_check_sched_domain() =>
           rcu_dereference_sched_domain()
  - <linux/compiler_types.h>: Add the __signed_scalar_typeof() helper
 
    (Peter Zijlstra)
 
 Fair scheduler updates/refactoring:
 
  - Fold the sched_avg update
  - Change rcu_dereference_check_sched_domain() to rcu-sched
  - Switch to rcu_dereference_all()
  - Remove superfluous rcu_read_lock()
  - Limit hrtick work
 
    (Peter Zijlstra)
 
  - Join two #ifdef CONFIG_FAIR_GROUP_SCHED blocks
  - Clean up comments in 'struct cfs_rq'
  - Separate se->vlag from se->vprot
  - Rename cfs_rq::avg_load to cfs_rq::sum_weight
  - Rename cfs_rq::avg_vruntime to ::sum_w_vruntime & helper functions
  - Introduce and use the vruntime_cmp() and vruntime_op() wrappers
    for wrapped-signed aritmetics
  - Sort out 'blocked_load*' namespace noise
 
    (Ingo Molnar)
 
 Scheduler debugging code updates:
 
  - Export hidden tracepoints to modules (Gabriele Monaco)
 
  - Convert copy_from_user() + kstrtouint() to kstrtouint_from_user()
    (Fushuai Wang)
 
  - Add assertions to QUEUE_CLASS (Peter Zijlstra)
 
  - hrtimer: Fix tracing oddity (Thomas Gleixner)
 
 Misc fixes and cleanups:
 
  - Re-evaluate scheduling when migrating queued tasks out of
    throttled cgroups (Zicheng Qu)
 
  - Remove task_struct->faults_disabled_mapping (Christoph Hellwig)
 
  - Fix math notation errors in avg_vruntime comment (Zhan Xusheng)
 
  - sched/cpufreq: Use %pe format for PTR_ERR() printing (zenghongling)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmJj+IRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1grtQ//WyXYGVE/WicdqslfaCY2Mr0uJnL0tLSM
 CJp+0LROdkmy+ChJmftO8RgjCUSsjhC4/xcBhUQXApf/ffQi3b2jH6nkTp/Z64Ms
 p2IXLkBiZjwdcO6fGbB0JE2G1J4hGRC5BlqfgkZzWidMf3kIbmrHg99mVWGzODLY
 N/cPW4d0WGf9TScl1FgEiOqgF3czMLlqvTDJqaFMpsTzSUcRBnrG4xushb4W/bBx
 573eqxgZJ6urNSGu8niY9PAl9F7gskXW3YxI3k8SH7VmJKSevWlwI9vMEhcRDzud
 E0XxD7J8iPOKtr7ypXm7anMBv4jWVUdAnPbYi4TDsyDDU/HguqMqT1McTGn8wQ+F
 jmdhmMC9/TEIzq93SNLbCYieibqDsmJoNVFFi0FWfPLMtYbcZd5a884SIz532vx4
 DegdlDXdazUwhxzDiQR3sq1CsHXpxNS2YdrpadAtF/r2gU86DQjsEew8yBvXi7bb
 Wrkzpax70sU1AFI23wJQkEb/OnnXyehAHAhhQN6GVvuiGr9P7C02WLEGLlmSmJrx
 zl2F750P76yhTfGcvTfJ/5LTfSB+yRozGvcdXnIkyzWotY6a2D1MKNusAfVax+IR
 kyfAWqVdxBhlKnqYbu92lTogvnPh3Lymd6G4TZZRkSH2jixyGd2oS7nZaDBAeBEM
 NHQtr9R+KyU=
 =Xj2f
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Scheduler Kconfig space updates:

   - Further consolidate configurable preemption modes (Peter Zijlstra)

     Reduce the number of architectures that are allowed to offer
     PREEMPT_NONE and PREEMPT_VOLUNTARY, reducing the number of
     preemption models from four to just two: 'full' and 'lazy' on
     up-to-date architectures (arm64, loongarch, powerpc, riscv, s390,
     x86).

     None and voluntary are only available as legacy features on
     platforms that don't implement lazy preemption yet, or which don't
     even support preemption.

     The goal is to eventually remove cond_resched() and voluntary
     preemption altogether.

  RSEQ based 'scheduler time slice extension' support (Thomas Gleixner
  and Peter Zijlstra):

  This allows a thread to request a time slice extension when it enters
  a critical section to avoid contention on a resource when the thread
  is scheduled out inside of the critical section.

   - Add fields and constants for time slice extension
   - Provide static branch for time slice extensions
   - Add statistics for time slice extensions
   - Add prctl() to enable time slice extensions
   - Implement sys_rseq_slice_yield()
   - Implement syscall entry work for time slice extensions
   - Implement time slice extension enforcement timer
   - Reset slice extension when scheduled
   - Implement rseq_grant_slice_extension()
   - entry: Hook up rseq time slice extension
   - selftests: Implement time slice extension test
   - Allow registering RSEQ with slice extension
   - Move slice_ext_nsec to debugfs
   - Lower default slice extension
   - selftests/rseq: Add rseq slice histogram script

  Scheduler performance/scalability improvements:

   - Update rq->avg_idle when a task is moved to an idle CPU, which
     improves the scalability of various workloads (Shubhang Kaushik)

   - Reorder fields in 'struct rq' for better caching (Blake Jones)

   - Fair scheduler SMP NOHZ balancing code speedups (Shrikanth Hegde):
      - Move checking for nohz cpus after time check
      - Change likelyhood of nohz.nr_cpus
      - Remove nohz.nr_cpus and use weight of cpumask instead

   - Avoid false sharing for sched_clock_irqtime (Wangyang Guo)

   - Cleanups (Yury Norov):
      - Drop useless cpumask_empty() in find_energy_efficient_cpu()
      - Simplify task_numa_find_cpu()
      - Use cpumask_weight_and() in sched_balance_find_dst_group()

  DL scheduler updates:

   - Add a deadline server for sched_ext tasks (by Andrea Righi and Joel
     Fernandes, with fixes by Peter Zijlstra)

  RT scheduler updates:

   - Skip currently executing CPU in rto_next_cpu() (Chen Jinghuang)

  Entry code updates and performance improvements (Jinjie Ruan)

  This is part of the scheduler tree in this cycle due to inter-
  dependencies with the RSEQ based time slice extension work:

    - Remove unused syscall argument from syscall_trace_enter()
    - Rework syscall_exit_to_user_mode_work() for architecture reuse
    - Add arch_ptrace_report_syscall_entry/exit()
    - Inline syscall_exit_work() and syscall_trace_enter()

  Scheduler core updates (Peter Zijlstra):

   - Rework sched_class::wakeup_preempt() and rq_modified_*()
   - Avoid rq->lock bouncing in sched_balance_newidle()
   - Rename rcu_dereference_check_sched_domain() =>
            rcu_dereference_sched_domain()
   - <linux/compiler_types.h>: Add the __signed_scalar_typeof() helper

  Fair scheduler updates/refactoring (Peter Zijlstra and Ingo Molnar):

   - Fold the sched_avg update
   - Change rcu_dereference_check_sched_domain() to rcu-sched
   - Switch to rcu_dereference_all()
   - Remove superfluous rcu_read_lock()
   - Limit hrtick work
   - Join two #ifdef CONFIG_FAIR_GROUP_SCHED blocks
   - Clean up comments in 'struct cfs_rq'
   - Separate se->vlag from se->vprot
   - Rename cfs_rq::avg_load to cfs_rq::sum_weight
   - Rename cfs_rq::avg_vruntime to ::sum_w_vruntime & helper functions
   - Introduce and use the vruntime_cmp() and vruntime_op() wrappers for
     wrapped-signed aritmetics
   - Sort out 'blocked_load*' namespace noise

  Scheduler debugging code updates:

   - Export hidden tracepoints to modules (Gabriele Monaco)

   - Convert copy_from_user() + kstrtouint() to kstrtouint_from_user()
     (Fushuai Wang)

   - Add assertions to QUEUE_CLASS (Peter Zijlstra)

   - hrtimer: Fix tracing oddity (Thomas Gleixner)

  Misc fixes and cleanups:

   - Re-evaluate scheduling when migrating queued tasks out of throttled
     cgroups (Zicheng Qu)

   - Remove task_struct->faults_disabled_mapping (Christoph Hellwig)

   - Fix math notation errors in avg_vruntime comment (Zhan Xusheng)

   - sched/cpufreq: Use %pe format for PTR_ERR() printing
     (zenghongling)"

* tag 'sched-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
  sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups
  sched/cpufreq: Use %pe format for PTR_ERR() printing
  sched/rt: Skip currently executing CPU in rto_next_cpu()
  sched/clock: Avoid false sharing for sched_clock_irqtime
  selftests/sched_ext: Add test for DL server total_bw consistency
  selftests/sched_ext: Add test for sched_ext dl_server
  sched/debug: Fix dl_server (re)start conditions
  sched/debug: Add support to change sched_ext server params
  sched_ext: Add a DL server for sched_ext tasks
  sched/debug: Stop and start server based on if it was active
  sched/debug: Fix updating of ppos on server write ops
  sched/deadline: Clear the defer params
  entry: Inline syscall_exit_work() and syscall_trace_enter()
  entry: Add arch_ptrace_report_syscall_entry/exit()
  entry: Rework syscall_exit_to_user_mode_work() for architecture reuse
  entry: Remove unused syscall argument from syscall_trace_enter()
  sched: remove task_struct->faults_disabled_mapping
  sched: Update rq->avg_idle when a task is moved to an idle CPU
  selftests/rseq: Add rseq slice histogram script
  hrtimer: Fix trace oddity
  ...
2026-02-10 12:50:10 -08:00
Linus Torvalds 4d84667627 Performance events changes for v7.0:
x86 PMU driver updates:
 
  - Add support for the core PMU for Intel Diamond Rapids (DMR) CPUs.
    Compared to previous iterations of the Intel PMU code, there's
    been a lot of changes, which center around three main areas:
 
     - Introduce the OFF-MODULE RESPONSE (OMR) facility to
       replace the Off-Core Response (OCR) facility
 
     - New PEBS data source encoding layout
 
     - Support the new "RDPMC user disable" feature
 
    (Dapeng Mi)
 
  - Likewise, a large series adds uncore PMU support for
    Intel Diamond Rapids (DMR) CPUs, which center around these
    four main areas:
 
     - DMR may have two Integrated I/O and Memory Hub (IMH) dies,
       separate from the compute tile (CBB) dies.  Each CBB and
       each IMH die has its own discovery domain.
 
     - Unlike prior CPUs that retrieve the global discovery table
       portal exclusively via PCI or MSR, DMR uses PCI for IMH PMON
       discovery and MSR for CBB PMON discovery.
 
     - DMR introduces several new PMON types: SCA, HAMVF, D2D_ULA,
       UBR, PCIE4, CRS, CPC, ITC, OTC, CMS, and PCIE6.
 
     - IIO free-running counters in DMR are MMIO-based, unlike SPR.
 
    (Zide Chen)
 
  - Also add support for Add missing PMON units for Intel Panther Lake,
    and support Nova Lake (NVL), which largely maps to Panther Lake.
    (Zide Chen)
 
  - KVM integration: Add support for mediated vPMUs (by Kan Liang
    and Sean Christopherson, with fixes and cleanups by Peter Zijlstra,
    Sandipan Das and Mingwei Zhang)
 
  - Add Intel cstate driver to support for Wildcat Lake (WCL)
    CPUs, which are a low-power variant of Panther Lake.
    (Zide Chen)
 
  - Add core, cstate and MSR PMU support for the Airmont NP Intel CPU
    (aka MaxLinear Lightning Mountain), which maps to the existing
    Airmont code. (Martin Schiller)
 
 Performance enhancements:
 
  - core: Speed up kexec shutdown by avoiding unnecessary
    cross CPU calls. (Jan H. Schönherr)
 
  - core: Fix slow perf_event_task_exit() with LBR callstacks
    (Namhyung Kim)
 
 User-space stack unwinding support:
 
  - Various cleanups and refactorings in preparation to generalize
    the unwinding code for other architectures. (Jens Remus)
 
 Uprobes updates:
 
  - Transition from kmap_atomic to kmap_local_page (Keke Ming)
 
  - Fix incorrect lockdep condition in filter_chain() (Breno Leitao)
 
  - Fix XOL allocation failure for 32-bit tasks (Oleg Nesterov)
 
 Misc fixes and cleanups:
 
  - s390: Remove kvm_types.h from Kbuild (Randy Dunlap)
 
  - x86/intel/uncore: Convert comma to semicolon (Chen Ni)
 
  - x86/uncore: Clean up const mismatch (Greg Kroah-Hartman)
 
  - x86/ibs: Fix typo in dc_l2tlb_miss comment (Xiang-Bin Shi)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmJhTURHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i/qw/9F/sjMqbxH8d3kPXB8wk2eUuSkynQ2aNw
 Zec9qtfCC5N1U9b2D7ywGJRscTmWYnX/3BKTyzFuyA6SDz6buAgDDIGPlHi+9Fww
 +RUUS3lQ7N3pVWZ4Ifu3kbh3Vz4lkQuOXhfcjiyIMS6QIxfrcSLFoKHK+2V6PeU+
 x0k+THHz/Ymg+DIpqSjqil1yrKaUmU9xRrbnyy6zJB1duREQrkYBhIWL1+bcd7SA
 89RVAGXQ+sWzVQMPaKrMkZj6GavOCB7zseigiiwjBRLznukS2OulDDe8zR6pCJZp
 wbdc7TR/nCm+QtNfkHlOmTQvsPAXiXNyXe5Vi8aFjGc0uMGhHaeiL9ah/bwsKA5m
 Bm5Y7oVSmBlCJbcr/CTrGYkb+WwvLIgPCwVkn4FYPlsWv+U92qTOx9q7qKmIDFaj
 1oUXCwoHbYrYnoZqZqPp2h689m0Lh/lsGhy0QRt8aGnKu0SDjMaqRGbYWH0UI0kA
 aDnZstVHG76RnVi0143q8HcvvjNZb82NL0cS749tY/YcwH4kUEGj2XuSK2Ar887T
 H0oDJHXijMlGXWqO5bK3WMoCQajR7nRyqBo7/rKYj20OjXmwoXS3vel77/s8WFo2
 fUUC469MacDzyxdBNutnkJvvcvsUio3r4MWFsEEWQk2nUE58PtN8YM8j/FdNYpql
 zAZ4Jx/A0RM=
 =4SRB
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull performance event updates from Ingo Molnar:
 "x86 PMU driver updates:

   - Add support for the core PMU for Intel Diamond Rapids (DMR) CPUs
     (Dapeng Mi)

     Compared to previous iterations of the Intel PMU code, there's been
     a lot of changes, which center around three main areas:

      - Introduce the OFF-MODULE RESPONSE (OMR) facility to replace the
        Off-Core Response (OCR) facility

      - New PEBS data source encoding layout

      - Support the new "RDPMC user disable" feature

   - Likewise, a large series adds uncore PMU support for Intel Diamond
     Rapids (DMR) CPUs (Zide Chen)

     This centers around these four main areas:

      - DMR may have two Integrated I/O and Memory Hub (IMH) dies,
        separate from the compute tile (CBB) dies. Each CBB and each IMH
        die has its own discovery domain.

      - Unlike prior CPUs that retrieve the global discovery table
        portal exclusively via PCI or MSR, DMR uses PCI for IMH PMON
        discovery and MSR for CBB PMON discovery.

      - DMR introduces several new PMON types: SCA, HAMVF, D2D_ULA, UBR,
        PCIE4, CRS, CPC, ITC, OTC, CMS, and PCIE6.

      - IIO free-running counters in DMR are MMIO-based, unlike SPR.

   - Also add support for Add missing PMON units for Intel Panther Lake,
     and support Nova Lake (NVL), which largely maps to Panther Lake.
     (Zide Chen)

   - KVM integration: Add support for mediated vPMUs (by Kan Liang and
     Sean Christopherson, with fixes and cleanups by Peter Zijlstra,
     Sandipan Das and Mingwei Zhang)

   - Add Intel cstate driver to support for Wildcat Lake (WCL) CPUs,
     which are a low-power variant of Panther Lake (Zide Chen)

   - Add core, cstate and MSR PMU support for the Airmont NP Intel CPU
     (aka MaxLinear Lightning Mountain), which maps to the existing
     Airmont code (Martin Schiller)

  Performance enhancements:

   - Speed up kexec shutdown by avoiding unnecessary cross CPU calls
     (Jan H. Schönherr)

   - Fix slow perf_event_task_exit() with LBR callstacks (Namhyung Kim)

  User-space stack unwinding support:

   - Various cleanups and refactorings in preparation to generalize the
     unwinding code for other architectures (Jens Remus)

  Uprobes updates:

   - Transition from kmap_atomic to kmap_local_page (Keke Ming)

   - Fix incorrect lockdep condition in filter_chain() (Breno Leitao)

   - Fix XOL allocation failure for 32-bit tasks (Oleg Nesterov)

  Misc fixes and cleanups:

   - s390: Remove kvm_types.h from Kbuild (Randy Dunlap)

   - x86/intel/uncore: Convert comma to semicolon (Chen Ni)

   - x86/uncore: Clean up const mismatch (Greg Kroah-Hartman)

   - x86/ibs: Fix typo in dc_l2tlb_miss comment (Xiang-Bin Shi)"

* tag 'perf-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
  s390: remove kvm_types.h from Kbuild
  uprobes: Fix incorrect lockdep condition in filter_chain()
  x86/ibs: Fix typo in dc_l2tlb_miss comment
  x86/uprobes: Fix XOL allocation failure for 32-bit tasks
  perf/x86/intel/uncore: Convert comma to semicolon
  perf/x86/intel: Add support for rdpmc user disable feature
  perf/x86: Use macros to replace magic numbers in attr_rdpmc
  perf/x86/intel: Add core PMU support for Novalake
  perf/x86/intel: Add support for PEBS memory auxiliary info field in NVL
  perf/x86/intel: Add core PMU support for DMR
  perf/x86/intel: Add support for PEBS memory auxiliary info field in DMR
  perf/x86/intel: Support the 4 new OMR MSRs introduced in DMR and NVL
  perf/core: Fix slow perf_event_task_exit() with LBR callstacks
  perf/core: Speed up kexec shutdown by avoiding unnecessary cross CPU calls
  uprobes: use kmap_local_page() for temporary page mappings
  arm/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
  mips/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
  arm64/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
  riscv/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
  perf/x86/intel/uncore: Add Nova Lake support
  ...
2026-02-10 12:00:46 -08:00
Linus Torvalds f17b474e36 bpf-next-7.0
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmmGmrgACgkQ6rmadz2v
 bTq6NxAAkCHosxzGn9GYYBV8xhrBJoJJDCyEbQ4nR0XNY+zaWnuykmiPP9w1aOAM
 zm/po3mQB2pZjetvlrPrgG5RLgBCAUHzqVGy0r+phUvD3vbohKlmSlMm2kiXOb9N
 T01BgLWsyqN2ZcNFvORdSsftqIJUHcXxU6RdupGD60sO5XM9ty5cwyewLX8GBOas
 UN2bOhbK2DpqYWUvtv+3Q3ykxoStMSkXZvDRurwLKl4RHeLjXZXPo8NjnfBlk/F2
 vdFo/F4NO4TmhOave6UPXvKb4yo9IlBRmiPAl0RmNKBxenY8j9XuV/xZxU6YgzDn
 +SQfDK+CKQ4IYIygE+fqd4e5CaQrnjmPPcIw12AB2CF0LimY9Xxyyk6FSAhMN7wm
 GTVh5K2C3Dk3OiRQk4G58EvQ5QcxzX98IeeCpcckMUkPsFWHRvF402WMUcv9SWpD
 DsxxPkfENY/6N67EvH0qcSe/ikdUorQKFl4QjXKwsMCd5WhToeP4Z7Ck1gVSNkAh
 9CX++mLzg333Lpsc4SSIuk9bEPpFa5cUIKUY7GCsCiuOXciPeMDP3cGSd5LioqxN
 qWljs4Z88QDM2LJpAh8g4m3sA7bMhES3nPmdlI5CfgBcVyLW8D8CqQq4GEZ1McwL
 Ky084+lEosugoVjRejrdMMEOsqAfcbkTr2b8jpuAZdwJKm6p/bw=
 =cBdK
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:

 - Support associating BPF program with struct_ops (Amery Hung)

 - Switch BPF local storage to rqspinlock and remove recursion detection
   counters which were causing false positives (Amery Hung)

 - Fix live registers marking for indirect jumps (Anton Protopopov)

 - Introduce execution context detection BPF helpers (Changwoo Min)

 - Improve verifier precision for 32bit sign extension pattern
   (Cupertino Miranda)

 - Optimize BTF type lookup by sorting vmlinux BTF and doing binary
   search (Donglin Peng)

 - Allow states pruning for misc/invalid slots in iterator loops (Eduard
   Zingerman)

 - In preparation for ASAN support in BPF arenas teach libbpf to move
   global BPF variables to the end of the region and enable arena kfuncs
   while holding locks (Emil Tsalapatis)

 - Introduce support for implicit arguments in kfuncs and migrate a
   number of them to new API. This is a prerequisite for cgroup
   sub-schedulers in sched-ext (Ihor Solodrai)

 - Fix incorrect copied_seq calculation in sockmap (Jiayuan Chen)

 - Fix ORC stack unwind from kprobe_multi (Jiri Olsa)

 - Speed up fentry attach by using single ftrace direct ops in BPF
   trampolines (Jiri Olsa)

 - Require frozen map for calculating map hash (KP Singh)

 - Fix lock entry creation in TAS fallback in rqspinlock (Kumar
   Kartikeya Dwivedi)

 - Allow user space to select cpu in lookup/update operations on per-cpu
   array and hash maps (Leon Hwang)

 - Make kfuncs return trusted pointers by default (Matt Bobrowski)

 - Introduce "fsession" support where single BPF program is executed
   upon entry and exit from traced kernel function (Menglong Dong)

 - Allow bpf_timer and bpf_wq use in all programs types (Mykyta
   Yatsenko, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Alexei
   Starovoitov)

 - Make KF_TRUSTED_ARGS the default for all kfuncs and clean up their
   definition across the tree (Puranjay Mohan)

 - Allow BPF arena calls from non-sleepable context (Puranjay Mohan)

 - Improve register id comparison logic in the verifier and extend
   linked registers with negative offsets (Puranjay Mohan)

 - In preparation for BPF-OOM introduce kfuncs to access memcg events
   (Roman Gushchin)

 - Use CFI compatible destructor kfunc type (Sami Tolvanen)

 - Add bitwise tracking for BPF_END in the verifier (Tianci Cao)

 - Add range tracking for BPF_DIV and BPF_MOD in the verifier (Yazhou
   Tang)

 - Make BPF selftests work with 64k page size (Yonghong Song)

* tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (268 commits)
  selftests/bpf: Fix outdated test on storage->smap
  selftests/bpf: Choose another percpu variable in bpf for btf_dump test
  selftests/bpf: Remove test_task_storage_map_stress_lookup
  selftests/bpf: Update task_local_storage/task_storage_nodeadlock test
  selftests/bpf: Update task_local_storage/recursion test
  selftests/bpf: Update sk_storage_omem_uncharge test
  bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy}
  bpf: Support lockless unlink when freeing map or local storage
  bpf: Prepare for bpf_selem_unlink_nofail()
  bpf: Remove unused percpu counter from bpf_local_storage_map_free
  bpf: Remove cgroup local storage percpu counter
  bpf: Remove task local storage percpu counter
  bpf: Change local_storage->lock and b->lock to rqspinlock
  bpf: Convert bpf_selem_unlink to failable
  bpf: Convert bpf_selem_link_map to failable
  bpf: Convert bpf_selem_unlink_map to failable
  bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage
  selftests/xsk: fix number of Tx frags in invalid packet
  selftests/xsk: properly handle batch ending in the middle of a packet
  bpf: Prevent reentrance into call_rcu_tasks_trace()
  ...
2026-02-10 11:26:21 -08:00
Linus Torvalds 85f24b0ace hardening updates for v7.0-rc1
- Various missed __counted_by annotations (Thorsten Blum)
 
 - Various missed -Wflex-array-member-not-at-end fixes (Gustavo A. R. Silva)
 
 - Avoid leftover tempfiles for interrupted compile-time FORTIFY tests
   (Nicolas Schier)
 
 - Remove non-existant CONFIG_UBSAN_REPORT_FULL from docs (Stefan Wiehler)
 
 - fortify: Use C arithmetic not FIELD_xxx() in FORTIFY_REASON defines
   (David Laight)
 
 - Add __counted_by_ptr attribute, tests, and first user (Bill Wendling,
   Kees Cook)
 
 - Update MAINTAINERS file to make hardening section not include pstore
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCaYopPQAKCRA2KwveOeQk
 u5QDAQDj9ygsbzLj9EtXtU3T03GTvix4Rx7RkaBAMPSDEJGhNQD+M3dP6Z2ogEMz
 1Km6dAC7nTEujsVFur9BVpyEgoBjjQM=
 =G1gb
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening updates from Kees Cook:
 "Mostly small cleanups and various scattered annotations and flex array
  warning fixes that we reviewed by unlanded in other trees. Introduces
  new annotation for expanding counted_by to pointer members, now that
  compiler behavior between GCC and Clang has been normalized.

   - Various missed __counted_by annotations (Thorsten Blum)

   - Various missed -Wflex-array-member-not-at-end fixes (Gustavo A. R.
     Silva)

   - Avoid leftover tempfiles for interrupted compile-time FORTIFY tests
     (Nicolas Schier)

   - Remove non-existant CONFIG_UBSAN_REPORT_FULL from docs (Stefan
     Wiehler)

   - fortify: Use C arithmetic not FIELD_xxx() in FORTIFY_REASON defines
     (David Laight)

   - Add __counted_by_ptr attribute, tests, and first user (Bill
     Wendling, Kees Cook)

   - Update MAINTAINERS file to make hardening section not include
     pstore"

* tag 'hardening-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  MAINTAINERS: pstore: Remove L: entry
  nfp: tls: Avoid -Wflex-array-member-not-at-end warnings
  carl9170: Avoid -Wflex-array-member-not-at-end warning
  coredump: Use __counted_by_ptr for struct core_name::corename
  lkdtm/bugs: Add __counted_by_ptr() test PTR_BOUNDS
  compiler_types.h: Attributes: Add __counted_by_ptr macro
  fortify: Cleanup temp file also on non-successful exit
  fortify: Rename temporary file to match ignore pattern
  fortify: Use C arithmetic not FIELD_xxx() in FORTIFY_REASON defines
  ecryptfs: Annotate struct ecryptfs_message with __counted_by
  fs/xattr: Annotate struct simple_xattr with __counted_by
  crypto: af_alg - Annotate struct af_alg_iv with __counted_by
  Kconfig.ubsan: Remove CONFIG_UBSAN_REPORT_FULL from documentation
  drm/nouveau: fifo: Avoid -Wflex-array-member-not-at-end warning
2026-02-10 08:54:13 -08:00
Linus Torvalds 45bf4bc87c arm64 updates for 7.0
ACPI:
  - Add interrupt signalling support to the AGDI handler.
  - Add Catalin and myself to the arm64 ACPI MAINTAINERS entry.
 
 CPU features:
 - Drop Kconfig options for PAN and LSE (these are detected at runtime).
 - Add support for 64-byte single-copy atomic instructions (LS64/LS64V).
 - Reduce MTE overhead when executing in the kernel on Ampere CPUs.
 - Ensure POR_EL0 value exposed via ptrace is up-to-date.
 - Fix error handling on GCS allocation failure.
 
 CPU frequency:
 - Add CPU hotplug support to the FIE setup in the AMU driver.
 
 Entry code:
 - Minor optimisations and cleanups to the syscall entry path.
 - Preparatory rework for moving to the generic syscall entry code.
 
 Hardware errata:
 - Work around Spectre-BHB on TSV110 processors.
 - Work around broken CMO propagation on some systems with the SI-L1
   interconnect.
 
 Miscellaneous:
 - Disable branch profiling for arch/arm64/ to avoid issues with noinstr.
 - Minor fixes and cleanups (kexec + ubsan, WARN_ONCE() instead of
   WARN_ON(), reduction of boolean expression).
 - Fix custom __READ_ONCE() implementation for LTO builds when operating
   on non-atomic types.
 
 Perf and PMUs:
 - Support for CMN-600AE.
 - Be stricter about supported hardware in the CMN driver.
 - Support for DSU-110 and DSU-120.
 - Support for the cycles event in the DSU driver (alongside the
   dedicated cycles counter).
 - Use IRQF_NO_THREAD instead of IRQF_ONESHOT in the cxlpmu driver.
 - Use !bitmap_empty() as a faster alternative to bitmap_weight().
 - Fix SPE error handling when failing to resume profiling.
 
 Selftests:
 - Add support for the FORCE_TARGETS option to the arm64 kselftests.
 - Avoid nolibc-specific my_syscall() function.
 - Add basic test for the LS64 HWCAP.
 - Extend fp-pidbench to cover additional workload patterns.
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmmJ16QQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNFO0CACxYy5klTAdqB+RjhxSbibZ6mmch/mX3zLl
 qBp0HeKuwSv+EWEblmr6KMrr/S5YjtEMboOC7GnmEdSeM6hucu5gPiEb7bIDQfbB
 5KJXoUlN1C+7Na4CimfordWiHIZw2d7mTumJcnsqMo7XN83t+4yqs98sET7WRCy5
 WUyqHH/dGZsGcsBMrRRO3UriHCpU425GE0jopF5OkwCs4dwyUVIV8SmZtK4Ovi3u
 Kin75UYgeVum8A218oHosTfYTBt4p/cSdfVVv5f3P14Q7lIstyJR7ZDqepTP2XSJ
 95W3nuMfOT1lx7f6pqvHc+4Ccl/+DQnbW1PWwv8f3nXAKVVMafWf
 =3SfC
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "There's a little less than normal, probably due to LPC & Christmas/New
  Year meaning that a few series weren't quite ready or reviewed in
  time. It's still useful across the board, despite the only real
  feature being support for the LS64 feature enabling 64-byte atomic
  accesses to endpoints that support it.

  ACPI:
   - Add interrupt signalling support to the AGDI handler
   - Add Catalin and myself to the arm64 ACPI MAINTAINERS entry

  CPU features:
   - Drop Kconfig options for PAN and LSE (these are detected at runtime)
   - Add support for 64-byte single-copy atomic instructions (LS64/LS64V)
   - Reduce MTE overhead when executing in the kernel on Ampere CPUs
   - Ensure POR_EL0 value exposed via ptrace is up-to-date
   - Fix error handling on GCS allocation failure

  CPU frequency:
   - Add CPU hotplug support to the FIE setup in the AMU driver

  Entry code:
   - Minor optimisations and cleanups to the syscall entry path
   - Preparatory rework for moving to the generic syscall entry code

  Hardware errata:
   - Work around Spectre-BHB on TSV110 processors
   - Work around broken CMO propagation on some systems with the SI-L1
     interconnect

  Miscellaneous:
   - Disable branch profiling for arch/arm64/ to avoid issues with
     noinstr
   - Minor fixes and cleanups (kexec + ubsan, WARN_ONCE() instead of
     WARN_ON(), reduction of boolean expression)
   - Fix custom __READ_ONCE() implementation for LTO builds when
     operating on non-atomic types

  Perf and PMUs:
   - Support for CMN-600AE
   - Be stricter about supported hardware in the CMN driver
   - Support for DSU-110 and DSU-120
   - Support for the cycles event in the DSU driver (alongside the
     dedicated cycles counter)
   - Use IRQF_NO_THREAD instead of IRQF_ONESHOT in the cxlpmu driver
   - Use !bitmap_empty() as a faster alternative to bitmap_weight()
   - Fix SPE error handling when failing to resume profiling

  Selftests:
   - Add support for the FORCE_TARGETS option to the arm64 kselftests
   - Avoid nolibc-specific my_syscall() function
   - Add basic test for the LS64 HWCAP
   - Extend fp-pidbench to cover additional workload patterns"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (43 commits)
  perf/arm-cmn: Reject unsupported hardware configurations
  perf: arm_spe: Properly set hw.state on failures
  arm64/gcs: Fix error handling in arch_set_shadow_stack_status()
  arm64: Fix non-atomic __READ_ONCE() with CONFIG_LTO=y
  arm64: poe: fix stale POR_EL0 values for ptrace
  kselftest/arm64: Raise default number of loops in fp-pidbench
  kselftest/arm64: Add a no-SVE loop after SVE in fp-pidbench
  perf/cxlpmu: Replace IRQF_ONESHOT with IRQF_NO_THREAD
  arm64: mte: Set TCMA1 whenever MTE is present in the kernel
  arm64/ptrace: Return early for ptrace_report_syscall_entry() error
  arm64/ptrace: Split report_syscall()
  arm64: Remove unused _TIF_WORK_MASK
  kselftest/arm64: Add missing file in .gitignore
  arm64: errata: Workaround for SI L1 downstream coherency issue
  kselftest/arm64: Add HWCAP test for FEAT_LS64
  arm64: Add support for FEAT_{LS64, LS64_V}
  KVM: arm64: Enable FEAT_{LS64, LS64_V} in the supported guest
  arm64: Provide basic EL2 setup for FEAT_{LS64, LS64_V} usage at EL0/1
  KVM: arm64: Handle DABT caused by LS64* instructions on unsupported memory
  KVM: arm64: Add documentation for KVM_EXIT_ARM_LDST64B
  ...
2026-02-09 20:28:45 -08:00
Linus Torvalds 0c00ed308d for-7.0/block-20260206
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmGLwcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpv+TD/48S2HTnMhmW6AtFYWErQ+sEKXpHrxbYe7S
 +qR8/g/T+QSfhfqPwZEuagndFKtIP3LJfaXGSP1Lk1RfP9NLQy91v33Ibe4DjHkp
 etWSfnMHA9MUAoWKmg8EvncB2G+ZQFiYCpjazj5tKHD9S2+psGMuL8kq6qzMJE83
 uhpb8WutUl4aSIXbMSfyGlwBhI1MjjRbbWlIBmg4yC8BWt1sH8Qn2L2GNVylEIcX
 U8At3KLgPGn0axSg4yGMAwTqtGhL/jwdDyeczbmRlXuAr4iVL9UX/yADCYkazt6U
 ttQ2/H+cxCwfES84COx9EteAatlbZxo6wjGvZ3xOMiMJVTjYe1x6Gkcckq+LrZX6
 tjofi2KK78qkrMXk1mZMkZjpyUWgRtCswhDllbQyqFs0SwzQtno2//Rk8HU9dhbt
 pkpryDbGFki9X3upcNyEYp5TYflpW6YhAzShYgmE6KXim2fV8SeFLviy0erKOAl+
 fwjTE6KQ5QoQv0s3WxkWa4lREm34O6IHrCUmbiPm5CruJnQDhqAN2QZIDgYC4WAf
 0gu9cR/O4Vxu7TQXrumPs5q+gCyDU0u0B8C3mG2s+rIo+PI5cVZKs2OIZ8HiPo0F
 x73kR/pX3DMe35ZQkQX22ymMuowV+aQouDLY9DTwakP5acdcg7h7GZKABk6VLB06
 gUIsnxURiQ==
 =jNzW
 -----END PGP SIGNATURE-----

Merge tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block updates from Jens Axboe:

 - Support for batch request processing for ublk, improving the
   efficiency of the kernel/ublk server communication. This can yield
   nice 7-12% performance improvements

 - Support for integrity data for ublk

 - Various other ublk improvements and additions, including a ton of
   selftests additions and updated

 - Move the handling of blk-crypto software fallback from below the
   block layer to above it. This reduces the complexity of dealing with
   bio splitting

 - Series fixing a number of potential deadlocks in blk-mq related to
   the queue usage counter and writeback throttling and rq-qos debugfs
   handling

 - Add an async_depth queue attribute, to resolve a performance
   regression that's been around for a qhilw related to the scheduler
   depth handling

 - Only use task_work for IOPOLL completions on NVMe, if it is necessary
   to do so. An earlier fix for an issue resulted in all these
   completions being punted to task_work, to guarantee that completions
   were only run for a given io_uring ring when it was local to that
   ring. With the new changes, we can detect if it's necessary to use
   task_work or not, and avoid it if possible.

 - rnbd fixes:
      - Fix refcount underflow in device unmap path
      - Handle PREFLUSH and NOUNMAP flags properly in protocol
      - Fix server-side bi_size for special IOs
      - Zero response buffer before use
      - Fix trace format for flags
      - Add .release to rnbd_dev_ktype

 - MD pull requests via Yu Kuai
      - Fix raid5_run() to return error when log_init() fails
      - Fix IO hang with degraded array with llbitmap
      - Fix percpu_ref not resurrected on suspend timeout in llbitmap
      - Fix GPF in write_page caused by resize race
      - Fix NULL pointer dereference in process_metadata_update
      - Fix hang when stopping arrays with metadata through dm-raid
      - Fix any_working flag handling in raid10_sync_request
      - Refactor sync/recovery code path, improve error handling for
        badblocks, and remove unused recovery_disabled field
      - Consolidate mddev boolean fields into mddev_flags
      - Use mempool to allocate stripe_request_ctx and make sure
        max_sectors is not less than io_opt in raid5
      - Fix return value of mddev_trylock
      - Fix memory leak in raid1_run()
      - Add Li Nan as mdraid reviewer

 - Move phys_vec definitions to the kernel types, mostly in preparation
   for some VFIO and RDMA changes

 - Improve the speed for secure erase for some devices

 - Various little rust updates

 - Various other minor fixes, improvements, and cleanups

* tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
  blk-mq: ABI/sysfs-block: fix docs build warnings
  selftests: ublk: organize test directories by test ID
  block: decouple secure erase size limit from discard size limit
  block: remove redundant kill_bdev() call in set_blocksize()
  blk-mq: add documentation for new queue attribute async_dpeth
  block, bfq: convert to use request_queue->async_depth
  mq-deadline: covert to use request_queue->async_depth
  kyber: covert to use request_queue->async_depth
  blk-mq: add a new queue sysfs attribute async_depth
  blk-mq: factor out a helper blk_mq_limit_depth()
  blk-mq-sched: unify elevators checking for async requests
  block: convert nr_requests to unsigned int
  block: don't use strcpy to copy blockdev name
  blk-mq-debugfs: warn about possible deadlock
  blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs()
  blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos()
  blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static
  blk-rq-qos: fix possible debugfs_mutex deadlock
  blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos
  blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter
  ...
2026-02-09 17:57:21 -08:00
Linus Torvalds 591beb0e3a io_uring-bpf-restrictions.4-20260206
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmGJ1kQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpky8EAChIL3uJ5Vmv+oQTxT4EVb1wpc8U/XzXWU5
 Q5F9IpZZCGO7+i015Y7iTTqDRixjblRaWpWzZZP8vflWDUS8LESNZLQdcoEnxaiv
 P367KNPUGwxejcKsu8PvZvfnX6JWSQoNstcDmrwkCF0ND2UUfvvMZyn3uKhkbBRY
 h5Ehcqkvqc1OJDAWC7+yPzYAmB01uRPQ6sc9/GeujznHPlfbvie4u6gBvvfXeirT
 592zbVftINMrm6Twd6zl4n+HNAn+CUoyVMppeeddv5IcyFPm9uz/dLOZBXTz6552
 jFYNmB0U4g+SxGXMyqp37YISTALnuY+57y5eXmEAtgkEeE3HrF+F/ZdxQHwXSpo3
 T2Lb9IOqFyHtSvq678HZ37JB6aIYbBE/mZdNf8FFFpnPJGb5Ey7d50qPp/ywVq0H
 p9CahbpkzGUBMsZ+koew0YHiFdWV9tww+/Bnk5dTtn2197uyaHsLdmbf4C36GWke
 Bk5cwNgU+3DMFAfTiL9m+AIXYsJkBayRJn+hViTrF5AL7gcGiBryGF43FOSKoYuq
 f0mniDnGSwvn86VZPuZQ6wBRHZPEMR3OlaUXn6XrUU6cYyvMg0pBZV+QHF7zlsSP
 2sdfUbPL5TxexF3G8dsxlDIypz9Z6TCoUCfU0WiiUETnCrVNkXfIY846A+w08p0b
 ejBjzrwRtQ==
 =CqJq
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-bpf-restrictions.4-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring bpf filters from Jens Axboe:
 "This adds support for both cBPF filters for io_uring, as well as task
  inherited restrictions and filters.

  seccomp and io_uring don't play along nicely, as most of the
  interesting data to filter on resides somewhat out-of-band, in the
  submission queue ring.

  As a result, things like containers and systemd that apply seccomp
  filters, can't filter io_uring operations.

  That leaves them with just one choice if filtering is critical -
  filter the actual io_uring_setup(2) system call to simply disallow
  io_uring. That's rather unfortunate, and has limited us because of it.

  io_uring already has some filtering support. It requires the ring to
  be setup in a disabled state, and then a filter set can be applied.
  This filter set is completely bi-modal - an opcode is either enabled
  or it's not. Once a filter set is registered, the ring can be enabled.
  This is very restrictive, and it's not useful at all to systemd or
  containers which really want both broader and more specific control.

  This first adds support for cBPF filters for opcodes, which enables
  tighter control over what exactly a specific opcode may do. As
  examples, specific support is added for IORING_OP_OPENAT/OPENAT2,
  allowing filtering on resolve flags. And another example is added for
  IORING_OP_SOCKET, allowing filtering on domain/type/protocol. These
  are both common use cases. cBPF was chosen rather than eBPF, because
  the latter is often restricted in containers as well.

  These filters are run post the init phase of the request, which allows
  filters to even dip into data that is being passed in struct in user
  memory, as the init side of requests make that data stable by bringing
  it into the kernel. This allows filtering without needing to copy this
  data twice, or have filters etc know about the exact layout of the
  user data. The filters get the already copied and sanitized data
  passed.

  On top of that support is added for per-task filters, meaning that any
  ring created with a task that has a per-task filter will get those
  filters applied when it's created. These filters are inherited across
  fork as well. Once a filter has been registered, any further added
  filters may only further restrict what operations are permitted.

  Filters cannot change the return value of an operation, they can only
  permit or deny it based on the contents"

* tag 'io_uring-bpf-restrictions.4-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring: allow registration of per-task restrictions
  io_uring: add task fork hook
  io_uring/bpf_filter: add ref counts to struct io_bpf_filter
  io_uring/bpf_filter: cache lookup table in ctx->bpf_filters
  io_uring/bpf_filter: allow filtering on contents of struct open_how
  io_uring/net: allow filtering on IORING_OP_SOCKET data
  io_uring: add support for BPF filtering for opcode restrictions
2026-02-09 17:31:17 -08:00
Linus Torvalds f5d4feed17 for-7.0/io_uring-20260206
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmGJxsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpk6+EACamMdw6WU4VVNjUtjT93FuXxor4ioyhowJ
 myRtKG3ZvYrE63Z8F1dCQE28RXi9n6MhGxabCq8WZVGkhTv27DuaBkDjU4T8oCnP
 EYhs5a3sdRXfKuIlqVbxuiFdmiPHEP0vh3/MviKx9Ju3/Po3OEWKBalNMevfGkS4
 bRNp9IQkAYNSRhGma2ni9Rnc5welWmhpsxUKFdGtPRX53ZlYegiZxKlfKMB4/SQ+
 7XAWKhy9dOGVo4DpLof7mCX6hMeX+FoNkJzF6cTMO/IF//lCLjI9BN4SMiI6mmEN
 RY6PLJiFraoQx8wdr3J1LtBCNXzzj6cPk6PNHKtsodoafe2oYFNLNgfAa9pHDzfM
 12kvy58au0cQG6TnS2eNlqM2GN116mJi+k00E+UW4iaXXtpqcdcBrLlS+Q5hJ78C
 9MBLQofv7D06C6kbpxV2pVS1u4oxefjl19wWLqLKx/VytCHrsaTm50n1r0k7YLCc
 plvPkQRQobqpp2GtcaXcfmsi1Vfu4jzMBAN+rTN4/te0kudNqL9+hPvrejIMEURc
 2AcktMAHC8wjpr93dFASXiWh/fdyhV4e2a/D/ML4PXxhnCfnGx5s5Tp/pGjePHEU
 dLZm9vadmr/Yrdgycf9gQ8mz9IxI9FNJCKbI7lf7+/KJXe7DwngOa6VHNblWBRHv
 YoX6bG1yQQ==
 =Q248
 -----END PGP SIGNATURE-----

Merge tag 'for-7.0/io_uring-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring updates from Jens Axboe:

 - Clean up the IORING_SETUP_R_DISABLED and submitter task checking,
   mostly just in preparation for relaxing the locking for SINGLE_ISSUER
   in the future.

 - Improve IOPOLL by using a doubly linked list to manage completions.

   Previously it was singly listed, which meant that to complete request
   N in the chain 0..N-1 had to have completed first. With a doubly
   linked list we can complete whatever request completes in that order,
   rather than need to wait for a consecutive range to be available.
   This reduces latencies.

 - Improve the restriction setup and checking. Mostly in preparation for
   adding further features on top of that. Coming in a separate pull
   request.

 - Split out task_work and wait handling into separate files. These are
   mostly nicely abstracted already, but still remained in the
   io_uring.c file which is on the larger side.

 - Use GFP_KERNEL_ACCOUNT in a few more spots, where appropriate.

 - Ensure even the idle io-wq worker exits if a task no longer has any
   rings open.

 - Add support for a non-circular submission queue.

   By default, the SQ ring keeps moving around, even if only a few
   entries are used for each submission. This can be wasteful in terms
   of cachelines.

   If IORING_SETUP_SQ_REWIND is set for the ring when created, each
   submission will start at offset 0 instead of where we last left off
   doing submissions.

 - Various little cleanups

* tag 'for-7.0/io_uring-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (30 commits)
  io_uring/kbuf: fix memory leak if io_buffer_add_list fails
  io_uring: Add SPDX id lines to remaining source files
  io_uring: allow io-wq workers to exit when unused
  io_uring/io-wq: add exit-on-idle state
  io_uring/net: don't continue send bundle if poll was required for retry
  io_uring/rsrc: use GFP_KERNEL_ACCOUNT consistently
  io_uring/futex: use GFP_KERNEL_ACCOUNT for futex data allocation
  io_uring/io-wq: handle !sysctl_hung_task_timeout_secs
  io_uring: fix bad indentation for setup flags if statement
  io_uring/rsrc: take unsigned index in io_rsrc_node_lookup()
  io_uring: introduce non-circular SQ
  io_uring: split out CQ waiting code into wait.c
  io_uring: split out task work code into tw.c
  io_uring/io-wq: don't trigger hung task for syzbot craziness
  io_uring: add IO_URING_EXIT_WAIT_MAX definition
  io_uring/sync: validate passed in offset
  io_uring/eventfd: remove unused ctx->evfd_last_cq_tail member
  io_uring/timeout: annotate data race in io_flush_timeouts()
  io_uring/uring_cmd: explicitly disallow cancelations for IOPOLL
  io_uring: fix IOPOLL with passthrough I/O
  ...
2026-02-09 17:22:00 -08:00
Linus Torvalds d10a88ce16 nilfs2 updates for v7.0
- nilfs2: fix missing struct keywords in nilfs2_api.h kernel-doc
 - nilfs2: convert nilfs_super_block to kernel-doc
 - nilfs2: Fix potential block overflow that cause system hang
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQT4wVoLCG92poNnMFAhI4xTh21NnQUCaYZw2gAKCRAhI4xTh21N
 nZuFAQD8vJk3OJKbxxroCyTGxDT3C/WL7PMC9Z1QNQ/yF0zBYwD+JSPN/dfQY6uF
 LracWCoFQcdaWRRWn1SA2k7L1OchFgQ=
 =WXsp
 -----END PGP SIGNATURE-----

Merge tag 'nilfs2-v7.0-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/vdubeyko/nilfs2

Pull nilfs2 updates from Viacheslav Dubeyko:

 - Fix potential block overflow that cause system hang

   When executing the FITRIM command, an underflow can occur in the
   calculation of nblocks. This ultimately leads to the block layer
   function __blkdev_issue_discard() taking an excessively long time
   to process the bio chain, and the ns_segctor_sem lock remains held
   for a long period.

   This prevents other tasks from acquiring the ns_segctor_sem lock,
   resulting in a hang reported by syzbot (Edward Adam Davis)

 - Fix missing struct keywords in nilfs2_api.h kernel-doc (Ryusuke
   Konishi)

 - Convert nilfs_super_block to kernel-doc

   Eliminate 40+ kernel-doc warnings in nilfs2_ondisk.h by converting
   all of the struct member comments to kernel-doc comments (Randy
   Dunlap)

* tag 'nilfs2-v7.0-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/vdubeyko/nilfs2:
  nilfs2: fix missing struct keywords in nilfs2_api.h kernel-doc
  nilfs2: convert nilfs_super_block to kernel-doc
  nilfs2: Fix potential block overflow that cause system hang
2026-02-09 15:55:41 -08:00
Linus Torvalds 8912c2fd58 for-6.20-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmmDT6sACgkQxWXV+ddt
 WDteIBAAnQBKtHZOrefnA/SjbT4N+IV20x8sxVc3XI2MXw6RpjEN6k+0oGMLdvMy
 5NBryJ43q5CwCV6iNkWQE4mT86gcPa6Bqv1nFOC5Q2BDkvbVBpOfOq7kC2+fQ7ay
 HF2Mr0PUHc0Y0MhkRSljO+T2QD4tDpWaxbEeVY+TxiAsepD1paK4fHV6Lwu2sk25
 17RJQvm/2XRY32g9Sa6NZIc7mGuyIasMCBcTpDKDJW10hP61NNtK4wHgPLtMRtzx
 qzCAPSMS6QkeJZHcDa/Atg+iqpR5U8pdKAUSYJii3Kgcmjr5n1U1ZTp5WRLlXSS2
 tHiR62a983ya022wKR1ApsdjN7ncE8iIeT/GrezZVcPtm9jTxaSzgd7dDNfSmr29
 my4crJWvlEuD9Qt+/oz//eLAjkgEe2Q5RtaAworCAG00MzaGOEwNiXXP7DDMQApI
 VTxx9dvY0s/W3UF/IuJWTTN9q95KjvlmZ9ELAPxwwtyq+sAD41CvlYhJqCaLLec5
 6xMotP5cy3Ur+yp+J7RCDprQ7x6YcU98PYIXQxf1/77f3Lz/7QA2TWafPzJ5V2Bk
 UtprVCrlqwCmSFrSISN6HzNf0UYY/ZI36WRoUj/ZJkGNfkQwvs9aBjb+lVYRb8T8
 OcMlJrJvoUwIY//ef5K97ma8HOecodxszdEIafOgmnJtE9H3foI=
 =Ie8n
 -----END PGP SIGNATURE-----

Merge tag 'for-6.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "User visible changes, feature updates:

   - when using block size > page size, enable direct IO

   - fallback to buffered IO if the data profile has duplication,
     workaround to avoid checksum mismatches on block group profiles
     with redundancy, real direct IO is possible on single or RAID0

   - redo export of zoned statistics, moved from sysfs to
     /proc/pid/mountstats due to size limitations of the former

  Experimental features:

   - remove offload checksum tunable, intended to find best way to do it
     but since we've switched to offload to thread for everything we
     don't need it anymore

   - initial support for remap-tree feature, a translation layer of
     logical block addresses that allow changes without moving/rewriting
     blocks to do eg. relocation, or other changes that require COW

  Notable fixes:

   - automatic removal of accidentally leftover chunks when
     free-space-tree is enabled since mkfs.btrfs v6.16.1

   - zoned mode:
      - do not try to append to conventional zones when RAID is mixing
        zoned and conventional drives
      - fixup write pointers when mixing zoned and conventional on
        DUP/RAID* profiles

   - when using squota, relax deletion rules for qgroups with 0 members
     to allow easier recovery from accounting bugs, also add more checks
     to detect bad accounting

   - fix periodic reclaim scanning, properly check boundary conditions
     not to trigger it unexpectedly or miss the time to run it

   - trim:
      - continue after first error
      - change reporting to the first detected error
      - add more cancellation points
      - reduce contention of big device lock that can block other
        operations when there's lots of trimmed space

   - when chunk allocation is forced (needs experimental build) fix
     transaction abort when unexpected space layout is detected

  Core:

   - switch to crypto library API for checksumming, removed module
     dependencies, pointer indirections, etc.

   - error handling improvements

   - adjust how and where transaction commit or abort are done and are
     maybe not necessary

   - minor compression optimization to skip single block ranges

   - improve how compression folios are handled

   - new and updated selftests

   - cleanups, refactoring:
      - auto-freeing and other automatic variable cleanup conversion
      - structure size optimizations
      - condition annotations"

* tag 'for-6.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (137 commits)
  btrfs: get rid of compressed_bio::compressed_folios[]
  btrfs: get rid of compressed_folios[] usage for encoded writes
  btrfs: get rid of compressed_folios[] usage for compressed read
  btrfs: remove the old btrfs_compress_folios() infrastructure
  btrfs: switch to btrfs_compress_bio() interface for compressed writes
  btrfs: introduce btrfs_compress_bio() helper
  btrfs: zlib: introduce zlib_compress_bio() helper
  btrfs: zstd: introduce zstd_compress_bio() helper
  btrfs: lzo: introduce lzo_compress_bio() helper
  btrfs: zoned: factor out the zone loading part into a testable function
  btrfs: add cleanup function for btrfs_free_chunk_map
  btrfs: tests: add cleanup functions for test specific functions
  btrfs: raid56: fix memory leak of btrfs_raid_bio::stripe_uptodate_bitmap
  btrfs: tests: add unit tests for pending extent walking functions
  btrfs: fix EEXIST abort due to non-consecutive gaps in chunk allocation
  btrfs: fix transaction commit blocking during trim of unallocated space
  btrfs: handle user interrupt properly in btrfs_trim_fs()
  btrfs: preserve first error in btrfs_trim_fs()
  btrfs: continue trimming remaining devices on failure
  btrfs: do not BUG_ON() in btrfs_remove_block_group()
  ...
2026-02-09 15:45:21 -08:00
Linus Torvalds 157d3d6efd vfs-7.0-rc1.namespace
Please consider pulling these changes from the signed vfs-7.0-rc1.namespace tag.
 
 Thanks!
 Christian
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaYX49gAKCRCRxhvAZXjc
 ovzgAP9BpqMQhMy2VCurru8/T5VAd6eJdgXzEfXqMksL5BNm8gEAsLx666KJNKgm
 Sh/yVA2KBjf51gvcLZ4gHOISaMU8bAI=
 =RGLf
 -----END PGP SIGNATURE-----

Merge tag 'vfs-7.0-rc1.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs mount updates from Christian Brauner:

 - statmount: accept fd as a parameter

   Extend struct mnt_id_req with a file descriptor field and a new
   STATMOUNT_BY_FD flag. When set, statmount() returns mount information
   for the mount the fd resides on — including detached mounts
   (unmounted via umount2(MNT_DETACH)).

   For detached mounts the STATMOUNT_MNT_POINT and STATMOUNT_MNT_NS_ID
   mask bits are cleared since neither is meaningful. The capability
   check is skipped for STATMOUNT_BY_FD since holding an fd already
   implies prior access to the mount and equivalent information is
   available through fstatfs() and /proc/pid/mountinfo without
   privilege. Includes comprehensive selftests covering both attached
   and detached mount cases.

 - fs: Remove internal old mount API code (1 patch)

   Now that every in-tree filesystem has been converted to the new
   mount API, remove all the legacy shim code in fs_context.c that
   handled unconverted filesystems. This deletes ~280 lines including
   legacy_init_fs_context(), the legacy_fs_context struct, and
   associated wrappers. The mount(2) syscall path for userspace remains
   untouched. Documentation references to the legacy callbacks are
   cleaned up.

 - mount: add OPEN_TREE_NAMESPACE to open_tree()

   Container runtimes currently use CLONE_NEWNS to copy the caller's
   entire mount namespace — only to then pivot_root() and recursively
   unmount everything they just copied. With large mount tables and
   thousands of parallel container launches this creates significant
   contention on the namespace semaphore.

   OPEN_TREE_NAMESPACE copies only the specified mount tree (like
   OPEN_TREE_CLONE) but returns a mount namespace fd instead of a
   detached mount fd. The new namespace contains the copied tree mounted
   on top of a clone of the real rootfs.

   This functions as a combined unshare(CLONE_NEWNS) + pivot_root() in a
   single syscall. Works with user namespaces: an unshare(CLONE_NEWUSER)
   followed by OPEN_TREE_NAMESPACE creates a mount namespace owned by
   the new user namespace. Mount namespace file mounts are excluded from
   the copy to prevent cycles. Includes ~1000 lines of selftests"

* tag 'vfs-7.0-rc1.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  selftests/open_tree: add OPEN_TREE_NAMESPACE tests
  mount: add OPEN_TREE_NAMESPACE
  fs: Remove internal old mount API code
  selftests: statmount: tests for STATMOUNT_BY_FD
  statmount: accept fd as a parameter
  statmount: permission check should return EPERM
2026-02-09 14:43:47 -08:00
Linus Torvalds c84bb79f70 vfs-7.0-rc1.nullfs
Please consider pulling these changes from the signed vfs-7.0-rc1.nullfs tag.
 
 Thanks!
 Christian
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaYX49gAKCRCRxhvAZXjc
 olG7AQD9TywOR0HC9PMT8jrhC1TKODnZ4H1aLNlYVltzfJ09xwEAwFSGO4rQmGAF
 aZdD0RQw4bkf7IC1PIZHEGUqmVXJCQ8=
 =NvyI
 -----END PGP SIGNATURE-----

Merge tag 'vfs-7.0-rc1.nullfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs nullfs update from Christian Brauner:
 "Add a completely catatonic minimal pseudo filesystem called "nullfs"
  and make pivot_root() work in the initramfs.

  Currently pivot_root() does not work on the real rootfs because it
  cannot be unmounted. Userspace has to recursively delete initramfs
  contents manually before continuing boot, using the fragile
  switch_root sequence (overmount + chroot).

  Add nullfs, a minimal immutable filesystem that serves as the true
  root of the mount hierarchy. The mutable rootfs (tmpfs/ramfs) is
  mounted on top of it. This allows userspace to simply:

      chdir(new_root);
      pivot_root(".", ".");
      umount2(".", MNT_DETACH);

  without the traditional switch_root workarounds. systemd already
  handles this correctly. It tries pivot_root() first and falls back
  to MS_MOVE only when that fails.

  This also means rootfs mounts in unprivileged namespaces no longer
  need MNT_LOCKED, since the immutable nullfs guarantees nothing can be
  revealed by unmounting the covering mount.

  nullfs is a single-instance filesystem (get_tree_single()) marked
  SB_NOUSER | SB_I_NOEXEC | SB_I_NODEV with an immutable empty root
  directory. This means sooner or later it can be used to overmount
  other directories to hide their contents without any additional
  protection needed.

  We enable it unconditionally. If we see any real regression we'll
  hide it behind a boot option.

  nullfs has extensions beyond this in the future. It will serve as a
  concept to support the creation of completely empty mount namespaces -
  which is work coming up in the next cycle"

* tag 'vfs-7.0-rc1.nullfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: use nullfs unconditionally as the real rootfs
  docs: mention nullfs
  fs: add immutable rootfs
  fs: add init_pivot_root()
  fs: ensure that internal tmpfs mount gets mount id zero
2026-02-09 13:41:34 -08:00
Linus Torvalds dd466ea002 vfs-7.0-rc1.fserror
Please consider pulling these changes from the signed vfs-7.0-rc1.fserror tag.
 
 Thanks!
 Christian
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaYX49gAKCRCRxhvAZXjc
 orUJAP9taSsjaB9zD9gU/rs8RfaPjhDXbVuPkBiDFARvGPSegwD/ZxTygHYsYarv
 7JtAuKI/njOcfhl+fvHSHT1BgcO+nQ8=
 =nUTi
 -----END PGP SIGNATURE-----

Merge tag 'vfs-7.0-rc1.fserror' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs error reporting updates from Christian Brauner:
 "This contains the changes to support generic I/O error reporting.

  Filesystems currently have no standard mechanism for reporting
  metadata corruption and file I/O errors to userspace via fsnotify.
  Each filesystem (xfs, ext4, erofs, f2fs, etc.) privately defines
  EFSCORRUPTED, and error reporting to fanotify is inconsistent or
  absent entirely.

  This introduces a generic fserror infrastructure built around struct
  super_block that gives filesystems a standard way to queue metadata
  and file I/O error reports for delivery to fsnotify.

  Errors are queued via mempools and queue_work to avoid holding
  filesystem locks in the notification path; unmount waits for pending
  events to drain. A new super_operations::report_error callback lets
  filesystem drivers respond to file I/O errors themselves (to be used
  by an upcoming XFS self-healing patchset).

  On the uapi side, EFSCORRUPTED and EUCLEAN are promoted from private
  per-filesystem definitions to canonical errno.h values across all
  architectures"

* tag 'vfs-7.0-rc1.fserror' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  ext4: convert to new fserror helpers
  xfs: translate fsdax media errors into file "data lost" errors when convenient
  xfs: report fs metadata errors via fsnotify
  iomap: report file I/O errors to the VFS
  fs: report filesystem and file I/O errors to fsnotify
  uapi: promote EFSCORRUPTED and EUCLEAN to errno.h
2026-02-09 12:21:37 -08:00
Linus Torvalds 996812c453 vfs-7.0-rc1.initrd
Please consider pulling these changes from the signed vfs-7.0-rc1.initrd tag.
 
 Thanks!
 Christian
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaYX49gAKCRCRxhvAZXjc
 ordBAQD4d6Y5Zvr852s9deMTDv+ng3bFk1YHqybGe3wEuATstwD/QcvAoNFW9Nn5
 n7/268Nk6jTEygT7Fm3tn42SnwOxhgU=
 =T203
 -----END PGP SIGNATURE-----

Merge tag 'vfs-7.0-rc1.initrd' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs initrd removal from Christian Brauner:
 "Remove the deprecated linuxrc-based initrd code path and related dead
  code. The linuxrc initrd path was deprecated in 2020 and this series
  completes its removal. If we see real-life regressions we'll revert.

  The core change removes handle_initrd() and init_linuxrc() — the
  entire flow that ran /linuxrc from an initrd, pivoted roots, and
  handed off to the real root filesystem. With that gone, initrd_load()
  becomes void (no longer short-circuits prepare_namespace()),
  rd_load_image() is simplified to always load /initrd.image instead of
  taking a path, and rd_load_disk() is deleted.

  The /proc/sys/kernel/real-root-dev sysctl and its backing variable are
  removed since they only existed for linuxrc to communicate the real
  root device back to the kernel.

  The no-op load_ramdisk= and prompt_ramdisk= parameters are dropped,
  and noinitrd and ramdisk_start= gain deprecation warnings.

  Initramfs is entirely unaffected. The non-linuxrc initrd path
  (root=/dev/ram0) is preserved but now carries a deprecation warning
  targeting January 2027 removal"

* tag 'vfs-7.0-rc1.initrd' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  init: remove /proc/sys/kernel/real-root-dev
  initrd: remove deprecated code path (linuxrc)
  init: remove deprecated "load_ramdisk" and "prompt_ramdisk" command line parameters
2026-02-09 11:03:25 -08:00
Jens Axboe ed82f35b92 io_uring: allow registration of per-task restrictions
Currently io_uring supports restricting operations on a per-ring basis.
To use those, the ring must be setup in a disabled state by setting
IORING_SETUP_R_DISABLED. Then restrictions can be set for the ring, and
the ring can then be enabled.

This commit adds support for IORING_REGISTER_RESTRICTIONS with ring_fd
== -1, like the other "blind" register opcodes which work on the task
rather than a specific ring. This allows registration of the same kind
of restrictions as can been done on a specific ring, but with the task
itself. Once done, any ring created will inherit these restrictions.

If a restriction filter is registered with a task, then it's inherited
on fork for its children. Children may only further restrict operations,
not extend them.

Inheriting restrictions include both the classic
IORING_REGISTER_RESTRICTIONS based restrictions, as well as the BPF
filters that have been registered with the task via
IORING_REGISTER_BPF_FILTER.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-06 07:29:19 -07:00
Peter Zijlstra 3e4067169c Linux 6.19-rc8
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCgA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAml/zSkeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG+bwIAJ0jbbeKDyeJJxPo
 8PgScnPJx9vBL3hGpphZrhbV3GOe9bDhKM/0Xk9qMDbpAm9C6qiBMTiDWyvWv5Qi
 qzDlZfoymMaDLPMxw9WHjJ++i1Z2StNdrz57Vze98C3/iG6gBcKnUEUzvF9nigri
 HIoxoOKlbSXLPUIzt49xE7YX+CRJhLF/kXmfoauZn5ghpv+uqSpWvRbUQJa3dmc0
 S4Ie/nbPtdVHmy1Fz9LJFDOzsdhGyjzHF4kc4shDkjAs8RAr8fJh74mQHO5a3MWA
 3WZ7GAAAc4XXNqj76X2dnVlMWpQNJ4p2e+OalsuXGA6VQ7OgbrJGMX8P6dMFn5AF
 8hFsXn4=
 =IdZ1
 -----END PGP SIGNATURE-----

Merge branch 'v6.19-rc8'

Update to avoid conflicts with /urgent patches.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2026-02-03 12:04:13 +01:00
Mark Harmstone 8620da16fb btrfs: allow mounting filesystems with remap-tree incompat flag
If we encounter a filesystem with the remap-tree incompat flag set,
validate its compatibility with the other flags, and load the remap tree
using the values that have been added to the superblock.

The remap-tree feature depends on the free-space-tree, but no-holes and
block-group-tree have been made dependencies to reduce the testing
matrix. Similarly I'm not aware of any reason why mixed-bg and zoned would be
incompatible with remap-tree, but this is blocked for the time being
until it can be fully tested.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:35 +01:00
Mark Harmstone 7977011460 btrfs: add extended version of struct block_group_item
Add a struct btrfs_block_group_item_v2, which is used in the block group
tree if the remap-tree incompat flag is set.

This adds two new fields to the block group item: `remap_bytes` and
`identity_remap_count`.

`remap_bytes` records the amount of data that's physically within this
block group, but nominally in another, remapped block group. This is
necessary because this data will need to be moved first if this block
group is itself relocated. If `remap_bytes` > 0, this is an indicator to
the relocation thread that it will need to search the remap-tree for
backrefs. A block group must also have `remap_bytes` == 0 before it can
be dropped.

`identity_remap_count` records how many identity remap items are located
in the remap tree for this block group. When relocation is begun for
this block group, this is set to the number of holes in the free-space
tree for this range. As identity remaps are converted into actual remaps
by the relocation process, this number is decreased. Once it reaches 0,
either because of relocation or because extents have been deleted, the
block group has been fully remapped and its chunk's device extents are
removed.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:34 +01:00
Mark Harmstone 0b4d29fa98 btrfs: add METADATA_REMAP chunk type
Add a new METADATA_REMAP chunk type, which is a metadata chunk that holds the
remap tree.

This is needed for bootstrapping purposes: the remap tree can't itself
be remapped, and must be relocated the existing way, by COWing every
leaf. The remap tree can't go in the SYSTEM chunk as space there is
limited, because a copy of the chunk item gets placed in the superblock.

The changes in fs/btrfs/volumes.h are because we're adding a new block
group type bit after the profile bits, and so can no longer rely on the
const_ilog2 trick.

The sizing to 32MB per chunk, matching the SYSTEM chunk, is an estimate
here, we can adjust it later if it proves to be too big or too small.
This works out to be ~500,000 remap items, which for a 4KB block size
covers ~2GB of remapped data in the worst case and ~500TB in the best case.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:27 +01:00
Mark Harmstone ef6a31d035 btrfs: add definitions and constants for remap-tree
Add an incompat flag for the new remap-tree feature, and the constants
and definitions needed to support it.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:02 +01:00
Ming Lei 8443e2087e ublk: add UBLK_F_NO_AUTO_PART_SCAN feature flag
Add a new feature flag UBLK_F_NO_AUTO_PART_SCAN to allow users to suppress
automatic partition scanning when starting a ublk device.

This is useful for some cases in which user don't want to scan
partitions.

Users still can manually trigger partition scanning later when appropriate
using standard tools (e.g., partprobe, blockdev --rereadpt).

Reported-by: Yoav Cohen <yoav@nvidia.com>
Link: https://lore.kernel.org/linux-block/DM4PR12MB63280C5637917C071C2F0D65A9A8A@DM4PR12MB6328.namprd12.prod.outlook.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-31 06:36:41 -07:00
Jens Axboe 8768770cf5 io_uring/bpf_filter: allow filtering on contents of struct open_how
This adds custom filtering for IORING_OP_OPENAT and IORING_OP_OPENAT2,
where the open_how flags, mode, and resolve can be checked by filters.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-27 11:10:46 -07:00
Jens Axboe cff1c26b42 io_uring/net: allow filtering on IORING_OP_SOCKET data
Example population method for the BPF based opcode filtering. This
exposes the socket family, type, and protocol to a registered BPF
filter. This in turn enables the filter to make decisions based on
what was passed in to the IORING_OP_SOCKET request type.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-27 11:10:46 -07:00
Jens Axboe d42eb05e60 io_uring: add support for BPF filtering for opcode restrictions
Add support for loading classic BPF programs with io_uring to provide
fine-grained filtering of SQE operations. Unlike
IORING_REGISTER_RESTRICTIONS which only allows bitmap-based allow/deny
of opcodes, BPF filters can inspect request attributes and make dynamic
decisions.

The filter is registered via IORING_REGISTER_BPF_FILTER with a struct
io_uring_bpf:

struct io_uring_bpf_filter {
	__u32	opcode;		/* io_uring opcode to filter */
	__u32	flags;
	__u32	filter_len;	/* number of BPF instructions */
	__u32	resv;
	__u64	filter_ptr;	/* pointer to BPF filter */
	__u64	resv2[5];
};

enum {
	IO_URING_BPF_CMD_FILTER	= 1,
};

struct io_uring_bpf {
	__u16	cmd_type;	/* IO_URING_BPF_* values */
	__u16	cmd_flags;	/* none so far */
	__u32	resv;
	union {
		struct io_uring_bpf_filter	filter;
	};
};

and the filters get supplied a struct io_uring_bpf_ctx:

struct io_uring_bpf_ctx {
	__u64	user_data;
	__u8	opcode;
	__u8	sqe_flags;
	__u8	pdu_size;
	__u8	pad[5];
};

where it's possible to filter on opcode and sqe_flags, with pdu_size
indicating how much extra data is being passed in beyond the pad field.
This will used for specific finer grained filtering inside an opcode.
An example of that for sockets is in one of the following patches.
Anything the opcode supports can end up in this struct, populated by
the opcode itself, and hence can be filtered for.

Filters have the following semantics:
  - Return 1 to allow the request
  - Return 0 to deny the request with -EACCES
  - Multiple filters can be stacked per opcode. All filters must
    return 1 for the opcode to be allowed.
  - Filters are evaluated in registration order (most recent first)

The implementation uses classic BPF (cBPF) rather than eBPF for as
that's required for containers, and since they can be used by any
user in the system.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-27 11:09:57 -07:00
Matt Bobrowski 752b807028 bpf: add new BPF_CGROUP_ITER_CHILDREN control option
Currently, the BPF cgroup iterator supports walking descendants in
either pre-order (BPF_CGROUP_ITER_DESCENDANTS_PRE) or post-order
(BPF_CGROUP_ITER_DESCENDANTS_POST). These modes perform an exhaustive
depth-first search (DFS) of the hierarchy. In scenarios where a BPF
program may need to inspect only the direct children of a given parent
cgroup, a full DFS is unnecessarily expensive.

This patch introduces a new BPF cgroup iterator control option,
BPF_CGROUP_ITER_CHILDREN. This control option restricts the traversal
to the immediate children of a specified parent cgroup, allowing for
more targeted and efficient iteration, particularly when exhaustive
depth-first search (DFS) traversal is not required.

Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Link: https://lore.kernel.org/r/20260127085112.3608687-1-mattbobrowski@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-27 09:05:54 -08:00
Linus Torvalds 0a6dce0a5c Char/Misc/IIO driver fixes for 6.19-rc7
Here are some small char/misc/iio and some other minor driver subsystem
 fixes for 6.19-rc7.  Nothing huge here, just some fixes for reported
 issues including:
   - lots of little iio driver fixes
   - comedi driver fixes
   - mux driver fix
   - w1 driver fixes
   - uio driver fix
   - slimbus driver fixes
   - hwtracing bugfix
   - other tiny bugfixes
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCaXY0dA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ylkhgCgvwYNIuJosuQRKK0sFTOR1Utig3sAn2g6E2H9
 AOZZ43qoosl++HsuXDLP
 =XzfC
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-6.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc/iio driver fixes from Greg KH:
 "Here are some small char/misc/iio and some other minor driver
  subsystem fixes for 6.19-rc7. Nothing huge here, just some fixes for
  reported issues including:

   - lots of little iio driver fixes

   - comedi driver fixes

   - mux driver fix

   - w1 driver fixes

   - uio driver fix

   - slimbus driver fixes

   - hwtracing bugfix

   - other tiny bugfixes

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'char-misc-6.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (36 commits)
  comedi: dmm32at: serialize use of paged registers
  mei: trace: treat reg parameter as string
  uio: pci_sva: correct '-ENODEV' check logic
  uacce: ensure safe queue release with state management
  uacce: implement mremap in uacce_vm_ops to return -EPERM
  uacce: fix isolate sysfs check condition
  uacce: fix cdev handling in the cleanup path
  slimbus: core: clean up of_slim_get_device()
  slimbus: core: fix of_slim_get_device() kernel doc
  slimbus: core: amend slim_get_device() kernel doc
  slimbus: core: fix device reference leak on report present
  slimbus: core: fix runtime PM imbalance on report present
  slimbus: core: fix OF node leak on registration failure
  intel_th: rename error label
  intel_th: fix device leak on output open()
  comedi: Fix getting range information for subdevices 16 to 255
  mux: mmio: Fix IS_ERR() vs NULL check in probe()
  interconnect: debugfs: initialize src_node and dst_node to empty strings
  iio: dac: ad3552r-hs: fix out-of-bound write in ad3552r_hs_write_data_source
  iio: accel: iis328dq: fix gain values
  ...
2026-01-25 09:57:31 -08:00
Menglong Dong 2d419c4465 bpf: add fsession support
The fsession is something that similar to kprobe session. It allow to
attach a single BPF program to both the entry and the exit of the target
functions.

Introduce the struct bpf_fsession_link, which allows to add the link to
both the fentry and fexit progs_hlist of the trampoline.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Co-developed-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260124062008.8657-2-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-24 18:49:35 -08:00
Linus Torvalds 00d20db21e block-6.19-20260122
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmly5rUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpv30EACNx0n1MlRvrIHvqIH+gq2NmpweLhJphars
 3e4lyDD5bBUuQ1V2iRc5k7MiKPEtQJpF83m1WVIvycNjjx2qWLBsZyCc9PCNoZIn
 nBe1/qCuKOs8kmrNkTXLKjr1uFVp+0Nm5jWSjeIgNI1zd7IDe6fpkjyH6JrnKsw8
 DD7aCYE4jHGJH8q9Ks3rOu3M2syiFwkqmOD12Xgqiz29fP4qJJDvC/y4yOMDWbZY
 G7ZKFHZK5QS8tmjRUhgJacVpJlN3NCL6lZ16f0fRTJoIlWzOAaBafeJTAjiS3S01
 TS8sadfTLrVZocFYSKZjixVgUocHqE5QeKBrDe09N0A5XsWz/YJkcNisNw0oRcvb
 BKCrO1TCHiPzliPCobPcaFZII4z2HeiIfVGSTaJAzSM4ZCt/EH9BFTKrgohGRZi4
 v+bPNdJaCLaFOipgsUeA63NZpt9xN+PsfmzFGC1PvF50gQMM+JeFpqjNh37XjCIt
 6QtK9rzLzC9GvdFXV3o1CRbIBAli3UEgjyLeThhsCu2HPnab/yrrf9AwF5udZQlJ
 wK7S8kB3zl25nyM3jwKzzaVLqi+aUWr1Jn+D+DA2BwwPg3m8h7yIrysJryF5Leon
 5VAYHCOg+MhPuvhNFjqkgahjreJqV/3Qou+sbGcLur4PXdKXeW8SY5wrDlplumFb
 +BZSUZ06Kw==
 =RK+t
 -----END PGP SIGNATURE-----

Merge tag 'block-6.19-20260122' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:

 - A set of selftest fixes for ublk

 - Fix for a pid mismatch in ublk, comparing PIDs in different
   namespaces if run inside a namespace

 - Fix for a regression added in this release with polling, where the
   nvme tcp connect code would spin forever

 - Zoned device error path fix

 - Tweak the blkzoned uapi additions from this kernel release, making
   them more easily discoverable

 - Fix for a regression in bcache with bio endio handling added in this
   release

* tag 'block-6.19-20260122' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  bcache: use bio cloning for detached device requests
  blk-mq: use BLK_POLL_ONESHOT for synchronous poll completion
  selftests/ublk: fix garbage output in foreground mode
  selftests/ublk: fix error handling for starting device
  selftests/ublk: fix IO thread idle check
  block: make the new blkzoned UAPI constants discoverable
  ublk: fix ublksrv pid handling for pid namespaces
  block: Fix an error path in disk_update_zone_resources()
2026-01-23 12:53:56 -08:00
Ming Lei e2723e6ce6 ublk: add new feature UBLK_F_BATCH_IO
Add new feature UBLK_F_BATCH_IO which replaces the following two
per-io commands:

	- UBLK_U_IO_FETCH_REQ

	- UBLK_U_IO_COMMIT_AND_FETCH_REQ

with three per-queue batch io uring_cmd:

	- UBLK_U_IO_PREP_IO_CMDS

	- UBLK_U_IO_COMMIT_IO_CMDS

	- UBLK_U_IO_FETCH_IO_CMDS

Then ublk can deliver batch io commands to ublk server in single
multishort uring_cmd, also allows to prepare & commit multiple
commands in batch style via single uring_cmd, communication cost is
reduced a lot.

This feature also doesn't limit task context any more for all supported
commands, so any allowed uring_cmd can be issued in any task context.
ublk server implementation becomes much easier.

Meantime load balance becomes much easier to support with this feature.
The command `UBLK_U_IO_FETCH_IO_CMDS` can be issued from multiple task
contexts, so each task can adjust this command's buffer length or number
of inflight commands for controlling how much load is handled by current
task.

Later, priority parameter will be added to command `UBLK_U_IO_FETCH_IO_CMDS`
for improving load balance support.

UBLK_U_IO_NEED_GET_DATA isn't supported in batch io yet, but it may be
enabled in future via its batch pair.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 20:05:40 -07:00
Ming Lei a4d8837553 ublk: add UBLK_U_IO_FETCH_IO_CMDS for batch I/O processing
Add UBLK_U_IO_FETCH_IO_CMDS command to enable efficient batch processing
of I/O requests. This multishot uring_cmd allows the ublk server to fetch
multiple I/O commands in a single operation, significantly reducing
submission overhead compared to individual FETCH_REQ* commands.

Key Design Features:

1. Multishot Operation: One UBLK_U_IO_FETCH_IO_CMDS can fetch many I/O
   commands, with the batch size limited by the provided buffer length.

2. Dynamic Load Balancing: Multiple fetch commands can be submitted
   simultaneously, but only one is active at any time. This enables
   efficient load distribution across multiple server task contexts.

3. Implicit State Management: The implementation uses three key variables
   to track state:
   - evts_fifo: Queue of request tags awaiting processing
   - fcmd_head: List of available fetch commands
   - active_fcmd: Currently active fetch command (NULL = none active)

   States are derived implicitly:
   - IDLE: No fetch commands available
   - READY: Fetch commands available, none active
   - ACTIVE: One fetch command processing events

4. Lockless Reader Optimization: The active fetch command can read from
   evts_fifo without locking (single reader guarantee), while writers
   (ublk_queue_rq/ublk_queue_rqs) use evts_lock protection. The memory
   barrier pairing plays key role for the single lockless reader
   optimization.

Implementation Details:

- ublk_queue_rq() and ublk_queue_rqs() save request tags to evts_fifo
- __ublk_acquire_fcmd() selects an available fetch command when
  events arrive and no command is currently active
- ublk_batch_dispatch() moves tags from evts_fifo to the fetch command's
  buffer and posts completion via io_uring_mshot_cmd_post_cqe()
- State transitions are coordinated via evts_lock to maintain consistency

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 20:05:40 -07:00
Ming Lei 1e500e106d ublk: handle UBLK_U_IO_COMMIT_IO_CMDS
Handle UBLK_U_IO_COMMIT_IO_CMDS by walking the uring_cmd fixed buffer:

- read each element into one temp buffer in batch style

- parse and apply each element for committing io result

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 20:05:40 -07:00
Ming Lei b256795b36 ublk: handle UBLK_U_IO_PREP_IO_CMDS
This commit implements the handling of the UBLK_U_IO_PREP_IO_CMDS command,
which allows userspace to prepare a batch of I/O requests.

The core of this change is the `ublk_walk_cmd_buf` function, which iterates
over the elements in the uring_cmd fixed buffer. For each element, it parses
the I/O details, finds the corresponding `ublk_io` structure, and prepares it
for future dispatch.

Add per-io lock for protecting concurrent delivery and committing.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 20:05:40 -07:00
Ming Lei e86f89ab24 ublk: add new batch command UBLK_U_IO_PREP_IO_CMDS & UBLK_U_IO_COMMIT_IO_CMDS
Add new command UBLK_U_IO_PREP_IO_CMDS, which is the batch version of
UBLK_IO_FETCH_REQ.

Add new command UBLK_U_IO_COMMIT_IO_CMDS, which is for committing io command
result only, still the batch version.

The new command header type is `struct ublk_batch_io`.

This patch doesn't actually implement these commands yet, just validates the
SQE fields.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 20:05:40 -07:00
Pavel Begunkov 5247c034a6 io_uring: introduce non-circular SQ
Outside of SQPOLL, normally SQ entries are consumed by the time the
submission syscall returns. For those cases we don't need a circular
buffer and the head/tail tracking, instead the kernel can assume that
entries always start from the beginning of the SQ at index 0. This patch
introduces a setup flag doing exactly that. It's a simpler and helps
to keeps SQEs hot in cache.

The feature is optional and enabled by setting IORING_SETUP_SQ_REWIND.
The flag is rejected if passed together with SQPOLL as it'd require
waiting for SQ before each submission. It also requires
IORING_SETUP_NO_SQARRAY, which can be supported but it's unlikely there
will be users, so leave more space for future optimisations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 15:47:23 -07:00
Linus Torvalds 0a80e38d0f Including fixes from CAN and wireless.
Pretty big PR, but hard to make up any cohesive story that would
 explain it, a random collection of fixes. The two reverts of bad
 patches from this release here feel like stuff that'd normally
 show up by rc5 or rc6. Perhaps obvious thing to say, given the MW
 timing.
 
 That said, no active investigations / regressions. Let's see what
 the next week brings.
 
 Current release - fix to a fix:
 
  - can: alloc_candev_mqs(): add missing default CAN capabilities
 
 Current release - regressions:
 
  - usbnet: fix crash due to missing BQL accounting after resume
 
  - Revert "net: wwan: mhi_wwan_mbim: Avoid -Wflex-array-member-not ...
 
 Previous releases - regressions:
 
  - Revert "nfc/nci: Add the inconsistency check between the input ...
 
 Previous releases - always broken:
 
  - number of driver fixes for incorrect use of seqlocks on stats
 
  - rxrpc: fix recvmsg() unconditional requeue, don't corrupt rcv queue
    when MSG_PEEK was set
 
  - ipvlan: make the addrs_lock be per port avoid races in the port
    hash table
 
  - sched: enforce that teql can only be used as root qdisc
 
  - virtio: coalesce only linear skb
 
  - wifi: ath12k: fix dead lock while flushing management frames
 
  - eth: igc: reduce TSN TX packet buffer from 7KB to 5KB per queue
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmlyVNQACgkQMUZtbf5S
 IrvoMQ/+N+KikpNLtHFpkqKAsTnNC6gR9yG+NYvUck1Oo+CQQpc/wJ78pS2cyPpM
 siMa7lLzRuCKPYni3RVpG9nqG/jv9HKvFXpgpSb71Evs6syO85kI8Dy4z8G4mClk
 A4QFH6/8V4uZpH/ewy9DTIjxdWPEJA6CuXXnmolTEBhrNkgskv0Sz3jFVTxDjFzN
 xTMk6H2PhgANUsYFree9PQ+gR+QAHKHPvjobWCsepYRHlT4YVi8PrmqoYBvycLwa
 Q//xy6PUOSXZM4sa6YSUMZk2Nou/84BvqouQ05ljaLSZDl4Ecfxl5MDkh/kzZHjR
 wPbMhiZUrJngwwyXQWcSnir1M17IPKjC/FAEDFUFVWDT/wYYHODt9npCRpdglsa5
 SGoy0yDgUN5Lq9Q5dPL0PMLHVLkuDdlzmyva4uso/dQYovXgTi/5Kx2bPc6LLqMJ
 CP5QxjN5OsOHKnGdI/UO28OU5Yn2KRCmJiW9nW3XdxR316lkmo6BWHAxWG9XhbJE
 aCBXu3EoqNSDVuWaZFVQ8g2uTN3uOhnuJqBllhahtENbdqjsat0cxcjYTuUDvuMm
 B1W0My5qq7O6TRjlFl6JfC3k2gqbZ+HlLg3LfmC74bmddOL52up+9HKyeRoWiIaA
 9BQ3DKpjD3VFEbsbBVPfoZ4Ch4jrB+G+ck8f7g/l9BCNM+Ji1Ds=
 =Ehzr
 -----END PGP SIGNATURE-----

Merge tag 'net-6.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from CAN and wireless.

  Pretty big, but hard to make up any cohesive story that would explain
  it, a random collection of fixes. The two reverts of bad patches from
  this release here feel like stuff that'd normally show up by rc5 or
  rc6. Perhaps obvious thing to say, given the holiday timing.

  That said, no active investigations / regressions. Let's see what the
  next week brings.

  Current release - fix to a fix:

   - can: alloc_candev_mqs(): add missing default CAN capabilities

  Current release - regressions:

   - usbnet: fix crash due to missing BQL accounting after resume

   - Revert "net: wwan: mhi_wwan_mbim: Avoid -Wflex-array-member-not ...

  Previous releases - regressions:

   - Revert "nfc/nci: Add the inconsistency check between the input ...

  Previous releases - always broken:

   - number of driver fixes for incorrect use of seqlocks on stats

   - rxrpc: fix recvmsg() unconditional requeue, don't corrupt rcv queue
     when MSG_PEEK was set

   - ipvlan: make the addrs_lock be per port avoid races in the port
     hash table

   - sched: enforce that teql can only be used as root qdisc

   - virtio: coalesce only linear skb

   - wifi: ath12k: fix dead lock while flushing management frames

   - eth: igc: reduce TSN TX packet buffer from 7KB to 5KB per queue"

* tag 'net-6.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (96 commits)
  Octeontx2-af: Add proper checks for fwdata
  dpll: Prevent duplicate registrations
  net/sched: act_ife: avoid possible NULL deref
  hinic3: Fix netif_queue_set_napi queue_index input parameter error
  vsock/test: add stream TX credit bounds test
  vsock/virtio: cap TX credit to local buffer size
  vsock/test: fix seqpacket message bounds test
  vsock/virtio: fix potential underflow in virtio_transport_get_credit()
  net: fec: account for VLAN header in frame length calculations
  net: openvswitch: fix data race in ovs_vport_get_upcall_stats
  octeontx2-af: Fix error handling
  net: pcs: pcs-mtk-lynxi: report in-band capability for 2500Base-X
  rxrpc: Fix data-race warning and potential load/store tearing
  net: dsa: fix off-by-one in maximum bridge ID determination
  net: bcmasp: Fix network filter wake for asp-3.0
  bonding: provide a net pointer to __skb_flow_dissect()
  selftests: net: amt: wait longer for connection before sending packets
  be2net: Fix NULL pointer dereference in be_cmd_get_mac_from_list
  Revert "net: wwan: mhi_wwan_mbim: Avoid -Wflex-array-member-not-at-end warning"
  netrom: fix double-free in nr_route_frame()
  ...
2026-01-22 09:32:11 -08:00
Jakub Kicinski 9146fe2829 Another set of updates:
- various small fixes for ath10k/ath12k/mwifiex/rsi
  - cfg80211 fix for HE bitrate overflow
  - mac80211 fixes
    - S1G beacon handling in scan
    - skb tailroom handling for HW encryption
    - CSA fix for multi-link
    - handling of disabled links during association
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmlyArkACgkQ10qiO8sP
 aACZxA/+N/q+DAHhVgqETwqOh80WAFTSuhDZsUXc6PNtFkOHvuZaeHePU+fn8hco
 +CkUVEnWYoNgUiaVg1697PJubp0psN7H+3+cq8kn8C/oNB2YnEV2kPCk4x7R8LCj
 TP6mad/Fb+I6Ct7XaCFymCS49eP4Bju7UBgYgLTiKJYvabA+Jim7LavZr8j0Tvra
 h9TNeA0I8+dgGppAWLTssrnsxp65xfSdq71mtRCFUrpEUHjzCl589PEv6BYcIRwv
 N50pm6Am5KJ1TZn5sVSYVfKiiG7UtL/pbXbsM5Cj/54yIFIgmE3bGI5MGAXRlWNG
 o/d/bo0rJg1xppipyZDEN5OJS6S0ijyC5TNKFFRX6IU2eZ8jJs7CWrQw6L8hWCbY
 G+lnVSh4yPAzLEk80S/zBHQccAPXnONtm+cFyPsPab79oAboxQVauuDdH1t5cxbQ
 1HRn0RopyfEoPLmxsCcCSVcdF+hDwRZxAUO735Opnz/amDJNjBKgXTKezXSubfPH
 5hvoAs/VZh7xSyJmEViDO5gavW1SX4nKlUixLYLUrXyq4i+eZkEIvqlCkIWuN9EW
 y/leooWqFzoXY3K8QL8nKQyHWg10WJL1l56tVz+Y+YAh8TvqgWWTB/O/u3g+/ZqO
 eDkaeKbbexUy6iyN0/TMb2RNnTERO0xOepMmIu3sw0HXhptkl1Q=
 =njuT
 -----END PGP SIGNATURE-----

Merge tag 'wireless-2026-11-22' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
Another set of updates:
 - various small fixes for ath10k/ath12k/mwifiex/rsi
 - cfg80211 fix for HE bitrate overflow
 - mac80211 fixes
   - S1G beacon handling in scan
   - skb tailroom handling for HW encryption
   - CSA fix for multi-link
   - handling of disabled links during association

* tag 'wireless-2026-11-22' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: cfg80211: ignore link disabled flag from userspace
  wifi: mac80211: apply advertised TTLM from association response
  wifi: mac80211: parse all TTLM entries
  wifi: mac80211: don't increment crypto_tx_tailroom_needed_cnt twice
  wifi: mac80211: don't perform DA check on S1G beacon
  wifi: ath12k: Fix wrong P2P device link id issue
  wifi: ath12k: fix dead lock while flushing management frames
  wifi: ath12k: Fix scan state stuck in ABORTING after cancel_remain_on_channel
  wifi: ath12k: cancel scan only on active scan vdev
  wifi: mwifiex: Fix a loop in mwifiex_update_ampdu_rxwinsize()
  wifi: mac80211: correctly check if CSA is active
  wifi: cfg80211: Fix bitrate calculation overflow for HE rates
  wifi: rsi: Fix memory corruption due to not set vif driver data size
  wifi: ath12k: don't force radio frequency check in freq_to_idx()
  wifi: ath12k: fix dma_free_coherent() pointer
  wifi: ath10k: fix dma_free_coherent() pointer
====================

Link: https://patch.msgid.link/20260122110248.15450-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22 07:54:31 -08:00
Marc Zyngier f174a9ffcd KVM: arm64: Add exit to userspace on {LD,ST}64B* outside of memslots
The main use of {LD,ST}64B* is to talk to a device, which is hopefully
directly assigned to the guest and requires no additional handling.

However, this does not preclude a VMM from exposing a virtual device
to the guest, and to allow 64 byte accesses as part of the programming
interface. A direct consequence of this is that we need to be able
to forward such access to userspace.

Given that such a contraption is very unlikely to ever exist, we choose
to offer a limited service: userspace gets (as part of a new exit reason)
the ESR, the IPA, and that's it. It is fully expected to handle the full
semantics of the instructions, deal with ACCDATA, the return values and
increment PC. Much fun.

A canonical implementation can also simply inject an abort and be done
with it. Frankly, don't try to do anything else unless you have time
to waste.

Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Oliver Upton <oupton@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>
Signed-off-by: Will Deacon <will@kernel.org>
2026-01-22 13:24:49 +00:00
Peter Zijlstra d6200245c7 rseq: Allow registering RSEQ with slice extension
Since glibc cares about the number of syscalls required to initialize a new
thread, allow initializing rseq with slice extension on. This avoids having to
do another prctl().

Requested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260121143207.814193010@infradead.org
2026-01-22 11:11:19 +01:00
Thomas Gleixner 99d2592023 rseq: Implement sys_rseq_slice_yield()
Provide a new syscall which has the only purpose to yield the CPU after the
kernel granted a time slice extension.

sched_yield() is not suitable for that because it unconditionally
schedules, but the end of the time slice extension is not required to
schedule when the task was already preempted. This also allows to have a
strict check for termination to catch user space invoking random syscalls
including sched_yield() from a time slice extension region.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20251215155708.929634896@linutronix.de
2026-01-22 11:11:17 +01:00
Thomas Gleixner 28621ec2d4 rseq: Add prctl() to enable time slice extensions
Implement a prctl() so that tasks can enable the time slice extension
mechanism. This fails, when time slice extensions are disabled at compile
time or on the kernel command line and when no rseq pointer is registered
in the kernel.

That allows to implement a single trivial check in the exit to user mode
hotpath, to decide whether the whole mechanism needs to be invoked.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.858717691@linutronix.de
2026-01-22 11:11:17 +01:00
Thomas Gleixner d7a5da7a0f rseq: Add fields and constants for time slice extension
Aside of a Kconfig knob add the following items:

   - Two flag bits for the rseq user space ABI, which allow user space to
     query the availability and enablement without a syscall.

   - A new member to the user space ABI struct rseq, which is going to be
     used to communicate request and grant between kernel and user space.

   - A rseq state struct to hold the kernel state of this

   - Documentation of the new mechanism

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.669472597@linutronix.de
2026-01-22 11:11:16 +01:00
Christoph Hellwig f5f2bad67a block: make the new blkzoned UAPI constants discoverable
The Linux 6.19 merge window added the new BLKREPORTZONESV2 ioctl, and
with it the new BLK_ZONE_REP_CACHED and BLK_ZONE_COND_ACTIVE constants.

The two constants are defined as part of enums, which makes it very
painful for userspace to discover if they are present in the installed
system headers.

Use the #define to the same name trick to make them trivially
discoverable using CPP directives.

Fixes: 0bf0e2e466 ("block: track zone conditions")
Fixes: b30ffcdc0c ("block: introduce BLKREPORTZONESV2 ioctl")
Reported-by: Andrey Albershteyn <aalbersh@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-21 07:47:44 -07:00
Benjamin Berg 50b359896f wifi: cfg80211: ignore link disabled flag from userspace
When the AP has an advertised TID to Link Mapping (TTLM) it shall
include the element in the association response. As such, when this
element is present it needs to be used for the currently dormant links.
See Draft P802.11REVmf_D1.0 section 35.3.7.2.3 ("Negotiation of TTLM")
for the details. The flag is also not usable in case userspace wants to
specify a negotiated TTLM during association.

Note that for the link reconfiguration case, mac80211 did not use the
information. Draft P802.11REVmf_D1.0 states in section 35.3.6.4 ("Link
reconfiguration to the setup links) that we "shall operate with all the
TIDs mapped to the newly added links ..."

All this means that the flag is not needed. The implementation should
parse the information from the association response.

Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260118093904.754e057896a5.Ifd06f5ef839a93bfd54d0593dc932870f95f3242@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-20 10:02:01 +01:00
Linus Torvalds 90a855e75a Landlock fix for v6.19-rc6
-----BEGIN PGP SIGNATURE-----
 
 iIYEABYKAC4WIQSVyBthFV4iTW/VU1/l49DojIL20gUCaWledRAcbWljQGRpZ2lr
 b2QubmV0AAoJEOXj0OiMgvbSBowBAID4pRQB8EKRNaoqadUWnoSE/wl929Y6KY7i
 FBf8aODOAP9lOtU/CjL5jHKjt00zKW5gX3o0LIuFzSeuzCFTVx2MBA==
 =Aef0
 -----END PGP SIGNATURE-----

Merge tag 'landlock-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux

Pull landlock fixes from Mickaël Salaün:
 "This fixes TCP handling, tests, documentation, non-audit elided code,
  and minor cosmetic changes"

* tag 'landlock-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux:
  landlock: Clarify documentation for the IOCTL access right
  selftests/landlock: Properly close a file descriptor
  landlock: Improve the comment for domain_is_scoped
  selftests/landlock: Use scoped_base_variants.h for ptrace_test
  selftests/landlock: Fix missing semicolon
  selftests/landlock: Fix typo in fs_test
  landlock: Optimize stack usage when !CONFIG_AUDIT
  landlock: Fix spelling
  landlock: Clean up hook_ptrace_access_check()
  landlock: Improve erratum documentation
  landlock: Remove useless include
  landlock: Fix wrong type usage
  selftests/landlock: NULL-terminate unix pathname addresses
  selftests/landlock: Remove invalid unix socket bind()
  selftests/landlock: Add missing connect(minimal AF_UNSPEC) test
  selftests/landlock: Fix TCP bind(AF_UNSPEC) test case
  landlock: Fix TCP handling of short AF_UNSPEC addresses
  landlock: Fix formatting
2026-01-18 15:15:47 -08:00
Linus Torvalds f8907398a6 Ext4 bug fixes for 6.19:
* Fix an inconsistency in structure size on 32-bit platforms caused by padding
      differences for the new EXT4_IOC_[GS]ET_TUNE_SB_PARAM ioctls
    * Fix a buffer leak on the error path when dropping the refcount an xattr
      value stored in an inode
    * Fix missing locking on the error path for the file defragmentation ioctl
      leading to a BUG
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmltDvoACgkQ8vlZVpUN
 gaOe8Af/cKUtk6niuANObDZ4BKIxiTVq1dujTVfftfDQoTsi+G3R4GrZS3UeZ8FB
 9K7UdnnmMkIVgv4e6mYjyjfvNY9oUF2PI9y19E+bk+sxhEEtIJfw0DCgbTfwtIc2
 6NKUS66DDlM1vy8qTrStIM4qShmFjfP+xqiOHWQKD0Xfd9vdxbPV/otVr+Qjl/Xj
 W2kT2erpvPnD9WLB+iaRv4G8zH3RSXEBPGCx43bP6E6JqwFhMw9qzIg1tcy4bVqG
 WPROiAuctzgnE9yEVvi5gF8gG3e1naB72rXi2aP6Ush4QMeNKLOWnDHhyCDr/V+5
 0fme7qlZYyIJZ7gys2sBAW4PkTaHEA==
 =SXcS
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:

 - Fix an inconsistency in structure size on 32-bit platforms caused by
   padding differences for the new EXT4_IOC_[GS]ET_TUNE_SB_PARAM ioctls

 - Fix a buffer leak on the error path when dropping the refcount an
   xattr value stored in an inode

 - Fix missing locking on the error path for the file defragmentation
   ioctl leading to a BUG

* tag 'ext4_for_linus-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix iloc.bh leak in ext4_xattr_inode_update_ref
  ext4: add missing down_write_data_sem in mext_move_extent().
  ext4: fix ext4_tune_sb_params padding
2026-01-18 14:01:20 -08:00
Arnd Bergmann cd16edba1c ext4: fix ext4_tune_sb_params padding
The padding at the end of struct ext4_tune_sb_params is architecture
specific and in particular is different between x86-32 and x86-64,
since the __u64 member only enforces struct alignment on the latter.

This shows up as a new warning when test-building the headers with
-Wpadded:

include/linux/ext4.h:144:1: error: padding struct size to alignment boundary with 4 bytes [-Werror=padded]

All members inside the structure are naturally aligned, so the only
difference here is the amount of padding at the end. Make the padding
explicit, to have a consistent sizeof(struct ext4_tune_sb_params) of
232 on all architectures and avoid adding compat ioctl handling for
EXT4_IOC_GET_TUNE_SB_PARAM/EXT4_IOC_SET_TUNE_SB_PARAM.

This is an ABI break on x86-32 but hopefully this can go into 6.18.y early
enough as a fixup so no actual users will be affected.  Alternatively, the
kernel could handle the ioctl commands for both sizes (232 and 228 bytes)
on all architectures.

Fixes: 04a91570ac ("ext4: implemet new ioctls to set and get superblock parameters")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20251204101914.1037148-1-arnd@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2026-01-18 11:22:53 -05:00
Bill Wendling 150a04d817 compiler_types.h: Attributes: Add __counted_by_ptr macro
Introduce __counted_by_ptr(), which works like __counted_by(), but for
pointer struct members.

struct foo {
	int a, b, c;
	char *buffer __counted_by_ptr(bytes);
	short nr_bars;
	struct bar *bars __counted_by_ptr(nr_bars);
	size_t bytes;
};

Because "counted_by" can only be applied to pointer members in very
recent compiler versions, its application ends up needing to be distinct
from flexibe array "counted_by" annotations, hence a separate macro.

This is a reworking of Kees' previous patch [1].

Link: https://lore.kernel.org/all/20251020220118.1226740-1-kees@kernel.org/ [1]
Co-developed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Bill Wendling <morbo@google.com>
Link: https://patch.msgid.link/20260116005838.2419118-1-morbo@google.com
Signed-off-by: Kees Cook <kees@kernel.org>
2026-01-17 11:00:28 -08:00
Linus Torvalds b62ce2547f Power management fixes for 6.19-rc6
- Fix a memory leak in em_create_pd() error path (Malaya Kumar Rout)
 
  - Fix stale description of the cost field in struct em_perf_state to
    reflect the current code (Yaxiong Tian)
 
  - Fix and revamp the energy model YNL specification added recently
    along with the energy model netlink interface (Changwoo Min)
 -----BEGIN PGP SIGNATURE-----
 
 iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmlqWS8SHHJqd0Byand5
 c29ja2kubmV0AAoJEO5fvZ0v1OO1BhAH/iSA/9YP9/xIW+rIZj2irsncGVc/hJ+V
 8PcM2tar966AEQBlmDPc7ug2MXT0Mb/5WRtQo/IvZ1tdAALB+bC70FHjdu3y7SAX
 IzFGyHKJ9OVJHGxYq0TCBbRXgYubUZiqvAnaBTBZ/ZIFlt3B4KEyatfFfxJMb1Pe
 H9zQIyUVBN1tjPfgNVbc22XGfkwfmmla72le6GmHseKZRqpwutIwzxm1PoMNVeLb
 fQv625LsWrDD9NU6j5uGq3WEq3xPltubtHN545FTFi4aGwBYJaG1FuIi+kPiQuCR
 VZmxTWVEiVwjttiZf/xWbOkPfwQqdgeXlTk5j+iBlF1jkrrW75IuLYQ=
 =eJ2p
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix an error path memory leak in the energy model management
  code, fix a kerneldoc comment in it, and fix and revamp the energy
  model YNL specification added recently along with the new energy model
  management netlink interface (that received feedback after being
  added):

   - Fix a memory leak in em_create_pd() error path (Malaya Kumar Rout)

   - Fix stale description of the cost field in struct em_perf_state to
     reflect the current code (Yaxiong Tian)

   - Fix and revamp the energy model YNL specification added recently
     along with the energy model netlink interface (Changwoo Min)"

* tag 'pm-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: EM: Add dump to get-perf-domains in the EM YNL spec
  PM: EM: Change cpus' type from string to u64 array in the EM YNL spec
  PM: EM: Rename em.yaml to dev-energymodel.yaml
  PM: EM: Fix yamllint warnings in the EM YNL spec
  PM: EM: Fix memory leak in em_create_pd() error path
  PM: EM: Fix incorrect description of the cost field in struct em_perf_state
2026-01-16 12:08:19 -08:00
Christian Brauner 9b8a0ba682
mount: add OPEN_TREE_NAMESPACE
When creating containers the setup usually involves using CLONE_NEWNS
via clone3() or unshare(). This copies the caller's complete mount
namespace. The runtime will also assemble a new rootfs and then use
pivot_root() to switch the old mount tree with the new rootfs. Afterward
it will recursively umount the old mount tree thereby getting rid of all
mounts.

On a basic system here where the mount table isn't particularly large
this still copies about 30 mounts. Copying all of these mounts only to
get rid of them later is pretty wasteful.

This is exacerbated if intermediary mount namespaces are used that only
exist for a very short amount of time and are immediately destroyed
again causing a ton of mounts to be copied and destroyed needlessly.

With a large mount table and a system where thousands or ten-thousands
of containers are spawned in parallel this quickly becomes a bottleneck
increasing contention on the semaphore.

Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
returning a file descriptor referring to that mount tree
OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
to a new mount namespace. In that new mount namespace the copied mount
tree has been mounted on top of a copy of the real rootfs.

The caller can setns() into that mount namespace and perform any
additionally required setup such as move_mount() detached mounts in
there.

This allows OPEN_TREE_NAMESPACE to function as a combined
unshare(CLONE_NEWNS) and pivot_root().

A caller may for example choose to create an extremely minimal rootfs:

fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE);

This will create a mount namespace where "wootwoot" has become the
rootfs mounted on top of the real rootfs. The caller can now setns()
into this new mount namespace and assemble additional mounts.

This also works with user namespaces:

unshare(CLONE_NEWUSER);
fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE);

which creates a new mount namespace owned by the earlier created user
namespace with "wootwoot" as the rootfs mounted on top of the real
rootfs.

Link: https://patch.msgid.link/20251229-work-empty-namespace-v1-1-bfb24c7b061f@kernel.org
Tested-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Suggested-by: Christian Brauner <brauner@kernel.org>
Suggested-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-16 19:21:40 +01:00
Ian Abbott 10d28cffb3 comedi: Fix getting range information for subdevices 16 to 255
The `COMEDI_RANGEINFO` ioctl does not work properly for subdevice
indices above 15.  Currently, the only in-tree COMEDI drivers that
support more than 16 subdevices are the "8255" driver and the
"comedi_bond" driver.  Making the ioctl work for subdevice indices up to
255 is achievable.  It needs minor changes to the handling of the
`COMEDI_RANGEINFO` and `COMEDI_CHANINFO` ioctls that should be mostly
harmless to user-space, apart from making them less broken.  Details
follow...

The `COMEDI_RANGEINFO` ioctl command gets the list of supported ranges
(usually with units of volts or milliamps) for a COMEDI subdevice or
channel.  (Only some subdevices have per-channel range tables, indicated
by the `SDF_RANGETYPE` flag in the subdevice information.)  It uses a
`range_type` value and a user-space pointer, both supplied by
user-space, but the `range_type` value should match what was obtained
using the `COMEDI_CHANINFO` ioctl (if the subdevice has per-channel
range tables)  or `COMEDI_SUBDINFO` ioctl (if the subdevice uses a
single range table for all channels).  Bits 15 to 0 of the `range_type`
value contain the length of the range table, which is the only part that
user-space should care about (so it can use a suitably sized buffer to
fetch the range table).  Bits 23 to 16 store the channel index, which is
assumed to be no more than 255 if the subdevice has per-channel range
tables, and is set to 0 if the subdevice has a single range table.  For
`range_type` values produced by the `COMEDI_SUBDINFO` ioctl, bits 31 to
24 contain the subdevice index, which is assumed to be no more than 255.
But for `range_type` values produced by the `COMEDI_CHANINFO` ioctl,
bits 27 to 24 contain the subdevice index, which is assumed to be no
more than 15, and bits 31 to 28 contain the COMEDI device's minor device
number for some unknown reason lost in the mists of time.  The
`COMEDI_RANGEINFO` ioctl extract the length from bits 15 to 0 of the
user-supplied `range_type` value, extracts the channel index from bits
23 to 16 (only used if the subdevice has per-channel range tables),
extracts the subdevice index from bits 27 to 24, and ignores bits 31 to
28.  So for subdevice indices 16 to 255, the `COMEDI_SUBDINFO` or
`COMEDI_CHANINFO` ioctl will report a `range_type` value that doesn't
work with the `COMEDI_RANGEINFO` ioctl.  It will either get the range
table for the subdevice index modulo 16, or will fail with `-EINVAL`.

To fix this, always use bits 31 to 24 of the `range_type` value to hold
the subdevice index (assumed to be no more than 255).  This affects the
`COMEDI_CHANINFO` and `COMEDI_RANGEINFO` ioctls.  There should not be
anything in user-space that depends on the old, broken usage, although
it may now see different values in bits 31 to 28 of the `range_type`
values reported by the `COMEDI_CHANINFO` ioctl for subdevices that have
per-channel subdevices.  User-space should not be trying to decode bits
31 to 16 of the `range_type` values anyway.

Fixes: ed9eccbe89 ("Staging: add comedi core")
Cc: stable@vger.kernel.org #5.17+
Signed-off-by: Ian Abbott <abbotti@mev.co.uk>
Link: https://patch.msgid.link/20251203162438.176841-1-abbotti@mev.co.uk
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-16 16:42:15 +01:00