Here is the big set of char/misc and other small driver subsystem
updates for 5.18-rc1.
Included in here are merges from driver subsystems which contain:
- iio driver updates and new drivers
- fsi driver updates
- fpga driver updates
- habanalabs driver updates and support for new hardware
- soundwire driver updates and new drivers
- phy driver updates and new drivers
- coresight driver updates
- icc driver updates
Individual changes include:
- mei driver updates
- interconnect driver updates
- new PECI driver subsystem added
- vmci driver updates
- lots of tiny misc/char driver updates
There will be two merge conflicts with your tree, one in MAINTAINERS
which is obvious to fix up, and one in drivers/phy/freescale/Kconfig
which also should be easy to resolve.
All of these have been in linux-next for a while with no reported
problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYkG3fQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykNEgCfaRG8CRxewDXOO4+GSeA3NGK+AIoAnR89donC
R4bgCjfg8BWIBcVVXg3/
=WWXC
-----END PGP SIGNATURE-----
Merge tag 'char-misc-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc and other driver updates from Greg KH:
"Here is the big set of char/misc and other small driver subsystem
updates for 5.18-rc1.
Included in here are merges from driver subsystems which contain:
- iio driver updates and new drivers
- fsi driver updates
- fpga driver updates
- habanalabs driver updates and support for new hardware
- soundwire driver updates and new drivers
- phy driver updates and new drivers
- coresight driver updates
- icc driver updates
Individual changes include:
- mei driver updates
- interconnect driver updates
- new PECI driver subsystem added
- vmci driver updates
- lots of tiny misc/char driver updates
All of these have been in linux-next for a while with no reported
problems"
* tag 'char-misc-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (556 commits)
firmware: google: Properly state IOMEM dependency
kgdbts: fix return value of __setup handler
firmware: sysfb: fix platform-device leak in error path
firmware: stratix10-svc: add missing callback parameter on RSU
arm64: dts: qcom: add non-secure domain property to fastrpc nodes
misc: fastrpc: Add dma handle implementation
misc: fastrpc: Add fdlist implementation
misc: fastrpc: Add helper function to get list and page
misc: fastrpc: Add support to secure memory map
dt-bindings: misc: add fastrpc domain vmid property
misc: fastrpc: check before loading process to the DSP
misc: fastrpc: add secure domain support
dt-bindings: misc: add property to support non-secure DSP
misc: fastrpc: Add support to get DSP capabilities
misc: fastrpc: add support for FASTRPC_IOCTL_MEM_MAP/UNMAP
misc: fastrpc: separate fastrpc device from channel context
dt-bindings: nvmem: brcm,nvram: add basic NVMEM cells
dt-bindings: nvmem: make "reg" property optional
nvmem: brcm_nvram: parse NVRAM content into NVMEM cells
nvmem: dt-bindings: Fix the error of dt-bindings check
...
Pull i2c updates from Wolfram Sang:
- tracepoints when Linux acts as an I2C client
- added support for AMD PSP
- whole subsystem now uses generic_handle_irq_safe()
- piix4 driver gained MMIO access enabling so far missed controllers
with AMD chipsets
- a bulk of device driver updates, refactorization, and fixes.
* 'i2c/for-mergewindow' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (61 commits)
i2c: mux: demux-pinctrl: do not deactivate a master that is not active
i2c: meson: Fix wrong speed use from probe
i2c: add tracepoints for I2C slave events
i2c: designware: Remove code duplication
i2c: cros-ec-tunnel: Fix syntax errors in comments
MAINTAINERS: adjust XLP9XX I2C DRIVER after removing the devicetree binding
i2c: designware: Mark dw_i2c_plat_{suspend,resume}() as __maybe_unused
i2c: mediatek: Add i2c compatible for Mediatek MT8168
dt-bindings: i2c: update bindings for MT8168 SoC
i2c: mt65xx: Simplify with clk-bulk
i2c: i801: Drop two outdated comments
i2c: xiic: Make bus names unique
i2c: i801: Add support for the Process Call command
i2c: i801: Drop useless masking in i801_access
i2c: tegra: Add SMBus block read function
i2c: designware: Use the i2c_mark_adapter_suspended/resumed() helpers
i2c: designware: Lock the adapter while setting the suspended flag
i2c: mediatek: remove redundant null check
i2c: mediatek: modify bus speed calculation formula
i2c: designware: Fix improper usage of readl
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI1AHwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplPjEACVJzKg5NkxpdkDThvq5tejws9KxB/4mHJg
NoDMcv1TF+Orsd/HNW6XrgYnbU0ObHom3568/xb8BNegRVFe7V4ME/4IYNRyGOmV
qbfciu04L1UkJhI52CIidkOioFABL3r1zgLCIz5vk0Cv9X7Le9x0UabHxJf7u9S+
Z3lNdyxezN0SGx8VT86l/7lSoHtG3VHO9IsQCuNGF02SB+6uGpXBlptbEoQ4nTxd
T7/H9FNOe2Wf7eKvcOOds8UlvZYAfYcY0GcRrIOXdHIy25mKFWwn5cDgFTMOH5ID
xXpm+JFkDkrfSW1o4FFPxbN9Z6RbVXbGCsrXlIragLO2MJQdXiIUxS1OPT5oAado
H9MlX6QtkwziLW9zUWa/N/jmRjc2vzHAxD6JFg/wXxNdtY0kd8TQpaxwTB8mVDPN
VCGutt7lJS1CQInQ+ppzbdqzzuLHC1RHAyWSmfUE9rb8cbjxtJBnSIorYRLUesMT
GRwqVTXW0osxSgCb1iDiBCJANrX1yPZcemv4Wh1gzbT6IE9sWxWXsE5sy9KvswNc
M+E4nu/TYYTfkynItJjLgmDLOoi+V0FBY6ba0mRPBjkriSP4AVlwsZLGVsAHQzuA
o5paW1GjRCCwhIQ6+AzZIoOz6wqvprBlUgUkUneyYAQ2ZKC3pZi8zPnpoVdFucVa
VaTzP71C1Q==
=efaq
-----END PGP SIGNATURE-----
Merge tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block
Pull NVMe write streams removal from Jens Axboe:
"This removes the write streams support in NVMe. No vendor ever really
shipped working support for this, and they are not interested in
supporting it.
With the NVMe support gone, we have nothing in the tree that supports
this. Remove passing around of the hints.
The only discussion point in this patchset imho is the fact that the
file specific write hint setting/getting fcntl helpers will now return
-1/EINVAL like they did before we supported write hints. No known
applications use these functions, I only know of one prototype that I
help do for RocksDB, and that's not used. That said, with a change
like this, it's always a bit controversial. Alternatively, we could
just make them return 0 and pretend it worked. It's placement based
hints after all"
* tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block:
fs: remove fs.f_write_hint
fs: remove kiocb.ki_hint
block: remove the per-bio/request write hint
nvme: remove support or stream based temperature hint
reuse_swap_page() currently indicates if we can write to an anon page
without COW. A COW is required if the page is shared by multiple
processes (either already mapped or via swap entries) or if there is
concurrent writeback that cannot tolerate concurrent page modifications.
However, in the context of khugepaged we're not actually going to write to
a read-only mapped page, we'll copy the page content to our newly
allocated THP and map that THP writable. All we have to make sure is that
the read-only mapped page we're about to copy won't get reused by another
process sharing the page, otherwise, page content would get modified. But
that is already guaranteed via multiple mechanisms (e.g., holding a
reference, holding the page lock, removing the rmap after copying the
page).
The swapcache handling was introduced in commit 10359213d0 ("mm:
incorporate read-only pages into transparent huge pages") and it sounds
like it merely wanted to mimic what do_swap_page() would do when trying to
map a page obtained via the swapcache writable.
As that logic is unnecessary, let's just remove it, removing the last user
of reuse_swap_page().
Link: https://lkml.kernel.org/r/20220131162940.210846-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Liang Zhang <zhangliang5@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new GFP flag __GFP_SKIP_ZERO that allows to skip memory
initialization. The flag is only effective with HW_TAGS KASAN.
This flag will be used by vmalloc code for page_alloc allocations backing
vmalloc() mappings in a following patch. The reason to skip memory
initialization for these pages in page_alloc is because vmalloc code will
be initializing them instead.
With the current implementation, when __GFP_SKIP_ZERO is provided,
__GFP_ZEROTAGS is ignored. This doesn't matter, as these two flags are
never provided at the same time. However, if this is changed in the
future, this particular implementation detail can be changed as well.
Link: https://lkml.kernel.org/r/0d53efeff345de7d708e0baa0d8829167772521e.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new GFP flag __GFP_SKIP_KASAN_UNPOISON that allows skipping KASAN
poisoning for page_alloc allocations. The flag is only effective with
HW_TAGS KASAN.
This flag will be used by vmalloc code for page_alloc allocations backing
vmalloc() mappings in a following patch. The reason to skip KASAN
poisoning for these pages in page_alloc is because vmalloc code will be
poisoning them instead.
Also reword the comment for __GFP_SKIP_KASAN_POISON.
Link: https://lkml.kernel.org/r/35c97d77a704f6ff971dd3bfe4be95855744108e.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Only define the ___GFP_SKIP_KASAN_POISON flag when CONFIG_KASAN_HW_TAGS is
enabled.
This patch it not useful by itself, but it prepares the code for additions
of new KASAN-specific GFP patches.
Link: https://lkml.kernel.org/r/44e5738a584c11801b2b8f1231898918efc8634a.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds two trace events for base page and HugeTLB page migrations.
These events, closely follow the implementation details like setting and
removing of PTE migration entries, which are essential operations for
migration. The new CREATE_TRACE_POINTS in <mm/rmap.c> covers both
<events/migration.h> and <events/tlb.h> based trace events. Hence drop
redundant CREATE_TRACE_POINTS from other places which could have otherwise
conflicted during build.
Link: https://lkml.kernel.org/r/1643368182-9588-3-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/migration: Add trace events", v3.
This adds trace events for all migration scenarios including base page,
THP and HugeTLB.
This patch (of 3):
This adds two trace events for PMD based THP migration without split.
These events closely follow the implementation details like setting and
removing of PMD migration entries, which are essential operations for THP
migration. This moves CREATE_TRACE_POINTS into generic THP from powerpc
for these new trace events to be available on other platforms as well.
Link: https://lkml.kernel.org/r/1643368182-9588-1-git-send-email-anshuman.khandual@arm.com
Link: https://lkml.kernel.org/r/1643368182-9588-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Core
----
- Introduce XDP multi-buffer support, allowing the use of XDP with
jumbo frame MTUs and combination with Rx coalescing offloads (LRO).
- Speed up netns dismantling (5x) and lower the memory cost a little.
Remove unnecessary per-netns sockets. Scope some lists to a netns.
Cut down RCU syncing. Use batch methods. Allow netdev registration
to complete out of order.
- Support distinguishing timestamp types (ingress vs egress) and
maintaining them across packet scrubbing points (e.g. redirect).
- Continue the work of annotating packet drop reasons throughout
the stack.
- Switch netdev error counters from an atomic to dynamically
allocated per-CPU counters.
- Rework a few preempt_disable(), local_irq_save() and busy waiting
sections problematic on PREEMPT_RT.
- Extend the ref_tracker to allow catching use-after-free bugs.
BPF
---
- Introduce "packing allocator" for BPF JIT images. JITed code is
marked read only, and used to be allocated at page granularity.
Custom allocator allows for more efficient memory use, lower
iTLB pressure and prevents identity mapping huge pages from
getting split.
- Make use of BTF type annotations (e.g. __user, __percpu) to enforce
the correct probe read access method, add appropriate helpers.
- Convert the BPF preload to use light skeleton and drop
the user-mode-driver dependency.
- Allow XDP BPF_PROG_RUN test infra to send real packets, enabling
its use as a packet generator.
- Allow local storage memory to be allocated with GFP_KERNEL if called
from a hook allowed to sleep.
- Introduce fprobe (multi kprobe) to speed up mass attachment (arch
bits to come later).
- Add unstable conntrack lookup helpers for BPF by using the BPF
kfunc infra.
- Allow cgroup BPF progs to return custom errors to user space.
- Add support for AF_UNIX iterator batching.
- Allow iterator programs to use sleepable helpers.
- Support JIT of add, and, or, xor and xchg atomic ops on arm64.
- Add BTFGen support to bpftool which allows to use CO-RE in kernels
without BTF info.
- Large number of libbpf API improvements, cleanups and deprecations.
Protocols
---------
- Micro-optimize UDPv6 Tx, gaining up to 5% in test on dummy netdev.
- Adjust TSO packet sizes based on min_rtt, allowing very low latency
links (data centers) to always send full-sized TSO super-frames.
- Make IPv6 flow label changes (AKA hash rethink) more configurable,
via sysctl and setsockopt. Distinguish between server and client
behavior.
- VxLAN support to "collect metadata" devices to terminate only
configured VNIs. This is similar to VLAN filtering in the bridge.
- Support inserting IPv6 IOAM information to a fraction of frames.
- Add protocol attribute to IP addresses to allow identifying where
given address comes from (kernel-generated, DHCP etc.)
- Support setting socket and IPv6 options via cmsg on ping6 sockets.
- Reject mis-use of ECN bits in IP headers as part of DSCP/TOS.
Define dscp_t and stop taking ECN bits into account in fib-rules.
- Add support for locked bridge ports (for 802.1X).
- tun: support NAPI for packets received from batched XDP buffs,
doubling the performance in some scenarios.
- IPv6 extension header handling in Open vSwitch.
- Support IPv6 control message load balancing in bonding, prevent
neighbor solicitation and advertisement from using the wrong port.
Support NS/NA monitor selection similar to existing ARP monitor.
- SMC
- improve performance with TCP_CORK and sendfile()
- support auto-corking
- support TCP_NODELAY
- MCTP (Management Component Transport Protocol)
- add user space tag control interface
- I2C binding driver (as specified by DMTF DSP0237)
- Multi-BSSID beacon handling in AP mode for WiFi.
- Bluetooth:
- handle MSFT Monitor Device Event
- add MGMT Adv Monitor Device Found/Lost events
- Multi-Path TCP:
- add support for the SO_SNDTIMEO socket option
- lots of selftest cleanups and improvements
- Increase the max PDU size in CAN ISOTP to 64 kB.
Driver API
----------
- Add HW counters for SW netdevs, a mechanism for devices which
offload packet forwarding to report packet statistics back to
software interfaces such as tunnels.
- Select the default NIC queue count as a fraction of number of
physical CPU cores, instead of hard-coding to 8.
- Expose devlink instance locks to drivers. Allow device layer of
drivers to use that lock directly instead of creating their own
which always runs into ordering issues in devlink callbacks.
- Add header/data split indication to guide user space enabling
of TCP zero-copy Rx.
- Allow configuring completion queue event size.
- Refactor page_pool to enable fragmenting after allocation.
- Add allocation and page reuse statistics to page_pool.
- Improve Multiple Spanning Trees support in the bridge to allow
reuse of topologies across VLANs, saving HW resources in switches.
- DSA (Distributed Switch Architecture):
- replay and offload of host VLAN entries
- offload of static and local FDB entries on LAG interfaces
- FDB isolation and unicast filtering
New hardware / drivers
----------------------
- Ethernet:
- LAN937x T1 PHYs
- Davicom DM9051 SPI NIC driver
- Realtek RTL8367S, RTL8367RB-VB switch and MDIO
- Microchip ksz8563 switches
- Netronome NFP3800 SmartNICs
- Fungible SmartNICs
- MediaTek MT8195 switches
- WiFi:
- mt76: MediaTek mt7916
- mt76: MediaTek mt7921u USB adapters
- brcmfmac: Broadcom BCM43454/6
- Mobile:
- iosm: Intel M.2 7360 WWAN card
Drivers
-------
- Convert many drivers to the new phylink API built for split PCS
designs but also simplifying other cases.
- Intel Ethernet NICs:
- add TTY for GNSS module for E810T device
- improve AF_XDP performance
- GTP-C and GTP-U filter offload
- QinQ VLAN support
- Mellanox Ethernet NICs (mlx5):
- support xdp->data_meta
- multi-buffer XDP
- offload tc push_eth and pop_eth actions
- Netronome Ethernet NICs (nfp):
- flow-independent tc action hardware offload (police / meter)
- AF_XDP
- Other Ethernet NICs:
- at803x: fiber and SFP support
- xgmac: mdio: preamble suppression and custom MDC frequencies
- r8169: enable ASPM L1.2 if system vendor flags it as safe
- macb/gem: ZynqMP SGMII
- hns3: add TX push mode
- dpaa2-eth: software TSO
- lan743x: multi-queue, mdio, SGMII, PTP
- axienet: NAPI and GRO support
- Mellanox Ethernet switches (mlxsw):
- source and dest IP address rewrites
- RJ45 ports
- Marvell Ethernet switches (prestera):
- basic routing offload
- multi-chain TC ACL offload
- NXP embedded Ethernet switches (ocelot & felix):
- PTP over UDP with the ocelot-8021q DSA tagging protocol
- basic QoS classification on Felix DSA switch using dcbnl
- port mirroring for ocelot switches
- Microchip high-speed industrial Ethernet (sparx5):
- offloading of bridge port flooding flags
- PTP Hardware Clock
- Other embedded switches:
- lan966x: PTP Hardward Clock
- qca8k: mdio read/write operations via crafted Ethernet packets
- Qualcomm 802.11ax WiFi (ath11k):
- add LDPC FEC type and 802.11ax High Efficiency data in radiotap
- enable RX PPDU stats in monitor co-exist mode
- Intel WiFi (iwlwifi):
- UHB TAS enablement via BIOS
- band disablement via BIOS
- channel switch offload
- 32 Rx AMPDU sessions in newer devices
- MediaTek WiFi (mt76):
- background radar detection
- thermal management improvements on mt7915
- SAR support for more mt76 platforms
- MBSSID and 6 GHz band on mt7915
- RealTek WiFi:
- rtw89: AP mode
- rtw89: 160 MHz channels and 6 GHz band
- rtw89: hardware scan
- Bluetooth:
- mt7921s: wake on Bluetooth, SCO over I2S, wide-band-speed (WBS)
- Microchip CAN (mcp251xfd):
- multiple RX-FIFOs and runtime configurable RX/TX rings
- internal PLL, runtime PM handling simplification
- improve chip detection and error handling after wakeup
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmI7YBcACgkQMUZtbf5S
IrveSBAAmSNJlUK6vPsnNzs7IhsZnfI/AUjm2TCLZnlhKttbpI4A/4Pohk33V7RS
FGX7f8kjEfhUwrIiLDgeCnztNHRECrCmk6aZc/jLEvecmTauJ+f6kjShkDY/wix+
AkPHmrZnQeLPAEVuljDdV+sL6ik08+zQL7PazIYHsaSKKC0MGQptRwcri8PLRAKE
KPBAhVhleq2rAZ/ntprSN52F4Af6rpFTrPIWuN8Bqdbc9dy5094LT0mpOOWYvgr3
/DLvvAPuLemwyIQkjWknVKBRUAQcmNPC+BY3J8K3LRaiNhekGqOFan46BfqP+k2J
6DWu0Qrp2yWt4BMOeEToZR5rA6v5suUAMIBu8PRZIDkINXQMlIxHfGjZyNm0rVfw
7edNri966yus9OdzwPa32MIG3oC6PnVAwYCJAjjBMNS8sSIkp7wgHLkgWN4UFe2H
K/e6z8TLF4UQ+zFM0aGI5WZ+9QqWkTWEDF3R3OhdFpGrznna0gxmkOeV2YvtsgxY
cbS0vV9Zj73o+bYzgBKJsw/dAjyLdXoHUGvus26VLQ78S/VGunVKtItwoxBAYmZo
krW964qcC89YofzSi8RSKLHuEWtNWZbVm8YXr75u6jpr5GhMBu0CYefLs+BuZcxy
dw8c69cGneVbGZmY2J3rBhDkchbuICl8vdUPatGrOJAoaFdYKuw=
=ELpe
-----END PGP SIGNATURE-----
Merge tag 'net-next-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"The sprinkling of SPI drivers is because we added a new one and Mark
sent us a SPI driver interface conversion pull request.
Core
----
- Introduce XDP multi-buffer support, allowing the use of XDP with
jumbo frame MTUs and combination with Rx coalescing offloads (LRO).
- Speed up netns dismantling (5x) and lower the memory cost a little.
Remove unnecessary per-netns sockets. Scope some lists to a netns.
Cut down RCU syncing. Use batch methods. Allow netdev registration
to complete out of order.
- Support distinguishing timestamp types (ingress vs egress) and
maintaining them across packet scrubbing points (e.g. redirect).
- Continue the work of annotating packet drop reasons throughout the
stack.
- Switch netdev error counters from an atomic to dynamically
allocated per-CPU counters.
- Rework a few preempt_disable(), local_irq_save() and busy waiting
sections problematic on PREEMPT_RT.
- Extend the ref_tracker to allow catching use-after-free bugs.
BPF
---
- Introduce "packing allocator" for BPF JIT images. JITed code is
marked read only, and used to be allocated at page granularity.
Custom allocator allows for more efficient memory use, lower iTLB
pressure and prevents identity mapping huge pages from getting
split.
- Make use of BTF type annotations (e.g. __user, __percpu) to enforce
the correct probe read access method, add appropriate helpers.
- Convert the BPF preload to use light skeleton and drop the
user-mode-driver dependency.
- Allow XDP BPF_PROG_RUN test infra to send real packets, enabling
its use as a packet generator.
- Allow local storage memory to be allocated with GFP_KERNEL if
called from a hook allowed to sleep.
- Introduce fprobe (multi kprobe) to speed up mass attachment (arch
bits to come later).
- Add unstable conntrack lookup helpers for BPF by using the BPF
kfunc infra.
- Allow cgroup BPF progs to return custom errors to user space.
- Add support for AF_UNIX iterator batching.
- Allow iterator programs to use sleepable helpers.
- Support JIT of add, and, or, xor and xchg atomic ops on arm64.
- Add BTFGen support to bpftool which allows to use CO-RE in kernels
without BTF info.
- Large number of libbpf API improvements, cleanups and deprecations.
Protocols
---------
- Micro-optimize UDPv6 Tx, gaining up to 5% in test on dummy netdev.
- Adjust TSO packet sizes based on min_rtt, allowing very low latency
links (data centers) to always send full-sized TSO super-frames.
- Make IPv6 flow label changes (AKA hash rethink) more configurable,
via sysctl and setsockopt. Distinguish between server and client
behavior.
- VxLAN support to "collect metadata" devices to terminate only
configured VNIs. This is similar to VLAN filtering in the bridge.
- Support inserting IPv6 IOAM information to a fraction of frames.
- Add protocol attribute to IP addresses to allow identifying where
given address comes from (kernel-generated, DHCP etc.)
- Support setting socket and IPv6 options via cmsg on ping6 sockets.
- Reject mis-use of ECN bits in IP headers as part of DSCP/TOS.
Define dscp_t and stop taking ECN bits into account in fib-rules.
- Add support for locked bridge ports (for 802.1X).
- tun: support NAPI for packets received from batched XDP buffs,
doubling the performance in some scenarios.
- IPv6 extension header handling in Open vSwitch.
- Support IPv6 control message load balancing in bonding, prevent
neighbor solicitation and advertisement from using the wrong port.
Support NS/NA monitor selection similar to existing ARP monitor.
- SMC
- improve performance with TCP_CORK and sendfile()
- support auto-corking
- support TCP_NODELAY
- MCTP (Management Component Transport Protocol)
- add user space tag control interface
- I2C binding driver (as specified by DMTF DSP0237)
- Multi-BSSID beacon handling in AP mode for WiFi.
- Bluetooth:
- handle MSFT Monitor Device Event
- add MGMT Adv Monitor Device Found/Lost events
- Multi-Path TCP:
- add support for the SO_SNDTIMEO socket option
- lots of selftest cleanups and improvements
- Increase the max PDU size in CAN ISOTP to 64 kB.
Driver API
----------
- Add HW counters for SW netdevs, a mechanism for devices which
offload packet forwarding to report packet statistics back to
software interfaces such as tunnels.
- Select the default NIC queue count as a fraction of number of
physical CPU cores, instead of hard-coding to 8.
- Expose devlink instance locks to drivers. Allow device layer of
drivers to use that lock directly instead of creating their own
which always runs into ordering issues in devlink callbacks.
- Add header/data split indication to guide user space enabling of
TCP zero-copy Rx.
- Allow configuring completion queue event size.
- Refactor page_pool to enable fragmenting after allocation.
- Add allocation and page reuse statistics to page_pool.
- Improve Multiple Spanning Trees support in the bridge to allow
reuse of topologies across VLANs, saving HW resources in switches.
- DSA (Distributed Switch Architecture):
- replay and offload of host VLAN entries
- offload of static and local FDB entries on LAG interfaces
- FDB isolation and unicast filtering
New hardware / drivers
----------------------
- Ethernet:
- LAN937x T1 PHYs
- Davicom DM9051 SPI NIC driver
- Realtek RTL8367S, RTL8367RB-VB switch and MDIO
- Microchip ksz8563 switches
- Netronome NFP3800 SmartNICs
- Fungible SmartNICs
- MediaTek MT8195 switches
- WiFi:
- mt76: MediaTek mt7916
- mt76: MediaTek mt7921u USB adapters
- brcmfmac: Broadcom BCM43454/6
- Mobile:
- iosm: Intel M.2 7360 WWAN card
Drivers
-------
- Convert many drivers to the new phylink API built for split PCS
designs but also simplifying other cases.
- Intel Ethernet NICs:
- add TTY for GNSS module for E810T device
- improve AF_XDP performance
- GTP-C and GTP-U filter offload
- QinQ VLAN support
- Mellanox Ethernet NICs (mlx5):
- support xdp->data_meta
- multi-buffer XDP
- offload tc push_eth and pop_eth actions
- Netronome Ethernet NICs (nfp):
- flow-independent tc action hardware offload (police / meter)
- AF_XDP
- Other Ethernet NICs:
- at803x: fiber and SFP support
- xgmac: mdio: preamble suppression and custom MDC frequencies
- r8169: enable ASPM L1.2 if system vendor flags it as safe
- macb/gem: ZynqMP SGMII
- hns3: add TX push mode
- dpaa2-eth: software TSO
- lan743x: multi-queue, mdio, SGMII, PTP
- axienet: NAPI and GRO support
- Mellanox Ethernet switches (mlxsw):
- source and dest IP address rewrites
- RJ45 ports
- Marvell Ethernet switches (prestera):
- basic routing offload
- multi-chain TC ACL offload
- NXP embedded Ethernet switches (ocelot & felix):
- PTP over UDP with the ocelot-8021q DSA tagging protocol
- basic QoS classification on Felix DSA switch using dcbnl
- port mirroring for ocelot switches
- Microchip high-speed industrial Ethernet (sparx5):
- offloading of bridge port flooding flags
- PTP Hardware Clock
- Other embedded switches:
- lan966x: PTP Hardward Clock
- qca8k: mdio read/write operations via crafted Ethernet packets
- Qualcomm 802.11ax WiFi (ath11k):
- add LDPC FEC type and 802.11ax High Efficiency data in radiotap
- enable RX PPDU stats in monitor co-exist mode
- Intel WiFi (iwlwifi):
- UHB TAS enablement via BIOS
- band disablement via BIOS
- channel switch offload
- 32 Rx AMPDU sessions in newer devices
- MediaTek WiFi (mt76):
- background radar detection
- thermal management improvements on mt7915
- SAR support for more mt76 platforms
- MBSSID and 6 GHz band on mt7915
- RealTek WiFi:
- rtw89: AP mode
- rtw89: 160 MHz channels and 6 GHz band
- rtw89: hardware scan
- Bluetooth:
- mt7921s: wake on Bluetooth, SCO over I2S, wide-band-speed (WBS)
- Microchip CAN (mcp251xfd):
- multiple RX-FIFOs and runtime configurable RX/TX rings
- internal PLL, runtime PM handling simplification
- improve chip detection and error handling after wakeup"
* tag 'net-next-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2521 commits)
llc: fix netdevice reference leaks in llc_ui_bind()
drivers: ethernet: cpsw: fix panic when interrupt coaleceing is set via ethtool
ice: don't allow to run ice_send_event_to_aux() in atomic ctx
ice: fix 'scheduling while atomic' on aux critical err interrupt
net/sched: fix incorrect vlan_push_eth dest field
net: bridge: mst: Restrict info size queries to bridge ports
net: marvell: prestera: add missing destroy_workqueue() in prestera_module_init()
drivers: net: xgene: Fix regression in CRC stripping
net: geneve: add missing netlink policy and size for IFLA_GENEVE_INNER_PROTO_INHERIT
net: dsa: fix missing host-filtered multicast addresses
net/mlx5e: Fix build warning, detected write beyond size of field
iwlwifi: mvm: Don't fail if PPAG isn't supported
selftests/bpf: Fix kprobe_multi test.
Revert "rethook: x86: Add rethook x86 implementation"
Revert "arm64: rethook: Add arm64 rethook implementation"
Revert "powerpc: Add rethook support"
Revert "ARM: rethook: Add rethook arm implementation"
netdevice: add missing dm_private kdoc
net: bridge: mst: prevent NULL deref in br_mst_info_size()
selftests: forwarding: Use same VRF for port and VLAN upper
...
There are a few separately maintained driver subsystems that we merge through
the SoC tree, notable changes are:
- Memory controller updates, mainly for Tegra and Mediatek SoCs,
and clarifications for the memory controller DT bindings
- SCMI firmware interface updates, in particular a new transport based
on OPTEE and support for atomic operations.
- Cleanups to the TEE subsystem, refactoring its memory management
For SoC specific drivers without a separate subsystem, changes include
- Smaller updates and fixes for TI, AT91/SAMA5, Qualcomm and NXP
Layerscape SoCs.
- Driver support for Microchip SAMA5D29, Tesla FSD, Renesas RZ/G2L,
and Qualcomm SM8450.
- Better power management on Mediatek MT81xx, NXP i.MX8MQ
and older NVIDIA Tegra chips
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAmI4nOUACgkQmmx57+YA
GNlNNhAApPQw+FKQ6yVj2EZYcaAgik8PJAJoNQWYED52iQfm5uXgjt3aQewvrPNW
nkKx5Mx+fPUfaKx5mkVOFMhME5Bw9tYbXHm2/RpRp+n8jOdUlQpAhzIPOyWPHOJS
QX6qu4t+agrQzjbOCGouAJXgyxhTJFUMviM2EgVHbQHXPtdF8i2kyanfCP7Rw8cx
sVtLwpvhbLm849+deYRXuv2Xw9I3M1Np7018s5QciimI2eLLEb+lJ/C5XWz5pMYn
M1nZ7uwCLKPCewpMETTuhKOv0ioOXyY9C1ghyiGZFhHQfoCYTu94Hrx9t8x5gQmL
qWDinXWXVk8LBegyrs8Bp4wcjtmvMMLnfWtsGSfT5uq24JOGg22OmtUNhNJbS9+p
VjEvBgkXYD7UEl5npI9v9/KQWr3/UDir0zvkuV40gJyeBWNEZ/PB8olXAxgL7wZv
cXRYSaUYYt3DKQf1k5I4GUyQtkP/4RaBy6AqvH5Sx0lCwuY6G6ISK+kCPaaSRKnX
WR+nFw84dKCu7miehmW9qSzMQ4kiSCKIDqk7ilHcwv0J2oXDrlqVPKGGGTzZjUc8
+feqM/eSoYvDDEDemuXNSnl3hc1Zlvm7Apd5AN6kdTaNgoACDYdyvGuJ3CvzcA+K
1gBHUBvGS/ODA25KnYabr7wCMgxYqf7dXfkyKIBwFHwxOnRHtgs=
=Cfbk
-----END PGP SIGNATURE-----
Merge tag 'arm-drivers-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM driver updates from Arnd Bergmann:
"There are a few separately maintained driver subsystems that we merge
through the SoC tree, notable changes are:
- Memory controller updates, mainly for Tegra and Mediatek SoCs, and
clarifications for the memory controller DT bindings
- SCMI firmware interface updates, in particular a new transport
based on OPTEE and support for atomic operations.
- Cleanups to the TEE subsystem, refactoring its memory management
For SoC specific drivers without a separate subsystem, changes include
- Smaller updates and fixes for TI, AT91/SAMA5, Qualcomm and NXP
Layerscape SoCs.
- Driver support for Microchip SAMA5D29, Tesla FSD, Renesas RZ/G2L,
and Qualcomm SM8450.
- Better power management on Mediatek MT81xx, NXP i.MX8MQ and older
NVIDIA Tegra chips"
* tag 'arm-drivers-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (154 commits)
ARM: spear: fix typos in comments
soc/microchip: fix invalid free in mpfs_sys_controller_delete
soc: s4: Add support for power domains controller
dt-bindings: power: add Amlogic s4 power domains bindings
ARM: at91: add support in soc driver for new SAMA5D29
soc: mediatek: mmsys: add sw0_rst_offset in mmsys driver data
dt-bindings: memory: renesas,rpc-if: Document RZ/V2L SoC
memory: emif: check the pointer temp in get_device_details()
memory: emif: Add check for setup_interrupts
dt-bindings: arm: mediatek: mmsys: add support for MT8186
dt-bindings: mediatek: add compatible for MT8186 pwrap
soc: mediatek: pwrap: add pwrap driver for MT8186 SoC
soc: mediatek: mmsys: add mmsys reset control for MT8186
soc: mediatek: mtk-infracfg: Disable ACP on MT8192
soc: ti: k3-socinfo: Add AM62x JTAG ID
soc: mediatek: add MTK mutex support for MT8186
soc: mediatek: mmsys: add mt8186 mmsys routing table
soc: mediatek: pm-domains: Add support for mt8186
dt-bindings: power: Add MT8186 power domains
soc: mediatek: pm-domains: Add support for mt8195
...
- New user_events interface. User space can register an event with the kernel
describing the format of the event. Then it will receive a byte in a page
mapping that it can check against. A privileged task can then enable that
event like any other event, which will change the mapped byte to true,
telling the user space application to start writing the event to the
tracing buffer.
- Add new "ftrace_boot_snapshot" kernel command line parameter. When set,
the tracing buffer will be saved in the snapshot buffer at boot up when
the kernel hands things over to user space. This will keep the traces that
happened at boot up available even if user space boot up has tracing as
well.
- Have TRACE_EVENT_ENUM() also update trace event field type descriptions.
Thus if a static array defines its size with an enum, the user space trace
event parsers can still know how to parse that array.
- Add new TRACE_CUSTOM_EVENT() macro. This acts the same as the
TRACE_EVENT() macro, but will attach to an existing tracepoint. This will
make one tracepoint be able to trace different content and not be stuck at
only what the original TRACE_EVENT() macro exports.
- Fixes to tracing error logging.
- Better saving of cmdlines to PIDs when tracing (use the wakeup events for
mapping).
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYjiO3RQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qhQzAQDtek5p80p/zkMGymm14wSH6qq0NdgN
Kv7fTBwEewUa0gD/UCOVLw4Oj+JtHQhCa3sCGZopmRv0BT1+4UQANqosKQY=
=Au08
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- New user_events interface. User space can register an event with the
kernel describing the format of the event. Then it will receive a
byte in a page mapping that it can check against. A privileged task
can then enable that event like any other event, which will change
the mapped byte to true, telling the user space application to start
writing the event to the tracing buffer.
- Add new "ftrace_boot_snapshot" kernel command line parameter. When
set, the tracing buffer will be saved in the snapshot buffer at boot
up when the kernel hands things over to user space. This will keep
the traces that happened at boot up available even if user space boot
up has tracing as well.
- Have TRACE_EVENT_ENUM() also update trace event field type
descriptions. Thus if a static array defines its size with an enum,
the user space trace event parsers can still know how to parse that
array.
- Add new TRACE_CUSTOM_EVENT() macro. This acts the same as the
TRACE_EVENT() macro, but will attach to an existing tracepoint. This
will make one tracepoint be able to trace different content and not
be stuck at only what the original TRACE_EVENT() macro exports.
- Fixes to tracing error logging.
- Better saving of cmdlines to PIDs when tracing (use the wakeup events
for mapping).
* tag 'trace-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (30 commits)
tracing: Have type enum modifications copy the strings
user_events: Add trace event call as root for low permission cases
tracing/user_events: Use alloc_pages instead of kzalloc() for register pages
tracing: Add snapshot at end of kernel boot up
tracing: Have TRACE_DEFINE_ENUM affect trace event types as well
tracing: Fix strncpy warning in trace_events_synth.c
user_events: Prevent dyn_event delete racing with ioctl add/delete
tracing: Add TRACE_CUSTOM_EVENT() macro
tracing: Move the defines to create TRACE_EVENTS into their own files
tracing: Add sample code for custom trace events
tracing: Allow custom events to be added to the tracefs directory
tracing: Fix last_cmd_set() string management in histogram code
user_events: Fix potential uninitialized pointer while parsing field
tracing: Fix allocation of last_cmd in last_cmd_set()
user_events: Add documentation file
user_events: Add sample code for typical usage
user_events: Add self-test for validator boundaries
user_events: Add self-test for perf_event integration
user_events: Add self-test for dynamic_events integration
user_events: Add self-test for ftrace integration
...
Primarily this series converts some of the address_space operations
to take a folio instead of a page.
->is_partially_uptodate() takes a folio instead of a page and changes the
type of the 'from' and 'count' arguments to make it obvious they're bytes.
->invalidatepage() becomes ->invalidate_folio() and has a similar type change.
->launder_page() becomes ->launder_folio()
->set_page_dirty() becomes ->dirty_folio() and adds the address_space as
an argument.
There are a couple of other misc changes up front that weren't worth
separating into their own pull request.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmI4hqMACgkQDpNsjXcp
gj7r7Af/fVJ7m8kKqjP/IayX3HiJRuIDQw+vM++BlRNXdjz+IyED6whdmFGxJeOY
BMyT+8ApOAz7ErS4G+7fAv4ScJK/aEgFUsnSeAiCp0PliiEJ5NNJzElp6sVmQ7H5
SX7+Ek444FZUGsQuy0qL7/ELpR3ditnD7x+5U2g0p5TeaHGUQn84crRyfR4xuhNG
EBD9D71BOb7OxUcOHe93pTkK51QsQ0aCrcIsB1tkK5KR0BAthn1HqF7ehL90Rvrr
omx5M7aDWGY4oj7IKrhlAs+55Ah2WaOzrZBp0FXNbr4UENDBKWKyUxErwa4xPkf6
Gm1iQG/CspOHnxN3YWsd5WjtlL3A+A==
=cOiq
-----END PGP SIGNATURE-----
Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache
Pull filesystem folio updates from Matthew Wilcox:
"Primarily this series converts some of the address_space operations to
take a folio instead of a page.
Notably:
- a_ops->is_partially_uptodate() takes a folio instead of a page and
changes the type of the 'from' and 'count' arguments to make it
obvious they're bytes.
- a_ops->invalidatepage() becomes ->invalidate_folio() and has a
similar type change.
- a_ops->launder_page() becomes ->launder_folio()
- a_ops->set_page_dirty() becomes ->dirty_folio() and adds the
address_space as an argument.
There are a couple of other misc changes up front that weren't worth
separating into their own pull request"
* tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits)
fs: Remove aops ->set_page_dirty
fb_defio: Use noop_dirty_folio()
fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio
fs: Convert __set_page_dirty_buffers to block_dirty_folio
nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio()
mm: Convert swap_set_page_dirty() to swap_dirty_folio()
ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio
f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio
f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio
f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio
afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio()
btrfs: Convert extent_range_redirty_for_io() to use folios
fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio
btrfs: Convert from set_page_dirty to dirty_folio
fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio()
fs: Add aops->dirty_folio
fs: Remove aops->launder_page
orangefs: Convert launder_page to launder_folio
nfs: Convert from launder_page to launder_folio
fuse: Convert from launder_page to launder_folio
...
- Rewrite how munlock works to massively reduce the contention
on i_mmap_rwsem (Hugh Dickins):
https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/
- Sort out the page refcount mess for ZONE_DEVICE pages (Christoph Hellwig):
https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/
- Convert GUP to use folios and make pincount available for order-1
pages. (Matthew Wilcox)
- Convert a few more truncation functions to use folios (Matthew Wilcox)
- Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew Wilcox)
- Convert rmap_walk to use folios (Matthew Wilcox)
- Convert most of shrink_page_list() to use a folio (Matthew Wilcox)
- Add support for creating large folios in readahead (Matthew Wilcox)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmI4ucgACgkQDpNsjXcp
gj69Wgf6AwqwmO5Tmy+fLScDPqWxmXJofbocae1kyoGHf7Ui91OK4U2j6IpvAr+g
P/vLIK+JAAcTQcrSCjymuEkf4HkGZOR03QQn7maPIEe4eLrZRQDEsmHC1L9gpeJp
s/GMvDWiGE0Tnxu0EOzfVi/yT+qjIl/S8VvqtCoJv1HdzxitZ7+1RDuqImaMC5MM
Qi3uHag78vLmCltLXpIOdpgZhdZexCdL2Y/1npf+b6FVkAJRRNUnA0gRbS7YpoVp
CbxEJcmAl9cpJLuj5i5kIfS9trr+/QcvbUlzRxh4ggC58iqnmF2V09l2MJ7YU3XL
v1O/Elq4lRhXninZFQEm9zjrri7LDQ==
=n9Ad
-----END PGP SIGNATURE-----
Merge tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache
Pull folio updates from Matthew Wilcox:
- Rewrite how munlock works to massively reduce the contention on
i_mmap_rwsem (Hugh Dickins):
https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/
- Sort out the page refcount mess for ZONE_DEVICE pages (Christoph
Hellwig):
https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/
- Convert GUP to use folios and make pincount available for order-1
pages. (Matthew Wilcox)
- Convert a few more truncation functions to use folios (Matthew
Wilcox)
- Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew
Wilcox)
- Convert rmap_walk to use folios (Matthew Wilcox)
- Convert most of shrink_page_list() to use a folio (Matthew Wilcox)
- Add support for creating large folios in readahead (Matthew Wilcox)
* tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache: (114 commits)
mm/damon: minor cleanup for damon_pa_young
selftests/vm/transhuge-stress: Support file-backed PMD folios
mm/filemap: Support VM_HUGEPAGE for file mappings
mm/readahead: Switch to page_cache_ra_order
mm/readahead: Align file mappings for non-DAX
mm/readahead: Add large folio readahead
mm: Support arbitrary THP sizes
mm: Make large folios depend on THP
mm: Fix READ_ONLY_THP warning
mm/filemap: Allow large folios to be added to the page cache
mm: Turn can_split_huge_page() into can_split_folio()
mm/vmscan: Convert pageout() to take a folio
mm/vmscan: Turn page_check_references() into folio_check_references()
mm/vmscan: Account large folios correctly
mm/vmscan: Optimise shrink_page_list for non-PMD-sized folios
mm/vmscan: Free non-shmem folios without splitting them
mm/rmap: Constify the rmap_walk_control argument
mm/rmap: Convert rmap_walk() to take a folio
mm: Turn page_anon_vma() into folio_anon_vma()
mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read()
...
As Steven suggested [1], we should access the pointers from the trace
event to avoid dereferencing them to the tracepoint function when the
tracepoint is disabled.
[1] https://lkml.org/lkml/2021/11/3/409
Link: https://lkml.kernel.org/r/4cd393b4d57f8f01ed72c001509b28e3a3b1a8c1.1646985115.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This framework is no longer used - so discard it.
Link: https://lkml.kernel.org/r/164549983747.9187.6171768583526866601.stgit@noble.brown
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Cleanups for SCHED_DEADLINE
- Tracing updates/fixes
- CPU Accounting fixes
- First wave of changes to optimize the overhead of the scheduler build,
from the fast-headers tree - including placeholder *_api.h headers for
later header split-ups.
- Preempt-dynamic using static_branch() for ARM64
- Isolation housekeeping mask rework; preperatory for further changes
- NUMA-balancing: deal with CPU-less nodes
- NUMA-balancing: tune systems that have multiple LLC cache domains per node (eg. AMD)
- Updates to RSEQ UAPI in preparation for glibc usage
- Lots of RSEQ/selftests, for same
- Add Suren as PSI co-maintainer
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmI5rg8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hGrw/+M3QOk6fH7G48wjlNnBvcOife6ls+Ni4k
ixOAcF4JKoixO8HieU5vv0A7yf/83tAa6fpeXeMf1hkCGc0NSlmLtuIux+WOmoAL
LzCyDEYfiP8KnVh0A1Tui/lK0+AkGo21O6ADhQE2gh8o2LpslOHQMzvtyekSzeeb
mVxMYQN+QH0m518xdO2D8IQv9ctOYK0eGjmkqdNfntOlytypPZHeNel/tCzwklP/
dElJUjNiSKDlUgTBPtL3DfpoLOI/0mHF2p6NEXvNyULxSOqJTu8pv9Z2ADb2kKo1
0D56iXBDngMi9MHIJLgvzsA8gKzHLFSuPbpODDqkTZCa28vaMB9NYGhJ643NtEie
IXTJEvF1rmNkcLcZlZxo0yjL0fjvPkczjw4Vj27gbrUQeEBfb4mfuI4BRmij63Ep
qEkgQTJhduCqqrQP1rVyhwWZRk1JNcVug+F6N42qWW3fg1xhj0YSrLai2c9nPez6
3Zt98H8YGS1Z/JQomSw48iGXVqfTp/ETI7uU7jqHK8QcjzQ4lFK5H4GZpwuqGBZi
NJJ1l97XMEas+rPHiwMEN7Z1DVhzJLCp8omEj12QU+tGLofxxwAuuOVat3CQWLRk
f80Oya3TLEgd22hGIKDRmHa22vdWnNQyS0S15wJotawBzQf+n3auS9Q3/rh979+t
ES/qvlGxTIs=
=Z8uT
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2022-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Cleanups for SCHED_DEADLINE
- Tracing updates/fixes
- CPU Accounting fixes
- First wave of changes to optimize the overhead of the scheduler
build, from the fast-headers tree - including placeholder *_api.h
headers for later header split-ups.
- Preempt-dynamic using static_branch() for ARM64
- Isolation housekeeping mask rework; preperatory for further changes
- NUMA-balancing: deal with CPU-less nodes
- NUMA-balancing: tune systems that have multiple LLC cache domains per
node (eg. AMD)
- Updates to RSEQ UAPI in preparation for glibc usage
- Lots of RSEQ/selftests, for same
- Add Suren as PSI co-maintainer
* tag 'sched-core-2022-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
sched/headers: ARM needs asm/paravirt_api_clock.h too
sched/numa: Fix boot crash on arm64 systems
headers/prep: Fix header to build standalone: <linux/psi.h>
sched/headers: Only include <linux/entry-common.h> when CONFIG_GENERIC_ENTRY=y
cgroup: Fix suspicious rcu_dereference_check() usage warning
sched/preempt: Tell about PREEMPT_DYNAMIC on kernel headers
sched/topology: Remove redundant variable and fix incorrect type in build_sched_domains
sched/deadline,rt: Remove unused parameter from pick_next_[rt|dl]_entity()
sched/deadline,rt: Remove unused functions for !CONFIG_SMP
sched/deadline: Use __node_2_[pdl|dle]() and rb_first_cached() consistently
sched/deadline: Merge dl_task_can_attach() and dl_cpu_busy()
sched/deadline: Move bandwidth mgmt and reclaim functions into sched class source file
sched/deadline: Remove unused def_dl_bandwidth
sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE
sched/tracing: Don't re-read p->state when emitting sched_switch event
sched/rt: Plug rt_mutex_setprio() vs push_rt_task() race
sched/cpuacct: Remove redundant RCU read lock
sched/cpuacct: Optimize away RCU read lock
sched/cpuacct: Fix charge percpu cpuusage
sched/headers: Reorganize, clean up and optimize kernel/sched/sched.h dependencies
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmI44SgACgkQxWXV+ddt
WDtzyg//YgMKr05jRsU3I/pIQ9znuKZmmllThwF63ZRG4PvKz2QfzvKdrMuzNjru
5kHbG59iJqtLmU/aVsdp8mL6mmg5U3Ym2bIRsrW5m4HTtTowKdirvL/lQ3/tWm8j
CSDJhUdCL2SwFjpru+4cxOeHLXNSfsk4BoCu8nsLitL+oXv/EPo/dkmu6nPjiMY3
RjsIDBeDEf7J20KOuP/qJuN2YOAT7TeISPD3Ow4aDsmndWQ8n6KehEmAZb7QuqZQ
SYubZ2wTb9HuPH/qpiTIA7innBIr+JkYtUYlz2xxixM2BUWNfqD6oKHw9RgOY5Sg
CULFssw0i7cgGKsvuPJw1zdM002uG4wwXKigGiyljTVWvxneyr4mNDWiGad+LyFJ
XWhnABPidkLs/1zbUkJ23DVub5VlfZsypkFDJAUXI0nGu3VrhjDfTYMa8eCe2L/F
YuGG6CrAC+5K/arKAWTVj7hOb+52UzBTEBJz60LJJ6dS9eQoBy857V6pfo7w7ukZ
t/tqA6q75O4tk/G3Ix3V1CjuAH3kJE6qXrvBxhpu8aZNjofopneLyGqS5oahpcE8
8edtT+ZZhNuU9sLSEJCJATVxXRDdNzpQ8CHgOR5HOUbmM/vwKNzHPfRQzDnImznw
UaUlFaaHwK17M6Y/6CnMecz26U2nVSJ7pyh39mb784XYe2a1efE=
=YARd
-----END PGP SIGNATURE-----
Merge tag 'for-5.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This contains feature updates, performance improvements, preparatory
and core work and some related VFS updates:
Features:
- encoded read/write ioctls, allows user space to read or write raw
data directly to extents (now compressed, encrypted in the future),
will be used by send/receive v2 where it saves processing time
- zoned mode now works with metadata DUP (the mkfs.btrfs default)
- error message header updates:
- print error state: transaction abort, other error, log tree
errors
- print transient filesystem state: remount, device replace,
ignored checksum verifications
- tree-checker: verify the transaction id of the to-be-written dirty
extent buffer
Performance improvements for fsync:
- directory logging speedups (up to -90% run time)
- avoid logging all directory changes during renames (up to -60% run
time)
- avoid inode logging during rename and link when possible (up to
-60% run time)
- prepare extents to be logged before locking a log tree path
(throughput +7%)
- stop copying old file extents when doing a full fsync()
- improved logging of old extents after truncate
Core, fixes:
- improved stale device identification by dev_t and not just path
(for devices that are behind other layers like device mapper)
- continued extent tree v2 preparatory work
- disable features that won't work yet
- add wrappers and abstractions for new tree roots
- improved error handling
- add super block write annotations around background block group
reclaim
- fix device scanning messages potentially accessing stale pointer
- cleanups and refactoring
VFS:
- allow reflinks/deduplication from two different mounts of the same
filesystem
- export and add helpers for read/write range verification, for the
encoded ioctls"
* tag 'for-5.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (98 commits)
btrfs: zoned: put block group after final usage
btrfs: don't access possibly stale fs_info data in device_list_add
btrfs: add lockdep_assert_held to need_preemptive_reclaim
btrfs: verify the tranisd of the to-be-written dirty extent buffer
btrfs: unify the error handling of btrfs_read_buffer()
btrfs: unify the error handling pattern for read_tree_block()
btrfs: factor out do_free_extent_accounting helper
btrfs: remove last_ref from the extent freeing code
btrfs: add a alloc_reserved_extent helper
btrfs: remove BUG_ON(ret) in alloc_reserved_tree_block
btrfs: add and use helper for unlinking inode during log replay
btrfs: extend locking to all space_info members accesses
btrfs: zoned: mark relocation as writing
fs: allow cross-vfsmount reflink/dedupe
btrfs: remove the cross file system checks from remap
btrfs: pass btrfs_fs_info to btrfs_recover_relocation
btrfs: pass btrfs_fs_info for deleting snapshots and cleaner
btrfs: add filesystems state details to error messages
btrfs: deal with unexpected extent type during reflinking
btrfs: fix unexpected error path when reflinking an inline extent
...
more bug fixes and clean ups in the ext4 fast_commit feature (most
notably, in the tracepoints). In the jbd2 layer, the t_handle_lock
spinlock has been removed, with the last place where it was actually
needed replaced with an atomic cmpxchg.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEOrBXt+eNlFyMVZH70292m8EYBPAFAmI5QPsACgkQ0292m8EY
BPD99Q//ZO7L0XUicNEsuxvtXDj4GBWg/R62JXcHsIetIRhCpM4+L2vS3kZVXCBh
Pd+QMYsfAhuXFK3yK5MfvAb+XJ2e0iVNaw9ke2ueAsrGidlKWOstbMWWqEa4Dcr2
age1DdrRB5hXIFxOg+Qz8bPMpP9XfxEiMqT2K7PgB+R/Z2chBbwPJ/G1VmdOMNGe
FzxZgvXFY5EIJgjqj09ioD5pUSlAj5sgI4y/Dcv4liW11kvgBxg7Mm8ZouoPRHa3
gZF5mj/40izhD8gYPmZJtHFFYuqBBLsrmJnjUF2Y2yvZt5I14j4w/5OfcjuJqWn6
t2mlnk8/o6ZHzyuUV0/8/x35rZBlwqMq9ep6L9pN07XVB0/xOMG7a59pWOElhFGp
sG/ELhP251ZE+glWC9/X1VjnhM+agduiWAUY0wF0X62LM3sw6sVO3q6hsI1EKunN
O5VlvzPXNTZn9bcKzXzpvbxP9vF6s4exf32rMxd1rfSptE6myavGwss7MlB1HM7v
xiM/A4AZT/qlUH9eOJxF1aGcswpxnFwTwHKZqBVW32iyG9mbqRT2lKajx4WstowU
pdI4X7YfF7GhI5tMFR5S4Fa9FiOivO4EEc6z/db753kVuxSjwtyDd8ip56ZuJUhU
LqpwZO2bh+UhR0vRO0pQH/ZqCzCfCaJTGtBwAvJUwo3g/zRTmoc=
=naSC
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Fix some bugs in converting ext4 to use the new mount API, as well as
more bug fixes and clean ups in the ext4 fast_commit feature (most
notably, in the tracepoints).
In the jbd2 layer, the t_handle_lock spinlock has been removed, with
the last place where it was actually needed replaced with an atomic
cmpxchg"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (35 commits)
ext4: fix kernel doc warnings
ext4: fix remaining two trace events to use same printk convention
ext4: add commit tid info in ext4_fc_commit_start/stop trace events
ext4: add commit_tid info in jbd debug log
ext4: add transaction tid info in fc_track events
ext4: add new trace event in ext4_fc_cleanup
ext4: return early for non-eligible fast_commit track events
ext4: do not call FC trace event in ext4_fc_commit() if FS does not support FC
ext4: convert ext4_fc_track_dentry type events to use event class
ext4: fix ext4_fc_stats trace point
ext4: remove unused enum EXT4_FC_COMMIT_FAILED
ext4: warn when dirtying page w/o buffers in data=journal mode
doc: fixed a typo in ext4 documentation
ext4: make mb_optimize_scan performance mount option work with extents
ext4: make mb_optimize_scan option work with set/unset mount cmd
ext4: don't BUG if someone dirty pages without asking ext4 first
ext4: remove redundant assignment to variable split_flag1
ext4: fix underflow in ext4_max_bitmap_size()
ext4: fix ext4_mb_clear_bb() kernel-doc comment
ext4: fix fs corruption when tring to remove a non-empty directory with IO error
...
- NFSv3 support in NFSD is now always built
- Added NFSD support for the NFSv4 birth-time file attribute
- Added support for storing and displaying sockaddrs in trace points
- NFSD now recognizes RPC_AUTH_TLS probes
Performance improvements:
- Optimized the svc transport enqueuing mechanism
- Added micro-optimizations for the duplicate reply cache
Notable bug fixes:
- Allocation of the NFSD file cache hash table is more reliable
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmI3XNkACgkQM2qzM29m
f5dNERAAqJ/nzfVp2H5BKLszJ7p/s13wFqbW719Rzymzl6/tUUHOqIsVdBsJsa/b
BdZQLLDwa6ZB5zOAnC6FosRKYu4lwixOOf94pC6a9ZDD/glYVKF8mZG+RZXPAy16
g3JUOi/bcyHXv0ZUhbv7DqW+HHM+owPP4vlNJ9ChiiLr/Xdp8NBKj+4Qtn/wcAo+
Xuvx7fU/5Mbemh+dd5mWker4afHvpt9V6U6s04m5LiTPPnHVnxmeyekJGUCOY0QO
cm/6SPNDqyn/VEfM/SRxEnLE9QcHRhZo/4PKRGF4wYolcviIogbILE1M7Ig/r/Gv
6Du2kcRAhyZ3zgWnu799Ivn3Q6IrVjxZwqmsi7YHURTwYKyZtxYsUk0MCBcpnxrE
WyTS2onpElbMop3viKCnQdpIetbsHnUNg3udUV6ugbdCbnZuhUw5B/d6x0o8ZWDE
C0f+jnX+GnBstn0vkcj2H0+VQTd5hUJtXMrooI42ODJoboQRZcmePwoXjqCmw3sy
PXTxLZC/5+4zNHGUuz4Pq4V7FKr4nHhDzaW4ZDO3mILx4ahceotulY1B/yoBUu8o
/LAhu2kJ6nFQkmpzdrGzPeOstgJYHm9CaitRvMzg+NJxEAJdebypdQDbX5iNpgfW
MDXH4n8eIqroTlQ/mQYEV0EbC7BaTqSCL6rQdcrcFUPu9n4Fcno=
=5nac
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"New features:
- NFSv3 support in NFSD is now always built
- Added NFSD support for the NFSv4 birth-time file attribute
- Added support for storing and displaying sockaddrs in trace points
- NFSD now recognizes RPC_AUTH_TLS probes
Performance improvements:
- Optimized the svc transport enqueuing mechanism
- Added micro-optimizations for the duplicate reply cache
Notable bug fixes:
- Allocation of the NFSD file cache hash table is more reliable"
* tag 'nfsd-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (30 commits)
nfsd: fix using the correct variable for sizeof()
nfsd: use correct format characters
NFSD: prevent integer overflow on 32 bit systems
NFSD: prevent underflow in nfssvc_decode_writeargs()
fs/lock: documentation cleanup. Replace inode->i_lock with flc_lock.
NFSD: Fix nfsd_breaker_owns_lease() return values
NFSD: Clean up _lm_ operation names
arch: Remove references to CONFIG_NFSD_V3 in the default configs
NFSD: Remove CONFIG_NFSD_V3
nfsd: more robust allocation failure handling in nfsd_file_cache_init
SUNRPC: Teach server to recognize RPC_AUTH_TLS
NFSD: Move svc_serv_ops::svo_function into struct svc_serv
NFSD: Remove svc_serv_ops::svo_module
SUNRPC: Remove svc_shutdown_net()
SUNRPC: Rename svc_close_xprt()
SUNRPC: Rename svc_create_xprt()
SUNRPC: Remove svo_shutdown method
SUNRPC: Merge svc_do_enqueue_xprt() into svc_enqueue_xprt()
SUNRPC: Remove the .svo_enqueue_xprt method
SUNRPC: Record endpoint information in trace log
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI09cgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjI1D/4uo1FnG+mF9ZEPAy/FKBh6I74gHj7eBJyw
obKyH2oszN19VI+HLYYQgBrPwlc4lOzxmYM0kbAQWfB+PCgxAExX+OkJrGrxWW3R
Hlc1RSQkravh+CBzEc1VIYIWK3XF+guZIwqLPa3gS0hwea8LYBae9h3gKjhvb2kH
4BbSh7Ax5k9klVUcNJuUXnG7nytyaiaQAbMl88iGV3S2bpseWNi0WTHvJ3Eizbyz
sLponJcgUb0/H2G00MXf3M3+V40s7AiLXeEMwFcv29PZUQIEyp5P8Rx43//48nbE
0Rms88Uz4tFHRyFe/48IKa4tjJgk6fy1Cjd3+Thyn8TeJaYhD6vZu4LA/uGo8fkm
EZEmLFHw5mqn7Xz3OA31ErnnpSmwT/T164vMPcvrlLao/1ZSHD1fK2PjyDMTX+aO
1wec5+mMsTbOFgzYJRH3tyEH/GfWxpsNJIPfdCjV5BRPd5HwFBBTvRwFEVTX6+Go
CS4JAGWkTgmr2u1OeyzaQ1lYPZWD1GTt/HJj7rEvuqPzlreX9nRlvTtIcVubw39a
dgx7nWLLCOCN3o4tFHEvmf05/92TiQd1Vw8wEQT0NdmkLuqUuJJOnVo53YWzHA5Y
qqMwPcciiYMi92btrJdsZOgj3GCKxP6rLxyHJm1mY5kFSjEB5e0swV7xryme40r4
ZJ6sUvEyHg==
=OBma
-----END PGP SIGNATURE-----
Merge tag 'for-5.18/io_uring-2022-03-18' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
- Fixes for current file position. Still doesn't have the f_pos_lock
sorted, but it's a step in the right direction (Dylan)
- Tracing updates (Dylan, Stefan)
- Improvements to io-wq locking (Hao)
- Improvements for provided buffers (me, Pavel)
- Support for registered file descriptors (me, Xiaoguang)
- Support for ring messages (me)
- Poll improvements (me)
- Fix for fixed buffers and non-iterator reads/writes (me)
- Support for NAPI on sockets (Olivier)
- Ring quiesce improvements (Usama)
- Misc fixes (Olivier, Pavel)
* tag 'for-5.18/io_uring-2022-03-18' of git://git.kernel.dk/linux-block: (42 commits)
io_uring: terminate manual loop iterator loop correctly for non-vecs
io_uring: don't check unrelated req->open.how in accept request
io_uring: manage provided buffers strictly ordered
io_uring: fold evfd signalling under a slower path
io_uring: thin down io_commit_cqring()
io_uring: shuffle io_eventfd_signal() bits around
io_uring: remove extra barrier for non-sqpoll iopoll
io_uring: fix provided buffer return on failure for kiocb_done()
io_uring: extend provided buf return to fails
io_uring: refactor timeout cancellation cqe posting
io_uring: normilise naming for fill_cqe*
io_uring: cache poll/double-poll state with a request flag
io_uring: cache req->apoll->events in req->cflags
io_uring: move req->poll_refs into previous struct hole
io_uring: make tracing format consistent
io_uring: recycle apoll_poll entries
io_uring: remove duplicated member check for io_msg_ring_prep()
io_uring: allow submissions to continue on error
io_uring: recycle provided buffers if request goes async
io_uring: ensure reads re-import for selected buffers
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmIzwtEACgkQSfxwEqXe
A67NCBAA1+U01HXx4ethmmy1m2pXHAIwngI7PP0QzyZtmoloWockdN1lRfQ1C0uJ
Whk/9Hc9G7iujznsxOnCS+LeNwRzd7CjtFbTgK+yGIRKwL9GFcVwA5nrifP9TjqZ
FWmTIomjjmA06YRYsNOdNSQdN6DdpQz8xLw0EqVOZerI4ITFErYlW8lLqOOKY99N
f9glQK75kh41SUgo+K3JSn46fhB95HldL6dYSZzjQ6QsVKBQuQTDE9ryfrH2XZDw
xI2nf/ycXPUBv7Bb+0op+7ES++CoDigM2nIyxapEj3ZkpplxL4M+cCIHq3Juzfwm
jDdbZbs5SqDszOQM/dvCJSR+S/D3QIKdv3fwwWHDTigByZdgpudT3rr9k7dY60Z8
aNvOzNWOzGH9/0boLl55WysF6cBQnazbgtzeWpzeuWFhAyfxN/DJx2sf8U+TmN6n
3bDUafamAvmkkIOoHUzOXfjo2lhXxlmRZ40rWVNX5JvcJj5+5jRmTawrQj+9fn8/
MhiIZ6KBDV1OxPwJzG6jm++JP6rgXfXsxduomO7cIEWs10itf/cE8WD9qJrtZTtg
kfjYUguFOd/QyzY0A1w6FD865vy8YhATk71Ywgwj9AI+cfH8QUajpDkXOutjop8x
8HBxIGx6Itgzilfuo5jpJxlVhNO3G6v1fX/A+mUMAfHufkmnfiQ=
=cyDR
-----END PGP SIGNATURE-----
Merge tag 'random-5.18-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator updates from Jason Donenfeld:
"There have been a few important changes to the RNG's crypto, but the
intent for 5.18 has been to shore up the existing design as much as
possible with modern cryptographic functions and proven constructions,
rather than actually changing up anything fundamental to the RNG's
design.
So it's still the same old RNG at its core as before: it still counts
entropy bits, and collects from the various sources with the same
heuristics as before, and so forth. However, the cryptographic
algorithms that transform that entropic data into safe random numbers
have been modernized.
Just as important, if not more, is that the code has been cleaned up
and re-documented. As one of the first drivers in Linux, going back to
1.3.30, its general style and organization was showing its age and
becoming both a maintenance burden and an auditability impediment.
Hopefully this provides a more solid foundation to build on for the
future. I encourage you to open up the file in full, and maybe you'll
remark, "oh, that's what it's doing," and enjoy reading it. That, at
least, is the eventual goal, which this pull begins working toward.
Here's a summary of the various patches in this pull:
- /dev/urandom and /dev/random now do the same thing, per the patch
we discussed on the list. I think this is worth trying out. If it
does appear problematic, I've made sure to keep it standalone and
revertible without any conflicts.
- Fixes and cleanups for numerous integer type problems, locking
issues, and general code quality concerns.
- The input pool's LFSR has been replaced with a cryptographically
secure hash function, which has security and performance benefits
alike, and consequently allows us to count entropy bits linearly.
- The pre-init injection now uses a real hash function too, instead
of an LFSR or vanilla xor.
- The interrupt handler's fast_mix() function now uses one round of
SipHash, rather than the fake crypto that was there before.
- All additions of RDRAND and RDSEED now go through the input pool's
hash function, in part to mitigate ridiculous hypothetical CPU
backdoors, but more so to have a consistent interface for ingesting
entropy that's easy to analyze, making everything happen one way,
instead of a potpourri of different ways.
- The crng now works on per-cpu data, while also being in accordance
with the actual "fast key erasure RNG" design. This allows us to
fix several boot-time race complications associated with the prior
dynamically allocated model, eliminates much locking, and makes our
backtrack protection more robust.
- Batched entropy now erases doled out values so that it's backtrack
resistant.
- Working closely with Sebastian, the interrupt handler no longer
needs to take any locks at all, as we punt the
synchronized/expensive operations to a workqueue. This is
especially nice for PREEMPT_RT, where taking spinlocks in irq
context is problematic. It also makes the handler faster for the
rest of us.
- Also working with Sebastian, we now do the right thing on CPU
hotplug, so that we don't use stale entropy or fail to accumulate
new entropy when CPUs come back online.
- We handle virtual machines that fork / clone / snapshot, using the
"vmgenid" ACPI specification for retrieving a unique new RNG seed,
which we can use to also make WireGuard (and in the future, other
things) safe across VM forks.
- Around boot time, we now try to reseed more often if enough entropy
is available, before settling on the usual 5 minute schedule.
- Last, but certainly not least, the documentation in the file has
been updated considerably"
* tag 'random-5.18-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (60 commits)
random: check for signal and try earlier when generating entropy
random: reseed more often immediately after booting
random: make consistent usage of crng_ready()
random: use SipHash as interrupt entropy accumulator
wireguard: device: clear keys on VM fork
random: provide notifier for VM fork
random: replace custom notifier chain with standard one
random: do not export add_vmfork_randomness() unless needed
virt: vmgenid: notify RNG of VM fork and supply generation ID
ACPI: allow longer device IDs
random: add mechanism for VM forks to reinitialize crng
random: don't let 644 read-only sysctls be written to
random: give sysctl_random_min_urandom_seed a more sensible value
random: block in /dev/urandom
random: do crng pre-init loading in worker rather than irq
random: unify cycles_t and jiffies usage and types
random: cleanup UUID handling
random: only wake up writers after zap if threshold was passed
random: round-robin registers as ulong, not u32
random: clear fast pool, crng, and batches in cpuhp bring up
...
This pull request contains the following branches:
exp.2022.02.24a: Contains a fix for idle detection from Neeraj Upadhyay
and missing access marking detected by KCSAN.
fixes.2022.02.14a: Miscellaneous fixes.
rcu_barrier.2022.02.08a: Reduces coupling between rcu_barrier() and
CPU-hotplug operations, so that rcu_barrier() no longer needs
to do cpus_read_lock(). This may also someday allow system
boot to bring CPUs online concurrently.
rcu-tasks.2022.02.08a: Enable more aggressive movement to per-CPU
queueing when reacting to excessive lock contention due
to workloads placing heavy update-side stress on RCU tasks.
rt.2022.02.01b: Improvements to RCU priority boosting, including
changes from Neeraj Upadhyay, Zqiang, and Alison Chaiken.
torture.2022.02.01b: Various fixes improving test robustness and
debug information.
torturescript.2022.02.08a: Add tests for SRCU size transitions, further
compress torture.sh build products, and improve debug output.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmIusb0THHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jAklD/9VXLK7crcg2YeRXUIg1IOdnancsVCV
MNtTfxNYqYIis+W2UfuHKuQu2yEXF5fihdY0J9TQv0byHsprp6FIZT+i1An4Ukgd
0vyHjd/DaIKgs2txsB1DjhlatWlJUfQuBwhtNUkpYFLFwKdCI1l813bPbNlL+GiL
p0ZejVMpBC5HgE6sDOtaaQSAB+AEUp+Lgr+yaG/On8hfzwWFKO8KldxhiKY9n07v
SNDfKDgXB+80hx4RBVGbkuogV3s9brFULoNRXJy7Uf79DtiY09uazhhA3G0TjO34
zGwmF91dqsXDF/Uz8g4aZO0xYRXUchOrsQ5lgO/GhTVbM9I0wWlMHEk/8WHyBJkU
vlXOMuwzBc9/5uwZE3rnkA4a3nkXhPQjLlCr+/I7A/7Vsv9IBW9WSlgMvUN0Qf4S
XAwTnIqfErnR60a+L0+HRr5kIV5VoXcxqI/Nv0/4/BMLRubS/c7cYjOTxXNJL9SU
50pv5vty9xk3HSpuz0JAOyLf+PUT773uUQhFr5xCBSCVqbAm5WFg6hWPAgrN/tUS
wstBc0wlA73rKVJxeLDQwHc/oT1zTUEzswVZITQ5zLHK0t0GbeR6QHccsdeaJyTe
DisX+66A6YQrEuJmx5xUZqjYHqtYLDOBTbHA3ZwQmvjKu8ibWZ8Fg9ioURLCS4bF
+FVkp/5KdcAN9w==
=ljVY
-----END PGP SIGNATURE-----
Merge tag 'rcu.2022.03.13a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Fix idle detection (Neeraj Upadhyay) and missing access marking
detected by KCSAN.
- Reduce coupling between rcu_barrier() and CPU-hotplug operations, so
that rcu_barrier() no longer needs to do cpus_read_lock(). This may
also someday allow system boot to bring CPUs online concurrently.
- Enable more aggressive movement to per-CPU queueing when reacting to
excessive lock contention due to workloads placing heavy update-side
stress on RCU tasks.
- Improvements to RCU priority boosting, including changes from Neeraj
Upadhyay, Zqiang, and Alison Chaiken.
- Various fixes improving test robustness and debug information.
- Add tests for SRCU size transitions, further compress torture.sh
build products, and improve debug output.
- Miscellaneous fixes.
* tag 'rcu.2022.03.13a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (49 commits)
rcu: Replace cpumask_weight with cpumask_empty where appropriate
rcu: Remove __read_mostly annotations from rcu_scheduler_active externs
rcu: Uninline multi-use function: finish_rcuwait()
rcu: Mark writes to the rcu_segcblist structure's ->flags field
kasan: Record work creation stack trace with interrupts enabled
rcu: Inline __call_rcu() into call_rcu()
rcu: Add mutex for rcu boost kthread spawning and affinity setting
rcu: Fix description of kvfree_rcu()
MAINTAINERS: Add Frederic and Neeraj to their RCU files
rcutorture: Provide non-power-of-two Tasks RCU scenarios
rcutorture: Test SRCU size transitions
torture: Make torture.sh help message match reality
rcu-tasks: Set ->percpu_enqueue_shift to zero upon contention
rcu-tasks: Use order_base_2() instead of ilog2()
rcu: Create and use an rcu_rdp_cpu_online()
rcu: Make rcu_barrier() no longer block CPU-hotplug operations
rcu: Rework rcu_barrier() and callback-migration logic
rcu: Refactor rcu_barrier() empty-list handling
rcu: Kill rnp->ofl_seq and use only rcu_state.ofl_lock for exclusion
torture: Change KVM environment variable to RCUTORTURE
...
We always write out an entire folio at once. This conversion removes
a few calls to compound_head() and gets the NR_VMSCAN_WRITE statistic
right when writing out a large folio.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Make the tracing formatting for user_data and flags consistent.
Having consistent formatting allows one for example to grep for a specific
user_data/flags and be able to trace a single sqe through easily.
Change user_data to 0x%llx and flags to 0x%x everywhere. The '0x' is
useful to disambiguate for example "user_data 100".
Additionally remove the '=' for flags in io_uring_req_failed, again for consistency.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220316095204.2191498-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All ext4 & jbd2 trace events starts with "dev Major:Minor".
While we are still improving/adding the ftrace events for FC,
let's fix last two remaining trace events to follow the same
convention.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/8f33b163f0f29df2491c03b79f8ac96890ea5184.1647057583.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This adds commit_tid info in ext4_fc_commit_start/stop which is helpful
in debugging fast_commit issues.
For e.g. issues where due to jbd2 journal full commit, FC miss to commit
updates to a file.
Also improves TP_prink format string i.e. all ext4 and jbd2 trace events
starts with "dev MAjOR,MINOR". Let's follow the same convention while we
are still at it.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/ebcd6b9ab5b718db30f90854497886801ce38c63.1647057583.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch adds the transaction & inode tid info in trace events for
callers of ext4_fc_track_template(). This is helpful in debugging race
conditions where an inode could belong to two different transaction tids.
It also fixes the checkpatch warnings which says use tabs instead of
spaces.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/c203c09dc11bb372803c430f621f25a4b8c2c8b4.1647057583.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Extensive changes, but fairly mechanical.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
This code adds the on disk structures for the block group root, which
will hold the block group items for extent tree v2.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
NFS_RPC_SWAPFLAGS is only used for READ requests.
It sets RPC_TASK_SWAPPER which gives some memory-allocation priority to
requests. This is not needed for swap READ - though it is for writes
where it is set via a different mechanism.
RPC_TASK_ROOTCREDS causes the 'machine' credential to be used.
This is not needed as the root credential is saved when the swap file is
opened, and this is used for all IO.
So NFS_RPC_SWAPFLAGS isn't needed, and as it is the only user of
RPC_TASK_ROOTCREDS, that isn't needed either.
Remove both.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
ftrace's __print_symbolic() requires that any enum values used in the
symbol to string translation table be wrapped in a TRACE_DEFINE_ENUM
so that the enum value can be decoded from the ftrace ring buffer by
user space tooling.
This patch also fixes few other problems found in this trace point.
e.g. dereferencing structures in TP_printk which should not be done
at any cost.
Also to avoid checkpatch warnings, this patch removes those
whitespaces/tab stops issues.
Cc: stable@kernel.org
Fixes: aa75f4d3da ("ext4: main fast-commit commit path")
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/b4b9691414c35c62e570b723e661c80674169f9a.1647057583.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
A network filesystem may set coherency data on a volume cookie, and if
given, cachefiles will store this in an xattr on the directory in the
cache corresponding to the volume.
The function that sets the xattr just stores the contents of the volume
coherency buffer directly into the xattr, with nothing added; the
checking function, on the other hand, has a cut'n'paste error whereby it
tries to interpret the xattr contents as would be the xattr on an
ordinary file (using the cachefiles_xattr struct). This results in a
failure to match the coherency data because the buffer ends up being
shifted by 18 bytes.
Fix this by defining a structure specifically for the volume xattr and
making both the setting and checking functions use it.
Since the volume coherency doesn't work if used, take the opportunity to
insert a reserved field for future use, set it to 0 and check that it is
0. Log mismatch through the appropriate tracepoint.
Note that this only affects cifs; 9p, afs, ceph and nfs don't use the
volume coherency data at the moment.
Fixes: 32e150037d ("fscache, cachefiles: Store the volume coherency data")
Reported-by: Rohith Surabattula <rohiths.msft@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Steve French <smfrench@gmail.com>
cc: linux-cifs@vger.kernel.org
cc: linux-cachefs@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To make it really easy to add custom events from modules, add a
TRACE_CUSTOM_EVENT() macro that acts just like the TRACE_EVENT() macro,
but creates a custom event to an already existing tracepoint.
The trace_custom_sched.[ch] has been updated to use this new macro to show
how simple it is.
Link: https://lkml.kernel.org/r/20220303220625.738622494@goodmis.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In an effort to add custom event macros that can be used to create your
own custom events based on existing tracepoints, move the defines of the
special macros used in TRACE_EVENT() into their own files such that they
can be reused for TRACE_CUSTOM_EVENT().
Link: https://lkml.kernel.org/r/20220303220625.553406495@goodmis.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
This makes the io-uring tracepoints consistent. Where it makes sense
the tracepoints start with the following four fields:
- context (ring)
- request
- user_data
- opcode.
Signed-off-by: Stefan Roesch <shr@fb.com>
Link: https://lore.kernel.org/r/20220214180430.70572-3-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The information on whether eventfd is registered is not very useful and
would result in the tracepoint being enclosed in an rcu_readlock in a
later patch that tries to avoid ring quiesce for registering eventfd.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-2-usama.arif@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The tracepoint in get_mapping_status() only dumped the incoming mpext
fields. This patch added a new tracepoint in mptcp_sendmsg_frag() to dump
the outgoing mpext too.
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This field is entirely unused now except for a tracepoint in f2fs, so
remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220308060529.736277-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The TUN can be used as vhost-net backend. E.g, the tun_net_xmit() is the
interface to forward the skb from TUN to vhost-net/virtio-net.
However, there are many "goto drop" in the TUN driver. Therefore, the
kfree_skb_reason() is involved at each "goto drop" to help userspace
ftrace/ebpf to track the reason for the loss of packets.
The below reasons are introduced:
- SKB_DROP_REASON_DEV_READY
- SKB_DROP_REASON_NOMEM
- SKB_DROP_REASON_HDR_TRUNC
- SKB_DROP_REASON_TAP_FILTER
- SKB_DROP_REASON_TAP_TXFILTER
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The TAP can be used as vhost-net backend. E.g., the tap_handle_frame() is
the interface to forward the skb from TAP to vhost-net/virtio-net.
However, there are many "goto drop" in the TAP driver. Therefore, the
kfree_skb_reason() is involved at each "goto drop" to help userspace
ftrace/ebpf to track the reason for the loss of packets.
The below reasons are introduced:
- SKB_DROP_REASON_SKB_CSUM
- SKB_DROP_REASON_SKB_GSO_SEG
- SKB_DROP_REASON_SKB_UCOPY_FAULT
- SKB_DROP_REASON_DEV_HDR
- SKB_DROP_REASON_FULL_RING
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add reason for skb drops to __netif_receive_skb_core() when packet_type
not found to handle the skb. For this purpose, the drop reason
SKB_DROP_REASON_PTYPE_ABSENT is introduced. Take ether packets for
example, this case mainly happens when L3 protocol is not supported.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() used in sch_handle_ingress() with
kfree_skb_reason(). Following drop reasons are introduced:
SKB_DROP_REASON_TC_INGRESS
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() used in do_xdp_generic() with kfree_skb_reason().
The drop reason SKB_DROP_REASON_XDP is introduced for this case.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() used in enqueue_to_backlog() with
kfree_skb_reason(). The skb rop reason SKB_DROP_REASON_CPU_BACKLOG is
introduced for the case of failing to enqueue the skb to the per CPU
backlog queue. The further reason can be backlog queue full or RPS
flow limition, and I think we needn't to make further distinctions.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add reasons for skb drops to __dev_xmit_skb() by replacing
kfree_skb_list() with kfree_skb_list_reason(). The drop reason of
SKB_DROP_REASON_QDISC_DROP is introduced for qdisc enqueue fails.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() used in sch_handle_egress() with kfree_skb_reason().
The drop reason SKB_DROP_REASON_TC_EGRESS is introduced. Considering
the code path of tc egerss, we make it distinct with the drop reason
of SKB_DROP_REASON_QDISC_DROP in the next commit.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As of commit
c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")
the following sequence becomes possible:
p->__state = TASK_INTERRUPTIBLE;
__schedule()
deactivate_task(p);
ttwu()
READ !p->on_rq
p->__state=TASK_WAKING
trace_sched_switch()
__trace_sched_switch_state()
task_state_index()
return 0;
TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
the trace event.
Prevent this by pushing the value read from __schedule() down the trace
event.
Reported-by: Abhijeet Dharmapurikar <adharmap@quicinc.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20220120162520.570782-2-valentin.schneider@arm.com
To make server-side trace events more useful in container-ized
environments, capture not just the remote's IP address, but the
local IP address and network namespace as well.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Enable a struct sockaddr to be stored in a trace record as a
dynamically-sized field. The common cases are AF_INET and AF_INET6
which are different sizes, and are vastly smaller than a struct
sockaddr_storage.
These are safer because, when used properly, the size of the
sockaddr destination field in each trace record is now guaranteed
to be the same as the source address that is being copied into it.
Link: https://lore.kernel.org/all/164182978641.8391.8277203495236105391.stgit@bazille.1015granger.net/
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Replace kfree_skb() used in __neigh_event_send() with
kfree_skb_reason(). Following drop reasons are added:
SKB_DROP_REASON_NEIGH_FAILED
SKB_DROP_REASON_NEIGH_QUEUEFULL
SKB_DROP_REASON_NEIGH_DEAD
The first two reasons above should be the hot path that skb drops
in neighbour subsystem.
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() which is used in the packet egress path of IP layer
with kfree_skb_reason(). Functions that are involved include:
__ip_queue_xmit()
ip_finish_output()
ip_mc_finish_output()
ip6_output()
ip6_finish_output()
ip6_finish_output2()
Following new drop reasons are introduced:
SKB_DROP_REASON_IP_OUTNOROUTES
SKB_DROP_REASON_BPF_CGROUP_EGRESS
SKB_DROP_REASON_IPV6DISABLED
SKB_DROP_REASON_NEIGH_CREATEFAIL
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Few main additions include:
- Support for OPTEE based SCMI transport to enable using SCMI service
provided by OPTEE on some platforms
- Support for atomic SCMI transports which enables few SCMI transactions
to be completed in atomic context. This involves other refactoring work
associated with it. It also marks SMC and OPTEE as atomic transport as
the commands are completed once the return.
- Support for polling mode in SCMI VirtIO transport in order to support
atomic operations
- Support for atomic clock operations based on availability of atomic
capability in the underlying SCMI transport
Other changes involves some trace and log enhancements and miscellaneous
bug fixes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEunHlEgbzHrJD3ZPhAEG6vDF+4pgFAmIVQcYACgkQAEG6vDF+
4pgiLw/+NA01jx1VMNSUTCsSLvDQ0ZXCjEWEXaIR1KkdKELGu7wwItIqBSisXfmx
9/NBGBydz9q700NGevIjRDiBo6JThmDf/tTg21OdLvRY8Dwqfktz7JO4Jsne3cwW
hz1UTxyStUTNqt4Omprf1xzlKKerSuzav4KB1MtpAyzoBV6RFHglaznPCXL7mkic
9+9WPeVcwQxQ5ZXuwV2WDrJ9S8IZzMzZh6woIM/O4fdbyXGv28NfVt1oRidOPJOE
uMsxF/LcvUsQpQBOU9cTqkkmBCKRmzY7xT1JK58DcteSrleYoiZJqT7uFxI92i4Z
DqKMMUcbJVzKErBYyBHCA62cfRBvOV+OLhh8dzR2c40a+Ecoy5vQHGAEgMUwZBWx
8ztRDlBFzWxNSCyn5fh4IuV1POygrH8UTq57s+Mwr+hrr93Gb2JoxH3UL3kgGTJX
c9AFhLPNWDD+cwJouSD4y55kqik/53+vUqOrCHQkPjV4cz5Y3+P0uIraesh7NHo9
Mhnoudt7IZdJVowvEZyjQ4kp38t0G/M13BK/3Tw9EtlcFXoxOSP0AfLc1Il8nbRT
afKWVBfTZbxpTWnLsf5SkHXJgyO1tCEd51wZLqejWZqqwrhbNoqA+HyxdtIIvi1o
n5bS7qQ2wiuu3DsnJlYrJXTj2YlpUGm1MBHgmFUHtMUj55XpmEA=
=jIoO
-----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAmIY/ksACgkQmmx57+YA
GNkNRw//YzZGhm0MtpPCe1jNv2bDWVq75v8bzDL217n9QJw1dpx0ru/xkDJDhmDP
t7vfSFFPfE2o/jXzMOuO8c6E1/b0cdxd9z8V1SPIHSwPXA/ZM4OdMKZJdwK6OK6T
HCW1PJJi+5ktu0mWbeflCW/isIQv0Y485yNH4q8gLSWisufevm+1LktErhNarQst
lG7CR9KclOZzqhVfT49vXhly/HSWVjUvYJfAPPERIQDwDvKSWjkx05ZPuc8l1DJU
8Z2164E04uTwU1p0154pmay9eY6xcBpTXhBkwFxwCfjKu0Yc2ViVRuGJh27x5TXB
ScAF7iRhjC2sk4v7H7UEFjVVAQix8aNyeueXcB1+6adSkCnerO/f4yT5uylbm7e0
Qj9XrqV1cF6GwE1ukRGa0hCySGH9xEz3No6ueEqCQQgbvVJ4o7RoUBoapnZUfx8J
ee/C4EHd/7uxBm0ygQ07F1+RbLeWw6aLSVeM03bY+bz9LKkPxELrgP98WLvvGtzy
Xj6k98DMdmZsCwcp3Z4va9gjopjgM4Z7pWnABihXEkidZXGodMMAQCAUU5gmZeKz
FSubYgjrGojOy00XOQsggw+CmNHLGNvQoFLjaRtRxHLJYewyjuE7zCcSOdqTsjI4
BKEUgtakDEEFoSA27ycjne23VyhxI3mF3yC8LBiAkzbksuMQXvA=
=qTGM
-----END PGP SIGNATURE-----
Merge tag 'scmi-updates-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into arm/drivers
Arm SCMI firmware interface updates for v5.18
Few main additions include:
- Support for OPTEE based SCMI transport to enable using SCMI service
provided by OPTEE on some platforms
- Support for atomic SCMI transports which enables few SCMI transactions
to be completed in atomic context. This involves other refactoring work
associated with it. It also marks SMC and OPTEE as atomic transport as
the commands are completed once the return.
- Support for polling mode in SCMI VirtIO transport in order to support
atomic operations
- Support for atomic clock operations based on availability of atomic
capability in the underlying SCMI transport
Other changes involves some trace and log enhancements and miscellaneous
bug fixes.
* tag 'scmi-updates-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux: (28 commits)
clk: scmi: Support atomic clock enable/disable API
firmware: arm_scmi: Add support for clock_enable_latency
firmware: arm_scmi: Add atomic support to clock protocol
firmware: arm_scmi: Support optional system wide atomic-threshold-us
dt-bindings: firmware: arm,scmi: Add atomic-threshold-us optional property
firmware: arm_scmi: Add atomic mode support to virtio transport
firmware: arm_scmi: Review virtio free_list handling
firmware: arm_scmi: Add a virtio channel refcount
firmware: arm_scmi: Disable ftrace for Clang Thumb2 builds
firmware: arm_scmi: Add new parameter to mark_txdone
firmware: arm_scmi: Add atomic mode support to smc transport
firmware: arm_scmi: Add support for atomic transports
firmware: arm_scmi: Make optee support sync_cmds_completed_on_ret
firmware: arm_scmi: Make smc support sync_cmds_completed_on_ret
firmware: arm_scmi: Add sync_cmds_completed_on_ret transport flag
firmware: arm_scmi: Make smc transport use common completions
firmware: arm_scmi: Add configurable polling mode for transports
firmware: arm_scmi: Use new trace event scmi_xfer_response_wait
include: trace: Add new scmi_xfer_response_wait event
firmware: arm_scmi: Refactor message response path
...
Link: https://lore.kernel.org/r/20220222201742.3338589-1-sudeep.holla@arm.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
These explicit tracepoints aren't really used and show sign of aging.
It's work to keep these up to date, and before I attempted to keep them
up to date, they weren't up to date, which indicates that they're not
really used. These days there are better ways of introspecting anyway.
Cc: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
We've been using a flurry of int, unsigned int, size_t, and ssize_t.
Let's unify all of this into size_t where it makes sense, as it does in
most places, and leave ssize_t for return values with possible errors.
In addition, keeping with the convention of other functions in this
file, functions that are dealing with raw bytes now take void *
consistently instead of a mix of that and u8 *, because much of the time
we're actually passing some other structure that is then interpreted as
bytes by the function.
We also take the opportunity to fix the outdated and incorrect comment
in get_random_bytes_arch().
Cc: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* Improvements in SCOM and OCC drivers for error handling and retries
* Addition of tracepoints for initialisation path
* API for setting long running SBE FIFO operations
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+nHMAt9PCBDH63wBa3ZZB4FHcJ4FAmITXCIACgkQa3ZZB4FH
cJ6tRQ//eBrtmh0pkfccNMw1tqGqPJAqsANxBwTBFlG02MD7YhUH6VEyULdp2V8O
icwweLcUwM5Q6d6mDSCTEWxRDyLDsQ3lIBCq/j6DLCOZ3gaFAm7bwdl24p59lK51
DNkNB4quAhsc1eXQ3hmALXEUZ1QRLSrbw4+utYPhaiyCU/8LORqaelmip52HBVK5
8WDsOs/Vj7o+Yiu6dNtYfRTkjH3AHyMCgPjSI6tZ1sUZGGeyBTWywDNcbX7wIVAn
wuf+4G+v5XtY7uqmnvMRmhoUU01FvdJ573/UGLyQ5qlZIN6C6i2q5shyvUP839FC
8GNeRJJFABq0qB9nMK2S9hhGvaFG8AhcWMNvT+/IRPa9p4wyvwClwzqeeScATDWg
762bQSl9NOrkGtggh3GDDH5hywQeFuboPhiNfCp9Y0Hw6Ng+OBx1Z9BzxuqayFQi
N1hsd+AC/K8qkcYiayV4euCbA2rinSzUZOYWCHlaPrqG78wACgcI2YWrwmz1prBH
y4aSibeehcXIvHseQQnLHUzLWfZiRr1UsEGA94lva2NnjUADiO/Wty9F2gI/i7zv
X+VP77/9s14EGrx8w4SunPn24i67xEoHMFzwuzr69HuT32w/CKLX52UqUdPCjecF
TbdEJDAwQH/Pml/t+rBfN4ntHhkoMRDhvbM7yovUJ6Y3gcDTMDY=
=kOpN
-----END PGP SIGNATURE-----
Merge tag 'fsi-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/joel/fsi into char-misc-next
Joel writes:
FSI changes for v5.18
* Improvements in SCOM and OCC drivers for error handling and retries
* Addition of tracepoints for initialisation path
* API for setting long running SBE FIFO operations
* tag 'fsi-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/joel/fsi:
fsi: Add trace events in initialization path
fsi: sbefifo: Implement FSI_SBEFIFO_READ_TIMEOUT_SECONDS ioctl
fsi: sbefifo: Use specified value of start of response timeout
fsi: occ: Improve response status checking
fsi: scom: Remove retries in indirect scoms
fsi: scom: Fix error handling
Our pool is 256 bits, and we only ever use all of it or don't use it at
all, which is decided by whether or not it has at least 128 bits in it.
So we can drastically simplify the accounting and cmpxchg loop to do
exactly this. While we're at it, we move the minimum bit size into a
constant so it can be shared between the two places where it matters.
The reason we want any of this is for the case in which an attacker has
compromised the current state, and then bruteforces small amounts of
entropy added to it. By demanding a particular minimum amount of entropy
be present before reseeding, we make that bruteforcing difficult.
Note that this rationale no longer includes anything about /dev/random
blocking at the right moment, since /dev/random no longer blocks (except
for at ~boot), but rather uses the crng. In a former life, /dev/random
was different and therefore required a more nuanced account(), but this
is no longer.
Behaviorally, nothing changes here. This is just a simplification of
the code.
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Add definitions for trace events to show the scanning flow.
Signed-off-by: Eddie James <eajames@linux.ibm.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20220207161640.35605-1-eajames@linux.ibm.com
Signed-off-by: Joel Stanley <joel@jms.id.au>
Replace tcp_drop() used in tcp_data_queue_ofo with tcp_drop_reason().
Following drop reasons are introduced:
SKB_DROP_REASON_TCP_OFOMERGE
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace tcp_drop() used in tcp_data_queue() with tcp_drop_reason().
Following drop reasons are introduced:
SKB_DROP_REASON_TCP_ZEROWINDOW
SKB_DROP_REASON_TCP_OLD_DATA
SKB_DROP_REASON_TCP_OVERWINDOW
SKB_DROP_REASON_TCP_OLD_DATA is used for the case that end_seq of skb
less than the left edges of receive window. (Maybe there is a better
name?)
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace tcp_drop() used in tcp_rcv_established() with tcp_drop_reason().
Following drop reasons are added:
SKB_DROP_REASON_TCP_FLAGS
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass the address of drop_reason to tcp_add_backlog() to store the
reasons for skb drops when fails. Following drop reasons are
introduced:
SKB_DROP_REASON_SOCKET_BACKLOG
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass the address of drop reason to tcp_v4_inbound_md5_hash() and
tcp_v6_inbound_md5_hash() to store the reasons for skb drops when this
function fails. Therefore, the drop reason can be passed to
kfree_skb_reason() when the skb needs to be freed.
Following drop reasons are added:
SKB_DROP_REASON_TCP_MD5NOTFOUND
SKB_DROP_REASON_TCP_MD5UNEXPECTED
SKB_DROP_REASON_TCP_MD5FAILURE
SKB_DROP_REASON_TCP_MD5* above correspond to LINUX_MIB_TCPMD5*
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dave suggested a while ago (eleven years by now) "Let's make netif_rx()
work in all contexts and get rid of netif_rx_ni()". Eric agreed and
pointed out that modern devices should use netif_receive_skb() to avoid
the overhead.
In the meantime someone added another variant, netif_rx_any_context(),
which behaves as suggested.
netif_rx() must be invoked with disabled bottom halves to ensure that
pending softirqs, which were raised within the function, are handled.
netif_rx_ni() can be invoked only from process context (bottom halves
must be enabled) because the function handles pending softirqs without
checking if bottom halves were disabled or not.
netif_rx_any_context() invokes on the former functions by checking
in_interrupts().
netif_rx() could be taught to handle both cases (disabled and enabled
bottom halves) by simply disabling bottom halves while invoking
netif_rx_internal(). The local_bh_enable() invocation will then invoke
pending softirqs only if the BH-disable counter drops to zero.
Eric is concerned about the overhead of BH-disable+enable especially in
regard to the loopback driver. As critical as this driver is, it will
receive a shortcut to avoid the additional overhead which is not needed.
Add a local_bh_disable() section in netif_rx() to ensure softirqs are
handled if needed.
Provide __netif_rx() which does not disable BH and has a lockdep assert
to ensure that interrupts are disabled. Use this shortcut in the
loopback driver and in drivers/net/*.c.
Make netif_rx_ni() and netif_rx_any_context() invoke netif_rx() so they
can be removed once they are no more users left.
Link: https://lkml.kernel.org/r/20100415.020246.218622820.davem@davemloft.net
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, rasdaemon uses the existing tracepoint block_rq_complete
and filters out non-error cases in order to capture block disk errors.
But there are a few problems with this approach:
1. Even kernel trace filter could do the filtering work, there is
still some overhead after we enable this tracepoint.
2. The filter is merely based on errno, which does not align with kernel
logic to check the errors for print_req_error().
3. block_rq_complete only provides dev major and minor to identify
the block device, it is not convenient to use in user-space.
So introduce a new tracepoint block_rq_error just for the error case.
With this patch, rasdaemon could switch to block_rq_error.
Since the new tracepoint has the similar implementation with
block_rq_complete, so move the existing code from TRACE_EVENT
block_rq_complete() into new event class block_rq_completion(). Then add
event for block_rq_complete and block_rq_err respectively from the newly
created event class per the suggestion from Chaitanya Kulkarni.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220210225222.260069-1-shy828301@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This change adds a couple of new ioctls for mctp sockets:
SIOCMCTPALLOCTAG and SIOCMCTPDROPTAG. These ioctls provide facilities
for explicit allocation / release of tags, overriding the automatic
allocate-on-send/release-on-reply and timeout behaviours. This allows
userspace more control over messages that may not fit a simple
request/response model.
In order to indicate a pre-allocated tag to the sendmsg() syscall, we
introduce a new flag to the struct sockaddr_mctp.smctp_tag value:
MCTP_TAG_PREALLOC.
Additional changes from Jeremy Kerr <jk@codeconstruct.com.au>.
Contains a fix that was:
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit saves a few lines by checking first for an empty callback
list. If the callback list is empty, then that CPU is taken care of,
regardless of its online or nocb state. Also simplify tracing accordingly
and fold a few lines together.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Replace kfree_skb() with kfree_skb_reason() in __udp_queue_rcv_skb().
Following new drop reasons are introduced:
SKB_DROP_REASON_SOCKET_RCVBUFF
SKB_DROP_REASON_PROTO_MEM
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() with kfree_skb_reason() in ip_protocol_deliver_rcu().
Following new drop reasons are introduced:
SKB_DROP_REASON_XFRM_POLICY
SKB_DROP_REASON_IP_NOPROTO
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() with kfree_skb_reason() in ip_rcv_finish_core(),
following drop reasons are introduced:
SKB_DROP_REASON_IP_RPFILTER
SKB_DROP_REASON_UNICAST_IN_L2_MULTICAST
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() with kfree_skb_reason() in ip_rcv_core(). Three new
drop reasons are introduced:
SKB_DROP_REASON_OTHERHOST
SKB_DROP_REASON_IP_CSUM
SKB_DROP_REASON_IP_INHDR
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() with kfree_skb_reason() in nf_hook_slow() when
skb is dropped by reason of NF_DROP. Following new drop reasons
are introduced:
SKB_DROP_REASON_NETFILTER_DROP
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Couple of main additions:
- Support for OPTEE based SCMI transport to enable using SCMI service
provided by OPTEE on some platforms
- Support for atomic SCMI transports which enables few SCMI transactions
to be completed in atomic context. This involves other refactoring work
associated with it. It also marks SMC and OPTEE as atomic transport as
the commands are completed once the return
Other changes involves some trace and log enhancements and a miscellaneous
bug fix.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEunHlEgbzHrJD3ZPhAEG6vDF+4pgFAmHDMMoACgkQAEG6vDF+
4pjx9g/+IbMZFxhJkO7kar72cNA3Nf4qzQIjYoM6gxOwly+5+s3QMPdXhBCYLf/9
knDz1JVbEMtz3u6jNHGliF/p+tUq/dGnqAZvsDn3lthmhPGt1EmW9CLiGweXXfU6
hZIEpHVQGJue0US/g7BgwYDAZHuAy71BAZr5M4WcR7gjGH9KMRXFdu8mqiXMM1k5
nMdvcjBv38ZQHtJXQSwVd9yv+d31vt01t57G3NeavINlbHzHCqnKA6VdbjuFLnX5
q0ovaKDKjzurK2cL2quep7nlAAujk0q5vMSDPYmPqY2K0+aHsHlTwIQHNYOEvNTO
GXtk7DL6lMwnfKSbJXRIDMK7MgCIlk9vhm1MZNMVkmJ/moMfS/nQjMkqBmg7BRCT
sGKzLEHRuzud+KHecI/vNCZxokvC9er3OyOYD7bIVn26kWJu7EmwwHXK5E7rt8JW
o0JqtgOo0gs2yjW/A7nSVVEBA/pEYftKSkaEn/gHFSvGcB6eu4LT7s16FaHf6d++
LsBVoTuPrjTYZdkAZXbyXJbX3wuw4yyCJWB1ynrJ+udRw3lguXlI02CKshAXHIKn
BSyLPYg9juWt7fqx7/DHrzw7ZdOQW0zMugNImZBHtub2UnmGw6I7t82bQILJmMG1
wbmA9MCggo7q1D4yZeRmPvFLqOLnNyKEcviSDm4rvApsZgCo9mM=
=12H2
-----END PGP SIGNATURE-----
Merge tag 'scmi-updates-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into for-next/scmi
Arm SCMI firmware interface updates for v5.17
Couple of main additions:
- Support for OPTEE based SCMI transport to enable using SCMI service
provided by OPTEE on some platforms
- Support for atomic SCMI transports which enables few SCMI transactions
to be completed in atomic context. This involves other refactoring work
associated with it. It also marks SMC and OPTEE as atomic transport as
the commands are completed once the return
Other changes involves some trace and log enhancements and a miscellaneous
bug fix.
* tag 'scmi-updates-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
firmware: arm_scmi: Add new parameter to mark_txdone
firmware: arm_scmi: Add atomic mode support to smc transport
firmware: arm_scmi: Add support for atomic transports
firmware: arm_scmi: Make optee support sync_cmds_completed_on_ret
firmware: arm_scmi: Make smc support sync_cmds_completed_on_ret
firmware: arm_scmi: Add sync_cmds_completed_on_ret transport flag
firmware: arm_scmi: Make smc transport use common completions
firmware: arm_scmi: Add configurable polling mode for transports
firmware: arm_scmi: Use new trace event scmi_xfer_response_wait
include: trace: Add new scmi_xfer_response_wait event
firmware: arm_scmi: Refactor message response path
firmware: arm_scmi: Set polling timeout to max_rx_timeout_ms
firmware: arm_scmi: Perform earlier cinfo lookup call in do_xfer
firmware: arm_scmi: optee: Drop the support for the OPTEE shared dynamic buffer
firmware: arm_scmi: optee: Fix missing mutex_init()
firmware: arm_scmi: Make virtio Version_1 compliance optional
firmware: arm_scmi: Add optee transport
dt-bindings: arm: Add OP-TEE transport for SCMI
firmware: arm_scmi: Review some virtio log messages
- Limit mcount build time sorting to only those archs that
we know it works for.
- Fix memory leak in error path of histogram setup
- Fix and clean up rel_loc array out of bounds issue
- tools/rtla documentation fixes
- Fix issues with histogram logic
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYfQlIhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qlXXAP9orNMD5Rkj4/2kTULIXhdx/O6l7d6f
Qq/Hy09evN1h+wEAlMIEE2Yr6tyIbO3uaoW6D8RbwG3napr/w4aUkzxGtgM=
=3MmM
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pulltracing fixes from Steven Rostedt:
- Limit mcount build time sorting to only those archs that we know it
works for.
- Fix memory leak in error path of histogram setup
- Fix and clean up rel_loc array out of bounds issue
- tools/rtla documentation fixes
- Fix issues with histogram logic
* tag 'trace-v5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Don't inc err_log entry count if entry allocation fails
tracing: Propagate is_signed to expression
tracing: Fix smatch warning for do while check in event_hist_trigger_parse()
tracing: Fix smatch warning for null glob in event_hist_trigger_parse()
tools/tracing: Update Makefile to build rtla
rtla: Make doc build optional
tracing/perf: Avoid -Warray-bounds warning for __rel_loc macro
tracing: Avoid -Warray-bounds warning for __rel_loc macro
tracing/histogram: Fix a potential memory leak for kstrdup()
ftrace: Have architectures opt-in for mcount build time sorting
As done for trace_events.h, also fix the __rel_loc macro in perf.h,
which silences the -Warray-bounds warning:
In file included from ./include/linux/string.h:253,
from ./include/linux/bitmap.h:11,
from ./include/linux/cpumask.h:12,
from ./include/linux/mm_types_task.h:14,
from ./include/linux/mm_types.h:5,
from ./include/linux/buildid.h:5,
from ./include/linux/module.h:14,
from samples/trace_events/trace-events-sample.c:2:
In function '__fortify_strcpy',
inlined from 'perf_trace_foo_rel_loc' at samples/trace_events/./trace-events-sample.h:519:1:
./include/linux/fortify-string.h:47:33: warning: '__builtin_strcpy' offset 12 is out of the bounds [
0, 4] [-Warray-bounds]
47 | #define __underlying_strcpy __builtin_strcpy
| ^
./include/linux/fortify-string.h:445:24: note: in expansion of macro '__underlying_strcpy'
445 | return __underlying_strcpy(p, q);
| ^~~~~~~~~~~~~~~~~~~
Also make __data struct member a proper flexible array to avoid future
problems.
Link: https://lkml.kernel.org/r/20220125220037.2738923-1-keescook@chromium.org
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 55de2c0b56 ("tracing: Add '__rel_loc' using trace event macros")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Since -Warray-bounds checks the destination size from the type of given
pointer, __assign_rel_str() macro gets warned because it passes the
pointer to the 'u32' field instead of 'trace_event_raw_*' data structure.
Pass the data address calculated from the 'trace_event_raw_*' instead of
'u32' __rel_loc field.
Link: https://lkml.kernel.org/r/20220125233154.dac280ed36944c0c2fe6f3ac@kernel.org
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
[ This did not fix the warning, but is still a nice clean up ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Current release - new code bugs:
- tcp: add a missing sk_defer_free_flush() in tcp_splice_read()
- tcp: add a stub for sk_defer_free_flush(), fix CONFIG_INET=n
- nf_tables: set last expression in register tracking area
- nft_connlimit: fix memleak if nf_ct_netns_get() fails
- mptcp: fix removing ids bitmap setting
- bonding: use rcu_dereference_rtnl when getting active slave
- fix three cases of sleep in atomic context in drivers: lan966x, gve
- handful of build fixes for esoteric drivers after netdev->dev_addr
was made const
Previous releases - regressions:
- revert "ipv6: Honor all IPv6 PIO Valid Lifetime values", it broke
Linux compatibility with USGv6 tests
- procfs: show net device bound packet types
- ipv4: fix ip option filtering for locally generated fragments
- phy: broadcom: hook up soft_reset for BCM54616S
Previous releases - always broken:
- ipv4: raw: lock the socket in raw_bind()
- ipv4: decrease the use of shared IPID generator to decrease the
chance of attackers guessing the values
- procfs: fix cross-netns information leakage in /proc/net/ptype
- ethtool: fix link extended state for big endian
- bridge: vlan: fix single net device option dumping
- ping: fix the sk_bound_dev_if match in ping_lookup
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmHy4mMACgkQMUZtbf5S
IrvOZg/9HyOFAJrCYBlgA3zskHBdqYOGn9M3LCIevBcrCzQeigT+U1uWCINfBn+H
DmsljeYKTicHZ38+HjdNXmzdnMqHtU+iJl4Ep1mcDNywygofW8JcS2Nf0n6Y+hK6
nzyEa23DBt9UAiLmGXUTIoJwEhDRbuL/eH1/ZkkPLG7GtShtEDAKHg+dJBgHbYgJ
0MQs3Q4s6AQ1PYOC0Z0zByhpSrAo2c4X/tr6g2ExNxU0vnydUbjIME0a5clFULr+
ziVeOo4e83FINPaZiYAXEDbMGUC0z+rp1RoGsgRCdTnixi5BclkmEeGRaChYJHTZ
T7tfIC2H0vZHu5/pAXFqwEHiRbminLv9jLkvA1/J67jbnpoNWTLD2jkuIWFlaY/Z
xDm7LnVBB1CdLmXYo2ItSC/8ws9GANpJOq/vFvm+uOYZNKUVctfQ5viA3+hOSULC
6BJHC0m5UminHZPVge9s1XZClarHK4jMMTH9Du2sHLsl3fedNxbgvcVPFdHswLdF
uYiUGMSrIXuQjXw6SNmR4/voJgzikvYhT+jwMn4vTeWoFQFi5eNUch0MPfUImlXG
e3T6WJHrOY3yJFyWQQ9GGLStchD72+iGq2uWLfOIyu9NRKCNBj4Kkm3bUvfqYp5b
d5sP/nl93o3um4WskxB/fDLyhSXWjprgM9mKI45ilPhUC8bWQyo=
=mwR3
-----END PGP SIGNATURE-----
Merge tag 'net-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter and can.
Current release - new code bugs:
- tcp: add a missing sk_defer_free_flush() in tcp_splice_read()
- tcp: add a stub for sk_defer_free_flush(), fix CONFIG_INET=n
- nf_tables: set last expression in register tracking area
- nft_connlimit: fix memleak if nf_ct_netns_get() fails
- mptcp: fix removing ids bitmap setting
- bonding: use rcu_dereference_rtnl when getting active slave
- fix three cases of sleep in atomic context in drivers: lan966x, gve
- handful of build fixes for esoteric drivers after netdev->dev_addr
was made const
Previous releases - regressions:
- revert "ipv6: Honor all IPv6 PIO Valid Lifetime values", it broke
Linux compatibility with USGv6 tests
- procfs: show net device bound packet types
- ipv4: fix ip option filtering for locally generated fragments
- phy: broadcom: hook up soft_reset for BCM54616S
Previous releases - always broken:
- ipv4: raw: lock the socket in raw_bind()
- ipv4: decrease the use of shared IPID generator to decrease the
chance of attackers guessing the values
- procfs: fix cross-netns information leakage in /proc/net/ptype
- ethtool: fix link extended state for big endian
- bridge: vlan: fix single net device option dumping
- ping: fix the sk_bound_dev_if match in ping_lookup"
* tag 'net-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (86 commits)
net: bridge: vlan: fix memory leak in __allowed_ingress
net: socket: rename SKB_DROP_REASON_SOCKET_FILTER
ipv4: remove sparse error in ip_neigh_gw4()
ipv4: avoid using shared IP generator for connected sockets
ipv4: tcp: send zero IPID in SYNACK messages
ipv4: raw: lock the socket in raw_bind()
MAINTAINERS: add missing IPv4/IPv6 header paths
MAINTAINERS: add more files to eth PHY
net: stmmac: dwmac-sun8i: use return val of readl_poll_timeout()
net: bridge: vlan: fix single net device option dumping
net: stmmac: skip only stmmac_ptp_register when resume from suspend
net: stmmac: configure PTP clock source prior to PTP initialization
Revert "ipv6: Honor all IPv6 PIO Valid Lifetime values"
connector/cn_proc: Use task_is_in_init_pid_ns()
pid: Introduce helper task_is_in_init_pid_ns()
gve: Fix GFP flags when allocing pages
net: lan966x: Fix sleep in atomic context when updating MAC table
net: lan966x: Fix sleep in atomic context when injecting frames
ethernet: seeq/ether3: don't write directly to netdev->dev_addr
ethernet: 8390/etherh: don't write directly to netdev->dev_addr
...
Rename SKB_DROP_REASON_SOCKET_FILTER, which is used
as the reason of skb drop out of socket filter before
it's part of a released kernel. It will be used for
more protocols than just TCP in future series.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/all/20220127091308.91401-2-imagedong@tencent.com/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- New Features:
- Basic handling for case insensitive filesystems
- Initial support for fs_locations and server trunking
- Bugfixes and Cleanups:
- Cleanups to how the "struct cred *" is handled for the nfs_access_entry
- Ensure the server has an up to date ctimes before hardlinking or renaming
- Update 'blocks used' after writeback, fallocate, and clone
- nfs_atomic_open() fixes
- Improvements to sunrpc tracing
- Various null check & indenting related cleanups
- Some improvements to the sunrpc sysfs code
- Use default_groups in kobj_type
- Fix some potential races and reference leaks
- A few tracepoint cleanups in xprtrdma
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmHodsIACgkQ18tUv7Cl
QOt3xQ//c9JPmMJZZoZtaD5UrHg28iyxaJpOUUpwC/jxQhLOETCf+nU1cELYgLq5
4W06NBYEmjDJ/tihUvcGMKLvbCtQR9Zl9HepFKDTLTQpGmRFD4enwSmMNvW/AV+h
I7PoN6J1DX/TZ5InOHH9asyoC2MjwrNHMn3bbQVT0qy+i3T76zJiBF79eWTnPR48
kKPnF1I0p4LKGJy+y+y/z2mdCsz7tzFkhssxVhot0nafxXzbUOp1H9aiwxroRiUC
ljbBA0TX8FWkGpGFt3y2QK2fMD7ovDpRhLFYiJClmeERXJVH5mXL9O5XfN5AL0xe
W/QqT5lbWfeHLkpm2j87yTyaHASC7hGKsAyPD0zWLDcNZws61l1Sy4BHymSE5ZVh
zt7sJjBnOWAtntyUGBg78G2vhBsd63GzrtcqAOlrngwA5ohJ8f32qvBQGyw4MQu9
75CjRcO8K8mnf9BJ6I1vYPycjkUh9RSFfNdnUEAI9ZwiTEC/hfEvH/omvEtZsNol
jBgv2SItTkdMZlEppEL4gxuaYT2wiZf2C6Gco215iPAqLC6dudoroN6yoLk/LRd0
OWZLl5XTr3j6m5QDm22k5CG080vl6XiAxmAFaFSLza6Q34Jmuluc0gLAZZxvqXk9
Ay7dQt9PQQk6mXD5Hreb0E5N9zcm2LkfvWpyGJ7mTV7sSHjA2DU=
=wcVT
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"New Features:
- Basic handling for case insensitive filesystems
- Initial support for fs_locations and server trunking
Bugfixes and Cleanups:
- Cleanups to how the "struct cred *" is handled for the
nfs_access_entry
- Ensure the server has an up to date ctimes before hardlinking or
renaming
- Update 'blocks used' after writeback, fallocate, and clone
- nfs_atomic_open() fixes
- Improvements to sunrpc tracing
- Various null check & indenting related cleanups
- Some improvements to the sunrpc sysfs code:
- Use default_groups in kobj_type
- Fix some potential races and reference leaks
- A few tracepoint cleanups in xprtrdma"
[ This should have gone in during the merge window, but didn't. The
original pull request - sent during the merge window - had gotten
marked as spam and discarded due missing DKIM headers in the email
from Anna. - Linus ]
* tag 'nfs-for-5.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (35 commits)
SUNRPC: Don't dereference xprt->snd_task if it's a cookie
xprtrdma: Remove definitions of RPCDBG_FACILITY
xprtrdma: Remove final dprintk call sites from xprtrdma
sunrpc: Fix potential race conditions in rpc_sysfs_xprt_state_change()
net/sunrpc: fix reference count leaks in rpc_sysfs_xprt_state_change
NFSv4.1 test and add 4.1 trunking transport
SUNRPC allow for unspecified transport time in rpc_clnt_add_xprt
NFSv4 handle port presence in fs_location server string
NFSv4 expose nfs_parse_server_name function
NFSv4.1 query for fs_location attr on a new file system
NFSv4 store server support for fs_location attribute
NFSv4 remove zero number of fs_locations entries error check
NFSv4: nfs_atomic_open() can race when looking up a non-regular file
NFSv4: Handle case where the lookup of a directory fails
NFSv42: Fallocate and clone should also request 'blocks used'
NFSv4: Allow writebacks to request 'blocks used'
SUNRPC: use default_groups in kobj_type
NFS: use default_groups in kobj_type
NFS: Fix the verifier for case sensitive filesystem in nfs_atomic_open()
NFS: Add a helper to remove case-insensitive aliases
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmHrJ4UACgkQ+7dXa6fL
C2taFg/+K0nKY3TDKc7JVxe9BEUa/uUvNba5j16qKWJKnjCrjqU0RWJSUnVOxcwH
eBwxjaoZTmtGqS3snvKMYNIGLf7FAEeRY3a1JO81sdoCA2X7P/4HtfDO+DdaCFpB
tYr9+Hfjy+2baMLEimTWp2SYsfVKD/J4gHeqdRF12QMQJ01m+X9fcV7TqpkyPo5C
L35Gi1RVDXzwn81qOV9yrmkOBhPK4hiREmCmD0BGEGT/+FYlYVBOv92n31CQMl1Q
ef8eMNiDYgXQ0iWoUYIJqMBs88M7DwAmeaA/qi85gzfGM0Z5pBE3SNy02F/81b8b
PHgmzDfj42AuAQ9ZOuwREwtvQrpGsCsw5+60klzZ6PiFtO+q/PtS/ODTdi4foDDd
ura6RM3wqlE9P4Vi7gj9NYhdUlp5p0YYPlJOGr6CepYS63O3DgwXyoVMPUE5s5fF
Boc3Ef7fKy25Wc1s/ZVSbvvgh9KYDJOKqxs4ELeNSW8GD7uA2d0XWkhjzZIsVdKZ
kd7VEYr2/j6gpPR63yw2MVOfymJN1WYuu6ittuL40hSYW8hbBg/m8hVE1E7b7NlP
+u10xvCwN9LBK+757I733wV3hB4rYa6pg9KBjHDXyCLQ1uCKVE0QaILTSjqC/mCM
OTPFmi50y1oa/9BxooEg5+ShXOH95EE4jTBCaL1lv7JYfYNUbYA=
=CgMV
-----END PGP SIGNATURE-----
Merge tag 'fscache-fixes-20220121' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull more fscache updates from David Howells:
"A set of fixes and minor updates for the fscache rewrite:
- Fix mishandling of volume collisions (the wait condition is
inverted and so it was only waiting if the volume collision was
already resolved).
- Fix miscalculation of whether there's space available in
cachefiles.
- Make sure a default cache name is set on a cache if the user hasn't
set one by the time they bind the cache.
- Adjust the way the backing inode is presented in tracepoints, add a
tracepoint for mkdir and trace directory lookup.
- Add a tracepoint for failure to set the active file mark.
- Add an explanation of the checks made on the backing filesystem.
- Check that the backing filesystem supports tmpfile.
- Document how the page-release cancellation of the read-skip
optimisation works.
And I've included a change for netfslib:
- Make ops->init_rreq() optional"
* tag 'fscache-fixes-20220121' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
netfs: Make ops->init_rreq() optional
fscache: Add a comment explaining how page-release optimisation works
cachefiles: Check that the backing filesystem supports tmpfiles
cachefiles: Explain checks in a comment
cachefiles: Trace active-mark failure
cachefiles: Make some tracepoint adjustments
cachefiles: set default tag name if it's unspecified
cachefiles: Calculate the blockshift in terms of bytes, not pages
fscache: Fix the volume collision wait condition
Make some adjustments to tracepoints to make the tracing a bit more
followable:
(1) Standardise on displaying the backing inode number as "B=<hex>" with
no leading zeros.
(2) Make the cachefiles_lookup tracepoint log the directory inode number
as well as the looked-up inode number.
(3) Add a cachefiles_lookup tracepoint into cachefiles_get_directory() to
log directory lookup.
(4) Add a new cachefiles_mkdir tracepoint and use that to log a successful
mkdir from cachefiles_get_directory().
(5) Make the cachefiles_unlink and cachefiles_rename tracepoints log the
inode number of the affected file/dir rather than dentry struct
pointers.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/164251403694.3435901.9797725381831316715.stgit@warthog.procyon.org.uk/ # v1
Merge more updates from Andrew Morton:
"55 patches.
Subsystems affected by this patch series: percpu, procfs, sysctl,
misc, core-kernel, get_maintainer, lib, checkpatch, binfmt, nilfs2,
hfs, fat, adfs, panic, delayacct, kconfig, kcov, and ubsan"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (55 commits)
lib: remove redundant assignment to variable ret
ubsan: remove CONFIG_UBSAN_OBJECT_SIZE
kcov: fix generic Kconfig dependencies if ARCH_WANTS_NO_INSTR
lib/Kconfig.debug: make TEST_KMOD depend on PAGE_SIZE_LESS_THAN_256KB
btrfs: use generic Kconfig option for 256kB page size limit
arch/Kconfig: split PAGE_SIZE_LESS_THAN_256KB from PAGE_SIZE_LESS_THAN_64KB
configs: introduce debug.config for CI-like setup
delayacct: track delays from memory compact
Documentation/accounting/delay-accounting.rst: add thrashing page cache and direct compact
delayacct: cleanup flags in struct task_delay_info and functions use it
delayacct: fix incomplete disable operation when switch enable to disable
delayacct: support swapin delay accounting for swapping without blkio
panic: remove oops_id
panic: use error_report_end tracepoint on warnings
fs/adfs: remove unneeded variable make code cleaner
FAT: use io_schedule_timeout() instead of congestion_wait()
hfsplus: use struct_group_attr() for memcpy() region
nilfs2: remove redundant pointer sbufs
fs/binfmt_elf: use PT_LOAD p_align values for static PIE
const_structs.checkpatch: add frequently used ops structs
...
Introduce the error detector "warning" to the error_report event and use
the error_report_end tracepoint at the end of a warning report.
This allows in-kernel tests but also userspace to more easily determine
if a warning occurred without polling kernel logs.
[akpm@linux-foundation.org: add comma to enum list, per Andy]
Link: https://lkml.kernel.org/r/20211115085630.1756817-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Popov <alex.popov@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In this round, we've tried to address some performance issues in f2fs_checkpoint
and direct IO flows. Also, there was a work to enhance the page cache management
used for compression. Other than them, we've done typical work including sysfs,
code clean-ups, tracepoint, sanity check, in addition to bug fixes on corner
cases.
Enhancement:
- use iomap for direct IO
- try to avoid lock contention to improve f2fs_ckpt speed
- avoid unnecessary memory allocation in compression flow
- POSIX_FADV_DONTNEED drops the page cache containing compression pages
- add some sysfs entries (gc_urgent_high_remaining, pending_discard)
Bug fix:
- try not to expose unwritten blocks to user by DIO
: this was added to avoid merge conflict; another patch is coming to address
other missing case.
- relax minor error condition for file pinning feature used in Android OTA
- fix potential deadlock case in compression flow
- should not truncate any block on pinned file
In addition, we've done some code clean-ups and tracepoint/sanity check
improvement.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmHnY0sACgkQQBSofoJI
UNIOkg//UmjCSSG63/YZM/lQQQe4kK/tT6QTT8W/VQtzWL9vXcL7bcaxzwX3LQbR
Gb47Zmsw9bzVJt6GQ2VRbODE1py/KPNMl5SDXJXHo6fOZ/dOnHve32gLwcLEzhPd
casB0TbwQJ6bpEsJiZ5ho741mURxUrSCHAAX6QIQVXh8ofm9qAqlWu74OLI6UHiV
MM84XmXcHtGUZG5SCTWfSCJhJM6Az/3A83ws9KVeu86dlE7IrigphU2nI2vdCKiO
trR3CiLC/364fiM+9ssLS3X2wKFPD/unEU7ljBv5UaG36jsVfW+tisjTKldzpiKK
44cNgDv1FEDxC0g3FKUhEGezAhxT8AJZB0in0zn8+5scarKGJtFCy9XhCGMVaeP+
usxvHVy8Ga1I7sMV6oHEBcGiPJWkmurzq1XXobtj6oL/JxN4gqUJeHTcod89hQHA
lx9kZs7MLKm2au+T3gZf5xyx35YCie8sY/N1qoPy8tU9Q7FJ54NdqqAc9JEZ6mSk
k9ybMaa/srHG/EI/XYPw0DrobHg6P5+bYtmsRvw2vP/nsNsD3ZI/EwBBEll2ITxC
V5Dn7MljYWI/5kB41Hl5xz6X65WeIN7koRyTXw5mp9tkNrLugqII5hzhwhSlcqJ1
3k9TAN3RbVpWHBcyryDyLbm/+dcbwIJ4v/eJEMIDk8F2SrBGOZs=
=LCJH
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've tried to address some performance issues in
f2fs_checkpoint and direct IO flows. Also, there was a work to enhance
the page cache management used for compression. Other than them, we've
done typical work including sysfs, code clean-ups, tracepoint, sanity
check, in addition to bug fixes on corner cases.
Enhancements:
- use iomap for direct IO
- try to avoid lock contention to improve f2fs_ckpt speed
- avoid unnecessary memory allocation in compression flow
- POSIX_FADV_DONTNEED drops the page cache containing compression
pages
- add some sysfs entries (gc_urgent_high_remaining, pending_discard)
Bug fixes:
- try not to expose unwritten blocks to user by DIO (this was added
to avoid merge conflict; another patch is coming to address other
missing case)
- relax minor error condition for file pinning feature used in
Android OTA
- fix potential deadlock case in compression flow
- should not truncate any block on pinned file
In addition, we've done some code clean-ups and tracepoint/sanity
check improvement"
* tag 'f2fs-for-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (29 commits)
f2fs: do not allow partial truncation on pinned file
f2fs: remove redunant invalidate compress pages
f2fs: Simplify bool conversion
f2fs: don't drop compressed page cache in .{invalidate,release}page
f2fs: fix to reserve space for IO align feature
f2fs: fix to check available space of CP area correctly in update_ckpt_flags()
f2fs: support fault injection to f2fs_trylock_op()
f2fs: clean up __find_inline_xattr() with __find_xattr()
f2fs: fix to do sanity check on last xattr entry in __f2fs_setxattr()
f2fs: do not bother checkpoint by f2fs_get_node_info
f2fs: avoid down_write on nat_tree_lock during checkpoint
f2fs: compress: fix potential deadlock of compress file
f2fs: avoid EINVAL by SBI_NEED_FSCK when pinning a file
f2fs: add gc_urgent_high_remaining sysfs node
f2fs: fix to do sanity check in is_alive()
f2fs: fix to avoid panic in is_alive() if metadata is inconsistent
f2fs: fix to do sanity check on inode type during garbage collection
f2fs: avoid duplicate call of mark_inode_dirty
f2fs: show number of pending discard commands
f2fs: support POSIX_FADV_DONTNEED drop compressed page cache
...
Originally, the RNG used several pools, so having things abstracted out
over a generic entropy_store object made sense. These days, there's only
one input pool, and then an uneven mix of usage via the abstraction and
usage via &input_pool. Rather than this uneasy mixture, just get rid of
the abstraction entirely and have things always use the global. This
simplifies the code and makes reading it a bit easier.
Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
A larger than usual set of changes for this cycle. The bulk of the changes are
part of a rework of libata messages and debugging features from Hannes. In more
details, the changes are as follows.
* Small code cleanups in the pata_ali driver (unnecessary variable
initialization and simplified return statement, from Jason and Colin.
* Switch to using struct_group() in the sata_fsl driver, from Kees.
* Convert many sysfs attribute show functions to use sysfs_emit() instead of
snprintf(), from me.
* sata_dwc_460ex driver code cleanups, from Andy.
* Improve DMA setup and remove superfluous error message in libahci_platform,
from Andy
* A small code cleanup in libata to use min() instead of open coding test,
from Changcheng.
* Rework of libata messages from Hannes. This is especially focused on
replacing compile time defined debugging messages (DPRINTK() and VPRINTK())
with regular dynamic debugging messages (pr_debug()) and traceipoint events.
Both libata-core and many drivers are updated to have a consistent debugging
level control for all drivers.
* Extend compile test support to as many drivers as possible in ATA Kconfig to
improve compile test coverage, from me.
* Fixes to avoid compile time warnings (W=1) and sparse warnings in sata_fsl
and ahci_xgene drivers, from me.
* Fix the interface of the read_id() port operation method to clarify that the
data buffer passed as an argument is little endian. This avoids sparse
warnings in the pata_netcell, pata_it821x, ahci_xgene, ahci_cevaxi and
ahci_brcm drivers. From me.
* Small code cleanup in the pata_octeon_cf driver, from Minghao.
* Improved IRQ configuration code in pata_of_platform, from Lad.
* Simplified implementation of __ata_scsi_queuecmd(), from Wenchao.
* Debounce delay flag renaming, from Paul.
* Add support for AMD A85 FCH (Hudson D4) AHCI adapters, from Paul
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCYeEexQAKCRDdoc3SxdoY
dgGfAQCfiWOwstxl8InSJKCeTzspu6L4mo3jtjSL+X/dihs91wEAj7MGO/8Yv/Z7
mnnM7GJ5vB0qReFnDEGw9uUnlkmMewU=
=t+rc
-----END PGP SIGNATURE-----
Merge tag 'ata-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata
Pull ATA updates from Damien Le Moal:
"A larger than usual set of changes for this cycle. The bulk of the
changes are part of a rework of libata messages and debugging features
from Hannes. In more detail, the changes are as follows.
- Small code cleanups in the pata_ali driver (unnecessary variable
initialization and simplified return statement, from Jason and
Colin.
- Switch to using struct_group() in the sata_fsl driver, from Kees.
- Convert many sysfs attribute show functions to use sysfs_emit()
instead of snprintf(), from me.
- sata_dwc_460ex driver code cleanups, from Andy.
- Improve DMA setup and remove superfluous error message in
libahci_platform, from Andy
- A small code cleanup in libata to use min() instead of open coding
test, from Changcheng.
- Rework of libata messages from Hannes. This is especially focused
on replacing compile time defined debugging messages (DPRINTK() and
VPRINTK()) with regular dynamic debugging messages (pr_debug()) and
traceipoint events. Both libata-core and many drivers are updated
to have a consistent debugging level control for all drivers.
- Extend compile test support to as many drivers as possible in ATA
Kconfig to improve compile test coverage, from me.
- Fixes to avoid compile time warnings (W=1) and sparse warnings in
sata_fsl and ahci_xgene drivers, from me.
- Fix the interface of the read_id() port operation method to clarify
that the data buffer passed as an argument is little endian. This
avoids sparse warnings in the pata_netcell, pata_it821x,
ahci_xgene, ahci_cevaxi and ahci_brcm drivers. From me.
- Small code cleanup in the pata_octeon_cf driver, from Minghao.
- Improved IRQ configuration code in pata_of_platform, from Lad.
- Simplified implementation of __ata_scsi_queuecmd(), from Wenchao.
- Debounce delay flag renaming, from Paul.
- Add support for AMD A85 FCH (Hudson D4) AHCI adapters, from Paul"
* tag 'ata-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata: (106 commits)
ata: pata_ali: remove redundant return statement
ata: ahci: Add support for AMD A85 FCH (Hudson D4)
ata: libata: Rename link flag ATA_LFLAG_NO_DB_DELAY
ata: libata-scsi: simplify __ata_scsi_queuecmd()
ata: pata_of_platform: Use platform_get_irq_optional() to get the interrupt
ata: pata_samsung_cf: add compile test support
ata: pata_pxa: add compile test support
ata: pata_imx: add compile test support
ata: pata_ftide010: add compile test support
ata: pata_cs5535: add compile test support
ata: pata_octeon_cf: remove redundant val variable
ata: fix read_id() ata port operation interface
ata: ahci_xgene: use correct type for port mmio address
ata: sata_fsl: fix cmdhdr_tbl_entry and prde struct definitions
ata: sata_fsl: fix scsi host initialization
ata: pata_bk3710: add compile test support
ata: ahci_seattle: add compile test support
ata: ahci_xgene: add compile test support
ata: ahci_tegra: add compile test support
ata: ahci_sunxi: add compile test support
...
New:
- The Real Time Linux Analysis (RTLA) tool is added to the tools directory.
- Can safely filter on user space pointers with: field.ustring ~ "match-string"
- eprobes can now be filtered like any other event.
- trace_marker(_raw) now uses stream_open() to allow multiple threads to safely
write to it. Note, this could possibly break existing user space, but we will
not know until we hear about it, and then can revert the change if need be.
- New field in events to display when bottom halfs are disabled.
- Sorting of the ftrace functions are now done at compile time instead of
at bootup.
Infrastructure changes to support future efforts:
- Added __rel_loc type for trace events. Similar to __data_loc but the offset
to the dynamic data is based off of the location of the descriptor and not
the beginning of the event. Needed for user defined events.
- Some simplification of event trigger code.
- Make synthetic events process its callback better to not hinder other
event callbacks that are registered. Needed for user defined events.
And other small fixes and clean ups.
-
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYeGvcxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qrZtAP9ICjJxX54MTErElhhUL/NFLV7wqhJi
OIAgmp6jGVRqPAD+JxQtBnGH+3XMd71ioQkTfQ1rp+jBz2ERBj2DmELUAg0=
=zmda
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"New:
- The Real Time Linux Analysis (RTLA) tool is added to the tools
directory.
- Can safely filter on user space pointers with: field.ustring ~
"match-string"
- eprobes can now be filtered like any other event.
- trace_marker(_raw) now uses stream_open() to allow multiple threads
to safely write to it. Note, this could possibly break existing
user space, but we will not know until we hear about it, and then
can revert the change if need be.
- New field in events to display when bottom halfs are disabled.
- Sorting of the ftrace functions are now done at compile time
instead of at bootup.
Infrastructure changes to support future efforts:
- Added __rel_loc type for trace events. Similar to __data_loc but
the offset to the dynamic data is based off of the location of the
descriptor and not the beginning of the event. Needed for user
defined events.
- Some simplification of event trigger code.
- Make synthetic events process its callback better to not hinder
other event callbacks that are registered. Needed for user defined
events.
And other small fixes and cleanups"
* tag 'trace-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (50 commits)
tracing: Add ustring operation to filtering string pointers
rtla: Add rtla timerlat hist documentation
rtla: Add rtla timerlat top documentation
rtla: Add rtla timerlat documentation
rtla: Add rtla osnoise hist documentation
rtla: Add rtla osnoise top documentation
rtla: Add rtla osnoise man page
rtla: Add Documentation
rtla/timerlat: Add timerlat hist mode
rtla: Add timerlat tool and timelart top mode
rtla/osnoise: Add the hist mode
rtla/osnoise: Add osnoise top mode
rtla: Add osnoise tool
rtla: Helper functions for rtla
rtla: Real-Time Linux Analysis tool
tracing/osnoise: Properly unhook events if start_per_cpu_kthreads() fails
tracing: Remove duplicate warnings when calling trace_create_file()
tracing/kprobes: 'nmissed' not showed correctly for kretprobe
tracing: Add test for user space strings when filtering on string pointers
tracing: Have syscall trace events use trace_event_buffer_lock_reserve()
...
- Bruce steps down as NFSD maintainer
- Prepare for dynamic nfsd thread management
- More work on supporting re-exporting NFS mounts
- One fs/locks patch on behalf of Jeff Layton
Notable bug fixes:
- Fix zero-length NFSv3 WRITEs
- Fix directory cinfo on FS's that do not support iversion
- Fix WRITE verifiers for stable writes
- Fix crash on COPY_NOTIFY with a special state ID
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmHcWOMACgkQM2qzM29m
f5dh0Q/+MjEL0IK551FdChx9Es1JqKRggv9KwJkLIoa1bw/PMSwP2pnKz6eL0Yun
mdhE9AZQgyFH1IAGdqjeLZKIYRin6bvAdDrnlqQ9SvTviPLWniSUI6AuyUqK6Zyk
wMcXpyOze0fhpxkYmz8/g7i66w967tmLh5MRvV1dkpOYAe99rYwGhvj+9ZeEWfNI
TgmptntMG6YEb+xY0E73otXZHMr2DL67ZYvOUYWemJA1uxcX4joaWBg8sx74dB6k
DUB4BFuoURk6viDD1QYh3qPU3dz9RCJNMz/cWd8+2t7BdaujTSXRIcaFslrQnKfL
Rm+O7pi5W+XohFDjeuMZ1g0c1ot/aoZSaAz00LoCVhejJ/sK9NiPAN1+LyY91Lja
cUBMVPNfW7ClIpiZcORP/chNmVn2qlaL2nxzSY/Uegnd5pIIeVD0pFVgx4+NlEat
mbrrQBcMpBRM0B+RzHS6AusqHrGdSEcwqWoVXWdxsBigJQT/AxWmii3U88k0Z54i
ooMWLaQ9EBBmygV01JN/OBySW2M/dvbfz3eFROvAVqsIP9JWP3FlUOlRDl8GcjXA
azi9fTysBom7WtL6NPcxDJbJ2t9hYr2YaztTpdo9YCHOuQbSQT6IWR5PAa3zvwMu
Bfz6Y8Hoo/KZHCqmkPGYM+x1ENCyDPv788E+erdnw1PFP5F3Pbo=
=/kX3
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"Bruce has announced he is leaving Red Hat at the end of the month and
is stepping back from his role as NFSD co-maintainer. As a result,
this includes a patch removing him from the MAINTAINERS file.
There is one patch in here that Jeff Layton was carrying in the locks
tree. Since he had only one for this cycle, he asked us to send it to
you via the nfsd tree.
There continues to be 0-day reports from Robert Morris @MIT. This time
we include a fix for a crash in the COPY_NOTIFY operation.
Highlights:
- Bruce steps down as NFSD maintainer
- Prepare for dynamic nfsd thread management
- More work on supporting re-exporting NFS mounts
- One fs/locks patch on behalf of Jeff Layton
Notable bug fixes:
- Fix zero-length NFSv3 WRITEs
- Fix directory cinfo on FS's that do not support iversion
- Fix WRITE verifiers for stable writes
- Fix crash on COPY_NOTIFY with a special state ID"
* tag 'nfsd-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (51 commits)
SUNRPC: Fix sockaddr handling in svcsock_accept_class trace points
SUNRPC: Fix sockaddr handling in the svc_xprt_create_error trace point
fs/locks: fix fcntl_getlk64/fcntl_setlk64 stub prototypes
nfsd: fix crash on COPY_NOTIFY with special stateid
MAINTAINERS: remove bfields
NFSD: Move fill_pre_wcc() and fill_post_wcc()
Revert "nfsd: skip some unnecessary stats in the v4 case"
NFSD: Trace boot verifier resets
NFSD: Rename boot verifier functions
NFSD: Clean up the nfsd_net::nfssvc_boot field
NFSD: Write verifier might go backwards
nfsd: Add a tracepoint for errors in nfsd4_clone_file_range()
NFSD: De-duplicate net_generic(nf->nf_net, nfsd_net_id)
NFSD: De-duplicate net_generic(SVC_NET(rqstp), nfsd_net_id)
NFSD: Clean up nfsd_vfs_write()
nfsd: Replace use of rwsem with errseq_t
NFSD: Fix verifier returned in stable WRITEs
nfsd: Retry once in nfsd_open on an -EOPENSTALE return
nfsd: Add errno mapping for EREMOTEIO
nfsd: map EBADF
...
Merge misc updates from Andrew Morton:
"146 patches.
Subsystems affected by this patch series: kthread, ia64, scripts,
ntfs, squashfs, ocfs2, vfs, and mm (slab-generic, slab, kmemleak,
dax, kasan, debug, pagecache, gup, shmem, frontswap, memremap,
memcg, selftests, pagemap, dma, vmalloc, memory-failure, hugetlb,
userfaultfd, vmscan, mempolicy, oom-kill, hugetlbfs, migration, thp,
ksm, page-poison, percpu, rmap, zswap, zram, cleanups, hmm, and
damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (146 commits)
mm/damon: hide kernel pointer from tracepoint event
mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log
mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging
mm/damon/dbgfs: remove an unnecessary variable
mm/damon: move the implementation of damon_insert_region to damon.h
mm/damon: add access checking for hugetlb pages
Docs/admin-guide/mm/damon/usage: update for schemes statistics
mm/damon/dbgfs: support all DAMOS stats
Docs/admin-guide/mm/damon/reclaim: document statistics parameters
mm/damon/reclaim: provide reclamation statistics
mm/damon/schemes: account how many times quota limit has exceeded
mm/damon/schemes: account scheme actions that successfully applied
mm/damon: remove a mistakenly added comment for a future feature
Docs/admin-guide/mm/damon/usage: update for kdamond_pid and (mk|rm)_contexts
Docs/admin-guide/mm/damon/usage: mention tracepoint at the beginning
Docs/admin-guide/mm/damon/usage: remove redundant information
Docs/admin-guide/mm/damon/usage: update for scheme quotas and watermarks
mm/damon: convert macro functions to static inline functions
mm/damon: modify damon_rand() macro to static inline function
mm/damon: move damon_rand() definition into damon.h
...
DAMON's virtual address spaces monitoring primitive uses 'struct pid *'
of the target process as its monitoring target id. The kernel address
is exposed as-is to the user space via the DAMON tracepoint,
'damon_aggregated'.
Though primarily only privileged users are allowed to access that, it
would be better to avoid unnecessarily exposing kernel pointers so.
Because the trace result is only required to be able to distinguish each
target, we aren't need to use the pointer as-is.
This makes the tracepoint to use the index of the target in the
context's targets list as its id in the tracepoint, to hide the kernel
space address.
Link: https://lkml.kernel.org/r/20211229131016.23641-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In Damon, we can get age information by analyzing the nr_access change,
But short time sampling is not effective, we have to obtain enough data
for analysis through long time trace, this also means that we need to
consume more cpu resources and storage space.
Now the region add a new 'age' variable, we only need to get the change of
age value through a little time trace, for example, age has been
increasing to 141, but nr_access shows a value of 0 at the same time,
Through this,we can conclude that the region has a very low nr_access
value for a long time.
Link: https://lkml.kernel.org/r/b9def1262af95e0dc1d0caea447886434db01161.1636989871.git.xhao@linux.alibaba.com
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The trace events hugepage_[invalidate|splitting], were added via the
commit 9e813308a5 ("powerpc/thp: Add tracepoints to track hugepage
invalidate"). Afterwards their call sites i.e
trace_hugepage_[invalidate|splitting] were just dropped off, leaving
these trace points unused.
Link: https://lkml.kernel.org/r/1641546351-15109-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now the migrate_pages() has changed to return the number of {normal
page, THP, hugetlb} instead, thus we should not use the return value to
calculate the number of pages migrated successfully. Instead we can
just use the 'nr_succeeded' which indicates the number of normal pages
migrated successfully to calculate the non-migrated pages in
trace_mm_compaction_migratepages().
Link: https://lkml.kernel.org/r/b4225251c4bec068dcd90d275ab7de88a39e2bd7.1636275127.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: e26d997272 ("SUNRPC: Clean up scheduling of autoclose")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Including:
- Identity domain support for virtio-iommu
- Move flush queue code into iommu-dma
- Some fixes for AMD IOMMU suspend/resume support when x2apic
is used
- Arm SMMU Updates from Will Deacon:
- Revert evtq and priq back to their former sizes
- Return early on short-descriptor page-table allocation failure
- Fix page fault reporting for Adreno GPU on SMMUv2
- Make SMMUv3 MMU notifier ops 'const'
- Numerous new compatible strings for Qualcomm SMMUv2 implementations
- Various smaller fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmHfBvsACgkQK/BELZcB
GuOURRAAqHBTUUgT/f2dJbVsWMOxSx0dhDnKuLRhAREMbmcn88tR8dxPRPYB1/6q
yyFmkh13UzEr1gosEd34P5G8GS6fMD/G4yz8p9tK6Mqu7UeU+DoiDIi3s194WhYa
iNPBlGgQ3XznTrbBnFV+n/LjvkxSguNfjj809Jiw7Ew3i50K4GFnzRA+IZVAT4+F
zdNmwY12Y0v4SCfHKiVR1ZB9kcn2/Dx78dlBQ6ALFuejeZGa2OVr94byKnfz+AJh
OssrgRcgUKclWbMk4Tljf4/FIdwkLMKD6gieBFb3sNKTUpzkG8elYmETO5d+hM+l
27CKOCz1OOqVRzKe3r5n3c0wf9UAmi0Q91zW+UVZp2i0GjxpfIeIS9/NwRy+IXHS
U9zybU47q10WF6cVO0n6wWHPRbjPii2OZpjqhSTq57qsnniCPLwkrry9H2fP71zz
NDAZv5qvHCvRF7QoZfkBvCCJ12ZhNnhZqTfZR2wGGITMIk6dokG4NCsU93rSVKvZ
4xQDPm45rECmunibdc9c1vrifKC7BIWCSU5DH3AEDBU/i9QfYpVPXfJlGdz3enIV
/FA+kcvYrh21sokly/TqiZXGSaOFBqFEN13KJReXOgbENNq6kT/4lSNjK5Q1WSWp
qDq12EQyv0RtTEcKVDpRVunI+/G5MquO8gbIrVmsRV0SU1Z0yoM=
=Q8LV
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
- Identity domain support for virtio-iommu
- Move flush queue code into iommu-dma
- Some fixes for AMD IOMMU suspend/resume support when x2apic is used
- Arm SMMU Updates from Will Deacon:
- Revert evtq and priq back to their former sizes
- Return early on short-descriptor page-table allocation failure
- Fix page fault reporting for Adreno GPU on SMMUv2
- Make SMMUv3 MMU notifier ops 'const'
- Numerous new compatible strings for Qualcomm SMMUv2 implementations
- Various smaller fixes and cleanups
* tag 'iommu-updates-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (38 commits)
iommu/iova: Temporarily include dma-mapping.h from iova.h
iommu: Move flush queue data into iommu_dma_cookie
iommu/iova: Move flush queue code to iommu-dma
iommu/iova: Consolidate flush queue code
iommu/vt-d: Use put_pages_list
iommu/amd: Use put_pages_list
iommu/amd: Simplify pagetable freeing
iommu/iova: Squash flush_cb abstraction
iommu/iova: Squash entry_dtor abstraction
iommu/iova: Fix race between FQ timeout and teardown
iommu/amd: Fix typo in *glues … together* in comment
iommu/vt-d: Remove unused dma_to_mm_pfn function
iommu/vt-d: Drop duplicate check in dma_pte_free_pagetable()
iommu/vt-d: Use bitmap_zalloc() when applicable
iommu/amd: Remove useless irq affinity notifier
iommu/amd: X2apic mode: mask/unmask interrupts on suspend/resume
iommu/amd: X2apic mode: setup the INTX registers on mask/unmask
iommu/amd: X2apic mode: re-enable after resume
iommu/amd: Restore GA log/tail pointer on host resume
iommu/iova: Move fast alloc size roundup into alloc_iova_fast()
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmHeBGsACgkQ+7dXa6fL
C2tyLw/8C2Gs/XvOZvRO7KPetKI9BbQSFoCe7uvGbiPq5CEmgcjWzQxvQGklBiZD
qYa6pMNye1iGpsHOY3Yu210b7vMQiRLnnxvVle0UrjpZR7CcxYS0gGV+6yRdbDGy
W1X6GFiX06qiNsgBH4msYp0SmbhhfkTyAx1BeBZAEtX8iFgaPfOldPY2nLMcTDD6
6FT1nTzRcMHx9IUQZJtpeatzc70Qg8+fOr2UAY2nOIypXh6+vAMBO80xtUjGVU+1
pWD1E+8cXSLfwEEzquFWoWTsTX7hNfsesEN10FmBf1bVCH9ZDFE01MOl6B8+CkFl
+xfkvDNFC3yyUwAMVAV4+A4Be+cVLSqN2R91QIKJnAj9w1OjxASrwZJ1YeZp6KP4
h0XKuPs3sRwwbNPVL/nP0UPNexoJnOUAaHesl4uKkRrExmxz9xGOIqIri2+tUIO+
HkGyNns1huymj1K1ja4AQbDiZZX39GgYVleyg9g3uuy1FS4k+/myJcXo/CqWn3ON
4oeNwxwLvlcqIQnPrESvwev50lFZYB4pfwvez6T2C5dL/Wk/xdeJK9iG81RWgx7y
5XcDeoGDE08gMCGWVPjuhOCXypeiRGHhRNlcxTtq5kLwBZGkcYg/wFFnWn+6hzc4
kyXw2kS5WZq4Q/FPh7BdY0eHp6xv0EpAOZwceneLB9lhNINdxcQ=
=ISJ6
-----END PGP SIGNATURE-----
Merge tag 'fscache-rewrite-20220111' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull fscache rewrite from David Howells:
"This is a set of patches that rewrites the fscache driver and the
cachefiles driver, significantly simplifying the code compared to
what's upstream, removing the complex operation scheduling and object
state machine in favour of something much smaller and simpler.
The series is structured such that the first few patches disable
fscache use by the network filesystems using it, remove the cachefiles
driver entirely and as much of the fscache driver as can be got away
with without causing build failures in the network filesystems.
The patches after that recreate fscache and then cachefiles,
attempting to add the pieces in a logical order. Finally, the
filesystems are reenabled and then the very last patch changes the
documentation.
[!] Note: I have dropped the cifs patch for the moment, leaving local
caching in cifs disabled. I've been having trouble getting that
working. I think I have it done, but it needs more testing (there
seem to be some test failures occurring with v5.16 also from
xfstests), so I propose deferring that patch to the end of the
merge window.
WHY REWRITE?
============
Fscache's operation scheduling API was intended to handle sequencing
of cache operations, which were all required (where possible) to run
asynchronously in parallel with the operations being done by the
network filesystem, whilst allowing the cache to be brought online and
offline and to interrupt service for invalidation.
With the advent of the tmpfile capacity in the VFS, however, an
opportunity arises to do invalidation much more simply, without having
to wait for I/O that's actually in progress: Cachefiles can simply
create a tmpfile, cut over the file pointer for the backing object
attached to a cookie and abandon the in-progress I/O, dismissing it
upon completion.
Future work here would involve using Omar Sandoval's vfs_link() with
AT_LINK_REPLACE[1] to allow an extant file to be displaced by a new
hard link from a tmpfile as currently I have to unlink the old file
first.
These patches can also simplify the object state handling as I/O
operations to the cache don't all have to be brought to a stop in
order to invalidate a file. To that end, and with an eye on to writing
a new backing cache model in the future, I've taken the opportunity to
simplify the indexing structure.
I've separated the index cookie concept from the file cookie concept
by C type now. The former is now called a "volume cookie" (struct
fscache_volume) and there is a container of file cookies. There are
then just the two levels. All the index cookie levels are collapsed
into a single volume cookie, and this has a single printable string as
a key. For instance, an AFS volume would have a key of something like
"afs,example.com,1000555", combining the filesystem name, cell name
and volume ID. This is freeform, but must not have '/' chars in it.
I've also eliminated all pointers back from fscache into the network
filesystem. This required the duplication of a little bit of data in
the cookie (cookie key, coherency data and file size), but it's not
actually that much. This gets rid of problems with making sure we keep
netfs data structures around so that the cache can access them.
These patches mean that most of the code that was in the drivers
before is simply gone and those drivers are now almost entirely new
code. That being the case, there doesn't seem any particular reason to
try and maintain bisectability across it. Further, there has to be a
point in the middle where things are cut over as there's a single
point everything has to go through (ie. /dev/cachefiles) and it can't
be in use by two drivers at once.
ISSUES YET OUTSTANDING
======================
There are some issues still outstanding, unaddressed by this patchset,
that will need fixing in future patchsets, but that don't stop this
series from being usable:
(1) The cachefiles driver needs to stop using the backing filesystem's
metadata to store information about what parts of the cache are
populated. This is not reliable with modern extent-based
filesystems.
Fixing this is deferred to a separate patchset as it involves
negotiation with the network filesystem and the VM as to how much
data to download to fulfil a read - which brings me on to (2)...
(2) NFS (and CIFS with the dropped patch) do not take account of how
the cache would like I/O to be structured to meet its granularity
requirements. Previously, the cache used page granularity, which
was fine as the network filesystems also dealt in page
granularity, and the backing filesystem (ext4, xfs or whatever)
did whatever it did out of sight. However, we now have folios to
deal with and the cache will now have to store its own metadata to
track its contents.
The change I'm looking at making for cachefiles is to store
content bitmaps in one or more xattrs and making a bit in the map
correspond to something like a 256KiB block. However, the size of
an xattr and the fact that they have to be read/updated in one go
means that I'm looking at covering 1GiB of data per 512-byte map
and storing each map in an xattr. Cachefiles has the potential to
grow into a fully fledged filesystem of its very own if I'm not
careful.
However, I'm also looking at changing things even more radically
and going to a different model of how the cache is arranged and
managed - one that's more akin to the way, say, openafs does
things - which brings me on to (3)...
(3) The way cachefilesd does culling is very inefficient for large
caches and it would be better to move it into the kernel if I can
as cachefilesd has to keep asking the kernel if it can cull a
file. Changing the way the backend works would allow this to be
addressed.
BITS THAT MAY BE CONTROVERSIAL
==============================
There are some bits I've added that may be controversial:
(1) I've provided a flag, S_KERNEL_FILE, that cachefiles uses to check
if a files is already being used by some other kernel service
(e.g. a duplicate cachefiles cache in the same directory) and
reject it if it is. This isn't entirely necessary, but it helps
prevent accidental data corruption.
I don't want to use S_SWAPFILE as that has other effects, but
quite possibly swapon() should set S_KERNEL_FILE too.
Note that it doesn't prevent userspace from interfering, though
perhaps it should. (I have made it prevent a marked directory from
being rmdir-able).
(2) Cachefiles wants to keep the backing file for a cookie open whilst
we might need to write to it from network filesystem writeback.
The problem is that the network filesystem unuses its cookie when
its file is closed, and so we have nothing pinning the cachefiles
file open and it will get closed automatically after a short time
to avoid EMFILE/ENFILE problems.
Reopening the cache file, however, is a problem if this is being
done due to writeback triggered by exit(). Some filesystems will
oops if we try to open a file in that context because they want to
access current->fs or suchlike.
To get around this, I added the following:
(A) An inode flag, I_PINNING_FSCACHE_WB, to be set on a network
filesystem inode to indicate that we have a usage count on the
cookie caching that inode.
(B) A flag in struct writeback_control, unpinned_fscache_wb, that
is set when __writeback_single_inode() clears the last dirty
page from i_pages - at which point it clears
I_PINNING_FSCACHE_WB and sets this flag.
This has to be done here so that clearing I_PINNING_FSCACHE_WB
can be done atomically with the check of PAGECACHE_TAG_DIRTY
that clears I_DIRTY_PAGES.
(C) A function, fscache_set_page_dirty(), which if it is not set,
sets I_PINNING_FSCACHE_WB and calls fscache_use_cookie() to
pin the cache resources.
(D) A function, fscache_unpin_writeback(), to be called by
->write_inode() to unuse the cookie.
(E) A function, fscache_clear_inode_writeback(), to be called when
the inode is evicted, before clear_inode() is called. This
cleans up any lingering I_PINNING_FSCACHE_WB.
The network filesystem can then use these tools to make sure that
fscache_write_to_cache() can write locally modified data to the
cache as well as to the server.
For the future, I'm working on write helpers for netfs lib that
should allow this facility to be removed by keeping track of the
dirty regions separately - but that's incomplete at the moment and
is also going to be affected by folios, one way or another, since
it deals with pages"
Link: https://lore.kernel.org/all/510611.1641942444@warthog.procyon.org.uk/
Tested-by: Dominique Martinet <asmadeus@codewreck.org> # 9p
Tested-by: kafs-testing@auristor.com # afs
Tested-by: Jeff Layton <jlayton@kernel.org> # ceph
Tested-by: Dave Wysochanski <dwysocha@redhat.com> # nfs
Tested-by: Daire Byrne <daire@dneg.com> # nfs
* tag 'fscache-rewrite-20220111' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (67 commits)
9p, afs, ceph, nfs: Use current_is_kswapd() rather than gfpflags_allow_blocking()
fscache: Add a tracepoint for cookie use/unuse
fscache: Rewrite documentation
ceph: add fscache writeback support
ceph: conversion to new fscache API
nfs: Implement cache I/O by accessing the cache directly
nfs: Convert to new fscache volume/cookie API
9p: Copy local writes to the cache when writing to the server
9p: Use fscache indexing rewrite and reenable caching
afs: Skip truncation on the server of data we haven't written yet
afs: Copy local writes to the cache when writing to the server
afs: Convert afs to use the new fscache API
fscache, cachefiles: Display stat of culling events
fscache, cachefiles: Display stats of no-space events
cachefiles: Allow cachefiles to actually function
fscache, cachefiles: Store the volume coherency data
cachefiles: Implement the I/O routines
cachefiles: Implement cookie resize for truncate
cachefiles: Implement begin and end I/O operation
cachefiles: Implement backing file wrangling
...
This patchset stops just short of actually enabling large folios.
It converts everything that I noticed needs to be converted, but there may
still be places I've overlooked which still have page size assumptions.
The big change here is using large entries in the page cache XArray
instead of many small entries. That only affects shmem for now, but
it's a pretty big change for shmem since it changes where memory needs
to be allocated (at split time instead of insertion).
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmHcraoACgkQDpNsjXcp
gj7C3wgAl0cjtdVzTpkLmbnInsicW1m3thnbkSXYbpqRccFjpu2kEBGj31PT+oGz
dzgXP7SNZ/VkFT+qWtmHSRF/J41B6f9bFojO81B2aQdpRiziU+5QbSbXbfUjwVhE
GJF0WGSJtVqySKynXP/iYTEt2zj6BiVperAwIqzhZpPY7gNoyDgeRD34Xy5bQqdD
ey6/Uwkh7oFHLEDcgxsEnyF0tUR3q+gpe5XZW1fb79p3crWw44xATc3UvKv8qCLC
Rd4oHmKkOj4MvdiUxJEfXI+XxgrkQ8XRO70B+p6ZljhDaoDZYw7ullxA0gvlSpNX
6pnjSQlKA1VQXsi6PMSt+9vf26XxaQ==
=KeYZ
-----END PGP SIGNATURE-----
Merge tag 'folio-5.17' of git://git.infradead.org/users/willy/pagecache
Pull folio conversion updates from Matthew Wilcox:
"Convert much of the page cache to use folios
This stops just short of actually enabling large folios. It converts
everything that I noticed needs to be converted, but there may still
be places I've overlooked which still have page size assumptions.
The big change here is using large entries in the page cache XArray
instead of many small entries. That only affects shmem for now, but
it's a pretty big change for shmem since it changes where memory needs
to be allocated (at split time instead of insertion)"
* tag 'folio-5.17' of git://git.infradead.org/users/willy/pagecache: (49 commits)
mm: Use multi-index entries in the page cache
XArray: Add xas_advance()
truncate,shmem: Handle truncates that split large folios
truncate: Convert invalidate_inode_pages2_range to folios
fs: Convert vfs_dedupe_file_range_compare to folios
mm: Remove pagevec_remove_exceptionals()
mm: Convert find_lock_entries() to use a folio_batch
filemap: Return only folios from find_get_entries()
filemap: Convert filemap_get_read_batch() to use a folio_batch
filemap: Convert filemap_read() to use a folio
truncate: Add invalidate_complete_folio2()
truncate: Convert invalidate_inode_pages2_range() to use a folio
truncate: Skip known-truncated indices
truncate,shmem: Add truncate_inode_folio()
shmem: Convert part of shmem_undo_range() to use a folio
mm: Add unmap_mapping_folio()
truncate: Add truncate_cleanup_folio()
filemap: Add filemap_release_folio()
filemap: Use a folio in filemap_page_mkwrite
filemap: Use a folio in filemap_map_pages
...
This set includes the normal collection of minor fixes and
cleanups, new kmem caches for network messaging structs,
a start on some basic tracepoints, and some new debugfs
files for inserting test messages.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJh3J8BAAoJEDgbc8f8gGmqC0IP/0xCyLKDLmv1KcR/SbUxiP9Y
ZmiuF1FdxqRLPN5HfU1GU4734EbWEiIZFGXDk36b4kE3UpbeddUMA6dDf9sDOJ73
H5BYgNfSgzybIq1D3wZe4C3nH4YTSrqvZhAI2bD3mcNnUDx20LzUMaGqLkzJcsOa
iFDVDlkNrXee/RPSuZDv/5U+L5YEC7WlKEHJ2u/INAFitakb0i1ChJGAgrJZADCw
giznPhfkmWyesgfOxOQ7JcxcVQSedQ+7vtHYhZ5tZ5v1h2yM9LPYJ0bNpNKfHUiA
0gf0DIpOjt4q9ClYxCsaLeK0t32qWOjiNoZaAm+lrLvZWE75CC+WP6KrDR+2x3p5
JlgBZ24h3v7t/ldYmEo7SSZ/1lfb4+0fyk7UjPzko9ErhRFrAopAfl924bRLN3ES
9iSgDslBT5apivNwrByDKI9flhDpMnfWtSufnYeMzFAbHzJCWJuUS8bbi9M+K0mK
ENjmXQu1OVOmht8WM7HOA91tIUcLhr/YD2fKtYNx054TErKKRjlRb/V+NHffyYJc
SqqyaLsAamhMBcLrAk3a88hj6bp8N4NYwrMQMus0xv3aZ23OaGCvtrj7fXuYMlcu
sjLOTQfFFDPN1B1WvI18TDdh/TIgP/8AC+KbHf4Y+hbntO0BgEzL1/CKqT3WqN8y
hAnK+Avj6CILRvO8ardm
=Cst4
-----END PGP SIGNATURE-----
Merge tag 'dlm-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
"This set includes the normal collection of minor fixes and cleanups,
new kmem caches for network messaging structs, a start on some basic
tracepoints, and some new debugfs files for inserting test messages"
* tag 'dlm-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: (32 commits)
fs: dlm: print cluster addr if non-cluster node connects
fs: dlm: memory cache for lowcomms hotpath
fs: dlm: memory cache for writequeue_entry
fs: dlm: memory cache for midcomms hotpath
fs: dlm: remove wq_alloc mutex
fs: dlm: use event based wait for pending remove
fs: dlm: check for pending users filling buffers
fs: dlm: use list_empty() to check last iteration
fs: dlm: fix build with CONFIG_IPV6 disabled
fs: dlm: replace use of socket sk_callback_lock with sock_lock
fs: dlm: don't call kernel_getpeername() in error_report()
fs: dlm: fix potential buffer overflow
fs: dlm:Remove unneeded semicolon
fs: dlm: remove double list_first_entry call
fs: dlm: filter user dlm messages for kernel locks
fs: dlm: add lkb waiters debugfs functionality
fs: dlm: add lkb debugfs functionality
fs: dlm: allow create lkb with specific id range
fs: dlm: add debugfs rawmsg send functionality
fs: dlm: let handle callback data as void
...
FS_IOC_GETFSLABEL and FS_IOC_SETFSLABEL ioctls. In addition the usual
large number of clean ups and bug fixes, in particular for the
fast_commit feature.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmHcr8sACgkQ8vlZVpUN
gaOuMwgAgunXEni8yED1QXAp+F1nigxx8K6O6fpjzVjiUFlH/cgc3M5PEZiV7Pgj
rIsNliCLQjDQPA7XxInvMuDXEvGRdleaqGS5MCKBR9898acaCTn0eBCK8ZKfhlIs
tJus706kOjy8IXRgEJkCrxEdrsJ57rgqHk/bstipNVg4lnfYYSUKH8k0ob4aaoiW
irufs1dSo7KmpwL0c/z9yIei3ST4DA2Dc/eBS0hsdgKtr/zMmwSanxqbQJTXz8ra
FTZ23HAu/zxWZ9GM4p3DhuVaMWaulm6eEqJKO+loeU9xOeItn0WmSJW260KR0cmj
0nPAflMmpQVBhtO9gjXcSiORbd6/zg==
=9HNa
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Convert ext4 to use the new mount API, and add support for the
FS_IOC_GETFSLABEL and FS_IOC_SETFSLABEL ioctls.
In addition the usual large number of clean ups and bug fixes, in
particular for the fast_commit feature"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (48 commits)
ext4: don't use the orphan list when migrating an inode
ext4: use BUG_ON instead of if condition followed by BUG
ext4: fix a copy and paste typo
ext4: set csum seed in tmp inode while migrating to extents
ext4: remove unnecessary 'offset' assignment
ext4: remove redundant o_start statement
ext4: drop an always true check
ext4: remove unused assignments
ext4: remove redundant statement
ext4: remove useless resetting io_end_size in mpage_process_page()
ext4: allow to change s_last_trim_minblks via sysfs
ext4: change s_last_trim_minblks type to unsigned long
ext4: implement support for get/set fs label
ext4: only set EXT4_MOUNT_QUOTA when journalled quota file is specified
ext4: don't use kfree() on rcu protected pointer sbi->s_qf_names
ext4: avoid trim error on fs with small groups
ext4: fix an use-after-free issue about data=journal writeback mode
ext4: fix null-ptr-deref in '__ext4_journal_ensure_credits'
ext4: initialize err_blk before calling __ext4_get_inode_loc
ext4: fix a possible ABBA deadlock due to busy PA
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmHcX64ACgkQxWXV+ddt
WDsIyA//UGMG0waxz40RksQL+4AnTIJ+apmmxq5VGEE8y6aUPAGKm++5Cs9cC+Os
5xuZhNnD8hjB+eWH+tRCx8ll+T4g4skGKDCyD0fWNtZAlW8pnsIbztdO0nNYx9C0
V++vu/hQR6M8E7ORlayEKBWy2/UnBG5p/XVLPG4RJ4vMETJPl2RLWVDpu3dj09kf
YnD3AY0vmKEyCu/b9NtSzfZMO2/lXT2U41ezLJJfmPAXcMJ0EeSAazACVDQyq24p
wnnr6xmdo7ZR0oGFLUmBmfxbKwd3l0JIUsi/XRysXe+8y7raIE0gKCg7NpCS276T
BZrKmefxHhdMCA1HLlH6AkrKmQUgIaceLkXTanTYv4cnVzb6XoeV6R4IO9/JQ0yv
YsdCL7eZ4Vl3ToPlQkYWdwUNP5UjVMg5qMwxchigbwq7jViLJtiu3WKy0O0TitzB
n4o0cCdlv3lHRp8FS6cFmbCrsavFT8/q3vz/aRdkrojOKE0jEWVJiz0bf39j5BVO
IiJ3RF3kbpG7TZz0+eNUbgebME8zwaWfyd6t2L7Ztvx9F4elzSW95iwGc9//TZvl
ciNFI9LQvnZRymUItxg0HNXfbJMXJAE9ImTxA8HLkfQYpaBlwbt4avkQK56LhpN+
nFCd5cxJy5HLFIvkRDfqpF+C247p5FICzd/hs59/cXlCynulglg=
=9lg+
-----END PGP SIGNATURE-----
Merge tag 'for-5.17-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This end of the year branch is intentionally not that exciting. Most
of the changes are under the hood, but there are some minor user
visible improvements and several performance improvements too.
Features:
- make send work with concurrent block group relocation.
We're not allowed to prevent send failing or silently producing
some bad stream but with more fine grained locking and checks it's
possible. The send vs deduplication exclusion could reuse the same
logic in the future.
- new exclusive operation 'balance paused' to allow adding a device
to filesystem with paused balance
- new sysfs file for fsid stored in the per-device directory to help
distinguish devices when seeding is enabled, the fsid may differ
from the one reported by the filesystem
Performance improvements:
- less metadata needed for directory logging, directory deletion is
20-40% faster
- in zoned mode, cache zone information during mount to speed up
repeated queries (about 50% speedup)
- free space tree entries get indexed and searched by size (latency
-30%, search run time -30%)
- less contention in tree node locking when inserting a key and no
splits are needed (files/sec in fsmark improves by 1-20%)
Fixes:
- fix ENOSPC failure when attempting direct IO write into NOCOW range
- fix deadlock between quota enable and other quota operations
- global reserve minimum calculations fixed to account for free space
tree
- in zoned mode, fix condition for chunk allocation that may not find
the right zone for reuse and could lead to early ENOSPC
Core:
- global reserve stealing got simplified and cleaned up in evict
- remove async transaction commit based on manual transaction refs,
reuse existing kthread and mechanisms to let it commit transaction
before timeout
- preparatory work for extent tree v2, add wrappers for global tree
roots, truncation path cleanups
- remove readahead framework, it's a bit overengineered and used only
for scrub, and yet it does not cover all its needs, there is
another readahead built in the b-tree search that is now used,
performance drop on HDD is about 5% which is acceptable and scrub
is often throttled anyway, on SSDs there's no reported drop but
slight improvement
- self tests report extent tree state when error occurs
- replace assert with debugging information when an uncommitted
transaction is found at unmount time
Other:
- error handling improvements
- other cleanups and refactoring"
* tag 'for-5.17-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits)
btrfs: output more debug messages for uncommitted transaction
btrfs: respect the max size in the header when activating swap file
btrfs: fix argument list that the kdoc format and script verified
btrfs: remove unnecessary parameter type from compression_decompress_bio
btrfs: selftests: dump extent io tree if extent-io-tree test failed
btrfs: scrub: cleanup the argument list of scrub_stripe()
btrfs: scrub: cleanup the argument list of scrub_chunk()
btrfs: remove reada infrastructure
btrfs: scrub: use btrfs_path::reada for extent tree readahead
btrfs: scrub: remove the unnecessary path parameter for scrub_raid56_parity()
btrfs: refactor unlock_up
btrfs: skip transaction commit after failure to create subvolume
btrfs: zoned: fix chunk allocation condition for zoned allocator
btrfs: add extent allocator hook to decide to allocate chunk or not
btrfs: zoned: unset dedicated block group on allocation failure
btrfs: zoned: drop redundant check for REQ_OP_ZONE_APPEND and btrfs_is_zoned
btrfs: zoned: sink zone check into btrfs_repair_one_zone
btrfs: zoned: simplify btrfs_check_meta_write_pointer
btrfs: zoned: encapsulate inode locking for zoned relocation
btrfs: sysfs: add devinfo/fsid to retrieve actual fsid from the device
...
- add sysfs interface and a sysfs node to control sync decompression;
- add tail-packing inline support for compressed files;
- get rid of erofs_get_meta_page().
-----BEGIN PGP SIGNATURE-----
iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYduH/BEceGlhbmdAa2Vy
bmVsLm9yZwAKCRA5NzHcH7XmBNPdAP9hKomD1hRiFeCWlLA1nDXYkqGbbt6+D3HT
cm4G7DgVBAD+O+RWv6JVYg1zAAFlKmxqEKEfoDLKI65wAjH1V/h/dQE=
=cEam
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"In this cycle, tail-packing data inline for compressed files is now
supported so that tail pcluster can be stored and read together with
inode metadata in order to save data I/O and storage space.
In addition to that, to prepare for the upcoming subpage, folio and
fscache features, we also introduce meta buffers to get rid of
erofs_get_meta_page() since it was too close to the page itself.
In addition, in order to show supported kernel features and control
sync decompression strategy, new sysfs nodes are introduced in this
cycle as well.
Summary:
- add sysfs interface and a sysfs node to control sync decompression
- add tail-packing inline support for compressed files
- get rid of erofs_get_meta_page()"
* tag 'erofs-for-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: use meta buffers for zmap operations
erofs: use meta buffers for xattr operations
erofs: use meta buffers for super operations
erofs: use meta buffers for inode operations
erofs: introduce meta buffer operations
erofs: add on-disk compressed tail-packing inline support
erofs: support inline data decompression
erofs: support unaligned data decompression
erofs: introduce z_erofs_fixup_insize
erofs: tidy up z_erofs_lz4_decompress
erofs: clean up erofs_map_blocks tracepoints
erofs: Replace zero-length array with flexible-array member
erofs: add sysfs node to control sync decompression strategy
erofs: add sysfs interface
erofs: rename lz4_0pading to zero_padding
Pull cgroup updates from Tejun Heo:
"Nothing too interesting. The only two noticeable changes are a subtle
cpuset behavior fix and trace event id field being expanded to u64
from int. Most others are code cleanups"
* 'for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: convert 'allowed' in __cpuset_node_allowed() to be boolean
cgroup/rstat: check updated_next only for root
cgroup: rstat: explicitly put loop variant in while
cgroup: return early if it is already on preloaded list
cgroup/cpuset: Don't let child cpusets restrict parent in default hierarchy
cgroup: Trace event cgroup id fields should be u64
cgroup: fix a typo in comment
cgroup: get the wrong css for css_alloc() during cgroup_init_subsys()
cgroup: rstat: Mark benign data race to silence KCSAN
Implement support for FS_IOC_GETFSLABEL and FS_IOC_SETFSLABEL ioctls for
online reading and setting of file system label.
ext4_ioctl_getlabel() is simple, just get the label from the primary
superblock. This might not be the first sb on the file system if
'sb=' mount option is used.
In ext4_ioctl_setlabel() we update what ext4 currently views as a
primary superblock and then proceed to update backup superblocks. There
are two caveats:
- the primary superblock might not be the first superblock and so it
might not be the one used by userspace tools if read directly
off the disk.
- because the primary superblock might not be the first superblock we
potentialy have to update it as part of backup superblock update.
However the first sb location is a bit more complicated than the rest
so we have to account for that.
The superblock modification is created generic enough so the
infrastructure can be used for other potential superblock modification
operations, such as chaning UUID.
Tested with generic/492 with various configurations. I also checked the
behavior with 'sb=' mount options, including very large file systems
with and without sparse_super/sparse_super2.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20211213135618.43303-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Avoid potentially hazardous memory copying and the needless use of
"%pIS" -- in the kernel, an RPC service listener is always bound to
ANYADDR. Having the network namespace is helpful when recording
errors, though.
Fixes: a0469f46fa ("SUNRPC: Replace dprintk call sites in TCP state change callouts")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
While testing, I got an unexpected KASAN splat:
Jan 08 13:50:27 oracle-102.nfsv4.dev kernel: BUG: KASAN: stack-out-of-bounds in trace_event_raw_event_svc_xprt_create_err+0x190/0x210 [sunrpc]
Jan 08 13:50:27 oracle-102.nfsv4.dev kernel: Read of size 28 at addr ffffc9000008f728 by task mount.nfs/4628
The memcpy() in the TP_fast_assign section of this trace point
copies the size of the destination buffer in order that the buffer
won't be overrun.
In other similar trace points, the source buffer for this memcpy is
a "struct sockaddr_storage" so the actual length of the source
buffer is always long enough to prevent the memcpy from reading
uninitialized or unallocated memory.
However, for this trace point, the source buffer can be as small as
a "struct sockaddr_in". For AF_INET sockaddrs, the memcpy() reads
memory that follows the source buffer, which is not always valid
memory.
To avoid copying past the end of the passed-in sockaddr, make the
source address's length available to the memcpy(). It would be a
little nicer if the tracing infrastructure was more friendly about
storing socket addresses that are not AF_INET, but I could not find
a way to make printk("%pIS") work with a dynamic array.
Reported-by: KASAN
Fixes: 4b8f380e46 ("SUNRPC: Tracepoint to record errors in svc_xpo_create()")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Replace kfree_skb() with kfree_skb_reason() in __udp4_lib_rcv.
New drop reason 'SKB_DROP_REASON_UDP_CSUM' is added for udp csum
error.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Replace kfree_skb() with kfree_skb_reason() in tcp_v4_rcv(). Following
drop reasons are added:
SKB_DROP_REASON_NO_SOCKET
SKB_DROP_REASON_PKT_TOO_SMALL
SKB_DROP_REASON_TCP_CSUM
SKB_DROP_REASON_TCP_FILTER
After this patch, 'kfree_skb' event will print message like this:
$ TASK-PID CPU# ||||| TIMESTAMP FUNCTION
$ | | | ||||| | |
<idle>-0 [000] ..s1. 36.113438: kfree_skb: skbaddr=(____ptrval____) protocol=2048 location=(____ptrval____) reason: NO_SOCKET
The reason of skb drop is printed too.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce the interface kfree_skb_reason(), which is able to pass
the reason why the skb is dropped to 'kfree_skb' tracepoint.
Add the 'reason' field to 'trace_kfree_skb', therefor user can get
more detail information about abnormal skb with 'drop_monitor' or
eBPF.
All drop reasons are defined in the enum 'skb_drop_reason', and
they will be print as string in 'kfree_skb' tracepoint in format
of 'reason: XXX'.
( Maybe the reasons should be defined in a uapi header file, so that
user space can use them? )
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Provide a pair of functions to perform raw I/O on the cache. The first
function allows an arbitrary asynchronous direct-IO read to be made against
a cache object, though the read should be aligned and sized appropriately
for the backing device:
int fscache_read(struct netfs_cache_resources *cres,
loff_t start_pos,
struct iov_iter *iter,
enum netfs_read_from_hole read_hole,
netfs_io_terminated_t term_func,
void *term_func_priv);
The cache resources must have been previously initialised by
fscache_begin_read_operation(). A read operation is sent to the backing
filesystem, starting at start_pos within the file. The size of the read is
specified by the iterator, as is the location of the output buffer.
If there is a hole in the data it can be ignored and left to the backing
filesystem to deal with (NETFS_READ_HOLE_IGNORE), a hole at the beginning
can be skipped over and the buffer padded with zeros
(NETFS_READ_HOLE_CLEAR) or -ENODATA can be given (NETFS_READ_HOLE_FAIL).
If term_func is not NULL, the operation may be performed asynchronously.
Upon completion, successful or otherwise, (*term_func)() will be called and
passed term_func_priv, along with an error or the amount of data
transferred. If the op is run asynchronously, fscache_read() will return
-EIOCBQUEUED.
The second function allows an arbitrary asynchronous direct-IO write to be
made against a cache object, though the write should be aligned and sized
appropriately for the backing device:
int fscache_write(struct netfs_cache_resources *cres,
loff_t start_pos,
struct iov_iter *iter,
netfs_io_terminated_t term_func,
void *term_func_priv);
This works in very similar way to fscache_read(), except that there's no
need to deal with holes (they're just overwritten).
The caller is responsible for preventing concurrent overlapping writes.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819613224.215744.7877577215582621254.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906915386.143852.16936177636106480724.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967122632.1823006.7487049517698562172.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021521420.640689.12747258780542678309.stgit@warthog.procyon.org.uk/ # v4
Provide a function to begin a read operation:
int fscache_begin_read_operation(
struct netfs_cache_resources *cres,
struct fscache_cookie *cookie)
This is primarily intended to be called by network filesystems on behalf of
netfslib, but may also be called to use the I/O access functions directly.
It attaches the resources required by the cache to cres struct from the
supplied cookie.
This holds access to the cache behind the cookie for the duration of the
operation and forces cache withdrawal and cookie invalidation to perform
synchronisation on the operation. cres->inval_counter is set from the
cookie at this point so that it can be compared at the end of the
operation.
Note that this does not guarantee that the cache state is fully set up and
able to perform I/O immediately; looking up and creation may be left in
progress in the background. The operations intended to be called by the
network filesystem, such as reading and writing, are expected to wait for
the cookie to move to the correct state.
This will, however, potentially sleep, waiting for a certain minimum state
to be set or for operations such as invalidate to advance far enough that
I/O can resume.
Also provide a function for the cache to call to wait for the cache object
to get to a state where it can be used for certain things:
bool fscache_wait_for_operation(struct netfs_cache_resources *cres,
enum fscache_want_stage stage);
This looks at the cache resources provided by the begin function and waits
for them to get to an appropriate stage. There's a choice of wanting just
some parameters (FSCACHE_WANT_PARAM) or the ability to do I/O
(FSCACHE_WANT_READ or FSCACHE_WANT_WRITE).
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819603692.215744.146724961588817028.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906910672.143852.13856103384424986357.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967110245.1823006.2239170567540431836.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021513617.640689.16627329360866150606.stgit@warthog.procyon.org.uk/ # v4
Provide a pair of functions to count the number of users of a cookie (open
files, writeback, invalidation, resizing, reads, writes), to obtain and pin
resources for the cookie and to prevent culling for the whilst there are
users.
The first function marks a cookie as being in use:
void fscache_use_cookie(struct fscache_cookie *cookie,
bool will_modify);
The caller should indicate the cookie to use and whether or not the caller
is in a context that may modify the cookie (e.g. a file open O_RDWR).
If the cookie is not already resourced, fscache will ask the cache backend
in the background to do whatever it needs to look up, create or otherwise
obtain the resources necessary to access data. This is pinned to the
cookie and may not be culled, though it may be withdrawn if the cache as a
whole is withdrawn.
The second function removes the in-use mark from a cookie and, optionally,
updates the coherency data:
void fscache_unuse_cookie(struct fscache_cookie *cookie,
const void *aux_data,
const loff_t *object_size);
If non-NULL, the aux_data buffer and/or the object_size will be saved into
the cookie and will be set on the backing store when the object is
committed.
If this removes the last usage on a cookie, the cookie is placed onto an
LRU list from which it will be removed and closed after a couple of seconds
if it doesn't get reused. This prevents resource overload in the cache -
in particular it prevents it from holding too many files open.
Changes
=======
ver #2:
- Fix fscache_unuse_cookie() to use atomic_dec_and_lock() to avoid a
potential race if the cookie gets reused before it completes the
unusement.
- Added missing transition to LRU_DISCARDING state.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819600612.215744.13678350304176542741.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906907567.143852.16979631199380722019.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967106467.1823006.6790864931048582667.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021511674.640689.10084988363699111860.stgit@warthog.procyon.org.uk/ # v4
Add a number of helper functions to manage access to a cookie, pinning the
cache object in place for the duration to prevent cache withdrawal from
removing it:
(1) void fscache_init_access_gate(struct fscache_cookie *cookie);
This function initialises the access count when a cache binds to a
cookie. An extra ref is taken on the access count to prevent wakeups
while the cache is active. We're only interested in the wakeup when a
cookie is being withdrawn and we're waiting for it to quiesce - at
which point the counter will be decremented before the wait.
The FSCACHE_COOKIE_NACC_ELEVATED flag is set on the cookie to keep
track of the extra ref in order to handle a race between
relinquishment and withdrawal both trying to drop the extra ref.
(2) bool fscache_begin_cookie_access(struct fscache_cookie *cookie,
enum fscache_access_trace why);
This function attempts to begin access upon a cookie, pinning it in
place if it's cached. If successful, it returns true and leaves a the
access count incremented.
(3) void fscache_end_cookie_access(struct fscache_cookie *cookie,
enum fscache_access_trace why);
This function drops the access count obtained by (2), permitting
object withdrawal to take place when it reaches zero.
A tracepoint is provided to track changes to the access counter on a
cookie.
Changes
=======
ver #2:
- Don't hold n_accesses elevated whilst cache is bound to a cookie, but
rather add a flag that prevents the state machine from being queued when
n_accesses reaches 0.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819595085.215744.1706073049250505427.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906895313.143852.10141619544149102193.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967095980.1823006.1133648159424418877.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021503063.640689.8870918985269528670.stgit@warthog.procyon.org.uk/ # v4
Add a pair of helper functions to manage access to a volume, pinning the
volume in place for the duration to prevent cache withdrawal from removing
it:
bool fscache_begin_volume_access(struct fscache_volume *volume,
enum fscache_access_trace why);
void fscache_end_volume_access(struct fscache_volume *volume,
enum fscache_access_trace why);
The way the access gate on the volume works/will work is:
(1) If the cache tests as not live (state is not FSCACHE_CACHE_IS_ACTIVE),
then we return false to indicate access was not permitted.
(2) If the cache tests as live, then we increment the volume's n_accesses
count and then recheck the cache liveness, ending the access if it
ceased to be live.
(3) When we end the access, we decrement the volume's n_accesses and wake
up the any waiters if it reaches 0.
(4) Whilst the cache is caching, the volume's n_accesses is kept
artificially incremented to prevent wakeups from happening.
(5) When the cache is taken offline, the state is changed to prevent new
accesses, the volume's n_accesses is decremented and we wait for it to
become 0.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819594158.215744.8285859817391683254.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906894315.143852.5454793807544710479.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967095028.1823006.9173132503876627466.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021501546.640689.9631510472149608443.stgit@warthog.procyon.org.uk/ # v4
Add functions to the fscache API to allow data file cookies to be acquired
and relinquished by the network filesystem. It is intended that the
filesystem will create such cookies per-inode under a volume.
To request a cookie, the filesystem should call:
struct fscache_cookie *
fscache_acquire_cookie(struct fscache_volume *volume,
u8 advice,
const void *index_key,
size_t index_key_len,
const void *aux_data,
size_t aux_data_len,
loff_t object_size)
The filesystem must first have created a volume cookie, which is passed in
here. If it passes in NULL then the function will just return a NULL
cookie.
A binary key should be passed in index_key and is of size index_key_len.
This is saved in the cookie and is used to locate the associated data in
the cache.
A coherency data buffer of size aux_data_len will be allocated and
initialised from the buffer pointed to by aux_data. This is used to
validate cache objects when they're opened and is stored on disk with them
when they're committed. The data is stored in the cookie and will be
updateable by various functions in later patches.
The object_size must also be given. This is also used to perform a
coherency check and to size the backing storage appropriately.
This function disallows a cookie from being acquired twice in parallel,
though it will cause the second user to wait if the first is busy
relinquishing its cookie.
When a network filesystem has finished with a cookie, it should call:
void
fscache_relinquish_cookie(struct fscache_volume *volume,
bool retire)
If retire is true, any backing data will be discarded immediately.
Changes
=======
ver #3:
- fscache_hash()'s size parameter is now in bytes. Use __le32 as the unit
to round up to.
- When comparing cookies, simply see if the attributes are the same rather
than subtracting them to produce a strcmp-style return[1].
- Add a check to see if the cookie is still hashed at the point of
freeing.
ver #2:
- Don't hold n_accesses elevated whilst cache is bound to a cookie, but
rather add a flag that prevents the state machine from being queued when
n_accesses reaches 0.
- Remove the unused cookie pointer field from the fscache_acquire
tracepoint.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/CAHk-=whtkzB446+hX0zdLsdcUJsJ=8_-0S1mE_R+YurThfUbLA@mail.gmail.com/ [1]
Link: https://lore.kernel.org/r/163819590658.215744.14934902514281054323.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906891983.143852.6219772337558577395.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967088507.1823006.12659006350221417165.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021498432.640689.12743483856927722772.stgit@warthog.procyon.org.uk/ # v4
Add functions to the fscache API to allow volumes to be acquired and
relinquished by the network filesystem. A volume is an index of data
storage cache objects. A volume is represented by a volume cookie in the
API. A filesystem would typically create a volume for a superblock and
then create per-inode cookies within it.
To request a volume, the filesystem calls:
struct fscache_volume *
fscache_acquire_volume(const char *volume_key,
const char *cache_name,
const void *coherency_data,
size_t coherency_len)
The volume_key is a printable string used to match the volume in the cache.
It should not contain any '/' characters. For AFS, for example, this would
be "afs,<cellname>,<volume_id>", e.g. "afs,example.com,523001".
The cache_name can be NULL, but if not it should be a string indicating the
name of the cache to use if there's more than one available.
The coherency data, if given, is an arbitrarily-sized blob that's attached
to the volume and is compared when the volume is looked up. If it doesn't
match, the old volume is judged to be out of date and it and everything
within it is discarded.
Acquiring a volume twice concurrently is disallowed, though the function
will wait if an old volume cookie is being relinquishing.
When a network filesystem has finished with a volume, it should return the
volume cookie by calling:
void
fscache_relinquish_volume(struct fscache_volume *volume,
const void *coherency_data,
bool invalidate)
If invalidate is true, the entire volume will be discarded; if false, the
volume will be synced and the coherency data will be updated.
Changes
=======
ver #4:
- Removed an extraneous param from kdoc on fscache_relinquish_volume()[3].
ver #3:
- fscache_hash()'s size parameter is now in bytes. Use __le32 as the unit
to round up to.
- When comparing cookies, simply see if the attributes are the same rather
than subtracting them to produce a strcmp-style return[2].
- Make the coherency data an arbitrary blob rather than a u64, but don't
store it for the moment.
ver #2:
- Fix error check[1].
- Make a fscache_acquire_volume() return errors, including EBUSY if a
conflicting volume cookie already exists. No error is printed now -
that's left to the netfs.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/20211203095608.GC2480@kili/ [1]
Link: https://lore.kernel.org/r/CAHk-=whtkzB446+hX0zdLsdcUJsJ=8_-0S1mE_R+YurThfUbLA@mail.gmail.com/ [2]
Link: https://lore.kernel.org/r/20211220224646.30e8205c@canb.auug.org.au/ [3]
Link: https://lore.kernel.org/r/163819588944.215744.1629085755564865996.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906890630.143852.13972180614535611154.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967086836.1823006.8191672796841981763.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021495816.640689.4403156093668590217.stgit@warthog.procyon.org.uk/ # v4
Add tracepoints for the HSM state machine and drop DPRINTK calls
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Add tracepoints for bus-master DMA and taskfile related functions.
That allows us to drop the relevant DPRINTK() calls.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Pass the folio instead of a page. The page was already implicitly a
folio as it accessed page->mapping directly. Add the order of the folio
to the tracepoint, as this is important information. Also drop printing
the address of the struct page as the pfn provides better information
than the struct page address.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Convert the existing ata_qc_issue() tracepoint into a template,
and add tracepoints for ata_qc_prep() and ata_qc_issue() based
on that template.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
To follow the flow of control we should be using tracepoints, as
they will tie in with the actual I/O flow and deliver a better
overview about what it happening.
This patch adds tracepoints for hard reset, soft reset, and postreset
and adds them in the libata-eh control flow.
With that we can drop the reset DPRINTK calls in the various drivers.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
The root on the trans->root can be anything, and generally we're
committing from the transaction kthread so it's usually the tree_root.
Change this to just take an fs_info, and to maintain compatibility
simply put the ROOT_TREE_OBJECTID as the root objectid for the
tracepoint. This will allow use to remove trans->root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar
problems due to reclaim throttling for excessive lengths of time. In
Alexey's case, a memory hog that should go OOM quickly stalls for
several minutes before stalling. In Mike and Darrick's cases, a small
memcg environment stalled excessively even though the system had enough
memory overall.
Commit 69392a403f ("mm/vmscan: throttle reclaim when no progress is
being made") introduced the problem although commit a19594ca4a
("mm/vmscan: increase the timeout if page reclaim is not making
progress") made it worse. Systems at or near an OOM state that cannot
be recovered must reach OOM quickly and memcg should kill tasks if a
memcg is near OOM.
To address this, only stall for the first zone in the zonelist, reduce
the timeout to 1 tick for VMSCAN_THROTTLE_NOPROGRESS and only stall if
the scan control nr_reclaimed is 0, kswapd is still active and there
were excessive pages pending for writeback. If kswapd has stopped
reclaiming due to excessive failures, do not stall at all so that OOM
triggers relatively quickly. Similarly, if an LRU is simply congested,
only lightly throttle similar to NOPROGRESS.
Alexey's original case was the most straight forward
for i in {1..3}; do tail /dev/zero; done
On vanilla 5.16-rc1, this test stalled heavily, after the patch the test
completes in a few seconds similar to 5.15.
Alexey's second test case added watching a youtube video while tail runs
10 times. On 5.15, playback only jitters slightly, 5.16-rc1 stalls a
lot with lots of frames missing and numerous audio glitches. With this
patch applies, the video plays similarly to 5.15.
[lkp@intel.com: Fix W=1 build warning]
Link: https://lore.kernel.org/r/99e779783d6c7fce96448a3402061b9dc1b3b602.camel@gmx.de
Link: https://lore.kernel.org/r/20211124011954.7cab9bb4@mail.inbox.lv
Link: https://lore.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
Link: https://lore.kernel.org/r/20211202150614.22440-1-mgorman@techsingularity.net
Link: https://linux-regtracking.leemhuis.info/regzbot/regression/20211124011954.7cab9bb4@mail.inbox.lv/
Reported-and-tested-by: Alexey Avramov <hakavlad@inbox.lv>
Reported-and-tested-by: Mike Galbraith <efault@gmx.de>
Reported-and-tested-by: Darrick J. Wong <djwong@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Tracked-by: Thorsten Leemhuis <regressions@leemhuis.info>
Fixes: 69392a403f ("mm/vmscan: throttle reclaim when no progress is being made")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I'm about to add more information to the server-side SUNRPC
tracepoints, so I'm going to offset the increased trace log
consumption by getting rid of some tracepoints that fire frequently
but don't offer much value.
trace_svc_xprt_received() was useful for debugging, perhaps, but
is not generally informative.
trace_svc_handle_xprt() reports largely the same information as
trace_svc_xdr_recvfrom().
As a clean-up, rename trace_svc_xprt_do_enqueue() to match
svc_xprt_dequeue().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Having a new step to trace SCMI stack while it waits for synchronous
responses is useful to analyze system performance when changing waiting
mode between polling and interrupt completion.
Link: https://lore.kernel.org/r/20211129191156.29322-5-cristian.marussi@arm.com
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Since the new type of chunk-based files is introduced, there is no
need to leave flatmode tracepoints.
Rename to erofs_map_blocks instead.
Link: https://lore.kernel.org/r/20211209012918.30337-1-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Add '__rel_loc' using trace event macros. These macros are usually
not used in the kernel, except for testing purpose.
This also add "rel_" variant of macros for dynamic_array string,
and bitmask.
Link: https://lkml.kernel.org/r/163757342119.510314.816029622439099016.stgit@devnote2
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In case of an iommu page fault, the faulting iova is logged
in trace_io_page_fault. It is therefore convenient to log
the iova range in mapping/unmapping trace events so that it
is easier to see if the faulting iova was recently in any of
those ranges.
Signed-off-by: Dafna Hirschfeld <dafna.hirschfeld@collabora.com>
Link: https://lore.kernel.org/r/20211104071620.27290-1-dafna.hirschfeld@collabora.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Pass in the original position and count rather than the position and
count that were updated by the write. Also use the correct types for
all arguments, in particular the file offset which was being truncated
to 32 bits on 32-bit platforms.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Various trace event fields that store cgroup IDs were declared as
ints, but cgroup_id(() returns a u64 and the structures and associated
TP_printk() calls were not updated to reflect this.
Fixes: 743210386c ("cgroup: use cgrp->kn->id as the cgroup ID")
Signed-off-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Just use the disk attached to the request_queue instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Highlights include:
Stable fixes:
- NFSv42: Fix pagecache invalidation after COPY/CLONE
Bugfixes:
- NFSv42: Don't fail clone() just because the server failed to return
post-op attributes
- SUNRPC: use different lockdep keys for INET6 and LOCAL
- NFSv4.1: handle NFS4ERR_NOSPC from CREATE_SESSION
- SUNRPC: fix header include guard in trace header
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmGiTnMACgkQZwvnipYK
APJoJQ//VZYSCx/mGaTIj5oUwjBKE/n/9rz2EUGS1cfYjZjPpb5Xgm1tn+1A4c01
Ztu9/hKgwrDqknqkmtKvP1GsX5vYUqgfqAlc880Q2nXAqaLJBBZgB6BFMmTtcoQx
C24L0tgxlZVD9Vw0DJEVDVgDxXA/9VmdSQK6uptQRQhcYf4VtR1wAzELHWdkdkfq
1WrREeAwGWw1BaPTrlPn9XwW9qTaMlBH05XRHh6dM7gFmoIe3td7kq7BOnFxFsnA
AQZ/nCgeMTE04kQQMzYqUc4YqZvnzUHxueZ6q8s0K1RKJBpNIQNgdWUMa285Qo4a
JA9oBCPPo0JjmsEge2Km12zyBJoA7lLQDfc6UQJON50ADF0sTu3wszgGuC63KkhE
V+kUogK2WOlnGky2yYrHmv43mcCcyoJ/g+g+38GXNYGorsFi/XUhvctEpFFFPF71
0umQwWhA6Dhc52hMj5DN4nfspp//hEuV9o7/zJzlPi0elC+xtVaBWZCorNvznlXW
C/O5yobJVd89PuIE17Stg+c0Rq3k7RVPPoyS2IaMkZ5cs7DtT5Tz3nKClPAA8Bur
mPLAMkHSOSLO26cy30SVZCIx1JDMJU9dx/PkFemCexkzYXQxgp9px/8wmM2Xe6Oc
/hDqi8V7ayJUuYpYuJ6sA8oUqj3j+NkoP4w5HhlnmZDBhFMEBhM=
=jeHm
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
"Highlights include:
Stable fixes:
- NFSv42: Fix pagecache invalidation after COPY/CLONE
Bugfixes:
- NFSv42: Don't fail clone() just because the server failed to return
post-op attributes
- SUNRPC: use different lockdep keys for INET6 and LOCAL
- NFSv4.1: handle NFS4ERR_NOSPC from CREATE_SESSION
- SUNRPC: fix header include guard in trace header"
* tag 'nfs-for-5.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
SUNRPC: use different lock keys for INET6 and LOCAL
sunrpc: fix header include guard in trace header
NFSv4.1: handle NFS4ERR_NOSPC by CREATE_SESSION
NFSv42: Fix pagecache invalidation after COPY/CLONE
NFS: Add a tracepoint to show the results of nfs_set_cache_invalid()
NFSv42: Don't fail clone() unless the OP_CLONE operation failed
rpcgss.h include protection was protecting against the define for
rpcrdma.h.
Signed-off-by: Thiago Rafael Becker <trbecker@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
In this cycle, we've applied relatively small number of patches which fix subtle
corner cases mainly, while introducing a new mount option to be able to fragment
the disk intentionally for performance tests.
Enhancement:
- add a mount option to fragmente on-disk layout to understand the performance
- support direct IO for multi-partitions
- add a fault injection of dquot_initialize
Bug fix:
- address some lockdep complaints
- fix a deadlock issue with quota
- fix a memory tuning condition
- fix compression condition to improve the ratio
- fix disabling compression on the non-empty compressed file
- invalidate cached pages before IPU/DIO writes
And, we've added some minor clean-ups as usual.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmGMILwACgkQQBSofoJI
UNJDRA/+KPyCXdY0OqL26BuGKj+z7hW6bz7tlh6h3wdnPdsR/W3ehbqQEr3GBb+q
yokmD75/in7vZwGsDHGowFWMAfWOHYEqHz5UAq91sHjhfZzLDNUgLFWJedBX2XJb
UoEAa7KzRt9M9K2p/5vSTs07RN3okUiRkFhVBBQJIaL7xi6MpadN/XAqpyoBqsiP
pAV6J3GF6WNF19P/hkN1CJI8rV+PFrvY6C23lMkP7mnsWh03jMSgDDuhLHMQpAba
EJYq7QbSatsLDRdR+jUQwIfMucvvzN7M6ja9+NTGlbeACvND8vXKYXOwngCq9+je
2PIU4J8zNqnEkLsPn8STm4zwZHCA7VFdeCobCZcaVZCZFBzVqCkVYE9wqFVaQmr1
bCrRFvEb+D1pkHYFujVXwCAfPlO6twiAInFNMa3WQ3FduJq2nhc8OLCJJ46D1KT2
ZzzLv2EIIlncxPvgLIhiEE9DgPOyV56PQAO3OTsBZcvycU32aHo4hyexju1ubKiD
CZFEHLnPbxX8Ulh3NX4uUxqPAEVhM/aw4l4e8xhmVRY3uj75geY7M6rt1vD+Y5Et
EwbUE8XbLy+GhqbbO/SX9G38pftOiIquH1J0RuhuVNNmkIDkQvnSNp8WqHTdjEJE
NiHZ5bkRkii34Wfrax9UccqGDswh/gjHAXEfGD8nFfcQZwLP1n8=
=KGQ3
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this cycle, we've applied relatively small number of patches which
fix subtle corner cases mainly, while introducing a new mount option
to be able to fragment the disk intentionally for performance tests.
Enhancements:
- add a mount option to fragmente on-disk layout to understand the
performance
- support direct IO for multi-partitions
- add a fault injection of dquot_initialize
Bug fixes:
- address some lockdep complaints
- fix a deadlock issue with quota
- fix a memory tuning condition
- fix compression condition to improve the ratio
- fix disabling compression on the non-empty compressed file
- invalidate cached pages before IPU/DIO writes
And, we've added some minor clean-ups as usual"
* tag 'f2fs-for-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
f2fs: fix UAF in f2fs_available_free_memory
f2fs: invalidate META_MAPPING before IPU/DIO write
f2fs: support fault injection for dquot_initialize()
f2fs: fix incorrect return value in f2fs_sanity_check_ckpt()
f2fs: compress: disallow disabling compress on non-empty compressed file
f2fs: compress: fix overwrite may reduce compress ratio unproperly
f2fs: multidevice: support direct IO
f2fs: introduce fragment allocation mode mount option
f2fs: replace snprintf in show functions with sysfs_emit
f2fs: include non-compressed blocks in compr_written_block
f2fs: fix wrong condition to trigger background checkpoint correctly
f2fs: fix to use WHINT_MODE
f2fs: fix up f2fs_lookup tracepoints
f2fs: set SBI_NEED_FSCK flag when inconsistent node block found
f2fs: introduce excess_dirty_threshold()
f2fs: avoid attaching SB_ACTIVE flag during mount
f2fs: quota: fix potential deadlock
f2fs: should use GFP_NOFS for directory inodes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmGNO7sACgkQ+7dXa6fL
C2sDTA//SLJBMoY719Z/RvZcZb3PUkuwYtqu8j/F6C1n231yg5TrlkchslV635Ph
4lUuy/+pEVtsWot4JBxBsziBsi5WCjucn6opFAO2UOyT0aiE9ucY2MG+fNo6/b5k
RRGqbEODKHScC7pm00AJlqd8gJUvHz6Zy08KetHvkSI4bqBz5VpKxDSNxcWvsbx1
T6FMY+E61vSd0bEOp1/sZ1gRKK5nG9BJrba9V/MuOzj94MqVd9Ajdarr2vfQ7IyS
Qe16rOBM8u6oCRJqRcwdz2Ma0Zf/0Bm6JpoP7LsEdbzXLKPiHPWM6kNd9WZFmMt1
KFGGC3xG3Yxufasdpf5KCa6wlw1U5hWVITqRubHGxg49IdnrwxNK2zLqpJNr/lZC
vsOg31PkAWmJiMCAxhwL+u++Qar27jlXcdiO6tyqIDYHWyzwGWqF5zlzZ46NR26W
SX7oB36drIzH+UMDqxKGti2hRYTPKKwjCJ6p0EfhEYQ0oGa/bNFJA3bPxupJCkJe
PD0pnXdEmhKwvY0fLH6Ghr/PAckQttstTFpaHZ40XFzglgd3Sm5DEcg2Xm38LQ+n
4elMUA+c807ZTDLSDkGTL9QKPmgoM3AFoAsVxV9eqtEYAzYdnYM1GJ0CGOWM1lvs
vDrB7rqn/CFB6ks8k1BOalq2DTmPVU1I1f1GwnkyOiVtlxyrslk=
=K23c
-----END PGP SIGNATURE-----
Merge tag 'netfs-folio-20211111' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull netfs, 9p, afs and ceph (partial) foliation from David Howells:
"This converts netfslib, 9p and afs to use folios. It also partially
converts ceph so that it uses folios on the boundaries with netfslib.
To help with this, a couple of folio helper functions are added in the
first two patches.
These patches don't touch fscache and cachefiles as I intend to remove
all the code that deals with pages directly from there. Only nfs and
cifs are using the old fscache I/O API now. The new API uses iov_iter
instead.
Thanks to Jeff Layton, Dominique Martinet and AuriStor for testing and
retesting the patches"
* tag 'netfs-folio-20211111' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Use folios in directory handling
netfs, 9p, afs, ceph: Use folios
folio: Add a function to get the host inode for a folio
folio: Add a function to change the private data attached to a folio
support for a filehandle format deprecated 20 years ago, and further
xdr-related cleanup from Chuck.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAmGMPYkVHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+JVwQAKbrpgbzl91u+T6W9MUGgQVzDpeP
XIy3NxCu/4pZ8SToWF3trz71sskokmkPPaZyuISD2C8e4DxO5LQ3fJLhtS9CjRFB
x4iZUxH7V2BoWrb5SY6TDWBEqaq4MY9f7tIbvUu5xpa0FIupLqJjYh2CP8vqtsbm
lblQKXz4ao0jwDzSVimNnPcTccpB25VIzwHsSOszRhN4rTjMgyHoETx2cqJne5IU
Tx/hH0UlpnwuQ7aVpcjMoKqIyUWDTMejx51pyZhHB47DVKL7HsnZvg59mTpXFcBx
29edvWT9yy1+w3nGkTYSkOgO9DyHvCbmQzIsvoYlmbZ2sdmTKK8Wuv2Ehcw3OfvL
MXGmy2EXIhzvTZXyN6pL1bBwwNSxdqJhVSxvrPLz1EymIkxf/IDI8eyUicVXd3Vq
K2xOn+CXyIbXWCU85ru8UA77r1+x//gSwqcJvtKUavbNJUwNt935CE2n3+o/0OL/
pToZ89nhcaRyDP1jJKA37K48VLNtBXzZZQlRovyLelNojam/kzZkXX8dI6oV9VD1
Ymjm0mbdZzwhE3C1HxKlxwZqhN+7YoyxMQuWjFMp28wxH+dkz/USCulKZ3/H+neD
0YBSgvwe92JqkZTW2AOjipL+beAuKJ4zsfCCl2XZig/rHGutiwOf2GfgdRmJM6AD
6aiufVWKNNRQef9y
=yKBl
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"A slow cycle for nfsd: mainly cleanup, including Neil's patch dropping
support for a filehandle format deprecated 20 years ago, and further
xdr-related cleanup from Chuck"
* tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux: (26 commits)
nfsd4: remove obselete comment
nfsd: document server-to-server-copy parameters
NFSD:fix boolreturn.cocci warning
nfsd: update create verifier comment
SUNRPC: Change return value type of .pc_encode
SUNRPC: Replace the "__be32 *p" parameter to .pc_encode
NFSD: Save location of NFSv4 COMPOUND status
SUNRPC: Change return value type of .pc_decode
SUNRPC: Replace the "__be32 *p" parameter to .pc_decode
SUNRPC: De-duplicate .pc_release() call sites
SUNRPC: Simplify the SVC dispatch code path
SUNRPC: Capture value of xdr_buf::page_base
SUNRPC: Add trace event when alloc_pages_bulk() makes no progress
svcrdma: Split svcrmda_wc_{read,write} tracepoints
svcrdma: Split the svcrdma_wc_send() tracepoint
svcrdma: Split the svcrdma_wc_receive() tracepoint
NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment()
SUNRPC: xdr_stream_subsegment() must handle non-zero page_bases
NFSD: Initialize pointer ni with NULL and not plain integer 0
NFSD: simplify struct nfsfh
...
Highlights include:
Features:
- NFSv4.1 can always retrieve and cache the ACCESS mode on OPEN
- Optimisations for READDIR and the 'ls -l' style workload
- Further replacements of dprintk() with tracepoints and other tracing
improvements
- Ensure we re-probe NFSv4 server capabilities when the user does a
"mount -o remount"
Bugfixes:
- Fix an Oops in pnfs_mark_request_commit()
- Fix up deadlocks in the commit code
- Fix regressions in NFSv2/v3 attribute revalidation due to the
change_attr_type optimisations
- Fix some dentry verifier races
- Fix some missing dentry verifier settings
- Fix a performance regression in nfs_set_open_stateid_locked()
- SUNRPC was sending multiple SYN calls when re-establishing a TCP
connection.
- Fix multiple NFSv4 issues due to missing sanity checking of server
return values
- Fix a potential Oops when FREE_STATEID races with an unmount
Cleanups:
- Clean up the labelled NFS code
- Remove unused header <linux/pnfs_osd_xdr.h>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmGL5c4ACgkQZwvnipYK
APLFyQ//endoc1HYNpTNpcvlWiAgombBQumjBLrk73Qr+M2Vq9uK6+WmaqYTCHhU
SfX6kbptiyGrd+f/pdIXCjIfPCnCRPRZYpRx8BxHwNr5vqOQIr9rvT/1Mvg2G9Oi
IkdwVDmrN3ZjK/dbvyYSxhsLwuwrnaNm0oHkHxDO/EFghqEsesU1Aj1yywbFIZZA
onRXVXh8r1T9pqL25HyHzZjD1kxvEiKuAMFis2NCKHexSmsvGF4Xs71J3AiCKuc2
XXLged3ng7WRhNCvvrZmfA0AVkZ+iklpVJQzBeXzxuYB81pRZr99yXuv3FKE5aEl
UIPv73b2uTq2SlXtZe2ggsVOdB0JDIRx+9jIH0iV3tOOjapfaTGdTwDx8JR1qHza
wVxB24evk3rW6EFrZNPogaf3JiZmwlVCSUlSZZ3T5c+5l36yZV+WuoSTOe4ajttm
y/uUkA1p2iFpYb9qNoO6kQ1ue3YO34TCqYPrUipzXWvTG1ZjJ5yGV5LZR0VvB4QT
bYpInua7SC/t9RwJ1/HWBrk1G9/xufC4WI7xJf6dJzSDSEo8n6x24nxY0OwUIClb
YzoVWv+bwTHgqkVlTO52XH3VX9E3XBgt5GLtxstQT3hXIndIEoitBqPms0buP/Af
RveTtV1pNCqhmGrmZJGInH3veIELn3l/pTywqITuhIBNCG3Rj5g=
=n8lj
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Features:
- NFSv4.1 can always retrieve and cache the ACCESS mode on OPEN
- Optimisations for READDIR and the 'ls -l' style workload
- Further replacements of dprintk() with tracepoints and other
tracing improvements
- Ensure we re-probe NFSv4 server capabilities when the user does a
"mount -o remount"
Bugfixes:
- Fix an Oops in pnfs_mark_request_commit()
- Fix up deadlocks in the commit code
- Fix regressions in NFSv2/v3 attribute revalidation due to the
change_attr_type optimisations
- Fix some dentry verifier races
- Fix some missing dentry verifier settings
- Fix a performance regression in nfs_set_open_stateid_locked()
- SUNRPC was sending multiple SYN calls when re-establishing a TCP
connection.
- Fix multiple NFSv4 issues due to missing sanity checking of server
return values
- Fix a potential Oops when FREE_STATEID races with an unmount
Cleanups:
- Clean up the labelled NFS code
- Remove unused header <linux/pnfs_osd_xdr.h>"
* tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (84 commits)
NFSv4: Sanity check the parameters in nfs41_update_target_slotid()
NFS: Remove the nfs4_label argument from decode_getattr_*() functions
NFS: Remove the nfs4_label argument from nfs_setsecurity
NFS: Remove the nfs4_label argument from nfs_fhget()
NFS: Remove the nfs4_label argument from nfs_add_or_obtain()
NFS: Remove the nfs4_label argument from nfs_instantiate()
NFS: Remove the nfs4_label from the nfs_setattrres
NFS: Remove the nfs4_label from the nfs4_getattr_res
NFS: Remove the f_label from the nfs4_opendata and nfs_openres
NFS: Remove the nfs4_label from the nfs4_lookupp_res struct
NFS: Remove the label from the nfs4_lookup_res struct
NFS: Remove the nfs4_label from the nfs4_link_res struct
NFS: Remove the nfs4_label from the nfs4_create_res struct
NFS: Remove the nfs4_label from the nfs_entry struct
NFS: Create a new nfs_alloc_fattr_with_label() function
NFS: Always initialise fattr->label in nfs_fattr_alloc()
NFSv4.2: alloc_file_pseudo() takes an open flag, not an f_mode
NFS: Don't allocate nfs_fattr on the stack in __nfs42_ssc_open()
NFSv4: Remove unnecessary 'minor version' check
NFSv4: Fix potential Oops in decode_op_map()
...
Merge misc updates from Andrew Morton:
"257 patches.
Subsystems affected by this patch series: scripts, ocfs2, vfs, and
mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache,
gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc,
pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools,
memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm,
vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram,
cleanups, kfence, and damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits)
mm/damon: remove return value from before_terminate callback
mm/damon: fix a few spelling mistakes in comments and a pr_debug message
mm/damon: simplify stop mechanism
Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions
Docs/admin-guide/mm/damon/start: simplify the content
Docs/admin-guide/mm/damon/start: fix a wrong link
Docs/admin-guide/mm/damon/start: fix wrong example commands
mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on
mm/damon: remove unnecessary variable initialization
Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM
mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)
selftests/damon: support watermarks
mm/damon/dbgfs: support watermarks
mm/damon/schemes: activate schemes based on a watermarks mechanism
tools/selftests/damon: update for regions prioritization of schemes
mm/damon/dbgfs: support prioritization weights
mm/damon/vaddr,paddr: support pageout prioritization
mm/damon/schemes: prioritize regions within the quotas
mm/damon/selftests: support schemes quotas
mm/damon/dbgfs: support quotas of schemes
...
Memcg reclaim throttles on congestion if no reclaim progress is made.
This makes little sense, it might be due to writeback or a host of other
factors.
For !memcg reclaim, it's messy. Direct reclaim primarily is throttled
in the page allocator if it is failing to make progress. Kswapd
throttles if too many pages are under writeback and marked for immediate
reclaim.
This patch explicitly throttles if reclaim is failing to make progress.
[vbabka@suse.cz: Remove redundant code]
Link: https://lkml.kernel.org/r/20211022144651.19914-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Page reclaim throttles on congestion if too many parallel reclaim
instances have isolated too many pages. This makes no sense, excessive
parallelisation has nothing to do with writeback or congestion.
This patch creates an additional workqueue to sleep on when too many
pages are isolated. The throttled tasks are woken when the number of
isolated pages is reduced or a timeout occurs. There may be some false
positive wakeups for GFP_NOIO/GFP_NOFS callers but the tasks will
throttle again if necessary.
[shy828301@gmail.com: Wake up from compaction context]
[vbabka@suse.cz: Account number of throttled tasks only for writeback]
Link: https://lkml.kernel.org/r/20211022144651.19914-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Remove dependency on congestion_wait in mm/", v5.
This series that removes all calls to congestion_wait in mm/ and deletes
wait_iff_congested. It's not a clever implementation but
congestion_wait has been broken for a long time [1].
Even if congestion throttling worked, it was never a great idea. While
excessive dirty/writeback pages at the tail of the LRU is one
possibility that reclaim may be slow, there is also the problem of too
many pages being isolated and reclaim failing for other reasons
(elevated references, too many pages isolated, excessive LRU contention
etc).
This series replaces the "congestion" throttling with 3 different types.
- If there are too many dirty/writeback pages, sleep until a timeout or
enough pages get cleaned
- If too many pages are isolated, sleep until enough isolated pages are
either reclaimed or put back on the LRU
- If no progress is being made, direct reclaim tasks sleep until
another task makes progress with acceptable efficiency.
This was initially tested with a mix of workloads that used to trigger
corner cases that no longer work. A new test case was created called
"stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
created XFS filesystem. Note that it may be necessary to increase the
timeout of ssh if executing remotely as ssh itself can get throttled and
the connection may timeout.
stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4
to check the impact as the number of direct reclaimers increase. It has
four types of worker.
- One "anon latency" worker creates small mappings with mmap() and
times how long it takes to fault the mapping reading it 4K at a time
- X file writers which is fio randomly writing X files where the total
size of the files add up to the allowed dirty_ratio. fio is allowed
to run for a warmup period to allow some file-backed pages to
accumulate. The duration of the warmup is based on the best-case
linear write speed of the storage.
- Y file readers which is fio randomly reading small files
- Z anon memory hogs which continually map (100-dirty_ratio)% of memory
- Total estimated WSS = (100+dirty_ration) percentage of memory
X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4
The intent is to maximise the total WSS with a mix of file and anon
memory where some anonymous memory must be swapped and there is a high
likelihood of dirty/writeback pages reaching the end of the LRU.
The test can be configured to have no background readers to stress
dirty/writeback pages. The results below are based on having zero
readers.
The short summary of the results is that the series works and stalls
until some event occurs but the timeouts may need adjustment.
The test results are not broken down by patch as the series should be
treated as one block that replaces a broken throttling mechanism with a
working one.
Finally, three machines were tested but I'm reporting the worst set of
results. The other two machines had much better latencies for example.
First the results of the "anon latency" latency
stutterp
5.15.0-rc1 5.15.0-rc1
vanilla mm-reclaimcongest-v5r4
Amean mmap-4 31.4003 ( 0.00%) 2661.0198 (-8374.52%)
Amean mmap-7 38.1641 ( 0.00%) 149.2891 (-291.18%)
Amean mmap-12 60.0981 ( 0.00%) 187.8105 (-212.51%)
Amean mmap-21 161.2699 ( 0.00%) 213.9107 ( -32.64%)
Amean mmap-30 174.5589 ( 0.00%) 377.7548 (-116.41%)
Amean mmap-48 8106.8160 ( 0.00%) 1070.5616 ( 86.79%)
Stddev mmap-4 41.3455 ( 0.00%) 27573.9676 (-66591.66%)
Stddev mmap-7 53.5556 ( 0.00%) 4608.5860 (-8505.23%)
Stddev mmap-12 171.3897 ( 0.00%) 5559.4542 (-3143.75%)
Stddev mmap-21 1506.6752 ( 0.00%) 5746.2507 (-281.39%)
Stddev mmap-30 557.5806 ( 0.00%) 7678.1624 (-1277.05%)
Stddev mmap-48 61681.5718 ( 0.00%) 14507.2830 ( 76.48%)
Max-90 mmap-4 31.4243 ( 0.00%) 83.1457 (-164.59%)
Max-90 mmap-7 41.0410 ( 0.00%) 41.0720 ( -0.08%)
Max-90 mmap-12 66.5255 ( 0.00%) 53.9073 ( 18.97%)
Max-90 mmap-21 146.7479 ( 0.00%) 105.9540 ( 27.80%)
Max-90 mmap-30 193.9513 ( 0.00%) 64.3067 ( 66.84%)
Max-90 mmap-48 277.9137 ( 0.00%) 591.0594 (-112.68%)
Max mmap-4 1913.8009 ( 0.00%) 299623.9695 (-15555.96%)
Max mmap-7 2423.9665 ( 0.00%) 204453.1708 (-8334.65%)
Max mmap-12 6845.6573 ( 0.00%) 221090.3366 (-3129.64%)
Max mmap-21 56278.6508 ( 0.00%) 213877.3496 (-280.03%)
Max mmap-30 19716.2990 ( 0.00%) 216287.6229 (-997.00%)
Max mmap-48 477923.9400 ( 0.00%) 245414.8238 ( 48.65%)
For most thread counts, the time to mmap() is unfortunately increased.
In earlier versions of the series, this was lower but a large number of
throttling events were reaching their timeout increasing the amount of
inefficient scanning of the LRU. There is no prioritisation of reclaim
tasks making progress based on each tasks rate of page allocation versus
progress of reclaim. The variance is also impacted for high worker
counts but in all cases, the differences in latency are not
statistically significant due to very large maximum outliers. Max-90
shows that 90% of the stalls are comparable but the Max results show the
massive outliers which are increased to to stalling.
It is expected that this will be very machine dependant. Due to the
test design, reclaim is difficult so allocations stall and there are
variances depending on whether THPs can be allocated or not. The amount
of memory will affect exactly how bad the corner cases are and how often
they trigger. The warmup period calculation is not ideal as it's based
on linear writes where as fio is randomly writing multiple files from
multiple tasks so the start state of the test is variable. For example,
these are the latencies on a single-socket machine that had more memory
Amean mmap-4 42.2287 ( 0.00%) 49.6838 * -17.65%*
Amean mmap-7 216.4326 ( 0.00%) 47.4451 * 78.08%*
Amean mmap-12 2412.0588 ( 0.00%) 51.7497 ( 97.85%)
Amean mmap-21 5546.2548 ( 0.00%) 51.8862 ( 99.06%)
Amean mmap-30 1085.3121 ( 0.00%) 72.1004 ( 93.36%)
The overall system CPU usage and elapsed time is as follows
5.15.0-rc3 5.15.0-rc3
vanilla mm-reclaimcongest-v5r4
Duration User 6989.03 983.42
Duration System 7308.12 799.68
Duration Elapsed 2277.67 2092.98
The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
stalling.
The high-level /proc/vmstats show
5.15.0-rc1 5.15.0-rc1
vanilla mm-reclaimcongest-v5r2
Ops Direct pages scanned 1056608451.00 503594991.00
Ops Kswapd pages scanned 109795048.00 147289810.00
Ops Kswapd pages reclaimed 63269243.00 31036005.00
Ops Direct pages reclaimed 10803973.00 6328887.00
Ops Kswapd efficiency % 57.62 21.07
Ops Kswapd velocity 48204.98 57572.86
Ops Direct efficiency % 1.02 1.26
Ops Direct velocity 463898.83 196845.97
Kswapd scanned less pages but the detailed pattern is different. The
vanilla kernel scans slowly over time where as the patches exhibits
burst patterns of scan activity. Direct reclaim scanning is reduced by
52% due to stalling.
The pattern for stealing pages is also slightly different. Both kernels
exhibit spikes but the vanilla kernel when reclaiming shows pages being
reclaimed over a period of time where as the patches tend to reclaim in
spikes. The difference is that vanilla is not throttling and instead
scanning constantly finding some pages over time where as the patched
kernel throttles and reclaims in spikes.
Ops Percentage direct scans 90.59 77.37
For direct reclaim, vanilla scanned 90.59% of pages where as with the
patches, 77.37% were direct reclaim due to throttling
Ops Page writes by reclaim 2613590.00 1687131.00
Page writes from reclaim context are reduced.
Ops Page writes anon 2932752.00 1917048.00
And there is less swapping.
Ops Page reclaim immediate 996248528.00 107664764.00
The number of pages encountered at the tail of the LRU tagged for
immediate reclaim but still dirty/writeback is reduced by 89%.
Ops Slabs scanned 164284.00 153608.00
Slab scan activity is similar.
ftrace was used to gather stall activity
Vanilla
-------
1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0
The fast majority of wait_iff_congested calls do not stall at all. What
is likely happening is that cond_resched() reschedules the task for a
short period when the BDI is not registering congestion (which it never
will in this test setup).
1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000
congestion_wait if called always exceeds the timeout as there is no
trigger to wake it up.
Bottom line: Vanilla will throttle but it's not effective.
Patch series
------------
Kswapd throttle activity was always due to scanning pages tagged for
immediate reclaim at the tail of the LRU
1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
The majority of events did not stall or stalled for a short period.
Roughly 16% of stalls reached the timeout before expiry. For direct
reclaim, the number of times stalled for each reason were
6624 reason=VMSCAN_THROTTLE_ISOLATED
93246 reason=VMSCAN_THROTTLE_NOPROGRESS
96934 reason=VMSCAN_THROTTLE_WRITEBACK
The most common reason to stall was due to excessive pages tagged for
immediate reclaim at the tail of the LRU followed by a failure to make
forward. A relatively small number were due to too many pages isolated
from the LRU by parallel threads
For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was
9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED
12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED
83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED
6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED
Most did not stall at all. A small number reached the timeout.
For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over
the map
1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS
6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS
11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS
13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS
13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS
16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPROGRESS
18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS
21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS
23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS
23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS
25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS
25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS
26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS
27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS
28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS
29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS
30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS
30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS
31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS
32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPROGRESS
33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS
35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS
36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS
37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS
38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS
40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS
43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS
55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS
56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS
58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS
61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS
71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS
71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS
79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS
82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS
82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS
85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS
85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS
88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS
90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS
90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS
94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS
118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS
119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS
126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS
148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS
148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS
159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS
178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS
183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS
237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS
266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS
313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS
347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS
470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS
559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS
964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS
2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS
7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS
22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS
51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS
The full timeout is often hit but a large number also do not stall at
all. The remainder slept a little allowing other reclaim tasks to make
progress.
While this timeout could be further increased, it could also negatively
impact worst-case behaviour when there is no prioritisation of what task
should make progress.
For VMSCAN_THROTTLE_WRITEBACK, the breakdown was
1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK
2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK
3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK
6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK
12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK
16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK
24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK
28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK
30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK
32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK
42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK
77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK
99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK
137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK
190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
The majority hit the timeout in direct reclaim context although a
sizable number did not stall at all. This is very different to kswapd
where only a tiny percentage of stalls due to writeback reached the
timeout.
Bottom line, the throttling appears to work and the wakeup events may
limit worst case stalls. There might be some grounds for adjusting
timeouts but it's likely futile as the worst-case scenarios depend on
the workload, memory size and the speed of the storage. A better
approach to improve the series further would be to prioritise tasks
based on their rate of allocation with the caveat that it may be very
expensive to track.
This patch (of 5):
Page reclaim throttles on wait_iff_congested under the following
conditions:
- kswapd is encountering pages under writeback and marked for immediate
reclaim implying that pages are cycling through the LRU faster than
pages can be cleaned.
- Direct reclaim will stall if all dirty pages are backed by congested
inodes.
wait_iff_congested is almost completely broken with few exceptions.
This patch adds a new node-based workqueue and tracks the number of
throttled tasks and pages written back since throttling started. If
enough pages belonging to the node are written back then the throttled
tasks will wake early. If not, the throttled tasks sleeps until the
timeout expires.
[neilb@suse.de: Uninterruptible sleep and simpler wakeups]
[hdanton@sina.com: Avoid race when reclaim starts]
[vbabka@suse.cz: vmstat irq-safe api, clarifications]
Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/ [1]
Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: NeilBrown <neilb@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By using DECLARE_EVENT_CLASS and TRACE_EVENT_FN, we can save a lot of
space from duplicate code.
Link: https://lkml.kernel.org/r/20211009071243.70286-1-ligang.bdlg@bytedance.com
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ftrace core will add newline automatically on printing, so using it in
TP_printkcreates a blank line.
Link: https://lkml.kernel.org/r/20211009071105.69544-1-ligang.bdlg@bytedance.com
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds tracepoints for dlm socket receive and send
functionality. We can use it to track how much data was send or received
to or from a specific nodeid.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch adds initial support for dlm tracepoints. It will introduce
tracepoints to dlm main functionality dlm_lock()/dlm_unlock() and their
complete ast() callback or blocking bast() callback.
The lock/unlock functionality has a start and end tracepoint, this is
because there exists a race in case if would have a tracepoint at the
end position only the complete/blocking callbacks could occur before. To
work with eBPF tracing and using their lookup hash functionality there
could be problems that an entry was not inserted yet. However use the
start functionality for hash insert and check again in end functionality
if there was an dlm internal error so there is no ast callback. In further
it might also that locks with local masters will occur those callbacks
immediately so we must have such functionality.
I did not make everything accessible yet, although it seems eBPF can be
used to access a lot of internal datastructures if it's aware of the
struct definitions of the running kernel instance. We still can change
it, if you do eBPF experiments e.g. time measurements between lock and
callback functionality you can simple use the local lkb_id field as hash
value in combination with the lockspace id if you have multiple
lockspaces. Otherwise you can simple use trace-cmd for some functionality,
e.g. `trace-cmd record -e dlm` and `trace-cmd report` afterwards.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Refactor: surface useful show_ macros so they can be shared between
the client and server trace code.
Additional clean up:
- Housekeeping: ensure the correct #include files are pulled in
and add proper TRACE_DEFINE_ENUM where they are missing
- Use a consistent naming scheme for the helpers
- Store values to be displayed symbolically as unsigned long, as
that is the type that the __print_yada() functions take
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Refactor: Surface useful show_ macros for use by other trace
subsystems.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
- Remove socket skb caches
- Add a SO_RESERVE_MEM socket op to forward allocate buffer space
and avoid memory accounting overhead on each message sent
- Introduce managed neighbor entries - added by control plane and
resolved by the kernel for use in acceleration paths (BPF / XDP
right now, HW offload users will benefit as well)
- Make neighbor eviction on link down controllable by userspace
to work around WiFi networks with bad roaming implementations
- vrf: Rework interaction with netfilter/conntrack
- fq_codel: implement L4S style ce_threshold_ect1 marking
- sch: Eliminate unnecessary RCU waits in mini_qdisc_pair_swap()
BPF:
- Add support for new btf kind BTF_KIND_TAG, arbitrary type tagging
as implemented in LLVM14
- Introduce bpf_get_branch_snapshot() to capture Last Branch Records
- Implement variadic trace_printk helper
- Add a new Bloomfilter map type
- Track <8-byte scalar spill and refill
- Access hw timestamp through BPF's __sk_buff
- Disallow unprivileged BPF by default
- Document BPF licensing
Netfilter:
- Introduce egress hook for looking at raw outgoing packets
- Allow matching on and modifying inner headers / payload data
- Add NFT_META_IFTYPE to match on the interface type either from
ingress or egress
Protocols:
- Multi-Path TCP:
- increase default max additional subflows to 2
- rework forward memory allocation
- add getsockopts: MPTCP_INFO, MPTCP_TCPINFO, MPTCP_SUBFLOW_ADDRS
- MCTP flow support allowing lower layer drivers to configure msg
muxing as needed
- Automatic Multicast Tunneling (AMT) driver based on RFC7450
- HSR support the redbox supervision frames (IEC-62439-3:2018)
- Support for the ip6ip6 encapsulation of IOAM
- Netlink interface for CAN-FD's Transmitter Delay Compensation
- Support SMC-Rv2 eliminating the current same-subnet restriction,
by exploiting the UDP encapsulation feature of RoCE adapters
- TLS: add SM4 GCM/CCM crypto support
- Bluetooth: initial support for link quality and audio/codec
offload
Driver APIs:
- Add a batched interface for RX buffer allocation in AF_XDP
buffer pool
- ethtool: Add ability to control transceiver modules' power mode
- phy: Introduce supported interfaces bitmap to express MAC
capabilities and simplify PHY code
- Drop rtnl_lock from DSA .port_fdb_{add,del} callbacks
New drivers:
- WiFi driver for Realtek 8852AE 802.11ax devices (rtw89)
- Ethernet driver for ASIX AX88796C SPI device (x88796c)
Drivers:
- Broadcom PHYs
- support 72165, 7712 16nm PHYs
- support IDDQ-SR for additional power savings
- PHY support for QCA8081, QCA9561 PHYs
- NXP DPAA2: support for IRQ coalescing
- NXP Ethernet (enetc): support for software TCP segmentation
- Renesas Ethernet (ravb) - support DMAC and EMAC blocks of
Gigabit-capable IP found on RZ/G2L SoC
- Intel 100G Ethernet
- support for eswitch offload of TC/OvS flow API, including
offload of GRE, VxLAN, Geneve tunneling
- support application device queues - ability to assign Rx and Tx
queues to application threads
- PTP and PPS (pulse-per-second) extensions
- Broadcom Ethernet (bnxt)
- devlink health reporting and device reload extensions
- Mellanox Ethernet (mlx5)
- offload macvlan interfaces
- support HW offload of TC rules involving OVS internal ports
- support HW-GRO and header/data split
- support application device queues
- Marvell OcteonTx2:
- add XDP support for PF
- add PTP support for VF
- Qualcomm Ethernet switch (qca8k): support for QCA8328
- Realtek Ethernet DSA switch (rtl8366rb)
- support bridge offload
- support STP, fast aging, disabling address learning
- support for Realtek RTL8365MB-VC, a 4+1 port 10M/100M/1GE switch
- Mellanox Ethernet/IB switch (mlxsw)
- multi-level qdisc hierarchy offload (e.g. RED, prio and shaping)
- offload root TBF qdisc as port shaper
- support multiple routing interface MAC address prefixes
- support for IP-in-IP with IPv6 underlay
- MediaTek WiFi (mt76)
- mt7921 - ASPM, 6GHz, SDIO and testmode support
- mt7915 - LED and TWT support
- Qualcomm WiFi (ath11k)
- include channel rx and tx time in survey dump statistics
- support for 80P80 and 160 MHz bandwidths
- support channel 2 in 6 GHz band
- spectral scan support for QCN9074
- support for rx decapsulation offload (data frames in 802.3
format)
- Qualcomm phone SoC WiFi (wcn36xx)
- enable Idle Mode Power Save (IMPS) to reduce power consumption
during idle
- Bluetooth driver support for MediaTek MT7922 and MT7921
- Enable support for AOSP Bluetooth extension in Qualcomm WCN399x
and Realtek 8822C/8852A
- Microsoft vNIC driver (mana)
- support hibernation and kexec
- Google vNIC driver (gve)
- support for jumbo frames
- implement Rx page reuse
Refactor:
- Make all writes to netdev->dev_addr go thru helpers, so that we
can add this address to the address rbtree and handle the updates
- Various TCP cleanups and optimizations including improvements
to CPU cache use
- Simplify the gnet_stats, Qdisc stats' handling and remove
qdisc->running sequence counter
- Driver changes and API updates to address devlink locking
deficiencies
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmGAzX4ACgkQMUZtbf5S
IrvW3g//Q0ZLrOuHK9pZ8sCXMMhDj8qL6ajm0otMddHWA/+1UglwVBKFhsajfxOf
wJ/5LZis+XKLpLqKTU5chKVfn39HuDGe/D3l+egi01Gv5BW0+XzEhagfyR5tJX5z
wsGG5CXO/we/laVSzRiFtwwVEKHKN20YC+tIQwYOYP5Wy3q4G7qDsFhT7GqgsGCS
n74QUEAIB5Tz0ODWFqLtbsySzIurXrskibwt5T9bvAAlPw/lCU68mmG+NVJ7VddO
lBbNkLMOo8yW9Ci20H09SrYd4jZTmMARo9tsFO1tAvAMk7qpn0Wd8pnOYTjFFoMD
+qjiFSVMh7E0JGb8Y7NCvwaB99suAK5rfGP68Xwe62DfP7vYWEx4pZGxBP19F4ld
6Kn1ME33BX9rUF9tBecf0bdKfJUwB2Q2Xou/b9laG04bwiqsc9iG5FQq1C46lnLZ
QdzNiS1My4dJMczkWt66HF3Kx30ibwHfvKMIHjf4PqkzEatkv6Y6SBZ57KXL+Lde
0BQSFhbf0tm2Gf55etzrczLElI3uqHSFWUNZZ2Bt6WmzO1e6tpV9nAtRWF4C/dFg
QDpLJtOOOY65uq+qz09zoPfv2lem868SrCAuFrVn99bEpYjx/CGNFDeEI02l6jyr
84eUxd364UcbIk3fc+eTGdXHLQNVk30G0AHVBBxaWNIidwfqXeE=
=srde
-----END PGP SIGNATURE-----
Merge tag 'net-next-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- Remove socket skb caches
- Add a SO_RESERVE_MEM socket op to forward allocate buffer space and
avoid memory accounting overhead on each message sent
- Introduce managed neighbor entries - added by control plane and
resolved by the kernel for use in acceleration paths (BPF / XDP
right now, HW offload users will benefit as well)
- Make neighbor eviction on link down controllable by userspace to
work around WiFi networks with bad roaming implementations
- vrf: Rework interaction with netfilter/conntrack
- fq_codel: implement L4S style ce_threshold_ect1 marking
- sch: Eliminate unnecessary RCU waits in mini_qdisc_pair_swap()
BPF:
- Add support for new btf kind BTF_KIND_TAG, arbitrary type tagging
as implemented in LLVM14
- Introduce bpf_get_branch_snapshot() to capture Last Branch Records
- Implement variadic trace_printk helper
- Add a new Bloomfilter map type
- Track <8-byte scalar spill and refill
- Access hw timestamp through BPF's __sk_buff
- Disallow unprivileged BPF by default
- Document BPF licensing
Netfilter:
- Introduce egress hook for looking at raw outgoing packets
- Allow matching on and modifying inner headers / payload data
- Add NFT_META_IFTYPE to match on the interface type either from
ingress or egress
Protocols:
- Multi-Path TCP:
- increase default max additional subflows to 2
- rework forward memory allocation
- add getsockopts: MPTCP_INFO, MPTCP_TCPINFO, MPTCP_SUBFLOW_ADDRS
- MCTP flow support allowing lower layer drivers to configure msg
muxing as needed
- Automatic Multicast Tunneling (AMT) driver based on RFC7450
- HSR support the redbox supervision frames (IEC-62439-3:2018)
- Support for the ip6ip6 encapsulation of IOAM
- Netlink interface for CAN-FD's Transmitter Delay Compensation
- Support SMC-Rv2 eliminating the current same-subnet restriction, by
exploiting the UDP encapsulation feature of RoCE adapters
- TLS: add SM4 GCM/CCM crypto support
- Bluetooth: initial support for link quality and audio/codec offload
Driver APIs:
- Add a batched interface for RX buffer allocation in AF_XDP buffer
pool
- ethtool: Add ability to control transceiver modules' power mode
- phy: Introduce supported interfaces bitmap to express MAC
capabilities and simplify PHY code
- Drop rtnl_lock from DSA .port_fdb_{add,del} callbacks
New drivers:
- WiFi driver for Realtek 8852AE 802.11ax devices (rtw89)
- Ethernet driver for ASIX AX88796C SPI device (x88796c)
Drivers:
- Broadcom PHYs
- support 72165, 7712 16nm PHYs
- support IDDQ-SR for additional power savings
- PHY support for QCA8081, QCA9561 PHYs
- NXP DPAA2: support for IRQ coalescing
- NXP Ethernet (enetc): support for software TCP segmentation
- Renesas Ethernet (ravb) - support DMAC and EMAC blocks of
Gigabit-capable IP found on RZ/G2L SoC
- Intel 100G Ethernet
- support for eswitch offload of TC/OvS flow API, including
offload of GRE, VxLAN, Geneve tunneling
- support application device queues - ability to assign Rx and Tx
queues to application threads
- PTP and PPS (pulse-per-second) extensions
- Broadcom Ethernet (bnxt)
- devlink health reporting and device reload extensions
- Mellanox Ethernet (mlx5)
- offload macvlan interfaces
- support HW offload of TC rules involving OVS internal ports
- support HW-GRO and header/data split
- support application device queues
- Marvell OcteonTx2:
- add XDP support for PF
- add PTP support for VF
- Qualcomm Ethernet switch (qca8k): support for QCA8328
- Realtek Ethernet DSA switch (rtl8366rb)
- support bridge offload
- support STP, fast aging, disabling address learning
- support for Realtek RTL8365MB-VC, a 4+1 port 10M/100M/1GE switch
- Mellanox Ethernet/IB switch (mlxsw)
- multi-level qdisc hierarchy offload (e.g. RED, prio and shaping)
- offload root TBF qdisc as port shaper
- support multiple routing interface MAC address prefixes
- support for IP-in-IP with IPv6 underlay
- MediaTek WiFi (mt76)
- mt7921 - ASPM, 6GHz, SDIO and testmode support
- mt7915 - LED and TWT support
- Qualcomm WiFi (ath11k)
- include channel rx and tx time in survey dump statistics
- support for 80P80 and 160 MHz bandwidths
- support channel 2 in 6 GHz band
- spectral scan support for QCN9074
- support for rx decapsulation offload (data frames in 802.3
format)
- Qualcomm phone SoC WiFi (wcn36xx)
- enable Idle Mode Power Save (IMPS) to reduce power consumption
during idle
- Bluetooth driver support for MediaTek MT7922 and MT7921
- Enable support for AOSP Bluetooth extension in Qualcomm WCN399x and
Realtek 8822C/8852A
- Microsoft vNIC driver (mana)
- support hibernation and kexec
- Google vNIC driver (gve)
- support for jumbo frames
- implement Rx page reuse
Refactor:
- Make all writes to netdev->dev_addr go thru helpers, so that we can
add this address to the address rbtree and handle the updates
- Various TCP cleanups and optimizations including improvements to
CPU cache use
- Simplify the gnet_stats, Qdisc stats' handling and remove
qdisc->running sequence counter
- Driver changes and API updates to address devlink locking
deficiencies"
* tag 'net-next-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2122 commits)
Revert "net: avoid double accounting for pure zerocopy skbs"
selftests: net: add arp_ndisc_evict_nocarrier
net: ndisc: introduce ndisc_evict_nocarrier sysctl parameter
net: arp: introduce arp_evict_nocarrier sysctl parameter
libbpf: Deprecate AF_XDP support
kbuild: Unify options for BTF generation for vmlinux and modules
selftests/bpf: Add a testcase for 64-bit bounds propagation issue.
bpf: Fix propagation of signed bounds from 64-bit min/max into 32-bit.
bpf: Fix propagation of bounds from 64-bit min/max into 32-bit and var_off.
net: vmxnet3: remove multiple false checks in vmxnet3_ethtool.c
net: avoid double accounting for pure zerocopy skbs
tcp: rename sk_wmem_free_skb
netdevsim: fix uninit value in nsim_drv_configure_vfs()
selftests/bpf: Fix also no-alu32 strobemeta selftest
bpf: Add missing map_delete_elem method to bloom filter map
selftests/bpf: Add bloom map success test for userspace calls
bpf: Add alignment padding for "map_extra" + consolidate holes
bpf: Bloom filter map naming fixups
selftests/bpf: Add test cases for struct_ops prog
bpf: Add dummy BPF STRUCT_OPS for test purpose
...
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-11-01
We've added 181 non-merge commits during the last 28 day(s) which contain
a total of 280 files changed, 11791 insertions(+), 5879 deletions(-).
The main changes are:
1) Fix bpf verifier propagation of 64-bit bounds, from Alexei.
2) Parallelize bpf test_progs, from Yucong and Andrii.
3) Deprecate various libbpf apis including af_xdp, from Andrii, Hengqi, Magnus.
4) Improve bpf selftests on s390, from Ilya.
5) bloomfilter bpf map type, from Joanne.
6) Big improvements to JIT tests especially on Mips, from Johan.
7) Support kernel module function calls from bpf, from Kumar.
8) Support typeless and weak ksym in light skeleton, from Kumar.
9) Disallow unprivileged bpf by default, from Pawan.
10) BTF_KIND_DECL_TAG support, from Yonghong.
11) Various bpftool cleanups, from Quentin.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (181 commits)
libbpf: Deprecate AF_XDP support
kbuild: Unify options for BTF generation for vmlinux and modules
selftests/bpf: Add a testcase for 64-bit bounds propagation issue.
bpf: Fix propagation of signed bounds from 64-bit min/max into 32-bit.
bpf: Fix propagation of bounds from 64-bit min/max into 32-bit and var_off.
selftests/bpf: Fix also no-alu32 strobemeta selftest
bpf: Add missing map_delete_elem method to bloom filter map
selftests/bpf: Add bloom map success test for userspace calls
bpf: Add alignment padding for "map_extra" + consolidate holes
bpf: Bloom filter map naming fixups
selftests/bpf: Add test cases for struct_ops prog
bpf: Add dummy BPF STRUCT_OPS for test purpose
bpf: Factor out helpers for ctx access checking
bpf: Factor out a helper to prepare trampoline for struct_ops prog
selftests, bpf: Fix broken riscv build
riscv, libbpf: Add RISC-V (RV64) support to bpf_tracing.h
tools, build: Add RISC-V to HOSTARCH parsing
riscv, bpf: Increase the maximum number of iterations
selftests, bpf: Add one test for sockmap with strparser
selftests, bpf: Fix test_txmsg_ingress_parser error
...
====================
Link: https://lore.kernel.org/r/20211102013123.9005-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- support multiple devices for multi-layer container images;
- support the secondary compression head;
- support readmore decompression strategy;
- support new LZMA algorithm (specifically called MicroLZMA);
- some bugfixes & cleanups.
-----BEGIN PGP SIGNATURE-----
iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYX8j7hEceGlhbmdAa2Vy
bmVsLm9yZwAKCRA5NzHcH7XmBE+SAQChAmAUav03OQujm8PvVNX7VUGusGNvww8E
qu5+zasC8wEArypW2Z75ZZ3IZNPCk6QWFlaC2I5Xnz7NNl0OGPKOCAg=
=DZQ4
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"There are some new features available for this cycle. Firstly, EROFS
LZMA algorithm support, specifically called MicroLZMA, is available as
an option for embedded devices, LiveCDs and/or as the secondary
auxiliary compression algorithm besides the primary algorithm in one
file.
In order to better support the LZMA fixed-sized output compression,
especially for 4KiB pcluster size (which has lowest memory pressure
thus useful for memory-sensitive scenarios), Lasse introduced a new
LZMA header/container format called MicroLZMA to minimize the original
LZMA1 header (for example, we don't need to waste 4-byte dictionary
size and another 8-byte uncompressed size, which can be calculated by
fs directly, for each pcluster) and enable EROFS fixed-sized output
compression.
Note that MicroLZMA can also be later used by other things in addition
to EROFS too where wasting minimal amount of space for headers is
important and it can be only compiled by enabling XZ_DEC_MICROLZMA.
MicroLZMA has been supported by the latest upstream XZ embedded [1] &
XZ utils [2], apply the latest related XZ embedded upstream patches by
the XZ author Lasse here.
Secondly, multiple device is also supported in this cycle, which is
designed for multi-layer container images. By working together with
inter-layer data deduplication and compression, we can achieve the
next high-performance container image solution. Our team will announce
the new Nydus container image service [3] implementation with new RAFS
v6 (EROFS-compatible) format in Open Source Summit 2021 China [4]
soon.
Besides, the secondary compression head support and readmore
decompression strategy are also included in this cycle. There are also
some minor bugfixes and cleanups, as always.
Summary:
- support multiple devices for multi-layer container images;
- support the secondary compression head;
- support readmore decompression strategy;
- support new LZMA algorithm (specifically called MicroLZMA);
- some bugfixes & cleanups"
* tag 'erofs-for-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: don't trigger WARN() when decompression fails
erofs: get rid of ->lru usage
erofs: lzma compression support
erofs: rename some generic methods in decompressor
lib/xz, lib/decompress_unxz.c: Fix spelling in comments
lib/xz: Add MicroLZMA decoder
lib/xz: Move s->lzma.len = 0 initialization to lzma_reset()
lib/xz: Validate the value before assigning it to an enum variable
lib/xz: Avoid overlapping memcpy() with invalid input with in-place decompression
erofs: introduce readmore decompression strategy
erofs: introduce the secondary compression head
erofs: get compression algorithms directly on mapping
erofs: add multiple device support
erofs: decouple basic mount options from fs_context
erofs: remove the fast path of per-CPU buffer decompression
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8KHcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgphvVEADHMsZP3fOGyJNqnIibIrDL5ZdUGtr5iH3c
0UIi9It0jo9xOyPX/aY2n1pInXK4vvND9ULC+XGYttSJZXWuYEbMGYQ34du2EP0r
dypN4JPwO6X+mFkJND6x8IeDCzj/fy6LCFbWbRlDNsndTZ/gavVTOybMpOLdCJx9
IyXE1iHismaIaD7I3Q77zvN0ei87cEwBfg9R0vRAXKBKUh5raSiLWsOYOiXQkZH4
8iUeDmOLlaWghgXwweODxARXuWq+gWZgiBMd0tp0QCECXMv+NIpfJYauvLHJDa/u
QScr9uRMrJS3KgRgt61o+Z2fcpzJF/bL0e0s5Ul9CgflRWucARbgodUMl4rZCi9D
WOwxPxv8Oab8IT7Qc/ZHdY3ULJsULRgbtmc/9OqPL5Y/Ww9/9E63Is8O4q/QFc7T
xJ1p5yZKw3G+G7oG0YBYE0U+x3RUzi4b/Ob+ECeLcAAAcp+XFg6epK6Aj8HDWd8K
kGYlEBKEq1hILM44K59YTwAT/Cp+fkwe+x7pNQ3JjqtPpVpqGT7RoMUuCduofT1J
ROtB+S8/AwhdABL6KKUYSVF8zlfoXbQpQs3SUKjaBtPVjwXLZwXERy7ttD/4STtT
QjC+5/qAWnMR8CYADE0E3rlicUkHJm1+AHukYLz0REphDcNO8GuB9PCDzX4SX/ol
SGJ6hoprYQ==
=5U4u
-----END PGP SIGNATURE-----
Merge tag 'for-5.16/io_uring-2021-10-29' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
"Light on new features - basically just the hybrid mode support.
Outside of that it's just fixes, cleanups, and performance
improvements.
In detail:
- Add ring related information to the fdinfo output (Hao)
- Hybrid async mode (Hao)
- Support for batched issue on block (me)
- sqe error trace improvement (me)
- IOPOLL efficiency improvements (Pavel)
- submit state cleanups and improvements (Pavel)
- Completion side improvements (Pavel)
- Drain improvements (Pavel)
- Buffer selection cleanups (Pavel)
- Fixed file node improvements (Pavel)
- io-wq setup cancelation fix (Pavel)
- Various other performance improvements and cleanups (Pavel)
- Misc fixes (Arnd, Bixuan, Changcheng, Hao, me, Noah)"
* tag 'for-5.16/io_uring-2021-10-29' of git://git.kernel.dk/linux-block: (97 commits)
io-wq: remove worker to owner tw dependency
io_uring: harder fdinfo sq/cq ring iterating
io_uring: don't assign write hint in the read path
io_uring: clusterise ki_flags access in rw_prep
io_uring: kill unused param from io_file_supports_nowait
io_uring: clean up timeout async_data allocation
io_uring: don't try io-wq polling if not supported
io_uring: check if opcode needs poll first on arming
io_uring: clean iowq submit work cancellation
io_uring: clean io_wq_submit_work()'s main loop
io-wq: use helper for worker refcounting
io_uring: implement async hybrid mode for pollable requests
io_uring: Use ERR_CAST() instead of ERR_PTR(PTR_ERR())
io_uring: split logic of force_nonblock
io_uring: warning about unused-but-set parameter
io_uring: inform block layer of how many requests we are submitting
io_uring: simplify io_file_supports_nowait()
io_uring: combine REQ_F_NOWAIT_{READ,WRITE} flags
io_uring: arm poll for non-nowait files
fs/io_uring: Prioritise checking faster conditions first in io_write
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8KDgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmQ2D/wO0nH3U+3+OZChi3XUwYck9Dev3o6BANCF
ClATiK/kivZY0xY1r8J4ixirZo2gcjIMpWSC3JGYZ5LdspfmYGLUbMjfZsaeU23i
lAKaX1IqfArmHN76k3IU1bKCg7B0/LFwC0q9QTFWTSwNSs8RK/EZLJ61U1hEXUb3
OfIpaMmvPiMaU7yuPqhcZK14m1cg1srrLM4rFB/PqsWWStF07pHq32WeArGDAU0e
Fe0YSnYD7qqA5Qc37KwqjCTmmxKX5YZf7etIcA6p3DNmwcuQrVNzKoCH/ZEDijaD
E2bS/BWbN1x96+rtoEZfBYEaNIrkmJzmW6+fJ53OITbJF3KqP6V66erhqNcFYCzC
mhFlRe7voXb/8AP7zQqSIhK529BUBM36sQ6nF7EiQcDrfLc1z39mq6eblUxbknIA
DDPISD5Tseik9N9x0bc7vINseKyHI1E90VAU/XKADcuGbzLvehPx+2p+Iq5ch5Ah
oa1G3RdlWWQOZxphJHWJhu1qMfo5+FP9dFZj1aoo7b8Kbc/CedyoQe71cpIE5wNh
Jj/EpWJnuyKXwuTic2VYGC+6ezM9O5DSdqCfP3YuZky95VESyvRCKJYMMgBYRVdC
/LuxhnBXIY2G8An7ZTnX0kLCCvLbapIwa0NyA98/xeOngO843coJ6wn8ZmE9LJNH
kMmpCygUrA==
=QWC+
-----END PGP SIGNATURE-----
Merge tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
- mq-deadline accounting improvements (Bart)
- blk-wbt timer fix (Andrea)
- Untangle the block layer includes (Christoph)
- Rework the poll support to be bio based, which will enable adding
support for polling for bio based drivers (Christoph)
- Block layer core support for multi-actuator drives (Damien)
- blk-crypto improvements (Eric)
- Batched tag allocation support (me)
- Request completion batching support (me)
- Plugging improvements (me)
- Shared tag set improvements (John)
- Concurrent queue quiesce support (Ming)
- Cache bdev in ->private_data for block devices (Pavel)
- bdev dio improvements (Pavel)
- Block device invalidation and block size improvements (Xie)
- Various cleanups, fixes, and improvements (Christoph, Jackie,
Masahira, Tejun, Yu, Pavel, Zheng, me)
* tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block: (174 commits)
blk-mq-debugfs: Show active requests per queue for shared tags
block: improve readability of blk_mq_end_request_batch()
virtio-blk: Use blk_validate_block_size() to validate block size
loop: Use blk_validate_block_size() to validate block size
nbd: Use blk_validate_block_size() to validate block size
block: Add a helper to validate the block size
block: re-flow blk_mq_rq_ctx_init()
block: prefetch request to be initialized
block: pass in blk_mq_tags to blk_mq_rq_ctx_init()
block: add rq_flags to struct blk_mq_alloc_data
block: add async version of bio_set_polled
block: kill DIO_MULTI_BIO
block: kill unused polling bits in __blkdev_direct_IO()
block: avoid extra iter advance with async iocb
block: Add independent access ranges support
blk-mq: don't issue request directly in case that current is to be blocked
sbitmap: silence data race warning
blk-cgroup: synchronize blkg creation against policy deactivation
block: refactor bio_iov_bvec_set()
block: add single bio async direct IO helper
...
Add memory folios, a new type to represent either order-0 pages or
the head page of a compound page. This should be enough infrastructure
to support filesystems converting from pages to folios.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmF9uI0ACgkQDpNsjXcp
gj7MUAf/R7LCZ+xFiIedw7SAgb/DGK0C9uVjuBEIZgAw21ZUw/GuPI6cuKBMFGGf
rRcdtlvMpwi7yZJcoNXxaqU/xPaaJMjf2XxscIvYJP1mjlZVuwmP9dOx0neNvWOc
T+8lqR6c1TLl82lpqIjGFLwvj2eVowq2d3J5jsaIJFd4odmmYVInrhJXOzC/LQ54
Niloj5ksehf+KUIRLDz7ycppvIHhlVsoAl0eM2dWBAtL0mvT7Nyn/3y+vnMfV2v3
Flb4opwJUgTJleYc16oxTn9svT2yS8q2uuUemRDLW8ABghoAtH3fUUk43RN+5Krd
LYCtbeawtkikPVXZMfWybsx5vn0c3Q==
=7SBe
-----END PGP SIGNATURE-----
Merge tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache
Pull memory folios from Matthew Wilcox:
"Add memory folios, a new type to represent either order-0 pages or the
head page of a compound page. This should be enough infrastructure to
support filesystems converting from pages to folios.
The point of all this churn is to allow filesystems and the page cache
to manage memory in larger chunks than PAGE_SIZE. The original plan
was to use compound pages like THP does, but I ran into problems with
some functions expecting only a head page while others expect the
precise page containing a particular byte.
The folio type allows a function to declare that it's expecting only a
head page. Almost incidentally, this allows us to remove various calls
to VM_BUG_ON(PageTail(page)) and compound_head().
This converts just parts of the core MM and the page cache. For 5.17,
we intend to convert various filesystems (XFS and AFS are ready; other
filesystems may make it) and also convert more of the MM and page
cache to folios. For 5.18, multi-page folios should be ready.
The multi-page folios offer some improvement to some workloads. The
80% win is real, but appears to be an artificial benchmark (postgres
startup, which isn't a serious workload). Real workloads (eg building
the kernel, running postgres in a steady state, etc) seem to benefit
between 0-10%. I haven't heard of any performance losses as a result
of this series. Nobody has done any serious performance tuning; I
imagine that tweaking the readahead algorithm could provide some more
interesting wins. There are also other places where we could choose to
create large folios and currently do not, such as writes that are
larger than PAGE_SIZE.
I'd like to thank all my reviewers who've offered review/ack tags:
Christoph Hellwig, David Howells, Jan Kara, Jeff Layton, Johannes
Weiner, Kirill A. Shutemov, Michal Hocko, Mike Rapoport, Vlastimil
Babka, William Kucharski, Yu Zhao and Zi Yan.
I'd also like to thank those who gave feedback I incorporated but
haven't offered up review tags for this part of the series: Nick
Piggin, Mel Gorman, Ming Lei, Darrick Wong, Ted Ts'o, John Hubbard,
Hugh Dickins, and probably a few others who I forget"
* tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache: (90 commits)
mm/writeback: Add folio_write_one
mm/filemap: Add FGP_STABLE
mm/filemap: Add filemap_get_folio
mm/filemap: Convert mapping_get_entry to return a folio
mm/filemap: Add filemap_add_folio()
mm/filemap: Add filemap_alloc_folio
mm/page_alloc: Add folio allocation functions
mm/lru: Add folio_add_lru()
mm/lru: Convert __pagevec_lru_add_fn to take a folio
mm: Add folio_evictable()
mm/workingset: Convert workingset_refault() to take a folio
mm/filemap: Add readahead_folio()
mm/filemap: Add folio_mkwrite_check_truncate()
mm/filemap: Add i_blocks_per_folio()
mm/writeback: Add folio_redirty_for_writepage()
mm/writeback: Add folio_account_redirty()
mm/writeback: Add folio_clear_dirty_for_io()
mm/writeback: Add folio_cancel_dirty()
mm/writeback: Add folio_account_cleaned()
mm/writeback: Add filemap_dirty_folio()
...
Commit 3c62be17d4 ("f2fs: support multiple devices") missed
to support direct IO for multiple device feature, this patch
adds to support the missing part of multidevice feature.
In addition, for multiple device image, we should be aware of
any issued direct write IO rather than just buffered write IO,
so that fsync and syncfs can issue a preflush command to the
device where direct write IO goes, to persist user data for
posix compliant.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Clean up: BIT() is preferred over open-coding the shift.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
For certain special cases, RPC-related tracepoints record a -1 as
the task ID or the client ID. It's ugly for a trace event to display
4 billion in these cases.
To help keep SUNRPC tracepoints consistent, create a macro that
defines the print format specifiers for tk_pid and cl_clid. At some
point in the future we might try tk_pid with a wider range of values
than 0..64K so this makes it easier to make that change.
RPC tracepoints now look like this:
<...>-1276 [009] 149.720358: rpc_clnt_new: client=00000005 peer=[192.168.2.55]:20049 program=nfs server=klimt.ib
<...>-1342 [004] 149.921234: rpc_xdr_recvfrom: task:0000001a@00000005 head=[0xff1242d9ab6dc01c,144] page=0 tail=[(nil),0] len=144
<...>-1342 [004] 149.921235: xprt_release_cong: task:0000001a@00000005 snd_task:ffffffff cong=256 cwnd=16384
<...>-1342 [004] 149.921235: xprt_put_cong: task:0000001a@00000005 snd_task:ffffffff cong=0 cwnd=16384
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
This is a buffer to be left persistently registered while a
connection is up. Connection tear-down will automatically DMA-unmap,
invalidate, and dereg the MR. A persistently registered buffer is
lower in cost to provide, and it can never be coalesced into the
RDMA segment that carries the data payload.
An RPC that provisions a Write chunk with a non-aligned length now
uses this MR rather than the tail buffer of the RPC's rq_rcv_buf.
Reviewed-By: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We only need to call it to resolve the blk_status_t -> errno mapping for
tracing, so move the conversion into the tracepoints that are not called
at all when tracing isn't enabled.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
I recently had to look at a production problem where a request ended
up getting the dreaded -EINVAL error on submit. The most used and
hence useless of error codes, as it just tells you that something
was wrong with your request, but not more than that.
Let's dump the full sqe contents if we run into an issue failure,
that'll allow easier diagnosing of a wide variety of issues.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This saves five calls to compound_head(), totalling 60 bytes of text.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Rename writeback_dirty_page() to writeback_dirty_folio() and
wait_on_page_writeback() to folio_wait_writeback().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
This replaces activate_page() and eliminates lots of calls to
compound_head(). Saves net 118 bytes of kernel text. There are still
some redundant calls to page_folio() here which will be removed when
pagevec_lru_move_fn() is converted to use folios.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmFsIqAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppbBEACDewLUv7bg1VFIGdroRN51OGiOv1oV+8HP
ruY7O9CPtV7wcb3lA1Zy9igICuzuC5culHjbRJrNIUeTdWCQHCFk/sfKSD6VGMoT
cFTqpKxV7M3vYr9G2m5TFWgY2mfS+I5fxyDZxK2z2esHCFw6TZ7A5W13xScVXKP+
QdNFSlTrGkpggsSIEeHApG+NLsIecnkT4qzm8zPfUodUtQ3A8JMjQjnYUFEAWfWv
l9x9zDIzaGjPtXf5soFEvmdh1ALh3WWiYb1kIwK1FeP/PYX0JV/3zCMgqOwpK+4b
69OM3Q0NPHvu2TgSRK+ghekAtz5qgPDMCrzdhSgLYJEL/PGAOboqjrB9E+wWoEjd
IKrYLx4Xao2TUZLJF2y34hHfODGdasx7d+wS191UpVFEZHFhDhIaazZ2rDd5xnQK
LdzQw1JQF/igJovHauhSkGFIdJWBSDneLQoMimBnitZlsWARUmFSZej34FFRLZsW
8ZXfqipn/x+fh4sQ/HdEfWxnGHtveDpU+0Ka5bMUe/tJ9RPtmn/Ye7nFjYecC6NY
4UzFSNn+4e9DpHaDuP3I/eA1YBmVlcB5Hum3ve7X6ovwpjArYg3dgJOEi8uCZjfb
hdMANmkVptcPiEO9njEHhC7S8+Nm3t+8o3qQceN81j6Vcjgzt/Y/n3Z6UkKeSlkn
Ila+cZI1oA==
=J/e4
-----END PGP SIGNATURE-----
Merge tag 'block-5.15-2021-10-17' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Bigger than usual for this point in time, the majority is fixing some
issues around BDI lifetimes with the move from the request_queue to
the disk in this release. In detail:
- Series on draining fs IO for del_gendisk() (Christoph)
- NVMe pull request via Christoph:
- fix the abort command id (Keith Busch)
- nvme: fix per-namespace chardev deletion (Adam Manzanares)
- brd locking scope fix (Tetsuo)
- BFQ fix (Paolo)"
* tag 'block-5.15-2021-10-17' of git://git.kernel.dk/linux-block:
block, bfq: reset last_bfqq_created on group change
block: warn when putting the final reference on a registered disk
brd: reduce the brd_devices_mutex scope
kyber: avoid q->disk dereferences in trace points
block: keep q_usage_counter in atomic mode after del_gendisk
block: drain file system I/O on del_gendisk
block: split bio_queue_enter from blk_queue_enter
block: factor out a blk_try_enter_queue helper
block: call submit_bio_checks under q_usage_counter
nvme: fix per-namespace chardev deletion
block/rnbd-clt-sysfs: fix a couple uninitialized variable bugs
nvme-pci: Fix abort command id
Currently, z_erofs_map_blocks_iter() returns whether extents are
compressed or not, and the decompression frontend gets the specific
algorithms then.
It works but not quite well in many aspests, for example:
- The decompression frontend has to deal with whether extents are
compressed or not again and lookup the algorithms if compressed.
It's duplicated and too detailed about the on-disk mapping.
- A new secondary compression head will be introduced later so that
each file can have 2 compression algorithms at most for different
type of data. It could increase the complexity of the decompression
frontend if still handled in this way;
- A new readmore decompression strategy will be introduced to get
better performance for much bigger pcluster and lzma, which needs
the specific algorithm in advance as well.
Let's look up compression algorithms in z_erofs_map_blocks_iter()
directly instead.
Link: https://lore.kernel.org/r/20211008200839.24541-2-xiang@kernel.org
Reviewed-by: Chao Yu <chao@kernel.org>
Reviewed-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
q->disk becomes invalid after the gendisk is removed. Work around this
by caching the dev_t for the tracepoints. The real fix would be to
properly tear down the I/O schedulers with the gendisk, but that is
a much more invasive change.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012093301.GA27795@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The declaration of struct devlink in general header provokes the
situation where internal fields can be accidentally used by the driver
authors. In order to reduce such possible situations, let's reduce the
namespace exposure of struct devlink.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 9df1c28bb7 ("bpf: add writable context for raw tracepoints")
supports writable context for tracepoint, but it misses the support
for bare tracepoint which has no associated trace event.
Bare tracepoint is defined by DECLARE_TRACE(), so adding a corresponding
DECLARE_TRACE_WRITABLE() macro to generate a definition in __bpf_raw_tp_map
section for bare tracepoint in a similar way to DEFINE_TRACE_WRITABLE().
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211004094857.30868-2-hotforest@gmail.com
When cachefiles_cull() calls cachefiles_bury_object(), it passes
a NULL object. When this occurs, either trace_cachefiles_unlink()
or trace_cachefiles_rename() may oops due to the NULL object.
Check for NULL object in the tracepoint and if so, set debug_id
to MAX_UINT as was done in 2908f5e101.
The following oops was seen with xfstests generic/100.
BUG: kernel NULL pointer dereference, address: 0000000000000010
...
RIP: 0010:trace_event_raw_event_cachefiles_unlink+0x4e/0xa0 [cachefiles]
...
Call Trace:
cachefiles_bury_object+0x242/0x430 [cachefiles]
? __vfs_removexattr_locked+0x10f/0x150
? vfs_removexattr+0x51/0xd0
cachefiles_cull+0x84/0x120 [cachefiles]
cachefiles_daemon_cull+0xd1/0x120 [cachefiles]
cachefiles_daemon_write+0x158/0x190 [cachefiles]
vfs_write+0xbc/0x260
ksys_write+0x4f/0xc0
do_syscall_64+0x3b/0x90
The following oops was seen with xfstests generic/290.
BUG: kernel NULL pointer dereference, address: 0000000000000010
...
RIP: 0010:trace_event_raw_event_cachefiles_rename+0x54/0xa0 [cachefiles]
...
Call Trace:
cachefiles_bury_object+0x35c/0x430 [cachefiles]
cachefiles_cull+0x84/0x120 [cachefiles]
cachefiles_daemon_cull+0xd1/0x120 [cachefiles]
cachefiles_daemon_write+0x158/0x190 [cachefiles]
vfs_write+0xbc/0x260
ksys_write+0x4f/0xc0
do_syscall_64+0x3b/0x90
Fixes: 2908f5e101 ("fscache: Add a cookie debug ID and use that in traces")
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://listman.redhat.com/archives/linux-cachefs/2021-October/msg00009.html
This value is usually zero, but will be non-zero more often in the
future. Knowing its value can be important diagnostic information.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This is an operational low memory situation that needs to be
flagged. The new tracepoint records a timestamp and the nfsd thread
that failed to allocate pages.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There are currently three separate purposes being served by single
tracepoints. Split them up, as was done with wc_send.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There are currently three separate purposes being served by a single
tracepoint here. They need to be split up.
svcrdma_wc_send:
- status is always zero, so there's no value in recording it.
- vendor_err is meaningless unless status is not zero, so
there's no value in recording it.
- This tracepoint is needed only when developing modifications,
so it should be left disabled most of the time.
svcrdma_wc_send_flush:
- As above, needed only rarely, and not an error.
svcrdma_wc_send_err:
- This tracepoint can be left persistently enabled because
completion errors are run-time problems (except for FLUSHED_ERR).
- Tracepoint name now ends in _err to reflect its purpose.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There are currently three separate purposes being served by a single
tracepoint here. They need to be split up.
svcrdma_wc_recv:
- status is always zero, so there's no value in recording it.
- vendor_err is meaningless unless status is not zero, so
there's no value in recording it.
- This tracepoint is needed only when developing modifications,
so it should be left disabled most of the time.
svcrdma_wc_recv_flush:
- As above, needed only rarely, and not an error.
svcrdma_wc_recv_err:
- received is always zero, so there's no value in recording it.
- This tracepoint can be left enabled because completion
errors are run-time problems (except for FLUSHED_ERR).
- Tracepoint name now ends in _err to reflect its purpose.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In cachefiles_mark_object_buried, the dentry in question may not have an
owner, and thus our cachefiles_object pointer may be NULL when calling
the tracepoint, in which case we will also not have a valid debug_id to
print in the tracepoint.
Check for NULL object in the tracepoint and if so, just set debug_id to
MAX_UINT as was done in 2908f5e101 ("fscache: Add a cookie debug ID
and use that in traces").
This fixes the following oops:
FS-Cache: Cache "mycache" added (type cachefiles)
CacheFiles: File cache on vdc registered
...
Workqueue: fscache_object fscache_object_work_func [fscache]
RIP: 0010:trace_event_raw_event_cachefiles_mark_buried+0x4e/0xa0 [cachefiles]
....
Call Trace:
cachefiles_mark_object_buried+0xa5/0xb0 [cachefiles]
cachefiles_bury_object+0x270/0x430 [cachefiles]
cachefiles_walk_to_object+0x195/0x9c0 [cachefiles]
cachefiles_lookup_object+0x5a/0xc0 [cachefiles]
fscache_look_up_object+0xd7/0x160 [fscache]
fscache_object_work_func+0xb2/0x340 [fscache]
process_one_work+0x1f1/0x390
worker_thread+0x53/0x3e0
kthread+0x127/0x150
Fixes: 2908f5e101 ("fscache: Add a cookie debug ID and use that in traces")
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-cachefs@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The tag allocation, release and bind events are somewhat opaque outside
the kernel; this change adds a few tracepoints to assist in
instrumentation and debugging.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix up a misuse that the filename pointer isn't always valid in
the ring buffer, and we should copy the content instead.
Fixes: 0c5e36db17 ("f2fs: trace f2fs_lookup")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The page was only being used for the memcg and to gather trace
information, so this is a simple conversion. The only caller of
mem_cgroup_track_foreign_dirty() will be converted to folios in a later
patch, so doing this now makes that patch simpler.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Handle arbitrary-order folios being added to the LRU. By definition,
all pages being added to the LRU were already head or base pages, but
call page_folio() on them anyway to get the type right and avoid the
buried calls to compound_head().
Saves 783 bytes of kernel text; no functions grow.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
- fix the dangling pointer use in erofs_lookup tracepoint;
- fix unsupported chunk format check;
- zero out compacted_2b if compacted_4b_initial > totalidx.
-----BEGIN PGP SIGNATURE-----
iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYU9CBxEceGlhbmdAa2Vy
bmVsLm9yZwAKCRA5NzHcH7XmBDgBAQDaj1NWjIleK4Q7hoerl++6MMhzEJrmpxSE
EENs9NPuiQEAp7dN0T05a2J+Szp5xJeLYg67LoYbAnDmbmzGH/jQQg0=
=6e/B
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-5.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
"Two bugfixes to fix the 4KiB blockmap chunk format availability and a
dangling pointer usage. There is also a trivial cleanup to clarify
compacted_2b if compacted_4b_initial > totalidx.
Summary:
- fix the dangling pointer use in erofs_lookup tracepoint
- fix unsupported chunk format check
- zero out compacted_2b if compacted_4b_initial > totalidx"
* tag 'erofs-for-5.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: clear compacted_2b if compacted_4b_initial > totalidx
erofs: fix misbehavior of unsupported chunk format check
erofs: fix up erofs_lookup tracepoint
-----BEGIN PGP SIGNATURE-----
iQIyBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmE/CK0ACgkQ+7dXa6fL
C2vR+A/3ZOlda7wl9grj+qPPiJE1jCav7myLJJR73Yog5T8ZfFkaK6a20IOAyOBu
1v9GzTEODCA12uomYfvIZqNHrcBr2oV6jf8twcnioELQELEP4KPQsXpd1eqq/Kho
O3JUaY7BRiKIk5jUL7IEt2hdBgYCBU2FMoQa+M3FiKfoq601rDDsb5YnwWP0og26
MxXpVmn8uY+QTfwCI4uoJaRZmEX5tu7DnPX3VNHbno9uuI2VJo16S/jmw5CAkG5B
K9p9VdWbGkelM3CXl2rYBG4cA56uwEhVDfTze+A/Eg9JYD2WCFrsehGWC1DR/QtZ
LMM5FxiajF2tvg8KQE/Ou+er96qujwfIJKUgI+vqYLh2s6b5ZLqIyzUpTk4fIrf4
MbHBb4ec0AMXrGapO0fu7UZ2x7f+T7CkYrtIMYxddjlv8YQ860TtzEp/esing4IW
2DHe6xe72LiqoZ09DBaFq0DJKxtFYKQ94GcHjVGxOaFf4nx4OVkQP3gPz3jrhIy8
boWJZQ3xv4cuSbX23GBdELzPbkaTRUjI1siYM2zVk31S4YkZVyy5LbgjQL93C+Bp
BzQwhMGiFQOz17J5eBehVIvHoKDi5fVBuX3WK7aMFmPtUxNhh3KnLKjaxERxdUYw
6pHq3P23rX15TVC24djqtDevv+otITqJ7dKDovKnGm6hoPRqnw==
=BLd7
-----END PGP SIGNATURE-----
Merge tag 'afs-fixes-20210913' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
"Fixes for AFS problems that can cause data corruption due to
interaction with another client modifying data cached locally:
- When d_revalidating a dentry, don't look at the inode to which it
points. Only check the directory to which the dentry belongs. This
was confusing things and causing the silly-rename cleanup code to
remove the file now at the dentry of a file that got deleted.
- Fix mmap data coherency. When a callback break is received that
relates to a file that we have cached, the data content may have
been changed (there are other reasons, such as the user's rights
having been changed). However, we're checking it lazily, only on
entry to the kernel, which doesn't happen if we have a writeable
shared mapped page on that file.
We make the kernel keep track of mmapped files and clear all PTEs
mapping to that file as soon as the callback comes in by calling
unmap_mapping_pages() (we don't necessarily want to zap the
pagecache). This causes the kernel to be reentered when userspace
tries to access the mmapped address range again - and at that point
we can query the server and, if we need to, zap the page cache.
Ideally, I would check each file at the point of notification, but
that involves poking the server[*] - which is holding an exclusive
lock on the vnode it is changing, waiting for all the clients it
notified to reply. This could then deadlock against the server.
Further, invalidating the pagecache might call ->launder_page(),
which would try to write to the file, which would definitely
deadlock. (AFS doesn't lease file access).
[*] Checking to see if the file content has changed is a matter of
comparing the current data version number, but we have to ask
the server for that. We also need to get a new callback promise
and we need to poke the server for that too.
- Add some more points at which the inode is validated, since we're
doing it lazily, notably in ->read_iter() and ->page_mkwrite(), but
also when performing some directory operations.
Ideally, checking in ->read_iter() would be done in some derivation
of filemap_read(). If we're going to call the server to read the
file, then we get the file status fetch as part of that.
- The above is now causing us to make a lot more calls to
afs_validate() to check the inode - and afs_validate() takes the
RCU read lock each time to make a quick check (ie.
afs_check_validity()). This is entirely for the purpose of checking
cb_s_break to see if the server we're using reinitialised its list
of callbacks - however this isn't a very common event, so most of
the time we're taking this needlessly.
Add a new cell-wide counter to count the number of
reinitialisations done by any server and check that - and only if
that changes, take the RCU read lock and check the server list (the
server list may change, but the cell a file is part of won't).
- Don't update vnode->cb_s_break and ->cb_v_break inside the validity
checking loop. The cb_lock is done with read_seqretry, so we might
go round the loop a second time after resetting those values - and
that could cause someone else checking validity to miss something
(I think).
Also included are patches for fixes for some bugs encountered whilst
debugging this:
- Fix a leak of afs_read objects and fix a leak of keys hidden by
that.
- Fix a leak of pages that couldn't be added to extend a writeback.
- Fix the maintenance of i_blocks when i_size is changed by a local
write or a local dir edit"
Link: https://bugzilla.kernel.org/show_bug.cgi?id=214217 [1]
Link: https://lore.kernel.org/r/163111665183.283156.17200205573146438918.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163113612442.352844.11162345591911691150.stgit@warthog.procyon.org.uk/ # i_blocks patch
* tag 'afs-fixes-20210913' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix updating of i_blocks on file/dir extension
afs: Fix corruption in reads at fpos 2G-4G from an OpenAFS server
afs: Try to avoid taking RCU read lock when checking vnode validity
afs: Fix mmap coherency vs 3rd-party changes
afs: Fix incorrect triggering of sillyrename on 3rd-party invalidation
afs: Add missing vnode validation checks
afs: Fix page leak
afs: Fix missing put on afs_read objects and missing get on the key therein
Try to avoid taking the RCU read lock when checking the validity of a
vnode's callback state. The only thing it's needed for is to pin the
parent volume's server list whilst we search it to find the record of the
server we're currently using to see if it has been reinitialised (ie. it
sent us a CB.InitCallBackState* RPC).
Do this by the following means:
(1) Keep an additional per-cell counter (fs_s_break) that's incremented
each time any of the fileservers in the cell reinitialises.
Since the new counter can be accessed without RCU from the vnode, we
can check that first - and only if it differs, get the RCU read lock
and check the volume's server list.
(2) Replace afs_get_s_break_rcu() with afs_check_server_good() which now
indicates whether the callback promise is still expected to be present
on the server. This does the checks as described in (1).
(3) Restructure afs_check_validity() to take account of the change in (2).
We can also get rid of the valid variable and just use the need_clear
variable with the addition of the afs_cb_break_no_promise reason.
(4) afs_check_validity() probably shouldn't be altering vnode->cb_v_break
and vnode->cb_s_break when it doesn't have cb_lock exclusively locked.
Move the change to vnode->cb_v_break to __afs_break_callback().
Delegate the change to vnode->cb_s_break to afs_select_fileserver()
and set vnode->cb_fs_s_break there also.
(5) afs_validate() no longer needs to get the RCU read lock around its
call to afs_check_validity() - and can skip the call entirely if we
don't have a promise.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163111669583.283156.1397603105683094563.stgit@warthog.procyon.org.uk/
Merge more updates from Andrew Morton:
"147 patches, based on 7d2a07b769.
Subsystems affected by this patch series: mm (memory-hotplug, rmap,
ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan),
alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib,
checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig,
selftests, ipc, and scripts"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits)
scripts: check_extable: fix typo in user error message
mm/workingset: correct kernel-doc notations
ipc: replace costly bailout check in sysvipc_find_ipc()
selftests/memfd: remove unused variable
Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH
configs: remove the obsolete CONFIG_INPUT_POLLDEV
prctl: allow to setup brk for et_dyn executables
pid: cleanup the stale comment mentioning pidmap_init().
kernel/fork.c: unexport get_{mm,task}_exe_file
coredump: fix memleak in dump_vma_snapshot()
fs/coredump.c: log if a core dump is aborted due to changed file permissions
nilfs2: use refcount_dec_and_lock() to fix potential UAF
nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group
nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group
nilfs2: fix NULL pointer in nilfs_##name##_attr_release
nilfs2: fix memory leak in nilfs_sysfs_create_device_group
trap: cleanup trap_init()
init: move usermodehelper_enable() to populate_rootfs()
...
This commit adds a tracepoint for DAMON. It traces the monitoring results
of each region for each aggregation interval. Using this, DAMON can
easily integrated with tracepoints supporting tools such as perf.
Link: https://lkml.kernel.org/r/20210716081449.22187-7-sj38.park@gmail.com
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Leonard Foerster <foersleo@amazon.de>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Fernand Sieber <sieberf@amazon.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Fan Du <fan.du@intel.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Maximilian Heyne <mheyne@amazon.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PG_idle and PG_young allow the two PTE Accessed bit users, Idle Page
Tracking and the reclaim logic concurrently work while not interfering
with each other. That is, when they need to clear the Accessed bit, they
set PG_young to represent the previous state of the bit, respectively.
And when they need to read the bit, if the bit is cleared, they further
read the PG_young to know whether the other has cleared the bit meanwhile
or not.
For yet another user of the PTE Accessed bit, we could add another page
flag, or extend the mechanism to use the flags. For the DAMON usecase,
however, we don't need to do that just yet. IDLE_PAGE_TRACKING and DAMON
are mutually exclusive, so there's only ever going to be one user of the
current set of flags.
In this commit, we split out the CONFIG options to allow for the use of
PG_young and PG_idle outside of idle page tracking.
In the next commit, DAMON's reference implementation of the virtual memory
address space monitoring primitives will use it.
[sjpark@amazon.de: set PAGE_EXTENSION for non-64BIT]
Link: https://lkml.kernel.org/r/20210806095153.6444-1-sj38.park@gmail.com
[akpm@linux-foundation.org: tweak Kconfig text]
[sjpark@amazon.de: hide PAGE_IDLE_FLAG from users]
Link: https://lkml.kernel.org/r/20210813081238.34705-1-sj38.park@gmail.com
Link: https://lkml.kernel.org/r/20210716081449.22187-5-sj38.park@gmail.com
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Fernand Sieber <sieberf@amazon.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Fan Du <fan.du@intel.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Maximilian Heyne <mheyne@amazon.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of hard-coding ((1UL << NR_PAGEFLAGS) - 1) everywhere, introducing
PAGEFLAGS_MASK to make the code clear to get the page flags.
Link: https://lkml.kernel.org/r/20210819150712.59948-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEz5eEQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmk1D/wML8Im2erR5s0PaWZgYxXlgEKrJDwJm/p+
2Uixrn/9kQAhwH+0kJnCiI+HwlL3LU+5/iAdeGtdYMcVaotPPmm5V3jfud8+RuAi
E+uIOdULXgQKj8pkiQ2h5mvYd0BxGkGH38gUqilSwFrY2HTpbfxreCHhYoQaE/7o
DiGNgbhJglSFIBuIgS4cfpLkI3FdaAmrCydZ9zaqEv/G/bx9aA9lwSbAJadhTbmt
Qc1vvbh2FB9YvgZX8qfaneyDKzQbwqTvKxCe2SOVMOp/X0feJym7WZUvrPr04EoZ
zBaLDkmn44re4iWPbide7+KQJ8NMQQDBiuxwF5WxdF3hrcsiwqmKgDtBEGWXFMeV
CUZ9Osrfb480UKsDExtxLhQqGz1JZqIPZdtDvSJb8MunPZtvTz27NNFyyb9aBrlX
WiwEHqAOE1W33buPCNyuYLGDVYis4/TkwF0NZpMwsyPdN0Iz/M8Z5F5BHhC7BYoP
U8KMsX3XvddxB113U+IMVqI/SuvT125U65brklQlQeLEHnH57ceII9mNGfNic6LR
bcIu7Fb5J1U5nAMeeLCSXsEYXs+peYgI1UOWXaWgSVixUAyU8H+OqsBVIl8eiMjr
TTbdIMmfWqENE3wBM709FQQLoMmGl1YjBkGmBXKZjNHcDrf9X56rimSxRD2i2okg
r2JczxQ5uQ==
=QoQg
-----END PGP SIGNATURE-----
Merge tag 'for-5.15/io_uring-2021-09-04' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"As sometimes happens, two reports came in around the merge window open
that led to some fixes. Hence this one is a bit bigger than usual
followup fixes, but most of it will be going towards stable, outside
of the fixes that are addressing regressions from this merge window.
In detail:
- postgres is a heavy user of signals between tasks, and if we're
unlucky this can interfere with io-wq worker creation. Make sure
we're resilient against unrelated signal handling. This set of
changes also includes hardening against allocation failures, which
could previously had led to stalls.
- Some use cases that end up having a mix of bounded and unbounded
work would have starvation issues related to that. Split the
pending work lists to handle that better.
- Completion trace int -> unsigned -> long fix
- Fix issue with REGISTER_IOWQ_MAX_WORKERS and SQPOLL
- Fix regression with hash wait lock in this merge window
- Fix retry issued on block devices (Ming)
- Fix regression with links in this merge window (Pavel)
- Fix race with multi-shot poll and completions (Xiaoguang)
- Ensure regular file IO doesn't inadvertently skip completion
batching (Pavel)
- Ensure submissions are flushed after running task_work (Pavel)"
* tag 'for-5.15/io_uring-2021-09-04' of git://git.kernel.dk/linux-block:
io_uring: io_uring_complete() trace should take an integer
io_uring: fix possible poll event lost in multi shot mode
io_uring: prolong tctx_task_work() with flushing
io_uring: don't disable kiocb_done() CQE batching
io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
io-wq: make worker creation resilient against signals
io-wq: get rid of FIXED worker flag
io-wq: only exit on fatal signals
io-wq: split bounded and unbounded work into separate lists
io-wq: fix queue stalling race
io_uring: don't submit half-prepared drain request
io_uring: fix queueing half-created requests
io-wq: ensure that hash wait lock is IRQ disabling
io_uring: retry in case of short read on block device
io_uring: IORING_OP_WRITE needs hash_reg_file set
io-wq: fix race between adding work and activating a free worker
Pull MAP_DENYWRITE removal from David Hildenbrand:
"Remove all in-tree usage of MAP_DENYWRITE from the kernel and remove
VM_DENYWRITE.
There are some (minor) user-visible changes:
- We no longer deny write access to shared libaries loaded via legacy
uselib(); this behavior matches modern user space e.g. dlopen().
- We no longer deny write access to the elf interpreter after exec
completed, treating it just like shared libraries (which it often
is).
- We always deny write access to the file linked via /proc/pid/exe:
sys_prctl(PR_SET_MM_MAP/EXE_FILE) will fail if write access to the
file cannot be denied, and write access to the file will remain
denied until the link is effectivel gone (exec, termination,
sys_prctl(PR_SET_MM_MAP/EXE_FILE)) -- just as if exec'ing the file.
Cross-compiled for a bunch of architectures (alpha, microblaze, i386,
s390x, ...) and verified via ltp that especially the relevant tests
(i.e., creat07 and execve04) continue working as expected"
* tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux:
fs: update documentation of get_write_access() and friends
mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
mm: remove VM_DENYWRITE
binfmt: remove in-tree usage of MAP_DENYWRITE
kernel/fork: always deny write access to current MM exe_file
kernel/fork: factor out replacing the current MM exe_file
binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()
In this cycle, we've addressed some performance issues such as lock contention,
misbehaving compress_cache, allowing extent_cache for compressed files, and new
sysfs to adjust ra_size for fadvise. In order to diagnose the performance issues
quickly, we also added an iostat which shows the IO latencies periodically. On
the stability side, we've found two memory leakage cases in the error path in
compression flow. And, we've also fixed various corner cases in fiemap, quota,
checkpoint=disable, zstd, and so on.
Enhancement:
- avoid long checkpoint latency by releasing nat_tree_lock
- collect and show iostats periodically
- support extent_cache for compressed files
- add a sysfs entry to manage ra_size given fadvise(POSIX_FADV_SEQUENTIAL)
- report f2fs GC status via sysfs
- add discard_unit=%s in mount option to handle zoned device
Bug fix:
- fix two memory leakages when an error happens in the compressed IO flow
- fix commpress_cache to get the right LBA
- fix fiemap to deal with compressed case correctly
- fix wrong EIO returns due to SBI_NEED_FSCK
- fix missing writes when enabling checkpoint back
- fix quota deadlock
- fix zstd level mount option
In addition to the above major updates, we've cleaned up several code paths such
as dio, unnecessary operations, debugfs/f2fs/status, sanity check, and typos.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmEyw1sACgkQQBSofoJI
UNLJmA/+NHUgwUjLMcHvmLyp6QYpQDZtKj93/sRDo+YHOYNdYFjWWUb329PYTKWS
kEdzApCP+KHfVxeSkiL/x3qWP+RlTkIf96P0kR3/BKi0tjg25G2riFWztusDDFpt
xi+AW5sUFDvIx1tFumvQHAQedSwBgcZ96ovT5EwxEuONkljhZC9phEC6vSXz9nOR
e2EQIyezbC5O21np1KSeqSgqRMpVkJkVcEHy4VmpMBCLMOOYPepWwKw+yPaV/jR/
zUXdo2/53vma50M5LCDPCtjCtWQgLoeNeGLxyjfzQuTJU6TmtPY65JObLPt6pUSj
fRW6qIziTZbVYXzOWBD0EYilv2N4c3BNJdhQCpx2Vyjw9/LLxzqKPOUyzBoa1kjY
eZVvmaLXVCKsoJdHDSi7OH/4BqS6SuSZE8eO/nGkgswqiErHZ0Vwl3bFCWC7r/Bk
r2U5spJx/83XO6c9H1bzeWEies1DRtwnCDIRRuw35RtJ4uHZaqCfkuJ7rOBwC90X
4SpaAKdUxP2RWc3GKELBIhaqPn7vyMy9ile6VU14PjM8UcY5hyE87T2azqR8gGut
nVjRL4cbMGTPj6m1Qj8KqBRSaLuShe6AncUy7bNGiM+JlcLcdB6OJ1ZYLl9hjx2r
TbIouXThgcZ4SIK0DEaBLKz2b9/0TfaO9gw1XzpRma+bWA1pApM=
=W67o
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this cycle, we've addressed some performance issues such as lock
contention, misbehaving compress_cache, allowing extent_cache for
compressed files, and new sysfs to adjust ra_size for fadvise.
In order to diagnose the performance issues quickly, we also added an
iostat which shows the IO latencies periodically.
On the stability side, we've found two memory leakage cases in the
error path in compression flow. And, we've also fixed various corner
cases in fiemap, quota, checkpoint=disable, zstd, and so on.
Enhancements:
- avoid long checkpoint latency by releasing nat_tree_lock
- collect and show iostats periodically
- support extent_cache for compressed files
- add a sysfs entry to manage ra_size given fadvise(POSIX_FADV_SEQUENTIAL)
- report f2fs GC status via sysfs
- add discard_unit=%s in mount option to handle zoned device
Bug fixes:
- fix two memory leakages when an error happens in the compressed IO flow
- fix commpress_cache to get the right LBA
- fix fiemap to deal with compressed case correctly
- fix wrong EIO returns due to SBI_NEED_FSCK
- fix missing writes when enabling checkpoint back
- fix quota deadlock
- fix zstd level mount option
In addition to the above major updates, we've cleaned up several code
paths such as dio, unnecessary operations, debugfs/f2fs/status, sanity
check, and typos"
* tag 'f2fs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (46 commits)
f2fs: should put a page beyond EOF when preparing a write
f2fs: deallocate compressed pages when error happens
f2fs: enable realtime discard iff device supports discard
f2fs: guarantee to write dirty data when enabling checkpoint back
f2fs: fix to unmap pages from userspace process in punch_hole()
f2fs: fix unexpected ENOENT comes from f2fs_map_blocks()
f2fs: fix to account missing .skipped_gc_rwsem
f2fs: adjust unlock order for cleanup
f2fs: Don't create discard thread when device doesn't support realtime discard
f2fs: rebuild nat_bits during umount
f2fs: introduce periodic iostat io latency traces
f2fs: separate out iostat feature
f2fs: compress: do sanity check on cluster
f2fs: fix description about main_blkaddr node
f2fs: convert S_IRUGO to 0444
f2fs: fix to keep compatibility of fault injection interface
f2fs: support fault injection for f2fs_kmem_cache_alloc()
f2fs: compress: allow write compress released file after truncate to zero
f2fs: correct comment in segment.h
f2fs: improve sbi status info in debugfs/f2fs/status
...
- New Features:
- Better client responsiveness when server isn't replying
- Use refcount_t in sunrpc rpc_client refcount tracking
- Add srcaddr and dst_port to the sunrpc sysfs info files
- Add basic support for connection sharing between servers with multiple NICs`
- Bugfixes and Cleanups:
- Sunrpc tracepoint cleanups
- Disconnect after ib_post_send() errors to avoid deadlocks
- Fix for tearing down rpcrdma_reps
- Fix a potential pNFS layoutget livelock loop
- pNFS layout barrier fixes
- Fix a potential memory corruption in rpc_wake_up_queued_task_set_status()
- Fix reconnection locking
- Fix return value of get_srcport()
- Remove rpcrdma_post_sends()
- Remove pNFS dead code
- Remove copy size restriction for inter-server copies
- Overhaul the NFS callback service
- Clean up sunrpc TCP socket shutdowns
- Always provide aligned buffers to RPC read layers
-----BEGIN PGP SIGNATURE-----
iQIyBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmExP7AACgkQ18tUv7Cl
QOshTg/zBz7OfrS23CcLLgNidTJ6S7JOuj1DShG+YzsYXT8f9Nl1DadLM7yAEyok
6JZzC8rXYzJcmYztHZzRyTuzj1+tGGb0u/MrD0bBk42VEel6eOjH/Y9ybn12Gf/E
aqlcJh8hPx44U8oo5EFjRJsg2h28O06vywqhJz+sTbkqKN4hlAgMOo5ysAB+1thg
BrTlR84EKBw5QqxPJ1WPmq9tEyGebU9Yrj1p8f0Uf015IeRNeTOXx3NzmdPshphf
2yJvjumwEzqkcHXTJFDfP6ikIcGPPMNVAOK8DHb+vDGzNsOXW7dDM7GuWA3U8DlU
ZHvyyb05Wwe6Wwg8xwx90FEXcYZFfZbSKmI9z2uoOuGFzNG07zWzPDzRft+qrOvU
VMMwP9oEh71+qesmWTvqIbR2RjxqbCYlTcc8mBrD66DROi6jZ2jznraNC85sxG0Q
b8GE+2SnYr2Q25yehj2xrRlOXyiYNkeeYmIpIquEqH9o7cSyDNJhBWbzIv6x+ith
O/S06ZVKMc9X1nH5t5121XcHrSTMMVA/67WMyKfKMxWnrADAWPQALG+ttoTcbRu7
Txew3Jb+hB8+ZdHAqbPf1l1i+7USQl1CRHMw3GRvNjCL2qcjZb1R7eyJRSQQtUyw
q6SJRGe6Sn1FTUnn96Hv15Zy8VHx+q0cOL/EQVzL1RzJIXYcag==
=Ad/3
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"New Features:
- Better client responsiveness when server isn't replying
- Use refcount_t in sunrpc rpc_client refcount tracking
- Add srcaddr and dst_port to the sunrpc sysfs info files
- Add basic support for connection sharing between servers with multiple NICs`
Bugfixes and Cleanups:
- Sunrpc tracepoint cleanups
- Disconnect after ib_post_send() errors to avoid deadlocks
- Fix for tearing down rpcrdma_reps
- Fix a potential pNFS layoutget livelock loop
- pNFS layout barrier fixes
- Fix a potential memory corruption in rpc_wake_up_queued_task_set_status()
- Fix reconnection locking
- Fix return value of get_srcport()
- Remove rpcrdma_post_sends()
- Remove pNFS dead code
- Remove copy size restriction for inter-server copies
- Overhaul the NFS callback service
- Clean up sunrpc TCP socket shutdowns
- Always provide aligned buffers to RPC read layers"
* tag 'nfs-for-5.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (39 commits)
NFS: Always provide aligned buffers to the RPC read layers
NFSv4.1 add network transport when session trunking is detected
SUNRPC enforce creation of no more than max_connect xprts
NFSv4 introduce max_connect mount options
SUNRPC add xps_nunique_destaddr_xprts to xprt_switch_info in sysfs
SUNRPC keep track of number of transports to unique addresses
NFSv3: Delete duplicate judgement in nfs3_async_handle_jukebox
SUNRPC: Tweak TCP socket shutdown in the RPC client
SUNRPC: Simplify socket shutdown when not reusing TCP ports
NFSv4.2: remove restriction of copy size for inter-server copy.
NFS: Clean up the synopsis of callback process_op()
NFS: Extract the xdr_init_encode/decode() calls from decode_compound
NFS: Remove unused callback void decoder
NFS: Add a private local dispatcher for NFSv4 callback operations
SUNRPC: Eliminate the RQ_AUTHERR flag
SUNRPC: Set rq_auth_stat in the pg_authenticate() callout
SUNRPC: Add svc_rqst::rq_auth_stat
SUNRPC: Add dst_port to the sysfs xprt info file
SUNRPC: Add srcaddr as a file in sysfs
sunrpc: Fix return value of get_srcport()
...
It currently takes a long, and while that's normally OK, the io_uring
limit is an int. Internally in io_uring it's an int, but sometimes it's
passed as a long. That can yield confusing results where a completions
seems to generate a huge result:
ou-sqp-1297-1298 [001] ...1 788.056371: io_uring_complete: ring 000000000e98e046, user_data 0x0, result 4294967171, cflags 0
which is due to -ECANCELED being stored in an unsigned, and then passed
in as a long. Using the right int type, the trace looks correct:
iou-sqp-338-339 [002] ...1 15.633098: io_uring_complete: ring 00000000e0ac60cf, user_data 0x0, result -125, cflags 0
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge misc updates from Andrew Morton:
"173 patches.
Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
oom-kill, migration, ksm, percpu, vmstat, and madvise)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits)
mm/madvise: add MADV_WILLNEED to process_madvise()
mm/vmstat: remove unneeded return value
mm/vmstat: simplify the array size calculation
mm/vmstat: correct some wrong comments
mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
selftests: vm: add COW time test for KSM pages
selftests: vm: add KSM merging time test
mm: KSM: fix data type
selftests: vm: add KSM merging across nodes test
selftests: vm: add KSM zero page merging test
selftests: vm: add KSM unmerge test
selftests: vm: add KSM merge test
mm/migrate: correct kernel-doc notation
mm: wire up syscall process_mrelease
mm: introduce process_mrelease system call
memblock: make memblock_find_in_range method private
mm/mempolicy.c: use in_task() in mempolicy_slab_node()
mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
mm/mempolicy: advertise new MPOL_PREFERRED_MANY
mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
...
This is mostly derived from a patch from Yang Shi:
https://lore.kernel.org/linux-mm/1560468577-101178-10-git-send-email-yang.shi@linux.alibaba.com/
Add code to the reclaim path (shrink_page_list()) to "demote" data to
another NUMA node instead of discarding the data. This always avoids the
cost of I/O needed to read the page back in and sometimes avoids the
writeout cost when the page is dirty.
A second pass through shrink_page_list() will be made if any demotions
fail. This essentially falls back to normal reclaim behavior in the case
that demotions fail. Previous versions of this patch may have simply
failed to reclaim pages which were eligible for demotion but were unable
to be demoted in practice.
For some cases, for example, MADV_PAGEOUT, the pages are always discarded
instead of demoted to follow the kernel API definition. Because
MADV_PAGEOUT is defined as freeing specified pages regardless in which
tier they are.
Note: This just adds the start of infrastructure for migration. It is
actually disabled next to the FIXME in migrate_demote_page_ok().
[dave.hansen@linux.intel.com: v11]
Link: https://lkml.kernel.org/r/20210715055145.195411-5-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210721063926.3024591-4-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-5-ying.huang@intel.com
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Wei Xu <weixugc@google.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All in-tree users of MAP_DENYWRITE are gone. MAP_DENYWRITE cannot be
set from user space, so all users are gone; let's remove it.
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmEujioACgkQ+7dXa6fL
C2twVg/+LuSWFgfgrKYQzN2KyeeoZPqR5MiFt3G/j6sLJHAnAzKxWMQV8uwvxA/H
9HpyWHSOH4qeErApThqr3+xmK8V7LzIU0YVrY8hsFEtynnwxcjNQ/oRm3AXu0jAx
eWa1vyM4/kRTh+0TtKQZ93uAmT/eUeIeHzin0Ozg6ED37a2xTIPMG2OprndKKqIM
xTlqaxv1BAYQrKyZtZFFQvsa29yQWgCeiq7lXjELkxGd/aN5ITUHNhYgmppe6ycO
v7/ZdOisyQAUWvzKb7ME0GDIoU2AOJHmbVF9Xe6we1Bhhnl1lWFwfXG9QOOxKLwZ
8KVI7eH/HRBdn77LyWIYv+SwViDaJpPmUeQgt21VeJuw/pT+wuq3lDa+o0JBE6BA
Di8AVBEzO1B2Bvt3H9ZOliKErhCvEsL7zd0PQK+/SdyXlRdqnxP5uG7Zwj0uFjAj
WfqdgtlgfPiKlAQD688BffFGx3MgOAJyq0oPJ2FQ/ZwUxpRL+aT6ChFxLR7rk5Xj
HEJ1b9dNs87pMMjU3csnH6iI7gxdp9mNtLRnKcTBsddn3+JlGFiswsTReto+HqT/
ORVTuBhHXjYxC2YJlkKtMxBD7jv6KL0Xs469/FAhQeDfuYT59MkwejGeKy/TDhl0
6iWx9LzWZlgewAt2rpw0dsZOxXxm2eSeP4NdAfrvjQUlSZdxe48=
=FR0t
-----END PGP SIGNATURE-----
Merge tag 'fscache-next-20210829' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull fscache updates from David Howells:
"Preparatory work for the fscache rewrite that's being worked on and
fix some bugs. These include:
- Always select netfs stats when enabling fscache stats since they're
displayed through the same procfile.
- Add a cookie debug ID that can be used in tracepoints instead of a
pointer and cache it in the netfs_cache_resources struct rather
than in the netfs_read_request struct to make it more available.
- Use file_inode() in cachefiles rather than dereferencing
file->f_inode directly.
- Provide a procfile to display fscache cookies.
- Remove the fscache and cachefiles histogram procfiles.
- Remove the fscache object list procfile.
- Avoid using %p in fscache and cachefiles as the value is hashed and
not comparable to the register dump in an oops trace.
- Fix the cookie hash function to actually achieve useful dispersion.
- Fix fscache_cookie_put() so that it doesn't dereference the cookie
pointer in the tracepoint after the refcount has been decremented
(we're only allowed to do that if we decremented it to zero).
- Use refcount_t rather than atomic_t for the fscache_cookie
refcount"
* tag 'fscache-next-20210829' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
fscache: Use refcount_t for the cookie refcount instead of atomic_t
fscache: Fix fscache_cookie_put() to not deref after dec
fscache: Fix cookie key hashing
cachefiles: Change %p in format strings to something else
fscache: Change %p in format strings to something else
fscache: Remove the object list procfile
fscache, cachefiles: Remove the histogram stuff
fscache: Procfile to display cookies
fscache: Add a cookie debug ID and use that in traces
cachefiles: Use file_inode() rather than accessing ->f_inode
netfs: Move cookie debug ID to struct netfs_cache_resources
fscache: Select netfs stats if fscache stats are enabled
- Enable memcg accounting for various networking objects.
BPF:
- Introduce bpf timers.
- Add perf link and opaque bpf_cookie which the program can read
out again, to be used in libbpf-based USDT library.
- Add bpf_task_pt_regs() helper to access user space pt_regs
in kprobes, to help user space stack unwinding.
- Add support for UNIX sockets for BPF sockmap.
- Extend BPF iterator support for UNIX domain sockets.
- Allow BPF TCP congestion control progs and bpf iterators to call
bpf_setsockopt(), e.g. to switch to another congestion control
algorithm.
Protocols:
- Support IOAM Pre-allocated Trace with IPv6.
- Support Management Component Transport Protocol.
- bridge: multicast: add vlan support.
- netfilter: add hooks for the SRv6 lightweight tunnel driver.
- tcp:
- enable mid-stream window clamping (by user space or BPF)
- allow data-less, empty-cookie SYN with TFO_SERVER_COOKIE_NOT_REQD
- more accurate DSACK processing for RACK-TLP
- mptcp:
- add full mesh path manager option
- add partial support for MP_FAIL
- improve use of backup subflows
- optimize option processing
- af_unix: add OOB notification support.
- ipv6: add IFLA_INET6_RA_MTU to expose MTU value advertised by
the router.
- mac80211: Target Wake Time support in AP mode.
- can: j1939: extend UAPI to notify about RX status.
Driver APIs:
- Add page frag support in page pool API.
- Many improvements to the DSA (distributed switch) APIs.
- ethtool: extend IRQ coalesce uAPI with timer reset modes.
- devlink: control which auxiliary devices are created.
- Support CAN PHYs via the generic PHY subsystem.
- Proper cross-chip support for tag_8021q.
- Allow TX forwarding for the software bridge data path to be
offloaded to capable devices.
Drivers:
- veth: more flexible channels number configuration.
- openvswitch: introduce per-cpu upcall dispatch.
- Add internet mix (IMIX) mode to pktgen.
- Transparently handle XDP operations in the bonding driver.
- Add LiteETH network driver.
- Renesas (ravb):
- support Gigabit Ethernet IP
- NXP Ethernet switch (sja1105)
- fast aging support
- support for "H" switch topologies
- traffic termination for ports under VLAN-aware bridge
- Intel 1G Ethernet
- support getcrosststamp() with PCIe PTM (Precision Time
Measurement) for better time sync
- support Credit-Based Shaper (CBS) offload, enabling HW traffic
prioritization and bandwidth reservation
- Broadcom Ethernet (bnxt)
- support pulse-per-second output
- support larger Rx rings
- Mellanox Ethernet (mlx5)
- support ethtool RSS contexts and MQPRIO channel mode
- support LAG offload with bridging
- support devlink rate limit API
- support packet sampling on tunnels
- Huawei Ethernet (hns3):
- basic devlink support
- add extended IRQ coalescing support
- report extended link state
- Netronome Ethernet (nfp):
- add conntrack offload support
- Broadcom WiFi (brcmfmac):
- add WPA3 Personal with FT to supported cipher suites
- support 43752 SDIO device
- Intel WiFi (iwlwifi):
- support scanning hidden 6GHz networks
- support for a new hardware family (Bz)
- Xen pv driver:
- harden netfront against malicious backends
- Qualcomm mobile
- ipa: refactor power management and enable automatic suspend
- mhi: move MBIM to WWAN subsystem interfaces
Refactor:
- Ambient BPF run context and cgroup storage cleanup.
- Compat rework for ndo_ioctl.
Old code removal:
- prism54 remove the obsoleted driver, deprecated by the p54 driver.
- wan: remove sbni/granch driver.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmEukBYACgkQMUZtbf5S
IrsyHA//TO8dw18NYts4n9LmlJT2naJ7yBUUSSXK/M+DtW0MQ9nnHhqzPm5uJdRl
IgQTNJrW3dYzRwgqaWZqEwO1t5/FI+f87ND1Nsekg7x9tF66a6ov5WxU26TwwSba
U+si/inQ/4chuQ+LxMQobqCDxaLE46I2dIoRl+YfndJ24DRzYSwAEYIPPbSdfyU+
+/l+3s4GaxO4k/hLciPAiOniyxLoUNiGUTNh+2yqRBXelSRJRKVnl+V22ANFrxRW
nTEiplfVKhlPU1e4iLuRtaxDDiePHhw9I3j/lMHhfeFU2P/gKJIvz4QpGV0CAZg2
1VvDU32WEx1GQLXJbKm0KwoNRUq1QSjOyyFti+BO7ugGaYAR4gKhShOqlSYLzUtB
tbtzQhSNLWOGqgmSJOztZb5kFDm2EdRSll5/lP2uyFlPkIsIp0QbscJVzNTnS74b
Xz15ZOw41Z4TfWPEMWgfrx6Zkm7pPWkly+7WfUkPcHa1gftNz6tzXXxSXcXIBPdi
yQ5JCzzxrM5573YHuk5YedwZpn6PiAt4A/muFGk9C6aXP60TQAOS/ppaUzZdnk4D
NfOk9mj06WEULjYjPcKEuT3GGWE6kmjb8Pu0QZWKOchv7vr6oZly1EkVZqYlXELP
AfhcrFeuufie8mqm0jdb4LnYaAnqyLzlb1J4Zxh9F+/IX7G3yoc=
=JDGD
-----END PGP SIGNATURE-----
Merge tag 'net-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- Enable memcg accounting for various networking objects.
BPF:
- Introduce bpf timers.
- Add perf link and opaque bpf_cookie which the program can read out
again, to be used in libbpf-based USDT library.
- Add bpf_task_pt_regs() helper to access user space pt_regs in
kprobes, to help user space stack unwinding.
- Add support for UNIX sockets for BPF sockmap.
- Extend BPF iterator support for UNIX domain sockets.
- Allow BPF TCP congestion control progs and bpf iterators to call
bpf_setsockopt(), e.g. to switch to another congestion control
algorithm.
Protocols:
- Support IOAM Pre-allocated Trace with IPv6.
- Support Management Component Transport Protocol.
- bridge: multicast: add vlan support.
- netfilter: add hooks for the SRv6 lightweight tunnel driver.
- tcp:
- enable mid-stream window clamping (by user space or BPF)
- allow data-less, empty-cookie SYN with TFO_SERVER_COOKIE_NOT_REQD
- more accurate DSACK processing for RACK-TLP
- mptcp:
- add full mesh path manager option
- add partial support for MP_FAIL
- improve use of backup subflows
- optimize option processing
- af_unix: add OOB notification support.
- ipv6: add IFLA_INET6_RA_MTU to expose MTU value advertised by the
router.
- mac80211: Target Wake Time support in AP mode.
- can: j1939: extend UAPI to notify about RX status.
Driver APIs:
- Add page frag support in page pool API.
- Many improvements to the DSA (distributed switch) APIs.
- ethtool: extend IRQ coalesce uAPI with timer reset modes.
- devlink: control which auxiliary devices are created.
- Support CAN PHYs via the generic PHY subsystem.
- Proper cross-chip support for tag_8021q.
- Allow TX forwarding for the software bridge data path to be
offloaded to capable devices.
Drivers:
- veth: more flexible channels number configuration.
- openvswitch: introduce per-cpu upcall dispatch.
- Add internet mix (IMIX) mode to pktgen.
- Transparently handle XDP operations in the bonding driver.
- Add LiteETH network driver.
- Renesas (ravb):
- support Gigabit Ethernet IP
- NXP Ethernet switch (sja1105):
- fast aging support
- support for "H" switch topologies
- traffic termination for ports under VLAN-aware bridge
- Intel 1G Ethernet
- support getcrosststamp() with PCIe PTM (Precision Time
Measurement) for better time sync
- support Credit-Based Shaper (CBS) offload, enabling HW traffic
prioritization and bandwidth reservation
- Broadcom Ethernet (bnxt)
- support pulse-per-second output
- support larger Rx rings
- Mellanox Ethernet (mlx5)
- support ethtool RSS contexts and MQPRIO channel mode
- support LAG offload with bridging
- support devlink rate limit API
- support packet sampling on tunnels
- Huawei Ethernet (hns3):
- basic devlink support
- add extended IRQ coalescing support
- report extended link state
- Netronome Ethernet (nfp):
- add conntrack offload support
- Broadcom WiFi (brcmfmac):
- add WPA3 Personal with FT to supported cipher suites
- support 43752 SDIO device
- Intel WiFi (iwlwifi):
- support scanning hidden 6GHz networks
- support for a new hardware family (Bz)
- Xen pv driver:
- harden netfront against malicious backends
- Qualcomm mobile
- ipa: refactor power management and enable automatic suspend
- mhi: move MBIM to WWAN subsystem interfaces
Refactor:
- Ambient BPF run context and cgroup storage cleanup.
- Compat rework for ndo_ioctl.
Old code removal:
- prism54 remove the obsoleted driver, deprecated by the p54 driver.
- wan: remove sbni/granch driver"
* tag 'net-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1715 commits)
net: Add depends on OF_NET for LiteX's LiteETH
ipv6: seg6: remove duplicated include
net: hns3: remove unnecessary spaces
net: hns3: add some required spaces
net: hns3: clean up a type mismatch warning
net: hns3: refine function hns3_set_default_feature()
ipv6: remove duplicated 'net/lwtunnel.h' include
net: w5100: check return value after calling platform_get_resource()
net/mlxbf_gige: Make use of devm_platform_ioremap_resourcexxx()
net: mdio: mscc-miim: Make use of the helper function devm_platform_ioremap_resource()
net: mdio-ipq4019: Make use of devm_platform_ioremap_resource()
fou: remove sparse errors
ipv4: fix endianness issue in inet_rtm_getroute_build_skb()
octeontx2-af: Set proper errorcode for IPv4 checksum errors
octeontx2-af: Fix static code analyzer reported issues
octeontx2-af: Fix mailbox errors in nix_rss_flowkey_cfg
octeontx2-af: Fix loop in free and unmap counter
af_unix: fix potential NULL deref in unix_dgram_connect()
dpaa2-eth: Replace strlcpy with strscpy
octeontx2-af: Use NDC TX for transmit packet data
...
- Support for server-side disconnect injection via debugfs
- Protocol definitions for new RPC_AUTH_TLS authentication flavor
Performance improvements:
- Reduce page allocator traffic in the NFSD splice read actor
- Reduce CPU utilization in svcrdma's Send completion handler
Notable bug fixes:
- Stabilize lockd operation when re-exporting NFS mounts
- Fix the use of %.*s in NFSD tracepoints
- Fix /proc/sys/fs/nfs/nsm_use_hostnames
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmEqq0AACgkQM2qzM29m
f5dYig/5AaPN2BWYf4D1VkrAS3+zGS+3IN23WVgpbA54jgfjPEH+Aa00YhEQQa0j
Y5u/jE5g/tWvenDefq5BmvdRfZMWCVc2JkngctOSflhaREUWK+HgCkH+5DQs6zUM
rbX7qy0v6wJnEMSlwCKJ2AuZbYw7Bsg2nvOgEbb718/ent3umeoXEK09x3HTWLEp
eVcMU5uicB5wRRPpROYG792oWzUScQ8kyiRCKJfQDoR7bINhBeVHObAIFMBo1UaH
x9CMX4RlPYGmoMYUc+AqcOM7hizucHpXqM1r3oVjQ7FyI+pmDLuLL/3OTjtRUX7+
nYLqNW/PijH9PjFe4BPjGHAUQfKiTIXANAe8VdjQj70D40jYkP+jQ9SPdV+pEgi4
U4azfK3S+85/bRYYq/1alcLiP1+6dgcL++rVvnKESTH9NRgNoEw2WZHeKxXiYaxU
p7oOC4XdnYDwcz/3QVWa0sK2kA5IJHzOsCQR7OilD09NAJ+AbJTAp0H3xFXTllzb
AV2CAEBVZlP+pZYOehuVnKpZPa7YAWx92wRK2anbRUMZN3lF1wWBEOTd6KweIpTx
l2GJSf3GWBqL1x9PjSet/cBusxYjTA+S1hE7KMrsNPhzbvpIgAZEtSqOfn9apDCV
uAFIN2DSiHm3Tv0aFSJWo+CMyKkyktuiS8JFKaFdzCp9NtsBM2M=
=TGkK
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"New features:
- Support for server-side disconnect injection via debugfs
- Protocol definitions for new RPC_AUTH_TLS authentication flavor
Performance improvements:
- Reduce page allocator traffic in the NFSD splice read actor
- Reduce CPU utilization in svcrdma's Send completion handler
Notable bug fixes:
- Stabilize lockd operation when re-exporting NFS mounts
- Fix the use of %.*s in NFSD tracepoints
- Fix /proc/sys/fs/nfs/nsm_use_hostnames"
* tag 'nfsd-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (31 commits)
nfsd: fix crash on LOCKT on reexported NFSv3
nfs: don't allow reexport reclaims
lockd: don't attempt blocking locks on nfs reexports
nfs: don't atempt blocking locks on nfs reexports
Keep read and write fds with each nlm_file
lockd: update nlm_lookup_file reexport comment
nlm: minor refactoring
nlm: minor nlm_lookup_file argument change
lockd: lockd server-side shouldn't set fl_ops
SUNRPC: Add documentation for the fail_sunrpc/ directory
SUNRPC: Server-side disconnect injection
SUNRPC: Move client-side disconnect injection
SUNRPC: Add a /sys/kernel/debug/fail_sunrpc/ directory
svcrdma: xpt_bc_xprt is already clear in __svc_rdma_free()
nfsd4: Fix forced-expiry locking
rpc: fix gss_svc_init cleanup on failure
SUNRPC: Add RPC_AUTH_TLS protocol numbers
lockd: change the proc_handler for nsm_use_hostnames
sysctl: introduce new proc handler proc_dobool
SUNRPC: Fix a NULL pointer deref in trace_svc_stats_latency()
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmEs2NIACgkQxWXV+ddt
WDsJMQ/+PJ/yXfI85mAeAzTJLWQ0zD6YO3iBhf3wOeyychWC4on435pj+zW8zR/U
/bix25ygoWF4MvGF6p0uyv4Z5mnvkZXE5lapUcJu6wXG7se1QRPH0broTh05IBXK
SnT93Eb9RexaiNFk7DVma9XkviqZ/ZISPtkJ9wYrfIba7j/U/wa+PtEFS7wk58hP
rFQXgV64xm/pcP28YYHfOkCjdyUMdJrnBUvfKOlX6d94lmYbP5lyiTL+XJEXExzN
wPakD0UsnXPr4TRvf+YRTPeFHPPUgyORII7otVUOKmGywWtcJrELX8rXFoW+6GwB
dzZIcSYXHUxU5UrtMbZgiztVBJ+bQY5juYMIrj13eYOMYkijxAqPP84iDO15+TSV
zNqyAVjUglHCGUGjhSpAxnAmtp+IJTZfVAWcvIKq3VqvJtb8tssQsk9bqFjH1xlH
qNJLE57CYe3tjw05K9y0keMh2iJWRWkXZYkgI/zjwo5nreemobpN+3fO4yneVLh7
ecdBmSl/JVSzAB1NamLOCZNGZLUqiiuTvZlJtI6ZsekrN1+4A6QzVcU/MGjSYL1v
C7W0hK0LF+e3xIBkxTKVq8noolsgbmlWacxJq8fZq9HwZy5IVJOVm9STDlCuLaIo
gPr0V0itkclcsMU0CHTyCjMsfuHYUwJZXwg93wKfJf5UCzS4OWU=
=ALO9
-----END PGP SIGNATURE-----
Merge tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"The highlights of this round are integrations with fs-verity and
idmapped mounts, the rest is usual mix of minor improvements, speedups
and cleanups.
There are some patches outside of btrfs, namely updating some VFS
interfaces, all straightforward and acked.
Features:
- fs-verity support, using standard ioctls, backward compatible with
read-only limitation on inodes with previously enabled fs-verity
- idmapped mount support
- make mount with rescue=ibadroots more tolerant to partially damaged
trees
- allow raid0 on a single device and raid10 on two devices,
degenerate cases but might be useful as an intermediate step during
conversion to other profiles
- zoned mode block group auto reclaim can be disabled via sysfs knob
Performance improvements:
- continue readahead of node siblings even if target node is in
memory, could speed up full send (on sample test +11%)
- batching of delayed items can speed up creating many files
- fsync/tree-log speedups
- avoid unnecessary work (gains +2% throughput, -2% run time on
sample load)
- reduced lock contention on renames (on dbench +4% throughput,
up to -30% latency)
Fixes:
- various zoned mode fixes
- preemptive flushing threshold tuning, avoid excessive work on
almost full filesystems
Core:
- continued subpage support, preparation for implementing remaining
features like compression and defragmentation; with some
limitations, write is now enabled on 64K page systems with 4K
sectors, still considered experimental
- no readahead on compressed reads
- inline extents disabled
- disabled raid56 profile conversion and mount
- improved flushing logic, fixing early ENOSPC on some workloads
- inode flags have been internally split to read-only and read-write
incompat bit parts, used by fs-verity
- new tree items for fs-verity
- descriptor item
- Merkle tree item
- inode operations extended to be namespace-aware
- cleanups and refactoring
Generic code changes:
- fs: new export filemap_fdatawrite_wbc
- fs: removed sync_inode
- block: bio_trim argument type fixups
- vfs: add namespace-aware lookup"
* tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (114 commits)
btrfs: reset replace target device to allocation state on close
btrfs: zoned: fix ordered extent boundary calculation
btrfs: do not do preemptive flushing if the majority is global rsv
btrfs: reduce the preemptive flushing threshold to 90%
btrfs: tree-log: check btrfs_lookup_data_extent return value
btrfs: avoid unnecessarily logging directories that had no changes
btrfs: allow idmapped mount
btrfs: handle ACLs on idmapped mounts
btrfs: allow idmapped INO_LOOKUP_USER ioctl
btrfs: allow idmapped SUBVOL_SETFLAGS ioctl
btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls
btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids
btrfs: allow idmapped SNAP_DESTROY ioctls
btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls
btrfs: check whether fsgid/fsuid are mapped during subvolume creation
btrfs: allow idmapped permission inode op
btrfs: allow idmapped setattr inode op
btrfs: allow idmapped tmpfile inode op
btrfs: allow idmapped symlink inode op
btrfs: allow idmapped mkdir inode op
...
fscache_cookie_put() accesses the cookie it has just put inside the
tracepoint that monitors the change - but this is something it's not
allowed to do if we didn't reduce the count to zero.
Fix this by dropping most of those values from the tracepoint and grabbing
the cookie debug ID before doing the dec.
Also take the opportunity to switch over the usage and where arguments on
the tracepoint to put the reason last.
Fixes: a18feb5576 ("fscache: Add tracepoints")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/162431203107.2908479.3259582550347000088.stgit@warthog.procyon.org.uk/
Replace the magic lookup through the kobject tree with an explicit
backpointer, given that the device model links are set up and torn
down at times when I/O is still possible, leading to potential
NULL or invalid pointer dereferences.
Fixes: edb0872f44 ("block: move the bdi from the request_queue to the gendisk")
Reported-by: syzbot <syzbot+aa0801b6b32dca9dda82@syzkaller.appspotmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Link: https://lore.kernel.org/r/20210816134624.GA24234@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Whenever we notice some sluggish issues on our machines, we are always
curious about how well all types of I/O in the f2fs filesystem are
handled. But, it's hard to get this kind of real data. First of all,
we need to reproduce the issue while turning on the profiling tool like
blktrace, but the issue doesn't happen again easily. Second, with the
intervention of any tools, the overall timing of the issue will be
slightly changed and it sometimes makes us hard to figure it out.
So, I added the feature printing out IO latency statistics tracepoint
events, which are minimal things to understand filesystem's I/O related
behaviors, into F2FS_IOSTAT kernel config. With "iostat_enable" sysfs
node on, we can get this statistics info in a periodic way and it
would cause the least overhead.
[samples]
f2fs_ckpt-254:1-507 [003] .... 2842.439683: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [136/1/801], rd_node [136/1/1704], rd_meta [4/2/4],
wr_sync_data [164/16/3331], wr_sync_node [152/3/648],
wr_sync_meta [160/2/4243], wr_async_data [24/13/15],
wr_async_node [0/0/0], wr_async_meta [0/0/0]
f2fs_ckpt-254:1-507 [002] .... 2845.450514: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [60/3/456], rd_node [60/3/1258], rd_meta [0/0/1],
wr_sync_data [120/12/2285], wr_sync_node [88/5/428],
wr_sync_meta [52/6/2990], wr_async_data [4/1/3],
wr_async_node [0/0/0], wr_async_meta [0/0/0]
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Added F2FS_IOSTAT config option to support getting IO statistics through
sysfs and printing out periodic IO statistics tracepoint events and
moved I/O statistics related codes into separate files for better
maintenance.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
[Jaegeuk Kim: set default=y]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We have been hitting some early ENOSPC issues in production with more
recent kernels, and I tracked it down to us simply not flushing delalloc
as aggressively as we should be. With tracing I was seeing us failing
all tickets with all of the block rsvs at or around 0, with very little
pinned space, but still around 120MiB of outstanding bytes_may_used.
Upon further investigation I saw that we were flushing around 14 pages
per shrink call for delalloc, despite having around 2GiB of delalloc
outstanding.
Consider the example of a 8 way machine, all CPUs trying to create a
file in parallel, which at the time of this commit requires 5 items to
do. Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
size waiting on reservations. Now assume we have 128MiB of delalloc
outstanding. With our current math we would set items to 20, and then
set to_reclaim to 20 * 256k, or 5MiB.
Assuming that we went through this loop all 3 times, for both
FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
twice, we'd only flush 60MiB of the 128MiB delalloc space. This could
leave a fair bit of delalloc reservations still hanging around by the
time we go to ENOSPC out all the remaining tickets.
Fix this two ways. First, change the calculations to be a fraction of
the total delalloc bytes on the system. Prior to this change we were
calculating based on dirty inodes so our math made more sense, now it's
just completely unrelated to what we're actually doing.
Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
gone through the flush states at least once. This will empty the system
of all delalloc so we're sure to be truly out of space when we start
failing tickets.
I'm tagging stable 5.10 and forward, because this is where we started
using the page stuff heavily again. This affects earlier kernel
versions as well, but would be a pain to backport to them as the
flushing mechanisms aren't the same.
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When debugging early enospc problems it was useful to have a tracepoint
where we failed all tickets so I could check the state of the enospc
counters at failure time to validate my fixes. This adds the tracpoint
so you can easily get that information.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to debug delalloc flushing issues I added delalloc_bytes and
ordered_bytes to this tracepoint to see if they were non-zero when we
were going ENOSPC. This was valuable for me and showed me cases where we
weren't waiting on ordered extents properly. In order to add this to the
tracepoint we need to take away the const modifier for fs_info, as
percpu_sum_counter_positive() will change the counter when it adds up
the percpu buckets. This is needed to make sure we're getting accurate
information at these tracepoints, as the wrong information could send us
down the wrong path when debugging problems.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
printk("%pGg") outputs these two flags as hexadecimal number, rather
than as a string, e.g:
GFP_KERNEL|0x1800000
Fix this by adding missing names of __GFP_ZEROTAGS and
__GFP_SKIP_KASAN_POISON flags to __def_gfpflag_names.
Link: https://lkml.kernel.org/r/20210816133502.590-1-rppt@kernel.org
Fixes: 013bb59dbb ("arm64: mte: handle tags zeroing at page allocation time")
Fixes: c275c5c6d5 ("kasan: disable freed user page poisoning with HW tags")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some paths through svc_process() leave rqst->rq_procinfo set to
NULL, which triggers a crash if tracing happens to be enabled.
Fixes: 89ff87494c ("SUNRPC: Display RPC procedure names instead of proc numbers")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
There's a few cases that a string that is to be recorded in a trace event,
does not have a terminating 'nul' character, and instead, the tracepoint
passes in the length of the string to record.
Add two helper macros to the trace event code that lets this work easier,
than tricks with "%.*s" logic.
__string_len() which is similar to __string() for declaration, but takes a
length argument.
__assign_str_len() which is similar to __assign_str() for assiging the
string, but it too takes a length argument.
Note, the TRACE_EVENT() macro will allocate the location on the ring
buffer to 'len + 1', that will be used to store the string into. It is a
requirement that the 'len' used for this is a most the length of the
string being recorded.
This string can still use __get_str() just like strings created with
__string() can use to retrieve the string.
Link: https://lore.kernel.org/linux-nfs/20210513105018.7539996a@gandalf.local.home/
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Now that there is an alternate method for returning an auth_stat
value, replace the RQ_AUTHERR flag with use of that new method.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
I'd like to take commit 4532608d71 ("SUNRPC: Clean up generic
dispatcher code") even further by using only private local SVC
dispatchers for all kernel RPC services. This change would enable
the removal of the logic that switches between
svc_generic_dispatch() and a service's private dispatcher, and
simplify the invocation of the service's pc_release method
so that humans can visually verify that it is always invoked
properly.
All that will come later.
First, let's provide a better way to return authentication errors
from SVC dispatcher functions. Instead of overloading the dispatch
method's *statp argument, add a field to struct svc_rqst that can
hold an error value.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Unlike xprtrdma_post_send(), this one can be left enabled all the
time, and should almost never fire. But we do want to know about
immediate errors when they happen.
Note that there is already a similar post_linv_err tracepoint.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
In the vast majority of cases, rc=0. Don't record that in the
post_recvs tracepoint. Instead, add a separate tracepoint that can
be left enabled all the time to capture the very rare immediate
errors returned by ib_post_recv().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The client can alter the timeout value after each retransmit. Record
the updated timeout value in the trace log.
Suggested-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Recent patches added RPC_TASK_MOVEABLE, XPRT_OFFLINE, and
XPRT_REMOVE. Update the tracepoint display macros to display these
flags properly.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: TRACE_DEFINE_ENUM is needed only for enums, not for
C macros.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
qdisc_enqueue tracepoint can work with qdisc:qdisc_dequeue
to measure packets latency in qdisc queues.
Add a new field txq for it, then we can retrieve more info.
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix type of bind option flag in af_xdp, from Baruch Siach.
2) Fix use after free in bpf_xdp_link_release(), from Xuan Zhao.
3) PM refcnt imbakance in r8152, from Takashi Iwai.
4) Sign extension ug in liquidio, from Colin Ian King.
5) Mising range check in s390 bpf jit, from Colin Ian King.
6) Uninit value in caif_seqpkt_sendmsg(), from Ziyong Xuan.
7) Fix skb page recycling race, from Ilias Apalodimas.
8) Fix memory leak in tcindex_partial_destroy_work, from Pave Skripkin.
9) netrom timer sk refcnt issues, from Nguyen Dinh Phi.
10) Fix data races aroun tcp's tfo_active_disable_stamp, from Eric
Dumazet.
11) act_skbmod should only operate on ethernet packets, from Peilin Ye.
12) Fix slab out-of-bpunds in fib6_nh_flush_exceptions(),, from Psolo
Abeni.
13) Fix sparx5 dependencies, from Yajun Deng.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (74 commits)
dpaa2-switch: seed the buffer pool after allocating the swp
net: sched: cls_api: Fix the the wrong parameter
net: sparx5: fix unmet dependencies warning
net: dsa: tag_ksz: dont let the hardware process the layer 4 checksum
net: dsa: ensure linearized SKBs in case of tail taggers
ravb: Remove extra TAB
ravb: Fix a typo in comment
net: dsa: sja1105: make VID 4095 a bridge VLAN too
tcp: disable TFO blackhole logic by default
sctp: do not update transport pathmtu if SPP_PMTUD_ENABLE is not set
net: ixp46x: fix ptp build failure
ibmvnic: Remove the proper scrq flush
selftests: net: add ESP-in-UDP PMTU test
udp: check encap socket in __udp_lib_err
sctp: update active_key for asoc when old key is being replaced
r8169: Avoid duplicate sysfs entry creation error
ixgbe: Fix packet corruption due to missing DMA sync
Revert "qed: fix possible unpaired spin_{un}lock_bh in _qed_mcp_cmd_and_union()"
ipv6: fix another slab-out-of-bounds in fib6_nh_flush_exceptions
fsl/fman: Add fibre support
...
To quote Alexey[1]:
I was adding custom tracepoint to the kernel, grabbed full F34 kernel
.config, disabled modules and booted whole shebang as VM kernel.
Then did
perf record -a -e ...
It crashed:
general protection fault, probably for non-canonical address 0x435f5346592e4243: 0000 [#1] SMP PTI
CPU: 1 PID: 842 Comm: cat Not tainted 5.12.6+ #26
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
RIP: 0010:t_show+0x22/0xd0
Then reproducer was narrowed to
# cat /sys/kernel/tracing/printk_formats
Original F34 kernel with modules didn't crash.
So I started to disable options and after disabling AFS everything
started working again.
The root cause is that AFS was placing char arrays content into a
section full of _pointers_ to strings with predictable consequences.
Non canonical address 435f5346592e4243 is "CB.YFS_" which came from
CM_NAME macro.
Steps to reproduce:
CONFIG_AFS=y
CONFIG_TRACING=y
# cat /sys/kernel/tracing/printk_formats
Fix this by the following means:
(1) Add enum->string translation tables in the event header with the AFS
and YFS cache/callback manager operations listed by RPC operation ID.
(2) Modify the afs_cb_call tracepoint to print the string from the
translation table rather than using the string at the afs_call name
pointer.
(3) Switch translation table depending on the service we're being accessed
as (AFS or YFS) in the tracepoint print clause. Will this cause
problems to userspace utilities?
Note that the symbolic representation of the YFS service ID isn't
available to this header, so I've put it in as a number. I'm not sure
if this is the best way to do this.
(4) Remove the name wrangling (CM_NAME) macro and put the names directly
into the afs_call_type structs in cmservice.c.
Fixes: 8e8d7f13b6 ("afs: Add some tracepoints")
Reported-by: Alexey Dobriyan (SK hynix) <adobriyan@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/YLAXfvZ+rObEOdc%2F@localhost.localdomain/ [1]
Link: https://lore.kernel.org/r/643721.1623754699@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/162430903582.2896199.6098150063997983353.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162609463957.3133237.15916579353149746363.stgit@warthog.procyon.org.uk/ # v1 (repost)
Link: https://lore.kernel.org/r/162610726860.3408253.445207609466288531.stgit@warthog.procyon.org.uk/ # v2
Tracepoint trace_qdisc_enqueue() is introduced to trace skb at
the entrance of TC layer on TX side. This is similar to
trace_qdisc_dequeue():
1. For both we only trace successful cases. The failure cases
can be traced via trace_kfree_skb().
2. They are called at entrance or exit of TC layer, not for each
->enqueue() or ->dequeue(). This is intentional, because
we want to make trace_qdisc_enqueue() symmetric to
trace_qdisc_dequeue(), which is easier to use.
The return value of qdisc_enqueue() is not interesting here,
we have Qdisc's drop packets in ->dequeue(), it is impossible to
trace them even if we have the return value, the only way to trace
them is tracing kfree_skb().
We only add information we need to trace ring buffer. If any other
information is needed, it is easy to extend it without breaking ABI,
see commit 3dd344ea84 ("net: tracepoint: exposing sk_family in all
tcp:tracepoints").
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Qitao Xu <qitao.xu@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Print format of skbaddr is changed to %px from %p, because we want
to use skb address as a quick way to identify a packet.
Note, trace ring buffer is only accessible to privileged users,
it is safe to use a real kernel address here.
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Qitao Xu <qitao.xu@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The print format of skb adress in tracepoint class net_dev_template
is changed to %px from %p, because we want to use skb address
as a quick way to identify a packet.
Note, trace ring buffer is only accessible to privileged users,
it is safe to use a real kernel address here.
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Qitao Xu <qitao.xu@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull RCU updates from Paul McKenney:
- Bitmap parsing support for "all" as an alias for all bits
- Documentation updates
- Miscellaneous fixes, including some that overlap into mm and lockdep
- kvfree_rcu() updates
- mem_dump_obj() updates, with acks from one of the slab-allocator
maintainers
- RCU NOCB CPU updates, including limited deoffloading
- SRCU updates
- Tasks-RCU updates
- Torture-test updates
* 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (78 commits)
tasks-rcu: Make show_rcu_tasks_gp_kthreads() be static inline
rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent states
rcu: Add missing __releases() annotation
rcu: Remove obsolete rcu_read_unlock() deadlock commentary
rcu: Improve comments describing RCU read-side critical sections
rcu: Create an unrcu_pointer() to remove __rcu from a pointer
srcu: Early test SRCU polling start
rcu: Fix various typos in comments
rcu/nocb: Unify timers
rcu/nocb: Prepare for fine-grained deferred wakeup
rcu/nocb: Only cancel nocb timer if not polling
rcu/nocb: Delete bypass_timer upon nocb_gp wakeup
rcu/nocb: Cancel nocb_timer upon nocb_gp wakeup
rcu/nocb: Allow de-offloading rdp leader
rcu/nocb: Directly call __wake_nocb_gp() from bypass timer
rcu: Don't penalize priority boosting when there is nothing to boost
rcu: Point to documentation of ordering guarantees
rcu: Make rcu_gp_cleanup() be noinline for tracing
rcu: Restrict RCU_STRICT_GRACE_PERIOD to at most four CPUs
rcu: Make show_rcu_gp_kthreads() dump rcu_node structures blocking GP
...
- Added option for per CPU threads to the hwlat tracer
- Have hwlat tracer handle hotplug CPUs
- New tracer: osnoise, that detects latency caused by interrupts, softirqs
and scheduling of other tasks.
- Added timerlat tracer that creates a thread and measures in detail what
sources of latency it has for wake ups.
- Removed the "success" field of the sched_wakeup trace event.
This has been hardcoded as "1" since 2015, no tooling should be looking
at it now. If one exists, we can revert this commit, fix that tool and
try to remove it again in the future.
- tgid mapping fixed to handle more than PID_MAX_DEFAULT pids/tgids.
- New boot command line option "tp_printk_stop", as tp_printk causes trace
events to write to console. When user space starts, this can easily live
lock the system. Having a boot option to stop just after boot up is
useful to prevent that from happening.
- Have ftrace_dump_on_oops boot command line option take numbers that match
the numbers shown in /proc/sys/kernel/ftrace_dump_on_oops.
- Bootconfig clean ups, fixes and enhancements.
- New ktest script that tests bootconfig options.
- Add tracepoint_probe_register_may_exist() to register a tracepoint
without triggering a WARN*() if it already exists. BPF has a path from
user space that can do this. All other paths are considered a bug.
- Small clean ups and fixes
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYN8YPhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qhxLAP9Mo5hHv7Hg6W7Ddv77rThm+qclsMR/
yW0P+eJpMm4+xAD8Cq03oE1DimPK+9WZBKU5rSqAkqG6CjgDRw6NlIszzQQ=
=WEPR
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Added option for per CPU threads to the hwlat tracer
- Have hwlat tracer handle hotplug CPUs
- New tracer: osnoise, that detects latency caused by interrupts,
softirqs and scheduling of other tasks.
- Added timerlat tracer that creates a thread and measures in detail
what sources of latency it has for wake ups.
- Removed the "success" field of the sched_wakeup trace event. This has
been hardcoded as "1" since 2015, no tooling should be looking at it
now. If one exists, we can revert this commit, fix that tool and try
to remove it again in the future.
- tgid mapping fixed to handle more than PID_MAX_DEFAULT pids/tgids.
- New boot command line option "tp_printk_stop", as tp_printk causes
trace events to write to console. When user space starts, this can
easily live lock the system. Having a boot option to stop just after
boot up is useful to prevent that from happening.
- Have ftrace_dump_on_oops boot command line option take numbers that
match the numbers shown in /proc/sys/kernel/ftrace_dump_on_oops.
- Bootconfig clean ups, fixes and enhancements.
- New ktest script that tests bootconfig options.
- Add tracepoint_probe_register_may_exist() to register a tracepoint
without triggering a WARN*() if it already exists. BPF has a path
from user space that can do this. All other paths are considered a
bug.
- Small clean ups and fixes
* tag 'trace-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (49 commits)
tracing: Resize tgid_map to pid_max, not PID_MAX_DEFAULT
tracing: Simplify & fix saved_tgids logic
treewide: Add missing semicolons to __assign_str uses
tracing: Change variable type as bool for clean-up
trace/timerlat: Fix indentation on timerlat_main()
trace/osnoise: Make 'noise' variable s64 in run_osnoise()
tracepoint: Add tracepoint_probe_register_may_exist() for BPF tracing
tracing: Fix spelling in osnoise tracer "interferences" -> "interference"
Documentation: Fix a typo on trace/osnoise-tracer
trace/osnoise: Fix return value on osnoise_init_hotplug_support
trace/osnoise: Make interval u64 on osnoise_main
trace/osnoise: Fix 'no previous prototype' warnings
tracing: Have osnoise_main() add a quiescent state for task rcu
seq_buf: Make trace_seq_putmem_hex() support data longer than 8
seq_buf: Fix overflow in seq_buf_putmem_hex()
trace/osnoise: Support hotplug operations
trace/hwlat: Support hotplug operations
trace/hwlat: Protect kdata->kthread with get/put_online_cpus
trace: Add timerlat tracer
trace: Add osnoise tracer
...
This series consists of the usual driver updates (ufs, ibmvfc,
megaraid_sas, lpfc, elx, mpi3mr, qedi, iscsi, storvsc, mpt3sas) with
elx and mpi3mr being new drivers. The major core change is a rework
to drop the status byte handling macros and the old bit shifted
definitions and the rest of the updates are minor fixes.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYN7I6iYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishXpRAQCkngYZ
35yQrqOxgOk2pfrysE95tHrV1MfJm2U49NFTwAEAuZutEvBUTfBF+sbcJ06r6q7i
H0hkJN/Io7enFs5v3WA=
=zwIa
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This series consists of the usual driver updates (ufs, ibmvfc,
megaraid_sas, lpfc, elx, mpi3mr, qedi, iscsi, storvsc, mpt3sas) with
elx and mpi3mr being new drivers.
The major core change is a rework to drop the status byte handling
macros and the old bit shifted definitions and the rest of the updates
are minor fixes"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (287 commits)
scsi: aha1740: Avoid over-read of sense buffer
scsi: arcmsr: Avoid over-read of sense buffer
scsi: ips: Avoid over-read of sense buffer
scsi: ufs: ufs-mediatek: Add missing of_node_put() in ufs_mtk_probe()
scsi: elx: libefc: Fix IRQ restore in efc_domain_dispatch_frame()
scsi: elx: libefc: Fix less than zero comparison of a unsigned int
scsi: elx: efct: Fix pointer error checking in debugfs init
scsi: elx: efct: Fix is_originator return code type
scsi: elx: efct: Fix link error for _bad_cmpxchg
scsi: elx: efct: Eliminate unnecessary boolean check in efct_hw_command_cancel()
scsi: elx: efct: Do not use id uninitialized in efct_lio_setup_session()
scsi: elx: efct: Fix error handling in efct_hw_init()
scsi: elx: efct: Remove redundant initialization of variable lun
scsi: elx: efct: Fix spelling mistake "Unexected" -> "Unexpected"
scsi: lpfc: Fix build error in lpfc_scsi.c
scsi: target: iscsi: Remove redundant continue statement
scsi: qla4xxx: Remove redundant continue statement
scsi: ppa: Switch to use module_parport_driver()
scsi: imm: Switch to use module_parport_driver()
scsi: mpt3sas: Fix error return value in _scsih_expander_add()
...
Including:
- SMMU Updates from Will Deacon:
- SMMUv3: Support stalling faults for platform devices
- SMMUv3: Decrease defaults sizes for the event and PRI queues
- SMMUv2: Support for a new '->probe_finalize' hook, needed by Nvidia
- SMMUv2: Even more Qualcomm compatible strings
- SMMUv2: Avoid Adreno TTBR1 quirk for DB820C platform
- Intel VT-d updates from Lu Baolu:
- Convert Intel IOMMU to use sva_lib helpers in iommu core
- ftrace and debugfs supports for page fault handling
- Support asynchronous nested capabilities
- Various misc cleanups
- Support for new VIOT ACPI table to make the VirtIO IOMMU:
available on x86
- Add the amd_iommu=force_enable command line option to
enable the IOMMU on platforms where they are known to cause
problems
- Support for version 2 of the Rockchip IOMMU
- Various smaller fixes, cleanups and refactorings
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmDexqwACgkQK/BELZcB
GuOy/w//cr331yKDZDF8DSkWxUHYNXitBlW12nShgYselhlREb5mB1vmpOHuWKus
K++w7tWA19/qMs/p7yPoS+zCEA3xiZjO+OgyjxTzkxfaQD4GMmP7hK+ItRCXiz9E
6QZXLOqexniydpa+KEg3rewibcrJwgIpz6QHT8FBrMISiEPRUw5oLeytv6rNjPWx
WyBRNA+TjNvnybFbWp9gTdgWCshygJMv1WlU7ySZcH45Mq4VKxS4Kihe1fTLp38s
vBqfRuUHhYcuNmExgjBuK3y8dq7hU8XelKjSm2yvp9srGbhD0NFT1Ng7iZQj43bi
Eh2Ic8O9miBvm/uJ0ug6PGcEUcdfoHf/PIMqLZMRBj79+9RKxNBzWOAkBd2RwH3A
Wy98WdOsX4+3MB5EKznxnIQMMA5Rtqyy/gLDd5j4xZnL5Ha3j0oQ9WClD+2iMfpV
v150GXNOKNDNNjlzXulBxNYzUOK8KKxse9OPg8YDevZPEyVNPH2yXZ6xjoe7SYCW
FGhaHXdCfRxUk9lsQNtb23CNTKl7Qd6JOOJLZqAfpWzQjQu4wB4kfbnv0zEhMAvi
XxVqijV/XTWyVXe3HkzByHnqQxrn7YKbC/Ql/BMmk8kbwFvpJbFIaTLvRVsBnE2r
8Edn5Cjuz4CC83YPQ3EBQn85TpAaUNMDRMc1vz5NrQTqJTzYZag=
=IxNB
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
- SMMU Updates from Will Deacon:
- SMMUv3:
- Support stalling faults for platform devices
- Decrease defaults sizes for the event and PRI queues
- SMMUv2:
- Support for a new '->probe_finalize' hook, needed by Nvidia
- Even more Qualcomm compatible strings
- Avoid Adreno TTBR1 quirk for DB820C platform
- Intel VT-d updates from Lu Baolu:
- Convert Intel IOMMU to use sva_lib helpers in iommu core
- ftrace and debugfs supports for page fault handling
- Support asynchronous nested capabilities
- Various misc cleanups
- Support for new VIOT ACPI table to make the VirtIO IOMMU
available on x86
- Add the amd_iommu=force_enable command line option to enable
the IOMMU on platforms where they are known to cause problems
- Support for version 2 of the Rockchip IOMMU
- Various smaller fixes, cleanups and refactorings
* tag 'iommu-updates-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (66 commits)
iommu/virtio: Enable x86 support
iommu/dma: Pass address limit rather than size to iommu_setup_dma_ops()
ACPI: Add driver for the VIOT table
ACPI: Move IOMMU setup code out of IORT
ACPI: arm64: Move DMA setup operations out of IORT
iommu/vt-d: Fix dereference of pointer info before it is null checked
iommu: Update "iommu.strict" documentation
iommu/arm-smmu: Check smmu->impl pointer before dereferencing
iommu/arm-smmu-v3: Remove unnecessary oom message
iommu/arm-smmu: Fix arm_smmu_device refcount leak in address translation
iommu/arm-smmu: Fix arm_smmu_device refcount leak when arm_smmu_rpm_get fails
iommu/vt-d: Fix linker error on 32-bit
iommu/vt-d: No need to typecast
iommu/vt-d: Define counter explicitly as unsigned int
iommu/vt-d: Remove unnecessary braces
iommu/vt-d: Removed unused iommu_count in dmar domain
iommu/vt-d: Use bitfields for DMAR capabilities
iommu/vt-d: Use DEVICE_ATTR_RO macro
iommu/vt-d: Fix out-bounds-warning in intel/svm.c
iommu/vt-d: Add PRQ handling latency sampling
...
Merge more updates from Andrew Morton:
"190 patches.
Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
signals, exec, kcov, selftests, compress/decompress, and ipc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
ipc/util.c: use binary search for max_idx
ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
ipc: use kmalloc for msg_queue and shmid_kernel
ipc sem: use kvmalloc for sem_undo allocation
lib/decompressors: remove set but not used variabled 'level'
selftests/vm/pkeys: exercise x86 XSAVE init state
selftests/vm/pkeys: refill shadow register after implicit kernel write
selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
exec: remove checks in __register_bimfmt()
x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
hfsplus: report create_date to kstat.btime
hfsplus: remove unnecessary oom message
nilfs2: remove redundant continue statement in a while-loop
kprobes: remove duplicated strong free_insn_page in x86 and s390
init: print out unknown kernel parameters
checkpatch: do not complain about positive return values starting with EPOLL
checkpatch: improve the indented label test
checkpatch: scripts/spdxcheck.py now requires python3
...
mm_vmscan_inactive_list_is_low has no users after commit b91ac37434
("mm: vmscan: enforce inactive:active ratio at the reclaim root").
Remove it.
Link: https://lkml.kernel.org/r/20210614194554.2683395-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ext4 in 5.14:
- Allow applications to poll on changes to /sys/fs/ext4/*/errors_count
- Add the ioctl EXT4_IOC_CHECKPOINT which allows the journal to be
checkpointed, truncated and discarded or zero'ed.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmDcjRgACgkQ8vlZVpUN
gaMAMQgAjRYUQ+tdJVZzInFwukudhgLyuCP9AdCx76fisaH22yNCakQ7M2XGz59i
/YbJerLaueYpHZzpA9p5+sSjVhMwILO3scBSJbOwdsbrFAsFLzcgQKQhGGqK2KvX
IAOEArC8/hm1wnVb7sfQYdBHlWyeJpI8hd/8WZPlYtySlRnP1TZCd+X7y7lmNs1H
QU1KECwstI2t8Lug0QeKx2B9PI9AWcCs0lTJ4LfcANZAh3HIJi9aUCk4SFDRkf3/
8AazvMqTHJD9yc+BNyZOro2ykDFCStkNqf0cDYTzvKrr66CHScPUtyI0oAEdspxN
+SNNARPGZgNOuR3ZRbGivtwgEB+GpQ==
=jSd4
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"In addition to bug fixes and cleanups, there are two new features for
ext4 in 5.14:
- Allow applications to poll on changes to
/sys/fs/ext4/*/errors_count
- Add the ioctl EXT4_IOC_CHECKPOINT which allows the journal to be
checkpointed, truncated and discarded or zero'ed"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (32 commits)
jbd2: export jbd2_journal_[un]register_shrinker()
ext4: notify sysfs on errors_count value change
fs: remove bdev_try_to_free_page callback
ext4: remove bdev_try_to_free_page() callback
jbd2: simplify journal_clean_one_cp_list()
jbd2,ext4: add a shrinker to release checkpointed buffers
jbd2: remove redundant buffer io error checks
jbd2: don't abort the journal when freeing buffers
jbd2: ensure abort the journal if detect IO error when writing original buffer back
jbd2: remove the out label in __jbd2_journal_remove_checkpoint()
ext4: no need to verify new add extent block
jbd2: clean up misleading comments for jbd2_fc_release_bufs
ext4: add check to prevent attempting to resize an fs with sparse_super2
ext4: consolidate checks for resize of bigalloc into ext4_resize_begin
ext4: remove duplicate definition of ext4_xattr_ibody_inline_set()
ext4: fsmap: fix the block/inode bitmap comment
ext4: fix comment for s_hash_unsigned
ext4: use local variable ei instead of EXT4_I() macro
ext4: fix avefreec in find_group_orlov
ext4: correct the cache_nr in tracepoint ext4_es_shrink_exit
...
Core:
- BPF:
- add syscall program type and libbpf support for generating
instructions and bindings for in-kernel BPF loaders (BPF loaders
for BPF), this is a stepping stone for signed BPF programs
- infrastructure to migrate TCP child sockets from one listener
to another in the same reuseport group/map to improve flexibility
of service hand-off/restart
- add broadcast support to XDP redirect
- allow bypass of the lockless qdisc to improving performance
(for pktgen: +23% with one thread, +44% with 2 threads)
- add a simpler version of "DO_ONCE()" which does not require
jump labels, intended for slow-path usage
- virtio/vsock: introduce SOCK_SEQPACKET support
- add getsocketopt to retrieve netns cookie
- ip: treat lowest address of a IPv4 subnet as ordinary unicast address
allowing reclaiming of precious IPv4 addresses
- ipv6: use prandom_u32() for ID generation
- ip: add support for more flexible field selection for hashing
across multi-path routes (w/ offload to mlxsw)
- icmp: add support for extended RFC 8335 PROBE (ping)
- seg6: add support for SRv6 End.DT46 behavior
- mptcp:
- DSS checksum support (RFC 8684) to detect middlebox meddling
- support Connection-time 'C' flag
- time stamping support
- sctp: packetization Layer Path MTU Discovery (RFC 8899)
- xfrm: speed up state addition with seq set
- WiFi:
- hidden AP discovery on 6 GHz and other HE 6 GHz improvements
- aggregation handling improvements for some drivers
- minstrel improvements for no-ack frames
- deferred rate control for TXQs to improve reaction times
- switch from round robin to virtual time-based airtime scheduler
- add trace points:
- tcp checksum errors
- openvswitch - action execution, upcalls
- socket errors via sk_error_report
Device APIs:
- devlink: add rate API for hierarchical control of max egress rate
of virtual devices (VFs, SFs etc.)
- don't require RCU read lock to be held around BPF hooks
in NAPI context
- page_pool: generic buffer recycling
New hardware/drivers:
- mobile:
- iosm: PCIe Driver for Intel M.2 Modem
- support for Qualcomm MSM8998 (ipa)
- WiFi: Qualcomm QCN9074 and WCN6855 PCI devices
- sparx5: Microchip SparX-5 family of Enterprise Ethernet switches
- Mellanox BlueField Gigabit Ethernet (control NIC of the DPU)
- NXP SJA1110 Automotive Ethernet 10-port switch
- Qualcomm QCA8327 switch support (qca8k)
- Mikrotik 10/25G NIC (atl1c)
Driver changes:
- ACPI support for some MDIO, MAC and PHY devices from Marvell and NXP
(our first foray into MAC/PHY description via ACPI)
- HW timestamping (PTP) support: bnxt_en, ice, sja1105, hns3, tja11xx
- Mellanox/Nvidia NIC (mlx5)
- NIC VF offload of L2 bridging
- support IRQ distribution to Sub-functions
- Marvell (prestera):
- add flower and match all
- devlink trap
- link aggregation
- Netronome (nfp): connection tracking offload
- Intel 1GE (igc): add AF_XDP support
- Marvell DPU (octeontx2): ingress ratelimit offload
- Google vNIC (gve): new ring/descriptor format support
- Qualcomm mobile (rmnet & ipa): inline checksum offload support
- MediaTek WiFi (mt76)
- mt7915 MSI support
- mt7915 Tx status reporting
- mt7915 thermal sensors support
- mt7921 decapsulation offload
- mt7921 enable runtime pm and deep sleep
- Realtek WiFi (rtw88)
- beacon filter support
- Tx antenna path diversity support
- firmware crash information via devcoredump
- Qualcomm 60GHz WiFi (wcn36xx)
- Wake-on-WLAN support with magic packets and GTK rekeying
- Micrel PHY (ksz886x/ksz8081): add cable test support
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmDb+fUACgkQMUZtbf5S
Irs2Jg//aqN0Q8CgIvYCVhPxQw1tY7pTAbgyqgBZ01vwjyvtIOgJiWzSfFEU84mX
M8fcpFX5eTKrOyJ9S6UFfQ/JG114n3hjAxFFT4Hxk2gC1Tg0vHuFQTDHcUl28bUE
mTm61e1YpdorILnv2k5JVQ/wu0vs5QKDrjcYcrcPnh+j93wvnPOgAfDBV95nZzjS
OTt4q2fR8GzLcSYWWsclMbDNkzyTG50RW/0Yd6aGjr5QGvXfrMeXfUJNz533PMf/
w5lNyjRKv+x9mdTZJzU0+msNUrZgUdRz7W8Ey8lD3hJZRE+D6/uU7FtsE8Mi3+uc
HWxeZUyzA3YF1MfVl/eesbxyPT7S/OkLzk4O5B35FbqP0YltaP+bOjq1/nM3ce1/
io9Dx9pIl/2JANUgRCAtLi8Z2dkvRoqTaBxZ/nPudCCljFwDwl6joTMJ7Ow22i5Y
5aIkcXFmZq4LbJDiHvbTlqT7yiuaEvu2UK/23bSIg/K3nF4eAmkY9Y1EgiMf60OF
78Ttw0wk2tUegwaS5MZnCniKBKDyl9gM2F6rbZ/IxQRR2LTXFc1B6gC+ynUxgXfh
Ub8O++6qGYGYZ0XvQH4pzco79p3qQWBTK5beIp2eu6BOAjBVIXq4AibUfoQLACsu
hX7jMPYd0kc3WFgUnKgQP8EnjFSwbf4XiaE7fIXvWBY8hzCw2h4=
=LvtX
-----END PGP SIGNATURE-----
Merge tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- BPF:
- add syscall program type and libbpf support for generating
instructions and bindings for in-kernel BPF loaders (BPF loaders
for BPF), this is a stepping stone for signed BPF programs
- infrastructure to migrate TCP child sockets from one listener to
another in the same reuseport group/map to improve flexibility
of service hand-off/restart
- add broadcast support to XDP redirect
- allow bypass of the lockless qdisc to improving performance (for
pktgen: +23% with one thread, +44% with 2 threads)
- add a simpler version of "DO_ONCE()" which does not require jump
labels, intended for slow-path usage
- virtio/vsock: introduce SOCK_SEQPACKET support
- add getsocketopt to retrieve netns cookie
- ip: treat lowest address of a IPv4 subnet as ordinary unicast
address allowing reclaiming of precious IPv4 addresses
- ipv6: use prandom_u32() for ID generation
- ip: add support for more flexible field selection for hashing
across multi-path routes (w/ offload to mlxsw)
- icmp: add support for extended RFC 8335 PROBE (ping)
- seg6: add support for SRv6 End.DT46 behavior
- mptcp:
- DSS checksum support (RFC 8684) to detect middlebox meddling
- support Connection-time 'C' flag
- time stamping support
- sctp: packetization Layer Path MTU Discovery (RFC 8899)
- xfrm: speed up state addition with seq set
- WiFi:
- hidden AP discovery on 6 GHz and other HE 6 GHz improvements
- aggregation handling improvements for some drivers
- minstrel improvements for no-ack frames
- deferred rate control for TXQs to improve reaction times
- switch from round robin to virtual time-based airtime scheduler
- add trace points:
- tcp checksum errors
- openvswitch - action execution, upcalls
- socket errors via sk_error_report
Device APIs:
- devlink: add rate API for hierarchical control of max egress rate
of virtual devices (VFs, SFs etc.)
- don't require RCU read lock to be held around BPF hooks in NAPI
context
- page_pool: generic buffer recycling
New hardware/drivers:
- mobile:
- iosm: PCIe Driver for Intel M.2 Modem
- support for Qualcomm MSM8998 (ipa)
- WiFi: Qualcomm QCN9074 and WCN6855 PCI devices
- sparx5: Microchip SparX-5 family of Enterprise Ethernet switches
- Mellanox BlueField Gigabit Ethernet (control NIC of the DPU)
- NXP SJA1110 Automotive Ethernet 10-port switch
- Qualcomm QCA8327 switch support (qca8k)
- Mikrotik 10/25G NIC (atl1c)
Driver changes:
- ACPI support for some MDIO, MAC and PHY devices from Marvell and
NXP (our first foray into MAC/PHY description via ACPI)
- HW timestamping (PTP) support: bnxt_en, ice, sja1105, hns3, tja11xx
- Mellanox/Nvidia NIC (mlx5)
- NIC VF offload of L2 bridging
- support IRQ distribution to Sub-functions
- Marvell (prestera):
- add flower and match all
- devlink trap
- link aggregation
- Netronome (nfp): connection tracking offload
- Intel 1GE (igc): add AF_XDP support
- Marvell DPU (octeontx2): ingress ratelimit offload
- Google vNIC (gve): new ring/descriptor format support
- Qualcomm mobile (rmnet & ipa): inline checksum offload support
- MediaTek WiFi (mt76)
- mt7915 MSI support
- mt7915 Tx status reporting
- mt7915 thermal sensors support
- mt7921 decapsulation offload
- mt7921 enable runtime pm and deep sleep
- Realtek WiFi (rtw88)
- beacon filter support
- Tx antenna path diversity support
- firmware crash information via devcoredump
- Qualcomm WiFi (wcn36xx)
- Wake-on-WLAN support with magic packets and GTK rekeying
- Micrel PHY (ksz886x/ksz8081): add cable test support"
* tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2168 commits)
tcp: change ICSK_CA_PRIV_SIZE definition
tcp_yeah: check struct yeah size at compile time
gve: DQO: Fix off by one in gve_rx_dqo()
stmmac: intel: set PCI_D3hot in suspend
stmmac: intel: Enable PHY WOL option in EHL
net: stmmac: option to enable PHY WOL with PMT enabled
net: say "local" instead of "static" addresses in ndo_dflt_fdb_{add,del}
net: use netdev_info in ndo_dflt_fdb_{add,del}
ptp: Set lookup cookie when creating a PTP PPS source.
net: sock: add trace for socket errors
net: sock: introduce sk_error_report
net: dsa: replay the local bridge FDB entries pointing to the bridge dev too
net: dsa: ensure during dsa_fdb_offload_notify that dev_hold and dev_put are on the same dev
net: dsa: include fdb entries pointing to bridge in the host fdb list
net: dsa: include bridge addresses which are local in the host fdb list
net: dsa: sync static FDB entries on foreign interfaces to hardware
net: dsa: install the host MDB and FDB entries in the master's RX filter
net: dsa: reference count the FDB addresses at the cross-chip notifier level
net: dsa: introduce a separate cross-chip notifier type for host FDBs
net: dsa: reference count the MDB entries at the cross-chip notifier level
...
The __assign_str macro has an unusual ending semicolon but the vast
majority of uses of the macro already have semicolon termination.
$ git grep -P '\b__assign_str\b' | wc -l
551
$ git grep -P '\b__assign_str\b.*;' | wc -l
480
Add semicolons to the __assign_str() uses without semicolon termination
and all the other uses without semicolon termination via additional defines
that are equivalent to __assign_str() with the eventual goal of removing
the semicolon from the __assign_str() macro definition.
Link: https://lore.kernel.org/lkml/1e068d21106bb6db05b735b4916bb420e6c9842a.camel@perches.com/
Link: https://lkml.kernel.org/r/48a056adabd8f70444475352f617914cef504a45.camel@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Merge misc updates from Andrew Morton:
"191 patches.
Subsystems affected by this patch series: kthread, ia64, scripts,
ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
pagealloc, and memory-failure)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits)
mm,hwpoison: make get_hwpoison_page() call get_any_page()
mm,hwpoison: send SIGBUS with error virutal address
mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
docs: remove description of DISCONTIGMEM
arch, mm: remove stale mentions of DISCONIGMEM
mm: remove CONFIG_DISCONTIGMEM
m68k: remove support for DISCONTIGMEM
arc: remove support for DISCONTIGMEM
arc: update comment about HIGHMEM implementation
alpha: remove DISCONTIGMEM and NUMA
mm/page_alloc: move free_the_page
mm/page_alloc: fix counting of managed_pages
mm/page_alloc: improve memmap_pages dbg msg
mm: drop SECTION_SHIFT in code comments
mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
mm/page_alloc: scale the number of pages that are batch freed
...
This patch will add tracers to trace inet socket errors only. A user
space monitor application can track connection errors indepedent from
socket lifetime and do additional handling. For example a cluster
manager can fence a node if errors occurs in a specific heuristic.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some trace event formats print PFNs as hex while others print them as
decimal. This is rather annoying when attempting to grep through traces
to understand what's going on with a particular page.
$ git grep -ho 'pfn=[0x%lu]\+' include/trace/events/ | sort | uniq -c
11 pfn=0x%lx
12 pfn=%lu
2 pfn=%lx
Printing as hex is in the majority in the trace events, and all the normal
printks in mm/ also print PFNs as hex, so change all the PFN formats in
the trace events to use 0x%lx.
Link: https://lkml.kernel.org/r/20210602092608.1493-1-vincent.whitchurch@axis.com
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmDZ4TwACgkQxWXV+ddt
WDvWCQ/8Dgnk+FBC25JOkqgu29VZtvhfWkY1poDRuG+tca6VeMMnDbPgnTQFyeS1
38F4uNNi/F5UdFuLz3RK0jYgGFKXTp+sFjavFuXeJQpFxe7VSu7JrilZPaA1Dti8
E8Dp42ilrHDikDbZaT8JB9GSnR7a8tHnIs0RfZSIkHsd+rPs7QPtM0TTzEZyLHqH
2uYoVyd5EvclvM5JLVGxRZ3lTU64zfZlJg+TnoAkBpilqUHqpD+x5cEoNYbdhbAb
j3sF11h/zEa/wmU5w5LRd4Qvl3JygCrnAo+6VAxB/u0yzJnH+UwOEJdDDeUpB/9k
2F/Zy69CUQ7DdXM+Es4TOfAyQ9fpPLt8Z96GIBrdD5BxWbam4pyU5xH4cDPNpsHo
zRCepdU1zwD6z3cfEYKmUAx89ewC8SE8XlUOWiGun4pBKdi3tgwcrytTnu+02JND
mEkP4vTWG2bU+S0Si0u/aAKHcFvOwiY9iHM9tmblVvvlSFYrhFAclsytihPwu9NQ
d9FRQMo9JZbQZXqaWpcmd8eXACz9+5AulIhofpuZLciyhvWpL+CQ+xGNnzJ1DnTH
ct0m+ByFb33bTpAnblkgCMQa9xuwlM57NxvIclRaDPXWipqyZReih9fbF1TkHbXQ
0dkrKe8cHn9w+DI1Hs1Hu1zdD7WJJxNMY2x9MowMU9gDVNBbbVs=
=htVu
-----END PGP SIGNATURE-----
Merge tag 'for-5.14-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"A normal mix of improvements, core changes and features that user have
been missing or complaining about.
User visible changes:
- new sysfs exports:
- add sysfs knob to limit scrub IO bandwidth per device
- device stats are also available in
/sys/fs/btrfs/FSID/devinfo/DEVID/error_stats
- support cancellable resize and device delete ioctls
- change how the empty value is interpreted when setting a property,
so far we have only 'btrfs.compression' and we need to distinguish
a reset to defaults and setting "do not compress", in general the
empty value will always mean 'reset to defaults' for any other
property, for compression it's either 'no' or 'none' to forbid
compression
Performance improvements:
- no need for full sync when truncation does not touch extents,
reported run time change is -12%
- avoid unnecessary logging of xattrs during fast fsyncs (+17%
throughput, -17% runtime on xattr stress workload)
Core:
- preemptive flushing improvements and fixes
- adjust clamping logic on multi-threaded workloads to avoid
flushing too soon
- take into account global block reserve, may help on almost full
filesystems
- continue flushing when there are enough pending delalloc and
ordered bytes
- simplify logic around conditional transaction commit, a workaround
used in the past for throttling that's been superseded by ticket
reservations that manage the throttling in a better way
- subpage blocksize preparation:
- submit read time repair only for each corrupted sector
- scrub repair now works with sectors and not pages
- free space cache (v1) works with sectors and not pages
- more fine grained bio tracking for extents
- subpage support in page callbacks, extent callbacks, end io
callbacks
- simplify transaction abort logic and always abort and don't check
various potentially unreliable stats tracked by the transaction
- exclusive operations can do more checks when started and allow eg.
cancellation of the same running operation
- ensure relocation never runs while we have send operations running,
e.g. when zoned background auto reclaim starts
Fixes:
- zoned: more sanity checks of write pointer
- improve error handling in delayed inodes
- send:
- fix invalid path for unlink operations after parent
orphanization
- fix crash when memory allocations trigger reclaim
- skip compression of we have only one page (can't make things
better)
- empty value of a property newly means reset to default
Other:
- lots of cleanups, comment updates, yearly typo fixing
- disable build on platforms having page size 256K"
* tag 'for-5.14-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (101 commits)
btrfs: remove unused btrfs_fs_info::total_pinned
btrfs: rip out btrfs_space_info::total_bytes_pinned
btrfs: rip the first_ticket_bytes logic from fail_all_tickets
btrfs: remove FLUSH_DELAYED_REFS from data ENOSPC flushing
btrfs: rip out may_commit_transaction
btrfs: send: fix crash when memory allocations trigger reclaim
btrfs: ensure relocation never runs while we have send operations running
btrfs: shorten integrity checker extent data mount option
btrfs: switch mount option bits to enums and use wider type
btrfs: props: change how empty value is interpreted
btrfs: compression: don't try to compress if we don't have enough pages
btrfs: fix unbalanced unlock in qgroup_account_snapshot()
btrfs: sysfs: export dev stats in devinfo directory
btrfs: fix typos in comments
btrfs: remove a stale comment for btrfs_decompress_bio()
btrfs: send: use list_move_tail instead of list_del/list_add_tail
btrfs: disable build on platforms having page size 256K
btrfs: send: fix invalid path for unlink operations after parent orphanization
btrfs: inline wait_current_trans_commit_start in its caller
btrfs: sink wait_for_unblock parameter to async commit
...
- Optimise SVE switching for CPUs with 128-bit implementations.
- Fix output format from SVE selftest.
- Add support for versions v1.2 and 1.3 of the SMC calling convention.
- Allow Pointer Authentication to be configured independently for
kernel and userspace.
- PMU driver cleanups for managing IRQ affinity and exposing event
attributes via sysfs.
- KASAN optimisations for both hardware tagging (MTE) and out-of-line
software tagging implementations.
- Relax frame record alignment requirements to facilitate 8-byte
alignment with KASAN and Clang.
- Cleanup of page-table definitions and removal of unused memory types.
- Reduction of ARCH_DMA_MINALIGN back to 64 bytes.
- Refactoring of our instruction decoding routines and addition of some
missing encodings.
- Move entry code moved into C and hardened against harmful compiler
instrumentation.
- Update booting requirements for the FEAT_HCX feature, added to v8.7
of the architecture.
- Fix resume from idle when pNMI is being used.
- Additional CPU sanity checks for MTE and preparatory changes for
systems where not all of the CPUs support 32-bit EL0.
- Update our kernel string routines to the latest Cortex Strings
implementation.
- Big cleanup of our cache maintenance routines, which were confusingly
named and inconsistent in their implementations.
- Tweak linker flags so that GDB can understand vmlinux when using RELR
relocations.
- Boot path cleanups to enable early initialisation of per-cpu
operations needed by KCSAN.
- Non-critical fixes and miscellaneous cleanup.
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmDUh1YQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNDaUCAC+2Jy2Yopd94uBPYajGybM0rqCUgE7b5n1
A7UzmQ6fia2hwqCPmxGG+sRabovwN7C1bKrUCc03RIbErIa7wum1edeyqmF/Aw44
DUDY1MAOSZaFmX8L62QCvxG1hfdLPtGmHMd1hdXvxYK7PCaigEFnzbLRWTtgE+Ok
JhdvNfsoeITJObHnvYPF3rV3NAbyYni9aNJ5AC/qb3dlf6XigEraXaMj29XHKfwc
+vmn+25oqFkLHyFeguqIoK+vUQAy/8TjFfjX83eN3LZknNhDJgWS1Iq1Nm+Vxt62
RvDUUecWJjAooCWgmil6pt0enI+q6E8LcX3A3cWWrM6psbxnYzkU
=I6KS
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"There's a reasonable amount here and the juicy details are all below.
It's worth noting that the MTE/KASAN changes strayed outside of our
usual directories due to core mm changes and some associated changes
to some other architectures; Andrew asked for us to carry these [1]
rather that take them via the -mm tree.
Summary:
- Optimise SVE switching for CPUs with 128-bit implementations.
- Fix output format from SVE selftest.
- Add support for versions v1.2 and 1.3 of the SMC calling
convention.
- Allow Pointer Authentication to be configured independently for
kernel and userspace.
- PMU driver cleanups for managing IRQ affinity and exposing event
attributes via sysfs.
- KASAN optimisations for both hardware tagging (MTE) and out-of-line
software tagging implementations.
- Relax frame record alignment requirements to facilitate 8-byte
alignment with KASAN and Clang.
- Cleanup of page-table definitions and removal of unused memory
types.
- Reduction of ARCH_DMA_MINALIGN back to 64 bytes.
- Refactoring of our instruction decoding routines and addition of
some missing encodings.
- Move entry code moved into C and hardened against harmful compiler
instrumentation.
- Update booting requirements for the FEAT_HCX feature, added to v8.7
of the architecture.
- Fix resume from idle when pNMI is being used.
- Additional CPU sanity checks for MTE and preparatory changes for
systems where not all of the CPUs support 32-bit EL0.
- Update our kernel string routines to the latest Cortex Strings
implementation.
- Big cleanup of our cache maintenance routines, which were
confusingly named and inconsistent in their implementations.
- Tweak linker flags so that GDB can understand vmlinux when using
RELR relocations.
- Boot path cleanups to enable early initialisation of per-cpu
operations needed by KCSAN.
- Non-critical fixes and miscellaneous cleanup"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (150 commits)
arm64: tlb: fix the TTL value of tlb_get_level
arm64: Restrict undef hook for cpufeature registers
arm64/mm: Rename ARM64_SWAPPER_USES_SECTION_MAPS
arm64: insn: avoid circular include dependency
arm64: smp: Bump debugging information print down to KERN_DEBUG
drivers/perf: fix the missed ida_simple_remove() in ddr_perf_probe()
perf/arm-cmn: Fix invalid pointer when access dtc object sharing the same IRQ number
arm64: suspend: Use cpuidle context helpers in cpu_suspend()
PSCI: Use cpuidle context helpers in psci_cpu_suspend_enter()
arm64: Convert cpu_do_idle() to using cpuidle context helpers
arm64: Add cpuidle context save/restore helpers
arm64: head: fix code comments in set_cpu_boot_mode_flag
arm64: mm: drop unused __pa(__idmap_text_start)
arm64: mm: fix the count comments in compute_indices
arm64/mm: Fix ttbr0 values stored in struct thread_info for software-pan
arm64: mm: Pass original fault address to handle_mm_fault()
arm64/mm: Drop SECTION_[SHIFT|SIZE|MASK]
arm64/mm: Use CONT_PMD_SHIFT for ARM64_MEMSTART_SHIFT
arm64/mm: Drop SWAPPER_INIT_MAP_SIZE
arm64: Conditionally configure PTR_AUTH key of the kernel.
...
There is a spelling mistake in a TP_printk message, the word interferences
is not the plural of interference. Fix this.
Link: https://lkml.kernel.org/r/20210628125522.56361-1-colin.king@canonical.com
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In the context of high-performance computing (HPC), the Operating System
Noise (*osnoise*) refers to the interference experienced by an application
due to activities inside the operating system. In the context of Linux,
NMIs, IRQs, SoftIRQs, and any other system thread can cause noise to the
system. Moreover, hardware-related jobs can also cause noise, for example,
via SMIs.
The osnoise tracer leverages the hwlat_detector by running a similar
loop with preemption, SoftIRQs and IRQs enabled, thus allowing all
the sources of *osnoise* during its execution. Using the same approach
of hwlat, osnoise takes note of the entry and exit point of any
source of interferences, increasing a per-cpu interference counter. The
osnoise tracer also saves an interference counter for each source of
interference. The interference counter for NMI, IRQs, SoftIRQs, and
threads is increased anytime the tool observes these interferences' entry
events. When a noise happens without any interference from the operating
system level, the hardware noise counter increases, pointing to a
hardware-related noise. In this way, osnoise can account for any
source of interference. At the end of the period, the osnoise tracer
prints the sum of all noise, the max single noise, the percentage of CPU
available for the thread, and the counters for the noise sources.
Usage
Write the ASCII text "osnoise" into the current_tracer file of the
tracing system (generally mounted at /sys/kernel/tracing).
For example::
[root@f32 ~]# cd /sys/kernel/tracing/
[root@f32 tracing]# echo osnoise > current_tracer
It is possible to follow the trace by reading the trace trace file::
[root@f32 tracing]# cat trace
# tracer: osnoise
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth MAX
# || / SINGLE Interference counters:
# |||| RUNTIME NOISE % OF CPU NOISE +-----------------------------+
# TASK-PID CPU# |||| TIMESTAMP IN US IN US AVAILABLE IN US HW NMI IRQ SIRQ THREAD
# | | | |||| | | | | | | | | | |
<...>-859 [000] .... 81.637220: 1000000 190 99.98100 9 18 0 1007 18 1
<...>-860 [001] .... 81.638154: 1000000 656 99.93440 74 23 0 1006 16 3
<...>-861 [002] .... 81.638193: 1000000 5675 99.43250 202 6 0 1013 25 21
<...>-862 [003] .... 81.638242: 1000000 125 99.98750 45 1 0 1011 23 0
<...>-863 [004] .... 81.638260: 1000000 1721 99.82790 168 7 0 1002 49 41
<...>-864 [005] .... 81.638286: 1000000 263 99.97370 57 6 0 1006 26 2
<...>-865 [006] .... 81.638302: 1000000 109 99.98910 21 3 0 1006 18 1
<...>-866 [007] .... 81.638326: 1000000 7816 99.21840 107 8 0 1016 39 19
In addition to the regular trace fields (from TASK-PID to TIMESTAMP), the
tracer prints a message at the end of each period for each CPU that is
running an osnoise/CPU thread. The osnoise specific fields report:
- The RUNTIME IN USE reports the amount of time in microseconds that
the osnoise thread kept looping reading the time.
- The NOISE IN US reports the sum of noise in microseconds observed
by the osnoise tracer during the associated runtime.
- The % OF CPU AVAILABLE reports the percentage of CPU available for
the osnoise thread during the runtime window.
- The MAX SINGLE NOISE IN US reports the maximum single noise observed
during the runtime window.
- The Interference counters display how many each of the respective
interference happened during the runtime window.
Note that the example above shows a high number of HW noise samples.
The reason being is that this sample was taken on a virtual machine,
and the host interference is detected as a hardware interference.
Tracer options
The tracer has a set of options inside the osnoise directory, they are:
- osnoise/cpus: CPUs at which a osnoise thread will execute.
- osnoise/period_us: the period of the osnoise thread.
- osnoise/runtime_us: how long an osnoise thread will look for noise.
- osnoise/stop_tracing_us: stop the system tracing if a single noise
higher than the configured value happens. Writing 0 disables this
option.
- osnoise/stop_tracing_total_us: stop the system tracing if total noise
higher than the configured value happens. Writing 0 disables this
option.
- tracing_threshold: the minimum delta between two time() reads to be
considered as noise, in us. When set to 0, the default value will
be used, which is currently 5 us.
Additional Tracing
In addition to the tracer, a set of tracepoints were added to
facilitate the identification of the osnoise source.
- osnoise:sample_threshold: printed anytime a noise is higher than
the configurable tolerance_ns.
- osnoise:nmi_noise: noise from NMI, including the duration.
- osnoise:irq_noise: noise from an IRQ, including the duration.
- osnoise:softirq_noise: noise from a SoftIRQ, including the
duration.
- osnoise:thread_noise: noise from a thread, including the duration.
Note that all the values are *net values*. For example, if while osnoise
is running, another thread preempts the osnoise thread, it will start a
thread_noise duration at the start. Then, an IRQ takes place, preempting
the thread_noise, starting a irq_noise. When the IRQ ends its execution,
it will compute its duration, and this duration will be subtracted from
the thread_noise, in such a way as to avoid the double accounting of the
IRQ execution. This logic is valid for all sources of noise.
Here is one example of the usage of these tracepoints::
osnoise/8-961 [008] d.h. 5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns
osnoise/8-961 [008] dNh. 5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns
migration/8-54 [008] d... 5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns
osnoise/8-961 [008] .... 5789.858413: sample_threshold: start 5789.858404555 duration 8723 ns interferences 2
In this example, a noise sample of 8 microseconds was reported in the last
line, pointing to two interferences. Looking backward in the trace, the
two previous entries were about the migration thread running after a
timer IRQ execution. The first event is not part of the noise because
it took place one millisecond before.
It is worth noticing that the sum of the duration reported in the
tracepoints is smaller than eight us reported in the sample_threshold.
The reason roots in the overhead of the entry and exit code that happens
before and after any interference execution. This justifies the dual
approach: measuring thread and tracing.
Link: https://lkml.kernel.org/r/e649467042d60e7b62714c9c6751a56299d15119.1624372313.git.bristot@redhat.com
Cc: Phil Auld <pauld@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Kate Carcia <kcarcia@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
Cc: Clark Willaims <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
[
Made the following functions static:
trace_irqentry_callback()
trace_irqexit_callback()
trace_intel_irqentry_callback()
trace_intel_irqexit_callback()
Added to include/trace.h:
osnoise_arch_register()
osnoise_arch_unregister()
Fixed define logic for LATENCY_FS_NOTIFY
Reported-by: kernel test robot <lkp@intel.com>
]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
To have nanosecond output displayed in a more human readable format, its
nicer to convert it to a seconds format (XXX.YYYYYYYYY). The problem is that
to do so, the numbers must be divided by NSEC_PER_SEC, and moded too. But as
these numbers are 64 bit, this can not be done simply with '/' and '%'
operators, but must use do_div() instead.
Instead of performing the expensive do_div() in the hot path of the
tracepoint, it is more efficient to perform it during the output phase. But
passing in do_div() can confuse the parser, and do_div() doesn't work
exactly like a normal C function. It modifies the number in place, and we
don't want to modify the actual values in the ring buffer.
Two helper functions are now created:
__print_ns_to_secs() and __print_ns_without_secs()
They both take a value of nanoseconds, and the former will return that
number divided by NSEC_PER_SEC, and the latter will mod it with NSEC_PER_SEC
giving a way to print a nice human readable format:
__print_fmt("time=%llu.%09u",
__print_ns_to_secs(REC->nsec_val),
__print_ns_without_secs(REC->nsec_val))
Link: https://lkml.kernel.org/r/e503b903045496c4ccde52843e1e318b422f7a56.1624372313.git.bristot@redhat.com
Cc: Phil Auld <pauld@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Kate Carcia <kcarcia@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
Cc: Clark Willaims <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Current metadata buffer release logic in bdev_try_to_free_page() have
a lot of use-after-free issues when umount filesystem concurrently, and
it is difficult to fix directly because ext4 is the only user of
s_op->bdev_try_to_free_page callback and we may have to add more special
refcount or lock that is only used by ext4 into the common vfs layer,
which is unacceptable.
One better solution is remove the bdev_try_to_free_page callback, but
the real problem is we cannot easily release journal_head on the
checkpointed buffer, so try_to_free_buffers() cannot release buffers and
page under memory pressure, which is more likely to trigger
out-of-memory. So we cannot remove the callback directly before we find
another way to release journal_head.
This patch introduce a shrinker to free journal_head on the checkpointed
transaction. After the journal_head got freed, try_to_free_buffers()
could free buffer properly.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210610112440.3438139-6-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
may_commit_transaction was introduced before the ticketing
infrastructure existed. There was a problem where we'd legitimately be
out of space, but every reservation would trigger a transaction commit
and then fail. Thus if you had 1000 things trying to make a
reservation, they'd all do the flushing loop and thus commit the
transaction 1000 times before they'd get their ENOSPC.
This helper was introduced to short circuit this, if there wasn't space
that could be reclaimed by committing the transaction then simply ENOSPC
out. This made true ENOSPC tests much faster as we didn't waste a bunch
of time.
However many of our bugs over the years have been from cases where we
didn't account for some space that would be reclaimed by committing a
transaction. The delayed refs rsv space, delayed rsv, many pinned bytes
miscalculations, etc. And in the meantime the original problem has been
solved with ticketing. We no longer will commit the transaction 1000
times. Instead we'll get 1000 waiters, we will go through the flushing
mechanisms, and if there's no progress after 2 loops we ENOSPC everybody
out. The ticketing infrastructure gives us a deterministic way to see
if we're making progress or not, thus we avoid a lot of extra work.
So simplify this step by simply unconditionally committing the
transaction. This removes what is arguably our most common source of
early ENOSPC bugs and will allow us to drastically simplify many of the
things we track because we simply won't need them with this stuff gone.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
end_compressed_bio_write().
It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
which is only supposed to accept inode pages.
Thankfully the important info here is the inode, so let's pass
btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
make @page parameter optional.
By this, end_compressed_bio_write() can happily pass page=NULL while
still getting everything done properly.
Also, to cooperate with such modification, replace @page parameter for
trace_btrfs_writepage_end_io_hook() with btrfs_inode.
Although this removes page_index info, the existing start/len should be
enough for most usage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In mptcp_dump_mpext, dump the csum fields, csum and csum_reqd in struct
mptcp_dump_mpext too.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2021-06-17
The following pull-request contains BPF updates for your *net-next* tree.
We've added 50 non-merge commits during the last 25 day(s) which contain
a total of 148 files changed, 4779 insertions(+), 1248 deletions(-).
The main changes are:
1) BPF infrastructure to migrate TCP child sockets from a listener to another
in the same reuseport group/map, from Kuniyuki Iwashima.
2) Add a provably sound, faster and more precise algorithm for tnum_mul() as
noted in https://arxiv.org/abs/2105.05398, from Harishankar Vishwanathan.
3) Streamline error reporting changes in libbpf as planned out in the
'libbpf: the road to v1.0' effort, from Andrii Nakryiko.
4) Add broadcast support to xdp_redirect_map(), from Hangbin Liu.
5) Extends bpf_map_lookup_and_delete_elem() functionality to 4 more map
types, that is, {LRU_,PERCPU_,LRU_PERCPU_,}HASH, from Denis Salopek.
6) Support new LLVM relocations in libbpf to make them more linker friendly,
also add a doc to describe the BPF backend relocations, from Yonghong Song.
7) Silence long standing KUBSAN complaints on register-based shifts in
interpreter, from Daniel Borkmann and Eric Biggers.
8) Add dummy PT_REGS macros in libbpf to fail BPF program compilation when
target arch cannot be determined, from Lorenz Bauer.
9) Extend AF_XDP to support large umems with 1M+ pages, from Magnus Karlsson.
10) Fix two minor libbpf tc BPF API issues, from Kumar Kartikeya Dwivedi.
11) Move libbpf BPF_SEQ_PRINTF/BPF_SNPRINTF macros that can be used by BPF
programs to bpf_helpers.h header, from Florent Revest.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The req pointer uniquely identify a specific request.
Having it in traces can provide valuable insights that is not possible
to have if the calling process is reusing the same user_data value.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Olivier Langlois <olivier@trillion01.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add SEQPACKET socket type to vsock trace event.
Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
'success' is left here for a long time and also it is meaningless
for the upper user. Just remove it.
[ There were some tools expecting this, and this may break them. But
hopefully they've been fixed in the mean time. Otherwise this may be
likely reverted - SDR ]
Link: https://lkml.kernel.org/r/20210422122226.9415-1-ed.tsai@mediatek.com
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ed Tsai <ed.tsai@mediatek.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Poisoning freed pages protects against kernel use-after-free. The
likelihood of such a bug involving kernel pages is significantly higher
than that for user pages. At the same time, poisoning freed pages can
impose a significant performance cost, which cannot always be justified
for user pages given the lower probability of finding a bug. Therefore,
disable freed user page poisoning when using HW tags. We identify
"user" pages via the flag set GFP_HIGHUSER_MOVABLE, which indicates
a strong likelihood of not being directly accessible to the kernel.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Link: https://linux-review.googlesource.com/id/I716846e2de8ef179f44e835770df7e6307be96c9
Link: https://lore.kernel.org/r/20210602235230.3928842-5-pcc@google.com
Signed-off-by: Will Deacon <will@kernel.org>
Remove last vestiges of SCSI status message bytes.
Link: https://lore.kernel.org/r/20210427083046.31620-39-hare@suse.de
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The driver_byte field in the result is now unused, so we can drop the
definitions.
Link: https://lore.kernel.org/r/20210427083046.31620-15-hare@suse.de
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
It is helpful to see what state of CS signal was during one
or another SPI operation. All the same for SPI setup.
Enable tracing of the SPI setup and CS selection.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Message-Id: <20210526195655.75691-1-andriy.shevchenko@linux.intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
This patch adds two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to
extend xdp_redirect_map for broadcast support.
With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces
in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be
excluded when do broadcasting.
When getting the devices in dev hash map via dev_map_hash_get_next_key(),
there is a possibility that we fall back to the first key when a device
was removed. This will duplicate packets on some interfaces. So just walk
the whole buckets to avoid this issue. For dev array map, we also walk the
whole map to find valid interfaces.
Function bpf_clear_redirect_map() was removed in
commit ee75aef23a ("bpf, xdp: Restructure redirect actions").
Add it back as we need to use ri->map again.
With test topology:
+-------------------+ +-------------------+
| Host A (i40e 10G) | ---------- | eno1(i40e 10G) |
+-------------------+ | |
| Host B |
+-------------------+ | |
| Host C (i40e 10G) | ---------- | eno2(i40e 10G) |
+-------------------+ | |
| +------+ |
| veth0 -- | Peer | |
| veth1 -- | | |
| veth2 -- | NS | |
| +------+ |
+-------------------+
On Host A:
# pktgen/pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -s 64
On Host B(Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 128G Memory):
Use xdp_redirect_map and xdp_redirect_map_multi in samples/bpf for testing.
All the veth peers in the NS have a XDP_DROP program loaded. The
forward_map max_entries in xdp_redirect_map_multi is modify to 4.
Testing the performance impact on the regular xdp_redirect path with and
without patch (to check impact of additional check for broadcast mode):
5.12 rc4 | redirect_map i40e->i40e | 2.0M | 9.7M
5.12 rc4 | redirect_map i40e->veth | 1.7M | 11.8M
5.12 rc4 + patch | redirect_map i40e->i40e | 2.0M | 9.6M
5.12 rc4 + patch | redirect_map i40e->veth | 1.7M | 11.7M
Testing the performance when cloning packets with the redirect_map_multi
test, using a redirect map size of 4, filled with 1-3 devices:
5.12 rc4 + patch | redirect_map multi i40e->veth (x1) | 1.7M | 11.4M
5.12 rc4 + patch | redirect_map multi i40e->veth (x2) | 1.1M | 4.3M
5.12 rc4 + patch | redirect_map multi i40e->veth (x3) | 0.8M | 2.6M
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/20210519090747.1655268-3-liuhangbin@gmail.com
Add a tracepoint for capturing TCP segments with
a bad checksum. This makes it easy to identify
sources of bad frames in the fleet (e.g. machines
with faulty NICs).
It should also help tools like IOvisor's tcpdrop.py
which are used today to get detailed information
about such packets.
We don't have a socket in many cases so we must
open code the address extraction based just on
the skb.
v2: add missing export for ipv6=m
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that ->nocb_timer and ->nocb_bypass_timer have become quite similar,
this commit merges them together. A new RCU_NOCB_WAKE_BYPASS wake level
is introduced. As a result, timers perform all kinds of deferred wake
ups but other deferred wakeup callsites only handle non-bypass wakeups
in order not to wake up rcuo too early.
The timer also unconditionally executes a full barrier so as to order
timer_pending() and callback enqueue although the path performing
RCU_NOCB_WAKE_FORCE that makes use of it is debatable. It should also
test against the rdp leader instead of the current rdp.
This unconditional full barrier shouldn't bring visible overhead since
these timers almost never fire.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
During runtime-suspend of ufs host, the SCSI devices are already suspended
and so are the queues associated with them. However, the ufs host sends SSU
(START_STOP_UNIT) to the wlun during runtime-suspend.
During the process blk_queue_enter() checks if the queue is not in suspended
state. If so, it waits for the queue to resume, and never comes out of
it. Commit 52abca64fd ("scsi: block: Do not accept any requests while
suspended") adds the check to see if the queue is in suspended state in
blk_queue_enter().
Call trace:
__switch_to+0x174/0x2c4
__schedule+0x478/0x764
schedule+0x9c/0xe0
blk_queue_enter+0x158/0x228
blk_mq_alloc_request+0x40/0xa4
blk_get_request+0x2c/0x70
__scsi_execute+0x60/0x1c4
ufshcd_set_dev_pwr_mode+0x124/0x1e4
ufshcd_suspend+0x208/0x83c
ufshcd_runtime_suspend+0x40/0x154
ufshcd_pltfrm_runtime_suspend+0x14/0x20
pm_generic_runtime_suspend+0x28/0x3c
__rpm_callback+0x80/0x2a4
rpm_suspend+0x308/0x614
rpm_idle+0x158/0x228
pm_runtime_work+0x84/0xac
process_one_work+0x1f0/0x470
worker_thread+0x26c/0x4c8
kthread+0x13c/0x320
ret_from_fork+0x10/0x18
Fix this by registering ufs device wlun as a SCSI driver and registering it
for block runtime-pm. Also make this a supplier for all other LUNs. This
way the wlun device suspends after all the consumers and resumes after HBA
resumes. This also registers a new SCSI driver for rpmb wlun. This new
driver is mostly used to clear rpmb uac.
[mkp: resolve merge conflict with 5.13-rc1 and fix doc warning]
Fixed smatch warnings:
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/4662c462e79e3e7f541f54f88f8993f421026d83.1619223249.git.asutoshd@codeaurora.org
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Co-developed-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Asutosh Das <asutoshd@codeaurora.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Highlights include:
Stable fixes:
- Add validation of the UDP retrans parameter to prevent shift out-of-bounds
- Don't discard pNFS layout segments that are marked for return
Bugfixes:
- Fix a NULL dereference crash in xprt_complete_bc_request() when the
NFSv4.1 server misbehaves.
- Fix the handling of NFS READDIR cookie verifiers
- Sundry fixes to ensure attribute revalidation works correctly when the
server does not return post-op attributes.
- nfs4_bitmask_adjust() must not change the server global bitmasks
- Fix major timeout handling in the RPC code.
- NFSv4.2 fallocate() fixes.
- Fix the NFSv4.2 SEEK_HOLE/SEEK_DATA end-of-file handling
- Copy offload attribute revalidation fixes
- Fix an incorrect filehandle size check in the pNFS flexfiles driver
- Fix several RDMA transport setup/teardown races
- Fix several RDMA queue wrapping issues
- Fix a misplaced memory read barrier in sunrpc's call_decode()
Features:
- Micro optimisation of the TCP transmission queue using TCP_CORK
- statx() performance improvements by further splitting up the tracking
of invalid cached file metadata.
- Support the NFSv4.2 "change_attr_type" attribute and use it to
optimise handling of change attribute updates.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmCVLooACgkQZwvnipYK
APJB5BAAtIJyhx40ooMBzcucDmXd1qovlKsb8ZlvnSI6c7wvHhFPNk9z4zwThnjL
FpVYzJzK6XzAQY/PtgbrPwnSUmW925ngPWYR/hiYe+OGPBnYV+tXP8izCyEkNgMg
45goDOxojGWl7AGTuAJiKcDSdH9PyIrbvt28iwcNSGjslasGSbAoL/836l4OIGr1
Ymxs/NDML11dPco8GIKLGtHd8leFGleDx089VeNsgud8MdaFErp16O5Iz8DdzRKd
W1l2zDMb05j8eDZIfy3w3FyrLkDXA+KgLSADiC8TcpxoadPaQJMeCvoIq8oqVndn
bZBoxduXdLgf54Aec0WnNKFAOyc7pGvZoSNmFouT7EGV73g+g1LQ+ZbEE1bb8fCQ
XHqCVaBt2+47NiTUgdxjXlZRfcn8fYKx0tVxfG3mQVMXUAWfsjmMyQMNgijDRJI2
8Wz3lZMRGMILbR9j4QpP1biVy/2zGNWG/TB5ZZyZMSY4uT+aOpzlqdknb4UsRaSp
f7MfmB7xEWpS4DJr9RIBrJ/hIdnMu1mNInxDPFo5Kl5HNp4TaPm2dPir2ZD2wMZI
daURTX7giUhpE15ZebQDBqWD+mTR0bVDqLLeo131JRmMfMEHugNrr49xe+NkBu/R
QWnFzgkGdQsOeiKRRwEUuhsi74JspqfwzdZzHqcRM5WuXVvBLcA=
=h01b
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.13-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- Add validation of the UDP retrans parameter to prevent shift
out-of-bounds
- Don't discard pNFS layout segments that are marked for return
Bugfixes:
- Fix a NULL dereference crash in xprt_complete_bc_request() when the
NFSv4.1 server misbehaves.
- Fix the handling of NFS READDIR cookie verifiers
- Sundry fixes to ensure attribute revalidation works correctly when
the server does not return post-op attributes.
- nfs4_bitmask_adjust() must not change the server global bitmasks
- Fix major timeout handling in the RPC code.
- NFSv4.2 fallocate() fixes.
- Fix the NFSv4.2 SEEK_HOLE/SEEK_DATA end-of-file handling
- Copy offload attribute revalidation fixes
- Fix an incorrect filehandle size check in the pNFS flexfiles driver
- Fix several RDMA transport setup/teardown races
- Fix several RDMA queue wrapping issues
- Fix a misplaced memory read barrier in sunrpc's call_decode()
Features:
- Micro optimisation of the TCP transmission queue using TCP_CORK
- statx() performance improvements by further splitting up the
tracking of invalid cached file metadata.
- Support the NFSv4.2 'change_attr_type' attribute and use it to
optimise handling of change attribute updates"
* tag 'nfs-for-5.13-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (85 commits)
xprtrdma: Fix a NULL dereference in frwr_unmap_sync()
sunrpc: Fix misplaced barrier in call_decode
NFSv4.2: Remove ifdef CONFIG_NFSD from NFSv4.2 client SSC code.
xprtrdma: Move fr_mr field to struct rpcrdma_mr
xprtrdma: Move the Work Request union to struct rpcrdma_mr
xprtrdma: Move fr_linv_done field to struct rpcrdma_mr
xprtrdma: Move cqe to struct rpcrdma_mr
xprtrdma: Move fr_cid to struct rpcrdma_mr
xprtrdma: Remove the RPC/RDMA QP event handler
xprtrdma: Don't display r_xprt memory addresses in tracepoints
xprtrdma: Add an rpcrdma_mr_completion_class
xprtrdma: Add tracepoints showing FastReg WRs and remote invalidation
xprtrdma: Avoid Send Queue wrapping
xprtrdma: Do not wake RPC consumer on a failed LocalInv
xprtrdma: Do not recycle MR after FastReg/LocalInv flushes
xprtrdma: Clarify use of barrier in frwr_wc_localinv_done()
xprtrdma: Rename frwr_release_mr()
xprtrdma: rpcrdma_mr_pop() already does list_del_init()
xprtrdma: Delete rpcrdma_recv_buffer_put()
xprtrdma: Fix cwnd update ordering
...
Merge more updates from Andrew Morton:
"The remainder of the main mm/ queue.
143 patches.
Subsystems affected by this patch series (all mm): pagecache, hugetlb,
userfaultfd, vmscan, compaction, migration, cma, ksm, vmstat, mmap,
kconfig, util, memory-hotplug, zswap, zsmalloc, highmem, cleanups, and
kfence"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (143 commits)
kfence: use power-efficient work queue to run delayed work
kfence: maximize allocation wait timeout duration
kfence: await for allocation using wait_event
kfence: zero guard page after out-of-bounds access
mm/process_vm_access.c: remove duplicate include
mm/mempool: minor coding style tweaks
mm/highmem.c: fix coding style issue
btrfs: use memzero_page() instead of open coded kmap pattern
iov_iter: lift memzero_page() to highmem.h
mm/zsmalloc: use BUG_ON instead of if condition followed by BUG.
mm/zswap.c: switch from strlcpy to strscpy
arm64/Kconfig: introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
x86/Kconfig: introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
mm,memory_hotplug: add kernel boot option to enable memmap_on_memory
acpi,memhotplug: enable MHP_MEMMAP_ON_MEMORY when supported
mm,memory_hotplug: allocate memmap from the added memory range
mm,memory_hotplug: factor out adjusting present pages into adjust_present_page_count()
mm,memory_hotplug: relax fully spanned sections check
drivers/base/memory: introduce memory_block_{online,offline}
mm/memory_hotplug: remove broken locking of zone PCP structures during hot remove
...
We should not pin pages in ZONE_MOVABLE. Currently, we do not pin only
movable CMA pages. Generalize the function that migrates CMA pages to
migrate all movable pages. Use is_pinnable_page() to check which pages
need to be migrated
Link: https://lkml.kernel.org/r/20210215161349.246722-10-pasha.tatashin@soleen.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
size_t in cma_alloc is confusing since it makes people think it's byte
count, not pages. Change it to unsigned long[1].
The unsigned int in cma_release is also not right so change it. Since we
have unsigned long in cma_release, free_contig_range should also respect
it.
[1] 67a2e213e7, mm: cma: fix incorrect type conversion for size during dma allocation
Link: https://lore.kernel.org/linux-mm/20210324043434.GP1719932@casper.infradead.org/
Link: https://lkml.kernel.org/r/20210331164018.710560-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There were missing places to add cma instance name. To identify each CMA
instance, let's add the name for every cma trace. This patch also changes
the existing cma_trace_alloc to cma_trace_finish since we have
cma_alloc_start[1].
[1] https://lore.kernel.org/linux-mm/20210324160740.15901-1-georgi.djakov@linaro.org
Link: https://lkml.kernel.org/r/20210330220237.748899-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Liam Mark <lmark@codeaurora.org>
Cc: Georgi Djakov <georgi.djakov@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add cma and migrate trace events to enable CMA allocation performance to
be measured via ftrace.
[georgi.djakov@linaro.org: add the CMA instance name to the cma_alloc_start trace event]
Link: https://lkml.kernel.org/r/20210326155414.25006-1-georgi.djakov@linaro.org
Link: https://lkml.kernel.org/r/20210324160740.15901-1-georgi.djakov@linaro.org
Signed-off-by: Liam Mark <lmark@codeaurora.org>
Signed-off-by: Georgi Djakov <georgi.djakov@linaro.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "userfaultfd: add minor fault handling", v9.
Overview
========
This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
When enabled (via the UFFDIO_API ioctl), this feature means that any
hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
get events for "minor" faults. By "minor" fault, I mean the following
situation:
Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory). One of the mappings is registered with userfaultfd (in minor
mode), and the other is not. Via the non-UFFD mapping, the underlying
pages have already been allocated & filled with some contents. The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault. As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.
We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea
is, userspace resolves the fault by either a) doing nothing if the
contents are already correct, or b) updating the underlying contents using
the second, non-UFFD mapping (via memcpy/memset or similar, or something
fancier like RDMA, or etc...). In either case, userspace issues
UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
correct, carry on setting up the mapping".
Use Case
========
Consider the use case of VM live migration (e.g. under QEMU/KVM):
1. While a VM is still running, we copy the contents of its memory to a
target machine. The pages are populated on the target by writing to the
non-UFFD mapping, using the setup described above. The VM is still running
(and therefore its memory is likely changing), so this may be repeated
several times, until we decide the target is "up to date enough".
2. We pause the VM on the source, and start executing on the target machine.
During this gap, the VM's user(s) will *see* a pause, so it is desirable to
minimize this window.
3. Between the last time any page was copied from the source to the target, and
when the VM was paused, the contents of that page may have changed - and
therefore the copy we have on the target machine is out of date. Although we
can keep track of which pages are out of date, for VMs with large amounts of
memory, it is "slow" to transfer this information to the target machine. We
want to resume execution before such a transfer would complete.
4. So, the guest begins executing on the target machine. The first time it
touches its memory (via the UFFD-registered mapping), userspace wants to
intercept this fault. Userspace checks whether or not the page is up to date,
and if not, copies the updated page from the source machine, via the non-UFFD
mapping. Finally, whether a copy was performed or not, userspace issues a
UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
are correct, carry on setting up the mapping".
We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.
Interaction with Existing APIs
==============================
Because this is a feature, a registered VMA could potentially receive both
missing and minor faults. I spent some time thinking through how the
existing API interacts with the new feature:
UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:
- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.
UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
Without modifications, the existing codepath assumes a new page needs to
be allocated. This is okay, since userspace must have a second
non-UFFD-registered mapping anyway, thus there isn't much reason to want
to use these in any case (just memcpy or memset or similar).
- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
-ENOENT in that case (regardless of the kind of fault).
Future Work
===========
This series only supports hugetlbfs. I have a second series in flight to
support shmem as well, extending the functionality. This series is more
mature than the shmem support at this point, and the functionality works
fully on hugetlbfs, so this series can be merged first and then shmem
support will follow.
This patch (of 6):
This feature allows userspace to intercept "minor" faults. By "minor"
faults, I mean the following situation:
Let there exist two mappings (i.e., VMAs) to the same page(s). One of the
mappings is registered with userfaultfd (in minor mode), and the other is
not. Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents. The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what
I'm calling a "minor" fault. As a concrete example, when working with
hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
page.
This commit adds the new registration mode, and sets the relevant flag on
the VMAs being registered. In the hugetlb fault path, if we find that we
have huge_pte_none(), but find_lock_page() does indeed find an existing
page, then we have a "minor" fault, and if the VMA has the userfaultfd
registration flag, we call into userfaultfd to handle it.
This is implemented as a new registration mode, instead of an API feature.
This is because the alternative implementation has significant drawbacks
[1].
However, doing it this was requires we allocate a VM_* flag for the new
registration mode. On 32-bit systems, there are no unused bits, so this
feature is only supported on architectures with
CONFIG_ARCH_USES_HIGH_VMA_FLAGS. When attempting to register a VMA in
MINOR mode on 32-bit architectures, we return -EINVAL.
[1] https://lore.kernel.org/patchwork/patch/1380226/
[peterx@redhat.com: fix minor fault page leak]
Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
New feature:
The "func-no-repeats" option in tracefs/options directory. When set
the function tracer will detect if the current function being traced
is the same as the previous one, and instead of recording it, it will
keep track of the number of times that the function is repeated in a row.
And when another function is recorded, it will write a new event that
shows the function that repeated, the number of times it repeated and
the time stamp of when the last repeated function occurred.
Enhancements:
In order to implement the above "func-no-repeats" option, the ring
buffer timestamp can now give the accurate timestamp of the event
as it is being recorded, instead of having to record an absolute
timestamp for all events. This helps the histogram code which no longer
needs to waste ring buffer space.
New validation logic to make sure all trace events that access
dereferenced pointers do so in a safe way, and will warn otherwise.
Fixes:
No longer limit the PIDs of tasks that are recorded for "saved_cmdlines"
to PID_MAX_DEFAULT (32768), as systemd now allows for a much larger
range. This caused the mapping of PIDs to the task names to be dropped
for all tasks with a PID greater than 32768.
Change trace_clock_global() to never block. This caused a deadlock.
Clean ups:
Typos, prototype fixes, and removing of duplicate or unused code.
Better management of ftrace_page allocations.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYI/1vBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qiL0AP9EemIC5TDh2oihqLRNeUjdTu0ryEoM
HRFqxozSF985twD/bfkt86KQC8rLHwxTbxQZ863bmdaC6cMGFhWiF+H/MAs=
=psYt
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"New feature:
- A new "func-no-repeats" option in tracefs/options directory.
When set the function tracer will detect if the current function
being traced is the same as the previous one, and instead of
recording it, it will keep track of the number of times that the
function is repeated in a row. And when another function is
recorded, it will write a new event that shows the function that
repeated, the number of times it repeated and the time stamp of
when the last repeated function occurred.
Enhancements:
- In order to implement the above "func-no-repeats" option, the ring
buffer timestamp can now give the accurate timestamp of the event
as it is being recorded, instead of having to record an absolute
timestamp for all events. This helps the histogram code which no
longer needs to waste ring buffer space.
- New validation logic to make sure all trace events that access
dereferenced pointers do so in a safe way, and will warn otherwise.
Fixes:
- No longer limit the PIDs of tasks that are recorded for
"saved_cmdlines" to PID_MAX_DEFAULT (32768), as systemd now allows
for a much larger range. This caused the mapping of PIDs to the
task names to be dropped for all tasks with a PID greater than
32768.
- Change trace_clock_global() to never block. This caused a deadlock.
Clean ups:
- Typos, prototype fixes, and removing of duplicate or unused code.
- Better management of ftrace_page allocations"
* tag 'trace-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (32 commits)
tracing: Restructure trace_clock_global() to never block
tracing: Map all PIDs to command lines
ftrace: Reuse the output of the function tracer for func_repeats
tracing: Add "func_no_repeats" option for function tracing
tracing: Unify the logic for function tracing options
tracing: Add method for recording "func_repeats" events
tracing: Add "last_func_repeats" to struct trace_array
tracing: Define new ftrace event "func_repeats"
tracing: Define static void trace_print_time()
ftrace: Simplify the calculation of page number for ftrace_page->records some more
ftrace: Store the order of pages allocated in ftrace_page
tracing: Remove unused argument from "ring_buffer_time_stamp()
tracing: Remove duplicate struct declaration in trace_events.h
tracing: Update create_system_filter() kernel-doc comment
tracing: A minor cleanup for create_system_filter()
kernel: trace: Mundane typo fixes in the file trace_events_filter.c
tracing: Fix various typos in comments
scripts/recordmcount.pl: Make vim and emacs indent the same
scripts/recordmcount.pl: Make indent spacing consistent
tracing: Add a verifier to check string pointers for trace events
...
- Stage-2 isolation for the host kernel when running in protected mode
- Guest SVE support when running in nVHE mode
- Force W^X hypervisor mappings in nVHE mode
- ITS save/restore for guests using direct injection with GICv4.1
- nVHE panics now produce readable backtraces
- Guest support for PTP using the ptp_kvm driver
- Performance improvements in the S2 fault handler
x86:
- Optimizations and cleanup of nested SVM code
- AMD: Support for virtual SPEC_CTRL
- Optimizations of the new MMU code: fast invalidation,
zap under read lock, enable/disably dirty page logging under
read lock
- /dev/kvm API for AMD SEV live migration (guest API coming soon)
- support SEV virtual machines sharing the same encryption context
- support SGX in virtual machines
- add a few more statistics
- improved directed yield heuristics
- Lots and lots of cleanups
Generic:
- Rework of MMU notifier interface, simplifying and optimizing
the architecture-specific code
- Some selftests improvements
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCJ13kUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroM1HAgAqzPxEtiTPTFeFJV5cnPPJ3dFoFDK
y/juZJUQ1AOtvuWzzwuf175ewkv9vfmtG6rVohpNSkUlJYeoc6tw7n8BTTzCVC1b
c/4Dnrjeycr6cskYlzaPyV6MSgjSv5gfyj1LA5UEM16LDyekmaynosVWY5wJhju+
Bnyid8l8Utgz+TLLYogfQJQECCrsU0Wm//n+8TWQgLf1uuiwshU5JJe7b43diJrY
+2DX+8p9yWXCTz62sCeDWNahUv8AbXpMeJ8uqZPYcN1P0gSEUGu8xKmLOFf9kR7b
M4U1Gyz8QQbjd2lqnwiWIkvRLX6gyGVbq2zH0QbhUe5gg3qGUX7JjrhdDQ==
=AXUi
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"This is a large update by KVM standards, including AMD PSP (Platform
Security Processor, aka "AMD Secure Technology") and ARM CoreSight
(debug and trace) changes.
ARM:
- CoreSight: Add support for ETE and TRBE
- Stage-2 isolation for the host kernel when running in protected
mode
- Guest SVE support when running in nVHE mode
- Force W^X hypervisor mappings in nVHE mode
- ITS save/restore for guests using direct injection with GICv4.1
- nVHE panics now produce readable backtraces
- Guest support for PTP using the ptp_kvm driver
- Performance improvements in the S2 fault handler
x86:
- AMD PSP driver changes
- Optimizations and cleanup of nested SVM code
- AMD: Support for virtual SPEC_CTRL
- Optimizations of the new MMU code: fast invalidation, zap under
read lock, enable/disably dirty page logging under read lock
- /dev/kvm API for AMD SEV live migration (guest API coming soon)
- support SEV virtual machines sharing the same encryption context
- support SGX in virtual machines
- add a few more statistics
- improved directed yield heuristics
- Lots and lots of cleanups
Generic:
- Rework of MMU notifier interface, simplifying and optimizing the
architecture-specific code
- a handful of "Get rid of oprofile leftovers" patches
- Some selftests improvements"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (379 commits)
KVM: selftests: Speed up set_memory_region_test
selftests: kvm: Fix the check of return value
KVM: x86: Take advantage of kvm_arch_dy_has_pending_interrupt()
KVM: SVM: Skip SEV cache flush if no ASIDs have been used
KVM: SVM: Remove an unnecessary prototype declaration of sev_flush_asids()
KVM: SVM: Drop redundant svm_sev_enabled() helper
KVM: SVM: Move SEV VMCB tracking allocation to sev.c
KVM: SVM: Explicitly check max SEV ASID during sev_hardware_setup()
KVM: SVM: Unconditionally invoke sev_hardware_teardown()
KVM: SVM: Enable SEV/SEV-ES functionality by default (when supported)
KVM: SVM: Condition sev_enabled and sev_es_enabled on CONFIG_KVM_AMD_SEV=y
KVM: SVM: Append "_enabled" to module-scoped SEV/SEV-ES control variables
KVM: SEV: Mask CPUID[0x8000001F].eax according to supported features
KVM: SVM: Move SEV module params/variables to sev.c
KVM: SVM: Disable SEV/SEV-ES if NPT is disabled
KVM: SVM: Free sev_asid_bitmap during init if SEV setup fails
KVM: SVM: Zero out the VMCB array used to track SEV ASID association
x86/sev: Drop redundant and potentially misleading 'sev_enabled'
KVM: x86: Move reverse CPUID helpers to separate header file
KVM: x86: Rename GPR accessors to make mode-aware variants the defaults
...
Including:
- Big cleanup of almost unsused parts of the IOMMU API by
Christoph Hellwig. This mostly affects the Freescale PAMU
driver.
- New IOMMU driver for Unisoc SOCs
- ARM SMMU Updates from Will:
- SMMUv3: Drop vestigial PREFETCH_ADDR support
- SMMUv3: Elide TLB sync logic for empty gather
- SMMUv3: Fix "Service Failure Mode" handling
- SMMUv2: New Qualcomm compatible string
- Removal of the AMD IOMMU performance counter writeable check
on AMD. It caused long boot delays on some machines and is
only needed to work around an errata on some older (possibly
pre-production) chips. If someone is still hit by this
hardware issue anyway the performance counters will just
return 0.
- Support for targeted invalidations in the AMD IOMMU driver.
Before that the driver only invalidated a single 4k page or the
whole IO/TLB for an address space. This has been extended now
and is mostly useful for emulated AMD IOMMUs.
- Several fixes for the Shared Virtual Memory support in the
Intel VT-d driver
- Mediatek drivers can now be built as modules
- Re-introduction of the forcedac boot option which got lost
when converting the Intel VT-d driver to the common dma-iommu
implementation.
- Extension of the IOMMU device registration interface and
support iommu_ops to be const again when drivers are built as
modules.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmCMEIoACgkQK/BELZcB
GuOu9xAAvg6aR0uHlxvRq6cgNnHN9Ltp5+t3qFYtRRrauY0iOPMO62k0QQli5shX
CGeczD0e59KAZqI0zNJnQn8hMY5dg7XVkFCC5BrSzuCDCtwJZ0N5Tq3pfUlaV1rw
BJf41t79Fd+jp7kn53tu+vRAfYZ3+sLOx/6U3c15pqKRZSkyFWbQllOtD3J5LnLu
1PyPlfiNpMwCajiS7aQbN+fuJ/lKIFeA2MDPOsCBzhbfxiJUqJxZOKAZO3rOjFfK
feTibqQ+3Zz6MPXt9st1cvPpy8jCosv81OY6Knqvxf/oB5q+fEdi2uNrKISonb/t
Fw331oOIwg2A+HOpwC9MN1AumOIqiHSWWENAMk9SlP+TMIWKQ8kZreyI6IEB23dV
+QvP3DVA+CfLwtNY/Zh0IqKh28D+IHlKbpWNU1m+9AUe468mV/MTjfwxr9Yfffhm
LZ6C0DgFdmtqv8jPuDGUOgo3RNeN8bLnUSEHG9gHibA+RKujl5BWDjKkwILqMQTt
Ysdsu8TiNtFIULomizqCpgqEbQfW8TLFvASXCM1VMQ/PDURxvchZPxFDJonYXy+K
z2HGaG3eUE07YrAdRKH69aMVIbmS+sjEhvmi4xZ1Lh7wWcIE2AZVvO8qNb+Ckcp3
4tLPPDksm/iQngnFf6gdgH3qv4rgbzE4+74GXqeANiQCjY9dSJI=
=qF2C
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
- Big cleanup of almost unsused parts of the IOMMU API by Christoph
Hellwig. This mostly affects the Freescale PAMU driver.
- New IOMMU driver for Unisoc SOCs
- ARM SMMU Updates from Will:
- Drop vestigial PREFETCH_ADDR support (SMMUv3)
- Elide TLB sync logic for empty gather (SMMUv3)
- Fix "Service Failure Mode" handling (SMMUv3)
- New Qualcomm compatible string (SMMUv2)
- Removal of the AMD IOMMU performance counter writeable check on AMD.
It caused long boot delays on some machines and is only needed to
work around an errata on some older (possibly pre-production) chips.
If someone is still hit by this hardware issue anyway the performance
counters will just return 0.
- Support for targeted invalidations in the AMD IOMMU driver. Before
that the driver only invalidated a single 4k page or the whole IO/TLB
for an address space. This has been extended now and is mostly useful
for emulated AMD IOMMUs.
- Several fixes for the Shared Virtual Memory support in the Intel VT-d
driver
- Mediatek drivers can now be built as modules
- Re-introduction of the forcedac boot option which got lost when
converting the Intel VT-d driver to the common dma-iommu
implementation.
- Extension of the IOMMU device registration interface and support
iommu_ops to be const again when drivers are built as modules.
* tag 'iommu-updates-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (84 commits)
iommu: Streamline registration interface
iommu: Statically set module owner
iommu/mediatek-v1: Add error handle for mtk_iommu_probe
iommu/mediatek-v1: Avoid build fail when build as module
iommu/mediatek: Always enable the clk on resume
iommu/fsl-pamu: Fix uninitialized variable warning
iommu/vt-d: Force to flush iotlb before creating superpage
iommu/amd: Put newline after closing bracket in warning
iommu/vt-d: Fix an error handling path in 'intel_prepare_irq_remapping()'
iommu/vt-d: Fix build error of pasid_enable_wpe() with !X86
iommu/amd: Remove performance counter pre-initialization test
Revert "iommu/amd: Fix performance counter initialization"
iommu/amd: Remove duplicate check of devid
iommu/exynos: Remove unneeded local variable initialization
iommu/amd: Page-specific invalidations for more than one page
iommu/arm-smmu-v3: Remove the unused fields for PREFETCH_CONFIG command
iommu/vt-d: Avoid unnecessary cache flush in pasid entry teardown
iommu/vt-d: Invalidate PASID cache when root/context entry changed
iommu/vt-d: Remove WO permissions on second-level paging entries
iommu/vt-d: Report the right page fault address
...
casefold, ensure that deleted file names are cleared in directory
blocks by zeroing directory entries when they are unlinked or moved as
part of a hash tree node split. We also improve the block allocator's
performance on a freshly mounted file system by prefetching block
bitmaps.
There are also the usual cleanups and bug fixes, including fixing a
page cache invalidation race when there is mixed buffered and direct
I/O and the block size is less than page size, and allow the dax flag
to be set and cleared on inline directories.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmCLei4ACgkQ8vlZVpUN
gaPZkgf/VH08xjMf3VthC+BpvVmChQXfV4yjigHbO2pmPyYWZhyJzkEGCQD8u2eB
b7ShW+B1NCifcTU34xAkKHwEtakzzEv3WIMrT1oZNWrpfo8tt850EkwQggaGGDpd
/HnP1/wLtziJ5hE6DwutmX7qB4VFghVj898MjDrEPSOBqItOjWps9mn/JWL7SHyI
Dqzhf5XZTYPaXWuJmSmKw3q8O70JDHnZe/rRWlfX1jLI5KDtqp71Nw1B+gszUB66
IUdncyZKvInsyjYhkbCQ8U6WFih82MrbKeuGYDp/RFvg5eMELEYkwT9j0ofuDHq8
zn62sAlbOXv1DiqkPDHKVm9GkHx8/g==
=UpnH
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"New features for ext4 this cycle include support for encrypted
casefold, ensure that deleted file names are cleared in directory
blocks by zeroing directory entries when they are unlinked or moved as
part of a hash tree node split. We also improve the block allocator's
performance on a freshly mounted file system by prefetching block
bitmaps.
There are also the usual cleanups and bug fixes, including fixing a
page cache invalidation race when there is mixed buffered and direct
I/O and the block size is less than page size, and allow the dax flag
to be set and cleared on inline directories"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (32 commits)
ext4: wipe ext4_dir_entry2 upon file deletion
ext4: Fix occasional generic/418 failure
fs: fix reporting supported extra file attributes for statx()
ext4: allow the dax flag to be set and cleared on inline directories
ext4: fix debug format string warning
ext4: fix trailing whitespace
ext4: fix various seppling typos
ext4: fix error return code in ext4_fc_perform_commit()
ext4: annotate data race in jbd2_journal_dirty_metadata()
ext4: annotate data race in start_this_handle()
ext4: fix ext4_error_err save negative errno into superblock
ext4: fix error code in ext4_commit_super
ext4: always panic when errors=panic is specified
ext4: delete redundant uptodate check for buffer
ext4: do not set SB_ACTIVE in ext4_orphan_cleanup()
ext4: make prefetch_block_bitmaps default
ext4: add proc files to monitor new structures
ext4: improve cr 0 / cr 1 group scanning
ext4: add MB_NUM_ORDERS macro
ext4: add mballoc stats proc file
...
Adjust the rss_stat tracepoint to print the name of the resident page type
that got updated (e.g. MM_ANONPAGES/MM_FILEPAGES), rather than the numeric
index corresponding to it (the __entry->member value):
Before this patch:
------------------
rss_stat: mm_id=1216113068 curr=0 member=1 size=28672B
rss_stat: mm_id=1216113068 curr=0 member=1 size=0B
rss_stat: mm_id=534402304 curr=1 member=0 size=188416B
rss_stat: mm_id=534402304 curr=1 member=1 size=40960B
After this patch:
-----------------
rss_stat: mm_id=1726253524 curr=1 type=MM_ANONPAGES size=40960B
rss_stat: mm_id=1726253524 curr=1 type=MM_FILEPAGES size=663552B
rss_stat: mm_id=1726253524 curr=1 type=MM_ANONPAGES size=65536B
rss_stat: mm_id=1726253524 curr=1 type=MM_FILEPAGES size=647168B
Use TRACE_DEFINE_ENUM()/__print_symbolic() logic to map the enum values to
the strings they represent, so that userspace tools can also parse the raw
data correctly.
Link: https://lkml.kernel.org/r/20210310162305.4862-1-ovidiu.panait@windriver.com
Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com>
Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Core:
- bpf:
- allow bpf programs calling kernel functions (initially to
reuse TCP congestion control implementations)
- enable task local storage for tracing programs - remove the
need to store per-task state in hash maps, and allow tracing
programs access to task local storage previously added for
BPF_LSM
- add bpf_for_each_map_elem() helper, allowing programs to
walk all map elements in a more robust and easier to verify
fashion
- sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT
redirection
- lpm: add support for batched ops in LPM trie
- add BTF_KIND_FLOAT support - mostly to allow use of BTF
on s390 which has floats in its headers files
- improve BPF syscall documentation and extend the use of kdoc
parsing scripts we already employ for bpf-helpers
- libbpf, bpftool: support static linking of BPF ELF files
- improve support for encapsulation of L2 packets
- xdp: restructure redirect actions to avoid a runtime lookup,
improving performance by 4-8% in microbenchmarks
- xsk: build skb by page (aka generic zerocopy xmit) - improve
performance of software AF_XDP path by 33% for devices
which don't need headers in the linear skb part (e.g. virtio)
- nexthop: resilient next-hop groups - improve path stability
on next-hops group changes (incl. offload for mlxsw)
- ipv6: segment routing: add support for IPv4 decapsulation
- icmp: add support for RFC 8335 extended PROBE messages
- inet: use bigger hash table for IP ID generation
- tcp: deal better with delayed TX completions - make sure we don't
give up on fast TCP retransmissions only because driver is
slow in reporting that it completed transmitting the original
- tcp: reorder tcp_congestion_ops for better cache locality
- mptcp:
- add sockopt support for common TCP options
- add support for common TCP msg flags
- include multiple address ids in RM_ADDR
- add reset option support for resetting one subflow
- udp: GRO L4 improvements - improve 'forward' / 'frag_list'
co-existence with UDP tunnel GRO, allowing the first to take
place correctly even for encapsulated UDP traffic
- micro-optimize dev_gro_receive() and flow dissection, avoid
retpoline overhead on VLAN and TEB GRO
- use less memory for sysctls, add a new sysctl type, to allow using
u8 instead of "int" and "long" and shrink networking sysctls
- veth: allow GRO without XDP - this allows aggregating UDP
packets before handing them off to routing, bridge, OvS, etc.
- allow specifing ifindex when device is moved to another namespace
- netfilter:
- nft_socket: add support for cgroupsv2
- nftables: add catch-all set element - special element used
to define a default action in case normal lookup missed
- use net_generic infra in many modules to avoid allocating
per-ns memory unnecessarily
- xps: improve the xps handling to avoid potential out-of-bound
accesses and use-after-free when XPS change race with other
re-configuration under traffic
- add a config knob to turn off per-cpu netdev refcnt to catch
underflows in testing
Device APIs:
- add WWAN subsystem to organize the WWAN interfaces better and
hopefully start driving towards more unified and vendor-
-independent APIs
- ethtool:
- add interface for reading IEEE MIB stats (incl. mlx5 and
bnxt support)
- allow network drivers to dump arbitrary SFP EEPROM data,
current offset+length API was a poor fit for modern SFP
which define EEPROM in terms of pages (incl. mlx5 support)
- act_police, flow_offload: add support for packet-per-second
policing (incl. offload for nfp)
- psample: add additional metadata attributes like transit delay
for packets sampled from switch HW (and corresponding egress
and policy-based sampling in the mlxsw driver)
- dsa: improve support for sandwiched LAGs with bridge and DSA
- netfilter:
- flowtable: use direct xmit in topologies with IP
forwarding, bridging, vlans etc.
- nftables: counter hardware offload support
- Bluetooth:
- improvements for firmware download w/ Intel devices
- add support for reading AOSP vendor capabilities
- add support for virtio transport driver
- mac80211:
- allow concurrent monitor iface and ethernet rx decap
- set priority and queue mapping for injected frames
- phy: add support for Clause-45 PHY Loopback
- pci/iov: add sysfs MSI-X vector assignment interface
to distribute MSI-X resources to VFs (incl. mlx5 support)
New hardware/drivers:
- dsa: mv88e6xxx: add support for Marvell mv88e6393x -
11-port Ethernet switch with 8x 1-Gigabit Ethernet
and 3x 10-Gigabit interfaces.
- dsa: support for legacy Broadcom tags used on BCM5325, BCM5365
and BCM63xx switches
- Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches
- ath11k: support for QCN9074 a 802.11ax device
- Bluetooth: Broadcom BCM4330 and BMC4334
- phy: Marvell 88X2222 transceiver support
- mdio: add BCM6368 MDIO mux bus controller
- r8152: support RTL8153 and RTL8156 (USB Ethernet) chips
- mana: driver for Microsoft Azure Network Adapter (MANA)
- Actions Semi Owl Ethernet MAC
- can: driver for ETAS ES58X CAN/USB interfaces
Pure driver changes:
- add XDP support to: enetc, igc, stmmac
- add AF_XDP support to: stmmac
- virtio:
- page_to_skb() use build_skb when there's sufficient tailroom
(21% improvement for 1000B UDP frames)
- support XDP even without dedicated Tx queues - share the Tx
queues with the stack when necessary
- mlx5:
- flow rules: add support for mirroring with conntrack,
matching on ICMP, GTP, flex filters and more
- support packet sampling with flow offloads
- persist uplink representor netdev across eswitch mode
changes
- allow coexistence of CQE compression and HW time-stamping
- add ethtool extended link error state reporting
- ice, iavf: support flow filters, UDP Segmentation Offload
- dpaa2-switch:
- move the driver out of staging
- add spanning tree (STP) support
- add rx copybreak support
- add tc flower hardware offload on ingress traffic
- ionic:
- implement Rx page reuse
- support HW PTP time-stamping
- octeon: support TC hardware offloads - flower matching on ingress
and egress ratelimitting.
- stmmac:
- add RX frame steering based on VLAN priority in tc flower
- support frame preemption (FPE)
- intel: add cross time-stamping freq difference adjustment
- ocelot:
- support forwarding of MRP frames in HW
- support multiple bridges
- support PTP Sync one-step timestamping
- dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like
learning, flooding etc.
- ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350,
SC7280 SoCs)
- mt7601u: enable TDLS support
- mt76:
- add support for 802.3 rx frames (mt7915/mt7615)
- mt7915 flash pre-calibration support
- mt7921/mt7663 runtime power management fixes
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmCKFPIACgkQMUZtbf5S
Irtw0g/+NA8bWdHNgG4H5rya0pv2z3IieLRmSdDfKRQQXcJpklawc5MKVVaTee/Q
5/QqgPdCsu1LAU6JXBKsKmyDDaMlQKdWuKbOqDSiAQKoMesZStTEHf9d851ZzgxA
Cdb6O7BD3lBl/IN+oxNG+KcmD1LKquTPKGySq2mQtEdLO12ekAsranzmj4voKffd
q9tBShpXQ7Dq77DLYfiQXVCvsizNcbbJFuxX0o9Lpb9+61ZyYAbogZSa9ypiZZwR
I/9azRBtJg7UV1aD/cLuAfy66Qh7t63+rCxVazs5Os8jVO26P/jQdisnnOe/x+p9
wYEmKm3GSu0V4SAPxkWW+ooKusflCeqDoMIuooKt6kbP6BRj540veGw3Ww/m5YFr
7pLQkTSP/tSjuGQIdBE1LOP5LBO8DZeC8Kiop9V0fzAW9hFSZbEq25WW0bPj8QQO
zA4Z7yWlslvxcfY2BdJX3wD8klaINkl/8fDWZFFsBdfFX2VeLtm7Xfduw34BJpvU
rYT3oWr6PhtkPAKR32SUcemSfeWgIVU41eSshzRz3kez1NngBUuLlSGGSEaKbes5
pZVt6pYFFVByyf6MTHFEoQvafZfEw04JILZpo4R5V8iTHzom0kD3Py064sBiXEw2
B6t+OW4qgcxGblpFkK2lD4kR2s1TPUs0ckVO6sAy1x8q60KKKjY=
=vcbA
-----END PGP SIGNATURE-----
Merge tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- bpf:
- allow bpf programs calling kernel functions (initially to
reuse TCP congestion control implementations)
- enable task local storage for tracing programs - remove the
need to store per-task state in hash maps, and allow tracing
programs access to task local storage previously added for
BPF_LSM
- add bpf_for_each_map_elem() helper, allowing programs to walk
all map elements in a more robust and easier to verify fashion
- sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT
redirection
- lpm: add support for batched ops in LPM trie
- add BTF_KIND_FLOAT support - mostly to allow use of BTF on
s390 which has floats in its headers files
- improve BPF syscall documentation and extend the use of kdoc
parsing scripts we already employ for bpf-helpers
- libbpf, bpftool: support static linking of BPF ELF files
- improve support for encapsulation of L2 packets
- xdp: restructure redirect actions to avoid a runtime lookup,
improving performance by 4-8% in microbenchmarks
- xsk: build skb by page (aka generic zerocopy xmit) - improve
performance of software AF_XDP path by 33% for devices which don't
need headers in the linear skb part (e.g. virtio)
- nexthop: resilient next-hop groups - improve path stability on
next-hops group changes (incl. offload for mlxsw)
- ipv6: segment routing: add support for IPv4 decapsulation
- icmp: add support for RFC 8335 extended PROBE messages
- inet: use bigger hash table for IP ID generation
- tcp: deal better with delayed TX completions - make sure we don't
give up on fast TCP retransmissions only because driver is slow in
reporting that it completed transmitting the original
- tcp: reorder tcp_congestion_ops for better cache locality
- mptcp:
- add sockopt support for common TCP options
- add support for common TCP msg flags
- include multiple address ids in RM_ADDR
- add reset option support for resetting one subflow
- udp: GRO L4 improvements - improve 'forward' / 'frag_list'
co-existence with UDP tunnel GRO, allowing the first to take place
correctly even for encapsulated UDP traffic
- micro-optimize dev_gro_receive() and flow dissection, avoid
retpoline overhead on VLAN and TEB GRO
- use less memory for sysctls, add a new sysctl type, to allow using
u8 instead of "int" and "long" and shrink networking sysctls
- veth: allow GRO without XDP - this allows aggregating UDP packets
before handing them off to routing, bridge, OvS, etc.
- allow specifing ifindex when device is moved to another namespace
- netfilter:
- nft_socket: add support for cgroupsv2
- nftables: add catch-all set element - special element used to
define a default action in case normal lookup missed
- use net_generic infra in many modules to avoid allocating
per-ns memory unnecessarily
- xps: improve the xps handling to avoid potential out-of-bound
accesses and use-after-free when XPS change race with other
re-configuration under traffic
- add a config knob to turn off per-cpu netdev refcnt to catch
underflows in testing
Device APIs:
- add WWAN subsystem to organize the WWAN interfaces better and
hopefully start driving towards more unified and vendor-
independent APIs
- ethtool:
- add interface for reading IEEE MIB stats (incl. mlx5 and bnxt
support)
- allow network drivers to dump arbitrary SFP EEPROM data,
current offset+length API was a poor fit for modern SFP which
define EEPROM in terms of pages (incl. mlx5 support)
- act_police, flow_offload: add support for packet-per-second
policing (incl. offload for nfp)
- psample: add additional metadata attributes like transit delay for
packets sampled from switch HW (and corresponding egress and
policy-based sampling in the mlxsw driver)
- dsa: improve support for sandwiched LAGs with bridge and DSA
- netfilter:
- flowtable: use direct xmit in topologies with IP forwarding,
bridging, vlans etc.
- nftables: counter hardware offload support
- Bluetooth:
- improvements for firmware download w/ Intel devices
- add support for reading AOSP vendor capabilities
- add support for virtio transport driver
- mac80211:
- allow concurrent monitor iface and ethernet rx decap
- set priority and queue mapping for injected frames
- phy: add support for Clause-45 PHY Loopback
- pci/iov: add sysfs MSI-X vector assignment interface to distribute
MSI-X resources to VFs (incl. mlx5 support)
New hardware/drivers:
- dsa: mv88e6xxx: add support for Marvell mv88e6393x - 11-port
Ethernet switch with 8x 1-Gigabit Ethernet and 3x 10-Gigabit
interfaces.
- dsa: support for legacy Broadcom tags used on BCM5325, BCM5365 and
BCM63xx switches
- Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches
- ath11k: support for QCN9074 a 802.11ax device
- Bluetooth: Broadcom BCM4330 and BMC4334
- phy: Marvell 88X2222 transceiver support
- mdio: add BCM6368 MDIO mux bus controller
- r8152: support RTL8153 and RTL8156 (USB Ethernet) chips
- mana: driver for Microsoft Azure Network Adapter (MANA)
- Actions Semi Owl Ethernet MAC
- can: driver for ETAS ES58X CAN/USB interfaces
Pure driver changes:
- add XDP support to: enetc, igc, stmmac
- add AF_XDP support to: stmmac
- virtio:
- page_to_skb() use build_skb when there's sufficient tailroom
(21% improvement for 1000B UDP frames)
- support XDP even without dedicated Tx queues - share the Tx
queues with the stack when necessary
- mlx5:
- flow rules: add support for mirroring with conntrack, matching
on ICMP, GTP, flex filters and more
- support packet sampling with flow offloads
- persist uplink representor netdev across eswitch mode changes
- allow coexistence of CQE compression and HW time-stamping
- add ethtool extended link error state reporting
- ice, iavf: support flow filters, UDP Segmentation Offload
- dpaa2-switch:
- move the driver out of staging
- add spanning tree (STP) support
- add rx copybreak support
- add tc flower hardware offload on ingress traffic
- ionic:
- implement Rx page reuse
- support HW PTP time-stamping
- octeon: support TC hardware offloads - flower matching on ingress
and egress ratelimitting.
- stmmac:
- add RX frame steering based on VLAN priority in tc flower
- support frame preemption (FPE)
- intel: add cross time-stamping freq difference adjustment
- ocelot:
- support forwarding of MRP frames in HW
- support multiple bridges
- support PTP Sync one-step timestamping
- dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like
learning, flooding etc.
- ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350,
SC7280 SoCs)
- mt7601u: enable TDLS support
- mt76:
- add support for 802.3 rx frames (mt7915/mt7615)
- mt7915 flash pre-calibration support
- mt7921/mt7663 runtime power management fixes"
* tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2451 commits)
net: selftest: fix build issue if INET is disabled
net: netrom: nr_in: Remove redundant assignment to ns
net: tun: Remove redundant assignment to ret
net: phy: marvell: add downshift support for M88E1240
net: dsa: ksz: Make reg_mib_cnt a u8 as it never exceeds 255
net/sched: act_ct: Remove redundant ct get and check
icmp: standardize naming of RFC 8335 PROBE constants
bpf, selftests: Update array map tests for per-cpu batched ops
bpf: Add batched ops support for percpu array
bpf: Implement formatted output helpers with bstr_printf
seq_file: Add a seq_bprintf function
sfc: adjust efx->xdp_tx_queue_count with the real number of initialized queues
net:nfc:digital: Fix a double free in digital_tg_recv_dep_req
net: fix a concurrency bug in l2tp_tunnel_register()
net/smc: Remove redundant assignment to rc
mpls: Remove redundant assignment to err
llc2: Remove redundant assignment to rc
net/tls: Remove redundant initialization of record
rds: Remove redundant assignment to nr_sig
dt-bindings: net: mdio-gpio: add compatible for microchip,mdio-smi0
...
- Implement concurrent TLB flushes, which overlaps the local TLB flush with the
remote TLB flush. In testing this improved sysbench performance measurably by
a couple of percentage points, especially if TLB-heavy security mitigations
are active.
- Further micro-optimizations to improve the performance of TLB flushes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmCKbNcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hjYBAAsyNUa/gOu0g6/Cx8R86w9HtHHmm5vso/
6nJjWj2fd2qJ9JShlddxvXEMeXtPTYabVWQkiiriFMuofk6JeKnlHm1Jzl6keABX
OQFwjIFeNASPRcdXvuuYPOVWAJJdr2oL9QUr6OOK1ccQJTz/Cd0zA+VQ5YqcsCon
yaWbkxELwKXpgql+qt66eAZ6Q2Y1TKXyrTW7ZgxQi0yeeWqMaEOub0/oyS7Ax1Rg
qEJMwm1prb76NPzeqR/G3e4KTrDZfQ/B/KnSsz36GTJpl4eye6XqWDUgm1nAGNIc
5dbc4Vx7JtZsUOuC0AmzWb3hsDyzVcN/lQvijdZ2RsYR3gvuYGaBhKqExqV0XH6P
oqaWOKWCz+LqWbsgJmxCpqkt1LZl5+VUOcfJ97WkIS7DyIPtSHTzQXbBMZqKLeat
mn5UcKYB2Gi7wsUPv6VC2ChKbDqN0VT8G86XbYylGo4BE46KoZKPUNY/QWKLUPd6
0UKcVeNM2HFyf1C73p/tO/z7hzu3qLuMMnsphP6/c2pKLpdgawEXgbnVKNId1B/c
NrzyhTvVaMt+Um28bBRhHONIlzPJwWcnZbdY7NqMnu+LBKQ68cL/h4FOIV/RDLNb
GJLgfAr8fIw/zIpqYuFHiiMNo9wWqVtZko1MvXhGceXUL69QuzTra2XR/6aDxkPf
6gQVesetTvo=
=3Cyp
-----END PGP SIGNATURE-----
Merge tag 'x86-mm-2021-04-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 tlb updates from Ingo Molnar:
"The x86 MM changes in this cycle were:
- Implement concurrent TLB flushes, which overlaps the local TLB
flush with the remote TLB flush.
In testing this improved sysbench performance measurably by a
couple of percentage points, especially if TLB-heavy security
mitigations are active.
- Further micro-optimizations to improve the performance of TLB
flushes"
* tag 'x86-mm-2021-04-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smp: Micro-optimize smp_call_function_many_cond()
smp: Inline on_each_cpu_cond() and on_each_cpu()
x86/mm/tlb: Remove unnecessary uses of the inline keyword
cpumask: Mark functions as pure
x86/mm/tlb: Do not make is_lazy dirty for no reason
x86/mm/tlb: Privatize cpu_tlbstate
x86/mm/tlb: Flush remote and local TLBs concurrently
x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()
smp: Run functions concurrently in smp_call_function_many_cond()
This series consists of the usual driver updates (ufs, target, tcmu,
smartpqi, lpfc, zfcp, qla2xxx, mpt3sas, pm80xx). The major core
change is using a sbitmap instead of an atomic for queue tracking.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYInvqCYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishYh2AP0SgqqL
WYZRT2oiyBOKD28v+ceOSiXvgjPlqABwVMC0BAEAn29/wNCxyvzZ1k/b0iPJ4M+S
klkSxLzXKQLzJBgdK5w=
=p5B/
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This consists of the usual driver updates (ufs, target, tcmu,
smartpqi, lpfc, zfcp, qla2xxx, mpt3sas, pm80xx).
The major core change is using a sbitmap instead of an atomic for
queue tracking"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (412 commits)
scsi: target: tcm_fc: Fix a kernel-doc header
scsi: target: Shorten ALUA error messages
scsi: target: Fix two format specifiers
scsi: target: Compare explicitly with SAM_STAT_GOOD
scsi: sd: Introduce a new local variable in sd_check_events()
scsi: dc395x: Open-code status_byte(u8) calls
scsi: 53c700: Open-code status_byte(u8) calls
scsi: smartpqi: Remove unused functions
scsi: qla4xxx: Remove an unused function
scsi: myrs: Remove unused functions
scsi: myrb: Remove unused functions
scsi: mpt3sas: Fix two kernel-doc headers
scsi: fcoe: Suppress a compiler warning
scsi: libfc: Fix a format specifier
scsi: aacraid: Remove an unused function
scsi: core: Introduce enum scsi_disposition
scsi: core: Modify the scsi_send_eh_cmnd() return value for the SDEV_BLOCK case
scsi: core: Rename scsi_softirq_done() into scsi_complete()
scsi: core: Remove an incorrect comment
scsi: core: Make the scsi_alloc_sgtables() documentation more accurate
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCIRBUQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjt5D/9de6zCaha6CyfIIPiU+crropQ2jPzO49cb
WzcOCmdhSv0GtYlhdnIqCOo5p8mRDWJAEBU9upTDTCWOx9hwr5Ms0TCNQHxuQ/T0
4Ll+/cMsOxeTypiykfMtOG9TEmYSria2vTJKLgpyaP4ohfJa3uT7r2NZ8NK/8T4t
wwbJ+jCSKewelI1l0XD8k8LBU39FS/KRgLTdfYj/rCW3PWt/ZE2eSIYjZQvMCVOC
3fIdgOOJAMQVQafz+YAeJd2E+/l5/8YcJVKpJMVtBNbqTHIjA4EsInZauy8TpBgW
OzJ3I+XdF70qZM119tI/nXw3sb0e+UV0fRsIXLkOwTEBzowernrAtsEwAOP+qFKS
2YnqSKOSjMO5d5Mpkz6T0MDMloU45jph88lUH0RoShVxGa7jv+TMOL6QU1oOyxc1
+gPPbApQs9WtSZDHsTJ0xFLpol804UDQmwb38mHdzedDVSE7iip1jANkw6LEhKkJ
Mlg60ZF1Z305G+cDhrbs02ZGVa+fzbrtXtLlTqZw8bNX9lBp0JLtDpzskjbnUmck
6A04nfg+Eto5GvAn+FRBuOCPridLEk2K6ygko/gwQWsYCgqkCgRuqjlIQCSZy5iu
jHEFixIXKn6eACf+YzLVxSLyEQrmFyDSypbN7LvzoKJYo/loy8Q1+42nGlrVC3zi
+CB1NokPng==
=ZJ8L
-----END PGP SIGNATURE-----
Merge tag 'for-5.13/io_uring-2021-04-27' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
- Support for multi-shot mode for POLL requests
- More efficient reference counting. This is shamelessly stolen from
the mm side. Even though referencing is mostly single/dual user, the
128 count was retained to keep the code the same. Maybe this
should/could be made generic at some point.
- Removal of the need to have a manager thread for each ring. The
manager threads only job was checking and creating new io-threads as
needed, instead we handle this from the queue path.
- Allow SQPOLL without CAP_SYS_ADMIN or CAP_SYS_NICE. Since 5.12, this
thread is "just" a regular application thread, so no need to restrict
use of it anymore.
- Cleanup of how internal async poll data lifetime is managed.
- Fix for syzbot reported crash on SQPOLL cancelation.
- Make buffer registration more like file registrations, which includes
flexibility in avoiding full set unregistration and re-registration.
- Fix for io-wq affinity setting.
- Be a bit more defensive in task->pf_io_worker setup.
- Various SQPOLL fixes.
- Cleanup of SQPOLL creds handling.
- Improvements to in-flight request tracking.
- File registration cleanups.
- Tons of cleanups and little fixes
* tag 'for-5.13/io_uring-2021-04-27' of git://git.kernel.dk/linux-block: (156 commits)
io_uring: maintain drain logic for multishot poll requests
io_uring: Check current->io_uring in io_uring_cancel_sqpoll
io_uring: fix NULL reg-buffer
io_uring: simplify SQPOLL cancellations
io_uring: fix work_exit sqpoll cancellations
io_uring: Fix uninitialized variable up.resv
io_uring: fix invalid error check after malloc
io_uring: io_sq_thread() no longer needs to reset current->pf_io_worker
kernel: always initialize task->pf_io_worker to NULL
io_uring: update sq_thread_idle after ctx deleted
io_uring: add full-fledged dynamic buffers support
io_uring: implement fixed buffers registration similar to fixed files
io_uring: prepare fixed rw for dynanic buffers
io_uring: keep table of pointers to ubufs
io_uring: add generic rsrc update with tags
io_uring: add IORING_REGISTER_RSRC
io_uring: enumerate dynamic resources
io_uring: add generic path for rsrc update
io_uring: preparation for rsrc tagging
io_uring: decouple CQE filling from requests
...
- Bitmap support for "N" as alias for last bit
- kvfree_rcu updates
- mm_dump_obj() updates. (One of these is to mm, but was suggested by Andrew Morton.)
- RCU callback offloading update
- Polling RCU grace-period interfaces
- Realtime-related RCU updates
- Tasks-RCU updates
- Torture-test updates
- Torture-test scripting updates
- Miscellaneous fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmCJCZERHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hRjw/+Jkb9KvR9odPt/zqN/KPtIlburCUWgsFb
2zAlWN4uMocPAiXT2Xq58/8gqMkpyn7ZVZtL1tD8fZSvlwEr0U8Z74+/NdoQvYE+
kMXIYIuhIAGRyAupmzkriqN33iY+BSZPacX3u6ziPj57/0OZzbWVN/DAhbuvyLqG
J/oL4PHCa7XAqXbf95rd5Zjs680QJ3CbTRh4nA8uHArzJmKZOaaHJ05Pxd1LpULe
SJ+5p1GQnnwxd1HqmlHMDu/dW+2hE35BGykF8zi78je9OJXualDoM/6JpIYGhMNY
5qlhU55QYP1jzjuNGVZZUS4L77eS2/W7SpPAaTmMEy/SsVB59G8Kf22oNDpVaEqQ
m+2ErqwaHvlkMjqnsx+JQbsOP0yCi2NZBoEPFdfk1H23E2deVlSDbxPso4Zb1oUD
E12769kN+SWDytuLSOAe1PY/KXqmNUKjPZl1GDCGXL7HlCnWyggUDschTsKJa19O
XXl+yCTGMUH4XAPSqavAKQbBjurqpT6i4zfooSH4TBtOHm1ExgZOUS8gglZ1JuJd
q+uJdZIgS8BcGkGw/k1bYDWY5TA4Rjv3sAOKQL1PgYBl1t/yLK441mE7LI9gWOwz
Crz7vlSxD6Jc2cYQeUVW0KPGt5aVd63Gd9HjpXxGkqYQSDRqYMCebHEAGagz+jj7
Nv/nOnf34Uc=
=mpNt
-----END PGP SIGNATURE-----
Merge tag 'core-rcu-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
- Support for "N" as alias for last bit in bitmap parsing library (eg
using syntax like "nohz_full=2-N")
- kvfree_rcu updates
- mm_dump_obj() updates. (One of these is to mm, but was suggested by
Andrew Morton.)
- RCU callback offloading update
- Polling RCU grace-period interfaces
- Realtime-related RCU updates
- Tasks-RCU updates
- Torture-test updates
- Torture-test scripting updates
- Miscellaneous fixes
* tag 'core-rcu-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (77 commits)
rcutorture: Test start_poll_synchronize_rcu() and poll_state_synchronize_rcu()
rcu: Provide polling interfaces for Tiny RCU grace periods
torture: Fix kvm.sh --datestamp regex check
torture: Consolidate qemu-cmd duration editing into kvm-transform.sh
torture: Print proper vmlinux path for kvm-again.sh runs
torture: Make TORTURE_TRUST_MAKE available in kvm-again.sh environment
torture: Make kvm-transform.sh update jitter commands
torture: Add --duration argument to kvm-again.sh
torture: Add kvm-again.sh to rerun a previous torture-test
torture: Create a "batches" file for build reuse
torture: De-capitalize TORTURE_SUITE
torture: Make upper-case-only no-dot no-slash scenario names official
torture: Rename SRCU-t and SRCU-u to avoid lowercase characters
torture: Remove no-mpstat error message
torture: Record kvm-test-1-run.sh and kvm-test-1-run-qemu.sh PIDs
torture: Record jitter start/stop commands
torture: Extract kvm-test-1-run-qemu.sh from kvm-test-1-run.sh
torture: Record TORTURE_KCONFIG_GDB_ARG in qemu-cmd
torture: Abstract jitter.sh start/stop into scripts
rcu: Provide polling interfaces for Tree RCU grace periods
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmCHJJAACgkQ+7dXa6fL
C2uv0A//S/sJyToPtj3xbzmRVmSGGWFYNRMaxBD2gYAq7swbDNiX4ZbBCe8A4FBY
zedeMfoNztHIRB2M9vvnhG4HJWXPKq2BaT0xzeteCcmZ65b5zBOrAXue0PQPqE20
xmK1RDls/y5Y2FaF92Ay0VZzXW7+y/M+RRSo+FCFzrIgpJrPprTnlZigrECYauGJ
Qdsv26rQ0flK6tyi6GVuWZIMvpINCt3WwpwQTkAUewz2VewA1tZ1xFe70sP0vF7R
MJNaS6A4uJmvoJJzb8rqdnBGiu76+TxmPaXn0IZKJBECZjBVJyk/duce0jgqbQ7C
PZz5j4C2xrPyu3Y98joj37HPEAHCy0DPRx2Es1mz5cHPzI1TDRClHzPrxyycz9gr
D9WnMiPj9ff9aDaV6XpWKyuHhPxaHpoOD3VGdrhx6bU19Jd3/mLHB3lSt1kJzWdg
QrSAk3KzMWAZigz/+I5xetOpbygKTPLEYgpdmdOSTrtACcm1wjnhIougu0FUIWXK
arPNFOIV9liN0qCQyDOcLx4UEcxXrb2W0AYeHHJDBFxJ7sT2WWUCjPZFW5bh3G+Y
goKv/XJRVWJxFlTXLZLZ5siclzzIlAAmSylh661ji836yRhqTQ3NJTB8QfnrGGsZ
QlD1hjpyqC8uwIGUvoh56KdLRTxj9Gj70gpVe/Lk3Z16mivqDUE=
=fSr0
-----END PGP SIGNATURE-----
Merge tag 'afs-netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS updates from David Howells:
"Use the new netfs lib.
Begin the process of overhauling the use of the fscache API by AFS and
the introduction of support for features such as Transparent Huge
Pages (THPs).
- Add some support for THPs, including using core VM helper functions
to find details of pages.
- Use the ITER_XARRAY I/O iterator to mediate access to the pagecache
as this handles THPs and doesn't require allocation of large bvec
arrays.
- Delegate address_space read/pre-write I/O methods for AFS to the
netfs helper library. A method is provided to the library that
allows it to issue a read against the server.
This includes a change in use for PG_fscache (it now indicates a
DIO write in progress from the marked page), so a number of waits
need to be deployed for it.
- Split the core AFS writeback function to make it easier to modify
in future patches to handle writing to the cache. [This might
feasibly make more sense moved out into my fscache-iter branch].
I've tested these with "xfstests -g quick" against an AFS volume
(xfstests needs patching to make it work). With this, AFS without a
cache passes all expected xfstests; with a cache, there's an extra
failure, but that's also there before these patches. Fixing that
probably requires a greater overhaul (as can be found on my
fscache-iter branch, but that's for a later time).
Thanks should go to Marc Dionne and Jeff Altman of AuriStor for
exercising the patches in their test farm also"
Link: https://lore.kernel.org/lkml/3785063.1619482429@warthog.procyon.org.uk/
* tag 'afs-netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Use the netfs_write_begin() helper
afs: Use new netfs lib read helper API
afs: Use the fs operation ops to handle FetchData completion
afs: Prepare for use of THPs
afs: Extract writeback extension into its own function
afs: Wait on PG_fscache before modifying/releasing a page
afs: Use ITER_XARRAY for writing
afs: Set up the iov_iter before calling afs_extract_data()
afs: Log remote unmarshalling errors
afs: Don't truncate iter during data fetch
afs: Move key to afs_read struct
afs: Print the operation debug_id when logging an unexpected data version
afs: Pass page into dirty region helpers to provide THP size
afs: Disable use of the fscache I/O routines
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmCHPZwACgkQ+7dXa6fL
C2uJxw/9FVNssHxtA8iFDvZskE4YHiL6vMgOgKOeVmBfUvxqJcxWQXcF8ycbon5y
jGcDRV1DWTv395ckALHqmD6SlH/5q+OBt4cCOXCebOlzbC63JmjJ6xOjHntZKw3i
9c3GITNca5AsPXHXHGIcoRY4/4FntpLoVpyfYJ4ZZJCY7a7QUbgnEIIy9/Ps8Clw
BahhiKChl2JCgV3KZBk/ypkf0IBduxKgT+IUxA9o7H5UsLzvUgnfd5uMIALLPMI1
NXzUHBJoUtnWcB52nWPufJx9YwkMfSx70mutT0T74CFxbJakwRgAl2tWr5g989qM
/fQrsOhMlU3NaXYaRPelbxkuzvy3hU1xSe3GLiZcxmh4Cb/YAX0TrHRecO62NWff
pu/UWQS8Du5Gy8DrHScuo8baI1KFfyiV2lWQPfBO8kPaEB2ERw+PN6fWSh993Cn9
4UHaR3Oyn4qyVXeirNZg+frado+BEZAbNMZwn0lyi6jnLeyir6qABOdpQk34SB35
D4jfdPOBxeh3OVFkc+EBJ98i3/nal2+yXrNOqkP4OwmF0HqGt0YKKSaLNigXaDdO
3CKmQlBqBZsUdRYHJyJsofrifkKjP78zx2WyUJPms8MGX9z+9kYR3f1erifLesCT
Kb2TrAFx4ZgqS5tFh6UHnX4x0qy2RckgNrKTMpv38K8lNqplvLo=
=tZgy
-----END PGP SIGNATURE-----
Merge tag 'netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull network filesystem helper library updates from David Howells:
"Here's a set of patches for 5.13 to begin the process of overhauling
the local caching API for network filesystems. This set consists of
two parts:
(1) Add a helper library to handle the new VM readahead interface.
This is intended to be used unconditionally by the filesystem
(whether or not caching is enabled) and provides a common
framework for doing caching, transparent huge pages and, in the
future, possibly fscrypt and read bandwidth maximisation. It also
allows the netfs and the cache to align, expand and slice up a
read request from the VM in various ways; the netfs need only
provide a function to read a stretch of data to the pagecache and
the helper takes care of the rest.
(2) Add an alternative fscache/cachfiles I/O API that uses the kiocb
facility to do async DIO to transfer data to/from the netfs's
pages, rather than using readpage with wait queue snooping on one
side and vfs_write() on the other. It also uses less memory, since
it doesn't do buffered I/O on the backing file.
Note that this uses SEEK_HOLE/SEEK_DATA to locate the data
available to be read from the cache. Whilst this is an improvement
from the bmap interface, it still has a problem with regard to a
modern extent-based filesystem inserting or removing bridging
blocks of zeros. Fixing that requires a much greater overhaul.
This is a step towards overhauling the fscache API. The change is
opt-in on the part of the network filesystem. A netfs should not try
to mix the old and the new API because of conflicting ways of handling
pages and the PG_fscache page flag and because it would be mixing DIO
with buffered I/O. Further, the helper library can't be used with the
old API.
This does not change any of the fscache cookie handling APIs or the
way invalidation is done at this time.
In the near term, I intend to deprecate and remove the old I/O API
(fscache_allocate_page{,s}(), fscache_read_or_alloc_page{,s}(),
fscache_write_page() and fscache_uncache_page()) and eventually
replace most of fscache/cachefiles with something simpler and easier
to follow.
This patchset contains the following parts:
- Some helper patches, including provision of an ITER_XARRAY iov
iterator and a function to do readahead expansion.
- Patches to add the netfs helper library.
- A patch to add the fscache/cachefiles kiocb API.
- A pair of patches to fix some review issues in the ITER_XARRAY and
read helpers as spotted by Al and Willy.
Jeff Layton has patches to add support in Ceph for this that he
intends for this merge window. I have a set of patches to support AFS
that I will post a separate pull request for.
With this, AFS without a cache passes all expected xfstests; with a
cache, there's an extra failure, but that's also there before these
patches. Fixing that probably requires a greater overhaul. Ceph also
passes the expected tests.
I also have patches in a separate branch to tidy up the handling of
PG_fscache/PG_private_2 and their contribution to page refcounting in
the core kernel here, but I haven't included them in this set and will
route them separately"
Link: https://lore.kernel.org/lkml/3779937.1619478404@warthog.procyon.org.uk/
* tag 'netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
netfs: Miscellaneous fixes
iov_iter: Four fixes for ITER_XARRAY
fscache, cachefiles: Add alternate API to use kiocb for read/write to cache
netfs: Add a tracepoint to log failures that would be otherwise unseen
netfs: Define an interface to talk to a cache
netfs: Add write_begin helper
netfs: Gather stats
netfs: Add tracepoints
netfs: Provide readahead and readpage netfs helpers
netfs, mm: Add set/end/wait_on_page_fscache() aliases
netfs, mm: Move PG_fscache helper funcs to linux/netfs.h
netfs: Documentation for helper library
netfs: Make a netfs helper module
mm: Implement readahead_control pageset expansion
mm/readahead: Handle ractl nr_pages being modified
fs: Document file_ra_state
mm/filemap: Pass the file_ra_state in the ractl
mm: Add set/end/wait functions for PG_private_2
iov_iter: Add ITER_XARRAY
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmCHGMYACgkQxWXV+ddt
WDsFeA/+MZ+5UiYYucH5RVw/VExOQzSvlRVxxnOeR2s8V/gFj/Ip7d9E9UezA+lX
di2byKXPzfjL9+xoyqfEZLpPgDQrhHTIQCzuNSvzMBykJIL6Sf1OZTtUZWU3HDH7
S51UZtghgTPzeOhxsiBHqSFo9danT0w3KQhliE10Ur855ziKSvL2Tb7dvM6q5TS7
mFTAj/Y2aanDaFKQjjBzzA+GZ0LFIGuErg1PADmF5XjbyY06ho1xqQ1A0t2/XL9x
UpHdRP3E5XRAMl4uyYOUtbvUB1cROzoS6ySHJOJ9Bbz+IC0cLf5xTJkLE25bGkSi
GjFNvnQOha1s8oMIlqkw64hKQqwp+gu2iZ7m1o76Z31k7CpLAC+rg11gbpODRuoh
7B3EzowKyVihMHF8URAdC3A+9gbpPvyuGKDSy07yULh/2vas6dEzR3cPVEU0yXyJ
3DO1ds0lVY3B/T9LKPQ785hQ7VdpgZ8BdIOVRtjgV2QQEa9eFh9VzybQjU8yBXGd
vflBe8kQfASIZ5E0rcUGPUVIJoesM8U1pSlx9jvvTQVkOC/DQjtBx/5ePCL2iVfd
izY8uWlCdguF/P1CYFf1M0auASSzl3bip1NnSMVvZ90dgDEK4XaIyd16kMGDCbU2
UMOePMsLDApWcCVTqM/J+lFLa7rajRccdKby7F/zSpZIRgadPF8=
=J5jh
-----END PGP SIGNATURE-----
Merge tag 'for-5.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"The updates this time are mostly stabilization, preparation and minor
improvements.
User visible improvements:
- readahead for send, improving run time of full send by 10% and for
incremental by 25%
- make reflinks respect O_SYNC, O_DSYNC and S_SYNC flags
- export supported sectorsize values in sysfs (currently only page
size, more once full subpage support lands)
- more graceful errors and warnings on 32bit systems when logical
addresses for metadata reach the limit posed by unsigned long in
page::index
- error: fail mount if there's a metadata block beyond the limit
- error: new metadata block would be at unreachable address
- warn when 5/8th of the limit is reached, for 4K page systems
it's 10T, for 64K page it's 160T
- zoned mode
- relocated zones get reset at the end instead of discard
- automatic background reclaim of zones that have 75%+ of unusable
space, the threshold is tunable in sysfs
Fixes:
- fsync and tree mod log fixes
- fix inefficient preemptive reclaim calculations
- fix exhaustion of the system chunk array due to concurrent
allocations
- fix fallback to no compression when racing with remount
- preemptive fix for dm-crypt on zoned device that does not properly
advertise zoned support
Core changes:
- add inode lock to synchronize mmap and other block updates (eg.
deduplication, fallocate, fsync)
- kmap conversions to new kmap_local API
- subpage support (continued)
- new helpers for page state/extent buffer tracking
- metadata changes now support read and write
- error handling through out relocation call paths
- many other cleanups and code simplifications"
* tag 'for-5.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (112 commits)
btrfs: zoned: automatically reclaim zones
btrfs: rename delete_unused_bgs_mutex to reclaim_bgs_lock
btrfs: zoned: reset zones of relocated block groups
btrfs: more graceful errors/warnings on 32bit systems when reaching limits
btrfs: zoned: fix unpaired block group unfreeze during device replace
btrfs: fix race when picking most recent mod log operation for an old root
btrfs: fix metadata extent leak after failure to create subvolume
btrfs: handle remount to no compress during compression
btrfs: zoned: fail mount if the device does not support zone append
btrfs: fix race between transaction aborts and fsyncs leading to use-after-free
btrfs: introduce submit_eb_subpage() to submit a subpage metadata page
btrfs: make lock_extent_buffer_for_io() to be subpage compatible
btrfs: introduce write_one_subpage_eb() function
btrfs: introduce end_bio_subpage_eb_writepage() function
btrfs: check return value of btrfs_commit_transaction in relocation
btrfs: do proper error handling in merge_reloc_roots
btrfs: handle extent corruption with select_one_root properly
btrfs: cleanup error handling in prepare_to_merge
btrfs: do not panic in __add_reloc_root
btrfs: handle __add_reloc_root failures in btrfs_recover_relocation
...
- Update NFSv2 and NFSv3 XDR encoding functions
- Add batch Receive posting to the server's RPC/RDMA transport (take 2)
- Reduce page allocator traffic in svcrdma
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmBzLCIACgkQM2qzM29m
f5evWA//fE2WlZDoTP8Iq1BGreGrGyzqOIkJakDGoZs4VOaUJN9WxWEcmBHI4t22
yom7aZ7S7VtMF6SoGMnohYoNwkloPJ1kfYBVZuUxDUCIHGVrLaAGZwjtojQftUS0
19ZdiSx8D8xWEqI/cpbHsj+CNCH1F5IDGjJzNhlz+rIFLrRPBMeHwbY8qf/zusm/
uZ4tGJtKWmFXvdT9duGkoNXRd0gBdcfDeFN//JYrLPS9sX4zs4/2KnDh25YJR8Jf
EQryc19l4ztlVT8PXJfI4I/fG0Sfv5AWuYYzaFIncp1PkmiunqGL1yai6eeW4LEN
8v3QEUNy9J7J20FrsL1ge1icbEyObfNFvkgYqNhmGIBdbEDdLWPppL6fQKUYl/zi
HSnAoJsJYyzYKbE7BMLy3wwZ775GsTrkU/7tfiu/M9KgvXJtDy0m7vsxt3/qULXn
Bg4KAKnmYIcuigdG6BATJ23jISRK7cHH/YYlORj3lsB/KQi+c2U/5zQFaROOwbtc
Ny+y92zQZVu9ZMkUfs2+b5qdg8Z/J0p8n9MfS7GcpCIyGZTnvfs8SfM8BXuHH2Yn
BdL3xDoPlbYQnAzp16oVfdoYm8RWzFBv+xHc360ielMOQbP4ntdC0oexIApsXzyz
Yv4OK9itsLagwcuAxPWnfnIgGxj9hPsp/peUsChsglhx+4NSGWA=
=BOO0
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"Highlights:
- Update NFSv2 and NFSv3 XDR encoding functions
- Add batch Receive posting to the server's RPC/RDMA transport (take 2)
- Reduce page allocator traffic in svcrdma"
* tag 'nfsd-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (70 commits)
NFSD: Use DEFINE_SPINLOCK() for spinlock
sunrpc: Remove unused function ip_map_lookup
NFSv4.2: fix copy stateid copying for the async copy
UAPI: nfsfh.h: Replace one-element array with flexible-array member
svcrdma: Clean up dto_q critical section in svc_rdma_recvfrom()
svcrdma: Remove svc_rdma_recv_ctxt::rc_pages and ::rc_arg
svcrdma: Remove sc_read_complete_q
svcrdma: Single-stage RDMA Read
SUNRPC: Move svc_xprt_received() call sites
SUNRPC: Export svc_xprt_received()
svcrdma: Retain the page backing rq_res.head[0].iov_base
svcrdma: Remove unused sc_pages field
svcrdma: Normalize Send page handling
svcrdma: Add a "deferred close" helper
svcrdma: Maintain a Receive water mark
svcrdma: Use svc_rdma_refresh_recvs() in wc_receive
svcrdma: Add a batch Receive posting mechanism
svcrdma: Remove stale comment for svc_rdma_wc_receive()
svcrdma: Provide an explanatory comment in CMA event handler
svcrdma: RPCDBG_FACILITY is no longer used
...
Pull crypto updates from Herbert Xu:
"API:
- crypto_destroy_tfm now ignores errors as well as NULL pointers
Algorithms:
- Add explicit curve IDs in ECDH algorithm names
- Add NIST P384 curve parameters
- Add ECDSA
Drivers:
- Add support for Green Sardine in ccp
- Add ecdh/curve25519 to hisilicon/hpre
- Add support for AM64 in sa2ul"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (184 commits)
fsverity: relax build time dependency on CRYPTO_SHA256
fscrypt: relax Kconfig dependencies for crypto API algorithms
crypto: camellia - drop duplicate "depends on CRYPTO"
crypto: s5p-sss - consistently use local 'dev' variable in probe()
crypto: s5p-sss - remove unneeded local variable initialization
crypto: s5p-sss - simplify getting of_device_id match data
ccp: ccp - add support for Green Sardine
crypto: ccp - Make ccp_dev_suspend and ccp_dev_resume void functions
crypto: octeontx2 - add support for OcteonTX2 98xx CPT block.
crypto: chelsio/chcr - Remove useless MODULE_VERSION
crypto: ux500/cryp - Remove duplicate argument
crypto: chelsio - remove unused function
crypto: sa2ul - Add support for AM64
crypto: sa2ul - Support for per channel coherency
dt-bindings: crypto: ti,sa2ul: Add new compatible for AM64
crypto: hisilicon - enable new error types for QM
crypto: hisilicon - add new error type for SEC
crypto: hisilicon - support new error types for ZIP
crypto: hisilicon - dynamic configuration 'err_info'
crypto: doc - fix kernel-doc notation in chacha.c and af_alg.c
...
Clean up: The last remaining field in struct rpcrdma_frwr has been
removed, so the struct can be eliminated.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up: The handler only recorded a trace event. If indeed no
action is needed by the RPC/RDMA consumer, then the event can be
ignored.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The remote peer's IP address is sufficient, and does not expose
details of the kernel's memory layout.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
I found it confusing that the MR_EVENT class displays the mr.id but
the associated COMPLETION_EVENT class displays a cid (that happens
to contain the mr.id!). To make it a little easier on humans who
have to read and interpret these events, create an MR_COMPLETION
class that displays the mr.id in the same way as the MR_EVENT class.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The Send signaling logic is a little subtle, so add some
observability around it. For every xprtrdma_mr_fastreg event, there
should be an xprtrdma_mr_localinv or xprtrdma_mr_reminv event.
When these tracepoints are enabled, we can see exactly when an MR is
DMA-mapped, registered, invalidated (either locally or remotely) and
then DMA-unmapped.
kworker/u25:2-190 [000] 787.979512: xprtrdma_mr_map: task:351@5 mr.id=4 nents=2 5608@0x8679e0c8f6f56000:0x00000503 (TO_DEVICE)
kworker/u25:2-190 [000] 787.979515: xprtrdma_chunk_read: task:351@5 pos=148 5608@0x8679e0c8f6f56000:0x00000503 (last)
kworker/u25:2-190 [000] 787.979519: xprtrdma_marshal: task:351@5 xid=0x8679e0c8: hdr=52 xdr=148/5608/0 read list/inline
kworker/u25:2-190 [000] 787.979525: xprtrdma_mr_fastreg: task:351@5 mr.id=4 nents=2 5608@0x8679e0c8f6f56000:0x00000503 (TO_DEVICE)
kworker/u25:2-190 [000] 787.979526: xprtrdma_post_send: task:351@5 cq.id=0 cid=73 (2 SGEs)
...
kworker/5:1H-219 [005] 787.980567: xprtrdma_wc_receive: cq.id=1 cid=161 status=SUCCESS (0/0x0) received=164
kworker/5:1H-219 [005] 787.980571: xprtrdma_post_recvs: peer=[192.168.100.55]:20049 r_xprt=0xffff8884974d4000: 0 new recvs, 70 active (rc 0)
kworker/5:1H-219 [005] 787.980573: xprtrdma_reply: task:351@5 xid=0x8679e0c8 credits=64
kworker/5:1H-219 [005] 787.980576: xprtrdma_mr_reminv: task:351@5 mr.id=4 nents=2 5608@0x8679e0c8f6f56000:0x00000503 (TO_DEVICE)
kworker/5:1H-219 [005] 787.980577: xprtrdma_mr_unmap: mr.id=4 nents=2 5608@0x8679e0c8f6f56000:0x00000503 (TO_DEVICE)
Note that I've moved the xprtrdma_post_send tracepoint so that event
always appears after the xprtrdma_mr_fastreg tracepoint. Otherwise
the event log looks counterintuitive (FastReg is always supposed to
happen before Send).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Better not to touch MRs involved in a flush or post error until the
Send and Receive Queues are drained and the transport is fully
quiescent. Simply don't insert such MRs back onto the free list.
They remain on mr_all and will be released when the connection is
torn down.
I had thought that recycling would prevent hardware resources from
being tied up for a long time. However, since v5.7, a transport
disconnect destroys the QP and other hardware-owned resources. The
MRs get cleaned up nicely at that point.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When a file gets deleted on a zoned file system, the space freed is not
returned back into the block group's free space, but is migrated to
zone_unusable.
As this zone_unusable space is behind the current write pointer it is not
possible to use it for new allocations. In the current implementation a
zone is reset once all of the block group's space is accounted as zone
unusable.
This behaviour can lead to premature ENOSPC errors on a busy file system.
Instead of only reclaiming the zone once it is completely unusable,
kick off a reclaim job once the amount of unusable bytes exceeds a user
configurable threshold between 51% and 100%. It can be set per mounted
filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75%
by default.
Similar to reclaiming unused block groups, these dirty block groups are
added to a to_reclaim list and then on a transaction commit, the reclaim
process is triggered but after we deleted unused block groups, which will
free space for the relocation process.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove x86's trace_kvm_age_page() tracepoint. It's mostly redundant with
the common trace_kvm_age_hva() tracepoint, and if there is a need for the
extra details, e.g. gfn, referenced, etc... those details should be added
to the common tracepoint so that all architectures and MMUs benefit from
the info.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210326021957.1424875-19-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move arm64's MMU notifier trace events into common code in preparation
for doing the hva->gfn lookup in common code. The alternative would be
to trace the gfn instead of hva, but that's not obviously better and
could also be done in common code. Tracing the notifiers is also quite
handy for debug regardless of architecture.
Remove a completely redundant tracepoint from PPC e500.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210326021957.1424875-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch added a tracepoint in subflow_check_data_avail() to show the
mapping status.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added a tracepoint in ack_update_msk() to track the
incoming data_ack and window/snd_una updates.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added a tracepoint in the mapping status function
get_mapping_status() to dump every mpext field.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added a tracepoint in the packet scheduler function
mptcp_subflow_get_send().
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This tracepoint can crash when dereferencing snd_task because
when some transports connect, they put a cookie in that field
instead of a pointer to an rpc_task.
BUG: KASAN: use-after-free in trace_event_raw_event_xprt_writelock_event+0x141/0x18e [sunrpc]
Read of size 2 at addr ffff8881a83bd3a0 by task git/331872
CPU: 11 PID: 331872 Comm: git Tainted: G S 5.12.0-rc2-00007-g3ab6e585a7f9 #1453
Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015
Call Trace:
dump_stack+0x9c/0xcf
print_address_description.constprop.0+0x18/0x239
kasan_report+0x174/0x1b0
trace_event_raw_event_xprt_writelock_event+0x141/0x18e [sunrpc]
xprt_prepare_transmit+0x8e/0xc1 [sunrpc]
call_transmit+0x4d/0xc6 [sunrpc]
Fixes: 9ce07ae5eb ("SUNRPC: Replace dprintk() call site in xprt_prepare_transmit")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
A separate tracepoint can be left enabled all the time to capture
rare but important retransmission events. So for example:
kworker/u26:3-568 [009] 156.967933: xprt_retransmit: task:44093@5 xid=0xa25dbc79 nfsv3 WRITE ntrans=2
Or, for example, enable all nfs and nfs4 tracepoints, and set up a
trigger to disable tracing when xprt_retransmit fires to capture
everything that leads up to it.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Sometimes we need to get the corresponding gendisk from request_queue.
It is preferred that block drivers store private data in
gendisk->private_data rather than request_queue->queuedata, e.g. see:
commit c4a59c4e5d ("dm: stop using ->queuedata").
So if only request_queue is given, we need to get its corresponding
gendisk to get the private data stored in that gendisk.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We should be including the completion flags for better introspection on
exactly what completion event was logged.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull RCU changes from Paul E. McKenney:
- Bitmap support for "N" as alias for last bit
- kvfree_rcu updates
- mm_dump_obj() updates. (One of these is to mm, but was suggested by Andrew Morton.)
- RCU callback offloading update
- Polling RCU grace-period interfaces
- Realtime-related RCU updates
- Tasks-RCU updates
- Torture-test updates
- Torture-test scripting updates
- Miscellaneous fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With commit c588072bba ("iommu/vt-d: Convert intel iommu driver to
the iommu ops"), the trace events for dma map/unmap have no users any
more. Cleanup them to make the code neat.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210323010600.678627-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
A number of tracepoint instances have been removed from ext4 by past
patches but the definitions of those tracepoints have not.
All instances of ext4_ext_in_cache and ext4_ext_put_in_cache were
removed by commit 69eb33dc24 ("ext4: remove single extent cache").
ext4_get_reserved_cluster_alloc was removed by commit b6bf9171ef
("ext4: reduce reserved cluster count by number of allocated
clusters"). ext4_find_delalloc_range was removed by commit
7d1b1fbc95 ("ext4: reimplement ext4_find_delay_alloc_range on extent
status tree").
All instances of ext4_direct_IO_enter and ext4_direct_IO_exit were
removed by commit 378f32bab3 ("ext4: introduce direct I/O write
using iomap infrastructure").
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20210216191634.20957-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Remove some dead code that was left over following commit 90ea1c6436
("random: remove the blocking pool").
Cc: linux-crypto@vger.kernel.org
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Prepare svc_xprt_received() to be called from transport code instead
of from generic RPC server code.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The trace event "workqueue_queue_work" references an unsafe string in
dereferencing the name of the workqueue. As the name is allocated, it
could later be freed, and the pointer to that string could stay on the
tracing buffer. If the trace buffer is read after the string is freed, it
will reference an unsafe pointer.
I added a new verifier to make sure that all strings referenced in the
output of the trace buffer is safe to read and this triggered on the
workqueue_queue_work trace event:
workqueue_queue_work: work struct=00000000b2b235c7 function=gc_worker workqueue=(0xffff888100051160:events_power_efficient)[UNSAFE-MEMORY] req_cpu=256 cpu=1
workqueue_queue_work: work struct=00000000c344caec function=flush_to_ldisc workqueue=(0xffff888100054d60:events_unbound)[UNSAFE-MEMORY] req_cpu=256 cpu=4294967295
workqueue_queue_work: work struct=00000000b2b235c7 function=gc_worker workqueue=(0xffff888100051160:events_power_efficient)[UNSAFE-MEMORY] req_cpu=256 cpu=1
workqueue_queue_work: work struct=000000000b238b3f function=vmstat_update workqueue=(0xffff8881000c3760:mm_percpu_wq)[UNSAFE-MEMORY] req_cpu=1 cpu=1
Also, if this event is read via a user space application like perf or
trace-cmd, the name would only be an address and useless information:
workqueue_queue_work: work struct=0xffff953f80b4b918 function=disk_events_workfn workqueue=ffff953f8005d378 req_cpu=8192 cpu=5
Cc: Zqiang <qiang.zhang@windriver.com>
Cc: Tejun Heo <tj@kernel.org>
Fixes: 7bf9c4a88e ("workqueue: tracing the name of the workqueue instead of it's address")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This commit adds a trace event which allows tracing the beginnings of RCU
CPU stall warnings on systems where sysctl_panic_on_rcu_stall is disabled.
The first parameter is the name of RCU flavor like other trace events.
The second parameter indicates whether this is a stall of an expedited
grace period, a self-detected stall of a normal grace period, or a stall
of a normal grace period detected by some CPU other than the one that
is stalled.
RCU CPU stall warnings are often caused by external-to-RCU issues,
for example, in interrupt handling or task scheduling. Therefore,
this event uses TRACE_EVENT, not TRACE_EVENT_RCU, to avoid requiring
those interested in tracing RCU CPU stalls to rebuild their kernels
with CONFIG_RCU_TRACE=y.
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Sangmoon Kim <sangmoon.kim@samsung.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-03-09
The following pull-request contains BPF updates for your *net-next* tree.
We've added 90 non-merge commits during the last 17 day(s) which contain
a total of 114 files changed, 5158 insertions(+), 1288 deletions(-).
The main changes are:
1) Faster bpf_redirect_map(), from Björn.
2) skmsg cleanup, from Cong.
3) Support for floating point types in BTF, from Ilya.
4) Documentation for sys_bpf commands, from Joe.
5) Support for sk_lookup in bpf_prog_test_run, form Lorenz.
6) Enable task local storage for tracing programs, from Song.
7) bpf_for_each_map_elem() helper, from Yonghong.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The XDP_REDIRECT implementations for maps and non-maps are fairly
similar, but obviously need to take different code paths depending on
if the target is using a map or not. Today, the redirect targets for
XDP either uses a map, or is based on ifindex.
Here, the map type and id are added to bpf_redirect_info, instead of
the actual map. Map type, map item/ifindex, and the map_id (if any) is
passed to xdp_do_redirect().
For ifindex-based redirect, used by the bpf_redirect() XDP BFP helper,
a special map type/id are used. Map type of UNSPEC together with map id
equal to INT_MAX has the special meaning of an ifindex based
redirect. Note that valid map ids are 1 inclusive, INT_MAX exclusive
([1,INT_MAX[).
In addition to making the code easier to follow, using explicit type
and id in bpf_redirect_info has a slight positive performance impact
by avoiding a pointer indirection for the map type lookup, and instead
use the cacheline for bpf_redirect_info.
Since the actual map is not passed via bpf_redirect_info anymore, the
map lookup is only done in the BPF helper. This means that the
bpf_clear_redirect_map() function can be removed. The actual map item
is RCU protected.
The bpf_redirect_info flags member is not used by XDP, and not
read/written any more. The map member is only written to when
required/used, and not unconditionally.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210308112907.559576-3-bjorn.topel@gmail.com
To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).
While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.
Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Michael Kelley <mikelley@microsoft.com> # Hyper-v parts
Reviewed-by: Juergen Gross <jgross@suse.com> # Xen and paravirt parts
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20210220231712.2475218-5-namit@vmware.com
Currently, exception event status can be read from wExceptionEventStatus
attribute (sysfs file attributes/exception_event_status under the UFS host
controller device directory). Polling that attribute to track UFS exception
events is impractical, so add a tracepoint to track exception events for
testing and debugging purposes.
Note, by the time the exception event status is read, the exception event
may have cleared, so the value can be zero - see example below.
Note also, only enabled exception events can be reported. A subsequent
patch adds the ability for users to enable selected exception events via
debugfs.
Example with driver instrumented to enable all exception events:
# echo 1 > /sys/kernel/debug/tracing/events/ufs/ufshcd_exception_event/enable
... do some I/O ...
# cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 3/3 #P:5
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
kworker/2:2-173 [002] .... 731.486419: ufshcd_exception_event: 0000:00:12.5: status 0x0
kworker/2:2-173 [002] .... 732.608918: ufshcd_exception_event: 0000:00:12.5: status 0x4
kworker/2:2-173 [002] .... 732.609312: ufshcd_exception_event: 0000:00:12.5: status 0x4
Link: https://lore.kernel.org/r/20210209062437.6954-2-adrian.hunter@intel.com
Reviewed-by: Avri Altman <avri.altman@wdc.com>
Reviewed-by: Bean Huo <beanhuo@micron.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmA6njIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgprolD/9zWti9LsZvA7yE+PhVwrwF3CsNzLfQlClw
99HaA7HxtAc/VLJrnD/SubhCAPdBC5B2xPv6faajdwF2iUR3Rr1Uc93CQ3uP2KKq
kvm6ALTpzPTMI6YSABhY74sg9BkkoDbMo54JQYVQPleiE+5eDLbuFZck6ObfUHyY
a4aaImlndWp/t14GzrClL4hucF+5KJy846P+QCVclkh0yl8xSsqZ5LIFU7tu3iQb
HpZ5HKLT/2ma/EOr3wknnsIe97AUZQU0q5aMparhYlm+qR511eop3QXx850FL/oC
tEGceKLij6qazmkiocKVzML8Fs+Y9/a4vCMjLCScWJmzDlmKdlH2uudeahN6b9Hm
15qRQHOjl1Hc2bdr5ZVn87nq9RWhSm18C+SRMwOKHCOnEhwxqM3RjRfAgj4BJ6QB
PFbFqdY+8Y1YLPFmn9hph72ePaEcN4L2IXW6TI/WX8mot8ODAnkq9Hr38dKwzO+i
0mon6DVyJKKho6XwvVu5IYurkR2beQprjeVUxwZjjT6DxUgsc+J6itK5LDHFSkeZ
qZlXn5Di8MkiXg0DFJYDQiFXnO0Z5GlRWOGPVfBaOr3x+1dqzDdHGw4oz1oGqvnr
GNNYCsYIpDGm7eauX5lqL5MUFpjqRCceXy5JSHPhnWWw617nYkr4H9jdsV9HiTX1
tQFx05QW3w==
=ccMs
-----END PGP SIGNATURE-----
Merge tag 'block-5.12-2021-02-27' of git://git.kernel.dk/linux-block
Pull more block updates from Jens Axboe:
"A few stragglers (and one due to me missing it originally), and fixes
for changes in this merge window mostly. In particular:
- blktrace cleanups (Chaitanya, Greg)
- Kill dead blk_pm_* functions (Bart)
- Fixes for the bio alloc changes (Christoph)
- Fix for the partition changes (Christoph, Ming)
- Fix for turning off iopoll with polled IO inflight (Jeffle)
- nbd disconnect fix (Josef)
- loop fsync error fix (Mauricio)
- kyber update depth fix (Yang)
- max_sectors alignment fix (Mikulas)
- Add bio_max_segs helper (Matthew)"
* tag 'block-5.12-2021-02-27' of git://git.kernel.dk/linux-block: (21 commits)
block: Add bio_max_segs
blktrace: fix documentation for blk_fill_rw()
block: memory allocations in bounce_clone_bio must not fail
block: remove the gfp_mask argument to bounce_clone_bio
block: fix bounce_clone_bio for passthrough bios
block-crypto-fallback: use a bio_set for splitting bios
block: fix logging on capacity change
blk-settings: align max_sectors on "logical_block_size" boundary
block: reopen the device in blkdev_reread_part
block: don't skip empty device in in disk_uevent
blktrace: remove debugfs file dentries from struct blk_trace
nbd: handle device refs for DESTROY_ON_DISCONNECT properly
kyber: introduce kyber_depth_updated()
loop: fix I/O error on fsync() in detached loop devices
block: fix potential IO hang when turning off io_poll
block: get rid of the trace rq insert wrapper
blktrace: fix blk_rq_merge documentation
blktrace: fix blk_rq_issue documentation
blktrace: add blk_fill_rwbs documentation comment
block: remove superfluous param in blk_fill_rwbs()
...
Two fixes:
- Fix an unsafe printf string usage in a kmem trace event
- Fix spelling in output from the latency-collector tool
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYDhhwBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qtclAQCztNe2l9UrmkADtaJNHOcmujRvxi+9
j0ppoSe7uKI4zAD+JVG40IyxvprfxengjmwIdvihL+uuWL50ULmvozRdzQM=
=8nVm
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Two fixes:
- Fix an unsafe printf string usage in a kmem trace event
- Fix spelling in output from the latency-collector tool"
* tag 'trace-v5.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/tools: fix a couple of spelling mistakes
mm, tracing: Fix kmem_cache_free trace event to not print stale pointers
Merge more updates from Andrew Morton:
"118 patches:
- The rest of MM.
Includes kfence - another runtime memory validator. Not as thorough
as KASAN, but it has unmeasurable overhead and is intended to be
usable in production builds.
- Everything else
Subsystems affected by this patch series: alpha, procfs, sysctl,
misc, core-kernel, MAINTAINERS, lib, bitops, checkpatch, init,
coredump, seq_file, gdb, ubsan, initramfs, and mm (thp, cma,
vmstat, memory-hotplug, mlock, rmap, zswap, zsmalloc, cleanups,
kfence, kasan2, and pagemap2)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
MIPS: make userspace mapping young by default
initramfs: panic with memory information
ubsan: remove overflow checks
kgdb: fix to kill breakpoints on initmem after boot
scripts/gdb: fix list_for_each
x86: fix seq_file iteration for pat/memtype.c
seq_file: document how per-entry resources are managed.
fs/coredump: use kmap_local_page()
init/Kconfig: fix a typo in CC_VERSION_TEXT help text
init: clean up early_param_on_off() macro
init/version.c: remove Version_<LINUX_VERSION_CODE> symbol
checkpatch: do not apply "initialise globals to 0" check to BPF progs
checkpatch: don't warn about colon termination in linker scripts
checkpatch: add kmalloc_array_node to unnecessary OOM message check
checkpatch: add warning for avoiding .L prefix symbols in assembly files
checkpatch: improve TYPECAST_INT_CONSTANT test message
checkpatch: prefer ftrace over function entry/exit printks
checkpatch: trivial style fixes
checkpatch: ignore warning designated initializers using NR_CPUS
checkpatch: improve blank line after declaration test
...
Patch series "Add error_report_end tracepoint to KFENCE and KASAN", v3.
This patchset adds a tracepoint, error_repor_end, that is to be used by
KFENCE, KASAN, and potentially other bug detection tools, when they print
an error report. One of the possible use cases is userspace collection of
kernel error reports: interested parties can subscribe to the tracing
event via tracefs, and get notified when an error report occurs.
This patch (of 3):
Introduce error_report_end tracepoint. It can be used in debugging tools
like KASAN, KFENCE, etc. to provide extensions to the error reporting
mechanisms (e.g. allow tests hook into error reporting, ease error report
collection from production kernels). Another benefit would be making use
of ftrace for debugging or benchmarking the tools themselves.
Should we need it, the tracepoint name leaves us with the possibility to
introduce a complementary error_report_start tracepoint in the future.
Link: https://lkml.kernel.org/r/20210121131915.1331302-1-glider@google.com
Link: https://lkml.kernel.org/r/20210121131915.1331302-2-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Suggested-by: Marco Elver <elver@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- New Features:
- Support for eager writes, and the write=eager and write=wait mount options
- Other Bugfixes and Cleanups:
- Fix typos in some comments
- Fix up fall-through warnings for Clang
- Cleanups to the NFS readpage codepath
- Remove FMR support in rpcrdma_convert_iovs()
- Various other cleanups to xprtrdma
- Fix xprtrdma pad optimization for servers that don't support RFC 8797
- Improvements to rpcrdma tracepoints
- Fix up nfs4_bitmask_adjust()
- Optimize sparse writes past the end of files
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmAwOLwACgkQ18tUv7Cl
QOsUfw//W2KoJ+2IQohQNFcoi+bG1OQE7jnqHtQ+tsKfpJKemcDcu8wQEAqrwALg
vXioG1Ye0QU7P5PZtNxCorylqSTVGvJSIOrfa3lTdn/PDbI7NIgN52w56TzzfeXn
pJ4gDwZzPwUFUblF0LBQUIhJv5IQvOXVgUsMqezbIbMXSiuLR/bjnZ96Q/woKpoL
eg2IZ5EO9Jb0QjuQ1e9U303X7c2qOl1jzpxyQLQfD7ONnWBx3HnJk1l+3JJRi8JV
smnae3I0L3nUZ7rBqoqsvK7YUjUchCEBvkmEMsnHT94D5tI9mxxX5OquREee6QHn
NuJRSNbsIiCD3Ne27fkCut78d6SetoMko7jZ97T6smhyijtXJiLG/6dycMPV9rt/
bVIudWMm9/A9AsXyY2YP5LC6Y6W6dhQRXygUjVgEPBl6kVsb2Eca8IA9QZghF9IL
+XSEulASvxo2rWPylJJ+3aLynfqoHrowVN/Tu61svDnJWTcb+FCxQ5zyLox7erEH
mUhraf1D0uoX9odH1069toN6favZFE6SIDvlUk1QTOjr6p3Jxmkuyl6PNs5t66/S
550z5JVb2deIHOPQxOie7xz/Dk6dnRoaFhTNq/Ootkt9GNe0A+NqSUdoRA5XxN5m
wW11ecLSZSehDksuXjyFmkHtkagLreFxLsHbVnaAtwEm7h/thRI=
=Dssn
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS Client Updates from Anna Schumaker:
"New Features:
- Support for eager writes, and the write=eager and write=wait mount
options
- Other Bugfixes and Cleanups:
- Fix typos in some comments
- Fix up fall-through warnings for Clang
- Cleanups to the NFS readpage codepath
- Remove FMR support in rpcrdma_convert_iovs()
- Various other cleanups to xprtrdma
- Fix xprtrdma pad optimization for servers that don't support
RFC 8797
- Improvements to rpcrdma tracepoints
- Fix up nfs4_bitmask_adjust()
- Optimize sparse writes past the end of files"
* tag 'nfs-for-5.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (27 commits)
NFS: Support the '-owrite=' option in /proc/self/mounts and mountinfo
NFS: Set the stable writes flag when initialising the super block
NFS: Add mount options supporting eager writes
NFS: Add support for eager writes
NFS: 'flags' field should be unsigned in struct nfs_server
NFS: Don't set NFS_INO_INVALID_XATTR if there is no xattr cache
NFS: Always clear an invalid mapping when attempting a buffered write
NFS: Optimise sparse writes past the end of file
NFS: Fix documenting comment for nfs_revalidate_file_size()
NFSv4: Fixes for nfs4_bitmask_adjust()
xprtrdma: Clean up rpcrdma_prepare_readch()
rpcrdma: Capture bytes received in Receive completion tracepoints
xprtrdma: Pad optimization, revisited
rpcrdma: Fix comments about reverse-direction operation
xprtrdma: Refactor invocations of offset_in_page()
xprtrdma: Simplify rpcrdma_convert_kvec() and frwr_map()
xprtrdma: Remove FMR support in rpcrdma_convert_iovs()
NFS: Add nfs_pageio_complete_read() and remove nfs_readpage_async()
NFS: Call readpage_async_filler() from nfs_readpage_async()
NFS: Refactor nfs_readpage() and nfs_readpage_async() to use nfs_readdesc
...
The update to kmem_cache_free trace event added printing of the slab name in
the trace event. But it only stores the pointer of the name which will be
printed as a string when the event is read some time in the future. This is
dangerous because the name could be freed in the mean time and when reading
the trace event it would try to dereference the string name by the pointer
to the name that has been freed.
Instead, use the trace event helper macros __string(), __assign_str(), and
__get_str() that are for this very case.
Cc: Jacob Wen <jian.w.wen@oracle.com>
Fixes: 3544de8ee6 ("mm, tracing: record slab name for kmem_cache_free()")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The parameter is redundant in the sense that it can be extracted
from the "struct page" parameter by page_lru() correctly.
Link: https://lore.kernel.org/linux-mm/20201207220949.830352-5-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20210122220600.906146-5-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, a trace record generated by the RCU core is as below.
... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=00000000f3b49a66
It doesn't tell us what the RCU core has freed.
This patch adds the slab name to trace_kmem_cache_free().
The new format is as follows.
... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=0000000037f79c8d name=dentry
... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=00000000f78cb7b5 name=sock_inode_cache
... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=0000000018768985 name=pool_workqueue
... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=000000006a6cb484 name=radix_tree_node
We can use it to understand what the RCU core is going to free. For
example, some users maybe interested in when the RCU core starts
freeing reclaimable slabs like dentry to reduce memory pressure.
Link: https://lkml.kernel.org/r/20201216072804.8838-1-jian.w.wen@oracle.com
Signed-off-by: Jacob Wen <jian.w.wen@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull qorkqueue updates from Tejun Heo:
"Tracepoint and comment updates only"
* 'for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Use %s instead of function name
workqueue: tracing the name of the workqueue instead of it's address
workqueue: fix annotation for WQ_SYSFS
- Update to the way irqs and preemption is tracked via the trace event PC field
- Fix handling of unregistering event failing due to allocate memory.
This is only triggered by failure injection, as it is pretty much guaranteed
to have less than a page allocation succeed.
- Do not show the useless "filter" or "enable" files for the "ftrace" trace
system, as they have no effect on doing anything.
- Add a warning if kprobes are registered more than once.
- Synthetic events now have their fields parsed by semicolons.
Old formats without semicolons will still work, but new features will
require them.
- New option to allow trace events to show %p without hashing in trace file.
The trace file can only be read by root, and reading the raw event buffer
did not have any pointers hashed, so this does not expose anything new.
- New directory in tools called tools/tracing, where a new tool that reads
sequential latency reports from the ftrace latency tracers.
- Other minor fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYDL2wBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qti6AP0RUcSU5U1onx8DcwPQLC5Xr3CPqJkm
RvKeJDdgFP+sVgEAiMTFsy2UMc0gmlHZMFd5nZLSiJCu1I2hHmhS5yKbHgY=
=fD9+
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Update to the way irqs and preemption is tracked via the trace event
PC field
- Fix handling of unregistering event failing due to allocate memory.
This is only triggered by failure injection, as it is pretty much
guaranteed to have less than a page allocation succeed.
- Do not show the useless "filter" or "enable" files for the "ftrace"
trace system, as they have no effect on doing anything.
- Add a warning if kprobes are registered more than once.
- Synthetic events now have their fields parsed by semicolons. Old
formats without semicolons will still work, but new features will
require them.
- New option to allow trace events to show %p without hashing in trace
file. The trace file can only be read by root, and reading the raw
event buffer did not have any pointers hashed, so this does not
expose anything new.
- New directory in tools called tools/tracing, where a new tool that
reads sequential latency reports from the ftrace latency tracers.
- Other minor fixes and cleanups.
* tag 'trace-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits)
kprobes: Fix to delay the kprobes jump optimization
tracing/tools: Add the latency-collector to tools directory
tracing: Make hash-ptr option default
tracing: Add ptr-hash option to show the hashed pointer value
tracing: Update the stage 3 of trace event macro comment
tracing: Show real address for trace event arguments
selftests/ftrace: Add '!event' synthetic event syntax check
selftests/ftrace: Update synthetic event syntax errors
tracing: Add a backward-compatibility check for synthetic event creation
tracing: Update synth command errors
tracing: Rework synthetic event command parsing
tracing/dynevent: Delegate parsing to create function
kprobes: Warn if the kprobe is reregistered
ftrace: Remove unused ftrace_force_update()
tracepoints: Code clean up
tracepoints: Do not punish non static call users
tracepoints: Remove unnecessary "data_args" macro parameter
tracing: Do not create "enable" or "filter" files for ftrace event subsystem
kernel: trace: preemptirq_delay_test: add cpu affinity
tracepoint: Do not fail unregistering a probe due to memory failure
...
Including:
- ARM SMMU and Mediatek updates from Will Deacon:
- Support for MT8192 IOMMU from Mediatek
- Arm v7s io-pgtable extensions for MT8192
- Removal of TLBI_ON_MAP quirk
- New Qualcomm compatible strings
- Allow SVA without hardware broadcast TLB maintenance
on SMMUv3
- Virtualization Host Extension support for SMMUv3 (SVA)
- Allow SMMUv3 PMU (perf) driver to be built
independently from IOMMU
- Some tidy-up in IOVA and core code
- Conversion of the AMD IOMMU code to use the generic
IO-page-table framework
- Intel VT-d updates from Lu Baolu:
- Audit capability consistency among different IOMMUs
- Add SATC reporting structure support
- Add iotlb_sync_map callback support
- SDHI Support for Renesas IOMMU driver
- Misc Cleanups and other small improvments
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmAz2AcACgkQK/BELZcB
GuMflRAAyOfXzFqfmniYKxXmxAhOIlibCoojXzecItifdyaFkvUhxCadRg5u3dLH
IeACDvUiaA5VQnVhjI0a7gCckKdq5YDwwAS+GQICUKNEZtWwrHwm2QTRBmL9MOlA
v+iTrhYCqZWIAPe16BP5L4u6q4JWS2N9oNmDp6ia1VIhPjfsHU+gXpYKSxiLicmV
VECAHJk5/JrwKXBP2nMg3ZqGz9RoJc2CzC4zvKu7ADDB9Zl+pXs74mb4ta81Y3G4
vf07G/ORYJjbMskz5KcmYw2897I9ejMrNaHrYNjlh2IDpGqmoCJ4+8vuVO0zslrm
GzMOHMshaI653BmuRDHyczwNrNxMxSX3NOeR2fp76d3MSouK7RoFMr7ghMAegp1u
qmSrqFbgnOT4cdzN8QpPyU22lmVHtQHm0P4EpZZzC95dtJo1nzt8BFrDjPdJDOyZ
D7oKvZq+OA6MtjCN4gZ2ClxQoiUZ8E/jP1uGIknpzR1oeWnyEFtx8aI9q4yRjNcp
n8UR7wFqbOIV0O/QC7UlEp/xSpG7BDN4BkvIgbH2qe6LmRmejCrnUWv6pkmPc9sI
wFgV/Qnh9oo7yf6zvmCsi1r0kmfGLPRe8GB9+eN3wY3xxzDOLJEJNdVghWaP8Rz7
MBcL/0u8+fMfQlRiOPX64BSIF7fFPY0r0+erINawf1VUbyjUsUE=
=xSX1
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
- ARM SMMU and Mediatek updates from Will Deacon:
- Support for MT8192 IOMMU from Mediatek
- Arm v7s io-pgtable extensions for MT8192
- Removal of TLBI_ON_MAP quirk
- New Qualcomm compatible strings
- Allow SVA without hardware broadcast TLB maintenance on SMMUv3
- Virtualization Host Extension support for SMMUv3 (SVA)
- Allow SMMUv3 PMU perf driver to be built independently from IOMMU
- Some tidy-up in IOVA and core code
- Conversion of the AMD IOMMU code to use the generic IO-page-table
framework
- Intel VT-d updates from Lu Baolu:
- Audit capability consistency among different IOMMUs
- Add SATC reporting structure support
- Add iotlb_sync_map callback support
- SDHI support for Renesas IOMMU driver
- Misc cleanups and other small improvments
* tag 'iommu-updates-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (94 commits)
iommu/amd: Fix performance counter initialization
MAINTAINERS: repair file pattern in MEDIATEK IOMMU DRIVER
iommu/mediatek: Fix error code in probe()
iommu/mediatek: Fix unsigned domid comparison with less than zero
iommu/vt-d: Parse SATC reporting structure
iommu/vt-d: Add new enum value and structure for SATC
iommu/vt-d: Add iotlb_sync_map callback
iommu/vt-d: Move capability check code to cap_audit files
iommu/vt-d: Audit IOMMU Capabilities and add helper functions
iommu/vt-d: Fix 'physical' typos
iommu: Properly pass gfp_t in _iommu_map() to avoid atomic sleeping
iommu/vt-d: Fix compile error [-Werror=implicit-function-declaration]
driver/perf: Remove ARM_SMMU_V3_PMU dependency on ARM_SMMU_V3
MAINTAINERS: Add entry for MediaTek IOMMU
iommu/mediatek: Add mt8192 support
iommu/mediatek: Remove unnecessary check in attach_device
iommu/mediatek: Support master use iova over 32bit
iommu/mediatek: Add iova reserved function
iommu/mediatek: Support for multi domains
iommu/mediatek: Add get_domain_id from dev->dma_range_map
...
This series consists of the usual driver updates (ufs, ibmvfc,
qla2xxx, hisi_sas, pm80xx) plus the removal of the gdth driver (which
is bound to cause conflicts with a trivial change somewhere). The
only big major rework of note is the one from Hannes trying to clean
up our result handling code in the drivers to make it consistent.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYDAdliYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishTblAQCk6wD8
fcb4TItSRp0DpRzs37zhppEbrBgveuAFHhr5swEA0gL2mHcq0vnyNBinCLnERrE7
TPYJqUKJNktnjVG7ZWc=
=wW6p
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This series consists of the usual driver updates (ufs, ibmvfc,
qla2xxx, hisi_sas, pm80xx) plus the removal of the gdth driver (which
is bound to cause conflicts with a trivial change somewhere).
The only big major rework of note is the one from Hannes trying to
clean up our result handling code in the drivers to make it
consistent"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (194 commits)
scsi: MAINTAINERS: Adjust to reflect gdth scsi driver removal
scsi: ufs: Give clk scaling min gear a value
scsi: lpfc: Fix 'physical' typos
scsi: megaraid_mbox: Fix spelling of 'allocated'
scsi: qla2xxx: Simplify the calculation of variables
scsi: message: fusion: Fix 'physical' typos
scsi: target: core: Change ASCQ for residual write
scsi: target: core: Signal WRITE residuals
scsi: target: core: Set residuals for 4Kn devices
scsi: hisi_sas: Add trace FIFO debugfs support
scsi: hisi_sas: Flush workqueue in hisi_sas_v3_remove()
scsi: hisi_sas: Enable debugfs support by default
scsi: hisi_sas: Don't check .nr_hw_queues in hisi_sas_task_prep()
scsi: hisi_sas: Remove deferred probe check in hisi_sas_v2_probe()
scsi: lpfc: Add auto select on IRQ_POLL
scsi: ncr53c8xx: Fix typos
scsi: lpfc: Fix ancient double free
scsi: qla2xxx: Fix some memory corruption
scsi: qla2xxx: Remove redundant NULL check
scsi: megaraid: Fix ifnullfree.cocci warnings
...
The commit f3bdc62fd8 ("blktrace: Provide event for request merging")
added the comment for blk_rq_merge() tracepoint. Remove the duplicate
word from the tracepoint documentation.
Fixes: f3bdc62fd8 ("blktrace: Provide event for request merging")
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The commit 881245dcff ("Add DocBook documentation for the block tracepoints.")
added the comment for blk_rq_issue() tracepoint. Remove the duplicate
word from the tracepoint documentation.
Fixes: 881245dcff ("Add DocBook documentation for the block tracepoints.")
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The last parameter for the function blk_fill_rwbs() was added in
5782138e47 ("tracing/events: convert block trace points to
TRACE_EVENT()") in order to signal read request and use of that parameter
was replaced with using switch case REQ_OP_READ with
1b9a9ab78b ("blktrace: use op accessors"), but the parameter was never
removed.
Remove the unused parameter and adjust the respective call sites.
Fixes: 1b9a9ab78b ("blktrace: use op accessors")
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Documentation updates.
- Miscellaneous fixes.
- kfree_rcu() updates: Addition of mem_dump_obj() to provide allocator return
addresses to more easily locate bugs. This has a couple of RCU-related commits,
but is mostly MM. Was pulled in with akpm's agreement.
- Per-callback-batch tracking of numbers of callbacks,
which enables better debugging information and smarter
reactions to large numbers of callbacks.
- The first round of changes to allow CPUs to be runtime switched from and to
callback-offloaded state.
- CONFIG_PREEMPT_RT-related changes.
- RCU CPU stall warning updates.
- Addition of polling grace-period APIs for SRCU.
- Torture-test and torture-test scripting updates, including a "torture everything"
script that runs rcutorture, locktorture, scftorture, rcuscale, and refscale.
Plus does an allmodconfig build.
- nolibc fixes for the torture tests
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmAs9lgRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1j/axAAsqIvarDD6OLmgcOPCyWSvfG6LsIFgqI9
CY0JdQBtvFBTvE8Q2No5ktbLVmuYsBh0dGeFkv4HQZJyRlr7mjstVMNN4SeBDVIS
/+zZO1wlwzaXfKQopLctTK1O/UzFqIN2sqyzA3nLzGGj8DqgxXJyreJ10feK5XM+
6ttZPd1qm4hqtpA22ZEODbct5OFqZuvnK8VNqBb2YHabA1rasUXbIEJPBpsuv/W2
l9W5AGP4erdOFm3nHJxiCpvLJtgHy4njvw0HJp5f99Abj6OVeAzw5kFjvRB3n1Qd
ayKyTw8T/1mfmkjvYkGsMAqhEmqwXcryFX0dR/14/XPdXyjPhZlbkz+MfRKrn4NT
LBJPX+MlX9lVFWBNR9HMe2o/083+gorlwZt9wtyt0OBBGGgudYo4uKNdbyy6tB3Y
Gb98P2vtVSO24EsQce6M+ppHN4TgVBd6id82MQxNuFw+PQJdBiCY0JJfNQApbAry
cIKOchSSR2SkJHlAevNVaKAeiTnkAXd1jDBKtCnvCqOUyvtnhE3rQCqwS5xT2Cno
oQydpudwBKT7uO/GUyS0ESErjHuy9zhExNSYD0ydxlBCrGbzrrgPg57ntXHA1die
mtFyvc2tfT/AshWRNYiuCG+eaUG3qK7n7jN7Vc6/K5DR4GMb5tOhL9wPx2ljCRGu
Z8WDg0pJGz4=
=31yj
-----END PGP SIGNATURE-----
Merge tag 'core-rcu-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
"These are the latest RCU updates for v5.12:
- Documentation updates.
- Miscellaneous fixes.
- kfree_rcu() updates: Addition of mem_dump_obj() to provide
allocator return addresses to more easily locate bugs. This has a
couple of RCU-related commits, but is mostly MM. Was pulled in with
akpm's agreement.
- Per-callback-batch tracking of numbers of callbacks, which enables
better debugging information and smarter reactions to large numbers
of callbacks.
- The first round of changes to allow CPUs to be runtime switched
from and to callback-offloaded state.
- CONFIG_PREEMPT_RT-related changes.
- RCU CPU stall warning updates.
- Addition of polling grace-period APIs for SRCU.
- Torture-test and torture-test scripting updates, including a
"torture everything" script that runs rcutorture, locktorture,
scftorture, rcuscale, and refscale. Plus does an allmodconfig
build.
- nolibc fixes for the torture tests"
* tag 'core-rcu-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (130 commits)
percpu_ref: Dump mem_dump_obj() info upon reference-count underflow
rcu: Make call_rcu() print mem_dump_obj() info for double-freed callback
mm: Make mem_obj_dump() vmalloc() dumps include start and length
mm: Make mem_dump_obj() handle vmalloc() memory
mm: Make mem_dump_obj() handle NULL and zero-sized pointers
mm: Add mem_dump_obj() to print source of memory block
tools/rcutorture: Fix position of -lgcc in mkinitrd.sh
tools/nolibc: Fix position of -lgcc in the documented example
tools/nolibc: Emit detailed error for missing alternate syscall number definitions
tools/nolibc: Remove incorrect definitions of __ARCH_WANT_*
tools/nolibc: Get timeval, timespec and timezone from linux/time.h
tools/nolibc: Implement poll() based on ppoll()
tools/nolibc: Implement fork() based on clone()
tools/nolibc: Make getpgrp() fall back to getpgid(0)
tools/nolibc: Make dup2() rely on dup3() when available
tools/nolibc: Add the definition for dup()
rcutorture: Add rcutree.use_softirq=0 to RUDE01 and TASKS01
torture: Maintain torture-specific set of CPUs-online books
torture: Clean up after torture-test CPU hotplugging
rcutorture: Make object_debug also double call_rcu() heap object
...
- Update NFSv2 and NFSv3 XDR decoding functions
- Further improve support for re-exporting NFS mounts
- Convert NFSD stats to per-CPU counters
- Add batch Receive posting to the server's RPC/RDMA transport
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmAYVsAACgkQM2qzM29m
f5f1Lg/+IBC7Bhnnc8jNr4nv4IntCwwKdx2VzSzQszbN/kkhLZK89u36nZyqp0RB
Vg3olyS5DseEisMMx0rI0KkHBz7pz+kXVdOGvve8fHBZvewnJ/FpxNZPChG4aMDc
mfjHLvDHO0/GoUqSftrBrjSEJ2jHoNdDcmvzgdAlugTuLOjGX3HhmKa3ZYVTNgFn
kDmFMaEHjS3pb3LqNDHNIYYpNnvtIukxHUh9weDvr+AH8Rmt/WVfjDc26xBS0FQu
jDJUk9AP06VYgZx0dLKp4In8GJYwz9DNjNrWm91+RyJml9AWrFswdBHHcfi0W/Yy
GipkBZGYE6ZblyMlITZCB4etyHQsq7qLuqicTlcXjL/Fdkd7xlT8DwFlZ8LjpyCU
LeHTI2cGzRSJ/JjL2hvhPvT3gR5hln/qk17jSP7V4S6psZAqAEvw/Xa/+MDJhB/b
vnzltFPvEgZc59Q/SJLbaWZLHy1q0enbrOBLMZDmUlk911/tgAuflHJM60N8o732
vkfy05pvZlrV0cFY546pQd7zTKZcAOYPVHHoP25wPa2ibKBu6eQ6kZEi5zu+tVK3
CkvqIhePFspBMQ6GOPKixTiFV4KFoO1HBtk+JEeMkiHXHk1xATCWbg1m7wkaagsq
NNS/qFkLRnftGYpFViBaxTFBGxiBOSbsTIS/zfj5L7JOpW4FRD4=
=02xw
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
- Update NFSv2 and NFSv3 XDR decoding functions
- Further improve support for re-exporting NFS mounts
- Convert NFSD stats to per-CPU counters
- Add batch Receive posting to the server's RPC/RDMA transport
* tag 'nfsd-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (65 commits)
nfsd: skip some unnecessary stats in the v4 case
nfs: use change attribute for NFS re-exports
NFSv4_2: SSC helper should use its own config.
nfsd: cstate->session->se_client -> cstate->clp
nfsd: simplify nfsd4_check_open_reclaim
nfsd: remove unused set_client argument
nfsd: find_cpntf_state cleanup
nfsd: refactor set_client
nfsd: rename lookup_clientid->set_client
nfsd: simplify nfsd_renew
nfsd: simplify process_lock
nfsd4: simplify process_lookup1
SUNRPC: Correct a comment
svcrdma: DMA-sync the receive buffer in svc_rdma_recvfrom()
svcrdma: Reduce Receive doorbell rate
svcrdma: Deprecate stat variables that are no longer used
svcrdma: Restore read and write stats
svcrdma: Convert rdma_stat_sq_starve to a per-CPU counter
svcrdma: Convert rdma_stat_recv to a per-CPU counter
svcrdma: Refactor svc_rdma_init() and svc_rdma_clean_up()
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmAqyGEACgkQxWXV+ddt
WDuU6BAAhfI5BndMm6a1LooMsBHTR7Mh/aFXZEKX7vCDRnrkr+WiihDFhXu4tH3y
arRsdwMnJCnta2/JMI5xCZZRg9Bsb/Sa0qWoR9sDBVoGRMnE1DS5YHQyv0bfJYk0
qYOW/jorBV1n/hL19+WbDFajwajP86uGtlDKV7cJ/C3lIogQma7zQ7ygwxbDcZqm
ZQVHg7ooM4P1t7EV0eDlatxn0Sm8KFkxXD7dbu37qDLWr3Aw8N4IwT7I9h4b+/tg
hL4dqMPxX6AyRiI0VBsqKnmcRWtT9cN7yw0+J+/JK5KuaFFx3qyZZ+EQu1jAGZDt
2m432YKya8LQfyBuSe8uoCIcczhGoD0EPIhspecDMfWTvxdo+AeTJZzZzj3u1y+v
3pih+gBN1sa8vRVSX08mIBF/k0pPfxRu7gIjvl4wl18bm3Khq5VJ93ImP7DNroNg
bKiUG35K+kvXGBNaLY71zZfO6aLMddK73aDudSbYOS8XcbKhor1G8j5o5/EkcVQA
wio4Gw5BmfVeRuXOl2h1aEXThk+469s0DR7MiMiAA6917cUjQiFUgFOaogR0XY3S
8ffX+S50AFW834J0eIGHPLmzi70WwSSXCS2q+zl87PPRK5+jCp9ZzWGi9MGG1qdh
fp7XVMkzHVSKGK5GXB+ICUfzkShxfTCh+EbxcXIulONxsEdADsc=
=0O6r
-----END PGP SIGNATURE-----
Merge tag 'for-5.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This brings updates of space handling, performance improvements or bug
fixes. The subpage block size and zoned mode features have reached
state where they're usable but with limitations.
Performance or related:
- do not block on deleted block group mutex in the cleaner, avoids
some long stalls
- improved flushing: make it work better with ticket space
reservations and avoid excessive transaction commits in some
scenarios, slightly improves throughput for random write load
- preemptive background flushing: separate the logic from ticket
reservations, improve the accounting and decisions when to flush in
low space conditions
- less lock contention related to running delayed refs, let just one
thread do the flushing when there are many inside transaction
commit
- dbench workload improvements: avoid unnecessary work when logging
inodes, fewer fallbacks to transaction commit and thus less waiting
for it (+7% throughput, -20% latency)
Core:
- subpage block size
- currently read-only support
- refactor and generalize code where sectorsize is assumed to be
page size, add the subpage handling everywhere
- the read-write support is on the way, page sizes are still
limited to 4K or 64K
- zoned mode, first working version but with limitations
- SMR/ZBC/ZNS friendly allocation mode, utilizing the "no fixed
location for structures" and chunked allocation
- superblock as the only fixed data structure needs special
handling, uses 2 consecutive zones as a ring buffer
- tree-log support with a dedicated block group to avoid unordered
writes
- emulated zones on non-zoned devices
- not yet working
- all non-single block group profiles, requires more zone write
pointer synchronization between the multiple block groups
- fitrim due to dependency on space cache, can be implemented
Fixes:
- ref-verify: proper tree owner and node level tracking
- fix pinned byte accounting, causing some early ENOSPC now more
likely due to other changes in delayed refs
Other:
- error handling fixes and improvements
- more error injection points
- more function documentation
- more and updated tracepoints
- subset of W=1 checked by default
- update comments to allow more automatic kdoc parameter checks"
* tag 'for-5.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (144 commits)
btrfs: zoned: enable to mount ZONED incompat flag
btrfs: zoned: deal with holes writing out tree-log pages
btrfs: zoned: reorder log node allocation on zoned filesystem
btrfs: zoned: serialize log transaction on zoned filesystems
btrfs: zoned: extend zoned allocator to use dedicated tree-log block group
btrfs: split alloc_log_tree()
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
btrfs: zoned: enable relocation on a zoned filesystem
btrfs: zoned: support dev-replace in zoned filesystems
btrfs: zoned: implement copying for zoned device-replace
btrfs: zoned: implement cloning for zoned device-replace
btrfs: zoned: mark block groups to copy for device-replace
btrfs: zoned: do not use async metadata checksum on zoned filesystems
btrfs: zoned: wait for existing extents before truncating
btrfs: zoned: serialize metadata IO
btrfs: zoned: introduce dedicated data write path for zoned filesystems
btrfs: zoned: enable zone append writing for direct IO
btrfs: zoned: use ZONE_APPEND write for zoned mode
btrfs: save irq flags when looking up an ordered extent
btrfs: zoned: cache if block group is on a sequential zone
...
Daniel Borkmann says:
====================
pull-request: bpf-next 2021-02-16
The following pull-request contains BPF updates for your *net-next* tree.
There's a small merge conflict between 7eeba1706e ("tcp: Add receive timestamp
support for receive zerocopy.") from net-next tree and 9cacf81f81 ("bpf: Remove
extra lock_sock for TCP_ZEROCOPY_RECEIVE") from bpf-next tree. Resolve as follows:
[...]
lock_sock(sk);
err = tcp_zerocopy_receive(sk, &zc, &tss);
err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname,
&zc, &len, err);
release_sock(sk);
[...]
We've added 116 non-merge commits during the last 27 day(s) which contain
a total of 156 files changed, 5662 insertions(+), 1489 deletions(-).
The main changes are:
1) Adds support of pointers to types with known size among global function
args to overcome the limit on max # of allowed args, from Dmitrii Banshchikov.
2) Add bpf_iter for task_vma which can be used to generate information similar
to /proc/pid/maps, from Song Liu.
3) Enable bpf_{g,s}etsockopt() from all sock_addr related program hooks. Allow
rewriting bind user ports from BPF side below the ip_unprivileged_port_start
range, both from Stanislav Fomichev.
4) Prevent recursion on fentry/fexit & sleepable programs and allow map-in-map
as well as per-cpu maps for the latter, from Alexei Starovoitov.
5) Add selftest script to run BPF CI locally. Also enable BPF ringbuffer
for sleepable programs, both from KP Singh.
6) Extend verifier to enable variable offset read/write access to the BPF
program stack, from Andrei Matei.
7) Improve tc & XDP MTU handling and add a new bpf_check_mtu() helper to
query device MTU from programs, from Jesper Dangaard Brouer.
8) Allow bpf_get_socket_cookie() helper also be called from [sleepable] BPF
tracing programs, from Florent Revest.
9) Extend x86 JIT to pad JMPs with NOPs for helping image to converge when
otherwise too many passes are required, from Gary Lin.
10) Verifier fixes on atomics with BPF_FETCH as well as function-by-function
verification both related to zero-extension handling, from Ilya Leoshkevich.
11) Better kernel build integration of resolve_btfids tool, from Jiri Olsa.
12) Batch of AF_XDP selftest cleanups and small performance improvement
for libbpf's xsk map redirect for newer kernels, from Björn Töpel.
13) Follow-up BPF doc and verifier improvements around atomics with
BPF_FETCH, from Brendan Jackman.
14) Permit zero-sized data sections e.g. if ELF .rodata section contains
read-only data from local variables, from Yonghong Song.
15) veth driver skb bulk-allocation for ndo_xdp_xmit, from Lorenzo Bianconi.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull RCU updates from Paul E. McKenney:
- Documentation updates.
- Miscellaneous fixes.
- kfree_rcu() updates: Addition of mem_dump_obj() to provide allocator return
addresses to more easily locate bugs. This has a couple of RCU-related commits,
but is mostly MM. Was pulled in with akpm's agreement.
- Per-callback-batch tracking of numbers of callbacks,
which enables better debugging information and smarter
reactions to large numbers of callbacks.
- The first round of changes to allow CPUs to be runtime switched from and to
callback-offloaded state.
- CONFIG_PREEMPT_RT-related changes.
- RCU CPU stall warning updates.
- Addition of polling grace-period APIs for SRCU.
- Torture-test and torture-test scripting updates, including a "torture everything"
script that runs rcutorture, locktorture, scftorture, rcuscale, and refscale.
Plus does an allmodconfig build.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Update the comment of the 3rd stage of trace event macro
expansion code. Now there are 2 macros makes different
trace_raw_output_<call>() functions.
Link: https://lkml.kernel.org/r/160277371605.29307.8586817119278606720.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
To help debugging kernel, show real address for trace event arguments
in tracefs/trace{,pipe} instead of hashed pointer value.
Since ftrace human-readable format uses vsprintf(), all %p are
translated to hash values instead of pointer address.
However, when debugging the kernel, raw address value gives a
hint when comparing with the memory mapping in the kernel.
(Those are sometimes used with crash log, which is not hashed too)
So converting %p with %px when calling trace_seq_printf().
Moreover, this is not improving the security because the tracefs
can be used only by root user and the raw address values are readable
from tracefs/percpu/cpu*/trace_pipe_raw file.
Link: https://lkml.kernel.org/r/160277370703.29307.5134475491761971203.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Often when I'm debugging ENOSPC related issues I have to resort to
printing the entire ENOSPC state with trace_printk() in different spots.
This gets pretty annoying, so add a trace state that does this for us.
Then add a trace point at the end of preemptive flushing so you can see
the state of the space_info when we decide to exit preemptive flushing.
This helped me figure out we weren't kicking in the preemptive flushing
soon enough.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we have normal ticketed flushing and preemptive flushing, adjust
the tracepoint so that we know the source of the flushing action to make
it easier to debug problems.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Solely for preemptive flushing, we want to be able to force the
transaction commit without any of the ambiguity of
may_commit_transaction(). This is because may_commit_transaction()
checks tickets and such, and in preemptive flushing we already know
it'll be helpful, so use this to keep the code nice and clean and
straightforward.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
While debugging a ENOSPC related performance problem I needed to see the
time difference between start and end of a reserve ticket, so add a
trace point to report when we handle a reserve ticket.
I opted to spit out start_ns itself without calculating the difference
because there could be a gap between enabling the tracepoint and setting
start_ns. Doing it this way allows us to filter on 0 start_ns so we
don't get bogus entries, and we can easily calculate the time difference
with bpftrace or something else.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a long existing bug in the last parameter of
btrfs_add_ordered_extent(), in commit 771ed689d2 ("Btrfs: Optimize
compressed writeback and reads") back to 2008.
In that ancient commit btrfs_add_ordered_extent() expects the @type
parameter to be one of the following:
- BTRFS_ORDERED_REGULAR
- BTRFS_ORDERED_NOCOW
- BTRFS_ORDERED_PREALLOC
- BTRFS_ORDERED_COMPRESSED
But we pass 0 in cow_file_range(), which means BTRFS_ORDERED_IO_DONE.
Ironically extra check in __btrfs_add_ordered_extent() won't set the bit
if we see (type == IO_DONE || type == IO_COMPLETE), and avoid any
obvious bug.
But this still leads to regular COW ordered extent having no bit to
indicate its type in various trace events, rendering REGULAR bit
useless.
[FIX]
Change the following aspects to avoid such problem:
- Reorder btrfs_ordered_extent::flags
Now the type bits go first (REGULAR/NOCOW/PREALLCO/COMPRESSED), then
DIRECT bit, finally extra status bits like IO_DONE/COMPLETE/IOERR.
- Add extra ASSERT() for btrfs_add_ordered_extent_*()
- Remove @type parameter for btrfs_add_ordered_extent_compress()
As the only valid @type here is BTRFS_ORDERED_COMPRESSED.
- Remove the unnecessary special check for IO_DONE/COMPLETE in
__btrfs_add_ordered_extent()
This is just to make the code work, with extra ASSERT(), there are
limited values can be passed in.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make it easier to spot messages of an unusual size.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>