Block groups documentation has three, first-level section headings.
These headings' text become toctree entries and the first one "Layout"
becomes docs title in the output, which isn't conveying the docs
contents.
Add explicit title heading and demote the rest.
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20250620105643.25141-6-bagasdotme@gmail.com
Last three sections of atomic block writes documentation are adorned as
first-level title headings, which erroneously increase toctree entries
in overview.rst. Demote them.
Fixes: 0bf1f51e34 ("ext4: Add atomic block write documentation")
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20250620105643.25141-5-bagasdotme@gmail.com
Reduce toctree depth from 6 to 2 to only show individual docs titles
on top-level toctree (index.rst) and to not spoil the entire hierarchy.
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20250620105643.25141-4-bagasdotme@gmail.com
ext4 docs are organized in three master docs (overview.rst, globals.rst,
and dynamic.rst), in which these include other docs via include::
directive. These docs sturcture is better served by toctrees instead.
Convert the master docs to use toctrees.
Fixes: 0bf1f51e34 ("ext4: Add atomic block write documentation")
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20250620105643.25141-3-bagasdotme@gmail.com
The HTML output seems to be correct, but when reading the raw rst file
it's annoying.
So use "|" for table the border.
Signed-off-by: Richard Weinberger <richard@nod.at>
Acked-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20250628083205.1066472-1-richard@nod.at
Several flags are updated and checked only under namespace_sem; we are
already making use of that when we are checking them without mount_lock,
but we have to hold mount_lock for all updates, which makes things
clumsier than they have to be.
Take MNT_SHARED, MNT_UNBINDABLE, MNT_MARKED and MNT_UMOUNT_CANDIDATE
into a separate field (->mnt_t_flags), renaming them to T_SHARED,
etc. to avoid confusion. All accesses must be under namespace_sem.
That changes locking requirements for mnt_change_propagation() and
set_mnt_shared() - only namespace_sem is needed now. The same goes
for SET_MNT_MARKED et.al.
There might be more flags moved from ->mnt_flags to that field;
this is just the initial set.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The variant currently in the tree has problems; trying to prove
correctness has caught at least one class of bugs (reparenting
that ends up moving the visible location of reparented mount, due
to not excluding some of the counterparts on propagation that
should've been included).
I tried to prove that it's the only bug there; I'm still not sure
whether it is. If anyone can reconstruct and write down an analysis
of the mainline implementation, I'll gladly review it; as it is,
I ended up doing a different implementation. Candidate collection
phase is similar, but trimming the set down until it satisfies the
constraints turned out pretty different.
I hoped to do transformation as a massage series, but that turns out
to be too convoluted. So it's a single patch replacing propagate_umount()
and friends in one go, with notes and analysis in D/f/propagate_umount.txt
(in addition to inline comments).
As far I can tell, it is provably correct and provably linear by the number
of mounts we need to look at in order to decide what should be unmounted.
It even builds and seems to survive testing...
Another nice thing that fell out of that is that ->mnt_umounting is no longer
needed.
Compared to the first version:
* explicit MNT_UMOUNT_CANDIDATE flag for is_candidate()
* trim_ancestors() only clears that flag, leaving the suckers on list
* trim_one() and handle_locked() take the stuff with flag cleared off
the list. That allows to iterate with list_for_each_entry_safe() when calling
trim_one() - it removes at most one element from the list now.
* no globals - I didn't bother with any kind of context, not worth it.
* Notes updated accordingly; I have not touch the terms yet.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaFx0bQAKCRBZ7Krx/gZQ
63yTAQC4NS7qopT8BQGn3aM+t8YjYo36BTeSRcSy4hVEAFrEJAD/WyW5Dcy1lWZR
S8g8rqRimsCepwxqTinYJlS7H8S56ws=
=CmGc
-----END PGP SIGNATURE-----
Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull mount fixes from Al Viro:
"Several mount-related fixes"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
userns and mnt_idmap leak in open_tree_attr(2)
attach_recursive_mnt(): do not lock the covering tree when sliding something under it
replace collect_mounts()/drop_collected_mounts() with a safer variant
brd hasn't supported DAX for a long time but dax.rst
still suggests it as an example of how to write a DAX
supporting block driver.
Remove the reference, confuse less people.
Fixes: 7a862fbbde ("brd: remove dax support")
Signed-off-by: Daniel Palmer <daniel.palmer@sony.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20250610-fixdasrstbrd20250610-v1-1-4abe3b7f381a@sony.com
collect_mounts() has several problems - one can't iterate over the results
directly, so it has to be done with callback passed to iterate_mounts();
it has an oopsable race with d_invalidate(); it creates temporary clones
of mounts invisibly for sync umount (IOW, you can have non-lazy umount
succeed leaving filesystem not mounted anywhere and yet still busy).
A saner approach is to give caller an array of struct path that would pin
every mount in a subtree, without cloning any mounts.
* collect_mounts()/drop_collected_mounts()/iterate_mounts() is gone
* collect_paths(where, preallocated, size) gives either ERR_PTR(-E...) or
a pointer to array of struct path, one for each chunk of tree visible under
'where' (i.e. the first element is a copy of where, followed by (mount,root)
for everything mounted under it - the same set collect_mounts() would give).
Unlike collect_mounts(), the mounts are *not* cloned - we just get pinning
references to the roots of subtrees in the caller's namespace.
Array is terminated by {NULL, NULL} struct path. If it fits into
preallocated array (on-stack, normally), that's where it goes; otherwise
it's allocated by kmalloc_array(). Passing 0 as size means that 'preallocated'
is ignored (and expected to be NULL).
* drop_collected_paths(paths, preallocated) is given the array returned
by an earlier call of collect_paths() and the preallocated array passed to that
call. All mount/dentry references are dropped and array is kfree'd if it's not
equal to 'preallocated'.
* instead of iterate_mounts(), users should just iterate over array
of struct path - nothing exotic is needed for that. Existing users (all in
audit_tree.c) are converted.
[folded a fix for braino reported by Venkat Rao Bagalkote <venkat88@linux.ibm.com>]
Fixes: 80b5dce8c5 ("vfs: Add a function to lazily unmount all mounts from any dentry")
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
VFS has switched to i_rwsem for ten years now (9902af79c01a: parallel
lookups actual switch to rwsem), but the VFS documentation and comments
still has references to i_mutex.
Signed-off-by: Junxuan Liao <ljx@cs.wisc.edu>
Link: https://lore.kernel.org/72223729-5471-474a-af3c-f366691fba82@cs.wisc.edu
Signed-off-by: Christian Brauner <brauner@kernel.org>
Long before introduction of lore.kernel.org, people would link
to LKML threads on third-party archives (here spinics.net), which
in some cases can be unreliable (as these were outside of
kernel.org control). Replace links to them with lore counterparts
(if any).
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20250611065254.36608-2-bagasdotme@gmail.com
This patch fixes two minor typos in Documentation/filesystems/f2fs.rst:
- "ramdom" → "random"
- "reenable" → "re-enable"
The changes improve spelling and consistency in the documentation.
These issues were identified using the 'codespell' tool with the
following command:
$ find Documentation/ -path Documentation/translations -prune -o \
-name '*.rst' -print | xargs codespell
Signed-off-by: Yuanye Ma <yuanye.ma20@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20250618225546.104949-1-yuanye.ma20@gmail.com
Convert the last user (d_alloc_pseudo()) and be done with that.
Any out-of-tree filesystem using it should switch to d_splice_alias_ops()
or, better yet, check whether it really needs to have ->d_op vary among
its dentries.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This paragraph was relevant for an earlier version of the code which
passed the qstr as a struct instead of a point. The version that landed
passed the pointer in all cases so this para is now pointless.
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://lore.kernel.org/20250608230952.20539-3-neil@brown.name
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
... to be used instead of manually assigning to ->s_d_op.
All in-tree filesystem converted (and field itself is renamed,
so any out-of-tree ones in need of conversion will be caught
by compiler).
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmhEsJkACgkQiiy9cAdy
T1Gjlgv+P1I4uD8iDbkJvyrh7mVcnIDnjttbgN1pTZZnpcX7JIyMPR9YPUn+QTKf
3GStGgLHqKtnXYYCdrkpPipsH/fjt7kjVLUC/YMhUGEr5dSFKzygOVaRTo4LDgNO
sSlOxTPyTaPgxdvIXriU5CeJGtm95GqowLDG29We9ubX1P5GLjIjGlMbAisAk4/Q
ek478D0qBSO5u6FkMYUFuSbjGwOYWAlabMxeOVHpjgEH/36tVtDKOc/bORnVZtHB
6bRjhymn9CZQOE2P+gDSqF6pZwHWhU7uYJYW+kbDbmDIgWDk9xl5/kbbqZ6RR3t5
joPWzZkO1I58avmJ31aauhvRpYIOXfOhx2dHB6VM/7bSHTvE6CW60RIlpXo1xbO6
d30pTy7ICqP/RWNZlt5ZgoPTDB1hPmVDdMXoELp7uZzChhrAz4qc/2BSD0Y5fCjE
e/U0bPfsyjscB7NtpcrQYVMbOsrweQ4qKUgolEk1klLolsKxDGE176wbPmLNAzo4
C9SzyxU3
=RvlW
-----END PGP SIGNATURE-----
Merge tag '6.16-rc-part2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull more smb client updates from Steve French:
- multichannel/reconnect fixes
- move smbdirect (smb over RDMA) defines to fs/smb/common so they will
be able to be used in the future more broadly, and a documentation
update explaining setting up smbdirect mounts
- update email address for Paulo
* tag '6.16-rc-part2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal version number
MAINTAINERS, mailmap: Update Paulo Alcantara's email address
cifs: add documentation for smbdirect setup
cifs: do not disable interface polling on failure
cifs: serialize other channels when query server interfaces is pending
cifs: deal with the channel loading lag while picking channels
smb: client: make use of common smbdirect_socket_parameters
smb: smbdirect: introduce smbdirect_socket_parameters
smb: client: make use of common smbdirect_socket
smb: smbdirect: add smbdirect_socket.h
smb: client: make use of common smbdirect.h
smb: smbdirect: add smbdirect.h with public structures
smb: client: make use of common smbdirect_pdu.h
smb: smbdirect: add smbdirect_pdu.h with protocol definitions
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCaEKzdwAKCRDh3BK/laaZ
PLpuAQCK2B/LsbyLslWVN6lWbQNwiPHF7l49+GjS2BaWVDxnTwEAwdpaktgg7tRI
wsMp9CEc0lbp8lMDjHDOEqhc/Qvejg4=
=Y7HA
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-v2-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
Pull overlayfs update from Miklos Szeredi:
- Fix a regression in getting the path of an open file (e.g. in
/proc/PID/maps) for a nested overlayfs setup (André Almeida)
- Support data-only layers and verity in a user namespace (unprivileged
composefs use case)
- Fix a gcc warning (Kees)
- Cleanups
* tag 'ovl-update-v2-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
ovl: Annotate struct ovl_entry with __counted_by()
ovl: Replace offsetof() with struct_size() in ovl_stack_free()
ovl: Replace offsetof() with struct_size() in ovl_cache_entry_new()
ovl: Check for NULL d_inode() in ovl_dentry_upper()
ovl: Use str_on_off() helper in ovl_show_options()
ovl: don't require "metacopy=on" for "verity"
ovl: relax redirect/metacopy requirements for lower -> data redirect
ovl: make redirect/metacopy rejection consistent
ovl: Fix nested backing file paths
Document steps to use SMB over RDMA using the linux SMB client and
KSMBD server
Signed-off-by: Meetakshi Setiya <msetiya@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCaD2Y1wAKCRDh3BK/laaZ
PFSHAP4q1+mOlQfZJPH/PFDwa+F0QW/uc3szXatS0888nxui/gEAsIeyyJlf+Mr8
/1JPXxCqcapRFw9xsS0zioiK54Elfww=
=2KxA
-----END PGP SIGNATURE-----
Merge tag 'fuse-update-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Remove tmp page copying in writeback path (Joanne).
This removes ~300 lines and with that a lot of complexity related to
avoiding reclaim related deadlock. The old mechanism is replaced with
a mapping flag that tells the MM not to block reclaim waiting for
writeback to complete. The MM parts have been reviewed/acked by
respective maintainers.
- Convert more code to handle large folios (Joanne). This still just
adds the code to deal with large folios and does not enable them yet.
- Allow invalidating all cached lookups atomically (Luis Henriques).
This feature is useful for CernVMFS, which currently does this
iteratively.
- Align write prefaulting in fuse with generic one (Dave Hansen)
- Fix race causing invalid data to be cached when setting attributes on
different nodes of a distributed fs (Guang Yuan Wu)
- Update documentation for passthrough (Chen Linxuan)
- Add fdinfo about the device number associated with an opened
/dev/fuse instance (Chen Linxuan)
- Increase readdir buffer size (Miklos). This depends on a patch to VFS
readdir code that was already merged through Christians tree.
- Optimize io-uring request expiration (Joanne)
- Misc cleanups
* tag 'fuse-update-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits)
fuse: increase readdir buffer size
readdir: supply dir_context.count as readdir buffer size hint
fuse: don't allow signals to interrupt getdents copying
fuse: support large folios for writeback
fuse: support large folios for readahead
fuse: support large folios for queued writes
fuse: support large folios for stores
fuse: support large folios for symlinks
fuse: support large folios for folio reads
fuse: support large folios for writethrough writes
fuse: refactor fuse_fill_write_pages()
fuse: support large folios for retrieves
fuse: support copying large folios
fs: fuse: add dev id to /dev/fuse fdinfo
docs: filesystems: add fuse-passthrough.rst
MAINTAINERS: update filter of FUSE documentation
fuse: fix race between concurrent setattrs from multiple nodes
fuse: remove tmp folio for writebacks and internal rb tree
mm: skip folio reclaim in legacy memcg contexts for deadlockable mappings
fuse: optimize over-io-uring request expiration check
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPUAAKCRCRxhvAZXjc
ouMEAQCrviYPG/WMtPTH7nBIbfVQTfNEXt/TvN7u7OjXb+RwRAEAwe9tLy4GrS/t
GuvUPWAthbhs77LTvxj6m3Gf49BOVgQ=
=6FqN
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.16-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull netfs updates from Christian Brauner:
- The main API document has been extensively updated/rewritten
- Fix an oops in write-retry due to mis-resetting the I/O iterator
- Fix the recording of transferred bytes for short DIO reads
- Fix a request's work item to not require a reference, thereby
avoiding the need to get rid of it in BH/IRQ context
- Fix waiting and waking to be consistent about the waitqueue used
- Remove NETFS_SREQ_SEEK_DATA_READ, NETFS_INVALID_WRITE,
NETFS_ICTX_WRITETHROUGH, NETFS_READ_HOLE_CLEAR,
NETFS_RREQ_DONT_UNLOCK_FOLIOS, and NETFS_RREQ_BLOCKED
- Reorder structs to eliminate holes
- Remove netfs_io_request::ractl
- Only provide proc_link field if CONFIG_PROC_FS=y
- Remove folio_queue::marks3
- Fix undifferentiation of DIO reads from unbuffered reads
* tag 'vfs-6.16-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
netfs: Fix undifferentiation of DIO reads from unbuffered reads
netfs: Fix wait/wake to be consistent about the waitqueue used
netfs: Fix the request's work item to not require a ref
netfs: Fix setting of transferred bytes with short DIO reads
netfs: Fix oops in write-retry from mis-resetting the subreq iterator
fs/netfs: remove unused flag NETFS_RREQ_BLOCKED
fs/netfs: remove unused flag NETFS_RREQ_DONT_UNLOCK_FOLIOS
folio_queue: remove unused field `marks3`
fs/netfs: declare field `proc_link` only if CONFIG_PROC_FS=y
fs/netfs: remove `netfs_io_request.ractl`
fs/netfs: reorder struct fields to eliminate holes
fs/netfs: remove unused enum choice NETFS_READ_HOLE_CLEAR
fs/netfs: remove unused flag NETFS_ICTX_WRITETHROUGH
fs/netfs: remove unused source NETFS_INVALID_WRITE
fs/netfs: remove unused flag NETFS_SREQ_SEEK_DATA_READ
semaphore" from Lance Yang enhances the hung task detector. The
detector presently dumps the blocking tasks's stack when it is blocked
on a mutex. Lance's series extends this to semaphores.
- The 2 patch series "nilfs2: improve sanity checks in dirty state
propagation" from Wentao Liang addresses a couple of minor flaws in
nilfs2.
- The 2 patch series "scripts/gdb: Fixes related to lx_per_cpu()" from
Illia Ostapyshyn fixes a couple of issues in the gdb scripts.
- The 9 patch series "Support kdump with LUKS encryption by reusing LUKS
volume keys" from Coiby Xu addresses a usability problem with kdump.
When the dump device is LUKS-encrypted, the kdump kernel may not have
the keys to the encrypted filesystem. A full writeup of this is in the
series [0/N] cover letter.
- The 2 patch series "sysfs: add counters for lockups and stalls" from
Max Kellermann adds /sys/kernel/hardlockup_count and
/sys/kernel/hardlockup_count and /sys/kernel/rcu_stall_count.
- The 3 patch series "fork: Page operation cleanups in the fork code"
from Pasha Tatashin implements a number of code cleanups in fork.c.
- The 3 patch series "scripts/gdb/symbols: determine KASLR offset on
s390 during early boot" from Ilya Leoshkevich fixes some s390 issues in
the gdb scripts.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDuCvQAKCRDdBJ7gKXxA
jrkxAQCnFAp/uK9ckkbN4nfpJ0+OMY36C+A+dawSDtuRsIkXBAEAq3e6MNAUdg5W
Ca0cXdgSIq1Op7ZKEA+66Km6Rfvfow8=
=g45L
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- "hung_task: extend blocking task stacktrace dump to semaphore" from
Lance Yang enhances the hung task detector.
The detector presently dumps the blocking tasks's stack when it is
blocked on a mutex. Lance's series extends this to semaphores
- "nilfs2: improve sanity checks in dirty state propagation" from
Wentao Liang addresses a couple of minor flaws in nilfs2
- "scripts/gdb: Fixes related to lx_per_cpu()" from Illia Ostapyshyn
fixes a couple of issues in the gdb scripts
- "Support kdump with LUKS encryption by reusing LUKS volume keys" from
Coiby Xu addresses a usability problem with kdump.
When the dump device is LUKS-encrypted, the kdump kernel may not have
the keys to the encrypted filesystem. A full writeup of this is in
the series [0/N] cover letter
- "sysfs: add counters for lockups and stalls" from Max Kellermann adds
/sys/kernel/hardlockup_count and /sys/kernel/hardlockup_count and
/sys/kernel/rcu_stall_count
- "fork: Page operation cleanups in the fork code" from Pasha Tatashin
implements a number of code cleanups in fork.c
- "scripts/gdb/symbols: determine KASLR offset on s390 during early
boot" from Ilya Leoshkevich fixes some s390 issues in the gdb
scripts
* tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (67 commits)
llist: make llist_add_batch() a static inline
delayacct: remove redundant code and adjust indentation
squashfs: add optional full compressed block caching
crash_dump, nvme: select CONFIGFS_FS as built-in
scripts/gdb/symbols: determine KASLR offset on s390 during early boot
scripts/gdb/symbols: factor out pagination_off()
scripts/gdb/symbols: factor out get_vmlinux()
kernel/panic.c: format kernel-doc comments
mailmap: update and consolidate Casey Connolly's name and email
nilfs2: remove wbc->for_reclaim handling
fork: define a local GFP_VMAP_STACK
fork: check charging success before zeroing stack
fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks code
fork: clean-up ifdef logic around stack allocation
kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count
x86/crash: make the page that stores the dm crypt keys inaccessible
x86/crash: pass dm crypt keys to kdump kernel
Revert "x86/mm: Remove unused __set_memory_prot()"
crash_dump: retrieve dm crypt keys in kdump kernel
...
Calling conventions of ->d_automount() made saner (flagday change)
vfs_submount() is gone - its sole remaining user (trace_automount) had
been switched to saner primitives.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaDoRWQAKCRBZ7Krx/gZQ
6wxMAQCzuMc2GiGBMXzeK4SGA7d5rsK71unf+zczOd8NvbTImQEAs1Cu3u3bF3pq
EmHQWFTKBpBf+RHsLSoDHwUA+9THowM=
=GXLi
-----END PGP SIGNATURE-----
Merge tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull automount updates from Al Viro:
"Automount wart removal
A bunch of odd boilerplate gone from instances - the reason for
those was the need to protect the yet-to-be-attched mount from
mark_mounts_for_expiry() deciding to take it out.
But that's easy to detect and take care of in mark_mounts_for_expiry()
itself; no need to have every instance simulate mount being busy by
grabbing an extra reference to it, with finish_automount() undoing
that once it attaches that mount.
Should've done it that way from the very beginning... This is a
flagday change, thankfully there are very few instances.
vfs_submount() is gone - its sole remaining user (trace_automount)
had been switched to saner primitives"
* tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
kill vfs_submount()
saner calling conventions for ->d_automount()
In this round, Matthew converted most of page operations to using folio. Beyond
the work, we've applied some performance tunings such as GC and linear lookup,
in addition to enhancing fault injection and sanity checks.
Enhancement:
- large number of folio conversions
- add a control to turn on/off the linear lookup for performance
- tune GC logics for zoned block device
- improve fault injection and sanity checks
Bug fix:
- handle error cases of memory donation
- fix to correct check conditions in f2fs_cross_rename
- fix to skip f2fs_balance_fs() if checkpoint is disabled
- don't over-report free space or inodes in statvfs
- prevent the current section from being selected as a victim during GC
- fix to calculate first_zoned_segno correctly
- fix to avoid inconsistence in between SIT and SSA for zoned block device
As usual, there are several debugging patches and clean-ups as well.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmg3PdcACgkQQBSofoJI
UNL/mQ/9Hkru4XSCokhxt8+/HoFRnTliAlzfD45Vzkkhz1YP7J8VdvWOzJV/WEai
D3Ib50Q6/y2ptxu7cwOpmToR3fI3RzAlgQsYooFAiZOBnyUkBOLA1oaVuT4s/EYg
u85xxLx0SW/IMX5CKKbYzhbXnocGAvRUkp/k30kjKJxpCeQ7pw/mLhw/2XeNIb9h
FxJbECWPpf4PA6ot22YUNvQn0plF/s9873PPhv50vpGyXTHIlTbDCSMeEC1r1E5v
xWsPcWmTkyPIyBhNFEONWJw1l3wcVIVKNBfBqwMEDr+Tgqi5UDEREeTDV9q5C6y+
vw3KnsOqX7RTdLExGfefTOnBsTqqMwSZQSH2HL5/Poayg5obXf3D/fUqAQajJpt/
FbAtfKaXElJcC7l3DJQU3Trh+WpdEPbuMiJo43OzX0YGvMfkA/sYrAHTYm5Q4nsC
wrRLaWiBgG6nQDKNXz+amD9kL1SMxp+Vsf6ybtChH3gvMqDAJsR7DY1F/Cxe3ry8
8JoJiGRYq70lw5xNACfJNQwWwRbtySy63nIwMA7FGR9zaXBQJx+cSPhEeLsS+0hI
zgijgtgRjbfuojlh7qvfFArHEIL4A67Um3RhjHbLWSFhREPaTB0665ElUNTGPe+y
hVdYtkb0X2ngsYdV/Xdmp/OThpSxI8x1ZCXVsrElawVIMpjP+nA=
=G8sl
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, Matthew converted most of page operations to using
folio. Beyond the work, we've applied some performance tunings such as
GC and linear lookup, in addition to enhancing fault injection and
sanity checks.
Enhancements:
- large number of folio conversions
- add a control to turn on/off the linear lookup for performance
- tune GC logics for zoned block device
- improve fault injection and sanity checks
Bug fixes:
- handle error cases of memory donation
- fix to correct check conditions in f2fs_cross_rename
- fix to skip f2fs_balance_fs() if checkpoint is disabled
- don't over-report free space or inodes in statvfs
- prevent the current section from being selected as a victim during GC
- fix to calculate first_zoned_segno correctly
- fix to avoid inconsistence between SIT and SSA for zoned block device
As usual, there are several debugging patches and clean-ups as well"
* tag 'f2fs-for-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (195 commits)
f2fs: fix to correct check conditions in f2fs_cross_rename
f2fs: use d_inode(dentry) cleanup dentry->d_inode
f2fs: fix to skip f2fs_balance_fs() if checkpoint is disabled
f2fs: clean up to check bi_status w/ BLK_STS_OK
f2fs: introduce is_{meta,node}_folio
f2fs: add ckpt_valid_blocks to the section entry
f2fs: add a method for calculating the remaining blocks in the current segment in LFS mode.
f2fs: introduce FAULT_VMALLOC
f2fs: use vmalloc instead of kvmalloc in .init_{,de}compress_ctx
f2fs: add f2fs_bug_on() in f2fs_quota_read()
f2fs: add f2fs_bug_on() to detect potential bug
f2fs: remove unused sbi argument from checksum functions
f2fs: fix 32-bits hexademical number in fault injection doc
f2fs: don't over-report free space or inodes in statvfs
f2fs: return bool from __write_node_folio
f2fs: simplify return value handling in f2fs_fsync_node_pages
f2fs: always unlock the page in f2fs_write_single_data_page
f2fs: remove wbc->for_reclaim handling
f2fs: return bool from __f2fs_write_meta_folio
f2fs: fix to return correct error number in f2fs_sync_node_pages()
...
Here are the driver core / kernfs changes for 6.16-rc1.
Not a huge number of changes this development cycle, here's the summary
of what is included in here:
- kernfs locking tweaks, pushing some global locks down into a per-fs
image lock
- rust driver core and pci device bindings added for new features.
- sysfs const work for bin_attributes. This churn should now be
completed for those types of attributes
- auxbus device creation helpers added
- fauxbus fix for creating sysfs files after the probe completed
properly
- other tiny updates for driver core things.
All of these have been in linux-next for over a week with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCaDbe+g8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ylYbACgl/MngU9pRnx5jZIQh6bWveFSeo8AnRE4U5x0
X+lgTPjGKL1RrV3C5HJp
=+0BA
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core
Pull driver core updates from Greg KH:
"Here are the driver core / kernfs changes for 6.16-rc1.
Not a huge number of changes this development cycle, here's the
summary of what is included in here:
- kernfs locking tweaks, pushing some global locks down into a per-fs
image lock
- rust driver core and pci device bindings added for new features.
- sysfs const work for bin_attributes.
The final churn of switching away from and removing the
transitional struct members, "read_new", "write_new" and
"bin_attrs_new" will come after the merge window to avoid
unnecesary merge conflicts.
- auxbus device creation helpers added
- fauxbus fix for creating sysfs files after the probe completed
properly
- other tiny updates for driver core things.
All of these have been in linux-next for over a week with no reported
issues"
* tag 'driver-core-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
kernfs: Relax constraint in draining guard
Documentation: embargoed-hardware-issues.rst: Remove myself
drivers: hv: fix up const issue with vmbus_chan_bin_attrs
firmware_loader: use SHA-256 library API instead of crypto_shash API
docs: debugfs: do not recommend debugfs_remove_recursive
PM: wakeup: Do not expose 4 device wakeup source APIs
kernfs: switch global kernfs_rename_lock to per-fs lock
kernfs: switch global kernfs_idr_lock to per-fs lock
driver core: auxiliary bus: Fix IS_ERR() vs NULL mixup in __devm_auxiliary_device_create()
sysfs: constify attribute_group::bin_attrs
sysfs: constify bin_attribute argument of bin_attribute::read/write()
software node: Correct a OOB check in software_node_get_reference_args()
devres: simplify devm_kstrdup() using devm_kmemdup()
platform: replace magic number with macro PLATFORM_DEVID_NONE
component: do not try to unbind unbound components
driver core: auxiliary bus: add device creation helpers
driver core: faux: Add sysfs groups after probing
* Fast commit performance improvements
* Multi-fsblock atomic write support for bigalloc file systems
* Large folio support for regular files
This last can result in really stupendous performance for the right
workloads. For example, see [1] where the Kernel Test Robot reported
over 37% improvement on a large sequential I/O workload.
[1] https://lore.kernel.org/all/202505161418.ec0d753f-lkp@intel.com/
There are also the usual bug fixes and cleanups. Of note are cleanups
of the extent status tree to fix potential races that could result in
the extent status tree getting corrupted under heavy siulatneous
allocation and deallocation to a single file.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmg2GJgACgkQ8vlZVpUN
gaOmmgf/fEh2OPDG6aAJpQ6hjy2WIbrxqTyuWC/+AFnyI5/Jy0Iskis3lHBiFdKP
IFjgC1h9CB5ARVvOLd7NOgflPHSSHsnYqoTCS6J4tdWvFN4VRiHe2J3fdTZd/bea
dzdWniHS3SAJiQm4wvbkhluFgecItBHYzDltapkHI0OGepxZt3thWVvbay6veO9R
ChXQ7T7/9eUZa5N5IVUeJmWobgh0RD+DgtwCih59UDfnezGqiDr6/shpyNC6EvWV
oZdvJw2+2DCPn5+DF4Ut77mLpKnxorQ4osNPOovZf59JnSyEcCmbBDuvyNfRldfC
yQYoCFkOv0Fz8tgJbtoAN71+YXl66w==
=fxDh
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"New ext4 features and performance improvements:
- Fast commit performance improvements
- Multi-fsblock atomic write support for bigalloc file systems
- Large folio support for regular files
This last can result in really stupendous performance for the right
workloads. For example, see [1] where the Kernel Test Robot reported
over 37% improvement on a large sequential I/O workload.
There are also the usual bug fixes and cleanups. Of note are cleanups
of the extent status tree to fix potential races that could result in
the extent status tree getting corrupted under heavy simultaneous
allocation and deallocation to a single file"
Link: https://lore.kernel.org/all/202505161418.ec0d753f-lkp@intel.com/ [1]
* tag 'ext4_for_linus-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (52 commits)
ext4: Add a WARN_ON_ONCE for querying LAST_IN_LEAF instead
ext4: Simplify flags in ext4_map_query_blocks()
ext4: Rename and document EXT4_EX_FILTER to EXT4_EX_QUERY_FILTER
ext4: Simplify last in leaf check in ext4_map_query_blocks
ext4: Unwritten to written conversion requires EXT4_EX_NOCACHE
ext4: only dirty folios when data journaling regular files
ext4: Add atomic block write documentation
ext4: Enable support for ext4 multi-fsblock atomic write using bigalloc
ext4: Add multi-fsblock atomic write support with bigalloc
ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS
ext4: Make ext4_meta_trans_blocks() non-static for later use
ext4: Check if inode uses extents in ext4_inode_can_atomic_write()
ext4: Document an edge case for overwrites
jbd2: remove journal_t argument from jbd2_superblock_csum()
jbd2: remove journal_t argument from jbd2_chksum()
ext4: remove sb argument from ext4_superblock_csum()
ext4: remove sbi argument from ext4_chksum()
ext4: enable large folio for regular file
ext4: make online defragmentation support large folios
ext4: make the writeback path support large folios
...
Introduce a new fault type FAULT_VMALLOC to simulate no memory error in
f2fs_vmalloc().
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
- The most significant change is the replacement of the old kernel-doc
script (a monstrous collection of Perl regexes that predates the Git era)
with a Python reimplementation. That, too, is a horrifying collection of
regexes, but in a much cleaner and more maintainable structure that
integrates far better with the Sphinx build system.
This change has been in linux-next for the full 6.15 cycle; the small
number of problems that turned up have been addressed, seemingly to
everybody's satisfaction. The Perl kernel-doc script remains in tree (as
scripts/kernel-doc.pl) and can be used with a command-line option if need
be. Unless some reason to keep it around materializes, it will probably
go away in 6.17.
Credit goes to Mauro Carvalho Chehab for doing all this work.
- Some RTLA documentation updates
- A handful of Chinese translations
- The usual collection of typo fixes, general updates, etc.
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmg0j/IPHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5Yu7sH/1w2LtO8XB/KTRNmuz3tV6KzGtDvQVwqgxB2
X8bbeJlBtYenvuak66RjCfucOh7Y8Ni3UN0G2BGa67KBAxmZEYc6u+IF4SrJUg5g
DuS6+ZXgqV4TrjWMRof5LtPS8KbNJLGnqgxSVdEPSBV0jJ13r3gb3/e7X06iNAKR
X4Nq+h5aa1tCwZTkPOSHHQn4qm3Tb1LQreDSn8gnBn6e8nVJIakNlwaVYkClhI9B
byvItInv32LPAXPDkcEWITvLNUTiMobTyfBYHOD6i3nImQ+j4ZiMMmOUjiB+0jDO
UQDvoUa46ipXkLBsBOrYEkM/iKXBawMwTa3CcudxR4scvVgATJs=
=BQ9X
-----END PGP SIGNATURE-----
Merge tag 'docs-6.16' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"A moderately busy cycle for documentation this time around:
- The most significant change is the replacement of the old
kernel-doc script (a monstrous collection of Perl regexes that
predates the Git era) with a Python reimplementation. That, too, is
a horrifying collection of regexes, but in a much cleaner and more
maintainable structure that integrates far better with the Sphinx
build system.
This change has been in linux-next for the full 6.15 cycle; the
small number of problems that turned up have been addressed,
seemingly to everybody's satisfaction. The Perl kernel-doc script
remains in tree (as scripts/kernel-doc.pl) and can be used with a
command-line option if need be. Unless some reason to keep it
around materializes, it will probably go away in 6.17.
Credit goes to Mauro Carvalho Chehab for doing all this work.
- Some RTLA documentation updates
- A handful of Chinese translations
- The usual collection of typo fixes, general updates, etc"
* tag 'docs-6.16' of git://git.lwn.net/linux: (85 commits)
Docs: doc-guide: update sphinx.rst Sphinx version number
docs: doc-guide: clarify latest theme usage
Documentation/scheduler: Fix typo in sched-stats domain field description
scripts: kernel-doc: prevent a KeyError when checking output
docs: kerneldoc.py: simplify exception handling logic
MAINTAINERS: update linux-doc entry to cover new Python scripts
docs: align with scripts/syscall.tbl migration
Documentation: NTB: Fix typo
Documentation: ioctl-number: Update table intro
docs: conf.py: drop backward support for old Sphinx versions
Docs: driver-api/basics: add kobject_event interfaces
Docs: relay: editing cleanups
docs: fix "incase" typo in coresight/panic.rst
Fix spelling error for 'parallel'
docs: admin-guide: fix typos in reporting-issues.rst
docs: dmaengine: add explanation for DMA_ASYNC_TX capability
Documentation: leds: improve readibility of multicolor doc
docs: fix typo in firmware-related section
docs: Makefile: Inherit PYTHONPYCACHEPREFIX setting as env variable
Documentation: ioctl-number: Update outdated submission info
...
multiple architectures can share the fs API for manipulating their
respective hw resource control implementation. This is the second step
in the work towards sharing the resctrl filesystem interface, the next
one being plugging ARM's MPAM into the aforementioned fs API.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmg0UDwACgkQEsHwGGHe
VUqsZw//SNSNcVHF7Gz2YvHrMXGYQFBETScg6fRWn/pTe3x1NrKEJedzMANXpAIy
1sBAsfDSOyi8MxIZnvMYapLcRdfLGAD+6FQTkyu/IQ3oSsjAxPgrTXornhxUswMY
LUs40hCv/UaEMkg35NVrRqDlT973kWLwA4iDNNnm6IGtrC8qv4EmdJvgVWHyPTjk
D80KA5ta+iPzK4l8noBrqyhUIZN3ZAJVJLrjS3Tx/gabuolLURE6p4IdlF/O6WzC
4NcqUjpwDeFpHpl2M9QJLVEKXHxKz9zZF2gLpT8Eon/ftqqQigBjzsUx/FKp07hZ
fe2AiQsd4gN9GZa3BGX+Lv+bjvyFadARsOoFbY45szuiUb0oceaRYtFF1ihmO0bV
bD4nAROE1kAfZpr/9ZRZT63LfE/DAm9TR1YBsViq1rrJvp4odvL15YbdOlIDHZD3
SmxhTxAokj058MRnhGdHoiMtPa54iw186QYDp0KxLQHLrToBPd7RBtRE8jsYrqrv
2EvwUxYKyO4vtwr9tzr0ZfptZ/DEsGovoTYD5EtlEGjotQUqsmi5Rxx4+SEQuwFw
CKSJ3j73gpxqDXTujjOe9bCeeXJqyEbrIkaWpkiBRwm5of7eFPG3Sw74jaCGvm4L
NM4UufMSDtyVAKfu3HmPkGhujHv0/7h1zYND51aW+GXEroKxy9s=
=eNCr
-----END PGP SIGNATURE-----
Merge tag 'x86_cache_for_v6.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov:
"Carve out the resctrl filesystem-related code into fs/resctrl/ so that
multiple architectures can share the fs API for manipulating their
respective hw resource control implementation.
This is the second step in the work towards sharing the resctrl
filesystem interface, the next one being plugging ARM's MPAM into the
aforementioned fs API"
* tag 'x86_cache_for_v6.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
MAINTAINERS: Add reviewers for fs/resctrl
x86,fs/resctrl: Move the resctrl filesystem code to live in /fs/resctrl
x86/resctrl: Always initialise rid field in rdt_resources_all[]
x86/resctrl: Relax some asm #includes
x86/resctrl: Prefer alloc(sizeof(*foo)) idiom in rdt_init_fs_context()
x86/resctrl: Squelch whitespace anomalies in resctrl core code
x86/resctrl: Move pseudo lock prototypes to include/linux/resctrl.h
x86/resctrl: Fix types in resctrl_arch_mon_ctx_{alloc,free}() stubs
x86/resctrl: Move enum resctrl_event_id to resctrl.h
x86/resctrl: Move the filesystem bits to headers visible to fs/resctrl
fs/resctrl: Add boiler plate for external resctrl code
x86/resctrl: Add 'resctrl' to the title of the resctrl documentation
x86/resctrl: Split trace.h
x86/resctrl: Expand the width of domid by replacing mon_data_bits
x86/resctrl: Add end-marker to the resctrl_event_id enum
x86/resctrl: Move is_mba_sc() out of core.c
x86/resctrl: Drop __init/__exit on assorted symbols
x86/resctrl: Resctrl_exit() teardown resctrl but leave the mount point
x86/resctrl: Check all domains are offline in resctrl_exit()
x86/resctrl: Rename resctrl_sched_in() to begin with "resctrl_arch_"
...
Add support for "hardware-wrapped inline encryption keys" to fscrypt.
When enabled on supported platforms, this feature protects file contents
keys from certain attacks, such as cold boot attacks.
This feature uses the block layer support for wrapped keys which was
merged in 6.15. Wrapped key support has existed out-of-tree in Android
for a long time, and it's finally ready for upstream now that there is a
platform on which it works end-to-end with upstream. Specifically,
it works on the Qualcomm SM8650 HDK, using the Qualcomm ICE (Inline
Crypto Engine) and HWKM (Hardware Key Manager). The corresponding
driver support is included in the SCSI tree for 6.16. Validation for
this feature includes two new tests that were already merged into
xfstests (generic/368 and generic/369).
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCaDNaqxQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK1K7AP92naB88sRzH1KG7Oic9+dMK+PImARP
f15ebG2TzQ3qBgEAreqtNmtCNOH7pguYsTeAcX3Y243vzIkwkDRGk7k+aAI=
=P6Sj
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux
Pull fscrypt update from Eric Biggers:
"Add support for 'hardware-wrapped inline encryption keys' to fscrypt.
When enabled on supported platforms, this feature protects file
contents keys from certain attacks, such as cold boot attacks.
This feature uses the block layer support for wrapped keys which was
merged in 6.15. Wrapped key support has existed out-of-tree in Android
for a long time, and it's finally ready for upstream now that there is
a platform on which it works end-to-end with upstream.
Specifically, it works on the Qualcomm SM8650 HDK, using the Qualcomm
ICE (Inline Crypto Engine) and HWKM (Hardware Key Manager). The
corresponding driver support is included in the SCSI tree for 6.16.
Validation for this feature includes two new tests that were already
merged into xfstests (generic/368 and generic/369)"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
fscrypt: add support for hardware-wrapped keys
- Add a `fsoffset` mount option to specify the filesystem offset;
- Support Intel QAT accelerators to boost up the DEFLATE algorithm;
- Initialize per-CPU workers and CPU hotplug hooks lazily to avoid
unnecessary overhead when EROFS is not mounted;
- Fix file handle encoding for 64-bit NIDs;
- Minor cleanups.
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmgz6IQRHHhpYW5nQGtl
cm5lbC5vcmcACgkQUXZn5Zlu5qrp8w//V8rQpo/jQwUXP2xZDWUGe/iS5APQU/w+
IQRn8LRt1RYLD1ssShW60y5mc40pa/PktxLlddIfcDDFfhAv4zYEK7Iosrd5FeGX
vDawKcFvjzozpveqtjWR63QPO0Ff/ldSsnl9FsdQopffWNFw+X7D+/4fgUJah+CF
p5jnyp6D7RvNMHdLIjQjiqvvmmAdllqb+nbyLy0jGQkzjIGR2RdJtqrM5gdsE/B1
zKQRzs6NwYaBQ2MO6XmLAd2P0603RBGplR9OyLEpfFmUHX877pUxuGLQW2o+NbRY
TodevQdzSJPlvHNrO0T+ztistwRhKGkCmyrP7+Vl4ackgRmA5ozT23CUxFX2hwQM
GhE24aXyqO/vIA/RCsy+Tb8vxVY3ysNd4fz001HtWq0tOqLVyFkVEhvaZwLGqi1A
PAV6WHqtYo/gjc8nrvq88GMGTUH0orIwlJpS9YQHhStzexyePDjl3cgQlmS0Q8J3
JHtf8S+pnaModsvqKJJ9LQW0bHrbry9Bfo0M6yQ5sirehcrqGeDFZ0m+ny16Ki9N
bv8Mx811KNtAVoeuwAidH2NqUxnz1/faiIs0yYE/2Vg2QfuEKjVXbpkDo2wfQj1i
TVsQ9gPJB9mZpvnuaGYGdgzxN/lQAIo3JxWAHvHhMz/1suike97vqKms4W4lSoBY
JPbJjs/4uUA=
=+2IX
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"In this cycle, Intel QAT hardware accelerators are supported to
improve DEFLATE decompression performance. I've tested it with the
enwik9 dataset of 1 MiB pclusters on our Intel Sapphire Rapids
bare-metal server and a PL0 ESSD, and the sequential read performance
even surpasses LZ4 software decompression on this setup.
In addition, a `fsoffset` mount option is introduced for file-backed
mounts to specify the filesystem offset in order to adapt customized
container formats.
And other improvements and minor cleanups. Summary:
- Add a `fsoffset` mount option to specify the filesystem offset
- Support Intel QAT accelerators to boost up the DEFLATE algorithm
- Initialize per-CPU workers and CPU hotplug hooks lazily to avoid
unnecessary overhead when EROFS is not mounted
- Fix file handle encoding for 64-bit NIDs
- Minor cleanups"
* tag 'erofs-for-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: support DEFLATE decompression by using Intel QAT
erofs: clean up erofs_{init,exit}_sysfs()
erofs: add 'fsoffset' mount option to specify filesystem offset
erofs: lazily initialize per-CPU workers and CPU hotplug hooks
erofs: refine readahead tracepoint
erofs: avoid using multiple devices with different type
erofs: fix file handle encoding for 64-bit NIDs
Lots of changes:
- Poisoned extents can now be moved: this lets us handle bitrotted data
without deleting it. For now, reading from poisoned extents only
returns -EIO: in the future we'll have an API for specifying "read
this data even if there were bitflips".
- Incompatible features may now be enabled at runtime, via
"opts/version_upgrade" in sysfs. Toggle it to incompatible, and then
toggle it back - option changes via the sysfs interface are
persistent.
- Various changes to support deployable disk images:
- RO mounts now use less memory
- Images may be stripped of alloc info, particularly useful for
slimming them down if they will primarily be mounted RO. Alloc info
will be automatically regenerated on first RW mount, and this is
quite fast.
- Filesystem images generated with 'bcachefs image' will be
automatically resized the first time they're mounted on a larger
device.
The images 'bcachefs image' generates with compression enabled have
been comparable in size to those generated by squashfs and erofs -
but you get a full RW capable filesystem.
- Major error message improvements for btree node reads, data reads,
and elsewhere. We now build up a single error message that lists all
the errors encountered, actions taken to repair, and success/failure
of the IO. This extends to other error paths that may kick off other
actions, e.g. scheduling recovery passes: actions we took because of
an error are included in that error message, with grouping/indentation
so we can see what caused what.
- Repair/self healing:
- We can now kick off recovery passes and run them in the background
if we detect errors. Currently, this is just used by code that walks
backpointers; we now also check for missing backpointers at runtime
and run check_extents_to_backpointers if required. The messy 6.14
upgrade left missing backpointers for some users, and this will
correct that automatically instead of requiring a manual fsck - some
users noticed this as copygc spinning and not making progress.
In the future, as more recovery passes come online, we'll be able to
repair and recover from nearly anything - except for unreadable
btree nodes, and that's why you're using replication, of course -
without shutting down the filesystem.
- There's a new recovery pass, for checking the rebalance_work btree,
which tracks extents that rebalance will process later.
- Hardening:
- Close the last known hole in btree iterator/btree locking
assertions: path->should_be_locked paths must stay locked until the
end of the transaction. This shook out a few bugs, including a
performance issue that was causing unnecessary path_upgrade
transaction restarts.
- Performance;
- Faster snapshot deletion: this is an incompatible feature, as it
requires new sentinal values, for safety. Snapshot deletion no
longer has to do a full metadata scan, it now just scans the inodes
btree: if an extent/dirent/xattr is present for a given snapshot ID,
we already require that an inode be present with that same snapshot
ID.
If/when users hit scalability limits again (ridiculously huge
filesystems with lots of inodes, and many sparse snapshots), let me
know - the next step will be to add an index from snapshot ID ->
inode number, which won't be too hard.
- Faster device removal: the "scan for pointers to this device" no
longer does a full metadata scan, instead it walks backpointers.
Like fast snapshot deletion this is another incompat feature: it
also requires a new sentinal value, because we don't want to reuse
these device IDs until after a fsck.
- We're now coalescing redundant accounting updates prior to
transaction commit, taking some pressure off the journal. Shortly
we'll also be doing multiple extent updates in a transaction in the
main write path, which combined with the previous should drastically
cut down on the amount of metadata updates we have to journal.
- Stack usage improvements: All allocator state has been moved off the
stack
- Debug improvements:
- enumerated refcounts: The debug code previously used for filesystem
write refs is now a small library, and used for other heavily used
refcounts. Different users of a refcount are enumerated, making it
much easier to debug refcount issues.
- Async object debugging: There's a new kconfig option that makes
various async objects (different types of bios, data updates, write
ops, etc.) visible in debugfs, and it should be fast enough to leave
on in production.
- Various sets of assertions no longer require CONFIG_BCACHEFS_DEBUG,
instead they're controlled by module parameters and static keys,
meaning users won't need to compile custom kernels as often to help
debug issues.
- bch2_trans_kmalloc() calls can be tracked (there's a new kconfig
option); with it on you can check the btree_transaction_stats in
debugfs to see the bch2_trans_kmalloc() calls a transaction did when
it used the most memory.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmgyaC0ACgkQE6szbY3K
bnYcGQ//ZOCe34wjVFub+dNn9os0llaIFaShTC9Baoi+Ly8qmMBkiVR8h0XZWJ6I
Xue8FaPksEDUF+pXSPjI+L/WA2uW/qNm2Q2RxEfxigSMSzUUZvHs/jU3ZkpZ1JQb
l327tun1XNNY2JagcTj09X+VoasLuhQtvBKXM6gAWozXNszLesd1vaFexPsk13bV
GwqSxlfayYt5DwzEf7OCL9CXWfW86qs8snLYAPpv/pyoVNKw+iuPFlhDA1AD1ZMG
s+syQ5R7u5ikcfpYnaakDsn3KhxsX+jLk5PoSHk/6kGy/5BdJ1AUYQEsSNfdcxHy
pxNht12Nuoo2q2qI0gL4oegnz36cndtveCf9vs6K0Vg24ZRylhh8uz3v/ZcAu0Ne
CwFvpxMn5jtIgqh75i9R1/W6aiuKffkE29D4Me5RJxEqoM8yKKhKx6tHHzZftT3a
QSvbgsfBghetfTqcajBvDDN5GQM2Z8pz2iLrIw/EHuAh15hAhzf+7ULHprIh6IDz
m/Px72xrh39CAKI8IdsjD7QLT9a7xN3WKQXbSvFMEPjnJtGL3JGARZfsKB2gL7ZO
551ONexueFkilQmGQfy20VYGF1Mu9mWTUqyVnNaQUMbgKKDcAivy71UyFe/n3GOB
xJyEKTfrJg8Qn+vEJvlhXevVnz5FO/hiOAMIrMPKQq8XT0iNdAA=
=srxl
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2025-05-24' of git://evilpiepirate.org/bcachefs
Pull bcachefs updates from Kent Overstreet:
- Poisoned extents can now be moved: this lets us handle bitrotted data
without deleting it. For now, reading from poisoned extents only
returns -EIO: in the future we'll have an API for specifying "read
this data even if there were bitflips".
- Incompatible features may now be enabled at runtime, via
"opts/version_upgrade" in sysfs. Toggle it to incompatible, and then
toggle it back - option changes via the sysfs interface are
persistent.
- Various changes to support deployable disk images:
- RO mounts now use less memory
- Images may be stripped of alloc info, particularly useful for
slimming them down if they will primarily be mounted RO. Alloc
info will be automatically regenerated on first RW mount, and
this is quite fast
- Filesystem images generated with 'bcachefs image' will be
automatically resized the first time they're mounted on a larger
device
The images 'bcachefs image' generates with compression enabled have
been comparable in size to those generated by squashfs and erofs -
but you get a full RW capable filesystem
- Major error message improvements for btree node reads, data reads,
and elsewhere. We now build up a single error message that lists all
the errors encountered, actions taken to repair, and success/failure
of the IO. This extends to other error paths that may kick off other
actions, e.g. scheduling recovery passes: actions we took because of
an error are included in that error message, with
grouping/indentation so we can see what caused what.
- New option, 'rebalance_on_ac_only'. Does exactly what the name
suggests, quite handy with background compression.
- Repair/self healing:
- We can now kick off recovery passes and run them in the
background if we detect errors. Currently, this is just used by
code that walks backpointers. We now also check for missing
backpointers at runtime and run check_extents_to_backpointers if
required. The messy 6.14 upgrade left missing backpointers for
some users, and this will correct that automatically instead of
requiring a manual fsck - some users noticed this as copygc
spinning and not making progress.
In the future, as more recovery passes come online, we'll be able
to repair and recover from nearly anything - except for
unreadable btree nodes, and that's why you're using replication,
of course - without shutting down the filesystem.
- There's a new recovery pass, for checking the rebalance_work
btree, which tracks extents that rebalance will process later.
- Hardening:
- Close the last known hole in btree iterator/btree locking
assertions: path->should_be_locked paths must stay locked until
the end of the transaction. This shook out a few bugs, including
a performance issue that was causing unnecessary path_upgrade
transaction restarts.
- Performance:
- Faster snapshot deletion: this is an incompatible feature, as it
requires new sentinal values, for safety. Snapshot deletion no
longer has to do a full metadata scan, it now just scans the
inodes btree: if an extent/dirent/xattr is present for a given
snapshot ID, we already require that an inode be present with
that same snapshot ID.
If/when users hit scalability limits again (ridiculously huge
filesystems with lots of inodes, and many sparse snapshots), let
me know - the next step will be to add an index from snapshot ID
-> inode number, which won't be too hard.
- Faster device removal: the "scan for pointers to this device" no
longer does a full metadata scan, instead it walks backpointers.
Like fast snapshot deletion this is another incompat feature: it
also requires a new sentinal value, because we don't want to
reuse these device IDs until after a fsck.
- We're now coalescing redundant accounting updates prior to
transaction commit, taking some pressure off the journal. Shortly
we'll also be doing multiple extent updates in a transaction in
the main write path, which combined with the previous should
drastically cut down on the amount of metadata updates we have to
journal.
- Stack usage improvements: All allocator state has been moved off the
stack
- Debug improvements:
- enumerated refcounts: The debug code previously used for
filesystem write refs is now a small library, and used for other
heavily used refcounts. Different users of a refcount are
enumerated, making it much easier to debug refcount issues.
- Async object debugging: There's a new kconfig option that makes
various async objects (different types of bios, data updates,
write ops, etc.) visible in debugfs, and it should be fast enough
to leave on in production.
- Various sets of assertions no longer require
CONFIG_BCACHEFS_DEBUG, instead they're controlled by module
parameters and static keys, meaning users won't need to compile
custom kernels as often to help debug issues.
- bch2_trans_kmalloc() calls can be tracked (there's a new kconfig
option). With it on you can check the btree_transaction_stats in
debugfs to see the bch2_trans_kmalloc() calls a transaction did
when it used the most memory.
* tag 'bcachefs-2025-05-24' of git://evilpiepirate.org/bcachefs: (218 commits)
bcachefs: Don't mount bs > ps without TRANSPARENT_HUGEPAGE
bcachefs: Fix btree_iter_next_node() for new locking asserts
bcachefs: Ensure we don't use a blacklisted journal seq
bcachefs: Small check_fix_ptr fixes
bcachefs: Fix opts.recovery_pass_last
bcachefs: Fix allocate -> self healing path
bcachefs: Fix endianness in casefold check/repair
bcachefs: Path must be locked if trans->locked && should_be_locked
bcachefs: Simplify bch2_path_put()
bcachefs: Plumb btree_trans for more locking asserts
bcachefs: Clear trans->locked before unlock
bcachefs: Clear should_be_locked before unlock in key_cache_drop()
bcachefs: bch2_path_get() reuses paths if upgrade_fails & !should_be_locked
bcachefs: Give out new path if upgrade fails
bcachefs: Fix btree_path_get_locks when not doing trans restart
bcachefs: btree_node_locked_type_nowrite()
bcachefs: Kill bch2_path_put_nokeep()
bcachefs: bch2_journal_write_checksum()
bcachefs: Reduce stack usage in data_update_index_update()
bcachefs: bch2_trans_log_str()
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPUAAKCRCRxhvAZXjc
opWHAP9xpS4Z/MvxYpRMQ7G6MSECDNZq0ru8k6HXuq4BIeMgNQEA7nI0JiyVjanY
ZCkRuBpoWMxR5OsiNIpL0GbhTVFwvwk=
=or0/
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.16-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull iomap updates from Christian Brauner:
- More fallout and preparatory work associated with the folio batch
prototype posted a while back.
Mainly this just cleans up some of the helpers and pushes some
pos/len trimming further down in the write begin path.
- Add missing flag descriptions to the iomap documentation
* tag 'vfs-6.16-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iomap: rework iomap_write_begin() to return folio offset and length
iomap: push non-large folio check into get folio path
iomap: helper to trim pos/bytes to within folio
iomap: drop pos param from __iomap_[get|put]_folio()
iomap: drop unnecessary pos param from iomap_write_[begin|end]
iomap: resample iter->pos after iomap_write_begin() calls
iomap: trace: Add missing flags to [IOMAP_|IOMAP_F_]FLAGS_STRINGS
Documentation: iomap: Add missing flags description
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPTwAKCRCRxhvAZXjc
om0+AQDMxKLweJXplqQQ7jxuvW2dEa60YpE2EalEKWGg9YA3KgEA3nI4kyKMKn7Y
PRFXgIcKvhs62oJLKsq8SGQUqExqvAE=
=atEw
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.16-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual selections of misc updates for this cycle.
Features:
- Use folios for symlinks in the page cache
FUSE already uses folios for its symlinks. Mirror that conversion
in the generic code and the NFS code. That lets us get rid of a few
folio->page->folio conversions in this path, and some of the few
remaining users of read_cache_page() / read_mapping_page()
- Try and make a few filesystem operations killable on the VFS
inode->i_mutex level
- Add sysctl vfs_cache_pressure_denom for bulk file operations
Some workloads need to preserve more dentries than we currently
allow through out sysctl interface
A HDFS servers with 12 HDDs per server, on a HDFS datanode startup
involves scanning all files and caching their metadata (including
dentries and inodes) in memory. Each HDD contains approximately 2
million files, resulting in a total of ~20 million cached dentries
after initialization
To minimize dentry reclamation, they set vfs_cache_pressure to 1.
Despite this configuration, memory pressure conditions can still
trigger reclamation of up to 50% of cached dentries, reducing the
cache from 20 million to approximately 10 million entries. During
the subsequent cache rebuild period, any HDFS datanode restart
operation incurs substantial latency penalties until full cache
recovery completes
To maintain service stability, more dentries need to be preserved
during memory reclamation. The current minimum reclaim ratio (1/100
of total dentries) remains too aggressive for such workload. This
patch introduces vfs_cache_pressure_denom for more granular cache
pressure control
The configuration [vfs_cache_pressure=1,
vfs_cache_pressure_denom=10000] effectively maintains the full 20
million dentry cache under memory pressure, preventing datanode
restart performance degradation
- Avoid some jumps in inode_permission() using likely()/unlikely()
- Avid a memory access which is most likely a cache miss when
descending into devcgroup_inode_permission()
- Add fastpath predicts for stat() and fdput()
- Anonymous inodes currently don't come with a proper mode causing
issues in the kernel when we want to add useful VFS debug assert.
Fix that by giving them a proper mode and masking it off when we
report it to userspace which relies on them not having any mode
- Anonymous inodes currently allow to change inode attributes because
the VFS falls back to simple_setattr() if i_op->setattr isn't
implemented. This means the ownership and mode for every single
user of anon_inode_inode can be changed. Block that as it's either
useless or actively harmful. If specific ownership is needed the
respective subsystem should allocate anonymous inodes from their
own private superblock
- Raise SB_I_NODEV and SB_I_NOEXEC on the anonymous inode superblock
- Add proper tests for anonymous inode behavior
- Make it easy to detect proper anonymous inodes and to ensure that
we can detect them in codepaths such as readahead()
Cleanups:
- Port pidfs to the new anon_inode_{g,s}etattr() helpers
- Try to remove the uselib() system call
- Add unlikely branch hint return path for poll
- Add unlikely branch hint on return path for core_sys_select
- Don't allow signals to interrupt getdents copying for fuse
- Provide a size hint to dir_context for during readdir()
- Use writeback_iter directly in mpage_writepages
- Update compression and mtime descriptions in initramfs
documentation
- Update main netfs API document
- Remove useless plus one in super_cache_scan()
- Remove unnecessary NULL-check guards during setns()
- Add separate separate {get,put}_cgroup_ns no-op cases
Fixes:
- Fix typo in root= kernel parameter description
- Use KERN_INFO for infof()|info_plog()|infofc()
- Correct comments of fs_validate_description()
- Mark an unlikely if condition with unlikely() in
vfs_parse_monolithic_sep()
- Delete macro fsparam_u32hex()
- Remove unused and problematic validate_constant_table()
- Fix potential unsigned integer underflow in fs_name()
- Make file-nr output the total allocated file handles"
* tag 'vfs-6.16-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (43 commits)
fs: Pass a folio to page_put_link()
nfs: Use a folio in nfs_get_link()
fs: Convert __page_get_link() to use a folio
fs/read_write: make default_llseek() killable
fs/open: make do_truncate() killable
fs/open: make chmod_common() and chown_common() killable
include/linux/fs.h: add inode_lock_killable()
readdir: supply dir_context.count as readdir buffer size hint
vfs: Add sysctl vfs_cache_pressure_denom for bulk file operations
fuse: don't allow signals to interrupt getdents copying
Documentation: fix typo in root= kernel parameter description
include/cgroup: separate {get,put}_cgroup_ns no-op case
kernel/nsproxy: remove unnecessary guards
fs: use writeback_iter directly in mpage_writepages
fs: remove useless plus one in super_cache_scan()
fs: add S_ANON_INODE
fs: remove uselib() system call
device_cgroup: avoid access to ->i_rdev in the common case in devcgroup_inode_permission()
fs/fs_parse: Remove unused and problematic validate_constant_table()
fs: touch up predicts in inode_permission()
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPTgAKCRCRxhvAZXjc
ovkTAP9tyN24Oo+koY/2UedYBxM54cW4BCCRsVmkzfr8NSVdwwD/dg+v6gS8+nyD
3jlR0Z/08UyMHapB7fnAuFxPXXc8oAo=
=e55o
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.16-rc1.writepage' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull final writepage conversion from Christian Brauner:
"This converts vboxfs from ->writepage() to ->writepages().
This was the last user of the ->writepage() method. So remove
->writepage() completely and all references to it"
* tag 'vfs-6.16-rc1.writepage' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Remove aops->writepage
mm: Remove swap_writepage() and shmem_writepage()
ttm: Call shmem_writeout() from ttm_backup_backup_page()
i915: Use writeback_iter()
shmem: Add shmem_writeout()
writeback: Remove writeback_use_writepage()
migrate: Remove call to ->writepage
vboxsf: Convert to writepages
9p: Add a migrate_folio method
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBN6wAKCRCRxhvAZXjc
ok32AQD9DTiSCAoVg+7s+gSBuLTi8drPTN++mCaxdTqRh5WpRAD9GVyrGQT0s6LH
eo9bm8d1TAYjilEWM0c0K0TxyQ7KcAA=
=IW7H
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs directory lookup updates from Christian Brauner:
"This contains cleanups for the lookup_one*() family of helpers.
We expose a set of functions with names containing "lookup_one_len"
and others without the "_len". This difference has nothing to do with
"len". It's rater a historical accident that can be confusing.
The functions without "_len" take a "mnt_idmap" pointer. This is found
in the "vfsmount" and that is an important question when choosing
which to use: do you have a vfsmount, or are you "inside" the
filesystem. A related question is "is permission checking relevant
here?".
nfsd and cachefiles *do* have a vfsmount but *don't* use the non-_len
functions. They pass nop_mnt_idmap and refuse to work on filesystems
which have any other idmap.
This work changes nfsd and cachefile to use the lookup_one family of
functions and to explictily pass &nop_mnt_idmap which is consistent
with all other vfs interfaces used where &nop_mnt_idmap is explicitly
passed.
The remaining uses of the "_one" functions do not require permission
checks so these are renamed to be "_noperm" and the permission
checking is removed.
This series also changes these lookup function to take a qstr instead
of separate name and len. In many cases this simplifies the call"
* tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
VFS: change lookup_one_common and lookup_noperm_common to take a qstr
Use try_lookup_noperm() instead of d_hash_and_lookup() outside of VFS
VFS: rename lookup_one_len family to lookup_noperm and remove permission check
cachefiles: Use lookup_one() rather than lookup_one_len()
nfsd: Use lookup_one() rather than lookup_one_len()
VFS: improve interface for lookup_one functions
When attempting to use an archive file, such as APEX on android,
as a file-backed mount source, it fails because EROFS image within
the archive file does not start at offset 0. As a result, a loop
or a dm device is still needed to attach the image file at an
appropriate offset first. Similarly, if an EROFS image within a
block device does not start at offset 0, it cannot be mounted
directly either.
To address this issue, this patch adds a new mount option `fsoffset=x'
to accept a start offset for the primary device. The offset should be
aligned to the block size. EROFS will add this offset before performing
read requests.
Signed-off-by: Sheng Yong <shengyong1@xiaomi.com>
Signed-off-by: Wang Shuai <wangshuai12@xiaomi.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250517090544.2687651-1-shengyong1@xiaomi.com
[ Gao Xiang: minor update on documentation and the error message. ]
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
People have been asking to see the plan for this, so -
bcachefs has various background tasks that need to be scheduled to
balance efficiency, predictability of performance, etc.
The design and philosophy hasn't changed too much since bcache, which
was primarily designed for server usage, with sustained load in mind.
These days we're seeing more desktop usage - where we really want to let
the system idle effictively, to reduce total power usage - while also
still balancing previous concerns, we still want to let work accumulate
to a degree.
This lays out all the requirements and starts to sketch out the
algorithm I have in mind.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cleanup some punctuation, capital letter, and a missing word
in relay.rst.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-doc@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20250512023233.107582-1-rdunlap@infradead.org>
Resctrl is a filesystem interface to hardware that provides cache
allocation policy and bandwidth control for groups of tasks or CPUs.
To support more than one architecture, resctrl needs to live in /fs/.
Move the code that is concerned with the filesystem interface to
/fs/resctrl.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-25-james.morse@arm.com
FAULT_KMALLOC 0x000000001
There is one redundant '0' in 32-bits hexademical number of fault type,
remove it.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The last use of relay_late_setup_files() was removed in 2018 by commit
2b47733045 ("drm/i915/guc: Merge log relay file and channel creation")
Remove it and the helper it used.
relay_late_setup_files() was used for eventually registering 'buffer only'
channels. With it gone, delete the docs that explain how to do that.
Which suggests it should be possible to lose the 'has_base_filename'
flags.
(Are there any other uses??)
Link: https://lkml.kernel.org/r/20250418234932.490863-1-linux@treblig.org
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Support to inject a timeout fault into function, currently it only
support to inject timeout to commit_atomic_write flow to reproduce
inconsistent bug, like the bug fixed by commit f098aeba04 ("f2fs:
fix to avoid atomicity corruption of atomic file").
By default, the new type fault will inject 1000ms timeout, and the
timeout process can be interrupted by SIGKILL.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Currently the calling conventions for ->d_automount() instances have
an odd wart - returned new mount to be attached is expected to have
refcount 2.
That kludge is intended to make sure that mark_mounts_for_expiry() called
before we get around to attaching that new mount to the tree won't decide
to take it out. finish_automount() drops the extra reference after it's
done with attaching mount to the tree - or drops the reference twice in
case of error. ->d_automount() instances have rather counterintuitive
boilerplate in them.
There's a much simpler approach: have mark_mounts_for_expiry() skip the
mounts that are yet to be mounted. And to hell with grabbing/dropping
those extra references. Makes for simpler correctness analysis, at that...
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Acked-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Update the debugfs documentation to indicate that debugfs_remove()
should be used to clean up debugfs entries.
In commit a3d1e7eb5a ("simple_recursive_removal(): kernel-side rm -rf
for ramfs-style filesystems"), function debugfs_remove_recursive()
was made into an alias for debugfs_remove():
#define debugfs_remove_recursive debugfs_remove
Therefore, drivers should just use debugfs_remove() going forward.
Signed-off-by: Timur Tabi <ttabi@nvidia.com>
Link: https://lore.kernel.org/r/20250429173958.3973958-1-ttabi@nvidia.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Allow the special case of a redirect from a lower layer to a data layer
without having to turn on metacopy. This makes the feature work with
userxattr, which in turn allows data layers to be usable in user
namespaces.
Minimize the risk by only enabling redirect from a single lower layer to a
data layer iff a data layer is specified. The only way to access a data
layer is to enable this, so there's really no reason not to enable this.
This can be used safely if the lower layer is read-only and the
user.overlay.redirect xattr cannot be modified.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Remove validate_constant_table() since:
- It has no caller.
- It has below 3 bugs for good constant table array array[] which must
end with a empty entry, and take below invocation for explaination:
validate_constant_table(array, ARRAY_SIZE(array), ...)
- Always return wrong value due to the last empty entry.
- Imprecise error message for missorted case.
- Potential NULL pointer dereference since the last pr_err() may use
@tbl[i].name NULL pointer to print the last empty entry's name.
Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Link: https://lore.kernel.org/20250415-fix_fs-v4-1-5d575124a3ff@quicinc.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Delete macro fsparam_u32hex() since:
- it has no caller.
- it uses as type @fs_param_is_u32_hex which is never defined, so will
cause compile error when caller uses it.
Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Link: https://lore.kernel.org/20250411-fix_fs-v2-1-5d3395c102e4@quicinc.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Let's document the use of these flags in iomap design doc where other
flags are defined too -
- IOMAP_F_BOUNDARY was added by XFS to prevent merging of I/O and I/O
completions across RTG boundaries.
- IOMAP_F_ATOMIC_BIO was added for supporting atomic I/O operations
for filesystems to inform the iomap that it needs HW-offload based
mechanism for torn-write protection.
While we are at it, let's also fix the description of IOMAP_F_PRIVATE
flag after a recent:
commit 923936efeb ("iomap: Fix conflicting values of iomap flags")
Signed-off-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Link: https://lore.kernel.org/8d8534a704c4f162f347a84830710db32a927b2e.1744432270.git.ritesh.list@gmail.com
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
syzbot failures and fixing a stale file handing refeencing an inode
previously used as a regular file, but which has been deleted and
reused as an ea_inode would result in ext4 erroneously consider this a
case of fs corrupotion.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmf7r3YACgkQ8vlZVpUN
gaPl9QgApwE5BAQdO6miW0sDMPj5b4sMc25aG4OPlfKhFqiIJB0Ub4zC2n0OFnaf
HXk8P5oVeepH9ciTnYFF30X20Ythzjwmd9j5eyq2wsfYASQUjfcvmR9WovbqZtGQ
3Zerd9QFp7SvZa+K4sADBhEb/7HAnxDGfiqSQptY6WQTwD+it1bnuhmzG0m6AH4m
R1ItREDx7D2QrudDToFBd8XQ+FgRETZ8Qrs7PqIznw/dBNMdHRnAiw2eiyuoPU/S
T8cmCxii3Z9sJ6LtohKYuWOmOmdxg951V5ZcekVRuaFSljSUsRsIplO7OlaMvQDs
9vGVKiiZLdU2B0Wd90IeQUdJmP4xPg==
=I8qx
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"A few more miscellaneous ext4 bug fixes and cleanups including some
syzbot failures and fixing a stale file handing refeencing an inode
previously used as a regular file, but which has been deleted and
reused as an ea_inode would result in ext4 erroneously considering
this a case of fs corruption"
* tag 'ext4_for_linus-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix off-by-one error in do_split
ext4: make block validity check resistent to sb bh corruption
ext4: avoid -Wflex-array-member-not-at-end warning
Documentation: ext4: Add fields to ext4_super_block documentation
ext4: don't treat fhandle lookup of ea_inode as FS corruption
Documentation and implementation of the ext4 super block have
slightly diverged: Padding has been removed in order to make room for
new fields that are still missing in the documentation.
Add the new fields s_encryption_level, s_first_error_errorcode,
s_last_error_errorcode to the documentation of the ext4 super block.
Fixes: f542fbe8d5 ("ext4 crypto: reserve codepoints used by the ext4 encryption feature")
Fixes: 878520ac45 ("ext4: save the error code which triggered an ext4_error() in the superblock")
Signed-off-by: Tom Vierjahn <tom.vierjahn@acm.org>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20250324221004.5268-1-tom.vierjahn@acm.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Add support for hardware-wrapped keys to fscrypt. Such keys are
protected from certain attacks, such as cold boot attacks. For more
information, see the "Hardware-wrapped keys" section of
Documentation/block/inline-encryption.rst.
To support hardware-wrapped keys in fscrypt, we allow the fscrypt master
keys to be hardware-wrapped. File contents encryption is done by
passing the wrapped key to the inline encryption hardware via
blk-crypto. Other fscrypt operations such as filenames encryption
continue to be done by the kernel, using the "software secret" which the
hardware derives. For more information, see the documentation which
this patch adds to Documentation/filesystems/fscrypt.rst.
Note that this feature doesn't require any filesystem-specific changes.
However it does depend on inline encryption support, and thus currently
it is only applicable to ext4 and f2fs.
The version of this feature introduced by this patch is mostly
equivalent to the version that has existed downstream in the Android
Common Kernels since 2020. However, a couple fixes are included.
First, the flags field in struct fscrypt_add_key_arg is now placed in
the proper location. Second, key identifiers for HW-wrapped keys are
now derived using a distinct HKDF context byte; this fixes a bug where a
raw key could have the same identifier as a HW-wrapped key. Note that
as a result of these fixes, the version of this feature introduced by
this patch is not UAPI or on-disk format compatible with the version in
the Android Common Kernels, though the divergence is limited to just
those specific fixes. This version should be used going forwards.
This patch has been heavily rewritten from the original version by
Gaurav Kashyap <quic_gaurkash@quicinc.com> and
Barani Muthukumaran <bmuthuku@codeaurora.org>.
Tested-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> # sm8650
Link: https://lore.kernel.org/r/20250404225859.172344-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
try_lookup_noperm() and d_hash_and_lookup() are nearly identical. The
former does some validation of the name where the latter doesn't.
Outside of the VFS that validation is likely valuable, and having only
one exported function for this task is certainly a good idea.
So make d_hash_and_lookup() local to VFS files and change all other
callers to try_lookup_noperm(). Note that the arguments are swapped.
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250319031545.2999807-6-neil@brown.name
Signed-off-by: Christian Brauner <brauner@kernel.org>
The lookup_one_len family of functions is (now) only used internally by
a filesystem on itself either
- in a context where permission checking is irrelevant such as by a
virtual filesystem populating itself, or xfs accessing its ORPHANAGE
or dquota accessing the quota file; or
- in a context where a permission check (MAY_EXEC on the parent) has just
been performed such as a network filesystem finding in "silly-rename"
file in the same directory. This is also the context after the
_parentat() functions where currently lookup_one_qstr_excl() is used.
So the permission check is pointless.
The name "one_len" is unhelpful in understanding the purpose of these
functions and should be changed. Most of the callers pass the len as
"strlen()" so using a qstr and QSTR() can simplify the code.
This patch renames these functions (include lookup_positive_unlocked()
which is part of the family despite the name) to have a name based on
"lookup_noperm". They are changed to receive a 'struct qstr' instead
of separate name and len. In a few cases the use of QSTR() results in a
new call to strlen().
try_lookup_noperm() takes a pointer to a qstr instead of the whole
qstr. This is consistent with d_hash_and_lookup() (which is nearly
identical) and useful for lookup_noperm_unlocked().
The new lookup_noperm_common() doesn't take a qstr yet. That will be
tidied up in a subsequent patch.
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://lore.kernel.org/r/20250319031545.2999807-5-neil@brown.name
Signed-off-by: Christian Brauner <brauner@kernel.org>
All callers and implementations are now removed, so remove the operation
and update the documentation to match.
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/r/20250402150005.2309458-10-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
The family of functions:
lookup_one()
lookup_one_unlocked()
lookup_one_positive_unlocked()
appear designed to be used by external clients of the filesystem rather
than by filesystems acting on themselves as the lookup_one_len family
are used.
They are used by:
btrfs/ioctl - which is a user-space interface rather than an internal
activity
exportfs - i.e. from nfsd or the open_by_handle_at interface
overlayfs - at access the underlying filesystems
smb/server - for file service
They should be used by nfsd (more than just the exportfs path) and
cachefs but aren't.
It would help if the documentation didn't claim they should "not be
called by generic code".
Also the path component name is passed as "name" and "len" which are
(confusingly?) separate by the "base". In some cases the len in simply
"strlen" and so passing a qstr using QSTR() would make the calling
clearer.
Other callers do pass separate name and len which are stored in a
struct. Sometimes these are already stored in a qstr, other times it
easily could be.
So this patch changes these three functions to receive a 'struct qstr *',
and improves the documentation.
QSTR_LEN() is added to make it easy to pass a QSTR containing a known
len.
[brauner@kernel.org: take a struct qstr pointer]
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://lore.kernel.org/r/20250319031545.2999807-2-neil@brown.name
Signed-off-by: Christian Brauner <brauner@kernel.org>
- fix handling of bogus (negative/too long) replies
- fix crash on mkdir with ACLs
(... looks like nobody is using ACLs with semi-recent kernels...)
- ipv6 support for trans=tcp
- minor concurrency fix to make syzbot happy
- minor cleanup
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmfuAoIACgkQq06b7GqY
5nBb9w/+K9WnU4MdSTFSXDJ+ZZTY//fPpFaUTqHl1hTeRjmIBtBdngy9ASvnPrPj
n6DHnd+qkdFV6cMvs5wPUskRxJZuRDugzZMAd6yzjJoRNPmNFN2Ux7EXWEdFwvFG
mk4EJtzgiZhp7XWlNzQeMziuDmMZJijzLsd4zVYNo9fNKEh5jLKjKWyHTVRxfuCc
i22Y8oUgcghK0YSSLoL59xF4nRrvn57DBF3wnrW6pqVvVQ05NJRH4fNgXp4wW497
jxQq01ela7IgNUoMgib7F0ov1fu8pSEd95T+fzcqynZCePQ9rzDbvt3MR7rjJuqo
/VXwW7N3KT6DrQG6Wu21B9VcfBeWjdbtJ/GWGVp8d2iP04Sv0escx53qETZSD0iZ
pMIZLthJuXlq9dmxZ/j+BPLlbm7uAFPbP15/O9Un5xVvrisANFm1TPvM77btnrEP
KovWfooheoUrK6DmkKbkzS5HJH2ko4CASAG7c8GL+R1hXwVDswC06cecyvXaKQQK
Um4nOe59hRqbqWXmIEs4jssoUjfg8MfuX71DvX0p6+r1WR+eySieG2HiTz/mTj0q
/27cCWlAvjYxa42opxASAD1/HvW2tZfcPKtSQbh/3s0FBpTVqbof3fxmnTjcb0Po
V7WpuRSD7DnmawjbQQLXznUQokagO23/ySO1vARnluKyGwsn5yI=
=Q0mE
-----END PGP SIGNATURE-----
Merge tag '9p-for-6.15-rc1' of https://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
- fix handling of bogus (negative/too long) replies
- fix crash on mkdir with ACLs (... looks like nobody is using ACLs
with semi-recent kernels...)
- ipv6 support for trans=tcp
- minor concurrency fix to make syzbot happy
- minor cleanup
* tag '9p-for-6.15-rc1' of https://github.com/martinetd/linux:
docs: fs/9p: Add missing "not" in cache documentation
9p: Use hashtable.h for hash_errmap
Documentation/fs/9p: fix broken link
9p/trans_fd: mark concurrent read and writes to p9_conn->err
9p/net: return error on bogus (longer than requested) replies
9p/net: fix improper handling of bogus negative read/write replies
fs/9p: fix NULL pointer dereference on mkdir
net/9p/fd: support ipv6 for trans=tcp
A quick fix for what I assume is a typo.
Signed-off-by: Tingmao Wang <m@maowtm.org>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Message-ID: <20250330213443.98434-1-m@maowtm.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
reservation" from Sourabh Jain changes powerpc's kexec code to use more
of the generic layers.
- The 2 patch series "get_maintainer: report subsystem status
separately" from Vlastimil Babka makes some long-requested improvements
to the get_maintainer output.
- The 4 patch series "ucount: Simplify refcounting with rcuref_t" from
Sebastian Siewior cleans up and optimizing the refcounting in the ucount
code.
- The 12 patch series "reboot: support runtime configuration of
emergency hw_protection action" from Ahmad Fatoum improves the ability
for a driver to perform an emergency system shutdown or reboot.
- The 16 patch series "Converge on using secs_to_jiffies() part two"
from Easwar Hariharan performs further migrations from
msecs_to_jiffies() to secs_to_jiffies().
- The 7 patch series "lib/interval_tree: add some test cases and
cleanup" from Wei Yang permits more userspace testing of kernel library
code, adds some more tests and performs some cleanups.
- The 2 patch series "hung_task: Dump the blocking task stacktrace" from
Masami Hiramatsu arranges for the hung_task detector to dump the stack
of the blocking task and not just that of the blocked task.
- The 4 patch series "resource: Split and use DEFINE_RES*() macros" from
Andy Shevchenko provides some cleanups to the resource definition
macros.
- Plus the usual shower of singleton patches - please see the individual
changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nuqwAKCRDdBJ7gKXxA
jtNqAQDxqJpjWkzn4yN9CNSs1ivVx3fr6SqazlYCrt3u89WQvwEA1oRrGpETzUGq
r6khQUIcQImPPcjFqEFpuiSOU0MBZA0=
=Kii8
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- The series "powerpc/crash: use generic crashkernel reservation" from
Sourabh Jain changes powerpc's kexec code to use more of the generic
layers.
- The series "get_maintainer: report subsystem status separately" from
Vlastimil Babka makes some long-requested improvements to the
get_maintainer output.
- The series "ucount: Simplify refcounting with rcuref_t" from
Sebastian Siewior cleans up and optimizing the refcounting in the
ucount code.
- The series "reboot: support runtime configuration of emergency
hw_protection action" from Ahmad Fatoum improves the ability for a
driver to perform an emergency system shutdown or reboot.
- The series "Converge on using secs_to_jiffies() part two" from Easwar
Hariharan performs further migrations from msecs_to_jiffies() to
secs_to_jiffies().
- The series "lib/interval_tree: add some test cases and cleanup" from
Wei Yang permits more userspace testing of kernel library code, adds
some more tests and performs some cleanups.
- The series "hung_task: Dump the blocking task stacktrace" from Masami
Hiramatsu arranges for the hung_task detector to dump the stack of
the blocking task and not just that of the blocked task.
- The series "resource: Split and use DEFINE_RES*() macros" from Andy
Shevchenko provides some cleanups to the resource definition macros.
- Plus the usual shower of singleton patches - please see the
individual changelogs for details.
* tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits)
mailmap: consolidate email addresses of Alexander Sverdlin
fs/procfs: fix the comment above proc_pid_wchan()
relay: use kasprintf() instead of fixed buffer formatting
resource: replace open coded variant of DEFINE_RES()
resource: replace open coded variants of DEFINE_RES_*_NAMED()
resource: replace open coded variant of DEFINE_RES_NAMED_DESC()
resource: split DEFINE_RES_NAMED_DESC() out of DEFINE_RES_NAMED()
samples: add hung_task detector mutex blocking sample
hung_task: show the blocker task if the task is hung on mutex
kexec_core: accept unaccepted kexec segments' destination addresses
watchdog/perf: optimize bytes copied and remove manual NUL-termination
lib/interval_tree: fix the comment of interval_tree_span_iter_next_gap()
lib/interval_tree: skip the check before go to the right subtree
lib/interval_tree: add test case for span iteration
lib/interval_tree: add test case for interval_tree_iter_xxx() helpers
lib/rbtree: add random seed
lib/rbtree: split tests
lib/rbtree: enable userland test suite for rbtree related data structure
checkpatch: describe --min-conf-desc-length
scripts/gdb/symbols: determine KASLR offset on s390
...
Uros Bizjak uses x86 named address space qualifiers to provide
compile-time checking of percpu area accesses.
This has caused a small amount of fallout - two or three issues were
reported. In all cases the calling code was founf to be incorrect.
- The 4 patch series "Some cleanup for memcg" from Chen Ridong
implements some relatively monir cleanups for the memcontrol code.
- The 17 patch series "mm: fixes for device-exclusive entries (hmm)"
from David Hildenbrand fixes a boatload of issues which David found then
using device-exclusive PTE entries when THP is enabled. More work is
needed, but this makes thins better - our own HMM selftests now succeed.
- The 2 patch series "mm: zswap: remove z3fold and zbud" from Yosry
Ahmed remove the z3fold and zbud implementations. They have been
deprecated for half a year and nobody has complained.
- The 5 patch series "mm: further simplify VMA merge operation" from
Lorenzo Stoakes implements numerous simplifications in this area. No
runtime effects are anticipated.
- The 4 patch series "mm/madvise: remove redundant mmap_lock operations
from process_madvise()" from SeongJae Park rationalizes the locking in
the madvise() implementation. Performance gains of 20-25% were observed
in one MADV_DONTNEED microbenchmark.
- The 12 patch series "Tiny cleanup and improvements about SWAP code"
from Baoquan He contains a number of touchups to issues which Baoquan
noticed when working on the swap code.
- The 2 patch series "mm: kmemleak: Usability improvements" from Catalin
Marinas implements a couple of improvements to the kmemleak user-visible
output.
- The 2 patch series "mm/damon/paddr: fix large folios access and
schemes handling" from Usama Arif provides a couple of fixes for DAMON's
handling of large folios.
- The 3 patch series "mm/damon/core: fix wrong and/or useless
damos_walk() behaviors" from SeongJae Park fixes a few issues with the
accuracy of kdamond's walking of DAMON regions.
- The 3 patch series "expose mapping wrprotect, fix fb_defio use" from
Lorenzo Stoakes changes the interaction between framebuffer deferred-io
and core MM. No functional changes are anticipated - this is
preparatory work for the future removal of page structure fields.
- The 4 patch series "mm/damon: add support for hugepage_size DAMOS
filter" from Usama Arif adds a DAMOS filter which permits the filtering
by huge page sizes.
- The 4 patch series "mm: permit guard regions for file-backed/shmem
mappings" from Lorenzo Stoakes extends the guard region feature from its
present "anon mappings only" state. The feature now covers shmem and
file-backed mappings.
- The 4 patch series "mm: batched unmap lazyfree large folios during
reclamation" from Barry Song cleans up and speeds up the unmapping for
pte-mapped large folios.
- The 18 patch series "reimplement per-vma lock as a refcount" from
Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for
pulling it out were largely bogus and that change made the code more
messy. This patchset provides small (0-10%) improvements on one
microbenchmark.
- The 5 patch series "Docs/mm/damon: misc DAMOS filters documentation
fixes and improves" from SeongJae Park does some maintenance work on the
DAMON docs.
- The 27 patch series "hugetlb/CMA improvements for large systems" from
Frank van der Linden addresses a pile of issues which have been observed
when using CMA on large machines.
- The 2 patch series "mm/damon: introduce DAMOS filter type for unmapped
pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the
page's mapped/unmapped status.
- The 19 patch series "zsmalloc/zram: there be preemption" from Sergey
Senozhatsky teaches zram to run its compression and decompression
operations preemptibly.
- The 12 patch series "selftests/mm: Some cleanups from trying to run
them" from Brendan Jackman fixes a pile of unrelated issues which
Brendan encountered while runnimg our selftests.
- The 2 patch series "fs/proc/task_mmu: add guard region bit to pagemap"
from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
determine whether a particular page is a guard page.
- The 7 patch series "mm, swap: remove swap slot cache" from Kairui Song
removes the swap slot cache from the allocation path - it simply wasn't
being effective.
- The 5 patch series "mm: cleanups for device-exclusive entries (hmm)"
from David Hildenbrand implements a number of unrelated cleanups in this
code.
- The 5 patch series "mm: Rework generic PTDUMP configs" from Anshuman
Khandual implements a number of preparatoty cleanups to the
GENERIC_PTDUMP Kconfig logic.
- The 8 patch series "mm/damon: auto-tune aggregation interval" from
SeongJae Park implements a feedback-driven automatic tuning feature for
DAMON's aggregation interval tuning.
- The 5 patch series "Fix lazy mmu mode" from Ryan Roberts fixes some
issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did
this in preparation for implementing lazy mmu mode for arm64 to optimize
vmalloc.
- The 2 patch series "mm/page_alloc: Some clarifications for migratetype
fallback" from Brendan Jackman reworks some commentary to make the code
easier to follow.
- The 3 patch series "page_counter cleanup and size reduction" from
Shakeel Butt cleans up the page_counter code and fixes a size increase
which we accidentally added late last year.
- The 3 patch series "Add a command line option that enables control of
how many threads should be used to allocate huge pages" from Thomas
Prescher does that. It allows the careful operator to significantly
reduce boot time by tuning the parallalization of huge page
initialization.
- The 3 patch series "Fix calculations in trace_balance_dirty_pages()
for cgwb" from Tang Yizhou fixes the tracing output from the dirty page
balancing code.
- The 9 patch series "mm/damon: make allow filters after reject filters
useful and intuitive" from SeongJae Park improves the handling of allow
and reject filters. Behaviour is made more consistent and the
documention is updated accordingly.
- The 5 patch series "Switch zswap to object read/write APIs" from Yosry
Ahmed updates zswap to the new object read/write APIs and thus permits
the removal of some legacy code from zpool and zsmalloc.
- The 6 patch series "Some trivial cleanups for shmem" from Baolin Wang
does as it claims.
- The 20 patch series "fs/dax: Fix ZONE_DEVICE page reference counts"
from Alistair Popple regularizes the weird ZONE_DEVICE page refcount
handling in DAX, permittig the removal of a number of special-case
checks.
- The 4 patch series "refactor mremap and fix bug" from Lorenzo Stoakes
is a preparatoty refactoring and cleanup of the mremap() code.
- The 20 patch series "mm: MM owner tracking for large folios (!hugetlb)
+ CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
which we determine whether a large folio is known to be mapped
exclusively into a single MM.
- The 8 patch series "mm/damon: add sysfs dirs for managing DAMOS
filters based on handling layers" from SeongJae Park adds a couple of
new sysfs directories to ease the management of DAMON/DAMOS filters.
- The 13 patch series "arch, mm: reduce code duplication in mem_init()"
from Mike Rapoport consolidates many per-arch implementations of
mem_init() into code generic code, where that is practical.
- The 13 patch series "mm/damon/sysfs: commit parameters online via
damon_call()" from SeongJae Park continues the cleaning up of sysfs
access to DAMON internal data.
- The 3 patch series "mm: page_ext: Introduce new iteration API" from
Luiz Capitulino reworks the page_ext initialization to fix a boot-time
crash which was observed with an unusual combination of compile and
cmdline options.
- The 8 patch series "Buddy allocator like (or non-uniform) folio split"
from Zi Yan reworks the code to split a folio into smaller folios. The
main benefit is lessened memory consumption: fewer post-split folios are
generated.
- The 2 patch series "Minimize xa_node allocation during xarry split"
from Zi Yan reduces the number of xarray xa_nodes which are generated
during an xarray split.
- The 2 patch series "drivers/base/memory: Two cleanups" from Gavin Shan
performs some maintenance work on the drivers/base/memory code.
- The 3 patch series "Add tracepoints for lowmem reserves, watermarks
and totalreserve_pages" from Martin Liu adds some more tracepoints to
the page allocator code.
- The 4 patch series "mm/madvise: cleanup requests validations and
classifications" from SeongJae Park cleans up some warts which SeongJae
observed during his earlier madvise work.
- The 3 patch series "mm/hwpoison: Fix regressions in memory failure
handling" from Shuai Xue addresses two quite serious regressions which
Shuai has observed in the memory-failure implementation.
- The 5 patch series "mm: reliable huge page allocator" from Johannes
Weiner makes huge page allocations cheaper and more reliable by reducing
fragmentation.
- The 5 patch series "Minor memcg cleanups & prep for memdescs" from
Matthew Wilcox is preparatory work for the future implementation of
memdescs.
- The 4 patch series "track memory used by balloon drivers" from Nico
Pache introduces a way to track memory used by our various balloon
drivers.
- The 2 patch series "mm/damon: introduce DAMOS filter type for active
pages" from Nhat Pham permits users to filter for active/inactive pages,
separately for file and anon pages.
- The 2 patch series "Adding Proactive Memory Reclaim Statistics" from
Hao Jia separates the proactive reclaim statistics from the direct
reclaim statistics.
- The 2 patch series "mm/vmscan: don't try to reclaim hwpoison folio"
from Jinjiang Tu fixes our handling of hwpoisoned pages within the
reclaim code.
-----BEGIN PGP SIGNATURE-----
iHQEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nZaAAKCRDdBJ7gKXxA
jsOWAPiP4r7CJHMZRK4eyJOkvS1a1r+TsIarrFZtjwvf/GIfAQCEG+JDxVfUaUSF
Ee93qSSLR1BkNdDw+931Pu0mXfbnBw==
=Pn2K
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- The series "Enable strict percpu address space checks" from Uros
Bizjak uses x86 named address space qualifiers to provide
compile-time checking of percpu area accesses.
This has caused a small amount of fallout - two or three issues were
reported. In all cases the calling code was found to be incorrect.
- The series "Some cleanup for memcg" from Chen Ridong implements some
relatively monir cleanups for the memcontrol code.
- The series "mm: fixes for device-exclusive entries (hmm)" from David
Hildenbrand fixes a boatload of issues which David found then using
device-exclusive PTE entries when THP is enabled. More work is
needed, but this makes thins better - our own HMM selftests now
succeed.
- The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
remove the z3fold and zbud implementations. They have been deprecated
for half a year and nobody has complained.
- The series "mm: further simplify VMA merge operation" from Lorenzo
Stoakes implements numerous simplifications in this area. No runtime
effects are anticipated.
- The series "mm/madvise: remove redundant mmap_lock operations from
process_madvise()" from SeongJae Park rationalizes the locking in the
madvise() implementation. Performance gains of 20-25% were observed
in one MADV_DONTNEED microbenchmark.
- The series "Tiny cleanup and improvements about SWAP code" from
Baoquan He contains a number of touchups to issues which Baoquan
noticed when working on the swap code.
- The series "mm: kmemleak: Usability improvements" from Catalin
Marinas implements a couple of improvements to the kmemleak
user-visible output.
- The series "mm/damon/paddr: fix large folios access and schemes
handling" from Usama Arif provides a couple of fixes for DAMON's
handling of large folios.
- The series "mm/damon/core: fix wrong and/or useless damos_walk()
behaviors" from SeongJae Park fixes a few issues with the accuracy of
kdamond's walking of DAMON regions.
- The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
Stoakes changes the interaction between framebuffer deferred-io and
core MM. No functional changes are anticipated - this is preparatory
work for the future removal of page structure fields.
- The series "mm/damon: add support for hugepage_size DAMOS filter"
from Usama Arif adds a DAMOS filter which permits the filtering by
huge page sizes.
- The series "mm: permit guard regions for file-backed/shmem mappings"
from Lorenzo Stoakes extends the guard region feature from its
present "anon mappings only" state. The feature now covers shmem and
file-backed mappings.
- The series "mm: batched unmap lazyfree large folios during
reclamation" from Barry Song cleans up and speeds up the unmapping
for pte-mapped large folios.
- The series "reimplement per-vma lock as a refcount" from Suren
Baghdasaryan puts the vm_lock back into the vma. Our reasons for
pulling it out were largely bogus and that change made the code more
messy. This patchset provides small (0-10%) improvements on one
microbenchmark.
- The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
improves" from SeongJae Park does some maintenance work on the DAMON
docs.
- The series "hugetlb/CMA improvements for large systems" from Frank
van der Linden addresses a pile of issues which have been observed
when using CMA on large machines.
- The series "mm/damon: introduce DAMOS filter type for unmapped pages"
from SeongJae Park enables users of DMAON/DAMOS to filter my the
page's mapped/unmapped status.
- The series "zsmalloc/zram: there be preemption" from Sergey
Senozhatsky teaches zram to run its compression and decompression
operations preemptibly.
- The series "selftests/mm: Some cleanups from trying to run them" from
Brendan Jackman fixes a pile of unrelated issues which Brendan
encountered while runnimg our selftests.
- The series "fs/proc/task_mmu: add guard region bit to pagemap" from
Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
determine whether a particular page is a guard page.
- The series "mm, swap: remove swap slot cache" from Kairui Song
removes the swap slot cache from the allocation path - it simply
wasn't being effective.
- The series "mm: cleanups for device-exclusive entries (hmm)" from
David Hildenbrand implements a number of unrelated cleanups in this
code.
- The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
implements a number of preparatoty cleanups to the GENERIC_PTDUMP
Kconfig logic.
- The series "mm/damon: auto-tune aggregation interval" from SeongJae
Park implements a feedback-driven automatic tuning feature for
DAMON's aggregation interval tuning.
- The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
preparation for implementing lazy mmu mode for arm64 to optimize
vmalloc.
- The series "mm/page_alloc: Some clarifications for migratetype
fallback" from Brendan Jackman reworks some commentary to make the
code easier to follow.
- The series "page_counter cleanup and size reduction" from Shakeel
Butt cleans up the page_counter code and fixes a size increase which
we accidentally added late last year.
- The series "Add a command line option that enables control of how
many threads should be used to allocate huge pages" from Thomas
Prescher does that. It allows the careful operator to significantly
reduce boot time by tuning the parallalization of huge page
initialization.
- The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
from Tang Yizhou fixes the tracing output from the dirty page
balancing code.
- The series "mm/damon: make allow filters after reject filters useful
and intuitive" from SeongJae Park improves the handling of allow and
reject filters. Behaviour is made more consistent and the documention
is updated accordingly.
- The series "Switch zswap to object read/write APIs" from Yosry Ahmed
updates zswap to the new object read/write APIs and thus permits the
removal of some legacy code from zpool and zsmalloc.
- The series "Some trivial cleanups for shmem" from Baolin Wang does as
it claims.
- The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
Alistair Popple regularizes the weird ZONE_DEVICE page refcount
handling in DAX, permittig the removal of a number of special-case
checks.
- The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
preparatoty refactoring and cleanup of the mremap() code.
- The series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
which we determine whether a large folio is known to be mapped
exclusively into a single MM.
- The series "mm/damon: add sysfs dirs for managing DAMOS filters based
on handling layers" from SeongJae Park adds a couple of new sysfs
directories to ease the management of DAMON/DAMOS filters.
- The series "arch, mm: reduce code duplication in mem_init()" from
Mike Rapoport consolidates many per-arch implementations of
mem_init() into code generic code, where that is practical.
- The series "mm/damon/sysfs: commit parameters online via
damon_call()" from SeongJae Park continues the cleaning up of sysfs
access to DAMON internal data.
- The series "mm: page_ext: Introduce new iteration API" from Luiz
Capitulino reworks the page_ext initialization to fix a boot-time
crash which was observed with an unusual combination of compile and
cmdline options.
- The series "Buddy allocator like (or non-uniform) folio split" from
Zi Yan reworks the code to split a folio into smaller folios. The
main benefit is lessened memory consumption: fewer post-split folios
are generated.
- The series "Minimize xa_node allocation during xarry split" from Zi
Yan reduces the number of xarray xa_nodes which are generated during
an xarray split.
- The series "drivers/base/memory: Two cleanups" from Gavin Shan
performs some maintenance work on the drivers/base/memory code.
- The series "Add tracepoints for lowmem reserves, watermarks and
totalreserve_pages" from Martin Liu adds some more tracepoints to the
page allocator code.
- The series "mm/madvise: cleanup requests validations and
classifications" from SeongJae Park cleans up some warts which
SeongJae observed during his earlier madvise work.
- The series "mm/hwpoison: Fix regressions in memory failure handling"
from Shuai Xue addresses two quite serious regressions which Shuai
has observed in the memory-failure implementation.
- The series "mm: reliable huge page allocator" from Johannes Weiner
makes huge page allocations cheaper and more reliable by reducing
fragmentation.
- The series "Minor memcg cleanups & prep for memdescs" from Matthew
Wilcox is preparatory work for the future implementation of memdescs.
- The series "track memory used by balloon drivers" from Nico Pache
introduces a way to track memory used by our various balloon drivers.
- The series "mm/damon: introduce DAMOS filter type for active pages"
from Nhat Pham permits users to filter for active/inactive pages,
separately for file and anon pages.
- The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
separates the proactive reclaim statistics from the direct reclaim
statistics.
- The series "mm/vmscan: don't try to reclaim hwpoison folio" from
Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
code.
* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
x86/mm: restore early initialization of high_memory for 32-bits
mm/vmscan: don't try to reclaim hwpoison folio
mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
selftests/mm: speed up split_huge_page_test
selftests/mm: uffd-unit-tests support for hugepages > 2M
docs/mm/damon/design: document active DAMOS filter type
mm/damon: implement a new DAMOS filter type for active pages
fs/dax: don't disassociate zero page entries
MM documentation: add "Unaccepted" meminfo entry
selftests/mm: add commentary about 9pfs bugs
fork: use __vmalloc_node() for stack allocation
docs/mm: Physical Memory: Populate the "Zones" section
xen: balloon: update the NR_BALLOON_PAGES state
hv_balloon: update the NR_BALLOON_PAGES state
balloon_compaction: update the NR_BALLOON_PAGES state
meminfo: add a per node counter for balloon drivers
mm: remove references to folio in __memcg_kmem_uncharge_page()
...
Neil Brown contributed more scalability improvements to NFSD's
open file cache, and Jeff Layton contributed a menagerie of
repairs to NFSD's NFSv4 callback / backchannel implementation.
Mike Snitzer contributed a change to NFS re-export support that
disables support for file locking on a re-exported NFSv4 mount.
This is because NFSv4 state recovery is currently difficult if
not impossible for re-exported NFS mounts. The change aims to
prevent data integrity exposures after the re-export server
crashes.
Work continues on the evolving NFSD netlink administrative API.
Many thanks to the contributors, reviewers, testers, and bug
reporters who participated during the v6.15 development cycle.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmfmpMIACgkQM2qzM29m
f5f6DA/+P0YqoRg3Zk/4oWwXZWbfEOMhWFltT+D1PE2QjUfOZpiwUSFQfsfYgXO6
OFu0iDQ4g8BxBeP6Umv61qy7Cv6n4fVzIHqzymXQvymh9JzoQiXlE9/fA8nAHuiH
u7kkNPRi7faBz1sMg/WpN9CHctg7STPOhhG/JrZcSFZnh87mU1i4i4bZBNz8tVnK
ZWf483OUuSmJY2/bUTkwvr4GbceTKBlLWFFjiRhfAKvJBWvu4myfC0DI5QzxmsgI
MJ62do7AFJP1ww2Ih9LLi2kFIt/yyInSVAgyts1CPhlJ4BfPnTSOw/i2+CuF3D/M
bZYEAOjH3AqjBZmq58sIQezpD5f9/TOrTSwYwS31zl/THYE413WiW80/MDoWqo0y
9cSNkD3nJlPVLLCfF58vXLoe7wpLoN/ZbTdxoozzUWEFR5A4Jz3XP8F/Cws0cjem
uWWAQMItiQpg1+RYJYfu4dg5+iN6dbgYbvzlr7buISwFNXi3Zo99MkJ4wHj9TJbL
Tpjth1rWGPwwSOMT6ojKiYMq1oUzx5PuAm9Saq9oIzQAbBySmxHF/LSDz3wEuBoO
MK1jzKroEmMk3fJOOAajSDLOdAbL3vfj6H/xi2IHvKnaz9yHCZNu2YGV05BBMprd
hWePf69AO5Ky5Q9KuGClEtwvJ9ZR5pb4DO2dqaYu8ximu3O4vPo=
=e2E2
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"Neil Brown contributed more scalability improvements to NFSD's open
file cache, and Jeff Layton contributed a menagerie of repairs to
NFSD's NFSv4 callback / backchannel implementation.
Mike Snitzer contributed a change to NFS re-export support that
disables support for file locking on a re-exported NFSv4 mount. This
is because NFSv4 state recovery is currently difficult if not
impossible for re-exported NFS mounts. The change aims to prevent data
integrity exposures after the re-export server crashes.
Work continues on the evolving NFSD netlink administrative API.
Many thanks to the contributors, reviewers, testers, and bug reporters
who participated during the v6.15 development cycle"
* tag 'nfsd-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (45 commits)
NFSD: Add a Kconfig setting to enable delegated timestamps
sysctl: Fixes nsm_local_state bounds
nfsd: use a long for the count in nfsd4_state_shrinker_count()
nfsd: remove obsolete comment from nfs4_alloc_stid
nfsd: remove unneeded forward declaration of nfsd4_mark_cb_fault()
nfsd: reorganize struct nfs4_delegation for better packing
nfsd: handle errors from rpc_call_async()
nfsd: move cb_need_restart flag into cb_flags
nfsd: replace CB_GETATTR_BUSY with NFSD4_CALLBACK_RUNNING
nfsd: eliminate cl_ra_cblist and NFSD4_CLIENT_CB_RECALL_ANY
nfsd: prevent callback tasks running concurrently
nfsd: disallow file locking and delegations for NFSv4 reexport
nfsd: filecache: drop the list_lru lock during lock gc scans
nfsd: filecache: don't repeatedly add/remove files on the lru list
nfsd: filecache: introduce NFSD_FILE_RECENT
nfsd: filecache: use list_lru_walk_node() in nfsd_file_gc()
nfsd: filecache: use nfsd_file_dispose_list() in nfsd_file_close_inode_sync()
NFSD: Re-organize nfsd_file_gc_worker()
nfsd: filecache: remove race handling.
fs: nfs: acl: Avoid -Wflex-array-member-not-at-end warning
...
* hardening against maliciously fuzzed file systems
* backwards compatibility for the brief period when we attempted to
ignore zero-width characters
* avoid potentially BUG'ing if there is a file system corruption found
during the file system unmount
* fix free space reporting by statfs when project quotas are enabled
and the free space is less than the remaining project quota
Also improve performance when replaying a journal with a very large
number of revoke records (applicable for Lustre volumes).
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmflfY4ACgkQ8vlZVpUN
gaMx7Qf/akTELvyBZ7iPCCHh2HwayuO8qLhPNqrU0TmYMFvgwgYUPcQ3BLn8CE+/
j5UeT8XxNaLU4GJn3z+q6yW6PnNHfqZqKry9j/iPc3s1mjTslntr/xENlgu6i4Bp
Q58xc7Pj45vdmP+xmYhRnJcefgsZMvB/N1SEHxwIP8bntZqsEvP9pI82r9Ouc8SA
ZLQ1/K4OADmk7f3GhlPr9AtgH7O0CjlAas30h/AW77DXBQl7ZgbDsGDlgTwaGqkR
jHcvfr6hLnWy+MUVGmlNZ2HY6iUgBPItWlYCP/fsrUdnc+CONyl5E17JPSl1QQtR
CLYlo4xV8j1+zJ094DjhDWMKI2G7jw==
=oudL
-----END PGP SIGNATURE-----
Merge tag 'ext4-for_linus-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Ext4 bug fixes and cleanups, including:
- hardening against maliciously fuzzed file systems
- backwards compatibility for the brief period when we attempted to
ignore zero-width characters
- avoid potentially BUG'ing if there is a file system corruption
found during the file system unmount
- fix free space reporting by statfs when project quotas are enabled
and the free space is less than the remaining project quota
Also improve performance when replaying a journal with a very large
number of revoke records (applicable for Lustre volumes)"
* tag 'ext4-for_linus-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (71 commits)
ext4: fix OOB read when checking dotdot dir
ext4: on a remount, only log the ro or r/w state when it has changed
ext4: correct the error handle in ext4_fallocate()
ext4: Make sb update interval tunable
ext4: avoid journaling sb update on error if journal is destroying
ext4: define ext4_journal_destroy wrapper
ext4: hash: simplify kzalloc(n * 1, ...) to kzalloc(n, ...)
jbd2: add a missing data flush during file and fs synchronization
ext4: don't over-report free space or inodes in statvfs
ext4: clear DISCARD flag if device does not support discard
jbd2: remove jbd2_journal_unfile_buffer()
ext4: reorder capability check last
ext4: update the comment about mb_optimize_scan
jbd2: fix off-by-one while erasing journal
ext4: remove references to bh->b_page
ext4: goto right label 'out_mmap_sem' in ext4_setattr()
ext4: fix out-of-bound read in ext4_xattr_inode_dec_ref_all()
ext4: introduce ITAIL helper
jbd2: remove redundant function jbd2_journal_has_csum_v2or3_feature
ext4: remove redundant function ext4_has_metadata_csum
...
On disk format is now soft frozen: no more required/automatic are
anticipated before taking off the experimental label.
Major changes/features since 6.14:
- Scrub
- Blocksize greater than page size support
- A number of "rebalance spinning and doing no work" issues have been
fixed; we now check if the write allocation will succeed in
bch2_data_update_init(), before kicking off the read.
There's still more work to do in this area. Later we may want to add
another bitset btree, like rebalance_work, to track "extents that
rebalance was requested to move but couldn't", e.g. due to destination
target having insufficient online devices.
- We can now support scaling well into the petabyte range: latest
bcachefs-tools will pick an appropriate bucket size at format time to
ensure fsck can run in available memory (e.g. a server with 256GB of
ram and 100PB of storage would want 16MB buckets).
On disk format changes:
- 1.21: cached backpointers (scalability improvement)
Cached replicas now get backpointers, which means we no longer rely on
incrementing bucket generation numbers to invalidate cached data: this
lets us get rid of the bucket generation number garbage collection,
which had to periodically rescan all extents to recompute bucket
oldest_gen.
Bucket generation numbers are now only used as a consistency check,
but they're quite useful for that.
- 1.22: stripe backpointers
Stripes now have backpointers: erasure coded stripes have their own
checksums, separate from the checksums for the extents they contain
(and stripe checksums also cover the parity blocks). This is required
for implementing scrub for stripes.
- 1.23: stripe lru (scalability improvement)
Persistent lru for stripes, ordered by "number of empty blocks". This
is used by the stripe creation path, which depending on free space
may create a new stripe out of a partially empty existing stripe
instead of starting a brand new stripe.
This replaces an in-memory heap, and means we no longer have to read
in the stripes btree at startup.
- 1.24: casefolding
Case insensitive directory support, courtesy of Valve.
This is an incompatible feature, to enable mount with
-o version_upgrade=incompatible
- 1.25: extent_flags
Another incompatible feature requiring explicit opt-in to enable.
This adds a flags entry to extents, and a flag bit that marks extents
as poisoned.
A poisoned extent is an extent that was unreadable due to checksum
errors. We can't move such extents without giving them a new checksum,
and we may have to move them (for e.g. copygc or device evacuate).
We also don't want to delete them: in the future we'll have an API
that lets userspace ignore checksum errors and attempt to deal with
simple bitrot itself. Marking them as poisoned lets us continue to
return the correct error to userspace on normal read calls.
Other changes/features:
- BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top'
command, which shows a live view of all internal filesystem counters.
- Improved journal pipelining: we can now have 16 journal writes in
flight concurrently, up from 4. We're logging significantly more to
the journal than we used to with all the recent disk accounting
changes and additions, so some users should see a performance
increase on some workloads.
- BCH_MEMBER_STATE_failed: previously, we would do no IO at all to
devices marked as failed. Now we will attempt to read from them, but
only if we have no better options.
- New option, write_error_timeout: devices will be kicked out of the
filesystem if all writes have been failing for x number of seconds.
We now also kick devices out when notified by blk_holder_ops that
they've gone offline.
- Device option handling improvements: the discard option should now be
working as expected (additionally, in -tools, all device options that
can be set at format time can now be set at device add time, i.e.
data_allowed, state).
- We now try harder to read data after a checksum error: we'll do
additional retries if necessary to a device after after it gave us
data with a checksum error.
- More self healing work: the full inode <-> dirent consistency checks
that are currently run by fsck are now also run every time we do a
lookup, meaning we'll be able to correct errors at runtime. Runtime
self healing will be flipped on after the new changes have seen more
testing, currently they're just checking for consistency.
- KMSAN fixes: our KMSAN builds should be nearly clean now, which will
put a massive dent in the syzbot dashboard.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmfhbnsACgkQE6szbY3K
bnY6ew/9FXh3m71BvVpuqTYcUGzIC7gVrnkFy6n4W96v07OjSOoTNHOVVovajxc3
P9LvA77BHC4Xro3H7ORpsIurOZUc6yx18ZizzulVbQFuYa7LY/kNri4ZBtGHcRiV
pIdQDLSNmwFjPA4x2S1qTFSF1c586lad+UNQiLam5ophBwQPEO6vG51ZEHa4wld9
+OhWTDYfrvij4D3Lt1ppvhuDP+PQBjhu/QFc0bGjHvKOjfV6sw9XU91sCYKOJIzd
qzpsiQd5sepnX717Br3f5SLdxMq2lJYvRp9756vltOCaMBvJYJtHqtXCglHQEkFw
yjhmPjk4r3VlKTF8K+wEJfAHwbC2kEn7csJNbt0+Nko5PPtFyrb8ok6QUbHCKscL
L0VMnzaXHVqvG2VgYa31temfdz7HM/zHjQ8Al3eQPaqTHIoTXIBQxOQSea/apVMt
TIlastvLoHfR8W7+LrwOmTjnBJGCJ+MrdcJzJDVk2tQmmcMA0boeZvl4aSklFuyB
zNN5fxp0VMsxNyIHLJjQ3UcwVqHXC5w+f5H1ByQLUyQh+m/xaAaz7S+BTVdVbFPa
1Z1xDuvuHOTnjIOamnOD1l36afJnhq5RciPCXCNtQSB819mc+AfNGQNQTVNOTReC
iTiUCcNxu0/DIPlPmeJzAlukVJUgz+/knOI/6zPs3eI7/o88ZGg=
=k3cV
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs
Pull bcachefs updates from Kent Overstreet:
"On disk format is now soft frozen: no more required/automatic are
anticipated before taking off the experimental label.
Major changes/features since 6.14:
- Scrub
- Blocksize greater than page size support
- A number of "rebalance spinning and doing no work" issues have been
fixed; we now check if the write allocation will succeed in
bch2_data_update_init(), before kicking off the read.
There's still more work to do in this area. Later we may want to
add another bitset btree, like rebalance_work, to track "extents
that rebalance was requested to move but couldn't", e.g. due to
destination target having insufficient online devices.
- We can now support scaling well into the petabyte range: latest
bcachefs-tools will pick an appropriate bucket size at format time
to ensure fsck can run in available memory (e.g. a server with
256GB of ram and 100PB of storage would want 16MB buckets).
On disk format changes:
- 1.21: cached backpointers (scalability improvement)
Cached replicas now get backpointers, which means we no longer rely
on incrementing bucket generation numbers to invalidate cached
data: this lets us get rid of the bucket generation number garbage
collection, which had to periodically rescan all extents to
recompute bucket oldest_gen.
Bucket generation numbers are now only used as a consistency check,
but they're quite useful for that.
- 1.22: stripe backpointers
Stripes now have backpointers: erasure coded stripes have their own
checksums, separate from the checksums for the extents they contain
(and stripe checksums also cover the parity blocks). This is
required for implementing scrub for stripes.
- 1.23: stripe lru (scalability improvement)
Persistent lru for stripes, ordered by "number of empty blocks".
This is used by the stripe creation path, which depending on free
space may create a new stripe out of a partially empty existing
stripe instead of starting a brand new stripe.
This replaces an in-memory heap, and means we no longer have to
read in the stripes btree at startup.
- 1.24: casefolding
Case insensitive directory support, courtesy of Valve.
This is an incompatible feature, to enable mount with
-o version_upgrade=incompatible
- 1.25: extent_flags
Another incompatible feature requiring explicit opt-in to enable.
This adds a flags entry to extents, and a flag bit that marks
extents as poisoned.
A poisoned extent is an extent that was unreadable due to checksum
errors. We can't move such extents without giving them a new
checksum, and we may have to move them (for e.g. copygc or device
evacuate). We also don't want to delete them: in the future we'll
have an API that lets userspace ignore checksum errors and attempt
to deal with simple bitrot itself. Marking them as poisoned lets us
continue to return the correct error to userspace on normal read
calls.
Other changes/features:
- BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top'
command, which shows a live view of all internal filesystem
counters.
- Improved journal pipelining: we can now have 16 journal writes in
flight concurrently, up from 4. We're logging significantly more to
the journal than we used to with all the recent disk accounting
changes and additions, so some users should see a performance
increase on some workloads.
- BCH_MEMBER_STATE_failed: previously, we would do no IO at all to
devices marked as failed. Now we will attempt to read from them,
but only if we have no better options.
- New option, write_error_timeout: devices will be kicked out of the
filesystem if all writes have been failing for x number of seconds.
We now also kick devices out when notified by blk_holder_ops that
they've gone offline.
- Device option handling improvements: the discard option should now
be working as expected (additionally, in -tools, all device options
that can be set at format time can now be set at device add time,
i.e. data_allowed, state).
- We now try harder to read data after a checksum error: we'll do
additional retries if necessary to a device after after it gave us
data with a checksum error.
- More self healing work: the full inode <-> dirent consistency
checks that are currently run by fsck are now also run every time
we do a lookup, meaning we'll be able to correct errors at runtime.
Runtime self healing will be flipped on after the new changes have
seen more testing, currently they're just checking for consistency.
- KMSAN fixes: our KMSAN builds should be nearly clean now, which
will put a massive dent in the syzbot dashboard"
* tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs: (180 commits)
bcachefs: Kill unnecessary bch2_dev_usage_read()
bcachefs: btree node write errors now print btree node
bcachefs: Fix race in print_chain()
bcachefs: btree_trans_restart_foreign_task()
bcachefs: bch2_disk_accounting_mod2()
bcachefs: zero init journal bios
bcachefs: Eliminate padding in move_bucket_key
bcachefs: Fix a KMSAN splat in btree_update_nodes_written()
bcachefs: kmsan asserts
bcachefs: Fix kmsan warnings in bch2_extent_crc_pack()
bcachefs: Disable asm memcpys when kmsan enabled
bcachefs: Handle backpointers with unknown data types
bcachefs: Count BCH_DATA_parity backpointers correctly
bcachefs: Run bch2_check_dirent_target() at lookup time
bcachefs: Refactor bch2_check_dirent_target()
bcachefs: Move bch2_check_dirent_target() to namei.c
bcachefs: fs-common.c -> namei.c
bcachefs: EIO cleanup
bcachefs: bch2_write_prep_encoded_data() now returns errcode
bcachefs: Simplify bch2_write_op_error()
...
In this round, there are three major updates: 1) folio conversion, 2) refactor
for mount API conversion, 3) some performance improvement such as direct IO,
checkpoint speed, and IO priority hints. For stability, there are patches
which add more sanity checks and fixes some major issues like i_size in
atomic write operations and write pointer recovery in zoned devices.
Enhancement:
- huge folio converion work by Matthew Wilcox
- clean up for mount API conversion by Eric Sandeen
- improve direct IO speed in the overwrite case
- add some sanity check on node consistency
- set highest IO priority for checkpoint thread
- keep POSIX_FADV_NOREUSE ranges and add sysfs entry to reclaim pages
- add ioctl to get IO priority hint
- add carve_out sysfs node for fsstat
Bug fix:
- disable nat_bits during umount to avoid potential nat entry corruption
- fix missing i_size update on atomic writes
- fix missing discard for active segments
- fix running out of free segments
- fix out-of-bounds access in f2fs_truncate_inode_blocks()
- call f2fs_recover_quota_end() correctly
- fix potential deadloop in prepare_compress_overwrite()
- fix the missing write pointer correction for zoned device
- fix to avoid panic once fallocation fails for pinfile
- don't retry IO for corrupted data scenario
There are many other clean up patches and minor bug fixes as usual.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmfhjuYACgkQQBSofoJI
UNJF+hAAip0Sf6bXwt43KR3lmbXg/YBTtPDACOs375Pn8pfRVsAPIAf5e1AnlBET
rA0KpqUrEZHXenQFF9n4LIoHYqfuUw3EMempuvp4qgQcx15Kaajw+EGWVLYstNy2
9dELc9DA55f/i1uvHezGfQFy6hMfXasf+tkaYk0z0ZYDStYboLgkVAY+F869q1Dl
D16Y7Lna12h4eCxDdssIPDsLjFH/2LDn7SmhXsnpZxwK0Zx8JSo83WxXHnzXHxz4
vUkukInKgqcBDvf6ufW/YF3/tqSs20XEXNK3cI1vyHx1dwij6Us+G0n0WJHL30QE
zsliecR0X28JP1rJ3ldCjr+4Kxm6/u/Uwpinm2Fm1jL67UY4TZcaARBE8k20I6ND
j/L1+sXrIdZ1aILM9/bwCjgXiVFdZbvlfGfpTj1duAkQgRd+/s9cjPlo5j5HTad5
XJmzJz6YOaUNarhP/E31Z9SV9M2kEcmoDOTxKBg6ZcMWXZ27B6Z0ag92prd0GWk6
rWDuVj0eP/LDY1QedbHRPbg1D84jVgcnldfPaf9ptln993skJGdgS0dkegTqMR0L
H8RgpWOzWZI53gnQdePdej8diHmD8uRTrLf/oABm//GzTHC5BdwVpArOl1DCuLFF
YkMtVEicgnWRmL3PgODAdTJXaDi3uGvT116i+lfMlUn9p+t93r0=
=E+TE
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, there are three major updates: (1) folio conversion,
(2) refactoring for mount API conversion, (3) some performance
improvement such as direct IO, checkpoint speed, and IO priority
hints.
For stability, there are patches which add more sanity checks and
fixes some major issues like i_size in atomic write operations and
write pointer recovery in zoned devices.
Enhancements:
- huge folio converion work by Matthew Wilcox
- clean up for mount API conversion by Eric Sandeen
- improve direct IO speed in the overwrite case
- add some sanity check on node consistency
- set highest IO priority for checkpoint thread
- keep POSIX_FADV_NOREUSE ranges and add sysfs entry to reclaim pages
- add ioctl to get IO priority hint
- add carve_out sysfs node for fsstat
Bug fixes:
- disable nat_bits during umount to avoid potential nat entry corruption
- fix missing i_size update on atomic writes
- fix missing discard for active segments
- fix running out of free segments
- fix out-of-bounds access in f2fs_truncate_inode_blocks()
- call f2fs_recover_quota_end() correctly
- fix potential deadloop in prepare_compress_overwrite()
- fix the missing write pointer correction for zoned device
- fix to avoid panic once fallocation fails for pinfile
- don't retry IO for corrupted data scenario
There are many other clean up patches and minor bug fixes as usual"
* tag 'f2fs-for-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (68 commits)
f2fs: fix missing discard for active segments
f2fs: optimize f2fs DIO overwrites
f2fs: fix to avoid atomicity corruption of atomic file
f2fs: pass sbi rather than sb to parse_options()
f2fs: pass sbi rather than sb to quota qf_name helpers
f2fs: defer readonly check vs norecovery
f2fs: Pass sbi rather than sb to f2fs_set_test_dummy_encryption
f2fs: make LAZYTIME a mount option flag
f2fs: make INLINECRYPT a mount option flag
f2fs: factor out an f2fs_default_check function
f2fs: consolidate unsupported option handling errors
f2fs: use f2fs_sb_has_device_alias during option parsing
f2fs: add carve_out sysfs node
f2fs: fix to avoid running out of free segments
f2fs: Remove f2fs_write_node_page()
f2fs: Remove f2fs_write_meta_page()
f2fs: Remove f2fs_write_data_page()
f2fs: Remove check for ->writepage
Revert "f2fs: rebuild nat_bits during umount"
f2fs: fix to avoid accessing uninitialized curseg
...
A fix for an issue where CONFIG_FS_ENCRYPTION could be enabled without
some of its dependencies, and a small documentation update.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZ+CBShQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKxj0AQC4OlE/HbIY2w3zO8Az9WEF9+4Dz9od
EpxTwO/fk9PC+gD/Z7r8J3xn+ykcy9QcZW+Qucd64k+rbmvqj36gXce6hAI=
=r+lX
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux
Pull fscrypt updates from Eric Biggers:
"A fix for an issue where CONFIG_FS_ENCRYPTION could be enabled without
some of its dependencies, and a small documentation update"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
fscrypt: mention init_on_free instead of page poisoning
fscrypt: drop obsolete recommendation to enable optimized ChaCha20
Revert "fscrypt: relax Kconfig dependencies for crypto API algorithms"
A fix for an issue where CONFIG_FS_VERITY could be enabled without some
of its dependencies, and a small documentation update.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZ+CBZBQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK5WSAP9sihsazS5s2FwkY2ukwxuPBm448O6R
w457RM/j2Dz6oQEAhQF4oKU6mWJk0dOFMoBJCEUUJinb1diYTE0BMtSbXAk=
=BwSf
-----END PGP SIGNATURE-----
Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux
Pull fsverity updates from Eric Biggers:
"A fix for an issue where CONFIG_FS_VERITY could be enabled without
some of its dependencies, and a small documentation update"
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
Revert "fsverity: relax build time dependency on CRYPTO_SHA256"
Documentation: add a usecase for FS_IOC_READ_VERITY_METADATA
- Significant changes throughout the tree to bring Python code up to
current standards and raise the minimum Python required to 3.9. Much of
this is preparatory to replacing the ancient Perl scripts/kernel-doc
horror with a slightly less horrifying Python implementation, expected
for 6.16.
- Update the minimum Sphinx required to 3.4.3, allowing us to remove a
bunch of older compatibility code.
- Rework and improve the generation of the ABI documentation.
(All of the above done by Mauro)
- Lots of translation updates. Alex Shi and Yanteng Si are taking on
responsibility for the Chinese translations going forward; that work will
still get to you via docs-next
- Try to standardize the format for indicating a developer's affiliation in
commit tags.
- Clarify the TAB's role in CoC enforcement actions.
- Try to spell out the rules for when a commit tag can name another
developer without their explicit permission.
Plus lots of other typo fixes and updates.
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmfccxwPHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5YoYcH/jL/nS8YAiJ3awF5PH5tR3m5ddt9l+fKXWJx
PB3KcHtDORbWltTA+Tvo2aP1jxGY9wqsIIvl+nvjJyUcfd72g4HNfTDUDXwP3OFU
wTkaEAQp3n/hqnLXtJ2AzV3Ir5cIfEL2d7F6QsN1Gnof8iu2OuMk5iMeb0iexUX6
FYjJq+jknh30VdAp2hxHy8q17R7h7PySh5OsjeAYJJroLv60n3DwQgnzHjXC/FT2
Qq1UuEzlSpRoso2o2NwVTND6OVW081umo6YrioqD7ZC2G2fhRgLFJJtJGXDNcyUl
gQv9xLSaTD97V4zaWPm28ObNBpY/GnAd4hMjB17wAH5xUfVS5Aw=
=Gvdp
-----END PGP SIGNATURE-----
Merge tag 'docs-6.15' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"It has been a reasonably busy cycle for docs...
- Significant changes throughout the tree to bring Python code up to
current standards and raise the minimum Python required to 3.9
Much of this is preparatory to replacing the ancient Perl
scripts/kernel-doc horror with a slightly less horrifying Python
implementation, expected for 6.16
- Update the minimum Sphinx required to 3.4.3, allowing us to remove
a bunch of older compatibility code
- Rework and improve the generation of the ABI documentation
(All of the above done by Mauro)
- Lots of translation updates. Alex Shi and Yanteng Si are taking on
responsibility for the Chinese translations going forward; that
work will still get to you via docs-next
- Try to standardize the format for indicating a developer's
affiliation in commit tags
- Clarify the TAB's role in CoC enforcement actions
- Try to spell out the rules for when a commit tag can name another
developer without their explicit permission
Plus lots of other typo fixes and updates"
* tag 'docs-6.15' of git://git.lwn.net/linux: (98 commits)
docs/zh_CN: fix spelling mistake
docs/Chinese: change the disclaimer words
docs/zh_CN: Add snp-tdx-threat-model index Chinese translation
docs: driver-api: firmware: clarify userspace requirements
docs: clarify rules wrt tagging other people
docs: Remove outdated highuid.rst documentation
Documentation: dma-buf: heaps: Add heap name definitions
docs/.../submit-checklist: Use Documentation/admin-guide/abi.rst for cross-ref of README
docs: Correct installation instruction
Documentation: kcsan: fix "Plain Accesses and Data Races" URL in kcsan.rst
Documentation/CoC: Spell out the TAB role in enforcement decisions
Documentation: ocxl.rst: Update consortium site
scripts: get_feat.pl: substitute s390x with s390
scripts/kernel-doc: drop dead code for Wcontents_before_sections
scripts/kernel-doc: don't add not needed new lines
docs: driver-api/infiniband.rst: fix Kerneldoc markup
drivers: firewire: firewire-cdev.h: fix identation on a kernel-doc markup
drivers: media: intel-ipu3.h: fix identation on a kernel-doc markup
include/asm-generic/io.h: fix kerneldoc markup
Docs/arch/arm64: Fix spelling in amu.rst
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rswAKCRCRxhvAZXjc
ogkpAQD7/kR7QiOQTdDztDtAavZELOE3p4pUZfAcg8XFfH8QKQD/cgDpxgBbyvSQ
VBInBrv1vPeSvlPWvFkyqT/n8eiR9AQ=
=KQVR
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.15-rc1.sysv' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs sysv removal from Christian Brauner:
"This removes the sysv filesystem. We've discussed this various times.
It's time to try"
* tag 'vfs-6.15-rc1.sysv' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
sysv: Remove the filesystem
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rNwAKCRCRxhvAZXjc
onBJAP9Z8Ywmlb5KQ1E3HvDmkwyY6yOSyZ9/CmbzrkCJ8ywYkQD/d9/xt0EP/O/q
N8YtzXArHWt7u0YbcVpy9WK3F72BdwU=
=VJgY
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs async dir updates from Christian Brauner:
"This contains cleanups that fell out of the work from async directory
handling:
- Change kern_path_locked() and user_path_locked_at() to never return
a negative dentry. This simplifies the usability of these helpers
in various places
- Drop d_exact_alias() from the remaining place in NFS where it is
still used. This also allows us to drop the d_exact_alias() helper
completely
- Drop an unnecessary call to fh_update() from nfsd_create_locked()
- Change i_op->mkdir() to return a struct dentry
Change vfs_mkdir() to return a dentry provided by the filesystems
which is hashed and positive. This allows us to reduce the number
of cases where the resulting dentry is not positive to very few
cases. The code in these places becomes simpler and easier to
understand.
- Repack DENTRY_* and LOOKUP_* flags"
* tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
doc: fix inline emphasis warning
VFS: Change vfs_mkdir() to return the dentry.
nfs: change mkdir inode_operation to return alternate dentry if needed.
fuse: return correct dentry for ->mkdir
ceph: return the correct dentry on mkdir
hostfs: store inode in dentry after mkdir if possible.
Change inode_operations.mkdir to return struct dentry *
nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked()
nfs/vfs: discard d_exact_alias()
VFS: add common error checks to lookup_one_qstr_excl()
VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry
VFS: repack LOOKUP_ bit flags.
VFS: repack DENTRY_ flags.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rUAAKCRCRxhvAZXjc
opI3AP9ws4S/JXOjxNKoTYmNM2nZ8+r1v8tUxbLIiqdvzx9PygD/V1ZjXtn6lwZr
OK8d5Y8UnlPZTlBF8D61op3AjnXYzws=
=KV4p
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.15-rc1.overlayfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs overlayfs updates from Christian Brauner:
"Currently overlayfs uses the mounter's credentials for its
override_creds() calls. That provides a consistent permission model.
This patches allows a caller to instruct overlayfs to use its
credentials instead. The caller must be located in the same user
namespace hierarchy as the user namespace the overlayfs instance will
be mounted in. This provides a consistent and simple security model.
With this it is possible to e.g., mount an overlayfs instance where
the mounter must have CAP_SYS_ADMIN but the credentials used for
override_creds() have dropped CAP_SYS_ADMIN. It also allows the usage
of custom fs{g,u}id different from the callers and other tweaks"
* tag 'vfs-6.15-rc1.overlayfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
selftests/ovl: add third selftest for "override_creds"
selftests/ovl: add second selftest for "override_creds"
selftests/filesystems: add utils.{c,h}
selftests/ovl: add first selftest for "override_creds"
ovl: allow to specify override credentials
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rGwAKCRCRxhvAZXjc
okVnAP9VgaYjWGzaeep/dLzWtu7C/Cg5Swl1P84Vj+SJ+hFPEAD/auzWTV0D0Ko5
5GLyUsLZehfeVDOSRqmiyt1po8iVsQo=
=ANks
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.15-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs iomap updates from Christian Brauner:
- Allow the filesystem to submit the writeback bios.
- Allow the filsystem to track completions on a per-bio bases
instead of the entire I/O.
- Change writeback_ops so that ->submit_bio can be done by the
filesystem.
- A new ANON_WRITE flag for writes that don't have a block number
assigned to them at the iomap level leaving the filesystem to do
that work in the submission handler.
- Incremental iterator advance
The folio_batch support for zero range where the filesystem provides
a batch of folios to process that might not be logically continguous
requires more flexibility than the current offset based iteration
currently offers.
Update all iomap operations to advance the iterator within the
operation and thus remove the need to advance from the core iomap
iterator.
- Make buffered writes work with RWF_DONTCACHE
If RWF_DONTCACHE is set for a write, mark the folios being written as
uncached. On writeback completion the pages will be dropped.
- Introduce infrastructure for large atomic writes
This will eventually be used by xfs and ext4.
* tag 'vfs-6.15-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits)
iomap: rework IOMAP atomic flags
iomap: comment on atomic write checks in iomap_dio_bio_iter()
iomap: inline iomap_dio_bio_opflags()
iomap: fix inline data on buffered read
iomap: Lift blocksize restriction on atomic writes
iomap: Support SW-based atomic writes
iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW
xfs: flag as supporting FOP_DONTCACHE
iomap: make buffered writes work with RWF_DONTCACHE
iomap: introduce a full map advance helper
iomap: rename iomap_iter processed field to status
iomap: remove unnecessary advance from iomap_iter()
dax: advance the iomap_iter on pte and pmd faults
dax: advance the iomap_iter on dedupe range
dax: advance the iomap_iter on unshare range
dax: advance the iomap_iter on zero range
dax: push advance down into dax_iomap_iter() for read and write
dax: advance the iomap_iter in the read/write path
iomap: convert misc simple ops to incremental advance
iomap: advance the iter on direct I/O
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90p4AAKCRCRxhvAZXjc
ojMIAP9atkG3u7+490+NGWLdulQlaHnD51Owa9MiW87UfKpsTQEArwi/NrJqXJNT
PFQ2xIa5TxG+9haChR89w3kjZ6b/hgs=
=iDkx
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.15-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Add CONFIG_DEBUG_VFS infrastucture:
- Catch invalid modes in open
- Use the new debug macros in inode_set_cached_link()
- Use debug-only asserts around fd allocation and install
- Place f_ref to 3rd cache line in struct file to resolve false
sharing
Cleanups:
- Start using anon_inode_getfile_fmode() helper in various places
- Don't take f_lock during SEEK_CUR if exclusion is guaranteed by
f_pos_lock
- Add unlikely() to kcmp()
- Remove legacy ->remount_fs method from ecryptfs after port to the
new mount api
- Remove invalidate_inodes() in favour of evict_inodes()
- Simplify ep_busy_loopER by removing unused argument
- Avoid mmap sem relocks when coredumping with many missing pages
- Inline getname()
- Inline new_inode_pseudo() and de-staticize alloc_inode()
- Dodge an atomic in putname if ref == 1
- Consistently deref the files table with rcu_dereference_raw()
- Dedup handling of struct filename init and refcounts bumps
- Use wq_has_sleeper() in end_dir_add()
- Drop the lock trip around I_NEW wake up in evict()
- Load the ->i_sb pointer once in inode_sb_list_{add,del}
- Predict not reaching the limit in alloc_empty_file()
- Tidy up do_sys_openat2() with likely/unlikely
- Call inode_sb_list_add() outside of inode hash lock
- Sort out fd allocation vs dup2 race commentary
- Turn page_offset() into a wrapper around folio_pos()
- Remove locking in exportfs around ->get_parent() call
- try_lookup_one_len() does not need any locks in autofs
- Fix return type of several functions from long to int in open
- Fix return type of several functions from long to int in ioctls
Fixes:
- Fix watch queue accounting mismatch"
* tag 'vfs-6.15-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
fs: sort out fd allocation vs dup2 race commentary, take 2
fs: call inode_sb_list_add() outside of inode hash lock
fs: tidy up do_sys_openat2() with likely/unlikely
fs: predict not reaching the limit in alloc_empty_file()
fs: load the ->i_sb pointer once in inode_sb_list_{add,del}
fs: drop the lock trip around I_NEW wake up in evict()
fs: use wq_has_sleeper() in end_dir_add()
VFS/autofs: try_lookup_one_len() does not need any locks
fs: dedup handling of struct filename init and refcounts bumps
fs: consistently deref the files table with rcu_dereference_raw()
exportfs: remove locking around ->get_parent() call.
fs: use debug-only asserts around fd allocation and install
fs: dodge an atomic in putname if ref == 1
vfs: Remove invalidate_inodes()
ecryptfs: remove NULL remount_fs from super_operations
watch_queue: fix pipe accounting mismatch
fs: place f_ref to 3rd cache line in struct file to resolve false sharing
epoll: simplify ep_busy_loop by removing always 0 argument
fs: Turn page_offset() into a wrapper around folio_pos()
kcmp: improve performance adding an unlikely hint to task comparisons
...
In b529c06f9d (Update the documentation referencing Plan 9 from User
Space., 2020-04-26), another instance of the link was left unfixed.
Fix that as well.
Signed-off-by: Tuomas Ahola <taahol@utu.fi>
Message-ID: <20250322153639.4917-1-taahol@utu.fi>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Commit dcdfdd40fa ("mm: Add support for unaccepted memory") added a
entry to meminfo but did not document it in the proc.rst file.
This counter tracks the amount of "Unaccepted" guest memory for some
Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP.
Add the missing entry in the documentation.
Link: https://lkml.kernel.org/r/20250317230403.79632-1-npache@redhat.com
Signed-off-by: Nico Pache <npache@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "track memory used by balloon drivers", v2.
This series introduces a way to track memory used by balloon drivers.
Add a NR_BALLOON_PAGES counter to track how many pages are reclaimed by
the balloon drivers. First add the accounting, then updates the balloon
drivers (virtio, Hyper-V, VMware, Pseries-cmm, and Xen) to maintain this
counter. The virtio, Vmware, and pseries-cmm balloon drivers utilize the
balloon_compaction interface to allocate and free balloon pages. Other
balloon drivers will have to maintain this counter manually.
This makes the information visible in memory reporting interfaces like
/proc/meminfo, show_mem, and OOM reporting.
This provides admins visibility into their VM balloon sizes without
requiring different virtualization tooling. Furthermore, this information
is helpful when debugging an OOM inside a VM.
This patch (of 4):
Add NR_BALLOON_PAGES counter to track memory used by balloon drivers and
expose it through /proc/meminfo and other memory reporting interfaces.
[npache@redhat.com: document Balloon Meminfo entry]
Link: https://lkml.kernel.org/r/a0315ccf-f244-460e-8643-fd7388724fe5@redhat.com
Link: https://lkml.kernel.org/r/20250314213757.244258-1-npache@redhat.com
Link: https://lkml.kernel.org/r/20250314213757.244258-2-npache@redhat.com
Signed-off-by: Nico Pache <npache@redhat.com>
Cc: Alexander Atanasov <alexander.atanasov@virtuozzo.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Juegren Gross <jgross@suse.com>
Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Flag IOMAP_ATOMIC_SW is not really required. The idea of having this flag
is that the FS ->iomap_begin callback could check if this flag is set to
decide whether to do a SW (FS-based) atomic write. But the FS can set
which ->iomap_begin callback it wants when deciding to do a FS-based
atomic write.
Furthermore, it was thought that IOMAP_ATOMIC_HW is not a proper name, as
the block driver can use SW-methods to emulate an atomic write. So change
back to IOMAP_ATOMIC.
The ->iomap_begin callback needs though to indicate to iomap core that
REQ_ATOMIC needs to be set, so add IOMAP_F_ATOMIC_BIO for that.
These changes were suggested by Christoph Hellwig and Dave Chinner.
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20250320120250.4087011-4-john.g.garry@oracle.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Everything is in place to stop using the per-page mapcounts in large
folios: the mapcount of tail pages will always be logically 0 (-1 value),
just like it currently is for hugetlb folios already, and the page
mapcount of the head page is either 0 (-1 value) or contains a page type
(e.g., hugetlb).
Maintaining _nr_pages_mapped without per-page mapcounts is impossible, so
that one also has to go with CONFIG_NO_PAGE_MAPCOUNT.
There are two remaining implications:
(1) Per-node, per-cgroup and per-lruvec stats of "NR_ANON_MAPPED"
("mapped anonymous memory") and "NR_FILE_MAPPED"
("mapped file memory"):
As soon as any page of the folio is mapped -- folio_mapped() -- we
now account the complete folio as mapped. Once the last page is
unmapped -- !folio_mapped() -- we account the complete folio as
unmapped.
This implies that ...
* "AnonPages" and "Mapped" in /proc/meminfo and
/sys/devices/system/node/*/meminfo
* cgroup v2: "anon" and "file_mapped" in "memory.stat" and
"memory.numa_stat"
* cgroup v1: "rss" and "mapped_file" in "memory.stat" and
"memory.numa_stat
... can now appear higher than before. But note that these folios do
consume that memory, simply not all pages are actually currently
mapped.
It's worth nothing that other accounting in the kernel (esp. cgroup
charging on allocation) is not affected by this change.
[why oh why is "anon" called "rss" in cgroup v1]
(2) Detecting partial mappings
Detecting whether anon THPs are partially mapped gets a bit more
unreliable. As long as a single MM maps such a large folio
("exclusively mapped"), we can reliably detect it. Especially before
fork() / after a short-lived child process quit, we will detect
partial mappings reliably, which is the common case.
In essence, if the average per-page mapcount in an anon THP is < 1,
we know for sure that we have a partial mapping.
However, as soon as multiple MMs are involved, we might miss detecting
partial mappings: this might be relevant with long-lived child
processes. If we have a fully-mapped anon folio before fork(), once
our child processes and our parent all unmap (zap/COW) the same pages
(but not the complete folio), we might not detect the partial mapping.
However, once the child processes quit we would detect the partial
mapping.
How relevant this case is in practice remains to be seen.
Swapout/migration will likely mitigate this.
In the future, RMAP walkers could check for that for that case
(e.g., when collecting access bits during reclaim) and simply flag
them for deferred-splitting.
Link: https://lkml.kernel.org/r/20250303163014.1128035-21-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Andy Lutomirks^H^Hski <luto@kernel.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Michal Koutn <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: tejun heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Let's implement an alternative when per-page mapcounts in large folios are
no longer maintained -- soon with CONFIG_NO_PAGE_MAPCOUNT.
When computing the output for smaps / smaps_rollups, in particular when
calculating the USS (Unique Set Size) and the PSS (Proportional Set Size),
we still rely on per-page mapcounts.
To determine private vs. shared, we'll use folio_likely_mapped_shared(),
similar to how we handle PM_MMAP_EXCLUSIVE. Similarly, we might now
under-estimate the USS and count pages towards "shared" that are actually
"private" ("exclusively mapped").
When calculating the PSS, we'll now also use the average per-page mapcount
for large folios: this can result in both, an over-estimation and an
under-estimation of the PSS. The difference is not expected to matter
much in practice, but we'll have to learn as we go.
We can now provide folio_precise_page_mapcount() only with
CONFIG_PAGE_MAPCOUNT, and remove one of the last users of per-page
mapcounts when CONFIG_NO_PAGE_MAPCOUNT is enabled.
Document the new behavior.
Link: https://lkml.kernel.org/r/20250303163014.1128035-20-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Andy Lutomirks^H^Hski <luto@kernel.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Michal Koutn <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: tejun heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Let's implement an alternative when per-page mapcounts in large folios are
no longer maintained -- soon with CONFIG_NO_PAGE_MAPCOUNT.
For calculating "mapmax", we now use the average per-page mapcount in a
large folio instead of the per-page mapcount.
For hugetlb folios and folios that are not partially mapped into MMs,
there is no change.
Likely, this change will not matter much in practice, and an alternative
might be to simple remove this stat with CONFIG_NO_PAGE_MAPCOUNT.
However, there might be value to it, so let's keep it like that and
document the behavior.
Link: https://lkml.kernel.org/r/20250303163014.1128035-19-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Andy Lutomirks^H^Hski <luto@kernel.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Michal Koutn <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: tejun heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The dcssblk driver has long needed special case supoprt to enable limited
dax operation, so called CONFIG_FS_DAX_LIMITED. This mode works around
the incomplete support for ZONE_DEVICE on s390 by forgoing the ability of
dax-mapped pages to support GUP.
Now, pending cleanups to fsdax that fix its reference counting [1] depend
on the ability of all dax drivers to supply ZONE_DEVICE pages.
To allow that work to move forward, dax support needs to be paused for
dcssblk until ZONE_DEVICE support arrives. That work has been known for a
few years [2], and the removal of "pte_devmap" requirements [3] makes the
conversion easier.
For now, place the support behind CONFIG_BROKEN, and remove PFN_SPECIAL
(dcssblk was the only user).
Link: http://lore.kernel.org/cover.9f0e45d52f5cff58807831b6b867084d0b14b61c.1725941415.git-series.apopple@nvidia.com [1]
Link: http://lore.kernel.org/20210820210318.187742e8@thinkpad/ [2]
Link: http://lore.kernel.org/4511465a4f8429f45e2ac70d2e65dc5e1df1eb47.1725941415.git-series.apopple@nvidia.com [3]
Link: https://lkml.kernel.org/r/33eef2379c0d240f40cc15453fad2df1a4ae34c8.1740713401.git-series.apopple@nvidia.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Tested-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Asahi Lina <lina@asahilina.net>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: linmiaohe <linmiaohe@huawei.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a paragraph explaining what sort of capabilities a process would need
to read procfs data for some other process. Also mention that reading
data for its own process doesn't require any extra permissions.
Link: https://lkml.kernel.org/r/20250129001747.759990-1-andrii@kernel.org
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Footnotes list are outputted in htmldocs simply as long-running
paragraph instead. Use reST numbered footnotes syntax for the job.
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
SubmttingPatches.rst has 4 section headings, all under the same heading
levels. In absence of title headings, these section headings are all
ended up as title headings in the docs output, which also affect
the index toctree (increasing titles to 6 from the original 2)
due to :numbered: option.
Demote second-to-last section headings, making "Submitting patches
to bcachefs" as title heading.
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bcachefs subsystem currently has 4 docs: two are development notes and
the rest are actual filesystem docs. These two groups are clearly
distinct and can be organized.
Split the toctree into two, one for each docs group. While at it, also
reduce :maxdepth: so that only title headings are listed in the
toctrees.
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The doc lists dirent structure for both regular and casefolded names,
yet it is written (and rendered) as long paragraph instead.
Write the structure list as bullet list.
Fixes: bc5cc09246c5 ("bcachefs: bcachefs_metadata_version_casefolding")
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Sphinx reports htmldocs warnings on dentry/dcache section:
Documentation/filesystems/bcachefs/casefolding.rst:75: WARNING: Title underline too short.
dentry/dcache considerations
--------- [docutils]
Documentation/filesystems/bcachefs/casefolding.rst:84: WARNING: Definition list ends without a blank line; unexpected unindent. [docutils]
Fix the section by:
* Extending the section underline to match the section title length;
* Separating problem list from surrounding paragraphs.
Fixes: bc5cc09246c5 ("bcachefs: bcachefs_metadata_version_casefolding")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/linux-next/20250221161911.2d16138b@canb.auug.org.au/
Closes: https://lore.kernel.org/linux-next/20250221162135.79be0147@canb.auug.org.au/
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Sphinx reports htmldocs warning:
Documentation/filesystems/bcachefs/casefolding.rst:36: WARNING: Inline interpreted text or phrase reference start-string without end-string. [docutils]
That's because NUL word is italicized but it is written in plural form
instead (`NUL`s). Sphinx, however, doesn't tip over when the italicized
word in this fashion is followed by punctuation instead.
Do not italicize the word to keep Sphinx happy.
Fixes: bc5cc09246c5 ("bcachefs: bcachefs_metadata_version_casefolding")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/linux-next/20250221162135.79be0147@canb.auug.org.au/
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch implements support for case-insensitive file name lookups
in bcachefs.
The implementation uses the same UTF-8 lowering and normalization that
ext4 and f2fs is using.
More information is provided in Documentation/bcachefs/casefolding.rst
Compatibility notes:
This uses the new versioning scheme for incompatible features where an
incompatible feature is tied to a version number: the superblock says
"we may use incompat features up to x" and "incompat features up to x
are in use", disallowing mounting by previous versions.
Additionally, and old style incompat feature bit is used, so that
kernels without utf8 casefolding support know if casefolding
specifically is in use and they're allowed to mount.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Cc: André Almeida <andrealmeid@igalia.com>
Cc: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We do not and cannot support file locking with NFS reexport over
NFSv4.x for the same reason we don't do it for NFSv3: NFS reexport
server reboot cannot allow clients to recover locks because the source
NFS server has not rebooted, and so it is not in grace. Since the
source NFS server is not in grace, it cannot offer any guarantees that
the file won't have been changed between the locks getting lost and
any attempt to recover/reclaim them. The same applies to delegations
and any associated locks, so disallow them too.
Clients are no longer allowed to get file locks or delegations from a
reexport server, any attempts will fail with operation not supported.
Update the "Reboot recovery" section accordingly in
Documentation/filesystems/nfs/reexport.rst
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Introduce a new mount option "nat_bits" to control nat_bits feature,
by default nat_bits feature is disabled.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The function can be replaced by evict_inodes. The only difference is
that evict_inodes() skips the inodes with positive refcount without
touching ->i_lock, but they are equivalent as evict_inodes() repeats the
refcount check after having grabbed ->i_lock.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20250307144318.28120-2-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
Currently atomic write support requires dedicated HW support. This imposes
a restriction on the filesystem that disk blocks need to be aligned and
contiguously mapped to FS blocks to issue atomic writes.
XFS has no method to guarantee FS block alignment for regular,
non-RT files. As such, atomic writes are currently limited to 1x FS block
there.
To deal with the scenario that we are issuing an atomic write over
misaligned or discontiguous data blocks - and raise the atomic write size
limit - support a SW-based software emulated atomic write mode. For XFS,
this SW-based atomic writes would use CoW support to issue emulated untorn
writes.
It is the responsibility of the FS to detect discontiguous atomic writes
and switch to IOMAP_DIO_ATOMIC_SW mode and retry the write. Indeed,
SW-based atomic writes could be used always when the mounted bdev does
not support HW offload, but this strategy is not initially expected to be
used.
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20250303171120.2837067-6-john.g.garry@oracle.com
Signed-off-by: Christian Brauner <brauner@kernel.org>