mirror-linux

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	18b19abc37	namespace-6.18-rc1 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQgQAKCRCRxhvAZXjc oiFXAQCpbLvkWbld9wLgxUBhq+q+kw5NvGxzpvqIhXwJB9F9YAEA44/Wevln4xGx +kRUbP+xlRQqenIYs2dLzVHzAwAdfQ4= =EO4Y -----END PGP SIGNATURE----- Merge tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull namespace updates from Christian Brauner: "This contains a larger set of changes around the generic namespace infrastructure of the kernel. Each specific namespace type (net, cgroup, mnt, ...) embedds a struct ns_common which carries the reference count of the namespace and so on. We open-coded and cargo-culted so many quirks for each namespace type that it just wasn't scalable anymore. So given there's a bunch of new changes coming in that area I've started cleaning all of this up. The core change is to make it possible to correctly initialize every namespace uniformly and derive the correct initialization settings from the type of the namespace such as namespace operations, namespace type and so on. This leaves the new ns_common_init() function with a single parameter which is the specific namespace type which derives the correct parameters statically. This also means the compiler will yell as soon as someone does something remotely fishy. The ns_common_init() addition also allows us to remove ns_alloc_inum() and drops any special-casing of the initial network namespace in the network namespace initialization code that Linus complained about. Another part is reworking the reference counting. The reference counting was open-coded and copy-pasted for each namespace type even though they all followed the same rules. This also removes all open accesses to the reference count and makes it private and only uses a very small set of dedicated helpers to manipulate them just like we do for e.g., files. In addition this generalizes the mount namespace iteration infrastructure introduced a few cycles ago. As reminder, the vfs makes it possible to iterate sequentially and bidirectionally through all mount namespaces on the system or all mount namespaces that the caller holds privilege over. This allow userspace to iterate over all mounts in all mount namespaces using the listmount() and statmount() system call. Each mount namespace has a unique identifier for the lifetime of the systems that is exposed to userspace. The network namespace also has a unique identifier working exactly the same way. This extends the concept to all other namespace types. The new nstree type makes it possible to lookup namespaces purely by their identifier and to walk the namespace list sequentially and bidirectionally for all namespace types, allowing userspace to iterate through all namespaces. Looking up namespaces in the namespace tree works completely locklessly. This also means we can move the mount namespace onto the generic infrastructure and remove a bunch of code and members from struct mnt_namespace itself. There's a bunch of stuff coming on top of this in the future but for now this uses the generic namespace tree to extend a concept introduced first for pidfs a few cycles ago. For a while now we have supported pidfs file handles for pidfds. This has proven to be very useful. This extends the concept to cover namespaces as well. It is possible to encode and decode namespace file handles using the common name_to_handle_at() and open_by_handle_at() apis. As with pidfs file handles, namespace file handles are exhaustive, meaning it is not required to actually hold a reference to nsfs in able to decode aka open_by_handle_at() a namespace file handle. Instead the FD_NSFS_ROOT constant can be passed which will let the kernel grab a reference to the root of nsfs internally and thus decode the file handle. Namespaces file descriptors can already be derived from pidfds which means they aren't subject to overmount protection bugs. IOW, it's irrelevant if the caller would not have access to an appropriate /proc/<pid>/ns/ directory as they could always just derive the namespace based on a pidfd already. It has the same advantage as pidfds. It's possible to reliably and for the lifetime of the system refer to a namespace without pinning any resources and to compare them trivially. Permission checking is kept simple. If the caller is located in the namespace the file handle refers to they are able to open it otherwise they must hold privilege over the owning namespace of the relevant namespace. The namespace file handle layout is exposed as uapi and has a stable and extensible format. For now it simply contains the namespace identifier, the namespace type, and the inode number. The stable format means that userspace may construct its own namespace file handles without going through name_to_handle_at() as they are already allowed for pidfs and cgroup file handles" * tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (65 commits) ns: drop assert ns: move ns type into struct ns_common nstree: make struct ns_tree private ns: add ns_debug() ns: simplify ns_common_init() further cgroup: add missing ns_common include ns: use inode initializer for initial namespaces selftests/namespaces: verify initial namespace inode numbers ns: rename to __ns_ref nsfs: port to ns_ref_() helpers net: port to ns_ref_() helpers uts: port to ns_ref_() helpers ipv4: use check_net() net: use check_net() net-sysfs: use check_net() user: port to ns_ref_() helpers time: port to ns_ref_() helpers pid: port to ns_ref_() helpers ipc: port to ns_ref_() helpers cgroup: port to ns_ref_() helpers ...	2025-09-29 11:20:29 -07:00
Linus Torvalds	722df25ddf	kernel-6.18-rc1.clone3 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZgMQAKCRCRxhvAZXjc ornXAP954dZjz+OJw6lJLCf0j9TXJOczGHvK3oW5ZD9KnqtTdwEA7p1A6WMOKJyl 8VtTgCS0yNt8QlznUnsSDfVm0jXVGAY= =tUXG -----END PGP SIGNATURE----- Merge tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull copy_process updates from Christian Brauner: "This contains the changes to enable support for clone3() on nios2 which apparently is still a thing. The more exciting part of this is that it cleans up the inconsistency in how the 64-bit flag argument is passed from copy_process() into the various other copy_() helpers" [ Fixed up rv ltl_monitor 32-bit support as per Sasha Levin in the merge ] tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nios2: implement architecture-specific portion of sys_clone3 arch: copy_thread: pass clone_flags as u64 copy_process: pass clone_flags as u64 across calltree copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)	2025-09-29 10:36:50 -07:00
Linus Torvalds	3a2a5b278f	vfs-6.18-rc1.mount -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQOwAKCRCRxhvAZXjc oth/AQDvlOo+23/f2djgDGS8akjkBYVLW14OYzC0q5cbEnnGgAEAycHL50pp3n1o 3jMYlCByuv507vpCsDupo7QcJapmQAk= =/NwE -----END PGP SIGNATURE----- Merge tag 'vfs-6.18-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: "This contains some work around mount api handling: - Output the warning message for mnt_too_revealing() triggered during fsmount() to the fscontext log. This makes it possible for the mount tool to output appropriate warnings on the command line. For example, with the newest fsopen()-based mount(8) from util-linux, the error messages now look like: # mount -t proc proc /tmp mount: /tmp: fsmount() failed: VFS: Mount too revealing. dmesg(1) may have more information after failed mount system call. - Do not consume fscontext log entries when returning -EMSGSIZE Userspace generally expects APIs that return -EMSGSIZE to allow for them to adjust their buffer size and retry the operation. However, the fscontext log would previously clear the message even in the -EMSGSIZE case. Given that it is very cheap for us to check whether the buffer is too small before we remove the message from the ring buffer, let's just do that instead. - Drop an unused argument from do_remount()" * tag 'vfs-6.18-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: vfs: fs/namespace.c: remove ms_flags argument from do_remount selftests/filesystems: add basic fscontext log tests fscontext: do not consume log entries when returning -EMSGSIZE vfs: output mount_too_revealing() errors to fscontext docs/vfs: Remove mentions to the old mount API helpers fscontext: add custom-prefix log helpers fs: Remove mount_bdev fs: Remove mount_nodev	2025-09-29 09:32:34 -07:00
Christian Brauner	6c7ca6a02f	mount: handle NULL values in mnt_ns_release() When calling in listmount() mnt_ns_release() may be passed a NULL pointer. Handle that case gracefully. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2025-09-29 09:08:18 -07:00
Linus Torvalds	b7ce6fa90f	vfs-6.18-rc1.misc -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQMQAKCRCRxhvAZXjc omNLAQCgrwzd9sa1JTlixweu3OAxQlSEbLuMpEv7Ztm+B7Wz0AD9HtwPC44Kev03 GbMcB2DCFLC4evqYECj6IG7NBmoKsAs= =1ICf -----END PGP SIGNATURE----- Merge tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Add "initramfs_options" parameter to set initramfs mount options. This allows to add specific mount options to the rootfs to e.g., limit the memory size - Add RWF_NOSIGNAL flag for pwritev2() Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE signal from being raised when writing on disconnected pipes or sockets. The flag is handled directly by the pipe filesystem and converted to the existing MSG_NOSIGNAL flag for sockets - Allow to pass pid namespace as procfs mount option Ever since the introduction of pid namespaces, procfs has had very implicit behaviour surrounding them (the pidns used by a procfs mount is auto-selected based on the mounting process's active pidns, and the pidns itself is basically hidden once the mount has been constructed) This implicit behaviour has historically meant that userspace was required to do some special dances in order to configure the pidns of a procfs mount as desired. Examples include: * In order to bypass the mnt_too_revealing() check, Kubernetes creates a procfs mount from an empty pidns so that user namespaced containers can be nested (without this, the nested containers would fail to mount procfs) But this requires forking off a helper process because you cannot just one-shot this using mount(2) * Container runtimes in general need to fork into a container before configuring its mounts, which can lead to security issues in the case of shared-pidns containers (a privileged process in the pidns can interact with your container runtime process) While SUID_DUMP_DISABLE and user namespaces make this less of an issue, the strict need for this due to a minor uAPI wart is kind of unfortunate Things would be much easier if there was a way for userspace to just specify the pidns they want. So this pull request contains changes to implement a new "pidns" argument which can be set using fsconfig(2): fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0); or classic mount(2) / mount(8): // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid"); Cleanups: - Remove the last references to EXPORT_OP_ASYNC_LOCK - Make file_remove_privs_flags() static - Remove redundant __GFP_NOWARN when GFP_NOWAIT is used - Use try_cmpxchg() in start_dir_add() - Use try_cmpxchg() in sb_init_done_wq() - Replace offsetof() with struct_size() in ioctl_file_dedupe_range() - Remove vfs_ioctl() export - Replace rwlock() with spinlock in epoll code as rwlock causes priority inversion on preempt rt kernels - Make ns_entries in fs/proc/namespaces const - Use a switch() statement() in init_special_inode() just like we do in may_open() - Use struct_size() in dir_add() in the initramfs code - Use str_plural() in rd_load_image() - Replace strcpy() with strscpy() in find_link() - Rename generic_delete_inode() to inode_just_drop() and generic_drop_inode() to inode_generic_drop() - Remove unused arguments from fcntl_{g,s}et_rw_hint() Fixes: - Document @name parameter for name_contains_dotdot() helper - Fix spelling mistake - Always return zero from replace_fd() instead of the file descriptor number - Limit the size for copy_file_range() in compat mode to prevent a signed overflow - Fix debugfs mount options not being applied - Verify the inode mode when loading it from disk in minixfs - Verify the inode mode when loading it from disk in cramfs - Don't trigger automounts with RESOLVE_NO_XDEV If openat2() was called with RESOLVE_NO_XDEV it didn't traverse through automounts, but could still trigger them - Add FL_RECLAIM flag to show_fl_flags() macro so it appears in tracepoints - Fix unused variable warning in rd_load_image() on s390 - Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD - Use ns_capable_noaudit() when determining net sysctl permissions - Don't call path_put() under namespace semaphore in listmount() and statmount()" * tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits) fcntl: trim arguments listmount: don't call path_put() under namespace semaphore statmount: don't call path_put() under namespace semaphore pid: use ns_capable_noaudit() when determining net sysctl permissions fs: rename generic_delete_inode() and generic_drop_inode() init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD initramfs: Replace strcpy() with strscpy() in find_link() initrd: Use str_plural() in rd_load_image() initramfs: Use struct_size() helper to improve dir_add() initrd: Fix unused variable warning in rd_load_image() on s390 fs: use the switch statement in init_special_inode() fs/proc/namespaces: make ns_entries const filelock: add FL_RECLAIM to show_fl_flags() macro eventpoll: Replace rwlock with spinlock selftests/proc: add tests for new pidns APIs procfs: add "pidns" mount option pidns: move is-ancestor logic to helper openat2: don't trigger automounts with RESOLVE_NO_XDEV namei: move cross-device check to __traverse_mounts namei: remove LOOKUP_NO_XDEV check from handle_mounts ...	2025-09-29 09:03:07 -07:00
Christian Brauner	c1f86d0ac3	listmount: don't call path_put() under namespace semaphore Massage listmount() and make sure we don't call path_put() under the namespace semaphore. If we put the last reference we're fscked. Fixes: `b4c2bea8ce` ("add listmount(2) syscall") Cc: stable@vger.kernel.org # v6.8+ Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-26 10:20:29 +02:00
Christian Brauner	e8c84e2082	statmount: don't call path_put() under namespace semaphore Massage statmount() and make sure we don't call path_put() under the namespace semaphore. If we put the last reference we're fscked. Fixes: `46eae99ef7` ("add statmount(2) syscall") Cc: stable@vger.kernel.org # v6.8+ Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-26 10:16:06 +02:00
Christian Brauner	4055526d35	ns: move ns type into struct ns_common It's misplaced in struct proc_ns_operations and ns->ops might be NULL if the namespace is compiled out but we still want to know the type of the namespace for the initial namespace struct. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-25 09:23:54 +02:00
Christian Brauner	d7610cb745	ns: simplify ns_common_init() further Simply derive the ns operations from the namespace type. Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-22 14:47:10 +02:00
Christian Brauner	7cf7303211	ns: use inode initializer for initial namespaces Just use the common helper we have. Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 16:22:38 +02:00
Christian Brauner	024596a4e2	ns: rename to __ns_ref Make it easier to grep and rename to ns_count. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 16:22:38 +02:00
Christian Brauner	2e9e697227	mnt: port to ns_ref_*() helpers Stop accessing ns.count directly. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 16:22:36 +02:00
Christian Brauner	be5f21d398	ns: add ns_common_free() And drop ns_free_inum(). Anything common that can be wasted centrally should be wasted in the new common helper. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 16:22:36 +02:00
Christian Brauner	5612ff3ec5	nscommon: simplify initialization There's a lot of information that namespace implementers don't need to know about at all. Encapsulate this all in the initialization helper. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 14:26:19 +02:00
Christian Brauner	86cdbae5c6	mnt: simplify ns_common_init() handling Assign the reserved MNT_NS_ANON_INO sentinel to anonymous mount namespaces and cleanup the initial mount ns allocation. This is just a preparatory patch and the ns->inum check in ns_common_init() will be dropped in the next patch. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 14:26:18 +02:00
Christian Brauner	b2a0b19208	mnt: expose pointer to init_mnt_ns There's various scenarios where we need to know whether we are in the initial set of namespaces or not to e.g., shortcut permission checking. All namespaces expose that information. Let's do that too. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 14:26:18 +02:00
Christian Brauner	7d7d164989	mnt: support ns lookup Move the mount namespace to the generic ns lookup infrastructure. This allows us to drop a bunch of members from struct mnt_namespace. Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 14:26:15 +02:00
Christian Brauner	7914f15c5e	Merge branch 'no-rebase-mnt_ns_tree_remove' Bring in the fix for removing a mount namespace from the mount namespace rbtree and list. Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 14:26:14 +02:00
Christian Brauner	96ece8eb67	mnt: use ns_common_init() Don't cargo-cult the same thing over and over. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 14:26:13 +02:00
Al Viro	38f4885088	mnt_ns_tree_remove(): DTRT if mnt_ns had never been added to mnt_ns_list Actual removal is done under the lock, but for checking if need to bother the lockless RB_EMPTY_NODE() is safe - either that namespace had never been added to mnt_ns_tree, in which case the the node will stay empty, or whoever had allocated it has called mnt_ns_tree_add() and it has already run to completion. After that point RB_EMPTY_NODE() will become false and will remain false, no matter what we do with other nodes in the tree. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-16 00:33:37 -04:00
Simon Schuster	edd3cb05c0	copy_process: pass clone_flags as u64 across calltree With the introduction of clone3 in commit `7f192e3cd3` ("fork: add clone3") the effective bit width of clone_flags on all architectures was increased from 32-bit to 64-bit, with a new type of u64 for the flags. However, for most consumers of clone_flags the interface was not changed from the previous type of unsigned long. While this works fine as long as none of the new 64-bit flag bits (CLONE_CLEAR_SIGHAND and CLONE_INTO_CGROUP) are evaluated, this is still undesirable in terms of the principle of least surprise. Thus, this commit fixes all relevant interfaces of callees to sys_clone3/copy_process (excluding the architecture-specific copy_thread) to consistently pass clone_flags as u64, so that no truncation to 32-bit integers occurs on 32-bit architectures. Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com> Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-2-53fcf5577d57@siemens-energy.com Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-01 15:31:34 +02:00
Guopeng Zhang	41a86f6242	fs: fix indentation style Replace 8 leading spaces with a tab to follow kernel coding style. Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Link: https://lore.kernel.org/20250820133424.1667467-1-zhangguopeng@kylinos.cn Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-21 10:27:05 +02:00
Lichen Liu	278033a225	fs: Add 'initramfs_options' to set initramfs mount options When CONFIG_TMPFS is enabled, the initial root filesystem is a tmpfs. By default, a tmpfs mount is limited to using 50% of the available RAM for its content. This can be problematic in memory-constrained environments, particularly during a kdump capture. In a kdump scenario, the capture kernel boots with a limited amount of memory specified by the 'crashkernel' parameter. If the initramfs is large, it may fail to unpack into the tmpfs rootfs due to insufficient space. This is because to get X MB of usable space in tmpfs, 2*X MB of memory must be available for the mount. This leads to an OOM failure during the early boot process, preventing a successful crash dump. This patch introduces a new kernel command-line parameter, initramfs_options, which allows passing specific mount options directly to the rootfs when it is first mounted. This gives users control over the rootfs behavior. For example, a user can now specify initramfs_options=size=75% to allow the tmpfs to use up to 75% of the available memory. This can significantly reduce the memory pressure for kdump. Consider a practical example: To unpack a 48MB initramfs, the tmpfs needs 48MB of usable space. With the default 50% limit, this requires a memory pool of 96MB to be available for the tmpfs mount. The total memory requirement is therefore approximately: 16MB (vmlinuz) + 48MB (loaded initramfs) + 48MB (unpacked kernel) + 96MB (for tmpfs) + 12MB (runtime overhead) ≈ 220MB. By using initramfs_options=size=75%, the memory pool required for the 48MB tmpfs is reduced to 48MB / 0.75 = 64MB. This reduces the total memory requirement by 32MB (96MB - 64MB), allowing the kdump to succeed with a smaller crashkernel size, such as 192MB. An alternative approach of reusing the existing rootflags parameter was considered. However, a new, dedicated initramfs_options parameter was chosen to avoid altering the current behavior of rootflags (which applies to the final root filesystem) and to prevent any potential regressions. Also add documentation for the new kernel parameter "initramfs_options" This approach is inspired by prior discussions and patches on the topic. Ref: https://www.lightofdawn.org/blog/?viewDetailed=00128 Ref: https://landley.net/notes-2015.html#01-01-2015 Ref: https://lkml.org/lkml/2021/6/29/783 Ref: https://www.kernel.org/doc/html/latest/filesystems/ramfs-rootfs-initramfs.html#what-is-rootfs Signed-off-by: Lichen Liu <lichliu@redhat.com> Link: https://lore.kernel.org/20250815121459.3391223-1-lichliu@redhat.com Tested-by: Rob Landley <rob@landley.net> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-21 10:23:48 +02:00
Linus Torvalds	b19a97d57c	fixes for several recent mount-related regressions Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaKShewAKCRBZ7Krx/gZQ 6+p0AQD0vMecDWcRALFSInfGofuozEZ+98i3G0stlzU8KrvvpQD/a2jEfxOZOBph 0RL73MLt+eSl03tnsbqGRoHrcH9epQM= =3DGl -----END PGP SIGNATURE----- Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull mount fixes from Al Viro: "Fixes for several recent mount-related regressions" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: change_mnt_propagation(): calculate propagation source only if we'll need it use uniform permission checks for all mount propagation changes propagate_umount(): only surviving overmounts should be reparented fix the softlockups in attach_recursive_mnt()	2025-08-19 10:12:10 -07:00
Al Viro	cffd044187	use uniform permission checks for all mount propagation changes do_change_type() and do_set_group() are operating on different aspects of the same thing - propagation graph. The latter asks for mounts involved to be mounted in namespace(s) the caller has CAP_SYS_ADMIN for. The former is a mess - originally it didn't even check that mount is mounted. That got fixed, but the resulting check turns out to be too strict for userland - in effect, we check that mount is in our namespace, having already checked that we have CAP_SYS_ADMIN there. What we really need (in both cases) is * only touch mounts that are mounted. That's a must-have constraint - data corruption happens if it get violated. * don't allow to mess with a namespace unless you already have enough permissions to do so (i.e. CAP_SYS_ADMIN in its userns). That's an equivalent of what do_set_group() does; let's extract that into a helper (may_change_propagation()) and use it in both do_set_group() and do_change_type(). Fixes: `12f147ddd6` "do_change_type(): refuse to operate on unmounted/not ours mounts" Acked-by: Andrei Vagin <avagin@gmail.com> Reviewed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Tested-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-08-19 12:03:23 -04:00
Al Viro	0ddfb62f5d	fix the softlockups in attach_recursive_mnt() In case when we mounting something on top of a large stack of overmounts, all of them being peers of each other, we get quadratic time by the depth of overmount stack. Easily fixed by doing commit_tree() before reparenting the overmount; simplifies commit_tree() as well - it doesn't need to skip the already mounted stuff that had been reparented on top of the new mounts. Since we are holding mount_lock through both reparenting and call of commit_tree(), the order does not matter from the mount hash point of view. Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: `663206854f` "copy_tree(): don't link the mounts via mnt_list" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-08-19 11:58:18 -04:00
Askar Safin	1e5f0fb41f	vfs: fs/namespace.c: remove ms_flags argument from do_remount ...because it is not used Signed-off-by: Askar Safin <safinaskar@zohomail.com> Link: https://lore.kernel.org/20250811045444.1813009-1-safinaskar@zohomail.com Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-11 16:08:31 +02:00
Yuntao Wang	593d9e4c3d	fs: fix incorrect lflags value in the move_mount syscall The lflags value used to look up from_path was overwritten by the one used to look up to_path. In other words, from_path was looked up with the wrong lflags value. Fix it. Fixes: `f9fde814de` ("fs: support getname_maybe_null() in move_mount()") Signed-off-by: Yuntao Wang <yuntao.wang@linux.dev> Link: https://lore.kernel.org/20250811052426.129188-1-yuntao.wang@linux.dev [Christian Brauner <brauner@kernel.org>: massage patch] Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-11 16:05:53 +02:00
Aleksa Sarai	807602d8cf	vfs: output mount_too_revealing() errors to fscontext It makes little sense for fsmount() to output the warning message when mount_too_revealing() is violated to kmsg. Instead, the warning should be output (with a "VFS" prefix) to the fscontext log. In addition, include the same log message for mount_too_revealing() when doing a regular mount for consistency. With the newest fsopen()-based mount(8) from util-linux, the error messages now look like # mount -t proc proc /tmp mount: /tmp: fsmount() failed: VFS: Mount too revealing. dmesg(1) may have more information after failed mount system call. which could finally result in mount_too_revealing() errors being easier for users to detect and understand. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/20250806-errorfc-mount-too-revealing-v2-2-534b9b4d45bb@cyphar.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-11 14:52:40 +02:00
Aleksa Sarai	9308366f06	open_tree_attr: do not allow id-mapping changes without OPEN_TREE_CLONE As described in commit `7a54947e72` ('Merge patch series "fs: allow changing idmappings"'), open_tree_attr(2) was necessary in order to allow for a detached mount to be created and have its idmappings changed without the risk of any racing threads operating on it. For this reason, mount_setattr(2) still does not allow for id-mappings to be changed. However, there was a bug in commit `2462651ffa` ("fs: allow changing idmappings") which allowed users to bypass this restriction by calling open_tree_attr(2) without OPEN_TREE_CLONE. can_idmap_mount() prevented this bug from allowing an attached mountpoint's id-mapping from being modified (thanks to an is_anon_ns() check), but this still allows for detached (but visible) mounts to have their be id-mapping changed. This risks the same UAF and locking issues as described in the merge commit, and was likely unintentional. Fixes: `2462651ffa` ("fs: allow changing idmappings") Cc: stable@vger.kernel.org # v6.15+ Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/20250808-open_tree_attr-bugfix-idmap-v1-1-0ec7bc05646c@cyphar.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-11 14:51:49 +02:00
Linus Torvalds	f70d24c230	vfs-6.17-rc1.nsfs -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCaAAKCRCRxhvAZXjc ouTCAQCrNCM3h9MpgcMDDUbi9b+lIR11JlvtWNGRUACv3RSNLgEA1vm30+u+JM87 KpVYg1RrkXDyMFXXuzy7UWpzLcBRywI= =L0eW -----END PGP SIGNATURE----- Merge tag 'vfs-6.17-rc1.nsfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull namespace updates from Christian Brauner: "This contains namespace updates. This time specifically for nsfs: - Userspace heavily relies on the root inode numbers for namespaces to identify the initial namespaces. That's already a hard dependency. So we cannot change that anymore. Move the initial inode numbers to a public header and align the only two namespaces that currently don't do that with all the other namespaces. - The root inode of /proc having a fixed inode number has been part of the core kernel ABI since its inception, and recently some userspace programs (mainly container runtimes) have started to explicitly depend on this behaviour. The main reason this is useful to userspace is that by checking that a suspect /proc handle has fstype PROC_SUPER_MAGIC and is PROCFS_ROOT_INO, they can then use openat2() together with RESOLVE_{NO_{XDEV,MAGICLINK},BENEATH} to ensure that there isn't a bind-mount that replaces some procfs file with a different one. This kind of attack has lead to security issues in container runtimes in the past (such as CVE-2019-19921) and libraries like libpathrs[1] use this feature of procfs to provide safe procfs handling functions" * tag 'vfs-6.17-rc1.nsfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: uapi: export PROCFS_ROOT_INO mntns: use stable inode number for initial mount ns netns: use stable inode number for initial mount ns nsfs: move root inode number to uapi	2025-07-28 12:50:56 -07:00
Linus Torvalds	7879d7aff0	vfs-6.17-rc1.misc -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaIM/KwAKCRCRxhvAZXjc opT+AP407JwhRSBjUEmHg5JzUyDoivkOySdnthunRjaBKD8rlgEApM6SOIZYucU7 cPC3ZY6ORFM6Mwaw+iDW9lasM5ucHQ8= =CHha -----END PGP SIGNATURE----- Merge tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc VFS updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Add ext4 IOCB_DONTCACHE support This refactors the address_space_operations write_begin() and write_end() callbacks to take const struct kiocb * as their first argument, allowing IOCB flags such as IOCB_DONTCACHE to propagate to the filesystem's buffered I/O path. Ext4 is updated to implement handling of the IOCB_DONTCACHE flag and advertises support via the FOP_DONTCACHE file operation flag. Additionally, the i915 driver's shmem write paths are updated to bypass the legacy write_begin/write_end interface in favor of directly calling write_iter() with a constructed synchronous kiocb. Another i915 change replaces a manual write loop with kernel_write() during GEM shmem object creation. Cleanups: - don't duplicate vfs_open() in kernel_file_open() - proc_fd_getattr(): don't bother with S_ISDIR() check - fs/ecryptfs: replace snprintf with sysfs_emit in show function - vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes() - filelock: add new locks_wake_up_waiter() helper - fs: Remove three arguments from block_write_end() - VFS: change old_dir and new_dir in struct renamedata to dentrys - netfs: Remove unused declaration netfs_queue_write_request() Fixes: - eventpoll: Fix semi-unbounded recursion - eventpoll: fix sphinx documentation build warning - fs/read_write: Fix spelling typo - fs: annotate data race between poll_schedule_timeout() and pollwake() - fs/pipe: set FMODE_NOWAIT in create_pipe_files() - docs/vfs: update references to i_mutex to i_rwsem - fs/buffer: remove comment about hard sectorsize - fs/buffer: remove the min and max limit checks in __getblk_slow() - fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable - fs_context: fix parameter name in infofc() macro - fs: Prevent file descriptor table allocations exceeding INT_MAX" * tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (24 commits) netfs: Remove unused declaration netfs_queue_write_request() eventpoll: fix sphinx documentation build warning ext4: support uncached buffered I/O mm/pagemap: add write_begin_get_folio() helper function fs: change write_begin/write_end interface to take struct kiocb * drm/i915: Refactor shmem_pwrite() to use kiocb and write_iter drm/i915: Use kernel_write() in shmem object create eventpoll: Fix semi-unbounded recursion vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes() fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable fs/buffer: remove the min and max limit checks in __getblk_slow() fs: Prevent file descriptor table allocations exceeding INT_MAX fs: Remove three arguments from block_write_end() fs/ecryptfs: replace snprintf with sysfs_emit in show function fs: annotate suspected data race between poll_schedule_timeout() and pollwake() docs/vfs: update references to i_mutex to i_rwsem fs/buffer: remove comment about hard sectorsize fs_context: fix parameter name in infofc() macro VFS: change old_dir and new_dir in struct renamedata to dentrys proc_fd_getattr(): don't bother with S_ISDIR() check ...	2025-07-28 11:22:56 -07:00
Al Viro	a7cce09945	statmount_mnt_basic(): simplify the logics for group id We are holding namespace_sem shared and we have not done any group id allocations since we grabbed it. Therefore IS_MNT_SHARED(m) is equivalent to non-zero m->mnt_group_id. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 19:03:46 -04:00
Al Viro	f6cc2f4e3d	invent_group_ids(): zero ->mnt_group_id always implies !IS_MNT_SHARED() All places where we call set_mnt_shared() are guaranteed to have non-zero ->mnt_group_id - either by explicit test, or by having done successful invent_group_ids() covering the same mount since we'd grabbed namespace_sem. The opposite combination (non-zero ->mnt_group_id and !IS_MNT_SHARED()) is possible - it means that we have allocated group id, but didn't get around to set_mnt_shared() yet; such state is transient - by the time we do namespace_unlock(), we must either do set_mnt_shared() or unroll the group id allocations by cleanup_group_ids(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 19:03:46 -04:00
Al Viro	725ab435ff	get rid of CL_SHARE_TO_SLAVE the only difference between it and CL_SLAVE is in this predicate in clone_mnt(): if ((flag & CL_SLAVE) \|\| ((flag & CL_SHARED_TO_SLAVE) && IS_MNT_SHARED(old))) { However, in case of CL_SHARED_TO_SLAVE we have not allocated any mount group ids since the time we'd grabbed namespace_sem, so IS_MNT_SHARED() is equivalent to non-zero ->mnt_group_id. And in case of CL_SLAVE old has come either from the original tree, which had ->mnt_group_id allocated for all nodes or from result of sequence of CL_MAKE_SHARED or CL_MAKE_SHARED\|CL_SLAVE copies, ultimately going back to the original tree. In both cases we are guaranteed that old->mnt_group_id will be non-zero. In other words, the predicate is always equal to (flags & (CL_SLAVE \| CL_SHARED_TO_SLAVE)) && old->mnt_group_id and with that replacement CL_SLAVE and CL_SHARED_TO_SLAVE have exact same behaviour. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 19:03:46 -04:00
Al Viro	aab771f34e	take freeing of emptied mnt_namespace to namespace_unlock() Freeing of a namespace must be delayed until after we'd dealt with mount notifications (in namespace_unlock()). The reasons are not immediately obvious (they are buried in ->prev_ns handling in mnt_notify()), and having that free_mnt_ns() explicitly called after namespace_unlock() is asking for trouble - it does feel like they should be OK to free as soon as they've been emptied. Make the things more explicit by setting 'emptied_ns' under namespace_sem and having namespace_unlock() free the sucker as soon as it's safe to free. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 19:03:46 -04:00
Al Viro	663206854f	copy_tree(): don't link the mounts via mnt_list The only place that really needs to be adjusted is commit_tree() - there we need to iterate through the copy and we might as well use next_mnt() for that. However, in case when our tree has been slid under something already mounted (propagation to a mountpoint that already has something mounted on it or a 'beneath' move_mount) we need to take care not to walk into the overmounting tree. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 19:03:37 -04:00
Al Viro	8c5a853f58	mnt_slave_list/mnt_slave: turn into hlist_head/hlist_node Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 19:03:30 -04:00
Al Viro	406fea7999	mount: separate the flags accessed only under namespace_sem Several flags are updated and checked only under namespace_sem; we are already making use of that when we are checking them without mount_lock, but we have to hold mount_lock for all updates, which makes things clumsier than they have to be. Take MNT_SHARED, MNT_UNBINDABLE, MNT_MARKED and MNT_UMOUNT_CANDIDATE into a separate field (->mnt_t_flags), renaming them to T_SHARED, etc. to avoid confusion. All accesses must be under namespace_sem. That changes locking requirements for mnt_change_propagation() and set_mnt_shared() - only namespace_sem is needed now. The same goes for SET_MNT_MARKED et.al. There might be more flags moved from ->mnt_flags to that field; this is just the initial set. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 19:03:29 -04:00
Al Viro	493a4bebf5	don't have mounts pin their parents Simplify the rules for mount refcounts. Current rules include: * being a namespace root => +1 * being someone's child => +1 * being someone's child => +1 to parent's refcount, unless you've already been through umount_tree(). The last part is not needed at all. It makes for more places where need to decrement refcounts and it creates an asymmetry between the situations for something that has never been a part of a namespace and something that left one, both for no good reason. If mount's refcount has additions from its children, we know that * it's either someone's child itself (and will remain so until umount_tree(), at which point contributions from children will disappear), or * or is the root of namespace (and will remain such until it either becomes someone's child in another namespace or goes through umount_tree()), or * it is the root of some tree copy, and is currently pinned by the caller of copy_tree() (and remains such until it either gets into namespace, or goes to umount_tree()). In all cases we already have contribution(s) to refcount that will last as long as the contribution from children remains. In other words, the lifetime is not affected by refcount contributions from children. It might be useful for "is it busy" checks, but those are actually no harder to express without it. NB: propagate_mnt_busy() part is an equivalent transformation, ugly as it is; the current logics is actually wrong and may give false negatives, but fixing that is for a separate patch (probably earlier in the queue). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:42 -04:00
Al Viro	d72c773237	get rid of mountpoint->m_count struct mountpoint has an odd kinda-sorta refcount in it. It's always either equal to or one above the number of mounts attached to that mountpoint. "One above" happens when a function takes a temporary reference to mountpoint. Things get simpler if we express that as inserting a local object into ->m_list and removing it to drop the reference. New calling conventions: 1) lock_mount(), do_lock_mount(), get_mountpoint() and lookup_mountpoint() take an extra struct pinned_mountpoint * argument and returns 0/-E... (or true/false in case of lookup_mountpoint()) instead of returning struct mountpoint pointers. In case of success, the struct mountpoint * we used to get can be found as pinned_mountpoint.mp 2) unlock_mount() (always paired with lock_mount()/do_lock_mount()) takes an address of struct pinned_mountpoint - the same that had been passed to lock_mount()/do_lock_mount(). 3) put_mountpoint() for a temporary reference (paired with get_mountpoint() or lookup_mountpoint()) is replaced with unpin_mountpoint(), which takes the address of pinned_mountpoint we passed to matching {get,lookup}_mountpoint(). 4) all instances of pinned_mountpoint are local variables; they always live on stack. {} is used for initializer, after successful {get,lookup}_mountpoint() we must make sure to call unpin_mountpoint() before leaving the scope and after successful {do_,}lock_mount() we must make sure to call unlock_mount() before leaving the scope. 5) all manipulations of ->m_count are gone, along with ->m_count itself. struct mountpoint lives while its ->m_list is non-empty. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:42 -04:00
Al Viro	86f6398096	combine __put_mountpoint() with unhash_mnt() A call of unhash_mnt() is immediately followed by passing its return value to __put_mountpoint(); the shrink list given to __put_mountpoint() will be ex_mountpoints when called from umount_mnt() and list when called from mntput_no_expire(). Replace with __umount_mnt(mount, shrink_list), moving the call of __put_mountpoint() into it (and returning nothing), adjust the callers. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:42 -04:00
Al Viro	e30da2a20e	pivot_root(): reorder tree surgeries, collapse unhash_mnt() and put_mountpoint() attach new_mnt before detaching root_mnt; that way we don't need to keep hold on the mountpoint and one more pair of unhash_mnt()/put_mountpoint() gets folded together into umount_mnt(). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:42 -04:00
Al Viro	ec3265a245	take ->mnt_expire handling under mount_lock [read_seqlock_excl] Doesn't take much massage, and we no longer need to make sure that by the time of final mntput() the victim has been removed from the list. Makes life safer for ->d_automount() instances... Rules: * all ->mnt_expire accesses are under mount_lock. * insertion into the list is done by mnt_set_expiry(), and caller (->d_automount() instance) must hold a reference to mount in question. It shouldn't be done more than once for a mount. * if a mount on an expiry list is not yet mounted, it will be ignored by anything that walks that list. * if the final mntput() finds its victim still on an expiry list (in which case it must've never been mounted - umount_tree() would've taken it out), it will remove the victim from the list. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:42 -04:00
Al Viro	a8c764e1a5	attach_recursive_mnt(): remove from expiry list on move ... rather than doing that in do_move_mount(). That's the main obstacle to moving the protection of ->mnt_expire from namespace_sem to mount_lock (spinlock-only), which would simplify several failure exits. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:42 -04:00
Al Viro	ee1ee33ccc	do_move_mount(): get rid of 'attached' flag 'attached' serves as a proxy for "source is a subtree of our namespace and not the entirety of anon namespace"; finish massaging it away. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:42 -04:00
Al Viro	761de25854	do_move_mount(): take dropping the old mountpoint into attach_recursive_mnt() ... and fold it with unhash_mnt() there - there's no need to retain a reference to old_mp beyond that point, since by then all mountpoints we were going to add are either explicitly pinned by get_mountpoint() or have stuff already added to them. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:42 -04:00
Al Viro	86b1da96c5	attach_recursive_mnt(): get rid of flags entirely move vs. attach is trivially detected as mnt_has_parent(source_mnt)... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:42 -04:00
Al Viro	18959bf585	attach_recursive_mnt(): pass destination mount in all cases ... and 'beneath' is no longer used there Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:42 -04:00
Al Viro	96f5d2e051	attach_recursive_mnt(): unify the mnt_change_mountpoint() logics The logics used for tucking under existing mount differs for original and copies; copies do a mount hash lookup to see if mountpoint to be is already overmounted, while the original is told explicitly. But the same logics that is used for copies works for the original, at which point the only place where we get very close to eliminating the need of passing 'beneath' flag to attach_recursive_mnt(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:41 -04:00

1 2 3 4 5 ...

860 Commits (09cfd3c52ea76f43b3cb15e570aeddf633d65e80)