mirror-linux

Commit Graph

Author	SHA1	Message	Date
Al Viro	7c6fb47b2b	make commit_tree() usable in same-namespace move case Once attach_recursive_mnt() has created all copies of original subtree, it needs to put them in place(s). Steps needed for those are slightly different: 1) in 'move' case, original copy doesn't need any rbtree manipulations (everything's already in the same namespace where it will be), but it needs to be detached from the current location 2) in 'attach' case, original may be in anon namespace; if it is, all those mounts need to removed from their current namespace before insertion into the target one 3) additional copies have a couple of extra twists - in case of cross-userns propagation we need to lock everything other the root of subtree and in case when we end up inserting under an existing mount, that mount needs to be found (for original copy we have it explicitly passed by the caller). Quite a bit of that can be unified; as the first step, make commit_tree() helper (inserting mounts into namespace, hashing the root of subtree and marking the namespace as updated) usable in all cases; (2) and (3) are already using it and for (1) we only need to make the insertion of mounts into namespace conditional. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:41 -04:00
Al Viro	f0d0ba1998	Rewrite of propagate_umount() The variant currently in the tree has problems; trying to prove correctness has caught at least one class of bugs (reparenting that ends up moving the visible location of reparented mount, due to not excluding some of the counterparts on propagation that should've been included). I tried to prove that it's the only bug there; I'm still not sure whether it is. If anyone can reconstruct and write down an analysis of the mainline implementation, I'll gladly review it; as it is, I ended up doing a different implementation. Candidate collection phase is similar, but trimming the set down until it satisfies the constraints turned out pretty different. I hoped to do transformation as a massage series, but that turns out to be too convoluted. So it's a single patch replacing propagate_umount() and friends in one go, with notes and analysis in D/f/propagate_umount.txt (in addition to inline comments). As far I can tell, it is provably correct and provably linear by the number of mounts we need to look at in order to decide what should be unmounted. It even builds and seems to survive testing... Another nice thing that fell out of that is that ->mnt_umounting is no longer needed. Compared to the first version: * explicit MNT_UMOUNT_CANDIDATE flag for is_candidate() * trim_ancestors() only clears that flag, leaving the suckers on list * trim_one() and handle_locked() take the stuff with flag cleared off the list. That allows to iterate with list_for_each_entry_safe() when calling trim_one() - it removes at most one element from the list now. * no globals - I didn't bother with any kind of context, not worth it. * Notes updated accordingly; I have not touch the terms yet. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:41 -04:00
Al Viro	24368a744b	sanitize handling of long-term internal mounts Original rationale for those had been the reduced cost of mntput() for the stuff that is mounted somewhere. Mount refcount increments and decrements are frequent; what's worse, they tend to concentrate on the same instances and cacheline pingpong is quite noticable. As the result, mount refcounts are per-cpu; that allows a very cheap increment. Plain decrement would be just as easy, but decrement-and-test is anything but (we need to add the components up, with exclusion against possible increment-from-zero, etc.). Fortunately, there is a very common case where we can tell that decrement won't be the final one - if the thing we are dropping is currently mounted somewhere. We have an RCU delay between the removal from mount tree and dropping the reference that used to pin it there, so we can just take rcu_read_lock() and check if the victim is mounted somewhere. If it is, we can go ahead and decrement without and further checks - the reference we are dropping is not the last one. If it isn't, we get all the fun with locking, carefully adding up components, etc., but the majority of refcount decrements end up taking the fast path. There is a major exception, though - pipes and sockets. Those live on the internal filesystems that are not going to be mounted anywhere. They are not going to be _un_mounted, of course, so having to take the slow path every time a pipe or socket gets closed is really obnoxious. Solution had been to mark them as long-lived ones - essentially faking "they are mounted somewhere" indicator. With minor modification that works even for ones that do eventually get dropped - all it takes is making sure we have an RCU delay between clearing the "mounted somewhere" indicator and dropping the reference. There are some additional twists (if you want to drop a dozen of such internal mounts, you'd be better off with clearing the indicator on all of them, doing an RCU delay once, then dropping the references), but in the basic form it had been * use kern_mount() if you want your internal mount to be a long-term one. * use kern_unmount() to undo that. Unfortunately, the things did rot a bit during the mount API reshuffling. In several cases we have lost the "fake the indicator" part; kern_unmount() on the unmount side remained (it doesn't warn if you use it on a mount without the indicator), but all benefits regaring mntput() cost had been lost. To get rid of that bitrot, let's add a new helper that would work with fs_context-based API: fc_mount_longterm(). It's a counterpart of fc_mount() that does, on success, mark its result as long-term. It must be paired with kern_unmount() or equivalents. Converted: 1) mqueue (it used to use kern_mount_data() and the umount side is still as it used to be) 2) hugetlbfs (used to use kern_mount_data(), internal mount is never unmounted in this one) 3) i915 gemfs (used to be kern_mount() + manual remount to set options, still uses kern_unmount() on umount side) 4) v3d gemfs (copied from i915) Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:41 -04:00
Al Viro	c93ff74ff1	do_umount(): simplify the "is it still mounted" checks Calls of do_umount() are always preceded by can_umount(), where we'd done a racy check for mount belonging to our namespace; if it wasn't, can_unmount() would've failed with -EINVAL and we wouldn't have reached do_umount() at all. That check needs to be redone once we have acquired namespace_sem and in do_umount() we do that. However, that's done in a very odd way; we check that mount is still in rbtree of _some_ namespace or its mnt_list is not empty. It is equivalent to check_mnt(mnt) - we know that earlier mnt was mounted in our namespace; if it has stayed there, it's going to remain in rbtree of our namespace. OTOH, if it ever had been removed from out namespace, it would be removed from rbtree and it never would've re-added to a namespace afterwards. As for ->mnt_list, for something that had been mounted in a namespace we'll never observe non-empty ->mnt_list while holding namespace_sem - it does temporarily become non-empty during umount_tree(), but that doesn't outlast the call of umount_tree(), let alone dropping namespace_sem. Things get much easier to follow if we replace that with (equivalent) check_mnt(mnt) there. What's more, currently we treat a failure of that test as "quietly do nothing"; we might as well pretend that we'd lost the race and fail on that the same way can_umount() would have. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:41 -04:00
Al Viro	49acacdc7c	clone_mnt(): simplify the propagation-related logics The underlying rules are simple: * MNT_SHARED should be set iff ->mnt_group_id of new mount ends up non-zero. * mounts should be on the same ->mnt_share cyclic list iff they have the same non-zero ->mnt_group_id value. * CL_PRIVATE is mutually exclusive with MNT_SHARED, MNT_SLAVE, MNT_SHARED_TO_SLAVE and MNT_EXPIRE; the whole point of that thing is to get a clone of old mount that would not be on any namespace-related lists. The above allows to make the logics more straightforward; what's more, it makes the proof that invariants are maintained much simpler. The variant in mainline is safe (aside of a very narrow race with unsafe modification of mnt_flags right after we had the mount exposed in superblock's ->s_mounts; theoretically it can race with ro remount of the original, but it's not easy to hit), but proof of its correctness is really unpleasant. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:41 -04:00
Al Viro	d08fa7f44a	don't set MNT_LOCKED on parentless mounts Originally MNT_LOCKED meant only one thing - "don't let this mount to be peeled off its parent, we don't want to have its mountpoint exposed". Accordingly, it had only been set on mounts that do have a parent. Later it got overloaded with another use - setting it on the absolute root had given free protection against umount(2) of absolute root (was possible to trigger, oopsed). Not a bad trick, but it ended up costing more than it bought us. Unfortunately, the cost included both hard-to-reason-about logics and a subtle race between mount -o remount,ro and mount --[r]bind - lockless &= ~MNT_LOCKED in the end of __do_loopback() could race with sb_prepare_remount_readonly() setting and clearing MNT_HOLD_WRITE (under mount_lock, as it should be). The race wouldn't be much of a problem (there are other ways to deal with it), but the subtlety is. Turns out that nobody except umount(2) had ever made use of having MNT_LOCKED set on absolute root. So let's give up on that trick, clever as it had been, add an explicit check in do_umount() and return to using MNT_LOCKED only for mounts that have a parent. It means that * clone_mnt() no longer copies MNT_LOCKED * copy_tree() sets it on submounts if their counterparts had been marked such, and does that right next to attach_mnt() in there, in the same mount_lock scope. * __do_loopback() no longer needs to strip MNT_LOCKED off the root of subtree it's about to return; no store, no race. * init_mount_tree() doesn't bother setting MNT_LOCKED on absolute root. * lock_mnt_tree() does not set MNT_LOCKED on the subtree's root; accordingly, its caller (loop in attach_recursive_mnt()) does not need to bother stripping that MNT_LOCKED on root. Note that lock_mnt_tree() setting MNT_LOCKED on submounts happens in the same mount_lock scope as __attach_mnt() (from commit_tree()) that makes them reachable. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:41 -04:00
Al Viro	1a867d729f	__attach_mnt(): lose the second argument It's always ->mnt_parent of the first one. What the function does is making a mount (with already set parent and mountpoint) visible - in mount hash and in the parent's list of children. IOW, it takes the existing rootwards linkage and sets the matching crownwards linkage. Renamed to make_visible(), while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:41 -04:00
Al Viro	9ed4b9eaea	dissolve_on_fput(): use anon_ns_root() that's the condition we are actually trying to check there... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:41 -04:00
Al Viro	05da054d43	new predicate: anon_ns_root(mount) checks if mount is the root of an anonymouns namespace. Switch open-coded equivalents to using it. For mounts that belong to anon namespace !mnt_has_parent(mount) is the same as mount == ns->root, and intent is more obvious in the latter form. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:41 -04:00
Al Viro	e031251cb2	constify is_local_mountpoint() Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:41 -04:00
Al Viro	9cb79ed60e	new predicate: mount_is_ancestor() mount_is_ancestor(p1, p2) returns true iff there is a possibly empty ancestry chain from p1 to p2. Convert the open-coded checks. Unlike those open-coded variants it does not depend upon p1 not being root... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:41 -04:00
Al Viro	cf53a2d423	copy_tree(): don't set ->mnt_mountpoint on the root of copy It never made any sense - neither when copy_tree() had been introduced (2.4.11-pre5), nor at any point afterwards. Mountpoint is meaningless without parent mount and the root of copied tree has no parent until we get around to attaching it somewhere. At that time we'll have mountpoint set; before that we have no idea which dentry will be used as mountpoint. IOW, copy_tree() should just leave the default value. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:41 -04:00
Al Viro	ffdc52fbbd	prevent mount hash conflicts Currently it's still possible to run into a pathological situation when two hashed mounts share both parent and mountpoint. That does not work well, for obvious reasons. We are not far from getting rid of that; the only remaining gap is attach_recursive_mnt() not being careful enough when sliding a tree under existing mount (for propagated copies or in 'beneath' case for the original one). To deal with that cleanly we need to be able to find overmounts (i.e. mounts on top of parent's root); we could do hash lookups or scan the list of children but either would be costly. Since one of the results we get from that will be prevention of multiple parallel overmounts, let's just bite the bullet and store a (non-counting) reference to overmount in struct mount. With that done, closing the hole in attach_recursive_mnt() becomes easy - we just need to follow the chain of overmounts before we change the mountpoint of the mount we are sliding things under. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:41 -04:00
Al Viro	431cc1d8e2	get rid of mnt_set_mountpoint_beneath() mnt_set_mountpoint_beneath() consists of attaching new mount side-by-side with the one we want to mount beneath (by mnt_set_mountpoint()), followed by mnt_change_mountpoint() shifting the the top mount onto the new one (by mnt_change_mountpoint()). Both callers of mnt_set_mountpoint_beneath (both in attach_recursive_mnt()) have the same form - in 'beneath' case we call mnt_set_mountpoint_beneath(), otherwise - mnt_set_mountpoint(). The thing is, expressing that as unconditional mnt_set_mountpoint(), followed, in 'beneath' case, by mnt_change_mountpoint() is just as easy. And these mnt_change_mountpoint() callers are similar to the ones we do when it comes to attaching propagated copies, which will allow more cleanups in the next commits. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:41 -04:00
Al Viro	8c6ce8e86d	attach_mnt(): expand in attach_recursive_mnt(), then lose the flag argument simpler that way - all but one caller pass false as 'beneath' argument, and that one caller is actually happier with the call expanded - the logics with choice of mountpoint is identical for 'moving' and 'attaching' cases, and now that is no longer hidden. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 18:13:41 -04:00
Al Viro	0748e553df	userns and mnt_idmap leak in open_tree_attr(2) Once want_mount_setattr() has returned a positive, it does require finish_mount_kattr() to release ->mnt_userns. Failing do_mount_setattr() does not change that. As the result, we can end up leaking userns and possibly mnt_idmap as well. Fixes: `c4a16820d9` ("fs: add open_tree_attr()") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-24 10:25:04 -04:00
Al Viro	ce7df19686	attach_recursive_mnt(): do not lock the covering tree when sliding something under it If we are propagating across the userns boundary, we need to lock the mounts added there. However, in case when something has already been mounted there and we end up sliding a new tree under that, the stuff that had been there before should not get locked. IOW, lock_mnt_tree() should be called before we reparent the preexisting tree on top of what we are adding. Fixes: `3bd045cc9c` ("separate copying and locking mount tree on cross-userns copies") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-23 14:02:08 -04:00
Al Viro	7484e15dbb	replace collect_mounts()/drop_collected_mounts() with a safer variant collect_mounts() has several problems - one can't iterate over the results directly, so it has to be done with callback passed to iterate_mounts(); it has an oopsable race with d_invalidate(); it creates temporary clones of mounts invisibly for sync umount (IOW, you can have non-lazy umount succeed leaving filesystem not mounted anywhere and yet still busy). A saner approach is to give caller an array of struct path that would pin every mount in a subtree, without cloning any mounts. * collect_mounts()/drop_collected_mounts()/iterate_mounts() is gone * collect_paths(where, preallocated, size) gives either ERR_PTR(-E...) or a pointer to array of struct path, one for each chunk of tree visible under 'where' (i.e. the first element is a copy of where, followed by (mount,root) for everything mounted under it - the same set collect_mounts() would give). Unlike collect_mounts(), the mounts are not cloned - we just get pinning references to the roots of subtrees in the caller's namespace. Array is terminated by {NULL, NULL} struct path. If it fits into preallocated array (on-stack, normally), that's where it goes; otherwise it's allocated by kmalloc_array(). Passing 0 as size means that 'preallocated' is ignored (and expected to be NULL). * drop_collected_paths(paths, preallocated) is given the array returned by an earlier call of collect_paths() and the preallocated array passed to that call. All mount/dentry references are dropped and array is kfree'd if it's not equal to 'preallocated'. * instead of iterate_mounts(), users should just iterate over array of struct path - nothing exotic is needed for that. Existing users (all in audit_tree.c) are converted. [folded a fix for braino reported by Venkat Rao Bagalkote <venkat88@linux.ibm.com>] Fixes: `80b5dce8c5` ("vfs: Add a function to lazily unmount all mounts from any dentry") Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-23 14:01:49 -04:00
Junxuan Liao	2773d282cd	docs/vfs: update references to i_mutex to i_rwsem VFS has switched to i_rwsem for ten years now (9902af79c01a: parallel lookups actual switch to rwsem), but the VFS documentation and comments still has references to i_mutex. Signed-off-by: Junxuan Liao <ljx@cs.wisc.edu> Link: https://lore.kernel.org/72223729-5471-474a-af3c-f366691fba82@cs.wisc.edu Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-06-23 12:17:33 +02:00
Christian Brauner	7f4f229195	mntns: use stable inode number for initial mount ns Apart from the network and mount namespace all other namespaces expose a stable inode number and userspace has been relying on that for a very long time now. It's very much heavily used API. Align the mount namespace and use a stable inode number from the reserved procfs inode number space so this is consistent across all namespaces. Link: https://lore.kernel.org/20250606-work-nsfs-v1-3-b8749c9a8844@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-06-11 11:59:08 +02:00
Linus Torvalds	35b574a6c2	mount-related bugfixes this cycle regression (well, bugfix for this cycle bugfix for v6.15-rc1 regression) do_move_mount(): split the checks in subtree-of-our-ns and entire-anon cases selftests/mount_setattr: adapt detached mount propagation test v6.15 fs: allow clone_private_mount() for a path on real rootfs v6.11 fs/fhandle.c: fix a race in call of has_locked_children() v5.15 fix propagation graph breakage by MOVE_MOUNT_SET_GROUP move_mount(2) v5.15 clone_private_mnt(): make sure that caller has CAP_SYS_ADMIN in the right userns v5.7 path_overmount(): avoid false negatives v3.12 finish_automount(): don't leak MNT_LOCKED from parent to child v2.6.15 do_change_type(): refuse to operate on unmounted/not ours mounts Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaEXGDwAKCRBZ7Krx/gZQ 61WQAPwNBpcwum3F5fqT8rcKymqAUFpc0+rluJoBi+qfCQA9ywEAwn+Kh5qqtz++ cdVnUYQxBrh0u5IOzMEFITlgfYFJZA4= =BIeU -----END PGP SIGNATURE----- Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull mount fixes from Al Viro: "Various mount-related bugfixes: - split the do_move_mount() checks in subtree-of-our-ns and entire-anon cases and adapt detached mount propagation selftest for mount_setattr - allow clone_private_mount() for a path on real rootfs - fix a race in call of has_locked_children() - fix move_mount propagation graph breakage by MOVE_MOUNT_SET_GROUP - make sure clone_private_mnt() caller has CAP_SYS_ADMIN in the right userns - avoid false negatives in path_overmount() - don't leak MNT_LOCKED from parent to child in finish_automount() - do_change_type(): refuse to operate on unmounted/not ours mounts" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: do_change_type(): refuse to operate on unmounted/not ours mounts clone_private_mnt(): make sure that caller has CAP_SYS_ADMIN in the right userns selftests/mount_setattr: adapt detached mount propagation test do_move_mount(): split the checks in subtree-of-our-ns and entire-anon cases fs: allow clone_private_mount() for a path on real rootfs fix propagation graph breakage by MOVE_MOUNT_SET_GROUP move_mount(2) finish_automount(): don't leak MNT_LOCKED from parent to child path_overmount(): avoid false negatives fs/fhandle.c: fix a race in call of has_locked_children()	2025-06-08 10:35:12 -07:00
Al Viro	12f147ddd6	do_change_type(): refuse to operate on unmounted/not ours mounts Ensure that propagation settings can only be changed for mounts located in the caller's mount namespace. This change aligns permission checking with the rest of mount(2). Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: `07b20889e3` ("beginning of the shared-subtree proper") Reported-by: "Orlando, Noah" <Noah.Orlando@deshaw.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-07 01:37:56 -04:00
Al Viro	c28f922c9d	clone_private_mnt(): make sure that caller has CAP_SYS_ADMIN in the right userns What we want is to verify there is that clone won't expose something hidden by a mount we wouldn't be able to undo. "Wouldn't be able to undo" may be a result of MNT_LOCKED on a child, but it may also come from lacking admin rights in the userns of the namespace mount belongs to. clone_private_mnt() checks the former, but not the latter. There's a number of rather confusing CAP_SYS_ADMIN checks in various userns during the mount, especially with the new mount API; they serve different purposes and in case of clone_private_mnt() they usually, but not always end up covering the missing check mentioned above. Reviewed-by: Christian Brauner <brauner@kernel.org> Reported-by: "Orlando, Noah" <Noah.Orlando@deshaw.com> Fixes: `427215d85e` ("ovl: prevent private clone if bind mount is not allowed") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-07 01:37:24 -04:00
Al Viro	290da20e33	do_move_mount(): split the checks in subtree-of-our-ns and entire-anon cases ... and fix the breakage in anon-to-anon case. There are two cases acceptable for do_move_mount() and mixing checks for those is making things hard to follow. One case is move of a subtree in caller's namespace. * source and destination must be in caller's namespace * source must be detachable from parent Another is moving the entire anon namespace elsewhere * source must be the root of anon namespace * target must either in caller's namespace or in a suitable anon namespace (see may_use_mount() for details). * target must not be in the same namespace as source. It's really easier to follow if tests are not mixed together... Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: `3b5260d12b` ("Don't propagate mounts into detached trees") Reported-by: Allison Karlitskaya <lis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-07 00:41:20 -04:00
KONDO KAZUMA(近藤　和真)	4954346d80	fs: allow clone_private_mount() for a path on real rootfs Mounting overlayfs with a directory on real rootfs (initramfs) as upperdir has failed with following message since commit `db04662e2f` ("fs: allow detached mounts in clone_private_mount()"). [ 4.080134] overlayfs: failed to clone upperpath Overlayfs mount uses clone_private_mount() to create internal mount for the underlying layers. The commit made clone_private_mount() reject real rootfs because it does not have a parent mount and is in the initial mount namespace, that is not an anonymous mount namespace. This issue can be fixed by modifying the permission check of clone_private_mount() following [1]. Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: `db04662e2f` ("fs: allow detached mounts in clone_private_mount()") Link: https://lore.kernel.org/all/20250514190252.GQ2023217@ZenIV/ [1] Link: https://lore.kernel.org/all/20250506194849.GT2023217@ZenIV/ Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kazuma Kondo <kazuma-kondo@nec.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-07 00:41:02 -04:00
Al Viro	d8cc0362f9	fix propagation graph breakage by MOVE_MOUNT_SET_GROUP move_mount(2) `9ffb14ef61` "move_mount: allow to add a mount into an existing group" breaks assertions on ->mnt_share/->mnt_slave. For once, the data structures in question are actually documented. Documentation/filesystem/sharedsubtree.rst: All vfsmounts in a peer group have the same ->mnt_master. If it is non-NULL, they form a contiguous (ordered) segment of slave list. do_set_group() puts a mount into the same place in propagation graph as the old one. As the result, if old mount gets events from somewhere and is not a pure event sink, new one needs to be placed next to the old one in the slave list the old one's on. If it is a pure event sink, we only need to make sure the new one doesn't end up in the middle of some peer group. "move_mount: allow to add a mount into an existing group" ends up putting the new one in the beginning of list; that's definitely not going to be in the middle of anything, so that's fine for case when old is not marked shared. In case when old one _is_ marked shared (i.e. is not a pure event sink), that breaks the assumptions of propagation graph iterators. Put the new mount next to the old one on the list - that does the right thing in "old is marked shared" case and is just as correct as the current behaviour if old is not marked shared (kudos to Pavel for pointing that out - my original suggested fix changed behaviour in the "nor marked" case, which complicated things for no good reason). Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: `9ffb14ef61` ("move_mount: allow to add a mount into an existing group") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-07 00:40:35 -04:00
Al Viro	5f31c54938	path_overmount(): avoid false negatives Holding namespace_sem is enough to make sure that result remains valid. It is not enough to avoid false negatives from __lookup_mnt(). Mounts can be unhashed outside of namespace_sem (stuck children getting detached on final mntput() of lazy-umounted mount) and having an unrelated mount removed from the hash chain while we traverse it may end up with false negative from __lookup_mnt(). We need to sample and recheck the seqlock component of mount_lock... Bug predates the introduction of path_overmount() - it had come from the code in finish_automount() that got abstracted into that helper. Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: `26df6034fd` ("fix automount/automount race properly") Fixes: `6ac3928156` ("fs: allow to mount beneath top mount") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-07 00:38:34 -04:00
Al Viro	1f282cdc1d	fs/fhandle.c: fix a race in call of has_locked_children() may_decode_fh() is calling has_locked_children() while holding no locks. That's an oopsable race... The rest of the callers are safe since they are holding namespace_sem and are guaranteed a positive refcount on the mount in question. Rename the current has_locked_children() to __has_locked_children(), make it static and switch the fs/namespace.c users to it. Make has_locked_children() a wrapper for __has_locked_children(), calling the latter under read_seqlock_excl(&mount_lock). Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Fixes: `620c266f39` ("fhandle: relax open_by_handle_at() permission checks") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-07 00:37:38 -04:00
Linus Torvalds	0f70f5b08a	automount wart removal Calling conventions of ->d_automount() made saner (flagday change) vfs_submount() is gone - its sole remaining user (trace_automount) had been switched to saner primitives. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaDoRWQAKCRBZ7Krx/gZQ 6wxMAQCzuMc2GiGBMXzeK4SGA7d5rsK71unf+zczOd8NvbTImQEAs1Cu3u3bF3pq EmHQWFTKBpBf+RHsLSoDHwUA+9THowM= =GXLi -----END PGP SIGNATURE----- Merge tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull automount updates from Al Viro: "Automount wart removal A bunch of odd boilerplate gone from instances - the reason for those was the need to protect the yet-to-be-attched mount from mark_mounts_for_expiry() deciding to take it out. But that's easy to detect and take care of in mark_mounts_for_expiry() itself; no need to have every instance simulate mount being busy by grabbing an extra reference to it, with finish_automount() undoing that once it attaches that mount. Should've done it that way from the very beginning... This is a flagday change, thankfully there are very few instances. vfs_submount() is gone - its sole remaining user (trace_automount) had been switched to saner primitives" * tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: kill vfs_submount() saner calling conventions for ->d_automount()	2025-05-30 15:38:29 -07:00
Linus Torvalds	a82ba83991	6.15 behaviour wrt mount propagation into detached trees breaks userland. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaDn/AQAKCRBZ7Krx/gZQ 6wAcAQCuiKUqmmrhssYEn/FsboM8eR36XzjlWUYOgzARXR0CIAEA6RmWIY3a/e0n DlbKbxmNkQM6xIUjml7nXYzoxaWX0A8= =9UbY -----END PGP SIGNATURE----- Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull mount propagation fix from Al Viro: "6.15 allowed mount propagation to destinations in detached trees; unfortunately, that breaks existing userland, so the old behaviour needs to be restored. It's not exactly a revert - the original behaviour had a bug, where existence of detached tree might disrupt propagation between locations not in detached trees. Thankfully, userland did not depend upon that bug, so we want to keep the fix" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Don't propagate mounts into detached trees	2025-05-30 15:04:11 -07:00
Al Viro	3b5260d12b	Don't propagate mounts into detached trees All versions up to 6.14 did not propagate mount events into detached tree. Shortly after 6.14 a merge of vfs-6.15-rc1.mount.namespace (`130e696aa6`) has changed that. Unfortunately, that has caused userland regressions (reported in https://lore.kernel.org/all/CAOYeF9WQhFDe+BGW=Dp5fK8oRy5AgZ6zokVyTj1Wp4EUiYgt4w@mail.gmail.com/) Straight revert wouldn't be an option - in particular, the variant in 6.14 had a bug that got fixed in `d1ddc6f1d9` ("fix IS_MNT_PROPAGATING uses") and we don't want to bring the bug back. This is a modification of manual revert posted by Christian, with changes needed to avoid reintroducing the breakage in scenario described in `d1ddc6f1d9`. Cc: stable@vger.kernel.org Reported-by: Allison Karlitskaya <lis@redhat.com> Tested-by: Allison Karlitskaya <lis@redhat.com> Acked-by: Christian Brauner <brauner@kernel.org> Co-developed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-05-26 17:35:32 -04:00
Linus Torvalds	2ca3534623	vfs-6.16-rc1.mount -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBoOgAKCRCRxhvAZXjc oibJAP90kOnGsm/7a/4Ul9da0nxcPr4NPKFMX3yd8Zegr9d1MgEAplL0xnhDp3SB YCcSdfKwUTQS5+lyfVt2ewzEEQnyZQI= =lgNp -----END PGP SIGNATURE----- Merge tag 'vfs-6.16-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: "This contains minor mount updates for this cycle: - mnt->mnt_devname can never be NULL so simplify the code handling that case - Add a comment about concurrent changes during statmount() and listmount() - Update the STATMOUNT_SUPPORTED macro - Convert mount flags to an enum" * tag 'vfs-6.16-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: statmount: update STATMOUNT_SUPPORTED macro fs: convert mount flags to enum ->mnt_devname is never NULL mount: add a comment about concurrent changes with statmount()/listmount()	2025-05-26 09:55:56 -07:00
Dmitry V. Levin	2b3c61b875	statmount: update STATMOUNT_SUPPORTED macro According to commit `8f6116b5b7` ("statmount: add a new supported_mask field"), STATMOUNT_SUPPORTED macro shall be updated whenever a new flag is added. Fixes: `7a54947e72` ("Merge patch series "fs: allow changing idmappings"") Signed-off-by: "Dmitry V. Levin" <ldv@strace.io> Link: https://lore.kernel.org/20250511224953.GA17849@strace.io Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-05-23 14:20:45 +02:00
Al Viro	7fc711739e	->mnt_devname is never NULL Not since `8f2918898e` "new helpers: vfs_create_mount(), fc_mount()" back in 2018. Get rid of the dead checks... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/20250421033509.GV2023217@ZenIV Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-05-23 14:20:44 +02:00
Christian Brauner	a68cb18624	mount: add a comment about concurrent changes with statmount()/listmount() Add some comments in there highlighting a few non-obvious assumptions. Link: https://lore.kernel.org/20250416-zerknirschen-aluminium-14a55639076f@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-05-23 14:20:44 +02:00
Al Viro	d1ddc6f1d9	fix IS_MNT_PROPAGATING uses propagate_mnt() does not attach anything to mounts created during propagate_mnt() itself. What's more, anything on ->mnt_slave_list of such new mount must also be new, so we don't need to even look there. When move_mount() had been introduced, we've got an additional class of mounts to skip - if we are moving from anon namespace, we do not want to propagate to mounts we are moving (i.e. all mounts in that anon namespace). Unfortunately, the part about "everything on their ->mnt_slave_list will also be ignorable" is not true - if we have propagation graph A -> B -> C and do OPEN_TREE_CLONE open_tree() of B, we get A -> [B <-> B'] -> C as propagation graph, where B' is a clone of B in our detached tree. Making B private will result in A -> B' -> C C still gets propagation from A, as it would after making B private if we hadn't done that open_tree(), but now the propagation goes through B'. Trying to move_mount() our detached tree on subdirectory in A should have * moved B' on that subdirectory in A * skipped the corresponding subdirectory in B' itself * copied B' on the corresponding subdirectory in C. As it is, the logics in propagation_next() and friends ends up skipping propagation into C, since it doesn't consider anything downstream of B'. IOW, walking the propagation graph should only skip the ->mnt_slave_list of new mounts; the only places where the check for "in that one anon namespace" are applicable are propagate_one() (where we should treat that as the same kind of thing as "mountpoint we are looking at is not visible in the mount we are looking at") and propagation_would_overmount(). The latter is better dealt with in the caller (can_move_mount_beneath()); on the first call of propagation_would_overmount() the test is always false, on the second it is always true in "move from anon namespace" case and always false in "move within our namespace" one, so it's easier to just use check_mnt() before bothering with the second call and be done with that. Fixes: `064fe6e233` ("mount: handle mount propagation for detached mount trees") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-05-09 18:06:27 -04:00
Al Viro	267fc3a06a	do_move_mount(): don't leak MNTNS_PROPAGATING on failures as it is, a failed move_mount(2) from anon namespace breaks all further propagation into that namespace, including normal mounts in non-anon namespaces that would otherwise propagate there. Fixes: `064fe6e233` ("mount: handle mount propagation for detached mount trees") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-05-09 18:06:10 -04:00
Al Viro	65781e19dc	do_umount(): add missing barrier before refcount checks in sync case do_umount() analogue of the race fixed in `119e1ef80e` "fix __legitimize_mnt()/mntput() race". Here we want to make sure that if __legitimize_mnt() doesn't notice our lock_mount_hash(), we will notice their refcount increment. Harder to hit than mntput_no_expire() one, fortunately, and consequences are milder (sync umount acting like umount -l on a rare race with RCU pathwalk hitting at just the wrong time instead of use-after-free galore mntput_no_expire() counterpart used to be hit). Still a bug... Fixes: `48a066e72d` ("RCU'd vfsmounts") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-05-09 18:05:55 -04:00
Al Viro	250cf36930	__legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lock ... or we risk stealing final mntput from sync umount - raising mnt_count after umount(2) has verified that victim is not busy, but before it has set MNT_SYNC_UMOUNT; in that case __legitimize_mnt() doesn't see that it's safe to quietly undo mnt_count increment and leaves dropping the reference to caller, where it'll be a full-blown mntput(). Check under mount_lock is needed; leaving the current one done before taking that makes no sense - it's nowhere near common enough to bother with. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-05-09 18:02:12 -04:00
Al Viro	2dbf6e0df4	kill vfs_submount() The last remaining user of vfs_submount() (tracefs) is easy to convert to fs_context_for_submount(); do that and bury that thing, along with SB_SUBMOUNT Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-05-06 12:49:07 -04:00
Al Viro	006ff7498f	saner calling conventions for ->d_automount() Currently the calling conventions for ->d_automount() instances have an odd wart - returned new mount to be attached is expected to have refcount 2. That kludge is intended to make sure that mark_mounts_for_expiry() called before we get around to attaching that new mount to the tree won't decide to take it out. finish_automount() drops the extra reference after it's done with attaching mount to the tree - or drops the reference twice in case of error. ->d_automount() instances have rather counterintuitive boilerplate in them. There's a much simpler approach: have mark_mounts_for_expiry() skip the mounts that are yet to be mounted. And to hell with grabbing/dropping those extra references. Makes for simpler correctness analysis, at that... Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Acked-by: David Howells <dhowells@redhat.com> Tested-by: David Howells <dhowells@redhat.com> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-05-05 13:42:49 -04:00
Al Viro	0d039eac6e	fix a couple of races in MNT_TREE_BENEATH handling by do_move_mount() Normally do_lock_mount(path, _) is locking a mountpoint pinned by path and at the time when matching unlock_mount() unlocks that location it is still pinned by the same thing. Unfortunately, for 'beneath' case it's no longer that simple - the object being locked is not the one path points to. It's the mountpoint of path->mnt. The thing is, without sufficient locking ->mnt_parent may change under us and none of the locks are held at that point. The rules are * mount_lock stabilizes m->mnt_parent for any mount m. * namespace_sem stabilizes m->mnt_parent, provided that m is mounted. * if either of the above holds and refcount of m is positive, we are guaranteed the same for refcount of m->mnt_parent. namespace_sem nests inside inode_lock(), so do_lock_mount() has to take inode_lock() before grabbing namespace_sem. It does recheck that path->mnt is still mounted in the same place after getting namespace_sem, and it does take care to pin the dentry. It is needed, since otherwise we might end up with racing mount --move (or umount) happening while we were getting locks; in that case dentry would no longer be a mountpoint and could've been evicted on memory pressure along with its inode - not something you want when grabbing lock on that inode. However, pinning a dentry is not enough - the matching mount is also pinned only by the fact that path->mnt is mounted on top it and at that point we are not holding any locks whatsoever, so the same kind of races could end up with all references to that mount gone just as we are about to enter inode_lock(). If that happens, we are left with filesystem being shut down while we are holding a dentry reference on it; results are not pretty. What we need to do is grab both dentry and mount at the same time; that makes inode_lock() safe and avoids the problem with fs getting shut down under us. After taking namespace_sem we verify that path->mnt is still mounted (which stabilizes its ->mnt_parent) and check that it's still mounted at the same place. From that point on to the matching namespace_unlock() we are guaranteed that mount/dentry pair we'd grabbed are also pinned by being the mountpoint of path->mnt, so we can quietly drop both the dentry reference (as the current code does) and mnt one - it's OK to do under namespace_sem, since we are not dropping the final refs. That solves the problem on do_lock_mount() side; unlock_mount() also has one, since dentry is guaranteed to stay pinned only until the namespace_unlock(). That's easy to fix - just have inode_unlock() done earlier, while it's still pinned by mp->m_dentry. Fixes: `6ac3928156` "fs: allow to mount beneath top mount" # v6.5+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-04-23 08:06:22 +02:00
Jan Stancek	47a742fd97	fs: use namespace_{lock,unlock} in dissolve_on_fput() In commit `b73ec10a45` ("fs: add fastpath for dissolve_on_fput()"), the namespace_{lock,unlock} has been replaced with scoped_guard using the namespace_sem. This however now also skips processing of 'unmounted' list in namespace_unlock(), and mount is not (immediately) cleaned up. For example, this causes LTP move_mount02 fail: ... move_mount02.c:80: TPASS: invalid-from-fd: move_mount() failed as expected: EBADF (9) move_mount02.c:80: TPASS: invalid-from-path: move_mount() failed as expected: ENOENT (2) move_mount02.c:80: TPASS: invalid-to-fd: move_mount() failed as expected: EBADF (9) move_mount02.c:80: TPASS: invalid-to-path: move_mount() failed as expected: ENOENT (2) move_mount02.c:80: TPASS: invalid-flags: move_mount() failed as expected: EINVAL (22) tst_test.c:1833: TINFO: === Testing on ext3 === tst_test.c:1170: TINFO: Formatting /dev/loop0 with ext3 opts='' extra opts='' mke2fs 1.47.2 (1-Jan-2025) /dev/loop0 is apparently in use by the system; will not make a filesystem here! tst_test.c:1170: TBROK: mkfs.ext3 failed with exit code 1 The test makes number of move_mount() calls but these are all designed to fail with specific errno. Even after test, 'losetup -d' can't detach loop device. Define a new guard for dissolve_on_fput, that will use namespace_{lock,unlock}. Fixes: `b73ec10a45` ("fs: add fastpath for dissolve_on_fput()") Signed-off-by: Jan Stancek <jstancek@redhat.com> Link: https://lore.kernel.org/cad2f042b886bf0ced3d8e3aff120ec5e0125d61.1744297468.git.jstancek@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-04-11 16:07:50 +02:00
Christian Brauner	d43dbf7322	mount: ensure we don't pointlessly walk the mount tree This logic got broken recently. Add it back. Fixes: `474f7825d5` ("fs: add copy_mount_setattr() helper") Link: https://lore.kernel.org/20250409-sektflaschen-gecko-27c021fbd222@brauner Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-04-11 15:24:29 +02:00
Christian Brauner	c0dbd11ada	fs: actually hold the namespace semaphore Don't use a scoped guard that only protects the next statement. Use a regular guard to make sure that the namespace semaphore is held across the whole function. Signed-off-by: Christian Brauner <brauner@kernel.org> Reported-by: Leon Romanovsky <leon@kernel.org> Link: https://lore.kernel.org/all/20250401170715.GA112019@unreal/ Fixes: `db04662e2f` ("fs: allow detached mounts in clone_private_mount()") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-03 15:45:35 -07:00
Gustavo A. R. Silva	9e6901f17a	fs: namespace: Avoid -Wflex-array-member-not-at-end warning -Wflex-array-member-not-at-end was introduced in GCC-14, and we are getting ready to enable it, globally. Move the conflicting declaration to the end of the structure. Notice that `struct statmount` is a flexible structure --a structure that contains a flexible-array member. Fix the following warning: fs/namespace.c:5329:26: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Signed-off-by: "Gustavo A. R. Silva" <gustavoars@kernel.org> Link: https://lore.kernel.org/r/Z-SZKNdCiAkVJvqm@kspp Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-28 10:18:34 +01:00
Linus Torvalds	130e696aa6	vfs-6.15-rc1.mount.namespace -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90r2wAKCRCRxhvAZXjc ouC6AQCk3MoqskN0WeNcaZT23dB7dHbEhf/7YXOFC9MFRMKXqQD9Fbn95+GuIe3U nBVPbVyQfDtfXE08ml6gbDJrCsbkkQI= =Xm1C -----END PGP SIGNATURE----- Merge tag 'vfs-6.15-rc1.mount.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount namespace updates from Christian Brauner: "This expands the ability of anonymous mount namespaces: - Creating detached mounts from detached mounts Currently, detached mounts can only be created from attached mounts. This limitaton prevents various use-cases. For example, the ability to mount a subdirectory without ever having to make the whole filesystem visible first. The current permission modelis: (1) Check that the caller is privileged over the owning user namespace of it's current mount namespace. (2) Check that the caller is located in the mount namespace of the mount it wants to create a detached copy of. While it is not strictly necessary to do it this way it is consistently applied in the new mount api. This model will also be used when allowing the creation of detached mount from another detached mount. The (1) requirement can simply be met by performing the same check as for the non-detached case, i.e., verify that the caller is privileged over its current mount namespace. To meet the (2) requirement it must be possible to infer the origin mount namespace that the anonymous mount namespace of the detached mount was created from. The origin mount namespace of an anonymous mount is the mount namespace that the mounts that were copied into the anonymous mount namespace originate from. In order to check the origin mount namespace of an anonymous mount namespace the sequence number of the original mount namespace is recorded in the anonymous mount namespace. With this in place it is possible to perform an equivalent check (2') to (2). The origin mount namespace of the anonymous mount namespace must be the same as the caller's mount namespace. To establish this the sequence number of the caller's mount namespace and the origin sequence number of the anonymous mount namespace are compared. The caller is always located in a non-anonymous mount namespace since anonymous mount namespaces cannot be setns()ed into. The caller's mount namespace will thus always have a valid sequence number. The owning namespace of any mount namespace, anonymous or non-anonymous, can never change. A mount attached to a non-anonymous mount namespace can never change mount namespace. If the sequence number of the non-anonymous mount namespace and the origin sequence number of the anonymous mount namespace match, the owning namespaces must match as well. Hence, the capability check on the owning namespace of the caller's mount namespace ensures that the caller has the ability to copy the mount tree. - Allow mount detached mounts on detached mounts Currently, detached mounts can only be mounted onto attached mounts. This limitation makes it impossible to assemble a new private rootfs and move it into place. Instead, a detached tree must be created, attached, then mounted open and then either moved or detached again. Lift this restriction. In order to allow mounting detached mounts onto other detached mounts the same permission model used for creating detached mounts from detached mounts can be used (cf. above). Allowing to mount detached mounts onto detached mounts leaves three cases to consider: (1) The source mount is an attached mount and the target mount is a detached mount. This would be equivalent to moving a mount between different mount namespaces. A caller could move an attached mount to a detached mount. The detached mount can now be freely attached to any mount namespace. This changes the current delegatioh model significantly for no good reason. So this will fail. (2) Anonymous mount namespaces are always attached fully, i.e., it is not possible to only attach a subtree of an anoymous mount namespace. This simplifies the implementation and reasoning. Consequently, if the anonymous mount namespace of the source detached mount and the target detached mount are the identical the mount request will fail. (3) The source mount's anonymous mount namespace is different from the target mount's anonymous mount namespace. In this case the source anonymous mount namespace of the source mount tree must be freed after its mounts have been moved to the target anonymous mount namespace. The source anonymous mount namespace must be empty afterwards. By allowing to mount detached mounts onto detached mounts a caller may do the following: fd_tree1 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE) fd_tree2 = open_tree(-EBADF, "/tmp", OPEN_TREE_CLONE) fd_tree1 and fd_tree2 refer to two different detached mount trees that belong to two different anonymous mount namespace. It is important to note that fd_tree1 and fd_tree2 both refer to the root of their respective anonymous mount namespaces. By allowing to mount detached mounts onto detached mounts the caller may now do: move_mount(fd_tree1, "", fd_tree2, "", MOVE_MOUNT_F_EMPTY_PATH \| MOVE_MOUNT_T_EMPTY_PATH) This will cause the detached mount referred to by fd_tree1 to be mounted on top of the detached mount referred to by fd_tree2. Thus, the detached mount fd_tree1 is moved from its separate anonymous mount namespace into fd_tree2's anonymous mount namespace. It also means that while fd_tree2 continues to refer to the root of its respective anonymous mount namespace fd_tree1 doesn't anymore. This has the consequence that only fd_tree2 can be moved to another anonymous or non-anonymous mount namespace. Moving fd_tree1 will now fail as fd_tree1 doesn't refer to the root of an anoymous mount namespace anymore. Now fd_tree1 and fd_tree2 refer to separate detached mount trees referring to the same anonymous mount namespace. This is conceptually fine. The new mount api does allow for this to happen already via: mount -t tmpfs tmpfs /mnt mkdir -p /mnt/A mount -t tmpfs tmpfs /mnt/A fd_tree3 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE \| AT_RECURSIVE) fd_tree4 = open_tree(-EBADF, "/mnt/A", 0) Both fd_tree3 and fd_tree4 refer to two different detached mount trees but both detached mount trees refer to the same anonymous mount namespace. An as with fd_tree1 and fd_tree2, only fd_tree3 may be moved another mount namespace as fd_tree3 refers to the root of the anonymous mount namespace just while fd_tree4 doesn't. However, there's an important difference between the fd_tree3/fd_tree4 and the fd_tree1/fd_tree2 example. Closing fd_tree4 and releasing the respective struct file will have no further effect on fd_tree3's detached mount tree. However, closing fd_tree3 will cause the mount tree and the respective anonymous mount namespace to be destroyed causing the detached mount tree of fd_tree4 to be invalid for further mounting. By allowing to mount detached mounts on detached mounts as in the fd_tree1/fd_tree2 example both struct files will affect each other. Both fd_tree1 and fd_tree2 refer to struct files that have FMODE_NEED_UNMOUNT set. To handle this we use the fact that @fd_tree1 will have a parent mount once it has been attached to @fd_tree2. When dissolve_on_fput() is called the mount that has been passed in will refer to the root of the anonymous mount namespace. If it doesn't it would mean that mounts are leaked. So before allowing to mount detached mounts onto detached mounts this would be a bug. Now that detached mounts can be mounted onto detached mounts it just means that the mount has been attached to another anonymous mount namespace and thus dissolve_on_fput() must not unmount the mount tree or free the anonymous mount namespace as the file referring to the root of the namespace hasn't been closed yet. If it had been closed yet it would be obvious because the mount namespace would be NULL, i.e., the @fd_tree1 would have already been unmounted. If @fd_tree1 hasn't been unmounted yet and has a parent mount it is safe to skip any cleanup as closing @fd_tree2 will take care of all cleanup operations. - Allow mount propagation for detached mount trees In commit `ee2e3f5062` ("mount: fix mounting of detached mounts onto targets that reside on shared mounts") I fixed a bug where propagating the source mount tree of an anonymous mount namespace into a target mount tree of a non-anonymous mount namespace could be used to trigger an integer overflow in the non-anonymous mount namespace causing any new mounts to fail. The cause of this was that the propagation algorithm was unable to recognize mounts from the source mount tree that were already propagated into the target mount tree and then reappeared as propagation targets when walking the destination propagation mount tree. When fixing this I disabled mount propagation into anonymous mount namespaces. Make it possible for anonymous mount namespace to receive mount propagation events correctly. This is now also a correctness issue now that we allow mounting detached mount trees onto detached mount trees. Mark the source anonymous mount namespace with MNTNS_PROPAGATING indicating that all mounts belonging to this mount namespace are currently in the process of being propagated and make the propagation algorithm discard those if they appear as propagation targets" * tag 'vfs-6.15-rc1.mount.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (21 commits) selftests: test subdirectory mounting selftests: add test for detached mount tree propagation fs: namespace: fix uninitialized variable use mount: handle mount propagation for detached mount trees fs: allow creating detached mounts from fsmount() file descriptors selftests: seventh test for mounting detached mounts onto detached mounts selftests: sixth test for mounting detached mounts onto detached mounts selftests: fifth test for mounting detached mounts onto detached mounts selftests: fourth test for mounting detached mounts onto detached mounts selftests: third test for mounting detached mounts onto detached mounts selftests: second test for mounting detached mounts onto detached mounts selftests: first test for mounting detached mounts onto detached mounts fs: mount detached mounts onto detached mounts fs: support getname_maybe_null() in move_mount() selftests: create detached mounts from detached mounts fs: create detached mounts from detached mounts fs: add may_copy_tree() fs: add fastpath for dissolve_on_fput() fs: add assert for move_mount() fs: add mnt_ns_empty() helper ...	2025-03-24 11:41:41 -07:00
Linus Torvalds	fd101da676	vfs-6.15-rc1.mount -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90qAwAKCRCRxhvAZXjc on7lAP0akpIsJMWREg9tLwTNTySI1b82uKec0EAgM6T7n/PYhAD/T4zoY8UYU0Pr qCxwTXHUVT6bkNhjREBkfqq9OkPP8w8= =GxeN -----END PGP SIGNATURE----- Merge tag 'vfs-6.15-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: - Mount notifications The day has come where we finally provide a new api to listen for mount topology changes outside of /proc/<pid>/mountinfo. A mount namespace file descriptor can be supplied and registered with fanotify to listen for mount topology changes. Currently notifications for mount, umount and moving mounts are generated. The generated notification record contains the unique mount id of the mount. The listmount() and statmount() api can be used to query detailed information about the mount using the received unique mount id. This allows userspace to figure out exactly how the mount topology changed without having to generating diffs of /proc/<pid>/mountinfo in userspace. - Support O_PATH file descriptors with FSCONFIG_SET_FD in the new mount api - Support detached mounts in overlayfs Since last cycle we support specifying overlayfs layers via file descriptors. However, we don't allow detached mounts which means userspace cannot user file descriptors received via open_tree(OPEN_TREE_CLONE) and fsmount() directly. They have to attach them to a mount namespace via move_mount() first. This is cumbersome and means they have to undo mounts via umount(). Allow them to directly use detached mounts. - Allow to retrieve idmappings with statmount Currently it isn't possible to figure out what idmapping has been attached to an idmapped mount. Add an extension to statmount() which allows to read the idmapping from the mount. - Allow creating idmapped mounts from mounts that are already idmapped So far it isn't possible to allow the creation of idmapped mounts from already idmapped mounts as this has significant lifetime implications. Make the creation of idmapped mounts atomic by allow to pass struct mount_attr together with the open_tree_attr() system call allowing to solve these issues without complicating VFS lookup in any way. The system call has in general the benefit that creating a detached mount and applying mount attributes to it becomes an atomic operation for userspace. - Add a way to query statmount() for supported options Allow userspace to query which mount information can be retrieved through statmount(). - Allow superblock owners to force unmount * tag 'vfs-6.15-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (21 commits) umount: Allow superblock owners to force umount selftests: add tests for mount notification selinux: add FILE__WATCH_MOUNTNS samples/vfs: fix printf format string for size_t fs: allow changing idmappings fs: add kflags member to struct mount_kattr fs: add open_tree_attr() fs: add copy_mount_setattr() helper fs: add vfs_open_tree() helper statmount: add a new supported_mask field samples/vfs: add STATMOUNT_MNT_{G,U}IDMAP selftests: add tests for using detached mount with overlayfs samples/vfs: check whether flag was raised statmount: allow to retrieve idmappings uidgid: add map_id_range_up() fs: allow detached mounts in clone_private_mount() selftests/overlayfs: test specifying layers as O_PATH file descriptors fs: support O_PATH fds with FSCONFIG_SET_FD vfs: add notifications for mount attach and detach fanotify: notify on mount attach and detach ...	2025-03-24 09:34:10 -07:00
Trond Myklebust	e1ff7aa34d	umount: Allow superblock owners to force umount Loosen the permission check on forced umount to allow users holding CAP_SYS_ADMIN privileges in namespaces that are privileged with respect to the userns that originally mounted the filesystem. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Link: https://lore.kernel.org/r/12f212d4ef983714d065a6bb372fbb378753bf4c.1742315194.git.trond.myklebust@hammerspace.com Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-19 09:19:04 +01:00
Christian Brauner	064fe6e233	mount: handle mount propagation for detached mount trees In commit `ee2e3f5062` ("mount: fix mounting of detached mounts onto targets that reside on shared mounts") I fixed a bug where propagating the source mount tree of an anonymous mount namespace into a target mount tree of a non-anonymous mount namespace could be used to trigger an integer overflow in the non-anonymous mount namespace causing any new mounts to fail. The cause of this was that the propagation algorithm was unable to recognize mounts from the source mount tree that were already propagated into the target mount tree and then reappeared as propagation targets when walking the destination propagation mount tree. When fixing this I disabled mount propagation into anonymous mount namespaces. Make it possible for anonymous mount namespace to receive mount propagation events correctly. This is no also a correctness issue now that we allow mounting detached mount trees onto detached mount trees. Mark the source anonymous mount namespace with MNTNS_PROPAGATING indicating that all mounts belonging to this mount namespace are currently in the process of being propagated and make the propagation algorithm discard those if they appear as propagation targets. Link: https://lore.kernel.org/r/20250225-work-mount-propagation-v1-1-e6e3724500eb@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-04 09:29:54 +01:00
Arnd Bergmann	99b6a1dee0	fs: namespace: fix uninitialized variable use clang correctly notices that the 'uflags' variable initialization only happens in some cases: fs/namespace.c:4622:6: error: variable 'uflags' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized] 4622 \| if (flags & MOVE_MOUNT_F_EMPTY_PATH) uflags = AT_EMPTY_PATH; \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/namespace.c:4623:48: note: uninitialized use occurs here 4623 \| from_name = getname_maybe_null(from_pathname, uflags); \| ^~~~~~ fs/namespace.c:4622:2: note: remove the 'if' if its condition is always true 4622 \| if (flags & MOVE_MOUNT_F_EMPTY_PATH) uflags = AT_EMPTY_PATH; \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fixes: b1e9423d65e3 ("fs: support getname_maybe_null() in move_mount()") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20250226081201.1876195-1-arnd@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-04 09:29:54 +01:00
Christian Brauner	f8b6cd66e4	fs: allow creating detached mounts from fsmount() file descriptors The previous patch series only enabled the creation of detached mounts from detached mounts that were created via open_tree(). In such cases we know that the origin sequence number for the newly created anonymous mount namespace will be set to the sequence number of the mount namespace the source mount belonged to. But fsmount() creates an anonymous mount namespace that does not have an origin mount namespace as the anonymous mount namespace was derived from a filesystem context created via fsopen(). Account for this case and allow the creation of detached mounts from mounts created via fsmount(). Consequently, any such detached mount created from an fsmount() mount will also have a zero origin sequence number. This allows to mount subdirectories without ever having to expose the filesystem to a a non-anonymous mount namespace: fd_context = sys_fsopen("tmpfs", 0); sys_fsconfig(fd_context, FSCONFIG_CMD_CREATE, NULL, NULL, 0); fd_tmpfs = sys_fsmount(fd_context, 0, 0); mkdirat(fd_tmpfs, "subdir", 0755); fd_tree = sys_open_tree(fd_tmpfs, "subdir", OPEN_TREE_CLONE); sys_move_mount(fd_tree, "", -EBADF, "/mnt", MOVE_MOUNT_F_EMPTY_PATH); Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-04 09:29:54 +01:00
Christian Brauner	2110772383	fs: mount detached mounts onto detached mounts Currently, detached mounts can only be mounted onto attached mounts. This limitation makes it impossible to assemble a new private rootfs and move it into place. That's an extremely powerful concept for container and service workloads that we should support. Right now, a detached tree must be created, attached, then it can gain additional mounts and then it can either be moved (if it doesn't reside under a shared mount) or a detached mount created again. Lift this restriction. In order to allow mounting detached mounts onto other detached mounts the same permission model used for creating detached mounts from detached mounts can be used: (1) Check that the caller is privileged over the owning user namespace of it's current mount namespace. (2) Check that the caller is located in the mount namespace of the mount it wants to create a detached copy of. The origin mount namespace of the anonymous mount namespace must be the same as the caller's mount namespace. To establish this the sequence number of the caller's mount namespace and the origin sequence number of the anonymous mount namespace are compared. The caller is always located in a non-anonymous mount namespace since anonymous mount namespaces cannot be setns()ed into. The caller's mount namespace will thus always have a valid sequence number. The owning namespace of any mount namespace, anonymous or non-anonymous, can never change. A mount attached to a non-anonymous mount namespace can never change mount namespace. If the sequence number of the non-anonymous mount namespace and the origin sequence number of the anonymous mount namespace match, the owning namespaces must match as well. Hence, the capability check on the owning namespace of the caller's mount namespace ensures that the caller has the ability to attach the mount tree. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-9-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-04 09:29:53 +01:00
Christian Brauner	f9fde814de	fs: support getname_maybe_null() in move_mount() Allow move_mount() to work with NULL path arguments. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-8-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-04 09:29:53 +01:00
Christian Brauner	c5c12f871a	fs: create detached mounts from detached mounts Add the ability to create detached mounts from detached mounts. Currently, detached mounts can only be created from attached mounts. This limitaton prevents various use-cases. For example, the ability to mount a subdirectory without ever having to make the whole filesystem visible first. The current permission model for the OPEN_TREE_CLONE flag of the open_tree() system call is: (1) Check that the caller is privileged over the owning user namespace of it's current mount namespace. (2) Check that the caller is located in the mount namespace of the mount it wants to create a detached copy of. While it is not strictly necessary to do it this way it is consistently applied in the new mount api. This model will also be used when allowing the creation of detached mount from another detached mount. The (1) requirement can simply be met by performing the same check as for the non-detached case, i.e., verify that the caller is privileged over its current mount namespace. To meet the (2) requirement it must be possible to infer the origin mount namespace that the anonymous mount namespace of the detached mount was created from. The origin mount namespace of an anonymous mount is the mount namespace that the mounts that were copied into the anonymous mount namespace originate from. The origin mount namespace of the anonymous mount namespace must be the same as the caller's mount namespace. To establish this the sequence number of the caller's mount namespace and the origin sequence number of the anonymous mount namespace are compared. The caller is always located in a non-anonymous mount namespace since anonymous mount namespaces cannot be setns()ed into. The caller's mount namespace will thus always have a valid sequence number. The owning namespace of any mount namespace, anonymous or non-anonymous, can never change. A mount attached to a non-anonymous mount namespace can never change mount namespace. If the sequence number of the non-anonymous mount namespace and the origin sequence number of the anonymous mount namespace match, the owning namespaces must match as well. Hence, the capability check on the owning namespace of the caller's mount namespace ensures that the caller has the ability to copy the mount tree. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-6-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-04 09:29:52 +01:00
Christian Brauner	9ed72af4e2	fs: add may_copy_tree() Add a helper that verifies whether a caller may copy a given mount tree. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-5-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-04 09:29:52 +01:00
Christian Brauner	b73ec10a45	fs: add fastpath for dissolve_on_fput() Instead of acquiring the namespace semaphore and the mount lock everytime we close a file with FMODE_NEED_UNMOUNT set add a fastpath that checks whether we need to at all. Most of the time the caller will have attached the mount to the filesystem hierarchy and there's nothing to do. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-4-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-04 09:29:52 +01:00
Christian Brauner	043bc81efb	fs: add assert for move_mount() After we've attached a detached mount tree the anonymous mount namespace must be empty. Add an assert and make this assumption explicit. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-3-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-04 09:29:52 +01:00
Christian Brauner	2f576220cd	fs: add mnt_ns_empty() helper Add a helper that checks whether a give mount namespace is empty instead of open-coding the specific data structure check. This also be will be used in follow-up patches. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-2-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-04 09:29:52 +01:00
Christian Brauner	3b0cdba4da	fs: record sequence number of origin mount namespace Store the sequence number of the mount namespace the anonymous mount namespace has been created from. This information will be used in follow-up patches. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-1-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-04 09:29:19 +01:00
Christian Brauner	7a54947e72	Merge patch series "fs: allow changing idmappings" Christian Brauner <brauner@kernel.org> says: Currently, it isn't possible to change the idmapping of an idmapped mount. This is becoming an obstacle for various use-cases. /* idmapped home directories with systemd-homed / On newer systems /home is can be an idmapped mount such that each file on disk is owned by 65536 and a subfolder exists for foreign id ranges such as containers. For example, a home directory might look like this (using an arbitrary folder as an example): user1@localhost:~/data/mount-idmapped$ ls -al /data/ total 16 drwxrwxrwx 1 65536 65536 36 Jan 27 12:15 . drwxrwxr-x 1 root root 184 Jan 27 12:06 .. -rw-r--r-- 1 65536 65536 0 Jan 27 12:07 aaa -rw-r--r-- 1 65536 65536 0 Jan 27 12:07 bbb -rw-r--r-- 1 65536 65536 0 Jan 27 12:07 cc drwxr-xr-x 1 2147352576 2147352576 0 Jan 27 19:06 containers When logging in home is mounted as an idmapped mount with the following idmappings: 65536:$(id -u):1 // uid mapping 65536:$(id -g):1 // gid mapping 2147352576:2147352576:65536 // uid mapping 2147352576:2147352576:65536 // gid mapping So for a user with uid/gid 1000 an idmapped /home would like like this: user1@localhost:~/data/mount-idmapped$ ls -aln /mnt/ total 16 drwxrwxrwx 1 1000 1000 36 Jan 27 12:15 . drwxrwxr-x 1 0 0 184 Jan 27 12:06 .. -rw-r--r-- 1 1000 1000 0 Jan 27 12:07 aaa -rw-r--r-- 1 1000 1000 0 Jan 27 12:07 bbb -rw-r--r-- 1 1000 1000 0 Jan 27 12:07 cc drwxr-xr-x 1 2147352576 2147352576 0 Jan 27 19:06 containers In other words, 65536 is mapped to the user's uid/gid and the range 2147352576 up to 2147352576 + 65536 is an identity mapping for containers. When a container is started a transient uid/gid range is allocated outside of both mappings of the idmapped mount. For example, the container might get the idmapping: $ cat /proc/1742611/uid_map 0 537985024 65536 This container will be allowed to write to disk within the allocated foreign id range 2147352576 to 2147352576 + 65536. To do this an idmapped mount must be created from an already idmapped mount such that: - The mappings for the user's uid/gid must be dropped, i.e., the following mappings are removed: 65536:$(id -u):1 // uid mapping 65536:$(id -g):1 // gid mapping - A mapping for the transient uid/gid range to the foreign uid/gid range is added: 2147352576:537985024:65536 In combination this will mean that the container will write to disk within the foreign id range 2147352576 to 2147352576 + 65536. / nested containers / When the outer container makes use of idmapped mounts it isn't posssible to create an idmapped mount for the inner container with a differen idmapping from the outer container's idmapped mount. There are other usecases and the two above just serve as an illustration of the problem. This patchset makes it possible to create a new idmapped mount from an already idmapped mount. It aims to adhere to current performance constraints and requirements: - Idmapped mounts aim to have near zero performance implications for path lookup. That is why no refernce counting, locking or any other mechanism can be required that would impact performance. This works be ensuring that a regular mount transitions to an idmapped mount once going from a static nop_mnt_idmap mapping to a non-static idmapping. - The idmapping of a mount change anymore for the lifetime of the mount afterwards. This not just avoids UAF issues it also avoids pitfalls such as generating non-matching uid/gid values. Changing idmappings could be solved by: - Idmappings could simply be reference counted (above the simple reference count when sharing them across multiple mounts). This would require pairing mnt_idmap_get() with mnt_idmap_put() which would end up being sprinkled everywhere into the VFS and some filesystems that access idmappings directly. It wouldn't just be quite ugly and introduce new complexity it would have a noticeable performance impact. - Idmappings could gain RCU protection. This would help the LOOKUP_RCU case and avoids taking reference counts under RCU. When not under LOOKUP_RCU reference counts need to be acquired on each idmapping. This would require pairing mnt_idmap_get() with mnt_idmap_put() which would end up being sprinkled everywhere into the VFS and some filesystems that access idmappings directly. This would have the same downsides as mentioned earlier. - The earlier solutions work by updating the mnt->mnt_idmap pointer with the new idmapping. Instead of this it would be possible to change the idmapping itself to avoid UAF issues. To do this a sequence counter would have to be added to struct mount. When retrieving the idmapping to generate uid/gid values the sequence counter would need to be sampled and the generation of the uid/gid would spin until the update of the idmap is finished. This has problems as well but the biggest issue will be that this can lead to inconsistent permission checking and inconsistent uid/gid pairs even more than this is already possible today. Specifically, during creation it could happen that: idmap = mnt_idmap(mnt); inode_permission(idmap, ...); may_create(idmap); // create file with uid/gid based on @idmap in between the permission checking and the generation of the uid/gid value the idmapping could change leading to the permission checking and uid/gid value that is actually used to create a file on disk being out of sync. Similarly if two values are generated like: idmap = mnt_idmap(mnt) vfsgid = make_vfsgid(idmap); // idmapping gets update concurrently vfsuid = make_vfsuid(idmap); @vfsgid and @vfsuid could be out of sync if the idmapping was changed in between. The generation of vfsgid/vfsuid could span a lot of codelines so to guard against this a sequence count would have to be passed around. The performance impact of this solutio are less clear but very likely not zero. - Using SRCU similar to fanotify that can sleep. I find that not just ugly but it would have memory consumption implications and is overall pretty ugly. / solution / So, to avoid all of these pitfalls creating an idmapped mount from an already idmapped mount will be done atomically, i.e., a new detached mount is created and a new set of mount properties applied to it without it ever having been exposed to userspace at all. This can be done in two ways. A new flag to open_tree() is added OPEN_TREE_CLEAR_IDMAP that clears the old idmapping and returns a mount that isn't idmapped. And then it is possible to set mount attributes on it again including creation of an idmapped mount. This has the consequence that a file descriptor must exist in userspace that doesn't have any idmapping applied and it will thus never work in unpriviledged scenarios. As a container would be able to remove the idmapping of the mount it has been given. That should be avoided. Instead, we add open_tree_attr() which works just like open_tree() but takes an optional struct mount_attr parameter. This is useful beyond idmappings as it fills a gap where a mount never exists in userspace without the necessary mount properties applied. This is particularly useful for mount options such as MOUNT_ATTR_{RDONLY,NOSUID,NODEV,NOEXEC}. To create a new idmapped mount the following works: // Create a first idmapped mount struct mount_attr attr = { .attr_set = MOUNT_ATTR_IDMAP .userns_fd = fd_userns }; fd_tree = open_tree(-EBADF, "/", OPEN_TREE_CLONE, &attr, sizeof(attr)); move_mount(fd_tree, "", -EBADF, "/mnt", MOVE_MOUNT_F_EMPTY_PATH); // Create a second idmapped mount from the first idmapped mount attr.attr_set = MOUNT_ATTR_IDMAP; attr.userns_fd = fd_userns2; fd_tree2 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE, &attr, sizeof(attr)); // Create a second non-idmapped mount from the first idmapped mount: memset(&attr, 0, sizeof(attr)); attr.attr_clr = MOUNT_ATTR_IDMAP; fd_tree2 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE, &attr, sizeof(attr)); patches from https://lore.kernel.org/r/20250128-work-mnt_idmap-update-v2-v1-0-c25feb0d2eb3@kernel.org: fs: allow changing idmappings fs: add kflags member to struct mount_kattr fs: add open_tree_attr() fs: add copy_mount_setattr() helper fs: add vfs_open_tree() helper Link: https://lore.kernel.org/r/20250128-work-mnt_idmap-update-v2-v1-0-c25feb0d2eb3@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-12 12:12:34 +01:00
Christian Brauner	2462651ffa	fs: allow changing idmappings This patchset makes it possible to create a new idmapped mount from an already idmapped mount and to clear idmappings. // Create a first idmapped mount struct mount_attr attr = { .attr_set = MOUNT_ATTR_IDMAP .userns_fd = fd_userns }; fd_tree = open_tree(-EBADF, "/", OPEN_TREE_CLONE, &attr, sizeof(attr)); move_mount(fd_tree, "", -EBADF, "/mnt", MOVE_MOUNT_F_EMPTY_PATH); // Create a second idmapped mount from the first idmapped mount attr.attr_set = MOUNT_ATTR_IDMAP; attr.userns_fd = fd_userns2; fd_tree2 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE, &attr, sizeof(attr)); // Create a second non-idmapped mount from the first idmapped mount: memset(&attr, 0, sizeof(attr)); attr.attr_clr = MOUNT_ATTR_IDMAP; fd_tree2 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE, &attr, sizeof(attr)); Link: https://lore.kernel.org/r/20250128-work-mnt_idmap-update-v2-v1-5-c25feb0d2eb3@kernel.org Reviewed-by: "Seth Forshee (DigitalOcean)" <sforshee@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-12 12:12:28 +01:00
Christian Brauner	325cca846f	fs: add kflags member to struct mount_kattr Instead of using a boolean use a flag so we can add new flags in following patches. Link: https://lore.kernel.org/r/20250128-work-mnt_idmap-update-v2-v1-4-c25feb0d2eb3@kernel.org Reviewed-by: "Seth Forshee (DigitalOcean)" <sforshee@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-12 12:12:28 +01:00
Christian Brauner	c4a16820d9	fs: add open_tree_attr() Add open_tree_attr() which allow to atomically create a detached mount tree and set mount options on it. If OPEN_TREE_CLONE is used this will allow the creation of a detached mount with a new set of mount options without it ever being exposed to userspace without that set of mount options applied. Link: https://lore.kernel.org/r/20250128-work-mnt_idmap-update-v2-v1-3-c25feb0d2eb3@kernel.org Reviewed-by: "Seth Forshee (DigitalOcean)" <sforshee@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-12 12:12:28 +01:00
Christian Brauner	474f7825d5	fs: add copy_mount_setattr() helper Split out copy_mount_setattr() from mount_setattr() so we can use it in later patches. Link: https://lore.kernel.org/r/20250128-work-mnt_idmap-update-v2-v1-2-c25feb0d2eb3@kernel.org Reviewed-by: "Seth Forshee (DigitalOcean)" <sforshee@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-12 12:12:28 +01:00
Christian Brauner	901766df44	fs: add vfs_open_tree() helper Split out vfs_open_tree() from open_tree() so we can use it in later patches. Link: https://lore.kernel.org/r/20250128-work-mnt_idmap-update-v2-v1-1-c25feb0d2eb3@kernel.org Reviewed-by: "Seth Forshee (DigitalOcean)" <sforshee@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-12 12:12:28 +01:00
Jeff Layton	8f6116b5b7	statmount: add a new supported_mask field Some of the fields in the statmount() reply can be optional. If the kernel has nothing to emit in that field, then it doesn't set the flag in the reply. This presents a problem: There is currently no way to know what mask flags the kernel supports since you can't always count on them being in the reply. Add a new STATMOUNT_SUPPORTED_MASK flag and field that the kernel can set in the reply. Userland can use this to determine if the fields it requires from the kernel are supported. This also gives us a way to deprecate fields in the future, if that should become necessary. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20250206-statmount-v2-1-6ae70a21c2ab@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-12 12:12:28 +01:00
Christian Brauner	37c4a9590e	statmount: allow to retrieve idmappings This adds the STATMOUNT_MNT_UIDMAP and STATMOUNT_MNT_GIDMAP options. It allows the retrieval of idmappings via statmount(). Currently it isn't possible to figure out what idmappings are applied to an idmapped mount. This information is often crucial. Before statmount() the only realistic options for an interface like this would have been to add it to /proc/<pid>/fdinfo/<nr> or to expose it in /proc/<pid>/mountinfo. Both solution would have been pretty ugly and would've shown information that is of strong interest to some application but not all. statmount() is perfect for this. The idmappings applied to an idmapped mount are shown relative to the caller's user namespace. This is the most useful solution that doesn't risk leaking information or confuse the caller. For example, an idmapped mount might have been created with the following idmappings: mount --bind -o X-mount.idmap="0:10000:1000 2000:2000:1 3000:3000:1" /srv /opt Listing the idmappings through statmount() in the same context shows: mnt_id: 2147485088 mnt_parent_id: 2147484816 fs_type: btrfs mnt_root: /srv mnt_point: /opt mnt_opts: ssd,discard=async,space_cache=v2,subvolid=5,subvol=/ mnt_uidmap[0]: 0 10000 1000 mnt_uidmap[1]: 2000 2000 1 mnt_uidmap[2]: 3000 3000 1 mnt_gidmap[0]: 0 10000 1000 mnt_gidmap[1]: 2000 2000 1 mnt_gidmap[2]: 3000 3000 1 But the idmappings might not always be resolvable in the caller's user namespace. For example: unshare --user --map-root In this case statmount() will skip any mappings that fil to resolve in the caller's idmapping: mnt_id: 2147485087 mnt_parent_id: 2147484016 fs_type: btrfs mnt_root: /srv mnt_point: /opt mnt_opts: ssd,discard=async,space_cache=v2,subvolid=5,subvol=/ The caller can differentiate between a mount not being idmapped and a mount that is idmapped but where all mappings fail to resolve in the caller's idmapping by check for the STATMOUNT_MNT_{G,U}IDMAP flag being raised but the number of mappings in ->mnt_{g,u}idmap_num being zero. Note that statmount() requires that the whole range must be resolvable in the caller's user namespace. If a subrange fails to map it will still list the map as not resolvable. This is a practical compromise to avoid having to find which subranges are resovable and wich aren't. Idmappings are listed as a string array with each mapping separated by zero bytes. This allows to retrieve the idmappings and immediately use them for writing to e.g., /proc/<pid>/{g,u}id_map and it also allow for simple iteration like: if (stmnt->mask & STATMOUNT_MNT_UIDMAP) { const char *idmap = stmnt->str + stmnt->mnt_uidmap; for (size_t idx = 0; idx < stmnt->mnt_uidmap_nr; idx++) { printf("mnt_uidmap[%lu]: %s\n", idx, idmap); idmap += strlen(idmap) + 1; } } Link: https://lore.kernel.org/r/20250204-work-mnt_idmap-statmount-v2-2-007720f39f2e@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-12 12:12:27 +01:00
Christian Brauner	db04662e2f	fs: allow detached mounts in clone_private_mount() In container workloads idmapped mounts are often used as layers for overlayfs. Recently I added the ability to specify layers in overlayfs as file descriptors instead of path names. It should be possible to simply use the detached mounts directly when specifying layers instead of having to attach them beforehand. They are discarded after overlayfs is mounted anyway so it's pointless system calls for userspace and pointless locking for the kernel. This just recently come up again in [1]. So enable clone_private_mount() to use detached mounts directly. Following conditions must be met: - Provided path must be the root of a detached mount tree. - Provided path may not create mount namespace loops. - Provided path must be mounted. It would be possible to be stricter and require that the caller must have CAP_SYS_ADMIN in the owning user namespace of the anonymous mount namespace but since this restriction isn't enforced for move_mount() there's no point in enforcing it for clone_private_mount(). This contains a folded fix for: Reported-by: syzbot+62dfea789a2cedac1298@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=62dfea789a2cedac1298 provided by Lizhi Xu <lizhi.xu@windriver.com> in [2]. Link: https://lore.kernel.org/r/20250207071331.550952-1-lizhi.xu@windriver.com [2] Link: https://lore.kernel.org/r/fd8f6574-f737-4743-b220-79c815ee1554@mbaynton.com [1] Link: https://lore.kernel.org/r/20250123-avancieren-erfreuen-3d61f6588fdd@brauner Tested-by: Mike Baynton <mike@mbaynton.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-12 12:12:24 +01:00
Miklos Szeredi	5eb9871053	fs: fix adding security options to statmount.mnt_opt Prepending security options was made conditional on sb->s_op->show_options, but security options are independent of sb options. Fixes: `056d33137b` ("fs: prepend statmount.mnt_opts string with security_sb_mnt_opts()") Fixes: `f9af549d1f` ("fs: export mount options via statmount()") Cc: stable@vger.kernel.org # v6.11 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://lore.kernel.org/r/20250129151253.33241-1-mszeredi@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-07 10:27:26 +01:00
Miklos Szeredi	e52e97f09f	statmount: let unset strings be empty Just like it's normal for unset values to be zero, unset strings should be empty instead of containing random values. It seems to be a typical mistake that the mask returned by statmount is not checked, which can result in various bugs. With this fix, these bugs are prevented, since it is highly likely that userspace would just want to turn the missing mask case into an empty string anyway (most of the recently found cases are of this type). Link: https://lore.kernel.org/all/CAJfpegsVCPfCn2DpM8iiYSS5DpMsLB8QBUCHecoj6s0Vxf4jzg@mail.gmail.com/ Fixes: `68385d77c0` ("statmount: simplify string option retrieval") Fixes: `46eae99ef7` ("add statmount(2) syscall") Cc: stable@vger.kernel.org # v6.8 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://lore.kernel.org/r/20250130121500.113446-1-mszeredi@redhat.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-07 10:27:24 +01:00
Miklos Szeredi	bf630c4016	vfs: add notifications for mount attach and detach Add notifications for attaching and detaching mounts to fs/namespace.c Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://lore.kernel.org/r/20250129165803.72138-4-mszeredi@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-05 17:21:11 +01:00
Miklos Szeredi	0f46d81f2b	fanotify: notify on mount attach and detach Add notifications for attaching and detaching mounts. The following new event masks are added: FAN_MNT_ATTACH - Mount was attached FAN_MNT_DETACH - Mount was detached If a mount is moved, then the event is reported with (FAN_MNT_ATTACH \| FAN_MNT_DETACH). These events add an info record of type FAN_EVENT_INFO_TYPE_MNT containing these fields identifying the affected mounts: __u64 mnt_id - the ID of the mount (see statmount(2)) FAN_REPORT_MNT must be supplied to fanotify_init() to receive these events and no other type of event can be received with this report type. Marks are added with FAN_MARK_MNTNS, which records the mount namespace from an nsfs file (e.g. /proc/self/ns/mnt). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://lore.kernel.org/r/20250129165803.72138-3-mszeredi@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-05 17:21:07 +01:00
Joel Granados	1751f872cc	treewide: const qualify ctl_tables where applicable Add the const qualifier to all the ctl_tables in the tree except for watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls, loadpin_sysctl_table and the ones calling register_net_sysctl (./net, drivers/inifiniband dirs). These are special cases as they use a registration function with a non-const qualified ctl_table argument or modify the arrays before passing them on to the registration function. Constifying ctl_table structs will prevent the modification of proc_handler function pointers as the arrays would reside in .rodata. This is made possible after commit `78eb4ea25c` ("sysctl: treewide: constify the ctl_table argument of proc_handlers") constified all the proc_handlers. Created this by running an spatch followed by a sed command: Spatch: virtual patch @ depends on !(file in "net") disable optional_qualifier @ identifier table_name != { watchdog_hardlockup_sysctl, iwcm_ctl_table, ucma_ctl_table, memory_allocation_profiling_sysctls, loadpin_sysctl_table }; @@ + const struct ctl_table table_name [] = { ... }; sed: sed --in-place \ -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \ kernel/utsname_sysctl.c Reviewed-by: Song Liu <song@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/ Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Corey Minyard <cminyard@mvista.com> Acked-by: Wei Liu <wei.liu@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Acked-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Joel Granados <joel.granados@kernel.org>	2025-01-28 13:48:37 +01:00
Linus Torvalds	100ceb4817	vfs-6.14-rc1.mount.v2 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ44+LwAKCRCRxhvAZXjc orNaAQCGDqtxgqgGLsdx9dw7yTxOm9opYBaG5qN7KiThLAz2PwD+MsHNNlLVEOKU IQo9pa23UFUhTipFSeszOWza5SGlxg4= =hdst -----END PGP SIGNATURE----- Merge tag 'vfs-6.14-rc1.mount.v2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: - Add a mountinfo program to demonstrate statmount()/listmount() Add a new "mountinfo" sample userland program that demonstrates how to use statmount() and listmount() to get at the same info that /proc/pid/mountinfo provides - Remove pointless nospec.h include - Prepend statmount.mnt_opts string with security_sb_mnt_opts() Currently these mount options aren't accessible via statmount() - Add new mount namespaces to mount namespace rbtree outside of the namespace semaphore - Lockless mount namespace lookup Currently we take the read lock when looking for a mount namespace to list mounts in. We can make this lockless. The simple search case can just use a sequence counter to detect concurrent changes to the rbtree For walking the list of mount namespaces sequentially via nsfs we keep a separate rcu list as rb_prev() and rb_next() aren't usable safely with rcu. Currently there is no primitive for retrieving the previous list member. To do this we need a new deletion primitive that doesn't poison the prev pointer and a corresponding retrieval helper Since creating mount namespaces is a relatively rare event compared with querying mounts in a foreign mount namespace this is worth it. Once libmount and systemd pick up this mechanism to list mounts in foreign mount namespaces this will be used very frequently - Add extended selftests for lockless mount namespace iteration - Add a sample program to list all mounts on the system, i.e., in all mount namespaces - Improve mount namespace iteration performance Make finding the last or first mount to start iterating the mount namespace from an O(1) operation and add selftests for iterating the mount table starting from the first and last mount - Use an xarray for the old mount id While the ida does use the xarray internally we can use it explicitly which allows us to increment the unique mount id under the xa lock. This allows us to remove the atomic as we're now allocating both ids in one go - Use a shared header for vfs sample programs - Fix build warnings for new sample program to list all mounts * tag 'vfs-6.14-rc1.mount.v2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: samples/vfs: fix build warnings samples/vfs: use shared header samples/vfs/mountinfo: Use __u64 instead of uint64_t fs: remove useless lockdep assertion fs: use xarray for old mount id selftests: add listmount() iteration tests fs: cache first and last mount samples: add test-list-all-mounts selftests: remove unneeded include selftests: add tests for mntns iteration seltests: move nsfs into filesystems subfolder fs: simplify rwlock to spinlock fs: lockless mntns lookup for nsfs rculist: add list_bidir_{del,prev}_rcu() fs: lockless mntns rbtree lookup fs: add mount namespace to rbtree late fs: prepend statmount.mnt_opts string with security_sb_mnt_opts() mount: remove inlude/nospec.h include samples: add a mountinfo program to demonstrate statmount()/listmount()	2025-01-20 10:44:51 -08:00
Linus Torvalds	5f85bd6aec	vfs-6.14-rc1.pidfs -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ4pRdwAKCRCRxhvAZXjc otQjAP9ooUH2d/jHZ49Rw4q/3BkhX8R2fFEZgj2PMvtYlr0jQwD/d8Ji0k4jINTL AIFRfPdRwrD+X35IUK3WPO42YFZ4rAg= =5wgo -----END PGP SIGNATURE----- Merge tag 'vfs-6.14-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull pidfs updates from Christian Brauner: - Rework inode number allocation Recently we received a patchset that aims to enable file handle encoding and decoding via name_to_handle_at(2) and open_by_handle_at(2). A crucical step in the patch series is how to go from inode number to struct pid without leaking information into unprivileged contexts. The issue is that in order to find a struct pid the pid number in the initial pid namespace must be encoded into the file handle via name_to_handle_at(2). This can be used by containers using a separate pid namespace to learn what the pid number of a given process in the initial pid namespace is. While this is a weak information leak it could be used in various exploits and in general is an ugly wart in the design. To solve this problem a new way is needed to lookup a struct pid based on the inode number allocated for that struct pid. The other part is to remove the custom inode number allocation on 32bit systems that is also an ugly wart that should go away. Allocate unique identifiers for struct pid by simply incrementing a 64 bit counter and insert each struct pid into the rbtree so it can be looked up to decode file handles avoiding to leak actual pids across pid namespaces in file handles. On both 64 bit and 32 bit the same 64 bit identifier is used to lookup struct pid in the rbtree. On 64 bit the unique identifier for struct pid simply becomes the inode number. Comparing two pidfds continues to be as simple as comparing inode numbers. On 32 bit the 64 bit number assigned to struct pid is split into two 32 bit numbers. The lower 32 bits are used as the inode number and the upper 32 bits are used as the inode generation number. Whenever a wraparound happens on 32 bit the 64 bit number will be incremented by 2 so inode numbering starts at 2 again. When a wraparound happens on 32 bit multiple pidfds with the same inode number are likely to exist. This isn't a problem since before pidfs pidfds used the anonymous inode meaning all pidfds had the same inode number. On 32 bit sserspace can thus reconstruct the 64 bit identifier by retrieving both the inode number and the inode generation number to compare, or use file handles. This gives the same guarantees on both 32 bit and 64 bit. - Implement file handle support This is based on custom export operation methods which allows pidfs to implement permission checking and opening of pidfs file handles cleanly without hacking around in the core file handle code too much. - Support bind-mounts Allow bind-mounting pidfds. Similar to nsfs let's allow bind-mounts for pidfds. This allows pidfds to be safely recovered and checked for process recycling. Instead of checking d_ops for both nsfs and pidfs we could in a follow-up patch add a flag argument to struct dentry_operations that functions similar to file_operations->fop_flags. * tag 'vfs-6.14-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests: add pidfd bind-mount tests pidfs: allow bind-mounts pidfs: lookup pid through rbtree selftests/pidfd: add pidfs file handle selftests pidfs: check for valid ioctl commands pidfs: implement file handle support exportfs: add permission method fhandle: pull CAP_DAC_READ_SEARCH check into may_decode_fh() exportfs: add open method fhandle: simplify error handling pseudofs: add support for export_ops pidfs: support FS_IOC_GETVERSION pidfs: remove 32bit inode number handling pidfs: rework inode number allocation	2025-01-20 09:59:00 -08:00
Linus Torvalds	4b84a4c8d4	vfs-6.14-rc1.misc -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ4pRjQAKCRCRxhvAZXjc omUyAP9k31Qr7RY1zNtmpPfejqc+3Xx+xXD7NwHr+tONWtUQiQEA/F94qU2U3ivS AzyDABWrEQ5ZNsm+Rq2Y3zyoH7of3ww= =s3Bu -----END PGP SIGNATURE----- Merge tag 'vfs-6.14-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Support caching symlink lengths in inodes The size is stored in a new union utilizing the same space as i_devices, thus avoiding growing the struct or taking up any more space When utilized it dodges strlen() in vfs_readlink(), giving about 1.5% speed up when issuing readlink on /initrd.img on ext4 - Add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag If a file system supports uncached buffered IO, it may set FOP_DONTCACHE and enable support for RWF_DONTCACHE. If RWF_DONTCACHE is attempted without the file system supporting it, it'll get errored with -EOPNOTSUPP - Enable VBOXGUEST and VBOXSF_FS on ARM64 Now that VirtualBox is able to run as a host on arm64 (e.g. the Apple M3 processors) we can enable VBOXSF_FS (and in turn VBOXGUEST) for this architecture. Tested with various runs of bonnie++ and dbench on an Apple MacBook Pro with the latest Virtualbox 7.1.4 r165100 installed Cleanups: - Delay sysctl_nr_open check in expand_files() - Use kernel-doc includes in fiemap docbook - Use page->private instead of page->index in watch_queue - Use a consume fence in mnt_idmap() as it's heavily used in link_path_walk() - Replace magic number 7 with ARRAY_SIZE() in fc_log - Sort out a stale comment about races between fd alloc and dup2() - Fix return type of do_mount() from long to int - Various cosmetic cleanups for the lockref code Fixes: - Annotate spinning as unlikely() in __read_seqcount_begin The annotation already used to be there, but got lost in commit `52ac39e5db` ("seqlock: seqcount_t: Implement all read APIs as statement expressions") - Fix proc_handler for sysctl_nr_open - Flush delayed work in delayed fput() - Fix grammar and spelling in propagate_umount() - Fix ESP not readable during coredump In /proc/PID/stat, there is the kstkesp field which is the stack pointer of a thread. While the thread is active, this field reads zero. But during a coredump, it should have a valid value However, at the moment, kstkesp is zero even during coredump - Don't wake up the writer if the pipe is still full - Fix unbalanced user_access_end() in select code" * tag 'vfs-6.14-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits) gfs2: use lockref_init for qd_lockref erofs: use lockref_init for pcl->lockref dcache: use lockref_init for d_lockref lockref: add a lockref_init helper lockref: drop superfluous externs lockref: use bool for false/true returns lockref: improve the lockref_get_not_zero description lockref: remove lockref_put_not_zero fs: Fix return type of do_mount() from long to int select: Fix unbalanced user_access_end() vbox: Enable VBOXGUEST and VBOXSF_FS on ARM64 pipe_read: don't wake up the writer if the pipe is still full selftests: coredump: Add stackdump test fs/proc: do_task_stat: Fix ESP not readable during coredump fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag fs: sort out a stale comment about races between fd alloc and dup2 fs: Fix grammar and spelling in propagate_umount() fs: fc_log replace magic number 7 with ARRAY_SIZE() fs: use a consume fence in mnt_idmap() file: flush delayed work in delayed fput() ...	2025-01-20 09:40:49 -08:00
Sentaro Onizuka	4f3b63e8a8	fs: Fix return type of do_mount() from long to int Fix the return type of do_mount() function from long to int to match its ac tual behavior. The function only returns int values, and all callers, inclu ding those in fs/namespace.c and arch/alpha/kernel/osf_sys.c, already treat the return value as int. This change improves type consistency across the filesystem code and aligns the function signature with its existing impleme ntation and usage. Signed-off-by: Sentaro Onizuka <sentaro@amazon.com> Link: https://lore.kernel.org/r/20250113151400.55512-1-sentaro@amazon.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-01-16 11:48:06 +01:00
Christian Brauner	482d520d86	vfs-6.14-rc7.mount.fixes -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ3/zQwAKCRCRxhvAZXjc oucWAQCggue6avf84aT1kOugdMR+7TRp3OFnvnb5EYKnJNwOgQEAl+kmIHTZRU4F o9cUdjfyPDSha+03owHtPJdXZ7mdOwY= =ISxi -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ3/2ggAKCRCRxhvAZXjc omqsAQDuLR/Yw0Tt73iTic5gJlupnKdcBXL+RPtrdnvRZD2IHAEAzINwYp7T3ujT MYC/gey8ppgsDKd9FkN0WZdThBjBlAc= =0LOi -----END PGP SIGNATURE----- Merge tag 'vfs-6.14-rc7.mount.fixes' Bring in the fix for the mount namespace rbtree. It is used as the base for the vfs mount work for this cycle and so shouldn't be applied directly. Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-01-09 17:03:21 +01:00
Christian Brauner	22eb23b8a7	fs: remove useless lockdep assertion mnt_ns_release() can run asynchronously via call_rcu() so hitting that lockdep assertion means someone else already grabbed the mnt_ns_tree_lock and causes a false positive. That assertion has likely always been wrong. call_rcu() just makes it more likely to hit. Link: https://lore.kernel.org/r/Z2PlT5rcRTIhCpft@ly-workstation Link: https://lore.kernel.org/r/20241219-darben-quietschen-b6e1d80327bb@brauner Reported-by: Lai, Yi <yi1.lai@linux.intel.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-01-09 16:58:55 +01:00
Christian Brauner	7f9bfafc5f	fs: use xarray for old mount id While the ida does use the xarray internally we can use it explicitly which allows us to increment the unique mount id under the xa lock. This allows us to remove the atomic as we're now allocating both ids in one go. Link: https://lore.kernel.org/r/20241217-erhielten-regung-44bb1604ca8f@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-01-09 16:58:55 +01:00
Christian Brauner	2ce23285d7	fs: cache first and last mount Speed up listmount() by caching the first and last node making retrieval of the first and last mount of each mount namespace O(1). Link: https://lore.kernel.org/r/20241215-vfs-6-14-mount-work-v1-2-fd55922c4af8@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-01-09 16:58:54 +01:00
Christian Brauner	e7c8dde368	fs: simplify rwlock to spinlock We're not taking the read_lock() anymore now that all lookup is lockless. Just use a simple spinlock. Link: https://lore.kernel.org/r/20241213-work-mount-rbtree-lockless-v3-6-6e3cdaf9b280@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-01-09 16:58:52 +01:00
Christian Brauner	4368898b27	fs: lockless mntns lookup for nsfs We already made the rbtree lookup lockless for the simple lookup case. However, walking the list of mount namespaces via nsfs still happens with taking the read lock blocking concurrent additions of new mount namespaces pointlessly. Plus, such additions are rare anyway so allow lockless lookup of the previous and next mount namespace by keeping a separate list. This also allows to make some things simpler in the code. Link: https://lore.kernel.org/r/20241213-work-mount-rbtree-lockless-v3-5-6e3cdaf9b280@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-01-09 16:58:52 +01:00
Christian Brauner	5dcbd85d35	fs: lockless mntns rbtree lookup Currently we use a read-write lock but for the simple search case we can make this lockless. Creating a new mount namespace is a rather rare event compared with querying mounts in a foreign mount namespace. Once this is picked up by e.g., systemd to list mounts in another mount in it's isolated services or in containers this will be used a lot so this seems worthwhile doing. Link: https://lore.kernel.org/r/20241213-work-mount-rbtree-lockless-v3-3-6e3cdaf9b280@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-01-09 16:58:52 +01:00
Christian Brauner	144acef333	fs: add mount namespace to rbtree late There's no point doing that under the namespace semaphore it just gives the false impression that it protects the mount namespace rbtree and it simply doesn't. Link: https://lore.kernel.org/r/20241213-work-mount-rbtree-lockless-v3-2-6e3cdaf9b280@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-01-09 16:58:51 +01:00
Christian Brauner	62b8dee925	mount: remove inlude/nospec.h include It's not needed, so remove it. Link: https://lore.kernel.org/r/20241213-work-mount-rbtree-lockless-v3-1-6e3cdaf9b280@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-01-09 16:58:51 +01:00
Jeff Layton	056d33137b	fs: prepend statmount.mnt_opts string with security_sb_mnt_opts() Currently these mount options aren't accessible via statmount(). The read handler for /proc/#/mountinfo calls security_sb_show_options() to emit the security options after emitting superblock flag options, but before calling sb->s_op->show_options. Have statmount_mnt_opts() call security_sb_show_options() before calling ->show_options. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241115-statmount-v2-2-cd29aeff9cbb@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-01-09 16:58:51 +01:00
Christian Brauner	344bac8f0d	fs: kill MNT_ONRB Move mnt->mnt_node into the union with mnt->mnt_rcu and mnt->mnt_llist instead of keeping it with mnt->mnt_list. This allows us to use RB_CLEAR_NODE(&mnt->mnt_node) in umount_tree() as well as list_empty(&mnt->mnt_node). That in turn allows us to remove MNT_ONRB. This also fixes the bug reported in [1] where seemingly MNT_ONRB wasn't set in @mnt->mnt_flags even though the mount was present in the mount rbtree of the mount namespace. The root cause is the following race. When a btrfs subvolume is mounted a temporary mount is created: btrfs_get_tree_subvol() { mnt = fc_mount() // Register the newly allocated mount with sb->mounts: lock_mount_hash(); list_add_tail(&mnt->mnt_instance, &mnt->mnt.mnt_sb->s_mounts); unlock_mount_hash(); } and registered on sb->s_mounts. Later it is added to an anonymous mount namespace via mount_subvol(): -> mount_subvol() -> mount_subtree() -> alloc_mnt_ns() mnt_add_to_ns() vfs_path_lookup() put_mnt_ns() The mnt_add_to_ns() call raises MNT_ONRB in @mnt->mnt_flags. If someone concurrently does a ro remount: reconfigure_super() -> sb_prepare_remount_readonly() { list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) { } all mounts registered in sb->s_mounts are visited and first MNT_WRITE_HOLD is raised, then MNT_READONLY is raised, and finally MNT_WRITE_HOLD is removed again. The flag modification for MNT_WRITE_HOLD/MNT_READONLY and MNT_ONRB race so MNT_ONRB might be lost. Fixes: `2eea9ce431` ("mounts: keep list of mounts in an rbtree") Cc: <stable@kernel.org> # v6.8+ Link: https://lore.kernel.org/r/20241215-vfs-6-14-mount-work-v1-1-fd55922c4af8@kernel.org Link: https://lore.kernel.org/r/ec6784ed-8722-4695-980a-4400d4e7bd1a@gmx.com [1] Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-01-09 16:58:50 +01:00
Christian Brauner	ef4144ac2d	pidfs: allow bind-mounts Allow bind-mounting pidfds. Similar to nsfs let's allow bind-mounts for pidfds. This allows pidfds to be safely recovered and checked for process recycling. Link: https://lore.kernel.org/r/20241219-work-pidfs-mount-v1-1-dbc56198b839@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-12-22 11:03:10 +01:00
Miklos Szeredi	aa21f333c8	fs: fix is_mnt_ns_file() Commit `1fa08aece4` ("nsfs: convert to path_from_stashed() helper") reused nsfs dentry's d_fsdata, which no longer contains a pointer to proc_ns_operations. Fix the remaining use in is_mnt_ns_file(). Fixes: `1fa08aece4` ("nsfs: convert to path_from_stashed() helper") Cc: stable@vger.kernel.org # v6.9 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://lore.kernel.org/r/20241211121118.85268-1-mszeredi@redhat.com Acked-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-12-13 13:47:04 +01:00
Christian Brauner	3e5360167a	statmount: fix security option retrieval Fix the inverted check for security_sb_show_options(). Link: https://lore.kernel.org/r/c8eaa647-5d67-49b6-9401-705afcb7e4d7@stanley.mountain Link: https://lore.kernel.org/r/20241120-verehren-rhabarber-83a11b297bcc@brauner Fixes: `aefff51e1c` ("statmount: retrieve security mount options") Reviewed-by: Jeff Layton <jlayton@kernel.org> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Cc: stable@vger.kernel.org # mainline only Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-11-21 09:35:31 +01:00
Miklos Szeredi	d18516a021	statmount: clean up unescaped option handling Move common code from opt_array/opt_sec_array to helper. This helper does more than just unescape options, so rename to statmount_opt_process(). Handle corner case of just a single character in options. Rename some local variables to better describe their function. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://lore.kernel.org/r/20241120142732.55210-1-mszeredi@redhat.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-11-21 09:35:26 +01:00
Linus Torvalds	0f25f0e4ef	the bulk of struct fd memory safety stuff Making sure that struct fd instances are destroyed in the same scope where they'd been created, getting rid of reassignments and passing them by reference, converting to CLASS(fd{,_pos,_raw}). We are getting very close to having the memory safety of that stuff trivial to verify. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZzdikAAKCRBZ7Krx/gZQ 69nJAQCmbQHK3TGUbQhOw6MJXOK9ezpyEDN3FZb4jsu38vTIdgEA6OxAYDO2m2g9 CN18glYmD3wRyU6Bwl4vGODouSJvDgA= =gVH3 -----END PGP SIGNATURE----- Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull 'struct fd' class updates from Al Viro: "The bulk of struct fd memory safety stuff Making sure that struct fd instances are destroyed in the same scope where they'd been created, getting rid of reassignments and passing them by reference, converting to CLASS(fd{,_pos,_raw}). We are getting very close to having the memory safety of that stuff trivial to verify" * tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits) deal with the last remaing boolean uses of fd_file() css_set_fork(): switch to CLASS(fd_raw, ...) memcg_write_event_control(): switch to CLASS(fd) assorted variants of irqfd setup: convert to CLASS(fd) do_pollfd(): convert to CLASS(fd) convert do_select() convert vfs_dedupe_file_range(). convert cifs_ioctl_copychunk() convert media_request_get_by_fd() convert spu_run(2) switch spufs_calls_{get,put}() to CLASS() use convert cachestat(2) convert do_preadv()/do_pwritev() fdget(), more trivial conversions fdget(), trivial conversions privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget() o2hb_region_dev_store(): avoid goto around fdget()/fdput() introduce "fd_pos" class, convert fdget_pos() users to it. fdget_raw() users: switch to CLASS(fd_raw) convert vmsplice() to CLASS(fd) ...	2024-11-18 12:24:06 -08:00
Linus Torvalds	70e7730c2a	vfs-6.13.misc -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcToAAKCRCRxhvAZXjc osL9AP948FFumJRC28gDJ4xp+X4eohNOfkgoEG8FTbF2zU6ulwD+O0pr26FqpFli pqlG+38UdATImpfqqWjPbb72sBYcfQg= =wLUh -----END PGP SIGNATURE----- Merge tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Fixup and improve NLM and kNFSD file lock callbacks Last year both GFS2 and OCFS2 had some work done to make their locking more robust when exported over NFS. Unfortunately, part of that work caused both NLM (for NFS v3 exports) and kNFSD (for NFSv4.1+ exports) to no longer send lock notifications to clients This in itself is not a huge problem because most NFS clients will still poll the server in order to acquire a conflicted lock It's important for NLM and kNFSD that they do not block their kernel threads inside filesystem's file_lock implementations because that can produce deadlocks. We used to make sure of this by only trusting that posix_lock_file() can correctly handle blocking lock calls asynchronously, so the lock managers would only setup their file_lock requests for async callbacks if the filesystem did not define its own lock() file operation However, when GFS2 and OCFS2 grew the capability to correctly handle blocking lock requests asynchronously, they started signalling this behavior with EXPORT_OP_ASYNC_LOCK, and the check for also trusting posix_lock_file() was inadvertently dropped, so now most filesystems no longer produce lock notifications when exported over NFS Fix this by using an fop_flag which greatly simplifies the problem and grooms the way for future uses by both filesystems and lock managers alike - Add a sysctl to delete the dentry when a file is removed instead of making it a negative dentry Commit `681ce86235` ("vfs: Delete the associated dentry when deleting a file") introduced an unconditional deletion of the associated dentry when a file is removed. However, this led to performance regressions in specific benchmarks, such as ilebench.sum_operations/s, prompting a revert in commit `4a4be1ad3a` ("Revert "vfs: Delete the associated dentry when deleting a file""). This reintroduces the concept conditionally through a sysctl - Expand the statmount() system call: * Report the filesystem subtype in a new fs_subtype field to e.g., report fuse filesystem subtypes * Report the superblock source in a new sb_source field * Add a new way to return filesystem specific mount options in an option array that returns filesystem specific mount options separated by zero bytes and unescaped. This allows caller's to retrieve filesystem specific mount options and immediately pass them to e.g., fsconfig() without having to unescape or split them * Report security (LSM) specific mount options in a separate security option array. We don't lump them together with filesystem specific mount options as security mount options are generic and most users aren't interested in them The format is the same as for the filesystem specific mount option array - Support relative paths in fsconfig()'s FSCONFIG_SET_STRING command - Optimize acl_permission_check() to avoid costly {g,u}id ownership checks if possible - Use smp_mb__after_spinlock() to avoid full smp_mb() in evict() - Add synchronous wakeup support for ep_poll_callback. Currently, epoll only uses wake_up() to wake up task. But sometimes there are epoll users which want to use the synchronous wakeup flag to give a hint to the scheduler, e.g., the Android binder driver. So add a wake_up_sync() define, and use wake_up_sync() when sync is true in ep_poll_callback() Fixes: - Fix kernel documentation for inode_insert5() and iget5_locked() - Annotate racy epoll check on file->f_ep - Make F_DUPFD_QUERY associative - Avoid filename buffer overrun in initramfs - Don't let statmount() return empty strings - Add a cond_resched() to dump_user_range() to avoid hogging the CPU - Don't query the device logical blocksize multiple times for hfsplus - Make filemap_read() check that the offset is positive or zero Cleanups: - Various typo fixes - Cleanup wbc_attach_fdatawrite_inode() - Add __releases annotation to wbc_attach_and_unlock_inode() - Add hugetlbfs tracepoints - Fix various vfs kernel doc parameters - Remove obsolete TODO comment from io_cancel() - Convert wbc_account_cgroup_owner() to take a folio - Fix comments for BANDWITH_INTERVAL and wb_domain_writeout_add() - Reorder struct posix_acl to save 8 bytes - Annotate struct posix_acl with __counted_by() - Replace one-element array with flexible array member in freevxfs - Use idiomatic atomic64_inc_return() in alloc_mnt_ns()" * tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) statmount: retrieve security mount options vfs: make evict() use smp_mb__after_spinlock instead of smp_mb statmount: add flag to retrieve unescaped options fs: add the ability for statmount() to report the sb_source writeback: wbc_attach_fdatawrite_inode out of line writeback: add a __releases annoation to wbc_attach_and_unlock_inode fs: add the ability for statmount() to report the fs_subtype fs: don't let statmount return empty strings fs:aio: Remove TODO comment suggesting hash or array usage in io_cancel() hfsplus: don't query the device logical block size multiple times freevxfs: Replace one-element array with flexible array member fs: optimize acl_permission_check() initramfs: avoid filename buffer overrun fs/writeback: convert wbc_account_cgroup_owner to take a folio acl: Annotate struct posix_acl with __counted_by() acl: Realign struct posix_acl to save 8 bytes epoll: Add synchronous wakeup support for ep_poll_callback coredump: add cond_resched() to dump_user_range mm/page-writeback.c: Fix comment of wb_domain_writeout_add() mm/page-writeback.c: Update comment for BANDWIDTH_INTERVAL ...	2024-11-18 09:35:30 -08:00
Christian Brauner	aefff51e1c	statmount: retrieve security mount options Add the ability to retrieve security mount options. Keep them separate from filesystem specific mount options so it's easy to tell them apart. Also allow to retrieve them separate from other mount options as most of the time users won't be interested in security specific mount options. Link: https://lore.kernel.org/r/20241114-radtour-ofenrohr-ff34b567b40a@brauner Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-11-14 17:03:25 +01:00
Miklos Szeredi	2f4d4503e9	statmount: add flag to retrieve unescaped options Filesystem options can be retrieved with STATMOUNT_MNT_OPTS, which returns a string of comma separated options, where some characters are escaped using the \OOO notation. Add a new flag, STATMOUNT_OPT_ARRAY, which instead returns the raw option values separated with '\0' charaters. Since escaped charaters are rare, this inteface is preferable for non-libmount users which likley don't want to deal with option de-escaping. Example code: if (st->mask & STATMOUNT_OPT_ARRAY) { const char *opt = st->str + st->opt_array; for (unsigned int i = 0; i < st->opt_num; i++) { printf("opt_array[%i]: <%s>\n", i, opt); opt += strlen(opt) + 1; } } Example ouput: (1) mnt_opts: <lowerdir+=/l\054w\054r,lowerdir+=/l\054w\054r1,upperdir=/upp\054r,workdir=/w\054rk,redirect_dir=nofollow,uuid=null> (2) opt_array[0]: <lowerdir+=/l,w,r> opt_array[1]: <lowerdir+=/l,w,r1> opt_array[2]: <upperdir=/upp,r> opt_array[3]: <workdir=/w,rk> opt_array[4]: <redirect_dir=nofollow> opt_array[5]: <uuid=null> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://lore.kernel.org/r/20241112101006.30715-1-mszeredi@redhat.com Acked-by: Jeff Layton <jlayton@kernel.org> [brauner: tweak variable naming and parsing add example output] Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-11-13 17:27:02 +01:00
Jeff Layton	44010543fc	fs: add the ability for statmount() to report the sb_source /proc/self/mountinfo displays the source for the mount, but statmount() doesn't yet have a way to return it. Add a new STATMOUNT_SB_SOURCE flag, claim the 32-bit __spare1 field to hold the offset into the str[] array. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241111-statmount-v4-3-2eaf35d07a80@kernel.org Acked-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-11-13 14:08:17 +01:00
Jeff Layton	ed9d95f691	fs: add the ability for statmount() to report the fs_subtype /proc/self/mountinfo prints out the sb->s_subtype after the type. This is particularly useful for disambiguating FUSE mounts (at least when the userland driver bothers to set it). Add STATMOUNT_FS_SUBTYPE and claim one of the __spare2 fields to point to the offset into the str[] array. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ian Kent <raven@themaw.net> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241111-statmount-v4-2-2eaf35d07a80@kernel.org Acked-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-11-12 14:37:12 +01:00
Jeff Layton	75ead69a71	fs: don't let statmount return empty strings When one of the statmount_string() handlers doesn't emit anything to seq, the kernel currently sets the corresponding flag and emits an empty string. Given that statmount() returns a mask of accessible fields, just leave the bit unset in this case, and skip any NULL termination. If nothing was emitted to the seq, then the EOVERFLOW and EAGAIN cases aren't applicable and the function can just return immediately. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241111-statmount-v4-1-2eaf35d07a80@kernel.org Acked-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-11-12 14:37:12 +01:00

1 2 3 4 5 ...

860 Commits (09cfd3c52ea76f43b3cb15e570aeddf633d65e80)