Commit Graph

547 Commits (0dfb36b2dcb666f116ba314e631bd3bc632c44d1)

Author SHA1 Message Date
Max Kellermann d927a595ab ceph: add trace points to the MDS client
This patch adds trace points to the Ceph filesystem MDS client:

- request submission (CEPH_MSG_CLIENT_REQUEST) and completion
  (CEPH_MSG_CLIENT_REPLY)
- capabilities (CEPH_MSG_CLIENT_CAPS)

These are the central pieces that are useful for analyzing MDS
latency/performance problems from the client's perspective.

In the long run, all doutc() calls should be replaced with
tracepoints.  This way, the Ceph filesystem can be traced at any time
(without spamming the kernel log).  Additionally, trace points can be
used in BPF programs (which can even deference the pointer parameters
and extract more values).

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2025-12-10 11:50:54 +01:00
Linus Torvalds 8bd9238e51 Some messenger improvements from Eric and Max, a patch to address the
issue (also affected userspace) of incorrect permissions being granted
 to users who have access to multiple different CephFS instances within
 the same cluster from Kotresh and a bunch of assorted CephFS fixes from
 Slava.
 -----BEGIN PGP SIGNATURE-----
 
 iQFFBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmjpShsTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzizoaB/C7qTw5Olh8NDpX8I+ljEO50XNduurf
 fp2eNn0ub5brhcvh8iACSPKE2oer/bDvv3b2SN9310GmBX3f7H2Ht5TeH2tGBRN0
 clg+C2DmY/2watHovo+ua7YAd+HiPH2XMbpeU38Pu1nEdmiU6cQ0YaOn8n2p+c1E
 bID0dMHWb4HTmFRURqWqKPDkM1fLHRxIVgyOMaov5vs0T7XdglwPja3S2W6epvqF
 hKSMSvO/j9qYlOsBM6G6IuHDMJomzBqOQKqsQqC4XZN6uXeaKPTLYRnzxKfJUEWj
 P5JTaum7NGGtfIs0L9wr6zpou/GY2zTFiyXhLZsLJMn894bBO5nArg==
 =wGIB
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.18-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:

 - some messenger improvements (Eric and Max)

 - address an issue (also affected userspace) of incorrect permissions
   being granted to users who have access to multiple different CephFS
   instances within the same cluster (Kotresh)

 - a bunch of assorted CephFS fixes (Slava)

* tag 'ceph-for-6.18-rc1' of https://github.com/ceph/ceph-client:
  ceph: add bug tracking system info to MAINTAINERS
  ceph: fix multifs mds auth caps issue
  ceph: cleanup in ceph_alloc_readdir_reply_buffer()
  ceph: fix potential NULL dereference issue in ceph_fill_trace()
  libceph: add empty check to ceph_con_get_out_msg()
  libceph: pass the message pointer instead of loading con->out_msg
  libceph: make ceph_con_get_out_msg() return the message pointer
  ceph: fix potential race condition on operations with CEPH_I_ODIRECT flag
  ceph: refactor wake_up_bit() pattern of calling
  ceph: fix potential race condition in ceph_ioctl_lazyio()
  ceph: fix overflowed constant issue in ceph_do_objects_copy()
  ceph: fix wrong sizeof argument issue in register_session()
  ceph: add checking of wait_for_completion_killable() return value
  ceph: make ceph_start_io_*() killable
  libceph: Use HMAC-SHA256 library instead of crypto_shash
2025-10-10 11:30:19 -07:00
Kotresh HR 22c73d52a6 ceph: fix multifs mds auth caps issue
The mds auth caps check should also validate the
fsname along with the associated caps. Not doing
so would result in applying the mds auth caps of
one fs on to the other fs in a multifs ceph cluster.
The bug causes multiple issues w.r.t user
authentication, following is one such example.

Steps to Reproduce (on vstart cluster):
1. Create two file systems in a cluster, say 'fsname1' and 'fsname2'
2. Authorize read only permission to the user 'client.usr' on fs 'fsname1'
    $ceph fs authorize fsname1 client.usr / r
3. Authorize read and write permission to the same user 'client.usr' on fs 'fsname2'
    $ceph fs authorize fsname2 client.usr / rw
4. Update the keyring
    $ceph auth get client.usr >> ./keyring

With above permssions for the user 'client.usr', following is the
expectation.
  a. The 'client.usr' should be able to only read the contents
     and not allowed to create or delete files on file system 'fsname1'.
  b. The 'client.usr' should be able to read/write on file system 'fsname2'.

But, with this bug, the 'client.usr' is allowed to read/write on file
system 'fsname1'. See below.

5. Mount the file system 'fsname1' with the user 'client.usr'
     $sudo bin/mount.ceph usr@.fsname1=/ /kmnt_fsname1_usr/
6. Try creating a file on file system 'fsname1' with user 'client.usr'. This
   should fail but passes with this bug.
     $touch /kmnt_fsname1_usr/file1
7. Mount the file system 'fsname1' with the user 'client.admin' and create a
   file.
     $sudo bin/mount.ceph admin@.fsname1=/ /kmnt_fsname1_admin
     $echo "data" > /kmnt_fsname1_admin/admin_file1
8. Try removing an existing file on file system 'fsname1' with the user
   'client.usr'. This shoudn't succeed but succeeds with the bug.
     $rm -f /kmnt_fsname1_usr/admin_file1

For more information, please take a look at the corresponding mds/fuse patch
and tests added by looking into the tracker mentioned below.

v2: Fix a possible null dereference in doutc
v3: Don't store fsname from mdsmap, validate against
    ceph_mount_options's fsname and use it
v4: Code refactor, better warning message and
    fix possible compiler warning

[ Slava.Dubeyko: "fsname check failed" -> "fsname mismatch" ]

Link: https://tracker.ceph.com/issues/72167
Signed-off-by: Kotresh HR <khiremat@redhat.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2025-10-08 23:30:47 +02:00
Viacheslav Dubeyko c66120c842 ceph: cleanup in ceph_alloc_readdir_reply_buffer()
The Coverity Scan service has reported potential issue
in ceph_alloc_readdir_reply_buffer() [1]. If order could
be negative one, then it expects the issue in the logic:

num_entries = (PAGE_SIZE << order) / size;

Technically speaking, this logic [2] should prevent from
making the order variable negative:

if (!rinfo->dir_entries)
    return -ENOMEM;

However, the allocation logic requires some cleanup.
This patch makes sure that calculated bytes count
will never exceed ULONG_MAX before get_order()
calculation. And it adds the checking of order
variable on negative value to guarantee that second
half of the function's code will never operate by
negative value of order variable even if something
will be wrong or to be changed in the first half of
the function's logic.

v2
Alex Markuze suggested to add unlikely() macro
for introduced condition checks.

[1] https://scan5.scan.coverity.com/#/project-view/64304/10063?selectedIssue=1198252
[2] https://elixir.bootlin.com/linux/v6.17-rc3/source/fs/ceph/mds_client.c#L2553

Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2025-10-08 23:30:47 +02:00
Viacheslav Dubeyko 1ed4471a4e ceph: fix wrong sizeof argument issue in register_session()
The Coverity Scan service has detected the wrong sizeof
argument in register_session() [1]. The CID 1598909 defect
contains explanation: "The wrong sizeof value is used in
an expression or as argument to a function. The result is
an incorrect value that may cause unexpected program behaviors.
In register_session: The sizeof operator is invoked on
the wrong argument (CWE-569)".

The patch introduces a ptr_size variable that is initialized
by sizeof(struct ceph_mds_session *). And this variable is used
instead of sizeof(void *) in the code.

[1] https://scan5.scan.coverity.com/#/project-view/64304/10063?selectedIssue=1598909

Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2025-10-08 23:30:46 +02:00
Linus Torvalds 56e7b31071 vfs-6.18-rc1.inode
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQQgAKCRCRxhvAZXjc
 oud9AQD5IG4sNnzCjsvcTDpQkbX5eZW+LFIiAiiN+nztZ+OcRQEAvC2N7YovfqM3
 TWpVoNDKvEPdtDc9ttFMUKqBZYvxvgE=
 =sEaL
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs inode updates from Christian Brauner:
 "This contains a series I originally wrote and that Eric brought over
  the finish line. It moves out the i_crypt_info and i_verity_info
  pointers out of 'struct inode' and into the fs-specific part of the
  inode.

  So now the few filesytems that actually make use of this pay the price
  in their own private inode storage instead of forcing it upon every
  user of struct inode.

  The pointer for the crypt and verity info is simply found by storing
  an offset to its address in struct fsverity_operations and struct
  fscrypt_operations. This shrinks struct inode by 16 bytes.

  I hope to move a lot more out of it in the future so that struct inode
  becomes really just about very core stuff that we need, much like
  struct dentry and struct file, instead of the dumping ground it has
  become over the years.

  On top of this are a various changes associated with the ongoing inode
  lifetime handling rework that multiple people are pushing forward:

   - Stop accessing inode->i_count directly in f2fs and gfs2. They
     simply should use the __iget() and iput() helpers

   - Make the i_state flags an enum

   - Rework the iput() logic

     Currently, if we are the last iput, and we have the I_DIRTY_TIME
     bit set, we will grab a reference on the inode again and then mark
     it dirty and then redo the put. This is to make sure we delay the
     time update for as long as possible

     We can rework this logic to simply dec i_count if it is not 1, and
     if it is do the time update while still holding the i_count
     reference

     Then we can replace the atomic_dec_and_lock with locking the
     ->i_lock and doing atomic_dec_and_test, since we did the
     atomic_add_unless above

   - Add an icount_read() helper and convert everyone that accesses
     inode->i_count directly for this purpose to use the helper

   - Expand dump_inode() to dump more information about an inode helping
     in debugging

   - Add some might_sleep() annotations to iput() and associated
     helpers"

* tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: add might_sleep() annotation to iput() and more
  fs: expand dump_inode()
  inode: fix whitespace issues
  fs: add an icount_read helper
  fs: rework iput logic
  fs: make the i_state flags an enum
  fs: stop accessing ->i_count directly in f2fs and gfs2
  fsverity: check IS_VERITY() in fsverity_cleanup_inode()
  fs: remove inode::i_verity_info
  btrfs: move verity info pointer to fs-specific part of inode
  f2fs: move verity info pointer to fs-specific part of inode
  ext4: move verity info pointer to fs-specific part of inode
  fsverity: add support for info in fs-specific part of inode
  fs: remove inode::i_crypt_info
  ceph: move crypt info pointer to fs-specific part of inode
  ubifs: move crypt info pointer to fs-specific part of inode
  f2fs: move crypt info pointer to fs-specific part of inode
  ext4: move crypt info pointer to fs-specific part of inode
  fscrypt: add support for info in fs-specific part of inode
  fscrypt: replace raw loads of info pointer with helper function
2025-09-29 09:42:30 -07:00
Alex Markuze 15f519e9f8 ceph: fix race condition validating r_parent before applying state
Add validation to ensure the cached parent directory inode matches the
directory info in MDS replies. This prevents client-side race conditions
where concurrent operations (e.g. rename) cause r_parent to become stale
between request initiation and reply processing, which could lead to
applying state changes to incorrect directory inodes.

[ idryomov: folded a kerneldoc fixup and a follow-up fix from Alex to
  move CEPH_CAP_PIN reference when r_parent is updated:

  When the parent directory lock is not held, req->r_parent can become
  stale and is updated to point to the correct inode.  However, the
  associated CEPH_CAP_PIN reference was not being adjusted.  The
  CEPH_CAP_PIN is a reference on an inode that is tracked for
  accounting purposes.  Moving this pin is important to keep the
  accounting balanced. When the pin was not moved from the old parent
  to the new one, it created two problems: The reference on the old,
  stale parent was never released, causing a reference leak.
  A reference for the new parent was never acquired, creating the risk
  of a reference underflow later in ceph_mdsc_release_request().  This
  patch corrects the logic by releasing the pin from the old parent and
  acquiring it for the new parent when r_parent is switched.  This
  ensures reference accounting stays balanced. ]

Cc: stable@vger.kernel.org
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2025-09-09 12:57:02 +02:00
Josef Bacik 37b27bd5d6
fs: add an icount_read helper
Instead of doing direct access to ->i_count, add a helper to handle
this. This will make it easier to convert i_count to a refcount later.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/9bc62a84c6b9d6337781203f60837bd98fbc4a96.1756222464.git.josef@toxicpanda.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-01 12:41:09 +02:00
Al Viro 0d2da2561b ceph: fix a race with rename() in ceph_mdsc_build_path()
Lift copying the name into callers of ceph_encode_encrypted_dname()
that do not have it already copied; ceph_encode_encrypted_fname()
disappears.

That fixes a UAF in ceph_mdsc_build_path() - while the initial copy
of plaintext into buf is done under ->d_lock, we access the
original name again in ceph_encode_encrypted_fname() and that is
done without any locking.  With ceph_encode_encrypted_dname() using
the stable copy the problem goes away.

Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-17 17:58:14 -04:00
Viacheslav Dubeyko fd7449d937
ceph: fix generic/421 test failure
The generic/421 fails to finish because of the issue:

Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.894678] INFO: task kworker/u48:0:11 blocked for more than 122 seconds.
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895403] Not tainted 6.13.0-rc5+ #1
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896633] task:kworker/u48:0 state:D stack:0 pid:11 tgid:11 ppid:2 flags:0x00004000
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896641] Workqueue: writeback wb_workfn (flush-ceph-24)
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897614] Call Trace:
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897620] <TASK>
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897629] __schedule+0x443/0x16b0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897637] schedule+0x2b/0x140
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897640] io_schedule+0x4c/0x80
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897643] folio_wait_bit_common+0x11b/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897646] ? _raw_spin_unlock_irq+0xe/0x50
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897652] ? __pfx_wake_page_function+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897655] __folio_lock+0x17/0x30
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897658] ceph_writepages_start+0xca9/0x1fb0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897663] ? fsnotify_remove_queued_event+0x2f/0x40
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897668] do_writepages+0xd2/0x240
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897672] __writeback_single_inode+0x44/0x350
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897675] writeback_sb_inodes+0x25c/0x550
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897680] wb_writeback+0x89/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897683] ? finish_task_switch.isra.0+0x97/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897687] wb_workfn+0xb5/0x410
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897689] process_one_work+0x188/0x3d0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897692] worker_thread+0x2b5/0x3c0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897694] ? __pfx_worker_thread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897696] kthread+0xe1/0x120
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897699] ? __pfx_kthread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897701] ret_from_fork+0x43/0x70
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897705] ? __pfx_kthread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897707] ret_from_fork_asm+0x1a/0x30
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897711] </TASK>

There are several issues here:
(1) ceph_kill_sb() doesn't wait ending of flushing
all dirty folios/pages because of racy nature of
mdsc->stopping_blockers. As a result, mdsc->stopping
becomes CEPH_MDSC_STOPPING_FLUSHED too early.
(2) The ceph_inc_osd_stopping_blocker(fsc->mdsc) fails
to increment mdsc->stopping_blockers. Finally,
already locked folios/pages are never been unlocked
and the logic tries to lock the same page second time.
(3) The folio_batch with found dirty pages by
filemap_get_folios_tag() is not processed properly.
And this is why some number of dirty pages simply never
processed and we have dirty folios/pages after unmount
anyway.

This patch fixes the issues by means of:
(1) introducing dirty_folios counter and flush_end_wq
waiting queue in struct ceph_mds_client;
(2) ceph_dirty_folio() increments the dirty_folios
counter;
(3) writepages_finish() decrements the dirty_folios
counter and wake up all waiters on the queue
if dirty_folios counter is equal or lesser than zero;
(4) adding in ceph_kill_sb() method the logic of
checking the value of dirty_folios counter and
waiting if it is bigger than zero;
(5) adding ceph_inc_osd_stopping_blocker() call in the
beginning of the ceph_writepages_start() and
ceph_dec_osd_stopping_blocker() at the end of
the ceph_writepages_start() with the goal to resolve
the racy nature of mdsc->stopping_blockers.

sudo ./check generic/421
FSTYP         -- ceph
PLATFORM      -- Linux/x86_64 ceph-testing-0001 6.13.0+ #137 SMP PREEMPT_DYNAMIC Mon Feb  3 20:30:08 UTC 2025
MKFS_OPTIONS  -- 127.0.0.1:40551:/scratch
MOUNT_OPTIONS -- -o name=fs,secret=<secret>,ms_mode=crc,nowsync,copyfrom 127.0.0.1:40551:/scratch /mnt/scratch

generic/421 7s ...  4s
Ran: generic/421
Passed all 1 tests

Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-5-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28 11:20:16 +01:00
Linus Torvalds 626d1a1e99 A fix for a memory leak from Antoine (marked for stable) and two
cleanups from Liang and Slava, all in CephFS.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmedEZkTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi7XSB/9bx32SkLccSCzXwXoVdoAAxB9hmPso
 jY1iWcd6rOGSg2ZGe8HCMi82g5wc69JlUdTE5M8EAeVkwfLzQhYY0fhnPBat7tbO
 7bufm7w6ZhQMnnTqaRfOB9D88AfpZV+whMl97/meu6TD7TZnkShO4rQsuCXFikdZ
 2ygLBKH+lwHUcoGlEcD5EMTQtARepBy/lAnHcZkeuajOpfWapNBtBHWhXIXCdPiB
 t5ffg49w770H64tEDlo4GPaQeM9yIraPcNeat21CrhxlKdOxcdZ5Vtz8SZAsk6p2
 vCzhS7qt0U3qWeoeztN4Xs8Y9Z3QJULVbVozN1Af3r5kKDyCMX6FJEoa
 =OoAG
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.14-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "A fix for a memory leak from Antoine (marked for stable) and two
  cleanups from Liang and Slava"

* tag 'ceph-for-6.14-rc1' of https://github.com/ceph/ceph-client:
  ceph: exchange hardcoded value on NAME_MAX
  ceph: streamline request head structures in MDS client
  ceph: fix memory leak in ceph_mds_auth_match()
2025-01-31 10:30:34 -08:00
Linus Torvalds d3d90cc289 Provide stable parent and name to ->d_revalidate() instances
Most of the filesystem methods where we care about dentry name
 and parent have their stability guaranteed by the callers;
 ->d_revalidate() is the major exception.
 
 It's easy enough for callers to supply stable values for
 expected name and expected parent of the dentry being
 validated.  That kills quite a bit of boilerplate in
 ->d_revalidate() instances, along with a bunch of races
 where they used to access ->d_name without sufficient
 precautions.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZ5gkoQAKCRBZ7Krx/gZQ
 6w9FAP4nyxNNWMjE1TwuWR/DNDMYYuw/qn/miZ88B5BUM8hzqgD/W2SjRvcbSaIm
 xSIYpbtKgtqNU34P1PU+dBvL8Utz2AE=
 =TWY8
 -----END PGP SIGNATURE-----

Merge tag 'pull-revalidate' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs d_revalidate updates from Al Viro:
 "Provide stable parent and name to ->d_revalidate() instances

  Most of the filesystem methods where we care about dentry name and
  parent have their stability guaranteed by the callers;
  ->d_revalidate() is the major exception.

  It's easy enough for callers to supply stable values for expected name
  and expected parent of the dentry being validated. That kills quite a
  bit of boilerplate in ->d_revalidate() instances, along with a bunch
  of races where they used to access ->d_name without sufficient
  precautions"

* tag 'pull-revalidate' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  9p: fix ->rename_sem exclusion
  orangefs_d_revalidate(): use stable parent inode and name passed by caller
  ocfs2_dentry_revalidate(): use stable parent inode and name passed by caller
  nfs: fix ->d_revalidate() UAF on ->d_name accesses
  nfs{,4}_lookup_validate(): use stable parent inode passed by caller
  gfs2_drevalidate(): use stable parent inode and name passed by caller
  fuse_dentry_revalidate(): use stable parent inode and name passed by caller
  vfat_revalidate{,_ci}(): use stable parent inode passed by caller
  exfat_d_revalidate(): use stable parent inode passed by caller
  fscrypt_d_revalidate(): use stable parent inode passed by caller
  ceph_d_revalidate(): propagate stable name down into request encoding
  ceph_d_revalidate(): use stable parent inode passed by caller
  afs_d_revalidate(): use stable name and parent inode passed by caller
  Pass parent directory inode and expected name to ->d_revalidate()
  generic_ci_d_compare(): use shortname_storage
  ext4 fast_commit: make use of name_snapshot primitives
  dissolve external_name.u into separate members
  make take_dentry_name_snapshot() lockless
  dcache: back inline names with a struct-wrapped array of unsigned long
  make sure that DNAME_INLINE_LEN is a multiple of word size
2025-01-30 09:13:35 -08:00
Al Viro 541795cb0b ceph_d_revalidate(): propagate stable name down into request encoding
Currently get_fscrypt_altname() requires ->r_dentry->d_name to be stable
and it gets that in almost all cases.  The only exception is ->d_revalidate(),
where we have a stable name, but it's passed separately - dentry->d_name
is not stable there.

Propagate it down to get_fscrypt_altname() as a new field of struct
ceph_mds_request - ->r_dname, to be used instead ->r_dentry->d_name
when non-NULL.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27 19:25:23 -05:00
Liang Jie 2f0805d7c0 ceph: streamline request head structures in MDS client
The existence of the ceph_mds_request_head_old structure in the MDS
client code is no longer required due to improvements in handling
different MDS request header versions. This patch removes the now
redundant ceph_mds_request_head_old structure and replaces its usage
with the flexible and extensible ceph_mds_request_head structure.

Changes include:
- Modification of find_legacy_request_head to directly cast the
  pointer to ceph_mds_request_head_legacy without going through the
  old structure.
- Update sizeof calculations in create_request_message to use
  offsetofend for consistency and future-proofing, rather than
  referencing the old structure.
- Use of the structured ceph_mds_request_head directly instead of the
  old one.

Additionally, this consolidation normalizes the handling of
request_head_version v1 to align with versions v2 and v3, leading to
a more consistent and maintainable codebase.

These changes simplify the codebase and reduce potential confusion
stemming from the existence of an obsolete structure.

Signed-off-by: Liang Jie <liangjie@lixiang.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2025-01-27 16:07:42 +01:00
Antoine Viallon 3b7d93db45 ceph: fix memory leak in ceph_mds_auth_match()
We now free the temporary target path substring allocation on every
possible branch, instead of omitting the default branch.  In some
cases, a memory leak occured, which could rapidly crash the system
(depending on how many file accesses were attempted).

This was detected in production because it caused a continuous memory
growth, eventually triggering kernel OOM and completely hard-locking
the kernel.

Relevant kmemleak stacktrace:

    unreferenced object 0xffff888131e69900 (size 128):
      comm "git", pid 66104, jiffies 4295435999
      hex dump (first 32 bytes):
        76 6f 6c 75 6d 65 73 2f 63 6f 6e 74 61 69 6e 65  volumes/containe
        72 73 2f 67 69 74 65 61 2f 67 69 74 65 61 2f 67  rs/gitea/gitea/g
      backtrace (crc 2f3bb450):
        [<ffffffffaa68fb49>] __kmalloc_noprof+0x359/0x510
        [<ffffffffc32bf1df>] ceph_mds_check_access+0x5bf/0x14e0 [ceph]
        [<ffffffffc3235722>] ceph_open+0x312/0xd80 [ceph]
        [<ffffffffaa7dd786>] do_dentry_open+0x456/0x1120
        [<ffffffffaa7e3729>] vfs_open+0x79/0x360
        [<ffffffffaa832875>] path_openat+0x1de5/0x4390
        [<ffffffffaa834fcc>] do_filp_open+0x19c/0x3c0
        [<ffffffffaa7e44a1>] do_sys_openat2+0x141/0x180
        [<ffffffffaa7e4945>] __x64_sys_open+0xe5/0x1a0
        [<ffffffffac2cc2f7>] do_syscall_64+0xb7/0x210
        [<ffffffffac400130>] entry_SYSCALL_64_after_hwframe+0x77/0x7f

It can be triggered by mouting a subdirectory of a CephFS filesystem,
and then trying to access files on this subdirectory with an auth token
using a path-scoped capability:

    $ ceph auth get client.services
    [client.services]
            key = REDACTED
            caps mds = "allow rw fsname=cephfs path=/volumes/"
            caps mon = "allow r fsname=cephfs"
            caps osd = "allow rw tag cephfs data=cephfs"

    $ cat /proc/self/mounts
    services@[REDACTED].cephfs=/volumes/containers /ceph/containers ceph rw,noatime,name=services,secret=<hidden>,ms_mode=prefer-crc,mount_timeout=300,acl,mon_addr=[REDACTED]:3300,recover_session=clean 0 0

    $ seq 1 1000000 | xargs -P32 --replace={} touch /ceph/containers/file-{} && \
    seq 1 1000000 | xargs -P32 --replace={} cat /ceph/containers/file-{}

[ idryomov: combine if statements, rename rc to path_matched and make
            it a bool, formatting ]

Cc: stable@vger.kernel.org
Fixes: 596afb0b89 ("ceph: add ceph_mds_check_access() helper")
Signed-off-by: Antoine Viallon <antoine@lesviallon.fr>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2025-01-15 20:35:06 +01:00
Max Kellermann 550f7ca98e ceph: give up on paths longer than PATH_MAX
If the full path to be built by ceph_mdsc_build_path() happens to be
longer than PATH_MAX, then this function will enter an endless (retry)
loop, effectively blocking the whole task.  Most of the machine
becomes unusable, making this a very simple and effective DoS
vulnerability.

I cannot imagine why this retry was ever implemented, but it seems
rather useless and harmful to me.  Let's remove it and fail with
ENAMETOOLONG instead.

Cc: stable@vger.kernel.org
Reported-by: Dario Weißer <dario@cure53.de>
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-12-16 23:25:43 +01:00
Max Kellermann c5cf420303 ceph: fix cred leak in ceph_mds_check_access()
get_current_cred() increments the reference counter, but the
put_cred() call was missing.

Cc: stable@vger.kernel.org
Fixes: 596afb0b89 ("ceph: add ceph_mds_check_access() helper")
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-25 15:51:13 +01:00
Max Kellermann 23426309a4 ceph: pass cred pointer to ceph_mds_auth_match()
This eliminates a redundant get_current_cred() call, because
ceph_mds_check_access() has already obtained this pointer.

As a side effect, this also fixes a reference leak in
ceph_mds_auth_match(): by omitting the get_current_cred() call, no
additional cred reference is taken.

Cc: stable@vger.kernel.org
Fixes: 596afb0b89 ("ceph: add ceph_mds_check_access() helper")
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-25 15:51:13 +01:00
Patrick Donnelly 50f42c4895 ceph: correct ceph_mds_cap_item field name
The issue_seq is sent with bulk cap releases, not the current sequence
number. See also ceph.git commit 655cddb7c9f3 ("include/ceph_fs: correct
ceph_mds_cap_item field name").

Link: https://tracker.ceph.com/issues/66704
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-19 11:08:17 +01:00
Dmitry Antipov 3500000bb1 ceph: miscellaneous spelling fixes
Correct spelling here and there as suggested by codespell.

Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-18 17:34:36 +01:00
Dr. David Alan Gilbert 6025b482e4 ceph: Remove fs/ceph deadcode
ceph_caps_revoking() has been unused since 2017's commit
3fb99d483e ("ceph: nuke startsync op")

ceph_mdsc_open_export_target_sessions() has been unused since 2013's
commit 11df2dfb61 ("ceph: add imported caps when handling cap export message")

Remove them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-18 17:34:35 +01:00
Yan Zhen 0039aebfe8 ceph: Fix typo in the comment
Correctly spelled comments make it easier for the reader to understand
the code.

replace 'tagert' with 'target' in the comment &
replace 'vaild' with 'valid' in the comment &
replace 'carefull' with 'careful' in the comment &
replace 'trsaverse' with 'traverse' in the comment.

Signed-off-by: Yan Zhen <yanzhen@vivo.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-09-24 22:51:33 +02:00
Luis Henriques (SUSE) d97079e97e ceph: fix a memory leak on cap_auths in MDS client
The cap_auths that are allocated during an MDS session opening are never
released, causing a memory leak detected by kmemleak.  Fix this by freeing
the memory allocated when shutting down the MDS client.

Fixes: 1d17de9534 ("ceph: save cap_auths in MDS client when session is opened")
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-09-24 22:51:33 +02:00
Xiubo Li adc5246176 ceph: flush all caps releases when syncing the whole filesystem
We have hit a race between cap releases and cap revoke request
that will cause the check_caps() to miss sending a cap revoke ack
to MDS. And the client will depend on the cap release to release
that revoking caps, which could be delayed for some unknown reasons.

In Kclient we have figured out the RCA about race and we need
a way to explictly trigger this manually could help to get rid
of the caps revoke stuck issue.

Link: https://tracker.ceph.com/issues/67221
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-09-24 22:51:28 +02:00
Xiubo Li c085f6ca95 ceph: rename ceph_flush_cap_releases() to ceph_flush_session_cap_releases()
Prepare for adding a helper to flush the cap releases for all
sessions.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-09-24 22:51:19 +02:00
Xiubo Li 578eb54c4a ceph: periodically flush the cap releases
The MDS could be waiting the caps releases infinitely in some corner
case and then reporting the caps revoke stuck warning. To fix this
we should periodically flush the cap releases.

Link: https://tracker.ceph.com/issues/57244
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-07-23 10:01:57 +02:00
Max Kellermann 65d284a38c ceph: use cap_wait_list only if debugfs is enabled
Only debugfs uses this list.  By omitting it, we save some memory and
reduce lock contention on `caps_list_lock`.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-07-23 10:01:57 +02:00
Xiubo Li 596afb0b89 ceph: add ceph_mds_check_access() helper
This will help check the mds auth access in client side. Always
insert the server path in front of the target path when matching
the paths.

[ idryomov: use u32 instead of uint32_t ]

Link: https://tracker.ceph.com/issues/61333
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-05-23 10:35:46 +02:00
Xiubo Li 1d17de9534 ceph: save cap_auths in MDS client when session is opened
Save the cap_auths, which have been parsed by the MDS, in the opened
session.

[ idryomov: use s64 and u32 instead of int64_t and uint32_t, switch to
            bool for root_squash, readable and writeable ]

Link: https://tracker.ceph.com/issues/61333
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-05-23 10:35:46 +02:00
Xiubo Li 17f8dc2db5 ceph: switch to use cap_delay_lock for the unlink delay list
The same list item will be used in both cap_delay_list and
cap_unlink_delay_list, so it's buggy to use two different locks
to protect them.

Cc: stable@vger.kernel.org
Fixes: dbc347ef7f ("ceph: add ceph_cap_unlink_work to fire check_caps() immediately")
Link: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AODC76VXRAMXKLFDCTK4TKFDDPWUSCN5
Reported-by: Marc Ruhmann <ruhmann@luis.uni-hannover.de>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Marc Ruhmann <ruhmann@luis.uni-hannover.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-04-11 22:56:28 +02:00
Xiubo Li dbc347ef7f ceph: add ceph_cap_unlink_work to fire check_caps() immediately
When unlinking a file the check caps could be delayed for more than
5 seconds, but in MDS side it maybe waiting for the clients to
release caps.

This will use the cap_wq work queue and a dedicated list to help
fire the check_caps() and dirty buffer flushing immediately.

Link: https://tracker.ceph.com/issues/50223
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-02-13 11:22:54 +01:00
Xiubo Li 07045648c0 ceph: always check dir caps asynchronously
The MDS will issue the 'Fr' caps for async dirop, while there is
buggy in kclient and it could miss releasing the async dirop caps,
which is 'Fsxr'. And then the MDS will complain with:

"[WRN] client.xxx isn't responding to mclientcaps(revoke) ..."

So when releasing the dirop async requests or when they fail we
should always make sure that being revoked caps could be released.

Link: https://tracker.ceph.com/issues/50223
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-02-07 14:58:02 +01:00
Linus Torvalds 556e2d17ca Assorted CephFS fixes and cleanups with nothing standing out.
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmWqmP8THGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi3cQB/0XJABiPkolqNtd3dSGw8x2YnpS6ciV
 yHxpJViF0+qmnS5l6Vn2lEDr/57h/jts0t3kXUUSDVbitK9glim5ar2FsBeuY7gi
 lQbqhFPfQ+G3APDn2Dn27JYvO1VQLMmvuFJyE4rJ03XZjvOYpq4zM3zPO0jPGvCN
 Gnw0VqPst/h4eobcsFEsHvHuMkkVy6YIOQPsDkiYUShaY6OBUWM4kewrlztmEvaK
 fyuo/FSNmZeEkoc5R7Pfo1FE4PZzfdUie7RmEznxqgHUWFmx2jKZ5TwnCZt1D2av
 dV2e2JWnZUZZL9vAnCQddvnYrj8j+an/IbGZ+0Wa5DZo/eMglDd01VV2
 =kNSw
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.8-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "Assorted CephFS fixes and cleanups with nothing standing out"

* tag 'ceph-for-6.8-rc1' of https://github.com/ceph/ceph-client:
  ceph: get rid of passing callbacks in __dentry_leases_walk()
  ceph: d_obtain_{alias,root}(ERR_PTR(...)) will do the right thing
  ceph: fix invalid pointer access if get_quota_realm return ERR_PTR
  ceph: remove duplicated code in ceph_netfs_issue_read()
  ceph: send oldest_client_tid when renewing caps
  ceph: rename create_session_open_msg() to create_session_full_msg()
  ceph: select FS_ENCRYPTION_ALGS if FS_ENCRYPTION
  ceph: fix deadlock or deadcode of misusing dget()
  ceph: try to allocate a smaller extent map for sparse read
  libceph: remove MAX_EXTENTS check for sparse reads
  ceph: reinitialize mds feature bit even when session in open
  ceph: skip reconnecting if MDS is not ready
2024-01-19 09:58:55 -08:00
Xiubo Li 6df89bf220 ceph: send oldest_client_tid when renewing caps
Update the oldest_client_tid via the session renew caps msg to
make sure that the MDSs won't pile up the completed request list
in a very large size.

[ idryomov: drop inapplicable comment ]

Link: https://tracker.ceph.com/issues/63364
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-15 15:40:51 +01:00
Xiubo Li 66207de308 ceph: rename create_session_open_msg() to create_session_full_msg()
Makes the create session msg helper to be more general and could
be used by other ops.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-15 15:40:51 +01:00
Venky Shankar f48e0342a7 ceph: reinitialize mds feature bit even when session in open
Following along the same lines as per the user-space fix. Right
now this isn't really an issue with the ceph kernel driver because
of the feature bit laginess, however, that can change over time
(when the new snaprealm info type is ported to the kernel driver)
and depending on the MDS version that's being upgraded can cause
message decoding issues - so, fix that early on.

Link: http://tracker.ceph.com/issues/63188
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-15 15:40:50 +01:00
Xiubo Li cbcb358b74 ceph: skip reconnecting if MDS is not ready
When MDS closed the session the kclient will send to reconnect to
it immediately, but if the MDS just restarted and still not ready
yet, such as still in the up:replay state and the sessionmap journal
logs hasn't be replayed, the MDS will close the session.

And then the kclient could remove the session and later when the
mdsmap is in RECONNECT phrase it will skip reconnecting. But the MDS
will wait until timeout and then evict the kclient.

Just skip sending the reconnection request until the MDS is ready.

Link: https://tracker.ceph.com/issues/62489
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-15 15:40:50 +01:00
Al Viro da549bdd15 dentry: switch the lists of children to hlist
Saves a pointer per struct dentry and actually makes the things less
clumsy.  Cleaned the d_walk() and dcache_readdir() a bit by use
of hlist_for_... iterators.

A couple of new helpers - d_first_child() and d_next_sibling(),
to make the expressions less awful.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-25 02:32:13 -05:00
Linus Torvalds e21165bfbc Two items:
- support for idmapped mounts in CephFS (Christian Brauner, Alexander
   Mikhalitsyn).  The series was originally developed by Christian and
   later picked up and brought over the finish line by Alexander, who
   also contributed an enabler on the MDS side (separate owner_{u,g}id
   fields on the wire).  The required exports for mnt_idmap_{get,put}()
   in VFS have been acked by Christian and received no objection from
   Christoph.
 
 - a churny change in CephFS logging to include cluster and client
   identifiers in log and debug messages (Xiubo Li).  This would help
   in scenarios with dozens of CephFS mounts on the same node which are
   getting increasingly common, especially in the Kubernetes world.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmVNDyMTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHziy6xCACmUikhgo+pZDO7oQS7HChdZFz5Q2Jb
 5K9K7sJr7Zb2gFKLAV93cyFrz0JRljmgZuA3DmDTH1omrkrVAJfHJ6Md1UFG3W7o
 LaowP3kECsykOiz+YtVU2957sfoFqds/q6KCXdp1Pc8WfNnL1vysCim6EGYtCqUm
 c6vv0zkvEDQp+3kjTN01LHzHFZdHg8+STNM4BMiB0sO/NqbADnaQBEtDpYrgzwh7
 YPwVIKcXutUmAKlb7vjUF9ptSICYkGV7B+loPS4BDva1I6dT42xqLQOu89tTKU0D
 DiQAfE2oRgU+fl+mMooFJNSqD25Q6IkXrPs0HuiSHS4wtLGqYZAwwfHB
 =F9o1
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.7-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:

 - support for idmapped mounts in CephFS (Christian Brauner, Alexander
   Mikhalitsyn).

   The series was originally developed by Christian and later picked up
   and brought over the finish line by Alexander, who also contributed
   an enabler on the MDS side (separate owner_{u,g}id fields on the
   wire).

   The required exports for mnt_idmap_{get,put}() in VFS have been acked
   by Christian and received no objection from Christoph.

 - a churny change in CephFS logging to include cluster and client
   identifiers in log and debug messages (Xiubo Li).

   This would help in scenarios with dozens of CephFS mounts on the same
   node which are getting increasingly common, especially in the
   Kubernetes world.

* tag 'ceph-for-6.7-rc1' of https://github.com/ceph/ceph-client:
  ceph: allow idmapped mounts
  ceph: allow idmapped atomic_open inode op
  ceph: allow idmapped set_acl inode op
  ceph: allow idmapped setattr inode op
  ceph: pass idmap to __ceph_setattr
  ceph: allow idmapped permission inode op
  ceph: allow idmapped getattr inode op
  ceph: pass an idmapping to mknod/symlink/mkdir
  ceph: add enable_unsafe_idmap module parameter
  ceph: handle idmapped mounts in create_request_message()
  ceph: stash idmapping in mdsc request
  fs: export mnt_idmap_get/mnt_idmap_put
  libceph, ceph: move mdsmap.h to fs/ceph
  ceph: print cluster fsid and client global_id in all debug logs
  ceph: rename _to_client() to _to_fs_client()
  ceph: pass the mdsc to several helpers
  libceph: add doutc and *_client debug macros support
2023-11-10 09:52:56 -08:00
Alexander Mikhalitsyn 673478b6e5 ceph: add enable_unsafe_idmap module parameter
This parameter is used to decide if we allow
to perform IO on idmapped mount in case when MDS lacks
support of CEPHFS_FEATURE_HAS_OWNER_UIDGID feature.

In this case we can't properly handle MDS permission
checks and if UID/GID-based restrictions are enabled
on the MDS side then IO requests which go through an
idmapped mount may fail with -EACCESS/-EPERM.
Fortunately, for most of users it's not a case and
everything should work fine. But we put work "unsafe"
in the module parameter name to warn users about
possible problems with this feature and encourage
update of cephfs MDS.

Suggested-by: Stéphane Graber <stgraber@ubuntu.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Christian Brauner 5ccd8530dd ceph: handle idmapped mounts in create_request_message()
Inode operations that create a new filesystem object such as ->mknod,
->create, ->mkdir() and others don't take a {g,u}id argument explicitly.
Instead the caller's fs{g,u}id is used for the {g,u}id of the new
filesystem object.

In order to ensure that the correct {g,u}id is used map the caller's
fs{g,u}id for creation requests. This doesn't require complex changes.
It suffices to pass in the relevant idmapping recorded in the request
message. If this request message was triggered from an inode operation
that creates filesystem objects it will have passed down the relevant
idmaping. If this is a request message that was triggered from an inode
operation that doens't need to take idmappings into account the initial
idmapping is passed down which is an identity mapping.

This change uses a new cephfs protocol extension CEPHFS_FEATURE_HAS_OWNER_UIDGID
which adds two new fields (owner_{u,g}id) to the request head structure.
So, we need to ensure that MDS supports it otherwise we need to fail
any IO that comes through an idmapped mount because we can't process it
in a proper way. MDS server without such an extension will use caller_{u,g}id
fields to set a new inode owner UID/GID which is incorrect because caller_{u,g}id
values are unmapped. At the same time we can't map these fields with an
idmapping as it can break UID/GID-based permission checks logic on the
MDS side. This problem was described with a lot of details at [1], [2].

[1] https://lore.kernel.org/lkml/CAEivzxfw1fHO2TFA4dx3u23ZKK6Q+EThfzuibrhA3RKM=ZOYLg@mail.gmail.com/
[2] https://lore.kernel.org/all/20220104140414.155198-3-brauner@kernel.org/

Link: https://github.com/ceph/ceph/pull/52575
Link: https://tracker.ceph.com/issues/62217
Co-Developed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Christian Brauner 9c2df2271c ceph: stash idmapping in mdsc request
When sending a mds request cephfs will send relevant data for the
requested operation. For creation requests the caller's fs{g,u}id is
used to set the ownership of the newly created filesystem object. For
setattr requests the caller can pass in arbitrary {g,u}id values to
which the relevant filesystem object is supposed to be changed.

If the caller is performing the relevant operation via an idmapped mount
cephfs simply needs to take the idmapping into account when it sends the
relevant mds request.

In order to support idmapped mounts for cephfs we stash the idmapping
whenever they are relevant for the operation for the duration of the
request. Since mds requests can be queued and performed asynchronously
we make sure to keep the idmapping around and release it once the
request has finished.

In follow-up patches we will use this to send correct ownership
information over the wire. This patch just adds the basic infrastructure
to keep the idmapping around. The actual conversion patches are all
fairly minimal.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Xiubo Li 38d46409c4 ceph: print cluster fsid and client global_id in all debug logs
Multiple CephFS mounts on a host is increasingly common so
disambiguating messages like this is necessary and will make it easier
to debug issues.

At the same this will improve the debug logs to make them easier to
troubleshooting issues, such as print the ino# instead only printing
the memory addresses of the corresponding inodes and print the dentry
names instead of the corresponding memory addresses for the dentry,etc.

Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Xiubo Li 5995d90d2d ceph: rename _to_client() to _to_fs_client()
We need to covert the inode to ceph_client in the following commit,
and will add one new helper for that, here we rename the old helper
to _fs_client().

Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Xiubo Li 197b7d792d ceph: pass the mdsc to several helpers
We will use the 'mdsc' to get the global_id in the following commits.

Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03 23:28:33 +01:00
Linus Torvalds 14ab6d425e vfs-6.7.ctime
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppYgAKCRCRxhvAZXjc
 okIHAP9anLz1QDyMLH12ASuHjgBc0Of3jcB6NB97IWGpL4O21gEA46ohaD+vcJuC
 YkBLU3lXqQ87nfu28ExFAzh10hG2jwM=
 =m4pB
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull vfs inode time accessor updates from Christian Brauner:
 "This finishes the conversion of all inode time fields to accessor
  functions as discussed on list. Changing timestamps manually as we
  used to do before is error prone. Using accessors function makes this
  robust.

  It does not contain the switch of the time fields to discrete 64 bit
  integers to replace struct timespec and free up space in struct inode.
  But after this, the switch can be trivially made and the patch should
  only affect the vfs if we decide to do it"

* tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (86 commits)
  fs: rename inode i_atime and i_mtime fields
  security: convert to new timestamp accessors
  selinux: convert to new timestamp accessors
  apparmor: convert to new timestamp accessors
  sunrpc: convert to new timestamp accessors
  mm: convert to new timestamp accessors
  bpf: convert to new timestamp accessors
  ipc: convert to new timestamp accessors
  linux: convert to new timestamp accessors
  zonefs: convert to new timestamp accessors
  xfs: convert to new timestamp accessors
  vboxsf: convert to new timestamp accessors
  ufs: convert to new timestamp accessors
  udf: convert to new timestamp accessors
  ubifs: convert to new timestamp accessors
  tracefs: convert to new timestamp accessors
  sysv: convert to new timestamp accessors
  squashfs: convert to new timestamp accessors
  server: convert to new timestamp accessors
  client: convert to new timestamp accessors
  ...
2023-10-30 09:47:13 -10:00
Al Viro dc32464a5f ceph_wait_on_conflict_unlink(): grab reference before dropping ->d_lock
Use of dget() after we'd dropped ->d_lock is too late - dentry might
be gone by that point.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-10-27 20:14:38 -04:00
Jeff Layton c453bdb535
ceph: convert to new timestamp accessors
Convert to using the new inode timestamp accessor functions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20231004185347.80880-22-jlayton@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-18 13:26:19 +02:00
Xiubo Li ce0d5bd3a6 ceph: make num_fwd and num_retry to __u32
The num_fwd in MClientRequestForward is int32_t, while the num_fwd
in ceph_mds_request_head is __u8. This is buggy when the num_fwd
is larger than 256 it will always be truncate to 0 again. But the
client couldn't recoginize this.

This will make them to __u32 instead. Because the old cephs will
directly copy the raw memories when decoding the reqeust's head,
so we need to make sure this kclient will be compatible with old
cephs. For newer cephs they will decode the requests depending
the version, which will be much simpler and easier to extend new
members.

Link: https://tracker.ceph.com/issues/62145
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-31 14:56:27 +02:00
Xiubo Li e3dfcab208 ceph: drop messages from MDS when unmounting
When unmounting all the dirty buffers will be flushed and after
the last osd request is finished the last reference of the i_count
will be released. Then it will flush the dirty cap/snap to MDSs,
and the unmounting won't wait the possible acks, which will ihold
the inodes when updating the metadata locally but makes no sense
any more, of this. This will make the evict_inodes() to skip these
inodes.

If encrypt is enabled the kernel generate a warning when removing
the encrypt keys when the skipped inodes still hold the keyring:

WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0
CPU: 4 PID: 168846 Comm: umount Tainted: G S  6.1.0-rc5-ceph-g72ead199864c #1
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0
RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202
RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00
RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000
RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000
R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40
R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000
FS:  00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
generic_shutdown_super+0x47/0x120
kill_anon_super+0x14/0x30
ceph_kill_sb+0x36/0x90 [ceph]
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x67/0xb0
exit_to_user_mode_prepare+0x23d/0x240
syscall_exit_to_user_mode+0x25/0x60
do_syscall_64+0x40/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fd83dc39e9b

Later the kernel will crash when iput() the inodes and dereferencing
the "sb->s_master_keys", which has been released by the
generic_shutdown_super().

Link: https://tracker.ceph.com/issues/59162
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:36 +02:00