-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmoZxw8ACgkQiiy9cAdy
T1HhnwwAvivw/s84qQhbkgQllNMdb4SPl+Ph+DRMiwyrZjXr36kv8jtiPIRIlplB
Uk+jXpswQXNk6qVKriUzbM1xGTyBin4iFhDzXfLoMmtZtETAmnbHWX9cVFblOibb
o+kMYMRXo+TGvQE5d47VKMioL7W5AUFoXfrIfOvWMhnRBaPwgb/aTblUxLFtHYLw
rhDm24p5JKxHv9YsR5+XWofGP2STstMDgkKBYjqYolmrEaq1ho3qBVQtcGY/DJFT
5heZ/b+Tv8N0s9ccMOAipAW509Qjn3Tml5SvgRCTZ56nEuZHeZBYCoXLhdV1tPG9
iuCPxTKrgFkDOZNSdweZscR5OD3MlbDC103K6W/mDEZk3IIv3ZGYe4atBwiz8kMl
09xvct3UJviHuOWjVgI7TBDV+Y0Gpf7zTeOLfixhn2RrVjU2IwrKUjZBjKGkZAFI
r5YcTK1FOe3a7WwXNYkVXVvTfwqvpIclQCs+qnQqAiEjvBNWvmTtgGg2eOlxEnBo
j4AE8Ryh
=0uMS
-----END PGP SIGNATURE-----
Merge tag 'v7.1-rc6-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- security fix for FSCTL_SET_SPARSE
- fix leak in ksmbd_query_inode_status()
- fix OOB read in smb_check_perm_dacl()
* tag 'v7.1-rc6-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix FSCTL permission bypass by adding a permission check for FSCTL_SET_SPARSE
ksmbd: release ksmbd_inode ref via ksmbd_inode_put on lookup paths
ksmbd: OOB read regression in smb_check_perm_dacl() ACE-walk loops
compiling with W=2 pointed out that "written may be used uninitialized"
Fixes: 20d72b00ca ("netfs: Fix the request's work item to not require a ref")
Cc: stable@vger.kernel.org
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
cifs_copy_folioq_to_iter() copies a requested number of bytes from
a folio queue into the destination iterator. Since the encrypted
SMB2 READ path was changed to pass the server-declared payload
length (data_len) instead of the larger folioq buffer length, the
caller can ask for fewer bytes than the folio queue holds.
In that case the helper continues walking the remaining folios after
data_size has reached zero and calls copy_folio_to_iter() with
len = 0, which is unnecessary work.
The helper also returns 0 (success) when the folio queue is
exhausted before data_size bytes have been copied. The caller has
no way to distinguish that from a full copy and the reported
transfer count ends up larger than the amount of data placed in the
iterator.
Add an early exit when data_size reaches zero, and return an error
when the folio queue is exhausted before all requested bytes have
been copied.
Signed-off-by: Jeremy Erazo <mendozayt13@gmail.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
FSCTL_SET_SPARSE in fsctl_set_sparse() modifies the file's sparse
attribute and saves it through xattr without any permission checks.
This exposes two issues:
1) A client on a read-only share can change the sparse attribute
on files it opened, even though the share is read-only.
Other FSCTL write operations already check
test_tree_conn_flag(work->tcon, KSMBD_TREE_CONN_FLAG_WRITABLE),
but FSCTL_SET_SPARSE does not.
2) Even on writable shares, clients without FILE_WRITE_DATA or
FILE_WRITE_ATTRIBUTES access should not modify the sparse
attribute. Similar handle-level checks exist in other functions
but are missing here.
Add both share-level writable check and per-handle access check.
Use goto out on error to avoid leaking file references.
Fixes: e2f34481b2 ("cifsd: add server-side procedures for SMB3")
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steve French <smfrench@gmail.com>
Signed-off-by: Sean Shen <grayhat@foxmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
ksmbd_query_inode_status() and ksmbd_lookup_fd_inode() both take a
reference on a ksmbd_inode via __ksmbd_inode_lookup() (which performs
atomic_inc_not_zero()) and later release it using a bare
atomic_dec(&ci->m_count). Unlike ksmbd_inode_put(), a bare
atomic_dec() does not check whether the reference count has reached
zero, so if the caller happens to drop the last reference, the
ksmbd_inode is leaked: it stays in the global inode hash table with
m_count == 0, future __ksmbd_inode_lookup() calls reject it via
atomic_inc_not_zero(), and ksmbd_inode_free() is never invoked.
The race is:
T1: __ksmbd_inode_lookup() -> atomic_inc_not_zero(): m_count = 2
T2: ksmbd_inode_put() -> atomic_dec_and_test(): m_count = 1
(not freed)
T1: atomic_dec(&ci->m_count) -> m_count = 0
return (LEAK)
In ksmbd_lookup_fd_inode() the matched-fp path (which now also uses
ksmbd_inode_put()) cannot currently reach m_count == 0 because the
matched ksmbd_file holds its own reference on ci, but converting it to
the proper API keeps the three call sites consistent and avoids
future regressions if the locking changes.
Because ksmbd_inode_put() may free the ksmbd_inode if this drops the
last reference, the call must happen after up_read(&ci->m_lock) on the
two affected paths in ksmbd_lookup_fd_inode(). On the no-match path
this is a pure reordering; on the matched path ksmbd_fp_get() is
moved above the unlock so that the returned ksmbd_file is pinned
before the inode reference is released.
Signed-off-by: Aleksandr Golovnya <cofedish@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Commit d07b26f392 ("ksmbd: require minimum ACE size in
smb_check_perm_dacl()") introduced a transposed bounds check:
if (offsetof(struct smb_ace, sid) + aces_size < CIFS_SID_BASE_SIZE)
Since offsetof(..sid) is 8 and CIFS_SID_BASE_SIZE is 8, this evaluates
to `aces_size < 0`. Because `aces_size` is always non-negative, this
check becomes dead code and never breaks the loop.
Worse, that commit removed the old 4-byte guard, meaning the loop now
reads `ace->size` (offset 2) even when `aces_size` is 0-3 bytes. This
re-opens a 2-byte heap out-of-bounds (OOB) read past the pntsd allocation
during subsequent SMB2_CREATE operations.
Fix this by properly transposing the comparison to require at least
16 bytes (8-byte offset + 8-byte SID base), matching the correct form
used in smb_inherit_dacl().
Fixes: d07b26f392 ("ksmbd: require minimum ACE size in smb_check_perm_dacl()")
Cc: stable@vger.kernel.org
Signed-off-by: Ali Ganiyev <ali.qaniyev@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Issues reported with v7.1-rc:
- Tighten bounds checking for sunrpc cache hash tables
- Don't report key material in the ftrace log
Issues that need expedient stable backports:
- Fix lockd's implementation of the NLM TEST procedure
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmoV9/kACgkQM2qzM29m
f5fdpxAAtT3hl4wKNJsVLhFlFhG+9ABL74fwaQ06j5vTSIgXqPm12NuO5YbrkC78
ZzV/B/YqoHLAw/t8Pgq2taBBuSeLF+H8JqjJRDYE5H2NB/KQOT8n9KTLZtac4/1V
Dvrk3mP2h12Q//BC3pF9bU9gMR1DO/+yLt9SkH+dtqcW+dWxiyVZWtK0eESIsMfh
IzkHNKOS0edMZmHl5O7VZSlbyq1jPA4hTZT+NCG7JwnK6YqSkpRGDiZdZIT2FBEI
C9a9hZHoP9JAJs9fR+xzTCVsIPpNW9OO3fknR2Lg7IScssVc1GIpqjU+g1O1XSVf
XsMfAl+pEipDBpULu46KM1TDqAKtjaAx8Z+hDmiPxSOCKWuPn/9LMdzwVrzC7Bw8
S7ftOxUZQLHtbS8Y0eECzwK9tdfBUHN26LAJfvg4P5ZOIsFoUj0LeDryPy0r9xxb
aEdEI8wro0O3p0krjtW2i+FJB8dtlKEu19LT6PN4MQtmv5a+DY4Hypt4Xovol0i+
eEugZVmLYE11b52ZFfcTcXf8n89jiWg7rgRBdBdy+vQl/32dKK3SMSIB/zCZYmBc
JZNywtri6JHeJjkohWJ4xmwrmMaDj4hNr3OqWh7bOQTHleg7igpCuy+9/LHzEF6G
BX4DgMJ6LqcdG8p4biGr2I2NF/+MJpXO5kNAdS44wpP983T26WE=
=onHe
-----END PGP SIGNATURE-----
Merge tag 'nfsd-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
"Regressions:
- Tighten bounds checking for sunrpc cache hash tables
- Don't report key material in the ftrace log
Stable fix:
- Fix lockd's implementation of the NLM TEST procedure"
* tag 'nfsd-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
lockd: fix TEST handling when not all permissions are available.
NFSD: Report whether fh_key was actually updated
sunrpc: prevent out-of-bounds read in __cache_seq_start()
post-7.1 issues or aren't considered suitable for backporting.
All patches are singletons - please see the individual changelogs for
details.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCahTZ2QAKCRDdBJ7gKXxA
ju+UAQDUga+l95O1iOnrraKFWvT1ghQKTgbNxGMwefHjVLLFBQD+Ln2wPfz73Ks7
H8WK0k5D0g+6lKs6tFGAALdQnTU0BAU=
=MYsv
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"13 hotfixes. 9 are for MM. 9 are cc:stable and the remaining 4 address
post-7.1 issues or aren't considered suitable for backporting.
All patches are singletons - please see the individual changelogs for
details"
* tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
Revert "mm: introduce a new page type for page pool in page type"
mm/vmalloc: do not trigger BUG() on BH disabled context
MAINTAINERS, mailmap: change email for Eugen Hristev
mm/migrate_device: fix pgtable leak in migrate_vma_insert_huge_pmd_page
kernel/fork: validate exit_signal in kernel_clone()
mm: memcontrol: propagate NMI slab stats to memcg vmstats
mm/damon/sysfs-schemes: delete tried region in regions_rmdirs()
mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
zram: fix use-after-free in zram_writeback_endio
memfd: deny writeable mappings when implying SEAL_WRITE
ipc: limit next_id allocation to the valid ID range
Revert "mm/hugetlbfs: update hugetlbfs to use mmap_prepare"
MAINTAINERS: .mailmap: update after GEHC spin-off
-----BEGIN PGP SIGNATURE-----
iQJPBAABCgA5FiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmoRu2cbFIAAAAAABAAO
bWFudTIsMi41KzEuMTIsMiwyAAoJEMVl1fnXbVg7jk0P/iokJwhO+YP7rdnDPAl6
wXYq/e4HxSl65//wwkR2Q7zc/GWOSdGZoP7fHCtXY52RERDKq2xrkQ+fBtnEMvS3
N6bwM0kVBoGPkXRHrf55SdfH7ITldeVAYIRIajO5bVT/j/F8l9s7ivXmI0Ep1xgZ
Eiv5MuOiW/kPLzYW1pHP+UniaUhIgQGTRcHs7NotAn55q2odnLdddRWx8NGT1kAe
Owydf6c0/B/+7NLhTLQl/w4WmeFL3OR0b0HuHiVYBNuQBkgCxwcUsERfPjnWpAbr
Pll32JKmJxH1Rthr8qA++Xv72D31VNAYVwxyieq/kPSFg6rwjcKw2lLdFKv99fCT
3OcDg0N9X20RK4ZcyMSwiCkS92DFStmVy4FtVIdNXbgpRcbw/jHNB6CnFq+RHOVU
wBNYdLte7zSmroDSQ/U2l/xY0n1KCf/KcYBxnkIkri6fdl/f2o8/monPfbsUvYiK
0qI3ODSomBpQRU0vYddJ28KfEx0iHqSQzmDyRFDlDuNb7M24d5W82jWLf60Nlk4h
ngehWVaVvLm8y4YiRteD10TGD7ClBE6ilu0t0dS2ys7o7stIAuXbjIP435tYz2T4
B0ddujn7S0mwNCoT+5yRfmxPQFJpyt93jU65VTJ95Mc7Pg43/D6b5ju7tvZlVdNw
NT4nY8sOiLy1KR72SvguXPSr
=hKb2
-----END PGP SIGNATURE-----
Merge tag 'for-7.1-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A batch of fixes to simple quotas:
- add conditional rescheduling point not dependent on the lock during
inode iterations to avoid delays with PREEMPT_NONE enabled
- fix subvolume deletion so it does not break the squota invariants
- properly handle enabling squota, tracking extents in the initial
transaction
- catch and warn about underflows, clamp to zero to avoid further
problems
And one fix to inode size handling:
- fix handling of preallocated extents beyond i_size when not using
the no-holes feature"
* tag 'for-7.1-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: swallow btrfs_record_squota_delta() ENOENT
btrfs: clamp to avoid squota underflow
btrfs: fix squota accounting during enable generation
btrfs: check for subvolume before deleting squota qgroup
btrfs: always drop root->inodes lock before cond_resched()
btrfs: mark file extent range dirty after converting prealloc extents
Signed-off-by: Carlos Maiolino <cem@kernel.org>
-----BEGIN PGP SIGNATURE-----
iJUEABMJAB0WIQSmtYVZ/MfVMGUq1GNcsMJ8RxYuYwUCahG3QAAKCRBcsMJ8RxYu
Y92cAYC0vDsgBcwRgeLBhrEHmzvkNBwKJAqAvXzYoxP/ux+N7hvS3k7kieXXTmmw
v/fX7AMBf2C9hjNn6la4uRxR0kkQWneqQWAvkxiy/584xnCheCFR3D+olAgh0ySp
BWRCCbEzwg==
=lIaC
-----END PGP SIGNATURE-----
Merge tag 'xfs-fixes-7.1-rc5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fix from Carlos Maiolino:
"A single fix for a race in xfs buffer cache which may lead to
filesystem shutdown due to inconsistent metadata if the buffer
lookup happens to find an old dead buffer still in the cache"
* tag 'xfs-fixes-7.1-rc5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix a buffer lookup against removal race
- Remove the software node on platform device release(); without this,
the software node remains registered after the device is gone and a
subsequent platform_device_register_full() reusing the same node fails
with -EBUSY
- In sysfs_update_group(), do not remove a pre-existing directory when
create_files() fails; the previous code would silently destroy a sysfs
group that the caller did not create
- Set fwnode->secondary to NULL in fwnode_init() to avoid dereferencing
uninitialized memory (e.g. in dev_to_swnode()) when the firmware node
is allocated on the stack or via a non-zeroing allocator
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQS2q/xV6QjXAdC7k+1FlHeO1qrKLgUCahGkDAAKCRBFlHeO1qrK
Ls4tAP0Ti0VhRJOHz2h3boyp0tXH1bIt082073oJGasHAjnoeQEAg0zbdA2Qkj9o
DDv5iX6KYGkeFUzv8MjuWUexc5HlYQ0=
=Lc0O
-----END PGP SIGNATURE-----
Merge tag 'driver-core-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core
Pull driver core fixes from Danilo Krummrich:
- Remove the software node on platform device release(); without this,
the software node remains registered after the device is gone and a
subsequent platform_device_register_full() reusing the same node
fails with -EBUSY
- In sysfs_update_group(), do not remove a pre-existing directory when
create_files() fails; the previous code would silently destroy a
sysfs group that the caller did not create
- Set fwnode->secondary to NULL in fwnode_init() to avoid dereferencing
uninitialized memory (e.g. in dev_to_swnode()) when the firmware node
is allocated on the stack or via a non-zeroing allocator
* tag 'driver-core-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
device property: set fwnode->secondary to NULL in fwnode_init()
sysfs: don't remove existing directory on update failure
driver core: platform: remove software node on release()
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Lines starting with '#' will be ignored.
-----BEGIN PGP SIGNATURE-----
iJUEABMJAB0WIQSmtYVZ/MfVMGUq1GNcsMJ8RxYuYwUCahG0TgAKCRBcsMJ8RxYu
Y18FAX0cD9LJOQetOPnFHYl6AdO4f0OmaBnjbyF19iRQjrLo9GthdiQf7WXD6QaG
b00H18ABgMYDhTO70T1zH2KkpaaVBOKHT9oFQ+Bzki6R1J+rCU2ZXOiQARmFLvd3
xE25p55FiQ==
=UWBN
-----END PGP SIGNATURE-----
Merge tag 'xfs-fixes-7.1-rc5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux into test_merge
xfs: fixes for v7.1-rc5
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Lines starting with '#' will be ignored.
- Avoid potential overflow when converting a zonefs file number string
to an inode number (from Johannes)
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCahAOdQAKCRDdoc3SxdoY
dnxHAPwMJClZOV6J0RtQqK2zxoDMLGIcE+z0MHq3stFbJBcjWgEA9fjB0rklUwaW
saPUjQUTj/mZJJYmce1MrXI0qYjXxAs=
=J44t
-----END PGP SIGNATURE-----
Merge tag 'zonefs-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs fix from Damien Le Moal:
- Avoid potential overflow when converting a zonefs file number string
to an inode number (from Johannes)
* tag 'zonefs-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: handle integer overflow in zonefs_fname_to_fno
This reverts commit ea52cb24cd ("mm/hugetlbfs: update hugetlbfs to use
mmap_prepare") with conflict resolution to account for changes in commit
ea52cb24cd ("mm/hugetlbfs: update hugetlbfs to use mmap_prepare").
The patch incorrectly handled hugetlb VMA lock allocation at the
mmap_prepare stage, where a failed allocation occurring after mmap_prepare
is called might result in the lock leaking.
There is no risk of a merge causing a similar issues, as
VMA_DONTEXPAND_BIT is set for hugetlb mappings.
As a first step in addressing this issue, simply revert the change so we
can rework how we do this having corrected the underlying issues.
We maintain the VMA flags changes as best we can, accounting for the fact
that we were working with a VMA descriptor previously and propagating
like-for-like changes for this.
Note that we invoke vma_set_flags() and do not call vma_start_write() as
vm_flags_set() does. This is OK as it's being done in an .mmap hook where
the VMA is not yet linked into the tree so nobody else can be accessing
it.
Link: https://lore.kernel.org/20260512160643.266960-1-ljs@kernel.org
Fixes: ea52cb24cd ("mm/hugetlbfs: update hugetlbfs to use mmap_prepare")
Signed-off-by: Lorenzo Stoakes <ljs@kernel.org>
Reported-by: Mingyu Wang <25181214217@stu.xidian.edu.cn>
Closes: https://lore.kernel.org/linux-mm/20260425070700.562229-1-25181214217@stu.xidian.edu.cn/
Acked-by: Muchun Song <muchun.song@linux.dev>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@kernel.org>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce smb_validate_ntsd_sid() helper to safely validate Owner SID
and Group SID inside the NT Security Descriptor (smb_ntsd) retrieved
from the parent directory.
Cc: stable@vger.kernel.org
Signed-off-by: Junyi Liu <moss80199@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
After a durable reconnect succeeds, ksmbd_reopen_durable_fd() republishes
the same ksmbd_file into the session volatile-id table. If smb2_open()
then takes a later error path, cleanup first calls ksmbd_fd_put(work, fp)
and then unconditionally calls ksmbd_put_durable_fd(dh_info.fp).
In this case fp and dh_info.fp are the same object. The first put drops the
reconnect lookup reference, but the final durable put can run
__ksmbd_close_fd(NULL, fp). Because the final close is not session-aware,
it can free the file object without removing the volatile-id entry that was
just published into the session table.
Use the session-aware put for the final reconnect drop when the reconnect
had already succeeded and the error path is cleaning up the republished
file. Earlier reconnect failures, before fp is assigned to dh_info.fp, keep
using the durable-only put path.
Fixes: 1baff47b81 ("ksmbd: fix use-after-free in smb2_open during durable reconnect")
Signed-off-by: Junyi Liu <moss80199@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
The F_GETLK fcntl can work with either read access or write access or
both. It can query F_RDLCK and F_WRLCK locks in either case.
However lockd currently treats F_GETLK similar to F_SETLK in that read
access is required to query an F_RDLCK lock and write access is required
to query a F_WRLCK lock.
This is wrong and can cause problems - e.g. when qemu accesses a
read-only (e.g. iso) filesystem image over NFS (though why it queries
if it can get a write lock - I don't know. But it does, and this works
with local filesystems).
So we need TEST requests to be handled differently. To do this:
- change nlm_do_fopen() to accept O_RDWR as a mode and in that case
succeed if either a O_RDONLY or O_WRONLY file can be opened.
- change nlm_lookup_file() to accept a mode argument from caller,
instead of deducing base on lock time, and pass that on to nlm_do_fopen()
- change nlm4svc_retrieve_args() and nlmsvc_retrieve_args() to detect
TEST requests and pass O_RDWR as a mode to nlm_lookup_file, passing
the same mode as before for other requests. Also set
lock->fl.c.flc_file to whichever file is available for TEST requests.
- change nlmsvc_testlock() to also not calculate the mode, but to use
whatever was stored in lock->fl.c.flc_file.
This behaviour of lockd - requesting O_WRONLY access to TEST for
exclusive locks - has been present at least since git history began.
However it was hidden until recently because knfsd ignored the access
requested by lockd and required only READ access for all locking
requests (unless the underlying filesystem provided an f_op->open
function which checked access permissions).
The commit mentioned in Fixes: below changed nfsd_permission() to NOT
override the access request for LOCK requests and this exposed the bug
that we are now fixing.
Note that there is another issue that this patch does not address.
The flock(.., LOCK_EX) call is permitted on a read-only file descriptor.
Linux NFS maps this to NLM locking as whole-file byte-range locks.
nfsd will see this as though it were fcntl( F_SETLK (F_WRLCK)) and will
now require write access, which it might not be able to get.
It is not clear if this is a problem in practice, or what the best
solution might be. So no attempt is made to address it.
Reported-by: Tj <tj.iam.tj@proton.me>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1128861
Fixes: 4cc9b9f2bf ("nfsd: refine and rename NFSD_MAY_LOCK")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The nfsd_ctl_fh_key_set tracepoint was introduced to capture
operator activity on the filehandle signing key. Earlier revisions
logged the key bytes verbatim; the version that landed hashes the
16 key bytes through crc32_le and stores the result.
CRC32 is a linear projection of its input rather than a one-way
function, and truncating any hash of fixed-size secret material
leaves the key recoverable under offline brute force when the
threat model includes an attacker with access to the trace ring.
The operational question the fingerprint was meant to answer is
whether a NFSD_CMD_THREADS_SET call that carries an
NFSD_A_SERVER_FH_KEY attribute actually replaced the active key or
re-installed the value already in place. Answer that question
directly: compare the incoming key bytes against the current
nn->fh_key inside nfsd_nl_fh_key_set() and surface a single bit to
the tracepoint. The event now prints "updated" when the stored
key changed and "unmodified" otherwise. A first set that fails
kmalloc reports "unmodified" because no key was installed.
Reported-by: jaeyeong <fin@spl.team>
Fixes: 62346217fd ("NFSD: Add a key for signing filehandles")
Cc: Benjamin Coddington <bcodding@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Currently, the macro DUP_CTX_STR allocates new_ctx->field using
GFP_ATOMIC. DUP_CTX_STR is only used in smb3_fs_context_dup(), which
is never called in an atomic context. Using GFP_ATOMIC puts unnecessary
pressure on emergency memory pools.
Change GFP_ATOMIC to GFP_KERNEL.
Signed-off-by: Fredric Cover <fredric.cover.lkernel@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CIFS_GENL_CMD_SWN_NOTIFY is the userspace witness-notify command. The
intended sender is the cifs.witness helper, but the generic-netlink
operation currently has no capability flag, so any local process can send
RESOURCE_CHANGE or CLIENT_MOVE notifications to the in-kernel witness
handler.
The same family exposes CIFS_GENL_MCGRP_SWN without multicast-group
capability flags. Register messages sent to that group include the witness
registration id and, for NTLM-authenticated mounts, the username, domain,
and password attributes copied from the CIFS session. An unprivileged
local process should not be able to join that group and receive those
messages.
Require CAP_NET_ADMIN for incoming SWN_NOTIFY commands with
GENL_ADMIN_PERM, and require CAP_NET_ADMIN over the network namespace for
joining the SWN multicast group with GENL_MCAST_CAP_NET_ADMIN. The
cifs.witness service runs with the privileges needed for both operations.
Fixes: fed979a7e0 ("cifs: Set witness notification handler for messages from userspace daemon")
Cc: stable@vger.kernel.org
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Steve French <stfrench@microsoft.com>
Unless smbdirect_connection_legacy_debug_proc_show()
wants to debug-log keep_alive_interval as microseconds,
a magnitude higher precision than available by the way,
keepalive_interval_msec should not be multiplied by 1000.
Fixes: cc55f65dd3 ("smb: client: make use of common smbdirect_socket_parameters")
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
Since commit 340cea84f6 ("cifs: open files should not hold ref on
superblock"), cifs file only holds the dentry ref_cnt, the cifs file
close work(cfile->deferred) could be executed after unmounting, which
will trigger a warning in generic_shutdown_super:
BUG: Dentry 00000000a14a6845{i=c,n=file} still in use (1) [unmount of
cifs cifs]
The detailed processs is:
process A process B kworker
fd = open(PATH)
vfs_open
file->__f_path = *path // dentry->d_lockref.count = 1
cifs_open
cifs_new_fileinfo
cfile->dentry = dget(dentry) // dentry->d_lockref.count = 2
close(fd)
__fput
cifs_close
queue_delayed_work(deferredclose_wq, cfile->deferred)
dput(dentry) // dentry->d_lockref.count = 1
smb2_deferred_work_close
_cifsFileInfo_put
list_del(&cifs_file->flist)
umount
cleanup_mnt
deactivate_super
cifs_kill_sb
cifs_close_all_deferred_files_sb
cifs_close_all_deferred_files
// cannot find cfile, skip _cifsFileInfo_put
kill_anon_super
generic_shutdown_super
shrink_dcache_for_umount
umount_check
WARN ! // dentry->d_lockref.count = 1
cifsFileInfo_put_final
dput(cifs_file->dentry)
// dentry->d_lockref.count = 0
Fix it by flushing 'deferredclose_wq' before calling kill_anon_super.
Fetch a reproducer in https://bugzilla.kernel.org/show_bug.cgi?id=221548.
Fixes: 340cea84f6 ("cifs: open files should not hold ref on superblock")
Cc: stable@vger.kernel.org
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When a buffer is freed either by LRU eviction or because it is unset,
the lockref is marked as dead instantly, which prevents the buffer from
being used after finding it in the buffer hash in xfs_buf_lookup and
xfs_buf_find_insert. But the latter will then not add the new buffer to
the hash because it already found an existing buffer.
Fix this using in two places: Remove the buffer from the hash before
marking the lockref dead so that that no buffer with a dead lockref can
be found in the hash, but if we find one in xfs_buf_find_insert due to
store reordering, handle this case correctly instead of returning an
unhashed buffer.
Fixes: 67fe430397 ("xfs: don't keep a reference for buffers on the LRU")
Reported-by: Andrey Albershteyn <aalbersh@redhat.com>
Reported-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
When sysfs_update_group() is called for a named group and create_files()
fails (e.g. -ENOMEM), internal_create_group() calls kernfs_remove(kn) on
the group directory. In the update path, kn was obtained via
kernfs_find_and_get() and refers to a directory that already existed
before this call. Removing it silently destroys a sysfs group that the
caller did not create.
Only remove the directory if we created it ourselves. On update failure
the directory remains as it is left empty by remove_files() inside
create_files(), but can be repopulated by a retry.
Cc: Rajat Jain <rajatja@google.com>
Fixes: c855cf2759 ("sysfs: Fix internal_create_group() for named group updates")
Cc: stable <stable@kernel.org>
Assisted-by: gkh_clanker_t1000
Reviewed-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Reviewed-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patch.msgid.link/2026052003-uniquely-hastily-c093@gregkh
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bb88e8da00 ("erofs: use meta buffers for xattr operations")
converted xattr operations to use on-stack erofs_buf instances.
erofs_init_inode_xattrs() uses such a metabuf while reading the inline
xattr header and shared xattr id array.
Some error paths after erofs_read_metabuf() leave through out_unlock
without dropping the metabuf, so the folio reference can leak.
Consolidate the cleanup at out_unlock. erofs_put_metabuf() is a
no-op if no folio has been acquired, and this keeps all paths after
taking EROFS_I_BL_XATTR_BIT covered by a single cleanup site.
Fixes: bb88e8da00 ("erofs: use meta buffers for xattr operations")
Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Fixes: bb88e8da00 ("erofs: use meta buffers for xattr operations")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
After unaligned compressed extents were introduced, the following race
could occur:
[Thread 1] [Thread 2]
(z_erofs_fill_bio_vec)
<handle a Z_EROFS_PREALLOCATED_FOLIO folio>
...
filemap_add_folio (1)
(z_erofs_bind_cache)
<the same folio is found..>
..
..
folio_attach_private (2)
filemap_add_folio (3) again
Since (1) is executed but (2) hasn't been executed yet, it's possible
that another thread finds the same managed folio in z_erofs_bind_cache()
for a different pcluster and calls filemap_add_folio() again since
folio->private is still Z_EROFS_PREALLOCATED_FOLIO.
Fix this by explicitly clearing folio->private before making the folio
visible in the managed cache so that another pcluster can simply wait
on the locked managed folio as what we did for other shared cases [1].
This only impacts unaligned data compression (`-E48bit` with zstd,
for example).
[1] Commit 9e2f9d34dd ("erofs: handle overlapped pclusters out of
crafted images properly") was originally introduced to handle crafted
overlapped extents, but it addresses unaligned extents as well.
Fixes: 7361d1e376 ("erofs: support unaligned encoded data")
Reported-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Closes: https://lore.kernel.org/r/4a2f3801-fac1-42fe-ae75-da315822e088@salutedevices.com
Tested-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmoGfMUACgkQiiy9cAdy
T1EoHgv/R82Vg16H+cNsL+o9vY4O/so59lbqLnkTAFDKhrNR5wPWjWuomQ/8Kq9d
u4Atc6bt/cqvhjNSrI46xlecZt+ArpmXXtW0fREYmrFPdIJaQFmjA5jjabI3jxVT
QTmwXFKdB2MhUIQIXBByrko/AIqpzJkrOM9EAP/0zHvuahBBEJUs2k+IjuI2yNhD
LVPTMrOlSqWGYVSPsqnLnUpa//sHV3NBSsonKQOTtvQYdacRzY+20AbSiczkMxmo
JOR839XkaxP0nUfUIxBtwguNvOgxKfJ+X4nEiKVA9cDo0yK9djOm8SeQshz/YmOq
1BXB94DiaWQRtkSRAs3XqgXfT53EiA0xa6DBg4JIBllAuGGlCyoa9Db4rzWGeY+/
S2zfGxxDOeAH0pddEqUWLfcgufyBTK5on20+YYQZx6njn3NYatMI/HC5KfgsfrTJ
i+/r06/foVQrbIA6pGeaCwwyQvNar8pS8SHbARmaC4N1jvgoMVEvs7tv/pPKA2Bt
at23kXHC
=TmyE
-----END PGP SIGNATURE-----
Merge tag 'v7.1-rc4-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- Fix two null pointer dereferences and a memory leak
* tag 'v7.1-rc4-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix null pointer dereference in compare_guid_key()
ksmbd: fix null pointer dereference in proc_show_files()
ksmbd: fix SID memory leak in set_posix_acl_entries_dacl() on overflow
- Check the index depth limit via ntfs_icx_parent_inc(), avoiding context
corruption from excessively deep child chains.
- Switch security descriptor allocation to kzalloc() to avoid leaking
uninitialized memory.
- Prevent an inconsistent state where vol->volume_label becomes NULL on
allocation failure.
- Validate MFT records by verifying that attrs_offset sits within
bytes_in_use.
- Fix an off-by-one boundary comparison, correctly catching the
out-of-range MFT record number
- Validate the attribute name offset and length bounds prior to AT_UNUSED
enumeration.
- Check for a valid left neighbor before runlist merges to prevent an
8byte out-of-bounds write on crafted volumes.
- Add the missing record comparison against $MFTMirr during mount.
- Fix wrong inode lookup when writing extent MFT records.
- Redirty folio on memory allocation failure in ntfs_write_mft_block().
- Capture and propagate $MFTMirr sync errors during writeback.
- Ensure MFT mirror and synchronous writes wait for I/O completion.
- Fix buffer overflow/heap over-read in ntfs_bdev_write() when cluster
size is smaller than PAGE_SIZE.
- Fix use-after-free in ntfs_inode_sync_filename() when parent index inode
is evicted while still holding its mrec_lock.
- Update resident attribute length validation to match $AttrDef.
- Fix refcount underflow and UAF of the global upcase table.
- Fix two smatch warnings.
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAmoMMfoWHGxpbmtpbmpl
b25Aa2VybmVsLm9yZwAKCRBnC/sDUUQhCFr9D/9gDFxn25jF6HM5dJ1TAQMCxjBE
UqOUtGpZkifQTc2860DnVavkg1RhIRR6cZMkY+5QmXqCr0464PpwWMOxaKrltITY
5E2eRV3PDfI3VL04eW2XwWPbruCpcFS7BF/A4SVOj5XNBVRbaLao1VRadDyZzAEN
Lua3RUUjHfz5PUjLrd4joa6zkuubYV135AB9jBzJkdAsFq5r1F4vi0jI14ozhm4j
BAlXgcJusGnPtNVfmCmUu/Ve3v6uM79sDlhBqoFSMccgV0FT+3KEl4TX8noNDXYF
fLk75EZvESR9rwb214OIdYesE6tXjP6Dy+pwatsXbk/7WXOuitgbtQ+nYnbG26Dl
/HcjBkuakq9W8Z99VRwctyjOQJydTQGWgINZXNM6yvJjCryADKtI9Yakj4y6+QL4
zQ9hyKkxoDLDif0XX5jTcaXuZ6fyDsL0tQr/QnH0vsjJE0A3gMnJfPnXej+yNVeP
bTeJowB68L0oV9/FTU6KDCTKf5YcWpbGeoisGL18PzUgEltc9DmbSkDObwpVTOp/
M/b4y1qltjvs8LzHFQWJWtfAJ8Ut0UyW5efNGMBC3ou5DRGxQeIIOmqGVkqGFHGy
K6kJBG3DnSW/trX1vd8tb1x+7EObwKt16VbHK9gutoO6aH9AqUedDWO07DnkSZyn
4c8CHBYnGRteWqbp8A==
=NZAT
-----END PGP SIGNATURE-----
Merge tag 'ntfs-for-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/ntfs
Pull ntfs fixes from Namjae Jeon:
- Check the index depth limit via ntfs_icx_parent_inc(), avoiding
context corruption from excessively deep child chains
- Switch security descriptor allocation to kzalloc() to avoid leaking
uninitialized memory
- Prevent an inconsistent state where vol->volume_label becomes NULL on
allocation failure
- Validate MFT records by verifying that attrs_offset sits within
bytes_in_use
- Fix an off-by-one boundary comparison, correctly catching the
out-of-range MFT record number
- Validate the attribute name offset and length bounds prior to
AT_UNUSED enumeration
- Check for a valid left neighbor before runlist merges to prevent an
8byte out-of-bounds write on crafted volumes
- Add the missing record comparison against $MFTMirr during mount
- Fix wrong inode lookup when writing extent MFT records
- Redirty folio on memory allocation failure in ntfs_write_mft_block()
- Capture and propagate $MFTMirr sync errors during writeback
- Ensure MFT mirror and synchronous writes wait for I/O completion
- Fix buffer overflow/heap over-read in ntfs_bdev_write() when cluster
size is smaller than PAGE_SIZE
- Fix use-after-free in ntfs_inode_sync_filename() when parent index
inode is evicted while still holding its mrec_lock
- Update resident attribute length validation to match $AttrDef
- Fix refcount underflow and UAF of the global upcase table
- Fix two smatch warnings
* tag 'ntfs-for-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/ntfs:
ntfs: restore $MFT mirror contents check
ntfs: fix empty_buf and ra lifetime bugs in ntfs_empty_logfile()
ntfs: validate attribute name bounds before returning it
ntfs: fix MFT bitmap scan 2^32 boundary check
ntfs: validate MFT attrs_offset against bytes_in_use
ntfs: fix missing kstrdup() error check in ntfs_write_volume_label()
ntfs: avoid leaking uninitialised bytes in new security descriptors
ntfs: fix out-of-bounds write in ntfs_index_walk_down()
ntfs: fix out-of-bounds write in ntfs_rl_collapse_range() merge path
ntfs: fix variable dereferenced before check ni in ntfs_attr_open()
ntfs: fix default_upcase refcount underflow and UAF on fs_context teardown
ntfs: match ntfs_resident_attr_min_value_length with $AttrDef
ntfs: avoid use-after-free of index inode in ntfs_inode_sync_filename()
ntfs: fix copy length in ntfs_bdev_write() for non-page-aligned start
ntfs: wait for sync mft writes to complete
ntfs: capture mft mirror sync errors in ntfs_write_mft_block()
ntfs: redirty folio when ntfs_write_mft_block() runs out of memory
ntfs: use base mft_no when looking up base inode for extent record
ntfs: fix variable dereferenced before check ni and attr in ntfs_attrlist_entry_add()
In handle_read_data() the encrypted/folioq branch
(buf_len <= data_offset, reached via receive_encrypted_read for
transform PDUs > CIFSMaxBufSize + MAX_HEADER_SIZE) copies the READ
payload using buffer_len rather than data_len:
rdata->result = cifs_copy_folioq_to_iter(buffer, buffer_len,
cur_off,
&rdata->subreq.io_iter);
...
rdata->got_bytes = buffer_len;
buffer_len comes from the SMB3 transform header OriginalMessageSize
field (OriginalMessageSize - read_rsp_size); it represents the size
of the decrypted message after the SMB2 header. data_len comes from
the SMB2 READ response DataLength field; it represents the actual
READ payload size and may be smaller than buffer_len when the
decrypted message contains padding or other trailing bytes after the
READ payload. The existing check `data_len > buffer_len - pad_len`
only enforces an upper bound, so a server that emits
OriginalMessageSize larger than read_rsp_size + pad_len + data_len
passes the check and the kernel copies buffer_len bytes per response,
ignoring the server-asserted DataLength.
Two observable failures with a crafted server (DataLength=4,
buffer_len=20000):
- the kernel returns 20000 bytes per sub-request to userspace and
sets got_bytes = buffer_len, even though the response claimed
only 4 bytes of payload;
- on a partial netfs sub-request whose iterator is sized to
data_len, the over-large copy_folio_to_iter() short-reads,
cifs_copy_folioq_to_iter() returns -EIO via the n != len path,
and the entire netfs read collapses to -EIO even though the
leading sub-requests succeeded.
Use data_len for the copy length and for got_bytes so the kernel
honours the server-asserted READ payload size. For well-formed
servers (where buffer_len == pad_len + data_len) the change is
behaviour-equivalent.
Cc: stable@vger.kernel.org
Signed-off-by: Jeremy Erazo <mendozayt13@gmail.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
cifs.spnego key descriptions contain authority-bearing fields such as
pid, uid, creduid, and upcall_target that cifs.upcall treats as
kernel-originating inputs. However, userspace can also create keys of
this type through request_key(2) or add_key(2), allowing those fields to
be supplied without CIFS origin.
Only accept cifs.spnego descriptions while CIFS is using its private
spnego_cred to request the key.
Fixes: f1d662a7d5 ("[CIFS] Add upcall files for cifs to use spnego/kerberos")
Assisted-by: avom-custom-harness:gpt-5.5-qwen3.6-mod-mix
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Asim Viladi Oglu Manizada <manizada@pm.me>
Signed-off-by: Steve French <stfrench@microsoft.com>
Commit 96c4af4185 ("cifs: Fix locking usage for tcon fields")
refactored cifs code to change cifs_tcp_ses_lock for tc_lock around
tc_count changes.
There was missing lock around tc_count increment inside
smb2_find_smb_sess_tcon_unlocked().
Cc: stable@vger.kernel.org
Fixes: 96c4af4185 ("cifs: Fix locking usage for tcon fields")
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Please consider pulling these changes from the signed vfs-7.1-rc5.fixes tag.
Thanks!
Christian
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCagq67gAKCRCRxhvAZXjc
ooRHAP0Scrpsiloo7JPM1u0DZZwvTdb9JRlx6k/KXkeN0j5L/wD9FVA9AXarcta5
h37k+SZpz8FuWkoY5LxTvUNbV6mr0w0=
=Enhi
-----END PGP SIGNATURE-----
Merge tag 'vfs-7.1-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"This contains a fixes for the current development cycle. Note that AI
related review sometimes delays fixes a bit because we find more fixes
for the fixes. I might try and send smaller but more fixes PRs if this
trend keeps up.
- Fix various netfslib bugs
- Fix an out-of-bounds write when listing idmappings
- Fix the return values in jfs_mkdir() and orangefs_mkdir()
- Fix a writeback writeback array overflow in fuse
- Fix a forced iversion increment on lazytime timestamp updates
- Reject a negative timeval component in kern_select()
- Fix error return when vfs_mkdir() fails in the cachefiles code
- Fix wrong error code returned for pidns ioctls"
* tag 'vfs-7.1-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (31 commits)
cachefiles: Fix error return when vfs_mkdir() fails
afs: Fix the locking used by afs_get_link()
netfs, afs: Fix write skipping in dir/link writepages
netfs: Fix netfs_read_folio() to wait on writeback
netfs: Fix folio->private handling in netfs_perform_write()
netfs: Fix partial invalidation of streaming-write folio
netfs: Fix potential UAF in netfs_unlock_abandoned_read_pages()
netfs: Fix leak of request in netfs_write_begin() error handling
netfs: Fix early put of sink folio in netfs_read_gaps()
netfs: Fix write streaming disablement if fd open O_RDWR
netfs: Fix read-gaps to remove netfs_folio from filled folio
netfs: Fix potential deadlock in write-through mode
netfs: Fix streaming write being overwritten
netfs: Defer the emission of trace_netfs_folio()
netfs: Fix netfs_invalidate_folio() to clear dirty bit if all changes gone
netfs: Fix overrun check in netfs_extract_user_iter()
netfs: fix error handling in netfs_extract_user_iter()
netfs: Fix potential uninitialised var in netfs_extract_user_iter()
netfs: fix VM_BUG_ON_FOLIO() issue in netfs_write_begin() call
netfs: Fix zeropoint update where i_size > remote_i_size
...
I thought that it was likely I could harden squota deletion to the point
that it was impossible to end up with an extent accounted to a qgroup
outliving its qgroup. Several recent bugs have made me re-consider that
position.
Ultimately, this is a tradeoff between short term stability and long
term strictness, but I think given that there could be another layer of
bugs behind the 2-3 I just fixed, I would feel much more confident in
people using squotas if the risk was "your values can get a bit out of
whack which you can fix by deleting stuff or
disabling/re-enabling/repairing" vs "it will abort your filesystem".
As the final nail in the coffin, the Meta production kernel was lacking
earlier fixes from me and Qu regarding subvol qgroup lifetime, so this
is what we have been testing at scale, so I think at least for now
upstream should have the same extra layer of protection.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Simple quota accounting can undercount metadata tree block allocations
in certain scenarios. When an undercounted subvolume is deleted and its
tree blocks freed, the free deltas decrement rfer/excl past zero,
wrapping the u64 to a value near U64_MAX.
Once wrapped, can_delete_squota_qgroup() sees non-zero rfer and refuses
to delete the qgroup. The qgroup becomes permanently orphaned in the
quota tree, since there is no subvolume left to generate frees that
would bring the counter back to zero.
While we ultimately want to fix any mis-accounting at the source, it is
also helpful and worthwhile to mitigate the damage by clamping rfer and
excl to zero on underflow rather than allowing the u64 to wrap. This at
least allows us to clean up the messed up qgroups on subvol deletion.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The first transaction that enables squotas is special and a bit tricky.
We have to set BTRFS_FS_QUOTA_ENABLED after the transaction to avoid a
deadlock, so any delayed refs that run before we set the bit are not
squota accounted. For data this is fine, we don't get an owner_ref, so
there is no real harm, it's as if the extent predated squotas. However
for metadata, the tree block will have gen == enable_gen so when we free
it later, we will decrement the squota accounting, which can result in
an underflow. Before it is freed, btrfs check shows errors, as we have
mismatched usage between the node generations/owners and the squota
values.
There are two angles to this fix:
1. For extents that come in delayed_refs that run during the
enable_gen transaction, we must actually set enable_gen to the *next*
transaction. That is the first transaction that we can really
properly account in any way.
2. For extents that come in between the end of our transaction handle
and the time we set the BTRFS_FS_QUOTA_ENABLED bit, we need an
additional bit, BTRFS_FS_SQUOTA_ENABLING which only affects recording
squota deltas, so we do pick up those extents. Otherwise, we would
miss them, even for enable_gen + 1.
Fixes: bd7c1ea3a3 ("btrfs: qgroup: check generation when recording simple quota delta")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The invariant that we want to maintain with subvolume qgroups is that
the qgroup can only be deleted if there is no root. With squotas, we
thought that it was sufficient to just check the usage, because we
assumed that deleting a subvolume will drive it's qgroups usage to 0,
and thus 0 usage implies no subvolume.
However, this is false, for two reasons:
- A subvol whose extents are all from before squotas was enabled.
- A subvol that was created in this transaction and for which we have
not yet run any delayed refs.
In both cases, deleting the qgroup breaks the desired invariant and we
are left with a subvolume with no qgroup but squotas are enabled.
Fix this by unifying the deletion check logic between full qgroups and
squotas. Squotas do all the same checks *and* the additional usage == 0
check, which is the one extra rule peculiar to squotas.
Link: https://lore.kernel.org/linux-btrfs/adnBhWfJQ1n3hZC8@merlins.org/
Fixes: a8df356199 ("btrfs: forbid deleting live subvol qgroup")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
find_first_inode() and find_first_inode_to_shrink() lock root->inodes,
then loop over them, occasionally skipping some inodes. When they skip
an inode, they attempt to share the cpu/lock with cond_resched_lock().
However, that has a subtle problem associated with it.
cond_resched_lock() only drops the lock if it needs to actually call
schedule(). With CONFIG_PREEMPT_NONE, this means the full timeslice as
detected at ticks. With 8+ cpus and default tunables, this is 2.8ms. So
regardless of HZ, we will run for at least 2.8ms in this loop without
dropping the lock, assuming it finds no suitable inodes. If HZ is
small enough, it might be even worse as the tick granularity becomes
bigger than the timeslice.
The knock-on effect of this is that callers to
btrfs_del_inode_from_root() like kswapd trying to shrink the inode slab
or userspace threads calling evict() will spin on xa_lock(&root->inodes)
for 2.8ms, so the extent map shrinker dominates the lock even though
ostensibly it is intending to share it. This produces memory pressure as
there is only one kswapd and it runs sequentially so it can get stuck in
the inode slab shrinking.
To fix it, simply replace cond_resched_lock() with an open coded variant
which unconditionally does unlock/lock around cond_resched. Sharing the
lock is decoupled from sharing the CPU, and all the users of the lock
now share it fairly.
I was able to reproduce this on test systems by producing a lot of empty
files (to make a big root->inodes xarray), then producing memory
pressure by reading large files larger than ram, triggering kswapd and
the extent_map shrinker. The lock contention is visible with perf or
lockstat. This patch also relieved a user-apparent bottleneck on a
production system from the original report.
Tested-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
When writing into a preallocated extent, ordered extent completion calls
btrfs_mark_extent_written() to convert the file extent item from the
BTRFS_FILE_EXTENT_PREALLOC type to the BTRFS_FILE_EXTENT_REG type.
If the preallocated extent was created beyond i_size with fallocate
keep-size, and the inode is evicted and loaded again before the write,
the inode's file_extent_tree is initialized only up to i_size.
The beyond i_size prealloc extent is therefore not tracked there.
After a write into that extent extends i_size, btrfs_mark_extent_written()
updates the file extent item, but the corresponding range is not marked
dirty in the inode's file_extent_tree.
This can leave disk_i_size stale when the filesystem does not use the
no-holes feature, so after remount the file size can go back to the old
value.
The following reproducer triggers the problem:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f -O ^no-holes $DEV
mount $DEV $MNT
touch $MNT/file
fallocate -n -l 2M $MNT/file
umount $MNT
mount $DEV $MNT
dd if=/dev/zero of=$MNT/file bs=1M count=1 conv=notrunc
ls -lh $MNT/file
umount $MNT
mount $DEV $MNT
ls -lh $MNT/file
umount $MNT
Running the reproducer gives the following result:
$ ./test.sh
(...)
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.000596024 s, 1.8 GB/s
-rw-rw-r-- 1 root root 1.0M May 8 16:34 /mnt/sdi/file
-rw-rw-r-- 1 root root 0 May 8 16:34 /mnt/sdi/file
Fix this by marking the written range dirty in the inode's
file_extent_tree after successfully converting the prealloc extent to a
regular extent.
Fixes: 9ddc959e80 ("btrfs: use the file extent tree infrastructure")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
[ Minor change log updates ]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
that could lead to OOM kills in CephFS and a number of miscellaneous
fixes from Raphael and Slava. All but two are marked for stable.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmoHWPETHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi8SVB/9naVkGM41Vb98EDywE0TPOY2uzKUDC
RW6pxTCtfbJbnqB+L3HuahbKYXz44h/WPk5Gl4+jO8FvizUz75CkwwjTLsPpGbpe
lgmSISrNFtWtYUS+9/X0x+I5BHz4EwX9sKclniizQ7Uick6SQWaPNhPvxiwEWpko
DnAv9T/dYP7Z5Y7RBNhAFrNgsOQh5qpjoJvZMmvLrzAoKROaWKEzc6G5FIOaoRRu
XolZ2KNnCD0kdN2r66LZFEIE+DpIwrrJ1M6geLwb9LyQ5pwcyCYKPz3AHAqjpBI1
TYXRl2ocMeciJFO0FeLGqpfGy2wcDxwc/ndWK6T/LWnelEfgm3qFZzqE
=KnhJ
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-7.1-rc4' of https://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"An important patch from Hristo that squashes a folio reference leak
that could lead to OOM kills in CephFS and a number of miscellaneous
fixes from Raphael and Slava.
All but two are marked for stable"
* tag 'ceph-for-7.1-rc4' of https://github.com/ceph/ceph-client:
libceph: Fix potential null-ptr-deref in decode_choose_args()
libceph: handle rbtree insertion error in decode_choose_args()
libceph: Fix potential out-of-bounds access in osdmap_decode()
ceph: put folios not suitable for writeback
ceph: add ceph_has_realms_with_quotas() check to ceph_quota_update_statfs()
libceph: Fix potential out-of-bounds access in __ceph_x_decrypt()
ceph: fix BUG_ON in __ceph_build_xattrs_blob() due to stale blob size
ceph: fix a buffer leak in __ceph_setxattr()
libceph: Fix unnecessarily high ceph_decode_need() for uniform bucket
libceph: Fix potential out-of-bounds access in crush_decode()
-----BEGIN PGP SIGNATURE-----
iQJPBAABCgA5FiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmoHVdcbFIAAAAAABAAO
bWFudTIsMi41KzEuMTIsMiwyAAoJEMVl1fnXbVg7M8QQAIdNt3hsHMd/0oWtDpTz
WW/QhdghGJoE1NDR+tDRCDbjwIRagiJYViMLdmjCmO/a16IdxZUwF2xBVEL6X7qV
OzFWIBiywVSQy+znCxOrpddSEEC5a55k+GZUCq55rehIoyq1A5kI++qYYQ2j7eQB
Ld7QeLaLmfCuWzfW/Yx+DhAc+DEiw8IYJBWzw7FVxj3775gGk7OftpjYNqoP726U
P3CQHeSRTFcIQ+pREk0LZ31RoaPZQKMGYxdqxc/cz+t2FoIYKVs/0H0/Fpmn7fzR
bfVGXiSXfWU2/08i2JYAyom7kdyBeBu/6wrde9AtpZyK26qgYkzoiocOMbxCuNgQ
Om4ccHKEu8r/pGhwRwNzu2xtmPD2YS9Gh5UVXQOMuTCMXuTAAFQnYRTEnBkCDPD9
MuJVGA8JZXT8kRTQMg77WxdfMzUEQRc8QNNXOlk2uYCecKjyQ5cldzHclkHRGPvX
mwUCT/XYWhGPc/HKwU0cqcLB/YmIAjuq+dqztusJeIjaJ8wqu/LDgc2j1fnv9HW6
G8LtZw6gUOMOcaybqbQ4rYNPK0Tee63CeS1IcQnC5iw6ezLLkW7mf1uVnOIywtq6
aAv5SwR/8JAnjiLjAeLePq1r7VFPY8I+AKMATer7uNW30pKyPfNS80GfvPxMI1dP
ACalqskniyNanM2qxgeQxiga
=Q5Hr
-----END PGP SIGNATURE-----
Merge tag 'for-7.1-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fixup warning when allocating memory for readahead, __GFP_NOWARN was
accidentally dropped when setting mapping constraints
- in tracepoint of file sync, fix sleeping in atomic context when
handling dentries
- harden initial loading of block group on crafted/fuzzed images,
iterate all chunk mapping entries unconditionally
- fix freeing pages of submitted io after checking for errors
- fix incorrect inode size after remount when using fallocate KEEP_SIZE
mode (also requires disabled 'no-holes' feature)
* tag 'for-7.1-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix incorrect i_size after remount caused by KEEP_SIZE prealloc gap
btrfs: only release the dirty pages io tree after successful writes
btrfs: tracepoints: fix sleep while in atomic context in btrfs_sync_file()
btrfs: always pass __GFP_NOWARN from add_ra_bio_pages()
btrfs: fix check_chunk_block_group_mappings() to iterate all chunk maps
Signed-off-by: Carlos Maiolino <cem@kernel.org>
-----BEGIN PGP SIGNATURE-----
iJUEABMJAB0WIQSmtYVZ/MfVMGUq1GNcsMJ8RxYuYwUCagc+5gAKCRBcsMJ8RxYu
YzpAAYDu3tfcjmoypj+s5aUuAI9zbjB1UNbcJkVgtiyv5tn5+A14Y3NADvcyMJhi
kZj3/T4BgKmemTPlPjaOSG+zznu11cZSL7dRHQY56hATQwCrY4IS9s/fi80PnV+e
SG1PpVRKcA==
=buSz
-----END PGP SIGNATURE-----
Merge tag 'xfs-fixes-7.1-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
"A few bug fixes, nothing really special stands out"
* tag 'xfs-fixes-7.1-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: Fix typo in comment
xfs: fix the "limiting open zones" message
xfs: flush delalloc blocks on ENOSPC in xfs_trans_alloc_icreate
xfs: check da node block pad field during scrub
xfs: fix memory leak for data allocated by xfs_zone_gc_data_alloc()
xfs: fix memory leak on error in xfs_alloc_zone_info()
xfs: check directory data block header padding in scrub
xfs: zero directory data block padding on write verification
xfs: zero entire directory data block header region at init
xfs: remove the meaningless XFS_ALLOC_FLAG_FREEING
Issues reported with v7.1-rc:
- Correctness fix for the new sunrpc cache netlink protocol
Issues that need expedient stable backports:
- Correctness fixes for delegated attributes
- Prevent an infinite loop when revoking layouts
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmoHKYwACgkQM2qzM29m
f5c77g/6A8bx4hrHThSebjCGrskbEKga5T9xd4jpODjy9K0y1ZawdmNoeCLBKQIv
R/PN9Kb2u5LPkrzHpmIVFsdLiKtw6knnyBFmOHSoiP0Q8EjsakRuBsqZgWJ3jmNU
WCHl5xtxDpgWZ7eaL4/Wrhmui1LKuTsHMrhDY1fbmyHbWO4EvaOR2kB7/dkdorGZ
fG13Lz4axy4fU9598NMqAZuo/LrMeE+VhJwbFpHqqJuPN1/m3BJTiQo3/iZHQnJ/
6azFBEgYSrQC7gdIywVo19lBLIglcMrQwmZnLj9YxftE7hM2ocI+y6jCBkmcqOkp
ajs+h2Sn/vR+f2Hwe7rsvBi3MswouA/tZ0wL2ALUJdpf1UaktF++tqgnPl4yOssq
9YMRqv/khgA9MXCa3IbHJ3s0MN4YEph6DixdaRTN2Dg2fF1ii+5qTipmaDZX7M8B
p1NMRX/S2D9u/zFkAHekK+sqI620hc6OpHqOmTscAWRT5aKs+O6ynq8NSregATX5
oefxJQIuD4dNb4NiVqWCxfr8vQ+3EAwjVTa60DWQOV6Hpvz8V21wnAw4TfvuH0sO
fjZhznrG0x7RLhddZ0/HXimALmJD97Uy2tjoI2B2df/LCroCk9x3wrDS2jAitiel
iPWTVC+awShgRPRAMrki0KcJhBrb4MChvI4oO3Nn9nPh4+QlMjY=
=6XME
-----END PGP SIGNATURE-----
Merge tag 'nfsd-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
"Fixes for this release:
- Correctness fix for the new sunrpc cache netlink protocol
Marked for stable:
- Correctness fixes for delegated attributes
- Prevent an infinite loop when revoking layouts"
* tag 'nfsd-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
NFSD: Fix infinite loop in layout state revocation
sunrpc: start cache request seqno at 1 to fix netlink GET_REQS
nfsd: update mtime/ctime on COPY in presence of delegated attributes
nfsd: update mtime/ctime on CLONE in presense of delegated attributes
nfsd: fix file change detection in CB_GETATTR
nfsd: fix GET_DIR_DELEGATION when VFS leases are disabled