mirror-linux

Commit Graph

Author	SHA1	Message	Date
Christoph Hellwig	2d263debd7	xfs: allow setting errortags at mount time Add an errortag mount option that enables an errortag with the default injection frequency. This allows injecting errors into the mount process instead of just on live file systems, and thus test mount error handling. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-30 10:41:42 +01:00
Christoph Hellwig	4d8f42466a	xfs: use WRITE_ONCE/READ_ONCE for m_errortag There is no synchronization for updating m_errortag, which is fine as it's just a debug tool. It would still be nice to fully avoid the theoretical case of torn values, so use WRITE_ONCE and READ_ONCE to access the members. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-30 10:41:42 +01:00
Christoph Hellwig	e2d62bfd99	xfs: move the guts of XFS_ERRORTAG_DELAY out of line Mirror what is done for the more common XFS_ERRORTAG_TEST version, and also only look at the error tag value once now that we can easily have a local variable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-30 10:41:42 +01:00
Christoph Hellwig	b8862a09d8	xfs: don't validate error tags in the I/O path We can trust XFS developers enough to not pass random stuff to XFS_ERROR_TEST/DELAY. Open code the validity check in xfs_errortag_add, which is the only place that receives unvalidated error tag values from user space, and drop the now pointless xfs_errortag_enabled helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-30 10:41:42 +01:00
Christoph Hellwig	394969e2f9	xfs: allocate m_errortag early Ensure the mount structure always has a valid m_errortag for debug builds. This removes the NULL checking from the runtime code, and prepares for allowing to set errortags from mount. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-30 10:41:42 +01:00
Christoph Hellwig	9a228d1415	xfs: fix the errno sign for the xfs_errortag_{add,clearall} stubs All errno values should be negative in the kernel. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-30 10:41:42 +01:00
Raphael Pinsonneault-Thibeault	44b9553c3d	xfs: validate log record version against superblock log version Syzbot creates a fuzzed record where xfs_has_logv2() but the xlog_rec_header h_version != XLOG_VERSION_2. This causes a KASAN: slab-out-of-bounds read in xlog_do_recovery_pass() -> xlog_recover_process() -> xlog_cksum(). Fix by adding a check to xlog_valid_rec_header() to abort journal recovery if the xlog_rec_header h_version does not match the super block log version. A file system with a version 2 log will only ever set XLOG_VERSION_2 in its headers (and v1 will only ever set V_1), so if there is any mismatch, either the journal or the superblock has been corrupted and therefore we abort processing with a -EFSCORRUPTED error immediately. Also, refactor the structure of the validity checks for better readability. At the default error level (LOW), XFS_IS_CORRUPT() emits the condition that failed, the file and line number it is located at, then dumps the stack. This gives us everything we need to know about the failure if we do a single validity check per XFS_IS_CORRUPT(). Reported-by: syzbot+9f6d080dece587cfdd4c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=9f6d080dece587cfdd4c Tested-by: syzbot+9f6d080dece587cfdd4c@syzkaller.appspotmail.com Fixes: `45cf976008` ("xfs: fix log recovery buffer allocation for the legacy h_size fixup") Signed-off-by: Raphael Pinsonneault-Thibeault <rpthibeault@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-30 10:41:42 +01:00
Shin Seong-jun	0ead3b7246	xfs: fix spacing style issues in xfs_alloc.c Fix checkpatch.pl errors regarding missing spaces around assignment operators in xfs_alloc_compute_diff() and xfs_alloc_fixup_trees(). Adhere to the Linux kernel coding style by ensuring spaces are placed around the assignment operator '='. Signed-off-by: Shin Seong-jun <shinsj4653@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-29 10:57:18 +01:00
Christoph Hellwig	7da4ebea83	xfs: remove xfs_zone_gc_space_available xfs_zone_gc_space_available only has one caller left, so fold it into that. Reorder the checks so that the cheaper scratch_available check is done first. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-29 10:57:18 +01:00
Christoph Hellwig	c17a1c0349	xfs: use a seprate member to track space availabe in the GC scatch buffer When scratch_head wraps back to 0 and scratch_tail is also 0 because no I/O has completed yet, the ring buffer could be mistaken for empty. Fix this by introducing a separate scratch_available member in struct xfs_zone_gc_data. This actually ends up simplifying the code as well. Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-29 10:57:18 +01:00
Carlos Maiolino	692243cac6	xfs: syzbot fixes for online fsck [3/3] Fix various syzbot complaints about scrub that Jiaming Zhang found. With a bit of luck, this should all go splendidly. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCaXbguQAKCRBKO3ySh0YR pp4yAP9RG8daogpIwkB6LkGQNxA0gxYXBt8npuYCGoisT2pKqgD9F6Xsd724mc3y TC9ii+EEv0cslwphy60J7d+DfaaPPgw= =kE9t -----END PGP SIGNATURE----- Merge tag 'scrub-syzbot-fixes-7.0_2026-01-25' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-7.0-merge xfs: syzbot fixes for online fsck [3/3] Fix various syzbot complaints about scrub that Jiaming Zhang found. With a bit of luck, this should all go splendidly. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-28 10:22:09 +01:00
Carlos Maiolino	c04ed39d85	xfs: improve shortform attr performance [2/3] Improve performance of the xattr (and parent pointer) code when the attr structure is in short format and we can therefore perform all updates in a single transaction. Avoiding the attr intent code brings a very nice speedup in those operations. With a bit of luck, this should all go splendidly. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCaXbguQAKCRBKO3ySh0YR pkGhAP4q0606NWz+XcF+5f3KlehLBOnpmnozVvudVMCd1rCmpgD9HecarQThh0VI ZHo7LrKQpl+jrg0fhuKcbocQzxGNpgI= =2NAV -----END PGP SIGNATURE----- Merge tag 'attr-pptr-speedup-7.0_2026-01-25' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-7.0-merge xfs: improve shortform attr performance [2/3] Improve performance of the xattr (and parent pointer) code when the attr structure is in short format and we can therefore perform all updates in a single transaction. Avoiding the attr intent code brings a very nice speedup in those operations. With a bit of luck, this should all go splendidly. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-28 10:16:12 +01:00
Carlos Maiolino	2744d7adb2	xfs: fix problems in the attr leaf freemap code [1/3] Running generic/753 for hours revealed data corruption problems in the attr leaf block space management code. Under certain circumstances, freemap entries are left with zero size but a nonzero offset. If that offset happens to be the same offset as the end of the entries array during an attr set operation, the leaf entry table expansion will push the freemap record offset upwards without checking for overlap with any other freemap entries. If there happened to be a second freemap entry overlapping with the newly allocated leaf entry space, then the next attr set operation might find that space and overwrite the leaf entry, thereby corrupting the leaf block. Fix this by zeroing the freemap offset any time we set the size to zero. If a subsequent attr set operation finds no space in the freemap, it will compact the block and regenerate the freemaps. With a bit of luck, this should all go splendidly. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCaXbguQAKCRBKO3ySh0YR pkF1AP4z66Qfdm4+mY6fs7N33FponI2dzp8R7nnljENif1yq0wEAqH63iSpDplhR ojPVgBKTQR6atxFJVl/AF9YtFPnXHwc= =BeEI -----END PGP SIGNATURE----- Merge tag 'attr-leaf-freemap-fixes-7.0_2026-01-25' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-7.0-merge xfs: fix problems in the attr leaf freemap code [1/3] Running generic/753 for hours revealed data corruption problems in the attr leaf block space management code. Under certain circumstances, freemap entries are left with zero size but a nonzero offset. If that offset happens to be the same offset as the end of the entries array during an attr set operation, the leaf entry table expansion will push the freemap record offset upwards without checking for overlap with any other freemap entries. If there happened to be a second freemap entry overlapping with the newly allocated leaf entry space, then the next attr set operation might find that space and overwrite the leaf entry, thereby corrupting the leaf block. Fix this by zeroing the freemap offset any time we set the size to zero. If a subsequent attr set operation finds no space in the freemap, it will compact the block and regenerate the freemaps. With a bit of luck, this should all go splendidly. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-28 10:15:07 +01:00
Carlos Maiolino	04a65666a6	xfs: autonomous self healing of filesystems [v7] This patchset builds new functionality to deliver live information about filesystem health events to userspace. This is done by creating an anonymous file that can be read() for events by userspace programs. Events are captured by hooking various parts of XFS and iomap so that metadata health failures, file I/O errors, and major changes in filesystem state (unmounts, shutdowns, etc.) can be observed by programs. When an event occurs, the hook functions queue an event object to each event anonfd for later processing. Programs must have CAP_SYS_ADMIN to open the anonfd and there's a maximum event lag to prevent resource overconsumption. The events themselves can be read() from the anonfd as C structs for the xfs_healer daemon. In userspace, we create a new daemon program that will read the event objects and initiate repairs automatically. This daemon is managed entirely by systemd and will not block unmounting of the filesystem unless repairs are ongoing. They are auto-started by a starter service that uses fanotify. This patchset depends on the new fserror code that Christian Brauner has tentatively accepted for Linux 7.0: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs-7.0.fserror v7: more cleanups of the media verification ioctl, improve comments, and reuse the bio v6: fix pi-breaking bugs, make verify failures trigger health reports and filter bio status flags better v5: add verify-media ioctl, collapse small helper funcs with only one caller v4: drop multiple client support so we can make direct calls into healthmon instead of chasing pointers and doing indirect calls v3: drag out of rfc status With a bit of luck, this should all go splendidly. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCaXB3wQAKCRBKO3ySh0YR pkyzAQD/6Yuzlbc/NDUeyHOeSYYB8zAtrbw1Pdky6dtR16FR3QD/Yb5/M9E4MFz7 IX7KeL00cF/fDFl6c3h9qaEx+w23KgA= =ADAl -----END PGP SIGNATURE----- Merge tag 'health-monitoring-7.0_2026-01-20' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-7.0-merge xfs: autonomous self healing of filesystems [v7] This patchset builds new functionality to deliver live information about filesystem health events to userspace. This is done by creating an anonymous file that can be read() for events by userspace programs. Events are captured by hooking various parts of XFS and iomap so that metadata health failures, file I/O errors, and major changes in filesystem state (unmounts, shutdowns, etc.) can be observed by programs. When an event occurs, the hook functions queue an event object to each event anonfd for later processing. Programs must have CAP_SYS_ADMIN to open the anonfd and there's a maximum event lag to prevent resource overconsumption. The events themselves can be read() from the anonfd as C structs for the xfs_healer daemon. In userspace, we create a new daemon program that will read the event objects and initiate repairs automatically. This daemon is managed entirely by systemd and will not block unmounting of the filesystem unless repairs are ongoing. They are auto-started by a starter service that uses fanotify. This patchset depends on the new fserror code that Christian Brauner has tentatively accepted for Linux 7.0: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs-7.0.fserror v7: more cleanups of the media verification ioctl, improve comments, and reuse the bio v6: fix pi-breaking bugs, make verify failures trigger health reports and filter bio status flags better v5: add verify-media ioctl, collapse small helper funcs with only one caller v4: drop multiple client support so we can make direct calls into healthmon instead of chasing pointers and doing indirect calls v3: drag out of rfc status With a bit of luck, this should all go splendidly. Conflicts: This merge required an update on files: - fs/xfs/xfs_healthmon.c - fs/xfs/xfs_verify_media.c Such change was required because a parallel developement changed XFS header file xfs.h naming to xfs_platform.h, so the merge required to update those includes in both files above Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-28 10:02:20 +01:00
Darrick J. Wong	55e03b8cbe	xfs: check for deleted cursors when revalidating two btrees The free space and inode btree repair functions will rebuild both btrees at the same time, after which it needs to evaluate both btrees to confirm that the corruptions are gone. However, Jiaming Zhang ran syzbot and produced a crash in the second xchk_allocbt call. His root-cause analysis is as follows (with minor corrections): In xrep_revalidate_allocbt(), xchk_allocbt() is called twice (first for BNOBT, second for CNTBT). The cause of this issue is that the first call nullified the cursor required by the second call. Let's first enter xrep_revalidate_allocbt() via following call chain: xfs_file_ioctl() -> xfs_ioc_scrubv_metadata() -> xfs_scrub_metadata() -> `sc->ops->repair_eval(sc)` -> xrep_revalidate_allocbt() xchk_allocbt() is called twice in this function. In the first call: /* Note that sc->sm->sm_type is XFS_SCRUB_TYPE_BNOPT now / xchk_allocbt() -> xchk_btree() -> `bs->scrub_rec(bs, recp)` -> xchk_allocbt_rec() -> xchk_allocbt_xref() -> xchk_allocbt_xref_other() since sm_type is XFS_SCRUB_TYPE_BNOBT, pur is set to &sc->sa.cnt_cur. Kernel called xfs_alloc_get_rec() and returned -EFSCORRUPTED. Call chain: xfs_alloc_get_rec() -> xfs_btree_get_rec() -> xfs_btree_check_block() -> (XFS_IS_CORRUPT \|\| XFS_TEST_ERROR), the former is false and the latter is true, return -EFSCORRUPTED. This should be caused by ioctl$XFS_IOC_ERROR_INJECTION I guess. Back to xchk_allocbt_xref_other(), after receiving -EFSCORRUPTED from xfs_alloc_get_rec(), kernel called xchk_should_check_xref(). In this function, curpp (points to sc->sa.cnt_cur) is nullified. Back to xrep_revalidate_allocbt(), since sc->sa.cnt_cur has been nullified, it then triggered null-ptr-deref via xchk_allocbt() (second call) -> xchk_btree(). So. The bnobt revalidation failed on a cross-reference attempt, so we deleted the cntbt cursor, and then crashed when we tried to revalidate the cntbt. Therefore, check for a null cntbt cursor before that revalidation, and mark the repair incomplete. Also we can ignore the second tree entirely if the first tree was rebuilt but is already corrupt. Apply the same fix to xrep_revalidate_iallocbt because it has the same problem. Cc: r772577952@gmail.com Link: https://lore.kernel.org/linux-xfs/CANypQFYU5rRPkTy=iG5m1Lp4RWasSgrHXAh3p8YJojxV0X15dQ@mail.gmail.com/T/#m520c7835fad637eccf843c7936c200589427cc7e Cc: <stable@vger.kernel.org> # v6.8 Fixes: `dbfbf3bdf6` ("xfs: repair inode btrees") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jiaming Zhang <r772577952@gmail.com>	2026-01-24 08:46:43 -08:00
Darrick J. Wong	1c253e1122	xfs: fix UAF in xchk_btree_check_block_owner We cannot dereference bs->cur when trying to determine if bs->cur aliases bs->sc->sa.{bno,rmap}_cur after the latter has been freed. Fix this by sampling before type before any freeing could happen. The correct temporal ordering was broken when we removed xfs_btnum_t. Cc: r772577952@gmail.com Cc: <stable@vger.kernel.org> # v6.9 Fixes: `ec793e690f` ("xfs: remove xfs_btnum_t") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jiaming Zhang <r772577952@gmail.com>	2026-01-24 08:46:41 -08:00
Darrick J. Wong	ca27313fb3	xfs: check return value of xchk_scrub_create_subord Fix this function to return NULL instead of a mangled ENOMEM, then fix the callers to actually check for a null pointer and return ENOMEM. Most of the corrections here are for code merged between 6.2 and 6.10. Cc: r772577952@gmail.com Cc: <stable@vger.kernel.org> # v6.12 Fixes: `1a5f6e08d4` ("xfs: create subordinate scrub contexts for xchk_metadata_inode_subtype") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jiaming Zhang <r772577952@gmail.com>	2026-01-24 08:46:38 -08:00
Darrick J. Wong	ba408d299a	xfs: only call xf{array,blob}_destroy if we have a valid pointer Only call the xfarray and xfblob destructor if we have a valid pointer, and be sure to null out that pointer afterwards. Note that this patch fixes a large number of commits, most of which were merged between 6.9 and 6.10. Cc: r772577952@gmail.com Cc: <stable@vger.kernel.org> # v6.12 Fixes: `ab97f4b1c0` ("xfs: repair AGI unlinked inode bucket lists") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jiaming Zhang <r772577952@gmail.com>	2026-01-24 08:46:36 -08:00
Darrick J. Wong	60382993a2	xfs: get rid of the xchk_xfile__descr calls The xchk_xfile__descr macros call kasprintf, which can fail to allocate memory if the formatted string is larger than 16 bytes (or whatever the nofail guarantees are nowadays). Some of them could easily exceed that, and Jiaming Zhang found a few places where that can happen with syzbot. The descriptions are debugging aids and aren't required to be unique, so let's just pass in static strings and eliminate this path to failure. Note this patch touches a number of commits, most of which were merged between 6.6 and 6.14. Cc: r772577952@gmail.com Cc: <stable@vger.kernel.org> # v6.12 Fixes: `ab97f4b1c0` ("xfs: repair AGI unlinked inode bucket lists") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jiaming Zhang <r772577952@gmail.com>	2026-01-23 09:27:37 -08:00
Darrick J. Wong	eaec8aeff3	xfs: add a method to replace shortform attrs If we're trying to replace an xattr in a shortform attr structure and the old entry fits the new entry, we can just memcpy and exit without having to delete, compact, and re-add the entry (or worse use the attr intent machinery). For parent pointers this only advantages renaming where the filename length stays the same (e.g. mv autoexec.bat scandisk.exe) but for regular xattrs it might be useful for updating security labels and the like. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2026-01-23 09:27:36 -08:00
Darrick J. Wong	d693534513	xfs: speed up parent pointer operations when possible After a recent fsmark benchmarking run, I observed that the overhead of parent pointers on file creation and deletion can be a bit high. On a machine with 20 CPUs, 128G of memory, and an NVME SSD capable of pushing 750000iops, I see the following results: $ mkfs.xfs -f -l logdev=/dev/nvme1n1,size=1g /dev/nvme0n1 -n parent=0 meta-data=/dev/nvme0n1 isize=512 agcount=40, agsize=9767586 blks = sectsz=4096 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=1 = reflink=1 bigtime=1 inobtcount=1 nrext64=1 = exchange=0 metadir=0 data = bsize=4096 blocks=390703440, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0 log =/dev/nvme1n1 bsize=4096 blocks=262144, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 = rgcount=0 rgsize=0 extents = zoned=0 start=0 reserved=0 So we created 40 AGs, one per CPU. Now we create 40 directories and run fsmark: $ time fs_mark -D 10000 -S 0 -n 100000 -s 0 -L 8 -d ... # Version 3.3, 40 thread(s) starting at Wed Dec 10 14:22:07 2025 # Sync method: NO SYNC: Test does not issue sync() or fsync() calls. # Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory. # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name) # Files info: size 0 bytes, written with an IO size of 16384 bytes per write # App overhead is time in microseconds spent in the test not doing file writing related system calls. parent=0 parent=1 ================== ================== real 0m57.573s real 1m2.934s user 3m53.578s user 3m53.508s sys 19m44.440s sys 25m14.810s $ time rm -rf ... parent=0 parent=1 ================== ================== real 0m59.649s real 1m12.505s user 0m41.196s user 0m47.489s sys 13m9.566s sys 20m33.844s Parent pointers increase the system time by 28% overhead to create 32 million files that are totally empty. Removing them incurs a system time increase of 56%. Wall time increases by 9% and 22%. For most filesystems, each file tends to have a single owner and not that many xattrs. If the xattr structure is shortform, then all xattr changes are logged with the inode and do not require the the xattr intent mechanism to persist the parent pointer. Therefore, we can speed up parent pointer operations by calling the shortform xattr functions directly if the child's xattr is in short format. Now the overhead looks like: $ time fs_mark -D 10000 -S 0 -n 100000 -s 0 -L 8 -d ... parent=0 parent=1 ================== ================== real 0m58.030s real 1m0.983s user 3m54.141s user 3m53.758s sys 19m57.003s sys 21m30.605s $ time rm -rf ... parent=0 parent=1 ================== ================== real 0m58.911s real 1m4.420s user 0m41.329s user 0m45.169s sys 13m27.857s sys 15m58.564s Now parent pointers only increase the system time by 8% for creation and 19% for deletion. Wall time increases by 5% and 9% now. Close the performance gap by creating helpers for the attr set, remove, and replace operations that will try to make direct shortform updates, and fall back to the attr intent machinery if that doesn't work. This works for regular xattrs and for parent pointers. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2026-01-23 09:27:35 -08:00
Darrick J. Wong	1ef7729df1	xfs: reduce xfs_attr_try_sf_addname parameters The dp parameter to this function is an alias of args->dp, so remove it for clarity before we go adding new callers. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2026-01-23 09:27:34 -08:00
Darrick J. Wong	bd3138e891	xfs: fix remote xattr valuelblk check In debugging other problems with generic/753, it turns out that it's possible for the system go to down in the middle of a remote xattr set operation such that the leaf block entry is marked incomplete and valueblk is set to zero. Make this no longer a failure. Cc: <stable@vger.kernel.org> # v4.15 Fixes: `13791d3b83` ("xfs: scrub extended attribute leaf space") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2026-01-23 09:27:33 -08:00
Darrick J. Wong	6fed827044	xfs: fix the xattr scrub to detect freemap/entries array collisions In the previous patches, we observed that it's possible for there to be freemap entries with zero size but a nonzero base. This isn't an inconsistency per se, but older kernels can get confused by this and corrupt the block, leading to corruption. If we see this, flag the xattr structure for optimization so that it gets rebuilt. Cc: <stable@vger.kernel.org> # v4.15 Fixes: `13791d3b83` ("xfs: scrub extended attribute leaf space") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2026-01-23 09:27:33 -08:00
Darrick J. Wong	27a0c41f33	xfs: strengthen attr leaf block freemap checking Check for erroneous overlapping freemap regions and collisions between freemap regions and the xattr leaf entry array. Note that we must explicitly zero out the extra freemaps in xfs_attr3_leaf_compact so that the in-memory buffer has a correctly initialized freemap array to satisfy the new verification code, even if subsequent code changes the contents before unlocking the buffer. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2026-01-23 09:27:32 -08:00
Darrick J. Wong	a165f7e763	xfs: refactor attr3 leaf table size computation Replace all the open-coded callsites with a single static inline helper. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2026-01-23 09:27:31 -08:00
Darrick J. Wong	3eefc0c2b7	xfs: fix freemap adjustments when adding xattrs to leaf blocks xfs/592 and xfs/794 both trip this assertion in the leaf block freemap adjustment code after ~20 minutes of running on my test VMs: ASSERT(ichdr->firstused >= ichdr->count * sizeof(xfs_attr_leaf_entry_t) + xfs_attr3_leaf_hdr_size(leaf)); Upon enabling quite a lot more debugging code, I narrowed this down to fsstress trying to set a local extended attribute with namelen=3 and valuelen=71. This results in an entry size of 80 bytes. At the start of xfs_attr3_leaf_add_work, the freemap looks like this: i 0 base 448 size 0 rhs 448 count 46 i 1 base 388 size 132 rhs 448 count 46 i 2 base 2120 size 4 rhs 448 count 46 firstused = 520 where "rhs" is the first byte past the end of the leaf entry array. This is inconsistent -- the entries array ends at byte 448, but freemap[1] says there's free space starting at byte 388! By the end of the function, the freemap is in worse shape: i 0 base 456 size 0 rhs 456 count 47 i 1 base 388 size 52 rhs 456 count 47 i 2 base 2120 size 4 rhs 456 count 47 firstused = 440 Important note: 388 is not aligned with the entries array element size of 8 bytes. Based on the incorrect freemap, the name area starts at byte 440, which is below the end of the entries array! That's why the assertion triggers and the filesystem shuts down. How did we end up here? First, recall from the previous patch that the freemap array in an xattr leaf block is not intended to be a comprehensive map of all free space in the leaf block. In other words, it's perfectly legal to have a leaf block with: * 376 bytes in use by the entries array * freemap[0] has [base = 376, size = 8] * freemap[1] has [base = 388, size = 1500] * the space between 376 and 388 is free, but the freemap stopped tracking that some time ago If we add one xattr, the entries array grows to 384 bytes, and freemap[0] becomes [base = 384, size = 0]. So far, so good. But if we add a second xattr, the entries array grows to 392 bytes, and freemap[0] gets pushed up to [base = 392, size = 0]. This is bad, because freemap[1] hasn't been updated, and now the entries array and the free space claim the same space. The fix here is to adjust all freemap entries so that none of them collide with the entries array. Note that this fix relies on commit `2a2b5932db` ("xfs: fix attr leaf header freemap.size underflow") and the previous patch that resets zero length freemap entries to have base = 0. Cc: <stable@vger.kernel.org> # v2.6.12 Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2026-01-23 09:27:31 -08:00
Darrick J. Wong	6f13c1d2a6	xfs: delete attr leaf freemap entries when empty Back in commit `2a2b5932db` ("xfs: fix attr leaf header freemap.size underflow"), Brian Foster observed that it's possible for a small freemap at the end of the end of the xattr entries array to experience a size underflow when subtracting the space consumed by an expansion of the entries array. There are only three freemap entries, which means that it is not a complete index of all free space in the leaf block. This code can leave behind a zero-length freemap entry with a nonzero base. Subsequent setxattr operations can increase the base up to the point that it overlaps with another freemap entry. This isn't in and of itself a problem because the code in _leaf_add that finds free space ignores any freemap entry with zero size. However, there's another bug in the freemap update code in _leaf_add, which is that it fails to update a freemap entry that begins midway through the xattr entry that was just appended to the array. That can result in the freemap containing two entries with the same base but different sizes (0 for the "pushed-up" entry, nonzero for the entry that's actually tracking free space). A subsequent _leaf_add can then allocate xattr namevalue entries on top of the entries array, leading to data loss. But fixing that is for later. For now, eliminate the possibility of confusion by zeroing out the base of any freemap entry that has zero size. Because the freemap is not intended to be a complete index of free space, a subsequent failure to find any free space for a new xattr will trigger block compaction, which regenerates the freemap. It looks like this bug has been in the codebase for quite a long time. Cc: <stable@vger.kernel.org> # v2.6.12 Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2026-01-23 09:27:30 -08:00
Wenwu Hou	a1ca658d64	xfs: fix incorrect context handling in xfs_trans_roll The memalloc_nofs_save() and memalloc_nofs_restore() calls are incorrectly paired in xfs_trans_roll. Call path: xfs_trans_alloc() __xfs_trans_alloc() // tp->t_pflags = memalloc_nofs_save(); xfs_trans_set_context() ... xfs_defer_trans_roll() xfs_trans_roll() xfs_trans_dup() // old_tp->t_pflags = 0; xfs_trans_switch_context() __xfs_trans_commit() xfs_trans_free() // memalloc_nofs_restore(tp->t_pflags); xfs_trans_clear_context() The code passes 0 to memalloc_nofs_restore() when committing the original transaction, but memalloc_nofs_restore() should always receive the flags returned from the paired memalloc_nofs_save() call. Before commit `3f6d5e6a46` ("mm: introduce memalloc_flags_{save,restore}"), calling memalloc_nofs_restore(0) would unset the PF_MEMALLOC_NOFS flag, which could cause memory allocation deadlocks[1]. Fortunately, after that commit, memalloc_nofs_restore(0) does nothing, so this issue is currently harmless. Fixes: `756b1c3433` ("xfs: use current->journal_info for detecting transaction recursion") Link: https://lore.kernel.org/linux-xfs/20251104131857.1587584-1-leo.lilong@huawei.com [1] Signed-off-by: Wenwu Hou <hwenwur@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:59:10 +01:00
Hans Holmberg	01a2896154	xfs: always allocate the free zone with the lowest index Zones in the beginning of the address space are typically mapped to higer bandwidth tracks on HDDs than those at the end of the address space. So, in stead of allocating zones "round robin" across the whole address space, always allocate the zone with the lowest index. This increases average write bandwidth for overwrite workloads when less than the full capacity is being used. At ~50% utilization this improves bandwidth for a random file overwrite benchmark with 128MiB files and 256MiB zone capacity by 30%. Running the same benchmark with small 2-8 MiB files at 67% capacity shows no significant difference in performance. Due to heavy fragmentation the whole zone range is in use, greatly limiting the number of free zones with high bw. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:57:17 +01:00
Darrick J. Wong	4d6d335ea9	xfs: promote metadata directories and large block support Large block support was merged upstream in 6.12 (Dec 2024) and metadata directories was merged in 6.13 (Jan 2025). We've not received any serious complaints about the ondisk formats of these two features in the past year, so let's remove the experimental warnings. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:57:17 +01:00
Christoph Hellwig	12d12dcc15	xfs: use blkdev_get_zone_info to simplify zone reporting Unwind the callback based programming model by querying the cached zone information using blkdev_get_zone_info. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:57:17 +01:00
Christoph Hellwig	b37c1e4e9a	xfs: check that used blocks are smaller than the write pointer Any used block must have been written, this reject used blocks > write pointer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:57:17 +01:00
Christoph Hellwig	19c5b6051e	xfs: split and refactor zone validation Currently xfs_zone_validate mixes validating the software zone state in the XFS realtime group with validating the hardware state reported in struct blk_zone and deriving the write pointer from that. Move all code that works on the realtime group to xfs_init_zone, and only keep the hardware state validation in xfs_zone_validate. This makes the code more clear, and allows for better reuse in userspace. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:57:17 +01:00
Christoph Hellwig	776b76f754	xfs: pass the write pointer to xfs_init_zone Move the two methods to query the write pointer out of xfs_init_zone into the callers, so that xfs_init_zone doesn't have to bother with the blk_zone structure and instead operates purely at the XFS realtime group level. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:57:17 +01:00
Christoph Hellwig	fc633b5c5b	xfs: add a xfs_rtgroup_raw_size helper Add a helper to figure the on-disk size of a group, accounting for the XFS_SB_FEAT_INCOMPAT_ZONE_GAPS feature if needed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:57:17 +01:00
Damien Le Moal	41263267ef	xfs: add missing forward declaration in xfs_zones.h Add the missing forward declaration for struct blk_zone in xfs_zones.h. This avoids headaches with the order of header file inclusion to avoid compilation errors. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:57:16 +01:00
Christoph Hellwig	3a65ea768b	xfs: remove xfs_attr_leaf_hasname The calling convention of xfs_attr_leaf_hasname() is problematic, because it returns a NULL buffer when xfs_attr3_leaf_read fails, a valid buffer when xfs_attr3_leaf_lookup_int returns -ENOATTR or -EEXIST, and a non-NULL buffer pointer for an already released buffer when xfs_attr3_leaf_lookup_int fails with other error values. Fix this by simply open coding xfs_attr_leaf_hasname in the callers, so that the buffer release code is done by each caller of xfs_attr3_leaf_read. Cc: stable@vger.kernel.org # v5.19+ Fixes: `07120f1abd` ("xfs: Add xfs_has_attr and subroutines") Reported-by: Mark Tinguely <mark.tinguely@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:57:16 +01:00
Darrick J. Wong	f39854a3fb	xfs: mark data structures corrupt on EIO and ENODATA I learned a few things this year: first, blk_status_to_errno can return ENODATA for critical media errors; and second, the scrub code doesn't mark data structures as corrupt on ENODATA or EIO. Currently, scrub failing to capture these errors isn't all that impactful -- the checking code will exit to userspace with EIO/ENODATA, and xfs_scrub will log a complaint and exit with nonzero status. Most people treat fsck tools failing as a sign that the fs is corrupt, but online fsck should mark the metadata bad and keep moving. Cc: stable@vger.kernel.org # v4.15 Fixes: `4700d22980` ("xfs: create helpers to record and deal with scrub problems") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:57:16 +01:00
Christoph Hellwig	102f444b57	xfs: rework zone GC buffer management The double buffering where just one scratch area is used at a time does not efficiently use the available memory. It was originally implemented when GC I/O could happen out of order, but that was removed before upstream submission to avoid fragmentation. Now that all GC I/Os are processed in order, just use a number of buffers as a simple ring buffer. For a synthetic benchmark that fills 256MiB HDD zones and punches out holes to free half the space this leads to a decrease of GC time by a little more than 25%. Thanks to Hans Holmberg <hans.holmberg@wdc.com> for testing and benchmarking. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:57:16 +01:00
Christoph Hellwig	0506d32f7c	xfs: use bio_reuse in the zone GC code Replace our somewhat fragile code to reuse the bio, which caused a regression in the past with the block layer bio_reuse helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:57:16 +01:00
Christoph Hellwig	7ca44303f9	block: add a bio_reuse helper Add a helper to allow an existing bio to be resubmitted without having to re-add the payload. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:57:16 +01:00
Christoph Hellwig	cf9b52fa7d	xfs: directly include xfs_platform.h The xfs.h header conflicts with the public xfs.h in xfsprogs, leading to a spurious difference in all shared libxfs files that have to include libxfs_priv.h in userspace. Directly include xfs_platform.h so that we can add a header of the same name to xfsprogs and remove this major annoyance for the shared code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:57:16 +01:00
Christoph Hellwig	19a46f1246	xfs: move the remaining content from xfs.h to xfs_platform.h Move the global defines from xfs.h to xfs_platform.h to prepare for removing xfs.h. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:57:16 +01:00
Christoph Hellwig	501a5161d2	xfs: include global headers first in xfs_platform.h Ensure we have all kernel headers included by the time we do our own thing, just like the rest of the tree. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:57:16 +01:00
Christoph Hellwig	971ffb6341	xfs: rename xfs_linux.h to xfs_platform.h Rename xfs_linux.h to prepare for including including it directly from source files including those shared with xfsprogs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:57:16 +01:00
Christoph Hellwig	a10b44cf10	xfs: factor out a xlog_write_space_advance helper Add a new xlog_write_space_advance that returns the current place in the iclog that data is written to, and advances the various counters by the amount taken from xlog_write_iovec, and also use it xlog_write_partial, which open codes the counter adjustments, but misses the asserts. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:57:16 +01:00
Christoph Hellwig	e2663443da	xfs: improve the iclog space assert in xlog_write_iovec We need enough space for the length we copy into the iclog, not just some space, so tighten up the check a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:57:16 +01:00
Christoph Hellwig	865970d49a	xfs: add a xlog_write_space_left helper Various places check how much space is left in the current iclog, add a helper for that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:57:16 +01:00
Christoph Hellwig	a3eb1f9cf8	xfs: improve the calling convention for the xlog_write helpers The xlog_write chain passes around the same seven variables that are often passed by reference. Add a xlog_write_data structure to contain them to improve code generation and readability. This change increases the generated code size by about 140 bytes for my x86_64 build, which is hopefully worth the much easier to follow code: $ size fs/xfs/xfs_log.o* text data bss dec hex filename 29300 1730 176 31206 79e6 fs/xfs/xfs_log.o 29160 1730 176 31066 795a fs/xfs/xfs_log.o.old Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-01-21 12:57:16 +01:00

1 2 3 4 5 ...

1413600 Commits (2d263debd7f1df2091efb4c06eda02ab04b68562) All Branches Search

1413600 Commits (2d263debd7f1df2091efb4c06eda02ab04b68562)

All Branches