mirror-linux

Commit Graph

Author	SHA1	Message	Date
Suchit Karunakaran	530e3d4af5	btrfs: fix NULL pointer dereference in do_abort_log_replay() Coverity reported a NULL pointer dereference issue (CID 1666756) in do_abort_log_replay(). When btrfs_alloc_path() fails in replay_one_buffer(), wc->subvol_path is NULL, but btrfs_abort_log_replay() calls do_abort_log_replay() which unconditionally dereferences wc->subvol_path when attempting to print debug information. Fix this by adding a NULL check before dereferencing wc->subvol_path in do_abort_log_replay(). Fixes: `2753e49176` ("btrfs: dump detailed info and specific messages on log replay failures") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Suchit Karunakaran <suchitkarunakaran@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-01-06 01:23:00 +01:00
Qu Wenruo	cefd809251	btrfs: force free space tree for bs > ps cases [BUG] Currently we only enforcing the free space tree for bs < ps cases, but with the recently added bs > ps support, we lack the free space tree enforcing, causing explicit v1 cache mount option to fail on bs > ps cases: # mount -o space_cache=v1 /dev/test/scratch1 /mnt/btrfs/ mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/mapper/test-scratch1, missing codepage or helper program, or other error. dmesg(1) may have more information after failed mount system call. # dmesg -t \| tail -n7 BTRFS: device fsid ac14a6fa-4ec9-449e-aec9-7d1777bfdc06 devid 1 transid 11 /dev/mapper/test-scratch1 (253:3) scanned by mount (2849) BTRFS info (device dm-3): first mount of filesystem ac14a6fa-4ec9-449e-aec9-7d1777bfdc06 BTRFS info (device dm-3): using crc32c checksum algorithm BTRFS warning (device dm-3): support for block size 8192 with page size 4096 is experimental, some features may be missing BTRFS warning (device dm-3): space cache v1 is being deprecated and will be removed in a future release, please use -o space_cache=v2 BTRFS warning (device dm-3): v1 space cache is not supported for page size 4096 with sectorsize 8192 BTRFS error (device dm-3): open_ctree failed: -22 [FIX] Just enable the same free space tree for bs > ps cases, aligning the behavior to bs < ps cases. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-01-06 01:22:59 +01:00
Qu Wenruo	30bcf4e824	btrfs: only enforce free space tree if v1 cache is required for bs < ps cases [BUG] Since the introduction of btrfs bs < ps support, v1 cache was never on the plan due to its hard coded PAGE_SIZE usage, and the future plan to properly deprecate it. However for bs < ps cases, even if 'nospace_cache,clear_cache' mount option is specified, it's never respected and free space tree is always enabled: mkfs.btrfs -f -O ^bgt,fst $dev mount $dev $mnt -o clear_cache,nospace_cache umount $mnt btrfs ins dump-super $dev ... compat_ro_flags 0x3 ( FREE_SPACE_TREE \| FREE_SPACE_TREE_VALID ) ... This means a different behavior compared to bs >= ps cases. [CAUSE] The forcing usage of v2 space cache is done inside btrfs_set_free_space_cache_settings(), however it never checks if we're even using space cache but always enabling v2 cache. [FIX] Instead unconditionally enable v2 cache, only forcing v2 cache if the old v1 cache is required. Now v2 space cache can be properly disabled on bs < ps cases: mkfs.btrfs -f -O ^bgt,fst $dev mount $dev $mnt -o clear_cache,nospace_cache umount $mnt btrfs ins dump-super $dev ... compat_ro_flags 0x0 ... Fixes: `9f73f1aef9` ("btrfs: force v2 space cache usage for subpage mount") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-01-06 01:22:59 +01:00
Filipe Manana	8731f2c50b	btrfs: release path before initializing extent tree in btrfs_read_locked_inode() In btrfs_read_locked_inode() we are calling btrfs_init_file_extent_tree() while holding a path with a read locked leaf from a subvolume tree, and btrfs_init_file_extent_tree() may do a GFP_KERNEL allocation, which can trigger reclaim. This can create a circular lock dependency which lockdep warns about with the following splat: [6.1433] ====================================================== [6.1574] WARNING: possible circular locking dependency detected [6.1583] 6.18.0+ #4 Tainted: G U [6.1591] ------------------------------------------------------ [6.1599] kswapd0/117 is trying to acquire lock: [6.1606] ffff8d9b6333c5b8 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x39/0x2f0 [6.1625] but task is already holding lock: [6.1633] ffffffffa4ab8ce0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x195/0xc60 [6.1646] which lock already depends on the new lock. [6.1657] the existing dependency chain (in reverse order) is: [6.1667] -> #2 (fs_reclaim){+.+.}-{0:0}: [6.1677] fs_reclaim_acquire+0x9d/0xd0 [6.1685] __kmalloc_cache_noprof+0x59/0x750 [6.1694] btrfs_init_file_extent_tree+0x90/0x100 [6.1702] btrfs_read_locked_inode+0xc3/0x6b0 [6.1710] btrfs_iget+0xbb/0xf0 [6.1716] btrfs_lookup_dentry+0x3c5/0x8e0 [6.1724] btrfs_lookup+0x12/0x30 [6.1731] lookup_open.isra.0+0x1aa/0x6a0 [6.1739] path_openat+0x5f7/0xc60 [6.1746] do_filp_open+0xd6/0x180 [6.1753] do_sys_openat2+0x8b/0xe0 [6.1760] __x64_sys_openat+0x54/0xa0 [6.1768] do_syscall_64+0x97/0x3e0 [6.1776] entry_SYSCALL_64_after_hwframe+0x76/0x7e [6.1784] -> #1 (btrfs-tree-00){++++}-{3:3}: [6.1794] lock_release+0x127/0x2a0 [6.1801] up_read+0x1b/0x30 [6.1808] btrfs_search_slot+0x8e0/0xff0 [6.1817] btrfs_lookup_inode+0x52/0xd0 [6.1825] __btrfs_update_delayed_inode+0x73/0x520 [6.1833] btrfs_commit_inode_delayed_inode+0x11a/0x120 [6.1842] btrfs_log_inode+0x608/0x1aa0 [6.1849] btrfs_log_inode_parent+0x249/0xf80 [6.1857] btrfs_log_dentry_safe+0x3e/0x60 [6.1865] btrfs_sync_file+0x431/0x690 [6.1872] do_fsync+0x39/0x80 [6.1879] __x64_sys_fsync+0x13/0x20 [6.1887] do_syscall_64+0x97/0x3e0 [6.1894] entry_SYSCALL_64_after_hwframe+0x76/0x7e [6.1903] -> #0 (&delayed_node->mutex){+.+.}-{3:3}: [6.1913] __lock_acquire+0x15e9/0x2820 [6.1920] lock_acquire+0xc9/0x2d0 [6.1927] __mutex_lock+0xcc/0x10a0 [6.1934] __btrfs_release_delayed_node.part.0+0x39/0x2f0 [6.1944] btrfs_evict_inode+0x20b/0x4b0 [6.1952] evict+0x15a/0x2f0 [6.1958] prune_icache_sb+0x91/0xd0 [6.1966] super_cache_scan+0x150/0x1d0 [6.1974] do_shrink_slab+0x155/0x6f0 [6.1981] shrink_slab+0x48e/0x890 [6.1988] shrink_one+0x11a/0x1f0 [6.1995] shrink_node+0xbfd/0x1320 [6.1002] balance_pgdat+0x67f/0xc60 [6.1321] kswapd+0x1dc/0x3e0 [6.1643] kthread+0xff/0x240 [6.1965] ret_from_fork+0x223/0x280 [6.1287] ret_from_fork_asm+0x1a/0x30 [6.1616] other info that might help us debug this: [6.1561] Chain exists of: &delayed_node->mutex --> btrfs-tree-00 --> fs_reclaim [6.1503] Possible unsafe locking scenario: [6.1110] CPU0 CPU1 [6.1411] ---- ---- [6.1707] lock(fs_reclaim); [6.1998] lock(btrfs-tree-00); [6.1291] lock(fs_reclaim); [6.1581] lock(&delayed_node->mutex); [6.1874] * DEADLOCK * [6.1716] 2 locks held by kswapd0/117: [6.1999] #0: ffffffffa4ab8ce0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x195/0xc60 [6.1294] #1: ffff8d998344b0e0 (&type->s_umount_key#40){++++}- {3:3}, at: super_cache_scan+0x37/0x1d0 [6.1596] stack backtrace: [6.1183] CPU: 11 UID: 0 PID: 117 Comm: kswapd0 Tainted: G U 6.18.0+ #4 PREEMPT(lazy) [6.1185] Tainted: [U]=USER [6.1186] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023 [6.1187] Call Trace: [6.1187] <TASK> [6.1189] dump_stack_lvl+0x6e/0xa0 [6.1192] print_circular_bug.cold+0x17a/0x1c0 [6.1194] check_noncircular+0x175/0x190 [6.1197] __lock_acquire+0x15e9/0x2820 [6.1200] lock_acquire+0xc9/0x2d0 [6.1201] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [6.1204] __mutex_lock+0xcc/0x10a0 [6.1206] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [6.1208] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [6.1211] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [6.1213] __btrfs_release_delayed_node.part.0+0x39/0x2f0 [6.1215] btrfs_evict_inode+0x20b/0x4b0 [6.1217] ? lock_acquire+0xc9/0x2d0 [6.1220] evict+0x15a/0x2f0 [6.1222] prune_icache_sb+0x91/0xd0 [6.1224] super_cache_scan+0x150/0x1d0 [6.1226] do_shrink_slab+0x155/0x6f0 [6.1228] shrink_slab+0x48e/0x890 [6.1229] ? shrink_slab+0x2d2/0x890 [6.1231] shrink_one+0x11a/0x1f0 [6.1234] shrink_node+0xbfd/0x1320 [6.1236] ? shrink_node+0xa2d/0x1320 [6.1236] ? shrink_node+0xbd3/0x1320 [6.1239] ? balance_pgdat+0x67f/0xc60 [6.1239] balance_pgdat+0x67f/0xc60 [6.1241] ? finish_task_switch.isra.0+0xc4/0x2a0 [6.1246] kswapd+0x1dc/0x3e0 [6.1247] ? __pfx_autoremove_wake_function+0x10/0x10 [6.1249] ? __pfx_kswapd+0x10/0x10 [6.1250] kthread+0xff/0x240 [6.1251] ? __pfx_kthread+0x10/0x10 [6.1253] ret_from_fork+0x223/0x280 [6.1255] ? __pfx_kthread+0x10/0x10 [6.1257] ret_from_fork_asm+0x1a/0x30 [6.1260] </TASK> This is because: 1) The fsync task is holding an inode's delayed node mutex (for a directory) while calling __btrfs_update_delayed_inode() and that needs to do a search on the subvolume's btree (therefore read lock some extent buffers); 2) The lookup task, at btrfs_lookup(), triggered reclaim with the GFP_KERNEL allocation done by btrfs_init_file_extent_tree() while holding a read lock on a subvolume leaf; 3) The reclaim triggered kswapd which is doing inode eviction for the directory inode the fsync task is using as an argument to btrfs_commit_inode_delayed_inode() - but in that call chain we are trying to read lock the same leaf that the lookup task is holding while calling btrfs_init_file_extent_tree() and doing the GFP_KERNEL allocation. Fix this by calling btrfs_init_file_extent_tree() after we don't need the path anymore and release it in btrfs_read_locked_inode(). Reported-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://lore.kernel.org/linux-btrfs/6e55113a22347c3925458a5d840a18401a38b276.camel@linux.intel.com/ Fixes: `8679d2687c` ("btrfs: initialize inode::file_extent_tree after i_mode has been set") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-01-06 01:22:59 +01:00
Qu Wenruo	1aff297ffb	btrfs: avoid access-beyond-folio for bs > ps encoded writes [POTENTIAL BUG] If the system page size is 4K and fs block size is 8K, and max_inline mount option is set to 6K, we can inline a 6K sized data extent. Then a encoded write submitted a compressed extent which is at file offset 0, and the compressed length is 6K, which is allowed to be inlined. Now a read beyond page boundary is triggered inside write_extent_buffer() from insert_inline_extent(). [CAUSE] Currently the function __cow_file_range_inline() can only accept a single folio. For regular compressed write path, we always allocate the compressed folios using the minimal order matching the block size, thus the @compressed_folio should always cover a full fs block thus it is fine. But for encoded writes, they allocate page size folios, this means we can hit a case where the compressed data is smaller than block size but still larger than page size, in that case __cow_file_range_inline() will be called with @compressed_size larger than a page. In that case we will trigger a read beyond the folio inside insert_inline_extent(). Thankfully this is not that common, as the default max_inline is only 2048 bytes, smaller than PAGE_SIZE, and bs > ps support is still experimental. [FIX] We need to either allow insert_inline_extent() to accept a page array to properly support such case, or reject such inline extent. The latter is a much simpler solution, and considering bs > ps will stay as a corner case and non-default max_inline will be even rarer, I don't think we really need to fulfill such niche. So just reject any inline extent that's larger than PAGE_SIZE, and add an extra ASSERT() to insert_inline_extent() to catch such beyond-boundary access. Fixes: `ec20799064` ("btrfs: enable encoded read/write/send for bs > ps cases") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-01-06 01:22:59 +01:00
Filipe Manana	c1c050f92d	btrfs: fix reservation leak in some error paths when inserting inline extent If we fail to allocate a path or join a transaction, we return from __cow_file_range_inline() without freeing the reserved qgroup data, resulting in a leak. Fix this by ensuring we call btrfs_qgroup_free_data() in such cases. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-16 22:53:15 +01:00
Filipe Manana	f8da41de0b	btrfs: do not free data reservation in fallback from inline due to -ENOSPC If we fail to create an inline extent due to -ENOSPC, we will attempt to go through the normal COW path, reserve an extent, create an ordered extent, etc. However we were always freeing the reserved qgroup data, which is wrong since we will use data. Fix this by freeing the reserved qgroup data in __cow_file_range_inline() only if we are not doing the fallback (ret is <= 0). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-16 22:53:15 +01:00
Leo Martins	83f59076a1	btrfs: fix use-after-free warning in btrfs_get_or_create_delayed_node() Previously, btrfs_get_or_create_delayed_node() set the delayed_node's refcount before acquiring the root->delayed_nodes lock. Commit `e8513c012d` ("btrfs: implement ref_tracker for delayed_nodes") moved refcount_set inside the critical section, which means there is no longer a memory barrier between setting the refcount and setting btrfs_inode->delayed_node. Without that barrier, the stores to node->refs and btrfs_inode->delayed_node may become visible out of order. Another thread can then read btrfs_inode->delayed_node and attempt to increment a refcount that hasn't been set yet, leading to a refcounting bug and a use-after-free warning. The fix is to move refcount_set back to where it was to take advantage of the implicit memory barrier provided by lock acquisition. Because the allocations now happen outside of the lock's critical section, they can use GFP_NOFS instead of GFP_ATOMIC. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202511262228.6dda231e-lkp@intel.com Fixes: `e8513c012d` ("btrfs: implement ref_tracker for delayed_nodes") Tested-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Leo Martins <loemra.dev@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-16 22:53:15 +01:00
Filipe Manana	7ba0b6461b	btrfs: always detect conflicting inodes when logging inode refs After rename exchanging (either with the rename exchange operation or regular renames in multiple non-atomic steps) two inodes and at least one of them is a directory, we can end up with a log tree that contains only of the inodes and after a power failure that can result in an attempt to delete the other inode when it should not because it was not deleted before the power failure. In some case that delete attempt fails when the target inode is a directory that contains a subvolume inside it, since the log replay code is not prepared to deal with directory entries that point to root items (only inode items). 1) We have directories "dir1" (inode A) and "dir2" (inode B) under the same parent directory; 2) We have a file (inode C) under directory "dir1" (inode A); 3) We have a subvolume inside directory "dir2" (inode B); 4) All these inodes were persisted in a past transaction and we are currently at transaction N; 5) We rename the file (inode C), so at btrfs_log_new_name() we update inode C's last_unlink_trans to N; 6) We get a rename exchange for "dir1" (inode A) and "dir2" (inode B), so after the exchange "dir1" is inode B and "dir2" is inode A. During the rename exchange we call btrfs_log_new_name() for inodes A and B, but because they are directories, we don't update their last_unlink_trans to N; 7) An fsync against the file (inode C) is done, and because its inode has a last_unlink_trans with a value of N we log its parent directory (inode A) (through btrfs_log_all_parents(), called from btrfs_log_inode_parent()). 8) So we end up with inode B not logged, which now has the old name of inode A. At copy_inode_items_to_log(), when logging inode A, we did not check if we had any conflicting inode to log because inode A has a generation lower than the current transaction (created in a past transaction); 9) After a power failure, when replaying the log tree, since we find that inode A has a new name that conflicts with the name of inode B in the fs tree, we attempt to delete inode B... this is wrong since that directory was never deleted before the power failure, and because there is a subvolume inside that directory, attempting to delete it will fail since replay_dir_deletes() and btrfs_unlink_inode() are not prepared to deal with dir items that point to roots instead of inodes. When that happens the mount fails and we get a stack trace like the following: [87.2314] BTRFS info (device dm-0): start tree-log replay [87.2318] BTRFS critical (device dm-0): failed to delete reference to subvol, root 5 inode 256 parent 259 [87.2332] ------------[ cut here ]------------ [87.2338] BTRFS: Transaction aborted (error -2) [87.2346] WARNING: CPU: 1 PID: 638968 at fs/btrfs/inode.c:4345 __btrfs_unlink_inode+0x416/0x440 [btrfs] [87.2368] Modules linked in: btrfs loop dm_thin_pool (...) [87.2470] CPU: 1 UID: 0 PID: 638968 Comm: mount Tainted: G W 6.18.0-rc7-btrfs-next-218+ #2 PREEMPT(full) [87.2489] Tainted: [W]=WARN [87.2494] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [87.2514] RIP: 0010:__btrfs_unlink_inode+0x416/0x440 [btrfs] [87.2538] Code: c0 89 04 24 (...) [87.2568] RSP: 0018:ffffc0e741f4b9b8 EFLAGS: 00010286 [87.2574] RAX: 0000000000000000 RBX: ffff9d3ec8a6cf60 RCX: 0000000000000000 [87.2582] RDX: 0000000000000002 RSI: ffffffff84ab45a1 RDI: 00000000ffffffff [87.2591] RBP: ffff9d3ec8a6ef20 R08: 0000000000000000 R09: ffffc0e741f4b840 [87.2599] R10: ffff9d45dc1fffa8 R11: 0000000000000003 R12: ffff9d3ee26d77e0 [87.2608] R13: ffffc0e741f4ba98 R14: ffff9d4458040800 R15: ffff9d44b6b7ca10 [87.2618] FS: 00007f7b9603a840(0000) GS:ffff9d4658982000(0000) knlGS:0000000000000000 [87.2629] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [87.2637] CR2: 00007ffc9ec33b98 CR3: 000000011273e003 CR4: 0000000000370ef0 [87.2648] Call Trace: [87.2651] <TASK> [87.2654] btrfs_unlink_inode+0x15/0x40 [btrfs] [87.2661] unlink_inode_for_log_replay+0x27/0xf0 [btrfs] [87.2669] check_item_in_log+0x1ea/0x2c0 [btrfs] [87.2676] replay_dir_deletes+0x16b/0x380 [btrfs] [87.2684] fixup_inode_link_count+0x34b/0x370 [btrfs] [87.2696] fixup_inode_link_counts+0x41/0x160 [btrfs] [87.2703] btrfs_recover_log_trees+0x1ff/0x7c0 [btrfs] [87.2711] ? __pfx_replay_one_buffer+0x10/0x10 [btrfs] [87.2719] open_ctree+0x10bb/0x15f0 [btrfs] [87.2726] btrfs_get_tree.cold+0xb/0x16c [btrfs] [87.2734] ? fscontext_read+0x15c/0x180 [87.2740] ? rw_verify_area+0x50/0x180 [87.2746] vfs_get_tree+0x25/0xd0 [87.2750] vfs_cmd_create+0x59/0xe0 [87.2755] __do_sys_fsconfig+0x4f6/0x6b0 [87.2760] do_syscall_64+0x50/0x1220 [87.2764] entry_SYSCALL_64_after_hwframe+0x76/0x7e [87.2770] RIP: 0033:0x7f7b9625f4aa [87.2775] Code: 73 01 c3 48 (...) [87.2803] RSP: 002b:00007ffc9ec35b08 EFLAGS: 00000246 ORIG_RAX: 00000000000001af [87.2817] RAX: ffffffffffffffda RBX: 0000558bfa91ac20 RCX: 00007f7b9625f4aa [87.2829] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003 [87.2842] RBP: 0000558bfa91b120 R08: 0000000000000000 R09: 0000000000000000 [87.2854] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [87.2864] R13: 00007f7b963f1580 R14: 00007f7b963f326c R15: 00007f7b963d8a23 [87.2877] </TASK> [87.2882] ---[ end trace 0000000000000000 ]--- [87.2891] BTRFS: error (device dm-0 state A) in __btrfs_unlink_inode:4345: errno=-2 No such entry [87.2904] BTRFS: error (device dm-0 state EAO) in do_abort_log_replay:191: errno=-2 No such entry [87.2915] BTRFS critical (device dm-0 state EAO): log tree (for root 5) leaf currently being processed (slot 7 key (258 12 257)): [87.2929] BTRFS info (device dm-0 state EAO): leaf 30736384 gen 10 total ptrs 7 free space 15712 owner 18446744073709551610 [87.2929] BTRFS info (device dm-0 state EAO): refs 3 lock_owner 0 current 638968 [87.2929] item 0 key (257 INODE_ITEM 0) itemoff 16123 itemsize 160 [87.2929] inode generation 9 transid 10 size 0 nbytes 0 [87.2929] block group 0 mode 40755 links 1 uid 0 gid 0 [87.2929] rdev 0 sequence 7 flags 0x0 [87.2929] atime 1765464494.678070921 [87.2929] ctime 1765464494.686606513 [87.2929] mtime 1765464494.686606513 [87.2929] otime 1765464494.678070921 [87.2929] item 1 key (257 INODE_REF 256) itemoff 16109 itemsize 14 [87.2929] index 4 name_len 4 [87.2929] item 2 key (257 DIR_LOG_INDEX 2) itemoff 16101 itemsize 8 [87.2929] dir log end 2 [87.2929] item 3 key (257 DIR_LOG_INDEX 3) itemoff 16093 itemsize 8 [87.2929] dir log end 18446744073709551615 [87.2930] item 4 key (257 DIR_INDEX 3) itemoff 16060 itemsize 33 [87.2930] location key (258 1 0) type 1 [87.2930] transid 10 data_len 0 name_len 3 [87.2930] item 5 key (258 INODE_ITEM 0) itemoff 15900 itemsize 160 [87.2930] inode generation 9 transid 10 size 0 nbytes 0 [87.2930] block group 0 mode 100644 links 1 uid 0 gid 0 [87.2930] rdev 0 sequence 2 flags 0x0 [87.2930] atime 1765464494.678456467 [87.2930] ctime 1765464494.686606513 [87.2930] mtime 1765464494.678456467 [87.2930] otime 1765464494.678456467 [87.2930] item 6 key (258 INODE_REF 257) itemoff 15887 itemsize 13 [87.2930] index 3 name_len 3 [87.2930] BTRFS critical (device dm-0 state EAO): log replay failed in unlink_inode_for_log_replay:1045 for root 5, stage 3, with error -2: failed to unlink inode 256 parent dir 259 name subvol root 5 [87.2963] BTRFS: error (device dm-0 state EAO) in btrfs_recover_log_trees:7743: errno=-2 No such entry [87.2981] BTRFS: error (device dm-0 state EAO) in btrfs_replay_log:2083: errno=-2 No such entry (Failed to recover log tr So fix this by changing copy_inode_items_to_log() to always detect if there are conflicting inodes for the ref/extref of the inode being logged even if the inode was created in a past transaction. A test case for fstests will follow soon. CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-16 22:53:15 +01:00
Qu Wenruo	e9e3b22ddf	btrfs: fix beyond-EOF write handling [BUG] For the following write sequence with 64K page size and 4K fs block size, it will lead to file extent items to be inserted without any data checksum: mkfs.btrfs -s 4k -f $dev > /dev/null mount $dev $mnt xfs_io -f -c "pwrite 0 16k" -c "pwrite 32k 4k" -c pwrite "60k 64K" \ -c "truncate 16k" $mnt/foobar umount $mnt This will result the following 2 file extent items to be inserted (extra trace point added to insert_ordered_extent_file_extent()): btrfs_finish_one_ordered: root=5 ino=257 file_off=61440 num_bytes=4096 csum_bytes=0 btrfs_finish_one_ordered: root=5 ino=257 file_off=0 num_bytes=16384 csum_bytes=16384 Note for file offset 60K, we're inserting a file extent without any data checksum. Also note that range [32K, 36K) didn't reach insert_ordered_extent_file_extent(), which is the correct behavior as that OE is fully truncated, should not result any file extent. Although file extent at 60K will be later dropped by btrfs_truncate(), if the transaction got committed after file extent inserted but before the file extent dropping, we will have a small window where we have a file extent beyond EOF and without any data checksum. That will cause "btrfs check" to report error. [CAUSE] The sequence happens like this: - Buffered write dirtied the page cache and updated isize Now the inode size is 64K, with the following page cache layout: 0 16K 32K 48K 64K \|/////////////\| \|//\| \|//\| - Truncate the inode to 16K Which will trigger writeback through: btrfs_setsize() \|- truncate_setsize() \| Now the inode size is set to 16K \| \|- btrfs_truncate() \|- btrfs_wait_ordered_range() for [16K, u64(-1)] \|- btrfs_fdatawrite_range() for [16K, u64(-1)} \|- extent_writepage() for folio 0 \|- writepage_delalloc() \| Generated OE for [0, 16K), [32K, 36K] and [60K, 64K) \| \|- extent_writepage_io() Then inside extent_writepage_io(), the dirty fs blocks are handled differently: - Submit write for range [0, 16K) As they are still inside the inode size (16K). - Mark OE [32K, 36K) as truncated Since we only call btrfs_lookup_first_ordered_range() once, which returned the first OE after file offset 16K. - Mark all OEs inside range [16K, 64K) as finished Which will mark OE ranges [32K, 36K) and [60K, 64K) as finished. For OE [32K, 36K) since it's already marked as truncated, and its truncated length is 0, no file extent will be inserted. For OE [60K, 64K) it has never been submitted thus has no data checksum, and we insert the file extent as usual. This is the root cause of file extent at 60K to be inserted without any data checksum. - Clear dirty flags for range [16K, 64K) It is the function btrfs_folio_clear_dirty() which searches and clears any dirty blocks inside that range. [FIX] The bug itself was introduced a long time ago, way before subpage and large folio support. At that time, fs block size must match page size, thus the range [cur, end) is just one fs block. But later with subpage and large folios, the same range [cur, end) can have multiple blocks and ordered extents. Later commit `18de34daa7` ("btrfs: truncate ordered extent when skipping writeback past i_size") was fixing a bug related to subpage/large folios, but it's still utilizing the old range [cur, end), meaning only the first OE will be marked as truncated. The proper fix here is to make EOF handling block-by-block, not trying to handle the whole range to @end. By this we always locate and truncate the OE for every dirty block. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-16 22:53:14 +01:00
Robbie Ko	5037b34282	btrfs: fix deadlock in wait_current_trans() due to ignored transaction type When wait_current_trans() is called during start_transaction(), it currently waits for a blocked transaction without considering whether the given transaction type actually needs to wait for that particular transaction state. The btrfs_blocked_trans_types[] array already defines which transaction types should wait for which transaction states, but this check was missing in wait_current_trans(). This can lead to a deadlock scenario involving two transactions and pending ordered extents: 1. Transaction A is in TRANS_STATE_COMMIT_DOING state 2. A worker processing an ordered extent calls start_transaction() with TRANS_JOIN 3. join_transaction() returns -EBUSY because Transaction A is in TRANS_STATE_COMMIT_DOING 4. Transaction A moves to TRANS_STATE_UNBLOCKED and completes 5. A new Transaction B is created (TRANS_STATE_RUNNING) 6. The ordered extent from step 2 is added to Transaction B's pending ordered extents 7. Transaction B immediately starts commit by another task and enters TRANS_STATE_COMMIT_START 8. The worker finally reaches wait_current_trans(), sees Transaction B in TRANS_STATE_COMMIT_START (a blocked state), and waits unconditionally 9. However, TRANS_JOIN should NOT wait for TRANS_STATE_COMMIT_START according to btrfs_blocked_trans_types[] 10. Transaction B is waiting for pending ordered extents to complete 11. Deadlock: Transaction B waits for ordered extent, ordered extent waits for Transaction B This can be illustrated by the following call stacks: CPU0 CPU1 btrfs_finish_ordered_io() start_transaction(TRANS_JOIN) join_transaction() # -EBUSY (Transaction A is # TRANS_STATE_COMMIT_DOING) # Transaction A completes # Transaction B created # ordered extent added to # Transaction B's pending list btrfs_commit_transaction() # Transaction B enters # TRANS_STATE_COMMIT_START # waiting for pending ordered # extents wait_current_trans() # waits for Transaction B # (should not wait!) Task bstore_kv_sync in btrfs_commit_transaction waiting for ordered extents: __schedule+0x2e7/0x8a0 schedule+0x64/0xe0 btrfs_commit_transaction+0xbf7/0xda0 [btrfs] btrfs_sync_file+0x342/0x4d0 [btrfs] __x64_sys_fdatasync+0x4b/0x80 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Task kworker in wait_current_trans waiting for transaction commit: Workqueue: btrfs-syno_nocow btrfs_work_helper [btrfs] __schedule+0x2e7/0x8a0 schedule+0x64/0xe0 wait_current_trans+0xb0/0x110 [btrfs] start_transaction+0x346/0x5b0 [btrfs] btrfs_finish_ordered_io.isra.0+0x49b/0x9c0 [btrfs] btrfs_work_helper+0xe8/0x350 [btrfs] process_one_work+0x1d3/0x3c0 worker_thread+0x4d/0x3e0 kthread+0x12d/0x150 ret_from_fork+0x1f/0x30 Fix this by passing the transaction type to wait_current_trans() and checking btrfs_blocked_trans_types[cur_trans->state] against the given type before deciding to wait. This ensures that transaction types which are allowed to join during certain blocked states will not unnecessarily wait and cause deadlocks. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-16 22:53:14 +01:00
Miquel Sabaté Solà	f157dd6613	btrfs: fix NULL dereference on root when tracing inode eviction When evicting an inode the first thing we do is to setup tracing for it, which implies fetching the root's id. But in btrfs_evict_inode() the root might be NULL, as implied in the next check that we do in btrfs_evict_inode(). Hence, we either should set the ->root_objectid to 0 in case the root is NULL, or we move tracing setup after checking that the root is not NULL. Setting the rootid to 0 at least gives us the possibility to trace this call even in the case when the root is NULL, so that's the solution taken here. Fixes: `1abe9b8a13` ("Btrfs: add initial tracepoint support for btrfs") Reported-by: syzbot+d991fea1b4b23b1f6bf8@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d991fea1b4b23b1f6bf8 Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-16 22:53:14 +01:00
Qu Wenruo	68d4b3fa18	btrfs: qgroup: update all parent qgroups when doing quick inherit [BUG] There is a bug that if a subvolume has multi-level parent qgroups, and is able to do a quick inherit, only the direct parent qgroup got updated: mkfs.btrfs -f -O quota $dev mount $dev $mnt btrfs subv create $mnt/subv1 btrfs qgroup create 1/100 $mnt btrfs qgroup create 2/100 $mnt btrfs qgroup assign 1/100 2/100 $mnt btrfs qgroup assign 0/256 1/100 $mnt btrfs qgroup show -p --sync $mnt Qgroupid Referenced Exclusive Parent Path -------- ---------- --------- ------ ---- 0/5 16.00KiB 16.00KiB - <toplevel> 0/256 16.00KiB 16.00KiB 1/100 subv1 1/100 16.00KiB 16.00KiB 2/100 2/100<1 member qgroup> 2/100 16.00KiB 16.00KiB - <0 member qgroups> btrfs subv snap -i 1/100 $mnt/subv1 $mnt/snap1 btrfs qgroup show -p --sync $mnt Qgroupid Referenced Exclusive Parent Path -------- ---------- --------- ------ ---- 0/5 16.00KiB 16.00KiB - <toplevel> 0/256 16.00KiB 16.00KiB 1/100 subv1 0/257 16.00KiB 16.00KiB 1/100 snap1 1/100 32.00KiB 32.00KiB 2/100 2/100<1 member qgroup> 2/100 16.00KiB 16.00KiB - <0 member qgroups> # Note that 2/100 is not updated, and qgroup numbers are inconsistent umount $mnt [CAUSE] If the snapshot source subvolume belongs to a parent qgroup, and the new snapshot target is also added to the new same parent qgroup, we allow a quick update without marking qgroup inconsistent. But that quick update only update the parent qgroup, without checking if there is any more parent qgroups. [FIX] Iterate through all parent qgroups during the quick inherit. Reported-by: Boris Burkov <boris@bur.io> Fixes: `b20fe56cd2` ("btrfs: qgroup: allow quick inherit if snapshot is created and added to the same parent") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-16 22:53:14 +01:00
Boris Burkov	7ee19a59a7	btrfs: fix qgroup_snapshot_quick_inherit() squota bug qgroup_snapshot_quick_inherit() detects conditions where the snapshot destination would land in the same parent qgroup as the snapshot source subvolume. In this case we can avoid costly qgroup calculations and just add the nodesize of the new snapshot to the parent. However, in the case of squotas this is actually a double count, and also an undercount for deeper qgroup nestings. The following annotated script shows the issue: btrfs quota enable --simple "$mnt" # Create 2-level qgroup hierarchy btrfs qgroup create 2/100 "$mnt" # Q2 (level 2) btrfs qgroup create 1/100 "$mnt" # Q1 (level 1) btrfs qgroup assign 1/100 2/100 "$mnt" # Create base subvolume btrfs subvolume create "$mnt/base" >/dev/null base_id=$(btrfs subvolume show "$mnt/base" \| grep 'Subvolume ID:' \| awk '{print $3}') # Create intermediate snapshot and add to Q1 btrfs subvolume snapshot "$mnt/base" "$mnt/intermediate" >/dev/null inter_id=$(btrfs subvolume show "$mnt/intermediate" \| grep 'Subvolume ID:' \| awk '{print $3}') btrfs qgroup assign "0/$inter_id" 1/100 "$mnt" # Create working snapshot with --inherit (auto-adds to Q1) # src=intermediate (in only Q1) # dst=snap (inheriting only into Q1) # This double counts the 16k nodesize of the snapshot in Q1, and # undercounts it in Q2. btrfs subvolume snapshot -i 1/100 "$mnt/intermediate" "$mnt/snap" >/dev/null snap_id=$(btrfs subvolume show "$mnt/snap" \| grep 'Subvolume ID:' \| awk '{print $3}') # Fully complete snapshot creation sync # Delete working snapshot # Q1 and Q2 will lose the full snap usage btrfs subvolume delete "$mnt/snap" >/dev/null # Delete intermediate and remove from Q1 # Q1 and Q2 will lose the full intermediate usage btrfs qgroup remove "0/$inter_id" 1/100 "$mnt" btrfs subvolume delete "$mnt/intermediate" >/dev/null # Q1 should be at 0, but still has 16k. Q2 is "correct" at 0 (for now...) # Trigger cleaner, wait for deletions mount -o remount,sync=1 "$mnt" btrfs subvolume sync "$mnt" "$snap_id" btrfs subvolume sync "$mnt" "$inter_id" # Remove Q1 from Q2 # Frees 16k more from Q2, underflowing it to 16EiB btrfs qgroup remove 1/100 2/100 "$mnt" # And show the bad state: btrfs qgroup show -pc "$mnt" Qgroupid Referenced Exclusive Parent Child Path -------- ---------- --------- ------ ----- ---- 0/5 16.00KiB 16.00KiB - - <toplevel> 0/256 16.00KiB 16.00KiB - - base 1/100 16.00KiB 16.00KiB - - <0 member qgroups> 2/100 16.00EiB 16.00EiB - - <0 member qgroups> Fix this by simply not doing this quick inheritance with squotas. I suspect that it is also wrong in normal qgroups to not recurse up the qgroup tree in the quick inherit case, though other consistency checks will likely fix it anyway. Fixes: `b20fe56cd2` ("btrfs: qgroup: allow quick inherit if snapshot is created and added to the same parent") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-16 22:09:52 +01:00
Filipe Manana	37343524f0	btrfs: fix changeset leak on mmap write after failure to reserve metadata If the call to btrfs_delalloc_reserve_metadata() fails we jump to the 'out_noreserve' label and there we never free the extent_changeset allocated by the previous call to btrfs_check_data_free_space() (if qgroups are enabled). Fix this by calling extent_changeset_free() under the 'out_noreserve' label. Fixes: `6599716de2` ("btrfs: fix -ENOSPC mmap write failure on NOCOW files/extents") Reported-by: syzbot+2f8aa76e6acc9fce6638@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/693a635a.a70a0220.33cd7b.0029.GAE@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-12 16:33:18 +01:00
Deepanshu Kartikey	b57f2ddd28	btrfs: fix memory leak of fs_devices in degraded seed device path In open_seed_devices(), when find_fsid() fails and we're in DEGRADED mode, a new fs_devices is allocated via alloc_fs_devices() but is never added to the seed_list before returning. This contrasts with the normal path where fs_devices is properly added via list_add(). If any error occurs later in read_one_dev() or btrfs_read_chunk_tree(), the cleanup code iterates seed_list to free seed devices, but this orphaned fs_devices is never found and never freed, causing a memory leak. Any devices allocated via add_missing_dev() and attached to this fs_devices are also leaked. Fix this by adding the newly allocated fs_devices to seed_list in the degraded path, consistent with the normal path. Fixes: `5f37583569` ("Btrfs: move the missing device to its own fs device list") Reported-by: syzbot+eadd98df8bceb15d7fed@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=eadd98df8bceb15d7fed Tested-by: syzbot+eadd98df8bceb15d7fed@syzkaller.appspotmail.com Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-12 16:33:12 +01:00
Qu Wenruo	313ef70a9f	btrfs: fix a potential path leak in print_data_reloc_error() Inside print_data_reloc_error(), if extent_from_logical() failed we return immediately. However there are the following cases where extent_from_logical() can return error but still holds a path: - btrfs_search_slot() returned 0 - No backref item found in extent tree - No flags_ret provided This is not possible in this call site though. So for the above two cases, we can return without releasing the path, causing extent buffer leaks. Fixes: `b9a9a85059` ("btrfs: output affected files when relocation fails") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-09 04:35:28 +01:00
Qu Wenruo	428e1b114c	Revert "btrfs: add ASSERTs on prealloc in qgroup functions" This reverts commit `252877a870`. Commit `252877a870` ("btrfs: add ASSERTs on prealloc in qgroup functions") tries to remove the kfree() on preallocated qgroup during several call sites, but this cannot work as intended: - btrfs_quota_enable() - btrfs_create_qgroup() If add_qgroup_item() failed, we go out_free_path() and at that time prealloc is not yet utilized and will trigger the new ASSERT(). - btrfs_qgroup_inherit() If qgroup_auto_inherit() failed, prealloc is not yet utilized and will trigger the new ASSERT() Reported-by: syzbot+b44d4a4885bc82af2a06@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/69369331.a70a0220.38f243.009e.GAE@google.com/ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-09 04:32:46 +01:00
Filipe Manana	5630f7557d	btrfs: do not skip logging new dentries when logging a new name When we are logging a directory and the log context indicates that we are logging a new name for some other file (that is or was inside that directory), we skip logging the inodes for new dentries in the directory. This is ok most of the time, but if after the rename or link operation that triggered the logging of that directory, we have an explicit fsync of that directory without the directory inode being evicted and reloaded, we end up never logging the inodes for the new dentries that we found during the new name logging, as the next directory fsync will only process dentries that were added after the last time we logged the directory (we are doing an incremental directory logging). So make sure we always log new dentries for a directory even if we are in a context of logging a new name. We started skipping logging inodes for new dentries as of commit `c48792c6ee` ("btrfs: do not log new dentries when logging that a new name exists") and it was fine back then, because when logging a directory we always iterated over all the directory entries (for leaves changed in the current transaction) so a subsequent fsync would always log anything that was previously skipped while logging a directory when logging a new name (with btrfs_log_new_name()). But later support for incrementally logging a directory was added in commit `dc2872247e` ("btrfs: keep track of the last logged keys when logging a directory"), to avoid checking all dir items every time we log a directory, so the check to skip dentry logging added in the first commit should have been removed when the incremental support for logging a directory was added. A test case for fstests will follow soon. Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com> Link: https://lore.kernel.org/linux-btrfs/84c4e713-85d6-42b9-8dcf-0722ed26cb05@gmail.com/ Fixes: `dc2872247e` ("btrfs: keep track of the last logged keys when logging a directory") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-08 21:46:56 +01:00
Filipe Manana	266273eaf4	btrfs: don't log conflicting inode if it's a dir moved in the current transaction We can't log a conflicting inode if it's a directory and it was moved from one parent directory to another parent directory in the current transaction, as this can result an attempt to have a directory with two hard links during log replay, one for the old parent directory and another for the new parent directory. The following scenario triggers that issue: 1) We have directories "dir1" and "dir2" created in a past transaction. Directory "dir1" has inode A as its parent directory; 2) We move "dir1" to some other directory; 3) We create a file with the name "dir1" in directory inode A; 4) We fsync the new file. This results in logging the inode of the new file and the inode for the directory "dir1" that was previously moved in the current transaction. So the log tree has the INODE_REF item for the new location of "dir1"; 5) We move the new file to some other directory. This results in updating the log tree to included the new INODE_REF for the new location of the file and removes the INODE_REF for the old location. This happens during the rename when we call btrfs_log_new_name(); 6) We fsync the file, and that persists the log tree changes done in the previous step (btrfs_log_new_name() only updates the log tree in memory); 7) We have a power failure; 8) Next time the fs is mounted, log replay happens and when processing the inode for directory "dir1" we find a new INODE_REF and add that link, but we don't remove the old link of the inode since we have not logged the old parent directory of the directory inode "dir1". As a result after log replay finishes when we trigger writeback of the subvolume tree's extent buffers, the tree check will detect that we have a directory a hard link count of 2 and we get a mount failure. The errors and stack traces reported in dmesg/syslog are like this: [ 3845.729764] BTRFS info (device dm-0): start tree-log replay [ 3845.730304] page: refcount:3 mapcount:0 mapping:000000005c8a3027 index:0x1d00 pfn:0x11510c [ 3845.731236] memcg:ffff9264c02f4e00 [ 3845.731751] aops:btree_aops [btrfs] ino:1 [ 3845.732300] flags: 0x17fffc00000400a(uptodate\|private\|writeback\|node=0\|zone=2\|lastcpupid=0x1ffff) [ 3845.733346] raw: 017fffc00000400a 0000000000000000 dead000000000122 ffff9264d978aea8 [ 3845.734265] raw: 0000000000001d00 ffff92650e6d4738 00000003ffffffff ffff9264c02f4e00 [ 3845.735305] page dumped because: eb page dump [ 3845.735981] BTRFS critical (device dm-0): corrupt leaf: root=5 block=30408704 slot=6 ino=257, invalid nlink: has 2 expect no more than 1 for dir [ 3845.737786] BTRFS info (device dm-0): leaf 30408704 gen 10 total ptrs 17 free space 14881 owner 5 [ 3845.737789] BTRFS info (device dm-0): refs 4 lock_owner 0 current 30701 [ 3845.737792] item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 [ 3845.737794] inode generation 3 transid 9 size 16 nbytes 16384 [ 3845.737795] block group 0 mode 40755 links 1 uid 0 gid 0 [ 3845.737797] rdev 0 sequence 2 flags 0x0 [ 3845.737798] atime 1764259517.0 [ 3845.737800] ctime 1764259517.572889464 [ 3845.737801] mtime 1764259517.572889464 [ 3845.737802] otime 1764259517.0 [ 3845.737803] item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12 [ 3845.737805] index 0 name_len 2 [ 3845.737807] item 2 key (256 DIR_ITEM 2363071922) itemoff 16077 itemsize 34 [ 3845.737808] location key (257 1 0) type 2 [ 3845.737810] transid 9 data_len 0 name_len 4 [ 3845.737811] item 3 key (256 DIR_ITEM 2676584006) itemoff 16043 itemsize 34 [ 3845.737813] location key (258 1 0) type 2 [ 3845.737814] transid 9 data_len 0 name_len 4 [ 3845.737815] item 4 key (256 DIR_INDEX 2) itemoff 16009 itemsize 34 [ 3845.737816] location key (257 1 0) type 2 [ 3845.737818] transid 9 data_len 0 name_len 4 [ 3845.737819] item 5 key (256 DIR_INDEX 3) itemoff 15975 itemsize 34 [ 3845.737820] location key (258 1 0) type 2 [ 3845.737821] transid 9 data_len 0 name_len 4 [ 3845.737822] item 6 key (257 INODE_ITEM 0) itemoff 15815 itemsize 160 [ 3845.737824] inode generation 9 transid 10 size 6 nbytes 0 [ 3845.737825] block group 0 mode 40755 links 2 uid 0 gid 0 [ 3845.737826] rdev 0 sequence 1 flags 0x0 [ 3845.737827] atime 1764259517.572889464 [ 3845.737828] ctime 1764259517.572889464 [ 3845.737830] mtime 1764259517.572889464 [ 3845.737831] otime 1764259517.572889464 [ 3845.737832] item 7 key (257 INODE_REF 256) itemoff 15801 itemsize 14 [ 3845.737833] index 2 name_len 4 [ 3845.737834] item 8 key (257 INODE_REF 258) itemoff 15787 itemsize 14 [ 3845.737836] index 2 name_len 4 [ 3845.737837] item 9 key (257 DIR_ITEM 2507850652) itemoff 15754 itemsize 33 [ 3845.737838] location key (259 1 0) type 1 [ 3845.737839] transid 10 data_len 0 name_len 3 [ 3845.737840] item 10 key (257 DIR_INDEX 2) itemoff 15721 itemsize 33 [ 3845.737842] location key (259 1 0) type 1 [ 3845.737843] transid 10 data_len 0 name_len 3 [ 3845.737844] item 11 key (258 INODE_ITEM 0) itemoff 15561 itemsize 160 [ 3845.737846] inode generation 9 transid 10 size 8 nbytes 0 [ 3845.737847] block group 0 mode 40755 links 1 uid 0 gid 0 [ 3845.737848] rdev 0 sequence 1 flags 0x0 [ 3845.737849] atime 1764259517.572889464 [ 3845.737850] ctime 1764259517.572889464 [ 3845.737851] mtime 1764259517.572889464 [ 3845.737852] otime 1764259517.572889464 [ 3845.737853] item 12 key (258 INODE_REF 256) itemoff 15547 itemsize 14 [ 3845.737855] index 3 name_len 4 [ 3845.737856] item 13 key (258 DIR_ITEM 1843588421) itemoff 15513 itemsize 34 [ 3845.737857] location key (257 1 0) type 2 [ 3845.737858] transid 10 data_len 0 name_len 4 [ 3845.737860] item 14 key (258 DIR_INDEX 2) itemoff 15479 itemsize 34 [ 3845.737861] location key (257 1 0) type 2 [ 3845.737862] transid 10 data_len 0 name_len 4 [ 3845.737863] item 15 key (259 INODE_ITEM 0) itemoff 15319 itemsize 160 [ 3845.737865] inode generation 10 transid 10 size 0 nbytes 0 [ 3845.737866] block group 0 mode 100600 links 1 uid 0 gid 0 [ 3845.737867] rdev 0 sequence 2 flags 0x0 [ 3845.737868] atime 1764259517.580874966 [ 3845.737869] ctime 1764259517.586121869 [ 3845.737870] mtime 1764259517.580874966 [ 3845.737872] otime 1764259517.580874966 [ 3845.737873] item 16 key (259 INODE_REF 257) itemoff 15306 itemsize 13 [ 3845.737874] index 2 name_len 3 [ 3845.737875] BTRFS error (device dm-0): block=30408704 write time tree block corruption detected [ 3845.739448] ------------[ cut here ]------------ [ 3845.740092] WARNING: CPU: 5 PID: 30701 at fs/btrfs/disk-io.c:335 btree_csum_one_bio+0x25a/0x270 [btrfs] [ 3845.741439] Modules linked in: btrfs dm_flakey crc32c_cryptoapi (...) [ 3845.750626] CPU: 5 UID: 0 PID: 30701 Comm: mount Tainted: G W 6.18.0-rc6-btrfs-next-218+ #1 PREEMPT(full) [ 3845.752414] Tainted: [W]=WARN [ 3845.752828] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [ 3845.754499] RIP: 0010:btree_csum_one_bio+0x25a/0x270 [btrfs] [ 3845.755460] Code: 31 f6 48 89 (...) [ 3845.758685] RSP: 0018:ffffa8d9c5677678 EFLAGS: 00010246 [ 3845.759450] RAX: 0000000000000000 RBX: ffff92650e6d4738 RCX: 0000000000000000 [ 3845.760309] RDX: 0000000000000000 RSI: ffffffff9aab45b9 RDI: ffff9264c4748000 [ 3845.761239] RBP: ffff9264d4324000 R08: 0000000000000000 R09: ffffa8d9c5677468 [ 3845.762607] R10: ffff926bdc1fffa8 R11: 0000000000000003 R12: ffffa8d9c5677680 [ 3845.764099] R13: 0000000000004000 R14: ffff9264dd624000 R15: ffff9264d978aba8 [ 3845.765094] FS: 00007f751fa5a840(0000) GS:ffff926c42a82000(0000) knlGS:0000000000000000 [ 3845.766226] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3845.766970] CR2: 0000558df1815380 CR3: 000000010ed88003 CR4: 0000000000370ef0 [ 3845.768009] Call Trace: [ 3845.768392] <TASK> [ 3845.768714] btrfs_submit_bbio+0x6ee/0x7f0 [btrfs] [ 3845.769640] ? write_one_eb+0x28e/0x340 [btrfs] [ 3845.770588] btree_write_cache_pages+0x2f0/0x550 [btrfs] [ 3845.771286] ? alloc_extent_state+0x19/0x100 [btrfs] [ 3845.771967] ? merge_next_state+0x1a/0x90 [btrfs] [ 3845.772586] ? set_extent_bit+0x233/0x8b0 [btrfs] [ 3845.773198] ? xas_load+0x9/0xc0 [ 3845.773589] ? xas_find+0x14d/0x1a0 [ 3845.773969] do_writepages+0xc6/0x160 [ 3845.774367] filemap_fdatawrite_wbc+0x48/0x60 [ 3845.775003] __filemap_fdatawrite_range+0x5b/0x80 [ 3845.775902] btrfs_write_marked_extents+0x61/0x170 [btrfs] [ 3845.776707] btrfs_write_and_wait_transaction+0x4e/0xc0 [btrfs] [ 3845.777379] ? _raw_spin_unlock_irqrestore+0x23/0x40 [ 3845.777923] btrfs_commit_transaction+0x5ea/0xd20 [btrfs] [ 3845.778551] ? _raw_spin_unlock+0x15/0x30 [ 3845.778986] ? release_extent_buffer+0x34/0x160 [btrfs] [ 3845.779659] btrfs_recover_log_trees+0x7a3/0x7c0 [btrfs] [ 3845.780416] ? __pfx_replay_one_buffer+0x10/0x10 [btrfs] [ 3845.781499] open_ctree+0x10bb/0x15f0 [btrfs] [ 3845.782194] btrfs_get_tree.cold+0xb/0x16c [btrfs] [ 3845.782764] ? fscontext_read+0x15c/0x180 [ 3845.783202] ? rw_verify_area+0x50/0x180 [ 3845.783667] vfs_get_tree+0x25/0xd0 [ 3845.784047] vfs_cmd_create+0x59/0xe0 [ 3845.784458] __do_sys_fsconfig+0x4f6/0x6b0 [ 3845.784914] do_syscall_64+0x50/0x1220 [ 3845.785340] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 3845.785980] RIP: 0033:0x7f751fc7f4aa [ 3845.786759] Code: 73 01 c3 48 (...) [ 3845.789951] RSP: 002b:00007ffcdba45dc8 EFLAGS: 00000246 ORIG_RAX: 00000000000001af [ 3845.791402] RAX: ffffffffffffffda RBX: 000055ccc8291c20 RCX: 00007f751fc7f4aa [ 3845.792688] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003 [ 3845.794308] RBP: 000055ccc8292120 R08: 0000000000000000 R09: 0000000000000000 [ 3845.795829] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 3845.797183] R13: 00007f751fe11580 R14: 00007f751fe1326c R15: 00007f751fdf8a23 [ 3845.798633] </TASK> [ 3845.799067] ---[ end trace 0000000000000000 ]--- [ 3845.800215] BTRFS: error (device dm-0) in btrfs_commit_transaction:2553: errno=-5 IO failure (Error while writing out transaction) [ 3845.801860] BTRFS warning (device dm-0 state E): Skipping commit of aborted transaction. [ 3845.802815] BTRFS error (device dm-0 state EA): Transaction aborted (error -5) [ 3845.803728] BTRFS: error (device dm-0 state EA) in cleanup_transaction:2036: errno=-5 IO failure [ 3845.805374] BTRFS: error (device dm-0 state EA) in btrfs_replay_log:2083: errno=-5 IO failure (Failed to recover log tree) [ 3845.807919] BTRFS error (device dm-0 state EA): open_ctree failed: -5 Fix this by never logging a conflicting inode that is a directory and was moved in the current transaction (its last_unlink_trans equals the current transaction) and instead fallback to a transaction commit. A test case for fstests will follow soon. Reported-by: Vyacheslav Kovalevsky <slva.kovalevskiy.2014@gmail.com> Link: https://lore.kernel.org/linux-btrfs/7bbc9419-5c56-450a-b5a0-efeae7457113@gmail.com/ CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-08 21:46:47 +01:00
Dan Carpenter	564d59410c	btrfs: tests: fix double btrfs_path free in remove_extent_ref() We converted this code to use auto free cleanup.h magic but one remaining free was accidentally left behind which leads to a double free bug. Fixes: `a320476ca8` ("btrfs: tests: do trivial BTRFS_PATH_AUTO_FREE conversions") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-08 21:46:43 +01:00
Filipe Manana	9e0e6577b3	btrfs: remove unnecessary inode key in btrfs_log_all_parents() We are setting up an inode key to lookup parent directory inode but all we need is the inode's objectid. The use of the key was necessary in the past but since commit `0202e83fda` ("btrfs: simplify iget helpers") we only need the objectid. So remove the key variable in the stack and use instead a simple u64 for the inode's objectid. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:53:33 +01:00
Filipe Manana	1c3e03b340	btrfs: remove redundant zero/NULL initializations in btrfs_alloc_root() We have allocated the root with kzalloc() so all the memory is already zero initialized, therefore it's redundant to assign 0 and NULL to several of the root members. Remove all of them except the atomic initializations since atomic_t is an opaque type and it's not a good practice to assume its internals. This slightly reduces the binary size. With gcc 14.2.0-19 from Debian on x86_64, before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1939404 162963 15592 2117959 205147 fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1939212 162963 15592 2117767 205087 fs/btrfs/btrfs.ko Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:53:33 +01:00
David Sterba	10934c131f	btrfs: remaining BTRFS_PATH_AUTO_FREE conversions Do the remaining btrfs_path conversion to the auto cleaning, this seems to be the last one. Most of the conversions are trivial, only adding the declaration and removing the freeing, or changing the goto patterns to return. There are some functions with many changes, like __btrfs_free_extent(), btrfs_remove_from_free_space_tree() or btrfs_add_to_free_space_tree() but it still follows the same pattern. Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:53:33 +01:00
Filipe Manana	5c9cac55b7	btrfs: send: do not allocate memory for xattr data when checking it exists When checking if xattrs were deleted we don't care about their data, but we are allocating memory for the data and copying it, which only wastes time and can result in an unnecessary error in case the allocation fails. So stop allocating memory and copying data by making find_xattr() and __find_xattr() skip those steps if the given data buffer is NULL. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:53:33 +01:00
Filipe Manana	7c3acdb998	btrfs: send: add unlikely to all unexpected overflow checks There are several checks for unexpected overflows of buffers and path lengths that makes us fail the send operation with an error if for some highly unexpected reason they happen. So add the unlikely tag to those checks to hint the compiler to generate better code, while also making it more explicit in the source that it's highly unexpected. With gcc 14.2.0-19 from Debian on x86_64, I also got a small reduction the text size of the btrfs module. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1936917 162723 15592 2115232 2046a0 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1936789 162723 15592 2115104 204620 fs/btrfs/btrfs.ko Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:53:33 +01:00
Filipe Manana	139e3167d8	btrfs: reduce arguments to btrfs_del_inode_ref_in_log() Instead of passing a root and the objectid of the parent directory, just pass the directory inode, as like that we can extract both the root and the objectid, reducing the number of arguments by one. It also makes the function more consistent with other log tree functions in the sense that we pass the inode and not only its objectid. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:53:33 +01:00
Filipe Manana	1361f7d8da	btrfs: remove root argument from btrfs_del_dir_entries_in_log() There's no need to pass the root as we can extract it from the directory inode, so remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:53:32 +01:00
Filipe Manana	9c78fe4a85	btrfs: use test_and_set_bit() in btrfs_delayed_delete_inode_ref() Instead of testing and setting the BTRFS_DELAYED_NODE_DEL_IREF bit in the delayed node's flags, use test_and_set_bit() which makes the code shorter without compromising readability and getting rid of the label and goto. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:53:32 +01:00
Josef Bacik	70085399b1	btrfs: don't search back for dir inode item in INO_LOOKUP_USER We don't need to search back to the inode item, the directory inode number is in key.offset, so simply use that. If we can't find the directory we'll get an ENOENT at the iget(). Note: The patch was taken from v5 of fscrypt patchset (https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/) which was handled over time by various people: Omar Sandoval, Sweet Tea Dorminy, Josef Bacik. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:53:27 +01:00
Josef Bacik	0185c2292c	btrfs: don't rewrite ret from inode_permission In our user safe ino resolve ioctl we'll just turn any ret into -EACCES from inode_permission(). This is redundant, and could potentially be wrong if we had an ENOMEM in the security layer or some such other error, so simply return the actual return value. Note: The patch was taken from v5 of fscrypt patchset (https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/) which was handled over time by various people: Omar Sandoval, Sweet Tea Dorminy, Josef Bacik. Fixes: `23d0b79dfa` ("btrfs: Add unprivileged version of ino_lookup ioctl") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:52:24 +01:00
Josef Bacik	bd45e9e3f6	btrfs: add orig_logical to btrfs_bio for encryption When checksumming the encrypted bio on writes we need to know which logical address this checksum is for. At the point where we get the encrypted bio the bi_sector is the physical location on the target disk, so we need to save the original logical offset in the btrfs_bio. Then we can use this when checksumming the bio instead of the bio->iter.bi_sector. Note: The patch was taken from v5 of fscrypt patchset (https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/) which was handled over time by various people: Omar Sandoval, Sweet Tea Dorminy, Josef Bacik. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:52:23 +01:00
Sweet Tea Dorminy	45d99129b6	btrfs: disable verity on encrypted inodes Right now there isn't a way to encrypt things that aren't either filenames in directories or data on blocks on disk with extent encryption, so for now, disable verity usage with encryption on btrfs. fscrypt with fsverity should be possible and it can be implemented in the future. Note: The patch was taken from v5 of fscrypt patchset (https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/) which was handled over time by various people: Omar Sandoval, Sweet Tea Dorminy, Josef Bacik. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:50:56 +01:00
Omar Sandoval	f968340053	btrfs: disable various operations on encrypted inodes Initially, only normal data extents will be encrypted. This change forbids various other bits: - allows reflinking only if both inodes have the same encryption status - disable inline data on encrypted inodes Note: The patch was taken from v5 of fscrypt patchset (https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/) which was handled over time by various people: Omar Sandoval, Sweet Tea Dorminy, Josef Bacik. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:50:56 +01:00
Sun YangKai	4357dd76f5	btrfs: remove redundant level reset in btrfs_del_items() When btrfs_del_items() empties a leaf, it deletes the leaf unless it's the root node. For the root leaf case, the code used to reset its level to 0 via btrfs_set_header_level(). This is redundant as leaf nodes always have level == 0. Remove the unnecessary level assignment and invert the conditional to handle only the non-root leaf deletion. The root leaf is correctly left as-is. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:50:56 +01:00
Sun YangKai	139f75a3b1	btrfs: simplify leaf traversal after path release in btrfs_next_old_leaf() After releasing the path in btrfs_next_old_leaf(), we need to re-check the leaf because a balance operation may have added items or removed the last item. The original code handled this with two separate conditional blocks, the second marked with a lengthy comment explaining a "missed case". Merge these two blocks into a single logical structure that handles both scenarios more clearly. Also update the comment to be more concise and accurate, incorporating the explanation directly into the main block rather than a separate annotation. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:50:56 +01:00
Sun YangKai	3afa17bf24	btrfs: optimize balance_level() path reference handling Instead of incrementing refcount on 'left' node when it's referenced by path, simply transfer ownership to path and set left to NULL. This eliminates: - Unnecessary refcount increment/decrement operations - Redundant conditional checks for left node cleanup The path now consistently owns the left node reference when used. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:50:56 +01:00
Sun YangKai	31b37b7667	btrfs: factor out root promotion logic into promote_child_to_root() The balance_level() function is overly long and contains a cold code path that handles promoting a child node to root when the root has only one item. This code has distinct logic that is clearer and more maintainable when isolated in its own function. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:50:56 +01:00
Qu Wenruo	1a332a6d70	btrfs: raid56: remove the "_step" infix The following functions are introduced as a middle step for bs > ps support: - rbio_streip_step_paddr() - rbio_pstripe_step_paddr() - rbio_qstripe_step_paddr() - sector_step_paddr_in_rbio() As there is already an existing function without the infix, and has a different parameter list. But the existing functions have been cleaned up, there is no need to keep the "_step" infix, just remove it completely. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:50:19 +01:00
Qu Wenruo	8870dbeedc	btrfs: raid56: enable bs > ps support The support code for bs > ps is complete, enable it and update assertions. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:48:52 +01:00
Qu Wenruo	89ca1a403e	btrfs: raid56: prepare finish_parity_scrub() to support bs > ps cases The function finish_parity_scrub() assume each fs block can be mapped by one page, blocking bs > ps support for raid56. Prepare it for bs > ps cases by: - Introduce a helper, verify_one_parity_step() Since the P/Q generation is always done in a vertical stripe, we have to handle the range step by step. - Only clear the rbio->dbitmap if all steps of an fs block match - Remove rbio_stripe_paddr() and sector_paddr_in_rbio() helpers Now we either use the paddrs version for checksum, or the step version for P/Q generation/recovery. - Make alloc_rbio_essential_pages() to handle bs > ps cases Since for bs > ps cases, one fs block needs multiple pages, the existing simple check against rbio->stripe_pages[] is not enough. Extract a dedicated helper, alloc_rbio_sector_pages(), for the existing alloc_rbio_essential_pages(), which is still based on sector number. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:48:02 +01:00
Qu Wenruo	ba88278c69	btrfs: raid56: prepare rbio_bio_add_io_paddr() to support bs > ps cases The function rbio_bio_add_io_paddr() assume each fs block can be mapped by one page, blocking bs > ps support for raid56. Prepare it for bs > ps cases by: - Introduce a helper bio_add_paddrs() Previously we only need to add a single page to a bio for a fs block, but now we need to add multiple pages, this means we can fail halfway. In that case we need to properly revert the bio (only for its size though) for halfway failed cases. - Rename rbio_add_io_paddr() to rbio_add_io_paddrs() And change the @paddr parameter to @paddrs[]. - Change all callers to use the updated rbio_add_io_paddrs() For the @paddrs pointer used for the new function, it can be grabbed using sector_paddrs_in_rbio() and rbio_stripe_paddrs() helpers. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:47:54 +01:00
Qu Wenruo	53474a2ae1	btrfs: raid56: prepare steal_rbio() to support bs > ps cases The function steal_rbio() assume each fs block can be mapped by one page, blocking bs > ps support for raid56. Prepare it for bs > ps cases by: - Introduce two helpers to calculate the sector number Previously we assume one page will contain at least one fs block, thus can use something like "sectors_per_page = PAGE_SIZE / sectorsize;", but with bs > ps support that above number will be 0. Instead introduce two helpers: * page_nr_to_sector_nr() Returns the sector number of the first sector covered by the page. * page_nr_to_num_sectors() Return how many sectors are covered by the page. And use the returned values for bitmap operations other than open-coded "PAGE_SIZE / sectorsize". Those helpers also have extra ASSERT()s to catch weird numbers. - Use above helpers The involved functions are: * steal_rbio_page() * is_data_stripe_page() * full_page_sectors_uptodate() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:47:41 +01:00
Qu Wenruo	05ddf35a5d	btrfs: raid56: prepare set_bio_pages_uptodate() to support bs > ps cases The function set_bio_pages_uptodate() assume each fs block can be mapped by one page, blocking bs > ps support for raid56. Prepare it for bs > ps cases by: - Update find_stripe_sector_nr() to check only the first step paddr We don't need to check each paddr, as the bios are still aligned to fs block size, thus checking the first step is enough. - Use step size to iterate the bio This means we only need to find the sector number for the first step of each fs block, and skip the remaining part. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:47:31 +01:00
Qu Wenruo	64e7b8c7c5	btrfs: raid56: prepare verify_bio_data_sectors() to support bs > ps cases The function verify_bio_data_sectors() assume each fs block can be mapped by one page, blocking bs > ps support for raid56. Prepare it for bs > ps cases by: - Make get_bio_sector_nr() to consider bs > ps cases The function is utilized to calculate the sector number of a device bio submitted by btrfs raid56 layer. - Assemble a local paddrs[] for checksum calculation - Open code btrfs_check_block_csum() btrfs_check_block_csum() only supports fs blocks backed by large folios. But for raid56 we can have fs blocks backed by multiple non-contiguous pages, e.g. direct IO, encoded read/write/send. So instead of using btrfs_check_block_csum(), open code it to use btrfs_calculate_block_csum_pages(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:47:20 +01:00
Qu Wenruo	e0eadfcc95	btrfs: raid56: prepare verify_one_sector() to support bs > ps cases The function verify_one_sector() assume each fs block can be mapped by one page, blocking bs > ps support for raid56. Prepare it for bs > ps cases by: - Introduce helpers to get a paddrs pointer Thankfully all the higher layer bio should still be aligned to fs block size, thus a fs block should still be fully covered by the bio. Introduce sector_paddrs_in_rbio() and rbio_stripe_paddrs(), which will return a paddrs pointer inside btrfs_raid_bio::bio_paddrs[] or stripe_paddrs[]. The pointer can be directly passed to btrfs_calculate_block_csum_pages() to verify the checksum. - Open code btrfs_check_block_csum() btrfs_check_block_csum() only supports fs blocks backed by large folios. But for raid56 we can have fs blocks backed by multiple non-contiguous pages, e.g. direct IO, encoded read/write/send. So instead of using btrfs_check_block_csum(), open code it to use btrfs_calculate_block_csum_pages(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:47:16 +01:00
Qu Wenruo	9ba67fd616	btrfs: raid56: prepare recover_vertical() to support bs > ps cases Currently recover_vertical() assumes that every fs block can be mapped by one page, this is blocking bs > ps support for raid56. Prepare recover_vertical() to support bs > ps cases by: - Introduce recover_vertical_step() helper Which will recover a full step (min(PAGE_SIZE, sectorsize)). Now recover_vertical() will do the error check for the specified sector, do the recover step by step, then do the sector verification. - Fix a spelling error of get_rbio_vertical_errors() The old name has a typo: "veritical". Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:47:11 +01:00
Qu Wenruo	826325b6d0	btrfs: raid56: prepare generate_pq_vertical() for bs > ps cases Unlike btrfs_calculate_block_csum_pages(), we cannot handle multiple pages at the same time for P/Q generation. So here we introduce a new @step_nr, and various helpers to grab the sub-block page from the rbio, and generate the P/Q stripe page by page. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:47:05 +01:00
Qu Wenruo	91cd1b5865	btrfs: raid56: introduce a new parameter to locate a sector Since we cannot ensure that all bios from the higher layer are backed by large folios (e.g. direct IO, encoded read/write/send), we need the ability to locate sub-block (aka, a page) inside a full stripe. So the existing @stripe_nr + @sector_nr combination is not enough to locate such page for bs > ps cases. Introduce a new parameter, @step_nr, to locate the page of a larger fs block. The naming is following the conventions used inside btrfs elsewhere, where one step is min(sectorsize, PAGE_SIZE). It's still a preparation, only touching the following aspects: - btrfs_dump_rbio() To show the new @sector_nsteps member. - btrfs_raid_bio::sector_nsteps Recording how many steps there are inside a fs block. - Enlarge btrfs_raid_bio::_paddrs[] size To take @sector_nsteps into consideration. - index_one_bio() - index_stripe_sectors() - memcpy_from_bio_to_stripe() - cache_rbio_pages() - need_read_stripe_sectors() Those functions are iterating _paddrs[], which needs to take sector_nsteps into consideration. - Rename rbio_stripe_sector_index() to rbio_sector_index() The "stripe" part is not that helpful. And an extra ASSERT() before returning the result. - Add a new rbio_paddr_index() helper This will take the extra @step_nr into consideration. - The comments of btrfs_raid_bio Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:46:58 +01:00
Qu Wenruo	9042dc0002	btrfs: raid56: add an overview for the btrfs_raid_bio structure The structure needs to track both the pages from higher layer bio and internal pages, thus it can be a little complex to grasp. Add an overview of the structure, especially how we track different pages from higher layer bios and internal ones, to save some time for future developers. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:46:29 +01:00

1 2 3 4 5 ...

1398708 Commits (530e3d4af566ca44807d79359b90794dea24c4f3) All Branches Search

1398708 Commits (530e3d4af566ca44807d79359b90794dea24c4f3)

All Branches