Commit Graph

1398708 Commits (530e3d4af566ca44807d79359b90794dea24c4f3)

Author SHA1 Message Date
Suchit Karunakaran 530e3d4af5 btrfs: fix NULL pointer dereference in do_abort_log_replay()
Coverity reported a NULL pointer dereference issue (CID 1666756) in
do_abort_log_replay(). When btrfs_alloc_path() fails in
replay_one_buffer(), wc->subvol_path is NULL, but btrfs_abort_log_replay()
calls do_abort_log_replay() which unconditionally dereferences
wc->subvol_path when attempting to print debug information. Fix this by
adding a NULL check before dereferencing wc->subvol_path in
do_abort_log_replay().

Fixes: 2753e49176 ("btrfs: dump detailed info and specific messages on log replay failures")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Suchit Karunakaran <suchitkarunakaran@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-06 01:23:00 +01:00
Qu Wenruo cefd809251 btrfs: force free space tree for bs > ps cases
[BUG]
Currently we only enforcing the free space tree for bs < ps cases, but
with the recently added bs > ps support, we lack the free space tree
enforcing, causing explicit v1 cache mount option to fail on bs > ps
cases:

  # mount -o space_cache=v1 /dev/test/scratch1  /mnt/btrfs/
  mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/mapper/test-scratch1, missing codepage or helper program, or other error.
         dmesg(1) may have more information after failed mount system call.

  # dmesg -t | tail -n7
  BTRFS: device fsid ac14a6fa-4ec9-449e-aec9-7d1777bfdc06 devid 1 transid 11 /dev/mapper/test-scratch1 (253:3) scanned by mount (2849)
  BTRFS info (device dm-3): first mount of filesystem ac14a6fa-4ec9-449e-aec9-7d1777bfdc06
  BTRFS info (device dm-3): using crc32c checksum algorithm
  BTRFS warning (device dm-3): support for block size 8192 with page size 4096 is experimental, some features may be missing
  BTRFS warning (device dm-3): space cache v1 is being deprecated and will be removed in a future release, please use -o space_cache=v2
  BTRFS warning (device dm-3): v1 space cache is not supported for page size 4096 with sectorsize 8192
  BTRFS error (device dm-3): open_ctree failed: -22

[FIX]
Just enable the same free space tree for bs > ps cases, aligning the
behavior to bs < ps cases.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-06 01:22:59 +01:00
Qu Wenruo 30bcf4e824 btrfs: only enforce free space tree if v1 cache is required for bs < ps cases
[BUG]
Since the introduction of btrfs bs < ps support, v1 cache was never on
the plan due to its hard coded PAGE_SIZE usage, and the future plan to
properly deprecate it.

However for bs < ps cases, even if 'nospace_cache,clear_cache' mount
option is specified, it's never respected and free space tree is always
enabled:

  mkfs.btrfs -f -O ^bgt,fst $dev
  mount $dev $mnt -o clear_cache,nospace_cache
  umount $mnt
  btrfs ins dump-super $dev
  ...
  compat_ro_flags		0x3
         		( FREE_SPACE_TREE |
         		  FREE_SPACE_TREE_VALID )
  ...

This means a different behavior compared to bs >= ps cases.

[CAUSE]
The forcing usage of v2 space cache is done inside
btrfs_set_free_space_cache_settings(), however it never checks if we're
even using space cache but always enabling v2 cache.

[FIX]
Instead unconditionally enable v2 cache, only forcing v2 cache if the
old v1 cache is required.

Now v2 space cache can be properly disabled on bs < ps cases:

  mkfs.btrfs -f -O ^bgt,fst $dev
  mount $dev $mnt -o clear_cache,nospace_cache
  umount $mnt
  btrfs ins dump-super $dev
  ...
  compat_ro_flags		0x0
  ...

Fixes: 9f73f1aef9 ("btrfs: force v2 space cache usage for subpage mount")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-06 01:22:59 +01:00
Filipe Manana 8731f2c50b btrfs: release path before initializing extent tree in btrfs_read_locked_inode()
In btrfs_read_locked_inode() we are calling btrfs_init_file_extent_tree()
while holding a path with a read locked leaf from a subvolume tree, and
btrfs_init_file_extent_tree() may do a GFP_KERNEL allocation, which can
trigger reclaim.

This can create a circular lock dependency which lockdep warns about with
the following splat:

   [6.1433] ======================================================
   [6.1574] WARNING: possible circular locking dependency detected
   [6.1583] 6.18.0+ #4 Tainted: G     U
   [6.1591] ------------------------------------------------------
   [6.1599] kswapd0/117 is trying to acquire lock:
   [6.1606] ffff8d9b6333c5b8 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x39/0x2f0
   [6.1625]
            but task is already holding lock:
   [6.1633] ffffffffa4ab8ce0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x195/0xc60
   [6.1646]
            which lock already depends on the new lock.

   [6.1657]
            the existing dependency chain (in reverse order) is:
   [6.1667]
            -> #2 (fs_reclaim){+.+.}-{0:0}:
   [6.1677]        fs_reclaim_acquire+0x9d/0xd0
   [6.1685]        __kmalloc_cache_noprof+0x59/0x750
   [6.1694]        btrfs_init_file_extent_tree+0x90/0x100
   [6.1702]        btrfs_read_locked_inode+0xc3/0x6b0
   [6.1710]        btrfs_iget+0xbb/0xf0
   [6.1716]        btrfs_lookup_dentry+0x3c5/0x8e0
   [6.1724]        btrfs_lookup+0x12/0x30
   [6.1731]        lookup_open.isra.0+0x1aa/0x6a0
   [6.1739]        path_openat+0x5f7/0xc60
   [6.1746]        do_filp_open+0xd6/0x180
   [6.1753]        do_sys_openat2+0x8b/0xe0
   [6.1760]        __x64_sys_openat+0x54/0xa0
   [6.1768]        do_syscall_64+0x97/0x3e0
   [6.1776]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
   [6.1784]
            -> #1 (btrfs-tree-00){++++}-{3:3}:
   [6.1794]        lock_release+0x127/0x2a0
   [6.1801]        up_read+0x1b/0x30
   [6.1808]        btrfs_search_slot+0x8e0/0xff0
   [6.1817]        btrfs_lookup_inode+0x52/0xd0
   [6.1825]        __btrfs_update_delayed_inode+0x73/0x520
   [6.1833]        btrfs_commit_inode_delayed_inode+0x11a/0x120
   [6.1842]        btrfs_log_inode+0x608/0x1aa0
   [6.1849]        btrfs_log_inode_parent+0x249/0xf80
   [6.1857]        btrfs_log_dentry_safe+0x3e/0x60
   [6.1865]        btrfs_sync_file+0x431/0x690
   [6.1872]        do_fsync+0x39/0x80
   [6.1879]        __x64_sys_fsync+0x13/0x20
   [6.1887]        do_syscall_64+0x97/0x3e0
   [6.1894]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
   [6.1903]
            -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
   [6.1913]        __lock_acquire+0x15e9/0x2820
   [6.1920]        lock_acquire+0xc9/0x2d0
   [6.1927]        __mutex_lock+0xcc/0x10a0
   [6.1934]        __btrfs_release_delayed_node.part.0+0x39/0x2f0
   [6.1944]        btrfs_evict_inode+0x20b/0x4b0
   [6.1952]        evict+0x15a/0x2f0
   [6.1958]        prune_icache_sb+0x91/0xd0
   [6.1966]        super_cache_scan+0x150/0x1d0
   [6.1974]        do_shrink_slab+0x155/0x6f0
   [6.1981]        shrink_slab+0x48e/0x890
   [6.1988]        shrink_one+0x11a/0x1f0
   [6.1995]        shrink_node+0xbfd/0x1320
   [6.1002]        balance_pgdat+0x67f/0xc60
   [6.1321]        kswapd+0x1dc/0x3e0
   [6.1643]        kthread+0xff/0x240
   [6.1965]        ret_from_fork+0x223/0x280
   [6.1287]        ret_from_fork_asm+0x1a/0x30
   [6.1616]
            other info that might help us debug this:

   [6.1561] Chain exists of:
              &delayed_node->mutex --> btrfs-tree-00 --> fs_reclaim

   [6.1503]  Possible unsafe locking scenario:

   [6.1110]        CPU0                    CPU1
   [6.1411]        ----                    ----
   [6.1707]   lock(fs_reclaim);
   [6.1998]                                lock(btrfs-tree-00);
   [6.1291]                                lock(fs_reclaim);
   [6.1581]   lock(&delayed_node->mutex);
   [6.1874]
             *** DEADLOCK ***

   [6.1716] 2 locks held by kswapd0/117:
   [6.1999]  #0: ffffffffa4ab8ce0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x195/0xc60
   [6.1294]  #1: ffff8d998344b0e0 (&type->s_umount_key#40){++++}- {3:3}, at: super_cache_scan+0x37/0x1d0
   [6.1596]
            stack backtrace:
   [6.1183] CPU: 11 UID: 0 PID: 117 Comm: kswapd0 Tainted: G     U 6.18.0+ #4 PREEMPT(lazy)
   [6.1185] Tainted: [U]=USER
   [6.1186] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023
   [6.1187] Call Trace:
   [6.1187]  <TASK>
   [6.1189]  dump_stack_lvl+0x6e/0xa0
   [6.1192]  print_circular_bug.cold+0x17a/0x1c0
   [6.1194]  check_noncircular+0x175/0x190
   [6.1197]  __lock_acquire+0x15e9/0x2820
   [6.1200]  lock_acquire+0xc9/0x2d0
   [6.1201]  ? __btrfs_release_delayed_node.part.0+0x39/0x2f0
   [6.1204]  __mutex_lock+0xcc/0x10a0
   [6.1206]  ? __btrfs_release_delayed_node.part.0+0x39/0x2f0
   [6.1208]  ? __btrfs_release_delayed_node.part.0+0x39/0x2f0
   [6.1211]  ? __btrfs_release_delayed_node.part.0+0x39/0x2f0
   [6.1213]  __btrfs_release_delayed_node.part.0+0x39/0x2f0
   [6.1215]  btrfs_evict_inode+0x20b/0x4b0
   [6.1217]  ? lock_acquire+0xc9/0x2d0
   [6.1220]  evict+0x15a/0x2f0
   [6.1222]  prune_icache_sb+0x91/0xd0
   [6.1224]  super_cache_scan+0x150/0x1d0
   [6.1226]  do_shrink_slab+0x155/0x6f0
   [6.1228]  shrink_slab+0x48e/0x890
   [6.1229]  ? shrink_slab+0x2d2/0x890
   [6.1231]  shrink_one+0x11a/0x1f0
   [6.1234]  shrink_node+0xbfd/0x1320
   [6.1236]  ? shrink_node+0xa2d/0x1320
   [6.1236]  ? shrink_node+0xbd3/0x1320
   [6.1239]  ? balance_pgdat+0x67f/0xc60
   [6.1239]  balance_pgdat+0x67f/0xc60
   [6.1241]  ? finish_task_switch.isra.0+0xc4/0x2a0
   [6.1246]  kswapd+0x1dc/0x3e0
   [6.1247]  ? __pfx_autoremove_wake_function+0x10/0x10
   [6.1249]  ? __pfx_kswapd+0x10/0x10
   [6.1250]  kthread+0xff/0x240
   [6.1251]  ? __pfx_kthread+0x10/0x10
   [6.1253]  ret_from_fork+0x223/0x280
   [6.1255]  ? __pfx_kthread+0x10/0x10
   [6.1257]  ret_from_fork_asm+0x1a/0x30
   [6.1260]  </TASK>

This is because:

1) The fsync task is holding an inode's delayed node mutex (for a
   directory) while calling __btrfs_update_delayed_inode() and that needs
   to do a search on the subvolume's btree (therefore read lock some
   extent buffers);

2) The lookup task, at btrfs_lookup(), triggered reclaim with the
   GFP_KERNEL allocation done by btrfs_init_file_extent_tree() while
   holding a read lock on a subvolume leaf;

3) The reclaim triggered kswapd which is doing inode eviction for the
   directory inode the fsync task is using as an argument to
   btrfs_commit_inode_delayed_inode() - but in that call chain we are
   trying to read lock the same leaf that the lookup task is holding
   while calling btrfs_init_file_extent_tree() and doing the GFP_KERNEL
   allocation.

Fix this by calling btrfs_init_file_extent_tree() after we don't need the
path anymore and release it in btrfs_read_locked_inode().

Reported-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/linux-btrfs/6e55113a22347c3925458a5d840a18401a38b276.camel@linux.intel.com/
Fixes: 8679d2687c ("btrfs: initialize inode::file_extent_tree after i_mode has been set")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-06 01:22:59 +01:00
Qu Wenruo 1aff297ffb btrfs: avoid access-beyond-folio for bs > ps encoded writes
[POTENTIAL BUG]
If the system page size is 4K and fs block size is 8K, and max_inline
mount option is set to 6K, we can inline a 6K sized data extent.

Then a encoded write submitted a compressed extent which is at file
offset 0, and the compressed length is 6K, which is allowed to be inlined.

Now a read beyond page boundary is triggered inside write_extent_buffer()
from insert_inline_extent().

[CAUSE]
Currently the function __cow_file_range_inline() can only accept a
single folio.

For regular compressed write path, we always allocate the compressed
folios using the minimal order matching the block size, thus the
@compressed_folio should always cover a full fs block thus it is fine.

But for encoded writes, they allocate page size folios, this means we
can hit a case where the compressed data is smaller than block size but
still larger than page size, in that case __cow_file_range_inline() will
be called with @compressed_size larger than a page.

In that case we will trigger a read beyond the folio inside
insert_inline_extent().

Thankfully this is not that common, as the default max_inline is only
2048 bytes, smaller than PAGE_SIZE, and bs > ps support is still
experimental.

[FIX]
We need to either allow insert_inline_extent() to accept a page array to
properly support such case, or reject such inline extent.

The latter is a much simpler solution, and considering bs > ps will stay
as a corner case and non-default max_inline will be even rarer, I don't
think we really need to fulfill such niche.

So just reject any inline extent that's larger than PAGE_SIZE, and add
an extra ASSERT() to insert_inline_extent() to catch such beyond-boundary
access.

Fixes: ec20799064 ("btrfs: enable encoded read/write/send for bs > ps cases")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-06 01:22:59 +01:00
Filipe Manana c1c050f92d btrfs: fix reservation leak in some error paths when inserting inline extent
If we fail to allocate a path or join a transaction, we return from
__cow_file_range_inline() without freeing the reserved qgroup data,
resulting in a leak. Fix this by ensuring we call btrfs_qgroup_free_data()
in such cases.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-12-16 22:53:15 +01:00
Filipe Manana f8da41de0b btrfs: do not free data reservation in fallback from inline due to -ENOSPC
If we fail to create an inline extent due to -ENOSPC, we will attempt to
go through the normal COW path, reserve an extent, create an ordered
extent, etc. However we were always freeing the reserved qgroup data,
which is wrong since we will use data. Fix this by freeing the reserved
qgroup data in __cow_file_range_inline() only if we are not doing the
fallback (ret is <= 0).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-12-16 22:53:15 +01:00
Leo Martins 83f59076a1 btrfs: fix use-after-free warning in btrfs_get_or_create_delayed_node()
Previously, btrfs_get_or_create_delayed_node() set the delayed_node's
refcount before acquiring the root->delayed_nodes lock.
Commit e8513c012d ("btrfs: implement ref_tracker for delayed_nodes")
moved refcount_set inside the critical section, which means there is
no longer a memory barrier between setting the refcount and setting
btrfs_inode->delayed_node.

Without that barrier, the stores to node->refs and
btrfs_inode->delayed_node may become visible out of order. Another
thread can then read btrfs_inode->delayed_node and attempt to
increment a refcount that hasn't been set yet, leading to a
refcounting bug and a use-after-free warning.

The fix is to move refcount_set back to where it was to take
advantage of the implicit memory barrier provided by lock
acquisition.

Because the allocations now happen outside of the lock's critical
section, they can use GFP_NOFS instead of GFP_ATOMIC.

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202511262228.6dda231e-lkp@intel.com
Fixes: e8513c012d ("btrfs: implement ref_tracker for delayed_nodes")
Tested-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-12-16 22:53:15 +01:00
Filipe Manana 7ba0b6461b btrfs: always detect conflicting inodes when logging inode refs
After rename exchanging (either with the rename exchange operation or
regular renames in multiple non-atomic steps) two inodes and at least
one of them is a directory, we can end up with a log tree that contains
only of the inodes and after a power failure that can result in an attempt
to delete the other inode when it should not because it was not deleted
before the power failure. In some case that delete attempt fails when
the target inode is a directory that contains a subvolume inside it, since
the log replay code is not prepared to deal with directory entries that
point to root items (only inode items).

1) We have directories "dir1" (inode A) and "dir2" (inode B) under the
   same parent directory;

2) We have a file (inode C) under directory "dir1" (inode A);

3) We have a subvolume inside directory "dir2" (inode B);

4) All these inodes were persisted in a past transaction and we are
   currently at transaction N;

5) We rename the file (inode C), so at btrfs_log_new_name() we update
   inode C's last_unlink_trans to N;

6) We get a rename exchange for "dir1" (inode A) and "dir2" (inode B),
   so after the exchange "dir1" is inode B and "dir2" is inode A.
   During the rename exchange we call btrfs_log_new_name() for inodes
   A and B, but because they are directories, we don't update their
   last_unlink_trans to N;

7) An fsync against the file (inode C) is done, and because its inode
   has a last_unlink_trans with a value of N we log its parent directory
   (inode A) (through btrfs_log_all_parents(), called from
   btrfs_log_inode_parent()).

8) So we end up with inode B not logged, which now has the old name
   of inode A. At copy_inode_items_to_log(), when logging inode A, we
   did not check if we had any conflicting inode to log because inode
   A has a generation lower than the current transaction (created in
   a past transaction);

9) After a power failure, when replaying the log tree, since we find that
   inode A has a new name that conflicts with the name of inode B in the
   fs tree, we attempt to delete inode B... this is wrong since that
   directory was never deleted before the power failure, and because there
   is a subvolume inside that directory, attempting to delete it will fail
   since replay_dir_deletes() and btrfs_unlink_inode() are not prepared
   to deal with dir items that point to roots instead of inodes.

   When that happens the mount fails and we get a stack trace like the
   following:

   [87.2314] BTRFS info (device dm-0): start tree-log replay
   [87.2318] BTRFS critical (device dm-0): failed to delete reference to subvol, root 5 inode 256 parent 259
   [87.2332] ------------[ cut here ]------------
   [87.2338] BTRFS: Transaction aborted (error -2)
   [87.2346] WARNING: CPU: 1 PID: 638968 at fs/btrfs/inode.c:4345 __btrfs_unlink_inode+0x416/0x440 [btrfs]
   [87.2368] Modules linked in: btrfs loop dm_thin_pool (...)
   [87.2470] CPU: 1 UID: 0 PID: 638968 Comm: mount Tainted: G        W           6.18.0-rc7-btrfs-next-218+ #2 PREEMPT(full)
   [87.2489] Tainted: [W]=WARN
   [87.2494] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
   [87.2514] RIP: 0010:__btrfs_unlink_inode+0x416/0x440 [btrfs]
   [87.2538] Code: c0 89 04 24 (...)
   [87.2568] RSP: 0018:ffffc0e741f4b9b8 EFLAGS: 00010286
   [87.2574] RAX: 0000000000000000 RBX: ffff9d3ec8a6cf60 RCX: 0000000000000000
   [87.2582] RDX: 0000000000000002 RSI: ffffffff84ab45a1 RDI: 00000000ffffffff
   [87.2591] RBP: ffff9d3ec8a6ef20 R08: 0000000000000000 R09: ffffc0e741f4b840
   [87.2599] R10: ffff9d45dc1fffa8 R11: 0000000000000003 R12: ffff9d3ee26d77e0
   [87.2608] R13: ffffc0e741f4ba98 R14: ffff9d4458040800 R15: ffff9d44b6b7ca10
   [87.2618] FS:  00007f7b9603a840(0000) GS:ffff9d4658982000(0000) knlGS:0000000000000000
   [87.2629] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [87.2637] CR2: 00007ffc9ec33b98 CR3: 000000011273e003 CR4: 0000000000370ef0
   [87.2648] Call Trace:
   [87.2651]  <TASK>
   [87.2654]  btrfs_unlink_inode+0x15/0x40 [btrfs]
   [87.2661]  unlink_inode_for_log_replay+0x27/0xf0 [btrfs]
   [87.2669]  check_item_in_log+0x1ea/0x2c0 [btrfs]
   [87.2676]  replay_dir_deletes+0x16b/0x380 [btrfs]
   [87.2684]  fixup_inode_link_count+0x34b/0x370 [btrfs]
   [87.2696]  fixup_inode_link_counts+0x41/0x160 [btrfs]
   [87.2703]  btrfs_recover_log_trees+0x1ff/0x7c0 [btrfs]
   [87.2711]  ? __pfx_replay_one_buffer+0x10/0x10 [btrfs]
   [87.2719]  open_ctree+0x10bb/0x15f0 [btrfs]
   [87.2726]  btrfs_get_tree.cold+0xb/0x16c [btrfs]
   [87.2734]  ? fscontext_read+0x15c/0x180
   [87.2740]  ? rw_verify_area+0x50/0x180
   [87.2746]  vfs_get_tree+0x25/0xd0
   [87.2750]  vfs_cmd_create+0x59/0xe0
   [87.2755]  __do_sys_fsconfig+0x4f6/0x6b0
   [87.2760]  do_syscall_64+0x50/0x1220
   [87.2764]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
   [87.2770] RIP: 0033:0x7f7b9625f4aa
   [87.2775] Code: 73 01 c3 48 (...)
   [87.2803] RSP: 002b:00007ffc9ec35b08 EFLAGS: 00000246 ORIG_RAX: 00000000000001af
   [87.2817] RAX: ffffffffffffffda RBX: 0000558bfa91ac20 RCX: 00007f7b9625f4aa
   [87.2829] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
   [87.2842] RBP: 0000558bfa91b120 R08: 0000000000000000 R09: 0000000000000000
   [87.2854] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   [87.2864] R13: 00007f7b963f1580 R14: 00007f7b963f326c R15: 00007f7b963d8a23
   [87.2877]  </TASK>
   [87.2882] ---[ end trace 0000000000000000 ]---
   [87.2891] BTRFS: error (device dm-0 state A) in __btrfs_unlink_inode:4345: errno=-2 No such entry
   [87.2904] BTRFS: error (device dm-0 state EAO) in do_abort_log_replay:191: errno=-2 No such entry
   [87.2915] BTRFS critical (device dm-0 state EAO): log tree (for root 5) leaf currently being processed (slot 7 key (258 12 257)):
   [87.2929] BTRFS info (device dm-0 state EAO): leaf 30736384 gen 10 total ptrs 7 free space 15712 owner 18446744073709551610
   [87.2929] BTRFS info (device dm-0 state EAO): refs 3 lock_owner 0 current 638968
   [87.2929]      item 0 key (257 INODE_ITEM 0) itemoff 16123 itemsize 160
   [87.2929]              inode generation 9 transid 10 size 0 nbytes 0
   [87.2929]              block group 0 mode 40755 links 1 uid 0 gid 0
   [87.2929]              rdev 0 sequence 7 flags 0x0
   [87.2929]              atime 1765464494.678070921
   [87.2929]              ctime 1765464494.686606513
   [87.2929]              mtime 1765464494.686606513
   [87.2929]              otime 1765464494.678070921
   [87.2929]      item 1 key (257 INODE_REF 256) itemoff 16109 itemsize 14
   [87.2929]              index 4 name_len 4
   [87.2929]      item 2 key (257 DIR_LOG_INDEX 2) itemoff 16101 itemsize 8
   [87.2929]              dir log end 2
   [87.2929]      item 3 key (257 DIR_LOG_INDEX 3) itemoff 16093 itemsize 8
   [87.2929]              dir log end 18446744073709551615
   [87.2930]      item 4 key (257 DIR_INDEX 3) itemoff 16060 itemsize 33
   [87.2930]              location key (258 1 0) type 1
   [87.2930]              transid 10 data_len 0 name_len 3
   [87.2930]      item 5 key (258 INODE_ITEM 0) itemoff 15900 itemsize 160
   [87.2930]              inode generation 9 transid 10 size 0 nbytes 0
   [87.2930]              block group 0 mode 100644 links 1 uid 0 gid 0
   [87.2930]              rdev 0 sequence 2 flags 0x0
   [87.2930]              atime 1765464494.678456467
   [87.2930]              ctime 1765464494.686606513
   [87.2930]              mtime 1765464494.678456467
   [87.2930]              otime 1765464494.678456467
   [87.2930]      item 6 key (258 INODE_REF 257) itemoff 15887 itemsize 13
   [87.2930]              index 3 name_len 3
   [87.2930] BTRFS critical (device dm-0 state EAO): log replay failed in unlink_inode_for_log_replay:1045 for root 5, stage 3, with error -2: failed to unlink inode 256 parent dir 259 name subvol root 5
   [87.2963] BTRFS: error (device dm-0 state EAO) in btrfs_recover_log_trees:7743: errno=-2 No such entry
   [87.2981] BTRFS: error (device dm-0 state EAO) in btrfs_replay_log:2083: errno=-2 No such entry (Failed to recover log tr

So fix this by changing copy_inode_items_to_log() to always detect if
there are conflicting inodes for the ref/extref of the inode being logged
even if the inode was created in a past transaction.

A test case for fstests will follow soon.

CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-12-16 22:53:15 +01:00
Qu Wenruo e9e3b22ddf btrfs: fix beyond-EOF write handling
[BUG]
For the following write sequence with 64K page size and 4K fs block size,
it will lead to file extent items to be inserted without any data
checksum:

  mkfs.btrfs -s 4k -f $dev > /dev/null
  mount $dev $mnt
  xfs_io -f -c "pwrite 0 16k" -c "pwrite 32k 4k" -c pwrite "60k 64K" \
            -c "truncate 16k" $mnt/foobar
  umount $mnt

This will result the following 2 file extent items to be inserted (extra
trace point added to insert_ordered_extent_file_extent()):

  btrfs_finish_one_ordered: root=5 ino=257 file_off=61440 num_bytes=4096 csum_bytes=0
  btrfs_finish_one_ordered: root=5 ino=257 file_off=0 num_bytes=16384 csum_bytes=16384

Note for file offset 60K, we're inserting a file extent without any
data checksum.

Also note that range [32K, 36K) didn't reach
insert_ordered_extent_file_extent(), which is the correct behavior as
that OE is fully truncated, should not result any file extent.

Although file extent at 60K will be later dropped by btrfs_truncate(),
if the transaction got committed after file extent inserted but before
the file extent dropping, we will have a small window where we have a
file extent beyond EOF and without any data checksum.

That will cause "btrfs check" to report error.

[CAUSE]
The sequence happens like this:

- Buffered write dirtied the page cache and updated isize

  Now the inode size is 64K, with the following page cache layout:

  0             16K             32K              48K           64K
  |/////////////|               |//|                        |//|

- Truncate the inode to 16K
  Which will trigger writeback through:

  btrfs_setsize()
  |- truncate_setsize()
  |  Now the inode size is set to 16K
  |
  |- btrfs_truncate()
     |- btrfs_wait_ordered_range() for [16K, u64(-1)]
        |- btrfs_fdatawrite_range() for [16K, u64(-1)}
	   |- extent_writepage() for folio 0
	      |- writepage_delalloc()
	      |  Generated OE for [0, 16K), [32K, 36K] and [60K, 64K)
	      |
	      |- extent_writepage_io()

  Then inside extent_writepage_io(), the dirty fs blocks are handled
  differently:

  - Submit write for range [0, 16K)
    As they are still inside the inode size (16K).

  - Mark OE [32K, 36K) as truncated
    Since we only call btrfs_lookup_first_ordered_range() once, which
    returned the first OE after file offset 16K.

  - Mark all OEs inside range [16K, 64K) as finished
    Which will mark OE ranges [32K, 36K) and [60K, 64K) as finished.

    For OE [32K, 36K) since it's already marked as truncated, and its
    truncated length is 0, no file extent will be inserted.

    For OE [60K, 64K) it has never been submitted thus has no data
    checksum, and we insert the file extent as usual.
    This is the root cause of file extent at 60K to be inserted without
    any data checksum.

  - Clear dirty flags for range [16K, 64K)
    It is the function btrfs_folio_clear_dirty() which searches and clears
    any dirty blocks inside that range.

[FIX]
The bug itself was introduced a long time ago, way before subpage and
large folio support.

At that time, fs block size must match page size, thus the range
[cur, end) is just one fs block.

But later with subpage and large folios, the same range [cur, end)
can have multiple blocks and ordered extents.

Later commit 18de34daa7 ("btrfs: truncate ordered extent when skipping
writeback past i_size") was fixing a bug related to subpage/large
folios, but it's still utilizing the old range [cur, end), meaning only
the first OE will be marked as truncated.

The proper fix here is to make EOF handling block-by-block, not trying
to handle the whole range to @end.

By this we always locate and truncate the OE for every dirty block.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-12-16 22:53:14 +01:00
Robbie Ko 5037b34282 btrfs: fix deadlock in wait_current_trans() due to ignored transaction type
When wait_current_trans() is called during start_transaction(), it
currently waits for a blocked transaction without considering whether
the given transaction type actually needs to wait for that particular
transaction state. The btrfs_blocked_trans_types[] array already defines
which transaction types should wait for which transaction states, but
this check was missing in wait_current_trans().

This can lead to a deadlock scenario involving two transactions and
pending ordered extents:

  1. Transaction A is in TRANS_STATE_COMMIT_DOING state

  2. A worker processing an ordered extent calls start_transaction()
     with TRANS_JOIN

  3. join_transaction() returns -EBUSY because Transaction A is in
     TRANS_STATE_COMMIT_DOING

  4. Transaction A moves to TRANS_STATE_UNBLOCKED and completes

  5. A new Transaction B is created (TRANS_STATE_RUNNING)

  6. The ordered extent from step 2 is added to Transaction B's
     pending ordered extents

  7. Transaction B immediately starts commit by another task and
     enters TRANS_STATE_COMMIT_START

  8. The worker finally reaches wait_current_trans(), sees Transaction B
     in TRANS_STATE_COMMIT_START (a blocked state), and waits
     unconditionally

  9. However, TRANS_JOIN should NOT wait for TRANS_STATE_COMMIT_START
     according to btrfs_blocked_trans_types[]

  10. Transaction B is waiting for pending ordered extents to complete

  11. Deadlock: Transaction B waits for ordered extent, ordered extent
      waits for Transaction B

This can be illustrated by the following call stacks:
  CPU0                              CPU1
                                    btrfs_finish_ordered_io()
                                      start_transaction(TRANS_JOIN)
                                        join_transaction()
                                          # -EBUSY (Transaction A is
                                          # TRANS_STATE_COMMIT_DOING)
  # Transaction A completes
  # Transaction B created
  # ordered extent added to
  # Transaction B's pending list
  btrfs_commit_transaction()
    # Transaction B enters
    # TRANS_STATE_COMMIT_START
    # waiting for pending ordered
    # extents
                                        wait_current_trans()
                                          # waits for Transaction B
                                          # (should not wait!)

Task bstore_kv_sync in btrfs_commit_transaction waiting for ordered
extents:

  __schedule+0x2e7/0x8a0
  schedule+0x64/0xe0
  btrfs_commit_transaction+0xbf7/0xda0 [btrfs]
  btrfs_sync_file+0x342/0x4d0 [btrfs]
  __x64_sys_fdatasync+0x4b/0x80
  do_syscall_64+0x33/0x40
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Task kworker in wait_current_trans waiting for transaction commit:

  Workqueue: btrfs-syno_nocow btrfs_work_helper [btrfs]
  __schedule+0x2e7/0x8a0
  schedule+0x64/0xe0
  wait_current_trans+0xb0/0x110 [btrfs]
  start_transaction+0x346/0x5b0 [btrfs]
  btrfs_finish_ordered_io.isra.0+0x49b/0x9c0 [btrfs]
  btrfs_work_helper+0xe8/0x350 [btrfs]
  process_one_work+0x1d3/0x3c0
  worker_thread+0x4d/0x3e0
  kthread+0x12d/0x150
  ret_from_fork+0x1f/0x30

Fix this by passing the transaction type to wait_current_trans() and
checking btrfs_blocked_trans_types[cur_trans->state] against the given
type before deciding to wait. This ensures that transaction types which
are allowed to join during certain blocked states will not unnecessarily
wait and cause deadlocks.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-12-16 22:53:14 +01:00
Miquel Sabaté Solà f157dd6613 btrfs: fix NULL dereference on root when tracing inode eviction
When evicting an inode the first thing we do is to setup tracing for it,
which implies fetching the root's id. But in btrfs_evict_inode() the
root might be NULL, as implied in the next check that we do in
btrfs_evict_inode().

Hence, we either should set the ->root_objectid to 0 in case the root is
NULL, or we move tracing setup after checking that the root is not
NULL. Setting the rootid to 0 at least gives us the possibility to trace
this call even in the case when the root is NULL, so that's the solution
taken here.

Fixes: 1abe9b8a13 ("Btrfs: add initial tracepoint support for btrfs")
Reported-by: syzbot+d991fea1b4b23b1f6bf8@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d991fea1b4b23b1f6bf8
Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-12-16 22:53:14 +01:00
Qu Wenruo 68d4b3fa18 btrfs: qgroup: update all parent qgroups when doing quick inherit
[BUG]
There is a bug that if a subvolume has multi-level parent qgroups, and
is able to do a quick inherit, only the direct parent qgroup got
updated:

  mkfs.btrfs  -f -O quota $dev
  mount $dev $mnt
  btrfs subv create $mnt/subv1
  btrfs qgroup create 1/100 $mnt
  btrfs qgroup create 2/100 $mnt
  btrfs qgroup assign 1/100 2/100 $mnt
  btrfs qgroup assign 0/256 1/100 $mnt
  btrfs qgroup show -p --sync $mnt

  Qgroupid    Referenced    Exclusive Parent     Path
  --------    ----------    --------- ------     ----
  0/5           16.00KiB     16.00KiB -          <toplevel>
  0/256         16.00KiB     16.00KiB 1/100      subv1
  1/100         16.00KiB     16.00KiB 2/100      2/100<1 member qgroup>
  2/100         16.00KiB     16.00KiB -          <0 member qgroups>

  btrfs subv snap -i 1/100 $mnt/subv1 $mnt/snap1
  btrfs qgroup show -p --sync $mnt

  Qgroupid    Referenced    Exclusive Parent     Path
  --------    ----------    --------- ------     ----
  0/5           16.00KiB     16.00KiB -          <toplevel>
  0/256         16.00KiB     16.00KiB 1/100      subv1
  0/257         16.00KiB     16.00KiB 1/100      snap1
  1/100         32.00KiB     32.00KiB 2/100      2/100<1 member qgroup>
  2/100         16.00KiB     16.00KiB -          <0 member qgroups>
  # Note that 2/100 is not updated, and qgroup numbers are inconsistent

  umount $mnt

[CAUSE]
If the snapshot source subvolume belongs to a parent qgroup, and the new
snapshot target is also added to the new same parent qgroup, we allow a
quick update without marking qgroup inconsistent.

But that quick update only update the parent qgroup, without checking if
there is any more parent qgroups.

[FIX]
Iterate through all parent qgroups during the quick inherit.

Reported-by: Boris Burkov <boris@bur.io>
Fixes: b20fe56cd2 ("btrfs: qgroup: allow quick inherit if snapshot is created and added to the same parent")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-12-16 22:53:14 +01:00
Boris Burkov 7ee19a59a7 btrfs: fix qgroup_snapshot_quick_inherit() squota bug
qgroup_snapshot_quick_inherit() detects conditions where the snapshot
destination would land in the same parent qgroup as the snapshot source
subvolume. In this case we can avoid costly qgroup calculations and just
add the nodesize of the new snapshot to the parent.

However, in the case of squotas this is actually a double count, and
also an undercount for deeper qgroup nestings.

The following annotated script shows the issue:

  btrfs quota enable --simple "$mnt"

  # Create 2-level qgroup hierarchy
  btrfs qgroup create 2/100 "$mnt"  # Q2 (level 2)
  btrfs qgroup create 1/100 "$mnt"  # Q1 (level 1)
  btrfs qgroup assign 1/100 2/100 "$mnt"

  # Create base subvolume
  btrfs subvolume create "$mnt/base" >/dev/null
  base_id=$(btrfs subvolume show "$mnt/base" | grep 'Subvolume ID:' | awk '{print $3}')

  # Create intermediate snapshot and add to Q1
  btrfs subvolume snapshot "$mnt/base" "$mnt/intermediate" >/dev/null
  inter_id=$(btrfs subvolume show "$mnt/intermediate" | grep 'Subvolume ID:' | awk '{print $3}')
  btrfs qgroup assign "0/$inter_id" 1/100 "$mnt"

  # Create working snapshot with --inherit (auto-adds to Q1)
  # src=intermediate (in only Q1)
  # dst=snap (inheriting only into Q1)
  # This double counts the 16k nodesize of the snapshot in Q1, and
  # undercounts it in Q2.
  btrfs subvolume snapshot -i 1/100 "$mnt/intermediate" "$mnt/snap" >/dev/null
  snap_id=$(btrfs subvolume show "$mnt/snap" | grep 'Subvolume ID:' | awk '{print $3}')

  # Fully complete snapshot creation
  sync

  # Delete working snapshot
  # Q1 and Q2 will lose the full snap usage
  btrfs subvolume delete "$mnt/snap" >/dev/null

  # Delete intermediate and remove from Q1
  # Q1 and Q2 will lose the full intermediate usage
  btrfs qgroup remove "0/$inter_id" 1/100 "$mnt"
  btrfs subvolume delete "$mnt/intermediate" >/dev/null

  # Q1 should be at 0, but still has 16k. Q2 is "correct" at 0 (for now...)

  # Trigger cleaner, wait for deletions
  mount -o remount,sync=1 "$mnt"
  btrfs subvolume sync "$mnt" "$snap_id"
  btrfs subvolume sync "$mnt" "$inter_id"

  # Remove Q1 from Q2
  # Frees 16k more from Q2, underflowing it to 16EiB
  btrfs qgroup remove 1/100 2/100 "$mnt"

  # And show the bad state:
  btrfs qgroup show -pc "$mnt"

        Qgroupid    Referenced    Exclusive Parent   Child   Path
        --------    ----------    --------- ------   -----   ----
        0/5           16.00KiB     16.00KiB -        -       <toplevel>
        0/256         16.00KiB     16.00KiB -        -       base
        1/100         16.00KiB     16.00KiB -        -       <0 member qgroups>
        2/100         16.00EiB     16.00EiB -        -       <0 member qgroups>

Fix this by simply not doing this quick inheritance with squotas.

I suspect that it is also wrong in normal qgroups to not recurse up the
qgroup tree in the quick inherit case, though other consistency checks
will likely fix it anyway.

Fixes: b20fe56cd2 ("btrfs: qgroup: allow quick inherit if snapshot is created and added to the same parent")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-12-16 22:09:52 +01:00
Filipe Manana 37343524f0 btrfs: fix changeset leak on mmap write after failure to reserve metadata
If the call to btrfs_delalloc_reserve_metadata() fails we jump to the
'out_noreserve' label and there we never free the extent_changeset
allocated by the previous call to btrfs_check_data_free_space() (if
qgroups are enabled). Fix this by calling extent_changeset_free() under
the 'out_noreserve' label.

Fixes: 6599716de2 ("btrfs: fix -ENOSPC mmap write failure on NOCOW files/extents")
Reported-by: syzbot+2f8aa76e6acc9fce6638@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/693a635a.a70a0220.33cd7b.0029.GAE@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-12-12 16:33:18 +01:00
Deepanshu Kartikey b57f2ddd28 btrfs: fix memory leak of fs_devices in degraded seed device path
In open_seed_devices(), when find_fsid() fails and we're in DEGRADED
mode, a new fs_devices is allocated via alloc_fs_devices() but is never
added to the seed_list before returning. This contrasts with the normal
path where fs_devices is properly added via list_add().

If any error occurs later in read_one_dev() or btrfs_read_chunk_tree(),
the cleanup code iterates seed_list to free seed devices, but this
orphaned fs_devices is never found and never freed, causing a memory
leak. Any devices allocated via add_missing_dev() and attached to this
fs_devices are also leaked.

Fix this by adding the newly allocated fs_devices to seed_list in the
degraded path, consistent with the normal path.

Fixes: 5f37583569 ("Btrfs: move the missing device to its own fs device list")
Reported-by: syzbot+eadd98df8bceb15d7fed@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=eadd98df8bceb15d7fed
Tested-by: syzbot+eadd98df8bceb15d7fed@syzkaller.appspotmail.com
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-12-12 16:33:12 +01:00
Qu Wenruo 313ef70a9f btrfs: fix a potential path leak in print_data_reloc_error()
Inside print_data_reloc_error(), if extent_from_logical() failed we
return immediately.

However there are the following cases where extent_from_logical() can
return error but still holds a path:

- btrfs_search_slot() returned 0

- No backref item found in extent tree

- No flags_ret provided
  This is not possible in this call site though.

So for the above two cases, we can return without releasing the path,
causing extent buffer leaks.

Fixes: b9a9a85059 ("btrfs: output affected files when relocation fails")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-12-09 04:35:28 +01:00
Qu Wenruo 428e1b114c Revert "btrfs: add ASSERTs on prealloc in qgroup functions"
This reverts commit 252877a870.

Commit 252877a870 ("btrfs: add ASSERTs on prealloc in qgroup
functions") tries to remove the kfree() on preallocated qgroup during
several call sites, but this cannot work as intended:

- btrfs_quota_enable()
- btrfs_create_qgroup()
  If add_qgroup_item() failed, we go out_free_path() and at that time
  prealloc is not yet utilized and will trigger the new ASSERT().

- btrfs_qgroup_inherit()
  If qgroup_auto_inherit() failed, prealloc is not yet utilized and
  will trigger the new ASSERT()

Reported-by: syzbot+b44d4a4885bc82af2a06@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/69369331.a70a0220.38f243.009e.GAE@google.com/
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-12-09 04:32:46 +01:00
Filipe Manana 5630f7557d btrfs: do not skip logging new dentries when logging a new name
When we are logging a directory and the log context indicates that we
are logging a new name for some other file (that is or was inside that
directory), we skip logging the inodes for new dentries in the directory.

This is ok most of the time, but if after the rename or link operation
that triggered the logging of that directory, we have an explicit fsync
of that directory without the directory inode being evicted and reloaded,
we end up never logging the inodes for the new dentries that we found
during the new name logging, as the next directory fsync will only process
dentries that were added after the last time we logged the directory (we
are doing an incremental directory logging).

So make sure we always log new dentries for a directory even if we are
in a context of logging a new name.

We started skipping logging inodes for new dentries as of commit
c48792c6ee ("btrfs: do not log new dentries when logging that a new name
exists") and it was fine back then, because when logging a directory we
always iterated over all the directory entries (for leaves changed in the
current transaction) so a subsequent fsync would always log anything that
was previously skipped while logging a directory when logging a new name
(with btrfs_log_new_name()). But later support for incrementally logging
a directory was added in commit dc2872247e ("btrfs: keep track of the
last logged keys when logging a directory"), to avoid checking all dir
items every time we log a directory, so the check to skip dentry logging
added in the first commit should have been removed when the incremental
support for logging a directory was added.

A test case for fstests will follow soon.

Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/84c4e713-85d6-42b9-8dcf-0722ed26cb05@gmail.com/
Fixes: dc2872247e ("btrfs: keep track of the last logged keys when logging a directory")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-12-08 21:46:56 +01:00
Filipe Manana 266273eaf4 btrfs: don't log conflicting inode if it's a dir moved in the current transaction
We can't log a conflicting inode if it's a directory and it was moved
from one parent directory to another parent directory in the current
transaction, as this can result an attempt to have a directory with
two hard links during log replay, one for the old parent directory and
another for the new parent directory.

The following scenario triggers that issue:

1) We have directories "dir1" and "dir2" created in a past transaction.
   Directory "dir1" has inode A as its parent directory;

2) We move "dir1" to some other directory;

3) We create a file with the name "dir1" in directory inode A;

4) We fsync the new file. This results in logging the inode of the new file
   and the inode for the directory "dir1" that was previously moved in the
   current transaction. So the log tree has the INODE_REF item for the
   new location of "dir1";

5) We move the new file to some other directory. This results in updating
   the log tree to included the new INODE_REF for the new location of the
   file and removes the INODE_REF for the old location. This happens
   during the rename when we call btrfs_log_new_name();

6) We fsync the file, and that persists the log tree changes done in the
   previous step (btrfs_log_new_name() only updates the log tree in
   memory);

7) We have a power failure;

8) Next time the fs is mounted, log replay happens and when processing
   the inode for directory "dir1" we find a new INODE_REF and add that
   link, but we don't remove the old link of the inode since we have
   not logged the old parent directory of the directory inode "dir1".

As a result after log replay finishes when we trigger writeback of the
subvolume tree's extent buffers, the tree check will detect that we have
a directory a hard link count of 2 and we get a mount failure.
The errors and stack traces reported in dmesg/syslog are like this:

   [ 3845.729764] BTRFS info (device dm-0): start tree-log replay
   [ 3845.730304] page: refcount:3 mapcount:0 mapping:000000005c8a3027 index:0x1d00 pfn:0x11510c
   [ 3845.731236] memcg:ffff9264c02f4e00
   [ 3845.731751] aops:btree_aops [btrfs] ino:1
   [ 3845.732300] flags: 0x17fffc00000400a(uptodate|private|writeback|node=0|zone=2|lastcpupid=0x1ffff)
   [ 3845.733346] raw: 017fffc00000400a 0000000000000000 dead000000000122 ffff9264d978aea8
   [ 3845.734265] raw: 0000000000001d00 ffff92650e6d4738 00000003ffffffff ffff9264c02f4e00
   [ 3845.735305] page dumped because: eb page dump
   [ 3845.735981] BTRFS critical (device dm-0): corrupt leaf: root=5 block=30408704 slot=6 ino=257, invalid nlink: has 2 expect no more than 1 for dir
   [ 3845.737786] BTRFS info (device dm-0): leaf 30408704 gen 10 total ptrs 17 free space 14881 owner 5
   [ 3845.737789] BTRFS info (device dm-0): refs 4 lock_owner 0 current 30701
   [ 3845.737792] 	item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
   [ 3845.737794] 		inode generation 3 transid 9 size 16 nbytes 16384
   [ 3845.737795] 		block group 0 mode 40755 links 1 uid 0 gid 0
   [ 3845.737797] 		rdev 0 sequence 2 flags 0x0
   [ 3845.737798] 		atime 1764259517.0
   [ 3845.737800] 		ctime 1764259517.572889464
   [ 3845.737801] 		mtime 1764259517.572889464
   [ 3845.737802] 		otime 1764259517.0
   [ 3845.737803] 	item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
   [ 3845.737805] 		index 0 name_len 2
   [ 3845.737807] 	item 2 key (256 DIR_ITEM 2363071922) itemoff 16077 itemsize 34
   [ 3845.737808] 		location key (257 1 0) type 2
   [ 3845.737810] 		transid 9 data_len 0 name_len 4
   [ 3845.737811] 	item 3 key (256 DIR_ITEM 2676584006) itemoff 16043 itemsize 34
   [ 3845.737813] 		location key (258 1 0) type 2
   [ 3845.737814] 		transid 9 data_len 0 name_len 4
   [ 3845.737815] 	item 4 key (256 DIR_INDEX 2) itemoff 16009 itemsize 34
   [ 3845.737816] 		location key (257 1 0) type 2
   [ 3845.737818] 		transid 9 data_len 0 name_len 4
   [ 3845.737819] 	item 5 key (256 DIR_INDEX 3) itemoff 15975 itemsize 34
   [ 3845.737820] 		location key (258 1 0) type 2
   [ 3845.737821] 		transid 9 data_len 0 name_len 4
   [ 3845.737822] 	item 6 key (257 INODE_ITEM 0) itemoff 15815 itemsize 160
   [ 3845.737824] 		inode generation 9 transid 10 size 6 nbytes 0
   [ 3845.737825] 		block group 0 mode 40755 links 2 uid 0 gid 0
   [ 3845.737826] 		rdev 0 sequence 1 flags 0x0
   [ 3845.737827] 		atime 1764259517.572889464
   [ 3845.737828] 		ctime 1764259517.572889464
   [ 3845.737830] 		mtime 1764259517.572889464
   [ 3845.737831] 		otime 1764259517.572889464
   [ 3845.737832] 	item 7 key (257 INODE_REF 256) itemoff 15801 itemsize 14
   [ 3845.737833] 		index 2 name_len 4
   [ 3845.737834] 	item 8 key (257 INODE_REF 258) itemoff 15787 itemsize 14
   [ 3845.737836] 		index 2 name_len 4
   [ 3845.737837] 	item 9 key (257 DIR_ITEM 2507850652) itemoff 15754 itemsize 33
   [ 3845.737838] 		location key (259 1 0) type 1
   [ 3845.737839] 		transid 10 data_len 0 name_len 3
   [ 3845.737840] 	item 10 key (257 DIR_INDEX 2) itemoff 15721 itemsize 33
   [ 3845.737842] 		location key (259 1 0) type 1
   [ 3845.737843] 		transid 10 data_len 0 name_len 3
   [ 3845.737844] 	item 11 key (258 INODE_ITEM 0) itemoff 15561 itemsize 160
   [ 3845.737846] 		inode generation 9 transid 10 size 8 nbytes 0
   [ 3845.737847] 		block group 0 mode 40755 links 1 uid 0 gid 0
   [ 3845.737848] 		rdev 0 sequence 1 flags 0x0
   [ 3845.737849] 		atime 1764259517.572889464
   [ 3845.737850] 		ctime 1764259517.572889464
   [ 3845.737851] 		mtime 1764259517.572889464
   [ 3845.737852] 		otime 1764259517.572889464
   [ 3845.737853] 	item 12 key (258 INODE_REF 256) itemoff 15547 itemsize 14
   [ 3845.737855] 		index 3 name_len 4
   [ 3845.737856] 	item 13 key (258 DIR_ITEM 1843588421) itemoff 15513 itemsize 34
   [ 3845.737857] 		location key (257 1 0) type 2
   [ 3845.737858] 		transid 10 data_len 0 name_len 4
   [ 3845.737860] 	item 14 key (258 DIR_INDEX 2) itemoff 15479 itemsize 34
   [ 3845.737861] 		location key (257 1 0) type 2
   [ 3845.737862] 		transid 10 data_len 0 name_len 4
   [ 3845.737863] 	item 15 key (259 INODE_ITEM 0) itemoff 15319 itemsize 160
   [ 3845.737865] 		inode generation 10 transid 10 size 0 nbytes 0
   [ 3845.737866] 		block group 0 mode 100600 links 1 uid 0 gid 0
   [ 3845.737867] 		rdev 0 sequence 2 flags 0x0
   [ 3845.737868] 		atime 1764259517.580874966
   [ 3845.737869] 		ctime 1764259517.586121869
   [ 3845.737870] 		mtime 1764259517.580874966
   [ 3845.737872] 		otime 1764259517.580874966
   [ 3845.737873] 	item 16 key (259 INODE_REF 257) itemoff 15306 itemsize 13
   [ 3845.737874] 		index 2 name_len 3
   [ 3845.737875] BTRFS error (device dm-0): block=30408704 write time tree block corruption detected
   [ 3845.739448] ------------[ cut here ]------------
   [ 3845.740092] WARNING: CPU: 5 PID: 30701 at fs/btrfs/disk-io.c:335 btree_csum_one_bio+0x25a/0x270 [btrfs]
   [ 3845.741439] Modules linked in: btrfs dm_flakey crc32c_cryptoapi (...)
   [ 3845.750626] CPU: 5 UID: 0 PID: 30701 Comm: mount Tainted: G        W           6.18.0-rc6-btrfs-next-218+ #1 PREEMPT(full)
   [ 3845.752414] Tainted: [W]=WARN
   [ 3845.752828] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
   [ 3845.754499] RIP: 0010:btree_csum_one_bio+0x25a/0x270 [btrfs]
   [ 3845.755460] Code: 31 f6 48 89 (...)
   [ 3845.758685] RSP: 0018:ffffa8d9c5677678 EFLAGS: 00010246
   [ 3845.759450] RAX: 0000000000000000 RBX: ffff92650e6d4738 RCX: 0000000000000000
   [ 3845.760309] RDX: 0000000000000000 RSI: ffffffff9aab45b9 RDI: ffff9264c4748000
   [ 3845.761239] RBP: ffff9264d4324000 R08: 0000000000000000 R09: ffffa8d9c5677468
   [ 3845.762607] R10: ffff926bdc1fffa8 R11: 0000000000000003 R12: ffffa8d9c5677680
   [ 3845.764099] R13: 0000000000004000 R14: ffff9264dd624000 R15: ffff9264d978aba8
   [ 3845.765094] FS:  00007f751fa5a840(0000) GS:ffff926c42a82000(0000) knlGS:0000000000000000
   [ 3845.766226] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [ 3845.766970] CR2: 0000558df1815380 CR3: 000000010ed88003 CR4: 0000000000370ef0
   [ 3845.768009] Call Trace:
   [ 3845.768392]  <TASK>
   [ 3845.768714]  btrfs_submit_bbio+0x6ee/0x7f0 [btrfs]
   [ 3845.769640]  ? write_one_eb+0x28e/0x340 [btrfs]
   [ 3845.770588]  btree_write_cache_pages+0x2f0/0x550 [btrfs]
   [ 3845.771286]  ? alloc_extent_state+0x19/0x100 [btrfs]
   [ 3845.771967]  ? merge_next_state+0x1a/0x90 [btrfs]
   [ 3845.772586]  ? set_extent_bit+0x233/0x8b0 [btrfs]
   [ 3845.773198]  ? xas_load+0x9/0xc0
   [ 3845.773589]  ? xas_find+0x14d/0x1a0
   [ 3845.773969]  do_writepages+0xc6/0x160
   [ 3845.774367]  filemap_fdatawrite_wbc+0x48/0x60
   [ 3845.775003]  __filemap_fdatawrite_range+0x5b/0x80
   [ 3845.775902]  btrfs_write_marked_extents+0x61/0x170 [btrfs]
   [ 3845.776707]  btrfs_write_and_wait_transaction+0x4e/0xc0 [btrfs]
   [ 3845.777379]  ? _raw_spin_unlock_irqrestore+0x23/0x40
   [ 3845.777923]  btrfs_commit_transaction+0x5ea/0xd20 [btrfs]
   [ 3845.778551]  ? _raw_spin_unlock+0x15/0x30
   [ 3845.778986]  ? release_extent_buffer+0x34/0x160 [btrfs]
   [ 3845.779659]  btrfs_recover_log_trees+0x7a3/0x7c0 [btrfs]
   [ 3845.780416]  ? __pfx_replay_one_buffer+0x10/0x10 [btrfs]
   [ 3845.781499]  open_ctree+0x10bb/0x15f0 [btrfs]
   [ 3845.782194]  btrfs_get_tree.cold+0xb/0x16c [btrfs]
   [ 3845.782764]  ? fscontext_read+0x15c/0x180
   [ 3845.783202]  ? rw_verify_area+0x50/0x180
   [ 3845.783667]  vfs_get_tree+0x25/0xd0
   [ 3845.784047]  vfs_cmd_create+0x59/0xe0
   [ 3845.784458]  __do_sys_fsconfig+0x4f6/0x6b0
   [ 3845.784914]  do_syscall_64+0x50/0x1220
   [ 3845.785340]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
   [ 3845.785980] RIP: 0033:0x7f751fc7f4aa
   [ 3845.786759] Code: 73 01 c3 48 (...)
   [ 3845.789951] RSP: 002b:00007ffcdba45dc8 EFLAGS: 00000246 ORIG_RAX: 00000000000001af
   [ 3845.791402] RAX: ffffffffffffffda RBX: 000055ccc8291c20 RCX: 00007f751fc7f4aa
   [ 3845.792688] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
   [ 3845.794308] RBP: 000055ccc8292120 R08: 0000000000000000 R09: 0000000000000000
   [ 3845.795829] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   [ 3845.797183] R13: 00007f751fe11580 R14: 00007f751fe1326c R15: 00007f751fdf8a23
   [ 3845.798633]  </TASK>
   [ 3845.799067] ---[ end trace 0000000000000000 ]---
   [ 3845.800215] BTRFS: error (device dm-0) in btrfs_commit_transaction:2553: errno=-5 IO failure (Error while writing out transaction)
   [ 3845.801860] BTRFS warning (device dm-0 state E): Skipping commit of aborted transaction.
   [ 3845.802815] BTRFS error (device dm-0 state EA): Transaction aborted (error -5)
   [ 3845.803728] BTRFS: error (device dm-0 state EA) in cleanup_transaction:2036: errno=-5 IO failure
   [ 3845.805374] BTRFS: error (device dm-0 state EA) in btrfs_replay_log:2083: errno=-5 IO failure (Failed to recover log tree)
   [ 3845.807919] BTRFS error (device dm-0 state EA): open_ctree failed: -5

Fix this by never logging a conflicting inode that is a directory and was
moved in the current transaction (its last_unlink_trans equals the current
transaction) and instead fallback to a transaction commit.

A test case for fstests will follow soon.

Reported-by: Vyacheslav Kovalevsky <slva.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/7bbc9419-5c56-450a-b5a0-efeae7457113@gmail.com/
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-12-08 21:46:47 +01:00
Dan Carpenter 564d59410c btrfs: tests: fix double btrfs_path free in remove_extent_ref()
We converted this code to use auto free cleanup.h magic but one
remaining free was accidentally left behind which leads to a double free
bug.

Fixes: a320476ca8 ("btrfs: tests: do trivial BTRFS_PATH_AUTO_FREE conversions")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-12-08 21:46:43 +01:00
Filipe Manana 9e0e6577b3 btrfs: remove unnecessary inode key in btrfs_log_all_parents()
We are setting up an inode key to lookup parent directory inode but all we
need is the inode's objectid. The use of the key was necessary in the past
but since commit 0202e83fda ("btrfs: simplify iget helpers") we only
need the objectid.

So remove the key variable in the stack and use instead a simple u64 for
the inode's objectid.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:53:33 +01:00
Filipe Manana 1c3e03b340 btrfs: remove redundant zero/NULL initializations in btrfs_alloc_root()
We have allocated the root with kzalloc() so all the memory is already
zero initialized, therefore it's redundant to assign 0 and NULL to several
of the root members. Remove all of them except the atomic initializations
since atomic_t is an opaque type and it's not a good practice to assume
its internals.

This slightly reduces the binary size.
With gcc 14.2.0-19 from Debian on x86_64, before this change:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1939404	 162963	  15592	2117959	 205147	fs/btrfs/btrfs.ko

After this change:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1939212	 162963	  15592	2117767	 205087	fs/btrfs/btrfs.ko

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:53:33 +01:00
David Sterba 10934c131f btrfs: remaining BTRFS_PATH_AUTO_FREE conversions
Do the remaining btrfs_path conversion to the auto cleaning, this seems
to be the last one. Most of the conversions are trivial, only adding the
declaration and removing the freeing, or changing the goto patterns to
return.

There are some functions with many changes, like __btrfs_free_extent(),
btrfs_remove_from_free_space_tree() or btrfs_add_to_free_space_tree()
but it still follows the same pattern.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:53:33 +01:00
Filipe Manana 5c9cac55b7 btrfs: send: do not allocate memory for xattr data when checking it exists
When checking if xattrs were deleted we don't care about their data, but
we are allocating memory for the data and copying it, which only wastes
time and can result in an unnecessary error in case the allocation fails.
So stop allocating memory and copying data by making find_xattr() and
__find_xattr() skip those steps if the given data buffer is NULL.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:53:33 +01:00
Filipe Manana 7c3acdb998 btrfs: send: add unlikely to all unexpected overflow checks
There are several checks for unexpected overflows of buffers and path
lengths that makes us fail the send operation with an error if for some
highly unexpected reason they happen. So add the unlikely tag to those
checks to hint the compiler to generate better code, while also making
it more explicit in the source that it's highly unexpected.

With gcc 14.2.0-19 from Debian on x86_64, I also got a small reduction
the text size of the btrfs module.

Before:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1936917	 162723	  15592	2115232	 2046a0	fs/btrfs/btrfs.ko

After:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1936789	 162723	  15592	2115104	 204620	fs/btrfs/btrfs.ko

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:53:33 +01:00
Filipe Manana 139e3167d8 btrfs: reduce arguments to btrfs_del_inode_ref_in_log()
Instead of passing a root and the objectid of the parent directory, just
pass the directory inode, as like that we can extract both the root and
the objectid, reducing the number of arguments by one. It also makes the
function more consistent with other log tree functions in the sense that
we pass the inode and not only its objectid.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:53:33 +01:00
Filipe Manana 1361f7d8da btrfs: remove root argument from btrfs_del_dir_entries_in_log()
There's no need to pass the root as we can extract it from the directory
inode, so remove it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:53:32 +01:00
Filipe Manana 9c78fe4a85 btrfs: use test_and_set_bit() in btrfs_delayed_delete_inode_ref()
Instead of testing and setting the BTRFS_DELAYED_NODE_DEL_IREF bit in the
delayed node's flags, use test_and_set_bit() which makes the code shorter
without compromising readability and getting rid of the label and goto.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:53:32 +01:00
Josef Bacik 70085399b1 btrfs: don't search back for dir inode item in INO_LOOKUP_USER
We don't need to search back to the inode item, the directory inode
number is in key.offset, so simply use that.  If we can't find the
directory we'll get an ENOENT at the iget().

Note: The patch was taken from v5 of fscrypt patchset
(https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/)
which was handled over time by various people: Omar Sandoval, Sweet Tea
Dorminy, Josef Bacik.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:53:27 +01:00
Josef Bacik 0185c2292c btrfs: don't rewrite ret from inode_permission
In our user safe ino resolve ioctl we'll just turn any ret into -EACCES
from inode_permission().  This is redundant, and could potentially be
wrong if we had an ENOMEM in the security layer or some such other
error, so simply return the actual return value.

Note: The patch was taken from v5 of fscrypt patchset
(https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/)
which was handled over time by various people: Omar Sandoval, Sweet Tea
Dorminy, Josef Bacik.

Fixes: 23d0b79dfa ("btrfs: Add unprivileged version of ino_lookup ioctl")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:52:24 +01:00
Josef Bacik bd45e9e3f6 btrfs: add orig_logical to btrfs_bio for encryption
When checksumming the encrypted bio on writes we need to know which
logical address this checksum is for.  At the point where we get the
encrypted bio the bi_sector is the physical location on the target disk,
so we need to save the original logical offset in the btrfs_bio.  Then
we can use this when checksumming the bio instead of the
bio->iter.bi_sector.

Note: The patch was taken from v5 of fscrypt patchset
(https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/)
which was handled over time by various people: Omar Sandoval, Sweet Tea
Dorminy, Josef Bacik.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:52:23 +01:00
Sweet Tea Dorminy 45d99129b6 btrfs: disable verity on encrypted inodes
Right now there isn't a way to encrypt things that aren't either
filenames in directories or data on blocks on disk with extent
encryption, so for now, disable verity usage with encryption on btrfs.

fscrypt with fsverity should be possible and it can be implemented
in the future.

Note: The patch was taken from v5 of fscrypt patchset
(https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/)
which was handled over time by various people: Omar Sandoval, Sweet Tea
Dorminy, Josef Bacik.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:50:56 +01:00
Omar Sandoval f968340053 btrfs: disable various operations on encrypted inodes
Initially, only normal data extents will be encrypted. This change
forbids various other bits:

- allows reflinking only if both inodes have the same encryption status
- disable inline data on encrypted inodes

Note: The patch was taken from v5 of fscrypt patchset
(https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/)
which was handled over time by various people: Omar Sandoval, Sweet Tea
Dorminy, Josef Bacik.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:50:56 +01:00
Sun YangKai 4357dd76f5 btrfs: remove redundant level reset in btrfs_del_items()
When btrfs_del_items() empties a leaf, it deletes the leaf unless it's
the root node. For the root leaf case, the code used to reset its level
to 0 via btrfs_set_header_level(). This is redundant as leaf nodes
always have level == 0.

Remove the unnecessary level assignment and invert the conditional to
handle only the non-root leaf deletion. The root leaf is correctly left
as-is.

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:50:56 +01:00
Sun YangKai 139f75a3b1 btrfs: simplify leaf traversal after path release in btrfs_next_old_leaf()
After releasing the path in btrfs_next_old_leaf(), we need to re-check
the leaf because a balance operation may have added items or removed the
last item. The original code handled this with two separate conditional
blocks, the second marked with a lengthy comment explaining a "missed
case".

Merge these two blocks into a single logical structure that handles both
scenarios more clearly.

Also update the comment to be more concise and accurate, incorporating the
explanation directly into the main block rather than a separate annotation.

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:50:56 +01:00
Sun YangKai 3afa17bf24 btrfs: optimize balance_level() path reference handling
Instead of incrementing refcount on 'left' node when it's referenced by
path, simply transfer ownership to path and set left to NULL. This
eliminates:

- Unnecessary refcount increment/decrement operations
- Redundant conditional checks for left node cleanup

The path now consistently owns the left node reference when used.

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:50:56 +01:00
Sun YangKai 31b37b7667 btrfs: factor out root promotion logic into promote_child_to_root()
The balance_level() function is overly long and contains a cold code path
that handles promoting a child node to root when the root has only one item.
This code has distinct logic that is clearer and more maintainable when
isolated in its own function.

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:50:56 +01:00
Qu Wenruo 1a332a6d70 btrfs: raid56: remove the "_step" infix
The following functions are introduced as a middle step for bs > ps
support:

- rbio_streip_step_paddr()
- rbio_pstripe_step_paddr()
- rbio_qstripe_step_paddr()
- sector_step_paddr_in_rbio()

As there is already an existing function without the infix, and has a
different parameter list.

But the existing functions have been cleaned up, there is no need to
keep the "_step" infix, just remove it completely.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:50:19 +01:00
Qu Wenruo 8870dbeedc btrfs: raid56: enable bs > ps support
The support code for bs > ps is complete, enable it and update
assertions.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:48:52 +01:00
Qu Wenruo 89ca1a403e btrfs: raid56: prepare finish_parity_scrub() to support bs > ps cases
The function finish_parity_scrub() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Introduce a helper, verify_one_parity_step()
  Since the P/Q generation is always done in a vertical stripe, we have
  to handle the range step by step.

- Only clear the rbio->dbitmap if all steps of an fs block match

- Remove rbio_stripe_paddr() and sector_paddr_in_rbio() helpers
  Now we either use the paddrs version for checksum, or the step version
  for P/Q generation/recovery.

- Make alloc_rbio_essential_pages() to handle bs > ps cases
  Since for bs > ps cases, one fs block needs multiple pages, the
  existing simple check against rbio->stripe_pages[] is not enough.

  Extract a dedicated helper, alloc_rbio_sector_pages(), for the
  existing alloc_rbio_essential_pages(), which is still based on sector
  number.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:48:02 +01:00
Qu Wenruo ba88278c69 btrfs: raid56: prepare rbio_bio_add_io_paddr() to support bs > ps cases
The function rbio_bio_add_io_paddr() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Introduce a helper bio_add_paddrs()
  Previously we only need to add a single page to a bio for a fs block,
  but now we need to add multiple pages, this means we can fail halfway.

  In that case we need to properly revert the bio (only for its size
  though) for halfway failed cases.

- Rename rbio_add_io_paddr() to rbio_add_io_paddrs()
  And change the @paddr parameter to @paddrs[].

- Change all callers to use the updated rbio_add_io_paddrs()
  For the @paddrs pointer used for the new function, it can be grabbed
  using sector_paddrs_in_rbio() and rbio_stripe_paddrs() helpers.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:47:54 +01:00
Qu Wenruo 53474a2ae1 btrfs: raid56: prepare steal_rbio() to support bs > ps cases
The function steal_rbio() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Introduce two helpers to calculate the sector number
  Previously we assume one page will contain at least one fs block, thus
  can use something like "sectors_per_page = PAGE_SIZE / sectorsize;",
  but with bs > ps support that above number will be 0.

  Instead introduce two helpers:

  * page_nr_to_sector_nr()
    Returns the sector number of the first sector covered by the page.

  * page_nr_to_num_sectors()
    Return how many sectors are covered by the page.

  And use the returned values for bitmap operations other than
  open-coded "PAGE_SIZE / sectorsize".
  Those helpers also have extra ASSERT()s to catch weird numbers.

- Use above helpers
  The involved functions are:
  * steal_rbio_page()
  * is_data_stripe_page()
  * full_page_sectors_uptodate()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:47:41 +01:00
Qu Wenruo 05ddf35a5d btrfs: raid56: prepare set_bio_pages_uptodate() to support bs > ps cases
The function set_bio_pages_uptodate() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Update find_stripe_sector_nr() to check only the first step paddr
  We don't need to check each paddr, as the bios are still aligned to fs
  block size, thus checking the first step is enough.

- Use step size to iterate the bio
  This means we only need to find the sector number for the first step
  of each fs block, and skip the remaining part.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:47:31 +01:00
Qu Wenruo 64e7b8c7c5 btrfs: raid56: prepare verify_bio_data_sectors() to support bs > ps cases
The function verify_bio_data_sectors() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Make get_bio_sector_nr() to consider bs > ps cases
  The function is utilized to calculate the sector number of a device
  bio submitted by btrfs raid56 layer.

- Assemble a local paddrs[] for checksum calculation

- Open code btrfs_check_block_csum()
  btrfs_check_block_csum() only supports fs blocks backed by large
  folios.

  But for raid56 we can have fs blocks backed by multiple non-contiguous
  pages, e.g. direct IO, encoded read/write/send.

  So instead of using btrfs_check_block_csum(), open code it to use
  btrfs_calculate_block_csum_pages().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:47:20 +01:00
Qu Wenruo e0eadfcc95 btrfs: raid56: prepare verify_one_sector() to support bs > ps cases
The function verify_one_sector() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Introduce helpers to get a paddrs pointer
  Thankfully all the higher layer bio should still be aligned to fs
  block size, thus a fs block should still be fully covered by the bio.

  Introduce sector_paddrs_in_rbio() and rbio_stripe_paddrs(), which will
  return a paddrs pointer inside btrfs_raid_bio::bio_paddrs[] or
  stripe_paddrs[].

  The pointer can be directly passed to
  btrfs_calculate_block_csum_pages() to verify the checksum.

- Open code btrfs_check_block_csum()
  btrfs_check_block_csum() only supports fs blocks backed by large
  folios.

  But for raid56 we can have fs blocks backed by multiple non-contiguous
  pages, e.g. direct IO, encoded read/write/send.

  So instead of using btrfs_check_block_csum(), open code it to use
  btrfs_calculate_block_csum_pages().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:47:16 +01:00
Qu Wenruo 9ba67fd616 btrfs: raid56: prepare recover_vertical() to support bs > ps cases
Currently recover_vertical() assumes that every fs block can be mapped
by one page, this is blocking bs > ps support for raid56.

Prepare recover_vertical() to support bs > ps cases by:

- Introduce recover_vertical_step() helper
  Which will recover a full step (min(PAGE_SIZE, sectorsize)).

  Now recover_vertical() will do the error check for the specified
  sector, do the recover step by step, then do the sector verification.

- Fix a spelling error of get_rbio_vertical_errors()
  The old name has a typo: "veritical".

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:47:11 +01:00
Qu Wenruo 826325b6d0 btrfs: raid56: prepare generate_pq_vertical() for bs > ps cases
Unlike btrfs_calculate_block_csum_pages(), we cannot handle multiple
pages at the same time for P/Q generation.

So here we introduce a new @step_nr, and various helpers to grab the
sub-block page from the rbio, and generate the P/Q stripe page by page.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:47:05 +01:00
Qu Wenruo 91cd1b5865 btrfs: raid56: introduce a new parameter to locate a sector
Since we cannot ensure that all bios from the higher layer are backed by
large folios (e.g. direct IO, encoded read/write/send), we need the
ability to locate sub-block (aka, a page) inside a full stripe.

So the existing @stripe_nr + @sector_nr combination is not enough to
locate such page for bs > ps cases.

Introduce a new parameter, @step_nr, to locate the page of a larger fs
block.  The naming is following the conventions used inside btrfs
elsewhere, where one step is min(sectorsize, PAGE_SIZE).

It's still a preparation, only touching the following aspects:

- btrfs_dump_rbio()
  To show the new @sector_nsteps member.

- btrfs_raid_bio::sector_nsteps
  Recording how many steps there are inside a fs block.

- Enlarge btrfs_raid_bio::*_paddrs[] size
  To take @sector_nsteps into consideration.

- index_one_bio()
- index_stripe_sectors()
- memcpy_from_bio_to_stripe()
- cache_rbio_pages()
- need_read_stripe_sectors()
  Those functions are iterating *_paddrs[], which needs to take
  sector_nsteps into consideration.

- Rename rbio_stripe_sector_index() to rbio_sector_index()
  The "stripe" part is not that helpful.

  And an extra ASSERT() before returning the result.

- Add a new rbio_paddr_index() helper
  This will take the extra @step_nr into consideration.

- The comments of btrfs_raid_bio

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:46:58 +01:00
Qu Wenruo 9042dc0002 btrfs: raid56: add an overview for the btrfs_raid_bio structure
The structure needs to track both the pages from higher layer bio and
internal pages, thus it can be a little complex to grasp.

Add an overview of the structure, especially how we track different
pages from higher layer bios and internal ones, to save some time for
future developers.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:46:29 +01:00