mirror-linux

Commit Graph

Author	SHA1	Message	Date
Filipe Manana	601ea9c42a	btrfs: use btrfs_inode local variable at btrfs_page_mkwrite() Most of the time we want to use the btrfs_inode, so change the local inode variable to be a btrfs_inode instead of a VFS inode, reducing verbosity by eliminating a lot of BTRFS_I() calls. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:22 +02:00
Filipe Manana	11ad7983c2	btrfs: use variable for io_tree when clearing range in btrfs_page_mkwrite() We have the inode's io_tree already stored in a local variable, so use it instead of grabbing it again in the call to btrfs_clear_extent_bit(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:22 +02:00
Filipe Manana	6599716de2	btrfs: fix -ENOSPC mmap write failure on NOCOW files/extents If we attempt a mmap write into a NOCOW file or a prealloc extent when there is no more available data space (or unallocated space to allocate a new data block group) and we can do a NOCOW write (there are no reflinks for the target extent or snapshots), we always fail due to -ENOSPC, unlike for the regular buffered write and direct IO paths where we check that we can do a NOCOW write in case we can't reserve data space. Simple reproducer: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi umount $DEV &> /dev/null mkfs.btrfs -f -b $((512 * 1024 * 1024)) $DEV mount $DEV $MNT touch $MNT/foobar # Make it a NOCOW file. chattr +C $MNT/foobar # Add initial data to file. xfs_io -c "pwrite -S 0xab 0 1M" $MNT/foobar # Fill all the remaining data space and unallocated space with data. dd if=/dev/zero of=$MNT/filler bs=4K &> /dev/null # Overwrite the file with a mmap write. Should succeed. xfs_io -c "mmap -w 0 1M" \ -c "mwrite -S 0xcd 0 1M" \ -c "munmap" \ $MNT/foobar # Unmount, mount again and verify the new data was persisted. umount $MNT mount $DEV $MNT od -A d -t x1 $MNT/foobar umount $MNT Running this: $ ./test.sh (...) wrote 1048576/1048576 bytes at offset 0 1 MiB, 256 ops; 0.0008 sec (1.188 GiB/sec and 311435.5231 ops/sec) ./test.sh: line 24: 234865 Bus error xfs_io -c "mmap -w 0 1M" -c "mwrite -S 0xcd 0 1M" -c "munmap" $MNT/foobar 0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab * 1048576 Fix this by not failing in case we can't allocate data space and we can NOCOW into the target extent - reserving only metadata space in this case. After this change the test passes: $ ./test.sh (...) wrote 1048576/1048576 bytes at offset 0 1 MiB, 256 ops; 0.0007 sec (1.262 GiB/sec and 330749.3540 ops/sec) 0000000 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd * 1048576 A test case for fstests will be added soon. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:22 +02:00
David Sterba	e8d2e254dc	btrfs: use clear_and_wake_up_bit() where open coded There are two cases open coding the clear and wake up pattern, we can use the helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:21 +02:00
David Sterba	72b2b199d5	btrfs: accessors: rename variable for folio offset There used to be 'oip' short for offset in page, which got changed during conversion to folios, the name is a bit confusing so rename it. Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:21 +02:00
David Sterba	ae80748225	btrfs: accessors: factor out split memcpy with two sources The case of a reading the bytes from 2 folios needs two memcpy()s, the compiler does not emit calls but two inline loops. Factoring out the code makes some improvement (stack, code) and in the future will provide an optimized implementation as well. (The analogical version with two destinations is not done as it increases stack usage but can be done if needed.) The address of the second folio is reordered before the first memcpy, which leads to an optimization reusing the vmemmap_base and page_offset_base (implementing folio_address()). Stack usage reduction: btrfs_get_32 -8 (32 -> 24) btrfs_get_64 -8 (32 -> 24) Code size reduction: text data bss dec hex filename 1454279 115665 16088 1586032 183370 pre/btrfs.ko 1454229 115665 16088 1585982 18333e post/btrfs.ko DELTA: -50 As this is the last patch in this series, here's the overall diff starting and including commit "btrfs: accessors: simplify folio bounds checks": Stack: btrfs_set_16 -72 (88 -> 16) btrfs_get_32 -56 (80 -> 24) btrfs_set_8 -72 (88 -> 16) btrfs_set_64 -64 (88 -> 24) btrfs_get_8 -72 (80 -> 8) btrfs_get_16 -64 (80 -> 16) btrfs_set_32 -64 (88 -> 24) btrfs_get_64 -56 (80 -> 24) NEW (48): report_setget_bounds 48 LOST/NEW DELTA: +48 PRE/POST DELTA: -472 Code: text data bss dec hex filename 1456601 115665 16088 1588354 183c82 pre/btrfs.ko 1454229 115665 16088 1585982 18333e post/btrfs.ko DELTA: -2372 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:21 +02:00
David Sterba	c8b33a57fb	btrfs: accessors: set target address at initialization The target address for the read/write can be simplified as it's the same expression for the first folio. This improves the generated code as the folio address does not have to be cached on stack. Stack usage reduction: btrfs_set_32 -8 (32 -> 24) btrfs_set_64 -8 (32 -> 24) btrfs_get_16 -8 (24 -> 16) Code size reduction: text data bss dec hex filename 1454459 115665 16088 1586212 183424 pre/btrfs.ko 1454279 115665 16088 1586032 183370 post/btrfs.ko DELTA: -180 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:21 +02:00
David Sterba	1ed0f75d57	btrfs: accessors: compile-time fast path for u16 Reading/writing 2 bytes (u16) may need 2 folios to be written to, each time it's just one byte so using memcpy for that is an overkill. Add a branch for the split case so that memcpy is now used for u32 and u64. Another side effect is that the u16 types now don't need additional stack space, everything fits to registers. Stack usage is reduced: btrfs_get_16 -8 (32 -> 24) btrfs_set_16 -16 (32 -> 16) Code size reduction: text data bss dec hex filename 1454691 115665 16088 1586444 18350c pre/btrfs.ko 1454459 115665 16088 1586212 183424 post/btrfs.ko DELTA: -232 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:21 +02:00
David Sterba	58383c6866	btrfs: accessors: compile-time fast path for u8 Reading/writing 1 byte (u8) is a special case compared to the others as it's always contained in the folio we find, so the split memcpy will never be needed. Turn it to a compile-time check that the memcpy part can be optimized out. The stack usage is reduced: btrfs_set_8 -16 (32 -> 16) btrfs_get_8 -16 (24 -> 8) Code size reduction: text data bss dec hex filename 1454951 115665 16088 1586704 183610 pre/btrfs.ko 1454691 115665 16088 1586444 18350c post/btrfs.ko DELTA: -260 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:21 +02:00
David Sterba	d5a87dbd95	btrfs: accessors: inline eb bounds check and factor out the error report There's a check in each set/get helper if the requested range is within extent buffer bounds, and if it's not then report it. This was in an ASSERT statement so with CONFIG_BTRFS_ASSERT this crashes right away, on other configs this is only reported but reading out of the bounds is done anyway. There are currently no known reports of this particular condition failing. There are some drawbacks though. The behaviour dependence on the assertions being compiled in or not and a less visible effect of inlining report_setget_bounds() into each helper. As the bounds check is expected to succeed almost always it's ok to inline it but make the report a function and move it out of the helper completely (__cold puts it to a different section). This also skips reading/writing the requested range in case it fails. This improves stack usage significantly: btrfs_get_16 -48 (80 -> 32) btrfs_get_32 -48 (80 -> 32) btrfs_get_64 -48 (80 -> 32) btrfs_get_8 -48 (72 -> 24) btrfs_set_16 -56 (88 -> 32) btrfs_set_32 -56 (88 -> 32) btrfs_set_64 -56 (88 -> 32) btrfs_set_8 -48 (80 -> 32) NEW (48): report_setget_bounds 48 LOST/NEW DELTA: +48 PRE/POST DELTA: -360 Same as .ko size: text data bss dec hex filename 1456079 115665 16088 1587832 183a78 pre/btrfs.ko 1454951 115665 16088 1586704 183610 post/btrfs.ko DELTA: -1128 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:21 +02:00
David Sterba	378c95c477	btrfs: accessors: use type sizeof constants directly Now unit_size is used only once, so use it directly in 'part' calculation. Don't cache sizeof(type) in a variable. While this is a compile-time constant, forcing the type 'int' generates worse code as it leads to additional conversion from 32 to 64 bit type on x86_64. The sizeof() is used only a few times and it does not make the code that harder to read, so use it directly and let the compiler utilize the immediate constants in the context it needs. The .ko code size slightly increases (+50) but further patches will reduce that again. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:21 +02:00
David Sterba	00c0cf8444	btrfs: accessors: simplify folio bounds checks As we can a have non-contiguous range in the eb->folios, any item can be straddling two folios and we need to check if it can be read in one go or in two parts. For that there's a check which is not implemented in the simplest way: offset in folio + size <= folio size With a simple expression transformation: oil + size <= unit_size size <= unit_size - oil sizeof() <= part this can be simplified and reusing existing run-time or compile-time constants. Add likely() annotation for this expression as this is the fast path and compiler sometimes reorders that after the followup block with the memcpy (observed in practice with other simplifications). Overall effect on stack consumption: btrfs_get_8 -8 (80 -> 72) btrfs_set_8 -8 (88 -> 80) And .ko size (due to optimizations making use of the direct constants): text data bss dec hex filename 1456601 115665 16088 1588354 183c82 pre/btrfs.ko 1456093 115665 16088 1587846 183a86 post/btrfs.ko DELTA: -508 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:21 +02:00
David Sterba	c76841362f	btrfs: remove struct rcu_string The only use for device name has been removed so we can kill the RCU string API. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:21 +02:00
David Sterba	e8d58aef11	btrfs: open code RCU for device name The RCU protected string is only used for a device name, and RCU is used so we can print the name and eventually synchronize against the rare device rename in device_list_add(). We don't need the whole API just for that. Open code all the helpers and access to the string itself. Notable change is in device_list_add() when the device name is changed, which is the only place that can actually happen at the same time as message prints using the device name under RCU read lock. Previously there was kfree_rcu() which used the embedded rcu_head to delay freeing the object depending on the RCU mechanism. Now there's kfree_rcu_mightsleep() which does not need the rcu_head and waits for the grace period. Sleeping is safe in this context and as this is a rare event it won't interfere with the rest as it's holding the device_list_mutex. Straightforward changes: - rcu_string_strdup -> kstrdup - rcu_str_deref -> rcu_dereference - drop ->str from safe contexts and use rcu_dereference_raw() so it does not trigger RCU validators Historical notes: Introduced in `606686eeac` ("Btrfs: use rcu to protect device->name") with a vague reference of the potential problem described in https://lore.kernel.org/all/20120531155304.GF11775@ZenIV.linux.org.uk/ . The RCU protection looks like the easiest and most lightweight way of protecting the rare event of device rename racing device_list_add() with a random printk() that uses the device name. Alternatives: a spin lock would require to protect the printk anyway, a fixed buffer for the name would be eventually wrong in case the new name is overwritten when being printed, an array switching pointers and cleaning them up eventually resembles RCU too much. The cleanups up to this patch should hide special case of RCU to the minimum that only the name needs rcu_dereference(), which can be further cleaned up to use btrfs_dev_name(). Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:21 +02:00
Daniel Vacek	f2cb97ee96	btrfs: index buffer_tree using node size So far we've been deriving the buffer tree index using the sector size. But each extent buffer covers multiple sectors. This makes the buffer tree rather sparse. For example the typical and quite common configuration uses sector size of 4KiB and node size of 16KiB. In this case it means the buffer tree is using up to the maximum of 25% of it's slots. Or in other words at least 75% of the tree slots are wasted as never used. We can score significant memory savings on the required tree nodes by indexing the tree using the node size instead. As a result far less slots are wasted and the tree can now use up to all 100% of it's slots this way. Note: This works even with unaligned tree blocks as we can still get unique index by doing eb->start >> nodesize_shift. Getting some stats from running fio write test, there is a bit of variance. The values presented in the table below are medians from 5 test runs. The numbers are: - # of allocated ebs in the tree - # of leaf tree nodes - highest index in the tree (radix tree width)): ebs / leaves / Index \| bare for-next \| with fix ---------------------+--------------------+------------------- post mount \| 16 / 11 / 10e5c \| 16 / 10 / 4240 post test \| 5810 / 891 / 11cfc \| 4420 / 252 / 473a post rm \| 574 / 300 / 10ef0 \| 540 / 163 / 46e9 In this case (10GiB filesystem) the height of the tree is still 3 levels but the 4x width reduction is clearly visible as expected. But since the tree is more dense we can see the 54-72% reduction of leaf nodes. That's very close to ideal with this test. It means the tree is getting really dense with this kind of workload. Also, the fio results show no performance change. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:20 +02:00
Filipe Manana	2b759eea98	btrfs: send: directly return strcmp() result when comparing recorded refs There's no point in converting the return values from strcmp() as all we need is that it returns a negative value if the first argument is less than the second, a positive value if it's greater and 0 if equal. We do not have a need for -1 instead of any other negative value and no need for 1 instead of any other positive value - that's all that rb_find() needs and no where else we need specific negative and positive values. So remove the intermediate local variable and checks and return directly the result from strcmp(). This also reduces the module's text size. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1888116 161347 16136 2065599 1f84bf fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1888052 161347 16136 2065535 1f847f fs/btrfs/btrfs.ko Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:20 +02:00
Filipe Manana	e560afc1a8	btrfs: set search_commit_root to false in iterate_inodes_from_logical() There's no point in checking at iterate_inodes_from_logical() if the path has search_commit_root set, the only caller never sets search_commit_root to true and it doesn't make sense for it ever to be true for the current use case (logical_to_ino ioctl). So stop checking for that and since the only caller allocates the path just for it to be used by iterate_inodes_from_logical(), move the path allocation into that function. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:20 +02:00
Filipe Manana	aee10fe4e4	btrfs: reduce size of struct tree_mod_elem Several members are used for specific types of tree mod log operations so they can be placed in a union in order to reduce the structure's size. This reduces the structure size from 112 bytes to 88 bytes on x86_64, so we can now use the kmalloc-96 slab instead of the kmalloc-128 slab. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:20 +02:00
Filipe Manana	d30b236a3e	btrfs: avoid logging tree mod log elements for irrelevant extent buffers We are logging tree mod log operations for extent buffers from any tree but we only need to log for the extent tree and subvolume tree, since the tree mod log is used to get a consistent view, within a transaction, of extents and their backrefs. So it's pointless to log operations for trees such as the csum tree, free space tree, root tree, chunk tree, log trees, data relocation tree, etc, as these trees are not used for backref walking and all tree mod log users are about backref walking. So skip extent buffers that don't belong neither to the extent nor to subvolume trees. This avoids unnecessary memory allocations and having a larger tree mod log rbtree with nodes that are never needed. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:20 +02:00
Boris Burkov	9e9ff875e4	btrfs: use readahead_expand() on compressed extents We recently received a report of poor performance doing sequential buffered reads of a file with compressed extents. With bs=128k, a naive sequential dd ran as fast on a compressed file as on an uncompressed (1.2GB/s on my reproducing system) while with bs<32k, this performance tanked down to ~300MB/s. i.e., slow: dd if=some-compressed-file of=/dev/null bs=4k count=X vs fast: dd if=some-compressed-file of=/dev/null bs=128k count=Y The cause of this slowness is overhead to do with looking up extent_maps to enable readahead pre-caching on compressed extents (add_ra_bio_pages()), as well as some overhead in the generic VFS readahead code we hit more in the slow case. Notably, the main difference between the two read sizes is that in the large sized request case, we call btrfs_readahead() relatively rarely while in the smaller request we call it for every compressed extent. So the fast case stays in the btrfs readahead loop: while ((folio = readahead_folio(rac)) != NULL) btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start); where the slower one breaks out of that loop every time. This results in calling add_ra_bio_pages a lot, doing lots of extent_map lookups, extent_map locking, etc. This happens because although add_ra_bio_pages() does add the appropriate un-compressed file pages to the cache, it does not communicate back to the ractl in any way. To solve this, we should be using readahead_expand() to signal to readahead to expand the readahead window. This change passes the readahead_control into the btrfs_bio_ctrl and in the case of compressed reads sets the expansion to the size of the extent_map we already looked up anyway. It skips the subpage case as that one already doesn't do add_ra_bio_pages(). With this change, whether we use bs=4k or bs=128k, btrfs expands the readahead window up to the largest compressed extent we have seen so far (in the trivial example: 128k) and the call stacks of the two modes look identical. Notably, we barely call add_ra_bio_pages at all. And the performance becomes identical as well. So this change certainly "fixes" this performance problem. Of course, it does seem to beg a few questions: 1. Will this waste too much page cache with a too large ra window? 2. Will this somehow cause bugs prevented by the more thoughtful checking in add_ra_bio_pages? 3. Should we delete add_ra_bio_pages? My stabs at some answers: 1. Hard to say. See attempts at generic performance testing below. Is there a "readahead_shrink" we should be using? Should we expand more slowly, by half the remaining em size each time? 2. I don't think so. Since the new behavior is indistinguishable from reading the file with a larger read size passed in, I don't see why one would be safe but not the other. 3. Probably! I tested that and it was fine in fstests, and it seems like the pages would get re-used just as well in the readahead case. However, it is possible some reads that use page cache but not btrfs_readahead() could suffer. I will investigate this further as a follow up. I tested the performance implications of this change in 3 ways (using compress-force=zstd:3 for compression): Directly test the affected workload of small sequential reads on a compressed file (improved from ~250MB/s to ~1.2GB/s) ==========for-next========== dd /mnt/lol/non-cmpr 4k 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.02983 s, 712 MB/s dd /mnt/lol/non-cmpr 128k 32768+0 records in 32768+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 5.92403 s, 725 MB/s dd /mnt/lol/cmpr 4k 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 17.8832 s, 240 MB/s dd /mnt/lol/cmpr 128k 32768+0 records in 32768+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.71001 s, 1.2 GB/s ==========ra-expand========== dd /mnt/lol/non-cmpr 4k 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.09001 s, 705 MB/s dd /mnt/lol/non-cmpr 128k 32768+0 records in 32768+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.07664 s, 707 MB/s dd /mnt/lol/cmpr 4k 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.79531 s, 1.1 GB/s dd /mnt/lol/cmpr 128k 32768+0 records in 32768+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.69533 s, 1.2 GB/s Built the linux kernel from clean (no change) Ran fsperf. Mostly neutral results with some improvements and regressions here and there. Reported-by: Dimitrios Apostolou <jimis@gmx.net> Link: https://lore.kernel.org/linux-btrfs/34601559-6c16-6ccc-1793-20a97ca0dbba@gmx.net/ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:00 +02:00
Qu Wenruo	1ef94169db	btrfs: populate otime when logging an inode item [TEST FAILURE WITH EXPERIMENTAL FEATURES] When running test case generic/508, the test case will fail with the new btrfs shutdown support: generic/508 - output mismatch (see /home/adam/xfstests/results//generic/508.out.bad) --- tests/generic/508.out 2022-05-11 11:25:30.806666664 +0930 +++ /home/adam/xfstests/results//generic/508.out.bad 2025-07-02 14:53:22.401824212 +0930 @@ -1,2 +1,6 @@ QA output created by 508 Silence is golden +Before: +After : stat.btime = Thu Jan 1 09:30:00 1970 +Before: +After : stat.btime = Wed Jul 2 14:53:22 2025 ... (Run 'diff -u /home/adam/xfstests/tests/generic/508.out /home/adam/xfstests/results//generic/508.out.bad' to see the entire diff) Ran: generic/508 Failures: generic/508 Failed 1 of 1 tests Please note that the test case requires shutdown support, thus the test case will be skipped using the current upstream kernel, as it doesn't have shutdown ioctl support. [CAUSE] The direct cause the 0 time stamp in the log tree: leaf 30507008 items 2 free space 16057 generation 9 owner TREE_LOG leaf 30507008 flags 0x1(WRITTEN) backref revision 1 checksum stored e522548d checksum calced e522548d fs uuid 57d45451-481e-43e4-aa93-289ad707a3a0 chunk uuid d52bd3fd-5163-4337-98a7-7986993ad398 item 0 key (257 INODE_ITEM 0) itemoff 16123 itemsize 160 generation 9 transid 9 size 0 nbytes 0 block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 sequence 1 flags 0x0(none) atime 1751432947.492000000 (2025-07-02 14:39:07) ctime 1751432947.492000000 (2025-07-02 14:39:07) mtime 1751432947.492000000 (2025-07-02 14:39:07) otime 0.0 (1970-01-01 09:30:00) <<< But the old fs tree has all the correct time stamp: btrfs-progs v6.12 fs tree key (FS_TREE ROOT_ITEM 0) leaf 30425088 items 2 free space 16061 generation 5 owner FS_TREE leaf 30425088 flags 0x1(WRITTEN) backref revision 1 checksum stored 48f6c57e checksum calced 48f6c57e fs uuid 57d45451-481e-43e4-aa93-289ad707a3a0 chunk uuid d52bd3fd-5163-4337-98a7-7986993ad398 item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 generation 3 transid 0 size 0 nbytes 16384 block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0 sequence 0 flags 0x0(none) atime 1751432947.0 (2025-07-02 14:39:07) ctime 1751432947.0 (2025-07-02 14:39:07) mtime 1751432947.0 (2025-07-02 14:39:07) otime 1751432947.0 (2025-07-02 14:39:07) <<< The root cause is that fill_inode_item() in tree-log.c is only populating a/c/m time, not the otime (or btime in statx output). Part of the reason is that, the vfs inode only has a/c/m time, no native btime support yet. [FIX] Thankfully btrfs has its otime stored in btrfs_inode::i_otime_sec and btrfs_inode::i_otime_nsec. So what we really need is just fill the otime time stamp in fill_inode_item() of tree-log.c There is another fill_inode_item() in inode.c, which is doing the proper otime population. Fixes: `94edf4ae43` ("Btrfs: don't bother committing delayed inode updates when fsyncing") CC: stable@vger.kernel.org Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:08:35 +02:00
Filipe Manana	a943812bff	btrfs: qgroup: use btrfs_qgroup_enabled() in ioctls We have a publicly exported btrfs_qgroup_enabled() and an ioctl.c private qgroup_enabled() helper. Both of these test if qgroups are enabled, the first check if the flag BTRFS_FS_QUOTA_ENABLED is set in fs_info->flags while the second checks if fs_info->quota_root is not NULL while holding the mutex fs_info->qgroup_ioctl_lock. We can get away with the private ioctl.c:qgroup_enabled(), as all entry points into the qgroup code check if fs_info->quota_root is NULL or not while holding the mutex fs_info->qgroup_ioctl_lock, and returning the error -ENOTCONN in case it's NULL. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:07:37 +02:00
Filipe Manana	08530d6e63	btrfs: qgroup: fix qgroup create ioctl returning success after quotas disabled When quotas are disabled qgroup ioctls are supposed to return -ENOTCONN, but the qgroup create ioctl stopped doing that when it races with a quota disable operation, returning 0 instead. This change of behaviour happened in commit `6ed05643dd` ("btrfs: create qgroup earlier in snapshot creation"). The issue happens as follows: 1) Task A enters btrfs_ioctl_qgroup_create(), qgroups are enabled and so qgroup_enabled() returns true since fs_info->quota_root is not NULL; 2) Task B enters btrfs_ioctl_quota_ctl() -> btrfs_quota_disable() and disables qgroups, so now fs_info->quota_root is NULL; 3) Task A enters btrfs_create_qgroup() and calls btrfs_qgroup_mode(), which returns BTRFS_QGROUP_MODE_DISABLED since quotas are disabled, and then btrfs_create_qgroup() returns 0 to the caller, which makes the ioctl return 0 instead of -ENOTCONN. The check for fs_info->quota_root and returning -ENOTCONN if it's NULL is made only after the call btrfs_qgroup_mode(). Fix this by moving the check for disabled quotas with btrfs_qgroup_mode() into transaction.c:create_pending_snapshot(), so that we don't abort the transaction if btrfs_create_qgroup() returns -ENOTCONN and quotas are disabled. Fixes: `6ed05643dd` ("btrfs: create qgroup earlier in snapshot creation") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:07:30 +02:00
Filipe Manana	e41c75ca31	btrfs: qgroup: set quota enabled bit if quota disable fails flushing reservations Before waiting for the rescan worker to finish and flushing reservations, we clear the BTRFS_FS_QUOTA_ENABLED flag from fs_info. If we fail flushing reservations we leave with the flag not set which is not correct since quotas are still enabled - we must set back the flag on error paths, such as when we fail to start a transaction, except for error paths that abort a transaction. The reservation flushing happens very early before we do any operation that actually disables quotas and before we start a transaction, so set back BTRFS_FS_QUOTA_ENABLED if it fails. Fixes: `af0e2aab3b` ("btrfs: qgroup: flush reservations during quota disable") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:07:08 +02:00
Qu Wenruo	736bd9d2e3	btrfs: restrict writes to opened btrfs devices [FLAG EXCLUSION] Commit `ead622674d` ("btrfs: Do not restrict writes to btrfs devices") removes the BLK_OPEN_RESTRICT_WRITES flag when opening the devices during mount. This was an exception at the time as it depended on other patches. [REASON TO EXCLUDE THAT FLAG] Btrfs needs to call btrfs_scan_one_device() to determine the fsid, no matter if we're mounting a new fs or an existing one. But if a fs is already mounted and the BLK_OPEN_RESTRICT_WRITES is honored, meaning no other write open is allowed for the block device. Then we want to mount a subvolume of the mounted fs to another mount point, we will call btrfs_scan_one_device() again, but it will fail due to the BLK_OPEN_RESTRICT_WRITES flag (no more write open allowed), causing only one mount point for the fs. Thus at that time, we had to exclude the BLK_OPEN_RESTRICT_WRITES to allow multiple mount points for one fs. [WHY IT'S SAFE NOW] The root problem is, we do not need to nor should use BLK_OPEN_WRITE for btrfs_scan_one_device(). That function is only to read out the super block, no write at all, and BLK_OPEN_WRITE is only going to cause problems for such usage. The root problem has been fixed by patch "btrfs: always open the device read-only in btrfs_scan_one_device", so btrfs_scan_one_device() will always work no matter if the device is opened with BLK_OPEN_RESTRICT_WRITES. [ENHANCEMENT] Just remove the btrfs_open_mode(), as the only call site can be replaced with regular sb_open_mode(). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:06:20 +02:00
Qu Wenruo	08fa138864	btrfs: use fs_holder_ops for all opened devices Since we have btrfs_fs_info::sb (struct super_block) as our block device holder, we can safely use fs_holder_ops for all of our block devices. This enables freezing/thawing the filesystem from the underlying block devices. This may enhance hibernation/suspend support since previously freezing/thawing a block device managed by btrfs won't do anything btrfs specific, but only syncing the block device. Thus before this change, freezing the underlying block devices won't prevent future writes into the filesystem, thus may cause problems for hibernation/suspend when btrfs is involved. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:06:20 +02:00
Christoph Hellwig	40426dd147	btrfs: use the super_block as holder when mounting file systems The file system type is not a very useful holder as it doesn't allow us to go back to the actual file system instance. Pass the super_block instead which is useful when passed back to the file system driver. This matches what is done for all other block device based file systems, and allows us to remove btrfs_fs_info::bdev_holder completely. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:06:19 +02:00
Qu Wenruo	bddf57a707	btrfs: delay btrfs_open_devices() until super block is created Currently we always call btrfs_open_devices() before creating the super block. It's fine for now because: - No blk_holder_ops is provided - btrfs_fs_type is used as a holder This means no matter who wins the device opening race, the holder will be the same thus not affecting the later sget_fc() race. And since no blk_holder_ops is provided, no bdev operation is depending on the holder. But this will no longer be true if we want to implement a proper blk_holder_ops using fs_holder_ops. This means we will need a proper super block as the bdev holder. To prepare for such change: - Add btrfs_fs_devices::holding member This will prevent btrfs_free_stale_devices() and btrfs_close_device() from deleting the fs_devices when there is another process trying to mount the fs. Along with the new member, here come the two helpers, btrfs_fs_devices_inc_holding() and btrfs_fs_devices_dec_holding(). This will allow us to hold fs_devices without opening it. This is needed because we cannot hold uuid_mutex while calling sget_fc(), this will reverse the lock sequence with s_umount, causing a lockdep warning. - Delay btrfs_open_devices() until a super block is returned This means we have to hold the initial fs_devices first, then unlock uuid_mutex, call sget_fc(), then re-lock uuid_mutex, and decrease the holding number. For new super block case, we continue to btrfs_open_devices() with uuid_mutex hold. For existing super block case, we can unlock uuid_mutex and continue. Although this means a more complex error handling path, as if we didn't call btrfs_open_devices() (either got an existing sb, or sget_fc() failed), we cannot let btrfs_put_fs_info() cleanup the fs_devices, as it can be freed at any time after we decrease the hold on fs_devices and unlock uuid_mutex. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:06:19 +02:00
Qu Wenruo	de339cbfb4	btrfs: call bdev_fput() to reclaim the blk_holder immediately As part of the preparation for btrfs blk_holder_ops, we want to ensure the holder of a block device has a proper lifespan. However btrfs is always using fput() to close a block device, which has one problem: - fput() is deferred Meaning we can have a block device with invalid (aka, freed) holder. To avoid the problem and align the behavior to other code, just call bdev_fput() instead. There is some extra requirement on the locking, but that's all resolved by previous patches and we should be safe to call bdev_fput(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:06:19 +02:00
Christoph Hellwig	9f43d0ff55	btrfs: call btrfs_close_devices() from ->kill_sb Although btrfs is not yet implementing blk_holder_ops, there is a requirement for proper blk_holder_ops: - blkdev_put() must not be called under sb->s_umount The blkdev_put()/bdev_fput() must not be called under sb->s_umount to avoid lock order reversal with disk->open_mutex. This is for the proper blk_holder_ops callbacks. Currently we're fine because we call regular fput() which defers the blk holder reclaiming. To prepare for the future of blk_holder_ops, move the btrfs_close_devices() calls into btrfs_free_fs_info(). That will be called from kill_sb() callbacks, which is also called for error handing during mount failures, or there is already an existing super block. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:06:19 +02:00
Qu Wenruo	2936a6ac8d	btrfs: add assertions to make super block creation more clear When calling sget_fc(), there are 3 different situations: a) Critical error No super block created. b) A new super block is created The fc->s_fs_info is transferred to the super block, and fc->s_fs_info is reset to NULL. In this case sb->s_root should still be NULL, and needs to be properly initialized later by btrfs_fill_super(). c) An existing super block is returned The fc->s_fs_info is untouched, and anything related to that fs_info should be properly cleaned up. This is not obvious even with the extra comments at sget_fc(). Enhance the situation by: - Add comments for case b) and c) Especially for case c), the fs_info and fs_devices cleanup happens at different timing, thus needs extra explanation. - Move the comments closer to case b) and case c) Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:06:19 +02:00
Qu Wenruo	35ea448b75	btrfs: get rid of re-entering of btrfs_get_tree() [EXISTING PROBLEM] Currently btrfs mount is split into two parts: - btrfs_get_tree_subvol() Which sets up the very basic fs_info, and eventually calls mount_subvol() to mount the target subvolume. - btrfs_get_tree_super() This is the part doing super block allocation and if there is no existing super block, do the real open_ctree() to open the fs. However currently we're doing this in a complex re-entering way: vfs_get_tree() \|- btrfs_get_tree() \|- btrfs_get_tree_subvol() \|- vfs_get_tree() \| \|- btrfs_get_tree() \| \|- btrfs_get_tree_super() \|- mount_subvol() This is definitely not that easy to grasp. [ENHANCEMENT] The function vfs_get_tree() is only doing the following work: - Call get_tree() call back - Call super_wake() - Call security_sb_set_mnt_opts() In our case, super_wake() can be skipped, as after btrfs_get_tree_subvol() finishes, vfs_get_tree() will call super_wake() on the super block we got anyway. The same applies to security_sb_set_mnt_opts(), as long as we do not free the security from our original fc in btrfs_get_tree_subvol(), the first vfs_get_tree() call will handle the security correctly. So here we only need to: - Replace vfs_get_tree() call with btrfs_get_tree_super() - Keep the existing fc->security for vfs_get_tree() to handle the security This will remove the re-entering behavior and make thing much easier to follow. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:06:19 +02:00
Christoph Hellwig	ae818824a2	btrfs: always open the device read-only in btrfs_scan_one_device() btrfs_scan_one_device() opens the block device only to read the super block. Instead of passing a blk_mode_t argument to sometimes open it for writing, just hard code BLK_OPEN_READ as it will never write to the device or hand the block_device out to someone else. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:06:02 +02:00
Caleb Sander Mateos	ea124ec327	btrfs: don't skip accounting in early ENOTTY return in btrfs_uring_encoded_read() btrfs_uring_encoded_read() returns early with -ENOTTY if the uring_cmd is issued with IO_URING_F_COMPAT but the kernel doesn't support compat syscalls. However, this early return bypasses the syscall accounting. Go to out_acct instead to ensure the syscall is counted. Fixes: `34310c442e` ("btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)") CC: stable@vger.kernel.org # 6.15+ Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:05:30 +02:00
David Sterba	9950c31ad9	btrfs: rename inode number parameter passed to btrfs_check_dir_item_collision() The name 'dir' is misleading as it's the inode number of the directory, so rename it accordingly. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:05:00 +02:00
David Sterba	8320febc64	btrfs: pass bool to indicate subvolume/snapshot creation type Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:05:00 +02:00
David Sterba	a5f0e0a4df	btrfs: pass dentry to btrfs_mksubvol() and btrfs_mksnapshot() There's no reason to pass 'struct path' to btrfs_mksubvol(), though it's been like the since the first commit `76dda93c6a` ("Btrfs: add snapshot/subvolume destroy ioctl"). We only use the dentry so we should pass it directly. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:05:00 +02:00
David Sterba	34f6cc5b18	btrfs: use struct qstr for subvolume ioctl helpers We pass name and length of subvolumes separately to the related functions, while this can be a struct qstr which is otherwise used for dentry interfaces. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:05:00 +02:00
Brahmajit Das	164299ba11	btrfs: replace strcpy() with strscpy() strcpy() is discouraged from use due to lack of bounds checking. Replaces it with strscpy(), the recommended alternative for null terminated strings, to follow best practices. There are instances where strscpy() cannot be used such as where both the source and destination are character pointers. In that instance we can use sysfs_emit(). Link: https://github.com/KSPP/linux/issues/88 Suggested-by: Anthony Iliopoulos <ailiop@suse.com> Signed-off-by: Brahmajit Das <bdas@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:05:00 +02:00
David Sterba	b37eb352c4	btrfs: accessors: delete token versions of set/get helpers Once upon a time there was a need to cache address of extent buffer pages, as it was a costly operation (map_private_extent_buffer(), `cfed81a04e` ("Btrfs: add the ability to cache a pointer into the eb")). This was not even due to use of HIGHMEM, this had been removed before that due to possible locking issues (`a65917156e` ("Btrfs: stop using highmem for extent_buffers")). Over time the amount of work in the set/get helpers got reduced and became quite straightforward bounds checking with an unaligned read/write, commit `db3756c879` ("btrfs: remove unused map_private_extent_buffer"). The actual caching of the page_address()/folio_address() in the token was more work for very little gain. This depended on subsequent access into the same page/folio, otherwise the cached pointer had to be updated. For metadata-heavy operations this showed up in the 'perf top' profile where the btrfs_get_token_32() calls were at the top, on my testing machine consuming about 2-3%. The other generic 32/64 bit helpers also appeared in the profile with similar fraction. After removing use of the token helpers we can remove them completely, this leads to reduction of btrfs.ko by 6.7KiB on release config. text data bss dec hex filename 1463289 115665 16088 1595042 1856a2 pre/btrfs.ko 1456601 115665 16088 1588354 183c82 post/btrfs.ko DELTA: -6688 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:05:00 +02:00
David Sterba	c418a15045	btrfs: tree-log: don't use token set/get accessors in fill_inode_item() The token versions of set/get accessors will be removed, use the normal helpers. There's additional overhead of the token helpers that update the cached address in case it moves to another page/folio. The normal versions don't need to do that. Note this is similar to fill_inode_item() in inode.c but with slight differences. The two functions could be deduplicated eventually. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:05:00 +02:00
David Sterba	e3df5141a4	btrfs: don't use token set/get accessors in inode.c:fill_inode_item() The token versions of set/get accessors will be removed, use the normal helpers. There's additional overhead of the token helpers that update the cached address in case it moves to another page/folio. The normal versions don't need to do that. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:04:59 +02:00
David Sterba	114b806a73	btrfs: don't use token set/get accessors for btrfs_item members The token versions of set/get accessors will be removed, use the normal helpers. The btrfs_item members use that interface the most but there are no real benefits anymore. This reduces stack consumption on x86_64 release config: setup_items_for_insert -32 (144 -> 112) __push_leaf_left -32 (176 -> 144) btrfs_extend_item -16 (104 -> 88) copy_for_split -32 (144 -> 112) btrfs_del_items -16 (144 -> 128) btrfs_truncate_item -24 (152 -> 128) __push_leaf_right -24 (168 -> 144) and module size: text data bss dec hex filename 1463615 115665 16088 1595368 1857e8 pre/btrfs.ko 1463413 115665 16088 1595166 18571e post/btrfs.ko DELTA: -202 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:04:59 +02:00
Filipe Manana	60127c29f1	btrfs: qgroup: remove no longer used fs_info->qgroup_ulist It's not used anymore after commit `0913445082` ("btrfs: qgroup: use qgroup_iterator in qgroup_convert_meta()"), so remove it. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:04:59 +02:00
Filipe Manana	e124966775	btrfs: qgroup: fix race between quota disable and quota rescan ioctl There's a race between a task disabling quotas and another running the rescan ioctl that can result in a use-after-free of qgroup records from the fs_info->qgroup_tree rbtree. This happens as follows: 1) Task A enters btrfs_ioctl_quota_rescan() -> btrfs_qgroup_rescan(); 2) Task B enters btrfs_quota_disable() and calls btrfs_qgroup_wait_for_completion(), which does nothing because at that point fs_info->qgroup_rescan_running is false (it wasn't set yet by task A); 3) Task B calls btrfs_free_qgroup_config() which starts freeing qgroups from fs_info->qgroup_tree without taking the lock fs_info->qgroup_lock; 4) Task A enters qgroup_rescan_zero_tracking() which starts iterating the fs_info->qgroup_tree tree while holding fs_info->qgroup_lock, but task B is freeing qgroup records from that tree without holding the lock, resulting in a use-after-free. Fix this by taking fs_info->qgroup_lock at btrfs_free_qgroup_config(). Also at btrfs_qgroup_rescan() don't start the rescan worker if quotas were already disabled. Reported-by: cen zhang <zzzccc427@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAFRLqsV+cMDETFuzqdKSHk_FDm6tneea45krsHqPD6B3FetLpQ@mail.gmail.com/ CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:04:49 +02:00
Filipe Manana	c0d013495a	btrfs: clear dirty status from extent buffer on error at insert_new_root() If we failed to insert the tree mod log operation, we are not removing the dirty status from the allocated and dirtied extent buffer before we free it. Removing the dirty status is needed for several reasons such as to adjust the fs_info->dirty_metadata_bytes counter and remove the dirty status from the respective folios. So add the missing call to btrfs_clear_buffer_dirty(). Fixes: `f61aa7ba08` ("btrfs: do not BUG_ON() on tree mod log failure at insert_new_root()") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:02:09 +02:00
Johannes Thumshirn	9669fcb77e	btrfs: change dump_block_groups() in btrfs_dump_space_info() from int to bool btrfs_dump_space_info()'s parameter dump_block_groups is used as a boolean although it is defined as an integer. Change it from int to bool. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:05 +02:00
David Sterba	ab5fcbb1ad	btrfs: use pgoff_t for page index variables Any conversion of offsets in the logical or the physical mapping space of the pages is done by a shift and the target type should be pgoff_t (type of struct page::index). Fix the locations where it's still unsigned long. Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:05 +02:00
George Hu	afd1dacbd0	btrfs: replace nested usage of min & max with clamp in btrfs_compress_set_level() Refactor the btrfs_compress_set_level() function by replacing the nested usage of min() and max() macro with clamp() to simplify the code and improve readability. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: George Hu <integral@archlinux.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:05 +02:00
Dmitry Antipov	2fb5e56f52	btrfs: send: avoid extra calls to strlen() in gen_unique_name() Since 'snprintf()' returns the number of characters which would be emitted and output truncation is handled by 'ASSERT()', it should be safe to use that return value instead of the subsequent calls to 'strlen()' in 'gen_unique_name()'. This also reduces the module's text size. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1897006 161571 16136 2074713 1fa859 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1896848 161571 16136 2074555 1fa7bb fs/btrfs/btrfs.ko Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:05 +02:00
Filipe Manana	6633a416ed	btrfs: qgroup: avoid memory allocation if qgroups are not enabled At btrfs_qgroup_inherit() we allocate a qgroup record even if qgroups are not enabled, which is unnecessary overhead and can result in subvolume creation to fail with -ENOMEM, as create_subvol() calls this function. Improve on this by making btrfs_qgroup_inherit() check if qgroups are enabled earlier and return if they are not, avoiding the unnecessary memory allocation and taking some locks. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:04 +02:00
Filipe Manana	2fda07effb	btrfs: qgroup: remove pointless error check for add_qgroup_rb() call The add_qgroup_rb() function never returns an error pointer anymore since commit `8d54518b5e` ("btrfs: qgroup: pre-allocate btrfs_qgroup to reduce GFP_ATOMIC usage"), so checking for an error pointer result at btrfs_quota_enable() is pointless. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:04 +02:00
Filipe Manana	da7f005239	btrfs: split btrfs_is_fstree() into multiple if statements for readability Instead of a single if statement with multiple conditions, split it into several if statements testing only one condition at a time and return true or false immediately after. This makes it more immediate to reason. The module's text size also slightly decreases, at least with gcc 14.2.0 on x86_64. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1897138 161583 16136 2074857 1fa8e9 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename `1896976` 161583 16136 2074695 1fa847 fs/btrfs/btrfs.ko Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:04 +02:00
Filipe Manana	fd00922abc	btrfs: add btrfs prefix to is_fstree() and make it return bool This is an exported function and therefore it should have a 'btrfs_' prefix, to make it clear it's btrfs specific, avoid future name collisions with code outside btrfs, and make its naming consistent with most other btrfs exported functions. So add a 'btrfs_' prefix to it and make it return bool instead of int, since all we need is to return true or false. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:04 +02:00
Filipe Manana	0c6f37eaa5	btrfs: split inode extref processing from __add_inode_ref() into a helper The __add_inode_ref() function is quite big and with too much nesting, so move the code that processes inode extrefs into a helper function, to make the function easier to read and reduce the level of indentation too. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:04 +02:00
Filipe Manana	06f77c659e	btrfs: split inode ref processing from __add_inode_ref() into a helper The __add_inode_ref() function is quite big and with too much nesting, so move the code that processes inode refs into a helper function, to make the function easier to read and reduce the level of indentation too. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:04 +02:00
Filipe Manana	98060e1611	btrfs: use btrfs inodes in btrfs_rmdir() to avoid so much usage of BTRFS_I() Almost everywhere we want to use a btrfs inode and therefore we have a lot of calls to BTRFS_I(), making the code more verbose. Instead use btrfs inode local variables to avoid so much use of BTRFS_I(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:04 +02:00
Filipe Manana	9f82a4ed34	btrfs: use inode already stored in local variable at btrfs_rmdir() There's no need to call d_inode(dentry) when calling btrfs_unlink_inode() since we have already stored that in a local inode variable. So just use the local variable to make the code less verbose. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:04 +02:00
David Sterba	44cac52341	btrfs: use our message helpers instead of pr_err/pr_warn/pr_info Our message helpers accept NULL for the fs_info in the context that does not provide and print the common header of the message. The use of pr_* helpers is only for special reasons, like module loading, device scanning or multi-line output (print-tree). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:04 +02:00
Sun YangKai	27260dd190	btrfs: remove partial support for lowest level from btrfs_search_forward() Commit `323ac95bce` ("Btrfs: don't read leaf blocks containing only checksums during truncate") changed the condition from `level == 0` to `level == path->lowest_level`, while its original purpose was just to do some leaf node handling (calling btrfs_item_key_to_cpu()) and skip some code that doesn't fit leaf nodes. After changing the condition, the code path: 1. Also handles the non-leaf nodes when path->lowest_level is nonzero, which is wrong. However btrfs_search_forward() is never called with a nonzero path->lowest_level, which makes this bug not found before. 2. Makes the later if block with the same condition, which was originally used to handle non-leaf node (calling btrfs_node_key_to_cpu()) when lowest_level is not zero, dead code. Since btrfs_search_forward() is never called for a path with a lowest_level different from zero, just completely remove the partial support for a non-zero lowest_level, simplifying a bit the code, and assert that lowest_level is zero at the start of the function. Suggested-by: Qu Wenruo <wqu@suse.com> Fixes: `323ac95bce` ("Btrfs: don't read leaf blocks containing only checksums during truncate") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Sun YangKai <sunk67188@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:04 +02:00
Qianfeng Rong	c9da22428e	btrfs: use folio_next_index() helper in check_range_has_page() Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using the existing helper folio_next_index(). Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:03 +02:00
Sun YangKai	23a6abdada	btrfs: remove unused parameters from btrfs_lookup_inode_extref() The function btrfs_lookup_inode_extref(` no longer requires transaction handle, insert length, or COW flag, as the only caller now performs a read-only lookup using trans == NULL, ins_len == 0 and cow == 0. This function was introduced in the early days where extref feature was introduced by commit `f186373fef` ("btrfs: extended inode refs"). Then some cleanup was done in commit `33b98f2271` ("btrfs: cleanup: removed unused 'btrfs_get_inode_ref_index'"), which removed the only caller passing trans and other COW specific options. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:03 +02:00
David Sterba	6be75e891c	btrfs: rename error to ret in device_list_add() Unify naming of return value to the preferred way. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:03 +02:00
David Sterba	6631c67ca1	btrfs: rename error to ret in btrfs_sysfs_add_mounted() Unify naming of return value to the preferred way. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:03 +02:00
David Sterba	6dfe71e6ab	btrfs: rename error to ret in btrfs_sysfs_add_fsid() Unify naming of return value to the preferred way. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:03 +02:00
David Sterba	64b8c3851f	btrfs: rename error to ret in btrfs_mksubvol() Unify naming of return value to the preferred way. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:03 +02:00
David Sterba	bfa13b82cc	btrfs: rename error to ret in btrfs_may_delete() Unify naming of return value to the preferred way. Reviewed-by: Daniel Vacek <neelx@suse.com>yy Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:03 +02:00
Filipe Manana	2abd9e1c58	btrfs: cache if we are using free space bitmaps for a block group Every time we add free space to the free space tree or we remove free space from the free space tree, we do a lookup for the block group's free space info item in the free space tree. This takes time, navigating the btree and we may block either on IO when reading extent buffers from disk or on extent buffer lock contention due to concurrency. Instead of doing this lookup every time, cache the result in the block structure and use it after the first lookup. This adds two boolean members to the block group structure but doesn't increase the structure's size. The following script that runs fs_mark was used to measure the time spent on run_delayed_tree_ref(), since down that call chain we have calls to add and remove free space to/from the free space tree (calls to btrfs_add_to_free_space_tree() and btrfs_remove_from_free_space_tree()): $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt FILES=100000 THREADS=$(nproc --all) echo "performance" \| \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT This is a heavy metadata test as it's exercising only file creation, so a lot of allocations of metadata extents, creating delayed refs for adding new metadata extents and dropping existing ones due to COW. The results of the times it took to execute run_delayed_tree_ref(), in nanoseconds, are the following. Before this change: Range: 1868.000 - 6482857.000; Mean: 10231.430; Median: 7005.000; Stddev: 27993.173 Percentiles: 90th: 13342.000; 95th: 23279.000; 99th: 82448.000 1868.000 - 4222.038: 270696 ############ 4222.038 - 9541.029: 1201327 ##################################################### 9541.029 - 21559.383: 385436 ################# 21559.383 - 48715.063: 64942 ### 48715.063 - 110073.800: 31454 # 110073.800 - 248714.944: 8218 \| 248714.944 - 561977.042: 1030 \| 561977.042 - 1269798.254: 295 \| 1269798.254 - 2869132.711: 116 \| 2869132.711 - 6482857.000: 28 \| After this change: Range: 1554.000 - 4557014.000; Mean: 9168.164; Median: 6391.000; Stddev: 21467.060 Percentiles: 90th: 12478.000; 95th: 20964.000; 99th: 72234.000 1554.000 - 3453.820: 219004 ############ 3453.820 - 7674.743: 980645 ##################################################### 7674.743 - 17052.574: 552486 ############################## 17052.574 - 37887.762: 68558 #### 37887.762 - 84178.322: 31557 ## 84178.322 - 187024.331: 12102 # 187024.331 - 415522.355: 1364 \| 415522.355 - 923187.626: 256 \| 923187.626 - 2051092.468: 125 \| 2051092.468 - 4557014.000: 21 \| Approximate improvement in the first four buckets is about 20%. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:03 +02:00
Filipe Manana	fdeffeb4f5	btrfs: add and use helper to determine if using bitmaps in free space tree When adding and removing free space to the free space tree, we need to lookup the respective block group's free info item in the free space tree, check its flags for the BTRFS_FREE_SPACE_USING_BITMAPS bit and then release the path. Move these steps into a helper function and use it in both sites. This will also help avoiding duplicate code in a subsequent change. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:03 +02:00
Filipe Manana	d1ac35ae2a	btrfs: use fs_info from local variable in btrfs_convert_free_space_to_extents() There's no need to dereference the block group to extract fs_info as we have already stored fs_info in a local variable. So use the local variable instead. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:03 +02:00
Filipe Manana	497c726ff8	btrfs: avoid double slot decrement at btrfs_convert_free_space_to_extents() There's no need to subtract 1 from path->slots[0] and then decrement the slot, we can reduce the generated assembly code by decrementing the slot and then use it. Module size before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1846220 162998 16136 `2025354` 1ee78a fs/btrfs/btrfs.ko Module size after: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1846204 162998 16136 2025338 1ee77a fs/btrfs/btrfs.ko Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	a8da443c9b	btrfs: turn remove argument of modify_free_space_bitmap() to boolean The argument is used as a boolean, so switch its type from int to bool. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	3887067f55	btrfs: rename free_space_set_bits() and make it less confusing The free_space_set_bits() is used both to set a range of bits or to clear range of bits, depending on the 'bit' argument value. So the name is very misleading since it's not used only to set bits. Furthermore the 'bit' argument is an integer when a boolean is all that is needed plus its name is singular, which gives the idea the function operates on a single bit and not on a range of bits. So rename the function to free_space_modify_bits(), rename the 'bit' argument to 'set_bits' and turn the argument to bool instead of int. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	6fc5ef7829	btrfs: add btrfs prefix to free space tree exported functions A few of the free space tree exported functions have a 'btrfs_' prefix in their name, but most don't. Not only is this inconsistent, the preferred style is to have such a prefix, to avoid potential collisions in the future with other kernel code and offer a somewhat better readibility by making it obvious in calls sites that we are calling btrfs specific code. So add the 'btrfs_' prefix to all free space tree functions that are missing it. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	8bfa3727ea	btrfs: remove pointless out label from load_free_space_extents() All we do under the label is to return, so there's no point in having it, just return directly whenever we get an error. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	b7db594bc2	btrfs: remove pointless out label from load_free_space_bitmaps() All we do under the label is to return, so there's no point in having it, just return directly whenever we get an error. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	5801a749a9	btrfs: remove pointless out label from add_free_space_extent() All we do under the label is to return, so there's no point in having it, just return directly whenever we get an error. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	e3ecf6f164	btrfs: remove pointless out label from remove_free_space_extent() All we do under the label is to return, so there's no point in having it, just return directly whenever we get an error. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	ffb7068f16	btrfs: remove pointless out label from modify_free_space_bitmap() All we do under the label is to return, so there's no point in having it, just return directly whenever we get an error. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	22b609768c	btrfs: make free_space_test_bit() return a boolean instead The function returns the result of another function that returns a boolean (extent_buffer_test_bit()), and all the callers need is a boolean an not an integer. So change its return type from int to bool, and modify the callers to store results in booleans instead of integers, which also makes them simpler. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	790b88c4dd	btrfs: make extent_buffer_test_bit() return a boolean instead All the callers want is to determine if a bit is set and all of them call the function and do a double negation (!!) on its result to get a boolean. So change it to return a boolean and simplify callers. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	e4e5fcbc62	btrfs: remove pointless out label from update_free_space_extent_count() Just return directly, we don't need the label since all we do under it is to return. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:01 +02:00
Filipe Manana	61b43a9374	btrfs: remove pointless out label from add_new_free_space_info() We can just return directly if btrfs_insert_empty_item() fails, there is no need to release the path before returning, as all callers (or upper in the call chain) will free the path if they get an error from the call to add_new_free_space_info(), which is also a common pattern everywhere in btrfs. Finally there's no need to set 'ret' to 0 if the call to btrfs_insert_empty_item() didn't fail, since 'ret' is already 0. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:01 +02:00
David Sterba	44892c5a3e	btrfs: tree-log: add and rename extent bits for dirty_log_pages tree The dirty_log_pages tree is used for tree logging and marks extents based on log_transid. The bits could be renamed to resemble the LOG1/LOG2 naming used for the BTRFS_FS_LOG1_ERR bits. The DIRTY bit is renamed to LOG1 and NEW to LOG2. Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:01 +02:00
David Sterba	55cd57faa5	btrfs: use folio_end() where appropriate Simplify folio_pos() + folio_size() and use the new helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:01 +02:00
David Sterba	89a3cc19e4	btrfs: add helper folio_end() There are several cases of folio_pos + folio_size, add a convenience helper for that. This is a local helper and not proposed as folio API because it does not seem to be heavily used elsewhere: A quick grep (folio_size + folio_end) in fs/ shows 24 btrfs 4 iomap 4 ext4 2 xfs 2 netfs 1 gfs2 1 f2fs 1 bcachefs 1 buffer.c Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:01 +02:00
David Sterba	d549391fc6	btrfs: rename variables for locked range in defrag_prepare_one_folio() In preparation to use a helper for folio_pos + folio_size, rename the variables for the locked range so they don't use the 'folio_' prefix. As the locking ranges take inclusive end of the range (hence the "-1") this would be confusing as the folio helpers typically use exclusive end of the range. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:01 +02:00
David Sterba	e47c8a4767	btrfs: simplify range end calculations in truncate_block_zero_beyond_eof() The way zero_end is calculated and used does a -1 and +1 that effectively cancel out, so this can be simplified. This is also preparatory patch for using a helper for folio_pos + folio_size with the semantics of exclusive end of the range. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:01 +02:00
Filipe Manana	bdd01fb036	btrfs: check BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE at __add_block_group_free_space() Every caller of __add_block_group_free_space() is checking if the flag BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE is set before calling it. This is duplicate code and it's prone to some mistake in case we add more callers in the future. So move the check for that flag into the start of __add_block_group_free_space(), and, as a consequence, the path allocation from add_block_group_free_space() is moved into __add_block_group_free_space(), to preserve the behaviour of allocating a path only if the flag BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE is set. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:01 +02:00
Filipe Manana	1f06c942aa	btrfs: always abort transaction on failure to add block group to free space tree Only one of the callers of __add_block_group_free_space() aborts the transaction if the call fails, while the others don't do it and it's either never done up the call chain or much higher in the call chain. So make sure we abort the transaction at __add_block_group_free_space() if it fails, which brings a couple benefits: 1) If some call chain never aborts the transaction, we avoid having some metadata inconsistency because BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE is cleared when we enter __add_block_group_free_space() and therefore __add_block_group_free_space() is never called again to add the block group items to the free space tree, since the function is only called when that flag is set in a block group; 2) If the call chain already aborts the transaction, then we get a better trace that points to the exact step from __add_block_group_free_space() which failed, which is better for analysis. So abort the transaction at __add_block_group_free_space() if any of its steps fails. CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:57:56 +02:00
Qu Wenruo	936f0b49dc	btrfs: add extra warning when qgroup is marked inconsistent Unlike qgroup rescan, which always shows whether it cleared the inconsistent flag, we do not have a proper way to show if qgroup is marked inconsistent. This was not a big deal before as there aren't that many locations that can mark qgroup inconsistent. But with the introduction of drop_subtree_threshold, qgroup can be marked inconsistent very frequently, especially when dropping subvolumes. Although most user space tools relying on qgroup should do their own checks and queue a rescan if needed, we have no idea when qgroup is marked inconsistent, and this would be much harder to debug. So this patch will add an extra warning (btrfs_warn_rl()) when the qgroup flag is flipped into inconsistent for the first time. And add extra reason why qgroup flips inconsistent. This means we can move the error message immediately before qgroup_inconsistent_warning() into that function. For call sites without an obvious reason, or is a shared error handling, output the function that failed and the error code instead. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:39 +02:00
David Sterba	b37532bffd	btrfs: merge btrfs_printk_ratelimited() and its only caller There's only one caller of btrfs_printk_ratelimited(), merge it there. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:39 +02:00
David Sterba	2f3f1ad7f1	btrfs: simplify debug print helpers without enabled printk The btrfs_debug() helpers depend on various config options. In case of "no_printk" we don't need to define a special helper that in the end does nothing but validates the parameters. As the default build config is to validate the parameters we can simplify it to let the debug helpers expand to nothing and remove btrfs_no_printk_in_rcu(). To avoid warnings use fs_info and inline one variable in extent_from_logical(). Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:39 +02:00
David Sterba	f9095103f2	btrfs: remove remaining unused message helpers Remove the critical level message helpers as they're not used, the RCU protection is provided by the plain helpers. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:39 +02:00
David Sterba	80f4fab544	btrfs: switch RCU helper versions to btrfs_debug() The RCU protection is now done in the plain helpers, we can remove the "_in_rcu" and "_rl_in_rcu". Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:38 +02:00
David Sterba	2eac2ae8b2	btrfs: switch RCU helper versions to btrfs_info() The RCU protection is now done in the plain helpers, we can remove the "_in_rcu" and "_rl_in_rcu". Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:38 +02:00
David Sterba	0fe04bf132	btrfs: switch RCU helper versions to btrfs_warn() The RCU protection is now done in the plain helpers, we can remove the "_in_rcu" and "_rl_in_rcu". Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:38 +02:00
David Sterba	9db18fe3ac	btrfs: switch RCU helper versions to btrfs_err() The RCU protection is now done in the plain helpers, we can remove the "_in_rcu" and "_rl_in_rcu". Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:38 +02:00
David Sterba	0e26727a73	btrfs: switch all message helpers to be RCU safe We have two versions of message helpers, one that provides RCU protection around the call in case we need to dereference device name. As messages are not performance critical we can set up the RCU protection for all of them and drop the distinction for those where device name is needed. This will lead to further simplifications. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:38 +02:00
David Sterba	4d4b489ef1	btrfs: remove unused levels of message helpers We're using the following levels: crit, err, warn, info, debug. This covers our needs and further specializations is not needed, so let's remove emerg, alert and notice. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:38 +02:00
David Sterba	ee3af49a05	btrfs: remove unused rcu-string printk helpers The RCU-string API has never taken off and we don't use the printk helpers provided as we do the protection in our helpers. Remove the "in RCU" wrappers. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:38 +02:00
David Sterba	d1d1c85427	btrfs: open code rcu_string_free() and remove it The helper is trivial and we can simply use kfree_rcu() if needed. In our case it's just one place where we rename a device in device_list_add() and the old name can still be used until the end of the RCU grace period. The other case is freeing a device and there nothing should reach the device, so we can use plain kfree(). Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:38 +02:00
Johannes Thumshirn	694ce5e143	btrfs: zoned: reserve data_reloc block group on mount Create a block group dedicated for data relocation on mount of a zoned filesystem. If there is already more than one empty DATA block group on mount, this one is picked for the data relocation block group, instead of a newly created one. This is done to ensure, there is always space for performing garbage collection and the filesystem is not hitting ENOSPC under heavy overwrite workloads. CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:31 +02:00
David Sterba	f1f22dfbea	btrfs: use btrfs_root_id() where not done yet A few more remaining cases where we can use the helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:31 +02:00
David Sterba	918fb77073	btrfs: use btrfs_is_data_reloc_root() where not done yet Two remaining cases where we can use the helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:31 +02:00
David Sterba	c6aeae86b9	btrfs: use on-stack variable for block reserve in btrfs_replace_file_extents() We can avoid potential memory allocation failure in btrfs_replace_file_extents() as the block reserve lifetime is limited to the scope of the function. This requires +48 bytes on stack. Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:31 +02:00
David Sterba	7ce22f62b2	btrfs: use on-stack variable for block reserve in btrfs_truncate() We can avoid potential memory allocation failure in btrfs_truncate() as the block reserve lifetime is limited to the scope of the function. This requires +48 bytes on stack. Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:31 +02:00
David Sterba	ec41c34547	btrfs: use on-stack variable for block reserve in btrfs_evict_inode() We can avoid potential memory allocation failure in btrfs_evict_inode() as the block reserve lifetime is limited to the scope of the function. This requires +48 bytes on stack. Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:31 +02:00
Sun YangKai	8811ace439	btrfs: update comment for xarray fields in struct btrfs_root The inode_lock field of struct btrfs_root was removed in commit e2844cce75c9e61("btrfs: remove inode_lock from struct btrfs_root and use xarray locks") but the related comment haven't been updated. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:31 +02:00
Qu Wenruo	cc38d178ff	btrfs: enable large data folio support under CONFIG_BTRFS_EXPERIMENTAL With all the preparation patches already merged, it's pretty easy to enable large data folios: - Remove the ASSERT() on folio size in btrfs_end_repair_bio() - Add a helper to properly set the max folio order Currently due to several call sites that are fetching the bitmap content directly into an unsigned long, we can only support BITS_PER_LONG blocks for each bitmap. - Call the helper when reading/creating an inode The support has the following limitations: - No large folios for data reloc inode The relocation code still requires page sized folio. But it's not that hot nor common compared to regular buffered ios. Will be improved in the future. - Requires CONFIG_BTRFS_EXPERIMENTAL - Will require all folio related operations to check if it needs the extra btrfs_subpage structure Now any folio larger than block size will need btrfs_subpage structure handling. Unfortunately I do not have a physical machine for performance test, but if everything goes like XFS/EXT4, it should mostly bring single digits percentage performance improvement in the real world. Although I believe there are still quite some optimizations to be done, let's focus on testing the current large data folio support first. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:30 +02:00
Filipe Manana	b769777d92	btrfs: use refcount_t type for the extent buffer reference counter Instead of using a bare atomic, use the refcount_t type, which despite being a structure that contains only an atomic, has an API that checks for underflows and other hazards. This doesn't change the size of the extent_buffer structure. This removes the need to do things like this: WARN_ON(atomic_read(&eb->refs) == 0); if (atomic_dec_and_test(&eb->refs)) { (...) } And do just: if (refcount_dec_and_test(&eb->refs)) { (...) } Since refcount_dec_and_test() already triggers a warning when we decrement a ref count that has a value of 0 (or below zero). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:30 +02:00
Filipe Manana	2697b61597	btrfs: add comment for optimization in free_extent_buffer() There's this special atomic compare and exchange logic which serves to avoid locking the extent buffers refs_lock spinlock and therefore reduce lock contention, so add a comment to make it more obvious. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:30 +02:00
Filipe Manana	71c086b30d	btrfs: reorganize logic at free_extent_buffer() for better readability It's hard to read the logic to break out of the while loop since it's a very long expression consisting of a logical or of two composite expressions, each one composed by a logical and. Further each one is also testing for the EXTENT_BUFFER_UNMAPPED bit, making it more verbose than necessary. So change from this: if ((!test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags) && refs <= 3) \|\| (test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags) && refs == 1)) break; To this: if (test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags)) { if (refs == 1) break; } else if (refs <= 3) { break; } At least on x86_64 using gcc 9.3.0, this doesn't change the object size. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:30 +02:00
Filipe Manana	41e4ea0bf5	btrfs: make btrfs_readdir_delayed_dir_index() return a bool instead There's no need to return errors, all we do is return 1 or 0 depending on whether we should or should not stop iterating over delayed dir indexes. So change the function to return bool instead of an int. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:30 +02:00
Filipe Manana	4106eb9bda	btrfs: make btrfs_should_delete_dir_index() return a bool instead There's no need to return errors and we currently return 1 in case the index should be deleted and 0 otherwise, so change the return type from int to bool as this is a boolean function and it's more clear this way. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:30 +02:00
Filipe Manana	adc1ef55dc	btrfs: add details to error messages at btrfs_delete_delayed_dir_index() Update the error messages with: 1) Fix typo in the first one, deltiona -> deletion; 2) Remove redundant part of the first message, the part following the comma, and including all the useful information: root, inode, index and error value; 3) Update the second message to use more formal language (example 'error' instead of 'err'), , remove redundant part about "deletion tree of delayed node..." and print the relevant information in the same format and order as the first message, without the ugly opening parenthesis without a space separating from the previous word. This also makes the message similar in format to the one we have at btrfs_insert_delayed_dir_index(). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:30 +02:00
Filipe Manana	0187acef35	btrfs: make btrfs_delete_delayed_insertion_item() return a boolean There's no need to return an integer as all we need to do is return true or false to tell whether we deleted a delayed item or not. Also the logic is inverted since we return 1 (true) if we didn't delete and 0 (false) if we did, which is somewhat counter intuitive. Change the return type to a boolean and make it return true if we deleted and false otherwise. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:30 +02:00
Filipe Manana	7077d7b872	btrfs: switch del_all argument of replay_dir_deletes() from int to bool The argument has boolean semantics, so change its type from int to bool, making it more clear. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:30 +02:00
Filipe Manana	5f8882c854	btrfs: pass NULL index to btrfs_del_inode_ref() where not needed There are two callers of btrfs_del_inode_ref() that declare a local index variable and then pass a pointer for it to btrfs_del_inode_ref(), but then don't use that index at all. Since btrfs_del_inode_ref() accepts a NULL index pointer, pass NULL instead and stop declaring those useless index variables. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:30 +02:00
Filipe Manana	93612a92ba	btrfs: allocate scratch eb earlier at btrfs_log_new_name() Instead of allocating the scratch eb after joining the log transaction, allocate it before so that we're not delaying log commits for longer than necessary, as allocating the scratch eb means allocating an extent_buffer structure, which comes from a dedicated kmem_cache, plus pages/folios to attach to the eb. Both of these allocations may take time when we're under memory pressure. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:30 +02:00
Filipe Manana	841324a8e6	btrfs: allocate path earlier at btrfs_log_new_name() Instead of allocating the path after joining the log transaction, allocate it before so that we're not delaying log commits for the rare cases where the allocation takes a significant time (under memory pressure and all slabs are full, there's the need to allocate a new page, etc). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:29 +02:00
Filipe Manana	b32efae7b8	btrfs: allocate path earlier at btrfs_del_dir_entries_in_log() Instead of allocating the path after joining the log transaction, allocate it before so that we're not delaying log commits for the rare cases where the allocation takes a significant time (under memory pressure and all slabs are full, there's the need to allocate a new page, etc). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:29 +02:00
Filipe Manana	181436a85b	btrfs: assert we join log transaction at btrfs_del_dir_entries_in_log() We are supposed to be able to join a log transaction at that point, since we have determined that the inode was logged in the current transaction with the call to inode_logged(). So ASSERT() we joined a log transaction and also warn if we didn't in case assertions are disabled (the kernel config doesn't have CONFIG_BTRFS_ASSERT=y), so that the issue gets noticed and reported if it ever happens. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:29 +02:00
Filipe Manana	1ed0cfc89e	btrfs: use btrfs_del_item() at del_logged_dentry() There's no need to use btrfs_delete_one_dir_name() at del_logged_dentry() because we are processing a dir index key which can contain only a single name, unlike dir item keys which can encode multiple names in case of name hash collisions. We have explicitly looked up for a dir index key by calling btrfs_lookup_dir_index_item() and we don't log dir item keys anymore (since commit `339d035424` ("btrfs: only copy dir index keys when logging a directory")). So simplify and use btrfs_del_item() directly instead of btrfs_delete_one_dir_name(). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:29 +02:00
Filipe Manana	0ef4c6120e	btrfs: free path sooner at __btrfs_unlink_inode() After calling btrfs_delete_one_dir_name() there's no need for the path anymore so we can free it immediately after. After that point we do some btree operations that take time and in those call chains we end up allocating paths for these operations, so we're unnecessarily holding on to the path we allocated early at __btrfs_unlink_inode(). So free the path as soon as we don't need it and add a comment. This also allows to simplify the error path, removing two exit labels and returning directly when an error happens. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:29 +02:00
Filipe Manana	d94edb0d7e	btrfs: assert we join log transaction at btrfs_del_inode_ref_in_log() We are supposed to be able to join a log transaction at that point, since we have determined that the inode was logged in the current transaction with the call to inode_logged(). So ASSERT() we joined a log transaction and also warn if we didn't in case assertions are disabled (the kernel config doesn't have CONFIG_BTRFS_ASSERT=y), so that the issue gets noticed and reported if it ever happens. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:29 +02:00
Al Viro	75764b41bf	btrfs: open code fc_mount() to avoid releasing s_umount rw_sempahore [CURRENT BEHAVIOR] Currently inside btrfs_get_tree_subvol(), we call fc_mount() to grab a tree, then re-lock s_umount inside btrfs_reconfigure_for_mount() to avoid race with remount. However fc_mount() itself is just doing two things: 1. Call vfs_get_tree() 2. Release s_umount then call vfs_create_mount() [ENHANCEMENT] Instead of calling fc_mount(), we can open-code it with vfs_get_tree() first. This provides a benefit that, since we have the full control of s_umount, we do not need to re-lock that rw_sempahore when calling btrfs_reconfigure_for_mount(), meaning less race between RO/RW remount. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Qu Wenruo <wqu@suse.com> [ Rework the subject and commit message, refactor the error handling ] Signed-off-by: Qu Wenruo <wqu@suse.com> Tested-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:29 +02:00
David Sterba	4013cde56e	btrfs: rename err to ret in scrub_submit_extent_sector_read() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:29 +02:00
David Sterba	56ccdd9af2	btrfs: rename err to ret in btrfs_create_common() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:29 +02:00
David Sterba	7d13ea864e	btrfs: rename err to ret in btrfs_wait_tree_log_extents() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:29 +02:00
David Sterba	0b2cd9e2c7	btrfs: rename err to ret in btrfs_wait_extents() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:28 +02:00
David Sterba	69c5c6130d	btrfs: rename err to ret in quota_override_store() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:28 +02:00
David Sterba	148961dac3	btrfs: rename err to ret in btrfs_fill_super() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:28 +02:00
David Sterba	60a8bab08c	btrfs: rename err to ret in calc_pct_ratio() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:28 +02:00
David Sterba	3b5742f379	btrfs: rename err to ret in btrfs_symlink() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:28 +02:00
David Sterba	af6f6c3af7	btrfs: rename err to ret in btrfs_link() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:28 +02:00
David Sterba	9cf280e2bd	btrfs: rename err to ret in btrfs_setattr() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:28 +02:00
David Sterba	b71a348513	btrfs: rename err to ret in btrfs_init_inode_security() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:28 +02:00
David Sterba	d64ef1d23f	btrfs: rename err to ret in btrfs_alloc_from_bitmap() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:28 +02:00
David Sterba	8d9e877919	btrfs: rename err to ret in btrfs_lock_extent_bits() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:28 +02:00
David Sterba	886240cbcd	btrfs: rename err to ret in btrfs_try_lock_extent_bits() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:27 +02:00
David Sterba	986b6aa185	btrfs: rename err to ret2 in btrfs_truncate_inode_items() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:27 +02:00
David Sterba	a579ddca43	btrfs: rename err to ret2 in btrfs_add_link() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:27 +02:00
David Sterba	8f38507068	btrfs: rename err to ret2 in btrfs_setsize() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:27 +02:00
David Sterba	df20be9f02	btrfs: rename err to ret2 in btrfs_search_old_slot() Unify naming of return value to the preferred way, move the variable to the closest scope. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:27 +02:00
David Sterba	644dcb4316	btrfs: rename err to ret2 in btrfs_search_slot() Unify naming of return value to the preferred way, move the variable to the closest scope. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:27 +02:00
David Sterba	56fc5b18c9	btrfs: rename err to ret2 in search_leaf() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:27 +02:00
David Sterba	58019c1dd4	btrfs: rename err to ret2 in read_block_for_search() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:27 +02:00
David Sterba	66ca7ea650	btrfs: rename err to ret2 in resolve_indirect_refs() Unify naming of return value to the preferred way, move the variable to the closest scope. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:27 +02:00
Qu Wenruo	582cd4bad4	btrfs: rename btrfs_subpage structure With the incoming large data folios support, the structure name btrfs_subpage is no longer correct, as for we can have multiple blocks inside a large folio, and the block size is still page size. So to follow the schema of iomap, rename btrfs_subpage to btrfs_folio_state, along with involved enums. There are still exported functions with "btrfs_subpage_" prefix, and I believe for metadata the name "subpage" will stay forever as we will never allocate a folio larger than nodesize anyway. The full cleanup of the word "subpage" will happen in much smaller steps in the future. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:27 +02:00
Qu Wenruo	1e17738d6b	btrfs: add comments on the extra btrfs specific subpage bitmaps Unlike the iomap_folio_state structure, the btrfs_subpage structure has a lot of extra sub-bitmaps, namely: - writeback sub-bitmap - locked sub-bitmap iomap_folio_state uses an atomic for writeback tracking, while it has no per-block locked tracking. This is because iomap always locks a single folio, and submits dirty blocks with that folio locked. But btrfs has async delalloc ranges (for compression), which are queued with their range locked, until the compression is done, then marks the involved range writeback and unlocked. This means a range can be unlocked and marked writeback at seemingly random timing, thus it needs the extra tracking. This needs a huge rework on the lifespan of async delalloc range before we can remove/simplify these two sub-bitmaps. - ordered sub-bitmap - checked sub-bitmap These are for COW-fixup, but as I mentioned in the past, the COW-fixup is not really needed anymore and these two flags are already marked deprecated, and will be removed in the near future after comprehensive tests. Add related comments to indicate we're actively trying to align the sub-bitmaps to the iomap ones. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:27 +02:00
Daniel Vacek	3f093ccb95	btrfs: harden parsing of compression mount options Btrfs happily but incorrectly accepts the `-o compress=zlib+foo` and similar options with any random suffix. Fix that by explicitly checking the end of the strings. Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:26 +02:00
Daniel Vacek	3f0e865ae6	btrfs: factor out compression mount options parsing There are many options making the parsing a bit lengthy. Factor the compress options out into a helper function. The next patch is going to harden this function. Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:26 +02:00
David Sterba	ccb42a6eed	btrfs: constify more pointer parameters Another batch of pointer parameter constifications. This is for clarity and minor addition to type safety. There are no observable effects in the assembly code and .ko measured on release config. Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:26 +02:00
Boris Burkov	c7f04fbc98	btrfs: sysfs: track current commit duration in commit_stats When debugging/detecting outlier commit stalls, having an indicator that we are currently in a long commit critical section can be very useful. Extend the commit_stats sysfs file to also include the current commit critical section duration. Since this requires storing the last commit start time, use that rather than a separate stack variable for storing the finished commit durations as well. This also requires slightly moving up the timing of the stats updating to inside the critical section to avoid the transaction T+1 setting the critical_section_start_time to 0 before transaction T can update its stats, which would trigger the new ASSERT. This is an improvement in and of itself, as it makes the stats more accurately represent the true critical section time. It may be yet better to pull the stats up to where start_transaction gets unblocked, rather than the next commit, but this seems like a good enough place as well. Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:26 +02:00
Pan Chuang	46d549928c	btrfs: use rb_find_add() in rb_simple_insert() Use the rb-tree helper so we don't open code the search and insert code. Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:26 +02:00
Pan Chuang	c52ea14d05	btrfs: pass struct rb_simple_node pointer directly in rb_simple_insert() Replace struct embedding with union to enable safe type conversion in btrfs_backref_node, tree_block and mapping_node. Adjust function calls to use the new unified API, eliminating redundant parameters. Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:26 +02:00
Yangtao Li	fbec9a5d3e	btrfs: use rb_find_add() in btrfs_qgroup_add_swapped_blocks() Use the rb-tree helper so we don't open code the search and insert code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:26 +02:00
Yangtao Li	844e5f902d	btrfs: use rb_find() in btrfs_qgroup_trace_subtree_after_cow() Use the rb-tree helper so we don't open code the search code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:26 +02:00
Yangtao Li	e3def6ce67	btrfs: use rb_find_add() in add_qgroup_rb() Use the rb-tree helper so we don't open code the search and insert code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:26 +02:00
Yangtao Li	1e0f0239a3	btrfs: use rb_find() in find_qgroup_rb() Use the rb-tree helper so we don't open code the search code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:26 +02:00
Yangtao Li	287480e269	btrfs: use rb_find_add() in insert_ref_entry() Use the rb-tree helper so we don't open code the search and insert code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Yangtao Li	c6e3ae8ac3	btrfs: use rb_find_add() in insert_root_entry() Use the rb-tree helper so we don't open code the search and insert code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Yangtao Li	afaa9f8235	btrfs: use rb_find() in lookup_root_entry() Use the rb-tree helper so we don't open code the search code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Yangtao Li	3f60f4374a	btrfs: use rb_find_add() in insert_block_entry() Use the rb-tree helper so we don't open code the search and insert code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Yangtao Li	4044a7ed3b	btrfs: use rb_find() in lookup_block_entry() Use the rb-tree helper so we don't open code the search code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Yangtao Li	b017a92bd9	btrfs: use rb_find_add() in ulist_rbtree_insert() Use the rb-tree helper so we don't open code the search and insert code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Yangtao Li	c4f38e7ca5	btrfs: use rb_find() in ulist_rbtree_search() Use the rb-tree helper so we don't open code the search code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Yangtao Li	9734685854	btrfs: use rb_find() in __btrfs_lookup_delayed_item() Use the rb-tree helper so we don't open code the search code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Yangtao Li	7a91e01875	btrfs: use rb_find_add() in btrfs_insert_inode_defrag() Use the rb-tree helper so we don't open code the search and insert code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Dan Johnson	06c3437f74	btrfs: fix comment in reserved space warning mkfs.btrfs up to v4.14 actually can leave a chunk inside the reserved space when invoked with `-m single`, fixed by 997f9977c24397eb6980bb9 ("mkfs: Prevent temporary system chunk to use space in reserved 1M range") released with v4.15. Signed-off-by: Dan Johnson <ComputerDruid@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Daniel Vacek	d8f6cb2b28	btrfs: relocation: simplify unused logic related to LINK_LOWER btrfs_backref_link_edge() is always called with the LINK_LOWER argument. We can simplify it and remove the LINK_LOWER and LINK_UPPER macros completely. The last call with LINK_UPPER was removed with commit `0097422c0d` ("btrfs: remove clone_backref_node() from relocation"). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Filipe Manana	593062f67b	btrfs: unfold transaction abort at btrfs_insert_one_raid_extent() We have a common error path where we abort the transaction, but like this in case we get a transaction abort stack trace we don't know exactly which previous function call failed. Instead abort the transaction after any function call that returns an error, so that we can easily identify which function failed. Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:24 +02:00
Filipe Manana	35bb03e57a	btrfs: unfold transaction abort at __btrfs_update_delayed_inode() We have a common error path where we abort the transaction, but like this in case we get a transaction abort stack trace we don't know exactly which previous function call failed. Instead abort the transaction after any function call that returns an error, so that we can easily identify which function failed. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:24 +02:00
Filipe Manana	33e8f24b52	btrfs: abort transaction on unexpected eb generation at btrfs_copy_root() If we find an unexpected generation for the extent buffer we are cloning at btrfs_copy_root(), we just WARN_ON() and don't error out and abort the transaction, meaning we allow to persist metadata with an unexpected generation. Instead of warning only, abort the transaction and return -EUCLEAN. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:16 +02:00
Filipe Manana	273bbb5b48	btrfs: unfold transaction abort at btrfs_copy_root() Instead of having a common btrfs_abort_transaction() call for when any of the two btrfs_inc_ref() calls fail, move the btrfs_abort_transaction() to happen immediately after each one of the calls, so that when analyzing a stack trace with a transaction abort we know which call failed. Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:50:35 +02:00
David Sterba	b63c8c1ede	btrfs: move transaction aborts to the error site in add_block_group_free_space() Transaction aborts should be done next to the place the error happens, which was not done in add_block_group_free_space(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:50:34 +02:00
David Sterba	0b10f3dd13	btrfs: move transaction aborts to the error site in remove_block_group_free_space() Transaction aborts should be done next to the place the error happens, which was not done in remove_block_group_free_space(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:50:34 +02:00
Filipe Manana	81bfd9d547	btrfs: simplify error detection flow during log replay We have this fuzzy logic at btrfs_recover_log_trees() where we don't abort the transaction and exit immediately after each function call that returned an error, and instead have if-then-else logic or check if the previous function call returned success before calling the next function. Make the flow more straightforward by immediately aborting the transaction and exiting after each function call failure. This also allows to avoid two consecutive if statements that test the same conditions: if (!ret && wc.stage == LOG_WALK_REPLAY_ALL) { (...) } Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:50:34 +02:00
Filipe Manana	6466084df6	btrfs: remove redundant path release when replaying a log tree There's no need to call btrfs_release_path() before calling btrfs_init_root_free_objectid() as we have released the path already at the top of the loop and the previous call to fixup_inode_link_counts() also releases the path. So remove it to simplify the code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:50:34 +02:00
Filipe Manana	2a5898c4aa	btrfs: abort transaction during log replay if walk_log_tree() failed If we failed walking a log tree during replay, we have a missing transaction abort to prevent committing a transaction where we didn't fully replay all the changes from a log tree and therefore can leave the respective subvolume tree in some inconsistent state. So add the missing transaction abort. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:50:28 +02:00
Filipe Manana	8f1e1b263d	btrfs: unfold transaction aborts when replaying log trees We have a single line doing a transaction abort in case either we got an error from btrfs_get_fs_root() different from -ENOENT or we got an error from btrfs_pin_extent_for_log_replay(), making it hard to figure out which function call failed when looking at a transaction abort massages and stack trace in dmesg. Change this to have an explicit transaction abort for each one of the two cases. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:44:11 +02:00
Johannes Thumshirn	2a946bf6d6	btrfs: make btrfs_should_periodic_reclaim() static btrfs_should_periodic_reclaim() is not used outside of space-info.c so make it static and remove the prototype from space-info.h. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:44:11 +02:00
Johannes Thumshirn	55f7c65b2f	btrfs: zoned: use filesystem size not disk size for reclaim decision When deciding if a zoned filesystem is reaching the threshold to reclaim data block groups, look at the size of the filesystem not to potentially total available size of all drives in the filesystem. Especially if a filesystem was created with mkfs' -b option, constraining it to only a portion of the block device, the numbers won't match and potentially garbage collection is kicking in too late. Fixes: `3687fcb075` ("btrfs: zoned: make auto-reclaim less aggressive") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:42:24 +02:00
Filipe Manana	f2de2b9ffd	btrfs: unfold transaction abort at clone_copy_inline_extent() We have a common error path where we abort the transaction, but like this in case we get a transaction abort stack trace we don't know exactly which previous function call failed. Instead abort the transaction after any function call that returns an error, so that we can easily identify which function failed. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 15:31:06 +02:00
Filipe Manana	5ff6050fcd	btrfs: remove pointless 'out' label from clone_finish_inode_update() The label is only used once and we can instead return directly where it's used, besides the fact that all we do under the label is to return the value of 'ret'. So get rid of the label and return directly. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 15:31:06 +02:00
Filipe Manana	5cf0e668ea	btrfs: unfold transaction abort at walk_up_proc() Instead of having a common btrfs_abort_transaction() call for when any of the two btrfs_dec_ref() calls fail, move the btrfs_abort_transaction() to happen immediately after each one of the calls, so that when analysing a stack trace with a transaction abort we know which call failed. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 15:31:05 +02:00
Filipe Manana	227aa55fa2	btrfs: unfold transaction abort at __btrfs_inc_extent_ref() Instead of having a common btrfs_abort_transaction() call for when either insert_tree_block_ref() failed or when insert_extent_data_ref() failed, move the btrfs_abort_transaction() to happen immediately after each one of those calls, so that when analysing a stack trace with a transaction abort we know which call failed. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 15:31:05 +02:00
Filipe Manana	3f757b56f1	btrfs: unfold transaction aborts at btrfs_create_new_inode() Instead of having a common btrfs_abort_transaction() call for when either btrfs_orphan_add() failed or when btrfs_add_link() failed, move the btrfs_abort_transaction() to happen immediately after each one of those calls, so that when analysing a stack trace with a transaction abort we know which call failed. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 15:31:05 +02:00
Caleb Sander Mateos	9aad72b4e3	btrfs/ioctl: store btrfs_uring_encoded_data in io_btrfs_cmd btrfs is the only user of struct io_uring_cmd_data and its op_data field. Switch its ->uring_cmd() implementations to store the struct btrfs_uring_encoded_data * in the struct io_btrfs_cmd, overlayed with io_uring_cmd's pdu field. This avoids having to touch another cache line to access the struct btrfs_uring_encoded_data *, and allows op_data and struct io_uring_cmd_data to be removed. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20250708202212.2851548-4-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-18 12:34:56 -06:00
Christian Brauner	ca115d7e75	tree-wide: s/struct fileattr/struct file_kattr/g Now that we expose struct file_attr as our uapi struct rename all the internal struct to struct file_kattr to clearly communicate that it is a kernel internal struct. This is similar to struct mount_{k}attr and others. Link: https://lore.kernel.org/20250703-restlaufzeit-baurecht-9ed44552b481@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-07-04 16:14:39 +02:00
Linus Torvalds	4c06e63b92	for-6.16-rc4-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmhmwFQACgkQxWXV+ddt WDsVMA/+NuSth71V0AfiDnFyqjgDMqIlZL2+dqBiTYHXQQHKbqiUlKvYkWICCT6T 1YgDV+95XJYy4TDBoA49Ndd/l+CiDcMLbOYeneIfbJy13ts84jVANPkl4n03gPkF ktibCw15h0MENVctTCPc71dX2X0cV9WPf4iDmoxUZiukDA376akGTArZKwH4tVVg 4qVpzUtDdNOf848D+8DZKGd+ot/RWgEdLkFCZES27BMg/OFemxBK1MU6K8VjxiKF VoaSVJRDXuug8oVBAGNl86XpiSgd4gHyoNNA5b4mhdSWMSBMxUAaILsONT9pNQZA CFyHA1Jp2gLOIzQIzeXwWgXaAOQDtco8YWYaXhf0v0mySs89tweXjOibfj2mU9pS wPaJyeD+nyRDMwPa4VWEws64D3vXX6aKwiThUENuDmxBvrRXjrkGYH9tf0LNzDDe OKv/vOCfeyutxbjKhP+qElMhdh73BZnJ4UCxxYRRDq2v1Mg+k06swl+6uL6xenme a2KLJlwEoG6LAlkpZzV66ZEaIHDyGBZNdVYtuA/G3dDtmlt0aLXDdp1eq7NivS1j aV7cd0JMX89lAUtqKT932ZOw8RoDrUPPjsnXzCaZJ69mMVyEkxyCV+iYHTTJPDga W5Vg8Tq3d1gwxMebZHvyI6wwUhmGA0wUFG2eohYY/tcSrrUlrHQ= =Ke0p -----END PGP SIGNATURE----- Merge tag 'for-6.16-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - tree-log fixes: - fixes of log tracking of directories and subvolumes - fix iteration and error handling of inode references during log replay - fix free space tree rebuild (reported by syzbot) * tag 'for-6.16-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: use btrfs_record_snapshot_destroy() during rmdir btrfs: propagate last_unlink_trans earlier when doing a rmdir btrfs: record new subvolume in parent dir earlier to avoid dir logging races btrfs: fix inode lookup error handling during log replay btrfs: fix iteration of extrefs during log replay btrfs: fix missing error handling when searching for inode refs during log replay btrfs: fix failure to rebuild free space tree using multiple transactions	2025-07-03 13:29:56 -07:00
Eric Biggers	2c7528d36e	btrfs: stop parsing crc32c driver name To determine whether the crc32c implementation is "fast", use crc32_optimizations() instead of parsing the crypto_shash driver name. This keeps the code working as intended after the driver name is changed by the next commit. Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20250613183753.31864-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>	2025-06-30 09:31:56 -07:00
Filipe Manana	157501b046	btrfs: use btrfs_record_snapshot_destroy() during rmdir We are setting the parent directory's last_unlink_trans directly which may result in a concurrent task starting to log the directory not see the update and therefore can log the directory after we removed a child directory which had a snapshot within instead of falling back to a transaction commit. Replaying such a log tree would result in a mount failure since we can't currently delete snapshots (and subvolumes) during log replay. This is the type of failure described in commit `1ec9a1ae1e` ("Btrfs: fix unreplayable log after snapshot delete + parent dir fsync"). Fix this by using btrfs_record_snapshot_destroy() which updates the last_unlink_trans field while holding the inode's log_mutex lock. Fixes: `44f714dae5` ("Btrfs: improve performance on fsync against new inode after rename/unlink") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-27 19:58:12 +02:00
Filipe Manana	c466e33e72	btrfs: propagate last_unlink_trans earlier when doing a rmdir In case the removed directory had a snapshot that was deleted, we are propagating its inode's last_unlink_trans to the parent directory after we removed the entry from the parent directory. This leaves a small race window where someone can log the parent directory after we removed the entry and before we updated last_unlink_trans, and as a result if we ever try to replay such a log tree, we will fail since we will attempt to remove a snapshot during log replay, which is currently not possible and results in the log replay (and mount) to fail. This is the type of failure described in commit `1ec9a1ae1e` ("Btrfs: fix unreplayable log after snapshot delete + parent dir fsync"). So fix this by propagating the last_unlink_trans to the parent directory before we remove the entry from it. Fixes: `44f714dae5` ("Btrfs: improve performance on fsync against new inode after rename/unlink") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-27 19:57:47 +02:00
Filipe Manana	bf5bcf9a6f	btrfs: record new subvolume in parent dir earlier to avoid dir logging races Instead of recording that a new subvolume was created in a directory after we add the entry do the directory, record it before adding the entry. This is to avoid races where after creating the entry and before recording the new subvolume in the directory (the call to btrfs_record_new_subvolume()), another task logs the directory, so we end up with a log tree where we logged a directory that has an entry pointing to a root that was not yet committed, resulting in an invalid entry if the log is persisted and replayed later due to a power failure or crash. Also state this requirement in the function comment for btrfs_record_new_subvolume(), similar to what we do for the btrfs_record_unlink_dir() and btrfs_record_snapshot_destroy(). Fixes: `45c4102f0d` ("btrfs: avoid transaction commit on any fsync after subvolume creation") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-27 19:57:24 +02:00
Filipe Manana	5f61b96159	btrfs: fix inode lookup error handling during log replay When replaying log trees we use read_one_inode() to get an inode, which is just a wrapper around btrfs_iget_logging(), which in turn is a wrapper for btrfs_iget(). But read_one_inode() always returns NULL for any error that btrfs_iget_logging() / btrfs_iget() may return and this is a problem because: 1) In many callers of read_one_inode() we convert the NULL into -EIO, which is not accurate since btrfs_iget() may return -ENOMEM and -ENOENT for example, besides -EIO and other errors. So during log replay we may end up reporting a false -EIO, which is confusing since we may not have had any IO error at all; 2) When replaying directory deletes, at replay_dir_deletes(), we assume the NULL returned from read_one_inode() means that the inode doesn't exist and then proceed as if no error had happened. This is wrong because unless btrfs_iget() returned ERR_PTR(-ENOENT), we had an actual error and the target inode may exist in the target subvolume root - this may later result in the log replay code failing at a later stage (if we are "lucky") or succeed but leaving some inconsistency in the filesystem. So fix this by not ignoring errors from btrfs_iget_logging() and as a consequence remove the read_one_inode() wrapper and just use btrfs_iget_logging() directly. Also since btrfs_iget_logging() is supposed to be called only against subvolume roots, just like read_one_inode() which had a comment about it, add an assertion to btrfs_iget_logging() to check that the target root corresponds to a subvolume root. Fixes: `5d4f98a28c` ("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-27 19:57:06 +02:00
Filipe Manana	54a7081ed1	btrfs: fix iteration of extrefs during log replay At __inode_add_ref() when processing extrefs, if we jump into the next label we have an undefined value of victim_name.len, since we haven't initialized it before we did the goto. This results in an invalid memory access in the next iteration of the loop since victim_name.len was not initialized to the length of the name of the current extref. Fix this by initializing victim_name.len with the current extref's name length. Fixes: `e43eec81c5` ("btrfs: use struct qstr instead of name and namelen pairs") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-27 19:56:55 +02:00
Filipe Manana	6561a40cec	btrfs: fix missing error handling when searching for inode refs during log replay During log replay, at __add_inode_ref(), when we are searching for inode ref keys we totally ignore if btrfs_search_slot() returns an error. This may make a log replay succeed when there was an actual error and leave some metadata inconsistency in a subvolume tree. Fix this by checking if an error was returned from btrfs_search_slot() and if so, return it to the caller. Fixes: `e02119d5a7` ("Btrfs: Add a write ahead tree log to optimize synchronous operations") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-27 19:56:35 +02:00
Filipe Manana	1e6ed33cab	btrfs: fix failure to rebuild free space tree using multiple transactions If we are rebuilding a free space tree, while modifying the free space tree we may need to allocate a new metadata block group. If we end up using multiple transactions for the rebuild, when we call btrfs_end_transaction() we enter btrfs_create_pending_block_groups() which calls add_block_group_free_space() to add items to the free space tree for the block group. Then later during the free space tree rebuild, at btrfs_rebuild_free_space_tree(), we may find such new block groups and call populate_free_space_tree() for them, which fails with -EEXIST because there are already items in the free space tree. Then we abort the transaction with -EEXIST at btrfs_rebuild_free_space_tree(). Notice that we say "may find" the new block groups because a new block group may be inserted in the block groups rbtree, which is being iterated by the rebuild process, before or after the current node where the rebuild process is currently at. Syzbot recently reported such case which produces a trace like the following: ------------[ cut here ]------------ BTRFS: Transaction aborted (error -17) WARNING: CPU: 1 PID: 7626 at fs/btrfs/free-space-tree.c:1341 btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341 Modules linked in: CPU: 1 UID: 0 PID: 7626 Comm: syz.2.25 Not tainted 6.15.0-rc7-syzkaller-00085-gd7fa1af5b33e-dirty #0 PREEMPT Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341 lr : btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341 sp : ffff80009c4f7740 x29: ffff80009c4f77b0 x28: ffff0000d4c3f400 x27: 0000000000000000 x26: dfff800000000000 x25: ffff70001389eee8 x24: 0000000000000003 x23: 1fffe000182b6e7b x22: 0000000000000000 x21: ffff0000c15b73d8 x20: 00000000ffffffef x19: ffff0000c15b7378 x18: 1fffe0003386f276 x17: ffff80008f31e000 x16: ffff80008adbe98c x15: 0000000000000001 x14: 1fffe0001b281550 x13: 0000000000000000 x12: 0000000000000000 x11: ffff60001b281551 x10: 0000000000000003 x9 : 1c8922000a902c00 x8 : 1c8922000a902c00 x7 : ffff800080485878 x6 : 0000000000000000 x5 : 0000000000000001 x4 : 0000000000000001 x3 : ffff80008047843c x2 : 0000000000000001 x1 : ffff80008b3ebc40 x0 : 0000000000000001 Call trace: btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341 (P) btrfs_start_pre_rw_mount+0xa78/0xe10 fs/btrfs/disk-io.c:3074 btrfs_remount_rw fs/btrfs/super.c:1319 [inline] btrfs_reconfigure+0x828/0x2418 fs/btrfs/super.c:1543 reconfigure_super+0x1d4/0x6f0 fs/super.c:1083 do_remount fs/namespace.c:3365 [inline] path_mount+0xb34/0xde0 fs/namespace.c:4200 do_mount fs/namespace.c:4221 [inline] __do_sys_mount fs/namespace.c:4432 [inline] __se_sys_mount fs/namespace.c:4409 [inline] __arm64_sys_mount+0x3e8/0x468 fs/namespace.c:4409 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151 el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767 el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786 el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600 irq event stamp: 330 hardirqs last enabled at (329): [<ffff80008048590c>] raw_spin_rq_unlock_irq kernel/sched/sched.h:1525 [inline] hardirqs last enabled at (329): [<ffff80008048590c>] finish_lock_switch+0xb0/0x1c0 kernel/sched/core.c:5130 hardirqs last disabled at (330): [<ffff80008adb9e60>] el1_dbg+0x24/0x80 arch/arm64/kernel/entry-common.c:511 softirqs last enabled at (10): [<ffff8000801fbf10>] local_bh_enable+0x10/0x34 include/linux/bottom_half.h:32 softirqs last disabled at (8): [<ffff8000801fbedc>] local_bh_disable+0x10/0x34 include/linux/bottom_half.h:19 ---[ end trace 0000000000000000 ]--- Fix this by flagging new block groups which had their free space tree entries already added and then skip them in the rebuild process. Also, since the rebuild may be triggered when doing a remount, make sure that when we clear an existing free space tree that we clear such flag from every existing block group, otherwise we would skip those block groups during the rebuild. Reported-by: syzbot+d0014fb0fc39c5487ae5@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/68460a54.050a0220.daf97.0af5.GAE@google.com/ Fixes: `882af9f13e` ("btrfs: handle free space tree rebuild in multiple transactions") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-27 19:56:15 +02:00
Linus Torvalds	5ca7fe213b	for-6.16-rc3-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmhZQ9MACgkQxWXV+ddt WDsyvw/+K5N4zbig9D5QL5SdsQwMe/ZUk1KF0LLu6H3hFetdICeM/Z4K46EBh40X c9Sxb13gLnIAm8DR/IFTTlOZVrrbJ3CTazZuJbncCpaZchH863aYb/1KboxjJnpW KqOen20KdUh8HdevrJFhkFc7rOjp7KupfIHsbWqIxaWYPf8ORvUyK55lKxQz0HES E5tFXLNr6z/8Ws5pc2HnRLgnRcCHuRUNJUb1PEaTfPKxoFvTwjda6cDsYnXOJEO9 NOnh6lluurqja+3FUEFig2f292/CbKGtByYUDgfhHO21P//IHSDhlouvwipzI/kh 6WUoH1K+DWCxxNbIVFFbUYLxrDGu7R7/aWFHH2q0dNjqQeiQBbUnbn4WIjAAwDWf k9cmE+WgVqwQI+vpfG3eENUafG5MpcQQo2wKrxG0whWaC2fiA6QtI+3DfKyMj4XJ JI1jUhfCwHrqzoGQ4XBE3UYENqQw9RICNC+Z3UfZx+5sQMWcb+ac5qIGygvCfU8N Gtfx4ladZshpQUSuRneiLozxdxLyXX3LzCt2Ls1s5fPPikZft/+2QRu5rzSbb/Cp 50TDSn/pE1N/TEMVZaP5M2PxquBVDOZ4TFSsSm3IvceqFInm0UerAGaJ7+T2eZhM 3XHhIp6xTecHfwukvGqs+XSxB9PMLfF5M0gc+9PR+3oxzFRpowI= =XLWR -----END PGP SIGNATURE----- Merge tag 'for-6.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Fixes: - fix invalid inode pointer dereferences during log replay - fix a race between renames and directory logging - fix shutting down delayed iput worker - fix device byte accounting when dropping chunk - in zoned mode, fix offset calculations for DUP profile when conventional and sequential zones are used together Regression fixes: - fix possible double unlock of extent buffer tree (xarray conversion) - in zoned mode, fix extent buffer refcount when writing out extents (xarray conversion) Error handling fixes and updates: - handle unexpected extent type when replaying log - check and warn if there are remaining delayed inodes when putting a root - fix assertion when building free space tree - handle csum tree error with mount option 'rescue=ibadroot' Other: - error message updates: add prefix to all scrub related messages, include other information in messages" * tag 'for-6.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: fix alloc_offset calculation for partly conventional block groups btrfs: handle csum tree error with rescue=ibadroots correctly btrfs: fix race between async reclaim worker and close_ctree() btrfs: fix assertion when building free space tree btrfs: don't silently ignore unexpected extent type when replaying log btrfs: fix invalid inode pointer dereferences during log replay btrfs: fix double unlock of buffer_tree xarray when releasing subpage eb btrfs: update superblock's device bytes_used when dropping chunk btrfs: fix a race between renames and directory logging btrfs: scrub: add prefix for the error messages btrfs: warn if leaking delayed_nodes in btrfs_put_root() btrfs: fix delayed ref refcount leak in debug assertion btrfs: include root in error message when unlinking inode btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails	2025-06-23 11:16:38 -07:00
Johannes Thumshirn	c0d90a79e8	btrfs: zoned: fix alloc_offset calculation for partly conventional block groups When one of two zones composing a DUP block group is a conventional zone, we have the zone_info[i]->alloc_offset = WP_CONVENTIONAL. That will, of course, not match the write pointer of the other zone, and fails that block group. This commit solves that issue by properly recovering the emulated write pointer from the last allocated extent. The offset for the SINGLE, DUP, and RAID1 are straight-forward: it is same as the end of last allocated extent. The RAID0 and RAID10 are a bit tricky that we need to do the math of striping. This is the kernel equivalent of Naohiro's user-space commit: "btrfs-progs: zoned: fix alloc_offset calculation for partly conventional block groups". Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:21:15 +02:00
Qu Wenruo	547e836661	btrfs: handle csum tree error with rescue=ibadroots correctly [BUG] There is syzbot based reproducer that can crash the kernel, with the following call trace: (With some debug output added) DEBUG: rescue=ibadroots parsed BTRFS: device fsid 14d642db-7b15-43e4-81e6-4b8fac6a25f8 devid 1 transid 8 /dev/loop0 (7:0) scanned by repro (1010) BTRFS info (device loop0): first mount of filesystem 14d642db-7b15-43e4-81e6-4b8fac6a25f8 BTRFS info (device loop0): using blake2b (blake2b-256-generic) checksum algorithm BTRFS info (device loop0): using free-space-tree BTRFS warning (device loop0): checksum verify failed on logical 5312512 mirror 1 wanted 0xb043382657aede36608fd3386d6b001692ff406164733d94e2d9a180412c6003 found 0x810ceb2bacb7f0f9eb2bf3b2b15c02af867cb35ad450898169f3b1f0bd818651 level 0 DEBUG: read tree root path failed for tree csum, ret=-5 BTRFS warning (device loop0): checksum verify failed on logical 5328896 mirror 1 wanted 0x51be4e8b303da58e6340226815b70e3a93592dac3f30dd510c7517454de8567a found 0x51be4e8b303da58e634022a315b70e3a93592dac3f30dd510c7517454de8567a level 0 BTRFS warning (device loop0): checksum verify failed on logical 5292032 mirror 1 wanted 0x1924ccd683be9efc2fa98582ef58760e3848e9043db8649ee382681e220cdee4 found 0x0cb6184f6e8799d9f8cb335dccd1d1832da1071d12290dab3b85b587ecacca6e level 0 process 'repro' launched './file2' with NULL argv: empty string added DEBUG: no csum root, idatacsums=0 ibadroots=134217728 Oops: general protection fault, probably for non-canonical address 0xdffffc0000000041: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000208-0x000000000000020f] CPU: 5 UID: 0 PID: 1010 Comm: repro Tainted: G OE 6.15.0-custom+ #249 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022 RIP: 0010:btrfs_lookup_csum+0x93/0x3d0 [btrfs] Call Trace: <TASK> btrfs_lookup_bio_sums+0x47a/0xdf0 [btrfs] btrfs_submit_bbio+0x43e/0x1a80 [btrfs] submit_one_bio+0xde/0x160 [btrfs] btrfs_readahead+0x498/0x6a0 [btrfs] read_pages+0x1c3/0xb20 page_cache_ra_order+0x4b5/0xc20 filemap_get_pages+0x2d3/0x19e0 filemap_read+0x314/0xde0 __kernel_read+0x35b/0x900 bprm_execve+0x62e/0x1140 do_execveat_common.isra.0+0x3fc/0x520 __x64_sys_execveat+0xdc/0x130 do_syscall_64+0x54/0x1d0 entry_SYSCALL_64_after_hwframe+0x76/0x7e ---[ end trace 0000000000000000 ]--- [CAUSE] Firstly the fs has a corrupted csum tree root, thus to mount the fs we have to go "ro,rescue=ibadroots" mount option. Normally with that mount option, a bad csum tree root should set BTRFS_FS_STATE_NO_DATA_CSUMS flag, so that any future data read will ignore csum search. But in this particular case, we have the following call trace that caused NULL csum root, but not setting BTRFS_FS_STATE_NO_DATA_CSUMS: load_global_roots_objectid(): ret = btrfs_search_slot(); /* Succeeded / btrfs_item_key_to_cpu() found = true; / We found the root item for csum tree. / root = read_tree_root_path(); if (IS_ERR(root)) { if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) / * Since we have rescue=ibadroots mount option, * @ret is still 0. / break; if (!found \|\| ret) { / @found is true, @ret is 0, error handling for csum * tree is skipped. */ } This means we completely skipped to set BTRFS_FS_STATE_NO_DATA_CSUMS if the csum tree is corrupted, which results unexpected later csum lookup. [FIX] If read_tree_root_path() failed, always populate @ret to the error number. As at the end of the function, we need @ret to determine if we need to do the extra error handling for csum tree. Fixes: `abed4aaae4` ("btrfs: track the csum, extent, and free space trees in a rb tree") Reported-by: Zhiyu Zhang <zhiyuzhang999@gmail.com> Reported-by: Longxing Li <coregee2000@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:21:06 +02:00
Filipe Manana	a26bf338cd	btrfs: fix race between async reclaim worker and close_ctree() Syzbot reported an assertion failure due to an attempt to add a delayed iput after we have set BTRFS_FS_STATE_NO_DELAYED_IPUT in the fs_info state: WARNING: CPU: 0 PID: 65 at fs/btrfs/inode.c:3420 btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420 Modules linked in: CPU: 0 UID: 0 PID: 65 Comm: kworker/u8:4 Not tainted 6.15.0-next-20250530-syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 Workqueue: btrfs-endio-write btrfs_work_helper RIP: 0010:btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420 Code: 4e ad 5d (...) RSP: 0018:ffffc9000213f780 EFLAGS: 00010293 RAX: ffffffff83c635b7 RBX: ffff888058920000 RCX: ffff88801c769e00 RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000000 RBP: 0000000000000001 R08: ffff888058921b67 R09: 1ffff1100b12436c R10: dffffc0000000000 R11: ffffed100b12436d R12: 0000000000000001 R13: dffffc0000000000 R14: ffff88807d748000 R15: 0000000000000100 FS: 0000000000000000(0000) GS:ffff888125c53000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00002000000bd038 CR3: 000000006a142000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> btrfs_put_ordered_extent+0x19f/0x470 fs/btrfs/ordered-data.c:635 btrfs_finish_one_ordered+0x11d8/0x1b10 fs/btrfs/inode.c:3312 btrfs_work_helper+0x399/0xc20 fs/btrfs/async-thread.c:312 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> This can happen due to a race with the async reclaim worker like this: 1) The async metadata reclaim worker enters shrink_delalloc(), which calls btrfs_start_delalloc_roots() with an nr_pages argument that has a value less than LONG_MAX, and that in turn enters start_delalloc_inodes(), which sets the local variable 'full_flush' to false because wbc->nr_to_write is less than LONG_MAX; 2) There it finds inode X in a root's delalloc list, grabs a reference for inode X (with igrab()), and triggers writeback for it with filemap_fdatawrite_wbc(), which creates an ordered extent for inode X; 3) The unmount sequence starts from another task, we enter close_ctree() and we flush the workqueue fs_info->endio_write_workers, which waits for the ordered extent for inode X to complete and when dropping the last reference of the ordered extent, with btrfs_put_ordered_extent(), when we call btrfs_add_delayed_iput() we don't add the inode to the list of delayed iputs because it has a refcount of 2, so we decrement it to 1 and return; 4) Shortly after at close_ctree() we call btrfs_run_delayed_iputs() which runs all delayed iputs, and then we set BTRFS_FS_STATE_NO_DELAYED_IPUT in the fs_info state; 5) The async reclaim worker, after calling filemap_fdatawrite_wbc(), now calls btrfs_add_delayed_iput() for inode X and there we trigger an assertion failure since the fs_info state has the flag BTRFS_FS_STATE_NO_DELAYED_IPUT set. Fix this by setting BTRFS_FS_STATE_NO_DELAYED_IPUT only after we wait for the async reclaim workers to finish, after we call cancel_work_sync() for them at close_ctree(), and by running delayed iputs after wait for the reclaim workers to finish and before setting the bit. This race was recently introduced by commit `19e60b2a95` ("btrfs: add extra warning if delayed iput is added when it's not allowed"). Without the new validation at btrfs_add_delayed_iput(), this described scenario was safe because close_ctree() later calls btrfs_commit_super(). That will run any final delayed iputs added by reclaim workers in the window between the btrfs_run_delayed_iputs() and the the reclaim workers being shut down. Reported-by: syzbot+0ed30ad435bf6f5b7a42@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/6840481c.a00a0220.d4325.000c.GAE@google.com/T/#u Fixes: `19e60b2a95` ("btrfs: add extra warning if delayed iput is added when it's not allowed") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:20:57 +02:00
Filipe Manana	1961d20f6f	btrfs: fix assertion when building free space tree When building the free space tree with the block group tree feature enabled, we can hit an assertion failure like this: BTRFS info (device loop0 state M): rebuilding free space tree assertion failed: ret == 0, in fs/btrfs/free-space-tree.c:1102 ------------[ cut here ]------------ kernel BUG at fs/btrfs/free-space-tree.c:1102! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP Modules linked in: CPU: 1 UID: 0 PID: 6592 Comm: syz-executor322 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 lr : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 sp : ffff8000a4ce7600 x29: ffff8000a4ce76e0 x28: ffff0000c9bc6000 x27: ffff0000ddfff3d8 x26: ffff0000ddfff378 x25: dfff800000000000 x24: 0000000000000001 x23: ffff8000a4ce7660 x22: ffff70001499cecc x21: ffff0000e1d8c160 x20: ffff0000e1cb7800 x19: ffff0000e1d8c0b0 x18: 00000000ffffffff x17: ffff800092f39000 x16: ffff80008ad27e48 x15: ffff700011e740c0 x14: 1ffff00011e740c0 x13: 0000000000000004 x12: ffffffffffffffff x11: ffff700011e740c0 x10: 0000000000ff0100 x9 : 94ef24f55d2dbc00 x8 : 94ef24f55d2dbc00 x7 : 0000000000000001 x6 : 0000000000000001 x5 : ffff8000a4ce6f98 x4 : ffff80008f415ba0 x3 : ffff800080548ef0 x2 : 0000000000000000 x1 : 0000000100000000 x0 : 000000000000003e Call trace: populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 (P) btrfs_rebuild_free_space_tree+0x14c/0x54c fs/btrfs/free-space-tree.c:1337 btrfs_start_pre_rw_mount+0xa78/0xe10 fs/btrfs/disk-io.c:3074 btrfs_remount_rw fs/btrfs/super.c:1319 [inline] btrfs_reconfigure+0x828/0x2418 fs/btrfs/super.c:1543 reconfigure_super+0x1d4/0x6f0 fs/super.c:1083 do_remount fs/namespace.c:3365 [inline] path_mount+0xb34/0xde0 fs/namespace.c:4200 do_mount fs/namespace.c:4221 [inline] __do_sys_mount fs/namespace.c:4432 [inline] __se_sys_mount fs/namespace.c:4409 [inline] __arm64_sys_mount+0x3e8/0x468 fs/namespace.c:4409 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151 el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767 el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786 el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600 Code: f0047182 91178042 528089c3 9771d47b (d4210000) ---[ end trace 0000000000000000 ]--- This happens because we are processing an empty block group, which has no extents allocated from it, there are no items for this block group, including the block group item since block group items are stored in a dedicated tree when using the block group tree feature. It also means this is the block group with the highest start offset, so there are no higher keys in the extent root, hence btrfs_search_slot_for_read() returns 1 (no higher key found). Fix this by asserting 'ret' is 0 only if the block group tree feature is not enabled, in which case we should find a block group item for the block group since it's stored in the extent root and block group item keys are greater than extent item keys (the value for BTRFS_BLOCK_GROUP_ITEM_KEY is 192 and for BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_ITEM_KEY the values are 168 and 169 respectively). In case 'ret' is 1, we just need to add a record to the free space tree which spans the whole block group, and we can achieve this by making 'ret == 0' as the while loop's condition. Reported-by: syzbot+36fae25c35159a763a2a@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/6841dca8.a00a0220.d4325.0020.GAE@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:20:54 +02:00
Filipe Manana	16edae52f6	btrfs: don't silently ignore unexpected extent type when replaying log If there's an unexpected (invalid) extent type, we just silently ignore it. This means a corruption or some bug somewhere, so instead return -EUCLEAN to the caller, making log replay fail, and print an error message with relevant information. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:20:47 +02:00
Filipe Manana	2dcf838cf5	btrfs: fix invalid inode pointer dereferences during log replay In a few places where we call read_one_inode(), if we get a NULL pointer we end up jumping into an error path, or fallthrough in case of __add_inode_ref(), where we then do something like this: iput(&inode->vfs_inode); which results in an invalid inode pointer that triggers an invalid memory access, resulting in a crash. Fix this by making sure we don't do such dereferences. Fixes: `b4c50cbb01` ("btrfs: return a btrfs_inode from read_one_inode()") CC: stable@vger.kernel.org # 6.15+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:20:42 +02:00
Filipe Manana	e5b5596011	btrfs: fix double unlock of buffer_tree xarray when releasing subpage eb If we break out of the loop because an extent buffer doesn't have the bit EXTENT_BUFFER_TREE_REF set, we end up unlocking the xarray twice, once before we tested for the bit and break out of the loop, and once again after the loop. Fix this by testing the bit and exiting before unlocking the xarray. The time spent testing the bit is negligible and it's not worth trying to do that outside the critical section delimited by the xarray lock due to the code complexity required to avoid it (like using a local boolean variable to track whether the xarray is locked or not). The xarray unlock only needs to be done before calling release_extent_buffer(), as that needs to lock the xarray (through xa_cmpxchg_irq()) and does a more significant amount of work. Fixes: `19d7f65f03` ("btrfs: convert the buffer_radix to an xarray") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/linux-btrfs/aDRNDU0GM1_D4Xnw@stanley.mountain/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:20:33 +02:00
Mark Harmstone	ae4477f937	btrfs: update superblock's device bytes_used when dropping chunk Each superblock contains a copy of the device item for that device. In a transaction which drops a chunk but doesn't create any new ones, we were correctly updating the device item in the chunk tree but not copying over the new bytes_used value to the superblock. This can be seen by doing the following: # dd if=/dev/zero of=test bs=4096 count=2621440 # mkfs.btrfs test # mount test /root/temp # cd /root/temp # for i in {00..10}; do dd if=/dev/zero of=$i bs=4096 count=32768; done # sync # rm * # sync # btrfs balance start -dusage=0 . # sync # cd # umount /root/temp # btrfs check test For btrfs-check to detect this, you will also need my patch at https://github.com/kdave/btrfs-progs/pull/991. Change btrfs_remove_dev_extents() so that it adds the devices to the fs_info->post_commit_list if they're not there already. This causes btrfs_commit_device_sizes() to be called, which updates the bytes_used value in the superblock. Fixes: `bbbf7243d6` ("btrfs: combine device update operations during transaction commit") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <maharmstone@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:20:22 +02:00
Filipe Manana	3ca864de85	btrfs: fix a race between renames and directory logging We have a race between a rename and directory inode logging that if it happens and we crash/power fail before the rename completes, the next time the filesystem is mounted, the log replay code will end up deleting the file that was being renamed. This is best explained following a step by step analysis of an interleaving of steps that lead into this situation. Consider the initial conditions: 1) We are at transaction N; 2) We have directories A and B created in a past transaction (< N); 3) We have inode X corresponding to a file that has 2 hardlinks, one in directory A and the other in directory B, so we'll name them as "A/foo_link1" and "B/foo_link2". Both hard links were persisted in a past transaction (< N); 4) We have inode Y corresponding to a file that as a single hard link and is located in directory A, we'll name it as "A/bar". This file was also persisted in a past transaction (< N). The steps leading to a file loss are the following and for all of them we are under transaction N: 1) Link "A/foo_link1" is removed, so inode's X last_unlink_trans field is updated to N, through btrfs_unlink() -> btrfs_record_unlink_dir(); 2) Task A starts a rename for inode Y, with the goal of renaming from "A/bar" to "A/baz", so we enter btrfs_rename(); 3) Task A inserts the new BTRFS_INODE_REF_KEY for inode Y by calling btrfs_insert_inode_ref(); 4) Because the rename happens in the same directory, we don't set the last_unlink_trans field of directoty A's inode to the current transaction id, that is, we don't cal btrfs_record_unlink_dir(); 5) Task A then removes the entries from directory A (BTRFS_DIR_ITEM_KEY and BTRFS_DIR_INDEX_KEY items) when calling __btrfs_unlink_inode() (actually the dir index item is added as a delayed item, but the effect is the same); 6) Now before task A adds the new entry "A/baz" to directory A by calling btrfs_add_link(), another task, task B is logging inode X; 7) Task B starts a fsync of inode X and after logging inode X, at btrfs_log_inode_parent() it calls btrfs_log_all_parents(), since inode X has a last_unlink_trans value of N, set at in step 1; 8) At btrfs_log_all_parents() we search for all parent directories of inode X using the commit root, so we find directories A and B and log them. Bu when logging direct A, we don't have a dir index item for inode Y anymore, neither the old name "A/bar" nor for the new name "A/baz" since the rename has deleted the old name but has not yet inserted the new name - task A hasn't called yet btrfs_add_link() to do that. Note that logging directory A doesn't fallback to a transaction commit because its last_unlink_trans has a lower value than the current transaction's id (see step 4); 9) Task B finishes logging directories A and B and gets back to btrfs_sync_file() where it calls btrfs_sync_log() to persist the log tree; 10) Task B successfully persisted the log tree, btrfs_sync_log() completed with success, and a power failure happened. We have a log tree without any directory entry for inode Y, so the log replay code deletes the entry for inode Y, name "A/bar", from the subvolume tree since it doesn't exist in the log tree and the log tree is authorative for its index (we logged a BTRFS_DIR_LOG_INDEX_KEY item that covers the index range for the dentry that corresponds to "A/bar"). Since there's no other hard link for inode Y and the log replay code deletes the name "A/bar", the file is lost. The issue wouldn't happen if task B synced the log only after task A called btrfs_log_new_name(), which would update the log with the new name for inode Y ("A/bar"). Fix this by pinning the log root during renames before removing the old directory entry, and unpinning after btrfs_log_new_name() is called. Fixes: `259c4b96d7` ("btrfs: stop doing unnecessary log updates during a rename") CC: stable@vger.kernel.org # 5.18+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:19:58 +02:00
Anand Jain	65d5112b4d	btrfs: scrub: add prefix for the error messages Add a "scrub: " prefix to all messages logged by scrub so that it's easy to filter them from dmesg for analysis. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:19:06 +02:00
Leo Martins	186b9dc3c3	btrfs: warn if leaking delayed_nodes in btrfs_put_root() Add a warning for leaked delayed_nodes when putting a root. We currently do this for inodes, but not delayed_nodes. Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ Remove the changelog from the commit message. ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:18:39 +02:00
Leo Martins	dd276214e4	btrfs: fix delayed ref refcount leak in debug assertion If the delayed_root is not empty we are increasing the number of references to a delayed_node without decreasing it, causing a leak. Fix by decrementing the delayed_node reference count. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ Remove the changelog from the commit message. ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:18:35 +02:00
Filipe Manana	c769be2d3d	btrfs: include root in error message when unlinking inode To help debugging include the root number in the error message, and since this is a critical error that implies a metadata inconsistency and results in a transaction abort change the log message level from "info" to "critical", which is a much better fit. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:18:30 +02:00
Lorenzo Stoakes	2e3b37a7e4	fs: replace mmap hook with .mmap_prepare for simple mappings Since commit `c84bf6dd2b` ("mm: introduce new .mmap_prepare() file callback"), the f_op->mmap() hook has been deprecated in favour of f_op->mmap_prepare(). This callback is invoked in the mmap() logic far earlier, so error handling can be performed more safely without complicated and bug-prone state unwinding required should an error arise. This hook also avoids passing a pointer to a not-yet-correctly-established VMA avoiding any issues with referencing this data structure. It rather provides a pointer to the new struct vm_area_desc descriptor type which contains all required state and allows easy setting of required parameters without any consideration needing to be paid to locking or reference counts. Note that nested filesystems like overlayfs are compatible with an .mmap_prepare() callback since commit `bb666b7c27` ("mm: add mmap_prepare() compatibility layer for nested file systems"). In this patch we apply this change to file systems with relatively simple mmap() hook logic - exfat, ceph, f2fs, bcachefs, zonefs, btrfs, ocfs2, orangefs, nilfs2, romfs, ramfs and aio. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/f528ac4f35b9378931bd800920fee53fc0c5c74d.1750099179.git.lorenzo.stoakes@oracle.com Acked-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-06-19 13:56:59 +02:00
Al Viro	05fb0e6664	new helper: set_default_d_op() ... to be used instead of manually assigning to ->s_d_op. All in-tree filesystem converted (and field itself is renamed, so any out-of-tree ones in need of conversion will be caught by compiler). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-10 22:21:16 -04:00
Linus Torvalds	a56baa2253	for-6.16-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmg1pKUACgkQxWXV+ddt WDt5JA/9ExCLAICZX9qXoq0GN9O921BpH86XtV7tBfmniFVFqII2xzk0K5KxMFGC E/CQVE5rh8n3i6abvm/nuSKcDl9532Lnqb98WPDpLx5o3ClaSArx7smgJKAGcmQ2 bE88UfK1TQFMUxyTxCKFrk/Q6iVWD+GgWatIyBLuiLn3E2aPkewwXMU64BdFUXyq 9ZLkyy75Lw41aQhAuGSVZGJTL81hg3mM5zO945+0vYFTS1o5ST7+mqlqPzOnWaxO 5v1ecsJy27uVqdIfGZoOt1kzJ1nWaWT+4frfHCVotusGtFEezc++tCghdEjqokCO ThjRxiwNJmuwN3yrSKadjbFLSiHWDhHCIv9rExJfqSQZqUV7URxKFTFDBBC/yCZW Y70TOzqdji1BYsNt7VYhGeuNsKuZHC8+zPiOLMxI6XprRoSvjGuknuMPTVcmy567 hC8Mbyk8RlQmTsk5rziuXiesGNkqxpPoNIV8smj1Qur+GpvWzU0kmWa20AFM8iOQ e58amI90LmM/C2Bl8I+I3Hh7pRgYFEhsbvGB6HwdhaEXEn+bEdDN0G4DGsOUtOtZ 4Se0ckmho5esOVlmw/TtDa81TTg+jMzrzKCs9/vl3iY7QuzK8VPc438wSy95D4z0 gkHuRO5tni9E5bl2rg/rE7JeNgBef1sgzjjv+hfc2ZxsO3d1IcY= =zeWb -----END PGP SIGNATURE----- Merge tag 'for-6.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "A fixup to the xarray conversion sent in the main 6.16 batch. It was not included because it would cause rebase/refresh of like 80 patches, right before sending the early pull request last week. It's fixing a bug when zoned mode is enabled on btrfs so it's not affecting most people" * tag 'for-6.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails	2025-05-28 11:59:25 -07:00
Josef Bacik	b83825a8f5	btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails In the zoned mode there's a bug in the extent buffer tree conversion to xarray. The reference for eb is dropped and code continues but the references get dropped by releasing the batch. Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/linux-btrfs/202505191521.435b97ac-lkp@intel.com/ Fixes: `19d7f65f03` ("btrfs: convert the buffer_radix to an xarray") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-27 13:39:15 +02:00
Josef Bacik	4db7384ce5	btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails In the zoned mode there's a bug in the extent buffer tree conversion to xarray. The reference for eb is dropped and code continues but the references get dropped by releasing the batch. Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Fixes: `19d7f65f03` ("btrfs: convert the buffer_radix to an xarray") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-27 13:26:28 +02:00
Linus Torvalds	5e82ed5ca4	for-6.16-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmgtuJgACgkQxWXV+ddt WDt79g//YndozUasOP0raqNVvod4wYvmG/CX1yHOkFQpfRQSVG4av0KlTWnupXKG oEQvFbZ639tmXbBYlKlK8Ts8fy1dpj+2iG4ValukA4L7xkY8ML5DrGQfKYbPEm2i Ab9lp4qnZZutYVH2/5UGQqkEUA3/YIiOZ0hsZWir//zbkTCL9cuHwl2FUYbmFlHi Hxkd30QC0kZuxINdMxXGauF4JkFJFyiNnmI5dMjj07xMMWk1cv8vunoZ3LVjAlbW gX16+4rUmtJl33HbYqofee4Dcovvcuvt/fEM1LX0rGbKXOnKA2dQPoMQsjMAV82B mjhma5T709MgVHQiDdJduh86seaul4Cuv/E/OqoDj7Kfkoew/YquHEfU4TB4bvCX KmONEyJFd9QDq5CUyvfow7HENja6QbU31Fw6akrbfpsVcla0MKAUWPi+Vqpqf+pe qIWNcovorD2g/EVJV6y+w0K+kXTarPtXXmVnJnJPYtOkBWpARI3Y8wVxDCKX8Nfo 7Kpi/h9K87+d9opjjEajydNONDL9GQa4AY4u/oeiwcSuJHvCt/rsKKwHZRyycRiI q+nGwsNcmY/ih/EVUzLgYomGG08H9nOcKvZOQkfHpOTI1EgvILAeV9SpGMex7du1 PiPqVtv9Z60dKy6OValh7ttMpt7LszAK4Dk7XiyHrN1Q3sYDyrs= =bDOD -----END PGP SIGNATURE----- Merge tag 'for-6.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Apart from numerous cleanups, there are some performance improvements and one minor mount option update. There's one more radix-tree conversion (one remaining), and continued work towards enabling large folios (almost finished). Performance: - extent buffer conversion to xarray gains throughput and runtime improvements on metadata heavy operations doing writeback (sample test shows +50% throughput, -33% runtime) - extent io tree cleanups lead to performance improvements by avoiding unnecessary searches or repeated searches - more efficient extent unpinning when committing transaction (estimated run time improvement 3-5%) User visible changes: - remove standalone mount option 'nologreplay', deprecated in 5.9, replacement is 'rescue=nologreplay' - in scrub, update reporting, add back device stats message after detected errors (accidentally removed during recent refactoring) Core: - convert extent buffer radix tree to xarray - in subpage mode, move block perfect compression out of experimental build - in zoned mode, introduce sub block groups to allow managing special block groups, like the one for relocation or tree-log, to handle some corner cases of ENOSPC - in scrub, simplify bitmaps for block tracking status - continued preparations for large folios: - remove assertions for folio order 0 - add support where missing: compression, buffered write, defrag, hole punching, subpage, send - fix fsync of files with no hard links not persisting deletion - reject tree blocks which are not nodesize aligned, a precaution from 4.9 times - move transaction abort calls closer to the error sites - remove usage of some struct bio_vec internals - simplifications in extent map - extent IO cleanups and optimizations - error handling improvements - enhanced ASSERT() macro with optional format strings - cleanups: - remove unused code - naming unifications, dropped __, added prefix - merge similar functions - use common helpers for various data structures" * tag 'for-6.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (198 commits) btrfs: move misplaced comment of btrfs_path::keep_locks btrfs: remove standalone "nologreplay" mount option btrfs: use a single variable to track return value at btrfs_page_mkwrite() btrfs: don't return VM_FAULT_SIGBUS on failure to set delalloc for mmap write btrfs: simplify early error checking in btrfs_page_mkwrite() btrfs: pass true to btrfs_delalloc_release_space() at btrfs_page_mkwrite() btrfs: fix wrong start offset for delalloc space release during mmap write btrfs: fix harmless race getting delayed ref head count when running delayed refs btrfs: log error codes during failures when writing super blocks btrfs: simplify error return logic when getting folio at prepare_one_folio() btrfs: return real error from __filemap_get_folio() calls btrfs: remove superfluous return value check at btrfs_dio_iomap_begin() btrfs: fix invalid data space release when truncating block in NOCOW mode btrfs: update Kconfig option descriptions btrfs: update list of features built under experimental config btrfs: send: remove btrfs_debug() calls btrfs: use boolean for delalloc argument to btrfs_free_reserved_extent() btrfs: use boolean for delalloc argument to btrfs_free_reserved_bytes() btrfs: fold error checks when allocating ordered extent and update comments btrfs: check we grabbed inode reference when allocating an ordered extent ...	2025-05-26 12:24:43 -07:00
Linus Torvalds	6f59de9bc0	for-6.16/block-20250523 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgwnGYQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpq9aD/4iqOts77xhWWLrOJWkkhOcV5rREeyppq8X MKYul9S4cc4Uin9Xou9a+nab31QBQEk3nsN3kX9o3yAXvkh6yUm36HD8qYNW/46q IUkwRQQJ0COyTnexMZQNTbZPQDIYcenXmQxOcrEJ5jC1Jcz0sOKHsgekL+ab3kCy fLnuz2ozvjGDMala/NmE8fN5qSlj4qQABHgbamwlwfo4aWu07cwfqn5G/FCYJgDO xUvsnTVclom2g4G+7eSSvGQI1QyAxl5QpviPnj/TEgfFBFnhbCSoBTEY6ecqhlfW 6u59MF/Uw8E+weiuGY4L87kDtBhjQs3UMSLxCuwH7MxXb25ff7qB4AIkcFD0kKFH 3V5NtwqlU7aQT0xOjGxaHhfPwjLD+FVss4ARmuHS09/Kn8egOW9yROPyetnuH84R Oz0Ctnt1IPLFjvGeg3+rt9fjjS9jWOXLITb9Q6nX9gnCt7orCwIYke8YCpmnJyhn i+fV4CWYIQBBRKxIT0E/GhJxZOmL0JKpomnbpP2dH8npemnsTCuvtfdrK9gfhH2X chBVqCPY8MNU5zKfzdEiavPqcm9392lMzOoOXW2pSC1eAKqnAQ86ZT3r7rLntqE8 75LxHcvaQIsnpyG+YuJVHvoiJ83TbqZNpyHwNaQTYhDmdYpp2d/wTtTQywX4DuXb Y6NDJw5+kQ== =1PNK -----END PGP SIGNATURE----- Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - ublk updates: - Add support for updating the size of a ublk instance - Zero-copy improvements - Auto-registering of buffers for zero-copy - Series simplifying and improving GET_DATA and request lookup - Series adding quiesce support - Lots of selftests additions - Various cleanups - NVMe updates via Christoph: - add per-node DMA pools and use them for PRP/SGL allocations (Caleb Sander Mateos, Keith Busch) - nvme-fcloop refcounting fixes (Daniel Wagner) - support delayed removal of the multipath node and optionally support the multipath node for private namespaces (Nilay Shroff) - support shared CQs in the PCI endpoint target code (Wilfred Mallawa) - support admin-queue only authentication (Hannes Reinecke) - use the crc32c library instead of the crypto API (Eric Biggers) - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes Reinecke, Leon Romanovsky, Gustavo A. R. Silva) - MD updates via Yu: - Fix that normal IO can be starved by sync IO, found by mkfs on newly created large raid5, with some clean up patches for bdev inflight counters - Clean up brd, getting rid of atomic kmaps and bvec poking - Add loop driver specifically for zoned IO testing - Eliminate blk-rq-qos calls with a static key, if not enabled - Improve hctx locking for when a plug has IO for multiple queues pending - Remove block layer bouncing support, which in turn means we can remove the per-node bounce stat as well - Improve blk-throttle support - Improve delay support for blk-throttle - Improve brd discard support - Unify IO scheduler switching. This should also fix a bunch of lockdep warnings we've been seeing, after enabling lockdep support for queue freezing/unfreezeing - Add support for block write streams via FDP (flexible data placement) on NVMe - Add a bunch of block helpers, facilitating the removal of a bunch of duplicated boilerplate code - Remove obsolete BLK_MQ pci and virtio Kconfig options - Add atomic/untorn write support to blktrace - Various little cleanups and fixes * tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits) selftests: ublk: add test for UBLK_F_QUIESCE ublk: add feature UBLK_F_QUIESCE selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE traceevent/block: Add REQ_ATOMIC flag to block trace events ublk: run auto buf unregisgering in same io_ring_ctx with registering io_uring: add helper io_uring_cmd_ctx_handle() ublk: remove io argument from ublk_auto_buf_reg_fallback() ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch() selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK selftests: ublk: support UBLK_F_AUTO_BUF_REG ublk: support UBLK_AUTO_BUF_REG_FALLBACK ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG ublk: prepare for supporting to register request buffer automatically ublk: convert to refcount_t selftests: ublk: make IO & device removal test more stressful nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk nvme: introduce multipath_always_on module param nvme-multipath: introduce delayed removal of the multipath head node nvme-pci: derive and better document max segments limits nvme-pci: use struct_size for allocation struct nvme_dev ...	2025-05-26 11:39:36 -07:00
Linus Torvalds	6d5b940e1e	vfs-6.16-rc1.async.dir -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBN6wAKCRCRxhvAZXjc ok32AQD9DTiSCAoVg+7s+gSBuLTi8drPTN++mCaxdTqRh5WpRAD9GVyrGQT0s6LH eo9bm8d1TAYjilEWM0c0K0TxyQ7KcAA= =IW7H -----END PGP SIGNATURE----- Merge tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs directory lookup updates from Christian Brauner: "This contains cleanups for the lookup_one() family of helpers. We expose a set of functions with names containing "lookup_one_len" and others without the "_len". This difference has nothing to do with "len". It's rater a historical accident that can be confusing. The functions without "_len" take a "mnt_idmap" pointer. This is found in the "vfsmount" and that is an important question when choosing which to use: do you have a vfsmount, or are you "inside" the filesystem. A related question is "is permission checking relevant here?". nfsd and cachefiles do* have a vfsmount but don't use the non-_len functions. They pass nop_mnt_idmap and refuse to work on filesystems which have any other idmap. This work changes nfsd and cachefile to use the lookup_one family of functions and to explictily pass &nop_mnt_idmap which is consistent with all other vfs interfaces used where &nop_mnt_idmap is explicitly passed. The remaining uses of the "_one" functions do not require permission checks so these are renamed to be "_noperm" and the permission checking is removed. This series also changes these lookup function to take a qstr instead of separate name and len. In many cases this simplifies the call" * tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: VFS: change lookup_one_common and lookup_noperm_common to take a qstr Use try_lookup_noperm() instead of d_hash_and_lookup() outside of VFS VFS: rename lookup_one_len family to lookup_noperm and remove permission check cachefiles: Use lookup_one() rather than lookup_one_len() nfsd: Use lookup_one() rather than lookup_one_len() VFS: improve interface for lookup_one functions	2025-05-26 08:02:43 -07:00
Sun YangKai	eeb133a634	btrfs: move misplaced comment of btrfs_path::keep_locks Commit `925baeddc5` ("Btrfs: Start btree concurrency work.") added the comment for the field keep_locks. This got moved later but without the comment, so move it to the right place and fix the comment style. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-17 21:15:08 +02:00
Qu Wenruo	8af94e772e	btrfs: remove standalone "nologreplay" mount option Standalone "nologreplay" mount option has been marked deprecated since commit `74ef00185e` ("btrfs: introduce "rescue=" mount option"), which dates back to v5.9 (2020). Furthermore there is no other filesystem with the same named mount option, so this one is btrfs specific and we will not hit the same problem when removing "norecovery" mount option. So let's remove the standalone "nologreplay" mount option. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-16 19:16:22 +02:00
Filipe Manana	1ce06d45d9	btrfs: use a single variable to track return value at btrfs_page_mkwrite() We have two variables to track return values, ret and ret2, with types vm_fault_t (an unsigned int type) and int, which makes it a bit confusing and harder to keep track. So use a single variable, of type int, and under the 'out' label return vmf_error(ret) in case ret contains an error, otherwise return VM_FAULT_NOPAGE. This is equivalent to what we had before and it's simpler. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 18:24:44 +02:00
Filipe Manana	d8cddf2a1d	btrfs: don't return VM_FAULT_SIGBUS on failure to set delalloc for mmap write If the call to btrfs_set_extent_delalloc() fails we are always returning VM_FAULT_SIGBUS, which is odd since the error means "bad access" and the most likely cause for btrfs_set_extent_delalloc() is -ENOMEM, which should be translated to VM_FAULT_OOM. Instead of returning VM_FAULT_SIGBUS return vmf_error(ret2), which gives us a more appropriate return value, and we use that everywhere else too. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 18:24:37 +02:00
Filipe Manana	a08625f825	btrfs: simplify early error checking in btrfs_page_mkwrite() We have this entangled error checks early at btrfs_page_mkwrite(): 1) Try to reserve delalloc space by calling btrfs_delalloc_reserve_space() and storing the return value in the ret2 variable; 2) If the reservation succeed, call file_update_time() and store the return value in ret2 and also set the local variable 'reserved' to true (1); 3) Then do an error check on ret2 to see if any of the previous calls failed and if so, jump either to the 'out' label or to the 'out_noreserve' label, depending on whether 'reserved' is true or not. This is unnecessarily complex. Instead change this to a simpler and more straightforward approach: 1) Call btrfs_delalloc_reserve_space(), if that returns an error jump to the 'out_noreserve' label; 2) The call file_update_time() and if that returns an error jump to the 'out' label. Like this there's less nested if statements, no need to use a local variable to track if space was reserved and if statements are used only to check errors. Also move the call to extent_changeset_free() out of the 'out_noreserve' label and under the 'out' label since the changeset is allocated only if the call to reserve delalloc space succeeded. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 18:24:23 +02:00
Filipe Manana	bf1c74ccba	btrfs: pass true to btrfs_delalloc_release_space() at btrfs_page_mkwrite() In the last call to btrfs_delalloc_release_space() where the value of the variable 'ret' is never zero, we pass the expression 'ret != 0' as the value for the argument 'qgroup_free', which always evaluates to true. Make this less confusing and more clear by explicitly passing true instead. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 18:24:15 +02:00
Filipe Manana	17a85f5204	btrfs: fix wrong start offset for delalloc space release during mmap write If we're doing a mmap write against a folio that has i_size somewhere in the middle and we have multiple sectors in the folio, we may have to release excess space previously reserved, for the range going from the rounded up (to sector size) i_size to the folio's end offset. We are calculating the right amount to release and passing it to btrfs_delalloc_release_space(), but we are passing the wrong start offset of that range - we're passing the folio's start offset instead of the end offset, plus 1, of the range for which we keep the reservation. This may result in releasing more space then we should and eventually trigger an underflow of the data space_info's bytes_may_use counter. So fix this by passing the start offset as 'end + 1' instead of 'page_start' to btrfs_delalloc_release_space(). Fixes: `d0b7da88f6` ("Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 18:24:07 +02:00
Filipe Manana	7dbfa4266c	btrfs: fix harmless race getting delayed ref head count when running delayed refs When running delayed references we are reading the number of ready delayed ref heads without taking any lock which can make KCSAN report a race since we can have concurrent tasks updating that number, such as for example when freeing a tree block which will end up decrementing that counter or when adding a new delayed ref while COWing a tree block which will increment that counter. This is a harmless race since running one more or one less delayed ref head doesn't result in any problem, in the critical section of a transaction commit we always run any remaining delayed refs and at that point no one can create more. So fix this harmless race by annotating the read with data_race(). Reported-by: cen zhang <zzzccc427@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAFRLqsUCLMz0hY-GaPj1Z=fhkgRHjxVXHZ8kz0PvkFN0b=8L2Q@mail.gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:57 +02:00
Filipe Manana	4469e95fe5	btrfs: log error codes during failures when writing super blocks When writing super blocks, at write_dev_supers(), we log an error message when we get some error but we don't show which error we got and we have that information. So enhance the error messages with the error codes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:57 +02:00
Filipe Manana	0f2bc22150	btrfs: simplify error return logic when getting folio at prepare_one_folio() There's no need to have special logic to return -EAGAIN in case the call to __filemap_get_folio() fails, because when FGP_NOWAIT is passed to __filemap_get_folio() it returns ERR_PTR(-EAGAIN) if it needs to do something that would imply blocking. The reason we have this logic is from the days before we migrated to the folio interface, when we called pagecache_get_page() which would return NULL instead of an error pointer. So remove this special casing and always return the error that the call to __filemap_get_folio() returned. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:57 +02:00
Filipe Manana	443e4d0e1c	btrfs: return real error from __filemap_get_folio() calls We have a few places that always assume a -ENOMEM error happened in case a call to __filemap_get_folio() returns an error, which is just too much of an assumption and even if it would be the case at some point in time, it's not future proof and there's nothing in the documentation that guarantees that only ERR_PTR(-ENOMEM) can be returned with the flags we are passing to it. So use the exact error returned by __filemap_get_folio() instead. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:57 +02:00
Filipe Manana	ca84913d49	btrfs: remove superfluous return value check at btrfs_dio_iomap_begin() In the if statement that checks the return value from btrfs_check_data_free_space(), there's no point to check if 'ret' is not zero in the else branch, since the main if branch checked that it's zero, so in the else branch it necessarily has a non-zero value. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:57 +02:00
Filipe Manana	d3914d6030	btrfs: fix invalid data space release when truncating block in NOCOW mode If when truncating a block we fail to reserve data space and then we proceed anyway because we can do a NOCOW write, if we later get an error when trying to get the folio from the inode's mapping, we end up releasing data space that we haven't reserved, screwing up the bytes_may_use counter from the data space_info, eventually resulting in an underflow when all other reservations done by other tasks are released, if any, or right away if there are no other reservations at the moment. This is because when we get an error when trying to grab the block's folio we call btrfs_delalloc_release_space(), which releases metadata (which we have reserved) and data (which we haven't reserved). Fix this by calling btrfs_delalloc_release_space() only if we did reserve data space, that is, if we aren't falling back to NOCOW, meaning the local variable @only_release_metadata has a false value, otherwise release only metadata by calling btrfs_delalloc_release_metadata(). Fixes: `6d4572a9d7` ("btrfs: allow btrfs_truncate_block() to fallback to nocow for data space reservation") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:56 +02:00
David Sterba	c16b984cdb	btrfs: update Kconfig option descriptions Expand what the options do and if they are OK to be enabled. Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:56 +02:00
David Sterba	5f9b394e32	btrfs: update list of features built under experimental config The list is out of date, the extent shrinker got fixed in 6.13. Add new entries: the COW fixup warning in 6.15, rund robin policies in 6.14. Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:56 +02:00
David Sterba	585e944a31	btrfs: send: remove btrfs_debug() calls There are debugging prints for each emitted send command and other related actions. This does not seem right as the number of commands can be high and dumping that to the system log will likely hit some rate limiting. This should be done by trace points that are more lightweight and can keep up with high frequency. Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:56 +02:00
Filipe Manana	9f6fa5b344	btrfs: use boolean for delalloc argument to btrfs_free_reserved_extent() We are using an integer for the 'delalloc' argument but all we need is a boolean, so switch the type to 'bool' and rename the parameter to 'is_delalloc' to better match the fact that it's a boolean. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:56 +02:00
Filipe Manana	ba4ec9a5a0	btrfs: use boolean for delalloc argument to btrfs_free_reserved_bytes() We are using an integer for the 'delalloc' argument but all we need is a boolean, so switch the type to 'bool' and rename the parameter to 'is_delalloc' to better match the fact that it's a boolean. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:56 +02:00
Filipe Manana	87417e0cbb	btrfs: fold error checks when allocating ordered extent and update comments Instead of having an error check and return on each branch of the if statement, move the error check to happen after that if branch, reducing source code and object code sizes. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1840174 163742 16136 2020052 1ed2d4 fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1840138 163742 16136 2020016 1ed2b0 fs/btrfs/btrfs.ko While at it and moving the comments, update the comments to be more clear about how qgroup reserved space is released and the intricacies of how it's managed for COW writes. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:56 +02:00
Filipe Manana	08c649a563	btrfs: check we grabbed inode reference when allocating an ordered extent When allocating an ordered extent we call igrab() to get a reference on the inode and attach it to the ordered extent. For an ordered extent we always must have an inode reference since we during its life cycle we need to access the inode for several things like for example: * Inserting the ordered extent right after allocating it, when calling insert_ordered_extent() - we need to lock the inode's ordered_tree_lock; * In the bio submission path we need to add checksums to the ordered extent and we end up at btrfs_add_ordered_sum(), where again we need to grab the inode from the ordered extent to lock the inode's ordered_tree_lock; * When finishing an ordered extent, at btrfs_finish_ordered_extent(), we need again to access its inode in order to lock the inode's ordered_tree_lock; * Etc etc etc. Everywhere we deal with an ordered extent we always expect its inode to be not NULL, the only exception being btrfs_put_ordered_extent() where we check if it's NULL before calling btrfs_add_delayed_iput(), even though we have already assumed it's not NULL when calling the tracepoint trace_btrfs_ordered_extent_put() since the tracepoint dereferences the inode to extract its number and root without ever checking it's NULL. The igrab() call can return NULL if the inode is about to be freed or is being freed (its state has I_FREEING or I_WILL_FREE set), and that's why there's such check at btrfs_put_ordered_extent(). The igrab() and NULL check were introduced in commit `5fd0204355` ("Btrfs: finish ordered extents in their own thread") but even back then we always needed and assumed igrab() returned a non-NULL pointer, since for example when removing an ordered extent, at btrfs_remove_ordered_extent(), we assumed the inode pointer was not NULL in order to access the inode's ordered extent tree. In fact whenever we allocate an ordered extent we are holding an inode reference and the inode is not being freed or going to be freed (which happens in the final iput), and since we depend on the inode for the life cycle of the ordered extent, just make ordered extent allocation to fail in case igrab() returns NULL and trigger a warning, to make it clear it's not expected. This allows to remove the confusing NULL inode check at btrfs_put_ordered_extent(). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:56 +02:00
Filipe Manana	1f2889f559	btrfs: fix qgroup reservation leak on failure to allocate ordered extent If we fail to allocate an ordered extent for a COW write we end up leaking a qgroup data reservation since we called btrfs_qgroup_release_data() but we didn't call btrfs_qgroup_free_refroot() (which would happen when running the respective data delayed ref created by ordered extent completion or when finishing the ordered extent in case an error happened). So make sure we call btrfs_qgroup_free_refroot() if we fail to allocate an ordered extent for a COW write. Fixes: `7dbeaad0af` ("btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:56 +02:00
Qu Wenruo	4ad57e1e22	btrfs: scrub: reduce memory usage of struct scrub_sector_verification That structure records needed info for block verification (either data checksum pointer, or expected tree block generation). But there is also a boolean to tell if this block belongs to a metadata or not, as the data checksum pointer and expected tree block generation is already a union, we need a dedicated bit to tell if this block is a metadata or not. However such layout means we're wasting 63 bits for x86_64, which is a huge memory waste. Thanks to the recent bitmap aggregation, we can easily move this single-bit-per-block member to a new sub-bitmap. And since we already have six 16 bits long bitmaps, adding another bitmap won't even increase any memory usage for x86_64, as we need two 64 bits long anyway. This will reduce the following memory usages: - sizeof(struct scrub_sector_verification) From 16 bytes to 8 bytes on x86_64. - scrub_stripe::sectors From 16 * 16 to 16 * 8 bytes. - Per-device scrub_ctx memory usage From 128 * (16 * 16) to 128 * (16 * 8), which saves 16KiB memory. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:56 +02:00
Qu Wenruo	4e2945f73b	btrfs: handle aligned EOF truncation correctly for subpage cases [BUG] For the following fsx -e 1 run, the btrfs still fails the run on 64K page size with 4K fs block size: READ BAD DATA: offset = 0x26b3a, size = 0xfafa, fname = /mnt/btrfs/junk OFFSET GOOD BAD RANGE 0x26b3a 0x0000 0x15b4 0x0 operation# (mod 256) for the bad data may be 21 [...] LOG DUMP (28 total operations): 1( 1 mod 256): SKIPPED (no operation) 2( 2 mod 256): SKIPPED (no operation) 3( 3 mod 256): SKIPPED (no operation) 4( 4 mod 256): SKIPPED (no operation) 5( 5 mod 256): WRITE 0x1ea90 thru 0x285e0 (0x9b51 bytes) HOLE 6( 6 mod 256): ZERO 0x1b1a8 thru 0x20bd4 (0x5a2d bytes) 7( 7 mod 256): FALLOC 0x22b1a thru 0x272fa (0x47e0 bytes) INTERIOR 8( 8 mod 256): WRITE 0x741d thru 0x13522 (0xc106 bytes) 9( 9 mod 256): MAPWRITE 0x73ee thru 0xdeeb (0x6afe bytes) 10( 10 mod 256): FALLOC 0xb719 thru 0xb994 (0x27b bytes) INTERIOR 11( 11 mod 256): COPY 0x15ed8 thru 0x18be1 (0x2d0a bytes) to 0x25f6e thru 0x28c77 12( 12 mod 256): ZERO 0x1615e thru 0x1770e (0x15b1 bytes) 13( 13 mod 256): SKIPPED (no operation) 14( 14 mod 256): DEDUPE 0x20000 thru 0x27fff (0x8000 bytes) to 0x1000 thru 0x8fff 15( 15 mod 256): SKIPPED (no operation) 16( 16 mod 256): CLONE 0xa000 thru 0xffff (0x6000 bytes) to 0x36000 thru 0x3bfff 17( 17 mod 256): ZERO 0x14adc thru 0x1b78a (0x6caf bytes) 18( 18 mod 256): TRUNCATE DOWN from 0x3c000 to 0x1e2e3 ****WWWW 19( 19 mod 256): CLONE 0x4000 thru 0x11fff (0xe000 bytes) to 0x16000 thru 0x23fff 20( 20 mod 256): FALLOC 0x311e1 thru 0x3681b (0x563a bytes) PAST_EOF 21( 21 mod 256): FALLOC 0x351c5 thru 0x40000 (0xae3b bytes) EXTENDING 22( 22 mod 256): WRITE 0x920 thru 0x7e51 (0x7532 bytes) 23( 23 mod 256): COPY 0x2b58 thru 0xc508 (0x99b1 bytes) to 0x117b1 thru 0x1b161 24( 24 mod 256): TRUNCATE DOWN from 0x40000 to 0x3c9a5 25( 25 mod 256): SKIPPED (no operation) 26( 26 mod 256): MAPWRITE 0x25020 thru 0x26b06 (0x1ae7 bytes) 27( 27 mod 256): SKIPPED (no operation) 28( 28 mod 256): READ 0x26b3a thru 0x36633 (0xfafa bytes) RRRR** [CAUSE] The involved operations are: fallocating to largest ever: 0x40000 21 pollute_eof 0x24000 thru 0x2ffff (0xc000 bytes) 21 falloc from 0x351c5 to 0x40000 (0xae3b bytes) 28 read 0x26b3a thru 0x36633 (0xfafa bytes) At operation #21 a pollute_eof is done, by memory mapped write into range [0x24000, 0x2ffff). At this stage, the inode size is 0x24000, which is block aligned. Then fallocate happens, and since it's expanding the inode, it will call btrfs_truncate_block() to truncate any unaligned range. But since the inode size is already block aligned, btrfs_truncate_block() does nothing and exits. However remember the folio at 0x20000 has some range polluted already, although it will not be written back to disk, it still affects the page cache, resulting the later operation #28 to read out the polluted value. [FIX] Instead of early exit from btrfs_truncate_block() if the range is already block aligned, do extra filio zeroing if the fs block size is smaller than the page size and we're truncating beyond EOF. This is to address exactly the above case where memory mapped write can still leave some garbage beyond EOF. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Qu Wenruo	8e4f21f2b1	btrfs: handle unaligned EOF truncation correctly for subpage cases [BUG] The following fsx sequence will fail on btrfs with 64K page size and 4K fs block size: #fsx -d -e 1 -N 4 $mnt/junk -S 36386 READ BAD DATA: offset = 0xe9ba, size = 0x6dd5, fname = /mnt/btrfs/junk OFFSET GOOD BAD RANGE 0xe9ba 0x0000 0x03ac 0x0 operation# (mod 256) for the bad data may be 3 ... LOG DUMP (4 total operations): 1( 1 mod 256): WRITE 0x6c62 thru 0x1147d (0xa81c bytes) HOLE *WWWW 2( 2 mod 256): TRUNCATE DOWN from 0x1147e to 0x5448 **WWWW 3( 3 mod 256): ZERO 0x1c7aa thru 0x28fe2 (0xc839 bytes) 4( 4 mod 256): MAPREAD 0xe9ba thru 0x1578e (0x6dd5 bytes) RRRR** [CAUSE] Only 2 operations are really involved in this case: 3 pollute_eof 0x5448 thru 0xffff (0xabb8 bytes) 3 zero from 0x1c7aa to 0x28fe3, (0xc839 bytes) 4 mapread 0xe9ba thru 0x1578e (0x6dd5 bytes) At operation 3, fsx pollutes beyond EOF, that is done by mmap() and write into that mmap() range beyond EOF. Such write will fill the range beyond EOF, but it will never reach disk as ranges beyond EOF will not be marked dirty nor uptodate. Then we zero_range for [0x1c7aa, 0x28fe3], and since the range is beyond our isize (which was 0x5448), we should zero out any range beyond EOF (0x5448). During btrfs_zero_range(), we call btrfs_truncate_block() to dirty the unaligned head block. But that function only really zeroes out the block at [0x5000, 0x5fff], it doesn't bother any range other that that block, since those ranges will not be marked dirty nor written back. So the range [0x6000, 0xffff] is still polluted, and later mapread() will return the poisoned value. [FIX] Enhance btrfs_truncate_block() by: - Pass a @start/@end pair to indicate the full truncation range This is to handle the following truncation case: Page size is 64K, fs block size is 4K, truncate range is [6K, 60K] 0 32K 64K \| \|///////////////////////////////////\| \| 6K 60K The range is not aligned for its head block, so we need to call btrfs_truncate_block() with @from = 6K, @front = 0, @len = 0. But with that information we only know to zero the range [6K, 8K), if we zero out the range [6K, 64K), the last block will also be zeroed, causing data loss. So here we need the full range we're truncating, so that we can avoid over-truncation. - Rename @from to @offset As now the parameter is only utilized to locate a block, it's not really carrying the old @from meaning well. - Remove @front parameter With the full truncate range passed in, we can determine if the @offset is at the head or tail block. - Skip truncation if @offset is not in the head nor tail blocks The call site in hole punch unconditionally call btrfs_truncate_block() without even checking the range is aligned or not. If the @offset is neither in the head nor in tail block, it means we can safely ignore it. - Skip truncate if the range inside the target block is already aligned - Make btrfs_truncate_block() zero all blocks beyond EOF Since we have the original range, we know exactly if we're doing truncation beyond EOF (the @end will be (u64)-1). If we're doing truncation beyond EOF, then enlarge the truncation range to the folio end, to address the possibly polluted ranges. Otherwise still keep the zero range inside the block, as we can have large data folios soon, always truncating every blocks inside the same folio can be costly for large folios. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Boris Burkov	3649833a58	btrfs: fix broken drop_caches on extent buffer folios The (correct) commit `e41c81d0d3` ("mm/truncate: Replace page_mapped() call in invalidate_inode_page()") replaced the page_mapped(page) check with a refcount check. However, this refcount check does not work as expected with drop_caches for btrfs's metadata pages. Btrfs has a per-sb metadata inode with cached pages, and when not in active use by btrfs, they have a refcount of 3. One from the initial call to alloc_pages(), one (nr_pages == 1) from filemap_add_folio(), and one from folio_attach_private(). We would expect such pages to get dropped by drop_caches. However, drop_caches calls into mapping_evict_folio() via mapping_try_invalidate() which gets a reference on the folio with find_lock_entries(). As a result, these pages have a refcount of 4, and fail this check. For what it's worth, such pages do get reclaimed under memory pressure, so I would say that while this behavior is surprising, it is not really dangerously broken. When I asked the mm folks about the expected refcount in this case, I was told that the correct thing to do is to donate the refcount from the original allocation to the page cache after inserting it. Therefore, attempt to fix this by adding a put_folio() to the critical spot in alloc_extent_buffer() where we are sure that we have really allocated and attached new pages. We must also adjust folio_detach_private() to properly handle being the last reference to the folio and not do a use-after-free after folio_detach_private(). extent_buffers allocated by clone_extent_buffer() and alloc_dummy_extent_buffer() are unmapped, so this transfer of ownership from allocation to insertion in the mapping does not apply to them. However, we can still folio_put() them safely once they are finished being allocated and have called folio_attach_private(). Finally, removing the generic put_folio() for the allocation from btrfs_detach_extent_buffer_folios() means we need to be careful to do the appropriate put_folio() in allocation failure paths in alloc_extent_buffer(), clone_extent_buffer() and alloc_dummy_extent_buffer(). Link: https://lore.kernel.org/linux-mm/ZrwhTXKzgDnCK76Z@casper.infradead.org/ Tested-by: Klara Modin <klarasmodin@gmail.com> Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Filipe Manana	1886b77f5b	btrfs: use verbose assert at peek_discard_list() We now have a verbose variant of ASSERT() so that we can print the value of the block group's discard_index. So use it for better problem analysis in case the assertion is triggered. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Qu Wenruo	1b660424a6	btrfs: scrub: aggregate small bitmaps into a larger one Currently we have several small bitmaps inside scrub_stripe: - extent_sector_bitmap - error_bitmap - io_error_bitmap - csum_error_bitmap - meta_error_bitmap - meta_gen_error_bitmap All those bitmaps are at most 16 bits long, but unsigned long is either 32 or 64 (more common) bits. This means we're wasting 1/2 or 3/4 space for each bitmap. And we can have 128 scrub_stripe for each device, such wasted space adds up quickly. Instead of using a single unsigned long for each bitmap, aggregate them into a larger bitmap, just like what we're doing for subpage support. This reduces 24 bytes from each scrub_stripe structure on x86_64 systems. This will need a lot of macros converting direct bitmap/bit operations into our scrub_stripe specific helpers, but all those helpers are very small and can be inlined. So overall the overhead shouldn't be that huge, and we save quite some memory space. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Qu Wenruo	f2c19541e4	btrfs: scrub: fix a wrong error type when metadata bytenr mismatches When the bytenr doesn't match for a metadata tree block, we will report it as an csum error, which is incorrect and should be reported as a metadata error instead. Fixes: `a3ddbaebc7` ("btrfs: scrub: introduce a helper to verify one metadata block") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00

... 3 4 5 6 7 ...

14605 Commits (09cfd3c52ea76f43b3cb15e570aeddf633d65e80)