mirror-linux

Commit Graph

Author	SHA1	Message	Date
Filipe Manana	e41c5e611a	btrfs: remove pointless inode lookup when processing extrefs during log replay At unlink_extrefs_not_in_log() we do an inode lookup of the directory but we already have the directory inode accessible as a function argument, so the lookup is redudant. Remove it and use the directory inode passed in as an argument. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:21 +02:00
Filipe Manana	bd9c063e6f	btrfs: stop passing inode object IDs to __add_inode_ref() in log replay There's no point in passing the inode and parent inode object IDs to __add_inode_ref() and its helpers because we can get them by calling btrfs_ino() against the inode and the directory inode, and we pass both inodes to __add_inode_ref() and its helpers. So remove the object IDs parameters to reduce arguments passed and to make things less confusing. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:20 +02:00
Filipe Manana	1ebeee283a	btrfs: add path for subvolume tree changes to struct walk_control While replaying log trees we need to do searches and updates to subvolume trees and for that we use a path that we allocate in replay_one_buffer() and then pass it as a parameter to other functions deeper in the log replay call chain. Instead of passing it as parameter, add it to struct walk_control since we pass a pointer to that structure for every log replay function. This reduces the number of arguments passed to the functions and it will be needed and important for an upcoming changes that improves error reporting for log replay. Also name the new filed in struct walk_control to 'subvol_path' - while that is longer to type, the naming makes it clear it's used for subvolume tree operations since many of the log replay functions operate both on subvolume and log trees, and for the log tree searches we have struct walk_control::log_leaf to also make it obvious it's an extent buffer for a log tree extent buffer. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:20 +02:00
Filipe Manana	f9c02e4b52	btrfs: remove redundant path release when overwriting item during log replay At overwrite_item() we have a redundant btrfs_release_path() just before failing with -ENOMEM, as the caller who passed in the path will free it and therefore also release any refcounts and locks on the extent buffers of the path. So remove it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:20 +02:00
Filipe Manana	9bdfa3eddb	btrfs: remove redundant path release when processing dentry during log replay At replay_one_one() we have a redundant btrfs_release_path() just before calling insert_one_name(), as some lines above we have already released the path with another btrfs_release_path() call. So remove it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:20 +02:00
Filipe Manana	f366722f33	btrfs: avoid unnecessary path allocation when replaying a dir item There's no need to allocate 'fixup_path' at replay_one_dir_item(), as the path passed as an argument is unused by the time link_to_fixup_dir() is called (replay_one_name() releases the path before it returns). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:20 +02:00
Filipe Manana	b343047c1a	btrfs: avoid path allocations when dropping extents during log replay We can avoid a path allocation in the btrfs_drop_extents() calls we have at replay_one_extent() and replay_one_buffer() by passing the path we already have in those contextes as it's unused by the time they call btrfs_drop_extents(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:20 +02:00
Filipe Manana	29d9c5e037	btrfs: avoid unnecessary path allocation at fixup_inode_link_count() There's no need to allocate a path as our single caller already has a path that we can use. So pass the caller's path and use it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:20 +02:00
Filipe Manana	2ac7094662	btrfs: add current log leaf, key and slot to struct walk_control A lot of the log replay functions get passed the current log leaf being processed as well as the current slot and the key at that slot. Instead of passing them as parameters, add them to struct walk_control so that we reduce the numbers of parameters. This is also going to be needed to further changes that improve error reporting during log replay. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:20 +02:00
Filipe Manana	2a13cfc949	btrfs: use the inode item boolean everywhere in overwrite_item() We have this boolean 'inode_item' to tell if we are processing an inode item key and we use it in a couple of places while in another two places we open code by checking if the key type matches the inode item type. Make this consistent and use the boolean everywhere. Also rename it from 'inode_item' to 'is_inode_item', which makes it more clear that it's a boolean and not an instance of struct btrfs_inode_item, and make it const too. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:19 +02:00
Filipe Manana	6cb7f0b8c9	btrfs: use level argument in log tree walk callback replay_one_buffer() We already have the extent buffer's level in an argument, there's no need to first ensure the extent buffer's data is loaded (by calling btrfs_read_extent_buffer()) and then call btrfs_header_level() to check the level. So use the level argument and do the check before calling btrfs_read_extent_buffer(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:19 +02:00
Filipe Manana	c2ef817b28	btrfs: use level argument in log tree walk callback process_one_buffer() We already have the extent buffer's level in an argument, there's no need to call btrfs_header_level(). So use the level argument and make the code shorter. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:19 +02:00
Filipe Manana	266967c0e2	btrfs: pass walk_control structure to overwrite_item() Instead of passing the transaction and subvolume root as arguments to overwrite_item(), pass the walk_control structure as we can grab them from the structure. This reduces the number of arguments passed and it's going to be needed by an incoming change that improves error reporting for log replay. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:19 +02:00
Filipe Manana	aa5b6635b0	btrfs: pass walk_control structure to drop_one_dir_item() and helpers Instead of passing the transaction as an argument to drop_one_dir_item() and its helpers (link_to_fixup_dir() and unlink_inode_for_log_replay()), pass the walk_control structure as we can access the transaction from it and the subvolume root. This is going to be needed by an incoming change that improves error reporting for log replay and also reduces the number of arguments passed to link_to_fixup_dir(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:19 +02:00
Filipe Manana	744e0cebb4	btrfs: pass walk_control structure to replay_one_dir_item() and replay_one_name() Instead of passing the transaction and subvolume root and log tree as arguments, pass the walk_control structure as we can grab all of those from the structure. This reduces the number of arguments passed and it's going to be needed by an incoming change that improves error reporting for log replay. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:19 +02:00
Filipe Manana	44463eb079	btrfs: pass walk_control structure to add_inode_ref() and helpers Instead of passing the transaction, subvolume root and log tree as arguments to add_inode_ref() and its helpers (__add_inode_ref(), unlink_refs_not_in_log(), unlink_extrefs_not_in_log() and unlink_old_inode_refs()), pass the walk_control structure as we can access all of those from the structure. This reduces the number of arguments passed and it's going to be needed by an incoming change that improves error reporting for log replay. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:19 +02:00
Filipe Manana	c7da72022b	btrfs: pass walk_control structure to replay_one_extent() Instead of passing the transaction and subvolume root as arguments to replay_one_extent(), pass the walk_control structure as we can grab all of those from the structure. This reduces the number of arguments passed and it's going to be needed by an incoming change that improves error reporting for log replay. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:19 +02:00
Filipe Manana	b150f1c321	btrfs: pass walk_control structure to check_item_in_log() Instead of passing the transaction and log tree as arguments to check_item_in_log(), pass the walk_control structure as we can grab those from the structure. This reduces the number of arguments passed and it's going to be needed by an incoming change that improves error reporting for log replay. Notice that a NULL log root argument to check_item_in_log() makes it unconditionally delete a directory entry, so since the walk_control always has a non-NULL log root, we add an extra boolean to check_item_in_log() to tell it if it should unconditionally delete a directory entry, preserving the behaviour and also making it a bit more clear. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:18 +02:00
Filipe Manana	82d1db6f46	btrfs: pass walk_control structure to replay_dir_deletes() Instead of passing the transaction, subvolume root and log tree as arguments to replay_dir_deletes(), pass the walk_control structure as we can grab all of those from the structure. This reduces the number of arguments passed and it's going to be needed by an incoming change that improves error reporting for log replay. This also requires changing fixup_inode_link_counts() and fixup_inode_link_count() to take that structure as an argument since fixup_inode_link_count() makes a call to replay_dir_deletes(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:18 +02:00
Filipe Manana	94a5ac668a	btrfs: move up the definition of struct walk_control In upcoming changes we need to pass struct walk_control as an argument to replay_dir_deletes() and link_to_fixup_dir() so we need to move its definition above the prototypes of those functions. So move it up right below the enum that defines log replay stages and before any functions and function prototypes are declared. Also fixup the comments while moving it so that they comply with the preferred code style (capitalize the first word in a sentence, end sentences with punctuation, makes lines wider and closer to the 80 characters limit). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:18 +02:00
Filipe Manana	7790a882ca	btrfs: pass walk_control structure to replay_xattr_deletes() Instead of passing the transaction, subvolume root and log tree as arguments to replay_xattr_deletes(), pass the walk_control structure as we can grab all of those from the structure. This reduces the number of arguments passed and it's going to be needed by an incoming change that improves error reporting for log replay. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:18 +02:00
Filipe Manana	2f5b8095ea	btrfs: always drop log root tree reference in btrfs_replay_log() Currently we have this odd behaviour: 1) At btrfs_replay_log() we drop the reference of the log root tree if the call to btrfs_recover_log_trees() failed; 2) But if the call to btrfs_recover_log_trees() did not fail, we don't drop the reference in btrfs_replay_log() - we expect that btrfs_recover_log_trees() does it in case it returns success. Let's simplify this and make btrfs_replay_log() always drop the reference on the log root tree, not only this simplifies code as it's what makes sense since it's btrfs_replay_log() who grabbed the reference in the first place. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:18 +02:00
Filipe Manana	4b7699f406	btrfs: stop setting log_root_tree->log_root to NULL in btrfs_recover_log_trees() There's no point in setting log_root_tree->log_root to NULL as this is already NULL, we never assigned anything to it before and it's meaningless as a log root never has a value other than NULL for the ->log_root field, that can be not NULL only for non log roots. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:18 +02:00
Filipe Manana	d73896a55c	btrfs: stop passing transaction parameter to log tree walk functions It's unncessary to pass a transaction parameter since struct walk_control already has a member that points to the transaction, so we can make the functions access the structure. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:18 +02:00
Filipe Manana	7f09699e5e	btrfs: deduplicate log root free in error paths from btrfs_recover_log_trees() Instead of duplicating the dropping of a log tree in case we jump to the 'error' label, move the dropping under the 'error' label and get rid of the the unnecessary setting of the log root to NULL since we return immediately after. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:18 +02:00
Filipe Manana	efa44fc4fd	btrfs: add and use a log root field to struct walk_control Instead of passing an extra log root parameter for the log tree walk functions and callbacks, add the log tree to struct walk_control and make those functions and callbacks extract the log root from that structure, reducing the number of parameters. This also simplifies further upcoming changes to report log tree replay failures. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:17 +02:00
Filipe Manana	60ac80242b	btrfs: rename root to log in walk_down_log_tree() and walk_up_log_tree() Everywhere we have a log root we name it as 'log' or 'log_root' except in walk_down_log_tree() and walk_up_log_tree() where we name it as 'root', which not only it's inconsistent, it's also confusing since we typically use 'root' when naming variables that refer to a subvolume tree. So for clairty and consistency rename the 'root' argument to 'log'. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:17 +02:00
Filipe Manana	2c123db1f0	btrfs: rename replay_dest member of struct walk_control to root Everywhere else we refer to a subvolume root we are replaying to simply as 'root', so rename from 'replay_dest' to 'root' for consistency and having a more meaningful and shorter name. While at it also update the comment to be more detailed and comply to preferred style (first word in a sentence is capitalized and sentence ends with punctuation). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:17 +02:00
Filipe Manana	6803bff896	btrfs: use booleans in walk control structure for log replay The 'free' and 'pin' member of struct walk_control, used during log replay and when freeing a log tree, are defined as integers but in practice are used as booleans. Change their type to bool and while at it update their comments to be more detailed and comply with the preferred comment style (first word in a sentence is capitalized, sentences end with punctuation and the comment opening (/*) is on a line of its own). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:17 +02:00
Qu Wenruo	ea77a1c1c7	btrfs: cache max and min order inside btrfs_fs_info Inside btrfs_fs_info we cache several bits shift like sectorsize_bits. Apply this to max and min folio orders so that every time mapping order needs to be applied we can skip the calculation. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:17 +02:00
Qu Wenruo	7425a28940	btrfs: introduce btrfs_bio_for_each_block_all() helper Currently if we want to iterate all blocks inside a bio, we do something like this: bio_for_each_segment_all(bvec, bio, iter_all) { for (off = 0; off < bvec->bv_len; off += sectorsize) { /* Iterate blocks using bv + off / } } That's fine for now, but it will not handle future bs > ps, as bio_for_each_segment_all() is a single-page iterator, it will always return a bvec that's no larger than a page. But for bs > ps cases, we need a full folio (which covers at least one block) so that we can work on the block. To address this problem and handle future bs > ps cases better: - Introduce a helper btrfs_bio_for_each_block_all() This helper will create a local bvec_iter, which has the size of the target bio. Then grab the current physical address of the current location, then advance the iterator by block size. - Use btrfs_bio_for_each_block_all() to replace existing call sites Including: set_bio_pages_uptodate() in raid56 * verify_bio_data_sectors() in raid56 Both will result much easier to read code. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:17 +02:00
Qu Wenruo	9afc617265	btrfs: introduce btrfs_bio_for_each_block() helper Currently if we want to iterate a bio in block unit, we do something like this: while (iter->bi_size) { struct bio_vec bv = bio_iter_iovec(); /* Do something with using the bv / bio_advance_iter_single(&bbio->bio, iter, sectorsize); } That's fine for now, but it will not handle future bs > ps, as bio_iter_iovec() returns a single-page bvec, meaning the bv_len will not exceed page size. This means the code using that bv can only handle a block if bs <= ps. To address this problem and handle future bs > ps cases better: - Introduce a helper btrfs_bio_for_each_block() Instead of bio_vec, which has single and multiple page version and multiple page version has quite some limits, use my favorite way to represent a block, phys_addr_t. For bs <= ps cases, nothing is changed, except we will do a very small overhead to convert phys_addr_t to a folio, then use the proper folio helpers to handle the possible highmem cases. For bs > ps cases, all blocks will be backed by large folios, meaning every folio will cover at least one block. And still use proper folio helpers to handle highmem cases. With phys_addr_t, we will handle both large folio and highmem properly. So there is no better single variable to present a btrfs block than phys_addr_t. - Extract the data block csum calculation into a helper The new helper, btrfs_calculate_block_csum() will be utilized by btrfs_csum_one_bio(). - Use btrfs_bio_for_each_block() to replace existing call sites Including: index_one_bio() from raid56.c Very straight-forward. * btrfs_check_read_bio() Also update repair_one_sector() to grab the folio using phys_addr_t, and do extra checks to make sure the folio covers at least one block. We do not need to bother bv_len at all now. * btrfs_csum_one_bio() Now we can move the highmem handling into a dedicated helper, calculate_block_csum(), and use btrfs_bio_for_each_block() helper. There is one exception in btrfs_decompress_buf2page(), which is copying decompressed data into the original bio, which is not iterating using block size thus we don't need to bother. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:17 +02:00
Qu Wenruo	35aff706dc	btrfs: concentrate highmem handling for data verification Currently for btrfs checksum verification, we do it in the following pattern: kaddr = kmap_local_*(); ret = btrfs_check_csum_csum(kaddr); kunmap_local(kaddr); It's OK for now, but it's still not following the patterns of helpers inside linux/highmem.h, which never requires a virt memory address. In those highmem helpers, they mostly accept a folio, some offset/length inside the folio, and in the implementation they check if the folio needs partial kmap, and do the handling. Inspired by those formal highmem helpers, enhance the highmem handling of data checksum verification by: - Rename btrfs_check_sector_csum() to btrfs_check_block_csum() To follow the more common term "block" used in all other major filesystems. - Pass a physical address into btrfs_check_block_csum() and btrfs_data_csum_ok() The physical address is always available even for a highmem page. Since it's page frame number << PAGE_SHIFT + offset in page. And with that physical address, we can grab the folio covering the page, and do extra checks to ensure it covers at least one block. This also allows us to do the kmap inside btrfs_check_block_csum(). This means all the extra HIGHMEM handling will be concentrated into btrfs_check_block_csum(), and no callers will need to bother highmem by themselves. - Properly zero out the block if csum mismatch Since btrfs_data_csum_ok() only got a paddr, we can not and should not use memzero_bvec(), which only accepts single page bvec. Instead use paddr to grab the folio and call folio_zero_range() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:16 +02:00
Qu Wenruo	2ccfaf7369	btrfs: support all block sizes which is no larger than page size Currently if block size < page size, btrfs only supports one single config, 4K. This is mostly to reduce the test configurations, as 4K is going to be the default block size for all architectures. However all other major filesystems have no artificial limits on the support block size, and some are already supporting block size > page sizes. Since the btrfs subpage block support has been there for a long time, it's time for us to enable all block size <= page size support. So here enable all block sizes support as long as it's no larger than page size for experimental builds. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:16 +02:00
Thorsten Blum	a7f3dfb829	btrfs: scrub: replace max_t()/min_t() with clamp() in scrub_throttle_dev_io() Replace max_t() followed by min_t() with a single clamp(). As was pointed by David Laight in https://lore.kernel.org/linux-btrfs/20250906122458.75dfc8f0@pumpkin/ the calculation may overflow u32 when the input value is too large, so clamp_t() is not used. In practice the expected values are in range of megabytes to gigabytes (throughput limit) so the bug would not happen. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: David Sterba <dsterba@suse.com> [ Use clamp() and add explanation. ] Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:16 +02:00
David Sterba	17dc82dc1e	btrfs: fix typos in comments and strings Annual typo fixing pass. Strangely codespell found only about 30% of what is in this patch, the rest was done manually using text spellchecker with a custom dictionary of acceptable terms. Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:16 +02:00
Qu Wenruo	74e8f002b7	btrfs: reduce compression workspace buffer space to block size Currently the compression workspace buffer size is always based on PAGE_SIZE, but btrfs has support subpage sized block size for some time. This means for one-shot compression algorithm like lzo, we're wasting quite some memory if the block size is smaller than page size, as the LZO only works on one block (thus one-shot). On 64K page sized systems with 4K block size, it means we only need at most 8K buffer space for lzo, but in reality we're allocating 64K buffer. So to reduce the memory usage, change all workspace buffer to base its size based on block size other than page size. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:16 +02:00
Qu Wenruo	0d0b80929e	btrfs: rename btrfs_compress_op to btrfs_compress_levels Since all workspace managers are per-fs, there is no need nor no way to store them inside btrfs_compress_op::wsm anymore. With that said, we can do the following modifications: - Remove zstd_workspace_mananger::ops Zstd always grab the global btrfs_compress_op[]. - Remove btrfs_compress_op::wsm member - Rename btrfs_compress_op to btrfs_compress_levels This should make it more clear that btrfs_compress_levels structures are only to indicate the levels of each compress algorithm. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:16 +02:00
Qu Wenruo	9c8f4cf456	btrfs: cleanup the per-module compression workspace managers Since all workspaces are handled by the per-fs workspace managers, we can safely remove the old per-module managers. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:16 +02:00
Qu Wenruo	856d46c313	btrfs: migrate to use per-fs workspace manager There are several interfaces involved for each algorithm: - alloc workspace All algorithms allocate a workspace without the need for workspace manager. So no change needs to be done. - get workspace This involves checking the workspace manager to find a free one, and if not, allocate a new one. For none and lzo, they share the same generic btrfs_get_workspace() helper, only needs to update that function to use the per-fs manager. For zlib it uses a wrapper around btrfs_get_workspace(), so no special work needed. For zstd, update zstd_find_workspace() and zstd_get_workspace() to utilize the per-fs manager. - put workspace For none/zlib/lzo they share the same btrfs_put_workspace(), update that function to use the per-fs manager. For zstd, it's zstd_put_workspace(), the same update. - zstd specific timer This is the timer to reclaim workspace, change it to grab the per-fs workspace manager instead. Now all workspace are managed by the per-fs manager. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:15 +02:00
Qu Wenruo	6f9c3f48ac	btrfs: add generic workspace manager initialization This involves: - Add (alloc\|free)_workspace_manager helpers. These are the helper to alloc/free workspace_manager structure. The allocator will allocate a workspace_manager structure, initialize it, and pre-allocate one workspace for it. The freer will do the cleanup and set the manager pointer to NULL. - Call alloc_workspace_manager() inside btrfs_alloc_compress_wsm() - Call alloc_workspace_manager() inside btrfs_free_compress_wsm() For none, zlib and lzo compression algorithms. For now the generic per-fs workspace managers won't really have any effect, and all compression is still going through the global workspace manager. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:15 +02:00
Qu Wenruo	330f02b136	btrfs: add workspace manager initialization for zstd This involves: - Add zstd_alloc_workspace_manager() and zstd_free_workspace_manager() Those two functions will accept an fs_info pointer, and alloc/free fs_info->compr_wsm[BTRFS_COMPRESS_ZSTD] pointer. - Add btrfs_alloc_compress_wsm() and btrfs_free_compress_wsm() Those are helpers allocating the workspace managers for all algorithms. For now only zstd is supported, and the timing is a little unusual, the btrfs_alloc_compress_wsm() should only be called after the sectorsize being initialized. Meanwhile btrfs_free_fs_info_compress() is called in btrfs_free_fs_info(). - Move the definition of btrfs_compression_type to "fs.h" The reason is that "compression.h" has already included "fs.h", thus we can not just include "compression.h" to get the definition of BTRFS_NR_COMPRESS_TYPES to define fs_info::compr_wsm[]. For now the per-fs zstd workspace manager won't really have any effect, and all compression is still going through the global workspace manager. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:15 +02:00
Qu Wenruo	2c5cca03c1	btrfs: add an fs_info parameter for compression workspace manager [BACKGROUND] Currently btrfs shares workspaces and their managers for all filesystems, this is mostly fine as all those workspaces are using page size based buffers, and btrfs only support block size (bs) <= page size (ps). This means even if bs < ps, we at most waste some buffer space in the workspace, but everything will still work fine. The problem here is that is limiting our support for bs > ps cases. As now a workspace now may need larger buffer to handle bs > ps cases, but since the pool has no way to distinguish different workspaces, a regular workspace (which is still using buffer size based on ps) can be passed to a btrfs whose bs > ps. In that case the buffer is not large enough, and will cause various problems. [ENHANCEMENT] To prepare for the per-fs workspace migration, add an fs_info parameter to all workspace related functions. For now this new fs_info parameter is not yet utilized. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:15 +02:00
Qu Wenruo	737852c060	btrfs: keep folios locked inside run_delalloc_nocow() [BUG] There is a very low chance that DEBUG_WARN() inside btrfs_writepage_cow_fixup() can be triggered when CONFIG_BTRFS_EXPERIMENTAL is enabled. This only happens after run_delalloc_nocow() failed. Unfortunately I haven't hit it for a while thus no real world dmesg for now. [CAUSE] There is a race window where after run_delalloc_nocow() failed, error handling can race with writeback thread. Before we hit run_delalloc_nocow(), there is an inode with the following dirty pages: (4K page size, 4K block size, no large folio) 0 4K 8K 12K 16K \|/////////\|///////////\|///////////\|////////////\| The inode also have NODATACOW flag, and the above dirty range will go through different extents during run_delalloc_range(): 0 4K 8K 12K 16K \| NOCOW \| COW \| COW \| NOCOW \| The race happen like this: writeback thread A \| writeback thread B ----------------------------------+-------------------------------------- Writeback for folio 0 \| run_delalloc_nocow() \| \|- nocow_one_range() \| \| For range [0, 4K), ret = 0 \| \| \| \|- fallback_to_cow() \| \| For range [4K, 8K), ret = 0 \| \| Folio 4K UNLOCKED \| \| \| Writeback for folio 4K \|- fallback_to_cow() \| extent_writepage() \| For range [8K, 12K), failure \| \|- writepage_delalloc() \| \| \| \|- btrfs_cleanup_ordered_extents()\| \| \|- btrfs_folio_clear_ordered() \| \| \| Folio 0 still locked, safe \| \| \| \| \| Ordered extent already allocated. \| \| \| Nothing to do. \| \| \|- extent_writepage_io() \| \| \|- btrfs_writepage_cow_fixup() \|- btrfs_folio_clear_ordered() \| \| Folio 4K hold by thread B, \| \| UNSAFE! \| \|- btrfs_test_ordered() \| \| Cleared by thread A, \| \| \| \|- DEBUG_WARN(); This is only possible after run_delalloc_nocow() failure, as cow_file_range() will keep all folios and io tree range locked, until everything is finished or after error handling. The root cause is we allow fallback_to_cow() and nocow_one_range() to unlock the folios after a successful run, so that during error handling we're no longer safe to use btrfs_cleanup_ordered_extents() as the folios are already unlocked. [FIX] - Make fallback_to_cow() and nocow_one_range() to keep folios locked after a successful run For fallback_to_cow() we can pass COW_FILE_RANGE_KEEP_LOCKED flag into cow_file_range(). For nocow_one_range() we have to remove the PAGE_UNLOCK flag from extent_clear_unlock_delalloc(). - Unlock folios if everything is fine in run_delalloc_nocow() - Use extent_clear_unlock_delalloc() to handle range [@start, @cur_offset) inside run_delalloc_nocow() Since folios are still locked, we do not need cleanup_dirty_folios() to do the cleanup. extent_clear_unlock_delalloc() with "PAGE_START_WRITEBACK \| PAGE_END_WRITEBACK" will clear the dirty flags. - Remove cleanup_dirty_folios() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:15 +02:00
Qu Wenruo	13141df705	btrfs: make nocow_one_range() to do cleanup on error Currently if we hit an error inside nocow_one_range(), we do not clear the page dirty, and let the caller to handle it. This is very different compared to fallback_to_cow(), when that function failed, everything will be cleaned up by cow_file_range(). Enhance the situation by: - Use a common error handling for nocow_one_range() If we failed anything, use the same btrfs_cleanup_ordered_extents() and extent_clear_unlock_delalloc(). btrfs_cleanup_ordered_extents() is safe even if we haven't created new ordered extent, in that case there should be no OE and that function will do nothing. The same applies to extent_clear_unlock_delalloc(), and since we're passing PAGE_UNLOCK \| PAGE_START_WRITEBACK \| PAGE_END_WRITEBACK, it will also clear folio dirty flag during error handling. - Avoid touching the failed range of nocow_one_range() As the failed range will be cleaned up and unlocked by that function. Here we introduce a new variable @nocow_end to record the failed range, so that we can skip it during the error handling of run_delalloc_nocow(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:15 +02:00
Qu Wenruo	6a378edc9a	btrfs: enhance error messages for delalloc range failure When running emulated write error tests like generic/475, we can hit error messages like this: BTRFS error (device dm-12 state EA): run_delalloc_nocow failed, root=596 inode=264 start=1605632 len=73728: -5 BTRFS error (device dm-12 state EA): failed to run delalloc range, root=596 ino=264 folio=1605632 submit_bitmap=0-7 start=1605632 len=73728: -5 Which is normally buried by direct IO error messages. However above error messages are not enough to determine which is the real range that caused the error. Considering we can have multiple different extents in one delalloc range (e.g. some COW extents along with some NOCOW extents), just outputting the error at the end of run_delalloc_nocow() is not enough. To enhance the error messages: - Remove the rate limit on the existing error messages In the generic/475 example, most error messages are from direct IO, not really from the delalloc range. Considering how useful the delalloc range error messages are, we don't want they to be rate limited. - Add extra @cur_offset output for cow_file_range() - Add extra variable output for run_delalloc_nocow() This is especially important for run_delalloc_nocow(), as there are extra error paths where we can hit error without into nocow_one_range() nor fallback_to_cow(). - Add an error message for nocow_one_range() That's the missing part. For fallback_to_cow(), we have error message from cow_file_range() already. - Constify the @len and @end local variables for nocow_one_range() This makes it much easier to make sure @len and @end are not modified at runtime. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:09 +02:00
Caleb Sander Mateos	ef9f603fd3	io_uring/cmd: drop unused res2 param from io_uring_cmd_done() Commit `79525b51ac` ("io_uring: fix nvme's 32b cqes on mixed cq") split out a separate io_uring_cmd_done32() helper for ->uring_cmd() implementations that return 32-byte CQEs. The res2 value passed to io_uring_cmd_done() is now unused because __io_uring_cmd_done() ignores it when is_cqe32 is passed as false. So drop the parameter from io_uring_cmd_done() to simplify the callers and clarify that it's not possible to return an extra value beyond the 32-bit CQE result. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-23 00:15:02 -06:00
Qu Wenruo	3239c44df7	btrfs: rework error handling of run_delalloc_nocow() Currently the error handling of run_delalloc_nocow() modifies @cur_offset to handle different parts of the delalloc range. However the error handling can always be split into 3 parts: 1) The range with ordered extents allocated (OE cleanup) We have to cleanup the ordered extents and unlock the folios. 2) The range that have been cleaned up (skip) For now it's only for the range of fallback_to_cow(). We should not touch it at all. 3) The range that is not yet touched (untouched) We have to unlock the folios and clear any reserved space. This 3 ranges split has the same principle as cow_file_range(), however the NOCOW/COW handling makes the above 3 range split much more complex: a) Failure immediately after a successful OE allocation Thus no @cow_start nor @cow_end set. start cur_offset end \| OE cleanup \| untouched \| b) Failure after hitting a COW range but before calling fallback_to_cow() start cow_start cur_offset end \| OE Cleanup \| untouched \| c) Failure to call fallback_to_cow() start cow_start cow_end end \| OE Cleanup \| skip \| untouched \| Instead of modifying @cur_offset, do proper range calculation for OE-cleanup and untouched ranges using above 3 cases with proper range charts. This avoid updating @cur_offset, as it will an extra output for debug purposes later, and explain the behavior easier. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:32 +02:00
Leo Martins	46d33a0cc4	btrfs: add mount option for ref_tracker The ref_tracker infrastructure aids debugging but is not enabled by default as it has a performance impact. Add mount option 'ref_tracker' so it can be selectively enabled on a filesystem. Currently it track references of 'delayed inodes'. Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:32 +02:00
Leo Martins	b767a28d61	btrfs: print leaked references in kill_all_delayed_nodes() We are seeing soft lockups in kill_all_delayed_nodes due to a delayed_node with a lingering reference count of 1. Printing at this point will reveal the guilty stack trace. If the delayed_node has no references there should be no output. Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:32 +02:00
Leo Martins	e8513c012d	btrfs: implement ref_tracker for delayed_nodes Add ref_tracker infrastructure for struct btrfs_delayed_node. It is a response to the largest btrfs related crash in our fleet. We're seeing soft lockups in btrfs_kill_all_delayed_nodes() that seem to be a result of delayed_nodes not being released properly. A ref_tracker object is allocated on reference count increases and freed on reference count decreases. The ref_tracker object stores a stack trace of where it is allocated. The ref_tracker_dir object is embedded in btrfs_delayed_node and keeps track of all current and some old/freed ref_tracker objects. When a leak is detected we can print the stack traces for all ref_trackers that have not yet been freed. Here is a common example of taking a reference to a delayed_node and freeing it with ref_tracker. struct btrfs_ref_tracker tracker; struct btrfs_delayed_node *node; node = btrfs_get_delayed_node(inode, &tracker); // use delayed_node... btrfs_release_delayed_node(node, &tracker); There are two special cases where the delayed_node reference is "long lived", meaning that the thread that takes the reference and the thread that releases the reference are different. The 'inode_cache_tracker' tracks the delayed_node stored in btrfs_inode. The 'node_list_tracker' tracks the delayed_node stored in the btrfs_delayed_root node_list/prepare_list. These trackers are embedded in the btrfs_delayed_node. btrfs_ref_tracker and btrfs_ref_tracker_dir are wrappers that either compile to the corresponding ref_tracker structs or empty structs depending on CONFIG_BTRFS_DEBUG. There are also btrfs wrappers for the ref_tracker API. Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:32 +02:00
David Sterba	67e78f983e	btrfs: convert several int parameters to bool We're almost done cleaning misused int/bool parameters. Convert a bunch of them, found by manual grepping. Note that btrfs_sync_fs() needs an int as it's mandated by the struct super_operations prototype. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:32 +02:00
Leo Martins	cba7c35fec	btrfs: move ref-verify under CONFIG_BTRFS_DEBUG Remove CONFIG_BTRFS_FS_REF_VERIFY Kconfig and add it as part of CONFIG_BTRFS_DEBUG. This should not be impactful to the performance of debug. The struct btrfs_ref takes an additional u64, btrfs_fs_info takes an additional spinlock_t and rb_root. All of the ref_verify logic is still protected by a mount option. Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:32 +02:00
Xichao Zhao	28a38e20ac	btrfs: use PTR_ERR_OR_ZERO() to simplify code inbtrfs_control_ioctl() Use the standard error pointer macro to simplify the code. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Xichao Zhao <zhao.xichao@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:31 +02:00
Qu Wenruo	d728f2e5f8	btrfs: simplify support block size check Currently we manually check the block size against 3 different values: - 4K - PAGE_SIZE - MIN_BLOCKSIZE Those 3 values can match or differ from each other. This makes it pretty complex to output the supported block sizes. Considering we're going to add block size > page size support soon, this can make the support block size sysfs attribute much harder to implement. To make it easier, factor out a helper, btrfs_supported_blocksize() to do a simple check for the block size. Then utilize it in the two locations: - btrfs_validate_super() This is very straightforward - supported_sectorsizes_show() Iterate through all valid block sizes, and only output supported ones. This is to make future full range block sizes support much easier. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:31 +02:00
Qu Wenruo	0a6dcd4235	btrfs: use blocksize to check if compression is making things larger [BEHAVIOR DIFFERENCE BETWEEN COMPRESSION ALGOS] Currently LZO compression algorithm will check if we're making the compressed data larger after compressing more than 2 blocks. But zlib and zstd do the same checks after compressing more than 8192 bytes. This is not a big deal, but since we're already supporting larger block size (e.g. 64K block size if page size is also 64K), this check is not suitable for all block sizes. For example, if our page and block size are both 16KiB, and after the first block compressed using zlib, the resulted compressed data is slightly larger than 16KiB, we will immediately abort the compression. This makes zstd and zlib compression algorithms to behave slightly different from LZO, which only aborts after compressing two blocks. [ENHANCEMENT] To unify the behavior, only abort the compression after compressing at least two blocks. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:31 +02:00
Qu Wenruo	d71b419f27	btrfs: pass btrfs_inode pointer directly into btrfs_compress_folios() For the 3 supported compression algorithms, two of them (zstd and zlib) are already grabbing the btrfs inode for error messages. It's more common to pass btrfs_inode and grab the address space from it. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:31 +02:00
Naohiro Aota	0d703963d2	btrfs: zoned: refine extent allocator hint selection The hint block group selection in the extent allocator is wrong in the first place, as it can select the dedicated data relocation block group for the normal data allocation. Since we separated the normal data space_info and the data relocation space_info, we can easily identify a block group is for data relocation or not. Do not choose it for the normal data allocation. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:31 +02:00
Boris Burkov	f07b855c56	btrfs: try to search for data csums in commit root If you run a workload with: - a cgroup that does tons of parallel data reading, with a working set much larger than its memory limit - a second cgroup that writes relatively fewer files, with overwrites, with no memory limit (see full code listing at the bottom for a reproducer) Then what quickly occurs is: - we have a large number of threads trying to read the csum tree - we have a decent number of threads deleting csums running delayed refs - we have a large number of threads in direct reclaim and thus high memory pressure The result of this is that we writeback the csum tree repeatedly mid transaction, to get back the extent_buffer folios for reclaim. As a result, we repeatedly COW the csum tree for the delayed refs that are deleting csums. This means repeatedly write locking the higher levels of the tree. As a result of this, we achieve an unpleasant priority inversion. We have: - a high degree of contention on the csum root node (and other upper nodes) eb rwsem - a memory starved cgroup doing tons of reclaim on CPU. - many reader threads in the memory starved cgroup "holding" the sem as readers, but not scheduling promptly. i.e., task __state == 0, but not running on a cpu. - btrfs_commit_transaction stuck trying to acquire the sem as a writer. (running delayed_refs, deleting csums for unreferenced data extents) This results in arbitrarily long transactions. This then results in seriously degraded performance for any cgroup using the filesystem (the victim cgroup in the script). It isn't an academic problem, as we see this exact problem in production at Meta with one cgroup over its memory limit ruining btrfs performance for the whole system, stalling critical system services that depend on btrfs syncs. The underlying scheduling "problem" with global rwsems is sort of thorny and apparently well known and was discussed at LPC 2024, for example. As a result, our main lever in the short term is just trying to reduce contention on our various rwsems with an eye to reducing the frequency of write locking, to avoid disabling the read lock fast acquisition path. Luckily, it seems likely that many reads are for old extents written many transactions ago, and that for those we can in fact search the commit root. The commit_root_sem only gets taken write once, near the end of transaction commit, no matter how much memory pressure there is, so we have much less contention between readers and writers. This change detects when we are trying to read an old extent (according to extent map generation) and then wires that through bio_ctrl to the btrfs_bio, which unfortunately isn't allocated yet when we have this information. When we go to lookup the csums in lookup_bio_sums we can check this condition on the btrfs_bio and do the commit root lookup accordingly. Note that a single bio_ctrl might collect a few extent_maps into a single bio, so it is important to track a maximum generation across all the extent_maps used for each bio to make an accurate decision on whether it is valid to look in the commit root. If any extent_map is updated in the current generation, we can't use the commit root. To test and reproduce this issue, I used the following script and accompanying C program (to avoid bottlenecks in constantly forking thousands of dd processes): ====== big-read.c ====== #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <sys/stat.h> #include <unistd.h> #include <errno.h> #define BUF_SZ (128 * (1 << 10UL)) int read_once(int fd, size_t sz) { char buf[BUF_SZ]; size_t rd = 0; int ret = 0; while (rd < sz) { ret = read(fd, buf, BUF_SZ); if (ret < 0) { if (errno == EINTR) continue; fprintf(stderr, "read failed: %d\n", errno); return -errno; } else if (ret == 0) { break; } else { rd += ret; } } return rd; } int read_loop(char fname) { int fd; struct stat st; size_t sz = 0; int ret; while (1) { fd = open(fname, O_RDONLY); if (fd == -1) { perror("open"); return 1; } if (!sz) { if (!fstat(fd, &st)) { sz = st.st_size; } else { perror("stat"); return 1; } } ret = read_once(fd, sz); close(fd); } } int main(int argc, char argv[]) { int fd; struct stat st; off_t sz; char *buf; int ret; if (argc != 2) { fprintf(stderr, "Usage: %s <filename>\n", argv[0]); return 1; } return read_loop(argv[1]); } ====== repro.sh ====== #!/usr/bin/env bash SCRIPT=$(readlink -f "$0") DIR=$(dirname "$SCRIPT") dev=$1 mnt=$2 shift shift CG_ROOT=/sys/fs/cgroup BAD_CG=$CG_ROOT/bad-nbr GOOD_CG=$CG_ROOT/good-nbr NR_BIGGOS=1 NR_LITTLE=10 NR_VICTIMS=32 NR_VILLAINS=512 START_SEC=$(date +%s) _elapsed() { echo "elapsed: $(($(date +%s) - $START_SEC))" } _stats() { local sysfs=/sys/fs/btrfs/$(findmnt -no UUID $dev) echo "================" date _elapsed cat $sysfs/commit_stats cat $BAD_CG/memory.pressure } _setup_cgs() { echo "+memory +cpuset" > $CG_ROOT/cgroup.subtree_control mkdir -p $GOOD_CG mkdir -p $BAD_CG echo max > $BAD_CG/memory.max # memory.high much less than the working set will cause heavy reclaim echo $((1 << 30)) > $BAD_CG/memory.high # victims get a subset of villain CPUs echo 0 > $GOOD_CG/cpuset.cpus echo 0,1,2,3 > $BAD_CG/cpuset.cpus } _kill_cg() { local cg=$1 local attempts=0 echo "kill cgroup $cg" [ -f $cg/cgroup.procs ] \|\| return while true; do attempts=$((attempts + 1)) echo 1 > $cg/cgroup.kill sleep 1 procs=$(wc -l $cg/cgroup.procs \| cut -d' ' -f1) [ $procs -eq 0 ] && break done rmdir $cg echo "killed cgroup $cg in $attempts attempts" } _biggo_vol() { echo $mnt/biggo_vol.$1 } _biggo_file() { echo $(_biggo_vol $1)/biggo } _subvoled_biggos() { total_sz=$((10 << 30)) per_sz=$((total_sz / $NR_VILLAINS)) dd_count=$((per_sz >> 20)) echo "create $NR_VILLAINS subvols with a file of size $per_sz bytes for a total of $total_sz bytes." for i in $(seq $NR_VILLAINS) do btrfs subvol create $(_biggo_vol $i) &>/dev/null dd if=/dev/zero of=$(_biggo_file $i) bs=1M count=$dd_count &>/dev/null done echo "done creating subvols." } _setup() { [ -f .done ] && rm .done findmnt -n $dev && exit 1 if [ -f .re-mkfs ]; then mkfs.btrfs -f -m single -d single $dev >/dev/null \|\| exit 2 else echo "touch .re-mkfs to populate the test fs" fi mount -o noatime $dev $mnt \|\| exit 3 [ -f .re-mkfs ] && _subvoled_biggos _setup_cgs } _my_cleanup() { echo "CLEANUP!" _kill_cg $BAD_CG _kill_cg $GOOD_CG sleep 1 umount $mnt } _bad_exit() { _err "Unexpected Exit! $?" _stats exit $? } trap _my_cleanup EXIT trap _bad_exit INT TERM _setup # Use a lot of page cache reading the big file _villain() { local i=$1 echo $BASHPID > $BAD_CG/cgroup.procs $DIR/big-read $(_biggo_file $i) } # Hit del_csum a lot by overwriting lots of small new files _victim() { echo $BASHPID > $GOOD_CG/cgroup.procs i=0; while (true) do local tmp=$mnt/tmp.$i dd if=/dev/zero of=$tmp bs=4k count=2 >/dev/null 2>&1 i=$((i+1)) [ $i -eq $NR_LITTLE ] && i=0 done } _one_sync() { echo "sync..." before=$(date +%s) sync after=$(date +%s) echo "sync done in $((after - before))s" _stats } # sync in a loop _sync() { echo "start sync loop" syncs=0 echo $BASHPID > $GOOD_CG/cgroup.procs while true do [ -f .done ] && break _one_sync syncs=$((syncs + 1)) [ -f .done ] && break sleep 10 done if [ $syncs -eq 0 ]; then echo "do at least one sync!" _one_sync fi echo "sync loop done." } _sleep() { local time=${1-60} local now=$(date +%s) local end=$((now + time)) while [ $now -lt $end ]; do echo "SLEEP: $((end - now))s left. Sleep 10." sleep 10 now=$(date +%s) done } echo "start $NR_VILLAINS villains" for i in $(seq $NR_VILLAINS) do _villain $i & disown # get rid of annoying log on kill (done via cgroup anyway) done echo "start $NR_VICTIMS victims" for i in $(seq $NR_VICTIMS) do _victim & disown done _sync & SYNC_PID=$! _sleep $1 _elapsed touch .done wait $SYNC_PID echo "OK" exit 0 Without this patch, that reproducer: - Ran for 6+ minutes instead of 60s - Hung hundreds of threads in D state on the csum reader lock - Got a commit stuck for 3 minutes sync done in 388s ================ Wed Jul 9 09:52:31 PM UTC 2025 elapsed: 420 commits 2 cur_commit_ms 0 last_commit_ms 159446 max_commit_ms 159446 total_commit_ms 160058 some avg10=99.03 avg60=98.97 avg300=75.43 total=418033386 full avg10=82.79 avg60=80.52 avg300=59.45 total=324995274 419 hits state R, D comms big-read btrfs_tree_read_lock_nested btrfs_read_lock_root_node btrfs_search_slot btrfs_lookup_csum btrfs_lookup_bio_sums btrfs_submit_bbio 1 hits state D comms btrfs-transacti btrfs_tree_lock_nested btrfs_lock_root_node btrfs_search_slot btrfs_del_csums __btrfs_run_delayed_refs btrfs_run_delayed_refs With the patch, the reproducer exits naturally, in 65s, completing a pretty decent 4 commits, despite heavy memory pressure. Occasionally you can still trigger a rather long commit (couple seconds) but never one that is minutes long. sync done in 3s ================ elapsed: 65 commits 4 cur_commit_ms 0 last_commit_ms 485 max_commit_ms 689 total_commit_ms 2453 some avg10=98.28 avg60=64.54 avg300=19.39 total=64849893 full avg10=74.43 avg60=48.50 avg300=14.53 total=48665168 some random rwalker samples showed the most common stack in reclaim, rather than the csum tree: 145 hits state R comms bash, sleep, dd, shuf shrink_folio_list shrink_lruvec shrink_node do_try_to_free_pages try_to_free_mem_cgroup_pages reclaim_high Link: https://lpc.events/event/18/contributions/1883/ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:31 +02:00
Jiapeng Chong	6d9cce2d1b	btrfs: remove duplicate inclusion of linux/types.h In messages.h there's linux/types.h included more than once. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=22939 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:31 +02:00
Johannes Thumshirn	3c44cd3c79	btrfs: zoned: return error from btrfs_zone_finish_endio() Now that btrfs_zone_finish_endio_workfn() is directly calling do_zone_finish() the only caller of btrfs_zone_finish_endio() is btrfs_finish_one_ordered(). btrfs_finish_one_ordered() already has error handling in-place so btrfs_zone_finish_endio() can return an error if the block group lookup fails. Also as btrfs_zone_finish_endio() already checks for zoned filesystems and returns early, there's no need to do this in the caller. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:30 +02:00
Johannes Thumshirn	3d16abf6c8	btrfs: zoned: directly call do_zone_finish() from btrfs_zone_finish_endio_workfn() When btrfs_zone_finish_endio_workfn() is calling btrfs_zone_finish_endio() it already has a pointer to the block group. Furthermore btrfs_zone_finish_endio() does additional checks if the block group can be finished or not. But in the context of btrfs_zone_finish_endio_workfn() only the actual call to do_zone_finish() is of interest, as the skipping condition when there is still room to allocate from the block group cannot be checked. Directly call do_zone_finish() on the block group. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:30 +02:00
Filipe Manana	669d38bf58	btrfs: collapse unaccount_log_buffer() into clean_log_buffer() There's one only one caller of unaccount_log_buffer() and both this function and the caller are short, so move its code into the caller. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:30 +02:00
Filipe Manana	cac2ab34d8	btrfs: use local key variable to pass arguments in replay_one_extent() Instead of extracting again the disk_bytenr and disk_num_bytes values from the file extent item to pass to btrfs_qgroup_trace_extent(), use the key local variable 'ins' which already has those values, reducing the size of the source code. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:30 +02:00
Filipe Manana	575f52a77a	btrfs: process inline extent earlier in replay_one_extent() Instead of having an if statement to check for regular and prealloc extents first, process them in a block, and then following with an else statement to check for an inline extent, check for an inline extent first, process it and jump to the 'update_inode' label, allowing us to avoid having the code for processing regular and prealloc extents inside a block, reducing the high indentation level by one and making the code easier to read and avoid line splittings too. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:30 +02:00
Filipe Manana	88666b6df9	btrfs: exit early when replaying hole file extent item from a log tree At replay_one_extent(), we can jump to the code that updates the file extent range and updates the inode when processing a file extent item that represents a hole and we don't have the NO_HOLES feature enabled. This helps reduce the high indentation level by one in replay_one_extent() and avoid splitting some lines to make the code easier to read. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:30 +02:00
Filipe Manana	912c257c88	btrfs: abort transaction where errors happen during log tree replay In the replay_one_buffer() log tree walk callback we return errors to the log tree walk caller and then the caller aborts the transaction, if we have one, or turns the fs into error state if we don't have one. While this reduces code it makes it harder to figure out where exactly an error came from. So add the transaction aborts after every failure inside the replay_one_buffer() callback and the functions it calls, making it as fine grained as possible, so that it helps figuring out why failures happen. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:29 +02:00
Filipe Manana	874576d2a7	btrfs: return real error from read_alloc_one_name() in drop_one_dir_item() If read_alloc_one_name() we explicitly return -ENOMEM and currently that is fine since it's the only error read_alloc_one_name() can return for now. However this is fragile and not future proof, so return instead what read_alloc_one_name() returned. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:29 +02:00
Filipe Manana	425652cf10	btrfs: use local variable for the transaction handle in replay_one_buffer() Instead of keep dereferencing the walk_control structure to extract the transaction handle whenever is needed, do it once by storing it in a local variable and then use the variable everywhere. This reduces code verbosity and eliminates the need for some split lines. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:29 +02:00
Filipe Manana	e6dd405b66	btrfs: abort transaction in the process_one_buffer() log tree walk callback In the process_one_buffer() log tree walk callback we return errors to the log tree walk caller and then the caller aborts the transaction, if we have one, or turns the fs into error state if we don't have one. While this reduces code it makes it harder to figure out where exactly an error came from. So add the transaction aborts after every failure inside the process_one_buffer() callback, so that it helps figuring out why failures happen. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:29 +02:00
Filipe Manana	6ebd726b10	btrfs: abort transaction on specific error places when walking log tree We do several things while walking a log tree (for replaying and for freeing a log tree) like reading extent buffers and cleaning them up, but we don't immediately abort the transaction, or turn the fs into an error state, when one of these things fails. Instead we the transaction abort or turn the fs into error state in the caller of the entry point function that walks a log tree - walk_log_tree() - which means we don't get to know exactly where an error came from. Improve on this by doing a transaction abort / turn fs into error state after each such failure so that when it happens we have a better understanding where the failure comes from. This deliberately leaves the transaction abort / turn fs into error state in the callers of walk_log_tree() as to ensure we don't get into an inconsistent state in case we forget to do it deeper in call chain. It also deliberately does not do it after errors from the calls to the callback defined in struct walk_control::process_func(), as we will do it later on another patch. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:29 +02:00
Qu Wenruo	aa8fc9469d	btrfs: replace double boolean parameters of cow_file_range() The function cow_file_range() has two boolean parameters. Replace it with a single @flags parameter, with two flags: - COW_FILE_RANGE_NO_INLINE - COW_FILE_RANGE_KEEP_LOCKED And since we're here, also update the comments of cow_file_range() to replace the old "page" usage with "folio". Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:29 +02:00
Linus Torvalds	f975f08c2e	for-6.17-rc6-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmjPff0ACgkQxWXV+ddt WDvaVRAAqcPlR07B5Y1S+73JlqVXx5g9S77uarkn48FS5OGJ97ad7NFF79LdFKaQ d5OpD2aZRzi+XLQTlD7Wz9udmr2OvHLnTih6E6KOa7ukL+bJMsK6CXw4usLW5ke0 nHNPNCLnEtIbQ5hKFpsjfsUrJaNtGcFNoWlnkl0iG1E0vjJQFBYqTNzBytv/ygc/ jVDdoftA7vqnUeXemVGhnfvLqyP5g374jnUB3CIejMQfzSJXCS07DXwmn/eeRoxl HdJre+kjavV3WL/fvAqq0f6wEBlYRXXiLUnSt2xRr5a70svy0eWhx8ggE6gBqlyR fo4UC8hsETZdQAW35ZgUfJBtUVqx+bwNLZ5xVFlDKv8uix7B9x7Fgcmhsin+GovN JowBYe28FTctU4x3IBqyweXZOn2815HvvZlIbU/D9jVSB7RSQ/2nUKAec1tYBk2G dy9TRxxE+N3/csJ3J+VqvFEMnGorVDN1GBXFPwIgy2OTpNi6dM7s3909lO2ebp2+ Kw2vBFtwEVdGk7ZkYVkHtsPa/Rn+uXLSCp+m08eqIJKTPxbTn2W6XXsoptHt5iXL t43oRP/wq9qUKgYJUd8242nQp/Sf+zEvIjHYDpbsDHajPsTfUo0nuCN8ZojceWat RRJclWk7KXdicQT4JiWp19mQakn9gjM+vOoMriGZwRf4ZGkO1PE= =Oj5R -----END PGP SIGNATURE----- Merge tag 'for-6.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull a few more btrfs fixes from David Sterba: - in tree-checker, fix wrong size of check for inode ref item - in ref-verify, handle combination of mount options that allow partially damaged extent tree (reported by syzbot) - additional validation of compression mount option to catch invalid string as level * tag 'for-6.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: reject invalid compression level btrfs: ref-verify: handle damaged extent root tree btrfs: tree-checker: fix the incorrect inode ref size check	2025-09-20 21:41:26 -07:00
Marco Crivellari	69635d7f4b	fs: WQ_PERCPU added to alloc_workqueue users Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This patch adds a new WQ_PERCPU flag to all the fs subsystem users to explicitly request the use of the per-CPU behavior. Both flags coexist for one release cycle to allow callers to transition their calls. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. All existing users have been updated accordingly. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://lore.kernel.org/20250916082906.77439-4-marco.crivellari@suse.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 16:15:07 +02:00
Marco Crivellari	7a4f92d39f	fs: replace use of system_unbound_wq with system_dfl_wq Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://lore.kernel.org/20250916082906.77439-2-marco.crivellari@suse.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 16:15:07 +02:00
Qu Wenruo	b98b208300	btrfs: reject invalid compression level Inspired by recent changes to compression level parsing in `6db1df415d` ("btrfs: accept and ignore compression level for lzo") it turns out that we do not do any extra validation for compression level input string, thus allowing things like "compress=lzo:invalid" to be accepted without warnings. Although we accept levels that are beyond the supported algorithm ranges, accepting completely invalid level specification is not correct. Fix the too loose checks for compression level, by doing proper error handling of kstrtoint(), so that we will reject not only too large values (beyond int range) but also completely wrong levels like "lzo:invalid". Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-18 13:18:49 +02:00
David Sterba	ed4e6b5d64	btrfs: ref-verify: handle damaged extent root tree Syzbot hits a problem with enabled ref-verify, ignorebadroots and a fuzzed/damaged extent tree. There's no fallback option like in other places that can deal with it so disable the whole ref-verify as it is just a debugging feature. Reported-by: syzbot+9c3e0cdfbfe351b0bc0e@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/0000000000001b6052062139be1c@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-18 05:47:34 +02:00
Qu Wenruo	96fa515e70	btrfs: tree-checker: fix the incorrect inode ref size check [BUG] Inside check_inode_ref(), we need to make sure every structure, including the btrfs_inode_extref header, is covered by the item. But our code is incorrectly using "sizeof(iref)", where @iref is just a pointer. This means "sizeof(iref)" will always be "sizeof(void )", which is much smaller than "sizeof(struct btrfs_inode_extref)". This will allow some bad inode extrefs to sneak in, defeating tree-checker. [FIX] Fix the typo by calling "sizeof(iref)", which is the same as "sizeof(struct btrfs_inode_extref)", and will be the correct behavior we want. Fixes: `71bf92a9b8` ("btrfs: tree-checker: Add check for INODE_REF") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-18 05:47:25 +02:00
Linus Torvalds	b6f456a76f	for-6.17-rc6-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmjKPo0ACgkQxWXV+ddt WDurFg/9FcP3Sg2BDCAm1T+akEy9nAoR1DjYBlOToNxjJ9uHdKUljVQjPTp3mLrD pZDowuNoXkH8ig+/XknhiD51ynGGqCiqXEMj37qFbPJRe6V2iBo6gh4XidHlbvuv YDcyrCyDiF3p+0QEGCVjTFNubJuFSWDlcf07K8BpsvcXi3945v0rHY6B1doBiTjh TcnPievRxeOdDQiCJ4yja2GkoMEMw8fdNQ1EyfSRxX7EPICCGChY+FHJJCnX4oGI 4rqe5v4LPA6l0PrGWKZ/crikPNBlzQZ3otD2drdLDEkusHC5vKpmuGL/r3IgP0gB OvtMIe70z0abBOOk+Rk/REFaDdVhGyXhuEeqraKK85+2eVkyUTy6fswP+Qj+sVq1 /AqOr1OaJpVlpnAzP02TbmSZnm8WPOe8mVY7sUI66nGbSHRdVWPxORb2MEwgueP1 B8G/6s46uLJrH5ipqCBHmFKvKNuUgiYJbtWJrhLsl0PVa5C0yVP1mC8vaQNlTIwz B+oPEUFW4SOZL7/uvwn12FhVPRyk15YdEt9CxNtM2ipHKTfTu9ptjvwy6gQ9/6MM zltxuQLMBiieBegiwOpocISXAyCB+aj6XP/jlQrpSb7vSGOaJXj3STzJSWNYKJ6w /sZbBmF4Mtim+CCgbDbEXpWfM4hW55bM3fSRgukpiianEzvdyZM= =0aOx -----END PGP SIGNATURE----- Merge tag 'for-6.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - in zoned mode, turn assertion to proper code when reserving space in relocation block group - fix search key of extended ref (hardlink) when replaying log - fix initialization of file extent tree on filesystems without no-holes feature - add harmless data race annotation to block group comparator * tag 'for-6.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: annotate block group access with data_race() when sorting for reclaim btrfs: initialize inode::file_extent_tree after i_mode has been set btrfs: zoned: fix incorrect ASSERT in btrfs_zoned_reserve_data_reloc_bg() btrfs: fix invalid extref key setup when replaying dentry	2025-09-17 07:55:45 -07:00
Mateusz Guzik	f99b391778	fs: rename generic_delete_inode() and generic_drop_inode() generic_delete_inode() is rather misleading for what the routine is doing. inode_just_drop() should be much clearer. The new naming is inconsistent with generic_drop_inode(), so rename that one as well with inode_ as the suffix. No functional changes. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-15 16:09:42 +02:00
Filipe Manana	80eb65ccf6	btrfs: annotate block group access with data_race() when sorting for reclaim When sorting the block group list for reclaim we are using a block group's used bytes counter without taking the block group's spinlock, so we can race with a concurrent task updating it (at btrfs_update_block_group()), which makes tools like KCSAN unhappy and report a race. Since the sorting is not strictly needed from a functional perspective and such races should rarely cause any ordering changes (only load/store tearing could cause them), not to mention that after the sorting the ordering may no longer be accurate due to concurrent allocations and deallocations of extents in a block group, annotate the accesses to the used counter with data_race() to silence KCSAN and similar tools. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-15 05:25:43 +02:00
austinchang	8679d2687c	btrfs: initialize inode::file_extent_tree after i_mode has been set btrfs_init_file_extent_tree() uses S_ISREG() to determine if the file is a regular file. In the beginning of btrfs_read_locked_inode(), the i_mode hasn't been read from inode item, then file_extent_tree won't be used at all in volumes without NO_HOLES. Fix this by calling btrfs_init_file_extent_tree() after i_mode is initialized in btrfs_read_locked_inode(). Fixes: `3d7db6e8bd` ("btrfs: don't allocate file extent tree for non regular files") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: austinchang <austinchang@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-15 05:25:40 +02:00
Johannes Thumshirn	5b8d296475	btrfs: zoned: fix incorrect ASSERT in btrfs_zoned_reserve_data_reloc_bg() When moving a block-group to the dedicated data relocation space-info in btrfs_zoned_reserve_data_reloc_bg() it is asserted that the newly created block group for data relocation does not contain any zone_unusable bytes. But on disks with zone_capacity < zone_size, the difference between zone_size and zone_capacity is accounted as zone_unusable. Instead of asserting that the block-group does not contain any zone_unusable bytes, remove them from the block-groups total size. Reported-by: Yi Zhang <yi.zhang@redhat.com> Link: https://lore.kernel.org/linux-block/CAHj4cs8-cS2E+-xQ-d2Bj6vMJZ+CwT_cbdWBTju4BV35LsvEYw@mail.gmail.com/ Fixes: `daa0fde322` ("btrfs: zoned: fix data relocation block group reservation") Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-15 05:25:37 +02:00
Filipe Manana	b62fd63ade	btrfs: fix invalid extref key setup when replaying dentry The offset for an extref item's key is not the object ID of the parent dir, otherwise we would not need the extref item and would use plain ref items. Instead the offset is the result of a hash computation that uses the object ID of the parent dir and the name associated to the entry. So fix this by setting the key offset at replay_one_name() to be the result of calling btrfs_extref_hash(). Fixes: `725af92a62` ("btrfs: Open-code name_in_log_ref in replay_one_name") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-15 05:25:31 +02:00
Boris Burkov	b55102826d	btrfs: set AS_KERNEL_FILE on the btree_inode extent_buffers are global and shared so their pages should not belong to any particular cgroup (currently whichever cgroups happens to allocate the extent_buffer). Btrfs tree operations should not arbitrarily block on cgroup reclaim or have the shared extent_buffer pages on a cgroup's reclaim lists. Link: https://lkml.kernel.org/r/2ee99832619a3fdfe80bf4dc9760278662d2d746.1755812945.git.boris@bur.io Signed-off-by: Boris Burkov <boris@bur.io> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Tested-by: syzbot@syzkaller.appspotmail.com Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qu Wenruo <wqu@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:55:20 -07:00
David Hildenbrand	fb49a4425c	treewide: remove MIGRATEPAGE_SUCCESS At this point MIGRATEPAGE_SUCCESS is misnamed for all folio users, and now that we remove MIGRATEPAGE_UNMAP, it's really the only "success" return value that the code uses and expects. Let's just get rid of MIGRATEPAGE_SUCCESS completely and just use "0" for success. Link: https://lkml.kernel.org/r/20250811143949.1117439-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> [mm] Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> [jfs] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Byungchul Park <byungchul@sk.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Lance Yang <lance.yang@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:54:50 -07:00
Linus Torvalds	b10c31b70b	for-6.17-rc5-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmjCW90ACgkQxWXV+ddt WDu03RAAon4XytPNZa3OK65kzgiVjzh7pVwTFWn//tYTwnA0g3y3Avwr5A7im6od Kd5HBKujx94n876rrHSLw/FZ44CYNu6v/yrvU5W/MUcnW5jhGt56WWcMIkXER/2l +Tui4deICr9pAGlA2g+rIy30thmJBW93Emu/jpdXiImvfW6mwQPRTx9OVrYWo/vy NJclTfXUp9bLjKOeJhJXDVrL2H723rlEs79Ewr2h/LhsqTP3XcT5DTGPw6AEKeaD j2K9jbTSYvJ0wZOv7k+kRmontFtNmC7Iq1IeXkrbrRlIHcpdOvya8ryXPKm773v4 A3vJTK1KS4pKNP1T8q+8NXgD2oEXKJn6951N4RVxowdwhCcEEsGI2p6OGbbQxw/Q A+JCqVK/zHjsDpRaeubPfOrl0uZb5zga7akJfszOxbyt0+Kg0m7n7HhK3MYHQ3eb e5sTOr4c5lUax/yoyU7kV4UE7mx2AYBJpEp8cGfKFSeLt9p8l4hGMPuxvf3w7FJE wDckz1F5S9Px3eA0tydUdb+tPV324nqaRdqpw7qk8kddOz1NhF+RL5lWsjZyGYrR HEHUguiwjTJgdror+8BS4/EEfDv+KdJefZzl78MNmcaL3d/OOOjHxEsAc5GghwyN sFsJbTpQGC638H+gBK7QvR14IhHkr5fh5HkZ5JVRyDs4ZZf35PM= =XrCl -----END PGP SIGNATURE----- Merge tag 'for-6.17-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix delayed inode tracking in xarray, eviction can race with insertion and leave behind a disconnected inode - on systems with large page (64K) and small block size (4K) fix compression read that can return partially filled folio - slightly relax compression option format for backward compatibility, allow to specify level for LZO although there's only one - fix simple quota accounting of compressed extents - validate minimum device size in 'device add' - update maintainers' entry * tag 'for-6.17-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: don't allow adding block device of less than 1 MB MAINTAINERS: update btrfs entry btrfs: fix subvolume deletion lockup caused by inodes xarray race btrfs: fix corruption reading compressed range when block size is smaller than page size btrfs: accept and ignore compression level for lzo btrfs: fix squota compressed stats leak	2025-09-11 08:01:18 -07:00
Mark Harmstone	3d1267475b	btrfs: don't allow adding block device of less than 1 MB Commit 15ae0410c37a79 ("btrfs-progs: add error handling for device_get_partition_size_fd_stat()") in btrfs-progs inadvertently changed it so that if the BLKGETSIZE64 ioctl on a block device returned a size of 0, this was no longer seen as an error condition. Unfortunately this is how disconnected NBD devices behave, meaning that with btrfs-progs 6.16 it's now possible to add a device you can't remove: # btrfs device add /dev/nbd0 /root/temp # btrfs device remove /dev/nbd0 /root/temp ERROR: error removing device '/dev/nbd0': Invalid argument This check should always have been done kernel-side anyway, so add a check in btrfs_init_new_device() that the new device doesn't have a size less than BTRFS_DEVICE_RANGE_RESERVED (i.e. 1 MB). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-05 19:52:10 +02:00
Linus Torvalds	e3c94a539e	for-6.17-rc4-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmi3Od8ACgkQxWXV+ddt WDvZrA/7BOm51t1iNjw46NFviYq5u9/uTjL9hnYJoaXIH/DUdK5PyTACwlDJsrKk aFYTSH04OfFhmvy2Fje1YxmkuCx+VQ49qUdUBZdbx5wD7Rj+FthXcSLxrIATfQ7W qrLyHr65pxpVI9zWdZh+E2Ls5beCDqi/Rcdidii+NqTdnJJ/l21LXNsLPw6O09rh dRIJlb4fr+3ioEkAMDfC5p5lLUb/76lAeB+3CHc2cB4m3hPRMZeET/bd8pXo4g2D C9vjOp2asRzawM9bg33jb6LTQlHW+Yug1N13MN0tx01pTvnRbs/mGq7qlnCPtDXo SbFUTDqMIXnkq9ohdaOXJH9mPksG/vq+2GoS5bCORvsl+cDbFtzQsjWoC7wkcXa+ e8JfcKrjTUKGSaCOacbz+CHEzAhiv5yCdIloofUQSQJfw+R0rsQbGlZuCP9dlzD7 GGlnQKD23duaLs0u9wA+KC6T+Ifz7z42GRDBlzo/9ZXQcSunJJvK5hsE6S5wejka aZqSPfyb95gvYb+LnXepHCrRVi3lCSfV3W6m6xg2fDEgwBvxe8iEZnzSUmcMMcvu nY+sXS3ezpBLym+3NRYVWMlbk+Jre28TkeUCrTMTvd6nPCK9yIOeygx6XR8jKZRb tJG0nhs3daqcG/EUmEKwQOar5lF7dpr8kAI+LbK+Wi8a92tTxL4= =fL6z -----END PGP SIGNATURE----- Merge tag 'for-6.17-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix a few races related to inode link count - fix inode leak on failure to add link to inode - move transaction aborts closer to where they happen * tag 'for-6.17-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: avoid load/store tearing races when checking if an inode was logged btrfs: fix race between setting last_dir_index_offset and inode logging btrfs: fix race between logging inode and checking if it was logged before btrfs: simplify error handling logic for btrfs_link() btrfs: fix inode leak on failure to add link to inode btrfs: abort transaction on failure to add link to inode	2025-09-02 13:13:22 -07:00
Omar Sandoval	f6a6c28005	btrfs: fix subvolume deletion lockup caused by inodes xarray race There is a race condition between inode eviction and inode caching that can cause a live struct btrfs_inode to be missing from the root->inodes xarray. Specifically, there is a window during evict() between the inode being unhashed and deleted from the xarray. If btrfs_iget() is called for the same inode in that window, it will be recreated and inserted into the xarray, but then eviction will delete the new entry, leaving nothing in the xarray: Thread 1 Thread 2 --------------------------------------------------------------- evict() remove_inode_hash() btrfs_iget_path() btrfs_iget_locked() btrfs_read_locked_inode() btrfs_add_inode_to_root() destroy_inode() btrfs_destroy_inode() btrfs_del_inode_from_root() __xa_erase In turn, this can cause issues for subvolume deletion. Specifically, if an inode is in this lost state, and all other inodes are evicted, then btrfs_del_inode_from_root() will call btrfs_add_dead_root() prematurely. If the lost inode has a delayed_node attached to it, then when btrfs_clean_one_deleted_snapshot() calls btrfs_kill_all_delayed_nodes(), it will loop forever because the delayed_nodes xarray will never become empty (unless memory pressure forces the inode out). We saw this manifest as soft lockups in production. Fix it by only deleting the xarray entry if it matches the given inode (using __xa_cmpxchg()). Fixes: `310b2f5d5a` ("btrfs: use an xarray to track open inodes in a root") Cc: stable@vger.kernel.org # 6.11+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Co-authored-by: Leo Martins <loemra.dev@gmail.com> Signed-off-by: Leo Martins <loemra.dev@gmail.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-02 20:45:30 +02:00
Qu Wenruo	9786531399	btrfs: fix corruption reading compressed range when block size is smaller than page size [BUG] With 64K page size (aarch64 with 64K page size config) and 4K btrfs block size, the following workload can easily lead to a corrupted read: mkfs.btrfs -f -s 4k $dev > /dev/null mount -o compress $dev $mnt xfs_io -f -c "pwrite -S 0xff 0 64k" $mnt/base > /dev/null echo "correct result:" od -Ad -t x1 $mnt/base xfs_io -f -c "reflink $mnt/base 32k 0 32k" \ -c "reflink $mnt/base 0 32k 32k" \ -c "pwrite -S 0xff 60k 4k" $mnt/new > /dev/null echo "incorrect result:" od -Ad -t x1 $mnt/new umount $mnt This shows the following result: correct result: 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0065536 incorrect result: 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0032768 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0061440 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0065536 Notice the zero in the range [32K, 60K), which is incorrect. [CAUSE] With extra trace printk, it shows the following events during od: (some unrelated info removed like CPU and context) od-3457 btrfs_do_readpage: enter r/i=5/258 folio=0(65536) prev_em_start=0000000000000000 The "r/i" is indicating the root and inode number. In our case the file "new" is using ino 258 from fs tree (root 5). Here notice the @prev_em_start pointer is NULL. This means the btrfs_do_readpage() is called from btrfs_read_folio(), not from btrfs_readahead(). od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=0 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=4096 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=8192 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=12288 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=16384 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=20480 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=24576 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=28672 got em start=0 len=32768 These above 32K blocks will be read from the first half of the compressed data extent. od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=32768 got em start=32768 len=32768 Note here there is no btrfs_submit_compressed_read() call. Which is incorrect now. Although both extent maps at 0 and 32K are pointing to the same compressed data, their offsets are different thus can not be merged into the same read. So this means the compressed data read merge check is doing something wrong. od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=36864 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=40960 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=45056 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=49152 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=53248 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=57344 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=61440 skip uptodate od-3457 btrfs_submit_compressed_read: cb orig_bio: file off=0 len=61440 The function btrfs_submit_compressed_read() is only called at the end of folio read. The compressed bio will only have an extent map of range [0, 32K), but the original bio passed in is for the whole 64K folio. This will cause the decompression part to only fill the first 32K, leaving the rest untouched (aka, filled with zero). This incorrect compressed read merge leads to the above data corruption. There were similar problems that happened in the past, commit `808f80b467` ("Btrfs: update fix for read corruption of compressed and shared extents") is doing pretty much the same fix for readahead. But that's back to 2015, where btrfs still only supports bs (block size) == ps (page size) cases. This means btrfs_do_readpage() only needs to handle a folio which contains exactly one block. Only btrfs_readahead() can lead to a read covering multiple blocks. Thus only btrfs_readahead() passes a non-NULL @prev_em_start pointer. With v5.15 kernel btrfs introduced bs < ps support. This breaks the above assumption that a folio can only contain one block. Now btrfs_read_folio() can also read multiple blocks in one go. But btrfs_read_folio() doesn't pass a @prev_em_start pointer, thus the existing bio force submission check will never be triggered. In theory, this can also happen for btrfs with large folios, but since large folio is still experimental, we don't need to bother it, thus only bs < ps support is affected for now. [FIX] Instead of passing @prev_em_start to do the proper compressed extent check, introduce one new member, btrfs_bio_ctrl::last_em_start, so that the existing bio force submission logic will always be triggered. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-02 20:45:25 +02:00
Calvin Owens	6db1df415d	btrfs: accept and ignore compression level for lzo The compression level is meaningless for lzo, but before commit `3f093ccb95` ("btrfs: harden parsing of compression mount options"), it was silently ignored if passed. After that commit, passing a level with lzo fails to mount: BTRFS error: unrecognized compression value lzo:1 It seems reasonable for users to expect that lzo would permit a numeric level option, as all the other algos do, even though the kernel's implementation of LZO currently only supports a single level. Because it has always worked to pass a level, it seems likely to me that users in the real world are relying on doing so. This patch restores the old behavior, giving "lzo:N" the same semantics as all of the other compression algos. To be clear, silly variants like "lzo:one", "lzo:the_first_option", or "lzo:armageddon" also used to work. This isn't meant to suggest that any possible mis-interpretation of mount options that once worked must continue to work forever. This is an exceptional case where it makes sense to preserve compatibility, both because the mis-interpretation is reasonable, and because nothing tangible is sacrificed. Finally update btrfs_show_options() to ignore the level of LZO, as it is only the default level without any extra meaning. Fixes: `3f093ccb95` ("btrfs: harden parsing of compression mount options") Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Calvin Owens <calvin@wbinvd.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-02 20:45:19 +02:00
Boris Burkov	de134cb54c	btrfs: fix squota compressed stats leak The following workload on a squota enabled fs: btrfs subvol create mnt/subvol # ensure subvol extents get accounted sync btrfs qgroup create 1/1 mnt btrfs qgroup assign mnt/subvol 1/1 mnt btrfs qgroup delete mnt/subvol # make the cleaner thread run btrfs filesystem sync mnt sleep 1 btrfs filesystem sync mnt btrfs qgroup destroy 1/1 mnt will fail with EBUSY. The reason is that 1/1 does the quick accounting when we assign subvol to it, gaining its exclusive usage as excl and excl_cmpr. But then when we delete subvol, the decrement happens via record_squota_delta() which does not update excl_cmpr, as squotas does not make any distinction between compressed and normal extents. Thus, we increment excl_cmpr but never decrement it, and are unable to delete 1/1. The two possible fixes are to make squota always mirror excl and excl_cmpr or to make the fast accounting separately track the plain and cmpr numbers. The latter felt cleaner to me so that is what I opted for. Fixes: `1e0e9d5771` ("btrfs: add helper for recording simple quota deltas") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-02 20:45:15 +02:00
Josef Bacik	37b27bd5d6	fs: add an icount_read helper Instead of doing direct access to ->i_count, add a helper to handle this. This will make it easier to convert i_count to a refcount later. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/9bc62a84c6b9d6337781203f60837bd98fbc4a96.1756222464.git.josef@toxicpanda.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-01 12:41:09 +02:00
Filipe Manana	986bf6ed44	btrfs: avoid load/store tearing races when checking if an inode was logged At inode_logged() we do a couple lockless checks for ->logged_trans, and these are generally safe except the second one in case we get a load or store tearing due to a concurrent call updating ->logged_trans (either at btrfs_log_inode() or later at inode_logged()). In the first case it's safe to compare to the current transaction ID since once ->logged_trans is set the current transaction, we never set it to a lower value. In the second case, where we check if it's greater than zero, we are prone to load/store tearing races, since we can have a concurrent task updating to the current transaction ID with store tearing for example, instead of updating with a single 64 bits write, to update with two 32 bits writes or four 16 bits writes. In that case the reading side at inode_logged() could see a positive value that does not match the current transaction and then return a false negative. Fix this by doing the second check while holding the inode's spinlock, add some comments about it too. Also add the data_race() annotation to the first check to avoid any reports from KCSAN (or similar tools) and comment about it. Fixes: `0f8ce49821` ("btrfs: avoid inode logging during rename and link when possible") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-22 00:58:55 +02:00
Filipe Manana	59a0dd4ab9	btrfs: fix race between setting last_dir_index_offset and inode logging At inode_logged() if we find that the inode was not logged before we update its ->last_dir_index_offset to (u64)-1 with the goal that the next directory log operation will see the (u64)-1 and then figure out it must check what was the index of the last logged dir index key and update ->last_dir_index_offset to that key's offset (this is done in update_last_dir_index_offset()). This however has a possibility for a time window where a race can happen and lead to directory logging skipping dir index keys that should be logged. The race happens like this: 1) Task A calls inode_logged(), sees ->logged_trans as 0 and then checks that the inode item was logged before, but before it sets the inode's ->last_dir_index_offset to (u64)-1... 2) Task B is at btrfs_log_inode() which calls inode_logged() early, and that has set ->last_dir_index_offset to (u64)-1; 3) Task B then enters log_directory_changes() which calls update_last_dir_index_offset(). There it sees ->last_dir_index_offset is (u64)-1 and that the inode was logged before (ctx->logged_before is true), and so it searches for the last logged dir index key in the log tree and it finds that it has an offset (index) value of N, so it sets ->last_dir_index_offset to N, so that we can skip index keys that are less than or equal to N (later at process_dir_items_leaf()); 4) Task A now sets ->last_dir_index_offset to (u64)-1, undoing the update that task B just did; 5) Task B will now skip every index key when it enters process_dir_items_leaf(), since ->last_dir_index_offset is (u64)-1. Fix this by making inode_logged() not touch ->last_dir_index_offset and initializing it to 0 when an inode is loaded (at btrfs_alloc_inode()) and then having update_last_dir_index_offset() treat a value of 0 as meaning we must check the log tree and update with the index of the last logged index key. This is fine since the minimum possible value for ->last_dir_index_offset is 1 (BTRFS_DIR_START_INDEX - 1 = 2 - 1 = 1). This also simplifies the management of ->last_dir_index_offset and now all accesses to it are done under the inode's log_mutex. Fixes: `0f8ce49821` ("btrfs: avoid inode logging during rename and link when possible") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-22 00:58:47 +02:00
Filipe Manana	ef07b74e1b	btrfs: fix race between logging inode and checking if it was logged before There's a race between checking if an inode was logged before and logging an inode that can cause us to mark an inode as not logged just after it was logged by a concurrent task: 1) We have inode X which was not logged before neither in the current transaction not in past transaction since the inode was loaded into memory, so it's ->logged_trans value is 0; 2) We are at transaction N; 3) Task A calls inode_logged() against inode X, sees that ->logged_trans is 0 and there is a log tree and so it proceeds to search in the log tree for an inode item for inode X. It doesn't see any, but before it sets ->logged_trans to N - 1... 3) Task B calls btrfs_log_inode() against inode X, logs the inode and sets ->logged_trans to N; 4) Task A now sets ->logged_trans to N - 1; 5) At this point anyone calling inode_logged() gets 0 (inode not logged) since ->logged_trans is greater than 0 and less than N, but our inode was really logged. As a consequence operations like rename, unlink and link that happen afterwards in the current transaction end up not updating the log when they should. Fix this by ensuring inode_logged() only updates ->logged_trans in case the inode item is not found in the log tree if after tacking the inode's lock (spinlock struct btrfs_inode::lock) the ->logged_trans value is still zero, since the inode lock is what protects setting ->logged_trans at btrfs_log_inode(). Fixes: `0f8ce49821` ("btrfs: avoid inode logging during rename and link when possible") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-22 00:58:43 +02:00
Filipe Manana	5bb00879cb	btrfs: simplify error handling logic for btrfs_link() Instead of incrementing the inode's link count and refcount early before adding the link, updating the inode and deleting orphan item, do it after all those steps succeeded right before calling d_instantiate(). This makes the error handling logic simpler by avoiding the need for the 'drop_inode' variable to signal if we need to undo the link count increment and the inode refcount increase under the 'fail' label. This also reduces the level of indentation by one, making the code easier to read. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-22 00:58:28 +02:00
Filipe Manana	e87e953bb2	btrfs: fix inode leak on failure to add link to inode If we fail to update the inode or delete the orphan item we leak the inode since we update its refcount with the ihold() call to account for the d_instantiate() call which never happens in case we fail those steps. Fix this by setting 'drop_inode' to true in case we fail those steps. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-22 00:58:27 +02:00
Filipe Manana	2b3979624c	btrfs: abort transaction on failure to add link to inode If we fail to update the inode or delete the orphan item, we must abort the transaction to prevent persisting an inconsistent state. For example if we fail to update the inode item, we have the inconsistency of having a persisted inode item with a link count of N but we have N + 1 inode ref items and N + 1 directory entries pointing to our inode in case the transaction gets committed. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-22 00:58:25 +02:00

1 2 3 4 5 ...

14605 Commits (09cfd3c52ea76f43b3cb15e570aeddf633d65e80)