Commit Graph

1532 Commits (master)

Author SHA1 Message Date
Linus Torvalds 7696286034 for-6.19-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmko/DUACgkQxWXV+ddt
 WDtyCw//UaFOTX/k72HgA1n2MWfegWbWyD+OGNbGosoZljKOrAe/mRnjXTyF9lyW
 8GDzGvJzF4Tkl5lyuGyiequlrO2F3veTpwHo94xnBTOYCeiTpMqTN/e/5SkasBpN
 4YlWq7OGYR4hwghRvZpaW7nsmVCKDLIlZVkH77x9Bmvx0NLO24EJlEZusQT4zYew
 ntC/i9x3DW0ZxYyfRhFIFvk9JUUdgXfxJ6dNexz0zi3dKUSUIR9hI0J9Nwl++1cF
 SgjAzbtO064htWoCvsKykgA6YGbJCZjw8XO8D2eJonkN24VbqSMaY44TPXmCMLVs
 ZXw871jV2E/urfWhRNdxv/kJdCFudPk0qXG5ZtfHO4UUwS/nZ+qAig+LHawgAOCJ
 9CgWy4zrfiYCqULRuqF1wzWu/z22++zIlZC552VAZd1RQ+JjqJY/aje4xhY5nUF4
 n1uVBReZaI9sH3jJOsMWpwLMptbhpH9RZp3QPgqZlUHo6GtPJJmNKfw8KgMAhZ7L
 wf7iy6v9yo+7VZ2ACwu2qJ+lZRxbZ0yvCnFatN3O5G1O0kkIrZFUM3MwdKtufZ0u
 LHWkGfoaq7zR6E6DhIaxIhiTTXMlOfLTikNKgBUO3NEdrRZwrDhr7K07S25jFxSx
 ZCNV6OdSCeziShPqT0ntcwecnJ41/kOcm13732NHF+QgzMK5LrI=
 =rO4x
 -----END PGP SIGNATURE-----

Merge tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "Features:

   - shutdown ioctl support (needs CONFIG_BTRFS_EXPERIMENTAL for now):
      - set filesystem state as being shut down (also named going down
        in other filesystems), where all active operations return EIO
        and this cannot be changed until unmount
      - pending operations are attempted to be finished but error
        messages may still show up depending on where exactly the
        shutdown happened

   - scrub (and device replace) vs suspend/hibernate:
      - a running scrub will prevent suspend, which can be annoying as
        suspend is an immediate request and scrub is not critical
      - filesystem freezing before suspend was not sufficient as the
        problem was in process freezing
      - behaviour change: on suspend scrub and device replace are
        cancelled, where scrub can record the last state and continue
        from there; the device replace has to be restarted from the
        beginning

   - zone stats exported in sysfs, from the perspective of the
     filesystem this includes active, reclaimable, relocation etc zones

  Performance:

   - improvements when processing space reservation tickets by
     optimizing locking and shrinking critical sections, cumulative
     improvements in lockstat numbers show +15%

  Notable fixes:

   - use vmalloc fallback when allocating bios as high order allocations
     can happen with wide checksums (like sha256)

   - scrub will always track the last position of progress so it's not
     starting from zero after an error

  Core:

   - under experimental config, checksum calculations are offloaded to
     process context, simplifies locking and allows to remove
     compression write worker kthread(s):
      - speed improvement in direct IO throughput with buffered IO
        fallback is +15% when not offloaded but this is more related to
        internal crypto subsystem improvements
      - this will be probably default in the future removing the sysfs
        tunable

   - (experimental) block size > page size updates:
      - support more operations when not using large folios (encoded
        read/write and send)
      - raid56

   - more preparations for fscrypt support

  Other:

   - more conversions to auto-cleaned variables

   - parameter cleanups and removals

   - extended warning fixes

   - improved printing of structured values like keys

   - lots of other cleanups and refactoring"

* tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (147 commits)
  btrfs: remove unnecessary inode key in btrfs_log_all_parents()
  btrfs: remove redundant zero/NULL initializations in btrfs_alloc_root()
  btrfs: remaining BTRFS_PATH_AUTO_FREE conversions
  btrfs: send: do not allocate memory for xattr data when checking it exists
  btrfs: send: add unlikely to all unexpected overflow checks
  btrfs: reduce arguments to btrfs_del_inode_ref_in_log()
  btrfs: remove root argument from btrfs_del_dir_entries_in_log()
  btrfs: use test_and_set_bit() in btrfs_delayed_delete_inode_ref()
  btrfs: don't search back for dir inode item in INO_LOOKUP_USER
  btrfs: don't rewrite ret from inode_permission
  btrfs: add orig_logical to btrfs_bio for encryption
  btrfs: disable verity on encrypted inodes
  btrfs: disable various operations on encrypted inodes
  btrfs: remove redundant level reset in btrfs_del_items()
  btrfs: simplify leaf traversal after path release in btrfs_next_old_leaf()
  btrfs: optimize balance_level() path reference handling
  btrfs: factor out root promotion logic into promote_child_to_root()
  btrfs: raid56: remove the "_step" infix
  btrfs: raid56: enable bs > ps support
  btrfs: raid56: prepare finish_parity_scrub() to support bs > ps cases
  ...
2025-12-03 20:03:46 -08:00
Linus Torvalds f2e74ecfba vfs-6.19-rc1.folio
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZQAKCRCRxhvAZXjc
 onGBAQDtqeO0jZzS7q9UxlJ84Wj/H9w+9INpO4jMxtWK4svhUAEAghG4qVxRvkE2
 Qh+wrpTPIC7OCQ78k8psDRmkj9cn8QA=
 =FCVN
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull folio updates from Christian Brauner:
 "Add a new folio_next_pos() helper function that returns the file
  position of the first byte after the current folio. This is a common
  operation in filesystems when needing to know the end of the current
  folio.

  The helper is lifted from btrfs which already had its own version, and
  is now used across multiple filesystems and subsystems:
   - btrfs
   - buffer
   - ext4
   - f2fs
   - gfs2
   - iomap
   - netfs
   - xfs
   - mm

  This fixes a long-standing bug in ocfs2 on 32-bit systems with files
  larger than 2GiB. Presumably this is not a common configuration, but
  the fix is backported anyway. The other filesystems did not have bugs,
  they were just mildly inefficient.

  This also introduce uoff_t as the unsigned version of loff_t. A recent
  commit inadvertently changed a comparison from being unsigned (on
  64-bit systems) to being signed (which it had always been on 32-bit
  systems), leading to sporadic fstests failures.

  Generally file sizes are restricted to being a signed integer, but in
  places where -1 is passed to indicate "up to the end of the file", it
  is convenient to have an unsigned type to ensure comparisons are
  always unsigned regardless of architecture"

* tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: Add uoff_t
  mm: Use folio_next_pos()
  xfs: Use folio_next_pos()
  netfs: Use folio_next_pos()
  iomap: Use folio_next_pos()
  gfs2: Use folio_next_pos()
  f2fs: Use folio_next_pos()
  ext4: Use folio_next_pos()
  buffer: Use folio_next_pos()
  btrfs: Use folio_next_pos()
  filemap: Add folio_next_pos()
2025-12-01 10:26:38 -08:00
Linus Torvalds ebaeabfa5a vfs-6.19-rc1.writeback
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZQAKCRCRxhvAZXjc
 or4UAP9FbpFsZd0DpsYnKuv7kFepl291PuR0x2dKmseJ/wcf8AEAzI8FR5wd/fey
 25ZNdExoUojAOj5wVn+jUep3u54jBws=
 =/toi
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull writeback updates from Christian Brauner:
 "Features:

   - Allow file systems to increase the minimum writeback chunk size.

     The relatively low minimal writeback size of 4MiB means that
     written back inodes on rotational media are switched a lot. Besides
     introducing additional seeks, this also can lead to extreme file
     fragmentation on zoned devices when a lot of files are cached
     relative to the available writeback bandwidth.

     This adds a superblock field that allows the file system to
     override the default size, and sets it to the zone size for zoned
     XFS.

   - Add logging for slow writeback when it exceeds
     sysctl_hung_task_timeout_secs. This helps identify tasks waiting
     for a long time and pinpoint potential issues. Recording the
     starting jiffies is also useful when debugging a crashed vmcore.

   - Wake up waiting tasks when finishing the writeback of a chunk

  Cleanups:

   - filemap_* writeback interface cleanups.

     Adding filemap_fdatawrite_wbc ended up being a mistake, as all but
     the original btrfs caller should be using better high level
     interfaces instead.

     This series removes all these low-level interfaces, switches btrfs
     to a more specific interface, and cleans up other too low-level
     interfaces. With this the writeback_control that is passed to the
     writeback code is only initialized in three places.

   - Remove __filemap_fdatawrite, __filemap_fdatawrite_range, and
     filemap_fdatawrite_wbc

   - Add filemap_flush_nr helper for btrfs

   - Push struct writeback_control into start_delalloc_inodes in btrfs

   - Rename filemap_fdatawrite_range_kick to filemap_flush_range

   - Stop opencoding filemap_fdatawrite_range in 9p, ocfs2, and mm

   - Make wbc_to_tag() inline and use it in fs"

* tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: Make wbc_to_tag() inline and use it in fs.
  xfs: set s_min_writeback_pages for zoned file systems
  writeback: allow the file system to override MIN_WRITEBACK_PAGES
  writeback: cleanup writeback_chunk_size
  mm: rename filemap_fdatawrite_range_kick to filemap_flush_range
  mm: remove __filemap_fdatawrite_range
  mm: remove filemap_fdatawrite_wbc
  mm: remove __filemap_fdatawrite
  mm,btrfs: add a filemap_flush_nr helper
  btrfs: push struct writeback_control into start_delalloc_inodes
  btrfs: use the local tmp_inode variable in start_delalloc_inodes
  ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers
  9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close
  mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode
  writeback: Add logging for slow writeback (exceeds sysctl_hung_task_timeout_secs)
  writeback: Wake up waiting tasks when finishing the writeback of a chunk.
2025-12-01 09:20:51 -08:00
Qu Wenruo 39bc80216a btrfs: relax btrfs_inode::ordered_tree_lock IRQ locking context
We used IRQ version of spinlock for ordered_tree_lock, as
btrfs_finish_ordered_extent() can be called in end_bbio_data_write()
which was in IRQ context.

However since we're moving all the btrfs_bio::end_io() calls into task
context, there is no more need to support IRQ context thus we can relax
to regular spin_lock()/spin_unlock() for btrfs_inode::ordered_tree_lock.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:42:21 +01:00
Qu Wenruo 81cea6cd70 btrfs: remove btrfs_bio::fs_info by extracting it from btrfs_bio::inode
Currently there is only one caller which doesn't populate
btrfs_bio::inode, and that's scrub.

The idea is scrub doesn't want any automatic csum verification nor
read-repair, as everything will be handled by scrub itself.

However that behavior is really no different than metadata inode, thus
we can reuse btree_inode as btrfs_bio::inode for scrub.

The only exception is in btrfs_submit_chunk() where if a bbio is from
scrub or data reloc inode, we set rst_search_commit_root to true.
This means we still need a way to distinguish scrub from metadata, but
that can be done by a new flag inside btrfs_bio.

Now btrfs_bio::inode is a mandatory parameter, we can extract fs_info
from that inode thus can remove btrfs_bio::fs_info to save 8 bytes from
btrfs_bio structure.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:40:16 +01:00
Filipe Manana 28fe58ce6a btrfs: add unlikely to unexpected error case in extent_writepages()
We don't expect to hit errors and log the error message, so add the
unlikely annotation to make it clear and to hint the compiler that it may
reorganize code to be more efficient.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:08 +01:00
Filipe Manana 74ca34f79e btrfs: split assertion into two in extent_writepage_io()
If the assertion fails we don't get to know which of the two expressions
failed and neither the values used in each expression.

So split the assertion into two, each for a single expression, so that
if any is triggered we see a line number reported in a stack trace that
points to which expression failed. Also  make the assertions use the
verbose mode to print the values involved in the computations.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:08 +01:00
Filipe Manana 46a2390859 btrfs: use variable for end offset in extent_writepage_io()
Instead of repeating the expression "start + len" multiple times, store it
in a variable and use it where needed.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:08 +01:00
Filipe Manana 18de34daa7 btrfs: truncate ordered extent when skipping writeback past i_size
While running test case btrfs/192 from fstests with support for large
folios (needs CONFIG_BTRFS_EXPERIMENTAL=y) I ended up getting very sporadic
btrfs check failures reporting that csum items were missing. Looking into
the issue it turned out that btrfs check searches for csum items of a file
extent item with a range that spans beyond the i_size of a file and we
don't have any, because the kernel's writeback code skips submitting bios
for ranges beyond eof. It's not expected however to find a file extent item
that crosses the rounded up (by the sector size) i_size value, but there is
a short time window where we can end up with a transaction commit leaving
this small inconsistency between the i_size and the last file extent item.

Example btrfs check output when this happens:

  $ btrfs check /dev/sdc
  Opening filesystem to check...
  Checking filesystem on /dev/sdc
  UUID: 69642c61-5efb-4367-aa31-cdfd4067f713
  [1/8] checking log skipped (none written)
  [2/8] checking root items
  [3/8] checking extents
  [4/8] checking free space tree
  [5/8] checking fs roots
  root 5 inode 332 errors 1000, some csum missing
  ERROR: errors found in fs roots
  (...)

Looking at a tree dump of the fs tree (root 5) for inode 332 we have:

   $ btrfs inspect-internal dump-tree -t 5 /dev/sdc
   (...)
        item 28 key (332 INODE_ITEM 0) itemoff 2006 itemsize 160
                generation 17 transid 19 size 610969 nbytes 86016
                block group 0 mode 100666 links 1 uid 0 gid 0 rdev 0
                sequence 11 flags 0x0(none)
                atime 1759851068.391327881 (2025-10-07 16:31:08)
                ctime 1759851068.410098267 (2025-10-07 16:31:08)
                mtime 1759851068.410098267 (2025-10-07 16:31:08)
                otime 1759851068.391327881 (2025-10-07 16:31:08)
        item 29 key (332 INODE_REF 340) itemoff 1993 itemsize 13
                index 2 namelen 3 name: f1f
        item 30 key (332 EXTENT_DATA 589824) itemoff 1940 itemsize 53
                generation 19 type 1 (regular)
                extent data disk byte 21745664 nr 65536
                extent data offset 0 nr 65536 ram 65536
                extent compression 0 (none)
   (...)

We can see that the file extent item for file offset 589824 has a length of
64K and its number of bytes is 64K. Looking at the inode item we see that
its i_size is 610969 bytes which falls within the range of that file extent
item [589824, 655360[.

Looking into the csum tree:

  $ btrfs inspect-internal dump-tree /dev/sdc
  (...)
        item 15 key (EXTENT_CSUM EXTENT_CSUM 21565440) itemoff 991 itemsize 200
                range start 21565440 end 21770240 length 204800
           item 16 key (EXTENT_CSUM EXTENT_CSUM 1104576512) itemoff 983 itemsize 8
                range start 1104576512 end 1104584704 length 8192
  (..)

We see that the csum item number 15 covers the first 24K of the file extent
item - it ends at offset 21770240 and the extent's disk_bytenr is 21745664,
so we have:

   21770240 - 21745664 = 24K

We see that the next csum item (number 16) is completely outside the range,
so the remaining 40K of the extent doesn't have csum items in the tree.

If we round up the i_size to the sector size, we get:

   round_up(610969, 4096) = 614400

If we subtract from that the file offset for the extent item we get:

   614400 - 589824 = 24K

So the missing 40K corresponds to the end of the file extent item's range
minus the rounded up i_size:

   655360 - 614400 = 40K

Normally we don't expect a file extent item to span over the rounded up
i_size of an inode, since when truncating, doing hole punching and other
operations that trim a file extent item, the number of bytes is adjusted.

There is however a short time window where the kernel can end up,
temporarily,persisting an inode with an i_size that falls in the middle of
the last file extent item and the file extent item was not yet trimmed (its
number of bytes reduced so that it doesn't cross i_size rounded up by the
sector size).

The steps (in the kernel) that lead to such scenario are the following:

 1) We have inode I as an empty file, no allocated extents, i_size is 0;

 2) A buffered write is done for file range [589824, 655360[ (length of
    64K) and the i_size is updated to 655360. Note that we got a single
    large folio for the range (64K);

 3) A truncate operation starts that reduces the inode's i_size down to
    610969 bytes. The truncate sets the inode's new i_size at
    btrfs_setsize() by calling truncate_setsize() and before calling
    btrfs_truncate();

 4) At btrfs_truncate() we trigger writeback for the range starting at
    610304 (which is the new i_size rounded down to the sector size) and
    ending at (u64)-1;

 5) During the writeback, at extent_write_cache_pages(), we get from the
    call to filemap_get_folios_tag(), the 64K folio that starts at file
    offset 589824 since it contains the start offset of the writeback
    range (610304);

 6) At writepage_delalloc() we find the whole range of the folio is dirty
    and therefore we run delalloc for that 64K range ([589824, 655360[),
    reserving a 64K extent, creating an ordered extent, etc;

 7) At extent_writepage_io() we submit IO only for subrange [589824, 614400[
    because the inode's i_size is 610969 bytes (rounded up by sector size
    is 614400). There, in the while loop we intentionally skip IO beyond
    i_size to avoid any unnecessay work and just call
    btrfs_mark_ordered_io_finished() for the range [614400, 655360[ (which
    has a 40K length);

 8) Once the IO finishes we finish the ordered extent by ending up at
    btrfs_finish_one_ordered(), join transaction N, insert a file extent
    item in the inode's subvolume tree for file offset 589824 with a number
    of bytes of 64K, and update the inode's delayed inode item or directly
    the inode item with a call to btrfs_update_inode_fallback(), which
    results in storing the new i_size of 610969 bytes;

 9) Transaction N is committed either by the transaction kthread or some
    other task committed it (in response to a sync or fsync for example).

    At this point we have inode I persisted with an i_size of 610969 bytes
    and file extent item that starts at file offset 589824 and has a number
    of bytes of 64K, ending at an offset of 655360 which is beyond the
    i_size rounded up to the sector size (614400).

    --> So after a crash or power failure here, the btrfs check program
        reports that error about missing checksum items for this inode, as
	it tries to lookup for checksums covering the whole range of the
	extent;

10) Only after transaction N is committed that at btrfs_truncate() the
    call to btrfs_start_transaction() starts a new transaction, N + 1,
    instead of joining transaction N. And it's with transaction N + 1 that
    it calls btrfs_truncate_inode_items() which updates the file extent
    item at file offset 589824 to reduce its number of bytes from 64K down
    to 24K, so that the file extent item's range ends at the i_size
    rounded up to the sector size (614400 bytes).

Fix this by truncating the ordered extent at extent_writepage_io() when we
skip writeback because the current offset in the folio is beyond i_size.
This ensures we don't ever persist a file extent item with a number of
bytes beyond the rounded up (by sector size) value of the i_size.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:08 +01:00
Qu Wenruo 4e700ac62a btrfs: remove unnecessary NULL fs_info check from find_lock_delalloc_range()
[STATIC CHECK REPORT]
Smatch is reporting that find_lock_delalloc_range() used to do a null
pointer check before accessing fs_info, but now we're accessing it for
sectorsize unconditionally.

[FALSE ALERT]
This is a false alert, the existing null pointer check is introduced in
commit f7b12a62f0 ("btrfs: replace BTRFS_MAX_EXTENT_SIZE with
fs_info->max_extent_size"), but way before that, commit 7c0260ee09
("btrfs: tests, require fs_info for root") is already forcing every
btrfs_root to have a correct fs_info pointer.

So there is no way that btrfs_root::fs_info is NULL.

[FIX]
Just remove the unnecessary NULL pointer checker.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: f7b12a62f0 ("btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size")
Closes: https://lore.kernel.org/r/202509250925.4L4JQTtn-lkp@intel.com/
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:37:36 +01:00
Matthew Wilcox (Oracle) 48f3784b17
btrfs: Use folio_next_pos()
btrfs defined its own variant of folio_next_pos() called folio_end().
This is an ambiguous name as 'end' might be exclusive or inclusive.
Switch to the new folio_next_pos().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-3-willy@infradead.org
Acked-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 13:11:37 +01:00
Qu Wenruo 2618849f31 btrfs: ensure no dirty metadata is written back for an fs with errors
[BUG]
During development of a minor feature (make sure all btrfs_bio::end_io()
is called in task context), I noticed a crash in generic/388, where
metadata writes triggered new works after btrfs_stop_all_workers().

It turns out that it can even happen without any code modification, just
using RAID5 for metadata and the same workload from generic/388 is going
to trigger the use-after-free.

[CAUSE]
If btrfs hits an error, the fs is marked as error, no new
transaction is allowed thus metadata is in a frozen state.

But there are some metadata modifications before that error, and they are
still in the btree inode page cache.

Since there will be no real transaction commit, all those dirty folios
are just kept as is in the page cache, and they can not be invalidated
by invalidate_inode_pages2() call inside close_ctree(), because they are
dirty.

And finally after btrfs_stop_all_workers(), we call iput() on btree
inode, which triggers writeback of those dirty metadata.

And if the fs is using RAID56 metadata, this will trigger RMW and queue
new works into rmw_workers, which is already stopped, causing warning
from queue_work() and use-after-free.

[FIX]
Add a special handling for write_one_eb(), that if the fs is already in
an error state, immediately mark the bbio as failure, instead of really
submitting them.

Then during close_ctree(), iput() will just discard all those dirty
tree blocks without really writing them back, thus no more new jobs for
already stopped-and-freed workqueues.

The extra discard in write_one_eb() also acts as an extra safenet.
E.g. the transaction abort is triggered by some extent/free space
tree corruptions, and since extent/free space tree is already corrupted
some tree blocks may be allocated where they shouldn't be (overwriting
existing tree blocks). In that case writing them back will further
corrupting the fs.

CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-30 19:16:01 +01:00
Julian Sun 4952f35f05
fs: Make wbc_to_tag() inline and use it in fs.
The logic in wbc_to_tag() is widely used in file systems, so modify this
function to be inline and use it in file systems.

This patch has only passed compilation tests, but it should be fine.

Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29 23:33:48 +01:00
Boris Burkov 8ab2fa6969 btrfs: fix incorrect readahead expansion length
The intent of btrfs_readahead_expand() was to expand to the length of
the current compressed extent being read. However, "ram_bytes" is *not*
that, in the case where a single physical compressed extent is used for
multiple file extents.

Consider this case with a large compressed extent C and then later two
non-compressed extents N1 and N2 written over C, leaving C1 and C2
pointing to offset/len pairs of C:

[               C                 ]
[ N1 ][     C1     ][ N2 ][   C2  ]

In such a case, ram_bytes for both C1 and C2 is the full uncompressed
length of C. So starting readahead in C1 will expand the readahead past
the end of C1, past N2, and into C2. This will then expand readahead
again, to C2_start + ram_bytes, way past EOF. First of all, this is
totally undesirable, we don't want to read the whole file in arbitrary
chunks of the large underlying extent if it happens to exist. Secondly,
it results in zeroing the range past the end of C2 up to ram_bytes. This
is particularly unpleasant with fs-verity as it can zero and set
uptodate pages in the verity virtual space past EOF. This incorrect
readahead behavior can lead to verity verification errors, if we iterate
in a way that happens to do the wrong readahead.

Fix this by using em->len for readahead expansion, not em->ram_bytes,
resulting in the expected behavior of stopping readahead at the extent
boundary.

Reported-by: Max Chernoff <git@maxchernoff.ca>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2399898
Fixes: 9e9ff875e4 ("btrfs: use readahead_expand() on compressed extents")
CC: stable@vger.kernel.org # 6.17
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-13 22:34:08 +02:00
David Sterba cc53bd2085 btrfs: add unlikely annotations to branches leading to EIO
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen, where
EIO is one of them.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:26 +02:00
Qu Wenruo c2ffb1ec1a btrfs: prepare compression folio alloc/free for bs > ps cases
This includes the following preparation for bs > ps cases:

- Always alloc/free the folio directly if bs > ps
  This adds a new @fs_info parameter for btrfs_alloc_compr_folio(), thus
  affecting all compression algorithms.

  For btrfs_free_compr_folio() it needs no parameter for now, as we can
  use the folio size to skip the caching part.

  For now the change is just to passing a @fs_info into the function,
  all the folio size assumption is still based on page size.

- Properly zero the last folio in compress_file_range()
  Since the compressed folios can be larger than a page, we need to
  properly zero the whole folio.

- Use correct folio size for btrfs_add_compressed_bio_folios()
  Instead of page size, use the correct folio size.

- Use correct folio size/shift for btrfs_compress_filemap_get_folio()
  As we are not only using simple page sized folios anymore.

- Use correct folio size for btrfs_decompress()
  There is an ASSERT() making sure the decompressed range is no larger
  than a page, which will be triggered for bs > ps cases.

- Skip readahead for compressed pages
  Similar to subpage cases.

- Make btrfs_alloc_folio_array() to accept a new @order parameter

- Add a helper to calculate the minimal folio size

All those changes should not affect the existing bs <= ps handling.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:24 +02:00
Qu Wenruo 7b26da4074 btrfs: fix the incorrect max_bytes value for find_lock_delalloc_range()
[BUG]
With my local branch to enable bs > ps support for btrfs, sometimes I
hit the following ASSERT() inside submit_one_sector():

	ASSERT(block_start != EXTENT_MAP_HOLE);

Please note that it's not yet possible to hit this ASSERT() in the wild
yet, as it requires btrfs bs > ps support, which is not even in the
development branch.

But on the other hand, there is also a very low chance to hit above
ASSERT() with bs < ps cases, so this is an existing bug affect not only
the incoming bs > ps support but also the existing bs < ps support.

[CAUSE]
Firstly that ASSERT() means we're trying to submit a dirty block but
without a real extent map nor ordered extent map backing it.

Furthermore with extra debugging, the folio triggering such ASSERT() is
always larger than the fs block size in my bs > ps case.
(8K block size, 4K page size)

After some more debugging, the ASSERT() is trigger by the following
sequence:

 extent_writepage()
 |  We got a 32K folio (4 fs blocks) at file offset 0, and the fs block
 |  size is 8K, page size is 4K.
 |  And there is another 8K folio at file offset 32K, which is also
 |  dirty.
 |  So the filemap layout looks like the following:
 |
 |  "||" is the filio boundary in the filemap.
 |  "//| is the dirty range.
 |
 |  0        8K       16K        24K         32K       40K
 |  |////////|        |//////////////////////||////////|
 |
 |- writepage_delalloc()
 |  |- find_lock_delalloc_range() for [0, 8K)
 |  |  Now range [0, 8K) is properly locked.
 |  |
 |  |- find_lock_delalloc_range() for [16K, 40K)
 |  |  |- btrfs_find_delalloc_range() returned range [16K, 40K)
 |  |  |- lock_delalloc_folios() locked folio 0 successfully
 |  |  |
 |  |  |  The filemap range [32K, 40K) got dropped from filemap.
 |  |  |
 |  |  |- lock_delalloc_folios() failed with -EAGAIN on folio 32K
 |  |  |  As the folio at 32K is dropped.
 |  |  |
 |  |  |- loops = 1;
 |  |  |- max_bytes = PAGE_SIZE;
 |  |  |- goto again;
 |  |  |  This will re-do the lookup for dirty delalloc ranges.
 |  |  |
 |  |  |- btrfs_find_delalloc_range() called with @max_bytes == 4K
 |  |  |  This is smaller than block size, so
 |  |  |  btrfs_find_delalloc_range() is unable to return any range.
 |  |  \- return false;
 |  |
 |  \- Now only range [0, 8K) has an OE for it, but for dirty range
 |     [16K, 32K) it's dirty without an OE.
 |     This breaks the assumption that writepage_delalloc() will find
 |     and lock all dirty ranges inside the folio.
 |
 |- extent_writepage_io()
    |- submit_one_sector() for [0, 8K)
    |  Succeeded
    |
    |- submit_one_sector() for [16K, 24K)
       Triggering the ASSERT(), as there is no OE, and the original
       extent map is a hole.

Please note that, this also exposed the same problem for bs < ps
support. E.g. with 64K page size and 4K block size.

If we failed to lock a folio, and falls back into the "loops = 1;"
branch, we will re-do the search using 64K as max_bytes.
Which may fail again to lock the next folio, and exit early without
handling all dirty blocks inside the folio.

[FIX]
Instead of using the fixed size PAGE_SIZE as @max_bytes, use
@sectorsize, so that we are ensured to find and lock any remaining
blocks inside the folio.

And since we're here, add an extra ASSERT() to
before calling btrfs_find_delalloc_range() to make sure the @max_bytes is
at least no smaller than a block to avoid false negative.

Cc: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:24 +02:00
Qu Wenruo 2d83ed6c6c btrfs: return any hit error from extent_writepage_io()
Since the support of bs < ps support, extent_writepage_io() will submit
multiple blocks inside the folio.

But if we hit error submitting one sector, but the next sector can still
be submitted successfully, the function extent_writepage_io() will still
return 0.

This will make btrfs to silently ignore the error without setting error
flag for the filemap.

Fix it by recording the first error hit, and always return that value.

Fixes: 8bf334beb3 ("btrfs: fix double accounting race when extent_writepage_io() failed")
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:23 +02:00
Filipe Manana 8f0534ec96 btrfs: mark extent buffer alignment checks as unlikely
We are not expecting to ever fail the extent buffer alignment checks, so
mark them as unlikely to allow the compiler to potentially generate more
optimized code.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:23 +02:00
Filipe Manana 6a9e1d1a65 btrfs: store and use node size in local variable in check_eb_alignment()
Instead of dereferencing fs_info every time we need to access the node
size, store in a local variable to make the code less verbose and avoid
a line split too.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:23 +02:00
David Sterba 17dc82dc1e btrfs: fix typos in comments and strings
Annual typo fixing pass. Strangely codespell found only about 30% of
what is in this patch, the rest was done manually using text
spellchecker with a custom dictionary of acceptable terms.

Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:16 +02:00
David Sterba 67e78f983e btrfs: convert several int parameters to bool
We're almost done cleaning misused int/bool parameters. Convert a bunch
of them, found by manual grepping.  Note that btrfs_sync_fs() needs an
int as it's mandated by the struct super_operations prototype.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:32 +02:00
Boris Burkov f07b855c56 btrfs: try to search for data csums in commit root
If you run a workload with:

- a cgroup that does tons of parallel data reading, with a working set
  much larger than its memory limit
- a second cgroup that writes relatively fewer files, with overwrites,
  with no memory limit
(see full code listing at the bottom for a reproducer)

Then what quickly occurs is:

- we have a large number of threads trying to read the csum tree
- we have a decent number of threads deleting csums running delayed refs
- we have a large number of threads in direct reclaim and thus high
  memory pressure

The result of this is that we writeback the csum tree repeatedly mid
transaction, to get back the extent_buffer folios for reclaim. As a
result, we repeatedly COW the csum tree for the delayed refs that are
deleting csums. This means repeatedly write locking the higher levels of
the tree.

As a result of this, we achieve an unpleasant priority inversion. We
have:

- a high degree of contention on the csum root node (and other upper
  nodes) eb rwsem
- a memory starved cgroup doing tons of reclaim on CPU.
- many reader threads in the memory starved cgroup "holding" the sem
  as readers, but not scheduling promptly. i.e., task __state == 0, but
  not running on a cpu.
- btrfs_commit_transaction stuck trying to acquire the sem as a writer.
  (running delayed_refs, deleting csums for unreferenced data extents)

This results in arbitrarily long transactions. This then results in
seriously degraded performance for any cgroup using the filesystem (the
victim cgroup in the script).

It isn't an academic problem, as we see this exact problem in production
at Meta with one cgroup over its memory limit ruining btrfs performance
for the whole system, stalling critical system services that depend on
btrfs syncs.

The underlying scheduling "problem" with global rwsems is sort of thorny
and apparently well known and was discussed at LPC 2024, for example.

As a result, our main lever in the short term is just trying to reduce
contention on our various rwsems with an eye to reducing the frequency
of write locking, to avoid disabling the read lock fast acquisition path.

Luckily, it seems likely that many reads are for old extents written
many transactions ago, and that for those we *can* in fact search the
commit root. The commit_root_sem only gets taken write once, near the
end of transaction commit, no matter how much memory pressure there is,
so we have much less contention between readers and writers.

This change detects when we are trying to read an old extent (according
to extent map generation) and then wires that through bio_ctrl to the
btrfs_bio, which unfortunately isn't allocated yet when we have this
information. When we go to lookup the csums in lookup_bio_sums we can
check this condition on the btrfs_bio and do the commit root lookup
accordingly.

Note that a single bio_ctrl might collect a few extent_maps into a single
bio, so it is important to track a maximum generation across all the
extent_maps used for each bio to make an accurate decision on whether it
is valid to look in the commit root. If any extent_map is updated in the
current generation, we can't use the commit root.

To test and reproduce this issue, I used the following script and
accompanying C program (to avoid bottlenecks in constantly forking
thousands of dd processes):

====== big-read.c ======
  #include <fcntl.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <sys/mman.h>
  #include <sys/stat.h>
  #include <unistd.h>
  #include <errno.h>

  #define BUF_SZ (128 * (1 << 10UL))

  int read_once(int fd, size_t sz) {
  	char buf[BUF_SZ];
  	size_t rd = 0;
  	int ret = 0;

  	while (rd < sz) {
  		ret = read(fd, buf, BUF_SZ);
  		if (ret < 0) {
  			if (errno == EINTR)
  				continue;
  			fprintf(stderr, "read failed: %d\n", errno);
  			return -errno;
  		} else if (ret == 0) {
  			break;
  		} else {
  			rd += ret;
  		}
  	}
  	return rd;
  }

  int read_loop(char *fname) {
  	int fd;
  	struct stat st;
  	size_t sz = 0;
  	int ret;

  	while (1) {
  		fd = open(fname, O_RDONLY);
  		if (fd == -1) {
  			perror("open");
  			return 1;
  		}
  		if (!sz) {
  			if (!fstat(fd, &st)) {
  				sz = st.st_size;
  			} else {
  				perror("stat");
  				return 1;
  			}
  		}

                  ret = read_once(fd, sz);
  		close(fd);
  	}
  }

  int main(int argc, char *argv[]) {
  	int fd;
  	struct stat st;
  	off_t sz;
  	char *buf;
  	int ret;

  	if (argc != 2) {
  		fprintf(stderr, "Usage: %s <filename>\n", argv[0]);
  		return 1;
  	}

  	return read_loop(argv[1]);
  }

====== repro.sh ======
  #!/usr/bin/env bash

  SCRIPT=$(readlink -f "$0")
  DIR=$(dirname "$SCRIPT")

  dev=$1
  mnt=$2
  shift
  shift

  CG_ROOT=/sys/fs/cgroup
  BAD_CG=$CG_ROOT/bad-nbr
  GOOD_CG=$CG_ROOT/good-nbr
  NR_BIGGOS=1
  NR_LITTLE=10
  NR_VICTIMS=32
  NR_VILLAINS=512

  START_SEC=$(date +%s)

  _elapsed() {
  	echo "elapsed: $(($(date +%s) - $START_SEC))"
  }

  _stats() {
  	local sysfs=/sys/fs/btrfs/$(findmnt -no UUID $dev)

  	echo "================"
  	date
  	_elapsed
  	cat $sysfs/commit_stats
  	cat $BAD_CG/memory.pressure
  }

  _setup_cgs() {
  	echo "+memory +cpuset" > $CG_ROOT/cgroup.subtree_control
  	mkdir -p $GOOD_CG
  	mkdir -p $BAD_CG
  	echo max > $BAD_CG/memory.max
  	# memory.high much less than the working set will cause heavy reclaim
  	echo $((1 << 30)) > $BAD_CG/memory.high

  	# victims get a subset of villain CPUs
  	echo 0 > $GOOD_CG/cpuset.cpus
  	echo 0,1,2,3 > $BAD_CG/cpuset.cpus
  }

  _kill_cg() {
  	local cg=$1
  	local attempts=0
  	echo "kill cgroup $cg"
  	[ -f $cg/cgroup.procs ] || return
  	while true; do
  		attempts=$((attempts + 1))
  		echo 1 > $cg/cgroup.kill
  		sleep 1
  		procs=$(wc -l $cg/cgroup.procs | cut -d' ' -f1)
  		[ $procs -eq 0 ] && break
  	done
  	rmdir $cg
  	echo "killed cgroup $cg in $attempts attempts"
  }

  _biggo_vol() {
  	echo $mnt/biggo_vol.$1
  }

  _biggo_file() {
  	echo $(_biggo_vol $1)/biggo
  }

  _subvoled_biggos() {
  	total_sz=$((10 << 30))
  	per_sz=$((total_sz / $NR_VILLAINS))
  	dd_count=$((per_sz >> 20))
  	echo "create $NR_VILLAINS subvols with a file of size $per_sz bytes for a total of $total_sz bytes."
  	for i in $(seq $NR_VILLAINS)
  	do
  		btrfs subvol create $(_biggo_vol $i) &>/dev/null
  		dd if=/dev/zero of=$(_biggo_file $i) bs=1M count=$dd_count &>/dev/null
  	done
  	echo "done creating subvols."
  }

  _setup() {
  	[ -f .done ] && rm .done
  	findmnt -n $dev && exit 1
        if [ -f .re-mkfs ]; then
		mkfs.btrfs -f -m single -d single $dev >/dev/null || exit 2
	else
		echo "touch .re-mkfs to populate the test fs"
	fi

  	mount -o noatime $dev $mnt || exit 3
  	[ -f .re-mkfs ] && _subvoled_biggos
  	_setup_cgs
  }

  _my_cleanup() {
  	echo "CLEANUP!"
  	_kill_cg $BAD_CG
  	_kill_cg $GOOD_CG
  	sleep 1
  	umount $mnt
  }

  _bad_exit() {
  	_err "Unexpected Exit! $?"
  	_stats
  	exit $?
  }

  trap _my_cleanup EXIT
  trap _bad_exit INT TERM

  _setup

  # Use a lot of page cache reading the big file
  _villain() {
  	local i=$1
  	echo $BASHPID > $BAD_CG/cgroup.procs
  	$DIR/big-read $(_biggo_file $i)
  }

  # Hit del_csum a lot by overwriting lots of small new files
  _victim() {
  	echo $BASHPID > $GOOD_CG/cgroup.procs
  	i=0;
  	while (true)
  	do
  		local tmp=$mnt/tmp.$i

  		dd if=/dev/zero of=$tmp bs=4k count=2 >/dev/null 2>&1
  		i=$((i+1))
  		[ $i -eq $NR_LITTLE ] && i=0
  	done
  }

  _one_sync() {
  	echo "sync..."
  	before=$(date +%s)
  	sync
  	after=$(date +%s)
  	echo "sync done in $((after - before))s"
  	_stats
  }

  # sync in a loop
  _sync() {
  	echo "start sync loop"
  	syncs=0
  	echo $BASHPID > $GOOD_CG/cgroup.procs
  	while true
  	do
  		[ -f .done ] && break
  		_one_sync
  		syncs=$((syncs + 1))
  		[ -f .done ] && break
  		sleep 10
  	done
  	if [ $syncs -eq 0 ]; then
  		echo "do at least one sync!"
  		_one_sync
  	fi
  	echo "sync loop done."
  }

  _sleep() {
  	local time=${1-60}
  	local now=$(date +%s)
  	local end=$((now + time))
  	while [ $now -lt $end ];
  	do
  		echo "SLEEP: $((end - now))s left. Sleep 10."
  		sleep 10
  		now=$(date +%s)
  	done
  }

  echo "start $NR_VILLAINS villains"
  for i in $(seq $NR_VILLAINS)
  do
  	_villain $i &
  	disown # get rid of annoying log on kill (done via cgroup anyway)
  done

  echo "start $NR_VICTIMS victims"
  for i in $(seq $NR_VICTIMS)
  do
  	_victim &
  	disown
  done

  _sync &
  SYNC_PID=$!

  _sleep $1
  _elapsed
  touch .done
  wait $SYNC_PID

  echo "OK"
  exit 0

Without this patch, that reproducer:

- Ran for 6+ minutes instead of 60s
- Hung hundreds of threads in D state on the csum reader lock
- Got a commit stuck for 3 minutes

sync done in 388s
================
Wed Jul  9 09:52:31 PM UTC 2025
elapsed: 420
commits 2
cur_commit_ms 0
last_commit_ms 159446
max_commit_ms 159446
total_commit_ms 160058
some avg10=99.03 avg60=98.97 avg300=75.43 total=418033386
full avg10=82.79 avg60=80.52 avg300=59.45 total=324995274

419 hits state R, D comms big-read
                 btrfs_tree_read_lock_nested
                 btrfs_read_lock_root_node
                 btrfs_search_slot
                 btrfs_lookup_csum
                 btrfs_lookup_bio_sums
                 btrfs_submit_bbio

1 hits state D comms btrfs-transacti
                 btrfs_tree_lock_nested
                 btrfs_lock_root_node
                 btrfs_search_slot
                 btrfs_del_csums
                 __btrfs_run_delayed_refs
                 btrfs_run_delayed_refs

With the patch, the reproducer exits naturally, in 65s, completing a
pretty decent 4 commits, despite heavy memory pressure. Occasionally you
can still trigger a rather long commit (couple seconds) but never one
that is minutes long.

sync done in 3s
================
elapsed: 65
commits 4
cur_commit_ms 0
last_commit_ms 485
max_commit_ms 689
total_commit_ms 2453
some avg10=98.28 avg60=64.54 avg300=19.39 total=64849893
full avg10=74.43 avg60=48.50 avg300=14.53 total=48665168

some random rwalker samples showed the most common stack in reclaim,
rather than the csum tree:
145 hits state R comms bash, sleep, dd, shuf
                 shrink_folio_list
                 shrink_lruvec
                 shrink_node
                 do_try_to_free_pages
                 try_to_free_mem_cgroup_pages
                 reclaim_high

Link: https://lpc.events/event/18/contributions/1883/
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:31 +02:00
Qu Wenruo 9786531399 btrfs: fix corruption reading compressed range when block size is smaller than page size
[BUG]
With 64K page size (aarch64 with 64K page size config) and 4K btrfs
block size, the following workload can easily lead to a corrupted read:

        mkfs.btrfs -f -s 4k $dev > /dev/null
        mount -o compress $dev $mnt
        xfs_io -f -c "pwrite -S 0xff 0 64k" $mnt/base > /dev/null
	echo "correct result:"
        od -Ad -t x1 $mnt/base
        xfs_io -f -c "reflink $mnt/base 32k 0 32k" \
		  -c "reflink $mnt/base 0 32k 32k" \
		  -c "pwrite -S 0xff 60k 4k" $mnt/new > /dev/null
	echo "incorrect result:"
        od -Ad -t x1 $mnt/new
        umount $mnt

This shows the following result:

correct result:
0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0065536
incorrect result:
0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0032768 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0061440 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0065536

Notice the zero in the range [32K, 60K), which is incorrect.

[CAUSE]
With extra trace printk, it shows the following events during od:
(some unrelated info removed like CPU and context)

 od-3457   btrfs_do_readpage: enter r/i=5/258 folio=0(65536) prev_em_start=0000000000000000

The "r/i" is indicating the root and inode number. In our case the file
"new" is using ino 258 from fs tree (root 5).

Here notice the @prev_em_start pointer is NULL. This means the
btrfs_do_readpage() is called from btrfs_read_folio(), not from
btrfs_readahead().

 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=0 got em start=0 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=4096 got em start=0 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=8192 got em start=0 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=12288 got em start=0 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=16384 got em start=0 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=20480 got em start=0 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=24576 got em start=0 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=28672 got em start=0 len=32768

These above 32K blocks will be read from the first half of the
compressed data extent.

 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=32768 got em start=32768 len=32768

Note here there is no btrfs_submit_compressed_read() call. Which is
incorrect now.
Although both extent maps at 0 and 32K are pointing to the same compressed
data, their offsets are different thus can not be merged into the same
read.

So this means the compressed data read merge check is doing something
wrong.

 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=36864 got em start=32768 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=40960 got em start=32768 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=45056 got em start=32768 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=49152 got em start=32768 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=53248 got em start=32768 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=57344 got em start=32768 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=61440 skip uptodate
 od-3457   btrfs_submit_compressed_read: cb orig_bio: file off=0 len=61440

The function btrfs_submit_compressed_read() is only called at the end of
folio read. The compressed bio will only have an extent map of range [0,
32K), but the original bio passed in is for the whole 64K folio.

This will cause the decompression part to only fill the first 32K,
leaving the rest untouched (aka, filled with zero).

This incorrect compressed read merge leads to the above data corruption.

There were similar problems that happened in the past, commit 808f80b467
("Btrfs: update fix for read corruption of compressed and shared
extents") is doing pretty much the same fix for readahead.

But that's back to 2015, where btrfs still only supports bs (block size)
== ps (page size) cases.
This means btrfs_do_readpage() only needs to handle a folio which
contains exactly one block.

Only btrfs_readahead() can lead to a read covering multiple blocks.
Thus only btrfs_readahead() passes a non-NULL @prev_em_start pointer.

With v5.15 kernel btrfs introduced bs < ps support. This breaks the above
assumption that a folio can only contain one block.

Now btrfs_read_folio() can also read multiple blocks in one go.
But btrfs_read_folio() doesn't pass a @prev_em_start pointer, thus the
existing bio force submission check will never be triggered.

In theory, this can also happen for btrfs with large folios, but since
large folio is still experimental, we don't need to bother it, thus only
bs < ps support is affected for now.

[FIX]
Instead of passing @prev_em_start to do the proper compressed extent
check, introduce one new member, btrfs_bio_ctrl::last_em_start, so that
the existing bio force submission logic will always be triggered.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-02 20:45:25 +02:00
Qu Wenruo 1f3d56db69 btrfs: clear TAG_TOWRITE from buffer tree when submitting a tree block
[POSSIBLE BUG]
After commit 5e121ae687 ("btrfs: use buffer xarray for extent buffer
writeback operations"), we have a dedicated xarray for extent buffers,
and a lot of tags are migrated to that buffer tree, like
PAGECACHE_TAG_TOWRITE/DIRTY/WRITEBACK.

This frees us from the limits of page flags, but there is a new
asymmetric behavior, we call buffer_tree_tag_for_writeback() to set
PAGECACHE_TAG_TOWRITE for the involved ranges, but there is no one to
clear that tag.

Before that rework, we relied on the page cache tag which was cleared
when folio_start_writeback() was called.
Although this has its own problems (e.g. the first one calling
folio_start_writeback() will clear the tag for the whole page), it at
least cleared the tag.

But now our real tags are stored in the buffer tree, no one is really
clearing the PAGECACHE_TAG_TOWRITE tag now.

[FIX]
Thankfully this is not going to cause any real bug, but just some
inefficiency iterating the extent buffers.

As if we hit an extent buffer which is not dirty but still has the
PAGECACHE_TAG_TOWRITE tag, lock_extent_buffer_for_io() will skip it so
we won't writeback the extent buffer again.

To properly fix the inefficiency, just clear the PAGECACHE_TAG_TOWRITE
inside lock_extent_buffer_for_io().

There is no error path between lock_extent_buffer_for_io() and
write_one_eb(), so we're safe to clear the tag there.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-08-13 14:08:45 +02:00
Qu Wenruo 05b3728626 btrfs: clear block dirty if btrfs_writepage_cow_fixup() failed
[BUG]
If btrfs_writepage_cow_fixup() failed (returning value -EUCLEAN),
the block will be kept dirty, but with its corresponding range finished
in the ordered extent.

Currently that error pattern is only possible for experimental builds,
which places extra check to ensure we shouldn't hit a dirty block
without a corresponding ordered extent.

This means if later a writeback happens again, we can hit the following
problems:

- ASSERT(block_start != EXTENT_MAP_HOLE) in submit_one_sector()
  If the original extent map is a hole, then we can hit this case, as
  the new ordered extent failed, we will drop the new extent map and
  re-read one from the disk.

- DEBUG_WARN() in btrfs_writepage_cow_fixup()
  This is because we no longer have an ordered extent for those dirty
  blocks. The original for them is already finished with error.

[CAUSE]
The function btrfs_writepage_cow_fixup() is not following the regular
error handling of writeback.  The common practice is to clear the folio
dirty, start and finish the writeback for the block.

This is normally done by extent_clear_unlock_delalloc() with
PAGE_START_WRITEBACK | PAGE_END_WRITEBACK flags during
run_delalloc_range().

So if we keep those failed blocks dirty, they will stay in the page
cache and wait for the next writeback.

And since the original ordered extent is already finished and removed,
depending on the original extent map, we either hit the ASSERT() inside
submit_one_sector(), or hit the DEBUG_WARN() in
btrfs_writepage_cow_fixup() again (and very ironic).

[FIX]
Follow the regular error handling to clear the dirty flag for the block
range, start and finish writeback for that block range instead.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-08-13 14:08:44 +02:00
Qu Wenruo 4bcd3061e8 btrfs: clear block dirty if submit_one_sector() failed
[BUG]
If submit_one_sector() failed, the block will be kept dirty, but with
their corresponding range finished in the ordered extent.

This means if a writeback happens later again, we can hit the following
problems:

- ASSERT(block_start != EXTENT_MAP_HOLE) in submit_one_sector()
  If the original extent map is a hole, then we can hit this case, as
  the new ordered extent failed, we will drop the new extent map and
  re-read one from the disk.

- DEBUG_WARN() in btrfs_writepage_cow_fixup()
  This is because we no longer have an ordered extent for those dirty
  blocks. The original for them is already finished with error.

[CAUSE]
The function submit_one_sector() is not following the regular error
handling of writeback.  The common practice is to clear the folio dirty,
start and finish the writeback for the block.

This is normally done by extent_clear_unlock_delalloc() with
PAGE_START_WRITEBACK | PAGE_END_WRITEBACK flags during
run_delalloc_range().

So if we keep those failed blocks dirty, they will stay in the page
cache and wait for the next writeback.

And since the original ordered extent is already finished and removed,
depending on the original extent map, we either hit the ASSERT() inside
submit_one_sector(), or hit the DEBUG_WARN() in
btrfs_writepage_cow_fixup().

[FIX]
Follow the regular error handling to clear the dirty flag for the block,
start and finish writeback for that block instead.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-08-13 14:08:44 +02:00
Leo Martins ad580dfa38 btrfs: fix subpage deadlock in try_release_subpage_extent_buffer()
There is a potential deadlock that can happen in
try_release_subpage_extent_buffer() because the irq-safe xarray spin
lock fs_info->buffer_tree is being acquired before the irq-unsafe
eb->refs_lock.

This leads to the potential race:
// T1 (random eb->refs user)                  // T2 (release folio)

spin_lock(&eb->refs_lock);
// interrupt
end_bbio_meta_write()
  btrfs_meta_folio_clear_writeback()
                                              btree_release_folio()
                                                folio_test_writeback() //false
                                                try_release_extent_buffer()
                                                  try_release_subpage_extent_buffer()
                                                    xa_lock_irq(&fs_info->buffer_tree)
                                                    spin_lock(&eb->refs_lock); // blocked; held by T1
  buffer_tree_clear_mark()
    xas_lock_irqsave() // blocked; held by T2

I believe that the spin lock can safely be replaced by an rcu_read_lock.
The xa_for_each loop does not need the spin lock as it's already
internally protected by the rcu_read_lock. The extent buffer is also
protected by the rcu_read_lock so it won't be freed before we take the
eb->refs_lock and check the ref count.

The rcu_read_lock is taken and released every iteration, just like the
spin lock, which means we're not protected against concurrent
insertions into the xarray. This is fine because we rely on
folio->private to detect if there are any ebs remaining in the folio.

There is already some precedent for this with find_extent_buffer_nolock,
which loads an extent buffer from the xarray with only rcu_read_lock.

lockdep warning:

            =====================================================
            WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
            6.16.0-0_fbk701_debug_rc0_123_g4c06e63b9203 #1 Tainted: G E    N
            -----------------------------------------------------
            kswapd0/66 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
            ffff000011ffd600 (&eb->refs_lock){+.+.}-{3:3}, at: try_release_extent_buffer+0x18c/0x560

and this task is already holding:
            ffff0000c1d91b88 (&buffer_xa_class){-.-.}-{3:3}, at: try_release_extent_buffer+0x13c/0x560
            which would create a new lock dependency:
             (&buffer_xa_class){-.-.}-{3:3} -> (&eb->refs_lock){+.+.}-{3:3}

but this new dependency connects a HARDIRQ-irq-safe lock:
             (&buffer_xa_class){-.-.}-{3:3}

... which became HARDIRQ-irq-safe at:
              lock_acquire+0x178/0x358
              _raw_spin_lock_irqsave+0x60/0x88
              buffer_tree_clear_mark+0xc4/0x160
              end_bbio_meta_write+0x238/0x398
              btrfs_bio_end_io+0x1f8/0x330
              btrfs_orig_write_end_io+0x1c4/0x2c0
              bio_endio+0x63c/0x678
              blk_update_request+0x1c4/0xa00
              blk_mq_end_request+0x54/0x88
              virtblk_request_done+0x124/0x1d0
              blk_mq_complete_request+0x84/0xa0
              virtblk_done+0x130/0x238
              vring_interrupt+0x130/0x288
              __handle_irq_event_percpu+0x1e8/0x708
              handle_irq_event+0x98/0x1b0
              handle_fasteoi_irq+0x264/0x7c0
              generic_handle_domain_irq+0xa4/0x108
              gic_handle_irq+0x7c/0x1a0
              do_interrupt_handler+0xe4/0x148
              el1_interrupt+0x30/0x50
              el1h_64_irq_handler+0x14/0x20
              el1h_64_irq+0x6c/0x70
              _raw_spin_unlock_irq+0x38/0x70
              __run_timer_base+0xdc/0x5e0
              run_timer_softirq+0xa0/0x138
              handle_softirqs.llvm.13542289750107964195+0x32c/0xbd0
              ____do_softirq.llvm.17674514681856217165+0x18/0x28
              call_on_irq_stack+0x24/0x30
              __irq_exit_rcu+0x164/0x430
              irq_exit_rcu+0x18/0x88
              el1_interrupt+0x34/0x50
              el1h_64_irq_handler+0x14/0x20
              el1h_64_irq+0x6c/0x70
              arch_local_irq_enable+0x4/0x8
              do_idle+0x1a0/0x3b8
              cpu_startup_entry+0x60/0x80
              rest_init+0x204/0x228
              start_kernel+0x394/0x3f0
              __primary_switched+0x8c/0x8958

to a HARDIRQ-irq-unsafe lock:
             (&eb->refs_lock){+.+.}-{3:3}

... which became HARDIRQ-irq-unsafe at:
            ...
              lock_acquire+0x178/0x358
              _raw_spin_lock+0x4c/0x68
              free_extent_buffer_stale+0x2c/0x170
              btrfs_read_sys_array+0x1b0/0x338
              open_ctree+0xeb0/0x1df8
              btrfs_get_tree+0xb60/0x1110
              vfs_get_tree+0x8c/0x250
              fc_mount+0x20/0x98
              btrfs_get_tree+0x4a4/0x1110
              vfs_get_tree+0x8c/0x250
              do_new_mount+0x1e0/0x6c0
              path_mount+0x4ec/0xa58
              __arm64_sys_mount+0x370/0x490
              invoke_syscall+0x6c/0x208
              el0_svc_common+0x14c/0x1b8
              do_el0_svc+0x4c/0x60
              el0_svc+0x4c/0x160
              el0t_64_sync_handler+0x70/0x100
              el0t_64_sync+0x168/0x170

other info that might help us debug this:
             Possible interrupt unsafe locking scenario:
                   CPU0                    CPU1
                   ----                    ----
              lock(&eb->refs_lock);
                                           local_irq_disable();
                                           lock(&buffer_xa_class);
                                           lock(&eb->refs_lock);
              <Interrupt>
                lock(&buffer_xa_class);

  *** DEADLOCK ***
            2 locks held by kswapd0/66:
             #0: ffff800085506e40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xe8/0xe50
             #1: ffff0000c1d91b88 (&buffer_xa_class){-.-.}-{3:3}, at: try_release_extent_buffer+0x13c/0x560

Link: https://www.kernel.org/doc/Documentation/locking/lockdep-design.rst#:~:text=Multi%2Dlock%20dependency%20rules%3A
Fixes: 19d7f65f03 ("btrfs: convert the buffer_radix to an xarray")
CC: stable@vger.kernel.org # 6.16+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-08-07 17:07:15 +02:00
David Sterba e8d2e254dc btrfs: use clear_and_wake_up_bit() where open coded
There are two cases open coding the clear and wake up pattern, we can
use the helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:21 +02:00
Daniel Vacek f2cb97ee96 btrfs: index buffer_tree using node size
So far we've been deriving the buffer tree index using the sector size.
But each extent buffer covers multiple sectors. This makes the buffer
tree rather sparse.

For example the typical and quite common configuration uses sector size
of 4KiB and node size of 16KiB. In this case it means the buffer tree is
using up to the maximum of 25% of it's slots. Or in other words at least
75% of the tree slots are wasted as never used.

We can score significant memory savings on the required tree nodes by
indexing the tree using the node size instead. As a result far less
slots are wasted and the tree can now use up to all 100% of it's slots
this way.

Note: This works even with unaligned tree blocks as we can still get
      unique index by doing eb->start >> nodesize_shift.

Getting some stats from running fio write test, there is a bit of
variance.  The values presented in the table below are medians from 5
test runs.  The numbers are:

  - # of allocated ebs in the tree
  - # of leaf tree nodes
  - highest index in the tree (radix tree width)):

ebs / leaves / Index |   bare for-next    |      with fix
---------------------+--------------------+-------------------
	post mount   |   16 /  11 / 10e5c |   16 /  10 / 4240
	post test    | 5810 / 891 / 11cfc | 4420 / 252 / 473a
	post rm	     |  574 / 300 / 10ef0 |  540 / 163 / 46e9

In this case (10GiB filesystem) the height of the tree is still 3 levels
but the 4x width reduction is clearly visible as expected. But since the
tree is more dense we can see the 54-72% reduction of leaf nodes. That's
very close to ideal with this test. It means the tree is getting really
dense with this kind of workload.

Also, the fio results show no performance change.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:20 +02:00
Boris Burkov 9e9ff875e4 btrfs: use readahead_expand() on compressed extents
We recently received a report of poor performance doing sequential
buffered reads of a file with compressed extents. With bs=128k, a naive
sequential dd ran as fast on a compressed file as on an uncompressed
(1.2GB/s on my reproducing system) while with bs<32k, this performance
tanked down to ~300MB/s.

i.e., slow:

  dd if=some-compressed-file of=/dev/null bs=4k count=X

vs fast:

  dd if=some-compressed-file of=/dev/null bs=128k count=Y

The cause of this slowness is overhead to do with looking up extent_maps
to enable readahead pre-caching on compressed extents
(add_ra_bio_pages()), as well as some overhead in the generic VFS
readahead code we hit more in the slow case. Notably, the main
difference between the two read sizes is that in the large sized request
case, we call btrfs_readahead() relatively rarely while in the smaller
request we call it for every compressed extent. So the fast case stays
in the btrfs readahead loop:

    while ((folio = readahead_folio(rac)) != NULL)
	    btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);

where the slower one breaks out of that loop every time. This results in
calling add_ra_bio_pages a lot, doing lots of extent_map lookups,
extent_map locking, etc.

This happens because although add_ra_bio_pages() does add the
appropriate un-compressed file pages to the cache, it does not
communicate back to the ractl in any way. To solve this, we should be
using readahead_expand() to signal to readahead to expand the readahead
window.

This change passes the readahead_control into the btrfs_bio_ctrl and in
the case of compressed reads sets the expansion to the size of the
extent_map we already looked up anyway. It skips the subpage case as
that one already doesn't do add_ra_bio_pages().

With this change, whether we use bs=4k or bs=128k, btrfs expands the
readahead window up to the largest compressed extent we have seen so far
(in the trivial example: 128k) and the call stacks of the two modes look
identical. Notably, we barely call add_ra_bio_pages at all. And the
performance becomes identical as well. So this change certainly "fixes"
this performance problem.

Of course, it does seem to beg a few questions:

1. Will this waste too much page cache with a too large ra window?
2. Will this somehow cause bugs prevented by the more thoughtful
   checking in add_ra_bio_pages?
3. Should we delete add_ra_bio_pages?

My stabs at some answers:

1. Hard to say. See attempts at generic performance testing below. Is
   there a "readahead_shrink" we should be using? Should we expand more
   slowly, by half the remaining em size each time?
2. I don't think so. Since the new behavior is indistinguishable from
   reading the file with a larger read size passed in, I don't see why
   one would be safe but not the other.
3. Probably! I tested that and it was fine in fstests, and it seems like
   the pages would get re-used just as well in the readahead case.
   However, it is possible some reads that use page cache but not
   btrfs_readahead() could suffer. I will investigate this further as a
   follow up.

I tested the performance implications of this change in 3 ways (using
compress-force=zstd:3 for compression):

Directly test the affected workload of small sequential reads on a
compressed file (improved from ~250MB/s to ~1.2GB/s)

==========for-next==========
  dd /mnt/lol/non-cmpr 4k
  1048576+0 records in
  1048576+0 records out
  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.02983 s, 712 MB/s
  dd /mnt/lol/non-cmpr 128k
  32768+0 records in
  32768+0 records out
  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 5.92403 s, 725 MB/s
  dd /mnt/lol/cmpr 4k
  1048576+0 records in
  1048576+0 records out
  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 17.8832 s, 240 MB/s
  dd /mnt/lol/cmpr 128k
  32768+0 records in
  32768+0 records out
  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.71001 s, 1.2 GB/s

==========ra-expand==========
  dd /mnt/lol/non-cmpr 4k
  1048576+0 records in
  1048576+0 records out
  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.09001 s, 705 MB/s
  dd /mnt/lol/non-cmpr 128k
  32768+0 records in
  32768+0 records out
  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.07664 s, 707 MB/s
  dd /mnt/lol/cmpr 4k
  1048576+0 records in
  1048576+0 records out
  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.79531 s, 1.1 GB/s
  dd /mnt/lol/cmpr 128k
  32768+0 records in
  32768+0 records out
  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.69533 s, 1.2 GB/s

Built the linux kernel from clean (no change)

Ran fsperf. Mostly neutral results with some improvements and
regressions here and there.

Reported-by: Dimitrios Apostolou <jimis@gmx.net>
Link: https://lore.kernel.org/linux-btrfs/34601559-6c16-6ccc-1793-20a97ca0dbba@gmx.net/
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:00 +02:00
David Sterba ab5fcbb1ad btrfs: use pgoff_t for page index variables
Any conversion of offsets in the logical or the physical mapping space
of the pages is done by a shift and the target type should be pgoff_t
(type of struct page::index). Fix the locations where it's still
unsigned long.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:05 +02:00
David Sterba 44cac52341 btrfs: use our message helpers instead of pr_err/pr_warn/pr_info
Our message helpers accept NULL for the fs_info in the context that does
not provide and print the common header of the message. The use of pr_*
helpers is only for special reasons, like module loading, device
scanning or multi-line output (print-tree).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:04 +02:00
Filipe Manana 790b88c4dd btrfs: make extent_buffer_test_bit() return a boolean instead
All the callers want is to determine if a bit is set and all of them call
the function and do a double negation (!!) on its result to get a boolean.
So change it to return a boolean and simplify callers.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:02 +02:00
David Sterba 55cd57faa5 btrfs: use folio_end() where appropriate
Simplify folio_pos() + folio_size() and use the new helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:01 +02:00
David Sterba f1f22dfbea btrfs: use btrfs_root_id() where not done yet
A few more remaining cases where we can use the helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:31 +02:00
Filipe Manana b769777d92 btrfs: use refcount_t type for the extent buffer reference counter
Instead of using a bare atomic, use the refcount_t type, which despite
being a structure that contains only an atomic, has an API that checks
for underflows and other hazards. This doesn't change the size of the
extent_buffer structure.

This removes the need to do things like this:

    WARN_ON(atomic_read(&eb->refs) == 0);
    if (atomic_dec_and_test(&eb->refs)) {
        (...)
    }

And do just:

    if (refcount_dec_and_test(&eb->refs)) {
        (...)
    }

Since refcount_dec_and_test() already triggers a warning when we decrement
a ref count that has a value of 0 (or below zero).

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:30 +02:00
Filipe Manana 2697b61597 btrfs: add comment for optimization in free_extent_buffer()
There's this special atomic compare and exchange logic which serves to
avoid locking the extent buffers refs_lock spinlock and therefore reduce
lock contention, so add a comment to make it more obvious.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:30 +02:00
Filipe Manana 71c086b30d btrfs: reorganize logic at free_extent_buffer() for better readability
It's hard to read the logic to break out of the while loop since it's a
very long expression consisting of a logical or of two composite
expressions, each one composed by a logical and. Further each one is also
testing for the EXTENT_BUFFER_UNMAPPED bit, making it more verbose than
necessary.

So change from this:

    if ((!test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags) && refs <= 3)
        || (test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags) &&
            refs == 1))
       break;

To this:

    if (test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags)) {
        if (refs == 1)
            break;
    } else if (refs <= 3) {
            break;
    }

At least on x86_64 using gcc 9.3.0, this doesn't change the object size.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:30 +02:00
Qu Wenruo 582cd4bad4 btrfs: rename btrfs_subpage structure
With the incoming large data folios support, the structure name
btrfs_subpage is no longer correct, as for we can have multiple blocks
inside a large folio, and the block size is still page size.

So to follow the schema of iomap, rename btrfs_subpage to
btrfs_folio_state, along with involved enums.

There are still exported functions with "btrfs_subpage_" prefix, and I
believe for metadata the name "subpage" will stay forever as we will
never allocate a folio larger than nodesize anyway.

The full cleanup of the word "subpage" will happen in much smaller steps
in the future.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:27 +02:00
Filipe Manana e5b5596011 btrfs: fix double unlock of buffer_tree xarray when releasing subpage eb
If we break out of the loop because an extent buffer doesn't have the bit
EXTENT_BUFFER_TREE_REF set, we end up unlocking the xarray twice, once
before we tested for the bit and break out of the loop, and once again
after the loop.

Fix this by testing the bit and exiting before unlocking the xarray.
The time spent testing the bit is negligible and it's not worth trying
to do that outside the critical section delimited by the xarray lock due
to the code complexity required to avoid it (like using a local boolean
variable to track whether the xarray is locked or not). The xarray unlock
only needs to be done before calling release_extent_buffer(), as that
needs to lock the xarray (through xa_cmpxchg_irq()) and does a more
significant amount of work.

Fixes: 19d7f65f03 ("btrfs: convert the buffer_radix to an xarray")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/linux-btrfs/aDRNDU0GM1_D4Xnw@stanley.mountain/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19 15:20:33 +02:00
Josef Bacik 4db7384ce5 btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails
In the zoned mode there's a bug in the extent buffer tree conversion to
xarray. The reference for eb is dropped and code continues but the
references get dropped by releasing the batch.

Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Fixes: 19d7f65f03 ("btrfs: convert the buffer_radix to an xarray")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-27 13:26:28 +02:00
Boris Burkov 3649833a58 btrfs: fix broken drop_caches on extent buffer folios
The (correct) commit e41c81d0d3 ("mm/truncate: Replace page_mapped()
call in invalidate_inode_page()") replaced the page_mapped(page) check
with a refcount check. However, this refcount check does not work as
expected with drop_caches for btrfs's metadata pages.

Btrfs has a per-sb metadata inode with cached pages, and when not in
active use by btrfs, they have a refcount of 3. One from the initial
call to alloc_pages(), one (nr_pages == 1) from filemap_add_folio(), and
one from folio_attach_private(). We would expect such pages to get dropped
by drop_caches. However, drop_caches calls into mapping_evict_folio() via
mapping_try_invalidate() which gets a reference on the folio with
find_lock_entries(). As a result, these pages have a refcount of 4, and
fail this check.

For what it's worth, such pages do get reclaimed under memory pressure,
so I would say that while this behavior is surprising, it is not really
dangerously broken.

When I asked the mm folks about the expected refcount in this case, I
was told that the correct thing to do is to donate the refcount from the
original allocation to the page cache after inserting it.

Therefore, attempt to fix this by adding a put_folio() to the critical
spot in alloc_extent_buffer() where we are sure that we have really
allocated and attached new pages. We must also adjust
folio_detach_private() to properly handle being the last reference to the
folio and not do a use-after-free after folio_detach_private().

extent_buffers allocated by clone_extent_buffer() and
alloc_dummy_extent_buffer() are unmapped, so this transfer of ownership
from allocation to insertion in the mapping does not apply to them.
However, we can still folio_put() them safely once they are finished
being allocated and have called folio_attach_private().

Finally, removing the generic put_folio() for the allocation from
btrfs_detach_extent_buffer_folios() means we need to be careful to do
the appropriate put_folio() in allocation failure paths in
alloc_extent_buffer(), clone_extent_buffer() and
alloc_dummy_extent_buffer().

Link: https://lore.kernel.org/linux-mm/ZrwhTXKzgDnCK76Z@casper.infradead.org/
Tested-by: Klara Modin <klarasmodin@gmail.com>
Reviewed-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:55 +02:00
Daniel Vacek 13ae88706a btrfs: get rid of goto in alloc_test_extent_buffer()
The `free_eb` label is used only once. Simplify by moving the code inplace.

Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:50 +02:00
Josef Bacik 5e121ae687 btrfs: use buffer xarray for extent buffer writeback operations
Currently we have this ugly back and forth with the btree writeback
where we find the folio, find the eb associated with that folio, and
then attempt to writeback.  This results in two different paths for
subpage ebs and >= page size ebs.

Clean this up by adding our own infrastructure around looking up tagged
ebs and writing the ebs out directly.  This allows us to unify the
subpage and >= pagesize IO paths, resulting in a much cleaner writeback
path for extent buffers.

I ran this through fsperf on a VM with 8 CPUs and 16GiB of RAM.  I used
smallfiles100k, but reduced the files to 1k to make it run faster, the
results are as follows, with the statistically significant improvements
marked with *, there were no regressions.  fsperf was run with -n 10 for
both runs, so the baseline is the average 10 runs and the test is the
average of 10 runs.

smallfiles100k results
      metric           baseline       current        stdev            diff
================================================================================
avg_commit_ms               68.58         58.44          3.35   -14.79% *
commits                    270.60        254.70         16.24    -5.88%
dev_read_iops                  48            48             0     0.00%
dev_read_kbytes              1044          1044             0     0.00%
dev_write_iops          866117.90     850028.10      14292.20    -1.86%
dev_write_kbytes      10939976.40   10605701.20     351330.32    -3.06%
elapsed                     49.30            33          1.64   -33.06% *
end_state_mount_ns    41251498.80   35773220.70    2531205.32   -13.28% *
end_state_umount_ns      1.90e+09      1.50e+09   14186226.85   -21.38% *
max_commit_ms                 139        111.60          9.72   -19.71% *
sys_cpu                      4.90          3.86          0.88   -21.29%
write_bw_bytes        42935768.20   64318451.10    1609415.05    49.80% *
write_clat_ns_mean      366431.69     243202.60      14161.98   -33.63% *
write_clat_ns_p50        49203.20         20992        264.40   -57.34% *
write_clat_ns_p99          827392     653721.60      65904.74   -20.99% *
write_io_kbytes           2035940       2035940             0     0.00%
write_iops               10482.37      15702.75        392.92    49.80% *
write_lat_ns_max         1.01e+08      90516129    3910102.06   -10.29% *
write_lat_ns_mean       366556.19     243308.48      14154.51   -33.62% *

As you can see we get about a 33% decrease runtime, with a 50%
throughput increase, which is pretty significant.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:50 +02:00
Josef Bacik 4bc0a3cb75 btrfs: set DIRTY and WRITEBACK tags on the buffer_tree
In preparation for changing how we do writeout of extent buffers, start
tagging the extent buffer xarray with DIRTY and WRITEBACK to make it
easier to find extent buffers that are in either state.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:50 +02:00
Josef Bacik 19d7f65f03 btrfs: convert the buffer_radix to an xarray
In order to fully utilize xarray tagging to improve writeback we need to
convert the buffer_radix to a proper xarray.  This conversion is
relatively straightforward as the radix code uses the xarray underneath.
Using xarray directly allows for quite a lot less code.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:50 +02:00
Qu Wenruo 73d6bcf41b btrfs: subpage: reject tree blocks which are not nodesize aligned
When btrfs subpage support (fs block < page size) was introduced, a
subpage filesystem will only reject tree blocks which cross page
boundaries.

This used to be a compromise to simplify the tree block handling and
still allowing subpage cases to read some old converted filesystems
which did not have proper chunk alignment.

But in practice, suppose we have the following unaligned tree block on a
64K page sized system:

  0                           32K           44K             60K  64K
  |                                         |///////////////|    |

Although btrfs has no problem reading the tree block at [44K, 60K), if
extent allocator is allocating another tree block, it may choose the
range [60K, 74K), as extent allocator has no awareness if it's a subpage
metadata request or not.

Then we'd get -EINVAL from the following sequence:

 btrfs_alloc_tree_block()
 |- btrfs_reserve_extent()
 |  Which returned range [60K, 74K)
 |- btrfs_init_new_buffer()
    |- btrfs_find_create_tree_block()
       |- alloc_extent_buffer()
          |- check_eb_alignment()
	     Which returned -EINVAL, because the range crosses page
	     boundary.

This situation will not fix itself and should mostly mark the fs
read-only.

Thankfully we didn't really get such reports in the real world because:

- The original unaligned tree block is only caused by older
  btrfs-convert
  It's before the btrfs-convert rework was done in v4.6, where converted
  btrfs filesystem can have metadata block groups which are not aligned
  to nodesize nor stripe size (64K).

  But after btrfs-progs v4.6, all chunks allocated will be stripe (64K)
  aligned, thus no more such problem.

Considering how old the fix is (v4.6 was released almost 10 years ago),
subpage support for btrfs was introduced in v5.15, it should be safe to
reject those unaligned tree blocks.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:49 +02:00
Daniel Vacek 406698623a btrfs: move folio initialization to one place in attach_eb_folio_to_filemap()
This is just a trivial change. The code looks a bit more readable this way, IMO.

Move initialization of existing_folio to the beginning of the retry loop
so it's set to NULL at one place.

Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:49 +02:00
David Sterba 9e0a739a9e btrfs: convert ASSERT(0) with handled errors to DEBUG_WARN()
The use of ASSERT(0) is maybe useful for some cases but more like a
notice for developers. Assertions can be compiled in independently so
convert it to a debugging helper.

The difference is that it's just a warning and will not end up in BUG().
The converted cases are in connection with proper error handling.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:47 +02:00