When mounting a btrfs file system on virtio-blk which supports native
Zone Append there has been a WARN triggering in btrfs' space management
code.
Further looking into btrfs' zoned statistics uncovered the filesystem
expecting the zones to be used, but the write pointers being 0:
# cat /sys/fs/btrfs/8eabd2e7-3294-4f9e-9b58-7e64135c8bf4/zoned_stats
active block-groups: 4
reclaimable: 0
unused: 0
need reclaim: false
data relocation block-group: 1342177280
active zones:
start: 1073741824, wp: 0 used: 0, reserved: 0, unusable: 0
start: 1342177280, wp: 0 used: 0, reserved: 0, unusable: 0
start: 1610612736, wp: 0 used: 16384, reserved: 0, unusable: 18446744073709535232
start: 1879048192, wp: 0 used: 131072, reserved: 0, unusable: 18446744073709420544
Looking at the blkzone report output for the zone in question
(1610612736) the write pointer on the device moved, but the filesystem
did not see a change on the write pointer:
# blkzone report -c 1 -o 0x300000 /dev/vda
start: 0x000300000, len 0x080000, cap 0x080000, wptr 0x000040 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
The zone write pointer is 0, because btrfs is using the cached version
of blkdev_report_zones() and as virtio-blk is supporting native zone
append, but blkdev_revalidate_zones() does not initialize the zone write
plugs in this case.
Not skipping the revalidate of sequential zones in
blkdev_revalidate_zones() callchain fixes this issue.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Fixes: a6aa36e957 ("block: Remove zone write plugs when handling native zone append writes")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit fe0418eb9b ("block: Prevent potential deadlocks in zone write
plug error recovery") added a WARN check in disk_put_zone_wplug() to
verify that when the last reference to a zone write plug is dropped,
this zone write plug does not have the BLK_ZONE_WPLUG_PLUGGED flag set,
that is, that it is not plugged.
However, the function disk_zone_wplug_abort(), which is called for zone
reset and zone finish operations, does not clear this flag after
emptying a zone write plug BIO list. This can result in the
disk_put_zone_wplug() warning to trigger if the user (erroneously as
that is bad pratcice) issues zone reset or zone finish operations while
the target zone still has plugged BIOs.
Modify disk_put_zone_wplug() to clear the BLK_ZONE_WPLUG_PLUGGED flag.
And while at it, also add a lockdep annotation to ensure that this
function is called with the zone write plug spinlock held.
Fixes: fe0418eb9b ("block: Prevent potential deadlocks in zone write plug error recovery")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 2284eec5053d ("block: introduce blkdev_get_zone_info()")
introduced the report_active field in struct blk_report_zones_args so
that open and closed zones can be reported with the condition
BLK_ZONE_COND_ACTIVE in the case of a cached report zone.
However, the args pointer to a struct blk_report_zones_args that is
passed to disk_report_zones() can be NULL, e.g. in the case of internal
report zones operations for device mapper zoned targets.
Fix disk_report_zones() to make sure to check that the args is not null
before updating a zone condition for cached zone reports.
Fixes: 2284eec5053d ("block: introduce blkdev_get_zone_info()")
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For zoned block devices that do not need zone write plugs (e.g. most
device mapper devices that support zones), the disk hash table of zone
write plugs is NULL. For such devices, blk_zone_reset_all_bio_endio()
should not attempt to scan this has table as that causes a NULL pointer
dereference.
Fix this by checking that the disk does have zone write plugs using the
atomic counter. This is equivalent to checking for a non-NULL hash table
but has the advantage to also speed up the execution of
blk_zone_reset_all_bio_endio() for devices that do use zone write plugs
but do not have any plug in the hash table (e.g. a disk with only full
zones).
Fixes: efae226c2e ("block: handle zone management operations completions")
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the following code into the only caller of disk_zone_wplug_add_bio():
- The code for clearing the REQ_NOWAIT flag.
- The code that sets the BLK_ZONE_WPLUG_PLUGGED flag.
- The disk_zone_wplug_schedule_bio_work() call.
This patch moves all code that is related to REQ_NOWAIT or to bio
scheduling into a single function. Additionally, the 'schedule_bio_work'
variable is removed. No functionality has been changed.
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlmoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Document that all callers hold this lock because the code in
disk_zone_wplug_schedule_bio_work() depends on this.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove a superfluous parenthesis that was introduced by commit fa8555630b
("blk-zoned: Improve the queue reference count strategy documentation").
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Introduce the function bdev_zone_start() as a more explicit (and clear)
replacement for ALIGN_DOWN() to get the start sector of a zone
containing a particular sector of a zoned block device.
Use this new helper in blkdev_get_zone_info() and
blkdev_report_zones_cached().
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The helper function blk_zone_wp_offset() is called from
disk_zone_wplug_sync_wp_offset(), and again called from
blk_revalidate_seq_zone() right after the call to
disk_zone_wplug_sync_wp_offset().
Change disk_zone_wplug_sync_wp_offset() to return the value of obtained
with blk_zone_wp_offset() to avoid this double call, which simplifies a
little blk_revalidate_seq_zone().
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_zone_wp_offset() is always called with a struct blk_zone obtained
from the device, that is, it will never see the BLK_ZONE_COND_ACTIVE
condition. However, handling this condition makes this function more
solid and will also avoid issues when propagating cached report requests
to underlying stacked devices is implemented. Add BLK_ZONE_COND_ACTIVE
as a new case in blk_zone_wp_offset() switch.
Also while at it, change the handling of the full condition to return
UINT_MAX for the zone write pointer to reflect the fact that the write
pointer of a full zone is invalid.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkdev_do_report_zones returns the number of reported zones, but
blkdev_get_zone_info returns 0 or an errno. Translate to the expected
return value in blkdev_report_zone_fallback.
Fixes: b037d41762fd ("block: introduce blkdev_get_zone_info()")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No zone plugs are allocated when a zone is opened by calling Zone Append
on it. This makes the cached zone reporting report incorrectly empty
zones if the file system is unmounted and report zones is called after
that, e.g. by xfstests test cases using the scratch device.
Fix this by recording if zone append was used on a device, and disable
cached reporting for the device until a ZONE_RESET_ALL happens that
guarantees all zones are empty.
We could probably do even better using a per-zone flag, but the practical
use cache for zone reporting after the initial mount are rather limited,
so let's keep things simple for now.
Fixes: 31f0656a4a ("block: introduce blkdev_report_zones_cached()")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
disk->zones_cond is allocated for all zoned devices, but
disk_free_zone_resources skips it when the zone write plug hash is not
allocated, leaking the allocation for non-mq devices that don't emulate
zone append. This is reported by kmemleak-enabled xfstests for various
tests that use simple device mapper targets.
Fix this by moving all code that requires writes plugs from
disk_free_zone_resources into disk_destroy_zone_wplugs_hash_table
and executing the rest of the code, including the disk->zones_cond
freeing unconditionally.
Fixes: 6e945ffb65 ("block: use zone condition to determine conventional zones")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Modify queue_zone_wplug_show() to include the condition of a zone write
plug to the zone_wplugs debugfs attribute of a zoned block device.
To improve readability and ease of use, rather than the zone condition
raw value, the zone condition name is given using blk_zone_cond_str().
Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make the output of the zone_wplugs debugfs attribute file more easily
readable by adding the name of the zone write plugs fields in the
output.
No functional changes.
Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Introduce the new BLKREPORTZONESV2 ioctl command to allow user
applications access to the fast zone report implemented by
blkdev_report_zones_cached(). This new ioctl is defined as number 142
and is documented in include/uapi/linux/fs.h.
Unlike the existing BLKREPORTZONES ioctl, this new ioctl uses the flags
field of struct blk_zone_report also as an input. If the user sets the
BLK_ZONE_REP_CACHED flag as an input, then blkdev_report_zones_cached()
is used to generate the zone report using cached zone information. If
this flag is not set, then BLKREPORTZONESV2 behaves in the same manner
as BLKREPORTZONES and the zone report is generated by accessing the
zoned device.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Introduce the function blkdev_report_zones_cached() to provide a fast
report zone built using the blkdev_get_zone_info() function, which gets
zone information from a disk zones_cond array or zone write plugs.
For a large capacity SMR drive, such fast report zone can be completed
in a few milliseconds compared to several seconds completion times
when the report zone is obtained from the device.
The zone report is built in the same manner as with the regular
blkdev_report_zones() function, that is, the first zone reported is the
one containing the specified start sector and the report is limited to
the specified number of zones (nr_zones argument). The information for
each zone in the report is obtained using blkdev_get_zone_info().
For zoned devices that do not use zone write plug resources,
using blkdev_get_zone_info() is inefficient as the zone report would
be very slow, generated one zone at a time. To avoid this,
blkdev_report_zones_cached() falls back to calling
blkdev_do_report_zones() to execute a regular zone report. In this case,
the .report_active field of struct blk_report_zones_args is set to true
to report zone conditions using the BLK_ZONE_COND_ACTIVE condition in
place of the implicit open, explicit open and closed conditions.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Introduce the function blkdev_get_zone_info() to obtain a single zone
information from cached zone data, that is, either from the zone write
plug for the target zone if it exists and from the disk zones_cond
array otherwise.
Since sequential zones that do not have a zone write plug are either
full, empty or in a bad state (read-only or offline), the zone write
pointer can be inferred from the zone condition cached in the disk
zones_cond array. For sequential zones that have a zone write plug, the
zone condition and zone write pointer are obtained from the condition
and write pointer offset managed with the zone write plug. This allows
obtaining the information for a zone much more quickly than having to
execute a report zones command on the device.
blkdev_get_zone_info() falls back to using a regular zone report if the
target zone is flagged as needing an update with the
BLK_ZONE_WPLUG_NEED_WP_UPDATE flag, or if the target device does not
use zone write plugs (i.e. a device mapper device). In this case, the
new function blkdev_report_zone_fallback() is used and the zone
condition is reported consistantly with the cahced report, that is, the
BLK_ZONE_COND_ACTIVE condition is used in place of the implicit open,
explicit open and closed conditions. This is achieved by adding the
.report_active field to struct blk_report_zones_args and by having
disk_report_zone() sets the correct zone condition if .report_active is
true.
In preparation for using blkdev_get_zone_info() in upcoming file systems
changes, also export this function as a GPL symbol.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for implementing cached report zone, split the main part
of the code of blkdev_report_zones() into the helper function
blkdev_do_report_zones(), with this new helper taking as argument a
struct blk_report_zones_args pointer instead of a report callback
function and its private argument.
No functional changes.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The function blk_revalidate_zone_cond() already caches the condition of
all zones of a zoned block device in the zones_cond array of a gendisk.
However, the zone conditions are updated only when the device is scanned
or revalidated.
Implement tracking of the runtime changes to zone conditions using
the new cond field in struct blk_zone_wplug. The size of this structure
remains 112 Bytes as the new field replaces the 4 Bytes padding at the
end of the structure.
Beause zones that do not have a zone write plug can be in the empty,
implicit open, explicit open or full condition, the zones_cond array of
a disk is used to track the conditions, of zones that do not have a zone
write plug. The condition of such zone is updated in the disk zones_cond
array when a zone reset, reset all or finish operation is executed, and
also when a zone write plug is removed from the disk hash table when the
zone becomes full.
Since a device may automatically close an implicitly open zone when
writing to an empty or closed zone, if the total number of open zones
has reached the device limit, the BLK_ZONE_COND_IMP_OPEN and
BLK_ZONE_COND_CLOSED zone conditions cannot be precisely tracked. To
overcome this, the zone condition BLK_ZONE_COND_ACTIVE is introduced to
represent a zone that has the condition BLK_ZONE_COND_IMP_OPEN,
BLK_ZONE_COND_EXP_OPEN or BLK_ZONE_COND_CLOSED. This follows the
definition of an active zone as defined in the NVMe Zoned Namespace
specifications. As such, for a zoned device that has a limit on the
maximum number of open zones, we will never have more zones in the
BLK_ZONE_COND_ACTIVE condition than the device limit. This is compatible
with the SCSI ZBC and ATA ZAC specifications for SMR HDDs as these
devices do not have a limit on the number of active zones.
The function disk_zone_wplug_set_wp_offset() is modified to use the new
helper disk_zone_wplug_update_cond() to update a zone write plug
condition whenever a zone write plug write offset is updated on
submission or merging of write BIOs to a zone.
The functions blk_zone_reset_bio_endio(), blk_zone_reset_all_bio_endio()
and blk_zone_finish_bio_endio() are modified to update the condition of
the zones targeted by reset, reset_all and finish operations, either
using though disk_zone_wplug_set_wp_offset() for zones that have a
zone write plug, or using the disk_zone_set_cond() helper to update the
zones_cond array of the disk for zones that do not have a zone write
plug.
When a zone write plug is removed from the disk hash table (when the
zone becomes empty or full), the condition of struct blk_zone_wplug is
used to update the disk zones_cond array. Conversely, when a zone write
plug is added to the disk hash table, the zones_cond array is used to
initialize the zone write plug condition.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The conv_zones_bitmap field of struct gendisk is used to define a bitmap
to identify the conventional zones of a zoned block device. The bit for
a zone is set in this bitmap if the zone is a conventional one, that is,
if the zone type is BLK_ZONE_TYPE_CONVENTIONAL. For such zone, this
always corresponds to the zone condition BLK_ZONE_COND_NOT_WP.
In other words, conv_zones_bitmap tracks a single condition of the
zones of a zoned block device.
In preparation for tracking more zone conditions, change
conv_zones_bitmap into an array of zone conditions, using 1 byte per
zone. This increases the memory usage from 1 bit per zone to 1 byte per
zone, that is, from 16 KiB to about 100 KiB for a 30 TB SMR HDD with 256
MiB zones. This is a trade-off to allow fast cached report zones later
on top of this change.
Rename the conv_zones_bitmap field of struct gendisk to zones_cond. Add
a blk_revalidate_zone_cond() function to initialize the zones_cond array
of a disk during device scan and to update it on device revalidation.
Move the allocation of the zones_cond array to
disk_revalidate_zone_resources(), making sure that this array is always
allocated, even for devices that do not need zone write plugs (zone
resources), to ensure that bdev_zone_is_seq() can be re-implemented to
use the zone condition array in place of the conv zones bitmap.
Finally, the function bdev_zone_is_seq() is rewritten to use a test on
the condition of the target zone.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reorganize the fields of struct blk_zone_wplug to remove a hole after
the wp_offset field and avoid having the bio_work structure split
between 2 cache lines.
No functional changes.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit b76b840fd9 ("dm: Fix dm-zoned-reclaim zone write pointer
alignment") introduced an indirect call for the callback function of a
report zones executed with blkdev_report_zones(). This is necessary so
that the function disk_zone_wplug_sync_wp_offset() can be called to
refresh a zone write plug zone write pointer offset after a write error.
However, this solution makes following the path of a zone information
harder to understand.
Clean this up by introducing the new blk_report_zones_args structure to
define a zone report callback and its private data and introduce the
helper function disk_report_zone() which calls both
disk_zone_wplug_sync_wp_offset() and the zone report user callback
function for all zones of a zone report. This helper function must be
called by all block device drivers that implement the report zones
block operation in order to correctly report a zone information.
All block device drivers supporting the report_zones block operation are
updated to use this new scheme.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The variable capacity is used only in one place and so can be removed
and get_capacity(disk) used directly instead.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Modify disk_update_zone_resources() to freeze the device queue before
updating the number of zones, zone capacity and other zone related
resources. The locking order resulting from the call to
queue_limits_commit_update_frozen() is preserved, that is, the queue
limits lock is first taken by calling queue_limits_start_update() before
freezing the queue, and the queue is unfrozen after executing
queue_limits_commit_update(), which replaces the call to
queue_limits_commit_update_frozen().
This change ensures that there are no in-flights I/Os when the zone
resources are updated due to a zone revalidation. In case of error when
the limits are applied, directly call disk_free_zone_resources() from
disk_update_zone_resources() while the disk queue is still frozen to
avoid needing to freeze & unfreeze the queue again in
blk_revalidate_disk_zones(), thus simplifying that function code a
little.
Fixes: 0b83c86b44 ("block: Prevent potential deadlock in blk_revalidate_disk_zones()")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The functions blk_zone_wplug_handle_reset_or_finish() and
blk_zone_wplug_handle_reset_all() both modify the zone write pointer
offset of zone write plugs that are the target of a reset, reset all or
finish zone management operation. However, these functions do this
modification before the BIO is executed. So if the zone operation fails,
the modified zone write pointer offsets become invalid.
Avoid this by modifying the zone write pointer offset of a zone write
plug that is the target of a zone management operation when the
operation completes. To do so, modify blk_zone_bio_endio() to call the
new function blk_zone_mgmt_bio_endio() which in turn calls the functions
blk_zone_reset_all_bio_endio(), blk_zone_reset_bio_endio() or
blk_zone_finish_bio_endio() depending on the operation of the completed
BIO, to modify a zone write plug write pointer offset accordingly.
These functions are called only if the BIO execution was successful.
Fixes: dd291d77cc ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If preparing a write bio fails then blk_zone_wplug_bio_work() calls
bio_endio() with zwplug->lock held. If a device mapper driver is stacked
on top of the zoned block device then this results in nested locking of
zwplug->lock. The resulting lockdep complaint is a false positive
because this is nested locking and not recursive locking. Suppress this
false positive by calling blk_zone_wplug_bio_io_error() without holding
zwplug->lock. This is safe because no code in
blk_zone_wplug_bio_io_error() depends on zwplug->lock being held. This
patch suppresses the following lockdep complaint:
WARNING: possible recursive locking detected
--------------------------------------------
kworker/3:0H/46 is trying to acquire lock:
ffffff882968b830 (&zwplug->lock){-...}-{2:2}, at: blk_zone_write_plug_bio_endio+0x64/0x1f0
but task is already holding lock:
ffffff88315bc230 (&zwplug->lock){-...}-{2:2}, at: blk_zone_wplug_bio_work+0x8c/0x48c
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&zwplug->lock);
lock(&zwplug->lock);
*** DEADLOCK ***
May be due to missing lock nesting notation
3 locks held by kworker/3:0H/46:
#0: ffffff8809486758 ((wq_completion)sdd_zwplugs){+.+.}-{0:0}, at: process_one_work+0x1bc/0x65c
#1: ffffffc085de3d70 ((work_completion)(&zwplug->bio_work)){+.+.}-{0:0}, at: process_one_work+0x1e4/0x65c
#2: ffffff88315bc230 (&zwplug->lock){-...}-{2:2}, at: blk_zone_wplug_bio_work+0x8c/0x48c
stack backtrace:
CPU: 3 UID: 0 PID: 46 Comm: kworker/3:0H Tainted: G W OE 6.12.38-android16-5-maybe-dirty-4k #1 8b362b6f76e3645a58cd27d86982bce10d150025
Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: Spacecraft board based on MALIBU (DT)
Workqueue: sdd_zwplugs blk_zone_wplug_bio_work
Call trace:
dump_backtrace+0xfc/0x17c
show_stack+0x18/0x28
dump_stack_lvl+0x40/0xa0
dump_stack+0x18/0x24
print_deadlock_bug+0x38c/0x398
__lock_acquire+0x13e8/0x2e1c
lock_acquire+0x134/0x2b4
_raw_spin_lock_irqsave+0x5c/0x80
blk_zone_write_plug_bio_endio+0x64/0x1f0
bio_endio+0x9c/0x240
__dm_io_complete+0x214/0x260
clone_endio+0xe8/0x214
bio_endio+0x218/0x240
blk_zone_wplug_bio_work+0x204/0x48c
process_one_work+0x26c/0x65c
worker_thread+0x33c/0x498
kthread+0x110/0x134
ret_from_fork+0x10/0x20
Cc: stable@vger.kernel.org
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Fixes: dd291d77cc ("block: Introduce zone write plugging")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250825182720.1697203-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a tracepoint for blkdev_zone_mgmt to trace zone management commands
submitted by higher layers like file systems or user space.
An example output for this tracepoint is as follows:
mkfs.btrfs-203 [001] ..... 42.877493: blkdev_zone_mgmt: 8,0 ZRS 5242880 + 0
This example output shows a REQ_OP_ZONE_RESET operation submitted by
mkfs.btrfs.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250715115324.53308-5-johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a tracepoint in blk_zone_update_request_bio() to trace the bio sector
update on ZONE APPEND completions.
An example for this tracepoint is as follows:
<idle>-0 [001] d.h1. 381.746444: blk_zone_update_request_bio: 259,5 ZAS 131072 () 1048832 + 256 none,0,0 [swapper/1]
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250715115324.53308-4-johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_zone_update_request_bio() does two things. First it checks if the
request to be completed was written via ZONE APPEND and if yes it then
updates the sector to the one that the data was written to.
This is small enough to be an inline function. But upcoming changes adding
a tracepoint don't work if the function is inlined.
Split the function into two, the first is blk_req_bio_is_zone_append()
checking if the sector needs to be updated. This can still be an inline
function. The second is blk_zone_append_update_request_bio() doing the
sector update.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250715115324.53308-3-johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for fixing device mapper zone write handling, introduce
the inline helper function bio_needs_zone_write_plugging() to test if a
BIO requires handling through zone write plugging using the function
blk_zone_plug_bio(). This function returns true for any write
(op_is_write(bio) == true) operation directed at a zoned block device
using zone write plugging, that is, a block device with a disk that has
a zone write plug hash table.
This helper allows simplifying the check on entry to blk_zone_plug_bio()
and used in to protect calls to it for blk-mq devices and DM devices.
Fixes: f211268ed1 ("dm: Use the block layer zone append emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250625093327.548866-3-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bios queued up in the zone write plug have already gone through all all
preparation in the submit_bio path, including the freeze protection.
Submitting them through submit_bio_noacct_nocheck duplicates the work
and can can cause deadlocks when freezing a queue with pending bio
write plugs.
Go straight to ->submit_bio or blk_mq_submit_bio to bypass the
superfluous extra freeze protection and checks.
Fixes: 9b1ce7f0c6 ("block: Implement zone append emulation")
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250611044416.2351850-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When blk_zone_write_plug_bio_endio() is called for a regular write BIO
used to emulate a zone append operation, that is, a BIO flagged with
BIO_EMULATES_ZONE_APPEND, the BIO operation code is restored to the
original REQ_OP_ZONE_APPEND but the BIO_EMULATES_ZONE_APPEND flag is not
cleared. Clear it to fully return the BIO to its orginal definition.
Fixes: 9b1ce7f0c6 ("block: Implement zone append emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250611005915.89843-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the new large sector size support, it's now the case that
set_blocksize can change i_blksize and the folio order in a manner that
conflicts with a concurrent reader and causes a kernel crash.
Specifically, let's say that udev-worker calls libblkid to detect the
labels on a block device. The read call can create an order-0 folio to
read the first 4096 bytes from the disk. But then udev is preempted.
Next, someone tries to mount an 8k-sectorsize filesystem from the same
block device. The filesystem calls set_blksize, which sets i_blksize to
8192 and the minimum folio order to 1.
Now udev resumes, still holding the order-0 folio it allocated. It then
tries to schedule a read bio and do_mpage_readahead tries to create
bufferheads for the folio. Unfortunately, blocks_per_folio == 0 because
the page size is 4096 but the blocksize is 8192 so no bufferheads are
attached and the bh walk never sets bdev. We then submit the bio with a
NULL block device and crash.
Therefore, truncate the page cache after flushing but before updating
i_blksize. However, that's not enough -- we also need to lock out file
IO and page faults during the update. Take both the i_rwsem and the
invalidate_lock in exclusive mode for invalidations, and in shared mode
for read/write operations.
I don't know if this is the correct fix, but xfs/259 found it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/174543795699.4139148.2086129139322431423.stgit@frogsfrogsfrogs
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For devices that natively support zone append operations,
REQ_OP_ZONE_APPEND BIOs are not processed through zone write plugging
and are immediately issued to the zoned device. This means that there is
no write pointer offset tracking done for these operations and that a
zone write plug is not necessary.
However, when receiving a zone append BIO, we may already have a zone
write plug for the target zone if that zone was previously partially
written using regular write operations. In such case, since the write
pointer offset of the zone write plug is not incremented by the amount
of sectors appended to the zone, 2 issues arise:
1) we risk leaving the plug in the disk hash table if the zone is fully
written using zone append or regular write operations, because the
write pointer offset will never reach the "zone full" state.
2) Regular write operations that are issued after zone append operations
will always be failed by blk_zone_wplug_prepare_bio() as the write
pointer alignment check will fail, even if the user correctly
accounted for the zone append operations and issued the regular
writes with a correct sector.
Avoid these issues by immediately removing the zone write plug of zones
that are the target of zone append operations when blk_zone_plug_bio()
is called. The new function blk_zone_wplug_handle_native_zone_append()
implements this for devices that natively support zone append. The
removal of the zone write plug using disk_remove_zone_wplug() requires
aborting all plugged regular write using disk_zone_wplug_abort() as
otherwise the plugged write BIOs would never be executed (with the plug
removed, the completion path will never see again the zone write plug as
disk_get_zone_wplug() will return NULL). Rate-limited warnings are added
to blk_zone_wplug_handle_native_zone_append() and to
disk_zone_wplug_abort() to signal this.
Since blk_zone_wplug_handle_native_zone_append() is called in the hot
path for operations that will not be plugged, disk_get_zone_wplug() is
optimized under the assumption that a user issuing zone append
operations is not at the same time issuing regular writes and that there
are no hashed zone write plugs. The struct gendisk atomic counter
nr_zone_wplugs is added to check this, with this counter incremented in
disk_insert_zone_wplug() and decremented in disk_remove_zone_wplug().
To be consistent with this fix, we do not need to fill the zone write
plug hash table with zone write plugs for zones that are partially
written for a device that supports native zone append operations.
So modify blk_revalidate_seq_zone() to return early to avoid allocating
and inserting a zone write plug for partially written sequential zones
if the device natively supports zone append.
Reported-by: Jorgen Hansen <Jorgen.Hansen@wdc.com>
Fixes: 9b1ce7f0c6 ("block: Implement zone append emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Jorgen Hansen <Jorgen.Hansen@wdc.com>
Link: https://lore.kernel.org/r/20250214041434.82564-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When block drivers or the core block code perform allocations with a
frozen queue, this could try to recurse into the block device to
reclaim memory and deadlock. Thus all allocations done by a process
that froze a queue need to be done without __GFP_IO and __GFP_FS.
Instead of tying to track all of them down, force a noio scope as
part of freezing the queue.
Note that nvme is a bit of a mess here due to the non-owner freezes,
and they will be addressed separately.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250131120352.1315351-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper that freezes the queue, updates the queue limits and
unfreezes the queue and convert all open coded versions of that to the
new helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250110054726.1499538-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reduce the indentation level of the code in queue_zone_wplugs_show() by
moving the body of the loop in that function into a new function.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241217210310.645966-5-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For the blk_queue_exit() calls, document where the corresponding code can
be found that increases q->q_usage_counter.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241217210310.645966-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Document which functions expect that their callers must hold a lock.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241217210310.645966-3-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Only include those header files that are necessary.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241217210310.645966-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Zone write plugging for handling writes to zones of a zoned block
device always execute a zone report whenever a write BIO to a zone
fails. The intent of this is to ensure that the tracking of a zone write
pointer is always correct to ensure that the alignment to a zone write
pointer of write BIOs can be checked on submission and that we can
always correctly emulate zone append operations using regular write
BIOs.
However, this error recovery scheme introduces a potential deadlock if a
device queue freeze is initiated while BIOs are still plugged in a zone
write plug and one of these write operation fails. In such case, the
disk zone write plug error recovery work is scheduled and executes a
report zone. This in turn can result in a request allocation in the
underlying driver to issue the report zones command to the device. But
with the device queue freeze already started, this allocation will
block, preventing the report zone execution and the continuation of the
processing of the plugged BIOs. As plugged BIOs hold a queue usage
reference, the queue freeze itself will never complete, resulting in a
deadlock.
Avoid this problem by completely removing from the zone write plugging
code the use of report zones operations after a failed write operation,
instead relying on the device user to either execute a report zones,
reset the zone, finish the zone, or give up writing to the device (which
is a fairly common pattern for file systems which degrade to read-only
after write failures). This is not an unreasonnable requirement as all
well-behaved applications, FSes and device mapper already use report
zones to recover from write errors whenever possible by comparing the
current position of a zone write pointer with what their assumption
about the position is.
The changes to remove the automatic error recovery are as follows:
- Completely remove the error recovery work and its associated
resources (zone write plug list head, disk error list, and disk
zone_wplugs_work work struct). This also removes the functions
disk_zone_wplug_set_error() and disk_zone_wplug_clear_error().
- Change the BLK_ZONE_WPLUG_ERROR zone write plug flag into
BLK_ZONE_WPLUG_NEED_WP_UPDATE. This new flag is set for a zone write
plug whenever a write opration targetting the zone of the zone write
plug fails. This flag indicates that the zone write pointer offset is
not reliable and that it must be updated when the next report zone,
reset zone, finish zone or disk revalidation is executed.
- Modify blk_zone_write_plug_bio_endio() to set the
BLK_ZONE_WPLUG_NEED_WP_UPDATE flag for the target zone of a failed
write BIO.
- Modify the function disk_zone_wplug_set_wp_offset() to clear this
new flag, thus implementing recovery of a correct write pointer
offset with the reset (all) zone and finish zone operations.
- Modify blkdev_report_zones() to always use the disk_report_zones_cb()
callback so that disk_zone_wplug_sync_wp_offset() can be called for
any zone marked with the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag.
This implements recovery of a correct write pointer offset for zone
write plugs marked with BLK_ZONE_WPLUG_NEED_WP_UPDATE and within
the range of the report zones operation executed by the user.
- Modify blk_revalidate_seq_zone() to call
disk_zone_wplug_sync_wp_offset() for all sequential write required
zones when a zoned block device is revalidated, thus always resolving
any inconsistency between the write pointer offset of zone write
plugs and the actual write pointer position of sequential zones.
Fixes: dd291d77cc ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241209122357.47838-5-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The zone reclaim processing of the dm-zoned device mapper uses
blkdev_issue_zeroout() to align the write pointer of a zone being used
for reclaiming another zone, to write the valid data blocks from the
zone being reclaimed at the same position relative to the zone start in
the reclaim target zone.
The first call to blkdev_issue_zeroout() will try to use hardware
offload using a REQ_OP_WRITE_ZEROES operation if the device reports a
non-zero max_write_zeroes_sectors queue limit. If this operation fails
because of the lack of hardware support, blkdev_issue_zeroout() falls
back to using a regular write operation with the zero-page as buffer.
Currently, such REQ_OP_WRITE_ZEROES failure is automatically handled by
the block layer zone write plugging code which will execute a report
zones operation to ensure that the write pointer of the target zone of
the failed operation has not changed and to "rewind" the zone write
pointer offset of the target zone as it was advanced when the write zero
operation was submitted. So the REQ_OP_WRITE_ZEROES failure does not
cause any issue and blkdev_issue_zeroout() works as expected.
However, since the automatic recovery of zone write pointers by the zone
write plugging code can potentially cause deadlocks with queue freeze
operations, a different recovery must be implemented in preparation for
the removal of zone write plugging report zones based recovery.
Do this by introducing the new function blk_zone_issue_zeroout(). This
function first calls blkdev_issue_zeroout() with the flag
BLKDEV_ZERO_NOFALLBACK to intercept failures on the first execution
which attempt to use the device hardware offload with the
REQ_OP_WRITE_ZEROES operation. If this attempt fails, a report zone
operation is issued to restore the zone write pointer offset of the
target zone to the correct position and blkdev_issue_zeroout() is called
again without the BLKDEV_ZERO_NOFALLBACK flag. The report zones
operation performing this recovery is implemented using the helper
function disk_zone_sync_wp_offset() which calls the gendisk report_zones
file operation with the callback disk_report_zones_cb(). This callback
updates the target write pointer offset of the target zone using the new
function disk_zone_wplug_sync_wp_offset().
dmz_reclaim_align_wp() is modified to change its call to
blkdev_issue_zeroout() to a call to blk_zone_issue_zeroout() without any
other change needed as the two functions are functionnally equivalent.
Fixes: dd291d77cc ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241209122357.47838-4-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are currently any issuer of REQ_OP_ZONE_RESET and
REQ_OP_ZONE_FINISH operations that set REQ_NOWAIT. However, as we cannot
handle this flag correctly due to the potential request allocation
failure that may happen in blk_mq_submit_bio() after blk_zone_plug_bio()
has handled the zone write plug write pointer updates for the targeted
zones, modify blk_zone_wplug_handle_reset_or_finish() to warn if this
flag is set and ignore it.
Fixes: dd291d77cc ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241209122357.47838-3-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For zoned block devices, a write BIO issued to a zone that has no
on-going writes will be prepared for execution and allowed to execute
immediately by blk_zone_wplug_handle_write() (called from
blk_zone_plug_bio()). However, if this BIO specifies REQ_NOWAIT, the
allocation of a request for its execution in blk_mq_submit_bio() may
fail after blk_zone_plug_bio() completed, marking the target zone of the
BIO as plugged. When this BIO is retried later on, it will be blocked as
the zone write plug of the target zone is in a plugged state without any
on-going write operation (completion of write operations trigger
unplugging of the next write BIOs for a zone). This leads to a BIO that
is stuck in a zone write plug and never completes, which results in
various issues such as hung tasks.
Avoid this problem by always executing REQ_NOWAIT write BIOs using the
BIO work of a zone write plug. This ensure that we never block the BIO
issuer and can thus safely ignore the REQ_NOWAIT flag when executing the
BIO from the zone write plug BIO work.
Since such BIO may be the first write BIO issued to a zone with no
on-going write, modify disk_zone_wplug_add_bio() to schedule the zone
write plug BIO work if the write plug is not already marked with the
BLK_ZONE_WPLUG_PLUGGED flag. This scheduling is otherwise not necessary
as the completion of the on-going write for the zone will schedule the
execution of the next plugged BIOs.
blk_zone_wplug_handle_write() is also fixed to better handle zone write
plug allocation failures for REQ_NOWAIT BIOs by failing a write BIO
using bio_wouldblock_error() instead of bio_io_error().
Reported-by: Bart Van Assche <bvanassche@acm.org>
Fixes: dd291d77cc ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241209122357.47838-2-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The function blk_revalidate_disk_zones() calls the function
disk_update_zone_resources() after freezing the device queue. In turn,
disk_update_zone_resources() calls queue_limits_start_update() which
takes a queue limits mutex lock, resulting in the ordering:
q->q_usage_counter check -> q->limits_lock. However, the usual ordering
is to always take a queue limit lock before freezing the queue to commit
the limits updates, e.g., the code pattern:
lim = queue_limits_start_update(q);
...
blk_mq_freeze_queue(q);
ret = queue_limits_commit_update(q, &lim);
blk_mq_unfreeze_queue(q);
Thus, blk_revalidate_disk_zones() introduces a potential circular
locking dependency deadlock that lockdep sometimes catches with the
splat:
[ 51.934109] ======================================================
[ 51.935916] WARNING: possible circular locking dependency detected
[ 51.937561] 6.12.0+ #2107 Not tainted
[ 51.938648] ------------------------------------------------------
[ 51.940351] kworker/u16:4/157 is trying to acquire lock:
[ 51.941805] ffff9fff0aa0bea8 (&q->limits_lock){+.+.}-{4:4}, at: disk_update_zone_resources+0x86/0x170
[ 51.944314]
but task is already holding lock:
[ 51.945688] ffff9fff0aa0b890 (&q->q_usage_counter(queue)#3){++++}-{0:0}, at: blk_revalidate_disk_zones+0x15f/0x340
[ 51.948527]
which lock already depends on the new lock.
[ 51.951296]
the existing dependency chain (in reverse order) is:
[ 51.953708]
-> #1 (&q->q_usage_counter(queue)#3){++++}-{0:0}:
[ 51.956131] blk_queue_enter+0x1c9/0x1e0
[ 51.957290] blk_mq_alloc_request+0x187/0x2a0
[ 51.958365] scsi_execute_cmd+0x78/0x490 [scsi_mod]
[ 51.959514] read_capacity_16+0x111/0x410 [sd_mod]
[ 51.960693] sd_revalidate_disk.isra.0+0x872/0x3240 [sd_mod]
[ 51.962004] sd_probe+0x2d7/0x520 [sd_mod]
[ 51.962993] really_probe+0xd5/0x330
[ 51.963898] __driver_probe_device+0x78/0x110
[ 51.964925] driver_probe_device+0x1f/0xa0
[ 51.965916] __driver_attach_async_helper+0x60/0xe0
[ 51.967017] async_run_entry_fn+0x2e/0x140
[ 51.968004] process_one_work+0x21f/0x5a0
[ 51.968987] worker_thread+0x1dc/0x3c0
[ 51.969868] kthread+0xe0/0x110
[ 51.970377] ret_from_fork+0x31/0x50
[ 51.970983] ret_from_fork_asm+0x11/0x20
[ 51.971587]
-> #0 (&q->limits_lock){+.+.}-{4:4}:
[ 51.972479] __lock_acquire+0x1337/0x2130
[ 51.973133] lock_acquire+0xc5/0x2d0
[ 51.973691] __mutex_lock+0xda/0xcf0
[ 51.974300] disk_update_zone_resources+0x86/0x170
[ 51.975032] blk_revalidate_disk_zones+0x16c/0x340
[ 51.975740] sd_zbc_revalidate_zones+0x73/0x160 [sd_mod]
[ 51.976524] sd_revalidate_disk.isra.0+0x465/0x3240 [sd_mod]
[ 51.977824] sd_probe+0x2d7/0x520 [sd_mod]
[ 51.978917] really_probe+0xd5/0x330
[ 51.979915] __driver_probe_device+0x78/0x110
[ 51.981047] driver_probe_device+0x1f/0xa0
[ 51.982143] __driver_attach_async_helper+0x60/0xe0
[ 51.983282] async_run_entry_fn+0x2e/0x140
[ 51.984319] process_one_work+0x21f/0x5a0
[ 51.985873] worker_thread+0x1dc/0x3c0
[ 51.987289] kthread+0xe0/0x110
[ 51.988546] ret_from_fork+0x31/0x50
[ 51.989926] ret_from_fork_asm+0x11/0x20
[ 51.991376]
other info that might help us debug this:
[ 51.994127] Possible unsafe locking scenario:
[ 51.995651] CPU0 CPU1
[ 51.996694] ---- ----
[ 51.997716] lock(&q->q_usage_counter(queue)#3);
[ 51.998817] lock(&q->limits_lock);
[ 52.000043] lock(&q->q_usage_counter(queue)#3);
[ 52.001638] lock(&q->limits_lock);
[ 52.002485]
*** DEADLOCK ***
Prevent this issue by moving the calls to blk_mq_freeze_queue() and
blk_mq_unfreeze_queue() around the call to queue_limits_commit_update()
in disk_update_zone_resources(). In case of revalidation failure, the
call to disk_free_zone_resources() in blk_revalidate_disk_zones()
is still done with the queue frozen as before.
Fixes: 843283e96e ("block: Fake max open zones limit when there is no limit")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241126104705.183996-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Turn the private disk_zone_is_conv() function in blk-zoned.c into a
public and documented bdev_zone_is_seq() helper with the inverse
polarity of the original function, also adding a check for non-zoned
devices so that all file systems can use the helper, even with a regular
block device.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241107064300.227731-3-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure that a disk revalidation changing the conventional zones bitmap
of a disk does not cause invalid memory references when using the
disk_zone_is_conv() helper by RCU protecting the disk->conv_zones_bitmap
pointer.
disk_zone_is_conv() is modified to operate under the RCU read lock and
the function disk_set_conv_zones_bitmap() is added to update a disk
conv_zones_bitmap pointer using rcu_replace_pointer() with the disk
zone_wplugs_lock spinlock held.
disk_free_zone_resources() is modified to call
disk_update_zone_resources() with a NULL bitmap pointer to free the disk
conv_zones_bitmap. disk_set_conv_zones_bitmap() is also used in
disk_update_zone_resources() to set the new (revalidated) bitmap and
free the old one.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241107064300.227731-2-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>