fuse update for 6.16

-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCaD2Y1wAKCRDh3BK/laaZ
 PFSHAP4q1+mOlQfZJPH/PFDwa+F0QW/uc3szXatS0888nxui/gEAsIeyyJlf+Mr8
 /1JPXxCqcapRFw9xsS0zioiK54Elfww=
 =2KxA
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:

 - Remove tmp page copying in writeback path (Joanne).

   This removes ~300 lines and with that a lot of complexity related to
   avoiding reclaim related deadlock. The old mechanism is replaced with
   a mapping flag that tells the MM not to block reclaim waiting for
   writeback to complete. The MM parts have been reviewed/acked by
   respective maintainers.

 - Convert more code to handle large folios (Joanne). This still just
   adds the code to deal with large folios and does not enable them yet.

 - Allow invalidating all cached lookups atomically (Luis Henriques).
   This feature is useful for CernVMFS, which currently does this
   iteratively.

 - Align write prefaulting in fuse with generic one (Dave Hansen)

 - Fix race causing invalid data to be cached when setting attributes on
   different nodes of a distributed fs (Guang Yuan Wu)

 - Update documentation for passthrough (Chen Linxuan)

 - Add fdinfo about the device number associated with an opened
   /dev/fuse instance (Chen Linxuan)

 - Increase readdir buffer size (Miklos). This depends on a patch to VFS
   readdir code that was already merged through Christians tree.

 - Optimize io-uring request expiration (Joanne)

 - Misc cleanups

* tag 'fuse-update-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits)
  fuse: increase readdir buffer size
  readdir: supply dir_context.count as readdir buffer size hint
  fuse: don't allow signals to interrupt getdents copying
  fuse: support large folios for writeback
  fuse: support large folios for readahead
  fuse: support large folios for queued writes
  fuse: support large folios for stores
  fuse: support large folios for symlinks
  fuse: support large folios for folio reads
  fuse: support large folios for writethrough writes
  fuse: refactor fuse_fill_write_pages()
  fuse: support large folios for retrieves
  fuse: support copying large folios
  fs: fuse: add dev id to /dev/fuse fdinfo
  docs: filesystems: add fuse-passthrough.rst
  MAINTAINERS: update filter of FUSE documentation
  fuse: fix race between concurrent setattrs from multiple nodes
  fuse: remove tmp folio for writebacks and internal rb tree
  mm: skip folio reclaim in legacy memcg contexts for deadlockable mappings
  fuse: optimize over-io-uring request expiration check
  ...
pull/1255/head
Linus Torvalds 2025-06-02 15:31:05 -07:00
commit 2619a6d413
14 changed files with 466 additions and 501 deletions

View File

@ -0,0 +1,133 @@
.. SPDX-License-Identifier: GPL-2.0
================
FUSE Passthrough
================
Introduction
============
FUSE (Filesystem in Userspace) passthrough is a feature designed to improve the
performance of FUSE filesystems for I/O operations. Typically, FUSE operations
involve communication between the kernel and a userspace FUSE daemon, which can
incur overhead. Passthrough allows certain operations on a FUSE file to bypass
the userspace daemon and be executed directly by the kernel on an underlying
"backing file".
This is achieved by the FUSE daemon registering a file descriptor (pointing to
the backing file on a lower filesystem) with the FUSE kernel module. The kernel
then receives an identifier (``backing_id``) for this registered backing file.
When a FUSE file is subsequently opened, the FUSE daemon can, in its response to
the ``OPEN`` request, include this ``backing_id`` and set the
``FOPEN_PASSTHROUGH`` flag. This establishes a direct link for specific
operations.
Currently, passthrough is supported for operations like ``read(2)``/``write(2)``
(via ``read_iter``/``write_iter``), ``splice(2)``, and ``mmap(2)``.
Enabling Passthrough
====================
To use FUSE passthrough:
1. The FUSE filesystem must be compiled with ``CONFIG_FUSE_PASSTHROUGH``
enabled.
2. The FUSE daemon, during the ``FUSE_INIT`` handshake, must negotiate the
``FUSE_PASSTHROUGH`` capability and specify its desired
``max_stack_depth``.
3. The (privileged) FUSE daemon uses the ``FUSE_DEV_IOC_BACKING_OPEN`` ioctl
on its connection file descriptor (e.g., ``/dev/fuse``) to register a
backing file descriptor and obtain a ``backing_id``.
4. When handling an ``OPEN`` or ``CREATE`` request for a FUSE file, the daemon
replies with the ``FOPEN_PASSTHROUGH`` flag set in
``fuse_open_out::open_flags`` and provides the corresponding ``backing_id``
in ``fuse_open_out::backing_id``.
5. The FUSE daemon should eventually call ``FUSE_DEV_IOC_BACKING_CLOSE`` with
the ``backing_id`` to release the kernel's reference to the backing file
when it's no longer needed for passthrough setups.
Privilege Requirements
======================
Setting up passthrough functionality currently requires the FUSE daemon to
possess the ``CAP_SYS_ADMIN`` capability. This requirement stems from several
security and resource management considerations that are actively being
discussed and worked on. The primary reasons for this restriction are detailed
below.
Resource Accounting and Visibility
----------------------------------
The core mechanism for passthrough involves the FUSE daemon opening a file
descriptor to a backing file and registering it with the FUSE kernel module via
the ``FUSE_DEV_IOC_BACKING_OPEN`` ioctl. This ioctl returns a ``backing_id``
associated with a kernel-internal ``struct fuse_backing`` object, which holds a
reference to the backing ``struct file``.
A significant concern arises because the FUSE daemon can close its own file
descriptor to the backing file after registration. The kernel, however, will
still hold a reference to the ``struct file`` via the ``struct fuse_backing``
object as long as it's associated with a ``backing_id`` (or subsequently, with
an open FUSE file in passthrough mode).
This behavior leads to two main issues for unprivileged FUSE daemons:
1. **Invisibility to lsof and other inspection tools**: Once the FUSE
daemon closes its file descriptor, the open backing file held by the kernel
becomes "hidden." Standard tools like ``lsof``, which typically inspect
process file descriptor tables, would not be able to identify that this
file is still open by the system on behalf of the FUSE filesystem. This
makes it difficult for system administrators to track resource usage or
debug issues related to open files (e.g., preventing unmounts).
2. **Bypassing RLIMIT_NOFILE**: The FUSE daemon process is subject to
resource limits, including the maximum number of open file descriptors
(``RLIMIT_NOFILE``). If an unprivileged daemon could register backing files
and then close its own FDs, it could potentially cause the kernel to hold
an unlimited number of open ``struct file`` references without these being
accounted against the daemon's ``RLIMIT_NOFILE``. This could lead to a
denial-of-service (DoS) by exhausting system-wide file resources.
The ``CAP_SYS_ADMIN`` requirement acts as a safeguard against these issues,
restricting this powerful capability to trusted processes.
**NOTE**: ``io_uring`` solves this similar issue by exposing its "fixed files",
which are visible via ``fdinfo`` and accounted under the registering user's
``RLIMIT_NOFILE``.
Filesystem Stacking and Shutdown Loops
--------------------------------------
Another concern relates to the potential for creating complex and problematic
filesystem stacking scenarios if unprivileged users could set up passthrough.
A FUSE passthrough filesystem might use a backing file that resides:
* On the *same* FUSE filesystem.
* On another filesystem (like OverlayFS) which itself might have an upper or
lower layer that is a FUSE filesystem.
These configurations could create dependency loops, particularly during
filesystem shutdown or unmount sequences, leading to deadlocks or system
instability. This is conceptually similar to the risks associated with the
``LOOP_SET_FD`` ioctl, which also requires ``CAP_SYS_ADMIN``.
To mitigate this, FUSE passthrough already incorporates checks based on
filesystem stacking depth (``sb->s_stack_depth`` and ``fc->max_stack_depth``).
For example, during the ``FUSE_INIT`` handshake, the FUSE daemon can negotiate
the ``max_stack_depth`` it supports. When a backing file is registered via
``FUSE_DEV_IOC_BACKING_OPEN``, the kernel checks if the backing file's
filesystem stack depth is within the allowed limit.
The ``CAP_SYS_ADMIN`` requirement provides an additional layer of security,
ensuring that only privileged users can create these potentially complex
stacking arrangements.
General Security Posture
------------------------
As a general principle for new kernel features that allow userspace to instruct
the kernel to perform direct operations on its behalf based on user-provided
file descriptors, starting with a higher privilege requirement (like
``CAP_SYS_ADMIN``) is a conservative and common security practice. This allows
the feature to be used and tested while further security implications are
evaluated and addressed.

View File

@ -99,6 +99,7 @@ Documentation for filesystem implementations.
fuse
fuse-io
fuse-io-uring
fuse-passthrough
inotify
isofs
nilfs2

View File

@ -9846,7 +9846,7 @@ L: linux-fsdevel@vger.kernel.org
S: Maintained
W: https://github.com/libfuse/
T: git git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git
F: Documentation/filesystems/fuse.rst
F: Documentation/filesystems/fuse*
F: fs/fuse/
F: include/uapi/linux/fuse.h

View File

@ -23,6 +23,7 @@
#include <linux/swap.h>
#include <linux/splice.h>
#include <linux/sched.h>
#include <linux/seq_file.h>
#define CREATE_TRACE_POINTS
#include "fuse_trace.h"
@ -45,7 +46,7 @@ bool fuse_request_expired(struct fuse_conn *fc, struct list_head *list)
return time_is_before_jiffies(req->create_time + fc->timeout.req_timeout);
}
bool fuse_fpq_processing_expired(struct fuse_conn *fc, struct list_head *processing)
static bool fuse_fpq_processing_expired(struct fuse_conn *fc, struct list_head *processing)
{
int i;
@ -816,7 +817,7 @@ static int unlock_request(struct fuse_req *req)
return err;
}
void fuse_copy_init(struct fuse_copy_state *cs, int write,
void fuse_copy_init(struct fuse_copy_state *cs, bool write,
struct iov_iter *iter)
{
memset(cs, 0, sizeof(*cs));
@ -955,10 +956,10 @@ static int fuse_check_folio(struct folio *folio)
* folio that was originally in @pagep will lose a reference and the new
* folio returned in @pagep will carry a reference.
*/
static int fuse_try_move_page(struct fuse_copy_state *cs, struct page **pagep)
static int fuse_try_move_folio(struct fuse_copy_state *cs, struct folio **foliop)
{
int err;
struct folio *oldfolio = page_folio(*pagep);
struct folio *oldfolio = *foliop;
struct folio *newfolio;
struct pipe_buffer *buf = cs->pipebufs;
@ -979,7 +980,7 @@ static int fuse_try_move_page(struct fuse_copy_state *cs, struct page **pagep)
cs->pipebufs++;
cs->nr_segs--;
if (cs->len != PAGE_SIZE)
if (cs->len != folio_size(oldfolio))
goto out_fallback;
if (!pipe_buf_try_steal(cs->pipe, buf))
@ -1025,7 +1026,7 @@ static int fuse_try_move_page(struct fuse_copy_state *cs, struct page **pagep)
if (test_bit(FR_ABORTED, &cs->req->flags))
err = -ENOENT;
else
*pagep = &newfolio->page;
*foliop = newfolio;
spin_unlock(&cs->req->waitq.lock);
if (err) {
@ -1058,8 +1059,8 @@ out_fallback:
goto out_put_old;
}
static int fuse_ref_page(struct fuse_copy_state *cs, struct page *page,
unsigned offset, unsigned count)
static int fuse_ref_folio(struct fuse_copy_state *cs, struct folio *folio,
unsigned offset, unsigned count)
{
struct pipe_buffer *buf;
int err;
@ -1067,17 +1068,17 @@ static int fuse_ref_page(struct fuse_copy_state *cs, struct page *page,
if (cs->nr_segs >= cs->pipe->max_usage)
return -EIO;
get_page(page);
folio_get(folio);
err = unlock_request(cs->req);
if (err) {
put_page(page);
folio_put(folio);
return err;
}
fuse_copy_finish(cs);
buf = cs->pipebufs;
buf->page = page;
buf->page = &folio->page;
buf->offset = offset;
buf->len = count;
@ -1089,20 +1090,24 @@ static int fuse_ref_page(struct fuse_copy_state *cs, struct page *page,
}
/*
* Copy a page in the request to/from the userspace buffer. Must be
* Copy a folio in the request to/from the userspace buffer. Must be
* done atomically
*/
static int fuse_copy_page(struct fuse_copy_state *cs, struct page **pagep,
unsigned offset, unsigned count, int zeroing)
static int fuse_copy_folio(struct fuse_copy_state *cs, struct folio **foliop,
unsigned offset, unsigned count, int zeroing)
{
int err;
struct page *page = *pagep;
struct folio *folio = *foliop;
size_t size;
if (page && zeroing && count < PAGE_SIZE)
clear_highpage(page);
if (folio) {
size = folio_size(folio);
if (zeroing && count < size)
folio_zero_range(folio, 0, size);
}
while (count) {
if (cs->write && cs->pipebufs && page) {
if (cs->write && cs->pipebufs && folio) {
/*
* Can't control lifetime of pipe buffers, so always
* copy user pages.
@ -1112,12 +1117,12 @@ static int fuse_copy_page(struct fuse_copy_state *cs, struct page **pagep,
if (err)
return err;
} else {
return fuse_ref_page(cs, page, offset, count);
return fuse_ref_folio(cs, folio, offset, count);
}
} else if (!cs->len) {
if (cs->move_pages && page &&
offset == 0 && count == PAGE_SIZE) {
err = fuse_try_move_page(cs, pagep);
if (cs->move_folios && folio &&
offset == 0 && count == size) {
err = fuse_try_move_folio(cs, foliop);
if (err <= 0)
return err;
} else {
@ -1126,22 +1131,30 @@ static int fuse_copy_page(struct fuse_copy_state *cs, struct page **pagep,
return err;
}
}
if (page) {
void *mapaddr = kmap_local_page(page);
void *buf = mapaddr + offset;
offset += fuse_copy_do(cs, &buf, &count);
if (folio) {
void *mapaddr = kmap_local_folio(folio, offset);
void *buf = mapaddr;
unsigned int copy = count;
unsigned int bytes_copied;
if (folio_test_highmem(folio) && count > PAGE_SIZE - offset_in_page(offset))
copy = PAGE_SIZE - offset_in_page(offset);
bytes_copied = fuse_copy_do(cs, &buf, &copy);
kunmap_local(mapaddr);
offset += bytes_copied;
count -= bytes_copied;
} else
offset += fuse_copy_do(cs, NULL, &count);
}
if (page && !cs->write)
flush_dcache_page(page);
if (folio && !cs->write)
flush_dcache_folio(folio);
return 0;
}
/* Copy pages in the request to/from userspace buffer */
static int fuse_copy_pages(struct fuse_copy_state *cs, unsigned nbytes,
int zeroing)
/* Copy folios in the request to/from userspace buffer */
static int fuse_copy_folios(struct fuse_copy_state *cs, unsigned nbytes,
int zeroing)
{
unsigned i;
struct fuse_req *req = cs->req;
@ -1151,23 +1164,12 @@ static int fuse_copy_pages(struct fuse_copy_state *cs, unsigned nbytes,
int err;
unsigned int offset = ap->descs[i].offset;
unsigned int count = min(nbytes, ap->descs[i].length);
struct page *orig, *pagep;
orig = pagep = &ap->folios[i]->page;
err = fuse_copy_page(cs, &pagep, offset, count, zeroing);
err = fuse_copy_folio(cs, &ap->folios[i], offset, count, zeroing);
if (err)
return err;
nbytes -= count;
/*
* fuse_copy_page may have moved a page from a pipe instead of
* copying into our given page, so update the folios if it was
* replaced.
*/
if (pagep != orig)
ap->folios[i] = page_folio(pagep);
}
return 0;
}
@ -1197,7 +1199,7 @@ int fuse_copy_args(struct fuse_copy_state *cs, unsigned numargs,
for (i = 0; !err && i < numargs; i++) {
struct fuse_arg *arg = &args[i];
if (i == numargs - 1 && argpages)
err = fuse_copy_pages(cs, arg->size, zeroing);
err = fuse_copy_folios(cs, arg->size, zeroing);
else
err = fuse_copy_one(cs, arg->value, arg->size);
}
@ -1538,7 +1540,7 @@ static ssize_t fuse_dev_read(struct kiocb *iocb, struct iov_iter *to)
if (!user_backed_iter(to))
return -EINVAL;
fuse_copy_init(&cs, 1, to);
fuse_copy_init(&cs, true, to);
return fuse_dev_do_read(fud, file, &cs, iov_iter_count(to));
}
@ -1561,7 +1563,7 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
if (!bufs)
return -ENOMEM;
fuse_copy_init(&cs, 1, NULL);
fuse_copy_init(&cs, true, NULL);
cs.pipebufs = bufs;
cs.pipe = pipe;
ret = fuse_dev_do_read(fud, in, &cs, len);
@ -1786,20 +1788,23 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size,
num = outarg.size;
while (num) {
struct folio *folio;
struct page *page;
unsigned int this_num;
unsigned int folio_offset;
unsigned int nr_bytes;
unsigned int nr_pages;
folio = filemap_grab_folio(mapping, index);
err = PTR_ERR(folio);
if (IS_ERR(folio))
goto out_iput;
page = &folio->page;
this_num = min_t(unsigned, num, folio_size(folio) - offset);
err = fuse_copy_page(cs, &page, offset, this_num, 0);
folio_offset = ((index - folio->index) << PAGE_SHIFT) + offset;
nr_bytes = min_t(unsigned, num, folio_size(folio) - folio_offset);
nr_pages = (offset + nr_bytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
err = fuse_copy_folio(cs, &folio, folio_offset, nr_bytes, 0);
if (!folio_test_uptodate(folio) && !err && offset == 0 &&
(this_num == folio_size(folio) || file_size == end)) {
folio_zero_segment(folio, this_num, folio_size(folio));
(nr_bytes == folio_size(folio) || file_size == end)) {
folio_zero_segment(folio, nr_bytes, folio_size(folio));
folio_mark_uptodate(folio);
}
folio_unlock(folio);
@ -1808,9 +1813,9 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size,
if (err)
goto out_iput;
num -= this_num;
num -= nr_bytes;
offset = 0;
index++;
index += nr_pages;
}
err = 0;
@ -1849,7 +1854,7 @@ static int fuse_retrieve(struct fuse_mount *fm, struct inode *inode,
unsigned int num;
unsigned int offset;
size_t total_len = 0;
unsigned int num_pages, cur_pages = 0;
unsigned int num_pages;
struct fuse_conn *fc = fm->fc;
struct fuse_retrieve_args *ra;
size_t args_size = sizeof(*ra);
@ -1867,6 +1872,7 @@ static int fuse_retrieve(struct fuse_mount *fm, struct inode *inode,
num_pages = (num + offset + PAGE_SIZE - 1) >> PAGE_SHIFT;
num_pages = min(num_pages, fc->max_pages);
num = min(num, num_pages << PAGE_SHIFT);
args_size += num_pages * (sizeof(ap->folios[0]) + sizeof(ap->descs[0]));
@ -1887,25 +1893,29 @@ static int fuse_retrieve(struct fuse_mount *fm, struct inode *inode,
index = outarg->offset >> PAGE_SHIFT;
while (num && cur_pages < num_pages) {
while (num) {
struct folio *folio;
unsigned int this_num;
unsigned int folio_offset;
unsigned int nr_bytes;
unsigned int nr_pages;
folio = filemap_get_folio(mapping, index);
if (IS_ERR(folio))
break;
this_num = min_t(unsigned, num, PAGE_SIZE - offset);
folio_offset = ((index - folio->index) << PAGE_SHIFT) + offset;
nr_bytes = min(folio_size(folio) - folio_offset, num);
nr_pages = (offset + nr_bytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
ap->folios[ap->num_folios] = folio;
ap->descs[ap->num_folios].offset = offset;
ap->descs[ap->num_folios].length = this_num;
ap->descs[ap->num_folios].offset = folio_offset;
ap->descs[ap->num_folios].length = nr_bytes;
ap->num_folios++;
cur_pages++;
offset = 0;
num -= this_num;
total_len += this_num;
index++;
num -= nr_bytes;
total_len += nr_bytes;
index += nr_pages;
}
ra->inarg.offset = outarg->offset;
ra->inarg.size = total_len;
@ -2021,11 +2031,24 @@ static int fuse_notify_resend(struct fuse_conn *fc)
return 0;
}
/*
* Increments the fuse connection epoch. This will result of dentries from
* previous epochs to be invalidated.
*
* XXX optimization: add call to shrink_dcache_sb()?
*/
static int fuse_notify_inc_epoch(struct fuse_conn *fc)
{
atomic_inc(&fc->epoch);
return 0;
}
static int fuse_notify(struct fuse_conn *fc, enum fuse_notify_code code,
unsigned int size, struct fuse_copy_state *cs)
{
/* Don't try to move pages (yet) */
cs->move_pages = 0;
/* Don't try to move folios (yet) */
cs->move_folios = false;
switch (code) {
case FUSE_NOTIFY_POLL:
@ -2049,6 +2072,9 @@ static int fuse_notify(struct fuse_conn *fc, enum fuse_notify_code code,
case FUSE_NOTIFY_RESEND:
return fuse_notify_resend(fc);
case FUSE_NOTIFY_INC_EPOCH:
return fuse_notify_inc_epoch(fc);
default:
fuse_copy_finish(cs);
return -EINVAL;
@ -2173,7 +2199,7 @@ static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
spin_unlock(&fpq->lock);
cs->req = req;
if (!req->args->page_replace)
cs->move_pages = 0;
cs->move_folios = false;
if (oh.error)
err = nbytes != sizeof(oh) ? -EINVAL : 0;
@ -2211,7 +2237,7 @@ static ssize_t fuse_dev_write(struct kiocb *iocb, struct iov_iter *from)
if (!user_backed_iter(from))
return -EINVAL;
fuse_copy_init(&cs, 0, from);
fuse_copy_init(&cs, false, from);
return fuse_dev_do_write(fud, &cs, iov_iter_count(from));
}
@ -2285,13 +2311,13 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe,
}
pipe_unlock(pipe);
fuse_copy_init(&cs, 0, NULL);
fuse_copy_init(&cs, false, NULL);
cs.pipebufs = bufs;
cs.nr_segs = nbuf;
cs.pipe = pipe;
if (flags & SPLICE_F_MOVE)
cs.move_pages = 1;
cs.move_folios = true;
ret = fuse_dev_do_write(fud, &cs, len);
@ -2602,6 +2628,17 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
}
}
#ifdef CONFIG_PROC_FS
static void fuse_dev_show_fdinfo(struct seq_file *seq, struct file *file)
{
struct fuse_dev *fud = fuse_get_dev(file);
if (!fud)
return;
seq_printf(seq, "fuse_connection:\t%u\n", fud->fc->dev);
}
#endif
const struct file_operations fuse_dev_operations = {
.owner = THIS_MODULE,
.open = fuse_dev_open,
@ -2617,6 +2654,9 @@ const struct file_operations fuse_dev_operations = {
#ifdef CONFIG_FUSE_IO_URING
.uring_cmd = fuse_uring_cmd,
#endif
#ifdef CONFIG_PROC_FS
.show_fdinfo = fuse_dev_show_fdinfo,
#endif
};
EXPORT_SYMBOL_GPL(fuse_dev_operations);

View File

@ -140,6 +140,21 @@ void fuse_uring_abort_end_requests(struct fuse_ring *ring)
}
}
static bool ent_list_request_expired(struct fuse_conn *fc, struct list_head *list)
{
struct fuse_ring_ent *ent;
struct fuse_req *req;
ent = list_first_entry_or_null(list, struct fuse_ring_ent, list);
if (!ent)
return false;
req = ent->fuse_req;
return time_is_before_jiffies(req->create_time +
fc->timeout.req_timeout);
}
bool fuse_uring_request_expired(struct fuse_conn *fc)
{
struct fuse_ring *ring = fc->ring;
@ -157,7 +172,8 @@ bool fuse_uring_request_expired(struct fuse_conn *fc)
spin_lock(&queue->lock);
if (fuse_request_expired(fc, &queue->fuse_req_queue) ||
fuse_request_expired(fc, &queue->fuse_req_bg_queue) ||
fuse_fpq_processing_expired(fc, queue->fpq.processing)) {
ent_list_request_expired(fc, &queue->ent_w_req_queue) ||
ent_list_request_expired(fc, &queue->ent_in_userspace)) {
spin_unlock(&queue->lock);
return true;
}
@ -494,7 +510,7 @@ static void fuse_uring_cancel(struct io_uring_cmd *cmd,
spin_lock(&queue->lock);
if (ent->state == FRRS_AVAILABLE) {
ent->state = FRRS_USERSPACE;
list_move(&ent->list, &queue->ent_in_userspace);
list_move_tail(&ent->list, &queue->ent_in_userspace);
need_cmd_done = true;
ent->cmd = NULL;
}
@ -577,8 +593,8 @@ static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
if (err)
return err;
fuse_copy_init(&cs, 0, &iter);
cs.is_uring = 1;
fuse_copy_init(&cs, false, &iter);
cs.is_uring = true;
cs.req = req;
return fuse_copy_out_args(&cs, args, ring_in_out.payload_sz);
@ -607,8 +623,8 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
return err;
}
fuse_copy_init(&cs, 1, &iter);
cs.is_uring = 1;
fuse_copy_init(&cs, true, &iter);
cs.is_uring = true;
cs.req = req;
if (num_args > 0) {
@ -714,7 +730,7 @@ static int fuse_uring_send_next_to_ring(struct fuse_ring_ent *ent,
cmd = ent->cmd;
ent->cmd = NULL;
ent->state = FRRS_USERSPACE;
list_move(&ent->list, &queue->ent_in_userspace);
list_move_tail(&ent->list, &queue->ent_in_userspace);
spin_unlock(&queue->lock);
io_uring_cmd_done(cmd, 0, 0, issue_flags);
@ -764,7 +780,7 @@ static void fuse_uring_add_req_to_ring_ent(struct fuse_ring_ent *ent,
clear_bit(FR_PENDING, &req->flags);
ent->fuse_req = req;
ent->state = FRRS_FUSE_REQ;
list_move(&ent->list, &queue->ent_w_req_queue);
list_move_tail(&ent->list, &queue->ent_w_req_queue);
fuse_uring_add_to_pq(ent, req);
}
@ -1180,7 +1196,7 @@ static void fuse_uring_send(struct fuse_ring_ent *ent, struct io_uring_cmd *cmd,
spin_lock(&queue->lock);
ent->state = FRRS_USERSPACE;
list_move(&ent->list, &queue->ent_in_userspace);
list_move_tail(&ent->list, &queue->ent_in_userspace);
ent->cmd = NULL;
spin_unlock(&queue->lock);

View File

@ -200,9 +200,14 @@ static int fuse_dentry_revalidate(struct inode *dir, const struct qstr *name,
{
struct inode *inode;
struct fuse_mount *fm;
struct fuse_conn *fc;
struct fuse_inode *fi;
int ret;
fc = get_fuse_conn_super(dir->i_sb);
if (entry->d_time < atomic_read(&fc->epoch))
goto invalid;
inode = d_inode_rcu(entry);
if (inode && fuse_is_bad(inode))
goto invalid;
@ -412,16 +417,20 @@ int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name
static struct dentry *fuse_lookup(struct inode *dir, struct dentry *entry,
unsigned int flags)
{
int err;
struct fuse_entry_out outarg;
struct fuse_conn *fc;
struct inode *inode;
struct dentry *newent;
int err, epoch;
bool outarg_valid = true;
bool locked;
if (fuse_is_bad(dir))
return ERR_PTR(-EIO);
fc = get_fuse_conn_super(dir->i_sb);
epoch = atomic_read(&fc->epoch);
locked = fuse_lock_inode(dir);
err = fuse_lookup_name(dir->i_sb, get_node_id(dir), &entry->d_name,
&outarg, &inode);
@ -443,6 +452,7 @@ static struct dentry *fuse_lookup(struct inode *dir, struct dentry *entry,
goto out_err;
entry = newent ? newent : entry;
entry->d_time = epoch;
if (outarg_valid)
fuse_change_entry_timeout(entry, &outarg);
else
@ -616,7 +626,6 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir,
struct dentry *entry, struct file *file,
unsigned int flags, umode_t mode, u32 opcode)
{
int err;
struct inode *inode;
struct fuse_mount *fm = get_fuse_mount(dir);
FUSE_ARGS(args);
@ -626,11 +635,13 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir,
struct fuse_entry_out outentry;
struct fuse_inode *fi;
struct fuse_file *ff;
int epoch, err;
bool trunc = flags & O_TRUNC;
/* Userspace expects S_IFREG in create mode */
BUG_ON((mode & S_IFMT) != S_IFREG);
epoch = atomic_read(&fm->fc->epoch);
forget = fuse_alloc_forget();
err = -ENOMEM;
if (!forget)
@ -699,6 +710,7 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir,
}
kfree(forget);
d_instantiate(entry, inode);
entry->d_time = epoch;
fuse_change_entry_timeout(entry, &outentry);
fuse_dir_changed(dir);
err = generic_file_open(inode, file);
@ -785,12 +797,14 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun
struct fuse_entry_out outarg;
struct inode *inode;
struct dentry *d;
int err;
struct fuse_forget_link *forget;
int epoch, err;
if (fuse_is_bad(dir))
return ERR_PTR(-EIO);
epoch = atomic_read(&fm->fc->epoch);
forget = fuse_alloc_forget();
if (!forget)
return ERR_PTR(-ENOMEM);
@ -832,10 +846,13 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun
if (IS_ERR(d))
return d;
if (d)
if (d) {
d->d_time = epoch;
fuse_change_entry_timeout(d, &outarg);
else
} else {
entry->d_time = epoch;
fuse_change_entry_timeout(entry, &outarg);
}
fuse_dir_changed(dir);
return d;
@ -1609,10 +1626,10 @@ static int fuse_permission(struct mnt_idmap *idmap,
return err;
}
static int fuse_readlink_page(struct inode *inode, struct folio *folio)
static int fuse_readlink_folio(struct inode *inode, struct folio *folio)
{
struct fuse_mount *fm = get_fuse_mount(inode);
struct fuse_folio_desc desc = { .length = PAGE_SIZE - 1 };
struct fuse_folio_desc desc = { .length = folio_size(folio) - 1 };
struct fuse_args_pages ap = {
.num_folios = 1,
.folios = &folio,
@ -1667,7 +1684,7 @@ static const char *fuse_get_link(struct dentry *dentry, struct inode *inode,
if (!folio)
goto out_err;
err = fuse_readlink_page(inode, folio);
err = fuse_readlink_folio(inode, folio);
if (err) {
folio_put(folio);
goto out_err;
@ -1943,6 +1960,7 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
int err;
bool trust_local_cmtime = is_wb;
bool fault_blocked = false;
u64 attr_version;
if (!fc->default_permissions)
attr->ia_valid |= ATTR_FORCE;
@ -2027,6 +2045,8 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
if (fc->handle_killpriv_v2 && !capable(CAP_FSETID))
inarg.valid |= FATTR_KILL_SUIDGID;
}
attr_version = fuse_get_attr_version(fm->fc);
fuse_setattr_fill(fc, &args, inode, &inarg, &outarg);
err = fuse_simple_request(fm, &args);
if (err) {
@ -2052,6 +2072,14 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
/* FIXME: clear I_DIRTY_SYNC? */
}
if (fi->attr_version > attr_version) {
/*
* Apply attributes, for example for fsnotify_change(), but set
* attribute timeout to zero.
*/
outarg.attr_valid = outarg.attr_valid_nsec = 0;
}
fuse_change_attributes_common(inode, &outarg.attr, NULL,
ATTR_TIMEOUT(&outarg),
fuse_get_cache_mask(inode), 0);
@ -2257,7 +2285,7 @@ void fuse_init_dir(struct inode *inode)
static int fuse_symlink_read_folio(struct file *null, struct folio *folio)
{
int err = fuse_readlink_page(folio->mapping->host, folio);
int err = fuse_readlink_folio(folio->mapping->host, folio);
if (!err)
folio_mark_uptodate(folio);

View File

@ -415,89 +415,11 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
struct fuse_writepage_args {
struct fuse_io_args ia;
struct rb_node writepages_entry;
struct list_head queue_entry;
struct fuse_writepage_args *next;
struct inode *inode;
struct fuse_sync_bucket *bucket;
};
static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi,
pgoff_t idx_from, pgoff_t idx_to)
{
struct rb_node *n;
n = fi->writepages.rb_node;
while (n) {
struct fuse_writepage_args *wpa;
pgoff_t curr_index;
wpa = rb_entry(n, struct fuse_writepage_args, writepages_entry);
WARN_ON(get_fuse_inode(wpa->inode) != fi);
curr_index = wpa->ia.write.in.offset >> PAGE_SHIFT;
if (idx_from >= curr_index + wpa->ia.ap.num_folios)
n = n->rb_right;
else if (idx_to < curr_index)
n = n->rb_left;
else
return wpa;
}
return NULL;
}
/*
* Check if any page in a range is under writeback
*/
static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
pgoff_t idx_to)
{
struct fuse_inode *fi = get_fuse_inode(inode);
bool found;
if (RB_EMPTY_ROOT(&fi->writepages))
return false;
spin_lock(&fi->lock);
found = fuse_find_writeback(fi, idx_from, idx_to);
spin_unlock(&fi->lock);
return found;
}
static inline bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
{
return fuse_range_is_writeback(inode, index, index);
}
/*
* Wait for page writeback to be completed.
*
* Since fuse doesn't rely on the VM writeback tracking, this has to
* use some other means.
*/
static void fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
{
struct fuse_inode *fi = get_fuse_inode(inode);
wait_event(fi->page_waitq, !fuse_page_is_writeback(inode, index));
}
static inline bool fuse_folio_is_writeback(struct inode *inode,
struct folio *folio)
{
pgoff_t last = folio_next_index(folio) - 1;
return fuse_range_is_writeback(inode, folio->index, last);
}
static void fuse_wait_on_folio_writeback(struct inode *inode,
struct folio *folio)
{
struct fuse_inode *fi = get_fuse_inode(inode);
wait_event(fi->page_waitq, !fuse_folio_is_writeback(inode, folio));
}
/*
* Wait for all pending writepages on the inode to finish.
*
@ -532,10 +454,6 @@ static int fuse_flush(struct file *file, fl_owner_t id)
if (err)
return err;
inode_lock(inode);
fuse_sync_writes(inode);
inode_unlock(inode);
err = filemap_check_errors(file->f_mapping);
if (err)
return err;
@ -875,7 +793,7 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio)
struct inode *inode = folio->mapping->host;
struct fuse_mount *fm = get_fuse_mount(inode);
loff_t pos = folio_pos(folio);
struct fuse_folio_desc desc = { .length = PAGE_SIZE };
struct fuse_folio_desc desc = { .length = folio_size(folio) };
struct fuse_io_args ia = {
.ap.args.page_zeroing = true,
.ap.args.out_pages = true,
@ -886,13 +804,6 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio)
ssize_t res;
u64 attr_ver;
/*
* With the temporary pages that are used to complete writeback, we can
* have writeback that extends beyond the lifetime of the folio. So
* make sure we read a properly synced folio.
*/
fuse_wait_on_folio_writeback(inode, folio);
attr_ver = fuse_get_attr_version(fm->fc);
/* Don't overflow end offset */
@ -965,14 +876,13 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args,
fuse_io_free(ia);
}
static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file)
static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file,
unsigned int count)
{
struct fuse_file *ff = file->private_data;
struct fuse_mount *fm = ff->fm;
struct fuse_args_pages *ap = &ia->ap;
loff_t pos = folio_pos(ap->folios[0]);
/* Currently, all folios in FUSE are one page */
size_t count = ap->num_folios << PAGE_SHIFT;
ssize_t res;
int err;
@ -1005,17 +915,13 @@ static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file)
static void fuse_readahead(struct readahead_control *rac)
{
struct inode *inode = rac->mapping->host;
struct fuse_inode *fi = get_fuse_inode(inode);
struct fuse_conn *fc = get_fuse_conn(inode);
unsigned int max_pages, nr_pages;
pgoff_t first = readahead_index(rac);
pgoff_t last = first + readahead_count(rac) - 1;
struct folio *folio = NULL;
if (fuse_is_bad(inode))
return;
wait_event(fi->page_waitq, !fuse_range_is_writeback(inode, first, last));
max_pages = min_t(unsigned int, fc->max_pages,
fc->max_read / PAGE_SIZE);
@ -1033,8 +939,8 @@ static void fuse_readahead(struct readahead_control *rac)
while (nr_pages) {
struct fuse_io_args *ia;
struct fuse_args_pages *ap;
struct folio *folio;
unsigned cur_pages = min(max_pages, nr_pages);
unsigned int pages = 0;
if (fc->num_background >= fc->congestion_threshold &&
rac->ra->async_size >= readahead_count(rac))
@ -1046,10 +952,12 @@ static void fuse_readahead(struct readahead_control *rac)
ia = fuse_io_alloc(NULL, cur_pages);
if (!ia)
return;
break;
ap = &ia->ap;
while (ap->num_folios < cur_pages) {
while (pages < cur_pages) {
unsigned int folio_pages;
/*
* This returns a folio with a ref held on it.
* The ref needs to be held until the request is
@ -1057,13 +965,31 @@ static void fuse_readahead(struct readahead_control *rac)
* fuse_try_move_page()) drops the ref after it's
* replaced in the page cache.
*/
folio = __readahead_folio(rac);
if (!folio)
folio = __readahead_folio(rac);
folio_pages = folio_nr_pages(folio);
if (folio_pages > cur_pages - pages) {
/*
* Large folios belonging to fuse will never
* have more pages than max_pages.
*/
WARN_ON(!pages);
break;
}
ap->folios[ap->num_folios] = folio;
ap->descs[ap->num_folios].length = folio_size(folio);
ap->num_folios++;
pages += folio_pages;
folio = NULL;
}
fuse_send_readpages(ia, rac->file);
nr_pages -= cur_pages;
fuse_send_readpages(ia, rac->file, pages << PAGE_SHIFT);
nr_pages -= pages;
}
if (folio) {
folio_end_read(folio, false);
folio_put(folio);
}
}
@ -1181,7 +1107,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
int err;
for (i = 0; i < ap->num_folios; i++)
fuse_wait_on_folio_writeback(inode, ap->folios[i]);
folio_wait_writeback(ap->folios[i]);
fuse_write_args_fill(ia, ff, pos, count);
ia->write.in.flags = fuse_write_flags(iocb);
@ -1226,27 +1152,24 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
struct fuse_args_pages *ap = &ia->ap;
struct fuse_conn *fc = get_fuse_conn(mapping->host);
unsigned offset = pos & (PAGE_SIZE - 1);
unsigned int nr_pages = 0;
size_t count = 0;
int err;
unsigned int num;
int err = 0;
num = min(iov_iter_count(ii), fc->max_write);
num = min(num, max_pages << PAGE_SHIFT);
ap->args.in_pages = true;
ap->descs[0].offset = offset;
do {
while (num) {
size_t tmp;
struct folio *folio;
pgoff_t index = pos >> PAGE_SHIFT;
size_t bytes = min_t(size_t, PAGE_SIZE - offset,
iov_iter_count(ii));
bytes = min_t(size_t, bytes, fc->max_write - count);
unsigned int bytes;
unsigned int folio_offset;
again:
err = -EFAULT;
if (fault_in_iov_iter_readable(ii, bytes))
break;
folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
mapping_gfp_mask(mapping));
if (IS_ERR(folio)) {
@ -1257,29 +1180,42 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
if (mapping_writably_mapped(mapping))
flush_dcache_folio(folio);
tmp = copy_folio_from_iter_atomic(folio, offset, bytes, ii);
folio_offset = ((index - folio->index) << PAGE_SHIFT) + offset;
bytes = min(folio_size(folio) - folio_offset, num);
tmp = copy_folio_from_iter_atomic(folio, folio_offset, bytes, ii);
flush_dcache_folio(folio);
if (!tmp) {
folio_unlock(folio);
folio_put(folio);
/*
* Ensure forward progress by faulting in
* while not holding the folio lock:
*/
if (fault_in_iov_iter_readable(ii, bytes)) {
err = -EFAULT;
break;
}
goto again;
}
err = 0;
ap->folios[ap->num_folios] = folio;
ap->descs[ap->num_folios].offset = folio_offset;
ap->descs[ap->num_folios].length = tmp;
ap->num_folios++;
nr_pages++;
count += tmp;
pos += tmp;
num -= tmp;
offset += tmp;
if (offset == PAGE_SIZE)
if (offset == folio_size(folio))
offset = 0;
/* If we copied full page, mark it uptodate */
if (tmp == PAGE_SIZE)
/* If we copied full folio, mark it uptodate */
if (tmp == folio_size(folio))
folio_mark_uptodate(folio);
if (folio_test_uptodate(folio)) {
@ -1288,10 +1224,9 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
ia->write.folio_locked = true;
break;
}
if (!fc->big_writes)
if (!fc->big_writes || offset != 0)
break;
} while (iov_iter_count(ii) && count < fc->max_write &&
nr_pages < max_pages && offset == 0);
}
return count > 0 ? count : err;
}
@ -1638,7 +1573,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
return res;
}
}
if (!cuse && fuse_range_is_writeback(inode, idx_from, idx_to)) {
if (!cuse && filemap_range_has_writeback(mapping, pos, (pos + count - 1))) {
if (!write)
inode_lock(inode);
fuse_sync_writes(inode);
@ -1835,38 +1770,34 @@ static ssize_t fuse_splice_write(struct pipe_inode_info *pipe, struct file *out,
static void fuse_writepage_free(struct fuse_writepage_args *wpa)
{
struct fuse_args_pages *ap = &wpa->ia.ap;
int i;
if (wpa->bucket)
fuse_sync_bucket_dec(wpa->bucket);
for (i = 0; i < ap->num_folios; i++)
folio_put(ap->folios[i]);
fuse_file_put(wpa->ia.ff, false);
kfree(ap->folios);
kfree(wpa);
}
static void fuse_writepage_finish_stat(struct inode *inode, struct folio *folio)
{
struct backing_dev_info *bdi = inode_to_bdi(inode);
dec_wb_stat(&bdi->wb, WB_WRITEBACK);
node_stat_sub_folio(folio, NR_WRITEBACK_TEMP);
wb_writeout_inc(&bdi->wb);
}
static void fuse_writepage_finish(struct fuse_writepage_args *wpa)
{
struct fuse_args_pages *ap = &wpa->ia.ap;
struct inode *inode = wpa->inode;
struct fuse_inode *fi = get_fuse_inode(inode);
struct backing_dev_info *bdi = inode_to_bdi(inode);
int i;
for (i = 0; i < ap->num_folios; i++)
fuse_writepage_finish_stat(inode, ap->folios[i]);
for (i = 0; i < ap->num_folios; i++) {
/*
* Benchmarks showed that ending writeback within the
* scope of the fi->lock alleviates xarray lock
* contention and noticeably improves performance.
*/
folio_end_writeback(ap->folios[i]);
dec_wb_stat(&bdi->wb, WB_WRITEBACK);
wb_writeout_inc(&bdi->wb);
}
wake_up(&fi->page_waitq);
}
@ -1877,13 +1808,15 @@ static void fuse_send_writepage(struct fuse_mount *fm,
__releases(fi->lock)
__acquires(fi->lock)
{
struct fuse_writepage_args *aux, *next;
struct fuse_inode *fi = get_fuse_inode(wpa->inode);
struct fuse_args_pages *ap = &wpa->ia.ap;
struct fuse_write_in *inarg = &wpa->ia.write.in;
struct fuse_args *args = &wpa->ia.ap.args;
/* Currently, all folios in FUSE are one page */
__u64 data_size = wpa->ia.ap.num_folios * PAGE_SIZE;
int err;
struct fuse_args *args = &ap->args;
__u64 data_size = 0;
int err, i;
for (i = 0; i < ap->num_folios; i++)
data_size += ap->descs[i].length;
fi->writectr++;
if (inarg->offset + data_size <= size) {
@ -1914,19 +1847,8 @@ __acquires(fi->lock)
out_free:
fi->writectr--;
rb_erase(&wpa->writepages_entry, &fi->writepages);
fuse_writepage_finish(wpa);
spin_unlock(&fi->lock);
/* After rb_erase() aux request list is private */
for (aux = wpa->next; aux; aux = next) {
next = aux->next;
aux->next = NULL;
fuse_writepage_finish_stat(aux->inode,
aux->ia.ap.folios[0]);
fuse_writepage_free(aux);
}
fuse_writepage_free(wpa);
spin_lock(&fi->lock);
}
@ -1954,43 +1876,6 @@ __acquires(fi->lock)
}
}
static struct fuse_writepage_args *fuse_insert_writeback(struct rb_root *root,
struct fuse_writepage_args *wpa)
{
pgoff_t idx_from = wpa->ia.write.in.offset >> PAGE_SHIFT;
pgoff_t idx_to = idx_from + wpa->ia.ap.num_folios - 1;
struct rb_node **p = &root->rb_node;
struct rb_node *parent = NULL;
WARN_ON(!wpa->ia.ap.num_folios);
while (*p) {
struct fuse_writepage_args *curr;
pgoff_t curr_index;
parent = *p;
curr = rb_entry(parent, struct fuse_writepage_args,
writepages_entry);
WARN_ON(curr->inode != wpa->inode);
curr_index = curr->ia.write.in.offset >> PAGE_SHIFT;
if (idx_from >= curr_index + curr->ia.ap.num_folios)
p = &(*p)->rb_right;
else if (idx_to < curr_index)
p = &(*p)->rb_left;
else
return curr;
}
rb_link_node(&wpa->writepages_entry, parent, p);
rb_insert_color(&wpa->writepages_entry, root);
return NULL;
}
static void tree_insert(struct rb_root *root, struct fuse_writepage_args *wpa)
{
WARN_ON(fuse_insert_writeback(root, wpa));
}
static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
int error)
{
@ -2010,41 +1895,6 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
if (!fc->writeback_cache)
fuse_invalidate_attr_mask(inode, FUSE_STATX_MODIFY);
spin_lock(&fi->lock);
rb_erase(&wpa->writepages_entry, &fi->writepages);
while (wpa->next) {
struct fuse_mount *fm = get_fuse_mount(inode);
struct fuse_write_in *inarg = &wpa->ia.write.in;
struct fuse_writepage_args *next = wpa->next;
wpa->next = next->next;
next->next = NULL;
tree_insert(&fi->writepages, next);
/*
* Skip fuse_flush_writepages() to make it easy to crop requests
* based on primary request size.
*
* 1st case (trivial): there are no concurrent activities using
* fuse_set/release_nowrite. Then we're on safe side because
* fuse_flush_writepages() would call fuse_send_writepage()
* anyway.
*
* 2nd case: someone called fuse_set_nowrite and it is waiting
* now for completion of all in-flight requests. This happens
* rarely and no more than once per page, so this should be
* okay.
*
* 3rd case: someone (e.g. fuse_do_setattr()) is in the middle
* of fuse_set_nowrite..fuse_release_nowrite section. The fact
* that fuse_set_nowrite returned implies that all in-flight
* requests were completed along with all of their secondary
* requests. Further primary requests are blocked by negative
* writectr. Hence there cannot be any in-flight requests and
* no invocations of fuse_writepage_end() while we're in
* fuse_set_nowrite..fuse_release_nowrite section.
*/
fuse_send_writepage(fm, next, inarg->offset + inarg->size);
}
fi->writectr--;
fuse_writepage_finish(wpa);
spin_unlock(&fi->lock);
@ -2131,19 +1981,16 @@ static void fuse_writepage_add_to_bucket(struct fuse_conn *fc,
}
static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
struct folio *tmp_folio, uint32_t folio_index)
uint32_t folio_index)
{
struct inode *inode = folio->mapping->host;
struct fuse_args_pages *ap = &wpa->ia.ap;
folio_copy(tmp_folio, folio);
ap->folios[folio_index] = tmp_folio;
ap->folios[folio_index] = folio;
ap->descs[folio_index].offset = 0;
ap->descs[folio_index].length = PAGE_SIZE;
ap->descs[folio_index].length = folio_size(folio);
inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
node_stat_add_folio(tmp_folio, NR_WRITEBACK_TEMP);
}
static struct fuse_writepage_args *fuse_writepage_args_setup(struct folio *folio,
@ -2178,18 +2025,12 @@ static int fuse_writepage_locked(struct folio *folio)
struct fuse_inode *fi = get_fuse_inode(inode);
struct fuse_writepage_args *wpa;
struct fuse_args_pages *ap;
struct folio *tmp_folio;
struct fuse_file *ff;
int error = -ENOMEM;
int error = -EIO;
tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0);
if (!tmp_folio)
goto err;
error = -EIO;
ff = fuse_write_file_get(fi);
if (!ff)
goto err_nofile;
goto err;
wpa = fuse_writepage_args_setup(folio, ff);
error = -ENOMEM;
@ -2200,22 +2041,17 @@ static int fuse_writepage_locked(struct folio *folio)
ap->num_folios = 1;
folio_start_writeback(folio);
fuse_writepage_args_page_fill(wpa, folio, tmp_folio, 0);
fuse_writepage_args_page_fill(wpa, folio, 0);
spin_lock(&fi->lock);
tree_insert(&fi->writepages, wpa);
list_add_tail(&wpa->queue_entry, &fi->queued_writes);
fuse_flush_writepages(inode);
spin_unlock(&fi->lock);
folio_end_writeback(folio);
return 0;
err_writepage_args:
fuse_file_put(ff, false);
err_nofile:
folio_put(tmp_folio);
err:
mapping_set_error(folio->mapping, error);
return error;
@ -2225,8 +2061,8 @@ struct fuse_fill_wb_data {
struct fuse_writepage_args *wpa;
struct fuse_file *ff;
struct inode *inode;
struct folio **orig_folios;
unsigned int max_folios;
unsigned int nr_pages;
};
static bool fuse_pages_realloc(struct fuse_fill_wb_data *data)
@ -2260,69 +2096,11 @@ static void fuse_writepages_send(struct fuse_fill_wb_data *data)
struct fuse_writepage_args *wpa = data->wpa;
struct inode *inode = data->inode;
struct fuse_inode *fi = get_fuse_inode(inode);
int num_folios = wpa->ia.ap.num_folios;
int i;
spin_lock(&fi->lock);
list_add_tail(&wpa->queue_entry, &fi->queued_writes);
fuse_flush_writepages(inode);
spin_unlock(&fi->lock);
for (i = 0; i < num_folios; i++)
folio_end_writeback(data->orig_folios[i]);
}
/*
* Check under fi->lock if the page is under writeback, and insert it onto the
* rb_tree if not. Otherwise iterate auxiliary write requests, to see if there's
* one already added for a page at this offset. If there's none, then insert
* this new request onto the auxiliary list, otherwise reuse the existing one by
* swapping the new temp page with the old one.
*/
static bool fuse_writepage_add(struct fuse_writepage_args *new_wpa,
struct folio *folio)
{
struct fuse_inode *fi = get_fuse_inode(new_wpa->inode);
struct fuse_writepage_args *tmp;
struct fuse_writepage_args *old_wpa;
struct fuse_args_pages *new_ap = &new_wpa->ia.ap;
WARN_ON(new_ap->num_folios != 0);
new_ap->num_folios = 1;
spin_lock(&fi->lock);
old_wpa = fuse_insert_writeback(&fi->writepages, new_wpa);
if (!old_wpa) {
spin_unlock(&fi->lock);
return true;
}
for (tmp = old_wpa->next; tmp; tmp = tmp->next) {
pgoff_t curr_index;
WARN_ON(tmp->inode != new_wpa->inode);
curr_index = tmp->ia.write.in.offset >> PAGE_SHIFT;
if (curr_index == folio->index) {
WARN_ON(tmp->ia.ap.num_folios != 1);
swap(tmp->ia.ap.folios[0], new_ap->folios[0]);
break;
}
}
if (!tmp) {
new_wpa->next = old_wpa->next;
old_wpa->next = new_wpa;
}
spin_unlock(&fi->lock);
if (tmp) {
fuse_writepage_finish_stat(new_wpa->inode,
folio);
fuse_writepage_free(new_wpa);
}
return false;
}
static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio,
@ -2331,25 +2109,16 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio,
{
WARN_ON(!ap->num_folios);
/*
* Being under writeback is unlikely but possible. For example direct
* read to an mmaped fuse file will set the page dirty twice; once when
* the pages are faulted with get_user_pages(), and then after the read
* completed.
*/
if (fuse_folio_is_writeback(data->inode, folio))
return true;
/* Reached max pages */
if (ap->num_folios == fc->max_pages)
if (data->nr_pages + folio_nr_pages(folio) > fc->max_pages)
return true;
/* Reached max write bytes */
if ((ap->num_folios + 1) * PAGE_SIZE > fc->max_write)
if ((data->nr_pages * PAGE_SIZE) + folio_size(folio) > fc->max_write)
return true;
/* Discontinuity */
if (data->orig_folios[ap->num_folios - 1]->index + 1 != folio->index)
if (folio_next_index(ap->folios[ap->num_folios - 1]) != folio->index)
return true;
/* Need to grow the pages array? If so, did the expansion fail? */
@ -2368,7 +2137,6 @@ static int fuse_writepages_fill(struct folio *folio,
struct inode *inode = data->inode;
struct fuse_inode *fi = get_fuse_inode(inode);
struct fuse_conn *fc = get_fuse_conn(inode);
struct folio *tmp_folio;
int err;
if (!data->ff) {
@ -2381,56 +2149,27 @@ static int fuse_writepages_fill(struct folio *folio,
if (wpa && fuse_writepage_need_send(fc, folio, ap, data)) {
fuse_writepages_send(data);
data->wpa = NULL;
data->nr_pages = 0;
}
err = -ENOMEM;
tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0);
if (!tmp_folio)
goto out_unlock;
/*
* The page must not be redirtied until the writeout is completed
* (i.e. userspace has sent a reply to the write request). Otherwise
* there could be more than one temporary page instance for each real
* page.
*
* This is ensured by holding the page lock in page_mkwrite() while
* checking fuse_page_is_writeback(). We already hold the page lock
* since clear_page_dirty_for_io() and keep it held until we add the
* request to the fi->writepages list and increment ap->num_folios.
* After this fuse_page_is_writeback() will indicate that the page is
* under writeback, so we can release the page lock.
*/
if (data->wpa == NULL) {
err = -ENOMEM;
wpa = fuse_writepage_args_setup(folio, data->ff);
if (!wpa) {
folio_put(tmp_folio);
if (!wpa)
goto out_unlock;
}
fuse_file_get(wpa->ia.ff);
data->max_folios = 1;
ap = &wpa->ia.ap;
}
folio_start_writeback(folio);
fuse_writepage_args_page_fill(wpa, folio, tmp_folio, ap->num_folios);
data->orig_folios[ap->num_folios] = folio;
fuse_writepage_args_page_fill(wpa, folio, ap->num_folios);
data->nr_pages += folio_nr_pages(folio);
err = 0;
if (data->wpa) {
/*
* Protected by fi->lock against concurrent access by
* fuse_page_is_writeback().
*/
spin_lock(&fi->lock);
ap->num_folios++;
spin_unlock(&fi->lock);
} else if (fuse_writepage_add(wpa, folio)) {
ap->num_folios++;
if (!data->wpa)
data->wpa = wpa;
} else {
folio_end_writeback(folio);
}
out_unlock:
folio_unlock(folio);
@ -2456,13 +2195,7 @@ static int fuse_writepages(struct address_space *mapping,
data.inode = inode;
data.wpa = NULL;
data.ff = NULL;
err = -ENOMEM;
data.orig_folios = kcalloc(fc->max_pages,
sizeof(struct folio *),
GFP_NOFS);
if (!data.orig_folios)
goto out;
data.nr_pages = 0;
err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data);
if (data.wpa) {
@ -2472,7 +2205,6 @@ static int fuse_writepages(struct address_space *mapping,
if (data.ff)
fuse_file_put(data.ff, false);
kfree(data.orig_folios);
out:
return err;
}
@ -2497,8 +2229,6 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping,
if (IS_ERR(folio))
goto error;
fuse_wait_on_page_writeback(mapping->host, folio->index);
if (folio_test_uptodate(folio) || len >= folio_size(folio))
goto success;
/*
@ -2561,13 +2291,9 @@ static int fuse_launder_folio(struct folio *folio)
{
int err = 0;
if (folio_clear_dirty_for_io(folio)) {
struct inode *inode = folio->mapping->host;
/* Serialize with pending writeback for the same page */
fuse_wait_on_page_writeback(inode, folio->index);
err = fuse_writepage_locked(folio);
if (!err)
fuse_wait_on_page_writeback(inode, folio->index);
folio_wait_writeback(folio);
}
return err;
}
@ -2611,7 +2337,7 @@ static vm_fault_t fuse_page_mkwrite(struct vm_fault *vmf)
return VM_FAULT_NOPAGE;
}
fuse_wait_on_folio_writeback(inode, folio);
folio_wait_writeback(folio);
return VM_FAULT_LOCKED;
}
@ -3429,9 +3155,12 @@ static const struct address_space_operations fuse_file_aops = {
void fuse_init_file_inode(struct inode *inode, unsigned int flags)
{
struct fuse_inode *fi = get_fuse_inode(inode);
struct fuse_conn *fc = get_fuse_conn(inode);
inode->i_fop = &fuse_file_operations;
inode->i_data.a_ops = &fuse_file_aops;
if (fc->writeback_cache)
mapping_set_writeback_may_deadlock_on_reclaim(&inode->i_data);
INIT_LIST_HEAD(&fi->write_files);
INIT_LIST_HEAD(&fi->queued_writes);
@ -3439,7 +3168,6 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
fi->iocachectr = 0;
init_waitqueue_head(&fi->page_waitq);
init_waitqueue_head(&fi->direct_io_waitq);
fi->writepages = RB_ROOT;
if (IS_ENABLED(CONFIG_FUSE_DAX))
fuse_dax_inode_init(inode, flags);

View File

@ -20,7 +20,6 @@ struct fuse_iqueue;
struct fuse_forget_link;
struct fuse_copy_state {
int write;
struct fuse_req *req;
struct iov_iter *iter;
struct pipe_buffer *pipebufs;
@ -30,8 +29,9 @@ struct fuse_copy_state {
struct page *pg;
unsigned int len;
unsigned int offset;
unsigned int move_pages:1;
unsigned int is_uring:1;
bool write:1;
bool move_folios:1;
bool is_uring:1;
struct {
unsigned int copied_sz; /* copied size into the user buffer */
} ring;
@ -51,7 +51,7 @@ struct fuse_req *fuse_request_find(struct fuse_pqueue *fpq, u64 unique);
void fuse_dev_end_requests(struct list_head *head);
void fuse_copy_init(struct fuse_copy_state *cs, int write,
void fuse_copy_init(struct fuse_copy_state *cs, bool write,
struct iov_iter *iter);
int fuse_copy_args(struct fuse_copy_state *cs, unsigned int numargs,
unsigned int argpages, struct fuse_arg *args,
@ -64,7 +64,6 @@ void fuse_dev_queue_interrupt(struct fuse_iqueue *fiq, struct fuse_req *req);
bool fuse_remove_pending_req(struct fuse_req *req, spinlock_t *lock);
bool fuse_request_expired(struct fuse_conn *fc, struct list_head *list);
bool fuse_fpq_processing_expired(struct fuse_conn *fc, struct list_head *processing);
#endif

View File

@ -74,8 +74,8 @@ extern struct list_head fuse_conn_list;
extern struct mutex fuse_mutex;
/** Module parameters */
extern unsigned max_user_bgreq;
extern unsigned max_user_congthresh;
extern unsigned int max_user_bgreq;
extern unsigned int max_user_congthresh;
/* One forget request */
struct fuse_forget_link {
@ -161,9 +161,6 @@ struct fuse_inode {
/* waitq for direct-io completion */
wait_queue_head_t direct_io_waitq;
/* List of writepage requestst (pending or sent) */
struct rb_root writepages;
};
/* readdir cache (directory only) */
@ -636,6 +633,9 @@ struct fuse_conn {
/** Number of fuse_dev's */
atomic_t dev_count;
/** Current epoch for up-to-date dentries */
atomic_t epoch;
struct rcu_head rcu;
/** The user id for this mount */

View File

@ -41,7 +41,7 @@ unsigned int fuse_max_pages_limit = 256;
unsigned int fuse_default_req_timeout;
unsigned int fuse_max_req_timeout;
unsigned max_user_bgreq;
unsigned int max_user_bgreq;
module_param_call(max_user_bgreq, set_global_limit, param_get_uint,
&max_user_bgreq, 0644);
__MODULE_PARM_TYPE(max_user_bgreq, "uint");
@ -49,7 +49,7 @@ MODULE_PARM_DESC(max_user_bgreq,
"Global limit for the maximum number of backgrounded requests an "
"unprivileged user can set");
unsigned max_user_congthresh;
unsigned int max_user_congthresh;
module_param_call(max_user_congthresh, set_global_limit, param_get_uint,
&max_user_congthresh, 0644);
__MODULE_PARM_TYPE(max_user_congthresh, "uint");
@ -962,6 +962,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
init_rwsem(&fc->killsb);
refcount_set(&fc->count, 1);
atomic_set(&fc->dev_count, 1);
atomic_set(&fc->epoch, 1);
init_waitqueue_head(&fc->blocked_waitq);
fuse_iqueue_init(&fc->iq, fiq_ops, fiq_priv);
INIT_LIST_HEAD(&fc->bg_queue);
@ -1036,7 +1037,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc)
}
EXPORT_SYMBOL_GPL(fuse_conn_get);
static struct inode *fuse_get_root_inode(struct super_block *sb, unsigned mode)
static struct inode *fuse_get_root_inode(struct super_block *sb, unsigned int mode)
{
struct fuse_attr attr;
memset(&attr, 0, sizeof(attr));
@ -1211,7 +1212,7 @@ static const struct super_operations fuse_super_operations = {
.show_options = fuse_show_options,
};
static void sanitize_global_limit(unsigned *limit)
static void sanitize_global_limit(unsigned int *limit)
{
/*
* The default maximum number of async requests is calculated to consume
@ -1232,7 +1233,7 @@ static int set_global_limit(const char *val, const struct kernel_param *kp)
if (rv)
return rv;
sanitize_global_limit((unsigned *)kp->arg);
sanitize_global_limit((unsigned int *)kp->arg);
return 0;
}

View File

@ -161,6 +161,7 @@ static int fuse_direntplus_link(struct file *file,
struct fuse_conn *fc;
struct inode *inode;
DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
int epoch;
if (!o->nodeid) {
/*
@ -190,6 +191,7 @@ static int fuse_direntplus_link(struct file *file,
return -EIO;
fc = get_fuse_conn(dir);
epoch = atomic_read(&fc->epoch);
name.hash = full_name_hash(parent, name.name, name.len);
dentry = d_lookup(parent, &name);
@ -256,6 +258,7 @@ retry:
}
if (fc->readdirplus_auto)
set_bit(FUSE_I_INIT_RDPLUS, &get_fuse_inode(inode)->state);
dentry->d_time = epoch;
fuse_change_entry_timeout(dentry, o);
dput(dentry);
@ -332,35 +335,32 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
{
int plus;
ssize_t res;
struct folio *folio;
struct inode *inode = file_inode(file);
struct fuse_mount *fm = get_fuse_mount(inode);
struct fuse_conn *fc = fm->fc;
struct fuse_io_args ia = {};
struct fuse_args_pages *ap = &ia.ap;
struct fuse_folio_desc desc = { .length = PAGE_SIZE };
struct fuse_args *args = &ia.ap.args;
void *buf;
size_t bufsize = clamp((unsigned int) ctx->count, PAGE_SIZE, fc->max_pages << PAGE_SHIFT);
u64 attr_version = 0, evict_ctr = 0;
bool locked;
folio = folio_alloc(GFP_KERNEL, 0);
if (!folio)
buf = kvmalloc(bufsize, GFP_KERNEL);
if (!buf)
return -ENOMEM;
args->out_args[0].value = buf;
plus = fuse_use_readdirplus(inode, ctx);
ap->args.out_pages = true;
ap->num_folios = 1;
ap->folios = &folio;
ap->descs = &desc;
if (plus) {
attr_version = fuse_get_attr_version(fm->fc);
evict_ctr = fuse_get_evict_ctr(fm->fc);
fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE,
FUSE_READDIRPLUS);
fuse_read_args_fill(&ia, file, ctx->pos, bufsize, FUSE_READDIRPLUS);
} else {
fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE,
FUSE_READDIR);
fuse_read_args_fill(&ia, file, ctx->pos, bufsize, FUSE_READDIR);
}
locked = fuse_lock_inode(inode);
res = fuse_simple_request(fm, &ap->args);
res = fuse_simple_request(fm, args);
fuse_unlock_inode(inode, locked);
if (res >= 0) {
if (!res) {
@ -369,16 +369,14 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
if (ff->open_flags & FOPEN_CACHE_DIR)
fuse_readdir_cache_end(file, ctx->pos);
} else if (plus) {
res = parse_dirplusfile(folio_address(folio), res,
file, ctx, attr_version,
res = parse_dirplusfile(buf, res, file, ctx, attr_version,
evict_ctr);
} else {
res = parse_dirfile(folio_address(folio), res, file,
ctx);
res = parse_dirfile(buf, res, file, ctx);
}
}
folio_put(folio);
kvfree(buf);
fuse_invalidate_atime(inode);
return res;
}

View File

@ -210,6 +210,7 @@ enum mapping_flags {
AS_STABLE_WRITES = 7, /* must wait for writeback before modifying
folio contents */
AS_INACCESSIBLE = 8, /* Do not attempt direct R/W access to the mapping */
AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM = 9,
/* Bits 16-25 are used for FOLIO_ORDER */
AS_FOLIO_ORDER_BITS = 5,
AS_FOLIO_ORDER_MIN = 16,
@ -335,6 +336,16 @@ static inline bool mapping_inaccessible(struct address_space *mapping)
return test_bit(AS_INACCESSIBLE, &mapping->flags);
}
static inline void mapping_set_writeback_may_deadlock_on_reclaim(struct address_space *mapping)
{
set_bit(AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM, &mapping->flags);
}
static inline bool mapping_writeback_may_deadlock_on_reclaim(struct address_space *mapping)
{
return test_bit(AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM, &mapping->flags);
}
static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
{
return mapping->gfp_mask;

View File

@ -232,6 +232,9 @@
*
* 7.43
* - add FUSE_REQUEST_TIMEOUT
*
* 7.44
* - add FUSE_NOTIFY_INC_EPOCH
*/
#ifndef _LINUX_FUSE_H
@ -267,7 +270,7 @@
#define FUSE_KERNEL_VERSION 7
/** Minor version number of this interface */
#define FUSE_KERNEL_MINOR_VERSION 43
#define FUSE_KERNEL_MINOR_VERSION 44
/** The node ID of the root inode */
#define FUSE_ROOT_ID 1
@ -671,6 +674,7 @@ enum fuse_notify_code {
FUSE_NOTIFY_RETRIEVE = 5,
FUSE_NOTIFY_DELETE = 6,
FUSE_NOTIFY_RESEND = 7,
FUSE_NOTIFY_INC_EPOCH = 8,
FUSE_NOTIFY_CODE_MAX,
};

View File

@ -1197,8 +1197,10 @@ retry:
* 2) Global or new memcg reclaim encounters a folio that is
* not marked for immediate reclaim, or the caller does not
* have __GFP_FS (or __GFP_IO if it's simply going to swap,
* not to fs). In this case mark the folio for immediate
* reclaim and continue scanning.
* not to fs), or the folio belongs to a mapping where
* waiting on writeback during reclaim may lead to a deadlock.
* In this case mark the folio for immediate reclaim and
* continue scanning.
*
* Require may_enter_fs() because we would wait on fs, which
* may not have submitted I/O yet. And the loop driver might
@ -1223,6 +1225,8 @@ retry:
* takes to write them to disk.
*/
if (folio_test_writeback(folio)) {
mapping = folio_mapping(folio);
/* Case 1 above */
if (current_is_kswapd() &&
folio_test_reclaim(folio) &&
@ -1233,7 +1237,9 @@ retry:
/* Case 2 above */
} else if (writeback_throttling_sane(sc) ||
!folio_test_reclaim(folio) ||
!may_enter_fs(folio, sc->gfp_mask)) {
!may_enter_fs(folio, sc->gfp_mask) ||
(mapping &&
mapping_writeback_may_deadlock_on_reclaim(mapping))) {
/*
* This is slightly racy -
* folio_end_writeback() might have