mirror-linux/drivers
Leon Romanovsky 5d74781ebc vfio/pci: Add dma-buf export support for MMIO regions
Add support for exporting PCI device MMIO regions through dma-buf,
enabling safe sharing of non-struct page memory with controlled
lifetime management. This allows RDMA and other subsystems to import
dma-buf FDs and build them into memory regions for PCI P2P operations.

The implementation provides a revocable attachment mechanism using
dma-buf move operations. MMIO regions are normally pinned as BARs
don't change physical addresses, but access is revoked when the VFIO
device is closed or a PCI reset is issued. This ensures kernel
self-defense against potentially hostile userspace.

Currently VFIO can take MMIO regions from the device's BAR and map
them into a PFNMAP VMA with special PTEs. This mapping type ensures
the memory cannot be used with things like pin_user_pages(), hmm, and
so on. In practice only the user process CPU and KVM can safely make
use of these VMA. When VFIO shuts down these VMAs are cleaned by
unmap_mapping_range() to prevent any UAF of the MMIO beyond driver
unbind.

However, VFIO type 1 has an insecure behavior where it uses
follow_pfnmap_*() to fish a MMIO PFN out of a VMA and program it back
into the IOMMU. This has a long history of enabling P2P DMA inside
VMs, but has serious lifetime problems by allowing a UAF of the MMIO
after the VFIO driver has been unbound.

Introduce DMABUF as a new safe way to export a FD based handle for the
MMIO regions. This can be consumed by existing DMABUF importers like
RDMA or DRM without opening an UAF. A following series will add an
importer to iommufd to obsolete the type 1 code and allow safe
UAF-free MMIO P2P in VM cases.

DMABUF has a built in synchronous invalidation mechanism called
move_notify. VFIO keeps track of all drivers importing its MMIO and
can invoke a synchronous invalidation callback to tell the importing
drivers to DMA unmap and forget about the MMIO pfns. This process is
being called revoke. This synchronous invalidation fully prevents any
lifecycle problems. VFIO will do this before unbinding its driver
ensuring there is no UAF of the MMIO beyond the driver lifecycle.

Further, VFIO has additional behavior to block access to the MMIO
during things like Function Level Reset. This is because some poor
platforms may experience a MCE type crash when touching MMIO of a PCI
device that is undergoing a reset. Today this is done by using
unmap_mapping_range() on the VMAs. Extend that into the DMABUF world
and temporarily revoke the MMIO from the DMABUF importers during FLR
as well. This will more robustly prevent an errant P2P from possibly
upsetting the platform.

A DMABUF FD is a preferred handle for MMIO compared to using something
like a pgmap because:
 - VFIO is supported, including its P2P feature, on archs that don't
   support pgmap
 - PCI devices have all sorts of BAR sizes, including ones smaller
   than a section so a pgmap cannot always be created
 - It is undesirable to waste a lot of memory for struct pages,
   especially for a case like a GPU with ~100GB of BAR size
 - We want a synchronous revoke semantic to support FLR with light
   hardware requirements

Use the P2P subsystem to help generate the DMA mapping. This is a
significant upgrade over the abuse of dma_map_resource() that has
historically been used by DMABUF exporters. Experience with an OOT
version of this patch shows that real systems do need this. This
approach deals with all the P2P scenarios:
 - Non-zero PCI bus_offset
 - ACS flags routing traffic to the IOMMU
 - ACS flags that bypass the IOMMU - though vfio noiommu is required
   to hit this.

There will be further work to formalize the revoke semantic in
DMABUF. For now this acts like a move_notify dynamic exporter where
importer fault handling will get a failure when they attempt to map.
This means that only fully restartable fault capable importers can
import the VFIO DMABUFs. A future revoke semantic should open this up
to more HW as the HW only needs to invalidate, not handle restartable
faults.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-10-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20 21:12:19 -07:00
..
accel accel/qaic: Synchronize access to DBC request queue head & tail pointer 2025-10-14 08:56:31 -06:00
accessibility
acpi Merge branches 'acpi-button', 'acpi-video' and 'acpi-fan' 2025-10-30 20:40:49 +01:00
amba
android binder: remove "invalid inc weak" check 2025-10-22 08:04:15 +02:00
ata ata: libata-core: relax checks in ata_read_log_directory() 2025-10-13 09:12:36 +02:00
atm
auxdisplay
base regmap: Fixes for v6.18 2025-11-01 10:45:39 -07:00
bcma bcma: don't register devices disabled in OF 2025-10-20 13:54:15 +02:00
block block-6.18-20251031 2025-10-31 12:57:19 -07:00
bluetooth Bluetooth: fix corruption in h4_recv_buf() after cleanup 2025-10-24 10:31:24 -04:00
bus Char/Misc/IIO/Binder changes for 6.18-rc1 2025-10-04 16:26:32 -07:00
cache
cdrom
cdx Char/Misc/IIO/Binder changes for 6.18-rc1 2025-10-04 16:26:32 -07:00
char tpm_crb: Add idle support for the Arm FF-A start method 2025-10-18 14:33:22 +03:00
clk There's a bunch of patches here across drivers/clk/ to migrate drivers to use 2025-10-07 09:28:37 -07:00
clocksource hyperv-next for v6.18 2025-10-07 08:40:15 -07:00
comedi comedi: fix divide-by-zero in comedi_buf_munge() 2025-10-22 08:03:52 +02:00
connector
counter
cpufreq cpufreq/amd-pstate: Fix a regression leading to EPP 0 after hibernate 2025-10-15 08:21:16 -05:00
cpuidle cpuidle: governors: menu: Select polling state in some more cases 2025-10-27 14:41:27 +01:00
crypto crypto: aspeed - fix double free caused by devm 2025-10-23 12:53:23 +08:00
cxl cxl/trace: Subtract to find an hpa_alias0 in cxl_poison events 2025-10-14 14:48:14 -07:00
dax fs: rename generic_delete_inode() and generic_drop_inode() 2025-09-15 16:09:42 +02:00
dca
devfreq PM / devfreq: rockchip-dfi: switch to FIELD_PREP_WM16 macro 2025-10-15 10:39:54 -04:00
dibs dibs: Check correct variable in dibs_init() 2025-09-26 15:10:59 -07:00
dio
dma dmaengine updates for v6.18 2025-10-06 10:37:06 -07:00
dma-buf dma-buf: provide phys_vec to scatter-gather mapping routine 2025-11-20 12:02:19 -07:00
dpll dpll: zl3073x: Fix output pin registration 2025-10-28 18:54:48 -07:00
edac - Add support for new AMD family 0x1a models to amd64_edac 2025-09-30 11:41:03 -07:00
eisa
extcon
firewire firewire: init_ohci1394_dma: add missing function parameter documentation 2025-10-25 08:29:56 +09:00
firmware Arm SCMI fixes for v6.18 2025-10-23 22:30:01 +02:00
fpga
fsi
fwctl pds_fwctl: Replace kzalloc + copy_from_user with memdup_user in pdsfc_fw_rpc 2025-09-22 10:33:10 -03:00
gnss
gpio gpio: ljca: Fix duplicated IRQ mapping 2025-10-23 14:30:11 +02:00
gpu Driver Changes: 2025-10-31 19:11:16 +01:00
greybus
hid hid-for-linus-2025101701 2025-10-18 08:18:18 -10:00
hsi
hte
hv Drivers: hv: Make CONFIG_HYPERV bool 2025-10-01 00:00:45 +00:00
hwmon hwmon: (sht3x) Fix error handling 2025-10-19 18:56:14 -07:00
hwspinlock
hwtracing Char/Misc/IIO/Binder changes for 6.18-rc1 2025-10-04 16:26:32 -07:00
i2c i2c: usbio: Add ACPI device-id for MTL-CVF devices 2025-10-14 13:54:43 +02:00
i3c i3c: fix big-endian FIFO transfers 2025-09-29 00:17:22 +02:00
idle
iio IIO: New device support, features and cleanup for 6.18 2025-09-23 14:15:25 +02:00
infiniband RDMA v6.18 merge window pull request 2025-10-03 18:35:22 -07:00
input Input updates for v6.18-rc0 2025-10-08 09:44:38 -07:00
interconnect
iommu PCI/P2PDMA: Simplify bus address mapping API 2025-11-20 12:01:41 -07:00
ipack
irqchip irqchip/sifive-plic: Avoid interrupt ID 0 handling during suspend/resume 2025-10-07 10:23:22 +02:00
isdn
leds leds: led-class: Add Device Tree support to led_get() 2025-09-16 16:49:28 +01:00
macintosh
mailbox qcom: add Glymur CPUCP mailbox binding 2025-10-08 11:44:21 -07:00
mcb
md dm docs: fix typos 2025-10-03 18:48:02 -07:00
media USB/Thunderbolt changes for 6.18-rc1 2025-10-04 16:07:08 -07:00
memory
memstick Summary of significant series in this pull request: 2025-10-02 18:18:33 -07:00
message
mfd mfd: ls2kbmc: check for devm_mfd_add_devices() failure 2025-10-03 10:38:23 -05:00
misc Char/Misc driver fixes for 6.18-rc3 2025-10-26 10:33:46 -07:00
mmc rpmb: move rpmb_frame struct and constants to common header 2025-10-13 13:18:03 +02:00
most most: usb: hdm_probe: Fix calling put_device() before device initialization 2025-10-22 08:04:43 +02:00
mtd MTD core: 2025-10-04 15:50:37 -07:00
mux
net net: stmmac: est: Fix GCL bounds checks 2025-10-29 18:49:24 -07:00
nfc
ntb NTB: epf: Add Renesas rcar support 2025-09-22 09:35:21 -04:00
nubus
nvdimm libnvdimm for 6.18 2025-10-06 11:17:18 -07:00
nvme nvme-pci: use blk_map_iter for p2p metadata 2025-10-22 19:46:25 -07:00
nvmem nvmem: rcar-efuse: add missing MODULE_DEVICE_TABLE 2025-10-22 08:02:38 +02:00
of of/irq: Export of_msi_xlate() for module usage 2025-10-24 07:44:09 -05:00
opp
parisc
parport
pci PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function 2025-11-20 12:02:00 -07:00
pcmcia
peci
perf arm64 fixes for -rc1 2025-10-07 08:59:25 -07:00
phy phy-for-6.18 2025-10-06 10:34:22 -07:00
pinctrl pci-v6.18-changes 2025-10-06 10:41:03 -07:00
platform platform/x86: alienware-wmi-wmax: Add AWCC support to Dell G15 5530 2025-10-15 11:22:35 +03:00
pmdomain soc: driver updates for 6.18 2025-10-01 17:32:51 -07:00
pnp
power power supply and reset changes for the 6.18 series 2025-10-01 13:02:59 -07:00
powercap
pps
ps3
ptp ptp: ocp: Fix typo using index 1 instead of i in SMA initialization loop 2025-10-22 19:18:39 -07:00
pwm gpio updates for v6.18-rc1 2025-10-01 11:34:12 -07:00
rapidio
ras RAS: Export log_non_standard_event() to drivers 2025-09-15 16:20:29 +02:00
regulator regulator: bd718x7: Fix voltages scaled by resistor divider 2025-10-30 11:30:23 +00:00
remoteproc remoteproc updates for v6.18 2025-10-04 15:45:17 -07:00
reset soc: driver updates for 6.18 2025-10-01 17:32:51 -07:00
rpmsg rpmsg: qcom_smd: Fix fallback to qcom,ipc parse 2025-09-20 21:29:48 -05:00
rtc RTC for 6.18 2025-10-11 11:56:47 -07:00
s390 more s390 updates for 6.18 merge window 2025-10-09 10:51:43 -07:00
sbus
scsi scsi: core: Fix the unit attention counter implementation 2025-10-21 21:09:36 -04:00
sh
siox
slimbus
soc - switch longson32 platform to DT and use MIPS_GENERIC framework 2025-10-05 10:09:55 -07:00
soundwire soundwire updates for 6.18 2025-10-06 10:32:22 -07:00
spi spi: intel: Add support for Oak Stream SPI serial flash 2025-10-29 12:53:45 +00:00
spmi
ssb
staging staging: gpib: Fix device reference leak in fmh_gpib driver 2025-10-13 10:55:03 +02:00
target SCSI misc on 20251011 2025-10-11 11:49:00 -07:00
tc
tee TEE QTEE fixes for v6.18 2025-10-17 15:26:52 +02:00
thermal thermal: renesas: Fix RZ/G3E fall-out 2025-10-02 10:41:58 +02:00
thunderbolt thunderbolt: Fix use-after-free in tb_dp_dprx_work 2025-09-23 17:16:38 +02:00
tty serial: 8250_mtk: Enable baud clock and manage in runtime PM 2025-10-22 12:13:54 +02:00
ufs scsi: ufs: core: Declare tx_lanes witout initialization 2025-10-21 21:02:46 -04:00
uio hyperv-next for v6.18 2025-10-07 08:40:15 -07:00
usb USB serial device ids for 6.18-rc3 2025-10-24 13:52:58 +02:00
vdpa vduse: Use fixed 4KB bounce pages for non-4KB page size 2025-10-01 07:24:55 -04:00
vfio vfio/pci: Add dma-buf export support for MMIO regions 2025-11-20 21:12:19 -07:00
vhost vdpa: support virtio_map 2025-10-01 07:24:43 -04:00
video fbdev: atyfb: Check if pll_ops->init_pll failed 2025-10-28 22:59:19 +01:00
virt arm64 updates for 6.18 2025-09-29 18:48:39 -07:00
virtio virtio,vhost: fixes, cleanups 2025-10-04 08:48:16 -07:00
w1
watchdog linux-watchdog 6.18-rc1 tag 2025-10-06 11:00:30 -07:00
xen dma-mapping updates for Linux 6.18: 2025-10-03 17:41:12 -07:00
zorro zorro: Remove extra whitespace in macro definitions 2025-09-15 14:30:17 +02:00
Kconfig
Makefile hyperv-next for v6.18 2025-10-07 08:40:15 -07:00