mirror-linux/mm
Paolo Bonzini 6c370dc653 Merge branch 'kvm-guestmemfd' into HEAD
Introduce several new KVM uAPIs to ultimately create a guest-first memory
subsystem within KVM, a.k.a. guest_memfd.  Guest-first memory allows KVM
to provide features, enhancements, and optimizations that are kludgly
or outright impossible to implement in a generic memory subsystem.

The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which
similar to the generic memfd_create(), creates an anonymous file and
returns a file descriptor that refers to it.  Again like "regular"
memfd files, guest_memfd files live in RAM, have volatile storage,
and are automatically released when the last reference is dropped.
The key differences between memfd files (and every other memory subystem)
is that guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be resized.
guest_memfd files do however support PUNCH_HOLE, which can be used to
convert a guest memory area between the shared and guest-private states.

A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to
specify attributes for a given page of guest memory.  In the long term,
it will likely be extended to allow userspace to specify per-gfn RWX
protections, including allowing memory to be writable in the guest
without it also being writable in host userspace.

The immediate and driving use case for guest_memfd are Confidential
(CoCo) VMs, specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM.
For such use cases, being able to map memory into KVM guests without
requiring said memory to be mapped into the host is a hard requirement.
While SEV+ and TDX prevent untrusted software from reading guest private
data by encrypting guest memory, pKVM provides confidentiality and
integrity *without* relying on memory encryption.  In addition, with
SEV-SNP and especially TDX, accessing guest private memory can be fatal
to the host, i.e. KVM must be prevent host userspace from accessing
guest memory irrespective of hardware behavior.

Long term, guest_memfd may be useful for use cases beyond CoCo VMs,
for example hardening userspace against unintentional accesses to guest
memory.  As mentioned earlier, KVM's ABI uses userspace VMA protections to
define the allow guest protection (with an exception granted to mapping
guest memory executable), and similarly KVM currently requires the guest
mapping size to be a strict subset of the host userspace mapping size.
Decoupling the mappings sizes would allow userspace to precisely map
only what is needed and with the required permissions, without impacting
guest performance.

A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to DMA from or into guest memory).

guest_memfd is the result of 3+ years of development and exploration;
taking on memory management responsibilities in KVM was not the first,
second, or even third choice for supporting CoCo VMs.  But after many
failed attempts to avoid KVM-specific backing memory, and looking at
where things ended up, it is quite clear that of all approaches tried,
guest_memfd is the simplest, most robust, and most extensible, and the
right thing to do for KVM and the kernel at-large.

The "development cycle" for this version is going to be very short;
ideally, next week I will merge it as is in kvm/next, taking this through
the KVM tree for 6.8 immediately after the end of the merge window.
The series is still based on 6.6 (plus KVM changes for 6.7) so it
will require a small fixup for changes to get_file_rcu() introduced in
6.7 by commit 0ede61d858 ("file: convert to SLAB_TYPESAFE_BY_RCU").
The fixup will be done as part of the merge commit, and most of the text
above will become the commit message for the merge.

Pending post-merge work includes:
- hugepage support
- looking into using the restrictedmem framework for guest memory
- introducing a testing mechanism to poison memory, possibly using
  the same memory attributes introduced here
- SNP and TDX support

There are two non-KVM patches buried in the middle of this series:

  fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()
  mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

The first is small and mostly suggested-by Christian Brauner; the second
a bit less so but it was written by an mm person (Vlastimil Babka).
2023-11-14 08:31:31 -05:00
..
damon As usual, lots of singleton and doubleton patches all over the tree and 2023-11-02 20:53:31 -10:00
kasan Many singleton patches against the MM code. The patch series which are 2023-11-02 19:38:47 -10:00
kfence LoongArch changes for v6.6 2023-09-08 12:16:52 -07:00
kmsan mm: kmsan: panic on failure to allocate early boot metadata 2023-10-25 16:47:10 -07:00
Kconfig mm: restrict the pcp batch scale factor to avoid too long latency 2023-10-25 16:47:10 -07:00
Kconfig.debug mm: page_table_check: Make it dependent on EXCLUSIVE_SYSTEM_RAM 2023-05-29 16:14:28 +01:00
Makefile mm: vmscan: move shrinker-related code into a separate file 2023-10-04 10:32:23 -07:00
backing-dev.c writeback: remove redundant checks for root memcg 2023-08-21 13:37:48 -07:00
balloon_compaction.c
bootmem_info.c bootmem: use kmemleak_free_part_phys in put_page_bootmem 2023-10-25 16:47:13 -07:00
cma.c mm/cma: use nth_page() in place of direct struct page manipulation 2023-10-04 10:32:29 -07:00
cma.h
cma_debug.c
cma_sysfs.c
compaction.c Merge branch 'kvm-guestmemfd' into HEAD 2023-11-14 08:31:31 -05:00
debug.c mm: update validate_mm() to use vma iterator 2023-06-09 16:25:31 -07:00
debug_page_alloc.c mm: page_alloc: split out DEBUG_PAGEALLOC 2023-06-09 16:25:23 -07:00
debug_page_ref.c
debug_vm_pgtable.c mm: fix multiple typos in multiple files 2023-10-25 16:47:14 -07:00
dmapool.c dmapool: create/destroy cleanup 2023-06-09 16:25:17 -07:00
dmapool_test.c
early_ioremap.c mm/early_ioremap.c: improve the execution efficiency of early_ioremap_setup() 2023-06-09 16:25:56 -07:00
fadvise.c mm: remove unnecessary pagevec includes 2023-06-23 16:59:31 -07:00
fail_page_alloc.c mm: page_alloc: split out FAIL_PAGE_ALLOC 2023-06-09 16:25:23 -07:00
failslab.c
filemap.c mm: drop the assumption that VM_SHARED always implies writable 2023-10-18 14:34:19 -07:00
folio-compat.c filemap: Add fgf_t typedef 2023-07-24 18:04:30 -04:00
gup.c mm/gup: make failure to pin an error if FOLL_NOWAIT not specified 2023-10-18 14:34:15 -07:00
gup_test.c Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes. 2023-06-23 16:58:19 -07:00
gup_test.h
highmem.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
hmm.c mm: enable page walking API to lock vmas during the walk 2023-08-21 13:07:20 -07:00
huge_memory.c mm: huge_memory: use folio_xchg_last_cpupid() in __split_huge_page_tail() 2023-10-25 16:47:12 -07:00
hugetlb.c mempolicy: mmap_lock is not needed while migrating folios 2023-10-25 16:47:16 -07:00
hugetlb_cgroup.c mm, hugetlb: remove HUGETLB_CGROUP_MIN_ORDER 2023-10-18 14:34:17 -07:00
hugetlb_vmemmap.c hugetlb_vmemmap: use folio argument for hugetlb_vmemmap_* functions 2023-10-25 16:47:08 -07:00
hugetlb_vmemmap.h mm: hugetlb_vmemmap: fix reference to nonexistent file 2023-10-25 16:47:14 -07:00
hwpoison-inject.c
init-mm.c mm: move dummy_vm_ops out of a header 2023-08-21 13:37:46 -07:00
internal.h mm: add page_rmappable_folio() wrapper 2023-10-25 16:47:16 -07:00
interval_tree.c
io-mapping.c
ioremap.c mm: ioremap: remove unneeded ioremap_allowed and iounmap_allowed 2023-08-18 10:12:36 -07:00
khugepaged.c As usual, lots of singleton and doubleton patches all over the tree and 2023-11-02 20:53:31 -10:00
kmemleak.c mm/kmemleak: move the initialisation of object to __link_object 2023-10-25 16:47:13 -07:00
ksm.c mm/ksm: add pages_skipped metric 2023-10-16 15:44:39 -07:00
list_lru.c
maccess.c
madvise.c mm: drop the assumption that VM_SHARED always implies writable 2023-10-18 14:34:19 -07:00
mapping_dirty_helpers.c mm: fix clean_record_shared_mapping_range kernel-doc 2023-08-24 16:20:30 -07:00
memblock.c memblock: report failures when memblock_can_resize is not set 2023-11-08 09:40:13 -08:00
memcontrol.c mm: fix multiple typos in multiple files 2023-10-25 16:47:14 -07:00
memfd.c memfd: drop warning for missing exec-related flags 2023-10-04 10:32:22 -07:00
memory-failure.c mm: convert DAX lock/unlock page to lock/unlock folio 2023-10-04 10:32:20 -07:00
memory-tiers.c dax, kmem: calculate abstract distance with general interface 2023-10-16 15:44:39 -07:00
memory.c mm: use folio_xchg_last_cpupid() in wp_page_reuse() 2023-10-25 16:47:13 -07:00
memory_hotplug.c mm: memory_hotplug: drop memoryless node from fallback lists 2023-10-25 16:47:14 -07:00
mempolicy.c Many singleton patches against the MM code. The patch series which are 2023-11-02 19:38:47 -10:00
mempool.c
memremap.c
memtest.c mm: memtest: convert to memtest_report_meminfo() 2023-08-21 13:37:47 -07:00
migrate.c Merge branch 'kvm-guestmemfd' into HEAD 2023-11-14 08:31:31 -05:00
migrate_device.c Add x86 shadow stack support 2023-08-31 12:20:12 -07:00
mincore.c mm: enable page walking API to lock vmas during the walk 2023-08-21 13:07:20 -07:00
mlock.c mm: mlock: avoid folio_within_range() on KSM pages 2023-10-25 16:47:14 -07:00
mm_init.c mm: hugetlb: skip initialization of gigantic tail struct pages if freed by HVO 2023-10-04 10:32:30 -07:00
mm_slot.h
mmap.c Many singleton patches against the MM code. The patch series which are 2023-11-02 19:38:47 -10:00
mmap_lock.c
mmu_gather.c mm: fix kernel-doc warning from tlb_flush_rmaps() 2023-08-24 16:20:30 -07:00
mmu_notifier.c mmu_notifiers: rename invalidate_range notifier 2023-08-18 10:12:41 -07:00
mmzone.c mm: remove page_cpupid_xchg_last() 2023-10-25 16:47:13 -07:00
mprotect.c mm: mprotect: use a folio in change_pte_range() 2023-10-25 16:47:12 -07:00
mremap.c mm: abstract VMA merge and extend into vma_merge_extend() helper 2023-10-18 14:34:18 -07:00
msync.c
nommu.c Many singleton patches against the MM code. The patch series which are 2023-11-02 19:38:47 -10:00
oom_kill.c mm/oom_killer: simplify OOM killer info dump helper 2023-10-25 16:47:10 -07:00
page-writeback.c mm: use folio_xor_flags_has_waiters() in folio_end_writeback() 2023-10-18 14:34:17 -07:00
page_alloc.c mm: add page_rmappable_folio() wrapper 2023-10-25 16:47:16 -07:00
page_counter.c
page_ext.c mm/page_ext: move functions around for minor cleanups to page_ext 2023-08-18 10:12:31 -07:00
page_idle.c
page_io.c mm: memcg: add THP swap out info for anonymous reclaim 2023-10-04 10:32:27 -07:00
page_isolation.c mm/hugetlb: get rid of page_hstate() 2023-08-18 10:12:39 -07:00
page_owner.c mm/page_owner: remove free_ts from page_owner output 2023-10-18 14:34:19 -07:00
page_poison.c mm/page_poison: remove unused page_ext.h from page_poison 2023-08-21 13:37:30 -07:00
page_reporting.c
page_reporting.h
page_table_check.c mm: convert page_table_check_pte_set() to page_table_check_ptes_set() 2023-08-24 16:20:18 -07:00
page_vma_mapped.c mm: correct stale comment of function check_pte 2023-08-18 10:12:13 -07:00
pagewalk.c mm/pagewalk: fix bootstopping regression from extra pte_unmap() 2023-09-02 08:39:21 -07:00
percpu-internal.h percpu-internal/pcpu_chunk: re-layout pcpu_chunk structure to reduce false sharing 2023-06-19 16:19:29 -07:00
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c Many singleton patches against the MM code. The patch series which are 2023-11-02 19:38:47 -10:00
pgalloc-track.h
pgtable-generic.c mm/pgtable: notes on pte_offset_map[_lock]() 2023-08-18 10:12:25 -07:00
process_vm_access.c mm/gup: remove unused vmas parameter from pin_user_pages_remote() 2023-06-09 16:25:25 -07:00
ptdump.c mm: ptdump should use ptep_get_lockless() 2023-06-19 16:19:24 -07:00
readahead.c vfs: fix readahead(2) on block devices 2023-10-19 11:02:49 +02:00
rmap.c mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap() 2023-10-18 14:34:14 -07:00
rodata_test.c
secretmem.c mm/secretmem: use a folio in secretmem_fault() 2023-08-21 13:38:02 -07:00
shmem.c As usual, lots of singleton and doubleton patches all over the tree and 2023-11-02 20:53:31 -10:00
shmem_quota.c shmem: Add default quota limit mount options 2023-08-09 09:15:40 +02:00
show_mem.c mm: refactor si_mem_available() 2023-10-04 10:32:19 -07:00
shrinker.c mm: shrinker: convert shrinker_rwsem to mutex 2023-10-04 10:32:26 -07:00
shrinker_debug.c mm: shrinker: convert shrinker_rwsem to mutex 2023-10-04 10:32:26 -07:00
shuffle.c
shuffle.h
slab.c Randomized slab caches for kmalloc() 2023-07-18 10:07:47 +02:00
slab.h mm: kmem: scoped objcg protection 2023-10-25 16:47:11 -07:00
slab_common.c RCU pull request for v6.7 2023-10-30 18:01:41 -10:00
slub.c mm/slub: refactor calculate_order() and calc_slab_order() 2023-10-02 11:55:47 +02:00
sparse-vmemmap.c mm/vmemmap: allow architectures to override how vmemmap optimization works 2023-08-18 10:12:53 -07:00
sparse.c mm/sparse: remove redundant judgments from macro for_each_present_section_nr 2023-08-18 10:12:14 -07:00
swap.c mm: remove references to pagevec 2023-06-23 16:59:30 -07:00
swap.h mempolicy: alloc_pages_mpol() for NUMA policy without vma 2023-10-25 16:47:16 -07:00
swap_cgroup.c
swap_slots.c
swap_state.c mempolicy: alloc_pages_mpol() for NUMA policy without vma 2023-10-25 16:47:16 -07:00
swapfile.c mm/swap: Convert to use bdev_open_by_dev() 2023-10-28 13:29:19 +02:00
truncate.c - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") 2023-08-29 14:25:26 -07:00
usercopy.c
userfaultfd.c Add x86 shadow stack support 2023-08-31 12:20:12 -07:00
util.c Many singleton patches against the MM code. The patch series which are 2023-11-02 19:38:47 -10:00
vmalloc.c mm/vmalloc: fix the unchecked dereference warning in vread_iter() 2023-11-01 12:38:35 -07:00
vmpressure.c net-memcg: Fix scope of sockmem pressure indicators 2023-08-16 12:21:32 +01:00
vmscan.c mm: multi-gen LRU: reuse some legacy trace events 2023-10-18 14:34:14 -07:00
vmstat.c mm: tune PCP high automatically 2023-10-25 16:47:10 -07:00
workingset.c mm: workingset: dynamically allocate the mm-shadow shrinker 2023-10-04 10:32:24 -07:00
z3fold.c mm/z3fold: remove obsolete comment for struct z3fold_pool 2023-08-21 13:37:51 -07:00
zbud.c mm: zswap: remove shrink from zpool interface 2023-06-19 16:19:27 -07:00
zpool.c mm: zswap: remove shrink from zpool interface 2023-06-19 16:19:27 -07:00
zsmalloc.c zsmalloc: use copy_page for full page copy 2023-10-18 14:34:16 -07:00
zswap.c zswap: export compression failure stats 2023-11-01 12:38:35 -07:00