1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

115 commits

Author SHA1 Message Date
Linus Torvalds
d65e1a0f30 - Store AP Query Configuration Information in a static buffer
- Rework the AP initialization and add missing cleanups to the error path
 
 - Swap IRQ and AP bus/device registration to avoid race conditions
 
 - Export prot_virt_guest symbol
 
 - Introduce AP configuration changes notifier interface to facilitate
   modularization of the AP bus
 
 - Add CONFIG_AP kernel configuration option to allow modularization of
   the AP bus
 
 - Rework CONFIG_ZCRYPT_DEBUG kernel configuration option description and
   dependency and rename it to CONFIG_AP_DEBUG
 
 - Convert sprintf() and snprintf() to sysfs_emit() in CIO code
 
 - Adjust indentation of RELOCS command build step
 
 - Make crypto performance counters upward compatible
 
 - Convert make_page_secure() and gmap_make_secure() to use folio
 
 - Rework channel-utilization-block (CUB) handling in preparation of
   introducing additional CUBs
 
 - Use attribute groups to simplify registration, removal and extension
   of measurement-related channel-path sysfs attributes
 
 - Add a per-channel-path binary "ext_measurement" sysfs attribute that
   provides access to extended channel-path measurement data
 
 - Export measurement data for all channel-measurement-groups (CMG), not
   only for a specific ones. This enables support of new CMG data formats
   in userspace without the need for kernel changes
 
 - Add a per-channel-path sysfs attribute "speed_bps" that provides the
   operating speed in bits per second or 0 if the operating speed is not
   available
 
 - The CIO tracepoint subchannel-type field "st" is incorrectly set to
   the value of subchannel-enabled SCHIB "ena" field. Fix that
 
 - Do not forcefully limit vmemmap starting address to MAX_PHYSMEM_BITS
 
 - Consider the maximum physical address available to a DCSS segment
   (512GB) when memory layout is set up
 
 - Simplify the virtual memory layout setup by reducing the size of
   identity mapping vs vmemmap overlap
 
 - Swap vmalloc and Lowcore/Real Memory Copy areas in virtual memory.
   This will allow to place the kernel image next to kernel modules
 
 - Move everyting KASLR related from <asm/setup.h> to <asm/page.h>
 
 - Put virtual memory layout information into a structure to improve
   code generation
 
 - Currently __kaslr_offset is the kernel offset in both physical and
   virtual memory spaces. Uncouple these offsets to allow uncoupling
   of the addresses spaces
 
 - Currently the identity mapping base address is implicit and is always
   set to zero. Make it explicit by putting into __identity_base persistent
   boot variable and use it in proper context
 
 - Introduce .amode31 section start and end macros AMODE31_START and
   AMODE31_END
 
 - Introduce OS_INFO entries that do not reference any data in memory,
   but rather provide only values
 
 - Store virtual memory layout in OS_INFO. It is read out by makedumpfile,
   crash and other tools
 
 - Store virtual memory layout in VMCORE_INFO. It is read out by crash and
   other tools when /proc/kcore device is used
 
 - Create additional PT_LOAD ELF program header that covers kernel image
   only, so that vmcore tools could locate kernel text and data when virtual
   and physical memory spaces are uncoupled
 
 - Uncouple physical and virtual address spaces
 
 - Map kernel at fixed location when KASLR mode is disabled. The location is
   defined by CONFIG_KERNEL_IMAGE_BASE kernel configuration value.
 
 - Rework deployment of kernel image for both compressed and uncompressed
   variants as defined by CONFIG_KERNEL_UNCOMPRESSED kernel configuration
   value
 
 - Move .vmlinux.relocs section in front of the compressed kernel.
   The interim section rescue step is avoided as result
 
 - Correct modules thunk offset calculation when branch target is more
   than 2GB away
 
 - Kernel modules contain their own set of expoline thunks. Now that the
   kernel modules area is less than 4GB away from kernel expoline thunks,
   make modules use kernel expolines. Also make EXPOLINE_EXTERN the default
   if the compiler supports it
 
 - userfaultfd can insert shared zeropages into processes running VMs,
   but that is not allowed for s390. Fallback to allocating a fresh
   zeroed anonymous folio and insert that instead
 
 - Re-enable shared zeropages for non-PV and non-skeys KVM guests
 
 - Rename hex2bitmap() to ap_hex2bitmap() and export it for external use
 
 - Add ap_config sysfs attribute to provide the means for setting or
   displaying adapters, domains and control domains assigned to a vfio-ap
   mediated device in a single operation
 
 - Make vfio_ap_mdev_link_queue() ignore duplicate link requests
 
 - Add write support to ap_config sysfs attribute to allow atomic update
   a vfio-ap mediated device state
 
 - Document ap_config sysfs attribute
 
 - Function os_info_old_init() is expected to be called only from a regular
   kdump kernel. Enable it to be called from a stand-alone dump kernel
 
 - Address gcc -Warray-bounds warning and fix array size in struct os_info
 
 - s390 does not support SMBIOS, so drop unneeded CONFIG_DMI checks
 
 - Use unwinder instead of __builtin_return_address() with ftrace to
   prevent returning of undefined values
 
 - Sections .hash and .gnu.hash are only created when CONFIG_PIE_BUILD
   kernel is enabled. Drop these for the case CONFIG_PIE_BUILD is disabled
 
 - Compile kernel with -fPIC and link with -no-pie to allow kpatch feature
   always succeed and drop the whole CONFIG_PIE_BUILD option-enabled code
 
 - Add missing virt_to_phys() converter for VSIE facility and crypto
   control blocks
 -----BEGIN PGP SIGNATURE-----
 
 iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCZjkp5xccYWdvcmRlZXZA
 bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8D99AQCEby+KHssuZe9m0NvvikWREYBC
 myqob4EmdU3KdTEbNAEAt2OB7mzSQc90yjawI+Je7vwVyh3uc2Nb4Qg05yO6owI=
 =eOYN
 -----END PGP SIGNATURE-----

Merge tag 's390-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 updates from Alexander Gordeev:

 - Store AP Query Configuration Information in a static buffer

 - Rework the AP initialization and add missing cleanups to the error
   path

 - Swap IRQ and AP bus/device registration to avoid race conditions

 - Export prot_virt_guest symbol

 - Introduce AP configuration changes notifier interface to facilitate
   modularization of the AP bus

 - Add CONFIG_AP kernel configuration option to allow modularization of
   the AP bus

 - Rework CONFIG_ZCRYPT_DEBUG kernel configuration option description
   and dependency and rename it to CONFIG_AP_DEBUG

 - Convert sprintf() and snprintf() to sysfs_emit() in CIO code

 - Adjust indentation of RELOCS command build step

 - Make crypto performance counters upward compatible

 - Convert make_page_secure() and gmap_make_secure() to use folio

 - Rework channel-utilization-block (CUB) handling in preparation of
   introducing additional CUBs

 - Use attribute groups to simplify registration, removal and extension
   of measurement-related channel-path sysfs attributes

 - Add a per-channel-path binary "ext_measurement" sysfs attribute that
   provides access to extended channel-path measurement data

 - Export measurement data for all channel-measurement-groups (CMG), not
   only for a specific ones. This enables support of new CMG data
   formats in userspace without the need for kernel changes

 - Add a per-channel-path sysfs attribute "speed_bps" that provides the
   operating speed in bits per second or 0 if the operating speed is not
   available

 - The CIO tracepoint subchannel-type field "st" is incorrectly set to
   the value of subchannel-enabled SCHIB "ena" field. Fix that

 - Do not forcefully limit vmemmap starting address to MAX_PHYSMEM_BITS

 - Consider the maximum physical address available to a DCSS segment
   (512GB) when memory layout is set up

 - Simplify the virtual memory layout setup by reducing the size of
   identity mapping vs vmemmap overlap

 - Swap vmalloc and Lowcore/Real Memory Copy areas in virtual memory.
   This will allow to place the kernel image next to kernel modules

 - Move everyting KASLR related from <asm/setup.h> to <asm/page.h>

 - Put virtual memory layout information into a structure to improve
   code generation

 - Currently __kaslr_offset is the kernel offset in both physical and
   virtual memory spaces. Uncouple these offsets to allow uncoupling of
   the addresses spaces

 - Currently the identity mapping base address is implicit and is always
   set to zero. Make it explicit by putting into __identity_base
   persistent boot variable and use it in proper context

 - Introduce .amode31 section start and end macros AMODE31_START and
   AMODE31_END

 - Introduce OS_INFO entries that do not reference any data in memory,
   but rather provide only values

 - Store virtual memory layout in OS_INFO. It is read out by
   makedumpfile, crash and other tools

 - Store virtual memory layout in VMCORE_INFO. It is read out by crash
   and other tools when /proc/kcore device is used

 - Create additional PT_LOAD ELF program header that covers kernel image
   only, so that vmcore tools could locate kernel text and data when
   virtual and physical memory spaces are uncoupled

 - Uncouple physical and virtual address spaces

 - Map kernel at fixed location when KASLR mode is disabled. The
   location is defined by CONFIG_KERNEL_IMAGE_BASE kernel configuration
   value.

 - Rework deployment of kernel image for both compressed and
   uncompressed variants as defined by CONFIG_KERNEL_UNCOMPRESSED kernel
   configuration value

 - Move .vmlinux.relocs section in front of the compressed kernel. The
   interim section rescue step is avoided as result

 - Correct modules thunk offset calculation when branch target is more
   than 2GB away

 - Kernel modules contain their own set of expoline thunks. Now that the
   kernel modules area is less than 4GB away from kernel expoline
   thunks, make modules use kernel expolines. Also make EXPOLINE_EXTERN
   the default if the compiler supports it

 - userfaultfd can insert shared zeropages into processes running VMs,
   but that is not allowed for s390. Fallback to allocating a fresh
   zeroed anonymous folio and insert that instead

 - Re-enable shared zeropages for non-PV and non-skeys KVM guests

 - Rename hex2bitmap() to ap_hex2bitmap() and export it for external use

 - Add ap_config sysfs attribute to provide the means for setting or
   displaying adapters, domains and control domains assigned to a
   vfio-ap mediated device in a single operation

 - Make vfio_ap_mdev_link_queue() ignore duplicate link requests

 - Add write support to ap_config sysfs attribute to allow atomic update
   a vfio-ap mediated device state

 - Document ap_config sysfs attribute

 - Function os_info_old_init() is expected to be called only from a
   regular kdump kernel. Enable it to be called from a stand-alone dump
   kernel

 - Address gcc -Warray-bounds warning and fix array size in struct
   os_info

 - s390 does not support SMBIOS, so drop unneeded CONFIG_DMI checks

 - Use unwinder instead of __builtin_return_address() with ftrace to
   prevent returning of undefined values

 - Sections .hash and .gnu.hash are only created when CONFIG_PIE_BUILD
   kernel is enabled. Drop these for the case CONFIG_PIE_BUILD is
   disabled

 - Compile kernel with -fPIC and link with -no-pie to allow kpatch
   feature always succeed and drop the whole CONFIG_PIE_BUILD
   option-enabled code

 - Add missing virt_to_phys() converter for VSIE facility and crypto
   control blocks

* tag 's390-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (54 commits)
  Revert "s390: Relocate vmlinux ELF data to virtual address space"
  KVM: s390: vsie: Use virt_to_phys for crypto control block
  s390: Relocate vmlinux ELF data to virtual address space
  s390: Compile kernel with -fPIC and link with -no-pie
  s390: vmlinux.lds.S: Drop .hash and .gnu.hash for !CONFIG_PIE_BUILD
  s390/ftrace: Use unwinder instead of __builtin_return_address()
  s390/pci: Drop unneeded reference to CONFIG_DMI
  s390/os_info: Fix array size in struct os_info
  s390/os_info: Initialize old os_info in standalone dump kernel
  docs: Update s390 vfio-ap doc for ap_config sysfs attribute
  s390/vfio-ap: Add write support to sysfs attr ap_config
  s390/vfio-ap: Ignore duplicate link requests in vfio_ap_mdev_link_queue
  s390/vfio-ap: Add sysfs attr, ap_config, to export mdev state
  s390/ap: Externalize AP bus specific bitmap reading function
  s390/mm: Re-enable the shared zeropage for !PV and !skeys KVM guests
  mm/userfaultfd: Do not place zeropages when zeropages are disallowed
  s390/expoline: Make modules use kernel expolines
  s390/nospec: Correct modules thunk offset calculation
  s390/boot: Do not rescue .vmlinux.relocs section
  s390/boot: Rework deployment of the kernel image
  ...
2024-05-13 08:33:52 -07:00
Claudio Imbrenda
843c328068 s390/mm: Fix storage key clearing for guest huge pages
The function __storage_key_init_range() expects the end address to be
the first byte outside the range to be initialized. I.e. end - start
should be the size of the area to be initialized.

The current code works because __storage_key_init_range() will still loop
over every page in the range, but it is slower than using sske_frame().

Fixes: 964c2c05c9 ("s390/mm: Clear huge page storage keys on enable_skey")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/r/20240416114220.28489-2-imbrenda@linux.ibm.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-22 12:47:47 +02:00
David Hildenbrand
06201e00ee s390/mm: Re-enable the shared zeropage for !PV and !skeys KVM guests
commit fa41ba0d08 ("s390/mm: avoid empty zero pages for KVM guests to
avoid postcopy hangs") introduced an undesired side effect when combined
with memory ballooning and VM migration: memory part of the inflated
memory balloon will consume memory.

Assuming we have a 100GiB VM and inflated the balloon to 40GiB. Our VM
will consume ~60GiB of memory. If we now trigger a VM migration,
hypervisors like QEMU will read all VM memory. As s390x does not support
the shared zeropage, we'll end up allocating for all previously-inflated
memory part of the memory balloon: 50 GiB. So we might easily
(unexpectedly) crash the VM on the migration source.

Even worse, hypervisors like QEMU optimize for zeropage migration to not
consume memory on the migration destination: when migrating a
"page full of zeroes", on the migration destination they check whether the
target memory is already zero (by reading the destination memory) and avoid
writing to the memory to not allocate memory: however, s390x will also
allocate memory here, implying that also on the migration destination, we
will end up allocating all previously-inflated memory part of the memory
balloon.

This is especially bad if actual memory overcommit was not desired, when
memory ballooning is used for dynamic VM memory resizing, setting aside
some memory during boot that can be added later on demand. Alternatives
like virtio-mem that would avoid this issue are not yet available on
s390x.

There could be ways to optimize some cases in user space: before reading
memory in an anonymous private mapping on the migration source, check via
/proc/self/pagemap if anything is already populated. Similarly check on
the migration destination before reading. While that would avoid
populating tables full of shared zeropages on all architectures, it's
harder to get right and performant, and requires user space changes.

Further, with posctopy live migration we must place a page, so there,
"avoid touching memory to avoid allocating memory" is not really
possible. (Note that a previously we would have falsely inserted
shared zeropages into processes using UFFDIO_ZEROPAGE where
mm_forbids_zeropage() would have actually forbidden it)

PV is currently incompatible with memory ballooning, and in the common
case, KVM guests don't make use of storage keys. Instead of zapping
zeropages when enabling storage keys / PV, that turned out to be
problematic in the past, let's do exactly the same we do with KSM pages:
trigger unsharing faults to replace the shared zeropages by proper
anonymous folios.

What about added latency when enabling storage kes? Having a lot of
zeropages in applicable environments (PV, legacy guests, unittests) is
unexpected. Further, KSM could today already unshare the zeropages
and unmerging KSM pages when enabling storage kets would unshare the
KSM-placed zeropages in the same way, resulting in the same latency.

[ agordeev: Fixed sparse and checkpatch complaints and error handling ]

Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Fixes: fa41ba0d08 ("s390/mm: avoid empty zero pages for KVM guests to avoid postcopy hangs")
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240411161441.910170-3-david@redhat.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-18 15:02:53 +02:00
Peter Xu
0a845e0f63 mm/treewide: replace pud_large() with pud_leaf()
pud_large() is always defined as pud_leaf().  Merge their usages.  Chose
pud_leaf() because pud_leaf() is a global API, while pud_large() is not.

Link: https://lkml.kernel.org/r/20240305043750.93762-9-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-06 13:04:19 -08:00
Peter Xu
2f709f7bfd mm/treewide: replace pmd_large() with pmd_leaf()
pmd_large() is always defined as pmd_leaf().  Merge their usages.  Chose
pmd_leaf() because pmd_leaf() is a global API, while pmd_large() is not.

Link: https://lkml.kernel.org/r/20240305043750.93762-8-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-06 13:04:19 -08:00
Qi Zheng
57b77b75ca s390: supplement for ptdesc conversion
After commit 6326c26c15 ("s390: convert various pgalloc functions to use
ptdescs"), there are still some positions that use page->{lru, index}
instead of ptdesc->{pt_list, pt_index}.  In order to make the use of
ptdesc->{pt_list, pt_index} clearer, it would be better to convert them as
well.

[zhengqi.arch@bytedance.com: fix build failure]
  Link: https://lkml.kernel.org/r/20240305072154.26168-1-zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/04beaf3255056ffe131a5ea595736066c1e84756.1709541697.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-06 13:04:18 -08:00
Christian Borntraeger
fe752331d4 KVM: s390: vsie: fix race during shadow creation
Right now it is possible to see gmap->private being zero in
kvm_s390_vsie_gmap_notifier resulting in a crash.  This is due to the
fact that we add gmap->private == kvm after creation:

static int acquire_gmap_shadow(struct kvm_vcpu *vcpu,
                               struct vsie_page *vsie_page)
{
[...]
        gmap = gmap_shadow(vcpu->arch.gmap, asce, edat);
        if (IS_ERR(gmap))
                return PTR_ERR(gmap);
        gmap->private = vcpu->kvm;

Let children inherit the private field of the parent.

Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Fixes: a3508fbe9d ("KVM: s390: vsie: initial support for nested virtualization")
Cc: <stable@vger.kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Link: https://lore.kernel.org/r/20231220125317.4258-1-borntraeger@linux.ibm.com
2023-12-21 11:40:18 +01:00
Heiko Carstens
65d37f163a s390/cmma: move arch_set_page_dat() to header file
In order to be usable for early boot code move the simple
arch_set_page_dat() function to header file, and add its counter-part
arch_set_page_nodat(). Also change the parameters, and the function name
slightly.

This is required since there aren't any struct pages available in early
boot code, and renaming of functions is done to make sure that all users
are converted to the new API.

Instead of a pointer to a struct page a virtual address is passed, and
instead of an order the number of pages for which the page state needs be
set.

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-11-05 22:34:58 +01:00
Heiko Carstens
1954da4a2b s390/mm: add missing arch_set_page_dat() call to gmap allocations
If the cmma no-dat feature is available all pages that are not used for
dynamic address translation are marked as "no-dat" with the ESSA
instruction. This information is visible to the hypervisor, so that the
hypervisor can optimize purging of guest TLB entries. This also means that
pages which are used for dynamic address translation must not be marked as
"no-dat", since the hypervisor may then incorrectly not purge guest TLB
entries.

Region, segment, and page tables allocated within the gmap code are
incorrectly marked as "no-dat", since an explicit call to
arch_set_page_dat() is missing, which would remove the "no-dat" mark.

In order to fix this add a new gmap_alloc_crst() function which should
be used to allocate region and segment tables, and which also calls
arch_set_page_dat().

Also add the arch_set_page_dat() call to page_table_alloc_pgste().

Cc: <stable@vger.kernel.org>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-10-25 15:08:11 +02:00
Suren Baghdasaryan
49b0638502 mm: enable page walking API to lock vmas during the walk
walk_page_range() and friends often operate under write-locked mmap_lock. 
With introduction of vma locks, the vmas have to be locked as well during
such walks to prevent concurrent page faults in these areas.  Add an
additional member to mm_walk_ops to indicate locking requirements for the
walk.

The change ensures that page walks which prevent concurrent page faults
by write-locking mmap_lock, operate correctly after introduction of
per-vma locks.  With per-vma locks page faults can be handled under vma
lock without taking mmap_lock at all, so write locking mmap_lock would
not stop them.  The change ensures vmas are properly locked during such
walks.

A sample issue this solves is do_mbind() performing queue_pages_range()
to queue pages for migration.  Without this change a concurrent page
can be faulted into the area and be left out of migration.

Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Suggested-by: Jann Horn <jannh@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21 13:07:20 -07:00
Claudio Imbrenda
c2fceb59bb KVM: s390: pv: fix index value of replaced ASCE
The index field of the struct page corresponding to a guest ASCE should
be 0. When replacing the ASCE in s390_replace_asce(), the index of the
new ASCE should also be set to 0.

Having the wrong index might lead to the wrong addresses being passed
around when notifying pte invalidations, and eventually to validity
intercepts (VM crash) if the prefix gets unmapped and the notifier gets
called with the wrong address.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Fixes: faa2f72cb3 ("KVM: s390: pv: leak the topmost page table when destroy fails")
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20230705111937.33472-3-imbrenda@linux.ibm.com>
2023-07-18 11:21:51 +02:00
Linus Torvalds
a452483508 s390 updates for 6.5 merge window part 2
- Fix virtual vs physical address confusion in vmem_add_range()
   and vmem_remove_range() functions.
 
 - Include <linux/io.h> instead of <asm/io.h> and <asm-generic/io.h>
   throughout s390 code.
 
 - Make all PSW related defines also available for assembler files.
   Remove PSW_DEFAULT_KEY define from uapi for that.
 
 - When adding an undefined symbol the build still succeeds, but
   userspace crashes trying to execute VDSO, because the symbol
   is not resolved. Add undefined symbols check to prevent that.
 
 - Use kvmalloc_array() instead of kzalloc() for allocaton of 256k
   memory when executing s390 crypto adapter IOCTL.
 
 - Add -fPIE flag to prevent decompressor misaligned symbol build
   error with clang.
 
 - Use .balign instead of .align everywhere. This is a no-op for s390,
   but with this there no mix in using .align and .balign anymore.
 
 - Filter out -mno-pic-data-is-text-relative flag when compiling
   kernel to prevent VDSO build error.
 
 - Rework entering of DAT-on mode on CPU restart to use PSW_KERNEL_BITS
   mask directly.
 
 - Do not retry administrative requests to some s390 crypto cards,
   since the firmware assumes replay attacks.
 
 - Remove most of the debug code, which is build in when kernel config
   option CONFIG_ZCRYPT_DEBUG is enabled.
 
 - Remove CONFIG_ZCRYPT_MULTIDEVNODES kernel config option and switch
   off the multiple devices support for the s390 zcrypt device driver.
 
 - With the conversion to generic entry machine checks are accounted
   to the current context instead of irq time. As result, the STCKF
   instruction at the beginning of the machine check handler and the
   lowcore member are no longer required, therefore remove it.
 
 - Fix various typos found with codespell.
 
 - Minor cleanups to CPU-measurement Counter and Sampling Facilities code.
 
 - Revert patch that removes VMEM_MAX_PHYS macro, since it causes
   a regression.
 -----BEGIN PGP SIGNATURE-----
 
 iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCZKakSBccYWdvcmRlZXZA
 bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8HeIAQCg9RX3/olsZhCqRNLZ/O+6FXAF
 29ohi2JmVqxJBKkmwgEA/QXCjoTOp41pQJ1FD39HnI8DeYpJFRnYYE5D3acibAw=
 =2Ykk
 -----END PGP SIGNATURE-----

Merge tag 's390-6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull more s390 updates from Alexander Gordeev:

 - Fix virtual vs physical address confusion in vmem_add_range() and
   vmem_remove_range() functions

 - Include <linux/io.h> instead of <asm/io.h> and <asm-generic/io.h>
   throughout s390 code

 - Make all PSW related defines also available for assembler files.
   Remove PSW_DEFAULT_KEY define from uapi for that

 - When adding an undefined symbol the build still succeeds, but
   userspace crashes trying to execute VDSO, because the symbol is not
   resolved. Add undefined symbols check to prevent that

 - Use kvmalloc_array() instead of kzalloc() for allocaton of 256k
   memory when executing s390 crypto adapter IOCTL

 - Add -fPIE flag to prevent decompressor misaligned symbol build error
   with clang

 - Use .balign instead of .align everywhere. This is a no-op for s390,
   but with this there no mix in using .align and .balign anymore

 - Filter out -mno-pic-data-is-text-relative flag when compiling kernel
   to prevent VDSO build error

 - Rework entering of DAT-on mode on CPU restart to use PSW_KERNEL_BITS
   mask directly

 - Do not retry administrative requests to some s390 crypto cards, since
   the firmware assumes replay attacks

 - Remove most of the debug code, which is build in when kernel config
   option CONFIG_ZCRYPT_DEBUG is enabled

 - Remove CONFIG_ZCRYPT_MULTIDEVNODES kernel config option and switch
   off the multiple devices support for the s390 zcrypt device driver

 - With the conversion to generic entry machine checks are accounted to
   the current context instead of irq time. As result, the STCKF
   instruction at the beginning of the machine check handler and the
   lowcore member are no longer required, therefore remove it

 - Fix various typos found with codespell

 - Minor cleanups to CPU-measurement Counter and Sampling Facilities
   code

 - Revert patch that removes VMEM_MAX_PHYS macro, since it causes a
   regression

* tag 's390-6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (25 commits)
  Revert "s390/mm: get rid of VMEM_MAX_PHYS macro"
  s390/cpum_sf: remove check on CPU being online
  s390/cpum_sf: handle casts consistently
  s390/cpum_sf: remove unnecessary debug statement
  s390/cpum_sf: remove parameter in call to pr_err
  s390/cpum_sf: simplify function setup_pmu_cpu
  s390/cpum_cf: remove unneeded debug statements
  s390/entry: remove mcck clock
  s390: fix various typos
  s390/zcrypt: remove ZCRYPT_MULTIDEVNODES kernel config option
  s390/zcrypt: do not retry administrative requests
  s390/zcrypt: cleanup some debug code
  s390/entry: rework entering DAT-on mode on CPU restart
  s390/mm: fence off VM macros from asm and linker
  s390: include linux/io.h instead of asm/io.h
  s390/ptrace: make all psw related defines also available for asm
  s390/ptrace: remove PSW_DEFAULT_KEY from uapi
  s390/vdso: filter out mno-pic-data-is-text-relative cflag
  s390: consistently use .balign instead of .align
  s390/decompressor: fix misaligned symbol build error
  ...
2023-07-06 13:18:30 -07:00
Heiko Carstens
cada938a01 s390: fix various typos
Fix various typos found with codespell.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2023-07-03 11:19:42 +02:00
Hugh Dickins
b2f58941ad s390: gmap use pte_unmap_unlock() not spin_unlock()
pte_alloc_map_lock() expects to be followed by pte_unmap_unlock(): to
keep balance in future, pass ptep as well as ptl to gmap_pte_op_end(),
and use pte_unmap_unlock() instead of direct spin_unlock() (even though
ptep ends up unused inside the macro).

Link: https://lkml.kernel.org/r/78873af-e1ec-4f9-47ac-483940ac6daa@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: John David Anglin <dave.anglin@bell.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:09 -07:00
Hugh Dickins
5c7f3bf04a s390: allow pte_offset_map_lock() to fail
In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Add comment on mm's contract with s390 above __zap_zero_pages(),
and fix old comment there: must be called after THP was disabled.

Link: https://lkml.kernel.org/r/3ff29363-336a-9733-12a1-5c31a45c8aeb@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: John David Anglin <dave.anglin@bell.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:09 -07:00
Linus Torvalds
493804a689 RISC-V:
- ONE_REG interface to enable/disable SBI extensions
 - Zbb extension for Guest/VM
 - AIA CSR virtualization
 
 x86:
 - Fix a long-standing TDP MMU flaw, where unloading roots on a vCPU can
   result in the root being freed even though the root is completely valid
   and can be reused as-is (with a TLB flush).
 
 s390:
 - A couple bugfixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmRU198UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOeiwf+OdIcGxOlt9D/qNys/U+2kkhZK6WU
 1Jrwn+S5onea2LGyBucHk3V6O1rWck8aOqfRAmFBQc5s/gu8EySW5Ic17kqgqLwc
 MSr2c70qw3/6U2LoTeVHx21V4Jgf9iPOZu1LAXf26F0jSCAoWlACO9OQd0rJoDmA
 cpUtHSAFKxuPA/8Ix+SH4ddzKhCYEs4QejMLmQIgBWyaGShdi8dvaMoKyoceHN2w
 ntrhSaz+BbCCZKaFwtNCHVaigyioqmwTHjsmMFjzEN3N1ly1XZfN2E7vvT4j3v7l
 o55eMpJnu7+TVs5LEMvj25nuxgts6UzJnoALJAO2cDsHW5F3FvjO9HGjTg==
 =NHEm
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more kvm updates from Paolo Bonzini:
 "This includes the 6.4 changes for RISC-V, and a few bugfix patches for
  other architectures. For x86, this closes a longstanding performance
  issue in the newer and (usually) more scalable page table management
  code.

  RISC-V:
   - ONE_REG interface to enable/disable SBI extensions
   - Zbb extension for Guest/VM
   - AIA CSR virtualization

  x86:
   - Fix a long-standing TDP MMU flaw, where unloading roots on a vCPU
     can result in the root being freed even though the root is
     completely valid and can be reused as-is (with a TLB flush).

  s390:
   - A couple of bugfixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: s390: fix race in gmap_make_secure()
  KVM: s390: pv: fix asynchronous teardown for small VMs
  KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated
  RISC-V: KVM: Virtualize per-HART AIA CSRs
  RISC-V: KVM: Use bitmap for irqs_pending and irqs_pending_mask
  RISC-V: KVM: Add ONE_REG interface for AIA CSRs
  RISC-V: KVM: Implement subtype for CSR ONE_REG interface
  RISC-V: KVM: Initial skeletal support for AIA
  RISC-V: KVM: Drop the _MASK suffix from hgatp.VMID mask defines
  RISC-V: Detect AIA CSRs from ISA string
  RISC-V: Add AIA related CSR defines
  RISC-V: KVM: Allow Zbb extension for Guest/VM
  RISC-V: KVM: Add ONE_REG interface to enable/disable SBI extensions
  RISC-V: KVM: Alphabetize selects
  KVM: RISC-V: Retry fault if vma_lookup() results become invalid
2023-05-05 12:17:01 -07:00
Paolo Bonzini
7a8016d956 For 6.4
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEoWuZBM6M3lCBSfTnuARItAMU6BMFAmRT4cIACgkQuARItAMU
 6BPZ6xAA0T0AhPUuc4JyKH4p2ZFaH61+1YpGHlxxCyIF1gm4JMHS0i1CRii+o2Zh
 k3w76MafGau9ADXwrjJAh+C+WihwwgTtSyN2/amOn70K75iK7GGbT125c6VX6A1I
 6L6lFEO/3JURrhQE4TthgH6HUVyXBC4XobQqajWtxGgcq/QRkjtgdbLPPv9HuH23
 NgXZ85+lpPrak9U0Zzu2ez1O9VIlABWbTE6B6DgcyQshwqBZDOSFRTxxJ883XtqH
 7sRMmVJBkiF3HRqvDAmxI0eUOrR8YNtl1PN441iadmvfqG4IDbPtuJ0UtiyNagXT
 g7UYGvv7Qj2z4y2QhQSgLiIC0Zrl/T6Vw/7oXt0CVybhEMuAAeshWzHIl5aR5aKt
 7sY2ijUX290zwuACft9ZjDspUWyOPkvqym5UrldTj1ZaYV9w7iHNezFbXTnXOBh9
 vEiet5p3SCeQNymFKXy2R+8YJ6IIsH3Mxf6SYo2CQFNNjOzfUHwKMix/uArFdXF5
 CQD6jM9NXTiPIjwpXIJjyutJ6USWZhsMCneHtItzfWTa1YnBSZTwIWU1fB05Snk/
 BcDVfJ9Pq142N983SNXaBYIGzxpYPUHYZQmKvAzbeLHcaxrgLJM9/x8zcXIxc8Wa
 sEFoo5W8DCIJRqjk7dsLyLEvPJdoBZidg6sd0MwHRGX0kDRzHEo=
 =PPnV
 -----END PGP SIGNATURE-----

Merge tag 'kvm-s390-next-6.4-2' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD

For 6.4
2023-05-05 06:15:09 -04:00
Claudio Imbrenda
292a7d6fca KVM: s390: pv: fix asynchronous teardown for small VMs
On machines without the Destroy Secure Configuration Fast UVC, the
topmost level of page tables is set aside and freed asynchronously
as last step of the asynchronous teardown.

Each gmap has a host_to_guest radix tree mapping host (userspace)
addresses (with 1M granularity) to gmap segment table entries (pmds).

If a guest is smaller than 2GB, the topmost level of page tables is the
segment table (i.e. there are only 2 levels). Replacing it means that
the pointers in the host_to_guest mapping would become stale and cause
all kinds of nasty issues.

This patch fixes the issue by disallowing asynchronous teardown for
guests with only 2 levels of page tables. Userspace should (and already
does) try using the normal destroy if the asynchronous one fails.

Update s390_replace_asce so it refuses to replace segment type ASCEs.
This is still needed in case the normal destroy VM fails.

Fixes: fb491d5500 ("KVM: s390: pv: asynchronous destroy for reboot")
Reviewed-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-Id: <20230421085036.52511-2-imbrenda@linux.ibm.com>
2023-05-04 18:24:41 +02:00
David Hildenbrand
2c281f54f5 mm/ksm: move disabling KSM from s390/gmap code to KSM code
Let's factor out actual disabling of KSM.  The existing "mm->def_flags &=
~VM_MERGEABLE;" was essentially a NOP and can be dropped, because
def_flags should never include VM_MERGEABLE.  Note that we don't currently
prevent re-enabling KSM.

This should now be faster in case KSM was never enabled, because we only
conditionally iterate all VMAs.  Further, it certainly looks cleaner.

Link: https://lkml.kernel.org/r/20230422210156.33630-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Stefan Roesch <shr@devkernel.io>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-02 17:21:50 -07:00
Stefan Roesch
d7597f59d1 mm: add new api to enable ksm per process
Patch series "mm: process/cgroup ksm support", v9.

So far KSM can only be enabled by calling madvise for memory regions.  To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.

Use case 1:
  The madvise call is not available in the programming language.  An
  example for this are programs with forked workloads using a garbage
  collected language without pointers.  In such a language madvise cannot
  be made available.

  In addition the addresses of objects get moved around as they are
  garbage collected.  KSM sharing needs to be enabled "from the outside"
  for these type of workloads.

Use case 2:
  The same interpreter can also be used for workloads where KSM brings
  no benefit or even has overhead.  We'd like to be able to enable KSM on
  a workload by workload basis.

Use case 3:
  With the madvise call sharing opportunities are only enabled for the
  current process: it is a workload-local decision.  A considerable number
  of sharing opportunities may exist across multiple workloads or jobs (if
  they are part of the same security domain).  Only a higler level entity
  like a job scheduler or container can know for certain if its running
  one or more instances of a job.  That job scheduler however doesn't have
  the necessary internal workload knowledge to make targeted madvise
  calls.

Security concerns:

  In previous discussions security concerns have been brought up.  The
  problem is that an individual workload does not have the knowledge about
  what else is running on a machine.  Therefore it has to be very
  conservative in what memory areas can be shared or not.  However, if the
  system is dedicated to running multiple jobs within the same security
  domain, its the job scheduler that has the knowledge that sharing can be
  safely enabled and is even desirable.

Performance:

  Experiments with using UKSM have shown a capacity increase of around 20%.

  Here are the metrics from an instagram workload (taken from a machine
  with 64GB main memory):

   full_scans: 445
   general_profit: 20158298048
   max_page_sharing: 256
   merge_across_nodes: 1
   pages_shared: 129547
   pages_sharing: 5119146
   pages_to_scan: 4000
   pages_unshared: 1760924
   pages_volatile: 10761341
   run: 1
   sleep_millisecs: 20
   stable_node_chains: 167
   stable_node_chains_prune_millisecs: 2000
   stable_node_dups: 2751
   use_zero_pages: 0
   zero_pages_sharing: 0

After the service is running for 30 minutes to an hour, 4 to 5 million
shared pages are common for this workload when using KSM.


Detailed changes:

1. New options for prctl system command
   This patch series adds two new options to the prctl system call. 
   The first one allows to enable KSM at the process level and the second
   one to query the setting.

The setting will be inherited by child processes.

With the above setting, KSM can be enabled for the seed process of a cgroup
and all processes in the cgroup will inherit the setting.

2. Changes to KSM processing
   When KSM is enabled at the process level, the KSM code will iterate
   over all the VMA's and enable KSM for the eligible VMA's.

   When forking a process that has KSM enabled, the setting will be
   inherited by the new child process.

3. Add general_profit metric
   The general_profit metric of KSM is specified in the documentation,
   but not calculated.  This adds the general profit metric to
   /sys/kernel/debug/mm/ksm.

4. Add more metrics to ksm_stat
   This adds the process profit metric to /proc/<pid>/ksm_stat.

5. Add more tests to ksm_tests and ksm_functional_tests
   This adds an option to specify the merge type to the ksm_tests. 
   This allows to test madvise and prctl KSM.

   It also adds a two new tests to ksm_functional_tests: one to test
   the new prctl options and the other one is a fork test to verify that
   the KSM process setting is inherited by client processes.


This patch (of 3):

So far KSM can only be enabled by calling madvise for memory regions.  To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.

1. New options for prctl system command

   This patch series adds two new options to the prctl system call.
   The first one allows to enable KSM at the process level and the second
   one to query the setting.

   The setting will be inherited by child processes.

   With the above setting, KSM can be enabled for the seed process of a
   cgroup and all processes in the cgroup will inherit the setting.

2. Changes to KSM processing

   When KSM is enabled at the process level, the KSM code will iterate
   over all the VMA's and enable KSM for the eligible VMA's.

   When forking a process that has KSM enabled, the setting will be
   inherited by the new child process.

  1) Introduce new MMF_VM_MERGE_ANY flag

     This introduces the new flag MMF_VM_MERGE_ANY flag.  When this flag
     is set, kernel samepage merging (ksm) gets enabled for all vma's of a
     process.

  2) Setting VM_MERGEABLE on VMA creation

     When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the
     VM_MERGEABLE flag will be set for this VMA.

  3) support disabling of ksm for a process

     This adds the ability to disable ksm for a process if ksm has been
     enabled for the process with prctl.

  4) add new prctl option to get and set ksm for a process

     This adds two new options to the prctl system call
     - enable ksm for all vmas of a process (if the vmas support it).
     - query if ksm has been enabled for a process.

3. Disabling MMF_VM_MERGE_ANY for storage keys in s390

   In the s390 architecture when storage keys are used, the
   MMF_VM_MERGE_ANY will be disabled.

Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io
Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21 14:52:03 -07:00
Suren Baghdasaryan
ff126c0ece mm: replace vma->vm_flags indirect modification in ksm_madvise
Replace indirect modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.

Link: https://lkml.kernel.org/r/20230126193752.297968-6-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Sebastian Reichel <sebastian.reichel@collabora.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-09 16:51:39 -08:00
Suren Baghdasaryan
1c71222e5f mm: replace vma->vm_flags direct modifications with modifier calls
Replace direct modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.

[akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-09 16:51:39 -08:00
Mike Kravetz
e9adcfecf5 mm: remove zap_page_range and create zap_vma_pages
zap_page_range was originally designed to unmap pages within an address
range that could span multiple vmas.  While working on [1], it was
discovered that all callers of zap_page_range pass a range entirely within
a single vma.  In addition, the mmu notification call within zap_page
range does not correctly handle ranges that span multiple vmas.  When
crossing a vma boundary, a new mmu_notifier_range_init/end call pair with
the new vma should be made.

Instead of fixing zap_page_range, do the following:
- Create a new routine zap_vma_pages() that will remove all pages within
  the passed vma.  Most users of zap_page_range pass the entire vma and
  can use this new routine.
- For callers of zap_page_range not passing the entire vma, instead call
  zap_page_range_single().
- Remove zap_page_range.

[1] https://lore.kernel.org/linux-mm/20221114235507.294320-2-mike.kravetz@oracle.com/
Link: https://lkml.kernel.org/r/20230104002732.232573-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Suggested-by: Peter Xu <peterx@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>	[s390]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:55 -08:00
Linus Torvalds
8fa590bf34 ARM64:
* Enable the per-vcpu dirty-ring tracking mechanism, together with an
   option to keep the good old dirty log around for pages that are
   dirtied by something other than a vcpu.
 
 * Switch to the relaxed parallel fault handling, using RCU to delay
   page table reclaim and giving better performance under load.
 
 * Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option,
   which multi-process VMMs such as crosvm rely on (see merge commit 382b5b87a9:
   "Fix a number of issues with MTE, such as races on the tags being
   initialised vs the PG_mte_tagged flag as well as the lack of support
   for VM_SHARED when KVM is involved.  Patches from Catalin Marinas and
   Peter Collingbourne").
 
 * Merge the pKVM shadow vcpu state tracking that allows the hypervisor
   to have its own view of a vcpu, keeping that state private.
 
 * Add support for the PMUv3p5 architecture revision, bringing support
   for 64bit counters on systems that support it, and fix the
   no-quite-compliant CHAIN-ed counter support for the machines that
   actually exist out there.
 
 * Fix a handful of minor issues around 52bit VA/PA support (64kB pages
   only) as a prefix of the oncoming support for 4kB and 16kB pages.
 
 * Pick a small set of documentation and spelling fixes, because no
   good merge window would be complete without those.
 
 s390:
 
 * Second batch of the lazy destroy patches
 
 * First batch of KVM changes for kernel virtual != physical address support
 
 * Removal of a unused function
 
 x86:
 
 * Allow compiling out SMM support
 
 * Cleanup and documentation of SMM state save area format
 
 * Preserve interrupt shadow in SMM state save area
 
 * Respond to generic signals during slow page faults
 
 * Fixes and optimizations for the non-executable huge page errata fix.
 
 * Reprogram all performance counters on PMU filter change
 
 * Cleanups to Hyper-V emulation and tests
 
 * Process Hyper-V TLB flushes from a nested guest (i.e. from a L2 guest
   running on top of a L1 Hyper-V hypervisor)
 
 * Advertise several new Intel features
 
 * x86 Xen-for-KVM:
 
 ** Allow the Xen runstate information to cross a page boundary
 
 ** Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
 
 ** Add support for 32-bit guests in SCHEDOP_poll
 
 * Notable x86 fixes and cleanups:
 
 ** One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
 
 ** Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few
    years back when eliminating unnecessary barriers when switching between
    vmcs01 and vmcs02.
 
 ** Clean up vmread_error_trampoline() to make it more obvious that params
    must be passed on the stack, even for x86-64.
 
 ** Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective
    of the current guest CPUID.
 
 ** Fudge around a race with TSC refinement that results in KVM incorrectly
    thinking a guest needs TSC scaling when running on a CPU with a
    constant TSC, but no hardware-enumerated TSC frequency.
 
 ** Advertise (on AMD) that the SMM_CTL MSR is not supported
 
 ** Remove unnecessary exports
 
 Generic:
 
 * Support for responding to signals during page faults; introduces
   new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks
 
 Selftests:
 
 * Fix an inverted check in the access tracking perf test, and restore
   support for asserting that there aren't too many idle pages when
   running on bare metal.
 
 * Fix build errors that occur in certain setups (unsure exactly what is
   unique about the problematic setup) due to glibc overriding
   static_assert() to a variant that requires a custom message.
 
 * Introduce actual atomics for clear/set_bit() in selftests
 
 * Add support for pinning vCPUs in dirty_log_perf_test.
 
 * Rename the so called "perf_util" framework to "memstress".
 
 * Add a lightweight psuedo RNG for guest use, and use it to randomize
   the access pattern and write vs. read percentage in the memstress tests.
 
 * Add a common ucall implementation; code dedup and pre-work for running
   SEV (and beyond) guests in selftests.
 
 * Provide a common constructor and arch hook, which will eventually be
   used by x86 to automatically select the right hypercall (AMD vs. Intel).
 
 * A bunch of added/enabled/fixed selftests for ARM64, covering memslots,
   breakpoints, stage-2 faults and access tracking.
 
 * x86-specific selftest changes:
 
 ** Clean up x86's page table management.
 
 ** Clean up and enhance the "smaller maxphyaddr" test, and add a related
    test to cover generic emulation failure.
 
 ** Clean up the nEPT support checks.
 
 ** Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.
 
 ** Fix an ordering issue in the AMX test introduced by recent conversions
    to use kvm_cpu_has(), and harden the code to guard against similar bugs
    in the future.  Anything that tiggers caching of KVM's supported CPUID,
    kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if
    the caching occurs before the test opts in via prctl().
 
 Documentation:
 
 * Remove deleted ioctls from documentation
 
 * Clean up the docs for the x86 MSR filter.
 
 * Various fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmOaFrcUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPemQgAq49excg2Cc+EsHnZw3vu/QWdA0Rt
 KhL3OgKxuHNjCbD2O9n2t5di7eJOTQ7F7T0eDm3xPTr4FS8LQ2327/mQePU/H2CF
 mWOpq9RBWLzFsSTeVA2Mz9TUTkYSnDHYuRsBvHyw/n9cL76BWVzjImldFtjYjjex
 yAwl8c5itKH6bc7KO+5ydswbvBzODkeYKUSBNdbn6m0JGQST7XppNwIAJvpiHsii
 Qgpk0e4Xx9q4PXG/r5DedI6BlufBsLhv0aE9SHPzyKH3JbbUFhJYI8ZD5OhBQuYW
 MwxK2KlM5Jm5ud2NZDDlsMmmvd1lnYCFDyqNozaKEWC1Y5rq1AbMa51fXA==
 =QAYX
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "ARM64:

   - Enable the per-vcpu dirty-ring tracking mechanism, together with an
     option to keep the good old dirty log around for pages that are
     dirtied by something other than a vcpu.

   - Switch to the relaxed parallel fault handling, using RCU to delay
     page table reclaim and giving better performance under load.

   - Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
     option, which multi-process VMMs such as crosvm rely on (see merge
     commit 382b5b87a9: "Fix a number of issues with MTE, such as
     races on the tags being initialised vs the PG_mte_tagged flag as
     well as the lack of support for VM_SHARED when KVM is involved.
     Patches from Catalin Marinas and Peter Collingbourne").

   - Merge the pKVM shadow vcpu state tracking that allows the
     hypervisor to have its own view of a vcpu, keeping that state
     private.

   - Add support for the PMUv3p5 architecture revision, bringing support
     for 64bit counters on systems that support it, and fix the
     no-quite-compliant CHAIN-ed counter support for the machines that
     actually exist out there.

   - Fix a handful of minor issues around 52bit VA/PA support (64kB
     pages only) as a prefix of the oncoming support for 4kB and 16kB
     pages.

   - Pick a small set of documentation and spelling fixes, because no
     good merge window would be complete without those.

  s390:

   - Second batch of the lazy destroy patches

   - First batch of KVM changes for kernel virtual != physical address
     support

   - Removal of a unused function

  x86:

   - Allow compiling out SMM support

   - Cleanup and documentation of SMM state save area format

   - Preserve interrupt shadow in SMM state save area

   - Respond to generic signals during slow page faults

   - Fixes and optimizations for the non-executable huge page errata
     fix.

   - Reprogram all performance counters on PMU filter change

   - Cleanups to Hyper-V emulation and tests

   - Process Hyper-V TLB flushes from a nested guest (i.e. from a L2
     guest running on top of a L1 Hyper-V hypervisor)

   - Advertise several new Intel features

   - x86 Xen-for-KVM:

      - Allow the Xen runstate information to cross a page boundary

      - Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured

      - Add support for 32-bit guests in SCHEDOP_poll

   - Notable x86 fixes and cleanups:

      - One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).

      - Reinstate IBPB on emulated VM-Exit that was incorrectly dropped
        a few years back when eliminating unnecessary barriers when
        switching between vmcs01 and vmcs02.

      - Clean up vmread_error_trampoline() to make it more obvious that
        params must be passed on the stack, even for x86-64.

      - Let userspace set all supported bits in MSR_IA32_FEAT_CTL
        irrespective of the current guest CPUID.

      - Fudge around a race with TSC refinement that results in KVM
        incorrectly thinking a guest needs TSC scaling when running on a
        CPU with a constant TSC, but no hardware-enumerated TSC
        frequency.

      - Advertise (on AMD) that the SMM_CTL MSR is not supported

      - Remove unnecessary exports

  Generic:

   - Support for responding to signals during page faults; introduces
     new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks

  Selftests:

   - Fix an inverted check in the access tracking perf test, and restore
     support for asserting that there aren't too many idle pages when
     running on bare metal.

   - Fix build errors that occur in certain setups (unsure exactly what
     is unique about the problematic setup) due to glibc overriding
     static_assert() to a variant that requires a custom message.

   - Introduce actual atomics for clear/set_bit() in selftests

   - Add support for pinning vCPUs in dirty_log_perf_test.

   - Rename the so called "perf_util" framework to "memstress".

   - Add a lightweight psuedo RNG for guest use, and use it to randomize
     the access pattern and write vs. read percentage in the memstress
     tests.

   - Add a common ucall implementation; code dedup and pre-work for
     running SEV (and beyond) guests in selftests.

   - Provide a common constructor and arch hook, which will eventually
     be used by x86 to automatically select the right hypercall (AMD vs.
     Intel).

   - A bunch of added/enabled/fixed selftests for ARM64, covering
     memslots, breakpoints, stage-2 faults and access tracking.

   - x86-specific selftest changes:

      - Clean up x86's page table management.

      - Clean up and enhance the "smaller maxphyaddr" test, and add a
        related test to cover generic emulation failure.

      - Clean up the nEPT support checks.

      - Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.

      - Fix an ordering issue in the AMX test introduced by recent
        conversions to use kvm_cpu_has(), and harden the code to guard
        against similar bugs in the future. Anything that tiggers
        caching of KVM's supported CPUID, kvm_cpu_has() in this case,
        effectively hides opt-in XSAVE features if the caching occurs
        before the test opts in via prctl().

  Documentation:

   - Remove deleted ioctls from documentation

   - Clean up the docs for the x86 MSR filter.

   - Various fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits)
  KVM: x86: Add proper ReST tables for userspace MSR exits/flags
  KVM: selftests: Allocate ucall pool from MEM_REGION_DATA
  KVM: arm64: selftests: Align VA space allocator with TTBR0
  KVM: arm64: Fix benign bug with incorrect use of VA_BITS
  KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow
  KVM: x86: Advertise that the SMM_CTL MSR is not supported
  KVM: x86: remove unnecessary exports
  KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic"
  tools: KVM: selftests: Convert clear/set_bit() to actual atomics
  tools: Drop "atomic_" prefix from atomic test_and_set_bit()
  tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers
  KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests
  perf tools: Use dedicated non-atomic clear/set bit helpers
  tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers
  KVM: arm64: selftests: Enable single-step without a "full" ucall()
  KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself
  KVM: Remove stale comment about KVM_REQ_UNHALT
  KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR
  KVM: Reference to kvm_userspace_memory_region in doc and comments
  KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl
  ...
2022-12-15 11:12:21 -08:00
Anshuman Khandual
7e25de77bc s390/mm: use pmd_pgtable_page() helper in __gmap_segment_gaddr()
In __gmap_segment_gaddr() pmd level page table page is being extracted
from the pmd pointer, similar to pmd_pgtable_page() implementation.  This
reduces some redundancy by directly using pmd_pgtable_page() instead,
though first making it available.

Link: https://lkml.kernel.org/r/20221125034502.1559986-1-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30 15:59:07 -08:00
Nico Boehr
079f0c21ef s390/mm: gmap: sort out physical vs virtual pointers usage
Fix virtual vs physical address confusion (which currently are the same).

Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/r/20221020143159.294605-2-nrb@linux.ibm.com
Message-Id: <20221020143159.294605-2-nrb@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
2022-10-26 14:27:41 +02:00
Matthew Wilcox (Oracle)
e7b6b990e5 s390: remove vma linked list walks
Use the VMA iterator instead.

Link: https://lkml.kernel.org/r/20220906194824.2110408-35-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:19 -07:00
Claudio Imbrenda
6f73517d0a KVM: s390: pv: refactor s390_reset_acc
Refactor s390_reset_acc so that it can be reused in upcoming patches.

We don't want to hold all the locks used in a walk_page_range for too
long, and the destroy page UVC does take some time to complete.
Therefore we quickly gather the pages to destroy, and then destroy them
without holding all the locks.

The new refactored function optionally allows to return early without
completing if a fatal signal is pending (and return and appropriate
error code). Two wrappers are provided to call the new function.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Link: https://lore.kernel.org/r/20220628135619.32410-5-imbrenda@linux.ibm.com
Message-Id: <20220628135619.32410-5-imbrenda@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
2022-07-13 14:42:11 +00:00
Claudio Imbrenda
faa2f72cb3 KVM: s390: pv: leak the topmost page table when destroy fails
Each secure guest must have a unique ASCE (address space control
element); we must avoid that new guests use the same page for their
ASCE, to avoid errors.

Since the ASCE mostly consists of the address of the topmost page table
(plus some flags), we must not return that memory to the pool unless
the ASCE is no longer in use.

Only a successful Destroy Secure Configuration UVC will make the ASCE
reusable again.

If the Destroy Configuration UVC fails, the ASCE cannot be reused for a
secure guest (either for the ASCE or for other memory areas). To avoid
a collision, it must not be used again. This is a permanent error and
the page becomes in practice unusable, so we set it aside and leak it.
On failure we already leak other memory that belongs to the ultravisor
(i.e. the variable and base storage for a guest) and not leaking the
topmost page table was an oversight.

This error (and thus the leakage) should not happen unless the hardware
is broken or KVM has some unknown serious bug.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Fixes: 29b40f105e ("KVM: s390: protvirt: Add initial vm and cpu lifecycle handling")
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20220628135619.32410-2-imbrenda@linux.ibm.com
Message-Id: <20220628135619.32410-2-imbrenda@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
2022-07-13 14:42:11 +00:00
Linus Torvalds
4ab6cfc4ad more s390 updates for 5.19 merge window
- Add Eric Farman as maintainer for s390 virtio drivers.
 
 - Improve machine check handling, and avoid incorrectly injecting a machine
   check into a kvm guest.
 
 - Add cond_resched() call to gmap page table walker in order to avoid
   possible huge latencies. Also use non-quiesing sske instruction to speed
   up storage key handling.
 
 - Add __GFP_NORETRY to KEXEC_CONTROL_MEMORY_GFP so s390 behaves similar like
   common code.
 
 - Get sie control block address from correct stack slot in perf event
   code. This fixes potential random memory accesses.
 
 - Change uaccess code so that the exception handler sets the result of
   get_user() and __get_kernel_nofault() to zero in case of a fault. Until
   now this was done via input parameters for inline assemblies. Doing it
   via fault handling is what most or even all other architectures are
   doing.
 
 - Couple of other small cleanups and fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmKZ2IMACgkQIg7DeRsp
 bsIXgg/+JeKOFIfwWWZuk2/2drITjQxWL6OwAV/vHVL9bYVGGRgFHTfF7Qslr2Hb
 j/kh55hajjMt00bVS6P52I7cC+xB6PcxPK4jF+u2FXcL187QnWZg7VD7qkFO3E7Q
 G5jiWub0eNfd3ijytzSO1yLwv3Rh6GEIOi7lRkk88Fe2B9l0AaXbvnr7rrIoG1SS
 TCPUoCCNKEH+xPmujdN5B6CDK2ldukcPZHtAJ9Qxu6DWAWIxh+hHr/c1zW9/7kDj
 Vogc3gcgApeXTsMZu38c2tFiv6wxvg37cMa0EW+l5zkyeFn+a1CLSxA/qPi0N1UY
 pFcxrlWXWshr7vKn16lJoCe1nBVyfb9ohToLKQtnWB7RZ86lP79tVjOpiQ4qi+jl
 54yx/rdtycEDaC0w4ab5kfZPc9/EeaY4ppaOLjh4+r/Ve3x+EMcAZ9Q2Ei3ltRBy
 75kswnRvHDWMbtS8rNecdk29QNRvLpnzpGLWQ4wGCV7V9FzTtJwO5mdjuibhSoI8
 9dZw4/HGMeA2P1pAE/d1YqKbdgqlNzDyU1dYpk6PVBI0I9mUu36eZXQnjrBN++ki
 8TWQYkBv6vnUHJmkfEK7B/thYwljzlQJSje4ebj0aPWzBJ9An+U5ozoANlAu97MR
 /EVZM67snrT6aSOjhF6SNjWqMGBOlMVecF+GCsVrFEeah+qrYq4=
 =BOi4
 -----END PGP SIGNATURE-----

Merge tag 's390-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull more s390 updates from Heiko Carstens:
 "Just a couple of small improvements, bug fixes and cleanups:

   - Add Eric Farman as maintainer for s390 virtio drivers.

   - Improve machine check handling, and avoid incorrectly injecting a
     machine check into a kvm guest.

   - Add cond_resched() call to gmap page table walker in order to avoid
     possible huge latencies. Also use non-quiesing sske instruction to
     speed up storage key handling.

   - Add __GFP_NORETRY to KEXEC_CONTROL_MEMORY_GFP so s390 behaves
     similar like common code.

   - Get sie control block address from correct stack slot in perf event
     code. This fixes potential random memory accesses.

   - Change uaccess code so that the exception handler sets the result
     of get_user() and __get_kernel_nofault() to zero in case of a
     fault. Until now this was done via input parameters for inline
     assemblies. Doing it via fault handling is what most or even all
     other architectures are doing.

   - Couple of other small cleanups and fixes"

* tag 's390-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/stack: add union to reflect kvm stack slot usages
  s390/stack: merge empty stack frame slots
  s390/uaccess: whitespace cleanup
  s390/uaccess: use __noreturn instead of __attribute__((noreturn))
  s390/uaccess: use exception handler to zero result on get_user() failure
  s390/uaccess: use symbolic names for inline assembler operands
  s390/mcck: isolate SIE instruction when setting CIF_MCCK_GUEST flag
  s390/mm: use non-quiescing sske for KVM switch to keyed guest
  s390/gmap: voluntarily schedule during key setting
  MAINTAINERS: Update s390 virtio-ccw
  s390/kexec: add __GFP_NORETRY to KEXEC_CONTROL_MEMORY_GFP
  s390/Kconfig.debug: fix indentation
  s390/Kconfig: fix indentation
  s390/perf: obtain sie_block from the right address
  s390: generate register offsets into pt_regs automatically
  s390: simplify early program check handler
  s390/crypto: fix scatterwalk_unmap() callers in AES-GCM
2022-06-03 13:57:50 -07:00
Christian Borntraeger
6d5946274d s390/gmap: voluntarily schedule during key setting
With large and many guest with storage keys it is possible to create
large latencies or stalls during initial key setting:

rcu: INFO: rcu_sched self-detected stall on CPU
rcu:   18-....: (2099 ticks this GP) idle=54e/1/0x4000000000000002 softirq=35598716/35598716 fqs=998
       (t=2100 jiffies g=155867385 q=20879)
Task dump for CPU 18:
CPU 1/KVM       R  running task        0 1030947 256019 0x06000004
Call Trace:
sched_show_task
rcu_dump_cpu_stacks
rcu_sched_clock_irq
update_process_times
tick_sched_handle
tick_sched_timer
__hrtimer_run_queues
hrtimer_interrupt
do_IRQ
ext_int_handler
ptep_zap_key

The mmap lock is held during the page walking but since this is a
semaphore scheduling is still possible. Same for the kvm srcu.
To minimize overhead do this on every segment table entry or large page.

Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/r/20220530092706.11637-2-borntraeger@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2022-06-01 12:03:16 +02:00
Christian Borntraeger
a06afe8383 KVM: s390: vsie/gmap: reduce gmap_rmap overhead
there are cases that trigger a 2nd shadow event for the same
vmaddr/raddr combination. (prefix changes, reboots, some known races)
This will increase memory usages and it will result in long latencies
when cleaning up, e.g. on shutdown. To avoid cases with a list that has
hundreds of identical raddrs we check existing entries at insert time.
As this measurably reduces the list length this will be faster than
traversing the list at shutdown time.

In the long run several places will be optimized to create less entries
and a shrinker might be necessary.

Fixes: 4be130a084 ("s390/mm: add shadow gmap support")
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20220429151526.1560-1-borntraeger@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2022-05-03 15:15:44 +02:00
Vasily Gorbik
731efc9613 s390: convert ".insn" encoding to instruction names
With z10 as minimum supported machine generation many ".insn" encodings
could be now converted to instruction names. There are couple of exceptions
- stfle is used from the als code built for z900 and cannot be converted
- few ".insn" directives encode unsupported instruction formats

The generated code is identical before/after this change.

Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2022-03-10 15:58:17 +01:00
Heiko Carstens
e1fc74ff23 s390/mm,gmap: don't use pte_val()/pXd_val() as lvalue
Convert pgtable code so pte_val()/pXd_val() aren't used as lvalue
anymore. This allows in later step to convert pte_val()/pXd_val() to
functions, which in turn makes it impossible to use these macros to
modify page table entries like they have been used before.

Therefore a construct like this:

        pte_val(*pte) = __pa(addr) | prot;

which would directly write into a page table, isn't possible anymore
with the last step of this series.

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2022-03-01 21:05:10 +01:00
Heiko Carstens
b8e3b37900 s390/mm: use set_pXd()/set_pte() helper functions everywhere
Use the new set_pXd()/set_pte() helper functions at all places where
page table entries are modified.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2022-03-01 21:05:10 +01:00
Claudio Imbrenda
380d97bd02 KVM: s390: pv: properly handle page flags for protected guests
Introduce variants of the convert and destroy page functions that also
clear the PG_arch_1 bit used to mark them as secure pages.

The PG_arch_1 flag is always allowed to overindicate; using the new
functions introduced here allows to reduce the extent of overindication
and thus improve performance.

These new functions can only be called on pages for which a reference
is already being held.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Link: https://lore.kernel.org/r/20210920132502.36111-7-imbrenda@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2021-10-27 07:55:53 +02:00
David Hildenbrand
b159f94c86 s390/gmap: don't unconditionally call pte_unmap_unlock() in __gmap_zap()
... otherwise we will try unlocking a spinlock that was never locked via a
garbage pointer.

At the time we reach this code path, we usually successfully looked up
a PGSTE already; however, evil user space could have manipulated the VMA
layout in the meantime and triggered removal of the page table.

Fixes: 1e133ab296 ("s390/mm: split arch/s390/mm/pgtable.c")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20210909162248.14969-3-david@redhat.com
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2021-10-25 09:20:38 +02:00
David Hildenbrand
2d8fb8f391 s390/gmap: validate VMA in __gmap_zap()
We should not walk/touch page tables outside of VMA boundaries when
holding only the mmap sem in read mode. Evil user space can modify the
VMA layout just before this function runs and e.g., trigger races with
page table removal code since commit dd2283f260 ("mm: mmap: zap pages
with read mmap_sem in munmap"). The pure prescence in our guest_to_host
radix tree does not imply that there is a VMA.

Further, we should not allocate page tables (via get_locked_pte()) outside
of VMA boundaries: if evil user space decides to map hugetlbfs to these
ranges, bad things will happen because we suddenly have PTE or PMD page
tables where we shouldn't have them.

Similarly, we have to check if we suddenly find a hugetlbfs VMA, before
calling get_locked_pte().

Note that gmap_discard() is different:
zap_page_range()->unmap_single_vma() makes sure to stay within VMA
boundaries.

Fixes: b31288fa83 ("s390/kvm: support collaborative memory management")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20210909162248.14969-2-david@redhat.com
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2021-10-25 09:20:38 +02:00
Heiko Carstens
2e8275285a s390/mm: fix kernel doc comments
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2021-09-07 13:38:41 +02:00
Linus Torvalds
6a447b0e31 ARM:
* PSCI relay at EL2 when "protected KVM" is enabled
 * New exception injection code
 * Simplification of AArch32 system register handling
 * Fix PMU accesses when no PMU is enabled
 * Expose CSV3 on non-Meltdown hosts
 * Cache hierarchy discovery fixes
 * PV steal-time cleanups
 * Allow function pointers at EL2
 * Various host EL2 entry cleanups
 * Simplification of the EL2 vector allocation
 
 s390:
 * memcg accouting for s390 specific parts of kvm and gmap
 * selftest for diag318
 * new kvm_stat for when async_pf falls back to sync
 
 x86:
 * Tracepoints for the new pagetable code from 5.10
 * Catch VFIO and KVM irqfd events before userspace
 * Reporting dirty pages to userspace with a ring buffer
 * SEV-ES host support
 * Nested VMX support for wait-for-SIPI activity state
 * New feature flag (AVX512 FP16)
 * New system ioctl to report Hyper-V-compatible paravirtualization features
 
 Generic:
 * Selftest improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl/bdL4UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNgQQgAnTH6rhXa++Zd5F0EM2NwXwz3iEGb
 lOq1DZSGjs6Eekjn8AnrWbmVQr+CBCuGU9MrxpSSzNDK/awryo3NwepOWAZw9eqk
 BBCVwGBbJQx5YrdgkGC0pDq2sNzcpW/VVB3vFsmOxd9eHblnuKSIxEsCCXTtyqIt
 XrLpQ1UhvI4yu102fDNhuFw2EfpzXm+K0Lc0x6idSkdM/p7SyeOxiv8hD4aMr6+G
 bGUQuMl4edKZFOWFigzr8NovQAvDHZGrwfihu2cLRYKLhV97QuWVmafv/yYfXcz2
 drr+wQCDNzDOXyANnssmviazrhOX0QmTAhbIXGGX/kTxYKcfPi83ZLoI3A==
 =ISud
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "Much x86 work was pushed out to 5.12, but ARM more than made up for it.

  ARM:
   - PSCI relay at EL2 when "protected KVM" is enabled
   - New exception injection code
   - Simplification of AArch32 system register handling
   - Fix PMU accesses when no PMU is enabled
   - Expose CSV3 on non-Meltdown hosts
   - Cache hierarchy discovery fixes
   - PV steal-time cleanups
   - Allow function pointers at EL2
   - Various host EL2 entry cleanups
   - Simplification of the EL2 vector allocation

  s390:
   - memcg accouting for s390 specific parts of kvm and gmap
   - selftest for diag318
   - new kvm_stat for when async_pf falls back to sync

  x86:
   - Tracepoints for the new pagetable code from 5.10
   - Catch VFIO and KVM irqfd events before userspace
   - Reporting dirty pages to userspace with a ring buffer
   - SEV-ES host support
   - Nested VMX support for wait-for-SIPI activity state
   - New feature flag (AVX512 FP16)
   - New system ioctl to report Hyper-V-compatible paravirtualization features

  Generic:
   - Selftest improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (171 commits)
  KVM: SVM: fix 32-bit compilation
  KVM: SVM: Add AP_JUMP_TABLE support in prep for AP booting
  KVM: SVM: Provide support to launch and run an SEV-ES guest
  KVM: SVM: Provide an updated VMRUN invocation for SEV-ES guests
  KVM: SVM: Provide support for SEV-ES vCPU loading
  KVM: SVM: Provide support for SEV-ES vCPU creation/loading
  KVM: SVM: Update ASID allocation to support SEV-ES guests
  KVM: SVM: Set the encryption mask for the SVM host save area
  KVM: SVM: Add NMI support for an SEV-ES guest
  KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest
  KVM: SVM: Do not report support for SMM for an SEV-ES guest
  KVM: x86: Update __get_sregs() / __set_sregs() to support SEV-ES
  KVM: SVM: Add support for CR8 write traps for an SEV-ES guest
  KVM: SVM: Add support for CR4 write traps for an SEV-ES guest
  KVM: SVM: Add support for CR0 write traps for an SEV-ES guest
  KVM: SVM: Add support for EFER write traps for an SEV-ES guest
  KVM: SVM: Support string IO operations for an SEV-ES guest
  KVM: SVM: Support MMIO for an SEV-ES guest
  KVM: SVM: Create trace events for VMGEXIT MSR protocol processing
  KVM: SVM: Create trace events for VMGEXIT processing
  ...
2020-12-20 10:44:05 -08:00
Christian Borntraeger
0cd2a787cf s390/gmap: make gmap memcg aware
gmap allocations can be attributed to a process.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
2020-12-10 13:36:05 +01:00
Janosch Frank
1ed576a20c KVM: s390: pv: Mark mm as protected after the set secure parameters and improve cleanup
We can only have protected guest pages after a successful set secure
parameters call as only then the UV allows imports and unpacks.

By moving the test we can now also check for it in s390_reset_acc()
and do an early return if it is 0.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Fixes: 29b40f105e ("KVM: s390: protvirt: Add initial vm and cpu lifecycle handling")
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2020-11-11 09:31:48 +01:00
Janosch Frank
1a80b54d1c s390/uv: add destroy page call
We don't need to export pages if we destroy the VM configuration
afterwards anyway. Instead we can destroy the page which will zero it
and then make it accessible to the host.

Destroying is about twice as fast as the export.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Link: https://lore.kernel.org/kvm/20200907124700.10374-2-frankja@linux.ibm.com/
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-09-14 11:38:35 +02:00
Linus Torvalds
990f227371 - Allow s390 debug feature to handle finally more than 256 CPU numbers, instead
of truncating the most significant bits.
 
 - Improve THP splitting required by qemu processes by making use of
   walk_page_vma() instead of calling follow_page() for every single page
   within each vma.
 
 - Add missing ZCRYPT dependency to VFIO_AP to fix potential compile problems.
 
 - Remove not required select CLOCKSOURCE_VALIDATE_LAST_CYCLE again.
 
 - Set node distance to LOCAL_DISTANCE instead of 0, since e.g. libnuma
   translates a node distance of 0 to "no NUMA support available".
 
 - Couple of other minor fixes and improvements.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAl81bRwACgkQIg7DeRsp
 bsIT3g/7BSIfbI852VTn6hWw+LfdoBLVqXDFcg9xQhUqccbPcG8LG/BYlmuvIn1Z
 X9XujP76xAX7LKnt9x0Z9znwpgGIj0oURmcBxzGBTk+yLBf7YXCOLu9w4MV1AD57
 WRoGG7pPFZy5rMDM29frqBNQBWc7g+/wGoPVqiffzqgj2DcS4Ka8TSTTboMIH7th
 mLK5xaU504FdDGwTgNv6Q/Pt3+YBZ6P/1cIuifzYnfjJ2CsjSBLjS9IFP6/w/c1Q
 VQ2TfajuRbAgkkY01o15tKrlqwPzzWYhnh9ix1LkL0WT6VNql+fcwN665HdsSDFu
 Ctu6Sk3BZmhN1GONK4gIx5didRxBi7neAfUMenSasPT8uMGJcR4EMoNUr70sXNA/
 B1naRayfCH0dE3SVZflAeFJRSh1LnikY14uj65Gg/Wtla5N+90zuHEGDttIBNzrr
 FDzc+39GWN1wJEl7FzSIm+YRC88C/So/BNUQhmlVYNY0sbqyAszUGBOAa+5VS2WQ
 tkWzWPjJ6QA3e3t/dpnc0dvlCKLTJG3o7bdtilb71QNybGiAniP8+79hCNa5nPT3
 iLRhmLX27cQF54IUw1whVA5mkfzHSzwAlB+/laPi8MWFQXTRLTD1XzhmjrIyR57Y
 Q6wM+lxC4qreDz7yGX+eQk7OEs8j8IKTF14HZxkp/DEL09Vbmbs=
 =E0hf
 -----END PGP SIGNATURE-----

Merge tag 's390-5.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull more s390 updates from Heiko Carstens:

 - Allow s390 debug feature to handle finally more than 256 CPU numbers,
   instead of truncating the most significant bits.

 - Improve THP splitting required by qemu processes by making use of
   walk_page_vma() instead of calling follow_page() for every single
   page within each vma.

 - Add missing ZCRYPT dependency to VFIO_AP to fix potential compile
   problems.

 - Remove not required select CLOCKSOURCE_VALIDATE_LAST_CYCLE again.

 - Set node distance to LOCAL_DISTANCE instead of 0, since e.g. libnuma
   translates a node distance of 0 to "no NUMA support available".

 - Couple of other minor fixes and improvements.

* tag 's390-5.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/numa: move code to arch/s390/kernel
  s390/time: remove select CLOCKSOURCE_VALIDATE_LAST_CYCLE again
  s390/debug: debug feature version 3
  s390/Kconfig: add missing ZCRYPT dependency to VFIO_AP
  s390/numa: set node distance to LOCAL_DISTANCE
  s390/pkey: remove redundant variable initialization
  s390/test_unwind: fix possible memleak in test_unwind()
  s390/gmap: improve THP splitting
  s390/atomic: circumvent gcc 10 build regression
2020-08-13 12:38:32 -07:00
Peter Xu
64019a2e46 mm/gup: remove task_struct pointer for all gup code
After the cleanup of page fault accounting, gup does not need to pass
task_struct around any more.  Remove that parameter in the whole gup
stack.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:58:04 -07:00
Gerald Schaefer
ba925fa350 s390/gmap: improve THP splitting
During s390_enable_sie(), we need to take care of splitting all qemu user
process THP mappings. This is currently done with follow_page(FOLL_SPLIT),
by simply iterating over all vma ranges, with PAGE_SIZE increment.

This logic is sub-optimal and can result in a lot of unnecessary overhead,
especially when using qemu and ASAN with large shadow map. Ilya reported
significant system slow-down with one CPU busy for a long time and overall
unresponsiveness.

Fix this by using walk_page_vma() and directly calling split_huge_pmd()
only for present pmds, which greatly reduces overhead.

Cc: <stable@vger.kernel.org> # v5.4+
Reported-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2020-08-11 18:16:13 +02:00
Michel Lespinasse
c1e8d7c6a7 mmap locking API: convert mmap_sem comments
Convert comments that reference mmap_sem to reference mmap_lock instead.

[akpm@linux-foundation.org: fix up linux-next leftovers]
[akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
[akpm@linux-foundation.org: more linux-next fixups, per Michel]

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Michel Lespinasse
d8ed45c5dc mmap locking API: use coccinelle to convert mmap_sem rwsem call sites
This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Mike Rapoport
65fddcfca8 mm: reorder includes after introduction of linux/pgtable.h
The replacement of <asm/pgrable.h> with <linux/pgtable.h> made the include
of the latter in the middle of asm includes.  Fix this up with the aid of
the below script and manual adjustments here and there.

	import sys
	import re

	if len(sys.argv) is not 3:
	    print "USAGE: %s <file> <header>" % (sys.argv[0])
	    sys.exit(1)

	hdr_to_move="#include <linux/%s>" % sys.argv[2]
	moved = False
	in_hdrs = False

	with open(sys.argv[1], "r") as f:
	    lines = f.readlines()
	    for _line in lines:
		line = _line.rstrip('
')
		if line == hdr_to_move:
		    continue
		if line.startswith("#include <linux/"):
		    in_hdrs = True
		elif not moved and in_hdrs:
		    moved = True
		    print hdr_to_move
		print line

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-4-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:13 -07:00
Mike Rapoport
ca5999fde0 mm: introduce include/linux/pgtable.h
The include/linux/pgtable.h is going to be the home of generic page table
manipulation functions.

Start with moving asm-generic/pgtable.h to include/linux/pgtable.h and
make the latter include asm/pgtable.h.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-3-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:13 -07:00