1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

17622 commits

Author SHA1 Message Date
Yazen Ghannam
91f75eb481 x86/MCE/AMD, EDAC/mce_amd: Support non-uniform MCA bank type enumeration
AMD systems currently lay out MCA bank types such that the type of bank
number "i" is either the same across all CPUs or is Reserved/Read-as-Zero.

For example:

  Bank # | CPUx | CPUy
    0      LS     LS
    1      RAZ    UMC
    2      CS     CS
    3      SMU    RAZ

Future AMD systems will lay out MCA bank types such that the type of
bank number "i" may be different across CPUs.

For example:

  Bank # | CPUx | CPUy
    0      LS     LS
    1      RAZ    UMC
    2      CS     NBIO
    3      SMU    RAZ

Change the structures that cache MCA bank types to be per-CPU and update
smca_get_bank_type() to handle this change.

Move some SMCA-specific structures to amd.c from mce.h, since they no
longer need to be global.

Break out the "count" for bank types from struct smca_hwid, since this
should provide a per-CPU count rather than a system-wide count.

Apply the "const" qualifier to the struct smca_hwid_mcatypes array. The
values in this array should not change at runtime.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211216162905.4132657-3-yazen.ghannam@amd.com
2021-12-22 17:22:09 +01:00
Yazen Ghannam
5176a93ab2 x86/MCE/AMD, EDAC/mce_amd: Add new SMCA bank types
Add HWID and McaType values for new SMCA bank types, and add their error
descriptions to edac_mce_amd.

The "PHY" bank types all have the same error descriptions, and the NBIF
and SHUB bank types have the same error descriptions. So reuse the same
arrays where appropriate.

  [ bp: Remove useless comments over hwid types. ]

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211216162905.4132657-2-yazen.ghannam@amd.com
2021-12-22 17:19:18 +01:00
Borislav Petkov
b64dfcde1c x86/mm: Prevent early boot triple-faults with instrumentation
Commit in Fixes added a global TLB flush on the early boot path, after
the kernel switches off of the trampoline page table.

Compiler profiling options enabled with GCOV_PROFILE add additional
measurement code on clang which needs to be initialized prior to
use. The global flush in x86_64_start_kernel() happens before those
initializations can happen, leading to accessing invalid memory.
GCOV_PROFILE builds with gcc are still ok so this is clang-specific.

The second issue this fixes is with KASAN: for a similar reason,
kasan_early_init() needs to have happened before KASAN-instrumented
functions are called.

Therefore, reorder the flush to happen after the KASAN early init
and prevent the compilers from adding profiling instrumentation to
native_write_cr4().

Fixes: f154f29085 ("x86/mm/64: Flush global TLB on boot and AP bringup")
Reported-by: "J. Bruce Fields" <bfields@fieldses.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Carel Si <beibei.si@intel.com>
Tested-by: "J. Bruce Fields" <bfields@fieldses.org>
Link: https://lore.kernel.org/r/20211209144141.GC25654@xsang-OptiPlex-9020
2021-12-22 11:51:20 +01:00
Al Viro
2610ed63ea um, x86: bury crypto_tfm_ctx_offset
unused since 2011

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Richard Weinberger <richard@nod.at>
2021-12-21 21:31:35 +01:00
Tianyu Lan
062a5c4260 hyper-v: Enable swiotlb bounce buffer for Isolation VM
hyperv Isolation VM requires bounce buffer support to copy
data from/to encrypted memory and so enable swiotlb force
mode to use swiotlb bounce buffer for DMA transaction.

In Isolation VM with AMD SEV, the bounce buffer needs to be
accessed via extra address space which is above shared_gpa_boundary
(E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG.
The access physical address will be original physical address +
shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP
spec is called virtual top of memory(vTOM). Memory addresses below
vTOM are automatically treated as private while memory above
vTOM is treated as shared.

Swiotlb bounce buffer code calls set_memory_decrypted()
to mark bounce buffer visible to host and map it in extra
address space via memremap. Populate the shared_gpa_boundary
(vTOM) via swiotlb_unencrypted_base variable.

The map function memremap() can't work in the early place
(e.g ms_hyperv_init_platform()) and so call swiotlb_update_mem_
attributes() in the hyperv_init().

Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20211213071407.314309-4-ltykernel@gmail.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-12-20 18:01:09 +00:00
Tianyu Lan
c789b90a69 x86/hyper-v: Add hyperv Isolation VM check in the cc_platform_has()
Hyper-V provides Isolation VM for confidential computing support and
guest memory is encrypted in it. Places checking cc_platform_has()
with GUEST_MEM_ENCRYPT attr should return "True" in Isolation VM.

Hyper-V Isolation VMs need to adjust the SWIOTLB size just like SEV
guests. Add a hyperv_cc_platform_has() variant which enables that.

Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Acked-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20211213071407.314309-3-ltykernel@gmail.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-12-20 18:01:09 +00:00
Borislav Petkov
1acd85feba x86/mce: Check regs before accessing it
Commit in Fixes accesses pt_regs before checking whether it is NULL or
not. Make sure the NULL pointer check happens first.

Fixes: 0a5b288e85 ("x86/mce: Prevent severity computation from being instrumented")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20211217102029.GA29708@kili
2021-12-20 11:41:02 +01:00
Dave Airlie
eacef9fd61 Merge tag 'drm-intel-next-2021-12-14' of ssh://git.freedesktop.org/git/drm/drm-intel into drm-next
drm/i915 feature pull #2 for v5.17:

Features and functionality:
- Add eDP privacy screen support (Hans)
- Add Raptor Lake S (RPL-S) support (Anusha)
- Add CD clock squashing support (Mika)
- Properly support ADL-P without force probe (Clint)
- Enable pipe color support (10 bit gamma) for display 13 platforms (Uma)
- Update ADL-P DMC firmware to v2.14 (Madhumitha)

Refactoring and cleanups:
- More FBC refactoring preparing for multiple FBC instances (Ville)
- Plane register cleanups (Ville)
- Header refactoring and include cleanups (Jani)
- Crtc helper and vblank wait function cleanups (Jani, Ville)
- Move pipe/transcoder/abox masks under intel_device_info.display (Ville)

Fixes:
- Add a delay to let eDP source OUI write take effect (Lyude)
- Use div32 version of MPLLB word clock for UHBR on SNPS PHY (Jani)
- Fix DMC firmware loader overflow check (Harshit Mogalapalli)
- Fully disable FBC on FIFO underruns (Ville)
- Disable FBC with double wide pipe as mutually exclusive (Ville)
- DG2 workarounds (Matt)
- Non-x86 build fixes (Siva)
- Fix HDR plane max width for NV12 (Vidya)
- Disable IRQ for selftest timestamp calculation (Anshuman)
- ADL-P VBT DDC pin mapping fix (Tejas)

Merges:
- Backmerge drm-next for privacy screen plumbing (Jani)

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/87ee6f5h9u.fsf@intel.com
2021-12-17 15:23:49 +10:00
Thomas Gleixner
b3f8236411 x86/apic/msi: Use PCI device MSI property
instead of fiddling with MSI descriptors.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20211210221813.372357371@linutronix.de
2021-12-16 22:16:37 +01:00
Mateusz Jończyk
0dd8d6cb9e rtc: Check return value from mc146818_get_time()
There are 4 users of mc146818_get_time() and none of them was checking
the return value from this function. Change this.

Print the appropriate warnings in callers of mc146818_get_time() instead
of in the function mc146818_get_time() itself, in order not to add
strings to rtc-mc146818-lib.c, which is kind of a library.

The callers of alpha_rtc_read_time() and cmos_read_time() may use the
contents of (struct rtc_time *) even when the functions return a failure
code. Therefore, set the contents of (struct rtc_time *) to 0x00,
which looks more sensible then 0xff and aligns with the (possibly
stale?) comment in cmos_read_time:

	/*
	 * If pm_trace abused the RTC for storage, set the timespec to 0,
	 * which tells the caller that this RTC value is unusable.
	 */

For consistency, do this in mc146818_get_time().

Note: hpet_rtc_interrupt() may call mc146818_get_time() many times a
second. It is very unlikely, though, that the RTC suddenly stops
working and mc146818_get_time() would consistently fail.

Only compile-tested on alpha.

Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Cc: linux-alpha@vger.kernel.org
Cc: x86@kernel.org
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Link: https://lore.kernel.org/r/20211210200131.153887-4-mat.jonczyk@o2.pl
2021-12-16 21:50:06 +01:00
Mike Rapoport
2f5b3514c3 x86/boot: Move EFI range reservation after cmdline parsing
The memory reservation in arch/x86/platform/efi/efi.c depends on at
least two command line parameters. Put it back later in the boot process
and move efi_memblock_x86_reserve_range() out of early_memory_reserve().

An attempt to fix this was done in

  8d48bf8206 ("x86/boot: Pull up cmdline preparation and early param parsing")

but that caused other troubles so it got reverted.

The bug this is addressing is:

Dan reports that Anjaneya Chagam can no longer use the efi=nosoftreserve
kernel command line parameter to suppress "soft reservation" behavior.

This is due to the fact that the following call-chain happens at boot:

  early_reserve_memory
  |-> efi_memblock_x86_reserve_range
      |-> efi_fake_memmap_early

which does

        if (!efi_soft_reserve_enabled())
                return;

and that would have set EFI_MEM_NO_SOFT_RESERVE after having parsed
"nosoftreserve".

However, parse_early_param() gets called *after* it, leading to the boot
cmdline not being taken into account.

See also https://lore.kernel.org/r/e8dd8993c38702ee6dd73b3c11f158617e665607.camel@intel.com

  [ bp: Turn into a proper patch. ]

Signed-off-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211213112757.2612-4-bp@alien8.de
2021-12-15 14:07:54 +01:00
Borislav Petkov
fbe6183998 Revert "x86/boot: Pull up cmdline preparation and early param parsing"
This reverts commit 8d48bf8206.

It turned out to be a bad idea as it broke supplying mem= cmdline
parameters due to parse_memopt() requiring preparatory work like setting
up the e820 table in e820__memory_setup() in order to be able to exclude
the range specified by mem=.

Pulling that up would've broken Xen PV again, see threads at

  https://lkml.kernel.org/r/20210920120421.29276-1-jgross@suse.com

due to xen_memory_setup() needing the first reservations in
early_reserve_memory() - kernel and initrd - to have happened already.

This could be fixed again by having Xen do those reservations itself...

Long story short, revert this and do a simpler fix in a later patch.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211213112757.2612-3-bp@alien8.de
2021-12-15 11:38:57 +01:00
Borislav Petkov
58e138d624 Revert "x86/boot: Mark prepare_command_line() __init"
This reverts commit c0f2077baa.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211213112757.2612-2-bp@alien8.de
2021-12-15 11:14:28 +01:00
Thomas Gleixner
09eb3ad55f Merge branch 'irq/urgent' into irq/msi
to pick up the PCI/MSI-x fixes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2021-12-14 13:30:34 +01:00
Eric W. Biederman
0e25498f8c exit: Add and use make_task_dead.
There are two big uses of do_exit.  The first is it's design use to be
the guts of the exit(2) system call.  The second use is to terminate
a task after something catastrophic has happened like a NULL pointer
in kernel code.

Add a function make_task_dead that is initialy exactly the same as
do_exit to cover the cases where do_exit is called to handle
catastrophic failure.  In time this can probably be reduced to just a
light wrapper around do_task_dead. For now keep it exactly the same so
that there will be no behavioral differences introducing this new
concept.

Replace all of the uses of do_exit that use it for catastraphic
task cleanup with make_task_dead to make it clear what the code
is doing.

As part of this rename rewind_stack_do_exit
rewind_stack_and_make_dead.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-12-13 12:04:45 -06:00
Borislav Petkov
e3d72e8eee x86/mce: Mark mce_start() noinstr
Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0x4ae: call to __const_udelay() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-13-bp@alien8.de
2021-12-13 14:14:05 +01:00
Borislav Petkov
edb3d07e24 x86/mce: Mark mce_timed_out() noinstr
Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0x482: call to mce_timed_out() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-12-bp@alien8.de
2021-12-13 14:13:54 +01:00
Borislav Petkov
75581a203e x86/mce: Move the tainting outside of the noinstr region
add_taint() is yet another external facility which the #MC handler
calls. Move that tainting call into the instrumentation-allowed part of
the handler.

Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0x617: call to add_taint() leaves .noinstr.text section

While at it, allow instrumentation around the mce_log() call.

Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0x690: call to mce_log() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-11-bp@alien8.de
2021-12-13 14:13:35 +01:00
Borislav Petkov
db6c996d6c x86/mce: Mark mce_read_aux() noinstr
Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0x681: call to mce_read_aux() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-10-bp@alien8.de
2021-12-13 14:13:23 +01:00
Borislav Petkov
b4813539d3 x86/mce: Mark mce_end() noinstr
It is called by the #MC handler which is noinstr.

Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0xbd6: call to memset() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-9-bp@alien8.de
2021-12-13 14:13:12 +01:00
Borislav Petkov
3c7ce80a81 x86/mce: Mark mce_panic() noinstr
And allow instrumentation inside it because it does calls to other
facilities which will not be tagged noinstr.

Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0xc73: call to mce_panic() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-8-bp@alien8.de
2021-12-13 14:13:01 +01:00
Borislav Petkov
0a5b288e85 x86/mce: Prevent severity computation from being instrumented
Mark all the MCE severity computation logic noinstr and allow
instrumentation when it "calls out".

Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0xc5d: call to mce_severity() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-7-bp@alien8.de
2021-12-13 14:12:48 +01:00
Borislav Petkov
4fbce464db x86/mce: Allow instrumentation during task work queueing
Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0xdb1: call to queue_task_work() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-6-bp@alien8.de
2021-12-13 14:12:35 +01:00
Borislav Petkov
487d654db3 x86/mce: Remove noinstr annotation from mce_setup()
Instead, sandwitch around the call which is done in noinstr context and
mark the caller - mce_gather_info() - as noinstr.

Also, document what the whole instrumentation strategy with #MC is going
to be in the future and where it all is supposed to be going to.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-5-bp@alien8.de
2021-12-13 14:12:21 +01:00
Borislav Petkov
88f66a4235 x86/mce: Use mce_rdmsrl() in severity checking code
MCA has its own special MSR accessors. Use them.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-4-bp@alien8.de
2021-12-13 14:12:08 +01:00
Borislav Petkov
ad669ec16a x86/mce: Remove function-local cpus variables
Use num_online_cpus() directly.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-3-bp@alien8.de
2021-12-13 14:11:53 +01:00
Borislav Petkov
cd5e0d1fc9 x86/mce: Do not use memset to clear the banks bitmaps
The bitmap is a single unsigned long so no need for the function call.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-2-bp@alien8.de
2021-12-13 14:11:22 +01:00
Peter Zijlstra
e5eefda5aa x86: Remove .fixup section
No moar users, kill it dead.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20211110101326.201590122@infradead.org
2021-12-11 09:09:50 +01:00
Peter Zijlstra
5ce8e39f55 x86/sgx: Remove .fixup usage
Create EX_TYPE_FAULT_SGX which does as EX_TYPE_FAULT does, except adds
this extra bit that SGX really fancies having.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20211110101325.961246679@infradead.org
2021-12-11 09:09:49 +01:00
Peter Zijlstra
1c3b9091d0 x86/fpu: Remove .fixup usage
Employ EX_TYPE_EFAULT_REG to store '-EFAULT' into the %[err] register
on exception. All the callers only ever test for 0, so the change
from -1 to -EFAULT is immaterial.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20211110101325.604494664@infradead.org
2021-12-11 09:09:48 +01:00
Peter Zijlstra
1614b2b11f arch: Make ARCH_STACKWALK independent of STACKTRACE
Make arch_stack_walk() available for ARCH_STACKWALK architectures
without it being entangled in STACKTRACE.

Link: https://lore.kernel.org/lkml/20211022152104.356586621@infradead.org/
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Mark: rebase, drop unnecessary arm change]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Link: https://lore.kernel.org/r/20211129142849.3056714-2-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-12-10 14:06:03 +00:00
Colin Ian King
df0114f1f8 x86/resctrl: Remove redundant assignment to variable chunks
The variable chunks is being shifted right and re-assinged the shifted
value which is then returned. Since chunks is not being read afterwards
the assignment is redundant and the >>= operator can be replaced with a
shift >> operator instead.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
Link: https://lkml.kernel.org/r/20211207223735.35173-1-colin.i.king@gmail.com
2021-12-09 09:57:16 -08:00
Jarkko Sakkinen
50468e4313 x86/sgx: Add an attribute for the amount of SGX memory in a NUMA node
== Problem ==

The amount of SGX memory on a system is determined by the BIOS and it
varies wildly between systems.  It can be as small as dozens of MB's
and as large as many GB's on servers.  Just like how applications need
to know how much regular RAM is available, enclave builders need to
know how much SGX memory an enclave can consume.

== Solution ==

Introduce a new sysfs file:

	/sys/devices/system/node/nodeX/x86/sgx_total_bytes

to enumerate the amount of SGX memory available in each NUMA node.
This serves the same function for SGX as /proc/meminfo or
/sys/devices/system/node/nodeX/meminfo does for normal RAM.

'sgx_total_bytes' is needed today to help drive the SGX selftests.
SGX-specific swap code is exercised by creating overcommitted enclaves
which are larger than the physical SGX memory on the system.  They
currently use a CPUID-based approach which can diverge from the actual
amount of SGX memory available.  'sgx_total_bytes' ensures that the
selftests can work efficiently and do not attempt stupid things like
creating a 100,000 MB enclave on a system with 128 MB of SGX memory.

== Implementation Details ==

Introduce CONFIG_HAVE_ARCH_NODE_DEV_GROUP opt-in flag to expose an
arch specific attribute group, and add an attribute for the amount of
SGX memory in bytes to each NUMA node:

== ABI Design Discussion ==

As opposed to the per-node ABI, a single, global ABI was considered.
However, this would prevent enclaves from being able to size
themselves so that they fit on a single NUMA node.  Essentially, a
single value would rule out NUMA optimizations for enclaves.

Create a new "x86/" directory inside each "nodeX/" sysfs directory.
'sgx_total_bytes' is expected to be the first of at least a few
sgx-specific files to be placed in the new directory.  Just scanning
/proc/meminfo, these are the no-brainers that we have for RAM, but we
need for SGX:

	MemTotal:       xxxx kB // sgx_total_bytes (implemented here)
	MemFree:        yyyy kB // sgx_free_bytes
	SwapTotal:      zzzz kB // sgx_swapped_bytes

So, at *least* three.  I think we will eventually end up needing
something more along the lines of a dozen.  A new directory (as
opposed to being in the nodeX/ "root") directory avoids cluttering the
root with several "sgx_*" files.

Place the new file in a new "nodeX/x86/" directory because SGX is
highly x86-specific.  It is very unlikely that any other architecture
(or even non-Intel x86 vendor) will ever implement SGX.  Using "sgx/"
as opposed to "x86/" was also considered.  But, there is a real chance
this can get used for other arch-specific purposes.

[ dhansen: rewrite changelog ]

Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211116162116.93081-2-jarkko@kernel.org
2021-12-09 07:02:22 -08:00
Peter Zijlstra
e463a09af2 x86: Add straight-line-speculation mitigation
Make use of an upcoming GCC feature to mitigate
straight-line-speculation for x86:

  https://gcc.gnu.org/g:53a643f8568067d7700a9f2facc8ba39974973d3
  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102952
  https://bugs.llvm.org/show_bug.cgi?id=52323

It's built tested on x86_64-allyesconfig using GCC-12 and GCC-11.

Maintenance overhead of this should be fairly low due to objtool
validation.

Size overhead of all these additional int3 instructions comes to:

     text	   data	    bss	    dec	    hex	filename
  22267751	6933356	2011368	31212475	1dc43bb	defconfig-build/vmlinux
  22804126	6933356	1470696	31208178	1dc32f2	defconfig-build/vmlinux.sls

Or roughly 2.4% additional text.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211204134908.140103474@infradead.org
2021-12-09 13:32:25 +01:00
Thomas Gleixner
ae72f31567 PCI/MSI: Make arch_restore_msi_irqs() less horrible.
Make arch_restore_msi_irqs() return a boolean which indicates whether the
core code should restore the MSI message or not. Get rid of the indirection
in x86.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>	# PCI
Link: https://lore.kernel.org/r/20211206210224.485668098@linutronix.de
2021-12-09 11:52:21 +01:00
Thomas Gleixner
e58f2259b9 genirq/msi, treewide: Use a named struct for PCI/MSI attributes
The unnamed struct sucks and is in the way of further cleanups. Stick the
PCI related MSI data into a real data structure and cleanup all users.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/20211206210224.374863119@linutronix.de
2021-12-09 11:52:21 +01:00
Peter Zijlstra
26c44b776d x86/alternative: Relax text_poke_bp() constraint
Currently, text_poke_bp() is very strict to only allow patching a
single instruction; however with straight-line-speculation it will be
required to patch: ret; int3, which is two instructions.

As such, relax the constraints a little to allow int3 padding for all
instructions that do not imply the execution of the next instruction,
ie: RET, JMP.d8 and JMP.d32.

While there, rename the text_poke_loc::rel32 field to ::disp.

Note: this fills up the text_poke_loc structure which is now a round
  16 bytes big.

  [ bp: Put comments ontop instead of on the side. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211204134908.082342723@infradead.org
2021-12-09 11:04:50 +01:00
Peter Zijlstra
cabdc3a847 sched,x86: Don't use cluster topology for x86 hybrid CPUs
For x86 hybrid CPUs like Alder Lake, the order of CPU selection should
be based strictly on CPU priority.  Don't include cluster topology for
hybrid CPUs to avoid interference with such CPU selection order.

On Alder Lake, the Atom CPU cluster has more capacity (4 Atom CPUs) vs
Big core cluster (2 hyperthread CPUs). This could potentially bias CPU
selection towards Atom over Big Core, when Big core CPU has higher
priority.

Fixes: 66558b730f ("sched: Add cluster scheduler level for x86")
Suggested-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Link: https://lkml.kernel.org/r/20211204091402.GM16608@worktop.programming.kicks-ass.net
2021-12-08 22:15:37 +01:00
Anusha Srivatsa
52407c220c drm/i915/rpl-s: Add PCI IDS for Raptor Lake S
Raptor Lake S(RPL-S) is a version 12
Display, Media and Render. For all i915
purposes it is the same as Alder Lake S (ADL-S).

Introduce RPL-S as a subplatform
of ADL-S. This patch adds PCI ids for RPL-S.

BSpec: 53655
Cc: x86@kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Signed-off-by: Anusha Srivatsa <anusha.srivatsa@intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> # arch/x86
Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211203063545.2254380-2-anusha.srivatsa@intel.com
2021-12-08 13:02:54 -08:00
Peter Zijlstra
b17c2baa30 x86: Prepare inline-asm for straight-line-speculation
Replace all ret/retq instructions with ASM_RET in preparation of
making it more than a single instruction.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211204134907.964635458@infradead.org
2021-12-08 19:23:12 +01:00
Kuppuswamy Sathyanarayanan
8260b9820f x86/sev: Use CC_ATTR attribute to generalize string I/O unroll
INS/OUTS are not supported in TDX guests and cause #UD. Kernel has to
avoid them when running in TDX guest. To support existing usage, string
I/O operations are unrolled using IN/OUT instructions.

AMD SEV platform implements this support by adding unroll
logic in ins#bwl()/outs#bwl() macros with SEV-specific checks.
Since TDX VM guests will also need similar support, use
CC_ATTR_GUEST_UNROLL_STRING_IO and generic cc_platform_has() API to
implement it.

String I/O helpers were the last users of sev_key_active() interface and
sev_enable_key static key. Remove them.

 [ bp: Move comment too and do not delete it. ]

Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20211206135505.75045-2-kirill.shutemov@linux.intel.com
2021-12-08 16:49:42 +01:00
David Woodhouse
74d9555580 PM: hibernate: Allow ACPI hardware signature to be honoured
Theoretically, when the hardware signature in FACS changes, the OS
is supposed to gracefully decline to attempt to resume from S4:

 "If the signature has changed, OSPM will not restore the system
  context and can boot from scratch"

In practice, Windows doesn't do this and many laptop vendors do allow
the signature to change especially when docking/undocking, so it would
be a bad idea to simply comply with the specification by default in the
general case.

However, there are use cases where we do want the compliant behaviour
and we know it's safe. Specifically, when resuming virtual machines where
we know the hypervisor has changed sufficiently that resume will fail.
We really want to be able to *tell* the guest kernel not to try, so it
boots cleanly and doesn't just crash. This patch provides a way to opt
in to the spec-compliant behaviour on the command line.

A follow-up patch may do this automatically for certain "known good"
machines based on a DMI match, or perhaps just for all hypervisor
guests since there's no good reason a hypervisor would change the
hardware_signature that it exposes to guests *unless* it wants them
to obey the ACPI specification.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-12-08 16:06:10 +01:00
Peter Zijlstra
f94909ceb1 x86: Prepare asm files for straight-line-speculation
Replace all ret/retq instructions with RET in preparation of making
RET a macro. Since AS is case insensitive it's a big no-op without
RET defined.

  find arch/x86/ -name \*.S | while read file
  do
	sed -i 's/\<ret[q]*\>/RET/' $file
  done

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211204134907.905503893@infradead.org
2021-12-08 12:25:37 +01:00
Smita Koralahalli
1e56279a49 x86/mce/inject: Set the valid bit in MCA_STATUS before error injection
MCA handlers check the valid bit in each status register
(MCA_STATUS[Val]) and continue processing the error only if the valid
bit is set.

Set the valid bit unconditionally in the corresponding MCA_STATUS
register and correct any Val=0 injections made by the user as such
errors will get ignored and such injections will be largely pointless.

Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211104215846.254012-3-Smita.KoralahalliChannabasappa@amd.com
2021-12-08 12:01:01 +01:00
Smita Koralahalli
e48d008bd1 x86/mce/inject: Check if a bank is populated before injecting
The MCA_IPID register uniquely identifies a bank's type on Scalable MCA
(SMCA) systems. When an MCA bank is not populated, the MCA_IPID register
will read as zero and writes to it will be ignored.

On a hw-type error injection (injection which writes the actual MCA
registers in an attempt to cause a real MCE) check the value of this
register before trying to inject the error.

Do not impose any limitations on a sw injection and allow the user to
test out all the decoding paths without relying on the available hardware,
as its purpose is to just test the code.

 [ bp: Heavily massage. ]

Link: https://lkml.kernel.org/r/20211019233641.140275-2-Smita.KoralahalliChannabasappa@amd.com
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211104215846.254012-2-Smita.KoralahalliChannabasappa@amd.com
2021-12-08 12:00:56 +01:00
Joerg Roedel
71d5049b05 x86/mm: Flush global TLB when switching to trampoline page-table
Move the switching code into a function so that it can be re-used and
add a global TLB flush. This makes sure that usage of memory which is
not mapped in the trampoline page-table is reliably caught.

Also move the clearing of CR4.PCIDE before the CR3 switch because the
cr4_clear_bits() function will access data not mapped into the
trampoline page-table.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211202153226.22946-4-joro@8bytes.org
2021-12-06 09:54:10 +01:00
Joerg Roedel
f154f29085 x86/mm/64: Flush global TLB on boot and AP bringup
The AP bringup code uses the trampoline_pgd page-table which
establishes global mappings in the user range of the address space.
Flush the global TLB entries after the indentity mappings are removed so
no stale entries remain in the TLB.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211202153226.22946-3-joro@8bytes.org
2021-12-06 09:38:48 +01:00
Michael Sterritt
1d5379d047 x86/sev: Fix SEV-ES INS/OUTS instructions for word, dword, and qword
Properly type the operands being passed to __put_user()/__get_user().
Otherwise, these routines truncate data for dependent instructions
(e.g., INSW) and only read/write one byte.

This has been tested by sending a string with REP OUTSW to a port and
then reading it back in with REP INSW on the same port.

Previous behavior was to only send and receive the first char of the
size. For example, word operations for "abcd" would only read/write
"ac". With change, the full string is now written and read back.

Fixes: f980f9c31a (x86/sev-es: Compile early handler code into kernel image)
Signed-off-by: Michael Sterritt <sterritt@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Marc Orr <marcorr@google.com>
Reviewed-by: Peter Gonda <pgonda@google.com>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/20211119232757.176201-1-sterritt@google.com
2021-12-03 18:09:30 +01:00
Feng Tang
b50db7095f x86/tsc: Disable clocksource watchdog for TSC on qualified platorms
There are cases that the TSC clocksource is wrongly judged as unstable by
the clocksource watchdog mechanism which tries to validate the TSC against
HPET, PM_TIMER or jiffies. While there is hardly a general reliable way to
check the validity of a watchdog, Thomas Gleixner proposed [1]:

"I'm inclined to lift that requirement when the CPU has:

    1) X86_FEATURE_CONSTANT_TSC
    2) X86_FEATURE_NONSTOP_TSC
    3) X86_FEATURE_NONSTOP_TSC_S3
    4) X86_FEATURE_TSC_ADJUST
    5) At max. 4 sockets

 After two decades of horrors we're finally at a point where TSC seems
 to be halfway reliable and less abused by BIOS tinkerers. TSC_ADJUST
 was really key as we can now detect even small modifications reliably
 and the important point is that we can cure them as well (not pretty
 but better than all other options)."

As feature #3 X86_FEATURE_NONSTOP_TSC_S3 only exists on several generations
of Atom processorz, and is always coupled with X86_FEATURE_CONSTANT_TSC
and X86_FEATURE_NONSTOP_TSC, skip checking it, and also be more defensive
to use maximal 2 sockets.

The check is done inside tsc_init() before registering 'tsc-early' and
'tsc' clocksources, as there were cases that both of them had been
wrongly judged as unreliable.

For more background of tsc/watchdog, there is a good summary in [2]

[tglx} Update vs. jiffies:

  On systems where the only remaining clocksource aside of TSC is jiffies
  there is no way to make this work because that creates a circular
  dependency. Jiffies accuracy depends on not missing a periodic timer
  interrupt, which is not guaranteed. That could be detected by TSC, but as
  TSC is not trusted this cannot be compensated. The consequence is a
  circulus vitiosus which results in shutting down TSC and falling back to
  the jiffies clocksource which is even more unreliable.

[1]. https://lore.kernel.org/lkml/87eekfk8bd.fsf@nanos.tec.linutronix.de/
[2]. https://lore.kernel.org/lkml/87a6pimt1f.ffs@nanos.tec.linutronix.de/

[ tglx: Refine comment and amend changelog ]

Fixes: 6e3cd95234 ("x86/hpet: Use another crystalball to evaluate HPET usability")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211117023751.24190-2-feng.tang@intel.com
2021-12-02 00:40:36 +01:00
Feng Tang
c7719e7934 x86/tsc: Add a timer to make sure TSC_adjust is always checked
The TSC_ADJUST register is checked every time a CPU enters idle state, but
Thomas Gleixner mentioned there is still a caveat that a system won't enter
idle [1], either because it's too busy or configured purposely to not enter
idle.

Setup a periodic timer (every 10 minutes) to make sure the check is
happening on a regular base.

[1] https://lore.kernel.org/lkml/875z286xtk.fsf@nanos.tec.linutronix.de/

Fixes: 6e3cd95234 ("x86/hpet: Use another crystalball to evaluate HPET usability")
Requested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211117023751.24190-1-feng.tang@intel.com
2021-12-02 00:40:35 +01:00