Resolve conflicts with ongoing lockdep work that fixed the NMI entry code.
Conflicts:
arch/x86/entry/common.c
arch/x86/include/asm/idtentry.h
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add Sapphire Rapids and Alder Lake processors to CPU list to enumerate
and enable the split lock feature.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/1595634320-79689-1-git-send-email-fenghua.yu@intel.com
Replace the x86 variant with the generic version. Provide the relevant
architecture specific helper functions and defines.
Use a temporary define for idtentry_exit_user which will be cleaned up
seperately.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200722220520.494648601@linutronix.de
The noinstr qualifier is to be specified before the return type in the
same way inline is used.
These 2 cases were missed by previous patches.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200723161405.852613-1-ira.weiny@intel.com
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl8UzA4eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGQ7cH/3v+Gv+SmHJCvaT2
CSu0+7okVnYbY3UTb3hykk7/aOqb6284KjxR03r0CWFzsEsZVhC5pvvruASSiMQg
Pi04sLqv6CsGLHd1n+pl4AUYEaxq6k4KS3uU3HHSWxrahDDApQoRUx2F8lpOxyj8
RiwnoO60IMPA7IFJqzcZuFqsgdxqiiYvnzT461KX8Mrw6fyMXeR2KAj2NwMX8dZN
At21Sf8+LSoh6q2HnugfiUd/jR10XbfxIIx2lXgIinb15GXgWydEQVrDJ7cUV7ix
Jd0S+dtOtp+lWtFHDoyjjqqsMV7+G8i/rFNZoxSkyZqsUTaKzaR6JD3moSyoYZgG
0+eXO4A=
=9EpR
-----END PGP SIGNATURE-----
Merge v5.8-rc6 into drm-next
I've got a silent conflict + two trees based on fixes to merge.
Fixes a silent merge with amdgpu
Signed-off-by: Dave Airlie <airlied@redhat.com>
Commit 711419e504 ("irqdomain: Add the missing assignment of
domain->fwnode for named fwnode") unintentionally caused a dangling pointer
page fault issue on firmware nodes that were freed after IRQ domain
allocation. Commit e3beca48a4 fixed that dangling pointer issue by only
freeing the firmware node after an IRQ domain allocation failure. That fix
no longer frees the firmware node immediately, but leaves the firmware node
allocated after the domain is removed.
The firmware node must be kept around through irq_domain_remove, but should be
freed it afterwards.
Add the missing free operations after domain removal where where appropriate.
Fixes: e3beca48a4 ("irqdomain/treewide: Keep firmware node unconditionally allocated")
Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> # drivers/pci
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1595363169-7157-1-git-send-email-jonathan.derrick@intel.com
show_trace_log_lvl() provides x86 platform-specific way to unwind
backtrace with a given log level. Unfortunately, registers dump(s) are
not printed with the same log level - instead, KERN_DEFAULT is always
used.
Arista's switches uses quite common setup with rsyslog, where only
urgent messages goes to console (console_log_level=KERN_ERR), everything
else goes into /var/log/ as the console baud-rate often is indecently
slow (9600 bps).
Backtrace dumps without registers printed have proven to be as useful as
morning standups. Furthermore, in order to introduce KERN_UNSUPPRESSED
(which I believe is still the most elegant way to fix raciness of sysrq[1])
the log level should be passed down the stack to register dumping
functions. Besides, there is a potential use-case for printing traces
with KERN_DEBUG level [2] (where registers dump shouldn't appear with
higher log level).
After all preparations are done, provide log_lvl parameter for
show_regs_if_on_stack() and wire up to actual log level used as
an argument for show_trace_log_lvl().
[1]: https://lore.kernel.org/lkml/20190528002412.1625-1-dima@arista.com/
[2]: https://lore.kernel.org/linux-doc/20190724170249.9644-1-dima@arista.com/
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Petr Mladek <pmladek@suse.com>
Link: https://lkml.kernel.org/r/20200629144847.492794-4-dima@arista.com
show_trace_log_lvl() provides x86 platform-specific way to unwind
backtrace with a given log level. Unfortunately, registers dump(s) are
not printed with the same log level - instead, KERN_DEFAULT is always
used.
Arista's switches uses quite common setup with rsyslog, where only
urgent messages goes to console (console_log_level=KERN_ERR), everything
else goes into /var/log/ as the console baud-rate often is indecently
slow (9600 bps).
Backtrace dumps without registers printed have proven to be as useful as
morning standups. Furthermore, in order to introduce KERN_UNSUPPRESSED
(which I believe is still the most elegant way to fix raciness of sysrq[1])
the log level should be passed down the stack to register dumping
functions. Besides, there is a potential use-case for printing traces
with KERN_DEBUG level [2] (where registers dump shouldn't appear with
higher log level).
Add log_lvl parameter to __show_regs().
Keep the used log level intact to separate visible change.
[1]: https://lore.kernel.org/lkml/20190528002412.1625-1-dima@arista.com/
[2]: https://lore.kernel.org/linux-doc/20190724170249.9644-1-dima@arista.com/
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Petr Mladek <pmladek@suse.com>
Link: https://lkml.kernel.org/r/20200629144847.492794-3-dima@arista.com
show_trace_log_lvl() provides x86 platform-specific way to unwind
backtrace with a given log level. Unfortunately, registers dump(s) are
not printed with the same log level - instead, KERN_DEFAULT is always
used.
Arista's switches uses quite common setup with rsyslog, where only
urgent messages goes to console (console_log_level=KERN_ERR), everything
else goes into /var/log/ as the console baud-rate often is indecently
slow (9600 bps).
Backtrace dumps without registers printed have proven to be as useful as
morning standups. Furthermore, in order to introduce KERN_UNSUPPRESSED
(which I believe is still the most elegant way to fix raciness of sysrq[1])
the log level should be passed down the stack to register dumping
functions. Besides, there is a potential use-case for printing traces
with KERN_DEBUG level [2] (where registers dump shouldn't appear with
higher log level).
Add log_lvl parameter to show_iret_regs() as a preparation to add it
to __show_regs() and show_regs_if_on_stack().
[1]: https://lore.kernel.org/lkml/20190528002412.1625-1-dima@arista.com/
[2]: https://lore.kernel.org/linux-doc/20190724170249.9644-1-dima@arista.com/
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Petr Mladek <pmladek@suse.com>
Link: https://lkml.kernel.org/r/20200629144847.492794-2-dima@arista.com
H.J. reported that post 5.7 a segfault of a user space task does not longer
dump the Code bytes when /proc/sys/debug/exception-trace is enabled. It
prints 'Code: Bad RIP value.' instead.
This was broken by a recent change which made probe_kernel_read() reject
non-kernel addresses.
Update show_opcodes() so it retrieves user space opcodes via
copy_from_user_nmi().
Fixes: 98a23609b1 ("maccess: always use strict semantics for probe_kernel_read")
Reported-by: H.J. Lu <hjl.tools@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87h7tz306w.fsf@nanos.tec.linutronix.de
If a user task's stack is empty, or if it only has user regs, ORC
reports it as a reliable empty stack. But arch_stack_walk_reliable()
incorrectly treats it as unreliable.
That happens because the only success path for user tasks is inside the
loop, which only iterates on non-empty stacks. Generally, a user task
must end in a user regs frame, but an empty stack is an exception to
that rule.
Thanks to commit 71c9582528 ("x86/unwind/orc: Fix error handling in
__unwind_start()"), unwind_start() now sets state->error appropriately.
So now for both ORC and FP unwinders, unwind_done() and !unwind_error()
always means the end of the stack was successfully reached. So the
success path for kthreads is no longer needed -- it can also be used for
empty user tasks.
Reported-by: Wang ShaoBo <bobo.shaobowang@huawei.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Wang ShaoBo <bobo.shaobowang@huawei.com>
Link: https://lkml.kernel.org/r/f136a4e5f019219cbc4f4da33b30c2f44fa65b84.1594994374.git.jpoimboe@redhat.com
The ORC unwinder fails to unwind newly forked tasks which haven't yet
run on the CPU. It correctly reads the 'ret_from_fork' instruction
pointer from the stack, but it incorrectly interprets that value as a
call stack address rather than a "signal" one, so the address gets
incorrectly decremented in the call to orc_find(), resulting in bad ORC
data.
Fix it by forcing 'ret_from_fork' frames to be signal frames.
Reported-by: Wang ShaoBo <bobo.shaobowang@huawei.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Wang ShaoBo <bobo.shaobowang@huawei.com>
Link: https://lkml.kernel.org/r/f91a8778dde8aae7f71884b5df2b16d552040441.1594994374.git.jpoimboe@redhat.com
On x86-32 the idt_table with 256 entries needs only 2048 bytes. It is
page-aligned, but the end of the .bss..page_aligned section is not
guaranteed to be page-aligned.
As a result, objects from other .bss sections may end up on the same 4k
page as the idt_table, and will accidentially get mapped read-only during
boot, causing unexpected page-faults when the kernel writes to them.
This could be worked around by making the objects in the page aligned
sections page sized, but that's wrong.
Explicit sections which store only page aligned objects have an implicit
guarantee that the object is alone in the page in which it is placed. That
works for all objects except the last one. That's inconsistent.
Enforcing page sized objects for these sections would wreckage memory
sanitizers, because the object becomes artificially larger than it should
be and out of bound access becomes legit.
Align the end of the .bss..page_aligned and .data..page_aligned section on
page-size so all objects places in these sections are guaranteed to have
their own page.
[ tglx: Amended changelog ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200721093448.10417-1-joro@8bytes.org
To allow the kernel not to play games with set_fs to call exec
implement kernel_execve. The function kernel_execve takes pointers
into kernel memory and copies the values pointed to onto the new
userspace stack.
The calls with arguments from kernel space of do_execve are replaced
with calls to kernel_execve.
The calls do_execve and do_execveat are made static as there are now
no callers outside of exec.
The comments that mention do_execve are updated to refer to
kernel_execve or execve depending on the circumstances. In addition
to correcting the comments, this makes it easy to grep for do_execve
and verify it is not used.
Inspired-by: https://lkml.kernel.org/r/20200627072704.2447163-1-hch@lst.de
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87wo365ikj.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This fixes a regression encountered while running the
gdb.base/corefile.exp test in GDB's test suite.
In my testing, the typo prevented the sw_reserved field of struct
fxregs_state from being output to the kernel XSAVES area. Thus the
correct mask corresponding to XCR0 was not present in the core file for
GDB to interrogate, resulting in the following behavior:
[kev@f32-1 gdb]$ ./gdb -q testsuite/outputs/gdb.base/corefile/corefile testsuite/outputs/gdb.base/corefile/corefile.core
Reading symbols from testsuite/outputs/gdb.base/corefile/corefile...
[New LWP 232880]
warning: Unexpected size of section `.reg-xstate/232880' in core file.
With the typo fixed, the test works again as expected.
Signed-off-by: Kevin Buettner <kevinb@redhat.com>
Fixes: 9e46365459 ("copy_xstate_to_kernel(): don't leave parts of destination uninitialized")
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Airlie <airlied@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Fix the I/O bitmap invalidation on XEN PV, which was overlooked in the
recent ioperm/iopl rework. This caused the TSS and XEN's I/O bitmap to
get out of sync.
- Use the proper vectors for HYPERV.
- Make disabling of stack protector for the entry code work with GCC
builds which enable stack protector by default. Removing the option is
not sufficient, it needs an explicit -fno-stack-protector to shut it
off.
- Mark check_user_regs() noinstr as it is called from noinstr code. The
missing annotation causes it to be placed in the text section which
makes it instrumentable.
- Add the missing interrupt disable in exc_alignment_check()
- Fixup a XEN_PV build dependency in the 32bit entry code
- A few fixes to make the Clang integrated assembler happy
- Move EFI stub build to the right place for out of tree builds
- Make prepare_exit_to_usermode() static. It's not longer called from ASM
code.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8UR+MTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoQCUD/4/9W5FFvdZvQPwmXsHPaVnW9hUsXxG
0tjc34xqDcgEl1U3khu+6jj+oHx+JM+4wGP/V49Wqx6xkrJ33/a8uYErAgI7+Pmp
s3T2gXMWkgJtYFlDQdAMbeuuM2cOFZJw4BxxvTMth5EixQvk1EkX6QyBjLaSGo8y
78sWtZ6Oh5Ql9ua/9TOilewLsCsQSFIFn0o/hawwwPUMrwGvD29scha0XHom+AO7
uwejfU8klq2HJJaLaaiUaiNBkFz0TNGJtY+3mQpw8BPjCuuBQhYygrS0X4uQzo01
4XJzhDnOVbAYWqi0/T+mAEmuJ9NBZJwYiYrwBYCkZgELwJKLzhzO2GOgP9xEsFY4
VUNgqHFhKrQp10k2k4L/A5tmr+0GntiCQhdZi+/gty6oO/t3ni57pRcAhA9qBNOb
8ZqumBwgaaAIqcmdtoyXAIveWOHnzwKEg6wmIGFbyCwHjeLJKJG7KhpXIpEuX+j2
DC7EfYvRB+jllAk1CBypBvzD0DHfMZ0myPxCcZiW2wHTVAlkpY7hiIyPHqocjE9L
OjOQ7FS6E2/p24lYVcLUFWcESxGFvQjjxwXk7htjpGUIZsQOhz/LOW+CIPCsfbqE
HoEsHmNltksYYV9FDfevXRp5sbxpx3wQSLOgqNqiOpy4cTCG8boalUqHQ0OsN8Oa
EgU067yF77ymRg==
=QAeH
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-07-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into master
Pull x86 fixes from Thomas Gleixner:
"A pile of fixes for x86:
- Fix the I/O bitmap invalidation on XEN PV, which was overlooked in
the recent ioperm/iopl rework. This caused the TSS and XEN's I/O
bitmap to get out of sync.
- Use the proper vectors for HYPERV.
- Make disabling of stack protector for the entry code work with GCC
builds which enable stack protector by default. Removing the option
is not sufficient, it needs an explicit -fno-stack-protector to
shut it off.
- Mark check_user_regs() noinstr as it is called from noinstr code.
The missing annotation causes it to be placed in the text section
which makes it instrumentable.
- Add the missing interrupt disable in exc_alignment_check()
- Fixup a XEN_PV build dependency in the 32bit entry code
- A few fixes to make the Clang integrated assembler happy
- Move EFI stub build to the right place for out of tree builds
- Make prepare_exit_to_usermode() static. It's not longer called from
ASM code"
* tag 'x86-urgent-2020-07-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Don't add the EFI stub to targets
x86/entry: Actually disable stack protector
x86/ioperm: Fix io bitmap invalidation on Xen PV
x86: math-emu: Fix up 'cmp' insn for clang ias
x86/entry: Fix vectors to IDTENTRY_SYSVEC for CONFIG_HYPERV
x86/entry: Add compatibility with IAS
x86/entry/common: Make prepare_exit_to_usermode() static
x86/entry: Mark check_user_regs() noinstr
x86/traps: Disable interrupts in exc_aligment_check()
x86/entry/32: Fix XEN_PV build dependency
tss_invalidate_io_bitmap() wasn't wired up properly through the pvop
machinery, so the TSS and Xen's io bitmap would get out of sync
whenever disabling a valid io bitmap.
Add a new pvop for tss_invalidate_io_bitmap() to fix it.
This is XSA-329.
Fixes: 22fe5b0439 ("x86/ioperm: Move TSS bitmap update to exit to user work")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/d53075590e1f91c19f8af705059d3ff99424c020.1595030016.git.luto@kernel.org
Setting interrupt affinity on inactive interrupts is inconsistent when
hierarchical irq domains are enabled. The core code should just store the
affinity and not call into the irq chip driver for inactive interrupts
because the chip drivers may not be in a state to handle such requests.
X86 has a hacky workaround for that but all other irq chips have not which
causes problems e.g. on GIC V3 ITS.
Instead of adding more ugly hacks all over the place, solve the problem in
the core code. If the affinity is set on an inactive interrupt then:
- Store it in the irq descriptors affinity mask
- Update the effective affinity to reflect that so user space has
a consistent view
- Don't call into the irq chip driver
This is the core equivalent of the X86 workaround and works correctly
because the affinity setting is established in the irq chip when the
interrupt is activated later on.
Note, that this is only effective when hierarchical irq domains are enabled
by the architecture. Doing it unconditionally would break legacy irq chip
implementations.
For hierarchial irq domains this works correctly as none of the drivers can
have a dependency on affinity setting in inactive state by design.
Remove the X86 workaround as it is not longer required.
Fixes: 02edee152d ("x86/apic/vector: Ignore set_affinity call for inactive interrupts")
Reported-by: Ali Saidi <alisaidi@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Ali Saidi <alisaidi@amazon.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200529015501.15771-1-alisaidi@amazon.com
Link: https://lkml.kernel.org/r/877dv2rv25.fsf@nanos.tec.linutronix.de
In removing UV1 support, efi_have_uv1_memmap is no longer used.
Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lkml.kernel.org/r/20200713212955.786177105@hpe.com
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.
In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:
git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'
drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.
No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.
[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/
Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
Quite some non OF/ACPI users of irqdomains allocate firmware nodes of type
IRQCHIP_FWNODE_NAMED or IRQCHIP_FWNODE_NAMED_ID and free them right after
creating the irqdomain. The only purpose of these FW nodes is to convey
name information. When this was introduced the core code did not store the
pointer to the node in the irqdomain. A recent change stored the firmware
node pointer in irqdomain for other reasons and missed to notice that the
usage sites which do the alloc_fwnode/create_domain/free_fwnode sequence
are broken by this. Storing a dangling pointer is dangerous itself, but in
case that the domain is destroyed later on this leads to a double free.
Remove the freeing of the firmware node after creating the irqdomain from
all affected call sites to cure this.
Fixes: 711419e504 ("irqdomain: Add the missing assignment of domain->fwnode for named fwnode")
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/873661qakd.fsf@nanos.tec.linutronix.de
While the nmi_enter() users did
trace_hardirqs_{off_prepare,on_finish}() there was no matching
lockdep_hardirqs_*() calls to complete the picture.
Introduce idtentry_{enter,exit}_nmi() to enable proper IRQ state
tracking across the NMIs.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200623083721.216740948@infradead.org
There are cases where a guest tries to switch spinlocks to bare metal
behavior (e.g. by setting "xen_nopvspin" on XEN platform and
"hv_nopvspin" on HYPER_V).
That feature is missed on KVM, add a new parameter "nopvspin" to disable
PV spinlocks for KVM guest.
The new 'nopvspin' parameter will also replace Xen and Hyper-V specific
parameters in future patches.
Define variable nopvsin as global because it will be used in future
patches as above.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
pr_*() is preferred than printk(KERN_* ...), after change all the print
in arch/x86/kernel/kvm.c will have "kvm-guest: xxx" style.
No functional change.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This reverts commit 34226b6b70.
Commit 8990cac6e5 ("x86/jump_label: Initialize static branching
early") adds jump_label_init() call in setup_arch() to make static
keys initialized early, so we could use the original simpler code
again.
The similar change for XEN is in commit 090d54bcbc ("Revert
"x86/paravirt: Set up the virt_spin_lock_key after static keys get
initialized"")
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In the LBR call stack mode, LBR information is used to reconstruct a
call stack. To get the complete call stack, perf has to save/restore
all LBR registers during a context switch. Due to a large number of the
LBR registers, this process causes a high CPU overhead. To reduce the
CPU overhead during a context switch, use the XSAVES/XRSTORS
instructions.
Every XSAVE area must follow a canonical format: the legacy region, an
XSAVE header and the extended region. Although the LBR information is
only kept in the extended region, a space for the legacy region and
XSAVE header is still required. Add a new dedicated structure for LBR
XSAVES support.
Before enabling XSAVES support, the size of the LBR state has to be
sanity checked, because:
- the size of the software structure is calculated from the max number
of the LBR depth, which is enumerated by the CPUID leaf for Arch LBR.
The size of the LBR state is enumerated by the CPUID leaf for XSAVE
support of Arch LBR. If the values from the two CPUID leaves are not
consistent, it may trigger a buffer overflow. For example, a hypervisor
may unconsciously set inconsistent values for the two emulated CPUID.
- unlike other state components, the size of an LBR state depends on the
max number of LBRs, which may vary from generation to generation.
Expose the function xfeature_size() for the sanity check.
The LBR XSAVES support will be disabled if the size of the LBR state
enumerated by CPUID doesn't match with the size of the software
structure.
The XSAVE instruction requires 64-byte alignment for state buffers. A
new macro is added to reflect the alignment requirement. A 64-byte
aligned kmem_cache is created for architecture LBR.
Currently, the structure for each state component is maintained in
fpu/types.h. The structure for the new LBR state component should be
maintained in the same place. Move structure lbr_entry to fpu/types.h as
well for broader sharing.
Add dedicated lbr_save/lbr_restore functions for LBR XSAVES support,
which invokes the corresponding xstate helpers to XSAVES/XRSTORS LBR
information at the context switch when the call stack mode is enabled.
Since the XSAVES/XRSTORS instructions will be eventually invoked, the
dedicated functions is named with '_xsaves'/'_xrstors' postfix.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/1593780569-62993-23-git-send-email-kan.liang@linux.intel.com
The perf subsystem will only need to save/restore the LBR state.
However, the existing helpers save all supported supervisor states to a
kernel buffer, which will be unnecessary. Two helpers are introduced to
only save/restore requested dynamic supervisor states. The supervisor
features in XFEATURE_MASK_SUPERVISOR_SUPPORTED and
XFEATURE_MASK_SUPERVISOR_UNSUPPORTED mask cannot be saved/restored using
these helpers.
The helpers will be used in the following patch.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/1593780569-62993-22-git-send-email-kan.liang@linux.intel.com
Last Branch Records (LBR) registers are used to log taken branches and
other control flows. In perf with call stack mode, LBR information is
used to reconstruct a call stack. To get the complete call stack, perf
has to save/restore all LBR registers during a context switch. Due to
the large number of the LBR registers, e.g., the current platform has
96 LBR registers, this process causes a high CPU overhead. To reduce
the CPU overhead during a context switch, an LBR state component that
contains all the LBR related registers is introduced in hardware. All
LBR registers can be saved/restored together using one XSAVES/XRSTORS
instruction.
However, the kernel should not save/restore the LBR state component at
each context switch, like other state components, because of the
following unique features of LBR:
- The LBR state component only contains valuable information when LBR
is enabled in the perf subsystem, but for most of the time, LBR is
disabled.
- The size of the LBR state component is huge. For the current
platform, it's 808 bytes.
If the kernel saves/restores the LBR state at each context switch, for
most of the time, it is just a waste of space and cycles.
To efficiently support the LBR state component, it is desired to have:
- only context-switch the LBR when the LBR feature is enabled in perf.
- only allocate an LBR-specific XSAVE buffer on demand.
(Besides the LBR state, a legacy region and an XSAVE header have to be
included in the buffer as well. There is a total of (808+576) byte
overhead for the LBR-specific XSAVE buffer. The overhead only happens
when the perf is actively using LBRs. There is still a space-saving,
on average, when it replaces the constant 808 bytes of overhead for
every task, all the time on the systems that support architectural
LBR.)
- be able to use XSAVES/XRSTORS for accessing LBR at run time.
However, the IA32_XSS should not be adjusted at run time.
(The XCR0 | IA32_XSS are used to determine the requested-feature
bitmap (RFBM) of XSAVES.)
A solution, called dynamic supervisor feature, is introduced to address
this issue, which
- does not allocate a buffer in each task->fpu;
- does not save/restore a state component at each context switch;
- sets the bit corresponding to the dynamic supervisor feature in
IA32_XSS at boot time, and avoids setting it at run time.
- dynamically allocates a specific buffer for a state component
on demand, e.g. only allocates LBR-specific XSAVE buffer when LBR is
enabled in perf. (Note: The buffer has to include the LBR state
component, a legacy region and a XSAVE header space.)
(Implemented in a later patch)
- saves/restores a state component on demand, e.g. manually invokes
the XSAVES/XRSTORS instruction to save/restore the LBR state
to/from the buffer when perf is active and a call stack is required.
(Implemented in a later patch)
A new mask XFEATURE_MASK_DYNAMIC and a helper xfeatures_mask_dynamic()
are introduced to indicate the dynamic supervisor feature. For the
systems which support the Architecture LBR, LBR is the only dynamic
supervisor feature for now. For the previous systems, there is no
dynamic supervisor feature available.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/1593780569-62993-21-git-send-email-kan.liang@linux.intel.com
When saving xstate to a kernel/user XSAVE area with the XSAVE family of
instructions, the current code applies the 'full' instruction mask (-1),
which tries to XSAVE all possible features. This method relies on
hardware to trim 'all possible' down to what is enabled in the
hardware. The code works well for now. However, there will be a
problem, if some features are enabled in hardware, but are not suitable
to be saved into all kernel XSAVE buffers, like task->fpu, due to
performance consideration.
One such example is the Last Branch Records (LBR) state. The LBR state
only contains valuable information when LBR is explicitly enabled by
the perf subsystem, and the size of an LBR state is large (808 bytes
for now). To avoid both CPU overhead and space overhead at each context
switch, the LBR state should not be saved into task->fpu like other
state components. It should be saved/restored on demand when LBR is
enabled in the perf subsystem. Current copy_xregs_to_* will trigger a
buffer overflow for such cases.
Three sites use the '-1' instruction mask which must be updated.
Two are saving/restoring the xstate to/from a kernel-allocated XSAVE
buffer and can use 'xfeatures_mask_all', which will save/restore all of
the features present in a normal task FPU buffer.
The last one saves the register state directly to a user buffer. It
could
also use 'xfeatures_mask_all'. Just as it was with the '-1' argument,
any supervisor states in the mask will be filtered out by the hardware
and not saved to the buffer. But, to be more explicit about what is
expected to be saved, use xfeatures_mask_user() for the instruction
mask.
KVM includes the header file fpu/internal.h. To avoid 'undefined
xfeatures_mask_all' compiling issue, move copy_fpregs_to_fpstate() to
fpu/core.c and export it, because:
- The xfeatures_mask_all is indirectly used via copy_fpregs_to_fpstate()
by KVM. The function which is directly used by other modules should be
exported.
- The copy_fpregs_to_fpstate() is a function, while xfeatures_mask_all
is a variable for the "internal" FPU state. It's safer to export a
function than a variable, which may be implicitly changed by others.
- The copy_fpregs_to_fpstate() is a big function with many checks. The
removal of the inline keyword should not impact the performance.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/1593780569-62993-20-git-send-email-kan.liang@linux.intel.com
Some Makefiles already pass -fno-stack-protector unconditionally.
For example, arch/arm64/kernel/vdso/Makefile, arch/x86/xen/Makefile.
No problem report so far about hard-coding this option. So, we can
assume all supported compilers know -fno-stack-protector.
GCC 4.8 and Clang support this option (https://godbolt.org/z/_HDGzN)
Get rid of cc-option from -fno-stack-protector.
Remove CONFIG_CC_HAS_STACKPROTECTOR_NONE, which is always 'y'.
Note:
arch/mips/vdso/Makefile adds -fno-stack-protector twice, first
unconditionally, and second conditionally. I removed the second one.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
They were originally called _cond_rcu because they were special versions
with conditional RCU handling. Now they're the standard entry and exit
path, so the _cond_rcu part is just confusing. Drop it.
Also change the signature to make them more extensible and more foolproof.
No functional change -- it's pure refactoring.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/247fc67685263e0b673e1d7f808182d28ff80359.1593795633.git.luto@kernel.org
Using a mutex for "print this warning only once" is so overdesigned as
to be actively offensive to my sensitive stomach.
Just use "pr_info_once()" that already does this, although in a
(harmlessly) racy manner that can in theory cause the message to be
printed twice if more than one CPU races on that "is this the first
time" test.
[ If somebody really cares about that harmless data race (which sounds
very unlikely indeed), that person can trivially fix printk_once() by
using a simple atomic access, preferably with an optimistic non-atomic
test first before even bothering to treat the pointless "make sure it
is _really_ just once" case.
A mutex is most definitely never the right primitive to use for
something like this. ]
Yes, this is a small and meaningless detail in a code path that hardly
matters. But let's keep some code quality standards here, and not
accept outrageously bad code.
Link: https://lore.kernel.org/lkml/CAHk-=wgV9toS7GU3KmNpj8hCS9SeF+A0voHS8F275_mgLhL4Lw@mail.gmail.com/
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Reset MXCSR in kernel_fpu_begin() to prevent using a stale user space
value.
- Prevent writing MSR_TEST_CTRL on CPUs which are not explicitly
whitelisted for split lock detection. Some CPUs which do not support
it crash even when the MSR is written to 0 which is the default value.
- Fix the XEN PV fallout of the entry code rework
- Fix the 32bit fallout of the entry code rework
- Add more selftests to ensure that these entry problems don't come back.
- Disable 16 bit segments on XEN PV. It's not supported because XEN PV
does not implement ESPFIX64
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8B9JoTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoV8LEAC6QJPDvqYUl4r0rNIRG+S6D99lQOse
1smxvgXX4UaRz5Tgz6kvYUcucqmmnTfvnO8cg82LASeFw1xfVPPAtl3GZjoClwhv
0NJkKYcMm5QUOSVjJmjkcbAld//FyRfxHuJ8HMEtrbvkys2qWBmLzMaUNhFDNhcc
73UMmyuyL4kef9v/iAeR5WXG5+b+j9lZDiC1lTWuEKs10d1EdTwt2O/wtSRRPpMn
kL1qGTJAL+iRyRe7weLOkC2KZ9+Gq2NtyJQutkthZtGe5+pLT3AT6AlWxeg1HU8q
pxaQP25oe8/8naIoOmwiuwAP2qmm5eHedzXoN0h7i2XmofYOJaWeF95K7oDro8Nj
2deCx1bk0wr/RUxbYlfUacs8S+wmMWe7+BPnHXZphkSq5Vx+oXIw6mJOqmNb7Yiv
7ld1QwSD5dyWCEk1af16XKsFvSIRiGh8FypfTiTxyk+z7HIWBNXlu8OWHn1A7Sra
iaolCZfXtTJzm4w5+VVT2FX3s7jJrmMM4iSLtM2ISo2k+1HMlTbgLE6/yGjQ3ZaY
U298W7Pm8CwBRgzyKBvZVfncm0U/B0FNo/8C0jsJKPIOdpoLhs+u7sjpyaNC+toz
GE0skoWZxMhga4xPF84ua/l1VGncVUN1d5/dmnXz8xdyxFlktUtkt2iPE4G0rt3S
Xgh2uLHOgST6Kw==
=lI9c
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-07-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A series of fixes for x86:
- Reset MXCSR in kernel_fpu_begin() to prevent using a stale user
space value.
- Prevent writing MSR_TEST_CTRL on CPUs which are not explicitly
whitelisted for split lock detection. Some CPUs which do not
support it crash even when the MSR is written to 0 which is the
default value.
- Fix the XEN PV fallout of the entry code rework
- Fix the 32bit fallout of the entry code rework
- Add more selftests to ensure that these entry problems don't come
back.
- Disable 16 bit segments on XEN PV. It's not supported because XEN
PV does not implement ESPFIX64"
* tag 'x86-urgent-2020-07-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ldt: Disable 16-bit segments on Xen PV
x86/entry/32: Fix #MC and #DB wiring on x86_32
x86/entry/xen: Route #DB correctly on Xen PV
x86/entry, selftests: Further improve user entry sanity checks
x86/entry/compat: Clear RAX high bits on Xen PV SYSENTER
selftests/x86: Consolidate and fix get/set_eflags() helpers
selftests/x86/syscall_nt: Clear weird flags after each test
selftests/x86/syscall_nt: Add more flag combinations
x86/entry/64/compat: Fix Xen PV SYSENTER frame setup
x86/entry: Move SYSENTER's regs->sp and regs->flags fixups into C
x86/entry: Assert that syscalls are on the right stack
x86/split_lock: Don't write MSR_TEST_CTRL on CPUs that aren't whitelisted
x86/fpu: Reset MXCSR to default in kernel_fpu_begin()
Now that HAVE_COPY_THREAD_TLS has been removed, rename copy_thread_tls()
back simply copy_thread(). It's a simpler name, and doesn't imply that only
tls is copied here. This finishes an outstanding chunk of internal process
creation work since we've added clone3().
Cc: linux-arch@vger.kernel.org
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>A
Acked-by: Stafford Horne <shorne@gmail.com>
Acked-by: Greentime Hu <green.hu@gmail.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>A
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Xen PV doesn't implement ESPFIX64, so they don't work right. Disable
them. Also print a warning the first time anyone tries to use a
16-bit segment on a Xen PV guest that would otherwise allow it
to help people diagnose this change in behavior.
This gets us closer to having all x86 selftests pass on Xen PV.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/92b2975459dfe5929ecf34c3896ad920bd9e3f2d.1593795633.git.luto@kernel.org
DEFINE_IDTENTRY_MCE and DEFINE_IDTENTRY_DEBUG were wired up as non-RAW
on x86_32, but the code expected them to be RAW.
Get rid of all the macro indirection for them on 32-bit and just use
DECLARE_IDTENTRY_RAW and DEFINE_IDTENTRY_RAW directly.
Also add a warning to make sure that we only hit the _kernel paths
in kernel mode.
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/9e90a7ee8e72fd757db6d92e1e5ff16339c1ecf9.1593795633.git.luto@kernel.org
On Xen PV, #DB doesn't use IST. It still needs to be correctly routed
depending on whether it came from user or kernel mode.
Get rid of DECLARE/DEFINE_IDTENTRY_XEN -- it was too hard to follow the
logic. Instead, route #DB and NMI through DECLARE/DEFINE_IDTENTRY_RAW on
Xen, and do the right thing for #DB. Also add more warnings to the
exc_debug* handlers to make this type of failure more obvious.
This fixes various forms of corruption that happen when usermode
triggers #DB on Xen PV.
Fixes: 4c0dcd8350 ("x86/entry: Implement user mode C entry points for #DB and #MCE")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/4163e733cce0b41658e252c6c6b3464f33fdff17.1593795633.git.luto@kernel.org