Move smp_num_siblings and cpu_llc_id to cpu/common.c so that they're
always present as symbols and not only in the CONFIG_SMP case. Then,
other code using them doesn't need ugly ifdeffery anymore. Get rid of
some ifdeffery.
Signed-off-by: Borislav Petkov <bpetkov@suse.de>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1524864877-111962-2-git-send-email-suravee.suthikulpanit@amd.com
Each of the strings that we want to put into the buf[MAX_FLAG_OPT_SIZE]
in flags_read() is two characters long. But the sprintf() adds
a trailing newline and will add a terminating NUL byte. So
MAX_FLAG_OPT_SIZE needs to be 4.
sprintf() calls vsnprintf() and *that* does return:
" * The return value is the number of characters which would
* be generated for the given input, excluding the trailing
* '\0', as per ISO C99."
Note the "excluding".
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20180427163707.ktaiysvbk3yhk4wm@agluck-desk
Unless explicitly opted out of, anything running under seccomp will have
SSB mitigations enabled. Choosing the "prctl" mode will disable this.
[ tglx: Adjusted it to the new arch_seccomp_spec_mitigate() mechanism ]
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The migitation control is simpler to implement in architecture code as it
avoids the extra function call to check the mode. Aside of that having an
explicit seccomp enabled mode in the architecture mitigations would require
even more workarounds.
Move it into architecture code and provide a weak function in the seccomp
code. Remove the 'which' argument as this allows the architecture to decide
which mitigations are relevant for seccomp.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
For certain use cases it is desired to enforce mitigations so they cannot
be undone afterwards. That's important for loader stubs which want to
prevent a child from disabling the mitigation again. Will also be used for
seccomp(). The extra state preserving of the prctl state for SSB is a
preparatory step for EBPF dymanic speculation control.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There's no reason for these to be changed after boot.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Adjust arch_prctl_get/set_spec_ctrl() to operate on tasks other than
current.
This is needed both for /proc/$pid/status queries and for seccomp (since
thread-syncing can trigger seccomp in non-current threads).
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add prctl based control for Speculative Store Bypass mitigation and make it
the default mitigation for Intel and AMD.
Andi Kleen provided the following rationale (slightly redacted):
There are multiple levels of impact of Speculative Store Bypass:
1) JITed sandbox.
It cannot invoke system calls, but can do PRIME+PROBE and may have call
interfaces to other code
2) Native code process.
No protection inside the process at this level.
3) Kernel.
4) Between processes.
The prctl tries to protect against case (1) doing attacks.
If the untrusted code can do random system calls then control is already
lost in a much worse way. So there needs to be system call protection in
some way (using a JIT not allowing them or seccomp). Or rather if the
process can subvert its environment somehow to do the prctl it can already
execute arbitrary code, which is much worse than SSB.
To put it differently, the point of the prctl is to not allow JITed code
to read data it shouldn't read from its JITed sandbox. If it already has
escaped its sandbox then it can already read everything it wants in its
address space, and do much worse.
The ability to control Speculative Store Bypass allows to enable the
protection selectively without affecting overall system performance.
Based on an initial patch from Tim Chen. Completely rewritten.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The Speculative Store Bypass vulnerability can be mitigated with the
Reduced Data Speculation (RDS) feature. To allow finer grained control of
this eventually expensive mitigation a per task mitigation control is
required.
Add a new TIF_RDS flag and put it into the group of TIF flags which are
evaluated for mismatch in switch_to(). If these bits differ in the previous
and the next task, then the slow path function __switch_to_xtra() is
invoked. Implement the TIF_RDS dependent mitigation control in the slow
path.
If the prctl for controlling Speculative Store Bypass is disabled or no
task uses the prctl then there is no overhead in the switch_to() fast
path.
Update the KVM related speculation control functions to take TID_RDS into
account as well.
Based on a patch from Tim Chen. Completely rewritten.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Having everything in nospec-branch.h creates a hell of dependencies when
adding the prctl based switching mechanism. Move everything which is not
required in nospec-branch.h to spec-ctrl.h and fix up the includes in the
relevant files.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
AMD does not need the Speculative Store Bypass mitigation to be enabled.
The parameters for this are already available and can be done via MSR
C001_1020. Each family uses a different bit in that MSR for this.
[ tglx: Expose the bit mask via a variable and move the actual MSR fiddling
into the bugs code as that's the right thing to do and also required
to prepare for dynamic enable/disable ]
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Intel and AMD SPEC_CTRL (0x48) MSR semantics may differ in the
future (or in fact use different MSRs for the same functionality).
As such a run-time mechanism is required to whitelist the appropriate MSR
values.
[ tglx: Made the variable __ro_after_init ]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Intel CPUs expose methods to:
- Detect whether RDS capability is available via CPUID.7.0.EDX[31],
- The SPEC_CTRL MSR(0x48), bit 2 set to enable RDS.
- MSR_IA32_ARCH_CAPABILITIES, Bit(4) no need to enable RRS.
With that in mind if spec_store_bypass_disable=[auto,on] is selected set at
boot-time the SPEC_CTRL MSR to enable RDS if the platform requires it.
Note that this does not fix the KVM case where the SPEC_CTRL is exposed to
guests which can muck with it, see patch titled :
KVM/SVM/VMX/x86/spectre_v2: Support the combination of guest and host IBRS.
And for the firmware (IBRS to be set), see patch titled:
x86/spectre_v2: Read SPEC_CTRL MSR during boot and re-use reserved bits
[ tglx: Distangled it from the intel implementation and kept the call order ]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Contemporary high performance processors use a common industry-wide
optimization known as "Speculative Store Bypass" in which loads from
addresses to which a recent store has occurred may (speculatively) see an
older value. Intel refers to this feature as "Memory Disambiguation" which
is part of their "Smart Memory Access" capability.
Memory Disambiguation can expose a cache side-channel attack against such
speculatively read values. An attacker can create exploit code that allows
them to read memory outside of a sandbox environment (for example,
malicious JavaScript in a web page), or to perform more complex attacks
against code running within the same privilege level, e.g. via the stack.
As a first step to mitigate against such attacks, provide two boot command
line control knobs:
nospec_store_bypass_disable
spec_store_bypass_disable=[off,auto,on]
By default affected x86 processors will power on with Speculative
Store Bypass enabled. Hence the provided kernel parameters are written
from the point of view of whether to enable a mitigation or not.
The parameters are as follows:
- auto - Kernel detects whether your CPU model contains an implementation
of Speculative Store Bypass and picks the most appropriate
mitigation.
- on - disable Speculative Store Bypass
- off - enable Speculative Store Bypass
[ tglx: Reordered the checks so that the whole evaluation is not done
when the CPU does not support RDS ]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Add the sysfs file for the new vulerability. It does not do much except
show the words 'Vulnerable' for recent x86 cores.
Intel cores prior to family 6 are known not to be vulnerable, and so are
some Atoms and some Xeon Phi.
It assumes that older Cyrix, Centaur, etc. cores are immune.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
A guest may modify the SPEC_CTRL MSR from the value used by the
kernel. Since the kernel doesn't use IBRS, this means a value of zero is
what is needed in the host.
But the 336996-Speculative-Execution-Side-Channel-Mitigations.pdf refers to
the other bits as reserved so the kernel should respect the boot time
SPEC_CTRL value and use that.
This allows to deal with future extensions to the SPEC_CTRL interface if
any at all.
Note: This uses wrmsrl() instead of native_wrmsl(). I does not make any
difference as paravirt will over-write the callq *0xfff.. with the wrmsrl
assembler code.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
The 336996-Speculative-Execution-Side-Channel-Mitigations.pdf refers to all
the other bits as reserved. The Intel SDM glossary defines reserved as
implementation specific - aka unknown.
As such at bootup this must be taken it into account and proper masking for
the bits in use applied.
A copy of this document is available at
https://bugzilla.kernel.org/show_bug.cgi?id=199511
[ tglx: Made x86_spec_ctrl_base __ro_after_init ]
Suggested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Those SysFS functions have a similar preamble, as such make common
code to handle them.
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Combine the various logic which goes through all those
x86_cpu_id matching structures in one function.
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
The recent commt which addresses the x86_phys_bits corruption with
encrypted memory on CPUID reload after a microcode update lost the reload
of CPUID_8000_0008_EBX as well.
As a consequence IBRS and IBRS_FW are not longer detected
Restore the behaviour by bringing the reload of CPUID_8000_0008_EBX
back. This restore has a twist due to the convoluted way the cpuid analysis
works:
CPUID_8000_0008_EBX is used by AMD to enumerate IBRB, IBRS, STIBP. On Intel
EBX is not used. But the speculation control code sets the AMD bits when
running on Intel depending on the Intel specific speculation control
bits. This was done to use the same bits for alternatives.
The change which moved the 8000_0008 evaluation out of get_cpu_cap() broke
this nasty scheme due to ordering. So that on Intel the store to
CPUID_8000_0008_EBX clears the IBRB, IBRS, STIBP bits which had been set
before by software.
So the actual CPUID_8000_0008_EBX needs to go back to the place where it
was and the phys/virt address space calculation cannot touch it.
In hindsight this should have used completely synthetic bits for IBRB,
IBRS, STIBP instead of reusing the AMD bits, but that's for 4.18.
/me needs to find time to cleanup that steaming pile of ...
Fixes: d94a155c59 ("x86/cpu: Prevent cpuinfo_x86::x86_phys_bits adjustment corruption")
Reported-by: Jörg Otte <jrg.otte@gmail.com>
Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jörg Otte <jrg.otte@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: kirill.shutemov@linux.intel.com
Cc: Borislav Petkov <bp@alien8.de
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1805021043510.1668@nanos.tec.linutronix.de
mark_tsc_unstable() also needs to affect tsc_early, Now that
clocksource_mark_unstable() can be used on a clocksource irrespective of
its registration state, use it on both tsc_early and tsc.
This does however require cs->list to be initialized empty, otherwise it
cannot tell the registation state before registation.
Fixes: aa83c45762 ("x86/tsc: Introduce early tsc clocksource")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Diego Viola <diego.viola@gmail.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: len.brown@intel.com
Cc: rjw@rjwysocki.net
Cc: rui.zhang@intel.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180430100344.533326547@infradead.org
Don't leave the tsc-early clocksource registered if it errors out
early.
This was reported by Diego, who on his Core2 era machine got TSC
invalidated while it was running with tsc-early (due to C-states).
This results in keeping tsc-early with very bad effects.
Reported-and-Tested-by: Diego Viola <diego.viola@gmail.com>
Fixes: aa83c45762 ("x86/tsc: Introduce early tsc clocksource")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: len.brown@intel.com
Cc: rjw@rjwysocki.net
Cc: diego.viola@gmail.com
Cc: rui.zhang@intel.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180430100344.350507853@infradead.org
Xen PV domains cannot shut down and start a crash kernel. Instead,
the crashing kernel makes a SCHEDOP_shutdown hypercall with the
reason code SHUTDOWN_crash, cf. xen_crash_shutdown() machine op in
arch/x86/xen/enlighten_pv.c.
A crash kernel reservation is merely a waste of RAM in this case. It
may also confuse users of kexec_load(2) and/or kexec_file_load(2).
When flags include KEXEC_ON_CRASH or KEXEC_FILE_ON_CRASH,
respectively, these syscalls return success, which is technically
correct, but the crash kexec image will never be actually used.
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: xen-devel@lists.xenproject.org
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jean Delvare <jdelvare@suse.de>
Link: https://lkml.kernel.org/r/20180425120835.23cef60c@ezekiel.suse.cz
From Skylake onwards, the platform controller hub (Sunrisepoint PCH) does
not support legacy DMA operations to IO ports 81h-83h, 87h, 89h-8Bh, 8Fh.
Currently this driver registers as syscore ops and its resume function is
called on every resume from S3. On Skylake and Kabylake, this causes a
resume delay of around 100ms due to port IO operations, which is a problem.
This change allows to load the driver only when the platform bios
explicitly supports such devices or has a cut-off date earlier than 2017
due to the following reasons:
- The platforms released before year 2017 have support for the 8237.
(except Sunrisepoint PCH e.g. Skylake)
- Some of the BIOS that were released for platforms (Skylake, Kabylake)
during 2016-17 are buggy. These BIOS do not set/unset the
ACPI_FADT_LEGACY_DEVICES field in FADT table properly based on the
presence or absence of the DMA device.
Very recently, open source system firmware like coreboot started unsetting
ACPI_FADT_LEGACY_DEVICES field in FADT table if the 8237 DMA device is not
present on the PCH.
Please refer to chapter 21 of 6th Generation Intel® Core™ Processor
Platform Controller Hub Family: BIOS Specification.
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@intel.com>
Signed-off-by: Anshuman Gupta <anshuman.gupta@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: rjw@rjwysocki.net
Cc: hpa@zytor.com
Cc: Alan Cox <alan@linux.intel.com>
Link: https://lkml.kernel.org/r/1522336015-22994-1-git-send-email-anshuman.gupta@intel.com
Make kernel print the correct number of TLB entries on Intel Xeon Phi 7210
(and others)
Before:
[ 0.320005] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
After:
[ 0.320005] Last level dTLB entries: 4KB 256, 2MB 128, 4MB 128, 1GB 16
The entries do exist in the official Intel SMD but the type column there is
incorrect (states "Cache" where it should read "TLB"), but the entries for
the values 0x6B, 0x6C and 0x6D are correctly described as 'Data TLB'.
Signed-off-by: Jacek Tomaka <jacek.tomaka@poczta.fm>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20180423161425.24366-1-jacekt@dugeo.com
The whole reasoning behind the amount of opcode bytes dumped and prologue
length isn't very clear so write down some of the reasons for why it is
done the way it is.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Link: https://lkml.kernel.org/r/20180417161124.5294-10-bp@alien8.de
Save the regs set when __die() is onvoked for the first time and print it
in oops_end().
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Link: https://lkml.kernel.org/r/20180417161124.5294-9-bp@alien8.de
... which shows the Instruction Pointer along with the insn bytes around
it. Use it whenever rIP is printed. Drop the rIP < PAGE_OFFSET check since
probe_kernel_read() can handle any address properly.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Link: https://lkml.kernel.org/r/20180417161124.5294-8-bp@alien8.de
Will be used in the next patch.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Link: https://lkml.kernel.org/r/20180417161124.5294-6-bp@alien8.de
The code used to iterate byte-by-byte over the bytes around RIP and that
is expensive: disabling pagefaults around it, copy_from_user, etc...
Make it read the whole buffer of OPCODE_BUFSIZE size in one go. Use a
statically allocated 64 bytes buffer so that concurrent show_opcodes()
do not interleave in the output even though in the majority of the cases
it's serialized via die_lock. Except the #PF path which doesn't...
Also, do the PAGE_OFFSET check outside of the function because latter
will be reused in other context.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Link: https://lkml.kernel.org/r/20180417161124.5294-5-bp@alien8.de
No functionality change, carve it out into a separate function for later
changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Link: https://lkml.kernel.org/r/20180417161124.5294-4-bp@alien8.de
The only user outside of arch/ is not a module since
86cd47334b ("ACPI, APEI, GHES, Prevent GHES to be built as module")
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Link: https://lkml.kernel.org/r/20180417161124.5294-3-bp@alien8.de
This was added by
86c4183742 ("[PATCH] i386: add option to show more code in oops reports")
long time ago but experience shows that 64 instruction bytes are plenty
when deciphering an oops. So get rid of it.
Removing it will simplify further enhancements to the opcodes dumping
machinery coming in the following patches.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Link: https://lkml.kernel.org/r/20180417161124.5294-2-bp@alien8.de
Recent AMD systems support using MWAIT for C1 state. However, MWAIT will
not allow deeper cstates than C1 on current systems.
play_dead() expects to use the deepest state available. The deepest state
available on AMD systems is reached through SystemIO or HALT. If MWAIT is
available, it is preferred over the other methods, so the CPU never reaches
the deepest possible state.
Don't try to use MWAIT to play_dead() on AMD systems. Instead, use CPUIDLE
to enter the deepest state advertised by firmware. If CPUIDLE is not
available then fallback to HALT.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Link: https://lkml.kernel.org/r/20180403140228.58540-1-Yazen.Ghannam@amd.com
Both powerpc and alpha have cases where they wronly set si_code to 0
in combination with SIGTRAP and don't mean SI_USER.
About half the time this is because the architecture can not report
accurately what kind of trap exception triggered the trap exception.
The other half the time it looks like no one has bothered to
figure out an appropriate si_code.
For the cases where the architecture does not have enough information
or is too lazy to figure out exactly what kind of trap exception
it is define TRAP_UNK.
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-alpha@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Call clear_siginfo to ensure every stack allocated siginfo is properly
initialized before being passed to the signal sending functions.
Note: It is not safe to depend on C initializers to initialize struct
siginfo on the stack because C is allowed to skip holes when
initializing a structure.
The initialization of struct siginfo in tracehook_report_syscall_exit
was moved from the helper user_single_step_siginfo into
tracehook_report_syscall_exit itself, to make it clear that the local
variable siginfo gets fully initialized.
In a few cases the scope of struct siginfo has been reduced to make it
clear that siginfo siginfo is not used on other paths in the function
in which it is declared.
Instances of using memset to initialize siginfo have been replaced
with calls clear_siginfo for clarity.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Vitezslav reported a case where the
"Timeout during microcode update!"
panic would hit. After a deeper look, it turned out that his .config had
CONFIG_HOTPLUG_CPU disabled which practically made save_mc_for_early() a
no-op.
When that happened, the discovered microcode patch wasn't saved into the
cache and the late loading path wouldn't find any.
This, then, lead to early exit from __reload_late() and thus CPUs waiting
until the timeout is reached, leading to the panic.
In hindsight, that function should have been written so it does not return
before the post-synchronization. Oh well, I know better now...
Fixes: bb8c13d61a ("x86/microcode: Fix CPU synchronization routine")
Reported-by: Vitezslav Samel <vitezslav@samel.cz>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Vitezslav Samel <vitezslav@samel.cz>
Tested-by: Ashok Raj <ashok.raj@intel.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20180418081140.GA2439@pc11.op.pod.cz
Link: https://lkml.kernel.org/r/20180421081930.15741-2-bp@alien8.de
save_mc_for_early() was a no-op on !CONFIG_HOTPLUG_CPU but the
generic_load_microcode() path saves the microcode patches it has found into
the cache of patches which is used for late loading too. Regardless of
whether CPU hotplug is used or not.
Make the saving unconditional so that late loading can find the proper
patch.
Reported-by: Vitezslav Samel <vitezslav@samel.cz>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Vitezslav Samel <vitezslav@samel.cz>
Tested-by: Ashok Raj <ashok.raj@intel.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20180418081140.GA2439@pc11.op.pod.cz
Link: https://lkml.kernel.org/r/20180421081930.15741-1-bp@alien8.de
GPL2.0 is not a valid SPDX identiier. Replace it with GPL-2.0.
Fixes: 4a362601ba ("x86/jailhouse: Add infrastructure for running in non-root cell")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Link: https://lkml.kernel.org/r/20180422220832.815346488@linutronix.de
Pull x86 fixes from Thomas Gleixner:
"A small set of fixes for x86:
- Prevent X2APIC ID 0xFFFFFFFF from being treated as valid, which
causes the possible CPU count to be wrong.
- Prevent 32bit truncation in calc_hpet_ref() which causes the TSC
calibration to fail
- Fix the page table setup for temporary text mappings in the resume
code which causes resume failures
- Make the page table dump code handle HIGHPTE correctly instead of
oopsing
- Support for topologies where NUMA nodes share an LLC to prevent a
invalid topology warning and further malfunction on such systems.
- Remove the now unused pci-nommu code
- Remove stale function declarations"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/power/64: Fix page-table setup for temporary text mapping
x86/mm: Prevent kernel Oops in PTDUMP code with HIGHPTE=y
x86,sched: Allow topologies where NUMA nodes share an LLC
x86/processor: Remove two unused function declarations
x86/acpi: Prevent X2APIC id 0xffffffff from being accounted
x86/tsc: Prevent 32bit truncation in calc_hpet_ref()
x86: Remove pci-nommu.c
Chun-Yi reported a kernel warning message below:
WARNING: CPU: 0 PID: 0 at ../mm/early_ioremap.c:182 early_iounmap+0x4f/0x12c()
early_iounmap(ffffffffff200180, 00000118) [0] size not consistent 00000120
The problem is x86 kexec_file_load adds extra alignment to the efi
memmap: in bzImage64_load():
efi_map_sz = efi_get_runtime_map_size();
efi_map_sz = ALIGN(efi_map_sz, 16);
And __efi_memmap_init maps with the size including the alignment bytes
but efi_memmap_unmap use nr_maps * desc_size which does not include the
extra bytes.
The alignment in kexec code is only needed for the kexec buffer internal
use Actually kexec should pass exact size of the efi memmap to 2nd
kernel.
Link: http://lkml.kernel.org/r/20180417083600.GA1972@dhcp-128-65.nay.redhat.com
Signed-off-by: Dave Young <dyoung@redhat.com>
Reported-by: joeyli <jlee@suse.com>
Tested-by: Randy Wright <rwright@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the first set of system call entry point changes to enable 32-bit
architectures to have variants on both 32-bit and 64-bit time_t. Typically
these system calls take a 'struct timespec' argument, but that structure
is defined in user space by the C library and its layout will change.
The kernel already supports handling the 32-bit time_t on 64-bit
architectures through the CONFIG_COMPAT mechanism. As there are a total
of 51 system calls suffering from this problem, reusing that mechanism
on 32-bit architectures.
We already have patches for most of the remaining system calls, but this
set contains most of the complexity and is best tested. There was one
last-minute regression that prevented it from going into 4.17, but that
is fixed now.
More details from Deepa's patch series description:
Big picture is as per the lwn article:
https://lwn.net/Articles/643234/ [2]
The series is directed at converting posix clock syscalls:
clock_gettime, clock_settime, clock_getres and clock_nanosleep
to use a new data structure __kernel_timespec at syscall boundaries.
__kernel_timespec maintains 64 bit time_t across all execution modes.
vdso will be handled as part of each architecture when they enable
support for 64 bit time_t.
The compat syscalls are repurposed to provide backward compatibility
by using them as native syscalls as well for 32 bit architectures.
They will continue to use timespec at syscall boundaries.
CONFIG_64_BIT_TIME controls whether the syscalls use __kernel_timespec
or timespec at syscall boundaries.
The series does the following:
1. Enable compat syscalls on 32 bit architectures.
2. Add a new __kernel_timespec type to be used as the data structure
for all the new syscalls.
3. Add new config CONFIG_64BIT_TIME(intead of the CONFIG_COMPAT_TIME in
[1] and [2] to switch to new definition of __kernel_timespec. It is
the same as struct timespec otherwise.
4. Add new CONFIG_32BIT_TIME to conditionally compile compat syscalls.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJa2IAgAAoJEGCrR//JCVInWDMP/2n44rfblcBVZSt+WPOBXIxD
nXkCrFqUQzhK/7ccQhd9Ij/Zjl+eed+nSe98fyfq23//eg18s9FCHqFYLlTTkJRt
iXvxCdjiKTO527VZcHy4gIQaovytbzLSn9PMKgaaOTh8bFiPi/JLHHw2IcC7Hg4X
oLxg+6XNBAN63JXgjzWF1mwmRyCOyN5JIUCIIQPySfRuQekPAd0EbgW8hvWvZJl/
L42VSszP5gPoSF1u+JKVtpNlDXB9POhoBSpVn+Kh19TJAYH9yxOOPxJ3RRvWGSS+
thMkNHlwJpyF3e5xgc24FgozW1lyKzMWSaUcYxLr0JNuehDX2oJCdpDkDQTXWPL2
IFIX7w/5wwVlC152wkAcwR/OdfrwhNiU9Ed6sgXZscm9MRN8Qdn1DjQ+xU79zalM
feeTdYST8L0MiLOafkQOJWbZzALibUQ+wnFWYGd66O5CMZLDcNU8oE3LbwODi8Gb
91LcFxCmdJMC+O3tRVONpZknG6+qyjXvNmaosgTE8KiHeOY7+FgCRRnVz5yYPKty
PHIajRP82+tf5b6tCZRkbQZJMWVA9AzCTS51DOXXrYK3LDF6X8wbQXPguVVZFbiN
mmXLHDEVjKC3SHhY/Y8FDkUfy+1dWA1Wd121T/84+UfTchLRJ2S9Yrye/0EvU4gj
Szb79+vKmtgK+R+Dn4Cu
=8Bch
-----END PGP SIGNATURE-----
Merge tag 'y2038-timekeeping' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground into timers/core
Pull y2038 timekeeping syscall changes from Arnd Bergmann:
This is the first set of system call entry point changes to enable 32-bit
architectures to have variants on both 32-bit and 64-bit time_t. Typically
these system calls take a 'struct timespec' argument, but that structure
is defined in user space by the C library and its layout will change.
The kernel already supports handling the 32-bit time_t on 64-bit
architectures through the CONFIG_COMPAT mechanism. As there are a total
of 51 system calls suffering from this problem, reusing that mechanism
on 32-bit architectures.
We already have patches for most of the remaining system calls, but this
set contains most of the complexity and is best tested. There was one
last-minute regression that prevented it from going into 4.17, but that
is fixed now.
More details from Deepa's patch series description:
Big picture is as per the lwn article:
https://lwn.net/Articles/643234/ [2]
The series is directed at converting posix clock syscalls:
clock_gettime, clock_settime, clock_getres and clock_nanosleep
to use a new data structure __kernel_timespec at syscall boundaries.
__kernel_timespec maintains 64 bit time_t across all execution modes.
vdso will be handled as part of each architecture when they enable
support for 64 bit time_t.
The compat syscalls are repurposed to provide backward compatibility
by using them as native syscalls as well for 32 bit architectures.
They will continue to use timespec at syscall boundaries.
CONFIG_64_BIT_TIME controls whether the syscalls use __kernel_timespec
or timespec at syscall boundaries.
The series does the following:
1. Enable compat syscalls on 32 bit architectures.
2. Add a new __kernel_timespec type to be used as the data structure
for all the new syscalls.
3. Add new config CONFIG_64BIT_TIME(intead of the CONFIG_COMPAT_TIME in
[1] and [2] to switch to new definition of __kernel_timespec. It is
the same as struct timespec otherwise.
4. Add new CONFIG_32BIT_TIME to conditionally compile compat syscalls.
Intel's Skylake Server CPUs have a different LLC topology than previous
generations. When in Sub-NUMA-Clustering (SNC) mode, the package is divided
into two "slices", each containing half the cores, half the LLC, and one
memory controller and each slice is enumerated to Linux as a NUMA
node. This is similar to how the cores and LLC were arranged for the
Cluster-On-Die (CoD) feature.
CoD allowed the same cache line to be present in each half of the LLC.
But, with SNC, each line is only ever present in *one* slice. This means
that the portion of the LLC *available* to a CPU depends on the data being
accessed:
Remote socket: entire package LLC is shared
Local socket->local slice: data goes into local slice LLC
Local socket->remote slice: data goes into remote-slice LLC. Slightly
higher latency than local slice LLC.
The biggest implication from this is that a process accessing all
NUMA-local memory only sees half the LLC capacity.
The CPU describes its cache hierarchy with the CPUID instruction. One of
the CPUID leaves enumerates the "logical processors sharing this
cache". This information is used for scheduling decisions so that tasks
move more freely between CPUs sharing the cache.
But, the CPUID for the SNC configuration discussed above enumerates the LLC
as being shared by the entire package. This is not 100% precise because the
entire cache is not usable by all accesses. But, it *is* the way the
hardware enumerates itself, and this is not likely to change.
The userspace visible impact of all the above is that the sysfs info
reports the entire LLC as being available to the entire package. As noted
above, this is not true for local socket accesses. This patch does not
correct the sysfs info. It is the same, pre and post patch.
The current code emits the following warning:
sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
The warning is coming from the topology_sane() check in smpboot.c because
the topology is not matching the expectations of the model for obvious
reasons.
To fix this, add a vendor and model specific check to never call
topology_sane() for these systems. Also, just like "Cluster-on-Die" disable
the "coregroup" sched_domain_topology_level and use NUMA information from
the SRAT alone.
This is OK at least on the hardware we are immediately concerned about
because the LLC sharing happens at both the slice and at the package level,
which are also NUMA boundaries.
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: brice.goglin@gmail.com
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lkml.kernel.org/r/20180407002130.GA18984@alison-desk.jf.intel.com
RongQing reported that there are some X2APIC id 0xffffffff in his machine's
ACPI MADT table, which makes the number of possible CPU inaccurate.
The reason is that the ACPI X2APIC parser has no sanity check for APIC ID
0xffffffff, which is an invalid id in all APIC types. See "Intel® 64
Architecture x2APIC Specification", Chapter 2.4.1.
Add a sanity check to acpi_parse_x2apic() which ignores the invalid id.
Reported-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Cc: len.brown@intel.com
Cc: rjw@rjwysocki.net
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/20180412014052.25186-1-douly.fnst@cn.fujitsu.com
The TSC calibration code uses HPET as reference. The conversion normalizes
the delta of two HPET timestamps:
hpetref = ((tshpet1 - tshpet2) * HPET_PERIOD) / 1e6
and then divides the normalized delta of the corresponding TSC timestamps
by the result to calulate the TSC frequency.
tscfreq = ((tstsc1 - tstsc2 ) * 1e6) / hpetref
This uses do_div() which takes an u32 as the divisor, which worked so far
because the HPET frequency was low enough that 'hpetref' never exceeded
32bit.
On Skylake machines the HPET frequency increased so 'hpetref' can exceed
32bit. do_div() truncates the divisor, which causes the calibration to
fail.
Use div64_u64() to avoid the problem.
[ tglx: Fixes whitespace mangled patch and rewrote changelog ]
Signed-off-by: Xiaoming Gao <newtongao@tencent.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Cc: peterz@infradead.org
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/38894564-4fc9-b8ec-353f-de702839e44e@gmail.com