This patch uses 'Avoid_Non_Temporal_Memset' flag to access
the non-temporal memset implementation for hygon processors.
Test Results:
hygon1 arch
x86_memset_non_temporal_threshold = 8MB
size new performance time / old performance time
1MB 0.994
4MB 0.996
8MB 0.670
16MB 0.343
32MB 0.355
hygon2 arch
x86_memset_non_temporal_threshold = 8MB
size new performance time / old performance time
1MB 1
4MB 1
8MB 1.312
16MB 0.822
32MB 0.830
hygon3 arch
x86_memset_non_temporal_threshold = 8MB
size new performance time / old performance time
1MB 1
4MB 0.990
8MB 0.737
16MB 0.390
32MB 0.401
For hygon arch with this patch, non-temporal stores can improve
performance by 20% - 65%.
Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
Reviewed-by: Jing Li <lijing@hygon.cn>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Add a new architecture type arch_kind_hygon to spilt Hygon branch
from AMD. This is to facilitate the Hygon processors to make settings
that are suitable for its own characteristics.
Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
Reviewed-by: Jing Li <lijing@hygon.cn>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The goal of this flag is to allow targets which don't prefer/have ERMS
to still access the non-temporal memset implementation.
There are 4 cases for tuning memset:
1) `Avoid_STOSB && Avoid_Non_Temporal_Memset`
- Memset with temporal stores
2) `Avoid_STOSB && !Avoid_Non_Temporal_Memset`
- Memset with temporal/non-temporal stores. Non-temporal path
goes through `rep stosb` path. We accomplish this by setting
`x86_rep_stosb_threshold` to
`x86_memset_non_temporal_threshold`.
3) `!Avoid_STOSB && Avoid_Non_Temporal_Memset`
- Memset with temporal stores/`rep stosb`
3) `!Avoid_STOSB && !Avoid_Non_Temporal_Memset`
- Memset with temporal stores/`rep stosb`/non-temporal stores.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
This is just a refactor and there should be no behavioral change from
this commit.
The goal is to make `Avoid_Non_Temporal_Memset` a more universal knob
for controlling whether we use non-temporal memset rather than having
extra logic based on vendor.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The original commit enabling non-temporal memset on Skylake Server had
erroneous benchmarks (actually done on ICX).
Further benchmarks indicate non-temporal stores may in fact by a
regression on Skylake Server.
This commit may be over-cautious in some cases, but should avoid any
regressions for 2.40.
Tested using qemu on all x86_64 cpu arch supported by both qemu +
GLIBC.
Reviewed-by: DJ Delorie <dj@redhat.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Current 'non_temporal_threshold' set to 'non_temporal_threshold_lowbound'
on Zhaoxin processors without ERMS. The default
'non_temporal_threshold_lowbound' is too small for the KH-40000 and KX-7000
Zhaoxin processors, this patch updates the value to
'shared / cachesize_non_temporal_divisor'.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Fix code formatting under the Zhaoxin branch and add comments for
different Zhaoxin models.
Unaligned AVX load are slower on KH-40000 and KX-7000, so disable
the AVX_Fast_Unaligned_Load.
Enable Prefer_No_VZEROUPPER and Fast_Unaligned_Load features to
use sse2_unaligned version of memset,strcpy and strcat.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
_dl_tlsdesc_dynamic preserves RDI, RSI and RBX before realigning stack.
After realigning stack, it saves RCX, RDX, R8, R9, R10 and R11. Define
TLSDESC_CALL_REGISTER_SAVE_AREA to allocate space for RDI, RSI and RBX
to avoid clobbering saved RDI, RSI and RBX values on stack by xsave to
STATE_SAVE_OFFSET(%rsp).
+==================+<- stack frame start aligned at 8 or 16 bytes
| |<- RDI saved in the red zone
| |<- RSI saved in the red zone
| |<- RBX saved in the red zone
| |<- paddings for stack realignment of 64 bytes
|------------------|<- xsave buffer end aligned at 64 bytes
| |<-
| |<-
| |<-
|------------------|<- xsave buffer start at STATE_SAVE_OFFSET(%rsp)
| |<- 8-byte padding for 64-byte alignment
| |<- 8-byte padding for 64-byte alignment
| |<- R11
| |<- R10
| |<- R9
| |<- R8
| |<- RDX
| |<- RCX
+==================+<- RSP aligned at 64 bytes
Define TLSDESC_CALL_REGISTER_SAVE_AREA, the total register save area size
for all integer registers by adding 24 to STATE_SAVE_OFFSET since RDI, RSI
and RBX are saved onto stack without adjusting stack pointer first, using
the red-zone. This fixes BZ #31501.
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
Replace minimum ISA check ifdef conditional with if. Since
MINIMUM_X86_ISA_LEVEL and AVX_X86_ISA_LEVEL are compile time constants,
compiler will perform constant folding optimization, getting same
results.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
_dl_tlsdesc_dynamic should also preserve AMX registers which are
caller-saved. Add X86_XSTATE_TILECFG_ID and X86_XSTATE_TILEDATA_ID
to x86-64 TLSDESC_CALL_STATE_SAVE_MASK. Compute the AMX state size
and save it in xsave_state_full_size which is only used by
_dl_tlsdesc_dynamic_xsave and _dl_tlsdesc_dynamic_xsavec. This fixes
the AMX part of BZ #31372. Tested on AMX processor.
AMX test is enabled only for compilers with the fix for
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114098
GCC 14 and GCC 11/12/13 branches have the bug fix.
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
When glibc is built with ISA level 3 or above enabled, SSE resolvers
aren't available and glibc fails to build:
ld: .../elf/librtld.os: in function `init_cpu_features':
.../elf/../sysdeps/x86/cpu-features.c:1200:(.text+0x1445f): undefined reference to `_dl_runtime_resolve_fxsave'
ld: .../elf/librtld.os: relocation R_X86_64_PC32 against undefined hidden symbol `_dl_runtime_resolve_fxsave' can not be used when making a shared object
/usr/local/bin/ld: final link failed: bad value
For ISA level 3 or above, don't use _dl_runtime_resolve_fxsave nor
_dl_tlsdesc_dynamic_fxsave.
This fixes BZ #31429.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Compiler generates the following instruction sequence for GNU2 dynamic
TLS access:
leaq tls_var@TLSDESC(%rip), %rax
call *tls_var@TLSCALL(%rax)
or
leal tls_var@TLSDESC(%ebx), %eax
call *tls_var@TLSCALL(%eax)
CALL instruction is transparent to compiler which assumes all registers,
except for EFLAGS and RAX/EAX, are unchanged after CALL. When
_dl_tlsdesc_dynamic is called, it calls __tls_get_addr on the slow
path. __tls_get_addr is a normal function which doesn't preserve any
caller-saved registers. _dl_tlsdesc_dynamic saved and restored integer
caller-saved registers, but didn't preserve any other caller-saved
registers. Add _dl_tlsdesc_dynamic IFUNC functions for FNSAVE, FXSAVE,
XSAVE and XSAVEC to save and restore all caller-saved registers. This
fixes BZ #31372.
Add GLRO(dl_x86_64_runtime_resolve) with GLRO(dl_x86_tlsdesc_dynamic)
to optimize elf_machine_runtime_setup.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Systemd execution environment configuration may prohibit changing a memory
mapping to become executable:
MemoryDenyWriteExecute=
Takes a boolean argument. If set, attempts to create memory mappings
that are writable and executable at the same time, or to change existing
memory mappings to become executable, or mapping shared memory segments
as executable, are prohibited.
When it is set, systemd service stops working if PLT rewrite is enabled.
Check if mprotect works before rewriting PLT. This fixes BZ #31230.
This also works with SELinux when deny_execmem is on.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
1. Remove _dl_runtime_resolve_shstk and _dl_runtime_profile_shstk.
2. Move CET offsets from x86 cpu-features-offsets.sym to x86-64
features-offsets.sym.
3. Rename x86 cet-control.h to x86-64 feature-control.h since it is only
for x86-64 and also used for PLT rewrite.
4. Add x86-64 ldsodefs.h to include feature-control.h.
5. Change TUNABLE_CALLBACK (set_plt_rewrite) to x86-64 only.
6. Move x86 dl-procruntime.c to x86-64.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Add ELF_DYNAMIC_AFTER_RELOC to allow target specific processing after
relocation.
For x86-64, add
#define DT_X86_64_PLT (DT_LOPROC + 0)
#define DT_X86_64_PLTSZ (DT_LOPROC + 1)
#define DT_X86_64_PLTENT (DT_LOPROC + 3)
1. DT_X86_64_PLT: The address of the procedure linkage table.
2. DT_X86_64_PLTSZ: The total size, in bytes, of the procedure linkage
table.
3. DT_X86_64_PLTENT: The size, in bytes, of a procedure linkage table
entry.
With the r_addend field of the R_X86_64_JUMP_SLOT relocation set to the
memory offset of the indirect branch instruction.
Define ELF_DYNAMIC_AFTER_RELOC for x86-64 to rewrite the PLT section
with direct branch after relocation when the lazy binding is disabled.
PLT rewrite is disabled by default since SELinux may disallow modifying
code pages and ld.so can't detect it in all cases. Use
$ export GLIBC_TUNABLES=glibc.cpu.plt_rewrite=1
to enable PLT rewrite with 32-bit direct jump at run-time or
$ export GLIBC_TUNABLES=glibc.cpu.plt_rewrite=2
to enable PLT rewrite with 32-bit direct jump and on APX processors with
64-bit absolute jump at run-time.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Not all CET enabled applications and libraries have been properly tested
in CET enabled environments. Some CET enabled applications or libraries
will crash or misbehave when CET is enabled. Don't set CET active by
default so that all applications and libraries will run normally regardless
of whether CET is active or not. Shadow stack can be enabled by
$ export GLIBC_TUNABLES=glibc.cpu.hwcaps=SHSTK
at run-time if shadow stack can be enabled by kernel.
NB: This commit can be reverted if it is OK to enable CET by default for
all applications and libraries.
Previously, CET was enabled by kernel before passing control to user
space and the startup code must disable CET if applications or shared
libraries aren't CET enabled. Since the current kernel only supports
shadow stack and won't enable shadow stack before passing control to
user space, we need to enable shadow stack during startup if the
application and all shared library are shadow stack enabled. There
is no need to disable shadow stack at startup. Shadow stack can only
be enabled in a function which will never return. Otherwise, shadow
stack will underflow at the function return.
1. GL(dl_x86_feature_1) is set to the CET features which are supported
by the processor and are not disabled by the tunable. Only non-zero
features in GL(dl_x86_feature_1) should be enabled. After enabling
shadow stack with ARCH_SHSTK_ENABLE, ARCH_SHSTK_STATUS is used to check
if shadow stack is really enabled.
2. Use ARCH_SHSTK_ENABLE in RTLD_START in dynamic executable. It is
safe since RTLD_START never returns.
3. Call arch_prctl (ARCH_SHSTK_ENABLE) from ARCH_SETUP_TLS in static
executable. Since the start function using ARCH_SETUP_TLS never returns,
it is safe to enable shadow stack in ARCH_SETUP_TLS.
Sync with Linux kernel 6.6 shadow stack interface. Since only x86-64 is
supported, i386 shadow stack codes are unchanged and CET shouldn't be
enabled for i386.
1. When the shadow stack base in TCB is unset, the default shadow stack
is in use. Use the current shadow stack pointer as the marker for the
default shadow stack. It is used to identify if the current shadow stack
is the same as the target shadow stack when switching ucontexts. If yes,
INCSSP will be used to unwind shadow stack. Otherwise, shadow stack
restore token will be used.
2. Allocate shadow stack with the map_shadow_stack syscall. Since there
is no function to explicitly release ucontext, there is no place to
release shadow stack allocated by map_shadow_stack in ucontext functions.
Such shadow stacks will be leaked.
3. Rename arch_prctl CET commands to ARCH_SHSTK_XXX.
4. Rewrite the CET control functions with the current kernel shadow stack
interface.
Since CET is no longer enabled by kernel, a separate patch will enable
shadow stack during startup.
This commit add support for the new AVX10 cpu features:
https://cdrdv2-public.intel.com/784267/355989-intel-avx10-spec.pdf
We add checks for:
- `AVX10`: Check if AVX10 is present.
- `AVX10_{X,Y,Z}MM`: Check if a given vec class has AVX10 support.
`make check` passes and cpuid output was checked against GNR/DMR on an
emulator.
Different systems prefer a different divisors.
From benchmarks[1] so far the following divisors have been found:
ICX : 2
SKX : 2
BWD : 8
For Intel, we are generalizing that BWD and older prefers 8 as a
divisor, and SKL and newer prefers 2. This number can be further tuned
as benchmarks are run.
[1]: https://github.com/goldsteinn/memcpy-nt-benchmarks
Reviewed-by: DJ Delorie <dj@redhat.com>
This patch should have no affect on existing functionality.
The current code, which has a single switch for model detection and
setting prefered features, is difficult to follow/extend. The cases
use magic numbers and many microarchitectures are missing. This makes
it difficult to reason about what is implemented so far and/or
how/where to add support for new features.
This patch splits the model detection and preference setting stages so
that CPU preferences can be set based on a complete list of available
microarchitectures, rather than based on model magic numbers.
Reviewed-by: DJ Delorie <dj@redhat.com>
Linux kernel uses AT_HWCAP2 to indicate if FSGSBASE instructions are
enabled. If the HWCAP2_FSGSBASE bit in AT_HWCAP2 is set, FSGSBASE
instructions can be used in user space. Define dl_check_hwcap2 to set
the FSGSBASE feature to active on Linux when the HWCAP2_FSGSBASE bit is
set.
Add a test to verify that FSGSBASE is active on current kernels.
NB: This test will fail if the kernel doesn't set the HWCAP2_FSGSBASE
bit in AT_HWCAP2 while fsgsbase shows up in /proc/cpuinfo.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
And make always supported. The configure option was added on glibc 2.25
and some features require it (such as hwcap mask, huge pages support, and
lock elisition tuning). It also simplifies the build permutations.
Changes from v1:
* Remove glibc.rtld.dynamic_sort changes, it is orthogonal and needs
more discussion.
* Cleanup more code.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Crossing 2GB boundaries with indirect calls and jumps can use more
branch prediction resources on Intel Golden Cove CPU (see the
"Misprediction for Branches >2GB" section in Intel 64 and IA-32
Architectures Optimization Reference Manual.) There is visible
performance improvement on workloads with many PLT calls when executable
and shared libraries are mmapped below 2GB. Add the Prefer_MAP_32BIT_EXEC
bit so that mmap will try to map executable or denywrite pages in shared
libraries with MAP_32BIT first.
NB: Prefer_MAP_32BIT_EXEC reduces bits available for address space
layout randomization (ASLR), which is always disabled for SUID programs
and can only be enabled by the tunable, glibc.cpu.prefer_map_32bit_exec,
or the environment variable, LD_PREFER_MAP_32BIT_EXEC. This works only
between shared libraries or between shared libraries and executables with
addresses below 2GB. PIEs are usually loaded at a random address above
4GB by the kernel.
I used these shell commands:
../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")
and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 7061 files FOO.
I then removed trailing white space from math/tgmath.h,
support/tst-support-open-dev-null-range.c, and
sysdeps/x86_64/multiarch/strlen-vec.S, to work around the following
obscure pre-commit check failure diagnostics from Savannah. I don't
know why I run into these diagnostics whereas others evidently do not.
remote: *** 912-#endif
remote: *** 913:
remote: *** 914-
remote: *** error: lines with trailing whitespace found
...
remote: *** error: sysdeps/unix/sysv/linux/statx_cp.c: trailing lines
Remove Prefer_AVX2_STRCMP to enable EVEX strcmp. When comparing 2 32-byte
strings, EVEX strcmp has been improved to require 1 load, 1 VPTESTM, 1
VPCMP, 1 KMOVD and 1 INCL instead of 2 loads, 3 VPCMPs, 2 KORDs, 1 KMOVD
and 1 TESTL while AVX2 strcmp requires 1 load, 2 VPCMPEQs, 1 VPMINU, 1
VPMOVMSKB and 1 TESTL. EVEX strcmp is now faster than AVX2 strcmp by up
to 40% on Tiger Lake and Ice Lake.
commit 3ec5d83d2a
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Sat Jan 25 14:19:40 2020 -0800
x86-64: Avoid rep movsb with short distance [BZ #27130]
introduced some regressions on Intel processors without Fast Short REP
MOV (FSRM). Add Avoid_Short_Distance_REP_MOVSB to avoid rep movsb with
short distance only on Intel processors with FSRM. bench-memmove-large
on Skylake server shows that cycles of __memmove_evex_unaligned_erms
improves for the following data size:
before after Improvement
length=4127, align1=3, align2=0: 479.38 349.25 27%
length=4223, align1=9, align2=5: 405.62 333.25 18%
length=8223, align1=3, align2=0: 786.12 496.38 37%
length=8319, align1=9, align2=5: 727.50 501.38 31%
length=16415, align1=3, align2=0: 1436.88 840.00 41%
length=16511, align1=9, align2=5: 1375.50 836.38 39%
length=32799, align1=3, align2=0: 2890.00 1860.12 36%
length=32895, align1=9, align2=5: 2891.38 1931.88 33%
1. Install <bits/platform/x86.h> for <sys/platform/x86.h> which includes
<bits/platform/x86.h>.
2. Rename HAS_CPU_FEATURE to CPU_FEATURE_PRESENT which checks if the
processor has the feature.
3. Rename CPU_FEATURE_USABLE to CPU_FEATURE_ACTIVE which checks if the
feature is active. There may be other preconditions, like sufficient
stack space or further setup for AMX, which must be satisfied before the
feature can be used.
This fixes BZ #27958.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
From
https://www.intel.com/content/www/us/en/support/articles/000059422/processors.html
* Intel TSX will be disabled by default.
* The processor will force abort all Restricted Transactional Memory (RTM)
transactions by default.
* A new CPUID bit CPUID.07H.0H.EDX[11](RTM_ALWAYS_ABORT) will be enumerated,
which is set to indicate to updated software that the loaded microcode is
forcing RTM abort.
* On processors that enumerate support for RTM, the CPUID enumeration bits
for Intel TSX (CPUID.07H.0H.EBX[11] and CPUID.07H.0H.EBX[4]) continue to
be set by default after microcode update.
* Workloads that were benefited from Intel TSX might experience a change
in performance.
* System software may use a new bit in Model-Specific Register (MSR) 0x10F
TSX_FORCE_ABORT[TSX_CPUID_CLEAR] functionality to clear the Hardware Lock
Elision (HLE) and RTM bits to indicate to software that Intel TSX is
disabled.
1. Add RTM_ALWAYS_ABORT to CPUID features.
2. Set RTM usable only if RTM_ALWAYS_ABORT isn't set. This skips the
string/tst-memchr-rtm etc. testcases on the affected processors, which
always fail after a microcde update.
3. Check RTM feature, instead of usability, against /proc/cpuinfo.
This fixes BZ #28033.
IBT and SHSTK usable bits are copied from CPUID feature bits and later
cleared if kernel doesn't support CET. Copy IBT and SHSTK usable only
if CET is enabled so that they aren't set on CET capable processors
with non-CET enabled glibc.
1. Set Prefer_No_VZEROUPPER if RTM is usable to avoid RTM abort triggered
by VZEROUPPER inside a transactionally executing RTM region.
2. Since to compare 2 32-byte strings, 256-bit EVEX strcmp requires 2
loads, 3 VPCMPs and 2 KORDs while AVX2 strcmp requires 1 load, 2 VPCMPEQs,
1 VPMINU and 1 VPMOVMSKB, AVX2 strcmp is faster than EVEX strcmp. Add
Prefer_AVX2_STRCMP to prefer AVX2 strcmp family functions.
1. Support GLIBC_TUNABLES=glibc.cpu.hwcaps=-XSAVE.
2. Disable all features which depend on XSAVE:
a. If OSXSAVE is disabled by glibc tunables. Or
b. If both XSAVE and XSAVEC aren't usable.
1. Add CPUID_INDEX_14_ECX_0 for CPUID leaf 0x14 to detect PTWRITE feature
in EBX of CPUID leaf 0x14 with ECX == 0.
2. Add PTWRITE detection to CPU feature tests.
3. Add 2 static CPU feature tests.