Commit graph

119 commits

Author SHA1 Message Date
Paul Eggert
2642002380 Update copyright dates with scripts/update-copyrights 2025-01-01 11:22:09 -08:00
Feifei Wang
ca90758b2a x86: Enable non-temporal memset for Hygon processors
This patch uses 'Avoid_Non_Temporal_Memset' flag to access
the non-temporal memset implementation for hygon processors.

Test Results:

hygon1 arch
x86_memset_non_temporal_threshold = 8MB
size                          new performance time / old performance time
1MB                           0.994
4MB                           0.996
8MB                           0.670
16MB                          0.343
32MB                          0.355

hygon2 arch
x86_memset_non_temporal_threshold = 8MB
size                          new performance time / old performance time
1MB                           1
4MB                           1
8MB                           1.312
16MB                          0.822
32MB                          0.830

hygon3 arch
x86_memset_non_temporal_threshold = 8MB
size                          new performance time / old performance time
1MB                           1
4MB                           0.990
8MB                           0.737
16MB                          0.390
32MB                          0.401

For hygon arch with this patch, non-temporal stores can improve
performance by 20% - 65%.

Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
Reviewed-by: Jing Li <lijing@hygon.cn>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2024-08-26 10:01:58 -07:00
Feifei Wang
6b08116b2d x86: Add new architecture type for Hygon processors
Add a new architecture type arch_kind_hygon to spilt Hygon branch
from AMD. This is to facilitate the Hygon processors to make settings
that are suitable for its own characteristics.

Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
Reviewed-by: Jing Li <lijing@hygon.cn>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2024-08-26 10:01:58 -07:00
Noah Goldstein
f446d90fe6 x86: Add Avoid_STOSB tunable to allow NT memset without ERMS
The goal of this flag is to allow targets which don't prefer/have ERMS
to still access the non-temporal memset implementation.

There are 4 cases for tuning memset:
    1) `Avoid_STOSB && Avoid_Non_Temporal_Memset`
        - Memset with temporal stores
    2) `Avoid_STOSB && !Avoid_Non_Temporal_Memset`
        - Memset with temporal/non-temporal stores. Non-temporal path
          goes through `rep stosb` path. We accomplish this by setting
          `x86_rep_stosb_threshold` to
          `x86_memset_non_temporal_threshold`.
    3) `!Avoid_STOSB && Avoid_Non_Temporal_Memset`
        - Memset with temporal stores/`rep stosb`
    3) `!Avoid_STOSB && !Avoid_Non_Temporal_Memset`
        - Memset with temporal stores/`rep stosb`/non-temporal stores.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2024-08-15 08:19:15 -07:00
Noah Goldstein
b93dddfaf4 x86: Use Avoid_Non_Temporal_Memset to control non-temporal path
This is just a refactor and there should be no behavioral change from
this commit.

The goal is to make `Avoid_Non_Temporal_Memset` a more universal knob
for controlling whether we use non-temporal memset rather than having
extra logic based on vendor.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2024-08-15 08:19:15 -07:00
Florian Weimer
0df48472ff x86: Add missing switch/case fall-through markers to init_cpu_features
The commits introducing these fall-throughs intended them to
happen.

Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2024-08-02 18:08:14 +02:00
Noah Goldstein
5bcf6265f2 x86: Disable non-temporal memset on Skylake Server
The original commit enabling non-temporal memset on Skylake Server had
erroneous benchmarks (actually done on ICX).

Further benchmarks indicate non-temporal stores may in fact by a
regression on Skylake Server.

This commit may be over-cautious in some cases, but should avoid any
regressions for 2.40.

Tested using qemu on all x86_64 cpu arch supported by both qemu +
GLIBC.

Reviewed-by: DJ Delorie <dj@redhat.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2024-07-16 17:20:18 +08:00
MayShao-oc
9dc645cb56 x86: Set default non_temporal_threshold for Zhaoxin processors
Current 'non_temporal_threshold' set to 'non_temporal_threshold_lowbound'
on Zhaoxin processors without ERMS. The default
'non_temporal_threshold_lowbound' is too small for the KH-40000 and KX-7000
Zhaoxin processors, this patch updates the value to
'shared / cachesize_non_temporal_divisor'.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2024-06-30 06:26:43 -07:00
MayShao-oc
44d757eb9f x86: Set preferred CPU features on the KH-40000 and KX-7000 Zhaoxin processors
Fix code formatting under the Zhaoxin branch and add comments for
different Zhaoxin models.

Unaligned AVX load are slower on KH-40000 and KX-7000, so disable
the AVX_Fast_Unaligned_Load.

Enable Prefer_No_VZEROUPPER and Fast_Unaligned_Load features to
use sse2_unaligned version of memset,strcpy and strcat.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2024-06-30 06:26:43 -07:00
H.J. Lu
717ebfa85c x86-64: Allocate state buffer space for RDI, RSI and RBX
_dl_tlsdesc_dynamic preserves RDI, RSI and RBX before realigning stack.
After realigning stack, it saves RCX, RDX, R8, R9, R10 and R11.  Define
TLSDESC_CALL_REGISTER_SAVE_AREA to allocate space for RDI, RSI and RBX
to avoid clobbering saved RDI, RSI and RBX values on stack by xsave to
STATE_SAVE_OFFSET(%rsp).

   +==================+<- stack frame start aligned at 8 or 16 bytes
   |                  |<- RDI saved in the red zone
   |                  |<- RSI saved in the red zone
   |                  |<- RBX saved in the red zone
   |                  |<- paddings for stack realignment of 64 bytes
   |------------------|<- xsave buffer end aligned at 64 bytes
   |                  |<-
   |                  |<-
   |                  |<-
   |------------------|<- xsave buffer start at STATE_SAVE_OFFSET(%rsp)
   |                  |<- 8-byte padding for 64-byte alignment
   |                  |<- 8-byte padding for 64-byte alignment
   |                  |<- R11
   |                  |<- R10
   |                  |<- R9
   |                  |<- R8
   |                  |<- RDX
   |                  |<- RCX
   +==================+<- RSP aligned at 64 bytes

Define TLSDESC_CALL_REGISTER_SAVE_AREA, the total register save area size
for all integer registers by adding 24 to STATE_SAVE_OFFSET since RDI, RSI
and RBX are saved onto stack without adjusting stack pointer first, using
the red-zone.  This fixes BZ #31501.
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
2024-03-18 19:45:13 -07:00
Sunil K Pandey
b6e3898194 x86-64: Simplify minimum ISA check ifdef conditional with if
Replace minimum ISA check ifdef conditional with if.  Since
MINIMUM_X86_ISA_LEVEL and AVX_X86_ISA_LEVEL are compile time constants,
compiler will perform constant folding optimization, getting same
results.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2024-03-03 15:47:53 -08:00
H.J. Lu
9b7091415a x86-64: Update _dl_tlsdesc_dynamic to preserve AMX registers
_dl_tlsdesc_dynamic should also preserve AMX registers which are
caller-saved.  Add X86_XSTATE_TILECFG_ID and X86_XSTATE_TILEDATA_ID
to x86-64 TLSDESC_CALL_STATE_SAVE_MASK.  Compute the AMX state size
and save it in xsave_state_full_size which is only used by
_dl_tlsdesc_dynamic_xsave and _dl_tlsdesc_dynamic_xsavec.  This fixes
the AMX part of BZ #31372.  Tested on AMX processor.

AMX test is enabled only for compilers with the fix for

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114098

GCC 14 and GCC 11/12/13 branches have the bug fix.
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
2024-02-29 04:30:01 -08:00
H.J. Lu
befe2d3c4d x86-64: Don't use SSE resolvers for ISA level 3 or above
When glibc is built with ISA level 3 or above enabled, SSE resolvers
aren't available and glibc fails to build:

ld: .../elf/librtld.os: in function `init_cpu_features':
.../elf/../sysdeps/x86/cpu-features.c:1200:(.text+0x1445f): undefined reference to `_dl_runtime_resolve_fxsave'
ld: .../elf/librtld.os: relocation R_X86_64_PC32 against undefined hidden symbol `_dl_runtime_resolve_fxsave' can not be used when making a shared object
/usr/local/bin/ld: final link failed: bad value

For ISA level 3 or above, don't use _dl_runtime_resolve_fxsave nor
_dl_tlsdesc_dynamic_fxsave.

This fixes BZ #31429.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2024-02-28 11:49:30 -08:00
H.J. Lu
0aac205a81 x86: Update _dl_tlsdesc_dynamic to preserve caller-saved registers
Compiler generates the following instruction sequence for GNU2 dynamic
TLS access:

	leaq	tls_var@TLSDESC(%rip), %rax
	call	*tls_var@TLSCALL(%rax)

or

	leal	tls_var@TLSDESC(%ebx), %eax
	call	*tls_var@TLSCALL(%eax)

CALL instruction is transparent to compiler which assumes all registers,
except for EFLAGS and RAX/EAX, are unchanged after CALL.  When
_dl_tlsdesc_dynamic is called, it calls __tls_get_addr on the slow
path.  __tls_get_addr is a normal function which doesn't preserve any
caller-saved registers.  _dl_tlsdesc_dynamic saved and restored integer
caller-saved registers, but didn't preserve any other caller-saved
registers.  Add _dl_tlsdesc_dynamic IFUNC functions for FNSAVE, FXSAVE,
XSAVE and XSAVEC to save and restore all caller-saved registers.  This
fixes BZ #31372.

Add GLRO(dl_x86_64_runtime_resolve) with GLRO(dl_x86_tlsdesc_dynamic)
to optimize elf_machine_runtime_setup.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2024-02-28 09:02:56 -08:00
H.J. Lu
457bd9cf2e x86-64: Check if mprotect works before rewriting PLT
Systemd execution environment configuration may prohibit changing a memory
mapping to become executable:

MemoryDenyWriteExecute=
Takes a boolean argument. If set, attempts to create memory mappings
that are writable and executable at the same time, or to change existing
memory mappings to become executable, or mapping shared memory segments
as executable, are prohibited.

When it is set, systemd service stops working if PLT rewrite is enabled.
Check if mprotect works before rewriting PLT.  This fixes BZ #31230.
This also works with SELinux when deny_execmem is on.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2024-01-15 06:59:23 -08:00
H.J. Lu
874214db62 i386: Remove CET support bits
1. Remove _dl_runtime_resolve_shstk and _dl_runtime_profile_shstk.
2. Move CET offsets from x86 cpu-features-offsets.sym to x86-64
features-offsets.sym.
3. Rename x86 cet-control.h to x86-64 feature-control.h since it is only
for x86-64 and also used for PLT rewrite.
4. Add x86-64 ldsodefs.h to include feature-control.h.
5. Change TUNABLE_CALLBACK (set_plt_rewrite) to x86-64 only.
6. Move x86 dl-procruntime.c to x86-64.
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-01-10 05:20:20 -08:00
H.J. Lu
848746e88e elf: Add ELF_DYNAMIC_AFTER_RELOC to rewrite PLT
Add ELF_DYNAMIC_AFTER_RELOC to allow target specific processing after
relocation.

For x86-64, add

 #define DT_X86_64_PLT     (DT_LOPROC + 0)
 #define DT_X86_64_PLTSZ   (DT_LOPROC + 1)
 #define DT_X86_64_PLTENT  (DT_LOPROC + 3)

1. DT_X86_64_PLT: The address of the procedure linkage table.
2. DT_X86_64_PLTSZ: The total size, in bytes, of the procedure linkage
table.
3. DT_X86_64_PLTENT: The size, in bytes, of a procedure linkage table
entry.

With the r_addend field of the R_X86_64_JUMP_SLOT relocation set to the
memory offset of the indirect branch instruction.

Define ELF_DYNAMIC_AFTER_RELOC for x86-64 to rewrite the PLT section
with direct branch after relocation when the lazy binding is disabled.

PLT rewrite is disabled by default since SELinux may disallow modifying
code pages and ld.so can't detect it in all cases.  Use

$ export GLIBC_TUNABLES=glibc.cpu.plt_rewrite=1

to enable PLT rewrite with 32-bit direct jump at run-time or

$ export GLIBC_TUNABLES=glibc.cpu.plt_rewrite=2

to enable PLT rewrite with 32-bit direct jump and on APX processors with
64-bit absolute jump at run-time.

Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2024-01-05 05:49:49 -08:00
Paul Eggert
dff8da6b3e Update copyright dates with scripts/update-copyrights 2024-01-01 10:53:40 -08:00
H.J. Lu
55d63e7312 x86/cet: Don't set CET active by default
Not all CET enabled applications and libraries have been properly tested
in CET enabled environments.  Some CET enabled applications or libraries
will crash or misbehave when CET is enabled.  Don't set CET active by
default so that all applications and libraries will run normally regardless
of whether CET is active or not.  Shadow stack can be enabled by

$ export GLIBC_TUNABLES=glibc.cpu.hwcaps=SHSTK

at run-time if shadow stack can be enabled by kernel.

NB: This commit can be reverted if it is OK to enable CET by default for
all applications and libraries.
2024-01-01 05:22:48 -08:00
H.J. Lu
541641a3de x86/cet: Enable shadow stack during startup
Previously, CET was enabled by kernel before passing control to user
space and the startup code must disable CET if applications or shared
libraries aren't CET enabled.  Since the current kernel only supports
shadow stack and won't enable shadow stack before passing control to
user space, we need to enable shadow stack during startup if the
application and all shared library are shadow stack enabled.  There
is no need to disable shadow stack at startup.  Shadow stack can only
be enabled in a function which will never return.  Otherwise, shadow
stack will underflow at the function return.

1. GL(dl_x86_feature_1) is set to the CET features which are supported
by the processor and are not disabled by the tunable.  Only non-zero
features in GL(dl_x86_feature_1) should be enabled.  After enabling
shadow stack with ARCH_SHSTK_ENABLE, ARCH_SHSTK_STATUS is used to check
if shadow stack is really enabled.
2. Use ARCH_SHSTK_ENABLE in RTLD_START in dynamic executable.  It is
safe since RTLD_START never returns.
3. Call arch_prctl (ARCH_SHSTK_ENABLE) from ARCH_SETUP_TLS in static
executable.  Since the start function using ARCH_SETUP_TLS never returns,
it is safe to enable shadow stack in ARCH_SETUP_TLS.
2024-01-01 05:22:48 -08:00
H.J. Lu
edb5e0c8f9 x86/cet: Sync with Linux kernel 6.6 shadow stack interface
Sync with Linux kernel 6.6 shadow stack interface.  Since only x86-64 is
supported, i386 shadow stack codes are unchanged and CET shouldn't be
enabled for i386.

1. When the shadow stack base in TCB is unset, the default shadow stack
is in use.  Use the current shadow stack pointer as the marker for the
default shadow stack. It is used to identify if the current shadow stack
is the same as the target shadow stack when switching ucontexts.  If yes,
INCSSP will be used to unwind shadow stack.  Otherwise, shadow stack
restore token will be used.
2. Allocate shadow stack with the map_shadow_stack syscall.  Since there
is no function to explicitly release ucontext, there is no place to
release shadow stack allocated by map_shadow_stack in ucontext functions.
Such shadow stacks will be leaked.
3. Rename arch_prctl CET commands to ARCH_SHSTK_XXX.
4. Rewrite the CET control functions with the current kernel shadow stack
interface.

Since CET is no longer enabled by kernel, a separate patch will enable
shadow stack during startup.
2024-01-01 05:22:48 -08:00
Noah Goldstein
d90b43a4ed x86: Add support for AVX10 preset and vec size in cpu-features
This commit add support for the new AVX10 cpu features:
https://cdrdv2-public.intel.com/784267/355989-intel-avx10-spec.pdf

We add checks for:
    - `AVX10`: Check if AVX10 is present.
    - `AVX10_{X,Y,Z}MM`: Check if a given vec class has AVX10 support.

`make check` passes and cpuid output was checked against GNR/DMR on an
emulator.
2023-09-29 14:18:42 -05:00
H.J. Lu
1547d6a64f <sys/platform/x86.h>: Add APX support
Add support for Intel Advanced Performance Extensions:

https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html

to <sys/platform/x86.h>.
2023-07-27 08:42:32 -07:00
Paul Pluzhnikov
4290aed051 Fix misspellings -- BZ 25337 2023-06-19 21:58:33 +00:00
Noah Goldstein
180897c161 x86: Make the divisor in setting non_temporal_threshold cpu specific
Different systems prefer a different divisors.

From benchmarks[1] so far the following divisors have been found:
    ICX     : 2
    SKX     : 2
    BWD     : 8

For Intel, we are generalizing that BWD and older prefers 8 as a
divisor, and SKL and newer prefers 2. This number can be further tuned
as benchmarks are run.

[1]: https://github.com/goldsteinn/memcpy-nt-benchmarks
Reviewed-by: DJ Delorie <dj@redhat.com>
2023-06-12 11:33:39 -05:00
Noah Goldstein
f193ea20ed x86: Refactor Intel init_cpu_features
This patch should have no affect on existing functionality.

The current code, which has a single switch for model detection and
setting prefered features, is difficult to follow/extend. The cases
use magic numbers and many microarchitectures are missing. This makes
it difficult to reason about what is implemented so far and/or
how/where to add support for new features.

This patch splits the model detection and preference setting stages so
that CPU preferences can be set based on a complete list of available
microarchitectures, rather than based on model magic numbers.
Reviewed-by: DJ Delorie <dj@redhat.com>
2023-06-12 11:33:39 -05:00
Paul Pluzhnikov
65cc53fe7c Fix misspellings in sysdeps/ -- BZ 25337 2023-05-30 23:02:29 +00:00
H.J. Lu
81a3cc956e <sys/platform/x86.h>: Add PREFETCHI support
Add PREFETCHI support to <sys/platform/x86.h>.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2023-04-05 14:46:10 -07:00
H.J. Lu
b05521c916 <sys/platform/x86.h>: Add AMX-COMPLEX support
Add AMX-COMPLEX support to <sys/platform/x86.h>.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2023-04-05 14:46:10 -07:00
H.J. Lu
609b7b2d3c <sys/platform/x86.h>: Add AVX-NE-CONVERT support
Add AVX-NE-CONVERT support to <sys/platform/x86.h>.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2023-04-05 14:46:10 -07:00
H.J. Lu
4c120c88a6 <sys/platform/x86.h>: Add AVX-VNNI-INT8 support
Add AVX-VNNI-INT8 support to <sys/platform/x86.h>.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2023-04-05 14:46:10 -07:00
H.J. Lu
96037c697d <sys/platform/x86.h>: Add AVX-IFMA support
Add AVX-IFMA support to <sys/platform/x86.h>.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2023-04-05 14:46:10 -07:00
H.J. Lu
8b4cc05eab <sys/platform/x86.h>: Add AMX-FP16 support
Add AMX-FP16 support to <sys/platform/x86.h>.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2023-04-05 14:46:10 -07:00
H.J. Lu
2f02d0d8e1 <sys/platform/x86.h>: Add CMPCCXADD support
Add CMPCCXADD support to <sys/platform/x86.h>.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2023-04-05 14:46:10 -07:00
H.J. Lu
231bf916ce <sys/platform/x86.h>: Add RAO-INT support
Add RAO-INT support to <sys/platform/x86.h>.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2023-04-05 14:46:10 -07:00
H.J. Lu
743113d42e x86: Set FSGSBASE to active if enabled by kernel
Linux kernel uses AT_HWCAP2 to indicate if FSGSBASE instructions are
enabled.  If the HWCAP2_FSGSBASE bit in AT_HWCAP2 is set, FSGSBASE
instructions can be used in user space.  Define dl_check_hwcap2 to set
the FSGSBASE feature to active on Linux when the HWCAP2_FSGSBASE bit is
set.

Add a test to verify that FSGSBASE is active on current kernels.
NB: This test will fail if the kernel doesn't set the HWCAP2_FSGSBASE
bit in AT_HWCAP2 while fsgsbase shows up in /proc/cpuinfo.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
2023-04-03 11:36:48 -07:00
Adhemerval Zanella Netto
33237fe83d Remove --enable-tunables configure option
And make always supported.  The configure option was added on glibc 2.25
and some features require it (such as hwcap mask, huge pages support, and
lock elisition tuning).  It also simplifies the build permutations.

Changes from v1:
 * Remove glibc.rtld.dynamic_sort changes, it is orthogonal and needs
   more discussion.
 * Cleanup more code.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
2023-03-29 14:33:06 -03:00
H.J. Lu
317f1c0a8a x86-64: Add glibc.cpu.prefer_map_32bit_exec [BZ #28656]
Crossing 2GB boundaries with indirect calls and jumps can use more
branch prediction resources on Intel Golden Cove CPU (see the
"Misprediction for Branches >2GB" section in Intel 64 and IA-32
Architectures Optimization Reference Manual.)  There is visible
performance improvement on workloads with many PLT calls when executable
and shared libraries are mmapped below 2GB.  Add the Prefer_MAP_32BIT_EXEC
bit so that mmap will try to map executable or denywrite pages in shared
libraries with MAP_32BIT first.

NB: Prefer_MAP_32BIT_EXEC reduces bits available for address space
layout randomization (ASLR), which is always disabled for SUID programs
and can only be enabled by the tunable, glibc.cpu.prefer_map_32bit_exec,
or the environment variable, LD_PREFER_MAP_32BIT_EXEC.  This works only
between shared libraries or between shared libraries and executables with
addresses below 2GB.  PIEs are usually loaded at a random address above
4GB by the kernel.
2023-02-22 18:28:37 -08:00
Joseph Myers
6d7e8eda9b Update copyright dates with scripts/update-copyrights 2023-01-06 21:14:39 +00:00
H.J. Lu
1e000d3d33 x86: Black list more Intel CPUs for TSX [BZ #27398]
Disable TSX and enable RTM_ALWAYS_ABORT for Intel CPUs listed in:

https://www.intel.com/content/www/us/en/support/articles/000059422/processors.html

This fixes BZ #27398.

Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2022-01-18 14:20:09 -08:00
Paul Eggert
581c785bf3 Update copyright dates with scripts/update-copyrights
I used these shell commands:

../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")

and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 7061 files FOO.

I then removed trailing white space from math/tgmath.h,
support/tst-support-open-dev-null-range.c, and
sysdeps/x86_64/multiarch/strlen-vec.S, to work around the following
obscure pre-commit check failure diagnostics from Savannah.  I don't
know why I run into these diagnostics whereas others evidently do not.

remote: *** 912-#endif
remote: *** 913:
remote: *** 914-
remote: *** error: lines with trailing whitespace found
...
remote: *** error: sysdeps/unix/sysv/linux/statx_cp.c: trailing lines
2022-01-01 11:40:24 -08:00
H.J. Lu
ceeffe968c x86: Don't set Prefer_No_AVX512 for processors with AVX512 and AVX-VNNI
Don't set Prefer_No_AVX512 on processors with AVX512 and AVX-VNNI since
they won't lower CPU frequency when ZMM load and store instructions are
used.
2021-12-06 07:14:12 -08:00
H.J. Lu
14dbbf46a0 x86-64: Remove Prefer_AVX2_STRCMP
Remove Prefer_AVX2_STRCMP to enable EVEX strcmp.  When comparing 2 32-byte
strings, EVEX strcmp has been improved to require 1 load, 1 VPTESTM, 1
VPCMP, 1 KMOVD and 1 INCL instead of 2 loads, 3 VPCMPs, 2 KORDs, 1 KMOVD
and 1 TESTL while AVX2 strcmp requires 1 load, 2 VPCMPEQs, 1 VPMINU, 1
VPMOVMSKB and 1 TESTL.  EVEX strcmp is now faster than AVX2 strcmp by up
to 40% on Tiger Lake and Ice Lake.
2021-11-01 07:53:04 -07:00
H.J. Lu
91cc803d27 x86-64: Add Avoid_Short_Distance_REP_MOVSB
commit 3ec5d83d2a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sat Jan 25 14:19:40 2020 -0800

    x86-64: Avoid rep movsb with short distance [BZ #27130]

introduced some regressions on Intel processors without Fast Short REP
MOV (FSRM).  Add Avoid_Short_Distance_REP_MOVSB to avoid rep movsb with
short distance only on Intel processors with FSRM.  bench-memmove-large
on Skylake server shows that cycles of __memmove_evex_unaligned_erms
improves for the following data size:

                                  before    after    Improvement
length=4127, align1=3, align2=0:  479.38    349.25      27%
length=4223, align1=9, align2=5:  405.62    333.25      18%
length=8223, align1=3, align2=0:  786.12    496.38      37%
length=8319, align1=9, align2=5:  727.50    501.38      31%
length=16415, align1=3, align2=0: 1436.88   840.00      41%
length=16511, align1=9, align2=5: 1375.50   836.38      39%
length=32799, align1=3, align2=0: 2890.00   1860.12     36%
length=32895, align1=9, align2=5: 2891.38   1931.88     33%
2021-07-28 13:23:57 -07:00
H.J. Lu
7c124e3714 x86: Install <bits/platform/x86.h> [BZ #27958]
1. Install <bits/platform/x86.h> for <sys/platform/x86.h> which includes
<bits/platform/x86.h>.
2. Rename HAS_CPU_FEATURE to CPU_FEATURE_PRESENT which checks if the
processor has the feature.
3. Rename CPU_FEATURE_USABLE to CPU_FEATURE_ACTIVE which checks if the
feature is active.  There may be other preconditions, like sufficient
stack space or further setup for AMX, which must be satisfied before the
feature can be used.

This fixes BZ #27958.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2021-07-23 05:12:51 -07:00
H.J. Lu
ea8e465a6b x86: Check RTM_ALWAYS_ABORT for RTM [BZ #28033]
From

https://www.intel.com/content/www/us/en/support/articles/000059422/processors.html

* Intel TSX will be disabled by default.
* The processor will force abort all Restricted Transactional Memory (RTM)
  transactions by default.
* A new CPUID bit CPUID.07H.0H.EDX[11](RTM_ALWAYS_ABORT) will be enumerated,
  which is set to indicate to updated software that the loaded microcode is
  forcing RTM abort.
* On processors that enumerate support for RTM, the CPUID enumeration bits
  for Intel TSX (CPUID.07H.0H.EBX[11] and CPUID.07H.0H.EBX[4]) continue to
  be set by default after microcode update.
* Workloads that were benefited from Intel TSX might experience a change
  in performance.
* System software may use a new bit in Model-Specific Register (MSR) 0x10F
  TSX_FORCE_ABORT[TSX_CPUID_CLEAR] functionality to clear the Hardware Lock
  Elision (HLE) and RTM bits to indicate to software that Intel TSX is
  disabled.

1. Add RTM_ALWAYS_ABORT to CPUID features.
2. Set RTM usable only if RTM_ALWAYS_ABORT isn't set.  This skips the
string/tst-memchr-rtm etc. testcases on the affected processors, which
always fail after a microcde update.
3. Check RTM feature, instead of usability, against /proc/cpuinfo.

This fixes BZ #28033.
2021-07-01 10:47:35 -07:00
H.J. Lu
ea26ff0322 x86: Copy IBT and SHSTK usable only if CET is enabled
IBT and SHSTK usable bits are copied from CPUID feature bits and later
cleared if kernel doesn't support CET.  Copy IBT and SHSTK usable only
if CET is enabled so that they aren't set on CET capable processors
with non-CET enabled glibc.
2021-06-23 17:35:47 -07:00
H.J. Lu
1da50d4bda x86: Set Prefer_No_VZEROUPPER and add Prefer_AVX2_STRCMP
1. Set Prefer_No_VZEROUPPER if RTM is usable to avoid RTM abort triggered
by VZEROUPPER inside a transactionally executing RTM region.
2. Since to compare 2 32-byte strings, 256-bit EVEX strcmp requires 2
loads, 3 VPCMPs and 2 KORDs while AVX2 strcmp requires 1 load, 2 VPCMPEQs,
1 VPMINU and 1 VPMOVMSKB, AVX2 strcmp is faster than EVEX strcmp.  Add
Prefer_AVX2_STRCMP to prefer AVX2 strcmp family functions.
2021-03-29 07:40:17 -07:00
H.J. Lu
27f7463675 x86: Properly disable XSAVE related features [BZ #27605]
1. Support GLIBC_TUNABLES=glibc.cpu.hwcaps=-XSAVE.
2. Disable all features which depend on XSAVE:
   a. If OSXSAVE is disabled by glibc tunables.  Or
   b. If both XSAVE and XSAVEC aren't usable.
2021-03-29 06:04:17 -07:00
H.J. Lu
5ab25c8875 x86: Add PTWRITE feature detection [BZ #27346]
1. Add CPUID_INDEX_14_ECX_0 for CPUID leaf 0x14 to detect PTWRITE feature
in EBX of CPUID leaf 0x14 with ECX == 0.
2. Add PTWRITE detection to CPU feature tests.
3. Add 2 static CPU feature tests.
2021-02-07 08:01:14 -08:00