linux

mirror of synced 2025-03-06 20:59:54 +01:00

History

Borislav Petkov 0db7058e8e x86/clear_user: Make it faster Based on a patch by Mark Hemment <markhemm@googlemail.com> and incorporating very sane suggestions from Linus. The point here is to have the default case with FSRM - which is supposed to be the majority of x86 hw out there - if not now then soon - be directly inlined into the instruction stream so that no function call overhead is taking place. Drop the early clobbers from the @size and @addr operands as those are not needed anymore since we have single instruction alternatives. The benchmarks I ran would show very small improvements and a PF benchmark would even show weird things like slowdowns with higher core counts. So for a ~6m running the git test suite, the function gets called under 700K times, all from padzero(): <...>-2536 [006] ..... 261.208801: padzero: to: 0x55b0663ed214, size: 3564, cycles: 21900 <...>-2536 [006] ..... 261.208819: padzero: to: 0x7f061adca078, size: 3976, cycles: 17160 <...>-2537 [008] ..... 261.211027: padzero: to: 0x5572d019e240, size: 3520, cycles: 23850 <...>-2537 [008] ..... 261.211049: padzero: to: 0x7f1288dc9078, size: 3976, cycles: 15900 ... which is around 1%-ish of the total time and which is consistent with the benchmark numbers. So Mel gave me the idea to simply measure how fast the function becomes. I.e.: start = rdtsc_ordered(); ret = __clear_user(to, n); end = rdtsc_ordered(); Computing the mean average of all the samples collected during the test suite run then shows some improvement: clear_user_original: Amean: 9219.71 (Sum: 6340154910, samples: 687674) fsrm: Amean: 8030.63 (Sum: 5522277720, samples: 687652) That's on Zen3. The situation looks a lot more confusing on Intel: Icelake: clear_user_original: Amean: 19679.4 (Sum: 13652560764, samples: 693750) Amean: 19743.7 (Sum: 13693470604, samples: 693562) (I ran it twice just to be sure.) ERMS: Amean: 20374.3 (Sum: 13910601024, samples: 682752) Amean: 20453.7 (Sum: 14186223606, samples: 693576) FSRM: Amean: 20458.2 (Sum: 13918381386, sample s: 680331) The original microbenchmark which people were complaining about: for i in $(seq 1 10); do dd if=/dev/zero of=/dev/null bs=1M status=progress count=65536; done 2>&1 \| grep copied 32207011840 bytes (32 GB, 30 GiB) copied, 1 s, 32.2 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.93069 s, 35.6 GB/s 37597741056 bytes (38 GB, 35 GiB) copied, 1 s, 37.6 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.78017 s, 38.6 GB/s 62020124672 bytes (62 GB, 58 GiB) copied, 2 s, 31.0 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 2.13716 s, 32.2 GB/s 60010004480 bytes (60 GB, 56 GiB) copied, 1 s, 60.0 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.14129 s, 60.2 GB/s 53212086272 bytes (53 GB, 50 GiB) copied, 1 s, 53.2 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.28398 s, 53.5 GB/s 55698259968 bytes (56 GB, 52 GiB) copied, 1 s, 55.7 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.22507 s, 56.1 GB/s 55306092544 bytes (55 GB, 52 GiB) copied, 1 s, 55.3 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.23647 s, 55.6 GB/s 54387539968 bytes (54 GB, 51 GiB) copied, 1 s, 54.4 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.25693 s, 54.7 GB/s 50566529024 bytes (51 GB, 47 GiB) copied, 1 s, 50.6 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.35096 s, 50.9 GB/s 58308165632 bytes (58 GB, 54 GiB) copied, 1 s, 58.3 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.17394 s, 58.5 GB/s Now the same thing with smaller buffers: for i in $(seq 1 10); do dd if=/dev/zero of=/dev/null bs=1M status=progress count=8192; done 2>&1 \| grep copied 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.28485 s, 30.2 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.276112 s, 31.1 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.29136 s, 29.5 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.283803 s, 30.3 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.306503 s, 28.0 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.349169 s, 24.6 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.276912 s, 31.0 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.265356 s, 32.4 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.28464 s, 30.2 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.242998 s, 35.3 GB/s is also not conclusive because it all depends on the buffer sizes, their alignments and when the microcode detects that cachelines can be aggregated properly and copied in bigger sizes. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/CAHk-=wh=Mu_EYhtOmPn6AxoQZyEh-4fo2Zx3G7rBv1g7vwoKiw@mail.gmail.com		2022-08-18 12:36:42 +02:00
..
boot	x86: link vdso and boot with -z noexecstack --no-warn-rwx-segments	2022-08-10 18:30:09 -07:00
coco	x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page	2022-06-17 15:37:33 -07:00
configs	efi: vars: Remove deprecated 'efivars' sysfs interface	2022-06-24 20:40:19 +02:00
crypto	SPDX changes for 6.0-rc1	2022-08-04 12:12:54 -07:00
entry	x86: link vdso and boot with -z noexecstack --no-warn-rwx-segments	2022-08-10 18:30:09 -07:00
events	ARM:	2022-08-04 14:59:54 -07:00
hyperv	ARM:	2022-08-04 14:59:54 -07:00
ia32	x86: Remove a.out support	2022-04-11 18:04:27 +02:00
include	x86/clear_user: Make it faster	2022-08-18 12:36:42 +02:00
kernel	Fix the "IBPB mitigated RETBleed" mode of operation on AMD CPUs	2022-08-13 14:24:12 -07:00
kvm	* Xen timer fixes	2022-08-11 12:10:08 -07:00
lib	x86/clear_user: Make it faster	2022-08-18 12:36:42 +02:00
math-emu	x86/32: Remove lazy GS macros	2022-04-14 14:09:43 +02:00
mm	mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry	2022-08-08 18:06:43 -07:00
net	Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next	2022-07-22 16:55:44 -07:00
pci	x86/PCI: Revert "x86/PCI: Clip only host bridge windows for E820 regions"	2022-06-17 14:24:14 -05:00
platform	Bitmap patches for v6.0-rc1	2022-08-07 17:52:35 -07:00
power	x86/cpu: Load microcode during restore_processor_state()	2022-04-19 19:37:05 +02:00
purgatory	x86/purgatory: Omit use of bin2c	2022-07-25 10:32:32 +02:00
ras
realmode	Intel Trust Domain Extensions	2022-05-23 17:51:12 -07:00
tools	x86/build: Use the proper name CONFIG_FW_LOADER	2021-12-29 22:20:38 +01:00
um	Misc fixes:	2022-08-06 17:45:37 -07:00
video
virt/vmx/tdx	x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappers	2022-04-07 08:27:50 -07:00
xen	x86/xen: Add support for HVMOP_set_evtchn_upcall_vector	2022-08-12 11:28:21 +02:00
.gitignore	x86/purgatory: Omit use of bin2c	2022-07-25 10:32:32 +02:00
Kbuild	x86/cc: Move arch/x86/{kernel/cc_platform.c => coco/core.c}	2022-02-23 18:25:58 +01:00
Kconfig	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
Kconfig.assembler
Kconfig.cpu	x86/mmx_32: Remove X86_USE_3DNOW	2021-12-11 09:09:45 +01:00
Kconfig.debug	arch: make TRACE_IRQFLAGS_NMI_SUPPORT generic	2022-06-23 15:39:21 +01:00
Makefile	lkdtm: Disable return thunks in rodata.c	2022-07-20 19:24:53 +02:00
Makefile.um	um: allow not setting extra rpaths in the linux binary	2021-06-17 21:54:15 +02:00
Makefile_32.cpu	x86/build: Do not add -falign flags unconditionally for clang	2021-09-19 10:35:53 +09:00