Utilize bpf_modify_return_test_tp() kfunc to have a fast way to trigger
tp/raw_tp/fmodret programs from another BPF program, which gives us
comparable batched benchmarks to (batched) kprobe/fentry benchmarks.
We don't switch kprobe/fentry batched benchmarks to this kfunc to make
bench tool usable on older kernels as well.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240326162151.3981687-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Instead of front-loading all possible benchmarking BPF programs for
trigger benchmarks, explicitly specify which BPF programs are used by
specific benchmark and load only it.
This allows to be more flexible in supporting older kernels, where some
program types might not be possible to load (e.g., those that rely on
newly added kfunc).
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240326162151.3981687-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Remove "legacy" benchmarks triggered by syscalls in favor of newly added
in-kernel/batched benchmarks. Drop -batched suffix now as well.
Next patch will restore "feature parity" by adding back
tp/raw_tp/fmodret benchmarks based on in-kernel kfunc approach.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240326162151.3981687-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Existing kprobe/fentry triggering benchmarks have 1-to-1 mapping between
one syscall execution and BPF program run. While we use a fast
get_pgid() syscall, syscall overhead can still be non-trivial.
This patch adds kprobe/fentry set of benchmarks significantly amortizing
the cost of syscall vs actual BPF triggering overhead. We do this by
employing BPF_PROG_TEST_RUN command to trigger "driver" raw_tp program
which does a tight parameterized loop calling cheap BPF helper
(bpf_get_numa_node_id()), to which kprobe/fentry programs are
attached for benchmarking.
This way 1 bpf() syscall causes N executions of BPF program being
benchmarked. N defaults to 100, but can be adjusted with
--trig-batch-iters CLI argument.
For comparison we also implement a new baseline program that instead of
triggering another BPF program just does N atomic per-CPU counter
increments, establishing the limit for all other types of program within
this batched benchmarking setup.
Taking the final set of benchmarks added in this patch set (including
tp/raw_tp/fmodret, added in later patch), and keeping for now "legacy"
syscall-driven benchmarks, we can capture all triggering benchmarks in
one place for comparison, before we remove the legacy ones (and rename
xxx-batched into just xxx).
$ benchs/run_bench_trigger.sh
usermode-count : 79.500 ± 0.024M/s
kernel-count : 49.949 ± 0.081M/s
syscall-count : 9.009 ± 0.007M/s
fentry-batch : 31.002 ± 0.015M/s
fexit-batch : 20.372 ± 0.028M/s
fmodret-batch : 21.651 ± 0.659M/s
rawtp-batch : 36.775 ± 0.264M/s
tp-batch : 19.411 ± 0.248M/s
kprobe-batch : 12.949 ± 0.220M/s
kprobe-multi-batch : 15.400 ± 0.007M/s
kretprobe-batch : 5.559 ± 0.011M/s
kretprobe-multi-batch: 5.861 ± 0.003M/s
fentry-legacy : 8.329 ± 0.004M/s
fexit-legacy : 6.239 ± 0.003M/s
fmodret-legacy : 6.595 ± 0.001M/s
rawtp-legacy : 8.305 ± 0.004M/s
tp-legacy : 6.382 ± 0.001M/s
kprobe-legacy : 5.528 ± 0.003M/s
kprobe-multi-legacy : 5.864 ± 0.022M/s
kretprobe-legacy : 3.081 ± 0.001M/s
kretprobe-multi-legacy: 3.193 ± 0.001M/s
Note how xxx-batch variants are measured with significantly higher
throughput, even though it's exactly the same in-kernel overhead. As
such, results can be compared only between benchmarks of the same kind
(syscall vs batched):
fentry-legacy : 8.329 ± 0.004M/s
fentry-batch : 31.002 ± 0.015M/s
kprobe-multi-legacy : 5.864 ± 0.022M/s
kprobe-multi-batch : 15.400 ± 0.007M/s
Note also that syscall-count is setting a theoretical limit for
syscall-triggered benchmarks, while kernel-count is setting similar
limits for batch variants. usermode-count is a happy and unachievable
case of user space counting without doing any syscalls, and is mostly
the measure of CPU speed for such a trivial benchmark.
As was mentioned, tp/raw_tp/fmodret require kernel-side kfunc to produce
similar benchmark, which we address in a separate patch.
Note that run_bench_trigger.sh allows to override a list of benchmarks
to run, which is very useful for performance work.
Cc: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240326162151.3981687-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Rename uprobe-base to more precise usermode-count (it will match other
baseline-like benchmarks, kernel-count and syscall-count). Also use
BENCH_TRIG_USERMODE() macro to define all usermode-based triggering
benchmarks, which include usermode-count and uprobe/uretprobe benchmarks.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240326162151.3981687-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Some distros seem to enable the -fcf-protection=branch by default,
which breaks our setup on first instruction of uprobe trigger
functions and place there endbr64 instruction.
Marking them with nocf_check attribute to skip that.
Ignoring unknown attribute warning in gcc for bench objects, because
nocf_check can be used only when -fcf-protection=branch is enabled,
otherwise we get a warning and break compilation.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240322134936.1075395-1-jolsa@kernel.org
With glibc 2.28, selftests compilation fails for benchs/bench_trigger.c:
benchs/bench_trigger.c: In function ‘inc_counter’:
benchs/bench_trigger.c:25:23: error: implicit declaration of function ‘gettid’; did you mean ‘getgid’? [-Werror=implicit-function-declaration]
25 | tid = gettid();
| ^~~~~~
| getgid
cc1: all warnings being treated as errors
It appears support for the gettid() wrapper is variable across glibc
versions, so may be safer to use syscall(SYS_gettid) instead.
Fixes: 520fad2e32 ("selftests/bpf: scale benchmark counting by using per-CPU counters")
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240322095728.95671-1-alan.maguire@oracle.com
When benchmarking with multiple threads (-pN, where N>1), we start
contending on single atomic counter that both BPF trigger benchmarks are
using, as well as "baseline" tests in user space (trig-base and
trig-uprobe-base benchmarks). As such, we start bottlenecking on
something completely irrelevant to benchmark at hand.
Scale counting up by using per-CPU counters on BPF side. On use space
side we do the next best thing: hash thread ID to approximate per-CPU
behavior. It seems to work quite well in practice.
To demonstrate the difference, I ran three benchmarks with 1, 2, 4, 8,
16, and 32 threads:
- trig-uprobe-base (no syscalls, pure tight counting loop in user-space);
- trig-base (get_pgid() syscall, atomic counter in user-space);
- trig-fentry (syscall to trigger fentry program, atomic uncontended per-CPU
counter on BPF side).
Command used:
for b in uprobe-base base fentry; do \
for p in 1 2 4 8 16 32; do \
printf "%-11s %2d: %s\n" $b $p \
"$(sudo ./bench -w2 -d5 -a -p$p trig-$b | tail -n1 | cut -d'(' -f1 | cut -d' ' -f3-)"; \
done; \
done
Before these changes, aggregate throughput across all threads doesn't
scale well with number of threads, it actually even falls sharply for
uprobe-base due to a very high contention:
uprobe-base 1: 138.998 ± 0.650M/s
uprobe-base 2: 70.526 ± 1.147M/s
uprobe-base 4: 63.114 ± 0.302M/s
uprobe-base 8: 54.177 ± 0.138M/s
uprobe-base 16: 45.439 ± 0.057M/s
uprobe-base 32: 37.163 ± 0.242M/s
base 1: 16.940 ± 0.182M/s
base 2: 19.231 ± 0.105M/s
base 4: 21.479 ± 0.038M/s
base 8: 23.030 ± 0.037M/s
base 16: 22.034 ± 0.004M/s
base 32: 18.152 ± 0.013M/s
fentry 1: 14.794 ± 0.054M/s
fentry 2: 17.341 ± 0.055M/s
fentry 4: 23.792 ± 0.024M/s
fentry 8: 21.557 ± 0.047M/s
fentry 16: 21.121 ± 0.004M/s
fentry 32: 17.067 ± 0.023M/s
After these changes, we see almost perfect linear scaling, as expected.
The sub-linear scaling when going from 8 to 16 threads is interesting
and consistent on my test machine, but I haven't investigated what is
causing it this peculiar slowdown (across all benchmarks, could be due
to hyperthreading effects, not sure).
uprobe-base 1: 139.980 ± 0.648M/s
uprobe-base 2: 270.244 ± 0.379M/s
uprobe-base 4: 532.044 ± 1.519M/s
uprobe-base 8: 1004.571 ± 3.174M/s
uprobe-base 16: 1720.098 ± 0.744M/s
uprobe-base 32: 3506.659 ± 8.549M/s
base 1: 16.869 ± 0.071M/s
base 2: 33.007 ± 0.092M/s
base 4: 64.670 ± 0.203M/s
base 8: 121.969 ± 0.210M/s
base 16: 207.832 ± 0.112M/s
base 32: 424.227 ± 1.477M/s
fentry 1: 14.777 ± 0.087M/s
fentry 2: 28.575 ± 0.146M/s
fentry 4: 56.234 ± 0.176M/s
fentry 8: 106.095 ± 0.385M/s
fentry 16: 181.440 ± 0.032M/s
fentry 32: 369.131 ± 0.693M/s
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Message-ID: <20240315213329.1161589-1-andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adding kprobe multi triggering benchmarks. It's useful now to bench
new fprobe implementation and might be useful later as well.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240311211023.590321-1-jolsa@kernel.org
We already have kprobe and fentry benchmarks. Let's add kretprobe and
fexit ones for completeness.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20240309005124.3004446-1-andrii@kernel.org
Settle on three "flavors" of uprobe/uretprobe, installed on different
kinds of instruction: nop, push, and ret. All three are testing
different internal code paths emulating or single-stepping instructions,
so are interesting to compare and benchmark separately.
To ensure `push rbp` instruction we ensure that uprobe_target_push() is
not a leaf function by calling (global __weak) noop function and
returning something afterwards (if we don't do that, compiler will just
do a tail call optimization).
Also, we need to make sure that compiler isn't skipping frame pointer
generation, so let's add `-fno-omit-frame-pointers` to Makefile.
Just to give an idea of where we currently stand in terms of relative
performance of different uprobe/uretprobe cases vs a cheap syscall
(getpgid()) baseline, here are results from my local machine:
$ benchs/run_bench_uprobes.sh
base : 1.561 ± 0.020M/s
uprobe-nop : 0.947 ± 0.007M/s
uprobe-push : 0.951 ± 0.004M/s
uprobe-ret : 0.443 ± 0.007M/s
uretprobe-nop : 0.471 ± 0.013M/s
uretprobe-push : 0.483 ± 0.004M/s
uretprobe-ret : 0.306 ± 0.007M/s
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240301214551.1686095-1-andrii@kernel.org
Considering that only bench_ringbufs.c supports consumer, just set the
default value of consumer_cnt as 0. After that, update the validity
check of consumer_cnt, remove unused consumer_thread code snippets and
set consumer_cnt as 1 in run_bench_ringbufs.sh accordingly.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230613080921.1623219-5-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Fix how selftests determine relative offset of a function that is
uprobed. Previously, there was an assumption that uprobed function is
always in the first executable region, which is not always the case
(libbpf CI hits this case now). So get_base_addr() approach in isolation
doesn't work anymore. So teach get_uprobe_offset() to determine correct
memory mapping and calculate uprobe offset correctly.
While at it, I merged together two implementations of
get_uprobe_offset() helper, moving powerpc64-specific logic inside (had
to add extra {} block to avoid unused variable error for insn).
Also ensured that uprobed functions are never inlined, but are still
static (and thus local to each selftest), by using a no-op asm volatile
block internally. I didn't want to keep them global __weak, because some
tests use uprobe's ref counter offset (to test USDT-like logic) which is
not compatible with non-refcounted uprobe. So it's nicer to have each
test uprobe target local to the file and guaranteed to not be inlined or
skipped by the compiler (which can happen with static functions,
especially if compiling selftests with -O2).
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20220126193058.3390292-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Fix checkpatch error: "ERROR: Bad function definition - void foo()
should probably be void foo(void)". Most replacements are done by
the following command:
sed -i 's#\([a-z]\)()$#\1(void)#g' testing/selftests/bpf/benchs/*.c
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211210141652.877186-3-houtao1@huawei.com
Add benchmark to measure overhead of uprobes and uretprobes. Also have
a baseline (no uprobe attached) benchmark.
On my dev machine, baseline benchmark can trigger 130M user_target()
invocations. When uprobe is attached, this falls to just 700K. With
uretprobe, we get down to 520K:
$ sudo ./bench trig-uprobe-base -a
Summary: hits 131.289 ± 2.872M/s
# UPROBE
$ sudo ./bench -a trig-uprobe-without-nop
Summary: hits 0.729 ± 0.007M/s
$ sudo ./bench -a trig-uprobe-with-nop
Summary: hits 1.798 ± 0.017M/s
# URETPROBE
$ sudo ./bench -a trig-uretprobe-without-nop
Summary: hits 0.508 ± 0.012M/s
$ sudo ./bench -a trig-uretprobe-with-nop
Summary: hits 0.883 ± 0.008M/s
So there is almost 2.5x performance difference between probing nop vs
non-nop instruction for entry uprobe. And 1.7x difference for uretprobe.
This means that non-nop uprobe overhead is around 1.4 microseconds for uprobe
and 2 microseconds for non-nop uretprobe.
For nop variants, uprobe and uretprobe overhead is down to 0.556 and
1.13 microseconds, respectively.
For comparison, just doing a very low-overhead syscall (with no BPF
programs attached anywhere) gives:
$ sudo ./bench trig-base -a
Summary: hits 4.830 ± 0.036M/s
So uprobes are about 2.67x slower than pure context switch.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20211116013041.4072571-1-andrii@kernel.org
Turn ony libbpf 1.0 mode. Fix all the explicit IS_ERR checks that now will be
broken because libbpf returns NULL on error (and sets errno). Fix
ASSERT_OK_PTR and ASSERT_ERR_PTR to work for both old mode and new modes and
use them throughout selftests. This is trivial to do by using
libbpf_get_error() API that all libbpf users are supposed to use, instead of
IS_ERR checks.
A bunch of checks also did explicit -1 comparison for various fd-returning
APIs. Such checks are replaced with >= 0 or < 0 cases.
There were also few misuses of bpf_object__find_map_by_name() in test_maps.
Those are fixed in this patch as well.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210525035935.1461796-3-andrii@kernel.org
It is sometimes desirable to be able to trigger BPF program from user-space
with minimal overhead. sys_enter would seem to be a good candidate, yet in
a lot of cases there will be a lot of noise from syscalls triggered by other
processes on the system. So while searching for low-overhead alternative, I've
stumbled upon getpgid() syscall, which seems to be specific enough to not
suffer from accidental syscall by other apps.
This set of benchmarks compares tp, raw_tp w/ filtering by syscall ID, kprobe,
fentry and fmod_ret with returning error (so that syscall would not be
executed), to determine the lowest-overhead way. Here are results on my
machine (using benchs/run_bench_trigger.sh script):
base : 9.200 ± 0.319M/s
tp : 6.690 ± 0.125M/s
rawtp : 8.571 ± 0.214M/s
kprobe : 6.431 ± 0.048M/s
fentry : 8.955 ± 0.241M/s
fmodret : 8.903 ± 0.135M/s
So it seems like fmodret doesn't give much benefit for such lightweight
syscall. Raw tracepoint is pretty decent despite additional filtering logic,
but it will be called for any other syscall in the system, which rules it out.
Fentry, though, seems to be adding the least amoung of overhead and achieves
97.3% of performance of baseline no-BPF-attached syscall.
Using getpgid() seems to be preferable to set_task_comm() approach from
test_overhead, as it's about 2.35x faster in a baseline performance.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200512192445.2351848-5-andriin@fb.com