linux

mirror of synced 2025-03-06 20:59:54 +01:00

Author	SHA1	Message	Date
Barry Song	2917406c35	sched/fair: Document the slow path and fast path in select_task_rq_fair All People I know including myself took a long time to figure out that typical wakeup will always go to fast path and never go to slow path except WF_FORK and WF_EXEC. Vincent reminded me once in a linaro meeting and made me understand slow path won't happen for WF_TTWU. But my other friends repeatedly wasted a lot of time on testing this path like me before I reminded them. So obviously the code needs some document. Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211016111109.5559-1-21cnbao@gmail.com	2021-12-07 15:14:10 +01:00
Christoph Hellwig	aea7e2a86a	dma-direct: factor the swiotlb code out of __dma_direct_alloc_pages Add a new helper to deal with the swiotlb case. This keeps the code nicely boundled and removes the not required call to dma_direct_optimal_gfp_mask for the swiotlb case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>	2021-12-07 12:50:10 +01:00
Christoph Hellwig	f5d3939a59	dma-direct: drop two CONFIG_DMA_RESTRICTED_POOL conditionals swiotlb_alloc and swiotlb_free are properly stubbed out if CONFIG_DMA_RESTRICTED_POOL is not set, so skip the extra checks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>	2021-12-07 12:50:10 +01:00
Christoph Hellwig	78bc72787a	dma-direct: warn if there is no pool for force unencrypted allocations Instead of blindly running into a blocking operation for a non-blocking gfp, return NULL and spew an error. Note that Kconfig prevents this for all currently relevant platforms, and this is just a debug check. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>	2021-12-07 12:50:10 +01:00
Christoph Hellwig	955f58f740	dma-direct: fail allocations that can't be made coherent If the architecture can't remap or set an address uncached there is no way to fullfill a request for a coherent allocation. Return NULL in that case. Note that this case currently does not happen, so this is a theoretical fixup and/or a preparation for eventually supporting platforms that can't support coherent allocations with the generic code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>	2021-12-07 12:50:06 +01:00
Christoph Hellwig	a86d10942d	dma-direct: refactor the !coherent checks in dma_direct_alloc Add a big central !dev_is_dma_coherent(dev) block to deal with as much as of the uncached allocation schemes and document the schemes a bit better. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>	2021-12-07 12:49:57 +01:00
Christoph Hellwig	d541ae55d5	dma-direct: factor out a helper for DMA_ATTR_NO_KERNEL_MAPPING allocations Split the code for DMA_ATTR_NO_KERNEL_MAPPING allocations into a separate helper to make dma_direct_alloc a little more readable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Acked-by: David Rientjes <rientjes@google.com>	2021-12-07 12:49:50 +01:00
Christoph Hellwig	f3c962226d	dma-direct: clean up the remapping checks in dma_direct_alloc Add two local variables to track if we want to remap the returned address using vmap or call dma_set_uncached and use that to simplify the code flow. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>	2021-12-07 12:48:09 +01:00
Christoph Hellwig	a90cf30437	dma-direct: always leak memory that can't be re-encrypted We must never let unencrypted memory go back into the general page pool. So if we fail to set it back to encrypted when freeing DMA memory, leak the memory instead and warn the user. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>	2021-12-07 12:47:38 +01:00
Christoph Hellwig	5570449b68	dma-direct: don't call dma_set_decrypted for remapped allocations Remapped allocations handle the encrypted bit through the pgprot passed to vmap, so there is no call dma_set_decrypted. Note that this case is currently entirely theoretical as no valid kernel configuration supports remapped allocations and memory encryption currently. Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-12-07 12:47:06 +01:00
Christoph Hellwig	4d0564785b	dma-direct: factor out dma_set_{de,en}crypted helpers Factor out helpers the make dealing with memory encryption a little less cumbersome. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>	2021-12-07 12:47:05 +01:00
Alexei Starovoitov	29f2e5bd94	bpf: Silence purge_cand_cache build warning. When CONFIG_DEBUG_INFO_BTF_MODULES is not set the following warning can be seen: kernel/bpf/btf.c:6588:13: warning: 'purge_cand_cache' defined but not used [-Wunused-function] Fix it. Fixes: `1e89106da2` ("bpf: Add bpf_core_add_cands() and wire it into bpf_core_apply_relo_insn().") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211207014839.6976-1-alexei.starovoitov@gmail.com	2021-12-06 18:24:34 -08:00
Uladzislau Rezki (Sony)	a6ed2aee54	tracing: Switch to kvfree_rcu() API Instead of invoking a synchronize_rcu() to free a pointer after a grace period we can directly make use of new API that does the same but in more efficient way. Link: https://lkml.kernel.org/r/20211124110308.2053-10-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2021-12-06 17:53:50 -05:00
Qiujun Huang	1d83c3a20b	tracing: Fix synth_event_add_val() kernel-doc comment It's named field here. Link: https://lkml.kernel.org/r/20210516022410.64271-1-hqjagain@gmail.com Signed-off-by: Qiujun Huang <hqjagain@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2021-12-06 17:53:50 -05:00
Steven Rostedt (VMware)	b7d5eb267f	tracing/uprobes: Use trace_event_buffer_reserve() helper To be consistent with kprobes and eprobes, use trace_event_buffer_reserver() and trace_event_buffer_commit(). This will ensure that any updates to trace events will also be implemented on uprobe events. Link: https://lkml.kernel.org/r/20211206162440.69fbf96c@gandalf.local.home Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2021-12-06 17:53:23 -05:00
Steven Rostedt (VMware)	5e6cd84e2f	tracing/kprobes: Do not open code event reserve logic As kprobe events use trace_event_buffer_commit() to commit the event to the ftrace ring buffer, for consistency, it should use trace_event_buffer_reserve() to allocate it, as the two functions are related. Link: https://lkml.kernel.org/r/20211130024319.257430762@goodmis.org Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2021-12-06 15:37:22 -05:00
Steven Rostedt (VMware)	3e8b1a29a0	tracing: Have eprobes use filtering logic of trace events The eprobes open code the reserving of the event on the ring buffer for ftrace instead of using the ftrace event wrappers, which means that it doesn't get affected by the filters, breaking the filtering logic on user space. Link: https://lkml.kernel.org/r/20211130024319.068451680@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2021-12-06 15:37:22 -05:00
Steven Rostedt (VMware)	6c536d76cf	tracing: Disable preemption when using the filter buffer In case trace_event_buffer_lock_reserve() is called with preemption enabled, the algorithm that defines the usage of the per cpu filter buffer may fail if the task schedules to another CPU after determining which buffer it will use. Disable preemption when using the filter buffer. And because that same buffer must be used throughout the call, keep preemption disabled until the filter buffer is released. This will also keep the semantics between the use case of when the filter buffer is used, and when the ring buffer itself is used, as that case also disables preemption until the ring buffer is released. Link: https://lkml.kernel.org/r/20211130024318.880190623@goodmis.org [ Fixed warning of assignment in if statement Reported-by: kernel test robot <lkp@intel.com> ] Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2021-12-06 15:37:22 -05:00
Steven Rostedt (VMware)	e07a1d5762	tracing: Use __this_cpu_read() in trace_event_buffer_lock_reserver() The value read by this_cpu_read() is used later and its use is expected to stay on the same CPU as being read. But this_cpu_read() does not warn if it is called without preemption disabled, where as __this_cpu_read() will check if preemption is disabled on CONFIG_DEBUG_PREEMPT Currently all callers have preemption disabled, but there may be new callers in the future that may not. Link: https://lkml.kernel.org/r/20211130024318.698165354@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2021-12-06 15:37:22 -05:00
Masami Hiramatsu	55de2c0b56	tracing: Add '__rel_loc' using trace event macros Add '__rel_loc' using trace event macros. These macros are usually not used in the kernel, except for testing purpose. This also add "rel_" variant of macros for dynamic_array string, and bitmask. Link: https://lkml.kernel.org/r/163757342119.510314.816029622439099016.stgit@devnote2 Cc: Beau Belgrave <beaub@linux.microsoft.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2021-12-06 15:37:21 -05:00
Masami Hiramatsu	05770dd0ad	tracing: Support __rel_loc relative dynamic data location attribute Add '__rel_loc' new dynamic data location attribute which encodes the data location from the next to the field itself. The '__data_loc' is used for encoding the dynamic data location on the trace event record. But '__data_loc' is not useful if the writer doesn't know the event header (e.g. user event), because it records the dynamic data offset from the entry of the record, not the field itself. This new '__rel_loc' attribute encodes the data location relatively from the next of the field. For example, when there is a record like below (the number in the parentheses is the size of fields) \|header(N)\|common(M)\|fields(K)\|__data_loc(4)\|fields(L)\|data(G)\| In this case, '__data_loc' field will be __data_loc = (G << 16) \| (N+M+K+4+L) If '__rel_loc' is used, this will be \|header(N)\|common(M)\|fields(K)\|__rel_loc(4)\|fields(L)\|data(G)\| where __rel_loc = (G << 16) \| (L) This case shows L bytes after the '__rel_loc' attribute field, if there is no fields after the __rel_loc field, L must be 0. This is relatively easy (and no need to consider the kernel header change) when the event data fields are composed by user who doesn't know header and common fields. Link: https://lkml.kernel.org/r/163757341258.510314.4214431827833229956.stgit@devnote2 Cc: Beau Belgrave <beaub@linux.microsoft.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2021-12-06 15:37:21 -05:00
Colin Ian King	f2b20c6627	tracing: Fix spelling mistake "aritmethic" -> "arithmetic" There is a spelling mistake in the tracing mini-HOWTO text. Fix it. Link: https://lkml.kernel.org/r/20211108201513.42876-1-colin.i.king@gmail.com Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2021-12-06 15:37:21 -05:00
Kajol Jain	db52f57211	bpf: Remove config check to enable bpf support for branch records Branch data available to BPF programs can be very useful to get stack traces out of userspace application. Commit `fff7b64355` ("bpf: Add bpf_read_branch_records() helper") added BPF support to capture branch records in x86. Enable this feature also for other architectures as well by removing checks specific to x86. If an architecture doesn't support branch records, bpf_read_branch_records() still has appropriate checks and it will return an -EINVAL in that scenario. Based on UAPI helper doc in include/uapi/linux/bpf.h, unsupported architectures should return -ENOENT in such case. Hence, update the appropriate check to return -ENOENT instead. Selftest 'perf_branches' result on power9 machine which has the branch stacks support: - Before this patch: [command]# ./test_progs -t perf_branches #88/1 perf_branches/perf_branches_hw:FAIL #88/2 perf_branches/perf_branches_no_hw:OK #88 perf_branches:FAIL Summary: 0/1 PASSED, 0 SKIPPED, 1 FAILED - After this patch: [command]# ./test_progs -t perf_branches #88/1 perf_branches/perf_branches_hw:OK #88/2 perf_branches/perf_branches_no_hw:OK #88 perf_branches:OK Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED Selftest 'perf_branches' result on power9 machine which doesn't have branch stack report: - After this patch: [command]# ./test_progs -t perf_branches #88/1 perf_branches/perf_branches_hw:SKIP #88/2 perf_branches/perf_branches_no_hw:OK #88 perf_branches:OK Summary: 1/1 PASSED, 1 SKIPPED, 0 FAILED Fixes: `fff7b64355` ("bpf: Add bpf_read_branch_records() helper") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kajol Jain <kjain@linux.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211206073315.77432-1-kjain@linux.ibm.com	2021-12-06 15:21:15 +01:00
Petr Mladek	5e8ba485b2	printk/console: Clean up boot console handling in register_console() The variable @bcon has two meanings. It is used several times for iterating the list of registered consoles. In the meantime, it holds the information whether a boot console is first in @console_drivers list. The information about the 1st console driver used to be important for the decision whether to install the new console by default or not. It allowed to re-evaluate the variable @need_default_console when a real console with tty binding has been unregistered in the meantime. The decision about the default console is not longer affected by @bcon variable. The current code checks whether the first driver is real and has tty binding directly. The information about the first console is still used for two more decisions: 1. It prevents duplicate output on non-boot consoles with CON_CONSDEV flag set. 2. Early/boot consoles are unregistered when a real console with CON_CONSDEV is registered and @keep_bootcon is not set. The behavior in the real life is far from obvious. @bcon is set according to the first console @console_drivers list. But the first position in the list is special: 1. Consoles with CON_CONSDEV flag are put at the beginning of the list. It is either the preferred console or any console with tty binding registered by default. 2. Another console might become the first in the list when the first console in the list is unregistered. It might happen either explicitly or automatically when boot consoles are unregistered. There is one more important rule: + Boot consoles can't be registered when any real console is already registered. It is a puzzle. The main complication is the dependency on the first position is the list and the complicated rules around it. Let's try to make it easier: 1. Add variable @bootcon_enabled and set it by iterating all registered consoles. The variable has obvious meaning and more predictable behavior. Any speed optimization and other tricks are not worth it. 2. Use a generic name for the variable that is used to iterate the list on registered console drivers. Behavior change: No, maybe surprisingly, there is _no_ behavior change! Let's provide the proof by contradiction. Both operations, duplicate output prevention and boot consoles removal, are done only when the newly added console has CON_CONSDEV flag set. The behavior would change when the new @bootcon_enabled has different value than the original @bcon. By other words, the behavior would change when the following conditions are true: + a console with CON_CONSDEV flag is added + a real (non-boot) console is the first in the list + a boot console is later in the list Now, a real console might be first in the list only when: + It was the first registered console. In this case, there can't be any boot console because any later ones were rejected. + It was put at the first position because it had CON_CONSDEV flag set. It was either the preferred console or it was a console with tty binding registered by default. We are interested only in a real consoles here. And real console with tty binding fulfills conditions of the default console. Now, there is always only one console that is either preferred or fulfills conditions of the default console. It can't be already in the list and being registered at the same time. As a result, the above three conditions could newer be "true" at the same time. Therefore the behavior can't change. Final dilemma: OK, the new code has the same behavior. But is the change in the right direction? What if the handling of @console_drivers is updated in the future? OK, let's look at it from another angle: 1. The ordering of @console_drivers list is important only in console_device() function. The first console driver with tty binding gets associated with /dev/console. 2. CON_CONSDEV flag is shown in /proc/consoles. And it should be set for the driver that is returned by console_device(). 3. A boot console is removed and the duplicated output is prevented when the real console with CON_CONSDEV flag is registered. Now, in the ideal world: + The driver associated with /dev/console should be either a console preferred via the command line, device tree, or SPCR. Or it should be the first real console with tty binding registered by default. + The code should match the related boot and real console drivers. It should unregister only the obsolete boot driver. And the duplicated output should be prevented only on the related real driver. It is clear that it is not guaranteed by the current code. Instead, the current code looks like a maze of heuristics that try to achieve the above. It is result of adding several features over last few decades. For example, a possibility to register more consoles, unregister consoles, boot consoles, consoles without tty binding, device tree, SPCR, braille consoles. Anyway, there is no reason why the decision, about removing boot consoles and preventing duplicated output, should depend on the first console in the list. The current code does the decisions primary by CON_CONSDEV flag that is used for the preferred console. It looks like a good compromise. And the change seems to be in the right direction. Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20211122132649.12737-6-pmladek@suse.com	2021-12-06 14:07:57 +01:00
Petr Mladek	4f54693925	printk/console: Remove need_default_console variable The variable @need_default_console is used to decide whether a newly registered console should get enabled by default. The logic is complicated. It can be modified in a register_console() call. But it is always re-evaluated in the next call by the following condition: if (need_default_console \|\| bcon \|\| !console_drivers) need_default_console = preferred_console < 0; In short, the value is updated when either of the condition is valid: + the value is still, or again, "true" + boot/early console is still the first in @console_driver list + @console_driver list is empty The value is updated according to @preferred_console. In particular, it is set to "false" when a @preferred_console was set by __add_preferred_console(). This happens when a non-braille console was added via the command line, device tree, or SPCR. It far from clear what this all means together. Let's look at @need_default_console from another angle: 1. The value is "true" by default. It means that it is always set according to @preferred_console during the first register_console() call. By other words, the first register_console() call will register the console by default only when none non-braille console was defined via the command line, device tree, or SPCR. 2. The value will always stay "false" when @preferred_console is set. By other words, try_enable_default_console() will never get called when a non-braille console is explicitly required. 4. The value might be set to "false" in try_enable_default_console() when a console with tty binding (driver) gets enabled. In this case CON_CONSDEV is set as well. It causes that the console will be inserted as first into the list @console_driver. It might be either real or boot/early console. 5. The value will be set _back_ to "true" in the next register_console() call when: + The console added by the previous register_console() had been a boot/early one. + The last console has been unregistered in the meantime and a boot/early console became first in @console_drivers list again. Or the list became empty. By other words, the value will stay "false" only when the last registered console was real, had tty binding, and was not removed in the mean time. The main logic looks clear: + Consoles are enabled by default only when no one is preferred via the command line, device tree, or SPCR. + By default, any console is enabled until a real console with tty binding gets registered. The behavior when the real console with tty binding is later removed is a bit unclear: + By default, any new console is registered again only when there is no console or the first console in the list is a boot one. The question is why the code is suddenly happy when a real console without tty binding is the first in the list. It looks like an overlook and bug. Conclusion: The state of @preferred_console and the first console in @console_driver list should be enough to decide whether we need to enable the given console by default. The rules are simple. New consoles are _not_ enabled by default when either of the following conditions is true: + @preferred_console is set. It means that a non-braille console is explicitly configured via the command line, device tree, or SPCR. + A real console with tty binding is registered. Such a console will have CON_CONSDEV flag set and will always be the first in @console_drivers list. Note: The new code does not use @bcon variable. The meaning of the variable is far from clear. The direct check of the first console in the list makes it more clear that only real console fulfills requirements of the default console. Behavior change: As already discussed above. There was one situation where the original code worked a strange way. Let's have: + console A: real console without tty binding + console B: real console with tty binding and do: register_console(A); /* 1st step / register_console(B); / 2nd step / unregister_console(B); / 3rd step / register_console(B); / 4th step */ The original code will not register the console B in the 4th step. @need_default_console is set to "false" in 2nd step. The real console with tty binding (driver) is then removed in the 3rd step. But @need_default_console will stay "false" in the 4th step because there is no boot/early console and @registered_consoles list is not empty. The new code will register the console B in the 4th step because it checks whether the first console has tty binding (->driver) This behavior change should acceptable: 1. The scenario requires manual intervention (console removal). The system should boot with the same consoles as before. 2. Console B is registered again probably because the user wants to use it. The most likely scenario is that the related module is reloaded. 3. It makes the behavior more consistent and predictable. Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20211122132649.12737-5-pmladek@suse.com	2021-12-06 14:07:57 +01:00
Petr Mladek	f873efe841	printk/console: Remove unnecessary need_default_console manipulation There is no need to clear @need_default_console when a console preferred by the command line, device tree, or SPCR, gets enabled. The code is called only when some non-braille console matched a console in @console_cmdline array. It means that a non-braille console was added in __add_preferred_console() and the variable preferred_console is set to a number >= 0. As a result, @need_default_console is always set to "false" in the magic condition: if (need_default_console \|\| bcon \|\| !console_drivers) need_default_console = preferred_console < 0; This is one small step in removing the above magic condition that is hard to follow. The patch removes one superfluous assignment and should not change the functionality. Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20211122132649.12737-4-pmladek@suse.com	2021-12-06 14:07:57 +01:00
Petr Mladek	a6953370d2	printk/console: Rename has_preferred_console to need_default_console The logic around the variable @has_preferred_console made my head spin many times. Part of the problem is the ambiguous name. There is the variable @preferred_console. It points to the last non-braille console in @console_cmdline array. This array contains consoles preferred via the command line, device tree, or SPCR. Then there is the variable @has_preferred_console. It is set to "true" when @preferred_console is enabled or when a console with tty binding gets enabled by default. It might get reset back by the magic condition: if (!has_preferred_console \|\| bcon \|\| !console_drivers) has_preferred_console = preferred_console >= 0; It is a puzzle. Dumb explanation is that it gets re-evaluated when: + it was not set before (see above when it gets set) + there is still an early console enabled (bcon) + there is no console enabled (!console_drivers) This is still a puzzle. It gets more clear when we see where the value is checked. The only meaning of the variable is to decide whether we should try to enable the new console by default. Rename the variable according to the single situation where the value is checked. The rename requires an inverted logic. Otherwise, it is a simple search & replace. It does not change the functionality. Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Link: https://lore.kernel.org/r/20211122132649.12737-3-pmladek@suse.com	2021-12-06 14:07:57 +01:00
Petr Mladek	ed758b30d5	printk/console: Split out code that enables default console Put the code enabling a console by default into a separate function called try_enable_default_console(). Rename try_enable_new_console() to try_enable_preferred_console() to make the purpose of the different variants more clear. It is a code refactoring without any functional change. Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Link: https://lore.kernel.org/r/20211122132649.12737-2-pmladek@suse.com	2021-12-06 14:07:57 +01:00
Linus Torvalds	7587a4a5a4	- Prevent a tick storm when a dedicated timekeeper CPU in nohz_full mode runs for prolonged periods with interrupts disabled and ends up programming the next tick in the past, leading to that storm -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmGsp+EACgkQEsHwGGHe VUpmAA/6A8W0Nb6Doc8B3emuy9qv3NeqLGWqSIKcJnOz0GYhlWuFKGmH6zWQ/ZKZ ihjw5fP7aOEytLhLagnn1k2weRZrgBavHaxQskuL3HBFD0mT6Gz1TfJC9JlE5s2Q KxaDjRLx5RGJb/KHZDiZv6Kz61Ouh14KfHHymVhZndcPNZ7UjsCgacyUkctGKcoc DtNW0Z6tjUGbp1MXyGcOiTiM7hUS8SWsdJbMfn0Eu+/NKvnkua7vwTgEMTwYwrK0 88sLYyVygL+NHjE9LpSGrRj1HjEV4dSMC3r18UYuWQYkzBvA+/SQbIKD5QoeFmZU st5dMBD8Q3KvAWQ8mXE5ymaYaIZxv21PaL1J7lZ3J3osMASH0LkMWXLYoMVtO5rq OIpZlODSGLiamGcC5uieoBR/f4Zzn+sEZZ6TyoXWOBv4Cap2XnlJP5WjJ4ARJvzT MLX2u8MPPMTL7vtd2Xb4kPZcWH5irrCENXlbz0UG08ZHj4CvBFb+a87f+E4aNUs4 uBsTf/kS5SihE1ripSCJEnFsc/QgVPr/9jBXQehRcuI4NgT4pUg85LWDj3gSIcH8 wMRbiX2ND0ZWk89RYaoiDQ6JPGrsnwKvGLRk9ZhFNtUfpycv5JWKwepVbmAKfos+ JtmG/6kcFQKBofR7EA4Xuh7DHv7LKCRf3MMlAR6Gzx/3K2kyIoQ= =Ft9k -----END PGP SIGNATURE----- Merge tag 'timers_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Borislav Petkov: - Prevent a tick storm when a dedicated timekeeper CPU in nohz_full mode runs for prolonged periods with interrupts disabled and ends up programming the next tick in the past, leading to that storm * tag 'timers_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers/nohz: Last resort update jiffies on nohz_full IRQ entry	2021-12-05 08:58:52 -08:00
Linus Torvalds	1d213767dc	- Properly init uclamp_flags of a runqueue, on first enqueuing - Fix preempt= callback return values - Correct utime/stime resource usage reporting on nohz_full to return the proper times instead of shorter ones -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmGspTcACgkQEsHwGGHe VUqHhBAAhEd9DMoJwREKDCDMqc3pttNYpTpSVo1K6oBTsOh7mwEilPdlmsTl239V jRocVJST+/JmJ424j7t0Sp42tREMKNlbyf+ddvr0oUwi0mLUnN6J83NU4WK4Jisf gyXFIkeMR+/W6/LO7gDdq/+rlRDtJcllwHoOm1yyiy5Zc0qDrcy6CjgP5/9hEsh6 xvRvPOXbeZZVA+a+n+G9xGN836aBe1VptoABbdAlOSTiOvAVkS95UCb9rfPTvMtq /71jjZMmhTxGUhg5oLpgvfRRZE608X6b2RCTcAPKa5mfMpN5YMQLcD9G0f8XZjkq iOO/+arE6XQJlTzhAEsGxkSXaVweYxRHHP1yAlWYlWV/xGhoaAyq/tXE1KusAnng 16/eTbrPb1eawpI6p1AAScCQuF/TlYZCMqjbFVhViXM5Rkd6jrii9vz/JnkdokGR 3TH0n4WAJkdZeg18WS3B0eIt6zDTvxbR9g5ap2/10xYnYHMNdHXGH8A+5Grw9/Ln Qsv0V43OjdUK2tVuIHYblx1X9dOlLdpTEg9FCfjiZTQVor1pTwcbG62qNMozanlf lQqI6f63E0jugHqhrqrfBvl4lUuoajN5SvXfBNFDIzxwWBGSdr+hJQXstUatfSZZ MdmJX+Dk5cAk4CpQQ1ofPvYkS3Ade0vxaL4H++KHYtRvpPvxCXA= =XQFF -----END PGP SIGNATURE----- Merge tag 'sched_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Properly init uclamp_flags of a runqueue, on first enqueuing - Fix preempt= callback return values - Correct utime/stime resource usage reporting on nohz_full to return the proper times instead of shorter ones * tag 'sched_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/uclamp: Fix rq->uclamp_max not set on first enqueue preempt/dynamic: Fix setup_preempt_mode() return value sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full	2021-12-05 08:53:31 -08:00
Hou Tao	866de40744	bpf: Disallow BPF_LOG_KERNEL log level for bpf(BPF_BTF_LOAD) BPF_LOG_KERNEL is only used internally, so disallow bpf_btf_load() to set log level as BPF_LOG_KERNEL. The same checking has already been done in bpf_check(), so factor out a helper to check the validity of log attributes and use it in both places. Fixes: `8580ac9404` ("bpf: Process in-kernel BTF") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20211203053001.740945-1-houtao1@huawei.com	2021-12-04 10:10:24 -08:00
Kefeng Wang	c0bed69daf	locking: Make owner_on_cpu() into <linux/sched.h> Move the owner_on_cpu() from kernel/locking/rwsem.c into include/linux/sched.h with under CONFIG_SMP, then use it in the mutex/rwsem/rtmutex to simplify the code. Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20211203075935.136808-2-wangkefeng.wang@huawei.com	2021-12-04 10:56:25 +01:00
Thomas Gleixner	0c1d7a2c2d	lockdep: Remove softirq accounting on PREEMPT_RT. There is not really a softirq context on PREEMPT_RT. Softirqs on PREEMPT_RT are always invoked within the context of a threaded interrupt handler or within ksoftirqd. The "in-softirq" context is preemptible and is protected by a per-CPU lock to ensure mutual exclusion. There is no difference on PREEMPT_RT between spin_lock_irq() and spin_lock() because the former does not disable interrupts. Therefore if a lock is used in_softirq() and locked once with spin_lock_irq() then lockdep will report this with "inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage". Teach lockdep that we don't really do softirqs on -RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20211129174654.668506-6-bigeasy@linutronix.de	2021-12-04 10:56:23 +01:00
Sebastian Andrzej Siewior	a364202192	locking/rtmutex: Add rt_mutex_lock_nest_lock() and rt_mutex_lock_killable(). The locking selftest for ww-mutex expects to operate directly on the base-mutex which becomes a rtmutex on PREEMPT_RT. Add a rtmutex based implementation of mutex_lock_nest_lock() and mutex_lock_killable() named rt_mutex_lock_nest_lock() abd rt_mutex_lock_killable(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20211129174654.668506-5-bigeasy@linutronix.de	2021-12-04 10:56:23 +01:00
Peter Zijlstra	02ea9fc96f	locking/rtmutex: Squash self-deadlock check for ww_rt_mutex. Similar to the issues in commits: `6467822b8c` ("locking/rtmutex: Prevent spurious EDEADLK return caused by ww_mutexes") `a055fcc132` ("locking/rtmutex: Return success on deadlock for ww_mutex waiters") ww_rt_mutex_lock() should not return EDEADLK without first going through the __ww_mutex logic to set the required state. In fact, the chain-walk can deal with the spurious cycles (per the above commits) this check warns about and is trying to avoid. Therefore ignore this test for ww_rt_mutex and simply let things fall in place. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20211129174654.668506-4-bigeasy@linutronix.de	2021-12-04 10:56:23 +01:00
Sebastian Andrzej Siewior	e08f343be0	locking: Remove rt_rwlock_is_contended(). rt_rwlock_is_contended() has no users. It makes no sense to use it as rwlock_is_contended() because it is a sleeping lock on RT and preemption is possible. It reports always != 0 if used by a writer and even if there is a waiter then the lock might not be handed over if the current owner has the highest priority. Remove rt_rwlock_is_contended(). Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20211129174654.668506-3-bigeasy@linutronix.de	2021-12-04 10:56:23 +01:00
Sebastian Andrzej Siewior	9d0df37797	sched: Trigger warning if ->migration_disabled counter underflows. If migrate_enable() is used more often than its counter part then it remains undetected and rq::nr_pinned will underflow, too. Add a warning if migrate_enable() is attempted if without a matching a migrate_disable(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20211129174654.668506-2-bigeasy@linutronix.de	2021-12-04 10:56:22 +01:00
Vincent Donnefort	014ba44e81	sched/fair: Fix per-CPU kthread and wakee stacking for asym CPU capacity select_idle_sibling() has a special case for tasks woken up by a per-CPU kthread where the selected CPU is the previous one. For asymmetric CPU capacity systems, the assumption was that the wakee couldn't have a bigger utilization during task placement than it used to have during the last activation. That was not considering uclamp.min which can completely change between two task activations and as a consequence mandates the fitness criterion asym_fits_capacity(), even for the exit path described above. Fixes: `b4c9c9f156` ("sched/fair: Prefer prev cpu in asymmetric wakeup path") Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20211129173115.4006346-1-vincent.donnefort@arm.com	2021-12-04 10:56:21 +01:00
Vincent Donnefort	8b4e74ccb5	sched/fair: Fix detection of per-CPU kthreads waking a task select_idle_sibling() has a special case for tasks woken up by a per-CPU kthread, where the selected CPU is the previous one. However, the current condition for this exit path is incomplete. A task can wake up from an interrupt context (e.g. hrtimer), while a per-CPU kthread is running. A such scenario would spuriously trigger the special case described above. Also, a recent change made the idle task like a regular per-CPU kthread, hence making that situation more likely to happen (is_per_cpu_kthread(swapper) being true now). Checking for task context makes sure select_idle_sibling() will not interpret a wake up from any other context as a wake up by a per-CPU kthread. Fixes: `52262ee567` ("sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression") Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/20211201143450.479472-1-vincent.donnefort@arm.com	2021-12-04 10:56:20 +01:00
Qais Yousef	315c4f8848	sched/uclamp: Fix rq->uclamp_max not set on first enqueue Commit `d81ae8aac8` ("sched/uclamp: Fix initialization of struct uclamp_rq") introduced a bug where uclamp_max of the rq is not reset to match the woken up task's uclamp_max when the rq is idle. The code was relying on rq->uclamp_max initialized to zero, so on first enqueue static inline void uclamp_rq_inc_id(struct rq rq, struct task_struct p, enum uclamp_id clamp_id) { ... if (uc_se->value > READ_ONCE(uc_rq->value)) WRITE_ONCE(uc_rq->value, uc_se->value); } was actually resetting it. But since commit `d81ae8aac8` changed the default to 1024, this no longer works. And since rq->uclamp_flags is also initialized to 0, neither above code path nor uclamp_idle_reset() update the rq->uclamp_max on first wake up from idle. This is only visible from first wake up(s) until the first dequeue to idle after enabling the static key. And it only matters if the uclamp_max of this task is < 1024 since only then its uclamp_max will be effectively ignored. Fix it by properly initializing rq->uclamp_flags = UCLAMP_FLAG_IDLE to ensure uclamp_idle_reset() is called which then will update the rq uclamp_max value as expected. Fixes: `d81ae8aac8` ("sched/uclamp: Fix initialization of struct uclamp_rq") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20211202112033.1705279-1-qais.yousef@arm.com	2021-12-04 10:56:18 +01:00
Andrew Halaney	9ed20bafc8	preempt/dynamic: Fix setup_preempt_mode() return value __setup() callbacks expect 1 for success and 0 for failure. Correct the usage here to reflect that. Fixes: `826bfeb37b` ("preempt/dynamic: Support dynamic preempt with preempt= boot option") Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211203233203.133581-1-ahalaney@redhat.com	2021-12-04 10:56:18 +01:00
Eric W. Biederman	9d3f401c52	Merge SA_IMMUTABLE-fixes-for-v5.16-rc2 I completed the first batch of signal changes for v5.17 against v5.16-rc1 before the SA_IMMUTABLE fixes where completed. Which leaves me with two lines of development that I want on my signal development branch both rooted at v5.16-rc1. Especially as I am hoping to reach the point of being able to remove SA_IMMUTABLE. Linus merged my SA_IMUTABLE fixes as: `7af959b5d5` ("Merge branch 'SA_IMMUTABLE-fixes-for-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace") To avoid rebasing the development changes that are currently complete I am merging the work I sent upstream to Linus to make my life simpler. The SA_IMMUTABLE changes as they are described in Linus's merge commit. Pull exit-vs-signal handling fixes from Eric Biederman: "This is a small set of changes where debuggers were no longer able to intercept synchronous SIGTRAP and SIGSEGV, introduced by the exit cleanups. This is essentially the change you suggested with all of i's dotted and the t's crossed so that ptrace can intercept all of the cases it has been able to intercept the past, and all of the cases that made it to exit without giving ptrace a chance still don't give ptrace a chance" * 'SA_IMMUTABLE-fixes-for-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: signal: Replace force_fatal_sig with force_exit_sig when in doubt signal: Don't always set SA_IMMUTABLE for forced signals Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2021-12-03 15:36:59 -06:00
Alexei Starovoitov	78c1f8d063	libbpf: Reduce bpf_core_apply_relo_insn() stack usage. Reduce bpf_core_apply_relo_insn() stack usage and bump BPF_CORE_SPEC_MAX_LEN limit back to 64. Fixes: `29db4bea1d` ("bpf: Prepare relo_core.c for kernel duty.") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211203182836.16646-1-alexei.starovoitov@gmail.com	2021-12-03 13:21:59 -08:00
Maxim Mikityanskiy	2fa7d94afc	bpf: Fix the off-by-two error in range markings The first commit cited below attempts to fix the off-by-one error that appeared in some comparisons with an open range. Due to this error, arithmetically equivalent pieces of code could get different verdicts from the verifier, for example (pseudocode): // 1. Passes the verifier: if (data + 8 > data_end) return early read (u64 )data, i.e. [data; data+7] // 2. Rejected by the verifier (should still pass): if (data + 7 >= data_end) return early read (u64 )data, i.e. [data; data+7] The attempted fix, however, shifts the range by one in a wrong direction, so the bug not only remains, but also such piece of code starts failing in the verifier: // 3. Rejected by the verifier, but the check is stricter than in #1. if (data + 8 >= data_end) return early read (u64 )data, i.e. [data; data+7] The change performed by that fix converted an off-by-one bug into off-by-two. The second commit cited below added the BPF selftests written to ensure than code chunks like #3 are rejected, however, they should be accepted. This commit fixes the off-by-two error by adjusting new_range in the right direction and fixes the tests by changing the range into the one that should actually fail. Fixes: `fb2a311a31` ("bpf: fix off by one for range markings with L{T, E} patterns") Fixes: `b37242c773` ("bpf: add test cases to bpf selftests to cover all access tests") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211130181607.593149-1-maximmi@nvidia.com	2021-12-03 21:44:42 +01:00
Frederic Weisbecker	45c753f5f2	workqueue: Fix unbind_workers() VS wq_worker_sleeping() race At CPU-hotplug time, unbind_workers() may preempt a worker while it is going to sleep. In that case the following scenario can happen: unbind_workers() wq_worker_sleeping() -------------- ------------------- if (worker->flags & WORKER_NOT_RUNNING) return; //PREEMPTED by unbind_workers worker->flags \|= WORKER_UNBOUND; [...] atomic_set(&pool->nr_running, 0); //resume to worker atomic_dec_and_test(&pool->nr_running); After unbind_worker() resets pool->nr_running, the value is expected to remain 0 until the pool ever gets rebound in case cpu_up() is called on the target CPU in the future. But here the race leaves pool->nr_running with a value of -1, triggering the following warning when the worker goes idle: WARNING: CPU: 3 PID: 34 at kernel/workqueue.c:1823 worker_enter_idle+0x95/0xc0 Modules linked in: CPU: 3 PID: 34 Comm: kworker/3:0 Not tainted 5.16.0-rc1+ #34 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 Workqueue: 0x0 (rcu_par_gp) RIP: 0010:worker_enter_idle+0x95/0xc0 Code: 04 85 f8 ff ff ff 39 c1 7f 09 48 8b 43 50 48 85 c0 74 1b 83 e2 04 75 99 8b 43 34 39 43 30 75 91 8b 83 00 03 00 00 85 c0 74 87 <0f> 0b 5b c3 48 8b 35 70 f1 37 01 48 8d 7b 48 48 81 c6 e0 93 0 RSP: 0000:ffff9b7680277ed0 EFLAGS: 00010086 RAX: 00000000ffffffff RBX: ffff93465eae9c00 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff9346418a0000 RDI: ffff934641057140 RBP: ffff934641057170 R08: 0000000000000001 R09: ffff9346418a0080 R10: ffff9b768027fdf0 R11: 0000000000002400 R12: ffff93465eae9c20 R13: ffff93465eae9c20 R14: ffff93465eae9c70 R15: ffff934641057140 FS: 0000000000000000(0000) GS:ffff93465eac0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000001cc0c000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> worker_thread+0x89/0x3d0 ? process_one_work+0x400/0x400 kthread+0x162/0x190 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 </TASK> Also due to this incorrect "nr_running == -1", all sorts of hazards can happen, starting with queued works being ignored because no workers are awaken at insert_work() time. Fix this with checking again the worker flags while pool->lock is locked. Fixes: `b945efcdd0` ("sched: Remove pointless preemption disable in sched_submit_work()") Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2021-12-02 13:00:59 -10:00
Frederic Weisbecker	07edfece8b	workqueue: Fix unbind_workers() VS wq_worker_running() race At CPU-hotplug time, unbind_worker() may preempt a worker while it is waking up. In that case the following scenario can happen: unbind_workers() wq_worker_running() -------------- ------------------- if (!(worker->flags & WORKER_NOT_RUNNING)) //PREEMPTED by unbind_workers worker->flags \|= WORKER_UNBOUND; [...] atomic_set(&pool->nr_running, 0); //resume to worker atomic_inc(&worker->pool->nr_running); After unbind_worker() resets pool->nr_running, the value is expected to remain 0 until the pool ever gets rebound in case cpu_up() is called on the target CPU in the future. But here the race leaves pool->nr_running with a value of 1, triggering the following warning when the worker goes idle: WARNING: CPU: 3 PID: 34 at kernel/workqueue.c:1823 worker_enter_idle+0x95/0xc0 Modules linked in: CPU: 3 PID: 34 Comm: kworker/3:0 Not tainted 5.16.0-rc1+ #34 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 Workqueue: 0x0 (rcu_par_gp) RIP: 0010:worker_enter_idle+0x95/0xc0 Code: 04 85 f8 ff ff ff 39 c1 7f 09 48 8b 43 50 48 85 c0 74 1b 83 e2 04 75 99 8b 43 34 39 43 30 75 91 8b 83 00 03 00 00 85 c0 74 87 <0f> 0b 5b c3 48 8b 35 70 f1 37 01 48 8d 7b 48 48 81 c6 e0 93 0 RSP: 0000:ffff9b7680277ed0 EFLAGS: 00010086 RAX: 00000000ffffffff RBX: ffff93465eae9c00 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff9346418a0000 RDI: ffff934641057140 RBP: ffff934641057170 R08: 0000000000000001 R09: ffff9346418a0080 R10: ffff9b768027fdf0 R11: 0000000000002400 R12: ffff93465eae9c20 R13: ffff93465eae9c20 R14: ffff93465eae9c70 R15: ffff934641057140 FS: 0000000000000000(0000) GS:ffff93465eac0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000001cc0c000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> worker_thread+0x89/0x3d0 ? process_one_work+0x400/0x400 kthread+0x162/0x190 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 </TASK> Also due to this incorrect "nr_running == 1", further queued work may end up not being served, because no worker is awaken at work insert time. This raises rcutorture writer stalls for example. Fix this with disabling preemption in the right place in wq_worker_running(). It's worth noting that if the worker migrates and runs concurrently with unbind_workers(), it is guaranteed to see the WORKER_UNBOUND flag update due to set_cpus_allowed_ptr() acquiring/releasing rq->lock. Fixes: `6d25be5782` ("sched/core, workqueues: Distangle worker accounting from rq lock") Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2021-12-02 12:59:58 -10:00
Kumar Kartikeya Dwivedi	b12f031043	bpf: Fix bpf_check_mod_kfunc_call for built-in modules When module registering its set is built-in, THIS_MODULE will be NULL, hence we cannot return early in case owner is NULL. Fixes: `14f267d95f` ("bpf: btf: Introduce helpers for dynamic BTF set registration") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20211122144742.477787-3-memxor@gmail.com	2021-12-02 13:39:46 -08:00
Kumar Kartikeya Dwivedi	d9847eb8be	bpf: Make CONFIG_DEBUG_INFO_BTF depend upon CONFIG_BPF_SYSCALL Vinicius Costa Gomes reported [0] that build fails when CONFIG_DEBUG_INFO_BTF is enabled and CONFIG_BPF_SYSCALL is disabled. This leads to btf.c not being compiled, and then no symbol being present in vmlinux for the declarations in btf.h. Since BTF is not useful without enabling BPF subsystem, disallow this combination. However, theoretically disabling both now could still fail, as the symbol for kfunc_btf_id_list variables is not available. This isn't a problem as the compiler usually optimizes the whole register/unregister call, but at lower optimization levels it can fail the build in linking stage. Fix that by adding dummy variables so that modules taking address of them still work, but the whole thing is a noop. [0]: https://lore.kernel.org/bpf/20211110205418.332403-1-vinicius.gomes@intel.com Fixes: `14f267d95f` ("bpf: btf: Introduce helpers for dynamic BTF set registration") Reported-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20211122144742.477787-2-memxor@gmail.com	2021-12-02 13:39:46 -08:00
Jakub Kicinski	fc993be36f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-02 11:44:56 -08:00
Alexei Starovoitov	1e89106da2	bpf: Add bpf_core_add_cands() and wire it into bpf_core_apply_relo_insn(). Given BPF program's BTF root type name perform the following steps: . search in vmlinux candidate cache. . if (present in cache and candidate list >= 1) return candidate list. . do a linear search through kernel BTFs for possible candidates. . regardless of number of candidates found populate vmlinux cache. . if (candidate list >= 1) return candidate list. . search in module candidate cache. . if (present in cache) return candidate list (even if list is empty). . do a linear search through BTFs of all kernel modules collecting candidates from all of them. . regardless of number of candidates found populate module cache. . return candidate list. Then wire the result into bpf_core_apply_relo_insn(). When BPF program is trying to CO-RE relocate a type that doesn't exist in either vmlinux BTF or in modules BTFs these steps will perform 2 cache lookups when cache is hit. Note the cache doesn't prevent the abuse by the program that might have lots of relocations that cannot be resolved. Hence cond_resched(). CO-RE in the kernel requires CAP_BPF, since BTF loading requires it. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211201181040.23337-9-alexei.starovoitov@gmail.com	2021-12-02 11:18:35 -08:00

... 6 7 8 9 10 ...

38217 commits