linux

mirror of synced 2025-03-06 20:59:54 +01:00

Author	SHA1	Message	Date
Sebastian Andrzej Siewior	71def423fe	cpu/hotplug: Provide cpuhp_setup/remove_state[_nocalls]_cpuslocked() Some call sites of cpuhp_setup/remove_state[_nocalls]() are within a cpus_read locked region. cpuhp_setup/remove_state[_nocalls]() call cpus_read_lock() as well, which is possible in the current implementation but prevents converting the hotplug locking to a percpu rwsem. Provide locked versions of the interfaces to avoid nested calls to cpus_read_lock(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20170524081547.239600868@linutronix.de	2017-05-26 10:10:35 +02:00
Thomas Gleixner	8f553c498e	cpu/hotplug: Provide cpus_read\|write_[un]lock() The counting 'rwsem' hackery of get\|put_online_cpus() is going to be replaced by percpu rwsem. Rename the functions to make it clear that it's locking and not some refcount style interface. These new functions will be used for the preparatory patches which make the code ready for the percpu rwsem conversion. Rename all instances in the cpu hotplug code while at it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20170524081547.080397752@linutronix.de	2017-05-26 10:10:34 +02:00
Babu Moger	9ab6055f95	kernel/locking: Fix compile error with qrwlock.c Saw these compile errors on SPARC when queued rwlock feature is enabled. CC kernel/locking/qrwlock.o kernel/locking/qrwlock.c: In function ‘queued_read_lock_slowpath’: kernel/locking/qrwlock.c:89: error: implicit declaration of function ‘arch_spin_lock’ kernel/locking/qrwlock.c:102: error: implicit declaration of function ‘arch_spin_unlock’ make[4]: *** [kernel/locking/qrwlock.o] Error 1 Include spinlock.h in qrwlock.c to fix it. Signed-off-by: Babu Moger <babu.moger@oracle.com> Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-05-25 12:06:50 -07:00
Daniel Borkmann	a316338cb7	bpf: fix wrong exposure of map_flags into fdinfo for lpm trie_alloc() always needs to have BPF_F_NO_PREALLOC passed in via attr->map_flags, since it does not support preallocation yet. We check the flag, but we never copy the flag into trie->map.map_flags, which is later on exposed into fdinfo and used by loaders such as iproute2. Latter uses this in bpf_map_selfcheck_pinned() to test whether a pinned map has the same spec as the one from the BPF obj file and if not, bails out, which is currently the case for lpm since it exposes always 0 as flags. Also copy over flags in array_map_alloc() and stack_map_alloc(). They always have to be 0 right now, but we should make sure to not miss to copy them over at a later point in time when we add actual flags for them to use. Fixes: `b95a5c4db0` ("bpf: add a longest prefix match trie map implementation") Reported-by: Jarno Rajahalme <jarno@covalent.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-05-25 13:44:28 -04:00
Daniel Borkmann	a9789ef9af	bpf: properly reset caller saved regs after helper call and ld_abs/ind Currently, after performing helper calls, we clear all caller saved registers, that is r0 - r5 and fill r0 depending on struct bpf_func_proto specification. The way we reset these regs can affect pruning decisions in later paths, since we only reset register's imm to 0 and type to NOT_INIT. However, we leave out clearing of other variables such as id, min_value, max_value, etc, which can later on lead to pruning mismatches due to stale data. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-05-25 13:44:27 -04:00
Daniel Borkmann	1ad2f5838d	bpf: fix incorrect pruning decision when alignment must be tracked Currently, when we enforce alignment tracking on direct packet access, the verifier lets the following program pass despite doing a packet write with unaligned access: 0: (61) r2 = (u32 )(r1 +76) 1: (61) r3 = (u32 )(r1 +80) 2: (61) r7 = (u32 )(r1 +8) 3: (bf) r0 = r2 4: (07) r0 += 14 5: (25) if r7 > 0x1 goto pc+4 R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp 6: (2d) if r0 > r3 goto pc+1 R0=pkt(id=0,off=14,r=14) R1=ctx R2=pkt(id=0,off=0,r=14) R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp 7: (63) (u32 )(r0 -4) = r0 8: (b7) r0 = 0 9: (95) exit from 6 to 8: R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp 8: (b7) r0 = 0 9: (95) exit from 5 to 10: R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R7=inv,min_value=2 R10=fp 10: (07) r0 += 1 11: (05) goto pc-6 6: safe <----- here, wrongly found safe processed 15 insns However, if we enforce a pruning mismatch by adding state into r8 which is then being mismatched in states_equal(), we find that for the otherwise same program, the verifier detects a misaligned packet access when actually walking that path: 0: (61) r2 = (u32 )(r1 +76) 1: (61) r3 = (u32 )(r1 +80) 2: (61) r7 = (u32 )(r1 +8) 3: (b7) r8 = 1 4: (bf) r0 = r2 5: (07) r0 += 14 6: (25) if r7 > 0x1 goto pc+4 R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R7=inv,min_value=0,max_value=1 R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp 7: (2d) if r0 > r3 goto pc+1 R0=pkt(id=0,off=14,r=14) R1=ctx R2=pkt(id=0,off=0,r=14) R3=pkt_end R7=inv,min_value=0,max_value=1 R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp 8: (63) (u32 )(r0 -4) = r0 9: (b7) r0 = 0 10: (95) exit from 7 to 9: R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R7=inv,min_value=0,max_value=1 R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp 9: (b7) r0 = 0 10: (95) exit from 6 to 11: R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R7=inv,min_value=2 R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp 11: (07) r0 += 1 12: (b7) r8 = 0 13: (05) goto pc-7 <----- mismatch due to r8 7: (2d) if r0 > r3 goto pc+1 R0=pkt(id=0,off=15,r=15) R1=ctx R2=pkt(id=0,off=0,r=15) R3=pkt_end R7=inv,min_value=2 R8=imm0,min_value=0,max_value=0,min_align=2147483648 R10=fp 8: (63) (u32 )(r0 -4) = r0 misaligned packet access off 2+15+-4 size 4 The reason why we fail to see it in states_equal() is that the third test in compare_ptrs_to_packet() ... if (old->off <= cur->off && old->off >= old->range && cur->off >= cur->range) return true; ... will let the above pass. The situation we run into is that old->off <= cur->off (14 <= 15), meaning that prior walked paths went with smaller offset, which was later used in the packet access after successful packet range check and found to be safe already. For example: Given is R0=pkt(id=0,off=0,r=0). Adding offset 14 as in above program to it, results in R0=pkt(id=0,off=14,r=0) before the packet range test. Now, testing this against R3=pkt_end with 'if r0 > r3 goto out' will transform R0 into R0=pkt(id=0,off=14,r=14) for the case when we're within bounds. A write into the packet at offset (u32 )(r0 -4), that is, 2 + 14 -4, is valid and aligned (2 is for NET_IP_ALIGN). After processing this with all fall-through paths, we later on check paths from branches. When the above skb->mark test is true, then we jump near the end of the program, perform r0 += 1, and jump back to the 'if r0 > r3 goto out' test we've visited earlier already. This time, R0 is of type R0=pkt(id=0,off=15,r=0), and we'll prune that part because this time we'll have a larger safe packet range, and we already found that with off=14 all further insn were already safe, so it's safe as well with a larger off. However, the problem is that the subsequent write into the packet with 2 + 15 -4 is then unaligned, and not caught by the alignment tracking. Note that min_align, aux_off, and aux_off_align were all 0 in this example. Since we cannot tell at this time what kind of packet access was performed in the prior walk and what minimal requirements it has (we might do so in the future, but that requires more complexity), fix it to disable this pruning case for strict alignment for now, and let the verifier do check such paths instead. With that applied, the test cases pass and reject the program due to misalignment. Fixes: `d117441674` ("bpf: Track alignment of register values in the verifier.") Reference: http://patchwork.ozlabs.org/patch/761909/ Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-05-25 13:44:27 -04:00
Peter Rajnoha	f36776fafb	kobject: support passing in variables for synthetic uevents This patch makes it possible to pass additional arguments in addition to uevent action name when writing /sys/.../uevent attribute. These additional arguments are then inserted into generated synthetic uevent as additional environment variables. Before, we were not able to pass any additional uevent environment variables for synthetic uevents. This made it hard to identify such uevents properly in userspace to make proper distinction between genuine uevents originating from kernel and synthetic uevents triggered from userspace. Also, it was not possible to pass any additional information which would make it possible to optimize and change the way the synthetic uevents are processed back in userspace based on the originating environment of the triggering action in userspace. With the extra additional variables, we are able to pass through this extra information needed and also it makes it possible to synchronize with such synthetic uevents as they can be clearly identified back in userspace. The format for writing the uevent attribute is following: ACTION [UUID [KEY=VALUE ...] There's no change in how "ACTION" is recognized - it stays the same ("add", "change", "remove"). The "ACTION" is the only argument required to generate synthetic uevent, the rest of arguments, that this patch adds support for, are optional. The "UUID" is considered as transaction identifier so it's possible to use the same UUID value for one or more synthetic uevents in which case we logically group these uevents together for any userspace listeners. The "UUID" is expected to be in "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" format where "x" is a hex digit. The value appears in uevent as "SYNTH_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" environment variable. The "KEY=VALUE" pairs can contain alphanumeric characters only. It's possible to define zero or more more pairs - each pair is then delimited by a space character " ". Each pair appears in synthetic uevents as "SYNTH_ARG_KEY=VALUE" environment variable. That means the KEY name gains "SYNTH_ARG_" prefix to avoid possible collisions with existing variables. To pass the "KEY=VALUE" pairs, it's also required to pass in the "UUID" part for the synthetic uevent first. If "UUID" is not passed in, the generated synthetic uevent gains "SYNTH_UUID=0" environment variable automatically so it's possible to identify this situation in userspace when reading generated uevent and so we can still make a difference between genuine and synthetic uevents. Signed-off-by: Peter Rajnoha <prajnoha@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-05-25 18:30:51 +02:00
Tejun Heo	41c25707d2	cpuset: consider dying css as offline In most cases, a cgroup controller don't care about the liftimes of cgroups. For the controller, a css becomes online when ->css_online() is called on it and offline when ->css_offline() is called. However, cpuset is special in that the user interface it exposes cares whether certain cgroups exist or not. Combined with the RCU delay between cgroup removal and css offlining, this can lead to user visible behavior oddities where operations which should succeed after cgroup removals fail for some time period. The effects of cgroup removals are delayed when seen from userland. This patch adds css_is_dying() which tests whether offline is pending and updates is_cpuset_online() so that the function returns false also while offline is pending. This gets rid of the userland visible delays. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Daniel Jordan <daniel.m.jordan@oracle.com> Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com Cc: stable@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org>	2017-05-24 12:43:30 -04:00
Linus Torvalds	2426125ab4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ptrace fix from Eric Biederman: "This fixes a brown paper bag bug. When I fixed the ptrace interaction with user namespaces I added a new field ptracer_cred in struct_task and I failed to properly initialize it on fork. This dangling pointer wound up breaking runing setuid applications run from the enlightenment window manager. As this is the worst sort of bug. A regression breaking user space for no good reason let's get this fixed" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ptrace: Properly initialize ptracer_cred on fork	2017-05-24 08:28:59 -07:00
Peter Zijlstra	45aea32167	sched/clock: Fix early boot preempt assumption in __set_sched_clock_stable() The more strict early boot preemption warnings found that __set_sched_clock_stable() was incorrectly assuming we'd still be running on a single CPU: BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1 caller is debug_smp_processor_id+0x1c/0x1e CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.12.0-rc2-00108-g1c3c5ea #1 Call Trace: dump_stack+0x110/0x192 check_preemption_disabled+0x10c/0x128 ? set_debug_rodata+0x25/0x25 debug_smp_processor_id+0x1c/0x1e sched_clock_init_late+0x27/0x87 [...] Fix it by disabling IRQs. Reported-by: kernel test robot <xiaolong.ye@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: lkp@01.org Cc: tipbuild@zytor.com Link: http://lkml.kernel.org/r/20170524065202.v25vyu7pvba5mhpd@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-05-24 09:10:00 +02:00
Thomas Gleixner	43fe8b8eb8	posix-timers: Make signal printks conditional A recent commit added extra printks for CPU/RT limits. This can result in excessive spam in dmesg. Make the printks conditional on print_fatal_signals. Fixes: `e7ea7c9806` ("rlimits: Print more information when CPU/RT limits are exceeded") Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Arun Raghavan <arun@arunraghavan.net>	2017-05-23 23:39:57 +02:00
Richard Guy Briggs	4b3e4ed6b0	audit: unswing cap_* fields in PATH records The cap_* fields swing in and out of PATH records. If no capabilities are set, the cap_* fields are completely missing and when one of the cap_fi or cap_fp values is empty, that field is omitted. Original: type=PATH msg=audit(04/20/2017 12:17:11.222:193) : item=1 name=/lib64/ld-linux-x86-64.so.2 inode=787694 dev=08:03 mode=file,755 ouid=root ogid=root rdev=00:00 obj=system_u:object_r:ld_so_t:s0 nametype=NORMAL type=PATH msg=audit(04/20/2017 12:17:11.222:193) : item=0 name=/home/sleep inode=1319469 dev=08:03 mode=file,suid,755 ouid=root ogid=root rdev=00:00 obj=system_u:object_r:bin_t:s0 nametype=NORMAL cap_fp=sys_admin cap_fe=1 cap_fver=2 Normalize the PATH record by always printing all 4 cap_* fields. Fixed: type=PATH msg=audit(04/20/2017 13:01:31.679:201) : item=1 name=/lib64/ld-linux-x86-64.so.2 inode=787694 dev=08:03 mode=file,755 ouid=root ogid=root rdev=00:00 obj=system_u:object_r:ld_so_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 type=PATH msg=audit(04/20/2017 13:01:31.679:201) : item=0 name=/home/sleep inode=1319469 dev=08:03 mode=file,suid,755 ouid=root ogid=root rdev=00:00 obj=system_u:object_r:bin_t:s0 nametype=NORMAL cap_fp=sys_admin cap_fi=none cap_fe=1 cap_fver=2 See: https://github.com/linux-audit/audit-kernel/issues/42 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2017-05-23 16:50:02 -04:00
Eric W. Biederman	c70d9d809f	ptrace: Properly initialize ptracer_cred on fork When I introduced ptracer_cred I failed to consider the weirdness of fork where the task_struct copies the old value by default. This winds up leaving ptracer_cred set even when a process forks and the child process does not wind up being ptraced. Because ptracer_cred is not set on non-ptraced processes whose parents were ptraced this has broken the ability of the enlightenment window manager to start setuid children. Fix this by properly initializing ptracer_cred in ptrace_init_task This must be done with a little bit of care to preserve the current value of ptracer_cred when ptrace carries through fork. Re-reading the ptracer_cred from the ptracing process at this point is inconsistent with how PT_PTRACE_CAP has been maintained all of these years. Tested-by: Takashi Iwai <tiwai@suse.de> Fixes: `64b875f7ac` ("ptrace: Capture the ptracer's creds not PT_PTRACE_CAP") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2017-05-23 07:40:44 -05:00
Thomas Gleixner	1c3c5eab17	sched/core: Enable might_sleep() and smp_processor_id() checks early might_sleep() and smp_processor_id() checks are enabled after the boot process is done. That hides bugs in the SMP bringup and driver initialization code. Enable it right when the scheduler starts working, i.e. when init task and kthreadd have been created and right before the idle task enables preemption. Tested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20170516184736.272225698@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-05-23 10:01:38 +02:00
Thomas Gleixner	ff48cd26fc	printk: Adjust system_state checks To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in boot_delay_msec() to handle the extra states. Tested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20170516184736.027534895@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-05-23 10:01:37 +02:00
Thomas Gleixner	0594729c24	extable: Adjust system_state checks To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in core_kernel_text() to handle the extra states, i.e. to cover init text up to the point where the system switches to state RUNNING. Tested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20170516184735.949992741@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-05-23 10:01:37 +02:00
Thomas Gleixner	b4def42724	async: Adjust system_state checks To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in async_run_entry_fn() and async_synchronize_cookie_domain() to handle the extra states. Tested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20170516184735.865155020@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-05-23 10:01:37 +02:00
Vlastimil Babka	8655d54977	sched/numa: Use down_read_trylock() for the mmap_sem A customer has reported a soft-lockup when running an intensive memory stress test, where the trace on multiple CPU's looks like this: RIP: 0010:[<ffffffff810c53fe>] [<ffffffff810c53fe>] native_queued_spin_lock_slowpath+0x10e/0x190 ... Call Trace: [<ffffffff81182d07>] queued_spin_lock_slowpath+0x7/0xa [<ffffffff811bc331>] change_protection_range+0x3b1/0x930 [<ffffffff811d4be8>] change_prot_numa+0x18/0x30 [<ffffffff810adefe>] task_numa_work+0x1fe/0x310 [<ffffffff81098322>] task_work_run+0x72/0x90 Further investigation showed that the lock contention here is pmd_lock(). The task_numa_work() function makes sure that only one thread is let to perform the work in a single scan period (via cmpxchg), but if there's a thread with mmap_sem locked for writing for several periods, multiple threads in task_numa_work() can build up a convoy waiting for mmap_sem for read and then all get unblocked at once. This patch changes the down_read() to the trylock version, which prevents the build up. For a workload experiencing mmap_sem contention, it's probably better to postpone the NUMA balancing work anyway. This seems to have fixed the soft lockups involving pmd_lock(), which is in line with the convoy theory. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170515131316.21909-1-vbabka@suse.cz Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-05-23 10:01:34 +02:00
Dave Kleikamp	c249f255aa	sched/rt: Minimize rq->lock contention in do_sched_rt_period_timer() With CONFIG_RT_GROUP_SCHED=y, do_sched_rt_period_timer() sequentially takes each CPU's rq->lock. On a large, busy system, the cumulative time it takes to acquire each lock can be excessive, even triggering a watchdog timeout. If rt_rq->rt_time and rt_rq->rt_nr_running are both zero, this function does nothing while holding the lock, so don't bother taking it at all. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/a767637b-df85-912f-ba69-c90ee00a3fb6@oracle.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-05-23 10:01:34 +02:00
Steven Rostedt (VMware)	896bbb2522	sched/core: Allow __sched_setscheduler() in interrupts when PI is not used When priority inheritance was added back in 2.6.18 to sched_setscheduler(), it added a path to taking an rt-mutex wait_lock, which is not IRQ safe. As PI is not a common occurrence, lockdep will likely never trigger if sched_setscheduler was called from interrupt context. A BUG_ON() was added to trigger if __sched_setscheduler() was ever called from interrupt context because there was a possibility to take the wait_lock. Today the wait_lock is irq safe, but the path to taking it in sched_setscheduler() is the same as the path to taking it from normal context. The wait_lock is taken with raw_spin_lock_irq() and released with raw_spin_unlock_irq() which will indiscriminately enable interrupts, which would be bad in interrupt context. The problem is that normalize_rt_tasks, which is called by triggering the sysrq nice-all-RT-tasks was changed to call __sched_setscheduler(), and this is done from interrupt context! Now __sched_setscheduler() takes a "pi" parameter that is used to know if the priority inheritance should be called or not. As the BUG_ON() only cares about calling the PI code, it should only bug if called from interrupt context with the "pi" parameter set to true. Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@osdl.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: `dbc7f069b9` ("sched: Use replace normalize_task() with __sched_setscheduler()") Link: http://lkml.kernel.org/r/20170308124654.10e598f2@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-05-23 10:01:34 +02:00
Byungchul Park	a776b968e5	sched/deadline: Remove unnecessary condition in push_dl_task() pick_next_pushable_dl_task(rq) has BUG_ON(rq->cpu != task_cpu(task)) when it returns a task other than NULL, which means that task_cpu(task) must be rq->cpu. So if task == next_task, then task_cpu(next_task) must be rq->cpu as well. Remove the redundant condition and make the code simpler. This way one unnecessary branch and two LOAD operations can be avoided. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Juri Lelli <juri.lelli@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: <kernel-team@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1494551159-22367-1-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-05-23 10:01:33 +02:00
Byungchul Park	de16b91eff	sched/rt: Remove unnecessary condition in push_rt_task() pick_next_pushable_task(rq) has BUG_ON(rq_cpu != task_cpu(task)) when it returns a task other than NULL, which means that task_cpu(task) must be rq->cpu. So if task == next_task, then task_cpu(next_task) must be rq->cpu as well. Remove the redundant condition and make the code simpler. This way one unnecessary branch and two LOAD operations can be avoided. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Juri Lelli <juri.lelli@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: <kernel-team@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1494551143-22219-1-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-05-23 10:01:33 +02:00
Byungchul Park	73215849df	sched/core: Use the new llist_for_each_entry_safe() primitive Now that we've added llist_for_each_entry_safe(), use it to simplify an open coded version of it in sched_ttwu_pending(). Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <kernel-team@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1494549584-11730-1-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-05-23 10:01:33 +02:00
Peter Zijlstra	6c8557bdb2	smp, cpumask: Use non-atomic cpumask_{set,clear}_cpu() The cpumasks in smp_call_function_many() are private and not subject to concurrency, atomic bitops are pointless and expensive. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-05-23 10:01:32 +02:00
Aaron Lu	3fc5b3b6a8	smp: Avoid sending needless IPI in smp_call_function_many() Inter-Processor-Interrupt(IPI) is needed when a page is unmapped and the process' mm_cpumask() shows the process has ever run on other CPUs. page migration, page reclaim all need IPIs. The number of IPI needed to send to different CPUs is especially large for multi-threaded workload since mm_cpumask() is per process. For smp_call_function_many(), whenever a CPU queues a CSD to a target CPU, it will send an IPI to let the target CPU to handle the work. This isn't necessary - we need only send IPI when queueing a CSD to an empty call_single_queue. The reason: flush_smp_call_function_queue() that is called upon a CPU receiving an IPI will empty the queue and then handle all of the CSDs there. So if the target CPU's call_single_queue is not empty, we know that: i. An IPI for the target CPU has already been sent by 'previous queuers'; ii. flush_smp_call_function_queue() hasn't emptied that CPU's queue yet. Thus, it's safe for us to just queue our CSD there without sending an addtional IPI. And for the 'previous queuers', we can limit it to the first queuer. To demonstrate the effect of this patch, a multi-thread workload that spawns 80 threads to equally consume 100G memory is used. This is tested on a 2 node broadwell-EP which has 44cores/88threads and 32G memory. So after 32G memory is used up, page reclaiming starts to happen a lot. With this patch, IPI number dropped 88% and throughput increased about 15% for the above workload. Signed-off-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Link: http://lkml.kernel.org/r/20170519075331.GE2084@aaronlu.sh.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-05-23 10:01:32 +02:00
Ingo Molnar	386b554888	Merge branch 'linus' into sched/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-05-23 09:50:35 +02:00
Dan Carpenter	36cc2b9222	perf/core: Fix error handling in perf_event_alloc() We don't set an error code here which means that perf_event_alloc() returns ERR_PTR(0) (in other words NULL). The callers are not expecting that and would Oops. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: `375637bc52` ("perf/core: Introduce address range filtering") Link: http://lkml.kernel.org/r/20170522090418.hvs6icgpdo53wkn5@mwanda Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-05-23 09:50:05 +02:00
Dan Carpenter	85c617abc7	perf/core: Remove some dead code perf_init_event() can't return NULL. If it did, the error handling is incomplete and we would crash. I have removed this confusing dead code. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/20170522090348.5g7yyld5en3yeky4@mwanda Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-05-23 09:50:05 +02:00
Linus Torvalds	801099bed0	Power management updates for v4.12-rc3 - Fix RTC wakeup from suspend-to-idle broken by the recent rework of ACPI wakeup handling (Rafael Wysocki). - Update intel_pstate driver documentation to reflect the current code and explain how it works in more detail (Rafael Wysocki). - Fix an issue related to CPU idleness detection on systems with shared cpufreq policies in the schedutil governor (Juri Lelli). - Fix a possible build issue in the dbx500 cpufreq driver (Arnd Bergmann). - Fix a function in the power capping framework core to return an error code instead of 0 when there's an error (Dan Carpenter). - Clean up variable definition in the hibernation core (Pushkar Jambhlekar). -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJZIzszAAoJEILEb/54YlRxMG0P/R4VpPMB1l+wxQRmCMwzOupC GJ1jTa2mQQpPy57QPjaCDlUPxSaZA97S4MO0eMn4Or6LX3rG7kTUoe1WaYvRhWNk Ul2UfoLdVeFJwvQrzOZKB2xnEGA/nD2jlsD/9zYzy9FxMPjiG0F//RZvhZJVChpg wycz9Rw1T2x+1URAD5wkS4xLWzQEv5NqH6mc/KAoP/ntxe+7ahs5SnWmF9MLpHj7 jXM9651BUSYp3QzHCHFObvsVZfbZz7isFIADmwsxzTy7vTPb1oIyo7EQ5QMcsivS LlJjrYy9JN0alwND0mistVlAmFVvvldckjR8zHSEiFt8IeMccrFw0inGir2ngghY 53kMnJ/QoL1A/C539MHoAmfnpqB0QUd56QjXngungC47YpVHi5DaSXU7rln2xy/C 7o7gbHUKUbStSvDLjRcQ915HANOuXkJk84BMIGUSlT3K/MvGAMKUNxZV7KOOngpb WR4G2lxjYTIHKB+YP5AmG2kMF4GlbGnIQts5Ryd5FijIH3/MYJ4W2Kas+GvbnoBb 7NtDjyBJgjxleTv3fV89Pod+dKdFzrTRl+mr6bsn/WCiMjUHoXcTnOHh3OO/fJ8F AW/dywk9+Hx5DyjY04EJyklflfne97T7/NjJ99Zjzh/EC+uePeM+dMd+o66PpYG5 +FJgyPc5ZaX1f2thAgv+ =2sNW -----END PGP SIGNATURE----- Merge tag 'pm-4.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix RTC wakeup from suspend-to-idle broken recently, fix CPU idleness detection condition in the schedutil cpufreq governor, fix a cpufreq driver build failure, fix an error code path in the power capping framework, clean up the hibernate core and update the intel_pstate documentation. Specifics: - Fix RTC wakeup from suspend-to-idle broken by the recent rework of ACPI wakeup handling (Rafael Wysocki). - Update intel_pstate driver documentation to reflect the current code and explain how it works in more detail (Rafael Wysocki). - Fix an issue related to CPU idleness detection on systems with shared cpufreq policies in the schedutil governor (Juri Lelli). - Fix a possible build issue in the dbx500 cpufreq driver (Arnd Bergmann). - Fix a function in the power capping framework core to return an error code instead of 0 when there's an error (Dan Carpenter). - Clean up variable definition in the hibernation core (Pushkar Jambhlekar)" * tag 'pm-4.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: dbx500: add a Kconfig symbol PM / hibernate: Declare variables as static PowerCap: Fix an error code in powercap_register_zone() RTC: rtc-cmos: Fix wakeup from suspend-to-idle PM / wakeup: Fix up wakeup_source_report_event() cpufreq: intel_pstate: Document the current behavior and user interface cpufreq: schedutil: use now as reference when aggregating shared policy requests	2017-05-22 19:24:32 -07:00
Marc Zyngier	a97b852b4d	genirq/msi: Populate the domain name if provided by the irqchip In order to ease debug, let's populate the domain name upfront, before any MSI gets requested. This allows the domain to appear in the irq_domain_mapping, and the user to easily find the expected data. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Link: http://lkml.kernel.org/r/20170512115538.10767-4-marc.zyngier@arm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2017-05-22 22:29:45 +02:00
Marc Zyngier	2370c00dc7	irqdomain: Let irq_domain_mapping display ACPI fwnode attributes If the system is using ACPI, there is no of_node to display. But ACPI can use a struct irqchip_fwid as a domain identifier, and it can be used to display the name contained in that structure. The output on such a system will look like this: pMSI 0 0 0 irqchip@00000000e1180000 MSI 37 0 0 irqchip@00000000e1180000 GICv2m 37 0 0 irqchip@00000000e1180000 GICv2 448 448 0 irqchip@ffff000008003000 Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Link: http://lkml.kernel.org/r/20170512115538.10767-3-marc.zyngier@arm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2017-05-22 22:29:44 +02:00
Marc Zyngier	fe17a42e70	irqdomain: Let irq_domain_mapping display hierarchical domains Hierarchical domains seem to be hard to grasp, and a number of aspiring kernel hackers find them utterly discombobulating. In order to ease their pain, let's make them appear in /sys/kernel/debug/irq_domain_mapping, such as the following: 96 0x81808 MSI 0x (null) RADIX MSI 96+ 0x00063 GICv2m 0xffff8003ee116980 RADIX GICv2m 96+ 0x00063 GICv2 0xffff00000916bfd8 LINEAR GICv2 [output compressed to fit in a commit log] This shows that IRQ96 is implemented by a stack of three domains, the + sign indicating the stacking. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Link: http://lkml.kernel.org/r/20170512115538.10767-2-marc.zyngier@arm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2017-05-22 22:29:44 +02:00
Vegard Nossum	4d6501dce0	kthread: Fix use-after-free if kthread fork fails If a kthread forks (e.g. usermodehelper since commit `1da5c46fa9`) but fails in copy_process() between calling dup_task_struct() and setting p->set_child_tid, then the value of p->set_child_tid will be inherited from the parent and get prematurely freed by free_kthread_struct(). kthread() - worker_thread() - process_one_work() \| - call_usermodehelper_exec_work() \| - kernel_thread() \| - _do_fork() \| - copy_process() \| - dup_task_struct() \| - arch_dup_task_struct() \| - tsk->set_child_tid = current->set_child_tid // implied \| - ... \| - goto bad_fork_* \| - ... \| - free_task(tsk) \| - free_kthread_struct(tsk) \| - kfree(tsk->set_child_tid) - ... - schedule() - __schedule() - wq_worker_sleeping() - kthread_data(task)->flags // UAF The problem started showing up with commit `1da5c46fa9` since it reused ->set_child_tid for the kthread worker data. A better long-term solution might be to get rid of the ->set_child_tid abuse. The comment in set_kthread_struct() also looks slightly wrong. Debugged-by: Jamie Iles <jamie.iles@oracle.com> Fixes: `1da5c46fa9` ("kthread: Make struct kthread kmalloc'ed") Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jamie Iles <jamie.iles@oracle.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20170509073959.17858-1-vegard.nossum@oracle.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2017-05-22 22:21:16 +02:00
Michael Hernandez	6f9a22bc57	PCI/MSI: Ignore affinity if pre/post vector count is more than min_vecs min_vecs is the minimum amount of vectors needed to operate in MSI-X mode which may just include the vectors that don't need affinity. Disabling affinity settings causes the qla2xxx driver scsi_add_host() to fail when blk_mq is enabled as the blk_mq_pci_map_queues() expects affinity masks on each vector. Fixes: `dfef358bd1` ("PCI/MSI: Don't apply affinity if there aren't enough vectors left") Signed-off-by: Michael Hernandez <michael.hernandez@cavium.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org # v4.10+	2017-05-22 15:06:05 -05:00
Peter Zijlstra	04dc1b2fff	futex,rt_mutex: Fix rt_mutex_cleanup_proxy_lock() Markus reported that the glibc/nptl/tst-robustpi8 test was failing after commit: `cfafcd117d` ("futex: Rework futex_lock_pi() to use rt_mutex__proxy_lock()") The following trace shows the problem: ld-linux-x86-64-2161 [019] .... 410.760971: SyS_futex: 00007ffbeb76b028: 80000875 op=FUTEX_LOCK_PI ld-linux-x86-64-2161 [019] ...1 410.760972: lock_pi_update_atomic: 00007ffbeb76b028: curval=80000875 uval=80000875 newval=80000875 ret=0 ld-linux-x86-64-2165 [011] .... 410.760978: SyS_futex: 00007ffbeb76b028: 80000875 op=FUTEX_UNLOCK_PI ld-linux-x86-64-2165 [011] d..1 410.760979: do_futex: 00007ffbeb76b028: curval=80000875 uval=80000875 newval=80000871 ret=0 ld-linux-x86-64-2165 [011] .... 410.760980: SyS_futex: 00007ffbeb76b028: 80000871 ret=0000 ld-linux-x86-64-2161 [019] .... 410.760980: SyS_futex: 00007ffbeb76b028: 80000871 ret=ETIMEDOUT Task 2165 does an UNLOCK_PI, assigning the lock to the waiter task 2161 which then returns with -ETIMEDOUT. That wrecks the lock state, because now the owner isn't aware it acquired the lock and removes the pending robust list entry. If 2161 is killed, the robust list will not clear out this futex and the subsequent acquire on this futex will then (correctly) result in -ESRCH which is unexpected by glibc, triggers an internal assertion and dies. Task 2161 Task 2165 rt_mutex_wait_proxy_lock() timeout(); / T2161 is still queued in the waiter list / return -ETIMEDOUT; futex_unlock_pi() spin_lock(hb->lock); rtmutex_unlock() remove_rtmutex_waiter(T2161); mark_lock_available(); / Make the next waiter owner of the user space side */ futex_uval = 2161; spin_unlock(hb->lock); spin_lock(hb->lock); rt_mutex_cleanup_proxy_lock() if (rtmutex_owner() !== current) ... return FAIL; .... return -ETIMEOUT; This means that rt_mutex_cleanup_proxy_lock() needs to call try_to_take_rt_mutex() so it can take over the rtmutex correctly which was assigned by the waker. If the rtmutex is owned by some other task then this call is harmless and just confirmes that the waiter is not able to acquire it. While there, fix what looks like a merge error which resulted in rt_mutex_cleanup_proxy_lock() having two calls to fixup_rt_mutex_waiters() and rt_mutex_wait_proxy_lock() not having any. Both should have one, since both potentially touch the waiter list. Fixes: `38d589f2fd` ("futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()") Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Bug-Spotted-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Florian Weimer <fweimer@redhat.com> Cc: Darren Hart <dvhart@infradead.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Markus Trippelsdorf <markus@trippelsdorf.de> Link: http://lkml.kernel.org/r/20170519154850.mlomgdsd26drq5j6@hirez.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2017-05-22 21:57:18 +02:00
Linus Torvalds	86ca984cef	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: "Mostly netfilter bug fixes in here, but we have some bits elsewhere as well. 1) Don't do SNAT replies for non-NATed connections in IPVS, from Julian Anastasov. 2) Don't delete conntrack helpers while they are still in use, from Liping Zhang. 3) Fix zero padding in xtables's xt_data_to_user(), from Willem de Bruijn. 4) Add proper RCU protection to nf_tables_dump_set() because we cannot guarantee that we hold the NFNL_SUBSYS_NFTABLES lock. From Liping Zhang. 5) Initialize rcv_mss in tcp_disconnect(), from Wei Wang. 6) smsc95xx devices can't handle IPV6 checksums fully, so don't advertise support for offloading them. From Nisar Sayed. 7) Fix out-of-bounds access in __ip6_append_data(), from Eric Dumazet. 8) Make atl2_probe() propagate the error code properly on failures, from Alexey Khoroshilov. 9) arp_target[] in bond_check_params() is used uninitialized. This got changes from a global static to a local variable, which is how this mistake happened. Fix from Jarod Wilson. 10) Fix fallout from unnecessary NULL check removal in cls_matchall, from Jiri Pirko. This is definitely brown paper bag territory..." * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits) net: sched: cls_matchall: fix null pointer dereference vsock: use new wait API for vsock_stream_sendmsg() bonding: fix randomly populated arp target array net: Make IP alignment calulations clearer. bonding: fix accounting of active ports in 3ad net: atheros: atl2: don't return zero on failure path in atl2_probe() ipv6: fix out of bound writes in __ip6_append_data() bridge: start hello_timer when enabling KERNEL_STP in br_stp_start smsc95xx: Support only IPv4 TCP/UDP csum offload arp: always override existing neigh entries with gratuitous ARP arp: postpone addr_type calculation to as late as possible arp: decompose is_garp logic into a separate function arp: fixed error in a comment tcp: initialize rcv_mss to TCP_MIN_MSS instead of 0 netfilter: xtables: fix build failure from COMPAT_XT_ALIGN outside CONFIG_COMPAT ebtables: arpreply: Add the standard target sanity check netfilter: nf_tables: revisit chain/object refcounting from elements netfilter: nf_tables: missing sanitization in data from userspace netfilter: nf_tables: can't assume lock is acquired when dumping set elems netfilter: synproxy: fix conntrackd interaction ...	2017-05-22 12:42:02 -07:00
Rafael J. Wysocki	bb47e96417	Merge branches 'pm-sleep' and 'powercap' * pm-sleep: PM / hibernate: Declare variables as static RTC: rtc-cmos: Fix wakeup from suspend-to-idle PM / wakeup: Fix up wakeup_source_report_event() * powercap: PowerCap: Fix an error code in powercap_register_zone()	2017-05-22 20:32:05 +02:00
Rafael J. Wysocki	079c1812a2	Merge branches 'intel_pstate', 'pm-cpufreq' and 'pm-cpufreq-sched' * intel_pstate: cpufreq: intel_pstate: Document the current behavior and user interface * pm-cpufreq: cpufreq: dbx500: add a Kconfig symbol * pm-cpufreq-sched: cpufreq: schedutil: use now as reference when aggregating shared policy requests	2017-05-22 20:28:22 +02:00
David S. Miller	e4eda884db	net: Make IP alignment calulations clearer. The assignmnet: ip_align = strict ? 2 : NET_IP_ALIGN; in compare_pkt_ptr_alignment() trips up Coverity because we can only get to this code when strict is true, therefore ip_align will always be 2 regardless of NET_IP_ALIGN's value. So just assign directly to '2' and explain the situation in the comment above. Reported-by: "Gustavo A. R. Silva" <garsilva@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-05-22 12:27:07 -04:00
Linus Torvalds	970c305aa8	Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Thomas Gleixner: "A single scheduler fix: Prevent idle task from ever being preempted. That makes sure that synchronize_rcu_tasks() which is ignoring idle task does not pretend that no task is stuck in preempted state. If that happens and idle was preempted on a ftrace trampoline the machine crashes due to inconsistent state" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Call __schedule() from do_idle() without enabling preemption	2017-05-21 11:52:00 -07:00
Linus Torvalds	e7a3d62749	Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A set of small fixes for the irq subsystem: - Cure a data ordering problem with chained interrupts - Three small fixlets for the mbigen irq chip" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Fix chained interrupt data ordering irqchip/mbigen: Fix the clear register offset calculation irqchip/mbigen: Fix potential NULL dereferencing irqchip/mbigen: Fix memory mapping code	2017-05-21 11:45:26 -07:00
Al Viro	92ebce5ac5	osf_wait4: switch to kernel_wait4() ... and sanitize copying rusage to userland Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2017-05-21 13:16:26 -04:00
Al Viro	4c48abe91b	waitid(): switch copyout of siginfo to unsafe_put_user() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2017-05-21 13:16:18 -04:00
Al Viro	76d9871e11	wait_task_zombie: consolidate info logics Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2017-05-21 13:14:59 -04:00
Al Viro	bb380ec33a	kill wait_noreap_copyout() folds into callers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2017-05-21 13:13:53 -04:00
Al Viro	e61a250229	lift getrusage() from wait_noreap_copyout() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2017-05-21 13:13:17 -04:00
Al Viro	67d7ddded3	waitid(2): leave copyout of siginfo to syscall itself have kernel_waitid() collect the information needed for siginfo into a small structure (waitid_info) passed to it; deal with copyout in sys_waitid()/compat_sys_waitid(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2017-05-21 13:13:07 -04:00
Al Viro	359566faef	kernel_wait4()/kernel_waitid(): delay copying status to userland Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2017-05-21 13:11:07 -04:00
Al Viro	ce72a16fa7	wait4(2)/waitid(2): separate copying rusage to userland New helpers: kernel_waitid() and kernel_wait4(). sys_waitid(), sys_wait4() and their compat variants switched to those. Copying struct rusage to userland is left to syscall itself. For compat_sys_wait4() that eliminates the use of set_fs() completely. For compat_sys_waitid() it's still needed (for siginfo handling); that will change shortly. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2017-05-21 13:11:00 -04:00
Al Viro	7e95a22590	move compat wait4 and waitid next to native variants Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2017-05-21 13:10:51 -04:00

... 8 9 10 11 12 ...

25229 commits