1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

18790 commits

Author SHA1 Message Date
Fabian Frederick
b4426b49c6 rcu: Make rcu node arrays static const char * const
Those two arrays are being passed to lockdep_init_map(), which expects
const char *, and are stored in lockdep_map the same way.

Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-07-09 09:14:34 -07:00
Paul E. McKenney
c41247e1d4 signal: Explain local_irq_save() call
The explicit local_irq_save() in __lock_task_sighand() is needed to avoid
a potential deadlock condition, as noted in a841796f11 (signal:
align __lock_task_sighand() irq disabling and RCU).  However, someone
reading the code might be forgiven for concluding that this separate
local_irq_save() was completely unnecessary.  This commit therefore adds
a comment referencing the shiny new block comment on rcu_read_unlock().

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2014-07-09 09:14:33 -07:00
Steven Rostedt (Red Hat)
ca65ef1ab6 Merge branch 'trace/ftrace/urgent' into trace/ftrace/core
Needed 099ed15167 "tracing: Remove ftrace_stop/start() from
 reading the trace file" for the removal of ftrace_start/stop().
2014-07-09 11:02:34 -04:00
Tejun Heo
7b9a6ba56e cgroup: clean up sane_behavior handling
After the previous patch to remove sane_behavior support from
non-default hierarchies, CGRP_ROOT_SANE_BEHAVIOR is used only to
indicate the default hierarchy while parsing mount options.  This
patch makes the following cleanups around it.

* Don't show it in the mount option.  Eventually the default hierarchy
  will be assigned a different filesystem type.

* As sane_behavior is no longer effective on non-default hierarchies
  and the default hierarchy doesn't accept any mount options,
  parse_cgroupfs_options() can consider sane_behavior mount option as
  indicating the default hierarchy and fail if any other options are
  specified with it.  While at it, remove one of the double blank
  lines in the function.

* cgroup_mount() can now simply test CGRP_ROOT_SANE_BEHAVIOR to tell
  whether to mount the default hierarchy or not.

* As CGROUP_ROOT_SANE_BEHAVIOR's only role now is indicating whether
  to select the default hierarchy or not during mount, it doesn't need
  to be set in the default hierarchy itself.  cgroup_init_early()
  updated accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-07-09 10:08:08 -04:00
Tejun Heo
aa6ec29bee cgroup: remove sane_behavior support on non-default hierarchies
sane_behavior has been used as a development vehicle for the default
unified hierarchy.  Now that the default hierarchy is in place, the
flag became redundant and confusing as its usage is allowed on all
hierarchies.  There are gonna be either the default hierarchy or
legacy ones.  Let's make that clear by removing sane_behavior support
on non-default hierarchies.

This patch replaces cgroup_sane_behavior() with cgroup_on_dfl().  The
comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of
cgroup_on_dfl() with sane_behavior specific part dropped.

On the default and legacy hierarchies w/o sane_behavior, this
shouldn't cause any behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
2014-07-09 10:08:08 -04:00
Tejun Heo
c1d5d42efd cgroup: make interface file "cgroup.sane_behavior" legacy-only
"cgroup.sane_behavior" is added to help distinguishing whether
sane_behavior is in effect or not.  We now have the default hierarchy
where the flag is always in effect and are planning to remove
supporting sane behavior on the legacy hierarchies making this file on
the default hierarchy rather pointless.  Let's make it legacy only and
thus always zero.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-07-09 10:08:08 -04:00
Tejun Heo
7450e90bbb cgroup: remove CGRP_ROOT_OPTION_MASK
cgroup_root->flags only contains CGRP_ROOT_* flags and there's no
reason to mask the flags.  Remove CGRP_ROOT_OPTION_MASK.

This doesn't cause any behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-07-09 10:08:07 -04:00
Sandeep Tripathy
30fe688402 cpuidle: move idle traces to cpuidle_enter_state()
idle_exit event is the first event after a core exits
idle state. So this should be traced before local irq
is ebabled. Likewise idle_entry is the last event before
a core enters idle state. This will ease visualising the
cpu idle state from kernel traces.

Signed-off-by: Sandeep Tripathy <sandeep.tripathy@linaro.org>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
[rjw: Subject, rebase]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-07-09 15:45:23 +02:00
Tejun Heo
af0ba6789c cgroup: implement cgroup_subsys->depends_on
Currently, the blkio subsystem attributes all of writeback IOs to the
root.  One of the issues is that there's no way to tell who originated
a writeback IO from block layer.  Those IOs are usually issued
asynchronously from a task which didn't have anything to do with
actually generating the dirty pages.  The memory subsystem, when
enabled, already keeps track of the ownership of each dirty page and
it's desirable for blkio to piggyback instead of adding its own
per-page tag.

blkio piggybacking on memory is an implementation detail which
preferably should be handled automatically without requiring explicit
userland action.  To achieve that, this patch implements
cgroup_subsys->depends_on which contains the mask of subsystems which
should be enabled together when the subsystem is enabled.

The previous patches already implemented the support for enabled but
invisible subsystems and cgroup_subsys->depends_on can be easily
implemented by updating cgroup_refresh_child_subsys_mask() so that it
calculates cgroup->child_subsys_mask considering
cgroup_subsys->depends_on of the explicitly enabled subsystems.

Documentation/cgroups/unified-hierarchy.txt is updated to explain that
subsystems may not become immediately available after being unused
from userland and that dependency could be a factor in it.  As
subsystems may already keep residual references, this doesn't
significantly change how subsystem rebinding can be used.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
2014-07-08 18:02:57 -04:00
Tejun Heo
b4536f0cab cgroup: implement cgroup_subsys->css_reset()
cgroup is implementing support for subsystem dependency which would
require a way to enable a subsystem even when it's not directly
configured through "cgroup.subtree_control".

The previous patches added support for explicitly and implicitly
enabled subsystems and showing/hiding their interface files.  An
explicitly enabled subsystem may become implicitly enabled if it's
turned off through "cgroup.subtree_control" but there are subsystems
depending on it.  In such cases, the subsystem, as it's turned off
when seen from userland, shouldn't enforce any resource control.
Also, the subsystem may be explicitly turned on later again and its
interface files should be as close to the intial state as possible.

This patch adds cgroup_subsys->css_reset() which is invoked when a css
is hidden.  The callback should disable resource control and reset the
state to the vanilla state.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
2014-07-08 18:02:57 -04:00
Tejun Heo
f63070d350 cgroup: make interface files visible iff enabled on cgroup->subtree_control
cgroup is implementing support for subsystem dependency which would
require a way to enable a subsystem even when it's not directly
configured through "cgroup.subtree_control".

The preceding patch distinguished cgroup->subtree_control and
->child_subsys_mask where the former is the subsystems explicitly
configured by the userland and the latter is all enabled subsystems
currently is equal to the former but will include subsystems
implicitly enabled through dependency.

Subsystems which are enabled due to dependency shouldn't be visible to
userland.  This patch updates cgroup_subtree_control_write() and
create_css() such that interface files are not created for implicitly
enabled subsytems.

* @visible paramter is added to create_css().  Interface files are
  created only when true.

* If an already implicitly enabled subsystem is turned on through
  "cgroup.subtree_control", the existing css should be used.  css
  draining is skipped.

* cgroup_subtree_control_write() computes the new target
  cgroup->child_subsys_mask and create/kill or show/hide csses
  accordingly.

As the two subsystem masks are still kept identical, this patch
doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
2014-07-08 18:02:57 -04:00
Tejun Heo
667c249171 cgroup: introduce cgroup->subtree_control
cgroup is implementing support for subsystem dependency which would
require a way to enable a subsystem even when it's not directly
configured through "cgroup.subtree_control".

Previously, cgroup->child_subsys_mask directly reflected
"cgroup.subtree_control" and the enabled subsystems in the child
cgroups.  This patch adds cgroup->subtree_control which
"cgroup.subtree_control" operates on.  cgroup->child_subsys_mask is
now calculated from cgroup->subtree_control by
cgroup_refresh_child_subsys_mask(), which sets it identical to
cgroup->subtree_control for now.

This will allow using cgroup->child_subsys_mask for all the enabled
subsystems including the implicit ones and ->subtree_control for
tracking the explicitly requested ones.  This patch keeps the two
masks identical and doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
2014-07-08 18:02:56 -04:00
Tejun Heo
c29adf24e0 cgroup: reorganize cgroup_subtree_control_write()
Make the following two reorganizations to
cgroup_subtree_control_write().  These are to prepare for future
changes and shouldn't cause any functional difference.

* Move availability above css offlining wait.

* Move cgrp->child_subsys_mask update above new css creation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
2014-07-08 18:02:56 -04:00
John Stultz
16927776ae alarmtimer: Fix bug where relative alarm timers were treated as absolute
Sharvil noticed with the posix timer_settime interface, using the
CLOCK_REALTIME_ALARM or CLOCK_BOOTTIME_ALARM clockid, if the users
tried to specify a relative time timer, it would incorrectly be
treated as absolute regardless of the state of the flags argument.

This patch corrects this, properly checking the absolute/relative flag,
as well as adds further error checking that no invalid flag bits are set.

Reported-by: Sharvil Nanavati <sharvil@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Sharvil Nanavati <sharvil@google.com>
Cc: stable <stable@vger.kernel.org> #3.0+
Link: http://lkml.kernel.org/r/1404767171-6902-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-07-08 10:49:36 +02:00
Greg Kroah-Hartman
c1a567d31b Merge 3.16-rc4 into staging-next
We want the staging tree fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-07 17:59:07 -07:00
Paul E. McKenney
b58cc46c5f rcu: Don't offload callbacks unless specifically requested
Enabling NO_HZ_FULL currently has the side effect of enabling callback
offloading on all CPUs.  This results in lots of additional rcuo kthreads,
and can also increase context switching and wakeups, even in cases where
callback offloading is neither needed nor particularly desirable.  This
commit therefore enables callback offloading on a given CPU only if
specifically requested at build time or boot time, or if that CPU has
been specifically designated (again, either at build time or boot time)
as a nohz_full CPU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-07-07 15:13:44 -07:00
Paul E. McKenney
fbce7497ee rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things.  This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.

To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers.  By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders.  In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.

For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period.  This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.

Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-07-07 15:13:44 -07:00
Kees Cook
6945915e7f torture: Avoid format string leak to thead name
Since the torture-test thread creation interface does not include
format string arguments, this commit makes sure the name can never be
accidentally processed as a format string.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-07-07 10:12:56 -07:00
Yasuaki Ishimatsu
5a6024f160 workqueue: zero cpumask of wq_numa_possible_cpumask on init
When hot-adding and onlining CPU, kernel panic occurs, showing following
call trace.

  BUG: unable to handle kernel paging request at 0000000000001d08
  IP: [<ffffffff8114acfd>] __alloc_pages_nodemask+0x9d/0xb10
  PGD 0
  Oops: 0000 [#1] SMP
  ...
  Call Trace:
   [<ffffffff812b8745>] ? cpumask_next_and+0x35/0x50
   [<ffffffff810a3283>] ? find_busiest_group+0x113/0x8f0
   [<ffffffff81193bc9>] ? deactivate_slab+0x349/0x3c0
   [<ffffffff811926f1>] new_slab+0x91/0x300
   [<ffffffff815de95a>] __slab_alloc+0x2bb/0x482
   [<ffffffff8105bc1c>] ? copy_process.part.25+0xfc/0x14c0
   [<ffffffff810a3c78>] ? load_balance+0x218/0x890
   [<ffffffff8101a679>] ? sched_clock+0x9/0x10
   [<ffffffff81105ba9>] ? trace_clock_local+0x9/0x10
   [<ffffffff81193d1c>] kmem_cache_alloc_node+0x8c/0x200
   [<ffffffff8105bc1c>] copy_process.part.25+0xfc/0x14c0
   [<ffffffff81114d0d>] ? trace_buffer_unlock_commit+0x4d/0x60
   [<ffffffff81085a80>] ? kthread_create_on_node+0x140/0x140
   [<ffffffff8105d0ec>] do_fork+0xbc/0x360
   [<ffffffff8105d3b6>] kernel_thread+0x26/0x30
   [<ffffffff81086652>] kthreadd+0x2c2/0x300
   [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
   [<ffffffff815f20ec>] ret_from_fork+0x7c/0xb0
   [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60

In my investigation, I found the root cause is wq_numa_possible_cpumask.
All entries of wq_numa_possible_cpumask is allocated by
alloc_cpumask_var_node(). And these entries are used without initializing.
So these entries have wrong value.

When hot-adding and onlining CPU, wq_update_unbound_numa() is called.
wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq()
calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set
as follow:

3592         /* if cpumask is contained inside a NUMA node, we belong to that node */
3593         if (wq_numa_enabled) {
3594                 for_each_node(node) {
3595                         if (cpumask_subset(pool->attrs->cpumask,
3596                                            wq_numa_possible_cpumask[node])) {
3597                                 pool->node = node;
3598                                 break;
3599                         }
3600                 }
3601         }

But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong
node is selected. As a result, kernel panic occurs.

By this patch, all entries of wq_numa_possible_cpumask are allocated by
zalloc_cpumask_var_node to initialize them. And the panic disappeared.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Fixes: bce903809a ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
2014-07-07 09:56:48 -04:00
Linus Torvalds
549f11c9f0 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 "A few minor fixlets in ARM SoC irq drivers and a fix for a memory leak
  which I introduced in the last round of cleanups :("

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Fix memory leak when calling irq_free_hwirqs()
  irqchip: spear_shirq: Fix interrupt offset
  irqchip: brcmstb-l2: Level-2 interrupts are edge sensitive
  irqchip: armada-370-xp: Mask all interrupts during initialization.
2014-07-05 16:56:14 -07:00
Keith Busch
8844aad89e genirq: Fix memory leak when calling irq_free_hwirqs()
irq_free_hwirqs() always calls irq_free_descs() with a cnt == 0
which makes it a no-op since the interrupt count to free is
decremented in itself.

Fixes: 7b6ef12625

Signed-off-by: Keith Busch <keith.busch@intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1404167084-8070-1-git-send-email-keith.busch@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-07-05 21:42:08 +02:00
Jason Low
72d5305dcb locking/mutexes: Optimize mutex trylock slowpath
The mutex_trylock() function calls into __mutex_trylock_fastpath() when
trying to obtain the mutex. On 32 bit x86, in the !__HAVE_ARCH_CMPXCHG
case, __mutex_trylock_fastpath() calls directly into __mutex_trylock_slowpath()
regardless of whether or not the mutex is locked.

In __mutex_trylock_slowpath(), we then acquire the wait_lock spinlock, xchg()
lock->count with -1, then set lock->count back to 0 if there are no waiters,
and return true if the prev lock count was 1.

However, if the mutex is already locked, then there isn't much point
in attempting all of the above expensive operations. In this patch, we only
attempt the above trylock operations if the mutex is unlocked.

Signed-off-by: Jason Low <jason.low2@hp.com>
Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: akpm@linux-foundation.org
Cc: tim.c.chen@linux.intel.com
Cc: paulmck@linux.vnet.ibm.com
Cc: rostedt@goodmis.org
Cc: Waiman.Long@hp.com
Cc: scott.norton@hp.com
Cc: aswin@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1402511843-4721-5-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:25:42 +02:00
Jason Low
0d968dd8c6 locking/mutexes: Try to acquire mutex only if it is unlocked
Upon entering the slowpath in __mutex_lock_common(), we try once more to
acquire the mutex. We only try to acquire if (lock->count >= 0). However,
what we actually want here is to try to acquire if the mutex is unlocked
(lock->count == 1).

This patch changes it so that we only try-acquire the mutex upon entering
the slowpath if it is unlocked, rather than if the lock count is non-negative.
This helps further reduce unnecessary atomic xchg() operations.

Furthermore, this patch uses !mutex_is_locked(lock) to do the initial
checks for if the lock is free rather than directly calling atomic_read()
on the lock->count, in order to improve readability.

Signed-off-by: Jason Low <jason.low2@hp.com>
Acked-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: akpm@linux-foundation.org
Cc: tim.c.chen@linux.intel.com
Cc: paulmck@linux.vnet.ibm.com
Cc: rostedt@goodmis.org
Cc: davidlohr@hp.com
Cc: scott.norton@hp.com
Cc: aswin@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1402511843-4721-4-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:25:42 +02:00
Jason Low
1e820c9608 locking/mutexes: Delete the MUTEX_SHOW_NO_WAITER macro
MUTEX_SHOW_NO_WAITER() is a macro which checks for if there are
"no waiters" on a mutex by checking if the lock count is non-negative.
Based on feedback from the discussion in the earlier version of this
patchset, the macro is not very readable.

Furthermore, checking lock->count isn't always the correct way to
determine if there are "no waiters" on a mutex. For example, a negative
count on a mutex really only means that there "potentially" are
waiters. Likewise, there can be waiters on the mutex even if the count is
non-negative. Thus, "MUTEX_SHOW_NO_WAITER" doesn't always do what the name
of the macro suggests.

So this patch deletes the MUTEX_SHOW_NO_WAITERS() macro, directly
use atomic_read() instead of the macro, and adds comments which
elaborate on how the extra atomic_read() checks can help reduce
unnecessary xchg() operations.

Signed-off-by: Jason Low <jason.low2@hp.com>
Acked-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: akpm@linux-foundation.org
Cc: tim.c.chen@linux.intel.com
Cc: paulmck@linux.vnet.ibm.com
Cc: rostedt@goodmis.org
Cc: davidlohr@hp.com
Cc: scott.norton@hp.com
Cc: aswin@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1402511843-4721-3-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:25:41 +02:00
Jason Low
0c3c0f0d6e locking/mutexes: Correct documentation on mutex optimistic spinning
The mutex optimistic spinning documentation states that we spin for
acquisition when we find that there are no pending waiters. However,
in actuality, whether or not there are waiters for the mutex doesn't
determine if we will spin for it.

This patch removes that statement and also adds a comment which
mentions that we spin for the mutex while we don't need to reschedule.

Signed-off-by: Jason Low <jason.low2@hp.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: akpm@linux-foundation.org
Cc: tim.c.chen@linux.intel.com
Cc: paulmck@linux.vnet.ibm.com
Cc: rostedt@goodmis.org
Cc: Waiman.Long@hp.com
Cc: scott.norton@hp.com
Cc: aswin@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1402511843-4721-2-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:25:41 +02:00
Jiri Olsa
985c8dcbe1 perf: Make perf_event_init_context() function static
Leftover from '8dc85d5 perf: Multiple task contexts'.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/1403598026-2310-1-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:21:52 +02:00
Kirill Tkhai
b728ca0602 sched: Rework check_for_tasks()
1) Iterate thru all of threads in the system.
   Check for all threads, not only for group leaders.

2) Check for p->on_rq instead of p->state and cputime.
   Preempted task in !TASK_RUNNING state  OR just
   created task may be queued, that we want to be
   reported too.

3) Use read_lock() instead of write_lock().
   This function does not change any structures, and
   read_lock() is enough.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Cc: Konstantin Khorenko <khorenko@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael wang <wangyun@linux.vnet.ibm.com>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paul Turner <pjt@google.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Todd E Brandt <todd.e.brandt@linux.intel.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403684395.3462.44.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:45 +02:00
Kirill Tkhai
99b625670f sched/rt: Enqueue just unthrottled rt_rq back on the stack in __disable_runtime()
Make rt_rq available for pick_next_task(). Otherwise, their tasks
stay prisoned long time till dead cpu becomes alive again.

Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
CC: Konstantin Khorenko <khorenko@parallels.com>
CC: Ben Segall <bsegall@google.com>
CC: Paul Turner <pjt@google.com>
CC: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403684388.3462.43.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:44 +02:00
Kirill Tkhai
0e59bdaea7 sched/fair: Disable runtime_enabled on dying rq
We kill rq->rd on the CPU_DOWN_PREPARE stage:

	cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
	-> cpu_attach_domain -> rq_attach_root -> set_rq_offline

This unthrottles all throttled cfs_rqs.

But the cpu is still able to call schedule() till

	take_cpu_down->__cpu_disable()

is called from stop_machine.

This case the tasks from just unthrottled cfs_rqs are pickable
in a standard scheduler way, and they are picked by dying cpu.
The cfs_rqs becomes throttled again, and migrate_tasks()
in migration_call skips their tasks (one more unthrottle
in migrate_tasks()->CPU_DYING does not happen, because rq->rd
is already NULL).

Patch sets runtime_enabled to zero. This guarantees, the runtime
is not accounted, and the cfs_rqs won't exceed given
cfs_rq->runtime_remaining = 1, and tasks will be pickable
in migrate_tasks(). runtime_enabled is recalculated again
when rq becomes online again.

Ben Segall also noticed, we always enable runtime in
tg_set_cfs_bandwidth(). Actually, we should do that for online
cpus only. To prevent races with unthrottle_offline_cfs_rqs()
we take get_online_cpus() lock.

Reviewed-by: Ben Segall <bsegall@google.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
CC: Konstantin Khorenko <khorenko@parallels.com>
CC: Paul Turner <pjt@google.com>
CC: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403684382.3462.42.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:42 +02:00
Rik van Riel
a22b4b0123 sched/numa: Change scan period code to match intent
Reading through the scan period code and comment, it appears the
intent was to slow down NUMA scanning when a majority of accesses
are on the local node, specifically a local:remote ratio of 3:1.

However, the code actually tests local / (local + remote), and
the actual cut-off point was around 30% local accesses, well before
a task has actually converged on a node.

Changing the threshold to 7 means scanning slows down when a task
has around 70% of its accesses local, which appears to match the
intent of the code more closely.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538095-31256-8-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:40 +02:00
Rik van Riel
db015daedb sched/numa: Rework best node setting in task_numa_migrate()
Fix up the best node setting in task_numa_migrate() to deal with a task
in a pseudo-interleaved NUMA group, which is already running in the
best location.

Set the task's preferred nid to the current nid, so task migration is
not retried at a high rate.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538095-31256-7-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:39 +02:00
Rik van Riel
0132c3e177 sched/numa: Examine a task move when examining a task swap
Running "perf bench numa mem -0 -m -P 1000 -p 8 -t 20" on a 4
node system results in 160 runnable threads on a system with 80
CPU threads.

Once a process has nearly converged, with 39 threads on one node
and 1 thread on another node, the remaining thread will be unable
to migrate to its preferred node through a task swap.

However, a simple task move would make the workload converge,
witout causing an imbalance.

Test for this unlikely occurrence, and attempt a task move to
the preferred nid when it happens.

 # Running main, "perf bench numa mem -p 8 -t 20 -0 -m -P 1000"

 ###
 # 160 tasks will execute (on 4 nodes, 80 CPUs):
 #         -1x     0MB global  shared mem operations
 #         -1x  1000MB process shared mem operations
 #         -1x     0MB thread  local  mem operations
 ###

 ###
 #
 #    0.0%  [0.2 mins]  0/0   1/1  36/2   0/0  [36/3 ] l:  0-0   (  0) {0-2}
 #    0.0%  [0.3 mins] 43/3  37/2  39/2  41/3  [ 6/10] l:  0-1   (  1) {1-2}
 #    0.0%  [0.4 mins] 42/3  38/2  40/2  40/2  [ 4/9 ] l:  1-2   (  1) [50.0%] {1-2}
 #    0.0%  [0.6 mins] 41/3  39/2  40/2  40/2  [ 2/9 ] l:  2-4   (  2) [50.0%] {1-2}
 #    0.0%  [0.7 mins] 40/2  40/2  40/2  40/2  [ 0/8 ] l:  3-5   (  2) [40.0%] (  41.8s converged)

Without this patch, this same perf bench numa mem run had to
rely on the scheduler load balancer to first balance out the
load (moving a random task), before a task swap could complete
the NUMA convergence.

The load balancer does not normally take action unless the load

difference exceeds 25%. Convergence times of over half an hour
have been observed without this patch.

With this patch, the NUMA balancing code will simply migrate the
task, if that does not cause an imbalance.

Also skip examining a CPU in detail if the improvement on that CPU
is no more than the best we already have.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: chegu_vinod@hp.com
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-ggthh0rnh0yua6o5o3p6cr1o@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:38 +02:00
Rik van Riel
1c5d3eb375 sched/numa: Simplify task_numa_compare()
When a task is part of a numa_group, the comparison should always use
the group weight, in order to make workloads converge.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: chegu_vinod@hp.com
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538378-31571-4-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:37 +02:00
Rik van Riel
6dc1a672ab sched/numa: Use effective_load() to balance NUMA loads
When CONFIG_FAIR_GROUP_SCHED is enabled, the load that a task places
on a CPU is determined by the group the task is in. The active groups
on the source and destination CPU can be different, resulting in a
different load contribution by the same task at its source and at its
destination. As a result, the load needs to be calculated separately
for each CPU, instead of estimated once with task_h_load().

Getting this calculation right allows some workloads to converge,
where previously the last thread could get stuck on another node,
without being able to migrate to its final destination.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538378-31571-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:35 +02:00
Rik van Riel
28a2174519 sched/numa: Move power adjustment into load_too_imbalanced()
Currently the NUMA code scales the load on each node with the
amount of CPU power available on that node, but it does not
apply any adjustment to the load of the task that is being
moved over.

On systems with SMT/HT, this results in a task being weighed
much more heavily than a CPU core, and a task move that would
even out the load between nodes being disallowed.

The correct thing is to apply the power correction to the
numbers after we have first applied the move of the tasks'
loads to them.

This also allows us to do the power correction with a multiplication,
rather than a division.

Also drop two function arguments for load_too_unbalanced, since it
takes various factors from env already.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: chegu_vinod@hp.com
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538378-31571-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:34 +02:00
Rik van Riel
f0b8a4afd6 sched/numa: Use group's max nid as task's preferred nid
From task_numa_placement, always try to consolidate the tasks
in a group on the group's top nid.

In case this task is part of a group that is interleaved over
multiple nodes, task_numa_migrate will set the task's preferred
nid to the best node it could find for the task, so this patch
will cause at most one run through task_numa_migrate.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538095-31256-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:33 +02:00
Tim Chen
4486edd12b sched/fair: Implement fast idling of CPUs when the system is partially loaded
When a system is lightly loaded (i.e. no more than 1 job per cpu),
attempt to pull job to a cpu before putting it to idle is unnecessary and
can be skipped.  This patch adds an indicator so the scheduler can know
when there's no more than 1 active job is on any CPU in the system to
skip needless job pulls.

On a 4 socket machine with a request/response kind of workload from
clients, we saw about 0.13 msec delay when we go through a full load
balance to try pull job from all the other cpus.  While 0.1 msec was
spent on processing the request and generating a response, the 0.13 msec
load balance overhead was actually more than the actual work being done.
This overhead can be skipped much of the time for lightly loaded systems.

With this patch, we tested with a netperf request/response workload that
has the server busy with half the cpus in a 4 socket system.  We found
the patch eliminated 75% of the load balance attempts before idling a cpu.

The overhead of setting/clearing the indicator is low as we already gather
the necessary info while we call add_nr_running() and update_sd_lb_stats.()
We switch to full load balance load immediately if any cpu got more than
one job on its run queue in add_nr_running.  We'll clear the indicator
to avoid load balance when we detect no cpu's have more than one job
when we scan the work queues in update_sg_lb_stats().  We are aggressive
in turning on the load balance and opportunistic in skipping the load
balance.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Jason Low <jason.low2@hp.com>
Cc: "Paul E.McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403551009.2970.613.camel@schen9-DESK
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:32 +02:00
Viresh Kumar
89abb5ad10 sched/idle: Drop !! while calculating 'broadcast'
We don't need 'broadcast' to be set to 'zero or one', but to 'zero or non-zero'
and so the extra operation to convert it to 'zero or one' can be skipped.

Also change type of 'broadcast' to unsigned int, i.e. type of
drv->states[*].flags.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/0dfbe2976aa108c53e08d3477ea90f6360c1f54c.1403584026.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:31 +02:00
Mike Galbraith
4036ac1567 sched: Fix clock_gettime(CLOCK_[PROCESS/THREAD]_CPUTIME_ID) monotonicity
If a task has been dequeued, it has been accounted.  Do not project
cycles that may or may not ever be accounted to a dequeued task, as
that may make clock_gettime() both inaccurate and non-monotonic.

Protect update_rq_clock() from slight TSC skew while at it.

Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: kosaki.motohiro@jp.fujitsu.com
Cc: pjt@google.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403588980.29711.11.camel@marge.simpson.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:30 +02:00
Ben Segall
c06f04c704 sched: Fix potential near-infinite distribute_cfs_runtime() loop
distribute_cfs_runtime() intentionally only hands out enough runtime to
bring each cfs_rq to 1 ns of runtime, expecting the cfs_rqs to then take
the runtime they need only once they actually get to run. However, if
they get to run sufficiently quickly, the period timer is still in
distribute_cfs_runtime() and no runtime is available, causing them to
throttle. Then distribute has to handle them again, and this can go on
until distribute has handed out all of the runtime 1ns at a time, which
takes far too long.

Instead allow access to the same runtime that distribute is handing out,
accepting that corner cases with very low quota may be able to spend the
entire cfs_b->runtime during distribute_cfs_runtime, meaning that the
runtime directly handed out by distribute_cfs_runtime was over quota. In
addition, if a cfs_rq does manage to throttle like this, make sure the
existing distribute_cfs_runtime no longer loops over it again.

Signed-off-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140620222120.13814.21652.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:29 +02:00
Viresh Kumar
541b82644d sched/core: Fix formatting issues in sched_can_stop_tick()
sched_can_stop_tick() is using 7 spaces instead of 8 spaces or a 'tab' at the
beginning of few lines. Which doesn't align well with the Coding Guidelines.

Also remove local variable 'rq' as it is used at only one place and we can
directly use this_rq() instead.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: fweisbec@gmail.com
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/afb781733e4a9ffbced5eb9fd25cc0aa5c6ffd7a.1403596966.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:28 +02:00
Peter Zijlstra
a77353e5eb irq_work: Remove BUG_ON in irq_work_run()
Because of a collision with 8d056c48e4 ("CPU hotplug, smp: flush any
pending IPI callbacks before CPU offline"), which ends up calling
hotplug_cfd()->flush_smp_call_function_queue()->irq_work_run(), which
is not from IRQ context.

And since that already calls irq_work_run() from the hotplug path,
remove our entire hotplug handling.

Reported-by: Stephen Warren <swarren@wwwdotorg.org>
Tested-by: Stephen Warren <swarren@wwwdotorg.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-busatzs2gvz4v62258agipuf@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:26 +02:00
Ingo Molnar
51da9830d7 Merge branch 'timers/nohz' into sched/core
Merge these two, because upcoming patches will touch both areas.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:06:10 +02:00
Linus Torvalds
ef34c6ce49 Oleg Nesterov found and fixed a bug in the perf/ftrace/uprobes code where
running:
 
    # perf probe -x /lib/libc.so.6 syscall
    # echo 1 >> /sys/kernel/debug/tracing/events/probe_libc/enable
    # perf record -e probe_libc:syscall whatever
 
 kills the uprobe. Along the way he found some other minor bugs and clean ups
 that he fixed up making it a total of 4 patches.
 
 Doing unrelated work, I found that the reading of the ftrace trace
 file disables all function tracer callbacks. This was fine when ftrace
 was the only user, but now that it's used by perf and kprobes, this
 is a bug where reading trace can disable kprobes and perf. A very unexpected
 side effect and should be fixed.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJTtMugAAoJEKQekfcNnQGuRs8H/2HhNAy0F1pAFYT5tH2o0z/Z
 z6NFn83FUUoesg/Bd+1Dk7VekIZIdo1JxQc67/Y0D0oylPPr31gmVvk2llFPdJV5
 xuWiUOuMbDq4Eh3Een8yaOsNsGbcX0lgw9qJEyqAvhJMi5G4dyt3r/g+vFThAyqm
 O8Uv74GBGmUmmGMyZuW2r2f2vSEANSXTLbzFqj54fV7FNms1B0MpZ/2AiRcEwCzi
 9yMainwrO1VPVSrSJFkW8g4sNl5X1M4tiIT8wGN75YePJVK3FAH8LmUDfpau4+ae
 /QGdjNkRDWpZ3mMJPbz3sAT7USnMlSZ5w80Rf/CZuZAe2Ncycvh9AhqcFV0+fj8=
 =3h4s
 -----END PGP SIGNATURE-----

Merge tag 'trace-fixes-v3.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Oleg Nesterov found and fixed a bug in the perf/ftrace/uprobes code
  where running:

    # perf probe -x /lib/libc.so.6 syscall
    # echo 1 >> /sys/kernel/debug/tracing/events/probe_libc/enable
    # perf record -e probe_libc:syscall whatever

  kills the uprobe.  Along the way he found some other minor bugs and
  clean ups that he fixed up making it a total of 4 patches.

  Doing unrelated work, I found that the reading of the ftrace trace
  file disables all function tracer callbacks.  This was fine when
  ftrace was the only user, but now that it's used by perf and kprobes,
  this is a bug where reading trace can disable kprobes and perf.  A
  very unexpected side effect and should be fixed"

* tag 'trace-fixes-v3.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Remove ftrace_stop/start() from reading the trace file
  tracing/uprobes: Fix the usage of uprobe_buffer_enable() in probe_event_enable()
  tracing/uprobes: Kill the bogus UPROBE_HANDLER_REMOVE code in uprobe_dispatcher()
  uprobes: Change unregister/apply to WARN() if uprobe/consumer is gone
  tracing/uprobes: Revert "Support mix of ftrace and perf"
2014-07-03 18:37:25 -07:00
Andrew Morton
d18bbc215f kernel/printk/printk.c: revert "printk: enable interrupts before calling console_trylock_for_printk()"
Revert commit 939f04bec1 ("printk: enable interrupts before calling
console_trylock_for_printk()").

Andreas reported:

: None of the post 3.15 kernel boot for me. They all hang at the GRUB
: screen telling me it loaded and started the kernel, but the kernel
: itself stops before it prints anything (or even replaces the GRUB
: background graphics).

939f04bec1 is modest latency reduction.  Revert it until we understand
the reason for these failures.

Reported-by: Andreas Bombe <aeb@debian.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-03 09:21:54 -07:00
Jarod Wilson
002c77a48b crypto: fips - only panic on bad/missing crypto mod signatures
Per further discussion with NIST, the requirements for FIPS state that
we only need to panic the system on failed kernel module signature checks
for crypto subsystem modules. This moves the fips-mode-only module
signature check out of the generic module loading code, into the crypto
subsystem, at points where we can catch both algorithm module loads and
mode module loads. At the same time, make CONFIG_CRYPTO_FIPS dependent on
CONFIG_MODULE_SIG, as this is entirely necessary for FIPS mode.

v2: remove extraneous blank line, perform checks in static inline
function, drop no longer necessary fips.h include.

CC: "David S. Miller" <davem@davemloft.net>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Stephan Mueller <stephan.mueller@atsec.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-07-03 21:38:32 +08:00
Jiri Olsa
1f9a7268c6 perf: Do not allow optimized switch for non-cloned events
The context check in perf_event_context_sched_out allows
non-cloned context to be part of the optimized schedule
out switch.

This could move non-cloned context into another workload
child. Once this child exits, the context is closed and
leaves all original (parent) events in closed state.

Any other new cloned event will have closed state and not
measure anything. And probably causing other odd bugs.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1403598026-2310-2-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-02 08:35:56 +02:00
Lai Jiangshan
85327af61d workqueue: stronger test in process_one_work()
When POOL_DISASSOCIATED is cleared, the running worker's local CPU should
be the same as pool->cpu without any exception even during cpu-hotplug.

This patch changes "(proposition_A && proposition_B && proposition_C)"
to "(proposition_B && proposition_C)", so if the old compound
proposition is true, the new one must be true too. so this won't hide
any possible bug which can be hit by old test.

tj: Minor description update and dropped the obvious comment.

CC: Jason J. Herne <jjherne@linux.vnet.ibm.com>
CC: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-01 17:40:14 -04:00
Lai Jiangshan
3de5e88485 workqueue: clear POOL_DISASSOCIATED in rebind_workers()
a9ab775bca ("workqueue: directly restore CPU affinity of workers
from CPU_ONLINE") moved pool locking into rebind_workers() but left
"pool->flags &= ~POOL_DISASSOCIATED" in workqueue_cpu_up_callback().

There is nothing necessarily wrong with it, but there is no benefit
either.  Let's move it into rebind_workers() and achieve the following
benefits:

  1) better readability, POOL_DISASSOCIATED is cleared in rebind_workers()
     as expected.

  2) we can guarantee that, when POOL_DISASSOCIATED is clear, the
     running workers of the pool are on the local CPU (pool->cpu).

tj: Minor description update.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-01 17:40:14 -04:00
Tejun Heo
76bb5ab8f6 cpuset: break kernfs active protection in cpuset_write_resmask()
Writing to either "cpuset.cpus" or "cpuset.mems" file flushes
cpuset_hotplug_work so that cpu or memory hotunplug doesn't end up
migrating tasks off a cpuset after new resources are added to it.

As cpuset_hotplug_work calls into cgroup core via
cgroup_transfer_tasks(), this flushing adds the dependency to cgroup
core locking from cpuset_write_resmak().  This used to be okay because
cgroup interface files were protected by a different mutex; however,
8353da1f91 ("cgroup: remove cgroup_tree_mutex") simplified the
cgroup core locking and this dependency became a deadlock hazard -
cgroup file removal performed under cgroup core lock tries to drain
on-going file operation which is trying to flush cpuset_hotplug_work
blocked on the same cgroup core lock.

The locking simplification was done because kernfs added an a lot
easier way to deal with circular dependencies involving kernfs active
protection.  Let's use the same strategy in cpuset and break active
protection in cpuset_write_resmask().  While it isn't the prettiest,
this is a very rare, likely unique, situation which also goes away on
the unified hierarchy.

The commands to trigger the deadlock warning without the patch and the
lockdep output follow.

 localhost:/ # mount -t cgroup -o cpuset xxx /cpuset
 localhost:/ # mkdir /cpuset/tmp
 localhost:/ # echo 1 > /cpuset/tmp/cpuset.cpus
 localhost:/ # echo 0 > cpuset/tmp/cpuset.mems
 localhost:/ # echo $$ > /cpuset/tmp/tasks
 localhost:/ # echo 0 > /sys/devices/system/cpu/cpu1/online

  ======================================================
  [ INFO: possible circular locking dependency detected ]
  3.16.0-rc1-0.1-default+ #7 Not tainted
  -------------------------------------------------------
  kworker/1:0/32649 is trying to acquire lock:
   (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e3d7>] cgroup_transfer_tasks+0x37/0x150

  but task is already holding lock:
   (cpuset_hotplug_work){+.+...}, at: [<ffffffff81085412>] process_one_work+0x192/0x520

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #2 (cpuset_hotplug_work){+.+...}:
  ...
  -> #1 (s_active#175){++++.+}:
  ...
  -> #0 (cgroup_mutex){+.+.+.}:
  ...

  other info that might help us debug this:

  Chain exists of:
    cgroup_mutex --> s_active#175 --> cpuset_hotplug_work

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(cpuset_hotplug_work);
				 lock(s_active#175);
				 lock(cpuset_hotplug_work);
    lock(cgroup_mutex);

   *** DEADLOCK ***

  2 locks held by kworker/1:0/32649:
   #0:  ("events"){.+.+.+}, at: [<ffffffff81085412>] process_one_work+0x192/0x520
   #1:  (cpuset_hotplug_work){+.+...}, at: [<ffffffff81085412>] process_one_work+0x192/0x520

  stack backtrace:
  CPU: 1 PID: 32649 Comm: kworker/1:0 Not tainted 3.16.0-rc1-0.1-default+ #7
 ...
  Call Trace:
   [<ffffffff815a5f78>] dump_stack+0x72/0x8a
   [<ffffffff810c263f>] print_circular_bug+0x10f/0x120
   [<ffffffff810c481e>] check_prev_add+0x43e/0x4b0
   [<ffffffff810c4ee6>] validate_chain+0x656/0x7c0
   [<ffffffff810c53d2>] __lock_acquire+0x382/0x660
   [<ffffffff810c57a9>] lock_acquire+0xf9/0x170
   [<ffffffff815aa13f>] mutex_lock_nested+0x6f/0x380
   [<ffffffff8110e3d7>] cgroup_transfer_tasks+0x37/0x150
   [<ffffffff811129c0>] hotplug_update_tasks_insane+0x110/0x1d0
   [<ffffffff81112bbd>] cpuset_hotplug_update_tasks+0x13d/0x180
   [<ffffffff811148ec>] cpuset_hotplug_workfn+0x18c/0x630
   [<ffffffff810854d4>] process_one_work+0x254/0x520
   [<ffffffff810875dd>] worker_thread+0x13d/0x3d0
   [<ffffffff8108e0c8>] kthread+0xf8/0x100
   [<ffffffff815acaec>] ret_from_fork+0x7c/0xb0

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Li Zefan <lizefan@huawei.com>
Tested-by: Li Zefan <lizefan@huawei.com>
2014-07-01 16:42:28 -04:00