Impact: Implementation change to remove cpumask_t from stack.
Actually change smp_call_function_mask() to smp_call_function_many().
We avoid cpumasks on the stack in this version.
(S390 has its own version, but that's going away apparently).
We have to do some dancing to figure out if 0 or 1 other cpus are in
the mask supplied and the online mask without allocating a tmp
cpumask. It's still fairly cheap.
We allocate the cpumask at the end of the call_function_data
structure: if allocation fails we fallback to smp_call_function_single
rather than using the baroque quiescing code (which needs a cpumask on
stack).
(Thanks to Hiroshi Shimamoto for spotting several bugs in previous versions!)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: npiggin@suse.de
Cc: axboe@kernel.dk
They're only for use in boot/cpu hotplug code anyway, and this avoids
the use of deprecated cpu_*_map.
Stephen Rothwell points out that gcc 4.2.4 (on powerpc at least)
didn't like the cast away of const anyway:
include/linux/cpumask.h: In function 'set_cpu_possible':
include/linux/cpumask.h:1052: warning: passing argument 2 of 'cpumask_set_cpu' discards qualifiers from pointer target type
So this kills two birds with one stone.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: cleanup
This implements the obsolescent cpu_online_map in terms of
cpu_online_mask, rather than the other way around. Same for the other
maps.
The documentation comments are also updated to refer to _mask rather
than _map.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
A panic was discovered by Chirag Jog where a BUG_ON sanity check
in the new "pushable_task" logic would trigger a panic under
certain circumstances:
http://lkml.org/lkml/2008/9/25/189
Gilles Carry discovered that the root cause was attributed to the
pushable_tasks list getting corrupted in the push_rt_task logic.
This was the result of a dropped rq lock in double_lock_balance
allowing a task in the process of being pushed to potentially migrate
away, and thus corrupt the pushable_tasks() list.
I traced back the problem as introduced by the pushable_tasks patch
that went in recently. There is a "retry" path in push_rt_task()
that actually had a compound conditional to decide whether to
retry or exit. I missed the meaning behind the rationale for the
virtual "if(!task) goto out;" portion of the compound statement and
thus did not handle it properly. The new pushable_tasks logic
actually creates three distinct conditions:
1) an untouched and unpushable task should be dequeued
2) a migrated task where more pushable tasks remain should be retried
3) a migrated task where no more pushable tasks exist should exit
The original logic mushed (1) and (3) together, resulting in the
system dequeuing a migrated task (against an unlocked foreign run-queue
nonetheless).
To fix this, we get rid of the notion of "paranoid" and we support the
three unique conditions properly. The paranoid feature is no longer
relevant with the new pushable logic (since pushable naturally limits
the loop) anyway, so lets just remove it.
Reported-By: Chirag Jog <chirag@linux.vnet.ibm.com>
Found-by: Gilles Carry <gilles.carry@bull.net>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
The RT scheduler employs a "push/pull" design to actively balance tasks
within the system (on a per disjoint cpuset basis). When a task is
awoken, it is immediately determined if there are any lower priority
cpus which should be preempted. This is opposed to the way normal
SCHED_OTHER tasks behave, which will wait for a periodic rebalancing
operation to occur before spreading out load.
When a particular RQ has more than 1 active RT task, it is said to
be in an "overloaded" state. Once this occurs, the system enters
the active balancing mode, where it will try to push the task away,
or persuade a different cpu to pull it over. The system will stay
in this state until the system falls back below the <= 1 queued RT
task per RQ.
However, the current implementation suffers from a limitation in the
push logic. Once overloaded, all tasks (other than current) on the
RQ are analyzed on every push operation, even if it was previously
unpushable (due to affinity, etc). Whats more, the operation stops
at the first task that is unpushable and will not look at items
lower in the queue. This causes two problems:
1) We can have the same tasks analyzed over and over again during each
push, which extends out the fast path in the scheduler for no
gain. Consider a RQ that has dozens of tasks that are bound to a
core. Each one of those tasks will be encountered and skipped
for each push operation while they are queued.
2) There may be lower-priority tasks under the unpushable task that
could have been successfully pushed, but will never be considered
until either the unpushable task is cleared, or a pull operation
succeeds. The net result is a potential latency source for mid
priority tasks.
This patch aims to rectify these two conditions by introducing a new
priority sorted list: "pushable_tasks". A task is added to the list
each time a task is activated or preempted. It is removed from the
list any time it is deactivated, made current, or fails to push.
This works because a task only needs to be attempted to push once.
After an initial failure to push, the other cpus will eventually try to
pull the task when the conditions are proper. This also solves the
problem that we don't completely analyze all tasks due to encountering
an unpushable tasks. Now every task will have a push attempted (when
appropriate).
This reduces latency both by shorting the critical section of the
rq->lock for certain workloads, and by making sure the algorithm
considers all eligible tasks in the system.
[ rostedt: added a couple more BUG_ONs ]
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Steven Rostedt <srostedt@redhat.com>
We currently run class->post_schedule() outside of the rq->lock, which
means that we need to test for the need to post_schedule outside of
the lock to avoid a forced reacquistion. This is currently not a problem
as we only look at rq->rt.overloaded. However, we want to enhance this
going forward to look at more state to reduce the need to post_schedule to
a bare minimum set. Therefore, we introduce a new member-func called
needs_post_schedule() which tests for the post_schedule condtion without
actually performing the work. Therefore it is safe to call this
function before the rq->lock is released, because we are guaranteed not
to drop the lock at an intermediate point (such as what post_schedule()
may do).
We will use this later in the series
[ rostedt: removed paranoid BUG_ON ]
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
double_lock balance() currently favors logically lower cpus since they
often do not have to release their own lock to acquire a second lock.
The result is that logically higher cpus can get starved when there is
a lot of pressure on the RQs. This can result in higher latencies on
higher cpu-ids.
This patch makes the algorithm more fair by forcing all paths to have
to release both locks before acquiring them again. Since callsites to
double_lock_balance already consider it a potential preemption/reschedule
point, they have the proper logic to recheck for atomicity violations.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
git-id c4acb2c066 attempted to limit
newidle critical section length by stopping after at least one task
was moved. Further investigation has shown that there are other
paths nested further inside the algorithm which still remain that allow
long latencies to occur with newidle balancing. This patch applies
the same technique inside balance_tasks() to limit the duration of
this optional balancing operation.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Nick Piggin <npiggin@suse.de>
There is no sense in wasting time trying to push a task away that
cannot move anywhere else. We gain no benefit from trying to push
other tasks at this point, so if the task being woken up is non
migratable, just skip the whole operation. This reduces overhead
in the wakeup path for certain tasks.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
We currently take the rq->lock for every cpu in an overload state during
pull_rt_tasks(). However, we now have enough information via the
highest_prio.[curr|next] fields to determine if there is any tasks of
interest to warrant the overhead of the rq->lock, before we actually take
it. So we use this information to reduce lock contention during the
pull for the case where the source-rq doesnt have tasks that preempt
the current task.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
highest_prio.curr is actually a more accurate way to keep track of
the pull_rt_task() threshold since it is always up to date, even
if the "next" task migrates during double_lock. Therefore, stop
looking at the "next" task object and simply use the highest_prio.curr.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
We will use this later in the series to reduce the amount of rq-lock
contention during a pull operation
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Impact: build fix
On !CONFIG_CONTEXT_SWITCH_TRACER trace_find_cmdline() is not defined:
kernel/trace/trace_output.c: In function 'trace_ctxwake_print':
kernel/trace/trace_output.c:499: error: implicit declaration of function 'trace_find_cmdline'
kernel/trace/trace_output.c:499: warning: assignment makes pointer from integer without a cast
Move it to the generic section in trace.h.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: extend the tracing API
The goal of this patch is to normalize and make more easy the
implementation of statistical (histogram) tracing.
It implements a trace_stat file into the /debugfs/tracing directory where
one can print a one-shot output of statistics/histogram entries.
A tracer has to provide two basic iterator callbacks:
stat_start() => the first entry
stat_next(prev, idx) => the next one.
Note that it is adapted for arrays or hash tables or lists.... since it
provides a pointer to the previous entry and the current index of the
iterator.
These two callbacks are called to get a snapshot of the statistics at each
opening of the trace_stat file because. The values are so updated between
two "cat trace_stat". And the tracer is free to lock its datas during the
iteration to keep consistent values.
Since it is almost always interesting to sort statisticals values to
address the problems by priority, this infrastructure provides a "sorting"
of the stat entries too if desired. A tracer has just to provide a
stat_cmp callback to compare two entries and the stat tracing
infrastructure will build a sorted list of the given entries.
A last callback, called stat_headers, can be implemented by a tracer to
output headers on its trace.
If one of these callbacks is changed on runtime, it just have to signal it
to the stat tracing API by calling the init_tracer_stat() helper.
Changes in V2:
- Fix a memory leak if the user opens multiple times the trace_stat file
without closing it. Now we always free our list before rebuilding it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: rework trace.c to use new event register API
Almost every ftrace event has to implement its output display in
trace.c through a different function. Some events did not handle
all the formats (trace, latency-trace, raw, hex, binary), and
this method does not scale well.
This patch converts the format functions to use the event API to
find the event and and print its format. Currently, we have
a print function for trace, latency_trace, raw, hex and binary.
A trace_nop_print is available if the event wants to avoid output
on a particular format.
Perhaps other tracers could use this in the future (like mmiotrace and
function_graph).
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: simplify/generalize/refactor trace.c
The trace.c file is becoming more difficult to maintain due to the
growing number of events. There is several formats that an event may
be printed. This patch sets up the infrastructure of an event hash to
allow for events to register how they should be printed.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, remove obsolete code
Now that the ring buffer used by ftrace allows for variable length
entries, we do not need the 'cont' feature of the buffer. This code
makes other parts of ftrace more complex and by removing this it
simplifies the ftrace code.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix theoretical NULL dereference
The generic irq layer doesn't know whether irq_chip has ack routine on some
architectures or not. Upon that, before calling chip->ack, we should check
that it's not NULL.
Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
GCC has a bug with __weak alias functions: if the functions are in
the same compilation unit as their call site, GCC can decide to
inline them - and thus rob the linker of the opportunity to override
the weak alias with the real thing.
So move all the IRQ handling related __weak symbols to kernel/irq/chip.c.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The mm->ioctx_list is currently protected by a reader-writer lock,
so we always grab that lock on the read side for doing ioctx
lookups. As the workload is extremely reader biased, turn this into
an rcu hlist so we can make lookup_ioctx() lockless. Get rid of
the rwlock and use a spinlock for providing update side exclusion.
There's usually only 1 entry on this list, so it doesn't make sense
to look into fancier data structures.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When taking recursive faults in do_exit, if the io_context is not null,
exit_io_context() is being called. But it might decrement the refcount
more than once. It is better to leave this task alone.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Impact: fix boot crash if the kernel is built with certain GCC versions
GCC has a bug with __weak alias functions: if the functions are in
the same compilation unit as their call site, GCC can decide to
inline them - and thus rob the linker of the opportunity to override
the weak alias with the real thing.
This can lead to the boot crash reported by Kamalesh Babulal:
ACPI: Core revision 20080926
Setting APIC routing to flat
BUG: unable to handle kernel NULL pointer dereference at
0000000000000000
IP: [<ffffffff8021f9a8>] add_pin_to_irq_cpu+0x14/0x74
PGD 0
Oops: 0000 [#1] SMP
[...]
So move the arch_init_chip_data() function from handle.c to manage.c.
Reported-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild-next: (25 commits)
allow stripping of generated symbols under CONFIG_KALLSYMS_ALL
kbuild: strip generated symbols from *.ko
kbuild: simplify use of genksyms
kernel-doc: check for extra kernel-doc notations
kbuild: add headerdep used to detect inclusion cycles in header files
kbuild: fix string equality testing in tags.sh
kbuild: fix make tags/cscope
kbuild: fix make incompatibility
kbuild: remove TAR_IGNORE
setlocalversion: add git-svn support
setlocalversion: print correct subversion revision
scripts: improve the decodecode script
scripts/package: allow custom options to rpm
genksyms: allow to ignore symbol checksum changes
genksyms: track symbol checksum changes
tags and cscope support really belongs in a shell script
kconfig: fix options to check-lxdialog.sh
kbuild: gen_init_cpio expands shell variables in file names
remove bashisms from scripts/extract-ikconfig
kbuild: teach mkmakfile to be silent
...
* 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (241 commits)
sched, trace: update trace_sched_wakeup()
tracing/ftrace: don't trace on early stage of a secondary cpu boot, v3
Revert "x86: disable X86_PTRACE_BTS"
ring-buffer: prevent false positive warning
ring-buffer: fix dangling commit race
ftrace: enable format arguments checking
x86, bts: memory accounting
x86, bts: add fork and exit handling
ftrace: introduce tracing_reset_online_cpus() helper
tracing: fix warnings in kernel/trace/trace_sched_switch.c
tracing: fix warning in kernel/trace/trace.c
tracing/ring-buffer: remove unused ring_buffer size
trace: fix task state printout
ftrace: add not to regex on filtering functions
trace: better use of stack_trace_enabled for boot up code
trace: add a way to enable or disable the stack tracer
x86: entry_64 - introduce FTRACE_ frame macro v2
tracing/ftrace: add the printk-msg-only option
tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp()
x86, bts: correctly report invalid bts records
...
Fixed up trivial conflict in scripts/recordmcount.pl due to SH bits
being already partly merged by the SH merge.
Impact: fix hang
Suresh report his two sockets system only works with SPARSE_IRQ enable
it turns out we miss the setting desc->irq
so provide early_irq_init() even !SPARSE_IRQ to set desc->irq
Reported-by: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add lockdep annotation to legacy IRQ descs
Warnings resulting out of this were not seen in practice, but it's prudent
to initialize the legacy descriptors to the lock class as well, symmetric
to how we do it with other descriptors.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix panic on null pointer with sparseirq
Some GCC versions seem to inline the weak global function,
when that function is empty.
Work it around, by making the functions return a (dummy) integer.
Signed-off-by: Yinghai <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix panic possible panic
Some versions of GCC inline the weak global function if it's empty.
Add a barrier() to work it around.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: avoid timer IRQ hanging slow systems
While using the function graph tracer on a virtualized system, the
hrtimer_interrupt can hang the system on an infinite loop.
This can be caused in several situations:
- the hardware is very slow and HZ is set too high
- something intrusive is slowing the system down (tracing under emulation)
... and the next clock events to program are always before the current time.
This patch implements a reasonable compromise: if such a situation is
detected, we share the CPUs time in 1/4 to process the hrtimer interrupts.
This is enough to let the system running without serious starvation.
It has been successfully tested under VirtualBox with 1000 HZ and 100 HZ
with function graph tracer launched. On both cases, the clock events were
increased until about 25 ms periodic ticks, which means 40 HZ.
So we change a hard to debug hang into a warning message and a system that
still manages to limp along.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
init_one_irq_desc() does not initialize the desc->lock properly -
you cannot init a lock by memcpying some other lock on it.
This happens to work right now (because irq_desc_init is never in use),
but it's a dangerous construct nevertheless, so fix it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: reduce printk noise
There were a couple of leftover KERN_DEBUG debugging printks, remove
them. Also clarify an error message.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
cpu_coregroup_map returned a cpumask_t: it's going away.
(Note, the sched part of this patch won't apply meaningfully to the
sched tree, but I'm posting it to show the goal).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ingo Molnar <mingo@redhat.com>
Impact: clean up
We already have a weak copy of this function in init/main.c
Signed-off-by: Yinghai <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: tracer output improvement
Ending newlines are appended automatically on comments by the function
graph tracer because the newline needs to be placed after the "*/"
comment characters.
So if the user puts an ending newline, we want to strip it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
all for_each_irq_desc() usage point have !desc check.
then its check can move into for_each_irq_desc() macro.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
before CONFIG_SPARSE_IRQ age, for_each_irq_desc() sat in irqnr.h and
could be called from generic code.
CONFIG_SPARSE_IRQ breaks this assumption, but SPARSE_IRQ version
for_each_irq_desc() also can move into irqnr.h easily.
Also, this patch unifies CONFIG_SPARSE_IRQ and !CONFIG_SPARSE_IRQ
for_each_irq_desc().
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
<linux/irq.h> can be removed and should be, because:
- hrtimer doesn't use any irq feature.
- <linux/irq.h> shouldn't be include from generic code.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>