Fixes the following compile error:
kernel/fork.c:1049: error: 'struct task_struct' has no member named 'last_switch_count'
kernel/fork.c:1050: error: 'struct task_struct' has no member named 'last_switch_timestamp'
Signed-off-by: Mandeep Singh Baines <msb@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Provide a shared interrupt debug facility under CONFIG_DEBUG_SHIRQ:
it uses the existing irqpoll facilities to iterate through all
registered interrupt handlers and call those which can handle shared
IRQ lines.
This can be handy for suspend/resume debugging: if we call this function
early during resume we can trigger crashes in those drivers which have
incorrect assumptions about when exactly their ISRs will be called
during suspend/resume.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar wrote:
> here's a new build failure with tip/sched/rt:
>
> LD .tmp_vmlinux1
> kernel/built-in.o: In function `set_curr_task_rt':
> sched.c:(.text+0x3675): undefined reference to `plist_del'
> kernel/built-in.o: In function `pick_next_task_rt':
> sched.c:(.text+0x37ce): undefined reference to `plist_del'
> kernel/built-in.o: In function `enqueue_pushable_task':
> sched.c:(.text+0x381c): undefined reference to `plist_del'
Eliminate the plist library kconfig and make it available
unconditionally.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Decoupling allows:
* hung tasks check to happen at very low priority
* hung tasks check and softlockup to be enabled/disabled independently
at compile and/or run-time
* individual panic settings to be enabled disabled independently
at compile and/or run-time
* softlockup threshold to be reduced without increasing hung tasks
poll frequency (hung task check is expensive relative to softlock watchdog)
* hung task check to be zero over-head when disabled at run-time
Signed-off-by: Mandeep Singh Baines <msb@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Allow the set_mode() clockevent callback to decide and fill in delta
details such as shift, mult, max_delta_ns and min_delta_ns.
With this change the clockevent can be registered without delta details
which allows us to keep the parent clock disabled until the clockevent
gets setup using set_mode().
Letting set_mode() fill in or update delta details allows us to save
power by disabling the parent clock while the clockevent is unused.
This may however make the parent clock rate change, so next time the
clockevent gets enabled we need let set_mode() to update the detla
details accordingly. Doing it at registration time is not enough.
Furthermore, the delta details seem unused in the case of periodic-only
clockevent drivers, so this change also allows registration of such
drivers without the delta details filled in.
Signed-off-by: Magnus Damm <damm@igel.co.jp>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Impact: trace max latencies on start of latency tracing
This patch sets the max latency to zero whenever one of the
irq variant tracers or the wakeup tracer is set to current tracer.
Most developers expect to see output when starting up a latency
tracer. But since the max_latency is already set to max, and
it takes a latency greater than max_latency to be recorded, there
is no trace. This is not the expected behavior and has even confused
myself.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up
After reorganizing the functions in trace.c and trace_function.c,
they no longer need to be in global context. This patch makes the
functions and one variable into static.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: less likely to interleave function and stack traces
This patch does replaces the separate stack trace on function with
a record function and stack trace together. This will switch between
the function only recording to a function and stack recording.
Also some whitespace fix ups as well.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
After adding the printf format checking for trace_seq_printf, several
warnings now show up. This patch cleans them up.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Andrew Morton suggested adding a printf checker to trace_seq_printf
since there are a number of users that have improper format arguments.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up of trace.c
The function tracer functions were put in trace.c because it needed
to share static variables that were in trace.c. Since then, those
variables have become global for various reasons. This patch moves
the function tracer functions into trace_function.c where they belong.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new feature to stack trace any function
Chris Mason asked about being able to pick and choose a function
and get a stack trace from it. This feature enables his request.
# echo io_schedule > /debug/tracing/set_ftrace_filter
# echo function > /debug/tracing/current_tracer
# echo func_stack_trace > /debug/tracing/trace_options
Produces the following in /debug/tracing/trace:
kjournald-702 [001] 135.673060: io_schedule <-sync_buffer
kjournald-702 [002] 135.673671:
<= sync_buffer
<= __wait_on_bit
<= out_of_line_wait_on_bit
<= __wait_on_buffer
<= sync_dirty_buffer
<= journal_commit_transaction
<= kjournald
Note, be careful about turning this on without filtering the functions.
You may find that you have a 10 second lag between typing and seeing
what you typed. This is why the stack trace for the function tracer
does not use the same stack_trace flag as the other tracers use.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix bug for handling partial line
trace_seq_printf(), seq_print_userip_objs(), ... return
0 -- partial line was written
other(>0) -- success
duplicate output is also removed in trace_print_raw().
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix __request_region() parameter kernel-doc notation and parameter name:
Warning(linux-2.6.28-git10//kernel/resource.c:627): No description found for parameter 'flags'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike's change: 0a582440f "sched: fix sched_slice())" broke group
scheduling by forgetting to reload cfs_rq on each loop.
This patch fixes aim7 regression and specjbb2005 regression becomes
less than 1.5% on 8-core stokley.
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Jayson King <dev@jaysonking.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Often the cause of kernel unaligned access warnings is not
obvious from just the ip displayed in the warning. This adds
the option via proc to dump the stack in addition to the warning.
The default is off (just display the 1 line warning). To enable
the stack to be shown: echo 1 > /proc/sys/kernel/unaligned-dump-stack
Signed-off-by: Doug Chapman <doug.chapman@hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Impact: fix SCHED_IDLE latency problems
OK, so we have 1 running task A (which is obviously curr and the tree is
equally obviously empty).
'A' nicely chugs along, doing its thing, carrying min_vruntime along as it
goes.
Then some whacko speed freak SCHED_IDLE task gets inserted due to SMP
balancing, which is very likely far right, in that case
update_curr
update_min_vruntime
cfs_rq->rb_leftmost := true (the crazy task sitting in a tree)
vruntime = se->vruntime
and voila, min_vruntime is waaay right of where it ought to be.
OK, so why did I write it like that to begin with...
Aah, yes.
Say we've just dequeued current
schedule
deactivate_task(prev)
dequeue_entity
update_min_vruntime
Then we'll set
vruntime = cfs_rq->min_vruntime;
we find !cfs_rq->curr, but do find someone in the tree. Then we _must_
do vruntime = se->vruntime, because
vruntime = min_vruntime(vruntime := cfs_rq->min_vruntime, se->vruntime)
will not advance vruntime, and cause lags the other way around (which we
fixed with that initial patch: 1af5f730fc
(sched: more accurate min_vruntime accounting).
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Mike Galbraith <efault@gmx.de>
Acked-by: Mike Galbraith <efault@gmx.de>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Stronger SCHED_IDLE isolation:
- no SCHED_IDLE buddies
- never let SCHED_IDLE preempt on wakeup
- always preempt SCHED_IDLE on wakeup
- limit SLEEPER fairness for SCHED_IDLE.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Increase the SCHED_IDLE weight from 2 to 3, this gives much more stable
vruntime numbers.
time advanced in 100ms:
weight=2
64765.988352
67012.881408
88501.412352
weight=3
35496.181411
34130.971298
35497.411573
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make rt-limit tunables work again
Mark Glines reported:
> I've got an issue on x86-64 where I can't configure the system to allow
> RT tasks for a non-root user.
>
> In 2.6.26.5, I was able to do the following to set things up nicely:
> echo 450000 >/sys/kernel/uids/0/cpu_rt_runtime
> echo 450000 >/sys/kernel/uids/1000/cpu_rt_runtime
>
> Seems like every value I try to echo into the /sys files returns EINVAL.
For UID grouping we initialize the root group with infinite bandwidth
which by default is actually more than the global limit, therefore the
bandwidth check always fails.
Because the root group is a phantom group (for UID grouping) we cannot
runtime adjust it, therefore we let it reflect the global bandwidth
settings.
Reported-by: Mark Glines <mark@glines.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, reduce kernel size a bit, avoid sparse warning
Fixes sparse warning:
kernel/time/tick-sched.c:137:6: warning: symbol 'tick_nohz_update_jiffies' was not declared. Should it be static?
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Prefer tasks that wake other tasks to preempt quickly. This improves
performance because more work is available sooner.
The workload that prompted this patch was a kernel build over NFS4 (for some
curious and not understood reason we had to revert commit:
18de973530 to make any progress at all)
Without this patch a make -j8 bzImage (of x86-64 defconfig) would take
3m30-ish, with this patch we're down to 2m50-ish.
psql-sysbench/mysql-sysbench show a slight improvement in peak performance as
well, tbench and vmark seemed to not care.
It is possible to improve upon the build time (to 2m20-ish) but that seriously
destroys other benchmarks (just shows that there's more room for tinkering).
Much thanks to Mike who put in a lot of effort to benchmark things and proved
a worthy opponent with a competing patch.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Introduce a new avg_wakeup statistic.
avg_wakeup is a measure of how frequently a task wakes up other tasks, it
represents the average time between wakeups, with a limit of avg_runtime
for when it doesn't wake up anybody.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We should unlock all_stat_sessions_mutex before returning failure.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix a crash while kernel image restore
When the function graph tracer is running and while suspend to disk, some racy
and dangerous things happen against this tracer.
The current task will save its registers including the stack pointer which
contains the return address hooked by the tracer. But the current task will
continue to enter other functions after that to save the memory, and then
it will store other return addresses, and finally loose the old depth which
matches the return address saved in the old stack (during the registers saving).
So on image restore, the code will return to wrong addresses.
And there are other things: on restore, the task will have it's "current"
pointer overwritten during registers restoring....switching from one task to
another... That would be insane to try to trace function graphs at these
stages.
This patch makes the function graph tracer listening on power events, making
it's tracing disabled for the current task (the one that performs the
hibernation work) while suspend/resume to disk, making the tracing safe
during hibernation.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: limit ftrace dump output
Currently ftrace_dump only calls ftrace_kill that is a fast way
to prevent the function tracer functions from being called (just sets
a flag and clears the function to call, nothing else). It is better
to also turn off any recording to the ring buffers as well.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: reset struct buffer_page.write when interrupt storm
if struct buffer_page.write is not reset, any succedent committing
will corrupted ring_buffer:
static inline void
rb_set_commit_to_write(struct ring_buffer_per_cpu *cpu_buffer)
{
......
cpu_buffer->commit_page->commit =
cpu_buffer->commit_page->write;
......
}
when "if (RB_WARN_ON(cpu_buffer, next_page == reader_page))", ring_buffer
is disabled, but some reserved buffers may haven't been committed.
we need reset struct buffer_page.write.
when "if (unlikely(next_page == cpu_buffer->commit_page))", ring_buffer
is still available, we should not corrupt it.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix to print out ftrace_dump when expected
I was debugging a hard race condition to only find out that
after I hit the race, my log level was not at level to show
KERN_INFO. The time it took to trigger the race was wasted because
I did not capture the trace.
Since ftrace_dump is only called from kernel oops (and only when
it is set in the kernel command line to do so), or when a
developer adds it to their own local tree, the log level of
the print should be at KERN_EMERG to make sure the print appears.
ftrace_dump is not called by a normal user setup, and will not
add extra unwanted print out to the console. There is no reason
it should be at KERN_INFO.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'syscalls' of git://git390.osdl.marist.edu/pub/scm/linux-2.6: (44 commits)
[CVE-2009-0029] s390 specific system call wrappers
[CVE-2009-0029] System call wrappers part 33
[CVE-2009-0029] System call wrappers part 32
[CVE-2009-0029] System call wrappers part 31
[CVE-2009-0029] System call wrappers part 30
[CVE-2009-0029] System call wrappers part 29
[CVE-2009-0029] System call wrappers part 28
[CVE-2009-0029] System call wrappers part 27
[CVE-2009-0029] System call wrappers part 26
[CVE-2009-0029] System call wrappers part 25
[CVE-2009-0029] System call wrappers part 24
[CVE-2009-0029] System call wrappers part 23
[CVE-2009-0029] System call wrappers part 22
[CVE-2009-0029] System call wrappers part 21
[CVE-2009-0029] System call wrappers part 20
[CVE-2009-0029] System call wrappers part 19
[CVE-2009-0029] System call wrappers part 18
[CVE-2009-0029] System call wrappers part 17
[CVE-2009-0029] System call wrappers part 16
[CVE-2009-0029] System call wrappers part 15
...
This reverts commit ad7a953c52.
And commit: ("allow stripping of generated symbols under CONFIG_KALLSYMS_ALL")
9bb482476c
These stripping patches has caused a set of issues:
1) People have reported compatibility issues with binutils due to
lack of support for `--strip-unneeded-symbols' with objcopy 2.15.92.0.2
Reported by: Wenji
2) ccache and distcc no longer works as expeced
Reported by: Ted, Roland, + others
3) The installed modules increased a lot in size
Reported by: Ted, Davej + others
Reported-by: Wenji Huang <wenji.huang@oracle.com>
Reported-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Spin more agressively. This is less fair but also markedly faster.
The numbers:
* dbench 50 (higher is better):
spin 1282MB/s
v10 548MB/s
v10 no wait 1868MB/s
* 4k creates (numbers in files/second higher is better):
spin avg 200.60 median 193.20 std 19.71 high 305.93 low 186.82
v10 avg 180.94 median 175.28 std 13.91 high 229.31 low 168.73
v10 no wait avg 232.18 median 222.38 std 22.91 high 314.66 low 209.12
* File stats (numbers in seconds, lower is better):
spin 2.27s
v10 5.1s
v10 no wait 1.6s
( The source changes are smaller than they look, I just moved the
need_resched checks in __mutex_lock_common after the cmpxchg. )
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix the sparc build - we were including `up.o' on SMP builds, when
CONFIG_USE_GENERIC_SMP_HELPERS=n.
Tested-by: Robert Reif <reif@earthlink.net>
Fixed-by: Robert Reif <reif@earthlink.net>
Cc: David Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change mutex contention behaviour such that it will sometimes busy wait on
acquisition - moving its behaviour closer to that of spinlocks.
This concept got ported to mainline from the -rt tree, where it was originally
implemented for rtmutexes by Steven Rostedt, based on work by Gregory Haskins.
Testing with Ingo's test-mutex application (http://lkml.org/lkml/2006/1/8/50)
gave a 345% boost for VFS scalability on my testbox:
# ./test-mutex-shm V 16 10 | grep "^avg ops"
avg ops/sec: 296604
# ./test-mutex-shm V 16 10 | grep "^avg ops"
avg ops/sec: 85870
The key criteria for the busy wait is that the lock owner has to be running on
a (different) cpu. The idea is that as long as the owner is running, there is a
fair chance it'll release the lock soon, and thus we'll be better off spinning
instead of blocking/scheduling.
Since regular mutexes (as opposed to rtmutexes) do not atomically track the
owner, we add the owner in a non-atomic fashion and deal with the races in
the slowpath.
Furthermore, to ease the testing of the performance impact of this new code,
there is means to disable this behaviour runtime (without having to reboot
the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
by issuing the following command:
# echo NO_OWNER_SPIN > /debug/sched_features
This command re-enables spinning again (this is also the default):
# echo OWNER_SPIN > /debug/sched_features
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The problem is that dropping the spinlock right before schedule is a voluntary
preemption point and can cause a schedule, right after which we schedule again.
Fix this inefficiency by keeping preemption disabled until we schedule, do this
by explicity disabling preemption and providing a schedule() variant that
assumes preemption is already disabled.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove a local variable by combining an assingment and test in one.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo found a build error in the scheduler when RT_GROUP_SCHED was
enabled, but SMP was not. This patch rearranges the code such
that it is a little more streamlined and compiles under all permutations
of SMP, UP and RT_GROUP_SCHED. It was boot tested on my 4-way x86_64
and it still passes preempt-test.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>