linux

mirror of synced 2025-03-06 20:59:54 +01:00

Author	SHA1	Message	Date
Eric Paris	637d32dc72	Capabilities: BUG when an invalid capability is requested If an invalid (large) capability is requested the capabilities system may panic as it is dereferencing an array of fixed (short) length. Its possible (and actually often happens) that the capability system accidentally stumbled into a valid memory region but it also regularly happens that it hits invalid memory and BUGs. If such an operation does get past cap_capable then the selinux system is sure to have problems as it already does a (simple) validity check and BUG. This is known to happen by the broken and buggy firegl driver. This patch cleanly checks all capable calls and BUG if a call is for an invalid capability. This will likely break the firegl driver for some situations, but it is the right thing to do. Garbage into a security system gets you killed/bugged Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Signed-off-by: James Morris <jmorris@namei.org>	2008-11-11 22:01:24 +11:00
Peter Zijlstra	2002c69595	sched: release buddies on yield Clear buddies on yield, so that the buddy rules don't schedule them despite them being placed right-most. This fixed a performance regression with yield-happy binary JVMs. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Tested-by: Lin Ming <ming.m.lin@intel.com>	2008-11-11 11:57:22 +01:00
Eric Paris	e68b75a027	When the capset syscall is used it is not possible for audit to record the actual capbilities being added/removed. This patch adds a new record type which emits the target pid and the eff, inh, and perm cap sets. example output if you audit capset syscalls would be: type=SYSCALL msg=audit(1225743140.465:76): arch=c000003e syscall=126 success=yes exit=0 a0=17f2014 a1=17f201c a2=80000000 a3=7fff2ab7f060 items=0 ppid=2160 pid=2223 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=1 comm="setcap" exe="/usr/sbin/setcap" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null) type=UNKNOWN[1322] msg=audit(1225743140.465:76): pid=0 cap_pi=ffffffffffffffff cap_pp=ffffffffffffffff cap_pe=ffffffffffffffff Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>	2008-11-11 21:48:22 +11:00
Eric Paris	3fc689e96c	Any time fcaps or a setuid app under SECURE_NOROOT is used to result in a non-zero pE we will crate a new audit record which contains the entire set of known information about the executable in question, fP, fI, fE, fversion and includes the process's pE, pI, pP. Before and after the bprm capability are applied. This record type will only be emitted from execve syscalls. an example of making ping use fcaps instead of setuid: setcap "cat_net_raw+pe" /bin/ping type=SYSCALL msg=audit(1225742021.015:236): arch=c000003e syscall=59 success=yes exit=0 a0=1457f30 a1=14606b0 a2=1463940 a3=321b770a70 items=2 ppid=2929 pid=2963 auid=0 uid=500 gid=500 euid=500 suid=500 fsuid=500 egid=500 sgid=500 fsgid=500 tty=pts0 ses=3 comm="ping" exe="/bin/ping" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null) type=UNKNOWN[1321] msg=audit(1225742021.015:236): fver=2 fp=0000000000002000 fi=0000000000000000 fe=1 old_pp=0000000000000000 old_pi=0000000000000000 old_pe=0000000000000000 new_pp=0000000000002000 new_pi=0000000000000000 new_pe=0000000000002000 type=EXECVE msg=audit(1225742021.015:236): argc=2 a0="ping" a1="127.0.0.1" type=CWD msg=audit(1225742021.015:236): cwd="/home/test" type=PATH msg=audit(1225742021.015:236): item=0 name="/bin/ping" inode=49256 dev=fd:00 mode=0100755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:ping_exec_t:s0 cap_fp=0000000000002000 cap_fe=1 cap_fver=2 type=PATH msg=audit(1225742021.015:236): item=1 name=(null) inode=507915 dev=fd:00 mode=0100755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:ld_so_t:s0 Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>	2008-11-11 21:48:18 +11:00
Eric Paris	851f7ff56d	This patch will print cap_permitted and cap_inheritable data in the PATH records of any file that has file capabilities set. Files which do not have fcaps set will not have different PATH records. An example audit record if you run: setcap "cap_net_admin+pie" /bin/bash /bin/bash type=SYSCALL msg=audit(1225741937.363:230): arch=c000003e syscall=59 success=yes exit=0 a0=2119230 a1=210da30 a2=20ee290 a3=8 items=2 ppid=2149 pid=2923 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=3 comm="ping" exe="/bin/ping" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null) type=EXECVE msg=audit(1225741937.363:230): argc=2 a0="ping" a1="www.google.com" type=CWD msg=audit(1225741937.363:230): cwd="/root" type=PATH msg=audit(1225741937.363:230): item=0 name="/bin/ping" inode=49256 dev=fd:00 mode=0104755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:ping_exec_t:s0 cap_fp=0000000000002000 cap_fi=0000000000002000 cap_fe=1 cap_fver=2 type=PATH msg=audit(1225741937.363:230): item=1 name=(null) inode=507915 dev=fd:00 mode=0100755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:ld_so_t:s0 Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>	2008-11-11 21:48:14 +11:00
Bharata B Rao	ff9b48c359	sched: include group statistics in /proc/sched_debug Impact: extend /proc/sched_debug info Since the statistics of a group entity isn't exported directly from the kernel, it becomes difficult to obtain some of the group statistics. For example, the current method to obtain exec time of a group entity is not always accurate. One has to read the exec times of all the tasks(/proc/<pid>/sched) in the group and add them. This method fails (or becomes difficult) if we want to collect stats of a group over a duration where tasks get created and terminated. This patch makes it easier to obtain group stats by directly including them in /proc/sched_debug. Stats like group exec time would help user programs (like LTP) to accurately measure the group fairness. An example output of group stats from /proc/sched_debug: cfs_rq[3]:/3/a/1 .exec_clock : 89.598007 .MIN_vruntime : 0.000001 .min_vruntime : 256300.970506 .max_vruntime : 0.000001 .spread : 0.000000 .spread0 : -25373.372248 .nr_running : 0 .load : 0 .yld_exp_empty : 0 .yld_act_empty : 0 .yld_both_empty : 0 .yld_count : 4474 .sched_switch : 0 .sched_count : 40507 .sched_goidle : 12686 .ttwu_count : 15114 .ttwu_local : 11950 .bkl_count : 67 .nr_spread_over : 0 .shares : 0 .se->exec_start : 113676.727170 .se->vruntime : 1592.612714 .se->sum_exec_runtime : 89.598007 .se->wait_start : 0.000000 .se->sleep_start : 0.000000 .se->block_start : 0.000000 .se->sleep_max : 0.000000 .se->block_max : 0.000000 .se->exec_max : 1.000282 .se->slice_max : 1.999750 .se->wait_max : 54.981093 .se->wait_sum : 217.610521 .se->wait_count : 50 .se->load.weight : 2 Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Acked-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Acked-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-11 11:44:18 +01:00
Gautham R Shenoy	5d5254f0d3	timers: handle HRTIMER_CB_IRQSAFE_UNLOCKED correctly from softirq context Impact: fix incorrect locking triggered during hotplug-intense stress-tests While migrating the the CB_IRQSAFE_UNLOCKED timers during a cpu-offline, we queue them on the cb_pending list, so that they won't go stale. Thus, when the callbacks of the timers run from the softirq context, they could run into potential deadlocks, since these callbacks assume that they're running with irq's disabled, thereby annoying lockdep! Fix this by emulating hardirq context while running these callbacks from the hrtimer softirq. ================================= [ INFO: inconsistent lock state ] 2.6.27 #2 -------------------------------- inconsistent {in-hardirq-W} -> {hardirq-on-W} usage. ksoftirqd/0/4 [HC0[0]:SC1[1]:HE1:SE0] takes: (&rq->lock){++..}, at: [<c011db84>] sched_rt_period_timer+0x9e/0x1fc {in-hardirq-W} state was registered at: [<c014103c>] __lock_acquire+0x549/0x121e [<c0107890>] native_sched_clock+0x88/0x99 [<c013aa12>] clocksource_get_next+0x39/0x3f [<c0139abc>] update_wall_time+0x616/0x7df [<c0141d6b>] lock_acquire+0x5a/0x74 [<c0121724>] scheduler_tick+0x3a/0x18d [<c047ed45>] _spin_lock+0x1c/0x45 [<c0121724>] scheduler_tick+0x3a/0x18d [<c0121724>] scheduler_tick+0x3a/0x18d [<c012c436>] update_process_times+0x3a/0x44 [<c013c044>] tick_periodic+0x63/0x6d [<c013c062>] tick_handle_periodic+0x14/0x5e [<c010568c>] timer_interrupt+0x44/0x4a [<c0150c9f>] handle_IRQ_event+0x13/0x3d [<c0151c14>] handle_level_irq+0x79/0xbd [<c0105634>] do_IRQ+0x69/0x7d [<c01041e4>] common_interrupt+0x28/0x30 [<c047007b>] aac_probe_one+0x1a3/0x3f3 [<c047ec2d>] _spin_unlock_irqrestore+0x36/0x39 [<c01512b4>] setup_irq+0x1be/0x1f9 [<c065d70b>] start_kernel+0x259/0x2c5 [<ffffffff>] 0xffffffff irq event stamp: 50102 hardirqs last enabled at (50102): [<c047ebf4>] _spin_unlock_irq+0x20/0x23 hardirqs last disabled at (50101): [<c047edc2>] _spin_lock_irq+0xa/0x4b softirqs last enabled at (50088): [<c0128ba6>] do_softirq+0x37/0x4d softirqs last disabled at (50099): [<c0128ba6>] do_softirq+0x37/0x4d other info that might help us debug this: no locks held by ksoftirqd/0/4. stack backtrace: Pid: 4, comm: ksoftirqd/0 Not tainted 2.6.27 #2 [<c013f6cb>] print_usage_bug+0x13e/0x147 [<c013fef5>] mark_lock+0x493/0x797 [<c01410b1>] __lock_acquire+0x5be/0x121e [<c0141d6b>] lock_acquire+0x5a/0x74 [<c011db84>] sched_rt_period_timer+0x9e/0x1fc [<c047ed45>] _spin_lock+0x1c/0x45 [<c011db84>] sched_rt_period_timer+0x9e/0x1fc [<c011db84>] sched_rt_period_timer+0x9e/0x1fc [<c01210fd>] finish_task_switch+0x41/0xbd [<c0107890>] native_sched_clock+0x88/0x99 [<c011dae6>] sched_rt_period_timer+0x0/0x1fc [<c0136dda>] run_hrtimer_pending+0x54/0xe5 [<c011dae6>] sched_rt_period_timer+0x0/0x1fc [<c0128afb>] __do_softirq+0x7b/0xef [<c0128ba6>] do_softirq+0x37/0x4d [<c0128c12>] ksoftirqd+0x56/0xc5 [<c0128bbc>] ksoftirqd+0x0/0xc5 [<c0134649>] kthread+0x38/0x5d [<c0134611>] kthread+0x0/0x5d [<c0104477>] kernel_thread_helper+0x7/0x10 ======================= Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-11 10:46:42 +01:00
Frederic Weisbecker	15e6cb3673	tracing: add a tracer to catch execution time of kernel functions Impact: add new tracing plugin which can trace full (entry+exit) function calls This tracer uses the low level function return ftrace plugin to measure the execution time of the kernel functions. The first field is the caller of the function, the second is the measured function, and the last one is the execution time in nanoseconds. - v3: - HAVE_FUNCTION_RET_TRACER have been added. Each arch that support ftrace return should enable it. - ftrace_return_stub becomes ftrace_stub. - CONFIG_FUNCTION_RET_TRACER depends now on CONFIG_FUNCTION_TRACER - Return traces printing can be used for other tracers on trace.c - Adapt to the new tracing API (no more ctrl_update callback) - Correct the check of "disabled" during insertion. - Minor changes... Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-11 10:29:12 +01:00
Frederic Weisbecker	caf4b323b0	tracing, x86: add low level support for ftrace return tracing Impact: add infrastructure for function-return tracing Add low level support for ftrace return tracing. This plug-in stores return addresses on the thread_info structure of the current task. The index of the current return address is initialized when the task is the first one (init) and when a process forks (the child). It is not needed when a task does a sys_execve because after this syscall, it still needs to return on the kernel functions it called. Note that the code of return_to_handler has been suggested by Steven Rostedt as almost all of the ideas of improvements in this V3. For purpose of security, arch/x86/kernel/process_32.c is not traced because __switch_to() changes the current task during its execution. That could cause inconsistency in the stored return address of this function even if I didn't have any crash after testing with tracing on this function enabled. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-11 10:29:11 +01:00
Steven Rostedt	f536aafc5a	ring-buffer: replace most bug ons with warn on and disable buffer This patch replaces most of the BUG_ONs in the ring_buffer code with RB_WARN_ON variants. It adds some more variants as needed for the replacement. This lets the buffer die nicely and still warn the user. One BUG_ON remains in the code, and that is because it detects a bad pointer passed in by the calling function, and not a bug by the ring buffer code itself. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-11 09:40:34 +01:00
Steven Rostedt	5aa1ba6a6c	ftrace: prevent ftrace_special from recursion Impact: stop ftrace_special from recursion The ftrace_special is used to help debug areas of the kernel. Because of this, if it is put in certain locations, the fact that it allows recursion can become a problem if the kernel developer using does not realize that. This patch changes ftrace_special to not allow recursion into itself to make it more robust. It also changes from preempt disable interrupts disable to prevent any loss of trace entries. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-11 09:40:29 +01:00
Ingo Molnar	e0cb4ebcd9	Merge branch 'tracing/urgent' into tracing/ftrace Conflicts: kernel/trace/trace.c	2008-11-11 09:40:18 +01:00
Ingo Molnar	45b86a96f1	Merge branch 'devel' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/urgent	2008-11-11 09:16:20 +01:00
Ingo Molnar	ae1e9130bf	sched: rename SCHED_NO_NO_OMIT_FRAME_POINTER => SCHED_OMIT_FRAME_POINTER Impact: cleanup, change .config option name We had this ugly config name for a long time for hysteric raisons. Rename it to a saner name. We still cannot get rid of it completely, until /proc/<pid>/stack usage replaces WCHAN usage for good. We'll be able to do that in the v2.6.29/v2.6.30 timeframe. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-11 08:59:20 +01:00
Oleg Nesterov	ad474caca3	fix for account_group_exec_runtime(), make sure ->signal can't be freed under rq->lock Impact: fix hang/crash on ia64 under high load This is ugly, but the simplest patch by far. Unlike other similar routines, account_group_exec_runtime() could be called "implicitly" from within scheduler after exit_notify(). This means we can race with the parent doing release_task(), we can't just check ->signal != NULL. Change __exit_signal() to do spin_unlock_wait(&task_rq(tsk)->lock) before __cleanup_signal() to make sure ->signal can't be freed under task_rq(tsk)->lock. Note that task_rq_unlock_wait() doesn't care about the case when tsk changes cpu/rq under us, this should be OK. Thanks to Ingo who nacked my previous buggy patch. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Reported-by: Doug Chapman <doug.chapman@hp.com>	2008-11-11 08:01:43 +01:00
Steven Rostedt	4143c5cb36	ring-buffer: prevent infinite looping on time stamping Impact: removal of unnecessary looping The lockless part of the ring buffer allows for reentry into the code from interrupts. A timestamp is taken, a test is preformed and if it detects that an interrupt occurred that did tracing, it tries again. The problem arises if the timestamp code itself causes a trace. The detection will detect this and loop again. The difference between this and an interrupt doing tracing, is that this will fail every time, and cause an infinite loop. Currently, we test if the loop happens 1000 times, and if so, it will produce a warning and disable the ring buffer. The problem with this approach is that it makes it difficult to perform some types of tracing (tracing the timestamp code itself). Each trace entry has a delta timestamp from the previous entry. If a trace entry is reserved but and interrupt occurs and traces before the previous entry is commited, the delta timestamp for that entry will be zero. This actually makes sense in terms of tracing, because the interrupt entry happened before the preempted entry was commited, so one may consider the two happening at the same time. The order is still preserved in the buffer. With this idea, instead of trying to get a new timestamp if an interrupt made it in between the timestamp and the test, the entry could simply make the delta zero and continue. This will prevent interrupts or tracers in the timer code from causing the above loop. Signed-off-by: Steven Rostedt <srostedt@redhat.com>	2008-11-10 21:47:37 -05:00
Steven Rostedt	bf5e6519b8	ftrace: disable tracing on resize Impact: fix for bug on resize This patch addresses the bug found here: http://bugzilla.kernel.org/show_bug.cgi?id=11996 When ftrace converted to the new unified trace buffer, the resizing of the buffer was not protected as much as it was originally. If tracing is performed while the resize occurs, then the buffer can be corrupted. This patch disables all ftrace buffer modifications before a resize takes place. Signed-off-by: Steven Rostedt <srostedt@redhat.com>	2008-11-10 21:47:35 -05:00
Thomas Gleixner	ae99286b4f	nohz: disable tick_nohz_kick_tick() for now Impact: nohz powersavings and wakeup regression commit `fb02fbc14d` (NOHZ: restart tick device from irq_enter()) causes a serious wakeup regression. While the patch is correct it does not take into account that spurious wakeups happen on x86. A fix for this issue is available, but we just revert to the .27 behaviour and let long running softirqs screw themself. Disable it for now. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-11-10 22:39:27 +01:00
Thomas Gleixner	ee5f80a993	irq: call __irq_enter() before calling the tick_idle_check Impact: avoid spurious ksoftirqd wakeups The tick idle check which is called from irq_enter() was run before the call to __irq_enter() which did not set the in_interrupt() bits in preempt_count. That way the raise of a softirq woke up softirqd for nothing as the softirq was handled on return from interrupt. Call __irq_enter() before calling into the tick idle check code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-10 22:36:39 +01:00
Peter Zijlstra	5ac5c4d604	sched: clean up debug info Impact: clean up and fix debug info printout While looking over the sched_debug code I noticed that we printed the rq schedstats for every cfs_rq, ammend this. Also change nr_spead_over into an int, and fix a little buglet in min_vruntime printing. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-10 10:51:51 +01:00
Ingo Molnar	f131e2436d	irq: fix typo Impact: build fix fix build failure on UP. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-09 22:26:45 +01:00
Thomas Gleixner	612e3684c1	genirq: fix the affinity setting in setup_irq The affinity setting in setup irq is called before the NO_BALANCING flag is checked and might therefore override affinity settings from the calling code with the default setting. Move the NO_BALANCING flag check before the call to the affinity setting. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-09 22:23:54 +01:00
Thomas Gleixner	f6d87f4bd2	genirq: keep affinities set from userspace across free/request_irq() Impact: preserve user-modified affinities on interrupts Kumar Galak noticed that commit `1840475676` (genirq: Expose default irq affinity mask (take 3)) overrides an already set affinity setting across a free / request_irq(). Happens e.g. with ifdown/ifup of a network device. Change the logic to mark the affinities as set and keep them intact. This also fixes the unlocked access to irq_desc in irq_select_affinity() when called from irq_affinity_proc_write() Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-09 22:23:49 +01:00
Linus Torvalds	cb56d98e2a	Merge branch 'cpus4096' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'cpus4096' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: cpumask: introduce new API, without changing anything, v3 cpumask: new API, v2 cpumask: introduce new API, without changing anything	2008-11-09 12:20:56 -08:00
Steven Rostedt	a309720c87	ftrace: display start of CPU buffer in trace output Impact: change in trace output Because the trace buffers are per cpu ring buffers, the start of the trace can be confusing. If one CPU is very active at the end of the trace, its history will not go as far back as the other CPU traces. This means that output for a particular CPU may not appear for the first part of a trace. To help annotate what is happening, and to prevent any more confusion, this patch adds a line that annotates the start of a CPU buffer output. For example: automount-3495 [001] 184.596443: dnotify_parent <-vfs_write [...] automount-3495 [001] 184.596449: dput <-path_put automount-3496 [002] 184.596450: down_read_trylock <-do_page_fault [...] sshd-3497 [001] 184.597069: up_read <-do_page_fault <idle>-0 [000] 184.597074: __exit_idle <-exit_idle [...] automount-3496 [002] 184.597257: filemap_fault <-__do_fault <idle>-0 [003] 184.597261: exit_idle <-smp_apic_timer_interrupt Note, parsers of a trace output should always ignore any lines that start with a '#'. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-08 09:51:54 +01:00
Steven Rostedt	769c48eb25	ftrace: force pass of preemptoff selftest Impact: preemptoff not tested in selftest Due to the BKL not being preemptable anymore, the selftest of the preemptoff code can not be tested. It requires that it is called with preemption enabled, but since the BKL is held, that is no longer the case. This patch simply skips those tests if it detects that the context is not preemptable. The following will now show up in the tests: Testing tracer preemptoff: can not test ... force PASSED Testing tracer preemptirqsoff: can not test ... force PASSED When the BKL is removed, or it becomes preemptable once again, then the tests will be performed. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-08 09:51:49 +01:00
Steven Rostedt	c76f06945b	ftrace: remove trace array ctrl Impact: remove obsolete variable in trace_array structure With the new start / stop method of ftrace, the ctrl variable in the trace_array structure is now obsolete. Remove it. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-08 09:51:39 +01:00
Steven Rostedt	bbf5b1a0ce	ftrace: remove ctrl_update method Impact: Remove the ctrl_update tracer method With the new quick start/stop method of tracing, the ctrl_update method is out of date. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-08 09:51:34 +01:00
Steven Rostedt	49833fc232	ftrace: enable trace_printk by default Impact: have the ftrace_printk enabled on startup It is confusing to have to "echo trace_printk > /debug/tracing/iter_ctrl" after adding ftrace_printk in the kernel. Currently the trace_printk is set to off by default. ftrace_printk should only be in open kernel code when used for debugging, and thus it should be enabled by default. It may also be used to record data within a tracer, but those ftrace_printks should be within wrappers that are either enabled by trace_points or have a variable protecting the code path from being entered when the tracer is disabled. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-08 09:51:29 +01:00
Steven Rostedt	4519317020	ftrace: irqsoff tracer incorrect reset Impact: fix to irqsoff tracer output In converting to the new start / stop ftrace handling, the irqsoff tracer start called the irqsoff reset function. irqsoff tracer is not the same as the other traces, and it resets the buffers while searching for the longest latency. The reset that the irqsoff stop method calls disables the function tracing. That means that, by starting the tracer, the function tracer is disabled incorrectly. This patch simply removes the call to reset which keeps the function tracing enabled. Reset is not needed for the irqsoff stop method. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-08 09:51:24 +01:00
Steven Rostedt	e168e0516e	ftrace: fix sched_switch API Impact: fix for sched_switch that broke dynamic ftrace startup The commit: tracing/fastboot: use sched switch tracer from boot tracer broke the API of the sched_switch trace. The use of the tracing_start/stop_cmdline record is for only recording the cmdline, NOT recording the schedule switches themselves. Seeing that the boot tracer broke the API to do something that it wanted, this patch adds a new interface for the API while puting back the original interface of the old API. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-08 09:51:18 +01:00
Steven Rostedt	75f5c47da3	ftrace: fix boot trace sched startup Impact: boot tracer startup modified The boot tracer calls into some of the schedule tracing private functions that should not be exported. This patch cleans it up, and makes way for further changes in the ftrace infrastructure. This patch adds a api to assign a tracer array to the schedule context switch tracer. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-08 09:51:09 +01:00
Steven Rostedt	0183fb1c94	ftrace: fix set_ftrace_filter Impact: fix of output of set_ftrace_filter Commit ftrace: do not show freed records in available_filter_functions Removed a bit too much from the set_ftrace_filter code, where we now see all functions in the set_ftrace_filter file even when we set a filter. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-08 09:51:02 +01:00
Ingo Molnar	a6b0786f7f	Merge branches 'tracing/ftrace', 'tracing/fastboot', 'tracing/nmisafe' and 'tracing/urgent' into tracing/core	2008-11-08 09:34:35 +01:00
Li Zefan	6d21cd6251	sched: clean up SCHED_CPUMASK_ALLOC Impact: cleanup The #if/#endif is ugly. Change SCHED_CPUMASK_ALLOC and SCHED_CPUMASK_FREE to static inline functions. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-07 10:30:35 +01:00
Ingo Molnar	258594a138	Merge branch 'sched/urgent' into sched/core	2008-11-07 10:29:58 +01:00
Li Zefan	ca3273f964	sched: fix memory leak in a failure path Impact: fix rare memory leak in the sched-domains manual reconfiguration code In the failure path, rd is not attached to a sched domain, so it causes a leak. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-07 08:29:58 +01:00
Li Zefan	f29c9b1ccb	sched: fix a bug in sched domain degenerate Impact: re-add incorrectly eliminated sched domain layers (1) on i386 with SCHED_SMT and SCHED_MC enabled # mount -t cgroup -o cpuset xxx /mnt # echo 0 > /mnt/cpuset.sched_load_balance # mkdir /mnt/0 # echo 0 > /mnt/0/cpuset.cpus # dmesg CPU0 attaching sched-domain: domain 0: span 0 level CPU groups: 0 (2) on i386 with SCHED_MC enabled but SCHED_SMT disabled # same with (1) # dmesg CPU0 attaching NULL sched-domain. The bug is that some sched domains may be skipped unintentionally when degenerating (optimizing) sched domains. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-07 08:29:57 +01:00
Niv Sardi	dcd7b4e5c0	Merge branch 'master' of git://oss.sgi.com:8090/xfs/linux-2.6	2008-11-07 15:07:12 +11:00
Linus Torvalds	e252f4db18	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: Block: use round_jiffies_up() Add round_jiffies_up and related routines block: fix __blkdev_get() for removable devices generic-ipi: fix the smp_mb() placement blk: move blk_delete_timer call in end_that_request_last block: add timer on blkdev_dequeue_request() not elv_next_request() bio: define __BIOVEC_PHYS_MERGEABLE block: remove unused ll_new_mergeable()	2008-11-06 15:53:47 -08:00
Linus Torvalds	067ab19923	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: re-tune balancing sched: fix buddies for group scheduling sched: backward looking buddy sched: fix fair preempt check sched: cleanup fair task selection	2008-11-06 15:45:40 -08:00
Li Zefan	24eb089950	cgroups: fix invalid cgrp->dentry before cgroup has been completely removed This fixes an oops when reading /proc/sched_debug. A cgroup won't be removed completely until finishing cgroup_diput(), so we shouldn't invalidate cgrp->dentry in cgroup_rmdir(). Otherwise, when a group is being removed while cgroup_path() gets called, we may trigger NULL dereference BUG. The bug can be reproduced: # cat test.sh #!/bin/sh mount -t cgroup -o cpu xxx /mnt for (( ; ; )) { mkdir /mnt/sub rmdir /mnt/sub } # ./test.sh & # cat /proc/sched_debug BUG: unable to handle kernel NULL pointer dereference at 00000038 IP: [<c045a47f>] cgroup_path+0x39/0x90 ... Call Trace: [<c0420344>] ? print_cfs_rq+0x6e/0x75d [<c0421160>] ? sched_debug_show+0x72d/0xc1e ... Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> [2.6.26.x, 2.6.27.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-11-06 15:41:19 -08:00
Sripathi Kodi	cf7f8690e8	sched, lockdep: inline double_unlock_balance() We have a test case which measures the variation in the amount of time needed to perform a fixed amount of work on the preempt_rt kernel. We started seeing deterioration in it's performance recently. The test should never take more than 10 microseconds, but we started 5-10% failure rate. Using elimination method, we traced the problem to commit `1b12bbc747` (lockdep: re-annotate scheduler runqueues). When LOCKDEP is disabled, this patch only adds an additional function call to double_unlock_balance(). Hence I inlined double_unlock_balance() and the problem went away. Here is a patch to make this change. Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-06 22:12:09 +01:00
Rusty Russell	2d3854a37e	cpumask: introduce new API, without changing anything Impact: introduce new APIs We want to deprecate cpumasks on the stack, as we are headed for gynormous numbers of CPUs. Eventually, we want to head towards an undefined 'struct cpumask' so they can never be declared on stack. 1) New cpumask functions which take pointers instead of copies. (cpus_* -> cpumask_*) 2) Several new helpers to reduce requirements for temporary cpumasks (cpumask_first_and, cpumask_next_and, cpumask_any_and) 3) Helpers for declaring cpumasks on or offstack for large NR_CPUS (cpumask_var_t, alloc_cpumask_var and free_cpumask_var) 4) 'struct cpumask' for explicitness and to mark new-style code. 5) Make iterator functions stop at nr_cpu_ids (a runtime constant), not NR_CPUS for time efficiency and for smaller dynamic allocations in future. 6) cpumask_copy() so we can allocate less than a full cpumask eventually (for alloc_cpumask_var), and so we can eliminate the 'struct cpumask' definition eventually. 7) work_on_cpu() helper for doing task on a CPU, rather than saving old cpumask for current thread and manipulating it. 8) smp_call_function_many() which is smp_call_function_mask() except taking a cpumask pointer. Note that this patch simply introduces the new functions and leaves the obsolescent ones in place. This is to simplify the transition patches. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-06 09:05:33 +01:00
Alan Stern	9c133c469d	Add round_jiffies_up and related routines This patch (as1158b) adds round_jiffies_up() and friends. These routines work like the analogous round_jiffies() functions, except that they will never round down. The new routines will be useful for timeouts where we don't care exactly when the timer expires, provided it doesn't expire too soon. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-11-06 08:42:48 +01:00
Suresh Siddha	561920a0d2	generic-ipi: fix the smp_mb() placement smp_mb() is needed (to make the memory operations visible globally) before sending the ipi on the sender and the receiver (on Alpha atleast) needs smp_read_barrier_depends() in the handler before reading the call_single_queue list in a lock-free fashion. On x86, x2apic mode register accesses for sending IPI's don't have serializing semantics. So the need for smp_mb() before sending the IPI becomes more critical in x2apic mode. Remove the unnecessary smp_mb() in csd_flag_wait(), as the presence of that smp_mb() doesn't mean anything on the sender, when the ipi receiver is not doing any thing special (like memory fence) after clearing the CSD_FLAG_WAIT. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-11-06 08:41:56 +01:00
Steven Rostedt	3e03fb7f1d	ring-buffer: convert to raw spinlocks Impact: no lockdep debugging of ring buffer The problem with running lockdep on the ring buffer is that the ring buffer is the core infrastructure of ftrace. What happens is that the tracer will start tracing the lockdep code while lockdep is testing the ring buffers locks. This can cause lockdep to fail due to testing cases that have not fully finished their locking transition. This patch converts the spin locks used by the ring buffer back into raw spin locks which lockdep does not check. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-06 07:51:09 +01:00
Steven Rostedt	9036990d46	ftrace: restructure tracing start/stop infrastructure Impact: change where tracing is started up and stopped Currently, when a new tracer is selected via echo'ing a tracer name into the current_tracer file, the startup is only done if tracing_enabled is set to one. If tracing_enabled is changed to zero (by echo'ing 0 into the tracing_enabled file) a full shutdown is performed. The full startup and shutdown of a tracer can be expensive and the user can lose out traces when echo'ing in 0 to the tracing_enabled file, because the process takes too long. There can also be places that the user would like to start and stop the tracer several times and doing the full startup and shutdown of a tracer might be too expensive. This patch performs the full startup and shutdown when a tracer is selected. It also adds a way to do a quick start or stop of a tracer. The quick version is just a flag that prevents the tracing from taking place, but the overhead of the code is still there. For example, the startup of a tracer may enable tracepoints, or enable the function tracer. The stop and start will just set a flag to have the tracer ignore the calls when the tracepoint or function trace is called. The overhead of the tracer may still be present when the tracer is stopped, but no tracing will occur. Setting the tracer to the 'nop' tracer (or any other tracer) will perform the shutdown of the tracer which will disable the tracepoint or disable the function tracer. The tracing_enabled file will simply start or stop tracing. This change is all internal. The end result for the user should be the same as before. If tracing_enabled is not set, no trace will happen. If tracing_enabled is set, then the trace will happen. The tracing_enabled variable is static between tracers. Enabling tracing_enabled and going to another tracer will keep tracing_enabled enabled. Same is true with disabling tracing_enabled. This patch will now provide a fast start/stop method to the users for enabling or disabling tracing. Note: There were two methods to the struct tracer that were never used: The methods start and stop. These were to be used as a hook to the reading of the trace output, but ended up not being necessary. These two methods are now used to enable the start and stop of each tracer, in case the tracer needs to do more than just not write into the buffer. For example, the irqsoff tracer must stop recording max latencies when tracing is stopped. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-06 07:51:03 +01:00
Steven Rostedt	0f04870148	ftrace: soft tracing stop and start Impact: add way to quickly start stop tracing from the kernel This patch adds a soft stop and start to the trace. This simply disables function tracing via the ftrace_disabled flag, and disables the trace buffers to prevent recording. The tracing code may still be executed, but the trace will not be recorded. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-06 07:50:57 +01:00
Steven Rostedt	60a7ecf426	ftrace: add quick function trace stop Impact: quick start and stop of function tracer This patch adds a way to disable the function tracer quickly without the need to run kstop_machine. It adds a new variable called function_trace_stop which will stop the calls to functions from mcount when set. This is just an on/off switch and does not handle recursion like preempt_disable(). It's main purpose is to help other tracers/debuggers start and stop tracing fuctions without the need to call kstop_machine. The config option HAVE_FUNCTION_TRACE_MCOUNT_TEST is added for archs that implement the testing of the function_trace_stop in the mcount arch dependent code. Otherwise, the test is done in the C code. x86 is the only arch at the moment that supports this. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-06 07:50:51 +01:00

... 62 63 64 65 66 ...

8411 commits