linux

mirror of synced 2025-03-06 20:59:54 +01:00

Author	SHA1	Message	Date
Arnaldo Carvalho de Melo	c71a896154	blktrace: add ftrace plugin Impact: New way of using the blktrace infrastructure This drops the requirement of userspace utilities to use the blktrace facility. Configuration is done thru sysfs, adding a "trace" directory to the partition directory where blktrace can be enabled for the associated request_queue. The same filters present in the IOCTL interface are present as sysfs device attributes. The /sys/block/sdX/sdXN/trace/enable file allows tracing without any filters. The other files in this directory: pid, act_mask, start_lba and end_lba can be used with the same meaning as with the IOCTL interface. Using the sysfs interface will only setup the request_queue->blk_trace fields, tracing will only take place when the "blk" tracer is selected via the ftrace interface, as in the following example: To see the trace, one can use the /d/tracing/trace file or the /d/tracign/trace_pipe file, with semantics defined in the ftrace documentation in Documentation/ftrace.txt. [root@f10-1 ~]# cat /t/trace kjournald-305 [000] 3046.491224: 8,1 A WBS 6367 + 8 <- (8,1) 6304 kjournald-305 [000] 3046.491227: 8,1 Q R 6367 + 8 [kjournald] kjournald-305 [000] 3046.491236: 8,1 G RB 6367 + 8 [kjournald] kjournald-305 [000] 3046.491239: 8,1 P NS [kjournald] kjournald-305 [000] 3046.491242: 8,1 I RBS 6367 + 8 [kjournald] kjournald-305 [000] 3046.491251: 8,1 D WB 6367 + 8 [kjournald] kjournald-305 [000] 3046.491610: 8,1 U WS [kjournald] 1 <idle>-0 [000] 3046.511914: 8,1 C RS 6367 + 8 [6367] [root@f10-1 ~]# The default line context (prefix) format is the one described in the ftrace documentation, with the blktrace specific bits using its existing format, described in blkparse(8). If one wants to have the classic blktrace formatting, this is possible by using: [root@f10-1 ~]# echo blk_classic > /t/trace_options [root@f10-1 ~]# cat /t/trace 8,1 0 3046.491224 305 A WBS 6367 + 8 <- (8,1) 6304 8,1 0 3046.491227 305 Q R 6367 + 8 [kjournald] 8,1 0 3046.491236 305 G RB 6367 + 8 [kjournald] 8,1 0 3046.491239 305 P NS [kjournald] 8,1 0 3046.491242 305 I RBS 6367 + 8 [kjournald] 8,1 0 3046.491251 305 D WB 6367 + 8 [kjournald] 8,1 0 3046.491610 305 U WS [kjournald] 1 8,1 0 3046.511914 0 C RS 6367 + 8 [6367] [root@f10-1 ~]# Using the ftrace standard format allows more flexibility, such as the ability of asking for backtraces via trace_options: [root@f10-1 ~]# echo noblk_classic > /t/trace_options [root@f10-1 ~]# echo stacktrace > /t/trace_options [root@f10-1 ~]# cat /t/trace kjournald-305 [000] 3318.826779: 8,1 A WBS 6375 + 8 <- (8,1) 6312 kjournald-305 [000] 3318.826782: <= submit_bio <= submit_bh <= sync_dirty_buffer <= journal_commit_transaction <= kjournald <= kthread <= child_rip kjournald-305 [000] 3318.826836: 8,1 Q R 6375 + 8 [kjournald] kjournald-305 [000] 3318.826837: <= generic_make_request <= submit_bio <= submit_bh <= sync_dirty_buffer <= journal_commit_transaction <= kjournald <= kthread Please read the ftrace documentation to use aditional, standardized tracing filters such as /d/tracing/trace_cpumask, etc. See also /d/tracing/trace_mark to add comments in the trace stream, that is equivalent to the /d/block/sdaN/msg interface. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-26 14:40:53 +01:00
Arnaldo Carvalho de Melo	9011262a37	ftrace: add ftrace_vprintk Impact: new helper function Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-26 14:40:53 +01:00
Randy Dunlap	cc2f6d90e9	kmemtrace: fix printk format warnings Fix kmemtrace printk warnings: kernel/trace/kmemtrace.c:142: warning: format '%4ld' expects type 'long int', but argument 3 has type 'size_t' kernel/trace/kmemtrace.c:147: warning: format '%4ld' expects type 'long int', but argument 3 has type 'size_t' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-26 14:03:51 +01:00
Ingo Molnar	5ce1b1ed27	Merge branches 'tracing/ftrace' and 'tracing/function-graph-tracer' into tracing/core	2009-01-26 14:01:52 +01:00
Thomas Gleixner	6626bff245	hrtimer: prevent negative expiry value after clock_was_set() Impact: prevent false positive WARN_ON() in clockevents_program_event() clock_was_set() changes the base->offset of CLOCK_REALTIME and enforces the reprogramming of the clockevent device to expire timers which are based on CLOCK_REALTIME. If the clock change is large enough then the subtraction of the timer expiry value and base->offset can become negative which triggers the warning in clockevents_program_event(). Check the subtraction result and set a negative value to 0. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-01-25 11:54:57 +01:00
Thomas Gleixner	5b74f9e0e0	Merge branch 'linus' into timers/hrtimers	2009-01-25 11:54:33 +01:00
Frederic Weisbecker	9005f3ebeb	tracing/function-graph-tracer: various fixes and features This patch brings various bugfixes: - Drop the first irrelevant task switch on the very beginning of a trace. - Drop the OVERHEAD word from the headers, the DURATION word is sufficient and will not overlap other columns. - Make the headers fit well their respective columns whatever the selected options. Ie, default options: # tracer: function_graph # # CPU DURATION FUNCTION CALLS # \| \| \| \| \| \| \| 1) 0.646 us \| } 1) \| mem_cgroup_del_lru_list() { 1) 0.624 us \| lookup_page_cgroup(); 1) 1.970 us \| } echo funcgraph-proc > trace_options # tracer: function_graph # # CPU TASK/PID DURATION FUNCTION CALLS # \| \| \| \| \| \| \| \| \| 0) bash-2937 \| 0.895 us \| } 0) bash-2937 \| 0.888 us \| __rcu_read_unlock(); 0) bash-2937 \| 0.864 us \| conv_uni_to_pc(); 0) bash-2937 \| 1.015 us \| __rcu_read_lock(); echo nofuncgraph-cpu > trace_options echo nofuncgraph-proc > trace_options # tracer: function_graph # # DURATION FUNCTION CALLS # \| \| \| \| \| \| 3.752 us \| native_pud_val(); 0.616 us \| native_pud_val(); 0.624 us \| native_pmd_val(); About features, one can now disable the duration (this will hide the overhead too for convenient reasons and because on doesn't need overhead if it hasn't the duration): echo nofuncgraph-duration > trace_options # tracer: function_graph # # FUNCTION CALLS # \| \| \| \| cap_vm_enough_memory() { __vm_enough_memory() { vm_acct_memory(); } } } And at last, an option to print the absolute time: //Restart from default options echo funcgraph-abstime > trace_options # tracer: function_graph # # TIME CPU DURATION FUNCTION CALLS # \| \| \| \| \| \| \| \| 261.339774 \| 1) + 42.823 us \| } 261.339775 \| 1) 1.045 us \| _spin_lock_irq(); 261.339777 \| 1) 0.940 us \| _spin_lock_irqsave(); 261.339778 \| 1) 0.752 us \| _spin_unlock_irqrestore(); 261.339780 \| 1) 0.857 us \| _spin_unlock_irq(); 261.339782 \| 1) \| flush_to_ldisc() { 261.339783 \| 1) \| tty_ldisc_ref() { 261.339783 \| 1) \| tty_ldisc_try() { 261.339784 \| 1) 1.075 us \| _spin_lock_irqsave(); 261.339786 \| 1) 0.842 us \| _spin_unlock_irqrestore(); 261.339788 \| 1) 4.211 us \| } 261.339788 \| 1) 5.662 us \| } The format is seconds.usecs. I guess no one needs the nanosec precision here, the main goal is to have an overview about the general timings of events, and to see the place when the trace switches from one cpu to another. ie: 274.874760 \| 1) 0.676 us \| _spin_unlock(); 274.874762 \| 1) 0.609 us \| native_load_sp0(); 274.874763 \| 1) 0.602 us \| native_load_tls(); 274.878739 \| 0) 0.722 us \| } 274.878740 \| 0) 0.714 us \| native_pmd_val(); 274.878741 \| 0) 0.730 us \| native_pmd_val(); Here there is a 4000 usecs difference when we switch the cpu. Changes in V2: - Completely fix the first pointless task switch. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-23 11:18:08 +01:00
Steven Rostedt	7e49fcce1b	trace, lockdep: manual preempt count adding for local_bh_disable Impact: fix to preempt trace triggering lockdep check_flag failure In local_bh_disable, the use of add_preempt_count causes the preempt tracer to start recording the time preemption is off. But because it already modified the preempt_count to show softirqs disabled, and before it called the lockdep code to handle this, it causes a state that lockdep can not handle. The preempt tracer will reset the ring buffer on start of a trace, and the ring buffer reset code does a spin_lock_irqsave. This calls into lockdep and lockdep will fail when it detects the invalid state of having softirqs disabled but the internal current->softirqs_enabled is still set. The fix is to manually add the SOFTIRQ_OFFSET to preempt count and call the preempt tracer code outside the lockdep critical area. Thanks to Peter Zijlstra for suggesting this solution. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-23 11:10:57 +01:00
Steven Rostedt	b06a830183	trace: fix logic to start/stop counting The logic in the tracing_start/stop code prevents the WARN_ON from ever detecting if a start/stop pair was mismatched. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-23 11:10:45 +01:00
Steven Rostedt	94523e818f	trace: remove internal irqsoff disabling for trace output Impact: cleanup of duplicate features The trace output disables the ring buffer and prevents tracing to occur. The code in irqsoff to do the same thing is no longer needed. This patch removes it. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-23 11:10:36 +01:00
Ingo Molnar	bfe2a3c3b5	Merge branch 'core/percpu' into perfcounters/core Conflicts: arch/x86/include/asm/hardirq_32.h arch/x86/include/asm/hardirq_64.h Semantic merge: arch/x86/include/asm/hardirq.h [ added apic_perf_irqs field. ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-23 10:20:15 +01:00
Steven Rostedt	91a8d07d82	ring-buffer: reset timestamps when ring buffer is reset Impact: fix bad times of recent resets The ring buffer needs to reset its timestamps when reseting of the buffer, otherwise the timestamps are stale and might be used to calculate times in the buffer causing funny timestamps to appear. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-22 10:31:58 +01:00
Steven Rostedt	69507c0653	ring-buffer: reset timestamps when ring buffer is reset Impact: fix bad times of recent resets The ring buffer needs to reset its timestamps when reseting of the buffer, otherwise the timestamps are stale and might be used to calculate times in the buffer causing funny timestamps to appear. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-22 10:27:54 +01:00
Steven Rostedt	f8ec1062f5	wakeup-tracer: show scheduling data in output Impact: better data for wakeup tracer This patch adds the wakeup and schedule calls that are used by the scheduler tracer to make the wakeup tracer more readable. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-22 10:27:39 +01:00
Steven Rostedt	3244351c31	trace: separate out rt tasks from wakeup tracer Impact: add option to trace all tasks or just RT tasks The current wakeup tracer only traces RT task wakeups. This is fine for those interested in wake up timings of RT tasks, but it is useless for those that are interested in the causes of long wakeups for non RT tasks. This patch creates a "wakeup_rt" to implement the tracing of just RT tasks (as the current "wakeup" does). And makes "wakeup" now trace all tasks as an average developer would expect. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-22 10:27:22 +01:00
Steven Rostedt	97b17efe45	ring-buffer: do not swap if recording is disabled If the ring buffer recording has been disabled. Do not let swapping of ring buffers occur. Simply return -EAGAIN. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-22 10:27:16 +01:00
Steven Rostedt	5bc4564b22	trace: do not disable wake up tracer on output of trace Impact: fix to erased trace output To try not to have the outputing of a trace interfere with the wakeup tracer, it would disable tracing while the output was printing. But if a trace had started when it was disabled, it can show a partial trace. To try to solve this, on closing of the tracer, it would clear the trace buffer. The latency tracers (wakeup and irqsoff) have two buffers. One for recording and one for holding the max trace that is printed. The clearing of the trace above should only affect the recording buffer. But for some reason it would move the erased trace to the print buffer. Probably due to a race with the closing of the trace and the saving ofhe max race. The above is all pretty useless, and if the user does not want the printing of the trace to be traced itself, then the user can manual disable tracing. This patch removes all the code that tries to keep the output of the tracer from modifying the trace. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-22 10:26:50 +01:00
Thomas Gleixner	6552ebae25	Merge branch 'core/debugobjects' into core/urgent	2009-01-22 10:03:02 +01:00
Ingo Molnar	77835492ed	Merge commit 'v2.6.29-rc2' into perfcounters/core Conflicts: include/linux/syscalls.h	2009-01-21 16:37:27 +01:00
Steven Rostedt	1092307d58	trace: set max latency variable to zero on default Impact: trace max latencies on start of latency tracing This patch sets the max latency to zero whenever one of the irq variant tracers or the wakeup tracer is set to current tracer. Most developers expect to see output when starting up a latency tracer. But since the max_latency is already set to max, and it takes a latency greater than max_latency to be recorded, there is no trace. This is not the expected behavior and has even confused myself. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-21 15:21:30 +01:00
Steven Rostedt	a442e5e0a2	trace: stop all recording to ring buffer on ftrace_dump Impact: limit ftrace dump output Currently ftrace_dump only calls ftrace_kill that is a fast way to prevent the function tracer functions from being called (just sets a flag and clears the function to call, nothing else). It is better to also turn off any recording to the ring buffers as well. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-21 15:21:30 +01:00
Steven Rostedt	faf6861ebd	trace: print ftrace_dump at KERN_EMERG log level Impact: fix to print out ftrace_dump when expected I was debugging a hard race condition to only find out that after I hit the race, my log level was not at level to show KERN_INFO. The time it took to trigger the race was wasted because I did not capture the trace. Since ftrace_dump is only called from kernel oops (and only when it is set in the kernel command line to do so), or when a developer adds it to their own local tree, the log level of the print should be at KERN_EMERG to make sure the print appears. ftrace_dump is not called by a normal user setup, and will not add extra unwanted print out to the console. There is no reason it should be at KERN_INFO. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-21 15:21:30 +01:00
Lai Jiangshan	551b4048b3	ring_buffer: reset write when reserve buffer fail Impact: reset struct buffer_page.write when interrupt storm if struct buffer_page.write is not reset, any succedent committing will corrupted ring_buffer: static inline void rb_set_commit_to_write(struct ring_buffer_per_cpu *cpu_buffer) { ...... cpu_buffer->commit_page->commit = cpu_buffer->commit_page->write; ...... } when "if (RB_WARN_ON(cpu_buffer, next_page == reader_page))", ring_buffer is disabled, but some reserved buffers may haven't been committed. we need reset struct buffer_page.write. when "if (unlikely(next_page == cpu_buffer->commit_page))", ring_buffer is still available, we should not corrupt it. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-21 15:21:30 +01:00
Frederic Weisbecker	00f57f545a	tracing/function-graph-tracer: fix a regression while suspend to disk Impact: fix a crash while kernel image restore When the function graph tracer is running and while suspend to disk, some racy and dangerous things happen against this tracer. The current task will save its registers including the stack pointer which contains the return address hooked by the tracer. But the current task will continue to enter other functions after that to save the memory, and then it will store other return addresses, and finally loose the old depth which matches the return address saved in the old stack (during the registers saving). So on image restore, the code will return to wrong addresses. And there are other things: on restore, the task will have it's "current" pointer overwritten during registers restoring....switching from one task to another... That would be insane to try to trace function graphs at these stages. This patch makes the function graph tracer listening on power events, making it's tracing disabled for the current task (the one that performs the hibernation work) while suspend/resume to disk, making the tracing safe during hibernation. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-21 15:21:30 +01:00
Paul Mundt	0609697eab	dma-coherent: Restore dma_alloc_from_coherent() large alloc fall back policy. When doing large allocations (larger than the per-device coherent area) the generic memory allocators are silently fallen back on regardless of consideration for the per-device constraints. In the DMA_MEMORY_EXCLUSIVE case falling back on generic memory is not an option, as it tends not to be addressable by the DMA hardware in question. This issue showed up with the 8139too breakage on the Dreamcast, where non-addressable buffers were silently allocated due to the size mismatch calculation -- while it should have simply errored out upon being unable to satisfy the allocation with the given device constraints. This restores fall back behaviour to what it was before the oversized request change caused multiple regressions. Signed-off-by: Paul Mundt <lethal@linux-sh.org>	2009-01-21 18:51:53 +09:00
Adrian McMenamin	cdf57cab27	dma-coherent: per-device coherent area is in pages, not bytes. Commit `58c6d3dfe4` ("dma-coherent: catch oversized requests to dma_alloc_from_coherent()") attempted to add a sanity check to bail out on allocations larger than the coherent area. Unfortunately when this was implemented, the fact the coherent area is tracked in pages rather than bytes was overlooked, which subsequently broke every single dma_alloc_from_coherent() user, forcing the allocation silently through generic memory instead. Signed-off-by: Adrian McMenamin <adrian@mcmen.demon.co.uk > Signed-off-by: Paul Mundt <lethal@linux-sh.org>	2009-01-21 18:47:38 +09:00
Ingo Molnar	198030782c	Merge branch 'x86/mm' into core/percpu Conflicts: arch/x86/mm/fault.c	2009-01-21 10:39:51 +01:00
Ingo Molnar	3eb3963fd1	Merge branch 'cpus4096' into core/percpu Conflicts: arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c arch/x86/kernel/tlb_32.c Merge it here because both the cpumask changes and the ongoing percpu work is touching the TLB code. The percpu changes take precedence, as they eliminate tlb_32.c altogether. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-21 10:14:17 +01:00
Steven Rostedt	082605de5f	ring-buffer: fix alignment problem Impact: fix to allow some archs to use the ring buffer Commits in the ring buffer are checked by pointer arithmetic. If the calculation is incorrect, then the commits will never take place and the buffer will simply fill up and report an error. Each page in the ring buffer has a small header: struct buffer_data_page { u64 time_stamp; local_t commit; unsigned char data[]; }; Unfortuntely, some of the calculations used sizeof(struct buffer_data_page) to know the size of the header. But this is incorrect on some archs, where sizeof(struct buffer_data_page) does not equal offsetof(struct buffer_data_page, data), and on those archs, the commits are never processed. This patch replaces the sizeof with offsetof. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-20 13:09:06 +01:00
Lai Jiangshan	3690b5e6fd	trace_workqueue: use percpu data for workqueue stat Impact: use percpu data instead of a global structure Use: static DEFINE_PER_CPU(struct workqueue_global_stats, all_workqueue_stat); instead of allocating a global structure. percpu data also works well on NUMA. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-20 13:06:59 +01:00
Markus Metzger	11edda0628	x86, ftrace, hw-branch-tracer: change trace format Change the hw-branch-tracer format to be more readable. Signed-off-by: Markus Metzger <markus.t.metzger@intel.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-20 13:04:16 +01:00
Markus Metzger	e23b8ad834	x86, ftrace, hw-branch-tracer: reset trace buffer on close Reset the ftrace buffer on close. Since we use cyclic buffers, the trace is not contiguous, anyway. Signed-off-by: Markus Metzger <markus.t.metzger@intel.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-20 13:03:55 +01:00
Markus Metzger	b1818748b0	x86, ftrace, hw-branch-tracer: dump trace on oops Dump the branch trace on an oops (based on ftrace_dump_on_oops). Signed-off-by: Markus Metzger <markus.t.metzger@intel.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-20 13:03:48 +01:00
Markus Metzger	5c5317de14	x86, ftrace, hw-branch-tracer: support hotplug cpus Support hotplug cpus. Reported-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Markus Metzger <markus.t.metzger@intel.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-20 13:03:38 +01:00
Rusty Russell	8ccad40df8	work_on_cpu: Use our own workqueue. Impact: remove potential clashes with generic kevent workqueue Annoyingly, some places we want to use work_on_cpu are already in workqueues. As per Ingo's suggestion, we create a different workqueue for work_on_cpu. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-19 22:36:07 +01:00
Rusty Russell	31ad908120	work_on_cpu: don't try to get_online_cpus() in work_on_cpu. Impact: remove potential circular lock dependency with cpu hotplug lock This has caused more problems than it solved, with a pile of cpu hotplug locking issues. Followup patches will get_online_cpus() in callers that need it, but if they don't do it they're no worse than before when they were using set_cpus_allowed without locking. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-19 22:36:02 +01:00
Miao Xie	f90d4118ba	cpuset: fix possible deadlock in async_rebuild_sched_domains Lockdep reported some possible circular locking info when we tested cpuset on NUMA/fake NUMA box. ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.29-rc1-00224-ga652504 #111 ------------------------------------------------------- bash/2968 is trying to acquire lock: (events){--..}, at: [<ffffffff8024c8cd>] flush_work+0x24/0xd8 but task is already holding lock: (cgroup_mutex){--..}, at: [<ffffffff8026ad1e>] cgroup_lock_live_group+0x12/0x29 which lock already depends on the new lock. ...... ------------------------------------------------------- Steps to reproduce: # mkdir /dev/cpuset # mount -t cpuset xxx /dev/cpuset # mkdir /dev/cpuset/0 # echo 0 > /dev/cpuset/0/cpus # echo 0 > /dev/cpuset/0/mems # echo 1 > /dev/cpuset/0/memory_migrate # cat /dev/zero > /dev/null & # echo $! > /dev/cpuset/0/tasks This is because async_rebuild_sched_domains has the following lock sequence: run_workqueue(async_rebuild_sched_domains) -> do_rebuild_sched_domains -> cgroup_lock But, attaching tasks when memory_migrate is set has following: cgroup_lock_live_group(cgroup_tasks_write) -> do_migrate_pages -> flush_work This patch fixes it by using a separate workqueue thread. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-19 02:44:00 +01:00
Peter Zijlstra	1d4a7f1c4f	hrtimers: fix inconsistent lock state on resume in hres_timers_resume Andrey Borzenkov reported this lockdep assert: > [17854.688347] ================================= > [17854.688347] [ INFO: inconsistent lock state ] > [17854.688347] 2.6.29-rc2-1avb #1 > [17854.688347] --------------------------------- > [17854.688347] inconsistent {in-hardirq-W} -> {hardirq-on-W} usage. > [17854.688347] pm-suspend/18240 [HC0[0]:SC0[0]:HE1:SE1] takes: > [17854.688347] (&cpu_base->lock){++..}, at: [<c0136fcc>] retrigger_next_event+0x5c/0xa0 > [17854.688347] {in-hardirq-W} state was registered at: > [17854.688347] [<c01443cd>] __lock_acquire+0x79d/0x1930 > [17854.688347] [<c01455bc>] lock_acquire+0x5c/0x80 > [17854.688347] [<c03092e5>] _spin_lock+0x35/0x70 > [17854.688347] [<c0136e61>] hrtimer_run_queues+0x31/0x140 > [17854.688347] [<c0128d98>] run_local_timers+0x8/0x20 > [17854.688347] [<c0128dd3>] update_process_times+0x23/0x60 > [17854.688347] [<c013e274>] tick_periodic+0x24/0x80 > [17854.688347] [<c013e2e2>] tick_handle_periodic+0x12/0x70 > [17854.688347] [<c0104e24>] timer_interrupt+0x14/0x20 > [17854.688347] [<c01607b9>] handle_IRQ_event+0x29/0x60 > [17854.688347] [<c0161c59>] handle_level_irq+0x69/0xe0 > [17854.688347] [<ffffffff>] 0xffffffff > [17854.688347] irq event stamp: 55771 > [17854.688347] hardirqs last enabled at (55771): [<c0309125>] _spin_unlock_irqrestore+0x35/0x60 > [17854.688347] hardirqs last disabled at (55770): [<c0309419>] _spin_lock_irqsave+0x19/0x80 > [17854.688347] softirqs last enabled at (54836): [<c0124f54>] __do_softirq+0xc4/0x110 > [17854.688347] softirqs last disabled at (54831): [<c01049ae>] do_softirq+0x8e/0xe0 > [17854.688347] > [17854.688347] other info that might help us debug this: > [17854.688347] 3 locks held by pm-suspend/18240: > [17854.688347] #0: (&buffer->mutex){--..}, at: [<c01dd4c5>] sysfs_write_file+0x25/0x100 > [17854.688347] #1: (pm_mutex){--..}, at: [<c015056f>] enter_state+0x4f/0x140 > [17854.688347] #2: (dpm_list_mtx){--..}, at: [<c027880f>] device_pm_lock+0xf/0x20 > [17854.688347] > [17854.688347] stack backtrace: > [17854.688347] Pid: 18240, comm: pm-suspend Not tainted 2.6.29-rc2-1avb #1 > [17854.688347] Call Trace: > [17854.688347] [<c0306248>] ? printk+0x18/0x20 > [17854.688347] [<c0141fac>] print_usage_bug+0x16c/0x1d0 > [17854.688347] [<c0142bcf>] mark_lock+0x8bf/0xc90 > [17854.688347] [<c0106b8f>] ? pit_next_event+0x2f/0x40 > [17854.688347] [<c01441b0>] __lock_acquire+0x580/0x1930 > [17854.688347] [<c030916d>] ? _spin_unlock+0x1d/0x20 > [17854.688347] [<c0106b8f>] ? pit_next_event+0x2f/0x40 > [17854.688347] [<c013dd38>] ? clockevents_program_event+0x98/0x160 > [17854.688347] [<c0142fe8>] ? mark_held_locks+0x48/0x90 > [17854.688347] [<c0309125>] ? _spin_unlock_irqrestore+0x35/0x60 > [17854.688347] [<c0143229>] ? trace_hardirqs_on_caller+0x139/0x190 > [17854.688347] [<c014328b>] ? trace_hardirqs_on+0xb/0x10 > [17854.688347] [<c01455bc>] lock_acquire+0x5c/0x80 > [17854.688347] [<c0136fcc>] ? retrigger_next_event+0x5c/0xa0 > [17854.688347] [<c03092e5>] _spin_lock+0x35/0x70 > [17854.688347] [<c0136fcc>] ? retrigger_next_event+0x5c/0xa0 > [17854.688347] [<c0136fcc>] retrigger_next_event+0x5c/0xa0 > [17854.688347] [<c013711a>] hres_timers_resume+0xa/0x10 > [17854.688347] [<c013aa8e>] timekeeping_resume+0xee/0x150 > [17854.688347] [<c0273384>] __sysdev_resume+0x14/0x50 > [17854.688347] [<c0273407>] sysdev_resume+0x47/0x80 > [17854.688347] [<c02791ab>] device_power_up+0xb/0x20 > [17854.688347] [<c015043f>] suspend_devices_and_enter+0xcf/0x150 > [17854.688347] [<c0150c2f>] ? freeze_processes+0x3f/0x90 > [17854.688347] [<c0150614>] enter_state+0xf4/0x140 > [17854.688347] [<c01506dd>] state_store+0x7d/0xc0 > [17854.688347] [<c0150660>] ? state_store+0x0/0xc0 > [17854.688347] [<c0202da4>] kobj_attr_store+0x24/0x30 > [17854.688347] [<c01dd53c>] sysfs_write_file+0x9c/0x100 > [17854.688347] [<c019916c>] vfs_write+0x9c/0x160 > [17854.688347] [<c0103494>] ? restore_nocheck_notrace+0x0/0xe > [17854.688347] [<c01dd4a0>] ? sysfs_write_file+0x0/0x100 > [17854.688347] [<c01992ed>] sys_write+0x3d/0x70 > [17854.688347] [<c0103371>] sysenter_do_call+0x12/0x31 Andrey's analysis: > timekeeping_resume() is called via class ->resume > method; and according to comments in sysdev_resume() and > device_power_up(), they are called with interrupts disabled. > > Looking at suspend_enter, irqs are disabled at this point. > > So it actually looks like something (may be some driver) > unconditionally enabled irqs in resume path. Add a debug check to test this theory. If it triggers then it triggers because the resume code calls it with irqs enabled, which is a no-no not just for timekeeping_resume(), but also bad for a number of other resume handlers. Reported-by: Andrey Borzenkov <arvidjaar@mail.ru> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-18 21:31:37 +01:00
Jiri Slaby	b786c6a98e	relay: fix lock imbalance in relay_late_setup_files One fail path in relay_late_setup_files() omits mutex_unlock(&relay_channels_mutex); Add it. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-18 20:29:35 +01:00
Ingo Molnar	4092762aeb	Merge branch 'tracing/ftrace'; commit 'v2.6.29-rc2' into tracing/core	2009-01-18 20:15:05 +01:00
Mandeep Singh Baines	603a148f43	softlockup: fix potential race in hung_task when resetting timeout Impact: fix potential false panic A potential race exists if sysctl_hung_task_timeout_secs is reset to 0 while inside check_hung_uniterruptible_tasks(). If check_task() is entered, a comparison with 0 will result in a false hung_task being detected. If sysctl_hung_task_panic is set, the system will panic. Signed-off-by: Mandeep Singh Baines <msb@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-18 19:20:17 +01:00
Ingo Molnar	b2b062b816	Merge branch 'core/percpu' into stackprotector Conflicts: arch/x86/include/asm/pda.h arch/x86/include/asm/system.h Also, moved include/asm-x86/stackprotector.h to arch/x86/include/asm. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-18 18:37:14 +01:00
Ingo Molnar	af37501c79	Merge branch 'core/percpu' into perfcounters/core Conflicts: arch/x86/include/asm/pda.h We merge tip/core/percpu into tip/perfcounters/core because of a semantic and contextual conflict: the former eliminates the PDA, while the latter extends it with apic_perf_irqs field. Resolve the conflict by moving the new field to the irq_cpustat structure on 64-bit too. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-18 18:15:49 +01:00
Paul Mackerras	d859e29fe3	perf_counter: Add counter enable/disable ioctls Impact: New perf_counter features This primarily adds a way for perf_counter users to enable and disable counters and groups. Enabling or disabling a counter or group also enables or disables all of the child counters that have been cloned from it to monitor children of the task monitored by the top-level counter. The userspace interface to enable/disable counters is via ioctl on the counter file descriptor. Along the way this extends the code that handles child counters to handle child counter groups properly. A group with multiple counters will be cloned to child tasks if and only if the group leader has the hw_event.inherit bit set - if it is set the whole group is cloned as a group in the child task. In order to be able to enable or disable all child counters of a given top-level counter, we need a way to find them all. Hence I have added a child_list field to struct perf_counter, which is the head of the list of children for a top-level counter, or the link in that list for a child counter. That list is protected by the perf_counter.mutex field. This also adds a mutex to the perf_counter_context struct. Previously the list of counters was protected just by the lock field in the context, which meant that perf_counter_init_task had to take that lock and then take whatever lock/mutex protects the top-level counter's child_list. But the counter enable/disable functions need to take that lock in order to traverse the list, then for each counter take the lock in that counter's context in order to change the counter's state safely, which will lead to a deadlock. To solve this, we now have both a mutex and a spinlock in the context, and taking either is sufficient to ensure the list of counters can't change - you have to take both before changing the list. Now perf_counter_init_task takes the mutex instead of the lock (which incidentally means that inherit_counter can use GFP_KERNEL instead of GFP_ATOMIC) and thus avoids the possible deadlock. Similarly the new enable/disable functions can take the mutex while traversing the list of child counters without incurring a possible deadlock when the counter manipulation code locks the context for a child counter. We also had an misfeature that the first counter added to a context would possibly not go on until the next sched-in, because we were using ctx->nr_active to detect if the context was running on a CPU. But nr_active is the number of active counters, and if that was zero (because the context didn't have any counters yet) it would look like the context wasn't running on a cpu and so the retry code in __perf_install_in_context wouldn't retry. So this adds an 'is_active' field that is set when the context is on a CPU, even if it has no counters. The is_active field is only used for task contexts, not for per-cpu contexts. If we enable a subsidiary counter in a group that is active on a CPU, and the arch code can't enable the counter, then we have to pull the whole group off the CPU. We do this with group_sched_out, which gets moved up in the file so it comes before all its callers. This also adds similar logic to __perf_install_in_context so that the "all on, or none" invariant of groups is preserved when adding a new counter to a group. Signed-off-by: Paul Mackerras <paulus@samba.org>	2009-01-17 18:10:22 +11:00
Rusty Russell	e1d9ec6246	work_on_cpu: Use our own workqueue. Impact: remove potential clashes with generic kevent workqueue Annoyingly, some places we want to use work_on_cpu are already in workqueues. As per Ingo's suggestion, we create a different workqueue for work_on_cpu. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Mike Travis <travis@sgi.com>	2009-01-16 15:31:15 -08:00
Rusty Russell	68564a4697	work_on_cpu: don't try to get_online_cpus() in work_on_cpu. Impact: remove potential circular lock dependency with cpu hotplug lock This has caused more problems than it solved, with a pile of cpu hotplug locking issues. Followup patches will get_online_cpus() in callers that need it, but if they don't do it they're no worse than before when they were using set_cpus_allowed without locking. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Mike Travis <travis@sgi.com>	2009-01-16 15:31:15 -08:00
Rafael J. Wysocki	091d71e023	PM: Fix compilation warning in kernel/power/main.c Reorder the code in kernel/power/main.c to fix compilation warning triggered by unsetting CONFIG_SUSPEND. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Len Brown <len.brown@intel.com>	2009-01-16 18:13:41 -05:00
Len Brown	88d998c264	Merge branch 'misc' into release	2009-01-16 14:45:34 -05:00
Masami Hiramatsu	5a4ccaf37f	kprobes: check CONFIG_FREEZER instead of CONFIG_PM Check CONFIG_FREEZER instead of CONFIG_PM because kprobe booster depends on freeze_processes() and thaw_processes() when CONFIG_PREEMPT=y. This fixes a linkage error which occurs when CONFIG_PREEMPT=y, CONFIG_PM=y and CONFIG_FREEZER=n. Reported-by: Cheng Renquan <crquan@gmail.com> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Len Brown <len.brown@intel.com>	2009-01-16 14:32:17 -05:00
Rafael J. Wysocki	33f1d7ecc6	PM: Fix freezer compilation if PM_SLEEP is unset Freezer fails to compile if with the following configuration settings: CONFIG_CGROUPS=y CONFIG_CGROUP_FREEZER=y CONFIG_MODULES=y CONFIG_FREEZER=y CONFIG_PM=y CONFIG_PM_SLEEP=n Fix this by making process.o compilation depend on CONFIG_FREEZER. Reported-by: Cheng Renquan <crquan@gmail.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Len Brown <len.brown@intel.com>	2009-01-16 14:32:17 -05:00

... 46 47 48 49 50 ...

8411 commits