uids in namespaces other than init don't get a sysfs entry.
For those in the init namespace, while we're waiting to remove
the sysfs entry for the uid the uid is still hashed, and
alloc_uid() may re-grab that uid without getting a new
reference to the user_ns, which we've already put in free_user
before scheduling remove_user_sysfs_dir().
Reported-and-tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: David Howells <dhowells@redhat.com>
Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert the c/p state "power" tracer to use tracepoints. Avoids a
function call when the tracer is disabled.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
While reviewing the manpages, I noticed I'd missed some clock vs timer sites.
Make sure that all timer functions call cpu_timer_sample_group() and not
cpu_clock_sample_group(). This ensures that we enable the process wide timer
in time, and therefore pay the O(n) thread group cost from the syscall.
Not doing it here, will result in the first jiffy tick after setting the timer
doing this, resulting in a very expensive tick (but only once) and a delay in
actually starting the timer.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Jaswinder Singh Rajput reported that commit 23a185ca8a caused the
context switch and migration software counters to report zero always.
With that commit, the software counters only count events that occur
between sched-in and sched-out for a task. This is necessary for the
counter enable/disable prctls and ioctls to work. However, the
context switch and migration counts are incremented after sched-out
for one task and before sched-in for the next. Since the increment
doesn't occur while a task is scheduled in (as far as the software
counters are concerned) it doesn't count towards any counter.
Thus the context switch and migration counters need to count events
that occur at any time, provided the counter is enabled, not just
those that occur while the task is scheduled in (from the perf_counter
subsystem's point of view). The problem though is that the software
counter code can't tell the difference between being enabled and being
scheduled in, and between being disabled and being scheduled out,
since we use the one pair of enable/disable entry points for both.
That is, the high-level disable operation simply arranges for the
counter to not be scheduled in any more, and the high-level enable
operation arranges for it to be scheduled in again.
One way to solve this would be to have sched_in/out operations in the
hw_perf_counter_ops struct as well as enable/disable. However, this
takes a simpler approach: it adds a 'prev_state' field to the
perf_counter struct that allows a counter's enable method to know
whether the counter was previously disabled or just inactive
(scheduled out), and therefore whether the enable method is being
called as a result of a high-level enable or a schedule-in operation.
This then allows the context switch, migration and page fault counters
to reset their hw.prev_count value in their enable functions only if
they are called as a result of a high-level enable operation.
Although page faults would normally only occur while the counter is
scheduled in, this changes the page fault counter code too in case
there are ever circumstances where page faults get counted against a
task while its counters are not scheduled in.
Reported-by: Jaswinder Singh Rajput <jaswinder@kernel.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
request_irq() calls into proc code via __setup_irq() which is not safe
in an atomic context, so request_irq() can itself use the more
reliable GFP_KERNEL allocation for the action descriptor.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up
While reviewing the ring buffer code, I thougth I saw a bug with
if (!__raw_spin_trylock(&cpu_buffer->lock))
goto out_unlock;
But I forgot that we use a variable "lock_taken" that is set if
the spinlock is taken, and only unlock it if that variable is set.
To avoid further confusion from other reviewers, this patch
renames the label out_unlock with out_reset, which is the more
appropriate name.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
CONFIG_SOFTLOCKUP=y || CONFIG_DETECT_HUNG_TASKS=y is now the only user
of the 'one' constant in kernel/sysctl.c. Move it to the softlockup
block of constants.
This fixes a GCC warning.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I enabled all cgroup subsystems when compiling kernel, and then:
# mount -t cgroup -o net_cls xxx /mnt
# mkdir /mnt/0
This showed up immediately:
BUG: MAX_LOCKDEP_SUBCLASSES too low!
turning off the locking correctness validator.
It's caused by the cgroup hierarchy lock:
for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];
if (ss->root == root)
mutex_lock_nested(&ss->hierarchy_mutex, i);
}
Now we have 9 cgroup subsystems, and the above 'i' for net_cls is 8, but
MAX_LOCKDEP_SUBCLASSES is 8.
This patch uses different lockdep keys for different subsystems.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to pass an unsigned long as the minimum, because it gets casted
to an unsigned long in the sysctl handler. If we pass an int, we'll
access four more bytes on 64bit arches, resulting in a random minimum
value.
[rientjes@google.com: fix type of `old_bytes']
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Catalin noticed that (38d47c1b70: futex: rely on get_user_pages() for
shared futexes) caused an mm_struct leak.
Some tracing with the function graph tracer quickly pointed out that
futex_wait() has exit paths with unbalanced reference counts.
This regression was discovered by kmemleak.
Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: "Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com>
Tested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
timers: fix TIMER_ABSTIME for process wide cpu timers
timers: split process wide cpu clocks/timers, fix
x86: clean up hpet timer reinit
timers: split process wide cpu clocks/timers, remove spurious warning
timers: split process wide cpu clocks/timers
signal: re-add dead task accumulation stats.
x86: fix hpet timer reinit for x86_64
sched: fix nohz load balancer on cpu offline
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ptrace, x86: fix the usage of ptrace_fork()
i8327: fix outb() parameter order
x86: fix math_emu register frame access
x86: math_emu info cleanup
x86: include correct %gs in a.out core dump
x86, vmi: put a missing paravirt_release_pmd in pgd_dtor
x86: find nr_irqs_gsi with mp_ioapic_routing
x86: add clflush before monitor for Intel 7400 series
x86: disable intel_iommu support by default
x86: don't apply __supported_pte_mask to non-present ptes
x86: fix grammar in user-visible BIOS warning
x86/Kconfig.cpu: make Kconfig help readable in the console
x86, 64-bit: print DMI info in the oops trace
Intel reported a 10% regression (mysql+sysbench) on a 16-way machine
with these patches:
1596e29: sched: symmetric sync vs avg_overlap
d942fb6: sched: fix sync wakeups
Revert them.
Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Bisected-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Only free child_counter if it has a parent; if it doesn't, then it
has a file pointing to it and we'll free it in perf_release.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The POSIX timer interface allows for absolute time expiry values through the
TIMER_ABSTIME flag, therefore we have to synchronize the timer to the clock
every time we start it.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
To decrease the chance of a missed enable, always enable the timer when we
sample it, we'll always disable it when we find that there are no active timers
in the jiffy tick.
This fixes a flood of warnings reported by Mike Galbraith.
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add the missing pair tracing_{start,stop}_record_cmdline() to record well
the cmdline associated with pid.
Changes in v2:
- fix a build error, the sched_switch tracer is needed to record the
cmdline.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When we check if a task has been switched out since the last scan, we might
have a race condition on the following scenario:
- the task is freshly created and scheduled
- it puts its state to TASK_UNINTERRUPTIBLE and is not yet switched out
- check_hung_task() scans this task and will report a false positive because
t->nvcsw + t->nivcsw == t->last_switch_count == 0
Add a check for such cases.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Mandeep Singh Baines <msb@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I noticed by pure accident we have ptrace_fork() and friends. This was
added by "x86, bts: add fork and exit handling", commit
bf53de907d.
I can't test this, ds_request_bts() returns -EOPNOTSUPP, but I strongly
believe this needs the fix. I think something like this program
int main(void)
{
int pid = fork();
if (!pid) {
ptrace(PTRACE_TRACEME, 0, NULL, NULL);
kill(getpid(), SIGSTOP);
fork();
} else {
struct ptrace_bts_config bts = {
.flags = PTRACE_BTS_O_ALLOC,
.size = 4 * 4096,
};
wait(NULL);
ptrace(PTRACE_SETOPTIONS, pid, NULL, PTRACE_O_TRACEFORK);
ptrace(PTRACE_BTS_CONFIG, pid, &bts, sizeof(bts));
ptrace(PTRACE_CONT, pid, NULL, NULL);
sleep(1);
}
return 0;
}
should crash the kernel.
If the task is traced by its natural parent ptrace_reparented() returns 0
but we should clear ->btsxxx anyway.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Andrew had some suggestions for the latencytop file; this patch takes care
of most of these:
* Add documentation
* Turn account_scheduler_latency into an inline function
* Don't report negative values to userspace
* Make the file operations struct const
* Fix a few checkpatch.pl warnings
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix these sparse warnings:
kernel/trace/ring_buffer.c:70:37: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:84:39: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:96:43: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2475:13: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2475:13: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2478:42: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2478:42: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2500:40: warning: incorrect type in argument 3 (different signedness)
kernel/trace/ring_buffer.c:2505:44: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2507:46: warning: incorrect type in argument 2 (different signedness)
kernel/trace/trace.c:2130:40: warning: incorrect type in argument 3 (different signedness)
kernel/trace/trace.c:2280:40: warning: incorrect type in argument 3 (different signedness)
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make global variables and a global function static
The function '__trace_userstack' does not seem to have a caller, so it
is commented out.
Fix this sparse warnings:
kernel/trace/trace.c:82:5: warning: symbol 'tracing_disabled' was not declared. Should it be static?
kernel/trace/trace.c:600:10: warning: symbol 'trace_record_cmdline_disabled' was not declared. Should it be static?
kernel/trace/trace.c:957:6: warning: symbol '__trace_userstack' was not declared. Should it be static?
kernel/trace/trace.c:1694:5: warning: symbol 'tracing_release' was not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new perf_counter feature
This extends the perf_counter_hw_event struct with bits that specify
that events in user, kernel and/or hypervisor mode should not be
counted (i.e. should be excluded), and adds code to program the PMU
mode selection bits accordingly on x86 and powerpc.
For software counters, we don't currently have the infrastructure to
distinguish which mode an event occurs in, so we currently fail the
counter initialization if the setting of the hw_event.exclude_* bits
would require us to distinguish. Context switches and CPU migrations
are currently considered to occur in kernel mode.
On x86, this changes the previous policy that only root can count
kernel events. Now non-root users can count kernel events or exclude
them. Non-root users still can't use NMI events, though. On x86 we
don't appear to have any way to control whether hypervisor events are
counted or not, so hw_event.exclude_hv is ignored.
On powerpc, the selection of whether to count events in user, kernel
and/or hypervisor mode is PMU-wide, not per-counter, so this adds a
check that the hw_event.exclude_* settings are the same as other events
on the PMU. Counters being added to a group have to have the same
settings as the other hardware counters in the group. Counters and
groups can only be enabled in hw_perf_group_sched_in or power_perf_enable
if they have the same settings as any other counters already on the
PMU. If we are not running on a hypervisor, the exclude_hv setting
is ignored (by forcing it to 0) since we can't ever get any
hypervisor events.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch is to make the function return early on failure, and give
correct return value on success.
Signed-off-by: Wenji Huang <wenji.huang@oracle.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The C99 specification states in section 6.11.5:
The placement of a storage-class specifier other than at the beginning
of the declaration specifiers in a declaration is an obsolescent
feature.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: change API and init bpage when copy
ring_buffer_read_page()/rb_remove_entries() may be called for
a partially consumed page.
Add a parameter for rb_remove_entries() and make it update
cpu_buffer->entries correctly for partially consumed pages.
ring_buffer_read_page() now returns the offset to the next event.
Init the bpage's time_stamp when return value is 0.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: Fix bug
I found several very very curious line.
It's so curious that it may be brought by typing mistake.
When (cpu_buffer->reader_page == cpu_buffer->commit_page):
1) We haven't copied it for bpage is changed:
bpage = cpu_buffer->reader_page->page;
memcpy(bpage->data, cpu_buffer->reader_page->page->data + read ... )
2) We need update cpu_buffer->reader_page->read, but
"cpu_buffer->reader_page += read;" is not right.
[
This bug was a typo. The commit->reader_page is a page pointer
and not an index into the page. The line should have been
commit->reader_page->read += read. The other changes
by Lai are nice clean ups to the code. - SDR
]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix broken /proc/profile on UP machines
Commit c309b917ca "cpumask: convert
kernel/profile.c" broke profiling. prof_cpu_mask was previously
initialized to CPU_MASK_ALL, but left uninitialized in that commit.
We need to copy cpu_possible_mask (cpu_online_mask is not enough).
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: no default -fno-stack-protector if stackp is enabled, cleanup
Stackprotector make rules had the following problems.
* cc support test and warning are scattered across makefile and
kernel/panic.c.
* -fno-stack-protector was always added regardless of configuration.
Update such that cc support test and warning are contained in makefile
and -fno-stack-protector is added iff stackp is turned off. While at
it, prepare for 32bit support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>