In find_extend_vma(), we don't need mlock_vma_pages_range() to verify
the vma type - we know we're working with a stack. So, we can call
directly into __mlock_vma_pages_range(), and remove the last
make_pages_present() call site.
Note that we don't use mm_populate() here, so we can't release the
mmap_sem while allocating new stack pages. This is deemed acceptable,
because the stack vmas grow by a bounded number of pages at a time, and
these are anon pages so we don't have to read from disk to populate
them.
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Greg Ungerer <gregungerer@westnet.com.au>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After the MAP_POPULATE handling has been moved to mmap_region() call
sites, the only remaining use of the flags argument is to pass the
MAP_NORESERVE flag. This can be just as easily handled by
do_mmap_pgoff(), so do that and remove the mmap_region() flags
parameter.
[akpm@linux-foundation.org: remove double parens]
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Greg Ungerer <gregungerer@westnet.com.au>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or
with MCL_FUTURE in effect), we want to populate the pages within the
newly created vmas. This may take a while as we may have to read pages
from disk, so ideally we want to do this outside of the write-locked
mmap_sem region.
This change introduces mm_populate(), which is used to defer populating
such mappings until after the mmap_sem write lock has been released.
This is implemented as a generalization of the former do_mlock_pages(),
which accomplished the same task but was using during mlock() /
mlockall().
Signed-off-by: Michel Lespinasse <walken@google.com>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Greg Ungerer <gregungerer@westnet.com.au>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In certain cases (kswapd reclaim, memcg target reclaim), a fixed minimum
amount of pages is scanned from the LRU lists on each iteration, to make
progress.
Do not make this minimum bigger than the respective LRU list size,
however, and save some busy work trying to isolate and reclaim pages
that are not there.
Empty LRU lists are quite common with memory cgroups in NUMA
environments because there exists a set of LRU lists for each zone for
each memory cgroup, while the memory of a single cgroup is expected to
stay on just one node. The number of expected empty LRU lists is thus
memcgs * (nodes - 1) * lru types
Each attempt to reclaim from an empty LRU list does expensive size
comparisons between lists, acquires the zone's lru lock etc. Avoid
that.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull powerpc updates from Benjamin Herrenschmidt:
"So from the depth of frozen Minnesota, here's the powerpc pull request
for 3.9. It has a few interesting highlights, in addition to the
usual bunch of bug fixes, minor updates, embedded device tree updates
and new boards:
- Hand tuned asm implementation of SHA1 (by Paulus & Michael
Ellerman)
- Support for Doorbell interrupts on Power8 (kind of fast
thread-thread IPIs) by Ian Munsie
- Long overdue cleanup of the way we handle relocation of our open
firmware trampoline (prom_init.c) on 64-bit by Anton Blanchard
- Support for saving/restoring & context switching the PPR (Processor
Priority Register) on server processors that support it. This
allows the kernel to preserve thread priorities established by
userspace. By Haren Myneni.
- DAWR (new watchpoint facility) support on Power8 by Michael Neuling
- Ability to change the DSCR (Data Stream Control Register) which
controls cache prefetching on a running process via ptrace by
Alexey Kardashevskiy
- Support for context switching the TAR register on Power8 (new
branch target register meant to be used by some new specific
userspace perf event interrupt facility which is yet to be enabled)
by Ian Munsie.
- Improve preservation of the CFAR register (which captures the
origin of a branch) on various exception conditions by Paulus.
- Move the Bestcomm DMA driver from arch powerpc to drivers/dma where
it belongs by Philippe De Muyter
- Support for Transactional Memory on Power8 by Michael Neuling
(based on original work by Matt Evans). For those curious about
the feature, the patch contains a pretty good description."
(See commit db8ff90702: "powerpc: Documentation for transactional
memory on powerpc" for the mentioned description added to the file
Documentation/powerpc/transactional_memory.txt)
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (140 commits)
powerpc/kexec: Disable hard IRQ before kexec
powerpc/85xx: l2sram - Add compatible string for BSC9131 platform
powerpc/85xx: bsc9131 - Correct typo in SDHC device node
powerpc/e500/qemu-e500: enable coreint
powerpc/mpic: allow coreint to be determined by MPIC version
powerpc/fsl_pci: Store the pci ctlr device ptr in the pci ctlr struct
powerpc/85xx: Board support for ppa8548
powerpc/fsl: remove extraneous DIU platform functions
arch/powerpc/platforms/85xx/p1022_ds.c: adjust duplicate test
powerpc: Documentation for transactional memory on powerpc
powerpc: Add transactional memory to pseries and ppc64 defconfigs
powerpc: Add config option for transactional memory
powerpc: Add transactional memory to POWER8 cpu features
powerpc: Add new transactional memory state to the signal context
powerpc: Hook in new transactional memory code
powerpc: Routines for FP/VSX/VMX unavailable during a transaction
powerpc: Add transactional memory unavaliable execption handler
powerpc: Add reclaim and recheckpoint functions for context switching transactional memory processes
powerpc: Add FP/VSX and VMX register load functions for transactional memory
powerpc: Add helper functions for transactional memory context switching
...
Trivial, but WRITE SAME is an SBC command so it seems strange for a
related function (defined in target_core_sbc.c) to be in the spc_
namespace.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
The last caller was removed >2 years ago in commit 7b2a69ba7.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull core locking changes from Ingo Molnar:
"The biggest change is the rwsem lock-steal improvements, both to the
assembly optimized and the spinlock based variants.
The other notable change is the clean up of the seqlock implementation
to be based on the seqcount infrastructure.
The rest is assorted smaller debuggability, cleanup and continued -rt
locking changes."
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rwsem-spinlock: Implement writer lock-stealing for better scalability
futex: Revert "futex: Mark get_robust_list as deprecated"
generic: Use raw local irq variant for generic cmpxchg
lockdep: Selftest: convert spinlock to raw spinlock
seqlock: Use seqcount infrastructure
seqlock: Remove unused functions
ntp: Make ntp_lock raw
intel_idle: Convert i7300_idle_lock to raw_spinlock
locking: Various static lock initializer fixes
lockdep: Print more info when MAX_LOCK_DEPTH is exceeded
rwsem: Implement writer lock-stealing for better scalability
lockdep: Silence warning if CONFIG_LOCKDEP isn't set
watchdog: Use local_clock for get_timestamp()
lockdep: Rename print_unlock_inbalance_bug() to print_unlock_imbalance_bug()
locking/stat: Fix a typo
randconfig build reports the following error which is caused by
CONFIG_PM_OPP being unset.
CC arch/arm/mach-imx/mach-imx6q.o
arch/arm/mach-imx/mach-imx6q.c: In function ‘imx6q_opp_init’:
arch/arm/mach-imx/mach-imx6q.c:248:2: error: implicit declaration of function ‘of_init_opp_table’ [-Werror=implicit-function-declaration]
Fix the error by giving a more correct condition for empty
of_init_opp_table() implementation.
Reported-by: Rob Herring <robherring2@gmail.com>
Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
Acked-by: Nishanth Menon <nm@ti.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Dave Jones reported a lockdep splat occurring in IP defrag code.
commit 6d7b857d54 (net: use lib/percpu_counter API for
fragmentation mem accounting) added a possible deadlock.
Because percpu_counter_sum_positive() needs to acquire
a lock that can be used from softirq, we need to disable BH
in sum_frag_mem_limit()
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we handle icmp errors in each transport protocol's err_handler,
for icmp protocols, that is ping_err. Since this handler only care
of those icmp errors triggered by echo request, errors triggered
by echo reply(which sent by kernel) are sliently ignored.
So wrap ping_err() with icmp_err() to deal with those icmp errors.
Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Originally submitted by Clark Williams as part of a cleanup,
but happens also to fix an ia64 build problem:
arch/ia64/kernel/init_task.c:38: error: 'RR_TIMESLICE' undeclared here (not in a function)
Signed-off-by: Clark Williams <clark.williams@gmail.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
This fixes an ia64 build bug reported by Tony Luck.
Reported-by: Tony Luck <tony.luck@gmail.com>
Signed-off-by: Clark Williams <clark.williams@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1361373550-4011-2-git-send-email-clark.williams@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Several map-related functions look like a serie of ifs, checking
widths of map. Those functions do not have any handling for default
case. Instead of fiddling with uninitialized_var in those functions,
let's just add a (correct) BUG() to the default case on those maps. This
will also allow us to catch potential errors in maps setup in future.
Signed-off-by: Dmitry Eremin-Solenikov <dmitry_eremin@mentor.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Add a generic helper to fill in an HDMI AVI infoframe with data
extracted from a DRM display mode.
Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
The drm_file and drm_clip_rect structures are used throughout the file
but they are never declared nor pulled in through an include. Add
forward declarations to make them available.
Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
The same function had already been merged with a different name. Remove
the duplicate one but reuse some of its kerneldoc fragments for the
existing implementation.
Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Pull x86 mm changes from Peter Anvin:
"This is a huge set of several partly interrelated (and concurrently
developed) changes, which is why the branch history is messier than
one would like.
The *really* big items are two humonguous patchsets mostly developed
by Yinghai Lu at my request, which completely revamps the way we
create initial page tables. In particular, rather than estimating how
much memory we will need for page tables and then build them into that
memory -- a calculation that has shown to be incredibly fragile -- we
now build them (on 64 bits) with the aid of a "pseudo-linear mode" --
a #PF handler which creates temporary page tables on demand.
This has several advantages:
1. It makes it much easier to support things that need access to data
very early (a followon patchset uses this to load microcode way
early in the kernel startup).
2. It allows the kernel and all the kernel data objects to be invoked
from above the 4 GB limit. This allows kdump to work on very large
systems.
3. It greatly reduces the difference between Xen and native (Xen's
equivalent of the #PF handler are the temporary page tables created
by the domain builder), eliminating a bunch of fragile hooks.
The patch series also gets us a bit closer to W^X.
Additional work in this pull is the 64-bit get_user() work which you
were also involved with, and a bunch of cleanups/speedups to
__phys_addr()/__pa()."
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (105 commits)
x86, mm: Move reserving low memory later in initialization
x86, doc: Clarify the use of asm("%edx") in uaccess.h
x86, mm: Redesign get_user with a __builtin_choose_expr hack
x86: Be consistent with data size in getuser.S
x86, mm: Use a bitfield to mask nuisance get_user() warnings
x86/kvm: Fix compile warning in kvm_register_steal_time()
x86-32: Add support for 64bit get_user()
x86-32, mm: Remove reference to alloc_remap()
x86-32, mm: Remove reference to resume_map_numa_kva()
x86-32, mm: Rip out x86_32 NUMA remapping code
x86/numa: Use __pa_nodebug() instead
x86: Don't panic if can not alloc buffer for swiotlb
mm: Add alloc_bootmem_low_pages_nopanic()
x86, 64bit, mm: hibernate use generic mapping_init
x86, 64bit, mm: Mark data/bss/brk to nx
x86: Merge early kernel reserve for 32bit and 64bit
x86: Add Crash kernel low reservation
x86, kdump: Remove crashkernel range find limit for 64bit
memblock: Add memblock_mem_size()
x86, boot: Not need to check setup_header version for setup_data
...
Pull s390 update from Martin Schwidefsky:
"The most prominent change in this patch set is the software dirty bit
patch for s390. It removes __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY and
the page_test_and_clear_dirty primitive which makes the common memory
management code a bit less obscure.
Heiko fixed most of the PCI related fallout, more often than not
missing GENERIC_HARDIRQS dependencies. Notable is one of the 3270
patches which adds an export to tty_io to be able to resize a tty.
The rest is the usual bunch of cleanups and bug fixes."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (42 commits)
s390/module: Add missing R_390_NONE relocation type
drivers/gpio: add missing GENERIC_HARDIRQ dependency
drivers/input: add couple of missing GENERIC_HARDIRQS dependencies
s390/cleanup: rename SPP to LPP
s390/mm: implement software dirty bits
s390/mm: Fix crst upgrade of mmap with MAP_FIXED
s390/linker skript: discard exit.data at runtime
drivers/media: add missing GENERIC_HARDIRQS dependency
s390/bpf,jit: add vlan tag support
drivers/net,AT91RM9200: add missing GENERIC_HARDIRQS dependency
iucv: fix kernel panic at reboot
s390/Kconfig: sort list of arch selected config options
phylib: remove !S390 dependeny from Kconfig
uio: remove !S390 dependency from Kconfig
dasd: fix sysfs cleanup in dasd_generic_remove
s390/pci: fix hotplug module init
s390/pci: cleanup clp page allocation
s390/pci: cleanup clp inline assembly
s390/perf: cpum_cf: fallback to software sampling events
s390/mm: provide PAGE_SHARED define
...
Pull HID subsystem updates from Jiri Kosina:
"HID subsystem and drivers update. Highlights:
- new support of a group of Win7/Win8 multitouch devices, from
Benjamin Tissoires
- fix for compat interface brokenness in uhid, from Dmitry Torokhov
- conversion of drivers to use hid_driver helper, by H Hartley
Sweeten
- HID over I2C transport received ACPI enumeration support, written
by Mika Westerberg
- there is an ongoing effort to make HID sensor hubs independent of
USB transport. The first self-contained part of this work is
provided here, done by Mika Westerberg
- a few smaller fixes here and there, support for a couple new
devices added"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (43 commits)
HID: Correct Logitech order in hid-ids.h
HID: LG4FF: Remove unnecessary deadzone code
HID: LG: Prevent the Logitech Gaming Wheels deadzone
HID: LG: Fix detection of Logitech Speed Force Wireless (WiiWheel)
HID: LG: Add support for Logitech Momo Force (Red) Wheel
HID: hidraw: print message when succesfully initialized
HID: logitech: split accel, brake for Driving Force wheel
HID: logitech: add report descriptor for Driving Force wheel
HID: add ThingM blink(1) USB RGB LED support
HID: uhid: make creating devices work on 64/32 systems
HID: wiimote: fix nunchuck button parser
HID: blacklist Velleman data acquisition boards
HID: sensor-hub: don't limit the driver only to USB bus
HID: sensor-hub: get rid of unused sensor_hub_grabbed_usages[] table
HID: extend autodetect to handle I2C sensors as well
HID: ntrig: use input_configured() callback to set the name
HID: multitouch: do not use pointers towards hid-core
HID: add missing GENERIC_HARDIRQ dependency
HID: multitouch: make MT_CLS_ALWAYS_TRUE the new default class
HID: multitouch: fix protocol for Elo panels
...
Pull trivial tree from Jiri Kosina:
"Assorted tiny fixes queued in trivial tree"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (22 commits)
DocBook: update EXPORT_SYMBOL entry to point at export.h
Documentation: update top level 00-INDEX file with new additions
ARM: at91/ide: remove unsused at91-ide Kconfig entry
percpu_counter.h: comment code for better readability
x86, efi: fix comment typo in head_32.S
IB: cxgb3: delay freeing mem untill entirely done with it
net: mvneta: remove unneeded version.h include
time: x86: report_lost_ticks doesn't exist any more
pcmcia: avoid static analysis complaint about use-after-free
fs/jfs: Fix typo in comment : 'how may' -> 'how many'
of: add missing documentation for of_platform_populate()
btrfs: remove unnecessary cur_trans set before goto loop in join_transaction
sound: soc: Fix typo in sound/codecs
treewide: Fix typo in various drivers
btrfs: fix comment typos
Update ibmvscsi module name in Kconfig.
powerpc: fix typo (utilties -> utilities)
of: fix spelling mistake in comment
h8300: Fix home page URL in h8300/README
xtensa: Fix home page URL in Kconfig
...
Merge misc patches from Andrew Morton:
- Florian has vanished so I appear to have become fbdev maintainer
again :(
- Joel and Mark are distracted to welcome to the new OCFS2 maintainer
- The backlight queue
- Small core kernel changes
- lib/ updates
- The rtc queue
- Various random bits
* akpm: (164 commits)
rtc: rtc-davinci: use devm_*() functions
rtc: rtc-max8997: use devm_request_threaded_irq()
rtc: rtc-max8907: use devm_request_threaded_irq()
rtc: rtc-da9052: use devm_request_threaded_irq()
rtc: rtc-wm831x: use devm_request_threaded_irq()
rtc: rtc-tps80031: use devm_request_threaded_irq()
rtc: rtc-lp8788: use devm_request_threaded_irq()
rtc: rtc-coh901331: use devm_clk_get()
rtc: rtc-vt8500: use devm_*() functions
rtc: rtc-tps6586x: use devm_request_threaded_irq()
rtc: rtc-imxdi: use devm_clk_get()
rtc: rtc-cmos: use dev_warn()/dev_dbg() instead of printk()/pr_debug()
rtc: rtc-pcf8583: use dev_warn() instead of printk()
rtc: rtc-sun4v: use pr_warn() instead of printk()
rtc: rtc-vr41xx: use dev_info() instead of printk()
rtc: rtc-rs5c313: use pr_err() instead of printk()
rtc: rtc-at91rm9200: use dev_dbg()/dev_err() instead of printk()/pr_debug()
rtc: rtc-rs5c372: use dev_dbg()/dev_warn() instead of printk()/pr_debug()
rtc: rtc-ds2404: use dev_err() instead of printk()
rtc: rtc-efi: use dev_err()/dev_warn()/pr_err() instead of printk()
...
LP8557 is one of LP855x family device, but it has different register map
and initialization process. To support this device, device specific
configuration is done through the lp855x_device_config structure.
Few register definitions are fixed for better readability.
BRIGHTNESS_CTRL -> LP855X_BRIGHTNESS_CTRL
DEVICE_CTRL -> LP855X_DEVICE_CTRL
EEPROM_START -> LP855X_EEPROM_START
EEPROM_END -> LP855X_EEPROM_END
EPROM_START -> LP8556_EPROM_START
EPROM_END -> LP8556_EPROM_END
And LP8557 register definitions are added. New register function,
lp855x_update_bit() is added.
Signed-off-by: Milo(Woogyom) Kim <milo.kim@ti.com>
Acked-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After I came across a help text for SUNGEM mentioning a broken sun.com
URL, I felt like fixing those up, as they are now pointing to oracle.com
URLs.
Signed-off-by: Christian Kujau <lists@nerdbynature.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I'm testing swapout workload in a two-socket Xeon machine. The workload
has 10 threads, each thread sequentially accesses separate memory
region. TLB flush overhead is very big in the workload. For each page,
page reclaim need move it from active lru list and then unmap it. Both
need a TLB flush. And this is a multthread workload, TLB flush happens
in 10 CPUs. In X86, TLB flush uses generic smp_call)function. So this
workload stress smp_call_function_many heavily.
Without patch, perf shows:
+ 24.49% [k] generic_smp_call_function_interrupt
- 21.72% [k] _raw_spin_lock
- _raw_spin_lock
+ 79.80% __page_check_address
+ 6.42% generic_smp_call_function_interrupt
+ 3.31% get_swap_page
+ 2.37% free_pcppages_bulk
+ 1.75% handle_pte_fault
+ 1.54% put_super
+ 1.41% grab_super_passive
+ 1.36% __swap_duplicate
+ 0.68% blk_flush_plug_list
+ 0.62% swap_info_get
+ 6.55% [k] flush_tlb_func
+ 6.46% [k] smp_call_function_many
+ 5.09% [k] call_function_interrupt
+ 4.75% [k] default_send_IPI_mask_sequence_phys
+ 2.18% [k] find_next_bit
swapout throughput is around 1300M/s.
With the patch, perf shows:
- 27.23% [k] _raw_spin_lock
- _raw_spin_lock
+ 80.53% __page_check_address
+ 8.39% generic_smp_call_function_single_interrupt
+ 2.44% get_swap_page
+ 1.76% free_pcppages_bulk
+ 1.40% handle_pte_fault
+ 1.15% __swap_duplicate
+ 1.05% put_super
+ 0.98% grab_super_passive
+ 0.86% blk_flush_plug_list
+ 0.57% swap_info_get
+ 8.25% [k] default_send_IPI_mask_sequence_phys
+ 7.55% [k] call_function_interrupt
+ 7.47% [k] smp_call_function_many
+ 7.25% [k] flush_tlb_func
+ 3.81% [k] _raw_spin_lock_irqsave
+ 3.78% [k] generic_smp_call_function_single_interrupt
swapout throughput is around 1400M/s. So there is around a 7%
improvement, and total cpu utilization doesn't change.
Without the patch, cfd_data is shared by all CPUs.
generic_smp_call_function_interrupt does read/write cfd_data several times
which will create a lot of cache ping-pong. With the patch, the data
becomes per-cpu. The ping-pong is avoided. And from the perf data, this
doesn't make call_single_queue lock contend.
Next step is to remove generic_smp_call_function_interrupt() from arch
code.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This provides a band-aid to provide stable page writes on jbd without
needing to backport the fixed locking and page writeback bit handling
schemes of jbd2. The band-aid works by using bounce buffers to snapshot
page contents instead of waiting.
For those wondering about the ext3 bandage -- fixing the jbd locking
(which was done as part of ext4dev years ago) is a lot of surgery, and
setting PG_writeback on data pages when we actually hold the page lock
dropped ext3 performance by nearly an order of magnitude. If we're
going to migrate iscsi and raid to use stable page writes, the
complaints about high latency will likely return. We might as well
centralize their page snapshotting thing to one place.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Tested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Create a helper function to check if a backing device requires stable
page writes and, if so, performs the necessary wait. Then, make it so
that all points in the memory manager that handle making pages writable
use the helper function. This should provide stable page write support
to most filesystems, while eliminating unnecessary waiting for devices
that don't require the feature.
Before this patchset, all filesystems would block, regardless of whether
or not it was necessary. ext3 would wait, but still generate occasional
checksum errors. The network filesystems were left to do their own
thing, so they'd wait too.
After this patchset, all the disk filesystems except ext3 and btrfs will
wait only if the hardware requires it. ext3 (if necessary) snapshots
pages instead of blocking, and btrfs provides its own bdi so the mm will
never wait. Network filesystems haven't been touched, so either they
provide their own stable page guarantees or they don't block at all.
The blocking behavior is back to what it was before 3.0 if you don't
have a disk requiring stable page writes.
Here's the result of using dbench to test latency on ext2:
3.8.0-rc3:
Operation Count AvgLat MaxLat
----------------------------------------
WriteX 109347 0.028 59.817
ReadX 347180 0.004 3.391
Flush 15514 29.828 287.283
Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms
3.8.0-rc3 + patches:
WriteX 105556 0.029 4.273
ReadX 335004 0.005 4.112
Flush 14982 30.540 298.634
Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms
As you can see, the maximum write latency drops considerably with this
patch enabled. The other filesystems (ext3/ext4/xfs/btrfs) behave
similarly, but see the cover letter for those results.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset ("stable page writes, part 2") makes some key
modifications to the original 'stable page writes' patchset. First, it
provides creators (devices and filesystems) of a backing_dev_info a flag
that declares whether or not it is necessary to ensure that page
contents cannot change during writeout. It is no longer assumed that
this is true of all devices (which was never true anyway). Second, the
flag is used to relaxed the wait_on_page_writeback calls so that wait
only occurs if the device needs it. Third, it fixes up the remaining
disk-backed filesystems to use this improved conditional-wait logic to
provide stable page writes on those filesystems.
It is hoped that (for people not using checksumming devices, anyway)
this patchset will give back unnecessary performance decreases since the
original stable page write patchset went into 3.0. Sorry about not
fixing it sooner.
Complaints were registered by several people about the long write
latencies introduced by the original stable page write patchset.
Generally speaking, the kernel ought to allocate as little extra memory
as possible to facilitate writeout, but for people who simply cannot
wait, a second page stability strategy is (re)introduced: snapshotting
page contents. The waiting behavior is still the default strategy; to
enable page snapshotting, a superblock flag (MS_SNAP_STABLE) must be
set. This flag is used to bandaid^Henable stable page writeback on
ext3[1], and is not used anywhere else.
Given that there are already a few storage devices and network FSes that
have rolled their own page stability wait/page snapshot code, it would
be nice to move towards consolidating all of these. It seems possible
that iscsi and raid5 may wish to use the new stable page write support
to enable zero-copy writeout.
Thank you to Jan Kara for helping fix a couple more filesystems.
Per Andrew Morton's request, here are the result of using dbench to measure
latencies on ext2:
3.8.0-rc3:
Operation Count AvgLat MaxLat
----------------------------------------
WriteX 109347 0.028 59.817
ReadX 347180 0.004 3.391
Flush 15514 29.828 287.283
Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms
3.8.0-rc3 + patches:
WriteX 105556 0.029 4.273
ReadX 335004 0.005 4.112
Flush 14982 30.540 298.634
Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms
As you can see, for ext2 the maximum write latency decreases from ~60ms
on a laptop hard disk to ~4ms. I'm not sure why the flush latencies
increase, though I suspect that being able to dirty pages faster gives
the flusher more work to do.
On ext4, the average write latency decreases as well as all the maximum
latencies:
3.8.0-rc3:
WriteX 85624 0.152 33.078
ReadX 272090 0.010 61.210
Flush 12129 36.219 168.260
Throughput 44.8618 MB/sec 4 clients 4 procs max_latency=168.276 ms
3.8.0-rc3 + patches:
WriteX 86082 0.141 30.928
ReadX 273358 0.010 36.124
Flush 12214 34.800 165.689
Throughput 44.9941 MB/sec 4 clients 4 procs max_latency=165.722 ms
XFS seems to exhibit similar latency improvements as ext2:
3.8.0-rc3:
WriteX 125739 0.028 104.343
ReadX 399070 0.005 4.115
Flush 17851 25.004 131.390
Throughput 66.0024 MB/sec 4 clients 4 procs max_latency=131.406 ms
3.8.0-rc3 + patches:
WriteX 123529 0.028 6.299
ReadX 392434 0.005 4.287
Flush 17549 25.120 188.687
Throughput 64.9113 MB/sec 4 clients 4 procs max_latency=188.704 ms
...and btrfs, just to round things out, also shows some latency
decreases:
3.8.0-rc3:
WriteX 67122 0.083 82.355
ReadX 212719 0.005 2.828
Flush 9547 47.561 147.418
Throughput 35.3391 MB/sec 4 clients 4 procs max_latency=147.433 ms
3.8.0-rc3 + patches:
WriteX 64898 0.101 71.631
ReadX 206673 0.005 7.123
Flush 9190 47.963 219.034
Throughput 34.0795 MB/sec 4 clients 4 procs max_latency=219.044 ms
Before this patchset, all filesystems would block, regardless of whether
or not it was necessary. ext3 would wait, but still generate occasional
checksum errors. The network filesystems were left to do their own
thing, so they'd wait too.
After this patchset, all the disk filesystems except ext3 and btrfs will
wait only if the hardware requires it. ext3 (if necessary) snapshots
pages instead of blocking, and btrfs provides its own bdi so the mm will
never wait. Network filesystems haven't been touched, so either they
provide their own wait code, or they don't block at all. The blocking
behavior is back to what it was before 3.0 if you don't have a disk
requiring stable page writes.
This patchset has been tested on 3.8.0-rc3 on x64 with ext3, ext4, and
xfs. I've spot-checked 3.8.0-rc4 and seem to be getting the same
results as -rc3.
[1] The alternative fixes to ext3 include fixing the locking order and
page bit handling like we did for ext4 (but then why not just use
ext4?), or setting PG_writeback so early that ext3 becomes extremely
slow. I tried that, but the number of write()s I could initiate dropped
by nearly an order of magnitude. That was a bit much even for the
author of the stable page series! :)
This patch:
Creates a per-backing-device flag that tracks whether or not pages must
be held immutable during writeout. Eventually it will be used to waive
wait_for_page_writeback() if nothing requires stable pages.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I recently made the mistake of writing:
foo = lockdep_dereference_protected(..., lockdep_assert_held(...));
which is clearly bogus. If lockdep is disabled in the config this would
cause a compile failure, if it is enabled then it compiles and causes a
puzzling warning about dereferencing without the correct protection.
Wrap the macro in "do { ... } while (0)" to also fail compile for this
when lockdep is enabled.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The correct value for VIDCON1_VSTATUS_FRONTPORCH is 3, not 0.
Signed-off-by: Tomasz Figa <t.figa@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the bit definitions for CSC EQ709 and EQ601. These definitons are
used to control the CSC parameter such as equation 709 and equation 601.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Tomasz Figa <t.figa@samsung.com>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove unnecessary brackets and the duplicated VIDTCON2 definition.
Also, header comment is modified, because EXYNOS series is supported and
<mach/regs-fb.h> is not available.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Tomasz Figa <t.figa@samsung.com>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
S3C_FB_MAX_WIN is already defined in 'plat-samsung/include/plat/fb.h'.
So, this definition in 'include/video/samsung_fimd.h' should be removed to
avoid the duplication.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Tomasz Figa <t.figa@samsung.com>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
devm_* APIs are device managed and make exit and cleanup code simpler.
While at it also remove some unused labels and fix an error path.
Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Acked-by: Donghwa Lee <dh09.lee@samsung.com>
Cc: Inki Dae <inki.dae@samsung.com>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add mmp display subsystem to support Marvell MMP display controllers.
This subsystem contains 4 parts:
--fb folder
--core.c
--hw folder
--panel folder
1. fb folder contains implementation of fb. fb get path and overlay
from common interface and operates on these structures.
2. core.c provides common interface for a hardware abstraction. Major
parts of this interface are:
a) Path: path is a output device connected to a panel or HDMI TV. Main
operations of the path is set/get timing/output color. fb operates
output device through path structure.
b) Ovly: Ovly is a buffer shown on the path.
Ovly describes frame buffer and its source/destination size, offset,
input color, buffer address, z-order, and so on. Each fb device maps
to one overlay.
3. hw folder contains implementation of hardware operations defined by
core.c. It registers paths for fb use.
4. panel folder contains implementation of panels. It's connected to
path. Panel drivers would also regiester panels and linked to path
when probe.
Signed-off-by: Zhou Zhu <zzhu3@marvell.com>
Signed-off-by: Lisa Du <cldu@marvell.com>
Cc: Guoqing Li <ligq@marvell.com>
Acked-by: Haojian Zhuang <haojian.zhuang@gmail.com>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce compiletime_assert to compiler.h, which moves the details of
how to break a build and emit an error message for a specific compiler
to the headers where these details should be. Following in the
tradition of the POSIX assert macro, compiletime_assert creates a
build-time error when the supplied condition is *false*.
Next, we add BUILD_BUG_ON_MSG to bug.h which simply wraps
compiletime_assert, inverting the logic, so that it fails when the
condition is *true*, consistent with the language "build bug on." This
macro allows you to specify the error message you want emitted when the
supplied condition is true.
Finally, we remove all other code from bug.h that mucks with these
details (BUILD_BUG & BUILD_BUG_ON), and have them all call
BUILD_BUG_ON_MSG. This not only reduces source code bloat, but also
prevents the possibility of code being changed for one macro and not for
the other (which was previously the case for BUILD_BUG and
BUILD_BUG_ON).
Since __compiletime_error_fallback is now only used in compiler.h, I'm
considering it a private macro and removing the double negation that's
now extraneous.
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Daniel Santos <daniel.santos@pobox.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Joe Perches <joe@perches.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prior to the introduction of __attribute__((error("msg"))) in gcc 4.3,
creating compile-time errors required a little trickery.
BUILD_BUG{,_ON} uses this attribute when available to generate
compile-time errors, but also uses the negative-sized array trick for
older compilers, resulting in two error messages in some cases. The
reason it's "some" cases is that as of gcc 4.4, the negative-sized array
will not create an error in some situations, like inline functions.
This patch replaces the negative-sized array code with the new
__compiletime_error_fallback() macro which expands to the same thing
unless the the error attribute is available, in which case it expands to
do{}while(0), resulting in exactly one compile-time error on all
versions of gcc.
Note that we are not changing the negative-sized array code for the
unoptimized version of BUILD_BUG_ON, since it has the potential to catch
problems that would be disabled in later versions of gcc were
__compiletime_error_fallback used. The reason is that that an
unoptimized build can't always remove calls to an error-attributed
function call (like we are using) that should effectively become dead
code if it were optimized. However, using a negative-sized array with a
similar value will not result in an false-positive (error). The only
caveat being that it will also fail to catch valid conditions, which we
should be expecting in an unoptimized build anyway.
Signed-off-by: Daniel Santos <daniel.santos@pobox.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Joe Perches <joe@perches.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Negative sized arrays wont create a compile-time error in some cases
starting with gcc 4.4 (e.g., inlined functions), but gcc 4.3 introduced
the error function attribute that will.
This patch modifies BUILD_BUG_ON to behave like BUILD_BUG already does,
using the error function attribute so that you don't have to build the
entire kernel to discover that you have a problem, and then enjoy trying
to track it down from a link-time error.
Also, we are only including asm/bug.h and then expecting that
linux/compiler.h will eventually be included to define __linktime_error
(used in BUILD_BUG_ON). This patch includes it directly for clarity and
to avoid the possibility of changes in <arch>/*/include/asm/bug.h being
changed or not including linux/compiler.h for some reason.
Signed-off-by: Daniel Santos <daniel.santos@pobox.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joe Perches <joe@perches.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When calling BUILD_BUG_ON in an optimized build using gcc 4.3 and later,
the condition will be evaulated twice, possibily with side-effects. This
patch eliminates that error.
[akpm@linux-foundation.org: tweak code layout]
Signed-off-by: Daniel Santos <daniel.santos@pobox.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Joe Perches <joe@perches.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When __CHECKER__ is defined, we disable all of the BUILD_BUG.* macros.
However, both BUILD_BUG_ON_NOT_POWER_OF_2 and BUILD_BUG_ON was evaluating
to nothing in this case, and we want (0) since this is a function-like
macro that will be followed by a semicolon.
Signed-off-by: Daniel Santos <daniel.santos@pobox.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joe Perches <joe@perches.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__linktime_error() does the same thing as __compiletime_error() and is
only used in bug.h. Since the macro defines a function attribute that
will cause a failure at compile-time (not link-time), it makes more sense
to keep __compiletime_error(), which is also neatly mated with
__compiletime_warning().
Signed-off-by: Daniel Santos <daniel.santos@pobox.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Joe Perches <joe@perches.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using GCC_VERSION reduces complexity, is easier to read and is GCC's
recommended mechanism for doing version checks. (Just don't ask me why
they didn't define it in the first place.) This also makes it easy to
merge compiler-gcc{,3,4}.h should somebody want to.
Signed-off-by: Daniel Santos <daniel.santos@pobox.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Joe Perches <joe@perches.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>