linux

mirror of synced 2025-03-06 20:59:54 +01:00

Author	SHA1	Message	Date
Jim Ramsay	9d0eb0ab43	dm: add switch target dm-switch is a new target that maps IO to underlying block devices efficiently when there is a large number of fixed-sized address regions but there is no simple pattern to allow for a compact mapping representation such as dm-stripe. Though we have developed this target for a specific storage device, Dell EqualLogic, we have made an effort to keep it as general purpose as possible in the hope that others may benefit. Originally developed by Jim Ramsay. Simplified by Mikulas Patocka. Signed-off-by: Jim Ramsay <jim_ramsay@dell.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-07-10 23:41:19 +01:00
Mikulas Patocka	2a7faeb176	dm: optimize reorder structure This reorder actually improves performance by 20% (from 39.1s to 32.8s) on x86-64 quad core Opteron. I have no explanation for this, possibly it makes some other entries are better cache-aligned. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-07-10 23:41:18 +01:00
Mikulas Patocka	83d5e5b0af	dm: optimize use SRCU and RCU This patch removes "io_lock" and "map_lock" in struct mapped_device and "holders" in struct dm_table and replaces these mechanisms with sleepable-rcu. Previously, the code would call "dm_get_live_table" and "dm_table_put" to get and release table. Now, the code is changed to call "dm_get_live_table" and "dm_put_live_table". dm_get_live_table locks sleepable-rcu and dm_put_live_table unlocks it. dm_get_live_table_fast/dm_put_live_table_fast can be used instead of dm_get_live_table/dm_put_live_table. These *_fast functions use non-sleepable RCU, so the caller must not block between them. If the code changes active or inactive dm table, it must call dm_sync_table before destroying the old table. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-07-10 23:41:18 +01:00
Mikulas Patocka	2480945cd4	dm bufio: submit writes outside lock This patch changes dm-bufio so that it submits write I/Os outside of the lock. If the number of submitted buffers is greater than the number of requests on the target queue, submit_bio blocks. We want to block outside of the lock to improve latency of other threads that may need the lock. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-07-10 23:41:18 +01:00
Mikulas Patocka	43aeaa2957	dm cache: fix arm link errors with inline Use __always_inline to avoid a link failure with gcc 4.6 on ARM. gcc 4.7 is OK. It creates a function block_div.part.8, it references __udivdi3 and __umoddi3 and it is never called. The references to __udivdi3 and __umoddi3 cause a link failure. Reported-by: Rob Herring <robherring2@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-07-10 23:41:17 +01:00
Mikulas Patocka	553d8fe029	dm verity: use __ffs and __fls This patch changes ffs() to __ffs() and fls() to __fls() which don't add one to the result. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-07-10 23:41:17 +01:00
Alasdair G Kergon	75e3a0f55b	dm flakey: correct ctr alloc failure mesg Remove the reference to the "linear" target from the error message issued when allocation fails in the flakey target. Cc: Robin Dong <sanbai@taobao.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-07-10 23:41:17 +01:00
Mikulas Patocka	5d8be84397	dm verity: remove pointless comparison Remove num < 0 test in verity_ctr because num is unsigned. (Found by Coverity.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-07-10 23:41:17 +01:00
Mikulas Patocka	220cd058d9	dm: use __GFP_HIGHMEM in __vmalloc Use __GFP_HIGHMEM in __vmalloc. Pages allocated with __vmalloc can be allocated in high memory that is not directly mapped to kernel space, so use __GFP_HIGHMEM just like vmalloc does. This patch reduces memory pressure slightly because pages can be allocated in the high zone. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-07-10 23:41:16 +01:00
Mikulas Patocka	b1bf2de072	dm verity: fix inability to use a few specific devices sizes Fix a boundary condition that caused failure for certain device sizes. The problem is reported at http://code.google.com/p/cryptsetup/issues/detail?id=160 For certain device sizes the number of hashes at a specific level was calculated incorrectly. It happens for example for a device with data and metadata block size 4096 that has 16385 blocks and algorithm sha256. The user can test if he is affected by this bug by running the "veritysetup verify" command and also by activating the dm-verity kernel driver and reading the whole block device. If it passes without an error, then the user is not affected. The condition for the bug is: Split the total number of data blocks (data_block_bits) into bit strings, each string has hash_per_block_bits bits. hash_per_block_bits is rounddown(log2(metadata_block_size/hash_digest_size)). Equivalently, you can say that you convert data_blocks_bits to 2^hash_per_block_bits base. If there some zero bit string below the most significant bit string and at least one bit below this zero bit string is set, then the bug happens. The same bug exists in the userspace veritysetup tool, so you must use fixed veritysetup too if you want to use devices that are affected by this boundary condition. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # 3.4+ Cc: Milan Broz <gmazyland@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-07-10 23:41:16 +01:00
Mikulas Patocka	1c0e883e86	dm ioctl: set noio flag to avoid __vmalloc deadlock Set noio flag while calling __vmalloc() because it doesn't fully respect gfp flags to avoid a possible deadlock (see commit `502624bdad`). This should be backported to stable kernels 3.8 and newer. The kernel 3.8 doesn't have memalloc_noio_save(), so we should set and restore process flag PF_MEMALLOC instead. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-07-10 23:41:15 +01:00
Hannes Reinecke	6c182cd88d	dm mpath: fix ioctl deadlock when no paths When multipath needs to retry an ioctl the reference to the current live table needs to be dropped. Otherwise a deadlock occurs when all paths are down: - dm_blk_ioctl takes a reference to the current table and spins in multipath_ioctl(). - A new table is being loaded, but upon resume the process hangs in dm_table_destroy() waiting for references to drop to zero. With this patch the reference to the old table is dropped prior to retry, thereby avoiding the deadlock. Signed-off-by: Hannes Reinecke <hare@suse.de> Cc: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-07-10 23:41:15 +01:00
Linus Torvalds	80cc38b163	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "The usual stuff from trivial tree" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits) treewide: relase -> release Documentation/cgroups/memory.txt: fix stat file documentation sysctl/net.txt: delete reference to obsolete 2.4.x kernel spinlock_api_smp.h: fix preprocessor comments treewide: Fix typo in printk doc: device tree: clarify stuff in usage-model.txt. open firmware: "/aliasas" -> "/aliases" md: bcache: Fixed a typo with the word 'arithmetic' irq/generic-chip: fix a few kernel-doc entries frv: Convert use of typedef ctl_table to struct ctl_table sgi: xpc: Convert use of typedef ctl_table to struct ctl_table doc: clk: Fix incorrect wording Documentation/arm/IXP4xx fix a typo Documentation/networking/ieee802154 fix a typo Documentation/DocBook/media/v4l fix a typo Documentation/video4linux/si476x.txt fix a typo Documentation/virtual/kvm/api.txt fix a typo Documentation/early-userspace/README fix a typo Documentation/video4linux/soc-camera.txt fix a typo lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment ...	2013-07-04 11:40:58 -07:00
NeilBrown	1376512065	md/raid10: fix bug which causes all RAID10 reshapes to move no data. The recent comment: commit `7e83ccbecd` md/raid10: Allow skipping recovery when clean arrays are assembled Causes raid10 to skip a recovery in certain cases where it is safe to do so. Unfortunately it also causes a reshape to be skipped which is never safe. The result is that an attempt to reshape a RAID10 will appear to complete instantly, but no data will have been moves so the array will now contain garbage. (If nothing is written, you can recovery by simple performing the reverse reshape which will also complete instantly). Bug was introduced in 3.10, so this is suitable for 3.10-stable. Cc: stable@vger.kernel.org (3.10) Cc: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>	2013-07-04 16:42:57 +10:00
NeilBrown	fdcfbbb653	md/raid5: allow 5-device RAID6 to be reshaped to 4-device. There is a bug in 'check_reshape' for raid5.c To checks that the new minimum number of devices is large enough (which is good), but it does so also after the reshape has started (bad). This is bad because - the calculation is now wrong as mddev->raid_disks has changed already, and - it is pointless because it is now too late to stop. So only perform that test when reshape has not been committed to. Signed-off-by: NeilBrown <neilb@suse.de>	2013-07-04 16:42:52 +10:00
NeilBrown	78eaa0d4cb	md/raid10: fix two bugs affecting RAID10 reshape. 1/ If a RAID10 is being reshaped to a fewer number of devices and is stopped while this is ongoing, then when the array is reassembled the 'mirrors' array will be allocated too small. This will lead to an access error or memory corruption. 2/ A sanity test for a reshaping RAID10 array is restarted is slightly incorrect. Due to the first bug, this is suitable for any -stable kernel since 3.5 where this code was introduced. Cc: stable@vger.kernel.org (v3.5+) Signed-off-by: NeilBrown <neilb@suse.de>	2013-07-03 09:43:28 +10:00
Kent Overstreet	8e51e414a3	bcache: Use standard utility code Some of bcache's utility code has made it into the rest of the kernel, so drop the bcache versions. Bcache used to have a workaround for allocating from a bio set under generic_make_request() (if you allocated more than once, the bios you already allocated would get stuck on current->bio_list when you submitted, and you'd risk deadlock) - bcache would mask out __GFP_WAIT when allocating bios under generic_make_request() so that allocation could fail and it could retry from workqueue. But bio_alloc_bioset() has a workaround now, so we can drop this hack and the associated error handling. Signed-off-by: Kent Overstreet <koverstreet@google.com>	2013-07-01 14:43:53 -07:00
Kent Overstreet	f3059a5461	bcache: Delete fuzz tester This code has rotted and it hasn't been used in ages anyways. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-07-01 14:43:48 -07:00
Kent Overstreet	36c9ea9837	bcache: Document shrinker reserve better Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2013-07-01 14:42:48 -07:00
Kent Overstreet	e49c7c374e	bcache: FUA fixes Journal writes need to be marked FUA, not just REQ_FLUSH. And btree node writes have... weird ordering requirements. Signed-off-by: Kent Overstreet <koverstreet@google.com>	2013-07-01 14:42:47 -07:00
Gabriel de Perthuis	ab9e14002e	bcache: Send label uevents Signed-off-by: Gabriel de Perthuis <g2p.code@gmail.com> Signed-off-by: Kent Overstreet <koverstreet@google.com>	2013-06-26 21:58:06 -07:00
Gabriel de Perthuis	a25c32bede	bcache: Send a uevent with a cached device's UUID Signed-off-by: Gabriel de Perthuis <g2p.code@gmail.com>	2013-06-26 21:58:05 -07:00
Kent Overstreet	72c270612b	bcache: Write out full stripes Now that we're tracking dirty data per stripe, we can add two optimizations for raid5/6: * If a stripe is already dirty, force writes to that stripe to writeback mode - to help build up full stripes of dirty data * When flushing dirty data, preferentially write out full stripes first if there are any. Signed-off-by: Kent Overstreet <koverstreet@google.com>	2013-06-26 21:58:04 -07:00
Kent Overstreet	279afbad4e	bcache: Track dirty data by stripe To make background writeback aware of raid5/6 stripes, we first need to track the amount of dirty data within each stripe - we do this by breaking up the existing sectors_dirty into per stripe atomic_ts Signed-off-by: Kent Overstreet <koverstreet@google.com>	2013-06-26 21:57:23 -07:00
Kent Overstreet	444fc0b6b1	bcache: Initialize sectors_dirty when attaching Previously, dirty_data wouldn't get initialized until the first garbage collection... which was a bit of a problem for background writeback (as the PD controller keys off of it) and also confusing for users. This is also prep work for making background writeback aware of raid5/6 stripes. Signed-off-by: Kent Overstreet <koverstreet@google.com>	2013-06-26 17:09:16 -07:00
Kent Overstreet	6ded34d1a5	bcache: Improve lazy sorting The old lazy sorting code was kind of hacky - rewrite in a way that mathematically makes more sense; the idea is that the size of the sets of keys in a btree node should increase by a more or less fixed ratio from smallest to biggest. Signed-off-by: Kent Overstreet <koverstreet@google.com>	2013-06-26 17:09:16 -07:00
Kent Overstreet	85b1492ee1	bcache: Rip out pkey()/pbtree() Old gcc doesnt like the struct hack, and it is kind of ugly. So finish off the work to convert pr_debug() statements to tracepoints, and delete pkey()/pbtree(). Signed-off-by: Kent Overstreet <koverstreet@google.com>	2013-06-26 17:09:15 -07:00
Kent Overstreet	c37511b863	bcache: Fix/revamp tracepoints The tracepoints were reworked to be more sensible, and fixed a null pointer deref in one of the tracepoints. Converted some of the pr_debug()s to tracepoints - this is partly a performance optimization; it used to be that with DEBUG or CONFIG_DYNAMIC_DEBUG pr_debug() was an empty macro; but at some point it was changed to an empty inline function. Some of the pr_debug() statements had rather expensive function calls as part of the arguments, so this code was getting run unnecessarily even on non debug kernels - in some fast paths, too. Signed-off-by: Kent Overstreet <koverstreet@google.com>	2013-06-26 17:09:15 -07:00
Kent Overstreet	5794351146	bcache: Refactor btree io The most significant change is that btree reads are now done synchronously, instead of asynchronously and doing the post read stuff from a workqueue. This was originally done because we can't block on IO under generic_make_request(). But - we already have a mechanism to punt cache lookups to workqueue if needed, so if we just use that we don't have to deal with the complexity of doing things asynchronously. The main benefit is this makes the locking situation saner; we can hold our write lock on the btree node until we're finished reading it, and we don't need that btree_node_read_done() flag anymore. Also, for writes, btree_write() was broken out into btree_node_write() and btree_leaf_dirty() - the old code with the boolean argument was dumb and confusing. The prio_blocked mechanism was improved a bit too, now the only counter is in struct btree_write, we don't mess with transfering a count from struct btree anymore. This required changing garbage collection to block prios at the start and unblock when it finishes, which is cleaner than what it was doing anyways (the old code had mostly the same effect, but was doing it in a convoluted way) And the btree iter btree_node_read_done() uses was converted to a real mempool. Signed-off-by: Kent Overstreet <koverstreet@google.com>	2013-06-26 17:09:14 -07:00
Kent Overstreet	119ba0f828	bcache: Convert allocator thread to kthread Using a workqueue when we just want a single thread is a bit silly. Signed-off-by: Kent Overstreet <koverstreet@google.com>	2013-06-26 17:09:13 -07:00
Gabriel de Perthuis	a9dd53adbb	bcache: Warn when a device is already registered. Signed-off-by: Gabriel de Perthuis <g2p.code+bcache@gmail.com> Signed-off-by: Kent Overstreet <koverstreet@google.com>	2013-06-26 17:08:52 -07:00
Kent Overstreet	bbc77aa7fb	bcache: fix a spurious gcc complaint, use scnprintf An old version of gcc was complaining about using a const int as the size of a stack allocated array. Which should be fine - but using ARRAY_SIZE() is better, anyways. Also, refactor the code to use scnprintf(). Signed-off-by: Kent Overstreet <koverstreet@google.com>	2013-06-26 17:06:33 -07:00
Kumar Amit Mehta	5c694129c8	md: bcache: io.c: fix a potential NULL pointer dereference bio_alloc_bioset returns NULL on failure. This fix adds a missing check for potential NULL pointer dereferencing. Signed-off-by: Kumar Amit Mehta <gmate.amit@gmail.com> Signed-off-by: Kent Overstreet <koverstreet@google.com>	2013-06-26 17:06:19 -07:00
Jonathan Brassow	c4a3955145	MD: Remember the last sync operation that was performed MD: Remember the last sync operation that was performed This patch adds a field to the mddev structure to track the last sync operation that was performed. This is especially useful when it comes to what is recorded in mismatch_cnt in sysfs. If the last operation was "data-check", then it reports the number of descrepancies found by the user-initiated check. If it was a "repair" operation, then it is reporting the number of descrepancies repaired. etc. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-06-26 12:38:24 +10:00
NeilBrown	eea136d69f	md: fix buglet in RAID5 -> RAID0 conversion. RAID5 uses a 'per-array' value for the 'size' of each device. RAID0 uses a 'per-device' value - it can be different for each device. When converting a RAID5 to a RAID0 we must ensure that the per-device size of each device matches the per-array size for the RAID5, else the array will change size. If the metadata cannot record a changed per-device size (as is the case with v0.90 metadata) the array could get bigger on restart. This does not cause data corruption, so it not a big issue and is mainly yet another a reason to not use 0.90. Signed-off-by: NeilBrown <neilb@suse.de>	2013-06-26 12:38:19 +10:00
Phil Viana	48a73025cb	md: bcache: Fixed a typo with the word 'arithmetic' The word 'arithmetic' was typed as 'arithmatic' Signed-off-by: Phil Viana <phillip.l.viana@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2013-06-18 13:41:16 +02:00
NeilBrown	725d6e579f	md/raid10: check In_sync flag in 'enough()'. It isn't really enough to check that the rdev is present, we need to also be sure that the device is still In_sync. Doing this requires using rcu_dereference to access the rdev, and holding the rcu_read_lock() to ensure the rdev doesn't disappear while we look at it. Signed-off-by: NeilBrown <neilb@suse.de>	2013-06-14 08:10:27 +10:00
NeilBrown	635f6416a2	md/raid10: locking changes for 'enough()'. As 'enough' accesses conf->prev and conf->geo, which can change spontanously, it should guard against changes. This can be done with device_lock as start_reshape holds device_lock while updating 'geo' and end_reshape holds it while updating 'prev'. So 'error' needs to hold 'device_lock'. On the other hand, raid10_end_read_request knows which of the two it really wants to access, and as it is an active request on that one, the value cannot change underneath it. So change _enough to take flag rather than a pointer, pass the appropriate flag from raid10_end_read_request(), and remove the locking. All other calls to 'enough' are made with reconfig_mutex held, so neither 'prev' nor 'geo' can change. Signed-off-by: NeilBrown <neilb@suse.de>	2013-06-14 08:10:27 +10:00
Jingoo Han	b29bebd66d	md: replace strict_strto() with kstrto() The usage of strict_strtoul() is not preferred, because strict_strtoul() is obsolete. Thus, kstrtoul() should be used. Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-06-14 08:10:26 +10:00
Hannes Reinecke	90f5f7ad4f	md: Wait for md_check_recovery before attempting device removal. When a device has failed, it needs to be removed from the personality module before it can be removed from the array as a whole. The first step is performed by md_check_recovery() which is called from the raid management thread. So when a HOT_REMOVE ioctl arrives, wait briefly for md_check_recovery to have run. This increases the chance that the ioctl will succeed. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Neil Brown <nfbrown@suse.de>	2013-06-14 08:10:26 +10:00
NeilBrown	3f6bbd3ffd	dm-raid: silence compiler warning on rebuilds_per_group. This doesn't really need to be initialised, but it doesn't hurt, silences the compiler, and as it is a counter it makes sense for it to start at zero. Signed-off-by: NeilBrown <neilb@suse.de>	2013-06-14 08:10:26 +10:00
Jonathan Brassow	a4dc163a55	DM RAID: Fix raid_resume not reviving failed devices in all cases DM RAID: Fix raid_resume not reviving failed devices in all cases When a device fails in a RAID array, it is marked as Faulty. Later, md_check_recovery is called which (through the call chain) calls 'hot_remove_disk' in order to have the personalities remove the device from use in the array. Sometimes, it is possible for the array to be suspended before the personalities get their chance to perform 'hot_remove_disk'. This is normally not an issue. If the array is deactivated, then the failed device will be noticed when the array is reinstantiated. If the array is resumed and the disk is still missing, md_check_recovery will be called upon resume and 'hot_remove_disk' will be called at that time. However, (for dm-raid) if the device has been restored, a resume on the array would cause it to attempt to revive the device by calling 'hot_add_disk'. If 'hot_remove_disk' had not been called, a situation is then created where the device is thought to concurrently be the replacement and the device to be replaced. Thus, the device is first sync'ed with the rest of the array (because it is the replacement device) and then marked Faulty and removed from the array (because it is also the device being replaced). The solution is to check and see if the device had properly been removed before the array was suspended. This is done by seeing whether the device's 'raid_disk' field is -1 - a condition that implies that 'md_check_recovery -> remove_and_add_spares (where raid_disk is set to -1) -> hot_remove_disk' has been called. If 'raid_disk' is not -1, then 'hot_remove_disk' must be called to complete the removal of the previously faulty device before it can be revived via 'hot_add_disk'. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-06-14 08:10:25 +10:00
Jonathan Brassow	f381e71b04	DM RAID: Break-up untidy function DM RAID: Break-up untidy function Clean-up excessive indentation by moving some code in raid_resume() into its own function. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-06-14 08:10:25 +10:00
Jonathan Brassow	9092c02d94	DM RAID: Add ability to restore transiently failed devices on resume DM RAID: Add ability to restore transiently failed devices on resume This patch adds code to the resume function to check over the devices in the RAID array. If any are found to be marked as failed and their superblocks can be read, an attempt is made to reintegrate them into the array. This allows the user to refresh the array with a simple suspend and resume of the array - rather than having to load a completely new table, allocate and initialize all the structures and throw away the old instantiation. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-06-14 08:10:24 +10:00
Linus Torvalds	82ea4be61f	A few bugfixes for md Some tagged for -stable. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQIVAwUAUbl1mznsnt1WYoG5AQKGlQ//eixdawF+DUK5hadqZ9EDni+BAVzb7m69 +zU6ilQ7UOh7bxtAoJqrgFVykK+LG8wvYsEBwMjB9oRDLA96/YDXXiBzXHvd6mGh g271lwMTQ9h+O8L6psLUX6qsrH3i7SJmF8ySPKi6Fe5ruT8ToOB8Ii8XQebEZdXo VOzRz2VgSTcBdrTyKPDsBJByDQX36hsK8Gs5YSl5F3nvyV4dvGWMlyoTF1TRRt9K YCCZ8pSk3kTXaSdl0syrJxI17pEUC8mtcA01S6JD/GV49CGO8LYAckVJ4ijWw7VV IGGlH0DsYSMgJ7yyuLz4ifaqRnsWsAGW0WyiZYYKvjtNUiyBuBBbo2cQ1lNkR5p4 jnLhpJJVh0hLCPn6wcCWIBIdT/mFaBpXkvZPd3ks5kefGXsfpVPm0fK8r0fzkzgy tJCZtZFZHeK1qsgaDsiS76S2ZNcFh0HQVIa84Q200/XUDgh8dYlD0+7oIsVu0UBZ 72Aop+Ak9+k4vKTvB9/hpcY+Rt0MI7zKewXBDSDK1sXhIHLQqv8rCEeNYiuPPqr/ ghRukn+C/Wtr7JYBsX+jMjxtmSzYtwBOihwLoZCH9pp3C5jTvyQk9s8n1j13V2RK sAFtfpCVoQ8tTa7IITKRMfftzHn1WiPlPsj6VbigJ6A4N98csgv7x2rF7FyqcF0X aoj69nQ3i/4= =8iy3 -----END PGP SIGNATURE----- Merge tag 'md-3.10-fixes' of git://neil.brown.name/md Pull md bugfixes from Neil Brown: "A few bugfixes for md Some tagged for -stable" * tag 'md-3.10-fixes' of git://neil.brown.name/md: md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place md/raid1,raid10: use freeze_array in place of raise_barrier in various places. md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it. md: md_stop_writes() should always freeze recovery.	2013-06-13 10:13:29 -07:00
H. Peter Anvin	5026d7a9b2	md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place There are cases where the kernel will believe that the WRITE SAME command is supported by a block device which does not, in fact, support WRITE SAME. This currently happens for SATA drivers behind a SAS controller, but there are probably a hundred other ways that can happen, including drive firmware bugs. After receiving an error for WRITE SAME the block layer will retry the request as a plain write of zeroes, but mdraid will consider the failure as fatal and consider the drive failed. This has the effect that all the mirrors containing a specific set of data are each offlined in very rapid succession resulting in data loss. However, just bouncing the request back up to the block layer isn't ideal either, because the whole initial request-retry sequence should be inside the write bitmap fence, which probably means that md needs to do its own conversion of WRITE SAME to write zero. Until the failure scenario has been sorted out, disable WRITE SAME for raid1, raid5, and raid10. [neilb: added raid5] This patch is appropriate for any -stable since 3.7 when write_same support was added. Cc: stable@vger.kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-06-13 14:49:54 +10:00
NeilBrown	e2d5992522	md/raid1,raid10: use freeze_array in place of raise_barrier in various places. Various places in raid1 and raid10 are calling raise_barrier when they really should call freeze_array. The former is only intended to be called from "make_request". The later has extra checks for 'nr_queued' and makes a call to flush_pending_writes(), so it is safe to call it from within the management thread. Using raise_barrier will sometimes deadlock. Using freeze_array should not. As 'freeze_array' currently expects one request to be pending (in handle_read_error - the only previous caller), we need to pass it the number of pending requests (extra) to ignore. The deadlock was made particularly noticeable by commits `050b66152f` (raid10) and `6b740b8d79` (raid1) which appeared in 3.4, so the fix is appropriate for any -stable kernel since then. This patch probably won't apply directly to some early kernels and will need to be applied by hand. Cc: stable@vger.kernel.org Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-06-13 13:40:48 +10:00
Alex Lyakas	3056e3aec8	md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it. Without that fix, the following scenario could happen: - RAID1 with drives A and B; drive B was freshly-added and is rebuilding - Drive A fails - WRITE request arrives to the array. It is failed by drive A, so r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B succeeds in writing it, so the same r1_bio is marked as R1BIO_Uptodate. - r1_bio arrives to handle_write_finished, badblocks are disabled, md_error()->error() does nothing because we don't fail the last drive of raid1 - raid_end_bio_io() calls call_bio_endio() - As a result, in call_bio_endio(): if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) clear_bit(BIO_UPTODATE, &bio->bi_flags); this code doesn't clear the BIO_UPTODATE flag, and the whole master WRITE succeeds, back to the upper layer. So we returned success to the upper layer, even though we had written the data onto the rebuilding drive only. But when we want to read the data back, we would not read from the rebuilding drive, so this data is lost. [neilb - applied identical change to raid10 as well] This bug can result in lost data, so it is suitable for any -stable kernel. Cc: stable@vger.kernel.org Signed-off-by: Alex Lyakas <alex@zadarastorage.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-06-13 13:20:03 +10:00
NeilBrown	6b6204ee92	md: md_stop_writes() should always freeze recovery. __md_stop_writes() will currently sometimes freeze recovery. So any caller must be ready for that to happen, and indeed they are. However if __md_stop_writes() doesn't freeze_recovery, then a recovery could start before mddev_suspend() is called, which could be awkward. This can particularly cause problems or dm-raid. So change __md_stop_writes() to always freeze recovery. This is safe and more predicatable. Reported-by: Brassow Jonathan <jbrassow@redhat.com> Tested-by: Brassow Jonathan <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-06-13 13:18:15 +10:00
Linus Torvalds	b2cc9c19e4	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block layer fixes from Jens Axboe: "Outside of bcache (which really isn't super big), these are all few-liners. There are a few important fixes in here: - Fix blk pm sleeping when holding the queue lock - A small collection of bcache fixes that have been done and tested since bcache was included in this merge window. - A fix for a raid5 regression introduced with the bio changes. - Two important fixes for mtip32xx, fixing an oops and potential data corruption (or hang) due to wrong bio iteration on stacked devices." * 'for-linus' of git://git.kernel.dk/linux-block: scatterlist: sg_set_buf() argument must be in linear mapping raid5: Initialize bi_vcnt pktcdvd: silence static checker warning block: remove refs to XD disks from documentation blkpm: avoid sleep when holding queue lock mtip32xx: Correctly handle bio->bi_idx != 0 conditions mtip32xx: Fix NULL pointer dereference during module unload bcache: Fix error handling in init code bcache: clarify free/available/unused space bcache: drop "select CLOSURES" bcache: Fix incompatible pointer type warning	2013-06-12 16:42:39 -07:00

1 2 3 4 5 ...

2853 commits