1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

42026 commits

Author SHA1 Message Date
Jaegeuk Kim
f7409d0fae f2fs: fix wrong pointer access during try_to_free_nids
If we release the lock in list_for_each_entry_safe, we can lose the tmp
pointer by alloc_nid.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-24 09:37:42 -07:00
Jaegeuk Kim
80c545055d f2fs: use __GFP_NOFAIL to avoid infinite loop
__GFP_NOFAIL can avoid retrying the whole path of kmem_cache_alloc and
bio_alloc.
And, it also fixes the use cases of GFP_ATOMIC correctly.

Suggested-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-24 09:37:21 -07:00
Vincent Bernat
b5ac1fb271 9p: fix return code of read() when count is 0
When reading 0 bytes from an empty file on a 9P filesystem, the return
code of read() was not 0 as expected due to an unitialized err variable.

Tested with this simple program:

    #include <assert.h>
    #include <sys/types.h>
    #include <sys/stat.h>
    #include <fcntl.h>
    #include <unistd.h>

    int main(int argc, const char **argv)
    {
        assert(argc == 2);
        char buffer[256];
        int fd = open(argv[1], O_RDONLY|O_NOCTTY);
        assert(fd >= 0);
        assert(read(fd, buffer, 0) == 0);
        return 0;
    }

Signed-off-by: Vincent Bernat <vincent@bernat.im>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2015-08-23 14:21:36 -05:00
Fabian Frederick
cb9af41859 9p: remove unused option Opt_trans
Commit 8a0dc95fd9
("9p: transport API reorganization")
removed Opt_trans in tokens not in enum.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2015-08-23 14:21:16 -05:00
Chao Yu
dac2ddefe6 f2fs: lookup neighbor extent nodes for merging later
In __lookup_extent_tree_ret we will not try to find neighbor nodes if
we find the target node, in this condition, we will lost the chance to
merge the new mapping with exist extent node later.

So our extent cache of inode will be fragmented after overwrite exist
file, we can see the number of extent node increases intensively in
following test case:

dd if=/dev/zero of=/mnt/f2fs/4m bs=4K count=1024

Extent Cache:
  - Hit Count: L1-1:0 L1-2:0 L2:0
  - Hit Ratio: 0% (0 / 3072)
  - Inner Struct Count: tree: 1, node: 1

dd if=/dev/zero of=/mnt/f2fs/4m bs=4K count=1024 conv=notrunc

Extent Cache:
  - Hit Count: L1-1:2048 L1-2:0 L2:0
  - Hit Ratio: 33% (2048 / 6144)
  - Inner Struct Count: tree: 1, node: 961

This patch fixes to lookup neighbors of target node for further
merging.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-21 22:45:18 -07:00
Chao Yu
ef05e22199 f2fs: split __insert_extent_tree_ret for readability
This patch splits __insert_extent_tree_ret into __try_merge_extent_node &
__insert_extent_tree for code readability.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-21 22:45:17 -07:00
Chao Yu
a6f7834594 f2fs: kill dead code in __insert_extent_tree
After commit 0f825ee6e8 ("f2fs: add new interfaces for extent tree"),
f2fs_init_extent_tree becomes the only caller of __insert_extent_tree, and
in f2fs_init_extent_tree, we will only insert extent node in an empty tree,
so __try_{back,front}_merge in __insert_extent_tree will never be called.

This patch removes these dead codes, besides, rename __insert_extent_tree
to __init_extent_tree for readability.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-21 22:45:17 -07:00
Chao Yu
029e13cc32 f2fs: adjust showing of extent cache stat
This patch alters to replace total hit stat with rbtree hit stat,
and then adjust showing of extent cache stat:

Hit Count:
L1-1: for largest node hit count;
L1-2: for last cached node hit count;
L2: for extent node hit after lookuping in rbtree.

Hit Ratio:
ratio (hit count / total lookup count)

Inner Struct Count:
tree count, node count.

Before:
Extent Hit Ratio: 0 / 2

Extent Tree Count: 3

Extent Node Count: 2

Patched:
Exten Cacache:
  - Hit Count: L1-1:4871 L1-2:2074 L2:208
  - Hit Ratio: 1% (7153 / 550751)
  - Inner Struct Count: tree: 26560, node: 11824

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-21 22:45:16 -07:00
Chao Yu
91c481fff9 f2fs: add largest/cached stat in extent cache
This patch adds to stat the hit count of largest/cached node for showing
in debugfs.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-21 22:45:15 -07:00
Chao Yu
e2b4e2bc88 f2fs: fix incorrect mapping for bmap
The test step is like below:
1. touch file
2. truncate -s $((1024*1024)) file
3. fallocate -o 0 -l $((1024*1024)) file
4. fibmap.f2fs file

Our result of fibmap.f2fs showed below is not correct:

file_pos   start_blk     end_blk        blks
       0    -937166132    -937166132           1
    4096    -937166132    -937166132           1
    8192    -937166132    -937166132           1
   12288    -937166132    -937166132           1
   16384    -937166132    -937166132           1
   20480    -937166132    -937166132           1
...
 1040384    -937166132    -937166132           1
 1044480    -937166132    -937166132           1

This is because f2fs_map_blocks will return with no error when meeting
a hole or preallocated block, the caller __get_data_block will map the
uninitialized variable value to bh->b_blocknr.

Unfortunately generic_block_bmap will neither check the return value of
get_data() nor check mapping info of buffer_head, result in returning
the random block address.

After fixing the issue, our result shows correctly:

file_pos   start_blk     end_blk        blks
       0           0           0         256

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-21 22:45:14 -07:00
Fan Li
f8b703da2c f2fs: fix to update cached_en of extent tree properly
In f2fs_lookup_extent_tree, et->cached_en was read and updated with only
read lock held,
it could cause __lookup_extent_tree within return entirely wrong
extent_node, if other
thread update et->cached_en just before __lookup_extent_tree return.

However, there are two things about this patch that need to be noticed:
1. It does no good to arrange the order of concurrent read/write, the result
would still
be random in such case.
2. It's built on this assumption: the mix up of reads and writes on a single
pointer would
not make the pointer partially wrong at any time. Please let me know if I'm
wrong, thx.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-21 22:45:06 -07:00
Junesung Lee
217940d4f0 f2fs: fix typo
Fix typo.

Signed-off-by: Junesung Lee <junesoung412@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-21 22:43:32 -07:00
Chris Mason
3a9508b022 btrfs: fix compile when block cgroups are not enabled
bio->bi_css and bio->bi_ioc don't exist when block cgroups are not on.
This adds an ifdef around them.  It's not perfect, but our
use of bi_ioc is being removed in the 4.3 merge window.

The bi_css usage really should go into bio_clone, but I want to make
sure that doesn't introduce problems for other bio_clone use cases.

Signed-off-by: Chris Mason <clm@fb.com>
2015-08-21 10:08:13 -07:00
Eric W. Biederman
397d425dc2 vfs: Test for and handle paths that are unreachable from their mnt_root
In rare cases a directory can be renamed out from under a bind mount.
In those cases without special handling it becomes possible to walk up
the directory tree to the root dentry of the filesystem and down
from the root dentry to every other file or directory on the filesystem.

Like division by zero .. from an unconnected path can not be given
a useful semantic as there is no predicting at which path component
the code will realize it is unconnected.  We certainly can not match
the current behavior as the current behavior is a security hole.

Therefore when encounting .. when following an unconnected path
return -ENOENT.

- Add a function path_connected to verify path->dentry is reachable
  from path->mnt.mnt_root.  AKA to validate that rename did not do
  something nasty to the bind mount.

  To avoid races path_connected must be called after following a path
  component to it's next path component.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-08-21 03:20:10 -04:00
Eric W. Biederman
a03e283bf5 dcache: Reduce the scope of i_lock in d_splice_alias
i_lock is only needed until __d_find_any_alias calls dget on the alias
dentry.  After that the reference to new ensures that dentry_kill and
d_delete will not remove the inode from the dentry, and remove the
dentry from the inode->d_entry list.

The inode i_lock came to be held over the the __d_move calls in
d_splice_alias through a series of introduction of locks with
increasing smaller scope.  First it was the dcache_lock, then
it was the dcache_inode_lock, and finally inode->i_lock.

Furthermore inode->i_lock is not held over any other calls
to d_move or __d_move so it can not provide any meaningful
rename protection.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-08-21 02:34:37 -04:00
Eric W. Biederman
cde93be45a dcache: Handle escaped paths in prepend_path
A rename can result in a dentry that by walking up d_parent
will never reach it's mnt_root.  For lack of a better term
I call this an escaped path.

prepend_path is called by four different functions __d_path,
d_absolute_path, d_path, and getcwd.

__d_path only wants to see paths are connected to the root it passes
in.  So __d_path needs prepend_path to return an error.

d_absolute_path similarly wants to see paths that are connected to
some root.  Escaped paths are not connected to any mnt_root so
d_absolute_path needs prepend_path to return an error greater
than 1.  So escaped paths will be treated like paths on lazily
unmounted mounts.

getcwd needs to prepend "(unreachable)" so getcwd also needs
prepend_path to return an error.

d_path is the interesting hold out.  d_path just wants to print
something, and does not care about the weird cases.  Which raises
the question what should be printed?

Given that <escaped_path>/<anything> should result in -ENOENT I
believe it is desirable for escaped paths to be printed as empty
paths.  As there are not really any meaninful path components when
considered from the perspective of a mount tree.

So tweak prepend_path to return an empty path with an new error
code of 3 when it encounters an escaped path.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-08-21 02:34:36 -04:00
Al Viro
061f98e959 Merge branch 'superblock-scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into for-next
Conflicts:
	include/linux/fs.h
2015-08-21 02:31:20 -04:00
Trond Myklebust
dd52128afd NFSv4.1/pnfs Ensure flexfiles reports all connection related errors
Make sure that we also handle RPC level connection and protocol
negotiation errors.

Reported-by: Tom Haynes <loghyr@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-20 18:33:42 -05:00
Trond Myklebust
e76d28dd9c NFSv4.1/pnfs: Ensure the flexfiles layoutstats timers are consistent
We want to ensure that the stopwatches for the busy timer and the
aggregate timer are consistent. This means that they need to use
the same start/stop times.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-20 13:43:53 -05:00
Peng Tao
046be74da8 NFS41: fix list splice type
We want to move commiting pages to pages list instead.
Otherwise it causes pnfs small writes crash like:

[34560.037692] BUG: unable to handle kernel NULL pointer dereference at 0000000000000068
[34560.038557] IP: [<ffffffffa05423d6>] nfs_init_commit+0x26/0x130 [nfs]
[34560.039400] PGD 69f5a067 PUD 69f59067 PMD 0
[34560.040207] Oops: 0000 [#1] SMP
[34560.041014] Modules linked in: nfsv3(OE) nfs_layout_flexfiles(OE) nfsv4(OE) nfs(OE) fscache(E) rpcsec_gss_krb5(E) xt_addrtype(E) xt_conntrack(E) ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) iptable_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) nf_nat_ipv4(E) iptable_filter(E) ip_tables(E) x_tables(E) nf_nat(E) nf_conntrack(E) bridge(E) stp(E) llc(E) dm_thin_pool(E) dm_persistent_data(E) dm_bio_prison(E) dm_bufio(E) ppdev(E) vmw_balloon(E) coretemp(E) crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) glue_helper(E) lrw(E) gf128mul(E) ablk_helper(E) cryptd(E) psmouse(E) serio_raw(E) vmw_vmci(E) i2c_piix4(E) shpchp(E) parport_pc(E) parport(E) mac_hid(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) xfs(E) libcrc32c(E) hid_generic(E) usbhid(E) hid(E) e1000(E) mptspi(E)
[34560.045106]  mptscsih(E) mptbase(E) vmwgfx(E) drm_kms_helper(E) ttm(E) drm(E) autofs4(E) [last unloaded: fscache]
[34560.045897] CPU: 0 PID: 130543 Comm: bash Tainted: G           OE   4.2.0-rc5-dp-00057-gf993a93 #11
[34560.046699] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
[34560.047525] task: ffff880031b0a980 ti: ffff880045fec000 task.ti: ffff880045fec000
[34560.048264] RIP: 0010:[<ffffffffa05423d6>]  [<ffffffffa05423d6>] nfs_init_commit+0x26/0x130 [nfs]
[34560.049000] RSP: 0018:ffff880045fefc18  EFLAGS: 00010246
[34560.049717] RAX: 0000000000000000 RBX: ffff8800208fbc80 RCX: ffff880045fefd50
[34560.050396] RDX: ffff880031c19ec0 RSI: ffff880045fefc88 RDI: ffff8800208fbc80
[34560.051041] RBP: ffff880045fefc28 R08: ffff8800208fbe68 R09: ffff880045fefc88
[34560.051666] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880045fefc78
[34560.052247] R13: ffff880045fefc88 R14: ffff880045fefa90 R15: ffff880045fefd50
[34560.052825] FS:  00007fa02d58c740(0000) GS:ffff88006d600000(0000) knlGS:0000000000000000
[34560.053410] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[34560.053992] CR2: 0000000000000068 CR3: 000000003b37a000 CR4: 00000000001406f0
[34560.054615] Stack:
[34560.055200]  ffff8800208fbc80 ffff8800208fbc80 ffff880045fefcc8 ffffffffa05c1a5b
[34560.055800]  ffff880045fefcc8 ffff880045fefd50 0000000045fefcb8 ffff880045fefd40
[34560.056418]  ffff8800420608e0 ffffffffa04f3910 0000000100000001 ffff880045fefd50
[34560.057013] Call Trace:
[34560.057672]  [<ffffffffa05c1a5b>] pnfs_generic_commit_pagelist+0x1cb/0x300 [nfsv4]
[34560.058277]  [<ffffffffa04f3910>] ? ff_layout_commit_pagelist+0x20/0x20 [nfs_layout_flexfiles]
[34560.058907]  [<ffffffffa04f3905>] ff_layout_commit_pagelist+0x15/0x20 [nfs_layout_flexfiles]
[34560.059557]  [<ffffffffa0543fc1>] nfs_generic_commit_list+0xb1/0xf0 [nfs]
[34560.060214]  [<ffffffffa0543e47>] ? nfs_scan_commit+0x37/0xa0 [nfs]
[34560.060825]  [<ffffffffa0544081>] nfs_commit_inode+0x81/0x150 [nfs]
[34560.061432]  [<ffffffffa05443ae>] nfs_wb_all+0x1ae/0x400 [nfs]
[34560.062035]  [<ffffffffa05380ad>] nfs_getattr+0x33d/0x510 [nfs]
[34560.062630]  [<ffffffff8122499c>] vfs_getattr_nosec+0x2c/0x40
[34560.063223]  [<ffffffff81224a66>] vfs_getattr+0x26/0x30
[34560.063818]  [<ffffffff81224b35>] vfs_fstatat+0x65/0xa0
[34560.064413]  [<ffffffff81224f3f>] SYSC_newstat+0x1f/0x40
[34560.065016]  [<ffffffff8102b176>] ? do_audit_syscall_entry+0x66/0x70
[34560.065626]  [<ffffffff8102c773>] ? syscall_trace_enter_phase1+0x113/0x170
[34560.066245]  [<ffffffff81003017>] ? trace_hardirqs_on_thunk+0x17/0x19
[34560.066868]  [<ffffffff812251ae>] SyS_newstat+0xe/0x10
[34560.067533]  [<ffffffff817a5df2>] entry_SYSCALL_64_fastpath+0x16/0x7a
[34560.068173] Code: 0f 1f 44 00 00 0f 1f 44 00 00 55 4c 8d 87 e8 01 00 00 48 89 e5 53 48 89 fb 48 83 ec 08 4c 8b 0e 49 8b 41 18 4c 39 ce 48 8b 40 40 <4c> 8b 50 68 74 24 48 8b 87 e8 01 00 00 48 8b 7e 08 4d 89 41 08
[34560.069609] RIP  [<ffffffffa05423d6>] nfs_init_commit+0x26/0x130 [nfs]
[34560.070295]  RSP <ffff880045fefc18>
[34560.071008] CR2: 0000000000000068
[34560.073207] ---[ end trace f85f873260977406 ]---

[fixes 27571297a7e(pNFS: Tighten up locking around DS commit buckets)]
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-20 13:43:53 -05:00
Ross Zwisler
e2e05394e4 pmem, dax: have direct_access use __pmem annotation
Update the annotation for the kaddr pointer returned by direct_access()
so that it is a __pmem pointer.  This is consistent with the PMEM driver
and with how this direct_access() pointer is used in the DAX code.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-20 14:07:24 -04:00
Ross Zwisler
2765cfbb34 dax: update I/O path to do proper PMEM flushing
Update the DAX I/O path so that all operations that store data (I/O
writes, zeroing blocks, punching holes, etc.) properly synchronize the
stores to media using the PMEM API.  This ensures that the data DAX is
writing is durable on media before the operation completes.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-20 14:07:24 -04:00
Jaegeuk Kim
24928634f8 f2fs: check the node block address of newly allocated nid
This patch adds a routine which checks the block address of newly allocated nid.
If an nid has already allocated by other thread due to subtle data races, it
will result in filesystem corruption.
So, it needs to check whether its block address was already allocated or not
in prior to nid allocation as the last chance.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 09:00:14 -07:00
Jaegeuk Kim
a21c20f0c8 f2fs: go out for insert_inode_locked failure
We should not call unlock_new_inode when insert_inode_locked failed.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 09:00:13 -07:00
Jaegeuk Kim
5ee5293c32 f2fs: retry gc if one section is not successfully reclaimed
If FG_GC failed to reclaim one section, let's retry with another section
from the start, since we can get anoterh good candidate.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 09:00:12 -07:00
Jaegeuk Kim
2286c0205d f2fs: fix to cover lock_op for update_inode_page
Previously, update_inode_page is not called under f2fs_lock_op.
Instead we should call with f2fs_write_inode.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 09:00:11 -07:00
Jaegeuk Kim
2683446646 f2fs: reuse nids more aggressively
If we can reuse nids as many as possible, we can mitigate producing obsolete
node pages in the page cache.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 09:00:11 -07:00
Jaegeuk Kim
26d5859974 f2fs: avoid garbage collecting already moved node blocks
If node blocks were already moved, we don't need to move them again.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 09:00:10 -07:00
Jaegeuk Kim
740432f835 f2fs: handle failed bio allocation
As the below comment of bio_alloc_bioset, f2fs can allocate multiple bios at the
same time. So, we can't guarantee that bio is allocated all the time.

"
 *   When @bs is not NULL, if %__GFP_WAIT is set then bio_alloc will always be
 *   able to allocate a bio. This is due to the mempool guarantees. To make this
 *   work, callers must never allocate more than 1 bio at a time from this pool.
 *   Callers that need to allocate more than 1 bio must always submit the
 *   previously allocated bio for IO before attempting to allocate a new one.
 *   Failure to do so can cause deadlocks under memory pressure.
"

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 09:00:09 -07:00
Jaegeuk Kim
a6db67f06f f2fs: increase the number of max hard links
This patch increases the number of maximum hard links for one file.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 09:00:08 -07:00
Jaegeuk Kim
798c1b16d1 f2fs: skip checkpoint if there is no dirty and prefree segments
We should avoid needless checkpoints when there is no dirty and prefree segment.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 09:00:07 -07:00
Chao Yu
31696580bf f2fs: shrink free_nids entries
This patch introduces __count_free_nids/try_to_free_nids and registers
them in slab shrinker for shrinking under memory pressure.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 09:00:06 -07:00
Chao Yu
206e61be29 f2fs: avoid clear valid page
In f2fs_delete_entry, if last dirent is remove from the dentry page,
we will try to punch that page since it has no valid date in it.

But truncate_hole which is used for punching could fail because of
no memory or IO error, if that happened, we'd better skip clearing
this valid dentry page.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 09:00:06 -07:00
Jaegeuk Kim
315df8398e f2fs: do not write any node pages related to orphan inodes
We should not write node pages when deleting orphan inodes.
In order to do that, we can eaisly set POR_DOING flag earlier before entering
orphan inode routine.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 08:59:42 -07:00
Christopher Oo
5fb4e288a0 cifs: Fix use-after-free on mid_q_entry
With CIFS_DEBUG_2 enabled, additional debug information is tracked inside each
mid_q_entry struct, however cifs_save_when_sent may use the mid_q_entry after it
has been freed from the appropriate callback if the transport layer has very low
latency. Holding the srv_mutex fixes this use-after-free, as cifs_save_when_sent
is called while the srv_mutex is held while the request is sent.

Signed-off-by: Christopher Oo <t-chriso@microsoft.com>
2015-08-20 10:19:25 -05:00
Steve French
0a6d0b6412 Update cifs version number
Update modinfo cifs.ko version number to 2.07

Signed-off-by: Steve French <steve.french@primarydata.com>
2015-08-20 10:19:25 -05:00
Steve French
0de1f4c6f6 Add way to query server fs info for smb3
The server exports information about the share and underlying
device under an SMB3 export, including its attributes and
capabilities, which is stored by cifs.ko when first connecting
to the share.

Add ioctl to cifs.ko to allow user space smb3 helper utilities
(in cifs-utils) to display this (e.g. via smb3util).

This information is also useful for debugging and for
resolving configuration errors.

Signed-off-by: Steve French <steve.french@primarydata.com>
2015-08-20 10:19:25 -05:00
Jan Kara
9181f8bf5a udf: Don't modify filesystem for read-only mounts
When read-write mount of a filesystem is requested but we find out we
can mount the filesystem only in read-only mode, we still modify
LVID in udf_close_lvid(). That is both unnecessary and contrary to
expectation that when we fall back to read-only mount we don't modify
the filesystem.

Make sure we call udf_close_lvid() only if we called udf_open_lvid() so
that filesystem gets modified only if we verified we are allowed to
write to it.

Reported-by: Karel Zak <kzak@redhat.com>
Signed-off-by: Jan Kara <jack@suse.com>
2015-08-20 14:58:35 +02:00
Trond Myklebust
2a606188c5 NFSv4: Enable delegated opens even when reboot recovery is pending
Unlike the previous attempt, this takes into account the fact that
we may be calling it from the recovery thread itself. Detect this
by looking at what kind of open we're doing, and checking the state
of the NFS_DELEGATION_NEED_RECLAIM if it turns out we're doing a
reboot reclaim-type open.

Cc: Olga Kornievskaia <aglo@umich.edu>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-19 23:01:54 -05:00
Trond Myklebust
c740624989 pNFS: Fix an unused variable warning in pnfs_roc_get_barrier
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-19 23:01:53 -05:00
Dave Chinner
aa493382cb Merge branch 'xfs-misc-fixes-for-4.3-2' into for-next 2015-08-20 09:28:45 +10:00
Dave Chinner
3403ccc0c9 xfs: inode lockdep annotations broke non-lockdep build
Fix CONFIG_LOCKDEP=n build, because asserts I put in to ensure we
aren't overrunning lockdep subclasses in commit 0952c81 ("xfs:
clean up inode lockdep annotations") use a define that doesn't
exist when CONFIG_LOCKDEP=n

Only check the subclass limits when lockdep is actually enabled.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-08-20 09:27:49 +10:00
Filipe Manana
b84b8390d6 Btrfs: fix file read corruption after extent cloning and fsync
If we partially clone one extent of a file into a lower offset of the
file, fsync the file, power fail and then mount the fs to trigger log
replay, we can get multiple checksum items in the csum tree that overlap
each other and result in checksum lookup failures later. Those failures
can make file data read requests assume a checksum value of 0, but they
will not return an error (-EIO for example) to userspace exactly because
the expected checksum value 0 is a special value that makes the read bio
endio callback return success and set all the bytes of the corresponding
page with the value 0x01 (at fs/btrfs/inode.c:__readpage_endio_check()).
From a userspace perspective this is equivalent to file corruption
because we are not returning what was written to the file.

Details about how this can happen, and why, are included inline in the
following reproducer test case for fstests and the comment added to
tree-log.c.

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      _cleanup_flakey
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_dm_flakey
  _require_cloner
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test file with a single 100K extent starting at file
  # offset 800K. We fsync the file here to make the fsync log tree gets
  # a single csum item that covers the whole 100K extent, which causes
  # the second fsync, done after the cloning operation below, to not
  # leave in the log tree two csum items covering two sub-ranges
  # ([0, 20K[ and [20K, 100K[)) of our extent.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 800K 100K"  \
                  -c "fsync"                     \
                   $SCRATCH_MNT/foo | _filter_xfs_io

  # Now clone part of our extent into file offset 400K. This adds a file
  # extent item to our inode's metadata that points to the 100K extent
  # we created before, using a data offset of 20K and a data length of
  # 20K, so that it refers to the sub-range [20K, 40K[ of our original
  # extent.
  $CLONER_PROG -s $((800 * 1024 + 20 * 1024)) -d $((400 * 1024)) \
      -l $((20 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo

  # Now fsync our file to make sure the extent cloning is durably
  # persisted. This fsync will not add a second csum item to the log
  # tree containing the checksums for the blocks in the sub-range
  # [20K, 40K[ of our extent, because there was already a csum item in
  # the log tree covering the whole extent, added by the first fsync
  # we did before.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  echo "File digest before power failure:"
  md5sum $SCRATCH_MNT/foo | _filter_scratch

  # Silently drop all writes and ummount to simulate a crash/power
  # failure.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again, mount to trigger log replay and validate file
  # contents.
  # The fsync log replay first processes the file extent item
  # corresponding to the file offset 400K (the one which refers to the
  # [20K, 40K[ sub-range of our 100K extent) and then processes the file
  # extent item for file offset 800K. It used to happen that when
  # processing the later, it erroneously left in the csum tree 2 csum
  # items that overlapped each other, 1 for the sub-range [20K, 40K[ and
  # 1 for the whole range of our extent. This introduced a problem where
  # subsequent lookups for the checksums of blocks within the range
  # [40K, 100K[ of our extent would not find anything because lookups in
  # the csum tree ended up looking only at the smaller csum item, the
  # one covering the subrange [20K, 40K[. This made read requests assume
  # an expected checksum with a value of 0 for those blocks, which caused
  # checksum verification failure when the read operations finished.
  # However those checksum failure did not result in read requests
  # returning an error to user space (like -EIO for e.g.) because the
  # expected checksum value had the special value 0, and in that case
  # btrfs set all bytes of the corresponding pages with the value 0x01
  # and produce the following warning in dmesg/syslog:
  #
  #  "BTRFS warning (device dm-0): csum failed ino 257 off 917504 csum\
  #   1322675045 expected csum 0"
  #
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  echo "File digest after log replay:"
  # Must match the same digest he had after cloning the extent and
  # before the power failure happened.
  md5sum $SCRATCH_MNT/foo | _filter_scratch

  _unmount_flakey

  status=0
  exit

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19 14:27:46 -07:00
Filipe Manana
1f9b8c8fbc Btrfs: check if previous transaction aborted to avoid fs corruption
While we are committing a transaction, it's possible the previous one is
still finishing its commit and therefore we wait for it to finish first.
However we were not checking if that previous transaction ended up getting
aborted after we waited for it to commit, so we ended up committing the
current transaction which can lead to fs corruption because the new
superblock can point to trees that have had one or more nodes/leafs that
were never durably persisted.
The following sequence diagram exemplifies how this is possible:

          CPU 0                                                        CPU 1

  transaction N starts

  (...)

  btrfs_commit_transaction(N)

    cur_trans->state = TRANS_STATE_COMMIT_START;
    (...)
    cur_trans->state = TRANS_STATE_COMMIT_DOING;
    (...)

    cur_trans->state = TRANS_STATE_UNBLOCKED;
    root->fs_info->running_transaction = NULL;

                                                              btrfs_start_transaction()
                                                                 --> starts transaction N + 1

    btrfs_write_and_wait_transaction(trans, root);
      --> starts writing all new or COWed ebs created
          at transaction N

                                                              creates some new ebs, COWs some
                                                              existing ebs but doesn't COW or
                                                              deletes eb X

                                                              btrfs_commit_transaction(N + 1)
                                                                (...)
                                                                cur_trans->state = TRANS_STATE_COMMIT_START;
                                                                (...)
                                                                wait_for_commit(root, prev_trans);
                                                                  --> prev_trans == transaction N

    btrfs_write_and_wait_transaction() continues
    writing ebs
       --> fails writing eb X, we abort transaction N
           and set bit BTRFS_FS_STATE_ERROR on
           fs_info->fs_state, so no new transactions
           can start after setting that bit

       cleanup_transaction()
         btrfs_cleanup_one_transaction()
           wakes up task at CPU 1

                                                                continues, doesn't abort because
                                                                cur_trans->aborted (transaction N + 1)
                                                                is zero, and no checks for bit
                                                                BTRFS_FS_STATE_ERROR in fs_info->fs_state
                                                                are made

                                                                btrfs_write_and_wait_transaction(trans, root);
                                                                  --> succeeds, no errors during writeback

                                                                write_ctree_super(trans, root, 0);
                                                                  --> succeeds
                                                                  --> we have now a superblock that points us
                                                                      to some root that uses eb X, which was
                                                                      never written to disk

In this scenario future attempts to read eb X from disk results in an
error message like "parent transid verify failed on X wanted Y found Z".

So fix this by aborting the current transaction if after waiting for the
previous transaction we verify that it was aborted.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19 14:27:31 -07:00
Michal Hocko
277fb5fc17 btrfs: use __GFP_NOFAIL in alloc_btrfs_bio
alloc_btrfs_bio relies on GFP_NOFS allocation when committing the
transaction but this allocation context is rather weak wrt. reclaim
capabilities. The page allocator currently tries hard to not fail these
allocations if they are small (<=PAGE_ALLOC_COSTLY_ORDER) but it can
still fail if the _current_ process is the OOM killer victim. Moreover
there is an attempt to move away from the default no-fail behavior and
allow these allocation to fail more eagerly. This would lead to:

[   37.928625] kernel BUG at fs/btrfs/extent_io.c:4045

which is clearly undesirable and the nofail behavior should be explicit
if the allocation failure cannot be tolerated.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19 14:25:15 -07:00
Michal Hocko
d1b5c5671d btrfs: Prevent from early transaction abort
Btrfs relies on GFP_NOFS allocation when committing the transaction but
this allocation context is rather weak wrt. reclaim capabilities. The
page allocator currently tries hard to not fail these allocations if
they are small (<=PAGE_ALLOC_COSTLY_ORDER) so this is not a problem
currently but there is an attempt to move away from the default no-fail
behavior and allow these allocation to fail more eagerly. And this would
lead to a pre-mature transaction abort as follows:

[   55.328093] Call Trace:
[   55.328890]  [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
[   55.330518]  [<ffffffff8108fa28>] ? console_unlock+0x334/0x363
[   55.332738]  [<ffffffff8110873e>] __alloc_pages_nodemask+0x81d/0x8d4
[   55.334910]  [<ffffffff81100752>] pagecache_get_page+0x10e/0x20c
[   55.336844]  [<ffffffffa007d916>] alloc_extent_buffer+0xd0/0x350 [btrfs]
[   55.338973]  [<ffffffffa0059d8c>] btrfs_find_create_tree_block+0x15/0x17 [btrfs]
[   55.341329]  [<ffffffffa004f728>] btrfs_alloc_tree_block+0x18c/0x405 [btrfs]
[   55.343566]  [<ffffffffa003fa34>] split_leaf+0x1e4/0x6a6 [btrfs]
[   55.345577]  [<ffffffffa0040567>] btrfs_search_slot+0x671/0x831 [btrfs]
[   55.347679]  [<ffffffff810682d7>] ? get_parent_ip+0xe/0x3e
[   55.349434]  [<ffffffffa0041cb2>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
[   55.351681]  [<ffffffffa004ecfb>] __btrfs_run_delayed_refs+0x7a6/0xf35 [btrfs]
[   55.353979]  [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[   55.356212]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.358378]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.360626]  [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[   55.362894]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.365221]  [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
[   55.367273]  [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
[   55.369047]  [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
[   55.370654]  [<ffffffff81186869>] do_fsync+0x34/0x4e
[   55.372246]  [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
[   55.373851]  [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
[   55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: errno=-12 Out of memory
[   55.382431] BTRFS warning (device hdb1): Skipping commit of aborted transaction.
[   55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting unused transaction(IO failure).
[   55.384280] ------------[ cut here ]------------
[   55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0xd9/0xfe [btrfs]()
[...]
[   55.384337] Call Trace:
[   55.384353]  [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
[   55.384357]  [<ffffffff8107f717>] ? down_trylock+0x2d/0x37
[   55.384359]  [<ffffffff81046977>] warn_slowpath_common+0xa1/0xbb
[   55.384398]  [<ffffffffa00a1d6b>] ? btrfs_select_ref_head+0xd9/0xfe [btrfs]
[   55.384400]  [<ffffffff81046a34>] warn_slowpath_null+0x1a/0x1c
[   55.384423]  [<ffffffffa00a1d6b>] btrfs_select_ref_head+0xd9/0xfe [btrfs]
[   55.384446]  [<ffffffffa004e5f7>] ? __btrfs_run_delayed_refs+0xa2/0xf35 [btrfs]
[   55.384455]  [<ffffffffa004e600>] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs]
[   55.384476]  [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[   55.384499]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.384521]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.384543]  [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[   55.384565]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.384588]  [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
[   55.384591]  [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
[   55.384592]  [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
[   55.384593]  [<ffffffff81186869>] do_fsync+0x34/0x4e
[   55.384594]  [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
[   55.384595]  [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
[...]
[   55.384608] ---[ end trace c29799da1d4dd621 ]---
[   55.437323] BTRFS info (device hdb1): forced readonly
[   55.438815] BTRFS info (device hdb1): delayed_refs has NO entry

Fix this by being explicit about the no-fail behavior of this allocation
path and use __GFP_NOFAIL.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19 14:25:15 -07:00
Zhaolei
60d53eb310 btrfs: Remove unused arguments in tree-log.c
Following arguments are not used in tree-log.c:
 insert_one_name(): path, type
 wait_log_commit(): trans
 wait_for_writer(): trans

This patch remove them.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19 14:25:15 -07:00
Zhaolei
34eb2a5249 btrfs: Remove useless condition in start_log_trans()
Dan Carpenter <dan.carpenter@oracle.com> reported a smatch warning
for start_log_trans():
 fs/btrfs/tree-log.c:178 start_log_trans()
 warn: we tested 'root->log_root' before and it was 'false'

 fs/btrfs/tree-log.c
 147          if (root->log_root) {
 We test "root->log_root" here.
 ...

Reason:
 Condition of:
 fs/btrfs/tree-log.c:178: if (!root->log_root) {
 is not necessary after commit: 7237f1833

 It caused a smatch warning, and no functionally error.

Fix:
 Deleting above condition will make smatch shut up,
 but a better way is to do cleanup for start_log_trans()
 to remove duplicated code and make code more readable.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19 14:24:49 -07:00
Peng Tao
69f230d907 NFS41/flexfiles: update inode after write finishes
Otherwise we break fstest case tests/read_write/mctime.t

Does files layout need the same fix as well?

Cc: stable@vger.kernel.org # v4.0+
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-19 13:51:16 -05:00
Peng Tao
e755d638e9 NFS41: make sure sending LAYOUTRETURN before close if marked so
If layout is marked by NFS_LAYOUT_RETURN_BEFORE_CLOSE, we should always
send LAYOUTRETURN before close, and we don't need to do ROC drain if we
do send LAYOUTRETURN.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-19 10:29:25 -05:00