1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

60722 commits

Author SHA1 Message Date
Dave Chinner
5e96fa8d2b xfs: factor iclog state processing out of xlog_state_do_callback()
The iclog IO completion state processing is somewhat complex, and
because it's inside two nested loops it is highly indented and very
hard to read. Factor it out, flatten the logic flow and clean up the
comments so that it much easier to see what the code is doing both
in processing the individual iclogs and in the over
xlog_state_do_callback() operation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-09-05 21:36:12 -07:00
Dave Chinner
6546818c85 xfs: factor callbacks out of xlog_state_do_callback()
Simplify the code flow by lifting the iclog callback work out of
the main iclog iteration loop. This isolates the log juggling and
callbacks from the iclog state change logic in the loop.

Note that the loopdidcallbacks variable is not actually tracking
whether callbacks are actually run - it is tracking whether the
icloglock was dropped during the loop and so determines if we
completed the entire iclog scan loop atomically. Hence we know for
certain there are either no more ordered completions to run or
that the next completion will run the remaining ordered iclog
completions. Hence rename that variable appropriately for it's
function.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-09-05 21:36:12 -07:00
Dave Chinner
6769aa2a4f xfs: factor debug code out of xlog_state_do_callback()
Start making this function readable by lifting the debug code into
a conditional function.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-09-05 21:36:12 -07:00
Dave Chinner
8ab39f11d9 xfs: prevent CIL push holdoff in log recovery
generic/530 on a machine with enough ram and a non-preemptible
kernel can run the AGI processing phase of log recovery enitrely out
of cache. This means it never blocks on locks, never waits for IO
and runs entirely through the unlinked lists until it either
completes or blocks and hangs because it has run out of log space.

It runs out of log space because the background CIL push is
scheduled but never runs. queue_work() queues the CIL work on the
current CPU that is busy, and the workqueue code will not run it on
any other CPU. Hence if the unlinked list processing never yields
the CPU voluntarily, the push work is delayed indefinitely. This
results in the CIL aggregating changes until all the log space is
consumed.

When the log recoveyr processing evenutally blocks, the CIL flushes
but because the last iclog isn't submitted for IO because it isn't
full, the CIL flush never completes and nothing ever moves the log
head forwards, or indeed inserts anything into the tail of the log,
and hence nothing is able to get the log moving again and recovery
hangs.

There are several problems here, but the two obvious ones from
the trace are that:
	a) log recovery does not yield the CPU for over 4 seconds,
	b) binding CIL pushes to a single CPU is a really bad idea.

This patch addresses just these two aspects of the problem, and are
suitable for backporting to work around any issues in older kernels.
The more fundamental problem of preventing the CIL from consuming
more than 50% of the log without committing will take more invasive
and complex work, so will be done as followup work.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-09-05 21:36:12 -07:00
Rik van Riel
cdea5459ce xfs: fix missed wakeup on l_flush_wait
The code in xlog_wait uses the spinlock to make adding the task to
the wait queue, and setting the task state to UNINTERRUPTIBLE atomic
with respect to the waker.

Doing the wakeup after releasing the spinlock opens up the following
race condition:

Task 1					task 2
add task to wait queue
					wake up task
set task state to UNINTERRUPTIBLE

This issue was found through code inspection as a result of kworkers
being observed stuck in UNINTERRUPTIBLE state with an empty
wait queue. It is rare and largely unreproducable.

Simply moving the spin_unlock to after the wake_up_all results
in the waker not being able to see a task on the waitqueue before
it has set its state to UNINTERRUPTIBLE.

This bug dates back to the conversion of this code to generic
waitqueue infrastructure from a counting semaphore back in 2008
which didn't place the wakeups consistently w.r.t. to the relevant
spin locks.

[dchinner: Also fix a similar issue in the shutdown path on
xc_commit_wait. Update commit log with more details of the issue.]

Fixes: d748c62367 ("[XFS] Convert l_flushsema to a sv_t")
Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-09-05 21:36:12 -07:00
Dave Chinner
7c107afb87 xfs: push the AIL in xlog_grant_head_wake
In the situation where the log is full and the CIL has not recently
flushed, the AIL push threshold is throttled back to the where the
last write of the head of the log was completed. This is stored in
log->l_last_sync_lsn. Hence if the CIL holds > 25% of the log space
pinned by flushes and/or aggregation in progress, we can get the
situation where the head of the log lags a long way behind the
reservation grant head.

When this happens, the AIL push target is trimmed back from where
the reservation grant head wants to push the log tail to, back to
where the head of the log currently is. This means the push target
doesn't reach far enough into the log to actually move the tail
before the transaction reservation goes to sleep.

When the CIL push completes, it moves the log head forward such that
the AIL push target can now be moved, but that has no mechanism for
puhsing the log tail. Further, if the next tail movement of the log
is not large enough wake the waiter (i.e. still not enough space for
it to have a reservation granted), we don't wake anything up, and
hence we do not update the AIL push target to take into account the
head of the log moving and allowing the push target to be moved
forwards.

To avoid this particular condition, if we fail to wake the first
waiter on the grant head because we don't have enough space,
push on the AIL again. This will pick up any movement of the log
head and allow the push target to move forward due to completion of
CIL pushing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-09-05 21:36:12 -07:00
Austin Kim
eb2e99943c xfs: Use WARN_ON_ONCE for bailout mount-operation
If the CONFIG_BUG is enabled, BUG is executed and then system is crashed.
However, the bailout for mount is no longer proceeding.

Using WARN_ON_ONCE rather than BUG can prevent this situation.

Signed-off-by: Austin Kim <austindh.kim@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-09-05 21:36:12 -07:00
Al Viro
df02450217 make ramfs_fill_super() static
all users should just call ramfs_mount()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-09-05 14:34:27 -04:00
David Howells
5a2be1288b vfs: Convert squashfs to use the new mount API
Convert the squashfs filesystem to the new internal mount API as the old
one will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Phillip Lougher <phillip@squashfs.org.uk>
cc: squashfs-devel@lists.sourceforge.net
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-09-05 14:34:26 -04:00
David Howells
ec10a24f10 vfs: Convert jffs2 to use the new mount API
Convert the jffs2 filesystem to the new internal mount API as the old
one will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: David Woodhouse <dwmw2@infradead.org>
cc: linux-mtd@lists.infradead.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-09-05 14:34:25 -04:00
David Howells
74f78fc5ef vfs: Convert cramfs to use the new mount API
Convert the cramfs filesystem to the new internal mount API as the old
one will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Nicolas Pitre <nico@fluxnic.net>
Acked-by: Nicolas Pitre <nico@fluxnic.net>
cc: linux-mtd@lists.infradead.org
cc: linux-block@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-09-05 14:34:24 -04:00
David Howells
b941759985 vfs: Convert romfs to use the new mount API
Convert the romfs filesystem to the new internal mount API as the old
one will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-mtd@lists.infradead.org
cc: linux-block@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-09-05 14:34:24 -04:00
David Howells
43ce4c1fea vfs: Add a single-or-reconfig keying to vfs_get_super()
Add an additional keying mode to vfs_get_super() to indicate that only a
single superblock should exist in the system, and that, if it does, further
mounts should invoke reconfiguration upon it.

This allows mount_single() to be replaced.

[Fix by Eric Biggers folded in]

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-09-05 14:34:23 -04:00
David Howells
fe62c3a4e1 vfs: Create fs_context-aware mount_bdev() replacement
Create a function, get_tree_bdev(), that is fs_context-aware and a
->get_tree() counterpart of mount_bdev().

It caches the block device pointer in the fs_context struct so that this
information can be passed into sget_fc()'s test and set functions.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-09-05 14:34:22 -04:00
Al Viro
533770cc0a new helper: get_tree_keyed()
For vfs_get_keyed_super users.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-09-05 14:34:22 -04:00
Eric Biggers
1dd9bc08cf vfs: set fs_context::user_ns for reconfigure
fs_context::user_ns is used by fuse_parse_param(), even during remount,
so it needs to be set to the existing value for reconfigure.

Reproducer:

	#include <fcntl.h>
	#include <sys/mount.h>

	int main()
	{
		char opts[128];
		int fd = open("/dev/fuse", O_RDWR);

		sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
		mkdir("mnt", 0777);
		mount("foo",  "mnt", "fuse.foo", 0, opts);
		mount("foo", "mnt", "fuse.foo", MS_REMOUNT, opts);
	}

Crash:
	BUG: kernel NULL pointer dereference, address: 0000000000000000
	#PF: supervisor read access in kernel mode
	#PF: error_code(0x0000) - not-present page
	PGD 0 P4D 0
	Oops: 0000 [#1] SMP
	CPU: 0 PID: 129 Comm: syz_make_kuid Not tainted 5.3.0-rc5-next-20190821 #3
	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
	RIP: 0010:map_id_range_down+0xb/0xc0 kernel/user_namespace.c:291
	[...]
	Call Trace:
	 map_id_down kernel/user_namespace.c:312 [inline]
	 make_kuid+0xe/0x10 kernel/user_namespace.c:389
	 fuse_parse_param+0x116/0x210 fs/fuse/inode.c:523
	 vfs_parse_fs_param+0xdb/0x1b0 fs/fs_context.c:145
	 vfs_parse_fs_string+0x6a/0xa0 fs/fs_context.c:188
	 generic_parse_monolithic+0x85/0xc0 fs/fs_context.c:228
	 parse_monolithic_mount_data+0x1b/0x20 fs/fs_context.c:708
	 do_remount fs/namespace.c:2525 [inline]
	 do_mount+0x39a/0xa60 fs/namespace.c:3107
	 ksys_mount+0x7d/0xd0 fs/namespace.c:3325
	 __do_sys_mount fs/namespace.c:3339 [inline]
	 __se_sys_mount fs/namespace.c:3336 [inline]
	 __x64_sys_mount+0x20/0x30 fs/namespace.c:3336
	 do_syscall_64+0x4a/0x1a0 arch/x86/entry/common.c:290
	 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Reported-by: syzbot+7d6a57304857423318a5@syzkaller.appspotmail.com
Fixes: 408cbe695350 ("vfs: Convert fuse to use the new mount API")
Cc: David Howells <dhowells@redhat.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-09-05 14:33:45 -04:00
Gao Xiang
618f40ea02 erofs: use read_cache_page_gfp for erofs_get_meta_page
As Christoph said [1], "I'd much prefer to just use
read_cache_page_gfp, and live with the fact that this
allocates bufferheads behind you for now.  I'll try to
speed up my attempts to get rid of the buffer heads on
the block device mapping instead. "

This simplifies the code a lot and a minor thing is
"no REQ_META (e.g. for blktrace) on metadata at all..."

[1] https://lore.kernel.org/r/20190903153704.GA2201@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-26-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:09 +02:00
Gao Xiang
4231138fe0 erofs: always use iget5_locked
As Christoph said [1] [2], "Just use the slightly
more complicated 32-bit version everywhere so that
you have a single actually tested code path.
And then remove this helper. "

[1] https://lore.kernel.org/r/20190829102426.GE20598@infradead.org/
[2] https://lore.kernel.org/r/20190902125320.GA16726@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-25-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:09 +02:00
Gao Xiang
fe7c242357 erofs: use read_mapping_page instead of sb_bread
As Christoph said [1], "This seems to be your only direct
use of buffer heads, which while not deprecated are a bit
of an ugly step child.  So if you can easily avoid creating
a buffer_head dependency in a new filesystem I think you
should avoid it. "

[1] https://lore.kernel.org/r/20190902125109.GA9826@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-24-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:09 +02:00
Gao Xiang
4f761fa253 erofs: rename errln/infoln/debugln to erofs_{err, info, dbg}
Add prefix "erofs_" to these functions and print
sb->s_id as a prefix to erofs_{err, info} so that
the user knows which file system is affected.

Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-23-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:09 +02:00
Gao Xiang
84947eb603 erofs: save one level of indentation
As Christoph said [1], ".. and save one
level of indentation."

[1] https://lore.kernel.org/r/20190829102426.GE20598@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-22-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:09 +02:00
Gao Xiang
73d03931be erofs: kill use_vmap module parameter
As Christoph said [1],
"vm_map_ram is supposed to generally behave better.  So if
it doesn't please report that that to the arch maintainer
and linux-mm so that they can look into the issue.  Having
user make choices of deep down kernel internals is just
a horrible interface.

Please talk to maintainers of other bits of the kernel
if you see issues and / or need enhancements. "

Let's redo the previous conclusion and kill the vmap
approach.

[1] https://lore.kernel.org/r/20190830165533.GA10909@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-21-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:09 +02:00
Gao Xiang
e2c71e74b2 erofs: kill all erofs specific fault injection
As Christoph suggested [1], "Please just use plain kmalloc
everywhere and let the normal kernel error injection code
take care of injeting any errors."

[1] https://lore.kernel.org/r/20190829102426.GE20598@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-20-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:08 +02:00
Gao Xiang
99634bf388 erofs: add "erofs_" prefix for common and short functions
Add erofs_ prefix to free_inode, alloc_inode, ...

Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-19-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:08 +02:00
Gao Xiang
94e4e153b1 erofs: kill __submit_bio()
As Christoph pointed out [1], "
Why is there __submit_bio which really just obsfucates
what is going on?  Also why is __submit_bio using
bio_set_op_attrs instead of opencode it as the comment
right next to it asks you to? "

Let's use submit_bio directly instead.

[1] https://lore.kernel.org/r/20190830162812.GA10694@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-18-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:08 +02:00
Gao Xiang
e655b5b3a2 erofs: kill prio and nofail of erofs_get_meta_page()
As Christoph pointed out [1],
"Why is there __erofs_get_meta_page with the two weird
booleans instead of a single erofs_get_meta_page that
gets and gfp_t for additional flags and an unsigned int
for additional bio op flags."

And since all callers can handle errors, let's kill
prio and nofail and erofs_get_inline_page() now.

[1] https://lore.kernel.org/r/20190830162812.GA10694@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-17-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:08 +02:00
Gao Xiang
a5c0b7802c erofs: localize erofs_grab_bio()
As Christoph pointed out [1], "erofs_grab_bio tries to
handle a bio_alloc failure, except that the function will
not actually fail due the mempool backing it."

Sorry about useless code, fix it now and
localize erofs_grab_bio [2].

[1] https://lore.kernel.org/r/20190830162812.GA10694@infradead.org/
[2] https://lore.kernel.org/r/20190902122016.GL15931@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-16-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:08 +02:00
Gao Xiang
688a5f2ed4 erofs: kill verbose debug info in erofs_fill_super
As Christoph said [1], "That is some very verbose
debug info.  We usually don't add that and let
people trace the function instead. "

[1] https://lore.kernel.org/r/20190829101545.GC20598@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-15-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:08 +02:00
Gao Xiang
0259f20948 erofs: use dsb instead of layout for ondisk super_block
As Christoph pointed out [1], "Why is the variable name
for the on-disk subperblock layout? We usually still
calls this something with sb in the name, e.g. dsb.
for disksuper block. " Let's fix it.

[1] https://lore.kernel.org/r/20190829101545.GC20598@infradead.org/
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-14-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:08 +02:00
Gao Xiang
a2c75c8143 erofs: better erofs symlink stuffs
Fix as Christoph suggested [1] [2], "remove is_inode_fast_symlink
and just opencode it in the few places using it"

and
"Please just set the ops directly instead of obsfucating that in
a single caller, single line inline function.  And please set it
instead of the normal symlink iops in the same place where you
also set those."

[1] https://lore.kernel.org/r/20190830163910.GB29603@infradead.org/
[2] https://lore.kernel.org/r/20190829102426.GE20598@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-13-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:07 +02:00
Gao Xiang
2d78c209b9 erofs: update comments in inode.c
As Christoph suggested [1], update them all.

[1] https://lore.kernel.org/r/20190829102426.GE20598@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-12-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:07 +02:00
Gao Xiang
ea559e7b84 erofs: update erofs_fs.h comments
As Christoph said [1] [2], update it now.

[1] https://lore.kernel.org/r/20190902124521.GA22153@infradead.org/
[2] https://lore.kernel.org/r/20190902120548.GB15931@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-11-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:07 +02:00
Gao Xiang
a5876e24f1 erofs: use erofs_inode naming
As Christoph suggested [1], "Why is this called vnode instead
of inode?  That seems like a rather odd naming for a Linux
file system."

[1] https://lore.kernel.org/r/20190829101545.GC20598@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-10-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:07 +02:00
Gao Xiang
1c2dfbf9c2 erofs: kill erofs_{init,exit}_inode_cache
As Christoph said [1] "having this function seems
entirely pointless", let's kill those.

filesystem                              function name
ext2,f2fs,ext4,isofs,squashfs,cifs,...  init_inodecache

In addition, add a necessary "rcu_barrier()" on exit_fs();

[1] https://lore.kernel.org/r/20190829101545.GC20598@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-9-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:07 +02:00
Gao Xiang
8a76568225 erofs: better naming for erofs inode related stuffs
updates inode naming
 - kill is_inode_layout_compression [1]
 - kill magic underscores [2] [3]
 - better naming for datamode & data_mapping_mode [3]
 - better naming erofs_inode_{compact, extended} [4]

[1] https://lore.kernel.org/r/20190829102426.GE20598@infradead.org/
[2] https://lore.kernel.org/r/20190829102426.GE20598@infradead.org/
[3] https://lore.kernel.org/r/20190902122627.GN15931@infradead.org/
[4] https://lore.kernel.org/r/20190902125438.GA17750@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-8-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:07 +02:00
Gao Xiang
426a930891 erofs: use feature_incompat rather than requirements
As Christoph said [1], "This is only cosmetic, why
not stick to feature_compat and feature_incompat?"

In my thought, requirements means "incompatible"
instead of "feature" though.

[1] https://lore.kernel.org/r/20190902125109.GA9826@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-7-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:07 +02:00
Gao Xiang
c39747f770 erofs: update erofs_inode_is_data_compressed helper
As Christoph said, "This looks like a really obsfucated
way to write:
	return datamode == EROFS_INODE_FLAT_COMPRESSION ||
		datamode == EROFS_INODE_FLAT_COMPRESSION_LEGACY; "

Although I had my own consideration, it's the right way for now.

[1] https://lore.kernel.org/r/20190829095954.GB20598@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-6-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:07 +02:00
Gao Xiang
ed34aa4a8a erofs: kill __packed for on-disk structures
As Christoph suggested "Please don't add __packed" [1],
remove all __packed except struct erofs_dirent here.

Note that all on-disk fields except struct erofs_dirent
(12 bytes with a 8-byte nid) in EROFS are naturally aligned.

[1] https://lore.kernel.org/r/20190829095954.GB20598@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-5-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:07 +02:00
Gao Xiang
b6796abd3c erofs: some macros are much more readable as a function
As Christoph suggested [1], these macros are much
more readable as a function.

[1] https://lore.kernel.org/r/20190829095954.GB20598@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-4-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:07 +02:00
Gao Xiang
60a49ba8fe erofs: on-disk format should have explicitly assigned numbers
As Christoph suggested [1], on-disk format should have
explicitly assigned numbers.

[1] https://lore.kernel.org/r/20190829095954.GB20598@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-3-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:06 +02:00
Gao Xiang
4b66eb51d2 erofs: remove all the byte offset comments
As Christoph suggested [1], "Please remove all the byte offset comments.
that is something that can easily be checked with gdb or pahole."

[1] https://lore.kernel.org/r/20190829095954.GB20598@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Link: https://lore.kernel.org/r/20190904020912.63925-2-gaoxiang25@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-05 20:10:06 +02:00
Deepa Dinamani
cba465b4f9 ext4: Reduce ext4 timestamp warnings
When ext4 file systems were created intentionally with 128 byte inodes,
the rate-limited warning of eventual possible timestamp overflow are
still emitted rather frequently.  Remove the warning for now.

Discussion for whether any warning is needed,
and where it should be emitted, can be found at
https://lore.kernel.org/lkml/1567523922.5576.57.camel@lca.pw/.
I can post a separate follow-up patch after the conclusion.

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-09-04 22:54:53 +02:00
Al Viro
b0841eefd9 configfs: provide exclusion between IO and removals
Make sure that attribute methods are not called after the item
has been removed from the tree.  To do so, we
	* at the point of no return in removals, grab ->frag_sem
exclusive and mark the fragment dead.
	* call the methods of attributes with ->frag_sem taken
shared and only after having verified that the fragment is still
alive.

	The main benefit is for method instances - they are
guaranteed that the objects they are accessing *and* all ancestors
are still there.  Another win is that we don't need to bother
with extra refcount on config_item when opening a file -
the item will be alive for as long as it stays in the tree, and
we won't touch it/attributes/any associated data after it's
been removed from the tree.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04 22:33:51 +02:00
Bob Peterson
ad26967b9a gfs2: Use async glocks for rename
Because s_vfs_rename_mutex is not cluster-wide, multiple nodes can
reverse the roles of which directories are "old" and which are "new" for
the purposes of rename. This can cause deadlocks where two nodes end up
waiting for each other.

There can be several layers of directory dependencies across many nodes.

This patch fixes the problem by acquiring all gfs2_rename's inode glocks
asychronously and waiting for all glocks to be acquired.  That way all
inodes are locked regardless of the order.

The timeout value for multiple asynchronous glocks is calculated to be
the total of the individual wait times for each glock times two.

Since gfs2_exchange is very similar to gfs2_rename, both functions are
patched in the same way.

A new async glock wait queue, sd_async_glock_wait, keeps a list of
waiters for these events. If gfs2's holder_wake function detects an
async holder, it wakes up any waiters for the event. The waiter only
tests whether any of its requests are still pending.

Since the glocks are sent to dlm asychronously, the wait function needs
to check to see which glocks, if any, were granted.

If a glock is granted by dlm (and therefore held), its minimum hold time
is checked and adjusted as necessary, as other glock grants do.

If the event times out, all glocks held thus far must be dequeued to
resolve any existing deadlocks.  Then, if there are any outstanding
locking requests, we need to loop around and wait for dlm to respond to
those requests too.  After we release all requests, we return -ESTALE to
the caller (vfs rename) which loops around and retries the request.

    Node1           Node2
    ---------       ---------
1.  Enqueue A       Enqueue B
2.  Enqueue B       Enqueue A
3.  A granted
6.                  B granted
7.  Wait for B
8.                  Wait for A
9.                  A times out (since Node 1 holds A)
10.                 Dequeue B (since it was granted)
11.                 Wait for all requests from DLM
12. B Granted (since Node2 released it in step 10)
13. Rename
14. Dequeue A
15.                 DLM Grants A
16.                 Dequeue A (due to the timeout and since we
                    no longer have B held for our task).
17. Dequeue B
18.                 Return -ESTALE to vfs
19.                 VFS retries the operation, goto step 1.

This release-all-locks / acquire-all-locks may slow rename / exchange
down as both nodes struggle in the same way and do the same thing.
However, this will only happen when there is contention for the same
inodes, which ought to be rare.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-09-04 20:22:17 +02:00
Andreas Gruenbacher
01123cf17c gfs2: create function gfs2_glock_update_hold_time
This patch moves the code that updates glock minimum hold
time to a separate function. This will be called by a future
patch.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-09-04 20:22:17 +02:00
Bob Peterson
bc74aaefdd gfs2: separate holder for rgrps in gfs2_rename
Before this patch, gfs2_rename added a holder for the rgrp glock to
its array of holders, ghs. There's nothing wrong with that, but this
patch separates it into a separate holder. This is done to ensure
it's always locked last as per the proper glock lock ordering,
and also to pave the way for a future patch in which we will
lock the non-rgrp glocks asynchronously.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-09-04 20:22:17 +02:00
Markus Elfring
bccaef9073 gfs2: Delete an unnecessary check before brelse()
The brelse() function tests whether its argument is NULL and then
returns immediately.  Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

[The same applies to brelse() in gfs2_dir_no_add (which Coccinelle
apparently missed), so fix that as well.]

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-09-04 20:22:17 +02:00
Andreas Gruenbacher
45eb05042d gfs2: Minor PAGE_SIZE arithmetic cleanups
Replace divisions by PAGE_SIZE with shifts by PAGE_SHIFT and similar.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-09-04 20:22:06 +02:00
Markus Elfring
4eb09e1112 fs-udf: Delete an unnecessary check before brelse()
The brelse() function tests whether its argument is NULL
and then returns immediately.
Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Link: https://lore.kernel.org/r/a254c1d1-0109-ab51-c67a-edc5c1c4b4cd@web.de
Signed-off-by: Jan Kara <jack@suse.cz>
2019-09-04 18:19:43 +02:00
Markus Elfring
18c2433cb8 ext2: Delete an unnecessary check before brelse()
The brelse() function tests whether its argument is NULL
and then returns immediately.
Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Link: https://lore.kernel.org/r/51dea296-2207-ebc0-bac3-13f3e5c3b235@web.de
Signed-off-by: Jan Kara <jack@suse.cz>
2019-09-04 18:19:43 +02:00