Shorten all the metadata repair xfs_repair_* symbols to xrep_.
Whitespace damage will be fixed by a subsequent patch. There are no
functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Shorten all the metadata checking xfs_scrub_ prefixes to xchk_. After
this, the only xfs_scrub* symbols are the ones that pertain to both
scrub and repair. Whitespace damage will be fixed in a subsequent
patch. There are no functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Less trivial cleanups of the error argument to xfs_btree_del_cursor;
these require some minor code refactoring.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The error argument to xfs_btree_del_cursor already understands the
"nonzero for error" semantics, so remove pointless error testing in the
callers and pass it directly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The following assertion was seen on generic/051:
XFS: Assertion failed: tp->t_firstblock == NULLFSBLOCK, file: fs/xfs/libxfs5
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:102!
invalid opcode: 0000 [#1] SMP PTI
CPU: 2 PID: 20757 Comm: fsstress Not tainted 4.18.0-rc4+ #3969
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/4
RIP: 0010:assfail+0x23/0x30
Code: c3 66 0f 1f 44 00 00 48 89 f1 41 89 d0 48 c7 c6 88 e0 8c 82 48 89 fa
RSP: 0018:ffff88012dc43c08 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff88012dc43ca0 RCX: 0000000000000000
RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff828480eb
RBP: ffff88012aa92758 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: f000000000000000 R12: 0000000000000000
R13: ffff88012dc43d48 R14: ffff88013092e7e8 R15: 0000000000000014
FS: 00007f8d689b8e80(0000) GS:ffff88013fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8d689c7000 CR3: 000000012ba6a000 CR4: 00000000000006e0
Call Trace:
xfs_defer_init+0xff/0x160
xfs_reflink_remap_extent+0x31b/0xa00
xfs_reflink_remap_blocks+0xec/0x4a0
xfs_reflink_remap_range+0x3a1/0x650
xfs_file_dedupe_range+0x39/0x50
vfs_dedupe_file_range+0x218/0x260
do_vfs_ioctl+0x262/0x6a0
? __se_sys_newfstat+0x3c/0x60
ksys_ioctl+0x35/0x60
__x64_sys_ioctl+0x11/0x20
do_syscall_64+0x4b/0x190
entry_SYSCALL_64_after_hwframe+0x49/0xbe
The root cause of the assertion failure is that xfs_defer_finish doesn't
roll the transaction after processing all the deferred items. Therefore
it returns a dirty transaction to the caller, which leaves the caller at
risk of exceeding the transaction reservation if it logs more items.
Brian Foster's patchset to move the defer_ops firstblock into the
transaction requires t_firstblock == NULLFSBLOCK upon defer_ops
initialization, which is how this was noticed at all.
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Check the leaf attribute freemap when we're verifying the block.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Pull vfs fixes from Al Viro:
"Fix several places that screw up cleanups after failures halfway
through opening a file (one open-coding filp_clone_open() and getting
it wrong, two misusing alloc_file()). That part is -stable fodder from
the 'work.open' branch.
And Christoph's regression fix for uapi breakage in aio series;
include/uapi/linux/aio_abi.h shouldn't be pulling in the kernel
definition of sigset_t, the reason for doing so in the first place had
been bogus - there's no need to expose struct __aio_sigset in
aio_abi.h at all"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
aio: don't expose __aio_sigset in uapi
ocxlflash_getfile(): fix double-iput() on alloc_file() failures
cxl_getfile(): fix double-iput() on alloc_file() failures
drm_mode_create_lease_ioctl(): fix open-coded filp_clone_open()
uuid_le_to_bin() is deprecated API and take into consideration that variable,
to where we store parsed data, is type of guid_t we switch to guid_parse()
for sake of consistency.
While here, add error checking to it.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20180720014726.24031-10-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAltSwkIACgkQxWXV+ddt
WDtMBQ//UXMjHaXvFmC0SM6NczuYQR51hLYtJFIKig93XK5goVpUBTNbxO7LX/Tn
4zKoKyVhkW1884V6mRiC+G23QLbo0BQZA7DExfyJ3jylQdjZMBm+K+r19OtGQf5v
CII7oUwni03KIiXIqiFAL5dLWebVQpG5EKJbh8GLZsmg6xNcyVaUqZ/fHXajbZiv
ldEBtHBKIv7WWTJmylMBKMWnRz+jqU91fXPahoU6R5qivODrLt1o/PMuSjVNhaxe
iDldHfdOaiQmLHB/1kOGyv492oW5mSSVNDE8LjEDZ61tDNlAcUyuKUWIRBxDEDtD
6D7rlVQXJ/N7sJ6+UYmJKsRpHL+NOkyzSZ0QEU/sm1Xpm8gkhHuuofRPrVCtd3l1
ZSbwvlrdyjigVEBfM3IbToQ/K6Rc1ZGId20OAs9PCQbb+mj9IxPIncZ7pI1c4hlh
pPEjcYsp14JbCTjctFalcqTiFY5tHRQsx+GUFnDyOcdL7Mi+CoH+0Jy61Vgz9GQE
7s934cfEC0ot/f66kAL/PZzxUfC7TePqaa+sDfS5BIkJ4M6lPMxS5De5R4Z0+Nzr
DXgQAlgXmxfRjpOYMTH9D0EDdSeJaNmVHgk7hFbiYk/KX3oyd4NmgI9Cfao8rQJv
2yd8wF2httfSJKD4b/Hv9r6Ho/Bw9PK59BvWOKYhSj6IGl32utw=
=f7eB
-----END PGP SIGNATURE-----
Merge tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A fix of a corruption regarding fsync and clone, under some very
specific conditions explained in the patch.
The fix is marked for stable 3.16+ so I'd like to get it merged now
given the impact"
* tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix file data corruption after cloning a range and fsync
Like vm_area_dup(), it initializes the anon_vma_chain head, and the
basic mm pointer.
The rest of the fields end up being different for different users,
although the plan is to also initialize the 'vm_ops' field to a dummy
entry.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The vm_area_struct is one of the most fundamental memory management
objects, but the management of it is entirely open-coded evertwhere,
ranging from allocation and freeing (using kmem_cache_[z]alloc and
kmem_cache_free) to initializing all the fields.
We want to unify this in order to end up having some unified
initialization of the vmas, and the first step to this is to at least
have basic allocation functions.
Right now those functions are literally just wrappers around the
kmem_cache_*() calls. This is a purely mechanical conversion:
# new vma:
kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()
# copy old vma
kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)
# free vma
kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)
to the point where the old vma passed in to the vm_area_dup() function
isn't even used yet (because I've left all the old manual initialization
alone).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In parse_options(), if match_strdup() failed, parse_options() leaves
opts->iocharset in unexpected state (i.e. still pointing the freed
string). And this can be the cause of double free.
To fix, this initialize opts->iocharset always when freeing.
Link: http://lkml.kernel.org/r/8736wp9dzc.fsf@mail.parknet.co.jp
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reported-by: syzbot+90b8e10515ae88228a92@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This passes the information we already have at the call sight into
do_send_sig_info. Ultimately allowing for better handling of signals
sent to a group of processes during fork.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When f_setown is called a pid and a pid type are stored. Replace the use
of PIDTYPE_PID with PIDTYPE_TGID as PIDTYPE_TGID goes to the entire thread
group. Replace the use of PIDTYPE_MAX with PIDTYPE_PID as PIDTYPE_PID now
is only for a thread.
Update the users of __f_setown to use PIDTYPE_TGID instead of
PIDTYPE_PID.
For now the code continues to capture task_pid (when task_tgid would
really be appropriate), and iterate on PIDTYPE_PID (even when type ==
PIDTYPE_TGID) out of an abundance of caution to preserve existing
behavior.
Oleg Nesterov suggested using the test to ensure we use PIDTYPE_PID
for tgid lookup also be used to avoid taking the tasklist lock.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Everywhere except in the pid array we distinguish between a tasks pid and
a tasks tgid (thread group id). Even in the enumeration we want that
distinction sometimes so we have added __PIDTYPE_TGID. With leader_pid
we almost have an implementation of PIDTYPE_TGID in struct signal_struct.
Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
into the pids array. Then remove the __PIDTYPE_TGID special case and the
leader_pid in signal_struct.
The net size increase is just an extra pointer added to struct pid and
an extra pair of pointers of an hlist_node added to task_struct.
The effect on code maintenance is the removal of a number of special
cases today and the potential to remove many more special cases as
PIDTYPE_TGID gets used to it's fullest. The long term potential
is allowing zombie thread group leaders to exit, which will remove
a lot more special cases in the code.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
To access these fields the code always has to go to group leader so
going to signal struct is no loss and is actually a fundamental simplification.
This saves a little bit of memory by only allocating the pid pointer array
once instead of once for every thread, and even better this removes a
few potential races caused by the fact that group_leader can be changed
by de_thread, while signal_struct can not.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The cost is the the same and this removes the need
to worry about complications that come from de_thread
and group_leader changing.
__task_pid_nr_ns has been updated to take advantage of this change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Normally kobjects and their sysfs representation belong to global root,
however it is not necessarily the case for objects in separate namespaces.
For example, objects in separate network namespace logically belong to the
container's root and not global root.
This change lays groundwork for allowing network namespace objects
ownership to be transferred to container's root user by defining
get_ownership() callback in ktype structure and using it in sysfs code to
retrieve desired uid/gid when creating sysfs objects for given kobject.
Co-Developed-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change allows creating kernfs files and directories with arbitrary
uid/gid instead of always using GLOBAL_ROOT_UID/GID by extending
kernfs_create_dir_ns() and kernfs_create_file_ns() with uid/gid arguments.
The "simple" kernfs_create_file() and kernfs_create_dir() are left alone
and always create objects belonging to the global root.
When creating symlinks ownership (uid/gid) is taken from the target kernfs
object.
Co-Developed-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In support of enabling memory_failure() handling for filesystem-dax
mappings, set ->index to the pgoff of the page. The rmap implementation
requires ->index to bound the search through the vma interval tree. The
index is set and cleared at dax_associate_entry() and
dax_disassociate_entry() time respectively.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
All the bits are in patches before this. So it is time to enable the
metadata only copy up feature.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
ovl_copy_up() by default will only do metadata only copy up (if enabled).
That means when ovl_real_ioctl() calls ovl_real_file(), it will still get
the lower file (as ovl_real_file() opens data file and not metacopy). And
that means "chattr +i" will end up modifying lower inode.
There seem to be two ways to solve this.
A. Open metacopy file in ovl_real_ioctl() and do operations on that
B. Force full copy up when FS_IOC_SETFLAGS is called.
I am resorting to option B for now as it feels little safer option. If
there are performance issues due to this, we can revisit it.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
truncate should copy up full file (and not do metacopy only), otherwise it
will be broken. For example, use truncate to increase size of a file so
that any read beyong existing size will return null bytes. If we don't
copy up full file, then we end up opening lower file and read from it only
reads upto the old size (and not new size after truncate). Hence to avoid
such situations, copy up data as well when file size changes.
So far it was being done by d_real(O_WRONLY) call in truncate() path. Now
that patch has been reverted. So force full copy up in ovl_setattr() if
size of file is changing.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Right now we seem to check redirect only if upperdentry is found. But it
is possible that there is no upperdentry but later we found an index.
We need to check redirect on index as well and set it in
ovl_inode->redirect. Otherwise link code can assume that dentry does not
have redirect and place a new one which breaks things. In my testing
overlay/033 test started failing in xfstests. Following are the details.
For example do following.
$ mkdir lower upper work merged
- Make lower dir with 4 links.
$ echo "foo" > lower/l0.txt
$ ln lower/l0.txt lower/l1.txt
$ ln lower/l0.txt lower/l2.txt
$ ln lower/l0.txt lower/l3.txt
- Mount with index on and metacopy on.
$ mount -t overlay -o lowerdir=lower,upperdir=upper,workdir=work,\
index=on,metacopy=on none merged
- Link lower
$ ln merged/l0.txt merged/l4.txt
(This will metadata copy up of l0.txt and put an absolute redirect
/l0.txt)
$ echo 2 > /proc/sys/vm/drop/caches
$ ls merged/l1.txt
(Now l1.txt will be looked up. There is no upper dentry but there is
lower dentry and index will be found. We don't check for redirect on
index, hence ovl_inode->redirect will be NULL.)
- Link Upper
$ ln merged/l4.txt merged/l5.txt
(Lookup of l4.txt will use inode from l1.txt lookup which is still in
cache. It has ovl_inode->redirect NULL, hence link will put a new
redirect and replace /l0.txt with /l4.txt
- Drop caches.
echo 2 > /proc/sys/vm/drop_caches
- List l1.txt and it returns -ESTALE
$ ls merged/l0.txt
(It returns stale because, we found a metacopy of l0.txt in upper and it
has redirect l4.txt but there is no file named l4.txt in lower layer.
So lower data copy is not found and -ESTALE is returned.)
So problem here is that we did not process redirect on index. Check
redirect on index as well and then problem is fixed.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When we create a hardlink to a metacopy upper file, first the redirect on
that inode. Path based lookup will not work with newly created link and
redirect will solve that issue.
Also use absolute redirect as two hardlinks could be in different
directores and relative redirect will not work.
I have not put any additional locking around setting redirects while
introducing redirects for non-dir files. For now it feels like existing
locking is sufficient. If that's not the case, we will have add more
locking. Following is my rationale about why do I think current locking
seems ok.
Basic problem for non-dir files is that more than on dentry could be
pointing to same inode and in theory only relying on dentry based locks
(d->d_lock) did not seem sufficient.
We set redirect upon rename and upon link creation. In both the paths for
non-dir file, VFS locks both source and target inodes (->i_rwsem). That
means vfs rename and link operations on same source and target can't he
happening in parallel (Even if there are multiple dentries pointing to same
inode). So that probably means that at a time on an inode, only one call
of ovl_set_redirect() could be working and we don't need additional locking
in ovl_set_redirect().
ovl_inode->redirect is initialized only when inode is created new. That
means it should not race with any other path and setting
ovl_inode->redirect should be fine.
Reading of ovl_inode->redirect happens in ovl_get_redirect() path. And
this called only in ovl_set_redirect(). And ovl_set_redirect() already
seemed to be protected using ->i_rwsem. That means ovl_set_redirect() and
ovl_get_redirect() on source/target inode should not make progress in
parallel and is mutually exclusive. Hence no additional locking required.
Now, only case where ovl_set_redirect() and ovl_get_redirect() could race
seems to be case of absolute redirects where ovl_get_redirect() has to
travel up the tree. In that case we already take d->d_lock and that should
be sufficient as directories will not have multiple dentries pointing to
same inode.
So given VFS locking and current usage of redirect, current locking around
redirect seems to be ok for non-dir as well. Once we have the logic to
remove redirect when metacopy file gets copied up, then we probably will
need additional locking.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Set redirect on metacopy files upon rename. This will help find data
dentry in lower dirs.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If a dentry has copy up origin, we set flag OVL_PATH_ORIGIN. So far this
decision was easy that we had to check only for oe->numlower and if it is
non-zero, we knew there is copy up origin. (For non-dir we installed
origin dentry in lowerstack[0]).
But we don't create ORGIN xattr for broken hardlinks (index=off). And with
metacopy feature it is possible that we will install lowerstack[0] but
ORIGIN xattr is not there. It is data dentry of upper metacopy dentry
which has been found using regular name based lookup or using REDIRECT. So
with addition of this new case, just presence of oe->numlower is not
sufficient to guarantee that ORIGIN xattr is present.
So to differentiate between two cases, look at OVL_CONST_INO flag. If this
flag is set and upperdentry is there, that means it can be marked as type
ORIGIN. OVL_CONST_INO is not set if lower hardlink is broken or will be
broken over copy up.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Add an ovl_inode flag OVL_CONST_INO. This flag signifies if inode number
will remain constant over copy up or not. This flag does not get updated
over copy up and remains unmodifed after setting once.
Next patch in the series will make use of this flag. It will basically
figure out if dentry is of type ORIGIN or not. And this can be derived by
this flag.
ORIGIN = (upperdentry && ovl_test_flag(OVL_CONST_INO, inode)).
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Right now OVL_PATH_MERGE is used only for merged directories. But
conceptually, a metacopy dentry (backed by a lower data dentry) is a merged
entity as well.
So mark metacopy dentries as OVL_PATH_MERGE and ovl_rename() makes use of
this property later to set redirect on a metacopy file.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Right now we rely on path based lookup for data origin of metacopy upper.
This will work only if upper has not been renamed. We solved this problem
already for merged directories using redirect. Use same logic for metacopy
files.
This patch just goes on to check redirects for metacopy files.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Move some directory related code in else block. This is pure code
reorganization and no functionality change.
Next patch enables redirect processing on metacopy files and needs this
change. By keeping non-functional changes in a separate patch, next patch
looks much smaller and cleaner.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Metacopy dentry/inode is internal to overlay and is never exposed outside
of it. Exception is metacopy upper file used for fsync(). Modify d_real()
to look for dentries/inode which have data, but also allow matching upper
inode without data for the fsync case.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
ovl_open() should open file which contains data and not open metacopy
inode. With the introduction of metacopy inodes, with current
implementaion we will end up opening metacopy inode as well.
But there can be certain circumstances like ovl_fsync() where we want to
allow opening a metacopy inode instead.
Hence, change ovl_open_realfile() and and add extra parameter which
specifies whether to allow opening metacopy inode or not. If this
parameter is false, we look for data inode and open that.
This should allow covering both the cases.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Add an helper to retrieve real data inode associated with overlay inode.
This helper will ignore all metacopy inodes and will return only the real
inode which has data.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Right now ovl_inode stores inode pointer for lower inode. This helps with
quickly getting lower inode given overlay inode (ovl_inode_lower()).
Now with metadata only copy-up, we can have metacopy inode in middle layer
as well and inode containing data can be different from ->lower. I need to
be able to open the real file in ovl_open_realfile() and for that I need to
quickly find the lower data inode.
Hence store lower data inode also in ovl_inode. Also provide an helper
ovl_inode_lowerdata() to access this field.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If an inode has been copied up metadata only, then we need to query the
number of blocks from lower and fill up the stat->st_blocks.
We need to be careful about races where we are doing stat on one cpu and
data copy up is taking place on other cpu. We want to return
stat->st_blocks either from lower or stable upper and not something in
between. Hence, ovl_has_upperdata() is called first to figure out whether
block reporting will take place from lower or upper.
We now support metacopy dentries in middle layer. That means number of
blocks reporting needs to come from lowest data dentry and this could be
different from lower dentry. Hence we end up making a separate
vfs_getxattr() call for metacopy dentries to get number of blocks.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Now we have the notion of data dentry and metacopy dentry.
ovl_dentry_lower() will return uppermost lower dentry, but it could be
either data or metacopy dentry. Now we support metacopy dentries in lower
layers so it is possible that lowerstack[0] is metacopy dentry while
lowerstack[1] is actual data dentry.
So add an helper which returns lowest most dentry which is supposed to be
data dentry.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
So far lower could not be a meta inode. So whenever it was time to copy up
data of a meta inode, we could copy it up from top most lower dentry.
But now lower itself can be a metacopy inode. That means data copy up
needs to take place from a data inode in metacopy inode chain. Find lower
data inode in the chain and use that for data copy up.
Introduced a helper called ovl_path_lowerdata() to find the lower data
inode chain.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This patch modifies ovl_lookup() and friends to lookup metacopy dentries.
It also allows for presence of metacopy dentries in lower layer.
During lookup, check for presence of OVL_XATTR_METACOPY and if not present,
set OVL_UPPERDATA bit in flags.
We don't support metacopy feature with nfs_export. So in nfs_export code,
we set OVL_UPPERDATA flag set unconditionally if upper inode exists.
Do not follow metacopy origin if we find a metacopy only inode and metacopy
feature is not enabled for that mount. Like redirect, this can have
security implications where an attacker could hand craft upper and try to
gain access to file on lower which it should not have to begin with.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Right now we use goto out_nomem which assumes error code is -ENOMEM. But
there are other errors returned like -ESTALE as well. So instead of
out_nomem, use out_err which will do ERR_PTR(err). That way one can put
error code in err and jump to out_err.
This just code reorganization and no change of functionality.
I am about to add more code and this organization helps laying more code
and error paths on top of it.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Now we will have the capability to have upper inodes which might be only
metadata copy up and data is still on lower inode. So add a new xattr
OVL_XATTR_METACOPY to distinguish between two cases.
Presence of OVL_XATTR_METACOPY reflects that file has been copied up
metadata only and and data will be copied up later from lower origin. So
this xattr is set when a metadata copy takes place and cleared when data
copy takes place.
We also use a bit in ovl_inode->flags to cache OVL_UPPERDATA which reflects
whether ovl inode has data or not (as opposed to metadata only copy up).
If a file is copied up metadata only and later when same file is opened for
WRITE, then data copy up takes place. We copy up data, remove METACOPY
xattr and then set the UPPERDATA flag in ovl_inode->flags. While all these
operations happen with oi->lock held, read side of oi->flags can be
lockless. That is another thread on another cpu can check if UPPERDATA
flag is set or not.
So this gives us an ordering requirement w.r.t UPPERDATA flag. That is, if
another cpu sees UPPERDATA flag set, then it should be guaranteed that
effects of data copy up and remove xattr operations are also visible.
For example.
CPU1 CPU2
ovl_open() acquire(oi->lock)
ovl_open_maybe_copy_up() ovl_copy_up_data()
open_open_need_copy_up() vfs_removexattr()
ovl_already_copied_up()
ovl_dentry_needs_data_copy_up() ovl_set_flag(OVL_UPPERDATA)
ovl_test_flag(OVL_UPPERDATA) release(oi->lock)
Say CPU2 is copying up data and in the end sets UPPERDATA flag. But if
CPU1 perceives the effects of setting UPPERDATA flag but not the effects of
preceding operations (ex. upper that is not fully copied up), it will be a
problem.
Hence this patch introduces smp_wmb() on setting UPPERDATA flag operation
and smp_rmb() on UPPERDATA flag test operation.
May be some other lock or barrier is already covering it. But I am not sure
what that is and is it obvious enough that we will not break it in future.
So hence trying to be safe here and introducing barriers explicitly for
UPPERDATA flag/bit.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
There are couple of places where we need to know if file is already copied
up (in lockless manner). Right now its open coded and there are only two
conditions to check. Soon this patch series will introduce another
condition to check and Amir wants to introduce one more. So introduce a
helper instead to check this so that code is easier to read.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If it makes sense to copy up only metadata during copy up, do it. This is
done for regular files which are not opened for WRITE.
Right now ->metacopy is set to 0 always. Last patch in the series will
remove the hard coded statement and enable metacopy feature.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Just a little re-ordering of code. This helps with next patch where after
copying up metadata, we skip data copying step, if needed.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
By default metadata only copy up is disabled. Provide a mount option so
that users can choose one way or other.
Also provide a kernel config and module option to enable/disable metacopy
feature.
metacopy feature requires redirect_dir=on when upper is present.
Otherwise, it requires redirect_dir=follow atleast.
As of now, metacopy does not work with nfs_export=on. So if both
metacopy=on and nfs_export=on then nfs_export is disabled.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Right now two copy up helpers are in inode.c. Amir suggested it might be
better to move these to copy_up.c.
There will one more related function which will come in later patch.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
ovl_inode->redirect is an inode property and should be initialized in
ovl_get_inode() only when we are adding a new inode to cache. If inode is
already in cache, it is already initialized and we should not be touching
ovl_inode->redirect field.
As of now this is not a problem as redirects are used only for directories
which don't share inode. But soon I want to use redirects for regular
files also and there it can become an issue.
Hence, move ->redirect initialization in ovl_get_inode().
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>