1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

2420 commits

Author SHA1 Message Date
Xiubo Li
fac47b43c7 netfs: do not unlock and put the folio twice
check_write_begin() will unlock and put the folio when return
non-zero.  So we should avoid unlocking and putting it twice in
netfs layer.

Change the way ->check_write_begin() works in the following two ways:

 (1) Pass it a pointer to the folio pointer, allowing it to unlock and put
     the folio prior to doing the stuff it wants to do, provided it clears
     the folio pointer.

 (2) Change the return values such that 0 with folio pointer set means
     continue, 0 with folio pointer cleared means re-get and all error
     codes indicating an error (no special treatment for -EAGAIN).

[ bagasdotme: use Sphinx code text syntax for *foliop pointer ]

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/56423
Link: https://lore.kernel.org/r/cf169f43-8ee7-8697-25da-0204d1b4343e@redhat.com
Co-developed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-07-14 10:10:12 +02:00
Darrick J. Wong
dd81dc0559 xfs: improve CIL scalability
This series aims to improve the scalability of XFS transaction
 commits on large CPU count machines. My 32p machine hits contention
 limits in xlog_cil_commit() at about 700,000 transaction commits a
 section. It hits this at 16 thread workloads, and 32 thread
 workloads go no faster and just burn CPU on the CIL spinlocks.
 
 This patchset gets rid of spinlocks and global serialisation points
 in the xlog_cil_commit() path. It does this by moving to a
 combination of per-cpu counters, unordered per-cpu lists and
 post-ordered per-cpu lists.
 
 This results in transaction commit rates exceeding 1.4 million
 commits/s under unlink certain workloads, and while the log lock
 contention is largely gone there is still significant lock
 contention in the VFS (dentry cache, inode cache and security layers)
 at >600,000 transactions/s that still limit scalability.
 
 The changes to the CIL accounting and behaviour, combined with the
 structural changes to xlog_write() in prior patchsets make the
 per-cpu restructuring possible and sane. This allows us to move to
 precalculated reservation requirements that allow for reservation
 stealing to be accounted across multiple CPUs accurately.
 
 That is, instead of trying to account for continuation log opheaders
 on a "growth" basis, we pre-calculate how many iclogs we'll need to
 write out a maximally sized CIL checkpoint and steal that reserveD
 that space one commit at a time until the CIL has a full
 reservation. If we ever run a commit when we are already at the hard
 limit (because post-throttling) we simply take an extra reservation
 from each commit that is run when over the limit. Hence we don't
 need to do space usage math in the fast path and so never need to
 sum the per-cpu counters in this fast path.
 
 Similarly, per-cpu lists have the problem of ordering - we can't
 remove an item from a per-cpu list if we want to move it forward in
 the CIL. We solve this problem by using an atomic counter to give
 every commit a sequence number that is copied into the log items in
 that transaction. Hence relogging items just overwrites the sequence
 number in the log item, and does not move it in the per-cpu lists.
 Once we reaggregate the per-cpu lists back into a single list in the
 CIL push work, we can run it through list-sort() and reorder it back
 into a globally ordered list. This costs a bit of CPU time, but now
 that the CIL can run multiple works and pipelines properly, this is
 not a limiting factor for performance. It does increase fsync
 latency when the CIL is full, but workloads issuing large numbers of
 fsync()s or sync transactions end up with very small CILs and so the
 latency impact or sorting is not measurable for such workloads.
 
 OVerall, this pushes the transaction commit bottleneck out to the
 lockless reservation grant head updates. These atomic updates don't
 start to be a limiting fact until > 1.5 million transactions/s are
 being run, at which point the accounting functions start to show up
 in profiles as the highest CPU users. Still, this series doubles
 transaction throughput without increasing CPU usage before we get
 to that cacheline contention breakdown point...
 `
 Signed-off-by: Dave Chinner <dchinner@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmLHai8UHGRhdmlkQGZy
 b21vcmJpdC5jb20ACgkQregpR/R1+h3JZQ//bb9HyBiBkeuK9MvqH40hOfazfGXD
 8+pdP9r22qWp9LHhjz/EtH4Wy1sYe6a99mtPxqlsT3DqSl8GiolA1VFn+T3Sadu4
 nqmB/ppzMLE0LLzKoVrb3/Zw+mEaz5Is3WLpr86CpK5gNW6gBHCj4B68lWiBtvjs
 OW5fTm0E44BnNORh/AdSUkJxxEB2OQhVk5omY/Op8vO5frviG5yqYakAeoQ3vFpS
 UKadwlGjei91c63g9se360Re+DXTBhzbgXz0oNV4YbgWba2O9lnut5zqlcJMvVAU
 YgGBxttT0OqCdSNp0vtwOG8UFeUqfWSY+AFwfDkNycltLASvU53efqC94kQHouoh
 9++2VrPwPg0KOcQsvQo5WViQqWrr0+KlsaiTRO/TE0XCGFx4xQKEuhZ6QAnHiiVU
 en34SMqY51qa5D3LSbs6F278rEZNcLQguiH6Urxe5KRmkJDfoxtsWQ/DpV8itbnk
 raCUFlhW8GIBrRvizB7Na+hDWj1/HGQRIEs+xlfqPcFDV9bkECE/IpbD04+JDbil
 wsDoy2IO15oG/rX05/bkXAY7fFuhWbnVAbKrqvl+50w8Oo5w0+X3ZHlqhiLqCzVr
 e/TL5lc+9Ciq4uG8TCwal4HoktYLwqez4qxz396YpE4LN1ax2ICFgR9HyY4GLqmU
 0H1qSxZmOkeueCU=
 =vLZn
 -----END PGP SIGNATURE-----

Merge tag 'xfs-cil-scale-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeA

xfs: improve CIL scalability

This series aims to improve the scalability of XFS transaction
commits on large CPU count machines. My 32p machine hits contention
limits in xlog_cil_commit() at about 700,000 transaction commits a
section. It hits this at 16 thread workloads, and 32 thread
workloads go no faster and just burn CPU on the CIL spinlocks.

This patchset gets rid of spinlocks and global serialisation points
in the xlog_cil_commit() path. It does this by moving to a
combination of per-cpu counters, unordered per-cpu lists and
post-ordered per-cpu lists.

This results in transaction commit rates exceeding 1.4 million
commits/s under unlink certain workloads, and while the log lock
contention is largely gone there is still significant lock
contention in the VFS (dentry cache, inode cache and security layers)
at >600,000 transactions/s that still limit scalability.

The changes to the CIL accounting and behaviour, combined with the
structural changes to xlog_write() in prior patchsets make the
per-cpu restructuring possible and sane. This allows us to move to
precalculated reservation requirements that allow for reservation
stealing to be accounted across multiple CPUs accurately.

That is, instead of trying to account for continuation log opheaders
on a "growth" basis, we pre-calculate how many iclogs we'll need to
write out a maximally sized CIL checkpoint and steal that reserveD
that space one commit at a time until the CIL has a full
reservation. If we ever run a commit when we are already at the hard
limit (because post-throttling) we simply take an extra reservation
from each commit that is run when over the limit. Hence we don't
need to do space usage math in the fast path and so never need to
sum the per-cpu counters in this fast path.

Similarly, per-cpu lists have the problem of ordering - we can't
remove an item from a per-cpu list if we want to move it forward in
the CIL. We solve this problem by using an atomic counter to give
every commit a sequence number that is copied into the log items in
that transaction. Hence relogging items just overwrites the sequence
number in the log item, and does not move it in the per-cpu lists.
Once we reaggregate the per-cpu lists back into a single list in the
CIL push work, we can run it through list-sort() and reorder it back
into a globally ordered list. This costs a bit of CPU time, but now
that the CIL can run multiple works and pipelines properly, this is
not a limiting factor for performance. It does increase fsync
latency when the CIL is full, but workloads issuing large numbers of
fsync()s or sync transactions end up with very small CILs and so the
latency impact or sorting is not measurable for such workloads.

OVerall, this pushes the transaction commit bottleneck out to the
lockless reservation grant head updates. These atomic updates don't
start to be a limiting fact until > 1.5 million transactions/s are
being run, at which point the accounting functions start to show up
in profiles as the highest CPU users. Still, this series doubles
transaction throughput without increasing CPU usage before we get
to that cacheline contention breakdown point...
`
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

* tag 'xfs-cil-scale-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: expanding delayed logging design with background material
  xfs: xlog_sync() manually adjusts grant head space
  xfs: avoid cil push lock if possible
  xfs: move CIL ordering to the logvec chain
  xfs: convert log vector chain to use list heads
  xfs: convert CIL to unordered per cpu lists
  xfs: Add order IDs to log items in CIL
  xfs: convert CIL busy extents to per-cpu
  xfs: track CIL ticket reservation in percpu structure
  xfs: implement percpu cil space used calculation
  xfs: introduce per-cpu CIL tracking structure
  xfs: rework per-iclog header CIL reservation
  xfs: lift init CIL reservation out of xc_cil_lock
  xfs: use the CIL space used counter for emptiness checks
2022-07-09 10:55:21 -07:00
Christian Brauner
4a47c6385b ovl: turn of SB_POSIXACL with idmapped layers temporarily
This cycle we added support for mounting overlayfs on top of idmapped
mounts.  Recently I've started looking into potential corner cases when
trying to add additional tests and I noticed that reporting for POSIX ACLs
is currently wrong when using idmapped layers with overlayfs mounted on top
of it.

I have sent out an patch that fixes this and makes POSIX ACLs work
correctly but the patch is a bit bigger and we're already at -rc5 so I
recommend we simply don't raise SB_POSIXACL when idmapped layers are
used. Then we can fix the VFS part described below for the next merge
window so we can have good exposure in -next.

I'm going to give a rather detailed explanation to both the origin of the
problem and mention the solution so people know what's going on.

Let's assume the user creates the following directory layout and they have
a rootfs /var/lib/lxc/c1/rootfs. The files in this rootfs are owned as you
would expect files on your host system to be owned. For example, ~/.bashrc
for your regular user would be owned by 1000:1000 and /root/.bashrc would
be owned by 0:0. IOW, this is just regular boring filesystem tree on an
ext4 or xfs filesystem.

The user chooses to set POSIX ACLs using the setfacl binary granting the
user with uid 4 read, write, and execute permissions for their .bashrc
file:

        setfacl -m u:4:rwx /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc

Now they to expose the whole rootfs to a container using an idmapped
mount. So they first create:

        mkdir -pv /vol/contpool/{ctrover,merge,lowermap,overmap}
        mkdir -pv /vol/contpool/ctrover/{over,work}
        chown 10000000:10000000 /vol/contpool/ctrover/{over,work}

The user now creates an idmapped mount for the rootfs:

        mount-idmapped/mount-idmapped --map-mount=b:0:10000000:65536 \
                                      /var/lib/lxc/c2/rootfs \
                                      /vol/contpool/lowermap

This for example makes it so that
/var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc which is owned by uid and gid
1000 as being owned by uid and gid 10001000 at
/vol/contpool/lowermap/home/ubuntu/.bashrc.

Assume the user wants to expose these idmapped mounts through an overlayfs
mount to a container.

       mount -t overlay overlay                      \
             -o lowerdir=/vol/contpool/lowermap,     \
                upperdir=/vol/contpool/overmap/over, \
                workdir=/vol/contpool/overmap/work   \
             /vol/contpool/merge

The user can do this in two ways:

(1) Mount overlayfs in the initial user namespace and expose it to the
    container.

(2) Mount overlayfs on top of the idmapped mounts inside of the container's
    user namespace.

Let's assume the user chooses the (1) option and mounts overlayfs on the
host and then changes into a container which uses the idmapping
0:10000000:65536 which is the same used for the two idmapped mounts.

Now the user tries to retrieve the POSIX ACLs using the getfacl command

        getfacl -n /vol/contpool/lowermap/home/ubuntu/.bashrc

and to their surprise they see:

        # file: vol/contpool/merge/home/ubuntu/.bashrc
        # owner: 1000
        # group: 1000
        user::rw-
        user:4294967295:rwx
        group::r--
        mask::rwx
        other::r--

indicating the uid wasn't correctly translated according to the idmapped
mount. The problem is how we currently translate POSIX ACLs. Let's inspect
the callchain in this example:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */

        sys_getxattr()
        -> path_getxattr()
           -> getxattr()
              -> do_getxattr()
                  |> vfs_getxattr()
                  |  -> __vfs_getxattr()
                  |     -> handler->get == ovl_posix_acl_xattr_get()
                  |        -> ovl_xattr_get()
                  |           -> vfs_getxattr()
                  |              -> __vfs_getxattr()
                  |                 -> handler->get() /* lower filesystem callback */
                  |> posix_acl_fix_xattr_to_user()
                     {
                              4 = make_kuid(&init_user_ns, 4);
                              4 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 4);
                              /* FAILURE */
                             -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
                     }

If the user chooses to use option (2) and mounts overlayfs on top of
idmapped mounts inside the container things don't look that much better:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs->creator_cred): 0:10000000:65536

        sys_getxattr()
        -> path_getxattr()
           -> getxattr()
              -> do_getxattr()
                  |> vfs_getxattr()
                  |  -> __vfs_getxattr()
                  |     -> handler->get == ovl_posix_acl_xattr_get()
                  |        -> ovl_xattr_get()
                  |           -> vfs_getxattr()
                  |              -> __vfs_getxattr()
                  |                 -> handler->get() /* lower filesystem callback */
                  |> posix_acl_fix_xattr_to_user()
                     {
                              4 = make_kuid(&init_user_ns, 4);
                              4 = mapped_kuid_fs(&init_user_ns, 4);
                              /* FAILURE */
                             -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
                     }

As is easily seen the problem arises because the idmapping of the lower
mount isn't taken into account as all of this happens in do_gexattr(). But
do_getxattr() is always called on an overlayfs mount and inode and thus
cannot possible take the idmapping of the lower layers into account.

This problem is similar for fscaps but there the translation happens as
part of vfs_getxattr() already. Let's walk through an fscaps overlayfs
callchain:

        setcap 'cap_net_raw+ep' /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc

The expected outcome here is that we'll receive the cap_net_raw capability
as we are able to map the uid associated with the fscap to 0 within our
container.  IOW, we want to see 0 as the result of the idmapping
translations.

If the user chooses option (1) we get the following callchain for fscaps:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */

        sys_getxattr()
        -> path_getxattr()
           -> getxattr()
              -> do_getxattr()
                   -> vfs_getxattr()
                      -> xattr_getsecurity()
                         -> security_inode_getsecurity()                                       ________________________________
                            -> cap_inode_getsecurity()                                         |                              |
                               {                                                               V                              |
                                        10000000 = make_kuid(0:0:4k /* overlayfs idmapping */, 10000000);                     |
                                        10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000);                  |
                                               /* Expected result is 0 and thus that we own the fscap. */                     |
                                               0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000);            |
                               }                                                                                              |
                               -> vfs_getxattr_alloc()                                                                        |
                                  -> handler->get == ovl_other_xattr_get()                                                    |
                                     -> vfs_getxattr()                                                                        |
                                        -> xattr_getsecurity()                                                                |
                                           -> security_inode_getsecurity()                                                    |
                                              -> cap_inode_getsecurity()                                                      |
                                                 {                                                                            |
                                                                0 = make_kuid(0:0:4k /* lower s_user_ns */, 0);               |
                                                         10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); |
                                                         10000000 = from_kuid(0:0:4k /* overlayfs idmapping */, 10000000);    |
                                                         |____________________________________________________________________|
                                                 }
                                                 -> vfs_getxattr_alloc()
                                                    -> handler->get == /* lower filesystem callback */

And if the user chooses option (2) we get:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs->creator_cred): 0:10000000:65536

        sys_getxattr()
        -> path_getxattr()
           -> getxattr()
              -> do_getxattr()
                   -> vfs_getxattr()
                      -> xattr_getsecurity()
                         -> security_inode_getsecurity()                                                _______________________________
                            -> cap_inode_getsecurity()                                                  |                             |
                               {                                                                        V                             |
                                       10000000 = make_kuid(0:10000000:65536 /* overlayfs idmapping */, 0);                           |
                                       10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000);                           |
                                               /* Expected result is 0 and thus that we own the fscap. */                             |
                                              0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000);                     |
                               }                                                                                                      |
                               -> vfs_getxattr_alloc()                                                                                |
                                  -> handler->get == ovl_other_xattr_get()                                                            |
                                    |-> vfs_getxattr()                                                                                |
                                        -> xattr_getsecurity()                                                                        |
                                           -> security_inode_getsecurity()                                                            |
                                              -> cap_inode_getsecurity()                                                              |
                                                 {                                                                                    |
                                                                 0 = make_kuid(0:0:4k /* lower s_user_ns */, 0);                      |
                                                          10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0);        |
                                                                 0 = from_kuid(0:10000000:65536 /* overlayfs idmapping */, 10000000); |
                                                                 |____________________________________________________________________|
                                                 }
                                                 -> vfs_getxattr_alloc()
                                                    -> handler->get == /* lower filesystem callback */

We can see how the translation happens correctly in those cases as the
conversion happens within the vfs_getxattr() helper.

For POSIX ACLs we need to do something similar. However, in contrast to
fscaps we cannot apply the fix directly to the kernel internal posix acl
data structure as this would alter the cached values and would also require
a rework of how we currently deal with POSIX ACLs in general which almost
never take the filesystem idmapping into account (the noteable exception
being FUSE but even there the implementation is special) and instead
retrieve the raw values based on the initial idmapping.

The correct values are then generated right before returning to
userspace. The fix for this is to move taking the mount's idmapping into
account directly in vfs_getxattr() instead of having it be part of
posix_acl_fix_xattr_to_user().

To this end we simply move the idmapped mount translation into a separate
step performed in vfs_{g,s}etxattr() instead of in
posix_acl_fix_xattr_{from,to}_user().

To see how this fixes things let's go back to the original example. Assume
the user chose option (1) and mounted overlayfs on top of idmapped mounts
on the host:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */

        sys_getxattr()
        -> path_getxattr()
           -> getxattr()
              -> do_getxattr()
                  |> vfs_getxattr()
                  |  |> __vfs_getxattr()
                  |  |  -> handler->get == ovl_posix_acl_xattr_get()
                  |  |     -> ovl_xattr_get()
                  |  |        -> vfs_getxattr()
                  |  |           |> __vfs_getxattr()
                  |  |           |  -> handler->get() /* lower filesystem callback */
                  |  |           |> posix_acl_getxattr_idmapped_mnt()
                  |  |              {
                  |  |                              4 = make_kuid(&init_user_ns, 4);
                  |  |                       10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
                  |  |                       10000004 = from_kuid(&init_user_ns, 10000004);
                  |  |                       |_______________________
                  |  |              }                               |
                  |  |                                              |
                  |  |> posix_acl_getxattr_idmapped_mnt()           |
                  |     {                                           |
                  |                                                 V
                  |             10000004 = make_kuid(&init_user_ns, 10000004);
                  |             10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
                  |             10000004 = from_kuid(&init_user_ns, 10000004);
                  |     }       |_________________________________________________
                  |                                                              |
                  |                                                              |
                  |> posix_acl_fix_xattr_to_user()                               |
                     {                                                           V
                                 10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
                                        /* SUCCESS */
                                        4 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000004);
                     }

And similarly if the user chooses option (1) and mounted overayfs on top of
idmapped mounts inside the container:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs->creator_cred): 0:10000000:65536

        sys_getxattr()
        -> path_getxattr()
           -> getxattr()
              -> do_getxattr()
                  |> vfs_getxattr()
                  |  |> __vfs_getxattr()
                  |  |  -> handler->get == ovl_posix_acl_xattr_get()
                  |  |     -> ovl_xattr_get()
                  |  |        -> vfs_getxattr()
                  |  |           |> __vfs_getxattr()
                  |  |           |  -> handler->get() /* lower filesystem callback */
                  |  |           |> posix_acl_getxattr_idmapped_mnt()
                  |  |              {
                  |  |                              4 = make_kuid(&init_user_ns, 4);
                  |  |                       10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
                  |  |                       10000004 = from_kuid(&init_user_ns, 10000004);
                  |  |                       |_______________________
                  |  |              }                               |
                  |  |                                              |
                  |  |> posix_acl_getxattr_idmapped_mnt()           |
                  |     {                                           V
                  |             10000004 = make_kuid(&init_user_ns, 10000004);
                  |             10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
                  |             10000004 = from_kuid(0(&init_user_ns, 10000004);
                  |             |_________________________________________________
                  |     }                                                        |
                  |                                                              |
                  |> posix_acl_fix_xattr_to_user()                               |
                     {                                                           V
                                 10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
                                        /* SUCCESS */
                                        4 = from_kuid(0:10000000:65536 /* caller's idmappings */, 10000004);
                     }

The last remaining problem we need to fix here is ovl_get_acl(). During
ovl_permission() overlayfs will call:

        ovl_permission()
        -> generic_permission()
           -> acl_permission_check()
              -> check_acl()
                 -> get_acl()
                    -> inode->i_op->get_acl() == ovl_get_acl()
                        > get_acl() /* on the underlying filesystem)
                          ->inode->i_op->get_acl() == /*lower filesystem callback */
                 -> posix_acl_permission()

passing through the get_acl request to the underlying filesystem. This will
retrieve the acls stored in the lower filesystem without taking the
idmapping of the underlying mount into account as this would mean altering
the cached values for the lower filesystem. The simple solution is to have
ovl_get_acl() simply duplicate the ACLs, update the values according to the
idmapped mount and return it to acl_permission_check() so it can be used in
posix_acl_permission(). Since overlayfs doesn't cache ACLs they'll be
released right after.

Link: https://github.com/brauner/mount-idmapped/issues/9
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: linux-unionfs@vger.kernel.org
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Fixes: bc70682a49 ("ovl: support idmapped layers")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-08 15:48:31 +02:00
Dave Chinner
51a117edff xfs: expanding delayed logging design with background material
I wrote up a description of how transactions, space reservations and
relogging work together in response to a question for background
material on the delayed logging design. Add this to the existing
document for ease of future reference.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 18:56:09 +10:00
Vincent Whitchurch
3093484301 mm/smaps: add Pss_Dirty
Pss is the sum of the sizes of clean and dirty private pages, and the
proportional sizes of clean and dirty shared pages:

 Private = Private_Dirty + Private_Clean
 Shared_Proportional = Shared_Dirty_Proportional + Shared_Clean_Proportional
 Pss = Private + Shared_Proportional

The Shared*Proportional fields are not present in smaps, so it is not
always possible to determine how much of the Pss is from dirty pages and
how much is from clean pages.  This information can be useful for
measuring memory usage for the purpose of optimisation, since clean pages
can usually be discarded by the kernel immediately while dirty pages
cannot.

The smaps routines in the kernel already have access to this data, so add
a Pss_Dirty to show it to userspace.  Pss_Clean is not added since it can
be calculated from Pss and Pss_Dirty.

Link: https://lkml.kernel.org/r/20220620081251.2928103-1-vincent.whitchurch@axis.com
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03 18:08:49 -07:00
Matthew Wilcox (Oracle)
90c02eb9a7 docs: Improve ->read_folio documentation
Add information on the use of 'file', whether ->read_folio should be
synchronous, and steer new callers towards calling read_mapping_folio()
instead of calling ->read_folio directly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-06-29 08:51:06 -04:00
Chao Liu
d218bee86a docs: filesystems: f2fs: fix description about compress ioctl
Since commit c61404153e ("f2fs: introduce FI_COMPRESS_RELEASED
instead of using IMMUTABLE bit"), we no longer use the IMMUTABLE
bit to prevent writing data for compression. Let's correct the
corresponding documentation.

BTW, this patch fixes some alignment issues in the compress
metadata layout.

Signed-off-by: Chao Liu <liuchao@coolpad.com>
Link: https://lore.kernel.org/r/20220613020800.3379482-1-chaoliu719@gmail.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-06-27 16:24:39 -06:00
Mike Rapoport
ee65728e10 docs: rename Documentation/vm to Documentation/mm
so it will be consistent with code mm directory and with
Documentation/admin-guide/mm and won't be confused with virtual machines.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Jonathan Corbet <corbet@lwn.net>
Acked-by: Wu XiangCheng <bobwxc@email.cn>
2022-06-27 12:52:53 -07:00
Linus Torvalds
82708bb1eb for-5.19-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmK4dV4ACgkQxWXV+ddt
 WDs4uQ/7B0XqPK05NJntJfwnuIoT/yOreKf47wt/6DyFV3CDMFte/qzaZwthwu6P
 F0GMpSYAlVszLlML5elvF9VXymlV+e+QROtbD6QCNLNW1IwHA7ZiF5fV/a1Rj930
 XSuaDyVFPAK7892RR6yMQ20IeMBuvqiAhXWEzaIJ2tIcAHn+fP+VkY8Nc0aZj3iC
 mI+ep4n93karDxmnHVGUxJTxAe0l/uNopx+fYBWQDj7HuoMLo0Cu+rAdv0gRIxi2
 RWUBkR4e4PBwV1OFScwNCsljjt6bHdUHrtdB3fo5Hzu9cO5hHdL7NEsKB1K2w7rV
 bgNuNqfj6Y4xUBchAfQO5CCJ9ISci5KoJ4RBpk6EprZR3QN40kN8GPlhi2519K7w
 F3d8jolDDHlkqxIsqoe47MYOcSepNEadVNsiYKb0rM6doilfxyXiu6dtTFMrC8Vy
 K2HDCdTyuIgw+TnwqT1puaUwxiIL8DFJf1CVyjwGuQ4UgaIEkHXKIsCssyyJ76Jh
 QkWX1aeRldbfkVArJWHQWqDQopx9pFBz1gjlws0YjAsU5YijOOXva464P9Rxg+Gq
 4pRlgnO48joQam9bRirP2Z6yhqa4O6jkzKDOXSYduAUYD7IMfpsYnz09wKS95jj+
 QCrR7VmKnpQdsXg5a/mqyacfIH30ph002VywRxPiFM89Syd25yo=
 =rUrf
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - zoned relocation fixes:
      - fix critical section end for extent writeback, this could lead
        to out of order write
      - prevent writing to previous data relocation block group if space
        gets low

 - reflink fixes:
      - fix race between reflinking and ordered extent completion
      - proper error handling when block reserve migration fails
      - add missing inode iversion/mtime/ctime updates on each iteration
        when replacing extents

 - fix deadlock when running fsync/fiemap/commit at the same time

 - fix false-positive KCSAN report regarding pid tracking for read locks
   and data race

 - minor documentation update and link to new site

* tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Documentation: update btrfs list of features and link to readthedocs.io
  btrfs: fix deadlock with fsync+fiemap+transaction commit
  btrfs: don't set lock_owner when locking extent buffer for reading
  btrfs: zoned: fix critical section of relocation inode writeback
  btrfs: zoned: prevent allocation from previous data relocation BG
  btrfs: do not BUG_ON() on failure to migrate space when replacing extents
  btrfs: add missing inode updates on each iteration when replacing extents
  btrfs: fix race between reflinking and ordered extent completion
2022-06-26 10:11:36 -07:00
Deming Wang
df6725651f docs: Remove duplicate word
Delete duplicate words of "the".

Signed-off-by: Deming Wang <wangdeming@inspur.com>
Link: https://lore.kernel.org/r/20220624014605.2007-1-wangdeming@inspur.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-06-24 13:10:31 -06:00
David Sterba
037e127452 Documentation: update btrfs list of features and link to readthedocs.io
The btrfs documentation in kernel is only meant as a starting point, so
update the list of features and add link to btrfs.readthedocs.io page
that is most up-to-date. The wiki is still used but information is
migrated from there.

Signed-off-by: David Sterba <dsterba@suse.com>
2022-06-21 14:47:19 +02:00
Wang Jianjian
3103084afc ext4, doc: remove unnecessary escaping
Signed-off-by: Wang Jianjian <wangjianjian3@huawei.com>
Link: https://lore.kernel.org/r/20220520022255.2120576-2-wangjianjian3@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-16 11:03:17 -04:00
David Howells
40a8110120 netfs: Rename the netfs_io_request cleanup op and give it an op pointer
The netfs_io_request cleanup op is now always in a position to be given a
pointer to a netfs_io_request struct, so this can be passed in instead of
the mapping and private data arguments (both of which are included in the
struct).

So rename the ->cleanup op to ->free_request (to match ->init_request) and
pass in the I/O pointer.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
2022-06-10 20:55:21 +01:00
Linus Torvalds
e81fb4198e netfs: Further cleanups after struct netfs_inode wrapper introduced
Change the signature of netfs helper functions to take a struct netfs_inode
pointer rather than a struct inode pointer where appropriate, thereby
relieving the need for the network filesystem to convert its internal inode
format down to the VFS inode only for netfslib to bounce it back up.  For
type safety, it's better not to do that (and it's less typing too).

Give netfs_write_begin() an extra argument to pass in a pointer to the
netfs_inode struct rather than deriving it internally from the file
pointer.  Note that the ->write_begin() and ->write_end() ops are intended
to be replaced in the future by netfslib code that manages this without the
need to call in twice for each page.

netfs_readpage() and similar are intended to be pointed at directly by the
address_space_operations table, so must stick to the signature dictated by
the function pointers there.

Changes
=======
- Updated the kerneldoc comments and documentation [DH].

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/CAHk-=wgkwKyNmNdKpQkqZ6DnmUL-x9hp0YBnUGjaPFEAdxDTbw@mail.gmail.com/
2022-06-10 20:55:21 +01:00
Nathan Huckleberry
6b2a51ff03 fscrypt: Add HCTR2 support for filename encryption
HCTR2 is a tweakable, length-preserving encryption mode that is intended
for use on CPUs with dedicated crypto instructions.  HCTR2 has the
property that a bitflip in the plaintext changes the entire ciphertext.
This property fixes a known weakness with filename encryption: when two
filenames in the same directory share a prefix of >= 16 bytes, with
AES-CTS-CBC their encrypted filenames share a common substring, leaking
information.  HCTR2 does not have this problem.

More information on HCTR2 can be found here: "Length-preserving
encryption with HCTR2": https://eprint.iacr.org/2021/1441.pdf

Signed-off-by: Nathan Huckleberry <nhuck@google.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-06-10 16:40:18 +08:00
David Howells
874c8ca1e6 netfs: Fix gcc-12 warning by embedding vfs inode in netfs_i_context
While randstruct was satisfied with using an open-coded "void *" offset
cast for the netfs_i_context <-> inode casting, __builtin_object_size() as
used by FORTIFY_SOURCE was not as easily fooled.  This was causing the
following complaint[1] from gcc v12:

  In file included from include/linux/string.h:253,
                   from include/linux/ceph/ceph_debug.h:7,
                   from fs/ceph/inode.c:2:
  In function 'fortify_memset_chk',
      inlined from 'netfs_i_context_init' at include/linux/netfs.h:326:2,
      inlined from 'ceph_alloc_inode' at fs/ceph/inode.c:463:2:
  include/linux/fortify-string.h:242:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
    242 |                         __write_overflow_field(p_size_field, size);
        |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fix this by embedding a struct inode into struct netfs_i_context (which
should perhaps be renamed to struct netfs_inode).  The struct inode
vfs_inode fields are then removed from the 9p, afs, ceph and cifs inode
structs and vfs_inode is then simply changed to "netfs.inode" in those
filesystems.

Further, rename netfs_i_context to netfs_inode, get rid of the
netfs_inode() function that converted a netfs_i_context pointer to an
inode pointer (that can now be done with &ctx->inode) and rename the
netfs_i_context() function to netfs_inode() (which is now a wrapper
around container_of()).

Most of the changes were done with:

  perl -p -i -e 's/vfs_inode/netfs.inode/'g \
        `git grep -l 'vfs_inode' -- fs/{9p,afs,ceph,cifs}/*.[ch]`

Kees suggested doing it with a pair structure[2] and a special
declarator to insert that into the network filesystem's inode
wrapper[3], but I think it's cleaner to embed it - and then it doesn't
matter if struct randomisation reorders things.

Dave Chinner suggested using a filesystem-specific VFS_I() function in
each filesystem to convert that filesystem's own inode wrapper struct
into the VFS inode struct[4].

Version #2:
 - Fix a couple of missed name changes due to a disabled cifs option.
 - Rename nfs_i_context to nfs_inode
 - Use "netfs" instead of "nic" as the member name in per-fs inode wrapper
   structs.

[ This also undoes commit 507160f46c ("netfs: gcc-12: temporarily
  disable '-Wattribute-warning' for now") that is no longer needed ]

Fixes: bc899ee1c8 ("netfs: Add a netfs inode context")
Reported-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
cc: Jonathan Corbet <corbet@lwn.net>
cc: Eric Van Hensbergen <ericvh@gmail.com>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Steve French <smfrench@gmail.com>
cc: William Kucharski <william.kucharski@oracle.com>
cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
cc: Dave Chinner <david@fromorbit.com>
cc: linux-doc@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: samba-technical@lists.samba.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-hardening@vger.kernel.org
Link: https://lore.kernel.org/r/d2ad3a3d7bdd794c6efb562d2f2b655fb67756b9.camel@kernel.org/ [1]
Link: https://lore.kernel.org/r/20220517210230.864239-1-keescook@chromium.org/ [2]
Link: https://lore.kernel.org/r/20220518202212.2322058-1-keescook@chromium.org/ [3]
Link: https://lore.kernel.org/r/20220524101205.GI2306852@dread.disaster.area/ [4]
Link: https://lore.kernel.org/r/165296786831.3591209.12111293034669289733.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165305805651.4094995.7763502506786714216.stgit@warthog.procyon.org.uk # v2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-06-09 13:55:00 -07:00
Linus Torvalds
8171acb8bc Changes since last update:
- Leave compressed inodes unsupported in fscache mode for now;
 
  - Avoid crash when using tracepoint cachefiles_prep_read;
 
  - Fix `backmost' behavior due to a recent cleanup;
 
  - Update documentation for better description of recent new features;
 
  - Several decompression cleanups w/o logical change.
 -----BEGIN PGP SIGNATURE-----
 
 iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYpeFXxEceGlhbmdAa2Vy
 bmVsLm9yZwAKCRA5NzHcH7XmBC9eAQC8YSePEG+YCGbmOCGadSuBsgU+OXzKGpCV
 KxPyy3SmPQEAyNCDk11HoaYDRywS8TbMPntlyRfXvtEGSxbRe+5d1Qc=
 =4RnO
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-5.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull more erofs updates from Gao Xiang:
 "This is a follow-up to the main updates, including some fixes of
  fscache mode related to compressed inodes and a cachefiles tracepoint.
  There is also a patch to fix an unexpected decompression strategy
  change due to a cleanup in the past. All the fixes are quite small.

  Apart from these, documentation is also updated for a better
  description of recent new features.

  In addition, this has some trivial cleanups without actual code logic
  changes, so I could have a more recent codebase to work on folios and
  avoiding the PG_error page flag for the next cycle.

  Summary:

   - Leave compressed inodes unsupported in fscache mode for now

   - Avoid crash when using tracepoint cachefiles_prep_read

   - Fix `backmost' behavior due to a recent cleanup

   - Update documentation for better description of recent new features

   - Several decompression cleanups w/o logical change"

* tag 'erofs-for-5.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: fix 'backmost' member of z_erofs_decompress_frontend
  erofs: simplify z_erofs_pcluster_readmore()
  erofs: get rid of label `restart_now'
  erofs: get rid of `struct z_erofs_collection'
  erofs: update documentation
  erofs: fix crash when enable tracepoint cachefiles_prep_read
  erofs: leave compressed inodes unsupported in fscache mode for now
2022-06-01 11:54:29 -07:00
Linus Torvalds
700170bf6b NFS Client Updates for Linux 5.18
- New Features:
   - Add support for 'dacl' and 'sacl' attributes
 
 - Bugfixes and Cleanups:
   - Fixes for reporting mapping errors
   - Fixes for memory allocation errors
   - Improve warning message when locks are lost
   - Update documentation for the nfs4_unique_id parameter
   - Add an explanation of NFSv4 client identifiers
   - Ensure the i_size attribute is written to the fscache storage
   - Fix freeing uninitialized nfs4_labels
   - Better handling when xprtrdma bc_serv is NULL
   - Marke qualified async operations as MOVEABLE tasks
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmKWhFQACgkQ18tUv7Cl
 QOszjRAAllmtKLbzOkwQwcT3e5ljh9NEJ8NL+Nv1SXjozpFY+1fuXc0ivT4rniU6
 68ZHz+faK2UtLytwOO94M0jo2RCAlYS5rfnts89CpdfP3bqmGPAj0Ytw/c/vg+Qf
 4eQbAzz++T35DgU7cdeKKZKg9Wtwbq7g0kYv1W8QCiCbxakSjnc/V9Ll5XhS/CAC
 1WqKD90TRKUkX0Y1NNsNdXB1dJn/6QAq9B6JTjan+2Rhn7/NCTU8p98mEZGcVD7r
 cPHyXTqkPF4IH7lgjEMIRf6eXEzDDZNIs98QLdHJ2Gk0LxW7p7IL7VW8TKzYunvl
 coA1bZfYhUZBUJ+eDrrKZ5hHMSn/+eNR5iiIcfqtyU8o3J0NXAXGlLh/iJSGsxIH
 PjyjWSfpCgoZVPc4dG3lxR9Iu7UZeAuuB2ZoiNakUkd+UNKK5U5PpaPnYT6adaIp
 TegivZclCmgyLQiAdPRifDzhaL5J2pp6kVb5iMY6oX+ObyclW/UcqzKMqIKSt3R8
 6+JAmZ6633ojS4r3xFsw/dlEUWuuVq7kYwXK209LqiBn5vvjWNa/WgH4MaSfnJe9
 rlw+fs8Aky0w59IhzRJMMVCJ/Q2EYDKmtQLQgYVw80RBFiFgBpMW0wDqMGiddTcu
 1IZ2c5+t1GxfASpyu8miexQjRJW6A2MTp0gfHGiHarxdCpaAycA=
 =0ccI
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "New Features:
   - Add support for 'dacl' and 'sacl' attributes

  Bugfixes and Cleanups:
   - Fixes for reporting mapping errors
   - Fixes for memory allocation errors
   - Improve warning message when locks are lost
   - Update documentation for the nfs4_unique_id parameter
   - Add an explanation of NFSv4 client identifiers
   - Ensure the i_size attribute is written to the fscache storage
   - Fix freeing uninitialized nfs4_labels
   - Better handling when xprtrdma bc_serv is NULL
   - Mark qualified async operations as MOVEABLE tasks"

* tag 'nfs-for-5.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4.1 mark qualified async operations as MOVEABLE tasks
  xprtrdma: treat all calls not a bcall when bc_serv is NULL
  NFSv4: Fix free of uninitialized nfs4_label on referral lookup.
  NFS: Pass i_size to fscache_unuse_cookie() when a file is released
  Documentation: Add an explanation of NFSv4 client identifiers
  NFS: update documentation for the nfs4_unique_id parameter
  NFS: Improve warning message when locks are lost.
  NFSv4.1: Enable access to the NFSv4.1 'dacl' and 'sacl' attributes
  NFSv4: Add encoders/decoders for the NFSv4.1 dacl and sacl attributes
  NFSv4: Specify the type of ACL to cache
  NFSv4: Don't hold the layoutget locks across multiple RPC calls
  pNFS/files: Fall back to I/O through the MDS on non-fatal layout errors
  NFS: Further fixes to the writeback error handling
  NFSv4/pNFS: Do not fail I/O when we fail to allocate the pNFS layout
  NFS: Memory allocation failures are not server fatal errors
  NFS: Don't report errors from nfs_pageio_complete() more than once
  NFS: Do not report flush errors in nfs_write_end()
  NFS: Don't report ENOSPC write errors twice
  NFS: fsync() should report filesystem errors over EINTR/ERESTARTSYS
  NFS: Do not report EINTR/ERESTARTSYS as mapping errors
2022-05-31 16:58:24 -07:00
Gao Xiang
6e95d0a018 erofs: update documentation
- refine the filesystem overview for better description of recent
   new features like FSDAX and Fscache;

 - add the new `fsid' mount option;

 - fix some typos.

Link: https://lore.kernel.org/r/20220527070133.77962-1-hsiangkao@linux.alibaba.com
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-29 15:39:04 +08:00
Linus Torvalds
6d29d7fe4f NFSD 5.19 Release Notes
We introduce "courteous server" in this release. Previously NFSD
 would purge open and lock state for an unresponsive client after
 one lease period (typically 90 seconds). Now, after one lease
 period, another client can open and lock those files and the
 unresponsive client's lease is purged; otherwise if the unrespon-
 sive client's open and lock state is uncontended, the server retains
 that open and lock state for up to 24 hours, allowing the client's
 workload to resume after a lengthy network partition.
 
 A longstanding issue with NFSv4 file creation is also addressed.
 Previously a file creation can fail internally, returning an error
 to the client, but leave the newly created file in place as an
 artifact. The file creation code path has been reorganized so that
 internal failures and race conditions are less likely to result in
 an unwanted file creation.
 
 A fault injector has been added to help exercise paths that are run
 during kernel metadata cache invalidation. These caches contain
 information maintained by user space about exported filesystems.
 Many of our test workloads do not trigger cache invalidation.
 
 There is one patch that is needed to support PREEMPT_RT and a fix
 for an ancient "sleep while spin-locked" splat that seems to have
 become easier to hit since v5.18-rc3.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmKPliAACgkQM2qzM29m
 f5dB3BAAorPa2L8xu5P1Ge1oTNogNSOVRkLPDzEkfEwK07ZM2qvz78eMZGkMziJ/
 strorvBWl3SWBlVtTePgNpJUjgYQ75MRRwaX7Qh2WuHeRKm1JlZm0/NId3+zKgbh
 N40QI20jdswWcNDuhidxVFFWurd09GlcM4z1cu8gZLbfthkiUOjZoPiLkXeNcvhk
 7wC9GiueWxHefYQQDAKh1nQS/L0GG1EkzJdJo7WUVAldZ9qVY9LpmJVMRqrBBbta
 XrFYfpeY1zFFDY4Qolyz5PUJSeQuDj9PctlhoZ6B1hp56PD/6yaqVhYXiPxtlALj
 tITtktfiekULZkgfvfvyzssCv+wkbYiaEBZcSSCauR7dkGOmBmajO+cf7vpsERgE
 fbCU8DWGk78SMeehdCrO+26cV37VP+8c2t2Txq/rG5Eq4ZoCi++Hj5poRboFLqb+
 oom+0Ee0LfcAKXkxH5gWTPTblHo49GzGitPZtRzTgZ9uFnVwvEaJ4+t0ij0J8JpL
 HuVtWrg5/REhqpEvOSwF0sRmkYWLTu7KdueGn/iZ8xUi7GHEue01NsVkClohKJcR
 WOjWrbNCNF/LJaG88MX0z5u7IO7s9bOHphd7PJ92vR+4YsehW3uRhk+rNi2ZBqQz
 hzULfu8BiaicV9fdB/hDcMmKQD6U6due2AVVPtxTf5XY+CHQNRY=
 =phE1
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
 "We introduce 'courteous server' in this release. Previously NFSD would
  purge open and lock state for an unresponsive client after one lease
  period (typically 90 seconds). Now, after one lease period, another
  client can open and lock those files and the unresponsive client's
  lease is purged; otherwise if the unresponsive client's open and lock
  state is uncontended, the server retains that open and lock state for
  up to 24 hours, allowing the client's workload to resume after a
  lengthy network partition.

  A longstanding issue with NFSv4 file creation is also addressed.
  Previously a file creation can fail internally, returning an error to
  the client, but leave the newly created file in place as an artifact.
  The file creation code path has been reorganized so that internal
  failures and race conditions are less likely to result in an unwanted
  file creation.

  A fault injector has been added to help exercise paths that are run
  during kernel metadata cache invalidation. These caches contain
  information maintained by user space about exported filesystems. Many
  of our test workloads do not trigger cache invalidation.

  There is one patch that is needed to support PREEMPT_RT and a fix for
  an ancient 'sleep while spin-locked' splat that seems to have become
  easier to hit since v5.18-rc3"

* tag 'nfsd-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (36 commits)
  NFSD: nfsd_file_put() can sleep
  NFSD: Add documenting comment for nfsd4_release_lockowner()
  NFSD: Modernize nfsd4_release_lockowner()
  NFSD: Fix possible sleep during nfsd4_release_lockowner()
  nfsd: destroy percpu stats counters after reply cache shutdown
  nfsd: Fix null-ptr-deref in nfsd_fill_super()
  nfsd: Unregister the cld notifier when laundry_wq create failed
  SUNRPC: Use RMW bitops in single-threaded hot paths
  NFSD: Clean up the show_nf_flags() macro
  NFSD: Trace filecache opens
  NFSD: Move documenting comment for nfsd4_process_open2()
  NFSD: Fix whitespace
  NFSD: Remove dprintk call sites from tail of nfsd4_open()
  NFSD: Instantiate a struct file when creating a regular NFSv4 file
  NFSD: Clean up nfsd_open_verified()
  NFSD: Remove do_nfsd_create()
  NFSD: Refactor NFSv4 OPEN(CREATE)
  NFSD: Refactor NFSv3 CREATE
  NFSD: Refactor nfsd_create_setattr()
  NFSD: Avoid calling fh_drop_write() twice in do_nfsd_create()
  ...
2022-05-26 20:52:24 -07:00
Linus Torvalds
98931dd95f Yang Shi has improved the behaviour of khugepaged collapsing of readonly
file-backed transparent hugepages.
 
 Johannes Weiner has arranged for zswap memory use to be tracked and
 managed on a per-cgroup basis.
 
 Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for runtime
 enablement of the recent huge page vmemmap optimization feature.
 
 Baolin Wang contributes a series to fix some issues around hugetlb
 pagetable invalidation.
 
 Zhenwei Pi has fixed some interactions between hwpoisoned pages and
 virtualization.
 
 Tong Tiangen has enabled the use of the presently x86-only
 page_table_check debugging feature on arm64 and riscv.
 
 David Vernet has done some fixup work on the memcg selftests.
 
 Peter Xu has taught userfaultfd to handle write protection faults against
 shmem- and hugetlbfs-backed files.
 
 More DAMON development from SeongJae Park - adding online tuning of the
 feature and support for monitoring of fixed virtual address ranges.  Also
 easier discovery of which monitoring operations are available.
 
 Nadav Amit has done some optimization of TLB flushing during mprotect().
 
 Neil Brown continues to labor away at improving our swap-over-NFS support.
 
 David Hildenbrand has some fixes to anon page COWing versus
 get_user_pages().
 
 Peng Liu fixed some errors in the core hugetlb code.
 
 Joao Martins has reduced the amount of memory consumed by device-dax's
 compound devmaps.
 
 Some cleanups of the arch-specific pagemap code from Anshuman Khandual.
 
 Muchun Song has found and fixed some errors in the TLB flushing of
 transparent hugepages.
 
 Roman Gushchin has done more work on the memcg selftests.
 
 And, of course, many smaller fixes and cleanups.  Notably, the customary
 million cleanup serieses from Miaohe Lin.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYo52xQAKCRDdBJ7gKXxA
 jtJFAQD238KoeI9z5SkPMaeBRYSRQmNll85mxs25KapcEgWgGQD9FAb7DJkqsIVk
 PzE+d9hEfirUGdL6cujatwJ6ejYR8Q8=
 =nFe6
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Almost all of MM here. A few things are still getting finished off,
  reviewed, etc.

   - Yang Shi has improved the behaviour of khugepaged collapsing of
     readonly file-backed transparent hugepages.

   - Johannes Weiner has arranged for zswap memory use to be tracked and
     managed on a per-cgroup basis.

   - Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for
     runtime enablement of the recent huge page vmemmap optimization
     feature.

   - Baolin Wang contributes a series to fix some issues around hugetlb
     pagetable invalidation.

   - Zhenwei Pi has fixed some interactions between hwpoisoned pages and
     virtualization.

   - Tong Tiangen has enabled the use of the presently x86-only
     page_table_check debugging feature on arm64 and riscv.

   - David Vernet has done some fixup work on the memcg selftests.

   - Peter Xu has taught userfaultfd to handle write protection faults
     against shmem- and hugetlbfs-backed files.

   - More DAMON development from SeongJae Park - adding online tuning of
     the feature and support for monitoring of fixed virtual address
     ranges. Also easier discovery of which monitoring operations are
     available.

   - Nadav Amit has done some optimization of TLB flushing during
     mprotect().

   - Neil Brown continues to labor away at improving our swap-over-NFS
     support.

   - David Hildenbrand has some fixes to anon page COWing versus
     get_user_pages().

   - Peng Liu fixed some errors in the core hugetlb code.

   - Joao Martins has reduced the amount of memory consumed by
     device-dax's compound devmaps.

   - Some cleanups of the arch-specific pagemap code from Anshuman
     Khandual.

   - Muchun Song has found and fixed some errors in the TLB flushing of
     transparent hugepages.

   - Roman Gushchin has done more work on the memcg selftests.

  ... and, of course, many smaller fixes and cleanups. Notably, the
  customary million cleanup serieses from Miaohe Lin"

* tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (381 commits)
  mm: kfence: use PAGE_ALIGNED helper
  selftests: vm: add the "settings" file with timeout variable
  selftests: vm: add "test_hmm.sh" to TEST_FILES
  selftests: vm: check numa_available() before operating "merge_across_nodes" in ksm_tests
  selftests: vm: add migration to the .gitignore
  selftests/vm/pkeys: fix typo in comment
  ksm: fix typo in comment
  selftests: vm: add process_mrelease tests
  Revert "mm/vmscan: never demote for memcg reclaim"
  mm/kfence: print disabling or re-enabling message
  include/trace/events/percpu.h: cleanup for "percpu: improve percpu_alloc_percpu event trace"
  include/trace/events/mmflags.h: cleanup for "tracing: incorrect gfp_t conversion"
  mm: fix a potential infinite loop in start_isolate_page_range()
  MAINTAINERS: add Muchun as co-maintainer for HugeTLB
  zram: fix Kconfig dependency warning
  mm/shmem: fix shmem folio swapoff hang
  cgroup: fix an error handling path in alloc_pagecache_max_30M()
  mm: damon: use HPAGE_PMD_SIZE
  tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate
  nodemask.h: fix compilation error with GCC12
  ...
2022-05-26 12:32:41 -07:00
Linus Torvalds
fdaf9a5840 Page cache changes for 5.19
- Appoint myself page cache maintainer
 
  - Fix how scsicam uses the page cache
 
  - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS
 
  - Remove the AOP flags entirely
 
  - Remove pagecache_write_begin() and pagecache_write_end()
 
  - Documentation updates
 
  - Convert several address_space operations to use folios:
    - is_dirty_writeback
    - readpage becomes read_folio
    - releasepage becomes release_folio
    - freepage becomes free_folio
 
  - Change filler_t to require a struct file pointer be the first argument
    like ->read_folio
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmKNMDUACgkQDpNsjXcp
 gj4/mwf/bpHhXH4ZoNIvtUpTF6rZbqeffmc0VrbxCZDZ6igRnRPglxZ9H9v6L53O
 7B0FBQIfxgNKHZpdqGdOkv8cjg/GMe/HJUbEy5wOakYPo4L9fZpHbDZ9HM2Eankj
 xBqLIBgBJ7doKr+Y62DAN19TVD8jfRfVtli5mqXJoNKf65J7BkxljoTH1L3EXD9d
 nhLAgyQjR67JQrT/39KMW+17GqLhGefLQ4YnAMONtB6TVwX/lZmigKpzVaCi4r26
 bnk5vaR/3PdjtNxIoYvxdc71y2Eg05n2jEq9Wcy1AaDv/5vbyZUlZ2aBSaIVbtKX
 WfrhN9O3L0bU5qS7p9PoyfLc9wpq8A==
 =djLv
 -----END PGP SIGNATURE-----

Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache

Pull page cache updates from Matthew Wilcox:

 - Appoint myself page cache maintainer

 - Fix how scsicam uses the page cache

 - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS

 - Remove the AOP flags entirely

 - Remove pagecache_write_begin() and pagecache_write_end()

 - Documentation updates

 - Convert several address_space operations to use folios:
     - is_dirty_writeback
     - readpage becomes read_folio
     - releasepage becomes release_folio
     - freepage becomes free_folio

 - Change filler_t to require a struct file pointer be the first
   argument like ->read_folio

* tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits)
  nilfs2: Fix some kernel-doc comments
  Appoint myself page cache maintainer
  fs: Remove aops->freepage
  secretmem: Convert to free_folio
  nfs: Convert to free_folio
  orangefs: Convert to free_folio
  fs: Add free_folio address space operation
  fs: Convert drop_buffers() to use a folio
  fs: Change try_to_free_buffers() to take a folio
  jbd2: Convert release_buffer_page() to use a folio
  jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio
  reiserfs: Convert release_buffer_page() to use a folio
  fs: Remove last vestiges of releasepage
  ubifs: Convert to release_folio
  reiserfs: Convert to release_folio
  orangefs: Convert to release_folio
  ocfs2: Convert to release_folio
  nilfs2: Remove comment about releasepage
  nfs: Convert to release_folio
  jfs: Convert to release_folio
  ...
2022-05-24 19:55:07 -07:00
Linus Torvalds
65965d9530 Changes since last update:
- Add erofs on-demand load support over fscache;
 
  - Support NFS export for erofs;
 
  - Support idmapped mounts for erofs;
 
  - Don't prompt for risk any more when using big pcluster;
 
  - Fix buffer copy overflow of ztailpacking feature;
 
  - Several minor cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYojqfREceGlhbmdAa2Vy
 bmVsLm9yZwAKCRA5NzHcH7XmBJ/vAP0XBbClZjsHhiSI/Gkp3UTcQHjR+uDIb2QR
 FhAui79F+QEAqCHoKF/F6YFkJdWtH0t6rBeNt6NL0UNU9hw3riF3IwY=
 =bcu7
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs (and fscache) updates from Gao Xiang:
 "After working on it on the mailing list for more than half a year, we
  finally form 'erofs over fscache' feature into shape. Hopefully it
  could bring more possibility to the communities.

  The story mainly started from a new project what we called "RAFS v6" [1]
  for Nydus image service almost a year ago, which enhances EROFS to be
  a new form of one bootstrap (which includes metadata representing the
  whole fs tree) + several data-deduplicated content addressable blobs
  (actually treated as multiple devices). Each blob can represent one
  container image layer but not quite exactly since all new data can be
  fully existed in the previous blobs so no need to introduce another
  new blob.

  It is actually not a new idea (at least on my side it's much like a
  simpilied casync [2] for now) and has many benefits over per-file
  blobs or some other exist ways since typically each RAFS v6 image only
  has dozens of device blobs instead of thousands of per-file blobs.
  It's easy to be signed with user keys as a golden image, transfered
  untouchedly with minimal overhead over the network, kept in some type
  of storage conveniently, and run with (optional) runtime verification
  but without involving too many irrelevant features crossing the system
  beyond EROFS itself. At least it's our final goal and we're keeping
  working on it. There was also a good summary of this approach from the
  casync author [3].

  Regardless further optimizations, this work is almost done in the
  previous Linux release cycles. In this round, we'd like to introduce
  on-demand load for EROFS with the fscache/cachefiles infrastructure,
  considering the following advantages:

   - Introduce new file-based backend to EROFS. Although each image only
     contains dozens of blobs but in densely-deployed runC host for
     example, there could still be massive blobs on a machine, which is
     messy if each blob is treated as a device. In contrast, fscache and
     cachefiles are really great interfaces for us to make them work.

   - Introduce on-demand load to fscache and EROFS. Previously, fscache
     is mainly used to caching network-likewise filesystems, now it can
     support on-demand downloading for local fses too with the exact
     localfs on-disk format. It has many advantages which we're been
     described in the latest patchset cover letter [4]. In addition to
     that, most importantly, the cached data is still stored in the
     original local fs on-disk format so that it's still the one signed
     with private keys but only could be partially available. Users can
     fully trust it during running. Later, users can also back up
     cachefiles easily to another machine.

   - More reliable on-demand approach in principle. After data is all
     available locally, user daemon can be no longer online in some use
     cases, which helps daemon crash recovery (filesystems can still in
     service) and hot-upgrade (user daemon can be upgraded more
     frequently due to new features or protocols introduced.)

   - Other format can also be converted to EROFS filesystem format over
     the internet on the fly with the new on-demand load feature and
     mounted. That is entirely possible with on-demand load feature as
     long as such archive format metadata can be fetched in advance like
     stargz.

  In addition, although currently our target user is Nydus image service [5],
  but laterly, it can be used for other use cases like on-demand system
  booting, etc. As for the fscache on-demand load feature itself,
  strictly it can be used for other local fses too. Laterly we could
  promote most code to the iomap infrastructure and also enhance it in
  the read-write way if other local fses are interested.

  Thanks David Howells for taking so much time and patience on this
  these months, many thanks with great respect here again! Thanks Jeffle
  for working on this feature and Xin Yin from Bytedance for
  asynchronous I/O implementation as well as Zichen Tian, Jia Zhu, and
  Yan Song for testing, much appeciated. We're also exploring more
  possibly over fscache cache management over FSDAX for secure
  containers and working on more improvements and useful features for
  fscache, cachefiles, and on-demand load.

  In addition to "erofs over fscache", NFS export and idmapped mount are
  also completed in this cycle for container use cases as well.

  Summary:

   - Add erofs on-demand load support over fscache

   - Support NFS export for erofs

   - Support idmapped mounts for erofs

   - Don't prompt for risk any more when using big pcluster

   - Fix buffer copy overflow of ztailpacking feature

   - Several minor cleanups"

[1] https://lore.kernel.org/r/20210730194625.93856-1-hsiangkao@linux.alibaba.com
[2] https://github.com/systemd/casync
[3] http://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html
[4] https://lore.kernel.org/r/20220509074028.74954-1-jefflexu@linux.alibaba.com
[5] https://github.com/dragonflyoss/image-service

* tag 'erofs-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: (29 commits)
  erofs: scan devices from device table
  erofs: change to use asynchronous io for fscache readpage/readahead
  erofs: add 'fsid' mount option
  erofs: implement fscache-based data readahead
  erofs: implement fscache-based data read for inline layout
  erofs: implement fscache-based data read for non-inline layout
  erofs: implement fscache-based metadata read
  erofs: register fscache context for extra data blobs
  erofs: register fscache context for primary data blob
  erofs: add erofs_fscache_read_folios() helper
  erofs: add anonymous inode caching metadata for data blobs
  erofs: add fscache context helper functions
  erofs: register fscache volume
  erofs: add fscache mode check helper
  erofs: make erofs_map_blocks() generally available
  cachefiles: document on-demand read mode
  cachefiles: add tracepoints for on-demand read mode
  cachefiles: enable on-demand read mode
  cachefiles: implement on-demand read
  cachefiles: notify the user daemon when withdrawing cookie
  ...
2022-05-24 18:42:04 -07:00
Linus Torvalds
f30fabe78a fs.idmapped.v5.19
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYotC2wAKCRCRxhvAZXjc
 omivAQD7hDdmZdhGaWgHJKGMofPJ+j62F7QPyoc1UPEkr0sMvAEA1EehhXkw4E8L
 6aFsXKs+Bb77TfdZI5EI7cUw1fAWUwE=
 =wlyp
 -----END PGP SIGNATURE-----

Merge tag 'fs.idmapped.v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull fs idmapping updates from Christian Brauner:
 "This contains two minor updates:

   - An update to the idmapping documentation by Rodrigo making it
     easier to understand that we first introduce several use-cases that
     fail without idmapped mounts simply to explain how they can be
     handled with idmapped mounts.

   - When changing a mount's idmapping we now hold writers to make it
     more robust.

     This is similar to turning a mount ro with the difference that in
     contrast to turning a mount ro changing the idmapping can only ever
     be done once while a mount can transition between ro and rw as much
     as it wants.

     The vfs layer itself takes care to retrieve the idmapping of a
     mount once ensuring that the idmapping used for vfs permission
     checking is identical to the idmapping passed down to the
     filesystem. All filesystems with FS_ALLOW_IDMAP raised take the
     same precautions as the vfs in code-paths that are outside of
     direct control of the vfs such as ioctl()s.

     However, holding writers makes this more robust and predictable for
     both the kernel and userspace.

     This is a minor user-visible change. But it is extremely unlikely
     to matter. The caller must've created a detached mount via
     OPEN_TREE_CLONE and then handed that O_PATH fd to another process
     or thread which then must've gotten a writable fd for that mount
     and started creating files in there while the caller is still
     changing mount properties. While not impossible it will be an
     extremely rare corner-case and should in general be considered a
     bug in the application. Consider making a mount MOUNT_ATTR_NOEXEC
     or MOUNT_ATTR_NODEV while allowing someone else to perform lookups
     or exec'ing in parallel by handing them a copy of the
     OPEN_TREE_CLONE fd or another fd beneath that mount.

     I've pinged all major users of idmapped mounts pointing out this
     change and none of them have active writers on a mount while still
     changing mount properties. It would've been strange if they did.

  The rest and majority of the work will be coming through the overlayfs
  tree this cycle. In addition to overlayfs this cycle should also see
  support for idmapped mounts on erofs as I've acked a patch to this
  effect a little while ago"

* tag 'fs.idmapped.v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  fs: hold writers when changing mount's idmapping
  docs: Add small intro to idmap examples
2022-05-24 18:19:06 -07:00
Linus Torvalds
0350785b0a integrity-v5.19
-----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQQdXVVFGN5XqKr1Hj7LwZzRsCrn5QUCYo0tOhQcem9oYXJAbGlu
 dXguaWJtLmNvbQAKCRDLwZzRsCrn5QJfAP47Ym9vacLc1m8/MUaRA/QjbJ/8t3TX
 h/4McK8kiRudxgD/RiPHII6gJ8q+qpBrYWJZ4ZZaHE8v0oA1viuZfbuN2wc=
 =KQYi
 -----END PGP SIGNATURE-----

Merge tag 'integrity-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity

Pull IMA updates from Mimi Zohar:
 "New is IMA support for including fs-verity file digests and signatures
  in the IMA measurement list as well as verifying the fs-verity file
  digest based signatures, both based on policy.

  In addition, are two bug fixes:

   - avoid reading UEFI variables, which cause a page fault, on Apple
     Macs with T2 chips.

   - remove the original "ima" template Kconfig option to address a boot
     command line ordering issue.

  The rest is a mixture of code/documentation cleanup"

* tag 'integrity-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
  integrity: Fix sparse warnings in keyring_handler
  evm: Clean up some variables
  evm: Return INTEGRITY_PASS for enum integrity_status value '0'
  efi: Do not import certificates from UEFI Secure Boot for T2 Macs
  fsverity: update the documentation
  ima: support fs-verity file digest based version 3 signatures
  ima: permit fsverity's file digests in the IMA measurement list
  ima: define a new template field named 'd-ngv2' and templates
  fs-verity: define a function to return the integrity protected file digest
  ima: use IMA default hash algorithm for integrity violations
  ima: fix 'd-ng' comments and documentation
  ima: remove the IMA_TEMPLATE Kconfig option
  ima: remove redundant initialization of pointer 'file'.
2022-05-24 13:50:39 -07:00
Linus Torvalds
140e40e39a zonefs changes for 5.19-rc1
This set of patches improve zonefs open sequential file accounting and
 adds accounting for active sequential files to allow the user to handle
 the maximum number of active zones of an NVMe ZNS drive. sysfs
 attributes for both open and active sequential files are also added to
 facilitate access to this information from applications without
 resorting to inspecting the block device limits.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCYosTQQAKCRDdoc3SxdoY
 dqUWAQDGKoSkyRAPJAmuQXYOuOJTLu0b8DSfvyPopFLfKXpPHAEAg995JNTLUs0G
 R3m7lH6GK+OSBWhZ/Z5HOND3QS9BhgM=
 =hvqx
 -----END PGP SIGNATURE-----

Merge tag 'zonefs-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs

Pull zonefs updates from Damien Le Moal:
 "This improves zonefs open sequential file accounting and adds
  accounting for active sequential files to allow the user to handle the
  maximum number of active zones of an NVMe ZNS drive.

  sysfs attributes for both open and active sequential files are also
  added to facilitate access to this information from applications
  without resorting to inspecting the block device limits"

* tag 'zonefs-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  documentation: zonefs: Document sysfs attributes
  documentation: zonefs: Cleanup the mount options section
  zonefs: Add active seq file accounting
  zonefs: Export open zone resource information through sysfs
  zonefs: Always do seq file write open accounting
  zonefs: Rename super block information fields
  zonefs: Fix management of open zones
  zonefs: Clear inode information flags on inode creation
2022-05-23 14:36:45 -07:00
Johannes Weiner
f6498b776d mm: zswap: add basic meminfo and vmstat coverage
Currently it requires poking at debugfs to figure out the size and
population of the zswap cache on a host.  There are no counters for reads
and writes against the cache.  As a result, it's difficult to understand
zswap behavior on production systems.

Print zswap memory consumption and how many pages are zswapped out in
/proc/meminfo.  Count zswapouts and zswapins in /proc/vmstat.

Link: https://lkml.kernel.org/r/20220510152847.230957-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:53 -07:00
Johannes Weiner
39799b6409 Documentation: filesystems: proc: update meminfo section
Patch series "zswap: accounting & cgroup control", v2.

Zswap can consume nearly a quarter of RAM in the default configuration,
yet it's neither listed in /proc/meminfo, nor is it accounted and
manageable on a per-cgroup basis.

This makes reasoning about the memory situation on a host in general
rather difficult.  On shared/cgrouped hosts, the consequences are worse. 
First, workloads can escape memory containment and cause resource priority
inversions: a lo-pri group can fill the global zswap pool and force a
hi-pri group out to disk.  Second, not all workloads benefit from zswap
equally.  Some even suffer when memory contents compress poorly, and are
better off going to disk swap directly.  On a host with mixed workloads,
it's currently not possible to enable zswap for one workload but not for
the other.

This series implements the missing global accounting as well as cgroup
tracking & control for zswap backing memory:

- Patch 1 refreshes the very out-of-date meminfo documentation in
  Documentation/filesystems/proc.rst.

- Patches 2-4 clean up related and adjacent options in Kconfig. Not
  actual dependencies, just things I noticed during development.

- Patch 5 adds meminfo and vmstat coverage for zswap consumption and
  activity.

- Patch 6 implements per-cgroup tracking & control of zswap memory.


This patch (of 6):

Add new entries.  Minor corrections and cleanups.

[hannes@cmpxchg.org: fix htmldocs warnings]
  Link: https://lkml.kernel.org/r/Ynve8dg4zJyhH2gW@cmpxchg.org
[hannes@cmpxchg.org: change `Unevictable' wording, per David]
  Link: https://lkml.kernel.org/r/YnwFraZlVWQoCjz3@cmpxchg.org
Link: https://lkml.kernel.org/r/20220510152847.230957-1-hannes@cmpxchg.org
Link: https://lkml.kernel.org/r/20220510152847.230957-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:52 -07:00
Dai Ngo
2443da2259 fs/lock: add 2 callbacks to lock_manager_operations to resolve conflict
Add 2 new callbacks, lm_lock_expirable and lm_expire_lock, to
lock_manager_operations to allow the lock manager to take appropriate
action to resolve the lock conflict if possible.

A new field, lm_mod_owner, is also added to lock_manager_operations.
The lm_mod_owner is used by the fs/lock code to make sure the lock
manager module such as nfsd, is not freed while lock conflict is being
resolved.

lm_lock_expirable checks and returns true to indicate that the lock
conflict can be resolved else return false. This callback must be
called with the flc_lock held so it can not block.

lm_expire_lock is called to resolve the lock conflict if the returned
value from lm_lock_expirable is true. This callback is called without
the flc_lock held since it's allowed to block. Upon returning from
this callback, the lock conflict should be resolved and the caller is
expected to restart the conflict check from the beginnning of the list.

Lock manager, such as NFSv4 courteous server, uses this callback to
resolve conflict by destroying lock owner, or the NFSv4 courtesy client
(client that has expired but allowed to maintains its states) that owns
the lock.

Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2022-05-19 12:25:39 -04:00
Chuck Lever
a28faaddb2 Documentation: Add an explanation of NFSv4 client identifiers
To enable NFSv4 to work correctly, NFSv4 client identifiers have
to be globally unique and persistent over client reboots. We
believe that in many cases, a good default identifier can be
chosen and set when a client system is imaged.

Because there are many different ways a system can be imaged,
provide an explanation of how NFSv4 client identifiers and
principals can be set by install scripts and imaging tools.

Additional cases, such as NFSv4 clients running in containers, also
need unique and persistent identifiers. The Linux NFS community
sets forth this explanation to aid those who create and manage
container environments.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-05-17 15:33:21 -04:00
Jeffle Xu
99302ebd3a cachefiles: document on-demand read mode
Document new user interface introduced by on-demand read mode.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220509074028.74954-9-jefflexu@linux.alibaba.com
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18 00:11:18 +08:00
Paul Gortmaker
e24ccaaf7e block: remove last remaining traces of IDE documentation
The last traces of the IDE driver went away in commit b7fb14d3ac
("ide: remove the legacy ide driver") but it left behind some traces
of old documentation.

As luck would have it Randy and I would submit similar changes within
a week of each other to address this.  As Randy's commit is in the doc
tree already - this delta is just the stuff my removal contained that
was not in Randy's IDE doc removal.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Link: https://lore.kernel.org/all/20220427165917.GE12977@windriver.com
[phil@philpotter.co.uk: removed diffs already added by others]
Signed-off-by: Phillip Potter <phil@philpotter.co.uk>
Link: https://lore.kernel.org/r/20220515205833.944139-5-phil@philpotter.co.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-15 18:31:28 -06:00
Mimi Zohar
02ee2316b9 fsverity: update the documentation
Update the fsverity documentation related to IMA signature support.

Acked-by: Stefan Berger <stefanb@linux.ibm.com>
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2022-05-12 18:38:27 -04:00
Matthew Wilcox (Oracle)
d2329aa0c7 fs: Add free_folio address space operation
Include documentation and convert the callers to use ->free_folio as
well as ->freepage.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-05-09 23:12:52 -04:00
Matthew Wilcox (Oracle)
fa29000b6b fs: Add aops->release_folio
This replaces aops->releasepage.  Update the documentation, and call it
if it exists.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2022-05-09 23:12:30 -04:00
NeilBrown
cba738f649 doc: update documentation for swap_activate and swap_rw
This documentation for ->swap_activate() has been out-of-date for a long
time.  This patch updates it to match recent changes, and adds
documentation for the associated ->swap_rw()

Link: https://lkml.kernel.org/r/164859778126.29473.6778751233552859461.stgit@noble.brown
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: David Howells <dhowells@redhat.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-09 18:20:48 -07:00
Matthew Wilcox (Oracle)
08830c8bc6 fs: Add read_folio documentation
Convert all the ->readpage documentation to ->read_folio.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-05-09 16:21:44 -04:00
Rodrigo Campos
ccbd0c9919
docs: Add small intro to idmap examples
When reading the documentation, I didn't understand why this list
examples of things that fail without using the mount idmap feature.
It seems pretty pointless and I doubted if I was missing something,
until I finished the examples, the next section and saw the examples
revisited.  After that, it all made sense.

Let's add one small sentence before, so the reader knows where this is
going and why examples that don't might seem relevant are used.

Link: https://lore.kernel.org/r/20220429135748.481301-1-rodrigo@sdfg.com.ar
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Rodrigo Campos <rodrigo@sdfg.com.ar>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-05-09 11:20:43 +02:00
Matthew Wilcox (Oracle)
520f301c54 fs: Convert is_dirty_writeback() to take a folio
Pass a folio instead of a page to aops->is_dirty_writeback().
Convert both implementations and the caller.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-08 14:45:56 -04:00
Matthew Wilcox (Oracle)
9d6b0cd757 fs: Remove flags parameter from aops->write_begin
There are no more aop flags left, so remove the parameter.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-08 14:28:19 -04:00
Matthew Wilcox (Oracle)
de2a931150 fs: Remove aop_flags parameter from netfs_write_begin()
There are no more aop flags left, so remove the parameter.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-08 14:28:19 -04:00
Matthew Wilcox (Oracle)
56f5746c41 namei: Merge page_symlink() and __page_symlink()
There are no callers of __page_symlink() left, so we can remove that
entry point.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
2022-05-08 14:28:18 -04:00
Damien Le Moal
31a644b3c2 documentation: zonefs: Document sysfs attributes
Document the max_wro_seq_files, nr_wro_seq_files, max_active_seq_files
and nr_active_seq_files sysfs attributes.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
2022-04-26 14:53:06 +09:00
Linus Torvalds
d615b5416f f2fs-fix-5.18
This includes major bug fixes introduced in 5.18-rc1 and 5.17+.
 
 - Remove obsolete whint_mode (5.18-rc1)
 - Fix IO split issue caused by op_flags change in f2fs (5.18-rc1)
 - Fix a wrong condition check to detect IO failure loop (5.18-rc1)
 - Fix wrong data truncation during roll-forward (5.17+)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmJmzzYACgkQQBSofoJI
 UNKuLw/+M3J/WCNH8Zuysgt3eJ6K9e2jtT/xfj+p/s52lileP4ljtnWJ7x8YaViX
 sMbzLkDL4tPip2SuNEQifW2EmMCJ6BCCWV5+SIK5/rEiwRYuhFygVqQIHn2S8g5O
 gWixusQ/NYaUN6e3Bi/7WjUhWmeV4oyhv6WB+sS3f0zaNs9RGoe/K3yNGeLLWm+/
 0Pz6P2LAARs/N/iSw03KWv+1BzMJxbLC1w38boPbysv5oT59+gtxKGVSxie1cce8
 F5mMt1Z7JplPUKhbjrqYo9LzNzAFIgIt3P6mbclE7ASKi9UYOtiT3nvikj6lygbM
 i9FHIcP6bqtjU7GZ5vVbYDW43pGZN+6Hlz7Fu1I3ix4Z2eyFWc/W1Fl6OmjvjVpj
 t/iafwvvdqm1NChLkJx3EXquDDuhxvKhbuuaTwLpuNt+56OvFJ0e91kZhWwbB3dY
 7y80N+VgB0MvFStWeZD85lMvSYfXmv5dnjCu6+nAxRzlsx+JVs8STN6+KHSiVeoY
 LbtyR1sViO0UGNVZAd8XLs8CIScfxatx059ui0wW+Bh2JOy2p5RW0vUKxFzm9/ZJ
 OtRT2W0fdyutYpfwERxny706cV3wOOOP20a/2NC4HgUVWLYNO6hnLeDK7+WlOMqP
 XHWKpkRzwLtZQKVyFNDtFHoYgQKCltQNNp6t/qJnSGC1Q1Cchfo=
 =nBQL
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-fix-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs fixes from Jaegeuk Kim:
 "This includes major bug fixes introduced in 5.18-rc1 and 5.17+:

   - Remove obsolete whint_mode (5.18-rc1)

   - Fix IO split issue caused by op_flags change in f2fs (5.18-rc1)

   - Fix a wrong condition check to detect IO failure loop (5.18-rc1)

   - Fix wrong data truncation during roll-forward (5.17+)"

* tag 'f2fs-fix-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
  f2fs: should not truncate blocks during roll-forward recovery
  f2fs: fix wrong condition check when failing metapage read
  f2fs: keep io_flags to avoid IO split due to different op_flags in two fio holders
  f2fs: remove obsolete whint_mode
2022-04-25 10:53:56 -07:00
Linus Torvalds
c00c5e1d15 Fix some syzbot-detected bugs, as well as other bugs found by I/O
injection testing.  Change ext4's fallocate to update consistently
 drop set[ug]id bits when an fallocate operation might possibly change
 the user-visible contents of a file.  Also, improve handling of
 potentially invalid values in the the s_overhead_cluster superblock
 field to avoid ext4 returning a negative number of free blocks.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmJinf8ACgkQ8vlZVpUN
 gaOHgQf+MKgUZgteYogLzoP3mF1kSycOGawk4wZ3QHOLz7AvsV2p9J8BWihbS/EK
 dBydfXbTMvCUrjWmpqb5dHECRzxdfxOJ0SPJtibc8DZaJc9ImNFmgSp9kyJ3uRaN
 cPGO6Lz2RXpdumVMPPLwzUJdVyrLi0K6I1NYSocxKgribePzd+xil8S9zRZj8Bpe
 RaeH0EytcRj2CI5qs5mI/mOPBAMsZeczd3HInI3gyCgP2I4ZOfsADne3APx57mcI
 IGKf77nvIwMHeKel3MGYfFPitEs5cZpHUhHplCMtgFsO8H0IR93tqnlaCvTM7VAZ
 Slamgl7pfcXFcLZP+pm0QL/82ub7iw==
 =FIds
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix some syzbot-detected bugs, as well as other bugs found by I/O
  injection testing.

  Change ext4's fallocate to consistently drop set[ug]id bits when an
  fallocate operation might possibly change the user-visible contents of
  a file.

  Also, improve handling of potentially invalid values in the the
  s_overhead_cluster superblock field to avoid ext4 returning a negative
  number of free blocks"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd2: fix a potential race while discarding reserved buffers after an abort
  ext4: update the cached overhead value in the superblock
  ext4: force overhead calculation if the s_overhead_cluster makes no sense
  ext4: fix overhead calculation to account for the reserved gdt blocks
  ext4, doc: fix incorrect h_reserved size
  ext4: limit length to bitmap_maxbytes - blocksize in punch_hole
  ext4: fix use-after-free in ext4_search_dir
  ext4: fix bug_on in start_this_handle during umount filesystem
  ext4: fix symlink file size not match to file content
  ext4: fix fallocate to use file_modified to update permissions consistently
2022-04-22 18:18:27 -07:00
Damien Le Moal
ae43038866 documentation: zonefs: Cleanup the mount options section
Use subsections to separate the descriptions of the "error=" and
"explicit-open" mount sections.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
2022-04-21 08:38:01 +09:00
Jaegeuk Kim
930e260763 f2fs: remove obsolete whint_mode
This patch removes obsolete whint_mode.

Fixes: 41d36a9f3e ("fs: remove kiocb.ki_hint")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-20 11:16:43 -07:00
wangjianjian (C)
7102ffe4c1 ext4, doc: fix incorrect h_reserved size
According to document and code, ext4_xattr_header's size is 32 bytes, so
h_reserved size should be 3.

Signed-off-by: Wang Jianjian <wangjianjian3@huawei.com>
Link: https://lore.kernel.org/r/92fcc3a6-7d77-8c09-4126-377fcb4c46a5@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2022-04-12 22:24:46 -04:00
Yue Hu
2c547f2998 fscache: Remove the cookie parameter from fscache_clear_page_bits()
The cookie is not used at all, remove it and update the usage in io.c
and afs/write.c (which is the only user outside of fscache currently)
at the same time.

[DH: Amended the documentation also]

Signed-off-by: Yue Hu <huyue2@coolpad.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://listman.redhat.com/archives/linux-cachefs/2022-April/006659.html
2022-04-08 23:54:37 +01:00
Yue Hu
5d3d5b9645 docs: filesystems: caching/backend-api.rst: fix an object withdrawn API
There's no fscache_are_objects_withdrawn() helper at all to test if
cookie withdrawal is completed currently. The cache backend is using
fscache_wait_for_objects() to wait all objects to be withdrawn.

Signed-off-by: Yue Hu <huyue2@coolpad.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://listman.redhat.com/archives/linux-cachefs/2022-April/006705.html # v1
2022-04-08 23:54:37 +01:00