1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

629 commits

Author SHA1 Message Date
Trond Myklebust
66f9dac907 Revert "nfs: don't reuse partially completed requests in nfs_lock_and_join_requests"
This reverts commit b571cfcb9d.

This patch appears to assume that if one request is complete, then the
others will complete too before unlocking. That is not a valid
assumption, since other requests could hit a non-fatal error or a short
write that would cause them not to complete.

Reported-by: Igor Raits <igor@gooddata.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219508
Fixes: b571cfcb9d ("nfs: don't reuse partially completed requests in nfs_lock_and_join_requests")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-11-18 09:42:03 -05:00
Trond Myklebust
8f52caf9d2 Revert "fs: nfs: fix missing refcnt by replacing folio_set_private by folio_attach_private"
This reverts commit 03e02b9417.

As was pointed out during code review, there is no need to use
folio_attach_private()/folio_detach_private() when a refcount to the
folio is already carried by the struct nfs_page.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-11-09 21:38:39 -05:00
Trond Myklebust
fa88a7d6ae nfs: enable localio for non-pNFS IO
Try a local open of the file being written to, and if it succeeds,
then use localio to issue IO.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23 15:03:30 -04:00
Weston Andros Adamson
70ba381e1a nfs: add LOCALIO support
Add client support for bypassing NFS for localhost reads, writes, and
commits. This is only useful when the client and the server are
running on the same host.

nfs_local_probe() is stubbed out, later commits will enable client and
server handshake via a Linux-only LOCALIO auxiliary RPC protocol.

This has dynamic binding with the nfsd module (via nfs_localio module
which is part of nfs_common). LOCALIO will only work if nfsd is
already loaded.

The "localio_enabled" nfs kernel module parameter can be used to
disable and enable the ability to use LOCALIO support.

CONFIG_NFS_LOCALIO enables NFS client support for LOCALIO.

Lastly, LOCALIO uses an nfsd_file to initiate all IO. To make proper
use of nfsd_file (and nfsd's filecache) its lifetime (duration before
nfsd_file_put is called) must extend until after commit, read and
write operations. So rather than immediately drop the nfsd_file
reference in nfs_local_open_fh(), that doesn't happen until
nfs_local_pgio_release() for read/write and not until
nfs_local_release_commit_data() for commit. The same applies to the
reference held on nfsd's nn->nfsd_serv. Both objects' lifetimes and
associated references are managed through calls to
nfs_to->nfsd_file_put_local().

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Co-developed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de> # nfs_open_local_fh
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23 15:03:30 -04:00
Mike Snitzer
df24c483e2 nfs: pass struct nfsd_file to nfs_init_pgio and nfs_init_commit
The nfsd_file will be passed, in future commits, by callers
that enable LOCALIO support (for both regular NFS and pNFS IO).

[Derived from patch authored by Weston Andros Adamson, but switched
 from passing struct file to struct nfsd_file]

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23 15:03:30 -04:00
Dan Aloni
dfb07e990a nfs: add 'noalignwrite' option for lock-less 'lost writes' prevention
There are some applications that write to predefined non-overlapping
file offsets from multiple clients and therefore don't need to rely on
file locking. However, if these applications want non-aligned offsets
and sizes they need to either use locks or risk data corruption, as the
NFS client defaults to extending writes to whole pages.

This commit adds a new mount option `noalignwrite`, which allows to turn
that off and avoid the need of locking, as long as these applications
don't overlap on offsets.

Signed-off-by: Dan Aloni <dan.aloni@vastdata.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23 15:03:13 -04:00
Zhaoyang Huang
03e02b9417 fs: nfs: fix missing refcnt by replacing folio_set_private by folio_attach_private
This patch is inspired by a code review of fs codes which aims at
folio's extra refcnt that could introduce unwanted behavious when
judging refcnt, such as[1].That is, the folio passed to
mapping_evict_folio carries the refcnts from find_lock_entries,
page_cache, corresponding to PTEs and folio's private if has. However,
current code doesn't take the refcnt for folio's private which could
have mapping_evict_folio miss the one to only PTE and lead to
call filemap_release_folio wrongly.

[1]
long mapping_evict_folio(struct address_space *mapping, struct folio *folio)
{
...
//current code will misjudge here if there is one pte on the folio which
is be deemed as the one as folio's private
        if (folio_ref_count(folio) >
                        folio_nr_pages(folio) + folio_has_private(folio) + 1)
                return 0;
        if (!filemap_release_folio(folio, 0))
                return 0;

        return remove_mapping(mapping, folio);
}

Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23 15:03:13 -04:00
Linus Torvalds
fbc90c042c - 875fa64577da ("mm/hugetlb_vmemmap: fix race with speculative PFN
walkers") is known to cause a performance regression
   (https://lore.kernel.org/all/3acefad9-96e5-4681-8014-827d6be71c7a@linux.ibm.com/T/#mfa809800a7862fb5bdf834c6f71a3a5113eb83ff).
   Yu has a fix which I'll send along later via the hotfixes branch.
 
 - In the series "mm: Avoid possible overflows in dirty throttling" Jan
   Kara addresses a couple of issues in the writeback throttling code.
   These fixes are also targetted at -stable kernels.
 
 - Ryusuke Konishi's series "nilfs2: fix potential issues related to
   reserved inodes" does that.  This should actually be in the
   mm-nonmm-stable tree, along with the many other nilfs2 patches.  My bad.
 
 - More folio conversions from Kefeng Wang in the series "mm: convert to
   folio_alloc_mpol()"
 
 - Kemeng Shi has sent some cleanups to the writeback code in the series
   "Add helper functions to remove repeated code and improve readability of
   cgroup writeback"
 
 - Kairui Song has made the swap code a little smaller and a little
   faster in the series "mm/swap: clean up and optimize swap cache index".
 
 - In the series "mm/memory: cleanly support zeropage in
   vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David
   Hildenbrand has reworked the rather sketchy handling of the use of the
   zeropage in MAP_SHARED mappings.  I don't see any runtime effects here -
   more a cleanup/understandability/maintainablity thing.
 
 - Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling of
   higher addresses, for aarch64.  The (poorly named) series is
   "Restructure va_high_addr_switch".
 
 - The core TLB handling code gets some cleanups and possible slight
   optimizations in Bang Li's series "Add update_mmu_tlb_range() to
   simplify code".
 
 - Jane Chu has improved the handling of our
   fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in the
   series "Enhance soft hwpoison handling and injection".
 
 - Jeff Johnson has sent a billion patches everywhere to add
   MODULE_DESCRIPTION() to everything.  Some landed in this pull.
 
 - In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang has
   simplified migration's use of hardware-offload memory copying.
 
 - Yosry Ahmed performs more folio API conversions in his series "mm:
   zswap: trivial folio conversions".
 
 - In the series "large folios swap-in: handle refault cases first",
   Chuanhua Han inches us forward in the handling of large pages in the
   swap code.  This is a cleanup and optimization, working toward the end
   objective of full support of large folio swapin/out.
 
 - In the series "mm,swap: cleanup VMA based swap readahead window
   calculation", Huang Ying has contributed some cleanups and a possible
   fixlet to his VMA based swap readahead code.
 
 - In the series "add mTHP support for anonymous shmem" Baolin Wang has
   taught anonymous shmem mappings to use multisize THP.  By default this
   is a no-op - users must opt in vis sysfs controls.  Dramatic
   improvements in pagefault latency are realized.
 
 - David Hildenbrand has some cleanups to our remaining use of
   page_mapcount() in the series "fs/proc: move page_mapcount() to
   fs/proc/internal.h".
 
 - David also has some highmem accounting cleanups in the series
   "mm/highmem: don't track highmem pages manually".
 
 - Build-time fixes and cleanups from John Hubbard in the series
   "cleanups, fixes, and progress towards avoiding "make headers"".
 
 - Cleanups and consolidation of the core pagemap handling from Barry
   Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers
   and utilize them".
 
 - Lance Yang's series "Reclaim lazyfree THP without splitting" has
   reduced the latency of the reclaim of pmd-mapped THPs under fairly
   common circumstances.  A 10x speedup is seen in a microbenchmark.
 
   It does this by punting to aother CPU but I guess that's a win unless
   all CPUs are pegged.
 
 - hugetlb_cgroup cleanups from Xiu Jianfeng in the series
   "mm/hugetlb_cgroup: rework on cftypes".
 
 - Miaohe Lin's series "Some cleanups for memory-failure" does just that
   thing.
 
 - Is anyone reading this stuff?  If so, email me!
 
 - Someone other than SeongJae has developed a DAMON feature in Honggyu
   Kim's series "DAMON based tiered memory management for CXL memory".
   This adds DAMON features which may be used to help determine the
   efficiency of our placement of CXL/PCIe attached DRAM.
 
 - DAMON user API centralization and simplificatio work in SeongJae
   Park's series "mm/damon: introduce DAMON parameters online commit
   function".
 
 - In the series "mm: page_type, zsmalloc and page_mapcount_reset()"
   David Hildenbrand does some maintenance work on zsmalloc - partially
   modernizing its use of pageframe fields.
 
 - Kefeng Wang provides more folio conversions in the series "mm: remove
   page_maybe_dma_pinned() and page_mkclean()".
 
 - More cleanup from David Hildenbrand, this time in the series
   "mm/memory_hotplug: use PageOffline() instead of PageReserved() for
   !ZONE_DEVICE".  It "enlightens memory hotplug more about PageOffline()
   pages" and permits the removal of some virtio-mem hacks.
 
 - Barry Song's series "mm: clarify folio_add_new_anon_rmap() and
   __folio_add_anon_rmap()" is a cleanup to the anon folio handling in
   preparation for mTHP (multisize THP) swapin.
 
 - Kefeng Wang's series "mm: improve clear and copy user folio"
   implements more folio conversions, this time in the area of large folio
   userspace copying.
 
 - The series "Docs/mm/damon/maintaier-profile: document a mailing tool
   and community meetup series" tells people how to get better involved
   with other DAMON developers.  From SeongJae Park.
 
 - A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does
   that.
 
 - David Hildenbrand sends along more cleanups, this time against the
   migration code.  The series is "mm/migrate: move NUMA hinting fault
   folio isolation + checks under PTL".
 
 - Jan Kara has found quite a lot of strangenesses and minor errors in
   the readahead code.  He addresses this in the series "mm: Fix various
   readahead quirks".
 
 - SeongJae Park's series "selftests/damon: test DAMOS tried regions and
   {min,max}_nr_regions" adds features and addresses errors in DAMON's self
   testing code.
 
 - Gavin Shan has found a userspace-triggerable WARN in the pagecache
   code.  The series "mm/filemap: Limit page cache size to that supported
   by xarray" addresses this.  The series is marked cc:stable.
 
 - Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations
   and cleanup" cleans up and slightly optimizes KSM.
 
 - Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of
   code motion.  The series (which also makes the memcg-v1 code
   Kconfigurable) are
 
   "mm: memcg: separate legacy cgroup v1 code and put under config
   option" and
   "mm: memcg: put cgroup v1-specific memcg data under CONFIG_MEMCG_V1"
 
 - Dan Schatzberg's series "Add swappiness argument to memory.reclaim"
   adds an additional feature to this cgroup-v2 control file.
 
 - The series "Userspace controls soft-offline pages" from Jiaqi Yan
   permits userspace to stop the kernel's automatic treatment of excessive
   correctable memory errors.  In order to permit userspace to monitor and
   handle this situation.
 
 - Kefeng Wang's series "mm: migrate: support poison recover from migrate
   folio" teaches the kernel to appropriately handle migration from
   poisoned source folios rather than simply panicing.
 
 - SeongJae Park's series "Docs/damon: minor fixups and improvements"
   does those things.
 
 - In the series "mm/zsmalloc: change back to per-size_class lock"
   Chengming Zhou improves zsmalloc's scalability and memory utilization.
 
 - Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for
   pinning memfd folios" makes the GUP code use FOLL_PIN rather than bare
   refcount increments.  So these paes can first be moved aside if they
   reside in the movable zone or a CMA block.
 
 - Andrii Nakryiko has added a binary ioctl()-based API to /proc/pid/maps
   for much faster reading of vma information.  The series is "query VMAs
   from /proc/<pid>/maps".
 
 - In the series "mm: introduce per-order mTHP split counters" Lance Yang
   improves the kernel's presentation of developer information related to
   multisize THP splitting.
 
 - Michael Ellerman has developed the series "Reimplement huge pages
   without hugepd on powerpc (8xx, e500, book3s/64)".  This permits
   userspace to use all available huge page sizes.
 
 - In the series "revert unconditional slab and page allocator fault
   injection calls" Vlastimil Babka removes a performance-affecting and not
   very useful feature from slab fault injection.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZp2C+QAKCRDdBJ7gKXxA
 joTkAQDvjqOoFStqk4GU3OXMYB7WCU/ZQMFG0iuu1EEwTVDZ4QEA8CnG7seek1R3
 xEoo+vw0sWWeLV3qzsxnCA1BJ8cTJA8=
 =z0Lf
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - In the series "mm: Avoid possible overflows in dirty throttling" Jan
   Kara addresses a couple of issues in the writeback throttling code.
   These fixes are also targetted at -stable kernels.

 - Ryusuke Konishi's series "nilfs2: fix potential issues related to
   reserved inodes" does that. This should actually be in the
   mm-nonmm-stable tree, along with the many other nilfs2 patches. My
   bad.

 - More folio conversions from Kefeng Wang in the series "mm: convert to
   folio_alloc_mpol()"

 - Kemeng Shi has sent some cleanups to the writeback code in the series
   "Add helper functions to remove repeated code and improve readability
   of cgroup writeback"

 - Kairui Song has made the swap code a little smaller and a little
   faster in the series "mm/swap: clean up and optimize swap cache
   index".

 - In the series "mm/memory: cleanly support zeropage in
   vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David
   Hildenbrand has reworked the rather sketchy handling of the use of
   the zeropage in MAP_SHARED mappings. I don't see any runtime effects
   here - more a cleanup/understandability/maintainablity thing.

 - Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling
   of higher addresses, for aarch64. The (poorly named) series is
   "Restructure va_high_addr_switch".

 - The core TLB handling code gets some cleanups and possible slight
   optimizations in Bang Li's series "Add update_mmu_tlb_range() to
   simplify code".

 - Jane Chu has improved the handling of our
   fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in
   the series "Enhance soft hwpoison handling and injection".

 - Jeff Johnson has sent a billion patches everywhere to add
   MODULE_DESCRIPTION() to everything. Some landed in this pull.

 - In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang
   has simplified migration's use of hardware-offload memory copying.

 - Yosry Ahmed performs more folio API conversions in his series "mm:
   zswap: trivial folio conversions".

 - In the series "large folios swap-in: handle refault cases first",
   Chuanhua Han inches us forward in the handling of large pages in the
   swap code. This is a cleanup and optimization, working toward the end
   objective of full support of large folio swapin/out.

 - In the series "mm,swap: cleanup VMA based swap readahead window
   calculation", Huang Ying has contributed some cleanups and a possible
   fixlet to his VMA based swap readahead code.

 - In the series "add mTHP support for anonymous shmem" Baolin Wang has
   taught anonymous shmem mappings to use multisize THP. By default this
   is a no-op - users must opt in vis sysfs controls. Dramatic
   improvements in pagefault latency are realized.

 - David Hildenbrand has some cleanups to our remaining use of
   page_mapcount() in the series "fs/proc: move page_mapcount() to
   fs/proc/internal.h".

 - David also has some highmem accounting cleanups in the series
   "mm/highmem: don't track highmem pages manually".

 - Build-time fixes and cleanups from John Hubbard in the series
   "cleanups, fixes, and progress towards avoiding "make headers"".

 - Cleanups and consolidation of the core pagemap handling from Barry
   Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers
   and utilize them".

 - Lance Yang's series "Reclaim lazyfree THP without splitting" has
   reduced the latency of the reclaim of pmd-mapped THPs under fairly
   common circumstances. A 10x speedup is seen in a microbenchmark.

   It does this by punting to aother CPU but I guess that's a win unless
   all CPUs are pegged.

 - hugetlb_cgroup cleanups from Xiu Jianfeng in the series
   "mm/hugetlb_cgroup: rework on cftypes".

 - Miaohe Lin's series "Some cleanups for memory-failure" does just that
   thing.

 - Someone other than SeongJae has developed a DAMON feature in Honggyu
   Kim's series "DAMON based tiered memory management for CXL memory".
   This adds DAMON features which may be used to help determine the
   efficiency of our placement of CXL/PCIe attached DRAM.

 - DAMON user API centralization and simplificatio work in SeongJae
   Park's series "mm/damon: introduce DAMON parameters online commit
   function".

 - In the series "mm: page_type, zsmalloc and page_mapcount_reset()"
   David Hildenbrand does some maintenance work on zsmalloc - partially
   modernizing its use of pageframe fields.

 - Kefeng Wang provides more folio conversions in the series "mm: remove
   page_maybe_dma_pinned() and page_mkclean()".

 - More cleanup from David Hildenbrand, this time in the series
   "mm/memory_hotplug: use PageOffline() instead of PageReserved() for
   !ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline()
   pages" and permits the removal of some virtio-mem hacks.

 - Barry Song's series "mm: clarify folio_add_new_anon_rmap() and
   __folio_add_anon_rmap()" is a cleanup to the anon folio handling in
   preparation for mTHP (multisize THP) swapin.

 - Kefeng Wang's series "mm: improve clear and copy user folio"
   implements more folio conversions, this time in the area of large
   folio userspace copying.

 - The series "Docs/mm/damon/maintaier-profile: document a mailing tool
   and community meetup series" tells people how to get better involved
   with other DAMON developers. From SeongJae Park.

 - A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does
   that.

 - David Hildenbrand sends along more cleanups, this time against the
   migration code. The series is "mm/migrate: move NUMA hinting fault
   folio isolation + checks under PTL".

 - Jan Kara has found quite a lot of strangenesses and minor errors in
   the readahead code. He addresses this in the series "mm: Fix various
   readahead quirks".

 - SeongJae Park's series "selftests/damon: test DAMOS tried regions and
   {min,max}_nr_regions" adds features and addresses errors in DAMON's
   self testing code.

 - Gavin Shan has found a userspace-triggerable WARN in the pagecache
   code. The series "mm/filemap: Limit page cache size to that supported
   by xarray" addresses this. The series is marked cc:stable.

 - Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations
   and cleanup" cleans up and slightly optimizes KSM.

 - Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of
   code motion. The series (which also makes the memcg-v1 code
   Kconfigurable) are "mm: memcg: separate legacy cgroup v1 code and put
   under config option" and "mm: memcg: put cgroup v1-specific memcg
   data under CONFIG_MEMCG_V1"

 - Dan Schatzberg's series "Add swappiness argument to memory.reclaim"
   adds an additional feature to this cgroup-v2 control file.

 - The series "Userspace controls soft-offline pages" from Jiaqi Yan
   permits userspace to stop the kernel's automatic treatment of
   excessive correctable memory errors. In order to permit userspace to
   monitor and handle this situation.

 - Kefeng Wang's series "mm: migrate: support poison recover from
   migrate folio" teaches the kernel to appropriately handle migration
   from poisoned source folios rather than simply panicing.

 - SeongJae Park's series "Docs/damon: minor fixups and improvements"
   does those things.

 - In the series "mm/zsmalloc: change back to per-size_class lock"
   Chengming Zhou improves zsmalloc's scalability and memory
   utilization.

 - Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for
   pinning memfd folios" makes the GUP code use FOLL_PIN rather than
   bare refcount increments. So these paes can first be moved aside if
   they reside in the movable zone or a CMA block.

 - Andrii Nakryiko has added a binary ioctl()-based API to
   /proc/pid/maps for much faster reading of vma information. The series
   is "query VMAs from /proc/<pid>/maps".

 - In the series "mm: introduce per-order mTHP split counters" Lance
   Yang improves the kernel's presentation of developer information
   related to multisize THP splitting.

 - Michael Ellerman has developed the series "Reimplement huge pages
   without hugepd on powerpc (8xx, e500, book3s/64)". This permits
   userspace to use all available huge page sizes.

 - In the series "revert unconditional slab and page allocator fault
   injection calls" Vlastimil Babka removes a performance-affecting and
   not very useful feature from slab fault injection.

* tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (411 commits)
  mm/mglru: fix ineffective protection calculation
  mm/zswap: fix a white space issue
  mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio
  mm/hugetlb: fix possible recursive locking detected warning
  mm/gup: clear the LRU flag of a page before adding to LRU batch
  mm/numa_balancing: teach mpol_to_str about the balancing mode
  mm: memcg1: convert charge move flags to unsigned long long
  alloc_tag: fix page_ext_get/page_ext_put sequence during page splitting
  lib: reuse page_ext_data() to obtain codetag_ref
  lib: add missing newline character in the warning message
  mm/mglru: fix overshooting shrinker memory
  mm/mglru: fix div-by-zero in vmpressure_calc_level()
  mm/kmemleak: replace strncpy() with strscpy()
  mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC
  mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB
  mm: ignore data-race in __swap_writepage
  hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr
  mm: shmem: rename mTHP shmem counters
  mm: swap_state: use folio_alloc_mpol() in __read_swap_cache_async()
  mm/migrate: putback split folios when numa hint migration fails
  ...
2024-07-21 17:15:46 -07:00
Linus Torvalds
4f40c636b2 NFS Client Updates for Linux 6.11
New Features:
   * Add support for large folios
   * Implement rpcrdma generic device removal notification
   * Add client support for attribute delegations
   * Use a LAYOUTRETURN during reboot recovery to report layoutstats and errors
   * Improve throughput for random buffered writes
   * Add NVMe support to pnfs/blocklayout
 
 Bugfixes:
   * Fix rpcrdma_reqs_reset()
   * Avoid soft lockups when using UDP
   * Fix an nfs/blocklayout premature PR key unregestration
   * Another fix for EXCHGID4_FLAG_USE_PNFS_DS for DS server
   * Do not extend writes to the entire folio
   * Pass explicit offset and count values to tracepoints
   * Fix a race to wake up sleeping SUNRPC sync tasks
   * Fix gss_status tracepoint output
 
 Cleanups:
   * Add missing MODULE_DESCRIPTION() macros
   * Add blocklayout / SCSI layout tracepoints
   * Remove asm-generic headers from xprtrdma verbs.c
   * Remove unused 'struct mnt_fhstatus'
   * Other delegation related cleanups
   * Other folio related cleanups
   * Other pNFS related cleanups
   * Other xprtrdma cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmaZgr0ACgkQ18tUv7Cl
 QOv8FxAAnUyYG7Kdbv+5Ko/SFv0imxCb5DQh2XC/hSHNrlKBlDnqe2PANXR9XocL
 mS0Wry5tZf/T+o+QoKv0HQUdWFlnqKzwclggrekf/lkioU1feWsLe2RzDl1iUh0V
 6fwcCyWXW1mYX2CtCaDe+/ZFcoZOMD+bItNHt/RdDScSnS9Jd8GSyocsVKsqaBx6
 3wub0FJ4UBgYNoX2T3YyK2JwvO9GLaKIQRJV74rjgPJKjcjhptbcb5MKBmOZrF95
 UCcpl4CwvD9RTsSEp0B98UbAFFpk8Nw1tmHF3GmyG/nsrJomDuLKFvbsiq23eHUf
 XeULZIbjMEzU56vjoTglZA4s7JYx17D0vzdPGUqU4mLN3LPm5LtGLBg2uQoPw/xW
 50euLU+ol36mfnQlBsuM/tAXgtoAcT63aNeNRNp8aOL47xA+PC6kWTBK9OaR5+x6
 w+d22Dpy+riMk1TRaAVt0ANcENKELsWRFvxkuWCpQhVoQ1h8LigQJzeggEEK7Sa6
 5u9H6wCTee2wz746uwA43koj1utuyrLq/5S+qEtCY1pbP3U0A+Gh0Xh00OXiYuzL
 TgRdksmiAL8cA51WjSrq6HhGLOUJAYLfbdKaVhW+fULxUVwzWhFFaFbbdiq/e4OR
 0pfqls8UZWICE51GeTfalEidpKZgV/LxU3QOuVoalWBULyj/TeI=
 =avTW
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.11-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "New Features:
   - Add support for large folios
   - Implement rpcrdma generic device removal notification
   - Add client support for attribute delegations
   - Use a LAYOUTRETURN during reboot recovery to report layoutstats
     and errors
   - Improve throughput for random buffered writes
   - Add NVMe support to pnfs/blocklayout

  Bugfixes:
   - Fix rpcrdma_reqs_reset()
   - Avoid soft lockups when using UDP
   - Fix an nfs/blocklayout premature PR key unregestration
   - Another fix for EXCHGID4_FLAG_USE_PNFS_DS for DS server
   - Do not extend writes to the entire folio
   - Pass explicit offset and count values to tracepoints
   - Fix a race to wake up sleeping SUNRPC sync tasks
   - Fix gss_status tracepoint output

  Cleanups:
   - Add missing MODULE_DESCRIPTION() macros
   - Add blocklayout / SCSI layout tracepoints
   - Remove asm-generic headers from xprtrdma verbs.c
   - Remove unused 'struct mnt_fhstatus'
   - Other delegation related cleanups
   - Other folio related cleanups
   - Other pNFS related cleanups
   - Other xprtrdma cleanups"

* tag 'nfs-for-6.11-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (63 commits)
  SUNRPC: Fixup gss_status tracepoint error output
  SUNRPC: Fix a race to wake a sync task
  nfs: split nfs_read_folio
  nfs: pass explicit offset/count to trace events
  nfs: do not extend writes to the entire folio
  nfs/blocklayout: add support for NVMe
  nfs: remove nfs_page_length
  nfs: remove the unused max_deviceinfo_size field from struct pnfs_layoutdriver_type
  nfs: don't reuse partially completed requests in nfs_lock_and_join_requests
  nfs: move nfs_wait_on_request to write.c
  nfs: fold nfs_page_group_lock_subrequests into nfs_lock_and_join_requests
  nfs: fold nfs_folio_find_and_lock_request into nfs_lock_and_join_requests
  nfs: simplify nfs_folio_find_and_lock_request
  nfs: remove nfs_folio_private_request
  nfs: remove dead code for the old swap over NFS implementation
  NFSv4.1 another fix for EXCHGID4_FLAG_USE_PNFS_DS for DS server
  nfs: Block on write congestion
  nfs: Properly initialize server->writeback
  nfs: Drop pointless check from nfs_commit_release_pages()
  nfs/blocklayout: SCSI layout trace points for reservation key reg/unreg
  ...
2024-07-18 17:17:30 -07:00
Christoph Hellwig
fada32ed6d nfs: pass explicit offset/count to trace events
nfs_folio_length is unsafe to use without having the folio locked and a
check for a NULL ->f_mapping that protects against truncations and can
lead to kernel crashes.  E.g. when running xfstests generic/065 with
all nfs trace points enabled.

Follow the model of the XFS trace points and pass in an explіcit offset
and length.  This has the additional benefit that these values can
be more accurate as some of the users touch partial folio ranges.

Fixes: eb5654b3b8 ("NFS: Enable tracing of nfs_invalidate_folio() and nfs_launder_folio()")
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-17 13:15:35 -04:00
Christoph Hellwig
39c910a430 nfs: do not extend writes to the entire folio
nfs_update_folio has code to extend a write to the entire page under
certain conditions.  With the support for large folios this now
suddenly extents to the variable sized and potentially much larger folio.
Add code to limit the extension to the page boundaries of the start and
end of the write, which matches the historic expecation and the code
comments.

Fixes: b73fe2dd6cd5 ("nfs: add support for large folios")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-12 11:36:08 -04:00
Christoph Hellwig
b571cfcb9d nfs: don't reuse partially completed requests in nfs_lock_and_join_requests
When NFS requests are split into sub-requests, nfs_inode_remove_request
calls nfs_page_group_sync_on_bit to set PG_REMOVE on this sub-request and
only completes the head requests once PG_REMOVE is set on all requests.
This means that when nfs_lock_and_join_requests sees a PG_REMOVE bit, I/O
on the request is in progress and has partially completed.   If such a
request is returned to nfs_try_to_update_request, it could be extended
with the newly dirtied region and I/O for the combined range will be
re-scheduled, leading to extra I/O.

Change the logic to instead restart the search for a request when any
PG_REMOVE bit is set, as the completion handler will remove the request
as soon as it can take the page group lock.  This not only avoid
extending the I/O but also does the right thing for the callers that
want to cancel or flush the request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08 13:47:51 -04:00
Christoph Hellwig
f1b7c7552c nfs: move nfs_wait_on_request to write.c
nfs_wait_on_request is now only used in write.c.  Move it there
and mark it static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08 13:47:51 -04:00
Christoph Hellwig
25edbcac6e nfs: fold nfs_page_group_lock_subrequests into nfs_lock_and_join_requests
Fold nfs_page_group_lock_subrequests into nfs_lock_and_join_requests to
prepare for future changes to this code, and move the helpers to write.c
as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08 13:47:51 -04:00
Christoph Hellwig
c3f2235782 nfs: fold nfs_folio_find_and_lock_request into nfs_lock_and_join_requests
Fold nfs_folio_find_and_lock_request into the only caller to prepare
for changes to this code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08 13:47:51 -04:00
Christoph Hellwig
9eb7c484db nfs: simplify nfs_folio_find_and_lock_request
nfs_folio_find_and_lock_request and the nfs_page_group_lock_head helper
called by it spend quite some effort to deal with head vs subrequests.
But given that only the head request can be stashed in the folio private
data, non of that is required.

Fold the locking logic from nfs_page_group_lock_head into
nfs_folio_find_and_lock_request and simplify the result based on the
invariant that we always find the head request in the folio private data.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08 13:47:51 -04:00
Christoph Hellwig
02e61ec1e2 nfs: remove nfs_folio_private_request
nfs_folio_private_request is a trivial wrapper around, which itself has
fallen out of favor and has been replaced with plain ->private
dereferences in recent folio conversions.  Do the same for nfs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08 13:47:51 -04:00
Christoph Hellwig
7e8e78a0ba nfs: remove dead code for the old swap over NFS implementation
Remove the code testing folio_test_swapcache either explicitly or
implicitly in pagemap.h headers, as is now handled using the direct I/O
path and not the buffered I/O path that these helpers are located in.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08 13:47:51 -04:00
Jan Kara
2f1f31042e nfs: Block on write congestion
Commit 6df25e5853 ("nfs: remove reliance on bdi congestion")
introduced NFS-private solution for limiting number of writes
outstanding against a particular server. Unlike previous bdi congestion
this algorithm actually works and limits number of outstanding writeback
pages to nfs_congestion_kb which scales with amount of client's memory
and is capped at 256 MB. As a result some workloads such as random
buffered writes over NFS got slower (from ~170 MB/s to ~126 MB/s). The
fio command to reproduce is:

fio --direct=0 --ioengine=sync --thread --invalidate=1 --group_reporting=1
  --runtime=300 --fallocate=posix --ramp_time=10 --new_group --rw=randwrite
  --size=64256m --numjobs=4 --bs=4k --fsync_on_close=1 --end_fsync=1

This happens because the client sends ~256 MB worth of dirty pages to
the server and any further background writeback request is ignored until
the number of writeback pages gets below the threshold of 192 MB. By the
time this happens and clients decides to trigger another round of
writeback, the server often has no pages to write and the disk is idle.

To fix this problem and make the client react faster to eased congestion
of the server by blocking waiting for congestion to resolve instead of
aborting writeback. This improves the random 4k buffered write
throughput to 184 MB/s.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08 13:47:27 -04:00
Jan Kara
37d4159dd2 nfs: Drop pointless check from nfs_commit_release_pages()
nfss->writeback is updated only when we are ending page writeback and at
that moment we also clear nfss->write_congested. So there's no point in
rechecking congestion state in nfs_commit_release_pages(). Drop the
pointless check.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08 13:47:27 -04:00
Trond Myklebust
e12912d941 NFSv4: Add support for delegated atime and mtime attributes
Ensure that we update the mtime and atime correctly when we read
or write data to the file and when we truncate. Let the server manage
ctime on other attribute updates.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08 13:47:25 -04:00
Trond Myklebust
4201916f2a NFSv4: Add a flags argument to the 'have_delegation' callback
This argument will be used to allow the caller to specify whether or not
they need to know that this is an attribute delegation.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08 13:47:25 -04:00
Kairui Song
237d29075c nfs: drop usage of folio_file_pos
folio_file_pos is only needed for mixed usage of page cache and swap
cache, for pure page cache usage, the caller can just use folio_pos
instead.

After commit e1209d3a7a ("mm: introduce ->swap_rw and use it for reads
from SWP_FS_OPS swap-space"), swap cache should never be exposed to nfs.

So remove the usage of folio_file_pos in following NFS functions / helpers:

- nfs_vm_page_mkwrite

  It's only used by nfs_file_vm_ops.page_mkwrite

- trace event helper: nfs_folio_event
- trace event helper: nfs_folio_event_done

  These two are used through DEFINE_NFS_FOLIO_EVENT and
  DEFINE_NFS_FOLIO_EVENT_DONE, which defined following events:

  - trace_nfs_aop_readpage{_done}: only called by nfs_read_folio
  - trace_nfs_writeback_folio: only called by nfs_wb_folio
  - trace_nfs_invalidate_folio: only called by nfs_invalidate_folio
  - trace_nfs_launder_folio_done: only called by nfs_launder_folio

  None of them could possibly be used on swap cache folio,
  nfs_read_folio only called by:
  .write_begin -> nfs_read_folio
  .read_folio

  nfs_wb_folio only called by nfs mapping:
  .release_folio -> nfs_wb_folio
  .launder_folio -> nfs_wb_folio
  .write_begin -> nfs_read_folio -> nfs_wb_folio
  .read_folio -> nfs_wb_folio
  .write_end -> nfs_update_folio -> nfs_writepage_setup -> nfs_setup_write_request -> nfs_try_to_update_request -> nfs_wb_folio
  .page_mkwrite -> nfs_update_folio -> nfs_writepage_setup -> nfs_setup_write_request -> nfs_try_to_update_request -> nfs_wb_folio
  .write_begin -> nfs_flush_incompatible -> nfs_wb_folio
  .page_mkwrite -> nfs_vm_page_mkwrite -> nfs_flush_incompatible -> nfs_wb_folio

  nfs_invalidate_folio is only called by .invalidate_folio.
  nfs_launder_folio is only called by .launder_folio

- nfs_grow_file
- nfs_update_folio

  nfs_grow_file is only called by nfs_update_folio, and all
  possible callers of them are:

  .write_end -> nfs_update_folio
  .page_mkwrite -> nfs_update_folio

- nfs_wb_folio_cancel

  .invalidate_folio -> nfs_wb_folio_cancel

Also, seeing from the swap side, swap_rw is now the only interface calling
into fs, the offset info is always in iocb.ki_pos now.

So we can remove all these folio_file_pos call safely.

Link: https://lkml.kernel.org/r/20240521175854.96038-8-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03 19:29:55 -07:00
Matthew Wilcox (Oracle)
8f3ab6e4be
nfs: Remove calls to folio_set_error
Common code doesn't test the error flag, so we don't need to set it in
nfs.  We can use folio_end_read() to combine the setting (or not)
of the uptodate flag and clearing the lock flag.

Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240530202110.2653630-10-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-31 12:31:41 +02:00
David Howells
2e9d7e4b98 mm: Remove the PG_fscache alias for PG_private_2
Remove the PG_fscache alias for PG_private_2 and use the latter directly.
Use of this flag for marking pages undergoing writing to the cache should
be considered deprecated and the folios should be marked dirty instead and
the write done in ->writepages().

Note that PG_private_2 itself should be considered deprecated and up for
future removal by the MM folks too.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: Tom Talpey <tom@talpey.com>
cc: Bharath SM <bharathsm@microsoft.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Anna Schumaker <anna@kernel.org>
cc: netfs@lists.linux.dev
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2024-04-29 15:01:42 +01:00
Linus Torvalds
c1f10ac840 NFS client updates for Linux 6.9
Highlights include:
 
 Bugfixes:
  - Fix for an Oops in the NFSv4.2 listxattr handler
  - Correct an incorrect buffer size in listxattr
  - Fix for an Oops in the pNFS flexfiles layout
  - Fix a refcount leak in NFS O_DIRECT writes
  - Fix missing locking in NFS O_DIRECT
  - Avoid an infinite loop in pnfs_update_layout
  - Fix an overflow in the RPC waitqueue queue length counter
  - Ensure that pNFS I/O is also protected by TLS when xprtsec
    is specified by the mount options
  - Fix a leaked folio lock in the netfs read code
  - Fix a potential deadlock in fscache
  - Allow setting the fscache uniquifier in NFSv4
  - Fix an off by one in root_nfs_cat()
  - Fix another off by one in rpc_sockaddr2uaddr()
  - nfs4_do_open() can incorrectly trigger state recovery.
  - Various fixes for connection shutdown
 
 Features and cleanups:
  - Ensure that containers only see their own RPC and NFS stats
  - Enable nconnect for RDMA
  - Remove dead code from nfs_writepage_locked()
  - Various tracepoint additions to track EXCHANGE_ID, GETDEVICEINFO, and
    mount options.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmX14K0ACgkQZwvnipYK
 APLCeg/7Bdah7158TdNxSQAHPo3jzDqZmc933eZC0H8C9whNlu6XIa9fyT6ZrsQr
 qkQ/ztSwsB6yp6vLPSnVdDh5KsndwrInTB874H8y6+8x+KwwuhSQ7Uy8epg5wrO0
 kgiaRYSH7HB7EgUdNY14fHNXkA/DMLHz1F1aw2NVGCYmVCMg7kGV4wYCOH6bI2Ea
 Wu8amZce6D1AbktbdSZcEz2ricR3lGXjCUPMnzRCaSpUmdd2t7d/rsnjTeKU1gb4
 p9zLlOZs9Xe2vMT0ZQI8SEI+Scze82LBy7ykSKyhOjOt4AurVpzQFAvK+3dFZoIq
 lzIHJwabBGNui26CR1k90ZqERLkkk+24i3ccT28HwhTqe5eM/qDCKOVQmuP0F1F8
 QYsnIM+NnmPZveSGAMdOQwlGFQTyJbT5Na1blHTW2R2rjqBzgvfn8fR0vV4L5P7B
 0J8ShmZKVkvb7mtJJhaaI4LF41ciCF8+I5zwpnYQi0tsX370XPNNFbzS3BmPUVFL
 k0uEMVfNy69PkaH4DJWQT9GoE3qiAamkO+EdAlPad6b8QMdJJZxXOmaUzL8YsCHV
 sX5ugsih/Hf5/+QFBCbHEy7G3oeeHsT80yO8nvGT+yy94bv4F+WcM/tviyRbKrls
 t5audBDNRfrAeUlqAQkXfFmAyqP2CGNr29oL62cXL2muFG7d7ys=
 =5n+X
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.9-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Bugfixes:
   - Fix for an Oops in the NFSv4.2 listxattr handler
   - Correct an incorrect buffer size in listxattr
   - Fix for an Oops in the pNFS flexfiles layout
   - Fix a refcount leak in NFS O_DIRECT writes
   - Fix missing locking in NFS O_DIRECT
   - Avoid an infinite loop in pnfs_update_layout
   - Fix an overflow in the RPC waitqueue queue length counter
   - Ensure that pNFS I/O is also protected by TLS when xprtsec is
     specified by the mount options
   - Fix a leaked folio lock in the netfs read code
   - Fix a potential deadlock in fscache
   - Allow setting the fscache uniquifier in NFSv4
   - Fix an off by one in root_nfs_cat()
   - Fix another off by one in rpc_sockaddr2uaddr()
   - nfs4_do_open() can incorrectly trigger state recovery
   - Various fixes for connection shutdown

  Features and cleanups:
   - Ensure that containers only see their own RPC and NFS stats
   - Enable nconnect for RDMA
   - Remove dead code from nfs_writepage_locked()
   - Various tracepoint additions to track EXCHANGE_ID, GETDEVICEINFO,
     and mount options"

* tag 'nfs-for-6.9-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (29 commits)
  nfs: fix panic when nfs4_ff_layout_prepare_ds() fails
  NFS: trace the uniquifier of fscache
  NFS: Read unlock folio on nfs_page_create_from_folio() error
  NFS: remove unused variable nfs_rpcstat
  nfs: fix UAF in direct writes
  nfs: properly protect nfs_direct_req fields
  NFS: enable nconnect for RDMA
  NFSv4: nfs4_do_open() is incorrectly triggering state recovery
  NFS: avoid infinite loop in pnfs_update_layout.
  NFS: remove sync_mode test from nfs_writepage_locked()
  NFSv4.1/pnfs: fix NFS with TLS in pnfs
  NFS: Fix an off by one in root_nfs_cat()
  nfs: make the rpc_stat per net namespace
  nfs: expose /proc/net/sunrpc/nfs in net namespaces
  sunrpc: add a struct rpc_stats arg to rpc_create_args
  nfs: remove unused NFS_CALL macro
  NFSv4.1: add tracepoint to trunked nfs4_exchange_id calls
  NFS: Fix nfs_netfs_issue_read() xarray locking for writeback interrupt
  SUNRPC: increase size of rpc_wait_queue.qlen from unsigned short to unsigned int
  nfs: fix regression in handling of fsc= option in NFSv4
  ...
2024-03-16 11:44:00 -07:00
Josef Bacik
17f46b803d nfs: fix UAF in direct writes
In production we have been hitting the following warning consistently

------------[ cut here ]------------
refcount_t: underflow; use-after-free.
WARNING: CPU: 17 PID: 1800359 at lib/refcount.c:28 refcount_warn_saturate+0x9c/0xe0
Workqueue: nfsiod nfs_direct_write_schedule_work [nfs]
RIP: 0010:refcount_warn_saturate+0x9c/0xe0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __warn+0x9f/0x130
 ? refcount_warn_saturate+0x9c/0xe0
 ? report_bug+0xcc/0x150
 ? handle_bug+0x3d/0x70
 ? exc_invalid_op+0x16/0x40
 ? asm_exc_invalid_op+0x16/0x20
 ? refcount_warn_saturate+0x9c/0xe0
 nfs_direct_write_schedule_work+0x237/0x250 [nfs]
 process_one_work+0x12f/0x4a0
 worker_thread+0x14e/0x3b0
 ? ZSTD_getCParams_internal+0x220/0x220
 kthread+0xdc/0x120
 ? __btf_name_valid+0xa0/0xa0
 ret_from_fork+0x1f/0x30

This is because we're completing the nfs_direct_request twice in a row.

The source of this is when we have our commit requests to submit, we
process them and send them off, and then in the completion path for the
commit requests we have

if (nfs_commit_end(cinfo.mds))
	nfs_direct_write_complete(dreq);

However since we're submitting asynchronous requests we sometimes have
one that completes before we submit the next one, so we end up calling
complete on the nfs_direct_request twice.

The only other place we use nfs_generic_commit_list() is in
__nfs_commit_inode, which wraps this call in a

nfs_commit_begin();
nfs_commit_end();

Which is a common pattern for this style of completion handling, one
that is also repeated in the direct code with get_dreq()/put_dreq()
calls around where we process events as well as in the completion paths.

Fix this by using the same pattern for the commit requests.

Before with my 200 node rocksdb stress running this warning would pop
every 10ish minutes.  With my patch the stress test has been running for
several hours without popping.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09 09:14:51 -05:00
NeilBrown
0b81371d3c NFS: remove sync_mode test from nfs_writepage_locked()
nfs_writepage_locked() is only called from nfs_wb_folio() (since Commit
12fc0a9631 ("nfs: Remove writepage")) so ->sync_mode is always
WB_SYNC_ALL.

This means the test for WB_SYNC_NONE is dead code and can be removed.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09 09:14:51 -05:00
Jeff Layton
dd1fac6ae6
nfs: adapt to breakup of struct file_lock
Most of the existing APIs have remained the same, but subsystems that
access file_lock fields directly need to reach into struct
file_lock_core now.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-41-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-05 13:11:43 +01:00
Jeff Layton
a69ce85ec9
filelock: split common fields into struct file_lock_core
In a future patch, we're going to split file leases into their own
structure. Since a lot of the underlying machinery uses the same fields
move those into a new file_lock_core, and embed that inside struct
file_lock.

For now, add some macros to ensure that we can continue to build while
the conversion is in progress.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-17-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-05 13:11:38 +01:00
Jeff Layton
d7c9616be0
nfs: convert to using new filelock helpers
Convert to using the new file locking helper functions. Also, in later
patches we're going to introduce some temporary macros with names that
clash with the variable name in nfs4_proc_unlck. Rename it.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-11-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-05 13:11:36 +01:00
Linus Torvalds
587217f970 NFS Client Updates for Linux 6.8
New Features:
   * Always ask for type with READDIR
   * Remove nfs_writepage()
 
 Bugfixes:
   * Fix a suspicious RCU usage warning
   * Fix a blocklayoutdriver reference leak
   * Fix the block driver's calculation of layoutget size
   * Fix handling NFS4ERR_RETURNCONFLICT
   * Fix _xprt_switch_find_current_entry()
   * Fix v4.1 backchannel request timeouts
   * Don't add zero-length pnfs block devices
   * Use the parent cred in nfs_access_login_time()
 
 Cleanups:
   * A few improvements when dealing with referring calls from the server
   * Clean up various unused variables, struct fields, and function calls
   * Various tracepoint improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmWfDp8ACgkQ18tUv7Cl
 QOsJ+Q/8DgrVmP3jwoM9Fu7JI/RnTQr9svk7zyrlyrQd3ywYqu6A1SC7lphcrzhy
 qxH55ykUuVgCB4kFqWPsU5yilJ8UzPncTOUObiBxN3pCU885Wckm4PJ9PNXtF9ct
 hc7+RpSTby/hYxiJABGVLgUADJ30rYBe6Y+KspSf+S1HvmgY1jbMPhEbVGpP2QBt
 zSF5pmnecZ748LGzSwSeW29WUZhvRPBL5B204EB4aq9SmPAhnAclnE7uhErQ1u8e
 Z6RVwSXv2j1FcM79F5xc/gAByCQhObGuMceFd0sAnx87RUttHi1fteVboz2gZxHB
 rawZQ9p9K9c7ayCu8disxKWTxNYAztvXDOs+Dnij+c3/2EpAmEUD53AXnXAz025b
 IbSnh6ggLlxoKLv1Lrwrli2d/Qi4TYTm2RSW/dY416pIhoO3aC6fv1a5tUnou9RX
 3XpiiFeNoTixWswmS23AMT7BrJTWXY/+NX7AxFZUyPyJ8y9F2Ug8BCam8uAvTluf
 80Dx0pB+7DRF19/ZkH0mUFU+2/+mlK/Ub0p+izSJhkhPSH5TwUTA7hvX6xb7yFtS
 OY4aTVD0rpTbSOvHOEI+F4tWBnw8onTobYMfRcuwNKYJCvuEh4mLLpn44QEJwW9M
 3nHIzdE75Nz3deO+gg6Jo5JuiMwqvh7AEGsxIA64FnAi/xRCDi0=
 =9hVw
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.8-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull nfs client updates from Anna Schumaker:
 "New Features:
   - Always ask for type with READDIR
   - Remove nfs_writepage()

  Bugfixes:
   - Fix a suspicious RCU usage warning
   - Fix a blocklayoutdriver reference leak
   - Fix the block driver's calculation of layoutget size
   - Fix handling NFS4ERR_RETURNCONFLICT
   - Fix _xprt_switch_find_current_entry()
   - Fix v4.1 backchannel request timeouts
   - Don't add zero-length pnfs block devices
   - Use the parent cred in nfs_access_login_time()

  Cleanups:
   - A few improvements when dealing with referring calls from the
     server
   - Clean up various unused variables, struct fields, and function
     calls
   - Various tracepoint improvements"

* tag 'nfs-for-6.8-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (21 commits)
  NFSv4.1: Use the nfs_client's rpc timeouts for backchannel
  SUNRPC: Fixup v4.1 backchannel request timeouts
  rpc_pipefs: Replace one label in bl_resolve_deviceid()
  nfs: Remove writepage
  NFS: drop unused nfs_direct_req bytes_left
  pNFS: Fix the pnfs block driver's calculation of layoutget size
  nfs: print fileid in lookup tracepoints
  nfs: rename the nfs_async_rename_done tracepoint
  nfs: add new tracepoint at nfs4 revalidate entry point
  SUNRPC: fix _xprt_switch_find_current_entry logic
  NFSv4.1/pnfs: Ensure we handle the error NFS4ERR_RETURNCONFLICT
  NFSv4.1: if referring calls are complete, trust the stateid argument
  NFSv4: Track the number of referring calls in struct cb_process_state
  NFS: Use parent's objective cred in nfs_access_login_time()
  NFSv4: Always ask for type with READDIR
  pnfs/blocklayout: Don't add zero-length pnfs_block_dev
  blocklayoutdriver: Fix reference leak of pnfs_device_node
  SUNRPC: Fix a suspicious RCU usage warning
  SUNRPC: Create a helper function for accessing the rpc_clnt's xprt_switch
  SUNRPC: Remove unused function rpc_clnt_xprt_switch_put()
  ...
2024-01-10 16:13:57 -08:00
Matthew Wilcox (Oracle)
12fc0a9631 nfs: Remove writepage
NFS already has writepages and migrate_folio, so it does not need to
implement writepage.  The writepage operation is deprecated as it leads
to worse performance under high memory pressure due to folios being
written out in LRU order rather than sequentially within a file.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-01-04 10:47:56 -05:00
Matthew Wilcox (Oracle)
600f111ef5 fs: Rename mapping private members
It is hard to find where mapping->private_lock, mapping->private_list and
mapping->private_data are used, due to private_XXX being a relatively
common name for variables and structure members in the kernel.  To fit
with other members of struct address_space, rename them all to have an
i_ prefix.  Tested with an allmodconfig build.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20231117215823.2821906-1-willy@infradead.org
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-21 11:57:10 +01:00
Trond Myklebust
6e7434abcd NFSv4/pnfs: Allow layoutget to return EAGAIN for softerr mounts
If we're using the 'softerr' mount option, we may want to allow
layoutget to return EAGAIN to allow knfsd server threads to return a
JUKEBOX/DELAY error to the client instead of busy waiting.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-10-22 19:47:56 -04:00
Scott Mayhew
6a6d4644ce NFS: Fix potential oops in nfs_inode_remove_request()
Once a folio's private data has been cleared, it's possible for another
process to clear the folio->mapping (e.g. via invalidate_complete_folio2
or evict_mapping_folio), so it wouldn't be safe to call
nfs_page_to_inode() after that.

Fixes: 0c493b5cf1 ("NFS: Convert buffered writes to use folios")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-10-11 11:09:56 -04:00
Jeff Layton
dd1b202632 nfs: decrement nrequests counter before releasing the req
I hit this panic in testing:

[ 6235.500016] run fstests generic/464 at 2023-09-18 22:51:24
[ 6288.410761] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 6288.412174] #PF: supervisor read access in kernel mode
[ 6288.413160] #PF: error_code(0x0000) - not-present page
[ 6288.413992] PGD 0 P4D 0
[ 6288.414603] Oops: 0000 [#1] PREEMPT SMP PTI
[ 6288.415419] CPU: 0 PID: 340798 Comm: kworker/u18:8 Not tainted 6.6.0-rc1-gdcf620ceebac #95
[ 6288.416538] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014
[ 6288.417701] Workqueue: nfsiod rpc_async_release [sunrpc]
[ 6288.418676] RIP: 0010:nfs_inode_remove_request+0xc8/0x150 [nfs]
[ 6288.419836] Code: ff ff 48 8b 43 38 48 8b 7b 10 a8 04 74 5b 48 85 ff 74 56 48 8b 07 a9 00 00 08 00 74 58 48 8b 07 f6 c4 10 74 50 e8 c8 44 b3 d5 <48> 8b 00 f0 48 ff 88 30 ff ff ff 5b 5d 41 5c c3 cc cc cc cc 48 8b
[ 6288.422389] RSP: 0018:ffffbd618353bda8 EFLAGS: 00010246
[ 6288.423234] RAX: 0000000000000000 RBX: ffff9a29f9a25280 RCX: 0000000000000000
[ 6288.424351] RDX: ffff9a29f9a252b4 RSI: 000000000000000b RDI: ffffef41448e3840
[ 6288.425345] RBP: ffffef41448e3840 R08: 0000000000000038 R09: ffffffffffffffff
[ 6288.426334] R10: 0000000000033f80 R11: ffff9a2a7fffa000 R12: ffff9a29093f98c4
[ 6288.427353] R13: 0000000000000000 R14: ffff9a29230f62e0 R15: ffff9a29230f62d0
[ 6288.428358] FS:  0000000000000000(0000) GS:ffff9a2a77c00000(0000) knlGS:0000000000000000
[ 6288.429513] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6288.430427] CR2: 0000000000000000 CR3: 0000000264748002 CR4: 0000000000770ef0
[ 6288.431553] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 6288.432715] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 6288.433698] PKRU: 55555554
[ 6288.434196] Call Trace:
[ 6288.434667]  <TASK>
[ 6288.435132]  ? __die+0x1f/0x70
[ 6288.435723]  ? page_fault_oops+0x159/0x450
[ 6288.436389]  ? try_to_wake_up+0x98/0x5d0
[ 6288.437044]  ? do_user_addr_fault+0x65/0x660
[ 6288.437728]  ? exc_page_fault+0x7a/0x180
[ 6288.438368]  ? asm_exc_page_fault+0x22/0x30
[ 6288.439137]  ? nfs_inode_remove_request+0xc8/0x150 [nfs]
[ 6288.440112]  ? nfs_inode_remove_request+0xa0/0x150 [nfs]
[ 6288.440924]  nfs_commit_release_pages+0x16e/0x340 [nfs]
[ 6288.441700]  ? __pfx_call_transmit+0x10/0x10 [sunrpc]
[ 6288.442475]  ? _raw_spin_lock_irqsave+0x23/0x50
[ 6288.443161]  nfs_commit_release+0x15/0x40 [nfs]
[ 6288.443926]  rpc_free_task+0x36/0x60 [sunrpc]
[ 6288.444741]  rpc_async_release+0x29/0x40 [sunrpc]
[ 6288.445509]  process_one_work+0x171/0x340
[ 6288.446135]  worker_thread+0x277/0x3a0
[ 6288.446724]  ? __pfx_worker_thread+0x10/0x10
[ 6288.447376]  kthread+0xf0/0x120
[ 6288.447903]  ? __pfx_kthread+0x10/0x10
[ 6288.448500]  ret_from_fork+0x2d/0x50
[ 6288.449078]  ? __pfx_kthread+0x10/0x10
[ 6288.449665]  ret_from_fork_asm+0x1b/0x30
[ 6288.450283]  </TASK>
[ 6288.450688] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc nls_iso8859_1 nls_cp437 vfat fat 9p netfs ext4 kvm_intel crc16 mbcache jbd2 joydev kvm xfs irqbypass virtio_net pcspkr net_failover psmouse failover 9pnet_virtio cirrus drm_shmem_helper virtio_balloon drm_kms_helper button evdev drm loop dm_mod zram zsmalloc crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha512_ssse3 sha512_generic virtio_blk nvme aesni_intel crypto_simd cryptd nvme_core t10_pi i6300esb crc64_rocksoft_generic crc64_rocksoft crc64 virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring serio_raw btrfs blake2b_generic libcrc32c crc32c_generic crc32c_intel xor raid6_pq autofs4
[ 6288.460211] CR2: 0000000000000000
[ 6288.460787] ---[ end trace 0000000000000000 ]---
[ 6288.461571] RIP: 0010:nfs_inode_remove_request+0xc8/0x150 [nfs]
[ 6288.462500] Code: ff ff 48 8b 43 38 48 8b 7b 10 a8 04 74 5b 48 85 ff 74 56 48 8b 07 a9 00 00 08 00 74 58 48 8b 07 f6 c4 10 74 50 e8 c8 44 b3 d5 <48> 8b 00 f0 48 ff 88 30 ff ff ff 5b 5d 41 5c c3 cc cc cc cc 48 8b
[ 6288.465136] RSP: 0018:ffffbd618353bda8 EFLAGS: 00010246
[ 6288.465963] RAX: 0000000000000000 RBX: ffff9a29f9a25280 RCX: 0000000000000000
[ 6288.467035] RDX: ffff9a29f9a252b4 RSI: 000000000000000b RDI: ffffef41448e3840
[ 6288.468093] RBP: ffffef41448e3840 R08: 0000000000000038 R09: ffffffffffffffff
[ 6288.469121] R10: 0000000000033f80 R11: ffff9a2a7fffa000 R12: ffff9a29093f98c4
[ 6288.470109] R13: 0000000000000000 R14: ffff9a29230f62e0 R15: ffff9a29230f62d0
[ 6288.471106] FS:  0000000000000000(0000) GS:ffff9a2a77c00000(0000) knlGS:0000000000000000
[ 6288.472216] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6288.473059] CR2: 0000000000000000 CR3: 0000000264748002 CR4: 0000000000770ef0
[ 6288.474096] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 6288.475097] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 6288.476148] PKRU: 55555554
[ 6288.476665] note: kworker/u18:8[340798] exited with irqs disabled

Once we've released "req", it's not safe to dereference it anymore.
Decrement the nrequests counter before dropping the reference.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-28 12:22:25 -04:00
Trond Myklebust
b193a78ddb NFS: Use the correct commit info in nfs_join_page_group()
Ensure that nfs_clear_request_commit() updates the correct counters when
it removes them from the commit list.

Fixes: ed5d588fe4 ("NFS: Try to join page groups before an O_DIRECT retransmission")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-13 11:51:11 -04:00
Linus Torvalds
3822a7c409 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X bit.
 
 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.
 
 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes
 
 - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which
   does perform some memcg maintenance and cleanup work.
 
 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".  These filters provide users
   with finer-grained control over DAMOS's actions.  SeongJae has also done
   some DAMON cleanup work.
 
 - Kairui Song adds a series ("Clean up and fixes for swap").
 
 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".
 
 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series.  It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.
 
 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".
 
 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".
 
 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".
 
 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".
 
 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series "mm:
   support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap
   PTEs".
 
 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".
 
 - Sergey Senozhatsky has improved zsmalloc's memory utilization with his
   series "zsmalloc: make zspage chain size configurable".
 
 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.  The previous BPF-based approach had
   shortcomings.  See "mm: In-kernel support for memory-deny-write-execute
   (MDWE)".
 
 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
 
 - T.J.  Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".
 
 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a per-node
   basis.  See the series "Introduce per NUMA node memory error
   statistics".
 
 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage during
   compaction".
 
 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".
 
 - Christoph Hellwig has removed block_device_operations.rw_page() in ths
   series "remove ->rw_page".
 
 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".
 
 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier functions".
 
 - Some pagemap cleanup and generalization work in Mike Rapoport's series
   "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and
   "fixups for generic implementation of pfn_valid()"
 
 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
 
 - Jason Gunthorpe rationalized the GUP system's interface to the rest of
   the kernel in the series "Simplify the external interface for GUP".
 
 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface.  To support this, we'll temporarily be
   printing warnings when people use the debugfs interface.  See the series
   "mm/damon: deprecate DAMON debugfs interface".
 
 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.
 
 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".
 
 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA
 jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K
 DmxHkn0LAitGgJRS/W9w81yrgig9tAQ=
 =MlGs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
   F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X
   bit.

 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.

 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes

 - Johannes Weiner has a series ("mm: push down lock_page_memcg()")
   which does perform some memcg maintenance and cleanup work.

 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".

   These filters provide users with finer-grained control over DAMOS's
   actions. SeongJae has also done some DAMON cleanup work.

 - Kairui Song adds a series ("Clean up and fixes for swap").

 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".

 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.

 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".

 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".

 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".

 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".

 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series
   "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
   swap PTEs".

 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".

 - Sergey Senozhatsky has improved zsmalloc's memory utilization with
   his series "zsmalloc: make zspage chain size configurable".

 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.

   The previous BPF-based approach had shortcomings. See "mm: In-kernel
   support for memory-deny-write-execute (MDWE)".

 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".

 - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".

 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a
   per-node basis. See the series "Introduce per NUMA node memory error
   statistics".

 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage
   during compaction".

 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".

 - Christoph Hellwig has removed block_device_operations.rw_page() in
   ths series "remove ->rw_page".

 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".

 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier
   functions".

 - Some pagemap cleanup and generalization work in Mike Rapoport's
   series "mm, arch: add generic implementation of pfn_valid() for
   FLATMEM" and "fixups for generic implementation of pfn_valid()"

 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".

 - Jason Gunthorpe rationalized the GUP system's interface to the rest
   of the kernel in the series "Simplify the external interface for
   GUP".

 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface. To support this, we'll temporarily be
   printing warnings when people use the debugfs interface. See the
   series "mm/damon: deprecate DAMON debugfs interface".

 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.

 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".

 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".

* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
  include/linux/migrate.h: remove unneeded externs
  mm/memory_hotplug: cleanup return value handing in do_migrate_range()
  mm/uffd: fix comment in handling pte markers
  mm: change to return bool for isolate_movable_page()
  mm: hugetlb: change to return bool for isolate_hugetlb()
  mm: change to return bool for isolate_lru_page()
  mm: change to return bool for folio_isolate_lru()
  objtool: add UACCESS exceptions for __tsan_volatile_read/write
  kmsan: disable ftrace in kmsan core code
  kasan: mark addr_has_metadata __always_inline
  mm: memcontrol: rename memcg_kmem_enabled()
  sh: initialize max_mapnr
  m68k/nommu: add missing definition of ARCH_PFN_OFFSET
  mm: percpu: fix incorrect size in pcpu_obj_full_size()
  maple_tree: reduce stack usage with gcc-9 and earlier
  mm: page_alloc: call panic() when memoryless node allocation fails
  mm: multi-gen LRU: avoid futile retries
  migrate_pages: move THP/hugetlb migration support check to simplify code
  migrate_pages: batch flushing TLB
  migrate_pages: share more code between _unmap and _move
  ...
2023-02-23 17:09:35 -08:00
Linus Torvalds
d8ca6dbb8d NFS Client Updates for Linux 6.3
New Features:
   * Convert the read and write paths to use folios
 
 Bugfixes and Cleanups:
   * Fix tracepoint state manager flag printing
   * Fix disabling swap files
   * Fix NFSv4 client identifier sysfs path in the documentation
   * Don't clear NFS_CAP_COPY if server returns NFS4ERR_OFFLOAD_DENIED
   * Treat GETDEVICEINFO errors as a layout failure
   * Replace kmap_atomic() calls with kmap_local_page()
   * Constify sunrpc sysfs kobj_type structures
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmP2lSQACgkQ18tUv7Cl
 QOuaPw//TNLuEMiiMtM6EXFPc2p2tNMfM+EeMjYKZkS5WnvszVOZRzQfXz4dvvif
 R2BIHT+C6AWkPOxnbuhIk86acy/wRuAPD9OMyp+3PNa4oEqjyIk5QaTvgYeeRBNe
 H7haAqCiEBx/g8F9DladWLAwpilTSPpupdmiXwTS7Q7wB6NkKqc1eAJ+BTuRYAsm
 19p2nGTzenWrut4HMOOmuAVrA2OcWtzgYzdJlY19lWRmzfCvX1gh3i2JxGutmnl3
 tduviVBdu5QGnTKHJiP853kzs3VwEHIFSO+5z137JWKTm3IgPxW/u3zf+ijmw22t
 vjbtFAhb4+u5juchvVKyIX4lClPSKZincQ6dn9soOm7qJcxYWNm40DzxAF21GUAq
 d4tp+zBoe9pfKxZrylp6YxIchLQYhU+dL0hhqbKzfOAFOg28k1JsJubl/wRvWKVe
 LGbvFOrUYo8arPOKfxBWMf4HCouu/vGq/qng+U/Ppjw3h8gD32aniyP3qN9Fm1JA
 rFAMKTnD4I3r3x1E4HX3icUMwil36zBr9cPglQBYD5DW54ConEW8e6hvQmjTdQHP
 EgN9+PSYzYXuKR/Fij7BYOymUD9grT1+5hVaOPzlM4SY3jMy2novcQGoGiLVOd0t
 PGQ5047qbESFD7VkW4GBAYftrSMLiU+l5KL3aN1JRtprxu6NaYA=
 =39ue
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.3-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "New Features:

   - Convert the read and write paths to use folios

  Bugfixes and Cleanups:

   - Fix tracepoint state manager flag printing

   - Fix disabling swap files

   - Fix NFSv4 client identifier sysfs path in the documentation

   - Don't clear NFS_CAP_COPY if server returns NFS4ERR_OFFLOAD_DENIED

   - Treat GETDEVICEINFO errors as a layout failure

   - Replace kmap_atomic() calls with kmap_local_page()

   - Constify sunrpc sysfs kobj_type structures"

* tag 'nfs-for-6.3-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (25 commits)
  fs/nfs: Replace kmap_atomic() with kmap_local_page() in dir.c
  pNFS/filelayout: treat GETDEVICEINFO errors as layout failure
  Documentation: Fix sysfs path for the NFSv4 client identifier
  nfs42: do not fail with EIO if ssc returns NFS4ERR_OFFLOAD_DENIED
  NFS: fix disabling of swap
  SUNRPC: make kobj_type structures constant
  nfs4trace: fix state manager flag printing
  NFS: Remove unnecessary check in nfs_read_folio()
  NFS: Improve tracing of nfs_wb_folio()
  NFS: Enable tracing of nfs_invalidate_folio() and nfs_launder_folio()
  NFS: fix up nfs_release_folio() to try to release the page
  NFS: Clean up O_DIRECT request allocation
  NFS: Fix up nfs_vm_page_mkwrite() for folios
  NFS: Convert nfs_write_begin/end to use folios
  NFS: Remove unused function nfs_wb_page()
  NFS: Convert buffered writes to use folios
  NFS: Convert the function nfs_wb_page() to use folios
  NFS: Convert buffered reads to use folios
  NFS: Add a helper nfs_wb_folio()
  NFS: Convert the remaining pagelist helper functions to support folios
  ...
2023-02-22 14:47:20 -08:00
Trond Myklebust
256093fec1 NFS: Improve tracing of nfs_wb_folio()
Include info about which folio is being traced.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-02-14 14:22:33 -05:00
Trond Myklebust
4cbf76948c NFS: Remove unused function nfs_wb_page()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-02-14 14:22:32 -05:00
Trond Myklebust
0c493b5cf1 NFS: Convert buffered writes to use folios
Mostly mechanical conversion of struct page and functions into struct
folio equivalents.
The lack of support for folios in write_cache_pages(), means we still
only support order 0 folio allocations. However the rest of the
writeback code should now be ready for order n > 0.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-02-14 14:22:32 -05:00
Trond Myklebust
5241060e8b NFS: Convert the function nfs_wb_page() to use folios
Convert to use the folio functions, but pass the struct page to
nfs_writepage_locked() for now.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-02-14 14:22:32 -05:00
Trond Myklebust
4b27232a6e NFS: Add a helper nfs_wb_folio()
...and use it in nfs_launder_folio().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-02-14 14:22:32 -05:00
Trond Myklebust
6dd85e83f3 NFS: Add a helper to convert a struct nfs_page into an inode
Replace all the open coded calls to page_file_mapping(req->wb_page)->host.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-02-14 14:22:32 -05:00
Matthew Wilcox (Oracle)
d585bdbeb7 fs: convert writepage_t callback to pass a folio
Patch series "Convert writepage_t to use a folio".

More folioisation.  I split out the mpage work from everything else
because it completely dominated the patch, but some implementations I just
converted outright.


This patch (of 2):

We always write back an entire folio, but that's currently passed as the
head page.  Convert all filesystems that use write_cache_pages() to expect
a folio instead of a page.

Link: https://lkml.kernel.org/r/20230126201255.1681189-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230126201255.1681189-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:34 -08:00
Jeff Layton
5970e15dbc filelock: move file locking definitions to separate header file
The file locking definitions have lived in fs.h since the dawn of time,
but they are only used by a small subset of the source files that
include it.

Move the file locking definitions to a new header file, and add the
appropriate #include directives to the source files that need them. By
doing this we trim down fs.h a bit and limit the amount of rebuilding
that has to be done when we make changes to the file locking APIs.

Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Steve French <stfrench@microsoft.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
2023-01-11 06:52:32 -05:00
Jeff Layton
17b985def2 nfs: use locks_inode_context helper
nfs currently doesn't access i_flctx safely. This requires a
smp_load_acquire, as the pointer is set via cmpxchg (a release
operation).

Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
2022-11-30 05:08:10 -05:00
Linus Torvalds
6504d82f44 NFS client bugfixes for Linux 6.0
Highlights include:
 
 Bugfixes
 - Fix SUNRPC call completion races with call_decode() that trigger a
   WARN_ON()
 - NFSv4.0 cannot support open-by-filehandle and NFS re-export
 - Revert "SUNRPC: Remove unreachable error condition" to allow handling
   of error conditions
 - Update suid/sgid mode bits after ALLOCATE and DEALLOCATE
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmMbd3UACgkQZwvnipYK
 APKG2A//RGBrs41UNY8FwEFZ6Q22OU7LzmpC5a1g97d35xHaYhJ7rGNydv3h7r0R
 T4/BzK3jdgH7jSr8mw/8lpCbh0mkSUY6kwu27qk6UVFel9RkKkxgy5VwPl7C8SzF
 u/zimyX3ftVd9h73ZKnKncx6d/88jlKccRvj1a6iAc+bj9jkeA85Gkl8tOXxEvtf
 BSwBpUKcmLtGpof5EPkt+eX9NKph2t1NhV/33CIRErWZpsAMEjhXvNTdca9biX2L
 fpVUyHrBceBqklO+dGIR3l6FeeZWkn/ceMZ043jzKsiufb9WKuy4X2pYHfijVA0W
 ZlMYUUtdrbju762GAfWjK0tiJG6ZieZs8SWQFmg3PjdNQBBHwlxRIwL8L+wOHLa0
 WzFzZy44IeW/7yAFdRp5YeZU9ZisMied3s4w1dWERKtgc8CtQzQTQ9XDTv9/33GC
 OIMl/Wf1Po8CaNLDb7T3FYST0IJQSwuHBSg/1B8+MezhsAhNKhYEy9BdRp2yeVI7
 +rau13ZBLCIaN3IjM0WF+6aqxmJh+euK9Wqs5DF6lsWzbVZQmTYYYv8U5J3lmJY4
 AN/laj+sUQ3gSfmR5tA9QGk4rPhPqKxIGu+kcRyaIaA6FRhJbj+45IkvHtudQvU+
 64/CoesuKDTjvi1UJk1oS/3TQHhogJgOXuY1orxBEBDE1MoCTHk=
 =KudX
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.20-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:

 - Fix SUNRPC call completion races with call_decode() that trigger a
   WARN_ON()

 - NFSv4.0 cannot support open-by-filehandle and NFS re-export

 - Revert "SUNRPC: Remove unreachable error condition" to allow handling
   of error conditions

 - Update suid/sgid mode bits after ALLOCATE and DEALLOCATE

* tag 'nfs-for-5.20-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  Revert "SUNRPC: Remove unreachable error condition"
  NFSv4.2: Update mode bits after ALLOCATE and DEALLOCATE
  NFSv4: Turn off open-by-filehandle and NFS re-export for NFSv4.0
  SUNRPC: Fix call completion races with call_decode()
2022-09-12 17:53:46 -04:00