1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

18425 commits

Author SHA1 Message Date
Miaohe Lin
a204e6d626 mm/slub: remove unneeded return value of slab_pad_check
The return value of slab_pad_check is never used. So we can make it return
void now.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220419120352.37825-1-linmiaohe@huawei.com
2022-04-20 19:27:14 +02:00
Song Liu
559089e0a9 vmalloc: replace VM_NO_HUGE_VMAP with VM_ALLOW_HUGE_VMAP
Huge page backed vmalloc memory could benefit performance in many cases.
However, some users of vmalloc may not be ready to handle huge pages for
various reasons: hardware constraints, potential pages split, etc.
VM_NO_HUGE_VMAP was introduced to allow vmalloc users to opt-out huge
pages.  However, it is not easy to track down all the users that require
the opt-out, as the allocation are passed different stacks and may cause
issues in different layers.

To address this issue, replace VM_NO_HUGE_VMAP with an opt-in flag,
VM_ALLOW_HUGE_VMAP, so that users that benefit from huge pages could ask
specificially.

Also, remove vmalloc_no_huge() and add opt-in helper vmalloc_huge().

Fixes: fac54e2bfb ("x86/Kconfig: Select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP")
Link: https://lore.kernel.org/netdev/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/"
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-19 12:08:57 -07:00
Christoph Hellwig
44abff2c0b block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARD
Secure erase is a very different operation from discard in that it is
a data integrity operation vs hint.  Fully split the limits and helper
infrastructure to make the separation more clear.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd]
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> [nifs2]
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org> [f2fs]
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Chao Yu <chao@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-27-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17 19:49:59 -06:00
Christoph Hellwig
70200574cc block: remove QUEUE_FLAG_DISCARD
Just use a non-zero max_discard_sectors as an indicator for discard
support, similar to what is done for write zeroes.

The only places where needs special attention is the RAID5 driver,
which must clear discard support for security reasons by default,
even if the default stacking rules would allow for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd]
Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390]
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17 19:49:59 -06:00
Christoph Hellwig
36d254893a block: add a bdev_stable_writes helper
Add a helper to check the stable writes flag based on the block_device
instead of having to poke into the block layer internal request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17 19:49:59 -06:00
Christoph Hellwig
10f0d2a517 block: add a bdev_nonrot helper
Add a helper to check the nonrot flag based on the block_device instead
of having to poke into the block layer internal request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17 19:49:59 -06:00
Christoph Hellwig
9964e67455 mm: use bdev_is_zoned in claim_swapfile
Use the bdev based helper instead of poking into the queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17 19:49:58 -06:00
Patrick Wang
23c2d497de mm: kmemleak: take a full lowmem check in kmemleak_*_phys()
The kmemleak_*_phys() apis do not check the address for lowmem's min
boundary, while the caller may pass an address below lowmem, which will
trigger an oops:

  # echo scan > /sys/kernel/debug/kmemleak
  Unable to handle kernel paging request at virtual address ff5fffffffe00000
  Oops [#1]
  Modules linked in:
  CPU: 2 PID: 134 Comm: bash Not tainted 5.18.0-rc1-next-20220407 #33
  Hardware name: riscv-virtio,qemu (DT)
  epc : scan_block+0x74/0x15c
   ra : scan_block+0x72/0x15c
  epc : ffffffff801e5806 ra : ffffffff801e5804 sp : ff200000104abc30
   gp : ffffffff815cd4e8 tp : ff60000004cfa340 t0 : 0000000000000200
   t1 : 00aaaaaac23954cc t2 : 00000000000003ff s0 : ff200000104abc90
   s1 : ffffffff81b0ff28 a0 : 0000000000000000 a1 : ff5fffffffe01000
   a2 : ffffffff81b0ff28 a3 : 0000000000000002 a4 : 0000000000000001
   a5 : 0000000000000000 a6 : ff200000104abd7c a7 : 0000000000000005
   s2 : ff5fffffffe00ff9 s3 : ffffffff815cd998 s4 : ffffffff815d0e90
   s5 : ffffffff81b0ff28 s6 : 0000000000000020 s7 : ffffffff815d0eb0
   s8 : ffffffffffffffff s9 : ff5fffffffe00000 s10: ff5fffffffe01000
   s11: 0000000000000022 t3 : 00ffffffaa17db4c t4 : 000000000000000f
   t5 : 0000000000000001 t6 : 0000000000000000
  status: 0000000000000100 badaddr: ff5fffffffe00000 cause: 000000000000000d
    scan_gray_list+0x12e/0x1a6
    kmemleak_scan+0x2aa/0x57e
    kmemleak_write+0x32a/0x40c
    full_proxy_write+0x56/0x82
    vfs_write+0xa6/0x2a6
    ksys_write+0x6c/0xe2
    sys_write+0x22/0x2a
    ret_from_syscall+0x0/0x2

The callers may not quite know the actual address they pass(e.g. from
devicetree).  So the kmemleak_*_phys() apis should guarantee the address
they finally use is in lowmem range, so check the address for lowmem's
min boundary.

Link: https://lkml.kernel.org/r/20220413122925.33856-1-patrick.wang.shcn@gmail.com
Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15 14:49:56 -07:00
Omar Sandoval
c12cd77cb0 mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
Commit 3ee48b6af4 ("mm, x86: Saving vmcore with non-lazy freeing of
vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
lazy_max_pages() + 1, ensuring that any future vunmaps() immediately
purge the vmap areas instead of doing it lazily.

Commit 690467c81b ("mm/vmalloc: Move draining areas out of caller
context") moved the purging from the vunmap() caller to a worker thread.
Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
(possibly forever).  For example, consider the following scenario:

 1. Thread reads from /proc/vmcore. This eventually calls
    __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
    vmap_lazy_nr to lazy_max_pages() + 1.

 2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
    pages (one page plus the guard page) to the purge list and
    vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
    drain_vmap_work is scheduled.

 3. Thread returns from the kernel and is scheduled out.

 4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
    frees the 2 pages on the purge list. vmap_lazy_nr is now
    lazy_max_pages() + 1.

 5. This is still over the threshold, so it tries to purge areas again,
    but doesn't find anything.

 6. Repeat 5.

If the system is running with only one CPU (which is typicial for kdump)
and preemption is disabled, then this will never make forward progress:
there aren't any more pages to purge, so it hangs.  If there is more
than one CPU or preemption is enabled, then the worker thread will spin
forever in the background.  (Note that if there were already pages to be
purged at the time that set_iounmap_nonlazy() was called, this bug is
avoided.)

This can be reproduced with anything that reads from /proc/vmcore
multiple times.  E.g., vmcore-dmesg /proc/vmcore.

It turns out that improvements to vmap() over the years have obsoleted
the need for this "optimization".  I benchmarked `dd if=/proc/vmcore
of=/dev/null` with 4k and 1M read sizes on a system with a 32GB vmcore.
The test was run on 5.17, 5.18-rc1 with a fix that avoided the hang, and
5.18-rc1 with set_iounmap_nonlazy() removed entirely:

    |5.17  |5.18+fix|5.18+removal
  4k|40.86s|  40.09s|      26.73s
  1M|24.47s|  23.98s|      21.84s

The removal was the fastest (by a wide margin with 4k reads).  This
patch removes set_iounmap_nonlazy().

Link: https://lkml.kernel.org/r/52f819991051f9b865e9ce25605509bfdbacadcd.1649277321.git.osandov@fb.com
Fixes: 690467c81b  ("mm/vmalloc: Move draining areas out of caller context")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15 14:49:56 -07:00
Mike Kravetz
5a317412ef hugetlb: do not demote poisoned hugetlb pages
It is possible for poisoned hugetlb pages to reside on the free lists.
The huge page allocation routines which dequeue entries from the free
lists make a point of avoiding poisoned pages.  There is no such check
and avoidance in the demote code path.

If a hugetlb page on the is on a free list, poison will only be set in
the head page rather then the page with the actual error.  If such a
page is demoted, then the poison flag may follow the wrong page.  A page
without error could have poison set, and a page with poison could not
have the flag set.

Check for poison before attempting to demote a hugetlb page.  Also,
return -EBUSY to the caller if only poisoned pages are on the free list.

Link: https://lkml.kernel.org/r/20220307215707.50916-1-mike.kravetz@oracle.com
Fixes: 8531fc6f52 ("hugetlb: add hugetlb demote page support")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15 14:49:55 -07:00
Charan Teja Kalla
31ca72fa75 mm: compaction: fix compiler warning when CONFIG_COMPACTION=n
The below warning is reported when CONFIG_COMPACTION=n:

   mm/compaction.c:56:27: warning: 'HPAGE_FRAG_CHECK_INTERVAL_MSEC' defined but not used [-Wunused-const-variable=]
      56 | static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500;
         |                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fix it by moving 'HPAGE_FRAG_CHECK_INTERVAL_MSEC' under
CONFIG_COMPACTION defconfig.

Also since this is just a 'static const int' type, use #define for it.

Link: https://lkml.kernel.org/r/1647608518-20924-1-git-send-email-quic_charante@quicinc.com
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15 14:49:55 -07:00
Minchan Kim
e914d8f003 mm: fix unexpected zeroed page mapping with zram swap
Two processes under CLONE_VM cloning, user process can be corrupted by
seeing zeroed page unexpectedly.

      CPU A                        CPU B

  do_swap_page                do_swap_page
  SWP_SYNCHRONOUS_IO path     SWP_SYNCHRONOUS_IO path
  swap_readpage valid data
    swap_slot_free_notify
      delete zram entry
                              swap_readpage zeroed(invalid) data
                              pte_lock
                              map the *zero data* to userspace
                              pte_unlock
  pte_lock
  if (!pte_same)
    goto out_nomap;
  pte_unlock
  return and next refault will
  read zeroed data

The swap_slot_free_notify is bogus for CLONE_VM case since it doesn't
increase the refcount of swap slot at copy_mm so it couldn't catch up
whether it's safe or not to discard data from backing device.  In the
case, only the lock it could rely on to synchronize swap slot freeing is
page table lock.  Thus, this patch gets rid of the swap_slot_free_notify
function.  With this patch, CPU A will see correct data.

      CPU A                        CPU B

  do_swap_page                do_swap_page
  SWP_SYNCHRONOUS_IO path     SWP_SYNCHRONOUS_IO path
                              swap_readpage original data
                              pte_lock
                              map the original data
                              swap_free
                                swap_range_free
                                  bd_disk->fops->swap_slot_free_notify
  swap_readpage read zeroed data
                              pte_unlock
  pte_lock
  if (!pte_same)
    goto out_nomap;
  pte_unlock
  return
  on next refault will see mapped data by CPU B

The concern of the patch would increase memory consumption since it
could keep wasted memory with compressed form in zram as well as
uncompressed form in address space.  However, most of cases of zram uses
no readahead and do_swap_page is followed by swap_free so it will free
the compressed form from in zram quickly.

Link: https://lkml.kernel.org/r/YjTVVxIAsnKAXjTd@google.com
Fixes: 0bcac06f27 ("mm, swap: skip swapcache for swapin of synchronous device")
Reported-by: Ivan Babrou <ivan@cloudflare.com>
Tested-by: Ivan Babrou <ivan@cloudflare.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>	[4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15 14:49:55 -07:00
Juergen Gross
e553f62f10 mm, page_alloc: fix build_zonerefs_node()
Since commit 6aa303defb ("mm, vmscan: only allocate and reclaim from
zones with pages managed by the buddy allocator") only zones with free
memory are included in a built zonelist.  This is problematic when e.g.
all memory of a zone has been ballooned out when zonelists are being
rebuilt.

The decision whether to rebuild the zonelists when onlining new memory
is done based on populated_zone() returning 0 for the zone the memory
will be added to.  The new zone is added to the zonelists only, if it
has free memory pages (managed_zone() returns a non-zero value) after
the memory has been onlined.  This implies, that onlining memory will
always free the added pages to the allocator immediately, but this is
not true in all cases: when e.g. running as a Xen guest the onlined new
memory will be added only to the ballooned memory list, it will be freed
only when the guest is being ballooned up afterwards.

Another problem with using managed_zone() for the decision whether a
zone is being added to the zonelists is, that a zone with all memory
used will in fact be removed from all zonelists in case the zonelists
happen to be rebuilt.

Use populated_zone() when building a zonelist as it has been done before
that commit.

There was a report that QubesOS (based on Xen) is hitting this problem.
Xen has switched to use the zone device functionality in kernel 5.9 and
QubesOS wants to use memory hotplugging for guests in order to be able
to start a guest with minimal memory and expand it as needed.  This was
the report leading to the patch.

Link: https://lkml.kernel.org/r/20220407120637.9035-1-jgross@suse.com
Fixes: 6aa303defb ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15 14:49:55 -07:00
Marco Elver
2dfe63e61c mm, kfence: support kmem_dump_obj() for KFENCE objects
Calling kmem_obj_info() via kmem_dump_obj() on KFENCE objects has been
producing garbage data due to the object not actually being maintained
by SLAB or SLUB.

Fix this by implementing __kfence_obj_info() that copies relevant
information to struct kmem_obj_info when the object was allocated by
KFENCE; this is called by a common kmem_obj_info(), which also calls the
slab/slub/slob specific variant now called __kmem_obj_info().

For completeness, kmem_dump_obj() now displays if the object was
allocated by KFENCE.

Link: https://lore.kernel.org/all/20220323090520.GG16885@xsang-OptiPlex-9020/
Link: https://lkml.kernel.org/r/20220406131558.3558585-1-elver@google.com
Fixes: b89fb5ef0c ("mm, kfence: insert KFENCE hooks for SLUB")
Fixes: d3fb45f370 ("mm, kfence: insert KFENCE hooks for SLAB")
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>	[slab]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15 14:49:55 -07:00
Vincenzo Frascino
b1add418d4 kasan: fix hw tags enablement when KUNIT tests are disabled
Kasan enables hw tags via kasan_enable_tagging() which based on the mode
passed via kernel command line selects the correct hw backend.
kasan_enable_tagging() is meant to be invoked indirectly via the cpu
features framework of the architectures that support these backends.
Currently the invocation of this function is guarded by
CONFIG_KASAN_KUNIT_TEST which allows the enablement of the correct backend
only when KUNIT tests are enabled in the kernel.

This inconsistency was introduced in commit:

  ed6d74446c ("kasan: test: support async (again) and asymm modes for HW_TAGS")

... and prevents to enable MTE on arm64 when KUNIT tests for kasan hw_tags are
disabled.

Fix the issue making sure that the CONFIG_KASAN_KUNIT_TEST guard does not
prevent the correct invocation of kasan_enable_tagging().

Link: https://lkml.kernel.org/r/20220408124323.10028-1-vincenzo.frascino@arm.com
Fixes: ed6d74446c ("kasan: test: support async (again) and asymm modes for HW_TAGS")
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15 14:49:55 -07:00
Axel Rasmussen
f9b141f936 mm/secretmem: fix panic when growing a memfd_secret
When one tries to grow an existing memfd_secret with ftruncate, one gets
a panic [1].  For example, doing the following reliably induces the
panic:

    fd = memfd_secret();

    ftruncate(fd, 10);
    ptr = mmap(NULL, 10, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    strcpy(ptr, "123456789");

    munmap(ptr, 10);
    ftruncate(fd, 20);

The basic reason for this is, when we grow with ftruncate, we call down
into simple_setattr, and then truncate_inode_pages_range, and eventually
we try to zero part of the memory.  The normal truncation code does this
via the direct map (i.e., it calls page_address() and hands that to
memset()).

For memfd_secret though, we specifically don't map our pages via the
direct map (i.e.  we call set_direct_map_invalid_noflush() on every
fault).  So the address returned by page_address() isn't useful, and
when we try to memset() with it we panic.

This patch avoids the panic by implementing a custom setattr for
memfd_secret, which detects resizes specifically (setting the size for
the first time works just fine, since there are no existing pages to try
to zero), and rejects them with EINVAL.

One could argue growing should be supported, but I think that will
require a significantly more lengthy change.  So, I propose a minimal
fix for the benefit of stable kernels, and then perhaps to extend
memfd_secret to support growing in a separate patch.

[1]:

  BUG: unable to handle page fault for address: ffffa0a889277028
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  PGD afa01067 P4D afa01067 PUD 83f909067 PMD 83f8bf067 PTE 800ffffef6d88060
  Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
  CPU: 0 PID: 281 Comm: repro Not tainted 5.17.0-dbg-DEV #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  RIP: 0010:memset_erms+0x9/0x10
  Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
  RSP: 0018:ffffb932c09afbf0 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffffda63c4249dc0 RCX: 0000000000000fd8
  RDX: 0000000000000fd8 RSI: 0000000000000000 RDI: ffffa0a889277028
  RBP: ffffb932c09afc00 R08: 0000000000001000 R09: ffffa0a889277028
  R10: 0000000000020023 R11: 0000000000000000 R12: ffffda63c4249dc0
  R13: ffffa0a890d70d98 R14: 0000000000000028 R15: 0000000000000fd8
  FS:  00007f7294899580(0000) GS:ffffa0af9bc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffffa0a889277028 CR3: 0000000107ef6006 CR4: 0000000000370ef0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   ? zero_user_segments+0x82/0x190
   truncate_inode_partial_folio+0xd4/0x2a0
   truncate_inode_pages_range+0x380/0x830
   truncate_setsize+0x63/0x80
   simple_setattr+0x37/0x60
   notify_change+0x3d8/0x4d0
   do_sys_ftruncate+0x162/0x1d0
   __x64_sys_ftruncate+0x1c/0x20
   do_syscall_64+0x44/0xa0
   entry_SYSCALL_64_after_hwframe+0x44/0xae
  Modules linked in: xhci_pci xhci_hcd virtio_net net_failover failover virtio_blk virtio_balloon uhci_hcd ohci_pci ohci_hcd evdev ehci_pci ehci_hcd 9pnet_virtio 9p netfs 9pnet
  CR2: ffffa0a889277028

[lkp@intel.com: secretmem_iops can be static]
  Signed-off-by: kernel test robot <lkp@intel.com>
[axelrasmussen@google.com: return EINVAL]

Link: https://lkml.kernel.org/r/20220324210909.1843814-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20220412193023.279320-1-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15 14:49:54 -07:00
Hugh Dickins
1bdec44b1e tmpfs: fix regressions from wider use of ZERO_PAGE
Chuck Lever reported fsx-based xfstests generic 075 091 112 127 failing
when 5.18-rc1 NFS server exports tmpfs: bisected to recent tmpfs change.

Whilst nfsd_splice_action() does contain some questionable handling of
repeated pages, and Chuck was able to work around there, history from
Mark Hemment makes clear that there might be similar dangers elsewhere:
it was not a good idea for me to pass ZERO_PAGE down to unknown actors.

Revert shmem_file_read_iter() to using ZERO_PAGE for holes only when
iter_is_iovec(); in other cases, use the more natural iov_iter_zero()
instead of copy_page_to_iter().

We would use iov_iter_zero() throughout, but the x86 clear_user() is not
nearly so well optimized as copy to user (dd of 1T sparse tmpfs file
takes 57 seconds rather than 44 seconds).

And now pagecache_init() does not need to SetPageUptodate(ZERO_PAGE(0)):
which had caused boot failure on arm noMMU STM32F7 and STM32H7 boards

Link: https://lkml.kernel.org/r/9a978571-8648-e830-5735-1f4748ce2e30@google.com
Fixes: 56a8c8eb1e ("tmpfs: do not allocate pages on read")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Patrice CHOTARD <patrice.chotard@foss.st.com>
Reported-by: Chuck Lever III <chuck.lever@oracle.com>
Tested-by: Chuck Lever III <chuck.lever@oracle.com>
Cc: Mark Hemment <markhemm@googlemail.com>
Cc: Patrice CHOTARD <patrice.chotard@foss.st.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15 14:49:54 -07:00
Matthew Wilcox (Oracle)
1109a5d907 usercopy: Remove HARDENED_USERCOPY_PAGESPAN
There isn't enough information to make this a useful check any more;
the useful parts of it were moved in earlier patches, so remove this
set of checks now.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220110231530.665970-5-willy@infradead.org
2022-04-13 12:15:52 -07:00
Matthew Wilcox (Oracle)
ab502103ae mm/usercopy: Detect large folio overruns
Move the compound page overrun detection out of
CONFIG_HARDENED_USERCOPY_PAGESPAN and convert it to use folios so it's
enabled for more people.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220110231530.665970-4-willy@infradead.org
2022-04-13 12:15:51 -07:00
Matthew Wilcox (Oracle)
0aef499f31 mm/usercopy: Detect vmalloc overruns
If you have a vmalloc() allocation, or an address from calling vmap(),
you cannot overrun the vm_area which describes it, regardless of the
size of the underlying allocation.  This probably doesn't do much for
security because vmalloc comes with guard pages these days, but it
prevents usercopy aborts when copying to a vmap() of smaller pages.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220110231530.665970-3-willy@infradead.org
2022-04-13 12:15:51 -07:00
Matthew Wilcox (Oracle)
4e140f59d2 mm/usercopy: Check kmap addresses properly
If you are copying to an address in the kmap region, you may not copy
across a page boundary, no matter what the size of the underlying
allocation.  You can't kmap() a slab page because slab pages always
come from low memory.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220110231530.665970-2-willy@infradead.org
2022-04-13 12:15:50 -07:00
Ohhoon Kwon
33647783de mm/slab_common: move dma-kmalloc caches creation into new_kmalloc_cache()
There are four types of kmalloc_caches: KMALLOC_NORMAL, KMALLOC_CGROUP,
KMALLOC_RECLAIM, and KMALLOC_DMA. While the first three types are
created using new_kmalloc_cache(), KMALLOC_DMA caches are created in a
separate logic. Let KMALLOC_DMA caches be also created using
new_kmalloc_cache(), to enhance readability.

Historically, there were only KMALLOC_NORMAL caches and KMALLOC_DMA
caches in the first place, and they were initialized in two separate
logics. However, when KMALLOC_RECLAIM was introduced in v4.20 via
commit 1291523f2c ("mm, slab/slub: introduce kmalloc-reclaimable
caches") and KMALLOC_CGROUP was introduced in v5.14 via
commit 494c1dfe85 ("mm: memcg/slab: create a new set of kmalloc-cg-<n>
caches"), their creations were merged with KMALLOC_NORMAL's only.
KMALLOC_DMA creation logic should be merged with them, too.

By merging KMALLOC_DMA initialization with other types, the following
two changes might occur:
1. The order dma-kmalloc-<n> caches added in slab_cache list may be
sorted by size. i.e. the order they appear in /proc/slabinfo may change
as well.
2. slab_state will be set to UP after KMALLOC_DMA is created.
In case of slub, freelist randomization is dependent on slab_state>=UP,
and therefore KMALLOC_DMA cache's freelist will not be randomized in
creation, but will be deferred to init_freelist_randomization().

Co-developed-by: JaeSang Yoo <jsyoo5b@gmail.com>
Signed-off-by: JaeSang Yoo <jsyoo5b@gmail.com>
Signed-off-by: Ohhoon Kwon <ohkwon1043@gmail.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220410162511.656541-1-ohkwon1043@gmail.com
2022-04-13 09:07:06 +02:00
JaeSang Yoo
6b6efe2394 mm/slub: remove meaningless node check in ___slab_alloc()
node_match() with node=NUMA_NO_NODE always returns 1.
Duplicate check by goto statement is meaningless. Remove it.

Signed-off-by: JaeSang Yoo <jsyoo5b@gmail.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220409144239.2649257-1-jsyoo5b@gmail.com
2022-04-13 09:05:31 +02:00
Jiyoup Kim
27c08f751c mm/slub: remove duplicate flag in allocate_slab()
In allocate_slab(), __GFP_NOFAIL flag is removed twice when trying
higher-order allocation. Remove it.

Signed-off-by: Jiyoup Kim <lakroforce@gmail.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220409150538.1264-1-lakroforce@gmail.com
2022-04-13 09:05:31 +02:00
JaeSang Yoo
c0f81a94d4 mm/slub: remove unused parameter in setup_object*()
setup_object_debug() and setup_object() has unused parameter, "struct
slab *slab". Remove it.

By the commit 3ec0974210 ("SLUB: Simplify debug code"),
setup_object_debug() were introduced to refactor previous code blocks
in the setup_object(). Previous code used SlabDebug() to init_object()
and init_tracking(). As the SlabDebug() takes "struct page *page" as
argument, the setup_object_debug() checks flag of "struct kmem_cache *s"
which doesn't require "struct page *page".
As the struct page were changed into struct slab by commit bb192ed9aa
("mm/slub: Convert most struct page to struct slab by spatch"), but it's
still unused parameter.

Suggested-by: Ohhoon Kwon <ohkwon1043@gmail.com>
Signed-off-by: JaeSang Yoo <jsyoo5b@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220411072534.3372768-1-jsyoo5b@gmail.com
2022-04-13 09:05:23 +02:00
Linus Torvalds
50c94de67c - Allow the compiler to optimize away unused percpu accesses and change
the local_lock_* macros back to inline functions
 
 - A couple of fixes to static call insn patching
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmJStZ4ACgkQEsHwGGHe
 VUpUpA/8DHOMUQa7rM8z49ZWBV01HNVCLECTeeKshQBLyJfWc84MNOfdPbpgEGvY
 XE/eIZDnTMB5UKD0bfRqD+AQ0fXjl3NiLnJrdDZJqEQAiP/wGBswKNXMire8xPT8
 9MfaOKYWYPl0LY2uZBWVLcdC+lVe4kRGfhqAcl4LRx0ZSvMzgjcFy34NeXY8LlXD
 kFQJEzHa97CTROje54mtmXEt7Y5bxjxWwVTSyfEt0hJPGo1bJtJP6FaY01Muj+Xu
 h/OGNx3KLOYf9MqQC31caAwKgtUOptm8bTpvG3onaHg29qJgz2umKwONyOjYrUUn
 2PE3NREfMuKI38nf88pX+lOCs6/I1uVIjJPvAVJijIcuI1ZBXrfm26IP0lZ3LqG1
 h/9Y5gChiZPn1j90VnF4UCJUm4u3bYEAHqKIQgUdpcpUqX0NlxbDiXoYxJWfHnmB
 PBJ0PE7Vdo4MPK0n3BGVrzXAFeOyHsohAsKFijT8afRCMAOF/ebmVs/tI5NygFrK
 11e/U13/78iKkazZSxWew8vU3yXA39W5Rym7aPnhR2lWxvN+xQOjNTgZTxF9hUcZ
 6AcsaYJgHR7nD8SM7Y9+cwHWOWaDEdZMg9XSkgvyd1p0tHb4u+Ve/SQK7sA3j9q7
 ZmZyFSE1X3K+M1i+75rUSVmIEVM5cpfhodN89iRje/JIZ1KyRT8=
 =hSOc
 -----END PGP SIGNATURE-----

Merge tag 'locking_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fixes from Borislav Petkov:

 - Allow the compiler to optimize away unused percpu accesses and change
   the local_lock_* macros back to inline functions

 - A couple of fixes to static call insn patching

* tag 'locking_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "mm/page_alloc: mark pagesets as __maybe_unused"
  Revert "locking/local_lock: Make the empty local_lock_*() function a macro."
  x86/percpu: Remove volatile from arch_raw_cpu_ptr().
  static_call: Remove __DEFINE_STATIC_CALL macro
  static_call: Properly initialise DEFINE_STATIC_CALL_RET0()
  static_call: Don't make __static_call_return0 static
  x86,static_call: Fix __static_call_return0 for i386
2022-04-10 06:56:46 -10:00
Linus Torvalds
911b2b9516 Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "9 patches.

  Subsystems affected by this patch series: mm (migration, highmem,
  sparsemem, mremap, mempolicy, and memcg), lz4, mailmap, and
  MAINTAINERS"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  MAINTAINERS: add Tom as clang reviewer
  mm/list_lru.c: revert "mm/list_lru: optimize memcg_reparent_list_lru_node()"
  mailmap: update Vasily Averin's email address
  mm/mempolicy: fix mpol_new leak in shared_policy_replace
  mmmremap.c: avoid pointless invalidate_range_start/end on mremap(old_size=0)
  mm/sparsemem: fix 'mem_section' will never be NULL gcc 12 warning
  lz4: fix LZ4_decompress_safe_partial read out of bound
  highmem: fix checks in __kmap_local_sched_{in,out}
  mm: migrate: use thp_order instead of HPAGE_PMD_ORDER for new page allocation.
2022-04-08 14:31:41 -10:00
Andrew Morton
b33e104447 mm/list_lru.c: revert "mm/list_lru: optimize memcg_reparent_list_lru_node()"
Commit 405cc51fc1 ("mm/list_lru: optimize memcg_reparent_list_lru_node()")
has subtle races which are proving ugly to fix.  Revert the original
optimization.  If quantitative testing indicates that we have a
significant problem here then other implementations can be looked at.

Fixes: 405cc51fc1 ("mm/list_lru: optimize memcg_reparent_list_lru_node()")
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-08 14:20:36 -10:00
Miaohe Lin
4ad099559b mm/mempolicy: fix mpol_new leak in shared_policy_replace
If mpol_new is allocated but not used in restart loop, mpol_new will be
freed via mpol_put before returning to the caller.  But refcnt is not
initialized yet, so mpol_put could not do the right things and might
leak the unused mpol_new.  This would happen if mempolicy was updated on
the shared shmem file while the sp->lock has been dropped during the
memory allocation.

This issue could be triggered easily with the below code snippet if
there are many processes doing the below work at the same time:

  shmid = shmget((key_t)5566, 1024 * PAGE_SIZE, 0666|IPC_CREAT);
  shm = shmat(shmid, 0, 0);
  loop many times {
    mbind(shm, 1024 * PAGE_SIZE, MPOL_LOCAL, mask, maxnode, 0);
    mbind(shm + 128 * PAGE_SIZE, 128 * PAGE_SIZE, MPOL_DEFAULT, mask,
          maxnode, 0);
  }

Link: https://lkml.kernel.org/r/20220329111416.27954-1-linmiaohe@huawei.com
Fixes: 42288fe366 ("mm: mempolicy: Convert shared_policy mutex to spinlock")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org>	[3.8]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-08 14:20:36 -10:00
Paolo Bonzini
01e67e04c2 mmmremap.c: avoid pointless invalidate_range_start/end on mremap(old_size=0)
If an mremap() syscall with old_size=0 ends up in move_page_tables(), it
will call invalidate_range_start()/invalidate_range_end() unnecessarily,
i.e.  with an empty range.

This causes a WARN in KVM's mmu_notifier.  In the past, empty ranges
have been diagnosed to be off-by-one bugs, hence the WARNing.  Given the
low (so far) number of unique reports, the benefits of detecting more
buggy callers seem to outweigh the cost of having to fix cases such as
this one, where userspace is doing something silly.  In this particular
case, an early return from move_page_tables() is enough to fix the
issue.

Link: https://lkml.kernel.org/r/20220329173155.172439-1-pbonzini@redhat.com
Reported-by: syzbot+6bde52d89cfdf9f61425@syzkaller.appspotmail.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-08 14:20:36 -10:00
Max Filippov
66f133ceab highmem: fix checks in __kmap_local_sched_{in,out}
When CONFIG_DEBUG_KMAP_LOCAL is enabled __kmap_local_sched_{in,out} check
that even slots in the tsk->kmap_ctrl.pteval are unmapped.  The slots are
initialized with 0 value, but the check is done with pte_none.  0 pte
however does not necessarily mean that pte_none will return true.  e.g.
on xtensa it returns false, resulting in the following runtime warnings:

 WARNING: CPU: 0 PID: 101 at mm/highmem.c:627 __kmap_local_sched_out+0x51/0x108
 CPU: 0 PID: 101 Comm: touch Not tainted 5.17.0-rc7-00010-gd3a1cdde80d2-dirty #13
 Call Trace:
   dump_stack+0xc/0x40
   __warn+0x8f/0x174
   warn_slowpath_fmt+0x48/0xac
   __kmap_local_sched_out+0x51/0x108
   __schedule+0x71a/0x9c4
   preempt_schedule_irq+0xa0/0xe0
   common_exception_return+0x5c/0x93
   do_wp_page+0x30e/0x330
   handle_mm_fault+0xa70/0xc3c
   do_page_fault+0x1d8/0x3c4
   common_exception+0x7f/0x7f

 WARNING: CPU: 0 PID: 101 at mm/highmem.c:664 __kmap_local_sched_in+0x50/0xe0
 CPU: 0 PID: 101 Comm: touch Tainted: G        W         5.17.0-rc7-00010-gd3a1cdde80d2-dirty #13
 Call Trace:
   dump_stack+0xc/0x40
   __warn+0x8f/0x174
   warn_slowpath_fmt+0x48/0xac
   __kmap_local_sched_in+0x50/0xe0
   finish_task_switch$isra$0+0x1ce/0x2f8
   __schedule+0x86e/0x9c4
   preempt_schedule_irq+0xa0/0xe0
   common_exception_return+0x5c/0x93
   do_wp_page+0x30e/0x330
   handle_mm_fault+0xa70/0xc3c
   do_page_fault+0x1d8/0x3c4
   common_exception+0x7f/0x7f

Fix it by replacing !pte_none(pteval) with pte_val(pteval) != 0.

Link: https://lkml.kernel.org/r/20220403235159.3498065-1-jcmvbkbc@gmail.com
Fixes: 5fbda3ecd1 ("sched: highmem: Store local kmaps in task struct")
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-08 14:20:36 -10:00
Zi Yan
a04cd1600b mm: migrate: use thp_order instead of HPAGE_PMD_ORDER for new page allocation.
Fix a VM_BUG_ON_FOLIO(folio_nr_pages(old) != nr_pages) crash.

With folios support, it is possible to have other than HPAGE_PMD_ORDER
THPs, in the form of folios, in the system.  Use thp_order() to correctly
determine the source page order during migration.

Link: https://lkml.kernel.org/r/20220404165325.1883267-1-zi.yan@sent.com
Link: https://lore.kernel.org/linux-mm/20220404132908.GA785673@u2004/
Fixes: d68eccad37 ("mm/filemap: Allow large folios to be added to the page cache")
Reported-by: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-08 14:20:36 -10:00
zhenwei pi
98ea02597b mm/rmap: Fix handling of hugetlbfs pages in page_vma_mapped_walk
page_mapped_in_vma() sets nr_pages to 1, which is usually correct as we
only want to know about the precise page and not about other pages in
the folio.  However, hugetlbfs does want to know about the entire hpage,
and using nr_pages to get the size of the hpage is wrong.  We could
change page_mapped_in_vma() to special-case hugetlbfs pages, but it's
better to ignore nr_pages in page_vma_mapped_walk() and get the size
from the VMA instead.

Fixes: 2aff7a4755 ("mm: Convert page_vma_mapped_walk to work on PFNs")
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
[edit commit message, use hstate directly]
2022-04-07 10:11:20 -04:00
Matthew Wilcox (Oracle)
ec4858e07e mm/mempolicy: Use vma_alloc_folio() in new_page()
Simplify new_page() by unifying the THP and base page cases, and
handle orders other than 0 and HPAGE_PMD_ORDER correctly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-04-07 09:43:41 -04:00
Matthew Wilcox (Oracle)
f584b68005 mm: Add vma_alloc_folio()
This wrapper around alloc_pages_vma() calls prep_transhuge_page(),
removing the obligation from the caller.  This is in the same spirit
as __folio_alloc().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-04-07 09:43:41 -04:00
Matthew Wilcox (Oracle)
c185e494ae mm/migrate: Use a folio in migrate_misplaced_transhuge_page()
Unify alloc_misplaced_dst_page() and alloc_misplaced_dst_page_thp().
Removes an assumption that compound pages are HPAGE_PMD_ORDER.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-04-07 09:43:41 -04:00
Matthew Wilcox (Oracle)
ffe06786b5 mm/migrate: Use a folio in alloc_migration_target()
This removes an assumption that a large folio is HPAGE_PMD_ORDER
as well as letting us remove the call to prep_transhuge_page()
and a few hidden calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-04-07 09:43:41 -04:00
Matthew Wilcox (Oracle)
83a8441f8d mm/huge_memory: Avoid calling pmd_page() on a non-leaf PMD
Calling try_to_unmap() with TTU_SPLIT_HUGE_PMD and a folio that's not
mapped by a PMD causes oopses on arm64 because we now call page_folio()
on an invalid page.  pmd_page() returns a valid page for non-leaf PMDs on
some architectures, so this bug escaped testing before now.  Fix this bug
by delaying the call to pmd_page() until after we know the PMD is a leaf.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215804
Fixes: af28a988b3 ("mm/huge_memory: Convert __split_huge_pmd() to take a folio")
Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Zorro Lang <zlang@redhat.com>
2022-04-07 09:43:41 -04:00
Yixuan Cao
a8f23dd166 mm/slab.c: fix comments
While reading the source code,
I noticed some language errors in the comments, so I fixed them.

Signed-off-by: Yixuan Cao <caoyixuan2019@email.szu.edu.cn>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220407080958.3667-1-caoyixuan2019@email.szu.edu.cn
2022-04-07 11:44:47 +02:00
zhanglianjie
aa779e5102 mm: move page-writeback sysctls to their own file
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start by moving sysctls to places
where they actually belong.  The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we just
care about the core logic.

So move the page-writeback sysctls to its own file.

[akpm@linux-foundation.org: coding-style cleanups]

akpm@linux-foundation.org: fix CONFIG_SYSCTL=n warnings]
Link: https://lkml.kernel.org/r/20220129012955.26594-1-zhanglianjie@uniontech.com
Signed-off-by: zhanglianjie <zhanglianjie@uniontech.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-04-06 13:43:44 -07:00
sujiaxun
43fe219aa5 mm: move oom_kill sysctls to their own file
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start by moving sysctls to places
where they actually belong.  The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we just
care about the core logic.

So move the oom_kill sysctls to their own file, mm/oom_kill.c

[sfr@canb.auug.org.au: null-terminate the array]
  Link: https://lkml.kernel.org/r/20220216193202.28838626@canb.auug.org.au

Link: https://lkml.kernel.org/r/20220215093203.31032-1-sujiaxun@uniontech.com
Signed-off-by: sujiaxun <sujiaxun@uniontech.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-04-06 13:43:44 -07:00
Oliver Glitta
553c0369b3 mm/slub: sort debugfs output by frequency of stack traces
Sort the output of debugfs alloc_traces and free_traces by the frequency
of allocation/freeing stack traces. Most frequently used stack traces
will be printed first, e.g. for easier memory leak debugging.

Signed-off-by: Oliver Glitta <glittao@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-and-tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
2022-04-06 11:09:32 +02:00
Oliver Glitta
8ea9fb921b mm/slub: distinguish and print stack traces in debugfs files
Aggregate objects in slub cache by unique stack trace in addition to
caller address when producing contents of debugfs files alloc_traces and
free_traces in debugfs. Also add the stack traces to the debugfs output.
This makes it much more useful to e.g. debug memory leaks.

Signed-off-by: Oliver Glitta <glittao@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-and-tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
2022-04-06 11:03:43 +02:00
Oliver Glitta
5cf909c553 mm/slub: use stackdepot to save stack trace in objects
Many stack traces are similar so there are many similar arrays.
Stackdepot saves each unique stack only once.

Replace field addrs in struct track with depot_stack_handle_t handle.  Use
stackdepot to save stack trace.

The benefits are smaller memory overhead and possibility to aggregate
per-cache statistics in the following patch using the stackdepot handle
instead of matching stacks manually.

[ vbabka@suse.cz: rebase to 5.17-rc1 and adjust accordingly ]

This was initially merged as commit 788691464c and reverted by commit
ae14c63a9f due to several issues, that should now be fixed.
The problem of unconditional memory overhead by stackdepot has been
addressed by commit 2dba5eb1c7 ("lib/stackdepot: allow optional init
and stack_table allocation by kvmalloc()"), so the dependency on
stackdepot will result in extra memory usage only when a slab cache
tracking is actually enabled, and not for all CONFIG_SLUB_DEBUG builds.
The build failures on some architectures were also addressed, and the
reported issue with xfs/433 test did not reproduce on 5.17-rc1 with this
patch.

Signed-off-by: Oliver Glitta <glittao@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-and-tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
2022-04-06 11:03:32 +02:00
Vlastimil Babka
0cd1a02901 mm/slub: move struct track init out of set_track()
set_track() either zeroes out the struct track or fills it, depending on
the addr parameter. This is unnecessary as there's only one place that
calls it for the initialization - init_tracking(). We can simply do the
zeroing there, with a single memset() that covers both TRACK_ALLOC and
TRACK_FREE as they are adjacent.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-and-tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
2022-04-06 10:57:38 +02:00
Vlastimil Babka
a5f1783be2 lib/stackdepot: allow requesting early initialization dynamically
In a later patch we want to add stackdepot support for object owner
tracking in slub caches, which is enabled by slub_debug boot parameter.
This creates a bootstrap problem as some caches are created early in
boot when slab_is_available() is false and thus stack_depot_init()
tries to use memblock. But, as reported by Hyeonggon Yoo [1] we are
already beyond memblock_free_all(). Ideally memblock allocation should
fail, yet it succeeds, but later the system crashes, which is a
separately handled issue.

To resolve this boostrap issue in a robust way, this patch adds another
way to request stack_depot_early_init(), which happens at a well-defined
point of time. In addition to build-time CONFIG_STACKDEPOT_ALWAYS_INIT,
code that's e.g. processing boot parameters (which happens early enough)
can call a new function stack_depot_want_early_init(), which sets a flag
that stack_depot_early_init() will check.

In this patch we also convert page_owner to this approach. While it
doesn't have the bootstrap issue as slub, it's also a functionality
enabled by a boot param and can thus request stack_depot_early_init()
with memblock allocation instead of later initialization with
kvmalloc().

As suggested by Mike, make stack_depot_early_init() only attempt
memblock allocation and stack_depot_init() only attempt kvmalloc().
Also change the latter to kvcalloc(). In both cases we can lose the
explicit array zeroing, which the allocations do already.

As suggested by Marco, provide empty implementations of the init
functions for !CONFIG_STACKDEPOT builds to simplify the callers.

[1] https://lore.kernel.org/all/YhnUcqyeMgCrWZbd@ip-172-31-19-208.ap-northeast-1.compute.internal/

Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Suggested-by: Mike Rapoport <rppt@linux.ibm.com>
Suggested-by: Marco Elver <elver@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-and-tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
2022-04-06 10:55:50 +02:00
Hyeonggon Yoo
a285909f47 mm/slub, kunit: Make slub_kunit unaffected by user specified flags
slub_kunit does not expect other debugging flags to be set when running
tests. When SLAB_RED_ZONE flag is set globally, test fails because the
flag affects number of errors reported.

To make slub_kunit unaffected by user specified debugging flags,
introduce SLAB_NO_USER_FLAGS to ignore them. With this flag, only flags
specified in the code are used and others are ignored.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/Yk0sY9yoJhFEXWOg@hyeyoo
2022-04-06 10:11:48 +02:00
Miaohe Lin
1e703d0548 mm/slab: remove some unused functions
alternate_node_alloc and ____cache_alloc_node are always called when
CONFIG_NUMA. So we can remove the unused !CONFIG_NUMA variant. Also
forward declaration for alternate_node_alloc is unnecessary. Remove
it too.

[ vbabka@suse.cz: move ____cache_alloc_node() declaration closer to
  its callers ]

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220322091421.25285-1-linmiaohe@huawei.com
2022-04-05 11:17:25 +02:00
Sebastian Andrzej Siewior
273ba85b5e Revert "mm/page_alloc: mark pagesets as __maybe_unused"
The local_lock() is now using a proper static inline function which is
enough for llvm to accept that the variable is used.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220328145810.86783-4-bigeasy@linutronix.de
2022-04-05 09:59:39 +02:00
Linus Torvalds
cda4351252 Filesystem/VFS changes for 5.18, part two
- Remove ->readpages infrastructure
  - Remove AOP_FLAG_CONT_EXPAND
  - Move read_descriptor_t to networking code
  - Pass the iocb to generic_perform_write
  - Minor updates to iomap, btrfs, ext4, f2fs, ntfs
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmJHSY8ACgkQDpNsjXcp
 gj59lgf/UJsVQjF+emdQAHa9AkFtZAb7TNv5QKLHp935c/OXREvHaQ956FyVhrc1
 n3pH3VRLFjXFQ3QZpWtArMQbIPr77I9KNs75zX0i+mutP5ieYcQVJKsGPIamiJAf
 eNTBoVaTxCVcTL43xCvnflvAeumoKzwdxGj6Hkgln8wuQ9B9p8923nBZpy94ajqp
 6b6E1rtrJlpEioqar2vCNpdJhEeN/jN93BwIynQMt1snPrBWQRYqv5pL3puUh7Gx
 UgJkCC6XvsUsXXOCu7n22RUKnDGiUW7QN99fmrztwnmiQY4hYmK2AoVMG16riDb+
 WmxIXbhaTo5qJT0rlQipi5d61TSuTA==
 =gwgb
 -----END PGP SIGNATURE-----

Merge tag 'folio-5.18d' of git://git.infradead.org/users/willy/pagecache

Pull more filesystem folio updates from Matthew Wilcox:
 "A mixture of odd changes that didn't quite make it into the original
  pull and fixes for things that did. Also the readpages changes had to
  wait for the NFS tree to be pulled first.

   - Remove ->readpages infrastructure

   - Remove AOP_FLAG_CONT_EXPAND

   - Move read_descriptor_t to networking code

   - Pass the iocb to generic_perform_write

   - Minor updates to iomap, btrfs, ext4, f2fs, ntfs"

* tag 'folio-5.18d' of git://git.infradead.org/users/willy/pagecache:
  btrfs: Remove a use of PAGE_SIZE in btrfs_invalidate_folio()
  ntfs: Correct mark_ntfs_record_dirty() folio conversion
  f2fs: Get the superblock from the mapping instead of the page
  f2fs: Correct f2fs_dirty_data_folio() conversion
  ext4: Correct ext4_journalled_dirty_folio() conversion
  filemap: Remove AOP_FLAG_CONT_EXPAND
  fs: Pass an iocb to generic_perform_write()
  fs, net: Move read_descriptor_t to net.h
  fs: Remove read_actor_t
  iomap: Simplify is_partially_uptodate a little
  readahead: Update comments
  mm: remove the skip_page argument to read_pages
  mm: remove the pages argument to read_pages
  fs: Remove ->readpages address space operation
  readahead: Remove read_cache_pages()
2022-04-01 13:50:50 -07:00