Make sure nothing goes wrong with the string counters or the bprm's
belief about the stack pointer. Add checks and matching self-tests.
Take special care for !CONFIG_MMU, since argmin is not exposed there.
For 32-bit validation, 32-bit UML was used:
$ tools/testing/kunit/kunit.py run \
--make_options CROSS_COMPILE=i686-linux-gnu- \
--make_options SUBARCH=i386 \
exec
For !MMU validation, m68k was used:
$ tools/testing/kunit/kunit.py run \
--arch m68k --make_option CROSS_COMPILE=m68k-linux-gnu- \
exec
Link: https://lore.kernel.org/r/20240520021615.741800-2-keescook@chromium.org
Link: https://lore.kernel.org/r/20240621205046.4001362-2-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
When argmin was added in commit 655c16a8ce ("exec: separate
MM_ANONPAGES and RLIMIT_STACK accounting"), it was intended only for
validating stack limits on CONFIG_MMU[1]. All checking for reaching the
limit (argmin) is wrapped in CONFIG_MMU ifdef checks, though setting
argmin was not. That argmin is only supposed to be used under CONFIG_MMU
was rediscovered recently[2], and I don't want to trip over this again.
Move argmin's declaration into the existing CONFIG_MMU area, and add
helpers functions so the MMU tests can be consolidated.
Link: https://lore.kernel.org/all/20181126122307.GA1660@redhat.com [1]
Link: https://lore.kernel.org/all/202406211253.7037F69@keescook/ [2]
Link: https://lore.kernel.org/r/20240621205046.4001362-1-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmaSqPwACgkQiiy9cAdy
T1EdMAv/Q+qBEmCUybFAjkJelkt+FecWWWZ3L26TXjyAGZBlf7cl590Rr5jXRLw1
xPDdUt7rE0Zxpg0pK8L5QRgDjc7BwiuAIEJfxdI/gAHbEueElLGdqvFp0G1HSBvY
3lgkG5zz9uZUBemFlrxZ2Wsd4MiHBPsaBx5+TEPPGkRhWzd3LRU7fi7PGa6PUD3U
BChQED88EhWB7BfxOqctAYfUgOxqzqiaOe5KAATsWcKpJ3sqgYCHLiVn5vZQ7tYD
69HijShCHC8ng7KeXkW3XJf1knsDHlHsROzNQgX+pUqEZWcDsjGpJNKGtIO3IfeD
9uOy3U+VuPwaVnVZnr5+bSqaiZbOehvGa+3T/JOwJnRfwVP6Kb97/YiEJdVvFwiI
K0CSop3+cgBouqo9S+4j2mjosN6oCQcfTGxBXzMCIZwdawvkNAVxKg/7RpuDuRWJ
3QVVOKzmVOYE6X1RTsnBevcgjCg/t6upfD+m99a8JnZlZislyzxj9qKUcs1XZ8WJ
02TCRc3V
=v/5z
-----END PGP SIGNATURE-----
Merge tag '6.10-rc7-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fix from Steve French:
"Small fix, also for stable"
* tag '6.10-rc7-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix setting SecurityFlags to true
If you try to set /proc/fs/cifs/SecurityFlags to 1 it
will set them to CIFSSEC_MUST_NTLMV2 which no longer is
relevant (the less secure ones like lanman have been removed
from cifs.ko) and is also missing some flags (like for
signing and encryption) and can even cause mount to fail,
so change this to set it to Kerberos in this case.
Also change the description of the SecurityFlags to remove mention
of flags which are no longer supported.
Cc: stable@vger.kernel.org
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.c
f7ce5eb2cb ("bnxt_en: Fix crash in bnxt_get_max_rss_ctx_ring()")
20c8ad72eb ("eth: bnxt: use the RSS context XArray instead of the local list")
Adjacent changes:
net/ethtool/ioctl.c
503757c809 ("net: ethtool: Fix RSS setting")
eac9122f0c ("net: ethtool: record custom RSS contexts in the XArray")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Smatch complains that:
fs/erofs/zdata.c:1047 z_erofs_scan_folio()
error: uninitialized symbol 'err'.
The issue is if we hit this (!(map->m_flags & EROFS_MAP_MAPPED)) {
condition then "err" isn't set. It's inside a loop so we would have to
hit that condition on every iteration. Initialize "err" to zero to
solve this.
Fixes: 5b9654efb6 ("erofs: teach z_erofs_scan_folios() to handle multi-page folios")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/r/f78ab50e-ed6d-4275-8dd4-a4159fa565a2@stanley.mountain
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
'struct kobj_type' is not modified in this driver. It is only used with
kobject_init_and_add() which takes a "const struct kobj_type *" parameter.
Constifying this structure moves some data to a read-only section, so
increase overall security.
On a x86_64, with allmodconfig:
Before:
======
text data bss dec hex filename
22403 4184 24 26611 67f3 fs/nilfs2/sysfs.o
After:
=====
text data bss dec hex filename
22723 3928 24 26675 6833 fs/nilfs2/sysfs.o
Link: https://lkml.kernel.org/r/20240708143242.3296-1-konishi.ryusuke@gmail.com
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When folio is isolated, the PG_lru bit is cleared. So the PG_lru check in
stable_page_flags() will miss this kind of isolated folios. Use
folio_test_large_rmappable() instead to also include isolated folios.
Since pagecache supports large folios and the introduction of mTHP, the
semantics of KPF_THP have been expanded, now it indicates not only
PMD-sized THP. Update related documentation to clearly state that KPF_THP
indicates multiple order THPs.
[ran.xiaokai@zte.com.cn: directly use is_zero_folio(), per David]
Link: https://lkml.kernel.org/r/20240708062601.165215-1-ranxiaokai627@163.com
Link: https://lkml.kernel.org/r/20240705104343.112680-1-ranxiaokai627@163.com
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andrei Vagin <avagin@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Svetly Todorov <svetly.todorov@memverge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A number of allocation helper functions were converted into macros to
account them at the call sites. Add a comment for each converted
allocation helper explaining why it has to be a macro and why we typecast
the return value wherever required. The patch also moves
acpi_os_acquire_object() closer to other allocation helpers to group them
together under the same comment. The patch has no functional changes.
Link: https://lkml.kernel.org/r/20240703174225.3891393-1-surenb@google.com
Fixes: 2c321f3f70 ("mm: change inlined allocation helpers to account at the call site")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christian König <christian.koenig@amd.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
On powerpc 8xx huge_ptep_get() will need to know whether the given ptep is
a PTE entry or a PMD entry. This cannot be known with the PMD entry
itself because there is no easy way to know it from the content of the
entry.
So huge_ptep_get() will need to know either the size of the page or get
the pmd.
In order to be consistent with huge_ptep_get_and_clear(), give mm and
address to huge_ptep_get().
Link: https://lkml.kernel.org/r/cc00c70dd384298796a4e1b25d6c4eb306d3af85.1719928057.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The need to get ELF build ID reliably is an important aspect when dealing
with profiling and stack trace symbolization, and /proc/<pid>/maps textual
representation doesn't help with this.
To get backing file's ELF build ID, application has to first resolve VMA,
then use it's start/end address range to follow a special
/proc/<pid>/map_files/<start>-<end> symlink to open the ELF file (this is
necessary because backing file might have been removed from the disk or
was already replaced with another binary in the same file path.
Such approach, beyond just adding complexity of having to do a bunch of
extra work, has extra security implications. Because application opens
underlying ELF file and needs read access to its entire contents (as far
as kernel is concerned), kernel puts additional capable() checks on
following /proc/<pid>/map_files/<start>-<end> symlink. And that makes
sense in general.
But in the case of build ID, profiler/symbolizer doesn't need the contents
of ELF file, per se. It's only build ID that is of interest, and ELF
build ID itself doesn't provide any sensitive information.
So this patch adds a way to request backing file's ELF build ID along the
rest of VMA information in the same API. User has control over whether
this piece of information is requested or not by either setting
build_id_size field to zero or non-zero maximum buffer size they provided
through build_id_addr field (which encodes user pointer as __u64 field).
This is a completely optional piece of information, and so has no
performance implications for user cases that don't care about build ID,
while improving performance and simplifying the setup for those
application that do need it.
Kernel already implements build ID fetching, which is used from BPF
subsystem. We are reusing this code here, but plan a follow up changes to
make it work better under more relaxed assumption (compared to what
existing code assumes) of being called from user process context, in which
page faults are allowed. BPF-specific implementation currently bails out
if necessary part of ELF file is not paged in, all due to extra
BPF-specific restrictions (like the need to fetch build ID in restrictive
contexts such as NMI handler).
[andrii@kernel.org: fix integer to pointer cast warning in do_procmap_query()]
Link: https://lkml.kernel.org/r/20240701174805.1897344-1-andrii@kernel.org
Link: https://lkml.kernel.org/r/20240627170900.1672542-4-andrii@kernel.org
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
/proc/<pid>/maps file is extremely useful in practice for various tasks
involving figuring out process memory layout, what files are backing any
given memory range, etc. One important class of applications that
absolutely rely on this are profilers/stack symbolizers (perf tool being
one of them). Patterns of use differ, but they generally would fall into
two categories.
In on-demand pattern, a profiler/symbolizer would normally capture stack
trace containing absolute memory addresses of some functions, and would
then use /proc/<pid>/maps file to find corresponding backing ELF files
(normally, only executable VMAs are of interest), file offsets within
them, and then continue from there to get yet more information (ELF
symbols, DWARF information) to get human-readable symbolic information.
This pattern is used by Meta's fleet-wide profiler, as one example.
In preprocessing pattern, application doesn't know the set of addresses of
interest, so it has to fetch all relevant VMAs (again, probably only
executable ones), store or cache them, then proceed with profiling and
stack trace capture. Once done, it would do symbolization based on stored
VMA information. This can happen at much later point in time. This
patterns is used by perf tool, as an example.
In either case, there are both performance and correctness requirement
involved. This address to VMA information translation has to be done as
efficiently as possible, but also not miss any VMA (especially in the case
of loading/unloading shared libraries). In practice, correctness can't be
guaranteed (due to process dying before VMA data can be captured, or
shared library being unloaded, etc), but any effort to maximize the chance
of finding the VMA is appreciated.
Unfortunately, for all the /proc/<pid>/maps file universality and
usefulness, it doesn't fit the above use cases 100%.
First, it's main purpose is to emit all VMAs sequentially, but in practice
captured addresses would fall only into a smaller subset of all process'
VMAs, mainly containing executable text. Yet, library would need to parse
most or all of the contents to find needed VMAs, as there is no way to
skip VMAs that are of no use. Efficient library can do the linear pass
and it is still relatively efficient, but it's definitely an overhead that
can be avoided, if there was a way to do more targeted querying of the
relevant VMA information.
Second, it's a text based interface, which makes its programmatic use from
applications and libraries more cumbersome and inefficient due to the need
to handle text parsing to get necessary pieces of information. The
overhead is actually payed both by kernel, formatting originally binary
VMA data into text, and then by user space application, parsing it back
into binary data for further use.
For the on-demand pattern of usage, described above, another problem when
writing generic stack trace symbolization library is an unfortunate
performance-vs-correctness tradeoff that needs to be made. Library has to
make a decision to either cache parsed contents of /proc/<pid>/maps (after
initial processing) to service future requests (if application requests to
symbolize another set of addresses (for the same process), captured at
some later time, which is typical for periodic/continuous profiling cases)
to avoid higher costs of re-parsing this file. Or it has to choose to
cache the contents in memory to speed up future requests. In the former
case, more memory is used for the cache and there is a risk of getting
stale data if application loads or unloads shared libraries, or otherwise
changed its set of VMAs somehow, e.g., through additional mmap() calls.
In the latter case, it's the performance hit that comes from re-opening
the file and re-parsing its contents all over again.
This patch aims to solve this problem by providing a new API built on top
of /proc/<pid>/maps. It's meant to address both non-selectiveness and
text nature of /proc/<pid>/maps, by giving user more control of what sort
of VMA(s) needs to be queried, and being binary-based interface eliminates
the overhead of text formatting (on kernel side) and parsing (on user
space side).
It's also designed to be extensible and forward/backward compatible by
including required struct size field, which user has to provide. We use
established copy_struct_from_user() approach to handle extensibility.
User has a choice to pick either getting VMA that covers provided address
or -ENOENT if none is found (exact, least surprising, case). Or, with an
extra query flag (PROCMAP_QUERY_COVERING_OR_NEXT_VMA), they can get either
VMA that covers the address (if there is one), or the closest next VMA
(i.e., VMA with the smallest vm_start > addr). The latter allows more
efficient use, but, given it could be a surprising behavior, requires an
explicit opt-in.
There is another query flag that is useful for some use cases.
PROCMAP_QUERY_FILE_BACKED_VMA instructs this API to only return
file-backed VMAs. Combining this with PROCMAP_QUERY_COVERING_OR_NEXT_VMA
makes it possible to efficiently iterate only file-backed VMAs of the
process, which is what profilers/symbolizers are normally interested in.
All the above querying flags can be combined with (also optional) set of
desired VMA permissions flags. This allows to, for example, iterate only
an executable subset of VMAs, which is what preprocessing pattern, used by
perf tool, would benefit from, as the assumption is that captured stack
traces would have addresses of executable code. This saves time by
skipping non-executable VMAs altogether efficienty.
All these querying flags (modifiers) are orthogonal and can be combined in
a semantically meaningful and natural way.
Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes sense
given it's querying the same set of VMA data. It's also benefitial
because permission checks for /proc/<pid>/maps is performed at open time
once, and the actual data read of text contents of /proc/<pid>/maps is
done without further permission checks. We piggyback on this pattern with
ioctl()-based API as well, as that's a desired property. Both for
performance reasons, but also for security and flexibility reasons.
Allowing application to open an FD for /proc/self/maps without any extra
capabilities, and then passing it to some sort of profiling agent through
Unix-domain socket, would allow such profiling agent to not require some
of the capabilities that are otherwise expected when opening
/proc/<pid>/maps file for *another* process. This is a desirable property
for some more restricted setups.
This new ioctl-based implementation doesn't interfere with seq_file-based
implementation of /proc/<pid>/maps textual interface, and so could be used
together or independently without paying any price for that.
Note also, that fetching VMA name (e.g., backing file path, or special
hard-coded or user-provided names) is optional just like build ID. If
user sets vma_name_size to zero, kernel code won't attempt to retrieve it,
saving resources.
Earlier versions of this patch set were adding per-VMA locking, which is
why we have a code structure that is ready for abstracting mmap_lock vs
vm_lock differences (query_vma_setup(), query_vma_teardown(), and
query_vma_find_by_addr()), but given anon_vma_name() is not yet compatible
with per-VMA locking, initial implementation sticks to using only
mmap_lock for now. It will be easy to add back per-VMA locking once all
the pieces are ready later on. Which is why we keep existing code
structure with setup/teardown/query helper functions.
[andrii@kernel.org: improve PROCMAP_QUERY's compat mode handling]
Link: https://lkml.kernel.org/r/20240701174805.1897344-2-andrii@kernel.org
Link: https://lkml.kernel.org/r/20240627170900.1672542-3-andrii@kernel.org
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "ioctl()-based API to query VMAs from /proc/<pid>/maps", v6.
Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
applications to query VMA information more efficiently than reading *all*
VMAs nonselectively through text-based interface of /proc/<pid>/maps file.
Patch #2 goes into a lot of details and background on some common patterns
of using /proc/<pid>/maps in the area of performance profiling and
subsequent symbolization of captured stack traces. As mentioned in that
patch, patterns of VMA querying can differ depending on specific use case,
but can generally be grouped into two main categories: the need to query a
small subset of VMAs covering a given batch of addresses, or
reading/storing/caching all (typically, executable) VMAs upfront for later
processing.
The new PROCMAP_QUERY ioctl() API added in this patch set was motivated by
the former pattern of usage. Earlier revisions had a patch adding a tool
that faithfully reproduces an efficient VMA matching pass of a symbolizer,
collecting a subset of covering VMAs for a given set of addresses as
efficiently as possible. This tool served both as a testing ground, as
well as a benchmarking tool. It implements everything both for currently
existing text-based /proc/<pid>/maps interface, as well as for newly-added
PROCMAP_QUERY ioctl(). This revision dropped the tool from the patch set
and, once the API lands upstream, this tool might be added separately on
Github as an example.
Based on discussion on earlier revisions of this patch set, it turned out
that this ioctl() API is competitive with highly-optimized text-based
pre-processing pattern that perf tool is using. Based on perf discussion,
this revision adds more flexibility in specifying a subset of VMAs that
are of interest. Now it's possible to specify desired permissions of VMAs
(e.g., request only executable ones) and/or restrict to only a subset of
VMAs that have file backing. This further improves the efficiency when
using this new API thanks to more selective (executable VMAs only)
querying.
In addition to a custom benchmarking tool, and experimental perf
integration (available at [0]), Daniel Mueller has since also implemented
an experimental integration into blazesym (see [1]), a library used for
stack trace symbolization by our server fleet-wide profiler and another
on-device profiler agent that runs on weaker ARM devices. The latter
ARM-based device profiler is especially sensitive to performance, and so
we benchmarked and compared text-based /proc/<pid>/maps solution to the
equivalent one using PROCMAP_QUERY ioctl().
Results are very encouraging, giving us 5x improvement for end-to-end
so-called "address normalization" pass, which is the part of the
symbolization process that happens locally on ARM device, before being
sent out for further heavier-weight processing on more powerful remote
server. Note that this is not an artificial microbenchmark. It's a full
end-to-end API call being measured with real-world data on real-world
device.
TEXT-BASED
==========
Benchmarking main/normalize_process_no_build_ids_uncached_maps
main/normalize_process_no_build_ids_uncached_maps
time: [49.777 µs 49.982 µs 50.250 µs]
IOCTL-BASED
===========
Benchmarking main/normalize_process_no_build_ids_uncached_maps
main/normalize_process_no_build_ids_uncached_maps
time: [10.328 µs 10.391 µs 10.457 µs]
change: [−79.453% −79.304% −79.166%] (p = 0.00 < 0.02)
Performance has improved.
You can see above that we see the drop from 50µs down to 10µs for
exactly the same amount of work, with the same data and target process.
With the aforementioned custom tool, we see about ~40x improvement (it
might vary a bit, depending on a specific captured set of addresses). And
even for perf-based benchmark it's on par or slightly ahead when using
permission-based filtering (fetching only executable VMAs).
Earlier revisions attempted to use per-VMA locking, if kernel was compiled
with CONFIG_PER_VMA_LOCK=y, but it turned out that anon_vma_name() is not
yet compatible with per-VMA locking and assumes mmap_lock to be taken,
which makes the use of per-VMA locking for this API premature. It was
agreed ([2]) to continue for now with just mmap_lock, but the code
structure is such that it should be easy to add per-VMA locking support
once all the pieces are ready.
One thing that did not change was basing this new API as an ioctl()
command on /proc/<pid>/maps file. An ioctl-based API on top of pidfd was
considered, but has its own downsides. Implementing ioctl() directly on
pidfd will cause access permission checks on every single ioctl(), which
leads to performance concerns and potential spam of capable() audit
messages. It also prevents a nice pattern, possible with
/proc/<pid>/maps, in which application opens /proc/self/maps FD (requiring
no additional capabilities) and passed this FD to profiling agent for
querying. To achieve similar pattern, a new file would have to be created
from pidf just for VMA querying, which is considered to be inferior to
just querying /proc/<pid>/maps FD as proposed in current approach. These
aspects were discussed in the hallway track at recent LSF/MM/BPF 2024 and
sticking to procfs ioctl() was the final agreement we arrived at.
[0] https://github.com/anakryiko/linux/commits/procfs-proc-maps-ioctl-v2/
[1] https://github.com/libbpf/blazesym/pull/675
[2] https://lore.kernel.org/bpf/7rm3izyq2vjp5evdjc7c6z4crdd3oerpiknumdnmmemwyiwx7t@hleldw7iozi3/
This patch (of 6):
Extract generic logic to fetch relevant pieces of data to describe VMA
name. This could be just some string (either special constant or
user-provided), or a string with some formatted wrapping text (e.g.,
"[anon_shmem:<something>]"), or, commonly, file path. seq_file-based
logic has different methods to handle all three cases, but they are
currently mixed in with extracting underlying sources of data.
This patch splits this into data fetching and data formatting, so that
data fetching can be reused later on.
There should be no functional changes.
Link: https://lkml.kernel.org/r/20240627170900.1672542-1-andrii@kernel.org
Link: https://lkml.kernel.org/r/20240627170900.1672542-2-andrii@kernel.org
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add check for the return value of crypto_shash_tfm_digest() and return
the error if it fails in order to catch the error.
Fixes: 817aa09484 ("ubifs: support offline signed images")
Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
UBIFS always make sure that the data length won't beyond the inode size
by writing inode before writing page(See ubifs_writepage.). After commit
c35acef383f4a2f2cfc30("ubifs: Convert ubifs_writepage to use a folio"),
the rule is broken in one case: Given a file with size 3, then write 4096
from the offset 0, following process will make inode size be smaller than
file data length after powercut & recovery:
P1 P2
ubifs_writepage
len = folio_size(folio) // 4096
if (folio_pos(folio) + len <= i_size) // condition 1: 0 + 4096 <= 4096
//(i_size is updated as 4096 in ubifs_write_end)
if (folio_pos(folio) >= synced_i_size) // condition 2: 0 >= 3, false
write_inode // Skipped, because condition 2 is false
do_writepage(folio, len) // write one page
do_commit // data node won't be replayed in next mounting
>> Powercut <<
So, inode size(4096) is not updated into disk, we will get following
error messages in next mounting(chk_fs = 1):
check_leaf [ubifs]: data node at LEB 14:2048 is not within inode size 3
dbg_walk_index [ubifs]: leaf checking function returned error -22, for
leaf at LEB 14:2048
Fix it by modifying condition 2 as original comparison(Compare the page
index of synced_i_size with current page index).
Fixes: c35acef383 ("ubifs: Convert ubifs_writepage to use a folio")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=218934
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
make C=1 reports the following kernel-doc warnings:
fs/ubifs/compress.c:103: warning: Function parameter or struct member 'c' not described in 'ubifs_compress'
fs/ubifs/compress.c:155: warning: Function parameter or struct member 'c' not described in 'ubifs_decompress'
fs/ubifs/find.c:353: warning: Excess function parameter 'data' description in 'scan_for_free_cb'
fs/ubifs/find.c:353: warning: Function parameter or struct member 'arg' not described in 'scan_for_free_cb'
fs/ubifs/find.c:594: warning: Excess function parameter 'data' description in 'scan_for_idx_cb'
fs/ubifs/find.c:594: warning: Function parameter or struct member 'arg' not described in 'scan_for_idx_cb'
fs/ubifs/find.c:786: warning: Excess function parameter 'data' description in 'scan_dirty_idx_cb'
fs/ubifs/find.c:786: warning: Function parameter or struct member 'arg' not described in 'scan_dirty_idx_cb'
fs/ubifs/find.c:86: warning: Excess function parameter 'data' description in 'scan_for_dirty_cb'
fs/ubifs/find.c:86: warning: Function parameter or struct member 'arg' not described in 'scan_for_dirty_cb'
fs/ubifs/journal.c:369: warning: expecting prototype for wake_up_reservation(). Prototype was for add_or_start_queue() instead
fs/ubifs/lprops.c:1018: warning: Excess function parameter 'lst' description in 'scan_check_cb'
fs/ubifs/lprops.c:1018: warning: Function parameter or struct member 'arg' not described in 'scan_check_cb'
fs/ubifs/lpt.c:1938: warning: Function parameter or struct member 'ptr' not described in 'lpt_scan_node'
fs/ubifs/replay.c:60: warning: Function parameter or struct member 'hash' not described in 'replay_entry'
Fix them.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
The UBIFS_DFS_DIR_LEN macro, which defines the maximum length of the UBIFS
debugfs directory name, has an incorrect formula and misleading comments.
The current formula is (3 + 1 + 2*2 + 1), which assumes that both UBI device
number and volume ID are limited to 2 characters. However, UBI device number
ranges from 0 to 31 (2 characters), and volume ID ranges from 0 to 127 (up
to 3 characters).
Although the current code works due to the cancellation of mathematical
errors (9 + 1 = 10, which matches the correct UBIFS_DFS_DIR_LEN value), it
can lead to confusion and potential issues in the future.
This patch aims to improve the code clarity and maintainability by making
the following changes:
1. Corrects the UBIFS_DFS_DIR_LEN macro definition to (3 + 1 + 2 + 3 + 1),
accommodating the maximum lengths of both UBI device number and volume ID,
plus the separators and null terminator.
2. Updates the snprintf calls to use UBIFS_DFS_DIR_LEN instead of
UBIFS_DFS_DIR_LEN + 1, removing the unnecessary +1.
3. Modifies the error checks to compare against UBIFS_DFS_DIR_LEN using >=
instead of >, aligning with the corrected macro definition.
4. Removes the redundant +1 in the dfs_dir_name array definitions in ubi.h
and debug.h.
While these changes do not affect the runtime behavior, they make the code
more readable, maintainable, and less prone to future errors.
v2->v3:
- Removes the duplicated UBIFS_DFS_DIR_LEN and UBIFS_DFS_DIR_NAME macro
definitions in ubifs.h, as they are already defined in debug.h.
Signed-off-by: ZhaoLong Wang <wangzhaolong1@huawei.com>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
When selinux/encryption is enabled, xattr entry node is added into TNC
before host inode when creating new file. So it is possible to find
xattr entry without host inode from TNC. Orphan debug checking is called
by ubifs_orphan_end_commit(), at that time, the commit semaphore is
already unlock, so the new creation won't be blocked.
Fixes: d7f0b70d30 ("UBIFS: Add security.* XATTR support for the UBIFS")
Fixes: d475a50745 ("ubifs: Add skeleton for fscrypto")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
For selinux or encryption scenarios, UBIFS could become inconsistent
while creating new files in powercut case. Encryption/selinux related
xattrs will be created before creating file dentry, which makes creation
process is not atomic, details are shown as:
Encryption case:
ubifs_create
ubifs_new_inode
fscrypt_set_context
ubifs_xattr_set
create_xattr
ubifs_jnl_update // Disk: xentry xinode inode(LAST_OF_NODE_GROUP)
>> power cut <<
ubifs_jnl_update // Disk: dentry inode parent_inode(LAST_OF_NODE_GROUP)
Selinux case:
ubifs_create
ubifs_new_inode
ubifs_init_security
security_inode_init_security
ubifs_xattr_set
create_xattr
ubifs_jnl_update // Disk: xentry xinode inode(LAST_OF_NODE_GROUP)
>> power cut <<
ubifs_jnl_update // Disk: dentry inode parent_inode(LAST_OF_NODE_GROUP)
Above process will make chk_fs failed in next mounting:
UBIFS error (ubi0:0 pid 7995): dbg_check_filesystem [ubifs]: inode 66
nlink is 1, but calculated nlink is 0
Fix it by allocating orphan inode for each non-xattr file creation, then
removing orphan list in journal writing process, which ensures that both
xattr and dentry be effective in atomic when powercut happens.
Fixes: d7f0b70d30 ("UBIFS: Add security.* XATTR support for the UBIFS")
Fixes: d475a50745 ("ubifs: Add skeleton for fscrypto")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=218309
Suggested-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
There is a potential space leak problem when powercut happens in linking
tmpfile, in which case, inode node (with nlink=0) and its' data nodes can
be found from tnc (on flash), but there are no dentries related to the
inode, so the file is invisible but takes free space. Detailed process is
shown as:
ubifs_tmpfile
ubifs_jnl_update // Add bud A into log area
ubifs_add_orphan // Add inode into orphan list
P1 P2
ubifs_link
ubifs_delete_orphan // Delete inode from orphan list, then inode won't
// be written into orphan area, there is no chance
// to delete inode by replaying orphan.
commit // bud A won't be replayed in next mounting
>> powercut <<
ubifs_jnl_update // Link inode to dentry
The root cause is that orphan entry deletion and journal writing(for link)
are interrupted by commit, which makes the two operations are not atomic.
Fix it by doing ubifs_delete_orphan under the protection of c->commit_sem
within ubifs_jnl_update. This is also a preparation to support all creating
new files by orphan inode.
v1 is https://lore.kernel.org/linux-mtd/20200701093227.674945-1-chengzhihao1@huawei.com/
Fixes: 32fe905c17 ("ubifs: Fix O_TMPFILE corner case in ubifs_link()")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=208405
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Host inode and its' xattr will be written on disk after initializing
security when creating symlink or dev, then the host inode and its
dentry will be written again in ubifs_jnl_update.
There is no need to write inode data in the security initialization
pass, just move the ui->data initialization after initializing
security.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
The tmpfile could be added into orphan list twice, first time is
creation, the second time is removing after it is linked. The orphan
entry could be added twice for tmpfile if following sequence is
satisfied:
ubifs_tmpfile
ubifs_jnl_update
ubifs_add_orphan // first time to add orphan entry
P1 P2
ubifs_link do_commit
ubifs_orphan_start_commit
orphan->cmt = 1
ubifs_delete_orphan
orphan_delete
if (orph->cmt)
orph->del = 1; // orphan entry is not deleted from tree
return
ubifs_unlink
ubifs_jnl_update
ubifs_add_orphan
orphan_add // found old orphan entry, second time to add orphan entry
ubifs_err(c, "orphaned twice")
return -EINVAL // unlink failed!
ubifs_orphan_end_commit
erase_deleted // delete old orphan entry
rb_erase(&orphan->rb, &c->orph_tree)
Fix it by removing orphan entry from orphan tree in advance, rather than
remove it from orphan tree in committing process.
Fixes: 32fe905c17 ("ubifs: Fix O_TMPFILE corner case in ubifs_link()")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=218672
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
UBIFS will do commit at the end of mounting process(rw mode), dead
orphans(added by insert_dead_orphan in replaying orphan) are deleted
by ubifs_orphan_end_commit(). The only reason why dead orphans are
added into orphan list is that old orpans may be lost when powercut
happens in ubifs_orphan_end_commit():
ubifs_orphan_end_commit // TNC(updated by orphans) is not written yet
if (c->cmt_orphans != 0)
commit_orphans
consolidate // traverse orphan list
write_orph_nodes // rewrite all orphans by ubifs_leb_change
// If dead orphans are not in list, they will be lost when powercut
// happens, then TNC won't be updated by old orphans in next mounting.
Luckily, the condition 'c->cmt_orphans != 0' will never be true in
mounting process, there can't be new orphans added into orphan list
before mounting returned, but commit will be done at the end of mounting.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
This reverts commit 6379b44cdc. Commit
1e022216dc ("ubifs: ubifs_symlink: Fix memleak of inode->i_link in
error path") is applied again in commit 6379b44cdc ("ubifs:
ubifs_symlink: Fix memleak of inode->i_link in error path"), which
changed ubifs_mknod (It won't become a real problem). Just revert it.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Now, the entire inode with its' xattrs are removed while replaying
orphan nodes. There is no need to add xattr inodes into orphan area,
which is based on the fact that xattr entries won't be cleared from
disk before deleting xattr inodes, in another words, current logic
can make sure that xattr inode be deleted in any cases even UBIFS not
record xattr inode into orphan area.
Let's looking for possible paths that could clear xattr entries from
disk but leave the xattr inode on TNC:
1. unlink/tmpfile -> ubifs_jnl_update: inode(nlink=0) is written
into bud LEB and added into orphan list, then:
a. powercut: ubifs_tnc_remove_ino(xattr entry/inode can be found
from TNC and being deleted) is invoked in replaying journal.
b. commit + powercut: inode is written into orphan area, and
ubifs_tnc_remove_ino is invoked in replaying orphan nodes.
c. evicting + powercut: xattr inode(nlink=0) is written on disk,
xattr is removed from TNC, gc could clear xattr entries from
disk. ubifs_tnc_remove_ino will apply on inode and xattr inode
in replaying journal, so lost xattr entries will make no
influence.
d. evicting + commit + powercut: xattr inode/entry are removed from
index tree(on disk) by ubifs_jnl_write_inode, xattr inode is
cleared from orphan area by ubifs_jnl_write_inode + commit.
e. commit + evicting + powercut: inode is written into orphan area,
then equivalent to c.
2. remove xattr -> ubifs_jnl_delete_xattr: xattr entry(inum=0) and
xattr inode(nlink=0) is written into bud LEB, xattr entry/inode are
removed from TNC, then:
a. powercut: gc could clear xattr entries from disk, which won't
affect deleting xattr entry from TNC. ubifs_tnc_remove_ino will
apply on xattr inode in replaying journal, ubifs_tnc_remove_nm
will apply on xattr entry in replaying journal.
b. commit + powercut: xattr entry/inode are removed from index tree
(on disk).
Tracking xattr inode in orphan list is imported by commit 988bec4131
("ubifs: orphan: Handle xattrs like files"), it aims to fix the similar
problem described in commit 7959cf3a75 ("ubifs: journal: Handle
xattrs like files"). Actually, the problem only exist in journal case
but not the orphan case. So, we can remove the orphan tracking for xattr
inodes.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
When powercut happens after deleting file, the xattr inode could be
alone existing in TNC but its' xattr entry cannot be found in TNC.
File inode and xattr inode are added into orphan list after deleting
file, file inode's nlink is 0 but xattr inode's nlink is not 0 (PS:
zero nlink xattr inode is written on disk in evicting process by
ubifs_jnl_write_inode). So, following process could happen:
1. touch file
2. setxattr(file)
3. unlink file
// inode(nlink=0), xattr inode(nlink=1) are added into orphan list
4. commit
// write inode inum and xattr inum into orphan area
5. powercut
6. mount
do_kill_orphans
// inode(nlink=0) is deleted from TNC by ubifs_tnc_remove_range,
// xattr entry is deleted too.
// xattr inode(nlink=1) is not deleted from TNC
Finally we could see following error while debugging UBIFS:
UBIFS error (ubi0:0 pid 1093): dbg_check_filesystem [ubifs]: inode 66
nlink is 1, but calculated nlink is 0
UBIFS (ubi0:0): dump of the inode 66 sitting in LEB 12:2128
node_type 0 (inode node)
group_type 1 (in node group)
len 197
key (66, inode)
size 37
nlink 1
flags 0x20
xattr_cnt 0
xattr_size 0
xattr_names 0
data len 37
Fix it by removing entire inode with it's xattrs while replaying orphan,
just replace function ubifs_tnc_remove_range by ubifs_tnc_remove_ino.
Fixes: ee1438ce5d ("ubifs: Check link count of inodes when killing orphans.")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=218661
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmaRcQgACgkQxWXV+ddt
WDvAGxAAknJAiREp/AmzhSwkhr+nSnqex0t+VVgsOaMTu0BEHO0xhoXc3l0QuSwS
u2AIqmOYyzr/UQVXCuatBqAE+5T4njtYAYIWwE825yquAtHNyuok9+Sjhfvxrwgs
HmNAN4Vvl2Fwds7xbWE8ug18QlssuRTIX8hk7ZtS6xo49g0tsbRX9KlzIPpsULD3
BOZa+2NJwC1PGVeNPf3p06rfiUkKfmFYgdDybe2zJ17uwsRz1CFSsaEEB35ys1f0
xYOS4epfcie03EGyZmYctuNxatUkk/J/1lTH4Z9JHwvPBvLK1U97SyJ11Wz2VQC/
8ar8gUDRYtjWdf6vn6AWBM4MseaYm9LDMlPhbSfvpDcWiclGTE64IOP4gKKr3mCh
WzlNSIR9I+tYgrhvcsCEzd7lvrSVHa7clwfooYgkEx0wl5lgbN0llAdtJWG3eeLn
3stxje2FqqXsFNj5N9SrPy7f7t6xF2i8vwk4qh6EpRuT4yuatb+nWzDm9EuTT/Bc
P+zM1KFp7Blk7Zw/Tpw0O9qjt1whStY2xrqcMzg539WVo45MmuFEFzmGBRwZsH55
QPGLIjXPpt728AgMdhBFEG0DtWaiA3AOI/C5nYOtLu92aZVBmbaX7/d/GpJv3Vvd
Ihvr9s1c49YvTZsIS0T0tkq/7LXZi/SToRJDjhP5HCrRGf7A30Y=
=gtsF
-----END PGP SIGNATURE-----
Merge tag 'for-6.10-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Fix a regression in extent map shrinker behaviour.
In the past weeks we got reports from users that there are huge
latency spikes or freezes. This was bisected to newly added shrinker
of extent maps (it was added to fix a build up of the structures in
memory).
I'm assuming that the freezes would happen to many users after release
so I'd like to get it merged now so it's in 6.10. Although the diff
size is not small the changes are relatively straightforward, the
reporters verified the fixes and we did testing on our side.
The fixes:
- adjust behaviour under memory pressure and check lock or scheduling
conditions, bail out if needed
- synchronize tracking of the scanning progress so inode ranges are
not skipped or work duplicated
- do a delayed iput when scanning a root so evicting an inode does
not slow things down in case of lots of dirty data, also fix
lockdep warning, a deadlock could happen when writing the dirty
data would need to start a transaction"
* tag 'for-6.10-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: avoid races when tracking progress for extent map shrinking
btrfs: stop extent map shrinker if reschedule is needed
btrfs: use delayed iput during extent map shrinking
"data" actually refers to a file_lease and not a file_lock. Both structs
have their file_lock_core as the first field though, so this bug should
be harmless without struct randomization in play.
Reported-by: Florian Evers <florian-evers@gmx.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219008
Fixes: 05580bbfc6 ("nfsd: adapt to breakup of struct file_lock")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Florian Evers <florian-evers@gmx.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
nfs_update_folio has code to extend a write to the entire page under
certain conditions. With the support for large folios this now
suddenly extents to the variable sized and potentially much larger folio.
Add code to limit the extension to the page boundaries of the start and
end of the write, which matches the historic expecation and the code
comments.
Fixes: b73fe2dd6cd5 ("nfs: add support for large folios")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Look for the udev generated persistent device name for NVMe devices
in addition to the SCSI ones and the Redhat-specific device mapper
name.
This is the client side implementation of RFC 9561 "Using the Parallel
NFS (pNFS) SCSI Layout to Access Non-Volatile Memory Express (NVMe)
Storage Devices".
Note that the udev rules for nvme are a bit of a mess and udev will only
create a link for the uuid if the NVMe namespace has one, and not the
NGUID. As the current RFCs don't support UUID based identifications this
means the layout can't be used on such namespaces out of the box. A
small tweak to the udev rules can work around it, and as the real fix I
will submit a draft to the IETF NFSv4 working group to support UUID-based
identifiers for SCSI and NVMe.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
- revert the SLAB_ACCOUNT patch, something crazy is going on in memcg
and someone forgot to test
- minor fixes: missing rcu_read_lock(), scheduling while atomic (in an
emergency shutdown path)
- two lockdep fixes; these could have gone earlier, but were left to
bake awhile
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmaRRssACgkQE6szbY3K
bnZpVw/+KBili018zGFIZjaYg89tNjyPhfdjtTSnTBm+989Naw4GjdF8nR81sVfC
LPW3hbUE/uZ9d9++kmwWXe+c6QrCk8skbc9z/6Bs7LAeidranZWmLNg157TfqOkF
utWhXb8oo4X0wivb6V+slnk7G4fU+p+uYfd68BFMGhNMX62ldjkVqKwpaT2bsQ03
5igS8b2ss90Q+XLkpzHNDYl/yq8kFxUbaeDc4IKFV0aN5pq93pBOGDtZCi1dHeN+
O9rHYyXoSd26+C/gjCscD/ZIfmEq7IgWRZP/2157fbVzSx9VzWZtb2FIuVf0cx/F
z01+M8fZctJVAIFijvG33P5KwAiwyDzGZMN5S/U4CICvMXu/iwJrAPUOjJoII8Dl
09ex4X0cZZjvFsA+dDMfd2Vc8U6dgpo4+7j3/rZkH/9REJypnhv/o+89Dl+Lx9P6
UdQsqphQjz0ud4Qd5TOT+x/7n8JFlRLnzeIXf8U1qKKlkKCuhxBnDEtVUc/s7gBc
8ekLQSHgpy14fCpq0wgOYx4OFQrY4KQ+Ocpt3i9RdzX1+Nti4v1REgVvBmi4UHra
5itcQFGMEq0xQ0H9eZcwqqoUHuQuzCtaP12CbECHDjWfZDWKDaYHkG2jo0SK6P9e
8ISZakmdWsu0sqP4nuexKvcb4K1ov0JtidMlGOeB3KYrQ7LHQjA=
=vpux
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-07-12' of https://evilpiepirate.org/git/bcachefs
Pull more bcachefs fixes from Kent Overstreet:
- revert the SLAB_ACCOUNT patch, something crazy is going on in memcg
and someone forgot to test
- minor fixes: missing rcu_read_lock(), scheduling while atomic (in an
emergency shutdown path)
- two lockdep fixes; these could have gone earlier, but were left to
bake awhile
* tag 'bcachefs-2024-07-12' of https://evilpiepirate.org/git/bcachefs:
bcachefs: bch2_gc_btree() should not use btree_root_lock
bcachefs: Set PF_MEMALLOC_NOFS when trans->locked
bcachefs; Use trans_unlock_long() when waiting on allocator
Revert "bcachefs: Mark bch_inode_info as SLAB_ACCOUNT"
bcachefs: fix scheduling while atomic in break_cycle()
bcachefs: Fix RCU splat
btree_root_lock is for the root keys in btree_root, not the pointers to
the nodes themselves; this fixes a lock ordering issue between
btree_root_lock and btree node locks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This reverts commit 86d81ec5f5.
This wasn't tested with memcg enabled, it immediately hits a null ptr
deref in list_lru_add().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cross-merge networking fixes after downstream PR.
Conflicts:
net/sched/act_ct.c
26488172b0 ("net/sched: Fix UAF when resolving a clash")
3abbd7ed8b ("act_ct: prepare for stolen verdict coming from conntrack and nat engine")
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZo9dYAAKCRCRxhvAZXjc
omYQAP4wELNW5StzljRReC6s/Kzu6IANJQlfFpuGnPIl23iRmwD+Pq433xQqSy5f
uonMBEdxqbOrJM7A6KeHKCyuAKYpNg0=
=zg3n
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.10-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"cachefiles:
- Export an existing and add a new cachefile helper to be used in
filesystems to fix reference count bugs
- Use the newly added fscache_ty_get_volume() helper to get a
reference count on an fscache_volume to handle volumes that are
about to be removed cleanly
- After withdrawing a fscache_cache via FSCACHE_CACHE_IS_WITHDRAWN
wait for all ongoing cookie lookups to complete and for the object
count to reach zero
- Propagate errors from vfs_getxattr() to avoid an infinite loop in
cachefiles_check_volume_xattr() because it keeps seeing ESTALE
- Don't send new requests when an object is dropped by raising
CACHEFILES_ONDEMAND_OJBSTATE_DROPPING
- Cancel all requests for an object that is about to be dropped
- Wait for the ondemand_boject_worker to finish before dropping a
cachefiles object to prevent use-after-free
- Use cyclic allocation for message ids to better handle id recycling
- Add missing lock protection when iterating through the xarray when
polling
netfs:
- Use standard logging helpers for debug logging
VFS:
- Fix potential use-after-free in file locks during
trace_posix_lock_inode(). The tracepoint could fire while another
task raced it and freed the lock that was requested to be traced
- Only increment the nr_dentry_negative counter for dentries that are
present on the superblock LRU. Currently, DCACHE_LRU_LIST list is
used to detect this case. However, the flag is also raised in
combination with DCACHE_SHRINK_LIST to indicate that dentry->d_lru
is used. So checking only DCACHE_LRU_LIST will lead to wrong
nr_dentry_negative count. Fix the check to not count dentries that
are on a shrink related list
Misc:
- hfsplus: fix an uninitialized value issue in copy_name
- minix: fix minixfs_rename with HIGHMEM. It still uses kunmap() even
though we switched it to kmap_local_page() a while ago"
* tag 'vfs-6.10-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
minixfs: Fix minixfs_rename with HIGHMEM
hfsplus: fix uninit-value in copy_name
vfs: don't mod negative dentry count when on shrinker list
filelock: fix potential use-after-free in posix_lock_inode
cachefiles: add missing lock protection when polling
cachefiles: cyclic allocation of msg_id to avoid reuse
cachefiles: wait for ondemand_object_worker to finish when dropping object
cachefiles: cancel all requests for the object that is being dropped
cachefiles: stop sending new request when dropping object
cachefiles: propagate errors from vfs_getxattr() to avoid infinite loop
cachefiles: fix slab-use-after-free in cachefiles_withdraw_cookie()
cachefiles: fix slab-use-after-free in fscache_withdraw_volume()
netfs, fscache: export fscache_put_volume() and add fscache_try_get_volume()
netfs: Switch debug logging to pr_debug()
We store the progress (root and inode numbers) of the extent map shrinker
in fs_info without any synchronization but we can have multiple tasks
calling into the shrinker during memory allocations when there's enough
memory pressure for example.
This can result in a task A reading fs_info->extent_map_shrinker_last_ino
after another task B updates it, and task A reading
fs_info->extent_map_shrinker_last_root before task B updates it, making
task A see an odd state that isn't necessarily harmful but may make it
skip certain inode ranges or do more work than necessary by going over
the same inodes again. These unprotected accesses would also trigger
warnings from tools like KCSAN.
So add a lock to protect access to these progress fields.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent map shrinker can be called in a variety of contexts where we
are under memory pressure, and of them is when a task is trying to
allocate memory. For this reason the shrinker is typically called with a
value of struct shrink_control::nr_to_scan that is much smaller than what
we return in the nr_cached_objects callback of struct super_operations
(fs/btrfs/super.c:btrfs_nr_cached_objects()), so that the shrinker does
not take a long time and cause high latencies. However we can still take
a lot of time in the shrinker even for a limited amount of nr_to_scan:
1) When traversing the red black tree that tracks open inodes in a root,
as for example with millions of open inodes we get a deep tree which
takes time searching for an inode;
2) Iterating over the extent map tree, which is a red black tree, of an
inode when doing the rb_next() calls and when removing an extent map
from the tree, since often that requires rebalancing the red black
tree;
3) When trying to write lock an inode's extent map tree we may wait for a
significant amount of time, because there's either another task about
to do IO and searching for an extent map in the tree or inserting an
extent map in the tree, and we can have thousands or even millions of
extent maps for an inode. Furthermore, there can be concurrent calls
to the shrinker so the lock might be busy simply because there is
already another task shrinking extent maps for the same inode;
4) We often reschedule if we need to, which further increases latency.
So improve on this by stopping the extent map shrinking code whenever we
need to reschedule and make it skip an inode if we can't immediately lock
its extent map tree.
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Reported-by: Andrea Gelmini <andrea.gelmini@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CABXGCsMmmb36ym8hVNGTiU8yfUS_cGvoUmGCcBrGWq9OxTrs+A@mail.gmail.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace the deprecated[1] use of a 1-element array in
struct slink_front with a modern flexible array.
No binary differences are present after this conversion.
Link: https://github.com/KSPP/linux/issues/79 [1]
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace the deprecated[1] use of a 1-element array in
struct affs_data_head with a modern flexible array.
No binary differences are present after this conversion.
Link: https://github.com/KSPP/linux/issues/79 [1]
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
AFFS uses struct affs_head's "table" array as a flexible array. Switch
this to a proper flexible array[1]. There are no sizeof() uses; struct
affs_head is only ever uses via direct casts. No binary output
differences were found after this change.
Link: https://github.com/KSPP/linux/issues/79 [1]
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we failed to link a free space entry because there's already a
conflicting entry for the same offset, we free the free space entry but
we don't free the associated bitmap that we had just allocated before.
Fix that by freeing the bitmap before freeing the entry.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously we had a BUG_ON() inside extent_range_clear_dirty_for_io(), as
we expected all involved folios to be still locked, thus no folio should be
missing.
However for extent_range_clear_dirty_for_io() itself, we can skip the
missing folio and handle the remaining ones, and return an error if
there is anything wrong.
Remove the BUG_ON() and let the caller to handle the error.
In the caller we do not have a quick way to cleanup the error, but all
the compression routines would handle the missing folio as an error and
properly error out, so we only need to do an ASSERT() for developers,
while for non-debug build the compression routine would handle the
error correctly.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is only used inside inode.c by compress_file_range(),
so move it to inode.c and unexport it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add more verbose and specific messages to all main error points in
compression code for all algorithms. Currently there's no way to know
which inode is affected or where in the data errors happened.
The messages follow a common format:
- what happened
- error code if relevant
- root and inode
- additional data like offsets or lengths
There's no helper for the messages as they differ in some details and
that would be cumbersome to generalize to a single function. As all the
errors are "almost never happens" there are the unlikely annotations
done as compression is hot path.
Signed-off-by: David Sterba <dsterba@suse.com>
KCSAN complains about a data race when accessing the last_trans field of a
root:
[ 199.553628] BUG: KCSAN: data-race in btrfs_record_root_in_trans [btrfs] / record_root_in_trans [btrfs]
[ 199.555186] read to 0x000000008801e308 of 8 bytes by task 2812 on cpu 1:
[ 199.555210] btrfs_record_root_in_trans+0x9a/0x128 [btrfs]
[ 199.555999] start_transaction+0x154/0xcd8 [btrfs]
[ 199.556780] btrfs_join_transaction+0x44/0x60 [btrfs]
[ 199.557559] btrfs_dirty_inode+0x9c/0x140 [btrfs]
[ 199.558339] btrfs_update_time+0x8c/0xb0 [btrfs]
[ 199.559123] touch_atime+0x16c/0x1e0
[ 199.559151] pipe_read+0x6a8/0x7d0
[ 199.559179] vfs_read+0x466/0x498
[ 199.559204] ksys_read+0x108/0x150
[ 199.559230] __s390x_sys_read+0x68/0x88
[ 199.559257] do_syscall+0x1c6/0x210
[ 199.559286] __do_syscall+0xc8/0xf0
[ 199.559318] system_call+0x70/0x98
[ 199.559431] write to 0x000000008801e308 of 8 bytes by task 2808 on cpu 0:
[ 199.559464] record_root_in_trans+0x196/0x228 [btrfs]
[ 199.560236] btrfs_record_root_in_trans+0xfe/0x128 [btrfs]
[ 199.561097] start_transaction+0x154/0xcd8 [btrfs]
[ 199.561927] btrfs_join_transaction+0x44/0x60 [btrfs]
[ 199.562700] btrfs_dirty_inode+0x9c/0x140 [btrfs]
[ 199.563493] btrfs_update_time+0x8c/0xb0 [btrfs]
[ 199.564277] file_update_time+0xb8/0xf0
[ 199.564301] pipe_write+0x8ac/0xab8
[ 199.564326] vfs_write+0x33c/0x588
[ 199.564349] ksys_write+0x108/0x150
[ 199.564372] __s390x_sys_write+0x68/0x88
[ 199.564397] do_syscall+0x1c6/0x210
[ 199.564424] __do_syscall+0xc8/0xf0
[ 199.564452] system_call+0x70/0x98
This is because we update and read last_trans concurrently without any
type of synchronization. This should be generally harmless and in the
worst case it can make us do extra locking (btrfs_record_root_in_trans())
trigger some warnings at ctree.c or do extra work during relocation - this
would probably only happen in case of load or store tearing.
So fix this by always reading and updating the field using READ_ONCE()
and WRITE_ONCE(), this silences KCSAN and prevents load and store tearing.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is only one caller utilizing the @extra_gfp parameter,
alloc_eb_folio_array(). And in that case the extra_gfp is only assigned
to __GFP_NOFAIL.
Rename the @extra_gfp parameter to @nofail to indicate that.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>