the UTF-8 output code was written assuming an invariant that iconv's
decoders only emit valid Unicode Scalar Values which wctomb can encode
successfully, thereby always returning a value between 1 and 4.
if this invariant is not satisfied, wctomb returns (size_t)-1, and the
subsequent adjustments to the output buffer pointer and remaining
output byte count overflow, moving the output position backwards,
potentially past the beginning of the buffer, without storing any
bytes.
the man page for this nonstandardized function has historically
documented it as scanning for a substring; however, this is
functionally incorrect (matches the substring "atime" in the "noatime"
option, for example) and differs from other existing implementations.
with the change made here, it should match glibc and other
implementations, only matching whole options delimited by commas or
separated from a value by an equals sign.
as a result of incorrect bounds checking on the lead byte being
decoded, certain invalid inputs which should produce an encoding
error, such as "\xc8\x41", instead produced out-of-bounds loads from
the ksc table.
in a worst case, the loaded value may not be a valid unicode scalar
value, in which case, if the output encoding was UTF-8, wctomb would
return (size_t)-1, causing an overflow in the output pointer and
remaining buffer size which could clobber memory outside of the output
buffer.
bug report was submitted in private by Nick Wellnhofer on account of
potential security implications.
out-of-range second bytes were not handled, leading to wrong character
output rather than a reported encoding error.
fix based on bug report by Nick Wellnhofer, submitted in private in
case the issue turned out to have security implications.
Calling __tls_get_addr with brasl is not valid since it's a global symbol; doing
so results in an R_390_PC32DBL relocation error from lld. We could fix this by
marking __tls_get_addr hidden since it is not part of the s390x ABI, or by using
a different instruction. However, given its simplicity, it makes more sense to
just manually inline it into __tls_get_offset for performance.
The patch has been tested by applying to Zig's bundled musl copy and running the
full Zig test suite under qemu-s390x.
Some weird linkers may emit PT_LOAD segments with memsz = 0. ELF
specification does not forbid this, but such a segment with non-zero
p_vaddr will result in reclaiming of invalid memory address.
This patch skips such segments during reclaiming for better
compatibility.
we have the cpuset macros call calloc/free/memset/memcmp directly so
that they don't depend on any further ABI surface. this is not
namespace-clean, but only affects the _GNU_SOURCE feature profile,
which is not intended to be namespace-clean. nonetheless, reports come
up now and then of things which are gratuitously broken, usually when
an application has wrapped malloc with macros.
this patch parenthesizes the function names so that function-like
macros will not be expanded, and removes the unused declaration of
memcpy. this is not a complete solution, but it should improve things
for affected applications, particularly ones which are not even trying
to use the cpuset interfaces which got them just because g++ always
defines _GNU_SOURCE.
the kernel mq_attr structure has 8 64-bit longs instead of 8 32-bit
longs.
it's not clear that this is the nicest way to implement the fix, but
the concept (translation) is right, and the details can be changed
later if desired.
previously, we left any changes made by the application to the timer
thread's signal mask active when resetting the thread state for reuse.
not only did this violate the intended invariant that timer threads
start with all signals blocked; it also allowed application code to
execute in a thread that, formally, did not exist. and further, if the
internal SIGTIMER signal became unblocked, it could also lead to
missed timer expiration events.
commit 6ae2568bc2 introduced a fatal
signal condition if the internal timer signal used for SIGEV_THREAD
timers is unblocked. this can happen whenever the application alters
the signal mask with SIG_SETMASK, since sigset_t objects never include
the bits used for implementation-internal signals.
this patch effectively reverts the breakage by adding back a no-op
signal handler.
overruns will not be accounted if the timer signal becomes unblocked,
but POSIX does not specify them except for SIGEV_SIGNAL timers anyway.
The LLVM assembler reportedly assembles the form using the j mnemonic
incorrectly (see issue 107460). The jr form is canonical and avoids
this problem, so use it instead.
When the pattern was changed from matching any whitespace to just
matching spaces and tabs, a newline started being appended to the
value of the matched field, if that field was a string. For example,
in a 4-field line, the mnt_opts field would have a newline on the end.
This happened because a newline is not a space or a tab, and so was
matched as part of the value before the end of the string was reached.
\n should therefore be added as a character that terminates a value.
This shouldn't interfere with the intention of the change to space and
tab only, as it was trying to make sure that other whitespace like
carriage returns, that should have been part of parsed values, were.
Fixes: f314e133
The instruction encoding that would be "br %r0" is not actually a
branch to r0, but instead a nop/memory-barrier. gcc 14 has been found
to choose r0 for the "r"(pc) constraint, breaking CRTJMP.
This patch adjusts the inline assembly constraints and marks "pc" as
address ("a"), which disallows usage of r0.
commit 8cca79a72c added use of SYS_pause
to exit() without accounting for newer archs omitting the syscall.
use the newly-added __sys_pause abstraction instead, which uses
SYS_ppoll when SYS_pause is missing.
newer archs lack the syscall. the pause() function accounted for this
with its own #ifdef, but that didn't allow use of the syscall directly
elsewhere, so move the logic to macros in src/internal/syscall.h where
it can be shared.
commit b817541f1c introduced statx with
a fallback using fstatat, but failed to fill in stx_rdev_major/minor
and stx_attributes[_mask]. the rdev omission has been addressed
separately. rather than explicitly zeroing the attributes and their
mask, pre-fill the entire structure with zeros. this will also cover
the padding adjacent to stx_mode, in case it's ever used in the
future.
explicit zeroing of stx_btime is removed since, with this change, it
will already be pre-zeroed. as an aside, zeroing it was not strictly
necessary, since STATX_BASIC_STATS does not include STATX_BTIME and
thus does not indicate any validity for it.
The current implementation of the statx function fails to set the
values of stx->stx_rdev_major and stx->stx_rdev_minor if the statx
syscall fails with ENOSYS and thus the statx function has to fall back
on fstatat-based emulation.
the value placed in the aux vector AT_MINSIGSTKSZ by the kernel is
purely the signal frame size, and does not include any execution space
for the signal handler. this is contrary to the POSIX definition of
MINSIGSTKSZ to be a value that can actually execute at least some
minimal signal handler, and contrary to the historical definitions of
MINSIGSTKSZ which had at least 1k of headroom.
commit 996b6154b2 added support for
querying the dynamic limit but did not enforce it in sigaltstack. the
kernel also does not seem to reliably enforce it, or at least does not
necessarily enforce the same limit exposed to userspace, so it needs
to be enforced here.
internally, printf always works with the maximal-size supported
integer and floating point formats. however, the space needed to
format a floating point number is proportional to the mantissa and
exponent ranges. on archs where long double is larger than double,
knowing that the actual value fit in double allows us to use a much
smaller buffer, roughly 1/16 the size.
as a bonus, making the working buffer a VLA whose dimension depends on
the format specifier prevents the compiler from lifting the stack
adjustment to the top of printf_core. this makes it so printf calls
without floating point arguments do not waste even the smaller amount
of stack space needed for double, making it much more practical to use
printf in tightly stack-constrained environments.
linux puts hung-up ttys in a state where ioctls produce EIO, and may
do the same for other types of devices in error or shutdown states.
such an error clearly does not mean the device is not a tty, but it
also can't reliably establish that the device is a tty, so the only
safe thing to do seems to be reporting the error. programs that don't
check errno will conclude that the device is not a tty, which is no
different from what happens now, but at least they gain the option to
differentiate between the cases.
commit c84971995b introduced the errno
collapsing behavior, but prior to that, errno was not set at all by
isatty.
this is purely a readability change, not a functional one. all of the
integer format cases use a common tail for handling precision logic
after the string representation of the number has been generated. the
code as I originally wrote it was overly clever in the aim of making a
point that the flow could be done without goto, and jumped over
intervening cases by wrapping them in if (0) { }, with the case labels
for each inside the conditional block scope.
this has been a perpetual source of complaints about the readability
and comprehensibility of the file, so I am now changing it to
explicitly jump to the tail logic with goto statements.
this is the same as cp850, but with the euro symbol replacing the
lowercase dotless i at 0xd5. it is significant because it's used by
thermal receipt printers.
the comment does not match the required or actual behavior when x<0
and y is not an integer. while it could be corrected, the role of
comments here is to tell about characteristics unique to the
implementation, not to restate the requirements of the standard, so
just removing it seems best.
while not the only error codes presently omitted, these two are
particularly likely to be encountered in the wild.
EUCLEAN is used by linux filesystem and device drivers to report
filesystem structure corruption or data corruption.
ENAVAIL is used by some linux drivers to indicate non-availability of
a resource.
both names are new inventions to correspond to how they are actually
used, as the original kernel strings ("Structure needs cleaning" and
"No XENIX semaphores available") are not remotely meaningful or
reasonable.
the file-level crt_arch.h asm fragments generally make direct
(non-PLT) calls from _start to _start_c, which is only valid when
there is a local, non-interposable definition for _start_c. generally,
the linker is expected to know that local definitions in a main
executable (as opposed to shared library) output are non-interposable,
making this work, but historically there have been linker bugs in this
area, and microblaze is reportedly still broken, flagging the
relocation for the call as a textrel.
the equivalent _dlstart_c, called from the same crt_arch.h asm
fragments, has always used hidden visibility without problem, and
semantically it should be hidden, so make it hidden. this ensures the
direct call is always valid regardless of whether the linker properly
special-cases main executable output.
if sem_post is interrupted between clearing the waiters bit from the
semaphore value and performing the futex wait operation, subsequent
calls to sem_post will not perform a wake operation unless a new
waiter has arrived.
usually, this is at most a minor nuisance, since the original wake
operation will eventually happen. however, it's possible that the wake
is delayed indefinitely if interrupted by a signal handler, or that
the address the wake needs to be performed on is no longer mapped if
the semaphore was a process-shared one that has since been unmapped
but has a waiter on a different mapping of the same semaphore. this
can happen when another thread using the same mapping "steals the
post" atomically before actually becoming a second waiter, deduces
from success that it was the last user of the semaphore mapping, then
re-posts and unmaps the semaphore mapping. this scenario was described
in a report by Markus Wichmann.
instead of checking only the waiters bit, also check the waiter count
that was sampled before the atomic post operation, and perform the
wake if it's nonzero. this will not produce any additional wakes under
non-race conditions, since the waiters bit only becomes zero when
targeting a single waiter for wake. checking both was already the
behavior prior to commit 159d1f6c02.
our pthread barrier implementation reportedly has bugs that are could
lead to malfunction or crash in timer_create. while this has not been
reviewed to confirm, there have been past reports of pthread barrier
bugs, and it seems likely that something is actually wrong.
pthread barriers are an obscure primitive, and timer_create is the
only place we are using them internally at present. even if they were
working correctly, this means we are imposing linking of otherwise
likely-dead code whenever timer_create is used.
a pair of semaphores functions identically to a 2-waiter barrier
except for destruction order properties. since the parent is
responsible for the argument structure (including semaphores)
lifetimes, the last operation on them in the timer thread must be
posting to the parent.
previously, global dtors, which are executed after all atexit handlers
have been called rather than being implemented as an atexit handler
themselves, would deadlock if they called atexit.
it was intentional to disallow adding more atexit handlers past the
last point where they would be executed, since a successful return
from atexit imposes a contract that the handler will be executed, but
this was only considered in the context of calls to atexit from other
threads, not calls from the dtors.
to fix this, release the lock after the exit handlers loop completes,
but but set a flag first so that we can make all future calls to
atexit return a failure code.
per the C and POSIX standards, calling exit "more than once",
including via return from main, produces undefined behavior. this
language predates threads, and at the time it was written, could only
have applied to recursive calls to exit via atexit handlers. C++
likewise makes calls to exit from global dtors undefined. nonetheless,
by the present specification as written, concurrent calls to exit by
multiple threads also have undefined behavior.
originally, our implementation of exit did have locking to handle
concurrent calls safely, but that was changed in commit
2e55da9118 based on it being undefined.
from a standpoint of both hardening and quality of implementation,
that change seems to have been a mistake.
this change adds back locking, but with awareness of the lock owner so
that recursive calls to exit can be trapped rather than deadlocking.
this also opens up the possibility of allowing recursive calls to
succeed, if future consensus ends up being in favor of that.
prior to this change, exit already behaved partly as if protected by a
lock as long as atexit was linked, but multiple threads calling exit
could concurrently "pop off" atexit handlers and execute them in
parallel with one another rather than serialized in the reverse order
of registration. this was a likely unnoticed but potentially very
dangerous manifestation of the undefined behavior. if on the other
hand atexit was not linked, multiple threads calling exit concurrently
could each run their own instance of global dtors, if any, likely
producing double-free situations.
now, if multiple threads call exit concurrently, all but the first
will permanently block (in SYS_pause) until the process terminates,
and all atexit handlers, global dtors, and stdio flushing/position
consistency will be handled in the thread that arrived first. this is
really the only reasonable way to define concurrent calls to exit. it
is not recommended usage, but may become so in the future if there is
consensus/standardization, as there is a push from the rust language
community (and potentially other languages interoperating with the C
runtime) to make concurrent calls to the language's exit interfaces
safe even when multiple languages are involved in a program, and this
is only possible by having the locking in the underlying C exit.
commit 895736d49b made these changes
along with fixing a real bug in LOG_MAKEPRI. based on further
information, they do not seem to be well-motivated or in line with
policy.
the result of LOG_FAC is not a meaningful facility value if we shift
it down like before, but apparently the way it is used by applications
is as an index into an array of facility names. moreover, all
historical systems which define it do so with the shift. as it is a
nonstandard interface, there is no justification for providing a macro
by the same name that is incompatible with historical practice.
the value of LOG_FACMASK likewise is 0x3f8 on all historical systems
checked. while only 5 bits are used for existing facility codes, the
convention seems to be that all 7 bits belong to the facility field
and theoretically could be used to expand to having more facilities.
that seems unlikely to happen, but there is no reason to make a
gratuitously incompatible change here.
Per RFC 5952, ties for longest sequence of zero fields must be broken
by choosing the earliest, but the implementation put the leading
sequence of zeros at a disadvantage. That's because for example when
compressing "0:0:0:10:0:0:0:10" the strspn(buf+i, ":0") call returns 6
for the first sequence and 7 for the second one – the second sequence
has the benefit of a leading colon.
Changing the condition to require beating the leading sequence by not
one but two characters resolves the issue.
while commit 53ac44ff4c fixed the temp
buffer being undersized, the use of a temp buffer to begin with was a
mistake. instead, compare the requested symbol name in-place and use
the already-null-terminated copy of the name without "64" present in
lfs64_list[] to look up the real symbol.
add two ioctls to get and set struct epoll_params to allow users to
control epoll based busy polling of network sockets.
added to uapi in commit 18e2bf0edf4dd88d9656ec92395aa47392e85b61 (Linux
kernel 6.9 and newer).
this interface does not have a lot of historical consensus on how it
handles the contents of the /etc/shells file in regard to whitespace
and comments, but the commonality between all checked is that they
ignore lines that are blank or that begin with '#', so that is the
behavior we adopt.
these are nonstandard and unnecessary for using the associated
functionality, but resulted in applications that used them
malfunctioning.
patch based on proposed fix by erny hombre.
This syscall is available since Linux 3.15 and also implemented in
glibc from version 2.28. It is commonly used in filesystem or security
contexts.
Constants RENAME_NOREPLACE, RENAME_EXCHANGE, RENAME_WHITEOUT are
guarded by _GNU_SOURCE as with glibc.
commit 1b0d48517f wrongly copied the
getdents return type of int rather than matching the ssize_t used by
posix_getdents. this was overlooked in testing on 32-bit archs but
obviously broke 64-bit archs.
without explicit alignment directives, whether they end up at the
necessary alignment depends on linker/linking conditions. initially
reported as mold issue 1255.
this interface was added as the outcome of Austin Group tracker issue
697. no error is specified for unsupported flags, which is probably an
oversight. for now, EOPNOTSUPP is used so as not to overload EINVAL.
the bits file is retained, but as a single generic version, to allow
for the unlikely future possibility of letting a new arch define
something differently.
previously, only a few archs defined it here. this change makes the
presence consistent across all archs, and reduces the amount of header
duplication (and potential for future inconsistency) between archs.
this change is purely to document that they are the same in
preparation to remove the arch-specific headers for these archs and
replace them with a generic version that matches riscv32 and can be
shared by these and all future archs.
commit f47a8cdd25 introduced an
alternate mechanism for access to runtime page size for compatibility
with early stages of dynamic linking, but because pthread_impl.h
indirectly includes libc.h, the condition #ifndef PAGE_SIZE was never
satisfied.
rather than depend on order of inclusion, use the (baseline POSIX)
macro PAGESIZE, not the (XSI) macro PAGE_SIZE, to determine whether
page size is dynamic. our internal libc.h only provides a dynamic
definition for PAGE_SIZE, not for PAGESIZE.
the %s conversion is added as the outcome of Austin Group tracker
issue 169 and its unspecified behavior is clarified as the outcome of
issue 1727.
the %F, %g, %G, %u, %V, %z, and %Z conversions are added as the
outcome of Austin Group tracker issue 879 for alignment with strftime
and the behaviors of %u, %z, and %Z are defined as the outcome of
issue 1727.
at this time, the conversions with unspecified effects on struct tm
are all left as parse-only no-ops. this may be changed at a later
time, particularly for %s, if there is reasonable cross-implementation
consensus outside the standards process on what the behavior should
be.
once the remaining value is less than 10, the modulo operation to
produce the final digit and division to prepare for next loop
iteration can be dropped. this may be a meaningful performance
distinction when formatting low-magnitude numbers in bulk, and should
never hurt.
based on patch by Viktor Reznov.
historically linux limited the number of supplementary groups a
process could be in to 32, but this limit was raised to 65536 in linux
2.6.4. proposals to support the new limit, change NGROUPS_MAX, or make
it dynamic have been stalled due to the impact it would have on
initgroups where the groups array exists in automatic storage.
the changes here decouple initgroups from the value of NGROUPS_MAX and
allow it to fall back to allocating a buffer in the case where
getgrouplist indicates the user has more supplementary groups than
could be reported in the buffer. getgrouplist already involves
allocation, so this does not pull in any new link dependency.
likewise, getgrouplist is already using the public malloc (vs internal
libc one), so initgroups does the same. if this turns out not to be
the best choice, both can be changed together later.
the initial buffer size is left at 32, but now as the literal value,
so that any potential future change to NGROUPS_MAX will not affect
initgroups.
commit cfa0a54c08 attempted to fix
rounding on archs where long double is not 80-bit (where LDBL_MANT_DIG
is not zero mod four), but failed to address the edge case where
rounding was skipped because LDBL_MANT_DIG/4 rounded down in the
comparison against the requested precision.
the rounding logic based on hex digit count is difficult to understand
and not well-motivated, so rather than try to fix it, replace it with
an explicit calculation in terms of number of bits to be kept, without
any truncating division operations. based on patch by Peter Ammon, but
with scalbn to apply the rounding exponent since the value will not
generally fit in any integer type. scalbn is used instead of scalbnl
to avoid pulling in the latter unnecessarily, since the value is an
exact power of two whose exponent range is bounded by LDBL_MANT_DIG, a
small integer.
The principal expressions defining acosh and acos are such that
acosh(z) = ±i acos(z)
where the + is only true on the Im(z)>0 half of the complex plane
(and partly on Im(z)==0 depending on number representation).
fix the comment without expanding on the details.
POSIX requires pwrite to honor the explicit file offset where the
write should take place even if the file was opened as O_APPEND.
however, linux historically defined the pwrite syscall family as
honoring O_APPEND. this cannot be changed on the kernel side due to
stability policy, but the addition of the pwritev2 syscall with a
flags argument opened the door to fixing it, and linux commit
73fa7547c70b32cc69685f79be31135797734eb6 adds the RWF_NOAPPEND flag
that lets us request a write honoring the file offset argument.
this patch changes the pwrite function to first attempt using the
pwritev2 syscall with RWF_NOAPPEND, falling back to using the old
pwrite syscall only after checking that O_APPEND is not set for the
open file. if O_APPEND is set, the operation fails with EOPNOTSUPP,
reflecting that the kernel does not support the correct behavior. this
is an extended error case needed to avoid the wrong behavior that
happened before (writing the data at the wrong location), and is
aligned with the spirit of the POSIX requirement that "An attempt to
perform a pwrite() on a file that is incapable of seeking shall result
in an error."
since the pwritev2 syscall interprets the offset of -1 as a request to
write at the current file offset, it is mapped to a different negative
value that will produce the expected error.
pwritev, though not governed by POSIX at this time, is adjusted to
match pwrite in honoring the offset.
added in linux kernel commit 73fa7547c70b32cc69685f79be31135797734eb6.
this is added now as a prerequisite for fixing pwrite/pwritev behavior
for O_APPEND files.
the jis0208 table we use is only 84x94 in size, but the shift_jis
encoding supports a 94x94 grid. attempts to convert sequences outside
of the supported zone resulted in out-of-bounds table reads,
misinterpreting adjacent rodata as part of the character table and
thereby converting these sequences to unexpected characters.
this is not needed, but may act as a hint to the compiler, and also
serves to suppress unused function warnings if enabled (on by default
since commit 86ac0f7947).
this is how it's defined in the cp936 document referenced by the IANA
charset registry as defining GBK, and of the mappings defined there,
was the only one missing.
it is not accepted for GB18030, as GB18030 is a UTF and has its own
unique mapping for the euro symbol.
- add mount_setattr from linux v5.12
- add epoll_pwait2 from linux v5.11
- add process_madvise from linux v5.10
- add __NR_faccessat2 from linux v5.8
- add pidfd_getfd and openat2 syscall numbers from linux v5.6
- add clone3 syscall number from linux v5.3
- add process_mrelease from linux v5.15
- add futex_waitv from linux v5.16
- add set_mempolicy_home_node from linux v5.17
- add cachestat from linux v6.4
- add __NR_fchmodat2 from linux v6.6
despite riscv32 being natively time64, the IPC_TIME64 bit (0x100) is
set in IPC_STAT and derived command macros, differentiating their
values from the raw command values used to interface with the kernel.
this reflects that the kernel ipc structure types are not natively
time64, but have broken-down hi/lo fields that cannot be used in-place
and require translation, and that the userspace struct types differ
from the kernel types (relevant to things like strace).
These are mostly copied from riscv64. _Addr and _Reg had to become int
to match compiler-controlled parts of the ABI (result type of sizeof,
etc.). There is no kernel stat struct; the userspace stat matches
glibc in the sizes and offsets of all fields (including glibc's
__dev_t __pad1). The jump buffer is 12 words larger to account for 12
saved double-precision floats; additionally it should be 64-bit
aligned to save doubles.
The syscall list was significantly revised by deleting all time32 and
pre-statx syscalls, and renaming several syscalls that have different
names depending on __BITS_PER_LONG, notably mmap2 and _llseek.
futex was added as an alias to futex_time64 since it is widely used by
software which does not pass time arguments.
__res_send returns the full answer length even if it didn't fit the
buffer, but __dns_parse expects the length of the filled part of the
buffer.
This is analogous to commit 77327ed064,
which fixed the only other __dns_parse call site.
A child process created by posix_spawn reports errors to its parent via
a pipe, retrying infinitely on any write error to prevent falsely
reporting success. If the (original) parent dies before write is
attempted, there is nobody to report to, but the child will remain
stuck in the write loop forever if SIGPIPE is blocked or ignored.
Fix this by not retrying write if it fails with EPIPE.
user_regs_struct and user_fp_struct were missing from the initial
commit of the port.
the union type for elf_fpreg_t and the new value of ELF_NFPREG are
made consistent with glibc.
originally, compilers did not provide these macros and we had to
provide them ourselves. this meant we were redefining them, which was
technically invalid unless the token sequence of the original
definition matched exactly.
the original patch proposed by Jules Maselbas to fix this made the
definitions conditional on them not already being defined; however I
suggested using #undef to avoid any possibly-wrong definitions already
in place and ensure that the definitions are 1. the version adopted as
commit 8b70486807 made this change.
unfortunately, gcc is loud about not liking #undef of any __STDC_*
macro name, and while warnings are suppressed in the system include
path, there is apparently no way to suppress this warning if the
system include dir has also been provided via -I.
while normally we don't go out of our way to satisfy warnings over
style in the public headers, in this case, it seems to be a matter of
disagreement over contract of which part of "the implementation" is
entitled to define or undefine macros belonging to the implementation,
and it's quite reasonable to conclude that the compiler may reject
attempts to undefine them.
this commit reverts to the originally-submitted version of the patch
making the definitions conditional.
this code dates back to the original commit of the sh port, with no
real clue as to how the bug was introduced. it looks like it was
written to assume the return address was pushed to the stack like on
x86, rather than arriving in the pr special register.
commit 0dc4824479 worked around for lack
of flags argument in syscall for fchmodat.
linux 6.6 introduced a new syscall, SYS_fchmodat2, fixing this
deficiency. use it if any flags are passed, and fallback to the old
strategy on ENOSYS. continue using the old syscall when there are no
flags. this is the exact same strategy used when SYS_faccessat2 was used
to implement faccessat with flags.
the linux fchmodat syscall lacks a flag argument that is necessary to
implement the posix api, see
linux commit 09da082b07bbae1c11d9560c8502800039aebcea
fs: Add fchmodat2()
linux commit 78252deb023cf0879256fcfbafe37022c390762b
arch: Register fchmodat2, usually as syscall 452
see
linux commit cf264e1329fb0307e044f7675849f9f38b44c11a
cachestat: implement cachestat syscall
linux commit 946e697c69ffeeefdd84dad90eac307284df46be
cachestat: wire up cachestat for other architectures
see
linux commit c6018b4b254971863bd0ad36bb5e7d0fa0f0ddb0
mm/mempolicy: add set_mempolicy_home_node syscall
linux commit 21b084fdf2a49ca1634e8e360e9ab6f9ff0dee11
mm/mempolicy: wire up syscall set_mempolicy_home_node
see
linux commit 039c0ec9bb77446d7ada7f55f90af9299b28ca49
futex,x86: Wire up sys_futex_waitv()
linux commit ea7c45fde5aa3e761aaddb7902a31a95cb120e7b
futex,arm: Wire up sys_futex_waitv()
linux commit b3ff2881ba18b852f79f5476d7631940071f1adb
MIPS: syscalls: Wire up futex_waitv syscall
linux commit 6c122360cf2f4c5a856fcbd79b4485b7baec942a
s390: wire up sys_futex_waitv system call
linux commit a0eb2da92b715d0c97b96b09979689ea09faefe6
futex: Wireup futex_waitv syscall
see
linux commit 884a7e5964e06ed93c7771c0d7cf19c09a8946f1
mm: introduce process_mrelease system call
linux commit dce49103962840dd61423d7627748d6c558d58c5
mm: wire up syscall process_mrelease
see
linux commit 7bb7f2ac24a028b20fca466b9633847b289b156a
arch, mm: wire up memfd_secret system call where relevant
linux commit 1507f51255c9ff07d75909a84e7c0d7f3c4b2f49
mm: introduce memfd_secret system call to create "secret" memory areas
linux commit b633896314c0f78f2b4eb7b19a530d68f2a35445
tools headers UAPI: Sync s390 syscall table file that wires up the
memfd_secret syscall
this commit should make no codegen change for existing archs, but is a
prerequisite for new archs including riscv32. the wait4 emulation
backend provides both cancellable and non-cancellable variants because
waitpid is required to be a cancellation point, but all of our other
uses are not, and most of them cannot be.
based on patch by Stefan O'Rear.
commit f47a5d400b overlooked that
strtoul was responsible for setting p to a const-laundered copy of the
format string pointer f, even in the case where there was no number to
parse. by making the call conditional on isdigit, that copy was lost.
the logic here is a mess and should be cleaned up, but for now, this
seems to be the least invasive change that undoes the breakage.
commit f247462b08 incorrectly hid ppoll
in the presence of _GNU_SOURCE due to an oversight that defining
_BSD_SOURCE does not implicitly define _GNU_SOURCE. at present,
headers still have to explicitly check for each feature profile level;
this may be changed at some point in the future via features.h, but
has not been changed yet.
depending on contents of the LC_TIME locale, log messages could be
malformatted (especially if the ABMON strings contain non-alphabetic
characters) or the subsequent code could invoke undefined behavior,
via passing a timebuf[] with unspecified contents to snprintf, if
the translated ABMON string did not fit in the 16-byte timebuf.
this does not appear to be a security-relevant bug, as locale loading
functionality is intentionally not available to set*id programs -- the
MUSL_LOCPATH environment variable is ignored when libc.secure is true,
and custom locales are not loadable without it.
Undefine any previous __STDC_UTF_{16,32}__ macros before defining
them to prenvent any warnings of redefining macros.
This happens as a result of some compiler versions defining the macros
themselves.
Linux and most systems do not have symlink permissions, but some
systems, including MacOS, do, and creation of the symlink with umask
set to 0777 makes the symlink inaccessible on such systems.
clear umask when making a symlink so that the behavior is uniform.
having these constants be static was unnecessary, so just remove the
static.
this error should have been caught by compilers, but recent versions
of both gcc and clang accept these as "other forms of constant
expressions" which the C standard allows.
Previously, __riscv_flush_icache would not work correctly as
__vdso_flush_icache had a wrong symbol version. Fix this by correcting
symbol version.
Fixes: 0a48860c27 ("add riscv64 architecture support")
Note: Some relocation types were only used by binutils and
accidentally exposed to previous versions of psABI. One of the values
has been reused by GOT32_PCREL.
the ppoll function has been accepted as a future part of the standard
as the outcome of Austin Group tracker issue 1263. at some point it
should be exposed unconditionally, but for now, expose it in the
default feature profile.
the ppoll function has been accepted as a future part of the standard
as the outcome of Austin Group tracker issue 1263. move the source
file to reflect this.
this was a POSIX requirement that was always in conflict with ISO C,
which specified a well-defined behavior for snprintf and swprintf so
long as the actual number of bytes/characters produced did not exceed
INT_MAX.
I originally raised this conflict for snprintf with the Austin Group
as tracker issue 761, which was never resolved. it was later reported
again as issue 1219, and as a result the conflicting requirement has
been removed.
the corresponding issue with swprintf does not seem to have been
addressed, but as the same reasoning applies to it, I am removing the
limitation on n for swprintf as well.
strtoul will consume leading whitespace or sign characters, which are
not valid in this context, thereby accepting invalid field specifiers.
so, avoid calling it unless there is a number to parse as the width.
this matters because the kernel-provided mtab only escapes tabs,
spaces, newlines, and backslashes. it leaves carriage returns, form
feeds, and vertical tabs literal.
As entries in mtab are delimited by spaces, whitespace characters
are escaped as octal sequences. When reading them out, we have to
unescape these sequences to get the proper string.
presently this only affects 32-bit arm. despite correctly reversing
the function pointer and argument fields based on the
TLSDESC_BACKWARDS macro, we did not read the addend from the
swapped-order argument field, so nonzero addends were lost, producing
wrong runtime addresses for TLS objects needing an addend.
based on report and patch by Rui Ueyama.
this is contrary to the spec as written, which requires %lc to behave
as if it were %ls on a 2-wchar_t buffer containing the argument and
zero. however, apparently no other implementations conform to the spec
as written, and in response to Austin Group issue #1647, WG14 chose to
align with existing practice and have %lc produce output for this case.
The name resolution would abort when getting more than 63 records per
request, due to what seems to be a left-over from the original code.
This check was non-breaking but spurious prior to TCP fallback
support, since any 512-byte packet with more than 63 records was
necessarily malformed. But now, it wrongly rejects valid results.
Reported by Daniel Stefanik in Alpine Linux aports issue 15320.
AT_NO_AUTOMOUNT is implied for stat/lstat/fstatat syscalls since Linux
3.1 (commit b6c8069d3577481390b3f24a8434ad72a3235594). However, this
is not the case for statx syscall, which defaults to automounting, so
this flag must be passed explicitly when statx is used to implement
stat-like functions.
This change affects only arches which use 32-bit seconds in struct kstat,
as well as out-of-tree/future ports to arches which lack SYS_fstatat.
C11 6.11.5p1:
> The placement of a storage-class specifier other than at the
> beginning of the declaration specifiers in a declaration is an
> obsolescent feature.
gcc also warns about this.
If __synccall() fails to capture all threads because tkill fails for
some reason other than EAGAIN, then the callback given will never be
executed, so nothing will ever overwrite the initial value. So that is
the value that will be returned from the function. The previous setting
of 1 is not a valid value for setuid() et al. to return.
I chose -EAGAIN since I don't know the reason the synccall failed ahead
of time, but EAGAIN is a specified error code for a possibly temporary
failure in setuid().
The code intends for the sem_post() in line 97 (now 98) to only unblock
target threads waiting on line 29. But after the first thread is
released, the next sem_post() might also unblock a thread waiting on
line 36. That would cause the thread to return to the execution of user
code before all threads are done, leading to user code being executed in
a mixed-credentials environment.
What's more, if this happens more than once, then the mass release on
line 110 (now line 111) will cause multiple threads to execute the
callback at the same time, and the callbacks are currently not written
to cope with that situation.
Adding another semaphore allows the caller to say explicitly which
threads it wants to release.
previously, the relative load address was used as the address at which
to find the ELF headers. this only works if two conditions are met:
ldso is linked to start at a virtual address of 0, and the linker is
cooperative and includes the main ELF headers in a loadable segment.
while in practice these are always met, modern linkers provide a
__ehdr_start symbol pointing to the ELF headers, and can in principle
use the reference to this symbol as an indication that they need to be
mapped in a segment. this also should make it possible to link for a
different starting virtual address, if that's ever desirable.
commit 37bb3cce45 suppressed the
declaration for C++, where it is wrongly interpreted as declaring the
function as taking no arguments. with C23 removing non-prototype
declarations, that problem is now also relevant to C.
the non-prototype declaration for basename originates with commit
06aec8d715, where it was designed to
avoid conflicts with programs which declare basename with the GNU
signature taking const char *. that change was probably misguided, as
it represents not only misaligned expectations with the caller, but
also undefined behavior (calling a function that's been declared with
the wrong type).
we could opt to fix the declaration, but since glibc, with the
gratuitously incompatible GNU-basename function, seems to be the only
implementation that declares it in string.h, it seems better to just
remove the declaration. this provides some warning if applications are
being built expecting the GNU behavior but not getting it. if we
declared it here, it would only produce a warning if the caller also
declares it themselves (rare) or if the caller attempts to pass a
const-qualified pointer.
These were overlooked when DT_RELR was added in commit
d32dadd60e, potentially breaking
software that treats presence of the DT_RELR macro as implying they
exist.
when the result count was zero, glob was ignoring a possible
GLOB_ABORTED error code and returning GLOB_NOMATCH. whether this
happened could be nondeterministic and dependent on the order of
dirent enumeration, in cases where multiple matches were present and
only some produced errors.
caught by Tor's test_util_glob.
This is the only missing part in struct statvfs. The LSB calls
[f]statfs() deprecated, and its weird types are definitely
off-putting. However, its use is required to get f_type.
Instead, allocate one of the six spares to f_type, copied directly
from struct statfs. This then becomes a small extension to the
standard interface on Linux, instead of two different interfaces, one
of which is quite odd due to being an ABI type, and there no longer is
any reason to use statfs().
The underlying kernel type is a mess, but all architectures agree on u32
(or more) for the ABI, and all filesystem magicks are 32-bit integers.
Since commit 6567db65f4 (prior to
1.0.0), the spare slots have been zero-filled, so on all versions that
may be reasonably be encountered in the wild, applications can rely on
a nonzero f_type as indication that the new field has been filled in.
powl used >= LDBL_MAX as infinity check, but LDBL_MAX is finite, so
this can cause wrong results e.g. powl(LDBL_MAX, 0.5) returned inf
or powl(2, LDBL_MAX) returned inf without raising overflow.
huge y values (close to LDBL_MAX) could cause intermediate results to
overflow (computing y * log2(x) with more than long double precision)
and e.g. powl(0.5, 0x1p16380L) or powl(10, 0x1p16380L) returned nan.
this is fixed by handling huge y early since that always overflows or
underflows.
reported by Paul Zimmermann against expl10 (which uses powl).
acosh(x) is nan for x < 1, but x < 0 cases were not handled specially
and acoshl gave wrong result for some -0x1p32 < x < -2 values, e.g.:
acoshl(-0x1p20) returned -inf,
acoshl(-0x1.4p20) returned -0x1.db365758403aa9acp+0L,
fixed by checking the sign bit and handling it specially.
reported by Paul Zimmermann.
the __dns_parse code used by the stub resolver traditionally included
code to reject label pointers to offsets past a 512 byte limit,
despite never processing the label contents, only stepping over them.
when commit 51d4669fb9 added support for
tcp fallback, this limit was overlooked, and as a result, it was at
least theoretically possible for some valid large answers to be
rejected on account of these offsets.
since the limit was never serving any useful purpose, just remove it.
in the event of chained CNAMEs, the answer to a query will contain the
entire CNAME chain, not just one CNAME record. previously, the answer
buffer size had been chosen to admit a maximal-length CNAME, but only
one. a moderate-length chain could fill the available 768 bytes
leaving no room for an actual address answering the query.
while the DNS RFCs do not specify any limit on the length of a CNAME
chain, or any reasonable behavior is the chain exceeds the entire 64k
possible message size, actual recursive servers have to impose a
limit, and a such, for all practical purposes, chains longer than this
limit are not usable. it turns out BIND has a hard-coded limit of 16,
and Unbound has a default limit of 11.
assuming the recursive server makes use of "compression" (pointers),
each maximal-length CNAME record takes at most 268 bytes, and thus any
chain up to length 16 fits in at most 4288 bytes.
this patch increases the answer buffer size to preserve the original
intent of having 512 bytes available for address answers, plus space
needed for a maximal CNAME chain, for a total of 4800 bytes. the
resulting size of 9600 bytes for two queries (A+AAAA) is still well
within what is reasonable to place in automatic storage.
the extra terms 3 and LDBL_MANT_DIG/4 are remnants of a proto-musl
implementation of printf where the sign/prefix and floating point
conversions were performed naively into this buffer. having them there
obscures the actual intended buffer size (sufficient to hold between 2
and 3 octal digits per byte, rounded up to 3 for simplicity) and
interferes with upcoming work to add C2x binary formats which would
otherwise be stuck having to explain a similar fix to buffer size as
part of an unrelated change.
%c takes an argument of type int, not char, and %lc/%C takes an
argument of type wint_t (unsigned), not int.
for most cases, this makes no practical difference, but since wide
printf variants convert narrow %c format specifiers via btowc,
interpreting the promoted-to-int unsigned char value passed in as a
(signed, on most archs) char causes 255 to get collapsed to EOF and
interpreted as such by btowc.
this is only relevant in the byte-based C locale, so prior to commit
f22a9edaf8, there was no observable
distinction in behavior. for UTF-8, all bytes which might be negative
when interpreted as char are encoding errors when used with %c/btowc.
the clone() function has been effectively unusable since it was added,
due to producing a child process with inconsistent state. in
particular, the child process's thread structure still contains the
tid, thread list pointers, thread count, and robust list for the
parent. this will cause malfunction in interfaces that attempt to use
the tid or thread list, some of which are specified to be
async-signal-safe.
this patch attempts to make clone() consistent in a _Fork-like sense.
as in _Fork, when the parent process is multi-threaded, the child
process inherits an async-signal context where it cannot call
AS-unsafe functions, but its context is now intended to be safe for
calling AS-safe functions. making clone fork-like would also be a
future option, if it turns out that this is what makes sense to
applications, but it's not done at this time because the changes would
be more invasive.
in the case where the CLONE_VM flag is used, clone is only vfork-like,
not _Fork-like. in particular, the child will see itself as having the
parent's tid, and cannot safely call any libc functions but one of the
exec family or _exit.
handling of flags and variadic arguments is also changed so that
arguments are only consumed with flags that indicate their presence,
and so that flags which produce an inconsistent state are disallowed
(reported as EINVAL). in particular, all libc functions carry a
contract that they are only callable with ABI requirements met, which
includes having a valid thread pointer to a thread structure that's
unique within the process, and whose contents are opaque and only able
to be setup internally by the implementation. the only way for an
application to use flags that violate these requirements without
executing any libc code is to perform the syscall from
application-provided asm.
apparently Linux clears the registered exit futex address on fork.
this means that, if after forking the child process becomes
multithreaded and the original thread exits, the thread list will
never be unlocked, and future attempts to use the thread list will
deadlock.
re-register the exit futex address after _Fork in the child to ensure
that it's preserved.
mbrtowc truncates n to unsigned int when storing its copy.
If n > UINT_MAX and the locale is not POSIX, the function will
return a wrong value greater than UINT_MAX on the success path.
aside from the documented differences, which are the contents of this
patch, GCC's -Os also has hard-coded unwanted behaviors which are
impossible to override, like refusing to strength-reduce division by a
constant to multiplication, presumably because the div saves a couple
bytes of code. for this reason, getting rid of -Os and switching to an
equivalent default optimization profile based on -O2 has been a
long-term goal.
as follow-ups, it may make sense to evaluate which of these variations
from -O2 actually do anything useful, and eliminate the ones which are
not helpful or which throw away performance for insignificant size
savings. but for now, I've replicated -Os as closely as possible to
provide a baseline for such evaluation.
The nl_type and nl_arg arrays defined in vfwprintf may be accessed
with an index up to and including NL_ARGMAX, but they are only of size
NL_ARGMAX, meaning they may be written to or read from 1 element too
far.
Resource usage data is filled by the kernel only when wait4 returns
a pid, i.e. a positive value.
Commit 5850546e96 introduced this bug,
possibly because of copy-pasting from getrusage.
For time64 support, musl normally defines SYS_foo to the time32 variant
of that syscall on arches that have it, and to the time64 variant
otherwise, so that "SYS_foo == SYS_foo_time64" implies that the arch is
time64-only. However, SYS_semtimedop is an odd case: some arches define
only SYS_semtimedop_time64, yet they are not time64-only, because the
time32 variant is provided via SYS_ipc instead. For such arches,
defining SYS_semtimedop to SYS_semtimedop_time64 would break the
implication above, so commit 4bbd7baea7
doesn't do this. Commit eb2e298cdc
attempts to detect time64-only arches by checking that both
SYS_semtimedop and SYS_ipc are undefined, but this doesn't work for
x32, because it's a time64-only arch that does define SYS_semtimedop.
As a result, 32-bit timeouts trigger the fallback path that passes
a 32-bit timespec to the kernel while it expects a 64-bit one, so
the effective tv_sec is formed by interpreting 32-bit tv_sec and
tv_nsec as a single long long, and the effective tv_nsec is whatever
is located in the next 64 bits of the stack.
Fix this by expanding the time64-only check to include arches where
SYS_semtimedop is the time64 variant of the syscall.
When an option that requires an argument is the last character of
argv[argc-1], getopt computes argv[argc] + optpos. While optpos
is always zero in this case, adding it to null pointer is still
undefined.
If lstat/stat fails with EACCES, st is left uninitialized, but its
st_dev/st_ino fields are then used in several places:
* for FTW_MOUNT check (in practice typically results in a false
positive and an early return)
* for copying to the new struct history (though the struct is not used
afterwards since we don't recurse in this case)
* for cycle detection check (could theoretically result in a false
positive and an early return)
To avoid adding FTW_NS checks to all these places, fix this by
zero-initializing st_dev/st_ino (which can never match an existing
dentry due to zero inode being reserved in Linux), and check for FTW_NS
only when handling FTW_MOUNT since we need two valid dentries there.
commit 246f1c8114 inadvertently
introduced the local variable p as static by declaring it together
with lfs64_list. the function is only reachable under lock, and is not
called reentrantly, so this is not a functional bug, but it is
confusing and inefficient. fix by separating the declarations.
The received length field in the message may be greater than the
size of the 'answer' buffer in which the message resides. Currently,
ABUF_SIZE is 768. And if we get a larger 'alens[i]', it will result
in an out-of-bounds reading in __dns_parse().
To fix this, limit the length to the size of the received buffer.
the buffer-flush function did not account for mbtowc returning 0
rather than 1 when converting the nul character. this prevented
advancing past it, instead repeatedly converting it into the output
wide character string until the max output length was exhausted.
commit d42269d7c8 appropriated the
stream error flag temporarily to let the printf family of functions
suppress further output attempts after encountering a write error.
since the wide printf code relies on (narrow) vfprintf to print
padding and numeric conversions, a hack was put in vfprintf not to
clear the initial error status unless the stream is narrow oriented.
this was okay, because calling vfprintf on a wide-oriented stream
(outside of internal use by the implementation) produces undefined
behavior. however, it was highly non-obvious to anyone reading the
wide printf code, where the calls to fprintf without first checking
for error status appeared erroneous.
this patch removes all direct use of fprintf from the wide printf
core, except in the numeric conversions case where it was already
checked before starting processing of the directive that the error
status is not set. the other calls, which were performing padding, are
replaced by a new pad() helper function, which performs the check and
abstracts out the mechanism of writing the padding.
direct use of the error flag is also replaced by ferror, which is
defined as a macro in stdio_impl.h, expanding directly to the flag
check with no call or locking overhead.
unlike with wide printf variants, encoding errors are not a vector by
which this bug is reachable, and the out() helper function already
ensured that no further output could be written after an output error,
transient or otherwise. however, the %n specifier could still be
processed after an error, yielding a side effect that wrongly implied
output had succeeded.
due to buffering effects, it's still possible for %n to show output as
having "succeeded", but for it never to appear on the underlying file
due to an error at flush time. this change, however, ensures that
processing of %n does not conflict with any error which has already
been seen.
this fixes a broader bug for which a special case was reported by
Bruno Haible, in the form of %n getting processed (and reporting the
number of wide characters which would have been written, but weren't)
after an encoding error (EILSEQ). in addition to the %n case, some but
not all of the format specifiers continued to attempt output after an
error. in particular, %c, %lc, and %s all used fputwc directly without
any check for error status.
as long as the error condition was permanent rather than transient,
these write attempts had no visible side effects, but in theory it
could be visible, for example with EAGAIN/EWOULDBLOCK or ENOSPC, if
the condition precluding output came to an end. this could produce
output with missing non-final data, rather than just truncated output,
albeit with the function still returning -1 as expected to report an
error.
to fix this, a check is added to stop processing of any new directive
(including %n) if the stream is already in error state, and direct use
of fputwc is replaced with calls to the out() helper function, which
checks for error status.
note that fprintf is also used directly without checking error status,
but due to how commit d42269d7c8
previously attempted to solve the issue of output after error, the
call to fprintf does not attempt to write anything when the
wide-oriented stream is already in error state. this is non-obvious,
and is quite a hack, so it should be changed, but I've left it alone
for now to make the bug fix commit itself as non-invasive as possible.
this function was overlooked during the time64 transition, probably as
a result of not having any time-related types in its application-side
interface. however, for archs that lack the traditional poll syscall
and have only ppoll, it used timespec as part of its interface with
the kernel: the millisecond timeout was converted to a timespec to
pass to SYS_ppoll. this is a type/ABI mismatch on 32-bit archs with
legacy time32 syscalls.
only one supported arch, or1k, is affected. all of the others either
have SYS_poll, or are 64-bit.
rather than using timespec, define a type locally to match what the
kernel expects. the condition (SYS_ppoll_time64 == SYS_ppoll),
comparable to conditions used elsewhere in timespec-handling code,
evaluates true for "natively time64" 32-bit archs including x32,
future riscv32, and all future 32-bit archs (via definitions in
internal syscall.h). otherwise, the arch is either 64-bit or has
syscalls that take the legacy type, and in either case "long" is
correct.
this fix is based on bug report and proposal by Alexey Izbyshev but
with a different approach to the changes to minimize the contextual
knowledge needed for a reader to understand the source file.
If the (normalized) timeout passed to select exceeds INT_MAX seconds on
an arch with SYS_pselect6_time64 and the kernel is too old to support
time64 syscalls, the timeout is implicitly converted to (32-bit) long on
the fallback path, losing its upper 32 bits and potentially becoming a
small positive value, violating the intended semantics, or even
a negative value, causing the fallback syscall failure. Fix this by
saturating the timeout at INT_MAX as done in other time64 fallback
cases.
this is the best-effort fallback path for kernels that can't actually
support the dup3 functionality. it was setting FD_CLOEXEC flag on the
target fd (new) even if the dup2 operation failed. normally that
shouldn't happen under correct usage, but it's possible if the source
fd is not open or intentionally invalid (e.g. -1).
our dup3 code wrongly skipped directly to making the SYS_dup2 syscall
whenever the O_CLOEXEC bit of flags was not set. this is incorrect if
any new flags are ever added, as it would silently ignore them rather
than failing with an error.
archs which lack SYS_dup2 were unaffected.
adjust the logic so that SYS_dup3 is attempted whenever flags is
nonzero, and explicitly fail with EINVAL if SYS_dup3 is unavailable
and there are any unknown flags.
kernels using the fallback have an inherent close-on-exec race
condition and as such support for them is only best-effort anyway.
however, ignoring potential new flags is still very bad behavior.
instead, fail with EINVAL.
If the buffer passed to getservbyport_r is just enough to store two
pointers after aligning it, getnameinfo is called with buflen == 0
(which means that service name is not needed) and trivially succeeds.
Then, strtol is called on the address just past the buffer end, and
if it doesn't happen to find the port number there, getservbyport_r
spuriously succeeds and returns the same bad address to the caller.
Fix this by ensuring that buflen is at least 1 when passed to
getnameinfo.
getifaddrs computes &ctx->first->ifa even if ctx->first is NULL. While
this shouldn't be possible on the success path because the loopback
interface is hardcoded into the kernel, this is still possible on the
error path (for example, if __rtnetlink_enumerate couldn't create a
socket due to exceeding the fd limit).
accept4 emulation via accept ignores unknown flags, so it can spuriously
succeed instead of failing (or succeed without doing the action implied
by an unknown flag if it's added in a future kernel). Worse, unknown
flags trigger the fallback code even on modern kernels if the real
accept4 syscall returns EINVAL, because this is indistinguishable from
socketcall returning EINVAL due to lack of accept4 support.
Fix this by always failing with EINVAL if unknown flags are present and
the syscall is missing or failed with EINVAL.
This is completely analoguous to commit 633183b5d1.
Similar code called from __lookup_name is not affected because it checks
that the line contains the host name surrounded by blanks.
When IPv6 nameservers are present, __res_msend_rc attempts to disable
IPV6_V6ONLY socket option to ensure that it can communicate with IPv4
nameservers (if they are present too) via IPv4-mapped IPv6 addresses.
However, this option can't be disabled on bound sockets, so setsockopt
always fails.
A zero returned from recvmsg is currently treated as if some data were
received, so if a DNS server closes its TCP socket before sending the
full answer, __res_msend_rc will spin until the timeout elapses because
POLLIN event will be reported on each poll. Fix this by treating an
early EOF as an error.
DNS parsing callbacks pass the response buffer end instead of the actual
response end to dn_expand, so a malformed DNS response can use message
compression to make dn_expand jump past the response end and attempt to
parse uninitialized parts of that buffer, which might succeed and return
garbage.
There are several issues with range checks in this function:
* The question section parsing loop can read up to two out-of-bounds
bytes before doing the range check and bailing out.
* The answer section parsing loop, in addition to the same issue as
above, uses the wrong length in the range check that doesn't prevent
OOB reads when computing len later.
* The len range check before calling the callback is off by 10. Also,
p+len can overflow in a (probably theoretical) case when p is within
2^16 from UINTPTR_MAX.
Because __dns_parse is used only with stack-allocated buffers, such
small overreads can't result in a segfault. The first two also don't
affect the function result, but the last one may result in getaddrinfo
incorrectly succeeding and returning up to 10 bytes past the
response buffer as a part of the IP address, and in (canon) name
returned by getaddrinfo/getnameinfo being affected by memory past the
response buffer (because dn_expand might interpret it as a pointer).
Before this commit, DNS timeouts always used CLOCK_REALTIME, which
could produce spurious timeouts or delays if wall time changed for
whatever reason.
Now we try CLOCK_MONOTONIC and only fall back to CLOCK_REALTIME when
it is unavailable.
As a result of using simple subtraction to implement the return values
for wcscmp and wcsncmp, integer overflow can occur (producing
undefined behavior, and in practice, a wrong comparison result). This
does not occur for meaningful character values (21-bit range) but the
functions are specified to work on arbitrary wchar_t arrays.
This patch replaces the subtraction with a little bit of code that
orders the characters correctly, returning -1 if the character from
the first string is smaller than the one from the second, 0 if they
are equal and 1 if the character from the first string is larger than
the one from the second.
A signed int shift overflowed when computing a constant mask, use hex
literal instead. This is unlikely to cause actual issues unless the
code was compiled with ubsan or similar instrumentation specifically
to catch this. The stripped libc.so is unchanged on x86_64.
Reported by q66 on irc.
When a dot is encountered, the loop counter is incremented before
exiting the loop, but the corresponding ip array element is left
uninitialized, so the subsequent memmove (if "::" was seen) and the
loop copying ip to the output buffer will operate on an uninitialized
uint16_t.
The uninitialized data never directly influences the control flow and
is overwritten on successful return by the second half of the parsed
IPv4 address. But it's better to fix this to avoid unexpected
transformations by a sufficiently smart compiler and reports from
UB-detection tools.
The kernel defines a limit on the number of fds that can be passed
through an SCM_RIGHTS ancillary message as SCM_MAX_FD. The value was
255 before kernel 2.6.38 (after that it is 253), and an SCM_RIGHTS
ancillary message with 255 fds requires 1040 bytes, slightly more than
the current 1024 byte internal buffer in sendmsg. 1024 is an arbitrary
size, so increase it to match the the arbitrary size limit in the
kernel. This fixes tests that are verifying they support up to
SCM_MAX_FD fds.
until the mq notification event arrives, it is mandatory that signals
be blocked. otherwise, a signal can be received, and its handler
executed, in a thread which does not yet exist on the abstract
machine.
after the point of the event arriving, having signals blocked is not a
conformance requirement but a QoI requirement. while the application
can unblock any signals it wants unblocked in the event handler
thread, if they did not start out blocked, it could not block them
without a race window where they are momentarily unblocked, and this
would preclude controlled delivery or other forms of acceptance
(sigwait, etc.) anywhere in the application.
in the error path where the mq_notify syscall fails, the initiating
thread may have closed the socket before the worker thread calls recv
on it. even in the absence of such a race, if the recv call failed,
e.g. due to seccomp policy blocking it, the worker thread could
proceed to close, producing a double-close condition.
this can all be simplified by moving the mq_notify syscall into the
new thread, so that the error case does not require pthread_cancel.
now, the initiating thread only needs to read back the error status
after waiting for the worker thread to consume its arguments.
disabling cancellation around the pthread_join call seems to be the
safest and logically simplest fix. i believe it would also be possible
to just perform the unmap directly here after __tl_sync, removing the
dependency on pthread_join, but such an approach duplicately encodes a
lot more implementation assumptions.
the logic to check hwcap for SPE register file inadvertently clobbered
the val argument before use. switch to a different work register so
this doesn't happen.
we wrongly defined a dummy SA_RESTORER flag on these archs, despite
the kernel interface not actually having such a feature. on archs
which lack SA_RESTORER, the kernel sigaction structure also lacks the
restorer function pointer member, which means the signal mask appears
at a different offset. the kernel was thereby interpreting the bits of
the code address as part of the signal set to be masked while handling
the signal.
this patch removes the erroneous SA_RESTORER definitions from archs
which do not have it, makes access to the member conditional on
whether SA_RESTORER is defined for the arch, and removes the
now-unused asm for the affected archs.
because there are reportedly versions of qemu-user which also use the
wrong ABI here, the old ksigaction struct size is preserved with an
unused member at the end. this is harmless and mitigates the risk of
such a bug turning into a buffer overflow onto the sigaction
function's stack.
the result of the 0xffff mask with the exit status could have bit 15
set, in which case multiplying by 0x10001 overflows 32-bit signed int.
making the multiply unsigned avoids the overflow. it also changes the
sign extension behavior of the subsequent >> operation, but the
affected bits are all unwanted anyway and all discarded by the cast to
short.
mips has its own mechanisms for DT_DEBUG because it makes _DYNAMIC
read-only, and the original mechanism, DT_MIPS_RLD_MAP, was
PIE-incompatible. DT_MIPS_RLD_MAP_REL was added to remedy this, but we
never implemented support for it. add it now using the same idioms for
mips-specific ldso logic.
memmem has been adopted for the next issue of POSIX (outcome of
tracker item 1061). since mem* is in the reserved namespace for
string.h it's already fully conforming to expose it by default, so
just do so.
while no lock is held here making it a lock-order issue, replacement
malloc is likely to want to use pthread_atfork, possibly making the
call to malloc infinitely recursive.
even if not, there is no reason to prefer an application-provided
malloc here.
printf_core() runs twice, and during its first run, nl_arg is
uninitialized and must not be read. It gets initialized at the end of
the first run. Conversely, nl_type does not need to be set during the
second run, as its useful life has ended at that point, since the only
time it is read is during that exact same initialization. Therefore we
can simply alternate the assignments.
p and w do still need to get values assigned to them, since at least one
line in the same if-statement depends on that, but they can be dummy
values. arg does not need to be assigned, since in the first run, we
encounter a continue statement before using the argument.
because the has-waiters state in the semaphore value futex word is
only representable when the value is zero (the special value -1
represents "0 with potential new waiters"), it's lost if intervening
operations make the semaphore value positive again. this creates an
ABA issue in sem_post, whereby the post uses a stale waiters count
rather than re-evaluating it, skipping the futex wake if the stale
count was zero.
the fix here is based on a proposal by Alexey Izbyshev, with minor
changes to eliminate costly new spurious wake syscalls.
the basic idea is to replace the special value -1 with a sticky
waiters bit (repurposing the sign bit) preserved under both wait and
post. any post that takes place with the waiters bit set will perform
a futex wake.
to be useful, the waiters bit needs to be removable, and to remove it
safely, we perform a broadcast wake instead of a normal single-task
wake whenever removing the bit. this lets any un-accounted-for waiters
wake and re-add the waiters bit if they still need it.
there are multiple possible choices for when to perform this
broadcast, but the optimal choice seems to be doing it whenever the
observed waiters count is less than two (semantically, this means
exactly one, but we might see a stale count of zero). in this case,
the expected number of threads to be woken is one, with exactly the
same cost as a non-broadcast wake.
when PAGE_SIZE is not constant, internal/libc.h defines it to expand
to libc.page_size. however, kernel_mapped_dso, reachable from stage 2
of the dynamic linker bootstrap (__dls2), needs PAGE_SIZE to interpret
the relro range. at this point the libc object is both uninitialized
and invalid to access according to our model for bootstrapping, which
does not assume any external-linkage objects are accessible until
stages 2b/3. in practice it likely worked because hidden visibility
tends to behave like internal linkage, but this is not a property that
the dynamic linker was designed to rely upon.
this bug likely manifested as relro malfunction on archs with variable
page size, due to incorrect mask when aligning the relro bounds to
page boundaries.
while there are certainly more direct ways to fix the known problem
point here, a maximally future-proof way is to just bypass the libc.h
PAGE_SIZE definition in the dynamic linker and instead have dynlink.c
define its own internal-linkage object for variable page size. then,
if anything else in stage 2 ever ends up referencing PAGE_SIZE, it
will just automatically work right.
this is analogous to skip_relative logic in do_relocs -- because
relative relocations for the dynamic linker itself were already
performed at entry (stage 1), they must not be applied again.
the rule that longest digit sequence not beginning with a zero is
greater only applies when both sequences being compared are
non-degenerate. this is spelled out explicitly in the man page, which
may be deemed authoritative for this nonstandard function: "If one or
both of these is empty, then return what strcmp(3) would have
returned..."
we were wrongly treating any sequence of digits not beginning with a
zero as greater than a non-digit in the other string.
if async cancellation is enabled and acted upon, the stack pointer is
not necessarily pointing to a __syscall_cp_asm stack frame. the
contents of the stack being wrong don't really matter, but if the
stack pointer is not suitably aligned, the procedure call ABI is
violated when calling back into C code via __cancel, and pthread_exit,
cancellation cleanup handlers, TSD destructors, etc. may malfunction
or crash.
for the async cancel case, just call __cancel directly like we did
prior to commit 102f6a01e2. restore the
signal mask prior to doing this since the cancellation handler runs
with all signals blocked.
commit f081d5336a fixed
gethostbyname[2]_r to treat negative results as a non-error, leaving
gethostbyname[2] wrongly returning a pointer to the unfilled result
buffer rather than a null pointer. since, as documented with commit
fe82bb9b92, the caller of
gethostby{name[2],addr}_r can always rely on the result pointer being
set, use that consistently rather than trying to duplicate logic about
whether we have a result or not in gethostby{name[2],addr}.
the only functional change here should be that MAXADDRS is only
checked for RRs that provide address results, so that a CNAME which
appears after an excessive number of address RRs does not get ignored.
I'm not aware of any servers that order the RRs this way, and it may
even be forbidden to do so, but I prefer having the callback logic not
be order dependent.
other than that, the motivation for this change is that the A and AAAA
cases were mostly duplicate code that could be combined as a single
code path.
returning -1 rather than 0 from the parse function causes __dns_parse
to bail out and return an error. presently, name_from_dns does not
check the return value anyway, so this does not matter, but if it ever
started treating this as an error, lookups with large numbers of
addresses would break. this is a consequence of adding TCP support and
extending the buffer size used in name_from_dns.
reportedly there is nameserver software with question-rewriting
"functionality" which gives A answers when AAAA is queried. since we
made no effort to validate that the answer RR type actually
corresponds to the question asked, it was possible (depending on
flags, etc.) for these answers to leak through, which the caller might
not be prepared for. indeed, our implementation of gethostbyname2_r
makes an assumption that the resulting addresses are in the family
requested, and will misinterpret the results if they don't.
commit 45ca5d3fcb already noted in
fixing CVE-2017-15650 that this could happen, but did nothing to
validate that the RR type of the answer matches the question; it just
enforced the limit on number of results to preclude overflow.
presently, name_from_dns ignores the return value of __dns_parse, so
it doesn't really matter whether we return 0 (ignoring the RR) or -1
(parse-ending error) upon encountering the mismatched RR. if that ever
changes, though, ignoring irrelevant answer RRs sounds like the
semantically correct thing to do, so for now let's return 0 from the
callback when this happens.
commit 167390f055 seems to have
overlooked the presence of a lock here, probably because it was one of
the exceptions not using LOCK() but a rwlock.
as such, it can't be added to the generic table of locks to take, so
add an explicit atfork function for the pthread keys table. the order
it is called does not particularly matter since nothing else in libc
but pthread_exit interacts with keys.
performing n-- is not a safe operation for arbitrary signed input n.
only perform the decrement in the code path where the initial n is
greater than 1, and adjust the condition in the n<=1 code path to
compensate for it not having been decremented.
the aio operations that lead to calling __aio_get_queue with the
possibility to expand the fd map are not AS-safe, but if they are
interrupted by a signal handler, the signal handler may call close,
which is required to be AS-safe. due to __aio_get_queue taking the
write lock without blocking signals, such a call to close from a
signal handler could deadlock.
change __aio_get_queue to block signals if it needs to obtain a write
lock, and restore when finished.
aio_suspend waits on a dummy futex in the corner case when the array of
requests contains NULL pointers only. But the value of this futex was
left uninitialized, so if it happens to be non-zero, aio_suspend
degrades to spinning instead of blocking.
as reported by Alexey Izbyshev, there is a lock order inversion
deadlock between the malloc lock and aio maplock at MT-fork time:
_Fork attempts to take the aio maplock while fork already has the
malloc lock, but a concurrent aio operation holding the maplock may
attempt to allocate memory.
move the __aio_atfork calls in the parent from _Fork to fork, and
reorder the lock before most other locks, since nothing else depends
on aio(*). this leaves us with the possibility that the child will not
be able to obtain the read lock, if _Fork is used directly and happens
concurrent with an aio operation. however, in that case, the child
context is an async signal context that cannot call any further aio
functions, so all we need is to ensure that close does not attempt to
perform any aio cancellation. this can be achieved just by nulling out
the map pointer.
(*) even if other functions call close, they will only need a read
lock, not a write lock, and read locks being recursive ensures they
can obtain it. moreover, the number of read references held is bounded
by something like twice the number of live threads, meaning that the
read lock count cannot saturate.
as reported by Alexey Izbyshev, when the second-to-last thread exits
causing a return to single-threaded (no locks needed) state, it
creates a situation where the last remaining thread may obtain the
killlock that's already held by the exiting thread. this means it may
erroneously use the tid of the exiting thread, and may corrupt the
lock state due to double-unlock.
commit 8d81ba8c0b, which (re)introduced
the switch back to single-threaded state, documents the intent that
the first lock after switching back should provide the necessary
synchronization. this is correct, but only works if the switch back is
made after there is no further need for synchronization with locks
(other than the thread list lock, which can't be bypassed) held by the
exiting thread.
in order to hit the bug, the remaining thread must first take a
different lock, causing it to perform an actual lock one last time,
consume the need_locks==-1 state, and transition to need_locks==0.
after that, the next attempt to lock the exiting thread's killlock
will bypass locking.
fix this by reordering the unlocking of killlock at thread exit time,
along with changes to the state protected by it, to occur earlier,
before the switch to single-threaded state. there are really no
constraints on where it's done, except that it occur after there is no
longer any possibility of application code executing in the exiting
thread, so do it as early as possible.
ever since commit 8f11e6127f introduced
the thread list lock, this has been wrong. initially, it was wrong via
calling free from the context with the thread list lock held. commit
aa5a9d15e0 deferred the unsafe free but
added a lock, which was also unsafe. in particular, it could deadlock
if code holding freebuf_queue_lock was interrupted by a signal handler
that takes the thread list lock.
commit 4d5aa20a94 observed that there
was a lock here but failed to notice that it's invalid.
there is no easy solution to this problem with locks; any attempt at
solving it while still using locks would require the lock to be an
AS-safe one (blocking signals on each access to the dlerror buffer
list to check if there's deferred free work to be done) which would be
excessively costly, and there are also lock order considerations with
respect to how the lock would be handled at fork.
instead, just use an atomic list.
unlike most projects that use -fno-strict-aliasing, we aim to have all
sources respect the C language rules for effective type that make
type-based alias analysis optimizations possible. unfortunately, it
turns out that there are deep, and likely very difficult to fix, flaws
in the TBAA performed by GCC and likely other compilers, whereby this
kind of optimization can transform code that follows the rules
strictly in ways that will make it malfunction. see for example GCC
bugs 107107 and 107115, the latter of which also affects clang.
there are not presently any known instances of breakage due to wrong
type-based aliasing optimizations in our codebase. nonetheless, since
the transformations are unsound and could introduce breakage,
configure CFLAGS to build with -fno-strict-aliasing.
some casual analysis of the effects on codegen suggest that this is
unlikely to affect performance except possibly in the regex engine. in
general, we should probably prefer making better use of the restrict
keyword over relying on types to imply non-aliasing for optimization
purposes; doing so should be able to get back any performance that was
lost and more, should it turn out to matter (unlikely).
the entire intent of using madvise/MADV_FREE on freed slots is to
improve system performance by avoiding evicting cache of useful data,
or swapping useless data to disk, by marking any whole pages in the
freed slot as discardable by the kernel. in particular, unlike
unmapping the memory or replacing it with a PROT_NONE region, use of
MADV_FREE does not make any difference to memory accounting for commit
charge purposes, and so does not increase the memory available to
other processes in a non-overcommitted environment.
however, various measurements have shown that inordinate amounts of
time are spent performing madvise syscalls in processes which
frequently allocate and free medium sized objects in the size range
roughly between PAGESIZE and MMAP_THRESHOLD, to the point that the net
effect is almost surely significant performance degredation. so, turn
it off.
the code, which has some nontrivial logic for efficiently determining
whether there is a whole-page range to apply madvise to, is left in
place so that it can easily be re-enabled if desired, or later tuned
to only apply to certain sizes or to use additional heuristics.
these badly pollute the namespace with macros whenever _GNU_SOURCE is
defined, which is always the case with g++, and especially tends to
interfere with C++ constructs.
as our implementation of these was macro-only, their removal cannot
affect any existing binaries. at the source level, portable software
should be prepared for them not to exist.
for now, they are left in place with explicit _LARGEFILE64_SOURCE.
this provides an easy temporary path for integrators/distributions to
get packages building again right away if they break while working on
a proper, upstreamable fix. the intent is that this be a very
short-term measure and that the macros be removed entirely in the next
release cycle.
originally the namespace-infringing "large file support" interfaces
were included as part of glibc-ABI-compat, with the intent that they
not be used for linking, since our off_t is and always has been
unconditionally 64-bit and since we usually do not aim to support
nonstandard interfaces when there is an equivalent standard interface.
unfortunately, having the symbols present and available for linking
caused configure scripts to detect them and attempt to use them
without declarations, producing all the expected ill effects that
entails.
as a result, commit 2dd8d5e1b8 was made
to prevent this, using macros to redirect the LFS64 names to the
standard names, conditional on _GNU_SOURCE or _LARGEFILE64_SOURCE.
however, this has turned out to be a source of further problems,
especially since g++ defines _GNU_SOURCE by default. in particular,
the presence of these names as macros breaks a lot of valid code.
this commit removes all the LFS64 symbols and replaces them with a
mechanism in the dynamic linker symbol lookup failure path to retry
with the spurious "64" removed from the symbol name. in the future,
if/when the rest of glibc-ABI-compat is moved out of libc, this can be
removed.
we already attempt to preclude this case by having res_send use a
sufficiently large temporary buffer even if the caller did not provide
one as large as or larger than the udp dns max of 512 bytes. however,
it's possible that the caller passed a custom-crafted query packet
using EDNS0, e.g. to get detailed DNSSEC results, with a larger udp
size allowance.
I have also seen claims that there are some broken nameservers in the
wild that do not honor the dns udp limit of 512 and send large answers
without the TC bit set, when the query was not using EDNS.
we generally don't aim to support broken nameservers, but in this case
both problems, if the latter is even real, have a common solution:
using recvmsg instead of recvfrom so we can examine the MSG_TRUNC
flag.
the size of 512 is not sufficient to get at least one address in the
worst case where the name is at or near max length and resolves to a
CNAME at or near max length. prior to tcp fallback, there was nothing
we could do about this case anyway, but now it's fixable.
the new limit 768 is chosen so as to admit roughly the number of
addresses with a worst-case CNAME as could fit for a worst-case name
that's not a CNAME in the old 512-byte limit. outside of this
worst-case, the number of addresses that might be obtained is
increased.
MAXADDRS (48) was originally chosen as an upper bound on the combined
number of A and AAAA records that could fit in 512-byte packets (31
and 17, respectively). it is not increased at this time.
so as to prevent a situation where the A records consume almost all of
these slots (at 768 bytes, a "best-case" name can fit almost 47 A
records), the order of parsing is swapped to process AAAA first. this
ensures roughly half of the slots are available to each address
family.
tcp fallback was originally deemed unwanted and unnecessary, since we
aim to return a bounded-size result from getaddrinfo anyway and
normally plenty of address records fit in the 512-byte udp dns limit.
however, this turned out to have several problems:
- some recursive nameservers truncate by omitting all the answers,
rather than sending as many as can fit.
- a pathological worst-case CNAME for a worst-case name can fill the
entire 512-byte space with just the two names, leaving no room for
any addresses.
- the res_* family of interfaces allow querying of non-address records
such as TLSA (DANE), TXT, etc. which can be very large. for many of
these, it's critical that the caller see the whole RRset. also,
res_send/res_query are specified to return the complete, untruncated
length so that the caller can retry with an appropriately-sized
buffer. determining this is not possible without tcp.
so, it's time to add tcp fallback.
the fallback strategy implemented here uses one tcp socket per
question (1 or 2 questions), initiated via tcp fastopen when possible.
the connection is made to the nameserver that issued the truncated
answer. right now, fallback happens unconditionally when truncation is
seen. this can, and may later be, relaxed for queries made by the
getaddrinfo system, since it will only use a bounded number of results
anyway.
retry is not attempted again after failure over tcp. the logic could
easily be adapted to do that, but it's of questionable value, since
the tcp stack automatically handles retransmission and the successs
answer with TC=1 over udp strongly suggests that the nameserver has
the full answer ready to give. further retry is likely just "take
longer to fail".
for extremely small buffer sizes, the DNS query core in __res_msend
may malfunction completely, being unable to get even the headers to
determine the response code. but there is also a problem for
reasonable sizes under 512 bytes: __res_msend is unable to determine
if the udp answer was truncated at the recv layer, in which case it
may be incomplete, and res_send is then unable to honor its contract
to return the length of the full, non-truncated answer.
at present, res_send does not honor that contract anyway when the full
answer would exceed 512 bytes, since there is no tcp fallback, but
this change at least makes it consistent in a context where this is
the only "full answer" to be had.
this was apparently omitted long ago out of a lack of understanding of
its importance and the fact that POSIX doesn't specify it. despite not
being officially standardized, however, it turns out that at least
AIX, glibc, NetBSD, OpenBSD, QNX, and Solaris document and support it.
in certain usage cases, such as implementing a DNS gateway on top of
the stub resolver interfaces, it's necessary to distinguish the case
where a name does not exit (NxDomain) from one where it exists but has
no addresses (or other records) of the requested type (NODATA). in
fact, even the legacy gethostbyname API had this distinction, which we
were previously unable to support correctly because the backend lacked
it.
apart from fixing an important functionality gap, adding this
distinction helps clarify to users how search domain fallback works
(falling back in cases corresponding to EAI_NONAME, not in ones
corresponding to EAI_NODATA), a topic that has been a source of
ongoing confusion and frustration.
as a result of this change, EAI_NONAME is no longer a valid universal
error code for getaddrinfo in the case where AI_ADDRCONFIG has
suppressed use of all address families. in order to return an accurate
result in this case, getaddrinfo is modified to still perform at least
one lookup. this will almost surely fail (with a network error, since
there is no v4 or v6 network to query DNS over) unless a result comes
from the hosts file or from ip literal parsing, but in case it does
succeed, the result is replaced by EAI_NODATA.
glibc has a related error code, EAI_ADDRFAMILY, that could be used for
the AI_ADDRCONFIG case and certain NODATA cases, but distinguishing
them properly in full generality seems to require additional DNS
queries that are otherwise not useful. on glibc, it is only used for
ip literals with mismatching family, not for DNS or hosts file results
where the name has addresses only in the opposite family. since this
seems misleading and inconsistent, and since EAI_NODATA already covers
the semantic case where the "name" exists but doesn't have any
addresses in the requested family, we do not adopt EAI_ADDRFAMILY at
this time. this could be changed at some point if desired, but the
logic for getting all the corner cases with AI_ADDRCONFIG right is
slightly nontrivial.
EAI_MEMORY is not possible (but would not provide errno if it were)
and EAI_FAIL does not provide errno. treat the latter as EBADMSG to
match how it's handled in gethostbyname2_r (it indicates erroneous or
failure response from the nameserver).
EAI_MEMORY is not possible because the resolver backend does not
allocate. if it did, it would be necessary for us to explicitly return
ENOMEM as the error, since errno is not guaranteed to reflect the
error cause except in the case of EAI_SYSTEM, so the existing code was
not correct anyway.
these functions are horribly underspecified, inconsistent between
historical systems, and should never have been included. however, the
signatures we have match the glibc ones, and the glibc behavior is to
treat NxDomain and NODATA results as a success condition, not an
ENOENT error.
this distinction only affects search, but allows search to continue
when concatenating one of the search domains onto the requested name
produces a result that's not valid. this can happen when the
concatenation is too long, or one of the search list entries is
itself not valid.
as a consequence of this change, having "." in the search domains list
will now be ignored/skipped rather than making the lookup abort with
no results (due to producing a concatenation ending in ".."). this
behavior could be changed later if needed.
the main loop already errors out on zero-length labels within the
name, but terminates before having a chance to check for an erroneous
final zero-length label, instead producing a malformed query packet
with a '.' byte instead of the terminating zero.
rather than poke at the look logic, simply detect this condition early
and error out without doing anything.
this also fixes behavior of getaddrinfo when "." appears in the search
domain list, which produces a name ending in ".." after concatenation,
at least in the sense of no longer emitting malformed packets on the
network. however, due to other issues, the lookup will still fail.
After commit 5b74eed3b3 the timer thread
doesn't check whether timer_create() actually created the timer,
proceeding to wait for a signal that might never arrive. We can't fix
this by simply checking for a negative timer_id after
pthread_barrier_wait() because we have no way to distinguish a timer
creation failure and a request to delete a timer with INT_MAX id if it
happens to arrive quickly (a variation of this bug existed before
5b74eed3b3, where the timer would be
leaked in this case). So (ab)use cancel field of pthread_t instead.
commit 4486c579cb disabled vdso
clock_gettime on arm due to a Linux kernel bug that was not understood
at the time, whereby the vdso function silently produced
catastrophically wrong results on some systems.
since then, the bug was tracked down to the way the arm kernel
disabled use of vdso clock_gettime on kernels where the necessary
timer was not available or was disabled. it simply patched out the
symbols, but it only did this for the legacy time32 functions, and
left the time64 function in place but non-operational. kernel commit
4405bdf3c57ec28d606bdf5325f1167505bfdcd4 (first present in 5.8)
provided the fix.
if this were a bug that impacted all users of the broken kernel
versions, we could probably ignore it and assume it had been patched
or replaced. however, it's very possible that these kernels appear in
the wild in devices running time32 userspace (glibc, musl 1.1.x, or
some other environment) where they appear to work fine, but where our
new binaries would fail catastrophically if we used the time64 vdso
function.
since the kernel has not (yet?) given us a way to probe for the
working time64 vdso function semantically, we work around the problem
by refusing to use the time64 one unless the time32 one is also
present. this will revert to not using vdso at all if the time32 one
is ever removed, but at least that's safe against wrong results and is
just a missed optimization.
commit d32dadd60e added DT_RELR
processing for programs and shared libraries processed by the dynamic
linker, but left them unsupported in the dynamic linker itseld and in
static pie binaries, which self-relocate via code in dlstart.c.
add the equivalent processing to this code path so that there are not
arbitrary restrictions on where the new packed relative relocation
form can be used.
open_wmemstream's write method was written assuming no buffering,
since it sets the FILE up with buf_len of zero in order to avoid
issues with position/seeking. however, as a consequence of commit
bd57e2b43a, a FILE being written to by
the printf core has a temporary local buffer for the duration of the
operation if it was unbuffered to begin with. since this was
disregarded by the wide memstream's write method, output produced
through this code path, particularly numeric fields, was missing from
the output wchar buffer.
copy the equivalent logic for using the buffered data from the
byte-oriented open_memstream.
if resolv.conf lists no nameservers at all, the default of 127.0.0.1
is used. however, another "no nameservers" case arises where the
system has ipv6 support disabled/configured-out and resolv.conf only
contains v6 nameservers. this caused the resolver to repeat socket
operations that will necessarily fail (sending to one or more
wrong-family addresses) while waiting for a timeout.
it would be contrary to configured intent to query 127.0.0.1 in this
case, but the current behavior is not conducive to diagnosing the
configuration problem. instead, fail immediately with EAI_SYSTEM and
errno==EAFNOSUPPORT so that the configuration error is reportable.
use the legacy constant values if the kernel does not provide
AT_MINSIGSTKSZ (__getauxval will return 0 in this case) and as a
safety check if something is wrong and the provided value is less than
the legacy constant.
sysconf(_SC_SIGSTKSZ) returns SIGSTKSZ adjusted for the difference
between the legacy constant MINSIGSTKSZ and the runtime value, so that
the working space the application has on top of the minimum remains
invariant under changes to the minimum.
as a result of ISA extensions exploding register file sizes on some
archs, using a constant for minimum signal stack size no longer seems
viably future-proof. add sysconf keys allowing the kernel to provide a
machine-dependent minimum applications can query to ensure they
allocate sufficient space for stacks. the key names and indices align
with the same functionality in glibc.
see commit d5a5045382 for previous
action on this subject.
ultimately, the macros MINSIGSTKSZ and SIGSTKSZ probably need to be
deprecated, but that is standards-amendment work outside the scope of
a single implementation.
apparently this code path was never tested, as it's not usual to have
v6 nameservers listed on a system without v6 networking support. but
it was always intended to work.
when reverting to binding a v4 address, also revert the family in the
sockaddr structure and the socklen for it. otherwise bind will just
fail due to mismatched family/sockaddr size.
fix dns resolver fallback when v6 nameservers are listed by
This is a part of the interface contract defined in the Linux man
page (official for a Linux-specific interface) and asserted by test
cases in the Linux Test Project (LTP).
a request for this behavior has been open for a long time. the
motivation is that application code, particularly under some language
runtimes designed around very-low-footprint coroutine type constructs,
may be operating with extremely small stack sizes unsuitable for
receiving signals, using a separate signal stack for any signals it
might handle.
progress on this was blocked at one point trying to determine whether
the implementation is actually entitled to clobber the alt stack, but
the phrasing "available to the implementation" in the POSIX spec for
sigaltstack seems to make it clear that the application cannot rely on
the contents of this memory to be preserved in the absence of signal
delivery (on the abstract machine, excluding implementation-internal
signals) and that we can therefore use it for delivery of signals that
"don't exist" on the abstract machine.
no change is made for SIGTIMER since it is always blocked when used,
and accepted via sigwaitinfo rather than execution of the signal
handler.
breaking out of the switch-case when l==-1 means the conditional below
will necessarily be true (-1 >= buf_size, a size_t variable) and the
function will return 0. it is, however, somewhat unclear that that's
what's happening. simply returning there is simpler
this is a requirement of the C language (orientation) and POSIX
(encoding rule) that was somehow overlooked.
we rely on the fact that the buffer pointers have been reset by
fflush, so that any future stdio operations on the stream will go
through the same code paths they would on a newly-opened file without
an orientation set, thereby setting the orientation as they should.
the way RELR is applied is not a meaningful operation for FDPIC (there
is no single "base" address). it seems unlikely RELR would ever be
added for FDPIC, but if it ever is, the behavior and possibly data
format will need to be different, so guard against calling the
non-FDPIC code.
the syscall used to probe availability of the clock fails with EINVAL
when the requested pid does not exist, but clock_getcpuclockid is
specified to use ESRCH for this purpose.
The generic vfork implementation uses clone(SIGCHLD) which has fork
semantics.
Implement vfork as clone(SIGCHLD|CLONE_VM|CLONE_VFORK, 0) instead which
has vfork semantics. (stack == 0 means sp is unchanged in the child.)
Some users rely on vfork semantics when memory overcommit is disabled
or when the vfork child runs code that synchronizes with the parent
process (non-conforming).
this code attempts to use the value of errno from failure of socket or
connect to infer availability of the requested address family (v4 or
v6). however, in the case where connect failed, there is an
intervening call to close between connect and the use of errno. close
is not required to preserve errno on success, and in fact the
__aio_close code, which is called whenever aio is linked and thus
always called in dynamic-linked programs, unconditionally clobbers
errno. as a result, getaddrinfo fails with EAI_SYSTEM and errno=ENOENT
rather than correctly determining that the address family was
unavailable.
this fix is based on report/patch by Jussi Nieminen, but simplified
slightly to avoid breaking the case where socket, not connect, failed.
while the error handling function should not be reached in stage 2
(assuming ldso itself was linked correctly), this was not statically
determinate from the compiler's perspective, and in theory a compiler
performing LTO could lift the TLS references (errno and other things)
out of the printf-family functions called in a stage where TLS is not
yet initialized.
instead, perform the call via a static-storage, internal-linkage
function pointer which will be set to a no-op function until the stage
where the real error handling function should be reachable.
inspired by commit 63c67053a3.
if LTO is enabled, gcc hoists the call to ___errno_location outside the
loop even though the access to errno is gated behind head != &ldso
because ___errno_location is marked __attribute__((const)). this causes
the program to crash because TLS is not yet initialized when called from
__dls2. this is also possible if LTO is not enabled; even though gcc 11
doesn't do it, it is still wrong to use errno here.
since the start and end are already aligned, we can simply call
__syscall instead of using global errno.
Fixes: e13a2b8953 ("implement PT_GNU_RELRO support")
this is not an issue that was actually hit, but I noticed it during
previous changes to __randname: if the resolution of tv_nsec is too
low, the space of temp file names obtainable by a thread could
plausibly be exhausted. mixing in tv_sec avoids this.
the __randname function is used by various temp file creation
interfaces as a backend to produce a name to attempt using. it does
not have to produce results that are safe against guessing, and only
aims to avoid unintentional collisions.
mixing the address of an object on the stack in a reversible manner
leaked ASLR information, potentially allowing an attacker who can
observe the temp files created and their creation timestamps to narrow
down the possible ASLR state of the process that created them. there
is no actual value in mixing these addresses in; it was just
obfuscation. so don't do it.
instead, mix the tid, just to avoid collisions if multiple
processes/threads stampede to create temp files at the same moment.
even without this measure, they should not collide unless the clock
source is very low resolution, but it's a cheap improvement.
if/when we have a guaranteed-available userspace csprng, it could be
used here instead. even though there is no need for cryptographic
entropy here, it would avoid having to reason about clock resolution
and such to determine whether the behavior is nice.
assuming a reasonable realtime clock, res_mkquery is highly unlikely
to generate the same query id twice in a row, but it's possible with a
very low-resolution system clock or under extreme delay of forward
progress. when it happens, res_msend fails to wait for both answers,
and instead stops listening after getting two answers to the same
query (A or AAAA).
to avoid this, increment one byte of the second query's id if it
matches the first query's. don't bother checking if the second byte is
also equal, since it doesn't matter; we just need to ensure that at
least one byte is distinct.
commit 05973dc3bb made it so that lines
longer than INT_MAX can in theory be read, but did not use a suitable
type for the positions determined by sscanf. we could change to using
size_t, but since the signature for getmntent_r does not admit lines
longer than INT_MAX, it does not make sense to support them in the
legacy thread-unsafe form either -- the principle here is that there
should not be an incentive to use the unsafe function to get added
functionality.
According to fstab(5), the last two fields are optional, but this
wasn't accepted. After this change, only the first field is required,
which matches glibc's behaviour.
Using sscanf as before, it would have been impossible to differentiate
between 0 fields and 4 fields, because sscanf would have returned 0 in
both cases due to the use of assignment suppression and %n for the
string fields (which is important to avoid copying any strings). So
instead, before calling sscanf, initialize every string to the empty
string, and then we can check which strings are empty afterwards to
know how many fields were matched.
this avoids the need for implementation-internal callers to depend on
the nonstandard AT_EMPTY_PATH extension to use __fstatat and isolates
knowledge of that extension to the implementation of __fstat.
this function is used to implement some baseline ISO C interfaces, so
it cannot call any of the stat functions by their public names. use
the namespace-safe __fstatat instead.
instead, use the fstatat/stat functions, so that the logic for which
syscalls are present and usable is all in fstatat.
this results in a slight increase in cost for old kernels on 32-bit
archs: now statx will be attempted first rather than just using the
legacy time32 syscalls, despite us not caring about timestamps.
however, it's not even clear that the legacy syscalls *should* succeed
if the timestamps are out of range; arguably they should fail with
EOVERFLOW. as such, paying a small cost here on old kernels seems
well-motivated.
with this change, fchmodat itself is no longer blocking ports to new
archs that lack the legacy syscalls.
this change serves two purposes:
1. it eliminates one of the few remaining uses of the kernel stat
structure which will not be present in future archs, avoiding the need
for growing ifdef logic here.
2. it potentially makes the operations less expensive when the
candidate exists as a non-symlink by avoiding the need to read the
inode (assuming the directory tables suffice to distinguish symlinks).
this uses the idiom I discovered while rewriting realpath for commit
29ff7599a4 of being able to use the
readlink operation as an inexpensive probe for file existence that
doesn't following symlinks.
_CS_POSIX_V7_THREADS_CFLAGS and _CS_POSIX_V7_THREADS_LDFLAGS have been
missing for a long time, which is a conformance defect. we were
waiting on glibc to add them or at least agree on the numeric values
they will have so as to keep the numbering aligned. it looks like they
will be added to glibc with these numbers, and in any case, this list
does not have any significant churn that would result in the numbers
getting taken.
the change to support passing null was rejected in the past on the
grounds that GNU gettext documented it as undefined, on an assumption
that only glibc accepted it and that the standalone GNU gettext did
not. but it turned out that both explicitly accept it.
in light of this, since some software assumes null can be passed
safely, allow it.
newlocale and freelocale use __libc_malloc and __libc_free, but
duplocale used malloc. If malloc was replaced, this resulted in
invalid free using the wrong allocator when passing the result of
duplocale to freelocale.
Instead, use libc-internal malloc for duplocale.
This bug was introduced by commit
1e4204d522.
sys/reg.h already had it right as 32, to which it was explicitly
changed when commit 664cd34192 derived
x32 from x86_64. but the copy exposed in sys/user.h was missed.
see
linux commit 90f093fa8ea48e5d991332cee160b761423d55c1
rseq, ptrace: Add PTRACE_GET_RSEQ_CONFIGURATION request
the struct type got __ prefix to follow existing practice.
see
linux commit 321827477360934dc040e9d3c626bf1de6c3ab3c
icmp: don't send out ICMP messages with a source address of 0.0.0.0
"RFC7600 reserves a dummy address to be used as a source for ICMP
messages (192.0.0.8/32), so let's teach the kernel to substitute that
address as a last resort if the regular source address selection procedure
fails."
see
linux commit a49f4f81cb48925e8d7cbd9e59068f516e984144
arch: Wire up Landlock syscalls
linuxcommit 17ae69aba89dbfa2139b7f8024b757ab3cc42f59
Merge tag 'landlock_v34' of ... jmorris/linux-security
Landlock provides for unprivileged application sandboxing. The goal of
Landlock is to enable to restrict ambient rights (e.g. global filesystem
access) for a set of processes. Landlock is inspired by seccomp-bpf but
instead of filtering syscalls and their raw arguments, a Landlock rule
can restrict the use of kernel objects like file hierarchies, according
to the kernel semantic.
see
linux commit 7eeba1706eba6def15f6cb2fc7b3c3b9a2651edc
tcp: Add receive timestamp support for receive zerocopy.
linux commit 3c5a2fd042d0bfac71a2dfb99515723d318df47b
tcp: Sanitize CMSG flags and reserved args in tcp_zerocopy_receive.
TCP_NLA_EDT was new in v5.9, see
linux commit 48040793fa6003d211f021c6ad273477bcd90d91
tcp: add earliest departure time to SCM_TIMESTAMPING_OPT_STATS
TCP_NLA_TTL is new in v5.12, see
linux commit e7ed11ee945438b737e2ae2370e35591e16ec371
tcp: add TTL to SCM_TIMESTAMPING_OPT_STATS
PTRACE_OLDSETOPTIONS is old, but it was missing, PTRACE_SYSEMU and
PTRACE_SYSEMU_SINGLESTEP are new, see
linux commit 56e62a73702836017564eaacd5212e4d0fa1c01d
s390: convert to generic entry
new syscall to change the properties of a mount or a mount tree using
file descriptors which the new mount api is based on, see
linux commit 2a1867219c7b27f928e2545782b86daaf9ad50bd
fs: add mount_setattr()
see
linux commit a54f0dfda754c5cecc89a14dab68a3edc1e497b5
signal: define the SA_UNSUPPORTED bit in sa_flags
linux commit 6ac05e832a9e96f9b1c42a8917cdd317d7b6c8fa
signal: define the SA_EXPOSE_TAGBITS bit in sa_flags
Note: SA_ is in the posix reserved namespace so these linux specific flags
can be exposed when compiling for posix.
unlike other si_code defines, SYS_ is not in the posix reserved namespace
which is likely the reason why SYS_SECCOMP was previously missing (was new
in linux v3.5).
see
linux commit 18fb76ed53865c1b5d5f0157b1b825704590beb5
net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.
linux commit 94ab9eb9b234ddf23af04a4bc7e8db68e67b8778
net-zerocopy: Defer vm zap unless actually needed.
see
linux commit 1446e1df9eb183fdf81c3f0715402f1d7595d4cb
kernel: Implement selective syscall userspace redirection
linux commit 36a6c843fd0d8e02506681577e96dabd203dd8e8
entry: Use different define for selector variable in SUD
redirect syscalls to a userspace handler via SIGSYS, except for a specific
range of code. can be toggled via a memory write to a selector variable.
mainly for wine.
see
linux commit b0a0c2615f6f199a656ed8549d7dce625d77aa77
epoll: wire up syscall epoll_pwait2
linux commit 58169a52ebc9a733aeb5bea857bc5daa71a301bb
epoll: add syscall epoll_pwait2
epoll_wait with struct timespec timeout instead of int. no time32 variant.
This reduces entropy of the canary from 64-bit to 56-bit in exchange
for mitigating non-terminated C string overflows by setting the second
byte of the canary to nul, so that off-by-one write overflow with a
nul byte can still be detected.
Idea from GrapheneOS bionic commit 7024d880b51f03a796ff8832f1298f2f1531fd7b
gcc-12 with -frounding-mode will do inexact constant conversions at
runtime according to the runtime rounding mode.
in the math library we want constants to be rounding mode independent
so this patch fixes cases where new runtime conversions happen with
gcc-12.
fortunately this only affects two minor cases, the fix uses global
initializers where rounding mode does not apply.
after the patch the same amount of conversions happen with gcc-12 as
with gcc-11.
commit a90d9da1d1 made fgetws look for
changes to errno by fgetwc to detect encoding errors, since ISO C did
not allow the implementation to set the stream's error flag in this
case, and the fgetwc interface did not admit any other way to detect
the error. however, the possibility of fgetwc setting errno to EILSEQ
in the success path was overlooked, and in fact this can happen if the
buffer ends with a partial character, causing mbtowc to be called with
only part of the character available.
since that change was made, the C standard was amended to specify that
fgetwc set the stream error flag on encoding errors, and commit
511d70738b made it do so. thus, there is
no longer any need for fgetws to poke at errno to handle encoding
errors.
this commit reverts commit a90d9da1d1
and thereby fixes the problem.
this bug goes back to commit 1cc81f5cb0
where zoneinfo file support was first added. in scan_trans, which
searches for the appropriate local time/dst rule in effect at a given
time, times prior to the second transition time caused the -1 slot of
the index to be read to determine the previous rule in effect. this
memory was always valid (part of another zoneinfo table in the mapped
file) but the byte value read was then used to index another table,
possibly going outside the bounds of the mmap. most of the time, the
result was limited to misinterpretation of the rule in effect at that
time (pre-1900s), but it could produce a crash if adjacent memory was
not readable.
the root cause of the problem, however, was that the logic for this
code path was all wrong. as documented in the comment, times before
the first transition should be treated as using the lowest-numbered
non-dst rule, or rule 0 if no non-dst rules exist. if the argument is
in units of local time, however, the rule prior to the first
transition is needed to determine if it falls before or after it, and
that's where the -1 index was wrongly used.
instead, use the documented logic to find out what rule would be in
effect before the first transition, and apply it as the offset if the
argument was given in local time.
the new code has not been heavily tested, but no longer performs
potentially out-of-bounds accesses, and successfully handles the 1883
transition from local mean time to central standard time in the test
case the error was reported for.
these are specified to use the sign of the imaginary part of the input
as the sign of zero in the result, but wrongly copied the sign of the
real part.
this is a POSIX requirement. we previously relied on the underlying fd
(or other backend) seek operation to produce the error, but since
linux lseek now supports other seek modes (SEEK_DATA and SEEK_HOLE)
which do not interact well with stdio buffering, this is insufficient.
instead, explicitly check whence before performing any operations.
these are linux specific constants. glibc exposes them behind
_GNU_SOURCE, but, since SEEK_* is reserved for the implementation, we
can simply define them. furthermore, since they can't be used with
fseek() and other functions that deal with FILE, we don't add them to
stdio.h.
these characters combine onto a base character (initial) and therefore
need to have width 0. the original binary-search implementation of
wcwidth handled them correctly, but a regression was introduced in
commit 1b0ce9af6d by generating the new
tables from unicode without noticing that the classification logic in
use (unicode character category Mn/Me/Cf) was insufficient to catch
these characters.
strtod_l, strtof_l, and strtold_l originally existed only as
glibc-ABI-compat symbols. as noted in the commit which added them,
17a60f9d32, making them aliases for the
non-_l functions was a hack and not appropriate if they ever became
public API.
unfortunately, commit 35eb1a1a9b did
make them public without undoing the hack. fix that now by moving the
the _l functions to their own file as wrappers that just throw away
the locale_t argument.
commit 7be59733d7 introduced the
hwcap-based branches to support the SPE FPU, but wrongly coded them as
bitwise tests on the computed address of __hwcap, not a value loaded
from that address. replace the add with indexed load to fix it.
the snd_pcm_mmap_control struct used with SNDRV_PCM_IOCTL_SYNC_PTR was
mistakenly defined in the kernel uapi with "before u32" padding both
before and after the first u32 member. our conversion between the
modern struct and the legacy time32 struct was written without
awareness of that mistake, and assumed the time64 version of the
struct was the intended form with padding to match the layout on
64-bit archs. as a result, the struct was not converted correctly when
running on old kernels, with audio glitches as the likely result.
this was discovered thanks to a related bug in the kernel, whereby
32-bit userspace running on a 64-bit kernel also suffered from the
types mismatching. the mistaken layout is now the ABI and can't be
changed -- or at least making a new ioctl to change it would just
result in a worse situation.
our conversion here is changed to treat the snd_pcm_mmap_control
substruct as two separate substructs at locations dependent on
endianness (since the displacement depends on endianness), using the
existing conversion framework.
we make qsort a wrapper by providing a wrapper_cmp function that uses
the extra argument as a function pointer. should be optimized to a tail
call on most architectures, as long as it's built with
-fomit-frame-pointer, so the performance impact should be minimal.
to keep the git history clean, for now qsort_r is implemented in qsort.c
and qsort is implemented in qsort_nr.c. qsort.c also received a few
trivial cleanups, including replacing (*cmp)() calls with cmp().
qsort_nr.c contains only wrapper_cmp and qsort as a qsort_r wrapper
itself.
When the soft-float ABI for PowerPC was added in commit
5a92dd95c7, with Freescale cpus using
the alternative SPE FPU as the main use case, it was noted that we
could probably support hard float on them, but that it would involve
determining some difficult ABI constraints. This commit is the
completion of that work.
The Power-Arch-32 ABI supplement defines the ABI profiles, and indeed
ATR-SPE is built on ATR-SOFT-FLOAT. But setjmp/longjmp compatibility
are problematic for the same reason they're problematic on ARM, where
optional float-related parts of the register file are "call-saved if
present". This requires testing __hwcap, which is now done.
In keeping with the existing powerpc-sf subarch definition, which did
not have fenv, the fenv macros are not defined for SPE and the SPEFSCR
control register is left (and assumed to start in) the default mode.
both passing a null pointer to memcpy with length 0, and adding 0 to a
null pointer, are undefined. in some sense this is 'benign' UB, but
having it precludes use of tooling that strictly traps on UB. there
may be better ways to fix it, but conditioning the operations which
are intended to be no-ops in the k==0 case on k being nonzero is a
simple and safe solution.
commit 6d99ad91e8 introduced this
regression as part of a larger change, based on an incorrect
assumption that rdhwr being part of the mips r2 ISA level meant that
the TLS register, known in the mips documentation as UserLocal, was
unconditionally present on chips providing this ISA level and would
not need trap-and-emulate. this turns out to be false.
based on research by Stanislav Kljuhhin and Abilio Marques, who
reported the problem as a performance regression on certain routers
using OpenWRT vs older uclibc-based versions, it turns out the mips
manuals document the UserLocal register as a feature that might or
might not be implemented or enabled, reflected by a cpu capability bit
in the CONFIG3 register, and that Linux checks for this and has to
explicitly enable it on models that have it.
thus, it's indeed possible that r2+ chips can lack the feature,
bringing us back to the situation where Linux only has a fast
trap-and-emulate path for the case where the destination register is
$3. so, always read the thread pointer through $3. this may incur a
gratuitous move to the desired final register on chips where it's not
needed, but it really doesn't matter.
len is unsigned and can never be smaller than 0. though unlikely, an
error in read() would have lead to an out of bounds write to name.
Reported-by: Michael Forney <mforney@mforney.org>
due to historical reasons, the mips signal set has 128 bits rather
than 64 like on every other arch. this was special-cased correctly, at
least for 32-bit mips, at one time, but was inadvertently broken in
commit 7c440977db, and seems never to
have been right on mips64/n32.
as consequenct of this bug, applications making use of high realtime
signal numbers on mips may have been able to execute application code
in contexts where doing so was unsafe.
the kernel structure has padding of the shm_segsz member up to 64
bits, as well as 2 unused longs at the end. somehow that was
overlooked when the powerpc port was added, and it has been broken
ever since; applications compiled with the wrong definition do not
correctly see the shm_segsz, shm_cpid, and shm_lpid members.
fixing the definition just by adding the missing padding would break
the ABI size of the structure as well as the position of the time64
shm_atime and shm_dtime members we added at the end. instead, just
move one of the unused padding members from the original end (before
time64) of the structure to the position of the missing padding. this
preserves size and preserves correct behavior of any compiled code
that was already working. programs affected by the wrong definition
need to be recompiled with the correct one.
previously, the contents of the TZ variable were considered a
candidate for a file/path name only if they began with a colon or
contained a slash before any comma. the latter was very sloppy logic
to avoid treating any valid POSIX TZ string as a file name, but it
also triggered on values that are not valid POSIX TZ strings,
including 3-letter timezone names without any offset.
instead, only treat the TZ variable as POSIX form if it begins with a
nonzero standard time name followed by +, -, or a digit.
also, special case GMT and UTC to always be treated as POSIX form
(with implicit zero offset) so that a stray file by the same name
cannot break software that depends on setting TZ=GMT or TZ=UTC.
POSIX places an obscure requirement on popen which is like a limited
version of close-on-exec:
"The popen() function shall ensure that any streams from previous
popen() calls that remain open in the parent process are closed in
the new child process."
if the POSIX-future 'e' mode flag is passed, producing a pipe FILE
with FD_CLOEXEC on the underlying pipe, this requirement is
automatically satisfied. however, for applications which use multiple
concurrent popen pipes but don't request close-on-exec, fd leaks from
earlier popen calls to later ones could produce deadlock situations
where processes are waiting for a pipe EOF that will never happen.
to fix this, iterate through all open FILEs and add close actions for
those obtained from popen. this requires holding a lock on the open
file list across the posix_spawn call so that additional popen FILEs
are not created after the list is traversed. note that it's still
possible for another popen call to start and create its pipe while the
lock is held, but such pipes are created with O_CLOEXEC and only drop
close-on-exec status (when 'e' flag is omitted) under control of the
lock.
the newly allocated FILE * has not yet leaked to the application and
is only visible to stdio internals until popen returns. since we do
not change any fields of the structure observed by libc internals,
only the pipe_pid member, locking is not necessary.
__tls_get_addr should not be called with an invalid TLS module id of
0. in practice it probably "works", returning the DTV length as if it
were a pointer, and the callback should probably not inspect
dlpi_tls_data in this case, but it's likely that some real-world
callbacks use a check on dlpi_tls_data being non-null, rather than on
dlpi_tls_modid being nonzero, to conclude that the module has TLS.
With mallocng, calling posix_memalign() or aligned_alloc() will
SIGSEGV if the internal malloc() call returns NULL. This does not
occur with oldmalloc, which explicitly checks for allocation failure.
this is a Linux-specific function and not covered by POSIX's
requirements for which interfaces are cancellation points, but glibc
makes it one and existing software relies on it being one.
at some point a review for similar functions that should be made
cancellation points should be done.
dl_iterate_phdr was wrongly reporting the address of the DSO's PT_TLS
image rather than the calling thread's instance of the TLS. the man
page, which is essentially normative for a nonstandard function of
this sort, clearly specifies the latter. it does not clarify where
exactly within/relative-to the image the pointer should point, but the
reasonable thing to do is match the ABI's DTP offset, and this seems
to be what other implementations do.
popen was special-casing the possibility (only possible when the
parent closed stdin and/or stdout) that the child's end of the pipe
was already on the final desired fd number, in which case there was no
way to get rid of its close-on-exec flag in the child. commit
6fc6ca1a32 made this unnecessary by
implementing the POSIX-future requirement that dup2 file actions with
equal source and destination fd values remove the close-on-exec flag.
this makes it possible to perform actions on file actions objects with
a libc-internal lock held without creating lock order relationships
that are silently imposed on an application-provided malloc.
reportedly the GNU linker can emit such segments, causing spurious
failure to load due to mmap with a length of zero producing EINVAL.
no action is required for such a load map (it's effectively a nop in
the program headers table) so just treat it as always successful.
since 4.1, gcc has had the __returns_twice__ attribute and has
required functions which return twice to carry it; however it's always
applied it automatically to known setjmp-like function names. clang
however does not do this reliably, at least not with -ffreestanding
and possibly under other conditions, resulting in silent emission of
wrong code.
since the symbol name setjmp is in no way special (setjmp is specified
as a macro that could expand to use any implementation-specific symbol
name or names), a compiler is justified not to do anything special
without further hints, and it's reasonable to do what we can to
provide such hints.
gcc 4.0.x and earlier do not recognize the attribute, so make use
conditional on __GNUC__ macros. clang and other gcc-like compilers
report (and have always reported) a later "GNUC" version so the
preprocessor conditional should function as desired for them as too.
undefine the internal macro after use so that nothing abuses it as a
public feature.
add synchronouse and asynchronous tag check failure codes, see
linux commit 74f1082487feb90bbf880af14beb8e29c3030c9f
arm64: mte: Add specific SIGSEGV codes
these are for the aarch64 MTE (memory tagging extension), see
linux commit 1c101da8b971a36695319dce7a24711dc567a0dd
arm64: mte: Allow user control of the tag check mode via prctl()
linux commit af5ce95282dc99d08a27a407a02c763dde1c5558
arm64: mte: Allow user control of the generated random tags via prctl()
path resolution does not follow symlinks on nosymfollow mounts (but
readlink still does), see
linux commit dab741e0e02bd3c4f5e2e97be74b39df2523fc6e
Add a "nosymfollow" mount option.
can cause rseq restart on another cpu to synchronize with global
memory access from rseq critical sections, see
linux commit 2a36ab717e8fe678d98f81c14a0b124712719840
rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
mainly added to linux to allow a central process management service in
android to give MADV_COLD|PAGEOUT hints for other processes, see
linux commit ecb8ac8b1f146915aa6b96449b66dd48984caacc
mm/madvise: introduce process_madvise() syscall: an external memory
hinting API
the historical function was specified to return an empty string in the
caller-provided buffer, not a null pointer, to indicate error when the
argument is non-null. only when the argument is null should it return
a null pointer on error.
getpwuid_r can return 0 but without a result in the case where there
was no error but no record exists. in that case cuserid was treating
it as success and copying junk out of pw.pw_name to the output buffer.
this function was removed from the standard in 2001 but appeared in
SUSv2 with an obligation to support calls with a null pointer
argument, using a static buffer.
the threshold was wrong so expm1f overflowed to inf a bit too early
and on most targets uint32_t compare is faster than float compare so
use that.
this also fixes sinhf incorrectly returning nan for some values where
the internal expm1f overflowed.
on some negative inputs (e.g. -0x1.1e6ae8p+5) acoshf failed to return
nan. ensure that negative inputs result nan without introducing new
branches. this was tried before in
commit 101e601285
math: fix acoshf on negative values
but that fix was wrong. there are 3 formulas used:
log1p(x-1 + sqrt((x-1)*(x-1)+2*(x-1)))
log(2*x - 1/(x+sqrt(x*x-1)))
log(x) + 0.693147180559945309417232121458176568
the first fails on large negative inputs (may compute log1p(0) or
log1p(inf)), the second one fails on some mid range or large negative
inputs (may compute log(large) or log(inf)) and the last one fails on
-0 (returns -inf).
as an outcome of Austin Group issue #385, future versions of the
standard will require free not to alter the value of errno. save and
restore it individually around the calls to madvise and munmap so that
the cost is not imposed on calls to free that do not result in any
syscall.
as an outcome of Austin Group issue #385, future versions of the
standard will require free not to alter the value of errno. save and
restore it individually around the calls to madvise and munmap so that
the cost is not imposed on calls to free that do not result in any
syscall.
commit 7586360bad removed the unused
arguments from the definition of __libc_start_main, making it
incompatible with the declaration at the point of call, which still
passed 6 arguments. calls with mismatched function type have undefined
behavior, breaking LTO and any other tooling that checks for function
signature mismatch.
removing the extra arguments from the point of call (crt1) is not an
option for fixing this, since that would be a change in ABI surface
between application and libc.
adding back the extra arguments requires some care. on archs that pass
arguments on the stack or that reserve argument spill space for the
callee on the stack, it imposes an ABI requirement on the caller to
provide such space. the modern crt1.c entry point provides such space,
but originally there was arch-specific asm for the call to
__libc_start_main. the last of this asm was removed in commit
6fef8cafbd, and manual review of the
code removed and its prior history was performed to check that all
archs/variants passed the legacy init/fini/ldso_fini arguments.
these functions are specified to fail with EBADF on negative fd
arguments. apart from close, they are also specified to fail if the
value exceeds OPEN_MAX, but as written it is not clear that this
imposes any requirement when OPEN_MAX is not defined, and it's
undesirable to impose a dynamic limit (via setrlimit) here since the
limit at the time of posix_spawn may be different from the limit at
the time of setting up the file actions. this may require revisiting
later.
commit 2412638bb3 got the size of struct
v4l2_event wrong and failed to account for the fact that the old
struct might be either 120 bytes with time misaligned mod 8, or 128
bytes with time aligned mod 8, due to the contained union having
64-bit members whose alignment is arch-dependent.
rather than adding new logic to handle the differences, use an actual
stripped-down version of the structure in question to derive the ioctl
number, size, and offsets.
commit 2412638bb3 got the size of struct
v4l2_buffer wrong and omitted the tv_usec member slot from the offset
list, so the ioctl numbers never matched and fallback code path was
never taken. this caused the affected ioctls to fail with ENOTTY on
kernels not new enough to have the native time64 ioctls.
this is necessary for MT-fork correctness now that the code runs under
locale lock. it would not be hard to avoid, but __get_locale is
already using libc-internal malloc anyway. this can be reconsidered
during locale overhaul later if needed.
in general, pthread_once is not compatible with MT-fork constraints
(commit 167390f055). here it actually no
longer matters, because it's now called with a lock held, but since
the lock is held it's pointless to use pthread_once.
this allows the lock to be shared with setlocale, eliminates repeated
per-category lock/unlock in newlocale, and will allow the use of
pthread_once in newlocale to be dropped (to be done separately).
the intent here is just to scan at least l bytes forward for the end
of the haystack and at least some decent minimum to avoid doing it
over and over if the needle is short, with no need to be precise. the
comment erroneously stated this as an estimate for MIN when it's
actually an estimate for MAX.
pthread_once is not compatible with MT-fork constraints (commit
167390f055) and is not needed here
anyway; we already have a lock suitable for initialization.
while changing this, fix a corner case where AT_MINSIGSTKSZ gives a
value that's more than MINSIGSTKSZ but by a margin of less than
2048, thereby causing the size to be reduced. it shouldn't matter but
the intent was to be the larger of a 2048-byte margin over the legacy
fixed minimum stack requirement or a 512-byte margin over the minimum
the kernel reports at runtime.
this change should have been made when priority inheritance mutex
support was added. if priority protection is also added at some point
the implementation will need to change and will probably no longer be
a simple bit shuffling.
both __clone and __syscall_cp_asm failed to restore the original value
of r6 after using it as a syscall argument register. the extent of
breakage is not known, and in some cases may be mitigated by the only
callers being internal to libc; if they used r6 but no longer needed
its value after the call, they may not have noticed the problem.
however at least posix_spawn (which uses __clone) was observed
returning to the application with the wrong value in r6, leading to
crash.
since the call frame ABI already provides a place to spill registers,
fixing this is just a matter of using it. in __clone, we also
spuriously restore r6 in the child, since the parent branch directly
returns to the caller. this takes the value from an uninitialized slot
of the child's stack, but is harmless since there is no caller to
return to in the child.
float_t should represent the type that is used to evaluate float
expressions internally. On s390x, float_t is currently set to double.
In contrast, the isa supports single-precision float operations and
compilers by default evaluate float in single precision, which
violates the C standard (sections 5.2.4.2.2 and 7.12 in C11/C17, to be
precise). With -fexcess-precision=standard, gcc evaluates float in
double precision, which aligns with the standard yet at the cost of
added conversion instructions.
gcc-11 will drop the special case to retrofit double precision
behavior for -fexcess-precision=standard so that __FLT_EVAL_METHOD__
will be 0 on s390x in any scenario.
To improve standards compliance and compatibility with future compiler
direction, this patch changes the definition of float_t to be derived
from the compiler's __FLT_EVAL_METHOD__.
reallocarray is an extension introduced by OpenBSD, which introduces
calloc overflow checking to realloc.
glibc 2.28 introduced support for this function behind _GNU_SOURCE,
while glibc 2.29 allows its usage in _DEFAULT_SOURCE.
inability to use realpath in chroot/container without procfs access
and at early boot prior to mount of /proc has been an ongoing issue,
and it turns out realpath was one of the last remaining interfaces
that needed procfs for its core functionality. during investigation
while reimplementing, it was determined that there were also serious
problems with the procfs-based implementation. most seriously it was
unsafe on pre-O_PATH kernels, and unlike other places where O_PATH was
used, the unsafety was hard or impossible to fix because O_NOFOLLOW
can't be used (since the whole purpose was to follow symlinks).
the new implementation is a direct one, performing readlink on each
path component to resolve it. an explicit stack, as opposed to
recursion, is used to represent the remaining components to be
processed. the stack starts out holding just the input string, and
reading a link pushes the link contents onto the stack.
unlike many other implementations, this one does not call getcwd
initially for relative pathnames. instead it accumulates initial ..
components to be applied to the working directory if the result is
still a relative path. this avoids calling getcwd (which may fail) at
all when symlink traversal will eventually yield an absolute path. it
also doesn't use any form of stat operation; instead it arranges for
readlink to tell it when a non-directory is used in a context where a
directory is needed. this minimizes the number of syscalls needed,
avoids accessing inodes when the directory table suffices, and reduces
the amount of code pulled in for static linking.
calling lutimes with tv=0 is valid if the application wants to set the
timestamps to the current time. this commit makes it so the timespec
struct is populated with values from tv only if tv != 0 and calls
utimensat with times=0 if tv == 0.
Update fanotify.h, see
linux commit 929943b38daf817f2e6d303ea04401651fc3bc05
fanotify: add support for FAN_REPORT_NAME
linux commit 83b7a59896dd24015a34b7f00027f0ff3747972f
fanotify: add basic support for FAN_REPORT_DIR_FID
linux commit 08b95c338e0c5a96e47f4ca314ea1e7580ecb5d7
fanotify: remove event FAN_DIR_MODIFY
FAN_DIR_MODIFY that was new in v5.7 is now removed from linux uapi,
but kept in musl, so we don't break api, linux cannot reuse the
value anyway.
see
linux commit 9b4feb630e8e9801603f3cab3a36369e3c1cf88d
arch: wire-up close_range()
linux commit 278a5fbaed89dacd04e9d052f4594ffd0e0585de
open: add close_range()
linux fails with EINVAL when a zero buffer size is passed to the
syscall. this is non-conforming because POSIX already defines EINVAL
with a significantly different meaning: the target is not a symlink.
since the request is semantically valid, patch it up by using a dummy
buffer of length one, and truncating the return value to zero if it
succeeds.
the v1 zoneinfo format with 32-bit time is deprecated. previously, the
v2 parsing code was only used if an exact match for '2' was found in
the version field of the header. this was already incorrect for v3
files (trivial differences from v2 that arguably didn't merit a new
version number anyway) but also failed to be future-proof.
since commit 3814333964, the condition
sizeof(time_t) > 4 is always true, so there is no functional change
being made here. but semantically, the 64-bit tables should always be
preferred now, because upstream zic (zoneinfo compiler) has quietly
switched to emitting empty 32-bit tables by default, and the resulting
backwards-incompatible zoneinfo files will be encountered in the wild.
commit d26e0774a5 moved the detach state
transition at exit before the thread list lock was taken. this
inadvertently allowed pthread_join to race to take the thread list
lock first, and proceed with unmapping of the exiting thread's memory.
we could fix this by just revering the offending commit and instead
performing __vm_wait unconditionally before taking the thread list
lock, but that may be costly. instead, bring back the old DT_EXITING
vs DT_EXITED state distinction that was removed in commit
8f11e6127f, and don't transition to
DT_EXITED (a value of 0, which is what pthread_join waits for) until
after the lock has been taken.
the original wcsnrtombs implementation, which has been largely
untouched since 0.5.0, attempted to build input-length-limiting
conversion on top of wcsrtombs, which only limits output length. as
best I recall, this choice was made out of a mix of disdain over
having yet another variant function to implement (added in POSIX 2008;
not standard C) and preference not to switch things around and
implement the wcsrtombs in terms of the more general new function,
probably over namespace issues. the strategy employed was to impose
output limits that would ensure the input limit wasn't exceeded, then
finish up the tail character-at-a-time. unfortunately, none of that
worked correctly.
first, the logic in the wcsrtombs loop was wrong in that it could
easily get stuck making no forward progress, by imposing an output
limit too small to convert even one character.
the character-at-a-time loop that followed was even worse. it made no
effort to ensure that the converted multibyte character would fit in
the remaining output space, only that there was a nonzero amount of
output space remaining. it also employed an incorrect interpretation
of wcrtomb's interface contract for converting the null character,
thereby failing to act on end of input, and remaining space accounting
was subject to unsigned wrap-around. together these errors allow
unbounded overflow of the destination buffer, controlled by input
length limit and input wchar_t string contents.
given the extent to which this function was broken, it's plausible
that most applications that would have been rendered exploitable were
sufficiently broken not to be usable in the first place. however, it's
also plausible that common (especially ASCII-only) inputs succeeded in
the wcsrtombs loop, which mostly worked, while leaving the wildly
erroneous code in the second loop exposed to particular non-ASCII
inputs.
CVE-2020-28928 has been assigned for this issue.
after a non-normal-type process-shared mutex is unlocked, it's
immediately available to another thread to lock, unlock, and destroy,
but the first unlocking thread may still have a pointer to it in its
robust_list pending slot. this means, on async process termination,
the kernel may attempt to access and modify the memory that used to
contain the mutex -- memory that may have been reused for some other
purpose after the mutex was destroyed.
setting up for this kind of race to occur is difficult to begin with,
requiring dynamic use of shared memory maps, and actually hitting the
race is very difficult even with a suitable setup. so this is mostly a
theoretical fix, but in any case the cost is very low.
the __vm_wait operation can delay forward progress arbitrarily long if
a thread holding the lock is interrupted by a signal. in a worst case
this can deadlock. any critical section holding the thread list lock
must respect lock ordering contracts and must not take any lock which
is not AS-safe.
to fix, move the determination of thread joinable/detached state to
take place before the killlock and thread list lock are taken. this
requires reverting the atomic state transition if we determine that
the exiting thread is the last thread and must call exit, but that's
easy to do since it's a single-threaded context with application
signals blocked.
as the outcome of Austin Group tracker issue #62, future editions of
POSIX have dropped the requirement that fork be AS-safe. this allows
but does not require implementations to synchronize fork with internal
locks and give forked children of multithreaded parents a partly or
fully unrestricted execution environment where they can continue to
use the standard library (per POSIX, they can only portably use
AS-safe functions).
up until recently, taking this allowance did not seem desirable.
however, commit 8ed2bd8bfc exposed the
extent to which applications and libraries are depending on the
ability to use malloc and other non-AS-safe interfaces in MT-forked
children, by converting latent very-low-probability catastrophic state
corruption into predictable deadlock. dealing with the fallout has
been a huge burden for users/distros.
while it looks like most of the non-portable usage in applications
could be fixed given sufficient effort, at least some of it seems to
occur in language runtimes which are exposing the ability to run
unrestricted code in the child as part of the contract with the
programmer. any attempt at fixing such contracts is not just a
technical problem but a social one, and is probably not tractable.
this patch extends the fork function to take locks for all libc
singletons in the parent, and release or reset those locks in the
child, so that when the underlying fork operation takes place, the
state protected by these locks is consistent and ready for the child
to use. locking is skipped in the case where the parent is
single-threaded so as not to interfere with legacy AS-safety property
of fork in single-threaded programs. lock order is mostly arbitrary,
but the malloc locks (including bump allocator in case it's used) must
be taken after the locks on any subsystems that might use malloc, and
non-AS-safe locks cannot be taken while the thread list lock is held,
imposing a requirement that it be taken last.
this change lifts undocumented restrictions on calls by replacement
mallocs to libc functions that might take these locks, and sets the
stage for lifting restrictions on the child execution environment
after multithreaded fork.
care is taken to #define macros to replace all four functions (malloc,
calloc, realloc, free) even if not all of them will be used, using an
undefined symbol name for the ones intended not to be used so that any
inadvertent future use will be caught at compile time rather than
directed to the wrong implementation.
allowing the application to replace malloc (since commit
c9f415d7ea) has brought multiple
headaches where it's used from various critical sections in libc
components. for example:
- the thread-local message buffers allocated for dlerror can't be
freed at thread exit time because application code would then run in
the context of a non-existant thread. this was handled in commit
aa5a9d15e0 by queuing them for free
later.
- the dynamic linker has to be careful not to pass memory allocated at
early startup time (necessarily using its own malloc) to realloc or
free after redoing relocations with the application and all
libraries present. bugs in this area were fixed several times, at
least in commits 0c5c8f5da6 and
2f1f51ae7b and possibly others.
- by calling the allocator from contexts where libc-internal locks are
held, we impose undocumented requirements on alternate malloc
implementations not to call into any libc function that might
attempt to take these locks; if they do, deadlock results.
- work to make fork of a multithreaded parent give the child an
unrestricted execution environment is blocked by lock order issues
as long as the application-provided allocator can be called with
libc-internal locks held.
these problems are all fixed by giving libc internals access to the
original, non-replaced allocator, for use where needed. it can't be
used everywhere, as some interfaces like str[n]dup, open_[w]memstream,
getline/getdelim, etc. are required to provide the called memory
obtained as if by (the public) malloc. and there are a number of libc
interfaces that are "pure library" code, not part of some internal
singleton, and where using the application's choice of malloc
implementation is preferable -- things like glob, regex, etc.
one might expect there to be significant cost to static-linked
programs, pulling in two malloc implementations, one of them
mostly-unused, if malloc is replaced. however, in almost all of the
places where malloc is used internally, care has been taken already
not to pull in realloc/free (i.e. to link with just the bump
allocator). this size optimization carries over automatically.
the newly-exposed internal allocator functions are obtained by
renaming the actual definitions, then adding new wrappers around them
with the public names. technically __libc_realloc and __libc_free
could be aliases rather than needing a layer of wrapper, but this
would almost surely break certain instrumentation (valgrind) and the
size and performance difference is negligible. __libc_calloc needs to
be handled specially since calloc is designed to work with either the
internal or the replaced malloc.
as a bonus, this change also eliminates the longstanding ugly
dependency of the static bump allocator on order of object files in
libc.a, by making it so there's only one definition of the malloc
function and having it in the same source file as the bump allocator.
the only place stdio was used here was for reading the ldso path file,
taking advantage of getdelim to automatically allocate and resize the
buffer. the motivation for use here was that, with shared libraries,
stdio is already available anyway and free to use. this has long been
a nuisance to users because getdelim's use of realloc here triggered a
valgrind bug, but removing it doesn't really fix that; on some archs
even calling the valgrind-interposed malloc at this point will crash.
the actual motivation for this change is moving towards getting rid of
use of application-provided malloc in parts of libc where it would be
called with libc-internal locks held, leading to the possibility of
deadlock if the malloc implementation doesn't follow unwritten rules
about which libc functions are safe for it to call. since getdelim is
required to produce a pointer as if by malloc (i.e. that can be passed
to reallor or free), it necessarily must use the public malloc.
instead of performing a realloc loop as the path file is read, first
query its size with fstat and allocate only once. this produces
slightly different truncation behavior when racing with writes to a
file, but neither behavior is or could be made safe anyway; on a live
system, ldso path files should be replaced by atomic rename only. the
change should also reduce memory waste.
thread-local buffers allocated for dlerror need to be queued for free
at a later time when the owning thread exits, since malloc may be
replaced by application code and the exiting context is not valid to
call application code from. the code to process queue of pending
frees, introduced in commit aa5a9d15e0,
gratuitously held the lock for the entire duration of queue
processing, updating the global queue pointer after each free, despite
there being no logical requirement that all frees finish before
another thread can access the queue.
instead, immediately claim the whole queue for freeing and release the
lock, then walk the list and perform frees without the lock held. the
change is unlikely to make any meaningful difference to performance,
but it eliminates one point where the allocator is called under an
internal lock. since the allocator may be application-provided, such
calls are undesirable because they allow application code to impede
forward progress of libc functions in other threads arbitrarily long,
and to induce deadlock if it calls a libc function that requires the
same lock.
the change also eliminates a lock ordering consideration that's an
impediment upcoming work with multithreaded fork.
the ABI type for the vector registers in fpregset_t, struct
fpsimd_context, and struct user_fpsimd_struct is __uint128_t, which
was presumably originally not used because it's a nonstandard type,
but its existence is mandated by the aarch64 psABI. use of the wrong
type here broke software using these structures, and encouraged
incorrect fixes with casts rather than reinterpretation of
representation.
the reasoning in commit 2d0bbe6c78 was
not entirely correct. while it's true that setting the waiters flag
ensures that the next unlock will perform a wake, it's possible that
the wake is consumed by a mutex waiter that has no relationship with
the condvar wait queue being processed, which then takes the mutex.
when that thread subsequently unlocks, it sees no waiters, and leaves
the rest of the condvar queue stuck.
bring back the waiter count adjustment, but skip it for PI mutexes,
for which a successful lock-after-waiting always sets the waiters bit.
if future changes are made to bring this same waiters-bit contract to
all lock types, this can be reverted.
sem_open is required to return the same sem_t pointer for all
references to the same named semaphore when it's opened more than once
in the same process. thus we keep a table of all the mapped semaphores
and their reference counts. the code path for sem_close checked the
reference count, but then proceeded to unmap the semaphore regardless
of whether the count had reached zero.
add an immediate unlock-and-return for the nonzero refcnt case so the
property of performing the munmap syscall after releasing the lock can
be preserved.
resource limits have been process-wide since linux 2.6.10, and the
prlimit syscall was added in 2.6.36, so prlimit can be assumed to set
the resource limits correctly for the whole process.
commit bd153422f2 reintroduced the bug
fixed in c21051e90c by refactoring the
__syscall_ret into _Fork where it once again runs before the atfork
handlers are called. since _Fork is a public interface that sets
errno, this can't be fixed the way it was fixed last time without
making new internal interfaces. instead, just save errno, and restore
it only on error to ensure that a value of 0 is never restored.
pthread_cond_wait arranged for requeued waiters to wake when the mutex
is unlocked by temporarily adjusting the mutex's waiter count. commit
54ca677983 broke this when introducing
PI mutexes by repurposing the waiter count field of the mutex
structure. since then, for PI mutexes, the waiter count adjustment was
misinterpreted by the mutex locking code as indicating that the mutex
is non a non-recoverable state.
it would be possible to special-case PI mutexes here, but instead just
drop all adjustment of the waiters count, and instead use the lock
word waiters bit for all mutex types. since the mutex is either held
by the caller or in unrecoverable state at the time the bit is set, it
will necessarily still be set at the time of any subsequent valid
unlock operation, and this will produce the desired effect of waking
the next waiter.
if waiter counts are entirely dropped at some point in the future this
code should still work without modification.
commit 25ea9f712c introduced a deadlock
to the posix_spawn child whereby, if abort was called in the parent
and ended up taking the abort lock to terminate the process, the
__libc_sigaction calls in the child would wait forever to obtain a
lock that would not be released. this could be fixed by having abort
set the abort lock as the exit futex address, but it's cleaner to just
remove the SIGABRT special handling from the internal __libc_sigaction
and lift it to the public sigaction function.
nothing but the posix_spawn child calls __libc_sigaction on SIGABRT,
and since commit b7bc966522 the abort
lock is held at the time of __clone, which precludes the child
inheriting a kernel-level signal disposition inconsistent with the
disposition on the abstract machine. this means it's fine to inspect
and modify the disposition in the child without a lock.
Merge changes from Solar Designer's crypt_blowfish v1.3. This makes
crypt_blowfish fully compatible with OpenBSD's bcrypt by adding
support for the $2b$ prefix (which behaves the same as
crypt_blowfish's $2y$).
this makes the code slightly smaller and eliminates timer_create from
relevance to possible future changes to multithreaded fork.
the barrier of a_store isn't technically needed here, but a_store is
used anyway for internal consistency of the memory model.
this was leftover from when the actual SIGEV_THREAD timer logic was in
the signal handler. commit 5b74eed3b3
replaced that with use of sigwaitinfo, with the actual signal left
blocked, so the no-op signal handler was no longer serving any
purpose.
the signal disposition reset to SIG_DFL is still needed, however, in
case we inherited SIG_IGN from a foreign-libc process.
assert is not specified to flush open stdio streams, and doing so can
block indefinitely waiting for a lock already held or an output
operation to a file that can't accept more output until an
unsatisfiable condition is met.
commit 500c6886c6 broke this by fixing
the behavior of fread to conform to the C standard; getgroupslist was
assuming the old behavior, that a request to read 1 member of length 0
would return 1, not 0.
this change prevents the child created concurrently with abort from
seeing the SIGABRT disposition change from SIG_IGN to SIG_DFL (other
changes are not visible anyway) and prevents leaking the write end of
the child pipe to children created by fork in another thread, which
may block return of posix_spawn indefinitely if the forked child does
not exit or exec.
along with other changes, this suggests that __abort_lock should
perhaps eventually be renamed to reflect that it's becoming a broader
lock on related "process lifetime" state.
the existing abort locking logic in sigaction only accounted for
attempts to change the disposition, not attempts to observe the change
made by abort.
unfortunately the change is still observable in at least one other
place: inheritance of signal dispositions across exec and posix_spawn.
fixing these is a separate task and it's not even clear whether a
complete fix is possible.
the _Fork interface is defined for future issue of POSIX as the
outcome of Austin Group issue 62, which drops the AS-safety
requirement for fork, and provides an AS-safe replacement that does
not run the registered atfork handlers.
commit 188759bbee documented the intent
to allow recursive dlopen based on tracking ctor_visitor, but used a
kernel tid rather than the pthread_t to identify the caller. as a
result, it would not behave as intended under fork by a ctor, where
the child tid would not match.
queue_ctors should not be called with the init_fini_lock held, since
it may longjmp out on allocation failure. this introduces a minor
TOCTOU race with p->constructed, but one already exists further down
anyway, and by design it's okay to run through the queue more than
once anyway. the only reason we bother to check p->constructed at all
is to avoid spurious failure of dlopen when the library is already
fully loaded and constructed.
this makes the code slightly smaller and eliminates these functions
from relevance to possible future changes to multithreaded fork.
the barrier of a_store isn't technically needed here, but a_store is
used anyway for internal consistency of the memory model.
if the multithreaded parent forked while another thread was calling
sigaction for SIGABRT or calling abort, the child could inherit a lock
state in which future calls to abort will deadlock, or in which the
disposition for SIGABRT has already been reset to SIG_DFL. this is
nonconforming since abort is AS-safe and permitted to be called
concurrently with fork or in the MT-forked child.
the dummy definition of __abort_lock in sigaction.c was performing
exactly the same role that putting the lock in its own source file
could and should have been used to achieve.
while we're moving it, give it a proper declaration.
previously, if a file descriptor had aio operations pending in the
parent before fork, attempting to close it in the child would attempt
to cancel a thread belonging to the parent. this could deadlock, fail,
or crash the whole process of the cancellation signal handler was not
yet installed in the parent. in addition, further use of aio from the
child could malfunction or deadlock.
POSIX specifies that async io operations are not inherited by the
child on fork, so clear the entire aio fd map in the child, and take
the aio map lock (with signals blocked) across the fork so that the
lock is kept in a consistent state.
taking the deprecated/dropped vfork spec strictly, doing pretty much
anything but execve in the child is wrong and undefined. however,
these are commonly needed operations to setup the child state before
exec, and historical implementations tolerated them.
for single-threaded parents, these operations already worked as
expected in the vforked child. however, due to the need for __synccall
to synchronize id/resource limit changes among all threads, calling
these functions in the vforked child of a multithreaded parent caused
a misdirected broadcast signaling of all threads in the parent. these
signals could kill the parent entirely if the synccall signal handler
had never been installed in the parent, or could be ignored if it had,
or could signal/kill one or more utterly wrong processes if the parent
already terminated (due to vfork semantics, only possible via fatal
signal) and the parent tids were recycled. in any case, the expected
number of semaphore posts would never happen, so the child would
permanently hang (with all signals blocked) waiting for them.
to mitigate this, and also make the normal usage case work as
intended, treat the condition where the caller's actual tid does not
match the tid in its thread structure as single-threaded, and bypass
the entire synccall broadcast operation.
commit 0a05eace16 implemented AT_EACCESS
for faccessat with a horrible hack, creating a child process to change
switch uid/gid and perform the access probe without making potentially
irreversible changes to the caller's credentials. this was due to the
syscall lacking a flags argument.
linux 5.8 introduced a new syscall, SYS_faccessat2, fixing this
deficiency. use it if any flags are passed, and fallback to the old
strategy on ENOSYS. continue using the old syscall when there are no
flags.
Ethernet protocol number for media redundancy protocol, see
linux commit 4714d13791f831d253852c8b5d657270becb8b2a
bridge: uapi: mrp: Add mrp attributes.
the linux faccessat syscall lacks a flag argument that is necessary
to implement the posix api, see
linux commit c8ffd8bcdd28296a198f237cc595148a8d4adfbe
vfs: add faccessat2 syscall
add TCP_NLA_BYTES_NOTSENT and new tcp_zerocopy_receive fields, see
linux commit c8856c051454909e5059df4e81c77b9c366c5515
tcp-zerocopy: Return inq along with tcp receive zerocopy.
linux commit 33946518d493cdf10aedb4a483f1aa41948a3dab
tcp-zerocopy: Return sk_err (if set) along with tcp receive zerocopy.
linux commit e08ab0b377a1489760533424437c5f4be7f484a4
tcp: add bytes not sent to SCM_TIMESTAMPING_OPT_STATS
it remaps anon mappings without unmapping the original. chromeos plans
to use it with userfaultfd, see:
linux commit e346b3813067d4b17383f975f197a9aa28a3b077
mm/mremap: add MREMAP_DONTUNMAP to mremap()
see
linux commit 9e2ba2c34f1922ca1e0c7d31b30ace5842c2e7d1
fanotify: send FAN_DIR_MODIFY event flavor with dir inode and name
linux commit 44d705b0370b1d581f46ff23e5d33e8b5ff8ec58
fanotify: report name info for FAN_DIR_MODIFY event
added in
linux commit 1a50ec0b3b2e9a83f1b1245ea37a853aac2f741c
arm64: Implement archrandom.h for ARMv8.5-RNG
linux commit d4209d8b717311d114b5d47ba7f8249fd44e97c2
arm64: cpufeature: Export matrix and other features to userspace
these were missed before, added in
linux commit 1201937491822b61641c1878ebcd16a93aed4540
arm64: Expose ARMv8.5 CondM capability to userspace
linux commit ca9503fc9e9812aa6258e55d44edb03eb30fc46f
arm64: Expose FRINT capabilities to userspace
reuses a bit from CSIGNAL so it can only be used with unshare
and clone3, added in
linux commit 769071ac9f20b6a447410c7eaa55d1a5233ef40c
ns: Introduce Time Namespace
needed for storage drivers with userspace component that may
run in the IO path, see
linux commit 8d19f1c8e1937baf74e1962aae9f90fa3aeab463
prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim
The use of TCP_ in udp.h is not known, fortunately udp.h is not
specified by posix so there are no strict namespace rules, added in
linux commit e27cca96cd68fa2c6814c90f9a1cfd36bb68c593
xfrm: add espintcp (RFC 8229)
TCP_NLA_TIMEOUT_REHASH queries timeout-triggered rehash attempts,
tcpm_ifindex limits the scope of TCP_MD5SIG* sockopt to a device.
see
linux commit 32efcc06d2a15fa87585614d12d6c2308cc2d3f3
tcp: export count for rehash attempts
linux commit 6b102db50cdde3ba2f78631ed21222edf3a5fb51
net: Add device index to tcp_md5sig
add IPPROTO_ETHERNET and IPPROTO_MPTCP, see
linux commit 2677625387056136e256c743e3285b4fe3da87bb
seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number
linux commit faf391c3826cd29feae02078ca2022d2f912f7cc
tcp: Define IPPROTO_MPTCP
also added clone3 on sh and m68k, on sh it's still missing (not
yet wired up), but reserved so safe to add.
see
linux commit fddb5d430ad9fa91b49b1d34d0202ffe2fa0e179
open: introduce openat2(2) syscall
linux commit 9a2cef09c801de54feecd912303ace5c27237f12
arch: wire up pidfd_getfd syscall
linux commit 8649c322f75c96e7ced2fec201e123b2b073bf09
pid: Implement pidfd_getfd syscall
linux commit e8bb2a2a1d51511e6b3f7e08125d52ec73c11139
m68k: Wire up clone3() syscall
the fcntl file locking command macro values in the existing generic
bits/fcntl.h were the "64" variants, requiring 64-bit archs that use
the "plain" variants to have their own bits/fcntl.h, even if they
otherwise use the common definitions for everything.
since commit 7cc79d10af exposed
__LONG_MAX to all bits headers, we can now make the generic one common
between 32- and 64-bit archs.
prior to commit 685e40bb09, x86_64 was
correctly passing O_LARGEFILE to SYS_open; it was removed (defined to
0 in the public header, and changed to use the public definition) as
part of that change, probably out of a mistaken belief that it's not
needed.
however, on a mixed system with 32-bit and 64-bit binaries, it's
important that all files be opened with O_LARGEFILE, even if the
opening process is 64-bit, in case a descriptor is passed to a 32-bit
process. otherwise, attempts to access past 2GB in the 32-bit process
could produce EOVERFLOW.
most 64-bit archs added later got this right alread, except for
mips64. x32 was also affected. there are now fixed.
this code is only needed for pre-2.6 kernels, which are not actually
supported anyway, and was never tested. the fallback path using
SYS_modify_ldt failed to clear the upper bits of %eax (all ones due to
SYS_set_thread_area's return value being an error) before modifying
%al to attempt a new syscall.
prior to commit e68c51ac46, h_errno was
actually an external data object not a macro. bring back the symbol,
and use it as the storage for the main thread's h_errno.
technically this still doesn't provide full compatibility if the
application was multithreaded, but at the time there were no res_*
functions (and they did not set h_errno anyway), so any use of h_errno
would have been via thread-unsafe functions. thus a solution that just
fixes single-threaded applications seems acceptable.
putting the (simple) definition in alltypes.h seems like the best
solution here. making sys/ioctl.h implicitly include termios.h is
probably excess namespace pollution.
now that -Wall is not used and we control which warnings are enabled,
it makes sense to have the wanted ones on by default. hopefully this
will also discourage manually adding -Wall to CFLAGS and making
incorrect changes or bug reports based on the compiler's output.
-Wall varies too much by compiler and version. rather than trying to
track all the unwanted style warnings that need to be subtracted, just
enable wanted warnings.
also, move -Wno-pointer-to-int-cast outside --enable-warnings
conditional so that it always applies, since it's turning off a
nuisance warning that's on-by-default with most compilers.
these four warning options were overlooked previously, likely because
they're not part of GCC's -Wall. they all detect constraint violations
(invalid C at the source level) and should always be on in -Werror
form.
dtv_copy, canary2, and canary_at_end existed solely to match multiple
ABI and asm-accessed layouts simultaneously. now that pthread_arch.h
can be included before struct __pthread is defined, the struct layout
can depend on macros defined by pthread_arch.h.
the adjustment made is entirely a function of TLS_ABOVE_TP and
TP_OFFSET. aside from avoiding repetition of the TP_OFFSET value and
arithmetic, this change makes pthread_arch.h independent of the
definition of struct __pthread from pthread_impl.h. this in turn will
allow inclusion of pthread_arch.h to be moved to the top of
pthread_impl.h so that it can influence the definition of the
structure.
previously, arch files were very inconsistent about the type used for
the thread pointer. this change unifies the new __get_tp interface to
always use uintptr_t, which is the most correct when performing
arithmetic that may involve addresses outside the actual pointed-to
object (due to TP_OFFSET).
while it's not clearly documented anywhere, this is the historical
behavior which some applications expect. applications which need to
see the response packet in these cases, for example to distinguish
between nonexistence in a secure vs insecure zone, must already use
res_mkquery with res_send in order to be portable, since most if not
all other implementations of res_query don't provide it.
the framework to do this always existed but it was deemed unnecessary
because the only [ex-]standard functions using h_errno were not
thread-safe anyway. however, some of the nonstandard res_* functions
are also supposed to set h_errno to indicate the cause of error, and
were unable to do so because it was not thread-safe. this change is a
prerequisite for fixing them.
these have been adopted for future issue of POSIX as the outcome of
Austin Group issue 1151, and are simply functions performing the roles
of the historical ioctls. since struct winsize is being standardized
along with them, its definition is moved to the appropriate header.
there is some chance this will break source files that expect struct
winsize to be defined by sys/ioctl.h without including termios.h. if
this happens, further changes will be needed to have sys/ioctl.h
expose it too.
this is a prerequisite for addition of other interfaces that use
kernel tids, including futex and SIGEV_THREAD_ID.
there is some ambiguity as to whether the semantic return type should
be int or pid_t. either way, futex API imposes a contract that the
values fit in int (excluding some upper reserved bits). glibc used
pid_t, so in the interest of not having gratuitous mismatch (the
underlying types are the same anyway), pid_t is used here as well.
while conceptually this is a syscall, the copy stored in the thread
structure is always valid in all contexts where it's valid to call
libc functions, so it's used to avoid the syscall.
longjmp should set the return value of setjmp, but 64bit
registers were used for the 0 check while the type is int.
use the code that gcc generates for return val ? val : 1;
longjmp 'val' argument is an int, but the assembly is referencing 64-bit
registers as if the argument was a long, or the caller was responsible
for extending the argument. Though the psABI is not clear on this, the
interpretation in GCC is that high bits may be arbitrary and the callee
is responsible for sign/zero-extending the value as needed (likewise for
return values: callers must anticipate that high bits may be garbage).
Therefore testing %rax is a functional bug: setjmp would wrongly return
zero if longjmp was called with val==0, but high bits of %rsi happened
to be non-zero.
Rewrite the prologue to refer to 32-bit registers. In passing, change
'test' to use %rsi, as there's no advantage to using %rax and the new
form is cheaper on processors that do not perform move elimination.
a number of users performing seccomp filtering have requested use of
the new individual syscall numbers for socket syscalls, rather than
the legacy multiplexed socketcall, since the latter has the arguments
all in memory where they can't participate in filter decisions.
previously, some archs used the multiplexed socketcall if it was
historically all that was available, while other archs used the
separate syscalls. the intent was that the latter set only include
archs that have "always" had separate socket syscalls, at least going
back to linux 2.6.0. however, at least powerpc, powerpc64, and sh were
wrongly included in this set, and thus socket operations completely
failed on old kernels for these archs.
with the changes made here, the separate syscalls are always
preferred, but fallback code is compiled for archs that also define
SYS_socketcall. two such archs, mips (plain o32) and microblaze,
define SYS_socketcall despite never having needed it, so it's now
undefined by their versions of syscall_arch.h to prevent inclusion of
useless fallback code.
some archs, where the separate syscalls were only added after the
addition of SYS_accept4, lack SYS_accept. because socket calls are
always made with zeros in the unused argument positions, it suffices
to just use SYS_accept4 to provide a definition of SYS_accept, and
this is done to make happy the macro machinery that concatenates the
socket call name onto __SC_ and SYS_.
same approach as in sqrt.
sqrtl was broken on aarch64, riscv64 and s390x targets because
of missing quad precision support and on m68k-sf because of
missing ld80 sqrtl.
this implementation is written for quad precision and then
edited to make it work for both m68k and x86 style ld80 formats
too, but it is not expected to be optimal for them.
note: using fp instructions for the initial estimate when such
instructions are available (e.g. double prec sqrt or rsqrt) is
avoided because of fenv correctness.
same method as in sqrt, this was tested on all inputs against
an sqrtf instruction. (the only difference found was that x86
sqrtf does not signal the x86 specific input-denormal exception
on negative subnormal inputs while the software sqrtf does,
this is fine as it was designed for ieee754 exceptions only.)
there is known faster method:
"Computing Floating-Point Square Roots via Bivariate Polynomial Evaluation"
that computes sqrtf directly via pipelined polynomial evaluation
which allows more parallelism, but the design does not generalize
easily to higher precisions.
approximate 1/sqrt(x) and sqrt(x) with goldschmidt iterations.
this is known to be a fast method for computing sqrt, but it is
tricky to get right, so added detailed comments.
use a lookup table for the initial estimate, this adds 256bytes
rodata but it can be shared between sqrt, sqrtf and sqrtl.
this saves one iteration compared to a linear estimate.
this is for soft float targets, but it supports fenv by using a
floating-point operation to get the final result. the result
is correctly rounded in all rounding modes. if fenv support is
turned off then the nearest rounded result is computed and
inexact exception is not signaled.
assumes fast 32bit integer arithmetics and 32 to 64bit mul.
prior to this change, the canonical name came from the first hosts
file line matching the requested family, so the canonical name for a
given hostname could differ depending on whether it was requested with
AF_UNSPEC or a particular family (AF_INET or AF_INET6). now, the
canonical name is deterministically the first one to appear with the
requested name as an alias.
the existing code clobbered the canonical name already discovered
every time another matching line was found, which will necessarily be
the case when a hostname has both IPv4 and v6 definitions.
patch by Wolf.
this is actually a functional fix at present, since the C sqrtl does
not support ld80 and just wraps double sqrt. once that's fixed it will
just be an optimization.
the previous commit addressing async-signal-safety issues around
pthread_kill did not fully fix pthread_cancel, which is also required
(albeit rather irrationally) to be async-cancel-safe.
without blocking implementation-internal signals, it's possible that,
when async cancellation is enabled, a cancel signal sent by another
thread interrupts pthread_kill while the killlock for a targeted
thread is held. as a result, the calling thread will terminate due to
cancellation without ever unlocking the targeted thread's killlock,
and thus the targeted thread will be unable to exit.
pthread_kill is required to be AS-safe. that requirement can't be met
if the target thread's killlock can be taken in contexts where
application-installed signal handlers can run.
block signals around use of this lock in all pthread_* functions which
target a tid, and reorder blocking/unblocking of signals in
pthread_exit so that they're blocked whenever the killlock is held.
this broke mallocng size_to_class on archs without a native
implementation of a_clz_32. the incorrect logic seems to have been
something i derived from a related but distinct log2-type operation.
with the change made here, it passes an exhaustive test.
as this function is new and presently only used by mallocng, no other
functionality was affected.
the intent here is to keep oldmalloc as an option, at least for the
short term, in case any users are negatively impacted in some way by
mallocng and need to fallback until their issues are resolved.
the files added come from the mallocng development repo, commit
2ed58817cca5bc055974e5a0e43c280d106e696b. they comprise a new malloc
implementation, developed over the past 9 months, to replace the old
allocator (since dubbed "oldmalloc") with one that retains low code
size and minimal baseline memory overhead while avoiding fundamental
flaws in oldmalloc and making significant enhancements. these include
highly controlled fragmentation, fine-grained ability to return memory
to the system when freed, and strong hardening against dynamic memory
usage errors by the caller.
internally, mallocng derives most of these properties from tightly
structuring memory, creating space for allocations as uniform-sized
slots within individually mmapped (and individually freeable)
allocation groups. smaller-than-pagesize groups are created within
slots of larger ones. minimal group size is very small, and larger
sizes (in geometric progression) only come into play when usage is
high.
all data necessary for maintaining consistency of the allocator state
is tracked in out-of-band metadata, reachable via a validated path
from minimal in-band metadata. all pointers passed (to free, etc.) are
validated before any stores to memory take place. early reuse of freed
slots is avoided via approximate LRU order of freed slots. further
hardening against use-after-free and double-free, even in the case
where the freed slot has been reused, is made by cycling the offset
within the slot at which the allocation is placed; this is possible
whenever the slot size is larger than the requested allocation.
this includes both an implementation of reclaimed-gap donation from
ldso and a version of mallocng's glue.h with namespace-safe linkage to
underlying syscalls, integration with AT_RANDOM initialization, and
internal locking that's optimized out when the process is
single-threaded.
these are based on the ARM optimized-routines repository v20.05
(ef907c7a799a), with macro dependencies flattened out and memmove code
removed from memcpy. this change is somewhat unfortunate since having
the branch for memmove support in the large n case of memcpy is the
performance-optimal and size-optimal way to do both, but it makes
memcpy alone (static-linked) about 40% larger and suggests a policy
that use of memcpy as memmove is supported.
tabs used for alignment have also been replaced with spaces.
the child is single-threaded, but may still need to synchronize with
last changes made to memory by another thread in the parent, so set
need_locks to -1 whereby the next lock-taker will drop to 0 and
prevent further barriers/locking.
otherwise, shrink in-place. as explained in the description of commit
3e16313f8f, the split here is valid
without holding split_merge_lock because all chunks involved are in
the in-use state.
commit 3e16313f8f introduced this bug by
making the copy case reachable with n (new size) smaller than n0
(original size). this was left as the only way of shrinking an
allocation because it reduces fragmentation if a free chunk of the
appropriate size is available. when that's not the case, another
approach may be better, but any such improvement would be independent
of fixing this bug.
access always computes result with real ids not effective ones, so it
is not a valid means of determining whether the directory is readable.
instead, attempt to open it before reporting whether it's readable,
and then use fdopendir rather than opendir to open and read the
entries.
effort is made here to keep fd_limit behavior the same as before even
if it was not correct.
some archs already have a_clz_32, used to provide a_ctz_32, but it
hasn't been mandatory because it's not used anywhere yet. mallocng
will need it, however, so add it now. it should probably be optimized
better, but doesn't seem to make a difference at present.
it both malloc and aligned_alloc have been replaced but the internal
aligned_alloc still gets called, the replacement is a wrapper of some
sort. it's not clear if this usage should be officially supported, but
it's at least a plausibly interesting debugging usage, and easy to do.
it should not be relied upon unless it's documented as supported at
some later time.
a new weak predicate function replacable by the malloc implementation,
__malloc_allzerop, is introduced. by default it's always false; the
default version will be used when static linking if the bump allocator
was used (in which case performance doesn't matter) or if malloc was
replaced by the application. only if the real internal malloc is
linked (always the case with dynamic linking) does the real version
get used.
if malloc was replaced dynamically, as indicated by __malloc_replaced,
the predicate function is ignored and conditional-memset is always
performed.
abstractly, calloc is completely malloc-implementation-independent;
it's malloc followed by memset, or as we do it, a "conditional memset"
that avoids touching fresh zero pages.
previously, calloc was kept separate for the bump allocator, which can
always skip memset, and the version of calloc provided with the full
malloc conditionally skipped the clearing for large direct-mmapped
allocations. the latter is a moderately attractive optimization, and
can be added back if needed. however, further consideration to make it
correct under malloc replacement would be needed.
commit b4b1e10364 documented the
contract for malloc replacement as allowing omission of calloc, and
indeed that worked for dynamic linking, but for static linking it was
possible to get the non-clearing definition from the bump allocator;
if not for that, it would have been a link error trying to pull in
malloc.o.
the conditional-clearing code for the new common calloc is taken from
mal0_clear in oldmalloc, but drops the need to access actual page size
and just uses a fixed value of 4096. this avoids potentially needing
access to global data for the sake of an optimization that at best
marginally helps archs with offensively-large page sizes.
this sets the stage for replacement, and makes it practical to keep
oldmalloc around as a build option for a while if that ends up being
useful.
only the files which are actually part of the implementation are
moved. memalign and posix_memalign are entirely generic. in theory
calloc could be pulled out too, but it's useful to have it tied to the
implementation so as to optimize out unnecessary memset when
implementation details make it possible to know the memory is already
clear.
this change eliminates the internal __memalign function and makes the
memalign and posix_memalign functions completely independent of the
malloc implementation, written portably in terms of aligned_alloc.
this was an unfinished draft document present since the initial
check-in, that was never intended to ship in its current form. remove
it as part of reorganizing for replacement of the allocator.
this affects the bump allocator used when static linking in programs
that don't need allocation metadata due to not using realloc, free,
etc.
commit e3bc22f1ef refactored the bump
allocator to share code with __expand_heap, used by malloc, for the
purpose of fixing the case (mainly nommu) where brk doesn't work.
however, the geometric growth behavior of __expand_heap is not
actually well-suited to the bump allocator, and can produce
significant excessive memory usage. in particular, by repeatedly
requesting just over the remaining free space in the current
mmap-allocated area, the total mapped memory will be roughly double
the nominal usage. and since the main user of the no-brk mmap fallback
in the bump allocator is nommu, this excessive usage is not just
virtual address space but physical memory.
in addition, even on systems with brk, having a unified size request
to __expand_heap without knowing whether the brk or mmap backend would
get used made it so the brk could be expanded twice as far as needed.
for example, with malloc(n) and n-1 bytes available before the current
brk, the brk would be expanded by n bytes rounded up to page size,
when expansion by just one page would have sufficed.
the new implementation computes request size separately for the cases
where brk expansion is being attempted vs using mmap, and also
performs individual mmap of large allocations without moving to a new
bump area and throwing away the rest of the old one. this greatly
reduces the need for geometric area size growth and limits the extent
to which free space at the end of one bump area might be unusable for
future allocations.
as a bonus, the resulting code size is somewhat smaller than the
combined old version plus __expand_heap.
this has been a longstanding issue reported many times over the years,
with it becoming increasingly clear that it could be hit in practice.
under concurrent malloc and free from multiple threads, it's possible
to hit usage patterns where unbounded amounts of new memory are
obtained via brk/mmap despite the total nominal usage being small and
bounded.
the underlying cause is that, as a fundamental consequence of keeping
locking as fine-grained as possible, the state where free has unbinned
an already-free chunk to merge it with a newly-freed one, but has not
yet re-binned the combined chunk, is exposed to other threads. this is
bad even with small chunks, and leads to suboptimal use of memory, but
where it really blows up is where the already-freed chunk in question
is the large free region "at the top of the heap". in this situation,
other threads momentarily see a state of having almost no free memory,
and conclude that they need to obtain more.
as far as I can tell there is no fix for this that does not harm
performance. the fix made here forces all split/merge of free chunks
to take place under a single lock, which also takes the place of the
old free_lock, being held at least momentarily at the time of free to
determine whether there are neighboring free chunks that need merging.
as a consequence, the pretrim, alloc_fwd, and alloc_rev operations no
longer make sense and are deleted. simplified merging now takes place
inline in free (__bin_chunk) and realloc.
as commented in the source, holding the split_merge_lock precludes any
chunk transition from in-use to free state. for the most part, it also
precludes change to chunk header sizes. however, __memalign may still
modify the sizes of an in-use chunk to split it into two in-use
chunks. arguably this should require holding the split_merge_lock, but
that would necessitate refactoring to expose it externally, which is a
mess. and it turns out not to be necessary, at least assuming the
existing sloppy memory model malloc has been using, because if free
(__bin_chunk) or realloc sees any unsynchronized change to the size,
it will also see the in-use bit being set, and thereby can't do
anything with the neighboring chunk that changed size.
coding style warnings enabled by default in clang have long been a
source of spurious questions/bug-reports. since clang provides a -w
that behaves differently from gcc's, and that lets us enable any
warnings we may actually want after turning them all off to start with
a clean slate, use it at configure time if clang is detected.
the design used here relies on the barrier provided by the first lock
operation after the process returns to single-threaded state to
synchronize with actions by the last thread that exited. by storing
the intent to change modes in the same object used to detect whether
locking is needed, it's possible to avoid an extra (possibly costly)
memory load after the lock is taken.
after all but the last thread exits, the next thread to observe
libc.threads_minus_1==0 and conclude that it can skip locking fails to
synchronize with any changes to memory that were made by the
last-exiting thread. this can produce data races.
on some archs, at least x86, memory synchronization is unlikely to be
a problem; however, with the inline locks in malloc, skipping the lock
also eliminated the compiler barrier, and caused code that needed to
re-check chunk in-use bits after obtaining the lock to reuse a stale
value, possibly from before the process became single-threaded. this
in turn produced corruption of the heap state.
some uses of libc.threads_minus_1 remain, especially for allocation of
new TLS in the dynamic linker; otherwise, it could be removed
entirely. it's made non-volatile to reflect that the remaining
accesses are only made under lock on the thread list.
instead of libc.threads_minus_1, libc.threaded is now used for
skipping locks. the difference is that libc.threaded is permanently
true once an additional thread has been created. this will produce
some performance regression in processes that are mostly
single-threaded but occasionally creating threads. in the future it
may be possible to bring back the full lock-skipping, but more care
needs to be taken to produce a safe design.
since the backend for LOCK() skips locking if single-threaded, it's
unsafe to make the process appear single-threaded before the last use
of lock.
this fixes potential unsynchronized access to a linked list via
__dl_thread_cleanup.
signal 7 is SIGEMT on Linux mips* ABI according to the man pages and
kernel. it's not clear where the wrong name came from but it dates
back to original mips commit.
the internal __res_msend returns 0 on timeout without having obtained
any conclusive answer, but in this case has not filled in meaningful
anslen. res_send wrongly treated that as success, but returned a zero
answer length. any reasonable caller would eventually end up treating
that as an error when attempting to parse/validate it, but it should
just be reported as an error.
alternatively we could return the last-received inconclusive answer
(typically servfail), but doing so would require internal changes in
__res_msend. this may be considered later.
the old logic here likely dates back, at least in inspiration, to
before it was recognized that transient errors must not be allowed to
reflect the contents of successful results and must be reported to the
application.
here, the dns backend for getaddrinfo, when performing a paired query
for v4 and v6 addresses, accepted results for one address family even
if the other timed out. (the __res_msend backend does not propagate
error rcodes back to the caller, but continues to retry until timeout,
so other error conditions were not actually possible.)
this patch moves the checks to take place before answer parsing, and
performs them for each answer rather than only the answer to the first
query. if nxdomain is seen it's assumed to apply to both queries since
that's how dns semantics work.
the AD (authenticated data) bit in outgoing dns queries is defined by
rfc3655 to request that the nameserver report (via the same bit in the
response) whether the result is authenticated by DNSSEC. while all
results returned by a DNSSEC conforming nameserver will be either
authenticated or cryptographically proven to lack DNSSEC protection,
for some applications it's necessary to be able to distinguish these
two cases. in particular, conforming and compatible handling of DANE
(TLSA) records requires enforcing them only in signed zones.
when the AD bit was first defined for queries, there were reports of
compatibility problems with broken firewalls and nameservers dropping
queries with it set. these problems are probably a thing of the past,
and broken nameservers are already unsupported. however, since there
is no use in the AD bit with the netdb.h interfaces, explicitly clear
it in the queries they make. this ensures that, even with broken
setups, the standard functions will work, and at most the res_*
functions break.
unsigned char promotes to int, which can overflow when shifted left by
24 bits or more. this has been reported multiple times but then
forgotten. it's expected to be benign UB, but can trap when built with
explicit overflow catching (ubsan or similar). fix it now.
note that promotion to uint32_t is safe and portable even outside of
the assumptions usually made in musl, since either uint32_t has rank
at least unsigned int, so that no further default promotions happen,
or int is wide enough that the shift can't overflow. this is a
desirable property to have in case someone wants to reuse the code
elsewhere.
it's been reported that the vdso clock_gettime64 function on (32-bit)
arm is broken, producing erratic results that grow at a rate far
greater than one reported second per actual elapsed second. the vdso
function seems to have been added sometime between linux 5.4 and 5.6,
so if there's ever been a working version, it was only present for a
very short window.
it's not clear what the eventual upstream kernel solution will be, but
something needs to be done on the libc side so as not to be producing
binaries that seem to work on older/existing/lts kernels (which lack
the function and thus lack the bug) but will break fantastically when
moving to newer kernels.
hopefully vdso support will be added back soon, but with a new symbol
name or version from the kernel to allow continued rejection of broken
ones.
analogous to commit b287cd745c but for
the custom FILE stream type the wcstol and wcstod family use. __toread
could be used here as well, but there's a simple direct fix to make
the buffer pointers initially valid for subtraction, so just do that
to avoid pulling in stdio exit code in programs that don't use stdio.
the sh version of fesetround or'd the new rounding mode onto the
control register without clearing the old rounding mode bits, making
changes sticky. this was the root cause of multiple test failures.
apparently this function was intended at some point to be used by
strto* family as well, and thus was put in its own file; however, as
far as I can tell, it's only ever been used by vsscanf. move it to the
same file to reduce the number of source files and external symbols.
this idea came up when I thought we might need to zero the UNGET
portion of buf as well, but it seems like a useful improvement even
when that turned out not to be necessary.
shgetc sets up to be able to perform an "unget" operation without the
caller having to remember and pass back the character value, and for
this purpose used a conditional store idiom:
if (f->rpos[-1] != c) f->rpos[-1] = c
to make it safe to use with non-writable buffers (setup by the
sh_fromstring macro or __string_read with sscanf).
however, validity of this depends on the buffer space at rpos[-1]
being initialized, which is not the case under some conditions
(including at least unbuffered files and fmemopen ones).
whenever data was read "through the buffer", the desired character
value is already in place and does not need to be written. thus,
rather than testing for the absence of the value, we can test for
rpos<=buf, indicating that the last character read could not have come
from the buffer, and thereby that we have a "real" buffer (possibly of
zero length) with writable pushback (UNGET bytes) below it.
as reported/analyzed by Pascal Cuoq, the shlim and shcnt
macros/functions are called by the scanf core (vfscanf) with f->rpos
potentially null (if the FILE is not yet activated for reading at the
time of the call). in this case, they compute differences between a
null pointer (f->rpos) and a non-null one (f->buf), resulting in
undefined behavior.
it's unlikely that any observably wrong behavior occurred in practice,
at least without LTO, due to limits on what's visible to the compiler
from translation unit boundaries, but this has not been checked.
fix is simply ensuring that the FILE is activated for read mode before
entering the main scanf loop, and erroring out early if it can't be.
TZ containg a timezone name with >TZNAME_MAX characters currently
breaks musl's timezone parsing. getname() stops after TZNAME_MAX
characters. getoff() will consume no characters (because the next
character is not a digit) and incorrectly return 0. Then, because
there are remaining alphabetic characters, __daylight == 1, and
dst_off == -3600.
getname() must consume the entire timezone name, even if it will not
fit in d/__tzname, so when it returns, s points to the offset digits.
Commit d9bdfd164 ("fix memccpy to not access buffer past given size")
correctly added a check for 'n' nonzero, but made the pre-existing test
'*s==c' redundant: n!=0 implies *s==c. Remove the unnecessary check.
Reported by Alexey Izbyshev.
Linux defines MAP_SYNC on powerpc and powerpc64 as of commit
22fcea6f85f2 ("mm: move MAP_SYNC to asm-generic/mman-common.h"),
so we can stop undefining it on those architectures.
kernel commit 4693916846269d633a3664586650dbfac2c5562f (first included
in release v4.14) silently fixed a bug whereby the reserved space
(which was later used for high bits of time) in IPC_STAT structures
was left untouched rather than zeroed. this means that a caller that
wants to read the high bits needs to pre-zero the memory.
since it's not clear that these operations are permitted to modify the
destination buffer on failure, use a temp buffer and copy back to the
caller's buffer on success.
on all mips variants, Linux did (and maybe still does) have some
syscall return paths that wrongly return both the error flag in r7 and
a negated error code in r2. in particular this happened for at least
some causes of ENOSYS.
add an extra check to only negate the error code if it's positive to
begin with.
bug report and concept for patch by Andreas Dröscher.
commit 4221f154ff added the r7
constraint apparently out of a misunderstanding of the breakage it was
addressing, and did so because the asm was in a shared macro used by
all the __syscallN inline functions. now "+r" is used in the output
section for the forms 4-argument and up, so having it in input is
redundant, and the forms with 0-3 arguments don't need it as an input
at all.
the r2 constraint is kept because without it most gcc versions (seems
to be all prior to 9.x) fail to honor the output register binding for
r2. this seems to be a variant of gcc bug #87733.
both the r7 and r2 input constraints look useless, but the r2 one was
a quiet workaround for gcc bug 87733, which affects all modern
versions prior to 9.x, so it's kept and documented.
exactly revert commit 604f8d3d8b which
was wrong; it caused a major regression on Linux versions prior to
2.6.36. old kernels did not properly preserve r2 across syscall
restart, and instead restarted with the instruction right before
syscall, imposing a contract that the previous instruction must load
r2 from an immediate or a register (or memory) not clobbered by the
syscall.
effectivly revert commit ddc7c4f936
which was wrong; it caused a major regression on Linux versions prior
to 2.6.36. old kernels did not properly preserve r2 across syscall
restart, and instead restarted with the instruction right before
syscall, imposing a contract that the previous instruction must load
r2 from an immediate or a register (or memory) not clobbered by the
syscall.
since other changes were made since, including removal of the struct
stat conversion that was replaced by separate struct kstat, this is
not a direct revert, only a functional one.
the "0"(r2) input constraint added back seems useless/erroneous, but
without it most gcc versions (seems to be all prior to 9.x) fail to
honor the output register binding for r2. this seems to be a variant
of gcc bug #87733. further changes should be made later if a better
workaround is found, but this one has been working since 2012. it
seems this issue was encountered but misidentified then, when it
inspired commit 4221f154ff.
this is added for POSIX-future as the outcome of Austin Group issue
599. since it's in the reserved namespace for pthread.h, there are no
namespace considerations for adding it early.
commit 59324c8b09 added __socketcall
analogous to __syscall, returning the negated error rather than
setting errno. use it to simplify the fallback path of socket(),
avoiding extern calls and access to errno.
Author: Rich Felker <dalias@aerifal.cx>
Date: Tue Jul 30 17:51:16 2019 -0400
make __socketcall analogous to __syscall, error-returning
this reverts commit 4ee039f354, which
added the helper as a hack to make vdprintf usable before relocation,
contingent on strong assumptions about the arch and tooling, back when
the dynamic linker did not have a real staged model for
self-relocation. since commit f3ddd17380
this has been unnecessary and the function was just wasting size and
execution time.
The final rounding operation should be done with the correct sign
otherwise huge results may incorrectly get rounded to or away from
infinity in upward or downward rounding modes.
This affected sinh and sinhf which set the sign on the result after
a potentially overflowing mul. There may be other non-nearest rounding
issues, but this was a known long standing issue with large ulp error
(depending on how ulp is defined near infinity).
The fix should have no effect on sinh and sinhf performance but may
have a tiny effect on cosh and coshf.
Handle when after reduction |y| > pi/4+tiny. This happens in directed
rounding modes because the fast round to int code does not give the
nearest integer. In such cases the reduction may not be symmetric
between x and -x so e.g. cos(x)==cos(-x) may not hold (but polynomial
evaluation is not symmetric either with directed rounding so fixing
that would require more changes with bigger performance impact).
The fix only adds two predictable branches in nearest rounding mode,
simple ubenchmark does not show relevant performance regression in
nearest rounding mode.
The code could be improved: e.g reducing the medium size threshold
such that two step reduction is enough instead of three, and the
single precision case can avoid the issue by doing the round to int
differently, but this fix was kept minimal.
2020-02-21 23:42:05 -05:00
677 changed files with 9275 additions and 3052 deletions