the UTF-8 output code was written assuming an invariant that iconv's
decoders only emit valid Unicode Scalar Values which wctomb can encode
successfully, thereby always returning a value between 1 and 4.
if this invariant is not satisfied, wctomb returns (size_t)-1, and the
subsequent adjustments to the output buffer pointer and remaining
output byte count overflow, moving the output position backwards,
potentially past the beginning of the buffer, without storing any
bytes.
the man page for this nonstandardized function has historically
documented it as scanning for a substring; however, this is
functionally incorrect (matches the substring "atime" in the "noatime"
option, for example) and differs from other existing implementations.
with the change made here, it should match glibc and other
implementations, only matching whole options delimited by commas or
separated from a value by an equals sign.
as a result of incorrect bounds checking on the lead byte being
decoded, certain invalid inputs which should produce an encoding
error, such as "\xc8\x41", instead produced out-of-bounds loads from
the ksc table.
in a worst case, the loaded value may not be a valid unicode scalar
value, in which case, if the output encoding was UTF-8, wctomb would
return (size_t)-1, causing an overflow in the output pointer and
remaining buffer size which could clobber memory outside of the output
buffer.
bug report was submitted in private by Nick Wellnhofer on account of
potential security implications.
out-of-range second bytes were not handled, leading to wrong character
output rather than a reported encoding error.
fix based on bug report by Nick Wellnhofer, submitted in private in
case the issue turned out to have security implications.
Calling __tls_get_addr with brasl is not valid since it's a global symbol; doing
so results in an R_390_PC32DBL relocation error from lld. We could fix this by
marking __tls_get_addr hidden since it is not part of the s390x ABI, or by using
a different instruction. However, given its simplicity, it makes more sense to
just manually inline it into __tls_get_offset for performance.
The patch has been tested by applying to Zig's bundled musl copy and running the
full Zig test suite under qemu-s390x.
Some weird linkers may emit PT_LOAD segments with memsz = 0. ELF
specification does not forbid this, but such a segment with non-zero
p_vaddr will result in reclaiming of invalid memory address.
This patch skips such segments during reclaiming for better
compatibility.
we have the cpuset macros call calloc/free/memset/memcmp directly so
that they don't depend on any further ABI surface. this is not
namespace-clean, but only affects the _GNU_SOURCE feature profile,
which is not intended to be namespace-clean. nonetheless, reports come
up now and then of things which are gratuitously broken, usually when
an application has wrapped malloc with macros.
this patch parenthesizes the function names so that function-like
macros will not be expanded, and removes the unused declaration of
memcpy. this is not a complete solution, but it should improve things
for affected applications, particularly ones which are not even trying
to use the cpuset interfaces which got them just because g++ always
defines _GNU_SOURCE.
the kernel mq_attr structure has 8 64-bit longs instead of 8 32-bit
longs.
it's not clear that this is the nicest way to implement the fix, but
the concept (translation) is right, and the details can be changed
later if desired.
previously, we left any changes made by the application to the timer
thread's signal mask active when resetting the thread state for reuse.
not only did this violate the intended invariant that timer threads
start with all signals blocked; it also allowed application code to
execute in a thread that, formally, did not exist. and further, if the
internal SIGTIMER signal became unblocked, it could also lead to
missed timer expiration events.
commit 6ae2568bc2 introduced a fatal
signal condition if the internal timer signal used for SIGEV_THREAD
timers is unblocked. this can happen whenever the application alters
the signal mask with SIG_SETMASK, since sigset_t objects never include
the bits used for implementation-internal signals.
this patch effectively reverts the breakage by adding back a no-op
signal handler.
overruns will not be accounted if the timer signal becomes unblocked,
but POSIX does not specify them except for SIGEV_SIGNAL timers anyway.
The LLVM assembler reportedly assembles the form using the j mnemonic
incorrectly (see issue 107460). The jr form is canonical and avoids
this problem, so use it instead.
When the pattern was changed from matching any whitespace to just
matching spaces and tabs, a newline started being appended to the
value of the matched field, if that field was a string. For example,
in a 4-field line, the mnt_opts field would have a newline on the end.
This happened because a newline is not a space or a tab, and so was
matched as part of the value before the end of the string was reached.
\n should therefore be added as a character that terminates a value.
This shouldn't interfere with the intention of the change to space and
tab only, as it was trying to make sure that other whitespace like
carriage returns, that should have been part of parsed values, were.
Fixes: f314e133
The instruction encoding that would be "br %r0" is not actually a
branch to r0, but instead a nop/memory-barrier. gcc 14 has been found
to choose r0 for the "r"(pc) constraint, breaking CRTJMP.
This patch adjusts the inline assembly constraints and marks "pc" as
address ("a"), which disallows usage of r0.
commit 8cca79a72c added use of SYS_pause
to exit() without accounting for newer archs omitting the syscall.
use the newly-added __sys_pause abstraction instead, which uses
SYS_ppoll when SYS_pause is missing.
newer archs lack the syscall. the pause() function accounted for this
with its own #ifdef, but that didn't allow use of the syscall directly
elsewhere, so move the logic to macros in src/internal/syscall.h where
it can be shared.
commit b817541f1c introduced statx with
a fallback using fstatat, but failed to fill in stx_rdev_major/minor
and stx_attributes[_mask]. the rdev omission has been addressed
separately. rather than explicitly zeroing the attributes and their
mask, pre-fill the entire structure with zeros. this will also cover
the padding adjacent to stx_mode, in case it's ever used in the
future.
explicit zeroing of stx_btime is removed since, with this change, it
will already be pre-zeroed. as an aside, zeroing it was not strictly
necessary, since STATX_BASIC_STATS does not include STATX_BTIME and
thus does not indicate any validity for it.
The current implementation of the statx function fails to set the
values of stx->stx_rdev_major and stx->stx_rdev_minor if the statx
syscall fails with ENOSYS and thus the statx function has to fall back
on fstatat-based emulation.
the value placed in the aux vector AT_MINSIGSTKSZ by the kernel is
purely the signal frame size, and does not include any execution space
for the signal handler. this is contrary to the POSIX definition of
MINSIGSTKSZ to be a value that can actually execute at least some
minimal signal handler, and contrary to the historical definitions of
MINSIGSTKSZ which had at least 1k of headroom.
commit 996b6154b2 added support for
querying the dynamic limit but did not enforce it in sigaltstack. the
kernel also does not seem to reliably enforce it, or at least does not
necessarily enforce the same limit exposed to userspace, so it needs
to be enforced here.
internally, printf always works with the maximal-size supported
integer and floating point formats. however, the space needed to
format a floating point number is proportional to the mantissa and
exponent ranges. on archs where long double is larger than double,
knowing that the actual value fit in double allows us to use a much
smaller buffer, roughly 1/16 the size.
as a bonus, making the working buffer a VLA whose dimension depends on
the format specifier prevents the compiler from lifting the stack
adjustment to the top of printf_core. this makes it so printf calls
without floating point arguments do not waste even the smaller amount
of stack space needed for double, making it much more practical to use
printf in tightly stack-constrained environments.
linux puts hung-up ttys in a state where ioctls produce EIO, and may
do the same for other types of devices in error or shutdown states.
such an error clearly does not mean the device is not a tty, but it
also can't reliably establish that the device is a tty, so the only
safe thing to do seems to be reporting the error. programs that don't
check errno will conclude that the device is not a tty, which is no
different from what happens now, but at least they gain the option to
differentiate between the cases.
commit c84971995b introduced the errno
collapsing behavior, but prior to that, errno was not set at all by
isatty.
this is purely a readability change, not a functional one. all of the
integer format cases use a common tail for handling precision logic
after the string representation of the number has been generated. the
code as I originally wrote it was overly clever in the aim of making a
point that the flow could be done without goto, and jumped over
intervening cases by wrapping them in if (0) { }, with the case labels
for each inside the conditional block scope.
this has been a perpetual source of complaints about the readability
and comprehensibility of the file, so I am now changing it to
explicitly jump to the tail logic with goto statements.
this is the same as cp850, but with the euro symbol replacing the
lowercase dotless i at 0xd5. it is significant because it's used by
thermal receipt printers.
the comment does not match the required or actual behavior when x<0
and y is not an integer. while it could be corrected, the role of
comments here is to tell about characteristics unique to the
implementation, not to restate the requirements of the standard, so
just removing it seems best.
while not the only error codes presently omitted, these two are
particularly likely to be encountered in the wild.
EUCLEAN is used by linux filesystem and device drivers to report
filesystem structure corruption or data corruption.
ENAVAIL is used by some linux drivers to indicate non-availability of
a resource.
both names are new inventions to correspond to how they are actually
used, as the original kernel strings ("Structure needs cleaning" and
"No XENIX semaphores available") are not remotely meaningful or
reasonable.
the file-level crt_arch.h asm fragments generally make direct
(non-PLT) calls from _start to _start_c, which is only valid when
there is a local, non-interposable definition for _start_c. generally,
the linker is expected to know that local definitions in a main
executable (as opposed to shared library) output are non-interposable,
making this work, but historically there have been linker bugs in this
area, and microblaze is reportedly still broken, flagging the
relocation for the call as a textrel.
the equivalent _dlstart_c, called from the same crt_arch.h asm
fragments, has always used hidden visibility without problem, and
semantically it should be hidden, so make it hidden. this ensures the
direct call is always valid regardless of whether the linker properly
special-cases main executable output.
if sem_post is interrupted between clearing the waiters bit from the
semaphore value and performing the futex wait operation, subsequent
calls to sem_post will not perform a wake operation unless a new
waiter has arrived.
usually, this is at most a minor nuisance, since the original wake
operation will eventually happen. however, it's possible that the wake
is delayed indefinitely if interrupted by a signal handler, or that
the address the wake needs to be performed on is no longer mapped if
the semaphore was a process-shared one that has since been unmapped
but has a waiter on a different mapping of the same semaphore. this
can happen when another thread using the same mapping "steals the
post" atomically before actually becoming a second waiter, deduces
from success that it was the last user of the semaphore mapping, then
re-posts and unmaps the semaphore mapping. this scenario was described
in a report by Markus Wichmann.
instead of checking only the waiters bit, also check the waiter count
that was sampled before the atomic post operation, and perform the
wake if it's nonzero. this will not produce any additional wakes under
non-race conditions, since the waiters bit only becomes zero when
targeting a single waiter for wake. checking both was already the
behavior prior to commit 159d1f6c02.
our pthread barrier implementation reportedly has bugs that are could
lead to malfunction or crash in timer_create. while this has not been
reviewed to confirm, there have been past reports of pthread barrier
bugs, and it seems likely that something is actually wrong.
pthread barriers are an obscure primitive, and timer_create is the
only place we are using them internally at present. even if they were
working correctly, this means we are imposing linking of otherwise
likely-dead code whenever timer_create is used.
a pair of semaphores functions identically to a 2-waiter barrier
except for destruction order properties. since the parent is
responsible for the argument structure (including semaphores)
lifetimes, the last operation on them in the timer thread must be
posting to the parent.
previously, global dtors, which are executed after all atexit handlers
have been called rather than being implemented as an atexit handler
themselves, would deadlock if they called atexit.
it was intentional to disallow adding more atexit handlers past the
last point where they would be executed, since a successful return
from atexit imposes a contract that the handler will be executed, but
this was only considered in the context of calls to atexit from other
threads, not calls from the dtors.
to fix this, release the lock after the exit handlers loop completes,
but but set a flag first so that we can make all future calls to
atexit return a failure code.
per the C and POSIX standards, calling exit "more than once",
including via return from main, produces undefined behavior. this
language predates threads, and at the time it was written, could only
have applied to recursive calls to exit via atexit handlers. C++
likewise makes calls to exit from global dtors undefined. nonetheless,
by the present specification as written, concurrent calls to exit by
multiple threads also have undefined behavior.
originally, our implementation of exit did have locking to handle
concurrent calls safely, but that was changed in commit
2e55da9118 based on it being undefined.
from a standpoint of both hardening and quality of implementation,
that change seems to have been a mistake.
this change adds back locking, but with awareness of the lock owner so
that recursive calls to exit can be trapped rather than deadlocking.
this also opens up the possibility of allowing recursive calls to
succeed, if future consensus ends up being in favor of that.
prior to this change, exit already behaved partly as if protected by a
lock as long as atexit was linked, but multiple threads calling exit
could concurrently "pop off" atexit handlers and execute them in
parallel with one another rather than serialized in the reverse order
of registration. this was a likely unnoticed but potentially very
dangerous manifestation of the undefined behavior. if on the other
hand atexit was not linked, multiple threads calling exit concurrently
could each run their own instance of global dtors, if any, likely
producing double-free situations.
now, if multiple threads call exit concurrently, all but the first
will permanently block (in SYS_pause) until the process terminates,
and all atexit handlers, global dtors, and stdio flushing/position
consistency will be handled in the thread that arrived first. this is
really the only reasonable way to define concurrent calls to exit. it
is not recommended usage, but may become so in the future if there is
consensus/standardization, as there is a push from the rust language
community (and potentially other languages interoperating with the C
runtime) to make concurrent calls to the language's exit interfaces
safe even when multiple languages are involved in a program, and this
is only possible by having the locking in the underlying C exit.
commit 895736d49b made these changes
along with fixing a real bug in LOG_MAKEPRI. based on further
information, they do not seem to be well-motivated or in line with
policy.
the result of LOG_FAC is not a meaningful facility value if we shift
it down like before, but apparently the way it is used by applications
is as an index into an array of facility names. moreover, all
historical systems which define it do so with the shift. as it is a
nonstandard interface, there is no justification for providing a macro
by the same name that is incompatible with historical practice.
the value of LOG_FACMASK likewise is 0x3f8 on all historical systems
checked. while only 5 bits are used for existing facility codes, the
convention seems to be that all 7 bits belong to the facility field
and theoretically could be used to expand to having more facilities.
that seems unlikely to happen, but there is no reason to make a
gratuitously incompatible change here.
Per RFC 5952, ties for longest sequence of zero fields must be broken
by choosing the earliest, but the implementation put the leading
sequence of zeros at a disadvantage. That's because for example when
compressing "0:0:0:10:0:0:0:10" the strspn(buf+i, ":0") call returns 6
for the first sequence and 7 for the second one – the second sequence
has the benefit of a leading colon.
Changing the condition to require beating the leading sequence by not
one but two characters resolves the issue.
while commit 53ac44ff4c fixed the temp
buffer being undersized, the use of a temp buffer to begin with was a
mistake. instead, compare the requested symbol name in-place and use
the already-null-terminated copy of the name without "64" present in
lfs64_list[] to look up the real symbol.
add two ioctls to get and set struct epoll_params to allow users to
control epoll based busy polling of network sockets.
added to uapi in commit 18e2bf0edf4dd88d9656ec92395aa47392e85b61 (Linux
kernel 6.9 and newer).
this interface does not have a lot of historical consensus on how it
handles the contents of the /etc/shells file in regard to whitespace
and comments, but the commonality between all checked is that they
ignore lines that are blank or that begin with '#', so that is the
behavior we adopt.
these are nonstandard and unnecessary for using the associated
functionality, but resulted in applications that used them
malfunctioning.
patch based on proposed fix by erny hombre.
This syscall is available since Linux 3.15 and also implemented in
glibc from version 2.28. It is commonly used in filesystem or security
contexts.
Constants RENAME_NOREPLACE, RENAME_EXCHANGE, RENAME_WHITEOUT are
guarded by _GNU_SOURCE as with glibc.
commit 1b0d48517f wrongly copied the
getdents return type of int rather than matching the ssize_t used by
posix_getdents. this was overlooked in testing on 32-bit archs but
obviously broke 64-bit archs.
without explicit alignment directives, whether they end up at the
necessary alignment depends on linker/linking conditions. initially
reported as mold issue 1255.
this interface was added as the outcome of Austin Group tracker issue
697. no error is specified for unsupported flags, which is probably an
oversight. for now, EOPNOTSUPP is used so as not to overload EINVAL.
the bits file is retained, but as a single generic version, to allow
for the unlikely future possibility of letting a new arch define
something differently.
previously, only a few archs defined it here. this change makes the
presence consistent across all archs, and reduces the amount of header
duplication (and potential for future inconsistency) between archs.
this change is purely to document that they are the same in
preparation to remove the arch-specific headers for these archs and
replace them with a generic version that matches riscv32 and can be
shared by these and all future archs.
commit f47a8cdd25 introduced an
alternate mechanism for access to runtime page size for compatibility
with early stages of dynamic linking, but because pthread_impl.h
indirectly includes libc.h, the condition #ifndef PAGE_SIZE was never
satisfied.
rather than depend on order of inclusion, use the (baseline POSIX)
macro PAGESIZE, not the (XSI) macro PAGE_SIZE, to determine whether
page size is dynamic. our internal libc.h only provides a dynamic
definition for PAGE_SIZE, not for PAGESIZE.
the %s conversion is added as the outcome of Austin Group tracker
issue 169 and its unspecified behavior is clarified as the outcome of
issue 1727.
the %F, %g, %G, %u, %V, %z, and %Z conversions are added as the
outcome of Austin Group tracker issue 879 for alignment with strftime
and the behaviors of %u, %z, and %Z are defined as the outcome of
issue 1727.
at this time, the conversions with unspecified effects on struct tm
are all left as parse-only no-ops. this may be changed at a later
time, particularly for %s, if there is reasonable cross-implementation
consensus outside the standards process on what the behavior should
be.
once the remaining value is less than 10, the modulo operation to
produce the final digit and division to prepare for next loop
iteration can be dropped. this may be a meaningful performance
distinction when formatting low-magnitude numbers in bulk, and should
never hurt.
based on patch by Viktor Reznov.
historically linux limited the number of supplementary groups a
process could be in to 32, but this limit was raised to 65536 in linux
2.6.4. proposals to support the new limit, change NGROUPS_MAX, or make
it dynamic have been stalled due to the impact it would have on
initgroups where the groups array exists in automatic storage.
the changes here decouple initgroups from the value of NGROUPS_MAX and
allow it to fall back to allocating a buffer in the case where
getgrouplist indicates the user has more supplementary groups than
could be reported in the buffer. getgrouplist already involves
allocation, so this does not pull in any new link dependency.
likewise, getgrouplist is already using the public malloc (vs internal
libc one), so initgroups does the same. if this turns out not to be
the best choice, both can be changed together later.
the initial buffer size is left at 32, but now as the literal value,
so that any potential future change to NGROUPS_MAX will not affect
initgroups.
commit cfa0a54c08 attempted to fix
rounding on archs where long double is not 80-bit (where LDBL_MANT_DIG
is not zero mod four), but failed to address the edge case where
rounding was skipped because LDBL_MANT_DIG/4 rounded down in the
comparison against the requested precision.
the rounding logic based on hex digit count is difficult to understand
and not well-motivated, so rather than try to fix it, replace it with
an explicit calculation in terms of number of bits to be kept, without
any truncating division operations. based on patch by Peter Ammon, but
with scalbn to apply the rounding exponent since the value will not
generally fit in any integer type. scalbn is used instead of scalbnl
to avoid pulling in the latter unnecessarily, since the value is an
exact power of two whose exponent range is bounded by LDBL_MANT_DIG, a
small integer.
The principal expressions defining acosh and acos are such that
acosh(z) = ±i acos(z)
where the + is only true on the Im(z)>0 half of the complex plane
(and partly on Im(z)==0 depending on number representation).
fix the comment without expanding on the details.
POSIX requires pwrite to honor the explicit file offset where the
write should take place even if the file was opened as O_APPEND.
however, linux historically defined the pwrite syscall family as
honoring O_APPEND. this cannot be changed on the kernel side due to
stability policy, but the addition of the pwritev2 syscall with a
flags argument opened the door to fixing it, and linux commit
73fa7547c70b32cc69685f79be31135797734eb6 adds the RWF_NOAPPEND flag
that lets us request a write honoring the file offset argument.
this patch changes the pwrite function to first attempt using the
pwritev2 syscall with RWF_NOAPPEND, falling back to using the old
pwrite syscall only after checking that O_APPEND is not set for the
open file. if O_APPEND is set, the operation fails with EOPNOTSUPP,
reflecting that the kernel does not support the correct behavior. this
is an extended error case needed to avoid the wrong behavior that
happened before (writing the data at the wrong location), and is
aligned with the spirit of the POSIX requirement that "An attempt to
perform a pwrite() on a file that is incapable of seeking shall result
in an error."
since the pwritev2 syscall interprets the offset of -1 as a request to
write at the current file offset, it is mapped to a different negative
value that will produce the expected error.
pwritev, though not governed by POSIX at this time, is adjusted to
match pwrite in honoring the offset.
added in linux kernel commit 73fa7547c70b32cc69685f79be31135797734eb6.
this is added now as a prerequisite for fixing pwrite/pwritev behavior
for O_APPEND files.
the jis0208 table we use is only 84x94 in size, but the shift_jis
encoding supports a 94x94 grid. attempts to convert sequences outside
of the supported zone resulted in out-of-bounds table reads,
misinterpreting adjacent rodata as part of the character table and
thereby converting these sequences to unexpected characters.
this is not needed, but may act as a hint to the compiler, and also
serves to suppress unused function warnings if enabled (on by default
since commit 86ac0f7947).
this is how it's defined in the cp936 document referenced by the IANA
charset registry as defining GBK, and of the mappings defined there,
was the only one missing.
it is not accepted for GB18030, as GB18030 is a UTF and has its own
unique mapping for the euro symbol.
- add mount_setattr from linux v5.12
- add epoll_pwait2 from linux v5.11
- add process_madvise from linux v5.10
- add __NR_faccessat2 from linux v5.8
- add pidfd_getfd and openat2 syscall numbers from linux v5.6
- add clone3 syscall number from linux v5.3
- add process_mrelease from linux v5.15
- add futex_waitv from linux v5.16
- add set_mempolicy_home_node from linux v5.17
- add cachestat from linux v6.4
- add __NR_fchmodat2 from linux v6.6
despite riscv32 being natively time64, the IPC_TIME64 bit (0x100) is
set in IPC_STAT and derived command macros, differentiating their
values from the raw command values used to interface with the kernel.
this reflects that the kernel ipc structure types are not natively
time64, but have broken-down hi/lo fields that cannot be used in-place
and require translation, and that the userspace struct types differ
from the kernel types (relevant to things like strace).
These are mostly copied from riscv64. _Addr and _Reg had to become int
to match compiler-controlled parts of the ABI (result type of sizeof,
etc.). There is no kernel stat struct; the userspace stat matches
glibc in the sizes and offsets of all fields (including glibc's
__dev_t __pad1). The jump buffer is 12 words larger to account for 12
saved double-precision floats; additionally it should be 64-bit
aligned to save doubles.
The syscall list was significantly revised by deleting all time32 and
pre-statx syscalls, and renaming several syscalls that have different
names depending on __BITS_PER_LONG, notably mmap2 and _llseek.
futex was added as an alias to futex_time64 since it is widely used by
software which does not pass time arguments.
__res_send returns the full answer length even if it didn't fit the
buffer, but __dns_parse expects the length of the filled part of the
buffer.
This is analogous to commit 77327ed064,
which fixed the only other __dns_parse call site.
A child process created by posix_spawn reports errors to its parent via
a pipe, retrying infinitely on any write error to prevent falsely
reporting success. If the (original) parent dies before write is
attempted, there is nobody to report to, but the child will remain
stuck in the write loop forever if SIGPIPE is blocked or ignored.
Fix this by not retrying write if it fails with EPIPE.
user_regs_struct and user_fp_struct were missing from the initial
commit of the port.
the union type for elf_fpreg_t and the new value of ELF_NFPREG are
made consistent with glibc.
originally, compilers did not provide these macros and we had to
provide them ourselves. this meant we were redefining them, which was
technically invalid unless the token sequence of the original
definition matched exactly.
the original patch proposed by Jules Maselbas to fix this made the
definitions conditional on them not already being defined; however I
suggested using #undef to avoid any possibly-wrong definitions already
in place and ensure that the definitions are 1. the version adopted as
commit 8b70486807 made this change.
unfortunately, gcc is loud about not liking #undef of any __STDC_*
macro name, and while warnings are suppressed in the system include
path, there is apparently no way to suppress this warning if the
system include dir has also been provided via -I.
while normally we don't go out of our way to satisfy warnings over
style in the public headers, in this case, it seems to be a matter of
disagreement over contract of which part of "the implementation" is
entitled to define or undefine macros belonging to the implementation,
and it's quite reasonable to conclude that the compiler may reject
attempts to undefine them.
this commit reverts to the originally-submitted version of the patch
making the definitions conditional.
this code dates back to the original commit of the sh port, with no
real clue as to how the bug was introduced. it looks like it was
written to assume the return address was pushed to the stack like on
x86, rather than arriving in the pr special register.
commit 0dc4824479 worked around for lack
of flags argument in syscall for fchmodat.
linux 6.6 introduced a new syscall, SYS_fchmodat2, fixing this
deficiency. use it if any flags are passed, and fallback to the old
strategy on ENOSYS. continue using the old syscall when there are no
flags. this is the exact same strategy used when SYS_faccessat2 was used
to implement faccessat with flags.
the linux fchmodat syscall lacks a flag argument that is necessary to
implement the posix api, see
linux commit 09da082b07bbae1c11d9560c8502800039aebcea
fs: Add fchmodat2()
linux commit 78252deb023cf0879256fcfbafe37022c390762b
arch: Register fchmodat2, usually as syscall 452
see
linux commit cf264e1329fb0307e044f7675849f9f38b44c11a
cachestat: implement cachestat syscall
linux commit 946e697c69ffeeefdd84dad90eac307284df46be
cachestat: wire up cachestat for other architectures
see
linux commit c6018b4b254971863bd0ad36bb5e7d0fa0f0ddb0
mm/mempolicy: add set_mempolicy_home_node syscall
linux commit 21b084fdf2a49ca1634e8e360e9ab6f9ff0dee11
mm/mempolicy: wire up syscall set_mempolicy_home_node
see
linux commit 039c0ec9bb77446d7ada7f55f90af9299b28ca49
futex,x86: Wire up sys_futex_waitv()
linux commit ea7c45fde5aa3e761aaddb7902a31a95cb120e7b
futex,arm: Wire up sys_futex_waitv()
linux commit b3ff2881ba18b852f79f5476d7631940071f1adb
MIPS: syscalls: Wire up futex_waitv syscall
linux commit 6c122360cf2f4c5a856fcbd79b4485b7baec942a
s390: wire up sys_futex_waitv system call
linux commit a0eb2da92b715d0c97b96b09979689ea09faefe6
futex: Wireup futex_waitv syscall
see
linux commit 884a7e5964e06ed93c7771c0d7cf19c09a8946f1
mm: introduce process_mrelease system call
linux commit dce49103962840dd61423d7627748d6c558d58c5
mm: wire up syscall process_mrelease
see
linux commit 7bb7f2ac24a028b20fca466b9633847b289b156a
arch, mm: wire up memfd_secret system call where relevant
linux commit 1507f51255c9ff07d75909a84e7c0d7f3c4b2f49
mm: introduce memfd_secret system call to create "secret" memory areas
linux commit b633896314c0f78f2b4eb7b19a530d68f2a35445
tools headers UAPI: Sync s390 syscall table file that wires up the
memfd_secret syscall
this commit should make no codegen change for existing archs, but is a
prerequisite for new archs including riscv32. the wait4 emulation
backend provides both cancellable and non-cancellable variants because
waitpid is required to be a cancellation point, but all of our other
uses are not, and most of them cannot be.
based on patch by Stefan O'Rear.
commit f47a5d400b overlooked that
strtoul was responsible for setting p to a const-laundered copy of the
format string pointer f, even in the case where there was no number to
parse. by making the call conditional on isdigit, that copy was lost.
the logic here is a mess and should be cleaned up, but for now, this
seems to be the least invasive change that undoes the breakage.
commit f247462b08 incorrectly hid ppoll
in the presence of _GNU_SOURCE due to an oversight that defining
_BSD_SOURCE does not implicitly define _GNU_SOURCE. at present,
headers still have to explicitly check for each feature profile level;
this may be changed at some point in the future via features.h, but
has not been changed yet.
depending on contents of the LC_TIME locale, log messages could be
malformatted (especially if the ABMON strings contain non-alphabetic
characters) or the subsequent code could invoke undefined behavior,
via passing a timebuf[] with unspecified contents to snprintf, if
the translated ABMON string did not fit in the 16-byte timebuf.
this does not appear to be a security-relevant bug, as locale loading
functionality is intentionally not available to set*id programs -- the
MUSL_LOCPATH environment variable is ignored when libc.secure is true,
and custom locales are not loadable without it.
Undefine any previous __STDC_UTF_{16,32}__ macros before defining
them to prenvent any warnings of redefining macros.
This happens as a result of some compiler versions defining the macros
themselves.
Linux and most systems do not have symlink permissions, but some
systems, including MacOS, do, and creation of the symlink with umask
set to 0777 makes the symlink inaccessible on such systems.
clear umask when making a symlink so that the behavior is uniform.
having these constants be static was unnecessary, so just remove the
static.
this error should have been caught by compilers, but recent versions
of both gcc and clang accept these as "other forms of constant
expressions" which the C standard allows.
Previously, __riscv_flush_icache would not work correctly as
__vdso_flush_icache had a wrong symbol version. Fix this by correcting
symbol version.
Fixes: 0a48860c27 ("add riscv64 architecture support")
Note: Some relocation types were only used by binutils and
accidentally exposed to previous versions of psABI. One of the values
has been reused by GOT32_PCREL.
the ppoll function has been accepted as a future part of the standard
as the outcome of Austin Group tracker issue 1263. at some point it
should be exposed unconditionally, but for now, expose it in the
default feature profile.
the ppoll function has been accepted as a future part of the standard
as the outcome of Austin Group tracker issue 1263. move the source
file to reflect this.
this was a POSIX requirement that was always in conflict with ISO C,
which specified a well-defined behavior for snprintf and swprintf so
long as the actual number of bytes/characters produced did not exceed
INT_MAX.
I originally raised this conflict for snprintf with the Austin Group
as tracker issue 761, which was never resolved. it was later reported
again as issue 1219, and as a result the conflicting requirement has
been removed.
the corresponding issue with swprintf does not seem to have been
addressed, but as the same reasoning applies to it, I am removing the
limitation on n for swprintf as well.
strtoul will consume leading whitespace or sign characters, which are
not valid in this context, thereby accepting invalid field specifiers.
so, avoid calling it unless there is a number to parse as the width.
this matters because the kernel-provided mtab only escapes tabs,
spaces, newlines, and backslashes. it leaves carriage returns, form
feeds, and vertical tabs literal.
As entries in mtab are delimited by spaces, whitespace characters
are escaped as octal sequences. When reading them out, we have to
unescape these sequences to get the proper string.
presently this only affects 32-bit arm. despite correctly reversing
the function pointer and argument fields based on the
TLSDESC_BACKWARDS macro, we did not read the addend from the
swapped-order argument field, so nonzero addends were lost, producing
wrong runtime addresses for TLS objects needing an addend.
based on report and patch by Rui Ueyama.
this is contrary to the spec as written, which requires %lc to behave
as if it were %ls on a 2-wchar_t buffer containing the argument and
zero. however, apparently no other implementations conform to the spec
as written, and in response to Austin Group issue #1647, WG14 chose to
align with existing practice and have %lc produce output for this case.
The name resolution would abort when getting more than 63 records per
request, due to what seems to be a left-over from the original code.
This check was non-breaking but spurious prior to TCP fallback
support, since any 512-byte packet with more than 63 records was
necessarily malformed. But now, it wrongly rejects valid results.
Reported by Daniel Stefanik in Alpine Linux aports issue 15320.
AT_NO_AUTOMOUNT is implied for stat/lstat/fstatat syscalls since Linux
3.1 (commit b6c8069d3577481390b3f24a8434ad72a3235594). However, this
is not the case for statx syscall, which defaults to automounting, so
this flag must be passed explicitly when statx is used to implement
stat-like functions.
This change affects only arches which use 32-bit seconds in struct kstat,
as well as out-of-tree/future ports to arches which lack SYS_fstatat.
C11 6.11.5p1:
> The placement of a storage-class specifier other than at the
> beginning of the declaration specifiers in a declaration is an
> obsolescent feature.
gcc also warns about this.
If __synccall() fails to capture all threads because tkill fails for
some reason other than EAGAIN, then the callback given will never be
executed, so nothing will ever overwrite the initial value. So that is
the value that will be returned from the function. The previous setting
of 1 is not a valid value for setuid() et al. to return.
I chose -EAGAIN since I don't know the reason the synccall failed ahead
of time, but EAGAIN is a specified error code for a possibly temporary
failure in setuid().
The code intends for the sem_post() in line 97 (now 98) to only unblock
target threads waiting on line 29. But after the first thread is
released, the next sem_post() might also unblock a thread waiting on
line 36. That would cause the thread to return to the execution of user
code before all threads are done, leading to user code being executed in
a mixed-credentials environment.
What's more, if this happens more than once, then the mass release on
line 110 (now line 111) will cause multiple threads to execute the
callback at the same time, and the callbacks are currently not written
to cope with that situation.
Adding another semaphore allows the caller to say explicitly which
threads it wants to release.
previously, the relative load address was used as the address at which
to find the ELF headers. this only works if two conditions are met:
ldso is linked to start at a virtual address of 0, and the linker is
cooperative and includes the main ELF headers in a loadable segment.
while in practice these are always met, modern linkers provide a
__ehdr_start symbol pointing to the ELF headers, and can in principle
use the reference to this symbol as an indication that they need to be
mapped in a segment. this also should make it possible to link for a
different starting virtual address, if that's ever desirable.
commit 37bb3cce45 suppressed the
declaration for C++, where it is wrongly interpreted as declaring the
function as taking no arguments. with C23 removing non-prototype
declarations, that problem is now also relevant to C.
the non-prototype declaration for basename originates with commit
06aec8d715, where it was designed to
avoid conflicts with programs which declare basename with the GNU
signature taking const char *. that change was probably misguided, as
it represents not only misaligned expectations with the caller, but
also undefined behavior (calling a function that's been declared with
the wrong type).
we could opt to fix the declaration, but since glibc, with the
gratuitously incompatible GNU-basename function, seems to be the only
implementation that declares it in string.h, it seems better to just
remove the declaration. this provides some warning if applications are
being built expecting the GNU behavior but not getting it. if we
declared it here, it would only produce a warning if the caller also
declares it themselves (rare) or if the caller attempts to pass a
const-qualified pointer.
These were overlooked when DT_RELR was added in commit
d32dadd60e, potentially breaking
software that treats presence of the DT_RELR macro as implying they
exist.
when the result count was zero, glob was ignoring a possible
GLOB_ABORTED error code and returning GLOB_NOMATCH. whether this
happened could be nondeterministic and dependent on the order of
dirent enumeration, in cases where multiple matches were present and
only some produced errors.
caught by Tor's test_util_glob.
This is the only missing part in struct statvfs. The LSB calls
[f]statfs() deprecated, and its weird types are definitely
off-putting. However, its use is required to get f_type.
Instead, allocate one of the six spares to f_type, copied directly
from struct statfs. This then becomes a small extension to the
standard interface on Linux, instead of two different interfaces, one
of which is quite odd due to being an ABI type, and there no longer is
any reason to use statfs().
The underlying kernel type is a mess, but all architectures agree on u32
(or more) for the ABI, and all filesystem magicks are 32-bit integers.
Since commit 6567db65f4 (prior to
1.0.0), the spare slots have been zero-filled, so on all versions that
may be reasonably be encountered in the wild, applications can rely on
a nonzero f_type as indication that the new field has been filled in.
powl used >= LDBL_MAX as infinity check, but LDBL_MAX is finite, so
this can cause wrong results e.g. powl(LDBL_MAX, 0.5) returned inf
or powl(2, LDBL_MAX) returned inf without raising overflow.
huge y values (close to LDBL_MAX) could cause intermediate results to
overflow (computing y * log2(x) with more than long double precision)
and e.g. powl(0.5, 0x1p16380L) or powl(10, 0x1p16380L) returned nan.
this is fixed by handling huge y early since that always overflows or
underflows.
reported by Paul Zimmermann against expl10 (which uses powl).
acosh(x) is nan for x < 1, but x < 0 cases were not handled specially
and acoshl gave wrong result for some -0x1p32 < x < -2 values, e.g.:
acoshl(-0x1p20) returned -inf,
acoshl(-0x1.4p20) returned -0x1.db365758403aa9acp+0L,
fixed by checking the sign bit and handling it specially.
reported by Paul Zimmermann.
the __dns_parse code used by the stub resolver traditionally included
code to reject label pointers to offsets past a 512 byte limit,
despite never processing the label contents, only stepping over them.
when commit 51d4669fb9 added support for
tcp fallback, this limit was overlooked, and as a result, it was at
least theoretically possible for some valid large answers to be
rejected on account of these offsets.
since the limit was never serving any useful purpose, just remove it.
in the event of chained CNAMEs, the answer to a query will contain the
entire CNAME chain, not just one CNAME record. previously, the answer
buffer size had been chosen to admit a maximal-length CNAME, but only
one. a moderate-length chain could fill the available 768 bytes
leaving no room for an actual address answering the query.
while the DNS RFCs do not specify any limit on the length of a CNAME
chain, or any reasonable behavior is the chain exceeds the entire 64k
possible message size, actual recursive servers have to impose a
limit, and a such, for all practical purposes, chains longer than this
limit are not usable. it turns out BIND has a hard-coded limit of 16,
and Unbound has a default limit of 11.
assuming the recursive server makes use of "compression" (pointers),
each maximal-length CNAME record takes at most 268 bytes, and thus any
chain up to length 16 fits in at most 4288 bytes.
this patch increases the answer buffer size to preserve the original
intent of having 512 bytes available for address answers, plus space
needed for a maximal CNAME chain, for a total of 4800 bytes. the
resulting size of 9600 bytes for two queries (A+AAAA) is still well
within what is reasonable to place in automatic storage.
the extra terms 3 and LDBL_MANT_DIG/4 are remnants of a proto-musl
implementation of printf where the sign/prefix and floating point
conversions were performed naively into this buffer. having them there
obscures the actual intended buffer size (sufficient to hold between 2
and 3 octal digits per byte, rounded up to 3 for simplicity) and
interferes with upcoming work to add C2x binary formats which would
otherwise be stuck having to explain a similar fix to buffer size as
part of an unrelated change.
%c takes an argument of type int, not char, and %lc/%C takes an
argument of type wint_t (unsigned), not int.
for most cases, this makes no practical difference, but since wide
printf variants convert narrow %c format specifiers via btowc,
interpreting the promoted-to-int unsigned char value passed in as a
(signed, on most archs) char causes 255 to get collapsed to EOF and
interpreted as such by btowc.
this is only relevant in the byte-based C locale, so prior to commit
f22a9edaf8, there was no observable
distinction in behavior. for UTF-8, all bytes which might be negative
when interpreted as char are encoding errors when used with %c/btowc.
the clone() function has been effectively unusable since it was added,
due to producing a child process with inconsistent state. in
particular, the child process's thread structure still contains the
tid, thread list pointers, thread count, and robust list for the
parent. this will cause malfunction in interfaces that attempt to use
the tid or thread list, some of which are specified to be
async-signal-safe.
this patch attempts to make clone() consistent in a _Fork-like sense.
as in _Fork, when the parent process is multi-threaded, the child
process inherits an async-signal context where it cannot call
AS-unsafe functions, but its context is now intended to be safe for
calling AS-safe functions. making clone fork-like would also be a
future option, if it turns out that this is what makes sense to
applications, but it's not done at this time because the changes would
be more invasive.
in the case where the CLONE_VM flag is used, clone is only vfork-like,
not _Fork-like. in particular, the child will see itself as having the
parent's tid, and cannot safely call any libc functions but one of the
exec family or _exit.
handling of flags and variadic arguments is also changed so that
arguments are only consumed with flags that indicate their presence,
and so that flags which produce an inconsistent state are disallowed
(reported as EINVAL). in particular, all libc functions carry a
contract that they are only callable with ABI requirements met, which
includes having a valid thread pointer to a thread structure that's
unique within the process, and whose contents are opaque and only able
to be setup internally by the implementation. the only way for an
application to use flags that violate these requirements without
executing any libc code is to perform the syscall from
application-provided asm.
apparently Linux clears the registered exit futex address on fork.
this means that, if after forking the child process becomes
multithreaded and the original thread exits, the thread list will
never be unlocked, and future attempts to use the thread list will
deadlock.
re-register the exit futex address after _Fork in the child to ensure
that it's preserved.
mbrtowc truncates n to unsigned int when storing its copy.
If n > UINT_MAX and the locale is not POSIX, the function will
return a wrong value greater than UINT_MAX on the success path.
aside from the documented differences, which are the contents of this
patch, GCC's -Os also has hard-coded unwanted behaviors which are
impossible to override, like refusing to strength-reduce division by a
constant to multiplication, presumably because the div saves a couple
bytes of code. for this reason, getting rid of -Os and switching to an
equivalent default optimization profile based on -O2 has been a
long-term goal.
as follow-ups, it may make sense to evaluate which of these variations
from -O2 actually do anything useful, and eliminate the ones which are
not helpful or which throw away performance for insignificant size
savings. but for now, I've replicated -Os as closely as possible to
provide a baseline for such evaluation.
The nl_type and nl_arg arrays defined in vfwprintf may be accessed
with an index up to and including NL_ARGMAX, but they are only of size
NL_ARGMAX, meaning they may be written to or read from 1 element too
far.
Resource usage data is filled by the kernel only when wait4 returns
a pid, i.e. a positive value.
Commit 5850546e96 introduced this bug,
possibly because of copy-pasting from getrusage.
For time64 support, musl normally defines SYS_foo to the time32 variant
of that syscall on arches that have it, and to the time64 variant
otherwise, so that "SYS_foo == SYS_foo_time64" implies that the arch is
time64-only. However, SYS_semtimedop is an odd case: some arches define
only SYS_semtimedop_time64, yet they are not time64-only, because the
time32 variant is provided via SYS_ipc instead. For such arches,
defining SYS_semtimedop to SYS_semtimedop_time64 would break the
implication above, so commit 4bbd7baea7
doesn't do this. Commit eb2e298cdc
attempts to detect time64-only arches by checking that both
SYS_semtimedop and SYS_ipc are undefined, but this doesn't work for
x32, because it's a time64-only arch that does define SYS_semtimedop.
As a result, 32-bit timeouts trigger the fallback path that passes
a 32-bit timespec to the kernel while it expects a 64-bit one, so
the effective tv_sec is formed by interpreting 32-bit tv_sec and
tv_nsec as a single long long, and the effective tv_nsec is whatever
is located in the next 64 bits of the stack.
Fix this by expanding the time64-only check to include arches where
SYS_semtimedop is the time64 variant of the syscall.
When an option that requires an argument is the last character of
argv[argc-1], getopt computes argv[argc] + optpos. While optpos
is always zero in this case, adding it to null pointer is still
undefined.
If lstat/stat fails with EACCES, st is left uninitialized, but its
st_dev/st_ino fields are then used in several places:
* for FTW_MOUNT check (in practice typically results in a false
positive and an early return)
* for copying to the new struct history (though the struct is not used
afterwards since we don't recurse in this case)
* for cycle detection check (could theoretically result in a false
positive and an early return)
To avoid adding FTW_NS checks to all these places, fix this by
zero-initializing st_dev/st_ino (which can never match an existing
dentry due to zero inode being reserved in Linux), and check for FTW_NS
only when handling FTW_MOUNT since we need two valid dentries there.
commit 246f1c8114 inadvertently
introduced the local variable p as static by declaring it together
with lfs64_list. the function is only reachable under lock, and is not
called reentrantly, so this is not a functional bug, but it is
confusing and inefficient. fix by separating the declarations.
The received length field in the message may be greater than the
size of the 'answer' buffer in which the message resides. Currently,
ABUF_SIZE is 768. And if we get a larger 'alens[i]', it will result
in an out-of-bounds reading in __dns_parse().
To fix this, limit the length to the size of the received buffer.
the buffer-flush function did not account for mbtowc returning 0
rather than 1 when converting the nul character. this prevented
advancing past it, instead repeatedly converting it into the output
wide character string until the max output length was exhausted.
commit d42269d7c8 appropriated the
stream error flag temporarily to let the printf family of functions
suppress further output attempts after encountering a write error.
since the wide printf code relies on (narrow) vfprintf to print
padding and numeric conversions, a hack was put in vfprintf not to
clear the initial error status unless the stream is narrow oriented.
this was okay, because calling vfprintf on a wide-oriented stream
(outside of internal use by the implementation) produces undefined
behavior. however, it was highly non-obvious to anyone reading the
wide printf code, where the calls to fprintf without first checking
for error status appeared erroneous.
this patch removes all direct use of fprintf from the wide printf
core, except in the numeric conversions case where it was already
checked before starting processing of the directive that the error
status is not set. the other calls, which were performing padding, are
replaced by a new pad() helper function, which performs the check and
abstracts out the mechanism of writing the padding.
direct use of the error flag is also replaced by ferror, which is
defined as a macro in stdio_impl.h, expanding directly to the flag
check with no call or locking overhead.
unlike with wide printf variants, encoding errors are not a vector by
which this bug is reachable, and the out() helper function already
ensured that no further output could be written after an output error,
transient or otherwise. however, the %n specifier could still be
processed after an error, yielding a side effect that wrongly implied
output had succeeded.
due to buffering effects, it's still possible for %n to show output as
having "succeeded", but for it never to appear on the underlying file
due to an error at flush time. this change, however, ensures that
processing of %n does not conflict with any error which has already
been seen.
this fixes a broader bug for which a special case was reported by
Bruno Haible, in the form of %n getting processed (and reporting the
number of wide characters which would have been written, but weren't)
after an encoding error (EILSEQ). in addition to the %n case, some but
not all of the format specifiers continued to attempt output after an
error. in particular, %c, %lc, and %s all used fputwc directly without
any check for error status.
as long as the error condition was permanent rather than transient,
these write attempts had no visible side effects, but in theory it
could be visible, for example with EAGAIN/EWOULDBLOCK or ENOSPC, if
the condition precluding output came to an end. this could produce
output with missing non-final data, rather than just truncated output,
albeit with the function still returning -1 as expected to report an
error.
to fix this, a check is added to stop processing of any new directive
(including %n) if the stream is already in error state, and direct use
of fputwc is replaced with calls to the out() helper function, which
checks for error status.
note that fprintf is also used directly without checking error status,
but due to how commit d42269d7c8
previously attempted to solve the issue of output after error, the
call to fprintf does not attempt to write anything when the
wide-oriented stream is already in error state. this is non-obvious,
and is quite a hack, so it should be changed, but I've left it alone
for now to make the bug fix commit itself as non-invasive as possible.
this function was overlooked during the time64 transition, probably as
a result of not having any time-related types in its application-side
interface. however, for archs that lack the traditional poll syscall
and have only ppoll, it used timespec as part of its interface with
the kernel: the millisecond timeout was converted to a timespec to
pass to SYS_ppoll. this is a type/ABI mismatch on 32-bit archs with
legacy time32 syscalls.
only one supported arch, or1k, is affected. all of the others either
have SYS_poll, or are 64-bit.
rather than using timespec, define a type locally to match what the
kernel expects. the condition (SYS_ppoll_time64 == SYS_ppoll),
comparable to conditions used elsewhere in timespec-handling code,
evaluates true for "natively time64" 32-bit archs including x32,
future riscv32, and all future 32-bit archs (via definitions in
internal syscall.h). otherwise, the arch is either 64-bit or has
syscalls that take the legacy type, and in either case "long" is
correct.
this fix is based on bug report and proposal by Alexey Izbyshev but
with a different approach to the changes to minimize the contextual
knowledge needed for a reader to understand the source file.
If the (normalized) timeout passed to select exceeds INT_MAX seconds on
an arch with SYS_pselect6_time64 and the kernel is too old to support
time64 syscalls, the timeout is implicitly converted to (32-bit) long on
the fallback path, losing its upper 32 bits and potentially becoming a
small positive value, violating the intended semantics, or even
a negative value, causing the fallback syscall failure. Fix this by
saturating the timeout at INT_MAX as done in other time64 fallback
cases.
this is the best-effort fallback path for kernels that can't actually
support the dup3 functionality. it was setting FD_CLOEXEC flag on the
target fd (new) even if the dup2 operation failed. normally that
shouldn't happen under correct usage, but it's possible if the source
fd is not open or intentionally invalid (e.g. -1).
our dup3 code wrongly skipped directly to making the SYS_dup2 syscall
whenever the O_CLOEXEC bit of flags was not set. this is incorrect if
any new flags are ever added, as it would silently ignore them rather
than failing with an error.
archs which lack SYS_dup2 were unaffected.
adjust the logic so that SYS_dup3 is attempted whenever flags is
nonzero, and explicitly fail with EINVAL if SYS_dup3 is unavailable
and there are any unknown flags.
kernels using the fallback have an inherent close-on-exec race
condition and as such support for them is only best-effort anyway.
however, ignoring potential new flags is still very bad behavior.
instead, fail with EINVAL.
If the buffer passed to getservbyport_r is just enough to store two
pointers after aligning it, getnameinfo is called with buflen == 0
(which means that service name is not needed) and trivially succeeds.
Then, strtol is called on the address just past the buffer end, and
if it doesn't happen to find the port number there, getservbyport_r
spuriously succeeds and returns the same bad address to the caller.
Fix this by ensuring that buflen is at least 1 when passed to
getnameinfo.
getifaddrs computes &ctx->first->ifa even if ctx->first is NULL. While
this shouldn't be possible on the success path because the loopback
interface is hardcoded into the kernel, this is still possible on the
error path (for example, if __rtnetlink_enumerate couldn't create a
socket due to exceeding the fd limit).
accept4 emulation via accept ignores unknown flags, so it can spuriously
succeed instead of failing (or succeed without doing the action implied
by an unknown flag if it's added in a future kernel). Worse, unknown
flags trigger the fallback code even on modern kernels if the real
accept4 syscall returns EINVAL, because this is indistinguishable from
socketcall returning EINVAL due to lack of accept4 support.
Fix this by always failing with EINVAL if unknown flags are present and
the syscall is missing or failed with EINVAL.
This is completely analoguous to commit 633183b5d1.
Similar code called from __lookup_name is not affected because it checks
that the line contains the host name surrounded by blanks.
When IPv6 nameservers are present, __res_msend_rc attempts to disable
IPV6_V6ONLY socket option to ensure that it can communicate with IPv4
nameservers (if they are present too) via IPv4-mapped IPv6 addresses.
However, this option can't be disabled on bound sockets, so setsockopt
always fails.
A zero returned from recvmsg is currently treated as if some data were
received, so if a DNS server closes its TCP socket before sending the
full answer, __res_msend_rc will spin until the timeout elapses because
POLLIN event will be reported on each poll. Fix this by treating an
early EOF as an error.
DNS parsing callbacks pass the response buffer end instead of the actual
response end to dn_expand, so a malformed DNS response can use message
compression to make dn_expand jump past the response end and attempt to
parse uninitialized parts of that buffer, which might succeed and return
garbage.
There are several issues with range checks in this function:
* The question section parsing loop can read up to two out-of-bounds
bytes before doing the range check and bailing out.
* The answer section parsing loop, in addition to the same issue as
above, uses the wrong length in the range check that doesn't prevent
OOB reads when computing len later.
* The len range check before calling the callback is off by 10. Also,
p+len can overflow in a (probably theoretical) case when p is within
2^16 from UINTPTR_MAX.
Because __dns_parse is used only with stack-allocated buffers, such
small overreads can't result in a segfault. The first two also don't
affect the function result, but the last one may result in getaddrinfo
incorrectly succeeding and returning up to 10 bytes past the
response buffer as a part of the IP address, and in (canon) name
returned by getaddrinfo/getnameinfo being affected by memory past the
response buffer (because dn_expand might interpret it as a pointer).
Before this commit, DNS timeouts always used CLOCK_REALTIME, which
could produce spurious timeouts or delays if wall time changed for
whatever reason.
Now we try CLOCK_MONOTONIC and only fall back to CLOCK_REALTIME when
it is unavailable.
As a result of using simple subtraction to implement the return values
for wcscmp and wcsncmp, integer overflow can occur (producing
undefined behavior, and in practice, a wrong comparison result). This
does not occur for meaningful character values (21-bit range) but the
functions are specified to work on arbitrary wchar_t arrays.
This patch replaces the subtraction with a little bit of code that
orders the characters correctly, returning -1 if the character from
the first string is smaller than the one from the second, 0 if they
are equal and 1 if the character from the first string is larger than
the one from the second.
A signed int shift overflowed when computing a constant mask, use hex
literal instead. This is unlikely to cause actual issues unless the
code was compiled with ubsan or similar instrumentation specifically
to catch this. The stripped libc.so is unchanged on x86_64.
Reported by q66 on irc.
When a dot is encountered, the loop counter is incremented before
exiting the loop, but the corresponding ip array element is left
uninitialized, so the subsequent memmove (if "::" was seen) and the
loop copying ip to the output buffer will operate on an uninitialized
uint16_t.
The uninitialized data never directly influences the control flow and
is overwritten on successful return by the second half of the parsed
IPv4 address. But it's better to fix this to avoid unexpected
transformations by a sufficiently smart compiler and reports from
UB-detection tools.
The kernel defines a limit on the number of fds that can be passed
through an SCM_RIGHTS ancillary message as SCM_MAX_FD. The value was
255 before kernel 2.6.38 (after that it is 253), and an SCM_RIGHTS
ancillary message with 255 fds requires 1040 bytes, slightly more than
the current 1024 byte internal buffer in sendmsg. 1024 is an arbitrary
size, so increase it to match the the arbitrary size limit in the
kernel. This fixes tests that are verifying they support up to
SCM_MAX_FD fds.
until the mq notification event arrives, it is mandatory that signals
be blocked. otherwise, a signal can be received, and its handler
executed, in a thread which does not yet exist on the abstract
machine.
after the point of the event arriving, having signals blocked is not a
conformance requirement but a QoI requirement. while the application
can unblock any signals it wants unblocked in the event handler
thread, if they did not start out blocked, it could not block them
without a race window where they are momentarily unblocked, and this
would preclude controlled delivery or other forms of acceptance
(sigwait, etc.) anywhere in the application.
in the error path where the mq_notify syscall fails, the initiating
thread may have closed the socket before the worker thread calls recv
on it. even in the absence of such a race, if the recv call failed,
e.g. due to seccomp policy blocking it, the worker thread could
proceed to close, producing a double-close condition.
this can all be simplified by moving the mq_notify syscall into the
new thread, so that the error case does not require pthread_cancel.
now, the initiating thread only needs to read back the error status
after waiting for the worker thread to consume its arguments.
disabling cancellation around the pthread_join call seems to be the
safest and logically simplest fix. i believe it would also be possible
to just perform the unmap directly here after __tl_sync, removing the
dependency on pthread_join, but such an approach duplicately encodes a
lot more implementation assumptions.
the logic to check hwcap for SPE register file inadvertently clobbered
the val argument before use. switch to a different work register so
this doesn't happen.
we wrongly defined a dummy SA_RESTORER flag on these archs, despite
the kernel interface not actually having such a feature. on archs
which lack SA_RESTORER, the kernel sigaction structure also lacks the
restorer function pointer member, which means the signal mask appears
at a different offset. the kernel was thereby interpreting the bits of
the code address as part of the signal set to be masked while handling
the signal.
this patch removes the erroneous SA_RESTORER definitions from archs
which do not have it, makes access to the member conditional on
whether SA_RESTORER is defined for the arch, and removes the
now-unused asm for the affected archs.
because there are reportedly versions of qemu-user which also use the
wrong ABI here, the old ksigaction struct size is preserved with an
unused member at the end. this is harmless and mitigates the risk of
such a bug turning into a buffer overflow onto the sigaction
function's stack.
the result of the 0xffff mask with the exit status could have bit 15
set, in which case multiplying by 0x10001 overflows 32-bit signed int.
making the multiply unsigned avoids the overflow. it also changes the
sign extension behavior of the subsequent >> operation, but the
affected bits are all unwanted anyway and all discarded by the cast to
short.
mips has its own mechanisms for DT_DEBUG because it makes _DYNAMIC
read-only, and the original mechanism, DT_MIPS_RLD_MAP, was
PIE-incompatible. DT_MIPS_RLD_MAP_REL was added to remedy this, but we
never implemented support for it. add it now using the same idioms for
mips-specific ldso logic.
memmem has been adopted for the next issue of POSIX (outcome of
tracker item 1061). since mem* is in the reserved namespace for
string.h it's already fully conforming to expose it by default, so
just do so.
while no lock is held here making it a lock-order issue, replacement
malloc is likely to want to use pthread_atfork, possibly making the
call to malloc infinitely recursive.
even if not, there is no reason to prefer an application-provided
malloc here.
printf_core() runs twice, and during its first run, nl_arg is
uninitialized and must not be read. It gets initialized at the end of
the first run. Conversely, nl_type does not need to be set during the
second run, as its useful life has ended at that point, since the only
time it is read is during that exact same initialization. Therefore we
can simply alternate the assignments.
p and w do still need to get values assigned to them, since at least one
line in the same if-statement depends on that, but they can be dummy
values. arg does not need to be assigned, since in the first run, we
encounter a continue statement before using the argument.
because the has-waiters state in the semaphore value futex word is
only representable when the value is zero (the special value -1
represents "0 with potential new waiters"), it's lost if intervening
operations make the semaphore value positive again. this creates an
ABA issue in sem_post, whereby the post uses a stale waiters count
rather than re-evaluating it, skipping the futex wake if the stale
count was zero.
the fix here is based on a proposal by Alexey Izbyshev, with minor
changes to eliminate costly new spurious wake syscalls.
the basic idea is to replace the special value -1 with a sticky
waiters bit (repurposing the sign bit) preserved under both wait and
post. any post that takes place with the waiters bit set will perform
a futex wake.
to be useful, the waiters bit needs to be removable, and to remove it
safely, we perform a broadcast wake instead of a normal single-task
wake whenever removing the bit. this lets any un-accounted-for waiters
wake and re-add the waiters bit if they still need it.
there are multiple possible choices for when to perform this
broadcast, but the optimal choice seems to be doing it whenever the
observed waiters count is less than two (semantically, this means
exactly one, but we might see a stale count of zero). in this case,
the expected number of threads to be woken is one, with exactly the
same cost as a non-broadcast wake.
when PAGE_SIZE is not constant, internal/libc.h defines it to expand
to libc.page_size. however, kernel_mapped_dso, reachable from stage 2
of the dynamic linker bootstrap (__dls2), needs PAGE_SIZE to interpret
the relro range. at this point the libc object is both uninitialized
and invalid to access according to our model for bootstrapping, which
does not assume any external-linkage objects are accessible until
stages 2b/3. in practice it likely worked because hidden visibility
tends to behave like internal linkage, but this is not a property that
the dynamic linker was designed to rely upon.
this bug likely manifested as relro malfunction on archs with variable
page size, due to incorrect mask when aligning the relro bounds to
page boundaries.
while there are certainly more direct ways to fix the known problem
point here, a maximally future-proof way is to just bypass the libc.h
PAGE_SIZE definition in the dynamic linker and instead have dynlink.c
define its own internal-linkage object for variable page size. then,
if anything else in stage 2 ever ends up referencing PAGE_SIZE, it
will just automatically work right.
this is analogous to skip_relative logic in do_relocs -- because
relative relocations for the dynamic linker itself were already
performed at entry (stage 1), they must not be applied again.
the rule that longest digit sequence not beginning with a zero is
greater only applies when both sequences being compared are
non-degenerate. this is spelled out explicitly in the man page, which
may be deemed authoritative for this nonstandard function: "If one or
both of these is empty, then return what strcmp(3) would have
returned..."
we were wrongly treating any sequence of digits not beginning with a
zero as greater than a non-digit in the other string.
if async cancellation is enabled and acted upon, the stack pointer is
not necessarily pointing to a __syscall_cp_asm stack frame. the
contents of the stack being wrong don't really matter, but if the
stack pointer is not suitably aligned, the procedure call ABI is
violated when calling back into C code via __cancel, and pthread_exit,
cancellation cleanup handlers, TSD destructors, etc. may malfunction
or crash.
for the async cancel case, just call __cancel directly like we did
prior to commit 102f6a01e2. restore the
signal mask prior to doing this since the cancellation handler runs
with all signals blocked.
commit f081d5336a fixed
gethostbyname[2]_r to treat negative results as a non-error, leaving
gethostbyname[2] wrongly returning a pointer to the unfilled result
buffer rather than a null pointer. since, as documented with commit
fe82bb9b92, the caller of
gethostby{name[2],addr}_r can always rely on the result pointer being
set, use that consistently rather than trying to duplicate logic about
whether we have a result or not in gethostby{name[2],addr}.
the only functional change here should be that MAXADDRS is only
checked for RRs that provide address results, so that a CNAME which
appears after an excessive number of address RRs does not get ignored.
I'm not aware of any servers that order the RRs this way, and it may
even be forbidden to do so, but I prefer having the callback logic not
be order dependent.
other than that, the motivation for this change is that the A and AAAA
cases were mostly duplicate code that could be combined as a single
code path.
returning -1 rather than 0 from the parse function causes __dns_parse
to bail out and return an error. presently, name_from_dns does not
check the return value anyway, so this does not matter, but if it ever
started treating this as an error, lookups with large numbers of
addresses would break. this is a consequence of adding TCP support and
extending the buffer size used in name_from_dns.
reportedly there is nameserver software with question-rewriting
"functionality" which gives A answers when AAAA is queried. since we
made no effort to validate that the answer RR type actually
corresponds to the question asked, it was possible (depending on
flags, etc.) for these answers to leak through, which the caller might
not be prepared for. indeed, our implementation of gethostbyname2_r
makes an assumption that the resulting addresses are in the family
requested, and will misinterpret the results if they don't.
commit 45ca5d3fcb already noted in
fixing CVE-2017-15650 that this could happen, but did nothing to
validate that the RR type of the answer matches the question; it just
enforced the limit on number of results to preclude overflow.
presently, name_from_dns ignores the return value of __dns_parse, so
it doesn't really matter whether we return 0 (ignoring the RR) or -1
(parse-ending error) upon encountering the mismatched RR. if that ever
changes, though, ignoring irrelevant answer RRs sounds like the
semantically correct thing to do, so for now let's return 0 from the
callback when this happens.
commit 167390f055 seems to have
overlooked the presence of a lock here, probably because it was one of
the exceptions not using LOCK() but a rwlock.
as such, it can't be added to the generic table of locks to take, so
add an explicit atfork function for the pthread keys table. the order
it is called does not particularly matter since nothing else in libc
but pthread_exit interacts with keys.
performing n-- is not a safe operation for arbitrary signed input n.
only perform the decrement in the code path where the initial n is
greater than 1, and adjust the condition in the n<=1 code path to
compensate for it not having been decremented.
the aio operations that lead to calling __aio_get_queue with the
possibility to expand the fd map are not AS-safe, but if they are
interrupted by a signal handler, the signal handler may call close,
which is required to be AS-safe. due to __aio_get_queue taking the
write lock without blocking signals, such a call to close from a
signal handler could deadlock.
change __aio_get_queue to block signals if it needs to obtain a write
lock, and restore when finished.
aio_suspend waits on a dummy futex in the corner case when the array of
requests contains NULL pointers only. But the value of this futex was
left uninitialized, so if it happens to be non-zero, aio_suspend
degrades to spinning instead of blocking.
as reported by Alexey Izbyshev, there is a lock order inversion
deadlock between the malloc lock and aio maplock at MT-fork time:
_Fork attempts to take the aio maplock while fork already has the
malloc lock, but a concurrent aio operation holding the maplock may
attempt to allocate memory.
move the __aio_atfork calls in the parent from _Fork to fork, and
reorder the lock before most other locks, since nothing else depends
on aio(*). this leaves us with the possibility that the child will not
be able to obtain the read lock, if _Fork is used directly and happens
concurrent with an aio operation. however, in that case, the child
context is an async signal context that cannot call any further aio
functions, so all we need is to ensure that close does not attempt to
perform any aio cancellation. this can be achieved just by nulling out
the map pointer.
(*) even if other functions call close, they will only need a read
lock, not a write lock, and read locks being recursive ensures they
can obtain it. moreover, the number of read references held is bounded
by something like twice the number of live threads, meaning that the
read lock count cannot saturate.
as reported by Alexey Izbyshev, when the second-to-last thread exits
causing a return to single-threaded (no locks needed) state, it
creates a situation where the last remaining thread may obtain the
killlock that's already held by the exiting thread. this means it may
erroneously use the tid of the exiting thread, and may corrupt the
lock state due to double-unlock.
commit 8d81ba8c0b, which (re)introduced
the switch back to single-threaded state, documents the intent that
the first lock after switching back should provide the necessary
synchronization. this is correct, but only works if the switch back is
made after there is no further need for synchronization with locks
(other than the thread list lock, which can't be bypassed) held by the
exiting thread.
in order to hit the bug, the remaining thread must first take a
different lock, causing it to perform an actual lock one last time,
consume the need_locks==-1 state, and transition to need_locks==0.
after that, the next attempt to lock the exiting thread's killlock
will bypass locking.
fix this by reordering the unlocking of killlock at thread exit time,
along with changes to the state protected by it, to occur earlier,
before the switch to single-threaded state. there are really no
constraints on where it's done, except that it occur after there is no
longer any possibility of application code executing in the exiting
thread, so do it as early as possible.
ever since commit 8f11e6127f introduced
the thread list lock, this has been wrong. initially, it was wrong via
calling free from the context with the thread list lock held. commit
aa5a9d15e0 deferred the unsafe free but
added a lock, which was also unsafe. in particular, it could deadlock
if code holding freebuf_queue_lock was interrupted by a signal handler
that takes the thread list lock.
commit 4d5aa20a94 observed that there
was a lock here but failed to notice that it's invalid.
there is no easy solution to this problem with locks; any attempt at
solving it while still using locks would require the lock to be an
AS-safe one (blocking signals on each access to the dlerror buffer
list to check if there's deferred free work to be done) which would be
excessively costly, and there are also lock order considerations with
respect to how the lock would be handled at fork.
instead, just use an atomic list.
unlike most projects that use -fno-strict-aliasing, we aim to have all
sources respect the C language rules for effective type that make
type-based alias analysis optimizations possible. unfortunately, it
turns out that there are deep, and likely very difficult to fix, flaws
in the TBAA performed by GCC and likely other compilers, whereby this
kind of optimization can transform code that follows the rules
strictly in ways that will make it malfunction. see for example GCC
bugs 107107 and 107115, the latter of which also affects clang.
there are not presently any known instances of breakage due to wrong
type-based aliasing optimizations in our codebase. nonetheless, since
the transformations are unsound and could introduce breakage,
configure CFLAGS to build with -fno-strict-aliasing.
some casual analysis of the effects on codegen suggest that this is
unlikely to affect performance except possibly in the regex engine. in
general, we should probably prefer making better use of the restrict
keyword over relying on types to imply non-aliasing for optimization
purposes; doing so should be able to get back any performance that was
lost and more, should it turn out to matter (unlikely).
the entire intent of using madvise/MADV_FREE on freed slots is to
improve system performance by avoiding evicting cache of useful data,
or swapping useless data to disk, by marking any whole pages in the
freed slot as discardable by the kernel. in particular, unlike
unmapping the memory or replacing it with a PROT_NONE region, use of
MADV_FREE does not make any difference to memory accounting for commit
charge purposes, and so does not increase the memory available to
other processes in a non-overcommitted environment.
however, various measurements have shown that inordinate amounts of
time are spent performing madvise syscalls in processes which
frequently allocate and free medium sized objects in the size range
roughly between PAGESIZE and MMAP_THRESHOLD, to the point that the net
effect is almost surely significant performance degredation. so, turn
it off.
the code, which has some nontrivial logic for efficiently determining
whether there is a whole-page range to apply madvise to, is left in
place so that it can easily be re-enabled if desired, or later tuned
to only apply to certain sizes or to use additional heuristics.
these badly pollute the namespace with macros whenever _GNU_SOURCE is
defined, which is always the case with g++, and especially tends to
interfere with C++ constructs.
as our implementation of these was macro-only, their removal cannot
affect any existing binaries. at the source level, portable software
should be prepared for them not to exist.
for now, they are left in place with explicit _LARGEFILE64_SOURCE.
this provides an easy temporary path for integrators/distributions to
get packages building again right away if they break while working on
a proper, upstreamable fix. the intent is that this be a very
short-term measure and that the macros be removed entirely in the next
release cycle.
originally the namespace-infringing "large file support" interfaces
were included as part of glibc-ABI-compat, with the intent that they
not be used for linking, since our off_t is and always has been
unconditionally 64-bit and since we usually do not aim to support
nonstandard interfaces when there is an equivalent standard interface.
unfortunately, having the symbols present and available for linking
caused configure scripts to detect them and attempt to use them
without declarations, producing all the expected ill effects that
entails.
as a result, commit 2dd8d5e1b8 was made
to prevent this, using macros to redirect the LFS64 names to the
standard names, conditional on _GNU_SOURCE or _LARGEFILE64_SOURCE.
however, this has turned out to be a source of further problems,
especially since g++ defines _GNU_SOURCE by default. in particular,
the presence of these names as macros breaks a lot of valid code.
this commit removes all the LFS64 symbols and replaces them with a
mechanism in the dynamic linker symbol lookup failure path to retry
with the spurious "64" removed from the symbol name. in the future,
if/when the rest of glibc-ABI-compat is moved out of libc, this can be
removed.
we already attempt to preclude this case by having res_send use a
sufficiently large temporary buffer even if the caller did not provide
one as large as or larger than the udp dns max of 512 bytes. however,
it's possible that the caller passed a custom-crafted query packet
using EDNS0, e.g. to get detailed DNSSEC results, with a larger udp
size allowance.
I have also seen claims that there are some broken nameservers in the
wild that do not honor the dns udp limit of 512 and send large answers
without the TC bit set, when the query was not using EDNS.
we generally don't aim to support broken nameservers, but in this case
both problems, if the latter is even real, have a common solution:
using recvmsg instead of recvfrom so we can examine the MSG_TRUNC
flag.
the size of 512 is not sufficient to get at least one address in the
worst case where the name is at or near max length and resolves to a
CNAME at or near max length. prior to tcp fallback, there was nothing
we could do about this case anyway, but now it's fixable.
the new limit 768 is chosen so as to admit roughly the number of
addresses with a worst-case CNAME as could fit for a worst-case name
that's not a CNAME in the old 512-byte limit. outside of this
worst-case, the number of addresses that might be obtained is
increased.
MAXADDRS (48) was originally chosen as an upper bound on the combined
number of A and AAAA records that could fit in 512-byte packets (31
and 17, respectively). it is not increased at this time.
so as to prevent a situation where the A records consume almost all of
these slots (at 768 bytes, a "best-case" name can fit almost 47 A
records), the order of parsing is swapped to process AAAA first. this
ensures roughly half of the slots are available to each address
family.
tcp fallback was originally deemed unwanted and unnecessary, since we
aim to return a bounded-size result from getaddrinfo anyway and
normally plenty of address records fit in the 512-byte udp dns limit.
however, this turned out to have several problems:
- some recursive nameservers truncate by omitting all the answers,
rather than sending as many as can fit.
- a pathological worst-case CNAME for a worst-case name can fill the
entire 512-byte space with just the two names, leaving no room for
any addresses.
- the res_* family of interfaces allow querying of non-address records
such as TLSA (DANE), TXT, etc. which can be very large. for many of
these, it's critical that the caller see the whole RRset. also,
res_send/res_query are specified to return the complete, untruncated
length so that the caller can retry with an appropriately-sized
buffer. determining this is not possible without tcp.
so, it's time to add tcp fallback.
the fallback strategy implemented here uses one tcp socket per
question (1 or 2 questions), initiated via tcp fastopen when possible.
the connection is made to the nameserver that issued the truncated
answer. right now, fallback happens unconditionally when truncation is
seen. this can, and may later be, relaxed for queries made by the
getaddrinfo system, since it will only use a bounded number of results
anyway.
retry is not attempted again after failure over tcp. the logic could
easily be adapted to do that, but it's of questionable value, since
the tcp stack automatically handles retransmission and the successs
answer with TC=1 over udp strongly suggests that the nameserver has
the full answer ready to give. further retry is likely just "take
longer to fail".
for extremely small buffer sizes, the DNS query core in __res_msend
may malfunction completely, being unable to get even the headers to
determine the response code. but there is also a problem for
reasonable sizes under 512 bytes: __res_msend is unable to determine
if the udp answer was truncated at the recv layer, in which case it
may be incomplete, and res_send is then unable to honor its contract
to return the length of the full, non-truncated answer.
at present, res_send does not honor that contract anyway when the full
answer would exceed 512 bytes, since there is no tcp fallback, but
this change at least makes it consistent in a context where this is
the only "full answer" to be had.
this was apparently omitted long ago out of a lack of understanding of
its importance and the fact that POSIX doesn't specify it. despite not
being officially standardized, however, it turns out that at least
AIX, glibc, NetBSD, OpenBSD, QNX, and Solaris document and support it.
in certain usage cases, such as implementing a DNS gateway on top of
the stub resolver interfaces, it's necessary to distinguish the case
where a name does not exit (NxDomain) from one where it exists but has
no addresses (or other records) of the requested type (NODATA). in
fact, even the legacy gethostbyname API had this distinction, which we
were previously unable to support correctly because the backend lacked
it.
apart from fixing an important functionality gap, adding this
distinction helps clarify to users how search domain fallback works
(falling back in cases corresponding to EAI_NONAME, not in ones
corresponding to EAI_NODATA), a topic that has been a source of
ongoing confusion and frustration.
as a result of this change, EAI_NONAME is no longer a valid universal
error code for getaddrinfo in the case where AI_ADDRCONFIG has
suppressed use of all address families. in order to return an accurate
result in this case, getaddrinfo is modified to still perform at least
one lookup. this will almost surely fail (with a network error, since
there is no v4 or v6 network to query DNS over) unless a result comes
from the hosts file or from ip literal parsing, but in case it does
succeed, the result is replaced by EAI_NODATA.
glibc has a related error code, EAI_ADDRFAMILY, that could be used for
the AI_ADDRCONFIG case and certain NODATA cases, but distinguishing
them properly in full generality seems to require additional DNS
queries that are otherwise not useful. on glibc, it is only used for
ip literals with mismatching family, not for DNS or hosts file results
where the name has addresses only in the opposite family. since this
seems misleading and inconsistent, and since EAI_NODATA already covers
the semantic case where the "name" exists but doesn't have any
addresses in the requested family, we do not adopt EAI_ADDRFAMILY at
this time. this could be changed at some point if desired, but the
logic for getting all the corner cases with AI_ADDRCONFIG right is
slightly nontrivial.
EAI_MEMORY is not possible (but would not provide errno if it were)
and EAI_FAIL does not provide errno. treat the latter as EBADMSG to
match how it's handled in gethostbyname2_r (it indicates erroneous or
failure response from the nameserver).
EAI_MEMORY is not possible because the resolver backend does not
allocate. if it did, it would be necessary for us to explicitly return
ENOMEM as the error, since errno is not guaranteed to reflect the
error cause except in the case of EAI_SYSTEM, so the existing code was
not correct anyway.
these functions are horribly underspecified, inconsistent between
historical systems, and should never have been included. however, the
signatures we have match the glibc ones, and the glibc behavior is to
treat NxDomain and NODATA results as a success condition, not an
ENOENT error.
this distinction only affects search, but allows search to continue
when concatenating one of the search domains onto the requested name
produces a result that's not valid. this can happen when the
concatenation is too long, or one of the search list entries is
itself not valid.
as a consequence of this change, having "." in the search domains list
will now be ignored/skipped rather than making the lookup abort with
no results (due to producing a concatenation ending in ".."). this
behavior could be changed later if needed.
the main loop already errors out on zero-length labels within the
name, but terminates before having a chance to check for an erroneous
final zero-length label, instead producing a malformed query packet
with a '.' byte instead of the terminating zero.
rather than poke at the look logic, simply detect this condition early
and error out without doing anything.
this also fixes behavior of getaddrinfo when "." appears in the search
domain list, which produces a name ending in ".." after concatenation,
at least in the sense of no longer emitting malformed packets on the
network. however, due to other issues, the lookup will still fail.
After commit 5b74eed3b3 the timer thread
doesn't check whether timer_create() actually created the timer,
proceeding to wait for a signal that might never arrive. We can't fix
this by simply checking for a negative timer_id after
pthread_barrier_wait() because we have no way to distinguish a timer
creation failure and a request to delete a timer with INT_MAX id if it
happens to arrive quickly (a variation of this bug existed before
5b74eed3b3, where the timer would be
leaked in this case). So (ab)use cancel field of pthread_t instead.
commit 4486c579cb disabled vdso
clock_gettime on arm due to a Linux kernel bug that was not understood
at the time, whereby the vdso function silently produced
catastrophically wrong results on some systems.
since then, the bug was tracked down to the way the arm kernel
disabled use of vdso clock_gettime on kernels where the necessary
timer was not available or was disabled. it simply patched out the
symbols, but it only did this for the legacy time32 functions, and
left the time64 function in place but non-operational. kernel commit
4405bdf3c57ec28d606bdf5325f1167505bfdcd4 (first present in 5.8)
provided the fix.
if this were a bug that impacted all users of the broken kernel
versions, we could probably ignore it and assume it had been patched
or replaced. however, it's very possible that these kernels appear in
the wild in devices running time32 userspace (glibc, musl 1.1.x, or
some other environment) where they appear to work fine, but where our
new binaries would fail catastrophically if we used the time64 vdso
function.
since the kernel has not (yet?) given us a way to probe for the
working time64 vdso function semantically, we work around the problem
by refusing to use the time64 one unless the time32 one is also
present. this will revert to not using vdso at all if the time32 one
is ever removed, but at least that's safe against wrong results and is
just a missed optimization.
commit d32dadd60e added DT_RELR
processing for programs and shared libraries processed by the dynamic
linker, but left them unsupported in the dynamic linker itseld and in
static pie binaries, which self-relocate via code in dlstart.c.
add the equivalent processing to this code path so that there are not
arbitrary restrictions on where the new packed relative relocation
form can be used.
open_wmemstream's write method was written assuming no buffering,
since it sets the FILE up with buf_len of zero in order to avoid
issues with position/seeking. however, as a consequence of commit
bd57e2b43a, a FILE being written to by
the printf core has a temporary local buffer for the duration of the
operation if it was unbuffered to begin with. since this was
disregarded by the wide memstream's write method, output produced
through this code path, particularly numeric fields, was missing from
the output wchar buffer.
copy the equivalent logic for using the buffered data from the
byte-oriented open_memstream.
if resolv.conf lists no nameservers at all, the default of 127.0.0.1
is used. however, another "no nameservers" case arises where the
system has ipv6 support disabled/configured-out and resolv.conf only
contains v6 nameservers. this caused the resolver to repeat socket
operations that will necessarily fail (sending to one or more
wrong-family addresses) while waiting for a timeout.
it would be contrary to configured intent to query 127.0.0.1 in this
case, but the current behavior is not conducive to diagnosing the
configuration problem. instead, fail immediately with EAI_SYSTEM and
errno==EAFNOSUPPORT so that the configuration error is reportable.
use the legacy constant values if the kernel does not provide
AT_MINSIGSTKSZ (__getauxval will return 0 in this case) and as a
safety check if something is wrong and the provided value is less than
the legacy constant.
sysconf(_SC_SIGSTKSZ) returns SIGSTKSZ adjusted for the difference
between the legacy constant MINSIGSTKSZ and the runtime value, so that
the working space the application has on top of the minimum remains
invariant under changes to the minimum.
as a result of ISA extensions exploding register file sizes on some
archs, using a constant for minimum signal stack size no longer seems
viably future-proof. add sysconf keys allowing the kernel to provide a
machine-dependent minimum applications can query to ensure they
allocate sufficient space for stacks. the key names and indices align
with the same functionality in glibc.
see commit d5a5045382 for previous
action on this subject.
ultimately, the macros MINSIGSTKSZ and SIGSTKSZ probably need to be
deprecated, but that is standards-amendment work outside the scope of
a single implementation.
apparently this code path was never tested, as it's not usual to have
v6 nameservers listed on a system without v6 networking support. but
it was always intended to work.
when reverting to binding a v4 address, also revert the family in the
sockaddr structure and the socklen for it. otherwise bind will just
fail due to mismatched family/sockaddr size.
fix dns resolver fallback when v6 nameservers are listed by
This is a part of the interface contract defined in the Linux man
page (official for a Linux-specific interface) and asserted by test
cases in the Linux Test Project (LTP).
a request for this behavior has been open for a long time. the
motivation is that application code, particularly under some language
runtimes designed around very-low-footprint coroutine type constructs,
may be operating with extremely small stack sizes unsuitable for
receiving signals, using a separate signal stack for any signals it
might handle.
progress on this was blocked at one point trying to determine whether
the implementation is actually entitled to clobber the alt stack, but
the phrasing "available to the implementation" in the POSIX spec for
sigaltstack seems to make it clear that the application cannot rely on
the contents of this memory to be preserved in the absence of signal
delivery (on the abstract machine, excluding implementation-internal
signals) and that we can therefore use it for delivery of signals that
"don't exist" on the abstract machine.
no change is made for SIGTIMER since it is always blocked when used,
and accepted via sigwaitinfo rather than execution of the signal
handler.
breaking out of the switch-case when l==-1 means the conditional below
will necessarily be true (-1 >= buf_size, a size_t variable) and the
function will return 0. it is, however, somewhat unclear that that's
what's happening. simply returning there is simpler
this is a requirement of the C language (orientation) and POSIX
(encoding rule) that was somehow overlooked.
we rely on the fact that the buffer pointers have been reset by
fflush, so that any future stdio operations on the stream will go
through the same code paths they would on a newly-opened file without
an orientation set, thereby setting the orientation as they should.
the way RELR is applied is not a meaningful operation for FDPIC (there
is no single "base" address). it seems unlikely RELR would ever be
added for FDPIC, but if it ever is, the behavior and possibly data
format will need to be different, so guard against calling the
non-FDPIC code.
the syscall used to probe availability of the clock fails with EINVAL
when the requested pid does not exist, but clock_getcpuclockid is
specified to use ESRCH for this purpose.
The generic vfork implementation uses clone(SIGCHLD) which has fork
semantics.
Implement vfork as clone(SIGCHLD|CLONE_VM|CLONE_VFORK, 0) instead which
has vfork semantics. (stack == 0 means sp is unchanged in the child.)
Some users rely on vfork semantics when memory overcommit is disabled
or when the vfork child runs code that synchronizes with the parent
process (non-conforming).
this code attempts to use the value of errno from failure of socket or
connect to infer availability of the requested address family (v4 or
v6). however, in the case where connect failed, there is an
intervening call to close between connect and the use of errno. close
is not required to preserve errno on success, and in fact the
__aio_close code, which is called whenever aio is linked and thus
always called in dynamic-linked programs, unconditionally clobbers
errno. as a result, getaddrinfo fails with EAI_SYSTEM and errno=ENOENT
rather than correctly determining that the address family was
unavailable.
this fix is based on report/patch by Jussi Nieminen, but simplified
slightly to avoid breaking the case where socket, not connect, failed.
while the error handling function should not be reached in stage 2
(assuming ldso itself was linked correctly), this was not statically
determinate from the compiler's perspective, and in theory a compiler
performing LTO could lift the TLS references (errno and other things)
out of the printf-family functions called in a stage where TLS is not
yet initialized.
instead, perform the call via a static-storage, internal-linkage
function pointer which will be set to a no-op function until the stage
where the real error handling function should be reachable.
inspired by commit 63c67053a3.
if LTO is enabled, gcc hoists the call to ___errno_location outside the
loop even though the access to errno is gated behind head != &ldso
because ___errno_location is marked __attribute__((const)). this causes
the program to crash because TLS is not yet initialized when called from
__dls2. this is also possible if LTO is not enabled; even though gcc 11
doesn't do it, it is still wrong to use errno here.
since the start and end are already aligned, we can simply call
__syscall instead of using global errno.
Fixes: e13a2b8953 ("implement PT_GNU_RELRO support")
this is not an issue that was actually hit, but I noticed it during
previous changes to __randname: if the resolution of tv_nsec is too
low, the space of temp file names obtainable by a thread could
plausibly be exhausted. mixing in tv_sec avoids this.
the __randname function is used by various temp file creation
interfaces as a backend to produce a name to attempt using. it does
not have to produce results that are safe against guessing, and only
aims to avoid unintentional collisions.
mixing the address of an object on the stack in a reversible manner
leaked ASLR information, potentially allowing an attacker who can
observe the temp files created and their creation timestamps to narrow
down the possible ASLR state of the process that created them. there
is no actual value in mixing these addresses in; it was just
obfuscation. so don't do it.
instead, mix the tid, just to avoid collisions if multiple
processes/threads stampede to create temp files at the same moment.
even without this measure, they should not collide unless the clock
source is very low resolution, but it's a cheap improvement.
if/when we have a guaranteed-available userspace csprng, it could be
used here instead. even though there is no need for cryptographic
entropy here, it would avoid having to reason about clock resolution
and such to determine whether the behavior is nice.
assuming a reasonable realtime clock, res_mkquery is highly unlikely
to generate the same query id twice in a row, but it's possible with a
very low-resolution system clock or under extreme delay of forward
progress. when it happens, res_msend fails to wait for both answers,
and instead stops listening after getting two answers to the same
query (A or AAAA).
to avoid this, increment one byte of the second query's id if it
matches the first query's. don't bother checking if the second byte is
also equal, since it doesn't matter; we just need to ensure that at
least one byte is distinct.
commit 05973dc3bb made it so that lines
longer than INT_MAX can in theory be read, but did not use a suitable
type for the positions determined by sscanf. we could change to using
size_t, but since the signature for getmntent_r does not admit lines
longer than INT_MAX, it does not make sense to support them in the
legacy thread-unsafe form either -- the principle here is that there
should not be an incentive to use the unsafe function to get added
functionality.
According to fstab(5), the last two fields are optional, but this
wasn't accepted. After this change, only the first field is required,
which matches glibc's behaviour.
Using sscanf as before, it would have been impossible to differentiate
between 0 fields and 4 fields, because sscanf would have returned 0 in
both cases due to the use of assignment suppression and %n for the
string fields (which is important to avoid copying any strings). So
instead, before calling sscanf, initialize every string to the empty
string, and then we can check which strings are empty afterwards to
know how many fields were matched.
this avoids the need for implementation-internal callers to depend on
the nonstandard AT_EMPTY_PATH extension to use __fstatat and isolates
knowledge of that extension to the implementation of __fstat.
this function is used to implement some baseline ISO C interfaces, so
it cannot call any of the stat functions by their public names. use
the namespace-safe __fstatat instead.
instead, use the fstatat/stat functions, so that the logic for which
syscalls are present and usable is all in fstatat.
this results in a slight increase in cost for old kernels on 32-bit
archs: now statx will be attempted first rather than just using the
legacy time32 syscalls, despite us not caring about timestamps.
however, it's not even clear that the legacy syscalls *should* succeed
if the timestamps are out of range; arguably they should fail with
EOVERFLOW. as such, paying a small cost here on old kernels seems
well-motivated.
with this change, fchmodat itself is no longer blocking ports to new
archs that lack the legacy syscalls.
this change serves two purposes:
1. it eliminates one of the few remaining uses of the kernel stat
structure which will not be present in future archs, avoiding the need
for growing ifdef logic here.
2. it potentially makes the operations less expensive when the
candidate exists as a non-symlink by avoiding the need to read the
inode (assuming the directory tables suffice to distinguish symlinks).
this uses the idiom I discovered while rewriting realpath for commit
29ff7599a4 of being able to use the
readlink operation as an inexpensive probe for file existence that
doesn't following symlinks.
_CS_POSIX_V7_THREADS_CFLAGS and _CS_POSIX_V7_THREADS_LDFLAGS have been
missing for a long time, which is a conformance defect. we were
waiting on glibc to add them or at least agree on the numeric values
they will have so as to keep the numbering aligned. it looks like they
will be added to glibc with these numbers, and in any case, this list
does not have any significant churn that would result in the numbers
getting taken.
the change to support passing null was rejected in the past on the
grounds that GNU gettext documented it as undefined, on an assumption
that only glibc accepted it and that the standalone GNU gettext did
not. but it turned out that both explicitly accept it.
in light of this, since some software assumes null can be passed
safely, allow it.
newlocale and freelocale use __libc_malloc and __libc_free, but
duplocale used malloc. If malloc was replaced, this resulted in
invalid free using the wrong allocator when passing the result of
duplocale to freelocale.
Instead, use libc-internal malloc for duplocale.
This bug was introduced by commit
1e4204d522.
sys/reg.h already had it right as 32, to which it was explicitly
changed when commit 664cd34192 derived
x32 from x86_64. but the copy exposed in sys/user.h was missed.
see
linux commit 90f093fa8ea48e5d991332cee160b761423d55c1
rseq, ptrace: Add PTRACE_GET_RSEQ_CONFIGURATION request
the struct type got __ prefix to follow existing practice.
see
linux commit 321827477360934dc040e9d3c626bf1de6c3ab3c
icmp: don't send out ICMP messages with a source address of 0.0.0.0
"RFC7600 reserves a dummy address to be used as a source for ICMP
messages (192.0.0.8/32), so let's teach the kernel to substitute that
address as a last resort if the regular source address selection procedure
fails."
see
linux commit a49f4f81cb48925e8d7cbd9e59068f516e984144
arch: Wire up Landlock syscalls
linuxcommit 17ae69aba89dbfa2139b7f8024b757ab3cc42f59
Merge tag 'landlock_v34' of ... jmorris/linux-security
Landlock provides for unprivileged application sandboxing. The goal of
Landlock is to enable to restrict ambient rights (e.g. global filesystem
access) for a set of processes. Landlock is inspired by seccomp-bpf but
instead of filtering syscalls and their raw arguments, a Landlock rule
can restrict the use of kernel objects like file hierarchies, according
to the kernel semantic.
see
linux commit 7eeba1706eba6def15f6cb2fc7b3c3b9a2651edc
tcp: Add receive timestamp support for receive zerocopy.
linux commit 3c5a2fd042d0bfac71a2dfb99515723d318df47b
tcp: Sanitize CMSG flags and reserved args in tcp_zerocopy_receive.
TCP_NLA_EDT was new in v5.9, see
linux commit 48040793fa6003d211f021c6ad273477bcd90d91
tcp: add earliest departure time to SCM_TIMESTAMPING_OPT_STATS
TCP_NLA_TTL is new in v5.12, see
linux commit e7ed11ee945438b737e2ae2370e35591e16ec371
tcp: add TTL to SCM_TIMESTAMPING_OPT_STATS
PTRACE_OLDSETOPTIONS is old, but it was missing, PTRACE_SYSEMU and
PTRACE_SYSEMU_SINGLESTEP are new, see
linux commit 56e62a73702836017564eaacd5212e4d0fa1c01d
s390: convert to generic entry
new syscall to change the properties of a mount or a mount tree using
file descriptors which the new mount api is based on, see
linux commit 2a1867219c7b27f928e2545782b86daaf9ad50bd
fs: add mount_setattr()
see
linux commit a54f0dfda754c5cecc89a14dab68a3edc1e497b5
signal: define the SA_UNSUPPORTED bit in sa_flags
linux commit 6ac05e832a9e96f9b1c42a8917cdd317d7b6c8fa
signal: define the SA_EXPOSE_TAGBITS bit in sa_flags
Note: SA_ is in the posix reserved namespace so these linux specific flags
can be exposed when compiling for posix.
unlike other si_code defines, SYS_ is not in the posix reserved namespace
which is likely the reason why SYS_SECCOMP was previously missing (was new
in linux v3.5).
see
linux commit 18fb76ed53865c1b5d5f0157b1b825704590beb5
net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.
linux commit 94ab9eb9b234ddf23af04a4bc7e8db68e67b8778
net-zerocopy: Defer vm zap unless actually needed.
see
linux commit 1446e1df9eb183fdf81c3f0715402f1d7595d4cb
kernel: Implement selective syscall userspace redirection
linux commit 36a6c843fd0d8e02506681577e96dabd203dd8e8
entry: Use different define for selector variable in SUD
redirect syscalls to a userspace handler via SIGSYS, except for a specific
range of code. can be toggled via a memory write to a selector variable.
mainly for wine.
see
linux commit b0a0c2615f6f199a656ed8549d7dce625d77aa77
epoll: wire up syscall epoll_pwait2
linux commit 58169a52ebc9a733aeb5bea857bc5daa71a301bb
epoll: add syscall epoll_pwait2
epoll_wait with struct timespec timeout instead of int. no time32 variant.
This reduces entropy of the canary from 64-bit to 56-bit in exchange
for mitigating non-terminated C string overflows by setting the second
byte of the canary to nul, so that off-by-one write overflow with a
nul byte can still be detected.
Idea from GrapheneOS bionic commit 7024d880b51f03a796ff8832f1298f2f1531fd7b
gcc-12 with -frounding-mode will do inexact constant conversions at
runtime according to the runtime rounding mode.
in the math library we want constants to be rounding mode independent
so this patch fixes cases where new runtime conversions happen with
gcc-12.
fortunately this only affects two minor cases, the fix uses global
initializers where rounding mode does not apply.
after the patch the same amount of conversions happen with gcc-12 as
with gcc-11.
commit a90d9da1d1 made fgetws look for
changes to errno by fgetwc to detect encoding errors, since ISO C did
not allow the implementation to set the stream's error flag in this
case, and the fgetwc interface did not admit any other way to detect
the error. however, the possibility of fgetwc setting errno to EILSEQ
in the success path was overlooked, and in fact this can happen if the
buffer ends with a partial character, causing mbtowc to be called with
only part of the character available.
since that change was made, the C standard was amended to specify that
fgetwc set the stream error flag on encoding errors, and commit
511d70738b made it do so. thus, there is
no longer any need for fgetws to poke at errno to handle encoding
errors.
this commit reverts commit a90d9da1d1
and thereby fixes the problem.
this bug goes back to commit 1cc81f5cb0
where zoneinfo file support was first added. in scan_trans, which
searches for the appropriate local time/dst rule in effect at a given
time, times prior to the second transition time caused the -1 slot of
the index to be read to determine the previous rule in effect. this
memory was always valid (part of another zoneinfo table in the mapped
file) but the byte value read was then used to index another table,
possibly going outside the bounds of the mmap. most of the time, the
result was limited to misinterpretation of the rule in effect at that
time (pre-1900s), but it could produce a crash if adjacent memory was
not readable.
the root cause of the problem, however, was that the logic for this
code path was all wrong. as documented in the comment, times before
the first transition should be treated as using the lowest-numbered
non-dst rule, or rule 0 if no non-dst rules exist. if the argument is
in units of local time, however, the rule prior to the first
transition is needed to determine if it falls before or after it, and
that's where the -1 index was wrongly used.
instead, use the documented logic to find out what rule would be in
effect before the first transition, and apply it as the offset if the
argument was given in local time.
the new code has not been heavily tested, but no longer performs
potentially out-of-bounds accesses, and successfully handles the 1883
transition from local mean time to central standard time in the test
case the error was reported for.
these are specified to use the sign of the imaginary part of the input
as the sign of zero in the result, but wrongly copied the sign of the
real part.
this is a POSIX requirement. we previously relied on the underlying fd
(or other backend) seek operation to produce the error, but since
linux lseek now supports other seek modes (SEEK_DATA and SEEK_HOLE)
which do not interact well with stdio buffering, this is insufficient.
instead, explicitly check whence before performing any operations.
these are linux specific constants. glibc exposes them behind
_GNU_SOURCE, but, since SEEK_* is reserved for the implementation, we
can simply define them. furthermore, since they can't be used with
fseek() and other functions that deal with FILE, we don't add them to
stdio.h.
these characters combine onto a base character (initial) and therefore
need to have width 0. the original binary-search implementation of
wcwidth handled them correctly, but a regression was introduced in
commit 1b0ce9af6d by generating the new
tables from unicode without noticing that the classification logic in
use (unicode character category Mn/Me/Cf) was insufficient to catch
these characters.
strtod_l, strtof_l, and strtold_l originally existed only as
glibc-ABI-compat symbols. as noted in the commit which added them,
17a60f9d32, making them aliases for the
non-_l functions was a hack and not appropriate if they ever became
public API.
unfortunately, commit 35eb1a1a9b did
make them public without undoing the hack. fix that now by moving the
the _l functions to their own file as wrappers that just throw away
the locale_t argument.
commit 7be59733d7 introduced the
hwcap-based branches to support the SPE FPU, but wrongly coded them as
bitwise tests on the computed address of __hwcap, not a value loaded
from that address. replace the add with indexed load to fix it.
the snd_pcm_mmap_control struct used with SNDRV_PCM_IOCTL_SYNC_PTR was
mistakenly defined in the kernel uapi with "before u32" padding both
before and after the first u32 member. our conversion between the
modern struct and the legacy time32 struct was written without
awareness of that mistake, and assumed the time64 version of the
struct was the intended form with padding to match the layout on
64-bit archs. as a result, the struct was not converted correctly when
running on old kernels, with audio glitches as the likely result.
this was discovered thanks to a related bug in the kernel, whereby
32-bit userspace running on a 64-bit kernel also suffered from the
types mismatching. the mistaken layout is now the ABI and can't be
changed -- or at least making a new ioctl to change it would just
result in a worse situation.
our conversion here is changed to treat the snd_pcm_mmap_control
substruct as two separate substructs at locations dependent on
endianness (since the displacement depends on endianness), using the
existing conversion framework.
we make qsort a wrapper by providing a wrapper_cmp function that uses
the extra argument as a function pointer. should be optimized to a tail
call on most architectures, as long as it's built with
-fomit-frame-pointer, so the performance impact should be minimal.
to keep the git history clean, for now qsort_r is implemented in qsort.c
and qsort is implemented in qsort_nr.c. qsort.c also received a few
trivial cleanups, including replacing (*cmp)() calls with cmp().
qsort_nr.c contains only wrapper_cmp and qsort as a qsort_r wrapper
itself.
When the soft-float ABI for PowerPC was added in commit
5a92dd95c7, with Freescale cpus using
the alternative SPE FPU as the main use case, it was noted that we
could probably support hard float on them, but that it would involve
determining some difficult ABI constraints. This commit is the
completion of that work.
The Power-Arch-32 ABI supplement defines the ABI profiles, and indeed
ATR-SPE is built on ATR-SOFT-FLOAT. But setjmp/longjmp compatibility
are problematic for the same reason they're problematic on ARM, where
optional float-related parts of the register file are "call-saved if
present". This requires testing __hwcap, which is now done.
In keeping with the existing powerpc-sf subarch definition, which did
not have fenv, the fenv macros are not defined for SPE and the SPEFSCR
control register is left (and assumed to start in) the default mode.
both passing a null pointer to memcpy with length 0, and adding 0 to a
null pointer, are undefined. in some sense this is 'benign' UB, but
having it precludes use of tooling that strictly traps on UB. there
may be better ways to fix it, but conditioning the operations which
are intended to be no-ops in the k==0 case on k being nonzero is a
simple and safe solution.
commit 6d99ad91e8 introduced this
regression as part of a larger change, based on an incorrect
assumption that rdhwr being part of the mips r2 ISA level meant that
the TLS register, known in the mips documentation as UserLocal, was
unconditionally present on chips providing this ISA level and would
not need trap-and-emulate. this turns out to be false.
based on research by Stanislav Kljuhhin and Abilio Marques, who
reported the problem as a performance regression on certain routers
using OpenWRT vs older uclibc-based versions, it turns out the mips
manuals document the UserLocal register as a feature that might or
might not be implemented or enabled, reflected by a cpu capability bit
in the CONFIG3 register, and that Linux checks for this and has to
explicitly enable it on models that have it.
thus, it's indeed possible that r2+ chips can lack the feature,
bringing us back to the situation where Linux only has a fast
trap-and-emulate path for the case where the destination register is
$3. so, always read the thread pointer through $3. this may incur a
gratuitous move to the desired final register on chips where it's not
needed, but it really doesn't matter.
len is unsigned and can never be smaller than 0. though unlikely, an
error in read() would have lead to an out of bounds write to name.
Reported-by: Michael Forney <mforney@mforney.org>
due to historical reasons, the mips signal set has 128 bits rather
than 64 like on every other arch. this was special-cased correctly, at
least for 32-bit mips, at one time, but was inadvertently broken in
commit 7c440977db, and seems never to
have been right on mips64/n32.
as consequenct of this bug, applications making use of high realtime
signal numbers on mips may have been able to execute application code
in contexts where doing so was unsafe.
the kernel structure has padding of the shm_segsz member up to 64
bits, as well as 2 unused longs at the end. somehow that was
overlooked when the powerpc port was added, and it has been broken
ever since; applications compiled with the wrong definition do not
correctly see the shm_segsz, shm_cpid, and shm_lpid members.
fixing the definition just by adding the missing padding would break
the ABI size of the structure as well as the position of the time64
shm_atime and shm_dtime members we added at the end. instead, just
move one of the unused padding members from the original end (before
time64) of the structure to the position of the missing padding. this
preserves size and preserves correct behavior of any compiled code
that was already working. programs affected by the wrong definition
need to be recompiled with the correct one.
previously, the contents of the TZ variable were considered a
candidate for a file/path name only if they began with a colon or
contained a slash before any comma. the latter was very sloppy logic
to avoid treating any valid POSIX TZ string as a file name, but it
also triggered on values that are not valid POSIX TZ strings,
including 3-letter timezone names without any offset.
instead, only treat the TZ variable as POSIX form if it begins with a
nonzero standard time name followed by +, -, or a digit.
also, special case GMT and UTC to always be treated as POSIX form
(with implicit zero offset) so that a stray file by the same name
cannot break software that depends on setting TZ=GMT or TZ=UTC.
POSIX places an obscure requirement on popen which is like a limited
version of close-on-exec:
"The popen() function shall ensure that any streams from previous
popen() calls that remain open in the parent process are closed in
the new child process."
if the POSIX-future 'e' mode flag is passed, producing a pipe FILE
with FD_CLOEXEC on the underlying pipe, this requirement is
automatically satisfied. however, for applications which use multiple
concurrent popen pipes but don't request close-on-exec, fd leaks from
earlier popen calls to later ones could produce deadlock situations
where processes are waiting for a pipe EOF that will never happen.
to fix this, iterate through all open FILEs and add close actions for
those obtained from popen. this requires holding a lock on the open
file list across the posix_spawn call so that additional popen FILEs
are not created after the list is traversed. note that it's still
possible for another popen call to start and create its pipe while the
lock is held, but such pipes are created with O_CLOEXEC and only drop
close-on-exec status (when 'e' flag is omitted) under control of the
lock.
the newly allocated FILE * has not yet leaked to the application and
is only visible to stdio internals until popen returns. since we do
not change any fields of the structure observed by libc internals,
only the pipe_pid member, locking is not necessary.
__tls_get_addr should not be called with an invalid TLS module id of
0. in practice it probably "works", returning the DTV length as if it
were a pointer, and the callback should probably not inspect
dlpi_tls_data in this case, but it's likely that some real-world
callbacks use a check on dlpi_tls_data being non-null, rather than on
dlpi_tls_modid being nonzero, to conclude that the module has TLS.
With mallocng, calling posix_memalign() or aligned_alloc() will
SIGSEGV if the internal malloc() call returns NULL. This does not
occur with oldmalloc, which explicitly checks for allocation failure.
this is a Linux-specific function and not covered by POSIX's
requirements for which interfaces are cancellation points, but glibc
makes it one and existing software relies on it being one.
at some point a review for similar functions that should be made
cancellation points should be done.
dl_iterate_phdr was wrongly reporting the address of the DSO's PT_TLS
image rather than the calling thread's instance of the TLS. the man
page, which is essentially normative for a nonstandard function of
this sort, clearly specifies the latter. it does not clarify where
exactly within/relative-to the image the pointer should point, but the
reasonable thing to do is match the ABI's DTP offset, and this seems
to be what other implementations do.
popen was special-casing the possibility (only possible when the
parent closed stdin and/or stdout) that the child's end of the pipe
was already on the final desired fd number, in which case there was no
way to get rid of its close-on-exec flag in the child. commit
6fc6ca1a32 made this unnecessary by
implementing the POSIX-future requirement that dup2 file actions with
equal source and destination fd values remove the close-on-exec flag.
this makes it possible to perform actions on file actions objects with
a libc-internal lock held without creating lock order relationships
that are silently imposed on an application-provided malloc.
reportedly the GNU linker can emit such segments, causing spurious
failure to load due to mmap with a length of zero producing EINVAL.
no action is required for such a load map (it's effectively a nop in
the program headers table) so just treat it as always successful.
since 4.1, gcc has had the __returns_twice__ attribute and has
required functions which return twice to carry it; however it's always
applied it automatically to known setjmp-like function names. clang
however does not do this reliably, at least not with -ffreestanding
and possibly under other conditions, resulting in silent emission of
wrong code.
since the symbol name setjmp is in no way special (setjmp is specified
as a macro that could expand to use any implementation-specific symbol
name or names), a compiler is justified not to do anything special
without further hints, and it's reasonable to do what we can to
provide such hints.
gcc 4.0.x and earlier do not recognize the attribute, so make use
conditional on __GNUC__ macros. clang and other gcc-like compilers
report (and have always reported) a later "GNUC" version so the
preprocessor conditional should function as desired for them as too.
undefine the internal macro after use so that nothing abuses it as a
public feature.
add synchronouse and asynchronous tag check failure codes, see
linux commit 74f1082487feb90bbf880af14beb8e29c3030c9f
arm64: mte: Add specific SIGSEGV codes
these are for the aarch64 MTE (memory tagging extension), see
linux commit 1c101da8b971a36695319dce7a24711dc567a0dd
arm64: mte: Allow user control of the tag check mode via prctl()
linux commit af5ce95282dc99d08a27a407a02c763dde1c5558
arm64: mte: Allow user control of the generated random tags via prctl()
path resolution does not follow symlinks on nosymfollow mounts (but
readlink still does), see
linux commit dab741e0e02bd3c4f5e2e97be74b39df2523fc6e
Add a "nosymfollow" mount option.
can cause rseq restart on another cpu to synchronize with global
memory access from rseq critical sections, see
linux commit 2a36ab717e8fe678d98f81c14a0b124712719840
rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
mainly added to linux to allow a central process management service in
android to give MADV_COLD|PAGEOUT hints for other processes, see
linux commit ecb8ac8b1f146915aa6b96449b66dd48984caacc
mm/madvise: introduce process_madvise() syscall: an external memory
hinting API
the historical function was specified to return an empty string in the
caller-provided buffer, not a null pointer, to indicate error when the
argument is non-null. only when the argument is null should it return
a null pointer on error.
getpwuid_r can return 0 but without a result in the case where there
was no error but no record exists. in that case cuserid was treating
it as success and copying junk out of pw.pw_name to the output buffer.
this function was removed from the standard in 2001 but appeared in
SUSv2 with an obligation to support calls with a null pointer
argument, using a static buffer.
the threshold was wrong so expm1f overflowed to inf a bit too early
and on most targets uint32_t compare is faster than float compare so
use that.
this also fixes sinhf incorrectly returning nan for some values where
the internal expm1f overflowed.
on some negative inputs (e.g. -0x1.1e6ae8p+5) acoshf failed to return
nan. ensure that negative inputs result nan without introducing new
branches. this was tried before in
commit 101e601285
math: fix acoshf on negative values
but that fix was wrong. there are 3 formulas used:
log1p(x-1 + sqrt((x-1)*(x-1)+2*(x-1)))
log(2*x - 1/(x+sqrt(x*x-1)))
log(x) + 0.693147180559945309417232121458176568
the first fails on large negative inputs (may compute log1p(0) or
log1p(inf)), the second one fails on some mid range or large negative
inputs (may compute log(large) or log(inf)) and the last one fails on
-0 (returns -inf).
as an outcome of Austin Group issue #385, future versions of the
standard will require free not to alter the value of errno. save and
restore it individually around the calls to madvise and munmap so that
the cost is not imposed on calls to free that do not result in any
syscall.
as an outcome of Austin Group issue #385, future versions of the
standard will require free not to alter the value of errno. save and
restore it individually around the calls to madvise and munmap so that
the cost is not imposed on calls to free that do not result in any
syscall.
commit 7586360bad removed the unused
arguments from the definition of __libc_start_main, making it
incompatible with the declaration at the point of call, which still
passed 6 arguments. calls with mismatched function type have undefined
behavior, breaking LTO and any other tooling that checks for function
signature mismatch.
removing the extra arguments from the point of call (crt1) is not an
option for fixing this, since that would be a change in ABI surface
between application and libc.
adding back the extra arguments requires some care. on archs that pass
arguments on the stack or that reserve argument spill space for the
callee on the stack, it imposes an ABI requirement on the caller to
provide such space. the modern crt1.c entry point provides such space,
but originally there was arch-specific asm for the call to
__libc_start_main. the last of this asm was removed in commit
6fef8cafbd, and manual review of the
code removed and its prior history was performed to check that all
archs/variants passed the legacy init/fini/ldso_fini arguments.
these functions are specified to fail with EBADF on negative fd
arguments. apart from close, they are also specified to fail if the
value exceeds OPEN_MAX, but as written it is not clear that this
imposes any requirement when OPEN_MAX is not defined, and it's
undesirable to impose a dynamic limit (via setrlimit) here since the
limit at the time of posix_spawn may be different from the limit at
the time of setting up the file actions. this may require revisiting
later.
commit 2412638bb3 got the size of struct
v4l2_event wrong and failed to account for the fact that the old
struct might be either 120 bytes with time misaligned mod 8, or 128
bytes with time aligned mod 8, due to the contained union having
64-bit members whose alignment is arch-dependent.
rather than adding new logic to handle the differences, use an actual
stripped-down version of the structure in question to derive the ioctl
number, size, and offsets.
commit 2412638bb3 got the size of struct
v4l2_buffer wrong and omitted the tv_usec member slot from the offset
list, so the ioctl numbers never matched and fallback code path was
never taken. this caused the affected ioctls to fail with ENOTTY on
kernels not new enough to have the native time64 ioctls.
this is necessary for MT-fork correctness now that the code runs under
locale lock. it would not be hard to avoid, but __get_locale is
already using libc-internal malloc anyway. this can be reconsidered
during locale overhaul later if needed.
in general, pthread_once is not compatible with MT-fork constraints
(commit 167390f055). here it actually no
longer matters, because it's now called with a lock held, but since
the lock is held it's pointless to use pthread_once.
this allows the lock to be shared with setlocale, eliminates repeated
per-category lock/unlock in newlocale, and will allow the use of
pthread_once in newlocale to be dropped (to be done separately).
the intent here is just to scan at least l bytes forward for the end
of the haystack and at least some decent minimum to avoid doing it
over and over if the needle is short, with no need to be precise. the
comment erroneously stated this as an estimate for MIN when it's
actually an estimate for MAX.
pthread_once is not compatible with MT-fork constraints (commit
167390f055) and is not needed here
anyway; we already have a lock suitable for initialization.
while changing this, fix a corner case where AT_MINSIGSTKSZ gives a
value that's more than MINSIGSTKSZ but by a margin of less than
2048, thereby causing the size to be reduced. it shouldn't matter but
the intent was to be the larger of a 2048-byte margin over the legacy
fixed minimum stack requirement or a 512-byte margin over the minimum
the kernel reports at runtime.
this change should have been made when priority inheritance mutex
support was added. if priority protection is also added at some point
the implementation will need to change and will probably no longer be
a simple bit shuffling.
both __clone and __syscall_cp_asm failed to restore the original value
of r6 after using it as a syscall argument register. the extent of
breakage is not known, and in some cases may be mitigated by the only
callers being internal to libc; if they used r6 but no longer needed
its value after the call, they may not have noticed the problem.
however at least posix_spawn (which uses __clone) was observed
returning to the application with the wrong value in r6, leading to
crash.
since the call frame ABI already provides a place to spill registers,
fixing this is just a matter of using it. in __clone, we also
spuriously restore r6 in the child, since the parent branch directly
returns to the caller. this takes the value from an uninitialized slot
of the child's stack, but is harmless since there is no caller to
return to in the child.
float_t should represent the type that is used to evaluate float
expressions internally. On s390x, float_t is currently set to double.
In contrast, the isa supports single-precision float operations and
compilers by default evaluate float in single precision, which
violates the C standard (sections 5.2.4.2.2 and 7.12 in C11/C17, to be
precise). With -fexcess-precision=standard, gcc evaluates float in
double precision, which aligns with the standard yet at the cost of
added conversion instructions.
gcc-11 will drop the special case to retrofit double precision
behavior for -fexcess-precision=standard so that __FLT_EVAL_METHOD__
will be 0 on s390x in any scenario.
To improve standards compliance and compatibility with future compiler
direction, this patch changes the definition of float_t to be derived
from the compiler's __FLT_EVAL_METHOD__.
reallocarray is an extension introduced by OpenBSD, which introduces
calloc overflow checking to realloc.
glibc 2.28 introduced support for this function behind _GNU_SOURCE,
while glibc 2.29 allows its usage in _DEFAULT_SOURCE.
inability to use realpath in chroot/container without procfs access
and at early boot prior to mount of /proc has been an ongoing issue,
and it turns out realpath was one of the last remaining interfaces
that needed procfs for its core functionality. during investigation
while reimplementing, it was determined that there were also serious
problems with the procfs-based implementation. most seriously it was
unsafe on pre-O_PATH kernels, and unlike other places where O_PATH was
used, the unsafety was hard or impossible to fix because O_NOFOLLOW
can't be used (since the whole purpose was to follow symlinks).
the new implementation is a direct one, performing readlink on each
path component to resolve it. an explicit stack, as opposed to
recursion, is used to represent the remaining components to be
processed. the stack starts out holding just the input string, and
reading a link pushes the link contents onto the stack.
unlike many other implementations, this one does not call getcwd
initially for relative pathnames. instead it accumulates initial ..
components to be applied to the working directory if the result is
still a relative path. this avoids calling getcwd (which may fail) at
all when symlink traversal will eventually yield an absolute path. it
also doesn't use any form of stat operation; instead it arranges for
readlink to tell it when a non-directory is used in a context where a
directory is needed. this minimizes the number of syscalls needed,
avoids accessing inodes when the directory table suffices, and reduces
the amount of code pulled in for static linking.
calling lutimes with tv=0 is valid if the application wants to set the
timestamps to the current time. this commit makes it so the timespec
struct is populated with values from tv only if tv != 0 and calls
utimensat with times=0 if tv == 0.
Update fanotify.h, see
linux commit 929943b38daf817f2e6d303ea04401651fc3bc05
fanotify: add support for FAN_REPORT_NAME
linux commit 83b7a59896dd24015a34b7f00027f0ff3747972f
fanotify: add basic support for FAN_REPORT_DIR_FID
linux commit 08b95c338e0c5a96e47f4ca314ea1e7580ecb5d7
fanotify: remove event FAN_DIR_MODIFY
FAN_DIR_MODIFY that was new in v5.7 is now removed from linux uapi,
but kept in musl, so we don't break api, linux cannot reuse the
value anyway.
see
linux commit 9b4feb630e8e9801603f3cab3a36369e3c1cf88d
arch: wire-up close_range()
linux commit 278a5fbaed89dacd04e9d052f4594ffd0e0585de
open: add close_range()
linux fails with EINVAL when a zero buffer size is passed to the
syscall. this is non-conforming because POSIX already defines EINVAL
with a significantly different meaning: the target is not a symlink.
since the request is semantically valid, patch it up by using a dummy
buffer of length one, and truncating the return value to zero if it
succeeds.
the v1 zoneinfo format with 32-bit time is deprecated. previously, the
v2 parsing code was only used if an exact match for '2' was found in
the version field of the header. this was already incorrect for v3
files (trivial differences from v2 that arguably didn't merit a new
version number anyway) but also failed to be future-proof.
since commit 3814333964, the condition
sizeof(time_t) > 4 is always true, so there is no functional change
being made here. but semantically, the 64-bit tables should always be
preferred now, because upstream zic (zoneinfo compiler) has quietly
switched to emitting empty 32-bit tables by default, and the resulting
backwards-incompatible zoneinfo files will be encountered in the wild.
commit d26e0774a5 moved the detach state
transition at exit before the thread list lock was taken. this
inadvertently allowed pthread_join to race to take the thread list
lock first, and proceed with unmapping of the exiting thread's memory.
we could fix this by just revering the offending commit and instead
performing __vm_wait unconditionally before taking the thread list
lock, but that may be costly. instead, bring back the old DT_EXITING
vs DT_EXITED state distinction that was removed in commit
8f11e6127f, and don't transition to
DT_EXITED (a value of 0, which is what pthread_join waits for) until
after the lock has been taken.
the original wcsnrtombs implementation, which has been largely
untouched since 0.5.0, attempted to build input-length-limiting
conversion on top of wcsrtombs, which only limits output length. as
best I recall, this choice was made out of a mix of disdain over
having yet another variant function to implement (added in POSIX 2008;
not standard C) and preference not to switch things around and
implement the wcsrtombs in terms of the more general new function,
probably over namespace issues. the strategy employed was to impose
output limits that would ensure the input limit wasn't exceeded, then
finish up the tail character-at-a-time. unfortunately, none of that
worked correctly.
first, the logic in the wcsrtombs loop was wrong in that it could
easily get stuck making no forward progress, by imposing an output
limit too small to convert even one character.
the character-at-a-time loop that followed was even worse. it made no
effort to ensure that the converted multibyte character would fit in
the remaining output space, only that there was a nonzero amount of
output space remaining. it also employed an incorrect interpretation
of wcrtomb's interface contract for converting the null character,
thereby failing to act on end of input, and remaining space accounting
was subject to unsigned wrap-around. together these errors allow
unbounded overflow of the destination buffer, controlled by input
length limit and input wchar_t string contents.
given the extent to which this function was broken, it's plausible
that most applications that would have been rendered exploitable were
sufficiently broken not to be usable in the first place. however, it's
also plausible that common (especially ASCII-only) inputs succeeded in
the wcsrtombs loop, which mostly worked, while leaving the wildly
erroneous code in the second loop exposed to particular non-ASCII
inputs.
CVE-2020-28928 has been assigned for this issue.
after a non-normal-type process-shared mutex is unlocked, it's
immediately available to another thread to lock, unlock, and destroy,
but the first unlocking thread may still have a pointer to it in its
robust_list pending slot. this means, on async process termination,
the kernel may attempt to access and modify the memory that used to
contain the mutex -- memory that may have been reused for some other
purpose after the mutex was destroyed.
setting up for this kind of race to occur is difficult to begin with,
requiring dynamic use of shared memory maps, and actually hitting the
race is very difficult even with a suitable setup. so this is mostly a
theoretical fix, but in any case the cost is very low.
the __vm_wait operation can delay forward progress arbitrarily long if
a thread holding the lock is interrupted by a signal. in a worst case
this can deadlock. any critical section holding the thread list lock
must respect lock ordering contracts and must not take any lock which
is not AS-safe.
to fix, move the determination of thread joinable/detached state to
take place before the killlock and thread list lock are taken. this
requires reverting the atomic state transition if we determine that
the exiting thread is the last thread and must call exit, but that's
easy to do since it's a single-threaded context with application
signals blocked.
as the outcome of Austin Group tracker issue #62, future editions of
POSIX have dropped the requirement that fork be AS-safe. this allows
but does not require implementations to synchronize fork with internal
locks and give forked children of multithreaded parents a partly or
fully unrestricted execution environment where they can continue to
use the standard library (per POSIX, they can only portably use
AS-safe functions).
up until recently, taking this allowance did not seem desirable.
however, commit 8ed2bd8bfc exposed the
extent to which applications and libraries are depending on the
ability to use malloc and other non-AS-safe interfaces in MT-forked
children, by converting latent very-low-probability catastrophic state
corruption into predictable deadlock. dealing with the fallout has
been a huge burden for users/distros.
while it looks like most of the non-portable usage in applications
could be fixed given sufficient effort, at least some of it seems to
occur in language runtimes which are exposing the ability to run
unrestricted code in the child as part of the contract with the
programmer. any attempt at fixing such contracts is not just a
technical problem but a social one, and is probably not tractable.
this patch extends the fork function to take locks for all libc
singletons in the parent, and release or reset those locks in the
child, so that when the underlying fork operation takes place, the
state protected by these locks is consistent and ready for the child
to use. locking is skipped in the case where the parent is
single-threaded so as not to interfere with legacy AS-safety property
of fork in single-threaded programs. lock order is mostly arbitrary,
but the malloc locks (including bump allocator in case it's used) must
be taken after the locks on any subsystems that might use malloc, and
non-AS-safe locks cannot be taken while the thread list lock is held,
imposing a requirement that it be taken last.
this change lifts undocumented restrictions on calls by replacement
mallocs to libc functions that might take these locks, and sets the
stage for lifting restrictions on the child execution environment
after multithreaded fork.
care is taken to #define macros to replace all four functions (malloc,
calloc, realloc, free) even if not all of them will be used, using an
undefined symbol name for the ones intended not to be used so that any
inadvertent future use will be caught at compile time rather than
directed to the wrong implementation.
allowing the application to replace malloc (since commit
c9f415d7ea) has brought multiple
headaches where it's used from various critical sections in libc
components. for example:
- the thread-local message buffers allocated for dlerror can't be
freed at thread exit time because application code would then run in
the context of a non-existant thread. this was handled in commit
aa5a9d15e0 by queuing them for free
later.
- the dynamic linker has to be careful not to pass memory allocated at
early startup time (necessarily using its own malloc) to realloc or
free after redoing relocations with the application and all
libraries present. bugs in this area were fixed several times, at
least in commits 0c5c8f5da6 and
2f1f51ae7b and possibly others.
- by calling the allocator from contexts where libc-internal locks are
held, we impose undocumented requirements on alternate malloc
implementations not to call into any libc function that might
attempt to take these locks; if they do, deadlock results.
- work to make fork of a multithreaded parent give the child an
unrestricted execution environment is blocked by lock order issues
as long as the application-provided allocator can be called with
libc-internal locks held.
these problems are all fixed by giving libc internals access to the
original, non-replaced allocator, for use where needed. it can't be
used everywhere, as some interfaces like str[n]dup, open_[w]memstream,
getline/getdelim, etc. are required to provide the called memory
obtained as if by (the public) malloc. and there are a number of libc
interfaces that are "pure library" code, not part of some internal
singleton, and where using the application's choice of malloc
implementation is preferable -- things like glob, regex, etc.
one might expect there to be significant cost to static-linked
programs, pulling in two malloc implementations, one of them
mostly-unused, if malloc is replaced. however, in almost all of the
places where malloc is used internally, care has been taken already
not to pull in realloc/free (i.e. to link with just the bump
allocator). this size optimization carries over automatically.
the newly-exposed internal allocator functions are obtained by
renaming the actual definitions, then adding new wrappers around them
with the public names. technically __libc_realloc and __libc_free
could be aliases rather than needing a layer of wrapper, but this
would almost surely break certain instrumentation (valgrind) and the
size and performance difference is negligible. __libc_calloc needs to
be handled specially since calloc is designed to work with either the
internal or the replaced malloc.
as a bonus, this change also eliminates the longstanding ugly
dependency of the static bump allocator on order of object files in
libc.a, by making it so there's only one definition of the malloc
function and having it in the same source file as the bump allocator.
the only place stdio was used here was for reading the ldso path file,
taking advantage of getdelim to automatically allocate and resize the
buffer. the motivation for use here was that, with shared libraries,
stdio is already available anyway and free to use. this has long been
a nuisance to users because getdelim's use of realloc here triggered a
valgrind bug, but removing it doesn't really fix that; on some archs
even calling the valgrind-interposed malloc at this point will crash.
the actual motivation for this change is moving towards getting rid of
use of application-provided malloc in parts of libc where it would be
called with libc-internal locks held, leading to the possibility of
deadlock if the malloc implementation doesn't follow unwritten rules
about which libc functions are safe for it to call. since getdelim is
required to produce a pointer as if by malloc (i.e. that can be passed
to reallor or free), it necessarily must use the public malloc.
instead of performing a realloc loop as the path file is read, first
query its size with fstat and allocate only once. this produces
slightly different truncation behavior when racing with writes to a
file, but neither behavior is or could be made safe anyway; on a live
system, ldso path files should be replaced by atomic rename only. the
change should also reduce memory waste.
thread-local buffers allocated for dlerror need to be queued for free
at a later time when the owning thread exits, since malloc may be
replaced by application code and the exiting context is not valid to
call application code from. the code to process queue of pending
frees, introduced in commit aa5a9d15e0,
gratuitously held the lock for the entire duration of queue
processing, updating the global queue pointer after each free, despite
there being no logical requirement that all frees finish before
another thread can access the queue.
instead, immediately claim the whole queue for freeing and release the
lock, then walk the list and perform frees without the lock held. the
change is unlikely to make any meaningful difference to performance,
but it eliminates one point where the allocator is called under an
internal lock. since the allocator may be application-provided, such
calls are undesirable because they allow application code to impede
forward progress of libc functions in other threads arbitrarily long,
and to induce deadlock if it calls a libc function that requires the
same lock.
the change also eliminates a lock ordering consideration that's an
impediment upcoming work with multithreaded fork.
the ABI type for the vector registers in fpregset_t, struct
fpsimd_context, and struct user_fpsimd_struct is __uint128_t, which
was presumably originally not used because it's a nonstandard type,
but its existence is mandated by the aarch64 psABI. use of the wrong
type here broke software using these structures, and encouraged
incorrect fixes with casts rather than reinterpretation of
representation.
the reasoning in commit 2d0bbe6c78 was
not entirely correct. while it's true that setting the waiters flag
ensures that the next unlock will perform a wake, it's possible that
the wake is consumed by a mutex waiter that has no relationship with
the condvar wait queue being processed, which then takes the mutex.
when that thread subsequently unlocks, it sees no waiters, and leaves
the rest of the condvar queue stuck.
bring back the waiter count adjustment, but skip it for PI mutexes,
for which a successful lock-after-waiting always sets the waiters bit.
if future changes are made to bring this same waiters-bit contract to
all lock types, this can be reverted.
sem_open is required to return the same sem_t pointer for all
references to the same named semaphore when it's opened more than once
in the same process. thus we keep a table of all the mapped semaphores
and their reference counts. the code path for sem_close checked the
reference count, but then proceeded to unmap the semaphore regardless
of whether the count had reached zero.
add an immediate unlock-and-return for the nonzero refcnt case so the
property of performing the munmap syscall after releasing the lock can
be preserved.
resource limits have been process-wide since linux 2.6.10, and the
prlimit syscall was added in 2.6.36, so prlimit can be assumed to set
the resource limits correctly for the whole process.
commit bd153422f2 reintroduced the bug
fixed in c21051e90c by refactoring the
__syscall_ret into _Fork where it once again runs before the atfork
handlers are called. since _Fork is a public interface that sets
errno, this can't be fixed the way it was fixed last time without
making new internal interfaces. instead, just save errno, and restore
it only on error to ensure that a value of 0 is never restored.
pthread_cond_wait arranged for requeued waiters to wake when the mutex
is unlocked by temporarily adjusting the mutex's waiter count. commit
54ca677983 broke this when introducing
PI mutexes by repurposing the waiter count field of the mutex
structure. since then, for PI mutexes, the waiter count adjustment was
misinterpreted by the mutex locking code as indicating that the mutex
is non a non-recoverable state.
it would be possible to special-case PI mutexes here, but instead just
drop all adjustment of the waiters count, and instead use the lock
word waiters bit for all mutex types. since the mutex is either held
by the caller or in unrecoverable state at the time the bit is set, it
will necessarily still be set at the time of any subsequent valid
unlock operation, and this will produce the desired effect of waking
the next waiter.
if waiter counts are entirely dropped at some point in the future this
code should still work without modification.
commit 25ea9f712c introduced a deadlock
to the posix_spawn child whereby, if abort was called in the parent
and ended up taking the abort lock to terminate the process, the
__libc_sigaction calls in the child would wait forever to obtain a
lock that would not be released. this could be fixed by having abort
set the abort lock as the exit futex address, but it's cleaner to just
remove the SIGABRT special handling from the internal __libc_sigaction
and lift it to the public sigaction function.
nothing but the posix_spawn child calls __libc_sigaction on SIGABRT,
and since commit b7bc966522 the abort
lock is held at the time of __clone, which precludes the child
inheriting a kernel-level signal disposition inconsistent with the
disposition on the abstract machine. this means it's fine to inspect
and modify the disposition in the child without a lock.
Merge changes from Solar Designer's crypt_blowfish v1.3. This makes
crypt_blowfish fully compatible with OpenBSD's bcrypt by adding
support for the $2b$ prefix (which behaves the same as
crypt_blowfish's $2y$).
this makes the code slightly smaller and eliminates timer_create from
relevance to possible future changes to multithreaded fork.
the barrier of a_store isn't technically needed here, but a_store is
used anyway for internal consistency of the memory model.
this was leftover from when the actual SIGEV_THREAD timer logic was in
the signal handler. commit 5b74eed3b3
replaced that with use of sigwaitinfo, with the actual signal left
blocked, so the no-op signal handler was no longer serving any
purpose.
the signal disposition reset to SIG_DFL is still needed, however, in
case we inherited SIG_IGN from a foreign-libc process.
assert is not specified to flush open stdio streams, and doing so can
block indefinitely waiting for a lock already held or an output
operation to a file that can't accept more output until an
unsatisfiable condition is met.
commit 500c6886c6 broke this by fixing
the behavior of fread to conform to the C standard; getgroupslist was
assuming the old behavior, that a request to read 1 member of length 0
would return 1, not 0.
this change prevents the child created concurrently with abort from
seeing the SIGABRT disposition change from SIG_IGN to SIG_DFL (other
changes are not visible anyway) and prevents leaking the write end of
the child pipe to children created by fork in another thread, which
may block return of posix_spawn indefinitely if the forked child does
not exit or exec.
along with other changes, this suggests that __abort_lock should
perhaps eventually be renamed to reflect that it's becoming a broader
lock on related "process lifetime" state.
the existing abort locking logic in sigaction only accounted for
attempts to change the disposition, not attempts to observe the change
made by abort.
unfortunately the change is still observable in at least one other
place: inheritance of signal dispositions across exec and posix_spawn.
fixing these is a separate task and it's not even clear whether a
complete fix is possible.
the _Fork interface is defined for future issue of POSIX as the
outcome of Austin Group issue 62, which drops the AS-safety
requirement for fork, and provides an AS-safe replacement that does
not run the registered atfork handlers.
commit 188759bbee documented the intent
to allow recursive dlopen based on tracking ctor_visitor, but used a
kernel tid rather than the pthread_t to identify the caller. as a
result, it would not behave as intended under fork by a ctor, where
the child tid would not match.
queue_ctors should not be called with the init_fini_lock held, since
it may longjmp out on allocation failure. this introduces a minor
TOCTOU race with p->constructed, but one already exists further down
anyway, and by design it's okay to run through the queue more than
once anyway. the only reason we bother to check p->constructed at all
is to avoid spurious failure of dlopen when the library is already
fully loaded and constructed.
this makes the code slightly smaller and eliminates these functions
from relevance to possible future changes to multithreaded fork.
the barrier of a_store isn't technically needed here, but a_store is
used anyway for internal consistency of the memory model.
if the multithreaded parent forked while another thread was calling
sigaction for SIGABRT or calling abort, the child could inherit a lock
state in which future calls to abort will deadlock, or in which the
disposition for SIGABRT has already been reset to SIG_DFL. this is
nonconforming since abort is AS-safe and permitted to be called
concurrently with fork or in the MT-forked child.
the dummy definition of __abort_lock in sigaction.c was performing
exactly the same role that putting the lock in its own source file
could and should have been used to achieve.
while we're moving it, give it a proper declaration.
previously, if a file descriptor had aio operations pending in the
parent before fork, attempting to close it in the child would attempt
to cancel a thread belonging to the parent. this could deadlock, fail,
or crash the whole process of the cancellation signal handler was not
yet installed in the parent. in addition, further use of aio from the
child could malfunction or deadlock.
POSIX specifies that async io operations are not inherited by the
child on fork, so clear the entire aio fd map in the child, and take
the aio map lock (with signals blocked) across the fork so that the
lock is kept in a consistent state.
taking the deprecated/dropped vfork spec strictly, doing pretty much
anything but execve in the child is wrong and undefined. however,
these are commonly needed operations to setup the child state before
exec, and historical implementations tolerated them.
for single-threaded parents, these operations already worked as
expected in the vforked child. however, due to the need for __synccall
to synchronize id/resource limit changes among all threads, calling
these functions in the vforked child of a multithreaded parent caused
a misdirected broadcast signaling of all threads in the parent. these
signals could kill the parent entirely if the synccall signal handler
had never been installed in the parent, or could be ignored if it had,
or could signal/kill one or more utterly wrong processes if the parent
already terminated (due to vfork semantics, only possible via fatal
signal) and the parent tids were recycled. in any case, the expected
number of semaphore posts would never happen, so the child would
permanently hang (with all signals blocked) waiting for them.
to mitigate this, and also make the normal usage case work as
intended, treat the condition where the caller's actual tid does not
match the tid in its thread structure as single-threaded, and bypass
the entire synccall broadcast operation.
commit 0a05eace16 implemented AT_EACCESS
for faccessat with a horrible hack, creating a child process to change
switch uid/gid and perform the access probe without making potentially
irreversible changes to the caller's credentials. this was due to the
syscall lacking a flags argument.
linux 5.8 introduced a new syscall, SYS_faccessat2, fixing this
deficiency. use it if any flags are passed, and fallback to the old
strategy on ENOSYS. continue using the old syscall when there are no
flags.
Ethernet protocol number for media redundancy protocol, see
linux commit 4714d13791f831d253852c8b5d657270becb8b2a
bridge: uapi: mrp: Add mrp attributes.
the linux faccessat syscall lacks a flag argument that is necessary
to implement the posix api, see
linux commit c8ffd8bcdd28296a198f237cc595148a8d4adfbe
vfs: add faccessat2 syscall
add TCP_NLA_BYTES_NOTSENT and new tcp_zerocopy_receive fields, see
linux commit c8856c051454909e5059df4e81c77b9c366c5515
tcp-zerocopy: Return inq along with tcp receive zerocopy.
linux commit 33946518d493cdf10aedb4a483f1aa41948a3dab
tcp-zerocopy: Return sk_err (if set) along with tcp receive zerocopy.
linux commit e08ab0b377a1489760533424437c5f4be7f484a4
tcp: add bytes not sent to SCM_TIMESTAMPING_OPT_STATS
it remaps anon mappings without unmapping the original. chromeos plans
to use it with userfaultfd, see:
linux commit e346b3813067d4b17383f975f197a9aa28a3b077
mm/mremap: add MREMAP_DONTUNMAP to mremap()
see
linux commit 9e2ba2c34f1922ca1e0c7d31b30ace5842c2e7d1
fanotify: send FAN_DIR_MODIFY event flavor with dir inode and name
linux commit 44d705b0370b1d581f46ff23e5d33e8b5ff8ec58
fanotify: report name info for FAN_DIR_MODIFY event
added in
linux commit 1a50ec0b3b2e9a83f1b1245ea37a853aac2f741c
arm64: Implement archrandom.h for ARMv8.5-RNG
linux commit d4209d8b717311d114b5d47ba7f8249fd44e97c2
arm64: cpufeature: Export matrix and other features to userspace
these were missed before, added in
linux commit 1201937491822b61641c1878ebcd16a93aed4540
arm64: Expose ARMv8.5 CondM capability to userspace
linux commit ca9503fc9e9812aa6258e55d44edb03eb30fc46f
arm64: Expose FRINT capabilities to userspace
reuses a bit from CSIGNAL so it can only be used with unshare
and clone3, added in
linux commit 769071ac9f20b6a447410c7eaa55d1a5233ef40c
ns: Introduce Time Namespace
needed for storage drivers with userspace component that may
run in the IO path, see
linux commit 8d19f1c8e1937baf74e1962aae9f90fa3aeab463
prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim
The use of TCP_ in udp.h is not known, fortunately udp.h is not
specified by posix so there are no strict namespace rules, added in
linux commit e27cca96cd68fa2c6814c90f9a1cfd36bb68c593
xfrm: add espintcp (RFC 8229)
TCP_NLA_TIMEOUT_REHASH queries timeout-triggered rehash attempts,
tcpm_ifindex limits the scope of TCP_MD5SIG* sockopt to a device.
see
linux commit 32efcc06d2a15fa87585614d12d6c2308cc2d3f3
tcp: export count for rehash attempts
linux commit 6b102db50cdde3ba2f78631ed21222edf3a5fb51
net: Add device index to tcp_md5sig
add IPPROTO_ETHERNET and IPPROTO_MPTCP, see
linux commit 2677625387056136e256c743e3285b4fe3da87bb
seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number
linux commit faf391c3826cd29feae02078ca2022d2f912f7cc
tcp: Define IPPROTO_MPTCP
also added clone3 on sh and m68k, on sh it's still missing (not
yet wired up), but reserved so safe to add.
see
linux commit fddb5d430ad9fa91b49b1d34d0202ffe2fa0e179
open: introduce openat2(2) syscall
linux commit 9a2cef09c801de54feecd912303ace5c27237f12
arch: wire up pidfd_getfd syscall
linux commit 8649c322f75c96e7ced2fec201e123b2b073bf09
pid: Implement pidfd_getfd syscall
linux commit e8bb2a2a1d51511e6b3f7e08125d52ec73c11139
m68k: Wire up clone3() syscall
the fcntl file locking command macro values in the existing generic
bits/fcntl.h were the "64" variants, requiring 64-bit archs that use
the "plain" variants to have their own bits/fcntl.h, even if they
otherwise use the common definitions for everything.
since commit 7cc79d10af exposed
__LONG_MAX to all bits headers, we can now make the generic one common
between 32- and 64-bit archs.
prior to commit 685e40bb09, x86_64 was
correctly passing O_LARGEFILE to SYS_open; it was removed (defined to
0 in the public header, and changed to use the public definition) as
part of that change, probably out of a mistaken belief that it's not
needed.
however, on a mixed system with 32-bit and 64-bit binaries, it's
important that all files be opened with O_LARGEFILE, even if the
opening process is 64-bit, in case a descriptor is passed to a 32-bit
process. otherwise, attempts to access past 2GB in the 32-bit process
could produce EOVERFLOW.
most 64-bit archs added later got this right alread, except for
mips64. x32 was also affected. there are now fixed.
this code is only needed for pre-2.6 kernels, which are not actually
supported anyway, and was never tested. the fallback path using
SYS_modify_ldt failed to clear the upper bits of %eax (all ones due to
SYS_set_thread_area's return value being an error) before modifying
%al to attempt a new syscall.
prior to commit e68c51ac46, h_errno was
actually an external data object not a macro. bring back the symbol,
and use it as the storage for the main thread's h_errno.
technically this still doesn't provide full compatibility if the
application was multithreaded, but at the time there were no res_*
functions (and they did not set h_errno anyway), so any use of h_errno
would have been via thread-unsafe functions. thus a solution that just
fixes single-threaded applications seems acceptable.
putting the (simple) definition in alltypes.h seems like the best
solution here. making sys/ioctl.h implicitly include termios.h is
probably excess namespace pollution.
now that -Wall is not used and we control which warnings are enabled,
it makes sense to have the wanted ones on by default. hopefully this
will also discourage manually adding -Wall to CFLAGS and making
incorrect changes or bug reports based on the compiler's output.
-Wall varies too much by compiler and version. rather than trying to
track all the unwanted style warnings that need to be subtracted, just
enable wanted warnings.
also, move -Wno-pointer-to-int-cast outside --enable-warnings
conditional so that it always applies, since it's turning off a
nuisance warning that's on-by-default with most compilers.
these four warning options were overlooked previously, likely because
they're not part of GCC's -Wall. they all detect constraint violations
(invalid C at the source level) and should always be on in -Werror
form.
dtv_copy, canary2, and canary_at_end existed solely to match multiple
ABI and asm-accessed layouts simultaneously. now that pthread_arch.h
can be included before struct __pthread is defined, the struct layout
can depend on macros defined by pthread_arch.h.
the adjustment made is entirely a function of TLS_ABOVE_TP and
TP_OFFSET. aside from avoiding repetition of the TP_OFFSET value and
arithmetic, this change makes pthread_arch.h independent of the
definition of struct __pthread from pthread_impl.h. this in turn will
allow inclusion of pthread_arch.h to be moved to the top of
pthread_impl.h so that it can influence the definition of the
structure.
previously, arch files were very inconsistent about the type used for
the thread pointer. this change unifies the new __get_tp interface to
always use uintptr_t, which is the most correct when performing
arithmetic that may involve addresses outside the actual pointed-to
object (due to TP_OFFSET).
while it's not clearly documented anywhere, this is the historical
behavior which some applications expect. applications which need to
see the response packet in these cases, for example to distinguish
between nonexistence in a secure vs insecure zone, must already use
res_mkquery with res_send in order to be portable, since most if not
all other implementations of res_query don't provide it.
the framework to do this always existed but it was deemed unnecessary
because the only [ex-]standard functions using h_errno were not
thread-safe anyway. however, some of the nonstandard res_* functions
are also supposed to set h_errno to indicate the cause of error, and
were unable to do so because it was not thread-safe. this change is a
prerequisite for fixing them.
these have been adopted for future issue of POSIX as the outcome of
Austin Group issue 1151, and are simply functions performing the roles
of the historical ioctls. since struct winsize is being standardized
along with them, its definition is moved to the appropriate header.
there is some chance this will break source files that expect struct
winsize to be defined by sys/ioctl.h without including termios.h. if
this happens, further changes will be needed to have sys/ioctl.h
expose it too.
this is a prerequisite for addition of other interfaces that use
kernel tids, including futex and SIGEV_THREAD_ID.
there is some ambiguity as to whether the semantic return type should
be int or pid_t. either way, futex API imposes a contract that the
values fit in int (excluding some upper reserved bits). glibc used
pid_t, so in the interest of not having gratuitous mismatch (the
underlying types are the same anyway), pid_t is used here as well.
while conceptually this is a syscall, the copy stored in the thread
structure is always valid in all contexts where it's valid to call
libc functions, so it's used to avoid the syscall.
longjmp should set the return value of setjmp, but 64bit
registers were used for the 0 check while the type is int.
use the code that gcc generates for return val ? val : 1;
longjmp 'val' argument is an int, but the assembly is referencing 64-bit
registers as if the argument was a long, or the caller was responsible
for extending the argument. Though the psABI is not clear on this, the
interpretation in GCC is that high bits may be arbitrary and the callee
is responsible for sign/zero-extending the value as needed (likewise for
return values: callers must anticipate that high bits may be garbage).
Therefore testing %rax is a functional bug: setjmp would wrongly return
zero if longjmp was called with val==0, but high bits of %rsi happened
to be non-zero.
Rewrite the prologue to refer to 32-bit registers. In passing, change
'test' to use %rsi, as there's no advantage to using %rax and the new
form is cheaper on processors that do not perform move elimination.
a number of users performing seccomp filtering have requested use of
the new individual syscall numbers for socket syscalls, rather than
the legacy multiplexed socketcall, since the latter has the arguments
all in memory where they can't participate in filter decisions.
previously, some archs used the multiplexed socketcall if it was
historically all that was available, while other archs used the
separate syscalls. the intent was that the latter set only include
archs that have "always" had separate socket syscalls, at least going
back to linux 2.6.0. however, at least powerpc, powerpc64, and sh were
wrongly included in this set, and thus socket operations completely
failed on old kernels for these archs.
with the changes made here, the separate syscalls are always
preferred, but fallback code is compiled for archs that also define
SYS_socketcall. two such archs, mips (plain o32) and microblaze,
define SYS_socketcall despite never having needed it, so it's now
undefined by their versions of syscall_arch.h to prevent inclusion of
useless fallback code.
some archs, where the separate syscalls were only added after the
addition of SYS_accept4, lack SYS_accept. because socket calls are
always made with zeros in the unused argument positions, it suffices
to just use SYS_accept4 to provide a definition of SYS_accept, and
this is done to make happy the macro machinery that concatenates the
socket call name onto __SC_ and SYS_.
same approach as in sqrt.
sqrtl was broken on aarch64, riscv64 and s390x targets because
of missing quad precision support and on m68k-sf because of
missing ld80 sqrtl.
this implementation is written for quad precision and then
edited to make it work for both m68k and x86 style ld80 formats
too, but it is not expected to be optimal for them.
note: using fp instructions for the initial estimate when such
instructions are available (e.g. double prec sqrt or rsqrt) is
avoided because of fenv correctness.
same method as in sqrt, this was tested on all inputs against
an sqrtf instruction. (the only difference found was that x86
sqrtf does not signal the x86 specific input-denormal exception
on negative subnormal inputs while the software sqrtf does,
this is fine as it was designed for ieee754 exceptions only.)
there is known faster method:
"Computing Floating-Point Square Roots via Bivariate Polynomial Evaluation"
that computes sqrtf directly via pipelined polynomial evaluation
which allows more parallelism, but the design does not generalize
easily to higher precisions.
approximate 1/sqrt(x) and sqrt(x) with goldschmidt iterations.
this is known to be a fast method for computing sqrt, but it is
tricky to get right, so added detailed comments.
use a lookup table for the initial estimate, this adds 256bytes
rodata but it can be shared between sqrt, sqrtf and sqrtl.
this saves one iteration compared to a linear estimate.
this is for soft float targets, but it supports fenv by using a
floating-point operation to get the final result. the result
is correctly rounded in all rounding modes. if fenv support is
turned off then the nearest rounded result is computed and
inexact exception is not signaled.
assumes fast 32bit integer arithmetics and 32 to 64bit mul.
prior to this change, the canonical name came from the first hosts
file line matching the requested family, so the canonical name for a
given hostname could differ depending on whether it was requested with
AF_UNSPEC or a particular family (AF_INET or AF_INET6). now, the
canonical name is deterministically the first one to appear with the
requested name as an alias.
the existing code clobbered the canonical name already discovered
every time another matching line was found, which will necessarily be
the case when a hostname has both IPv4 and v6 definitions.
patch by Wolf.
this is actually a functional fix at present, since the C sqrtl does
not support ld80 and just wraps double sqrt. once that's fixed it will
just be an optimization.
the previous commit addressing async-signal-safety issues around
pthread_kill did not fully fix pthread_cancel, which is also required
(albeit rather irrationally) to be async-cancel-safe.
without blocking implementation-internal signals, it's possible that,
when async cancellation is enabled, a cancel signal sent by another
thread interrupts pthread_kill while the killlock for a targeted
thread is held. as a result, the calling thread will terminate due to
cancellation without ever unlocking the targeted thread's killlock,
and thus the targeted thread will be unable to exit.
pthread_kill is required to be AS-safe. that requirement can't be met
if the target thread's killlock can be taken in contexts where
application-installed signal handlers can run.
block signals around use of this lock in all pthread_* functions which
target a tid, and reorder blocking/unblocking of signals in
pthread_exit so that they're blocked whenever the killlock is held.
this broke mallocng size_to_class on archs without a native
implementation of a_clz_32. the incorrect logic seems to have been
something i derived from a related but distinct log2-type operation.
with the change made here, it passes an exhaustive test.
as this function is new and presently only used by mallocng, no other
functionality was affected.
the intent here is to keep oldmalloc as an option, at least for the
short term, in case any users are negatively impacted in some way by
mallocng and need to fallback until their issues are resolved.
the files added come from the mallocng development repo, commit
2ed58817cca5bc055974e5a0e43c280d106e696b. they comprise a new malloc
implementation, developed over the past 9 months, to replace the old
allocator (since dubbed "oldmalloc") with one that retains low code
size and minimal baseline memory overhead while avoiding fundamental
flaws in oldmalloc and making significant enhancements. these include
highly controlled fragmentation, fine-grained ability to return memory
to the system when freed, and strong hardening against dynamic memory
usage errors by the caller.
internally, mallocng derives most of these properties from tightly
structuring memory, creating space for allocations as uniform-sized
slots within individually mmapped (and individually freeable)
allocation groups. smaller-than-pagesize groups are created within
slots of larger ones. minimal group size is very small, and larger
sizes (in geometric progression) only come into play when usage is
high.
all data necessary for maintaining consistency of the allocator state
is tracked in out-of-band metadata, reachable via a validated path
from minimal in-band metadata. all pointers passed (to free, etc.) are
validated before any stores to memory take place. early reuse of freed
slots is avoided via approximate LRU order of freed slots. further
hardening against use-after-free and double-free, even in the case
where the freed slot has been reused, is made by cycling the offset
within the slot at which the allocation is placed; this is possible
whenever the slot size is larger than the requested allocation.
this includes both an implementation of reclaimed-gap donation from
ldso and a version of mallocng's glue.h with namespace-safe linkage to
underlying syscalls, integration with AT_RANDOM initialization, and
internal locking that's optimized out when the process is
single-threaded.
these are based on the ARM optimized-routines repository v20.05
(ef907c7a799a), with macro dependencies flattened out and memmove code
removed from memcpy. this change is somewhat unfortunate since having
the branch for memmove support in the large n case of memcpy is the
performance-optimal and size-optimal way to do both, but it makes
memcpy alone (static-linked) about 40% larger and suggests a policy
that use of memcpy as memmove is supported.
tabs used for alignment have also been replaced with spaces.
the child is single-threaded, but may still need to synchronize with
last changes made to memory by another thread in the parent, so set
need_locks to -1 whereby the next lock-taker will drop to 0 and
prevent further barriers/locking.
otherwise, shrink in-place. as explained in the description of commit
3e16313f8f, the split here is valid
without holding split_merge_lock because all chunks involved are in
the in-use state.
commit 3e16313f8f introduced this bug by
making the copy case reachable with n (new size) smaller than n0
(original size). this was left as the only way of shrinking an
allocation because it reduces fragmentation if a free chunk of the
appropriate size is available. when that's not the case, another
approach may be better, but any such improvement would be independent
of fixing this bug.
access always computes result with real ids not effective ones, so it
is not a valid means of determining whether the directory is readable.
instead, attempt to open it before reporting whether it's readable,
and then use fdopendir rather than opendir to open and read the
entries.
effort is made here to keep fd_limit behavior the same as before even
if it was not correct.
some archs already have a_clz_32, used to provide a_ctz_32, but it
hasn't been mandatory because it's not used anywhere yet. mallocng
will need it, however, so add it now. it should probably be optimized
better, but doesn't seem to make a difference at present.
it both malloc and aligned_alloc have been replaced but the internal
aligned_alloc still gets called, the replacement is a wrapper of some
sort. it's not clear if this usage should be officially supported, but
it's at least a plausibly interesting debugging usage, and easy to do.
it should not be relied upon unless it's documented as supported at
some later time.
a new weak predicate function replacable by the malloc implementation,
__malloc_allzerop, is introduced. by default it's always false; the
default version will be used when static linking if the bump allocator
was used (in which case performance doesn't matter) or if malloc was
replaced by the application. only if the real internal malloc is
linked (always the case with dynamic linking) does the real version
get used.
if malloc was replaced dynamically, as indicated by __malloc_replaced,
the predicate function is ignored and conditional-memset is always
performed.
abstractly, calloc is completely malloc-implementation-independent;
it's malloc followed by memset, or as we do it, a "conditional memset"
that avoids touching fresh zero pages.
previously, calloc was kept separate for the bump allocator, which can
always skip memset, and the version of calloc provided with the full
malloc conditionally skipped the clearing for large direct-mmapped
allocations. the latter is a moderately attractive optimization, and
can be added back if needed. however, further consideration to make it
correct under malloc replacement would be needed.
commit b4b1e10364 documented the
contract for malloc replacement as allowing omission of calloc, and
indeed that worked for dynamic linking, but for static linking it was
possible to get the non-clearing definition from the bump allocator;
if not for that, it would have been a link error trying to pull in
malloc.o.
the conditional-clearing code for the new common calloc is taken from
mal0_clear in oldmalloc, but drops the need to access actual page size
and just uses a fixed value of 4096. this avoids potentially needing
access to global data for the sake of an optimization that at best
marginally helps archs with offensively-large page sizes.
this sets the stage for replacement, and makes it practical to keep
oldmalloc around as a build option for a while if that ends up being
useful.
only the files which are actually part of the implementation are
moved. memalign and posix_memalign are entirely generic. in theory
calloc could be pulled out too, but it's useful to have it tied to the
implementation so as to optimize out unnecessary memset when
implementation details make it possible to know the memory is already
clear.
this change eliminates the internal __memalign function and makes the
memalign and posix_memalign functions completely independent of the
malloc implementation, written portably in terms of aligned_alloc.
this was an unfinished draft document present since the initial
check-in, that was never intended to ship in its current form. remove
it as part of reorganizing for replacement of the allocator.
this affects the bump allocator used when static linking in programs
that don't need allocation metadata due to not using realloc, free,
etc.
commit e3bc22f1ef refactored the bump
allocator to share code with __expand_heap, used by malloc, for the
purpose of fixing the case (mainly nommu) where brk doesn't work.
however, the geometric growth behavior of __expand_heap is not
actually well-suited to the bump allocator, and can produce
significant excessive memory usage. in particular, by repeatedly
requesting just over the remaining free space in the current
mmap-allocated area, the total mapped memory will be roughly double
the nominal usage. and since the main user of the no-brk mmap fallback
in the bump allocator is nommu, this excessive usage is not just
virtual address space but physical memory.
in addition, even on systems with brk, having a unified size request
to __expand_heap without knowing whether the brk or mmap backend would
get used made it so the brk could be expanded twice as far as needed.
for example, with malloc(n) and n-1 bytes available before the current
brk, the brk would be expanded by n bytes rounded up to page size,
when expansion by just one page would have sufficed.
the new implementation computes request size separately for the cases
where brk expansion is being attempted vs using mmap, and also
performs individual mmap of large allocations without moving to a new
bump area and throwing away the rest of the old one. this greatly
reduces the need for geometric area size growth and limits the extent
to which free space at the end of one bump area might be unusable for
future allocations.
as a bonus, the resulting code size is somewhat smaller than the
combined old version plus __expand_heap.
this has been a longstanding issue reported many times over the years,
with it becoming increasingly clear that it could be hit in practice.
under concurrent malloc and free from multiple threads, it's possible
to hit usage patterns where unbounded amounts of new memory are
obtained via brk/mmap despite the total nominal usage being small and
bounded.
the underlying cause is that, as a fundamental consequence of keeping
locking as fine-grained as possible, the state where free has unbinned
an already-free chunk to merge it with a newly-freed one, but has not
yet re-binned the combined chunk, is exposed to other threads. this is
bad even with small chunks, and leads to suboptimal use of memory, but
where it really blows up is where the already-freed chunk in question
is the large free region "at the top of the heap". in this situation,
other threads momentarily see a state of having almost no free memory,
and conclude that they need to obtain more.
as far as I can tell there is no fix for this that does not harm
performance. the fix made here forces all split/merge of free chunks
to take place under a single lock, which also takes the place of the
old free_lock, being held at least momentarily at the time of free to
determine whether there are neighboring free chunks that need merging.
as a consequence, the pretrim, alloc_fwd, and alloc_rev operations no
longer make sense and are deleted. simplified merging now takes place
inline in free (__bin_chunk) and realloc.
as commented in the source, holding the split_merge_lock precludes any
chunk transition from in-use to free state. for the most part, it also
precludes change to chunk header sizes. however, __memalign may still
modify the sizes of an in-use chunk to split it into two in-use
chunks. arguably this should require holding the split_merge_lock, but
that would necessitate refactoring to expose it externally, which is a
mess. and it turns out not to be necessary, at least assuming the
existing sloppy memory model malloc has been using, because if free
(__bin_chunk) or realloc sees any unsynchronized change to the size,
it will also see the in-use bit being set, and thereby can't do
anything with the neighboring chunk that changed size.
coding style warnings enabled by default in clang have long been a
source of spurious questions/bug-reports. since clang provides a -w
that behaves differently from gcc's, and that lets us enable any
warnings we may actually want after turning them all off to start with
a clean slate, use it at configure time if clang is detected.
the design used here relies on the barrier provided by the first lock
operation after the process returns to single-threaded state to
synchronize with actions by the last thread that exited. by storing
the intent to change modes in the same object used to detect whether
locking is needed, it's possible to avoid an extra (possibly costly)
memory load after the lock is taken.
after all but the last thread exits, the next thread to observe
libc.threads_minus_1==0 and conclude that it can skip locking fails to
synchronize with any changes to memory that were made by the
last-exiting thread. this can produce data races.
on some archs, at least x86, memory synchronization is unlikely to be
a problem; however, with the inline locks in malloc, skipping the lock
also eliminated the compiler barrier, and caused code that needed to
re-check chunk in-use bits after obtaining the lock to reuse a stale
value, possibly from before the process became single-threaded. this
in turn produced corruption of the heap state.
some uses of libc.threads_minus_1 remain, especially for allocation of
new TLS in the dynamic linker; otherwise, it could be removed
entirely. it's made non-volatile to reflect that the remaining
accesses are only made under lock on the thread list.
instead of libc.threads_minus_1, libc.threaded is now used for
skipping locks. the difference is that libc.threaded is permanently
true once an additional thread has been created. this will produce
some performance regression in processes that are mostly
single-threaded but occasionally creating threads. in the future it
may be possible to bring back the full lock-skipping, but more care
needs to be taken to produce a safe design.
since the backend for LOCK() skips locking if single-threaded, it's
unsafe to make the process appear single-threaded before the last use
of lock.
this fixes potential unsynchronized access to a linked list via
__dl_thread_cleanup.
signal 7 is SIGEMT on Linux mips* ABI according to the man pages and
kernel. it's not clear where the wrong name came from but it dates
back to original mips commit.
the internal __res_msend returns 0 on timeout without having obtained
any conclusive answer, but in this case has not filled in meaningful
anslen. res_send wrongly treated that as success, but returned a zero
answer length. any reasonable caller would eventually end up treating
that as an error when attempting to parse/validate it, but it should
just be reported as an error.
alternatively we could return the last-received inconclusive answer
(typically servfail), but doing so would require internal changes in
__res_msend. this may be considered later.
the old logic here likely dates back, at least in inspiration, to
before it was recognized that transient errors must not be allowed to
reflect the contents of successful results and must be reported to the
application.
here, the dns backend for getaddrinfo, when performing a paired query
for v4 and v6 addresses, accepted results for one address family even
if the other timed out. (the __res_msend backend does not propagate
error rcodes back to the caller, but continues to retry until timeout,
so other error conditions were not actually possible.)
this patch moves the checks to take place before answer parsing, and
performs them for each answer rather than only the answer to the first
query. if nxdomain is seen it's assumed to apply to both queries since
that's how dns semantics work.
the AD (authenticated data) bit in outgoing dns queries is defined by
rfc3655 to request that the nameserver report (via the same bit in the
response) whether the result is authenticated by DNSSEC. while all
results returned by a DNSSEC conforming nameserver will be either
authenticated or cryptographically proven to lack DNSSEC protection,
for some applications it's necessary to be able to distinguish these
two cases. in particular, conforming and compatible handling of DANE
(TLSA) records requires enforcing them only in signed zones.
when the AD bit was first defined for queries, there were reports of
compatibility problems with broken firewalls and nameservers dropping
queries with it set. these problems are probably a thing of the past,
and broken nameservers are already unsupported. however, since there
is no use in the AD bit with the netdb.h interfaces, explicitly clear
it in the queries they make. this ensures that, even with broken
setups, the standard functions will work, and at most the res_*
functions break.
unsigned char promotes to int, which can overflow when shifted left by
24 bits or more. this has been reported multiple times but then
forgotten. it's expected to be benign UB, but can trap when built with
explicit overflow catching (ubsan or similar). fix it now.
note that promotion to uint32_t is safe and portable even outside of
the assumptions usually made in musl, since either uint32_t has rank
at least unsigned int, so that no further default promotions happen,
or int is wide enough that the shift can't overflow. this is a
desirable property to have in case someone wants to reuse the code
elsewhere.
it's been reported that the vdso clock_gettime64 function on (32-bit)
arm is broken, producing erratic results that grow at a rate far
greater than one reported second per actual elapsed second. the vdso
function seems to have been added sometime between linux 5.4 and 5.6,
so if there's ever been a working version, it was only present for a
very short window.
it's not clear what the eventual upstream kernel solution will be, but
something needs to be done on the libc side so as not to be producing
binaries that seem to work on older/existing/lts kernels (which lack
the function and thus lack the bug) but will break fantastically when
moving to newer kernels.
hopefully vdso support will be added back soon, but with a new symbol
name or version from the kernel to allow continued rejection of broken
ones.
analogous to commit b287cd745c but for
the custom FILE stream type the wcstol and wcstod family use. __toread
could be used here as well, but there's a simple direct fix to make
the buffer pointers initially valid for subtraction, so just do that
to avoid pulling in stdio exit code in programs that don't use stdio.
the sh version of fesetround or'd the new rounding mode onto the
control register without clearing the old rounding mode bits, making
changes sticky. this was the root cause of multiple test failures.
apparently this function was intended at some point to be used by
strto* family as well, and thus was put in its own file; however, as
far as I can tell, it's only ever been used by vsscanf. move it to the
same file to reduce the number of source files and external symbols.
this idea came up when I thought we might need to zero the UNGET
portion of buf as well, but it seems like a useful improvement even
when that turned out not to be necessary.
shgetc sets up to be able to perform an "unget" operation without the
caller having to remember and pass back the character value, and for
this purpose used a conditional store idiom:
if (f->rpos[-1] != c) f->rpos[-1] = c
to make it safe to use with non-writable buffers (setup by the
sh_fromstring macro or __string_read with sscanf).
however, validity of this depends on the buffer space at rpos[-1]
being initialized, which is not the case under some conditions
(including at least unbuffered files and fmemopen ones).
whenever data was read "through the buffer", the desired character
value is already in place and does not need to be written. thus,
rather than testing for the absence of the value, we can test for
rpos<=buf, indicating that the last character read could not have come
from the buffer, and thereby that we have a "real" buffer (possibly of
zero length) with writable pushback (UNGET bytes) below it.
as reported/analyzed by Pascal Cuoq, the shlim and shcnt
macros/functions are called by the scanf core (vfscanf) with f->rpos
potentially null (if the FILE is not yet activated for reading at the
time of the call). in this case, they compute differences between a
null pointer (f->rpos) and a non-null one (f->buf), resulting in
undefined behavior.
it's unlikely that any observably wrong behavior occurred in practice,
at least without LTO, due to limits on what's visible to the compiler
from translation unit boundaries, but this has not been checked.
fix is simply ensuring that the FILE is activated for read mode before
entering the main scanf loop, and erroring out early if it can't be.
TZ containg a timezone name with >TZNAME_MAX characters currently
breaks musl's timezone parsing. getname() stops after TZNAME_MAX
characters. getoff() will consume no characters (because the next
character is not a digit) and incorrectly return 0. Then, because
there are remaining alphabetic characters, __daylight == 1, and
dst_off == -3600.
getname() must consume the entire timezone name, even if it will not
fit in d/__tzname, so when it returns, s points to the offset digits.
Commit d9bdfd164 ("fix memccpy to not access buffer past given size")
correctly added a check for 'n' nonzero, but made the pre-existing test
'*s==c' redundant: n!=0 implies *s==c. Remove the unnecessary check.
Reported by Alexey Izbyshev.
Linux defines MAP_SYNC on powerpc and powerpc64 as of commit
22fcea6f85f2 ("mm: move MAP_SYNC to asm-generic/mman-common.h"),
so we can stop undefining it on those architectures.
kernel commit 4693916846269d633a3664586650dbfac2c5562f (first included
in release v4.14) silently fixed a bug whereby the reserved space
(which was later used for high bits of time) in IPC_STAT structures
was left untouched rather than zeroed. this means that a caller that
wants to read the high bits needs to pre-zero the memory.
since it's not clear that these operations are permitted to modify the
destination buffer on failure, use a temp buffer and copy back to the
caller's buffer on success.
on all mips variants, Linux did (and maybe still does) have some
syscall return paths that wrongly return both the error flag in r7 and
a negated error code in r2. in particular this happened for at least
some causes of ENOSYS.
add an extra check to only negate the error code if it's positive to
begin with.
bug report and concept for patch by Andreas Dröscher.
commit 4221f154ff added the r7
constraint apparently out of a misunderstanding of the breakage it was
addressing, and did so because the asm was in a shared macro used by
all the __syscallN inline functions. now "+r" is used in the output
section for the forms 4-argument and up, so having it in input is
redundant, and the forms with 0-3 arguments don't need it as an input
at all.
the r2 constraint is kept because without it most gcc versions (seems
to be all prior to 9.x) fail to honor the output register binding for
r2. this seems to be a variant of gcc bug #87733.
both the r7 and r2 input constraints look useless, but the r2 one was
a quiet workaround for gcc bug 87733, which affects all modern
versions prior to 9.x, so it's kept and documented.
exactly revert commit 604f8d3d8b which
was wrong; it caused a major regression on Linux versions prior to
2.6.36. old kernels did not properly preserve r2 across syscall
restart, and instead restarted with the instruction right before
syscall, imposing a contract that the previous instruction must load
r2 from an immediate or a register (or memory) not clobbered by the
syscall.
effectivly revert commit ddc7c4f936
which was wrong; it caused a major regression on Linux versions prior
to 2.6.36. old kernels did not properly preserve r2 across syscall
restart, and instead restarted with the instruction right before
syscall, imposing a contract that the previous instruction must load
r2 from an immediate or a register (or memory) not clobbered by the
syscall.
since other changes were made since, including removal of the struct
stat conversion that was replaced by separate struct kstat, this is
not a direct revert, only a functional one.
the "0"(r2) input constraint added back seems useless/erroneous, but
without it most gcc versions (seems to be all prior to 9.x) fail to
honor the output register binding for r2. this seems to be a variant
of gcc bug #87733. further changes should be made later if a better
workaround is found, but this one has been working since 2012. it
seems this issue was encountered but misidentified then, when it
inspired commit 4221f154ff.
this is added for POSIX-future as the outcome of Austin Group issue
599. since it's in the reserved namespace for pthread.h, there are no
namespace considerations for adding it early.
commit 59324c8b09 added __socketcall
analogous to __syscall, returning the negated error rather than
setting errno. use it to simplify the fallback path of socket(),
avoiding extern calls and access to errno.
Author: Rich Felker <dalias@aerifal.cx>
Date: Tue Jul 30 17:51:16 2019 -0400
make __socketcall analogous to __syscall, error-returning
this reverts commit 4ee039f354, which
added the helper as a hack to make vdprintf usable before relocation,
contingent on strong assumptions about the arch and tooling, back when
the dynamic linker did not have a real staged model for
self-relocation. since commit f3ddd17380
this has been unnecessary and the function was just wasting size and
execution time.
The final rounding operation should be done with the correct sign
otherwise huge results may incorrectly get rounded to or away from
infinity in upward or downward rounding modes.
This affected sinh and sinhf which set the sign on the result after
a potentially overflowing mul. There may be other non-nearest rounding
issues, but this was a known long standing issue with large ulp error
(depending on how ulp is defined near infinity).
The fix should have no effect on sinh and sinhf performance but may
have a tiny effect on cosh and coshf.
Handle when after reduction |y| > pi/4+tiny. This happens in directed
rounding modes because the fast round to int code does not give the
nearest integer. In such cases the reduction may not be symmetric
between x and -x so e.g. cos(x)==cos(-x) may not hold (but polynomial
evaluation is not symmetric either with directed rounding so fixing
that would require more changes with bigger performance impact).
The fix only adds two predictable branches in nearest rounding mode,
simple ubenchmark does not show relevant performance regression in
nearest rounding mode.
The code could be improved: e.g reducing the medium size threshold
such that two step reduction is enough instead of three, and the
single precision case can avoid the issue by doing the round to int
differently, but this fix was kept minimal.
because struct stat is no longer assumed to correspond to the
structure used by the stat-family syscalls, it's not valid to make any
of these syscalls directly using a buffer of type struct stat.
commit 9493892021 moved all logic around
this change for stat-family functions into fstatat.c, making the
others wrappers for it. but a few other direct uses of the syscall
were overlooked. the ones in tmpnam/tempnam are harmless since the
syscalls are just used to test for file existence. however, the uses
in fchmodat and __map_file depend on getting accurate file properties,
and these functions may actually have been broken one or more mips
variants due to removal of conversion hacks from syscall_arch.h.
as a low-risk fix, simply use struct kstat in place of struct stat in
the affected places.
these did not truncate excess precision in the return value. fixing
them looks like considerable work, and the current C code seems to
outperform them significantly anyway.
long double functions are left in place because they are not subject
to excess precision issues and probably better than the C code.
for functions implemented in C, this is a requirement of C11 (F.6);
strictly speaking that text does not apply to standard library
functions, but it seems to be intended to apply to them, and C2x is
expected to make it a requirement.
failure to drop excess precision is particularly bad for inverse trig
functions, where a value with excess precision can be outside the
range of the function (entire range, or range for a particular
subdomain), breaking reasonable invariants a caller may expect.
this extends commit 5a105f19b5, removing
timer[fd]_settime and timer[fd]_gettime. the timerfd ones are likely
to have been used in software that started using them before it could
rely on libc exposing functions.
under _GNU_SOURCE for namespace cleanliness, analogous to other archs.
the original placement in sys/reg.h seems not to have been motivated;
such a header isn't even present on other implementations.
some nontrivial number of applications have historically performed
direct syscalls for these operations rather than using the public
functions. such usage is invalid now that time_t is 64-bit and these
syscalls no longer match the types they are used with, and it was
already harmful before (by suppressing use of vdso).
since syscall() has no type safety, incorrect usage of these syscalls
can't be caught at compile-time. so, without manually inspecting or
running additional tools to check sources, the risk of such errors
slipping through is high.
this patch renames the syscalls on 32-bit archs to clock_gettime32 and
gettimeofday_time32, so that applications using the original names
will fail to build without being fixed.
note that there are a number of other syscalls that may also be unsafe
to use directly after the time64 switchover, but (1) these are the
main two that seem to be in widespread use, and (2) most of the others
continue to have valid usage with a null timeval/timespec argument, as
the argument is an optional timeout or similar.
_POSIX_VDISABLE is only visible if unistd.h has already been included,
so conditional use of it here makes no sense. the value is always 0
anyway; it does not vary.
This patch adds an explicit cast to the int arguments passed to the
inline asm used in the RISC-V's implementation of `a_cas`, to ensure
that they are properly sign extended to 64 bits. They aren't
automatically sign extended by Clang, and GCC technically also doesn't
guarantee that they will be sign extended.
For Thumb2 compatibility, replace two instances of a single
instruction "orr with a variable shift" with the two instruction
equivalent. Neither of the replacements are in a performance critical
loop.
the bug fixed in commit b82cd6c78d was
mostly masked on arm because __hwcap was zero at the point of the call
from the dynamic linker to __set_thread_area, causing the access to
libc.auxv to be skipped and kuser_helper versions of TLS access and
atomics to be used instead of the armv6 or v7 versions. however, on
kernels with kuser_helper removed for hardening it would crash.
since __set_thread_area potentially uses __hwcap, it must be
initialized before the function is called. move the AT_HWCAP lookup
from stage 3 to stage 2b.
This enables alternative compilers, which may not define __GNUC__,
to implement alloca, which is still fairly widely used.
This is similar to how stdarg.h already works in musl; compilers must
implement __builtin_va_arg, there is no fallback definition.
this change was discussed on the mailing list thread for the linux
uapi v5.3 patches, and submitted as a v2 patch, but overlooked when I
applied the patches much later.
revert commit f291c09ec9 and apply the
v2 as submitted; the net change is just padding.
notes by Szabolcs Nagy follow:
compared to the linux uapi (and glibc) a padding is used instead of
aligned attribute for keeping the layout the same across targets, this
means the alignment of the struct may be different on some targets
(e.g. m68k where uint64_t is 2 byte aligned) but that should not affect
syscalls and this way the abi does not depend on nonstandard extensions.
at least gcc 9 broke execution of DT_INIT/DT_FINI for fdpic archs
(presently only sh) by recognizing that the stores to the
compound-literal function descriptor constructed to call them were
dead stores. there's no way to make a "may_alias function", so instead
launder the descriptor through an asm-statement barrier. in practice
just making the compound literal volatile seemed to have worked too,
but this should be less of a hack and more accurately convey the
semantics of what transformations are not valid.
commit 1c84c99913 moved the call to
__init_tp above the initialization of libc.auxv, inadvertently
breaking archs where __set_thread_area examines auxv for the sake of
determining the TLS/atomic model needed at runtime. this broke armv6
and sh2.
the syscall numbers were reserved in v5.3 but not wired up on mips, see
linux commit 0671c5b84e9e0a6d42d22da9b5d093787ac1c5f3
MIPS: Wire up clone3 syscall
mips application specific isa extensions were previously not exported
in hwcaps so userspace could not apply optimized code at runtime.
linux commit 38dffe1e4dde1d3174fdce09d67370412843ebb5
MIPS: elf_hwcap: Export userspace ASEs
allows waiting on a pidfd, in the future it might allow retrieving the
exit status by a non-parent process, see
linux commit 3695eae5fee0605f316fbaad0b9e3de791d7dfaf
pidfd: add P_PIDFD to waitid()
tcpi_rcv_ooopack for tracking connection quality:
linux commit f9af2dbbfe01def62765a58af7fbc488351893c3
tcp: Add TCP_INFO counter for packets received out-of-order
tcpi_snd_wnd peer window size for diagnosing tcp performance problems:
linux commit 8f7baad7f03543451af27f5380fc816b008aa1f2
tcp: Add snd_wnd to TCP_INFO
per thread prctl commands to relax the syscall abi such that top bits
of user pointers are ignored in the kernel. this allows the use of
those bits by hwasan or by mte to color pointers and memory on aarch64:
linux commit 63f0c60379650d82250f22e4cf4137ef3dc4f43d
arm64: Introduce prctl() options to control the tagged user addresses ABI
These were mainly introduced so android can optimize the memory usage
of unused apps.
MADV_COLD hints that the memory range is currently not needed (unlike
with MADV_FREE the content is not garbage, it needs to be swapped):
linux commit 9c276cc65a58faf98be8e56962745ec99ab87636
mm: introduce MADV_COLD
MADV_PAGEOUT hints that the memory range is not needed for a long time
so it can be reclaimed immediately independently of memory pressure
(unlike with MADV_DONTNEED the content is not garbage):
linux commit 1a4e58cce84ee88129d5d49c064bd2852b481357
mm: introduce MADV_PAGEOUT
the syscall number is reserved on all targets, but it is not wired up
on all targets, see
linux commit 8f6ccf6159aed1f04c6d179f61f6fb2691261e84
Merge tag 'clone3-v5.3' of ... brauner/linux
linux commit 8f3220a806545442f6f26195bc491520f5276e7c
arch: wire-up clone3() syscall
linux commit 7f192e3cd316ba58c88dfa26796cf77789dd9872
fork: add clone3
see
linux commit 7615d9e1780e26e0178c93c55b73309a5dc093d7
arch: wire-up pidfd_open()
linux commit 32fcb426ec001cb6d5a4a195091a8486ea77e2df
pid: add pidfd_open()
ptrace API to get details of the syscall the tracee is blocked in, see
linux commit 201766a20e30f982ccfe36bebfad9602c3ff574a
ptrace: add PTRACE_GET_SYSCALL_INFO request
the align attribute was used to keep the layout the same across targets
e.g. on m68k uint32_t is 2 byte aligned, this helps with compat ptrace.
adding this condition makes the entire convert_ioctl_struct function
and compat_map table statically unreachable, and thereby optimized out
by dead code elimination, on archs where they are not needed.
VIDIOC_OMAP3ISP_STAT_REQ is a device-specific command for the omap3isp
video device. the command number is in a device-private range and
therefore could theoretically be used by other devices too in the
future, but problematic clashes should not be able to arise without
intentional misuse.
This ensures that the musl definition of 'struct iphdr' does not conflict
with the Linux kernel UAPI definition of it.
Some software, i.e. net-tools, will not compile against 5.4 kernel headers
without this patch and the corresponding Linux kernel patch.
since time64 switchover has changed the size and layout of the struct
anyway, take the opportunity to fix it up so that it can be shared
between 32- and 64-bit ABIs on the same system as long as byte order
matches.
the ut_type member is explicitly padded to make up for m68k having
only 2-byte alignment; explicit padding has no effect on other archs.
ut_session is changed from long to int, with endian-matched padding.
this affects 64-bit archs as well, but brings the type into alignment
with glibc's x86_64 struct, so it should not break software, and does
not break on-disk format. the semantic type is int (pid-like) anyway.
the padding produces correct alignment for the ut_tv member on 32-bit
archs that don't naturally align it, so that ABI matches 64-bit.
this type is presently not used anywhere in the ABI between libc and
libc consumers; it's only used between pairs of consumers if a
third-party utmp library using the system utmpx.h is in use.
the elf_prstatus structure is used in core dumps, and the timeval
structures in it are longs matching the elf class, *not* the kernel
"old timeval" for the arch. this means using timeval here for x32 was
always wrong, despite kernel uapi headers and glibc also exposing it
this way, and of course it's wrong for any arch with 64-bit time_t.
rather than just changing the type on affected archs, use a tagless
struct containing long tv_sec and tv_usec members in place of the
timevals. this intentionally breaks use of them as timevals (e.g.
assignment, passing address, etc.) on 64-bit archs as well so that any
usage unsafe for 32-bit archs is caught even in software that only
gets tested on 64-bit archs. from what I could gather, there is not
any software using these members anyway. the only reason they need to
be fixed to begin with is that the only members which are commonly
used, the saved registers, follow the time members and have the wrong
offset if the time members are sized incorrectly.
commit ae388becb5 accidentally
introduced #define SYSCALL_NO_TLS 1 in mmap.c, which was probably a
stale change left around from unrelated syscall timing measurements.
reverse it.
this commit covers all remaining ioctls I'm aware of that use
time_t-derived types in their interfaces. it may still be incomplete,
and has undergone only minimal testing for a few commands used in
audio playback.
the SNDRV_PCM_IOCTL_SYNC_PTR command is special-cased because, rather
than the whole structure expanding, it has two substructures each
padded to 64 bytes that expand within their own 64-byte reserved zone.
as long as it's the only one of its type, it doesn't really make sense
to make a general framework for it, but the existing table framework
is still used for the substructures in the special-case. one of the
substructures, snd_pcm_mmap_status, has a snd_pcm_uframes_t member
which is not a timestamp but is expanded just like one, to match the
64-bit-arch version of the structure. this is handled just like a
timestamp at offset 8, and is the motivation for the conversions table
holding offsets of individual values to be expanded rather than
timespec/timeval type pairs.
for some of the types, the size to which they expand is dependent on
whether the arch's ABI aligns 8-byte types on 8-byte boundaries.
new_req entries in the table need to reflect this size to get the
right ioctl request number that will match what callers pass, but we
don't have access to the actual structure type definitions here and
duplicating them would be cumbersome. instead, the new_misaligned
macro introduced here constructs an artificial object whose size is
the result of expanding a misaligned timespec/timeval to 64-bit and
imposing the arch's alignment on the result, which can be passed to
the _IO{R,W,WR} macros.
record offsets of individual slots that expand from 32- to 64-bit,
rather than timespec/timeval pairs. this flexibility will be needed
for some ioctls. reduce size of types in table. adjust representation
of offsets to include a count rather than needing -1 padding so that
the table is less ugly and doesn't need large diffs if we increase max
number of slots.
with the current set of supported ioctls, this conversion is hardly an
improvement, but it sets the stage for being able to do alsa, v4l2,
ppp, and other ioctls with timespec/timeval-derived types. without
this capability, a lot of functionality users depend on would stop
working with the time64 switchover.
commit b60fdf133c broke the
SIOCGSTAMP[NS] ioctl fallbacks introduced in commit
2e554617e5, as well as use of these
ioctls, by creating a situation where bits/ioctl.h could be included
without __LONG_MAX being visible.
always try the time64 syscall first since we can use its success to
conclude that no conversion is needed (any setsockopt for the
timestamp options would have succeeded without need for fallbacks).
otherwise, we have to remember the original controllen for each
msghdr, requiring O(vlen) space, so vlen must be bounded. linux clamps
it to IOV_MAX for sendmmsg only (not recvmmsg), but doing the same for
recvmmsg is not unreasonable, especially since the limitation will
only apply to old kernels.
we could optimize to avoid trying SYS_recvmmsg_time64 first if all
msghdrs have controllen zero, or support unlimited vlen by looping and
emulating the timeout logic, but I'm not inclined to do complex and
error-prone optimizations on a function that has so many underlying
problems it should really never be used.
the definitions of SO_TIMESTAMP* changed on 32-bit archs in commit
3814333964 to the new versions that
provide 64-bit versions of timeval/timespec structure in control
message payload. socket options, being state attached to the socket
rather than function calls, are not trivial to implement as fallbacks
on ENOSYS, and support for them was initially omitted on the
assumption that the ioctl-based polling alternatives (SIOCGSTAMP*)
could be used instead by applications if setsockopt fails.
unfortunately, it turns out that SO_TIMESTAMP is sufficiently old and
widely supported that a number of applications assume it's available
and treat errors as fatal.
this patch introduces emulation of SO_TIMESTAMP[NS] on pre-time64
kernels by falling back to setting the "_OLD" (time32) versions of the
options if the time64 ones are not recognized, and performing
translation of the SCM_TIMESTAMP[NS] control messages in recvmsg.
since recvmsg does not know whether its caller is legacy time32 code
or time64, it performs translation for any SCM_TIMESTAMP[NS]_OLD
control messages it sees, leaving the original time32 timestamp as-is
(it can't be rewritten in-place anyway, and memmove would be mildly
expensive) and appending the converted time64 control message at the
end of the buffer. legacy time32 callers will see the converted one as
a spurious control message of unknown type; time64 callers running on
pre-time64 kernels will see the original one as a spurious control
message of unknown type. a time64 caller running on a kernel with
native time64 support will only see the time64 version of the control
message.
emulation of SO_TIMESTAMPING is not included at this time since (1)
applications which use it seem to be prepared for the possibility that
it's not present or working, and (2) it can also be used in sendmsg
control messages, in a manner that looks complex to emulate
completely, and costly even when running on a time64-supporting
kernel.
corresponding changes in recvmmsg are not made at this time; they will
be done separately.
linux/input.h and perhaps others use this macro to determine whether
the userspace time_t is 64-bit when potentially defining types in
terms of time_t and derived structures. the name __USE_TIME_BITS64 is
unfortunate; it really should have been in the __UAPI namespace. but
this is what was chosen back in v4.16 when first preparing input.h for
time64 userspace, presumably based on expectations about what the
glibc-internal features.h macro for time64 would be, and changing it
now would just put a new minimum version requirement on kernel
headers.
the __USE_TIME_BITS64 macro is not intended as a public interface. it
is purely an internal contract between libc and Linux uapi headers.
this interface permits a null pointer for where to store the old
itimerval being replaced. an early version of the time32 compat shim
code had corresponding bugs for lots of functions; apparently
setitimer was overlooked when fixing them.
commit 4d3a162d00 overlooked that the
mips64 reloc.h dependent on endian.h not only for setting the ABI ldso
name to match the byte order, but also for use of the byte swapping
macros. they are needed to override R_TYPE, R_SYM, and R_INFO, to
compensate for a mips "quirk" of always using big endian order for
symbol references in relocations.
part of that commit canot be reverted because the original code was
wrong: it's invalid to define _GNU_SOURCE or any feature test macro
in reloc.h, or anywhere except at the top of a source file. however,
thanks to commit 316730cdc7, the feature
test macro is no longer needed to access the endian-swapping macros,
so simply bringing back the #include directive suffices.
commit de90f38e3b omitted $(srcdir) from
the makefile include pathname it added. since the include directive
was prefixed with - to make it optional (for archs that don't use it),
the failure to find arch/$(ARCH)/arch.mak was silent.
in commit 22daaea39f, the
__dlsym_redir_time64 function providing the backend for __dlsym_time64
was defined only in the dynamic linker, and thus was undefined when
static linking a program referencing dlsym. use the same stub_dlsym
definition that provides __dlsym (the non-redirecting backend) for
static linked programs to provide it, conditional on _REDIR_TIME64.
now that all 32-bit archs have 64-bit time_t (and suseconds_t), the
arch-provided _Int64 macro (long or long long, as appropriate) can be
used to define them, and arch-specific definitions are no longer
needed.
now that all 32-bit archs have 64-bit time types, the values for the
time-related ioctls can be shared. the mechanism for this is an
arch/generic version of the bits header. archs which don't use the
generic header still need to duplicate the definitions.
x32, which does not use the new time64 values of the macros, already
has its own overrides, so this commit does not affect it.
now that all 32-bit archs have 64-bit time types, the values for the
time-related socket option macros can be treated as universal for
32-bit archs. the sys/socket.h mechanism for this predates
arch/generic and is instead in the top-level header.
x32, which does not use the new time64 values of the macros, already
has its own overrides, so this commit does not affect it.
this commit preserves ABI fully for existing interface boundaries
between libc and libc consumers (applications or libraries), by
retaining existing symbol names for the legacy 32-bit interfaces and
redirecting sources compiled against the new headers to alternate
symbol names. this does not necessarily, however, preserve the
pairwise ABI of libc consumers with one another; where they use
time_t-derived types in their interfaces with one another, it may be
necessary to synchronize updates with each other.
the intent is that ABI resulting from this commit already be stable
and permanent, but it will not be officially so until a release is
made. changes to some header-defined types that do not play any role
in the ABI between libc and its consumers may still be subject to
change.
mechanically, the changes made by this commit for each 32-bit arch are
as follows:
- _REDIR_TIME64 is defined to activate the symbol redirections in
public headers
- COMPAT_SRC_DIRS is defined in arch.mak to activate build of ABI
compat shims to serve as definitions for the original symbol names
- time_t and suseconds_t definitions are changed to long long (64-bit)
- IPC_STAT definition is changed to add the IPC_TIME64 bit (0x100),
triggering conversion of semid_ds, shmid_ds, and msqid_ds split
low/high time bits into new time_t members
- structs semid_ds, shmid_ds, msqid_ds, and stat are modified to add
new 64-bit time_t/timespec members at the end, maintaining existing
layout of other members.
- socket options (SO_*) and ioctl (sockios) command macros are
redefined to use the kernel's "_NEW" values.
in addition, on archs where vdso clock_gettime is used, the
VDSO_CGT_SYM macro definition in syscall_arch.h is changed to use a
new time64 vdso function if available, and a new VDSO_CGT32_SYM macro
is added for use as fallback on kernels lacking time64.
these definitions are copied from generic bits/ioctl.h, so that x32
keeps the "_OLD" versions (which are already time64 on x32) when
32-bit archs switch to 64-bit time_t.
these definitions are merely copied from the top-level sys/socket.h,
so there is no functional change at this time. however, the top-level
definitions will change to use the time64 "_NEW" versions on 32-bit
archs when time_t is switched over to 64-bit. this commit ensures that
change will be suppressed on x32.
these structures can now be defined generically in terms of endianness
and long size. previously, the 32-bit archs all shared a common
definition from the generic bits header, and each 64-bit arch had to
repeat the 64-bit version, with endian conditionals if the arch had
variants of each endianness.
I would prefer getting rid of the preprocessor conditionals for
padding and instead using unnamed bitfield members, like commit
9b2921bea1 did for struct timespec.
however, at present sendmsg, recvmsg, and recvmmsg need access to the
padding members by name to zero them. this could perhaps be cleaned up
in the future.
being that it contains pointers and (from the kernel perspective,
which is wrong) size_t members, x32 uses the 32-bit version of the
structure, not a half-32-bit, half-64-bit layout like we had here. the
x86_64 definition was inadvertently copied when x32 was first added.
unlike errors in the opposite direction (missing padding), this error
was not easily detected breakage, because the layout of the commonly
used initial subset of members still matched. breakage could only be
observed in the presence of control messages or flags.
SO_RCVTIMEO and SO_SNDTIMEO already were, but only in aggregate with
SO_DEBUG and all of the other low/traditional options that varied per
arch. SO_TIMESTAMP* are newly overridable. the two groups have to be
done separately since mips64 and powerpc64 will override the former
but not the latter.
at some point this should be cleaned up to use bits headers more
idiomatically.
the immediate usage case for this is to let 32-bit archs moving to
64-bit time_t via symbol redirection pull in wrapper shims that
provide the old symbol names. in the future it may be used for other
types of compatibility-only source files that are not relevant to all
archs.
if symbols are being redirected to provide the new time64 ABI, dlsym
must perform matching redirections; otherwise, it would poke a hole in
the magic and return pointers to functions that are not safe to call
from a caller using time64 types.
rather than duplicating a table of redirections, use the time64
symbols present in libc's symbol table to derive the decision for
whether a particular symbol needs to be redirected.
these files provide the symbols for the traditional 32-bit time_t ABI
on existing 32-bit archs by wrapping the real, internal versions of
the corresponding functions, which always work with 64-bit time_t.
they are written to be as agnostic as possible to the implementation
details of the real functions, so that they can be written once and
mostly forgotten, but they are aware of details of the old (and
sometimes new) ABI, which is okay since ABI is fixed and cannot
change.
a new compat tree is added, separate from src, which the Makefile does
not see or use now, but which archs will be able to add to the build
process. we could also consider moving other things that are compat
shims here, like functions which are purely for glibc-ABI-compat, with
the goal of making it optional or just cleaning up the main src tree
to make the distinction between actual implementation/API files and
ABI-compat shims clear.
here _REDIR_TIME64 is used as an indication that there's an old ABI,
and thereby the old time32 timespec fields of struct stat.
keeping struct stat compatible and providing both versions of the
timespec fields is done so that ftw/nftw does not need painful compat
shims, and (more importantly) so that similar interfaces between pairs
of libc consumers (applications/libraries) will be less likely to
break when one has been rebuilt for time64 but the other has not.
these functions cannot provide the glibc lfs64-ABI-compatible symbols
when time_t differs from what it was in that ABI. instead, the aliases
need to be provided by the time32 compat shims or through some other
mechanism.
the time_t members in struct sched_param are just reserved space to
preserve size and alignment. when time_t changes to 64-bit on 32-bit
archs, this structure should not change.
make definition conditional on _REDIR_TIME64 to match the size of the
old time_t, which can be assumed to be long if _REDIR_TIME64 is
defined.
a _REDIR_TIME64 macro is introduced, which the arch's alltypes.h is
expected to define, to control redirection of symbol names for
interfaces that involve time_t and derived types. this ensures that
object files will only be linked to libc interfaces matching the ABI
whose headers they were compiled against.
along with time32 compat shims, which will be introduced separately,
the redirection also makes it possible for a single libc (static or
shared) to be used with object files produced with either the old
(32-bit time_t) headers or the new ones after 64-bit time_t switchover
takes place. mixing of such object files (or shared libraries) in the
same program will also be possible, but must be done with care; ABI
between libc and a consumer of the libc interfaces is guaranteed to
match by the the symbol name redirection, but pairwise ABI between
consumers of libc that define interfaces between each other in terms
of time_t is not guaranteed to match.
this change adds a dependency on an additional "GNU C" feature to the
public headers for existing 32-bit archs, which is generally
undesirable; however, the feature is one which glibc has depended on
for a long time, and thus which any viable alternative compiler is
going to need to provide. 64-bit archs are not affected, nor will
future 32-bit archs be, regardless of whether they are "new" on the
kernel side (e.g. riscv32) or just newly-added (e.g. a new sparc or
xtensa port). the same applies to newly-added ABIs for existing
machine-level archs.
the existing implementation of case mappings was very small (typically
around 1.5k), but unmaintainable, requiring manual addition of new
case mappings with each new edition of Unicode. often, it turned out
that newly-added case mappings were not easily representable in the
existing tightly-constrained table structures, requiring new hacks to
be invented and delaying support for new characters.
the new implementation added here follows the pattern used for
character class membership, with a two-level table allowing Unicode
blocks for which no data is needed to be elided. however, rather than
single-bit data, each character maps to a one of up to 6 case-mapping
rules available to its block, where 6 is floor(cbrt(256)) and allow 3
characters to be represented per byte (vs 8 with bit tables). blocks
that would need more than 6 rules designate one as an exception and
let lookup pass into a binary search of exceptional cases for the
block.
the number 6 was chosen empirically; many blocks would be ok with 4
rules (uncased, lower, upper, possible exceptions), some even just
with 2, but the latter are rare and fitting 4 characters per byte
rather than 3 does not save significant space. moreover, somewhat
surprisingly, there are sufficiently many blocks where even 4 rules
don't suffice without a lot of exceptions (blocks where some case
pairs are laced, others offset) that originally I was looking at
supporting variable-width tables, with 1-, 2-, or 3-bit entries,
thereby allowing blocks with 8 rules. as implemented in my
experiments, that version was significantly larger and involved more
memory accesses/cache lines.
improvements in size at the expense of some performance might be
possible by utilizing iswalpha data or merging the table of case
mapping identity with alphabetic identity. these were explored
somewhat when the code was first written, and might be worth
revisiting in the future.
somehow this seems to have been overlooked. add it now so that
subsequent overhaul of case mapping implementation will not introduce
a functional change at the same time.
for time64 support on 32-bit archs, the kernel interfaces use a
timespec layout padded to match the representation of a pair of 64-bit
values, which requires endian-specific padding.
use of an ordinary, non-bitfield, named member for the padding is
undesirable because, on big endian archs, it would alter the
interpretation of traditional (non-designated) initializers of the
form {s,ns}, initializing the padding instead of the tv_nsec member.
unnamed bitfield members solve this problem by not taking part in
initialization, and were the expected solution when the kernel
interfaces were designed. however, they also have further advantages
which we take advantage of here:
positioning of the padding could be controlled by having a
preprocessor conditional with separate definitions of struct timespec
for little and big endian, but whether padding should appear at all is
a function of whether time_t is larger than long. this condition is
not something the preprocessor can determine unless we were to define
a new macro specifically for that purpose.
by using unnamed bitfield members instead of ordinary named members,
we can arrange for the size of the padding to collapse to zero when it
should not be present, just by using sizeof(time_t) and sizeof(long)
in the bitfield width expression, which can be any integer constant
expression.
commit 2b4fd6f75b added time64 for this
function, but did so with a hidden assumption that the new time64
version of struct timex will be layout-compatible with the old one.
however, there is little benefit to doing it that way, and the cost is
permanent special-casing of 32-bit archs with 64-bit time_t in the
public interface definitions.
instead, do a full translation of the structure going in and out. this
commit is actually a revision to an earlier uncommited version of the
code.
presently the kernel does not actually define time64 versions of these
syscalls, and they're not really needed except to represent extreme
cpu time usage. however, x32's versions of the syscalls already behave
as time64 ones, meaning the functions were broken on x32 if the caller
used any part of the rusage result other than ru_utime and ru_stime.
commit 7e81711431 made it possible to
fix this by treating x32's syscalls as time64 versions.
in the non-time64-syscall case, make the syscall with the rusage
destination pointer adjusted so that all members but the timevals line
up between the libc and kernel structures. on 64-bit archs, or present
32-bit archs with 32-bit time_t, the timevals will line up too and no
further work is needed. for future 32-bit archs with 64-bit time_t,
the timevals are copied into place, contingent on time_t being larger
than long.
this is analogous to commit 40aa18d55a.
so far, there are not any actual time64 versions of the rusage
syscalls (getrusage and wait4) and might never be. however, the
existing x32 ones behave the way time64 versions would if they
existed: using 64-bit slots in place of all longs.
presently, wait4 and getrusage are broken on x32, storing the timevals
correctly but messing up everything else due to the long/kernel-long
mismatch. this would be a huge buffer overflow if not for the 16
reserved slots we left long ago, which suffice to prevent 14
double-sized longs from overflowing into unrelated memory. this commit
will make it possible to fix them.
this is to match the kernel and glibc interfaces. here, struct pt_regs
is an incomplete type, but that's harmless, and if it's completed by
inclusion of another header then members of the struct pointed to by
the regs member can be accessed directly without going through a cast
or intermediate pointer object.
the userspace ucontext API has this as an array rather than a
structure.
commit 3c59a86895 fixed the
corresponding mistake for vrregset_t, namely that the original
powerpc64 port used a mix of types from 32-bit powerpc and powerpc64
rather than matching the 64-bit types.
aside from the special value EOF, ungetc is specified to accept and
convert values outside the range of unsigned char. conversion takes
place automatically as part of assignment when storing into the
buffer, but the return value is also required to be the resulting
converted value, and this requirement was not satisfied.
simplified from patch by Wang Jianjian.
based on patch by Dan Gohman, who caught this via compiler warnings.
analysis by Szabolcs Nagy determined that it's a bug, whereby errno
can be set incorrectly for values where the coercion from long double
to double causes rounding. it seems likely that floating point status
flags may be set incorrectly as a result too.
at the same time, clean up use of preprocessor concatenation involving
LDBL_MANT_DIG, which spuriously depends on it being a single unadorned
decimal integer literal, and instead use the equivalent formulation
2/LDBL_EPSILON. an equivalent change on the printf side was made in
commit bff6095d91.
policy has long been that these definitions are purely a function of
whether long/pointer is 32- or 64-bit, and that they are not allowed
to vary per-arch. move the definition to the shared alltypes.h.in
fragment, using integer constant expressions in terms of sizeof to
vary the array dimensions appropriately. I'm not sure whether this is
more or less ugly than using preprocessor conditionals and two sets of
definitions here, but either way is a lot less ugly than repeating the
same thing for every arch.
LLONG_MAX is uniform for all archs we support and plenty of header and
code level logic assumes it is, so it does not make sense for limits.h
bits mechanism to pretend it's variable.
LONG_BIT can be defined in terms of LONG_MAX; there's no reason to put
it in bits.
by moving LONG_MAX definition to __LONG_MAX in alltypes.h and moving
LLONG_MAX out of bits, there are now no plain-C limits that are
defined in the bits header, so the bits header only needs to be
included in the POSIX or extended profiles. this allows the feature
test macro logic to be removed from the bits header, facilitating a
long-term goal of getting such logic out of bits.
having __LONG_MAX in alltypes.h will allow further generalization of
headers.
archs without a constant PAGESIZE no longer need bits/limits.h at all.
the resolution of Austin Group issue #162 adds endian.h as a standard
header for future versions of the standard, making it no longer
acceptable for some of the functionality to be hidden behind
_BSD_SOURCE or _GNU_SOURCE. the definitions of the [lb]etoh{16,32,64}
function-like macros are kept conditional since they are alternate
names which the standard did not adopt.
building on commit 97d35a552e,
__BYTE_ORDER is now available wherever alltypes.h is included. since
reloc.h is only used from src/internal/dynlink.h, it can be assumed
that __BYTE_ORDER is exposed. reloc.h is not permitted to be included
in other contexts, and generally, like most arch headers, lacks
inclusion guards that would allow such usage. the mips64 version
mistakenly included such guards; they are removed for consistency.
building on commit 97d35a552e,
__BYTE_ORDER is now available wherever alltypes.h is included.
endian.h should not be used since, in the future, it will expose
identifiers that are not in the reserved namespace for the headers
which were previously using it.
this change is motivated by the intersection of several factors.
presently, despite being a nonstandard header, endian.h is exposing
the unprefixed byte order macros and functions only if _BSD_SOURCE or
_GNU_SOURCE is defined. this is to accommodate use of endian.h from
other headers, including bits headers, which need to define structure
layout in terms of endianness. with time64 switch-over, even more
headers will need to do this.
at the same time, the resolution of Austin Group issue 162 makes
endian.h a standard header for POSIX-future, requiring that it expose
the unprefixed macros and the functions even in standards-conforming
profiles. changes to meet this new requirement would break existing
internal usage of endian.h by causing it to violate namespace where
it's used.
instead, have the arch's alltypes.h define __BYTE_ORDER, either as a
fixed constant or depending on the right arch-specific predefined
macros for determining endianness. explicit literals 1234 and 4321 are
used instead of __LITTLE_ENDIAN and __BIG_ENDIAN so that there's no
danger of getting the wrong result if a macro is undefined and
implicitly evaluates to 0 at the preprocessor level.
the powerpc (32-bit) bits/endian.h being removed had logic for varying
endianness, but our powerpc arch has never supported that and has
always been big-endian-only. this logic is not carried over to the new
__BYTE_ORDER definition in alltypes.h.
now that commit f7f1079796 removed the
legacy i386 conditional definition, va_list is in no way
arch-specific, and has no reason to be in the future. move it to the
shared part of alltypes.h.in
commit ffaaa6d230 removed the
corresponding stdarg.h support for compilers without va_list builtins,
but failed to remove the alternate type definition, leaving incorrect
va_list definitions in place with compilers that don't define __GNUC__
with a value >= 3.
SQRT.fmt exists on MIPS II+ (float), MIPS III+ (double).
ABS.fmt exists on MIPS I+ but only cores with ABS2008 flag in FCSR
implement the required behaviour.
Both sqrt and sqrtf shifted the signed exponent as signed int to adjust
the bit representation of the result. There are signed right shifts too
in the code but those are implementation defined and are expected to
compile to arithmetic shift on supported compilers and targets.
mbsrtowcs contains "vectorized" loops to quickly step over bytes
without the high bit set; these have undefined behavior by virtue of
aliasing uint32_t over top of char data for the accesses.
commit 4d0a82170a fixed the
corresponding usage in string functions by using the may_alias
attribute conditional on __GNUC__ and disabled the vectorized code in
its absence. do the same for mbsrtowcs.
Several math functions are now from the ARM optimized-routines repo
licensed under standard MIT terms and copyrighted by Arm Limited,
so mention this in the COPYRIGHT too.
commit ab3eb89a8b removed it as part of
correcting the mcontext_t definition, but there is still code using
struct sigcontext and expecting the member names present in it, most
notably libgcc_eh. almost all such usage is incorrect, but bring back
struct sigcontext at least for now so as not to introduce regressions.
in order for sys/procfs.h (provided by sys/user.h) to be useful, it
needs to match the API its consumers (gdb, etc.) expect, including the
member names established by glibc.
this partly reverts commit 29e8737f81,
which partly reverted d493206de7,
eliminating struct user_fpregs_struct which seems to have had no
precedent and using union __riscv_mc_fp_state for elf_fpregset_t. this
requires indirect inclusion of signal.h to make union
__riscv_mc_fp_state visible, but being that these are nonstandard
"junk" headers with no official restrictions on what they can pull in,
that's no big deal.
split off and expanded from patch by Khem Raj.
the top-level mcontext_t member names were namespace-violating in
standards profiles before, and nested-level member names (some of them
single-letter) were egregiously bad namespace impositions even in
non-strict profiles. moreover, they mismatched those used in the
public API first defined in glibc, breaking any code making use of
them.
unlike most archs, the public API used in glibc for riscv mcontext_t
members was designed to be namespace-safe, so we can and should expose
the members regardless of feature test macros. only the typedefs for
greg_t, gregset_t, and fpregset_t need to be protected behind FTMs.
the struct tags for mcontext_t and ucontext_t are also changed. for
mcontext_t this is necessary to make the common definition across
profiles namespace-safe. for ucontext_t, it's just a matter of
matching the tag from the glibc-defined API.
these changes are split off and expanded from a patch by Khem Raj.
Some declarations of __tls_get_new were left in the code, even
though the definition got removed in
commit 9d44b6460a
install dynamic tls synchronously at dlopen, streamline access
this can make the build fail with
ld: lib/libc.so: hidden symbol `__tls_get_new' isn't defined
when libc.so is linked without --gc-sections, because a .hidden
declaration in asm code creates a reference even if the symbol
is not actually used.
lrint in (LONG_MAX, 1/DBL_EPSILON) and in (-1/DBL_EPSILON, LONG_MIN)
is not trivial: rounding to int may be inexact, but the conversion to
int may overflow and then the inexact flag must not be raised. (the
overflow threshold is rounding mode dependent).
this matters on 32bit targets (without single instruction lrint or
rint), so the common case (when there is no overflow) is optimized by
inlining the lrint logic, otherwise the old code is kept as a fallback.
on my laptop an i486 lrint call is asm:10ns, old c:30ns, new c:21ns
on a smaller arm core: old c:71ns, new c:34ns
on a bigger arm core: old c:27ns, new c:19ns
analogous to commit ddc7c4f936 for
mips64 and n32, remove the hack to load the syscall number into $2 via
asm, and use a constraint to let the compiler load it instead.
now, only $4, $5, and $6 are potential input-only registers. $2 is
always input and output, and $7 is both when it's an argument,
otherwise output-only. previously, $7 was treated as an input (with a
"1" constraint matching its output position) even when it was not an
input, which was arguably undefined behavior (asm input from
indeterminate value). this is corrected.
as before, $8, $9, and $10 are conditionally input-output registers
for 5-, 6-, and 7-argument syscalls. their role in input is carrying
in the values that will be stored on the stack for arguments 5-7.
their role in output is carrying back whatever the kernel has
clobbered them with, so that the compiler cannot assume they still
contain the input values.
mips32 has two fpu register file variants: FR=0 with 32 32-bit
registers, where pairs of neighboring even/odd registers are used to
represent doubles, and FR=1 with 32 64-bit registers, each of which
can store a single or double.
up through r5 (our "mips" arch), the supported ABI uses FR=0, but
modern compilers generate "fpxx" model code that can safely operate
with either model. r6, which is an incompatible but similar ISA, drops
FR=0 and only provides the FR=1 model. as such, setjmp and longjmp,
which depended on being able to save and restore call-saved doubles by
storing and loading their 32-bit halves, were completely broken in the
presence of floating point code on mips r6.
to fix this, use the s.d and l.d mnemonics to store and load fpu
registers. these expand to the existing swc1 and lwc1 instructions for
pairs of 32-bit fpu registers on mips1, but on mips2 and later they
translate directly to the 64-bit sdc1 and ldc1.
with FR=0, sdc1 and ldc1 behave just like the pairs of swc1 and lwc1
instructions they replace, storing or loading the even/odd pair of fpu
registers that can be treated as separate single-precision floats or
as a unit representing a double. but with FR=1, they store/load
individual 64-bit registers. this yields the ABI-correct behavior on
mips r6, and should make linking of pre-r6 (plain "mips") code with
"fp64" model code workable, although this is and will likely remain
unsupported usage.
in addition to the mips r6 problem this change fixes, reportedly
clang's internal assembler refuses to assemble swc1 and lwc1
instructions for odd register indices when building for "fpxx" model
(the default). this caused setjmp and longjmp not to build. by using
the s.d and l.d forms, this problem is avoided too.
as a bonus, code size is reduced everywhere but mips1.
mips r6 (an incompatible isa from traditional mips) removes the hi and
lo registers used for mul/div results. older gcc versions accepted
them in the clobber list for asm, but their presence is incorrect and
breaks on later versions.
in the process of fixing this, the clobber list for 32-bit mips
syscalls has been deduplicated via a macro like on mips64 and n32.
armv8 removed the coprocessor instructions other than cp14, so
on an armv8 system the related hwcaps should never be set.
new llvm complains about the use of coprocessor instructions in
armv8-a mode (even though they are never executed at runtime),
so ifdef them out when musl is built for armv8.
in the timer thread start function, self->timer_id was accessed
without synchronization; the timer thread could fail to see the store
from the calling thread, resulting in timer_delete failing to delete
the correct kernel-level timer.
this fix is based on a patch by changdiankang, but with the load moved
to after receiving the timer_delete signal rather than just after the
start barrier, so as not to retain the possibility of data race with
timer_delete.
The operand sepcifiers in a_cas and a_cas_p for riscv64 were incorrect:
there's a backwards branch in the routine, so despite tmp being written
at the end of the assembly fragment it cannot be allocated in one of the
input registers because the input values may be needed for another trip
around the loop.
For code that follows the guaranteed forward progress requirements, the
backwards branch is rarely taken: SiFive's hardware only fails a store
conditional on execptional cases (ie, instruction cache misses inside
the loop), and until recently a bug in QEMU allowed back-to-back
store conditionals to succeed. The bug has been fixed in the latest
QEMU release, but it turns out that the fix caused this latent bug in
musl to manifest.
commit 8a544ee3a2 introduced a
dependency of the failure path for explicit scheduling at thread
creation on __clone's handling of the start function returning, which
should result in SYS_exit.
as noted in commit 05870abeaa, the arm
version of __clone was broken in this case. in the past, the mips
version was also broken; it was fixed in commit
8b2b61e000.
since this code path is pretty much entirely untested (previously only
reachable in applications that call the public clone() and return from
the start function) and consists of fragile per-arch asm, don't assume
it works, at least not until it's been thoroughly tested. instead make
the SYS_exit syscall from the start function's failure path.
we don't actually support building asm source files as thumb1, but
it's possible that the condition __ARM_ARCH>=5 would be false on old
compilers that did not define __ARM_ARCH at all. avoiding that would
require enumerating all of the possible __ARM_ARCH_*__ macros for
testing.
as noted in commit 05870abeaa, mov lr,pc
is not valid for saving a return address when in thumb mode. since
this code is a hot path (dynamic TLS access), don't do the out-of-line
bl->bx chaining to save the return value; instead, use the fact that
this file is preprocessed asm to add the missing thumb bit with an add
in place of the mov.
the change here does not affect builds for ISA levels new enough to
have a thread pointer read instruction, or for armv5 and later as long
as the compiler properly defines __ARM_ARCH, or for any build as arm
(not thumb) code. it's likely that it makes no difference whatsoever
to any present-day practical build environments, but nonetheless now
it's safe.
as an alternative, we could just assume __thumb__ implies availability
of blx since we don't support building asm source files as thumb1. I
didn't do that in order to avoid having a wrong assumption here if
that ever changes.
as noted in commit 05870abeaa, mov lr,pc
is not a valid method for saving the return address in code that might
be built as thumb.
this one is unlikely to matter, since any ISA level that has thumb2
should also have native implementations of atomics that don't involve
kuser_helper, and the affected code is only used on very old kernels
to begin with.
mov lr,pc is not a valid way to save the return address in thumb mode
since it omits the thumb bit. use a chain of bl and bx to emulate blx.
this could be avoided by converting to a .S file with preprocessor
conditions to use blx if available, but the time cost here is
dominated by the syscall anyway.
while making this change, also remove the remnants of support for
pre-bx ISA levels. commit 9f290a49bf
removed the hack from the parent code paths, but left the unnecessary
code in the child. keeping it would require rewriting two code paths
rather than one, and is useless for reasons described in that commit.
AT_HWCAP2 flags, see
linux commit 671db581815faf17cbedd7fcbc48823a247d90b1
arm64: Expose DC CVADP to userspace
linux commit 06a916feca2b262ab0c1a2aeb68882f4b1108a07
arm64: Expose SVE2 features for userspace
new mount api syscalls were added, same numers on all targets, see
linux commit a07b20004793d8926f78d63eb5980559f7813404
vfs: syscall: Add open_tree(2) to reference or clone a mount
linux commit 2db154b3ea8e14b04fee23e3fdfd5e9d17fbc6ae
vfs: syscall: Add move_mount(2) to move mounts around
linux commit 24dcb3d90a1f67fe08c68a004af37df059d74005
vfs: syscall: Add fsopen() to prepare for superblock creation
linux commit ecdab150fddb42fe6a739335257949220033b782
vfs: syscall: Add fsconfig() for configuring and managing a context
linux commit 93766fbd2696c2c4453dd8e1070977e9cd4e6b6d
vfs: syscall: Add fsmount() to create a mount for a superblock
linux commit cf3cba4a429be43e5527a3f78859b1bfd9ebc5fb
vfs: syscall: Add fspick() to select a superblock for reconfiguration
linux commit 9c8ad7a2ff0bfe58f019ec0abc1fb965114dde7d
uapi, x86: Fix the syscall numbering of the mount API syscalls [ver #2]
linux commit d8076bdb56af5e5918376cd1573a6b0007fc1a89
uapi: Wire up the mount API syscalls on non-x86 arches [ver #2]
apply open_tree with OPEN_TREE_CLONE call to the entire subtree, see
linux commit a07b20004793d8926f78d63eb5980559f7813404
vfs: syscall: Add open_tree(2) to reference or clone a mount
see
linux commit a528d35e8bfcc521d7cb70aaf03e1bd296c8493f
statx: Add a system call to make enhanced file info available
these are linux specific and not reserved names for fcntl.h so they
are under _BSD_SOURCE|_GNU_SOURCE.
ethertype for fake VLAN header for DSA, see
linux commit bf5bc3ce8a8f32a0d45b6820ede8f9fc3e9c23df
ether: Add dedicated Ethertype for pseudo-802.1Q DSA tagging
historically, a number of 32-bit archs used long rather than int for
wchar_t, for no good reason. GCC still uses the historical types, but
clang replaced them all with int, and it seems PCC uses int too.
mismatching the compiler's type for wchar_t is not an option due to
wide string literals.
note that the mismatch does not affect C++ ABI since wchar_t is its
own builtin type/keyword in C++, distinct from both int and long, not
a typedef.
i386 already worked around this by honoring __WCHAR_TYPE__ if defined
by the compiler, and only using the official legacy ABI type if not.
add the same to the other affected archs.
it might make sense at some point to switch to using int as the
default if __WCHAR_TYPE__ is not defined, if the expectations is that
new compilers will treat int as the correct choice, but it's unlikely
that the case where __WCHAR_TYPE__ is undefined will ever be used
anyway. I actually wanted to move the definition of wchar_t to the
top-level shared alltypes.h.in, using __WCHAR_TYPE__ and falling back
to int if not defined, but that can't be done without assuming all
compilers define __WCHAR_TYPE__ thanks to some pathological archs
where the ABI has wchar_t as an unsigned type.
previously, when pthread_create failed due to inability to set
explicit scheduling according to the requested attributes, the nascent
thread was detached and made responsible for its own cleanup via the
standard pthread_exit code path. this left it consuming resources
potentially well after pthread_create returned, in a way that the
application could not see or mitigate, and unnecessarily exposed its
existence to the rest of the implementation via the global thread
list.
instead, attempt explicit scheduling early and reuse the failure path
for __clone failure if it fails. the nascent thread's exit futex is
not needed for unlocking the thread list, since the thread calling
pthread_create holds the thread list lock the whole time, so it can be
repurposed to ensure the thread has finished exiting. no pthread_exit
is needed, and freeing the stack, if needed, can happen just as it
would if __clone failed.
if setting scheduling properties succeeds, the new thread may end up
with lower priority than the caller, and may be unable to continue
running due to another intermediate-priority thread. this produces a
priority inversion situation for the thread calling pthread_create,
since it cannot return until the new thread reports success.
originally, the parent was responsible for setting the new thread's
priority; commits b8742f3260 and
40bae2d32f changed it as part of
trimming down the pthread structure. since then, commit
04335d9260 partly reversed the changes,
but did not switch responsibilities back. do that now.
commit 8f11e6127f wrongly documented
that all changes to libc.threads_minus_1 were guarded by the thread
list lock, but the decrement for failed SYS_clone took place after the
thread list lock was released.
commit 030e526392 added optreset, a BSD
extension to getopt duplicating the functionality (also an extension)
of setting optind to 0, but failed to provide a public declaration for
it. according to the BSD documentation and headers, the application is
not supposed to need to provide its own declaration.
these are presently extensions, thus named with _np to match glibc and
other implementations that provide them; however they are likely to be
standardized in the future without the _np suffix as a result of
Austin Group issue 1208. if so, both names will be kept as aliases.
due to historical accident/sloppiness in glibc, the powerpc,
powerpc64, and sh versions of struct user, defined by sys/user.h, used
struct pt_regs from the kernel asm/ptrace.h for their regs member.
this made it impossible to define the type in an API-compatible manner
without either including asm/ptrace.h like glibc does (contrary to our
policy of not depending on kernel headers), or clashing with
asm/ptrace.h's definition of struct pt_regs if both headers are
included (which is almost always the case in software using
sys/user.h).
for a long time I viewed this problem as having no reasonable fix. I
even explored the possibility of having the powerpc[64] and sh
versions of user.h just include the kernel header (breaking with
policy), but that looked like it might introduce new clashes with
sys/ptrace.h. and it would also bring in a lot of additional cruft
that makes no sense for sys/user.h to expose. glibc goes out of its
way to suppress some of that with #undef, possibly leading to
different problems. this is a rabbit-hole that should be explored no
further.
as it turns out, however, nothing actually uses struct user
sufficiently to care about the type of the regs member; most software
including sys/user.h does not even use struct user at all. so, the
problem can be fixed just by doing away with the insistence on strict
glibc API compatibility for the struct tag of the regs member.
rather than renaming the tag, which might lead to the new name
entering use as API, simply use an untagged structure inside struct
user with the same members/layout as struct pt_regs.
for sh, struct pt_dspregs is just removed entirely since it was not
used.
these members are associated with an unsupported option group. with
time_t changing size on 32-bit archs, all interfaces taking struct
sched_param arguments would need redirection and compat shims in order
to be able to continue offering these members, for no benefit. just
convert them to reserved space instead.
commit ffab43602b broke this by moving
relocations after not only the allocation of storage for the main
thread's static TLS, but after the copying of the TLS image. thus,
relocation results were not reflected in the main thread's copy. this
could be fixed by calling __reset_tls after relocations, but instead
split the allocation and installation before/after relocations so that
there's not a redundant copy.
due to commit 71af530987, updating of
static_tls_cnt needs to be kept with allocation of static TLS, before
relocations, rather than after installation.
Using common code path for all symbol lookups fixes three dlsym issues:
- st_shndx of STT_TLS symbols were not checked and thus an undefined
tls symbol reference could be incorrectly treated as a definition
(the sysv hash lookup returns undefined symbols, gnu does not, so should
be rare in practice).
- symbol binding was not checked so a hidden symbol may be returned
(in principle STB_LOCAL symbols may appear in the dynamic symbol table
for hidden symbols, but linkers most likely don't produce it).
- mips specific behaviour was not applied (ARCH_SYM_REJECT_UND) so
undefined symbols may be returned on mips.
always_inline is used to avoid relocation performance regression, the
code generation for find_sym should not be affected.
commit 7a9669e977 added use of the
symbol reference as the definition, in place of performing a lookup,
for STT_SECTION symbol references that were first found used in FDPIC.
such references may happen in certain other cases, such as
local-dynamic TLS and with relocation types that require a symbol but
that are being used for non-symbolic purposes, like the powerpc
unaligned address relocations.
in all such cases I'm aware of, the symbol referenced is a section
symbol (STT_SECTION); however, the important semantic property is not
its being a section, but rather its binding local (STB_LOCAL). check
the latter instead of the former for greater generality and semantic
correctness.
R_PPC_UADDR32 (R_PPC64_UADDR64) has the same meaning as R_PPC_ADDR32
(R_PPC64_ADDR64), except that its address need not be aligned. For
powerpc64, BFD ld(1) will automatically convert between ADDR<->UADDR
relocations when the address is/isn't at its native alignment. This
will happen if, for example, there is a pointer in a packed struct.
gold and lld do not currently generate R_PPC64_UADDR64, but pass
through misaligned R_PPC64_ADDR64 relocations from object files,
possibly relaxing them to misaligned R_PPC64_RELATIVE. In both cases
(relaxed or not) this violates the PSABI, which defines the relevant
field type as "a 64-bit field occupying 8 bytes, the alignment of
which is 8 bytes unless otherwise specified."
All three linkers violate the PSABI on 32-bit powerpc, where the only
difference is that the field is 32 bits wide, aligned to 4 bytes.
Currently musl fails to load executables linked by BFD ld containing
R_PPC64_UADDR64, with the error "unsupported relocation type 43".
This change provides compatibility with BFD ld on powerpc64, and any
static linker on either architecture that starts following the PSABI
more closely.
as a result of commit ffab43602b,
static_tls_cnt is now valid during relocations at program startup, so
it's no longer necessary to condition the check against static_tls_cnt
on this being a runtime (dlopen) relocation.
this is analogous to commit 2f1f51ae7b,
and should have been caught at the same time since it was right next
to the code moved in that commit. between final stage 3 reloc_all and
the jump to the main program's entry point, it is not valid to call
any functions which may be interposed by the application; doing so
results in execution of application code before ctors have run, and on
fdpic archs, before the main program's fdpic self-fixups have taken
place, which will produce runaway wrong execution.
somewhat analogous to commit d0b547dfb5,
but here the omission of the null timeout check was in the time64
syscall code path. this code is not yet used except on x32.
these accept the netbsd/openbsd message catalog file format,
consisting of a sorted list of set headers and a sorted list of
message headers for each set, admitting trivial binary search for
lookups.
the gnu format was not chosen because it's unusably bad. it does not
admit efficient (log time or better) lookups; rather, it requires
linear search or hash table lookups, and the hash function is awful:
it's literally set_id*msg_id.
commit 722a1ae335 inadvertently passed a
copy of {s,us} to the syscall even if the timeout argument tv was
null, thereby causing immediate timeout (polling) in place of
unlimited timeout. only archs using SYS_select were affected.
when the pattern ended with one or more literal path components, or
when the GLOB_MARK flag was passed to request that glob flag directory
results and the type obtained by readdir was unknown or inconclusive
(symlink), the stat function was called to evaluate existence and/or
determine type. however, stat fails with ENOENT for broken symlinks,
and this caused the match to be omitted from the results.
instead, use stat only for the unknown/inconclusive cases with
GLOB_MARK, and otherwise, or if stat fails, use lstat existence still
needs to be determined. this minimizes the number of costly syscalls,
performing both only in the case where GLOB_MARK is in use and there
is a final literal path component which is a broken symlink.
based on/simplified from patch by James Y Knight.
the contents conflicted with asm/ptrace.h. glibc does not provide
anything in user.h for riscv, so software cannot be depending on it.
simplified from patch submitted by Baruch Siach.
Rename user registers struct definitions to avoid conflict with the
asm/ptrace.h kernel header that defines the same structs. Use the
__riscv_mc prefix as glibc does.
The only reason we needed to preserve the link register was because we
were using a branch-link instruction to branch to __cp_cancel.
Replacing this with a branch means we can avoid the save/restore as
the link register is no longer modified.
otherwise alarm will break on 32-bit archs when time_t is changed to
64-bit. a second itimerval object is introduced for retrieving the old
value, since the setitimer function has restrict-qualified arguments.
commit 31c5fb80b9 introduced underflow
code paths for the i386 math asm, along with checks on the fpu status
word to skip the underflow-generation instructions if the underflow
flag was already raised. unfortunately, at least one such path, in
log1p, returned with 2 items on the x87 stack rather than just 1 item
for the return value. this is a violation of the ABI's calling
convention, and could cause subsequent floating point code to produce
NANs due to x87 stack overflow. if floating point results are used in
flow control, this can lead to runaway wrong code execution.
rather than reviewing each "underflow already raised" code path for
correctness, remove them all. they're likely slower than just
performing the underflow code unconditionally, and significantly more
complex.
all of this code should be ripped out and replaced by C source files
with inline asm. doing so would preclude this kind of error by having
the compiler perform all x87 stack register allocation and stack
manipulation, and would produce comparable or better code. however
such a change is a much larger project.
commit f3f96f2daa added these for the
rest of the archs, but the patch it corresponded to missed riscv64
since riscv64 was not yet upstream at the time. this caused commit
dfc81828f7 to break riscv64 build, due
to a wrong assumption that SYS_statx was unconditionally defined.
this fixes a major upcoming performance regression introduced by
commit 72f50245d0, whereby 32-bit archs
would lose vdso clock_gettime after switching to 64-bit time_t, unless
the kernel supports time64 and provides a time64 version of the vdso
function. this would incur not just one but two syscalls: first, the
failed time64 syscall, then the fallback time32 one.
overflow of the 32-bit result is detected and triggers a revert to
syscalls. normally, on a system that's not Y2038-ready, this would
still overflow, but if the process has been migrated to a
time64-capable kernel or if the kernel has been hot-patched to add
time64 syscalls, it may conceivably work.
otherwise, 32-bit archs that could otherwise share the generic
bits/ipc.h would need to duplicate the struct ipc_perm definition,
obscuring the fact that it's the same. sysvipc is not widely used and
these headers are not commonly included, so there is no performance
gain to be had by limiting the number of indirectly included files
here.
files with the existing time32 definition of IPC_STAT are added to all
current 32-bit archs now, so that when it's changed the change will
show up as a change rather than addition of a new file where it's less
obvious that the value is changing vs the generic one that was used
before.
per policy, define the feature test macro to get declarations for the
pthread_tryjoin_np and pthread_timedjoin_np functions. in the past
this has been only for checking; with 32-bit archs getting 64-bit
time_t it will also be necessary for symbols to get redirected
correctly.
to make use of {sem,shm,msg}ctl IPC_STAT functionality to provide
64-bit time_t on 32-bit archs, IPC_STAT and related macros must be
defined with bit 8 (0x100) set. allow archs to define IPC_STAT in
bits/ipc.h, and define the other macros in terms of it so that they
all get the same value of the time64 bit.
the time64 syscall has to be used if time_t is 64-bit, since there's
no way of knowing before making a syscall whether the result will fit
in 32 bits, and the 32-bit syscalls do not report overflow as an
error.
on 64-bit archs, there is no change to the code after preprocessing.
on current 32-bit archs, the result is now read from the kernel
through long[2] array, then copied into the timespec, to remove the
assumption that time_t is the same as long.
vdso clock_gettime is still used in place of a syscall if available.
32-bit archs with 64-bit time_t must use the time64 version of the
vdso function; if it's not available, performance will significantly
suffer. support for both vdso functions could be added, but would
break the ability to move a long-lived process from a pre-time64
kernel to one that can outlast Y2038 with checkpoint/resume, at least
without added hacks to identify that the 32-bit function is no longer
usable and stop using it (e.g. by seeing negative tv_sec). this
possibility may be explored in future work on the function.
the 64-bit/time64 version of the syscall is not API-compatible with
the userspace timex structure definition; fields specified as long
have type long long. so when using the time64 syscall, we have to
convert the entire structure. this was always the case for x32 as
well, but went unnoticed, meaning that clock_adjtime just passed junk
to the kernel on x32. it should be fixed now.
for the fallback case, we avoid encoding any assumptions about the new
location of the time member or naming of the legacy slots by accessing
them through a union of the kernel type and the new userspace type.
the only assumption is that the non-time members live at the same
offsets as in the (non-time64, long-based) kernel timex struct. this
property saves us from having to convert the whole thing, and avoids a
lot of additional work in compat shims.
the new code is statically unreachable for now except on x32, where it
fixes major brokenness. it is permanently unreachable on 64-bit.
without this, the SIOCGSTAMP and SIOCGSTAMPNS ioctl commands, for
obtaining timestamps, would stop working on pre-5.1 kernels after
time_t is switched to 64-bit and their values are changed to the new
time64 versions.
new code is written such that it's statically unreachable on 64-bit
archs, and on existing 32-bit archs until the macro values are changed
to activate 64-bit time_t.
without this, the SO_RCVTIMEO and SO_SNDTIMEO socket options would
stop working on pre-5.1 kernels after time_t is switched to 64-bit and
their values are changed to the new time64 versions.
new code is written such that it's statically unreachable on 64-bit
archs, and on existing 32-bit archs until the macro values are changed
to activate 64-bit time_t.
the __socketcall and __socketcall_cp macros are remnants from a really
old version of the syscall-mechanism infrastructure, and don't follow
the pattern that the "__" version of the macro returns the raw negated
error number rather than setting errno and returning -1.
for time64 purposes, some socket syscalls will need to operate on the
error value rather than returning immediately, so fix this up so they
can use it.
being "ctl" functions that take command numbers, these will be handled
like ioctl/sockopt/etc., using new command numbers for the time64
variants with an "IPC_TIME64" bit added to their values. to obtain
such a reserved bit, we reuse the IPC_64 bit, 0x100, which served only
as part of the libc-to-kernel interface, not as a public interface of
the libc functions.
using new command numbers avoids the need for compat shims (in ABIs
doing time64 through symbol redirection and compat shims) and, by
virtue of having a fixed time64 bit for all commands, we can ensure
that libc can perform the appropriate translations, even if the
application is using new commands from a newer version of the libc
headers than the libc available at runtime.
for the vast majority of 32-bit archs, the kernel {sem,shm,msq}id64_ds
definitions left padding space intended for expanding their time_t
fields to 64 bits in-place, and it would have been really nice to be
able to do time64 support that way. however the padding was almost
always in little-endian order (except on powerpc, and for msqid_ds
only on mips, where it matched the arch's byte order), and more
importantly, the alignment was overlooked. in semid_ds and msqid_ds,
the time_t members were not suitably aligned to be expanded to 64-bit,
due to the ipc_perm header consisting of 9 32-bit words -- except on
powerpc where ipc_perm contains an extra padding word. in shmid_ds,
the time_t members were suitably aligned, except that mips
(accidentally?) omitted the padding for them alltogether.
as a result, we're stuck with adding new time_t fields on the end of
the structures, and assembling the 32-bit lo/hi parts (or 16-bit hi
parts, for mips shmid_ds, which lacked sufficient reserved space for
full 32-bit hi parts) to fill them in.
all of the functional changes here are conditional on the IPC_TIME64
macro having a nonzero definition, which will only happen when
IPC_STAT is redefined for 32-bit archs, and on time_t being larger
than long, so for now the new code is all dead code.
due to the variadic signature, semctl needs to be made aware of any
new commands that take arguments. this was overlooked when commit
af55070eae added SEM_STAT_ANY.
these differ from generic only in using endian-matched padding with a
short __ipc_perm_seq field in place of the int field in generic. this
is not a documented public interface anyway, and the original intent
was to use int here. some ports just inadvertently slipped in the
kernel short+padding form.
previously these differed from generic because they needed their own
definitions of IPC_64. now that it's no longer in public header,
they're identical.
the definition of the IPC_64 macro controls the interface between libc
and the kernel through syscalls; it's not a public API. the meaning is
rather obscure. long ago, Linux's sysvipc *id_ds structures used
16-bit uids/gids and wrong types for a few other fields. this was in
the libc5 era, before glibc. the IPC_64 flag (64 is a misnomer; it's
more like 32) tells the kernel to use the modern[-ish] versions of the
structures.
the definition of IPC_64 has nothing to do with whether the arch is
32- or 64-bit. rather, due to either historical accident or
intentional obnoxiousness, the kernel only accepts and masks off the
0x100 IPC_64 flag conditional on CONFIG_ARCH_WANT_IPC_PARSE_VERSION,
i.e. for archs that want to provide, or that accidentally provided,
both. for archs which don't define this option, no masking is
performed and commands with the 0x100 bit set will fail as invalid. so
ultimately, the definition is just a matter of matching an arbitrary
switch defined per-arch in the kernel.
major changes are made alongside adding time64 syscall support to
account for issues found during research. select historically accepts
non-normalized (tv_usec not restricted to less than 1000000) timeouts,
and the kernel normalizes them, but the normalization code is buggy
and subject to integer overflows. since normalization is needed anyway
when using SYS_pselect6 or SYS_pselect6_time64 as the backend, simply
do it up-front to eliminate both code path complexity and the
possibility of kernel bugs.
as a side effect, select no longer updates the caller's timeout
timeval with the remaining time. previously, archs that used
SYS_select updated it and archs that used SYS_pselect6 didn't. this
change may turn out to be controversial and may need revisiting, but
in any case the old behavior was not strictly conforming.
POSIX allows modification of the timeout "upon successful completion",
but the Linux syscall modifies it upon unsuccessful completion (EINTR)
as well (and presumably each time the syscall stops and restarts
before it's known whether completion will be successful). it's
possible that this language does not reflect the actual intent of the
standard, since other historical implementations probably behaved like
Linux, but that should be clarified if there's a desire to bring the
old behavior back. regardless, programs that are depending on this are
not correct and are already broken on some archs we support.
the time64 syscall is used only if the timeout does not fit in 32
bits. after preprocessing, the code is unchanged on 64-bit archs. for
32-bit archs, the timeout now goes through an intermediate copy,
meaning that the caller does not get back the updated timeout. this is
based on my reading of the documentation, which does not document the
updating as a contract you can rely on, and mentions that the whole
recvmmsg timeout mechanism is buggy and unlikely to be useful. if it
turns out that there's interest in making the remaining time
officially available to callers, such functionality could be added
back later.
these functions have no new time64 syscall, so the existence of a
time64 syscall cannot be used as the condition for the new code.
instead, assume the syscall takes timevals as longs, which is true
everywhere but x32, and interface with the kernel through long[4]
objects.
rather than adding new hacks to special-case x32 here, just add
x32-specific source files since a trivial syscall wrapper suffices
there.
the new code paths added in this commit are statically unreachable on
all current archs, but will become reachable when 32-bit archs get
64-bit time_t.
this layout is more common already than the old generic, and should
become even more common in the future with new archs added and with
64-bit time_t on 32-bit archs.
some of these were not exact duplicates, but had gratuitously
different naming for padding, or omitted the endian checks because the
arch is fixed-endian.
this layout is slightly less common than the old generic one, but only
because x86_64 and x32 wrongly (according to comments in the kernel
headers) copied the i386 padding. for future archs, and with 64-bit
time_t on 32-bit archs, the new layout here will become the most
common, and it makes sense to treat it as the generic.
various padding fields in the generic bits/sem.h were defined in terms
of time_t as a cheap hack standing in for "kernel long", to allow x32
to use the generic version of the file. this was a really bad idea, as
it ended up getting copied into lots of arch-specific versions of the
bits file, and is a blocker to changing time_t to 64-bit on 32-bit
archs.
this commit adds an x32-specific version of the header, and changes
padding type back from time_t to long (currently the same type on all
archs but x32) in the generic header and all the others the hack got
copied into.
this layout is more common already than the old generic, and should
become even more common in the future with new archs added and with
64-bit time_t on 32-bit archs.
the duplicate arch-specific copies are not removed yet in this commit,
so as to assist git tooling in copy/rename tracking.
there are more archs sharing the generic 64-bit version of the struct,
which is uniform and much more reasonable, than sharing the current
"generic" one, and depending on how time64 sysvipc is done for 32-bit
archs, even more may be sharing the "64-bit version" in the future.
so, duplicate the current generic to all archs using it (arm, i386,
m68k, microblaze, or1k) so that the generic can be changed freely.
this is recorded as its own commit mainly as a hint to git tooling, to
assist in copy/move tracking.
as with clock_getres, the time64 syscall for this is not necessary or
useful, this time since scheduling timeslices are not on the order 68
years. if there's a 32-bit syscall, use it and expand the result into
timespec; otherwise there is only one syscall and it does the right
thing to store to timespec directly.
on 64-bit archs, there is no change to the code after preprocessing.
the time64 syscall for this is not necessary or useful, since clock
resolution is generally better than 68-year granularity. if there's a
32-bit syscall, use it and expand the result into timespec; otherwise
there is only one syscall and it does the right thing to store to
timespec directly.
on 64-bit archs, there is no change to the code after preprocessing.
the time64 syscall has to be used if time_t is 64-bit, since there's
no way of knowing before making a syscall whether the result will fit
in 32 bits, and the 32-bit syscalls do not report overflow as an
error.
on 64-bit archs, there is no change to the code after preprocessing.
on current 32-bit archs, the result is now read from the kernel
through long[4] array, then copied into the timespec, to remove the
assumption that time_t is the same as long.
the x32 syscall interfaces treat timespec's tv_nsec member as 64-bit
despite the API type being long and long being 32-bit in the ABI. this
is no problem for syscalls that store timespecs to userspace as
results, but caused uninitialized padding to be misinterpreted as the
high bits in syscalls that take timespecs as input.
since the beginning of the port, we've dealt with this situation with
hacks in syscall_arch.h, and injected between __syscall_cp_c and
__syscall_cp_asm, to special-case the syscall numbers that involve
timespecs as inputs and copy them to a form suitable to pass to the
kernel.
commit 40aa18d55a set the stage for
removal of these hacks by letting us treat the "normal" x32 syscalls
dealing with timespec as if they're x32's "time64" syscalls,
effectively making x32 ax "time64-only 32-bit arch" like riscv32 will
be when it's added. since then, all users of syscalls that x32's
syscall_arch.h had hacks for have been updated to use time64 syscalls,
so the hacks can be removed.
there are still at least a few other timespec-related syscalls broken
on x32, which were overlooked when the x32 hacks were done or added
later. these include at least recvmmsg, adjtimex/clock_adjtime, and
timerfd_settime, and they will be fixed independently later on.
time64 syscall is used only if it's the only one defined for the arch,
or if either of the requested times does not fit in 32 bits. care is
taken to normalize the inputs to account for UTIME_NOW or UTIME_OMIT
in tv_nsec, in which case tv_sec should be ignored. this is needed not
only to avoid spurious time64 syscalls that might waste time failing
with ENOSYS, but also to accurately decide whether fallback is
possible.
if the requested time cannot be represented, the function fails with
ENOTSUP, defined in general as "The implementation does not support
the requested feature or value". neither the time64 syscall, nor this
error, can happen on current 32-bit archs where time_t is a 32-bit
type, and both are statically unreachable.
on 64-bit archs, there are only superficial changes to the
SYS_futimesat fallback path, which has been modified to pass long[4]
instead of struct timeval[2] to the kernel, making it suitable for use
on 32-bit archs even once time_t is changed to 64-bit. for 32-bit
archs, the call to SYS_utimensat has also been changed to copy the
timespecs through an array of long[4] rather than passing the
timespec[2] in-place.
time64 syscall is used only if it's the only one defined for the arch,
or if the requested time does not fit in 32 bits. on current 32-bit
archs where time_t is a 32-bit type, this makes it statically
unreachable.
if the time64 syscall is needed because the requested time does not
fit in 32 bits, we define this as an error ENOTSUP, for "The
implementation does not support the requested feature or value".
on 64-bit archs, there is no change to the code after preprocessing.
on current 32-bit archs, the time is moved through an intermediate
copy to remove the assumption that time_t is a 32-bit type.
time64 syscall is used only if it's the only one defined for the arch,
if either component of the itimerspec does not fit in 32 bits, or if
time_t is 64-bit and the caller requested the old value, in which case
there's a possibility that the old value might not fit in 32 bits. on
current 32-bit archs where time_t is a 32-bit type, this makes it
statically unreachable.
on 64-bit archs, there is no change to the code after preprocessing.
on current 32-bit archs, the time is moved through an intermediate
copy to remove the assumption that time_t is a 32-bit type.
time64 syscall is used only if it's the only one defined for the arch,
or if the requested timeout length does not fit in 32 bits. on current
32-bit archs where time_t is a 32-bit type, this makes it statically
unreachable.
on 64-bit archs, there are only superficial changes to the code after
preprocessing. both before and after these changes, these functions
copied their timeout arguments to avoid letting the kernel clobber the
caller's copies. now, the copying also serves to change the type from
userspace timespec to a pair of longs, which makes a difference only
in the 32-bit fallback case, not on 64-bit.
thanks to the original factorization using the __timedwait function,
there are no FUTEX_WAIT calls anywhere else, giving us a single point
of change to make nearly all the timed thread primitives time64-ready.
the one exception is the FUTEX_LOCK_PI command for PI mutex timedlock.
I haven't tried to make these two points share code, since they have
different fallbacks (no non-private fallback needed for PI since PI
was added later) and FUTEX_LOCK_PI isn't a cancellation point (thus
allowing the whole code path to inline into pthread_mutex_timedlock).
as for other changes in this series, the time64 syscall is used only
if it's the only one defined for the arch, or if the requested timeout
does not fit in 32 bits. on current 32-bit archs where time_t is a
32-bit type, this makes it statically unreachable.
on 64-bit archs, there are only superficial changes to the code after
preprocessing. on current 32-bit archs, the time is passed via an
intermediate copy to remove the assumption that time_t is a 32-bit
type.
time64 syscall is used only if it's the only one defined for the arch,
or if the requested timeout does not fit in 32 bits. on current 32-bit
archs where time_t is a 32-bit type, this makes it statically
unreachable.
on 64-bit archs, there is no change to the code after preprocessing.
on current 32-bit archs, the time is passed via an intermediate copy
to remove the assumption that time_t is a 32-bit type.
to avoid duplicating SYS_ipc/SYS_semtimedop choice logic, the code for
32-bit archs "falls through" after updating the timeout argument ts to
point to a [compound literal] array of longs. in preparation for
"time64-only" 32-bit archs, an extra case is added for neither SYS_ipc
nor the non-time64 SYS_semtimedop existing; the ENOSYS failure path
here should never be reachable, and is added just in case a compiler
can't see that it's not reachable, to avoid spurious static analysis
complaints.
time64 syscall is used only if it's the only one defined for the arch,
or if the requested timeout length does not fit in 32 bits. on current
32-bit archs where time_t is a 32-bit type, this makes it statically
unreachable.
on 64-bit archs, there are only superficial changes to the code after
preprocessing. on current 32-bit archs, the timeout is passed via an
intermediate copy to remove the assumption that time_t is a 32-bit
type.
time64 syscall is used only if it's the only one defined for the arch,
or if the requested absolute timeout does not fit in 32 bits. on
current 32-bit archs where time_t is a 32-bit type, this makes it
statically unreachable.
on 64-bit archs, there is no change to the code after preprocessing.
on current 32-bit archs, the timeout is passed via an intermediate
copy to remove the assumption that time_t is a 32-bit type.
time64 syscall is used only if it's the only one defined for the arch,
or if the requested time does not fit in 32 bits. on current 32-bit
archs where time_t is a 32-bit type, this makes it statically
unreachable.
on 64-bit archs, there is no change to the code after preprocessing.
on current 32-bit archs, the time is moved through an intermediate
copy to remove the assumption that time_t is a 32-bit type.
this is yet another place where special handling of time syscalls can
and should be avoided by implementing legacy functions in terms of
their modern replacements. in theory a fallback to SYS_settimeofday
could be added to clock_settime, but SYS_clock_settime has been
available since Linux 2.6.0 or earlier, i.e. all the way back to the
minimum supported version.
this commit has no effect whatsoever right now, but is in preparation
for a future riscv32 port and other future 32-bit archs that will be
"time64-only" from the start on the kernel side.
together with the previous x32 changes, this commit ensures that
syscall call points that don't care about time (passing null timeouts,
etc.) can continue to do so without having to special-case time64-only
archs, and allows code using the time64 syscalls to uniformly test for
the need to fallback with SYS_foo != SYS_foo_time64, rather than
needing to check defined(SYS_foo) && SYS_foo != SYS_foo_time64.
x32 is odd in that it's the only ILP32 arch/ABI we have where time_t
is 64-bit rather than (32-bit) long, and this has always been
problematic in that it results in struct timespec having unused
padding space, since tv_nsec has type long, which the kernel insists
be zero- or sign-extended (due to negative tv_nsec being invalid, it
doesn't matter which) to match the x86_64 type.
up til now, we've had really ugly hacks in x32/syscall_arch.h to patch
up the timespecs passed to the kernel. but the same requirement to
zero- or sign-extend tv_nsec also applies to all the new time64
syscalls on true 32-bit archs. so let's take advantage of this to
clean things up.
this patch defines all of the time64 syscalls for x32 as aliases for
the existing syscalls by the same name. this establishes the following
invariants:
- if the time64 form is defined, it takes time arguments as 64-bit
objects, and tv_nsec inputs must be zero-/sign-extended to 64-bit.
- if the time64 form is not defined, or if the time64 form is defined
and is not equal to the "plain" form, the plain form takes time
arguments as longs.
this will avoid the need for protocols for archs to define appropriate
types for each family of syscalls, and for the reader of the code to
have to be aware of such type definitions.
in some sense it might be simpler if the plain syscall form were
undefined for x32, so that it would always take longs if defined.
however, a number of these syscalls are used in contexts with a null
time argument, or (e.g. futex) for commands that don't involve time at
all, and having to introduce time64-specific logic to all those call
points does not make sense. thus, while the "plain" forms are kept now
just because they're needed until the affected code is converted over,
they'll also almost surely be kept in the future as well.
kernel support for x32 was added long after the utimensat syscall was
already available, so having a fallback is just wasted code size.
also, for changes related to time64 support on 32-bit archs, I want to
be able to assume the old futimesat syscall always works with longs,
which is true except for x32. by ensuring that it's not used on x32,
the needed invariant is established.
previously the fallback wrongly failed with EINVAL rather than ENOSYS
when UTIME_NOW was used with one component but not both. commit
dd5f50da6f introduced this behavior when
initially adding the fallback support.
instead, detect the case where both are UTIME_NOW early and replace
with a null times pointer; this may improve performance slightly (less
copy from user), and removes the complex logic from the fallback case.
it also makes things slightly simpler for adding time64 code paths.
for namespace-safety with thrd_sleep, this requires an alias, which is
also added. this eliminates all but one direct call point for
nanosleep syscalls, and arranges that 64-bit time_t conversion logic
will only need to exist in one file rather than three.
as a bonus, clock_nanosleep with CLOCK_REALTIME and empty flags is now
implemented as SYS_nanosleep, thereby working on older kernels that
may lack POSIX clocks functionality.
commit 01ae3fc6d4 modified fstatat to
translate the kernel's struct stat ("kstat") into the libc struct stat.
To do this, it created a local kstat object, and copied its contents
into the user-provided object.
However, the commit neglected to update the fstat compatibility path and
its fallbacks. They continued to pass the user-supplied object to the
kernel, later overwiting it with the uninitialized memory in the local
temporary.
this sets the stage for having the conversion logic for 64-bit time_t
all in one file, and as a bonus makes clock_adjtime for CLOCK_REALTIME
work even on kernels too old to have the clock_adjtime syscall.
this commit adds a new backend for fstatat (and thereby the whole stat
family) using the SYS_statx syscall, but conditions the new code on
the kernel stat structure's time fields being smaller than time_t. in
principle that should make it all dead code at present, but mips64 has
a broken stat structure with 32-bit time fields despite having 64-bit
time_t elsewhere, so on mips64 it is a functional change that makes
post-Y2038 filesystem timestamps accessible.
whenever the 32-bit archs end up getting 64-bit time_t, regardless of
how that happens, the changes in this commit will automatically take
effect for them too.
AT_FDCWD is not a valid file descriptor, so POSIX requires fstat to
fail with EBADF. if passed to fstatat, the call would spuriously
succeed and return results for the working directory.
now that we have a kstat structure decoupled from the public struct
stat, we can just use the broken kernel structures directly and let
the code in fstatat do the translation.
presently, all archs/ABIs have struct stat matching the kernel
stat[64] type, except mips/mipsn32/mips64 which do conversion hacks in
syscall_arch.h to work around bugs in the kernel type. this patch
completely decouples them and adds a translation step to the success
path of fstatat. at present, this is just a gratuitous copying, but it
opens up multiple possibilities for future support for 64-bit time_t
on 32-bit archs and for cleaned-up/unified ABIs.
for clarity, the mips hacks are not yet removed in this commit, so the
mips kstat structs still correspond to the output of the hacks in
their syscall_arch.h files, not the raw kernel type. a subsequent
commit will fix this.
equivalent logic for fstat+O_PATH fallback and direct use of
stat/lstat syscalls where appropriate is kept, now in the fstatat
function. this change both improves functionality (now, fstatat forms
equivalent to fstat/lstat/stat will work even on kernels too old to
have the at functions) and localizes direct interfacing with the
kernel stat structure to one file.
these were overlooked during review. bits headers are not allowed to
pull in additional headers (note: that rule is currently broken in
other places but just for endian.h). string.h has no place here
anyway, and including bits/alltypes.h without defining macros to
request types from it is a nop.
the "A" constraint is simply for an address expression that's a single
register, but it's not yet supported by clang, and has no advantage
here over just using a register operand for the address. the latter is
actually preferable in the a_cas_p case because it avoids aliasing an
lvalue onto the memory.
most egregious problem was the lack of memory clobber and lack of
volatile asm; this made the atomics memory barriers but not compiler
barriers. use of "+r" rather than "=r" for a clobbered temp was also
wrong, since the initial value is indeterminate.
having "+r"(a0) is redundant with "0"(a0) in syscalls with at least 1
arg, which is arguably a constraint violation (clang treats it as
such), and an invalid input with indeterminate value in the 0-arg
case. use the "=r"(a0) form instead.
mips n32 has 32-bit long, and generally uses long syscall arguments
and return values, but provides only SYS_lseek, not SYS_llseek. we
have some framework (syscall_arg_t, added for x32) to make syscall
arguments 64-bit in such a setting, but it's not clear whether this
could match the sign-extension semantics needed for 32-bit args to all
the other syscalls, and we don't have any existing mechanism to allow
the return value of syscalls to be something other than long.
instead, just provide a custom mipsn32 version of the lseek function
doing its own syscall asm with 64-bit arguments. as a result of commit
03919b26ed, stdio will also get the new
code, fixing fseeko/ftello too.
ever since inline syscalls were added for (o32) mips in commit
328810d325, the asm has nonsensically
loaded the syscall number, rather than taking $2 as an input
constraint to let the compiler load it. commit
cfc09b1ecf improved on this somewhat by
allowing a constant syscall number to propagate into an immediate, but
missed that the whole operation made no sense.
now, only $4, $5, $6, $8, and $9 are potential input-only registers.
$2 is always input and output, and $7 is both when it's an argument,
otherwise output-only. previously, $7 was treated as an input (with a
"1" constraint matching its output position) even when it was not an
input, which was arguably undefined behavior (asm input from
indeterminate value). this is corrected.
this patch is not purely non-functional changes, since before, $8 and
$9 were wrongly in the clobberlist for syscalls with fewer than 5 or 6
arguments. of course it's impossible for syscalls to have different
clobbers depending on their number of arguments. the clobberlist for
the recently-added 5- and 6-argument forms was correct, and for the 0-
to 4-argument forms was erroneously copied from the mips o32 ABI where
the additional arguments had to be passed on the stack.
in making this change, I reviewed the kernel sources, and $8 and $9
are always saved for 64-bit kernels since they're part of the syscall
argument list for n32 and n64 ABIs.
this probably saves a few bytes, avoids duplicating the clunky
lseek/_llseek syscall convention in two places, and sets the stage for
fixing broken seeks on x32 and mipsn32.
these additions were made by scanning git log since the last major
update in commit 1366b3c5e6.
as before my aim was adding everyone with either substantial code
contributions or a pattern of ongoing simple patch submission; any
omissions are unintentional.
a fully thumb1 build is not supported because some asm files are
incompatible with thumb1, but apparently it works to compile the C
code as thumb1
commit 06fbefd100 caused this regression
but introducing use of the clz instruction, which is not supported in
arm mode prior to v5, and not supported in thumb prior to thumb2
(v6t2). commit 1b9406b03c fixed the
issue only for arm mode pre-v5 but left thumb1 broken.
In the public header, __errno_location is declared with the "const"
attribute, conditional on __GNUC__. Ensure that its internal alias has
the same attributes.
Maintainer's note: This change also fixes a regression in quality of
code generation -- multiple references to errno in a single function
started generating multiple calls again -- introduced by commit
e13063aad7.
Commit 3517d74a5e changed the token in
sys/ioctl.h from 0x01 to 1, so bits/termios.h no longer matches. Revert
the bits/termios.h change to keep the headers in sync.
This reverts commit 9eda4dc69c.
The old/new parameters to pthread_sigmask, sigprocmask, and setitimer
are marked restrict, so passing the same address to both is
prohibited. Modify callers of these functions to use a separate object
for each argument.
as reported by Tavian Barnes, a dup2 file action for the internal pipe
fd used by posix_spawn could cause it to remain open after execve and
allow the child to write an artificial error into it, confusing the
parent. POSIX allows internal use of file descriptors by the
implementation, with undefined behavior for poking at them, so this is
not a conformance problem, but it seems preferable to diagnose and
prevent the error when we can do so easily.
catch attempts to apply a dup2 action to the internal pipe fd and
emulate EBADF for it instead.
commit c8b49b2fbc introduced code that
checked bestsym to determine whether a matching symbol was found, but
bestsym is uninitialized if not. instead use best, consistent with use
in the rest of the function.
simplified from bug report and patch by Cheng Liu.
this was apparently copied from x86_64; it's not part of the kernel
API for riscv64. this change eliminates the need for a
riscv64-specific bits header and lets it use the generic one.
syscall numbers are now synced up across targets (starting from 403 the
numbers are the same on all targets other than an arch specific offset)
IPC syscalls sem*, shm*, msg* got added where they were missing (except
for semop: only semtimedop got added), the new semctl, shmctl, msgctl
imply IPC_64, see
linux commit 0d6040d4681735dfc47565de288525de405a5c99
arch: add split IPC system calls where needed
new 64bit time_t syscall variants got added on 32bit targets, see
linux commit 48166e6ea47d23984f0b481ca199250e1ce0730a
y2038: add 64-bit time_t syscalls to all 32-bit architectures
new async io syscalls got added, see
linux commit 2b188cc1bb857a9d4701ae59aa7768b5124e262e
Add io_uring IO interface
linux commit edafccee56ff31678a091ddb7219aba9b28bc3cb
io_uring: add support for pre-mapped user IO buffers
a new syscall got added that uses the fd of /proc/<pid> as a stable
handle for processes: allows sending signals without pid reuse issues,
intended to eventually replace rt_sigqueueinfo, kill, tgkill and
rt_tgsigqueueinfo, see
linux commit 3eb39f47934f9d5a3027fe00d906a45fe3a15fad
signal: add pidfd_send_signal() syscall
on some targets (arm, m68k, s390x, sh) some previously missing syscall
numbers got added as well.
Linux v5.1 introduced ipc syscalls on targets where previously only
SYS_ipc was available, change the logic such that the ipc code keeps
using SYS_ipc which works backward compatibly on older kernels.
This changes behaviour on microblaze which had both mechanisms, now
SYS_ipc will be used instead of separate syscalls.
to request or change pointer auth keys for criu via ptrace, new in
linux commit d0a060be573bfbf8753a15dca35497db5e968bb0
arm64: add ptrace regsets for ptrauth key management
RFC 4286: "The IPv4 multicast address for All-Snoopers is 224.0.0.106."
from
linux commit 4effd28c1245303dce7fd290c501ac2c11052114
bridge: join all-snoopers multicast address
SO_BINDTOIFINDEX behaves similar to SO_BINDTODEVICE, but takes a
network interface index as argument, rather than the network
interface name. see
linux commit f5dd3d0c9638a9d9a02b5964c4ad636f06cf7e2c
net: introduce SO_BINDTOIFINDEX sockopt
restricts router alert packets received by the socket to the
socket's namespace only. see
linux commit 9036b2fe092a107856edd1a3bad48b83f2b45000
net: ipv6: add socket option IPV6_ROUTER_ALERT_ISOLATE
allows specifying that the speculative store bypass disable bit should
be cleared on exec. see
linux commit 71368af9027f18fe5d1c6f372cfdff7e4bde8b48
x86/speculation: Add PR_SPEC_DISABLE_NOEXEC
needed for android so it can migrate from its ashmem to memfd.
allows making the memfd readonly for future users while keeping
a writable mmap of it. see
linux commit ab3948f58ff841e51feb845720624665ef5b7ef3
mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd
includes changes from linux v5.1
linux commit 235328d1fa4251c6dcb32351219bb553a58838d2
fanotify: add support for create/attrib/move/delete events
linux commit 5e469c830fdb5a1ebaa69b375b87f583326fd296
fanotify: copy event fid info to user
linux commit e9e0c8903009477b630e37a8b6364b26a00720da
fanotify: encode file identifier for FAN_REPORT_FID
as well as earlier changes that were missed.
sys/statfs.h is included for fsid_t.
synccall may be called by AS-safe functions such as setuid/setgid after
fork. although fork() resets libc.threads_minus_one, causing synccall to
take the single-threaded path, synccall still takes the thread list
lock. This lock may be held by another thread if for example fork()
races with pthread_create(). After fork(), the value of the lock is
meaningless, so clear it.
maintainer's note: commit 8f11e6127f and
e4235d7067 introduced this regression.
the state protected by this lock is the linked list, which is entirely
replaced in the child path of fork (next=prev=self), so resetting it
is semantically sound.
the linux syscall treats this argument as having type int, so passing
extremely long buffer sizes would be misinterpreted by the kernel.
since "short reads" are always acceptable, just cap it down.
patch based on report and suggested change by Florian Weimer.
after commit a48ccc159a removed the use
of _Noreturn on the stage3_func type (which only worked due to it
being defined to the "GNU C" attribute in C99 mode), GCC could no
longer assume that the ends of __dls2 and __dls2b are unreachable, and
produced a warning that a function marked _Noreturn returns.
also, since commit 4390383b32, the
_Noreturn declaration for __libc_start_main in crt1/rcrt1 has been not
only inconsistent with the definition, but wrong. formally,
__libc_start_main does return, via a (hopefully) tail call to a helper
function after the barrier. incorrect usage of _Noreturn in the
declaration was probably formal UB.
the _Noreturn specifiers were not useful in any of these places, so
remove them all. now, the only remaining usage of _Noreturn is in
public interfaces where _Noreturn is part of their contract.
previously, POSIX erroneously required this to fail with EINVAL
despite the traditional glibc implementation, on which the POSIX
interface was based, allowing it. the resolution of Austin Group issue
818 removes the requirement to fail.
this reverts commit f552c792c7, which
exposed the sysmacros.h macros (device major/minor calculations) for
BSD and GNU profiles to mimic an unintentional glibc behavior some
code depended on. glibc has deprecated and since removed them as the
resolution to bug #19239, so it makes no sense for us to keep this
behavior. affected code should all have been fixed by now, and if it's
not yet fixed it needs to be for use with modern glibc anyway.
Author: Alex Suykov <alex.suykov@gmail.com>
Author: Aric Belsito <lluixhi@gmail.com>
Author: Drew DeVault <sir@cmpwn.com>
Author: Michael Clark <mjc@sifive.com>
Author: Michael Forney <mforney@mforney.org>
Author: Stefan O'Rear <sorear2@gmail.com>
This port has involved the work of many people over several years. I
have tried to ensure that everyone with substantial contributions has
been credited above; if any omissions are found they will be noted
later in an update to the authors/contributors list in the COPYRIGHT
file.
The version committed here comes from the riscv/riscv-musl repo's
commit 3fe7e2c75df78eef42dcdc352a55757729f451e2, with minor changes by
me for issues found during final review:
- a_ll/a_sc atomics are removed (according to the ISA spec, lr/sc
are not safe to use in separate inline asm fragments)
- a_cas[_p] is fixed to be a memory barrier
- the call from the _start assembly into the C part of crt1/ldso is
changed to allow for the possibility that the linker does not place
them nearby each other.
- DTP_OFFSET is defined correctly so that local-dynamic TLS works
- reloc.h LDSO_ARCH logic is simplified and made explicit.
- unused, non-functional crti/n asm files are removed.
- an empty .sdata section is added to crt1 so that the
__global_pointer reference is resolvable.
- indentation style errors in some asm files are fixed.
with the glibc generation counter model for reusing dynamic tls slots
after dlclose, it's really not possible to get away with fewer than 4
working registers. for us however it's always been possible, but
tricky, and only became apparent after the switch to installing new
dynamic tls at dlopen time. by merging the negated thread pointer into
the addend early, the register holding the thread pointer can
immediately be reused, bringing the working register count down to
three. this allows saving/restoring via a single stp/ldp pair, since
the return register x0 does not need to be saved.
net reduction of 3 instructions, 2 of which were push/pop.
between v2 and v3 of the powerpc64 port patch, the change was made
from a 32x4 array of 32-bit unsigned ints for vrregs[] to a 32-element
array of __int128. this mismatches the API applications working with
mcontext_t expect from glibc, and seems to have been motivated by a
misinterpretation of a comment on how aarch64 did things as a
suggestion to do the same on powerpc64.
the mistaken layout seems to have been adapted from 32-bit powerpc,
where vscr and vrsave are packed into the same 128-bit slot in a way
that looks like it relies on non-overlapping-ness of the value bits in
big endian.
the powerpc64 port accounted for the fact that the 64-bit ABI puts
each in its own 128-bit slot, but ordered them incorrectly (matching
the bit order used on the 32-bit ABI), and failed to account for vscr
being padded according to endianness so that it can be accessed via
vector moves.
in addition to ABI layout, our definition used different logical
member layout/naming from glibc, where vscr is a structure to
facilitate access as a 32-bit word or a 128-bit vector. the
inconsistency here was unintentional, so fix it.
currently the bfd linker does not seem to create tls segments where
p_vaddr%p_align != 0, but this is valid in ELF and then the runtime
computed tls offset must satisfy
offset%p_align == (base+p_vaddr)%p_align
and in case of local exec tls (main executable) the smallest such
offset must be used (otherwise it is incompatible with the offset
computed by the static linker). the !TLS_ABOVE_TP case is handled
correctly (the offset is negative then in the formula).
the ldso code for TLS_ABOVE_TP is changed so the static tls offset
of each module satisfies the formula.
tls_offset should always point to the end of the allocated static tls
area, but this was not handled correctly on "tls variant 1" targets
in the dynamic linker:
after application tls was allocated, tls_offset was aligned up,
potentially wasting tls space. (alignment may be needed at the
begining of the tls area, not at the end, but that will be fixed
separately as it is unlikely to affect real binaries.)
when static tls was allocated for a shared library, tls_offset was
only updated with the size of the tls segment which does not include
alignment gaps, which can easily happen if the tls size update for
one library leaves tls_offset misaligned for the next one. this can
cause oob access in __copy_tls or arbitrary breakage at tls access.
(the issue was observed on aarch64 with rust binaries)
commit 648c3b4e18 omitted this change,
which is needed to be able to use uid/gid values greater than INT_MAX
with these interfaces. it fixes alpine linux bug #10460.
we have to avoid using ebx unconditionally in asm constraints for
i386, because gcc 3 and 4 and possibly other simplistic compilers
(pcc?) implement PIC via making ebx a fixed-use register, and disallow
its use for anything else. rather than hard-coding knowledge of which
compilers work (at least gcc 5+ and clang), perform a configure test;
this should give us the good codegen on any new compilers we don't yet
know about.
swapping ebx and edx is kept for 1- and 2-arg syscalls because it
avoids having any spills/stack-frame at all in small functions. for
6-arg, if ebx is directly usable, the complex shuffling introduced in
commit c8798ef974 can be avoided, and
ebp can be loaded the same way ebx is in 5-arg syscalls for compilers
that don't support direct use of ebx.
commit 22e5bbd0de inlined the i386
syscall mechanism, but wrongly assumed memory operands to the 5- and
6-argument syscall asm would be esp-based. however, nothing in the
constraints prevented them from being ebx- or ebp-based, and in those
cases, ebx and ebp could be clobbered before use of the memory operand
was complete. in the 6-argument case, this prevented restoration of
the original register values before the end of the asm block, breaking
the asm contract since ebx and ebp are not marked as clobbered. (they
can't be, because lots of compilers don't accept these registers in
constraints or clobbers if PIC or frame pointer is enabled).
doing this right is complicated by the fact that, after a single push,
no operands which might be memory operands are usable. if they are
esp-based, the value of esp has changed, rendering them invalid.
introduce some new dances to load the registers. for the 5-arg case,
push the operand that may be a memory operand first, and after that,
it doesn't matter if the operand is invalid, since we'll just use the
newly pushed value. for the 6-arg case, we need to put both operands
in memory to begin with, like the old non-inline code prior to commit
22e5bbd0de accepted, so that there's
only one potentially memory-based operand to the asm. this can then be
saved with a single push, and after that the values can be read off
into the registers they're needed in.
there's some size overhead, but still a lot less execution overhead
than the old out-of-line code. doing it better depends on a modern
compiler that lets you use ebx and ebp in asm constraints without
restriction. the failure modes on compilers where this doesn't work
are inconsistent and dangerous (on at least some gcc versions 4.x and
earlier, wrong codegen!), so this is a delicate matter. it can be
addressed later if needed.
this is a requirement in POSIX that's omitted, and seemed potentially
non-conforming, in the C standard. as such it was omitted here.
however, as part of Austin Group issue #1170, the discrepancy was
raised with WG14 and determined to be unintended; future versions of
the C standard will require the error indicator to be set, as POSIX
does.
commit 788d5e24ca exposed the breakage
at build time by removing support for 7-argument syscalls; however,
the external __syscall function provided for mips before did not pass
a 7th argument from the stack, so the behavior was just silently
broken.
commit 788d5e24ca noted that we could
add this if needed, and in fact it is needed, but not for one of the
archs documented as having a 7th syscall arg register. rather, it's
needed for mips (o32), where all but the first 4 arguments are passed
on the stack, and the stack can accommodate a 7th.
this has been wrong since the beginning of the microblaze port: the
syscall ABI for microblaze does not align 64-bit arguments on even
register boundaries. commit 788d5e24ca
exposed the problem by introducing references to a nonexistent
__syscall7. the ABI is not documented well anywhere, but I was able to
confirm against both strace source and glibc source that microblaze is
not using the alignment.
per the syscall(2) man page, posix_fadvise, ftruncate, pread, pwrite,
readahead, sync_file_range, and truncate were all affected and either
did not work at all, or only worked by chance, e.g. when the affected
argument slots were all zero.
analogous to commit efda534b21 for
powerpc. commit 587f5a53bc moved the
definition of SO_PEERSEC to bits/socket.h for archs where the SO_*
macros differ.
commit b50d315fd2 introduced
fp_force_eval implemented by default with a dead store to a volatile
variable. unfortunately introduces warnings with -Wunused-variable and
breaks the ability to use -Werror with the default warning options set
by configure when warnings are enabled.
we could just call fp_barrier instead, but that results in a spurious
load after the store due to volatile semantics.
the fix committed here avoids the load. it will still produce warnings
without -Wno-unused-but-set-variable, but that's part of our default
warning profile, and there are already other locations in the source
where an unused variable warning will occur without it.
from https://github.com/ARM-software/optimized-routines,
commit 04884bd04eac4b251da4026900010ea7d8850edc
The underflow exception is signaled if the result is in the subnormal
range even if the result is exact.
code size change: +3421 bytes.
benchmark on x86_64 before, after, speedup:
-Os:
pow rthruput: 102.96 ns/call 33.38 ns/call 3.08x
pow latency: 144.37 ns/call 54.75 ns/call 2.64x
-O3:
pow rthruput: 98.91 ns/call 32.79 ns/call 3.02x
pow latency: 138.74 ns/call 53.78 ns/call 2.58x
from https://github.com/ARM-software/optimized-routines,
commit 04884bd04eac4b251da4026900010ea7d8850edc
POWF_SCALE != 1.0 case only matters if TOINT_INTRINSICS is set, which
is currently not supported for any target.
SNaN is not supported, it would require an issignalingf
implementation.
code size change: -816 bytes.
benchmark on x86_64 before, after, speedup:
-Os:
powf rthruput: 95.14 ns/call 20.04 ns/call 4.75x
powf latency: 137.00 ns/call 34.98 ns/call 3.92x
-O3:
powf rthruput: 92.48 ns/call 13.67 ns/call 6.77x
powf latency: 131.11 ns/call 35.15 ns/call 3.73x
Musl currently aims to support non-nearest rounding mode and does not
support SNaNs. These macros allow marking relevant code paths in case
these decisions are changed later (they also help documenting the
corner cases involved).
These don't have an effectw with -Os so not useful with default settings
other than documenting the expectation.
With --enable-optimize=internal,malloc,string,math the libc.so code size
increases by 18K on x86_64 and performance varies in -2% .. +10%.
These are supposed to be used in tail call positions when handling
special cases in new code. (fp exceptions may be raised "naturally"
by the common code path if special casing is more effort.)
This implements the error handling apis used in
https://github.com/ARM-software/optimized-routines
without errno setting.
Previously type casts or assignments were used for handling excess
precision, which assumed standard C99 semantics, but since it's a
rarely needed obscure detail, it's better to use explicit helper
functions to document where we rely on this. It also helps if the
code is used outside of the libc in non-C99 compilation mode: with the
default excess precision handling of gcc, explicit inline asm barriers
are needed for narrowing on FLT_EVAL_METHOD!=0 targets.
I plan to use this in new code with the existing style that uses
double_t and float_t as much as possible.
One ugliness is that it is required for almost every return statement
since that does not drop excess precision (the standard changed this
in C11 annex F, but that does not help in non-standard compilation
modes or with old compilers).
C99 has ways to support fenv access, but compilers don't implement it
and assume nearest rounding mode and no fp status flag access. (gcc has
-frounding-math and then it does not assume nearest rounding mode, but
it still assumes the compiled code itself does not change the mode.
Even if the C99 mechanism was implemented it is not ideal: it requires
all code in the library to be compiled with FENV_ACCESS "on" to make it
usable in non-nearest rounding mode, but that limits optimizations more
than necessary.)
The math functions should give reasonable results in all rounding modes
(but the quality may be degraded in non-nearest rounding modes) and the
fp status flag settings should follow the spec, so fenv side-effects are
important and code transformations that break them should be prevented.
Unfortunately compilers don't give any help with this, the best we can
do is to add fp barriers to the code using volatile local variables
(they create a stack frame and undesirable memory accesses to it) or
inline asm (gcc specific, requires target specific fp reg constraints,
often creates unnecessary reg moves and multiple barriers are needed to
express that an operation has side-effects) or extern call (only useful
in tail-call position to avoid stack-frame creation and does not work
with lto).
We assume that in a math function if an operation depends on the input
and the output depends on it, then the operation will be evaluated at
runtime when the function is called, producing all the expected fenv
side-effects (this is not true in case of lto and in case the operation
is evaluated with excess precision that is not rounded away). So fp
barriers are needed (1) to prevent the move of an operation within a
function (in case it may be moved from an unevaluated code path into an
evaluated one or if it may be moved across a fenv access), (2) force the
evaluation of an operation for its side-effect when it has no input
dependency (may be constant folded) or (3) when its output is unused. I
belive that fp_barrier and fp_force_eval can take care of these and they
should not be needed in hot code paths.
Nothing is left from the original fdlibm header nor from the bsd
modifications to it other than some internal api declarations.
Comments are dropped that may be copyrightable content.
This makes it easier to build musl math code with a compiler that
does not support complex types (tcc) and in general more sensible
factorization of the internal headers.
FP_FAST_FMA can be defined if "the fma function generally executes about
as fast as, or faster than, a multiply and an add of double operands",
which can only be true if the fma call is inlined as an instruction.
gcc sets __FP_FAST_FMA if __builtin_fma is inlined as an instruction,
but that does not mean an fma call will be inlined (e.g. it is defined
with -fno-builtin-fma), other compilers (clang) don't even have such
macro, but this is the closest we can get.
(even if the libc fma implementation is a single instruction, the extern
call overhead is already too big when the macro is used to decide between
x*y+z and fma(x,y,z) so it cannot be based on libc only, defining the
macro unconditionally on targets which have fma in the base isa is also
incorrect: the compiler might not inline fma anyway.)
this solution works with gcc unless fma inlining is explicitly turned off.
POSIX: "[If] either O_TTY_INIT is set in oflag or O_TTY_INIT has the
value zero, open() shall set any non-standard termios structure
terminal parameters to a state that provides conforming behavior."
The Linux kernel tty drivers always perform initialisation on their
devices to set known good termios values during the open(2) call. This
means that setting O_TTY_INIT to zero is conforming.
the weak version of __syscall_cp_c was using a tail call to __syscall
to avoid duplicating the 6-argument syscall code inline in small
static-linked programs, but now that __syscall no longer exists, the
inline expansion is no longer duplication.
the syscall.h machinery suppported up to 7 syscall arguments, only via
an external __syscall function, but we presently have no syscall call
points that actually make use of that many, and the kernel only
defines 7-argument calling conventions for arm, powerpc (32-bit), and
sh. if it turns out we need them in the future, they can easily be
added.
n32 and n64 ABIs add new argument registers vs o32, so that passing on
the stack is not necessary, so it's not clear why the 5- and
6-argument versions were special-cased to begin with; it seems to have
been pattern-copying from arch/mips (o32).
i've treated the new argument registers like the first 4 in terms of
clobber status (non-clobbered). hopefully this is correct.
the OABI passes these on the stack, using the convention that their
position on the stack is as if the first four arguments (in registers)
also had stack slots. originally this was deemed too awkward to do
inline, falling back to external __syscall, but it's not that bad and
now that external __syscall is being removed, it's necessary.
the inline syscall code is copied directly from powerpc64. the extent
of register clobber specifiers may be excessive on both; if that turns
out to be the case it can be fixed later.
it was never demonstrated to me that this workaround was needed, and
seems likely that, if there ever was any clang version for which it
was needed, it's old enough to be unusably buggy in other ways. if it
turns out some compilers actually can't do the register allocation
right, we'll need to replace this with inline shuffling code, since
the external __syscall dependency is being removed.
this is the first part of a series of patches intended to make
__syscall fully self-contained in the object file produced using
syscall.h, which will make it possible for crt1 code to perform
syscalls.
the (confusingly named) i386 __vsyscall mechanism, which this commit
removes, was introduced before the presence of a valid thread pointer
was mandatory; back then the thread pointer was setup lazily only if
threads were used. the intent was to be able to perform syscalls using
the kernel's fast entry point in the VDSO, which can use the sysenter
(Intel) or syscall (AMD) instruction instead of int $128, but without
inlining an access to the __syscall global at the point of each
syscall, which would incur a significant size cost from PIC setup
everywhere. the mechanism also shuffled registers/calling convention
around to avoid spills of call-saved registers, and to avoid
allocating ebx or ebp via asm constraints, since there are plenty of
broken-but-supported compiler versions which are incapable of
allocating ebx with -fPIC or ebp with -fno-omit-frame-pointer.
the new mechanism preserves the properties of avoiding spills and
avoiding allocation of ebx/ebp in constraints, but does it inline,
using some fairly simple register shuffling, and uses a field of the
thread structure rather than global data for the vdso-provided syscall
code address.
for now, the external __syscall function is refactored not to use the
old __vsyscall so it can be kept, but the intent is to remove it too.
this is a workaround to avoid a crashing regression on qemu-user when
dynamic TLS is installed at dlopen time. the sigaction syscall should
not be able to fail, but it does fail for implementation-internal
signals under qemu user-level emulation if the host libc qemu is
running under reserves the same signals for implementation-internal
use, since qemu makes no provision to redirect/emulate them. after
sigaction fails, the subsequent tkill would terminate the process
abnormally as the default action.
no provision to account for membarrier failing is made in the dynamic
linker code that installs new TLS. at the formal level, the missing
barrier in this case is incorrect, and perhaps we should fail the
dlopen operation, but in practice all the archs we support (and
probably all real-world archs except alpha, which isn't yet supported)
should give the right behavior with no barrier at all as a consequence
of consume-order properties.
in the long term, this workaround should be supplemented or replaced
by something better -- a different fallback approach to ensuring
memory consistency, or dynamic allocation of implementation-internal
signals. the latter is appealing in that it would allow cancellation
to work under qemu-user too, and would even allow many levels of
nested emulation.
This parameter was incorrectly declared to be a pointer to a function
accepting zero parameters. The intent of makecontext is that it is
possible to pass integer parameters to the function, so this should
have been a pointer to a function accepting an unspecified set of
parameters.
Mark atanhi, atanlo, and aT in atanl.c as static, as they're not
intended to be part of the public API.
These are already static in the LDBL_MANT_DIG == 64 code, so this
patch is just making the LDBL_MANT_DIG == 113 code do the same thing.
The result is the same but takes less code.
Note that __execvpe calls getenv which calls __strchrnul so even
using static output the size of the executable won't grow.
commit 54ca677983 inadvertently
introduced bitwise and where logical and was intended. since the
right-hand operand is always 0 or -1 whenever the left-hand operand is
nonzero, the behavior happened to be equivalent.
priority inheritance is a feature to mitigate priority inversion
situations, where a execution of a medium-priority thread can
unboundedly block forward progress of a high-priority thread when a
lock it needs is held by a low-priority thread.
the natural way to do priority inheritance would be with a simple
futex flag to donate the calling thread's priority to a target thread
while it waits on the futex. unfortunately, linux does not offer such
an interface, but instead insists on implementing the whole locking
protocol in kernelspace with special futex commands that exist solely
for the purpose of doing PI mutexes. this would require the entire
"trylock" logic to be duplicated in the timedlock code path for PI
mutexes, since, once the previous lock holder releases the lock and
the futex call returns, the lock is already held by the caller.
obviously such code duplication is undesirable.
instead, I've made the PI timedlock success path set the mutex lock
count to -1, which can be thought of as "not yet complete", since a
lock count of 0 is "locked, with no recursive references". a simple
branch in a non-hot path of pthread_mutex_trylock can then see and act
on this state, skipping past the code that would check and take the
lock to the same code path that runs after the lock is obtained for a
non-PI mutex.
because we're forced to let the kernel perform the actual lock and
unlock operations whenever the mutex is contended, we have to patch
things up when it does the wrong thing:
1. the lock operation is not aware of whether the mutex is
error-checking, so it will always fail with EDEADLK rather than
deadlocking.
2. the lock operation is not aware of whether the mutex is robust, so
it will successfully obtain mutexes in the owner-died state even if
they're non-robust, whereas this operation should deadlock.
3. the unlock operation always sets the lock value to zero, whereas
for robust mutexes, we want to set it to a special value indicating
that the mutex obtained after its owner died was unlocked without
marking it consistent, so that future operations all fail with
ENOTRECOVERABLE.
the first of these is easy to solve, just by performing a futex wait
on a dummy futex address to simulate deadlock or ETIMEDOUT as
appropriate. but problems 2 and 3 interact in a nasty way. to solve
problem 2, we need to back out the spurious success. but if waiters
are present -- which we can't just ignore, because even if we don't
want to wake them, the calling thread is incorrectly inheriting their
priorities -- this requires using the kernel's unlock operation, which
will zero the lock value, thereby losing the "owner died with lock
held" state.
to solve these problems, we overload the mutex's waiters field, which
is unused for PI mutexes since they don't call the normal futex wait
functions, as an indicator that the PI mutex is permanently
non-lockable. originally I wanted to use the count field, but there is
one code path that needs to access this flag without synchronization:
trylock's CAS failure path needs to be able to decide whether to fail
with EBUSY or ENOTRECOVERABLE, the waiters field is already treated as
a relaxed-order atomic in our memory model, so this works out nicely.
there was no point in masking off the pshared bit when first loading
the type, since every subsequent access involves a mask anyway. not
masking it may avoid a subsequent load to check the pshared flag, and
it's just simpler.
commit 84d061d5a3 wrongly moved the
access to the global next_key outside of the scope of the lock. the
error manifested as spurious failure to find an available key slot
under concurrent calls to pthread_key_create, since the stopping
condition could be met after only a small number of slots were
examined.
commit d6c855caa8 caused this
"regression", though the behavior was undefined before, overlooking
that f->shend=0 was being used as a sentinel for "EOF" status (actual
EOF or hitting the scanf field width) of the stream helper (shgetc)
functions.
obviously the shgetc macro could be adjusted to check for a null
pointer in addition to the != comparison, but it's the hot path, and
adding extra code/branches to it begins to defeat the purpose.
so instead of setting shend to a null pointer to block further reads,
which no longer works, set it to the current position (rpos). this
makes the shgetc macro work with no change, but it breaks shunget,
which can no longer look at the value of shend to determine whether to
back up. Szabolcs Nagy suggested a solution which I'm using here:
setting shlim to a negative value is inexpensive to test at shunget
time, and automatically re-trips the cnt>=shlim stop condition in
__shgetc no matter what the original limit was.
commit 2de29bc994 left behind one
reference to pthread_mutex_trylock. fixing this also improves code
generation due to the namespace-safe version being hidde.
HWCAP_SB - speculation barrier instruction available added in linux
commit bd4fb6d270bc423a9a4098108784f7f9254c4e6d
HWCAP_PACA, HWCAP_PACG - pointer authentication instructions available
(address and generic) added in linux commit
7503197562567b57ec14feb3a9d5400ebc56812f
aarch64 pointer authentication code related prctl that allows
reinitializing the key for the thread, added in linux commit
ba830885656414101b2f8ca88786524d4bb5e8c1
NT_MIPS_MSA for ptrace access to mips simd arch reg set, added in linux
commit 3cd640832894b85b5929d5bda74505452c800421
NT_ARM_PAC_MASK for ptrace access to pointer auth code mask, added in
commit ec6e822d1a22d0eef1d1fa260dff751dba9a4258
C-SKY support was added to binutils 2.32 in commit
b8891f8d622a31306062065813fc278d8a94fe21
the elf.h change was added to glibc 2.29 in commit
4975f0c3d0131fdf697be0b1631c265e5fd39088
NT_MIPS_FP_MODE is new in linux commit
1ae22a0e35636efceab83728ba30b013df761592
NT_MIPS_DSP is new in linux commit
44109c60176ae73924a42a6bef64ef151aba9095
new fields for RFC 4898 tcp stats in linux
tcpi_bytes_sent added in commit ba113c3aa79a7f941ac162d05a3620bdc985c58d
tcpi_bytes_retrans added in commit fb31c9b9f6c85b1bad569ecedbde78d9e37cd87b
tcpi_dsack_dups added in commit 7e10b6554ff2ce7f86d5d3eec3af5db8db482caa
tcpi_reord_seen added in commit 7ec65372ca534217b53fd208500cf7aac223a383
The new fields change the size of a public struct and thus an ABI break,
but this is how the getsockopt TCP_INFO api is designed: the tcp_info
type must only be used with a length parameter in extern interfaces.
inotify_add_watch flag to prevent modifying existing watch descriptors,
when used on an already watched inode it fails with EEXIST.
added in linux commit 4d97f7d53da7dc830dbf416a3d2a6778d267ae68
The original logic considered each byte until it either found a 0
value or a value >= 192. This means if a string segment contained any
byte >= 192 it was interepretted as a compressed segment marker even
if it wasn't in a position where it should be interpretted as such.
The fix is to adjust dn_skipname to increment by each segments size
rather than look at each character. This avoids misinterpretting
string segment characters by not considering those bytes.
On s390x, POSIX_FADV_DONTNEED and POSIX_FADV_NOREUSE have different
values than on all other architectures that Linux supports.
Handle this difference by wrapping their definitions in
include/fcntl.h in #ifdef, so that arch/s390x/bits/fcntl.h can
override them.
as noted in Austin Group issue #1236, the XSI shading for TSVTX is
misplaced in the html version of the standard; it was only supposed to
be on the description text. the intent was that the definition always
be visible, which is reflected in the pdf version of the standard.
this reverts commits d93c0740d8 and
729fef0a93.
C11 removed the requirement that FILE be a complete type, which was
deemed erroneous, as part of the changes introduced by N1439 regarding
completeness of types (see footnote 6 for specific mention of FILE).
however the current version of POSIX is still based on C99 and
incorporates the old requirement that FILE be a complete type.
expose an arbitrary, useless complete type definition because the
actual object used to represent FILE streams cannot be public/ABI.
thanks to commit 13d1afa46f, we now have
a framework for suppressing the public complete-type definition of FILE
when stdio.h is included internally, so that a different internal
definition can be provided. this is perfectly well-defined, since the
same struct tag can refer to different types in different translation
units. it would be a problem if the implementation were accessing the
application's FILE objects or vice versa, but either would be
undefined behavior.
this affected the error path where dlopen successfully found and
loaded the requested dso and all its dependencies, but failed to
resolve one or more relocations, causing the operation to fail after
storage for the ctor queue was allocated.
commit 188759bbee wrongly put the free
for the ctor_queue array in the error path inside a loop over each
loaded dso that needed to be backed-out, rather than just doing it
once. in addition, the exit path also observed the ctor_queue pointer
still being nonzero, and would attempt to call ctors on the backed-out
dsos unless the double-free crashed the process first.
historically, and likely accidentally, sigaltstack was specified to
fail with EINVAL if any flag bit other than SS_DISABLE was set. the
resolution of Austin Group issue 1187 fixes this so that the
requirement is only to fail for SS_ONSTACK (which cannot be set) or
"invalid" flags.
Linux fails on the kernel side for invalid flags, but historically
accepts SS_ONSTACK as a no-op, so it needs to be rejected in userspace
still.
with this change, the Linux-specific SS_AUTODISARM, provided since
commit 9680e1d03a but unusable due to
rejection at runtime, is now usable.
together with the previous two commits, this completes restoration of
the property that dynamic-linked apps with no external deps and no tls
have no failure paths before entry.
neither has or can have any dependencies, but since commit
4035556907, gratuitous zero-length deps
arrays were being allocated for them. use a dummy array instead.
traditionally, we've provided a guarantee that dynamic-linked
applications with no external dependencies (nothing but libc) and no
thread-local storage have no failure paths before the entry point.
normally, thanks to reclaim_gaps, such a malloc will not require a
syscall anyway, but if segment alignment is unlucky, it might. use a
builtin array for this common special case.
in the case where malloc is being replaced, it's not valid to call
malloc between final relocations and main app's crt1 entry point; on
fdpic archs the main app's entry point will not yet have performed the
self-fixups necessary to call its code.
to fix, reorder queue_ctors before final relocations. an alternative
solution would be doing the allocation from __libc_start_init, after
the entry point but before any ctors run. this is less desirable,
since it would leave a call to malloc that might be provided by the
application happening at startup when doing so can be easily avoided.
previously, going way back, there was simply no synchronization here.
a call to exit concurrent with ctor execution from dlopen could cause
a dtor to execute concurrently with its corresponding ctor, or could
cause dtors for newly-constructed libraries to be skipped.
introduce a shutting_down state that blocks further ctor execution,
producing the quiescence the dtor execution loop needs to ensure any
kind of consistency, and that blocks further calls to dlopen so that a
call into dlopen from a dtor cannot deadlock.
better approaches to some of this may be possible, but the changes
here at least make things safe.
previously, shared library constructors at program start and dlopen
time were executed in reverse load order. some libraries, however,
rely on a depth-first dependency order, which most other dynamic
linker implementations provide. this is a much more reasonable, less
arbitrary order, and it turns out to have much better properties with
regard to how slow-running ctors affect multi-threaded programs, and
how recursive dlopen behaves.
this commit builds on previous work tracking direct dependencies of
each dso (commit 4035556907), and
performs a topological sort on the dependency graph at load time while
the main ldso lock is held and before success is committed, producing
a queue of constructors needed by the newly-loaded dso (or main
application). in the case of circular dependencies, the dependency
chain is simply broken at points where it becomes circular.
when the ctor queue is run, the init_fini_lock is held only for
iteration purposes; it's released during execution of each ctor, so
that arbitrarily-long-running application code no longer runs with a
lock held in the caller. this prevents a dlopen with slow ctors in one
thread from arbitrarily delaying other threads that call dlopen.
fully-independent ctors can run concurrently; when multiple threads
call dlopen with a shared dependency, one will end up executing the
ctor while the other waits on a condvar for it to finish.
another corner case improved by these changes is recursive dlopen
(call from a ctor). previously, recursive calls to dlopen could cause
a ctor for a library to be executed before the ctor for its
dependency, even when there was no relation between the calling
library and the library it was loading, simply due to the naive
reverse-load-order traversal. now, we can guarantee that recursive
dlopen in non-circular-dependency usage preserves the desired ctor
execution order properties, and that even in circular usage, at worst
the libraries whose ctors call dlopen will fail to have completed
construction when ctors that depend on them run.
init_fini_lock is changed to a normal, non-recursive mutex, since it
is no longer held while calling back into application code.
this makes calling dlsym on the main app more consistent with the
global symbol table (load order), and is a prerequisite for
dependency-order ctor execution to work correctly with LD_PRELOAD.
commit 4035556907 introduced runtime
realloc of an array that may have been allocated before symbols were
resolved outside of libc, which is invalid if the allocator has been
replaced. track this condition and manually copy if needed.
dlsym with an explicit handle is specified to use "dependency order",
a breadth-first search rooted at the argument. this has always been
implemented by iterating a flattened dependency list built at dlopen
time. however, the logic for building this list was completely wrong
except in trivial cases; it simply used the list of libraries loaded
since a given library, and their direct dependencies, as that
library's dependencies, which could result in misordering, wrongful
omission of deep dependencies from the search, and wrongful inclusion
of unrelated libraries in the search.
further, libraries did not have any recorded list of resolved
dependencies until they were explicitly dlopened, meaning that
DT_NEEDED entries had to be resolved again whenever a library
participated as a dependency of more than one dlopened library.
with this overhaul, the resolved direct dependency list of each
library is always recorded when it is first loaded, and can be
extended to a full flattened breadth-first search list if dlopen is
called on the library. the extension is performed using the direct
dependency list as a queue and appending copies of the direct
dependency list of each dependency in the queue, excluding duplicates,
until the end of the queue is reached. the direct deps remain
available for future use as the initial subarray of the full deps
array.
first-load logic in dlopen is updated to match these changes, and
clarified.
code introduced in commit 9d44b6460a
wrongly attempted to read past the end of the currently-installed dtv
to determine if a dso provides new, not-already-installed tls. this
logic was probably leftover from an earlier draft of the code that
wrongly installed the new dtv before populating it.
it would work if we instead queried the new, not-yet-installed dtv,
but instead, replace the incorrect check with a simple range check
against old_cnt. this also catches modules that have no tls at all
with a single condition.
code introduced in commit 9d44b6460a
wrongly assumed the dso list tail was the right place to find new dtv
storage. however, this is only true if the last-loaded dependency has
tls. the correct place to get it is the dso corresponding to the tls
module list tail. introduce a container_of macro to get it, and use
it.
ultimately, dynamic tls allocation should be refactored so that this
is not an issue. there is no reason to be allocating new dtv space at
each load_library; instead it could happen after all new libraries
have been loaded but before they are committed. such changes may be
made later, but this commit fixes the present regression.
the motivation for this change is twofold. first, it gets the fallback
logic out of the dynamic linker, improving code readability and
organization. second, it provides application code that wants to use
the membarrier syscall, which depends on preregistration of intent
before the process becomes multithreaded unless unbounded latency is
acceptable, with a symbol that, when linked, ensures that this
registration happens.
this is a prerequisite for factoring the membarrier fallback code into
a function that can be called from a context with the thread list
already locked or independently.
commit 9d44b6460a inadvertently
contained leftover logic from a previous approach to the fallback
signaling loop. it had no adverse effect, since j was always nonzero
if the loop body was reachable, but it makes no sense to be there with
the current approach to avoid signaling self.
addressing &out[k].sa was arguably undefined, despite &out[k] being
defined the slot one past the end of an array, since the member access
.sa is intervening between the [] operator and the & operator.
the backindex stored by getaddrinfo to allow freeaddrinfo to perform
partial-free wrongly used the address result index, rather than the
output slot index, and thus was only valid when they were equal
(nservs==1).
patch based on report with proposed fix by Markus Wichmann.
previously, dynamic loading of new libraries with thread-local storage
allocated the storage needed for all existing threads at load-time,
precluding late failure that can't be handled, but left installation
in existing threads to take place lazily on first access. this imposed
an additional memory access and branch on every dynamic tls access,
and imposed a requirement, which was not actually met, that the
dynamic tlsdesc asm functions preserve all call-clobbered registers
before calling C code to to install new dynamic tls on first access.
the x86[_64] versions of this code wrongly omitted saving and
restoring of fpu/vector registers, assuming the compiler would not
generate anything using them in the called C code. the arm and aarch64
versions saved known existing registers, but failed to be future-proof
against expansion of the register file.
now that we track live threads in a list, it's possible to install the
new dynamic tls for each thread at dlopen time. for the most part,
synchronization is not needed, because if a thread has not
synchronized with completion of the dlopen, there is no way it can
meaningfully request access to a slot past the end of the old dtv,
which remains valid for accessing slots which already existed.
however, it is necessary to ensure that, if a thread sees its new dtv
pointer, it sees correct pointers in each of the slots that existed
prior to the dlopen. my understanding is that, on most real-world
coherency architectures including all the ones we presently support, a
built-in consume order guarantees this; however, don't rely on that.
instead, the SYS_membarrier syscall is used to ensure that all threads
see the stores to the slots of their new dtv prior to the installation
of the new dtv. if it is not supported, the same is implemented in
userspace via signals, using the same mechanism as __synccall.
the __tls_get_addr function, variants, and dynamic tlsdesc asm
functions are all updated to remove the fallback paths for claiming
new dynamic tls, and are now all branch-free.
access to clear the entry in each thread's tsd array for the key being
deleted was not synchronized with __pthread_tsd_run_dtors. I probably
made this mistake from a mistaken belief that the thread list lock was
held during the latter, which of course is not possible since it
executes application code in a still-live-thread context.
while we're at it, expand the interval during which signals are
blocked to cover taking the write lock on key_lock, so that a signal
at an inopportune time doesn't block forward progress of readers.
commit 84d061d5a3 inadvertently
introduced namespace violations by using the pthread-namespace rwlock
functions in pthread_key_create, which is in turn used for C11 tss.
fix that and possible future uses of rwlocks elsewhere.
with the availability of the thread list, there is no need to mark tsd
key slots dirty and clean them up only when a free slot can't be
found. instead, directly iterate threads and clear any value
associated with the key being deleted.
no synchronization is necessary for the clearing, since there is no
way the slot can be accessed without having synchronized with the
creation of a new key occupying the same slot, which is already
sequenced after and synchronized with the deletion of the old key.
the __synccall mechanism provides stop-the-world synchronous execution
of a callback in all threads of the process. it is used to implement
multi-threaded setuid/setgid operations, since Linux lacks them at the
kernel level, and for some other less-critical purposes.
this change eliminates dependency on /proc/self/task to determine the
set of live threads, which in addition to being an unwanted dependency
and a potential point of resource-exhaustion failure, turned out to be
inaccurate. test cases provided by Alexey Izbyshev showed that it
could fail to reflect newly created threads. due to how the
presignaling phase worked, this usually yielded a deadlock if hit, but
in the worst case it could also result in threads being silently
missed (allowed to continue running without executing the callback).
the hard problem here is unlinking threads from a list when they exit
without creating a window of inconsistency where the kernel task for a
thread still exists and is still executing instructions in userspace,
but is not reflected in the list. the magic solution here is getting
rid of per-thread exit futex addresses (set_tid_address), and instead
using the exit futex to unlock the global thread list.
since pthread_join can no longer see the thread enter a detach_state
of EXITED (which depended on the exit futex address pointing to the
detach_state), it must now observe the unlocking of the thread list
lock before it can unmap the joined thread and return. it doesn't
actually have to take the lock. for this, a __tl_sync primitive is
offered, with a signature that will allow it to be enhanced for quick
return even under contention on the lock, if needed. for now, the
exiting thread always performs a futex wake on its detach_state. a
future change could optimize this out except when there is already a
joiner waiting.
initial/dynamic variants of detached state no longer need to be
tracked separately, since the futex address is always set to the
global list lock, not a thread-local address that could become invalid
on detached thread exit. all detached threads, however, must perform a
second sigprocmask syscall to block implementation-internal signals,
since locking the thread list with them already blocked is not
permissible.
the arch-independent C version of __unmapself no longer needs to take
a lock or setup its own futex address to release the lock, since it
must necessarily be called with the thread list lock already held,
guaranteeing exclusive access to the temporary stack.
changes to libc.threads_minus_1 no longer need to be atomic, since
they are guarded by the thread list lock. it is largely vestigial at
this point, and can be replaced with a cheaper boolean indicating
whether the process is multithreaded at some point in the future.
whether signals need to be blocked at thread start, and whether
unblocking is necessary in the entry point function, has historically
depended on intricacies of the cancellation design and on whether
there are scheduling operations to perform on the new thread before
its successful creation can be committed. future changes to track an
AS-safe list of live threads will require signals to be blocked
whenever changes are made to the list, so ...
prior to commits b8742f3260 and
40bae2d32f, a signal mask for the entry
function to restore was part of the pthread structure. it was removed
to trim down the size of the structure, which both saved a small
amount of stack space and improved code generation on archs where
small immediate displacements are less costly than arbitrary ones, by
limiting the range of offsets between the base of the thread
structure, its members, and the thread pointer. these commits moved
the saved mask to a special structure used only when special
scheduling was needed, in which case the pthread_create caller and new
thread had to synchronize with each other and could use this memory to
pass a mask.
this commit partially reverts the above two commits, but instead of
putting the mask back in the pthread structure, it moves all "start
argument" members out of the pthread structure, trimming it down
further, and puts them in a separate structure passed on the new
thread's stack. the code path for explicit scheduling of the new
thread is also changed to synchronize with the calling thread in such
a way to avoid spurious futex wakes.
this eliminates some ugly hacks that were repurposing the start
function and start argument fields in the pthread structure for timer
use, and the need to longjmp out of a signal handler.
__dl_thread_cleanup is called from the context of an exiting thread
that is not in a consistent state valid for calling application code.
since commit c9f415d7ea, it's possible
(and supported usage) for the allocator to have been replaced by the
application, so __dl_thread_cleanup can no longer call free. instead,
reuse the message buffer as a linked-list pointer, and queue it to be
freed the next time any dynamic linker error message is generated.
the way gets was implemented in terms of fgets, it used the location
of the null termination to determine where to find and remove the
newline, if any. an embedded null byte prevented this from working.
this also fixes a one-byte buffer overflow, whereby when gets read an
N-byte line (not counting newline), it would store two null
terminators for a total of N+2 bytes. it's unlikely that anyone would
care that a function whose use is pretty much inherently a buffer
overflow writes too much, but it could break the only possible correct
uses of this function, in conjunction with input of known format from
a trusted/same-privilege-domain source, where the buffer length may
have been selected to exactly match a line length contract.
there seems to be no correct way to implement gets in terms of a
single call to fgets or scanf, and using multiple calls would require
explicit locking, so we might as well just write the logic out
explicitly character-at-a-time. this isn't fast, but nobody cares if a
catastrophically unsafe function that's so bad it was removed from the
C language is fast.
in order to implement ENOTRECOVERABLE, the implementation has
traditionally used a bit of the mutex type field to indicate that it's
recovered after EOWNERDEAD and will go into ENOTRECOVERABLE state if
pthread_mutex_consistent is not called before unlocking. while it's
only the thread that holds the lock that needs access to this
information (except possibly for the sake of pthread_mutex_consistent
choosing between EINVAL and EPERM for erroneous calls), the change to
the type field is formally a data race with all other threads that
perform any operation on the mutex. no individual bits race, and no
write races are possible, so things are "okay" in some sense, but it's
still not good.
this patch moves the recovery/consistency state to the mutex
owner/lock field which is rightfully mutable. bit 30, the same bit the
kernel uses with a zero owner to indicate that the previous owner died
holding the lock, is now used with a nonzero owner to indicate that
the mutex is held but has not yet been marked consistent. note that
the kernel ABI also reserves bit 29 not to appear in any tid, so the
sentinel value we use for ENOTRECOVERABLE, 0x7fffffff, does not clash
with any tid plus bit 30.
fdopendir is specified to fail with EBADF if the file descriptor
passed is not open for reading. while O_PATH is an extension and
arguably exempt from this requirement, it's used, albeit incompletely,
to implement O_SEARCH, and fdopendir should fail when passed an
O_SEARCH file descriptor.
the new check is performed after fstat so that we don't have to
consider the possibility that the fd is invalid.
an alternate solution would be attempting to pre-fill the buffer using
getdents, which would fail with EBADF for us, but that seems more
complex and error-prone and involves either code duplication or
refactoring, so the simple fix with an additional inexpensive syscall
is what I've made for now.
Some packages call gettext to format a message to be sent to perror.
If the currently set user locale points to a non-existent .mo file,
open via __map_file in dcngettext will set errno to ENOENT.
Maintainer's notes: Non-modification of errno is a documented part of
the interface contract for the GNU version of this function and likely
other versions. The issue being fixed here seems to be a regression
from commit 1b52863e24, which enabled
setting of errno from __map_file.
commit 84d061d5a3 attempted to do this
already, but omitted from pthread_key_create.c the weak definition of
__pthread_key_delete_synccall, so that the definition provided by
pthread_key_delete.c was always pulled in.
based on patch by Markus Wichmann, but with a weak alias rather than
weak reference for consistency/policy about dependence on tooling
features.
fallback to /etc/shadow should happen only when the entry is not found
in the TCB shadow. otherwise transient errors or permission errors can
cause inconsistent results.
this reverts commit c0ed5a201b, which
was based on a mistaken reading of POSIX due to inconsistency between
the description (which requires return upon interruption by a signal)
and the errors list (which wrongly lists EINTR as "may fail").
since the previously-introduced behavior was a workaround for an old
kernel bug to ensure safety of correct programs that were not hardened
against the bug, an effort has been made to preserve it for programs
which do not use interrupting signal handlers. the stage for this was
set in commit a63c0104e4, which makes
the futex __timedwait backend suppress EINTR if it's seen when no
interrupting signal handlers have been installed.
based loosely on a patch submitted by Orivej Desh, but with
unnecessary additional changes removed.
the resolution of Austin Group issue #1132 changes the requirement to
fail so that it only applies when the set argument (new mask) is
non-null. this change was made for consistency with the description,
which specified "if set is a null pointer, the value of the argument
how is not significant".
prior to linux 2.6.22, futex wait could fail with EINTR even for
non-interrupting (SA_RESTART) signals. this was no problem provided
the caller simply restarted the wait, but sem_[timed]wait is required
by POSIX to return when interrupted by a signal. commit
a113434cd6 introduced this behavior, and
commit c0ed5a201b reverted it based on a
mistaken belief that it was not required. this belief stems from a bug
in the specification: the description requires the function to return
when interrupted, but the errors section marks EINTR as a "may fail"
condition rather than a "shall fail" one.
since there does seem to be significant value in the change made in
commit c0ed5a201b, making it so that
programs that call sem_wait without checking for EINTR don't silently
make forward progress without obtaining the semaphore or treat it as a
fatal error and abort, add a behind-the-scenes mechanism in the
__timedwait backend to suppress EINTR in programs that have never
installed interrupting signal handlers, and have sigaction track and
report this state. this way the semaphore code is not cluttered by
workarounds and can be updated (to be done in next commit) to reflect
the high-level logic for conforming behavior.
these changes are based loosely on a patch by Markus Wichmann, with
the main changes being atomic update to flag object and moving the
workaround from sem_timedwait to the __timedwait futex backend.
it's not clear whether this is required, but it seems arguable that it
should happen. for example aio_suspend is supposed to return
immediately if any of the operations has "completed", which includes
ending with an error status asynchonously and might also be
interpreted to include doing so synchronously.
the map structures in particular are permanent once created, and thus
a large number of aio function calls with invalid file descriptors
could exhaust memory, whereas, assuming normal resource limits, only a
very small number of entries ever need to be allocated. check validity
of the fd before allocating anything new, so that allocation of large
amounts of memory is only possible when resource limits have been
increased and a large number of files are actually open.
this change also improves error reporting for bad file descriptors to
happen at the time the aio submission call is made, as opposed to
asynchronously.
since commit c9f415d7ea, it has been
possible that the allocator is application-provided code, which cannot
necessarily run safely on io thread stacks, and which should not be
able to see the existence of io threads, since they are an
implementation detail.
instead of having the io thread request and possibly allocate its
queue (and the map structures leading to it), make the submitting
thread responsible for this, and pass the queue pointer into the io
thread via its args structure. this eliminates the only early error
case in io threads, making it no longer necessary to pass an error
status back to the submitting thread via the args structure.
aio threads not using SIGEV_THREAD notification are created with small
stacks and no guard page, which is possible since they only run the
code for the requested io operation, not any application code. the
motivation is not creating a lot of VMAs. however, the io thread needs
to be able to receive a cancellation signal in case aio_cancel
(implemented via pthread_cancel) is called. this requires sufficient
stack space for a signal frame, which PTHREAD_STACK_MIN does not
necessarily include.
in principle MINSIGSTKSZ from signal.h should give us sufficient space
for a signal frame, but the value is incorrect on some existing archs
due to kernel addition of new vector register support without
consideration for impact on ABI. some powerpc models exceed
MINSIGSTKSZ by about 0.5k, and x86[_64] with AVX-512 can exceed it by
up to about 1.5k. so use MINSIGSTKSZ+2048 to allow for the discrepancy
plus some working space.
unfortunately, it's possible that signal frame sizes could continue to
grow, and some archs (aarch64) explicitly specify that they may.
passing of a runtime value for MINSIGSTKSZ via AT_MINSIGSTKSZ in the
aux vector was added to aarch64 linux, and presumably other archs will
use this mechanism to report if they further increase the signal frame
size. when AT_MINSIGSTKSZ is present, assume it's correct, so that we
only need a small amount of working space in addition to it; in this
case just add 512.
new in linux commit 76b7f670730e87974f71df9f6129811e2769666e
in struct signalfd_siginfo the pad member is changed to __pad to keep
the namespace clean, it's not part of the public api.
add UDP_NO_CHECK6_* to restrict zero UDP6 checksums, new in linux commit
1c19448c9ba6545b80ded18488a64a7f3d8e6998 (pre-v4.18 change, was missed)
add UDP_SEGMENT to support generic segmentation offload for udp datagrams,
bec1f6f697362c5bc635dacd7ac8499d0a10a4e7 (new in v4.18)
add packet delivery info to tcp_info,
new in linux commit feb5f2ec646483fb66f9ad7218b1aad2a93a2a5c
add TCP_ZEROCOPY_RECEIVE socket option for zerocopy receive,
new in linux commit 05255b823a6173525587f29c4e8f1ca33fd7677d
add TCP_INQ socket option and TCP_CM_INQ cmsg to get in-queue bytes in cmsg
upon read, new in linux commit b75eba76d3d72e2374fac999926dafef2997edd2
add TCP_REPAIR_* to fix repair socket window probe patch,
new in linux commit 31048d7aedf31bf0f69c54a662944632f29d82f2
commit b9410061e2 inadvertently omitted
optopt from the "dynamic list", causing it to be split into separate
objects that don't share their value if the main program contains a
copy relocation for it (for non-PIE executables that access it, and
some PIE ones, depending on arch and toolchain versions/options).
first, the condition (mem && k < p) is redundant, because mem being
nonzero implies the needle is periodic with period exactly p, in which
case any byte that appears in the needle must appear in the last p
bytes of the needle, bounding the shift (k) by p.
second, the whole point of replacing the shift k by mem (=l-p) is to
prevent shifting by less than mem when discarding the memory on shift,
in which case linear time could not be guaranteed. but as written, the
check also replaced shifts greater than mem by mem, reducing the
benefit of the shift. there is no possible benefit to this reduction of
the shift; since mem is being cleared, the full shift is valid and
more optimal. so only replace the shift by mem when it would be less
than mem.
commit ddc947eda3 fixed the
corresponding bug for exit which was introduced when commit
0b80a7b040 added support for
caller-provided buffers, making it possible for stderr to be a
buffered stream.
fflush(NULL) and __stdio_exit lock individual FILEs while holding the
open file list lock to walk the list. since fclose first locked the
FILE to be closed, then the ofl lock, it could deadlock with these
functions.
also, because fclose removed the FILE to be closed from the open file
list before flushing and closing it, a concurrent fclose or exit could
complete successfully before fclose flushed the FILE it was closing,
resulting in data loss.
reorder the body of fclose to first flush and close the file, then
remove it from the open file list only after unlocking it. this
creates a window where consumers of the open file list can see dead
FILE objects, but in the absence of undefined behavior on the part of
the application, such objects will be in an inactive-buffer state and
processing them will have no side effects.
__unlist_locked_file is also moved so that it's performed only for
non-permanent files. this change is not necessary, but preserves
consistency (and thereby provides safety/hardening) in the case where
an application uses one of the standard streams after closing it while
holding an explicit lock on it. such usage is of course undefined
behavior.
Use "+r" in the asm instead of implementing a non-transparent copy by
applying "0" constraint to the source value. Introduce a typedef for
the function type to avoid spelling it out twice.
commit aeeac9ca54 introduced fail-safe
invariants that creating a locale_t object for the C locale or C.UTF-8
locale will always succeed. extend the guarantee to also cover the
following:
- newlocale(LC_ALL_MASK, "", 0)
- newlocale(LC_ALL_MASK-LC_CTYPE_MASK, "C", 0)
provided that the LANG/LC_* environment variables have not been
changed by the program. these usages are idiomatic for getting the
default locale, and for getting a locale that behaves as the C locale
except for honoring the default locale's character encoding.
unify the code paths for allocated and non-allocated locale objects,
always using a tmp object. this is necessary to avoid clobbering the
base locale object too soon if we allow for the possibility that
looking up an explicitly requested locale name may fail, and makes the
code simpler and cleaner anyway.
eliminate the complex and fragile logic for checking whether one of
the non-allocated locale objects can be used for the result, and
instead just memcmp against each of them.
commit 63c188ec42 missed making this
change when switching from atomics to locking for modification of the
global locale, leaving access to locale structures unnecessarily
burdened with the restrictions of volatile.
the volatile qualification was originally added in commit
56fbaa3bbe.
introduce a new LOC_MAP_FAILED sentinel for errors, since null
pointers for a category's locale map indicate the C locale. at this
time, __get_locale does not fail, so there should be no functional
change by this commit.
the choice of signed char for lbf was a theoretically space-saving
hack that was not helping, and was unwantedly expensive. while
comparing bytes against a byte-sized member sounds easy, the trick
here was that the byte to be compared was unsigned while the lbf
member was signed, making it possible to set lbf negative to disable
line buffering. however, this imposed a requirement to promote both
operands, zero-extending one and sign-extending the other, in order to
compare them.
to fix this, repurpose the waiters count slot (unused since commit
c21f750727). while we're at it, switch
mode (orientation) from signed char to int as well. this makes no
semantic difference (its only possible values are -1, 0, and 1) but it
might help on archs where byte access is awkward.
to check whether flush due to line buffering is needed, the int-type
character argument must be truncated to unsigned char for comparison.
if the original value is subsequently passed to __overflow, it must be
preserved, adding to register pressure. since it doesn't matter,
truncate all uses so the original value is no longer live.
the internal putc_unlocked macro was wrongly returning a meaningless
boolean result rather than the written character or EOF.
bug was found by reading (very surprising) asm.
check whether the lock is free before loading the calling thread's
tid. if so, just use a dummy tid value that cannot compare equal to
any actual thread id (because it's one bit wider). this also avoids
the need to save the tid and pass it to locking_getc or locking_putc,
reducing register pressure.
this change might slightly hurt the case where the caller already
holds the lock, but it does not affect the single-threaded case, and
may significantly improve the multi-threaded case, especially on archs
where loading the thread pointer is disproportionately expensive like
early mips and arm ISA levels. but even on i386 it helps, at least on
some machines; I measured roughly a 10-15% improvement.
this is not needed for correctness, but doesn't hurt, and in some
cases the compiler may pessimize the call assuming the callee might be
variadic when it lacks a prototype.
commit 4390383b32 inadvertently used "r"
instead of "0" for the input constraint, which only happened to work
for the configuration I tested it on because it usually makes sense
for the compiler to choose the same input and output register.
by ABI, the public stdin/out/err macros use extern pointer objects,
and this is necessary to avoid copy relocations that would be
expensive and make the size of the FILE structure part of the ABI.
however, internally it makes sense to access the underlying FILE
objects directly. this avoids both an indirection through the GOT to
find the address of the stdin/out/err pointer objects (which can't be
computed PC-relative because they may have been moved to the main
program by copy relocations) and an indirection through the resulting
pointer object.
in most places this is just a minor optimization, but in the case of
getchar and putchar (and the unlocked versions thereof), ipa constant
propagation makes all accesses to members of stdin/out PC-relative or
GOT-relative, possibly reducing register pressure as well.
with these changes, in a program that has not created any threads
besides the main thread and that has not called f[try]lockfile, getc
performs indistinguishably from getc_unlocked. this was measured on
several i386 and x86_64 models, and should hold on other archs too
simply by the properties of the code generation.
the case where the caller already holds the lock (via flockfile) is
improved significantly as well (40-60% reduction in time on machines
tested) and the case where locking is needed is improved somewhat
(roughly 10%).
the key technique used here is forcing the non-hot path out-of-line
and enabling it to be a tail call. a static noinline function
(conditional on __GNUC__) is used rather than the extern hiddens used
elsewhere for this purpose, so that the compiler can choose
non-default calling conventions, making it possible to tail-call to a
callee that takes more arguments than the caller on archs where
arguments are passed on the stack or must have space reserved on the
stack for spilling the. the tid could just be reloaded via the thread
pointer in locking_getc, but that would be ridiculously expensive on
some archs where thread pointer load requires a trap or syscall.
on multiple occasions I've started to flatten/inline the code in
__init_libc, only to rediscover the reason it was not inlined: GCC
fails to deallocate its stack (and now, with the changes in commit
4390383b32, fails to produce a tail call
to the stage 2 function; see PR #87639) before calling main if it was
inlined.
document this with a comment and use an explicit noinline attribute if
__GNUC__ is defined so that even with CFLAGS that heavily favor
inlining it won't get inlined.
this is the analog of commit 1c84c99913
for static linking. unlike with dynamic linking, we don't have
symbolic lookup to use as a barrier. use a dummy (target-agnostic)
degenerate inline asm fragment instead. this technique has precedent
in commit 05ac345f89 where it's used for
explicit_bzero. if it proves problematic in any way, loading the
address of the stage 2 function from a pointer object whose address
leaks to kernelspace during thread pointer init could be used as an
even stronger barrier.
this will allow the compiler to cache and reuse the result, meaning we
no longer have to take care not to load it more than once for the sake
of archs where the load may be expensive.
depends on commit 1c84c99913 for
correctness, since otherwise the compiler could hoist loads during
stage 3 of dynamic linking before the initial thread-pointer setup.
revert commit a603a75a72.
as a result of commit 1c84c99913 this is
now safe, assuming an interpretation of the somewhat-underspecified
attribute((const)) consistent with real-world usage.
commit a603a75a72 removed attribute
const from __errno_location and pthread_self, and the same reasoning
forced arch definitions of __pthread_self to use volatile asm,
significantly impacting code generation and imposing manual caching of
pointers where the impact might be noticable.
reorder the thread pointer setup and place it across a strong barrier
(symbolic function lookup) so that there is no assumed ordering
between the initialization and the accesses to the thread pointer in
stage 3.
fma is only available on recent x86_64 cpus and it is much faster than
a software fma, so this should be done with a runtime check, however
that requires more changes, this patch just adds the code so it can be
tested when musl is compiled with -mfma or -mfma4.
vfma is available in the vfpv4 fpu and above, the ACLE standard feature
test for double precision hardware fma support is
__ARM_FEATURE_FMA && __ARM_FP&8
we need further checks to work around clang bugs (fixed in clang >=7.0)
&& !__SOFTFP__
because __ARM_FP is defined even with -mfloat-abi=soft
&& !BROKEN_VFP_ASM
to disable the single precision code when inline asm handling is broken.
For runtime selection the HWCAP_ARM_VFPv4 hwcap flag can be used, but
that requires further work.
previously (before and after rewrite), spurious escaping of path
separators as \/ was not treated the same as /, but rather got split
as an unpaired \ at the end of the fnmatch pattern and an unescaped /,
resulting in a mismatch/error.
for the case of \/ as part of the maximal literal prefix, remove the
explicit rejection of it and move the handling of / below escape
processing.
for the case of \/ after a proper glob pattern, it's hard to parse the
pattern, so don't. instead cheat and count repetitions of \ prior to
the already-found / character. if there are an odd number, the last is
escaping the /, so back up the split position by one. now the
char clobbered by null termination is variable, so save it and restore
as needed.
this code has been long overdue for a rewrite, but the immediate cause
that necessitated it was total failure to see past unreadable path
components. for example, A/B/* would fail to match anything, even
though it should succeed, when both A and A/B are searchable but only
A/B is readable. this problem both was caught in conformance testing,
and impacted users.
the old glob implementation insisted on searching the listing of each
path component for a match, even if the next component was a literal.
it also used considerable stack space, up to length of the pattern,
per recursion level, and relied on an artificial bound of the pattern
length by PATH_MAX, which was incorrect because a pattern can be much
longer than PATH_MAX while having matches shorter (for example, with
necessarily long bracket expressions, or with redundancy).
in the new implementation, each level of recursion starts by consuming
the maximal literal (possibly escaped-literal) path prefix remaining
in the pattern, and only opening a directory to read when there is a
proper glob pattern in the next path component. it then recurses into
each matching entry. the top-level glob function provided automatic
storage (up to PATH_MAX) for construction of candidate/result strings,
and allocates a duplicate of the pattern that can be modified in-place
with temporary null-termination to pass to fnmatch. this allocation is
not a big deal since glob already has to perform allocation, and has
to link free to clean up if it experiences an allocation failure or
other error after some results have already been allocated.
care is taken to use the d_type field from iterated dirents when
possible; stat is called only when there are literal path components
past the last proper-glob component, or when needed to disambiguate
symlinks for the purpose of GLOB_MARK.
one peculiarity with the new implementation is the manner in which the
error handling callback will be called. if attempting to match */B/C/D
where a directory A exists that is inaccessible, the error reported
will be a stat error for A/B/C/D rather than (previous and wrong
implementation) an opendir error for A, or (likely on other
implementations) a stat error for A/B. such behavior does not seem to
be non-conforming, but if it turns out to be undesirable for any
reason, backtracking could be done on error to report the first
component producing it.
also, redundant slashes are no longer normalized, but preserved as
they appear in the pattern; this is probably more correct, and falls
out naturally from the algorithm used. since trailing slashes (which
force all matches to be directories) are preserved as well, the
behavior of GLOB_MARK has been adjusted not to append an additional
slash to results that already end in slash.
commit 6ba5517a46 modified
__tls_get_addr to offset the address by +DTP_OFFSET (0x8000 on
powerpc, mips, etc.) and adjusted the result of DTPREL relocations by
-DTP_OFFSET to compensate, but missed changing the argument setup for
calls to __tls_get_addr from dlsym.
as explained in commit 6ba5517a46, some
archs use an offset (typicaly -0x8000) with their DTPOFF relocations,
which __tls_get_addr needs to invert. on affected archs, which lack
direct support for large immediates, this can cost multiple extra
instructions in the hot path. instead, incorporate the DTP_OFFSET into
the DTV entries. this means they are no longer valid pointers, so
store them as an array of uintptr_t rather than void *; this also
makes it easier to access slot 0 as a valid slot count.
commit e75b16cf93 left behind cruft in
two places, __reset_tls and __tls_get_new, from back when it was
possible to have uninitialized gap slots indicated by a null pointer
in the DTV. since the concept of null pointer is no longer meaningful
with an offset applied, remove this cruft.
presently there are no archs with both TLSDESC and nonzero DTP_OFFSET,
but the dynamic TLSDESC relocation code is also updated to apply an
inverted offset to its offset field, so that the offset DTV would not
impose a runtime cost in TLSDESC resolver functions.
when invoking the assembler, arm gcc does not always pass the right
flags to enable use of vfp instruction mnemonics. for C code it
produces, it emits the .fpu directive, but this does not help when
building asm source files, which tlsdesc needs to be. to fix, use an
explicit directive here.
commit 0beb9dfbec introduced this
regression. it has not appeared in any release.
the specification for freeaddrinfo allows it to be used to free
"arbitrary sublists" of the list returned by getaddrinfo. it's not
clearly stated how such sublists come into existence, but the
interpretation seems to be that the application can edit the ai_next
pointers to cut off a portion of the list and then free it.
actual freeing of individual list slots is contrary to the design of
our getaddrinfo implementation, which has no failure paths after
making a single allocation, so that light callers can avoid linking
realloc/free. freeing individual slots is also incompatible with
sharing the string for ai_canonname, which the current implementation
does despite no requirement that it be present except on the first
result. so, rather than actually freeing individual slots, provide a
way to find the start of the allocated array, and reference-count it,
freeing the memory all at once after the last slot has been freed.
since the language in the spec is "arbitrary sublists", no provision
for handling other constructs like multiple lists glued together,
circular links, etc. is made. presumably passing such a construct to
freeaddrinfo produces undefined behavior.
the indirect function call is a significant portion of the code path
for the dynamic case, and most users are probably building for ISA
levels where it can be omitted.
we could drop at least one register save/restore (lr) with this
change, and possibly another (ip) with some clever shuffling, but it's
not clear whether there's a way to do it that's not more expensive, or
whether avoiding the save/restore would have any practical effect, so
in the interest of avoiding complexity it's omitted for now.
unlike other asm where the baseline ISA is used, these functions are
hot paths and use ISA-level specializations.
call-clobbered vfp registers are saved before calling __tls_get_new,
since there is no guarantee it won't use them. while setjmp/longjmp
have to use hwcap to decide whether to the fpu is in use, since
application code could be using vfp registers even if libc was
compiled as pure softfloat, __tls_get_new is part of libc and can be
assumed not to have access to vfp registers if tlsdesc.S does not.
thus it suffices just to check the predefined preprocessor macros. the
check for __ARM_PCS_VFP is redundant; !__SOFTFP__ must always be true
if the target ISA level includes fpu instructions/registers.
use the GNU C may_alias attribute if available, and fallback to naive
byte-by-byte loops if __GNUC__ is not defined.
this patch has been written to minimize changes so that history
remains reviewable; it does not attempt to bring the affected code
into a more consistent or elegant form.
the comparison must take place in the address space model as an
integer type, since comparing pointers that are not pointing into the
same array is undefined.
the subsequent d<s comparison however is valid, because it's only
reached in the case where the source and dest overlap, in which case
they are necessarily pointing to parts of the same array.
to make the comparison, use an unsigned range check for dist(s,d)>=n,
algebraically !(-n<s-d<n). subtracting n yields !(-2*n<s-d-n<0), which
mapped into unsigned modular arithmetic is !(-2*n<s-d-n) or rather
-2*n>=s-d-n.
Rewrote the AVL tree implementation:
- It is now non-recursive with fixed stack usage (large enough for
worst case tree height). twalk and tdestroy are still recursive as
that's smaller/simpler.
- Moved unrelated interfaces into separate translation units.
- The node structure is changed to use indexed children instead of
left/right pointers, this simplifies the balancing logic.
- Using void * pointers instead of struct node * in various places,
because this better fits the api (node address is passed in a void**
argument, so it is tempting to incorrectly cast it to struct node **).
- As a further performance improvement the rebalancing now stops
when it is not needed (subtree height is unchanged). Otherwise
the behaviour should be the same as before (checked over generated
random inputs that the resulting tree shape is equivalent).
- Removed the old copyright notice (including prng related one: it
should be licensed under the same terms as the rest of the project).
.text size of pic tsearch + tfind + tdelete + twalk:
x86_64 i386 aarch64 arm mips powerpc ppc64le sh4 m68k s390x
old 941 899 1220 1068 1852 1400 1600 1008 1008 1488
new 857 881 1040 976 1564 1192 1360 736 820 1408
despite not being documented to do so in the standard or Linux
documentation, attempts to udp connect to 127.0.0.1 or ::1 generate
EADDRNOTAVAIL when the loopback device is not configured and there is
no default route for IPv6. this caused getaddrinfo with AI_ADDRCONFIG
to fail with EAI_SYSTEM and EADDRNOTAVAIL on some no-IPv6
configurations, rather than the intended behavior of detecting IPv6 as
unsuppported and producing IPv4-only results.
previously, only EAFNOSUPPORT was treated as unavailability of the
address family being probed. instead, treat all errors related to
inability to get an address or route as conclusive that the family
being probed is unsupported, and only fail with EAI_SYSTEM on other
errors.
further improvements may be desirable, such as reporting EAI_AGAIN
instead of EAI_SYSTEM for errors which are expected to be transient,
but this patch should suffice to fix the serious regression.
the clang internal assembler does not accept assembler options passed
via the usual -Wa mechanism, but it does accept -mimplicit-it directly
as an option to the compiler driver.
this facilitates building software that assumes a large default stack
size without any patching to call pthread_setattr_default_np or
pthread_attr_setstacksize at each thread creation site, using just
LDFLAGS.
normally the PT_GNU_STACK header is used only to reflect whether
executable stack is desired, but with GNU ld at least, passing
-Wl,-z,stack-size=N will set a size on the program header. with this
patch, that size will be incorporated into the default stack size
(subject to increase-only rule and DEFAULT_STACK_MAX limit).
both static and dynamic linking honor the program header. for dynamic
linking, all libraries loaded at program start, including preloaded
ones, are considered. dlopened libraries are not considered, for
several reasons. extra logic would be needed to defer processing until
the load of the new library is commited, synchronization woud be
needed since other threads may be running concurrently, and the
effectiveness woud be limited since the larger size would not apply to
threads that already existed at the time of dlopen. programs that will
dlopen code expecting a large stack need to declare the requirement
themselves, or pthread_setattr_default_np can be used.
stack size default is increased from 80k to 128k. this coincides with
Linux's hard-coded default stack for the main thread (128k is
initially committed; growth beyond that up to ulimit is contingent on
additional allocation succeeding) and GNU ld's default PT_GNU_STACK
size for FDPIC, at least on sh.
guard size default is increased from 4k to 8k to reduce the risk of
guard page jumping on overflow, since use of just over 4k of stack is
common (PATH_MAX buffers, etc.).
limit to 8MB/1MB, repectively. since the defaults cannot be reduced
once increased, excessively large settings would lead to an
unrecoverably broken state. this change is in preparation to allow
defaults to be increased via program headers at the linker level.
creation of threads that really need larger sizes needs to be done
with an explicit attribute.
per POSIX, deletion of a key for which some threads still have values
stored is permitted, and newly created keys must initially hold the
null value in all threads. these properties were not met by our
implementation; if a key was deleted with values left and a new key
was created in the same slot, the old values were still visible.
moreover, due to lack of any synchronization in pthread_key_delete,
there was a TOCTOU race whereby a concurrent pthread_exit could
attempt to call a null destructor pointer for the newly orphaned
value.
this commit introduces a solution based on __synccall, stopping the
world to zero out the values for deleted keys, but only does so lazily
when all key slots have been exhausted. pthread_key_delete is split
off into a separate translation unit so that static-linked programs
which only create keys but never delete them will not pull in the
__synccall machinery.
a global rwlock is added to synchronize creation and deletion of keys
with dtor execution. since the dtor execution loop now has to release
and retake the lock around its call to each dtor, checks are made not
to call the nodtor dummy function for keys which lack a dtor.
The condition occurs when
- thread #1 is holding the lock
- thread #2 is waiting for it on __futexwait
- thread #1 is about to release the lock and performs a_swap
- thread #3 enters the __lockfile function and manages to grab the lock
before thread #1 calls __wake, resetting the MAYBE_WAITERS flag
- thread #1 calls __wake
- thread #2 wakes up but goes again to __futexwait as the lock is
held by thread #3
- thread #3 releases the lock but does not call __wake as the
MAYBE_WAITERS flag is not set
This condition results in thread #2 not being woken up. This patch fixes
the problem by making the woken up thread ensure that the flag is
properly set before going to sleep again.
Mainainer's note: This fixes a regression introduced in commit
c21f750727.
commit b114190b29 introduced spurious
realloc of the output buffer in cases where the result would exactly
fit in the caller-provided buffer. this is contrary to a strict
reading of the spec, which only allows realloc when the provided
buffer is "of insufficient size".
revert the adjustment of the realloc threshold, and instead push the
byte read by getc_unlocked (for which the adjustment was made) back
into the stdio buffer if it does not fit in the output buffer, to be
read in the next loop iteration.
in order not to leave a pushed-back byte in the stdio buffer if
realloc fails (which would violate the invariant that logical FILE
position and underlying open file description offset match for
unbuffered FILEs), the OOM code path must be changed. it would suffice
move just one byte in this case, but from a QoI perspective, in the
event of ENOMEM the entire output buffer (up to the allocated length
reported via *n) should contain bytes read from the FILE stream.
otherwise the caller has no way to distinguish trunated data from
uninitialized buffer space.
the SIZE_MAX/2 check is removed since the sum of disjoint object sizes
is assumed not to be able to overflow, leaving just one OOM code path.
morally, for null pointers a and b, a-b, a<b, and a>b should all be
defined as 0; however, C does not define any of them.
the stdio implementation makes heavy use of such pointer comparison
and subtraction for buffer logic, and also uses null pos/base/end
pointers to indicate that the FILE is not in the corresponding (read
or write) mode ready for accesses through the buffer.
all of the comparisons are fixed trivially by using != in place of the
relational operators, since the opposite relation (e.g. pos>end) is
logically impossible. the subtractions have been reviewed to check
that they are conditional the stream being in the appropriate reading-
or writing-through-buffer mode, with checks added where needed.
in fgets and getdelim, the checks added should improve performance for
unbuffered streams by avoiding a do-nothing call to memchr, and should
be negligible for buffered streams.
on some archs, linux support for futex operations (including
robust_list processing) that depend on kernelspace CAS is conditional
on a runtime check. as of linux 4.18, this check fails unconditionally
on nommu archs that perform it, and spurious failure on powerpc64 was
observed but not explained. it's also possible that futex support is
omitted entirely, or that the kernel is older than 2.6.17. for most
futex ops, ENOSYS does not yield hard breakage; userspace will just
spin at 100% cpu load. but for robust mutexes, correct behavior
depends on the kernel functionality.
use the get_robust_list syscall to probe for support at the first call
to pthread_mutexattr_setrobust, and block creation of robust mutexes
with a reportable error if they can't be supported.
in order to produce FILE objects to pass to the intscan/floatscan
backends without any (prohibitively costly) extra buffering layer, the
strto* functions set the FILE's rend (read end) buffer pointer to an
invalid value at the end of the address space, or SIZE_MAX/2 past the
beginning of the string. this led to undefined behavior comparing and
subtracting the end pointer with the buffer position pointer (rpos).
the comparison issue is easily eliminated by using != instead of <.
however the subtractions require nontrivial changes:
previously, f->shcnt stored the count that would have been read if
consuming the whole buffer, which required an end pointer for the
buffer. the purpose for this was that it allowed reading it and adding
rpos-rend at any time to get the actual count so far, and required no
adjustment at the time of __shgetc (actual function call) since the
call would only happen when reaching the end of the buffer.
to get rid of the dependency on rend, instead offset shcnt by buf-rpos
(start of buffer) at the time of last __shlim/__shgetc call. this
makes for slightly more work in __shgetc the function, but for the
inline macro it's still just as easy to compute the current count.
since the scan helper interfaces used here are a big hack, comments
are added to document their contracts and what's going on with their
implementations.
POSIX allows ttyname(_r) and isatty to return EBADF if passed file
descriptor is invalid.
maintainer's note: these are optional ("may fail") errors, but it's
non-conforming for ttyname_r to return ENOTTY when it failed for a
different reason.
this significantly improves codegen in functions that need to access
errno but otherwise have no need for a GOT pointer.
we could probably improve it much more by including an inline version
of the &errno accessor function, but that depends on having the
definitions of struct __pthread and __pthread_self(), which at present
would expose a lot more than is appropriate. moving them to a small
tls.h later might make this more reasonable.
commit 5ce3737931 removed the inclusion
of libc.h from this file as spurious, but it's needed to get PAGE_SIZE
on archs where PAGE_SIZE is not a constant defined by limits.h.
there is no good reason to wait to find and process the plural rules
for a translated message file until a gettext form requesting plural
rule processing is used. it just imposes additional synchronization,
here in the form of clunky use of atomics.
it looks like there may also have been a race condition where nplurals
could be seen without plural_rule being seen, possibly leading to null
pointer dereference. if so, this commit fixes it.
in our memory model, all atomics are supposed to be full barriers;
stores are not release-only. this is important because store is used
as an unlock operation in places where it needs to acquire the waiter
count to determine if a futex wake is needed. at least in the
malloc-internal locks, but possibly elsewhere, soft deadlocks from
missing futex wake (breakable by poking the threads to restart the
syscall, e.g. by attaching a tracer) were reported to occur.
once the malloc lock is replaced with Jens Gustedt's new lock
implementation (see commit 47d0bcd476),
malloc will not be affected by the issue, but it's not clear that
other uses won't be. reducing the strength of the ordering properties
required from a_store would require a thorough analysis of how it's
used.
to fix the problem, I'm removing the powerpc[64]-specific a_store
definition; now, the top-level atomic.h will implement a_store using
a_barrier on both sides of the store.
it's not clear to me yet whether there might be issues with the other
atomics. it's possible that a_post_llsc needs to be replaced with a
full barrier to guarantee the formal semanics we want, but either way
I think the difference is unlikely to impact the way we use them.
as originally published, the C99 syntax only allowed static index
parameter declarators when a gratuitous parameter name was included.
gcc 3, which some projects use for bootstrapping, is a supported C99
compiler, but does not have the fix to the standard incorporated, so
edit the affected declaration to conform to the earlier buggy C99
syntax.
other compilers don't need this option, but gcc 3 and perhaps others
accept it despite not understanding it, then print warnings about it
at build time.
omitting it when not needed will also help shorten the command lines.
since commit dc2f368e56 this has been
disabled by default, but was left available in case users unhappy with
the resulting size or performance regressions wanted to try to make it
work. now that we make widespread use of hidden visibility for
internal interfaces, this no longer makes sense. if any costly calls
remain they can be fixed with hidden aliases.
this further reduces the number of source files which need to include
libc.h and thereby be potentially exposed to libc global state and
internals.
this will also facilitate further improvements like adding an inline
fast-path, if we want to do so later.
pthread_atfork.c does not actually include pthread_impl.h and has no
reason to, so it wasn't getting the declaration. move it to libc.h
which is already included by both fork.c and pthread_atfork.c. this
makes more sense anyway since the function has little to do with
pthreads anyway aside from the name.
the LFS64 macro was not self-documenting and barely saved any
characters. simply use weak_alias directly so that it's clear what's
being done, and doesn't depend on a header to provide a strange macro.
libc.h was intended to be a header for access to global libc state and
related interfaces, but ended up included all over the place because
it was the way to get the weak_alias macro. most of the inclusions
removed here are places where weak_alias was needed. a few were
recently introduced for hidden. some go all the way back to when
libc.h defined CANCELPT_BEGIN and _END, and all (wrongly implemented)
cancellation points had to include it.
remaining spurious users are mostly callers of the LOCK/UNLOCK macros
and files that use the LFS64 macro to define the awful *64 aliases.
in a few places, new inclusion of libc.h is added because several
internal headers no longer implicitly include libc.h.
declarations for __lockfile and __unlockfile are moved from libc.h to
stdio_impl.h so that the latter does not need libc.h. putting them in
libc.h made no sense at all, since the macros in stdio_impl.h are
needed to use them correctly anyway.
not all prefixed symbols can be made hidden. some are part of
ABI-compat (e.g. __nl_langinfo_l) and others are ABI as a consequence
of the way copy relocations for weak aliases work in ELF shared
libraries. most, however, can be made hidden.
with this commit, there should be no remaining unintentionally visible
symbols exported from libc.so.
this was added so that posix_spawn and possibly other functionality
could be implemented in terms of vfork, but that turned out to be
unsafe. any such usage needs __clone with proper handling of stack
lifetime.
the direct syscall or various thin and mostly-inline wrappers around
it are used instead internally. at some point a public futex function
should be added, but it's not yet clear what the signature should be,
and in the mean time this file is not useful.
this is a special case that does not need a declaration, because it's
not even a libc-internal interface between translation units. instead
it's a poor hack around compilers' inability to shrink-wrap critical
code paths. after vis.h was disabled, it became more of a
pessimization on many archs due to the extra layer of machinery to
support a call through the PLT, but now it should be efficient again.
the __-prefixed filename does not make sense when the only purpose of
this file is implementing a public function that's not used as a
backend for implementing the standard dirent functions.
these were overlooked in the declarations overhaul work because they
are not properly declared, and the current framework even allows their
declared types to vary by arch. at some point this should be cleaned
up, but I'm not sure what the right way would be.
this makes significant differences to codegen on archs with an
expensive PLT-calling ABI; on i386 and gcc 7.3 for example, the sin
and sinf functions no longer touch call-saved registers or the stack
except for pushing outgoing arguments. performance is likely improved
too, but no measurements were taken.
commits leading up to this one have moved the vast majority of
libc-internal interface declarations to appropriate internal headers,
allowing them to be type-checked and setting the stage to limit their
visibility. the ones that have not yet been moved are mostly
namespace-protected aliases for standard/public interfaces, which
exist to facilitate implementing plain C functions in terms of POSIX
functionality, or C or POSIX functionality in terms of extensions that
are not standardized. some don't quite fit this description, but are
"internally public" interfacs between subsystems of libc.
rather than create a number of newly-named headers to declare these
functions, and having to add explicit include directives for them to
every source file where they're needed, I have introduced a method of
wrapping the corresponding public headers.
parallel to the public headers in $(srcdir)/include, we now have
wrappers in $(srcdir)/src/include that come earlier in the include
path order. they include the public header they're wrapping, then add
declarations for namespace-protected versions of the same interfaces
and any "internally public" interfaces for the subsystem they
correspond to.
along these lines, the wrapper for features.h is now responsible for
the definition of the hidden, weak, and weak_alias macros. this means
source files will no longer need to include any special headers to
access these features.
over time, it is my expectation that the scope of what is "internally
public" will expand, reducing the number of source files which need to
include *_impl.h and related headers down to those which are actually
implementing the corresponding subsystems, not just using them.
it's not ideal, but the function is essentially an extended stdio
function specialized to getopt's needs. the only reason it exists is
avoiding pulling printf code into every program using getopt.
the public flockfile interface is significantly heavier because it has
to handle the possibility of caller returning or thread exiting while
holding the lock.
the malloc-implementation-private header is the only right place for
this, because, being in the reserved namespace, __memalign is not
interposable and thus not valid to use anywhere else. anything outside
of the malloc implementation must call an appropriate-namespace public
function (aligned_alloc or posix_memalign).
previously, a common __posix_spawnx backend was used that accepted an
additional argument for the execve variant to call in the child. this
moderately bloated up the posix_spawn function, shuffling arguments
between stack and/or registers to call a 7-argument function from a
6-argument one.
instead, tuck the exec function pointer in an unused part of the
(large) pthread_spawnattr_t structure, and have posix_spawnp duplicate
the attributes and fill in a pointer to __execvpe. the net code size
change is minimal, but the weight is shifted to the "heavier" function
which already pulls in more dependencies.
as a bonus, we get rid of an external symbol (__posix_spawnx) that had
no really good place for a declaration because it shouldn't have
existed to begin with.
this is not a public interface, and does not even necessarily match
the syscall on all archs that have a syscall by that name.
on archs where it's implemented in C, no action on the source file is
needed; the hidden declaration in pthread_arch.h suffices.
unlike the other res/dn functions, this one is tied to struct
resolvconf which is not a public interface, so put it in the private
header for its subsystem.
despite looking like undefined behavior, the affected code is correct
both before and after this patch. the pairs mtx_t and pthread_mutex_t,
and cnd_t and pthread_cond_t, are not mutually compatible within a
single translation unit (because they are distinct untagged aggregate
instances), but they are compatible with an object of either type from
another translation unit (6.2.7 ¶1), and therefore a given translation
unit can choose which one it wants to use.
in the interest of being able to move declarations out of source files
to headers that facilitate checking, use the pthread type names in
declaring the namespace-safe versions of the pthread functions and
cast the argument pointer types when calling them.
eliminate gratuitous glue function for reporting the version, which
was probably leftover from the old dynamic linker design which lacked
a clear barrier for when/how it could access global data. put the
declaration for the data object that replaces it in libc.h where it
can be type checked.
logically these belong to the intersection of the stdio and pthread
subsystems, and either place the declarations could go (stdio_impl.h
or pthread_impl.h) requires a forward declaration for one of the
argument types.
these exist for the sake of defining the corresponding weak public
aliases (for C11 and POSIX namespace conformance reasons). they are
not referenced by anything else in libc, so make them static.
get rid of a gratuitous translation unit and call frame between
asctime_r and the actual implementation of the function. this is the
way gmtime_r and localtime_r are already done.
syscall.h was chosen as the header to declare it, since its intended
usage is alongside syscalls as a fallback for operations the direct
syscall does not support.
policy is that all public functions which have a public declaration
should be defined in a context where that public declaration is
visible, to avoid preventable type mismatches.
an audit performed using GCC's -Wmissing-declarations turned up the
violations corrected here. in some cases the public header had not
been included; in others, a feature test macro needed to make the
declaration visible had been omitted.
in the case of gethostent and getnetent, the omission seems to have
been intentional, as a hack to admit a single stub definition for both
functions. this kind of hack is no longer acceptable; it's UB and
would not fly with LTO or advanced toolchains. the hack is undone to
make exposure of the declarations possible.
this cleans up what had become widespread direct inline use of "GNU C"
style attributes directly in the source, and lowers the barrier to
increased use of hidden visibility, which will be useful to recovering
some of the efficiency lost when the protected visibility hack was
dropped in commit dc2f368e56, especially
on archs where the PLT ABI is costly.
commit 4f35eb7591 introduced this bug.
it is not present in any released versions. inadvertent use of the &
operator on an array into which we're indexing produced arithmetic on
the wrong-type pointer, with undefined behavior.
this code in sigaction was the only place where sizeof was being
applied to the kernel sigaction's mask member to get the size argument
to pass to the kernel. everywhere else, _NSIG/8 is used for this
purpose.
Linux makes this surprisingly difficult, but it can be done. the trick
here is using the fact that we control the implementation of sigaction
to prevent changing the disposition of SIGABRT to anything but SIG_DFL
after abort has tried and failed to terminate the process simply by
calling raise(SIGABRT).
these functions are specified to write to stderr but not set its
orientation, presumably so that they can be used in programs operating
stderr in wide mode. also, they are not allowed to clobber errno on
success. save and restore to meet the requirement.
psiginfo is reduced to a think wrapper around psignal, since it
already behaved the same. if we want to add more detailed siginfo
printing at some point this will need refactoring.
if no output is produced, no underlying fwrite will ever be called,
but byte-oriented printf functions are still required to set the
orientation of the stream to byte-oriented. call __towrite explicitly
if the FILE is not already in write mode.
commit b5a8b28915 setup the write buffer
bound pointers for the temporary buffer manually to fix a buffer
overflow issue, but in doing so, caused vfprintf on unbuffered files
never to call __towrite, thereby failing to set the stream orientation
to byte-oriented, failing to clear any prior read mode, and failing to
produce an error when the stream is not writable.
revert the inline setup of the bounds pointers and instead zero them,
so that the underlying fwrite code will call __towrite to set them up.
commit 0b80a7b040 added the ability to
set application-provided stdio FILE buffers, adding the possibility
that stderr might be buffered at exit time, but __stdio_exit did not
have code to flush it.
this regression was not present in any release.
if __cp_cancel was reached via __syscall_cp, r12 will necessarily
still contain a GOT pointer (for libc.so or for the static-linked main
program) valid for entering __cancel. however, in the case of async
cancellation, r12 may contain any scratch value; it's not necessarily
even a valid GOT pointer for the code that was interrupted.
unlike in commit 0ec49dab67 where the
corresponding issue was fixed for powerpc64, there is fundamentally no
way for fdpic code to recompute its GOT pointer. so a new mechanism is
introduced for cancel_handler to write a GOT register value into the
interrupted context on archs where it is needed.
entering the local entry point for __cancel from __cp_cancel is valid
if __cp_cancel was reached from __syscall_cp, since both are in libc
and share the same TOC pointer, but it is not valid if __cp_cancel was
reached when cancel_handler rewrote the program counter for
asynchronous cancellation of code outside libc.
to ensure __cancel is entered with a valid TOC pointer, recompute the
correct value in a PC-relative manner before jumping.
- REALTIME_SIGNALS is supposed to be version-valued
- DELAYTIMER_MAX was wrongly using the min allowed max
- unavailable compilation environments wrongly used 0 instead of -1
the value 0x7f00 (as if by _exit(127)) is specified only for the case
where the child is created but then fails to exec the shell, since
traditional fork+exec implementations do not admit reporting an error
via errno in this case without additional machinery. it's unclear
whether an implementation not subject to this failure mode needs to
emulate it; one could read the standard as requiring that. if so,
additional code will need to be added to map posix_spawn errors into
the form system is expected to return. but for now, returning -1 to
indicate an error is significantly better behavior than always
reporting failures as if the shell failed to exec after fork.
fundamentally there is no good reason these functions need to set an
orientation (morally it should be possible to write a wchar_t[] memory
stream using byte functions, or a char[] memory stream using wide
functions), but it's a part of the specification that they do. aside
from being able to inspect the orientation with fwide, failure to set
the orientation in open_wmemstream is observable if the locale changes
between open_wmemstream and the first operation on the stream; this is
because the encoding rule (locale) for the stream is required to be
bound at the time the stream becomes wide-oriented.
for open_wmemstream, call fwide to avoid duplicating the logic for
binding the encoding rule. for open_memstream it suffices just to set
the mode field in the FILE struct.
the w+ mode is specified to "truncate the buffer contents". like most
of fmemopen, exactly what this means is underspecified. mode w and w+
of course implicitly 'truncate' the buffer if a write from the initial
position is flushed, so in order for this part of the text about w+
not to be spurious, it should be interpreted as requiring something
else, and the obvious reasonable interpretation is that the truncation
is immediately visible if you attempt to read from the stream or the
buffer before writing/flushing.
this interpretation agrees with reported conformance test failures.
this is a POSIX requirement.
also remove the gratuitous locking shenanigans and simply access f->fd
under control of the lock. there is no advantage to not doing so, and
it made the correctness non-obvious at best.
__aeabi_read_tp used to call c code, but that was incorrect as the
arm runtime abi specifies special pcs for this function: it is only
allowed to clobber r0, ip, lr and cpsr.
maintainer's note: the old code explicitly saved and restored all
general-purpose registers which are call-clobbered in the normal
calling convention, so it's unlikely that any real-world compilers
produced code that could break. however theoretically they could have
chosen to use floating point registers, in which case the caller's
values of those registers would be clobbered.
commit 201995f382 introduced a hack
utilizing the signedness of character constants at the preprocessor
level to avoid depending on the gcc-specific __CHAR_UNSIGNED__ predef.
while this trick works on gcc and presumably other compilers being
used, it's not clear that the behavior it depends on is actually
conforming. C11 6.4.4.4 ¶10 defines character constants as having type
int, and 6.10.1 ¶4 defines preprocessor #if arithmetic to take place
in intmax_t or uintmax_t, depending on the signedness of the integer
operand types, and it is specified that "this includes interpreting
character constants".
if character literals had type char and just promoted to int, it would
be clear that when char is unsigned they should behave as uintmax_t at
the preprocessor level. however, as written the text of the standard
seems to require that character constants always behave as intmax_t,
corresponding to int, at the preprocessor level.
since there is a good deal of ambiguity about the correct behavior and
a risk that compilers will disagree or that an interpretation may
mandate a change in the behavior, do not rely on it for defining
CHAR_MIN and CHAR_MAX correctly. instead, use the signedness of the
value (as opposed to the type) of '\xff', which will be positive if
and only if plain char is unsigned. this behavior is clearly
specified, and the specific case '\xff' is even used in an example,
under 6.4.4.4 of the standard.
with async cancellation enabled, pthread_cancel(pthread_self())
deadlocked due to pthread_kill holding killlock which is needed by
pthread_exit.
this could be solved by making pthread_kill block signals around the
critical section, at least when the target thread is itself, but the
issue only arises for cancellation, and otherwise would just be
imposing unnecessary cost.
instead just have pthread_cancel explicitly check for async
self-cancellation and call pthread_exit(PTHREAD_CANCELED) directly
rather than going through the signal machinery.
This manifests itself in mktime if tm_isdst = 1 and the current TZ= is
a POSIX timezone specification. mktime would see that tm_isdst was set
to 0 by __secs_to_zone, and subtract 'oppoff' (dst_off) - gmtoff from
the resultant time. This meant that mktime returned a time that was
exactly double the GMT offset of the desired timezone when tm_isdst
was = 1.
commit 610c5a8524 changed the thread
pointer setup so tp points at the end of the pthread struct on arm,
but failed to update __aeabi_read_tp so it was off by 8.
this broke tls access in code that is compiled with -mtp=soft, which
is the default when target arch is pre armv6k or thumb1.
maintainer's note: no release versions are affected.
this is an obsolete error code from RFS, an obsolete predecessor of
NFS. POSIX documents it only as "Reserved", but maintains the
requirement that it be defined. as long as it is defined, it needs a
string for strerror to produce; the one chosen matches glibc and
documentation from other language runtimes I could find.
the code to perform rounding to the desired precision wrongly assumed
the long double mantissa was an integral number of nibbles (hex
digits) in length. this is true for 80-bit extended precision (64-bit
mantissa) but not for double (53) or quad (113).
scale the rounding value by 1<<(LDBL_MANT_DIG%4) to compensate.
the text of the specification for getopt's handling of options that
require an argument, which requires updating optarg and optind, does
not exclude the error case where the end of the argument list has been
reached. in that case, it is expected that optarg be assigned
argv[argc] (normally null) and optind be incremented by 2, resulting
in a value of argc+1.
commit 98c9af5001 wrongly claimed they
do not need to be valid for such usage, but the last sentence of C11
7.1.4 ¶1 imposes a broad requirement that all macros specified as
integer constant expressions also need to be valid for #if.
simply write the value out explicitly. there is no value here in
pretending that the width of int will vary.
POSIX requires the symlink function to fail with ENAMETOOLONG if the
link contents to be written exceed SYMLINK_MAX in length, but neither
Linux nor our syscall wrapper code enforce this. the value 255 for
SYMLINK_MAX is not meaningful and does not seem to have been motivated
by anything except perhaps a wrong assumption that a definition was
mandatory. it has been present (though moving through bits to
top-level limits.h) since the beginning of the project history.
[f]pathconf is entitled to return -1 as the limit for conf names for
which there is no hard limit, with the usual POSIX note that an
indefinite limit does not imply an infinite limit. in principle we
might should report a limit for filesystems that impose one, but such
functionality is not currently present for any of the pathconf limits,
and adding it is beyond the scope of fixing the incorrect limit.
Call SYS_exit on return from fn in __clone. This is the expected
behavior of this function. Without this the child task will crash on
return from fn, since it will return to nowhere.
due to moved code, commit b8742f3260
inadvertently used the return value of __clone, rather than the return
value of SYS_sched_setscheduler in the new thread, to check whether it
needed to report failure. since a successful __clone returns the tid
of the new thread, which is never zero, this caused pthread_create
always to return with an invalid error number in the code path for
PTHREAD_EXPLICIT_SCHED.
this regression was not present in any releases.
the sign character produced came from the sign of tm_gmtoff/3600 as an
integer division, which is zero for negative offsets smaller in
magnitude than 3600. instead of printing the hours and minutes as
separate fields, print them as a single value of the form
hours*100+minutes, which naturally has the correct sign.
the specfile for the wrapper was written assuming output is pie only
if -pie appears on the command line. recent (and older patched)
versions of gcc can be configured to produce pie output by default,
adn when used with such a toolchain, the wrapper linked the wrong
startfiles (crt*) containing pic-incompatible code.
rather than trying to figure out gcc's default, simply always use the
pic-compatible start files.
this fixes a major gap in the intended functionality of
pthread_setattr_default_np. if application/library code creating a
thread does not pass a null attribute pointer to pthread_create, but
sets up an attribute object to change other properties while leaving
the stack alone, the created thread will get a stack with size
DEFAULT_STACK_SIZE. this makes pthread_setattr_default_np useless for
working around stack overflow issues in such applications, and leaves
a major risk of regression if previously-working code switches from
using a null attribute pointer to an attribute object.
this change aligns the behavior more closely with the glibc
pthread_setattr_default_np functionality too, albeit via a different
mechanism. glibc encodes "default" specially in the attribute object
and reads the actual default at thread creation time. with this
commit, we now copy the current default into the attribute object at
pthread_attr_init time, so that applications that query the properties
of the attribute object will see the right values.
maintainer's note: the key observation here is that the compared
element is the first slot of the second ceil(half) of the array, and
thus can be removed for further comparison when it does not match, so
that we descend into the second ceil(half)-1 rather than ceil(half)
elements. this change ensures that nel strictly decreases with each
iteration, so that the case of != but nel==1 does not need to be
special-cased anymore.
maintainer's note: while musl does not use the linux kernel headers,
it does provide these three sys/* headers which do nothing but include
the corresponding linux/* headers, since the sys/* versions are the
ones documented for application use (and they arguably provide
interfaces that are not linux-specific but common to other unices).
these headers should probably not be provided by libc (rather by a
separate package), but as long as they are, use the bits header
framework as an aid to out-of-tree ports of musl for non-linux systems
that want to implement them in some other way.
maintainer's note: at some point, probably long before linux separated
the uapi headers, it was the case, or at least I believed it was the
case, that linux/types.h was unsafe to include from userspace. thus,
the inclusion guard macro _LINUX_TYPES_H was defined in sys/kd.h to
prevent linux/kd.h from including linux/types.h (which it spuriously
includes but does not use). as far as I can tell, whatever problem
this was meant to solve does not seem to have been present for a long
time, and the hack was not done correctly anyway, so removing it is
the right thing to do.
commit 32482f61da reduced the number of
int members before the dirent buf from 4 to 3, thereby misaligning it
mod sizeof(off_t), producing invalid accesses on any arch where
alignof(off_t)==sizeof(off_t).
rather than re-adding wasted padding, reorder the struct to meet the
requirement and add a comment and static assertion to prevent this
from getting broken again.
sys/ptrace.h is target specific, use bits/ptrace.h to add target
specific macro definitions.
these macros are kept in the generic sys/ptrace.h even though some
targets don't support them:
PTRACE_GETREGS
PTRACE_SETREGS
PTRACE_GETFPREGS
PTRACE_SETFPREGS
PTRACE_GETFPXREGS
PTRACE_SETFPXREGS
so no macro definition got removed in this patch on any target. only
s390x has a numerically conflicting macro definition (PTRACE_SINGLEBLOCK).
the PT_ aliases follow glibc headers, otherwise the definitions come
from linux uapi headers except ones that are skipped in glibc and
there is no real kernel support (s390x PTRACE_*_AREA) or need special
type definitions (mips PTRACE_*_WATCH_*) or only relevant for linux
2.4 compatibility (PTRACE_OLDSETOPTIONS).
new in linux v3.1 commit 3544d72a0e10d0aa1c1bd59ed77a53a59cdc12f7
changed in linux v3.4 commit 5cdf389aee90109e2e3d88085dea4dd5508a3be7
A tracer recieves this event in the waitpid status of a PTRACED_SEIZED
process.
including uchar.h in c++ code is only well defined in c++11 onwards
where char16_t and char32_t type definitions must be hidden since they
are keywords. however some c++ code compiled for older c++ standard
include uchar.h too and they need the typedefs, this fix makes such
code work.
previously, this operation succeeded, and the relocation results
worked for access from new threads created after dlopen, but produced
invalid accesses (and possibly clobbered other memory) from threads
that already existed.
the way the check is written, it still permits dlopen of libraries
containing initial-exec references to static TLS (TLS in the main
program or in a dynamic library loaded at startup).
tls_id is one-based, whereas [static_]tls_cnt is a count, so
comparison for checking that a given tls_id is dynamic rather than
static needs to use strict inequality.
this flag is notoriously under-/mis-specified, and in the past it was
implemented as a nop, essentially considering the absence of a
loopback interface with 127.0.0.1 and ::1 addresses an unsupported
configuration. however, common real-world container environments omit
IPv6 support (even for the network-namespaced loopback interface), and
some kernels omit IPv6 support entirely. future systems on the other
hand might omit IPv4 entirely.
treat these as supported configurations and suppress results of the
unconfigured/unsupported address families when AI_ADDRCONFIG is
requested. use routability of the loopback address to make the
determination; unlike other implementations, we do not exclude
loopback from the "an address is configured" condition, since there is
no basis in the specification for such exclusion. obtaining a result
with AI_ADDRCONFIG does not imply routability of the result, and
applications must still be able to cope with unroutable results even
if they pass AI_ADDRCONFIG.
commit 0b80a7b040, which added non-stub
setvbuf, applied the UNGET pushback adjustment to the size of the
buffer passed in, but inadvertently omitted offsetting the start by
the same amount, thereby allowing unget to clobber up to 8 bytes
before the start of the buffer. this bug was introduced in the present
release cycle; no releases are affected.
to produce sorted results roughly corresponding to RFC 3484/6724,
__lookup_name computes routability and choice of source address via
dummy UDP connect operations (which do not produce any packets). since
at the logical level, the properties fed into the sort key are
computed on ipv6 addresses, the code was written to use the v4mapped
ipv6 form of ipv4 addresses and share a common code path for them all.
however, on kernels where ipv6 support has been completely omitted,
this causes ipv4 to appear equally unroutable as ipv6, thereby putting
unreachable ipv6 addresses before ipv4 addresses in the results.
instead, use only ipv4 sockets to compute routability for ipv4
addresses. some gratuitous conversion back and forth is left so that
the logic is not affected by these changes. it may be possible to
simplify the ipv4 case considerably, thereby reducing code size and
complexity.
since slack space at the beginning and/or end of writable load maps is
donated to malloc, the application could obtain valid pointers in
these ranges which dladdr would erroneously identify as part of the
shared object whose mapping they came from.
instead of checking the queried address against the mapping base and
length, check it against the load segments from the program headers,
and only match the dso if it lies within the bounds of one of them.
as a shortcut, if the address does match the range of the mapping but
not any of the load segments, we know it cannot match any other dso
and can immediately return failure.
the early-exit condition for the symbol match loop on exact matches
caused dladdr to produce the first match for an exact match, but the
last match for an inexact match. in the interest of consistency,
require a strictly-closer match to replace an already-found one.
commit 8b8fb7f037 added logic to prevent
matching a symbol with no recorded size (closest-match) when there is
an intervening symbol whose size was recorded, but it only worked when
the intervening symbol was encountered later in the search.
instead of rejecting symbols where addr falls outside their recorded
size during the closest-match search, accept them to find the true
closest-match, then reject such a result only once the search has
finished.
based on patch by Axel Siebenborn, with fixes discussed on the mailing
list after submission and and rebased around the UB fix in commit
e829695fcc.
avoid spurious symbol matches by dladdr beyond symbol size. for
symbols with a size recorded, only match if the queried address lies
within the address range determined by the symbol address and size.
for symbols with no size recorded, the old closest-match behavior is
kept, as long as there is no intervening symbol with a recorded size.
the case where no symbol is matched, but the address does lie within
the memory range of a shared object, is specified as success. fix the
return value and produce a valid (with null dli_sname and dli_saddr)
Dl_info structure.
maintainer's note: past sentiment was that, despite being imperfect
and unable to force clearing of all possible copies of sensitive data
(e.g. in registers, register spills, signal contexts left on the
stack, etc.) this function would be added if major implementations
agreed on it, which has happened -- several BSDs and glibc all include
it.
maintainer's note: this change is for conformance with RFC 5952,
4.2.2, which explicitly forbids use of :: to shorten a single 16-bit 0
field when producing the canonical text representation for an IPv6
address. fixes a test failure reported by Philip Homburg, who also
submitted a patch, but this fix is simpler and should produce smaller
code.
if a final dot was included in the queried host name to anchor it to
the dns root/suppress search domains, and the result was not a CNAME,
the returned canonical name included the final dot. this was not
consistent with other implementations, confused some applications, and
does not seem desirable.
POSIX specifies returning a pointer to, or to a copy of, the input
nodename, when the canonical name is not available, but does not
attempt to specify what constitutes "not available". in the case of
search, we already have an implementation-defined "availability" of a
canonical name as the fully-qualified name resulting from search, so
defining it similarly in the no-search case seems reasonable in
addition to being consistent with other implementations.
as a bonus, fix the case where more than one trailing dot is included,
since otherwise the changes made here would wrongly cause lookups with
two trailing dots to succeed. previously this case resulted in
malformed dns queries and produced EAI_AGAIN after a timeout. now it
fails immediately with EAI_NONAME.
commit 587f5a53bc moved the definition
of SO_PEERSEC to bits/socket.h for archs where the SO_* macros differ
from their standard values, but failed to add copies of the generic
definition for powerpc and powerpc64.
writable load segments can have size-in-memory larger than their size
in the ELF file, representing bss or equivalent. the initial partial
page has to be zero-filled, and additional anonymous pages have to be
mapped such that accesses don't failt with SIGBUS.
map_library skips redundant MAP_FIXED mapping of the initial
(lowest-address) segment when processing LOAD segments since it was
already mapped when reserving the virtual address range, but in doing
so, inadvertently also skipped the code to fill/map bss. typical
executable and library files have two or more LOAD segments, and the
first one is text/rodata (non-writable) and thus has no bss, but it is
syntactically valid for an ELF program/library to put its writable
segment first, or to have only one segment (everything writable). the
binutils bfd-based linker has been observed to create such programs in
the presence of unusual sections or linker scripts.
fix by moving only the mmap_fixed operation under the conditional
rather than skipping the remainder of the loop body. add a check to
avoid bss processing in the case where the segment is not writable;
this should not happen, but if it does, the change would be a crashing
regression without this check.
mlock2 syscall was added in linux v4.4 and glibc has api for it.
It falls back to mlock in case of flags==0, so that case works
even on older kernels.
MLOCK_ONFAULT is moved under _GNU_SOURCE following glibc.
the mode member of struct ipc_perm is specified by POSIX to have type
mode_t, which is uniformly defined as unsigned int. however, Linux
defines it with type __kernel_mode_t, and defines __kernel_mode_t as
unsigned short on some archs. since there is a subsequent padding
field, treating it as a 32-bit unsigned int works on little endian
archs, but the order is backwards on big endian archs with the
erroneous definition.
since multiple archs are affected, remedy the situation with fixup
code in the affected functions (shmctl, semctl, and msgctl) rather
than repeating the same shims in syscall_arch.h for every affected
arch.
PR_{SET,GET}_SPECULATION_CTRL controls speculation related vulnerability
mitigations, new in commits
b617cfc858161140d69cc0b5cc211996b557a1c7
356e4bfff2c5489e016fdb925adbf12a1e3950ee
new and missing netlink attributes types for SCM_TIMESTAMPING_OPT_STATS,
new ones were added in commits
7156d194a0772f733865267e7207e0b08f81b02b
be631892948060f44b1ceee3132be1266932071e
87ecc95d81d951b0984f2eb9c5c118cb68d0dce8
introduced to stat ipc objects without permission checks since the
info is available in /proc/sysvipc anyway, new in linux commits
23c8cec8cf679b10997a512abb1e86f0cedc42ba
a280d6dc77eb6002f269d58cd47c7c7e69b617b6
c21a6970ae727839a2f300cd8dd957de0d0238c3
to map at a fixed address without unmapping underlying mappings
(fails with EEXIST unlike MAP_FIXED), new in linux commits
4ed28639519c7bad5f518e70b3284c6e0763e650 and
a4ff8e8620d3f4f50ac4b41e8067b7d395056843.
add pkey_mprotect, pkey_alloc, pkey_free syscall numbers,
new in linux commits 3350eb2ea127978319ced883523d828046af4045
and 9499ec1b5e82321829e1c1510bcc37edc20b6f38
to get seccomp state for checkpoint restore.
added in linux commit 26500475ac1b499d8636ff281311d633909f5d20
struct tag follows the glibc api and ptrace_peeksiginfo_args
got changed too accordingly.
added to uapi in commit 65aaf87b3aa2d049c6b9fd85221858a895df3393
used since commit a9a08845e9acbd224e4ee466f5c1275ed50054e8,
which renamed POLL* to EPOLL* in the kernel.
three ABIs are supported: the default with 68881 80-bit fpu format and
results returned in floating point registers, softfloat-only with the
same format, and coldfire fpu with IEEE single/double only. only the
first is tested at all, and only under qemu which has fpu emulation
bugs.
basic functionality smoke tests have been performed for the most
common arch-specific breakage via libc-test and qemu user-level
emulation. some sysvipc failures remain, but are shared with other big
endian archs and will be fixed separately.
since x86 and m68k are the only archs with 80-bit long double and each
has mandatory endianness, select the variant via endianness.
differences are minor: apparently just byte order and representation
of infinities. the m68k format is not well-documented anywhere I could
find, so if other differences are found they may require additional
changes later.
In TLS variant I the TLS is above TP (or above a fixed offset from TP)
but on some targets there is a reserved gap above TP before TLS starts.
This matters for the local-exec tls access model when the offsets of
TLS variables from the TP are hard coded by the linker into the
executable, so the libc must compute these offsets the same way as the
linker. The tls offset of the main module has to be
alignup(GAP_ABOVE_TP, main_tls_align).
If there is no TLS in the main module then the gap can be ignored
since musl does not use it and the tls access models of shared
libraries are not affected.
The previous setup only worked if (tls_align & -GAP_ABOVE_TP) == 0
(i.e. TLS did not require large alignment) because the gap was
treated as a fixed offset from TP. Now the TP points at the end
of the pthread struct (which is aligned) and there is a gap above
it (which may also need alignment).
The fix required changing TP_ADJ and __pthread_self on affected
targets (aarch64, arm and sh) and in the tlsdesc asm the offset to
access the dtv changed too.
since this iconv implementation's output is stateless, it's necessary
to know before writing anything to the output buffer whether the
conversion of the current input character will fit.
previously we used a hard-coded table of the output size needed for
each supported output encoding, but failed to update the table when
adding support for conversion to jis-based encodings and again when
adding separate encoding identifiers for implicit-endianness utf-16/32
and ucs-2/4 variants, resulting in out-of-bound table reads and
incorrect size checks. no buffer overflow was possible, but the
affected characters could be converted incorrectly, and iconv could
potentially produce an incorrect return value as a result.
remove the hard-coded table, and instead perform the recursive iconv
conversion to a temporary buffer, measuring the output size and
transferring it to the actual output buffer only if the whole
converted result fits.
this case is handled with a recursive call to iconv using a
specially-constructed conversion descriptor. the constant 0 was used
as the offset for utf-8, since utf-8 appears first in the charmaps
table, but the offset used needs to point into the charmap entry, past
the name/aliases at the beginning, to the byte identifying the
encoding. as a result of this error, junk was produced.
instead, call find_charmap so we don't have to hard-code a nontrivial
offset. with this change, the code has been tested and found to work
in the case of converting the affected hkscs characters to utf-8.
maintainer's notes:
commit 95c6044e2a split UTF-32 and
UTF-32BE but neglected to add a case for the former as a destination
encoding, resulting in it wrongly being handled by the default case.
the intent was that the value of the macro be chosen to encode "big
endian" in the low bits, so that no code would be needed, but this was
botched; instead, handle it the way UCS2 is handled.
maintainer's notes:
commit a223dbd27a added the reverse
conversions to JIS-based encodings, but omitted the check for remining
buffer space in the case where the next character to be written was
single-byte, allowing conversion to continue past the end of the
destination buffer.
the wrapper start function that performs scheduling operations is
unreachable if pthread_attr_setinheritsched is never called, so move
it there rather than the pthread_create source file, saving some code
size for static-linked programs.
eliminate the awkward startlock mechanism and corresponding fields of
the pthread structure that were only used at startup.
instead of having pthread_create perform the scheduling operations and
having the new thread wait for them to be completed, start the new
thread with a wrapper start function that performs its own scheduling,
sending the result code back via a futex. this way the new thread can
use storage from the calling thread's stack rather than permanent
fields in the pthread structure.
over time the pthread structure has accumulated a lot of cruft taking
up size. this commit removes unused fields and packs booleans and
other small data more efficiently. changes which would also require
changing code are not included at this time.
non-volatile booleans are packed as unsigned char bitfield members.
the canceldisable and cancelasync fields need volatile qualification
due to how they're accessed from the cancellation signal handler and
cancellable syscalls called from signal handlers. since volatile
bitfield semantics are not clearly defined, discrete char objects are
used instead.
the pid field is completely removed; it has been unused since commit
83dc6eb087.
the tid field's type is changed to int because its use is as a value
in futexes, which are defined as plain int. it has no conceptual
relationship to pid_t. also, its position is not ABI.
startlock is reduced to a length-1 array. the second element was
presumably intended as a waiter count, but it was never used and made
no sense, since there is at most one waiter.
previously, some accesses to the detached state (from pthread_join and
pthread_getattr_np) were unsynchronized; they were harmless in
programs with well-defined behavior, but ugly. other accesses (in
pthread_exit and pthread_detach) were synchronized by a poorly named
"exitlock", with an ad-hoc trylock operation on it open-coded in
pthread_detach, whose only purpose was establishing protocol for which
thread is responsible for deallocation of detached-thread resources.
instead, use an atomic detach_state and unify it with the futex used
to wait for thread exit. this eliminates 2 members from the pthread
structure, gets rid of the hackish lock usage, and makes rigorous the
trap added in commit 80bf595255 for
catching attempts to join detached threads. it should also make
attempt to detach an already-detached thread reliably trap.
if the last thread exited via pthread_exit, the logic that marked it
dead did not account for the possibility of it targeting itself via
atexit handlers. for example, an atexit handler calling
pthread_kill(pthread_self(), SIGKILL) would return success
(previously, ESRCH) rather than causing termination via the signal.
move the release of killlock after the determination is made whether
the exiting thread is the last thread. in the case where it's not,
move the release all the way to the end of the function. this way we
can clear the tid rather than spending storage on a dedicated
dead-flag. clearing the tid is also preferable in that it hardens
against inadvertent use of the value after the thread has terminated
but before it is joined.
posix documents in the rationale and future directions for
pthread_kill that, since the lifetime of the thread id for a joinable
thread lasts until it is joined, ESRCH is not a correct error for
pthread_kill to produce when the target thread has exited but not yet
been joined, and that conforming applications cannot attempt to detect
this state. future versions of the standard may explicitly require
that ESRCH not be returned for this case.
the tid field in the pthread structure is not volatile, and really
shouldn't be, so as not to limit the compiler's ability to reorder,
merge, or split loads in code paths that may be relevant to
performance (like controlling lock ownership).
however, use of objects which are not volatile or atomic with futex
wait is inherently broken, since the compiler is free to transform a
single load into multiple loads, thereby using a different value for
the controlling expression of the loop and the value passed to the
futex syscall, leading the syscall to block instead of returning.
reportedly glibc's pthread_join was actually affected by an equivalent
issue in glibc on s390.
add a separate, dedicated join_futex object for pthread_join to use.
the static const zero set ended up getting put in bss instead of
rodata, wasting writable memory, and the call to memcmp was
size-inefficient. generally for nonstandard extension functions we try
to avoid poking at any internals directly, but the way the zero set
was setup was arguably already doing so.
to support the GNU extension of allocating a buffer for getcwd's
result when a null pointer is passed without incurring a link
dependency on free, we use a PATH_MAX-sized buffer on the stack and
only duplicate it to allocated storage after the operation succeeds.
unfortunately this imposed excessive stack usage on all callers,
including those not making use of the GNU extension.
instead, use a VLA to make stack allocation conditional.
in thumb mode, r7 is the ABI frame pointer register, and unless frame
pointer is disabled, gcc insists on treating it as a fixed register,
refusing to spill it to satisfy constraints. unfortunately, r7 is also
used in the syscall ABI for passing the syscall number.
up til now we just treated this as a requirement to disable frame
pointer when generating code as thumb, but it turns out gcc forcibly
enables frame pointer, and the fixed register constraint that goes
with it, for functions which contain VLAs. this produces an
unacceptable arch-specific constraint that (non-arm-specific) source
files making syscalls cannot use VLAs.
as a workaround, avoid r7 register constraints when producing thumb
code and instead save/restore r7 in a temp register as part of the asm
block. at some point we may want/need to support armv6-m/thumb1, so
the asm has been tweaked to be thumb1-compatible while also
near-optimal for thumb2: it allows the temp and/or syscall number to
be in high registers (necessary since r0-r5 may all be used for
syscalll args) and in thumb2 mode allows the syscall number to be an
8-bit immediate.
for getopt_long, partial (prefix) matches of long options always begin
with "--" and thus can never be ambiguous with a short option. for
getopt_long_only, though, a single-character option can match both a
short option and as a prefix for a long option. in this case, we
wrongly interpreted it as a prefix for the long option.
introduce a new pass, only in long-only mode, to check the prefix
match against short options before accepting it. the only reason
there's a slightly nontrivial loop being introduced rather than strchr
is that our getopt already supports multibyte short options, and
getopt_long_long should handle them consistently. a temp buffer and
strstr could have been used, but the code to set it up would be just
as large as what's introduced here and it would unnecessarily pull in
relatively large code for strstr.
commit 618b18c78e removed the previous
detection and hardening since it was incorrect. commit
72141795d4 already handled all that
remained for hardening the static-linked case. in the dynamic-linked
case, have the dynamic linker check whether malloc was replaced and
make that information available.
with these changes, the properties documented in commit
c9f415d7ea are restored: if calloc is
not provided, it will behave as malloc+memset, and any of the
memalign-family functions not provided will fail with ENOMEM.
this change serves multiple purposes:
1. it ensures that static linking of memalign-family functions will
pull in the system malloc implementation, thereby causing link errors
if an attempt is made to link the system memalign functions with a
replacement malloc (incomplete allocator replacement).
2. it eliminates calls to free that are unpaired with allocations,
which are confusing when setting breakpoints or tracing execution.
as a bonus, making __bin_chunk external may discourage aggressive and
unnecessary inlining of it.
the generated code should be mostly unchanged, except for explicit use
of C_INUSE in place of copying the low bits from existing chunk
headers/footers.
these changes also remove mild UB due to dubious arithmetic on
pointers into imaginary size_t[] arrays.
commit c9f415d7ea included checks to
make calloc fallback to memset if used with a replaced malloc that
didn't also replace calloc, and the memalign family fail if free has
been replaced. however, the checks gave false positives for
replacement whenever malloc or free resolved to a PLT entry in the
main program.
for now, disable the checks so as not to leave libc in a broken state.
this means that the properties documented in the above commit are no
longer satisfied; failure to replace calloc and the memalign family
along with malloc is unsafe if they are ever called.
the calloc checks were correct but useless for static linking. in both
cases (simple or full malloc), calloc and malloc are in a source file
together, so replacement of one but not the other would give linking
errors. the memalign-family check was useful for static linking, but
broken for dynamic as described above, and can be replaced with a
better link-time check.
__ARM_ARCH_6ZK__ is a gcc specific historical typo which may not be
defined by other compilers.
https://gcc.gnu.org/ml/gcc-patches/2015-07/msg02237.html
To avoid unexpected results when building for ARMv6KZ with clang, the
correct form of the macro (ie 6KZ) needs to be tested. The incorrect
form of the macro (ie 6ZK) still needs to be tested for compatibility
with pre-2015 versions of gcc.
Provide an ARM specific a_ctz_32 helper function for architecture
versions for which it can be implemented efficiently via the "rbit"
instruction (ie all Thumb-2 capable versions of ARM v6 and above).
Update atomic.h to provide a_ctz_l in all cases (atomic_arch.h should
now only provide a_ctz_32 and/or a_ctz_64).
The generic version of a_ctz_32 now takes advantage of a_clz_32 if
available and the generic a_ctz_64 now makes use of a_ctz_32.
replacement is subject to conditions on the replacement functions.
they may only call functions which are async-signal-safe, as specified
either by POSIX or as an implementation-defined extension. if any
allocator functions are replaced, at least malloc, realloc, and free
must be provided. if calloc is not provided, it will behave as
malloc+memset. any of the memalign-family functions not provided will
fail with ENOMEM.
in order to implement the above properties, calloc and __memalign
check that they are using their own malloc or free, respectively.
choice to check malloc or free is based on considerations of
supporting __simple_malloc. in order to make this work, calloc is
split into separate versions for __simple_malloc and full malloc;
commit ba819787ee already did most of
the split anyway, and completing it saves an extra call frame.
previously, use of -Bsymbolic-functions made dynamic interposition
impossible. now, we are using an explicit dynamic-list, so add
allocator functions to the list. most are not referenced anyway, but
all are added for completeness.
instead of using a waiters count, add a bit to the lock field
indicating that the lock may have waiters. threads which obtain the
lock after contending for it will perform a potentially-spurious wake
when they release the lock.
the existing laddr function for fdpic cannot translate ELF virtual
addresses outside of the LOAD segments to runtime addresses because
the fdpic loadmap only covers the logically-mapped part. however the
whole point of reclaim_gaps is to recover the slack space up to the
page boundaries, so it needs to work with such addresses.
add a new laddr_pg function that accepts any address in the page range
for the LOAD segment by expanding the loadmap records out to page
boundaries. only use the new version for reclaim_gaps, so as not to
impact performance of other address lookups.
also, only use laddr_pg for the start address of a gap; the end
address lies one byte beyond the end, potentially in a different page
where it would get mapped differently. instead of mapping end, apply
the length (end-start) to the mapped value of start.
we have always bound symbols at libc.so link time rather than runtime
to minimize startup-time relocations and overhead of calls through the
PLT, and possibly also to preclude interposition that would not work
correctly anyway if allowed. historically, binding at link-time was
also necessary for the dynamic linker to work, but the dynamic linker
bootstrap overhaul in commit f3ddd17380
made it unnecessary.
our use of -Bsymbolic-functions, rather than -Bsymbolic, was chosen
because the latter is incompatible with public global data; it makes
it incompatible with copy relocations in the main program. however,
not all global data needs to be public. by using --dynamic-list
instead with an explicit list, we can reduce the number of symbolic
relocations left for runtime.
this change will also allow us to permit interposition of specific
functions (e.g. the allocator) if/when we want to, by adding them to
the dynamic list.
the Linux SYS_nice syscall is unusable because it does not return the
newly set priority. always use SYS_setpriority. also avoid overflows
in addition of inc by handling large inc values directly without
examining the old nice value.
Implementation of __malloc0 in malloc.c takes care to preserve zero
pages by overwriting only non-zero data. However, malloc must have
already modified auxiliary heap data just before and beyond the
allocated region, so we know that edge pages need not be preserved.
For allocations smaller than one page, pass them immediately to memset.
Otherwise, use memset to handle partial pages at the head and tail of
the allocation, and scan complete pages in the interior. Optimize the
scanning loop by processing 16 bytes per iteration and handling rest of
page via memset as soon as a non-zero byte is found.
the catan implementation from OpenBSD includes a FIXME-annotated
"overflow" branch that produces a meaningless and incorrect
large-magnitude result. it was reachable via three paths,
corresponding to gotos removed by this commit, in order:
1. pure imaginary argument with imaginary component greater than 1 in
magnitude. this case does not seem at all exceptional and is
handled (at least with the quality currently expected from our
complex math functions) by the existing non-exceptional code path.
2. arguments on the unit circle, including the pure-real argument 1.0.
these are not exceptional except for ±i, which should produce
results with infinite imaginary component and which lead to
computation of atan2(±0,0) in the existing non-exceptional code
path. such calls to atan2() however are well-defined by POSIX.
3. the specific argument +i. this route should be unreachable due to
the above (2), but subtle rounding effects might have made it
possible in rare cases. continuing on the non-exceptional code path
in this case would lead to computing the (real) log of an infinite
argument, then producing a NAN when multiplying it by I.
for now, remove the exceptional code paths entirely. replace the
multiplication by I with construction of a complex number using the
CMPLX macro so that the NAN issue (3) prevented cannot arise.
with these changes, catan should give reasonably correct results for
real arguments, and should no longer give completely-wrong results for
pure-imaginary arguments outside the interval (-i,+i).
the factor of -i noted in the comment at the top of casin.c was
omitted from the actual code, yielding a result rotated 90 degrees and
propagating into errors in other functions defined in terms of casin.
implement multiplication by -i as a rotation of the real and imaginary
parts of the result, rather than by actual multiplication, since the
latter cannot be optimized without knowledge that the operand is
finite. here, the rotation is the actual intent, anyway.
Commit 8a6bd7307d added support for
padding specifier extensions to strftime, but did not modify wcsftime.
In the process, it added a parameter to __strftime_fmt_1 in strftime.c,
but failed to update the prototype in wcsftime.c. This was found by
compiling musl with LTO:
src/time/wcsftime.c:7:13: warning: type of '__strftime_fmt_1' does \
not match original declaration [-Wlto-type-mismatch]
Fix the prototype of __strftime_fmt_1 in wcsftime.c, and generate the
'pad' argument the same way as it is done in strftime.
it was reported by Erik Bosman that poll fails without setting revents
when the nfds argument exceeds the current value for RLIMIT_NOFILE,
causing the subsequent open calls to be bypassed. if the rlimit is
either 1 or 2, this leaves fd 0 and 1 potentially closed but openable
when the application code is reached.
based on a brief reading of the poll syscall documentation and code,
it may be possible for poll to fail under other attacker-controlled
conditions as well. if it turns out these are reasonable conditions
that may happen in the real world, we may have to go back and
implement fallbacks to probe each fd individually if poll fails, but
for now, keep things simple and treat all poll failures as fatal.
if double precision r=x*y+z is not a half way case between two single
precision floats or it is an exact result then fmaf returns (float)r.
however the exactness check was wrong when |x*y| < |z| and could cause
incorrectly rounded result in nearest rounding mode when r is a half
way case.
fmaf(-0x1.26524ep-54, -0x1.cb7868p+11, 0x1.d10f5ep-29)
was incorrectly rounded up to 0x1.d117ap-29 instead of 0x1.d1179ep-29.
(exact result is 0x1.d1179efffffffecp-29, r is 0x1.d1179fp-29)
commit d93c0740d8 added use of feature
test macros without including features.h, causing a definition that
should be exposed in the default profile, TSVTX, to appear only when
_XOPEN_SOURCE or higher is explicitly defined.
previously, MEMOPS_SRCS failed to include arch-specific replacement
files for memcpy, etc., omitting CFLAGS_MEMOPS and thereby potentially
causing build failure if an arch provided C (rather than asm)
replacements for these files.
instead of trying to explicitly include all the files that might have
arch replacements, which is prone to human error, extract final names
to be used out of $(LIBC_OBJS), where the rules for arch replacements
have already been applied. do the same for NOSSP_OBJS, using CRT_OBJS
and LDSO_OBJS rather than repeating ourselves with $(wildcard...) and
explicit pathnames again.
standing alone, both the signed and int keywords identify the same
type, a (signed) int. however the C language has an exception where,
when the lone keyword int is used to declare a bitfield, it's
implementation-defined whether the bitfield is signed or unsigned. C11
footnote 125 extends this implementation-definedness to typedefs, and
DR#315 extends it to other integer types (for which support with
bitfields is implementation-defined).
while reasonable ABIs (all the ones we support) define bitfields as
signed by default, GCC and compatible compilers offer an option
-funsigned-bitfields to change the default. while any signed types
defined without explicit use of the signed keyword are affected, the
stdint.h types, especially intNN_t, have a natural use in bitfields.
ensure that bitfields defined with these types always have the correct
signedness regardless of compiler & flags used.
see also GCC PR 83294.
the output delay features (NL*, CR*, TAB*, BS*, and VT*) are
XSI-shaded. VT* is in the V* namespace reservation but the rest need
to be suppressed in base POSIX namespace.
unfortunately this change introduces feature test macro checks into
another bits header. at some point these checks should be simplified
by having features.h handle the "FTM X implies Y" relationships.
this must have been taken from POSIX without realizing that it was
meaningless. the resolution to Austin Group issue #844 removed it from
the standard.
PAGESIZE is actually the version defined in POSIX base, with PAGE_SIZE
being in the XSI option. use PAGESIZE as the underlying definition to
facilitate making exposure of PAGE_SIZE conditional.
use of MB_CUR_MAX encoded a hidden dependency on the currently active
locale for the calling thread, whereas nl_langinfo_l is supposed to
report for the locale passed as an argument.
general policy is that all source files defining a public API or an
ABI mechanism referenced by a public header should include the public
header that declares the interface, so that the compiler or analysis
tools can check the consistency of the declarations. Alexander Monakov
pointed out a number of violations of this principle a few years back.
fix them now.
add a member of appropriate type to the fpos_t union so that accesses
are well-defined. use long long instead of off_t since off_t is not
always exposed in stdio.h and there's no namespace-clean alias for it.
access is still performed using pointer casts rather than by naming
the union member as a matter of style; to the extent possible, the
naming of fields in opaque types defined in the public headers is not
treated as an API contract with the implementation. access via the
pointer cast is valid as long as the union has a member of matching
type.
previously this macro used an odd if/else form instead of the more
idiomatic do/while(0), making it unsafe against omission of trailing
semicolon. the omission would make the following statement conditional
instead of producing an error.
formally, calling readv with a zero-length first iov component should
behave identically to calling read on just the second component, but
presence of a zero-length iov component has triggered bugs in some
kernels and performs significantly worse than a simple read on some
file types.
the stdio FILE read backend's return type is size_t, not ssize_t, and
all of the special (non-fd-backed) FILE types already return the
number of bytes read (zero) on error or eof. only __stdio_read leaked
a syscall error return into its return value.
fread had a workaround for this behavior going all the way back to the
original check-in. remove the workaround since it's no longer needed.
the ':' in optstring has special meaning as a flag applying to the
previous option character, or to getopt's error handling behavior when
it appears at the beginning. don't also accept a "-:" option based on
its presence.
based loosely on patch by Hauke Mehrtens; converted to wrap the public
API of the underlying getrandom function rather than direct syscalls,
so that if/when a fallback implementation of getrandom is added it
will automatically get picked up by getentropy too.
NT_ARM_SVE and NT_S390_RI_CB are new in linux commits
43d4da2c45b2f5d62f8a79ff7c6f95089bb24656 and
262832bc5acda76fd8f901d39f4da1121d951222
the rest are older.
musl missed NT_PRFPREG because it followed the glibc api:
https://sourceware.org/bugzilla/show_bug.cgi?id=14890
PR_SVE_SET_VL and PR_SVE_GET_VL controls are new in linux commit
2d2123bc7c7f843aa9db87720de159a049839862
related PR_SVE_* macros were added in
7582e22038a266444eb87bc07c372592ad647439
HWCAP_SVE is new in linux commit 43994d824e8443263dc98b151e6326bf677be52e
HWCAP_SHA3, HWCAP_SM3, HWCAP_SM4, HWCAP_ASIMDDP and HWCAP_SHA512 are new in
f5e035f8694c3bdddc66ea46ecda965ee6853718
PPC_FEATURE2_HTM_NO_SUSPEND is new in linux commit
cba6ac4869e45cc93ac5497024d1d49576e82666
PPC_FEATURE2_DARN and PPC_FEATURE2_SCV were new in v4.12 in commit
a4700a26107241cc7b9ac8528b2c6714ff99983d
for synchronous page faults, new in linux commit
1c9725974074a047f6080eecc62c50a8e840d050 and
b6fb293f2497a9841d94f6b57bd2bb2cd222da43
note that only targets that use asm-generic/mman.h have this new
flag defined, so undef it on other targets (mips*, powerpc*).
use the same token to define TIOCSER_TEMT as is used in ioctl.h
so when both headers are included there are no redefinition warnings
during musl build.
*_HUGE_SHIFT, *_HUGE_2MB, *_HUGE_1GB are documented in the man page,
so add all of the *_HUGE_* macros from linux uapi.
if MAP_HUGETLB is set, top bits of the mmap flags encode the page size.
see the linux commit aafd4562dfee81a40ba21b5ea3cf5e06664bc7f6
if SHM_HUGETLB is set, top bits of the shmget flags encode the page size.
see the linux commit 4da243ac1cf6aeb30b7c555d56208982d66d6d33
*_HUGE_16GB is defined unsigned to avoid signed left shift ub.
new ethertypes in linux v4.14:
ETH_P_ERSPAN new in 84e54fe0a5eaed696dee4019c396f8396f5a908b
ETH_P_IFE new in 2804fd3af6ba5ae5737705b27146455eabe2e2f8
ETH_P_NSH new in 155e6f649757c902901e599c268f8b575ddac1f8
ETH_P_MAP new in 7373ae7e8f0bf2c0718422481da986db5058b005
MSG_ZEROCOPY socket send flag avoids copy in the kernel
new in linux commit 52267790ef52d7513879238ca9fac22c1733e0e3
SO_ZEROCOPY socket option enables MSG_ZEROCOPY if availale
new in linux commit 76851d1212c11365362525e1e2c0a18c97478e6b
add AF_SMC and PF_SMC for the IBM shared memory communication protocol.
new in linux commit ac7138746e14137a451f8539614cdd349153e0c0
(linux socket.h is not in uapi so this update was missed earlier)
these additions were made by scanning git log since the last major
update in commit 790580b2fc. in addition
to git-level commit authorship, "patch by" text in the commit message
was also scanned. this idiom was used in the past for patches that
underwent substantial edits when merging or where the author did not
provide a commit message. going forward, my intent is to use commit
authorship consistently for attribution.
as before my aim was adding everyone with either substantial code
contributions or a pattern of ongoing simple patch submission; any
omissions are unintentional.
Maintainer's note: at one point, -lcompiler_rt apparently worked, and
may still work and be preferable if one has manually installed the
library in a public lib directory. but with current versions of clang,
the full pathname to the library file is needed. the original patch
removed the -lcompiler_rt check; I have left it in place in case there
are users depending on it, and since, when it does work, it's
preferable so as not to code a dependency on the specific compiler
version and paths in config.mak.
this is more extensible if we need to consider additional errors, and
more efficient as long as the compiler does not know it can cache the
result of __errno_location (a surprisingly complex issue detailed in
commit a603a75a72).
It's better to make execvp continue PATH search on ENOTDIR rather than
issuing an error. Bogus entries should not render rest of PATH invalid.
Maintainer's note: POSIX seems to require the search to continue like
this as part of XBD 8.3 Other Environment Variables. Only errors that
conclusively determine non-existence are candidates for continuing;
otherwise for consistency we have to report the error.
when a null buffer pointer is passed to fmemopen, requesting it
allocate its own memory buffer, extremely large size arguments near
SIZE_MAX could overflow and result in underallocation. this results
from omission of the size of the cookie structure in the overflow
check but inclusion of it in the calloc call.
instead of accounting for individual small contributions to the total
allocation size needed, simply reject sizes larger than PTRDIFF_MAX,
which will necessarily fail anyway. then adding arbitrary fixed-size
structures is safe without matching up the expressions in the
comparison and the allocation.
Currently getcwd(3) can succeed without returning an absolute path
because the underlying getcwd syscall, starting with linux commit
v2.6.36-rc1~96^2~2, may succeed without returning an absolute path.
This is a conformance issue because "The getcwd() function shall
place an absolute pathname of the current working directory
in the array pointed to by buf, and return buf".
Fix this by checking the path returned by syscall and failing with
ENOENT if the path is not absolute. The error code is chosen for
consistency with the case when the current directory is unlinked.
Similar issue was fixed in glibc recently, see
https://sourceware.org/bugzilla/show_bug.cgi?id=22679
in theory non-absolute origins can only arise when either the main
program is invoked by running ldso as a command (inherently non-suid)
or when dlopen was called with a relative pathname containing at least
one slash. such usage would be inherently insecure in an suid program
anyway, so the old behavior here does not seem to have been insecure.
harden against it anyway.
the rpath fixup code assumed any module's name field would contain at
least one slash, an invariant which is usually met but not in the case
of a main executable loaded from the current working directory by
running ldd or ldso as a command. it would be possible to make this
invariant always hold, but it has a higher runtime allocation cost and
does not seem useful elsewhere, so just patch things up in fixup_rpath
instead.
it's unclear from the specification whether the word "consumes" in
"consumes more than four bytes to represent a year" refers just to
significant places or includes leading zeros due to field width
padding. however the examples in the rationale indicate that the
latter was the intent. in particular, the year 270 is shown being
formatted by %+5Y as +0270 rather than 00270.
previously '+' prefixing was implemented just by comparing the year
against 10000. instead, count the number of significant digits and
padding bytes to be added, and use the total to determine whether to
apply the '+' prefix.
based on testing by Dennis Wölfing.
the code to strip initial sign and leading zeros inadvertently
stripped all the zeros and the subsequent '-' separating the month.
instead, only strip sign characters from the very first position, and
only strip zeros when they are followed by another digit.
based on testing by Dennis Wölfing.
in the original submission of the patch that became commit
7c709f2d4f, and in subsequent reading of
it by others, it was not clear that the new member had to be inserted
before canary_at_end, or that inserting it at that location was safe.
add comments to document.
Do not retry waitpid if the child was terminated by a signal. Do not
examine status: since we are not passing any flags, we will not receive
stop or continue notifications.
commit f9fb20b42d switched from using a
pipe for the result to conveying it via the child process exit status.
Alexander Monakov pointed out that the latter could fail if the
application is not expecting faccessat to produce a child and performs
a wait operation with __WCLONE or __WALL, and that it is not clear
whether it's guaranteed to work when SIGCHLD's disposition has been
set to SIG_IGN.
in addition, that commit introduced a bug that caused EACCES to be
produced instead of EBUSY due to an exit path that was overlooked when
the error channel was changed, and introduced a spurious retry loop
around the wait operation.
the Linux and FreeBSD man pages for dladdr document dli_fbase as the
"base address" of the library/module found. normally (e.g. AT_BASE)
the term "base" is used to denote the base address relative to which
p_vaddr addresses are interpreted; however in the case of dladdr's
Dl_info structure, existing implementations define it as the lowest
address of the mapping, which makes sense in the context of
determining which module's memory range the input address falls
within.
since this is a nonstandard interface provided to mimic one provided
by other implementations, adjust it to match their behavior.
Consider the first equals sign found in the option to be the delimiter
between it and its argument, even if it matches an equals sign in the
option name. This avoids consuming the equals sign, which would prevent
finding the argument. Instead, it forces a partial match of the part of
the option name before the equals sign.
Maintainer's note: GNU getopt_long does not explicitly document this
behavior, but it can be seen as a consequence of how partial matches
are specified, and at least GNU (bfd) ld is known to make use of it.
If we find a partial option name match, we need to keep looking for
ambiguous/conflicting options. However, we need to remember the position
in the candidate argument to find its option-argument later, if there is
one. This fixes e.g. option "foobar" being given as "--fooba=baz".
commit 78897b0dc0 wrongly simplified
Dmitry Levin's original submitted patch fixing alt-form octal with the
zero flag and field width present, omitting the special case where the
value is zero. as a result, printf("%#o",0) wrongly prints "00" rather
than "0".
the logic prior to this commit was actually better, in that it was
aligned with how the alt-form flag (#) for printf is specified ("it
shall increase the precision"). at the time there was no good way to
avoid the zero flag issue with the old logic, but commit
167dfe9672 added tracking of whether an
explicit precision was provided.
revert commit 78897b0dc0 and switch to
using the explicit precision indicator for suppressing the zero flag.
In some places there has been a direct usage of the functions. Use the
macros consistently everywhere, such that it might be easier later on to
capture the fast path directly inside the macro and only have the call
overhead on the slow path.
A variant of this new lock algorithm has been presented at SAC'16, see
https://hal.inria.fr/hal-01304108. A full version of that paper is
available at https://hal.inria.fr/hal-01236734.
The main motivation of this is to improve on the safety of the basic lock
implementation in musl. This is achieved by squeezing a lock flag and a
congestion count (= threads inside the critical section) into a single
int. Thereby an unlock operation does exactly one memory
transfer (a_fetch_add) and never touches the value again, but still
detects if a waiter has to be woken up.
This is a fix of a use-after-free bug in pthread_detach that had
temporarily been patched. Therefore this patch also reverts
c1e27367a9
This is also the only place where internal knowledge of the lock
algorithm is used.
The main price for the improved safety is a little bit larger code.
Under high congestion, the scheduling behavior will be different
compared to the previous algorithm. In that case, a successful
put-to-sleep may appear out of order compared to the arrival in the
critical section.
With Linux kernel 4.16 it will be possible to guard more parts of the
Linux header files from a libc. Make use of this in musl to guard all
the structures and other definitions from the Linux header files which
are also defined by the header files provided by musl. This will make
it possible to compile source files which include both the libc
headers and the kernel userspace headers.
This extends the definitions done in commit 04983f2272 ("make
netinet/in.h suppress clashing definitions from kernel headers")
previously, the charset names without endianness specified were always
interpreted as big endian. unicode specifies that UTF-16 and UTF-32
have BOM-determined endianness if BOM is present, and are otherwise
big endian. since commit 5b546faa67
added support for stateful encodings, it is now possible to implement
BOM support via the conversion descriptor state.
for conversions to these charsets, the output is always big endian and
does not have a BOM.
the mapping tables and code are not automatically generated; they were
produced by comparing the output of towupper/towlower against the
mappings in the UCD, ignoring characters that were previously excluded
from case mappings or from alphabetic status (micro sign and circled
letters), and adding table entries or code for everything else
missing.
based very loosely on a patch by Reini Urban.
the new version of the code used to generate these tables forces a
newline every 256 entries, whereas at the time these files were
originally generated and committed, it only wrapped them at 80
columns. the new behavior ensures that localized changes to the
tables, if they are ever needed, will produce localized diffs.
commit d060edf6c5 made the corresponding
changes to the iconv tables.
notes by maintainer:
commit 2f853dd6b9 added these rules
because the new system for handling arch-provided replacement files
introduced for out-of-tree builds did not apply to the crt tree.
commit 63bcda4d8f later adapted the
makefile logic so that the crt and ldso trees go through the same
replacement logic as everything else, but failed to remove the
explicit rules that assumed the arch would always provide asm
replacements.
in addition to cleaning things up, removing these spurious rules
allows crti/crtn asm to be omitted by an arch (thereby using the empty
C files instead) if they are not needed.
notes by maintainer:
both C and POSIX use the term UTC to specify related functionality,
despite POSIX defining it as something more like UT1 or historical
(pre-UTC) GMT without leap seconds. neither specifies the associated
string for %Z. old choice of "GMT" violated principle of least
surprise for users and some applications/tests. use "UTC" instead.
aside from theoretical arbitrary results due to UB, this could
practically cause unbounded overflow of static array if hit, but
hitting it depends on having more than 32 calls to at_quick_exit and
having them sufficiently often.
notes added by maintainer:
the '-' specifier allows default padding to be suppressed, and '_'
allows padding with spaces instead of the default (zeros).
these extensions seem to be included in several other implementations
including FreeBSD and derivatives, and Solaris. while portable
software should not depend on them, time format strings are often
exposed to the user for configurable time display. reportedly some
python programs also use and depend on them.
notes added by maintainer:
this function is a GNU extension. it was chosen over the similar BSD
function funopen because the latter depends on fpos_t being an
arithmetic type as part of its public API, conflicting with our
definition of fpos_t and with the intent that it be an opaque type. it
was accepted for inclusion because, despite not being widely used, it
is usually very difficult to extricate software using it from the
dependency on it.
calling pattern for the read and write callbacks is not likely to
match glibc or other implementations, but should work with any
reasonable callbacks. in particular the read function is never called
without at least one byte being needed to satisfy its caller, so that
spurious blocking is not introduced.
contracts for what callbacks called from inside libc/stdio can do are
always complicated, and at some point still need to be specified
explicitly. at the very least, the callbacks must return or block
indefinitely (they cannot perform nonlocal exits) and they should not
make calls to stdio using their own FILE as an argument.
previously, fgetwc left all but the first byte of an illegal sequence
unread (available for subsequent calls) when reading out of the FILE
buffer, but dropped all bytes contibuting to the error when falling
back to reading a byte at a time. neither behavior was ideal. in the
buffered case, each malformed character produced one error per byte,
rather than one per character. in the unbuffered case, consuming the
last byte that caused the transition from "incomplete" to "invalid"
state potentially dropped (and produced additional spurious encoding
errors for) the next valid character.
to handle both cases uniformly without duplicate code, revise the
buffered case to only cover situations where a complete and valid
character is present in the buffer, and fall back to byte-at-a-time
for all other cases. this allows using mbtowc (stateless) instead of
mbrtowc, which may slightly improve performance too.
when an encoding error has been hit in the byte-at-a-time case, leave
the final byte that produced the error unread (via ungetc) except in
the case of single-byte errors (for UTF-8, bytes c0, c1, f5-ff, and
continuation bytes with no lead byte). single-byte errors are fully
consumed so as not to leave the caller in an infinite loop repeating
the same error.
none of these changes are distinguished from a conformance standpoint,
since the file position is unspecified after encoding errors. they are
intended merely as QoI/consistency improvements.
fgetwc does not set the stream's error indicator on encoding errors,
making ferror insufficient to distinguish between error and eof
conditions. feof is also insufficient, since it will return true if
the file ended with a partial character encoding error.
whether fgetwc should be setting the error indicator itself is a
question with conflicting answers. the POSIX text for the function
states it as a requirement, but the ISO C text seems to require that
it not. this may be revisited in the future based on the outcome of
Austin Group issue #1170.
these encodings are still commonly used in messaging protocols and
such. the reverse mapping is implemented as a binary search of a list
of the jis 0208 characters in unicode order; the existing forward
table is used to perform the comparison in the search.
previously, 8-bit codepages could only remap the high 128 bytes; the
low range was assumed/forced to agree with ascii. interpretation of
codepage table headers has been changed so that it's possible to
represent mappings for up to 256 slots (fewer if the initial portion
of the map is elided because it coincides with unicode codepoints).
this requires consuming a bit more of the 10-bit space of characters
that can be represented in 8-bit codepages, but there's still a plenty
left. the size of the legacy_chars table is actually reduced now by
eliding the first 256 entries and considering them to map implicitly
via the identity map.
before these changes, there seem to have been minor bugs/omissions in
codepage table generation, so it's likely that some actual bug fixes
are silently included in this commit. round-trip testing of a few
codepages was performed on the new version of the code, but no
differential testing against the old version was done.
commit c49d3c8ada added logic to detect
attempts to load libc.so via another name and instead redirect to the
existing libc, rather than loading two and producing dangerously
inconsistent state. however, the check for and unmapping of the
duplicate libc happened after reclaim_gaps was already called,
donating the slack space around the writable segment to malloc.
subsequent unmapping of the library then invalidated malloc's free
lists.
fix the issue by moving the call to reclaim_gaps out of map_library
into load_library, after the duplicate libc check but before the first
call to calloc, so that the gaps can still be used to satisfy the
allocation of struct dso. this change also eliminates the need for an
ugly hack (temporarily setting runtime=1) to avoid reclaim_gaps when
loading the main program via map_library, which happens when ldso is
invoked as a command.
only programs/libraries erroneously containing a DT_NEEDED reference
to libc.so via an absolute pathname or symlink were affected by this
issue.
the new version of the code used to generate these tables forces a
newline every 256 entries, whereas at the time these files were
originally generated and committed, it only wrapped them at 80
columns. the new behavior ensures that localized changes to the
tables, if they are ever needed, will produce localized diffs. other
tables including hkscs were already committed in the new format.
binary comparison of the generated object files was performed to
confirm that no spurious changes slipped in.
If the syscall fails, errno must be set correctly for the caller.
There's no guarantee that the handlers registered with pthread_atfork
won't clobber errno, so we need to ensure it gets set after they are
called.
this implementation aims to match the baseline defined by rfc1468 (the
original mime charset definition) plus the halfwidth katakana
extension included in the whatwg definition of the charset. rejection
of si/so controls and newlines in doublebyte state are not currently
enforced. the jis x 0201 mode is currently interpreted as having the
yen sign and overline character in place of backslash and tilde; ascii
mode has the standard ascii characters in those slots.
assuming pointers obtained from malloc have some nonzero alignment,
repurpose the low bit of iconv_t as an indicator that the descriptor
is a stateless value representing the source and destination character
encodings.
the special case where mbrtowc returns 0 but consumed 1 byte of input
does not need to be considered, because the short-circuit for low
bytes already covered that case.
short-circuiting low bytes before the switch precluded support for
character encodings that don't coincide with ascii in this range. this
limitation affected iso-2022 encodings, which use the esc byte to
introduce a shift sequence, and things like ebcdic.
this is in preparation to support stateful conversion descriptors,
which are necessarily allocated and thus must be freed in iconv_close.
putting it in a separate TU will avoid pulling in free if iconv_close
is not referenced.
this change is made to avoid having assumptions about the encoding
spread out across the file, and to facilitate future change to a form
that can accommodate allocted, stateful descriptors when needed.
this commit should not produce any functional changes; with the
compiler tested the only change to code generation was minor
reordering of local variables on stack.
If AI_NUMERICSERV is specified and a numeric service was not provided,
POSIX mandates getaddrinfo return EAI_NONAME. EAI_SERVICE is only for
services that cannot be used on the specified socket type.
commit a6054e3c94 changed this function
not to take an argument, but the weak definition used by timer_create
was not updated to match.
reported by Pascal Cuoq.
s390 can use the generic ioctls definitions other than FIOQSIZE (like arm).
this fixes some missing ioctls and two incorrect ones:
TIOCTTYGSTRUCT and TIOCM_MODEM_BITS seem to be defined on frv
target only in linux.
for getting/setting write lifetime hints fcntl commands were
added in linux commit c75b1d9421f80f4143e389d2d50ddfc8a28c8c35
added under _GNU_SOURCE || _BSD_SOURCE, since RWH_* life time
hints are not in the POSIX reserved namespace.
hwcap bits for armv8.3 extensions, added in linux commits
c8c3798d2369e4285da44b244638eafe446a8f8a
cb567e79fa504575cb97fb2f866d2040ed1c92e7
c651aae5a7732287c1c9bc974ece4ed798780544
SO_MEMINFO added in linux commit a2d133b1d465016d0d97560b11f54ba0ace56d3e
SO_INCOMING_NAPI_ID added in 6d4339028b350efbf87c61e6d9e113e5373545c9
SO_COOKIE added in 5daab9db7b65df87da26fd8cfa695fb9546a1ddb
min max mtu size definitions mostly for drivers.
new in linux commits a52ad514fdf3b8a57ca4322c92d2d8d5c6182485 and
d894be57ca92c8a8819ab544d550809e8731137b
for tcp timestamp control messages, new in linux commit
1c885808e45601b2b6f68b30ac1d999e10b6f606
and export time measurements via tcp_info, added in linux commit
efd90174167530c67a54273fd5d8369c87f9bd32
the resolution to Austin Group issue #411 defined new semantics for
the posix_spawn dup2 file action in the (previously useless) case
where src and dest fd are equal. future issues will require the dup2
file action to remove the close-on-exec flag. without this change,
passing fds to a child with posix_spawn while avoiding fd-leak races
in a multithreaded parent required a complex dance with temporary fds.
based on patch by Petr Skocik. changes were made to preserve the
80-column formatting of the function and to remove code that became
unreachable as a result of the new functionality.
commit 06fbefd100 (first included in
release 1.1.17) introduced this regression.
patch by Adrian Bunk. it fixes the regression in all cases, but
spuriously prevents use of the clz instruction on very old compiler
versions that don't define __ARM_ARCH. this may be fixed in a more
general way at some point in the future. it also omits thumb1 logic
since building as thumb1 code is currently not supported.
commit 8c4be3e220 was written to
preclude the GLOB_PERIOD extension from matching these directory
entries, but also precluded literal matches.
adjust the check that excludes . and .. to check whether the
GLOB_PERIOD flag is in effect, so that it cannot alter behavior in
cases governed by the standard, and also don't exclude . or .. in any
case where normal glob behavior (fnmatch's FNM_PERIOD flag) would have
included one or both of them (patterns such as ".*").
it's still not clear whether this is the preferred behavior for
GLOB_PERIOD, but at least it's clear that it can no longer break
applications which are not relying on quirks of a nonstandard feature.
execvpe stack-allocates a buffer used to hold the full path
(combination of a PATH entry and the program name)
while searching through $PATH, so at least
NAME_MAX+PATH_MAX is needed.
The stack size can be made conditionally smaller
(the current 1024 appears appropriate)
should this larger size be burdensome in those situations.
MAXADDRS was chosen not to need enforcement, but the logic used to
compute it assumes the answers received match the RR types of the
queries. specifically, it assumes that only one replu contains A
record answers. if the replies to both the A and the AAAA query have
their answer sections filled with A records, MAXADDRS can be exceeded
and clobber the stack of the calling function.
this bug was found and reported by Felix Wilhelm.
the rightmost '/' character is not necessarily the delimiter before
the basename; it could be a spurious trailing character on the
directory name.
this change does not introduce any normalization of pathnames or
stripping of trailing slashes, contrary to at least glibc and perhaps
other implementations; it jusst prevents their presence from breaking
things. whether further changes should be made is an open question
that may depend on conformance and/or application compatibility
considerations.
based loosely on patch by Joakim Sindholt.
calling __unlock on t->exitlock is not valid because __unlock reads
the waiters count after making the atomic store that could allow
pthread_exit to continue and unmap the thread's stack and the object t
points to. for now, inline the __unlock logic with an unconditional
futex wake operation so that the waiters count is not needed.
once __lock/__unlock have been made safe for self-synchronized
destruction, we could switch back to using them.
the freebsd fma code failed to raise underflow exception in some
cases in nearest rounding mode (affects fmal too) e.g.
fma(-0x1p-1000, 0x1.000001p-74, 0x1p-1022)
and the inexact exception may be raised spuriously since the fenv
is not saved/restored around the exact multiplication algorithm
(affects x86 fma too).
another issue is that the underflow behaviour when the rounded result
is the minimal normal number is target dependent, ieee754 allows two
ways to raise underflow for inexact results: raise if the result before
rounding is in the subnormal range (e.g. aarch64, arm, powerpc) or if
the result after rounding with infinite exponent range is in the
subnormal range (e.g. x86, mips, sh).
to avoid all these issues the algorithm was rewritten with mostly int
arithmetics and float arithmetics is only used to get correct rounding
and raise exceptions according to the behaviour of the target without
any fenv.h dependency. it also unifies x86 and non-x86 fma.
fmaf is not affected, fmal need to be fixed too.
this algorithm depends on a_clz_64 and it required a few spurious
instructions to make sure underflow exception is raised in a particular
corner case. (normally FORCE_EVAL(tiny*tiny) would be used for this,
but on i386 gcc is broken if the expression is constant
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57245
and there is no easy portable fix for the macro.)
this is for consistency with the way it's done in in the dynamic
linker, avoiding a deprecated C feature (non-prototype function
types), and improving code generation. GCC unnecessarily uses the
variadic calling convention (e.g. clearing rax on x86_64) when making
a call where the argument types are not known for compatibility with
wrong code which calls variadic functions this way. (C on the other
hand is clear that such calls have undefined behavior.)
this is a subtle issue with how the assembler/linker work. for the adr
pseudo-instruction used to find __hwcap, the assembler in thumb mode
generates a 16-bit thumb add instruction which can only represent
word-aligned addresses, despite not knowing the alignment of the
label. if the setjmp function is assigned a non-multiple-of-4 address
at link time, the load then loads from the wrong address (the last
instruction rather than the data containing the offset) and ends up
reading nonsense instead of the value of __hwcap. this in turn causes
the checks for floating-point/vector register sets (e.g. IWMMX) to
evaluate incorrectly, crashing when setjmp/longjmp try to save/restore
those registers.
fix based on bug report by Felix Hädicke.
under some conditions, the mmap syscall wrongly fails with EPERM
instead of ENOMEM when memory is exhausted; this is probably the
result of the kernel trying to fit the allocation somewhere that
crosses into the kernel range or below mmap_min_addr. in any case it's
a conformance bug, so work around it. for now, only handle the case of
anonymous mappings with no requested address; in other cases EPERM may
be a legitimate error.
this indirectly fixes the possibility of malloc failing with the wrong
errno value.
GLOB_PERIOD is a gnu extension, and GNU glob does not seem to honor it
except in the last path component. it's not clear whether this a bug
or intentional, but it seems reasonable that it should exclude the
special entries . and .. when walking.
changes based on report and analysis by Julien Ramseier.
some applications use getservbyport to find port numbers that are not
assigned to a service; if getservbyport always succeeds with a numeric
string as the result, they fail to find any available ports.
POSIX doesn't seem to mandate the behavior one way or another. it
specifies an abstract service database, which an implementation could
define to include numeric port strings, but it makes more sense to
align behavior with traditional implementations.
based on patch by A. Wilcox. the original patch only changed
getservbyport[_r]. to maintain a consistent view of the "service
database", I have also modified getservbyname[_r] to exclude numeric
port strings.
if the parent thread was able to set the new thread's priority before
it reached the check for 'startlock', the new thread failed to restore
its signal mask and thus ran with all signals blocked.
concept for patch by Sergei, who reported the issue; unnecessary
changes were removed and comments added since the whole 'startlock'
thing is non-idiomatic and confusing. eventually it should be replaced
with use of idiomatic synchronization primitives.
most of the found naming differences don't matter to musl, because
internally it unifies the syscall names that vary across targets,
but for external code the names should match the kernel uapi.
aarch64:
__NR_fstatat is called __NR_newfstatat in linux.
__NR_or1k_atomic got mistakenly copied from or1k.
arm:
__NR_arm_sync_file_range is an alias for __NR_sync_file_range2
__NR_fadvise64_64 is called __NR_arm_fadvise64_64 in linux,
the old non-arm name is kept too, it should not cause issues.
(powerpc has similar nonstandard fadvise and it uses the
normal name.)
i386:
__NR_madvise1 was removed from linux in commit
303395ac3bf3e2cb488435537d416bc840438fcb 2011-11-11
microblaze:
__NR_fadvise, __NR_fstatat, __NR_pread, __NR_pwrite
had different name in linux.
mips:
__NR_fadvise, __NR_fstatat, __NR_pread, __NR_pwrite, __NR_select
had different name in linux.
mipsn32:
__NR_fstatat is called __NR_newfstatat in linux.
or1k:
__NR__llseek is called __NR_llseek in linux.
the old name is kept too because that's the name musl uses
internally.
powerpc:
__NR_{get,set}res{gid,uid}32 was never present in powerpc linux.
__NR_timerfd was briefly defined in linux but then got renamed.
This aligns clearenv with the Linux man page by setting 'environ'
rather than '*environ' to NULL, and stops it from leaking entries
allocated by the libc.
Rewrite environment access functions to slim down code, fix bugs and
avoid invoking undefined behavior.
* avoid using int-typed iterators where size_t would be correct;
* use strncmp instead of memcmp consistently;
* tighten prologues by invoking __strchrnul;
* handle NULL environ.
putenv:
* handle "=value" input via unsetenv too (will return -1/EINVAL);
* rewrite and simplify __putenv; fix the leak caused by failure to
deallocate entry added by preceding setenv when called from putenv.
setenv:
* move management of libc-allocated entries to this translation unit,
and use no-op weak symbols in putenv/unsetenv;
unsetenv:
* rewrite; this fixes UB caused by testing a free'd pointer against
NULL on entry to subsequent loops.
Not changed:
Failure to extend allocation tracking array (previously __env_map, now
env_alloced) is ignored rather than causing to report -1/ENOMEM to the
caller; the worst-case consequence is leaking this allocation when it
is removed or replaced in a subsequent environment access.
Initially UB in unsetenv was reported by Alexander Cherepanov.
Using a weak alias to avoid pulling in malloc via unsetenv was
suggested by Rich Felker.
the DFA table controlling accepted ranges for the f4 prefix used an
incorrect upper bound of 0xa0 where it should have been 0x90, allowing
such sequences to be accepted and decoded as non-Unicode-scalar values
0x110000 through 0x11ffff.
the value computed as an output limit that bounds the amount of input
consumed below the input limit was incorrectly being used as the
actual amount of input consumed. instead, compute the actual amount of
input consumed as a difference of pointers before and after the
conversion.
patch by Mikhail Kremnyov.
Glibc renamed the linux uapi HWCAP_* macros to HWCAP_ARM_*
so have both variants in case some code depends on it.
(The HWCAP2_ macros are not defined in glibc currently so those
only have the linux uapi variant.)
counts leading zero bits of a 64bit int, undefined on zero input.
(has nothing to do with atomics, added to atomic.h so target specific
helper functions are together.)
there is a logarithmic generic implementation and another in terms of
a 32bit a_clz_32 on targets where that's available.
It is possible for argv[0] to be a null pointer, but the __progname
variable is used to implement functions in src/legacy/err.c that do not
expect it to be null. It is also available to the user via the
program_invocation_name alias as a GNU extension, and the implementation
in Glibc initializes it to a pointer to empty string rather than NULL.
Since argv[0] is usually non-null and it's preferable to keep those
variables in BSS, implement the fallbacks in __init_libc, which also
allows to have an intermediate fallback to AT_EXECFN.
it is defined in linux asm/sockios.h since commit
ae40eb1ef30ab4120bd3c8b7e3da99ee53d27a23 (linux v2.6.22)
but was missing from musl by accident.
in musl the sockios macros are exposed in sys/ioctl.h together
with other ioctl requests instead of in sys/socket.h because of
namespace rules. (glibc has them in sys/socket.h under _GNU_SOURCE.)
Due to a missing ":" in an asm() statement, the "memory" clobber is
considered by gcc as an input operand and not a clobber, which causes a
build failure.
passing to pthread_join the id of a thread which is not joinable
results in undefined behavior.
in principle the check to trap does not necessarily work if
pthread_detach was called after thread creation, since no effort is
made here to synchronize access to t->detached, but the check is
well-defined and harmless for callers which did not invoke UB, and
likely to help catch erroneous code that would otherwise mysteriously
hang.
patch by William Pitcock.
The TOC pointer is constant within a single dso, but needs to be saved
and restored around cross-dso calls. The PLT stub saves it to the
caller's stack frame, and the linker adds code to the caller to restore
it.
With a local call, as within a single dso or with static linking, this
doesn't happen and the TOC pointer is always in r2. Therefore,
setjmp/longjmp need to save/restore the TOC pointer from/to different
locations depending on whether the call to setjmp was a local or non-local
call.
It is always safe for longjmp to restore to both r2 and the caller's stack.
If the call to setjmp was local, and only r2 matters and the stack location
will be ignored, but is required by the ABI to be reserved for the TOC
pointer. If the call was non-local, then only the stack location matters,
and whatever is restored into r2 will be clobbered anyway when the caller
reloads r2 from the stack.
A little extra care is required for sigsetjmp, because it uses setjmp
internally. After the second return from this setjmp call, r2 will contain
the caller's TOC pointer instead of libc's TOC pointer. We need to save
and restore the correct libc pointer before we can tail call to
__sigsetjmp_tail.
neither current compilers nor linkers treat protected visibility the
way I expected, as having fixed source-level semantics rather than
being dependent on target-specific ABI details, and change seems
unlikely. while the use here does not actually depend on the specific
semantics, at least some versions of some linkers, especially lld,
refuse to allow linking to a libc.so where the symbols have protected
visibility. this cannot be detected at configure-time because linking
libc.so itself works fine, and because even if we could test linking
an application against libc.so successfully, we could not justifiably
assume that the same linker used to link libc.so would also be used
later to link applications.
disable the vis.h hack by default at the configure level, but add an
explicit "auto" option to request the old configure-time detection
rather than just removing it. this leaves it easy to evaluate whether
it actually resulted in significant size or performance benefits while
ensuring that out-of-the-box builds are not unlinkable for some users.
fortunately, preliminary evaluation suggests that at least x86_64,
arm, and aarch64 don't suffer at all from the change, and impact on
other archs is low. if low is not low enough, it should not be hard to
analyze where the significant PLT call ABI costs are present and
mitigate them without the hack.
since setlocale(cat, NULL) is required to return the setting for the
global locale, there is no standard mechanism to obtain the name of
the currently active thread-local locale set by uselocale. this makes
it impossible for application/library software to load appropriate
translations, etc. unless using the gettext implementation provided by
libc, which has privileged access to libc internals.
to fill this gap, glibc introduced the _NL_LOCALE_NAME macro which can
be used with nl_langinfo to obtain the name. GNU gettext/gnulib code
already use this functionality on glibc, and can easily be adapted to
make use of it on non-glibc systems if it's available; for other
systems they poke at locale implementation internals, which we want to
avoid. this patch provides a compatible interface to the one glibc
introduced.
The switch statement has no 'default:' case and the function ends
immediately following the switch, so the extra comparison did not
communicate any extra information to the compiler.
The flag 1<<7 is used in several places for different purposes that are
not always easy to distinguish. Mark those usages that correspond to the
flag that is used by the kernel for futexes.
previously, the pathname used to load the program was always used as
argv[0]. the default remains the same, but a new --argv0 option can be
used to provide a different value.
a null pointer for a library's deps list was ambiguous: it could
indicate either no dependencies or that the dependency list had not
yet been populated. inability to distinguish could lead to spurious
work when dlopen is called multiple times on a library with no deps,
and due to related bugs, could actually cause other libraries to
falsely appear as dependencies, translating into false positives for
dlsym.
avoid the problem by always initializing the deps pointer, pointing to
an empty list if there are no deps. rather than wasting memory and
introducing another failure path by allocating an empty list per
library, simply share a global dummy list.
further fixes will be needed for related bugs, and much of this code
may end up being replaced.
while the official elfv2 abi for "powerpc64le" sets power8 as the
baseline isa, we use it for both little and big endian powerpc64
targets and need to maintain compatibility with pre-power8 models. the
instructions for sqrt, fabs, and fma are in the baseline isa; support
for the rest is conditional via predefined isa-level macros.
patch by David Edelsohn.
these were introduced in z196 and not available in the baseline (z900)
ISA level. use __HTM__ as an alternate indicator for ISA level, since
gcc did not define __ARCH__ until 7.x.
patch by David Edelsohn.
in arm rtabi these __aeabi_* functions have special abi (they are
only allowed to clobber r0,r1,r2,r3,ip,lr,cpsr), so they cannot
be simple wrappers around normal string functions (which may
clobber other registers), the safest solution is to write them in
asm, a minimalistic implementation works because these are not
supposed to be emitted by compilers or used in general.
commit 97bd6b09db refactored the table
lookup into a function and introduced an error in index computation.
the error caused garbage to be read from the table if the given charmap
had a non-zero number of elided entries.
POSIX requires ctime_r return a null pointer on failure, which can
occur if the input time_t value is not representable in broken down
form.
based on patch by Alexander Monakov.
these functions return an error code, and are not explicitly
documented to set errno, but they are nonstandard and the historical
implementations do set errno as well, and some applications expect
this behavior. do likewise for compatibility.
patch by Rudolph Pereira.
ctime passes the result from localtime directly to asctime. But in case
of error, localtime returns 0. This causes an error (NULL pointer
dereference) in asctime.
based on patch by Omer Anson.
mremap seems to always fail on nommu, and on some non-Linux
implementations of the Linux syscall API, it at least fails to
increase allocation size, and may fail to move (i.e. defragment) the
existing mapping when shrinking it too. instead of failing realloc or
leaving an over-sized allocation that may waste a large amount of
memory, fallback to malloc-memcpy-free if mremap fails.
POSIX defines getdate error #5 as:
"An I/O error is encountered while reading the template file."
POSIX defines getdate error #7 as:
"There is no line in the template that matches the input."
This change correctly disambiguates between the two error conditions.
the check to prevent matching empty string wrongly blocked matching
of "/" due to checking emptiness after stripping leading slashes
rather than checking the full original argument string.
simplified from patch by Julien Ramseier.
when using the sh4a opcodes, the assembler tags the resulting object
file as requiring sh4a. the linker then refuses to (static) link it
with object files marked as requiring j2, since there is no isa level
that includes both sh4a and j2 instructions.
at one point, clang reportedly failed to support the asm register
constraints needed for inline syscalls. versions of clang that old
have much bigger problems that preclude using them to compile musl
libc.
mips64 requires 'struct stat' conversion due to incorrect 32-bit
fields where time_t should be in the kernel version of the structure.
syscall_arch.h already performed the correct translation for stat,
fstat, and lstat syscalls, but omitted special handling for fstatat.
The flags argument was missing, causing uninitalized data to be passed
to fchownat(2). The correct value of flags should match the fallback for
chown(3).
there was missing reverse-conversion logic for the case, handled
specially in the character set tables, where a byte represents a
unicode codepoint with the same value.
this patch adds code to handle the case, and refactors the two-level
10-bit table lookup for legacy character sets into a function to avoid
repeating it yet another time as part of the fix.
per POSIX, EINVAL is not a mandatory error, only an optional one. but
reporting unsupported flags allows an application to fallback
gracefully when a requested feature is not supported. this is not
helpful now, but it may be in the future if additional flags are
added.
had this checking been present before, applications would have been
able to check for the newly-added POSIX_SPAWN_SETSID feature (added in
commit bb439bb171) at runtime.
the bit is reserved anyway for ABI-compat reasons; this documents it
and makes it so we can have posix_spawnattr_setflags check for flag
validity without hard-coding an anonymous bit value.
This structure was missed when creating the s390x port.
This is based on the report and patch from William Pitcock, but with a
modified structure defintion to more closely match the kernel's
definition.
the code being removed was written to optimize for size assuming the
compiler cannot collapse code paths for different types with the same
underlying representation. modern compilers sometimes succeed in
making this optimization themselves, but either way it's a small size
difference and not worth the source-level complexity or the UB
involved in this hack.
some incorrect use of va_arg still remains, particularly use of void *
where the actual argument has a different pointer type. fixing this
requires some actual code additions, rather than just removing cruft,
so I'm leaving it to be done later as a separate commit.
commit 0a950dcf15 added checking that
the pathname a tty device was opened with actually matches the device,
which can fail to hold when a container inherits a tty from outside
the container. the error code added at the time was ENOENT; however,
discussions between affected applications and glibc developers
resulted in glibc adopting ENODEV as the error for this condition, and
this has now been documented in the man pages project as well. adopt
the same error code for consistency.
patch by Christian Brauner.
commit d6cb08bcac moved the code and
introduced an incorrect string offset for the new parsing, probably
due to a copy-and-paste error.
patch by Stefan Sedich.
in nearest rounding mode scalbn could introduce double rounding error
when an intermediate value and the final result were both in the
subnormal range e.g.
scalbn(0x1.7ffffffffffffp-1, -1073)
returned 0x1p-1073 instead of 0x1p-1074, because the intermediate
computation got rounded to 0x1.8p-1023.
with the fix an intermediate value can only be in the subnormal range
if the final result is 0 which is correct even after double rounding.
(there still can be two roundings so signals may be raised twice, but
that's only observable with trapping exceptions which is not supported.)
normally 32-bit archs use the mmap2 syscall and are limited to an
offset of 2^32 pages. however some 32-bit archs (mainly ILP32-on-64
ones like x32) have 64-bit syscall argument slots and thus can accept
the full range. don't artifically limit them.
analogous to commit 5bf7eba213, use of
AT_PHDR/PT_PHDR does not actually work to find the program base, and
the method with _DYNAMIC vs PT_DYNAMIC must be used as an alternative.
patch by Shiz, along with testing to confirm that this fixes unwinding
in static PIE.
due to testing buf[i].family==AF_INET before checking i==cnt, it was
possible to read past the end of the array, or past the valid part. in
practice, without active bounds/indeterminate-value checking by the
compiler, the worst that happened was failure to return early and
optimize out the sorting that's unneeded for v4-only results.
returning on i==cnt-1 rather than i==cnt would be an alternate fix,
but the approach this patch takes is more idiomatic and less
error-prone.
patch by Timo Teräs.
this should increase performance and reduce code size on aarch64.
the compiled code was checked against using __builtin_* instead
of inline asm with gcc-6.2.0.
lrint is two instructions.
c with inline asm is used because it is safer than a pure asm
implementation, this prevents ll{rint,round} to be an alias
of l{rint,round} (because the types don't match) and depends
on gcc style inline asm support.
ceil, floor, round, trunc can either raise inexact on finite
non-integer inputs or not raise any exceptions. the new
implementation does not raise exceptions while the generic
c code does.
on aarch64, the underflow exception is signaled before rounding
(ieee 754 allows both before and after rounding, but it must be
consistent), the generic fma c code signals it after rounding
so using single instruction fixes a slight conformance issue too.
With REG_NEWLINE, POSIX says:
"A <newline> in string shall not be matched by a period outside
a bracket expression or by any form of a non-matching list"
the old limit was one byte too short to support locale names of the
form xx_XX.UTF-8@modifier where modifier is more than 3 bytes, a form
which various real-world locale names take. the problem could be
avoided by omitting the useless ".UTF-8" part, but users may need to
have it present when operating on mixed-libc systems or when it will
be carried over (e.g. across ssh) to other systems.
the new limit is chosen sufficient for existing/reasonable locale
names while still keeping the size of setlocale's static buffer small.
also add locale_impl.h to the Makefile's list of headers which force
rebuild of source files, to prevent dangerously inconsistent object
files from getting used after this change.
often translations will be named only by language, whereas locale
names may also include a territory code, modifier, and codeset
portion. previously, only translations exactly matching the locale
name were loaded. this was a major usability issue, requiring
workarounds like symlinks or tweaking of the locale name.
with these changes, gettext now searches for translations by first
removing the codeset portion of the locale name, then trying the
remainder in full, with modifier (@mod) removed, with territory code
(_XX) removed, and with both removed.
part of the reason gettext lacked support for searching fallbacks
before is that the candidate pathname for a translation file was
constructed on each call and used as the key to lookup an
already-mapped translation file. this was very costly/inefficient. we
now use the tuple of textdomain binding pointer, locale map pointer,
and integer category id as the key for looking up a translation file
mapping.
based on patch by He X.
when called for LC_ALL, setlocale has to return a string representing
the state of all locale categories. the simplest way to do this was to
always return a delimited list of values for each category, but that's
not friendly in the fairly common case where all categories have the
same setting. He X proposed a patch to check for this case and return
a single name; this patch is a simplified approach to do the same.
commit 4ff234f6cb erroneously changed
the condition for running certain code at dlopen time to check whether
the library was already relocated rather than whether it already had
its deps[] table filled. this was out of concern over whether the code
under the conditional would be idempotent/safe to call on an
already-loaded libraries. however, I missed a consideration in the
opposite direction: if a library was loaded at program startup rather
than dlopen, its deps[] table was not yet allocated/filled, and
load_deps needs to be called at dlopen time in order for dlsym to be
able to perform dependency-order symbol lookups.
in order to avoid wasteful allocation of lazy-binding relocation
tables for libraries which were already loaded and relocated at
startup, the check for !p->relocated is not deleted entirely, but
moved to apply only to allocation of these dables.
the time of day at which daylight time switches over is specified in
local time in the dst state prior to the transition. the code for
handling this wrongly assumed it needed to switch whether dst or
standard offset is applied to the transition time when the dst end
date is before the dst start date (souther hemisphere summer), but in
fact the end transition time should always be adjusted for dst, and
the start transition time should always be adjusted for standard time.
Including sys/procfs.h complains unknown type name 'fpreg_t' in
bits/user.h. fpreg_t in bits/signal.h and elf_fpreg_t in bits/user.h
are practically the same.
per_struct is never used, even conflicts with kernel header
asm/ptrace.h
this change was suggested based on testing done by Timo Teräs almost
two years ago; the branch (and probably call prep overhead) in the
inner loop was found to contribute noticably to total symbol lookup
time. this change will make lookup slightly slower if libraries were
built with only the traditional "sysv" ELF hash table, but based on
how much slower lookup tends to be without the gnu hash table, it
seems reasonable to assume that (1) users building without gnu hash
don't care about dynamic linking performance, and (2) the extra time
spent computing the gnu hash is likely to be dominated by the slowness
of the sysv hash table lookup anyway.
partly following freebsd rev 279491
https://svnweb.freebsd.org/base?view=revision&revision=279491
(musl had some of the fixes before freebsd).
the change should not matter much for j0f, y0f, but it improves
j1f and y1f in [2.5,~3.75] (that is [0x40200000,~0x40700000]).
near roots (e.g. around 3.8317 for j1f) there are still large
ulp errors.
dropped code that tried to raise inexact.
such loading is unsafe, and can happen when programs use their own
logic to locate a .so file then pass the absolute pathname to dlopen,
or if an absolute pathname ends up in DT_NEEDED headers. multiple
loads with only the base name were already precluded, provided libc
was named appropriately, by special-casing standard library names.
one function symbol (in the reserved namespace, but public, since it's
part of the crt1 entry point ABI) and one data symbol are checked.
this way we avoid likely false positives, particularly from libraries
interposing and wrapping functions. there is no hard requirement to
avoid breaking such usage, since trying to run a hook before libc is
even initialized is not a supported usage case, but it's friendlier
not to break things.
the fix in commit c3edc06d1e for
CVE-2016-8859 used gotos to exit on overflow conditions, but the code
in that error path assumed the buffer pointer was valid or null. thus,
the conditions which previously led to under-allocation and buffer
overflow could instead lead to an invalid pointer being passed to
free.
traditional lazy relocation with call-time plt resolver is
intentionally not implemented, as it is a huge bug surface and demands
significant amounts of arch-specific code and requires ongoing
maintenance to ensure compatibility with applications which make use
of new additions to the arch's register file in passing function
arguments.
some applications, however, depend on the ability to dlopen modules
which have unsatisfied symbol references at the time they are loaded,
either avoiding use of the affected interfaces or manually loading
another module to provide the missing definition via their own module
dependency tracking outside the ELF data structures. while such usage
is non-conforming, failure to support it has been a significant
obstacle for users/distributions trying to support affected software,
particularly the X.org server.
instead of resolving lazy relocations at call time, this patch saves
unresolved GOT/PLT relocations for deferral and retries them after
each subsequent dlopen until they are resolved. since dlopen is the
only time at which the effective global symbol table can change, this
behavior is not observably different from traditional lazy binding,
and the required code is minimal.
when loading libraries with dlopen, the caller can request that the
library's symbols become part of the global symbol table, or that they
only be used for resolving relocations in the loaded library and its
dependencies. in the latter case, a subsequent dlopen of the same
library can upgrade it to global status.
previously, if a library was upgraded from local to global mode, its
symbols entered the symbol lookup search order at the point where the
library was originally loaded. this means that a new call to dlopen
could change the value of a symbol that already had a visible
definition, an inconsistency which applications could observe.
POSIX is unclear whether this should happen or whether it's permitted
to happen, but the resolution of Austin Group issue #982 made it
formally unspecified.
with this patch, a library whose mode is upgraded from local to global
enters the symbol lookup order at the point where it was made global,
so that symbol resolution before and after the upgrade are consistent.
in order to implement this change, the per-dso global flag is replaced
with a separate set of linked-list pointers for participation in the
global symbol table. this permits the order of dso objects for symbol
resolution to differ from the order used for iteration of all loaded
libraries. it also improves performance of find_sym, by avoiding a
branch per iteration and skipping, and especially in the case where
many non-global libraries have been loaded, by allowing the loop to
skip over them entirely. logic for temporarily adding non-global
libraries to the symbol table for relocation purposes is also mildly
simplified.
A weak symbol definition is not special during dynamic linking, so
don't let a strong definition in a later module override it.
(glibc dynamic linker allows overriding weak definitions if
LD_DYNAMIC_WEAK is set, musl does not.)
STB_GNU_UNIQUE means that the symbol is global, even if it is in a
module that's loaded with RTLD_LOCAL, and all references resolve to
the same definition. This semantics is only relevant for c++ plugin
systems and even there it's often not what the user wants (so it can
be turned off in g++ by -fno-gnu-unique when the c++ shared lib is
compiled). In musl just treat it like STB_GLOBAL.
the 32-bit pc-relative address for stage 2 of dynamic linker entry was
wrongly loaded with a zero-extending load instead of sign-extending
load, resulting in an invalid jump if the offset happened to be
negative, which depends on the linker's ordering of text sections.
this is not a conformance issue as posix does not specify the
argument order, but the order is specified for bsearch and some
systems document the order for lsearch consistently (openbsd).
since there were two indpendent reports of this issue it's better
to use the more widely expected argument order.
the ABI for arm was silently changed at some point to allow page sizes
other than 4k; traditional binaries built with only 4k-aligned offsets
between load segments cannot run on such systems, but newer binutils
versions use 64k offset alignment.
while larger page size is undesirable for various reasons, users have
encountered hardware and/or kernels that lock the page size to a
larger value, so follow the new ABI and allow it to vary.
binutils commit bada43421274615d0d5f629a61a60b7daa71bc15 tightened
immediate fixup handling in gas in such a way that the final .arch of
an object file must be compatible with the fixups used when the
instruction was assembled; this in turn broke assembling of atomics.s,
at least in thumb mode.
it's not clear whether this should be considered a bug in gas, but
.object_arch is preferable anyway for our purpose here of controlling
the ISA level tag on the object file being produced, and it's the
intended directive for use in object files with runtime code
selection. research by Szabolcs Nagy confirmed that .object_arch is
supported in all relevant versions of binutils and clang's integrated
assembler.
patch by Reiner Herrmann.
the plural_rule field of allocated msgcat structures was assumed to be
initially-null but was never initialized. for future-proofing, the
nplurals field which was left uninitialized should also be cleared.
likewise, in the binding structure, the active field could be used
uninitialized by a technicality: the a_store which stores the initial
value of 0 may be implemented as a cas operation, which reads the old
value.
rather than fixing these issues individually, just use calloc for both
allocations. this does result in wasteful clearing of name buffers (up
to NAME_MAX+PATH_MAX) before filling them, but since the size if
bounded and the time is dominated by filesystem operations, it really
doesn't matter; simplicity and future-proofing have more value here.
modified from patch submitted by He X.
this loop was only supposed to deactivate other bindings for the same
text domain name, but due to copy-and-paste error, deactivated all
other bindings.
patch by He X.
commit 78a8ef47c4 inadvertently removed
the SA_RESTART flag from the sigaction for the internal signal handler
used by __synccall for broadcasting. as a result, programs which did
not use interrupting signals but which used set*id() in a
multithreaded context could wrongly observe EINTR errors they're not
prepared to handle.
commit d56460c939 introduced this
regression as part of splitting the tls module list out of the dso
list. the new code added to dlopen's failure path to undo the changes
adding the partially-loaded libraries reset the tls_tail pointer
correctly, but did not clear its link to the next list entry. thus, at
least until the next successful dlopen, the list was not terminated
but ended with an invalid next pointer, which __copy_tls attempted to
follow when a new thread was created.
patch by Mikael Vidstedt.
ISO C and POSIX only specify behavior for base arguments of 0 and
2-36; POSIX mandates an EINVAL error for unsupported bases. it's not
clear that there's a requirement for implementations not to "support"
additional bases as an extension, but "base 1" did not work in any
meaningful way anyway, so it should be considered unsupported and thus
an error.
getopt is only specified to modify optopt on error, and some software
apparently infers an error from optopt!=0.
getopt_long is changed analogously. the resulting behavior differs
slightly from the behavior of the GNU implementation of getopt_long,
which keeps an internal shadow copy of optopt and copies it to the
public one on return, but since the GNU implementation also exhibits
this shadow-copy behavior for plain getopt where is is non-conforming,
I think this can reasonably be considered a bug rather than an
intentional behavior that merits mimicing.
when _GNU_SOURCE is defined, which is always the case when compiling
c++ with gcc, these macros for the the indices in gregset_t are
exposed and likely to clash with applications. by using enum constants
rather than macros defined with integer literals, we can make the
clash slightly less likely to break software. the macros are still
defined in case anything checks for them with #ifdef, but they're
defined to expand to themselves so that non-file-scope (e.g.
namespaced) identifiers by the same names still work.
for the sake of avoiding mistakes, the changes were generated with sed
via the command:
sed -i -e 's/#define *\(REG_[A-Z_0-9]\{1,\}\) *\([0-9]\{1,\}\)'\
'/enum { \1 = \2 };\n#define \1 \1/' \
arch/i386/bits/signal.h arch/x86_64/bits/signal.h arch/x32/bits/signal.h
commit 0dc99ac413 added input length
checking to avoid unsafe VLA allocation, but put it in the wrong
place, before the glob_t structure was zeroed out. while POSIX isn't
clear on whether it's permitted to call globfree after glob failed
with GLOB_NOSPACE, making it safe is clearly better than letting
uninitialized pointers get passed to free in non-conforming callers.
while we're fixing this, change strlen check to the idiomatic strnlen
version to avoid unbounded input scanning before returning an error.
commit 583ea83541 fixed the case where
tm_year is negative but the resulting year (offset by 1900) was still
positive, which is always the case for time_t values that fit in 32
bits, but not for arbitrary inputs.
based on an earlier patch by Julien Ramseier which was overlooked at
the time the previous fix was applied.
the static-linked version of __init_tls needs to locate the TLS
initialization image via the ELF program headers, which requires
determining the base address at which the program was loaded. the
existing code attempted to do this by comparing the actual address of
the program headers (obtained via auxv) with the virtual address for
the PT_PHDR record in the program headers. however, the linker seems
to produce a PT_PHDR record only when a program interpreter (dynamic
linker) is used. thus the computation failed and used the default base
address of 0, leading to a crash when trying to access the TLS image
at the wrong address.
the dynamic linker entry point and static-PIE rcrt1.o startup code
compute the base address instead by taking the difference between the
run-time address of _DYNAMIC and the virtual address in the PT_DYNAMIC
record. this patch copies the approach they use, but with a weak
symbolic reference to _DYNAMIC instead of obtaining the address from
the crt_arch.h asm. this works because relocations have already been
performed at the time __init_tls is called.
all assembly is now thumb2-compatible. on existing targets this is at
best a size optimization, but it will also facilitate porting to
thumb2-isa-only arm variants.
three problems are addressed:
- use of pc arithmetic, which was difficult if not impossible to make
correct in thumb mode on all models, so that relative rather than
absolute pointers to the backends could be used. this was designed
back when there was no coherent model for the early stages of the
dynamic linker before relocations, and is no longer necessary.
- assumption that data (the relative pointers to the backends) can be
accessed at a constant displacement from the code. this will not be
possible on future fdpic subarchs (for cortex-m), so move
responsibility for loading the backend code address to the caller.
- hard-coded arm opcodes using the .word directive. instead, use the
.arch directive to work around the assembler's refusal to assemble
instructions not available (or in some cases, available but just
considered deprecated) in the target isa level. the obscure v6t2
arch is used for v6 code so as to (1) allow generation of thumb2
output if -mthumb is active, and (2) avoid warnings/errors for mcr
barriers that clang would produce if we just set arch to v7-a.
in addition, the __aeabi_read_tp function is moved out of the inner
workings and implemented as an asm wrapper around a C function, so
that asm code does not need to read global data. the asm wrapper
serves to satisfy the ABI calling convention requirements for this
function.
float conversion is slow and big on soft-float targets.
The lookup table increases code size a bit on most hard float targets
(and adds 60byte rodata), performance can be a bit slower because of
position independent data access and cpu internal state dependence
(cache, extra branches), but the overall effect should be minimal
(common, small size allocations should be unaffected).
In BRE, ^ is an anchor at the beginning of an expression, optionally
it may be an anchor at the beginning of a subexpression and must be
treated as a literal otherwise.
Previously musl treated ^ in subexpressions as literal, but at least
glibc and gnu sed treats it as an anchor and that's the more useful
behaviour: it can always be escaped to get back the literal meaning.
Same for $ at the end of a subexpression.
Portable BRE should not rely on this, but there are sed commands in
build scripts which do.
This changes the meaning of the BREs:
\(^a\)
\(a\|^b\)
\(a$\)
\(a$\|b\)
POSIX specifies the result to have signed 32-bit range. on 32-bit
archs, the implicit conversion to long achieved the desired range
already, but when long is 64-bit, a cast is needed.
patch by Ed Schouten.
the bz instruction that was wrongly used only admits a small immediate
displacement and cannot be used with external symbols; apparently the
linker fails to diagnose the overflow.
this has been slated for removal for a long time. there is
fundamentally no way to implement stdarg without compiler assistance;
any attempt to do so has serious undefined behavior; its working
depends not just (as a common misconception goes) on ABI, but also on
assumptions about compiler code generation internal to a translation
unit, which is not subject to external ABI constraints.
gdb can only backtrace/unwind across signal handlers if it recognizes
the sa_restorer trampoline. for x86_64, gdb first attempts to
determine the symbol name for the function in which the program
counter resides and match it against "__restore_rt". if no name can be
found (e.g. in the case of a stripped binary), the exact instruction
sequence is matched instead.
when matching the function name, however, gdb's unwind code wrongly
considers the interval [sym,sym+size] rather than [sym,sym+size).
thus, if __restore_rt begins immediately after another function, gdb
wrongly identifies pc as lying within the previous adjacent function.
this patch adds a nop before __restore_rt to preclude that
possibility. it also removes the symbol name __restore and replaces it
with a macro since the stability of whether gdb identifies the
function as __restore_rt or __restore is not clear.
for the no-symbols case, the instruction sequence is changed to use
%rax rather than %eax to match what gdb expects.
based on patch by Szabolcs Nagy, with extended description and
corresponding x32 changes added.
On s390x, the kernel provides AT_SYSINFO_EHDR, but sets it to zero, if the
program being run does not have a program interpreter. This causes
problems when running the dynamic linker directly.
alpha and s390x gratuitously use 64-bit entries (wasting 2x space and
cache utilization) despite the values always being 32-bit.
based on patch by Bobby Bingham, with changes suggested by Alexander
Monakov to use the public Elf_Symndx type from link.h (and make it
properly variable by arch) rather than adding new internal
infrastructure for handling the type.
commit 31fb174dd2 used
DEFAULT_GUARD_SIZE from pthread_impl.h in a static initializer,
breaking build on archs where its definition, PAGE_SIZE, is not a
constant. instead, just define DEFAULT_GUARD_SIZE as 4096, the minimal
page size on any arch we support. pthread_create rounds up to whole
pages anyway, so defining it to 1 would also work, but a moderately
meaningful value is nicer to programs that use
pthread_attr_getguardsize on default-initialized attribute objects.
based on patch by Timo Teräs:
While generally this is a bad API, it is the only existing API to
affect c++ (std::thread) and c11 (thrd_create) thread stack size.
This patch allows applications only to increate stack and guard
page sizes.
commit 33ce920857 broke pthread_create
in the case where a null attribute pointer is passed; rather than
using the default sizes, sizes of 0 (plus the remainder of one page
after TLS/TCB use) were used.
the linux kernel uapi headers provide their own definitions of the
structures from netinet/in.h, resulting in errors when a program
includes both the standard libc header and one or more of the
networking-related kernel headers that pull in the kernel definitions.
as before, we do not attempt to support the case where kernel headers
are included before the libc ones, since the kernel definitions may
have subtly incorrect types, namespace violations, etc. however, we
can easily support the inclusion of the kernel headers after the libc
ones, since the kernel headers provide a public interface for
suppressing their definitions. this patch adds the necessary macro
definitions for such suppression.
previously, the pthread_attr_t object was always initialized all-zero,
and stack/guard size were represented as differences versus their
defaults. this required lots of confusing offset arithmetic everywhere
they were used. instead, have pthread_attr_init fill in the default
values, and work with absolute sizes everywhere.
the swprintf write callback never reset its buffer pointers, so after
its 256-byte buffer filled up, it would keep repeating those bytes
over and over in the output until the destination buffer filled up. it
also failed to set the error indicator for the stream on EILSEQ,
potentially allowing output to continue after the error.
the overflow check for years+100 did not account for the extra
year computed from the remaining months. instead, perform this
check after obtaining the final number of years.
If a DT_NEEDED entry was the prefix of a reserved library name
(up to the first dot) then it was incorrectly treated as a libc
reserved name.
e.g. libp.so dependency was not loaded as it matched libpthread
reserved name.
the value 19991006 for __RES implies availability of res_ninit and
related functions that take a resolver state argument; these are not
supported since our resolver is stateless. instead claim support for
just the older API by defining __RES to 19960801.
based on patch by Dmitrij D. Czarkoff.
the old snprintf design setup the FILE buffer pointers to point
directly into the destination buffer; if n was actually larger than
the buffer size, the pointer arithmetic to compute the buffer end
pointer was undefined. this affected sprintf, which is implemented in
terms of snprintf, as well as some unusual but valid direct uses of
snprintf.
instead, setup the FILE as unbuffered and have its write function
memcpy to the destination. the printf core sets up its own temporary
buffer for unbuffered streams.
ETH_P_HSR (IEC 62439-3 HSRv1) added in
linux 4.7 commit ee1c27977284907d40f7f72c2d078d709f15811f
ETH_P_TSN (IEEE 1722) added in
linux 4.3 commit 1ab1e895492d8084dfc1c854efacde219e56b8c1
this constant breaks the ascending order to match the kernel header
ETH_P_XDSA (Multiplexed DSA protocol) added in
linux 3.18 commit 3e8a72d1dae374cf6fc1dba97cec663585845ff9
the _CS_V6_ENV and _CS_V7_ENV constants are required to be available for use
with confstr. glibc defines these constants with values 1148 and 1149,
respectively.
the only missing (and required) confstr constants are
_CS_POSIX_V7_THREADS_CFLAGS and _CS_POSIX_V7_THREADS_LDFLAGS which remain
unavailable in glibc.
commit 6ffdc4579f set lnz in the code
path for non-zero digits after a huge string of zeros, but the
assignment of dc to lnz truncates if the value of dc does not fit in
int; this is possible for some pathologically long inputs, either via
strings on 64-bit systems or via scanf-family functions.
instead, simply set lnz to match the point at which we add the
artificial trailing 1 bit to simulate nonzero digits after a huge
run of zeros.
the mid-sized integer optimization relies on lnz set up properly
to mark the last non-zero decimal digit, but this was not done
if the non-zero digit lied outside the KMAX digits of the base
10^9 number representation.
so if the fractional part was a very long list of zeros (>2048*9 on
x86) followed by non-zero digits then the integer optimization could
kick in discarding the tiny non-zero fraction which can mean wrong
result on non-nearest rounding mode.
strtof, strtod and strtold were all affected.
in certain cases excessive trailing zeros could cause incorrect
rounding from long double to double or float in decfloat.
e.g. in strtof("9444733528689243848704.000000", 0) the argument
is 0x1.000001p+73, exactly halfway between two representible floats,
this incorrectly got rounded to 0x1.000002p+73 instead of 0x1p+73,
but with less trailing 0 the rounding was fine.
the fix makes sure that the z index always points one past the last
non-zero digit in the base 10^9 representation, this way trailing
zeros don't affect the rounding logic.
in nearest rounding mode exact halfway cases were not following the
round to even rule if the rounding happened at a base 1000000000 digit
boundary of the internal representation and the previous digit was odd.
e.g. printf("%.0f", 1.5) printed 1 instead of 2.
posix requires that EINVAL be returned if the first parameter specifies
the cpu-time clock of the calling thread (CLOCK_THREAD_CPUTIME_ID).
linux returns ENOTSUP instead so we handle this.
j is int32_t and thus j<<31 is undefined if j==1, so j is changed to
uint32_t locally as a quick fix, the generated code is not affected.
(this is a strict conformance fix, future c standard may allow 1<<31,
see DR 463. the bug was inherited from freebsd fdlibm, the proper fix
is to use uint32_t for all bit hacks, but that requires more intrusive
changes.)
reported by Daniel Sabogal
aarch64, arm, mips, mips64, mipsn32, powerpc, powerpc64 and sh have
cpu feature bits defined in linux for AT_HWCAP auxv entry, so expose
those in sys/auxv.h
it seems the mips hwcaps were never exposed to userspace neither
by linux nor by glibc, but that's most likely an oversight.
overlayfs may have fairly long lines so we use getline to allocate a
buffer dynamically. The buffer will be allocated on first use, expand as
needed, but will never be free'ed.
Downstream bug: http://bugs.alpinelinux.org/issues/5703
Signed-off-by: Natanael Copa <ncopa@alpinelinux.org>
this patch fixes a large number of missed internal signed-overflow
checks and errors in determining when the return value (output length)
would exceed INT_MAX, which should result in EOVERFLOW. some of the
issues fixed were reported by Alexander Cherepanov; others were found
in subsequent review of the code.
aside from the signed overflows being undefined behavior, the
following specific bugs were found to exist in practice:
- overflows computing length of floating point formats with huge
explicit precisions, integer formats with prefix characters and huge
explicit precisions, or string arguments or format strings longer
than INT_MAX, resulted in wrong return value and wrong %n results.
- literal width and precision values outside the range of int were
misinterpreted, yielding wrong behavior in at least one well-defined
case: string formats with precision greater than INT_MAX were
sometimes truncated.
- in cases where EOVERFLOW is produced, incorrect values could be
written for %n specifiers past the point of exceeding INT_MAX.
in addition to fixing these bugs, we now stop producing output
immediately when output length would exceed INT_MAX, rather than
continuing and returning an error only at the end.
if the requested precision is close to INT_MAX, adding
LDBL_MANT_DIG/3+8 overflows. in practice the resulting undefined
behavior manifests as a large negative result, which is then used to
compute the new end pointer (z) with a wildly out-of-bounds value
(more overflow, more undefined behavior). the end result is at least
incorrect output and character count (return value); worse things do
not seem to happen, but detailed analysis has not been done.
this patch fixes the overflow by performing the intermediate
computation as unsigned; after division by 9, the final result
necessarily fits in int.
we inherited from TRE regexec code that's utterly wrong with respect
to the integer types it's using. while it doesn't appear that
compilers are producing unsafe output, signed integer overflows seem
to happen, and regexec fails to find matches past offset INT_MAX.
this patch fixes the type of all variables/fields used to store
offsets in the string from int to regoff_t. after the changes, basic
testing showed that regexec can now find matches past 2GB (INT_MAX)
and past 4GB on x86_64, and code generation is unchanged on i386.
most of the possible overflows were already ruled out in practice by
regcomp having already succeeded performing larger allocations.
however at least the num_states*num_tags multiplication can clearly
overflow in practice. for safety, check them all, and use the proper
type, size_t, rather than int.
also improve comments, use calloc in place of malloc+memset, and
remove bogus casts.
this is a clone of the fix to the gethostby*_r functions in
commit fe82bb9b92. the man pages
document that the getservby*_r functions set this pointer to
NULL if there was an error or if no record was found.
since cpu sets can be dynamically allocated and have variable size,
accessing their contents via ->__bits is not valid; performing pointer
arithmetic outside the range of the size of the declared __bits array
results in undefined beahavior. instead, only use cpu_set_t for
fixed-size cpu set objects (instantiated by the caller) and as an
abstract pointer type for dynamically allocated ones. perform all
accesses simply by casting the abstract pointer type cpuset_t * back
to unsigned long *.
previously, fflush_unlocked was an alias for an internal backend that
was called by fflush, either for its argument or in a loop for each
file if a null pointer was passed. since the logic for the latter was
in the main fflush function, fflush_unlocked crashed when passed a
null pointer, rather than flushing all open files. since
fflush_unlocked is not a standard function and has no specification,
it's not clear whether it should be expected to accept null pointers
like fflush does, but a reasonable argument could be made that it
should.
this patch eliminates the helper function, simplifying fflush, and
makes fflush_unlocked an alias for fflush, which is valid because the
two functions agree in their behavior in all cases where their
behavior is defined (the unlocked version has undefined behavior if
another thread could hold locks).
commit b91cdbe2bc, in fixing another
issue, changed the logic for how alt-form octal adds the leading zero
to adjust the precision rather than using a prefix character. this
wrongly suppressed the zero flag by mimicing an explicit precision
given by the format string. switch back to using a prefix character.
based on bug report and patch by Dmitry V. Levin, but simplified.
this reverts commit 2c1f8fd5da. without
the _Noreturn attribute, the compiler cannot use asserts to perform
reachability/range analysis. this leads to missed optimizations and
spurious warnings.
the original backtrace problem that prompted the removal of _Noreturn
was not clearly documented at the time, but it seems to happen only
when libc was built without -g, which also breaks many other
backtracing cases.
there was a copy paste error that could cause large ulp errors
in atan2l, atanl, asinl and acosl on aarch64, mips64 and mipsn32.
(the implementation is from freebsd fdlibm, but the tail end
of the polynomial was wrong. 128 bit long double functions
are not yet tested so this went undetected.)
linux containers use separate mount namespace so the /proc
symlink might not point to the right device if the fd was
opened in the parent namespace, in this case return ENOENT.
despite sh not generally using register-pair alignment for 64-bit
syscall arguments, there are arch-specific versions of the syscall
entry points for pread and pwrite which include a dummy argument for
alignment before the 64-bit offset argument.
this code was already under #if 0, but could be confusing if a reader
didn't notice that, and it's almost surely full of bugs and/or
inconsistencies with the current code that uses the gethostbyname2_r
backend.
modern compilers (for gcc, versions 4.8 and later) automatically
pre-include <stdc-predef.h> to obtain the values of certain predefined
macros specified by ISO C but which reflect properties of the library
implementation, not just the compiler. provide values indicating that
wchar_t is Unicode-encoded and that Annex F (IEEE floating point) is
supported unless the compiler indicates otherwise.
based on patch by Masanori Ogino.
these changes still do not yield a fully-conforming abort, but they
fix two known issues:
- per POSIX, termination via SIGKILL is not "abnormal", but both ISO C
and POSIX require abort to yield abnormal termination.
- raising SIGKILL fails to do anything to pid 1 in some containers.
now, the trapping instruction produced by a_crash() is expected to
produce abnormal termination, without the risk of invoking a signal
handler since SIGILL and SIGSEGV are blocked, and _Exit, which
contains an infinite loop analogous to the one being removed from
abort itself, is used as a last resort.
this implementation still fails to produce an exit status as if the
process terminated via SIGABRT in cases where SIGABRT is blocked or
ignored, but fixing that is not easy; the obvious pseudo-solutions all
have subtle race conditions where a concurrent fork or exec can expose
incorrect signal state.
it was changed to EM_OR1K in 200d15479c
as that was meant to be the official name, but glibc and the latest
gabi spec still uses the EM_OPENRISC name:
http://www.sco.com/developers/gabi/latest/ch4.eheader.html
binutils defines both macros so we should do the same for backward
compatibility.
placing the opening brace on the same line as the struct keyword/tag
is the style I prefer and seems to be the prevailing practice in more
recent additions.
these changes were generated by the command:
find include/ arch/*/bits -name '*.h' \
-exec sed -i '/^struct [^;{]*$/{N;s/\n/ /;}' {} +
and subsequently checked by hand to ensure that the regex did not pick
up any false positives.
same changes as in the generic header.
and BOTHER and IBSHIFT were removed (present in linux uapi but not
in glibc) and TIOCSER_TEMT was added (present in glibc).
add EXTA, EXTB, CIBAUD, CMSPAR, XCASE macros and hide them as well as
CBAUD, ECHOCTL, ECHOPRT, ECHOKE, FLUSHO, PENDIN in standard mode.
the new macros are both in glibc termios.h and in linux asm/termbits.h,
the later also contains IBSHIFT and BOTHER, those were not added.
these are not standard macros, but some of them are in the reserved
namespace so could be exposed, the ones which are not reserved are
CIBAUD, CMSPAR and XCASE (which was removed in issue 6), the rest
got hidden to be consistent with glibc.
mips and powerpc use their own asm/ioctls.h, not the asm-generic/ioctls.h
and they lack termiox macros that are available on other targets.
see kernel commit 1d65b4a088de407e99714fdc27862449db04fb5c
the (unused) speed fields were omitted when these ports were first
added (within this release cycle, so not present in any release yet)
in accordance with how glibc defines the structure on mips archs.
however their omission does not match existing musl practice/intent.
glibc provides its own, mostly-unified termios structure definition
and performs translation in userspace to match the kernel structure
for the arch, but has gratuitous differences on a few archs like mips,
presumably as a result of historical mistakes. some other libcs use
the kernel definitions directly. musl essentially does that, by
matching the kernel layout in the part of the structure the kernel
will read/write, but leaves additional space at the end for
extensibility. these are nominally the (nonstandard) speed fields and
(on most archs) extra c_cc elements, but since they are not used they
could be repurposed if there's ever a need.
commit 6d38c9cf80 provided an
arm-specific version of posix_fadvise to address the alternate
argument order the kernel expects on arm, but neglected to address
that powerpc (32-bit) has the same issue. instead of having arch
variant files in duplicate, simply put the alternate version in the
top-level file under the control of a macro defined in syscall_arch.h.
when commit 0b6eb2dfb2 added the
parentheses around __syscall to invoke the function directly, there
was no __syscall7 in the syscall macro infrastructure, so this hack
was needed. commit 9a3bbce447 fixed that
but failed to remove the hack.
the kernel ABI value for RUSAGE_CHILDREN is -1, not 1. the latter is
actually interpreted as RUSAGE_THREAD, to obtain values for just the
calling thread and not the whole process.
Linux's documentation (robust-futex-ABI.txt) claims that, when a
process dies with a futex on the robust list, bit 30 (0x40000000) is
set to indicate the status. however, what actually happens is that
bits 0-30 are replaced with the value 0x40000000, i.e. bits 0-29
(containing the old owner tid) are cleared at the same time bit 30 is
set.
our userspace-side code for robust mutexes was written based on that
documentation, assuming that kernel would never produce a futex value
of 0x40000000, since the low (owner) bits would always be non-zero.
commit d338b506e3 introduced this
assumption explicitly while fixing another bug in how non-recoverable
status for robust mutexes was tracked. presumably the tests conducted
at that time only checked non-process-shared robust mutexes, which are
handled in pthread_exit (which implemented the documented kernel
protocol, not the actual one) rather than by the kernel.
change pthread_exit robust list processing to match the kernel
behavior, clearing bits 0-29 while setting bit 30, and use the value
0x7fffffff instead of 0x40000000 to encode non-recoverable status. the
choice of value here is arbitrary; any value with at least one of bits
0-29 set should work just as well,
despite clarifications made to the COPYRIGHT file in commit
f0a6139933, there continues to be
confusion about whether the permissions granted actually apply to all
files. I am the sole author of these files and clearly intend, and
have always intended, for the grant of permission to apply to them.
compilers are free not to copy, or in some cases to clobber, padding
bytes in a structure. while it's an aliasing violation, and thus
undefined behavior, to copy or manipulate other sockaddr types using
sockaddr_storage, it seems likely that traditional code attempts to do
so, and the original intent of the sockaddr_storage structure was
probably to allow such usage.
in the interest of avoiding silent and potentially dangerous breakage,
ensure that there are no actual padding bytes in sockaddr_storage by
moving and adjusting the size of the __ss_padding member so that it
fits exactly.
this change also removes a silent assumption that the alignment of
long is equal to its size.
kernel connection multiplexor macros AF_KCM, PF_KCM, SOL_KCM were
added in linux commit ab7ac4eb9832e32a09f4e8042705484d2fb0aad3
MSG_BATCH sendmsg flag for performance optimization was added
in linux commit f092276d85b82504e8a07498f4e9e0c51f06745c
SOL_* macros are now synced with linux socket.h which is not a uapi
header and glibc did not have the macros either, but that has changed
http://sourceware.org/ml/libc-alpha/2016-05/msg00322.html
new fields and associated linux commit:
tcpi_notsent_bytes, tcpi_min_rtt cd9b266095f422267bddbec88f9098b48ea548fc
tcpi_data_segs_in, tcpi_data_segs_out a44d6eacdaf56f74fad699af7f4925a5f5ac0e7f
new socket option so application can give advice about routing
path quality of connected udp sockets, added in linux commit
a87cb3e48ee86d29868d3f59cfb9ce1a8fa63314
the syscalls take an additional flag argument, they were added in commit
f17d8b35452cab31a70d224964cd583fb2845449 and a RWF_HIPRI priority hint
flag was added to linux/fs.h in 97be7ebe53915af504fb491fb99f064c7cf3cb09.
the syscall is not allocated for microblaze and sh yet.
the difference of pointers is a signed type ptrdiff_t; if it is only
32-bit, left-shifting it by 30 bits produces undefined behavior. cast
the difference to an appropriate unsigned type, uint32_t, before
shifting to avoid this.
the a64l function is specified to return a signed 32-bit result in
type long. as noted in the bug report by Ed Schouten, converting
implicitly from uint32_t only produces the desired result when long is
a 32-bit type. since the computation has to be done in unsigned
arithmetic to avoid overflow, simply cast the result to int32_t.
further, POSIX leaves the behavior on invalid input unspecified but
not undefined, so we should not take the difference between the
potentially-null result of strchr and the base pointer without first
checking the result. the simplest behavior is just returning the
partial conversion already performed in this case, so do that.
previously, the only way the stopping condition could be met with
correct lengths in the headers invoked undefined behavior, adding
sizeof(struct cmsghdr) beyond the end of the cmsg buffer.
instead, compute and compare sizes rather than pointers.
the num_submatches field of some ast nodes was not initialized in
tre_add_tag_{left,right}, but was accessed later.
this was a benign bug since the uninitialized values were never used
(these values are created during tre_add_tags and copied around during
tre_expand_ast where they are also used in computations, but nothing
in the final tnfa depends on them).
The --build flag is listed in two case statement entries in configure,
which causes the second entry to be ignored. This patch removes it
from the first entry.
Signed-off-by: Michael LeMay <michael.lemay@intel.com>
previously if you called getprotobyname("egp") you would get
NULL because \008 is invalid octal and so the protocol id was
interpreted as 0 and name as "8egp".
the standard configure interface, which our configure script tries to
implement, identifies cross compiling (build != host) and searches for
the properly-prefixed cross tools. our script was not doing that,
forcing users to explicitly provide either CC or a CROSS_COMPILE tool
prefix, and the more common choice, just providing CC, was incomplete
because the Makefile would still invoke the native ar and ranlib
programs. this happened to work when building on ELF-based systems
with GNU binutils, but could easily fail when cross-compiling from
dissimilar systems.
like before, and like the standard configure behavior, an explicit CC
or CROSS_COMPILE variable on the command line or in the environment
overrides the automatic prefixing.
these changes are the outcome of a long mailing list thread that took
place March 2016, "musl licensing". among minor other issues,
prospective users were not confident that the whole-project MIT
license would grant permission for files to which the COPYRIGHT file
expressed a belief that copyright not apply, if it turned out that
these files were actually subject to copyright.
in accordance with the original intent of applying a permissive
license to the project, which was that license issues not be an
obstacle to use, the text which was causing confusion is removed. no
new claims of copyright are made, but new text is added to clarify
that the grant of permissions applies to all files, and an explicit
grant of permission to use public headers and crt files without
attribution has been made.
this patch was reviewed and approved by all substantial contributors
to the affected files: Bobby Bingham, John Spencer (rofl0r), Nicholas
J. Kain, Rich Felker, Richard Pennington, Stefan Kristiansson, and
Szabolcs Nagy.
commit 7e816a6487 (version 1.1.11
release cycle) moved the code that performs wchar_t to multibyte
conversion across code that used the resulting length in bytes,
thereby breaking the unget buffer space check in ungetwc and
clobbering up to three bytes below the start of the buffer.
for allocated FILEs (all read-enabled FILEs except stdin), the
underflow clobbers at most the FILE-specific locale pointer. no stores
are performed through this pointer, but subsequent loads may result in
a crash or mismatching encoding rule (UTF-8 multibyte vs byte-based).
for stdin, the buffer lies in .bss and the underflow may clobber
another object. in practice, for libc.so the adjacent object seems to
be stderr's buffer, which is completely unused, but this could vary
with linking options, or when static linking.
applications which do not attempt to use more than one character of
ungetwc pushback, or which do not use ungetwc, are not affected.
per the powerpc psabi, offset 4 of the stack at call time belongs to
the callee and is used for spilling lr (return address). in addition,
offset 0 on the stack must contain a pointer to the previous stack
frame, or a null pointer for the initial stack frame of a thread.
__clone failed to setup any stack frame on the new thread's stack,
thereby allowing the start function it called to clobber offset 4 of
the new thread's struct __pthread, which contains the dtv pointer.
add code to setup a proper stack frame and align the stack pointer to
a multiple of 16 (also an abi requirement) if it was not already
aligned.
mips32r6 and mips64r6 are actually new isas at both the asm source and
opcode levels (pre-r6 code cannot run on r6) and thus need to be
treated as a new subarch. the following changes are made, some of
which yield code generation improvements for non-r6 targets too:
- add subarch logic in configure script and reloc.h files for dynamic
linker name.
- suppress use of .set mips2 asm directives (used to allow mips2
atomic instructions on baseline mips1 builds; the kernel has to
emulate them on mips1) except when actually needed. they cause wrong
instruction encodings on r6, and pessimize inlining on at least some
compilers.
- only hard-code sync instruction encoding on mips1.
- use "ZC" constraint instead of "m" constraint for llsc memory
operands on r6, where the ll/sc instructions no longer accept full
16-bit offsets.
- only hard-code rdhwr instruction encoding with .word on targets
(pre-r2) where it may need trap-and-emulate by the kernel.
otherwise, just use the instruction mnemonic, and allow an arbitrary
destination register to be used.
the two/three/four byte memmem specializations are not prepared to
handle haystacks shorter than the needle; they unconditionally read at
least up to the needle length and subtract from the haystack length.
if the haystack is shorter, the remaining haystack length underflows
and produces an unbounded search which will eventually either crash or
find a spurious match.
the top-level memmem function attempted to avoid this case already by
checking for haystack shorter than needle, but it failed to re-check
after using memchr to remove the maximal prefix not containing the
first byte of the needle.
commits e24984efd5 and
16b55298dc inadvertently disabled the
a_spin implementations for i386, x86_64, and x32 by defining a macro
named a_pause instead of a_spin. this should not have caused any
functional regression, but it inhibited cpu relaxation while spinning
for locks.
bug reported by George Kulakowski.
the comparison f->wpos > f->buf has undefined behavior when f->wpos is
a null pointer, despite the intuition (and actual compiler behavior,
for all known compilers) being that NULL > ptr is false for all valid
pointers ptr.
the purpose of the comparison is to determine if the write buffer is
non-empty, and the idiom used elsewhere for that is comparison against
f->wbase, which is either a null pointer when not writing, or equal to
f->buf when writing. in the former case, both f->wpos and f->wbase are
null; in the latter they are both non-null and point into the same
array.
allows the os to free the marked pages lazily on memory pressure.
expected to increase malloc performance.
new in linux commit 854e9ed09dedf0c19ac8640e91bcc74bc3f9e5c9
new flag for exclusive wakeup mode when an event source fd is attached
to multiple epoll fds but they should not all receive the events.
new in linux commit df0108c5da561c66c333bb46bfe3c1fc65905898
new socket options for setting classic or extended BPF program
for sockets in a SO_REUSEPORT group. added in linux commit
538950a1b7527a0a52ccd9337e3fcd304f027f13
it was introduced for offloading copying between regular files
in linux commit 29732938a6289a15e907da234d6692a2ead71855
(microblaze and sh does not yet have the syscall number.)
currently five targets use the same mman.h constants and the rest
share most constants too, so move them to sys/mman.h before the
bits/mman.h include where the differences can be corrected by
redefinition of the macros.
this fixes two minor bugs: POSIX_MADV_DONTNEED was wrong on most
targets (it should be the same as MADV_DONTNEED), and sh defined
the x86-only MAP_32BIT mmap flag.
the idiom fprintf(f, "%.*s", n, "") was wrongly used in vfwprintf as a
means of producing n spaces; instead it produces no output. the
correct form is fprintf(f, "%*s", n, ""), using width instead of
precision, since for %s the later is a maximum rather than a minimum.
these changes should not affect generated code, but they reflect that
the underlying objects operated on by a_cas_p are supposed to have
type volatile void *, not volatile long. in theory a compiler could
treat the effective type mismatch in the "m" memory operands as
undefined behavior.
apparently clang does not accept matching-register input and output
constraints that differ in size (32-bit vs 64-bit).
based on patch by Jaydeep Patil.
the SPE ABI may be compatible with soft-float, but actually making it
work requires some additional work, so for now it's best to make sure
broken builds don't happen.
Some PowerPC CPUs (e.g. Freescale MPC85xx) have a completely different
instruction set for floating point operations (SPE).
Executing regular PowerPC floating point instructions results in
"Illegal instruction" errors.
Make it possible to run these devices in soft-float mode.
This is the minimal fix for __putenv leaving a pointer to freed heap
storage in __env_map array, which could later on lead to errors such
as double-free.
this change is made in preparation for adding the mips64 port, which
needs a 64-bit (and mips64-specific) form of the R_INFO macro, but
it's a better abstraction anyway.
based on part of the mips64 port patch by Mahesh Bodapati and Jaydeep
Patil of Imagination Technologies.
This is a workaround to treat * as literal * at the start of a BRE.
Ideally ^ would be treated as an anchor at the start of any BRE
subexpression and similarly $ would be an anchor at the end of any
subexpression. This is not required by the standard and hard to do
with the current code, but it's the existing practice. If it is
changed, * should be treated as literal after such anchor as well.
commit 7eaa76fc2e made * invalid at
the start of a BRE subexpression, but it should be accepted as
literal * there according to the standard.
This patch does not fix subexpressions starting with ^*.
name_from_hosts failed to account for the possibility of an address
family error from name_from_numeric, wrongly counting such a return as
success and using the uninitialized address data as part of the
results passed up to the caller.
non-matching address family entries cannot simply be ignored or
results would be inconsistent with respect to whether AF_UNSPEC or a
specific address family is queried. instead, record that a
non-matching entry was seen, and fail the lookup with EAI_NONAME of no
matching-family entries are found.
at present this is done only for consistency, since this file defines
its own a_cas_p rather than using the new generic one from atomic.h
added in commit 225f6a6b5b. these
definitions may however be useful if we ever need to add other
pointer-sized atomic operations.
this follows the principle of having the source tree layout define
build semantics. it also makes it possible for crt/$(ARCH) to define
additional installable files, which may be needed for midipix and
other future targets.
the nt32 and nt64 archs will be provided by the midipix project for
building musl on top of its posix-like syscall layer for windows. at
present the needed arch files are in a separate repository, but having
the tuple matching in the upstream configure script should make it
possible to overlay the arch files without needing any further
patching.
commit e4355bd6be moved the math asm
from external source files to inline asm, but unfortunately, all
current releases of clang use the wrong inline asm constraint codes
for float and double ("w" and "P" instead of "t" and "w",
respectively). this patch adds detection for the bug in configure,
and, for now, just disables the affected asm on broken clang versions.
in order to take advantage of the fpu in -mfloat-abi=softfp mode, the
__VFP_FP__ (presence of vfp fpu) was checked instead of checking for
__ARM_PCS_VFP (hardfloat EABI variant). however, the latter macro is
the one that's actually specified by the ABI documents rather than
being compiler-specific, and should also be checked in case __VFP_FP__
is not defined on some compilers or some configurations.
these additions were made based on scanning commit authors since the
last update, at the time of the 1.1.7 release, and adding everyone
with either substantial code contributions or a pattern of ongoing
simple patch submission.
the dynamic linker was found to hang when used as the PT_INTERP, but
not when invoked as a command. the mechanism of this failure was not
determined, but the cause is clear:
commit 5552ce5200 removed the SHARED
macro, but arch/sh/crt_arch.h is still using it to choose the right
form of the crt/ldso entry point code. moving the forced definition
from rcrt1.c to dlstart.c restores the old behavior. eventually the
logic should be changed to fully remove the SHARED macro or at least
rename it to something more reasonable.
commit 80fbaac4cd broke all soft-float
archs, where gcc defines __GCC_IEC_559==0 because rounding modes and
exception flags are not supported. for now, just check for
__FAST_MATH__ as an indication of broken float. this won't detect all
possible misconfigurations but it probably catches the most common
one.
commit 2f853dd6b9 moved the error
handling for $(ARCH) not being set such that it applied to all
targets, including clean and distclean. previously these targets
worked even in an unconfigured tree. to restore the old behavior, make
most of the makefile body conditional on $(ARCH) being set/non-empty
and produce the error via a fake "all" target in the conditional
branch for the case where $(ARCH) is empty.
prior to commit 2f853dd6b9 which
overhauled the makefile for out-of-tree builds, crt/*.c files were
replaceable by crt/$(ARCH)/*.s, and top-level ldso/ did not exist (its
files were under src/ldso). since then, crti.o and crtn.o have been
hard-coded as arch-specific, but none of the other files in crt/ or
ldso/ were replaceable at all.
in preparation for easy integration with midipix, which has a port of
musl to windows, it needs to be possible to override the ELF-specific
code in these files. making the same arch-replacements system work
throughout the whole source tree also improves consistency and removes
the need for some file-specific rules (crti.o and crtn.o) in the
makefile.
the reference implementation clamps rounds to [1000,999999999]. we
further limited rounds to at most 9999999 as a defense against extreme
run times, but wrongly clamped instead of treating out-of-bounds
values as an error, thereby producing implementation-specific hash
results. fixing this should not break anything since values of rounds
this high are not useful anyway.
like fputs (see commit 10a17dfbad), the
message printing code for getopt assumed that fwrite only returns 0 on
failure, but it can also happen on success if the total length to be
written is zero. programs with zero-length argv[0] were affected.
commit 500c6886c6 introduced this
problem in getopt by fixing the fwrite behavior to conform to the
requirements of ISO C. previously the wrong expectations of the getopt
code were met by the fwrite implementation.
internally, the idiom of passing nmemb=1 to fwrite and interpreting
the return value of fwrite (which is necessarily 0 or 1) as
failure/success is fairly widely used. this is not correct, however,
when the size argument is unknown and may be zero, since C requires
fwrite to return 0 in that special case. previously fwrite always
returned nmemb on success, but this was changed for conformance with
ISO C by commit 500c6886c6.
some software simply uses static_assert if the macro is defined, and
this breaks if the compiler does not recognize the _Static_assert
keyword used to define it.
commit 378f8cb522 added these functions
(as stubs) but left them without declarations. this broke some
autoconf based software that detected linkability of the symbols but
didn't check for a declaration.
when the size argument was zero but nmemb was nonzero, these functions
were returning nmemb, despite no data having been written.
conceptually this is not wrong, but the standard requires a return
value of zero in this case.
as specified, the int argument providing the character to write is
converted to type unsigned char. for the actual write to buffer,
conversion happened implicitly via the assignment operator; however,
the logic to check whether the argument was a newline used the
original int value. thus usage such as putchar('\n'+0x100) failed to
produce a flush.
when a write error occurred while flushing output due to a newline,
fwrite falsely reported all bytes up to and including the newline as
successfully written. in general, due to buffering such "spurious
success" returns are acceptable for stdio; however for line-buffered
mode it was subtly wrong. errors were still visible via ferror() or as
a short-write return if there was more data past the newline that
should have been written, but since the contract for line-buffered
mode is that everything up through the newline be written out
immediately, a discrepency was observable in the actual file contents.
the workaround was for a bug that botched .gpword references to local
labels, applying a nonsensical random offset of -0x4000 to them.
this reverses commit 5e396fb996 and a
removes a similar hack that was added to syscall_cp.s in the later
commit 756c8af858. it turns out one
additional instance of the same idiom, the GETFUNCSYM macro in
arch/mips/reloc.h, was still affected by the assembler bug and does
not admit an easy workaround without making assumptions about how the
macro is used. the previous workarounds made static linking work but
left the early-stage dynamic linker broken and thus had limited
usefulness.
instead, affected users (using binutils versions older than 2.20) will
need to fix the bug on the binutils side; the trivial patch is commit
453f5985b13e35161984bf1bf657bbab11515aa4 in the binutils-gdb
repository.
the old __cp_cancel code path loaded the address of __cancel from the
GOT using the $gp register, which happened to be set to point to the
correct GOT by the calling C function, but there is no ABI requirement
that this happen. instead, go the roundabout way and compute the
address of __cancel via pc-relative and gp-relative addressing
starting with a fake return address generated by a bal instruction,
which is the same trick crt1 uses to bootstrap.
add aarch64 and or1k archs, upgrade sh from experimental, and note
that sh now supports the FDPIC ABI.
the old advice on compiler versions was outdated and more specific
than made sense. presence of compiler bugs varies a lot by arch, so
it's hard to make any good recommendations beyond "recent". if we want
to document specific known-good/bad compiler versions, a much larger
section in the documentation than what's appropriate for the INSTALL
file would be needed.
the linux man page specifies malloc_usable_size(0) to return 0 and
this is the semantics other implementations follow (jemalloc).
reported by Alexander Monakov.
10k elements stack is increased to 1000k, otherwise tnfa creation fails
for reasonable sized patterns: a single literal char can add 7 elements
to this stack, so regcomp of an 1500 char long pattern (with only litral
chars) fails with REG_ESPACE. (the new limit allows about < 150k chars,
this arbitrary limit allows most command line regex usage.)
ideally there would be no upper bound: regcomp dynamically reallocates
this buffer, every reallocation checks for allocation failure and at
the end this stack is freed so there is no reason for special bound.
however that may have unwanted effect on regcomp and regexec runtime
so this is a conservative change.
"Q" input constraint was used for the written object, instead of "=Q"
output constraint. this should not cause problems because "memory"
is on the clobber list, but "=Q" better documents the intent and more
consistent with the actual asm code.
this changes the generated code, because different registers are used,
but other than the register names nothing should change.
previous work overhauling the dynamic linker made it so that linking
libc with -Bsymbolic-functions was no longer mandatory, but the
configure logic that forced --disable-shared when ld failed to accept
the option was left in place.
this commit removes the hard-coded -Bsymbolic-functions from the
Makefile and changes the configure test to one that simply adds it to
the auto-detected LDFLAGS on success.
The standard does not define semantics for \| in BRE, but some code
depends on it meaning alternation. Empty alternative expression is
allowed to be consistent with ERE.
Based on a patch by Rob Landley.
Previously repetitions were accepted after empty expressions like
in (*|?)|{2}, but in BRE the handling of * and \{\} were not
consistent: they were accepted as literals in some cases and
repetitions in others.
It is better to treat repetitions after an empty expression as an
error (this is allowed by the standard, and glibc mostly does the
same). This is hard to do consistently with the current logic so
the new rule is:
Reject repetitions after empty expressions, except after assertions
^*, $? and empty groups ()+ and never treat them as literals.
Empty alternation (|a) is undefined by the standard, but it can be
useful so that should be accepted.
this file's .data section was not aligned, and just happened to get
the correct alignment with past builds. it's likely that the move of
atomic.s from arch/arm/src to src/thread/arm caused the change in
alignment, which broke the atomic and thread-pointer access fragments
on actual armv5 hardware.
commit d56460c939 introduced this bug by
setting up the tls module chain incorrectly when the main app has tls.
the singly-linked list head pointer was setup correctly, but the tail
pointer was not, so the first attempt to append to the list (for a
shared library with tls) would treat the list as empty and effectively
removed the main app from the list. this left all tls module id
numbers off-by-one.
this bug did not appear in any released versions.
search is only performed if the search or domain keyword is used in
resolv.conf and the queried name has fewer than ndots dots. there is
no default domain and names with >=ndots dots are never subjected to
search; failure in the root scope is final.
the (non-POSIX) res_search API presently does not honor search. this
may be added at some point in the future if needed.
resolv.conf is now parsed twice, at two different layers of the code
involved. this will be fixed in a subsequent patch.
rcode of 3 (NxDomain) was treated as a hard EAI_NONAME failure, but it
should instead return 0 (no results) so the caller can continue
searching. this will be important for adding search domain support.
the top-level caller will automatically return EAI_NONAME if there are
zero results at the end.
also, the case where rcode is 0 (success) but there are no results was
not handled. this happens when the domain exists but there are no A or
AAAA records for it. in this case a hard EAI_NONAME should be imposed
to inhibit further search, since the name was defined and just does
not have any address associated with it. previously a misleading hard
failure of EAI_FAIL was reported.
this change is made in preparation for adding search domains, for
which higher-level code will need to parse resolv.conf. simply parsing
it twice for each lookup would be one reasonable option, but the
existing parser code was buggy anyway, which suggested to me that it's
a bad idea to have two variants of this code in two different places.
the old code in res_msend potentially misinterpreted overly long lines
in resolv.conf, and stopped parsing after it found 3 nameservers, even
if there were relevant options left to be parsed later in the file.
all bits headers that were identical for a number of 'clean' archs are
moved to the new arch/generic tree. in addition, a few headers that
differed only cosmetically from the new generic version are removed.
additional deduplication may be possible in mman.h and in several
headers (limits.h, posix.h, stdint.h) that mostly depend on whether
the arch is 32- or 64-bit, but they are left alone for now because
greater gains are likely possible with more invasive changes to header
logic, which is beyond the scope of this commit.
this sets the stage for the first phase of the bits deduplication.
bits headers which are identical for "most" archs will be moved to
arch/generic/bits.
vdso support is available on mips starting with kernel 4.4, see kernel
commit a7f4df4e21 "MIPS: VDSO: Add implementations of gettimeofday()
and clock_gettime()" for details.
In Linux kernel 4.4.0 the mips code returns -ENOSYS in case it can not
handle the vdso call and assumes the libc will call the original
syscall in this case. Handle this case in musl. Currently Linux kernel
4.4.0 handles the following types: CLOCK_REALTIME_COARSE,
CLOCK_MONOTONIC_COARSE, CLOCK_REALTIME and CLOCK_MONOTONIC.
these changes are motivated by a functionally similar patch by Hauke
Mehrtens to address the needs of the new mips vdso clock_gettime,
which wrongly fails with ENOSYS rather than falling back to making a
syscall for clock ids it cannot handle from userspace. in the process
of preparing to handle that case, it was noticed that the old
clock_gettime use of the vdso was actually wrong with respect to error
handling -- the tail call to the vdso function failed to set errno and
instead returned an error code.
since tail calls to vdso are no longer possible and since the plain
syscall code is now needed as a fallback path anyway, it does not make
sense to use a function pointer to call the plain syscall code path.
instead, it's inlined at the end of the main clock_gettime function.
the new code also avoids the need to test for initialization of the
vdso function pointer by statically initializing it to a self-init
function, and eliminates redundant loads from the volatile pointer
object.
finally, the use of a_cas_p on an object of type other than void *,
which is not permitted aliasing, is replaced by using an object with
the correct type and casting the value.
si_errno and si_code are swapped in mips siginfo_t compared to other
archs and some si_code values are different. This fix is required
for POSIX timers to work.
based on patch by Dmitry Ivanov.
only have code above the bits/signal.h include that is necessary.
(some types are used for the ucontext struct and mips has to
override a few macro definitions)
this way mips bits/signal.h will be able to affect siginfo_t.
they lock faulted pages into memory (useful when a small part of a
large mapped file needs efficient access), new in linux v4.4, commit
b0f205c2a3082dd9081f9a94e50658c5fa906ff1
MLOCK_* is not in the POSIX reserved namespace for sys/mman.h
this is mlock with a flags argument, new in linux commit
a8ca5d0ecbdde5cc3d7accacbd69968b0c98764e
as usual microblaze and sh don't have allocated syscall number yet.
allows a ptracer process to disable/enable seccomp filters of the
traced process, useful for checkpoint/restore, new in v4.3 commit
13c4a90119d28cfcb6b5bdd820c233b86c2b0237
new in linux v4.3 added for aarch64, arm, i386, mips, or1k, powerpc,
x32 and x86_64.
membarrier is a system wide memory barrier, moves most of the
synchronization cost to one side, new in kernel commit
5b25b13ab08f616efd566347d809b4ece54570d1
userfaultfd is useful for qemu and is new in kernel commit
8d2afd96c20316d112e04d935d9e09150e988397
switch_endian is powerpc only for switching endianness, new in commit
529d235a0e190ded1d21ccc80a73e625ebcad09b
new in linux v4.3 commit 9dea5dc921b5f4045a18c63eb92e84dc274d17eb
direct calls instead of socketcall allow better seccomp filtering.
musl continues to use socketcalls internally on i386. (older kernels
would need a fallback mechanism if the direct calls were used.)
only use SYS_socketcall if SYSCALL_USE_SOCKETCALL is defined
internally, otherwise use direct syscalls.
this commit does not change the current behaviour, it is
preparation for adding direct syscall numbers for i386.
these were not covered by the parent-level rules with the new build
system. in the old build system, the equivalent files were often in
arch/$(ARCH)/src and likewise lacked the suppression. this could lead
to early crashing (before thread pointer init) when libc itself was
built with stack protector enabled.
now that .lo and .o files differ only by whether -fPIC is passed (and
no longer at the source level based on the SHARED macro), it's
possible to use the same object files for both static and shared libc
when the compiler would produce PIC for the static files anyway. this
happens if the user has included -fPIC in their CFLAGS or if the
compiler has been configured to produce PIE by default.
we use the .lo files for both, and still append -fPIC to the CFLAGS,
rather than using the .o files so that libc.so does not break
catastrophically if the user later removes -fPIC from CFLAGS in
config.mak or on the make command line. this also ensures that we get
full -fPIC in case -fpic, -fPIE, or some other lesser-PIC option was
passed in CFLAGS.
this eliminates the last need for the SHARED macro to control how
files in the src tree are compiled. the same code is used for both
libc.a and libc.so, with additional code for the dynamic linker (from
the new ldso tree) being added to libc.so but not libc.a. separate .o
and .lo object files still exist for the src tree, but the only
difference is that the .lo files are built as PIC.
in the future, if/when we add dlopen support for static-linked
programs, much of the code in dynlink.c may be moved back into the src
tree, but properly factored into separate source files. in that case,
the code in the ldso tree will be reduced to just the dynamic linker
entry point, self-relocation, and loading of libraries needed by the
main application.
the function name is still __-prefixed because it requires an asm
wrapper to pass the caller's address in order for RTLD_NEXT to work.
since this was the last function in dynlink.c still used for static
linking, now the whole file is conditional on SHARED being defined.
the ultimate goal of this change is to get all code used in libc.a out
of dynlink.c, so that the dynamic linker code can be moved to its own
tree and object files in the src tree can all be shared between libc.a
and libc.so.
contrary to commit 89e149d275, big
endian arm does need the instruction bytes in big endian order. rather
than trying to use a special encoding that works as arm or thumb,
simply encode the simplest/canonical undefined instructions dependent
on whether __thumb__ is defined.
the .byte directive encodes a guaranteed-undefined instruction, the
same one Linux fills the kuser helper page with when it's disabled.
the udf mnemonic and and .insn directives are not supported by old
binutils versions, and larger-than-byte integer directives would
produce the wrong output on big-endian.
IP_BIND_ADDRESS_NO_PORT is a SOL_IP socket option to delay src port
allocation until connect in case src ip is set with bind(port=0).
new in linux v4.2, commit 90c337da1524863838658078ec34241f45d8394d
IPPROTO_MPLS protocol number for mpls over ip.
new in linux v4.2, commit 730fc4371333636a00fed32c587fc1e85c5367e2
TCP_CC_INFO is a new socket option to get congestion control info without
netlink (union tcp_cc_info is in linux/inet_diag.h kernel header).
linux commit 6e9250f59ef9efb932c84850cd221f22c2a03c4a
TCP_SAVE_SYN, TCP_SAVED_SYN socket options are for saving and getting the
SYN headers of passive connections in a server application.
linux commit cd8ae85299d54155702a56811b2e035e63064d3d
Add new tcpi_* fields to struct tcp_info implementing RFC4898 counters.
linux commit 2efd055c53c06b7e89c167c98069bab9afce7e59
a_ll/a_sc inline asm used 64bit register operands (%0) instead of 32bit
ones (%w0), this at least broke a_and_64 (which always cleared the top
32bit, leaking memory in malloc).
aarch64 provides ll/sc variants with acquire/release memory order,
freeing us from the need to have full barriers both before and after
the ll/sc operation. previously they were not used because the a_cas
can fail without performing a_sc, in which case half of the barrier
would be omitted. instead, define a custom version of a_cas for
aarch64 which uses a_barrier explicitly when aborting the cas
operation. aside from cas, other operations built on top of ll/sc are
not affected since they never abort but rather loop until they
succeed.
a split ll/sc version of the pointer-sized a_cas_p is also introduced
using the same technique.
patch by Szabolcs Nagy.
commit f3ddd17380, the dynamic linker
bootstrap overhaul, silently disabled the definition of __fpscr_values
in this file since libc.so's copy of __fpscr_values now comes from
crt_arch.h, the same place the public definition in the main program's
crt1.o ultimately comes from. remove this file which is no longer in
use.
previously powerpc had a_cas defined in terms of its native ll/sc
style operations, but all other atomics were defined in terms of
a_cas. instead define a_ll and a_sc so the compiler can generate
optimized versions of all the atomic ops and perform better inlining
of a_cas.
extracting the result of the sc (stwcx.) instruction is rather awkward
because it's natively stored in a condition flag, which is not
representable in inline asm. but even with this limitation the new
code still seems significantly better.
this commit mostly makes consistent things like spacing, function
ordering in atomic_arch.h, argument names, use of volatile, etc.
a_ctz_l was also removed from x86_64 since atomic.h provides it
automatically using a_ctz_64.
this commit mostly makes consistent things like spacing, function
ordering in atomic_arch.h, argument names, use of volatile, etc. the
fake 64-bit and/or atomics are also removed because the shared
atomic.h does a better job of implementing them; it avoids making two
atomic memory accesses when only one 32-bit half needs to be touched.
no major overhaul is needed or possible because x86 actually has
native versions of all the usual atomic operations, rather than using
ll/sc or needing cas loops.
this is possible with the new build system that allows src/*/$(ARCH)/*
files which do not shadow a file in the parent directory, and yields a
more logical organization. eventually it will be possible to remove
arch/*/src from the build system.
switch to ll/sc model so that new atomic.h can provide optimized
versions of all the atomic primitives without needing an ll/sc loop
written in asm for each one.
all isa levels which use ldrex/strex now use the inline ll/sc model
even if the type of barrier to use is not known until runtime (v6).
the cas model is only used for arm v5 and earlier, and it has been
optimized to make the call via inline asm with custom constraints
rather than as a C function call.
sh needs runtime-selected atomic backends since there are a number of
supported models that use non-forwards-compatible (non-smp-compatible)
atomic mechanisms. previously, the code paths for this were highly
inefficient since they involved C function calls with multiple
branches in the callee and heavy spills in the caller. the new code
performs calls the runtime-selected asm fragment from inline asm with
extremely minimal clobbers, rather than using a function call.
for the sh4a case where the atomic mechanism is known and there is no
forward-compatibility issue, the movli.l and movco.l instructions are
provided as a_ll and a_sc, allowing the new shared atomic.h to
generate efficient inline versions of all the basic atomic operations
without needing a cas loop.
rather than having each arch provide its own atomic.h, there is a new
shared atomic.h in src/internal which pulls arch-specific definitions
from arc/$(ARCH)/atomic_arch.h. the latter can be extremely minimal,
defining only a_cas or new ll/sc type primitives which the shared
atomic.h will use to construct everything else.
this commit avoids making heavy changes to the individual archs'
atomic implementations. definitions which are identical or
near-identical to what the new shared atomic.h would produce have been
removed, but otherwise the changes made are just hooking up the
arch-specific files to the new infrastructure. major changes to take
advantage of the new system will come in subsequent commits.
commit 2f853dd6b9 failed to change the
test for -include vis.h support to use $srcdir, so vis.h was always
disabled by configure for out-of-tree builds.
the lib dir is automatically created if needed by the out-of-tree
build logic, and now that all generated files are in obj and lib,
deleting them is much simpler. using "rm -rf" is also more thorough,
as it picks up object files that were left around from source files
that no longer exist or which are no longer to be used because an
arch-specific replacement file was added or removed.
as of commit af21a82ccc, .sub files are
no longer in use. removing the makefile machinery to handle them not
only cleans up and simplifies the makefile, but also significantly
reduces make's startup time.
commit 2f853dd6b9 failed to replicate
the old makefile logic that caused arch/arm/src/arm/atomics.s to be
built. since this was the only .s file under arch/*/src, rather than
trying to reproduce the old logic, I'm just moving it up a level and
adjusting the glob pattern in the makefile to catch it. eventually
arch/*/src will probably be removed in favor of moving all these files
to appropriate src/*/$(ARCH) locations.
the __SOFTFP__ macro which was wrongly being used does not reflect the
ABI (arm vs armhf) but just the availability of floating point
instructions/registers, so -mfloat-abi=softfp was wrongly being
treated as armhf. __ARM_PCS_VFP is the correct predefined macro to
check for the armhf EABI variant. this macro usage was corrected for
the build process in commit 4918c2bb20
but reloc.h was apparently overlooked at the time.
this makes it possible to inline them with LTO, and is the simplest
approach to eliminating the use of .sub files.
this also makes VFP sqrt available for use with the standard EABI
(plain arm rather than armhf subarch) when libc is built with
-mfloat-abi=softfp. the same could have been done for fabs, but when
the argument and return value are in integer registers, moving to VFP
registers and back is almost certainly more costly than a simple
integer operation.
this depends on commit 9f5eb77992, which
made it possible to use a .c file for arch-specific replacements, and on
commit 2f853dd6b9, the out-of-tree build
support, which made it so that src/*/$(ARCH)/* 'replacement' files get
used even if they don't match the base name of a .c file in the parent
directory.
this allows the rules for .o and .lo files to be identical, with -fPIC
and -DSHARED added for .lo files via target-specific variable append.
this is arguably cleaner now and will allow more cleanup and removal
of redundant rule bodies after other prerequisite changes are made.
previously, replacement files provided in $(ARCH) dirs under src/ had
to be .s files. in order to replace a file with C source, an empty .s
file was needed there to suppress the original file, and a separate .c
file was needed in arch/$(ARCH)/src/.
support for .S is new and is aimed at short-term use eliminating .sub
files. asm source files are still expected not to make any heavy
preprocessor use, just simple conditionals on subarch. eventually most
affected files may be replaced with C source files with minimal inline
asm instead of asm source files.
Programs such as iptables depend on these constants, which can also
be found defined in other libcs.
Since only TCP_* is reserved as part of tcp.h's namespace, we hide
them behind _BSD_SOURCE (and therefore _DEFAULT_SOURCE) to expose
them by default, but keep it standard conforming.
this change adds support for building musl outside of the source
tree. the implementation is similar to autotools where running
configure in a different directory creates config.mak in the current
working directory and symlinks the makefile, which contains the
logic for creating all necessary directories and resolving paths
relative to the source directory.
to support both in-tree and out-of-tree builds with implicit make
rules, all object files are now placed into a separate directory.
apparently the .gpword directive does not work reliably with local
text labels; values produced were offset by 64k from the correct
value, resulting in incorrect computation of the got pointer at
runtime. instead, use an external label so that the assembler does not
munge the relocation; the linker will then get it right.
commit 6fef8cafbd exposed this issue by
removing the old, non-PIE-compatible handwritten crt1.s, which was not
affected. presumably mips PIE executables (using Scrt1.o produced from
crt_arch.h) were already affected at the time.
at least gcc 4.7 claims c++11 support but does not accept the alignas
keyword, causing breakage when stddef.h is included in c++11 mode.
instead, prefer using __attribute__((__aligned__)) on any compiler
with GNU extensions, and only use the alignas keyword as a fallback
for other C++ compilers.
C code should not be affected by this patch.
previously, getdelim was allocating twice the space needed every time
it expanded its buffer to implement exponential buffer growth (in
order to avoid quadratic run time). however, this doubling was
performed even when the final buffer length needed was already known,
which is the common case that occurs whenever the delimiter is in the
FILE's buffer.
this patch makes two changes to remedy the situation:
1. over-allocation is no longer performed if the delimiter has already
been found when realloc is needed.
2. growth factor is reduced from 2x to 1.5x to reduce the relative
excess allocation in cases where the delimiter is not initially in the
buffer, including unbuffered streams.
in theory these changes could lead to quadratic time if the same
buffer is reused to process a sequence of lines successively
increasing in length, but once this length exceeds the stdio buffer
size, the delimiter will not be found in the buffer right away and
exponential growth will still kick in.
getdelim was updating *n, the caller's stored buffer size, before
calling realloc. if getdelim then failed due to realloc failure, the
caller would see in *n a value larger than the actual size of the
allocated block, and use of that value is unsafe. in particular,
passing it again to getdelim is unsafe.
now, temporary storage is used for the desired new size, and *n is not
written until realloc succeeds.
this error case was overlooked in the old range checking logic. new
check is moved out of __libc_sigaction to the public wrapper in order
to unify the error path and reduce code size.
commit 8a8fdf6398 was intended to remove
all such usage, but these arch-specific files were overlooked, leading
to inconsistent declarations and definitions.
the tsearch data structure is an avl tree, but it did not implement
the deletion operation correctly so the tree could become unbalanced.
reported by Ed Schouten.
With point-to-point interfaces, the IFA_ADDRESS netlink attribute
contains the peer address while an extra attribute IFA_LOCAL carries
the actual local interface address.
Both the glibc and uclibc implementations of getifaddrs() handle this
case by moving the ifa_addr contents to the broadcast/remote address
union and overwriting ifa_addr upon receipt of an IFA_LOCAL attribute.
This patch adds the same special treatment logic of IFA_LOCAL to
musl's implementation of getifaddrs() in order to align its behaviour
with that of uclibc and glibc.
Signed-off-by: Jo-Philipp Wich <jow@openwrt.org>
if two or more threads accessed tls in a dso that was loaded after
the threads were created, then __tls_get_new could do out-of-bound
memory access (leading to segfault).
accidentally byte count was used instead of element count when
the new dtv pointer was computed. (dso->new_dtv is (void**).)
it is rare that the same dso provides dtv for several threads,
the crash was not observed in practice, but possible to trigger.
a conforming compiler for an arch with excess precision floating point
(FLT_EVAL_METHOD!=0; presently i386 is the only such arch supported)
computes all intermediate results in the types float_t and double_t
rather than the nominal type of the expression. some incorrect
compilers, however, only keep excess precision in registers, and
convert down to the nominal type when spilling intermediate results to
memory, yielding unpredictable results that depend on the compiler's
choices of what/when to spill. in particular, this happens on old gcc
versions with -ffloat-store, which we need in order to work around
bugs where the compiler wrongly keeps explicitly-dropped excess
precision.
by explicitly converting to double_t where expressions are expected be
be evaluated in double_t precision, we can avoid depending on the
compiler to get types correct when spilling; the nominal and
intermediate precision now match. this commit should not change the
code generated by correct compilers, or by old ones on non-i386 archs
where double_t is defined as double.
this fixes a serious bug in argument reduction observed on i386 with
gcc 4.2: for values of x outside the unit circle, sin(x) was producing
results outside the interval [-1,1]. changes made in commit
0ce946cf80 were likely responsible for
breaking compatibility with this and other old gcc versions.
patch by Szabolcs Nagy.
commit ad1cd43a86 eliminated
preprocessor-level omission of references to the init/fini array
symbols from object files going into libc.so. the references are weak,
and the intent was that the linker would resolve them to zero in
libc.so, but instead it leaves undefined references that could be
satisfied at runtime. normally these references would be harmless,
since the code using them does not even get executed, but some older
binutils versions produce a linking error: when linking a program
against libc.so, ld first tries to use the hidden init/fini array
symbols produced by the linker script to satisfy the references in
libc.so, then produces an error because the definitions are hidden.
ideally ld would have already provided definitions of these symbols
when linking libc.so, but the linker script for -shared omits them.
to avoid this situation, the dynamic linker now provides its own dummy
definitions of the init/fini array symbols for libc.so. since they are
hidden, everything binds at ld time and no references remain in the
dynamic symbol table. with modern binutils and --gc-sections, both
the dummy empty array objects and the code referencing them get
dropped at link time, anyway.
the _init and _fini symbols are also switched back to using weak
definitions rather than weak references since the latter behave
somewhat problematically in general, and the weak definition approach
was known to work well.
the nommu kernel shares memory when it can anyway for private
read-only maps, but semantically the map should be private. this can
make a difference when debugging breakpoints are to be used, in which
case the kernel may need to ensure that the mapping is not shared.
the new behavior matches how the kernel FDPIC loader maps the main
program and/or program interpreter (dynamic linker) binary.
this both allows removal of some of the main remaining uses of the
SHARED macro and clears one obstacle to static-linked dlopen support,
which may be added at some point in the future.
specialized single-TLS-module versions of __copy_tls and __reset_tls
are removed and replaced with code adapted from their dynamic-linked
versions, capable of operating on a whole chain of TLS modules, and
use of the dynamic linker's DSO chain (which contains large struct dso
objects) by these functions is replaced with a new chain of struct
tls_module objects containing only the information needed for
implementing TLS. this may also yield some performance benefit
initializing TLS for a new thread when a large number of modules
without TLS have been loaded, since since there is no need to walk
structures for modules without TLS.
use weak definitions that the dynamic linker can override instead of
preprocessor conditionals on SHARED so that the same libc start and
exit code can be used for both static and dynamic linking.
this was only a tiny optimization, and static-linked binaries should
not be calling __tls_get_addr anyway since the linker is supposed to
perform relaxation, resulting in use of the local-exec TLS model.
this is the first and simplest stage of removal of the SHARED macro,
which will eventually allow libc.a and libc.so to be produced from the
same object files.
the original motivation for these #ifdefs which are now being removed
was to allow building a static-only libc using a compiler that does
not support visibility. however, SHARED was the wrong condition to
test for this anyway; various assembly-language sources refer to
hidden symbols and declare them with the .hidden directive, making it
wrong to define the referenced symbols as non-hidden. if there is a
need in the future to build libc using compilers that lack visibility,
support could be moved to the build system or perhaps the __PIC__
macro could be checked instead of SHARED.
on linux/nommu, non-writable private mappings of files may actually
use memory shared with other processes or the fs cache. the old nommu
loader code (used when mmap with MAP_FIXED fails) simply wrote over
top of the original file mapping, possibly clobbering this shared
memory. no such breakage was observed in practice, but it should have
been possible.
the new code starts by mapping anonymous writable memory on archs that
might support nommu, then maps load segments over top of it, falling
back to read if MAP_FIXED fails. we use an anonymous map rather than a
writable file map to avoid reading more data from disk than needed.
since pages cannot be loaded lazily on fault, in case of large
data/bss, mapping the full file may read a lot of data that will
subsequently be thrown away when processing additional LOAD segments.
as a result, we cannot skip the first LOAD segment when operating in
this mode.
these changes affect only non-FDPIC nommu support.
these files are all accepted as legacy arm syntax when producing arm
code, but legacy syntax cannot be used for producing thumb2 with
access to the full ISA. even after switching to UAL, some asm source
files contain instructions which are not valid in thumb mode, so these
will need to be addressed separately.
the idea of the three-instruction sequence being removed was to be
able to return to thumb code when used on armv4t+ from a thumb caller,
but also to be able to run on armv4 without the bx instruction
available (in which case the low bit of lr would always be 0).
however, without compiler support for generating such a sequence from
C code, which does not exist and which there is unlikely to be
interest in implementing, there is little point in having it in the
asm, and it would likely be easier to add pre-armv4t support via
enhanced linker handling of R_ARM_V4BX than at the compiler level.
removing this code simplifies adding support for building libc in
thumb2-only form (for cortex-m).
the code to save/restore vfp registers needs to build even when the
configured target does not have fpu; this is because code using vfp
fpu (but with the standard soft-float EABI) may call a libc built for
a soft-float only, and the EABI considers these registers call-saved
when they exist. thus, extra directives are used to force the
assembler to allow vfp instructions and to avoid marking the resulting
object files as requiring vfp.
moving away from using hard-coded opcode words is necessary in order
to eventually support producing thumb2-only output for cortex-m.
conditional execution of these instructions based on hwcap flags was
already implemented. when building for arm (non-thumb) output, the
only currently-supported configuration, this commit does not change
the code emitted.
this function is used only as a weak definition for malloc, for static
linking in programs which do not call realloc or free. since it had
external linkage and was thereby exported in libc.so's dynamic symbol
table, --gc-sections was unable to drop it. this was merely an
oversight; there's no reason for it to be external, so make it static.
this allowing the linker to drop certain weak definitions that are
only used as dummies for static linking. they could be eliminated for
shared library builds using the preprocessor instead, but we are
trying to transition to using the same object files for shared and
static libc, so a link-time solution is preferable.
based on patch by Denys Vlasenko. sorting sections and common data
symbols by alignment acts as an approximation for optimal packing,
which the linker does not actually support.
based on patch by Denys Vlasenko. the original intent for using these
options was to enable linking optimizations. these are immediately
available for static linking applications to libc.a, and will also be
used for linking libc.so in a subsequent commit.
in addition to the original motives, this change works around a whole
class of toolchain bugs where the compiler generates relative address
expressions using a weak symbol and the assembler "optimizes out" the
relocation which should result by using the weak definition. (see gas
pr 18561 and gcc pr 66609, 68178, etc. for examples.) by having
different functions and data objects in their own sections, all
relative address expressions are cross-section and thus cannot be
resolved to constants until link time. this allows us to retain
support for affected compiler/assembler versions without invasive
and fragile source-level workarounds.
this assumption is borderline-unsafe to begin with, and fails badly
with -ffunction-sections since the linker can move the callee
arbitrarily far away when it lies in a different section.
this way, overriding these variables on the make command line (or just
re-passing the originally-passed values when invoking make) won't
suppress use of the flags added by configure.
since mremap with the MREMAP_FIXED flag is an operation that unmaps
existing mappings, it needs to use the vm lock mechanism to ensure
that any in-progress synchronization operations using vm identities
from before the call have finished.
also, the variadic argument was erroneously being read even if the
MREMAP_FIXED flag was not passed. in practice this didn't break
anything, but it's UB and in theory LTO could turn it into a hard
error.
previously, only archs that needed to do stack cleanup defined a
__cp_cancel label for acting on cancellation in their syscall asm, and
a default definition was provided by a weak alias to __cancel, the C
function. this resulted in wrong codegen for arm on gcc versions
affected by pr 68178 and possibly similar issues (like pr 66609) on
other archs, and also created an inconsistency where the __cp_begin
and __cp_end labels were treated as const data but __cp_cancel was
treated as a function. this in turn caused incorrect code generation
on archs where function pointers point to function descriptors rather
than code (for now, only sh/fdpic).
using the actual mcontext_t definition rather than an overlaid pointer
array both improves correctness/readability and eliminates some ugly
hacks for archs with 64-bit registers bit 32-bit program counter.
also fix UB due to comparison of pointers not in a common array
object.
when a library being loaded has bss (i.e. data segment with
p_memsz>p_filesz), this region needs to be zeroed with a combination
of memset and/or mmap. the regular ELF loader always did this but the
FDPIC code path omitted it, leading to objects in bss having
uninitialized/junk contents.
getnameinfo() compares the size of the given struct sockaddr with
sizeof(struct sockaddr_in) and sizeof(struct sockaddr_in6) depending on
the net family. When you add a sockaddr of size sizeof(struct
sockaddr_storage) this function will fail because the size of the
sockaddr is too big. Change the check that it only fails if the size is
too small, but make it work when it is too big for example when someone
calls this function with a struct sockaddr_storage and its size.
This fixes a problem with IoTivity 1.0.0 and musl.
glibc and bionic are only failing if it is smaller, net/freebsd
implemented the != check.
Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
previously, transient failures like fd exhaustion or other
resource-related errors were treated the same as non-existence of
these files, leading to fallbacks or false-negative results. in
particular:
- failure to open hosts resulted in fallback to dns, possibly yielding
EAI_NONAME for a hostname that should be defined locally, or an
unwanted result from dns that the hosts file was intended to
replace.
- failure to open services resulted in EAI_SERVICE.
- failure to open resolv.conf resulted in querying localhost rather
than the configured nameservers.
now, only permanent errors trigger the fallback behaviors above; all
other errors are reportable to the caller as EAI_SYSTEM.
the buffer enlargement logic here accounted for the terminating null
byte, but not for the possibility of hitting the delimiter in the
buffer-refill code path that uses getc_unlocked, in which case two
additional bytes (the delimiter and the null termination) are written
without another chance to enlarge the buffer.
this patch and the corresponding bug report are by Felix Janda.
the option to suppress executable stack tagging was placed in CFLAGS,
which is treated as optional and overridable by the build system. if a
user replaces CFLAGS after configure has run, it could get lost,
resulting in a libc.so that's flagged as needing executable stack,
which would cause the kernel to map the initial stack as executable.
move -Wa,--noexecstack to CFLAGS_C99FSE, the make variable used for
mandatory compiler options.
these per-target CFLAGS adjustments are mandatory additions to the
command line for building the affected targets, not part of the
user-provided CFLAGS for tuning. my intent was always that the
variable append operations would take place after user settings, but
when a variable is set on the command line, it overrides all
definitions in the makefile, including target-specific ones.
based on patch by Szabolcs Nagy.
Some armhf gcc toolchains (built with --with-float=hard but without
--with-fpu=vfp*) do not pass -mfpu=vfp to the assembler and then
binutils rejects the UAL mnemonics for VFP unless there is an .fpu vfp
directive in the asm source.
POSIX requires pthread_join to synchronize memory on success. The
futex wait inside __timedwait_cp cannot handle this because it's not
called in all cases. Also, in the case of a spurious wake, tid can
become zero between the wake and when the joining thread checks it.
when determining which module an address belongs to, all function
descriptor ranges must be checked first, in case the allocated memory
falls inside another module's memory range.
dladdr itself must also check addresses against function descriptors
before doing a best-match search against the symbol table. even when
doing the latter (e.g. for code addresses obtained from mcontext_t),
also check whether the best-match was a function, and if so, replace
the result with a function descriptor address. which is the nominal
"base address" of the function and which the caller needs if it
intends to subsequently call the matching function.
since commits 2907afb8db and
6fc30c2493, __dls2 is no longer called
via symbol lookup, but instead uses relative addressing that needs to
be resolved at link time. on some linker versions, and/or if
-Bsymbolic-functions is not used, the linker may leave behind a
dynamic relocation, which is not suitable for bootstrapping the
dynamic linker, if the reference to __dls2 is marked hidden but the
definition is not actually hidden. correcting the definition to use
hidden visibility fixes the problem.
the static-PIE entry point rcrt1 was likewise affected and is also
fixed by this patch.
we need access to all instructions in order for runtime selection of
atomic model to work correctly. without this patch, some versions of
gcc instruct gas to reject instructions outside the target isa level.
other archs use asm for the thread pointer load, so making that asm
volatile is sufficient to inform the compiler that it has a "side
effect" (crashing or giving the wrong result if the thread pointer was
not yet initialized) that prevents reordering. however, powerpc and
or1k have dedicated general purpose registers for the thread pointer
and did not need to use any asm to access it; instead, "local register
variables with a specified register" were used. however, there is no
specification for ordering constraints on this type of usage, and
presumably use of the thread pointer could be reordered across its
initialization.
to impose an ordering, I have added empty volatile asm blocks that
produce the "local register variable with a specified register" as
an output constraint.
this builds on commits a603a75a72 and
0ba35d69c0 to ensure that a compiler
cannot conclude that it's valid to reorder the asm to a point before
the thread pointer is set up, or to treat the inline function as if it
were declared with attribute((const)).
other archs already use volatile asm for thread pointer loading.
strftime results are unspecified in this case, but should not invoke
undefined behaviour.
tm_wday, tm_yday, tm_mon and tm_year fields were used in signed int
arithmetic that could overflow.
based on patch by Szabolcs Nagy.
since commit c5e34dabbb, crt1.c has
provided a "mostly-C" implementation of the crt1 start file that
avoids the need for arch-specific symbol referencing, PIC/PIE-specific
code variants, etc. but for archs that had existing hand-written
versions, the new code was initially unused, and later only used as
the dynamic linker entry point. this commit switches all archs to
using the new code.
the code being removed was a recurring source of subtle errors, and
was still broken at least on arm, where it failed to properly align
the stack pointer before calling into C code.
as found and reported by Brian Mastenbrook, the expressions
400*qc_cycles and years+100 in __secs_to_tm were both subject to
integer overflow for extreme values of the input t.
this patch by Szabolcs Nagy fixes the code by switching to larger
types, and matches the original intent I had in mind when writing this
code.
the specification for these functions requires that the buffer/size
exposed to the caller be valid after any successful call to fflush or
fclose on the stream. the implementation's approach is to update them
only at flush time, but that misses the case where fflush or fclose is
called without any writes having taken place, in which case the write
flushing callback will not be called.
to fix both the observable bug and the desired invariant, setup empty
buffers at open time and fail the open operation if no memory is
available.
fdiv and fmul instructions were wrongly matched by the rules for
integer div and mul instructions, leading to incorrect conclusions
about register values being clobbered.
There is a lot which could be common between i386 and x86_64, but none
of it will be useful for any other arch. These should be useful for
all archs, however.
commit 844212d94f, which did not make it
into any releases, changed nl_langinfo(CODESET) to always return
"UTF-8", even in the byte-based C locale. this was problematic because
application software was found to use the string match for "UTF-8" to
activate its own UTF-8 processing. this both undermines the byte-based
functionality of the C locale, and if mixed with with calls to the
standard multibyte functions, which happened in practice, could result
in severe mis-handling of input.
the motive for the previous change was that, to avoid widespread
compatibility problems, the string returned by nl_langinfo(CODESET)
needs to be accepted by iconv and by third-party character conversion
code. thus, the only remaining choice is "ASCII". this choice
accurately represents the intent that high bytes do not have
individual meaning in the C locale, but it does mean that iconv, when
passed nl_langinfo(CODESET) in the C locale, will produce errors in
cases where mbrtowc would have succeeded. for reference, glibc behaves
similarly in this regard, so I don't think it will be a problem.
some newer binutils versions print scary warnings about protected data
because most gcc versions fail to produce the right address
references/relocations for such data that might be subject to copy
relocations. originally vis.h explicitly assigned default visibility
to all public data symbols to avoid this issue, but commit
b8dda24fe1 removed this treatment for
stdin/out/err to work around a gcc 3.x bug, and since they don't
actually need it (because taking their addresses is not valid C).
instead, a check for the gcc 3.x bug is added to the configure check
for vis.h preinclude support; this feature will simply be disabled
when using a buggy version of gcc.
previously, __lookup_ipliteral only checked its argument against the
requested address family, so IPv4 literals passed through to
__lookup_name if the caller asked for only IPv6 results, and likewise
for IPv6 literals when the caller asked for only IPv4. this resulted
in spurious DNS lookups that reportedly even succeeded with some
nameservers.
now, __lookup_ipliteral attempts to parse its argument as both IPv4
and IPv6, and returns an error (to stop further search) rather than 0
(no results yet) if the form of the argument mismatches the requested
address family.
based on patch by Julien Ramseier.
The value of *size is not relevant in case of failure, but it's
better not to copy garbage from the stack into it.
(The compiler cannot see through the syscall, so optimization
was not affected by the unspecified value).
the restorer function pointer provided in the kernel sigaction
structure is interpreted by the kernel as a raw code address, not a
function descriptor.
this commit moves the declarations of the __restore and __restore_rt
symbols to ksigaction.h so that arch versions of the file can override
them, and introduces a version for sh which declares them as objects
rather than functions.
an alternate solution would have been defining SA_RESTORER to 0 so
that the functions are not used, but this both requires executable
stack (since the sh kernel does not have a vdso page with permanent
restorer functions) and crashes on qemu user-level emulation.
lookup the dso an address falls in based on the loadmap and not just a
base/length. fix the main app's fake loadmap used when loaded by a
non-fdpic-aware loader so that it does not cover the whole memory
space.
function descriptor addresses are also matched for future use by
dladdr, but reverse lookups of function descriptors via dladdr have
not been implemented yet. some revisions may be needed in the future
once reclaim_gaps supports fdpic, so that function descriptors
allocated in reclaimed heap space do not get detected as belonging to
the module whose gaps they were allocated in.
previously these resolved to the code address rather than the address
of the function descriptor.
the conditions for accepting or rejecting symbols are quite
inconsistent between the different points in the dynamic linker code
where such decisions are made. this commit attempts to be at least as
correct as anything already there, but does not improve consistency.
it has been tested to correctly avoid symbols that are merely
references to functions defined in other modules, at least in simple
usage, but at some point all symbol lookup logic should be reviewed
and refactored/unified.
the entry point code supports being loaded by a loader which is not
fdpic-aware (in practice, either kernel with mmu or qemu without fdpic
support). this mostly just works, but signal handling will wrongly use
a function descriptor address as a code address if the personality is
not adjusted to fdpic.
ideally this code could be placed with sigaction so that it's not
needed except if/when a signal handler is installed. however,
personality is incorrectly maintained per-thread by the kernel, rather
than per-process, so it's necessary to correct the personality before
any threads are started. also, in order to skip the personality
syscall when an fdpic-aware loader is used, we need to be able to
detect how the program was loaded, and this information is only
readily available at the entry point.
this change is needed to be compatible with fdpic, where some of the
main application's relocations may be performed as part of the crt1
entry point. if we call init functions before passing control, these
relocations will not yet have been performed, and the init code will
potentially make use of invalid pointers.
conceptually, no code provided by the application or third-party
libraries should run before the application entry point. the
difference is not observable to programs using the crt1 we provide,
but it could come into play if custom entry point code is used, so
it's better to be doing this right anyway.
previously, the normal ELF library loading code was used even for
fdpic, so only the kernel-loaded dynamic linker and main app could
benefit from separate placement of segments and shared text.
this is always an error and usually results from failure to find/link
the compiler runtime library, but it could also result from
implementation errors in libc, using functions that don't (yet) exist.
either way the resulting libc.so will crash mysteriously at runtime.
the crash happens too early to produce a meaningful error, so these
crashes are very confusing to users and waste a lot of debugging time.
this commit should ensure that they do not happen.
the __fdpic_fixup code is not needed for ET_DYN executables, which
instead use reloctions, so we can omit it from the dynamic linker and
static-pie entry point and save some code size.
the C implementation of __unmapself used for potentially-nommu sh
assumed CRTJMP takes a function descriptor rather than a code address;
however, the actual dynamic linker needs a code address, and so commit
7a9669e977 changed the definition of the
macro in reloc.h. this commit puts the old macro back in a place where
it only affects __unmapself.
this is an ugly workaround and should be cleaned up at some point, but
at least it's well isolated.
at this point not all functionality is complete. the dynamic linker
itself, and main app if it is also loaded by the kernel, take
advantage of fdpic and do not need constant displacement between
segments, but additional libraries loaded by the dynamic linker follow
normal ELF semantics for mapping still. this fully works, but does not
admit shared text on nommu.
in terms of actual functional correctness, dlsym's results are
presently incorrect for function symbols, RTLD_NEXT fails to identify
the caller correctly, and dladdr fails almost entirely.
with the dynamic linker entry point working, support for static pie is
automatically included, but linking the main application as ET_DYN
(pie) probably does not make sense for fdpic anyway. ET_EXEC is
equally relocatable but more efficient at representing relocations.
the fdpic code will need to count symbols, and it may be useful
elsewhere in the future too. counting is trivial as long as sysv hash
is present, but for gnu-hash-only libraries it's complex.
the behavior of the count is changed slightly: we now include symbols
that are not accessible by the gnu hash table in the count. this may
make dladdr slightly slower. if this is a problem, dladdr can subtract
out the part that should not be accessible. unlike in the old code,
subtracting this out is easy even in the fast path where sysv hash is
available too.
these are in do_relocs. the first one was omitted in commit
301335a80b because it slightly changes
code (using dso->base rather than cached local var base) and would
have prevented easy verification. the other was an oversight.
for ordinary ELF with fixed segment displacements, load address
computation is simply adding the base load address. but for FDPIC,
each segment has its own load address, and virtual addresses need to
be adjusted according to the segment they fall in. abstracting this
computation is the first step to making the dynamic linker ready for
FDPIC.
for this first commit, a macro is used rather than a function in order
to facilitate correctness checking. I have verified that the generated
code does not change on my i386 build.
this new generic version of the stage-2 function lookup should work
for any arch where static data is accessible via got-relative or
pc-relative addressing, using approximately the technique described in
the log message for commit 2907afb8db.
since all the mips-like archs that need got slots fo access static
data have already transitioned to the new stage chaining scheme, the
old dynamic symbol lookup code is now removed.
aarch64, arm, and sh have not yet transitioned; with this commit, they
are now using the new generic code.
previously, the call into stage 2 was made by looking up the symbol
name "__dls2" (which was chosen short to be easy to look up) from the
dynamic symbol table. this was no problem for the dynamic linker,
since it always exports all its symbols. in the case of the static pie
entry point, however, the dynamic symbol table does not contain the
necessary symbol unless -rdynamic/-E was used when linking. this
linking requirement is a major obstacle both to practical use of
static-pie as a nommu binary format (since it greatly enlarges the
file) and to upstream toolchain support for static-pie (adding -E to
default linking specs is not reasonable).
this patch replaces the runtime symbolic lookup with a link-time
lookup via an inline asm fragment, which reloc.h is responsible for
providing. in this initial commit, the asm is provided only for i386,
and the old lookup code is left in place as a fallback for archs that
have not yet transitioned.
modifying crt_arch.h to pass the stage-2 function pointer as an
argument was considered as an alternative, but such an approach would
not be compatible with fdpic, where it's impossible to compute
function pointers without already having performed relocations. it was
also deemed desirable to keep crt_arch.h as simple/minimal as
possible.
in principle, archs with pc-relative or got-relative addressing of
static variables could instead load the stage-2 function pointer from
a static volatile object. that does not work for fdpic, and is not
safe against reordering on mips-like archs that use got slots even for
static functions, but it's a valid on i386 and many others, and could
provide a reasonable default implementation in the future.
this attribute was applied to pthread_self and the functions providing
the locations for errno and h_errno as an optimization; however, it is
subtly incorrect. as specified, it means the return value will always
be the same, which is not true; it varies per-thread.
this attribute also implies that the function does not depend on any
state, and that calls to it can safely be reordered across any other
code. however such reordering is unsafe for these functions: they
break when reordered before initialization of the thread pointer. such
breakage was actually observed when compiled by libfirm/cparser.
to some extent the reordering problem could be solved with strong
compiler barriers between the stages of early startup code, but the
specified meaning of of attribute((const)) is sufficiently strong that
a compiler would theoretically be justified inserting gratuitous calls
to attribute((const)) const functions at random locations (e.g. to
save the value in static storage for later use).
this reverts commit cbf35978a9.
their absence completely breaks format string warnings in programs
with gettext message translations: -Wformat gives no results, and
-Wformat-nonliteral produces spurious warnings.
with gcc, the problem manifests only in standards-conforming profiles;
otherwise gcc sets these attributes by default for the gettext family.
with clang, the problem always manifests; clang has no such defaults.
with this commit it should be possible to produce a working
static-linked fdpic libc and application binaries for sh.
the changes in reloc.h are largely unused at this point since dynamic
linking is not supported, but the CRTJMP macro is used one place
outside of dynamic linking, in __unmapself.
this version of the entry point is only suitable for static linking in
ET_EXEC form. neither dynamic linking nor pie is supported yet. at
some point in the future the fdpic and non-fdpic versions of this code
may be unified but for now it's easiest to work with them separately.
clone calls back to a function pointer provided by the caller, which
will actually be a pointer to a function descriptor on fdpic. the
obvious solution is to have a separate version of clone for fdpic, but
I have taken a simpler approach to go around the problem. instead of
calling the pointed-to function from asm, a direct call is made to an
internal C function which then calls the pointed-to function. this
lets the C compiler generate the appropriate calling convention for an
indirect call with no need for ABI-specific assembly.
for fdpic support is is essential that the got pointer be saved at a
known, ABI-dictated offset from the frame pointer, since there is no
way to recover it once it's lost.
this error was only found by reading the code, but it seems to have
been causing gcc to produce wrong code in malloc: the same register
was used for the output and the high word of the input. in principle
this could have caused an infinite loop searching for an available
bin, but in practice most x86 models seem to implement the "undefined"
result of the bsf instruction as "unchanged".
originally, the comment in this code was correct and it would likely
work if the compiler generated a tail call to setjmp. however, commit
583e55122e redesigned sigsetjmp and
siglongjmp such that the old C implementation (which was not intended
to be used) is not even conceptually correct. remove it in the
interest of avoiding confusion when porting to new archs.
this restores the original behavior prior to the addition of the
byte-based C locale and fixes what is effectively a regression in
musl's property of always providing working UTF-8 support.
commit 1507ebf837 introduced the codeset
name "UTF-8-CODE-UNITS" for the byte-based C locale to represent that
the semantic content is UTF-8 but that it is being processed as code
units (bytes) rather than whole multibyte characters. however, many
programs assume that the codeset name is usable with iconv and/or
comes from a set of standard/widely-used names known to the
application. such programs are likely to produce warnings or errors,
run with reduced functionality, or mangle character data when run
explicitly in the C locale.
the standard places basically no requirements for the string returned
by nl_langinfo(CODESET) and how it interacts with other interfaces, so
returning "UTF-8" is permissible. moreover, it seems like the right
thing to do, since the identity of the character encoding as "UTF-8"
is independent of whether it is being processed as bytes of characters
by the standard library functions.
this fixes a bug reported by Nuno Gonçalves. previously, calling
fclose on stdin or stdout resulted in deadlock at exit time, since
__stdio_exit attempts to lock these streams to flush/seek them, and
has no easy way of knowing that they were closed.
conceptually, leaving a FILE stream locked on fclose is valid since,
in the abstract machine, it ceases to exist. but to satisfy the
implementation-internal assumption in __stdio_exit that it can access
these streams unconditionally, we need to unlock them.
it's also necessary that fclose leaves permanent streams in a state
where __stdio_exit will not attempt any further operations on them.
fortunately, the call to fflush already yields this property.
these functions are part of the ARM EABI, meaning compilers may
generate references to them. known versions of gcc do not use them,
but llvm does. they are not provided by libgcc, and the de facto
standard seems to be that libc provides them.
this functionality is affected by GNU make bug #30653, "intermediate
files incorrectly pruned in parallel builds". on affected versions of
make, parallel builds attempt to compile source files before
alltypes.h is generated.
as noted with commit a91ebdcfac, which
added the use of .SECONDARY, suppression of removal of "intermediate"
files does not seem to be needed at present. if it is needed in the
future, it should be achievable by explicitly mentioning their names
as targets or prerequisites.
at one point, GNU make was removing crt/*.o after producing the copies
in lib/ due to an arcane misfeature for handling "intermediate" files.
the circumstances that caused this are no longer present in our
makefile, but the previous workaround using .PRECIOUS was wrong and
could result in corrupt/partial files being left behind during an
interrupted build. using .SECONDARY is the correct, documented fix
that will prevent deletion of "intermediate" files from ever
resurfacing.
Some functions implemented in asm need to use EBP for purposes other
than acting as a frame pointer. (Notably, it is used for the 6th
argument to syscalls with 6 arguments.) Without frame pointers, GDB
can only show backtraces if it gets CFI information from a
.debug_frame or .eh_frame ELF section.
Rather than littering our asm with ugly .cfi directives, use an awk
script to insert them in the right places during the build process, so
GDB can keep track of where the current stack frame is relative to the
stack pointer. This means GDB can produce beautiful stack traces at
any given point when single-stepping through asm functions.
Additionally, when registers are saved on the stack and later
overwritten, emit ..cfi directives so GDB will know where they were
saved relative to the stack pointer. This way, when you look back up
the stack from within an asm function, you can still reliably print
the values of local variables in the caller.
If this awk script were to understand every possible wild and crazy
contortion that an asm programmer can do with the stack and registers,
and always emit the exact ..cfi directives needed for GDB to know what
the register values were in the preceding stack frame, it would
necessarily be as complex as a full x86 emulator. That way lies
madness.
Hence, we assume that the stack pointer will _only_ ever be adjusted
using push/pop or else add/sub with a constant. We do not attempt to
detect every possible way that a register value could be saved for
later use, just the simple and common ways.
Thanks to Szabolcs Nagy for suggesting numerous improvements to this
code.
getsubopt incorrectly returns the delimiting = in the value string,
this patch fixes it by increasing the pointer position by one.
Signed-off-by: Steven Barth <cyrus@openwrt.org>
commit 3c43c0761e fixed missing
synchronization in the atomic store operation for i386 and x86_64, but
opted to use mfence for the barrier on x86_64 where it's always
available. however, in practice mfence is significantly slower than
the barrier approach used on i386 (a nop-like lock orl operation).
this commit changes x86_64 (and x32) to use the faster barrier.
tm_gmtoff is a nonstandard field, but on historical systems which have
this field, it stores the offset of the local time zone from GMT or
UTC. this is the opposite of the POSIX extern long timezone object and
the offsets used in POSIX-form TZ strings, which represent the offset
from local time to UTC. previously we were storing these negated
offsets in tm_gmtoff too.
programs which only used this field indirectly via strftime were not
affected since strftime performed the negation for presentation.
however, some programs and libraries accesse tm_gmtoff directly and
were obtaining negated time zone offsets.
On 32bit systems long long arguments are passed in a special way
to some syscalls; this accidentally got copied to the AArch64 port.
The following interfaces were broken: fallocate, fanotify, ftruncate,
posix_fadvise, posix_fallocate, pread, pwrite, readahead,
sync_file_range, truncate.
tempnam uses an uninitialized buffer which is filled using memcpy and
__randname. It is therefore necessary to explicitly null-terminate it.
based on patch by Felix Janda.
during calls to free, any free chunks adjacent to the chunk being
freed are momentarily held in allocated state for the purpose of
merging, possibly leaving little or no available free memory for other
threads to allocate. under this condition, other threads will attempt
to expand the heap rather than waiting to use memory that will soon be
available. the race window where this happens is normally very small,
but became huge when free chooses to use madvise to release unused
physical memory, causing unbounded heap size growth.
this patch drastically shrinks the race window for unwanted heap
expansion by performing madvise with the bin lock held and marking the
bin non-empty in the binmask before making the expensive madvise
syscall. testing by Timo Teräs has shown this approach to be a
suitable mitigation.
more invasive changes to the synchronization between malloc and free
would be needed to completely eliminate the problem. it's not clear
whether such changes would improve or worsen typical-case performance,
or whether this would be a worthwhile direction to take malloc
development.
despite being strongly ordered, the x86 memory model does not preclude
reordering of loads across earlier stores. while a plain store
suffices as a release barrier, we actually need a full barrier, since
users of a_store subsequently load a waiter count to determine whether
to issue a futex wait, and using a stale count will result in soft
(fail-to-wake) deadlocks. these deadlocks were observed in malloc and
possible with stdio locks and other libc-internal locking.
on i386, an atomic operation on the caller's stack is used as the
barrier rather than performing the store itself using xchg; this
avoids the need to read the cache line on which the store is being
performed. mfence is used on x86_64 where it's always available, and
could be used on i386 with the appropriate cpu model checks if it's
shown to perform better.
The old code accepted atexit handlers after exit, but did not run them
reliably. C11 seems to explicitly allow atexit to fail (and report
such failure) in this case, but this situation can easily come up in
C++ if a destructor has a local static object with a destructor so it
should be handled.
Note that the memory usage can grow linearly with the overall number
of registered atexit handlers instead of with the worst case list
length. (This only matters if atexit handlers keep registering atexit
handlers which should not happen in practice).
Commit message/rationale based on text by Szabolcs Nagy.
when traditional syslogd implementations are restarted, the old server
socket ceases to exist and a new unix socket with the same pathname is
created. when this happens, the default destination address associated
with the client socket via connect is no longer valid, and attempts to
send produce errors. this happens despite the socket being datagram
type, and is in contrast to the behavior that would be seen with an IP
datagram (UDP) socket.
in order to avoid a situation where the application is unable to send
further syslog messages without calling closelog, this patch makes
syslog attempt to reconnect the socket when send returns an error
indicating a lost connection.
additionally, initial failure to connect the socket no longer results
in the socket being closed. this ensures that an application which
calls openlog to reserve the socket file descriptor will not run into
a situation where transient connection failure (e.g. due to syslogd
restart) prevents fd reservation. however, applications which may be
unable to connect the socket later (e.g. due to chroot, restricted
permissions, seccomp, etc.) will still fail to log if the syslog
socket cannot be connected at openlog time or if it has to be
reconnected later.
being nonstandard, the closest thing to a specification for this
function is its man page, which documents it as returning int. it can
fail with EBADF if the file descriptor passed is invalid.
due to a reversed pointer difference computation, ns_skiprr always
returned a negative value, which functions using it would interpret as
an error.
patch by Yu Lu.
musl-clang allows the user to compile musl-powered programs using their
already existent clang install, without the need of a special cross compiler.
it achieves this by wrapping around both the system clang install and the
linker and passing them special flags to re-target musl at runtime.
it does only affect invocations done through the special musl-clang wrapper
script, so that the user setup remains fully intact otherwise.
the clang wrapper consists of the compiler frontend wrapper script,
musl-clang, and the linker wrapper script, ld.musl-clang.
musl-clang makes sure clang invokes ld.musl-clang to link objects; neither
script needs to be in PATH for the wrapper to work.
the old test was broken in that it would never fail on a toolchains built
without dynamic linking support, leading to the wrapper script possibly being
installed on compilers that do not support it. in addition, the new test is
portable across compilers: the old test only worked on GCC.
the new test works by testing whether the toolchain libc defines __GLIBC__:
most non-musl Linux libc's do define this for compatibility even when they
are not glibc, so this is a safe bet to check for musl. in addition, the
compiler runtime would need to have a somewhat glibc-compatible ABI in the
first place, so any non-glibc compatible libc's compiler runtime might not
work. it is safer to disable these cases by default and have the user enable
the wrappers manually there using --enable-wrapper if they certain it works.
this overhauls part of the build system in order to support multiple
toolchain wrapper scripts, as opposed to solely the musl-gcc wrapper as
before. it thereby replaces --enable-gcc-wrapper with --enable-wrapper=...,
which has the options 'auto' (the default, detect whether to use wrappers),
'all' (build and install all wrappers), 'no' (don't build any) and finally
the options named after the individual compiler scripts (currently only
'gcc' is available) to build and install only that wrapper.
the old --enable-gcc-wrapper is removed from --help, but still available.
it also modifies the wrappers to use the C compiler specified to the build
system as 'inner' compiler, when applicable. as wrapper detection works by
probing this compiler, it may not work with any other.
this improves compatibility with the behavior of other systems and
with some applications which set an empty TZ var to disable use of
local time by mktime, etc.
The callers need to check the value of the pointer anyway, so make
them pass the pointer to gnu_lookup instead of reloading it there.
Reorder gnu_lookup arguments so that always-used ones are listed
first. GCC can choose a calling convention with arguments in registers
(e.g. up to 3 arguments in eax, ecx, edx on x86), but cannot reorder
the arguments for static functions.
Introduce gnu_lookup_filtered and use it to speed up symbol lookups in
find_sym (do_dlsym is left as is, based on an expectation that
frequently dlsym queries will use a dlopen handle rather than
RTLD_NEXT or RTLD_DEFAULT, and will not need to look at more than one
DSO).
the TLS ABI spec for mips, powerpc, and some other (presently
unsupported) RISC archs has the return value of __tls_get_addr offset
by +0x8000 and the result of DTPOFF relocations offset by -0x8000. I
had previously assumed this part of the ABI was actually just an
implementation detail, since the adjustments cancel out. however, when
the local dynamic model is used for accessing TLS that's known to be
in the same DSO, either of the following may happen:
1. the -0x8000 offset may already be applied to the argument structure
passed to __tls_get_addr at ld time, without any opportunity for
runtime relocations.
2. __tls_get_addr may be used with a zero offset argument to obtain a
base address for the module's TLS, to which the caller then applies
immediate offsets for individual objects accessed using the local
dynamic model. since the immediate offsets have the -0x8000 adjustment
applied to them, the base address they use needs to include the
+0x8000 offset.
it would be possible, but more complex, to store the pointers in the
dtv[] array with the +0x8000 offset pre-applied, to avoid the runtime
cost of adding 0x8000 on each call to __tls_get_addr. this change
could be made later if measurements show that it would help.
previously, loading of additional libraries beyond libc/ldso did not
work on nommu kernels, nor did loading programs via invocation of the
dynamic linker as a command.
this interface is non-standardized and is a GNU invention, and as
such, our implementation should match the behavior of the GNU
function. one peculiarity the old implementation got wrong was the
handling of all-zero digit sequences: they are supposed to compare
greater than digit sequences of which they are a proper prefix, as in
009 < 00.
in addition, high bytes were treated with char signedness rather than
as unsigned. this was wrong regardless of what the GNU function does
since the resulting order relation varied by arch.
the new strverscmp implementation makes explicit the cases where the
order differs from what strcmp would produce, of which there are only
two.
commit ba819787ee introduced this
regression. since the __malloc0 weak alias was not properly provided
by __simple_malloc, use of calloc forced the full malloc to be linked.
previously, calloc's implementation encoded assumptions about the
implementation of malloc, accessing a size_t word just prior to the
allocated memory to determine if it was obtained by mmap to optimize
out the zero-filling. when __simple_malloc is used (static linking a
program with no realloc/free), it doesn't matter if the result of this
check is wrong, since all allocations are zero-initialized anyway. but
the access could be invalid if it crosses a page boundary or if the
pointer is not sufficiently aligned, which can happen for very small
allocations.
this patch fixes the issue by moving the zero-fill logic into malloc.c
with the full malloc, as a new function named __malloc0, which is
provided by a weak alias to __simple_malloc (which always gives
zero-filled memory) when the full malloc is not in use.
this symbol is needed only on archs where the PLT call ABI is klunky,
and only for position-independent code compiled with stack protector.
thus references usually only appear in shared libraries or PIE
executables, but they can also appear when linking statically if some
of the object files being linked were built as PIC/PIE.
normally libssp_nonshared.a from the compiler toolchain should provide
__stack_chk_fail_local, but reportedly it appears prior to -lc in the
link order, thus failing to satisfy references from libc itself (which
arise only if libc.a was built as PIC/PIE with stack protector
enabled).
linux kernel commit 46e12c07b3b9603c60fc1d421ff18618241cb081 caused
the mips syscall mechanism to fail with EFAULT when the userspace
stack pointer is invalid, breaking __unmapself used for detached
thread exit. the workaround is to set $sp to a known-valid, readable
address, and the simplest one to obtain is the address of the current
function, which is available (per o32 calling convention) in $25.
nominally the low bits of the trap number on sh are the number of
syscall arguments, but they have never been used by the kernel, and
some code making syscalls does not even know the number of arguments
and needs to pass an arbitrary high number anyway.
sh3/sh4 traditionally used the trap range 16-31 for syscalls, but part
of this range overlapped with hardware exceptions/interrupts on sh2
hardware, so an incompatible range 32-47 was chosen for sh2.
using trap number 31 everywhere, since it's in the existing sh3/sh4
range and does not conflict with sh2 hardware, is a proposed
unification of the kernel syscall convention that will allow binaries
to be shared between sh2 and sh3/sh4. if this is not accepted into the
kernel, we can refit the sh2 target with runtime selection mechanisms
for the trap number, but doing so would be invasive and would entail
non-trivial overhead.
due to the way the interrupt and syscall trap mechanism works,
userspace on sh2 must never set the stack pointer to an invalid value.
thus, the approach used on most archs, where __unmapself executes with
no stack for the interval between SYS_munmap and SYS_exit, is not
viable on sh2.
in order not to pessimize sh3/sh4, the sh asm version of __unmapself
is not removed. instead it's renamed and redirected through code that
calls either the generic (safe) __unmapself or the sh3/sh4 asm,
depending on compile-time and run-time conditions.
the sh2 target is being considered an ISA subset of sh3/sh4, in the
sense that binaries built for sh2 are intended to be usable on later
cpu models/kernels with mmu support. so rather than hard-coding
sh2-specific atomics, the runtime atomic selection mechanisms that was
already in place has been extended to add sh2 atomics.
at this time, the sh2 atomics are not SMP-compatible; since the ISA
lacks actual atomic operations, the new code instead masks interrupts
for the duration of the atomic operation, producing an atomic result
on single-core. this is only possible because the kernel/hardware does
not impose protections against userspace doing so. additional changes
will be needed to support future SMP systems.
care has been taken to avoid producing significant additional code
size in the case where it's known at compile-time that the target is
not sh2 and does not need sh2-specific code.
functions which open in-memory FILE stream variants all shared a tail
with __fdopen, adding the FILE structure to stdio's open file list.
replacing this common tail with a function call reduces code size and
duplication of logic. the list is also partially encapsulated now.
function signatures were chosen to facilitate tail call optimization
and reduce the need for additional accessor functions.
with these changes, static linked programs that do not use stdio no
longer have an open file list at all.
this patch activates the new byte-based C locale (high bytes treated
as abstract code unit "characters" rather than decoded as multibyte
characters) by making the value of MB_CUR_MAX depend on the active
locale. for the C locale, the LC_CTYPE category pointer is null,
yielding a value of 1. all other locales yield a value of 4.
this patch adjusts libc components which use the multibyte functions
internally, and which depend on them operating in a particular
encoding, to make the appropriate locale changes before calling them
and restore the calling thread's locale afterwards. activating the
byte-based C locale without these changes would cause regressions in
stdio and iconv.
in the case of iconv, the current implementation was simply using the
multibyte functions as UTF-8 conversions. setting a multibyte UTF-8
locale for the duration of the iconv operation allows the code to
continue working.
in the case of stdio, POSIX requires that FILE streams have an
encoding rule bound at the time of setting wide orientation. as long
as all locales, including the C locale, used the same encoding,
treating high bytes as UTF-8, there was no need to store an encoding
rule as part of the stream's state.
a new locale field in the FILE structure points to the locale that
should be made active during fgetwc/fputwc/ungetwc on the stream. it
cannot point to the locale active at the time the stream becomes
oriented, because this locale could be mutable (the global locale) or
could be destroyed (locale_t objects produced by newlocale) before the
stream is closed. instead, a pointer to the static C or C.UTF-8 locale
object added in commit commit aeeac9ca54
is used. this is valid since categories other than LC_CTYPE will not
affect these functions.
this patch makes the functions which work directly on multibyte
characters treat the high bytes as individual abstract code units
rather than as multibyte sequences when MB_CUR_MAX is 1. since
MB_CUR_MAX is presently defined as a constant 4, all of the new code
added is dead code, and optimizing compilers' code generation should
not be affected at all. a future commit will activate the new code.
as abstract code units, bytes 0x80 to 0xff are represented by wchar_t
values 0xdf80 to 0xdfff, at the end of the surrogates range. this
ensures that they will never be misinterpreted as Unicode characters,
and that all wctype functions return false for these "characters"
without needing locale-specific logic. a high range outside of Unicode
such as 0x7fffff80 to 0x7fffffff was also considered, but since C11's
char16_t also needs to be able to represent conversions of these
bytes, the surrogate range was the natural choice.
btowc is required to interpret its argument by conversion to unsigned
char, unless the argument is equal to EOF. since the conversion to
produces a non-character value anyway, we can just unconditionally
convert, for now.
this extends the brk/stack collision protection added to full malloc
in commit 276904c2f6 to also protect the
__simple_malloc function used in static-linked programs that don't
reference the free function.
it also extends support for using mmap when brk fails, which full
malloc got in commit 5446303328, to
__simple_malloc.
since __simple_malloc may expand the heap by arbitrarily large
increments, the stack collision detection is enhanced to detect
interval overlap rather than just proximity of a single address to the
stack. code size is increased a bit, but this is partly offset by the
sharing of code between the two malloc implementations, which due to
linking semantics, both get linked in a program that needs the full
malloc with realloc/free support.
commit 5816592389 added these optional
cancellation points on the basis that cancellable stdio could be
useful, to unblock threads stuck on stdio operations that will never
complete. however, the only way to ensure that cancellation can
achieve this is to violate the rules for side effects when
cancellation is acted upon, discarding knowledge of any partial data
transfer already completed. our implementation exhibited this behavior
and was thus non-conforming.
in addition to improving correctness, removing these cancellation
points moderately reduces code size, and should significantly improve
performance on i386, where sysenter/syscall instructions can be used
instead of "int $128" for non-cancellable syscalls.
the old idiom, f->mode |= f->mode+1, was adapted from the idiom for
setting byte orientation, f->mode |= f->mode-1, but the adaptation was
incorrect. unless the stream was alreasdy set byte-oriented, this code
incremented f->mode each time it was executed, which would eventually
lead to overflow. it could be fixed by changing it to f->mode |= 1,
but upcoming changes will require slightly more work at the time of
wide orientation, so it makes sense to just call fwide. as an
optimization in the single-character functions, fwide is only called
if the stream is not already wide-oriented.
this can be used to put off writing an asm version of __unmapself for
new archs, or as a permanent solution on archs where it's not
practical or even possible to run momentarily with no stack.
the concept here is simple: the caller takes a lock on a global shared
stack and uses it to make the munmap and exit syscalls. the only trick
is unlocking, which must be done after the thread exits, and this is
achieved by using the set_tid_address syscall to have the kernel zero
and futex-wake the lock word as part of the exit syscall.
the linux/nommu fdpic ELF loader sets up the brk range to overlap
entirely with the main thread's stack (but growing from opposite
ends), so that the resulting failure mode for malloc is not to return
a null pointer but to start returning pointers to memory that overlaps
with the caller's stack. needless to say this extremely dangerous and
makes brk unusable.
since it's non-trivial to detect execution environments that might be
affected by this kernel bug, and since the severity of the bug makes
any sort of detection that might yield false-negatives unsafe, we
instead check the proximity of the brk to the stack pointer each time
the brk is to be expanded. both the main thread's stack (where the
real known risk lies) and the calling thread's stack are checked. an
arbitrary gap distance of 8 MB is imposed, chosen to be larger than
linux default main-thread stack reservation sizes and larger than any
reasonable stack configuration on nommu.
the effeciveness of this patch relies on an assumption that the amount
by which the brk is being grown is smaller than the gap limit, which
is always true for malloc's use of brk. reliance on this assumption is
why the check is being done in malloc-specific code and not in __brk.
for several pwd/grp functions, the only way the caller can distinguish
between a successful negative result ("no such user/group") and an
internal error is by clearing errno before the call and checking errno
afterwards. the nscd backend support code correctly simulated a
not-found response on systems where such a backend is not running, but
failed to restore errno.
this commit also fixed an outdated/incorrect comment.
the arm atomics/TLS runtime selection code is called from
__set_thread_area and depends on having libc.auxv and __hwcap
available. commit 71f099cb7d moved the
first call to __set_thread_area to the top of dynamic linking stage 3,
before this data is made available, causing the runtime detection code
to always see __hwcap as zero and thereby select the atomics/TLS
implementations based on kuser helper.
upcoming work on superh will use similar runtime detection.
ideally this early-init code should be cleanly refactored and shared
between the dynamic linker and static-linked startup.
unless/until the byte-based C locale is implemented, defining
MB_CUR_MAX to 1 in the C locale is wrong. no internal code currently
uses the MB_CUR_MAX macro, but having it defined inconsistently is
error-prone. applications get the value from stdlib.h and were
unaffected.
aside from being invalid, the early check only optimized the error
case, and likely pessimized the common case by separating the
two branches on isascii(c) at opposite ends of the function.
commit f3ddd17380 inadvertently removed
the early check for "none" type relocations, causing the address
dso->base+0 to be dereferenced to obtain an addend. shared libraries,
(including libc.so) and PIE executables were unaffected, since their
base addresses are the actual address of their mappings and are
readable. non-PIE main executables, however, have a base address of 0
because their load addresses are absolute and not offset at load time.
in practice none-type relocations do not arise with toolchains that
are in use except on mips, and on mips it's moderately rare for a
non-PIE executable to have a relocation table, since the mips-specific
got processing serves in its place for most purposes.
these functions were written to handle clearing eof status, but failed
to account for the __toread function's handling of eof. with this
patch applied, __toread still returns EOF when the file is in eof
status, so that read operations will fail, but it also sets up valid
buffer pointers for read mode, which are set to the end of the buffer
rather than the beginning in order to make the whole buffer available
to ungetc/ungetwc.
minor changes to __uflow were needed since it's now possible to have
non-zero buffer pointers while in eof status. as made, these changes
remove a 'fast path' bypassing the function call to __toread, which
could be reintroduced with slightly different logic, but since
ordinary files have a syscall in f->read, optimizing the code path
does not seem worthwhile.
the __stdio_read function is also updated not to zero the read buffer
pointers on eof/error. while not necessary for correctness, this
change avoids the overhead of calling __toread in ungetc after
reaching eof, and it also reduces code size and increases consistency
with the fmemopen read operation which does not zero the pointers.
some compilers (such as clang) accept unknown options without error,
but then print warnings on each invocation, cluttering the build
output and burying meaningful warnings. this patch makes configure's
tryflag and tryldflag functions use additional options to turn the
unknown-option warnings into errors, if available, but only at check
time. these options are not output in config.mak to avoid the risk of
spurious build breakage; if they work, they will have already done
their job at configure time.
this frees applications which need to make temporary use of the C
locale (via uselocale) from the possibility that newlocale might fail.
the C.UTF-8 locale is also provided as a static locale. presently they
behave the same, but this may change in the future.
previously, LC_MESSAGES was treated specially as the only category
which could be set to a locale name without a definition file, in
order to facilitate gettext message translations when no libc locale
was available. LC_NUMERIC was completely un-settable, and LC_CTYPE
stored a flag intended to be used for a possible future byte-based C
locale, instead of storing a __locale_map pointer like the other
categories use.
this patch changes all categories to be represented by pointers to
__locale_map structures, and allows locale names without definition
files to be treated as valid locales with trivial definition when used
in any category. outwardly visible functional changes should be minor,
limited mainly to the strings read back from setlocale and the way
gettext handles translations in categories other than LC_MESSAGES.
various internal refactoring has also been performed, and improvements
in const correctness have been made.
this is part of a general program of removing direct use of atomics
where they are not necessary to meet correctness or performance needs,
but in this case it's also an optimization. only the global locale
needs synchronization; allocated locales referenced with locale_t
handles are immutable during their lifetimes, and using atomics to
initialize them increases their cost of setup.
static-linked PIE files need startup code to relocate themselves, much
like the dynamic linker does. rcrt1.c reuses the code in dlstart.c,
stage 1 of the dynamic linker, which in turn reuses crt_arch.h, to
achieve static PIE with no new code. only relative relocations are
supported.
existing toolchains that don't yet support static PIE directly can be
repurposed by passing "-shared -Wl,-Bstatic -Wl,-Bsymbolic" instead of
"-static -pie" and substituting rcrt1.o in place of crt1.o.
all libraries being linked must be built as PIC/PIE; TEXTRELs are not
supported at this time.
commit de2b67f8d4 attempted to avoid
having vis.h affect crt files, but the Makefile variable used,
CRT_LIBS, refers to the final output copies in the lib directory, not
the copies in the crt build directory, and thus the -DCRT was not
applied.
while unlikely to be noticed, this regression probably broke
production of PIE executables whose main functions are not in the
executable but rather a shared library.
commit f3ddd17380 introduced early
relocations and subsequent reprocessing as part of the dynamic linker
bootstrap overhaul, to allow use of arbitrary libc functions before
the main application and libraries are loaded, but only reprocessed
GOT/PLT relocation types.
commit c093e2e820 added reprocessing of
non-GOT/PLT relocations to fix an actual regression that was observed
on powerpc, but only for RELA format tables with out-of-line addends.
REL table (inline addends at the relocation address) reprocessing is
trickier because the first relocation pass clobbers the addends.
this patch extends symbolic relocation reprocessing for libc/ldso to
support all relocation types, whether REL or RELA format tables are
used. it is believed not to alter behavior on any existing archs for
the current dynamic linker and libc code. the motivations for this
change are consistency and future-proofing. it ensures that behavior
does not differ depending on whether REL or RELA tables are used,
which could lead to undetected arch-specific bugs. it also ensures
that, if in the future code depending on additional relocation types
is added to libc.so, either at the source level or as part of the
compiler runtime that gets pulled in (for example, soft-float with TLS
for fenv), the new code will work properly.
the implementation concept is simple: stage 2 of the dynamic linker
counts the number of symbolic relocations in the libc/ldso REL table
and allocates a VLA to save their addends into; stage 3 then uses the
saved addends in place of the inline ones which were clobbered. for
stack safety, a hard limit (currently 4k) is imposed on the number of
such addends; this should be a couple orders of magnitude larger than
the actual need. this number is not a runtime variable that could
break fail-safety; it is constant for a given libc.so build.
this move eliminates a duplicate "by-hand" symbol lookup loop from the
stage-1 code and replaces it with a call to find_sym, which can be
used once we're in stage 2. it reduces the size of the stage 1 code,
which is helpful because stage 1 will become the crt start file for
static-PIE executables, and it will allow stage 3 to access stage 2's
automatic storage, which will be important in an upcoming commit.
the outer-loop approach made sense when we were also processing
DT_JMPREL, which might be in REL or RELA form, to avoid major code
duplication. commit 09db855b35 removed
processing of DT_JMPREL, and in the remaining two tables, the format
(REL or RELA) is known by the name of the table. simply writing two
versions of the loop results in smaller and simpler code.
the DT_JMPREL relocation table necessarily consists entirely of
JMP_SLOT (REL_PLT in internal nomenclature) relocations, which are
symbolic; they cannot be resolved in stage 1, so there is no point in
processing them.
the instruction used to align the stack, "and $sp, $sp, -8", does not
actually exist; it's expanded to 2 instructions using the 'at'
(assembler temporary) register, and thus cannot be used in a branch
delay slot. since alignment mod 16 commutes with subtracting 8, simply
swapping these two operations fixes the problem.
crt1.o was not affected because it's still being generated from a
dedicated asm source file. dlstart.lo was not affected because the
stack pointer it receives is already aligned by the kernel. but
Scrt1.o was affected in cases where the dynamic linker gave it a
misaligned stack pointer.
i386 and x86_64 versions already had the .text directive; other archs
did not. normally, top-level (file scope) __asm__ starts in the .text
section anyway, but problems were reported with some versions of
clang, and it seems preferable to set it explicitly anyway, at least
for the sake of consistency between archs.
while not a requirement, it's common convention in other iconv
implementations to accept "CHAR" as an alias for nl_langinfo(CODESET),
meaning the encoding used for char[] strings in the current locale,
and also "" as an alternate form. supporting this is not costly and
improves compatibility.
conceptually, and on other archs, these functions take a pointer to
int, but in the i386, x86_64, and x32 versions of atomic.h, they took
a pointer to void instead.
If we're building for sh4a, the compiler is already free to use
instructions only available on sh4a, so we can do the same and inline the
llsc atomics. If we're building for an older processor, we still do the
same runtime atomics selection as before.
this fixes a regression on powerpc that was introduced in commit
f3ddd17380. global data accesses on
powerpc seem to be using a translation-unit-local GOT filled via
R_PPC_ADDR32 relocations rather than R_PPC_GLOB_DAT. being a non-GOT
relocation type, these were not reprocessed after adding the main
application and its libraries to the chain, causing libc code not to
see copy relocations in the main program, and therefore to use the
pre-copy-relocation addresses for global data objects (like environ).
the motivation for the dynamic linker only reprocessing GOT/PLT
relocation types in stage 3 is that these types always have a zero
addend, making them safe to process again even if the storage for the
addend has been clobbered. other relocation types which can be used
for address constants in initialized data objects may have non-zero
addends which will be clobbered during the first pass of relocation
processing if they're stored inline (REL form) rather than out-of-line
(RELA form).
powerpc generally uses only RELA, so this patch is sufficient to fix
the regression in practice, but is not fully general, and would not
suffice if an alternate toolchain generated REL for powerpc.
if setlocale has not been called, the current locale's messages_name
may be a null pointer. the code path where it's assumed to be non-null
was only reachable if bindtextdomain had already been called, which is
normally not done in programs which do not call setlocale, so the
omitted check went unnoticed.
patch from Void Linux, with description rewritten.
the code being removed used atomics to track whether any threads might
be using a locale other than the current global locale, and whether
any threads might have abstract 8-bit (non-UTF-8) LC_CTYPE active, a
feature which was never committed (still pending). the motivations
were to support early execution prior to setup of the thread pointer,
to partially support systems (ancient kernels) where thread pointer
setup is not possible, and to avoid high performance cost on archs
where accessing the thread pointer may be very slow.
since commit 19a1fe670a, the thread
pointer is always available, so these hacks are no longer needed.
removing them greatly simplifies the affected code.
commit f630df09b1 added logic to handle
the case where __set_thread_area is called more than once by reusing
the GDT slot already in the %gs register, and only setting up a new
GDT slot when %gs is zero. this created a hidden assumption that %gs
is zero when a new process image starts, which is true in practice on
Linux, but does not seem to be documented ABI, and fails to hold under
qemu app-level emulation.
while it would in theory be possible to zero %gs in the entry point
code, this code is shared between static and dynamic binaries, and
dynamic binaries must not clobber the value of %gs already setup by
the dynamic linker.
the alternative solution implemented in this commit simply uses global
data to store the GDT index that's selected. __set_thread_area should
only be called in the initial thread anyway (subsequent threads get
their thread pointer setup by __clone), but even if it were called by
another thread, it would simply read and write back the same GDT index
that was already assigned to the initial thread, and thus (in the x86
memory model) there is no data race.
compilers targeting armv7 may be configured to produce thumb2 code
instead of arm code by default, and in the future we may wish to
support targets where only the thumb instruction set is available.
the instructions this patch omits in thumb mode are needed only for
non-thumb versions of armv4 or earlier, which are not supported by any
current compilers/toolchains and thus rather pointless to have. at
some point these compatibility return sequences may be removed from
all asm source files, and in that case it would make sense to remove
them here too and remove the ifdef.
compilers targeting armv7 may be configured to produce thumb2 code
instead of arm code by default, and in the future we may wish to
support targets where only the thumb instruction set is available.
the changes made here avoid operating directly on the sp register,
which is not possible in thumb code, and address an issue with the way
the address of _DYNAMIC is computed.
previously, the relative address of _DYNAMIC was stored with an
additional offset of -8 versus the pc-relative add instruction, since
on arm the pc register evaluates to ".+8". in thumb code, it instead
evaluates to ".+4". both are two (normal-size) instructions beyond "."
in the current execution mode, so the numbered label 2 used in the
relative address expression is simply moved two instructions ahead to
be compatible with both instruction sets.
i386, x86_64, x32, and powerpc all use TLS for stack protector canary
values in the default stack protector ABI, but the location only
matched the ABI on i386 and x86_64. on x32, the expected location for
the canary contained the tid, thus producing spurious mismatches
(resulting in process termination) upon fork. on powerpc, the expected
location contained the stdio_locks list head, so returning from a
function after calling flockfile produced spurious mismatches. in both
cases, the random canary was not present, and a predictable value was
used instead, making the stack protector hardening much less effective
than it should be.
in the current fix, the thread structure has been expanded to have
canary fields at all three possible locations, and archs that use a
non-default location must define a macro in pthread_arch.h to choose
which location is used. for most archs (which lack TLS canary ABI) the
choice does not matter.
the 64-bit push reads not only the 32-bit return address but also the
first 32 signal mask bits. if any were nonzero, the return address
obtained will be invalid.
at some point storage of the return address should probably be moved
to follow the saved mask so that there's plenty room and the same code
can be used on x32 and regular x86_64, but for now I want a fix that
does not risk breaking x86_64, and this simple re-zeroing works.
the lifetime of compound literals is the block in which they appear.
the temporary struct __timespec_kernel objects created as compound
literals no longer existed at the time their addresses were passed to
the kernel.
due to an incorrect return statement in this error case, the
previously blocked cancellation state was not restored and no result
was stored. this could lead to invalid (read) accesses in the caller
resulting in crashes or nonsensical result data in the event of memory
exhaustion.
while the sh port is still experimental and subject to ABI
instability, this is not actually an application/libc boundary ABI
change. it only affects third-party APIs where jmp_buf is used in a
shared structure at the ABI boundary, because nothing anywhere near
the end of the jmp_buf object (which includes the oversized sigset_t)
is accessed by libc.
both glibc and uclibc have 15-slot jmp_buf for sh. presumably the
smaller version was used in musl because the slots for fpu status
register and thread pointer register (gbr) were incorrect and must not
be restored by longjmp, but the size should have been preserved, as
it's generally treated as a libc-agnostic ABI property for the arch,
and having extra slots free in case we ever need them for something is
useful anyway.
previously it was using the same name as the default ABI with hard
float (floating point args and return value in registers).
the test __SH_FPU_ANY__ || __SH4__ matches what's used in the
configure script already, and seems correct under casual review
against gcc's config/sh.h, but may need tweaks. the logic for
predefined macros for sh, and what they all mean, is very complex.
eventually this should be documented in comments here.
configure already rejects "half-hard" configurations on sh where
double=float since these do not conform to Annex F and are not
suitable for musl, so these do not need to be considered here.
both static and dynamic linked versions of the __copy_tls function
have a hidden assumption that the alignment of the beginning or end of
the memory passed is suitable for storing an array of pointers for the
dtv. pthread_create satisfies this requirement except when
libc.tls_size is misaligned, which cannot happen with dynamic linking
due to way update_tls_size computes the total size, but could happen
with static linking and odd-sized TLS.
commit dab441aea2, which made thread
pointer init mandatory for all programs, rendered this store obsolete
by removing the early-return path for static programs with no TLS.
this slightly reduces the code size cost of TLS/thread-pointer for
static linking since __init_tp can be inlined into its only caller and
removed. this is analogous to the handling of __init_libc in
__libc_start_main, where the function only has external linkage when
it needs to be called from the dynamic linker.
the implicit-operand form of fucomip is rejected by binutils 2.19 and
perhaps other versions still in use. writing both operands explicitly
fixes the issue. there is no change to the resulting output.
commit a732e80d33 was the source of this
regression.
use CAS instead of swap since it's lighter for most archs, and keep
EBUSY in the lock value so that the old value obtained by CAS can be
used directly as the return value for pthread_spin_trylock.
the motivation for this change is that the extra declaration (with or
without visibility) using "struct _IO_FILE" instead of "FILE" seems to
trigger a bug in gcc 3.x where it considers the types mismatched.
however, this change also results in slightly better code and it is
valid because (1) these three objects are constant, and (2) applying
the & operator to any of them is invalid C, since they are not even
specified to be objects. thus it does not matter if the application
and libc see different addresses for them, as long as the (initial,
unchanging) value is seen the same by both.
these were hacks to work around toolchains that could not properly
optimize PIC accesses based on visibility and would generate GOT
lookups even for hidden data, which broke the old dynamic linker.
since commit f3ddd17380 it no longer
matters; the dynamic linker does not assume accessibility of this data
until stage 3.
pcc does not search for -include relative to the working directory
unless -I. is used. rather than adding -I., which could be problematic
if there's extra junk in the top-level directory, switch back to the
old method (reverting commit 60ed988fd6)
of using -include vis.h and relying on -I./src/internal being present
on the command line (which the Makefile guarantees). to fix the
breakage that was present in trycppif checks with the old method,
$CFLAGS_AUTO is removed from the command line passed to trycppif; this
is valid since $CFLAGS_AUTO should not contain options that alter
compiler semantics or ABI, only optimizations, warnings, etc.
when the non-stub duplocale code was added as part of the locale
framework in commit 0bc03091bb, the old
code to memcpy the old locale object to the new one was left behind.
the conditional for the memcpy no longer makes sense, because the
conditions are now always-true when it's reached, and the memcpy is
wrong because it clobbers the new->messages_name pointer setup just
above.
since the messages_name and ctype_utf8 members have already been
copied, all that remains is the cat[] array. these pointers are
volatile, so using memcpy to copy them is formally wrong; use a for
loop instead.
Some build environments pass -march and -mtune as part of CC, therefore
update configure to check both CC and CFLAGS before making the decision
to fall back to generic -march and -mtune options for x86.
Signed-off-by: Andre McCurdy <armccurdy@gmail.com>
the first switch already returns in the F_SETLKW code path so it need
not be handled in the second switch. moreover the code in the second
switch is wrong for the F_SETLKW command: it's not cancellable.
the leak was found by static analysis (reported by Alexander Monakov),
not tested/observed, but seems to have occured both when failing due
to O_EXCL, and in a race condition with O_CREAT but not O_EXCL where a
semaphore by the same name was created concurrently.
the allocating path which can fail is for dynamic TLS, which can only
occur at runtime, and the check for runtime was already made in the
outer conditional.
commit 637dd2d383 introduced the checks
for RTLD_DEFAULT and RTLD_NEXT here, claiming they fixed a regression,
but the above conditional block clearly already covered these cases,
and removing the checks produces no difference in the generated code.
the jmp instruction requires a 64-bit register, so cast the desired PC
address up to uint64_t, going through uintptr_t to ensure that it's
zero-extended rather than possibly sign-extended.
commit de2b67f8d4 introduced a
regression by adding a -include option to CFLAGS_AUTO which did not
work without additional -I options. this broke subsequent trycppif
tests and caused x86_64 to be misdetected as x32, among other issues.
simply using the full relative pathname to vis.h rather than -I is the
cleanest way to fix the problem.
this is implemented via the build system and does not affect source
files. the idea is to use protected or hidden visibility to prevent
the compiler from pessimizing function calls within a shared (or
position-independent static) libc in the form of overhead setting up
for a call through the PLT. the ld-time symbol binding via the
-Bsymbolic-functions option already optimized out the PLT itself, but
not the code in the caller needed to support a call through the PLT.
on some archs this overhead can be substantial; on others it's
trivial.
these are perfectly fine with ld-time symbol binding, but otherwise
result in textrels. they cannot be replaced with @PLT jump targets
because the PLT thunks require a GOT register to be setup, so use a
hidden alias instead.
these are perfectly fine with ld-time symbol binding, but if the calls
go through a PLT thunk, they are invalid because the caller does not
setup a GOT register. use a hidden alias to bypass the issue.
none of these are actual textrels because of ld-time binding performed
by -Bsymbolic-functions, but I'm changing them with the goal of making
ld-time binding purely an optimization rather than relying on it for
semantic purposes.
in the case of memmove's call to memcpy, making it explicit that the
memmove asm is assuming the forward-copying behavior of the memcpy asm
is desirable anyway; in case memcpy is ever changed, the semantic
mismatch would be apparent while editing memmcpy.s.
this fixes truncation of error messages containing long pathnames or
symbol names.
the dlerror state was previously required by POSIX to be global. the
resolution of bug 97 relaxed the requirements to allow thread-safe
implementations of dlerror with thread-local state and message buffer.
these functions are never called directly; only their addresses are
used, so PLT indirections should never happen unless a broken
application tries to redefine them, but it's still best to make them
hidden.
the casts of the argument to unsigned int suppressed diagnosis of
errors like passing a pointer instead of a character. putting the
actual function call in an unreachable branch restores any diagnostics
that would be present if the macros didn't exist and functions were
used.
the braf instruction's destination register is an offset from the
address of the braf instruction plus 4 (or equivalently, the address
of the next instruction after the delay slot). the code for dlsym was
incorrectly computing the offset to pass using the address of the
delay slot itself. in other places, a label was placed after the delay
slot, but I find this confusing. putting the label on the branch
instruction itself, and manually adding 4, makes it more clear which
branch the offset in the constant pool goes with.
the conventional way to implement sigsetjmp is to save the signal mask
then tail-call to setjmp; siglongjmp then restores the signal mask and
calls longjmp. the problem with this approach is that a signal already
pending, or arriving between unmasking of signals and restoration of
the saved stack pointer, will have its signal handler run on the stack
that was active before siglongjmp was called. this can lead to
unbounded stack usage when siglongjmp is used to leave a signal
handler.
in the new design, sigsetjmp saves its own return address inside the
extended part of the sigjmp_buf (outside the __jmp_buf part used by
setjmp) then calls setjmp to save a jmp_buf inside its own execution.
it then tail-calls to __sigsetjmp_tail, which uses the return value of
setjmp to determine whether to save the current signal mask or restore
a previously-saved mask.
as an added bonus, this design makes it so that siglongjmp and longjmp
are identical. this is useful because the __longjmp_chk function we
need to add for ABI-compatibility assumes siglongjmp and longjmp are
the same, but for different reasons -- it was designed assuming either
can access a flag just past the __jmp_buf indicating whether the
signal masked was saved, and act on that flag. however, early versions
of musl did not have space past the __jmp_buf for the non-sigjmp_buf
version of jmp_buf, so our setjmp cannot store such a flag without
risking clobbering memory on (very) old binaries.
previously, the dynamic tlsdesc lookup functions and the i386
special-ABI ___tls_get_addr (3 underscores) function called
__tls_get_addr when the slot they wanted was not already setup;
__tls_get_addr would then in turn also see that it's not setup and
call __tls_get_new.
calling __tls_get_new directly is both more efficient and avoids the
issue of calling a non-hidden (public API/ABI) function from asm.
for the special i386 function, a weak reference to __tls_get_new is
used since this function is not defined when static linking (the code
path that needs it is unreachable in static-linked programs).
applying the attribute to a weak_alias macro was a hack. instead use a
separate declaration to apply the visibility, and consolidate
declarations together to avoid having visibility mess all over the
file.
in a few places, non-hidden symbols were referenced from asm in ways
that assumed ld-time binding. while these is no semantic reason these
symbols need to be hidden, fixing the references without making them
hidden was going to be ugly, and hidden reduces some bloat anyway.
in the asm files, .global/.hidden directives have been moved to the
top to unclutter the actual code.
at the point of call it was declared hidden, but the definition was
not hidden. for some toolchains this inconsistency produced textrels
without ld-time binding.
the zero initialization is redundant since decode_vec does its own
clearing, and it increases the risk that buggy compilers will generate
calls to memset. as long as symbols are bound at ld time, such a call
will not break anything, but it may be desirable to turn off ld-time
binding in the future.
this was already essentially possible as a result of the previous
commits changing the dynamic linker/thread pointer bootstrap process.
this commit mainly adds build system infrastructure:
configure no longer attempts to disable stack protector. instead it
simply determines how so the makefile can disable stack protector for
a few translation units used during early startup.
stack protector is also disabled for memcpy and memset since compilers
(incorrectly) generate calls to them on some archs to implement
struct initialization and assignment, and such calls may creep into
early initialization.
no explicit attempt to enable stack protector is made by configure at
this time; any stack protector option supported by the compiler can be
passed to configure in CFLAGS, and if the compiler uses stack
protector by default, this default is respected.
since 1.1.0, musl has nominally required a thread pointer to be setup.
most of the remaining code that was checking for its availability was
doing so for the sake of being usable by the dynamic linker. as of
commit 71f099cb7d, this is no longer
necessary; the thread pointer is now valid before any libc code
(outside of dynamic linker bootstrap functions) runs.
this commit essentially concludes "phase 3" of the "transition path
for removing lazy init of thread pointer" project that began during
the 1.1.0 release cycle.
this allows the dynamic linker itself to run with a valid thread
pointer, which is a prerequisite for stack protector on archs where
the ssp canary is stored in TLS. it will also allow us to remove some
remaining runtime checks for whether the thread pointer is valid.
as long as the application and its libraries do not require additional
size or alignment, this early thread pointer will be kept and reused
at runtime. otherwise, a new static TLS block is allocated after
library loading has finished and the thread pointer is switched over.
previously, the layout of the static TLS block was perturbed by the
size of the dtv; dtv size increasing from 0 to 1 perturbed both TLS
arch types, and the TLS-above-TP type's layout was perturbed by the
specific number of dtv slots (libraries with TLS). this behavior made
it virtually impossible to setup a tentative thread pointer address
before loading libraries and keep it unchanged as long as the
libraries' TLS size/alignment requirements fit.
the new code fixes the location of the dtv and pthread structure at
opposite ends of the static TLS block so that they will not move
unless size or alignment changes.
previously a new GDT slot was requested, even if one had already been
obtained by a previous call. instead extract the old slot number from
GS and reuse it if it was already set. the formula (GS-3)/8 for the
slot number automatically yields -1 (request for new slot) if GS is
zero (unset).
this overhaul further reduces the amount of arch-specific code needed
by the dynamic linker and removes a number of assumptions, including:
- that symbolic function references inside libc are bound at link time
via the linker option -Bsymbolic-functions.
- that libc functions used by the dynamic linker do not require
access to data symbols.
- that static/internal function calls and data accesses can be made
without performing any relocations, or that arch-specific startup
code handled any such relocations needed.
removing these assumptions paves the way for allowing libc.so itself
to be built with stack protector (among other things), and is achieved
by a three-stage bootstrap process:
1. relative relocations are processed with a flat function.
2. symbolic relocations are processed with no external calls/data.
3. main program and dependency libs are processed with a
fully-functional libc/ldso.
reduction in arch-specific code is achived through the following:
- crt_arch.h, used for generating crt1.o, now provides the entry point
for the dynamic linker too.
- asm is no longer responsible for skipping the beginning of argv[]
when ldso is invoked as a command.
- the functionality previously provided by __reloc_self for heavily
GOT-dependent RISC archs is now the arch-agnostic stage-1.
- arch-specific relocation type codes are mapped directly as macros
rather than via an inline translation function/switch statement.
this global lock allows certain unlock-type primitives to exclude
mmap/munmap operations which could change the identity of virtual
addresses while references to them still exist.
the original design mistakenly assumed mmap/munmap would conversely
need to exclude the same operations which exclude mmap/munmap, so the
vmlock was implemented as a sort of 'symmetric recursive rwlock'. this
turned out to be unnecessary.
commit 25d12fc0fc already shortened the
interval during which mmap/munmap held their side of the lock, but
left the inappropriate lock design and some inefficiency.
the new design uses a separate function, __vm_wait, which does not
hold any lock itself and only waits for lock users which were already
present when it was called to release the lock. this is sufficient
because of the way operations that need to be excluded are sequenced:
the "unlock-type" operations using the vmlock need only block
mmap/munmap operations that are precipitated by (and thus sequenced
after) the atomic-unlock they perform while holding the vmlock.
this allows for a spectacular lack of synchronization in the __vm_wait
function itself.
as a result of commit 12e1e32468, kernel
processing of the robust list is only needed for process-shared
mutexes. previously the first attempt to lock any owner-tracked mutex
resulted in robust list initialization and a set_robust_list syscall.
this is no longer necessary, and since the kernel's record of the
robust list must now be cleared at thread exit time for detached
threads, optimizing it out is more worthwhile than before too.
the robust list head lies in the thread structure, which is unmapped
before exit for detached threads. this leaves the kernel unable to
process the exiting thread's robust list, and with a dangling pointer
which may happen to point to new unrelated data at the time the kernel
processes it.
userspace processing of the robust list was already needed for
non-pshared robust mutexes in order to perform private futex wakes
rather than the shared ones the kernel would do, but it was
conditional on linking pthread_mutexattr_setrobust and did not bother
processing the pshared mutexes in the list, which requires additional
logic for the robust list pending slot in case pthread_exit is
interrupted by asynchronous process termination.
the new robust list processing code is linked unconditionally (inlined
in pthread_exit), handles both private and shared mutexes, and also
removes the kernel's reference to the robust list before unmapping and
exit if the exiting thread is detached.
depending on the compiler's interpretation of __asm__ register names
for register class objects, it may be possible for the return value in
r2 to be clobbered by the function call to __stat_fix. I have not
observed any such breakage in normal builds and suspect it only
happens with -O0 or other unusual build options, but since there's an
ambiguity as to the semantics of this feature, it's best to use an
explicit temporary to avoid the issue.
based on reporting and patch by Eugene.
when dlopen fails, all partially-loaded libraries need to be unmapped
and freed. any of these libraries using an rpath with $ORIGIN
expansion may have an allocated string for the expanded rpath;
previously, this string was not freed when freeing the library data
structures.
this change hardens the dynamic linker against the possibility of
loading the wrong library due to inability to expand $ORIGIN in rpath.
hard failures such as excessively long paths or absence of /proc (when
resolving /proc/self/exe for the main executable's origin) do not stop
the path search, but memory allocation failures and any other
potentially transient failures do.
to implement this change, the meaning of the return value of
fixup_rpath function is changed. returning zero no longer indicates
that the dso's rpath string pointer is non-null; instead, the caller
needs to check. a return value of -1 indicates a failure that should
stop further path search.
the C standard specifies that setjmp is a macro, but longjmp is a
normal function. a macro version of it would be permitted (albeit
useless) for C (not C++), but would have to be a function-like macro,
not an object-like one.
transient errors during the path search should not allow the search to
continue and possibly open the wrong file. this patch eliminates most
conditions where that could happen, but there is still a possibility
that $ORIGIN-based rpath processing will have an allocation failure,
causing the search to skip such a path. fixing this is left as a
separate task.
a small bug where overly-long path components caused an infinite loop
rather than being skipped/ignored is also fixed.
while it's the same for all presently supported archs, it differs at
least on sparc, and conceptually it's no less arch-specific than the
other O_* macros. O_SEARCH and O_EXEC are still defined in terms of
O_PATH in the main fcntl.h.
POSIX requires the sem_nsems member to have type unsigned short. we
have to work around the incorrect kernel type using matching
endian-specific padding.
The shm_info struct is a gnu extension and some of its members do
not have shm* prefix. This is worked around in sys/shm.h by macros,
but aarch64 didn't use those.
Internally regcomp needs to copy some iteration nodes before
translating the AST into TNFA representation.
Literal nodes were not copied correctly: the class type and list
of negated class types were not copied so classes were ignored
(in the non-negated case an ignored char class caused the literal
to match everything).
This affects iterations when the upper bound is finite, larger
than one or the lower bound is larger than one. So eg. the EREs
[[:digit:]]{2}
[^[:space:]ab]{1,4}
were treated as
.{2}
[^ab]{1,4}
The fix is done with minimal source modification to copy the
necessary fields, but the AST preparation and node handling
code of tre will need to be cleaned up for clarity.
The valid BRE backref tokens are \1 .. \9, and 0 is not a special
character either so \0 is undefined by the standard.
Such undefined escaped characters are treated as literal characters
currently, following existing practice, so \0 is the same as 0.
commit 559de8f5f0 redefined FLT_ROUNDS
to use an external function that can report the actual current
rounding mode, rather than always reporting round-to-nearest. however,
float.h did not include 'extern "C"' wrapping for C++, so C++ programs
using FLT_ROUNDS ended up with an unresolved reference to a
name-mangled C++ function __flt_rounds.
one stop condition for parsing abbreviated ipv6 addressed was missed,
allowing the internal ip[] buffer to overflow. this patch adds the
missing stop condition and masks the array index so that, in case
there are any remaining stop conditions missing, overflowing the
buffer is not possible.
one of the features of ERE is that it's actually a regular language
and does not admit expressions which cannot be matched in linear time.
introduction of \n backref support into regcomp's ERE parsing was
unintentional.
the regex parser handles the (undefined) case of an unexpected byte
following a backslash as a literal. however, instead of correctly
decoding a character, it was treating the byte value itself as a
character. this was not only semantically unjustified, but turned out
to be dangerous on archs where plain char is signed: bytes in the
range 252-255 alias the internal codes -4 through -1 used for special
types of literal nodes in the AST.
the previous values (2k min and 8k default) were too small for some
archs. aarch64 reserves 4k in the signal context for future extensions
and requires about 4.5k total, and powerpc reportedly uses over 2k.
the new minimums are chosen to fit the saved context and also allow a
minimal signal handler to run.
since the default (SIGSTKSZ) has always been 6k larger than the
minimum, it is also increased to maintain the 6k usable by the signal
handler. this happens to be able to store one pathname buffer and
should be sufficient for calling any function in libc that doesn't
involve conversion between floating point and decimal representations.
x86 (both 32-bit and 64-bit variants) may also need a larger minimum
(around 2.5k) in the future to support avx-512, but the values on
these archs are left alone for now pending further analysis.
the value for PTHREAD_STACK_MIN is not increased to match MINSIGSTKSZ
at this time. this is so as not to preclude applications from using
extremely small thread stacks when they know they will not be handling
signals. unfortunately cancellation and multi-threaded set*id() use
signals as an implementation detail and therefore require a stack
large enough for a signal context, so applications which use extremely
small thread stacks may still need to avoid using these features.
previously the implementation-internal signal used for multithreaded
set*id operations was left unblocked during handling of the
cancellation signal. however, on some archs, signal contexts are huge
(up to 5k) and the possibility of nested signal handlers drastically
increases the minimum stack requirement. since the cancellation signal
handler will do its job and return in bounded time before possibly
passing execution to application code, there is no need to allow other
signals to interrupt it.
previously, a sentinel value of (FILE *)-1 was used to inform the
caller of __nscd_query that nscd is not in use. aside from being an
ugly hack, this resulted in duplicate code paths for two logically
equivalent cases: no nscd, and "not found" result from nscd.
now, __nscd_query simply skips closing the socket and returns a valid
FILE pointer when nscd is not in use, and produces a fake "not found"
response header. the caller is then responsible for closing the socket
just like it would do if it had gotten a real "not found" response.
This completes the alternate backend support that was previously added
to the getpw* and getgr* functions. Unlike those, though, it
unconditionally queries nscd. Any groups from nscd that aren't in the
/etc/groups file are added to the returned list, and any that are
present in the file are ignored. The purpose of this behavior is to
provide a view of the group database consistent with what is observed
by the getgr* functions. If group memberships reported by nscd were
honored when the corresponding group already has a definition in the
/etc/groups file, the user's getgrouplist-based membership in the
group would conflict with their non-membership in the reported
gr_mem[] for the group.
The changes made also make getgrouplist thread-safe and eliminate its
clobbering of the global getgrent state.
The unwind code in libgcc uses this type for unwinding across signal
handlers. On aarch64 the kernel may place a sequence of structs on the
signal stack on top of the ucontext to provide additional information.
The unwinder only needs the header, but added all the types the kernel
currently defines for this mechanism because they are part of the uapi.
previously, commit e7b9887e8b aligned
the sizes with the glibc ABI. subsequent discussion during the merge
of the aarch64 port reached a conclusion that we should reject larger
arch-specific sizes, which have significant cost and no benefit, and
stick with the existing common 32-bit sizes for all 32-bit/ILP32 archs
and the x86_64 sizes for 64-bit archs.
one peculiarity of this change is that x32 pthread_attr_t is now
larger in musl than in the glibc x32 ABI, making it unsafe to call
pthread_attr_init from x32 code that was compiled against glibc. with
all the ABI issues of x32, it's not clear that ABI compatibility will
ever work, but if it's needed, pthread_attr_init and related functions
could be modified not to write to the last slot of the object.
this is not a regression versus previous releases, since on previous
releases the x32 pthread type sizes were all severely oversized
already (due to incorrectly using the x86_64 LP64 definitions).
moreover, x32 is still considered experimental and not ABI-stable.
This adds complete aarch64 target support including bigendian subarch.
Some of the long double math functions are known to be broken otherwise
interfaces should be fully functional, but at this point consider this
port experimental.
Initial work on this port was done by Sireesh Tripurari and Kevin Bortis.
This is in preparation for the aarch64 port only to have the long
double math symbols available on ld128 platforms. The implementations
should be fixed up later once we have proper tests for these functions.
Added bigendian handling for ld128 bit manipulations too.
There are two main abi variants for thread local storage layout:
(1) TLS is above the thread pointer at a fixed offset and the pthread
struct is below that. So the end of the struct is at known offset.
(2) the thread pointer points to the pthread struct and TLS starts
below it. So the start of the struct is at known (zero) offset.
Assembly code for the dynamic TLSDESC callback needs to access the
dynamic thread vector (dtv) pointer which is currently at the front
of the pthread struct. So in case of (1) the asm code needs to hard
code the offset from the end of the struct which can easily break if
the struct changes.
This commit adds a copy of the dtv at the end of the struct. New members
must not be added after dtv_copy, only before it. The size of the struct
is increased a bit, but there is opportunity for size optimizations.
due to a logic error in the use of masked cancellation mode,
pthread_cond_wait did not honor PTHREAD_CANCEL_DISABLE but instead
failed with ECANCELED when cancellation was pending.
a conservative estimate of 4*sizeof(size_t) was used as the minimum
alignment for thread-local storage, despite the only requirements
being alignment suitable for struct pthread and void* (which struct
pthread already contains). additional alignment required by the
application or libraries is encoded in their headers and is already
applied.
over-alignment prevented the builtin_tls array from ever being used in
dynamic-linked programs on 64-bit archs, thereby requiring allocation
at startup even in programs with no TLS of their own.
normally time.h would provide a definition for this struct, but
depending on the feature test macros in use, it may not be exposed,
leading to warnings when it's used in the function prototypes.
these macros have the same distinct definition on blackfin, frv, m68k,
mips, sparc and xtensa kernels. POLLMSG and POLLRDHUP additionally
differ on sparc.
the previous definitions were copied from x86_64. not only did they
fail to match the ABI sizes; they also wrongly encoded an assumption
that long/pointer types are twice as large as int.
this re-check idiom seems to have been copied from the alloc_fwd and
alloc_rev functions, which guess a bin based on non-synchronized
memory access to adjacent chunk headers then need to confirm, after
locking the bin, that the chunk is actually in the bin they locked.
the check being removed, however, was being performed on a chunk
obtained from the already-locked bin. there is no race to account for
here; the check could only fail in the event of corrupt free lists,
and even then it would not catch them but simply continue running.
since the bin_index function is mildly expensive, it seems preferable
to remove the check rather than trying to convert it into a useful
consistency check. casual testing shows a 1-5% reduction in run time.
the malloc init code provided its own version of pthread_once type
logic, including the exact same bug that was fixed in pthread_once in
commit 0d0c2f4034.
since this code is called adjacent to expand_heap, which takes a lock,
there is no reason to have pthread_once-type initialization. simply
moving the init code into the interval where expand_heap already holds
its lock on the brk achieves the same result with much less
synchronization logic, and allows the buggy code to be eliminated
rather than just fixed.
the memory model we use internally for atomics permits plain loads of
values which may be subject to concurrent modification without
requiring that a special load function be used. since a compiler is
free to make transformations that alter the number of loads or the way
in which loads are performed, the compiler is theoretically free to
break this usage. the most obvious concern is with atomic cas
constructs: something of the form tmp=*p;a_cas(p,tmp,f(tmp)); could be
transformed to a_cas(p,*p,f(*p)); where the latter is intended to show
multiple loads of *p whose resulting values might fail to be equal;
this would break the atomicity of the whole operation. but even more
fundamental breakage is possible.
with the changes being made now, objects that may be modified by
atomics are modeled as volatile, and the atomic operations performed
on them by other threads are modeled as asynchronous stores by
hardware which happens to be acting on the request of another thread.
such modeling of course does not itself address memory synchronization
between cores/cpus, but that aspect was already handled. this all
seems less than ideal, but it's the best we can do without mandating a
C11 compiler and using the C11 model for atomics.
in the case of pthread_once_t, the ABI type of the underlying object
is not volatile-qualified. so we are assuming that accessing the
object through a volatile-qualified lvalue via casts yields volatile
access semantics. the language of the C standard is somewhat unclear
on this matter, but this is an assumption the linux kernel also makes,
and seems to be the correct interpretation of the standard.
like close, pthread_join is a resource-deallocation function which is
also a cancellation point. the intent of masked cancellation mode is
to exempt such functions from failure with ECANCELED.
previously, the __timedwait function was optionally a cancellation
point depending on whether it was passed a pointer to a cleaup
function and context to register. as of now, only one caller actually
used such a cleanup function (and it may face removal soon); most
callers either passed a null pointer to disable cancellation or a
dummy cleanup function.
now, __timedwait is never a cancellation point, and __timedwait_cp is
the cancellable version. this makes the intent of the calling code
more obvious and avoids ugly dummy functions and long argument lists.
as part of abstracting the futex wait, this function suppresses all
futex error values which callers should not see using a whitelist
approach. when the masked cancellation mode was added, the new
ECANCELED error was not whitelisted. this omission caused the new
pthread_cond_wait code using masked cancellation to exhibit a spurious
wake (rather than acting on cancellation) when the request arrived
after blocking on the cond var.
on most cpu models, "rep stosq" has high overhead that makes it
undesirable for small memset sizes. the new code extends the
minimal-branch fast path for short memsets from size 15 up to size
126, and shrink-wraps this code path. in addition, "rep stosq" is
sensitive to misalignment. the cost varies with size and with cpu
model, but it has been observed performing 1.5 times slower when the
destination address is not aligned mod 16. the new code thus ensures
alignment mod 16, but also preserves any existing additional
alignment, in case there are cpu models where it is beneficial.
this version is based in part on changes proposed by Denys Vlasenko.
on most cpu models, "rep stosl" has high overhead that makes it
undesirable for small memset sizes. the new code extends the
minimal-branch fast path for short memsets from size 15 up to size 62,
and shrink-wraps this code path. in addition, "rep stosl" is very
sensitive to misalignment. the cost varies with size and with cpu
model, but it has been observed performing 1.5 to 4 times slower when
the destination address is not aligned mod 16. the new code thus
ensures alignment mod 16, but also preserves any existing additional
alignment, in case there are cpu models where it is beneficial.
this version is based in part on changes to the x86_64 memset asm
proposed by Denys Vlasenko.
the equivalent checks for newly opened stdio output streams, used to
determine buffering mode, are also fixed.
on most archs, the TCGETS ioctl command shares a value with
SNDCTL_TMR_TIMEBASE, part of the OSS sound API which was apparently
used with certain MIDI and timer devices. for file descriptors
referring to such a device, TCGETS will not fail with ENOTTY as
expected; it may produce a different error, or may succeed, and if it
succeeds it changes the mode of the device. while it's unlikely that
such devices are in use, this is in principle very harmful behavior
for an operation which is supposed to do nothing but query whether the
fd refers to a tty.
TIOCGWINSZ, used to query logical window size for a terminal, was
chosen as an alternate ioctl to perform the isatty check. it does not
share a value with any other ioctl commands, and it succeeds on any
tty device.
this change also cleans up strace output to be less ugly and
misleading.
due to accidental use of = instead of ==, the error code was always
set to zero in the signaled wake case for non-shared cv waits.
suppressing ETIMEDOUT (the only possible wait error) is harmless and
actually permitted in this case, but suppressing mutex errors could
give the caller false information about the state of the mutex.
commit 8741ffe625 introduced this
regression and commit d9da1fb8c5
preserved it when reorganizing the code.
when we fail to find the entry in the commonly accepted files, we
query a server over a Unix domain socket on /var/run/nscd/socket.
the protocol used here is compatible with glibc's nscd protocol on
most systems (all that use 32-bit numbers for all the protocol fields,
which appears to be everything but Alpha).
errno was treated as the error status when the return value of getline
was negative, but this condition can simply indicate EOF and is not
necessarily an error.
the spurious errors caused by this bug masked the bug which was fixed
in commit fc5a96c9c8.
the wrong condition was used in determining the presence of a result
that needs space/copying for the _r functions. a zero return value
does not necessarily mean success; it can also be a non-error negative
result: no such user/group.
it's possible that signaling a waiter races with cancellation of that
same waiter. previously, cancellation was acted upon, causing the
signal to be consumed with no waiter returning. by using the new
masked cancellation state, it's possible to refuse to act on the
cancellation request and instead leave it pending.
to ease review and understanding of the changes made, this commit
leaves the unwait function, which was previously the cancellation
cleanup handler, in place. additional simplifications could be made by
removing it.
this is a new extension which is presently intended only for
experimental and internal libc use. interface and behavior details may
change subject to feedback and experience from using it internally.
the basic concept for the new PTHREAD_CANCEL_MASKED state is that the
first cancellation point to observe the cancellation request fails
with an errno value of ECANCELED rather than acting on cancellation,
allowing the caller to process the status and choose whether/how to
act upon it.
commit 82dc1e2e78 addressed the
resolution of Austin Group issue 529, which requires close to leave
the fd open when failing with EINTR, by returning the newly defined
error code EINPROGRESS. this turns out to be a bad idea, though, since
legacy applications not aware of the new specification are likely to
interpret any error from close except EINTR as a hard failure.
a_store is only valid for int, but ssize_t may be defined as long or
another type. since there is no valid way for another thread to acess
the return value without first checking the error/completion status of
the aiocb anyway, an atomic store is not necessary.
previously, aio operations were not tracked by file descriptor; each
operation was completely independent. this resulted in non-conforming
behavior for non-seekable/append-mode writes (which are required to be
ordered) and made it impossible to implement aio_cancel, which in turn
made closing file descriptors with outstanding aio operations unsafe.
the new implementation is significantly heavier (roughly twice the
size, and seems to be slightly slower) and presently aims mainly at
correctness, not performance.
most of the public interfaces have been moved into a single file,
aio.c, because there is little benefit to be had from splitting them.
whenever any aio functions are used, aio_cancel and the internal
queue lifetime management and fd-to-queue mapping code must be linked,
and these functions make up the bulk of the code size.
the close function's interaction with aio is implemented with weak
alias magic, to avoid pulling in heavy aio cancellation code in
programs that don't use aio, and the expensive cancellation path
(which includes signal blocking) is optimized out when there are no
active aio queues.
the character sequence '$((' was incorrectly interpreted as the
opening of arithmetic even within single-quoted contexts, thereby
suppressing the checks for bad characters after the closing quote.
presently bad character checking is only performed when the WRDE_NOCMD
is used; this patch only corrects checking in that case.
The code does a potentially misaligned 8-byte store to fill the tail
of the buffer. Then it fills the initial part of the buffer
which is a multiple of 8 bytes.
Therefore, if size is divisible by 8, we were storing last word twice.
This patch decrements byte count before dividing it by 8,
making one less store in "size is divisible by 8" case,
and not changing anything in all other cases.
All at the cost of replacing one MOV insn with LEA insn.
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
"and $0xff,%esi" is a six-byte insn (81 e6 ff 00 00 00), can use
4-byte "movzbl %sil,%esi" (40 0f b6 f6) instead.
64-bit imul is slow, move it as far up as possible so that the result
(rax) has more time to be ready by the time we start using it
in mem stores.
There is no need to shuffle registers in preparation to "rep movs"
if we are not going to take that code path. Thus, patch moves
"jump if len < 16" instructions up, and changes alternate code path
to use rdx and rdi instead of rcx and r8.
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
this syscall allows fexecve to be implemented without /proc, it is new
in linux v3.19, added in commit 51f39a1f0cea1cacf8c787f652f26dfee9611874
(sh and microblaze do not have allocated syscall numbers yet)
added a x32 fix as well: the io_setup and io_submit syscalls are no
longer common with x86_64, so use the x32 specific numbers.
these socket options are new in linux v3.19, introduced in commit
2c8c56e15df3d4c2af3d656e44feb18789f75837 and commit
89aa075832b0da4402acebd698d0411dcc82d03e
with SO_INCOMING_CPU the cpu can be queried on which a socket is
managed inside the kernel and optimize polling of large number of
sockets accordingly.
SO_ATTACH_BPF lets eBPF programs (created by the bpf syscall) to
be attached to sockets.
just defining the necessary constants:
LD_B1B_MAX is 2^113 - 1 in base 10^9
KMAX is 2048 so the x array can hold up to 18432 decimal digits
(the worst case is converting 2^-16495 = 5^16495 * 10^-16495 to
binary, it requires the processing of int(log10(5)*16495)+1 = 11530
decimal digits after discarding the leading zeros, the conversion
requires some headroom in x, but KMAX is more than enough for that)
However this code is not optimal on archs with IEEE binary128
long double because the arithmetics is software emulated (on
all such platforms as far as i know) which means big and slow
strtod.
all socket types are accepted at this point, but that may be changed
at a later time if the behavior is not meaningful for other types. as
before, omitting type (a value of 0) gives both UDP and TCP results,
and SOCK_DGRAM or SOCK_STREAM restricts to UDP or TCP, respectively.
for other socket types, the service name argument is required to be a
null pointer, and the protocol number provided by the caller is used.
x86_64 syscall.h defined some musl internal syscall names and made
them public. These defines were already moved to src/internal/syscall.h
(except for SYS_fadvise which is added now) so the cruft in x86_64
syscall.h is not needed.
in the case where a non-symlink file was replaced by a symlink during
the fchmodat operation with AT_SYMLINK_NOFOLLOW, mode change on the
new symlink target was successfully suppressed, but the error was not
reported. instead, fchmodat simply returned 0.
the specification for execvp itself is unclear as to whether
encountering a file that cannot be executed due to EACCES during the
PATH search is a mandatory error condition; however, XBD 8.3's
specification of the PATH environment variable clarifies that the
search continues until a file with "appropriate execution permissions"
is found.
since it seems undesirable/erroneous to report ENOENT rather than
EACCES when an early path element has a non-executable file and all
later path elements lack any file by the requested name, the new code
stores a flag indicating that EACCES was seen and sets errno back to
EACCES in this case.
in practice this was probably a non-issue, because the necessary
barrier almost certainly exists in kernel space -- implementing signal
delivery without such a barrier seems impossible -- but for the sake
of correctness, it should be done here too.
in principle, without a barrier, it is possible that the thread to be
cancelled does not see the store of its cancellation flag performed by
another thread. this affects both the case where the signal arrives
before entering the critical program counter range from __cp_begin to
__cp_end (in which case both the signal handler and the inline check
fail to see the value which was already stored) and the case where the
signal arrives during the critical range (in which case the signal
handler should be responsible for cancellation, but when it does not
see the cancellation flag, it assumes the signal is spurious and
refuses to act on it).
in the fix, the barrier is placed only in the signal handler, not in
the inline check at the beginning of the critical program counter
range. if the signal handler runs before the critical range is
entered, it will of course take no action, but its barrier will ensure
that the inline check subsequently sees the store. if on the other
hand the inline check runs first, it may miss seeing the store, but
the subsequent signal handler in the critical range will act upon the
cancellation request. this strategy avoids adding a memory barrier in
the common, non-cancellation code path.
the definitions are generic for all kernel archs. exposure of these
macros now only occurs on the same feature test as for the function
accepting them, which is believed to be more correct.
this typo did not result in an erroneous setjmp with at least binutils
2.22 but fix it for clarity and compatibility with potentially stricter
sh assemblers.
based on patch by Vadim Ushakov. in general overriding LC_ALL rather
than specific categories (here, LC_MESSAGES) is undesirable, but
LC_ALL is easier and in this case there is nothing else that depends
on the locale in this invocation of the compiler.
when using /etc/shadow (rather than tcb) as its backend, getspnam_r
matched any username starting with the caller-provided string rather
than requiring an exact match. in practice this seems to have affected
only systems where one valid username is a prefix for another valid
username, and where the longer username appears first in the shadow
file.
as a result of commit e8e4e56a8c,
the later code path for setting optarg to a null pointer is no longer
necessary, and removing it eliminates an indention level and arguably
makes the code more readable.
the standard getopt does not touch optarg unless processing an option
with an argument. however, programs using the GNU getopt API, which we
attempt to provide in getopt_long, expect optarg to be a null pointer
after processing an option without an argument.
before argument permutation support was added, such programs typically
detected its absence and used their own replacement getopt_long,
masking the discrepency in behavior.
multi-threaded set*id and setrlimit use the internal __synccall
function to work around the kernel's wrongful treatment of these
process properties as thread-local. the old implementation of
__synccall failed to be AS-safe, despite POSIX requiring setuid and
setgid to be AS-safe, and was not rigorous in assuring that all
threads were caught. in a worst case, threads late in the process of
exiting could retain permissions after setuid reported success, in
which case attacks to regain dropped permissions may have been
possible under the right conditions.
the new implementation of __synccall depends on the presence of
/proc/self/task and will fail if it can't be opened, but is able to
determine that it has caught all threads, and does not use any locks
except its own. it thereby achieves AS-safety simply by blocking
signals to preclude re-entry in the same thread.
with this commit, all known conformance and safety issues in set*id
functions should be fixed.
per POSIX, the EINTR condition is an optional error for these
functions, not a mandatory one. since old kernels (pre-2.6.22) failed
to honor SA_RESTART for the futex syscall, it's dangerous to trust
EINTR from the kernel. thankfully POSIX offers an easy way out.
in the current version of __synccall, the callback is always run, so
failure to handle this case did not matter. however, the upcoming
overhaul of __synccall will have failure cases, in which case the
callback does not run and errno is already set. the changes being
committed now are in preparation for that.
the code being removed was introduced to work around "partial failure"
of multi-threaded set*id() operations, where some threads would
succeed in changing their ids but an RLIMIT_NPROC setting would
prevent the rest from succeeding, leaving the process in an
inconsistent and dangerous state. however, the workaround code did not
handle important usage cases like swapping real and effective uids
then restoring their original values, and the wrongful kernel
enforcement of RLIMIT_NPROC at setuid time was removed in Linux 3.1,
making the workaround obsolete.
since the partial failure still is dangerous on old kernels, and could
in principle happen on post-fix kernels as well if set*id() syscalls
fail for another spurious reason such as resource-related failures,
new code is added to detect and forcibly kill the process if/when such
a situation arises. future documentation releases should be updated to
reflect that setting RLIMIT_NPROC to RLIM_INFINITY is necessary to
avoid this forced-kill on old kernels. ideally, at some point the
kernel will get proper multi-threaded set*id() syscalls capable of
performing their actions atomically, and all of the userspace code to
emulate them can be treated as a fallback for outdated kernels.
opening /dev/tty then using ttyname_r on it does not produce a
canonical terminal name; it simply yields "/dev/tty".
it would be possible to make ctermid determine the actual controlling
terminal device via field 7 of /proc/self/stat, but doing so would
introduce a buffer overflow into applications built with L_ctermid==9,
which glibc defines, adversely affecting the quality of ABI compat.
commit b72cd07f17 added support for a
this feature in getopt, but it was later broken in the case where
getopt_long is used as a side effect of the changes made in commit
91184c4f16, which prevented the
underlying getopt call from seeing the leading '-' or '+' character in
optstring.
this commit changes the logic in the getopt_long core to check for a
leading colon, possibly after the leading '-' or '+', without
depending on the latter having been skipped by the caller. a minor
incorrectness in the return value for one error condition in
getopt_long is also fixed when opterr has been set to zero but
optstring has no leading ':'.
based on patch by Dima Krasner, with minor improvements for code size.
connect can fail if there is no listening syslogd, in which case a
useless socket was kept open, preventing subsequent syslog call from
attempting to connect again.
PR_SET_MM_MAP was introduced as a subcommand for PR_SET_MM in
linux v3.18 commit f606b77f1a9e362451aca8f81d8f36a3a112139e
the associated struct type is replicated in sys/prctl.h using
libc types.
example usage:
struct prctl_mm_map *p;
...
prctl(PR_SET_MM, PR_SET_MM_MAP, p, sizeof *p);
the kernel side supported struct size may be queried with
the PR_SET_MM_MAP_SIZE subcommand.
these syscalls are new in linux v3.18, bpf is present on all
supported archs except sh, kexec_file_load is only allocted for
x86_64 and x32 yet.
bpf was added in linux commit 99c55f7d47c0dc6fc64729f37bf435abf43f4c60
kexec_file_load syscall number was allocated in commit
f0895685c7fd8c938c91a9d8a6f7c11f22df58d2
per the rules for hexadecimal integer constants, the previous
definitions were correctly treated as having unsigned type except
possibly when used in preprocessor conditionals, where all artithmetic
takes place as intmax_t or uintmax_t. the explicit 'u' suffix ensures
that they are treated as unsigned in all contexts.
based on discussion with and patches by Felix Janda. these changes
started as an effort to factor forkpty in terms of login_tty, which
returns an error and skips fd reassignment and closing if setting the
controlling terminal failed. the previous forkpty code was unable to
handle errors in the child, and did not attempt to; it just silently
ignored them. but this would have been unacceptable when switching to
using login_tty, since the child would start with the wrong stdin,
stdout, and stderr and thereby clobber the parent's files.
the new code uses the same technique as the posix_spawn implementation
to convey any possible error in the child to the parent so that the
parent can report failure to the caller. it is also safe against
thread cancellation and against signal delivery in the child prior to
the determination of success.
being a nonstandard function, this isn't strictly necessary, but it's
inexpensive and avoids unpleasant surprises. eventually I would like
all functions in libc to be safe against cancellation, either ignoring
it or acting on it cleanly.
not only is this semantically more correct; it also reduces code size
slightly by eliminating the need for the compiler to assume the
possibility of aliasing.
if writing the error message fails, POSIX requires that ferror(stderr)
be set. and as a function that operates on a stdio stream, getopt is
required to lock the stream it uses, stderr.
fwrite calls are used instead of fprintf since there is a demand from
some users not to pull in heavy stdio machinery via getopt. this
mimics the original code using write.
this shaves off a useless syscall for getting the caller's pid and
brings raise into alignment with other functions which were adapted to
use tkill rather than tgkill.
commit 83dc6eb087 documents the
rationale for this change, and in particular why the tgkill syscall is
useless for its designed purpose of avoiding races.
formally, it seems a sign is only required when the '+' modifier
appears in the format specifier, in which case either '+' or '-' must
be present in the output. but the specification is written such that
an optional negative sign is part of the output format anyway, and the
simplest approach to fixing the problem is removing the code that was
suppressing the sign.
it's unclear whether compilers which provide pure imaginary types
might produce a pure imaginary expression for 1.0fi. using 0.0f+1.0fi
ensures that the result is explicitly complex and makes this obvious
to human readers too.
based on patches by Jens Gustedt. these macros need to be usable in
static initializers, and the old definitions were not.
there is no portable way to provide correct definitions for these
macros unless the compiler supports pure imaginary types. a portable
definition is provided for this case even though there are presently
no compilers that can use it. gcc and compatible compilers provide a
builtin function that can be used, but clang fails to support this and
instead requires a construct which is a constraint violation and which
is only a constant expression as a clang-specific extension.
since these macros are a namespace violation in pre-C11 profiles, and
since no known pre-C11 compilers provide any way to define them
correctly anyway, the definitions have been made conditional on C11.
this avoids assuming the presence of C11 macro definitions in the
public complex.h, which need changes potentially incompatible with the
way these macros are being used internally.
based on patch by Timo Teräs, with some corrections to bounds checking
code and other minor changes.
while they are borderline scope creep, the functions added are fairly
small and are roughly the minimum code needed to use the results of
the res_query API without re-implementing error-prone DNS packet
parsing, and they are used in practice by some kerberos related
software and possibly other things. at this time there is no intent to
implement further nameser.h API functions.
previously, write errors neither stopped further output attempts nor
caused the function to return an error to the caller. this could
result in silent loss of output, possibly in the middle of output in
the event of a non-permanent error.
the simplest solution is temporarily clearing the error flag for the
target stream, then suppressing further output when the error flag is
set and checking/restoring it at the end of the operation to determine
the correct return value.
since the wide version of the code internally calls the narrow fprintf
to perform some of its underlying operations, initial clearing of the
error flag is suppressed when performing a narrow vfprintf on a
wide-oriented stream. this is not a problem since the behavior of
narrow operations on wide-oriented streams is undefined.
if argv permutation is used, the option terminator "--" should be
moved before any skipped non-option arguments rather than being left
in the argv tail where the caller will see and interpret it.
in the case where an initial '+' was passed in optstring (a
getopt_long feature to suppress argv permutation), getopt would fail
to see a possible subsequent ':', resulting in incorrect handling of
missing arguments.
C++ programmers typically expect something like "::function(x,y)" to work
and may be surprised to find that "(::function)(x,y)" is actually required
due to the headers declaring a macro version of some standard functions.
We already omit function-like macros for C++ in most cases where there is
a real function available. This commit extends this to the remaining
function-like macros which have a real function version.
the write function is a cancellation point and accesses thread-local
state belonging to the calling thread in the parent process. since
cancellation is blocked for the duration of posix_spawn, this is
probably safe, but it's fragile and unnecessary. making the syscall
directly is just as easy and clearly safe.
the resolution of austin group issue #370 removes the requirement that
posix_spawn fail when the close file action is performed on an
already-closed fd. since there are no other meaningful errors for
close, just ignoring the return value completely is the simplest fix.
the previous hard-coded offsets of +1 and +2 contained a hidden
assumption that the option character matched was single-byte, despite
this implementation of getopt attempting to support multibyte option
characters. this patch reworks the matching logic to leave the final
index pointing just past the matched character so that fixed offsets
can be used to check for ':'.
the sched_getaffinity syscall only fills a cpu set up to the set size
used/supported by the kernel. the rest is left untouched and userspace
is responsible for zero-filling it based on the return value of the
syscall.
this is a GNU extension, activated by including '-' as the first
character of the options string, whereby non-option arguments are
processed as if they were arguments to an option character '\1' rather
than ending option processing.
the new DT_RUNPATH semantics for search order are always used, and
since binutils had always set both DT_RPATH and DT_RUNPATH when the
latter was used, processing only DT_RPATH worked fine. however, recent
binutils has stopped generating DT_RPATH when DT_RUNPATH is used,
which broke support for this feature completely.
this file had been a mess that went unnoticed ever since it was
imported. some lines used spaces for indention while others used tabs,
and tabs were used for alignment.
commit 27828f7e9a fixed compatibility
with clang's internal assembler, but broke compatibility with gas and
the traditional arm asm syntax by switching to the arm "unified
assembler language" (UAL). recent versions of gas also support UAL,
but require the .syntax directive to be used to switch to it. clang on
the other hand defaults to UAL. and old versions of gas (still
relevant) don't support UAL at all.
for the conditional ldm/stm instructions, "ia" is default and can just
be omitted, resulting in a mnemonic that's compatible with both
traditional and UAL syntax. but for byte/halfword loads and stores,
there seems to be no mnemonic compatible with both, and thus .word is
used to produce the desired opcode explicitly. the .inst directive is
not used because it is not compatible with older assemblers.
except powerpc, which still lacks inline syscalls simply because
nobody has written the code, these are all fallbacks used to work
around a clang bug that probably does not exist in versions of clang
that can compile musl. however, it's useful to have the generic
non-inline code anyway, as it eases the task of porting to new archs:
writing inline syscall code is now optional. this approach could also
help support compilers which don't understand inline asm or lack
support for the needed register constraints.
mips could not be unified because it has special fixup code for broken
layout of the kernel's struct stat.
the register constraints in the non-clang case were tested to work on
clang back to 3.2, and earlier versions of clang have known bugs that
preclude building musl.
there may be other reasons to prefer not to use inline syscalls, but
if so the function-call-based implementations should be added back in
a unified way for all archs.
calls to __aeabi_read_tp may be generated by the compiler to access
TLS on pre-v6 targets. previously, this function was hard-coded to
call the kuser helper, which would crash on kernels with kuser helper
removed.
to fix the problem most efficiently, the definition of __aeabi_read_tp
is moved so that it's an alias for the new __a_gettp. however, on v7+
targets, code to initialize the runtime choice of thread-pointer
loading code is not even compiled, meaning that defining
__aeabi_read_tp would have caused an immediate crash due to using the
default implementation of __a_gettp with a HCF instruction.
fortunately there is an elegant solution which reduces overall code
size: putting the native thread-pointer loading instruction in the
default code path for __a_gettp, so that separate default/native code
paths are not needed. this function should never be called before
__set_thread_area anyway, and if it is called early on pre-v6
hardware, the old behavior (crashing) is maintained.
ideally __aeabi_read_tp would not be called at all on v7+ targets
anyway -- in fact, prior to the overhaul, the same problem existed,
but it was never caught by users building for v7+ with kuser disabled.
however, it's possible for calls to __aeabi_read_tp to end up in a v7+
binary if some of the object files were built for pre-v7 targets, e.g.
in the case of static libraries that were built separately, so this
case needs to be handled.
previously, builds for pre-armv6 targets hard-coded use of the "kuser
helper" system for atomics and thread-pointer access, resulting in
binaries that fail to run (crash) on systems where this functionality
has been disabled (as a security/hardening measure) in the kernel.
additionally, builds for armv6 hard-coded an outdated/deprecated
memory barrier instruction which may require emulation (extremely
slow) on future models.
this overhaul replaces the behavior for all pre-armv7 builds (both of
the above cases) to perform runtime detection of the appropriate
mechanisms for barrier, atomic compare-and-swap, and thread pointer
access. detection is based on information provided by the kernel in
auxv: presence of the HWCAP_TLS bit for AT_HWCAP and the architecture
version encoded in AT_PLATFORM. direct use of the instructions is
preferred when possible, since probing for the existence of the kuser
helper page would be difficult and would incur runtime cost.
for builds targeting armv7 or later, the runtime detection code is not
compiled at all, and much more efficient versions of the non-cas
atomic operations are provided by using ldrex/strex directly rather
than wrapping cas.
Processing an option character with optional argument fails if the
option is last on the command line. This happens because the
if (optind >= argc) check runs first before testing for optional
argument.
The C standard is imperative on that:
7.28.1 ... If ps is a null pointer, each function uses its own internal
mbstate_t object instead, which is initialized at program startup to
the initial conversion state;
and these functions are also not supposed to implicitly use the state of
the wchar.h functions:
7.29.6.3 ... The implementation behaves as if no library function calls
these functions with a null pointer for ps.
Previously this resulted in two bugs.
- The functions c16rtomb and mbrtoc16 would crash when called with ps
set to null.
- The function mbrtoc32 used the private state of mbrtowc, which it
is not allowed to do.
in this case there are two conflicting rules in play: that an explicit
precision of zero with the value zero produces no output, and that the
'#' modifier for octal increases the precision sufficiently to yield a
leading zero. ISO C (7.19.6.1 paragraph 6 in C99+TC3) includes a
parenthetical remark to clarify that the precision-increasing behavior
takes precedence, but the corresponding text in POSIX off of which I
based the implementation is missing this remark.
this issue was covered in WG14 DR#151.
fnstsw does not wait for pending unmasked x87 floating-point exceptions
and it is the same as fstsw when all exceptions are masked which is the
only environment libc supports.
Some early x86_64 cpus (released before 2006) did not support sahf/lahf
instructions so they should be avoided (intel manual says they are only
supported if CPUID.80000001H:ECX.LAHF-SAHF[bit 0] = 1).
The workaround simplifies exp2l and expm1l because fucomip can be
used instead of the fucomp;fnstsw;sahf sequence copied from i386.
In fmodl and remainderl sahf is replaced by a simple bit test.
the kernel syscall interface for or1k does not expect 64-bit arguments
to be aligned to "even" register boundaries. this incorrect alignment
broke truncate/ftruncate and as well as a few less-common syscalls.
the idiomatic rounding of x is
n = x + toint - toint;
where toint is either 1/EPSILON (x is non-negative) or 1.5/EPSILON
(x may be negative and nearest rounding mode is assumed) and EPSILON is
according to the evaluation precision (the type of toint is not very
important, because single precision float can represent the 1/EPSILON of
ieee binary128).
in case of FLT_EVAL_METHOD!=0 this avoids a useless store to double or
float precision, and the long double code became cleaner with
1/LDBL_EPSILON instead of ifdefs for toint.
__rem_pio2f and __rem_pio2 functions slightly changed semantics:
on i386 a double-rounding is avoided so close to half-way cases may
get evaluated differently eg. as sin(pi/4-eps) instead of cos(pi/4+eps)
The old code used the rounding idiom incorrectly:
y = (double)(x + 0x1p52) - 0x1p52;
the cast is useless if FLT_EVAL_METHOD==0 and causes a second rounding
if FLT_EVAL_METHOD==2 which can give incorrect result in nearest rounding
mode, so the correct idiom is to add/sub a power-of-2 according to the
characteristics of double_t.
This did not cause actual bug because only i386 is affected where rint
is implemented in asm.
Other rounding functions use a similar idiom, but they give correct
results because they only rely on getting a neighboring integer result
and the rounding direction is fixed up separately independently of the
current rounding mode. However they should be fixed to use the idiom
correctly too.
this change is a workaround for the inability of current compilers to
perform "shrink wrapping" optimizations. in casual testing, it roughly
doubled the performance of pthread_once when called on an
already-finished once control object.
these functions need to be fast when the init routine has already run,
since they may be called very often from code which depends on global
initialization having taken place. as such, a fast path bypassing
atomic cas on the once control object was used to avoid heavy memory
contention. however, on archs with weakly ordered memory, the fast
path failed to ensure that the caller actually observes the side
effects of the init routine.
preliminary performance testing showed that simply removing the fast
path was not practical; a performance drop of roughly 85x was observed
with 20 threads hammering the same once control on a 24-core machine.
so the new explicit barrier operation from atomic.h is used to retain
the fast path while ensuring memory visibility.
performance may be reduced on some archs where the barrier actually
makes a difference, but the previous behavior was unsafe and incorrect
on these archs. future improvements to the implementation of a_barrier
should reduce the impact.
previously, the hours were considered as a signed quantity while
minutes and seconds were always treated as positive offsets. however,
semantically the '-' sign should negate the whole hh:mm:ss offset.
this bug only affected timezones east of GMT with non-whole-hours
offsets, such as those used in India and Nepal.
new in linux v3.17 commit 40e041a2c858b3caefc757e26cb85bfceae5062b
sealing allows some operations to be blocked on a file which makes
file access safer when fds are shared between processes (only
supported for shared mem fds currently)
flags:
F_SEAL_SEAL prevents further sealing
F_SEAL_SHRINK prevents file from shrinking
F_SEAL_GROW prevents file from growing
F_SEAL_WRITE prevents writes
fcntl commands:
F_GET_SEALS get the current seal flags
F_ADD_SEALS add new seal flags
these syscalls are new in linux v3.17 and present on all supported
archs except sh.
seccomp was added in commit 48dc92b9fc3926844257316e75ba11eb5c742b2c
it has operation, flags and pointer arguments (if flags==0 then it is
the same as prctl(PR_SET_SECCOMP,...)), the uapi header for flag
definitions is linux/seccomp.h
getrandom was added in commit c6e9d6f38894798696f23c8084ca7edbf16ee895
it provides an entropy source when open("/dev/urandom",..) would fail,
the uapi header for flags is linux/random.h
memfd_create was added in commit 9183df25fe7b194563db3fec6dc3202a5855839c
it allows anon mmap to have an fd, that can be shared, sealed and needs no
mount point, the uapi header for flags is linux/memfd.h
previously the external definitions of these functions were omitted on
archs where long double is the same as double, since the code paths in
the math.h macros which would call them are unreachable. however, even
if they are unreachable, the definitions are still mandatory. omitting
them is invalid C, and in the case of a non-optimizing compiler, will
result in a link error.
per the text accepted for inclusion in POSIX, behavior is unspecified
when any of the access mode bits are set. since it's impossible to
consistently report this usage error (O_RDONLY could not be detected
since its value happens to be zero), the most consistent way to handle
them is just to ignore them.
previously, if a caller erroneously passed O_WRONLY, the resulting
access mode would be O_WRONLY|O_RDWR, which has the value 3, and this
resulted in a file descriptor which rejects both read and write
attempts when it is subsequently used.
this function is specified to leave the last byte with "unspecified
disposition" when the length is odd, so for the most part correct
programs should not be calling swab with odd lengths. however, doing
so is permitted, and should not write past the end of the destination
buffer.
patch by Jens Gustedt. this fixes a bug reported by Nadav Har'El. the
underlying issue was that a left-shift by 16 bits after promotion of
unsigned short to int caused integer overflow. while some compilers
define this overflow case as "shifting into the sign bit", doing so
doesn't help; the sign bit then gets extended through the upper bits
in subsequent arithmetic as unsigned long long. this patch imposes a
promotion to unsigned prior to the shift, so that the result is
well-defined and matches the specified behavior.
commit 5345c9b884 added a linked list to
track the FILE streams currently locked (via flockfile) by a thread.
due to a failure to fully link newly added members, removal from the
list could leave behind references which could later result in writes
to already-freed memory and possibly other memory corruption.
implicit stdio locking was unaffected; the list is only used in
conjunction with explicit flockfile locking.
this bug was not present in any releases; it was introduced and fixed
during the same release cycle.
patch by Timo Teräs, who discovered and tracked down the bug.
incorrect behavior occurred only in cases where the input overflows
unsigned long long, not just the (possibly lower) range limit for the
result type. in this case, processing of the '-' sign character was
not suppressed, and the function returned a value of 1 despite setting
errno to ERANGE.
The new code is a bit simpler and the generated code is about 1KB
smaller (on i386). The basic design was kept including internal
interfaces, TNFA generation was not touched.
The old tre parser had various issues:
[^aa-z]
negated overlapping ranges in a bracket expression were handled
incorrectly (eg [^aa-z] was handled as [^a] instead of [^a-z])
a{,2}
missing lower bound in a counted repetition should be an error,
but it was accepted with broken semantics: a{,2} was treated as
a{0,3}, the new parser rejects it
a{999,}
large min count was not rejected (a{5000,} failed with REG_ESPACE
due to reaching a stack limit), the new parser enforces the
RE_DUP_MAX limit
\xff
regcomp used to accept a pattern with illegal sequences in it
(treated them as empty expression so p\xffq matched pq) the new
parser rejects such patterns with REG_BADPAT or REG_ERANGE
[^b-fD-H] with REG_ICASE
old parser turned this into [^b-fB-F] because of the negated
overlapping range issue (see above), the new parser treats it
as [^b-hB-H], POSIX seems to require [^d-fD-F], but practical
implementations do case-folding first and negate the character
set later instead of the other way around. (Supporting the posix
way efficiently would require significant changes so it was left
as is, it is unclear if any application actually expects the
posix behaviour, this issue is raised on the austingroup tracker:
http://austingroupbugs.net/view.php?id=872 ).
another case-insensitive matching issue is that unicode case
folding rules can group more than two characters together while
towupper and towlower can only work for a pair of upper and
lower case characters, this is a limitation of POSIX so it is
not fixed.
invalid bracket and brace expressions may return different error
codes now (REG_ERANGE instead of REG_EBRACK or REG_BADBR instead
of REG_EBRACE) otherwise the new parser should be compatible with
the old one.
regcomp should be able to handle arbitrary pattern input if the
pattern length is limited, the only exception is the use of large
repetition counts (eg. (a{255}){255}) which require exp amount
of memory and there is no easy workaround.
the C11 _Alignas keyword is not present in C++, and despite it being
in the reserved namespace and thus reasonable to support even in
non-C11 modes, compilers seem to fail to support it.
as a result of commit ab8f6a6e42, this
definition is now equivalent to the actual "default profile" which
appears immediately below in features.h, and which defines both
_BSD_SOURCE and _XOPEN_SOURCE.
the intent of providing a _DEFAULT_SOURCE, which glibc also now
provides, is to give applications a way to "get back" the default
feature profile when it was lost either by compiler flags that inhibit
it (such as -std=c99) or by library-provided predefined macros (such
as -D_POSIX_C_SOURCE=200809L) which may inhibit exposure of features
that were otherwise visible by default and which the application may
need. without _DEFAULT_SOURCE, the application had encode knowledge of
a particular libc's defaults, and such knowledge was fragile and
subject to bitrot.
eventually the names _GNU_SOURCE and _BSD_SOURCE should be phased out
in favor of the more-descriptive and more-accurate _ALL_SOURCE and
_DEFAULT_SOURCE, leaving the old names as aliases but using the new
ones internally. however this is a more invasive change that would
require extensive regression testing, so it is deferred.
the vast majority of these failures seem to have been oversights at
the time _BSD_SOURCE was added, or perhaps shortly afterward. the one
which may have had some reason behind it is omission of setpgrp from
the _BSD_SOURCE feature profile, since the standard setpgrp interface
conflicts with a legacy (pre-POSIX) BSD interface by the same name.
however, such omission is not aligned with our general policy in this
area (for example, handling of similar _GNU_SOURCE cases) and should
not be preserved.
__polevll, __p1evll and exp10l were provided on archs when long double
is the same as double. The first two were completely unused and exp10l
can be a wrapper around exp10.
open file description locks are inherited across fork and only auto
dropped after the last fd of the file description is closed, they can be
used to synchronize between threads that open separate file descriptions
for the same file.
new in linux 3.15 commit 0d3f7a2dd2f5cf9642982515e020c1aee2cf7af6
based on patch by Jens Gustedt.
the main difficulty here is handling the difference between start
function signatures and thread return types for C11 threads versus
POSIX threads. pointers to void are assumed to be able to represent
faithfully all values of int. the function pointer for the thread
start function is cast to an incorrect type for passing through
pthread_create, but is cast back to its correct type before calling so
that the behavior of the call is well-defined.
changes to the existing threads implementation were kept minimal to
reduce the risk of regressions, and duplication of code that carries
implementation-specific assumptions was avoided for ease and safety of
future maintenance.
These all have POSIX equivalents, but aside from tss_get, they all
have minor changes to the signature or return value and thus need to
exist as separate functions.
based on patch by Jens Gustedt.
mtx_t and cnd_t are defined in such a way that they are formally
"compatible types" with pthread_mutex_t and pthread_cond_t,
respectively, when accessed from a different translation unit. this
makes it possible to implement the C11 functions using the pthread
functions (which will dereference them with the pthread types) without
having to use the same types, which would necessitate either namespace
violations (exposing pthread type names in threads.h) or incompatible
changes to the C++ name mangling ABI for the pthread types.
for the rest of the types, things are much simpler; using identical
types is possible without any namespace considerations.
The intent of this is to avoid name space pollution of the C threads
implementation.
This has two sides to it. First we have to provide symbols that wouldn't
pollute the name space for the C threads implementation. Second we have
to clean up some internal uses of POSIX functions such that they don't
implicitly drag in such symbols.
there is no blksize64_t (blksize_t is always long) but there are
fsblkcnt64_t and fsfilcnt64_t types in sys/stat.h and sys/types.h.
and glob.h missed glob64_t.
versionsort64, aio*64 and lio*64 symbols were missing, they are
only needed for glibc ABI compatibility, on the source level
dirent.h and aio.h already redirect them.
this error resulted in an out-of-bounds read, as opposed to a reported
error, when calling the function with an argument one greater than the
max valid index.
if the loop stopped due to reaching the end of the string, the
subsequent increment could possibly move the position one past the end
of the buffer. no further writes happen, the reads cannot fault anyway
unless the stack completely lacks any zero bytes, and reading junk
should not yield an incorrect result from the function either.
nonetheless the code was wrong and needs to be fixed.
the condition was probably intended to be !*p rather than !p, but
neither is needed here. the subsequent code naturally handles the case
where it's already at end of string.
U+00DF ('ß') has had an uppercase form (U+1E9E) available since
Unicode 5.1, but Unicode lacks the case mappings for it due to
stability policy. when I added support for the new character in commit
1a63a9fc30, I omitted the mapping in the
lowercase-to-uppercase direction. this choice was not based on any
actual information, only assumptions.
this commit adds bidirectional case mappings between U+00DF and
U+1E9E, and removes the special-case hack that allowed U+00DF to be
identified as lowecase despite lacking a mapping. aside from strong
evidence that this is the "right" behavior for real-world usage of
these characters, several factors informed this decision:
- the other "potentially correct" mapping, to "SS", is not
representable in the C case-mapping system anyway.
- leaving one letter in lowercase form when transforming a string to
uppercase is obviously wrong.
- having a character which is nominally lowercase but which is fixed
under case mapping violates reasonable invariants.
per POSIX these functions are both cancellation points, so they must
act on any cancellation request which is pending prior to the call.
previously, only the code path where actual waiting took place could
act on cancellation.
the outer getnameinfo function already has a properly-sized temporary
buffer for storing the reverse dns (ptr) result. there is no reason
for the callback to use a secondary buffer and copy it on success, and
doing so potentially expanded the impact of the dn_expand bug that was
fixed in commit 49d2c8c6bc.
this change reduces the code size by a small amount, and also reduces
the run-time stack space requirements by about 256 bytes.
previously, fgets, fputs, fread, and fwrite completely omitted locking
and access to the FILE object when their arguments yielded a zero
length read or write operation independent of the FILE state. this
optimization was invalid; it wrongly skipped marking the stream as
byte-oriented (a C conformance bug) and exposed observably missing
synchronization (a POSIX conformance bug) where one of these functions
could wrongly complete despite another thread provably holding the
lock.
the C standard requires that "the contents of the array remain
unchanged" in this case.
this patch also changes the behavior on read errors, but in that case
"the array contents are indeterminate", so the application cannot
inspect them anyway.
Empty name was rejected in dn_expand since commit
56b57f37a4
which is a regression as reported by Natanael Copa.
Furthermore if an offset pointer in a compressed name
pointed to a terminating 0 byte (instead of a label)
the returned name was not null terminated.
this function is needed for some important practical applications of
ABI compatibility, and may be useful for supporting some non-portable
software at the source level too.
I was hesitant to add a function which imposes any constraints on
malloc internals; however, it turns out that any malloc implementation
which has realloc must already have an efficient way to determine the
size of existing allocations, so no additional constraint is imposed.
for now, some internal malloc definitions are duplicated in the new
source file. if/when malloc is refactored to put them in a shared
internal header file, these could be removed.
since malloc_usable_size is conventionally declared in malloc.h, the
empty stub version of this file was no longer suitable. it's updated
to provide the standard allocator functions, nonstandard ones (even if
stdlib.h would not expose them based on the feature test macros in
effect), and any malloc-extension functions provided (currently, only
malloc_usable_size).
if there is already a waiter for a lock, spinning on the lock is
essentially an attempt to steal it from whichever waiter would obtain
it via any priority rules in place, and is therefore undesirable. in
the current implementation, there is always an inherent race window at
unlock during which a newly-arriving thread may steal the lock from
the existing waiters, but we should aim to keep this window minimal
rather than enlarging it.
empirically, this increases the maximum rate of wait/post operations
between two threads by 20-150 times on machines I tested, including
x86 and arm. conceptually, it makes sense to do some spinning because
semaphores are intended to be usable as a notification mechanism
between threads, not just as locks, and low-latency notification is a
valuable property to have.
the previous spin limit of 10000 was utterly unreasonable.
empirically, it could consume up to 200000 cycles, whereas a failed
futex wait (EAGAIN) typically takes 1000 cycles or less, and even a
true wait/wake round seems much less expensive.
the new counts (100 for general wait, 200 in barrier) were simply
chosen to be in the range of what's reasonable without having adverse
effects on casual micro-benchmark tests I have been running. they may
still be too high, from a standpoint of not wasting cpu cycles, but at
least they're a lot better than before. rigorous testing across
different archs and cpu models should be performed at some point to
determine whether further adjustments should be made.
conceptually, a_spin needs to be at least a compiler barrier, so the
compiler will not optimize out loops (and the load on each iteration)
while spinning. it should also be a memory barrier, or the spinning
thread might keep spinning without noticing stores from other threads,
thus delaying for longer than it should.
ideally, an optimal a_spin implementation that avoids unnecessary
cache/memory contention should be chosen for each arch, but for now,
the easiest thing is to perform a useless a_cas on the calling
thread's stack.
this is analogous commit fffc5cda10
which fixed the corresponding issue for mutexes.
the robust list can't be used here because the locks do not share a
common layout with mutexes. at some point it may make sense to simply
incorporate a mutex object into the FILE structure and use it, but
that would be a much more invasive change, and it doesn't mesh well
with the current design that uses a simpler code path for internal
locking and pulls in the recursive-mutex-like code when the flockfile
API is used explicitly.
the subsequent code in pthread_create and the code which copies TLS
initialization images to the new thread's TLS space assume that the
memory provided to them is zero-initialized, which is true when it's
obtained by pthread_create using mmap. however, when the caller
provides a stack using pthread_attr_setstack, pthread_create cannot
make any assumptions about the contents. simply zero-filling the
relevant memory in this case is the simplest and safest fix.
unfortunately this needs to be able to vary by arch, because of a huge
mess GCC made: the GCC definition, which became the ABI, depends on
quirks in GCC's definition of __alignof__, which does not match the
formal alignment of the type.
GCC's __alignof__ unexpectedly exposes the an implementation detail,
its "preferred alignment" for the type, rather than the formal/ABI
alignment of the type, which it only actually uses in structures. on
most archs the two values are the same, but on some (at least i386)
the preferred alignment is greater than the ABI alignment.
I considered using _Alignas(8) unconditionally, but on at least one
arch (or1k), the alignment of max_align_t with GCC's definition is
only 4 (even the "preferred alignment" for these types is only 4).
the main idea of the changes made is to have waiters wait directly on
the "barrier" lock that was used to prevent them from making forward
progress too early rather than first waiting on the atomic state value
and then attempting to lock the barrier.
in addition, adjustments to the mutex waiter count are optimized.
previously, each waking waiter decremented the count (unless it was
the first) then immediately incremented it again for the next waiter
(unless it was the last). this was a roundabout was of achieving the
equivalent of incrementing it once for the first waiter and
decrementing it once for the last.
previously, wake order could be unpredictable: if a waiter happened to
leave its futex wait on the state early, e.g. due to EAGAIN while
restarting after a signal handler, it could acquire the mutex out of
turn. handling this required ugly O(n) list walking in the unwait
function and accounting to remove waiters that already woke from the
list.
with the new changes, the "barrier" locks in each waiter node are only
unlocked in turn. in addition to simplifying the code, this seems to
improve performance slightly, probably by reducing the number of
accesses threads make to each other's stacks.
as an additional benefit, unrecoverable mutex re-locking errors
(mainly ENOTRECOVERABLE for robust mutexes) no longer need to be
handled with deadlock; they can be reported to the caller, since the
unlocking sequence makes it unnecessary to rely on the mutex to
synchronize access to the waiter list.
the immediate issue that was reported by Jens Gustedt and needed to be
fixed was corruption of the cv/mutex waiter states when switching to
using a new mutex with the cv after all waiters were unblocked but
before they finished returning from the wait function.
self-synchronized destruction was also handled poorly and may have had
race conditions. and the use of sequence numbers for waking waiters
admitted a theoretical missed-wakeup if the sequence number wrapped
through the full 32-bit space.
the new implementation is largely documented in the comments in the
source. the basic principle is to use linked lists initially attached
to the cv object, but detachable on signal/broadcast, made up of nodes
residing in automatic storage (stack) on the threads that are waiting.
this eliminates the need for waiters to access the cv object after
they are signaled, and allows us to limit wakeup to one waiter at a
time during broadcasts even when futex requeue cannot be used.
performance is also greatly improved, roughly double some tests.
basically nothing is changed in the process-shared cond var case,
where this implementation does not work, since processes do not have
access to one another's local storage.
when the kernel is responsible for waking waiters on a robust mutex
whose owner died, it does not have a waiters count available and must
rely entirely on the waiter bit of the lock value.
normally, this bit is only set by newly arriving waiters, so it will
be clear if no new waiters arrived after the current owner obtained
the lock, even if there are other waiters present. leaving it clear is
desirable because it allows timed-lock operations to remove themselves
as waiters and avoid causing unnecessary futex wake syscalls. however,
for process-shared robust mutexes, we need to set the bit whenever
there are existing waiters so that the kernel will know to wake them.
for non-process-shared robust mutexes, the wake happens in userspace
and can look at the waiters count, so the bit does not need to be set
in the non-process-shared case.
when manipulating the robust list, the order of stores matters,
because the code may be asynchronously interrupted by a fatal signal
and the kernel will then access the robust list in what is essentially
an async-signal context.
previously, aliasing considerations made it seem unlikely that a
compiler could reorder the stores, but proving that they could not be
reordered incorrectly would have been extremely difficult. instead
I've opted to make all the pointers used as part of the robust list,
including those in the robust list head and in the individual mutexes,
volatile.
in addition, the format of the robust list has been changed to point
back to the head at the end, rather than ending with a null pointer.
this is to match the documented kernel robust list ABI. the null
pointer, which was previously used, only worked because faults during
access terminate the robust list processing.
a robust mutex should not enter the unrecoverable status until it's
unlocked without marking it consistent. previously, flag 8 in the type
was used as an indication of unrecoverable, but only honored after
successful locking; this resulted in a race window where the
unrecoverable mutex could appear to a second thread as locked/busy
again while the first thread was in the process of observing it as
unrecoverable.
now, flag 8 is used to mean that the mutex is in the process of being
recovered, but not yet marked consistent. the flag only takes effect
in pthread_mutex_unlock, where it causes the value 0x40000000 (owner
dead flag, with old owner tid 0, an otherwise impossible state) to be
stored in the lock. subsequent lock attempts will interpret this state
as unrecoverable.
per the resolution of Austin Group issue 755, the POSIX requirement
that ownership be enforced for recursive and error-checking mutexes
does not allow a random new thread to acquire ownership of an orphaned
mutex just because it happened to be assigned the same tid as the
original owner that exited with the mutex locked.
one possible fix for this issue would be to disallow the kernel thread
to terminate when it exited with mutexes held, permanently reserving
the tid against reuse. however, this does not solve the problem for
process-shared mutexes where lifetime cannot be controlled, so it was
not used.
the alternate approach I've taken is to reuse the robust mutex system
for non-robust recursive and error-checking mutexes. when a thread
exits, the kernel (or the new userspace robust-list code added in
commit b092f1c5fa) will set the
owner-died bit for these orphaned mutexes, but since the mutex-type is
not robust, pthread_mutex_trylock will not allow a new owner to
acquire them. instead, they remain in a state of being permanently
locked, as desired.
the whole point of this locking is to prevent munmap, or mmap with
MAP_FIXED, from deallocating virtual addresses, or changing the
backing a given virtual address refers to, during certain race windows
involving self-synchronized unmapping or destruction of pthread
synchronization objects. there is no need for exclusion in the other
direction, so it suffices to take the lock momentarily and release it
before making the syscall, rather than holding it across the syscall.
the kernel always uses non-private wake when walking the robust list
when a thread or process exits, so it's not able to wake waiters
listening with the private futex flag. this problem is solved by doing
the equivalent in userspace as the last step of pthread_exit.
care is taken to remove mutexes from the robust list before unlocking
them so that the kernel will not attempt to access them again,
possibly after another thread locks them. this removal code can treat
the list as singly-linked, since no further code which would add or
remove items is able to run at this point. moreover, the pending
pointer is not needed since the mutexes being unlocked are all
process-local; in the case of asynchronous process termination, they
all cease to exist.
since a process-local robust mutex cannot come into existence without
a call to pthread_mutexattr_setrobust in the same process, the code
for userspace robust list processing is put in that source file, and
a weak alias to a dummy function is used to avoid pulling in this
bloat as part of pthread_exit in static-linked programs.
private-futex uses the virtual address of the futex int directly as
the hash key rather than requiring the kernel to resolve the address
to an underlying backing for the mapping in which it lies. for certain
usage patterns it improves performance significantly.
in many places, the code using futex __wake and __wait operations was
already passing a correct fixed zero or nonzero flag for the priv
argument, so no change was needed at the site of the call, only in the
__wake and __wait functions themselves. in other places, especially
where the process-shared attribute for a synchronization object was
not previously tracked, additional new code is needed. for mutexes,
the only place to store the flag is in the type field, so additional
bit masking logic is needed for accessing the type.
for non-process-shared condition variable broadcasts, the futex
requeue operation is unable to requeue from a private futex to a
process-shared one in the mutex structure, so requeue is simply
disabled in this case by waking all waiters.
for robust mutexes, the kernel always performs a non-private wake when
the owner dies. in order not to introduce a behavioral regression in
non-process-shared robust mutexes (when the owning thread dies), they
are simply forced to be treated as process-shared for now, giving
correct behavior at the expense of performance. this can be fixed by
adding explicit code to pthread_exit to do the right thing for
non-shared robust mutexes in userspace rather than relying on the
kernel to do it, and will be fixed in this way later.
since not all supported kernels have private futex support, the new
code detects EINVAL from the futex syscall and falls back to making
the call without the private flag. no attempt to cache the result is
made; caching it and using the cached value efficiently is somewhat
difficult, and not worth the complexity when the benefits would be
seen only on ancient kernels which have numerous other limitations and
bugs anyway.
the code which loads locale files was already rejecting locale names
containing slashes. however, LC_MESSAGES records a locale name even if
libc does not have a matching locale file, so that gettext or
application code can use the recorded locale name for message
translations to languages that libc does not support. this recorded
name was not being checked for slashes, meaning that such code could
potentially be tricked into directory traversal.
in addition, since the value of a locale category is sometimes used as
a pathname component by callers, the improved code rejects any value
beginning with a dot. this prevents traversal to the parent directory
via "..", use of the top-level locale directory via ".", and also
avoids "hidden" directories as a side effect.
finally, overly long locale names are now rejected (treated as an
unrecognized name and thus as an alias for C.UTF-8) rather than being
truncated.
using an operator precedence parser the code size
became smaller and it is only slower by about %10
size of old vs new pleval.o on different archs:
(with inlined isspace added to pleval.c for now)
old:
text data bss dec hex filename
828 0 0 828 33c pl.i386.o
1152 0 0 1152 480 pl.arm.o
1704 0 0 1704 6a8 pl.mips.o
1328 0 0 1328 530 pl.ppc.o
992 0 0 992 3e0 pl.x64.o
new:
text data bss dec hex filename
693 0 0 693 2b5 pl.i386.o
972 0 0 972 3cc pl.arm.o
1276 0 0 1276 4fc pl.mips.o
1087 0 0 1087 43f pl.ppc.o
846 0 0 846 34e pl.x64.o
the previous implementations had several deficiencies, the most severe
of which was the inability to report unconfigured interfaces or
interfaces without ipv4 addresses. among the options discussed for
fixing this, using netlink turned out to be the one with the least
cost and most additional advantages. other improvements include:
if_nameindex now avoids duplicates in the list it produces, but still
includes legacy-style interface aliases if any are in use.
getifaddrs now reports hardware addresses and includes the scope_id
for link-local ipv6 addresses in the resulting address.
this commit changes the names to match the kernel names, exposing
under the normal names the "old" versions which work with a smaller
termios structure compatible with the userspace structure, and
renaming the "new" versions with "2" on the end like the kernel has.
this fixes spurious warnings "Unsupported ioctl: cmd=0x802c542a" from
qemu-sh4 and should be more correct anyway, since our userspace
termios structure does not have meaningful information in the part
which the kernel would be interpreting as speeds with the new ioctl.
while the __mo_lookup backend can verify that the translated message
ends with a null terminator, is has no way to know nplurals and thus
no way to verify that sufficiently many null terminators are present
in the string to satisfy all plural forms. the code in dcngettext was
already attempting to avoid reading past the end of the mo file
mapping, but failed to do so because the strlen call itself could
over-read. using strnlen instead allows us to avoid the problem.
rather than just checking that the start of the string lies within the
mapping, also check that the nominal length remains within the
mapping, and that the null terminator is present at the nominal
length. this ensures that the caller, using the result as a C string,
will not read past the end of the mapping.
the nominal length is never exposed to the caller, but it's useful
internally to find where the null terminator should be without having
to restort to linear search via strnlen/memchr.
the a_cas_l, a_swap_l, a_swap_p, and a_store_l operations were
probably used a long time ago when only i386 and x86_64 were
supported. as other archs were added, support for them was
inconsistent, and they are obviously not in use at present. having
them around potentially confuses readers working on new ports, and the
type-punning hacks and inconsistent use of types in their definitions
is not a style I wish to perpetuate in the source tree, so removing
them seems appropriate.
while other usage I've seen only has the synco instruction after the
atomic operation, I cannot find any documentation indicating that this
is correct. certainly all stores before the atomic need to have been
synchronized before the atomic operation takes place.
this commit replaces the stub implementations with working message
translation functions. translation units are factored so as to prevent
pulling in the legacy, non-library-safe functions which use a global
textdomain in modern code which is using the versions with an explicit
domain argument. bind_textdomain_codeset is also placed in its own
file since it should not be needed by most programs.
this implementation is still missing some features: the LANGUAGE
environment variable (for multiple fallback languages) is not honored,
and non-default plural-form rules are not supported. these issues will
be addressed in a later commit.
one notable difference from the GNU implementation is that there is no
default path for loading translation files. in principle one could be
added, but since the documented correct usage is to call the
bindtextdomain function, a default path is probably unnecessary.
for LC_MESSAGES, translation of strerror and similar literal message
functions is supported. for messages in other places (particularly the
dynamic linker) that use format strings, translation is not yet
supported. in order to make it possible and safe, such messages will
need to be refactored to separate the textual content from the format.
for LC_TIME, the day and month names and strftime-style format strings
provided by nl_langinfo are supported for translation. however there
may be limitations, as some of the original C-locale nl_langinfo
strings are non-unique and thus perhaps non-suitable as keys.
overall, the locale support activated by this commit should not be
seen as complete and polished but as a basis for beginning to test
locale functionality and implement locales.
the core is based on a binary search; hash table is not used. both
native and reverse-endian mo files are supported. all offsets read
from the mapped mo file are checked against the mapping size to
prevent the possibility of reads outside the mapping.
this commit has no observable effects since there are not yet any
callers to the message translation code.
there is still no code which actually uses the loaded locale files, so
the main observable effect of this commit is that calls to setlocale
store and give back the names of the selected locales for the
remaining categories (LC_TIME, LC_COLLATE, LC_MONETARY) if a locale
file by the requested name could be loaded.
ETH_P_80221 is ethertype for IEEE Std 802.21 - Media Independent Handover Protocol
introduced in linux 3.15 commit b62faf3cdc875a1ac5a10696cf6ea0b12bab1596
ETH_P_LOOPBACK is the correct packet type for loopback in IEEE 802.3*
introduced in linux 3.15 commit 61ccbb684421d374fdcd7cf5d6b024b06f03ce4e
some defines were shuffled to be in ascending order and match the kernel header
due to what was essentially a copy and paste error, the changes made
in commit f61be1f875 caused syscalls
with 5 or 6 arguments (and syscalls with 2, 3, or 4 arguments when
compiled with clang compatibility) to negate the returned error code a
second time, breaking errno reporting.
the mips version of this structure on the kernel side wrongly has
32-bit type rather than 64-bit type. fortunately there is adjacent
padding to bring it up to 64 bits, and on little-endian, this allows
us to treat the adjacent kernel st_dev and st_pad0[0] as as single
64-bit dev_t. however, on big endian, such treatment results in the
upper and lower 32-bit parts of the dev_t value being swapped. for the
purpose of just comparing st_dev values this did not break anything,
but it precluded actually processing the device numbers as major/minor
values.
since the broken kernel behavior that needs to be worked around is
isolated to one arch, I put the workarounds in syscall_arch.h rather
than adding a stat fixup path in the common code. on little endian
mips, the added code optimizes out completely.
the changes necessary were incompatible with the way the __asm_syscall
macro was factored so I just removed it and flattened the individual
__syscallN functions. this arguably makes the code easier to read and
understand, anyway.
this function provides a way for third-party library code to use the
same logic that's used internally in libc for suppressing untrusted
input/state (e.g. the environment) when the application is running
with privleges elevated by the setuid or setgid bit or some other
mechanism. its semantics are intended to match the openbsd function by
the same name.
there was some question as to whether this function is necessary:
getauxval(AT_SECURE) was proposed as an alternative. however, this has
several drawbacks. the most obvious is that it asks programmers to be
aware of an implementation detail of ELF-based systems (the aux
vector) rather than simply the semantic predicate to be checked. and
trying to write a safe, reliable version of issetugid in terms of
getauxval is difficult. for example, early versions of the glibc
getauxval did not report ENOENT, which could lead to false negatives
if AT_SECURE was not present in the aux vector (this could probably
only happen when running on non-linux kernels under linux emulation,
since glibc does not support linux versions old enough to lack
AT_SECURE). as for musl, getauxval has always properly reported
errors, but prior to commit 7bece9c209,
the musl implementation did not emulate AT_SECURE if missing, which
would result in a false positive. since musl actually does partially
support kernels that lack AT_SECURE, this was problematic.
the intent is that library authors will use issetugid if its
availability is detected at build time, and only fall back to the
unreliable alternatives on systems that lack it.
patch by Brent Cook. commit message/rationale by Rich Felker.
at the very least, a compiler barrier is required no matter what, and
that was missing. current or1k implementations have strong ordering,
but this is not guaranteed as part of the ISA, so some sort of
synchronizing operation is necessary.
in principle we should use l.msync, but due to misinterpretation of
the spec, it was wrongly treated as an optional instruction and is not
supported by some implementations. if future kernels trap it and treat
it as a nop (rather than illegal instruction) when the
hardware/emulator does not support it, we could consider using it.
in the absence of l.msync support, the l.lwa/l.swa instructions, which
are specified to have a built-in l.msync, need to be used. the easiest
way to use them to implement atomic store is to perform an atomic swap
and throw away the result. using compare-and-swap would be lighter,
and would probably be sufficient for all actual usage cases, but
checking this is difficult and error-prone:
with store implemented in terms of swap, it's guaranteed that, when
another atomic operation is performed at the same time as the store,
either the result of the store followed by the other operation, or
just the store (clobbering the other operation's result) is seen. if
store were implemented in terms of cas, there are cases where this
invariant would fail to hold, and we would need detailed rules for the
situations in which the store operation is well-defined.
as far as I can tell, microblaze is strongly ordered, but this does
not seem to be well-documented and the assumption may need revisiting.
even with strong ordering, however, a volatile C assignment is not
sufficient to implement atomic store, since it does not preclude
reordering by the compiler with respect to non-volatile stores and
loads.
simply flanking a C store with empty volatile asm blocks with memory
clobbers would achieve the desired result, but is likely to result in
worse code generation, since the address and value for the store may
need to be spilled. actually writing the store in asm, so that there's
only one asm block, should give optimal code generation while
satisfying the requirement for having a compiler barrier.
previously I had wrongly assumed the ll/sc instructions also provided
memory synchronization; apparently they do not. this commit adds sync
instructions before and after each atomic operation and changes the
atomic store to simply use sync before and after a plain store, rather
than a useless compare-and-swap.
despite lacking the semantic content that the asm accesses the
pointed-to object rather than just using its address as a value, the
mips asm was not actually broken. the asm blocks were declared
volatile, meaning that the compiler must treat them as having unknown
side effects.
however changing the asm to use memory constraints is desirable not
just from a semantic correctness and consistency standpoint, but also
produces better code. the compiler is able to use base/offset
addressing expressions for the atomic object's address rather than
having to load the address into a single register. this improves
access to global locks in static libc, and access to non-zero-offset
atomic fields in synchronization primitives, etc.
due to a mistake in my testing procedure, the changes in the previous
commit were not correctly tested and wrongly assumed to be valid. the
lwarx and stwcx. instructions do not accept general ppc memory address
expressions and thus the argument associated with the memory
constraint cannot be used directly.
instead, the memory constraint can be left as an argument that the asm
does not actually use, and the address can be provided in a separate
register constraint.
the register constraint for the address to be accessed did not convey
that the asm can access the pointed-to object. as far as the compiler
could tell, the result of the asm was just a pure function of the
address and the values passed in, and thus the asm could be hoisted
out of loops or omitted entirely if the result was not used.
this code path is used only on archs without the plain, non-at
syscalls, and only when the fstat syscall fails with EBADF on a valid
file descriptor. this in turn can happen only for O_PATH file
descriptors, and may not happen at all on the newer kernels needed for
supporting such archs.
with the flags argument omitted, spurious fstat failures may happen
when the argument register happens to have the AT_SYMLINK_NOFOLLOW bit
set.
the erroneous definition was missed because with works with qemu
user-level emulation, which also has the wrong definition. the actual
kernel uses the asm-generic generic definition.
With the exception of a fenv implementation, the port is fully featured.
The port has been tested in or1ksim, the golden reference functional
simulator for OpenRISC 1000.
It passes all libc-test tests (except the math tests that
requires a fenv implementation).
The port assumes an or1k implementation that has support for
atomic instructions (l.lwa/l.swa).
Although it passes all the libc-test tests, the port is still
in an experimental state, and has yet experienced very little
'real-world' use.
this could happen on 2.4-series linux kernels that predate AT_SECURE
and possibly on other kernels that are emulating the linux syscall API
but not providing AT_SECURE in the aux vector at startup.
in principle applications should be checking errno anyway, but this
does not really work. to be secure, the caller would have to treat
ENOENT (indeterminate result) as possibly-suid and thereby disable
functionality in the typical non-suid usage case. and since glibc only
runs on kernels that provide AT_SECURE, applications written to the
glibc getauxval API might simply assume it succeeds.
this was originally added as a cheap but portable way to quell
warnings about reaching the end of a function that does not return,
but since _Exit is marked _Noreturn, it's not needed. removing it
makes the call to _Exit into a tail call and shaves off a few bytes of
code from minimal static programs.
previously we detected this bug in configure and issued advice for a
workaround, but this turned out not to work. since then gcc 4.9.0 has
appeared in several distributions, and now 4.9.1 has been released
without a fix despite this being a wrong code generation bug which is
supposed to be a release-blocker, per gcc policy.
since the scope of the bug seems to affect only data objects (rather
than functions) whose definitions are overridable, and there are only
a very small number of these in musl, I am just changing them from
const to volatile for the time being. simply removing the const would
be sufficient to make gcc 4.9.1 work (the non-const case was
inadvertently fixed as part of another change in gcc), and this would
also be sufficient with 4.9.0 if we forced -O0 on the affected files
or on the whole build. however it's cleaner to just remove all the
broken compiler detection and use volatile, which will ensure that
they are never constant-folded. the quality of a non-broken compiler's
output should not be affected except for the fact that these objects
are no longer const and thus possibly add a few bytes to data/bss.
this change can be reconsidered and possibly reverted at some point in
the future when the broken gcc versions are no longer relevant.
the purpose of this logic is to avoid linking __stdio_exit unless any
stdio reads (which might require repositioning the file offset at exit
time) or writes (which might require flushing at exit time) could have
been performed.
previously, exit called two wrapper functions for __stdio_exit named
__flush_on_exit and __seek_on_exit. both of these functions actually
performed both tasks (seek and flushing) by calling the underlying
__stdio_exit. in order to avoid doing this twice, an overridable data
object __towrite_used was used to cause __seek_on_exit to act as a nop
when __towrite was linked.
now, exit only makes one call, directly to __stdio_exit. this is
satisfiable by a weak dummy definition in exit.c, but the real
definition is pulled in by either __toread.c or __towrite.c through
their referencing a symbol which is defined only in __stdio_exit.c.
this was previously a no-op, somewhat intentionally, because I failed
to understand that it only has an effect when sending to the logging
facility fails and thus is not the nuisance that it would be if always
sent output to the console.
this behavior is no longer valid in general, and was never necessary.
if the LOG_PERROR option is set, output to stderr could still succeed.
also, when the LOG_CONS option is added, it will need syslog to
proceed even if opening the log socket fails.
this is a nonstandard feature, but easy and inexpensive to add. since
the corresponding macro has always been defined in our syslog.h, it
makes sense to actually support it. applications may reasonably be
using the presence of the macro to assume that the feature is
supported.
the behavior of omitting the 'header' part of the log message does not
seem to be well-documented, but matches other implementations (at
least glibc) which have this option.
based on a patch by Clément Vasseur, but simplified using %n.
previously passing an empty string for name resulted in failure, as
expected, but only after spurious syscalls, and it produced confusing
errno values (and thus dlerror strings).
in addition to dlopen calls, this issue affected use of LD_PRELOAD
with trailing whitespace or colon characters.
r24 was wrongly being saved at a misaligned offset of 30 rather than
the correct offset of 40 in the jmp_buf. the exact effects of this
error have not been studied, but it's clear that the value of r24 was
lost across setjmp/longjmp and the saved values of r21 and/or r22 may
also have been corrupted.
if the order of object files in the static archive libc.a was not
respected by the linker, the old logic could wrongly cause POSIX
symbols outside of the ISO C namespace to be pulled into pure C
programs. this should not happen with well-behaved linkers, but
relying on the link order was a bad idea anyway.
files are renamed to better reflect their contents now that they don't
need names to control their order as members in the archive file.
1. failure to output a newline after the password is read
2. fd leaks via missing FD_CLOEXEC
3. fd leaks via failure-to-close when any of the standard streams are
closed at the time of the call
4. wrongful fallback to use of stdin when opening /dev/tty fails
5. wrongful use of stderr rather than /dev/tty for prompt
6. failure to report error reading password
the main motivation for this change is to remove the assumption that
the tid of the main thread is also the pid of the process. (the value
returned by the set_tid_address syscall was used to fill both fields
despite it semantically being the tid.) this is historically and
presently true on linux and unlikely to change, but it conceivably
could be false on other systems that otherwise reproduce the linux
syscall api/abi.
only a few parts of the code were actually still using the cached pid.
in a couple places (aio and synccall) it was a minor optimization to
avoid a syscall. caching could be reintroduced, but lazily as part of
the public getpid function rather than at program startup, if it's
deemed important for performance later. in other places (cancellation
and pthread_kill) the pid was completely unnecessary; the tkill
syscall can be used instead of tgkill. this is actually a rather
subtle issue, since tgkill is supposedly a solution to race conditions
that can affect use of tkill. however, as documented in the commit
message for commit 7779dbd266, tgkill
does not actually solve this race; it just limits it to happening
within one process rather than between processes. we use a lock that
avoids the race in pthread_kill, and the use in the cancellation
signal handler is self-targeted and thus not subject to tid reuse
races, so both are safe regardless of which syscall (tgkill or tkill)
is used.
the main practical purposes of this commit are to remove a huge amount
of clutter from the src/locale directory, to cut down on the length of
the $(AR) and $(LD) command lines, and to reduce the amount of space
wasted by object file headers in the static libc.a. build time may
also be reduced, though this has not been measured.
as an additional justification, if there ever were a need for the
behavior of these functions to vary by locale, it would be necessary
for the non-_l versions to call the _l versions, so that linking the
former without the latter would not be possible anyway.
this commit adds non-stub implementations of setlocale, duplocale,
newlocale, and uselocale, along with the data structures and minimal
code needed for representing the active locale on a per-thread basis
and optimizing the common case where thread-local locale settings are
not in use.
at this point, the data structures only contain what is necessary to
represent LC_CTYPE (a single flag) and LC_MESSAGES (a name for use in
finding message translation files). representation for the other
categories will be added later; the expectation is that a single
pointer will suffice for each.
for LC_CTYPE, the strings "C" and "POSIX" are treated as special; any
other string is accepted and treated as "C.UTF-8". for other
categories, any string is accepted after being truncated to a maximum
supported length (currently 15 bytes). for LC_MESSAGES, the name is
kept regardless of whether libc itself can use such a message
translation locale, since applications using catgets or gettext should
be able to use message locales libc is not aware of. for other
categories, names which are not successfully loaded as locales (which,
at present, means all names) are treated as aliases for "C". setlocale
never fails.
locale settings are not yet used anywhere, so this commit should have
no visible effects except for the contents of the string returned by
setlocale.
in some cases, these functions internally call a byte-based input or
output function before calling getwc/putwc, so they cannot rely on the
latter to set the orientation.
when the orientation of the stream was already set, fwide was
incorrectly returning its argument (the requested orientation) rather
than the actual orientation of the stream.
these functions were setting wc to point to wchar_t aliasing itself as
a "cheap" way to support null wc arguments. doing so was anything but
cheap, since even without the aliasing violation, it would limit the
compiler's ability to optimize.
making wc point to a dummy object is equally easy and does not suffer
from the above problems.
this issue caused the address of functions in shared libraries to
resolve to their PLT thunks in the main program rather than their
correct addresses. it was observed causing crashes, though the
mechanism of the crash was not thoroughly investigated. since the
issue is very subtle, it calls for some explanation:
on all well-behaved archs, GOT entries that belong to the PLT use a
special relocation type, typically called JMP_SLOT, so that the
dynamic linker can avoid having the jump destinations for the PLT
resolve to PLT thunks themselves (they also provide a definition for
the symbol, which must be used whenever the address of the function is
taken so that all DSOs see the same address).
however, the traditional mips PIC ABI lacked such a JMP_SLOT
relocation type, presumably because, due to the way PIC works, the
address of the PLT thunk was never needed and could always be ignored.
prior to commit adf94c1966, the mips
version of reloc.h contained a hack that caused all symbol lookups to
be treated like JMP_SLOT, inhibiting undefined symbols from ever being
used to resolve symbolic relocations. this hack goes all the way back
to commit babf820180, when the mips
dynamic linker was first made usable.
during the recent refactoring to eliminate arch-specific relocation
processing (commit adf94c1966), this
hack was overlooked and no equivalent functionality was provided in
the new code.
fixing the problem is not as simple as adding back an equivalent hack,
since there is now also a "non-PIC ABI" that can be used for the main
executable, which actually does use a PLT. the closest thing to
official documentation I could find for this ABI is nonpic.txt,
attached to Message-ID: 20080701202236.GA1534@caradoc.them.org, which
can be found in the gcc mailing list archives and elsewhere. per this
document, undefined symbols corresponding to PLT thunks have the
STO_MIPS_PLT bit set in the symbol's st_other field. thus, I have
added an arch-specific rule for mips, applied at the find_sym level
rather than the relocation level, to reject undefined symbols with the
STO_MIPS_PLT bit clear.
the previous hack of treating all mips relocations as JMP_SLOT-like,
rather than rejecting the unwanted symbols in find_sym, probably also
caused dlsym to wrongly return PLT thunks in place of the correct
address of a function under at least some conditions. this should now
be fixed, at least for global-scope symbol lookups.
due to a mistake when refactoring the error printing for the dynamic
linker (commit 7c73cacd09), all messages
were suppressed and replaced by blank lines.
my old, out-of-tree release script that performed a clone rather than
using git archive checked the VERSION file to make sure that it
matched before doing a release. I believe there should be a way to do
the same with git commands (without resorting to checking out the
desired tag) but I have not yet found a way, so care should be taken
when using these targets that the correctness of the VERSION file is
not overlooked.
it should be noted that the "real" __sysv_signal, which we do not
implement, is semantically different from signal. references to
__sysv_signal arise in code built against glibc under certain
combinations of feature test macros, and are almost surely
unintentional since the legacy sysv signal behavior has fundamental
race conditions that cannot be worked around and which make it
impossible to use safely.
these are mostly intended for use with dynamic linking (although they
can also be used statically with object files compiled against glibc
headers), so having them broken down into separate source files to
optimize for static linking is unlikely to be worth the cost having
more files in the source tree (which contributes to libc.a overhead,
compile time, link time, ar/linker command line size exhaustion, and
so on).
this issue affected the prioritynames and facilitynames arrays which
are only provided when requested (usually by syslogd implementations)
and which are presently defined as compound literals. the aliasing
violation seems to have been introduced as a workaround for bad
behavior by gcc's -Wwrite-strings option, but it caused compilers to
completely optimize out the contents of prioritynames and
facilitynames since, under many usage cases, the aliasing rules prove
that the contents are never accessed.
this behavior turned out to be counter-intuitive to users and in any
case it's unnecessary. optimization can be disabled explicitly using
the --disable-optimize option, or both can be achieved without any
enable/disable options by passing CFLAGS="-O0 -g".
according to the documentation in the man pages, the GNU extension
functions gethostbyaddr_r, gethostbyname_r and gethostbyname2_r are
guaranteed to set the result pointer to NULL in case of error or no
result.
the main motivation for this change is to aid in debugging. since the
main program's entry point is also named _start, it was difficult to
set breakpoints or quickly identify which _start execution stopped in.
these are not pure syscall wrappers because they have to work around
kernel API bugs on 64-bit archs. the workarounds could probably be
made somewhat more efficient, but at the cost of more complexity. this
may be revisited later.
such separation serves multiple purposes:
- by having the common path for __tls_get_addr alone in its own
function with a tail call to the slow case, code generation is
greatly improved.
- by having __tls_get_addr in it own file, it can be replaced on a
per-arch basis as needed, for optimization or ABI-specific purposes.
- by removing __tls_get_addr from __init_tls.c, a few bytes of code
are shaved off of static binaries (which are unlikely to use this
function unless the linker messed up).
previously, accesses to dynamic TLS had to check two conditions before
being able to use a dtv slot: (1) that the module index was within the
bounds of the current dtv size, and (2) that the dynamic tls for the
requested module index was already installed in the dtv.
this commit changes the installation strategy so that, whenever an
attempt is made to access dynamic TLS that's not yet installed in the
dtv, the dynamic TLS for all lower-index modules is also installed.
thus it provides a new invariant: if a given module index is within
the bounds of the current dtv size, we automatically know that its TLS
is installed and directly available. the requirement that the second
condition (above) be checked is eliminated.
this code is non-functional without further changes to link up the
arch-specific reloc types for tlsdesc and add asm implementations of
__tlsdesc_static and __tlsdesc_dynamic.
the logic for this loop was copied from null-terminated-string logic
in strstr without properly adapting it to work with explicit lengths.
presumably this error could result in false negatives (wrongly
comparing past the end of the needle/haystack), false positives
(stopping comparison early when the needle contains null bytes), and
crashes (from runaway reads past the end of mapped memory).
this was one of the main instances of ugly code duplication: all archs
use basically the same types of relocations, but roughly equivalent
logic was duplicated for each arch to account for the different naming
and numbering of relocation types and variation in whether REL or RELA
records are used.
as an added bonus, both REL and RELA are now supported on all archs,
regardless of which is used by the standard toolchain.
processing of R_PPC_TPREL32 was ignoring the addend provided by the
RELA-style relocation and instead using the inline value as the
addend. this presumably broke dynamic-linked access to initial TLS in
cases where the addend was nonzero.
the following issues are fixed:
- R_SH_REL32 was adding the load address of the module being relocated
to the result. this seems to have been a mistake in the original
port, since it does not match other dynamic linker implementations
and since adding a difference between two addresses (the symbol
value and the relocation address) to a load address does not make
sense.
- R_SH_TLS_DTPMOD32 was wrongly accepting an inline addend (i.e. using
+= rather than = on *reloc_addr) which makes no sense; addition is
not an operation that's defined on module ids.
- R_SH_TLS_DTPOFF32 and R_SH_TLS_TPOFF32 were wrongly using inline
addends rather than the RELA-provided addends.
in addition, handling of R_SH_GLOB_DAT, R_SH_JMP_SLOT, and R_SH_DIR32
are merged to all honor the addend. the first two should not need it
for correct usage generated by toolchains, but other dynamic linkers
allow addends here, and it simplifies the code anyway.
these issues were spotted while reviewing the code for the purpose of
refactoring this part of the dynamic linker. no testing was performed.
the immediate motivation is supporting TLSDESC relocations which
require allocation and thus may fail (unless we pre-allocate), but
this mechanism should also be used for throwing an error on
unsupported or invalid relocation types, and perhaps in certain cases,
for reporting when a relocation is not satisfiable.
this extension is not incompatible with the standard behavior of the
function, not expensive, and avoids requiring a replacement getopt
with full GNU extensions for a few important apps including busybox's
sed with the -i option.
previously, a warning was issued in this case no matter what, even if
--disable-shared was used. now, the default for --enable-shared is
changed from "yes" to "auto", and the warning is issued by default,
but becomes an error if --enable-shared is used, and the test is
suppressed completely if --disable-shared is used.
the motivation for the errno_ptr field in the thread structure, which
this commit removes, was to allow the main thread's errno to keep its
address when lazy thread pointer initialization was used. &errno was
evaluated prior to setting up the thread pointer and stored in
errno_ptr for the main thread; subsequently created threads would have
errno_ptr pointing to their own errno_val in the thread structure.
since lazy initialization was removed, there is no need for this extra
level of indirection; __errno_location can simply return the address
of the thread's errno_val directly. this does cause &errno to change,
but the change happens before entry to application code, and thus is
not observable.
prior to version 1.1.0, the difference between pthread_self (the
public function) and __pthread_self (the internal macro or inline
function) was that the former would lazily initialize the thread
pointer if it was not already initialized, whereas the latter would
crash in this case. since lazy initialization is no longer supported,
use of pthread_self no longer makes sense; it simply generates larger,
slower code.
such kernels cannot support threads, but the thread pointer is also
important for other purposes, most notably stack protector. without a
valid thread pointer, all code compiled with stack protector will
crash. the same applies to any use of thread-local storage by
applications or libraries.
the concept of this patch is to fall back to using the modify_ldt
syscall, which has been around since linux 1.0, to setup the gs
segment register. since the kernel does not have a way to
automatically assign ldt entries, use of slot zero is hard-coded. if
this fallback path is used, __set_thread_area returns a positive value
(rather than the usual zero for success, or negative for error)
indicating to the caller that the thread pointer was successfully set,
but only for the main thread, and that thread creation will not work
properly. the code in __init_tp has been changed accordingly to record
this result for later use by pthread_create.
the results of a dns query, whether it's performed as part of one of
the standard name-resolving functions or directly by res_send, should
be a function of the query, not of the particular nameserver that
responds to it. thus, all responses which indicate a failure or
refusal by the nameserver, as opposed to a positive or negative result
for the query, should be ignored.
the strategy used is to re-issue the query immediately (but with a
limit on the number of retries, in case the server is really broken)
when a response code of 2 (server failure, typically transient) is
seen, and otherwise take no action on bad responses (which generally
indicate a misconfigured nameserver or one which the client does not
have permission to use), allowing the normal retry interval to apply
and of course accepting responses from other nameservers queried in
parallel.
empirically this matches the traditional resolver behavior for
nameservers that respond with a code of 2 in the case where there is
just a single nameserver configured. the behavior diverges when
multiple nameservers are available, since musl is querying them in
parallel. in this case we are mildly more aggressive at retrying.
the way this is implemented, it also allows explicit setting of
TZ=/etc/localtime even for suid programs. this is not a problem
because /etc/localtime is a trusted path, much like the trusted
zoneinfo search path.
reading the variadic mode argument is only valid when the O_CREAT flag
is present. this probably does not matter, but is needed for formal
correctness, and could affect LTO or other full-program analysis.
since there is no easy way to detect whether open honored or ignored
the O_CLOEXEC flag, the optimal solution to providing a fallback is
simply to make the fcntl syscall to set the close-on-exec flag
immediately after open returns.
the fcntl function is heavy, so make the syscall directly instead.
also, avoid the code size and runtime overhead of querying the old
flags, since it's reasonable to assume nothing will be set on a
newly-created socket. this code is only used on old kernels which lack
proper atomic close-on-exec support, so future changes that might
invalidate such an assumption do not need to be considered.
the input name is validated, the other parameters are assumed to be
valid (the list of already compressed names are not checked for
infinite reference loops or out-of-bound offsets).
names are handled case-sensitively for now.
trailing . should be accepted in domain name strings by convention
(RFC 1034), host name lookup accepts "." but rejects empty "", res_*
interfaces also accept empty name following existing practice.
this condition could only happen due to malloc failure.
the fdopen operation is also moved to take place after the unlink to
minimize the window during which a link to the file exists in the
directory table.
Due to an error introduced in commit fcc522c923,
checking of the remaining output buffer space was not performed correctly,
allowing malformed input to write past the end of the buffer.
In addition, the loop detection logic failed to account for the possibility
of infinite loops with no output, which would hang the function.
The output size is now limited more strictly so only names with valid length
are accepted.
this also affects the legacy gethostbyaddr family, which uses
getnameinfo as its backend.
some other minor changes associated with the refactoring of source
files are also made; in particular, the resolv.conf parser now uses
the same code that's used elsewhere to handle ip literals, so as a
side effect it can now accept a scope id for nameserver addressed with
link-local scope.
the service and protocol functions are defined also in other files,
and the protocol ones are actually non-nops elsewhere, so the weak
definitions in ent.c could have prevented the strong definitions from
getting pulled in and used in some static programs.
the old implementation preallocated a buffer in order to try to avoid
calling vsnprintf more than once. not only did this potentially lead
to memory fragmentation from trimming with realloc; it also pulled in
realloc/free, which otherwise might not be needed in a static linked
program.
for all address types, a scope_id specified as a decimal value is
accepted. for addresses with link-local scope, a string containing the
interface name is also accepted.
some changes are made to error handling to avoid unwanted fallbacks in
the case where the scope_id is invalid: if an earlier name lookup
backend fails with an error rather than simply "0 results", this
failure now suppresses any later attempts with other backends.
in getnameinfo, a light "itoa" type function is added for generating
decimal scope_id results, and decimal port strings for services are
also generated using this function now so as not to pull in the
dependency on snprintf.
in netdb.h, a definition for the NI_NUMERICSCOPE flag is added. this
is required by POSIX (it was previously missing) and needed to allow
callers to suppress interface-name lookups.
previously, all failures to obtain at least one address were treated
as nonexistant names (EAI_NONAME). this failed to account for the
possibility of transient failures (no response at all, or a response
with rcode of 2, server failure) or permanent failures that do not
indicate the nonexistence of the requested name. only an rcode of 3
should be treated as an indication of nonexistence.
since the buffer passed always has an actual size of 512 bytes, the
maximum possible response packet size, no out-of-bounds access was
possible; however, reading past the end of the valid portion of the
packet could cause the parser to attempt to process junk as answer
content.
when wcsrtombs stopped due to hitting zero remaining space in the
output buffer, it was wrongly clearing the position pointer as if it
had completed the conversion successfully.
this commit rearranges the code somewhat to make a clear separation
between the cases of ending due to running out of output buffer space,
and ending due to reaching the end of input or an illegal sequence in
the input. the new branches have been arranged with the hope of
optimizing more common cases, too.
the old resolver code used a function __ipparse which contained the
logic for inet_addr and inet_aton, which is needed in getaddrinfo.
this was phased out in the resolver overhaul in favor of directly
using inet_aton and inet_pton as appropriate.
this commit cleans up some stuff that was left behind.
this is the third phase of the "resolver overhaul" project.
this commit removes all of the old dns code, and switches the
__lookup_name backend (used by getaddrinfo, etc.) and the getnameinfo
function to use the newly implemented __res_mkquery and __res_msend
interfaces. for parsing the results, a new callback-based __dns_parse
function, based on __dns_get_rr from the old dns code, is used.
this is the second phase of the "resolver overhaul" project.
the key additions in this commit are the __res_msend and __res_mkquery
functions, which have been factored so as to provide a backend for
both the legacy res_* functions and the standard getaddrinfo and
getnameinfo functions. the latter however are still using the old
backend code; there is code duplication which still needs to be
removed, and this will be the next phase of the resolver overhaul.
__res_msend is derived from the old __dns_doqueries function, but
generalized to send arbitrary caller-provided packets in parallel
rather than producing the parallel queries itself. this allows it to
be used (completely trivially) as a backend for res_send. the
factored-out query generation code, with slightly more generality, is
now part of __res_mkquery.
this bug was introduced in the recent resolver overhaul commits. it
likely had visible symptoms. these were probably limited to wrongly
accepting truncated versions of over-long names (vs rejecting them),
as opposed to stack-based overflows or anything more severe, but no
extensive checks were made. there have been no releases where this bug
was present.
now that host and service lookup have been separated in the backend,
there's no need for service lookup functions to pull in the host
lookup code. moreover, dynamic allocation is no longer needed, so this
function should now be async-signal-safe. it's also significantly
smaller.
one change in getservbyname is also made: knowing that getservbyname_r
needs only two character pointers in the caller-provided buffer, some
wasted bss can be avoided.
these changes reduce the size of the function somewhat and remove many
of its dependencies, including free. in principle it should now be
async-signal-safe, but this has not been verified in detail.
minor changes to error handling are also made.
this is the first phase of the "resolver overhaul" project.
conceptually, the results of getaddrinfo are a direct product of a
list of address results and a list of service results. the new code
makes this explicit by computing these lists separately and combining
the results. this adds support for services that have both tcp and udp
versions, where the caller has not specified which it wants, and
eliminates a number of duplicate code paths which were all producing
the final output addrinfo structures, but in subtly different ways,
making it difficult to implement any of the features which were
missing.
in addition to the above benefits, the refactoring allows for legacy
functions like gethostbyname to be implemented without using the
getaddrinfo function itself. such changes to the legacy functions have
not yet been made, however.
further improvements include matching of service alias names from
/etc/services (previously only the primary name was supported),
returning multiple results from /etc/hosts (previously only the first
matching line was honored), and support for the AI_V4MAPPED and AI_ALL
flags.
features which remain unimplemented are IDN translations (encoding
non-ASCII hostnames for DNS lookup) and the AI_ADDRCONFIG flag.
at this point, the DNS-based name resolving code is still based on the
old interfaces in __dns.c, albeit somewhat simpler in its use of them.
there may be some dead code which could already be removed, but
changes to this layer will be a later phase of the resolver overhaul.
from linux/in.h and linux/in6.h uapi headers the following
missing socket options were added:
IP_NODEFRAG - used with customized ipv4 headers
IPV6_RECVPATHMTU - for ipv6 path mtu
IPV6_PATHMTU - for ipv6 path mtu
IPV6_DONTFRAG - for ipv6 path mtu
IPV6_ADDR_PREFERENCES - RFC5014 Source Address Selection
IPV6_MINHOPCOUNT - RFC5082 Generalized TTL Security Mechanism
IPV6_ORIGDSTADDR - used by tproxy
IPV6_RECVORIGDSTADDR - used by tproxy
IPV6_TRANSPARENT - used by tproxy
IPV6_UNICAST_IF - ipv6 version of IP_UNICAST_IF
and socket option values:
IP_PMTUDISC_OMIT - value for IP_MTU_DISCOVER option, new in linux 3.14
IPV6_PMTUDISC_OMIT - same for IPV6_MTU_DISCOVER
IPV6_PMTUDISC_INTERFACE - ipv6 version of IP_PMTUDISC_INTERFACE
IPV6_PREFER_* - flags for IPV6_ADDR_PREFERENCES
not added: ipv6 flow info and flow label related definitions.
(it's unclear if libc should define these and namespace polluting
type name is involved so they are not provided for now)
linux 3.14 introduced sched_getattr and sched_setattr syscalls in
commit d50dde5a10f305253cbc3855307f608f8a3c5f73
and the related SCHED_DEADLINE scheduling policy in
commit aab03e05e8f7e26f51dee792beddcb5cca9215a5
but struct sched_attr "extended scheduling parameters data structure"
is not yet exported to userspace (necessary for using the syscalls)
so related uapi definitions are not added yet.
On 32 bit mips the kernel uses -1UL/2 to mark RLIM_INFINITY (and
this is the definition in the userspace api), but since it is in
the middle of the valid range of limits and limits are often
compared with relational operators, various kernel side logic is
broken if larger than -1UL/2 limits are used. So we truncate the
limits to -1UL/2 in get/setrlimit and prlimit.
Even if the kernel side logic consistently treated -1UL/2 as greater
than any other limit value, there wouldn't be any clean workaround
that allowed using large limits:
* using -1UL/2 as RLIM_INFINITY in userspace would mean different
infinity value for get/setrlimt and prlimit (where infinity is always
-1ULL) and userspace logic could break easily (just like the kernel
is broken now) and more special case code would be needed for mips.
* translating -1UL/2 kernel side value to -1ULL in userspace would
mean that -1UL/2 limit cannot be set (eg. -1UL/2+1 had to be passed
to the kernel instead).
using the existence of SYS_stat64 as the condition for remapping other
related syscalls is no longer valid, since new archs that omit the old
syscalls will not have SYS_stat or SYS_stat64, but still potentially
need SYS_fstat and others remapped. it would probably be possible to
get by with just one or two extra conditionals, but just breaking them
all down into separate conditions is robust and not significantly
heavier for the preprocessor.
somehow the remapping of this syscall to the 64-bit version was
overlooked. the issue was found, and patch provided, by Stefan
Kristiansson. presumably the reason this bug was not caught earlier is
that the syscall takes a pointer to off_t rather than a value, so on
little-endian systems, everything appears to work as long as the
offset value fits in the low 31 bits. on big-endian systems, though,
sendfile was presumably completely non-functional.
such archs are expected to omit definitions of the SYS_* macros for
syscalls their kernels lack from arch/$ARCH/bits/syscall.h. the
preprocessor is then able to select the an appropriate implementation
for affected functions. two basic strategies are used on a
case-by-case basis:
where the old syscalls correspond to deprecated library-level
functions, the deprecated functions have been converted to wrappers
for the modern function, and the modern function has fallback code
(omitted at the preprocessor level on new archs) to make use of the
old syscalls if the new syscall fails with ENOSYS. this also improves
functionality on older kernels and eliminates the incentive to program
with deprecated library-level functions for the sake of compatibility
with older kernels.
in other situations where the old syscalls correspond to library-level
functions which are not deprecated but merely lack some new features,
such as the *at functions, the old syscalls are still used on archs
which support them. this may change at some point in the future if or
when fallback code is added to the new functions to make them usable
(possibly with reduced functionality) on old kernels.
calling exit more than once invokes undefined behavior. in some cases
it's desirable to detect undefined behavior and diagnose it via a
predictable crash, but the code here was silently covering up an
uncommon case (exit from more than one thread) and turning a much more
common case (recursive calls to exit) into a permanent hang.
these all now use the shared __randname function internally, rather
than duplicating logic for producing a random name. incorrect usage of
the access syscall (which works with real uid/gid, not effective) has
been removed, along with unnecessary heavy dependencies like snprintf.
this was messed up during a recent commit when the socketcall macros
were moved to the common internal/syscall.h, and the following commit
expanded the problem by adding more new content outside the guard.
this only matters on x32 (and perhaps future 32-on-64 abis for other
archs); otherwise the type is long anyway. the cast through uintptr_t
prevents nonsensical "sign extension" of pointers, and follows the
principle that uintptr_t is the canonical integer type to which
pointer conversion is safe.
open is handled specially because it is used from so many places, in
so many variants (2 or 3 arguments, setting errno or not, and
cancellable or not). trying to do it as a function would not only
increase bloat, but would also risk subtle breakage.
this is the first step towards supporting "new" archs where linux
lacks "old" syscalls.
the main motivation for this change is that, with the previous
definition, it was arguably illegal, in standard C, to initialize both
si_value and si_pid/si_uid with designated initializers, due to the
rule that only one member of a union can have an initializer. whether
or not this affected real-world application code, it affected some
internal code, and clang was producing warnings (and possibly
generating incorrect code).
the new definition uses a more complex hierarchy of structs and unions
to avoid the need to initialize more than one member of a single union
in usage cases that make sense. further work would be needed to
eliminate even the ones with no practical applications.
at the same time, some fixes are made to the exposed names for
nonstandard fields, to match what software using them expects.
%C, %U, %W, and %y handling were completely missing; %C wrongly
fell-through to unrelated cases, and the rest returned failure. for
now, they all parse numbers in the proper forms and range-check the
values, but they do not store the value anywhere.
it's not clear to me whether, as "derived" fields, %U and %W should
produce any result. they certainly cannot produce a result unless the
year and weekday are also converted, but in this case it might be
desirable for them to do so. clarification is needed on the intended
behavior of strptime in cases like this.
%C and %y have well-defined behavior as long as they are used together
(and %y is defined by itself but may change in the future).
implementing them (including their correct interaction) is left as a
later change to be made.
finally, strptime now rejects unknown/invalid format characters
instead of ignoring them.
some of these may have been from ancient (pre-SUSv2) POSIX versions;
more likely, they were from POSIX drafts or glibc interpretations of
what ancient versions of POSIX should have added (instead they made
they described functionality mandatory and/or dropped it completely).
others are purely glibc-isms, many of them ill-thought-out, like
providing ways to lookup the min/max values of types at runtime
(despite the impossibility of them changing at runtime and the
impossibility of representing ULONG_MAX in a return value of type
long).
since our sysconf implementation does not support or return meaningful
values for any of these, it's harmful to have the macros around;
applications' build scripts may detect and attempt to use them, only
to get -1/EINVAL as a result.
if removing them does break some applications, and it's determined
that the usage was reasonable, some of these could be added back on an
as-needed basis, but they should return actual meaningful values, not
junk like they were returning before.
based on patch by Timo Teräs. previously, the value zero was used as a
literal zero, meaning that all invalid sysconf "names", which should
result in sysconf returning -1, had to be explicitly listed. (in
addition, it was not possible for sysconf to set errno to EINVAL, as
there was no distinction between -1 as an error and -1 as a valid
result.)
now, the value 0 is used for invalid/undefined slots in the table and
a new switch table entry is used for returning literal zeros.
in addition, an off-by-one error in checking against the table size is
fixed.
this is gcc bug #61144. the broken compiler is detected, but the user
must manually work around it. this is partly to avoid complex logic
for adding workaround CFLAGS and attempting to recheck with them, and
partly for the sake of letting the user know the compiler is broken
(since the workaround will result in less-efficient code production).
some refactoring was also needed to move the check for gcc outside of
the check for whether to build the compiler wrapper.
perhaps some additional legacy DOS-era codepages would also be useful
to have, but these are the ones for which there has been demand. the
size of the diff is due to the fact that legacychars.h is updated in
such a way that new characters are inserted into the table in unicode
codepoint order; thus other mappings in codepages.h have changed to
reflect the new table indices of their characters.
without this, broken choices of CC/CPPFLAGS/CFLAGS don't show up until
late in the configure process where they are confusingly reported as a
different failure such as incorrect long double type.
armv7/thumb2 provides a way to do atomics in thumb mode, but for armv6
we need a call to arm mode.
this commit is based on a patch by Stephen Thomas which fixed the
armv7 cases but not the armv6 ones.
all of this should be revisited if/when runtime selection of thread
pointer access and atomics are added.
As far as gcc3 knows, sh4 is the only processor version that can have an
FPU, so it indicates the FPU's presence by defining __SH4__. This is not
defined if there is no FPU, even if the processor really is an SH4.
Starting with gcc4, there is support for the sh2a processor, which has an
FPU but is not an SH4. gcc4 therefore additionally defines __SH_FPU_ANY__
when there is an FPU, but still doesn't define __SH4__ for an FPU-less sh4.
Therefore, to support all gcc versions, we must look at both preprocessor
symbols.
previously, setting TZ to the pathname of a file which was not a valid
zoneinfo file would usually cause programs using local time zone based
operations to crash. the new code checks the file size and magic at
the beginning of the file, which seems sufficient to prevent
accidental misconfiguration from causing crashes. attempting to make
fully-robust validation would be futile unless we wanted to drop use
of mmap (shared zoneinfo) and instead read it into a local buffer,
since such validation would be subject to race conditions with
modification of the file.
being static allows it to be inlined in __libc_start_main; inlining
should take place at all levels since the function is called exactly
once. this further reduces mandatory startup code size for static
binaries.
there is no reason (and seemingly there never was any) for
__init_security to be its own function. it's linked unconditionally
so it can just be placed inline in __init_libc.
moving the call to __init_ssp from __init_security to __init_libc
makes __init_security a leaf function, which allows the compiler to
make it smaller. __init_libc is already non-leaf, and the additional
call makes no difference to the amount of register spillage.
in addition, it really made no sense for the call to __init_ssp to be
buried inside __init_security rather than parallel with other init
functions.
since the form TZ=name is reserved for POSIX-form time zone strings,
TZ=:name needs to be used when the zoneinfo filename is in the
top-level zoneinfo directory and therefore does not contain a slash.
previously the leading colon was merely dropped, making it impossible
to access such zones without a full absolute pathname.
changes based on patch by Timo Teräs.
in cases where the memorized match range from the right factor
exceeded the length of the left factor, it was wrongly treated as a
mismatch rather than a match.
issue reported by Yves Bastide.
so far the options are --library-path and --preload which override the
corresponding environment variables, and --list which forces the
behavior of ldd even if the invocation name is not ldd. both the
two-arg form and the one-arg form using an equals sign are supported.
based loosely on a patch proposed by Rune.
the vdso symbol lookup code is based on the original 2011 patch by
Nicholas J. Kain, with some streamlining, pointer arithmetic fixes,
and one symbol version matching fix.
on the consumer side (clock_gettime), per-arch macros for the
particular symbol name and version to lookup are added in
syscall_arch.h, and no vdso code is pulled in on archs which do not
define these macros. at this time, vdso is enabled only on x86_64.
the vdso support at the dynamic linker level is no longer useful to
libc, but is left in place for the sake of debuggers (which may need
the vdso in the link map to find its functions) and possibly use with
dlsym.
at the end of successful pthread_once, there was a race window during
which another thread calling pthread_once would momentarily change the
state back from 2 (finished) to 1 (in-progress). in this case, the
status was immediately changed back, but with no wake call, meaning
that waiters which arrived during this short window could block
forever. there are two possible fixes. one would be adding the wake to
the code path where it was missing. but it's better just to avoid
reverting the status at all, by using compare-and-swap instead of
swap.
The mips arch is special in that it uses different RLIMIT_
numbers than other archs, so allow bits/resource.h to override
the default RLIMIT_ numbers (empty on all archs except mips).
Reported by orc.
it will be needed to implement some things in sysconf, and the syscall
can't easily be used directly because the x32 syscall uses the wrong
structure layout. the l (uncreative, for "linux") prefix is used since
the symbol name __sysinfo is already taken for AT_SYSINFO from the aux
vector.
the way the x32 override of this function works is also changed to be
simpler and avoid the useless jump instruction.
the syscall is deprecated (replaced by prlimit64) and does not work
correctly on x32. this change mildly increases size, but is likely
needed anyway for newer archs that might omit deprecated syscalls.
the previous handling of cases that could not fit in the 16-bit table
or which required non-constant results was extremely ugly and could
not scale. the new code remaps these keys into a contiguous range
that's efficient for a switch statement.
aside from potentially offering better performance, this change is
needed since the old coprocessor-based approach to barriers is
deprecated in arm v7, and some compilers/assemblers issue errors when
using the deprecated instruction for v7 targets.
the use of visibility at all is purely an optimization to avoid the
need for the caller to load the GOT register or similar to prepare for
a call via the PLT. there is no reason for these symbols to be
externally visible, so hidden works just as well as protected, and
using protected visibility is undesirable due to toolchain bugs and
the lack of testing it receives.
in particular, GCC's microblaze target is known to generate symbolic
relocations in the GOT for functions with protected visibility. this
in turn results in a dynamic linker which crashes under any nontrivial
usage that requires making a syscall before symbolic relocations are
processed.
modfl and sincosl were passing long double* instead of double*
to the wrapped double precision functions (on archs where long
double and double have the same size).
This is fixed now by using temporaries (this is not optimized
to a single branch so the generated code is a bit bigger).
Found by Morten Welinder.
to optimize the search, memchr is used to find the first occurrence of
the first character of the needle in the haystack before switching to
a search for the full needle. however, the number of characters
skipped by this first step were not subtracted from the haystack
length, causing memmem to search past the end of the haystack.
the subsequent rounding code assumes the end pointer (z) accurately
reflects the end of significance in the decimal expansion, but for
certain large integers, spurious trailing zero slots were left behind
when applying the binary exponent.
issue reported by Morten Welinder; the analysis of the cause was
performed by nsz, who also proposed this change.
the "m" constraint could give a memory reference with an offset that's
not compatible with ldrex/strex, so the arm-specific "Q" constraint is
needed instead.
this is perhaps not the optimal implementation; a_cas still compiles
to nested loops due to the different interface contracts of the kuser
helper cas function (whose contract this patch implements) and the
a_cas function (whose contract mimics the x86 cmpxchg). fixing this
may be possible, but it's more complicated and thus deferred until a
later time.
aside from improving performance and code size, this patch also
provides a means of producing binaries which can run on hardened
kernels where the kuser helpers have been disabled. however, at
present this requires producing binaries for armv6k or later, which
will not run on older cpus. a real solution to the problem of kernels
that omit the kuser helpers would be runtime detection, so that
universal binaries which run on all arm cpu models can also be
compatible with all kernel hardening profiles. robust detection
however is a much harder problem, and will be addressed at a later
time.
in a sense this implementation is incomplete since it doesn't provide
the HWCAP_* macros for use with AT_HWCAP, which is perhaps the most
important intended usage case for getauxval. they will be added at a
later time.
the code to strip trailing zeros was only looking in the last slot for
up to 9 zeros, assuming that the rounding code had already removed
fully-zero slots from the end. however, this ignored cases where the
rounding code did not run at all, which occur when the value being
printed is exactly representable in the requested precision.
the simplest solution is to move the code that strips trailing zero
slots to run unconditionally, immediately after rounding, rather than
as the last step of rounding.
in cases where rounding caused a carry, the slot into which the carry
was taking place was unconditionally treated as valid, despite the
possibility that it could be a new slot prior to the beginning of the
existing non-rounded number. in theory this could lead to unbounded
runaway carry, but in order for that to happen, the whole
uninitialized buffer would need to have been pre-filled with 32-bit
integer values greater than or equal to 999999999.
patch based on proposed fix by Morten Welinder, who also discovered
and reported the bug.
the function itself was static, but the weak alias provided an
externally visible reference and thus prevented the dead code from
being omitted from the output. so this change actually reduces bloat
in mandatory static-linked code.
There are two changes here, both of which make sense to be done in a
single patch:
- Remove hash from struct elem and compute it at runtime wherever
necessary.
- Eliminate struct elem and use ENTRY directly.
As a result we cut down on the memory usage as each element in the
hash table now contains only an ENTRY not an ENTRY + size_t for the
hash. The downside is that the hash needs to be computed at runtime.
the size and alignment of struct hsearch_data are matched to the glibc
definition for binary compatibility. the members of the structure do
not match, which should not be a problem as long as applications
correctly treat the structure as opaque.
unlike the glibc implementation, this version of hcreate_r does not
require the caller to zero-fill the structure before use.
this issue mainly affects PIE binaries and execution of programs via
direct invocation of the dynamic linker binary: depending on kernel
behavior, in these cases the initial brk may be placed at at location
where it cannot be extended, due to conflicting adjacent maps.
when brk fails, mmap is used instead to expand the heap. in order to
avoid expensive bookkeeping for managing fragmentation by merging
these new heap regions, the minimum size for new heap regions
increases exponentially in the number of regions. this limits the
number of regions, and thereby the number of fixed fragmentation
points, to a quantity which is logarithmic with respect to the size of
virtual address space and thus negligible. the exponential growth is
tuned so as to avoid expanding the heap by more than approximately 50%
of its current total size.
the kernel entry point for syscalls on microblaze nominally saves and
restores all registers, and testing on qemu always worked since qemu
behaves this way too. however, the real kernel treats r3:r4 as a
potential 64-bit return value from the syscall function, and copies
both over top of the saved registers before returning to userspace.
thus, we need to treat r4 as always-clobbered.
now that thread pointer is initialized always, ssp canary
initialization can be done unconditionally. this simplifies
the ldso as it does not try to detect ssp usage, and the
init function itself as it is always called exactly once.
this also merges ssp init path for shared and static linking.
record phentsize in struct dso, so the phdrs can be easily
enumerated via it. simplify all functions enumerating phdrs
to require only struct dso. also merge find_map_range and
find_dso to kernel_mapped_dso function that does both tasks
during single phdr enumeration.
this is the first step in an overhaul aimed at greatly simplifying and
optimizing everything dealing with thread-local state.
previously, the thread pointer was initialized lazily on first access,
or at program startup if stack protector was in use, or at certain
random places where inconsistent state could be reached if it were not
initialized early. while believed to be fully correct, the logic was
fragile and non-obvious.
in the first phase of the thread pointer overhaul, support is retained
(and in some cases improved) for systems/situation where loading the
thread pointer fails, e.g. old kernels.
some notes on specific changes:
- the confusing use of libc.main_thread as an indicator that the
thread pointer is initialized is eliminated in favor of an explicit
has_thread_pointer predicate.
- sigaction no longer needs to ensure that the thread pointer is
initialized before installing a signal handler (this was needed to
prevent a situation where the signal handler caused the thread
pointer to be initialized and the subsequent sigreturn cleared it
again) but it still needs to ensure that implementation-internal
thread-related signals are not blocked.
- pthread tsd initialization for the main thread is deferred in a new
manner to minimize bloat in the static-linked __init_tp code.
- pthread_setcancelstate no longer needs special handling for the
situation before the thread pointer is initialized. it simply fails
on systems that cannot support a thread pointer, which are
non-conforming anyway.
- pthread_cleanup_push/pop now check for missing thread pointer and
nop themselves out in this case, so stdio no longer needs to avoid
the cancellable path when the thread pointer is not available.
a number of cases remain where certain interfaces may crash if the
system does not support a thread pointer. at this point, these should
be limited to pthread interfaces, and the number of such cases should
be fewer than before.
the external mmap function is heavy because it has to handle error
reporting that the kernel cannot do, and has to do some locking for
arcane race-condition-avoidance purposes. for allocating initial TLS,
we do not need any of that; the raw syscall suffices.
on i386, this change shaves off 13% of the size of .text for the empty
program.
in general, we aim to always include the header that's declaring a
function before defining it so that the compiler can check that
prototypes match.
additionally, the internal syscall.h declares __syscall_ret with a
visibility attribute to improve code generation for shared libc (to
prevent gratuitous GOT-register loads). this declaration should be
visible at the point where __syscall_ret is defined, too, or the
inconsistency could theoretically lead to problems at link-time.
in addition to the dbm functions (which we don't intent to implement
anyway), fmtmsg is still missing too. rather than adding exceptions I
think it's best just to avoid making the claim.
the text covering an ill-advised procedure for 'bootstrapping' a new
musl-based system in-place is removed. new information on targets and
compilers is added. formatting improved. the remaining text is
adjusted to cover both usage with musl-gcc on a non-musl-based system
and upgrading a musl-based system or toolchain.
the excess space was unused and unintentional. this change does not
affect the ABI between applications and libc. while it does
theoretically affect linkage between third-party translation units
using jmp_buf as part of a structure, we've already changed jmp_buf at
least once on all archs, and problems were never observed, likely
because such usage would be very unusual. in any case it's best to get
things right now rather than making changes sometime during the 1.0.x
series or later.
on x32, this change allows programs which use syscall() with pointers
or 64-bit values as arguments to work correctly, i.e. without
truncation or incorrect sign extension. on all other supported archs,
syscall_arg_t is defined as long, so this change is a no-op.
the previous pattern required "x32" to be used as the second field of
the gcc tuple, which is usually reserved for vendor use and not
appropriate as an ABI specifier. with this change, putting "x32" at
the end of the tuple, the way ABI specifiers are normally done, is
also permitted.
the incorrect error codes also made their way into errno when
__ptsname_r was called by plain ptsname, which reports errors via
errno rather than a return value.
Applications ended up with copy relocations for this array, which
resulted in libc's references to this array pointing to the
application's copy. The dynamic linker, however, can require this array
before the application is relocated, and therefore before the
application's copy of this array is initialized. This resulted in
garbage being loaded into FPSCR before executing main, which violated
the ABI.
We fix this by putting the array in crt1 and making the libc copy
private. This prevents libc's reference to the array from pointing to
an uninitialized copy in the application.
it's UB to fetch variadic args when none are passed, and this caused
real crashes on ppc due to its calling convention, which defines that
for variadic functions aggregate types be passed as pointers.
the assignment caused that pointer to get dereferenced, resulting in
a crash.
The mips statfs struct layout is different than on other archs, so the
statfs, fstatfs, statvfs and fstatvfs APIs were broken on mips.
Now the ordering is fixed, the types are kept consistent with other archs.
these have been wrong for a long time and were never detected or
corrected. powerpc needs some gratuitous extra padding/reserved slots
in ipc_perm, big-endian ordering for the padding of time_t slots that
was intended by the kernel folks to allow a transition to 64-bit
time_t, and some minor gratuitous reordering of struct members.
the definition was found to be incorrect at least for powerpc, and
fixing this cleanly requires making the definition arch-specific. this
will allow cleaning up the definition for other archs to make it more
specific, and reversing some of the ugliness (time_t hacks) introduced
with the x32 port.
this first commit simply copies the existing definition to each arch
without any changes. this is intentional, to make it easier to review
changes made on a per-arch basis.
the printf floating point formatting code contains an optimization to
avoid computing digits that will be thrown away by rounding at the
specified (or default) precision. while it was correctly retaining all
places up to the last decimal place to be printed, it was not
retaining enough precision to see the next nonzero decimal place in
all cases. this could cause incorrect rounding down in round-to-even
(default) rounding mode, for example, when printing 0.5+DBL_EPSILON
with "%.0f".
in the fix, LDBL_MANT_DIG/3 is a lazy (non-sharp) upper bound on the
number of zeros between any two nonzero decimal digits.
empirically the overflow was an off-by-one, and it did not seem to be
overwriting meaningful data. rather than simply increasing the buffer
size by one, however, I have attempted to make the size obviously
correct in terms of bounds on the number of iterations for the loops
that fill the buffer. this still results in no more than a negligible
size increase of the buffer on the stack (6-7 32-bit slots) and is a
"safer" fix unless/until somebody wants to do the proof that a smaller
buffer would suffice.
this was problematic because several archs don't define __WORDSIZE. we
could add it, but I would rather phase this macro out in the long
term. in our version of the headers, UINTPTR_MAX is available here, so
just use it instead.
neither is correct; different commands take different argument types,
and some take no arguments at all. I have a much larger overhaul of
fcntl prepared to address this, but it's not appropriate to commit
during freeze.
the immediate problem being addressed affects forward-compatibility on
x32: if new commands are added and they take pointers, but the
libc-level fcntl function is not aware of them, using long would
sign-extend the pointer to 64 bits and give the kernel an invalid
pointer. on the kernel side, the argument to fcntl is always treated
as unsigned long, so no harm is done by treating possibly-signed
integer arguments as unsigned. for every command that takes an integer
argument except for F_SETOWN, large integer arguments and negative
arguments are handled identically anyway. in the case of F_SETOWN, the
kernel is responsible for converting the argument which it received as
unsigned long to int, so the sign of negative arguments is recovered.
the other problem that will be addressed later is that the type passed
to va_arg does not match the type in the caller of fcntl. an advanced
compiler doing cross-translation-unit analysis could potentially see
this mismatch and issue warnings or otherwise make trouble.
on i386, this patch was confirmed not to alter the code generated by
gcc 4.7.3. in principle the generated code should not be affected on
any arch except x32.
the kernel uses long longs in the struct, but the documentation
says they're long. so we need to fixup the mismatch between the
userspace and kernelspace structs.
since the struct offers a mem_unit member, we can avoid truncation
by adjusting that value.
if we ever encounter other targets where error codes don't fit in the
8-bit range, the table should probably just be bumped to 16-bit, but
for now I don't want to increase the table size on all archs just
because of a bug in the mips abi.
most notably, it was failing to match sh4-*, etc., but in general the
explicit matching of hyphens for some archs was problematic because it
failed to accept simply the musl-style arch name (without a gcc-style
tuple) as an input. the original motivation of matching hyphens was to
prevent incorrectly identifying a 64-bit arch as the corresponding
32-bit arch (e.g. mips* matching mips64) but this is easily fixed by
simply checking (and for now, rejecting as unsupported) the relevant
64-bit archs.
linux, gcc, etc. all use "sh" as the name for the superh arch. there
was already some inconsistency internally in musl: the dynamic linker
was searching for "ld-musl-sh.path" as its path file despite its own
name being "ld-musl-superh.so.1". there was some sentiment in both
directions as to how to resolve the inconsistency, but overall "sh"
was favored.
per POSIX, ENOENT is reserved for invalid stream position; it is an
optional error and would only happen if the application performs
invalid seeks on the underlying file descriptor. however, linux's
getdents syscall also returns ENOENT if the directory was removed
between the time it was opened and the time of the read. we need to
catch this case and remap it to simple end-of-file condition (null
pointer return value like an error, but no change to errno). this
issue reportedly affects GNU make in certain corner cases.
rather than backing up and restoring errno, I've just changed the
syscall to be made in a way that doesn't affect errno (via an inline
syscall rather than a call to the __getdents function). the latter
still exists for the purpose of providing the public getdents alias
which sets errno.
Userspace emulated floating-point (gcc -msoft-float) is not compatible
with the default mips abi (assumes an FPU or in kernel emulation of it).
Soft vs hard float abi should not be mixed, __mips_soft_float is checked
in musl's configure script and there is no runtime check. The -sf subarch
does not save/restore floating-point registers in setjmp/longjmp and only
provides dummy fenv implementation.
the reordering of headers caused some risc archs to not see
the __syscall declaration anymore.
this caused build errors on mips with any compiler,
and on arm and microblaze with clang.
we now declare it locally just like the powerpc port does.
it's legal to call the __syscall functions with more arguments than
necessary, and the __syscall_cp cancel dummy impl. does just that.
thus we must insert the switch for all possible syscalls numbers
into all of the syscallN inline functions.
- the nanosleep fixup "fixed" the second timespec* argument erroneusly.
- the futex fixup was missing the check for FUTEX_WAIT.
- general cleanup using a macro.
x32 is the internal arch name, but glibc uses x86_64-x32.
there doesn't exist a specific triple for x32 in gcc and binutils.
you're supposed to build your compiler for x86_64 and configure
it with multilib support for "mx32".
however it turns out that using a triple of x86_64-x32 makes
gcc and binutils pick up the right arch (they detect it as x86_64)
and allows us to have a unique triple for cross-compiler toolchains.
some 32-on-64 archs require that the actual syscall args be long long.
in that case syscall_arch.h can define syscall_arg_t to whatever it needs
and syscall.h picks it up.
all other archs just use long as usual.
this allows syscall_arch.h to define the macro __scc if special
casting is needed, as is the case for x32, where the actual syscall
arguments are 64bit, but, in case of pointers, would get sign-extended
and thus become invalid.
the other atomic FD_CLOEXEC interfaces (dup3, pipe2, socket) already
had such emulation in place. the justification for doing the emulation
here is the same as for the other functions: it allows applications to
simply use accept4 rather than having to have their own fallback code
for ENOSYS/EINVAL (which one you get is arch-specific!) and there is
no reasonable way an application could benefit from knowing the
operation is emulated/non-atomic since there is no workaround at the
application level for non-atomicity (that is the whole reason these
interfaces were added).
this was unlikely to lead to any crash or dangerous behavior, but
caused adjacent string constants to be treated as part of the
protocols table, possibly returning nonsensical results for unknown
protocol names/numbers or when getprotoent was called in a loop to
enumerate all protocols.
gcc -Wsign-compare warns about expanded macros that were defined in
standard headers (before gcc 4.8) which can make builds fail that
use -Werror. changed macros: WIFSIGNALED, __CPU_op_S
The architecture-specific assembly versions of clone did not set errno on
failure, which is inconsistent with glibc. __clone still returns the error
via its return value, and clone is now a wrapper that sets errno as needed.
The public clone has also been moved to src/linux, as it's not directly
related to the pthreads API.
__clone is called by pthread_create, which does not report errors via
errno. Though not strictly necessary, it's nice to avoid clobbering errno
here.
the default fenv was not set up properly, in particular the
tag word that indicates the contents of the x87 registers was
set to 0 (used) instead of 0xffff (empty)
this could cause random crashes after setting the default fenv
because it corrupted the fpu stack and then any float computation
gives NaN result breaking the program logic (usually after a
float to integer conversion).
this saves a syscall in the case where the underlying open already
took place with O_APPEND, which is common because fopen with append
modes sets O_APPEND at the time of open before passing the file
descriptor to __fdopen.
when there is unflushed output, ftello (and ftell) compute the logical
stream position as the underlying file descriptor's offset plus an
adjustment for the amount of buffered data. however, this can give the
wrong result for append-mode streams where the unflushed writes should
adjust the logical position to be at the end of the file, as if a seek
to end-of-file takes place before the write.
the solution turns out to be a simple trick: when ftello (indirectly)
calls lseek to determine the current file offset, use SEEK_END instead
of SEEK_CUR if the stream is append-mode and there's unwritten
buffered data.
the ISO C rules regarding switching between reading and writing for a
stream opened in an update mode, along with the POSIX rules regarding
switching "active handles", conveniently leave undefined the
hypothetical usage cases where this fix might lead to observably
incorrect offsets.
the bug being fixed was discovered via the test case for glibc issue
the affected part of the header is responsible for providing both GNU
and BSD versions of the udphdr structure. previously, the
namespace-polluting GNU names were always used for the actual struct
members, and the BSD names, which are named in a manner resembling a
sane namespace, were always macros defined to expand to the GNU names.
now, unless _GNU_SOURCE is defined, the BSD names are used as the
actual structure members, and the macros and GNU names only come into
play when the application requests them.
there are two versions of this structure: the BSD version and the GNU
version. previously only the GNU version was supported. the only way
to support both simultaneously is with an anonymous union, which was a
nonstandard extension prior to C11, so some effort is made to avoid
breakage with compilers which do not support anonymous unions.
this commit is based on a patch by Timo Teräs, but with some changes.
in particular, the GNU version of the structure is not exposed unless
_GNU_SOURCE is defined; this both avoids namespace pollution and
dependency on anonymous unions in the default feature profile.
these are poorly designed (illogical argument order) and even poorly
implemented (brace issues) on glibc, but unfortunately some software
is using them. we could consider removing them again in the future at
some point if they're documented as deprecated, but for now the
simplest thing to do is just to provide them under _GNU_SOURCE.
some applications expect it to be defined, despite the standard making
it impossible for it to ever be returned as a value distinct from
NO_DATA. since these macros are outside the scope of the current
standards, no special effort is made to hide NO_ADDRESS under
conditions where the others are exposed.
STB_WEAK is only a weak reference for undefined symbols (those with a
section of SHN_UNDEF). otherwise, it's a weak definition. normally
this distinction would not matter, since a relocation referencing a
symbol that also provides a definition (not SHN_UNDEF) will always
succeed in finding the referenced symbol itself. however, in the case
of copy relocations, the referenced symbol itself is ignored in order
to search for another symbol to copy from, and thus it's possible that
no definition is found. in this case, if the symbol being resolved
happened to be a weak definition, it was misinterpreted as a weak
reference, suppressing the error path and causing a crash when the
copy relocation was performed with a null source pointer passed to
memcpy.
there are almost certainly still situations in which invalid
combinations of symbol and relocation types can cause the dynamic
linker to crash (this is pretty much inevitable), but the intent is
that crashes not be possible for symbol/relocation tables produced by
a valid linker.
setstate could use the results of previous initstate or setstate
calls (they return the old state buffer), but the documentation
requires that an initialized state buffer should be possible to
use in setstate immediately, which means that initstate should
save the generator parameters in it.
I also removed the copyright notice since it is present in the
copyright file.
install.sh was wrongly waiting until after atomically replacing the
old file to set the correct permissions on the new file. in the case
of the dynamic linker, this would cause a dynamic-linked chmod command
not to run (due to missing executable permissions on the dynamic
linker) and thus leave the system in an unusable state.
even if chmod is static-linked, the old behavior had a race window
where dynamic-linked programs could fail to run.
the operand size is unnecessary, since the assembler knows it from the
destination register size. removing the suffix makes it so the same
code should work for x32.
in fixing this, I've changed the logic from ugly #if/#else blocks
inside the struct shm_info definition to a fixed struct definition and
optional macros to rename the elements. this will be helpful if we
need to move shm_info to a bits header in the future, as it will keep
the feature test logic out of bits.
the fix should be complete on archs that use the generic definitions
(i386, arm, x86_64, microblaze), but mips and powerpc have not been
checked thoroughly and may need more fixes.
the imr_, imsf_, ip6_, ip6m_, ipi_, ipi6_, SCM_, and SOL_ prefixes are
not in the reserved namespace for this header. thus the constants and
structures using them need to be protected under appropriate feature
test macros.
this also affects some headers which are permitted to include
netinet/in.h, particularly netdb.h and arpa/inet.h.
the SOL_ macros are moved to sys/socket.h where they are in the
reserved namespace (SO*). they are still accessible via netinet/in.h
since it includes sys/socket.h implicitly (which is permitted).
the SCM_SRCRT macro is simply removed, since the definition used for
it, IPV6_RXSRCRT is not defined anywhere. it could be re-added, this
time in sys/socket.h, if the appropriate value can be determined;
however, given that the erroneous definition was not caught, it is
unlikely that any software actually attempts to use SCM_SRCRT.
per POSIX, the variadic argument has type union semun, which may
contain a pointer or int; the type read depends on the command being
issued. this allows the userspace part of the implementation to be
type-correct without requiring special-casing for different commands.
the kernel always expects to receive the argument interpreted as
unsigned long (or equivalently, a pointer), and does its own handling
of extracting the int portion from the representation, as needed.
this change fixes two possible issues: most immediately, reading the
argument as a (signed) long and passing it to the syscall would
perform incorrect sign-extension of pointers on the upcoming x32
target. the other possible issue is that some archs may use different
(user-space) argument-passing convention for unions, preventing va_arg
from correctly obtaining the argument when the type long (or even
unsigned long or void *) is passed to it.
really, fcntl should be changed to use the correct type corresponding
to cmd when calling va_arg, and to carry the correct type through
until making the syscall. however, this greatly increases binary size
and does not seem to offer any benefits except formal correctness, so
I'm holding off on that change for now.
the minimal changes made in this patch are in preparation for addition
of the x32 port, where the syscall macros need to know whether their
arguments are pointers or integers in order to properly pass them to
the 64-bit kernel.
it's unclear what the historical signature for this function was, but
semantically, the argument should be a pointer to const, and this is
what glibc uses. correct programs should not be using this function
anyway, so it's unlikely to matter.
this change is consistent with the corresponding glibc functions and
is semantically const-correct. the incorrect argument types without
const seem to have been taken from erroneous man pages.
this functionality has essentially always been deprecated in linux,
and was never supported by musl. the presence of the header was
reported to cause some software to attempt to use the nonexistant
function, so removing the header is the cleanest solution.
this was wrong since the original commit adding inotify, and I don't
see any explanation for it. not even the man pages have it wrong. it
was most likely a copy-and-paste error.
this practice came from very early, before internal/syscall.h defined
macros that could accept pointer arguments directly and handle them
correctly. aside from being ugly and unnecessary, it looks like it
will be problematic when we add support for 32-bit ABIs on archs where
registers (and syscall arguments) are 64-bit, e.g. x32 and mips n32.
this agrees with implementation practice on glibc and BSD systems, and
is the const-correct way to do things; it eliminates warnings from
passing pointers to const. the prototype without const came from
seemingly erroneous man pages.
the header is included only as a guard to check that the declaration
and definition match, so the typo didn't cause any breakage aside
from omitting this check.
the reasons are the same as for sbrk. unlike sbrk, there is no safe
usage because brk does not return any useful information, so it should
just fail unconditionally.
use of sbrk is never safe; it conflicts with malloc, and malloc may be
used internally by the implementation basically anywhere. prior to
this change, applications attempting to use sbrk to do their own heap
management simply caused untrackable memory corruption; now, they will
fail with ENOMEM allowing the errors to be fixed.
sbrk(0) is still permitted as a way to get the current brk; some
misguided applications use this as a measurement of their memory
usage or for other related purposes, and such usage is harmless.
eventually sbrk may be re-added if/when malloc is changed to avoid
using the brk by using mmap for all allocations.
ssi_ptr is really 64-bit in kernel, so fix that. assuming sizeof(void*)
for it also caused incorrect padding for 32-bits, as the following
64-bits are aligned to 64-bits (and the padding was not taken into
account), so fix the padding as well. add addr_lsb field while there.
the workaround/fallback code for supporting O_PATH file descriptors
when the kernel lacks support for performing these operations on them
caused EBADF to get replaced by ENOENT (due to missing entry in
/proc/self/fd). this is unlikely to affect real-world code (calls that
might yield EBADF are generally unsafe, especially in library code)
but it was breaking some test cases.
the fix I've applied is something of a tradeoff: it adds one syscall
to these operations on kernels where the workaround is needed. the
alternative would be to catch ENOENT from the /proc lookup and
translate it to EBADF, but I want to avoid doing that in the interest
of not touching/depending on /proc at all in these functions as long
as the kernel correctly supports the operations. this is following the
general principle of isolating hacks to code paths that are taken on
broken systems, and keeping the code for correct systems completely
hack-free.
the ABI allows the callee to clobber stack slots that correspond to
arguments passed in registers, so the caller must adjust the stack
pointer to reserve space appropriately. prior to this fix, the argv
array was possibly clobbered by dynamic linker code before passing
control to the main program.
our getcwd already (as an extension) supports allocation of a buffer
when the buffer argument is a null pointer, so there's no need to
duplicate the allocation logic in this wrapper function. duplicating
it is actually harmful in that it doubles the stack usage from
PATH_MAX to 2*PATH_MAX.
the wildcard function in GNU make includes dangling symlinks; if any
exist under the .git directory, they would get added as dependencies,
causing make to exit with an error due to lacking a rule to build the
missing file.
as far as I can tell, git operations which should force version.h to
be rebuilt must all touch the mtime of the top-level .git directory.
historically these functions appeared in BSD 4.3 without prototypes,
then in the bind project prototypes were added to resolv.h, but those
were incompatible with the definitions of the implementation.
the bind resolv.h became the defacto api most systems use now, but the
old internal definitions found their way into the linux manuals and thus
into musl.
previously this flag was defined and accepted as a no-op, possibly
breaking some software that uses it. given the choice to remove the
definition and possibly break applications that were already working,
or simply implement the feature, the latter turned out to be easy
enough to make the decision easy.
in the case where the FNM_PATHNAME flag is also set, this
implementation is clean and essentially optimal. otherwise, it's an
inefficient "brute force" implementation. at some point, when cleaning
up and refactoring this code, I may add a more direct code path for
handling FNM_LEADING_DIR in the non-FNM_PATHNAME case, but at this
point my main interest is avoiding introducing new bugs in the code
that implements the standard fnmatch features specified by POSIX.
this is still experimental and subject to change. for git checkouts,
an attempt is made to record the exact revision to aid in bug reports
and debugging. no version information is recorded in the static libc.a
or binaries it's linked into.
the FNM_PATHNAME logic for advancing by /-delimited components was
incorrect when the / character was escaped (i.e. \/), and a final \ at
the end of pattern was not handled correctly.
a '/' in the pattern could be incorrectly matched against the
terminating null byte in the string causing arbitrarily long
sequence of out-of-bounds access in fnmatch("/","",FNM_PATHNAME)
a v6 socket will only be used if there is at least one v6 nameserver
address. if the kernel lacks v6 support, the code will fall back to
using a v4 socket and requests to v6 servers will silently fail. when
using a v6 socket, v4 addresses are converted to v4-mapped form and
setsockopt is used to ensure that the v6 socket can accept both v4 and
v6 traffic (this is on-by-default on Linux but the default is
configurable in /proc and so it needs to be set explicitly on the
socket level). this scheme avoids increasing resource usage during
lookups and allows the existing network io loop to be used without
modification.
previously, nameservers whose address family did not match the address
family of the first-listed nameserver were simply ignored. prior to
recent __ipparse fixes, they were not ignored but erroneously parsed.
the old value of 20 was reported by Laurent Bercot as being
insufficient for a reasonable real-world usage case. actual problem
was the internal buffer used by ttyname(), but the implementation of
ttyname uses TTY_NAME_MAX, and for consistency it's best to increase
both. the new value is aligned with glibc.
subsequent code assumes the address family requested is either
unspecified or one of IPv4/IPv6, and could malfunction if this
constraint is not met, so other address families should be explicitly
rejected.
on archs with excess precision, the floating point constant 1e40f may
be evaluated such that it does not actually produce an infinity.
1e5000f is sufficiently large to produce an infinity for all supported
floating point formats. note that this definition of INFINITY is only
used for old or non-GNUC compilers anyway; despite being a portable,
conforming definition, it leads to erroneous warnings on many
compilers and thus using the builtin is preferred.
these functions were spuriously failing in the case where the buffer
size was exactly the number of bytes/characters to be written,
including null termination. since these functions do not have defined
error conditions other than buffer size, a reasonable application may
fail to check the return value when the format string and buffer size
are known to be valid; such an application could then attempt to use a
non-terminated buffer.
in addition to fixing the bug, I have changed the error handling
behavior so that these functions always null-terminate the output
except in the case where the buffer size is zero, and so that they
always write as many characters as possible before failing, rather
than dropping whole fields that do not fit. this actually simplifies
the logic somewhat anyway.
unfortunately this eliminates the ability of the compiler to diagnose
some dangerous/incorrect usage, but POSIX requires (as an extension to
the C language, i.e. CX shaded) that NULL have type void *. plain C
allows it to be defined as any null pointer constant.
the definition 0L is preserved for C++ rather than reverting to plain
0 to avoid dangerous behavior in non-conforming programs which use
NULL as a variadic sentinel. (it's impossible to use (void *)0 for C++
since C++ lacks the proper implicit pointer conversions, and other
popular alternatives like the GCC __null extension seem non-conforming
to the standard's requirements.)
- remove the HAVE_EFFICIENT_IRINT case: fn is an exact integer, so
it can be converted to int32_t a bit more efficiently than with a
cast (the rounding mode change can be avoided), but musl does not
support this case on any arch.
- __rem_pio2: use double_t where possible
- __rem_pio2f: use less assignments to avoid stores on i386
- use unsigned int bit manipulation (and union instead of macros)
- use hexfloat literals instead of named constants
loop condition was incorrect and confusing and caused an infinite loop
when (broken) applications reaped the pid from a signal handler or
another thread before wordexp's call to waitpid could do so.
when WRDE_NOSPACE is returned, the we_wordv and we_wordc members must
be valid, because the interface contract allows them to return partial
results.
in the case of zero results (due either to resource exhaustion or a
zero-word input) the we_wordv array still should contain a terminating
null pointer and the initial we_offs null pointers. this is impossible
on resource exhaustion, so a correct application must presumably check
for a null pointer in we_wordv; POSIX however seems to ignore the
issue. the previous code may have crashed under this situation.
avoid using exit status to determine if a shell error occurred, since
broken programs may install SIGCHLD handlers which reap all zombies,
including ones that don't belong to them. using clone and __WCLONE
does not seem to work for avoiding this problem since exec resets the
exit signal to SIGCHLD.
instead, the new code uses a dummy word at the beginning of the
shell's output, which is ignored, to determine whether the command was
executed successfully. this also fixes a corner case where a word
string containing zero words was interpreted as a single zero-length
word rather than no words at all. POSIX does not seem to require this
case to be supported anyway, though.
in addition, the new code uses the correct retry idiom for waitpid to
ensure that spurious STOP/CONT signals in the child and/or EINTR in
the parent do not prevent successful wait for the child, and blocks
signals in the child.
* simplify sin_pi(x) (don't care about inexact here, the result is
inexact anyway, and x is not so small to underflow)
* in lgammal add the previously removed special case for x==1 and
x==2 (to fix the sign of zero in downward rounding mode)
* only define lgammal on supported long double platforms
* change tgamma so the generated code is a bit smaller
previously these macros wrongly had type double rather than long
double. I see no way an application could detect the error in C99, but
C11's _Generic can trivially detect it.
at the same time, even though these archs do not have excess
precision, the number of decimal places used to represent these
constants has been increased to 21 to be consistent with the decimal
representations used for the DBL_* macros.
this is enough to produce the correct value even if the constant is
interpreted as 80-bit extended precision, which matters on archs with
excess precision (FLT_EVAL_METHOD==2) under at least some
interpretations of the C standard. the shorter representations, while
correct if converted to the nominal precision at translation time,
could produce an incorrect value at extended precision, yielding
results such as (double)DBL_MAX != DBL_MAX.
this should not matter since the reality is that either all the sysv
sem syscalls are individual syscalls, or all of them are multiplexed
on the SYS_ipc syscall (depending on arch). but best to be consistent
anyway.
siginfo_t is not available from signal.h when the strict ISO C feature
profile (e.g. passing -std=c99 to gcc without defining any other
feature test macros) is used, but the type is needed to declare
waitid. using sys/wait.h (or any POSIX headers) in strict ISO C mode
is an application bug, but in the interest of compatibility, it's best
to avoid producing gratuitous errors. the simplest fix I could find is
suppressing the declaration of waitid (and also signal.h inclusion,
since it's not needed for anything else) in this case, while still
exposing everything else in sys/wait.h
it's not clear why I originally wrote O_NOFOLLOW into this; I suspect
the reason was with an aim of making the function more general for
mapping partially or fully untrusted files provided by the user.
however, the timezone code already precludes use of absolute or
relative pathnames in suid/sgid programs, and disallows .. in
pathnames which are relative to one of the system timezone locations,
so there is no threat of opening a symlink which is not trusted by
appropriate user. since some users may wish to put symbolic links in
the zoneinfo directories to alias timezones, it seems preferable to
allow this.
the rest of the code is not prepared to handle an empty TZ string, so
falling back to __gmt ("GMT"), just as if TZ had been blank or unset,
is the preferable action.
if sizeof(time_t) == 8, this code path was missing the correct
offset into the zoneinfo file, using the header magic to do
offset calculations.
the 6 32bit fields to be read start at offset 20.
inet_aton returns a boolean success value, whereas __ipparse returns 0
on success and -1 on failure. also change the conditional in inet_addr
to be consistent with other uses of __ipparse where only negative
values are treated as failure.
now that we're waiting for the exit status of the child process, the
result can be conveyed in the exit status rather than via a pipe.
since the error value might not fit in 7 bits, a table is used to
translate possible meaningful error values to small integers.
I mistakenly assumed that clone without a signal produced processes
that would not become zombies; however, waitpid with __WCLONE is
required to release their pids.
while using "l" unconditionally gave the right behavior due to
matching sizes/representations, it was technically UB and produced
compiler warnings with format string checking.
i386 fenv code checks __hwcap for sse support, but in fesetround the sse
code was unconditionally jumped over after the test so the sse rounding
mode was never set.
The log, log2 and log10 functions share a lot of code and to a lesser
extent log1p too. A small part of the code was kept separately in
__log1p.h, but since it did not capture much of the common code and
it was inlined anyway, it did not solve the issue properly. Now the
log functions have significant code duplication, which may be resolved
later, until then they need to be modified together.
logl, log10l, log2l, log1pl:
* Fix the sign when the return value should be -inf.
* Remove the volatile hack from log10l (seems unnecessary)
log1p, log1pf:
* Change the handling of small inputs: only |x|<2^-53 is special
(then it is enough to return x with the usual subnormal handling)
this fixes the sign of log1p(0) in downward rounding.
* Do not handle the k==0 case specially (other than skipping the
elaborate argument reduction)
* Do not handle 1+x close to power-of-two specially (this code was
used rarely, did not give much speed up and the precision wasn't
better than the general)
* Fix the correction term formula (c=1-(u-x) was used incorrectly
when x<1 but (double)(x+1)==2, this was not a critical issue)
* Use the exact same method for calculating log(1+f) as in log
(except in log1p the c correction term is added to the result).
log, logf, log10, log10f, log2, log2f:
* Use double_t and float_t consistently.
* Now the first part of log10 and log2 is identical to log (until the
return statement, hopefully this makes maintainence easier).
* Most special case formulas were removed (close to power-of-two and
k==0 cases), they increase the code size without providing precision
or performance benefits (and obfuscate the code).
Only x==1 is handled specially so in downward rounding mode the
sign of zero is correct (the general formula happens to give -0).
* For x==0 instead of -1/0.0 or -two54/0.0, return -1/(x*x) to force
raising the exception at runtime.
* Arg reduction code is changed (slightly simplified)
* The thresholds for arg reduction to [sqrt(2)/2,sqrt(2)] are now
consistently the [0x3fe6a09e00000000,0x3ff6a09dffffffff] and the
[0x3f3504f3,0x3fb504f2] intervals for double and float reductions
respectively (the exact threshold values are not critical)
* Remove the obsolete comment for the FLT_EVAL_METHOD!=0 case in log2f
(The same code is used for all eval methods now, on i386 slightly
simpler code could be used, but we have asm there anyway)
all:
* Fix signed int arithmetics (using unsigned for bitmanipulation)
* Fix various comments
despite being marked legacy, this was specified by SUSv3 as part of
the XSI option; only the most recent version of the standard dropped
it. reportedly there's actual code using it.
* parse IPv4 dotted-decimal correctly (without strtoul, no leading zeros)
* disallow single leading ':' in IPv6 address
* allow at most 4 hex digits in IPv6 address (according to RFC 2373)
* have enough hex fields in IPv4 mapped IPv6 address
* disallow leading zeros in IPv4 mapped IPv6 address
* allow at most 4 parts
* bounds check the parts correctly
* disallow leading whitespace and sign
* check the address family before falling back to IPv6
despite being practically deprecated, these functions are still part
of the standard and thus cannot reside in a file that also contains
namespace pollution. this reverts some of the changes made in commit
e40f48a421.
fcntl.h: AT_* is not a reserved namespace so extensions cannot be
exposed by default.
langinfo.h: YESSTR and NOSTR were removed from the standard.
limits.h: NL_NMAX was removed from the standard.
signal.h: the conditional for NSIG was wrongly checking _XOPEN_SOURCE
rather than _BSD_SOURCE. this was purely a mistake; it doesn't even
match the commit message from the commit that added it.
this fixes an issue reported by Daniel Thau whereby faccessat with the
AT_EACCESS flag did not work in cases where the process is running
suid or sgid but without root privileges. per POSIX, when the process
does not have "appropriate privileges", setuid changes the euid, not
the real uid, and the target uid must be equal to the current real or
saved uid; if this condition is not met, EPERM results. this caused
the faccessat child process to fail.
using the setreuid syscall rather than setuid works. POSIX leaves it
unspecified whether setreuid can set the real user id to the effective
user id on processes without "appropriate privileges", but Linux
allows this; if it's not allowed, there would be no way for this
function to work.
based on patch by Michael Forney. at the same time, I've changed the
if branch to be more clear, avoiding the comma operator.
the underlying issue is that Linux always returns ERANGE when size is
too short, even when it's zero, rather than returning EINVAL for the
special case of zero as required by POSIX.
there is no reason to check the return value for setting errno, since
brk never returns errors, only the new value of the brk (which may be
the same as the old, or otherwise differ from the requested brk, on
failure).
it may be beneficial to eventually just eliminate this file and make
the syscalls inline in malloc.c.
the va_arg call for the argv[]-terminating null pointer was missing,
so this pointer was being wrongly used as the environment pointer.
issue reported by Timo Teräs. proposed patch slightly modified to
simplify the resulting code.
bug report and patch by Michael Forney. the terminating null pointer
at the end of the gr_mem array was overwriting the beginning of the
string data, causing the gr_name member to always be a zero-length
string.
issue reported by Michael Forney:
"If wn becomes 0 after processing a chunk of 4, mbsrtowcs currently
continues on, wrapping wn around to -1, causing the rest of the string
to be processed.
This resulted in buffer overruns if there was only space in ws for wn
wide characters."
the original patch submitted added an additional check for !wn after
the loop; to avoid extra branching, I instead just changed the wn>=4
check to wn>=5 to ensure that at least one slot remains after the
word-at-a-time loop runs. this should not slow down the tail
processing on real-world usage, since an extra slot that can't be
processed in the word-at-a-time loop is needed for the null
termination anyway.
This is a change in ISO C11 annex F (F.10.11p1), comparision macros
can't round their arguments to their semantic type when the evaluation
format has wider range and precision. (ie. they must be consistent with
the builtin relational operators)
atomic store was lacking a barrier, which was fine for legacy arm with
no real smp and kernel-emulated cas, but unsuitable for more modern
systems. the kernel provides another "kuser" function, at 0xffff0fa0,
which could be used for the barrier, but using that would drop support
for kernels 2.6.12 through 2.6.14 unless an extra conditional were
added to check for barrier availability. just using the barrier in the
kernel cas is easier, and, based on my reading of the assembly code in
the kernel, does not appear to be significantly slower.
at the same time, other atomic operations are adapted to call the
kernel cas function directly rather than using a_cas; due to small
differences in their interface contracts, this makes the generated
code much simpler.
if a multithreaded program became non-multithreaded (i.e. all other
threads exited) while one thread held an internal lock, the remaining
thread would fail to release the lock. the the program then became
multithreaded again at a later time, any further attempts to obtain
the lock would deadlock permanently.
the underlying cause is that the value of libc.threads_minus_1 at
unlock time might not match the value at lock time. one solution would
be returning a flag to the caller indicating whether the lock was
taken and needs to be unlocked, but there is a simpler solution: using
the lock itself as such a flag.
note that this flag is not needed anyway for correctness; if the lock
is not held, the unlock code is harmless. however, the memory
synchronization properties associated with a_store are costly on some
archs, so it's best to avoid executing the unlock code when it is
unnecessary.
this was resulting in crashes in posix_spawn on mips, and would have
affected applications calling clone too. since the prototype for
__clone has it as a variadic function, it may not assume that 16($sp)
is writable for use in making the syscall. instead, it needs to
allocate additional stack space, and then adjust the stack pointer
back in both of the code paths for the parent process/thread.
These constants are not specified by POSIX, but they are in the reserved
namespace, glibc and bsd systems seem to provide them as well.
(Note that POSIX specifies -NZERO and NZERO-1 to be the limits, but
PRIO_MAX equals NZERO)
the changes were verified using various sources:
linux: include/uapi/linux/elf.h
binutils: include/elf/common.h
glibc: elf/elf.h
sysv gabi: http://www.sco.com/developers/gabi/latest/contents.html
sun linker docs: http://docs.oracle.com/cd/E18752_01/pdf/817-1984.pdf
and platform specific docs
- fixed:
EF_MIPS_* E_MIPS_* e_flags: fixed accoding to glibc and binutils
- added:
ELFOSABI_GNU for EI_OSABI entry: glibc, binutils and sysv gabi
EM_* e_machine values: updated according to linux and glibc
PN_XNUM e_phnum value: from glibc and linux, see oracle docs
NT_* note types: updated according to linux and glibc
DF_1_* flags for DT_FLAGS_1 entry: following glibc and oracle docs
AT_HWCAP2 auxv entry for more hwcap bits accoding to linux and glibc
R_386_SIZE32 relocation according to glibc and binutils
EF_ARM_ABI_FLOAT_* e_flags: added following glibc and binutils
R_AARCH64_* relocs: added following glibc and aarch64 elf specs
R_ARM_* relocs: according to glibc, binutils and arm elf specs
R_X86_64_* relocs: added missing relocs following glibc
- removed:
HWCAP_SPARC_* flags were moved to arch specific header in glibc
R_ARM_SWI24 reloc is marked as obsolete in glibc, not present in binutils
not specified in arm elf spec, R_ARM_TLS_DESC reused its number
see http://www.codesourcery.com/publications/RFC-TLSDESC-ARM.txt
- glibc changes not pulled in:
ELFOSABI_ARM_AEABI (bare-metal system, binutils and glibc disagrees about the name)
R_68K_* relocs for unsupported platform
R_SPARC_* ditto
EF_SH* ditto (e_flags)
EF_S390* ditto (e_flags)
R_390* ditto
R_MN10300* ditto
R_TILE* ditto
CLONE_PARENT is not necessary (CLONE_THREAD provides all the useful
parts of it) and Linux treats CLONE_PARENT as an error in certain
situations, without noticing that it would be a no-op due to
CLONE_THREAD. this error case prevents, for example, use of a
multi-threaded init process and certain usages with containers.
the removed ARPHRD_IEEE802154_PHY was only present in the kernel api
in v2.6.31 (by accident), but it got into the glibc headers (in 2009)
and remained there since this header was not updated since then.
PAGE_SIZE was hardcoded to 4096, which is historically what most
systems use, but on several archs it is a kernel config parameter,
user space can only know it at execution time from the aux vector.
PAGE_SIZE and PAGESIZE are not defined on archs where page size is
a runtime parameter, applications should use sysconf(_SC_PAGE_SIZE)
to query it. Internally libc code defines PAGE_SIZE to libc.page_size,
which is set to aux[AT_PAGESZ] in __init_libc and early in __dynlink
as well. (Note that libc.page_size can be accessed without GOT, ie.
before relocations are done)
Some fpathconf settings are hardcoded to 4096, these should be actually
queried from the filesystem using statfs.
unlike other archs, the mips version of clone was not doing anything
to align the stack pointer. this seems to have been the cause for some
SIGBUS crashes that were observed in posix_spawn.
msg.h was wrong for big-endian (wrong endiannness padding).
shm.h was just plain wrong (mips is not supposed to have padding).
both changes were tested using libc-test on qemu-system-mips.
the underlying problem was not incorrect sign extension (fixed in the
previous commit to this file by nsz) but that code that treats "long"
as 32-bit was copied blindly from i386 to x86_64.
now lrintl is identical to llrintl on x86_64, as it should be.
if fopen fails for a reason other than ENOENT, we must assume the
intent is that the path file be used. failure may be due to
misconfiguration or intentional resource-exhaustion attack (against
suid programs), in which case falling back to loading libraries from
an unintended path could be dangerous.
gcc did not always drop excess precision according to c99 at assignments
before version 4.5 even if -std=c99 was requested which caused badly
broken mathematical functions on i386 when FLT_EVAL_METHOD!=0
but STRICT_ASSIGN was not used consistently and it is worked around for
old compilers with -ffloat-store so it is no longer needed
the new convention is to get the compiler respect c99 semantics and when
excess precision is not harmful use float_t or double_t or to specialize
code using FLT_EVAL_METHOD
apparently gnulib requires invalid long double representations
to be handled correctly in printf so we classify them according
to how the fpu treats them: bad inf is nan, bad nan is nan,
bad normal is nan and bad subnormal/zero is minimal normal
in atanh exception handling was left to the called log functions,
but the argument to those functions could underflow or overflow.
use double_t and float_t to avoid some useless stores on x86
acosh(x) is invalid for x<1, acoshf tried to be clever using
signed comparisions to handle all x<2 the same way, but the
formula was wrong on large negative values.
there were two problems:
* omitted underflow on subnormal results: exp2l(-16383.5) was calculated
as sqrt(2)*2^-16384, the last bits of sqrt(2) are zero so the down scaling
does not underflow eventhough the result is in subnormal range
* spurious underflow for subnormal inputs: exp2l(0x1p-16400) was evaluated
as f2xm1(x)+1 and f2xm1 raised underflow (because inexact subnormal result)
the first issue is fixed by raising underflow manually if x is in
(-32768,-16382] and not integer (x-0x1p63+0x1p63 != x)
the second issue is fixed by treating x in (-0x1p64,0x1p64) specially
for these fixes the special case handling was completely rewritten
* use float_t and double_t
* cleanup subnormal handling
* bithacks according to the new convention (ldshape for long double
and explicit unions for float and double)
* don't care about inexact flag
* use double_t and float_t (faster, smaller, more precise on x86)
* exp: underflow when result is zero or subnormal and not -inf
* exp2: underflow when result is zero or subnormal and not exact
* expm1: underflow when result is zero or subnormal
* expl: don't underflow on -inf
* exp2: fix incorrect comment
* expm1: simplify special case handling and overflow properly
* expm1: cleanup final scaling and fix negative left shift ub (twopk)
ld128 support was added to internal kernel functions (__cosl, __sinl,
__tanl, __rem_pio2l) from freebsd (not tested, but should be a good
start for when ld128 arch arrives)
__rem_pio2l had some code cleanup, the freebsd ld128 code seems to
gather the results of a large reduction with precision loss (fixed
the bug but a todo comment was added for later investigation)
the old copyright was removed from the non-kernel wrapper functions
(cosl, sinl, sincosl, tanl) since these are trivial and the interesting
parts and comments had been already rewritten.
method: if there is a large difference between the scale of x and y
then the larger magnitude dominates, otherwise reduce x,y so the
argument of sqrt (x*x+y*y) does not overflow or underflow and calculate
the argument precisely using exact multiplication. If the argument
has less error than 1/sqrt(2) ~ 0.7 ulp, then the result has less error
than 1 ulp in nearest rounding mode.
the original fdlibm method was the same, except it used bit hacks
instead of dekker-veltkamp algorithm, which is problematic for long
double where different representations are supported. (the new hypot
and hypotl code should be smaller and faster on 32bit cpu archs with
fast fpu), the new code behaves differently in non-nearest rounding,
but the error should be still less than 2ulps.
ld80 and ld128 are supported
* results are exact
* modfl follows truncl (raises inexact flag spuriously now)
* modf and modff only had cosmetic cleanup
* remainder is just a wrapper around remquo now
* using iterative shift+subtract for remquo and fmod
* ld80 and ld128 are supported as well
* faster, smaller, cleaner implementation than the bit hacks of fdlibm
* use arithmetics like y=(double)(x+0x1p52)-0x1p52, which is an integer
neighbor of x in all rounding modes (0<=x<0x1p52) and only use bithacks
when that's faster and smaller (for float it usually is)
* the code assumes standard excess precision handling for casts
* long double code supports both ld80 and ld128
* nearbyint is not changed (it is a wrapper around rint)
* consistent code style
* explicit union instead of typedef for double and float bit access
* turn FENV_ACCESS ON to make 0/0.0f raise invalid flag
* (untested) ld128 version of ilogbl (used by logbl which has ld128 support)
new ldshape union, ld128 support is kept, code that used the old
ldshape union was rewritten (IEEEl2bits union of freebsd libm is
not touched yet)
ld80 __fpclassifyl no longer tries to handle invalid representation
this protects against deadlock from spurious signals (e.g. sent by
another process) arriving after the controlling thread releases the
other threads from the sync operation.
the head pointer was not being reset between calls to synccall, so any
use of this interface more than once would build the linked list
incorrectly, keeping the (now invalid) list nodes from the previous
call.
invalid format strings invoke undefined behavior, so this is not a
conformance issue, but it's nicer for scanf to report the error safely
instead of calling free on a potentially-uninitialized pointer or a
pointer to memory belonging to the caller.
rather than allocating a PATH_MAX-sized buffer when the caller does
not provide an output buffer, work first with a PATH_MAX-sized temp
buffer with automatic storage, and either copy it to the caller's
buffer or strdup it on success. this not only avoids massive memory
waste, but also avoids pulling in free (and thus the full malloc
implementation) unnecessarily in static programs.
this avoids failure if the file is not readable and avoids odd
behavior for device nodes, etc. on old kernels that lack O_PATH, the
old behavior (O_RDONLY) will naturally happen as the fallback.
commit 07827d1a82 seems to have
introduced this issue. sigqueue is called from the synccall core, at
which time, even implementation-internal signals are blocked. however,
pthread_sigmask removes the implementation-internal signals from the
old mask before returning, so that a process which began life with
them blocked will not be able to save a signal mask that has them
blocked, possibly causing them to become re-blocked later. however,
this was causing sigqueue to unblock the implementation-internal
signals during synccall, leading to deadlock.
the BSD and GNU versions of this structure differ, so exposing it in
the default _BSD_SOURCE profile is possibly problematic. both versions
could be simultaneously supported with anonymous unions if needed in
the future, but for now, just omitting it except under _GNU_SOURCE
should be safe.
I originally added this warning option based on a misunderstanding of
how it works. it does not warn whenever the destination of the cast
has stricter alignment; it only warns in cases where misaligned
dereference could lead to a fault. thus, it's essentially a no-op for
i386, which had me wrongly believing the code was clean for this
warning level. on other archs, numerous diagnostic messages are
produced, and all of them are false-positives, so it's better just not
to use it.
unlike the old C memcpy, this version handles word-at-a-time reads and
writes even for misaligned copies. it does not require that the cpu
support misaligned accesses; instead, it performs bit shifts to
realign the bytes for the destination.
essentially, this is the C version of the ARM assembly language
memcpy. the ideas are all the same, and it should perform well on any
arch with a decent number of general-purpose registers that has a
barrel shift operation. since the barrel shifter is an optional cpu
feature on microblaze, it may be desirable to provide an alternate asm
implementation on microblaze, but otherwise the C code provides a
competitive implementation for "generic risc-y" cpu archs that should
alleviate the urgent need for arch-specific memcpy asm.
while the incorporation of this requirement from C99 into C++11 was
likely an accident, some software expects it to be defined, and it
doesn't hurt. if the requirement is removed, then presumably
__bool_true_false_are_defined would just be in the implementation
namespace and thus defining it would still be legal.
this version of memset is optimized both for small and large values of
n, and makes no misaligned writes, so it is usable (and near-optimal)
on all archs. it is capable of filling up to 52 or 56 bytes without
entering a loop and with at most 7 branches, all of which can be fully
predicted if memset is called multiple times with the same size.
it also uses the attribute extension to inform the compiler that it is
violating the aliasing rules, unlike the previous code which simply
assumed it was safe to violate the aliasing rules since translation
unit boundaries hide the violations from the compiler. for non-GNUC
compilers, 100% portable fallback code in the form of a naive loop is
provided. I intend to eventually apply this approach to all of the
string/memory functions which are doing word-at-a-time accesses.
this will be needed for upcoming commits to the string/mem functions
to correct their unannounced use of aliasing violations for
word-at-a-time search, fill, and copy operations.
this is a nonstandard extension but will be required in the next
version of POSIX, and it's widely used/useful in shell scripts
utilizing the date utility.
this may need further revision in the future, since POSIX is rather
unclear on the requirements, and is designed around the assumption of
POSIX TZ specifiers which are not sufficiently powerful to represent
real-world timezones (this is why zoneinfo support was added).
the basic issue is that strftime gets the string and numeric offset
for the timezone from the extra fields in struct tm, which are
initialized when calling localtime/gmtime/etc. however, a conforming
application might have created its own struct tm without initializing
these fields, in which case using __tm_zone (a pointer) could crash.
other zoneinfo-based implementations simply check for a null pointer,
but otherwise can still crash of the field contains junk.
simply ignoring __tm_zone and using tzname[] would "work" but would
give incorrect results in time zones with more complex rules. I feel
like this would lower the quality of implementation.
instead, simply validate __tm_zone: unless it points to one of the
zone name strings managed by the timezone system, assume it's invalid.
this commit also fixes several other minor bugs with formatting:
tm_isdst being negative is required to suppress printing of the zone
formats, and %z was using the wrong format specifiers since the type
of val was changed, resulting in bogus output.
the empty TZ string was matching equal to the initial value of the
cached TZ name, thus causing do_tzset never to run and never to
initialize the time zone data.
1. an occurrence of ${ORIGIN} before $ORIGIN would be ignored due to
the strstr logic. (note that rpath contains multiple :-delimited paths
to be searched.)
2. data read by readlink was not null-terminated.
fallback to argv[0] as before. unlike argv[0], AT_EXECFN was a valid
(but possibly relative) pathname for the new program image at the time
the execve syscall was made.
as a special case, ignore AT_EXECFN if it begins with "/proc/", in
order not to give bogus (and possibly harmful) results when fexecve
was used.
previously, rpath was only honored for direct dependencies. in other
words, if A depends on B and B depends on C, only B's rpath (if any),
not A's rpath, was being searched for C. this limitation made
rpath-based deployment difficult in the presence of multiple levels of
library dependency.
at present, $ORIGIN processing in rpath is still unsupported.
at present, since POSIX requires %F to behave as %+4Y-%m-%d and ISO C
requires %F to behave as %Y-%m-%d, the default behavior for %Y has
been changed to match %+4Y. this seems to be the only way to conform
to the requirements of both standards, and it does not affect years
prior to the year 10000. depending on the outcome of interpretations
from the standards bodies, this may be adjusted at some point.
use a long long value so that even with offsets, values cannot
overflow. instead of using different format strings for different
numeric formats, simply use a per-format width and %0*lld for all of
them.
this width specifier is not for use with strftime field widths; that
will be a separate step in the caller.
make __strftime_fmt_1 return a string (possibly in the caller-provided
temp buffer) rather than writing into the output buffer. this approach
makes more sense when padding to a minimum field width might be
required, and it's also closer to what wcsftime wants.
one place where semicolon (non-portable) was still used in place of
separate -e options (copied over from an old version of this code),
and use of a literal slash in the bracket expression for the final
command, despite slash being used as the delimiter for the s command.
fesetround.c is a wrapper to do the arch independent argument
check (on archs where rounding mode is not stored in 2 bits
__fesetround still has to check its arguments)
on powerpc fe*except functions do not accept the extra invalid
flags of its fpscr register
the useless FENV_ACCESS pragma was removed from feupdateenv
the x87 exception summary (ES) and stack fault (SF) flags may be
spuriously cleared by feclearexcept using the fnclex instruction,
but these flags are not observable through libc hence maintaining
their state is not critical.
the sse and x87 rounding modes should be always the same,
the visible exception flags are the bitwise or of the two
fenv states (so it's enough to query the rounding mode or
raise exceptions on one fenv)
the historical (non-standardized) install command is really
inappropriate for installing binaries/libraries on a system that
utilizes memory-mapped executable files. rather than replacing an
existing file atomically, it overwrites the existing file. this can
cause running programs to see a partially-modified version of the
file, resulting in unpredictable behavior, or SIGBUS. a MAP_COPY mode
for mmap would get around this problem, but Linux lacks MAP_COPY.
the shell script added with this commit works around the problem by
writing temporary files and moving them into place. unlike the
historical install utility, it also support a -l option for installing
a symbolic link atomically, via the same method.
with these changes, the character set implemented as "big5" in musl is
a pure superset of cp950, the canonical "big5", and agrees with the
normative parts of Unicode. this means it has minor differences from
both hkscs and big5-2003:
- the range A2CC-A2CE maps to CJK ideographs rather than numerals,
contrary to changes made in big5-2003.
- C6CD maps to a CJK ideograph rather than its corresponding Kangxi
radical character, contrary to changes made in hkscs.
- F9FE maps to U+2593 rather than U+FFED.
of these differences, none but the last are visually distinct, and the
last is a character used purely for text-based graphics, not to convey
linguistic content.
should there be future demand for strict conformance to big5-2003 or
hkscs mappings, the present charset aliases can be replaced with
distinct variants.
reportedly there are other non-standard big5 extensions in common use
in Taiwan and perhaps elsewhere, which could also be added as layers
on top of the existing big5 support.
there may be additional characters which should be added to the hkscs
table: the whatwg standard for big5 defines what appears to be a
superset of hkscs.
ln -sf is non-atomic; it unlinks the destination first. instead, make
a temporary link and rename it into place.
this commit also fixes some of the dependency tracking behavior for
the link. depending on the directory it's to be installed in is not
reasonable; it causes a new link to be attempted if the library
directory has been modified, but does not attempt to make a new link
just because libc has been updated. instead, depend on the target to
be linked to. this will ensure that, if prefix has changed but
syslibdir has not, the link will be updated to point to the new
prefix.
it turns out that __SOFTFP__ does not indicate the ABI in use but
rather that fpu instructions are not to be used at all. this is
specified in ARM's documentation so I'm unclear on how I previously
got the wrong idea. unfortunately, this resulted in the 0.9.12 release
producing a dynamic linker with the wrong name. fortunately, there do
not yet seem to be any public toolchain builds using the wrong name.
the __ARM_PCS_VFP macro does not seem to be official from ARM, and in
fact it was missing from the very earliest gcc versions (around 4.5.x)
that added -mfloat-abi=hard. it would be possible on such versions to
perform some ugly linker-based tests instead in hopes that the linker
will reject ABI-mismatching object files, if there is demand for
supporting such versions. I would probably prefer to document which
versions are broken and warn users to manually add -D__ARM_PCS_VFP if
using such a version.
there's definitely an argument to be made that the fenv macros should
be exposed even in -mfloat-abi=softfp mode. for now, I have chosen not
to expose them in this case, since the math library will not
necessarily have the capability to raise exceptions (it depends on the
CFLAGS used to compile it), and since exceptions are officially
excluded from the ARM EABI, which the plain "arm" arch aims to
follow.
without these, calls may be resolved incorrectly if the calling code
has been compiled to thumb instead of arm. it's not clear to me at
this point whether crt_arch.h is even working if crt1.c is built as
thumb; this needs testing. but the _init and _fini issues were known
to cause crashes in static-linked apps when libc was built as thumb,
and this commit should fix that issue.
if FLT_EVAL_METHOD!=0 check if (double)(1/x) is subnormal and not a
power of 2 (if 1/x is power of 2 then either it is exact or the
long double to double rounding already raised inexact and underflow)
* remove volatile hacks
* don't care about inexact flag for now (removed all the +-tiny)
* fix atanl to raise underflow properly
* remove signed int arithmetics
* use pi/2 instead of pi_o_2 (gcc generates the same code, which is not
correct, but it does not matter: we mainly care about nearest rounding)
underflow is raised by an inexact subnormal float store,
since subnormal operations are slow, check the underflow
flag and skip the store if it's already raised
for these functions f(x)=x for small inputs, because f(0)=0 and
f'(0)=1, but for subnormal values they should raise the underflow
flag (required by annex F), if they are approximated by a polynomial
around 0 then spurious underflow should be avoided (not required by
annex F)
all these functions should raise inexact flag for small x if x!=0,
but it's not required by the standard and it does not seem a worthy
goal, so support for it is removed in some cases.
raising underflow:
- x*x may not raise underflow for subnormal x if FLT_EVAL_METHOD!=0
- x*x may raise spurious underflow for normal x if FLT_EVAL_METHOD==0
- in case of double subnormal x, store x as float
- in case of float subnormal x, store x*x as float
there are two possible points where the length is evaluated: either
the first 'compression' jump, or the null terminator if no jumps have
taken place yet. the previous code only measured the length of the
first component.
the duplicate code in dn_expand and its incorrect return values are
both results of the history of the code: the version in __dns.c was
originally written with no awareness of the legacy resolver API, and
was later copy-and-paste duplicated to provide the legacy API.
this commit is the first of a series that will restructure the
internal dns code to share as much code as possible with the legacy
resolver API functions.
I have also removed the loop detection logic, since the output buffer
length limit naturally prevents loops. in order to avoid long runtime
when encountering a loop if the caller provided a ridiculously long
buffer, the caller-provided length is clamped at the maximum dns name
length.
the approach of this implementation was heavily investigated prior to
adopting it. attempts to obtain similar performance with pure C code
were capping out at about 75% of the performance of the asm, with
considerably larger code size, and were fragile in that the compiler
would sometimes compile part of memcpy into a call to itself.
therefore, just using the asm seems to be the best option.
this commit is the first to make use of the new subarch-specific asm
framework. the new armel directory is the location for arm asm that
should not be used for all arm subarchs, only the default one. armhf
is the name of the little-endian hardfloat-ABI subarch, which can use
the exact same asm. in both cases, the build system finds the asm by
following a memcpy.sub file.
the other two subarchs, armeb and armebhf, would need a big-endian
variant of this code. it would not be hard to adapt the code to big
endian, but I will hold off on doing so until there is demand for it.
instead of subarchs getting their own .s files which are used directly
by the makefile to replace the .c file, they now must provide a .sub
file whose contents are a pathname, relative to the location of the
.sub file, which will substitute for the .c file. essentially these
files are acting as symbolic links, but implemented as text files.
these aliases were originally intended to be for ABI compatibility
only, but their presence caused regressions in broken gnulib-based
software whose configure scripts detect the existing of these
functions then use them without declarations, resulting in bogus
return values.
the default subarch is the one whose full name is just the base arch
name, with no suffixes. normally, either the asm in the default
subarch is suitable for all subarch variants, or separate asm is
mandatory for each variant. however, in the case of asm which is
purely for optimization purposes, it's possible to have asm that only
works (or only performs well) on the default subarch, and not any othe
the other variants. thus, I have added a mechanism to give a name to
the default variant, for example "armel" for the default,
little-endian arm. further such default-subarch names can be added in
the future as needed.
a mips signal mask contains 128 bits, enough for signals 1 through
128. however, the exit status obtained from the wait-family functions
only has room for values up to 127. reportedly signal 128 was causing
kernelspace bugs, so it was removed from the kernel recently; even
without that issue, however, it was impossible to support it correctly
in userspace.
at the same time, the bug was masked on musl by SIGRTMAX incorrectly
yielding 64 on mips, rather than the "correct" value of 128. now that
the _NSIG issue is fixed, SIGRTMAX can be fixed at the same time,
exposing the full range of signals for application use.
note that the (nonstandardized) libc _NSIG value is actually one
greater than the max signal number, and also one greater than the
kernel headers' idea of _NSIG. this is the reason for the discrepency
with the recent kernel changes. since reducing _NSIG by one brought it
down from 129 to 128, rather than from 128 to 127, _NSIG/8, used
widely in the musl sources, is unchanged.
mips has signal numbers up to 127 (formerly, up to 128, but the last
one never worked right and caused kernel panic when used), so 127 in
the "signal number" field of the wait status is insufficient for
determining that the process was stopped. in addition, a nonzero value
in the upper bits must be present, indicating the signal number which
caused the process to be stopped.
details on this issue can be seen in the email with message id
CAAG0J9-d4BfEhbQovFqUAJ3QoOuXScrpsY1y95PrEPxA5DWedQ@mail.gmail.com on
the linux-mips mailing list, archived at:
http://www.linux-mips.org/archives/linux-mips/2013-06/msg00552.html
and in the associated thread about fixing the mips kernel bug.
commit 4a96b948687166da26a6c327e6c6733ad2336c5c fixed the
corresponding issue in uClibc, but introduced a multiple-evaluation
issue for the WIFSTOPPED macro.
for the most part, none of these issues affected pure musl systems,
since musl has up until now (incorrectly) defined SIGRTMAX as 64 on
all archs, even mips. however, interpreting status of non-musl
programs on mips may have caused problems. with this change, the full
range of signal numbers can be made available on mips.
this first commit just includes the CPU_* and sched_* interfaces, not
the pthread_* interfaces, which may be added later. simple
sanity-check testing has been done for the basic interfaces, but most
of the macros have not yet been tested.
the idea here is to avoid advertising signals that don't exist and to
make these functions safe to call (e.g. from within other parts of the
implementation) on fake sigset_t objects which do not have the HURD
padding.
the trick here is that sigaction can track for us which signals have
ever had a signal handler set for them, and only those signals need to
be considered for reset. this tracking mask may have false positives,
since it is impossible to remove bits from it without race conditions.
false negatives are not possible since the mask is updated with atomic
operations prior to making the sigaction syscall.
implementation-internal signals are set to SIG_IGN rather than SIG_DFL
so that a signal raised in the parent (e.g. calling pthread_cancel on
the thread executing pthread_spawn) does not have any chance make it
to the child, where it would cause spurious termination by signal.
this change reduces the minimum/typical number of syscalls in the
child from around 70 to 4 (including execve). this should greatly
improve the performance of posix_spawn and other interfaces which use
it (popen and system).
to facilitate these changes, sigismember is also changed to return 0
rather than -1 for invalid signals, and to return the actual status of
implementation-internal signals. POSIX allows but does not require an
error on invalid signal numbers, and in fact returning an error tends
to confuse applications which wrongly assume the return value of
sigismember is boolean.
the child process's stack may be insufficient size to support a signal
frame, and there is no reason these signal handlers should run in the
child anyway.
there are several reasons for this. some of them are related to race
conditions that arise since fork is required to be async-signal-safe:
if fork or pthread_create is called from a signal handler after the
fork syscall has returned but before the subsequent userspace code has
finished, inconsistent state could result. also, there seem to be
kernel and/or strace bugs related to arrival of signals during fork,
at least on some versions, and simply blocking signals eliminates the
possibility of such bugs.
this commit does not add versioning support; it merely fixes incorrect
lookups of symbols in libraries that contain versioned symbols.
previously, the version information was completely ignored, and
empirically this seems to have resulted in the oldest version being
chosen, but I am uncertain if that behavior was even reliable.
the new behavior being introduced is to completely ignore symbols
which are marked "hidden" (this seems to be the confusing nomenclature
for non-current-version) when versioning is present. this should solve
all problems related to libraries with symbol versioning as long as
all binaries involved are up-to-date (compatible with the
latest-version symbols), and it's the needed behavior for dlsym under
all circumstances.
at this point, it is just the common base charset equivalent to
Windows CP 950, with no further extensions. HKSCS and possibly other
supersets will be added later. other aliases may need to be added too.
the (obsolete) standard allows either 0 or 1 for the decimal point
location in this case, but since the number of zero digits returned in
the output string (in this implementation) is one more than the number
of digits the caller requested, it makes sense for the decimal point
to be logically "after" the first digit. in a sense, this change goes
with the previous commit which fixed the value of the decimal point
location for non-zero inputs.
these functions are obsolete and have no modern standard. the text in
SUSv2 is highly ambiguous, specifying that "negative means to the left
of the returned digits", which suggested to me that 0 would mean to
the right of the first digit. however, this does not agree with
historic practice, and the Linux man pages are more clear, specifying
that a negative value means "that the decimal point is to the left of
the start of the string" (in which case, 0 would mean the start of the
string, in accordance with historic practice).
like for other character sets, stateful iso-2022 form is not supported
yet but everything else should work. all charset aliases are treated
the same, as Windows codepage 949, because reportedly the EUC-KR
charset name is in widespread (mis?)usage in email and on the web for
data which actually uses the extended characters outside the standard
93x94 grid. this could easily be changed if desired.
the principle of this converter for handling the giant bulk of rare
Hangul syllables outside of the standard KS X 1001 93x94 grid is the
same as the GB18030 converter's treatment of non-explicitly-coded
Unicode codepoints: sequences in the extension range are mapped to an
integer index N, and the converter explicitly computes the Nth Hangul
syllable not explicitly encoded in the character map. empirically,
this requires at most 7 passes over the grid. this approach reduces
the table size required for Korean legacy encodings from roughly 44k
to 17k and should have minimal performance impact on real-world text
conversions since the "slow" characters are rare. where it does have
impact, the cost is merely a large constant time factor.
unblocking it in the pthread_once init function is not sufficient,
since multiple threads, some of them with the signal blocked, could
already exist before this is called; timers started from such threads
would be non-functional.
this is needed for reused threads in the SIGEV_THREAD timer
notification system, and could be reused elsewhere in the future if
needed, though it should be refactored for such use.
for static linking, __init_tls.c is simply modified to export the TLS
info in a structure with external linkage, rather than using statics.
this perhaps makes the code more clear, since the statics were poorly
named for statics. the new __reset_tls.c is only linked if it is used.
for dynamic linking, the code is in dynlink.c. sharing code with
__copy_tls is not practical since __reset_tls must also re-zero
thread-local bss.
1. the thread result field was reused for storing a kernel timer id,
but would be overwritten if the application code exited or cancelled
the thread.
2. low pointer values were used as the indicator that the timer id is
a kernel timer id rather than a thread id. this is not portable, as
mmap may return low pointers on some conditions. instead, use the fact
that pointers must be aligned and kernel timer ids must be
non-negative to map pointers into the negative integer space.
3. signals were not blocked until after the timer thread started, so a
race condition could allow a signal handler to run in the timer thread
when it's not supposed to exist. this is mainly problematic if the
calling thread was the only thread where the signal was unblocked and
the signal handler assumes it runs in that thread.
this is another case of the kernel syscall failing to support flags
where it needs to, leading to horrible workarounds in userspace. this
time the workaround requires changing uid/gid, and that's not safe to
do in the current process. in the worst case, kernel resource limits
might prevent recovering the original values, and then there would be
no way to safely return. so, use the safe but horribly inefficient
alternative: forking. clone is used instead of fork to suppress
signals from the child.
fortunately this worst-case code is only needed when effective and
real ids mismatch, which mainly happens in suid programs.
it turns out Linux is buggy for faccessat, just like fchmodat: the
kernel does not actually take a flags argument. so we're going to have
to emulate it there.
patch by nsz. the actual object the caller has storing the tree root
has type void *, so accessing it as struct node * is not valid.
instead, simply access the value, move it to a temporary of the
appropriate type and work from there, then move the result back.
check in configure to be polite (failing early if we're going to fail)
and in vfprintf.c since that is the point at which a mismatching type
would be extremely dangerous.
on newer kernels, fchdir and fstat work anyway. this same fix should
be applied to any other syscalls that are similarly affected.
with this change, the current definitions of O_SEARCH and O_EXEC as
O_PATH are mostly conforming to POSIX requirements. the main remaining
issue is that O_NOFOLLOW has different semantics.
I intend to add more Linux workarounds that depend on using these
pathnames, and some of them will be in "syscall" functions that, from
an anti-bloat standpoint, should not depend on the whole snprintf
framework.
previously, the AT_SYMLINK_NOFOLLOW flag was ignored, giving
dangerously incorrect behavior -- the target of the symlink had its
modes changed to the modes (usually 0777) intended for the symlink).
this issue was amplified by the fact that musl provides lchmod, as a
wrapper for fchmodat, which some archival programs take as a sign that
symlink modes are supported and thus attempt to use.
emulating AT_SYMLINK_NOFOLLOW was a difficult problem, and I
originally believed it could not be solved, at least not without
depending on kernels newer than 3.5.x or so where O_PATH works halfway
well. however, it turns out that accessing O_PATH file descriptors via
their pseudo-symlink entries in /proc/self/fd works much better than
trying to use the fd directly, and works even on older kernels.
moreover, the kernel has permanently pegged these references to the
inode obtained by the O_PATH open, so there should not be race
conditions with the file being moved, deleted, replaced, etc.
this is the modern way, and the only way that makes any sense. glibc
has this complicated mechanism with RPATH and RUNPATH that controls
whether RPATH is processed before or after LD_LIBRARY_PATH, presumably
to support legacy binaries, but there is no compelling reason to
support this, and better behavior is obtained by just fixing the
search order.
previously, errno could be meaningless when the caller wrote it to the
dlerror string or stderr. try to make it meaningful. also, fix
incorrect check for over-long program headers and instead actually
support them by allocating memory if needed.
the access function cannot be used to check for existence, because it
operates using real uid/gid rather than effective to determine
accessibility; this matters for the non-final path components.
instead, use stat. failure of stat is success if only the final
component is missing (ENOENT) and otherwise is failure.
the concept of both versions is the same; they differ only in details.
for long runs, they use "rep movsl" or "rep movsq", and for small
runs, they use a trick, writing from both ends towards the middle,
that reduces the number of branches needed. in addition, if memset is
called multiple times with the same length, all branches will be
predicted; there are no loops.
for larger runs, there are likely faster approaches than "rep", at
least on some cpu models. for 32-bit, it's unlikely that there is any
faster approach that does not require non-baseline instructions; doing
anything fancier would require inspecting cpu capabilities. for
64-bit, there may very well be faster versions that work on all
models; further optimization could be explored in the future.
with these changes, memset is anywhere between 50% faster and 6 times
faster, depending on the cpu model and the length and alignment of the
destination buffer.
the original motivation for this patch was that qemu (and possibly
other syscall emulators) nop out madvise, resulting in an infinite
loop. however, there is another benefit to this change: madvise may
actually undo an explicit madvise the application intended for its
stack, whereas the mremap operation is a true nop. the logic here is
that mremap must fail if it cannot resize the mapping in-place, and
the caller knows that it cannot resize in-place because it knows the
next page of virtual memory is already occupied.
one of the arguments to memcmp may be shorter than the length l-3, and
memcmp is under no obligation not to access past the first byte that
differs. instead use strncmp which conveys the correct semantics. the
performance difference is negligible here and since the code is only
use for shared libc, both functions are already linked anyway.
the dev/inode for the main app and the dynamic linker ("interpreter")
are not available, so the subsequent checks don't work. in general we
don't want to make exact string matches to existing libraries prevent
loading new ones, since this breaks loading upgraded modules in
module-loading systems. so instead, special-case it.
the motivation for this fix is that calling dlopen on the names
returned by dl_iterate_phdr or walking the link map (obtained by
dlinfo) seem to be the only methods available to an application to
actually get a list of open dso handles.
reject elf files which are not ET_EXEC/ET_DYN type as bad exec format,
and reject ET_EXEC files when they cannot be loaded at the correct
address, since they are not relocatable at runtime. the main practical
benefit of this is to make dlopen of the main program fail rather than
producing an unsafe-to-use handle.
it's not clear to me why the linker even outputs these headers if they
are null, but apparently it does so. with the default startfiles, they
will never be null anyway, but this patch allows eliminating crti,
crtn, crtbegin, and crtend (leaving only crt1) if the toolchain is
using init_array/fini_array (or for a C-only, no-ctor environment).
in signal() it is needed since __sigaction uses restrict in parameters
and sharing the buffer is technically an aliasing error. do the same
for the syscall, as at least qemu-user does not handle it properly.
LC_GLOBAL_LOCALE refers to the global locale, controlled by setlocale,
not the thread-local locale in effect which these functions should be
using. neither LC_GLOBAL_LOCALE nor 0 has an argument to the *_l
functions has behavior defined by the standard, but 0 is a more
logical choice for requesting the callee to lookup the current locale.
in the future I may move the current locale lookup the the caller (the
non-_l-suffixed wrapper).
at this point, all of the locale logic is dummied out, so no harm was
done, but it should at least avoid misleading usage.
also add a warning to the existing sys/poll.h. the warning is absent
from sys/dir.h because it is actually providing a slightly different
API to the program, and thus just replacing the #include directive is
not a valid fix to programs using this one.
entry point was wrong for PIE. e_entry was being treated as an
absolute value, whereas it's actually relative to the load address
(which is zero for non-PIE).
phdr pointer was wrong for non-PIE. e_phoff was being treated as
load-address-relative, whereas it's actually a file offset in the ELF
file. in any case, map_library was already computing it correctly, and
the incorrect code in __dynlink was overwriting it with junk.
the only immediate effect of this commit is enabling PIE support on
some archs that did not previously have any Scrt1.s, since the
existing asm files for crt1 override this C code. so some of the
crt_arch.h files committed are only there for the sake of documenting
what their archs "would do" if they used the new C-based crt1.
the expectation is that new archs should use this new system rather
than using heavy asm for crt1. aside from being easier and less
error-prone, it also ensures that PIE support is available immediately
(since Scrt1.o is generated from the same C source, using -fPIC)
rather than having to be added as an afterthought in the porting
process.
based on a patch by orc. POSIX actually fails to specify the format of
the ntop conversion; presumably, any output that will correctly
round-trip back via the (well-specified) pton operation is acceptable.
the new behavior is much more convenient than the old, however.
this patch also affects getnameinfo, which is implemented in terms of
inet_ntop and which is the preferred interface for performing this
conversion.
I've also removed some inexplicable cruft (filling the buffer with 'x'
before doing anything) whose origin I was unable to track down.
it's not clear that -O3 helps them, and gcc seems to have floating
point optimization bugs that introduce additional failures when -O3 is
used on some of these files.
apparently the original kernel commit's i386 version of siginfo.h
defined this field as unsigned int, but the asm-generic file always
had void *. unsigned int is obviously not a suitable type for an
address, in a non-arch-specific file, and glibc also has void * here,
so I think void * is the right type for it.
also fix redundant type specifiers.
linux commit 8d36eb01da5d371feffa280e501377b5c450f5a5 (2013-05-29)
added PF_IB for InfiniBand
linux commit d021c344051af91f42c5ba9fdedc176740cbd238 (2013-02-06)
added PF_VSOCK for VMware sockets
linux commit a0727e8ce513fe6890416da960181ceb10fbfae6 (2012-04-12)
added siginfo fields for SIGSYS (seccomp uses it)
linux commit ad5fa913991e9e0f122b021e882b0d50051fbdbc (2009-09-16)
added siginfo field and si_code values for SIGBUS (hwpoison signal)
this is a cheat since the _l versions take an extra argument, but
since these functions are only here for ABI purposes, it doesn't
really matter as long as the ABI matches. if the non-__-prefixed
versions are eventually made public, they should proabably be real
functions rather than hacks like this.
unlike the strftime commit, this one is purely an ABI compatibility
issue. the previous version of the code would have worked just as well
with LC_TIME once LC_TIME support is added.
based on a patch by orc, with indexing and flow control cleaned up a
little bit. this code is all going to be replaced at some point in the
near future.
these are needed for some C++ library binaries including most builds
of libstdc++. I'm not entirely clear on the rationale. this patch does
not implement any special semantics for them, but as far as I can
tell, no special treatment is needed in correctly-linked programs;
this binding seems to exist only for catching incorrectly-linked
programs.
this is necessary to meet the C++ ABI target. alternatives were
considered to avoid the size increase for non-sig jmp_buf objects, but
they seemed to have worse properties. moreover, the relative size
increase is only extreme on x86[_64]; one way of interpreting this is
that, if the size increase from this patch makes jmp_buf use too much
memory, then the program was already using too much memory when built
for non-x86 archs.
this bug was caught by the new footer-corruption check in realloc and
free.
if the block returned by malloc was already aligned to the desired
alignment, memalign's logic to split off the misaligned head was
incorrect; rather than writing to a point inside the allocated block,
it was overwriting the footer of the previous block on the heap with
the value 1 (length 0 plus an in-use flag).
fortunately, the impact of this bug was fairly low. (this is probably
why it was not caught sooner.) due to the way the heap works, malloc
will never return a block whose previous block is free. (doing so would
be harmful because it would increase fragmentation with no benefit.)
the footer is actually not needed for in-use blocks, except that its
in-use bit needs to remain set so that it does not get merged with
free blocks, so there was no harm in it being set to 1 instead of the
correct value.
however, there is one case where this bug could have had an impact: in
multi-threaded programs, if another thread freed the previous block
after memalign's call to malloc returned, but before memalign
overwrote the previous block's footer, the resulting block in the free
list could be left in a corrupt state. I have not analyzed the impact
of this bad state and whether it could lead to more serious
malfunction.
the motivation for this patch is that the vast majority of libc is
code that does not benefit at all from optimizations, but that certain
components like string/memory operations can be major performance
bottlenecks.
at the same time, the old -falign-*=1 options are removed, since they
were only beneficial for avoiding bloat when global -O3 was used, and
in that case, they may have prevented some of the performance gains.
to be the most useful, this patch will need further tuning. in
particular, research is needed to determine which components should be
built with -O3 by default, and it may be desirable to remove the
hard-coded -O3 and instead allow more customization of the
optimization level used for selected modules.
this patch is something of a compromise for a compatibility
regression discovered after the header refactoring: libtiff uses
_Int64 for its own use. this is absolutely wrong, invalid C, and
should not be supported, but it's also frustrating for users when code
that used to work suddenly breaks.
rather than leave the breakage in place or change musl internals to
accommodate broken software, I've found a change that makes the
problem go away and improves musl. by undefining these macros at the
end of alltypes.h, the temptation to use them in other headers is
removed. (for example, I almost used _Int64 in sys/types.h to define
u_int64_t rather than adding it back to alltypes.h.) by confining use
of these macros to alltypes.h, we keep it easy to go back and change
the implementation of alltypes later, if needed.
during the header refactoring, I had moved u_int64_t out of alltypes
under the assumption that we could just use long long everywhere.
however, it seems some broken applications make inconsistent mixed use
of u_int64_t and uint64_t, resulting in build errors when the
underlying type differs.
rather than moving nlink_t back to the arch-specific file, I've added
a macro _Reg defined to the canonical type for register-size values on
the arch. this is not the same as _Addr for (not-yet-supported)
32-on-64 pseudo-archs like x32 and mips n32, so a new macro was
needed.
for regoff_t, it's impossible to match on 64-bit archs because glibc
defined the type in a non-conforming way. however this change makes
the type match on 32-bit archs.
since the old, poorly-thought-out musl approach to init/fini arrays on
ARM (when it was the only arch that needed them) was to put the code
in crti/crtn and have the legacy _init/_fini code run the arrays,
adding proper init/fini array support caused the arrays to get
processed twice on ARM. I'm not sure skipping legacy init/fini
processing is the best solution to the problem, but it works, and it
shouldn't break anything since the legacy init/fini system was never
used for ARM EABI.
aside from the obvious C++ ABI purpose for this change, it also brings
musl into alignment with the compiler's idea of the definition of
wint_t (use in -Wformat), and makes the situation less awkward on ARM,
where wchar_t is unsigned.
internal code using wint_t and WEOF was checked against this change,
and while a few cases of storing WEOF into wchar_t were found, they
all seem to operate properly with the natural conversion from unsigned
to signed.
the arch-specific bits/alltypes.h.sh has been replaced with a generic
alltypes.h.in and minimal arch-specific bits/alltypes.h.in.
this commit is intended to have no functional changes except:
- exposing additional symbols that POSIX allows but does not require
- changing the C++ name mangling for some types
- fixing the signedness of blksize_t on powerpc (POSIX requires signed)
- fixing the limit macros for sig_atomic_t on x86_64
- making dev_t an unsigned type (ABI matching goal, and more logical)
in addition, some types that were wrongly defined with long on 32-bit
archs were changed to int, and vice versa; this change is
non-functional except for the possibility of making pointer types
mismatch, and only affects programs that were using them incorrectly,
and only at build-time, not runtime.
the following changes were made in the interest of moving
non-arch-specific types out of the alltypes system and into the
headers they're associated with, and also will tend to improve
application compatibility:
- netdb.h now includes netinet/in.h (for socklen_t and uint32_t)
- netinet/in.h now includes sys/socket.h and inttypes.h
- sys/resource.h now includes sys/time.h (for struct timeval)
- sys/wait.h now includes signal.h (for siginfo_t)
- langinfo.h now includes nl_types.h (for nl_item)
for the types in stdint.h:
- types which are of no interest to other headers were moved out of
the alltypes system.
- fast types for 8- and 64-bit are hard-coded (at least for now); only
the 16- and 32-bit ones have reason to vary by arch.
and the following types have been changed for C++ ABI purposes;
- mbstate_t now has a struct tag, __mbstate_t
- FILE's struct tag has been changed to _IO_FILE
- DIR's struct tag has been changed to __dirstream
- locale_t's struct tag has been changed to __locale_struct
- pthread_t is defined as unsigned long in C++ mode only
- fpos_t now has a struct tag, _G_fpos64_t
- fsid_t's struct tag has been changed to __fsid_t
- idtype_t has been made an enum type (also required by POSIX)
- nl_catd has been changed from long to void *
- siginfo_t's struct tag has been removed
- sigset_t's has been given a struct tag, __sigset_t
- stack_t has been given a struct tag, sigaltstack
- suseconds_t has been changed to long on 32-bit archs
- [u]intptr_t have been changed from long to int rank on 32-bit archs
- dev_t has been made unsigned
summary of tests that have been performed against these changes:
- nsz's libc-test (diff -u before and after)
- C++ ABI check symbol dump (diff -u before, after, glibc)
- grepped for __NEED, made sure types needed are still in alltypes
- built gcc 3.4.6
these functions were mistakenly assumed to be needed to match glibc
ABI, but glibc has them as part of the non-shared part of libc that's
always statically linked into the main program. moreover, the only
place they are referenced from is glibc's crt1.o.
modern (4.7.x and later) gcc uses init/fini arrays, rather than the
legacy _init/_fini function pasting and crtbegin/crtend ctors/dtors
system, on most or all archs. some archs had already switched a long
time ago. without following this change, global ctors/dtors will cease
to work under musl when building with new gcc versions.
the most surprising part of this patch is that it actually reduces the
size of the init code, for both static and shared libc. this is
achieved by (1) unifying the handling main program and shared
libraries in the dynamic linker, and (2) eliminating the
glibc-inspired rube goldberg machine for passing around init and fini
function pointers. to clarify, some background:
the function signature for __libc_start_main was based on glibc, as
part of the original goal of being able to run some glibc-linked
binaries. it worked by having the crt1 code, which is linked into
every application, static or dynamic, obtain and pass pointers to the
init and fini functions, which __libc_start_main is then responsible
for using and recording for later use, as necessary. however, in
neither the static-linked nor dynamic-linked case do we actually need
crt1.o's help. with dynamic linking, all the pointers are available in
the _DYNAMIC block. with static linking, it's safe to simply access
the _init/_fini and __init_array_start, etc. symbols directly.
obviously changing the __libc_start_main function signature in an
incompatible way would break both old musl-linked programs and
glibc-linked programs, so let's not do that. instead, the function can
just ignore the information it doesn't need. new archs need not even
provide the useless args in their versions of crt1.o. existing archs
should continue to provide it as long as there is an interest in
having newly-linked applications be able to run on old versions of
musl; at some point in the future, this support can be removed.
for conversion specifiers, alloc is always set when the specifier is
parsed. however, if scanf stops due to mismatching literal text,
either an uninitialized (if no conversions have been performed yet) or
stale (from the previous conversion) of the flag will be used,
possibly causing an invalid pointer to be passed to free when the
function returns.
the sizes in the header and footer for a chunk should always match. if
they don't, the program has definitely invoked undefined behavior, and
the most likely cause is a simple overflow, either of a buffer in the
block being freed or the one just below it.
crashing here should not only improve security of buggy programs, but
also aid in debugging, since the crash happens in a context where you
have a pointer to the likely-overflowed buffer.
while there's no POSIX namespace provision for UIO_* in uio.h, this
exact macro name is reserved in XBD 2.2.2. apparently some
glibc-centric software expects it to exist, so let's provide it.
the main aim of this patch is to ensure that if not all fields are
filled in, they contain zeros, so as not to confuse applications.
reportedly some older kernels, including commonly used openvz kernels,
lack the f_flags field, resulting in applications reading random junk
as the mount flags; the common symptom seems to be wrongly considering
the filesystem to be mounted read-only and refusing to operate. glibc
has some amazingly ugly fallback code to get the mount flags for old
kernels, but having them really is not that important anyway; what
matters most is not presenting incorrect flags to the application.
I have also aimed to fill in some fields of statvfs that were
previously missing, and added code to explicitly zero the reserved
space at the end of the structure, which will make things easier in
the future if this space someday needs to be used.
this change is both to fix one of the remaining type (and thus C++
ABI) mismatches with glibc/LSB and to allow use of the full range of
uid and gid values, if so desired.
passwd/group access functions were not prepared to deal with unsigned
values, so they too have been fixed with this commit.
prior to this change, using a non-default syslibdir was impractical on
systems where the ordinary library paths contain musl-incompatible
library files. the file containing search paths was always taken from
/etc, which would either correspond to a system-wide musl
installation, or fail to exist at all, resulting in searching of the
default library path.
the new search strategy is safe even for suid programs because the
pathname used comes from the PT_INTERP header of the program being
run, rather than any external input.
as part of this change, I have also begun differentiating the names of
arch variants that differ by endianness or floating point calling
convention. the corresponding changes in the build system and and gcc
wrapper script (to use an alternate dynamic linker name) for these
configurations have not yet been made.
POSIX is not clear on whether it includes the termination, but ISO C
requires that it does. the whole concept of this macro is rather
useless, but it's better to be correct anyway.
this is both a minor scheduling optimization and a workaround for a
difficult-to-fix bug in qemu app-level emulation.
from the scheduling standpoint, it makes no sense to schedule the
parent thread again until the child has exec'd or exited, since the
parent will immediately block again waiting for it.
on the qemu side, as regular application code running on an underlying
libc, qemu cannot make arbitrary clone syscalls itself without
confusing the underlying implementation. instead, it breaks them down
into either fork-like or pthread_create-like cases. it was treating
the code in posix_spawn as pthread_create-like, due to CLONE_VM, which
caused horribly wrong behavior: CLONE_FILES broke the synchronization
mechanism, CLONE_SIGHAND broke the parent's signals, and CLONE_THREAD
caused the child's exec to end the parent -- if it hadn't already
crashed. however, qemu special-cases CLONE_VFORK and emulates that
with fork, even when CLONE_VM is also specified. this also gives
incorrect semantics for code that really needs the memory sharing, but
posix_spawn does not make use of the vm sharing except to avoid
momentary double commit charge.
programs using posix_spawn (including via popen) should now work
correctly under qemu app-level emulation.
for 0-argument syscalls (1 argument to the macro, the syscall number),
the __SYSCALL_NARGS_X macro's ... argument was not satisfied. newer
compilers seem to care about this.
this commit has two major user-visible parts: zoneinfo-format time
zones are now supported, and overflow handling is intended to be
complete in the sense that all functions return a correct result if
and only if the result fits in the destination type, and otherwise
return an error. also, some noticable bugs in the way DST detection
and normalization worked have been fixed, and performance may be
better than before, but it has not been tested.
apparently this was never noticed before because the linker normally
optimizes dynamic TLS models to non-dynamic ones when static linking,
thus eliminating the calls to __tls_get_addr which crash when the dtv
is missing. however, some libsupc++ code on ARM was calling
__tls_get_addr when static linked and crashing. the reason is unclear
to me, but with this issue fixed it should work now anyway.
map_library was saving pointers to an automatic-storage buffer rather
than pointers into the mapping. this should be a fairly simple fix,
but the patch here is slightly complicated by two issues:
1. supporting gratuitously obfuscated ELF files where the program
headers are not right at the beginning of the file.
2. cleaning up the map_library function so that data isn't clobbered
by the time we need it.
there are still several more that are misleading, but SIGFPE (integer
division error misdescribed as floating point) and and SIGCHLD
(possibly non-exit status change events described as exiting) were the
worst offenders.
also clean up, optimize, and simplify the code, removing branches by
simply pre-setting the result string to an empty string, which will be
preserved if other operations fail.
the main use for this macro seems to be knowing the correct allocation
granularity for dynamic-sized fd_set objects. such usage is
non-conforming and results in undefined behavior, but it is widespread
in applications.
there are two motivations for this change. one is to avoid
gratuitously depending on a C11 symbol for implementing a POSIX
function. the other pertains to the documented semantics. C11 does not
define any behavior for aligned_alloc when the length argument is not
a multiple of the alignment argument. posix_memalign on the other hand
places no requirements on the length argument. using __memalign as the
implementation of both, rather than trying to implement one in terms
of the other when their documented contracts differ, eliminates this
confusion.
C11 has no requirement that the alignment be a multiple of
sizeof(void*), and in fact seems to require any "valid alignment
supported by the implementation" to work. since the alignment of char
is 1 and thus a valid alignment, an alignment argument of 1 should be
accepted.
a research in debian codesearch and grepping over the pkgsrc
directory tree have shown that these macros are all either unused,
or defined by programs in case they need them.
these would not be expensive to actually implement, but reading
/etc/ethers does not sound like a particularly useful feature, so for
now I'm leaving them as stubs.
previously, determination of the list of header files for installation
depended on the include/bits symlink (to the arch-specific files)
already having been created. in other words, running "make install"
immediately after configure without first running "make" caused the
bits headers not to be installed.
the solution I have applied is to pull the list of headers directly
from arch/$(ARCH)/bits rather than include/bits, and likewise to
install directly from arch/$(ARCH)/bits rather than via the symlink.
at this point, the only purpose served by keeping the symlink around
is that it enables use of the in-tree headers and libs directly via -I
and -L, which can be useful when testing against a new version of the
library before installing it. on the other hand, removing the bits
symlink would be beneficial if we ever want to support building
multiple archs in the same source tree.
in theory this should not be an issue, since major() should only be
applied to type dev_t, which is 64-bit. however, it appears some
applications are not using dev_t but a smaller integer type (which
works on Linux because the kernel's dev_t is really only 32-bit). to
avoid the undefined behavior, do it as two shifts.
this change is needed to correctly handle the case where a constructor
creates a new thread which calls dlopen. previously, the lock was not
held in this case. the reason for the complex logic to avoid locking
whenever possible is that, since the mutex is recursive, it will need
to inspect the thread pointer to get the current thread's tid, and
this requires initializing the thread pointer. we do not want
non-multi-threaded programs to attempt to access the thread pointer
unnecessarily; doing so could make them crash on ancient kernels that
don't support threads but which may otherwise be capable of running
the program.
rather than returning an error, we have to increase the size argument
so high that the kernel will have no choice but to fail. this is
because POSIX only permits the EINVAL error for size errors when a new
shared memory segment would be created; if it already exists, the size
argument must be ignored. unfortunately Linux is non-conforming in
this regard, but I want to keep the code correct in userspace anyway
so that if/when Linux is fixed, the behavior applications see will be
conforming.
rejecting invalid values for n is fine even in the case where a new
sem will not be created, since the kernel does its range checks on n
even in this case as well.
by default, the kernel will bound the limit well below USHRT_MAX
anyway, but it's presumably possible that an administrator could
override this limit and break things.
this type is not really intended to be used; it's just there to allow
implementations to choose the type for the shm_nattch member of
struct shmid_sh, presumably since historical implementations disagreed
on the type. in any case, it needs to be there, so now it is.
in the process, I refactored the week-number code so it can be used by
the week-based-year formats to determine year adjustments at the
boundary values. this also improves indention/code readability.
output for plain week numbers (%U and %W) has been sanity-checked, and
output for the week-based-year week numbers (%V) has been checked
extensively against known-good data for the full non-negative range of
32-bit time_t.
year numbers for week-based years (%g and %G) are not yet implemented.
the pathnames prefixed with /dev/null/ are guaranteed never to be
valid. the previous use of /dev/null alone was mildly dangerous in
that bad software might attempt to unlink the name when it found a
non-regular file there and create a new file.
internally, other parts of the library assume sizes don't overflow
ssize_t and/or ptrdiff_t, and the way this assumption is made valid is
by preventing creating of such large objects. malloc already does so,
but the check was missing from mmap.
this is also a quality of implementation issue: even if the
implementation internally could handle such objects, applications
could inadvertently invoke undefined behavior by subtracting pointers
within an object. it is very difficult to guard against this in
applications, so a good implementation should simply ensure that it
does not happen.
previously, the path string was being used despite being invalid. with
this change, empty path file or error reading the path file is treated
as an empty path. this is preferable to falling back to a default
path, so that attacks to prevent reading of the path file could not
result in loading incorrect and possibly dangerous (outdated or
mismatching ABI) libraries from.
the code to strip the final newline has also been removed; now that
newline is accepted as a delimiter, it's harmless to leave it in
place.
despite declaring functions that take arguments of type va_list, these
headers are not permitted by the c standard to expose the definition
of va_list, so an alias for the type must be used. the name
__isoc_va_list was chosen to convey that the purpose of this alternate
name is for iso c conformance, and to avoid the multitude of names
which gcc mangles with its hideous "fixincludes" monstrosity, leading
to serious header breakage if these "fixes" are run.
apparently the original commit was never tested properly, since
getline was only ever reading one line. the intent was to read the
entire file, so use getdelim with the null byte as delimiter as a
cheap way to read a whole file into memory.
also move all legacy inet_* functions into a single file to avoid
wasting object file and compile time overhead on them.
the added functions are legacy interfaces for working with classful
ipv4 network addresses. they have no modern usefulness whatsoever, but
some programs unconditionally use them anyway, and they're tiny.
based on patch by Strake with minor stylistic changes, and combined
into a single file. this patch remained open for a long time due to
some question as to whether ether_aton would be better implemented in
terms of sscanf, and it's time something was committed, so here it is.
the shgetc api, used internally in scanf and int/float scanning code
to handle field width limiting and pushback, was designed assuming
that pushback could be achieved via a simple decrement on the file
buffer pointer. this only worked by chance for regular FILE streams,
due to the linux readv bug workaround in __stdio_read which moves the
last requested byte through the buffer rather than directly back to
the caller. for unbuffered streams and streams not using __stdio_read
but some other underlying read function, the first character read
could be completely lost, and replaced by whatever junk happened to be
in the unget buffer.
to fix this, simply have shgetc, when it performs an underlying read
operation on the stream, store the character read at the -1 offset
from the read buffer pointer. this is valid even for unbuffered
streams, as they have an unget buffer located just below the start of
the zero-length buffer. the check to avoid storing the character when
it is already there is to handle the possibility of read-only buffers.
no application-exposed FILE types are allowed to use read-only
buffers, but sscanf and strto* may use them internally when calling
functions which use the shgetc api.
issue found and patch provided by Jens Gustedt. after the atomic store
to the error code field of the aiocb, the application is permitted to
free or reuse the storage, so further access is invalid. instead, use
the local copy that was already made.
due to the interface requirement of having the full state contained in
a single object of type unsigned int, it is difficult to provide a
reasonable-quality implementation; most good PRNGs are immediately
ruled out because they need larger state. the old rand_r gave very
poor output (very short period) in its lower bits; normally, it's
desirable to throw away the low bits (as in rand()) when using a LCG,
but this is not possible since the state is only 32 bits and we need
31 bits of output.
glibc's rand_r uses the same LCG as musl's, but runs it for 3
iterations and only takes 10-11 bits from each iteration to construct
the output value. this partially fixes the period issue, but
introduces bias: not all outputs have the same frequency, and many do
not appear at all. with such a low period, the bias is likely to be
observable.
I tried many approaches to "fix" rand_r, and the simplest I found
which made it pass the "dieharder" tests was applying this
transformation to the output. the "temper" function is taken from
mersenne twister, where it seems to have been chosen for some rigorous
properties; here, the only formal property I'm using is that it's
one-to-one and thus avoids introducing bias.
should further deficiencies in rand_r be reported, the obvious "best"
solution is applying a 32-bit cryptographic block cipher in CTR mode.
I identified several possible ciphers that could be used directly or
adapted, but as they would be a lot slower and larger, I do not see a
justification for using them unless the current rand_r proves
deficient for some real-world use.
arguably CLOCK_MONOTONIC should be redirected to CLOCK_BOOTTIME with a
fallback for old kernels that don't support it, since Linux's
CLOCK_BOOTTIME semantics seem to match the spirit of the POSIX
requirements for CLOCK_MONOTONIC better than Linux's version of
CLOCK_MONOTONIC does. however, this is a change that would require
further discussion and research, so for now, I'm simply making them
all available.
originally it was right on 32-bit archs and wrong on 64-bit, but after
recent changes it was wrong everywhere. with this commit, it's now
right everywhere.
apparently these features have been in Linux for a while now, so it
makes sense to support them. the bit twiddling seems utterly illogical
and wasteful, especially the negation, but that's how the kernel folks
chose to encode pids/tids into the clock id.
some applications rely on the low bits of rand() to be reasonably good
quality prng, so now it fixed by using the top bits of a 64 bit LCG,
this is simple, has small state and passes statistical tests.
D.E. Knuth attributes the multiplier to C.E. Haynes in TAOCP Vol2 3.3.4
the concept here is that %s and %c are essentially special-cases of
%[, with some minimal additional special-casing.
aside from simplifying the code and reducing the number of complex
code-paths that would need changing to make optimizations later, the
main purpose of this change is to simplify addition of the 'm'
modifier which causes scanf to allocate storage for the string being
read.
failure to do so was causing crashes on x86_64 when ctors used SSE,
which was first observed when ctors called variadic functions due to
the SSE prologue code inserted into every variadic function.
previously we were using an unsigned type on 32-bit systems so that
subtraction would be well-defined when it wrapped, but since wrapping
is non-conforming anyway (when clock() overflows, it has to return -1)
the only use of unsigned would be to buy a little bit more time before
overflow. this does not seem worth having the type vary per-arch
(which leads to more arch-specific bugs) or disagree with the ABI musl
(mostly) follows.
per Austin Group interpretation for issue #686, which cites the
requirements of ISO C, clock() cannot wrap. if the result is not
representable, it must return (clock_t)-1. in addition, the old code
was performing wrapping via signed overflow and thus invoking
undefined behavior.
since it seems impossible to accurately check for overflow with the
old times()-based fallback code, I have simply dropped the fallback
code for now, thus always returning -1 on ancient systems. if there's
a demand for making it work and somebody comes up with a way, it could
be reinstated, but the clock() function is essentially useless on
32-bit system anyway (it overflows in less than an hour).
it should be noted that I used LONG_MAX rather than ULONG_MAX, despite
32-bit archs using an unsigned type for clock_t. this discrepency with
the glibc/LSB type definitions will be fixed now; since wrapping of
clock_t is no longer supported, there's no use in it being unsigned.
The underflow exception is not raised correctly in some
cornercases (see previous fma commit), added comments
with examples for fmaf, fmal and non-x86 fma.
In fmaf store the result before returning so it has the
correct precision when FLT_EVAL_METHOD!=0
1) in downward rounding fma(1,1,-1) should be -0 but it was 0 with
gcc, the code was correct but gcc does not support FENV_ACCESS ON
so it used common subexpression elimination where it shouldn't have.
now volatile memory access is used as a barrier after fesetround.
2) in directed rounding modes there is no double rounding issue
so the complicated adjustments done for nearest rounding mode are
not needed. the only exception to this rule is raising the underflow
flag: assume "small" is an exactly representible subnormal value in
double precision and "verysmall" is a much smaller value so that
(long double)(small plus verysmall) == small
then
(double)(small plus verysmall)
raises underflow because the result is an inexact subnormal, but
(double)(long double)(small plus verysmall)
does not because small is not a subnormal in long double precision
and it is exact in double precision.
now this problem is fixed by checking inexact using fenv when the
result is subnormal
* use unsigned arithmetics
* use unsigned to store arg reduction quotient (so n&3 is understood)
* remove z=0.0 variables, use literal 0
* raise underflow and inexact exceptions properly when x is small
* fix spurious underflow in tanl
patch by Strake. previously is was not feasible to duplicate this
functionality of the functions these were modeled on, since argv[0]
was not saved at program startup, but now that it's available it's
easy to use.
* use unsigned arithmetics on the representation
* store arg reduction quotient in unsigned (so n%2 would work like n&1)
* use different convention to pass the arg reduction bit to __tan
(this argument used to be 1 for even and -1 for odd reduction
which meant obscure bithacks, the new n&1 is cleaner)
* raise inexact and underflow flags correctly for small x
(tanl(x) may still raise spurious underflow for small but normal x)
(this exception raising code increases codesize a bit, similar fixes
are needed in many other places, it may worth investigating at some
point if the inexact and underflow flags are worth raising correctly
as this is not strictly required by the standard)
* tanf manual reduction optimization is kept for now
* tanl code path is cleaned up to follow similar logic to tan and tanf
there was some question as to how many decimal places to use, since
one decimal place is always sufficient to identify the smallest
denormal uniquely. for now, I'm following the example in the C
standard which is consistent with the other min/max macros we already
had in place.
somehow I missed this when removing the corresponding
__STDC_LIMIT_MACROS and __STDC_CONSTANT_MACROS nonsense from stdint.h.
these were all attempts by the C committee to guess what the C++
committee would want, and the guesses turned out to be wrong.
support for these was recently added to sysmacros.h. note that the
syscall argument is a long, despite dev_t being 64-bit, so on 32-bit
archs the high bits will be lost. it appears the high bits are just
glibc silliness and not part of the kernel api, anyway, but it's nice
that we have them there for future expansion if needed.
When FLT_EVAL_METHOD!=0 (only i386 with x87 fp) the excess
precision of an expression must be removed in an assignment.
(gcc needs -fexcess-precision=standard or -std=c99 for this)
This is done by extra load/store instructions which adds code
bloat when lot of temporaries are used and it makes the result
less precise in many cases.
Using double_t and float_t avoids these issues on i386 and
it makes no difference on other archs.
For now only a few functions are modified where the excess
precision is clearly beneficial (mostly polynomial evaluations
with temporaries).
object size differences on i386, gcc-4.8:
old new
__cosdf.o 123 95
__cos.o 199 169
__sindf.o 131 95
__sin.o 225 203
__tandf.o 207 151
__tan.o 605 499
erff.o 1470 1416
erf.o 1703 1649
j0f.o 1779 1745
j0.o 2308 2274
j1f.o 1602 1568
j1.o 2286 2252
tgamma.o 1431 1424
math/*.o 64164 63635
__FLOAT_BITS and __DOUBLE_BITS macros used union compound literals,
now they are changed into static inline functions. A good C compiler
generates the same code for both and the later is C++ conformant.
since CLOCKS_PER_SEC is 1000000 (required by XSI) and the times
syscall reports values in 1/100 second units (Linux), the correct
scaling factor is 10000, not 100. note that only ancient kernels which
lack clock_gettime are affected.
all return values are valid, and on 32-bit systems, values that look
like errors can and will occur. since the only actual error this
function could return is EFAULT, and it is only returnable when the
application has invoked undefined behavior, simply ignore the
possibility that the return value is actually an error code.
there are several reasons for this change. one is getting rid of the
repetition of the syscall signature all over the place. another is
sharing the constant masks without costly GOT accesses in PIC.
the main motivation, however, is accurately representing whether we
want to block signals that might be handled by the application, or all
signals.
they have already blocked signals before decrementing the thread
count, so the code being removed is unreachable in the case where the
thread is no longer counted.
now that blocking signals prevents any application code from running
while the last thread is exiting, the cas logic is no longer needed to
prevent decrementing below zero.
the thread count (1+libc.threads_minus_1) must always be greater than
or equal to the number of threads which could have application code
running, even in an async-signal-safe sense. there is at least one
dangerous race condition if this invariant fails to hold: dlopen could
allocate too little TLS for existing threads, and a signal handler
running in the exiting thread could claim the allocated TLS for itself
(via __tls_get_addr), leaving too little for the other threads it was
allocated for and thereby causing out-of-bounds access.
there may be other situations where it's dangerous for the thread
count to be too low, particularly in the case where only one thread
should be left, in which case locking may be omitted. however, all
such code paths seem to arise from undefined behavior, since
async-signal-unsafe functions are not permitted to be called from a
signal handler that interrupts pthread_exit (which is itself
async-signal-unsafe).
this change may also simplify logic in __synccall and improve the
chances of making __synccall async-signal-safe.
for the duration of the vm-sharing clone used by posix_spawn, all
signals are blocked in the parent process, including
implementation-internal signals. since __synccall cannot do anything
until successfully signaling all threads, the fact that signals are
blocked automatically yields the necessary safety.
aside from debloating and general simplification, part of the
motivation for removing the explicit lock is to simplify the
synchronization logic of __synccall in hopes that it can be made
async-signal-safe, which is needed to make setuid and setgid, which
depend on __synccall, conform to the standard. whether this will be
possible remains to be seen.
C++11, the first C++ with stdint.h, requires the previously protected
macros to be exposed unconditionally by stdint.h. apparently these
checks were an early attempt by the C committee to guess what the C++
committee would want, and they guessed wrong.
this allows /etc/ld-musl-$(ARCH).path to contain one path per line,
which is much more convenient for users than the :-delimited format,
which was a source of repeated and unnecessary confusion. for
simplicity, \n is also accepted in environment variables, though it
should probably not be used there.
at the same time, issues with overly long paths invoking UB or getting
truncated have been fixed. such issues should not have arisen with the
environment (which is size-limited) but could have been generated by a
path file larger than 2**31 bytes in length.
the getifaddrs interface seems to have been invented by glibc, and
they expose socket.h, so for us not to do so is just gratuitous
incompatibility with the interface we're mimicing.
the standard is clear that the old behavior is conforming: "In this
case, [EILSEQ] shall be stored in errno and the conversion state is
undefined."
however, the specification of mbrtowc has one peculiarity when the
source argument is a null pointer: in this case, it's required to
behave as mbrtowc(NULL, "", 1, ps). no motivation is provided for this
requirement, but the natural one that comes to mind is that the intent
is to reset the mbstate_t object. for stateful encodings, such
behavior is actually specified: "If the corresponding wide character
is the null wide character, the resulting state described shall be the
initial conversion state." but in the case of UTF-8 where the
mbstate_t object contains a partially-decoded character rather than a
shift state, a subsequent '\0' byte indicates that the previous
partial character is incomplete and thus an illegal sequence.
naturally, applications using their own mbstate_t object should clear
it themselves after an error, but the standard presently provides no
way to clear the builtin mbstate_t object used when the ps argument is
a null pointer. I suspect this issue may be addressed in the future by
specifying that a null source argument resets the state, as this seems
to have been the intent all along.
for what it's worth, this change also slightly reduces code size.
the interface contract for mbtowc admits a much faster implementation
than mbrtowc can achieve; wrapping mbrtowc with an extra call frame
only made the situation worse.
since the regex implementation uses mbtowc already, this change should
improve regex performance too. it may be possible to improve
performance in other places internally by switching from mbrtowc to
mbtowc.
this simple change, in my measurements, makes about a 7% performance
improvement. at first glance this change would seem like a
compiler-specific hack, since the modified code is not even used.
however, I suspect the reason is that I'm eliminating a second path
into the main body of the code, allowing the compiler more flexibility
to optimize the normal (hot) path into the main body. so even if it
weren't for the measurable (and quite notable) difference in
performance, I think the change makes sense.
SA and SB are used as the lowest and highest valid starter bytes, but
the value of SB was one-past the last valid starter. this caused
access past the end of the state table when the illegal byte '\xf5'
was encountered in a starter position. the error did not show up in
full-character decoding tests, since the bogus state read from just
past the table was unlikely to admit any continuation bytes as valid,
but would have shown up had we tested feeding '\xf5' to the
byte-at-a-time decoding in mbrtowc: it would cause the funtion to
wrongly return -2 rather than -1.
I may eventually go back and remove all references to SA and SB,
replacing them with the values; this would make the code more
transparent, I think. the original motivation for using macros was to
allow misguided users of the code to redefine them for the purpose of
enlarging the set of accepted sequences past the end of Unicode...
also include fallback code for broken kernels that don't support the
flags. as usual, the fallback has a race condition that can leak file
descriptors.
this is a bit ugly, and the motivation for supporting it is
questionable. however the main factors were:
1. it will be useful to have this for certain internal purposes
anyway -- things like syslog.
2. applications can just save argv[0] in main, but it's hard to fix
non-portable library code that's depending on being able to get the
invocation name without the main application's help.
GNU used several extensions that were incompatible with C99 and POSIX,
so they used alternate names for the standard functions.
The result is that we need these to run standards-conformant programs
that were linked with glibc.
supports ipv4 and ipv6, but not the "extended" usage where
usage statistics and other info are assigned to ifa_data members
of duplicate entries with AF_PACKET family.
the preprocessor can reliably determine the signedness of wchar_t.
L'\0' is used for 0 in the expressions so that, if the underlying type
of wchar_t is long rather than int, the promoted type of the
expression will match the type of wchar_t.
since shadow does not yet support enumeration (getspent), the
corresponding FILE-based get and put versions are also subbed out for
now. this is partly out of laziness and partly because it's not clear
how they should work in the presence of TCB shadow files. the stubs
should make it possible to compile some software that expects them to
exist, but such software still may not work properly.
negative values of wchar_t need to be treated in the non-ASCII case so
that they can properly generate EILSEQ rather than getting truncated
to 8bit values and stored in the output.
these changes fix at least two bugs:
- misaligned access to the input as uint32_t for vectorized ASCII test
- incorrect src pointer after stopping on EILSEQ
in addition, the text of the standard makes it unclear whether the
mbstate_t object is to be modified when the destination pointer is
null; previously it was cleared either way; now, it's only cleared
when the destination is non-null. this change may need revisiting, but
it should not affect most applications, since calling mbsrtowcs with
non-zero state can only happen when the head of the string was already
processed with mbrtowc.
finally, these changes shave about 20% size off the function and seem
to improve performance by 1-5%.
this type was removed back in 5243e5f160 ,
because it was removed from the XSI specs.
however some apps use it.
since it's in the POSIX reserved namespace, we can expose it
unconditionally.
the issue at hand is that many syscalls require as an argument the
kernel-ABI size of sigset_t, intended to allow the kernel to switch to
a larger sigset_t in the future. previously, each arch was defining
this size in syscall_arch.h, which was redundant with the definition
of _NSIG in bits/signal.h. as it's used in some not-quite-portable
application code as well, _NSIG is much more likely to be recognized
and understood immediately by someone reading the code, and it's also
shorter and less cluttered.
note that _NSIG is actually 65/129, not 64/128, but the division takes
care of throwing away the off-by-one part.
I'm not entirely happy with the amount of ugliness here, but since
F_DUPFD_CLOEXEC is used elsewhere in code that's expected to work on
old kernels (popen), it seems necessary. reportedly even some modern
kernels went back and broke F_DUPFD_CLOEXEC (making it behave like
plain F_DUPFD), so it might be necessary to add some additional fixup
code later to deal with that issue too.
SYS_pipe is not usable directly in general, since mips has a very
broken calling convention for the pipe syscall. instead, just call the
function, so that the mips-specific ugliness is isolated in
mips/pipe.s and not copied elsewhere.
1. as reported by William Haddon, the value returned by snprintf was
wrongly used as a length passed to sendto, despite it possibly
exceeding the buffer length. this could lead to invalid reads and
leaking additional data to syslog.
2. openlog was storing a pointer to the ident string passed by the
caller, rather than copying it. this bug is shared with (and even
documented in) other implementations like glibc, but such behavior
does not seem to meet the requirements of the standard.
3. extremely long ident provided to openlog, or corrupt ident due to
the above issue, could possibly have resulted in buffer overflows.
despite having the potential for smashing the stack, i believe the
impact is low since ident points to a short string literal in typical
application usage (and per the above bug, other usages will break
horribly on other implementations).
4. when used with LOG_NDELAY, openlog was not connecting the
newly-opened socket; sendto was being used instead. this defeated the
main purpose of LOG_NDELAY: preparing for chroot.
5. the default facility was not being used at all, so all messages
without an explicit facility passed to syslog were getting logged at
the kernel facility.
6. setlogmask was not thread-safe; no synchronization was performed
updating the mask. the fix uses atomics rather than locking to avoid
introducing a lock in the fast path for messages whose priority is not
in the mask.
7. in some code paths, the syslog lock was being unlocked twice; this
could result in releasing a lock that was actually held by a different
thread.
some additional enhancements to syslog such as a default identifier
based on argv[0] or similar may still be desired; at this time, only
the above-listed bugs have been fixed.
it serves no purpose (binaries linked against musl as -lc/libc.so
automatically get the right DT_NEEDED value of libc.so) and causes
ldconfig to misbehave (making a symlink to ld-musl named libc.so in
/lib). ldconfig is not used on pure musl systems, but if ld-musl is
installed on a system where it's not the primary libc, this will
pollute the system /lib with a symlink to musl named libc.so, which
should NOT exist and could cause problems linking native apps. also,
the existence of the soname caused spurious warnings from ldconfig
when /lib and /usr/lib were the same physical directory.
this fix is far from ideal and breaks the rule of not using
arch-specific #ifdefs, but for now we just need a solution to the
existing breakage.
the underlying problem is that the kernel folks made a very stupid
decision to make misalignment of this struct part of the kernel
API/ABI for x86_64, in order to avoid writing a few extra lines of
code to handle both 32- and 64-bit userspace on 64-bit kernels. I had
just added the packed attribute unconditionally thinking it was
harmless on 32-bit archs, but non-x86 32-bit archs have 8-byte
alignment on 64-bit types.
wctype_t was incorrectly "int" rather than "long" on x86_64. not only
is this an ABI incompatibility; it's also a major design flaw if we
ever wanted wctype_t to be implemented as a pointer, which would be
necessary if locales support custom character classes, since int is
too small to store a converted pointer. this commit fixes wctype_t to
be unsigned long on all archs, matching the LSB ABI; this change does
not matter for C code, but for C++ it affects mangling.
the same issue applied to wctrans_t. glibc/LSB defines this type as
const __int32_t *, but since no such definition is visible, I've just
expanded the definition, int, everywhere.
it would be nice if these types (which don't vary by arch) could be in
wctype.h, but the OB XSI requirement in POSIX that wchar.h expose some
types and functions from wctype.h precludes doing so. glibc works
around this with some hideous hacks, but trying to duplicate that
would go against the intent of musl's headers.
lenl-lenr is not a valid expression for a signed int return value from
strverscmp, since after implicit conversion from size_t to int this
difference could have the wrong sign or might even be zero. using the
difference for char values works since they're bounded well within the
range of differences representable by int, but it does not work for
size_t values.
1. wrong return value and missing errno for negative suffix len
2. failure to catch suffix len > strlen
3. remove unwanted clearing of input string in invalid case
based on patch contributed by Anthony G. Basile (blueness)
some issues remain with the filename generation algorithm and other
small bugs, but this patch has been sitting around long enough that I
feel it's best to get it committed and then work out any remaining
issues.
patch by Jens Gustedt.
previously, the intended policy was to use __environ in code that must
conform to the ISO C namespace requirements, and environ elsewhere.
this policy was not followed in practice anyway, making things
confusing. on top of that, Jens reported that certain combinations of
link-time optimization options were breaking with the inconsistent
references; this seems to be a compiler or linker bug, but having it
go away is a nice side effect of the changes made here.
based on patch by Isaac Dunham, moved to its own file to avoid
increasing bss on static linked programs not using this nonstandard
function but using the standard getgrent function, and vice versa.
this definitely has the potential to be a bikeshed topic, so some
justification is in order. most of the changes made fit into one of
the following categories:
1. alignment with text in posix, xsh 2.3
2. eliminating overly-specific text for shared error codes
3. making the message match more closely with the macro name
4. removing extraneous words
in particular, the EAGAIN/EWOULDBLOCK text is updated to match the
description of EAGAIN (which covers both uses) rather than saying the
operation would block, and ENOTSUP/EOPNOTSUPP is updated not to
mention sockets.
the distinction between ENFILE/EMFILE has also been clarified; ENFILE
is aligned with the posix text, and EMFILE, which lacks concise posix
text matching any historic message, is updated to emphasize that the
exhausted resource is not open files/open file descriptions, but
rather the integer 'address space' of file descriptors.
some messages may be further tweaked based on feedback.
arm eabi requires this symbol for static C++ dtors.
usually it is provided by libstdc++, but when a C++ program
doesn't use the std lib (free-standing), the libc has to provide
it.
this was encountered while building transmission, which
depends on such a C++ library (libutp).
this function is nearly identical to __cxa_atexit, but it has the
order of argumens swapped for "performance reasons".
see page 25 of
http://infocenter.arm.com/help/topic/com.arm.doc.ihi0043d/IHI0043D_rtabi.pdf
there are other aeabi specific C++ support functions missing, but
it is not clear yet that GCC makes use of them so we omit them for
the moment.
read should never return anything but 0 or sizeof ec here, but if it
does, we want to treat any other return as "success". then the caller
will get back the pid and is responsible for waiting on it when it
immediately exits.
the proposed change was described in detail in detail previously on
the mailing list. in short, vfork is unsafe because:
1. the compiler could make optimizations that cause the child to
clobber the parent's local vars.
2. strace is buggy and allows the vforking parent to run before the
child execs when run under strace.
the new design uses a close-on-exec pipe instead of vfork semantics to
synchronize the parent and child so that the parent does not return
before the child has finished using its arguments (and now, also its
stack). this also allows reporting exec failures to the caller instead
of giving the caller a child that mysteriously exits with status 127
on exec error.
basic testing has been performed on both the success and failure code
paths. further testing should be done.
also, don't waste code/time on F_GETFL since pipes always have blank
flags initially (at least on old kernels, which are all this fallback
code matters for).
this change shaves ~1k off libc.so bss size, and also avoids hard
errors in the case where the static buffer was not large enough to
hold the result.
this whole framework is really ugly and might should be replaced or at
least heavily overhauled when some changes/factorizations are made to
getaddrinfo internals in the future.
this bug seems to have been introduced when the map_library signatures
was changed to return the mapping in a temp dso structure instead of
into separate variables.
the main goal of these changes is to address the case where an
application provides a stack of size N, but TLS has size M that's a
significant portion of the size N (or even larger than N), thus giving
the application less stack space than it expected or no stack at all!
the new strategy pthread_create now uses is to only put TLS on the
application-provided stack if TLS is smaller than 1/8 of the stack
size or 2k, whichever is smaller. this ensures that the application
always has "close enough" to what it requested, and the threshold is
chosen heuristically to make sure "sane" amounts of TLS still end up
in the application-provided stack.
if TLS does not fit the above criteria, pthread_create uses mmap to
obtain space for TLS, but still uses the application-provided stack
for actual call frame stack. this is to avoid wasting memory, and for
the sake of supporting ugly hacks like garbage collection based on
assumptions that the implementation will use the provided stack range.
in order for the above heuristics to ever succeed, the amount of TLS
space wasted on POSIX TSD (pthread_key_create based) needed to be
reduced. otherwise, these changes would preclude any use of
pthread_create without mmap, which would have serious memory usage and
performance costs for applications trying to create huge numbers of
threads using pre-allocated stack space. the new value of
PTHREAD_KEYS_MAX is the minimum allowed by POSIX, 128. this should
still be plenty more than real-world applications need, especially now
that C11/gcc-style TLS is now supported in musl, and most apps and
libraries choose to use that instead of POSIX TSD when available.
at the same time, PTHREAD_STACK_MIN has been decreased. it was
originally set to PAGE_SIZE back when there was no support for TLS or
application-provided stacks, and requests smaller than a whole page
did not make sense. now, there are two good reasons to support
requests smaller than a page: (1) applications could provide
pre-allocated stacks smaller than a page, and (2) with smaller stack
sizes, stack+TLS+TSD can all fit in one page, making it possible for
applications which need huge numbers of threads with minimal stack
needs to allocate exactly one page per thread. the new value of
PTHREAD_STACK_MIN, 2k, is aligned with the minimum size for
sigaltstack.
this should generate faster and smaller code, especially with inline
syscalls. the conditional with cnt is ugly, but thankfully cnt is
always a constant anyway so it gets evaluated at compile time. it may
be preferable to make separate __wake and __wakeall macros without a
count argument.
priv flag is not used yet; private futex support still needs to be
done at some point in the future.
it's not clear to me at the moment whether the code that was removed
(and which is now being re-added) is needed, but it's far from being a
no-op, and i don't want to risk breaking regex in this release.
alternatively, we could define it in sys/socket.h since SO* is
reserved there, and tcp.h includes sys/socket.h in extensions mode.
note that SOL_TCP is simply wrong and it's only here for compatibility
with broken applications. the correct argument to pass for setting TCP
socket options is IPPROTO_TCP, which of course has the same value as
SOL_TCP but works everywhere.
this is a trivial no-op, because dlclose never deletes libraries. thus
we might as well have it in the header in case some application wants
it, since we're already providing it anyway.
based on patch by Pierre Carrier <pierre@gcarrier.fr> that just added
the flag constant, but with minimal additional code so that it
actually works as documented. this is a nonstandard option but some
major software (reportedly, Firefox) uses it and it was easy to add
anyway.
the historical mess of having different definitions for C and C++
comes from the historical C definition as (void *)0 and the fact that
(void *)0 can't be used in C++ because it does not convert to other
pointer types implicitly. however, using plain 0 in C++ exposed bugs
in C++ programs that call variadic functions with NULL as an argument
and (wrongly; this is UB) expect it to arrive as a null pointer. on
64-bit machines, the high bits end up containing junk. glibc dodges
the issue by using a GCC extension __null to define NULL; this is
observably non-conforming because a conforming application could
observe the definition of NULL via stringizing and see that it is
neither an integer constant expression with value zero nor such an
expression cast to void.
switching to 0L eliminates the issue and provides compatibility with
broken applications, since on all musl targets, long and pointers have
the same size, representation, and argument-passing convention. we
could maintain separate C and C++ definitions of NULL (i.e. just use
0L on C++ and use (void *)0 on C) but after careful analysis, it seems
extremely difficult for a C program to even determine whether NULL has
integer or pointer type, much less depend in subtle, unintentional
ways, on whether it does. C89 seems to have no way to make the
distinction. on C99, the fact that (int)(void *)0 is not an integer
constant expression, along with subtle VLA/sizeof semantics, can be
used to make the distinction, but many compilers are non-conforming
and give the wrong result to this test anyway. on C11, _Generic can
trivially make the distinction, but it seems unlikely that code
targetting C11 would be so backwards in caring which definition of
NULL an implementation uses.
as such, the simplest path of using the same definition for NULL in
both C and C++ was chosen. the #undef directive was also removed so
that the compiler can catch and give a warning or error on
redefinition if buggy programs have defined their own versions of
NULL prior to inclusion of standard headers.
original FreeSec code accessed keybuf as uint32* and uint8* as well
(incorrectly), this got fixed with an union, but then it seems the
uint32* access is no longer needed so the code can be simplified
the internal sha2 hash sum functions had incorrect array size
in the prototype for the message digest argument, fixed by
using pointer so it is not misleading
added various MS_*, MNT_*, UMOUNT_* flags following the linux
headers, with one exception: MS_NOUSER is defined as (1U<<31)
instead of (1<<31) which invokes undefined behaviour
the S_* flags were removed following glibc
based on linux headers add the missing MCAST_* options
under _GNU_SOURCE as they are not in the reserved namespace
(this api was originally specified by RFC 3678)
this is wasteful and useless from a standpoint of sane programs, but
it is required by the standard, and the current requirements were
upheld with the closure of Austin Group issue #639:
http://austingroupbugs.net/view.php?id=639
common part of erf and erfc was put in a separate function which
saved some space and the new code is using unsigned arithmetics
erfcf had a bug: for some inputs in [7.95,8] the result had
more than 60ulp error: in expf(-z*z - 0.5625f) the argument
must be exact but not enough lowbits of z were zeroed,
-SET_FLOAT_WORD(z, ix&0xfffff000);
+SET_FLOAT_WORD(z, ix&0xffffe000);
fixed the issue
the anonymous struct typedef with array notation breaks with
GCC in C++ mode:
error: non-local function 'static<anonymous struct>
(& boost::signal_handler::jump_buffer())[1]' uses anonymous type
this is a known GCC issue, as search results for that error msg
suggest.
since this is hard to work around in the calling C++ code, a
fix in musl is preferable.
some programs (procps, babl) expect it, and it doesn't seem to
cause any harm to just add it.
it's small and straightforward.
since math.h also defines MAXFLOAT, we undef it in both places,
before defining it.
these flags are needed in order to be able to handle lwp id's
which the kernel returns after clone() calls for new threads
via ptrace(PTRACE_GETEVENTMSG).
fortunately, they're the same for all archs and in the reserved
namespace.
for _Noreturn functions, gcc generates code that trashes the
stack frame, and so it makes it impossible to inspect the causes
of an assert error in gdb.
abort() is not affected (i have not yet investigated why).
both jn and yn functions had integer overflow issues for large
and small n
to handle these issues nm1 (== |n|-1) is used instead of n and -n
in the code and some loops are changed to make sure the iteration
counter does not overflow
(another solution could be to use larger integer type or even double
but that has more size and runtime cost, on x87 loading int64_t or
even uint32_t into an fpu register is more than two times slower than
loading int32_t, and using double for n slows down iteration logic)
yn(-1,0) now returns inf
posix2008 specifies that on overflow and at +-0 all y0,y1,yn functions
return -inf, this is not consistent with math when n<0 odd integer in yn
(eg. when x->0, yn(-1,x)->inf, but historically yn(-1,0) seems to be
special cased and returned -inf)
some threshold values in jnf and ynf were fixed that seems to be
incorrectly copy-pasted from the double version
a common code path in j1 and y1 was factored out so the resulting
object code is a bit smaller
unsigned int arithmetics is used for bit manipulation
j1(-inf) now returns 0 instead of -0
an incorrect threshold in the common code of j1f and y1f got fixed
(this caused spurious overflow and underflow exceptions)
the else branch in pone and pzero functions are fixed
(so code analyzers dont warn about uninitialized values)
a common code path in j0 and y0 was factored out so the resulting
object code is smaller
unsigned int arithmetics is used for bit manipulation
the logic of j0 got a bit simplified (x < 1 case was handled
separately with a bit higher precision than now, but there are large
errors in other domains anyway so that branch has been removed)
some threshold values were adjusted in j0f and y0f
the old definitions were wrong on some archs. actually, EPOLL_NONBLOCK
probably should not even be defined; it is not accepted by the kernel
and it's not clear to me whether it has any use at all, even if it did
work. this issue should be revisited at some point, but I'm leaving it
in place for now in case some applications reference it.
libc is the macro, __libc is the internal symbol, but under some
configurations on old/broken compilers, the symbol might not actually
exist and the libc macro might instead use __libc_loc() to obtain
access to the object.
the previous logic was assuming the kernel would give EINVAL when
passed an invalid address, but instead with MAP_FIXED it was giving
EPERM, as it considered this an attempt to map over kernel memory.
instead of trying to get the kernel to do the rigth thing, the new
code just handles the error in userspace.
I have also cleaned up the code to use a single mask to check for
invalid low bits and unsupported high bits, so it's simpler and more
clearly correct. the old code was actually wrong for sizeof(long)
smaller than sizeof(off_t) but not equal to 4; now it should be
correct for all possibilities.
for 64-bit systems, the low-bits test is new and extraneous (the
kernel should catch the error anyway when the mmap2 syscall is not
used), but it's cheap anyway. if this is an issue, the OFF_MASK
definition could be tweaked to omit the low bits when SYS_mmap2 is not
defined.
__IS_FP is a portable integer constant expression now
(uses that unsigned long long is larger than float)
the result casting logic should work now on all compilers
supporting typeof
* return type logic is simplified a bit and fixed (see below)
* return type of conj and cproj were wrong on int arguments
* added comments about the pending issues
(usually we don't have comments in public headers but this is
not the biggest issue with tgmath.h)
casting the result to the right type cannot be done in c99
(c11 _Generic can solve this but that is not widely supported),
so the typeof extension of gcc is used and that the ?: operator
has special semantics when one of the operands is a null
pointer constant
the standard is very strict about the definition of null
pointer constants so typeof with ?: is still not enough so
compiler specific workaround is used for now
on gcc '!1.0' is a null pointer constant so we can use the old
__IS_FP logic (eventhough it's non-standard)
on clang (and on gcc as well) 'sizeof(void)-1' is a null
pointer constant so we can use
!(sizeof(*(0?(int*)0:(void*)__IS_FP(x)))-1)
(this is non-standard as well), the old logic is used by
default and this new one on clang
previously 0x1p-1000 and 0x1p1000 was used for raising inexact
exception like x+tiny (when x is big) or x+huge (when x is small)
the rational is that these float consts are large enough
(0x1p-120 + 1 raises inexact even on ld128 which has 113 mant bits)
and float consts maybe smaller or easier to load on some platforms
(on i386 this reduced the object file size by 4bytes in some cases)
this is not a full rewrite just fixes to the special case logic:
+-0 and non-integer x<INT_MIN inputs incorrectly raised invalid
exception and for +-0 the return value was wrong
so integer test and odd/even test for negative inputs are changed
and a useless overflow test was removed
comments are kept in the double version of the function
compared to fdlibm/freebsd we partition the domain into one
more part and select different threshold points:
now the [log(5/3)/2,log(3)/2] and [log(3)/2,inf] domains
should have <1.5ulp error
(so only the last bit may be wrong, assuming good exp, expm1)
(note that log(3)/2 and log(5/3)/2 are the points where tanh
changes resolution: tanh(log(3)/2)=0.5, tanh(log(5/3)/2)=0.25)
for some x < log(5/3)/2 (~=0.2554) the error can be >1.5ulp
but it should be <2ulp
(the freebsd code had some >2ulp errors in [0.255,1])
even with the extra logic the new code produces smaller
object files
changed the algorithm: large input is not special cased
(when exp(-x) is small compared to exp(x))
and the threshold values are reevaluated
(fdlibm code had a log(2)/2 cutoff for which i could not find
justification, log(2) seems to be a better threshold and this
was verified empirically)
the new code is simpler, makes smaller binaries and should be
faster for common cases
the old comments were removed as they are no longer true for the
new algorithm and the fdlibm copyright was dropped as well
because there is no common code or idea with the original anymore
except for trivial ones.
with naive exp2l(x*log2e) the last 12bits of the result was incorrect
for x with large absolute value
with hi + lo = x*log2e is caluclated to 128 bits precision and then
expl(x) = exp2l(hi) + exp2l(hi) * f2xm1(lo)
this gives <1.5ulp measured error everywhere in nearest rounding mode
in tgmath.h the return values are casted to the appropriate
floating-point type (if the compiler supports gcc __typeof__),
this is wrong in case of ilogb, lrint, llrint, lround, llround
which do not need such cast
uses the lanczos approximation method with the usual tweaks.
same parameters were selected as in boost and python.
(avoides some extra work and special casing found in boost
so the precision is not that good: measured error is <5ulp for
positive x and <10ulp for negative)
an alternative lgamma_r implementation is also given in the same
file which is simpler and smaller than the current one, but less
precise so it's ifdefed out for now.
modifications:
* avoid unsigned->signed conversions
* removed various volatile hacks
* use FORCE_EVAL when evaluating only for side-effects
* factor out R() rational approximation instead of manual inline
* __invtrigl.h now only provides __invtrigl_R, __pio2_hi and __pio2_lo
* use 2*pio2_hi, 2*pio2_lo instead of pi_hi, pi_lo
otherwise the logic is not changed, long double versions will
need a revisit when a genaral long double cleanup happens
modifications:
* avoid unsigned->signed integer conversion
* do not handle special cases when they work correctly anyway
* more strict threshold values (0x1p26 instead of 0x1p28 etc)
* smaller code, cleaner branching logic
* same precision as the old code:
acosh(x) has up to 2ulp error in [1,1.125]
asinh(x) has up to 1.6ulp error in [0.125,0.5], [-0.5,-0.125]
atanh(x) has up to 1.7ulp error in [0.125,0.5], [-0.5,-0.125]
j0l,j1l,jnl,y0l,j1l,jnl are gnu extensions, bsd and posix do not
have them.
noone seems to use them and there is no plan to implement them any
time soon so we shouldn't declare them in math.h.
despite glibc using __key and __seq rather than key and seq, some
applications, notably busybox, assume the names are key and seq unless
glibc is being used. and the names key and seq are really the ones
that _should_ be exposed when not attempting to present a
standards-conforming namespace; apps should not be using names that
begin with double-underscore. thus, the optimal fix is to use key and
seq as the actual names of the members when in bsd/gnu source profile,
and define macros for __key and __seq that redirect to plain key and
seq.
traditionally, both BSD and GNU systems have it this way.
sys/syscall.h is purely syscall number macros. presently glibc exposes
the syscall declaration in unistd.h only with _GNU_SOURCE, but that
does not reflect historical practice.
a while back, gcc switched from using the old _init/_fini fragments
method for calling ctors and dtors on arm to the __init_array and
__fini_array method. unfortunately, on glibc this depends on ugly
hacks involving making libc.so a linker script and pulling parts of
libc into the main program binary. so I cheat a little bit, and just
write asm to iterate over the init/fini arrays from the _init/_fini
asm. the same approach could be used on any arch it's needed on, but
for now arm is the only one.
this change fixes an obscure issue with some nonstandard kernels,
where the initial brk syscall returns a pointer just past the end of
bss rather than the beginning of a new page. in that case, the dynamic
linker has already reclaimed the space between the end of bss and the
page end for use by malloc, and memory corruption (allocating the same
memory twice) will occur when malloc again claims it on the first call
to brk.
if there's evidence of any use for it, we can add it back later. as
far as I can tell, glibc has it only for internal use (and musl uses a
direct syscall in that case rather than a function call), not for
exposing it to applications.
in case of mmap-obtained chunks, end points past the end of the
mapping and reading it may fault. since the value is not needed until
after the conditional, move the access to prevent invalid reads.
they were accidentally exposed under just baseline POSIX, which is a
big namespace pollution issue. thankfully glibc only exposes them
under _GNU_SOURCE, not under any of its other options, so omitting
the pollution in the default _BSD_SOURCE profile does not hurt
application compatibility at all.
previously the names were exposed as key/seq with _GNU_SOURCE and
__ipc_perm_key/__ipc_perm/seq otherwise, whereas glibc always uses
__key and __seq for the names. thus, the old behavior never matched
glibc, and the new behavior always does, regardless of feature test
macros.
for now, i'm leaving the renaming here in sys/ipc.h where it's easy to
change globally for all archs, in case something turns out to be
wrong, but eventually the names could just be incorporated directly
into the bits headers for each arch and the renaming removed.
the issue is identical to the recent commit fixing the mips versions:
despite other implementations doing this, it conflicts with the
requirements of ISO C and it's a waste of time and code size.
previously, everything was going through an intermediate conversion to
long double, which caused the extern __fpclassifyl function to get
invoked, preventing virtually all optimizations of these operations.
with the new code, tests on constant float or double arguments compile
to a constant 0 or 1, and tests on non-constant expressions are
efficient. I may later add support for __builtin versions on compilers
that support them.
nothing in the standard requires or even allows the fenv state to be
restored by longjmp. restoring the exception flags is not such a big
deal since it's probably valid to clobber them completely, but
restoring the rounding mode yields an observable side effect not
sanctioned by ISO C. saving/restoring it also wastes a few cycles and
16 bytes of code.
as for historical behavior, reportedly SGI IRIX did save/restore fenv,
and this is where glibc and uClibc got the behavior from. a few other
systems save/restore it too (on archs other than mips), even though
this is apparently wrong. further details are documented here:
http://www-personal.umich.edu/~williams/archive/computation/setjmp-fpmode.html
as musl aims for standards conformance rather than coddling historical
programs expecting non-conforming behavior, and as it's unlikely that
any historical programs actually depend on the incorrect behavior
(such programs would break on other archs, anyway), I'm making the
change not to save/restore fenv on mips.
due to some historical oddity, these are considered libc headers
rather than kernel headers. the kernel used to provide them too, but
it seems modern kernels do not install them, so let's just do the
easiest thing and provide them. stripped-down versions provided by
John Spencer.
apparently recent gcc versions have intentionally broken the
traditional definition by treating it as a non-constant expression.
the traditional definition may also be problematic for c++ programs.
for some reason I have not been able to determine, gcc 3.2 rejects the
array notation. this seems to be a gcc bug, but since it's easy to
work around, let's do the workaround and avoid gratuitously requiring
newer compilers.
previously, a few BSD features were enabled only by _BSD_SOURCE, not
by _GNU_SOURCE. since _BSD_SOURCE is default in the absence of other
feature test macros, this made adding _GNU_SOURCE to a project not a
purely additive feature test macro; it actually caused some features
to be suppressed.
most of the changes made by this patch actually bring musl in closer
alignment with the glibc behavior for _GNU_SOURCE. the only exceptions
are the added visibility of functions like strlcpy which were BSD-only
due to being disliked/rejected by glibc maintainers. here, I feel the
consistency of having _GNU_SOURCE mean "everything", and especially
the property of it being purely additive, are more valuable than
hiding functions which glibc does not have.
most importantly, the format/scan macros for the [u]int_fast16_t and
[u]int_fast32_t types were defined incorrectly assuming these types
would match the native word/pointer size. this is incorrect on any
64-bit system; the "fast" types for 16- and 32-bit integers are simply
int.
another issue which was "only a warning" (despite being UB) is that
the choice of "l" versus "ll" was incorrect for 64-bit types on 64-bit
machines. while it would "work" to always use "long long" for 64-bit
types, we use "long" on 64-bit machines to match what glibc does and
what the ABI documents recommend. the macro definitions were probably
right in very old versions of musl, but became wrong when we aligned
most closely with the 'standard' ABI. checking UINTPTR_MAX is an easy
way to get the system wordsize without pulling in new headers.
finally, the useless __PRIPTR macro to allow the underlying type of
[u]intptr_t to vary has been removed. we are using "long" on all
targets, and thankfully this matches what glibc does, so I do not
envision ever needing to change it. thus, the "l" has just been
incorporated directly in the strings.
previously, shared library constructors were being called before
important internal things like the environment (extern char **environ)
and hwcap flags (needed for sjlj to work right with float on arm) were
initialized in __libc_start_main. rather than trying to have to
dynamic linker make sure this stuff all gets initialized right, I've
opted to just defer calling shared library constructors until after
the main program's entry point is reached. this also fixes the order
of ctors to be the exact reverse of dtors, which is a desirable
property and possibly even mandated by some languages.
the main practical effect of this change is that shared libraries
calling getenv from ctors will no longer fail.
the missing check did not affect the default profile, since it has
both _XOPEN_SOURCE and _BSD_SOURCE defined, but it did break programs
which explicitly define _BSD_SOURCE, causing it to be the only feature
test macro present.
these structures are purely for use by trace/debug tools and tools
working with core files. the definition of fpregset_t, which was
previously here, has been removed because it was wrong; fpregset_t
should be the type used in mcontext_t, not the type used in
ptrace/core stuff.
aside from microblaze, these should be roughly correct for all archs
now. some misc junk macros and typedefs are missing, which should
probably be added for max compatibility with trace/debug tools.
it should now really match the kernel. some of the removed padding
corresponded to the difference between user and kernel sigset_t. the
space at the end was redundant with the uc_mcontext member and seems
to have been added as a result of misunderstanding glibc's definition
versus the kernel's.
with these changes, the members/types of mcontext_t and related stuff
should closely match the glibc definitions. unlike glibc, however, the
definitions here avoid using typedefs as much as possible and work
directly with the underlying types, to minimize namespace pollution
from signal.h in the default (_BSD_SOURCE) profile.
this is a first step in improving compatibility with applications
which poke at context/register information -- mainly debuggers, trace
utilities, etc. additional definitions in ucontext.h and other headers
may be needed later.
if feature test macros are used to request a conforming namespace,
mcontext_t is replaced with an opaque structure of the equivalent size
and alignment; conforming programs cannot examine its contents anyway.
unlike the previous definition, NSIG/_NSIG is supposed to be one more
than the highest signal number. adding this will allow simplifying
libc-internal code that makes signal-related syscalls, which can be
done as a later step. some apps might use it too; while this usage is
questionable, it's at least not insane.
also handle the non-GNUC case where alignment attribute is not available
by simply omitting it. this will not cause problems except for
inclusion of mcontex_t/ucontext_t in application-defined structures,
since the natural alignment of the uc_mcontext member relative to the
start of ucontext_t is already correct. and shame on whoever designed
this for making it impossible to satisfy the ABI requirements without
GNUC extensions.
it's essential to decrement the stack pointer before writing to new
stack space, rather than afterwards. otherwise there is a race
condition during which asynchronous code (signals) could clobber the
data being stored.
it may be possible to optimize the code further using stwu, but I
wanted to avoid making any changes to the actual stack layout in this
commit. further improvements can be made separately if desired.
apparently some other archs have sys/io.h and should not break just
because they don't have the x86 port io functions. provide a blank
bits/io.h everywhere for now.
based on proposal by Isaac Dunham. nonexistance of bits/io.h will
cause inclusion of sys/io.h to produce an error on archs that are not
supposed to have it. this is probably the desired behavior, but the
error message may be a bit unusual.
put some macros that do not differ between architectures in the
main header and remove from bits.
restructure mips header so it has the same structure as the others.
similar to exp.c cleanup: use scalbnf, don't return excess precision,
drop some optimizatoins.
exp.c was changed to be more consistent with expf.c code.
fortunately the memory corruption could not hurt anything, but it
prevented clearing the final newline and thus prevented the last path
element from working.
priority inheritance is not yet supported, and priority protection
probably will not be supported ever unless there's serious demand for
it (it's a fairly heavy-weight feature).
per-thread cpu clocks would be nice to have, but to my knowledge linux
is still not capable of supporting them. glibc fakes them by using the
_process_ cpu-time clock and subtracting the thread creation time,
which gives seriously incorrect semantics (worse than not supporting
the feature at all), so until there's a way to do it right, it will
remain as a stub that always fails.
overflow and underflow was incorrect when the result was not stored.
an optimization for the 0.5*ln2 < |x| < 1.5*ln2 domain was removed.
did various cleanups around static constants and made the comments
consistent with the code.
incomplete but at least partly working. requires all files to be
compiled in the new "secure" plt model, not the old one that put plt
code in the data segment. TLS is untested but may work. invoking the
dynamic linker explicitly to load a program does not yet handle argv
correctly.
although a number is reserved for it, this option is not implemented
on Linux and does not work. defining it causes some applications to
use it, and subsequently break due to its failure.
the volatile hack in STRICT_ASSIGN is only needed if
assignment is not respected and excess precision is kept.
gcc -fexcess-precision=standard and -ffloat-store both
respect assignment and musl use these flags by default.
i kept the macro for now so the workaround may be used
for bad compilers in the future.
old code was correct only if the result was stored (without the
excess precision) or musl was compiled with -ffloat-store.
now we use STRICT_ASSIGN to work around the issue.
(see note 160 in c11 section 6.8.6.4)
old code was correct only if the result was stored (without the
excess precision) or musl was compiled with -ffloat-store.
(see note 160 in n1570.pdf section 6.8.6.4)
POSIX includes mostly-useless attribute-get functions for each
attribute-set function, presumably out of some object-oriented
dogmatism. the get functions are not useful with the simple idiomatic
usage of attributes. there are of course possible valid uses of them
(like writing wrappers for pthread init functions that perform special
actions on the presence of certain attributes), but considering how
tiny these functions are anyway, little is lost by putting them all in
one file, and some build-time cost and archive-file-size benefits are
achieved.
linux's sched_* syscalls actually implement the TPS (thread
scheduling) functionality, not the PS (process scheduling)
functionality which the sched_* functions are supposed to have.
omitting support for the PS option (and having the sched_* interfaces
fail with ENOSYS rather than omitting them, since some broken software
assumes they exist) seems to be the only conforming way to do this on
linux.
this function does not obey the normal calling convention; like a
syscall instruction, it's expected not to clobber any registers except
the return value. clobbering edx could break callers that were reusing
the value cached in edx after the syscall returns.
per interpretation for austin group issue #626, fflush(0) and exit()
must block waiting for a lock if another thread has locked a memory
stream with flockfile. this adds some otherwise-unnecessary
synchronization cost to use of memory streams, but there was already a
synchronization cost calling malloc anyway.
previously the stream was only added to the open file list in
single-threaded programs, so that upon subsequent call to
pthread_create, locking could be turned on for the stream.
this change was originally intended just to avoid repeated attempts to
open a nonexistant /etc/ls-musl-$(ARCH).path file, but I realized it
also prevents the default paths from being searched when such a path
file exists. despite the potential to break existing usage, I believe
the new behavior is the right behavior, and it's better to fix it
sooner rather than later. with the old behavior, it was impossible to
inhibit search of default paths which might contain musl-incompatible
libs (or even libs from a different cpu arch, on multi-arch machines).
previously, empty string was treated as "use default". this is
apparently not compatible with standard configure semantics where an
empty prefix puts everything under /. the new logic should be a lot
cleaner and not suffer from such issues.
this mirrors the stdio_impl.h cleanup. one header which is not
strictly needed, errno.h, is left in pthread_impl.h, because since
pthread functions return their error codes rather than using errno,
nearly every single pthread function needs the errno constants.
in a few places, rather than bringing in string.h to use memset, the
memset was replaced by direct assignment. this seems to generate much
better code anyway, and makes many functions which were previously
non-leaf functions into leaf functions (possibly eliminating a great
deal of bloat on some platforms where non-leaf functions require ugly
prologue and/or epilogue).
this header evolved to facilitate the extremely lazy practice of
omitting explicit includes of the necessary headers in individual
stdio source files; not only was this sloppy, but it also increased
build time.
now, stdio_impl.h is only including the headers it needs for its own
use; any further headers needed by source files are included directly
where needed.
checking for EINVAL should be sufficient, but qemu user emulation
returns EPROTONOSUPPORT in some of the failure cases, and it seems
conceivable that other kernels doing linux-emulation could make the
same mistake. since DNS lookups and other important code might break
if the fallback does not get invoked, be extra careful and check for
either error.
note that it's important NOT to perform the fallback code on other
errors such as resource-exhaustion cases, since the fallback is not
atomic and will lead to file-descriptor leaks in multi-threaded
programs that use exec. the fallback code is only "safe" to run when
the initial failure is caused by the application's choice of
arguments, not the system state.
some of these were coming from stdio functions locking files without
unlocking them. I believe it's useful for this to throw a warning, so
I added a new macro that's self-documenting that the file will never
be unlocked to avoid the warning in the few places where it's wrong.
patches by Alex Caudill (npx). the dynamic-linked version is almost
identical to the final submitted patch; I just added a couple missing
lines for saving the phdr address when the dynamic linker is invoked
directly to run a program, and removed a couple to avoid introducing
another unnecessary type. the static-linked version is based on npx's
draft. it could use some improvements which are contingent on the
startup code saving some additional information for later use.
ideally, system would also be cancellable while running the external
command, but I cannot find any way to make that work without either
leaking zombie processes or introducing behavior that is far outside
what the standard specifies. glibc handles cancellation by killing the
child process with SIGKILL, but this could be unsafe in that it could
leave the data being manipulated by the command in an inconsistent
state.
for conformance, two functions should not have the same address. a
conforming program could use the addresses of getc and fgetc in ways
that assume they are distinct. normally i would just use a wrapper,
but these functions are so small and performance-critical that an
extra layer of function call could make the one that's a wrapper
nearly twice as slow, so I'm just duplicating the code instead.
-lpcc only works if -nostdlib is not passed, so it's useless. instead,
use -print-file-name to look up the full pathname for libpcc.a, and
check whether that succeeds before trying to link with the result.
also, silence pcc's junk printed on stdout during tests.
in old versions of pcc, the directory containing libpcc.a was not in
the library path, and other options like -print-file-name may have
been needed to locate it. however, -print-file-name itself seems to
have been added around the same time that the directory was added to
the search path, and moreover, I see no evidence that older versions
of pcc are capable of building a working musl shared library. thus, it
seems reasonable to just test whether -lpcc is accepted.
on x86 and some other archs, functions which make function calls which
might go through a PLT incur a significant overhead cost loading the
GOT register prior to making the call. this load is utterly useless in
musl, since all calls are bound at library-creation time using
-Bsymbolic-functions, but the compiler has no way of knowing this, and
attempts to set the default visibility to protected have failed due to
bugs in GCC and binutils.
this commit simply manually assigns hidden/protected visibility, as
appropriate, to a few internal-use-only functions which have many
callers, or which have callers that are hot paths like getc/putc. it
shaves about 5k off the i386 libc.so with -Os. many of the
improvements are in syscall wrappers, where the benefit is just size
and performance improvement is unmeasurable noise amid the syscall
overhead. however, stdio may be measurably faster.
if in the future there are toolchains that can do the same thing
globally without introducing linking bugs, it might be worth
considering removing these workarounds.
pcc wrongly passes any option beginning with -m to the linker, and
will break at link time if these options were added to CFLAGS. testing
linking lets us catch this at configure time and skip them.
these functions must behave as if they obtain the lock via flockfile
to satisfy POSIX requirements. since another thread can provably hold
the lock when they are called, they must wait to obtain the lock
before they can return, even if the correct return value could be
obtained without locking. in the case of fclose and freopen, failure
to do so could cause correct (albeit obscure) programs to crash or
otherwise misbehave; in the case of feof, ferror, and fwide, failure
to obtain the lock could sometimes return incorrect results. in any
case, having these functions proceed and return while another thread
held the lock was wrong.
1. don't open /dev/null just as a basis to copy flags; use shared
__fmodeflags function to get the right file flags for the mode.
2. handle the case (probably invalid, but whatever) case where the
original stream's file descriptor was closed; previously, the logic
re-closed it.
3. accept the "e" mode flag for close-on-exec; update dup3 to fallback
to using dup2 so we can simply call __dup3 instead of putting fallback
logic in freopen itself.
the behavior of putenv is left undefined if the argument does not
contain an equal sign, but traditional implementations behave this way
and gnulib replaces putenv if it doesn't do this.
__release_ptc() is only valid in the parent; if it's performed in the
child, the lock will be unlocked early then double-unlocked later,
corrupting the lock state.
since we target systems without overcommit, special care should be
taken that system() and popen(), like posix_spawn(), do not fail in
processes whose commit charges are too high to allow ordinary forking.
this in turn requires special precautions to ensure that the parent
process's signal handlers do not end up running in the shared-memory
child, where they could corrupt the state of the parent process.
popen has also been updated to use pipe2, so it does not have a
fd-leak race in multi-threaded programs. since pipe2 is missing on
older kernels, (non-atomic) emulation has been added.
some silly bugs in the old code should be gone too.
despite documentation that makes it sound a lot different, the only
ABI-constraint difference between TLS variants II and I seems to be
that variant II stores the initial TLS segment immediately below the
thread pointer (i.e. the thread pointer points to the end of it) and
variant I stores the initial TLS segment above the thread pointer,
requiring the thread descriptor to be stored below. the actual value
stored in the thread pointer register also tends to have per-arch
random offsets applied to it for silly micro-optimization purposes.
with these changes applied, TLS should be basically working on all
supported archs except microblaze. I'm still working on getting the
necessary information and a working toolchain that can build TLS
binaries for microblaze, but in theory, static-linked programs with
TLS and dynamic-linked programs where only the main executable uses
TLS should already work on microblaze.
alignment constraints have not yet been heavily tested, so it's
possible that this code does not always align TLS segments correctly
on archs that need TLS variant I.
usage of vfork creates a situation where a process of lower privilege
may momentarily have write access to the memory of a process of higher
privilege.
consider the case of a multi-threaded suid program which is calling
posix_spawn in one thread while another thread drops the elevated
privileges then runs untrusted (relative to the elevated privilege)
code as the original invoking user. this untrusted code can then
potentially modify the data the child process will use before calling
exec, for example changing the pathname or arguments that will be
passed to exec.
note that if vfork is implemented as fork, the lock will not be held
until the child execs, but since memory is not shared it does not
matter.
with this change, pcc-built musl libc.so seems to work correctly. the
problem is that pcc generates GOT lookups for external-linkage symbols
even if they are hidden, rather than using GOT-relative addressing.
the entire reason we're using hidden visibility on the __libc object
is to make it accessible prior to relocations -- not to mention
inexpensive to access. unfortunately, the workaround makes it even
more expensive on pcc.
when the pcc issue is fixed, an appropriate version test should be
added so new pcc can use the much more efficient variant.
this is actually a rather subtle issue: do arrays decay to pointers
when used as inline asm args? gcc says yes, but currently pcc says no.
hopefully this discrepency in pcc will be fixed, but since the
behavior is not clearly defined anywhere I can find, I'm using an
explicit operation to cause the decay to occur.
this makes it so the #undef libc and __libc name are no longer needed,
which were problematic because the "accessor function" mode for
accessing the libc struct could not be used, breaking build on any
compiler without (working) visibility.
this is necessary because posix_spawn calls sigaction after vfork, and
if the thread pointer is not already initialized, initializing it in
the child corrupts the parent process's state.
this doubles the performance of the fastest syscalls on the atom I
tested it on; improvement is reportedly much more dramatic on
worst-case cpus. cannot be used for cancellable syscalls.
the code in __libc_start_main is now responsible for parsing auxv,
rather than duplicating the parsing all over the place. this should
shave off a few cycles and some code size. __init_libc is left as an
external-linkage function despite the fact that it could be static, to
prevent it from being inlined and permanently wasting stack space when
main is called.
a few other minor changes are included, like eliminating per-thread
ssp canaries (they were likely broken when combined with certain
dlopen usages, and completely unnecessary) and some other unnecessary
checks. since this code gets linked into every program, it should be
as small and simple as possible.
at initial program load, all libraries must be loaded before the
thread pointer can be setup, since the TP-relative addresses of all
initial TLS objects must be constant.
this is needed to ensure async-cancel-safety, i.e. to make it safe to
access TLS objects when async cancellation is enabled. otherwise, if
cancellation were acter upon after the atomic fetch/add but before the
thread saved the obtained memory, another access to the same TLS in
the cancellation handler could end up performing the atomic fetch/add
again, consuming more memory than is actually available and
overflowing into other objects on the heap.
symbol value of 0 is not "undefined" for TLS; it's the address of the
first symbol in the TLS segment. however, non-definition TLS
references also have values of 0, so check the section.
hopefully the new logic is more clear, too.
compute offsets from the thread pointer statically when loading the
library, rather than repeating the logic on each thread creation. not
only is the latter less efficient at runtime; it also fails to provide
solid guarantees that the offsets will remain the same when the
initial alignment of memory is different. the new alignment handling
is both more rigorous and simpler.
the old code was also clobbering TLS bss with random image data in
some cases due to using tls_size (size of TLS segment) instead of
tls_len (length of the TLS data image).
some libraries call dlopen from their constructors, resulting in
recursive calls to dlopen. previously, this resulted in deadlock. I'm
now unlocking the dlopen lock before running constructors (this is
especially important since the lock also blocked pthread_create and
was being held while application code runs!) and using a separate
recursive mutex protecting the ctor/dtor state instead.
in order to prevent the same ctor from being called more than once, a
module is considered "constructed" just before the ctor runs.
also, switch from using atexit to register each dtor to using a single
atexit call to register the dynamic linker's dtor processing as just
one handler. this is necessary because atexit performs allocation and
may fail, but the library has already been loaded and cannot be
backed-out at the time dtor registration is performed. this change
also ensures that all dtors run after all atexit functions, rather
than in mixed order.
unlike other implementations, this one reserves memory for new TLS in
all pre-existing threads at dlopen-time, and dlopen will fail with no
resources consumed and no new libraries loaded if memory is not
available. memory is not immediately distributed to running threads;
that would be too complex and too costly. instead, assurances are made
that threads needing the new TLS can obtain it in an async-signal-safe
way from a buffer belonging to the dynamic linker/new module (via
atomic fetch-and-add based allocator).
I've re-appropriated the lock that was previously used for __synccall
(synchronizing set*id() syscalls between threads) as a general
pthread_create lock. it's a "backwards" rwlock where the "read"
operation is safe atomic modification of the live thread count, which
multiple threads can perform at the same time, and the "write"
operation is making sure the count does not increase during an
operation that depends on it remaining bounded (__synccall or dlopen).
in static-linked programs that don't use __synccall, this lock is a
no-op and has no cost.
currently, only i386 is tested. x86_64 and arm should probably work.
the necessary relocation types for mips and microblaze have not been
added because I don't understand how they're supposed to work, and I'm
not even sure if it's defined yet on microblaze. I may be able to
reverse engineer the requirements out of gcc/binutils output.
this was an optimization to save/recover a minimal amount of extra
memory for use by malloc, that's becoming increasingly costly to keep
around. freeing this data:
1. breaks debugging with gdb (it can't find library symbols)
2. breaks thread-local storage in shared libraries
it would be possible to disable freeing when TLS is used, but in
addition to the above breakages, tracking whether dlopen/dlsym is used
adds a cost to every symbol lookup, possibly making program startup
slower for large programs. combined with the complexity, it's not
worth it. we already save/recover plenty of memory in the dynamic
linker with reclaim_gaps.
this code will not work yet because the necessary relocations are not
supported, and cannot be supported without some internal changes to
how relocation processing works (coming soon).
the design for TLS in dynamic-linked programs is mostly complete too,
but I have not yet implemented it. cost is nonzero but still low for
programs which do not use TLS and/or do not use threads (a few hundred
bytes of new code, plus dependency on memcpy). i believe it can be
made smaller at some point by merging __init_tls and __init_security
into __libc_start_main and avoiding duplicate auxv-parsing code.
at the same time, I've also slightly changed the logic pthread_create
uses to allocate guard pages to ensure that guard pages are not
counted towards commit charge.
for some reason this option is undocumented. not sure when it was
added, so I'm using a configure test. gcc was already setting the mark
correctly for C files, but assembler source files would need ugly
.note boilerplate in every single file to achieve this without the
option to the assembler.
blame whoever thought it would be a good idea to make the stack
executable by default rather than doing it the other way around...
based on proposed patches by Daniel Cegiełka, with minor changes:
- use a weak symbol for optreset so it doesn't clash with namespace
- also reset optpos (position in multi-option arg like -lR)
- also make getopt_long support reset
this function was overly complicated and not even obviously correct.
avoid using openat/linkat just like in shm_open, and instead expand
pathname using code shared with shm_open. remove bogus (and dangerous,
with priorities) use of spinlocks.
this commit also heavily streamlines the code and ensures there are no
failure cases that can happen after a new semaphore has been created
in the filesystem, since that case is unreportable.
this feature will be in the next version of POSIX, and can be used
internally immediately. there are many internal uses of fopen where
close-on-exec is needed to fix bugs.
also update syslog to use SOCK_CLOEXEC rather than separate fcntl
step, to make it safe in multithreaded programs that run external
programs.
emulation is not atomic; it could be made atomic by holding a lock on
forking during the operation, but this seems like overkill. my goal is
not to achieve perfect behavior on old kernels (which have plenty of
other imperfect behavior already) but to avoid catastrophic breakage
in (1) syslog, which would give no output on old kernels with the
change to use SOCK_CLOEXEC, and (2) programs built on a new kernel
where configure scripts detected a working SOCK_CLOEXEC, which later
get run on older kernels (they may otherwise fail to work completely).
based on initial work by rdp, with heavy modifications. some features
including threads are untested because qemu app-level emulation seems
to be broken and I do not have a proper system image for testing.
when strchr fails, and important piece of information already
computed, the string length, is thrown away. have strchrnul (with
namespace protection) be the underlying function so this information
can be kept, and let strchr be a wrapper for it. this also allows
strcspn to be considerably faster in the case where the match set has
a single element that's not matched.
testing with gcc 4.6.3 on x86, -Os, the old version does a duplicate
null byte check after the first loop. this is purely the compiler
being stupid, but the old code was also stupid and unintuitive in how
it expressed the check.
austin group interpretation for defect #529
(http://austingroupbugs.net/view.php?id=529) tightens the
requirements on close such that, if it returns with EINTR, the file
descriptor must not be closed. the linux kernel developers vehemently
disagree with this, and will not change it. we catch and remap EINTR
to EINPROGRESS, which the standard allows close() to return when the
operation was not finished but the file descriptor has been closed.
new behavior can be summarized as:
inputs that parse completely as a decimal number are treated as one,
and rejected only if the result is out of 16-bit range.
inputs that do not parse as a decimal number (where strtoul leaves
anything left over in the input) are searched in /etc/services.
this is useful when the underlying gcc is already a wrapper, which is
the case at least on some uclibc-based system images. it's also useful
for running an older/newer/nondefault version of gcc.
it was determined in discussion that these kind of limits are not
sufficient to protect single-threaded servers against denial of
service attacks from maliciously large round counts. the time scales
simply vary too much; many users will want login passwords with rounds
counts on a scale that gives decisecond latency, while highly loaded
webservers will need millisecond latency or shorter.
still some limit is left in place; the idea is not to protect against
attacks, but to avoid the runtime of a single call to crypt being, for
all practical purposes, infinite, so that configuration errors can be
caught and fixed without bringing down whole systems. these limits are
very high, on the order of minute-long runtimes for modest systems.
if same register is used for input/output, the compiler must be told.
otherwise is generates random junk code that clobbers the result. in
pure syscall-wrapper functions, nothing went wrong, but in more
complex functions where register allocation is non-trivial, things
broke badly.
with this patch, the malloc in libc.so built with -Os is nearly the
same speed as the one built with -O3. thus it solves the performance
regression that resulted from removing the forced -O3 when building
libc.so; now libc.so can be both small and fast.
I originally added -O3 for shared libraries to counteract very bad
behavior by GCC when building PIC code: it insists on reloading the
GOT register in static functions that need it, even if the address of
the function is never leaked from the translation unit and all local
callers of the function have already loaded the GOT register. this
measurably degrades performance in a few key areas like malloc. the
inlining done at -O3 avoids the issue, but that's really not a good
reason for overriding the user's choice of optimization level.
vfork is implemented as the fork syscall (with no atfork handlers run)
on archs where it is not available, so this change does not introduce
any change in behavior or regression for such archs.
I'm not 100% sure that Linux's O_PATH meets the POSIX requirements for
O_SEARCH, but it seems very close if not perfect. and old kernels
ignore it, so O_SEARCH will still work as desired as long as the
caller has read permissions to the directory.
by using the "ir" constraint (immediate or register) and the carefully
constructed instruction addu $2,$0,%2 which can take either an
immediate or a register for %2, the new inline asm admits maximal
optimization with no register spillage to the stack when the compiler
successfully performs constant propagration, but still works by
allocating a register when the syscall number cannot be recognized as
a constant. in the case of syscalls with 0-3 arguments it barely
matters, but for 4-argument syscalls, using an immediate for the
syscall number avoids creating a stack frame for the syscall wrapper
function.
all past and current kernel versions have done so, but there seems to
be no reason it's necessary and the sentiment from everyone I've asked
has been that we should not rely on it. instead, use r7 (an argument
register) which will necessarily be preserved upon syscall restart.
however this only works for 0-3 argument syscalls, and we have to
resort to the function call for 4-argument syscalls.
for the sake of simplicity, I've only used rep movsb rather than
breaking up the copy for using rep movsd/q. on all modern cpus, this
seems to be fine, but if there are performance problems, there might
be a need to go back and add support for rep movsd/q.
before restrict was added, memove called memcpy for forward copies and
used a byte-at-a-time loop for reverse copies. this was changed to
avoid invoking UB now that memcpy has an undefined copying order,
making memmove considerably slower.
performance is still rather bad, so I'll be adding asm soon.
this should both fix the issue with ARM needing -lgcc_eh (although
that's really a bug in the libgcc build process that's causing
considerable bloat, which should be fixed) and make it easier to build
musl using clang/llvm in place of gcc. unfortunately I don't know a
good way to detect and support pcc's -lpcc since it's not in pcc's
default library search path...
no syscalls actually use that many arguments; the issue is that some
syscalls with 64-bit arguments have them ordered badly so that
breaking them into aligned 32-bit half-arguments wastes slots with
padding, and a 7th slot is needed for the last argument.
this drastically reduces the size of some functions which are purely
syscall wrappers.
disabled for clang due to known bugs satisfying register constraints.
this code was using $10 to save the syscall number, but $10 is not
necessarily preserved by the kernel across syscalls. only mattered for
syscalls that got interrupted by a signal and restarted. as far as i
can tell, $25 is preserved by the kernel across syscalls.
something is wrong with the logic for the argument layout, resulting
in compile errors on mips due to too many args to syscall... further
information on how it's supposed to work will be needed before it can
be reactivated.
now public syscall.h only exposes __NR_* and SYS_* constants and the
variadic syscall function. no macros or inline functions, no
__syscall_ret or other internal details, no 16-/32-bit legacy syscall
renaming, etc. this logic has all been moved to src/internal/syscall.h
with the arch-specific parts in arch/$(ARCH)/syscall_arch.h, and the
amount of arch-specific stuff has been reduced to a minimum.
changes still need to be reviewed/double-checked. minimal testing on
i386 and mips has already been performed.
this is equivalent to posix_fallocate except that it has an extra
mode/flags argument to control its behavior, and stores the error in
errno rather than returning an error code.
the old behavior of exposing nothing except plain ISO C can be
obtained by defining __STRICT_ANSI__ or using a compiler option (such
as -std=c99) that predefines it. the new default featureset is POSIX
with XSI plus _BSD_SOURCE. any explicit feature test macros will
inhibit the default.
installation docs have also been updated to reflect this change.
clang does not presently support the "v" constraint we want to use to
get the result from $3, and trying to use register...__asm__("$3") to
do the same invokes serious compiler bugs. so for now, i'm working
around the issue with an extra temp register and putting $3 in the
clobber list instead of using it as output. when the bugs in clang are
fixed, this issue should be revisited to generate smaller/faster code
like what gcc gets.
previously, it was pretty much random which one of these trees a given
function appeared in. they have now been organized into:
src/linux: non-POSIX linux syscalls (possibly shard with other nixen)
src/legacy: various obsolete/legacy functions, mostly wrappers
src/misc: still mostly uncategorized; some misc POSIX, some nonstd
src/crypt: crypt hash functions
further cleanup will be done later.
so far, this is the only actual use of loff_t i've found. some
software, including glib, assumes loff_t must exist if splice exists;
this is a reasonable assumption since the official prototype for
splice uses loff_t, as it always works with 64-bit offsets regardless
of the selected libc off_t size. i'm using #define for now rather than
a typedef to make it easy to define in other headers if necessary
(like the LFS64 ugliness), but it may be necessary to add it to
alltypes.h eventually if other functions end up needing it.
note that POSIX does not specify these functions as _Noreturn, because
POSIX is aligned with C99, not the new C11 standard. when POSIX is
eventually updated to C11, it will almost surely give these functions
the _Noreturn attribute. for now, the actual _Noreturn keyword is not
used anyway when compiling with a c99 compiler, which is what POSIX
requires; the GCC __attribute__ is used instead if it's available,
however.
in a few places, I've added infinite for loops at the end of _Noreturn
functions to silence compiler warnings. presumably
__buildin_unreachable could achieve the same thing, but it would only
work on newer GCCs and would not be portable. the loops should have
near-zero code size cost anyway.
like the previous _Noreturn commit, this one is based on patches
contributed by philomath.
to deal with the fact that the public headers may be used with pre-c99
compilers, __restrict is used in place of restrict, and defined
appropriately for any supported compiler. we also avoid the form
[restrict] since older versions of gcc rejected it due to a bug in the
original c99 standard, and instead use the form *restrict.
unlike the memmove commit, this one should be fine to leave in place.
wmemmove is not performance-critical, and even if it were, it's
already copying whole 32-bit words at a time instead of bytes.
this commit introduces a performance regression in many uses of
memmove, which will need to be addressed before the next release. i'm
making it as a temporary measure so that the restrict patch can be
committed without invoking undefined behavior when memmove calls
memcpy with overlapping regions.
while musl itself requires a c99 compiler, some applications insist on
being compiled with c89 compilers, and use of "inline" in the headers
was breaking them. much of this had been avoided already by just
skipping the inline keyword in pre-c99 compilers or modes, but this
new unified solution is cleaner and may/should result in better code
generation in the default gcc configuration.
these limits could definitely use review, but for now, i feel
consistency and erring on the side of preventing servers from getting
bogged down by excessively-slow user-provided settings (think
.htpasswd) are the best policy. blowfish should be updated to match.
based on versions sent to the list by nsz, with some simplification
and debloating. i'd still like to get them a bit smaller, or ideally
merge them into a single file with most of the code being shared, but
that can be done later.
if needed for debugging, it will be output in the .debug_frame section
instead, where it is not part of the loaded program and where the
strip command is free to strip it.
based on patches submitted by boris brezillon. this commit also fixes
the issue whereby the main application and libc don't have the address
ranges of their mappings stored, which was theoretically a problem for
RTLD_NEXT support in dlsym; it didn't actually matter because libc
never calls dlsym, and it seemed to be doing the right thing (by
chance) for symbols in the main program as well.
based on Gregor's patch sent to the list. includes:
- stdalign.h
- removing gets in C11 mode
- adding aligned_alloc and adjusting other functions to use it
- adding 'x' flag to fopen for exclusive mode
wrong hash was being passed; just a copy/paste error. did not affect
lookups in the global namespace; this is probably why it was not
caught in testing.
previously, this usage could lead to a crash if the thread pointer was
still uninitialized, and otherwise would just cause the canary to be
zero (less secure).
before, only the first library that failed to load or symbol that
failed to resolve was reported, and then the dynamic linker
immediately exited. when attempting to fix a library compatibility
issue, this is about the worst possible behavior. now we print all
errors as they occur and exit at the very end if errors were
encountered.
it's naturally aligned when entered with the kernel argv array, but if
ld.so has been invoked explicitly to run a program, the stack will not
be aligned due to having thrown away argv[0].
if new shared mappings of files/devices/shared memory can be made
between the time a robust mutex is unlocked and its subsequent removal
from the pending slot in the robustlist header, the kernel can
inadvertently corrupt data in the newly-mapped pages when the process
terminates. i am fixing the bug by using the same global vm lock
mechanism that was used to fix the race condition with unmapping
barriers after pthread_barrier_wait returns.
this affects at least the case of very long inputs, but may also
affect shorter inputs that become long due to growth while upscaling.
basically, the logic for the circular buffer indices of the initial
base-10^9 digit and the slot one past the final digit, and for
simplicity of the loop logic, assumes an invariant that they're not
equal. the upscale loop, which can increase the length of the
base-10^9 representation, attempted to preserve this invariant, but
was actually only ensuring that the end index did not loop around past
the start index, not that the two never become equal.
the main (only?) effect of this bug was that subsequent logic treats
the excessively long number as having no digits, leading to junk
results.
with this patch, setting _POSIX_SOURCE, or setting _POSIX_C_SOURCE or
_XOPEN_SOURCE to an old version, will bring back the interfaces that
were removed in POSIX 2008 - at least the ones i've covered so far,
which are gethostby*, usleep, and ualarm. if there are other functions
still in widespread use that were removed for which similar changes
would be beneficial, they can be added just like this.
this function never existed historically; since the float/double
functions it's based on are nonstandard and deprecated, there's really
no justification for its existence except that glibc has it. it can be
added back if there's ever really a need...
since this interface is rarely used, it's probably best to lean
towards keeping code size down anyway. one-character needles will
still be found immediately by the initial wcschr call anyway.
the strspn call was made for every format specifier and end-of-string,
even though the expected return value was 1-2 for normal usage.
replace with simple loop.
amusingly, this cuts more than 10% off the run time of printf("a"); on
the machine i tested it on.
sadly the same optimization is not possible for snprintf without
duplicating all the pseudo-FILE setup code, which is not worth it.
this is needed to match the underlying "ABI" standards. it's not
really an ABI issue since the binary representations are the same, but
having the wrong type can lead to errors when the type arising from a
difference-of-pointers expression does not match the defined type of
ptrdiff_t. most of the problems affect C++, not C.
there are still some discussions going on about tweaking the code, but
at least thing brings us to the point of having something working in
the repository. hopefully the remaining major hashes (md5,sha) will
follow soon.
some minor changes to how hard-coded sets for thread-related purposes
are handled were also needed, since the old object sizes were not
necessarily sufficient. things have gotten a bit ugly in this area,
and i think a cleanup is in order at some point, but for now the goal
is just to get the code working on all supported archs including mips,
which was badly broken by linux rejecting syscalls with the wrong
sigset_t size.
unfortunately, a large portion of programs which call crypt are not
prepared for its failure and do not check that the return value is
non-null before using it. thus, always "succeeding" but giving an
unmatchable hash is reportedly a better behavior than failing on
error.
it was suggested that we could do this the same way as other
implementations and put the null-to-unmatchable translation in the
wrapper rather than the individual crypt modules like crypt_des, but
when i tried to do it, i found it was making the logic in __crypt_r
for keeping track of which hash type we're working with and whether it
succeeded or failed much more complex, and potentially error-prone.
the way i'm doing it now seems to have essentially zero cost, anyway.
since .init and .fini are not .text, the toolchain does not seem to
align them for code by default. this yields random breakage depending
on the object sizes the linker is dealing with.
not heavily tested, but the basics are working. the basic concept is
that the dynamic linker entry point code invokes a pure-PIC (no global
accesses) C function in reloc.h to perform the early GOT relocations
needed to make the dynamic linker itself functional, then invokes
__dynlink like on other archs. since mips uses some ugly arch-specific
hacks to optimize relocating the GOT (rather than just using the
normal DT_REL[A] tables like on other archs), the dynamic linker has
been modified slightly to support calling arch-specific relocation
code in reloc.h.
most of the actual mips-specific behavior was developed by reading the
output of readelf on libc.so and simple executable files. i could not
find good reference information on which relocation types need to be
supported or their semantics, so it's possible that some legitimate
usage cases will not work yet.
this is mainly a development convenience but will also ensure users
building from latest git always get up-to-date arch-specific dynamic
linker code without having to "make clean".
changing the string printed for the dso name is not a regression; the
old code was simply using the wrong dso name (head rather than the dso
currently being relocated). this will be fixed in a later commit.
not heavily tested, but at least they don't seem to break anything on
soft float targets with or without coprocessors. they check the auxv
AT_HWCAP flags to determine which coprocessor, if any, is available.
since the correct declaration was not visible, and since the
representation of the types wchar_t and wint_t always match, a
compiler would have to go out of its way to make this bug manifest,
but better to fix it anyway.
it's expected that this will be needed/useful only in asm, so I've
given it its own symbol that can be addressed in pc-relative ways from
asm rather than adding a field in the __libc structure which would
require hard-coding the offset wherever it's used.
this seems counter-intuitive since sem_trywait is supposed to just try
once, not wait for the semaphore. however, the retry loop is not a
wait. instead, it's to handle the case where the value changes due to
a simultaneous post or wait from another thread while the semaphore
value remains positive. in such a case, it's absolutely wrong for
sem_trywait to fail with EAGAIN because the semaphore is not busy.
based on patches by orc and Isaac Dunham, with some fixes. sys/io.h
exists and contains prototypes for these functions regardless of
whether the target arch has them; this is a bit unorthodox but I don't
think it will break anything. the function definitions do not exist
unless the appropriate SYS_* syscall number macro is defined, which
should make sure configure scripts looking for these functions don't
find them on other systems.
presently, sys/io.h does not have the inb/outb/etc. port io
macros/functions. I'd be surprised if ioperm/iopl are useful without
them, so they probably need to be added at some point in appropriate
bits/io.h files...
also fix the alignment of jmp_buf to meet the abi. linux always
emulates fpu on mips if it's not present, so enabling this code
unconditionally is "safe" but may be slow. in the long term it may be
preferable to find a way to disable it on soft float builds.
the fields in the mcontext_t are long long (for no good reason) even
on 32-bit mips, so the offset of the instruction pointer (as a word)
varies depending on endianness.
the kernel wrongly expects the cmsg length field to be size_t instead
of socklen_t. in order to work around the issue, we have to impose a
length limit and copy to a local buffer. the length limit should be
more than sufficient for any real-world use; these headers are only
used for passing file descriptors and permissions between processes
over unix sockets.
these could have caused memory corruption due to invalid accesses to
the next field. all should be fixed now; I found the errors with fgrep
-r '__lock(&', which is bogus since the argument should be an array.
after the thread unmaps its own stack/thread structure, the kernel,
performing child tid clear and futex wake, could clobber a new mapping
made at the same location as the just-removed thread's tid field.
disable kernel clearing of child tid to prevent this.
the mips abi reserves stack space equal to the size of the in-register
args for the callee to save the args, if desired. this would cause the
beginning of the thread structure to be clobbered...
the old code worked in qemu app-level emulation, but not on real
kernels where the clone syscall does not copy the register values to
the new thread. save arguments on the new thread stack instead.
basically, this version of the code was obtained by starting with
rdp's work from his ellcc source tree, adapting it to musl's build
system and coding style, auditing the bits headers for discrepencies
with kernel definitions or glibc/LSB ABI or large file issues, fixing
up incompatibility with the old binutils from aboriginal linux, and
adding some new special cases to deal with the oddities of sigaction
and pipe syscall interfaces on mips.
at present, minimal test programs work, but some interfaces are broken
or missing. threaded programs probably will not link.
if libc.a is compiled PIC for use in static PIE code, this should not
cause the dynamic linker (which still does not support static-linked
main program) to be built into libc.a.
most importantly, the name for such libs was being set from an
uninitialized buffer. also, shortname always had an initial '/'
character, making it useless for looking up already-loaded libraries
by name, and thus causing repeated searches through the library path.
major changes now:
- shortname is the base name for library lookups with no explicit
pathname. it's initially clear for libraries loaded with an explicit
pathname (and for the main program), but will be set if the same
library (detected via inodes match) is later found by a search.
- exact name match is never used to identify libraries loaded with an
explicit pathname. in this case, there's no explicit search, so we
can just stat the file and check for inode match.
previously this was being handled the same as a library-specific,
dependency-order lookup on the next library in the global chain, which
is likely to be utterly meaningless. instead the lookup needs to be in
the global namespace, but omitting the initial portion of the global
library chain up through the calling library.
this option is expensive and only used on old gcc's that lack
-fexcess-precision=standed, but it's not needed on non-i386 archs
where floating point does not have excess precision anyway.
if musl ever supports m68k, i think it will need to be special-cased
too. i'm not aware of any other archs with excess precision.
on arm, the location of the saved-signal-mask flag and mask were off
by one between sigsetjmp and siglongjmp, causing incorrect behavior
restoring the signal mask. this is because the siglongjmp code assumed
an extra slot was in the non-sig jmp_buf for the flag, but arm did not
have this. now, the extra slot is removed for all archs since it was
useless.
also, arm eabi requires jmp_buf to have 8-byte alignment. we achieve
that using long long as the type rather than with non-portable gcc
attribute tags.
no idea why gcc refuses to compile the C code to use a tail call, but
it's best to use asm anyway so we don't have to rely on the quality of
the compiler's optimizations for correct code.
the new version is largely the work of Solar Designer, with minor
changes for integration with musl. compared to the old code, text size
is reduced by about 7k, stack space usage by about 70k, and
performance is greatly improved by avoiding expensive calculation of
constant tables on each run.
this version also adds support for extended des-based password hashes,
which allow for unlimited key (password) length and configurable
iteration counts.
i've also published the interface for crypt_r in a new crypt.h header.
especially since this is not a standard interface, i did not feel
compelled to match the glibc abi for the crypt_data structure. the
glibc structure is way too big to allocate on the stack; in fact it's
so big that the first usage may cause the main thread to exceed its
pre-committed stack size of 128k and thus could cause the program to
crash even on systems with overcommit disabled. the only legitimate
use of crypt_data for crypt_r is to store the hash string to return,
so i've reserved 256 bytes, which should be more than sufficient
(longest known password hashes are ~60 characters, and beyond that is
possibly even exceeding some implementations' passwd file field size
limit).
lr must be saved because init/fini-section code from the compiler
clobbers it. this was not a problem when i tested without gcc's
crtbegin/crtend files present, but with them, musl on arm fails to
work (infinite loop in _init).
on old kernels, there's no way to detect errors; we must assume
negative syscall return values are pgrp ids. but if the F_GETOWN_EX
fcntl works, we can get a reliable answer.
The long double adjustment was wrong:
The usual check is
mant_bits & 0x7ff == 0x400
before doing a mant_bits++ or mant_bits-- adjustment since
this is the only case when rounding an inexact ld80 into
double can go wrong. (only in nearest rounding mode)
After such a check the ++ and -- is ok (the mantissa will end
in 0x401 or 0x3ff).
fma is a bit different (we need to add 3 numbers with correct
rounding: hi_xy + lo_xy + z so we should survive two roundings
at different places without precision loss)
The adjustment in fma only checks for zero low bits
mant_bits & 0x3ff == 0
this way the adjusted value is correct when rounded to
double or *less* precision.
(this is an important piece in the fma puzzle)
Unfortunately in this case the -- is not a correct adjustment
because mant_bits might underflow so further checks are needed
and this was the source of the bug.
unicode char data has both "W" and "F" wide types and the old table
only included the "W" ones. this omitted U+3000 (ideographic space)
and all the wide-ascii, etc.
at the point pclose might receive and act on cancellation, it has
already invalidated the FILE passed to it. thus, per musl's QOI
guarantees about cancellation and resource allocation/deallocation,
it's not a candidate for cancellation.
if it were required to be a cancellation point by posix, we would have
to switch the order of deallocation, but somehow still close the pipe
in order to trigger the child process to exit. i looked into doing
this, but the logic gets ugly, and i'm not sure the semantics are
conformant, so i'd rather just leave it alone unless there's a need to
change it.
close was the only cancellation point called from popen, but it left
popen with major resource leaks if any call to close got cancelled.
the easiest, cheapest fix is just to use a non-cancellable close
function.
if the buffer is too short, at least return a partial string. this is
helpful if the caller is lazy and does not check for failure. care is
taken to avoid writing anything if the buffer length is zero, and to
always null-terminate when the buffer length is non-zero.
this one could never cause any problems unless the compiler/machine
goes to extra trouble to break oob pointer arithmetic, but it's best
to fix it anyway.
dynamic-allocation of the structure is not valid; it can crash an
application if malloc fails. since localeconv is not specified to have
failure conditions, the object needs to have static storage duration.
need to review whether all the values are right or not still..
if we eventually have build options, it might be nice to make an
option to dummy this out again, in case anybody needs a system-wide
disable for disk/ssd-thrashing, etc. that some daemons do when
logging...
large precision values could cause out-of-bounds pointer arithmetic in
computing the precision cutoff (used to avoid expensive long-precision
arithmetic when the result will be discarded). per the C standard,
this is undefined behavior. one would expect that it works anyway, and
in fact it did in most real-world cases, but it was randomly
(depending on aslr) crashing in i386 binaries running on x86_64
kernels. this is because linux puts the userspace stack near 4GB
(instead of near 3GB) when the kernel is 64-bit, leading to the
out-of-bounds pointer arithmetic overflowing past the end of address
space and giving a very low pointer value, which then compared lower
than a pointer it should have been higher than.
the new code rearranges the arithmetic so that no overflow can occur.
while this bug could crash printf with memory corruption, it's
unlikely to have security impact in real-world applications since the
ability to provide an extremely large field precision value under
attacker-control is required to trigger the bug.
for seekable files, posix imposed requirements on the offset of the
underlying open file description after a stream is closed. this was
correctly handled (as a side effect of the unconditional fflush call)
when streams were explicitly closed by fclose, but was not handled
correctly at program exit time, where fflush(0) was being used.
the weak symbol hackery is to pull in __stdio_exit if either of
__toread or __towrite is used, but avoid calling it twice so we don't
have to keep extra state. the new __stdio_exit is a streamlined fflush
variant that avoids performing any unnecessary operations and which
never unlocks the files or open file list, so we can be sure no other
threads write new data to a stream's buffer after it's already
flushed.
there is no need/use for a flush hook. the write function serves this
purpose already. i originally created the hook for implementing mem
streams based on a mistaken reading of posix, and later realized it
wasn't useful but never removed it until now.
the old behavior was to only consider a stream to be "reading" or
"writing" if it had buffered, unread/unwritten data. this reportedly
differs from the traditional behavior of these functions, which is
essentially to return true as much as possible without creating the
possibility that both __freading and __fwriting could return true.
gnulib expects __fwriting to return true as soon as a file is opened
write-only, and possibly expects other cases that depend on the
traditional behavior. and since these functions exist mostly for
gnulib (does anything else use them??), they should match the expected
behavior to avoid even more ugly hacks and workarounds...
this is required in case dtors use stdio.
also remove the old comments; one was cruft from when the code used to
be using function pointers and conditional calls, and has little
motivation now that we're using weak symbols. the other was just
complaining about having to support dtors even though the cost was
made essentially zero in the non-use case by the way it's done here.
these are not exposed publicly in any header, but the few programs
that use them (modutils/kmod, etc.) are declaring the functions
themselves rather than making the syscalls directly, and it doesn't
really hurt to have them (same as the capset junk).
based on patch by Emil Renner Berthing, with minor changes to dirent.h
for LFS64 and organization of declarations
this code should work unmodified once a real strverscmp is added, but
I've been hesitant to add it because the GNU strverscmp behavior is
harmful in a lot of cases (for instance if you have numeric filenames
in hex). at some point I plan on trying to design a variant of the
algorithm that behaves better on a mix of filename styles.
these were left in glibc for binary compatibility after the public
part of the interface was removed, and libcap kept using them (with
its own copy of the header files) rather than just making the syscalls
directly. might as well add them since they're so small...
i originally omitted these (optional, per POSIX) interfaces because i
considered them backwards implementation details. however, someone
later brought to my attention a fairly legitimate use case: allocating
thread stacks in memory that's setup for sharing and/or fast transfer
between CPU and GPU so that the thread can move data to a GPU directly
from automatic-storage buffers without having to go through additional
buffer copies.
perhaps there are other situations in which these interfaces are
useful too.
printf was not printing too many characters, but it was reading one
too many wchar_t elements from the input. this could lead to crashes
if running off the page, or spurious failure if the conversion of the
extra wchar_t resulted in EILSEQ.
this issue affects the last gpl2 version of binutils, which some
people are still using out of aversion to gpl3. musl requires
-Bsymbolic-functions because it's the only way to make a libc.so
that's able to operate prior to dynamic linking but that still behaves
correctly with respect to global vars that may be moved to the main
program via copy relocations.
it's possible that the user has provided a compiler that does not have
any libc to link to, so linking a main program is a bad idea. instead,
generate an empty shared library with no dependencies.
in theory we could support stack protector in the libc itself, and
users wanting to experiment with such usage could add
-fstack-protector to CFLAGS intentionally. but to avoid breakage in
the default case, override broken distro-patched gcc that forces stack
protector on.
some broken distro-provided toolchains have modified gcc to produce
only "gnu hash" dynamic hash table by default. as this is unsupported
by musl, that results in a non-working libc.so. we detect and switch
this on in configure rather than hard-coding it in the Makefile
because it's not supported by old binutils versions, but that might
not even be relevant since old binutils versions already fail from
-Bsymbolic-functions being missing. at some point I may review whether
this should just go in the Makefile...
the error will propagate up and be printed to the user at program
start time; at runtime, dlopen will just fail and leave a message for
dlerror.
previously, if mprotect failed, subsequent attempts to perform
relocations would crash the program. this was resulting in an
increasing number of false bug reports on grsec systems where rwx
permission is not possible in cases where users were wrongly
attempting to use non-PIC code in shared libraries. supporting that
usage is in theory possible, but the x86_64 toolchain does not even
support textrels, and the cost of keeping around the necessary
information to handle textrels without rwx permissions is
disproportionate to the benefit (which is essentially just supporting
broken library setups on grsec machines).
also, i unified the error-out code in map_library now that there are 3
places from which munmap might have to be called.
this is ugly and stupid, but now that the *64 symbol names exist, a
lot of broken GNU software detects them in configure, then either
breaks during build due to missing off64_t definition, or attempts to
compile without function declarations/prototypes. "fixing" it here is
easier than telling everyone to add yet another feature test macro to
their builds.
Per POSIX, "The abort() function shall cause abnormal process
termination to occur, unless the signal SIGABRT is being caught and
the signal handler does not return."
If SIGABRT is blocked or if a signal handler is installed and does
return, abort is still required to cause abnormal program termination.
We cannot use a_crash() to do this, since a SIGILL handler could also
be installed (and might even longjmp out of the abort, not expecting
to be invoked from within abort), nor can we rely on resetting the
signal handler and re-raising the signal (this has race conditions in
multi-threaded programs). On the other hand, SIGKILL is a perfectly
safe, unblockable way to obtain abnormal program termination, and it
requires no ugly loop-and-retry logic.
for some nonsensical reason, glibc's headers use inline functions that
redirect some of the standard functions to ugly nonstandard names (and
likewise for some of their nonstandard functions).
I've been looking for data that would suggest a good default, and
since little has shown up, i'm doing this based on the limited data I
have. the value 80k is chosen to accommodate 64k of application data
(which happens to be the size of the buffer in git that made it crash
without a patch to call pthread_attr_setstacksize) plus the max stack
usage of most libc functions (with a few exceptions like crypt, which
will be fixed soon to avoid excessive stack usage, and [n]ftw, which
inherently uses a fair bit in recursive directory searching).
if further evidence emerges suggesting that the default should be
larger, I'll consider changing it again, but I'd like to avoid it
getting too large to avoid the issues of large commit charge and rapid
address space exhaustion on 32-bit machines.
this fix is necessary because a program could be started with some of
the implementation-reserved signals masked (e.g. due to exec having
been called from a signal handler, or from a non-musl program) and
then could obtain an invalid-to-use-later sigset_t as the old/saved
signal mask.
this action is now performed in pthread_self initialization; it must
be performed there in case the first call to pthread_create is from a
signal handler, in which case the old signal mask could be restored on
return from the signal.
this should be the last major fix needed to support running
glibc-linked conforming POSIX programs with musl in place of glibc, as
long as musl provides the features they need and they don't use
pthread cancellation (which is implemented as c++ exceptions in glibc,
and fundamentally incompatible with musl).
these will NOT be used when compiling with -D_LARGEFILE64_SOURCE on
musl; instead, they exist in the hopes of eventually being able to run
some glibc-linked apps with musl sitting in place of glibc.
also remove the (apparently incorrect) fcntl alias.
two actual issues: one is that __dynlink no longer wants/needs a GOT
pointer argument, so the code to generate that argument can be
removed. the other issue was that in the i386 code, argc/argv were
being loaded into registers that would be call-clobbered, then copied
to preserved registers, rather than just being loaded into the proper
call-preserved registers to begin with.
this cleanup is in preparation for adding new dynamic linker
functionality (ability to explicitly invoke the dynamic linker to run
a program).
unfortunately in dynamic-linked programs, these macros cause
pthread_self to be initialized, which costs a couple syscalls, and
(much worse) would necessarily fail, crash, and burn on ancient (2.4
and earlier) kernels where setting up a thread pointer does not work.
i'd like to do this in a more generic way that avoids all use of
cleanup push/pop before pthread_self has been successfully called and
avoids ugly if/else constructs like the one in this commit, but for
now, this will suffice.
if the process started with these signals blocked, cancellation could
fail or setxid could deadlock. there is no way to globally unblock
them after threads have been created. by unblocking them in the
pthread_self initialization for the main thread, we ensure that
they're unblocked before any other threads are created and also
outside of any signal handler context (sigaction initialized
pthread_self), which is important so that return from a signal handler
won't re-block them.
TRE has a broken assumption that wchar_t is signed, which is a sane
expectation, but not required by the standard, and false on ARM's ABI.
i leave tre_char_t as wchar_t for now, since a pointer to it is
directly passed to functions that need pointer to wchar_t. it does not
seem to break anything. and since the maximum unicode scalar value is
0x10ffff, just use that explicitly rather than using the max value of
any particular C type.
the bug was that cancellation requests which arrived while a
cancellation point was interrupted by a signal handler would not be
acted upon when the signal handler returns. this was because cp_sp was
never set; it's no longer needed or used.
instead, just always re-raise the signal when cancellation was not
acted upon. this wastes a tiny amount of time in the rare case where
it even matters, but it ensures correctness and simplifies the code.
the old code could be kept for cases where SYS_utime is available, but
it's not really worth the ifdef ugliness. and better to avoid
deprecated stuff just in case the kernel devs ever get crazy enough to
start removing it from archs where it was part of the ABI and breaking
static bins...
stale state information indicating that a thread was possibly blocked
at a cancellation point could get left behind if longjmp was used to
exit a signal handler that interrupted a cancellation point.
to fix the issue, we throw away the state information entirely and
simply compare the saved instruction pointer to a range of code
addresses in the __syscall_cp_asm function. all the ugly PIC work
(which becomes minimal anyway with this approach) is defered to
cancellation time instead of happening at every syscall, which should
improve performance too.
this commit also fixes cancellation on arm, which was mildly broken
(race condition, not checking cancellation flag once inside the
cancellation point zone). apparently i forgot to implement that. the
new arm code is untested, but appears correct; i'll test and fix it
later if there are problems.
i originally made it the same size as the bloated GNU version, which
contains space for saved signal mask, but this makes some structures
containing jmp_buf become much larger for no benefit. we will never
use the signal mask field with plain setjmp; sigsetjmp serves that
purpose.
i made a best attempt, but the intended semantics of this function are
fundamentally contradictory. there is no consistent way to handle
ownership of locks when forking a multi-threaded process. the code
could have worked by accident for programs that only used normal
mutexes and nothing else (since they don't actually store or care
about their owner), but that's about it. broken-by-design interfaces
that aren't even in glibc (only solaris) don't belong in musl.
this is actually rather ugly, and would get even uglier if we ever
want to support further feature test macros. at some point i may
factor the bits headers into separate files for C base, POSIX base,
and nonstandard extensions (the only distinctions that seem to matter
now) and then the logic for which to include can go in the main header
rather than being duplicated for each arch. the downside of this is
that it would result in more files having to be opened during
compilation, so as long as the ugliness does not grow, i'm inclined to
leave it alone for now.
there is no reason to avoid multiple identical macro definitions; this
is perfectly legal C, and even with the maximal warning options
enabled, gcc does not issue any warning for it.
these are cruft from the original code which used an explicit string
length rather than null termination. i blindly converted all the
checks to null terminator checks, without noticing that in several
cases, the subsequent switch statement would automatically handle the
null byte correctly.
we do not bother making h_errno thread-local since the only interfaces
that use it are inherently non-thread-safe. but still use the
potentially-thread-local ABI to access it just to avoid lock-in.
this one is for program(s|ers) who haven't heard of uint16_t and
uint32_t (which are obviously the correct types for use in such
situations, as they're the argument/return types for ntohs/htons and
ntohl/htonl).
there's no sense in using a powerful lock in exit, because it will
never be unlocked. a thread that arrives at exit while exit is already
in progress just needs to hang forever. use the pause syscall for this
because it's cheap and easy and universally available.
the non-prototype declaration of basename in string.h is an ugly
compromise to avoid breaking 2 types of broken software:
1. programs which assume basename is declared in string.h and thus
would suffer from dangerous pointer-truncation if an implicit
declaration were used.
2. programs which include string.h with _GNU_SOURCE defined but then
declare their own prototype for basename using the incorrect GNU
signature for the function (which would clash with a correct
prototype).
however, since C++ does not have non-prototype declarations and
interprets them as prototypes for a function with no arguments, we
must omit it when compiling C++ code. thankfully, all known broken
apps that suffer from the above issues are written in C, not C++.
1. * in BRE is not special at the beginning of the regex or a
subexpression. this broke ncurses' build scripts.
2. \\( in BRE is a literal \ followed by a literal (, not a literal \
followed by a subexpression opener.
3. the ^ in \\(^ in BRE is a literal ^ only at the beginning of the
entire BRE. POSIX allows treating it as an anchor at the beginning of
a subexpression, but TRE's code for checking if it was at the
beginning of a subexpression was wrong, and fixing it for the sake of
supporting a non-portable usage was too much trouble when just
removing this non-portable behavior was much easier.
this patch also moved lots of the ugly logic for empty atom checking
out of the default/literal case and into new cases for the relevant
characters. this should make parsing faster and make the code smaller.
if nothing else it's a lot more readable/logical.
at some point i'd like to revisit and overhaul lots of this code...
apparently initializing a variable is not "using" it but assigning to
it is "using" it. i don't really like this fix, but it's better than
trying to make a bigger cleanup just before a release, and it should
work fine (tested against nsz's math tests).
this only works with gcc 4.6 and later, but it allows us to support
non-default endianness on archs like arm, mips, ppc, etc. that can do
both without having separate header sets for both variants, and it
saves one #include even on fixed-endianness archs like x86.
apparently some packages see stropts.h and want to be able to use
this. the implementation checks that the file descriptor is valid by
using fcntl/F_GETFD so it can report an error if not (as specified).
two issues: (1) the type was wrong (unsigned instead of signed int),
and (2) the value of FP_ILOGBNAN should be INT_MIN rather than INT_MAX
to match the ABI. this is also much more useful since INT_MAX
corresponds to a valid input (infinity). the standard would allow us
to set FP_ILOGB0 to -INT_MAX instead of INT_MIN, which would give us
distinct values for ilogb(0) and ilogb(NAN), but the benefit seems way
too small to justify ignoring the ABI.
note that the macro is just a "portable" (to any twos complement
system where signed and unsigned int have the same width) way to write
INT_MIN without needing limits.h. it's valid to use this method since
these macros are not required to work in #if directives.
these changes are based on the following communication via email:
"I hereby grant that all of the code I have contributed to musl on or
before April 23, 2012 may be licensed under the terms of the following
MIT license:
Copyright (c) 2011-2012 Nicholas J. Kain
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE."
this script is not based on autoconf; however it attempts to follow
the same interface contracts for ease of integration with build
systems. it is also not necessary to use musl. manually written
config.mak files are still supported, as is building without any
config.mak at all as long as you are happy with the default options
and you supply at least ARCH on the command line to make.
this change is necessary or pthread_create will always fail on
security-hardened kernels. i considered first trying to make the stack
executable and simply retrying without execute permissions when the
first try fails, but (1) this would incur a serious performance
penalty on hardened systems, and (2) having the stack be executable is
just a bad idea from a security standpoint.
if there is real-world "GNU C" code that uses nested functions with
threads, and it can't be fixed, we'll have to consider other ways of
solving the problem, but for now this seems like the best fix.
these new rules should avoid spurious error messages when the
directory (usually /lib) and the dynamic linker symlink already exist,
and minimize the spam when they can't be created.
old: 2*atan2(sqrt(1-x),sqrt(1+x))
new: atan2(fabs(sqrt((1-x)*(1+x))),x)
improvements:
* all edge cases are fixed (sign of zero in downward rounding)
* a bit faster (here a single call is about 131ns vs 162ns)
* a bit more precise (at most 1ulp error on 1M uniform random
samples in [0,1), the old formula gave some 2ulp errors as well)
musl does not support legacy 32-bit-off_t whatsoever. off_t is always
64 bit, and correct programs that use off_t and the standard functions
will just work out of the box. (on glibc, they would require
-D_FILE_OFFSET_BITS=64 to work.) however, some programs instead define
_LARGEFILE64_SOURCE and use alternate versions of all the standard
types and functions with "64" appended to their names.
we do not want code to actually get linked against these functions
(it's ugly and inconsistent), so macros are used instead of prototypes
with weak aliases in the library itself. eventually the weak aliases
may be added at the library level for the sake of using code that was
originally built against glibc, but the macros will still be the
desired solution in the headers.
pthread structure has been adjusted to match the glibc/GCC abi for
where the canary is stored on i386 and x86_64. it will need variants
for other archs to provide the added security of the canary's entropy,
but even without that it still works as well as the old "minimal" ssp
support. eventually such changes will be made anyway, since they are
also needed for GCC/C11 thread-local storage support (not yet
implemented).
care is taken not to attempt initializing the thread pointer unless
the program actually uses SSP (by reference to __stack_chk_fail).
hopefully the annoyance of this will be minimal. these files all
define internal interfaces which can change at any time; if different
modules are using different versions of the interfaces, the library
will badly break. ideally we would scan and add the dependency only
for C files that actually reference the affected interfaces, but for
now, err on the side of caution and force a rebuild of everything if
any of them have changed.
this commit is in preparation for the upcoming ssp overhaul commit,
which will change internals of the pthread struct.
looks like nik copied these "extra arguments" from the i386 code.
they're not actually arguments there, just 1-byte instructions to
make sure the stack is aligned to 16 bytes after all the other
arguments are pushed. since each push is 8 bytes on x86_64, they
happened to have no effect here, but their presence is confusing and a
minor waste of space.
it does not work; after further consideration, a separate Scrt1.s for
pie really is essential. it would be nice if the unified approach
worked, but the linker fails to generate the correct PLT entries and
instead puts textrels in the main program, which don't work because
the kernel maps the text read-only.
new Scrt1.s will be committed soon in place of this.
these are POSIX 2008 (previously GNU extension) functions that are
rarely used. apparently they had never been tested before, since the
end-of-string logic was completely missing. mbsnrtowcs is used by
modern versions of bash for its glob implementation, and and this bug
was causing tab completion to hang in an infinite loop.
these were at best of limited usefulness (for bootstrapping new
systems, mainly) and at worst caused real kernel headers to get
overwritten when upgrading libc.
in case they're needed by anyone, the exact same files are now
available in a new git repository:
git://git.etalabs.net/mini-lkh
the major change here is that CFLAGS is now a variable that can be
changed entirely under user control, without causing essential flags
to be lost. previously, "CFLAGS += ..." was valid in config.mak, but
using "CFLAGS = ..." in config.mak would have badly broken the build
process unless the user took care to copy the necessary flags out of
the main Makefile.
I have also added a distclean target that removes config.mak.
as far as I can tell, it's not useful and never way. I wrote it way
back under the assumption that non-weak symbols in the POSIX or
extension namespace could conflict with legitimate uses of the same
symbol name in the main program or other libraries, but that does not
seem to be the case.
this is a nonstandard function so it's not clear what conditions it
should satisfy. my intent is that it be fast and exact for positive
integral exponents when the result fits in the destination type, and
fast and correctly rounded for small negative integral exponents.
otherwise we aim for at most 1ulp error; it seems to differ from pow
by at most 1ulp and it's often 2-5 times faster than pow.
this caused misreading of certain floating point values that are exact
multiples of large powers of ten, unpredictable depending on prior
stack contents.
unlike the old one, this one's algorithm does not suffer from
potential stack overflow issues or pathologically bad performance on
certain patterns. instead of backtracking, it uses a matching
algorithm which I have not seen before (unsure whether I invented or
re-invented it) that runs in O(1) space and O(nm) time. it may be
possible to improve the time to O(n), but not without significantly
greater complexity.
an invalid bracket expression must be treated as if the opening
bracket were just a literal character. this is to fix a bug whereby
POSIX left the behavior of the "[" shell command undefined due to it
being an invalid bracket expression.
the code is written to pre-init the thread pointer in static linked
programs that pull in __stack_chk_fail or dynamic-linked programs that
lookup the symbol. no explicit canary is set; the canary will be
whatever happens to be in the thread structure at the offset gcc
hard-coded. this can be improved later.
i did some testing trying to switch malloc to use the new internal
lock with priority inheritance, and my malloc contention test got
20-100 times slower. if priority inheritance futexes are this slow,
it's simply too high a price to pay for avoiding priority inversion.
maybe we can consider them somewhere down the road once the kernel
folks get their act together on this (and perferably don't link it to
glibc's inefficient lock API)...
as such, i've switch __lock to use malloc's implementation of
lightweight locks, and updated all the users of the code to use an
array with a waiter count for their locks. this should give optimal
performance in the vast majority of cases, and it's simple.
malloc is still using its own internal copy of the lock code because
it seems to yield measurably better performance with -O3 when it's
inlined (20% or more difference in the contention stress test).
this bug probably would have gone unnoticed since it's only used in
the fallback code for systems where priority-inheritance locking
fails. unfortunately this approach results in one spurious wake
syscall on the final unlock, when there are no waiters remaining. the
alternative (possibly better) would be to use broadcast wakes instead
of reflagging the waiter unconditionally, and let each waiter reflag
itself; this saves one syscall at the expense of invoking the
"thundering herd" effect (worse performance degredation) when there
are many waiters.
ideally we would be able to update all of our locks to use an array of
two ints rather than a single int, and use a separate counter system
like proper mutexes use; then we could avoid all spurious wake calls
without resorting to broadcasts. however, it's not clear to me that
priority inheritance futexes support this usage. the kernel sets the
waiters flag for them (just like we're doing now) and i can't tell if
it's safe to bypass the kernel when unlocking just because we know
(from private data, the waiter count) that there are no waiters. this
is something that could be explored in the future.
we use priority inheritance futexes if possible so that the library
cannot hit internal priority inversion deadlocks in the presence of
realtime priority scheduling (full support to be added later).
i tried to go with improving the old binary-search-based algorithm,
but between growth in the number of ranges, bad performance, and lack
of confidence in the binary search code's stability under changes in
the table, i decided it was worth the extra 1.8k to have something
clean and maintainable.
also note that, like the alpha and punct tables, there's definitely
room to optimize the nonspacing/wide tables by overlapping subtables.
this is not a high priority, but i've begun looking into how to do it,
and i suspect the table sizes can be roughly halved. if that turns out
to be true, the new, fast, table-based implementation will be roughly
the same size as if i had just extended the old binary search one.
also special-case ß (U+00DF) as lowercase even though it does not have
a mapping to uppercase. unicode added an uppercase version of this
character but does not map it, presumably because the uppercase
version is not actually used except for some obscure purpose...
alpha is defined as unicode property "Alphabetic" plus category Nd
minus ASCII digits minus 2 special-cased Thai punctuation marks
supposedly misclassified by Unicode as letters.
punct is defined as all of unicode except control, alphanumeric, and
space characters.
the tables were generated by a simple tool based on the code posted
previously to the mailing list. in the future, this and other code
used for maintaining locale/iconv/i18n data will be published either
in the main source repository or in a separate locale data generation
repository.
note that dlerror is specified to be non-thread-safe, so no locking is
performed on the error flag or message aside from the rwlock already
held by dlopen or dlsym. if 2 invocations of dlsym are generating
errors at the same time, they could clobber each other's results, but
the resulting string, albeit corrupt, will still be null-terminated.
any use of dlerror in such a situation could not be expected to give
meaningful results anyway.
the _concept_ of this wrapper has been tested extensively, but the
integration with the build/install system, and using a persistent
specfile rather than one generated at build-time, have not been
heavily tested and may need minor tweaks.
this approach should be a lot more robust (and easier to improve) than
writing a shell script that's responsible for trying to mimic gcc's
logic about whether it's compiling or linking, building shared libs or
executable files, etc. it's also lighter weight and should result in
mildly faster builds when using the wrapper.
care is taken that the setting of errno correctly reflects underflow
condition. scanning exact denormal values does not result in ERANGE,
nor does scanning values (such as the usual string definition of
FLT_MIN) which are actually less than the smallest normal number but
which round to a normal result.
only the decimal case is handled so far; hex float require a separate
fix to come later.
in principle this should just be an optimization, but it happens to
also fix a nasty bug where values like 0.00000000001 were getting
caught by the early zero detection path and wrongly scanned as zero.
- add the rest of the junk traditionally in sys/param.h
- add prototypes for some nonstandard functions
- add _GNU_SOURCE to their source files so the compiler can check proto
this code worked in strtod, but not in scanf. more evidence that i
should design a better interface for discarding multiple tail
characters than just calling unget repeatedly...
at this point, strto* and all scanf family functions are using the new
unified integer and floating point parser/converter code.
the wide scanf is largely a wrapper for ordinary byte-based scanf;
since numbers can only contain ascii characters, only strings need to
be handled specially.
vfprintf temporarily swaps in a local buffer (for the duration of the
operation) when the target stream is unbuffered; this both simplifies
the implementation of functions like dprintf (they don't need their
own buffers) and eliminates the pathologically bad performance of
writing the formatted output with one or more write syscalls per
formatting field.
in cases like dprintf where we are dealing with a virgin FILE
structure, everything worked correctly. however for long-lived files
(like stderr), it's possible that the buffer bounds were already set
for the internal zero-size buffer. on the next write, __stdio_write
would pick up and use the new buffer provided by vfprintf, but the
bound (wend) field was still pointing at the internal zero-size
buffer's end. this in turn allowed unbounded writes to the temporary
buffer.
the l prefix is redundant/no-op with printf, since default promotions
always promote floats to double; however, it is valid, and printf was
wrongly rejecting it.
shunget cannot unget eof status, causing wcstol to leave endptr
pointing to the wrong place when scanning, for example, L"0x". cheap
fix is to make the read function provide an infinite stream of bogus
characters rather than eof. really this is something of a design flaw
in how the shgetc system is used for strto* and wcsto*; in the long
term, I believe multi-character unget should be scrapped and replaced
with a function that can subtract from the f->shcnt counter.
advantages over the old code:
- correct results for floating point (old code was bogus)
- wide/regular scanf separated so scanf does not pull in wide code
- well-defined behavior on integers that overflow dest type
- support for %[a-b] ranges with %[ (impl-defined by widely used)
- no intermediate conversion of fmt string to wide string
- cleaner, easier to share code with strto* functions
- better standards conformance for corner cases
the old code remains in the source tree, as the wide versions of the
scanf-family functions are still using it. it will be removed when no
longer needed.
I'm not sure if it's legal for wordexp to modify this field, but this
is the only easy/straightforward fix, and applications should not
care. if it's an issue, i can work out a different (but more complex)
solution later.
this off-by-one error was causing values with just one digit past the
decimal point to be treated by the integer case. in many cases it
would yield the correct result, but if expressions are evaluated in
excess precision, double rounding may occur.
fcntl values 1024 and up are universal, arch-independent. later I'll
add some of the other linux-specific ones for notify, leases, pipe
size, etc. here too.
the "< 0" test was always false due to use of an unsigned type. this
resulted in infinite loops on 32-bit machines (adding -1U to a pointer
is the same as adding -1) and crashes on 64-bit machines (offsetting
the string pointer by 4gb-1b when an illegal sequence was hit).
this is legal since sa_* is in the reserved namespace for signal.h,
per posix. note that the sa_restorer field is not used anywhere, so
programs that are trying to use it may still break, but at least
they'll compile. if it turns out such programs actually need to be
able to set their own sa_restorer to function properly, i'll add the
necessary code to sigaction.c later.
TRE wants to treat + and ? after a +, ?, or * as special; ? means
ungreedy and + is reserved for future use. however, this is
non-conformant. although redundant, these redundant characters have
well-defined (no-op) meaning for POSIX ERE, and are actually _literal_
characters (which TRE is wrongly ignoring) in POSIX BRE mode.
the simplest fix is to simply remove the unneeded nonstandard
functionality. as a plus, this shaves off a small amount of bloat.
at -Os optimization level, gcc refuses to inline these functions even
though the inlined code would roughly the same size as the function
call, and much faster. the easy solution is to make them into macros.
whenever the base was small enough that more than one digit could
still fit after UINTMAX_MAX/36-1 was reached, only the first would be
allowed; subsequent digits would trigger spurious overflow, making it
impossible to read the largest values in low bases.
when upscaling, even the very last digit is needed in cases where the
input is exact; no digits can be discarded. but when downscaling, any
digits less significant than the mantissa bits are destined for the
great bitbucket; the only influence they can have is their presence
(being nonzero). thus, we simply throw them away early. the result is
nearly a 4x performance improvement for processing huge values.
the particular threshold LD_B1B_DIG+3 is not chosen sharply; it's
simply a "safe" distance past the significant bits. it would be nice
to replace it with a sharp bound, but i suspect performance will be
comparable (within a few percent) anyway.
now that this is the first operation, it can rely on the circular
buffer contents not being wrapped when it begins. we limit the number
of digits read slightly in the initial parsing loops too so that this
code does not have to consider the case where it might cause the
circular buffer to wrap; this is perfectly fine because KMAX is chosen
as a power of two for circular-buffer purposes and is much larger than
it otherwise needs to be, anyway.
these changes should not affect performance at all.
upscaling by even one step too much creates 3-29 extra iterations for
the next loop. this is still suboptimal since it always goes by 2^29
rather than using a smaller upscale factor when nearing the target,
but performance on common, small-magnitude, few-digit values has
already more than doubled with this change.
more optimizations on the way...
for example, "1000000000" was being read as "1" due to this loop
exiting early. it's necessary to actually update z and zero the
entries so that the subsequent rounding code does not get confused;
before i did that, spurious inexact exceptions were being raised.
note that there's no need for a precise cutoff, because exponents this
large will always result in overflow or underflow (it's impossible to
read enough digits to compensate for the exponent magnitude; even at a
few nanoseconds per digit it would take hundreds of years).
the immediate benefit is a significant debloating of the float parsing
code by moving the responsibility for keeping track of the number of
characters read to a different module.
by linking shgetc with the stdio buffer logic, counting logic is
defered to buffer refill time, keeping the calls to shgetc fast and
light.
in the future, shgetc will also be useful for integrating the new
float code with scanf, which needs to not only count the characters
consumed, but also limit the number of characters read based on field
width specifiers.
shgetc may also become a useful tool for simplifying the integer
parsing code.
this version is intended to be fully conformant to the ISO C, POSIX,
and IEEE standards for conversion of decimal/hex floating point
strings to float, double, and long double (ld64 or ld80 only at
present) values. in particular, all results are intended to be rounded
correctly according to the current rounding mode. further, this
implementation aims to set the floating point underflow, overflow, and
inexact flags to reflect the conversion performed.
a moderate amount of testing has been performed (by nsz and myself)
prior to integration of the code in musl, but it still may have bugs.
so far, only strto(d|ld|f) use the new code. scanf integration will be
done as a separate commit, and i will add implementations of the wide
character functions later.
gcc makes this mapping by default anyway, but it will be disabled by
-fno-builtin (and presumably by -std=c99 or similar). for the main
program the error will be reported by the linker, and the issue can
easily be fixed, but for dynamic-loaded so files, the error cannot be
detected until dlopen time, at which point it has become very obscure.
when the "r" (register) constraint is used to let gcc choose a
register, gcc will sometimes assign the same register that was used
for one of the other fixed-register operands, if it knows the values
are the same. one common case is multiple zero arguments to a syscall.
this horribly breaks the intended usage, which is swapping the GOT
pointer from ebx into the temp register and back to perform the
syscall.
presumably there is a way to fix this with advanced usage of register
constaints on the inline asm, but having bad memories about hellish
compatibility issues with different gcc versions, for the time being
i'm just going to hard-code specific registers to be used. this may
hurt the compiler's ability to optimize, but it will fix serious
miscompilation issues.
so far the only function i know what compiled incorrectly is
getrlimit.c, and naturally the bug only applies to shared (PIC)
builds, but it may be more extensive and may have gone undetected..
the buffer in getaddrinfo really only matters when /etc/hosts is huge,
but in that case, the huge number of syscalls resulting from a tiny
buffer would seriously impact the performance of every name lookup.
the buffer in __dns.c has also been enlarged a bit so that typical
resolv.conf files will fit fully in the buffer. there's no need to
make it so large as to dominate the syscall overhead for large files,
because resolv.conf should never be large.
special care is made to avoid any inexact computations when either arg
is zero (in which case the exact absolute value of the other arg
should be returned) and to support the special condition that
hypot(±inf,nan) yields inf.
hypotl is not yet implemented since avoiding overflow is nontrivial.
the error status is required to be sticky after failure of dlopen or
dlsym until cleared by dlerror. applications and especially libraries
should never rely on this since it is not thread-safe and subject to
race conditions, but glib does anyway.
the old formula atan2(1,sqrt((1+x)/(1-x))) was faster but
could give nan result at x=1 when the rounding mode is
FE_DOWNWARD (so 1-1 == -0 and 2/-0 == -inf), the new formula
gives -0 at x=+-1 with downward rounding.
DECIMAL_DIG is not the same as LDBL_DIG
type_DIG is the maximimum number of decimal digits that can survive a
round trip from decimal to type and back to decimal.
DECIMAL_DIG is the minimum number of decimal digits required in order
for any floating point type to survive the round trip to decimal and
back, and it is generally larger than LDBL_DIG. since the exact
formula is non-trivial, and defining it larger than necessary may be
legal but wasteful, just define the right value in bits/float.h.
this has not been tested heavily, but it's known to at least assemble
and run in basic usage cases. it's nearly identical to the
corresponding i386 code, and thus expected to be just as correct or
just as incorrect.
the main practical results of this change are
1. the regex code is no longer subject to LGPL; it's now 2-clause BSD
2. most (all?) popular nonstandard regex extensions are supported
I hesitate to call this a "sync" since both the old and new code are
heavily modified. in one sense, the old code was "more severely"
modified, in that it was actively hostile to non-strictly-conforming
expressions. on the other hand, the new code has eliminated the
useless translation of the entire regex string to wchar_t prior to
compiling, and now only converts multibyte character literals as
needed.
in the future i may use this modified TRE as a basis for writing the
long-planned new regex engine that will avoid multibyte-to-wide
character conversion entirely by compiling multibyte bracket
expressions specific to UTF-8.
old code saved/restored the fenv (the new code is only as slow
as that when inexact is not set before the call, but some other
flag is set and the rounding is inexact, which is rare)
before:
bench_nearbyint_exact 5000000 N 261 ns/op
bench_nearbyint_inexact_set 5000000 N 262 ns/op
bench_nearbyint_inexact_unset 5000000 N 261 ns/op
after:
bench_nearbyint_exact 10000000 N 94.99 ns/op
bench_nearbyint_inexact_set 25000000 N 65.81 ns/op
bench_nearbyint_inexact_unset 10000000 N 94.97 ns/op
the fscale instruction is slow everywhere, probably because it
involves a costly and unnecessary integer truncation operation that
ends up being a no-op in common usages. instead, construct a floating
point scale value with integer arithmetic and simply multiply by it,
when possible.
for float and double, this is always possible by going to the
next-larger type. we use some cheap but effective saturating
arithmetic tricks to make sure even very large-magnitude exponents
fit. for long double, if the scaling exponent is too large to fit in
the exponent of a long double value, we simply fallback to the
expensive fscale method.
on atom cpu, these changes speed up scalbn by over 30%. (min rdtsc
timing dropped from 110 cycles to 70 cycles.)
exponents (base 2) near 16383 were broken due to (1) wrong cutoff, and
(2) inability to fit the necessary range of scalings into a long
double value.
as a solution, we fall back to using frndint/fscale for insanely large
exponents, and also have to special-case infinities here to avoid
inf-inf generating nan.
thankfully the costly code never runs in normal usage cases.
zero, one, two, half are replaced by const literals
The policy was to use the f suffix for float consts (1.0f),
but don't use suffix for long double consts (these consts
can be exactly represented as double).
Underflow exception is only raised when the result is
invalid, but fmod is always exact. x87 has a denormalization
exception, but that's nonstandard. And the superflous *1.0
will be optimized away by any compiler that does not honor
signaling nans.
Some code assumed ldexp(x, 1) is faster than 2.0*x,
but ldexp is a wrapper around scalbn which uses
multiplications inside, so this optimization is
wrong.
This commit also fixes fmal which accidentally
used ldexp instead of ldexpl loosing precision.
There are various additional changes from the
work-in-progress const cleanups.
Some long double consts were stored in two doubles as a workaround
for x86_64 and i386 with the following comment:
/* Long double constants are slow on these arches, and broken on i386. */
This is most likely old gcc bug related to the default x87 fpu
precision setting (it's double instead of double extended on BSD).
up to 30% faster exp2 by avoiding slow frndint and fscale functions.
expm1 also takes a much more direct path for small arguments (the
expected usage case).
unlike some implementations, these functions perform the equivalent of
gcc's -ffloat-store on the result before returning. this is necessary
to raise underflow/overflow/inexact exceptions, perform the correct
rounding with denormals, etc.
unlike trig functions, these are easy to do in asm because they do not
involve (arbitrary-precision) argument reduction. fpatan automatically
takes care of domain issues, and in asin and acos, fsqrt takes care of
them for us.
infinities were getting converted into nans. the new code simply tests
for infinity and replaces it with a large magnitude value of the same
sign.
also, the fcomi instruction is apparently not part of the i387
instruction set, so avoid using it.
these are functions that have direct fpu approaches to implementation
without problematic exception or rounding issues. x86_64 lacks
float/double versions because i'm unfamiliar with the necessary sse
code for performing these operations.
A faster workaround for spurious inexact exceptions
when the result cannot be represented. The old code
actually could be wrong, because gcc reordered the
integer conversion and the exception check.
Note that the new fesetround has slightly different semantics:
Storing the floating-point environment with fnstenv makes the
next fldenv (or fldcw) "non-signaling", so unmasked and pending
exceptions does not invoke the exception handler.
(These are rare since exceptions are handled immediately and by
default all exceptions are masked anyway. But if one manually
unmasks an exception in the control word then either sets the
corresponding exception flag in the status word or the execution
of an exception raising floating-point operation gets interrupted
then it may happen).
So the old implementation did not trap in some rare cases
where the new implementation traps.
However POSIX does not specify anything like the x87 exception
handling traps and the fnstenv/fldenv pair is significantly slower
than the fnstcw/fldcw pair (new code is about 5x faster here and
it's dominated by the function call overhead).
this is necessary to support archs where fenv is incomplete or
unavailable (presently arm). fma, fmal, and the lrint family should
work perfectly fine with this change; fmaf is slightly broken with
respect to rounding as it depends on non-default rounding modes to do
its work.
a double precision nan, when converted to extended (80-bit) precision,
will never end in 0x400, since the corresponding bits do not exist in
the original double precision value. thus there's no need to waste
time and code size on this check.
the fsqrt opcode is correctly rounded, but only in the fpu's selected
precision mode, which is 80-bit extended precision. to get a correctly
rounded double precision output, we check for the only corner cases
where two-step rounding could give different results than one-step
(extended-precision mantissa ending in 0x400) and adjust the mantissa
slightly in the opposite direction of the rounding which the fpu
already did (reported in the c1 flag of the fpu status word).
this should have near-zero cost in the non-corner cases and at worst
very low cost.
note that in order for sqrt() to get used when compiling with gcc, the
broken, non-conformant builtin sqrt must be disabled.
other cases with %x were probably broken too.
I would actually like to go ahead and replace this code in scanf with
calls to the new __intparse framework, but for now this calls for a
quick and unobtrusive fix without the risk of breaking other things.
thanks to the hard work of Szabolcs Nagy (nsz), identifying the best
(from correctness and license standpoint) implementations from freebsd
and openbsd and cleaning them up! musl should now fully support c99
float and long double math functions, and has near-complete complex
math support. tgmath should also work (fully on gcc-compatible
compilers, and mostly on any c99 compiler).
based largely on commit 0376d44a890fea261506f1fc63833e7a686dca19 from
nsz's libm git repo, with some additions (dummy versions of a few
missing long double complex functions, etc.) by me.
various cleanups still need to be made, including re-adding (if
they're correct) some asm functions that were dropped.
the previous version not only failed to work in c++, but also failed
to produce constant expressions, making the macros useless as
initializers for objects of static storage duration.
gcc 3.3 and later have builtins for these, which sadly seem to be the
most "portable" solution. the alternative definitions produce
exceptions (for NAN) and compiler warnings (for INFINITY) on newer
versions of gcc.
this is a popular extension some programs depend on, and by using a
temporary buffer and strdup rather than malloc prior to the syscall,
i've avoided the dependency on free and thus minimized the bloat cost
of supporting this feature.
this was discussed on the mailing list and no consensus on the
preferred solution was reached, so in anticipation of a release, i'm
just committing a minimally-invasive solution that avoids the problem
by ensuring that multi-threaded-capable programs will always have
initialized the thread pointer before any signal handler can run.
in the long term we may switch to initializing the thread pointer at
program start time whenever the program has the potential to access
any per-thread data.
GNU programs may expect the GNU version of basename, which has a
different prototype (argument is const-qualified) and prototype it
themselves too. of course if they're expecting the GNU behavior for
the function, they'll still run into problems, but at least this
eliminates some compile-time failures.
in gcc 3, the visibility attribute must be placed on both the
declaration and on the definition. if it's omitted from the
definition, the compiler fails to emit the ".hidden" directive in the
assembly, and the linker will either generate textrels (if supported,
such as on i386) or refuse to link (on targets where certain types of
textrels are forbidden or impossible without further assumptions about
memory layout, such as on x86_64).
this patch also unifies the decision about when to use visibility into
libc.h and makes the visibility in the utf-8 state machine tables
based on libc.h rather than a duplicate test.
1. don't try to install (and thus build) shared libs when they were
disabled in config.mak
2. ensure that the path for the dynamic linker exists before
attempting to install it.
even if pthread_create/exit code is not linked, run flag needs to be
checked and cleanup function potentially run on pop. thus, move the
code to the module that's always linked when pthread_cleanup_push/pop
is used.
the old abi was intended to duplicate glibc's abi at the expense of
being ugly and slow, but it turns out glib was not even using that abi
except on non-gcc-compatible compilers (which it doesn't even support)
and was instead using an exceptions-in-c/unwind-based approach whose
abi we could not duplicate anyway without nasty dwarf2/unwind
integration.
the new abi is copied from a very old glibc abi, which seems to still
be supported/present in current glibc. it avoids all unwinding,
whether by sjlj or exceptions, and merely maintains a linked list of
cleanup functions to be called from the context of pthread_exit. i've
made some care to ensure that longjmp out of a cleanup function should
work, even though it is not required to.
this change breaks abi compatibility with programs which were using
pthread cancellation, which is unfortunate, but that's why i'm making
the change now rather than later. considering that most pthread
features have not been usable until recently anyway, i don't see it as
a major issue at this point.
i'm not sure that it's "correct" for dlopen to block cancellation
when calling constructors for libraries it loads, but it sure seems
like the right thing. in any case, dlopen itself needs cancellation
blocked.
note that it still will have the standards-conformant behavior, not
the GNU behavior. but at least this prevents broken code from ending
up with truncated pointers due to implicit declarations...
per 7.18.4: Each invocation of one of these macros shall expand to an
integer constant expression suitable for use in #if preprocessing
directives. The type of the expression shall have the same type as
would an expression of the corresponding type converted according to
the integer promotions. The value of the expression shall be that of
the argument.
the key phrase is "converted according to the integer promotions".
thus there is no intent or allowance that the expression have
smaller-than-int types.
this is mainly in hopes of supporting c++ (not yet possible for other
reasons) but will also help applications/libraries which use (and more
often, abuse) the gcc __attribute__((__constructor__)) feature in "C"
code.
x86_64 and arm versions of the new startup asm are untested and may
have minor problems.
these don't work (or do anything at all) but at least make it possible
to static link programs that insist on "having" dynamic loading
support...as long as they don't actually need to use it.
adding real support for dlopen/dlsym with static linking is going to
be significantly more difficult...
it should be noted that only the actual underlying buffer flush and
fill operations are cancellable, not reads from or writes to the
buffer. this behavior is compatible with POSIX, which makes all
cancellation points in stdio optional, and it achieves the goal of
allowing cancellation of a thread that's "stuck" on IO (due to a
non-responsive socket/pipe peer, slow/stuck hardware, etc.) without
imposing any measurable performance cost.
it was previously attempting to link start files as part of shared
objects. this is definitely wrong and depending on the platform and
linker could range from just adding extraneous junk to introducing
textrels to making linking fail entirely.
even a single-threaded program can be cancellable, e.g. if it's called
pthread_cancel(pthread_self()). the correct predicate to check is not
whether multiple threads have been invoked, but whether pthread_self
has been invoked.
this fixes an issue using gold instead of gnu ld for linking. it also
should eliminate the need of the startup code to even load/pass the
got address to the dynamic linker.
based on patch submitted by sh4rm4 with minor cosmetic changes.
further cleanup will follow.
note that regardless of the name used, basename is always conformant.
it never takes on the bogus gnu behavior, unlike glibc where basename
is nonconformant when declared manually without including libgen.h.
CHUNK_SIZE macro was defined incorrectly and shaving off at least one
significant bit in the size of mmapped chunks, resulting in the test
for oldlen==newlen always failing and incurring a syscall. fortunately
i don't think this issue caused any other observable behavior; the
definition worked correctly for all non-mmapped chunks where its
correctness matters more, since their lengths are always multiples of
the alignment.
it's a keyword in c++ (wtf). i'm not sure this is the cleanest
solution; it might be better to avoid ever defining __NEED_wchar_t on
c++. but in any case, this works for now.
musl's dynamic linker does not support unloading dsos, so there's
nothing for this function to do. adding the symbol in case anything
depends on its presence..
the fcntl syscall can return a negative value when the command is
F_GETOWN, and this is not an error code but an actual value. thus we
must special-case it and avoid calling __syscall_ret to set errno.
this fix is better than the glibc fix (using F_GETOWN_EX) which only
works on newer kernels and is more complex.
right now it's questionable whether this change is an improvement or
not, but if we later want to support priority inheritance mutexes, it
will be important to have the code paths unified like this to avoid
major code duplication.
this simplifies the code paths slightly, but perhaps what's nicer is
that it makes recursive mutexes fully reentrant, i.e. locking and
unlocking from a signal handler works even if the interrupted code was
in the middle of locking or unlocking.
a reader unlocking the lock need only wake one waiter (necessarily a
writer, but a writer unlocking the lock must wake all waiters
(necessarily readers). if it only wakes one, the remainder can remain
blocked indefinitely, or at least until the first reader unlocks (in
which case the whole lock becomes serialized and behaves as a mutex
rather than a read lock).
there is no need to send a wake when the lock count does not hit zero,
but when it does, all waiters must be woken (since all with the same
sign are eligible to obtain the lock).
eliminate the sequence number field and instead use the counter as the
futex because of the way the lock is held, sequence numbers are
completely useless, and this frees up a field in the barrier structure
to be used as a waiter count for the count futex, which lets us avoid
some syscalls in the best case.
as of now, self-synchronized destruction and unmapping should be fully
safe. before any thread can return from the barrier, all threads in
the barrier have obtained the vm lock, and each holds a shared lock on
the barrier. the barrier memory is not inspected after the shared lock
count reaches 0, nor after the vm lock is released.
it was assuming the result of the condition it was supposed to be
checking for, i.e. that the thread ptr had already been initialized by
pthread_mutex_lock. use the slower call to be safe.
we're not required to check this except for error-checking mutexes,
but it doesn't hurt. the new test is actually simpler/lighter, and it
also eliminates the need to later check that pthread_mutex_unlock
succeeds.
when used with error-checking mutexes, pthread_cond_wait is required
to fail with EPERM if the mutex is not locked by the caller.
previously we relied on pthread_mutex_unlock to generate the error,
but this is not valid, since in the case of such invalid usage the
internal state of the cond variable has already been potentially
corrupted (due to access outside the control of the mutex). thus, we
have to check first.
this implementation is rather heavy-weight, but it's the first
solution i've found that's actually correct. all waiters actually wait
twice at the barrier so that they can synchronize exit, and they hold
a "vm lock" that prevents changes to virtual memory mappings (and
blocks pthread_barrier_destroy) until all waiters are finished
inspecting the barrier.
thus, it is safe for any thread to destroy and/or unmap the barrier's
memory as soon as pthread_barrier_wait returns, without further
synchronization.
issue reported by nsz, but it's actually not just pedantic. the
functions can take input of any arithmetic type, including floating
point, and the behavior needs to be as if the conversion implicit in
the function call took place.
lock out new waiters during the broadcast. otherwise the wait count
added to the mutex might be lower than the actual number of waiters
moved, and wakeups may be lost.
this issue could also be solved by temporarily setting the mutex
waiter count higher than any possible real count, then relying on the
kernel to tell us how many waiters were requeued, and updating the
counts afterwards. however the logic is more complex, and i don't
really trust the kernel. the solution here is also nice in that it
replaces some atomic cas loops with simple non-atomic ops under lock.
due to moving waiters from the cond var to the mutex in bcast, these
waiters upon wakeup would steal slots in the count from newer waiters
that had not yet been signaled, preventing the signal function from
taking any action.
to solve the problem, we simply use two separate waiter counts, and so
that the original "total" waiters count is undisturbed by broadcast
and still available for signal.
the changes to syscall_ret are mostly no-ops in the generated code,
just cleanup of type issues and removal of some implementation-defined
behavior. the one exception is the change in the comparison value,
which is fixed so that 0xf...f000 (which in principle could be a valid
return value for mmap, although probably never in reality) is not
treated as an error return.
testing revealed that the old implementation, while correct, was
giving way too many spurious wakeups due to races changing the value
of the condition futex. in a test program with 5 threads receiving
broadcast signals, the number of returns from pthread_cond_wait was
roughly 3 times what it should have been (2 spurious wakeups for every
legitimate wakeup). moreover, the magnitude of this effect seems to
grow with the number of threads.
the old implementation may also have had some nasty race conditions
with reuse of the cond var with a new mutex.
the new implementation is based on incrementing a sequence number with
each signal event. this sequence number has nothing to do with the
number of threads intended to be woken; it's only used to provide a
value for the futex wait to avoid deadlock. in theory there is a
danger of race conditions due to the value wrapping around after 2^32
signals. it would be nice to eliminate that, if there's a way.
testing showed no spurious wakeups (though they are of course
possible) with the new implementation, as well as slightly improved
performance.
using swap has a race condition: the waiters must be added to the
mutex waiter count *before* they are taken off the cond var waiter
count, or wake events can be lost.
this avoids the "stampede effect" where pthread_cond_broadcast would
result in all waiters waking up simultaneously, only to immediately
contend for the mutex and go back to sleep.
previously, a waiter could miss the 1->0 transition of block if
another thread set block to 1 again after the signal function set
block to 0. we now use the caller's thread id as a unique token to
store in block, which no other thread will ever write there. this
ensures that if block still contains the tid, no signal has occurred.
spurious wakeups will of course occur whenever there is a spurious
return from the futex wait and another thread has begun waiting on the
cond var. this should be a rare occurrence except perhaps in the
presence of interrupting signal handlers.
signal/bcast operations have been improved by noting that they need
not avoid inspecting the cond var's memory after changing the futex
value. because the standard allows spurious wakeups, there is no way
for an application to distinguish between a spurious wakeup just
before another thread called signal/bcast, and the deliberate wakeup
resulting from the signal/bcast call. thus the woken thread must
assume that the signalling thread may still be waiting to act on the
cond var, and therefore it cannot destroy/unmap the cond var.
casting to int would not be correct because high bits could be lost.
mapping the high bits down onto low bits would be costlier in the
common case where the result is just used in a conditional. changing
the type of the bit array elements to int would permute the order of
the bit array on 64-bit big endian systems, so that's not an option
either.
actually this is just to avoid gcc being stupid and refusing to inline
the function version, even when the size cost is essentially identical
whether it's inlined or not.
if the file descriptor resource limit has been increased past
FD_SETSIZE, this is actually a security issue; we could write past the
end of the fd_set object. using poll makes it a non-issue, and
simplifies the code at the same time.
also, use clock_gettime instead of gettimeofday, for reduced bloat
and better entropy.
for now this is just a tiny optimization, but later if we support
cancellation from __stdio_read and __stdio_write, it will be necessary
for the recusrive lock count to be zero in order for these functions
to know they are responsible for unlocking the FILE on cancellation.
the arm syscall abi requires 64-bit arguments to be aligned on an even
register boundary. these new macros facilitate meeting the abi
requirement without imposing significant ugliness on the code.
this is a case of poorly written man pages not matching the actual
implementation, and why i hate implementing nonstandard interfaces
with no actual documentation of how they're intended to work.
this bug was introduced in a recent patch. the problem we're working
around is that broken GNU software wants to use "struct siginfo"
rather than "siginfo_t", but "siginfo" is not in the reserved
namespace and thus not legal for the standard header to use.
really wchar_t should never vary, but the ARM EABI defines it as an
unsigned 32-bit int instead of a signed one, and gcc follows this
nonsense. thus, to give a conformant environment, we have to follow
(otherwise L""[0] and L'\0' would be 0U rather than 0, but the
application would be unaware due to a mismatched definition for
WCHAR_MIN and WCHAR_MAX, and Bad Things could happen with respect to
signed/unsigned comparisons, promotions, etc.).
fortunately no rules are imposed by the C standard on the relationship
between wchar_t and wint_t, and WEOF has type wint_t, so we can still
make wint_t always-signed and use -1 for WEOF.
this port assumes eabi calling conventions, eabi linux syscall
convention, and presence of the kernel helpers at 0xffff0f?0 needed
for threads support. otherwise it makes very few assumptions, and the
code should work even on armv4 without thumb support, as well as on
systems with thumb interworking. the bits headers declare this a
little endian system, but as far as i can tell the code should work
equally well on big endian.
some small details are probably broken; so far, testing has been
limited to qemu/aboriginal linux.
several things are changed. first, i have removed the old __uniclone
function signature and replaced it with the "standard" linux
__clone/clone signature. this was necessary to expose clone to
applications anyway, and it makes it easier to port __clone to new
archs, since it's now testable independently of pthread_create.
secondly, i have removed all references to the ugly ldt descriptor
structure (i386 only) from the c code and pthread structure. in places
where it is needed, it is now created on the stack just when it's
needed, in assembly code. thus, the i386 __clone function takes the
desired thread pointer as its argument, rather than an ldt descriptor
pointer, just like on all other sane archs. this should not affect
applications since there is really no way an application can use clone
with threads/tls in a way that doesn't horribly conflict with and
clobber the underlying implementation's use. applications are expected
to use clone only for creating actual processes, possibly with new
namespace features and whatnot.
eventually we may have a working "generic" implementation for archs
that don't need anything special. in any case, the goal of having
stubs like this is to allow early testing of new ports before all the
details needed for threads have been filled in. more functions like
this will follow.
actually these are just weak aliases for the normal locking versions
right now, and they will probably stay that way since making them
lock-free without slowing down the normal versions would require
significant code duplication for no benefit.
programs that use this tend to horribly botch international text
support, so it's questionable whether we want to support it even in
the long term... for now, it's just a dummy that calls strcmp.
on spurious wakeups/returns from __timedwait, pthread_join would
"succeed" and unmap the thread's stack while it was still running. at
best this would lead to SIGSEGV when the thread resumed execution, but
in the worst case, the thread would later resume executing on top of
another new thread's stack mapped at the same address.
spent about 4 hours tracking this bug down, chasing rare
difficult-to-reproduce stack corruption in a stress test program.
still no idea *what* caused the spurious wakeups; i suspect it's a
kernel bug.
this seeme to be the bug that prevented enabling of private futex
support. i'm going to hold off on switching to private futexes until
after the next release, and until i get a chance to audit all
wait/wake calls to make sure they're using the correct private
argument, but with this change it should be safe to enable private
futex support.
null termination is only added when current size grows.
in update modes, null termination is not added if it does not fit
(i.e. it is not allowed to clobber data).
these rules make very little sense, but that's how it goes..
read should not be allowed past "current size".
append mode should write at "current size", not buffer size.
null termination should not be written except when "current size" grows.
this is not strictly required by the standard, but without it, there
is a race condition where cancellation arriving just before async
cancellation is enabled might not be acted upon. it is impossible for
a conforming application to work around this race condition since
calling pthread_testcancel after setting async cancellation mode is
not allowed (pthread_testcancel is not specified to be
async-cancel-safe). thus the implementation should be responsible for
eliminating the race, from a quality-of-implementation standpoint.
the expression -off is not safe in case off is the most-negative
value. instead apply - to base which is known to be non-negative and
bounded within sanity.
not heavily tested, but it seems to be correct, including the odd
behavior that seeking is in terms of wide character count. this
precludes any simple buffering, so we just make the stream unbuffered.
gcc generates extremely bad code (7 byte immediate mov) for the old
null pointer write approach. it should be generating something like
"xor %eax,%eax ; mov %al,(%eax)". in any case, using a dedicated
crashing opcode accomplishes the same thing in one byte.
this behavior (opening fds 0-2 for a suid program) is explicitly
allowed (but not required) by POSIX to protect badly-written suid
programs from clobbering files they later open.
this commit does add some cost in startup code, but the availability
of auxv and the security flag will be useful elsewhere in the future.
in particular auxv is needed for static-linked vdso support, which is
still waiting to be committed (sorry nik!)
this does not change behavior, but the idea is to avoid letting other
code build up between these two points, whereby the environment
variables might get used before security it checked.
a valid mmapped block will have an even (actually aligned) "extra"
field, whereas a freed chunk on the heap will always have an in-use
neighbor.
this fixes a potential bug if mmap ever allocated memory below the
main program/brk (in which case it would be wrongly-detected as a
double-free by the old code) and allows the double-free check to work
for donated memory outside of the brk area (or, in the future,
secondary heap zones if support for their creation is added).
it previously was returning the pseudo-monotonic-realtime clock
returned by times() rather than process cputime. it also violated C
namespace by pulling in times().
we now use clock_gettime() if available because times() has
ridiculously bad resolution. still provide a fallback for ancient
kernels without clock_gettime.
this is a "nonstandard" function that was "rejected" by POSIX, but
nonetheless had its behavior documented in the POSIX rationale for
fork. it's present on solaris and possibly some other systems, and
duplicates the whole calling process, not just a single thread. glibc
does not have this function. it should not be used in programs
intending to be portable, but may be useful for testing,
checkpointing, etc. and it's an interesting (and quite small) example
of the usefulness of the __synccall framework originally written to
work around deficiencies in linux's setuid syscall.
fix up clone signature to match the actual behavior. the new
__syncall_wait function allows a __synccall callback to wait for other
threads to continue without returning, so that it can resume action
after the caller finishes. this interface could be made significantly
more general/powerful with minimal effort, but i'll wait to do that
until it's actually useful for something.
if a timer thread leaves signals unblocked, any future attempt by the
main thread to prevent the process from being terminated by blocking
signals will fail, since the signal can still be delivered to the
timer thread.
this works around pcc's lack of working support for weak references,
and in principle is nice because it gets us back to the stage where
the only weak symbol feature we use is weak aliases, nothing else.
having fewer dependencies on fancy linker features is a good thing.
the new absolute-time-based wait kernelside was hard to get right and
basically just code duplication. it could only improve "performance"
when waiting, and even then, the improvement was just slight drop in
cpu usage during a wait.
actually, with vdso clock_gettime, the "old" way will be even faster
than the "new" way if the time has already expired, since it will not
invoke any syscalls. it can determine entirely in userspace that it
needs to return ETIMEDOUT.
normally we allow cancellation to be acted upon when a syscall fails
with EINTR, since there is no useful status to report to the caller in
this case, and the signal that caused the interruption was almost
surely the cancellation request, anyway.
however, unlike all other syscalls, close has actually performed its
resource-deallocation function whenever it returns, even when it
returned an error. if we allow cancellation at this point, the caller
has no way of informing the program that the file descriptor was
closed, and the program may later try to close the file descriptor
again, possibly closing a different, newly-opened file.
the workaround looks ugly (special-casing one syscall), but it's
actually the case that close is the one and only syscall (at least
among cancellation points) with this ugly property.
if gcc decided to move this across a conditional that checks validity
of the thread register, an invalid thread-register-based read could be
performed and raise sigsegv.
if saved, signal mask would not be restored unless some low signals
were masked. if not saved, signal mask could be wrongly restored to
uninitialized values. in any, wrong mask would be restored.
i believe this function was written for a very old version of the
jmp_buf structure which did not contain a final 0 field for
compatibility with siglongjmp, and never updated...
cleanup push and pop are also no-ops if pthread_exit is not reachable.
this can make a big difference for library code which needs to protect
itself against cancellation, but which is unlikely to actually be used
in programs with threads/cancellation.
previously, pthread_cleanup_push/pop were pulling in all of
pthread_create due to dependency on the __pthread_unwind_next
function. this was not needed, as cancellation cleanup handlers can
never be called unless pthread_exit or pthread_cancel is reachable.
like mutexes and semaphores, rwlocks suffered from a race condition
where the unlock operation could access the lock memory after another
thread successfully obtained the lock (and possibly destroyed or
unmapped the object). this has been fixed in the same way it was fixed
for other lock types.
in addition, the previous implementation favored writers over readers.
in the absence of other considerations, that is the best behavior for
rwlocks, and posix explicitly allows it. however posix also requires
read locks to be recursive. if writers are favored, any attempt to
obtain a read lock while a writer is waiting for the lock will fail,
causing "recursive" read locks to deadlock. this can be avoided by
keeping track of which threads already hold read locks, but doing so
requires unbounded memory usage, and there must be a fallback case
that favors readers in case memory allocation failed. and all of this
must be synchronized. the cost, complexity, and risk of errors in
getting it right is too great, so we simply favor readers.
tracking of the owner of write locks has been removed, as it was not
useful for anything. it could allow deadlock detection, but it's not
clear to me that returning EDEADLK (which a buggy program is likely to
ignore) is better than deadlocking; at least the latter behavior
prevents further data corruption. a correct program cannot invoke this
situation anyway.
the reader count and write lock state, as well as the "last minute"
waiter flag have all been combined into a single atomic lock. this
means all state transitions for the lock are atomic compare-and-swap
operations. this makes establishing correctness much easier and may
improve performance.
finally, some code duplication has been cleaned up. more is called
for, especially the standard __timedwait idiom repeated in all locks.
futex returns EINVAL, not ENOSYS, when op is not supported.
unfortunately this looks just like EINVAL from other causes, and we
end up running the fallback code and getting EINVAL again. fortunately
this case should be rare since correct code should not generate EINVAL
anyway.
new features:
- FUTEX_WAIT_BITSET op will be used for timed waits if available. this
saves a call to clock_gettime.
- error checking for the timespec struct is now inside __timedwait so
it doesn't need to be duplicated everywhere. cond_timedwait still
needs to duplicate it to avoid unlocking the mutex, though.
- pushing and popping the cancellation handler is delegated to
__timedwait, and cancellable/non-cancellable waits are unified.
this change is needed to fix a race condition and ensure that it's
possible to unlock and destroy or unmap the mutex as soon as
pthread_mutex_lock succeeds. POSIX explicitly gives such an example in
the rationale and requires an implementation to allow such usage.
the race condition these changes address is described in glibc bug
report number 12674:
http://sourceware.org/bugzilla/show_bug.cgi?id=12674
up until now, musl has shared the bug, and i had not been able to
figure out how to eliminate it. in short, the problem is that it's not
valid for sem_post to inspect the waiters count after incrementing the
semaphore value, because another thread may have already successfully
returned from sem_wait, (rightly) deemed itself the only remaining
user of the semaphore, and chosen to destroy and free it (or unmap the
shared memory it's stored in). POSIX is not explicit in blessing this
usage, but it gives a very explicit analogous example with mutexes
(which, in musl and glibc, also suffer from the same race condition
bug) in the rationale for pthread_mutex_destroy.
the new semaphore implementation augments the waiter count with a
redundant waiter indication in the semaphore value itself,
representing the presence of "last minute" waiters that may have
arrived after sem_post read the waiter count. this allows sem_post to
read the waiter count prior to incrementing the semaphore value,
rather than after incrementing it, so as to avoid accessing the
semaphore memory whatsoever after the increment takes place.
a similar, but much simpler, fix should be possible for mutexes and
other locking primitives whose usage rules are stricter than
semaphores.
per POSIX and RFC 3493:
If the specified address family is AF_INET, AF_INET6, or AF_UNSPEC,
the service can be specified as a string specifying a decimal port
number.
021 is a valid decimal number, therefore, interpreting it as octal
seems to be non-conformant.
this race is fundamentally due to linux's bogus requirement that
userspace, rather than kernelspace, fill in the siginfo structure. an
intervening signal handler that calls fork could cause both the parent
and child process to send signals claiming to be from the parent,
which could in turn have harmful effects depending on what the
recipient does with the signal. we simply block all signals for the
interval between getuid and sigqueue syscalls (much like what raise()
does already) to prevent the race and make the getuid/sigqueue pair
atomic.
this will be a non-issue if linux is fixed to validate the siginfo
structure or fill it in from kernelspace.
setrlimit is supposed to be per-process, not per-thread, but again
linux gets it wrong. work around this in userspace. not only is it
needed for correctness; setxid also depends on the resource limits for
all threads being the same to avoid situations where temporarily
unlimiting the limit succeeds in some threads but fails in others.
previously, stdio used spinlocks, which would be unacceptable if we
ever add support for thread priorities, and which yielded
pathologically bad performance if an application attempted to use
flockfile on a key file as a major/primary locking mechanism.
i had held off on making this change for fear that it would hurt
performance in the non-threaded case, but actually support for
recursive locking had already inflicted that cost. by having the
internal locking functions store a flag indicating whether they need
to perform unlocking, rather than using the actual recursive lock
counter, i was able to combine the conditionals at unlock time,
eliminating any additional cost, and also avoid a nasty corner case
where a huge number of calls to ftrylockfile could cause deadlock
later at the point of internal locking.
this commit also fixes some issues with usage of pthread_self
conflicting with __attribute__((const)) which resulted in crashes with
some compiler versions/optimizations, mainly in flockfile prior to
pthread_create.
changing credentials in a multi-threaded program is extremely
difficult on linux because it requires synchronizing the change
between all threads, which have their own thread-local credentials on
the kernel side. this is further complicated by the fact that changing
the real uid can fail due to exceeding RLIMIT_NPROC, making it
possible that the syscall will succeed in some threads but fail in
others.
the old __rsyscall approach being replaced was robust in that it would
report failure if any one thread failed, but in this case, the program
would be left in an inconsistent state where individual threads might
have different uid. (this was not as bad as glibc, which would
sometimes even fail to report the failure entirely!)
the new approach being committed refuses to change real user id when
it cannot temporarily set the rlimit to infinity. this is completely
POSIX conformant since POSIX does not require an implementation to
allow real-user-id changes for non-privileged processes whatsoever.
still, setting the real uid can fail due to memory allocation in the
kernel, but this can only happen if there is not already a cached
object for the target user. thus, we forcibly serialize the syscalls
attempts, and fail the entire operation on the first failure. this
*should* lead to an all-or-nothing success/failure result, but it's
still fragile and highly dependent on kernel developers not breaking
things worse than they're already broken.
ideally linux will eventually add a CLONE_USERCRED flag that would
give POSIX conformant credential changes without any hacks from
userspace, and all of this code would become redundant and could be
removed ~10 years down the line when everyone has abandoned the old
broken kernels. i'm not holding my breath...
thanks to mikachu
per POSIX:
The setenv() function shall fail if:
[EINVAL] The name argument is a null pointer, points to an empty
string, or points to a string containing an '=' character.
instead of creating temp dso objects on the stack and moving them to
the heap if dlopen/dlsym are used, use static objects to begin with,
and just donate them to malloc if we no longer need them.
these changes also make it so clock_gettime(CLOCK_REALTIME, &ts) works
even on pre-2.6 kernels, emulated via the gettimeofday syscall. there
is no cost for the fallback check, as it falls under the error case
that already must be checked for storing the error code in errno, but
which would normally be hidden inside __syscall_ret.
we cannot report failure after forking, so the idea is to ensure prior
to fork that fd 0,1,2 exist. this will prevent dup2 from possibly
hitting a resource limit and failing in the child process. fcntl
rather than dup2 is used prior to forking to avoid race conditions.
fread was calling f->read without checking that the file was in
reading mode. this could:
1. crash, if f->read was a null pointer
2. cause unwanted blocking on a terminal already at eof
3. allow reading on a write-only file
1. my interpretation of subject sequence definition was wrong. adjust
parser to conform to the standard.
2. some code for handling tail overflow case was missing (forgot to
finish writing it).
3. typo (= instead of ==) caused ERANGE to wrongly behave like EINVAL
stopping without letting the parser see a stop character prevented
getting a result. so treat all high chars as the null character and
pass them into the parser.
also eliminated ugly tmp var using compound literals.
this fixes a number of bugs in integer parsing due to lazy haphazard
wrapping, as well as some misinterpretations of the standard. the new
parser is able to work character-at-a-time or on whole strings, making
it easy to support the wide functions without unbounded space for
conversion. it will also be possible to update scanf to use the new
parser.
STREAMS are utterly useless as far as I can tell, but some software
was apparently broken by the presence of stropts.h but lack of macros
it's supposed to define...
this should not be necessary - the invalid bit patterns cannot be
created except through type punning. however, some broken gnu software
is passing them to printf and triggering dangerous stack-smashing, so
let's catch them anyway...
this is a really ugly and backwards function, but its presence will
prevent lots of broken gnulib software from trying to define its own
version of fpurge and thereby failing to build or worse.
per POSIX: The mprotect() function shall change the access protections
to be that specified by prot for those whole pages containing any part
of the address space of the process starting at address addr and
continuing for len bytes.
on the other hand, linux mprotect fails with EINVAL if the base
address and/or length is not page-aligned, so we have to align them
before making the syscall.
this is mostly useless for shared libs (though it could help for
prelink-like purposes); the intended use case is for adding support
for calling the dynamic linker directly to run a program, as in:
./libc.so ./a.out foo
this usage is not yet supported.
basically we temporarily make the library and all its dependencies
part of the global namespace but only for the duration of performing
relocations, then return them to their former state.
some of the code is not yet used, and is in preparation for dlopen
which needs to be able to handle failure loading libraries without
terminating the program.
the use of this test will be much stricter than glibc and other
typical implementations; the environment will not be honored
whatsoever unless the program is confirmed non-suid/sgid by the aux
vector the kernel passed in. no fallback to slow syscall-based
checking is used if the kernel fails to provide the information; we
simply assume the worst (suid) in this case and refuse to honor
environment.
some notes:
- library search path is hard coded
- x86_64 code is untested and may not work
- dlopen/dlsym is not yet implemented
- relocations in read-only memory won't work
this seems to be necessary to make the linker accept the functions in
a shared library (perhaps to generate PLT entries?)
strictly speaking libc-internal asm should not need it. i might clean
that up later.
if thread id was reused by the kernel between the time pthread_kill
read it from the userspace pthread_t object and the time of the tgkill
syscall, a signal could be sent to the wrong thread. the tgkill
syscall was supposed to prevent this race (versus the old tkill
syscall) but it can't; it can only help in the case where the tid is
reused in a different process, but not when the tid is reused in the
same process.
the only solution i can see is an extra lock to prevent threads from
exiting while another thread is trying to pthread_kill them. it should
be very very cheap in the non-contended case.
at present the i386 code does not support sse floating point, which is
not part of the standard i386 abi. while it may be desirable to
support it later, doing so will reduce performance and require some
tricks to probe if sse support is present.
this first commit is i386-only, but it should be trivial to port the
asm to x86_64.
even if size_t was 32-bit already, the fact that the value was
unsigned and that gcc is too stupid to figure out it would be positive
as a signed quantity (due to the immediately-prior arithmetic and
conditionals) results in gcc compiling the integer-to-float conversion
as zero extension to 64 bits followed by an "fildll" (64 bit)
instruction rather than a simple "fildl" (32 bit) instruction on x86.
reportedly fildll is very slow on certain p4-class machines; even if
not, the new code is slightly smaller.
unfortunately traditional i386 practice was to use "long" rather than
"int" for wchar_t, despite the latter being much more natural and
logical. we followed this practice, but it seems some compilers (clang
and maybe certain gcc builds or others too..?) have switched to using
int, resulting in spurious pointer type mismatches when L"..." wide
strings are used. the best solution I could find is to use the
compiler's definition of wchar_t if it exists, and otherwise fallback
to the traditional definition.
there's no point in duplicating this approach on 64-bit archs, as
their only 32-bit type is int.
basically there are 3 choices for how to implement this variable-size
string member:
1. C99 flexible array member: breaks using dirent.h with pre-C99 compiler.
2. old way: length-1 string: generates array bounds warnings in caller.
3. new way: length-NAME_MAX string. no problems, simplifies all code.
of course the usable part in the pointer returned by readdir might be
shorter than NAME_MAX+1 bytes, but that is allowed by the standard and
doesn't hurt anything.
this actually inadvertently disallows some valid patterns with
redundant / or * characters, but it's better than allowing unbounded
vla allocation.
eventually i'll write code to move the pattern to the stack and
eliminate redundancy to ensure that it fits in PATH_MAX at the
beginning of glob. this would also allow it to be modified in place
for passing to fnmatch rather than copied at each level of recursion.
there is a resource limit of 0 bits to store the concurrency level
requested. thus any positive level exceeds a resource limit, resulting
in EAGAIN. :-)
the observed symptom was that the code was incorrectly rounding up
1.0625 to 1.063 despite the rounding mode being round-to-nearest with
ties broken by rounding to even last place. however, the code was just
not right in many respects, and i'm surprised it worked as well as it
did. this time i tested the values that end up in the variables round,
small, and the expression round+small, and all look good.
the new approach relies on the fact that the only ways to create
sigset_t objects without invoking UB are to use the sig*set()
functions, or from the masks returned by sigprocmask, sigaction, etc.
or in the ucontext_t argument to a signal handler. thus, as long as
sigfillset and sigaddset avoid adding the "protected" signals, there
is no way the application will ever obtain a sigset_t including these
bits, and thus no need to add the overhead of checking/clearing them
when sigprocmask or sigaction is called.
note that the old code actually *failed* to remove the bits from
sa_mask when sigaction was called.
the new implementations are also significantly smaller, simpler, and
faster due to ignoring the useless "GNU HURD signals" 65-1024, which
are not used and, if there's any sanity in the world, never will be
used.
these should be tweaked according to testing. offhand i know 1000 is
too low and 5000 is likely to be sufficiently high. consider trying to
add futexes to file locking, too...
the previous implementation had at least 2 problems:
1. the case where additional threads reached the barrier before the
first wave was finished leaving the barrier was untested and seemed
not to be working.
2. threads leaving the barrier continued to access memory within the
barrier object after other threads had successfully returned from
pthread_barrier_wait. this could lead to memory corruption or crashes
if the barrier object had automatic storage in one of the waiting
threads and went out of scope before all threads finished returning,
or if one thread unmapped the memory in which the barrier object
lived.
the new implementation avoids both problems by making the barrier
state essentially local to the first thread which enters the barrier
wait, and forces that thread to be the last to return.
the previous fix was incorrect, as it would prevent f->close(f) from
being called if fflush(f) failed. i believe this was the original
motivation for using | rather than ||. so now let's just use a second
statement to constrain the order of function calls, and to back to
using |.
with this patch, musl compiles and mostly works with pcc 1.0.0. a few
tests are still failing and i'm uncertain whether they are due to
portability problems in musl, or bugs in pcc, but i suspect the
latter.
this slightly cuts down on the degree musl "fights with" gcc, but more
importantly, it fixes a critical bug when gcc inlines a variadic
function and optimizes out the variadic arguments due to noticing that
they were "not used" (by __builtin_va_arg).
we leave the old code in place if __GNUC__ >= 3 is false; it seems
like it might be necessary at least for tinycc support and perhaps if
anyone ever gets around to fixing gcc 2.95.3 enough to make it work..
Smoothsort is an adaptive variant of heapsort. This version was
written by Valentin Ochs (apo) specifically for inclusion in musl. I
worked with him to get it working in O(1) memory usage even with giant
array element widths, and to optimize it heavily for size and speed.
It's still roughly 4 times as large as the old heap sort
implementation, but roughly 20 times faster given an almost-sorted
array of 1M elements (20 being the base-2 log of 1M), i.e. it really
does reduce O(n log n) to O(n) in the mostly-sorted case. It's still
somewhat slower than glibc's Introsort for random input, but now
considerably faster than glibc when the input is already sorted, or
mostly sorted.
strictly speaking this and a few other ops should be factored into
asm.h or the file should just be renamed to asm.h, but whatever. clean
it up someday.
1. failed match of literal chars from the format string would always
return matching failure rather than input failure at eof, leading to
infinite loops in some programs.
2. unread of eof would wrongly adjust the character counts reported by
%n, yielding an off-by-one error.
some functions that should have been testing whether pthread_self()
had been called and initialized the thread pointer were instead
testing whether pthread_create() had been called and actually made the
program "threaded". while it's unlikely any mismatch would occur in
real-world problems, this could have introduced subtle bugs. now, we
store the address of the main thread's thread descriptor in the libc
structure and use its presence as a flag that the thread register is
initialized. note that after fork, the calling thread (not necessarily
the original main thread) is the new main thread.
the linux documentation for dup2 says it can fail with EBUSY due to a
race condition with open and dup in the kernel. shield applications
(and the rest of libc) from this nonsense by looping until it succeeds
we already checked before making the syscall, but it's possible that a
signal handler interrupted the blocking syscall and disabled
cancellation, and that this is the cause of EINTR. in this case, the
old behavior was testably wrong.
like all other syscalls, close should return to the caller if and only
if it successfully performed its action. it is necessary that the
application be able to determine whether the close succeeded.
clean and simple, but fails when the caller does not have permissions
to open the file for reading or when /proc is not available. i may
replace this with a full implementation later, possibly leaving this
version as an optimization to use when it works.
don't waste time (and significant code size due to function call
overhead!) setting errno when the result of a syscall does not matter
or when it can't fail.
x86_64 was just plain wrong in the cancel-flag-already-set path, and
crashing.
the more subtle error was not clearing the saved stack pointer before
returning to c code. this could result in the signal handler
misidentifying c code as the pre-syscall part of the asm, and acting
on cancellation at the wrong time, and thus resource leak race
conditions.
also, now __cancel (in the c code) is responsible for clearing the
saved sp in the already-cancelled branch. this means we have to use
call rather than jmp to ensure the stack pointer in the c will never
match what the asm saved.
the goal is to be able to use pthread_setcancelstate internally in
the implementation, whenever a function might want to use functions
which are cancellation points but avoid becoming a cancellation point
itself. i could have just used a separate internal function for
temporarily inhibiting cancellation, but the solution in this commit
is better because (1) it's one less implementation-specific detail in
functions that need to use it, and (2) application code can also get
the same benefit.
previously, pthread_setcancelstate dependend on pthread_self, which
would pull in unwanted thread setup overhead for non-threaded
programs. now, it temporarily stores the state in the global libc
struct if threads have not been initialized, and later moves it if
needed. this way we can instead use __pthread_self, which has no
dependencies and assumes that the thread register is already valid.
this patch improves the correctness, simplicity, and size of
cancellation-related code. modulo any small errors, it should now be
completely conformant, safe, and resource-leak free.
the notion of entering and exiting cancellation-point context has been
completely eliminated and replaced with alternative syscall assembly
code for cancellable syscalls. the assembly is responsible for setting
up execution context information (stack pointer and address of the
syscall instruction) which the cancellation signal handler can use to
determine whether the interrupted code was in a cancellable state.
these changes eliminate race conditions in the previous generation of
cancellation handling code (whereby a cancellation request received
just prior to the syscall would not be processed, leaving the syscall
to block, potentially indefinitely), and remedy an issue where
non-cancellable syscalls made from signal handlers became cancellable
if the signal handler interrupted a cancellation point.
x86_64 asm is untested and may need a second try to get it right.
setting errno here is completely valid, but some programs, notably
busybox printf, assume that errno will not be set during output and
treat this as an error condition. in any case, skipping it slightly
reduces code size and saves time.
otherwise we cannot support an application's desire to use
asynchronous cancellation within the callback function. this change
also slightly debloats pthread_create.c.
we take advantage of the fact that unless self->cancelpt is 1,
cancellation cannot happen. so just increment it by 2 to temporarily
block cancellation. this drops pthread_create.o well under 1k.
with datagram sockets, depending on fprintf not to flush the output
early was very fragile; the new version simply uses a small fixed-size
buffer. it could be updated to dynamic-allocate large buffers if
needed, but i can't envision any admin being happy about finding
64kb-long lines in their syslog...
some of these definitions were just plain wrong, others based on
outdated ancient "non-64" versions of the kernel interface.
as much as possible has now been moved out of bits/*
these changes break abi (the old abi for these functions was wrong),
but since they were not working anyway it can hardly matter.
it should be noted that flock does not mix well with standard fcntl
locking, but nonetheless some applications will attempt to use flock
instead of fcntl if both exist. options to configure or small patches
may be needed. debian maintainers have plenty of experience with this
unfortunate situation...
after fork, we have a new process and the pid is equal to the tid of
the new main thread. there is no need to make two separate syscalls to
obtain the same number.
we can do this without violating the namespace now that they are
macros/inline functions rather than extern functions. the motivation
is that gcc was generating giant, slow, horrible code for the old
functions, and now generates a single byte-swapping instruction.
the basic idea is that the only things in alltypes.h should be types
that either vary from system to system (in practice, not just in
theoretical la-la land - this is the implementation so we choose what
constraints we want to impose on ports) or which are needed by
multiple system headers.
1. saved errno was not being restored, illegally clearing errno to 0.
2. no need to backup and save errno around free; it will not touch
except perhaps when the program has already invoked UB...
actually FLT_ROUNDS needs to expand to a static inline function that
obtains the current rounding mode and returns it, but that will be
added later with fenv.h stuff.
according to posix, readv "shall be equivalent to read(), except..."
that it places the data into the buffers specified by the iov array.
however on linux, when reading from a terminal, each iov element
behaves almost like a separate read. this means that if the first iov
exactly satisfied the request (e.g. a length-one read of '\n') and the
second iov is nonzero length, the syscall will block again after
getting the blank line from the terminal until another line is read.
simply put, entering a single blank line becomes impossible.
the solution, fortunately, is simple. whenever the buffer size is
nonzero, reduce the length of the requested read by one byte and let
the last byte go through the buffer. this way, readv will already be
in the second (and last) iov, and won't re-block on the second iov.
POSIX clearly specifies the type of msg_iovlen and msg_controllen, and
Linux ignores it and makes them both size_t instead. to work around
this we add padding (instead of just using the wrong types like glibc
does), but we also need to patch-up the struct before passing it to
the kernel in case the caller did not zero-fill it.
if i could trust the kernel to just ignore the upper 32 bits, this
would not be necessary, but i don't think it will ignore them...
previously NULL was returned in ai_canonname, resulting in crashes in
some callers. this behavior was incorrect. note however that the new
behavior differs from glibc, which performs reverse dns lookups. POSIX
is very clear that a reverse DNS lookup must not be performed for
numeric addresses.
this is something of a tradeoff, as now set*id() functions, rather
than pthread_create, are what pull in the code overhead for dealing
with linux's refusal to implement proper POSIX thread-vs-process
semantics. my motivations are:
1. it's cleaner this way, especially cleaner to optimize out the
rsyscall locking overhead from pthread_create when it's not needed.
2. it's expected that only a tiny number of core system programs will
ever use set*id() functions, whereas many programs may want to use
threads, and making thread overhead tiny is an incentive for "light"
programs to try threads.
since timer_create is no longer allocating a structure for the timer_t
and simply using the kernel timer id, it was impossible to specify the
timer_t as the argument to the signal handler. the solution is to pass
the null sigevent pointer on to the kernel, rather than filling it in
userspace, so that the kernel does the right thing. however, that
precludes the clever timerid-versus-threadid encoding we were doing.
instead, just assume timerids are below 1M and thread pointers are
above 1M. (in perspective: timerids are sequentially allocated and
seem limited to 32k, and thread pointers are at roughly 3G.)
with these small changes, libc functions which need to call functions
which are cancellation points, but which themselves must not be
cancellation points, can use the CANCELPT_INHIBIT and CANCELPT_RESUME
macros to temporarily inhibit all cancellation.
note that unlike the originals, these do not print the program
name/argv[0] because we have not saved it anywhere. this could be
changed in __libc_start_main if desired.
this could actually cause rare crashes in the case where a short
string is located at the end of a page and the following page is not
readable, and in fact this was seen in gcc compiling certain files.