Age | Commit message (Collapse) | Author | Files | Lines |
|
On GitHub, @razvanphp & @hbernaciak both reported issues running the
APCu PHP module under Unit.
When using this module they were seeing errors like
'apcu_fetch(): Failed to acquire read lock'
However when running APCu under php-fpm, everything was fine.
The issue turned out to be due to our use of SYS_clone breaking the
pthreads(7) API used by APCu. Even if we had been using glibc's
clone(2) wrapper we would still have run into problems due to a known
issue there.
Essentially the problem is when using clone, glibc doesn't update the
TID cache, so the child ends up having the same TID as the parent and
that is used in various parts of pthreads(7) such as in the various
locking primitives, so when APCu was grabbing a lock it ended up using
the TID of the main unit process (rather than that of the php
application processes that was grabbing the lock).
So due to the above what was happening was when one of the application
processes went to grab either a read or write lock, the lock was
actually being attributed to the main unit process. If a process had
acquired the write lock, then if a process tried to acquire a read or
write lock then glibc would return EDEADLK due to detecting a deadlock
situation due to thinking the process already held the write lock when
in fact it didn't.
It seems the right way to do this is via fork(2) and unshare(2). We
already use fork(2) on other platforms.
This requires a few tricks to keep the essence of the processes the same
as before when using clone
1) We use the prctl(2) PR_SET_CHILD_SUBREAPER option (if its
available, since Linux 3.4) to make the main unit process inherit
prototype processes after a double fork(2), rather than them being
reparented to 'init'.
This avoids needing to ^C twice to fully exit unit when running in
the foreground. It's probably also better if they maintain their
parent child relationship where possible.
2) We use a double fork(2) technique on the prototype processes to
ensure they themselves end up in a new PID namespace as PID 1 (when
CLONE_NEWPID is being used).
When using unshare(CLONE_NEWPID), the calling process is _not_
placed in the namespace (as discussed in pid_namespaces(7)). It
only sets things up so that subsequent children are placed in a PID
namespace.
Having the prototype processes as PID 1 in the new PID namespace is
probably a good thing and matches the behaviour of clone(2). Also,
some isolation tests break if the prototype process is not PID 1.
3) Due to the above double fork(2) the main unit process looses track
of the prototype process ID, which it needs to know.
To solve this, we employ a simple pipe(2) between the main unit and
prototype processes and pass the prototype grandchild PID from the
parent of the second fork(2) before exiting. This needs to be done
from the parent and not the grandchild, as the grandchild will see
itself having a PID of 1 while the main process needs its
externally visible PID.
Link: <https://www.php.net/manual/en/book.apcu.php>
Link: <https://sourceware.org/bugzilla/show_bug.cgi?id=21793>
Closes: <https://github.com/nginx/unit/issues/694>
Reviewed-by: Alejandro Colomar <alx@nginx.com>
Signed-off-by: Andrew Clayton <a.clayton@nginx.com>
|
|
Due to the need to replace our use of clone/__NR_clone on Linux with
fork(2)/unshare(2) for enabling Linux namespaces(7) to keep the
pthreads(7) API working. Let's rename NXT_HAVE_CLONE to
NXT_HAVE_LINUX_NS, i.e name it after the feature, not how it's
implemented, then in future if we change how we do namespaces again we
don't have to rename this.
Reviewed-by: Alejandro Colomar <alx@nginx.com>
Signed-off-by: Andrew Clayton <a.clayton@nginx.com>
|
|
This commit hooks into the cgroup infrastructure added in the previous
commit to create per-application cgroups.
It does this by adding each "prototype process" into its own cgroup,
then each child process inherits its parents cgroup.
If we fail to create a cgroup we simply fail the process. This behaviour
may get enhanced in the future.
This won't actually do anything yet. Subsequent commits will hook this
up to the build and config systems.
Reviewed-by: Alejandro Colomar <alx@nginx.com>
Signed-off-by: Andrew Clayton <a.clayton@nginx.com>
|
|
Registering an isolated PID in the global PID hash is wrong
because it can be duplicated. Isolated processes are stored only
in the children list until the response for the WHOAMI message is
processed and the global PID is discovered.
To remove isolated siblings, a pointer to the children list is
introduced in the nxt_process_init_t struct.
This closes #633 issue on GitHub.
|
|
posix_spawn(3POSIX) was introduced by POSIX.1d
(IEEE Std 1003.1d-1999), and was later consolidated in
POSIX.1-2001, requiring it in all POSIX-compliant systems.
It's safe to assume it's always available, more than 20 years
after its standardization.
Link: <https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/spawn.h.html>
|
|
Splitting the process type connectivity matrix to 'keep ports' and 'send
ports'; the 'keep ports' matrix is used to clean up unnecessary ports after
forking a new process, and the 'send ports' matrix determines which process
types expect to get created process ports.
Unfortunately, the original single connectivity matrix no longer works because
of an application stop delay caused by prototypes. Existing applications
should not get the new router port at the moment.
|
|
Splitting the process type connectivity matrix to 'keep ports' and 'send
ports'; the 'keep ports' matrix is used to clean up unnecessary ports after
forking a new process, and the 'send ports' matrix determines which process
types expect to get created process ports.
Unfortunately, the original single connectivity matrix no longer works because
of an application stop delay caused by prototypes. Existing applications
should not get the new router port at the moment.
|
|
|
|
This enables the reuse of process creation functions.
|
|
The socket is required for intercontextual communication in multithreaded apps.
|
|
|
|
Generic process-to-process shared memory exchange is no more required. Here,
it is transformed into a router-to-application pattern. The outgoing shared
memory segments collection is now the property of the application structure.
The applications connect to the router only, and the process only needs to group
the ports.
|
|
The application process needs to request the port from the router instead of the
latter pushing the port before sending a request to the application. This is
required to simplify the communication between the router and the application
and to prepare the router to use the application shared port and then the queue.
|
|
- Changed the port management callbacks to notifications, which e. g. avoids
the need to call the libunit function
- Added context and library instance reference counts for a safer resource
release
- Added the router main port initialization
|
|
|
|
|
|
The process abstraction has changed to:
setup(task, process)
start(task, process_data)
prefork(task, process, mp)
The prefork() occurs in the main process right before fork.
The file src/nxt_main_process.c is completely free of process
specific logic.
The creation of a process now supports a PROCESS_CREATED state. The
The setup() function of each process can set its state to either
created or ready. If created, a MSG_PROCESS_CREATED is sent to main
process, where external setup can be done (required for rootfs under
container).
The core processes (discovery, controller and router) doesn't need
external setup, then they all proceeds to their start() function
straight away.
In the case of applications, the load of the module happens at the
process setup() time and The module's init() function has changed
to be the start() of the process.
The module API has changed to:
setup(task, process, conf)
start(task, data)
As a direct benefit of the PROCESS_CREATED message, the clone(2) of
processes using pid namespaces now doesn't need to create a pipe
to make the child block until parent setup uid/gid mappings nor it
needs to receive the child pid.
|
|
An earlier attempt (ad6265786871) to resolve this condition on the
router's side added a new issue: the app could get a request before
acquiring a port.
|
|
Missing error log messages added.
|
|
The setuid/setgid syscalls requires root capabilities but if the kernel
supports unprivileged user namespace then the child process has the full
set of capabilities in the new namespace, then we can allow setting "user"
and "group" in such cases (this is a common security use case).
Tests were added to ensure user gets meaningful error messages for
uid/gid mapping misconfigurations.
|
|
This is required to avoid include cycles, as some nxt_clone_* functions
depend on the credential structures, but nxt_process depends on clone
structures.
|
|
Introduces the functions nxt_process_init_create() and
nxt_process_init_creds_set().
|
|
Now the nxt_user_groups_get() function uses getgrouplist(3) when available
(except MacOS, see below). For some platforms, getgrouplist() supports
a method of probing how much groups the user has but the behavior is not
consistent. The method used here consists of optimistically trying to get up
to min(256, NGROUPS_MAX) groups; only if ngroups returned exceeds the original
value, we do a second call. This method can block main's process if LDAP/NDIS+
is in use.
MacOS has getgrouplist(3) but it's buggy. It doesn't update ngroups if the
value passed is smaller than the number of groups the user has. Some
projects (like Go stdlib) call getgrouplist() in a loop, increasing ngroups
until it exceeds the number of groups user belongs to or fail when a limit
is reached. For performance reasons, this is to be avoided and MacOS is
handled in the fallback implementation.
The fallback implementation is the old Unit approach. It saves main's
user groups (getgroups(2)) and then calls initgroups(3) to load application's
groups in main, then does a second getgroups(2) to store the gids and restore
main's groups in the end. Because of initgroups(3)' call to setgroups(2),
this method requires root capabilities. In the case of OSX, which has
small NGROUPS_MAX by default (16), it's not possible to restore main's groups
if it's large; if so, this method fallbacks again: user_cred gids aren't
stored, and the worker process calls initgroups() itself and may block for
some time if LDAP/NDIS+ is in use.
|
|
- Introduced nxt_runtime_process_port_create().
- Moved nxt_process_use() into nxt_process.c from nxt_runtime.c.
- Renamed nxt_runtime_process_remove_pid() as nxt_runtime_process_remove().
- Some public functions transformed to static.
This closes #327 issue on GitHub.
|
|
Now it's possible to pass -DNXT_HAVE_CLONE=0 for debugging.
|
|
When using "credential: true", the new namespace starts with a completely
empty uid and gid ranges. Then, any setuid/setgid/setgroups calls using ids
not properly mapped with uidmap and gidmap fields return EINVAL, meaning
the id is not valid inside the new namespace.
|
|
|
|
The leak has been introduced in 325b315e48c4.
This closes #322 issue in GitHub.
|
|
|
|
This closes #228 issue on GitHub.
|
|
|
|
|
|
The bug had appeared in 5cc5002a788e when process type has been
converted to bitmask. This commit reverts the type back to a number.
This commit is related to #131 issue on GitHub.
|
|
|
|
|
|
|
|
CID 200496
CID 200494
CID 200490
CID 200489
CID 200483
CID 200482
CID 200472
CID 200465
|
|
- Main process should be connected to all other processes.
- Controller should be connected to Router.
- Router should be connected to Controller and all Workers.
- Workers should be connected to Router worker thread ports only.
This filtering helps to avoid unnecessary communication and various errors
during massive application workers stop / restart.
|
|
Two different router threads may send different requests to single
application worker. In this case shared memory fds from worker
to router will be send over 2 different router ports. These fds
will be received and processed by different threads in any order.
This patch made possible to add incoming shared memory segments in
arbitrary order. Additionally, array and memory pool are no longer
used to store segments because of pool's single threaded nature.
Custom array-like structure nxt_port_mmaps_t introduced.
|
|
This helps to decouple process removal from port memory pool cleanups.
|
|
Use counter helps to simplify logic around port and application free.
Port 'post' function introduced to simplify post execution of particular
function to original port engine's thread.
Write message queue is protected by mutex which makes port write operation
thread safe.
|
|
Memory pool is not used by port_hash and it was a mistake to pass it into
'add' and 'remove' functions. port_hash enrties are allocated from heap.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Used for connection mem pool cleanup, which can be used by buffers.
Used for port mem pool to safely destroy linked process.
|