summaryrefslogtreecommitdiffstats
path: root/man7/user_namespaces.7
diff options
context:
space:
mode:
Diffstat (limited to 'man7/user_namespaces.7')
-rw-r--r--man7/user_namespaces.71469
1 files changed, 0 insertions, 1469 deletions
diff --git a/man7/user_namespaces.7 b/man7/user_namespaces.7
deleted file mode 100644
index ddeb669c3..000000000
--- a/man7/user_namespaces.7
+++ /dev/null
@@ -1,1469 +0,0 @@
-.\" Copyright (c) 2013, 2014 by Michael Kerrisk <mtk.manpages@gmail.com>
-.\" and Copyright (c) 2012, 2014 by Eric W. Biederman <ebiederm@xmission.com>
-.\"
-.\" SPDX-License-Identifier: Linux-man-pages-copyleft
-.\"
-.\"
-.TH user_namespaces 7 (date) "Linux man-pages (unreleased)"
-.SH NAME
-user_namespaces \- overview of Linux user namespaces
-.SH DESCRIPTION
-For an overview of namespaces, see
-.BR namespaces (7).
-.P
-User namespaces isolate security-related identifiers and attributes,
-in particular,
-user IDs and group IDs (see
-.BR credentials (7)),
-the root directory,
-keys (see
-.BR keyrings (7)),
-.\" FIXME: This page says very little about the interaction
-.\" of user namespaces and keys. Add something on this topic.
-and capabilities (see
-.BR capabilities (7)).
-A process's user and group IDs can be different
-inside and outside a user namespace.
-In particular,
-a process can have a normal unprivileged user ID outside a user namespace
-while at the same time having a user ID of 0 inside the namespace;
-in other words,
-the process has full privileges for operations inside the user namespace,
-but is unprivileged for operations outside the namespace.
-.\"
-.\" ============================================================
-.\"
-.SS Nested namespaces, namespace membership
-User namespaces can be nested;
-that is, each user namespace\[em]except the initial ("root")
-namespace\[em]has a parent user namespace,
-and can have zero or more child user namespaces.
-The parent user namespace is the user namespace
-of the process that creates the user namespace via a call to
-.BR unshare (2)
-or
-.BR clone (2)
-with the
-.B CLONE_NEWUSER
-flag.
-.P
-The kernel imposes (since Linux 3.11) a limit of 32 nested levels of
-.\" commit 8742f229b635bf1c1c84a3dfe5e47c814c20b5c8
-user namespaces.
-.\" FIXME Explain the rationale for this limit. (What is the rationale?)
-Calls to
-.BR unshare (2)
-or
-.BR clone (2)
-that would cause this limit to be exceeded fail with the error
-.BR EUSERS .
-.P
-Each process is a member of exactly one user namespace.
-A process created via
-.BR fork (2)
-or
-.BR clone (2)
-without the
-.B CLONE_NEWUSER
-flag is a member of the same user namespace as its parent.
-A single-threaded process can join another user namespace with
-.BR setns (2)
-if it has the
-.B CAP_SYS_ADMIN
-in that namespace;
-upon doing so, it gains a full set of capabilities in that namespace.
-.P
-A call to
-.BR clone (2)
-or
-.BR unshare (2)
-with the
-.B CLONE_NEWUSER
-flag makes the new child process (for
-.BR clone (2))
-or the caller (for
-.BR unshare (2))
-a member of the new user namespace created by the call.
-.P
-The
-.B NS_GET_PARENT
-.BR ioctl (2)
-operation can be used to discover the parental relationship
-between user namespaces; see
-.BR ioctl_ns (2).
-.P
-A task that changes one of its effective IDs
-will have its dumpability reset to the value in
-.IR /proc/sys/fs/suid_dumpable .
-This may affect the ownership of proc files of child processes
-and may thus cause the parent to lack the permissions
-to write to mapping files of child processes running in a new user namespace.
-In such cases making the parent process dumpable, using
-.B PR_SET_DUMPABLE
-in a call to
-.BR prctl (2),
-before creating a child process in a new user namespace
-may rectify this problem.
-See
-.BR prctl (2)
-and
-.BR proc (5)
-for details on how ownership is affected.
-.\"
-.\" ============================================================
-.\"
-.SS Capabilities
-The child process created by
-.BR clone (2)
-with the
-.B CLONE_NEWUSER
-flag starts out with a complete set
-of capabilities in the new user namespace.
-Likewise, a process that creates a new user namespace using
-.BR unshare (2)
-or joins an existing user namespace using
-.BR setns (2)
-gains a full set of capabilities in that namespace.
-On the other hand,
-that process has no capabilities in the parent (in the case of
-.BR clone (2))
-or previous (in the case of
-.BR unshare (2)
-and
-.BR setns (2))
-user namespace,
-even if the new namespace is created or joined by the root user
-(i.e., a process with user ID 0 in the root namespace).
-.P
-Note that a call to
-.BR execve (2)
-will cause a process's capabilities to be recalculated in the usual way (see
-.BR capabilities (7)).
-Consequently,
-unless the process has a user ID of 0 within the namespace,
-or the executable file has a nonempty inheritable capabilities mask,
-the process will lose all capabilities.
-See the discussion of user and group ID mappings, below.
-.P
-A call to
-.BR clone (2)
-or
-.BR unshare (2)
-using the
-.B CLONE_NEWUSER
-flag
-or a call to
-.BR setns (2)
-that moves the caller into another user namespace
-sets the "securebits" flags
-(see
-.BR capabilities (7))
-to their default values (all flags disabled) in the child (for
-.BR clone (2))
-or caller (for
-.BR unshare (2)
-or
-.BR setns (2)).
-Note that because the caller no longer has capabilities
-in its original user namespace after a call to
-.BR setns (2),
-it is not possible for a process to reset its "securebits" flags while
-retaining its user namespace membership by using a pair of
-.BR setns (2)
-calls to move to another user namespace and then return to
-its original user namespace.
-.P
-The rules for determining whether or not a process has a capability
-in a particular user namespace are as follows:
-.IP \[bu] 3
-A process has a capability inside a user namespace
-if it is a member of that namespace and
-it has the capability in its effective capability set.
-A process can gain capabilities in its effective capability
-set in various ways.
-For example, it may execute a set-user-ID program or an
-executable with associated file capabilities.
-In addition,
-a process may gain capabilities via the effect of
-.BR clone (2),
-.BR unshare (2),
-or
-.BR setns (2),
-as already described.
-.\" In the 3.8 sources, see security/commoncap.c::cap_capable():
-.IP \[bu]
-If a process has a capability in a user namespace,
-then it has that capability in all child (and further removed descendant)
-namespaces as well.
-.IP \[bu]
-.\" * The owner of the user namespace in the parent of the
-.\" * user namespace has all caps.
-When a user namespace is created, the kernel records the effective
-user ID of the creating process as being the "owner" of the namespace.
-.\" (and likewise associates the effective group ID of the creating process
-.\" with the namespace).
-A process that resides
-in the parent of the user namespace
-.\" See kernel commit 520d9eabce18edfef76a60b7b839d54facafe1f9 for a fix
-.\" on this point
-and whose effective user ID matches the owner of the namespace
-has all capabilities in the namespace.
-.\" This includes the case where the process executes a set-user-ID
-.\" program that confers the effective UID of the creator of the namespace.
-By virtue of the previous rule,
-this means that the process has all capabilities in all
-further removed descendant user namespaces as well.
-The
-.B NS_GET_OWNER_UID
-.BR ioctl (2)
-operation can be used to discover the user ID of the owner of the namespace;
-see
-.BR ioctl_ns (2).
-.\"
-.\" ============================================================
-.\"
-.SS Effect of capabilities within a user namespace
-Having a capability inside a user namespace
-permits a process to perform operations (that require privilege)
-only on resources governed by that namespace.
-In other words, having a capability in a user namespace permits a process
-to perform privileged operations on resources that are governed by (nonuser)
-namespaces owned by (associated with) the user namespace
-(see the next subsection).
-.P
-On the other hand, there are many privileged operations that affect
-resources that are not associated with any namespace type,
-for example, changing the system (i.e., calendar) time (governed by
-.BR CAP_SYS_TIME ),
-loading a kernel module (governed by
-.BR CAP_SYS_MODULE ),
-and creating a device (governed by
-.BR CAP_MKNOD ).
-Only a process with privileges in the
-.I initial
-user namespace can perform such operations.
-.P
-Holding
-.B CAP_SYS_ADMIN
-within the user namespace that owns a process's mount namespace
-allows that process to create bind mounts
-and mount the following types of filesystems:
-.\" fs_flags = FS_USERNS_MOUNT in kernel sources
-.P
-.RS 4
-.PD 0
-.IP \[bu] 3
-.I /proc
-(since Linux 3.8)
-.IP \[bu]
-.I /sys
-(since Linux 3.8)
-.IP \[bu]
-.I devpts
-(since Linux 3.9)
-.IP \[bu]
-.BR tmpfs (5)
-(since Linux 3.9)
-.IP \[bu]
-.I ramfs
-(since Linux 3.9)
-.IP \[bu]
-.I mqueue
-(since Linux 3.9)
-.IP \[bu]
-.I bpf
-.\" commit b2197755b2633e164a439682fb05a9b5ea48f706
-(since Linux 4.4)
-.IP \[bu]
-.I overlayfs
-.\" commit 92dbc9dedccb9759c7f9f2f0ae6242396376988f
-.\" commit 4cb2c00c43b3fe88b32f29df4f76da1b92c33224
-(since Linux 5.11)
-.PD
-.RE
-.P
-Holding
-.B CAP_SYS_ADMIN
-within the user namespace that owns a process's cgroup namespace
-allows (since Linux 4.6)
-that process to the mount the cgroup version 2 filesystem and
-cgroup version 1 named hierarchies
-(i.e., cgroup filesystems mounted with the
-.I \[dq]none,name=\[dq]
-option).
-.P
-Holding
-.B CAP_SYS_ADMIN
-within the user namespace that owns a process's PID namespace
-allows (since Linux 3.8)
-that process to mount
-.I /proc
-filesystems.
-.P
-Note, however, that mounting block-based filesystems can be done
-only by a process that holds
-.B CAP_SYS_ADMIN
-in the initial user namespace.
-.\"
-.\" ============================================================
-.\"
-.SS Interaction of user namespaces and other types of namespaces
-Starting in Linux 3.8, unprivileged processes can create user namespaces,
-and the other types of namespaces can be created with just the
-.B CAP_SYS_ADMIN
-capability in the caller's user namespace.
-.P
-When a nonuser namespace is created,
-it is owned by the user namespace in which the creating process
-was a member at the time of the creation of the namespace.
-Privileged operations on resources governed by the nonuser namespace
-require that the process has the necessary capabilities
-in the user namespace that owns the nonuser namespace.
-.P
-If
-.B CLONE_NEWUSER
-is specified along with other
-.B CLONE_NEW*
-flags in a single
-.BR clone (2)
-or
-.BR unshare (2)
-call, the user namespace is guaranteed to be created first,
-giving the child
-.RB ( clone (2))
-or caller
-.RB ( unshare (2))
-privileges over the remaining namespaces created by the call.
-Thus, it is possible for an unprivileged caller to specify this combination
-of flags.
-.P
-When a new namespace (other than a user namespace) is created via
-.BR clone (2)
-or
-.BR unshare (2),
-the kernel records the user namespace of the creating process as the owner of
-the new namespace.
-(This association can't be changed.)
-When a process in the new namespace subsequently performs
-privileged operations that operate on global
-resources isolated by the namespace,
-the permission checks are performed according to the process's capabilities
-in the user namespace that the kernel associated with the new namespace.
-For example, suppose that a process attempts to change the hostname
-.RB ( sethostname (2)),
-a resource governed by the UTS namespace.
-In this case,
-the kernel will determine which user namespace owns
-the process's UTS namespace, and check whether the process has the
-required capability
-.RB ( CAP_SYS_ADMIN )
-in that user namespace.
-.P
-The
-.B NS_GET_USERNS
-.BR ioctl (2)
-operation can be used to discover the user namespace
-that owns a nonuser namespace; see
-.BR ioctl_ns (2).
-.\"
-.\" ============================================================
-.\"
-.SS User and group ID mappings: uid_map and gid_map
-When a user namespace is created,
-it starts out without a mapping of user IDs (group IDs)
-to the parent user namespace.
-The
-.IR /proc/ pid /uid_map
-and
-.IR /proc/ pid /gid_map
-files (available since Linux 3.5)
-.\" commit 22d917d80e842829d0ca0a561967d728eb1d6303
-expose the mappings for user and group IDs
-inside the user namespace for the process
-.IR pid .
-These files can be read to view the mappings in a user namespace and
-written to (once) to define the mappings.
-.P
-The description in the following paragraphs explains the details for
-.IR uid_map ;
-.I gid_map
-is exactly the same,
-but each instance of "user ID" is replaced by "group ID".
-.P
-The
-.I uid_map
-file exposes the mapping of user IDs from the user namespace
-of the process
-.I pid
-to the user namespace of the process that opened
-.I uid_map
-(but see a qualification to this point below).
-In other words, processes that are in different user namespaces
-will potentially see different values when reading from a particular
-.I uid_map
-file, depending on the user ID mappings for the user namespaces
-of the reading processes.
-.P
-Each line in the
-.I uid_map
-file specifies a 1-to-1 mapping of a range of contiguous
-user IDs between two user namespaces.
-(When a user namespace is first created, this file is empty.)
-The specification in each line takes the form of
-three numbers delimited by white space.
-The first two numbers specify the starting user ID in
-each of the two user namespaces.
-The third number specifies the length of the mapped range.
-In detail, the fields are interpreted as follows:
-.IP (1) 5
-The start of the range of user IDs in
-the user namespace of the process
-.IR pid .
-.IP (2)
-The start of the range of user
-IDs to which the user IDs specified by field one map.
-How field two is interpreted depends on whether the process that opened
-.I uid_map
-and the process
-.I pid
-are in the same user namespace, as follows:
-.RS
-.IP (a) 5
-If the two processes are in different user namespaces:
-field two is the start of a range of
-user IDs in the user namespace of the process that opened
-.IR uid_map .
-.IP (b)
-If the two processes are in the same user namespace:
-field two is the start of the range of
-user IDs in the parent user namespace of the process
-.IR pid .
-This case enables the opener of
-.I uid_map
-(the common case here is opening
-.IR /proc/self/uid_map )
-to see the mapping of user IDs into the user namespace of the process
-that created this user namespace.
-.RE
-.IP (3)
-The length of the range of user IDs that is mapped between the two
-user namespaces.
-.P
-System calls that return user IDs (group IDs)\[em]for example,
-.BR getuid (2),
-.BR getgid (2),
-and the credential fields in the structure returned by
-.BR stat (2)\[em]return
-the user ID (group ID) mapped into the caller's user namespace.
-.P
-When a process accesses a file, its user and group IDs
-are mapped into the initial user namespace for the purpose of permission
-checking and assigning IDs when creating a file.
-When a process retrieves file user and group IDs via
-.BR stat (2),
-the IDs are mapped in the opposite direction,
-to produce values relative to the process user and group ID mappings.
-.P
-The initial user namespace has no parent namespace,
-but, for consistency, the kernel provides dummy user and group
-ID mapping files for this namespace.
-Looking at the
-.I uid_map
-file
-.RI ( gid_map
-is the same) from a shell in the initial namespace shows:
-.P
-.in +4n
-.EX
-$ \fBcat /proc/$$/uid_map\fP
- 0 0 4294967295
-.EE
-.in
-.P
-This mapping tells us
-that the range starting at user ID 0 in this namespace
-maps to a range starting at 0 in the (nonexistent) parent namespace,
-and the length of the range is the largest 32-bit unsigned integer.
-This leaves 4294967295 (the 32-bit signed \-1 value) unmapped.
-This is deliberate:
-.I (uid_t)\~\-1
-is used in several interfaces (e.g.,
-.BR setreuid (2))
-as a way to specify "no user ID".
-Leaving
-.I (uid_t)\~\-1
-unmapped and unusable guarantees that there will be no
-confusion when using these interfaces.
-.\"
-.\" ============================================================
-.\"
-.SS Defining user and group ID mappings: writing to uid_map and gid_map
-After the creation of a new user namespace, the
-.I uid_map
-file of
-.I one
-of the processes in the namespace may be written to
-.I once
-to define the mapping of user IDs in the new user namespace.
-An attempt to write more than once to a
-.I uid_map
-file in a user namespace fails with the error
-.BR EPERM .
-Similar rules apply for
-.I gid_map
-files.
-.P
-The lines written to
-.I uid_map
-.RI ( gid_map )
-must conform to the following validity rules:
-.IP \[bu] 3
-The three fields must be valid numbers,
-and the last field must be greater than 0.
-.IP \[bu]
-Lines are terminated by newline characters.
-.IP \[bu]
-There is a limit on the number of lines in the file.
-In Linux 4.14 and earlier, this limit was (arbitrarily)
-.\" 5*12-byte records could fit in a 64B cache line
-set at 5 lines.
-Since Linux 4.15,
-.\" commit 6397fac4915ab3002dc15aae751455da1a852f25
-the limit is 340 lines.
-In addition, the number of bytes written to
-the file must be less than the system page size,
-and the write must be performed at the start of the file (i.e.,
-.BR lseek (2)
-and
-.BR pwrite (2)
-can't be used to write to nonzero offsets in the file).
-.IP \[bu]
-The range of user IDs (group IDs)
-specified in each line cannot overlap with the ranges
-in any other lines.
-In the initial implementation (Linux 3.8), this requirement was
-satisfied by a simplistic implementation that imposed the further
-requirement that
-the values in both field 1 and field 2 of successive lines must be
-in ascending numerical order,
-which prevented some otherwise valid maps from being created.
-Linux 3.9 and later
-.\" commit 0bd14b4fd72afd5df41e9fd59f356740f22fceba
-fix this limitation, allowing any valid set of nonoverlapping maps.
-.IP \[bu]
-At least one line must be written to the file.
-.P
-Writes that violate the above rules fail with the error
-.BR EINVAL .
-.P
-In order for a process to write to the
-.IR /proc/ pid /uid_map
-.RI ( /proc/ pid /gid_map )
-file, all of the following permission requirements must be met:
-.IP \[bu] 3
-The writing process must have the
-.B CAP_SETUID
-.RB ( CAP_SETGID )
-capability in the user namespace of the process
-.IR pid .
-.IP \[bu]
-The writing process must either be in the user namespace of the process
-.I pid
-or be in the parent user namespace of the process
-.IR pid .
-.IP \[bu]
-The mapped user IDs (group IDs) must in turn have a mapping
-in the parent user namespace.
-.IP \[bu]
-If updating
-.IR /proc/ pid /uid_map
-to create a mapping that maps UID 0 in the parent namespace,
-then one of the following must be true:
-.RS
-.IP (a) 5
-if writing process is in the parent user namespace,
-then it must have the
-.B CAP_SETFCAP
-capability in that user namespace; or
-.IP (b)
-if the writing process is in the child user namespace,
-then the process that created the user namespace must have had the
-.B CAP_SETFCAP
-capability when the namespace was created.
-.RE
-.IP
-This rule has been in place since
-.\" commit db2e718a47984b9d71ed890eb2ea36ecf150de18
-Linux 5.12.
-It eliminates an earlier security bug whereby
-a UID 0 process that lacks the
-.B CAP_SETFCAP
-capability,
-which is needed to create a binary with namespaced file capabilities
-(as described in
-.BR capabilities (7)),
-could nevertheless create such a binary,
-by the following steps:
-.RS
-.IP (1) 5
-Create a new user namespace with the identity mapping
-(i.e., UID 0 in the new user namespace maps to UID 0 in the parent namespace),
-so that UID 0 in both namespaces is equivalent to the same root user ID.
-.IP (2)
-Since the child process has the
-.B CAP_SETFCAP
-capability, it could create a binary with namespaced file capabilities
-that would then be effective in the parent user namespace
-(because the root user IDs are the same in the two namespaces).
-.RE
-.IP \[bu]
-One of the following two cases applies:
-.RS
-.IP (a) 5
-.I Either
-the writing process has the
-.B CAP_SETUID
-.RB ( CAP_SETGID )
-capability in the
-.I parent
-user namespace.
-.RS
-.IP \[bu] 3
-No further restrictions apply:
-the process can make mappings to arbitrary user IDs (group IDs)
-in the parent user namespace.
-.RE
-.IP (b)
-.I Or
-otherwise all of the following restrictions apply:
-.RS
-.IP \[bu] 3
-The data written to
-.I uid_map
-.RI ( gid_map )
-must consist of a single line that maps
-the writing process's effective user ID
-(group ID) in the parent user namespace to a user ID (group ID)
-in the user namespace.
-.IP \[bu]
-The writing process must have the same effective user ID as the process
-that created the user namespace.
-.IP \[bu]
-In the case of
-.IR gid_map ,
-use of the
-.BR setgroups (2)
-system call must first be denied by writing
-.RI \[dq] deny \[dq]
-to the
-.IR /proc/ pid /setgroups
-file (see below) before writing to
-.IR gid_map .
-.RE
-.RE
-.P
-Writes that violate the above rules fail with the error
-.BR EPERM .
-.\"
-.\" ============================================================
-.\"
-.SS Project ID mappings: projid_map
-Similarly to user and group ID mappings,
-it is possible to create project ID mappings for a user namespace.
-(Project IDs are used for disk quotas; see
-.BR setquota (8)
-and
-.BR quotactl (2).)
-.P
-Project ID mappings are defined by writing to the
-.IR /proc/ pid /projid_map
-file (present since
-.\" commit f76d207a66c3a53defea67e7d36c3eb1b7d6d61d
-Linux 3.7).
-.P
-The validity rules for writing to the
-.IR /proc/ pid /projid_map
-file are as for writing to the
-.I uid_map
-file; violation of these rules causes
-.BR write (2)
-to fail with the error
-.BR EINVAL .
-.P
-The permission rules for writing to the
-.IR /proc/ pid /projid_map
-file are as follows:
-.IP \[bu] 3
-The writing process must either be in the user namespace of the process
-.I pid
-or be in the parent user namespace of the process
-.IR pid .
-.IP \[bu]
-The mapped project IDs must in turn have a mapping
-in the parent user namespace.
-.P
-Violation of these rules causes
-.BR write (2)
-to fail with the error
-.BR EPERM .
-.\"
-.\" ============================================================
-.\"
-.SS Interaction with system calls that change process UIDs or GIDs
-In a user namespace where the
-.I uid_map
-file has not been written, the system calls that change user IDs will fail.
-Similarly, if the
-.I gid_map
-file has not been written, the system calls that change group IDs will fail.
-After the
-.I uid_map
-and
-.I gid_map
-files have been written, only the mapped values may be used in
-system calls that change user and group IDs.
-.P
-For user IDs, the relevant system calls include
-.BR setuid (2),
-.BR setfsuid (2),
-.BR setreuid (2),
-and
-.BR setresuid (2).
-For group IDs, the relevant system calls include
-.BR setgid (2),
-.BR setfsgid (2),
-.BR setregid (2),
-.BR setresgid (2),
-and
-.BR setgroups (2).
-.P
-Writing
-.RI \[dq] deny \[dq]
-to the
-.IR /proc/ pid /setgroups
-file before writing to
-.IR /proc/ pid /gid_map
-.\" Things changed in Linux 3.19
-.\" commit 9cc46516ddf497ea16e8d7cb986ae03a0f6b92f8
-.\" commit 66d2f338ee4c449396b6f99f5e75cd18eb6df272
-.\" http://lwn.net/Articles/626665/
-will permanently disable
-.BR setgroups (2)
-in a user namespace and allow writing to
-.IR /proc/ pid /gid_map
-without having the
-.B CAP_SETGID
-capability in the parent user namespace.
-.\"
-.\" ============================================================
-.\"
-.SS The \fI/proc/\fPpid\fI/setgroups\fP file
-.\"
-.\" commit 9cc46516ddf497ea16e8d7cb986ae03a0f6b92f8
-.\" commit 66d2f338ee4c449396b6f99f5e75cd18eb6df272
-.\" http://lwn.net/Articles/626665/
-.\" http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-8989
-.\"
-The
-.IR /proc/ pid /setgroups
-file displays the string
-.RI \[dq] allow \[dq]
-if processes in the user namespace that contains the process
-.I pid
-are permitted to employ the
-.BR setgroups (2)
-system call; it displays
-.RI \[dq] deny \[dq]
-if
-.BR setgroups (2)
-is not permitted in that user namespace.
-Note that regardless of the value in the
-.IR /proc/ pid /setgroups
-file (and regardless of the process's capabilities), calls to
-.BR setgroups (2)
-are also not permitted if
-.IR /proc/ pid /gid_map
-has not yet been set.
-.P
-A privileged process (one with the
-.B CAP_SYS_ADMIN
-capability in the namespace) may write either of the strings
-.RI \[dq] allow \[dq]
-or
-.RI \[dq] deny \[dq]
-to this file
-.I before
-writing a group ID mapping
-for this user namespace to the file
-.IR /proc/ pid /gid_map .
-Writing the string
-.RI \[dq] deny \[dq]
-prevents any process in the user namespace from employing
-.BR setgroups (2).
-.P
-The essence of the restrictions described in the preceding
-paragraph is that it is permitted to write to
-.IR /proc/ pid /setgroups
-only so long as calling
-.BR setgroups (2)
-is disallowed because
-.IR /proc/ pid /gid_map
-has not been set.
-This ensures that a process cannot transition from a state where
-.BR setgroups (2)
-is allowed to a state where
-.BR setgroups (2)
-is denied;
-a process can transition only from
-.BR setgroups (2)
-being disallowed to
-.BR setgroups (2)
-being allowed.
-.P
-The default value of this file in the initial user namespace is
-.RI \[dq] allow \[dq].
-.P
-Once
-.IR /proc/ pid /gid_map
-has been written to
-(which has the effect of enabling
-.BR setgroups (2)
-in the user namespace),
-it is no longer possible to disallow
-.BR setgroups (2)
-by writing
-.RI \[dq] deny \[dq]
-to
-.IR /proc/ pid /setgroups
-(the write fails with the error
-.BR EPERM ).
-.P
-A child user namespace inherits the
-.IR /proc/ pid /setgroups
-setting from its parent.
-.P
-If the
-.I setgroups
-file has the value
-.RI \[dq] deny \[dq],
-then the
-.BR setgroups (2)
-system call can't subsequently be reenabled (by writing
-.RI \[dq] allow \[dq]
-to the file) in this user namespace.
-(Attempts to do so fail with the error
-.BR EPERM .)
-This restriction also propagates down to all child user namespaces of
-this user namespace.
-.P
-The
-.IR /proc/ pid /setgroups
-file was added in Linux 3.19,
-but was backported to many earlier stable kernel series,
-because it addresses a security issue.
-The issue concerned files with permissions such as "rwx\-\-\-rwx".
-Such files give fewer permissions to "group" than they do to "other".
-This means that dropping groups using
-.BR setgroups (2)
-might allow a process file access that it did not formerly have.
-Before the existence of user namespaces this was not a concern,
-since only a privileged process (one with the
-.B CAP_SETGID
-capability) could call
-.BR setgroups (2).
-However, with the introduction of user namespaces,
-it became possible for an unprivileged process to create
-a new namespace in which the user had all privileges.
-This then allowed formerly unprivileged
-users to drop groups and thus gain file access
-that they did not previously have.
-The
-.IR /proc/ pid /setgroups
-file was added to address this security issue,
-by denying any pathway for an unprivileged process to drop groups with
-.BR setgroups (2).
-.\"
-.\" /proc/PID/setgroups
-.\" [allow == setgroups() is allowed, "deny" == setgroups() is disallowed]
-.\" * Can write if have CAP_SYS_ADMIN in NS
-.\" * Must write BEFORE writing to /proc/PID/gid_map
-.\"
-.\" setgroups()
-.\" * Must already have written to gid_map
-.\" * /proc/PID/setgroups must be "allow"
-.\"
-.\" /proc/PID/gid_map -- writing
-.\" * Must already have written "deny" to /proc/PID/setgroups
-.\"
-.\" ============================================================
-.\"
-.SS Unmapped user and group IDs
-There are various places where an unmapped user ID (group ID)
-may be exposed to user space.
-For example, the first process in a new user namespace may call
-.BR getuid (2)
-before a user ID mapping has been defined for the namespace.
-In most such cases, an unmapped user ID is converted
-.\" from_kuid_munged(), from_kgid_munged()
-to the overflow user ID (group ID);
-the default value for the overflow user ID (group ID) is 65534.
-See the descriptions of
-.I /proc/sys/kernel/overflowuid
-and
-.I /proc/sys/kernel/overflowgid
-in
-.BR proc (5).
-.P
-The cases where unmapped IDs are mapped in this fashion include
-system calls that return user IDs
-.RB ( getuid (2),
-.BR getgid (2),
-and similar),
-credentials passed over a UNIX domain socket,
-.\" also SO_PEERCRED
-credentials returned by
-.BR stat (2),
-.BR waitid (2),
-and the System V IPC "ctl"
-.B IPC_STAT
-operations,
-credentials exposed by
-.IR /proc/ pid /status
-and the files in
-.IR /proc/sysvipc/* ,
-credentials returned via the
-.I si_uid
-field in the
-.I siginfo_t
-received with a signal (see
-.BR sigaction (2)),
-credentials written to the process accounting file (see
-.BR acct (5)),
-and credentials returned with POSIX message queue notifications (see
-.BR mq_notify (3)).
-.P
-There is one notable case where unmapped user and group IDs are
-.I not
-.\" from_kuid(), from_kgid()
-.\" Also F_GETOWNER_UIDS is an exception
-converted to the corresponding overflow ID value.
-When viewing a
-.I uid_map
-or
-.I gid_map
-file in which there is no mapping for the second field,
-that field is displayed as 4294967295 (\-1 as an unsigned integer).
-.\"
-.\" ============================================================
-.\"
-.SS Accessing files
-In order to determine permissions when an unprivileged process accesses a file,
-the process credentials (UID, GID) and the file credentials
-are in effect mapped back to what they would be in
-the initial user namespace and then compared to determine
-the permissions that the process has on the file.
-The same is also true of other objects that employ the credentials plus
-permissions mask accessibility model, such as System V IPC objects.
-.\"
-.\" ============================================================
-.\"
-.SS Operation of file-related capabilities
-Certain capabilities allow a process to bypass various
-kernel-enforced restrictions when performing operations on
-files owned by other users or groups.
-These capabilities are:
-.BR CAP_CHOWN ,
-.BR CAP_DAC_OVERRIDE ,
-.BR CAP_DAC_READ_SEARCH ,
-.BR CAP_FOWNER ,
-and
-.BR CAP_FSETID .
-.P
-Within a user namespace,
-these capabilities allow a process to bypass the rules
-if the process has the relevant capability over the file,
-meaning that:
-.IP \[bu] 3
-the process has the relevant effective capability in its user namespace; and
-.IP \[bu]
-the file's user ID and group ID both have valid mappings
-in the user namespace.
-.P
-The
-.B CAP_FOWNER
-capability is treated somewhat exceptionally:
-.\" These are the checks performed by the kernel function
-.\" inode_owner_or_capable(). There is one exception to the exception:
-.\" overriding the directory sticky permission bit requires that
-.\" the file has a valid mapping for both its UID and GID.
-it allows a process to bypass the corresponding rules so long as
-at least the file's user ID has a mapping in the user namespace
-(i.e., the file's group ID does not need to have a valid mapping).
-.\"
-.\" ============================================================
-.\"
-.SS Set-user-ID and set-group-ID programs
-When a process inside a user namespace executes
-a set-user-ID (set-group-ID) program,
-the process's effective user (group) ID inside the namespace is changed
-to whatever value is mapped for the user (group) ID of the file.
-However, if either the user
-.I or
-the group ID of the file has no mapping inside the namespace,
-the set-user-ID (set-group-ID) bit is silently ignored:
-the new program is executed,
-but the process's effective user (group) ID is left unchanged.
-(This mirrors the semantics of executing a set-user-ID or set-group-ID
-program that resides on a filesystem that was mounted with the
-.B MS_NOSUID
-flag, as described in
-.BR mount (2).)
-.\"
-.\" ============================================================
-.\"
-.SS Miscellaneous
-When a process's user and group IDs are passed over a UNIX domain socket
-to a process in a different user namespace (see the description of
-.B SCM_CREDENTIALS
-in
-.BR unix (7)),
-they are translated into the corresponding values as per the
-receiving process's user and group ID mappings.
-.\"
-.SH STANDARDS
-Linux.
-.\"
-.SH NOTES
-Over the years, there have been a lot of features that have been added
-to the Linux kernel that have been made available only to privileged users
-because of their potential to confuse set-user-ID-root applications.
-In general, it becomes safe to allow the root user in a user namespace to
-use those features because it is impossible, while in a user namespace,
-to gain more privilege than the root user of a user namespace has.
-.\"
-.\" ============================================================
-.\"
-.SS Global root
-The term "global root" is sometimes used as a shorthand for
-user ID 0 in the initial user namespace.
-.\"
-.\" ============================================================
-.\"
-.SS Availability
-Use of user namespaces requires a kernel that is configured with the
-.B CONFIG_USER_NS
-option.
-User namespaces require support in a range of subsystems across
-the kernel.
-When an unsupported subsystem is configured into the kernel,
-it is not possible to configure user namespaces support.
-.P
-As at Linux 3.8, most relevant subsystems supported user namespaces,
-but a number of filesystems did not have the infrastructure needed
-to map user and group IDs between user namespaces.
-Linux 3.9 added the required infrastructure support for many of
-the remaining unsupported filesystems
-(Plan 9 (9P), Andrew File System (AFS), Ceph, CIFS, CODA, NFS, and OCFS2).
-Linux 3.12 added support for the last of the unsupported major filesystems,
-.\" commit d6970d4b726cea6d7a9bc4120814f95c09571fc3
-XFS.
-.\"
-.SH EXAMPLES
-The program below is designed to allow experimenting with
-user namespaces, as well as other types of namespaces.
-It creates namespaces as specified by command-line options and then executes
-a command inside those namespaces.
-The comments and
-.IR usage ()
-function inside the program provide a full explanation of the program.
-The following shell session demonstrates its use.
-.P
-First, we look at the run-time environment:
-.P
-.in +4n
-.EX
-$ \fBuname \-rs\fP # Need Linux 3.8 or later
-Linux 3.8.0
-$ \fBid \-u\fP # Running as unprivileged user
-1000
-$ \fBid \-g\fP
-1000
-.EE
-.in
-.P
-Now start a new shell in new user
-.RI ( \-U ),
-mount
-.RI ( \-m ),
-and PID
-.RI ( \-p )
-namespaces, with user ID
-.RI ( \-M )
-and group ID
-.RI ( \-G )
-1000 mapped to 0 inside the user namespace:
-.P
-.in +4n
-.EX
-$ \fB./userns_child_exec \-p \-m \-U \-M \[aq]0 1000 1\[aq] \-G \[aq]0 1000 1\[aq] bash\fP
-.EE
-.in
-.P
-The shell has PID 1, because it is the first process in the new
-PID namespace:
-.P
-.in +4n
-.EX
-bash$ \fBecho $$\fP
-1
-.EE
-.in
-.P
-Mounting a new
-.I /proc
-filesystem and listing all of the processes visible
-in the new PID namespace shows that the shell can't see
-any processes outside the PID namespace:
-.P
-.in +4n
-.EX
-bash$ \fBmount \-t proc proc /proc\fP
-bash$ \fBps ax\fP
- PID TTY STAT TIME COMMAND
- 1 pts/3 S 0:00 bash
- 22 pts/3 R+ 0:00 ps ax
-.EE
-.in
-.P
-Inside the user namespace, the shell has user and group ID 0,
-and a full set of permitted and effective capabilities:
-.P
-.in +4n
-.EX
-bash$ \fBcat /proc/$$/status | egrep \[aq]\[ha][UG]id\[aq]\fP
-Uid: 0 0 0 0
-Gid: 0 0 0 0
-bash$ \fBcat /proc/$$/status | egrep \[aq]\[ha]Cap(Prm|Inh|Eff)\[aq]\fP
-CapInh: 0000000000000000
-CapPrm: 0000001fffffffff
-CapEff: 0000001fffffffff
-.EE
-.in
-.SS Program source
-\&
-.EX
-/* userns_child_exec.c
-\&
- Licensed under GNU General Public License v2 or later
-\&
- Create a child process that executes a shell command in new
- namespace(s); allow UID and GID mappings to be specified when
- creating a user namespace.
-*/
-#define _GNU_SOURCE
-#include <err.h>
-#include <sched.h>
-#include <unistd.h>
-#include <stdint.h>
-#include <stdlib.h>
-#include <sys/wait.h>
-#include <signal.h>
-#include <fcntl.h>
-#include <stdio.h>
-#include <string.h>
-#include <limits.h>
-#include <errno.h>
-\&
-struct child_args {
- char **argv; /* Command to be executed by child, with args */
- int pipe_fd[2]; /* Pipe used to synchronize parent and child */
-};
-\&
-static int verbose;
-\&
-static void
-usage(char *pname)
-{
- fprintf(stderr, "Usage: %s [options] cmd [arg...]\en\en", pname);
- fprintf(stderr, "Create a child process that executes a shell "
- "command in a new user namespace,\en"
- "and possibly also other new namespace(s).\en\en");
- fprintf(stderr, "Options can be:\en\en");
-#define fpe(str) fprintf(stderr, " %s", str);
- fpe("\-i New IPC namespace\en");
- fpe("\-m New mount namespace\en");
- fpe("\-n New network namespace\en");
- fpe("\-p New PID namespace\en");
- fpe("\-u New UTS namespace\en");
- fpe("\-U New user namespace\en");
- fpe("\-M uid_map Specify UID map for user namespace\en");
- fpe("\-G gid_map Specify GID map for user namespace\en");
- fpe("\-z Map user\[aq]s UID and GID to 0 in user namespace\en");
- fpe(" (equivalent to: \-M \[aq]0 <uid> 1\[aq] \-G \[aq]0 <gid> 1\[aq])\en");
- fpe("\-v Display verbose messages\en");
- fpe("\en");
- fpe("If \-z, \-M, or \-G is specified, \-U is required.\en");
- fpe("It is not permitted to specify both \-z and either \-M or \-G.\en");
- fpe("\en");
- fpe("Map strings for \-M and \-G consist of records of the form:\en");
- fpe("\en");
- fpe(" ID\-inside\-ns ID\-outside\-ns len\en");
- fpe("\en");
- fpe("A map string can contain multiple records, separated"
- " by commas;\en");
- fpe("the commas are replaced by newlines before writing"
- " to map files.\en");
-\&
- exit(EXIT_FAILURE);
-}
-\&
-/* Update the mapping file \[aq]map_file\[aq], with the value provided in
- \[aq]mapping\[aq], a string that defines a UID or GID mapping. A UID or
- GID mapping consists of one or more newline\-delimited records
- of the form:
-\&
- ID_inside\-ns ID\-outside\-ns length
-\&
- Requiring the user to supply a string that contains newlines is
- of course inconvenient for command\-line use. Thus, we permit the
- use of commas to delimit records in this string, and replace them
- with newlines before writing the string to the file. */
-\&
-static void
-update_map(char *mapping, char *map_file)
-{
- int fd;
- size_t map_len; /* Length of \[aq]mapping\[aq] */
-\&
- /* Replace commas in mapping string with newlines. */
-\&
- map_len = strlen(mapping);
- for (size_t j = 0; j < map_len; j++)
- if (mapping[j] == \[aq],\[aq])
- mapping[j] = \[aq]\en\[aq];
-\&
- fd = open(map_file, O_RDWR);
- if (fd == \-1) {
- fprintf(stderr, "ERROR: open %s: %s\en", map_file,
- strerror(errno));
- exit(EXIT_FAILURE);
- }
-\&
- if (write(fd, mapping, map_len) != map_len) {
- fprintf(stderr, "ERROR: write %s: %s\en", map_file,
- strerror(errno));
- exit(EXIT_FAILURE);
- }
-\&
- close(fd);
-}
-\&
-/* Linux 3.19 made a change in the handling of setgroups(2) and
- the \[aq]gid_map\[aq] file to address a security issue. The issue
- allowed *unprivileged* users to employ user namespaces in
- order to drop groups. The upshot of the 3.19 changes is that
- in order to update the \[aq]gid_maps\[aq] file, use of the setgroups()
- system call in this user namespace must first be disabled by
- writing "deny" to one of the /proc/PID/setgroups files for
- this namespace. That is the purpose of the following function. */
-\&
-static void
-proc_setgroups_write(pid_t child_pid, char *str)
-{
- char setgroups_path[PATH_MAX];
- int fd;
-\&
- snprintf(setgroups_path, PATH_MAX, "/proc/%jd/setgroups",
- (intmax_t) child_pid);
-\&
- fd = open(setgroups_path, O_RDWR);
- if (fd == \-1) {
-\&
- /* We may be on a system that doesn\[aq]t support
- /proc/PID/setgroups. In that case, the file won\[aq]t exist,
- and the system won\[aq]t impose the restrictions that Linux 3.19
- added. That\[aq]s fine: we don\[aq]t need to do anything in order
- to permit \[aq]gid_map\[aq] to be updated.
-\&
- However, if the error from open() was something other than
- the ENOENT error that is expected for that case, let the
- user know. */
-\&
- if (errno != ENOENT)
- fprintf(stderr, "ERROR: open %s: %s\en", setgroups_path,
- strerror(errno));
- return;
- }
-\&
- if (write(fd, str, strlen(str)) == \-1)
- fprintf(stderr, "ERROR: write %s: %s\en", setgroups_path,
- strerror(errno));
-\&
- close(fd);
-}
-\&
-static int /* Start function for cloned child */
-childFunc(void *arg)
-{
- struct child_args *args = arg;
- char ch;
-\&
- /* Wait until the parent has updated the UID and GID mappings.
- See the comment in main(). We wait for end of file on a
- pipe that will be closed by the parent process once it has
- updated the mappings. */
-\&
- close(args\->pipe_fd[1]); /* Close our descriptor for the write
- end of the pipe so that we see EOF
- when parent closes its descriptor. */
- if (read(args\->pipe_fd[0], &ch, 1) != 0) {
- fprintf(stderr,
- "Failure in child: read from pipe returned != 0\en");
- exit(EXIT_FAILURE);
- }
-\&
- close(args\->pipe_fd[0]);
-\&
- /* Execute a shell command. */
-\&
- printf("About to exec %s\en", args\->argv[0]);
- execvp(args\->argv[0], args\->argv);
- err(EXIT_FAILURE, "execvp");
-}
-\&
-#define STACK_SIZE (1024 * 1024)
-\&
-static char child_stack[STACK_SIZE]; /* Space for child\[aq]s stack */
-\&
-int
-main(int argc, char *argv[])
-{
- int flags, opt, map_zero;
- pid_t child_pid;
- struct child_args args;
- char *uid_map, *gid_map;
- const int MAP_BUF_SIZE = 100;
- char map_buf[MAP_BUF_SIZE];
- char map_path[PATH_MAX];
-\&
- /* Parse command\-line options. The initial \[aq]+\[aq] character in
- the final getopt() argument prevents GNU\-style permutation
- of command\-line options. That\[aq]s useful, since sometimes
- the \[aq]command\[aq] to be executed by this program itself
- has command\-line options. We don\[aq]t want getopt() to treat
- those as options to this program. */
-\&
- flags = 0;
- verbose = 0;
- gid_map = NULL;
- uid_map = NULL;
- map_zero = 0;
- while ((opt = getopt(argc, argv, "+imnpuUM:G:zv")) != \-1) {
- switch (opt) {
- case \[aq]i\[aq]: flags |= CLONE_NEWIPC; break;
- case \[aq]m\[aq]: flags |= CLONE_NEWNS; break;
- case \[aq]n\[aq]: flags |= CLONE_NEWNET; break;
- case \[aq]p\[aq]: flags |= CLONE_NEWPID; break;
- case \[aq]u\[aq]: flags |= CLONE_NEWUTS; break;
- case \[aq]v\[aq]: verbose = 1; break;
- case \[aq]z\[aq]: map_zero = 1; break;
- case \[aq]M\[aq]: uid_map = optarg; break;
- case \[aq]G\[aq]: gid_map = optarg; break;
- case \[aq]U\[aq]: flags |= CLONE_NEWUSER; break;
- default: usage(argv[0]);
- }
- }
-\&
- /* \-M or \-G without \-U is nonsensical */
-\&
- if (((uid_map != NULL || gid_map != NULL || map_zero) &&
- !(flags & CLONE_NEWUSER)) ||
- (map_zero && (uid_map != NULL || gid_map != NULL)))
- usage(argv[0]);
-\&
- args.argv = &argv[optind];
-\&
- /* We use a pipe to synchronize the parent and child, in order to
- ensure that the parent sets the UID and GID maps before the child
- calls execve(). This ensures that the child maintains its
- capabilities during the execve() in the common case where we
- want to map the child\[aq]s effective user ID to 0 in the new user
- namespace. Without this synchronization, the child would lose
- its capabilities if it performed an execve() with nonzero
- user IDs (see the capabilities(7) man page for details of the
- transformation of a process\[aq]s capabilities during execve()). */
-\&
- if (pipe(args.pipe_fd) == \-1)
- err(EXIT_FAILURE, "pipe");
-\&
- /* Create the child in new namespace(s). */
-\&
- child_pid = clone(childFunc, child_stack + STACK_SIZE,
- flags | SIGCHLD, &args);
- if (child_pid == \-1)
- err(EXIT_FAILURE, "clone");
-\&
- /* Parent falls through to here. */
-\&
- if (verbose)
- printf("%s: PID of child created by clone() is %jd\en",
- argv[0], (intmax_t) child_pid);
-\&
- /* Update the UID and GID maps in the child. */
-\&
- if (uid_map != NULL || map_zero) {
- snprintf(map_path, PATH_MAX, "/proc/%jd/uid_map",
- (intmax_t) child_pid);
- if (map_zero) {
- snprintf(map_buf, MAP_BUF_SIZE, "0 %jd 1",
- (intmax_t) getuid());
- uid_map = map_buf;
- }
- update_map(uid_map, map_path);
- }
-\&
- if (gid_map != NULL || map_zero) {
- proc_setgroups_write(child_pid, "deny");
-\&
- snprintf(map_path, PATH_MAX, "/proc/%jd/gid_map",
- (intmax_t) child_pid);
- if (map_zero) {
- snprintf(map_buf, MAP_BUF_SIZE, "0 %ld 1",
- (intmax_t) getgid());
- gid_map = map_buf;
- }
- update_map(gid_map, map_path);
- }
-\&
- /* Close the write end of the pipe, to signal to the child that we
- have updated the UID and GID maps. */
-\&
- close(args.pipe_fd[1]);
-\&
- if (waitpid(child_pid, NULL, 0) == \-1) /* Wait for child */
- err(EXIT_FAILURE, "waitpid");
-\&
- if (verbose)
- printf("%s: terminating\en", argv[0]);
-\&
- exit(EXIT_SUCCESS);
-}
-.EE
-.SH SEE ALSO
-.BR newgidmap (1), \" From the shadow package
-.BR newuidmap (1), \" From the shadow package
-.BR clone (2),
-.BR ptrace (2),
-.BR setns (2),
-.BR unshare (2),
-.BR proc (5),
-.BR subgid (5), \" From the shadow package
-.BR subuid (5), \" From the shadow package
-.BR capabilities (7),
-.BR cgroup_namespaces (7),
-.BR credentials (7),
-.BR namespaces (7),
-.BR pid_namespaces (7)
-.P
-The kernel source file
-.IR Documentation/admin\-guide/namespaces/resource\-control.rst .