Capabilities and Flags

From Linux-VServer

Revision as of 18:28, 22 July 2016 by AlexanderS (Talk | contribs)

Jump to: navigation, search

In computer science, a capability is a token used by a process to prove that it is allowed to perform an operation on an object. The Linux Capability System is based on "POSIX Capabilities", a somewhat different concept, designed to split up the all powerful root privilege into a set of distinct privileges.

Contents

The Capability/Flag System

POSIX Capabilities

A process has three sets of bitmaps called the inheritable(I), permitted(P), and effective(E) capabilities. Each capability is implemented as a bit in each of these bitmaps that is either set or unset.

When a process tries to do a privileged operation, the operating system will check the appropriate bit in the effective set of the process (instead of checking whether the effective uid of the process is 0 as is normally done).

For example, when a process tries to set the clock, the Linux kernel will check that the process has the CAP_SYS_TIME bit (which is currently bit 25) set in its effective set.

The permitted set of the process indicates the capabilities the process can use. The process can have capabilities set in the permitted set that are not in the effective set.

This indicates that the process has temporarily disabled this capability. A process is allowed to set a bit in its effective set only if it is available in the permitted set. The distinction between effective and permitted exists so that processes can "bracket" operations that need privilege.

The inheritable capabilities are the capabilities of the current process that should be inherited by a program executed by the current process. The permitted set of a process is masked against the inheritable set during exec(). Nothing special happens during fork() or clone(). Child processes and threads are given an exact copy of the capabilities of the parent process.

The implementation in Linux stopped at this point, whereas POSIX Capabilities require the addition of capability sets to files too, to replace the SUID flag (at least for executables)

Upper Bound for Capabilities

Because the current Linux Capability system does not implement the filesystem related portions of POSIX Capabilities which would make setuid and setgid executables secure, and because it is much safer to have a secure upper bound for all processes within a context, an additional per-context capability mask has been added to limit all processes belonging to that context to this mask. The meaning of the individual caps (bits) of the capability bound mask is exactly the same as with the permitted capability set.

Context Capabilities

As the Linux capabilities have almost reached the maximum number that is possible without heavy modifications to the kernel, it was a natural step to add a context-specific capability system.

The Linux-VServer context capability set acts as a mechanism to fine tune existing Linux capabilities. It is not visible to the processes within a context, as they would not know how to modify or verify it.

In general there are two ways to use those capabilities:

  • Require one or a number of context capabilities to be set in addition to a given Linux capability, each one controlling a distinct part of the functionality. For example the CAP_NET_ADMIN could be split into RAW and PACKET sockets, so you could take away each of them separately by not providing the required context capability.
  • Consider the context capability sufficient for a specified functionality, even if the Linux Capability says something different. For example mount() requires CAP_SYS_ADMIN which adds a dozen other things we do not want, so we define VXC_SECURE_MOUNT to allow mounts for certain contexts.

The difference between the context flags and the context capabilities is more an abstract logical separation than a functional one, because they are handled in a very similar way.

List of capabilities/flags

Below is a list of capabilities and flags used for contexts and processes within. The tables contain the following information:

Bit 
The bit number to enable the capability/flag
Mask 
The bit number in hexadecimal notation
Name 
Human readable identifier used in userspace utilities
Tag 
Special capability/flag code to denote special behaviour, legacy usage and others (see below)
Description 
Description of capability/flag effects

Special capability/flags codes

The tag column may contain one or more of the following tags:

Tag Description
I Internal use only
L Only supported with legacy enabled
O One time capability/flag (once it's cleared, it can't be re-enabled again)
U Unsupported
X Slightly different meaning in legacy

Context capabilities (ccaps)

The set of available context capabilities is specific to Linux-VServer and applied to all processes contained within a context. Below is a list of capabilities currently available in 2.1.1 and above.

Bit Mask Name Tag Description
0 0x00000001 SET_UTSNAME Allow setdomainname(2) and sethostname(2)
1 0x00000002 SET_RLIMIT Allow setrlimit(2)
2 0x00000004 FS_SECURITY Allow setxattr for security attributes
4 0x00000010 TIOCSTI Allow the tiocsti ioctl (fake input character)
8 0x00000100 RAW_ICMP L Allow usage of raw ICMP sockets
12 0x00001000 SYSLOG Allow syslog(2)
13 0x00002000 OOM_ADJUST Allow 'safe' oom adjustments
14 0x00004000 AUDIT_CONTROL Allow loginuid write (for auditing)
16 0x00010000 SECURE_MOUNT Allow secure mount(2)
17 0x00020000 SECURE_REMOUNT Allow secure remount
18 0x00040000 BINARY_MOUNT Allow binary/network mounts
20 0x00100000 QUOTA_CTL Allow quota ioctls
21 0x00200000 ADMIN_MAPPER Allow access to device mapper
22 0x00400000 ADMIN_CLOOP Allow access to loop devices
24 0x01000000 KTHREAD Allow creating kernel threads
25 0x02000000 NAMESPACE Allow namespace related operations

Context flags (cflags)

The set of available context flags is specific to Linux-VServer and applied to all processes contained within a context. Below is a list of flags available in 2.1.1 and above.

Bit Mask Name Tag Description
0 0x00000001 INFO_LOCK L Prohibit further context migration
1 0x00000002 INFO_SCHED L Account all processes as one
2 0x00000004 INFO_NPROC L Apply process limits to context
3 0x00000008 INFO_PRIVATE L Context cannot be entered
4 0x00000010 INFO_INIT X Show a fake init process
5 0x00000020 INFO_HIDE X Hide context information in task status
6 0x00000040 INFO_ULIMIT L Apply ulimits to context
7 0x00000080 INFO_NSPACE L Use private namespace
8 0x00000100 SCHED_HARD Enable hard scheduler
9 0x00000200 SCHED_PRIO Enable priority scheduler
10 0x00000400 SCHED_PAUSE Pause context (unschedule)
16 0x00010000 VIRT_MEM Virtualize memory information
17 0x00020000 VIRT_UPTIME Virtualize uptime information
18 0x00040000 VIRT_CPU Virtualize cpu usage information
19 0x00080000 VIRT_LOAD Virtualize load average information
20 0x00100000 VIRT_TIME Allow per guest time offsets
24 0x01000000 HIDE_MOUNT Hide entries in /proc/$pid/mounts
25 0x02000000 HIDE_NETIF L Hide foreign network interfaces
26 0x04000000 HIDE_VINFO Hide context information in task status
32 0x0001<<32 STATE_SETUP IO Enable setup state
33 0x0002<<32 STATE_INIT IO Enable init state
34 0x0004<<32 STATE_ADMIN O Enable admin state
36 0x0010<<32 SC_HELPER I Enable state change helper
37 0x0020<<32 REBOOT_KILL Kill all processes on reboot(2)
38 0x0040<<32 PERSISTENT Make context persistent
48 0x0001<<48 FORK_RSS Block fork when RSS limit is exceeded
49 0x0002<<48 PROLIFIC Allow context to create new contexts
52 0x0010<<48 IGNEG_NICE Ignore priority raise

Network capabilities (ncaps)

The set of available network capabilities is specific to Linux-VServer and applied to all processes contained within a network context. Below is a list of flags available since at least 2.3.0.34

Bit Mask Name Tag Description
0 0x00000001 TUN_CREATE Allows the guest to open TUN devices (from the kernel's TUN/TAP-interface / network devices which link to userspace). ie: tun_set_iff()
8 0x00000100 RAW_ICMP Allow usage of raw ICMP sockets. (Makes ping work)

Network context flags (nflags)

The set of available network context flags is specific to Linux-VServer and applied to all processes contained within a network context. Below is a list of flags available in 2.1.1 and above.

Bit Mask Name Tag Description
0 0x00000001 INFO_LOCK L Prohibit further context migration
3 0x00000008 INFO_PRIVATE Context cannot be entered
8 0x00000100 SINGLE_IP Enable special handling of network contexts with a single IP only
9 0x00000200 LBACK_REMAP use loopback-virtualisation (will only work in 2.3.0.xx or greater)
10 0x00000400 LBACK_ALLOW if set, allows guests without LBACK_REMAP to connect to 127.0.0.0/8
29 0x02000000 HIDE_NETIF Hide foreign network interfaces (ie: network interfaces not carrying and IP belonging to the guest)
30 0x04000000 HIDE_LBACK hides the real loopback-address from the guest (rewrites to 127.0.0.1 when queried) (will only work in 2.3.0.xx or greater)
32 0x0001<<32 STATE_SETUP IO Enable setup state
34 0x0004<<32 STATE_ADMIN O Enable admin state
36 0x0010<<32 SC_HELPER I Enable state change helper
38 0x0040<<32 PERSISTENT Make network context persistent. Allows you to configure a network context to use a certain IP with specific flags, and quickly run processes at later points with long pauses between them, rather than re-creating it every time a process needs to be executed.

System capabilities (bcaps)

The set of available system capabilities is inherited from the Linux kernel and applied to all processes contained within a context. Below is a list of capabilities currently available in the vanilla kernel.

BIG FAT WARNING: Adding any system capability to your virtual server WILL reduce security. Do not change the default values unless you absolutely know what you are doing!

Bit Mask Name Description
0 0x00000001 CHOWN In a system with the [_POSIX_CHOWN_RESTRICTED] option defined, this overrides the restriction of changing file ownership and group ownership.
1 0x00000002 DAC_OVERRIDE Override all DAC access, including ACL execute access if [_POSIX_ACL] is defined. Excluding DAC access covered by CAP_LINUX_IMMUTABLE.
2 0x00000004 DAC_READ_SEARCH Overrides all DAC restrictions regarding read and search on files and directories, including ACL restrictions if [_POSIX_ACL] is defined. Excluding DAC access covered by CAP_LINUX_IMMUTABLE.
3 0x00000008 FOWNER Overrides all restrictions about allowed operations on files, where file owner ID must be equal to the user ID, except where CAP_FSETID is applicable. It doesn't override MAC and DAC restrictions.
4 0x00000010 FSETID Overrides the following restrictions that the effective user ID shall match the file owner ID when setting the S_ISUID and S_ISGID bits on that file; that the effective group ID (or one of the supplementary group IDs) shall match the file owner ID when setting the S_ISGID bit on that file; that the S_ISUID and S_ISGID bits are cleared on successful return from chown(2) (not implemented).
5 0x00000020 KILL Overrides the restriction that the real or effective user ID of a process sending a signal must match the real or effective user ID of the process receiving the signal.
6 0x00000040 SETGID
  • Allows setgid(2) manipulation
  • Allows setgroups(2)
  • Allows forged gids on socket credentials passing.
7 0x00000080 SETUID
  • Allows set*uid(2) manipulation (including fsuid).
  • Allows forged pids on socket credentials passing.
8 0x00000100 SETPCAP Transfer any capability in your permitted set to any pid, remove any capability in your permitted set from any pid
9 0x00000200 LINUX_IMMUTABLE Allow modification of S_IMMUTABLE and S_APPEND file attributes
10 0x00000400 NET_BIND_SERVICE
  • Allows binding to TCP/UDP sockets below 1024
  • Allows binding to ATM VCIs below 32
11 0x00000800 NET_BROADCAST Allow broadcasting, listen to multicast
12 0x00001000 NET_ADMIN
  • Allow interface configuration
  • Allow administration of IP firewall, masquerading and accounting
  • Allow setting debug option on sockets
  • Allow modification of routing tables
  • Allow setting arbitrary process / process group ownership on sockets
  • Allow binding to any address for transparent proxying
  • Allow setting TOS (type of service)
  • Allow setting promiscuous mode
  • Allow clearing driver statistics
  • Allow multicasting
  • Allow read/write of device-specific registers
  • Allow activation of ATM control sockets
13 0x00002000 NET_RAW
  • Allow use of RAW sockets
  • Allow use of PACKET sockets
14 0x00004000 IPC_LOCK
  • Allow locking of shared memory segments
  • Allow mlock and mlockall (which doesn't really have anything to do with IPC)
15 0x00008000 IPC_OWNER Override IPC ownership checks
16 0x00010000 SYS_MODULE
  • Insert and remove kernel modules - modify kernel without limit
  • Modify cap_bset
17 0x00020000 SYS_RAWIO
  • Allow ioperm/iopl access
  • Allow sending USB messages to any device via /proc/bus/usb
18 0x00040000 SYS_CHROOT Allow use of chroot()
19 0x00080000 SYS_PTRACE Allow ptrace() of any process
20 0x00100000 SYS_PACCT Allow configuration of process accounting
21 0x00200000 SYS_ADMIN
  • Allow configuration of the secure attention key
  • Allow administration of the random device
  • Allow examination and configuration of disk quotas
  • Allow configuring the kernel's syslog (printk behaviour)
  • Allow setting the domainname
  • Allow setting the hostname
  • Allow calling bdflush()
  • Allow mount() and umount(), setting up new smb connection
  • Allow some autofs root ioctls
  • Allow nfsservctl
  • Allow VM86_REQUEST_IRQ
  • Allow to read/write pci config on alpha
  • Allow irix_prctl on mips (setstacksize)
  • Allow flushing all cache on m68k (sys_cacheflush)
  • Allow removing semaphores (Used instead of CAP_CHOWN to "chown" IPC message queues, semaphores and shared memory)
  • Allow locking/unlocking of shared memory segment
  • Allow turning swap on/off
  • Allow forged pids on socket credentials passing
  • Allow setting readahead and flushing buffers on block devices
  • Allow setting geometry in floppy driver
  • Allow turning DMA on/off in xd driver
  • Allow administration of md devices (mostly the above, but some extra ioctls)
  • Allow tuning the ide driver
  • Allow access to the nvram device
  • Allow administration of apm_bios, serial and bttv (TV) device
  • Allow manufacturer commands in isdn CAPI support driver
  • Allow reading non-standardized portions of pci configuration space
  • Allow DDI debug ioctl on sbpcd driver
  • Allow setting up serial ports
  • Allow sending raw qic-117 commands
  • Allow enabling/disabling tagged queuing on SCSI controllers and sending arbitrary SCSI commands
  • Allow setting encryption key on loopback filesystem
  • Allow setting zone reclaim policy
22 0x00400000 SYS_BOOT Allow use of reboot()
23 0x00800000 SYS_NICE
  • Allow raising priority and setting priority on other (different UID) processes
  • Allow use of FIFO and round-robin (realtime) scheduling on own processes and setting the scheduling algorithm used by another process.
  • Allow setting cpu affinity on other processes
24 0x01000000 SYS_RESOURCE
  • Override resource limits. Set resource limits.
  • Override quota limits.
  • Override reserved space on ext2 filesystem
  • Modify data journaling mode on ext3 filesystem (uses journaling resources)
  • NOTE: ext2 honors fsuid when checking for resource overrides, so you can override using fsuid too
  • Override size restrictions on IPC message queues
  • Allow more than 64hz interrupts from the real-time clock
  • Override max number of consoles on console allocation
  • Override max number of keymaps
25 0x02000000 SYS_TIME
  • Allow manipulation of system clock
  • Allow irix_stime on mips
  • Allow setting the real-time clock
26 0x04000000 SYS_TTY_CONFIG
  • Allow configuration of tty devices
  • Allow vhangup() of tty
27 0x08000000 MKNOD Allow the privileged aspects of mknod()
28 0x10000000 LEASE Allow taking of leases on files
29 0x20000000 AUDIT_WRITE  ??
30 0x40000000 AUDIT_CONTROL  ??

Setting flags and capabilities

To see how to set the flags and capabilities, see util-vserver:Capabilities and Flags if you're using util-vserver.

If you would like to edit those flags without restarting the vservers, you can use vattribute and nattribute. See util-vserver:Cheatsheet

Personal tools