Paper
From Linux-VServer
Abstract
A soft partitioning concept based on Security Contexts which permits the creation of many independent Virtual Private Servers (VPS) that run simultaneously on a single physical server at full speed, efficiently sharing hardware resources.
A VPS provides an almost identical operating environment as a conventional Linux Server. All services, such as ssh, mail, Web and databases, can be started on such a VPS, without (or in special cases with only minimal) modification, just like on any real server.
Each virtual server has its own user account database and root password and is isolated from other virtual servers, except for the fact that they share the same hardware resources.
Introduction
Over the years, computers have become sufficiently powerful to use virtualization to create the illusion of many smaller virtual machines, each running a separate operating system instance.
There are several kinds of Virtual Machines (VMs) which provide similar features, but differ in the degree of abstraction and the methods used for virtualization.
Most of them accomplish what they do by emulating some real or fictional hardware, which in turn requires real resources from the Host (the machine running the VMs). This approach, used by most System Emulators (like QEMU, Bochs, ...), allows the emulator to run an arbitrary Guest Operating System, even for a different Architecture (CPU and Hardware). No modifications need to be made to the Guest OS because it isn't aware of the fact that it isn't running on real hardware.
Some System Emulators require small modifications or specialized drivers to be added to Host or Guest to improve performance and minimize the overhead required for the hardware emulation. Although this significantly improves efficiency, there are still large amounts of resources being wasted in caches and mediation between Guest and Host (examples for this approach are UML and Xen).
But suppose you do not want to run many different Operating Systems simultaneously on a single box? Most applications running on a server do not require hardware access or kernel level code, and could easily share a machine with others, if they could be separated and secured...
The Concept
At a basic level, a Linux Server consists of three building blocks: Hardware, Kernel and Applications. The Hardware usually depends on the provider or system maintainer, and, while it has a big influence on the overall performance, it cannot be changed that easily, and will likely differ from one setup to another.
The main purpose of the Kernel is to build an abstraction layer on top of the hardware to allow processes (Applications) to work with and operate on resources (Data) without knowing the details of the underlying hardware. Ideally, those processes would be completely hardware agnostic, by being written in an interpreted language and therefore not requiring any hardware-specific knowledge.
Given that a system has enough resources to drive ten times the number of applications a single Linux server would usually require, why not put ten servers on that box, which will then share the available resources in an efficient manner?
Most server applications (e.g. httpd) will assume that it is the only application providing a particular service, and usually will also assume a certain filesystem layout and environment. This dictates that similar or identical services running on the same physical server, but for example, only differing in their addresses, have to be coordinated. This typically requires a great deal of administrative work which can lead to reduced system stability and security.
The basic concept of the Linux-VServer solution is to separate the user-space environment into distinct units (sometimes called Virtual Private Servers) in such a way that each VPS looks and feels like a real server to the processes contained within.
Although different Linux Distributions use (sometimes heavily) patched kernels to provide special support for unusual hardware or extra functionality, most Linux Distributions are not tied to a special kernel.
Linux-VServer uses this fact to allow several distributions, to be run simultaneously on a single, shared kernel, without direct access to the hardware, and share the resources in a very efficient way.
Existing Infrastructure
Recent Linux Kernels already provide many security features that are utilized by Linux-VServer to do its work. Especially features such as the Linux Capability System, Resource Limits, File Attributes and the Change Root Environment. The following sections will give a short overview about each of these.
Linux Capability System
In computer science, a capability is a token used by a process to prove that it is allowed to perform an operation on an object. The Linux Capability System is based on "POSIX Capabilities", a somewhat different concept, designed to split up the all powerful root privilege into a set of distinct privileges.
POSIX Capabilities
A process has three sets of bitmaps called the inheritable(I), permitted(P), and effective(E) capabilities. Each capability is implemented as a bit in each of these bitmaps that is either set or unset.
When a process tries to do a privileged operation, the operating system will check the appropriate bit in the effective set of the process (instead of checking whether the effective uid of the process is 0 as is normally done).
For example, when a process tries to set the clock, the Linux kernel will check that the process has the CAP_SYS_TIME bit (which is currently bit 25) set in its effective set.
The permitted set of the process indicates the capabilities the process can use. The process can have capabilities set in the permitted set that are not in the effective set.
This indicates that the process has temporarily disabled this capability. A process is allowed to set a bit in its effective set only if it is available in the permitted set. The distinction between effective and permitted exists so that processes can "bracket" operations that need privilege.
The inheritable capabilities are the capabilities of the current process that should be inherited by a program executed by the current process. The permitted set of a process is masked against the inheritable set during exec(). Nothing special happens during fork() or clone(). Child processes and threads are given an exact copy of the capabilities of the parent process.
The implementation in Linux stopped at this point, whereas POSIX Capabilities[U5] requires the addition of capability sets to files too, to replace the SUID flag (at least for executables). The pending Linux 2.6.24 kernel supports file POSIX capabilities. (See Serge Hallyn's git commit of Wed, 17 Oct 2007.) One userspace tool capable of changing file capabilities is the 2.X version of Andrew Morgan's libcap library and programs.
Capability Overview
The list of POSIX Capabilities used with Linux is long, and the 32 available bits are almost used up. While the detailed list of all capabilities can be found in /usr/include/linux/capability.h on most Linux systems, an overview of important capabilities is given here.
[0] CAP_CHOWN | change file ownership and group. |
---|---|
[5] CAP_KILL | send a signal to a process with a different real or effective user ID |
[6] CAP_SETGID | permit setgid(2), setgroups(2), and forged gids on socket credentials passing |
[7] CAP_SETUID | permit set*uid(2), and forged uids on socket credentials passing |
[8] CAP_SETPCAP | transfer/remove any capability in permitted set to/from any pid |
[9] CAP_LINUX_IMMUTABLE | allow modification of S_IMMUTABLE and S_APPEND file attributes |
[11] CAP_NET_BROADCAST | permit broadcasting and listening to multicast |
[12] CAP_NET_ADMIN | permit interface configuration, IP firewall, masquerading, accounting, socket debugging, routing tables, bind to any address, enter promiscuous mode, multicasting, ... |
[13] CAP_NET_RAW | permit usage of RAW and PACKET sockets |
[16] CAP_SYS_MODULE | insert and remove kernel modules |
[18] CAP_SYS_CHROOT | permit chroot(2) |
[19] CAP_SYS_PTRACE | permit ptrace() of any process |
[21] CAP_SYS_ADMIN | this list would be too long, it basically allows to do everything else, not mentioned in another capability. |
[22] CAP_SYS_BOOT | permit reboot(2) |
[23] CAP_SYS_NICE | allow raising priority and setting priority on other processes, modify scheduling |
[24] CAP_SYS_RESOURCE | override resource limits, quota, reserved space on fs, ... |
[27] CAP_MKNOD | permit the privileged aspects of mknod(2) |
Resource Limits
Resources for each process can be limited by specifying a Resource Limit. Similar to the Linux Capabilities, there are two different limits, a Soft Limit and a Hard Limit.
The soft limit is the value that the kernel enforces for the corresponding resource. The hard limit acts as a ceiling for the soft limit: an unprivileged process may only set its soft limit to a value in the range from zero up to the hard limit, and (irreversibly) lower its hard limit. A privileged process may make arbitrary changes to either limit value, as long as the soft limit stays below the hard limit.
Limit-able Resource Overview
The list of all defined resource limits can be found in /usr/include/asm/resource.h on most Linux systems, an overview of relevant resource limits is given here.
[0] RLIMIT_CPU | CPU time in seconds. process is sent a SIGXCPU signal after reaching the soft limit, and SIGKILL on hard limit. |
---|---|
[4] RLIMIT_CORE | maximum size of core files generated |
[5] RLIMIT_RSS | number of pages the process's resident set can consume (the number of virtual pages resident in RAM) |
[6] RLIMIT_NPROC | The maximum number of processes that can be created for the real user ID of the calling process. |
[7] RLIMIT_NOFILE | Specifies a value one greater than the maximum file descriptor number that can be opened by this process. |
[8] RLIMIT_MEMLOCK | The maximum number of virtual memory pages that may be locked into RAM using mlock() and mlockall(). |
[9] RLIMIT_AS | The maximum number of virtual memory pages available to the process (address space limit). \ |
File Attributes
Originally, this feature was only available with ext2, but now all major filesystems implement a basic set of File Attributes that permit certain properties to be changed. Here again is a short overview of the possible attributes, and what they mean.
Macro Name | Meaning | |
---|---|---|
s | SECRM | When a file with this attribute set is deleted, its blocks are zeroed and written back to the disk. |
u | UNRM | When a file with this attribute set is deleted, its contents are saved. |
c | COMPR | Files marked with this attribute are automatically compressed on write and uncompressed on read. (not implemented yet) |
i | IMMUTABLE | A file with this attribute cannot be modified: it cannot be deleted or renamed, no link can be created to this file and no data can be written to the file. |
a | APPEND | Files with this attribute set can only be opened in append mode for writing. |
d | NODUMP | If this flag is set, the file is not candidate for backup with the dump utility. |
S | SYNC | Updates to the file contents are done synchronously. |
A | NOATIME | Prevents updating the atime record on files when they are accessed or modified. |
t | NOTAIL | A file with the t attribute will not have a partial block fragment at the end of the file merged with other files. |
D | DIRSYNC | Changes to a directory having this attribute set will be done synchronously. |
The first column in the above table denotes command line options one might supply to lsattr respectively chattr. The below screedump gives a notion about what we are talking:
max@pc1:~$ cd /tmp/ max@pc1:/tmp$ touch my_file max@pc1:/tmp$ lsattr my_file ------------------ my_file max@pc1:/tmp$ chattr +a my_file chattr: Operation not permitted while setting flags on my_file max@pc1:/tmp$ su Password: pc1:/tmp# chattr +a my_file && lsattr my_file -----a------------ my_file pc1:/tmp# exit exit max@pc1:/tmp$
As you might have noticed, one needs to gain root permissions in the upper showcase (the underlying file system was ext3). For more information just issue man 1 chattr in your command line interface.
Information regarding file attributes can be found in the kernel source code. Every file system uses a subset of all known attributes (which are used depends on the file system).
One thing can be said for sure -- the file attributes listed in the kernel source code are defined -- those are not listed are not defined and in turn can not be used for a particular file system (e.g. ext3 (Extended File System version 3)). However, many of those file attributes defined and understood by the kernel have no effect. Most file systems define those flags in a specific (specific for a particular file system) header file found within the kernel source tree. They also define a so called User Modifiable Mask (those are the flags the user can change with the ioctls).
Those flags have partially different meaning depending on the node type (i.e. dir, inode, fifo, pipe, device) and it is not trivial to say if a filesystem makes use of any user modifiable flag -- things like immutable are easy to verify (from user space) but how to verify e.g. NOTAIL from user space? Usually only source code review will show if it is implemented and used.
For example, if that didn't change, the COMPR is defined, and well understood by ext2/3 but there is no implementation there, i.e. nothing is compressed.
The chroot(1) Command
chroot allows you to run a command with a different directory acting as the root directory. This means that all filesystem lookups are done with '/' referring to the substitute root directory and not to the original one.
While the Linux chroot implementation isn't very secure, it increases the isolation of processes with regards to the filesystem, and, if used properly, can create a filesystem "jail" for a single process or a restricted user, daemon or service.
Required Modifications
This chapter will describe the essential Kernel modifications to implement something like Linux-VServer.
Context Separation
The separation mentioned in the Concepts section requires some modifications to the kernel to allow for the notion of Contexts. The purpose of this "Context" is to hide all processes outside of its scope, and prohibit any unwanted interaction between a process inside the context and a process belonging to another context.
This separation requires the extension of some existing data structures in order for them to become aware of contexts and to differentiate between identical uids used in different virtual servers.
It also requires the definition of a default context that is used when the host system is booted, and to work around the issues resulting from some false assumptions made by some user-space tools (like pstree) that the init process has to exist and to be running under id '1'.
To simplify administration, the Host Context isn't treated any differently than any other context as far as process isolation is concerned. To allow for process overview, a special Spectator context has been defined to peek at all processes at once.
Network Separation
While the Context Separation is sufficient to isolate groups of processes, a different kind of separation, or rather a limitation, is required to confine processes to a subset of available network addresses.
Several issues have to be considered when doing so; for example, the fact that bindings to special addresses like IPADDR_ANY or the local host address have to be handled in a very special way.
Currently, Linux-VServer doesn't make use of virtual network devices (and maybe never will) to minimize the resulting overhead. Therefore socket binding and packet transmission have been adjusted.
The Chroot Barrier
One major problem of the chroot() system used in Linux lies within the fact that this information is volatile, and will be changed on the next chroot() Syscall.
One simple method to escape from a chroot-ed environment is as follows: First, create or open a file and retain the file-descriptor, then chroot into a subdirectory at equal or lower level with regards to the file. This causes the root to be moved down in the filesystem. Next, use fchdir() on the file-descriptor to escape from that new root. This will consequently escape from the old root as well, as this was lost in the last chroot() Syscall.
While early Linux-VServer versions tried to fix this by "funny" methods, recent versions use a special marking, known as the Chroot Barrier, on the parent directory of each VPS to prevent unauthorized modification and escape from confinement.
Upper Bound for Caps
Because the current Linux Capability system does not implement the filesystem related portions of POSIX Capabilities which would make setuid and setgid executables secure, and because it is much safer to have a secure upper bound for all processes within a context, an additional per-context capability mask has been added to limit all processes belonging to that context to this mask.
The meaning of the individual caps (bits) of the capability bound mask is exactly the same as with the permitted capability set.
Resource Isolation
Most resources are somewhat shared among the different contexts. Some require more additional isolation than others, either to avoid security issues or to allow for improved accounting.
Those resources are:
- shared memory, IPC
- user and process IDs
- file xid tagging
- Unix ptys
- sockets
Filesystem XID Tagging
Although it can be disabled completely, this modification is required for more robust filesystem level security and context isolation. It is also mandatory for Context Disk Limits and Per Context Quota Support on a shared partition.
The concept of adding a context id (xid) to each file to make the context ownership persistent sounds simple, but the actual implementation is non-trivial - mainly because adding this information either requires a change to the on disk representation of the filesystem or the application of some tricks.
One non-intrusive approach to avoid modification of the underlying filesystem is to use the upper bits of existing fields, like those for UID and GID to store the additional XID.
Once context information is available for each inode, it is a logical step to extend the access controls to check against context too. Currently all inode access restrictions have been extended to check for the context id, with special exceptions for the Host Context and the Spectator Context.
Untagged files belong to the Host Context and are silently treated as if they belong to the current context, which is required for Unification. If such a file is modified from inside a context, it silently migrates to the new one, changing its xid.
The following Tagging Methods are implemented:
UID32/GID32 or EXTERNAL | This format uses currently unused space within the disk inode to store the context information. As of now, this is only defined for ext2/ext3 but will be also defined for xfs, reiserfs, and jfs as soon as possible. Advantage: Full 32bit uid/gid values. |
---|---|
UID32/GID16 | This format uses the upper half of the group id to store the context information. This is done transparently, except if the format is changed without prior file conversion. Advantage: works on all 32bit U/GID FSs. Drawback: GID is reduced to 16 bits. |
UID24/GID24 | This format uses the upper quarter of user and group id to store the context information, again transparently. This allows for about 16 million user and group ids, which should suffice for the majority of all applications. Advantage: works on all 32bit U/GID FSs. Drawback: UID and GID are reduced to 24 bits. |
Additional Modifications
In addition to the bare minimum, there are a number of modifications that are not mandatory, but have proven extremely useful over time.
Context Flags
It was very soon discovered that some features require a flag, a kind of switch to turn them on and off separately for each Linux-VServer, so a simple flag-word was added.
This flag-word supports quite a number of flags, a flag-word mask, which allows to tell what flags are available, and a special trigger mechanism, providing one-time flags, set on startup, that can only be cleared once, usually causing a special action or event.
Here is a list of planned and mostly implemented Context Flags, available in the development branch of Linux-VServer:
[0] VXF_INFO_LOCK | (legacy, obsoleted) |
---|---|
[1] VXF_INFO_SCHED | schedule all processes in a context as if they where one. (legacy, obsoleted) |
[2] VXF_INFO_NPROC | limit the number of processes in a context to the initial NPROC value. (legacy, obsoleted) |
[3] VXF_INFO_PRIVATE | do not allow to join this context from outside. (legacy) |
[4] VXF_INFO_INIT | show the init process with pid '1' (legacy) |
[5] VXF_INFO_HIDE | (legacy, obsoleted) |
[6] VXF_INFO_ULIMIT | (legacy, obsoleted) |
[7] VXF_INFO_NSPACE | (legacy, obsoleted) |
[8] VXF_SCHED_HARD | activate the Hard CPU scheduling |
[9] VXF_SCHED_PRIO | use the context token bucket for calculating the process priorities |
[10] VXF_SCHED_PAUSE | put all processes in this context on the hold queue, not scheduling them any longer |
[16] VXF_VIRT_MEM | virtualize the memory information so that the VM and RSS limits are used for meminfo and friends |
[17] VXF_VIRT_UPTIME | virtualize the uptime, beginning with the time of context creation |
[18] VXF_VIRT_CPU | |
[24] VXF_HIDE_MOUNT | show empty proc/{pid}/mounts |
[25] VXF_HIDE_NETIF | hide network interfaces and addresses not permitted by the network context |
Context Capabilities
As the Linux Capabilities have almost reached the maximum number that is possible without heavy modifications to the kernel, it was a natural step to add a context-specific capability system.
The Linux-VServer context capability set acts as a mechanism to fine tune existing Linux capabilities. It is not visible to the processes within a context, as they would not know how to modify or verify it.
In general there are two ways to use those capabilities:
- Require one or a number of context capabilities to be set in addition to a given Linux capability, each one controlling a distinct part of the functionality.For example the CAP_NET_ADMIN could be split into RAW and PACKET sockets, so you could take away each of them separately by not providing the required context capability.
- Consider the context capability sufficient for a specified functionality, even if the Linux Capability says something different.For example mount() requires CAP_SYS_ADMIN which adds a dozen other things we do not want, so we define a CCAP_MOUNT to allow mounts for certain contexts.
The difference between the Context Flags and the Context Caps is more an abstract logical separation than a functional one, because they are handled very similar.
Again, a list of the Context Capabilities and their purpose:
[0] VXC_SET_UTSNAME | allow the context to change the host and domain name with the appropriate kernel Syscall |
---|---|
[1] VXC_SET_RLIMIT | allow the context to modify the resource limits (within the vserver limits). |
[8] VXC_RAW_ICMP | allow raw icmp packets in a secure way (this makes ping work from inside) |
[16] VXC_SECURE_MOUNT | permit secure mounts, which at the moment means that the nodev mount option is added. |
Context Accounting
Some properties of a context are useful to the admin, either for keeping an overview of the resources, to get a feeling for the capacity of the host, or for billing them in some way to a customer.
There are two different kinds of accountable properties, those having a current value which represents the state of the system (for example the speed of a vehicle), and those which monotonically increase over time (like the mileage).
Most of the state type of properties also qualify for applying some limits, so they are handled specially. this is described in more detail in the following section.
Good candidates for Context Accounting are:
- Amount of CPU Time spent
- Number of Forks done
- Socket Messages by Type
- Network Packets Transmitted and Received
Context Limits
Most properties related to system resources, might it be the memory consumption, the number of processes or file-handles, or the current network bandwidth, qualify for imposing limits on them.
To provide a general framework for all kinds of limits, Context Limits allow the configuration of three different values for each limit-able resource: the minimum, a soft limit and a hard limit (maximum).
At the time this is written, only the hard limits are supported and not all of them are actually enforced, but here is a list of current and planned Context Limits:
- process limits
- scheduler limits
- memory limits
- per-context disk limits
- per-context user/group quota
Additionally the context limit system keeps track of observed maxima and resource limit hits, to provide some feedback for the administrator.
Virtualization
One major difference between the Linux-VServer approach and Virtual Machines is that you do not have the virtualization part as a side-effect, so you have to do that by hand where it makes sense.
For example, a Virtual Machine does not need to think about uptime, because naturally the running OS was started somewhere in the past and will not have any problem to tell the time it thinks it began running.
A context can also store the time when it was created, but that will be different from the systems uptime, so in addition, there has to be some function, which adjusts the values passed from kernel to user-space depending on the context the process belongs to.
This is what for Linux-VServer is known as Virtualization (actually it's more faking some values passed to and from the kernel to make the processes think that they are on a different machine).
Currently modified for the purpose of Virtualization are:
- System Uptime
- Host and Domain Name
- Machine Type and Kernel Version
- Context Memory Availability
- Context Disk Space
Improved Security
Proc-FS Security provides a mechanism to protect dynamic entries in the proc filesystem from being seen in every context. The system consists of three flags for each Proc-FS entry: Admin, Watch and Hide.
The Hide flag enables or disables the entire feature, so any combination with the Hide flag cleared will mean total visibility. The Admin and Watch flags determine where the hidden entry remains visible; so for example if Admin and Hidden are set, the Host Context will be the only one able to see this specific entry.
Kernel Helper
For some purposes, it makes sense to have an user-space tool to act on behalf of the kernel, when a process inside a context requests something usually available on a real server, but naturally not available inside a context.
The best, and currently only example for this is the Reboot Helper, which handles the reboot() system call, invoked from inside a context on behalf of the Kernel. It is executed, in Host side user-space to take appropriate actions - either reboot or just shutdown (halt) the specified context.
While the helper is designed to be flexible and handle different things in a similar way there are no other users of this helper at the moment. It might be replaced by an event interface in near future.
Features and Bonus Material
Unification
Because one of the central objectives for Linux-VServer is to reduce the overall resource usage wherever possible, a truly great idea was born to share files between different contexts without interfering with the usual administrative tasks or reducing the level of security created by the isolation.
Files common to more than one context, which are not very likely going to change, like libraries or binaries, can be hard linked on a shared filesystem, thus reducing the amount of disk space, inode caches, and even memory mappings for shared libraries.
The only drawback is that without additional measures, a malicious context would be able to deliberately or accidentally destroy or modify such shared files, which in turn would harm the other contexts.
One step is to make the shared files immutable by using the Immutable File Attribute (and removing the Linux Capability required to modify this attribute). However an additional attribute is required to allow removal of such immutable shared files, to allow for updates of libraries or executables from inside a context.
Such hard linked, immutable but unlink-able files belonging to more than one context are called unified and the process of finding common files and preparing them in this way is called Unification.
The reason for doing this is reduced resource consumption, not simplified administration. While a typical Linux Server install will consume about 500MB of disk space, 10 unified servers will only need about 700MB and as a bonus use less memory for caching.
Private Namespaces
A recent addition to the Linux-VServer branch was the introduction of Private Namespaces. This uses the already existing Virtual Filesystem Layer of the Linux kernel to create a separate view of the filesystem for the processes belonging to a context.
The major advantage over the shared namespace used by default is that any modifications to the namespace layout (like mounts) do not affect other contexts, not even the Host Context.
Obviously the drawback of that approach is that entering such a Private Namespace isn't as trivial as changing the root directory, but with proper kernel support this will completely replace the chroot() in the future.
The Linux-VServer Proc-FS
A structured, dynamically generated subtree of the well-known Proc-FS - actually two of them - has been created to allow for inspecting the different values of Security and Network Contexts.
/proc/virtual .../info /proc/virtual/<pid> .../info .../status .../sched .../cvirt .../cacct .../limit
Token Bucket Extensions
While the basic idea of Linux-VServer is a peaceful coexistence of all contexts, sharing the common resources in a respectful way, it is sometimes useful to control the resource distribution for resource hungry processes.
The basic principle of a Token Bucket is not very new. It is given here as an example for the Hard CPU Limit. The same principle also applies to scheduler priorities, network bandwidth limitation and resource control in general.
The Hard CPU Limit uses this mechanism in the following way: consider a bucket of a certain size S which is filled with a specified amount of tokens R every interval T, until the bucket is "full" - excess tokens are spilled. At each timer tick, a running process consumes exactly one token from the bucket, unless the bucket is empty, in which case the process is put on a hold queue until the bucket has been refilled with a minimum M of tokens. The process is then rescheduled.
A major advantage of a Token Bucket is that a certain amount of tokens can be accumulated in times of quiescence, which later can be used to burst when resources are required.
Where a per-process Token Bucket would allow for a CPU resource limitation of a single process, a Context Token Bucket allows to control the CPU usage of all confined processes.
Another approach, which is also implemented, is to use the current fill level of the bucket to adjust the process priority, thus reducing the priority of processes belonging to excessive contexts.
Context Disk Limits
This Feature requires the use of XID Tagged Files, and allows for independent Disk Limits for different contexts on a shared partition. The number of inodes and blocks for each filesystem is accounted, if an XID-Hash was added for the Context-Filesystem combo.
Those values, including current usage, maximum and reserved space, will be shown for filesystem queries, creating the illusion that the shared filesystem has a different usage and size, for each context.
Per-Context Quota
Similar to the Context Disk Limits, Per-Context Quota uses separate quota hashes for different Contexts on a shared filesystem. This is not required to allow for Linux-VServer quota on separate partitions.
The VRoot Proxy Device
Quota operations (ioctls) require some access to the block device, which for security reasons is not available inside a VPS.
Stealth
For some applications, for example the preparation of a honey-pot or an especially realistic imitation of a real server for educational purposes, it can make sense to make the context indistinguishable from a real server.
However, since other freely available alternatives like QEMU or UML are much better at this, and require much less effort, this is not a central issue in Linux-VServer development.
Linux-VServer Security
Now that we know what the Linux-VServer framework provides and how some features work, let's have a word on security, because you should not rely on the framework to be secure per definition. Instead, you should exactly know what you are doing.
Secure Capabilities
Currently the following Linux Capabilities are considered secure for VPS use. If others are added, it will probably open some security hole.
- CAP_CHOWN
- CAP_DAC_OVERRIDE
- CAP_DAC_READ_SEARCH
- CAP_FOWNER
- CAP_FSETID
- CAP_KILL
- CAP_SETGID
- CAP_SETUID
- CAP_NET_BIND_SERVICE
- CAP_SYS_CHROOT
- CAP_SYS_PTRACE
- CAP_SYS_BOOT
- CAP_SYS_TTY_CONFIG
- CAP_LEASE
CAP_NET_RAW for example is not considered secure although it is often used to allow the broken ping command to work, although there are better alternatives like the userspace ping command poink[U7] or the VXC_RAW_ICMP Context Capability.
The Chroot Barrier
Ensuring that the Barrier flag is set on the parent directory of each VPS is vital if you do not want VPS root to escape from the confinement and walk your Host's root filesystem.
Secure Device Nodes
The /dev directory of a VPS should not contain more than the following devices and the one directory for the unix pts tree.
- c 1 7 full
- c 1 3 null
- c 5 2 ptmx
- c 1 8 random
- c 5 0 tty
- c 1 9 urandom
- c 1 5 zero
- d pts
Of course other device nodes like console, mem and kmem, even block and character devices can be added, but some expertise is required in order to ensure no security holes are opened.
Secure Proc-FS Entries
There has been no detailed evaluation of secure and unsecure entries in the proc filesystem, but there have been some incidents where unprotected (not protected via Linux Capabilities) writable proc entries caused mayhem.
For example, /proc/sysrq-trigger is something which should not be accessible inside a VPS without a very good reason.
Field of Application
The primary goal of this project is to create virtual servers sharing the same machine. A virtual server operates like a normal Linux server. It runs normal services such as telnet, mail servers, web servers, and SQL servers.
Administrative Separation
This allows a clever provider to sell something called Virtual Private Server, which uses less resources than other virtualization techniques, which in turn allows more units on a single machine.
The list of providers doing so is relatively long, and so this is rightfully considered the main area of application.
Service Separation
Separating different or similar services which otherwise would interfere with each other, either because they are poorly designed or because they are simply incapable of peaceful coexistence for whatever reason, can be easily done with Linux-VServer.
But even on the old-fashioned real server machines, putting some extremely exposed or untrusted, because unknown or proprietary, services into some kind of jail can improve maintainability and security a lot.
Enhancing Security
While it can be interesting to run several virtual servers in one box, there is one concept potentially more generally useful. Imagine a physical server running a single virtual server. The goal is isolate the main environment from any service, any network. You boot in the main environment, start very few services and then continue in the virtual server.
The service in the main environment would be:
- Unreachable from the network.
- Able to log messages from the virtual server in a secure way. The virtual server would be unable to change/erase the logs.\ Even a cracked virtual server would not be able the edit the log.
- Able to run intrusion detection facilities, potentially spying the state of the virtual server without being accessible or noticed.\ For example, tripwire could run there and it would be impossible to circumvent its operation or trick it.
Another option is to put the firewall in a virtual server, and pull in the DMZ, containing each service in a separate VPS. On proper configuration, this setup can reduce the number of required machines drastically, without impacting performance.
Easy Maintenance
One key feature of a virtual server is the independence from the actual hardware. Most hardware issues are irrelevant for a virtual server installation.
The main server acts as a host and takes care of all the details. The virtual server is just a client and ignores all the details. As such, the client can be moved to another physical server with very few manipulations.
For example, to move the virtual server from one physical computer to another, it is sufficient to do the following:
- shutdown the running server
- copy it over to the other machine
- copy the configuration
- start the virtual server on the new machine
No adjustments to user setup, password database or hardware configuration are required, as long as both machines are binary compatible.
Fail-over Scenarios
Pushing the limit a little further, replication technology could be used to keep an up-to-the-minute copy of the filesystem of a running Virtual Server. This would permit a very fast fail-over if the running server goes offline for whatever reason.
All the known methods to accomplish this, starting with network replication via rsync, or drbd, via network devices, or shared disk arrays, to distributed filesystems, can be utilized to reduce the down-time and improve overall efficiency.
For Testing
Consider a software tool or package which should be built for several versions of a specific distribution (Mandrake 8.2, 9.0, 9.1, 9.2, 10.0) or even for different distributions.
This is easily solved with Linux-VServer. Given plenty of disk space, the different distributions can be installed and running side by side, simplifying the task of switching from one to another.
Of course this can be accomplished by chroot() alone, but with Linux-VServer it's a much more realistic simulation.
Performance and Stability
(work in progress)
Impact of Linux-VServer on the Host
seems to be 0% ...
Overhead inside a Context
seems to be less than 2% ...
Size of the Kernel Patch
Comparison of the different patches ...
patch | hunks | + | - |
---|---|---|---|
patch-2.4.24-vs1.00.diff | 178 | 1112 | 135 |
patch-2.4.24-vs1.20.diff | 216 | 2035 | 178 |
patch-2.4.24-vs1.26.diff | 225 | 2118 | 180 |
patch-2.4.25-vs1.27.diff | 252 | 2166 | 201 |
patch-2.4.26-vs1.28.diff | 254 | 2183 | 202 |
patch-2.6.6-vs1.9.0.diff | 494 | 5699 | 303 |
patch-2.6.6-vs1.9.1.diff | 497 | 5878 | 307 |
patch-2.6.7-vs1.9.2.diff | 618 | 6836 | 348 |
uml-patch-2.4.26-1.diff | 449 | 36885 | 48 |
Non Intel i386 Hardware
Linux-VServer was designed to be mostly architecture agnostic, therefore only a small part, the syscall definition itself, is architecture specific. Nevertheless some architectures have private copies of basically architecture independent code for whatever reason, and therefore small modifications are often required.
The following architectures are supported and some of them are even tested:
- alpha
- ia32 / ia64 / xbox
- x86_64 (AMD64)
- mips / mips64
- hppa / hppa64
- ppc / ppc64
- sparc / sparc64
- s390
- uml
Adding a new architecture is relatively simple although extensive testing is required to make sure that every feature is working as expected (and of course, the hardware ;).
Linux Kernel Intro
While almost all of the described features reside in the Linux Kernel, nifty Userspace Tools are required to activate and control the new functionality.
Those Userspace Tools in general communicate with the Linux Kernel via System Calls (or Syscall for short). This chapter will give a short overview how Linux Kernel and User Space is organized and how Syscalls, a simple method of communication between processes and kernel, work.
Kernel and User Space
In Linux and similar Operating Systems, User and Kernel Space is separated, and address space is divided into two parts. Kernel space is where the kernel code resides, and user space is where the user programs live. Of course, a given user program can't write to kernel memory or to another program's memory area.
Unfortunately, this is also the case for kernel code. Kernel code can't write to user space either. What does this mean? Well, when a given hardware driver wants to write data bytes to a program in user memory, it can't do it directly, but rather it must use specific kernel functions instead. Also, when parameters are passed by address to a kernel function, the kernel function can not read the parameters directly. It must use other kernel functions to read each byte of the parameters.
Of course, there are some helpers which do the transfer to and from user space.
copy_to_user(void *to, const void *from, long n); copy_from_user(void *to, const void *from, long n);
get_user() and put_user() Get or put the given byte, word, or long from or to user memory. This is a macro, and it relies on the type of the argument to determine the number of bytes to transfer.
Linux Syscalls
Most libc calls rely on system calls, which are the simplest kernel functions a user program can call.
These system calls are implemented in the kernel itself or in loadable kernel modules, which are little chunks of dynamically link-able kernel code.
Linux system calls are implemented through a multiplexor called with a given maskable interrupt. In Linux, this interrupt is int 0x80. When the 'int 0x80' instruction is executed, control is given to the kernel (or, more accurately, to the _system_call() function), and the actual demultiplexing process occurs.
How does _system_call() work ?
First, all registers are saved and the content of the %eax register is checked against the global system calls table, which enumerates all system calls and their addresses.
This table can be accessed with the extern void *sys_call_table[] variable. A given number and memory address in this table corresponds to each system call.
System call numbers can be found in /usr/include/sys/syscall.h.
They are of the form SYS_systemcallname. If the system call is not implemented, the corresponding cell in the sys_call_table is 0, and an error is returned.
Otherwise, the system call actually exists and the corresponding entry in the table is the memory address of the system call code.
Kernel Side Implementation
While this chapter is mainly of interest to kernel developers it might be fun to take a small peek behind the curtain to get a glimpse how everything really works.
The Syscall Command Switch
For a long time Linux-VServer used a few different Syscalls to accomplish different aspects of the work, but very soon the number of required commands grew large, and the Syscalls started to have magic values, selecting the desired behavior.
Not too long ago, a single syscall was reserved for Linux-VServer, and while the opinion on that might differ from developer to developer, it was generally considered a good decision not to have more than one syscall.
The advantage of different Syscalls would be simpler handling of the Syscalls on different architectures; however, this hasn't been a problem so far, as the data passed to and from the kernel has strong typed fields conforming to the C99 types.
Regardless, the availability of one system call required the creation of a multiplexor, which decides, based on some selector, what specific command is to be executed, and then passes on the remaining arguments to that command, which does the actual work.
extern asmlinkage long sys_vserver(uint32_t cmd, uint32_t id, void __user *data)
The Linux-VServer syscall is passed three arguments regardless of what actual command is specified: a command (cmd), a number (id), and a user-space data-structure of yet unknown size.
To allow for some structure for debugging purposes and some kind of command versioning, the cmd is split into three parts: the lower 12 bit contain a version number, then 4 bits are reserved, the upper 16 bits are divided into 8 bit command and 6 bit category, again reserving 2 bits for the future.
There are 64 Categories with up to 256 commands in each category, allowing for 4096 revisions of each command, which is far more than will ever be required.
Here is an overview of the categories already defined, and their numerical value:
Syscall Matrix V2.6 |VERSION|CREATE |MODIFY |MIGRATE|CONTROL|EXPERIM| |SPECIAL|SPECIAL| |STATS |DESTROY|ALTER |CHANGE |LIMIT |TEST | | | | |INFO |SETUP | |MOVE | | | | | | -------+-------+-------+-------+-------+-------+-------+ +-------+-------+ SYSTEM |VERSION|VSETUP |VHOST | | | | |DEVICES| | HOST | 00| 01| 02| 03| 04| 05| | 06| 07| -------+-------+-------+-------+-------+-------+-------+ +-------+-------+ CPU | |VPROC |PROCALT|PROCMIG|PROCTRL| | |SCHED. | | PROCESS| 08| 09| 10| 11| 12| 13| | 14| 15| -------+-------+-------+-------+-------+-------+-------+ +-------+-------+ MEMORY | | | | | | | |SWAP | | | 16| 17| 18| 19| 20| 21| | 22| 23| -------+-------+-------+-------+-------+-------+-------+ +-------+-------+ NETWORK| |VNET |NETALT |NETMIG |NETCTL | | |SERIAL | | | 24| 25| 26| 27| 28| 29| | 30| 31| -------+-------+-------+-------+-------+-------+-------+ +-------+-------+ DISK | | | | | | | |INODE | | VFS | 32| 33| 34| 35| 36| 37| | 38| 39| -------+-------+-------+-------+-------+-------+-------+ +-------+-------+ OTHER | | | | | | | |VINFO | | | 40| 41| 42| 43| 44| 45| | 46| 47| =======+=======+=======+=======+=======+=======+=======+ +=======+=======+ SPECIAL| | | | |FLAGS | | | | | | 48| 49| 50| 51| 52| 53| | 54| 55| -------+-------+-------+-------+-------+-------+-------+ +-------+-------+ SPECIAL| | | | |RLIMIT |SYSCALL| | |COMPAT | | 56| 57| 58| 59| 60|TEST 61| | 62| 63| -------+-------+-------+-------+-------+-------+-------+ +-------+-------+
The definition of those Commands is simplified by some macros, so for example the commands to get and set the Context Flags are defined like this:
#define VCMD_get_cflags VC_CMD(FLAGS, 1, 0) #define VCMD_set_cflags VC_CMD(FLAGS, 2, 0) extern int vc_get_cflags(uint32_t, void __user *); extern int vc_set_cflags(uint32_t, void __user *);
Note that the command itself is not passed to the actual command implementation, only the id and the pointer to user-space data.
Utilized Data Structures
There are many different data structures used by different parts of the implementation; while only a few examples are given here, all utilized structures can be found in the source.
The Context Data Structure
The Context Data Structure consists of a few fields required to manage the contexts, and handle context destruction, as well as future hierarchical contexts.
Logically separated sections of that structure, like for the scheduler or the context limits are defined in separate structures, and incorporated into the main one.
struct vx_info { struct list_head vx_list; /* linked list of contexts */ xid_t vx_id; /* context id */ atomic_t vx_refcount; /* refcount */ struct vx_info *vx_parent; /* parent context */ struct namespace *vx_namespace; /* private namespace */ struct fs_struct *vx_fs; /* private namespace fs */ uint64_t vx_flags; /* context flags */ uint64_t vx_bcaps; /* bounding caps (system) */ uint64_t vx_ccaps; /* context caps (vserver) */ pid_t vx_initpid; /* PID of fake init process */ struct _vx_limit limit; /* vserver limits */ struct _vx_sched sched; /* vserver scheduler */ struct _vx_cvirt cvirt; /* virtual/bias stuff */ struct _vx_cacct cacct; /* context accounting */ char vx_name[65]; /* vserver name */ };
Here as example the Scheduler Substructure:
struct _vx_sched { spinlock_t tokens_lock; /* lock for this structure */ int fill_rate; /* Fill rate: add X tokens ... */ int interval; /* Divisor: ... each Y jiffies */ atomic_t tokens; /* current number of tokens */ int tokens_min; /* Limit: minimum for unhold */ int tokens_max; /* Limit: no more than N tokens */ uint32_t jiffies; /* bias: integral multiple of Y */ uint64_t ticks; /* token tick events */ cpumask_t cpus_allowed; /* cpu mask for context */ };
The main idea behind this separation is that each substructure belongs to a logically distinct part of the implementation which provides an init and cleanup function for this structure, thus simplifying maintainability and readability of those structures.
The Scheduler Command Data
As an example for the data structure used to control a specific part of the context from user-space, here is a scheduler command and the utilized data structure to set the properties:
#define VCMD_set_sched VC_CMD(SCHED, 1, 2) struct vcmd_set_sched_v2 { int32_t fill_rate; /* Fill rate: add X tokens ... */ int32_t interval; /* Divisor: ... each Y jiffies */ int32_t tokens; /* current number of tokens */ int32_t tokens_min; /* Limit: minimum for unhold */ int32_t tokens_max; /* Limit: no more than N tokens */ uint64_t cpu_mask; /* Mask: allowed cpus */ };
Example Accounting: Sockets
Basically all the accounting and limit stuff are defined as macros or inline functions capable of handling the different resources, hiding the underlying implementation wherever possible.
#define vx_acc_sock(v,f,p,s) \ __vx_acc_sock((v), (f), (p), (s), __FILE__, __LINE__) static inline void __vx_acc_sock(struct vx_info *vxi, int family, int pos, int size, char *file, int line) { if (vxi) { int type = vx_sock_type(family); atomic_inc(&vxi->cacct.sock[type][pos].count); atomic_add(size, &vxi->cacct.sock[type][pos].total); } } #define vx_sock_recv(sk,s) \ vx_acc_sock((sk)->sk_vx_info, (sk)->sk_family, 0, (s)) #define vx_sock_send(sk,s) \ vx_acc_sock((sk)->sk_vx_info, (sk)->sk_family, 1, (s)) #define vx_sock_fail(sk,s) \ vx_acc_sock((sk)->sk_vx_info, (sk)->sk_family, 2, (s))
And this general definition is then used where appropriate, for example in the __sock_sendmsg() function like this:
len = sock->ops->sendmsg(iocb, sock, msg, size); if (sock->sk) { if (len == size) vx_sock_send(sock->sk, size); else vx_sock_fail(sock->sk, size); }
Example Limits: Virtual Memory
#define vx_pages_avail(m, p, r) \ __vx_pages_avail((m)->mm_vx_info, (r), (p), __FILE__, __LINE__) static inline int __vx_pages_avail(struct vx_info *vxi, int res, int pages, char *file, int line) { if (!vxi) return 1; if (vxi->limit.rlim[res] == RLIM_INFINITY) return 1; if (atomic_read(&vxi->limit.res[res]) + pages < vxi->limit.rlim[res]) return 1; return 0; } #define vx_vmpages_avail(m,p) vx_pages_avail(m, p, RLIMIT_AS) #define vx_vmlocked_avail(m,p) vx_pages_avail(m, p, RLIMIT_MEMLOCK) #define vx_rsspages_avail(m,p) vx_pages_avail(m, p, RLIMIT_RSS)
And again the test against those limits at certain places, for example here in copy_process()
/* check vserver memory */ if (p->mm && !(clone_flags & CLONE_VM)) { if (vx_vmpages_avail(p->mm, p->mm->total_vm)) vx_pages_add(p->mm->mm_vx_info, RLIMIT_AS, p->mm->total_vm); else goto bad_fork_free; }
Example Virtualization: Uptime
void vx_vsi_uptime(struct timespec *uptime) { struct vx_info *vxi = current->vx_info; set_normalized_timespec(uptime, uptime->tv_sec - vxi->cvirt.bias_tp.tv_sec, uptime->tv_nsec - vxi->cvirt.bias_tp.tv_nsec); return; } if (vx_flags(VXF_VIRT_UPTIME, 0)) vx_vsi_uptime(&uptime, &idle);
== Future Direction