summaryrefslogtreecommitdiffstats
path: root/man/man7/sched.7
diff options
context:
space:
mode:
Diffstat (limited to 'man/man7/sched.7')
-rw-r--r--man/man7/sched.7992
1 files changed, 992 insertions, 0 deletions
diff --git a/man/man7/sched.7 b/man/man7/sched.7
new file mode 100644
index 000000000..71f098e48
--- /dev/null
+++ b/man/man7/sched.7
@@ -0,0 +1,992 @@
+.\" Copyright (C) 2014 Michael Kerrisk <mtk.manpages@gmail.com>
+.\" and Copyright (C) 2014 Peter Zijlstra <peterz@infradead.org>
+.\" and Copyright (C) 2014 Juri Lelli <juri.lelli@gmail.com>
+.\" Various pieces from the old sched_setscheduler(2) page
+.\" Copyright (C) Tom Bjorkholm, Markus Kuhn & David A. Wheeler 1996-1999
+.\" and Copyright (C) 2007 Carsten Emde <Carsten.Emde@osadl.org>
+.\" and Copyright (C) 2008 Michael Kerrisk <mtk.manpages@gmail.com>
+.\"
+.\" SPDX-License-Identifier: GPL-2.0-or-later
+.\"
+.\" Worth looking at: http://rt.wiki.kernel.org/index.php
+.\"
+.TH sched 7 (date) "Linux man-pages (unreleased)"
+.SH NAME
+sched \- overview of CPU scheduling
+.SH DESCRIPTION
+Since Linux 2.6.23, the default scheduler is CFS,
+the "Completely Fair Scheduler".
+The CFS scheduler replaced the earlier "O(1)" scheduler.
+.\"
+.SS API summary
+Linux provides the following system calls for controlling
+the CPU scheduling behavior, policy, and priority of processes
+(or, more precisely, threads).
+.TP
+.BR nice (2)
+Set a new nice value for the calling thread,
+and return the new nice value.
+.TP
+.BR getpriority (2)
+Return the nice value of a thread, a process group,
+or the set of threads owned by a specified user.
+.TP
+.BR setpriority (2)
+Set the nice value of a thread, a process group,
+or the set of threads owned by a specified user.
+.TP
+.BR sched_setscheduler (2)
+Set the scheduling policy and parameters of a specified thread.
+.TP
+.BR sched_getscheduler (2)
+Return the scheduling policy of a specified thread.
+.TP
+.BR sched_setparam (2)
+Set the scheduling parameters of a specified thread.
+.TP
+.BR sched_getparam (2)
+Fetch the scheduling parameters of a specified thread.
+.TP
+.BR sched_get_priority_max (2)
+Return the maximum priority available in a specified scheduling policy.
+.TP
+.BR sched_get_priority_min (2)
+Return the minimum priority available in a specified scheduling policy.
+.TP
+.BR sched_rr_get_interval (2)
+Fetch the quantum used for threads that are scheduled under
+the "round-robin" scheduling policy.
+.TP
+.BR sched_yield (2)
+Cause the caller to relinquish the CPU,
+so that some other thread be executed.
+.TP
+.BR sched_setaffinity (2)
+(Linux-specific)
+Set the CPU affinity of a specified thread.
+.TP
+.BR sched_getaffinity (2)
+(Linux-specific)
+Get the CPU affinity of a specified thread.
+.TP
+.BR sched_setattr (2)
+Set the scheduling policy and parameters of a specified thread.
+This (Linux-specific) system call provides a superset of the functionality of
+.BR sched_setscheduler (2)
+and
+.BR sched_setparam (2).
+.TP
+.BR sched_getattr (2)
+Fetch the scheduling policy and parameters of a specified thread.
+This (Linux-specific) system call provides a superset of the functionality of
+.BR sched_getscheduler (2)
+and
+.BR sched_getparam (2).
+.\"
+.SS Scheduling policies
+The scheduler is the kernel component that decides which runnable thread
+will be executed by the CPU next.
+Each thread has an associated scheduling policy and a \fIstatic\fP
+scheduling priority,
+.IR sched_priority .
+The scheduler makes its decisions based on knowledge of the scheduling
+policy and static priority of all threads on the system.
+.P
+For threads scheduled under one of the normal scheduling policies
+(\fBSCHED_OTHER\fP, \fBSCHED_IDLE\fP, \fBSCHED_BATCH\fP),
+\fIsched_priority\fP is not used in scheduling
+decisions (it must be specified as 0).
+.P
+Processes scheduled under one of the real-time policies
+(\fBSCHED_FIFO\fP, \fBSCHED_RR\fP) have a
+\fIsched_priority\fP value in the range 1 (low) to 99 (high).
+(As the numbers imply, real-time threads always have higher priority
+than normal threads.)
+Note well: POSIX.1 requires an implementation to support only a
+minimum 32 distinct priority levels for the real-time policies,
+and some systems supply just this minimum.
+Portable programs should use
+.BR sched_get_priority_min (2)
+and
+.BR sched_get_priority_max (2)
+to find the range of priorities supported for a particular policy.
+.P
+Conceptually, the scheduler maintains a list of runnable
+threads for each possible \fIsched_priority\fP value.
+In order to determine which thread runs next, the scheduler looks for
+the nonempty list with the highest static priority and selects the
+thread at the head of this list.
+.P
+A thread's scheduling policy determines
+where it will be inserted into the list of threads
+with equal static priority and how it will move inside this list.
+.P
+All scheduling is preemptive: if a thread with a higher static
+priority becomes ready to run, the currently running thread
+will be preempted and
+returned to the wait list for its static priority level.
+The scheduling policy determines the
+ordering only within the list of runnable threads with equal static
+priority.
+.SS SCHED_FIFO: First in-first out scheduling
+\fBSCHED_FIFO\fP can be used only with static priorities higher than
+0, which means that when a \fBSCHED_FIFO\fP thread becomes runnable,
+it will always immediately preempt any currently running
+\fBSCHED_OTHER\fP, \fBSCHED_BATCH\fP, or \fBSCHED_IDLE\fP thread.
+\fBSCHED_FIFO\fP is a simple scheduling
+algorithm without time slicing.
+For threads scheduled under the
+\fBSCHED_FIFO\fP policy, the following rules apply:
+.IP \[bu] 3
+A running \fBSCHED_FIFO\fP thread that has been preempted by another thread of
+higher priority will stay at the head of the list for its priority and
+will resume execution as soon as all threads of higher priority are
+blocked again.
+.IP \[bu]
+When a blocked \fBSCHED_FIFO\fP thread becomes runnable, it
+will be inserted at the end of the list for its priority.
+.IP \[bu]
+If a call to
+.BR sched_setscheduler (2),
+.BR sched_setparam (2),
+.BR sched_setattr (2),
+.BR pthread_setschedparam (3),
+or
+.BR pthread_setschedprio (3)
+changes the priority of the running or runnable
+.B SCHED_FIFO
+thread identified by
+.I pid
+the effect on the thread's position in the list depends on
+the direction of the change to the thread's priority:
+.RS
+.IP (a) 5
+If the thread's priority is raised,
+it is placed at the end of the list for its new priority.
+As a consequence,
+it may preempt a currently running thread with the same priority.
+.IP (b)
+If the thread's priority is unchanged,
+its position in the run list is unchanged.
+.IP (c)
+If the thread's priority is lowered,
+it is placed at the front of the list for its new priority.
+.RE
+.IP
+According to POSIX.1-2008,
+changes to a thread's priority (or policy) using any mechanism other than
+.BR pthread_setschedprio (3)
+should result in the thread being placed at the end of
+the list for its priority.
+.\" In Linux 2.2.x and Linux 2.4.x, the thread is placed at the front of the queue
+.\" In Linux 2.0.x, the Right Thing happened: the thread went to the back -- MTK
+.IP \[bu]
+A thread calling
+.BR sched_yield (2)
+will be put at the end of the list.
+.P
+No other events will move a thread
+scheduled under the \fBSCHED_FIFO\fP policy in the wait list of
+runnable threads with equal static priority.
+.P
+A \fBSCHED_FIFO\fP
+thread runs until either it is blocked by an I/O request, it is
+preempted by a higher priority thread, or it calls
+.BR sched_yield (2).
+.SS SCHED_RR: Round-robin scheduling
+\fBSCHED_RR\fP is a simple enhancement of \fBSCHED_FIFO\fP.
+Everything
+described above for \fBSCHED_FIFO\fP also applies to \fBSCHED_RR\fP,
+except that each thread is allowed to run only for a maximum time
+quantum.
+If a \fBSCHED_RR\fP thread has been running for a time
+period equal to or longer than the time quantum, it will be put at the
+end of the list for its priority.
+A \fBSCHED_RR\fP thread that has
+been preempted by a higher priority thread and subsequently resumes
+execution as a running thread will complete the unexpired portion of
+its round-robin time quantum.
+The length of the time quantum can be
+retrieved using
+.BR sched_rr_get_interval (2).
+.\" On Linux 2.4, the length of the RR interval is influenced
+.\" by the process nice value -- MTK
+.\"
+.SS SCHED_DEADLINE: Sporadic task model deadline scheduling
+Since Linux 3.14, Linux provides a deadline scheduling policy
+.RB ( SCHED_DEADLINE ).
+This policy is currently implemented using
+GEDF (Global Earliest Deadline First)
+in conjunction with CBS (Constant Bandwidth Server).
+To set and fetch this policy and associated attributes,
+one must use the Linux-specific
+.BR sched_setattr (2)
+and
+.BR sched_getattr (2)
+system calls.
+.P
+A sporadic task is one that has a sequence of jobs, where each
+job is activated at most once per period.
+Each job also has a
+.IR "relative deadline" ,
+before which it should finish execution, and a
+.IR "computation time" ,
+which is the CPU time necessary for executing the job.
+The moment when a task wakes up
+because a new job has to be executed is called the
+.I arrival time
+(also referred to as the request time or release time).
+The
+.I start time
+is the time at which a task starts its execution.
+The
+.I absolute deadline
+is thus obtained by adding the relative deadline to the arrival time.
+.P
+The following diagram clarifies these terms:
+.P
+.in +4n
+.EX
+arrival/wakeup absolute deadline
+ | start time |
+ | | |
+ v v v
+-----x--------xooooooooooooooooo--------x--------x---
+ |<- comp. time ->|
+ |<------- relative deadline ------>|
+ |<-------------- period ------------------->|
+.EE
+.in
+.P
+When setting a
+.B SCHED_DEADLINE
+policy for a thread using
+.BR sched_setattr (2),
+one can specify three parameters:
+.IR Runtime ,
+.IR Deadline ,
+and
+.IR Period .
+These parameters do not necessarily correspond to the aforementioned terms:
+usual practice is to set Runtime to something bigger than the average
+computation time (or worst-case execution time for hard real-time tasks),
+Deadline to the relative deadline, and Period to the period of the task.
+Thus, for
+.B SCHED_DEADLINE
+scheduling, we have:
+.P
+.in +4n
+.EX
+arrival/wakeup absolute deadline
+ | start time |
+ | | |
+ v v v
+-----x--------xooooooooooooooooo--------x--------x---
+ |<-- Runtime ------->|
+ |<----------- Deadline ----------->|
+ |<-------------- Period ------------------->|
+.EE
+.in
+.P
+The three deadline-scheduling parameters correspond to the
+.IR sched_runtime ,
+.IR sched_deadline ,
+and
+.I sched_period
+fields of the
+.I sched_attr
+structure; see
+.BR sched_setattr (2).
+These fields express values in nanoseconds.
+.\" FIXME It looks as though specifying sched_period as 0 means
+.\" "make sched_period the same as sched_deadline".
+.\" This needs to be documented.
+If
+.I sched_period
+is specified as 0, then it is made the same as
+.IR sched_deadline .
+.P
+The kernel requires that:
+.P
+.in +4n
+.EX
+sched_runtime <= sched_deadline <= sched_period
+.EE
+.in
+.P
+.\" See __checkparam_dl in kernel/sched/core.c
+In addition, under the current implementation,
+all of the parameter values must be at least 1024
+(i.e., just over one microsecond,
+which is the resolution of the implementation), and less than 2\[ha]63.
+If any of these checks fails,
+.BR sched_setattr (2)
+fails with the error
+.BR EINVAL .
+.P
+The CBS guarantees non-interference between tasks, by throttling
+threads that attempt to over-run their specified Runtime.
+.P
+To ensure deadline scheduling guarantees,
+the kernel must prevent situations where the set of
+.B SCHED_DEADLINE
+threads is not feasible (schedulable) within the given constraints.
+The kernel thus performs an admittance test when setting or changing
+.B SCHED_DEADLINE
+policy and attributes.
+This admission test calculates whether the change is feasible;
+if it is not,
+.BR sched_setattr (2)
+fails with the error
+.BR EBUSY .
+.P
+For example, it is required (but not necessarily sufficient) for
+the total utilization to be less than or equal to the total number of
+CPUs available, where, since each thread can maximally run for
+Runtime per Period, that thread's utilization is its
+Runtime divided by its Period.
+.P
+In order to fulfill the guarantees that are made when
+a thread is admitted to the
+.B SCHED_DEADLINE
+policy,
+.B SCHED_DEADLINE
+threads are the highest priority (user controllable) threads in the
+system; if any
+.B SCHED_DEADLINE
+thread is runnable,
+it will preempt any thread scheduled under one of the other policies.
+.P
+A call to
+.BR fork (2)
+by a thread scheduled under the
+.B SCHED_DEADLINE
+policy fails with the error
+.BR EAGAIN ,
+unless the thread has its reset-on-fork flag set (see below).
+.P
+A
+.B SCHED_DEADLINE
+thread that calls
+.BR sched_yield (2)
+will yield the current job and wait for a new period to begin.
+.\"
+.\" FIXME Calling sched_getparam() on a SCHED_DEADLINE thread
+.\" fails with EINVAL, but sched_getscheduler() succeeds.
+.\" Is that intended? (Why?)
+.\"
+.SS SCHED_OTHER: Default Linux time-sharing scheduling
+\fBSCHED_OTHER\fP can be used at only static priority 0
+(i.e., threads under real-time policies always have priority over
+.B SCHED_OTHER
+processes).
+\fBSCHED_OTHER\fP is the standard Linux time-sharing scheduler that is
+intended for all threads that do not require the special
+real-time mechanisms.
+.P
+The thread to run is chosen from the static
+priority 0 list based on a \fIdynamic\fP priority that is determined only
+inside this list.
+The dynamic priority is based on the nice value (see below)
+and is increased for each time quantum the thread is ready to run,
+but denied to run by the scheduler.
+This ensures fair progress among all \fBSCHED_OTHER\fP threads.
+.P
+In the Linux kernel source code, the
+.B SCHED_OTHER
+policy is actually named
+.BR SCHED_NORMAL .
+.\"
+.SS The nice value
+The nice value is an attribute
+that can be used to influence the CPU scheduler to
+favor or disfavor a process in scheduling decisions.
+It affects the scheduling of
+.B SCHED_OTHER
+and
+.B SCHED_BATCH
+(see below) processes.
+The nice value can be modified using
+.BR nice (2),
+.BR setpriority (2),
+or
+.BR sched_setattr (2).
+.P
+According to POSIX.1, the nice value is a per-process attribute;
+that is, the threads in a process should share a nice value.
+However, on Linux, the nice value is a per-thread attribute:
+different threads in the same process may have different nice values.
+.P
+The range of the nice value
+varies across UNIX systems.
+On modern Linux, the range is \-20 (high priority) to +19 (low priority).
+On some other systems, the range is \-20..20.
+Very early Linux kernels (before Linux 2.0) had the range \-infinity..15.
+.\" Linux before 1.3.36 had \-infinity..15.
+.\" Since Linux 1.3.43, Linux has the range \-20..19.
+.P
+The degree to which the nice value affects the relative scheduling of
+.B SCHED_OTHER
+processes likewise varies across UNIX systems and
+across Linux kernel versions.
+.P
+With the advent of the CFS scheduler in Linux 2.6.23,
+Linux adopted an algorithm that causes
+relative differences in nice values to have a much stronger effect.
+In the current implementation, each unit of difference in the
+nice values of two processes results in a factor of 1.25
+in the degree to which the scheduler favors the higher priority process.
+This causes very low nice values (+19) to truly provide little CPU
+to a process whenever there is any other
+higher priority load on the system,
+and makes high nice values (\-20) deliver most of the CPU to applications
+that require it (e.g., some audio applications).
+.P
+On Linux, the
+.B RLIMIT_NICE
+resource limit can be used to define a limit to which
+an unprivileged process's nice value can be raised; see
+.BR setrlimit (2)
+for details.
+.P
+For further details on the nice value, see the subsections on
+the autogroup feature and group scheduling, below.
+.\"
+.SS SCHED_BATCH: Scheduling batch processes
+(Since Linux 2.6.16.)
+\fBSCHED_BATCH\fP can be used only at static priority 0.
+This policy is similar to \fBSCHED_OTHER\fP in that it schedules
+the thread according to its dynamic priority
+(based on the nice value).
+The difference is that this policy
+will cause the scheduler to always assume
+that the thread is CPU-intensive.
+Consequently, the scheduler will apply a small scheduling
+penalty with respect to wakeup behavior,
+so that this thread is mildly disfavored in scheduling decisions.
+.P
+.\" The following paragraph is drawn largely from the text that
+.\" accompanied Ingo Molnar's patch for the implementation of
+.\" SCHED_BATCH.
+.\" commit b0a9499c3dd50d333e2aedb7e894873c58da3785
+This policy is useful for workloads that are noninteractive,
+but do not want to lower their nice value,
+and for workloads that want a deterministic scheduling policy without
+interactivity causing extra preemptions (between the workload's tasks).
+.\"
+.SS SCHED_IDLE: Scheduling very low priority jobs
+(Since Linux 2.6.23.)
+\fBSCHED_IDLE\fP can be used only at static priority 0;
+the process nice value has no influence for this policy.
+.P
+This policy is intended for running jobs at extremely low
+priority (lower even than a +19 nice value with the
+.B SCHED_OTHER
+or
+.B SCHED_BATCH
+policies).
+.\"
+.SS Resetting scheduling policy for child processes
+Each thread has a reset-on-fork scheduling flag.
+When this flag is set, children created by
+.BR fork (2)
+do not inherit privileged scheduling policies.
+The reset-on-fork flag can be set by either:
+.IP \[bu] 3
+ORing the
+.B SCHED_RESET_ON_FORK
+flag into the
+.I policy
+argument when calling
+.BR sched_setscheduler (2)
+(since Linux 2.6.32);
+or
+.IP \[bu]
+specifying the
+.B SCHED_FLAG_RESET_ON_FORK
+flag in
+.I attr.sched_flags
+when calling
+.BR sched_setattr (2).
+.P
+Note that the constants used with these two APIs have different names.
+The state of the reset-on-fork flag can analogously be retrieved using
+.BR sched_getscheduler (2)
+and
+.BR sched_getattr (2).
+.P
+The reset-on-fork feature is intended for media-playback applications,
+and can be used to prevent applications evading the
+.B RLIMIT_RTTIME
+resource limit (see
+.BR getrlimit (2))
+by creating multiple child processes.
+.P
+More precisely, if the reset-on-fork flag is set,
+the following rules apply for subsequently created children:
+.IP \[bu] 3
+If the calling thread has a scheduling policy of
+.B SCHED_FIFO
+or
+.BR SCHED_RR ,
+the policy is reset to
+.B SCHED_OTHER
+in child processes.
+.IP \[bu]
+If the calling process has a negative nice value,
+the nice value is reset to zero in child processes.
+.P
+After the reset-on-fork flag has been enabled,
+it can be reset only if the thread has the
+.B CAP_SYS_NICE
+capability.
+This flag is disabled in child processes created by
+.BR fork (2).
+.\"
+.SS Privileges and resource limits
+Before Linux 2.6.12, only privileged
+.RB ( CAP_SYS_NICE )
+threads can set a nonzero static priority (i.e., set a real-time
+scheduling policy).
+The only change that an unprivileged thread can make is to set the
+.B SCHED_OTHER
+policy, and this can be done only if the effective user ID of the caller
+matches the real or effective user ID of the target thread
+(i.e., the thread specified by
+.IR pid )
+whose policy is being changed.
+.P
+A thread must be privileged
+.RB ( CAP_SYS_NICE )
+in order to set or modify a
+.B SCHED_DEADLINE
+policy.
+.P
+Since Linux 2.6.12, the
+.B RLIMIT_RTPRIO
+resource limit defines a ceiling on an unprivileged thread's
+static priority for the
+.B SCHED_RR
+and
+.B SCHED_FIFO
+policies.
+The rules for changing scheduling policy and priority are as follows:
+.IP \[bu] 3
+If an unprivileged thread has a nonzero
+.B RLIMIT_RTPRIO
+soft limit, then it can change its scheduling policy and priority,
+subject to the restriction that the priority cannot be set to a
+value higher than the maximum of its current priority and its
+.B RLIMIT_RTPRIO
+soft limit.
+.IP \[bu]
+If the
+.B RLIMIT_RTPRIO
+soft limit is 0, then the only permitted changes are to lower the priority,
+or to switch to a non-real-time policy.
+.IP \[bu]
+Subject to the same rules,
+another unprivileged thread can also make these changes,
+as long as the effective user ID of the thread making the change
+matches the real or effective user ID of the target thread.
+.IP \[bu]
+Special rules apply for the
+.B SCHED_IDLE
+policy.
+Before Linux 2.6.39,
+an unprivileged thread operating under this policy cannot
+change its policy, regardless of the value of its
+.B RLIMIT_RTPRIO
+resource limit.
+Since Linux 2.6.39,
+.\" commit c02aa73b1d18e43cfd79c2f193b225e84ca497c8
+an unprivileged thread can switch to either the
+.B SCHED_BATCH
+or the
+.B SCHED_OTHER
+policy so long as its nice value falls within the range permitted by its
+.B RLIMIT_NICE
+resource limit (see
+.BR getrlimit (2)).
+.P
+Privileged
+.RB ( CAP_SYS_NICE )
+threads ignore the
+.B RLIMIT_RTPRIO
+limit; as with older kernels,
+they can make arbitrary changes to scheduling policy and priority.
+See
+.BR getrlimit (2)
+for further information on
+.BR RLIMIT_RTPRIO .
+.SS Limiting the CPU usage of real-time and deadline processes
+A nonblocking infinite loop in a thread scheduled under the
+.BR SCHED_FIFO ,
+.BR SCHED_RR ,
+or
+.B SCHED_DEADLINE
+policy can potentially block all other threads from accessing
+the CPU forever.
+Before Linux 2.6.25, the only way of preventing a runaway real-time
+process from freezing the system was to run (at the console)
+a shell scheduled under a higher static priority than the tested application.
+This allows an emergency kill of tested
+real-time applications that do not block or terminate as expected.
+.P
+Since Linux 2.6.25, there are other techniques for dealing with runaway
+real-time and deadline processes.
+One of these is to use the
+.B RLIMIT_RTTIME
+resource limit to set a ceiling on the CPU time that
+a real-time process may consume.
+See
+.BR getrlimit (2)
+for details.
+.P
+Since Linux 2.6.25, Linux also provides two
+.I /proc
+files that can be used to reserve a certain amount of CPU time
+to be used by non-real-time processes.
+Reserving CPU time in this fashion allows some CPU time to be
+allocated to (say) a root shell that can be used to kill a runaway process.
+Both of these files specify time values in microseconds:
+.TP
+.I /proc/sys/kernel/sched_rt_period_us
+This file specifies a scheduling period that is equivalent to
+100% CPU bandwidth.
+The value in this file can range from 1 to
+.BR INT_MAX ,
+giving an operating range of 1 microsecond to around 35 minutes.
+The default value in this file is 1,000,000 (1 second).
+.TP
+.I /proc/sys/kernel/sched_rt_runtime_us
+The value in this file specifies how much of the "period" time
+can be used by all real-time and deadline scheduled processes
+on the system.
+The value in this file can range from \-1 to
+.BR INT_MAX \-1.
+Specifying \-1 makes the run time the same as the period;
+that is, no CPU time is set aside for non-real-time processes
+(which was the behavior before Linux 2.6.25).
+The default value in this file is 950,000 (0.95 seconds),
+meaning that 5% of the CPU time is reserved for processes that
+don't run under a real-time or deadline scheduling policy.
+.SS Response time
+A blocked high priority thread waiting for I/O has a certain
+response time before it is scheduled again.
+The device driver writer
+can greatly reduce this response time by using a "slow interrupt"
+interrupt handler.
+.\" as described in
+.\" .BR request_irq (9).
+.SS Miscellaneous
+Child processes inherit the scheduling policy and parameters across a
+.BR fork (2).
+The scheduling policy and parameters are preserved across
+.BR execve (2).
+.P
+Memory locking is usually needed for real-time processes to avoid
+paging delays; this can be done with
+.BR mlock (2)
+or
+.BR mlockall (2).
+.\"
+.SS The autogroup feature
+.\" commit 5091faa449ee0b7d73bc296a93bca9540fc51d0a
+Since Linux 2.6.38,
+the kernel provides a feature known as autogrouping to improve interactive
+desktop performance in the face of multiprocess, CPU-intensive
+workloads such as building the Linux kernel with large numbers of
+parallel build processes (i.e., the
+.BR make (1)
+.B \-j
+flag).
+.P
+This feature operates in conjunction with the
+CFS scheduler and requires a kernel that is configured with
+.BR CONFIG_SCHED_AUTOGROUP .
+On a running system, this feature is enabled or disabled via the file
+.IR /proc/sys/kernel/sched_autogroup_enabled ;
+a value of 0 disables the feature, while a value of 1 enables it.
+The default value in this file is 1, unless the kernel was booted with the
+.I noautogroup
+parameter.
+.P
+A new autogroup is created when a new session is created via
+.BR setsid (2);
+this happens, for example, when a new terminal window is started.
+A new process created by
+.BR fork (2)
+inherits its parent's autogroup membership.
+Thus, all of the processes in a session are members of the same autogroup.
+An autogroup is automatically destroyed when the last process
+in the group terminates.
+.P
+When autogrouping is enabled, all of the members of an autogroup
+are placed in the same kernel scheduler "task group".
+The CFS scheduler employs an algorithm that equalizes the
+distribution of CPU cycles across task groups.
+The benefits of this for interactive desktop performance
+can be described via the following example.
+.P
+Suppose that there are two autogroups competing for the same CPU
+(i.e., presume either a single CPU system or the use of
+.BR taskset (1)
+to confine all the processes to the same CPU on an SMP system).
+The first group contains ten CPU-bound processes from
+a kernel build started with
+.IR "make\~\-j10" .
+The other contains a single CPU-bound process: a video player.
+The effect of autogrouping is that the two groups will
+each receive half of the CPU cycles.
+That is, the video player will receive 50% of the CPU cycles,
+rather than just 9% of the cycles,
+which would likely lead to degraded video playback.
+The situation on an SMP system is more complex,
+.\" Mike Galbraith, 25 Nov 2016:
+.\" I'd say something more wishy-washy here, like cycles are
+.\" distributed fairly across groups and leave it at that, as your
+.\" detailed example is incorrect due to SMP fairness (which I don't
+.\" like much because [very unlikely] worst case scenario
+.\" renders a box sized group incapable of utilizing more that
+.\" a single CPU total). For example, if a group of NR_CPUS
+.\" size competes with a singleton, load balancing will try to give
+.\" the singleton a full CPU of its very own. If groups intersect for
+.\" whatever reason on say my quad lappy, distribution is 80/20 in
+.\" favor of the singleton.
+but the general effect is the same:
+the scheduler distributes CPU cycles across task groups such that
+an autogroup that contains a large number of CPU-bound processes
+does not end up hogging CPU cycles at the expense of the other
+jobs on the system.
+.P
+A process's autogroup (task group) membership can be viewed via the file
+.IR /proc/ pid /autogroup :
+.P
+.in +4n
+.EX
+$ \fBcat /proc/1/autogroup\fP
+/autogroup\-1 nice 0
+.EE
+.in
+.P
+This file can also be used to modify the CPU bandwidth allocated
+to an autogroup.
+This is done by writing a number in the "nice" range to the file
+to set the autogroup's nice value.
+The allowed range is from +19 (low priority) to \-20 (high priority).
+(Writing values outside of this range causes
+.BR write (2)
+to fail with the error
+.BR EINVAL .)
+.\" FIXME .
+.\" Because of a bug introduced in Linux 4.7
+.\" (commit 2159197d66770ec01f75c93fb11dc66df81fd45b made changes
+.\" that exposed the fact that autogroup didn't call scale_load()),
+.\" it happened that *all* values in this range caused a task group
+.\" to be further disfavored by the scheduler, with \-20 resulting
+.\" in the scheduler mildly disfavoring the task group and +19 greatly
+.\" disfavoring it.
+.\"
+.\" A patch was posted on 23 Nov 2016
+.\" ("sched/autogroup: Fix 64bit kernel nice adjustment";
+.\" check later to see in which kernel version it lands.
+.P
+The autogroup nice setting has the same meaning as the process nice value,
+but applies to distribution of CPU cycles to the autogroup as a whole,
+based on the relative nice values of other autogroups.
+For a process inside an autogroup, the CPU cycles that it receives
+will be a product of the autogroup's nice value
+(compared to other autogroups)
+and the process's nice value
+(compared to other processes in the same autogroup.
+.P
+The use of the
+.BR cgroups (7)
+CPU controller to place processes in cgroups other than the
+root CPU cgroup overrides the effect of autogrouping.
+.P
+The autogroup feature groups only processes scheduled under
+non-real-time policies
+.RB ( SCHED_OTHER ,
+.BR SCHED_BATCH ,
+and
+.BR SCHED_IDLE ).
+It does not group processes scheduled under real-time and
+deadline policies.
+Those processes are scheduled according to the rules described earlier.
+.\"
+.SS The nice value and group scheduling
+When scheduling non-real-time processes (i.e., those scheduled under the
+.BR SCHED_OTHER ,
+.BR SCHED_BATCH ,
+and
+.B SCHED_IDLE
+policies), the CFS scheduler employs a technique known as "group scheduling",
+if the kernel was configured with the
+.B CONFIG_FAIR_GROUP_SCHED
+option (which is typical).
+.P
+Under group scheduling, threads are scheduled in "task groups".
+Task groups have a hierarchical relationship,
+rooted under the initial task group on the system,
+known as the "root task group".
+Task groups are formed in the following circumstances:
+.IP \[bu] 3
+All of the threads in a CPU cgroup form a task group.
+The parent of this task group is the task group of the
+corresponding parent cgroup.
+.IP \[bu]
+If autogrouping is enabled,
+then all of the threads that are (implicitly) placed in an autogroup
+(i.e., the same session, as created by
+.BR setsid (2))
+form a task group.
+Each new autogroup is thus a separate task group.
+The root task group is the parent of all such autogroups.
+.IP \[bu]
+If autogrouping is enabled, then the root task group consists of
+all processes in the root CPU cgroup that were not
+otherwise implicitly placed into a new autogroup.
+.IP \[bu]
+If autogrouping is disabled, then the root task group consists of
+all processes in the root CPU cgroup.
+.IP \[bu]
+If group scheduling was disabled (i.e., the kernel was configured without
+.BR CONFIG_FAIR_GROUP_SCHED ),
+then all of the processes on the system are notionally placed
+in a single task group.
+.P
+Under group scheduling,
+a thread's nice value has an effect for scheduling decisions
+.IR "only relative to other threads in the same task group" .
+This has some surprising consequences in terms of the traditional semantics
+of the nice value on UNIX systems.
+In particular, if autogrouping
+is enabled (which is the default in various distributions), then employing
+.BR setpriority (2)
+or
+.BR nice (1)
+on a process has an effect only for scheduling relative
+to other processes executed in the same session
+(typically: the same terminal window).
+.P
+Conversely, for two processes that are (for example)
+the sole CPU-bound processes in different sessions
+(e.g., different terminal windows,
+each of whose jobs are tied to different autogroups),
+.I modifying the nice value of the process in one of the sessions
+.I has no effect
+in terms of the scheduler's decisions relative to the
+process in the other session.
+.\" More succinctly: the nice(1) command is in many cases a no-op since
+.\" Linux 2.6.38.
+.\"
+A possibly useful workaround here is to use a command such as
+the following to modify the autogroup nice value for
+.I all
+of the processes in a terminal session:
+.P
+.in +4n
+.EX
+$ \fBecho 10 > /proc/self/autogroup\fP
+.EE
+.in
+.SS Real-time features in the mainline Linux kernel
+.\" FIXME . Probably this text will need some minor tweaking
+.\" ask Carsten Emde about this.
+Since Linux 2.6.18, Linux is gradually
+becoming equipped with real-time capabilities,
+most of which are derived from the former
+.I realtime\-preempt
+patch set.
+Until the patches have been completely merged into the
+mainline kernel,
+they must be installed to achieve the best real-time performance.
+These patches are named:
+.P
+.in +4n
+.EX
+patch\-\fIkernelversion\fP\-rt\fIpatchversion\fP
+.EE
+.in
+.P
+and can be downloaded from
+.UR http://www.kernel.org\:/pub\:/linux\:/kernel\:/projects\:/rt/
+.UE .
+.P
+Without the patches and prior to their full inclusion into the mainline
+kernel, the kernel configuration offers only the three preemption classes
+.BR CONFIG_PREEMPT_NONE ,
+.BR CONFIG_PREEMPT_VOLUNTARY ,
+and
+.B CONFIG_PREEMPT_DESKTOP
+which respectively provide no, some, and considerable
+reduction of the worst-case scheduling latency.
+.P
+With the patches applied or after their full inclusion into the mainline
+kernel, the additional configuration item
+.B CONFIG_PREEMPT_RT
+becomes available.
+If this is selected, Linux is transformed into a regular
+real-time operating system.
+The FIFO and RR scheduling policies are then used to run a thread
+with true real-time priority and a minimum worst-case scheduling latency.
+.SH NOTES
+The
+.BR cgroups (7)
+CPU controller can be used to limit the CPU consumption of
+groups of processes.
+.P
+Originally, Standard Linux was intended as a general-purpose operating
+system being able to handle background processes, interactive
+applications, and less demanding real-time applications (applications that
+need to usually meet timing deadlines).
+Although the Linux 2.6
+allowed for kernel preemption and the newly introduced O(1) scheduler
+ensures that the time needed to schedule is fixed and deterministic
+irrespective of the number of active tasks, true real-time computing
+was not possible up to Linux 2.6.17.
+.SH SEE ALSO
+.ad l
+.nh
+.BR chcpu (1),
+.BR chrt (1),
+.BR lscpu (1),
+.BR ps (1),
+.BR taskset (1),
+.BR top (1),
+.BR getpriority (2),
+.BR mlock (2),
+.BR mlockall (2),
+.BR munlock (2),
+.BR munlockall (2),
+.BR nice (2),
+.BR sched_get_priority_max (2),
+.BR sched_get_priority_min (2),
+.BR sched_getaffinity (2),
+.BR sched_getparam (2),
+.BR sched_getscheduler (2),
+.BR sched_rr_get_interval (2),
+.BR sched_setaffinity (2),
+.BR sched_setparam (2),
+.BR sched_setscheduler (2),
+.BR sched_yield (2),
+.BR setpriority (2),
+.BR pthread_getaffinity_np (3),
+.BR pthread_getschedparam (3),
+.BR pthread_setaffinity_np (3),
+.BR sched_getcpu (3),
+.BR capabilities (7),
+.BR cpuset (7)
+.ad
+.P
+.I Programming for the real world \- POSIX.4
+by Bill O.\& Gallmeister, O'Reilly & Associates, Inc., ISBN 1-56592-074-0.
+.P
+The Linux kernel source files
+.IR \%Documentation/\:scheduler/\:sched\-deadline\:.txt ,
+.IR \%Documentation/\:scheduler/\:sched\-rt\-group\:.txt ,
+.IR \%Documentation/\:scheduler/\:sched\-design\-CFS\:.txt ,
+and
+.I \%Documentation/\:scheduler/\:sched\-nice\-design\:.txt