141 lines
5.6 KiB
ReStructuredText
141 lines
5.6 KiB
ReStructuredText
=====================
|
|
Restartable Sequences
|
|
=====================
|
|
|
|
Restartable Sequences allow to register a per thread userspace memory area
|
|
to be used as an ABI between kernel and userspace for three purposes:
|
|
|
|
* userspace restartable sequences
|
|
|
|
* quick access to read the current CPU number, node ID from userspace
|
|
|
|
* scheduler time slice extensions
|
|
|
|
Restartable sequences (per-cpu atomics)
|
|
---------------------------------------
|
|
|
|
Restartable sequences allow userspace to perform update operations on
|
|
per-cpu data without requiring heavyweight atomic operations. The actual
|
|
ABI is unfortunately only available in the code and selftests.
|
|
|
|
Quick access to CPU number, node ID
|
|
-----------------------------------
|
|
|
|
Allows to implement per CPU data efficiently. Documentation is in code and
|
|
selftests. :(
|
|
|
|
Scheduler time slice extensions
|
|
-------------------------------
|
|
|
|
This allows a thread to request a time slice extension when it enters a
|
|
critical section to avoid contention on a resource when the thread is
|
|
scheduled out inside of the critical section.
|
|
|
|
The prerequisites for this functionality are:
|
|
|
|
* Enabled in Kconfig
|
|
|
|
* Enabled at boot time (default is enabled)
|
|
|
|
* A rseq userspace pointer has been registered for the thread
|
|
|
|
The thread has to enable the functionality via prctl(2)::
|
|
|
|
prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
|
|
PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
|
|
|
|
prctl() returns 0 on success or otherwise with the following error codes:
|
|
|
|
========= ==============================================================
|
|
Errorcode Meaning
|
|
========= ==============================================================
|
|
EINVAL Functionality not available or invalid function arguments.
|
|
Note: arg4 and arg5 must be zero
|
|
ENOTSUPP Functionality was disabled on the kernel command line
|
|
ENXIO Available, but no rseq user struct registered
|
|
========= ==============================================================
|
|
|
|
The state can be also queried via prctl(2)::
|
|
|
|
prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
|
|
|
|
prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
|
|
disabled. Otherwise it returns with the following error codes:
|
|
|
|
========= ==============================================================
|
|
Errorcode Meaning
|
|
========= ==============================================================
|
|
EINVAL Functionality not available or invalid function arguments.
|
|
Note: arg3 and arg4 and arg5 must be zero
|
|
========= ==============================================================
|
|
|
|
The availability and status is also exposed via the rseq ABI struct flags
|
|
field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
|
|
``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user
|
|
space and only for informational purposes.
|
|
|
|
If the mechanism was enabled via prctl(), the thread can request a time
|
|
slice extension by setting rseq::slice_ctrl::request to 1. If the thread is
|
|
interrupted and the interrupt results in a reschedule request in the
|
|
kernel, then the kernel can grant a time slice extension and return to
|
|
userspace instead of scheduling out. The length of the extension is
|
|
determined by debugfs:rseq/slice_ext_nsec. The default value is 5 usec; which
|
|
is the minimum value. It can be incremented to 50 usecs, however doing so
|
|
can/will affect the minimum scheduling latency.
|
|
|
|
Any proposed changes to this default will have to come with a selftest and
|
|
rseq-slice-hist.py output that shows the new value has merrit.
|
|
|
|
The kernel indicates the grant by clearing rseq::slice_ctrl::request and
|
|
setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the
|
|
thread after granting the extension, the kernel clears the granted bit to
|
|
indicate that to userspace.
|
|
|
|
If the request bit is still set when the leaving the critical section,
|
|
userspace can clear it and continue.
|
|
|
|
If the granted bit is set, then userspace invokes rseq_slice_yield(2) when
|
|
leaving the critical section to relinquish the CPU. The kernel enforces
|
|
this by arming a timer to prevent misbehaving userspace from abusing this
|
|
mechanism.
|
|
|
|
If both the request bit and the granted bit are false when leaving the
|
|
critical section, then this indicates that a grant was revoked and no
|
|
further action is required by userspace.
|
|
|
|
The required code flow is as follows::
|
|
|
|
rseq->slice_ctrl.request = 1;
|
|
barrier(); // Prevent compiler reordering
|
|
critical_section();
|
|
barrier(); // Prevent compiler reordering
|
|
rseq->slice_ctrl.request = 0;
|
|
if (rseq->slice_ctrl.granted)
|
|
rseq_slice_yield();
|
|
|
|
As all of this is strictly CPU local, there are no atomicity requirements.
|
|
Checking the granted state is racy, but that cannot be avoided at all::
|
|
|
|
if (rseq->slice_ctrl.granted)
|
|
-> Interrupt results in schedule and grant revocation
|
|
rseq_slice_yield();
|
|
|
|
So there is no point in pretending that this might be solved by an atomic
|
|
operation.
|
|
|
|
If the thread issues a syscall other than rseq_slice_yield(2) within the
|
|
granted timeslice extension, the grant is also revoked and the CPU is
|
|
relinquished immediately when entering the kernel. This is required as
|
|
syscalls might consume arbitrary CPU time until they reach a scheduling
|
|
point when the preemption model is either NONE or VOLUNTARY and therefore
|
|
might exceed the grant by far.
|
|
|
|
The preferred solution for user space is to use rseq_slice_yield(2) which
|
|
is side effect free. The support for arbitrary syscalls is required to
|
|
support onion layer architectured applications, where the code handling the
|
|
critical section and requesting the time slice extension has no control
|
|
over the code within the critical section.
|
|
|
|
The kernel enforces flag consistency and terminates the thread with SIGSEGV
|
|
if it detects a violation.
|