Missing Lower-Bound Validation in sched_fair_server_write Enables CFS Starvation and Kernel Crash
File: kernel/sched/debug.c
Function: sched_fair_server_write() (line 342)
Severity: High — Local Denial of Service / Kernel Panic
Type: Missing Input Validation (CWE-20) + Logic Error
Attack surface: /sys/kernel/debug/sched/fair_server/cpu<N>/runtime
Vulnerability Description
sched_fair_server_write() allows a caller to set the dl_runtime of a CPU’s CFS
fair-server via debugfs. The only validation applied to the user-supplied runtime value
is an upper-bound check (runtime > period). There is no lower bound: a value
of 0 passes every guard and is committed to the scheduler.
The kernel itself acknowledges the danger (lines 384–386):
if (!runtime)
printk_deferred(
"Fair server disabled in CPU %d, system may crash due to starvation.\n",
cpu_of(rq));
Once dl_runtime = 0 is applied, the CFS fair-server for that CPU is effectively
disabled. Deadline/real-time tasks then monopolise the CPU indefinitely. All CFS
tasks — including PID 1, system daemons, and user sessions — are starved and the system
hangs. The kernel’s own soft-lockup / NMI watchdog typically fires and triggers a panic.
Root Cause
// kernel/sched/debug.c line 373-377
if (runtime > period ||
period > fair_server_period_max ||
period < fair_server_period_min) {
return -EINVAL;
}
runtime = 0 satisfies all three conditions (all evaluate to false) because:
0 > period→ false (period is a positive nanosecond value)- The
periodrange checks say nothing aboutruntime
The call then proceeds to:
update_rq_clock(rq);
dl_server_stop(&rq->fair_server); // server halted
retval = dl_server_apply_params( // runtime=0 committed
&rq->fair_server, runtime, period, 0);
if (rq->cfs.h_nr_queued)
dl_server_start(&rq->fair_server); // server restarted with zero budget
The fair server is restarted with a zero-nanosecond budget, making it useless.
Secondary Issue: Early return Inside scoped_guard(rq_lock_irqsave, rq)
Lines 376 and 391–392 contain return statements while the runqueue spinlock is held
and hardware interrupts are disabled:
scoped_guard (rq_lock_irqsave, rq) {
...
if (runtime > period || ...)
return -EINVAL; // line 376 — early return with IRQs disabled
...
if (retval < 0)
return retval; // line 391 — early return with IRQs disabled
}
Under a correct GCC build __cleanup__ releases the lock on scope exit, so the lock is
not actually leaked. However, if __cleanup__ support is absent or the CLASS()
implementation is defective, the rq spinlock is never released and IRQs stay disabled on
that CPU permanently — an unrecoverable hard lockup. The two bugs compound: the
runtime = 0 path reaches the first early return and, on a broken toolchain, also
permanently freezes the CPU.
Exploitation
# Trigger CFS starvation on CPU 0 — requires write access to debugfs
echo 0 > /sys/kernel/debug/sched/fair_server/cpu0/runtime
The file permissions are 0644 (owner-writable). On systems where debugfs is mounted
world-accessible (common in development kernels, containers, or with a permissive
/proc/sys/kernel/perf_event_paranoid posture), a non-root local user can trigger the
crash. Even where root is required, the attack provides a reliable, single-write kernel
panic — useful as a post-exploitation cleanup or sandbox-escape step.
Impact
| Property | Detail |
|---|---|
| Trigger | Single write(2) of the string "0" to a debugfs file |
| Privileges | Write access to debugfs fair_server/cpuN/runtime |
| Immediate effect | CFS starvation → all normal tasks hang |
| Final effect | Watchdog-triggered kernel panic / hard lockup |
| Recovery | Hard reboot required |
Fix
Reject a zero (or sub-minimum) runtime before applying the parameters:
if (!runtime || /* <-- add: disallow zero */
runtime > period ||
period > fair_server_period_max ||
period < fair_server_period_min) {
return -EINVAL;
}
If deliberate disabling is a supported use case it should be gated behind a separate, explicitly documented sysctl with appropriate capability checks, not a bare debugfs write.