Missing Lower-Bound Validation in sched_fair_server_write Enables CFS Starvation and Kernel Crash

File: kernel/sched/debug.c
Function: sched_fair_server_write() (line 342)
Severity: High — Local Denial of Service / Kernel Panic
Type: Missing Input Validation (CWE-20) + Logic Error
Attack surface: /sys/kernel/debug/sched/fair_server/cpu<N>/runtime


Vulnerability Description

sched_fair_server_write() allows a caller to set the dl_runtime of a CPU’s CFS fair-server via debugfs. The only validation applied to the user-supplied runtime value is an upper-bound check (runtime > period). There is no lower bound: a value of 0 passes every guard and is committed to the scheduler.

The kernel itself acknowledges the danger (lines 384–386):

if (!runtime)
    printk_deferred(
        "Fair server disabled in CPU %d, system may crash due to starvation.\n",
        cpu_of(rq));

Once dl_runtime = 0 is applied, the CFS fair-server for that CPU is effectively disabled. Deadline/real-time tasks then monopolise the CPU indefinitely. All CFS tasks — including PID 1, system daemons, and user sessions — are starved and the system hangs. The kernel’s own soft-lockup / NMI watchdog typically fires and triggers a panic.


Root Cause

// kernel/sched/debug.c  line 373-377
if (runtime > period ||
    period > fair_server_period_max ||
    period < fair_server_period_min) {
    return -EINVAL;
}

runtime = 0 satisfies all three conditions (all evaluate to false) because:

  • 0 > period → false (period is a positive nanosecond value)
  • The period range checks say nothing about runtime

The call then proceeds to:

update_rq_clock(rq);
dl_server_stop(&rq->fair_server);        // server halted
retval = dl_server_apply_params(         // runtime=0 committed
    &rq->fair_server, runtime, period, 0);
if (rq->cfs.h_nr_queued)
    dl_server_start(&rq->fair_server);   // server restarted with zero budget

The fair server is restarted with a zero-nanosecond budget, making it useless.


Secondary Issue: Early return Inside scoped_guard(rq_lock_irqsave, rq)

Lines 376 and 391–392 contain return statements while the runqueue spinlock is held and hardware interrupts are disabled:

scoped_guard (rq_lock_irqsave, rq) {
    ...
    if (runtime > period || ...)
        return -EINVAL;   // line 376 — early return with IRQs disabled
    ...
    if (retval < 0)
        return retval;    // line 391 — early return with IRQs disabled
}

Under a correct GCC build __cleanup__ releases the lock on scope exit, so the lock is not actually leaked. However, if __cleanup__ support is absent or the CLASS() implementation is defective, the rq spinlock is never released and IRQs stay disabled on that CPU permanently — an unrecoverable hard lockup. The two bugs compound: the runtime = 0 path reaches the first early return and, on a broken toolchain, also permanently freezes the CPU.


Exploitation

# Trigger CFS starvation on CPU 0 — requires write access to debugfs
echo 0 > /sys/kernel/debug/sched/fair_server/cpu0/runtime

The file permissions are 0644 (owner-writable). On systems where debugfs is mounted world-accessible (common in development kernels, containers, or with a permissive /proc/sys/kernel/perf_event_paranoid posture), a non-root local user can trigger the crash. Even where root is required, the attack provides a reliable, single-write kernel panic — useful as a post-exploitation cleanup or sandbox-escape step.


Impact

Property Detail
Trigger Single write(2) of the string "0" to a debugfs file
Privileges Write access to debugfs fair_server/cpuN/runtime
Immediate effect CFS starvation → all normal tasks hang
Final effect Watchdog-triggered kernel panic / hard lockup
Recovery Hard reboot required

Fix

Reject a zero (or sub-minimum) runtime before applying the parameters:

if (!runtime ||                          /* <-- add: disallow zero */
    runtime > period ||
    period > fair_server_period_max ||
    period < fair_server_period_min) {
    return -EINVAL;
}

If deliberate disabling is a supported use case it should be gated behind a separate, explicitly documented sysctl with appropriate capability checks, not a bare debugfs write.