Leveraging eBPF as a feedback loop to runtime tune of the systems

Problem Motivation

When building large projects like Android AOSP, system resource limitations often dictate success or failure. For instance, running a make build (i.e., make -j$(nproc)) command on a high-end system (e.g., 128GB RAM and a modern multi-core CPU) typically completes successfully due to ample memory and processing power. By contrast, on a system with only 16GB RAM, the build process is prone to failures caused by Linux’s OOM (Out-Of-Memory) Killer terminating processes to reclaim memory, or other bugs stem from resource contention, such as segmentation faults due to extreme memory pressure.

However, the issue is rarely just about RAM. System-wide bottlenecks—such as a slow CPU struggling to compile code efficiently, or sluggish I/O (e.g., HDDs or slow SSDs) delaying read/write operations—can compound memory pressure. These limitations force processes to linger in memory longer than necessary, overloading the system and triggering instability. In such cases, even adequate RAM may not prevent failures if the CPU or storage cannot keep pace with the build’s demands.

Background

We use eBPF to trace system-wide metrics for CPU, memory, IO, and Scheduler events. Our throttling (i.e., build process resource consumption rate) is based on resource thresholds, such as Pressure Stall Information [1-2] APIs or watermarks [3] for memory solutions. We can use eBPF to read those different watermarks levels and tune the system, such as increase background kswapd frequencey on-demand so that we have enough available memory all the time. PSI APIs for all memory, IO, and CPU that gives how long the system has been under resource pressure.

Our solution is two steps: (1) we do runtime tuning of system, such as reducing the numbe of parallel jobs (2) If the system gets saturated and no options left, then we freeze the entire process per cgroup granularity instead of getting killed, tune the system, then thaw the ongoing process. One solution is to intercept make -j$(nproc) and use ninja with pool that scales dynamically.

I am thinking to build an eBPF intercept (i.e., wrapper) layer that takes "make -jnproc" and based on runtime system condition, it throttles the resource use rate by the process. Can I do this so that every build will pass. This may need to extend its functionality to kernel, if that is the case this we will use kfunc from eBPF to kernel modules, do those throttling logic inside kernel to runtime tune the system for the current job. One challenge would be we need to guarantee whatever we extend to kernel, we need to make sure that is free of bugs, and is well optimized. For FDO (Feedback directed optimization) + LTO (link time optimization) + other compiler optimizations.

Let us try to come up with how exactly we use eBPF wrapper to intercept build process.

Create cgroup for manage CPU, memory, and IO (i.e., cgcreate -g cpu,io,memory:$(throttle_build_cgroup)
We develop an eBPF program that observes watermarks, or PSI APIs for CPU, IO, and Memory.
We attach our cgroup with this eBPF tool (i.e, `$(eBPF_tool) --cgroup $(throttle_build_cgroup)`
We run build process inside cgroup (i.e., cgexec -g cpu,io,memory :$(throttle_build_cgroup) $(our_build_job)".
Our $(eBPF_tool) has callbacks that detect memory, io, CPU pressure, alerts it and handles it directly or by delegating to other in-kernel (or user-space) handler.

Upto step 5 is easy to solve. But how to handle exactly is hard question.

The handling logic should be something like this.

if memory_usage > 85% || cpu_usage > 90% || io_latency > 100ms:
    - Solution 1:
        - pause to build (`pkill -STOP $(pid_of_build_process`)
        - NEW_JOBS=$(calc_new_job_count) //this is difficult, so this should be our research
        - make -j$NEW_JOBS
    - Solution 2:
        -  cgroup throttling
            -  echo "50000 100000" > /sys/fs/cgroup/"$(throttle_build_cgroup)"/cpu.max
    -  Solution 3:
        -  In case, system saturates and we are in high watermarks level, then we may freeze entire cgroup.

One note, we also need to disable default OOM killer that system uses, and guarantee our throttler is the only thing that is solving this low-resources issue.

Second note, if doing tuning or throttling in user-space is slow incase we delegated solution to user-space, then we can do that via kernel-mdules. Or when cases such as we reach out eBPF limitation to perform certain things then we may need to use kfunc to kernel modules, where we have superpower and larger context, and then do those actions that were hard or impossible in eBPF alone. One challenge is that we need to guarantee the kernel modules won't bring new kernel bugs.

Other on-the-fly system tuning ideas

  1. To make build process pass is that we can also track the resource utilization, and where there is bottleneck (for example, if we detect CPU bottleneck) then we can overclock the CPU on the fly. [5]

  2. Let's say we use Mellanox Gigabit NIC that has chance of overheating. We detect this using eBPF and modify the fan speed on the fly. Other ideas can be, track those recently modified files and speed up incrementatl backups [5]

  3. Detect what process hangs (e.g., the firefox tab which is slow and stuck your mouse) and kill the process.

References

[1] https://docs.kernel.org/accounting/psi.html
[2] https://source.android.com/docs/core/perf/lmkd
[3] https://www.kernel.org/doc/gorman/html/understand/understand005.html
[4] https://lpc.events/event/4/contributions/404/attachments/326/550/Handling_memory_pressure_on_Android.pdf
[5]: https://dl.acm.org/doi/10.1016/j.sysarc.2024.103130