OOM-killer

Posted by Marcus Folkesson on Friday, December 1, 2017

OOM-killer

When the system is running out of memory, the Out-Of-Memory (OOM) killer picks a process to kill based on the current memory footprint. In case of OOM, we will calculate a badness score between 0 (never kill) and 1000 for each process in the system. The process with the highest score will be killed. A score of 0 is reserved for unkillable tasks such as the global init process (see [1]) or kernel threads (processes with PF_KTHREAD flag set).

/media/oomkiller.jpg

The current score of a given process is exposed in procfs, see /proc/[pid]/oom_score, and may be adjusted by setting /proc/[pid]/oom_score_adj. The value of oom_score_adj is added to the score before it's used to determine which task to kill. The value may be set between OOM_SCORE_ADJ_MIN (-1000) and OOM_SCORE_DJ_MAX (+1000). This is useful if you want to guarantee that a process never is selected by the OOM killer.

The calculation is simple (nowadays), if a task is using all its allowed memory, the badness score will be calculated to 1000. If it's using half of its allowed memory, the badness score is calculated to 500 and so on. By setting oom_score_adj to -1000, the badness score sums up to <=0 and the task will never be killed by OOM.

There is one more thing that affects the calculation; if the process is running with the capability CAP_SYS_ADMIN, it gets a 3% discount, but that is simply it.

The old implementation

Before v2.6.36, the calculation of badness score tried to be smarter, besides looking for the total memory usage (task->mm->total_vm), it also considered: - Whether the process creates a lot of children - Whether the process has been running for a long time, or has used a lot of CPU time - Whether the process has a low nice value - Whether the process is privileged (CAP_SYS_ADMIN or CAP_SYS_RESOURCE set) - Whether the process is making direct hardware access

At first glance, all these criteria looks valid, but if you think about it a bit, there is a lot of pitfalls here which makes the selection not so fair. For example: A process that creates a lot of children and consumes some memory could be a leaky web server. Another process that fits into the description is your session manager for your desktop environment which naturally creates a lot of child processes.

The new implementation

This heuristic selection has evolved over time, instead of looking on mm->total_vm for each task, the task's RSS (resident set size, [2]) and swap space is used instead. RSS and Swap space gives a better indication of the amount that we will be able to free if we chose this task. The drawback with using mm->total_vm is that it includes overcommitted memory ( see [3] for more information ) which is pages that the process has claimed but hasn't been physically allocated.

The process is now only counted as privileged if CAP_SYS_ADMIN is set, not CAP_SYS_RESOURCE as before.

The code

The whole implementation of OOM killer is located in mm/oom_kill.c. The function oom_badness() will be called for each task in the system and returns the calculated badness score.

Let's go through the function.

1    unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
2                  const nodemask_t *nodemask, unsigned long totalpages)
3    {
4        long points;
5        long adj;
6
7        if (oom_unkillable_task(p, memcg, nodemask))
8            return 0;

Looking for unkillable tasks such as the global init process.

 1        p = find_lock_task_mm(p);
 2        if (!p)
 3            return 0;
 4
 5        adj = (long)p->signal->oom_score_adj;
 6        if (adj == OOM_SCORE_ADJ_MIN ||
 7                test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
 8                in_vfork(p)) {
 9            task_unlock(p);
10            return 0;
11        }

If proc/[pid]/oom_score_adj is set to OOM_SCORE_ADJ_MIN (-1000), do not even consider this task:

1        points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
2            atomic_long_read(&p->mm->nr_ptes) + mm_nr_pmds(p->mm);
3        task_unlock(p);

Calculate a score based on RSS, pagetables and used swap space:

1        if (has_capability_noaudit(p, CAP_SYS_ADMIN))
2            points -= (points * 3) / 100;

If it's root process, give it a 3% discount. We are no mean people after all:

1        adj *= totalpages / 1000;
2        points += adj;

Normalize and add the oom_score_adj value :

1        return points > 0 ? points : 1;

At last, never return 0 for an eligible task as it's reserved for non killable tasks

}

Conclusion

The OOM logic is quite straightforward and seems to have been stable for a long time (v2.6.36 was released in october 2010). The reason why I was looking at the code was that I did not think the behavior I saw when experimenting corresponds to what was written in the man page for oom_score. It turned out that the manpage was not updated when the new calculation was introduced back in 2010.

I have updated the manpage and it's available in v4.14 of the Linux manpage project [4].

commit 5753354a3af20c8b361ec3d53caf68f7217edf48
Author: Marcus Folkesson <marcus.folkesson@gmail.com>
Date:   Fri Nov 17 13:09:44 2017 +0100

    proc.5: Update description of /proc/<pid>/oom_score

    After Linux 2.6.36, the heuristic calculation of oom_score
    has changed to only consider used memory and CAP_SYS_ADMIN.

    See kernel commit a63d83f427fbce97a6cea0db2e64b0eb8435cd10.

    Signed-off-by: Marcus Folkesson <marcus.folkesson@gmail.com>
    Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>

diff --git a/man5/proc.5 b/man5/proc.5
index 82d4a0646..4e44b8fba 100644
--- a/man5/proc.5
+++ b/man5/proc.5
@@ -1395,7 +1395,9 @@ Since Linux 2.6.36, use of this file is deprecated in favor of
 .IR /proc/[pid]/oom_score_adj .
 .TP
 .IR /proc/[pid]/oom_score " (since Linux 2.6.11)"
-.\" See mm/oom_kill.c::badness() in the 2.6.25 sources
+.\" See mm/oom_kill.c::badness() in pre 2.6.36 sources
+.\" See mm/oom_kill.c::oom_badness() after 2.6.36
+.\" commit a63d83f427fbce97a6cea0db2e64b0eb8435cd10
 This file displays the current score that the kernel gives to
 this process for the purpose of selecting a process
 for the OOM-killer.
@@ -1403,7 +1405,16 @@ A higher score means that the process is more likely to be
 selected by the OOM-killer.
 The basis for this score is the amount of memory used by the process,
 with increases (+) or decreases (\-) for factors including:
-.\" See mm/oom_kill.c::badness() in the 2.6.25 sources
+.\" See mm/oom_kill.c::badness() in pre 2.6.36 sources
+.\" See mm/oom_kill.c::oom_badness() after 2.6.36
+.\" commit a63d83f427fbce97a6cea0db2e64b0eb8435cd10
+.RS
+.IP * 2
+whether the process is privileged (\-);
+.\" More precisely, if it has CAP_SYS_ADMIN or (pre 2.6.36) CAP_SYS_RESOURCE
+.RE
+.IP
+Before kernel 2.6.36 the following factors were also used in the calculation of oom_score:
 .RS
 .IP * 2
 whether the process creates a lot of children using
@@ -1413,10 +1424,7 @@ whether the process creates a lot of children using
 whether the process has been running a long time,
 or has used a lot of CPU time (\-);
 .IP *
-whether the process has a low nice value (i.e., > 0) (+);
-.IP *
-whether the process is privileged (\-); and
-.\" More precisely, if it has CAP_SYS_ADMIN or CAP_SYS_RESOURCE
+whether the process has a low nice value (i.e., > 0) (+); and
 .IP *
 whether the process is making direct hardware access (\-).
 .\" More precisely, if it has CAP_SYS_RAWIO