Checkpoint-restore in Linux
I'm working on power saving features for a project based on a Raspberry Pi Zero. Unfortunately, the RPi does not support features as hibernation to disk or suspend to RAM because how the processor is constructed (the GPU is actually the main processor). So I was looking for alternatives.
That's when I stumpled upon CRIU ( [1], [2] ), Checkpoint-Restore In Userspace. (I actually started to read about PTRACE_SEIZE [4] and ptrace parasite code [3] and found out that CRIU is one of their users.)
CRIU
CRIU is a project that implements checkpoint/restore functionality by freeze the state of the process and its sub tasks. CRIU makes use of ptrace [4] to stop the process by attach to the process by sending a PTRACE_SEIZE request. Then it injects parasitic code to dump the process's memory pages into image files to create a recoverable checkpoint.
Such process information is memory pages (collected from /proc/$PID/smaps, /proc/$PID/mapfiles/ and /proc/$PID/pagemap), but also information about opened files, credentials, registers, task states and more.
My first concern was that this could not work very well, how about open sockets (especially clients)? It turns out that CRIU alredy handle most of that stuff. There are only a few scenarios that cannot be dumped [5] yet.
Usage
CRIU has many possible use-cases. Some of those are:
- Container live migration
- Slow-boot services speed up
- Seamless kernel upgrade
- Seamless kernel upgrade
- "Save" ability in apps (games), that don't have such
- Snapshots of apps
My use case or now is just to save a snapshot of an application and poweroff the CPU module to later be able to power on and restore it.
PTRACE
For those not familiar with ptrace(2):
The ptrace() system call provides a means by which one process (the "tracer") may observe and control the execution of an‐ other process (the "tracee"), and examine and change the tracee's memory and registers. It's primarily used to implement breakpoint debugging and system call tracing.
ptrace is the only interface that the Linux kernel provides to poke around and fetch information from inside another application (think debugger and/or tracers).
The PTRACE_SEIZE was introduced in Linux 3.4:
PTRACE_SEIZE (since Linux 3.4) Attach to the process specified in pid, making it a tracee of the calling process. Unlike PTRACE_ATTACH, PTRACE_SEIZE does not stop the process. Group-stops are reported as PTRACE_EVENT_STOP and WSTOPSIG(status) returns the stop signal. Automatically attached children stop with PTRACE_EVENT_STOP and WSTOPSIG(status) returns SIGTRAP instead of having SIGSTOP signal delivered to them. execve(2) does not deliver an extra SIGTRAP. Only a PTRACE_SEIZEd process can accept PTRACE_INTERRUPT and PTRACE_LISTEN commands. The "seized" behavior just described is inherited by children that are automatically attached using PTRACE_O_TRACEFORK, PTRACE_O_TRACEVFORK, and PTRACE_O_TRACECLONE. addr must be zero. data contains a bit mask of ptrace options to activate immediately. Permission to perform a PTRACE_SEIZE is governed by a ptrace access mode PTRACE_MODE_ATTACH_REALCREDS check; see be‐ low.
But it took a while until the checkpoint/restore capability was created for this purpose, see capabilities(7):
CAP_CHECKPOINT_RESTORE (since Linux 5.9) • Update /proc/sys/kernel/ns_last_pid (see pid_namespaces(7)); • employ the set_tid feature of clone3(2); • read the contents of the symbolic links in /proc/pid/map_files for other processes. This capability was added in Linux 5.9 to separate out checkpoint/restore functionality from the overloaded CAP_SYS_ADMIN capability.
Example
I wrote a simple C application that just count a variable up each second and print the value:
1 #include <stdio.h>
2 #include <unistd.h>
3 int main()
4 {
5 printf("My PID is %i\n", getpid());
6 int count = 0;
7 while (1) {
8 printf("%d\n", count++);
9 sleep(1);
10 }
11 }
Compile the code:
1 gcc main.c -o main
Start The application:
1 [17:26:03]marcus@goliat:~/tmp/count$ ./main
2 My PID is 2483855
3 0
4 1
5 2
6 3
7 4
8 5
9 6
The process is started with process ID 2483855.
We can now dump the process and store its state. We have to add the --shell-job flag to tell that it was spawned from a shell (and therefor have some file descriptors open to PTYs that needs to be restored).
1 [17:27:26]marcus@goliat:~/tmp/criu$ sudo criu dump -t 2483855 --shell-job
2 Warn (compel/arch/x86/src/lib/infect.c:356): Will restore 2483855 with interrupted system call
CRIU needs to have the CAP_SYS_ADMIN or the CAP_CHECKPOINT_RESTORE capability. Set it by:
1 setcap cap_checkpoint_restore+eip /usr/bin/criu
The criu dump command will now generate a bunch of files to store the current state of the application. These includes open file descriptors, registers, stackframes, memorymaps and more:
1 [17:28:00]marcus@goliat:~/tmp/criu$ ls -1
2 core-2483855.img
3 fdinfo-2.img
4 files.img
5 fs-2483855.img
6 ids-2483855.img
7 inventory.img
8 mm-2483855.img
9 pagemap-2483855.img
10 pages-1.img
11 pstree.img
12 seccomp.img
13 stats-dump
14 timens-0.img
15 tty-info.img
We can now restore the application from where we stopped:
1 [17:29:07]marcus@goliat:~/tmp/criu$ sudo criu restore --shell-job
2 27
3 28
4 29
5 30
This is cool. But what is even cooler is that you may restore the application on a different host(!).
Summary
I do not know if CRIU is applicable for what I want to achieve right now, but it's a cool project that I will probably find usage for in the future, so it is a welcome tool to my toolbag.