Contiguous Memory Allocator
Introduction
I do find memory management as one of the most fascinating subsystem in the Linux kernel, and I take every chance I see to talk about it. This post is inspired by a project I'm currently working on; an embedded Linux platform with a camera connected to the CSI-2 bus.
Before we dig into which problems we could trip over, lets talk briefly about how the kernel handles memory.
Memory subsystem
The memory management subsystem handles a wide spectrum of operations which all have impact on the system performance. The subsystem is therefor divided into several parts to sustain operational efficiency and optimized resource handling for different use cases.
Such parts includes:
- Page allocator
- Buddy system
- Kmalloc allocator
- Slab caches
- Vmalloc allocator
- Contiguous memory allocator
- ...
The smallest allocation unit of memory is a page frame. The Memory Management Unit (MMU) does a terrific job to arrange and map these page frames of the available physical memory into a virtual address space. Most allocations in the kernel are only virtually contiguous which is fine for the most use cases.
Some hardware/IP-blocks requires physically contiguous memory to work though. Direct Memory Access (DMA) transfers are one such case where memory (often) needs to be physically contiguous. Many DMA controllers now supports scatter-gather, which let you hand-pick addresses to make it appear to be contiguous and then let the (IO)MMU do the rest.
To make it works, it requires that the hardware/IP-blocks actually do its memory accesses through the MMU, which is not always the case.
Multimedia devices such as GPU or VPU does often requires huge blocks of physically contiguous memory and do (with exceptions, see Raspberry Pi 4 below) not make use of the (IO)MMU.
Contiguous memory
In order to meet this requirement on big chunks of physically contiguous memory we have to reserve it from the main memory during system boot.
Before CMA, we had to use the mem kernel parameter to limit how much of the system memory that should be available for allocators in the Linux system.
The memory outside this mem-region is not touched by the system and could be remapped into linear address space by the driver.
Here is the documentation for the mem kernel parameter [1]:
mem=nn[KMG] [KNL,BOOT] Force usage of a specific amount of memory Amount of memory to be used in cases as follows: 1 for test; 2 when the kernel is not able to see the whole system memory; 3 memory that lies after 'mem=' boundary is excluded from the hypervisor, then assigned to KVM guests. 4 to limit the memory available for kdump kernel. [ARC,MICROBLAZE] - the limit applies only to low memory, high memory is not affected. [ARM64] - only limits memory covered by the linear mapping. The NOMAP regions are not affected. [X86] Work as limiting max address. Use together with memmap= to avoid physical address space collisions. Without memmap= PCI devices could be placed at addresses belonging to unused RAM. Note that this only takes effects during boot time since in above case 3, memory may need be hot added after boot if system memory of hypervisor is not sufficient.
The mem parameter has a few drawbacks. The driver needs details about where to get the reserved memory and the memory lie momentarily unused when the driver is not initiating any access operations.
Therefor the Contiguous Memory Allocator (CMA) was introduced to manage these reserved memory areas.
The benefits by using CMA is that this area is handled by the allocator algorithms instead of the device driver itself. This let both devices and systems to allocate and use memory from this CMA area through the page allocator for regular needs and through the DMA allocation routines when DMA capabilities is needed.
A few words about Raspberry Pi
Raspberry Pi uses a configuration (config.txt [4] ) file that is read by the GPU to initialize the system. The configuration file has many tweakable parameters and one of those are gpu_mem.
This parameter specifies how much memory (in megabytes) to reserve exclusively for the GPU. This works pretty much like the mem kernel commandline parameter described above, with the very same drawbacks. The memory reserved for GPU is not available for the ARM CPU and should be kept as low as possible that your application could work with.
One big difference between the variants of the Raspberry Pi modules is that the Raspberry Pi 4 has a GPU with its own MMU, which allows the GPU to use memory that is dynamically allocated within Linux. The gpu_mem could therfor be kept small on that platform.
The GPU is normally used for displays, 3D calculations, codecs and cameras. One important thing regarding the camera is that the default camera stack (libcamera) does use CMA memory to allocate buffers instead of the reserved GPU memory. In cases that the GPU is only for camera purposes, the gpu_mem could be kept small.
How much CMA is already reserved?
The easiest way to determine how much memory that is reserved for CMA is to consult meminfo:
1# grep Cma /proc/meminfo
2CmaTotal: 983040 kB
3CmaFree: 612068 kB
or look at the boot log:
1# dmesg | grep CMA
2[ 0.000000] Reserved memory: created CMA memory pool at 0x0000000056000000, size 960 MiB
Reserve memory with CMA
The CMA area is reserved during boot and there are a few ways to do this.
By device tree
This is the preferred way to define CMA areas.
This example is taken from the device tree bindings documentation [2]:
1reserved-memory {
2 #address-cells = <1>;
3 #size-cells = <1>;
4 ranges;
5
6 /* global autoconfigured region for contiguous allocations */
7 linux,cma {
8 compatible = "shared-dma-pool";
9 reusable;
10 size = <0x4000000>;
11 alignment = <0x2000>;
12 linux,cma-default;
13 };
14};
By kernel command line
The CMA area size could also be specified by the kernel command line. There are tons of references out there that states that the command line parameter is overridden by the device tree, but I thought it sounded weird so I looked it up, and the kernel command line overrides device tree, not the other way around.
At least nowadays [3] :
1 static int __init rmem_cma_setup(struct reserved_mem *rmem)
2 {
3 ...
4 if (size_cmdline != -1 && default_cma) {
5 pr_info("Reserved memory: bypass %s node, using cmdline CMA params instead\n",
6 rmem->name);
7 return -EBUSY;
8 }
9 ...
10 }
Here is the documentation for the cma kernel parameter [1]:
cma=nn[MG]@[start[MG][-end[MG]]] [KNL,CMA] Sets the size of kernel global memory area for contiguous memory allocations and optionally the placement constraint by the physical address range of memory allocations. A value of 0 disables CMA altogether. For more information, see kernel/dma/contiguous.c
By kernel configuration
The kernel configuration could be used to set min/max and even a percentage of how much of the available memory that should be reserved for the CMA area:
CONFIG_CMA CONFIG_CMA_AREAS CONFIG_DMA_CMA CONFIG_DMA_PERNUMA_CMA CONFIG_CMA_SIZE_MBYTES CONFIG_CMA_SIZE_SEL_MBYTES CONFIG_CMA_SIZE_SEL_PERCENTAGE CONFIG_CMA_SIZE_SEL_MIN CONFIG_CMA_SIZE_SEL_MAX CONFIG_CMA_ALIGNMENT
Conclusion
As soon we are using camera devices with higher resolution and do the image manipulation in the VPU/GPU, we almost always have to increase the CMA area size. Otherwise we will end up with errors like this:
1 cma_alloc: alloc failed, req-size: 8192 pages, ret: -12
References
[1] | (1, 2) https://docs.kernel.org/admin-guide/kernel-parameters.html |
[2] | https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/devicetree/bindings/reserved-memory/shared-dma-pool.yaml?h=v6.1.10 |
[3] | https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/dma/contiguous.c?h=v6.2-rc6#n408 |
[4] | https://www.raspberrypi.com/documentation/computers/config_txt.html |