Fundamentally, memory is just a form of storage. Computers have many layers of storage — CPU Registers, CPU Cache, RAM, Disk. Access to each layer of memory consumes time, lesser the cycles faster the access.
CPU Registers - 1 CPU cycle
CPU Cache - 3-14 CPU cycles
RAM - ~250 CPU cycles
Disk - ~40 million CPU cycles
In 32-bit systems, there are total 2^32 unique addresses. Each these unique addresses points to 1 byte in RAM. The term 32-bit usually refers to the width of the data bus, memory addresses, or the general-purpose registers that can hold 32-bit values (which can represent numbers, addresses, etc.). This means the processor can handle 32-bit data in a single operation. These registers are named EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP. The "E" in the register names refers to "Extended", meaning they are 32-bit wide.
In 32-bit system, we can have max 4 GB (i.e., 2^32 bytes) of RAM for direct memory allocation (not talking other technologies like Physical Address Extension), and 64-bit systems can address up to 16EB of memory (2^64 different addresses). Nowadays main memory is not allocated directly, only virtual memory is.
In modern computer systems (both 32-bit and 64-bit), virtual memory is used extensively. Virtual memory is an abstraction that allows programs to use more memory than is physically installed in the system by using disk space (paging, swapping, etc.) to extend the effective memory. It provides isolation between processes and makes the management of memory more flexible. Even though physical memory exists, the operating system manages it through virtual memory, which is mapped to physical memory by the Memory Management Unit (MMU) in the CPU. With virtual memory, each process thinks it has access to a large, contiguous block of memory, even if physical memory is fragmented or smaller. The OS uses paging or segmentation to map virtual memory addresses to actual physical memory (RAM). There didn't have MMU hardware, that's why there was no virtual memory in older systems.
Linux divide memory into “Zones” — 32 bit and 64-bit have different memory zones.
32-bit
ZONE_DMA: Low 16Mbyte for DMA-suitable memory by ancient ISA devices
ZONE_NORMAL: RAM from 16Mbyte up to 896MB
ZONE_HIGHMEM: All RAM above 896MB
64-bit
ZONE_DMA: Low 16Mbyte for DMA-suitable memory by ancient ISA devices
ZONE_DMA32: From 16MB to 4GB for DMA-suitable memory in a 32-bit addressable area
ZONE_NORMAL: All RAM above 4GB
ZONE_MOVABLE: To deal with fragmentation — map of movable pages
A zone might be restricted to certain types of memory allocations, like memory reserved for kernel space (i.e., memory used by the operating system itself) or critical system memory. The operating system may try to allocate memory from the most easily available zone first, but if that's not possible, it may need to fall back on using a more restrictive zone where memory is harder to allocate or where only certain types of allocations are allowed.
A node refers to a physical or logical grouping of memory in large servers or systems with non-uniform memory architecture (NUMA). NUMA is a design where the system has multiple memory modules (or memory banks) spread across different physical locations. So, when a system has more than one node, it means the system has multiple memory groups, and each group can have its own memory zone. You can do cat /proc/zoneinfo
.
Node 0, zone DMA
per-node stats
nr_inactive_anon 0
nr_active_anon 997013
nr_inactive_file 1148635
nr_active_file 709357
nr_unevictable 15735
nr_slab_reclaimable 91751
nr_slab_unreclaimable 180400
nr_isolated_anon 0
nr_isolated_file 0
workingset_nodes 0
workingset_refault_anon 0
workingset_refault_file 0
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 0
nr_anon_pages 991098
nr_mapped 259280
nr_file_pages 1879713
nr_dirty 262
nr_writeback 0
nr_writeback_temp 0
nr_shmem 20126
nr_shmem_hugepages 0
nr_shmem_pmdmapped 0
nr_file_hugepages 193
nr_file_pmdmapped 0
nr_anon_transparent_hugepages 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_dirtied 2521054
nr_written 2185652
nr_throttled_written 0
nr_kernel_misc_reclaimable 0
nr_foll_pin_acquired 1270
nr_foll_pin_released 1270
nr_kernel_stack 28288
nr_page_table_pages 14630
nr_sec_page_table_pages 0
nr_swapcached 0
pgpromote_success 0
pgpromote_candidate 0
pgdemote_kswapd 0
pgdemote_direct 0
pgdemote_khugepaged 0
pages free 2818
boost 0
min 1
low 4
high 7
spanned 4095
present 3999
managed 3842
cma 0
protection: (0, 1665, 128452, 128452, 128452)
nr_free_pages 2818
nr_zone_inactive_anon 0
nr_zone_active_anon 0
nr_zone_inactive_file 0
nr_zone_active_file 0
nr_zone_unevictable 0
nr_zone_write_pending 0
nr_mlock 0
nr_bounce 0
nr_zspages 0
nr_free_cma 0
nr_unaccepted 0
numa_hit 1
numa_miss 0
numa_foreign 0
numa_interleave 1
numa_local 1
numa_other 0
pagesets
cpu: 0
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 1
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 2
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 3
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 4
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 5
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 6
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 7
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 8
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 9
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 10
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 11
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 12
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 13
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 14
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 15
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 16
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 17
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 18
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 19
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 20
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 21
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 22
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 23
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 24
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 25
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 26
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 27
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 28
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 29
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 30
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 31
count: 0
high: 0
batch: 1
vm stats threshold: 12
node_unreclaimable: 0
start_pfn: 1
Node 0, zone DMA32
pages free 442199
boost 0
min 219
low 645
high 1071
spanned 1044480
present 460687
managed 443735
cma 0
protection: (0, 0, 126786, 126786, 126786)
nr_free_pages 442199
nr_zone_inactive_anon 0
nr_zone_active_anon 0
nr_zone_inactive_file 0
nr_zone_active_file 0
nr_zone_unevictable 0
nr_zone_write_pending 0
nr_mlock 0
nr_bounce 0
nr_zspages 0
nr_free_cma 0
nr_unaccepted 0
numa_hit 2
numa_miss 0
numa_foreign 0
numa_interleave 2
numa_local 2
numa_other 0
pagesets
cpu: 0
count: 0
high: 252
batch: 63
vm stats threshold: 60
cpu: 1
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 2
count: 0
high: 252
batch: 63
vm stats threshold: 60
cpu: 3
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 4
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 5
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 6
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 7
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 8
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 9
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 10
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 11
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 12
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 13
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 14
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 15
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 16
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 17
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 18
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 19
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 20
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 21
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 22
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 23
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 24
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 25
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 26
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 27
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 28
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 29
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 30
count: 0
high: 0
batch: 63
vm stats threshold: 60
cpu: 31
count: 0
high: 0
batch: 63
vm stats threshold: 60
node_unreclaimable: 0
start_pfn: 4096
Node 0, zone Normal
pages free 28931385
boost 0
min 16674
low 49131
high 81588
spanned 33021568
present 33021568
managed 32459074
cma 0
protection: (0, 0, 0, 0, 0)
nr_free_pages 28931385
nr_zone_inactive_anon 0
nr_zone_active_anon 997013
nr_zone_inactive_file 1148635
nr_zone_active_file 709357
nr_zone_unevictable 15735
nr_zone_write_pending 262
nr_mlock 10897
nr_bounce 0
nr_zspages 0
nr_free_cma 0
nr_unaccepted 0
numa_hit 81052041
numa_miss 0
numa_foreign 0
numa_interleave 5013
numa_local 81052040
numa_other 0
pagesets
cpu: 0
count: 1251
high: 1535
batch: 63
vm stats threshold: 125
cpu: 1
count: 377
high: 1535
batch: 63
vm stats threshold: 125
cpu: 2
count: 1130
high: 1535
batch: 63
vm stats threshold: 125
cpu: 3
count: 241
high: 1535
batch: 63
vm stats threshold: 125
cpu: 4
count: 273
high: 1535
batch: 63
vm stats threshold: 125
cpu: 5
count: 1401
high: 1535
batch: 63
vm stats threshold: 125
cpu: 6
count: 1003
high: 1535
batch: 63
vm stats threshold: 125
cpu: 7
count: 732
high: 1535
batch: 63
vm stats threshold: 125
cpu: 8
count: 219
high: 1535
batch: 63
vm stats threshold: 125
cpu: 9
count: 1026
high: 1535
batch: 63
vm stats threshold: 125
cpu: 10
count: 1033
high: 1535
batch: 63
vm stats threshold: 125
cpu: 11
count: 825
high: 1535
batch: 63
vm stats threshold: 125
cpu: 12
count: 421
high: 1535
batch: 63
vm stats threshold: 125
cpu: 13
count: 688
high: 1535
batch: 63
vm stats threshold: 125
cpu: 14
count: 760
high: 1535
batch: 63
vm stats threshold: 125
cpu: 15
count: 940
high: 1535
batch: 63
vm stats threshold: 125
cpu: 16
count: 470
high: 1535
batch: 63
vm stats threshold: 125
cpu: 17
count: 315
high: 1535
batch: 63
vm stats threshold: 125
cpu: 18
count: 527
high: 1535
batch: 63
vm stats threshold: 125
cpu: 19
count: 867
high: 1535
batch: 63
vm stats threshold: 125
cpu: 20
count: 1058
high: 1535
batch: 63
vm stats threshold: 125
cpu: 21
count: 208
high: 1535
batch: 63
vm stats threshold: 125
cpu: 22
count: 260
high: 1535
batch: 63
vm stats threshold: 125
cpu: 23
count: 1496
high: 1535
batch: 63
vm stats threshold: 125
cpu: 24
count: 453
high: 1535
batch: 63
vm stats threshold: 125
cpu: 25
count: 122
high: 1535
batch: 63
vm stats threshold: 125
cpu: 26
count: 577
high: 1535
batch: 63
vm stats threshold: 125
cpu: 27
count: 211
high: 1535
batch: 63
vm stats threshold: 125
cpu: 28
count: 1043
high: 1535
batch: 63
vm stats threshold: 125
cpu: 29
count: 345
high: 1535
batch: 63
vm stats threshold: 125
cpu: 30
count: 1291
high: 1535
batch: 63
vm stats threshold: 125
cpu: 31
count: 258
high: 1535
batch: 63
vm stats threshold: 125
node_unreclaimable: 0
start_pfn: 1048576
Node 0, zone Movable
pages free 0
boost 0
min 32
low 32
high 32
spanned 0
present 0
managed 0
cma 0
protection: (0, 0, 0, 0, 0)
Node 0, zone Device
pages free 0
boost 0
min 0
low 0
high 0
spanned 0
present 0
managed 0
cma 0
protection: (0, 0, 0, 0, 0)
In above zoneinfo
output, you can see I have 5 zones.
CPU Chip: CPU —– MMU (MMU is a hardware) Virtual (or Logical) address: Larger than physical memory RAM (this is the memory abstraction) — we have pages here Physical Memory: RAM (we have page frames here)
Logical or Virtual memory doesn't reside anywhere, it is just a label or tagging. Physical memory is actual RAM. If process allocates more logical memory then there happens internal (logical) memory fragmentation.
In 64-bit systems, we can have 2^64 addressable units. This much big virtual memory we can have, althogh we have limited RAM installed. In earlier 32-bit system, we could achieve max. 64 GB of logical addressable units through Physical Address Extension (PAE).
Paging is basically the movement of pages in and out of the main memory and storage. It allows partially loaded & programs larger than memory to execute. Unlike Swapping which moves entire program in and out, Paging only moves pages, which are relatively small (e.g., 4KB). Processes that are swapped out are still known by the kernel as metadata is still resident in kernel memory. Kernel prioritizes swapping is based on various factors like thread priority, wait time, size of process. The longer it has been waiting and the smaller it is, the higher in the queue it will be. Modern Linux does not perform traditional swapping at all, it uses paging operation on a swap device or file instead. Some UNIX systems still perform actual swapping. In Linux, Kernel uses various caches to optimize performance.
It involves reading/writing of pages in memory mapped files (mmap()) and on the file systems that uses page cache.
If file system page has been modified in main memory, it is a “dirty” page, and requires a write to disk. If it is not modified, or a “clean” page, then then page out just frees the memory immediately.
Anonymous paging is private to processes (process’ heaps and stacks). It is called anonymous due to lack of a named location in the operating system, such as file system path.
Anonymous page outs require the data be moved to the physical swap devices or swap files — ‘swapping’. Anonymous paging, or swapping, hurts performance, and is thus consider a “bad” paging.
Applications that requires access to the anonymous pages, that have been paged out, require anonymous page in, which blocks I/O call to the disk.
Page outs themselves may not negatively affect performance as they can be done asynchronously, while page ins are synchronous.
On demand paging is the act of mapping pages of virtual memory to main memory on demand. It defers CPU overhead of creating mapping until they are needed and accessed, instead of when memory is first allocated.
A page fault occurs when a page is accessed that has no page memory from virtual memory to main memory.
If the mapping can be satisfied from another page in memory, it is called a minor fault which may occur for mapping a new page from available memory. It can also occur when mapping to an existing page, such as reading a page from a shared library.
The UNIX virtual memory has 4 states for a page:
Overcommit allows more memory to be allocated than physically available. (more that the main memory + swap). It is dependent on on-demand paging and on applications not using more than a minority of allocated memory. It allows for malloc () requests to succeed instead of failing as system will rarely decline requests for virtual memory. Consequences of overcommit depend on tumbles and how kernel manages memory pressure — most frequently you’ll see OOM Killed (Out of memory killer).
True free memory is not useful and it does nothing, so the OS will attempt to utilize spare memory to cache file system. Kernel is also able to quickly free memory from file system cache. This processes is transparent to applications . Logical I/O latency is much lower, as requests are being served from main memory.
Cache grows over time and “free” memory shrinks. Regular caching is used to improve read performance and buffering inside the cache is used to improve write performance.
Buffer cache is stored in the page cache in modern Linux and is used for disk I/O to buffer writes.
It is dynamic and current cache size can be checked in /proc/meminfo. Page cache is used to increase directly and file I/O and virtual memory pages and file system pages are stored in it. Dirty file system pages are flushed by flusher threads (flush), per device processes.
It happens after:
If there is a system memory deficient, kswapd will look for dirty pages to be written to disk. All I/O goes through the page cache unless explicitly set not to do so — Direct I/O. This can result in all writes being blocked if the page cache has completely filled. When all writes are blocked, operating system have a tendency to stop.
It is possible to drop the page, dentry (directory entry cache), and inode caches in Linux, either to forcefully free up memory, or to test file system performance prior to anything being cached.
To drop the page cache, use “Echo 1 > /proc/sys/vm/drop_caches” To drop dentry and inode caches use “Echo 2 /proc/sys/vm/drop_caches” To drop both use: “Echo 3 > / proc/sys/vm/drop_caches”
Linux uses various memory management techniques, such as paging, cache shrinking, and cache removal, to handle memory usage. However, there are times when these methods are insufficient, and that’s when the OOM Killer comes into action.
The OOM Killer will terminate processes to free up memory and ensure the system stays operational. It also terminates processes that share the same memory structure (mm_struct) as the chosen process.
You can make certain processes less likely to be killed by modifying the oom_adj value at /proc/
For root-owned processes, the OOM score is slightly reduced by 30 points, which makes them less likely to be targeted by the OOM Killer.
The issue is triggered primarily for low-order allocations, such as 2³ or smaller. In Linux, memory allocation is done in powers of 2. For example:
Memory pages are allocated in powers of 2, meaning a 3rd-order allocation will involve 2³ (8) pages, with the total size depending on the page size.
** What causes this? **
The most common cause is that the system is truly out of memory. If /proc/meminfo shows SwapFree and MemFree are around 1% or less, this is likely the cause.
In rarer cases, a kernel data structure issue or memory leak could be responsible. To investigate, check /proc/meminfo for SwapFree and MemFree, and then examine /proc/slabinfo. One sign of trouble is if the task_struct objects are unusually high, indicating the system might be forking many processes and exhausting kernel memory. You can also identify the specific object consuming the most memory.
SwapFree may appear misleading if a program uses mlock() or HugeTLB, as memory allocated this way cannot be swapped. In typical setups where swap is not enabled, SwapFree is not usually relevant.
In most situations, the system is indeed running out of memory, so tracking process memory usage to identify the culprit is crucial.
The issue can also be triggered by specific memory allocation requirements, such as:
Segfaults are access violations. Hardware with memory protection will notify the OS that a memory access violation has occurred. This might be caused by trying to read a part of memory that the application is not allowed to access, or trying to use a section of memory in a way that is not allowed, such as trying to write to read-only memory.
Ultimately caused by software errors, most often seen in C programs where pointers reference a portion of virtual memory they are not allowed to access.
Some programs have exception handling built in for segfaults, but more frequently do not, and segfault will result in the process crashing and potentially generating a core dump. Core dumps are files containing a process’s memory address space at a specific time — in this case, the time of the crash. In practice, you often see other pieces of the program state also dumped, such as processor registers, which often include the program counter and stack pointer, general memory management information, and other processor and operating system flags.
Implies that the system has failed to allocated a page.
It can be caused by memory segmentation — the available memory is so fragmented that there is not enough contiguous space to allocated pages that require contiguous space.
It can also be caused by a general lack of memory — as we discussed earlier, OOM Killer doesn’t trigger on high order allocations. In such a case that you are trying to allocate a larger set of pages than would trigger OOM Killer when low on memory, you might see this instead
A pointer is a variable that contains another variable as the value — such as a memory address.
Occurs when a pointer is used pointing at a NULL value, when the assumption is made that it is pointing at a valid memory address
Almost always results in the process crashing, unless exception handling is built in, similar to with segfaults.
MCEs are a hardware error, thrown when the CPU detects a hardware problem. Main potential causes are errors with the system bus, memory, and CPU cache.
As discussed earlier, pages are generally 4KB in size, however, you can change this. Huge Pages allow for pages that are 2MB and 1GB in size.
As modern processors contain a limited set of page table entries — when you use larger pages, the processor can work with more memory, without failing back to the slower software memory management.
It requires applications to be aware and coded to take advantage of them.
Transparent Huge pages are an attempt to abstract this so that everything can take advantage of Huge pages — however, this can cause oddities in behavior on some applications, and some vendors such as Red hat explicitly state that they are problematic in certain workloads such as databases.
Setting /sys/kernel/mm/transparent_hugepapge/enabled to never will disable them.
You can modify the system overcommit behavior by modifying /proc/sys/vm/overcommit_memory
(You generally should not not modify this settings)
0: Heuristic overcommitting. Ensures “crazy” allocations fail while allowing more normal over allocation. (Default)
1: Always allows overcommitting
2: Never allow overcommitting. Total address space is limited to swap+configurable percentage of physical RAM. Percentage defaults to 50%, is set at /proc/sys/vm/overcommit_ratio
When checking/repairing a file system, you can see fairly extreme requirements on memory, particularly when a large file system is involved. Specifics vary, but XFS for example is particularly onerous — SGI recommends 2GB of RAM per TB of space, and 200MB of RAM per million inodes. Can be worked around with using a large swap partition, or with fsck, a scratch file (setting a scratch file can be done in /etc/e2fsck.conf).
When in doubt, it is safer to side with more RAM and swap — repairs failing due to a lack of memory can be damaging to the data you are trying to save.
Memory Zones: Linux uses specific memory zones (DMA, DMA32, NORMAL, MOVABLE) for different purposes, while Android's zoning is tailored for mobile devices. Low Memory Killer: Android employs a Low Memory Killer (LMK) to aggressively terminate background processes when memory is low, which is not present in standard Linux systems. Kernel Same-page Merging (KSM): Android makes extensive use of KSM to reduce memory footprint, especially for duplicate app data. This feature is less commonly used in desktop Linux. Zram: Android heavily relies on zram for compressed swap space in RAM, which is less common in traditional Linux systems. Memory Allocation: Android uses a custom memory allocator (jemalloc) optimized for mobile devices, while Linux typically uses glibc's malloc. OOM Handling: Android's Out-of-Memory (OOM) handling is more aggressive, prioritizing foreground apps and system processes over background tasks. Memory Tracking: Android includes additional memory tracking tools like meminfo and procrank, which are specific to its ecosystem.
Contended CPU, memory, or I/O resources lead to latency spikes, throughput losses, and potential OOM kills. Without precise metrics, users must either under-utilize resources to avoid risk or overcommit, facing frequent disruptions.
The PSI identifies and quantifies disruptions caused by resource shortages, measuring their impact on workloads and systems. Accurate data on productivity losses helps users optimize workload sizing or hardware provisioning. By aggregating real-time information, systems can be dynamically managed through load shedding, job migration, or pausing low-priority tasks. This enables efficient hardware use without compromising workload stability or risking OOM kills.
upgautam@amd:~$ sudo cat /proc/pressure/cpu
some avg10=0.00 avg60=0.00 avg300=0.00 total=150844896
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
upgautam@amd:~$ sudo cat /proc/pressure/io
some avg10=0.00 avg60=0.01 avg300=0.00 total=129022505
full avg10=0.00 avg60=0.01 avg300=0.00 total=126547588
upgautam@amd:~$ sudo cat /proc/pressure/memory
some avg10=0.00 avg60=0.00 avg300=0.00 total=12
full avg10=0.00 avg60=0.00 avg300=0.00 total=12
CPU Pressure: The pressure values in /proc/pressure/cpu reflect how much time the CPU has been under stress due to a combination of factors, including how much it’s used by active processes and how much it’s having to wait or contend with other resource demands (like memory or I/O).
These values are percentage of time spent. The max. go upto 100.
some: Represents the percentage of time the CPU has been under some pressure — that is, the CPU was waiting for other resources (such as I/O or memory) or was being scheduled to handle multiple tasks but not fully saturated.
full: Represents the percentage of time the CPU was under full pressure — meaning it was 100% utilized, and no more tasks could be scheduled for that CPU at that moment.
avg10, avg60, avg300: These are average pressure values over different time windows:
total: This is the total number of CPU pressure events that have occurred since the system was booted. These are cumulative counts of how often the CPU has experienced pressure at each level.
Running stress-ng,
I confirm from top
command, I see my CPU is used 99.9%.
upgautam@amd:~$ sudo cat /proc/pressure/cpu
some avg10=0.67 avg60=0.80 avg300=0.70 total=47501908
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
Linux is efficient in distributing CPU tasks. Even though stress-ng is using a lot of CPU, it’s a single process. The scheduler is capable of handling this efficiently by distributing tasks across available CPU cores. Hence, the CPU is not "waiting" for resources in the same way it would if there were a mix of processes with different types of resource demands (e.g., CPU, I/O, memory). The value increased only when I used multiple processes.
If you open multiple chrome tabs and some other applications, including upgautam@amd:~$ sudo stress-ng --cpu 32 --verbose
then
upgautam@amd:~$ sudo cat /proc/pressure/cpu
some avg10=7.78 avg60=2.18 avg300=0.64 total=4712040
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
I must stress out to my memory to limit. I do have very fast 128 GB RAM. To stress out, my fast 128GB of RAM, I used sudo stress-ng --vm 4 --vm-bytes 62G --timeout 60s --vm-method all
and then checked PSI for memory,
upgautam@amd:~$ sudo cat /proc/pressure/memory
some avg10=2.19 avg60=0.77 avg300=0.19 total=755622
full avg10=1.57 avg60=0.58 avg300=0.14 total=603300
Pressure on the RAM is the one that makes user-interaction bad.
upgautam@amd:~$ sudo stress-ng --io 32 --timeout 60s
(or you can use iomix as `sudo stress-ng --iomix 1024`)
upgautam@amd:~$ sudo cat /proc/pressure/io
some avg10=74.87 avg60=38.62 avg300=10.40 total=36519348
full avg10=73.75 avg60=38.16 avg300=10.28 total=35167137
upgautam@amd:~$ sudo cat /proc/pressure/memory
some avg10=2.19 avg60=0.77 avg300=0.19 total=755622
full avg10=1.57 avg60=0.58 avg300=0.14 total=603300
And we can write a user-space program to monitor memory pressure as,
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <poll.h>
#include <string.h>
#include <unistd.h>
/*
* Monitor memory partial stall with 1s tracking window size
* and 150ms threshold.
*/
int main() {
// 150000 ms cumulative delays of total time 1000000
// it means for every 100 ms, if cumulative delays cross 15ms, we trigger
// trigger format: <type> <cumulative_stall_threshold> <tracking_window>
// These values are provided by kernel through sampling.
const char trig[] = "some 150000 1000000";
struct pollfd fds;
int n;
fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
if (fds.fd < 0) {
printf("/proc/pressure/memory open error: %s\n",
strerror(errno));
return 1;
}
fds.events = POLLPRI;
if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
printf("/proc/pressure/memory write error: %s\n",
strerror(errno));
return 1;
}
printf("waiting for events...\n");
while (1) {
n = poll(&fds, 1, -1);
if (n < 0) {
printf("poll error: %s\n", strerror(errno));
return 1;
}
if (fds.revents & POLLERR) {
printf("got POLLERR, event source is gone\n");
return 0;
}
if (fds.revents & POLLPRI) {
printf("event triggered!\n");
} else {
printf("unknown event received: 0x%x\n", fds.revents);
return 1;
}
}
return 0;
}
or, we can use Cgroup2 interface,
Cgroup provides isolation, and several other benefits. You can move those processes that you want to become part of your cgroup.
echo 12345 | sudo tee /sys/fs/cgroup/my_cgroup/cgroup.procs
echo 12346 | sudo tee /sys/fs/cgroup/my_cgroup/cgroup.procs
and so on.
This cgrup2 is basically giving the process-level granularity PSI monitoring.
In a system with a CONFIG_CGROUPS=y kernel and the cgroup2 filesystem mounted, pressure stall information is also tracked for tasks grouped into cgroups. Each subdirectory in the cgroupfs mountpoint contains cpu.pressure, memory.pressure, and io.pressure files; the format is the same as the /proc/pressure/ files.
Enable this in kernel config CONFIG_CGROUPS=y
#include <errno.h>
#include <fcntl.h>
#include <poll.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
int main() {
// Path to the memory pressure file in the cgroup
const char *cgroup_memory_pressure_path = "/sys/fs/cgroup/my_cgroup/memory.pressure";
// Trigger string: monitor "some" memory pressure, 150ms threshold in a 1s window
const char trigger[] = "some 150000 1000000";
struct pollfd fds;
int n;
// Open the memory pressure file for the specific cgroup
fds.fd = open(cgroup_memory_pressure_path, O_RDWR | O_NONBLOCK);
if (fds.fd < 0) {
printf("Error opening %s: %s\n", cgroup_memory_pressure_path, strerror(errno));
return 1;
}
fds.events = POLLPRI;
// Write the trigger to set up monitoring
if (write(fds.fd, trigger, strlen(trigger) + 1) < 0) {
printf("Error writing trigger to %s: %s\n", cgroup_memory_pressure_path, strerror(errno));
close(fds.fd);
return 1;
}
printf("Monitoring memory pressure for cgroup at %s...\n", cgroup_memory_pressure_path);
// Poll for events
while (1) {
n = poll(&fds, 1, -1); // Wait indefinitely for an event
if (n < 0) {
printf("Poll error: %s\n", strerror(errno));
close(fds.fd);
return 1;
}
// Check for events
if (fds.revents & POLLERR) {
printf("POLLERR: Event source is gone\n");
break;
} else if (fds.revents & POLLPRI) {
printf("Memory pressure event triggered!\n");
} else {
printf("Unknown event: 0x%x\n", fds.revents);
break;
}
}
close(fds.fd);
return 0;
}
Steps to do:
sudo mount -t cgroup2 none /sys/fs/cgroup
sudo mkdir /sys/fs/cgroup/my_cgroup
echo <PID> | sudo tee /sys/fs/cgroup/my_cgroup/cgroup.procs
stress-ng
to do memory pressure.