Multi-Gen LRU¶
The multi-gen LRU is an alternative LRU implementation that optimizes page reclaim and improves performance under memory pressure. Page reclaim decides the kernel’s caching policy and ability to overcommit memory. It directly impacts the kswapd CPU usage and RAM efficiency.
Quick start¶
Build the kernel with the following configurations.
CONFIG_LRU_GEN=y
CONFIG_LRU_GEN_ENABLED=y
All set!
Runtime options¶
/sys/kernel/mm/lru_gen/
contains stable ABIs described in the
following subsections.
Kill switch¶
enable
accepts different values to enable or disable the following
components. Its default value depends on CONFIG_LRU_GEN_ENABLED
.
All the components should be enabled unless some of them have
unforeseen side effects. Writing to enable
has no effect when a
component is not supported by the hardware, and valid values will be
accepted even when the main switch is off.
Values | Components |
---|---|
0x0001 | The main switch for the multi-gen LRU. |
0x0002 | Clearing the accessed bit in leaf page table entries in large batches, when MMU sets it (e.g., on x86). This behavior can theoretically worsen lock contention (mmap_lock). If it is disabled, the multi-gen LRU will suffer a minor performance degradation. |
0x0004 | Clearing the accessed bit in non-leaf page table entries as well, when MMU sets it (e.g., on x86). This behavior was not verified on x86 varieties other than Intel and AMD. If it is disabled, the multi-gen LRU will suffer a negligible performance degradation. |
[yYnN] | Apply to all the components above. |
E.g.,
echo y >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0007
echo 5 >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0005
Thrashing prevention¶
Personal computers are more sensitive to thrashing because it can
cause janks (lags when rendering UI) and negatively impact user
experience. The multi-gen LRU offers thrashing prevention to the
majority of laptop and desktop users who do not have oomd
.
Users can write N
to min_ttl_ms
to prevent the working set of
N
milliseconds from getting evicted. The OOM killer is triggered
if this working set cannot be kept in memory. In other words, this
option works as an adjustable pressure relief valve, and when open, it
terminates applications that are hopefully not being used.
Based on the average human detectable lag (~100ms), N=1000
usually
eliminates intolerable janks due to thrashing. Larger values like
N=3000
make janks less noticeable at the risk of premature OOM
kills.
The default value 0
means disabled.
Experimental features¶
/sys/kernel/debug/lru_gen
accepts commands described in the
following subsections. Multiple command lines are supported, so does
concatenation with delimiters ,
and ;
.
/sys/kernel/debug/lru_gen_full
provides additional stats for
debugging. CONFIG_LRU_GEN_STATS=y
keeps historical stats from
evicted generations in this file.
Working set estimation¶
Working set estimation measures how much memory an application requires in a given time interval, and it is usually done with little impact on the performance of the application. E.g., data centers want to optimize job scheduling (bin packing) to improve memory utilizations. When a new job comes in, the job scheduler needs to find out whether each server it manages can allocate a certain amount of memory for this new job before it can pick a candidate. To do so, this job scheduler needs to estimate the working sets of the existing jobs.
When it is read, lru_gen
returns a histogram of numbers of pages
accessed over different time intervals for each memcg and node.
MAX_NR_GENS
decides the number of bins for each histogram.
memcg memcg_id memcg_path
node node_id
min_gen_nr age_in_ms nr_anon_pages nr_file_pages
...
max_gen_nr age_in_ms nr_anon_pages nr_file_pages
Each generation contains an estimated number of pages that have been
accessed within age_in_ms
non-cumulatively. E.g., min_gen_nr
contains the coldest pages and max_gen_nr
contains the hottest
pages, since age_in_ms
of the former is the largest and that of
the latter is the smallest.
Users can write + memcg_id node_id max_gen_nr
[can_swap[full_scan]]
to lru_gen
to create a new generation
max_gen_nr+1
. can_swap
defaults to the swap setting and, if it
is set to 1
, it forces the scan of anon pages when swap is off.
full_scan
defaults to 1
and, if it is set to 0
, it reduces
the overhead as well as the coverage when scanning page tables.
A typical use case is that a job scheduler writes to lru_gen
at a
certain time interval to create new generations, and it ranks the
servers it manages based on the sizes of their cold memory defined by
this time interval.
Proactive reclaim¶
Proactive reclaim induces memory reclaim when there is no memory pressure and usually targets cold memory only. E.g., when a new job comes in, the job scheduler wants to proactively reclaim memory on the server it has selected to improve the chance of successfully landing this new job.
Users can write - memcg_id node_id min_gen_nr [swappiness
[nr_to_reclaim]]
to lru_gen
to evict generations less than or
equal to min_gen_nr
. Note that min_gen_nr
should be less than
max_gen_nr-1
as max_gen_nr
and max_gen_nr-1
are not fully
aged and therefore cannot be evicted. swappiness
overrides the
default value in /proc/sys/vm/swappiness
. nr_to_reclaim
limits
the number of pages to evict.
A typical use case is that a job scheduler writes to lru_gen
before it tries to land a new job on a server, and if it fails to
materialize the cold memory without impacting the existing jobs on
this server, it retries on the next server according to the ranking
result obtained from the working set estimation step described
earlier.