手动配置NUMA Node的大页内存
背景
- 机器开启了numa支持
- 通过内核启动参数配置了大页内存
通过以上两点,可以简单地实现预留大页内存,供linux中的一些特殊应用,或者k8s中pod去申请。
但是在开启了numa支持时,大页内存的细节也会更复杂一些。本文档基于大页内存的内核官方文档,摘选出大页和numa相关的关键部分。
numa开启时大页内存的基本分配逻辑
开启numa后,内存是分组的,那么内核在预留大页内存时,如何决定分组呢?内核文档对这个逻辑首先进行了基本描述:
- 默认内存分配策略下,大页的预留会平均分配到每一个numa node
- 前提是该numa node必须有足够的连续物理内存可用
- 如果某numa node上的物理内存,不能预留作大页内存使用,那么这部分预留将由其他numa node补偿
On a NUMA platform, the kernel will attempt to distribute the huge page pool
over all the set of allowed nodes specified by the NUMA memory policy of the
task that modifies nr_hugepages. The default for the allowed nodes--when the
task has default memory policy--is all on-line nodes with memory. Allowed
nodes with insufficient available, contiguous memory for a huge page will be
silently skipped when allocating persistent huge pages. See the discussion
below of the interaction of task memory policy, cpusets and per node attributes
with the allocation and freeing of persistent huge pages.The success or failure of huge page allocation depends on the amount of
physically contiguous memory that is present in system at the time of the
allocation attempt. If the kernel is unable to allocate huge pages from
some nodes in a NUMA system, it will attempt to make up the difference by
allocating extra pages on other nodes with sufficient available contiguous
memory, if any.
大页内存超分
关于大页内存的超分
- 在
/proc/sys/vm/nr_overcommit_hugepages
可以配置一个数值,允许应用在配置的预留大页的基础上,额外申请的大页内存量 - 这部分额外的大页内存从普通内存池中尝试申请
- 这部分额外的大页内存在释放后,会回到普通内存池中
/proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
requested by applications. Writing any non-zero value into this file
indicates that the hugetlb subsystem is allowed to try to obtain that
number of "surplus" huge pages from the kernel's normal page pool, when the
persistent huge page pool is exhausted. As these surplus huge pages become
unused, they are freed back to the kernel's normal page pool.
用户态接口的重复问题
如果你经常在互联网上搜索大页内存相关的配置,会发现有不同的配置方法和位置。关于linux文件系统对大页内存的控制入口有多个的问题,官方文档也提到了。
/proc/sys/vm
,这个是为了向后兼容而留下的老入口/sys/kernel/mm/hugepages
With support for multiple huge page pools at run-time available, much of
the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs.
The /proc interfaces discussed above have been retained for backwards
compatibility. The root huge page control directory in sysfs is:/sys/kernel/mm/hugepages
For each huge page size supported by the running kernel, a subdirectory
will exist, of the form:hugepages-${size}kB
Inside each of these directories, the same set of files will exist:
nr_hugepages
nr_hugepages_mempolicy
nr_overcommit_hugepages
free_hugepages
resv_hugepages
surplus_hugepages
大页内存分配策略
围绕nr_hugepages_mempolicy,可以对numa开启下,大页内存的分配进行更多定制化的高级配置。
-
首先,使用nr_hugepages_mempolicy调整大页内存,有一个不太常见的逻辑:指定哪个/哪些numa节点,是由操作nr_hugepages_mempolicy的进程所在的numa node决定的。比如某个在numa node 1的进程,在
/proc/sys/vm/nr_hugepages_mempolicy
写入了一个数值,那么内核将针对numa node 1,调整其大页内存量为目标值。且nr_hugepages被使用后,nr_hugepages_mempolicy将被忽略。Whether huge pages are allocated and freed via the /proc interface or
the /sysfs interface using the nr_hugepages_mempolicy attribute, the NUMA
nodes from which huge pages are allocated or freed are controlled by the
NUMA memory policy of the task that modifies the nr_hugepages_mempolicy
sysctl or attribute. When the nr_hugepages attribute is used, mempolicy
is ignored. -
接下来,内核文档给出了两个样例:
numactl --interleave <node-list> echo 20 \ >/proc/sys/vm/nr_hugepages_mempolicy numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy
命令的后半部分,
echo >...
便是最常见的向内核sysfs写入配置的逻辑,没什么特殊的,但是一开始我见到这个命令时,十分不解的是前面的numactl命令是干什么用的?最后的输出也并不会包含指定numa node的信息。后来看了上一条的解释,才明白前面的numactl命令指示控制echo命令的numa节点。还有需要注意的是,这里echo后面的值,并不是目标值,最终的值为abs($echo - 当前值)
The recommended method to allocate or free huge pages to/from the kernel
huge page pool, using the nr_hugepages example above, is:numactl --interleave
echo 20 \ /proc/sys/vm/nr_hugepages_mempolicy
or, more succinctly:
numactl -m
echo 20 >/proc/sys/vm/nr_hugepages_mempolicy This will allocate or free abs(20 - nr_hugepages) to or from the nodes
specified in, depending on whether number of persistent huge pages
is initially less than or greater than 20, respectively. No huge pages will be
allocated nor freed on any node not included in the specified. -
最后,我们可以看到上面两个numactl命令中,分别使用了--interleave和-m,这两个有什么区别呢?内核文档接下来就解释了使用不同的策略参数时,通过nr_hugepages_mempolicy调整大页内存的区别。而numactl命令后面跟的--interleave、-m等,都是对应了不同的策略。接下来简要介绍一下四种策略的核心逻辑。
- interleave:numa node之间平分。比如通过内核启动参数配置的持久化预分配的大页就是这种策略。但是这种策略下,在内存不足时,不会去其他node上分配,这一点又和预分配时的行为稍有区别。
- bind:精确调整指定的numa node对应的大页内存。注意
- local:根据进程当前使用的cpu numa,进行大页分配。如果进程的cpu已经固定,那么用该策略比较合适。否则,结果会随进程的漂移而改变。
- preferred:只使用最小数字的numa node分配大页
When adjusting the persistent hugepage count via nr_hugepages_mempolicy, any
memory policy mode--bind, preferred, local or interleave--may be used. The
resulting effect on persistent huge page allocation is as follows:-
Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
persistent huge pages will be distributed across the node or nodes
specified in the mempolicy as if "interleave" had been specified.
However, if a node in the policy does not contain sufficient contiguous
memory for a huge page, the allocation will not "fallback" to the nearest
neighbor node with sufficient contiguous memory. To do this would cause
undesirable imbalance in the distribution of the huge page pool, or
possibly, allocation of persistent huge pages on nodes not allowed by
the task's memory policy. -
One or more nodes may be specified with the bind or interleave policy.
If more than one node is specified with the preferred policy, only the
lowest numeric id will be used. Local policy will select the node where
the task is running at the time the nodes_allowed mask is constructed.
For local policy to be deterministic, the task must be bound to a cpu or
cpus in a single node. Otherwise, the task could be migrated to some
other node at any time after launch and the resulting node will be
indeterminate. Thus, local policy is not very useful for this purpose.
Any of the other mempolicy modes may be used to specify a single node. -
The nodes allowed mask will be derived from any non-default task mempolicy,
whether this policy was set explicitly by the task itself or one of its
ancestors, such as numactl. This means that if the task is invoked from a
shell with non-default policy, that policy will be used. One can specify a
node list of "all" with numactl --interleave or --membind [-m] to achieve
interleaving over all nodes in the system or cpuset. -
Any task mempolicy specified--e.g., using numactl--will be constrained by
the resource limits of any cpuset in which the task runs. Thus, there will
be no way for a task with non-default policy running in a cpuset with a
subset of the system nodes to allocate huge pages outside the cpuset
without first moving to a cpuset that contains all of the desired nodes. -
Boot-time huge page allocation attempts to distribute the requested number
of huge pages over all on-lines nodes with memory.
配置实测
-
通过内核参数配置持久化的大页100个
# /etc/default/grub GRUB_CMDLINE_LINUX="hugepages=100" # update-grub
-
查看总数
root@node1:~# cat /proc/meminfo | fgrep Huge AnonHugePages: 90112 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 100 HugePages_Free: 100 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 204800 kB
-
查看每个numa上大页的分配情况
root@node1:~# cat /sys/devices/system/node/node*/meminfo | fgrep Huge Node 0 AnonHugePages: 55296 kB Node 0 ShmemHugePages: 0 kB Node 0 FileHugePages: 0 kB # 各分了50 Node 0 HugePages_Total: 50 Node 0 HugePages_Free: 50 Node 0 HugePages_Surp: 0 Node 1 AnonHugePages: 18432 kB Node 1 ShmemHugePages: 0 kB Node 1 FileHugePages: 0 kB Node 1 HugePages_Total: 50 Node 1 HugePages_Free: 50 Node 1 HugePages_Surp: 0
-
使用membind策略(-m)动态调整某个numa node上的大页
root@node1:~# numactl -m 1 echo 150 > /proc/sys/vm/nr_hugepages_mempolicy root@node1:~# cat /sys/devices/system/node/node*/meminfo | fgrep Huge Node 0 AnonHugePages: 53248 kB Node 0 ShmemHugePages: 0 kB Node 0 FileHugePages: 0 kB Node 0 HugePages_Total: 50 Node 0 HugePages_Free: 50 Node 0 HugePages_Surp: 0 Node 1 AnonHugePages: 20480 kB Node 1 ShmemHugePages: 0 kB Node 1 FileHugePages: 0 kB # abs(150-50)=100 Node 1 HugePages_Total: 100 Node 1 HugePages_Free: 100 Node 1 HugePages_Surp: 0
-
修改超分配置。该配置与上面的配置无关。
root@node1:~# echo 100 > /proc/sys/vm/nr_overcommit_hugepages
参考
本作品采用 知识共享署名-相同方式共享 4.0 国际许可协议 进行许可。