Written by
arstercz
-
numa 简单使用汇总
目前主流的物理服务(包括一些云厂商的高配 vm 机器)架构大多是 NUMA 架构, NUMA
在 CPU 和内存使用方面有很大的优势. 不过它的一些策略以及本地, 远端的访问模式可能造成一些意料之外的问题, 可以参考文章 percona-mysql-start-slowly 了解更多. 本文则对 numa 及使用做一些简单汇总.
系统参数控制
内核提供的 numa_balancing
参数可以控制系统的行为. 有个例外是进程如果指定了 numa
的 node 节点, 就应该 disable 该系统参数.
kernel.numa_balancing = 1
Enables/disables automatic page fault based NUMA memory
balancing. Memory is moved automatically to nodes
that access it often.
When this feature is enabled the kernel samples what task thread is
accessing memory by periodically unmapping pages and later trapping
a page fault. At the time of the page fault, it is determined if the
data being accessed should be migrated to a local memory node.
The unmapping of pages and trapping faults incur additional overhead that
ideally is offset by improved memory locality but there is no universal
guarantee. If the target workload is already bound to NUMA nodes then this
feature should be disabled. Otherwise, if the system overhead from the
feature is too high then the rate the kernel samples for NUMA hinting
faults may be controlled by the numa_balancing_scan_period_min_ms,
numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
numa_balancing_scan_size_mb, and numa_balancing_settle_count sysctls.
控制 numa 策略
命令行控制
早期的时候, 机器配置都不是很高. 很多占用内存的进程(比如 mysql)都可以通过 numactl 命令控制内存策略, 如下所示, 允许进程可以使用远端的内存:
numactl --interleave=all <program> <args>
系统调用控制
程序可以在代码层级控制策略, 如下 c 语言示例:
#include <stdio.h>
#include <stdlib.h>
#include <numaif.h>
int main() {
// Set the memory policy to interleave
struct bitmask* numa_nodes = numa_get_mems_allowed();
int ret = set_mempolicy(MPOL_INTERLEAVE, numa_nodes->maskp, numa_nodes->size);
if (ret != 0) {
perror("set_mempolicy");
exit(1);
}
// Now, subsequent memory allocations will be biased towards node 1
int *ptr = malloc(1024);
// ... other code ...
return 0;
}
备注: 早期的 MySQL 版本(
5.1, 5.5
) 使用命令行方式解决 numa 问题, 新的版本则使用系统调用方式(可以搜索set_mempolicy
函数).
查看 numa 状态
# numa 内存分配情况
$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22
node 0 size: 15836 MB
node 0 free: 928 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23
node 1 size: 16123 MB
node 1 free: 4177 MB
node distances:
node 0 1
0: 10 20
1: 20 10
# numa 内存使用情况
$ numastat
node0 node1
numa_hit 18136592413 17603414333
numa_miss 938561440 4793700752
numa_foreign 4793700752 938561440
interleave_hit 568140 568108
local_node 18216194200 11673197270
other_node 858959653 10723917815
可以查看 numa 状态, 以确定按什么方式开启策略, 当然最好能够提前在测试环境中尝试此类操作.