kubernetes源码分析:大页内存数据来源与更新机制
背景
-
大页内存通常可以通过
/etc/default/grub
配置 -
在配置了大页内存的node上,k8s会自动检测到并将其添加到node capacity中
-
在开启了NUMA的node上,通过
/etc/default/grub
配置的大页内存默认会平均分配到所有NUMA NODE上# /etc/default/grub GRUB_CMDLINE_LINUX="hugepages=100" root@master1:~# cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages 100 root@master1:~# cat /sys/devices/system/node/node*/hugepages/hugepages-2048kB/nr_hugepages 50 50
-
通过自启动服务+脚本的方式,可以绕开大页内存在NUMA NODE平均分的限制,为不同的NUMA NODE配置不同的大页内存数量。为了确定同时配置了grub参数和脚本的时候哪个优先级更高,此处特地把内核参数里的大页内存数量改成了110
root@master1:~# cat /etc/default/grub|grep hugepage GRUB_CMDLINE_LINUX="hugepages=110" root@master1:~# cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages 100 root@master1:~# cat /sys/devices/system/node/node*/hugepages/hugepages-2048kB/nr_hugepages 90 10
可以看出脚本的优先级更高,可以覆盖启动参数。且通过
/sys/devices/system/node
配置的大页内存,会同步到/sys/kernel/mm/hugepages
,因此k8s也能感知到 -
风险点:kubelet也是通过systemd启动的,而上一步的自启动服务也是通过systemd启动的,这里会出现服务依赖关系的问题,即kubelet最好要在配置大页内存的脚本后启动,以保证kubelet读到最新的大页内存数量
接下来我们带着两个疑问对kubelet对节点资源总量的统计上报逻辑,尤其是大页内存相关的部分做代码梳理
- kubelet是从哪里读到大页内存数据量的
- kubelet是否能动态追踪节点上大页内存的变化
node资源总量信息
在kubernetest代码目录的pkg/kubelet/nodestatus/sttters.go
中可以看到和node资源状态相关的函数MachineInfo
,我们就从这里开始追踪。
// pkg/kubelet/nodestatus/setters.go:275
func MachineInfo(nodeName string,
maxPods int,
podsPerCore int,
machineInfoFunc func() (*cadvisorapiv1.MachineInfo, error), // typically Kubelet.GetCachedMachineInfo
capacityFunc func() v1.ResourceList, // typically Kubelet.containerManager.GetCapacity
devicePluginResourceCapacityFunc func() (v1.ResourceList, v1.ResourceList, []string), // typically Kubelet.containerManager.GetDevicePluginResourceCapacity
nodeAllocatableReservationFunc func() v1.ResourceList, // typically Kubelet.containerManager.GetNodeAllocatableReservation
recordEventFunc func(eventType, event, message string), // typically Kubelet.recordEvent
) Setter {
...
}
看起来,上面函数的形参capacityFunc和node的capacity有密切的关系,因此我们先从这个形参对应的实参追踪下去
// pkg/kubelet/cm/container_manager_linux.go:1060
func (cm *containerManagerImpl) GetCapacity() v1.ResourceList {
return cm.capacity
}
// pkg/kubelet/cm/container_manager_linux.go:250
capacity := cadvisor.CapacityFromMachineInfo(machineInfo)
// 追踪machineInfo
// pkg/kubelet/cadvisor/cadvisor_linux.go:157
func (cc *cadvisorClient) MachineInfo() (*cadvisorapi.MachineInfo, error) {
return cc.GetMachineInfo()
}
// github.com/google/cadvisor@v0.39.0/manager/manager.go:813
func (m *manager) GetMachineInfo() (*info.MachineInfo, error) {
m.machineMu.RLock()
defer m.machineMu.RUnlock()
return m.machineInfo.Clone(), nil
}
注意下面的ticker,说明这里获取的machine info是动态刷新的,进而上面的capacity也是动态刷新的。
// github.com/google/cadvisor@v0.39.0/manager/manager.go:354
func (m *manager) updateMachineInfo(quit chan error) {
ticker := time.NewTicker(*updateMachineInfoInterval)
for {
select {
case <-ticker.C:
// 继续追踪这里
info, err := machine.Info(m.sysFs, m.fsInfo, m.inHostNamespace)
if err != nil {
klog.Errorf("Could not get machine info: %v", err)
break
}
m.machineMu.Lock()
m.machineInfo = *info
m.machineMu.Unlock()
klog.V(5).Infof("Update machine info: %+v", *info)
case <-quit:
ticker.Stop()
quit <- nil
return
}
}
}
// github.com/google/cadvisor/machine/info.go:57
func Info(sysFs sysfs.SysFs, fsInfo fs.FsInfo, inHostNamespace bool) (*info.MachineInfo, error) {
...
// 这里,我们比较直接的和大页内存打交道的地方
hugePagesInfo, err := sysinfo.GetHugePagesInfo(sysFs, hugepagesDirectory)
if err != nil {
return nil, err
}
...
}
大页内存数据来源
查看上面追踪的最终结果sysinfo.GetHugePagesInfo
,我们可以初步确定:
k8s通过cadvisor获取的大页内存信息,具体到位置是/sys/kernel/mm/hugepages/
,从里面拿到pageSize和pageNum
// GetHugePagesInfo returns information about pre-allocated huge pages
// hugepagesDirectory should be top directory of hugepages
// Such as: /sys/kernel/mm/hugepages/
func GetHugePagesInfo(sysFs sysfs.SysFs, hugepagesDirectory string) ([]info.HugePagesInfo, error) {
var hugePagesInfo []info.HugePagesInfo
files, err := sysFs.GetHugePagesInfo(hugepagesDirectory)
if err != nil {
// treat as non-fatal since kernels and machine can be
// configured to disable hugepage support
return hugePagesInfo, nil
}
for _, st := range files {
nameArray := strings.Split(st.Name(), "-")
pageSizeArray := strings.Split(nameArray[1], "kB")
pageSize, err := strconv.ParseUint(string(pageSizeArray[0]), 10, 64)
if err != nil {
return hugePagesInfo, err
}
val, err := sysFs.GetHugePagesNr(hugepagesDirectory, st.Name())
if err != nil {
return hugePagesInfo, err
}
var numPages uint64
// we use sscanf as the file as a new-line that trips up ParseUint
// it returns the number of tokens successfully parsed, so if
// n != 1, it means we were unable to parse a number from the file
n, err := fmt.Sscanf(string(val), "%d", &numPages)
if err != nil || n != 1 {
return hugePagesInfo, fmt.Errorf("could not parse file nr_hugepage for %s, contents %q", st.Name(), string(val))
}
hugePagesInfo = append(hugePagesInfo, info.HugePagesInfo{
NumPages: numPages,
PageSize: pageSize,
})
}
return hugePagesInfo, nil
}
nodeStatus更新
那么问题来了,如果我通过背景章节中提到的动态方法更新节点的大页内存,kubelet是否能动态更新node capacity中大页内存的量呢?
要明确这个问题,就需要确定上面提到的MachineInfo
函数的调用时机。
// pkg/kubelet/kubelet_node_status.go:604
func (kl *Kubelet) defaultNodeStatusFuncs() []func(*v1.Node) error {
...
var setters []func(n *v1.Node) error
setters = append(setters,
nodestatus.NodeAddress(kl.nodeIPs, kl.nodeIPValidator, kl.hostname, kl.hostnameOverridden, kl.externalCloudProvider, kl.cloud, nodeAddressesFunc),
nodestatus.MachineInfo(string(kl.nodeName), kl.maxPods, kl.podsPerCore, kl.GetCachedMachineInfo, kl.containerManager.GetCapacity,
kl.containerManager.GetDevicePluginResourceCapacity, kl.containerManager.GetNodeAllocatableReservation, kl.recordEvent),
nodestatus.VersionInfo(kl.cadvisor.VersionInfo, kl.containerRuntime.Type, kl.containerRuntime.Version),
nodestatus.DaemonEndpoints(kl.daemonEndpoints),
nodestatus.Images(kl.nodeStatusMaxImages, kl.imageManager.GetImageList),
nodestatus.GoRuntime(),
)
...
}
上面是MachineInfo
被注册的地方,根据其注册的变量,可以找到它被调用的地方。下面是最上层调用到setter的地方,可以看到,kubelet启动后会按照nodeStatusUpdateFrequency
参数作为周期,持续更新node status。
// pkg/kubelet/kubelet.go:1409
func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
...
if kl.kubeClient != nil {
// Start syncing node status immediately, this may set up things the runtime needs to run.
go wait.Until(kl.syncNodeStatus, kl.nodeStatusUpdateFrequency, wait.NeverStop)
}
...
}
更具体的,我们来到setter的直接调用方:这里可以看到,虽然上面会周期性地调用setter更新node的status,但是并不是每次更新都会调用api server,触发patch操作。为了减轻api server的压力,这里会根据nodeStatusReportFrequency
的配置,进一步延长真正patch node status的时间间隔。
// pkg/kubelet/kubelet_node_status.go:480
func (kl *Kubelet) tryUpdateNodeStatus(tryNumber int) error {
...
kl.setNodeStatus(node)
now := kl.clock.Now()
if now.Before(kl.lastStatusReportTime.Add(kl.nodeStatusReportFrequency)) {
if !podCIDRChanged && !nodeStatusHasChanged(&originalNode.Status, &node.Status) {
// We must mark the volumes as ReportedInUse in volume manager's dsw even
// if no changes were made to the node status (no volumes were added or removed
// from the VolumesInUse list).
//
// The reason is that on a kubelet restart, the volume manager's dsw is
// repopulated and the volume ReportedInUse is initialized to false, while the
// VolumesInUse list from the Node object still contains the state from the
// previous kubelet instantiation.
//
// Once the volumes are added to the dsw, the ReportedInUse field needs to be
// synced from the VolumesInUse list in the Node.Status.
//
// The MarkVolumesAsReportedInUse() call cannot be performed in dsw directly
// because it does not have access to the Node object.
// This also cannot be populated on node status manager init because the volume
// may not have been added to dsw at that time.
kl.volumeManager.MarkVolumesAsReportedInUse(node.Status.VolumesInUse)
return nil
}
}
// Patch the current status on the API server
updatedNode, _, err := nodeutil.PatchNodeStatus(kl.heartbeatClient.CoreV1(), types.NodeName(kl.nodeName), originalNode, node)
if err != nil {
return err
}
kl.lastStatusReportTime = now
kl.setLastObservedNodeAddresses(updatedNode.Status.Addresses)
// If update finishes successfully, mark the volumeInUse as reportedInUse to indicate
// those volumes are already updated in the node's status
kl.volumeManager.MarkVolumesAsReportedInUse(updatedNode.Status.VolumesInUse)
return nil
}
综上,可以初步认为,kubelet会周期性地检查node capacity,并周期性地patch node status对象。因此我们动态修改大页内存,理论上也应该能够被更新到node capacity中。
但是当我实际验证发现,修改大页内存后,不重启kubelet的话,node capacity并不会发生变化,只有重启kubelet才可以识别到新的大页内存值。
频率配置
既然实践和理论不匹配,首先想到是否是因为上面提到的两个kubelet配置导致的。在使用kubekey部署的集群中,kubelet配置文件中看到有如下的默认配置
# /var/lib/kubelet/config.yaml
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
尝试修改上述配置为5s,发现并不能解决问题
kubelet日志等级
为了进一步确定这些node status setter是否有被定期调用,考虑提高kubelet的日志等级。
虽然/var/lib/kubelet/config.yaml
中有logging verbosity配置,但是修改后发现无效。
搜索了一下发现修改/var/lib/kubelet/kubeadm-flags.env
,在KUBELET_KUBEADM_ARGS后加--v
是有效的
这时查看日志发现,node status setters是有被定时调用的
func (kl *Kubelet) updateNodeStatus() error {
klog.V(5).InfoS("Updating node status")
...
}
root@master1:~# journalctl -u kubelet -f|grep "Updating node status"
Mar 15 16:19:59 master1 kubelet[65392]: I0315 16:19:59.193290 65392 kubelet_node_status.go:464] "Updating node status"
Mar 15 16:20:09 master1 kubelet[65392]: I0315 16:20:09.209810 65392 kubelet_node_status.go:464] "Updating node status"
Mar 15 16:20:19 master1 kubelet[65392]: I0315 16:20:19.226005 65392 kubelet_node_status.go:464] "Updating node status"
Mar 15 16:20:29 master1 kubelet[65392]: I0315 16:20:29.244722 65392 kubelet_node_status.go:464] "Updating node status"
回到原点
当现实和理论发生了冲突时,我们不得不再次审视理论的出发点,再次回到MachineInfo
// pkg/kubelet/nodestatus/setters.go:274
func MachineInfo(nodeName string,
maxPods int,
podsPerCore int,
machineInfoFunc func() (*cadvisorapiv1.MachineInfo, error), // typically Kubelet.GetCachedMachineInfo
capacityFunc func() v1.ResourceList, // typically Kubelet.containerManager.GetCapacity
devicePluginResourceCapacityFunc func() (v1.ResourceList, v1.ResourceList, []string), // typically Kubelet.containerManager.GetDevicePluginResourceCapacity
nodeAllocatableReservationFunc func() v1.ResourceList, // typically Kubelet.containerManager.GetNodeAllocatableReservation
recordEventFunc func(eventType, event, message string), // typically Kubelet.recordEvent
) Setter {
...
}
当我们仔细审视这个函数的内部,就会发现被函数的形参名字骗了。虽然参数名字叫capacityFunc,但是函数里该参数的唯一用途如下,它的作用只是为了从中获取内置存储资源的数量,仅此而已
if utilfeature.DefaultFeatureGate.Enabled(features.LocalStorageCapacityIsolation) {
// TODO: all the node resources should use ContainerManager.GetCapacity instead of deriving the
// capacity for every node status request
initialCapacity := capacityFunc()
if initialCapacity != nil {
if v, exists := initialCapacity[v1.ResourceEphemeralStorage]; exists {
node.Status.Capacity[v1.ResourceEphemeralStorage] = v
}
}
}
真正决定了node capacity的,是另一个形参……
info, err := machineInfoFunc()
if err != nil {
...
}else{
for rName, rCap := range cadvisor.CapacityFromMachineInfo(info) {
node.Status.Capacity[rName] = rCap
}
}
因此追踪一下这个参数的实参,从注释里就能看出结果了,fuck
// pkg/kubelet/kubelet_getters.go:413
// GetCachedMachineInfo assumes that the machine info can't change without a reboot
func (kl *Kubelet) GetCachedMachineInfo() (*cadvisorapiv1.MachineInfo, error) {
kl.machineInfoLock.RLock()
defer kl.machineInfoLock.RUnlock()
return kl.machineInfo, nil
}
如果继续追踪,可以发现kl.machineInfo是在kubelet实例初始化时更新的,这也是唯一一处调用
// pkg/kubelet/kubelet.go:341
func NewMainKubelet(kubeCfg *kubeletconfiginternal.KubeletConfiguration,
...
){
...
machineInfo, err := klet.cadvisor.MachineInfo()
if err != nil {
return nil, err
}
// Avoid collector collects it as a timestamped metric
// See PR #95210 and #97006 for more details.
machineInfo.Timestamp = time.Time{}
klet.setCachedMachineInfo(machineInfo)
...
}
结论
- kubelet从
/sys/kernel/mm/hugepages/
获取节点大页内存配置 - 虽然
/sys/kernel/mm/hugepages/
可以在不重启node的情况下动态配置,但是kubelet并不支持动态刷新该值,只能通过restart kubelet刷新大页内存的配置
参考
7.3. Configuring HugeTLB Huge Pages Red Hat Enterprise Linux 7 | Red Hat Customer Portal
kubernetes - how to enable kubelet logging verbosity - Stack Overflow
本作品采用 知识共享署名-相同方式共享 4.0 国际许可协议 进行许可。