背景

  1. 大页内存通常可以通过/etc/default/grub配置

  2. 在配置了大页内存的node上,k8s会自动检测到并将其添加到node capacity中

  3. 在开启了NUMA的node上,通过/etc/default/grub配置的大页内存默认会平均分配到所有NUMA NODE上

     # /etc/default/grub
     GRUB_CMDLINE_LINUX="hugepages=100"
    
     root@master1:~# cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
     100
    
     root@master1:~# cat /sys/devices/system/node/node*/hugepages/hugepages-2048kB/nr_hugepages
     50
     50
  4. 通过自启动服务+脚本的方式,可以绕开大页内存在NUMA NODE平均分的限制,为不同的NUMA NODE配置不同的大页内存数量。为了确定同时配置了grub参数和脚本的时候哪个优先级更高,此处特地把内核参数里的大页内存数量改成了110

     root@master1:~# cat /etc/default/grub|grep hugepage
     GRUB_CMDLINE_LINUX="hugepages=110"
    
     root@master1:~# cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
     100
    
     root@master1:~# cat /sys/devices/system/node/node*/hugepages/hugepages-2048kB/nr_hugepages
     90
     10

    可以看出脚本的优先级更高,可以覆盖启动参数。且通过/sys/devices/system/node配置的大页内存,会同步到/sys/kernel/mm/hugepages,因此k8s也能感知到

  5. 风险点:kubelet也是通过systemd启动的,而上一步的自启动服务也是通过systemd启动的,这里会出现服务依赖关系的问题,即kubelet最好要在配置大页内存的脚本后启动,以保证kubelet读到最新的大页内存数量

接下来我们带着两个疑问对kubelet对节点资源总量的统计上报逻辑,尤其是大页内存相关的部分做代码梳理

  1. kubelet是从哪里读到大页内存数据量的
  2. kubelet是否能动态追踪节点上大页内存的变化

node资源总量信息

在kubernetest代码目录的pkg/kubelet/nodestatus/sttters.go中可以看到和node资源状态相关的函数MachineInfo,我们就从这里开始追踪。

// pkg/kubelet/nodestatus/setters.go:275
func MachineInfo(nodeName string,
    maxPods int,
    podsPerCore int,
    machineInfoFunc func() (*cadvisorapiv1.MachineInfo, error), // typically Kubelet.GetCachedMachineInfo
    capacityFunc func() v1.ResourceList, // typically Kubelet.containerManager.GetCapacity
    devicePluginResourceCapacityFunc func() (v1.ResourceList, v1.ResourceList, []string), // typically Kubelet.containerManager.GetDevicePluginResourceCapacity
    nodeAllocatableReservationFunc func() v1.ResourceList, // typically Kubelet.containerManager.GetNodeAllocatableReservation
    recordEventFunc func(eventType, event, message string), // typically Kubelet.recordEvent
) Setter {
  ...
}

看起来,上面函数的形参capacityFunc和node的capacity有密切的关系,因此我们先从这个形参对应的实参追踪下去

// pkg/kubelet/cm/container_manager_linux.go:1060
func (cm *containerManagerImpl) GetCapacity() v1.ResourceList {
    return cm.capacity
}
// pkg/kubelet/cm/container_manager_linux.go:250
capacity := cadvisor.CapacityFromMachineInfo(machineInfo)
// 追踪machineInfo
// pkg/kubelet/cadvisor/cadvisor_linux.go:157
func (cc *cadvisorClient) MachineInfo() (*cadvisorapi.MachineInfo, error) {
    return cc.GetMachineInfo()
}
// github.com/google/cadvisor@v0.39.0/manager/manager.go:813
func (m *manager) GetMachineInfo() (*info.MachineInfo, error) {
    m.machineMu.RLock()
    defer m.machineMu.RUnlock()
    return m.machineInfo.Clone(), nil
}

注意下面的ticker,说明这里获取的machine info是动态刷新的,进而上面的capacity也是动态刷新的。

// github.com/google/cadvisor@v0.39.0/manager/manager.go:354
func (m *manager) updateMachineInfo(quit chan error) {
    ticker := time.NewTicker(*updateMachineInfoInterval)
    for {
        select {
        case <-ticker.C:
      // 继续追踪这里
            info, err := machine.Info(m.sysFs, m.fsInfo, m.inHostNamespace)
            if err != nil {
                klog.Errorf("Could not get machine info: %v", err)
                break
            }
            m.machineMu.Lock()
            m.machineInfo = *info
            m.machineMu.Unlock()
            klog.V(5).Infof("Update machine info: %+v", *info)
        case <-quit:
            ticker.Stop()
            quit <- nil
            return
        }
    }
}
// github.com/google/cadvisor/machine/info.go:57
func Info(sysFs sysfs.SysFs, fsInfo fs.FsInfo, inHostNamespace bool) (*info.MachineInfo, error) {
  ...
  // 这里,我们比较直接的和大页内存打交道的地方
  hugePagesInfo, err := sysinfo.GetHugePagesInfo(sysFs, hugepagesDirectory)
    if err != nil {
        return nil, err
    }
  ...
}

大页内存数据来源

查看上面追踪的最终结果sysinfo.GetHugePagesInfo,我们可以初步确定:

k8s通过cadvisor获取的大页内存信息,具体到位置是/sys/kernel/mm/hugepages/,从里面拿到pageSize和pageNum

// GetHugePagesInfo returns information about pre-allocated huge pages
// hugepagesDirectory should be top directory of hugepages
// Such as: /sys/kernel/mm/hugepages/
func GetHugePagesInfo(sysFs sysfs.SysFs, hugepagesDirectory string) ([]info.HugePagesInfo, error) {
    var hugePagesInfo []info.HugePagesInfo
    files, err := sysFs.GetHugePagesInfo(hugepagesDirectory)
    if err != nil {
        // treat as non-fatal since kernels and machine can be
        // configured to disable hugepage support
        return hugePagesInfo, nil
    }

    for _, st := range files {
        nameArray := strings.Split(st.Name(), "-")
        pageSizeArray := strings.Split(nameArray[1], "kB")
        pageSize, err := strconv.ParseUint(string(pageSizeArray[0]), 10, 64)
        if err != nil {
            return hugePagesInfo, err
        }

        val, err := sysFs.GetHugePagesNr(hugepagesDirectory, st.Name())
        if err != nil {
            return hugePagesInfo, err
        }
        var numPages uint64
        // we use sscanf as the file as a new-line that trips up ParseUint
        // it returns the number of tokens successfully parsed, so if
        // n != 1, it means we were unable to parse a number from the file
        n, err := fmt.Sscanf(string(val), "%d", &numPages)
        if err != nil || n != 1 {
            return hugePagesInfo, fmt.Errorf("could not parse file nr_hugepage for %s, contents %q", st.Name(), string(val))
        }

        hugePagesInfo = append(hugePagesInfo, info.HugePagesInfo{
            NumPages: numPages,
            PageSize: pageSize,
        })
    }
    return hugePagesInfo, nil
}

nodeStatus更新

那么问题来了,如果我通过背景章节中提到的动态方法更新节点的大页内存,kubelet是否能动态更新node capacity中大页内存的量呢?

要明确这个问题,就需要确定上面提到的MachineInfo函数的调用时机。

// pkg/kubelet/kubelet_node_status.go:604
func (kl *Kubelet) defaultNodeStatusFuncs() []func(*v1.Node) error {
...
var setters []func(n *v1.Node) error
    setters = append(setters,
        nodestatus.NodeAddress(kl.nodeIPs, kl.nodeIPValidator, kl.hostname, kl.hostnameOverridden, kl.externalCloudProvider, kl.cloud, nodeAddressesFunc),
        nodestatus.MachineInfo(string(kl.nodeName), kl.maxPods, kl.podsPerCore, kl.GetCachedMachineInfo, kl.containerManager.GetCapacity,
            kl.containerManager.GetDevicePluginResourceCapacity, kl.containerManager.GetNodeAllocatableReservation, kl.recordEvent),
        nodestatus.VersionInfo(kl.cadvisor.VersionInfo, kl.containerRuntime.Type, kl.containerRuntime.Version),
        nodestatus.DaemonEndpoints(kl.daemonEndpoints),
        nodestatus.Images(kl.nodeStatusMaxImages, kl.imageManager.GetImageList),
        nodestatus.GoRuntime(),
    )
...
}

上面是MachineInfo被注册的地方,根据其注册的变量,可以找到它被调用的地方。下面是最上层调用到setter的地方,可以看到,kubelet启动后会按照nodeStatusUpdateFrequency参数作为周期,持续更新node status。

// pkg/kubelet/kubelet.go:1409
func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
...
    if kl.kubeClient != nil {
        // Start syncing node status immediately, this may set up things the runtime needs to run.
        go wait.Until(kl.syncNodeStatus, kl.nodeStatusUpdateFrequency, wait.NeverStop)
    }
...
}

更具体的,我们来到setter的直接调用方:这里可以看到,虽然上面会周期性地调用setter更新node的status,但是并不是每次更新都会调用api server,触发patch操作。为了减轻api server的压力,这里会根据nodeStatusReportFrequency的配置,进一步延长真正patch node status的时间间隔。

// pkg/kubelet/kubelet_node_status.go:480
func (kl *Kubelet) tryUpdateNodeStatus(tryNumber int) error {
...
  kl.setNodeStatus(node)

    now := kl.clock.Now()
    if now.Before(kl.lastStatusReportTime.Add(kl.nodeStatusReportFrequency)) {
        if !podCIDRChanged && !nodeStatusHasChanged(&originalNode.Status, &node.Status) {
            // We must mark the volumes as ReportedInUse in volume manager's dsw even
            // if no changes were made to the node status (no volumes were added or removed
            // from the VolumesInUse list).
            //
            // The reason is that on a kubelet restart, the volume manager's dsw is
            // repopulated and the volume ReportedInUse is initialized to false, while the
            // VolumesInUse list from the Node object still contains the state from the
            // previous kubelet instantiation.
            //
            // Once the volumes are added to the dsw, the ReportedInUse field needs to be
            // synced from the VolumesInUse list in the Node.Status.
            //
            // The MarkVolumesAsReportedInUse() call cannot be performed in dsw directly
            // because it does not have access to the Node object.
            // This also cannot be populated on node status manager init because the volume
            // may not have been added to dsw at that time.
            kl.volumeManager.MarkVolumesAsReportedInUse(node.Status.VolumesInUse)
            return nil
        }
    }

    // Patch the current status on the API server
    updatedNode, _, err := nodeutil.PatchNodeStatus(kl.heartbeatClient.CoreV1(), types.NodeName(kl.nodeName), originalNode, node)
    if err != nil {
        return err
    }
    kl.lastStatusReportTime = now
    kl.setLastObservedNodeAddresses(updatedNode.Status.Addresses)
    // If update finishes successfully, mark the volumeInUse as reportedInUse to indicate
    // those volumes are already updated in the node's status
    kl.volumeManager.MarkVolumesAsReportedInUse(updatedNode.Status.VolumesInUse)
    return nil
}

综上,可以初步认为,kubelet会周期性地检查node capacity,并周期性地patch node status对象。因此我们动态修改大页内存,理论上也应该能够被更新到node capacity中。

但是当我实际验证发现,修改大页内存后,不重启kubelet的话,node capacity并不会发生变化,只有重启kubelet才可以识别到新的大页内存值。

频率配置

既然实践和理论不匹配,首先想到是否是因为上面提到的两个kubelet配置导致的。在使用kubekey部署的集群中,kubelet配置文件中看到有如下的默认配置

# /var/lib/kubelet/config.yaml
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s

尝试修改上述配置为5s,发现并不能解决问题

kubelet日志等级

为了进一步确定这些node status setter是否有被定期调用,考虑提高kubelet的日志等级。

虽然/var/lib/kubelet/config.yaml中有logging verbosity配置,但是修改后发现无效。

搜索了一下发现修改/var/lib/kubelet/kubeadm-flags.env,在KUBELET_KUBEADM_ARGS后加--v是有效的

这时查看日志发现,node status setters是有被定时调用的

func (kl *Kubelet) updateNodeStatus() error {
    klog.V(5).InfoS("Updating node status")
    ...
    }
root@master1:~# journalctl -u kubelet -f|grep "Updating node status"
Mar 15 16:19:59 master1 kubelet[65392]: I0315 16:19:59.193290   65392 kubelet_node_status.go:464] "Updating node status"
Mar 15 16:20:09 master1 kubelet[65392]: I0315 16:20:09.209810   65392 kubelet_node_status.go:464] "Updating node status"
Mar 15 16:20:19 master1 kubelet[65392]: I0315 16:20:19.226005   65392 kubelet_node_status.go:464] "Updating node status"
Mar 15 16:20:29 master1 kubelet[65392]: I0315 16:20:29.244722   65392 kubelet_node_status.go:464] "Updating node status"

回到原点

当现实和理论发生了冲突时,我们不得不再次审视理论的出发点,再次回到MachineInfo

// pkg/kubelet/nodestatus/setters.go:274
func MachineInfo(nodeName string,
    maxPods int,
    podsPerCore int,
    machineInfoFunc func() (*cadvisorapiv1.MachineInfo, error), // typically Kubelet.GetCachedMachineInfo
    capacityFunc func() v1.ResourceList, // typically Kubelet.containerManager.GetCapacity
    devicePluginResourceCapacityFunc func() (v1.ResourceList, v1.ResourceList, []string), // typically Kubelet.containerManager.GetDevicePluginResourceCapacity
    nodeAllocatableReservationFunc func() v1.ResourceList, // typically Kubelet.containerManager.GetNodeAllocatableReservation
    recordEventFunc func(eventType, event, message string), // typically Kubelet.recordEvent
) Setter {
...
}

当我们仔细审视这个函数的内部,就会发现被函数的形参名字骗了。虽然参数名字叫capacityFunc,但是函数里该参数的唯一用途如下,它的作用只是为了从中获取内置存储资源的数量,仅此而已

if utilfeature.DefaultFeatureGate.Enabled(features.LocalStorageCapacityIsolation) {
// TODO: all the node resources should use ContainerManager.GetCapacity instead of deriving the
// capacity for every node status request
initialCapacity := capacityFunc()
if initialCapacity != nil {
if v, exists := initialCapacity[v1.ResourceEphemeralStorage]; exists {
node.Status.Capacity[v1.ResourceEphemeralStorage] = v
}
}
}

真正决定了node capacity的,是另一个形参……

info, err := machineInfoFunc()
if err != nil {
  ...
}else{
  for rName, rCap := range cadvisor.CapacityFromMachineInfo(info) {
                node.Status.Capacity[rName] = rCap
            }
}           

因此追踪一下这个参数的实参,从注释里就能看出结果了,fuck

// pkg/kubelet/kubelet_getters.go:413
// GetCachedMachineInfo assumes that the machine info can't change without a reboot
func (kl *Kubelet) GetCachedMachineInfo() (*cadvisorapiv1.MachineInfo, error) {
    kl.machineInfoLock.RLock()
    defer kl.machineInfoLock.RUnlock()
    return kl.machineInfo, nil
}

如果继续追踪,可以发现kl.machineInfo是在kubelet实例初始化时更新的,这也是唯一一处调用

// pkg/kubelet/kubelet.go:341
func NewMainKubelet(kubeCfg *kubeletconfiginternal.KubeletConfiguration,
...
){
...
    machineInfo, err := klet.cadvisor.MachineInfo()
    if err != nil {
        return nil, err
    }
    // Avoid collector collects it as a timestamped metric
    // See PR #95210 and #97006 for more details.
    machineInfo.Timestamp = time.Time{}
    klet.setCachedMachineInfo(machineInfo)
...
}

结论

  1. kubelet从/sys/kernel/mm/hugepages/获取节点大页内存配置
  2. 虽然/sys/kernel/mm/hugepages/可以在不重启node的情况下动态配置,但是kubelet并不支持动态刷新该值,只能通过restart kubelet刷新大页内存的配置

参考

7.3. Configuring HugeTLB Huge Pages Red Hat Enterprise Linux 7 | Red Hat Customer Portal

kubernetes - how to enable kubelet logging verbosity - Stack Overflow

Kubelet Configuration (v1beta1) | Kubernetes

文章目录