Blog: Kubernetes 1.27: Quality-of-Service for Memory Resources (alpha)

Authors: Dixita Narang (Google)

Kubernetes v1.27, released in April 2023, introduced changes to
Memory QoS (alpha) to improve memory management capabilites in Linux nodes.

Support for Memory QoS was initially added in Kubernetes v1.22, and later some
limitations
around the formula for calculating memory.high were identified. These limitations are
addressed in Kubernetes v1.27.

Background

Kubernetes allows you to optionally specify how much of each resources a container needs
in the Pod specification. The most common resources to specify are CPU and Memory.

For example, a Pod manifest that defines container resource requirements could look like:

apiVersion: v1
kind: Pod
metadata:
name: example
spec:
containers:
- name: nginx
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "64Mi"
cpu: "500m"

spec.containers[].resources.requests

When you specify the resource request for containers in a Pod, the
Kubernetes scheduler
uses this information to decide which node to place the Pod on. The scheduler
ensures that for each resource type, the sum of the resource requests of the
scheduled containers is less than the total allocatable resources on the node.
spec.containers[].resources.limits

When you specify the resource limit for containers in a Pod, the kubelet enforces
those limits so that the running containers are not allowed to use more of those
resources than the limits you set.

When the kubelet starts a container as a part of a Pod, kubelet passes the
container’s requests and limits for CPU and memory to the container runtime.
The container runtime assigns both CPU request and CPU limit to a container.
Provided the system has free CPU time, the containers are guaranteed to be
allocated as much CPU as they request. Containers cannot use more CPU than
the configured limit i.e. containers CPU usage will be throttled if they
use more CPU than the specified limit within a given time slice.

Prior to Memory QoS feature, the container runtime only used the memory
limit and discarded the memory request (requests were, and still are,
also used to influence scheduling).
If a container uses more memory than the configured limit,
the Linux Out Of Memory (OOM) killer will be invoked.

Let’s compare how the container runtime on Linux typically configures memory
request and limit in cgroups, with and without Memory QoS feature:

Memory request

The memory request is mainly used by kube-scheduler during (Kubernetes) Pod
scheduling. In cgroups v1, there are no controls to specify the minimum amount
of memory the cgroups must always retain. Hence, the container runtime did not
use the value of requested memory set in the Pod spec.

cgroups v2 introduced a memory.min setting, used to specify the minimum
amount of memory that should remain available to the processes within
a given cgroup. If the memory usage of a cgroup is within its effective
min boundary, the cgroup’s memory won’t be reclaimed under any conditions.
If the kernel cannot maintain at least memory.min bytes of memory for the
processes within the cgroup, the kernel invokes its OOM killer. In other words,
the kernel guarantees at least this much memory is available or terminates
processes (which may be outside the cgroup) in order to make memory more available.
Memory QoS maps memory.min to spec.containers[].resources.requests.memory
to ensure the availability of memory for containers in Kubernetes Pods.
Memory limit

The memory.limit specifies the memory limit, beyond which if the container tries
to allocate more memory, Linux kernel will terminate a process with an
OOM (Out of Memory) kill. If the terminated process was the main (or only) process
inside the container, the container may exit.

In cgroups v1, memory.limit_in_bytes interface is used to set the memory usage limit.
However, unlike CPU, it was not possible to apply memory throttling: as soon as a
container crossed the memory limit, it would be OOM killed.

In cgroups v2, memory.max is analogous to memory.limit_in_bytes in cgroupv1.
Memory QoS maps memory.max to spec.containers[].resources.limits.memory to
specify the hard limit for memory usage. If the memory consumption goes above this
level, the kernel invokes its OOM Killer.

cgroups v2 also added memory.high configuration . Memory QoS uses memory.high
to set memory usage throttle limit. If the memory.high limit is breached,
the offending cgroups are throttled, and the kernel tries to reclaim memory
which may avoid an OOM kill.

How it works

Cgroups v2 memory controller interfaces & Kubernetes container resources mapping

Memory QoS uses the memory controller of cgroups v2 to guarantee memory resources in
Kubernetes. cgroupv2 interfaces that this feature uses are:

memory.max
memory.min
memory.high.

Memory QoS Levels

memory.max is mapped to limits.memory specified in the Pod spec. The kubelet and
the container runtime configure the limit in the respective cgroup. The kernel
enforces the limit to prevent the container from using more than the configured
resource limit. If a process in a container tries to consume more than the
specified limit, kernel terminates a process(es) with an out of
memory Out of Memory (OOM) error.

memory.min is mapped to requests.memory, which results in reservation of memory resources
that should never be reclaimed by the kernel. This is how Memory QoS ensures the availability of
memory for Kubernetes pods. If there’s no unprotected reclaimable memory available, the OOM
killer is invoked to make more memory available.

For memory protection, in addition to the original way of limiting memory usage, Memory QoS
throttles workload approaching its memory limit, ensuring that the system is not overwhelmed
by sporadic increases in memory usage. A new field, memoryThrottlingFactor, is available in
the KubeletConfiguration when you enable MemoryQoS feature. It is set to 0.9 by default.
memory.high is mapped to throttling limit calculated by using memoryThrottlingFactor,
requests.memory and limits.memory as in the formula below, and rounding down the
value to the nearest page size:

Note: If a container has no memory limits specified, limits.memory is substituted for node allocatable memory.

Summary:

File Description

memory.max

File	Description
memory.max	`memory.max` specifies the maximum memory limit, a container is allowed to use. If a process within the container tries to consume more memory than the configured limit, the kernel terminates the process with an Out of Memory (OOM) error. It is mapped to the container’s memory limit specified in Pod manifest.
memory.min	`memory.min` specifies a minimum amount of memory the cgroups must always retain, i.e., memory that should never be reclaimed by the system. If there’s no unprotected reclaimable memory available, OOM kill is invoked. It is mapped to the container’s memory request specified in the Pod manifest.
memory.high	`memory.high` specifies the memory usage throttle limit. This is the main mechanism to control a cgroup’s memory use. If cgroups memory use goes over the high boundary specified here, the cgroups processes are throttled and put under heavy reclaim pressure. Kubernetes uses a formula to calculate `memory.high`, depending on container’s memory request, memory limit or node allocatable memory (if container’s memory limit is empty) and a throttling factor. Please refer to the KEP for more details on the formula.

memory.max specifies the maximum memory limit,
a container is allowed to use. If a process within the container
tries to consume more memory than the configured limit,
the kernel terminates the process with an Out of Memory (OOM) error.

It is mapped to the container’s memory limit specified in Pod manifest.

memory.min

memory.min specifies a minimum amount of memory
the cgroups must always retain, i.e., memory that should never be
reclaimed by the system.
If there’s no unprotected reclaimable memory available, OOM kill is invoked.

It is mapped to the container’s memory request specified in the Pod manifest.

memory.high

memory.high specifies the memory usage throttle limit.
This is the main mechanism to control a cgroup’s memory use. If
cgroups memory use goes over the high boundary specified here,
the cgroups processes are throttled and put under heavy reclaim pressure.

Kubernetes uses a formula to calculate memory.high,
depending on container’s memory request, memory limit or node allocatable memory
(if container’s memory limit is empty) and a throttling factor.
Please refer to the KEP
for more details on the formula.

Note memory.high is set only on container level cgroups while memory.min is set on
container, pod, and node level cgroups.

`memory.min` calculations for cgroups heirarchy

When container memory requests are made, kubelet passes memory.min to the back-end
CRI runtime (such as containerd or CRI-O) via the Unified field in CRI during
container creation. The memory.min in container level cgroups will be set to:

$memory.min = pod.spec.containers[i].resources.requests[memory]$
_{for every i^th container in a pod}

Since the memory.min interface requires that the ancestor cgroups directories are all
set, the pod and node cgroups directories need to be set correctly.

memory.min in pod level cgroup:
$memory.min = sum_{i=0}^{no. of pods}pod.spec.containers[i].resources.requests[memory]$
_{for every i^th container in a pod}

memory.min in node level cgroup:
$memory.min = sum_{i}^{no. of nodes}sum_{j}^{no. of pods}pod[i].spec.containers[j].resources.requests[memory]$
_{for every j^th container in every i^th pod on a node}

Kubelet will manage the cgroups hierarchy of the pod level and node level cgroups
directly using the libcontainer library (from the runc project), while container
cgroups limits are managed by the container runtime.

Support for Pod QoS classes

Based on user feedback for the Alpha feature in Kubernetes v1.22, some users would like
to opt out of MemoryQoS on a per-pod basis to ensure there is no early memory throttling.
Therefore, in Kubernetes v1.27 Memory QOS also supports memory.high to be set as per
Quality of Service(QoS) for Pod classes. Following are the different cases for memory.high
as per QOS classes:

Guaranteed pods by their QoS definition require memory requests=memory limits and are
not overcommitted. Hence MemoryQoS feature is disabled on those pods by not setting
memory.high. This ensures that Guaranteed pods can fully use their memory requests up
to their set limit, and not hit any throttling.
Burstable pods by their QoS definition require at least one container in the Pod with
CPU or memory request or limit set.
- When requests.memory and limits.memory are set, the formula is used as-is:
  
  memory.high when requests and limits are set
- When requests.memory is set and limits.memory is not set, limits.memory is substituted
  for node allocatable memory in the formula:
  
  memory.high when requests and limits are not set
BestEffort by their QoS definition do not require any memory or CPU limits or requests.
For this case, kubernetes sets requests.memory = 0 and substitute limits.memory for node allocatable
memory in the formula:

memory.high for BestEffort Pod

Summary: Only Pods in Burstable and BestEffort QoS classes will set memory.high.
Guaranteed QoS pods do not set memory.high as their memory is guaranteed.

How do I use it?

The prerequisites for enabling Memory QoS feature on your Linux node are:

Verify the requirements
related to Kubernetes support for cgroups v2
are met.
Ensure CRI Runtime supports Memory QoS. At the time of writing, only containerd
and CRI-O provide support compatible with Memory QoS (alpha). This was implemented
in the following PRs:
- Containerd: Feature: containerd-cri support LinuxContainerResources.Unified #5627.
- CRI-O: implement kube alpha features for 1.22 #5207.

Memory QoS remains an alpha feature for Kubernetes v1.27. You can enable the feature by setting
MemoryQoS=true in the kubelet configuration file:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
 MemoryQoS: true

How do I get involved?

Huge thank you to all the contributors who helped with the design, implementation,
and review of this feature:

Dixita Narang (ndixita)
Tim Xu (xiaoxubeii)
Paco Xu (pacoxu)
David Porter(bobbypage)
Mrunal Patel(mrunalp)

For those interested in getting involved in future discussions on Memory QoS feature,
you can reach out SIG Node by several means:

Originally posted on Kubernetes – Production-Grade Container Orchestration
Author:

Blog: Kubernetes 1.27: Quality-of-Service for Memory Resources (alpha)

Background

How it works

Cgroups v2 memory controller interfaces & Kubernetes container resources mapping

Memory QoS Levels

memory.max maps to limits.memory

memory.min maps to requests.memory

memory.high formula

`memory.min` calculations for cgroups heirarchy

Support for Pod QoS classes

memory.high when requests and limits are set

memory.high when requests and limits are not set

memory.high for BestEffort Pod

How do I use it?

How do I get involved?

Related

Deja una respuesta Cancelar la respuesta

Background

How it works

Cgroups v2 memory controller interfaces & Kubernetes container resources mapping

Memory QoS Levels

memory.max maps to limits.memory

memory.min maps to requests.memory

memory.high formula

memory.min calculations for cgroups heirarchy

Support for Pod QoS classes

memory.high when requests and limits are set

memory.high when requests and limits are not set

memory.high for BestEffort Pod

How do I use it?

How do I get involved?

Related

Deja una respuesta Cancelar la respuesta

`memory.min` calculations for cgroups heirarchy